Dr Serge Sharoff
- Position: Professor of Language Technology
- Areas of expertise: language technology; natural language processing; machine translation; corpus linguistics; digital humanities; genres and text classification; terminology mining; related languages
- Email: S.Sharoff@leeds.ac.uk
- Phone: +44(0)113 343 7287
- Website: Googlescholar | ORCID
Profile
I joined the University of Leeds in 2003 after obtaining my PhD in 1997 from the Moscow Lomonosov State University and postdoctoral appointments at the Russian Research Institute for Artificial Intelligence (1997–2000), and Humboldt Research fellowship at the Univesity of Bielefeld (2001–2002).
Research interests
Artificial Intelligence and more specifically Large Language Models, such as ChatGPT, have recently made a profound impact on how we interact with the computers by providing the ability to produce new texts in response to prompts. Fundamental research in this area is at the core of my expertise, with one of my papers on the diversity of texts on the Web cited by the GPT creators.
My research interests are related to three domains: linguistics (primarily computational linguistics and corpus linguistics), cognitive science and communication studies.
Probably the most interesting bit in my recent research is digital curation of representative corpora automatically collected from the Web, i.e., their annotation in terms of genres, domains or morphosyntactic categories. The current set of resources includes very large corpora for Arabic, Chinese, English, French, German, Italian, Polish, Portuguese, Russian and Spanish.
I am happy to consider applications from prospective PhD students in the area of my expertise. The following general topics are preferable:
Automatic Text Classification for Translation
Setting up a translation project usually involves assessing the amount of time required for translating a text and selecting the most suitable translator. Modern approaches in Language Technology can do wonders with text processing, but it is not clear how helpful they can be in the translation settings. For example, can they help to determine the genre of a text, its difficulty or suitability to translators? Similar text classification tools can be also used for tasks related to learning foreign languages.
Background references:
- Serge Sharoff. Genre Annotation for the Web: text-external and text-internal perspectives. Register Studies. , 2021
- Serge Sharoff. Functional text dimensions for the annotation of Web corpora. Corpora Journal, 13(1):65–95, 2018
- Yu Yuan and Serge Sharoff, Sentence Level Human Translation Quality Estimation with Attention-based Neural Networks. In Proc International Conference on Language Resources and Evaluation (LREC'20), Marseilles, May 2020
Language adaptation for improving models of lesser-resourced languages
A translation model needs to be applicable to a large number of languages, while the training resources or linguistic models are often better developed only for some languages. Language adaptation can be designed in a way similar to domain adaptation to improve the models of lesser-resourced languages by taking into account the resources available for closely related languages, e.g., from French to Romanian. This can be applied in a range of training scenarios, such as Part-Of-Speech tagging, text classification, translation quality prediction, etc.
Background references:
- Serge Sharoff. Finding next of kin: Cross-lingual embedding spaces for related languages. Journal of Natural Language Engineering, 25, 2019
- Miguel Rios and Serge Sharoff. Language adaptation for extending post-editing estimates for closely related languages. The Prague Bulletin of Mathematical Linguistics, 106:181–192, 2016
Non-parallel resources for translation
Modern Machine Translation is based on "plagiarising" large amounts of existing translations, which usually come from institutions such as the United Nations or the European Parliament. This is not enough for many language directions or for specific domains, such as biomedicine. What are productive methods to mine information about translations from non-parallel texts, such as Wikipedia articles on the same topic or news wire streams in different languages?
Background references:
- Maria Kunilovskaya and Serge Sharoff. Building functionally similar corpus resources for translation studies. In Proc RANLP, Varna, September 2019
- Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. A multilingual dataset for evaluating parallel sentence extraction from comparable corpora. In Proc LREC, Miyazaki, Japan, May 2018
-
Serge Sharoff, Reinhard Rapp, Pierre Zweigenbaum, Building and Using Comparable Corpora for Multilingual Natural Language Processing, 2023
- Leeds Russian Centre
- Process Automation for Localisation of Dialogue in Entertainment Media (PALODIEM)
Student education
I teach courses on:
- Computer-Assisted Translation Translation Memories, Terminology extraction and management, Machine Translation
- Corpus methods for translators using corpus tools to solve translation problems
- Introduction to Natural Language Processing using computers to model language
Research groups and institutes
- Centre for Translation Studies
- Language documentation
- Translation
- Leeds Russian Centre
- Russian
- Centre for Endangered Languages, Cultures and Ecosystems