Dr Serge Sharoff
I joined the University of Leeds in 2003 after obtaining my PhD in 1997 from the Moscow Lomonosov State University and postdoctoral appointments at the Russian Research Institute for Artificial Intelligence (1997–2000), and Humboldt Research fellowship at the Univesity of Bielefeld (2001–2002).
My research interests are related to three domains: linguistics (primarily computational linguistics and corpus linguistics), cognitive science and communication studies.
Probably the most interesting bit in my recent research is digital curation of representative corpora automatically collected from the Web, i.e., their annotation in terms of genres, domains or morphosyntactic categories. The current set of resources includes very large corpora for Arabic, Chinese, English, French, German, Italian, Polish, Portuguese, Russian and Spanish.
I am happy to consider applications from prospective PhD students in the area of my expertise. The following general topics are preferable:
Automatic Text Classification for Translation
Setting up a translation project usually involves assessing the amount of time required for translating a text and selecting the most suitable translator. Modern approaches in Language Technology can do wonders with text processing, but it is not clear how helpful they can be in the translation settings. For example, can they help to determine the genre of a text, its difficulty or suitability to translators? Similar text classification tools can be also used for tasks related to learning foreign languages.
- Serge Sharoff. Genre Annotation for the Web: text-external and text-internal perspectives. Register Studies. , 2021
- Serge Sharoff. Functional text dimensions for the annotation of Web corpora. Corpora Journal, 13(1):65–95, 2018
- Yu Yuan and Serge Sharoff, Sentence Level Human Translation Quality Estimation with Attention-based Neural Networks. In Proc International Conference on Language Resources and Evaluation (LREC'20), Marseilles, May 2020
Language adaptation for improving models of lesser-resourced languages
A translation model needs to be applicable to a large number of languages, while the training resources or linguistic models are often better developed only for some languages. Language adaptation can be designed in a way similar to domain adaptation to improve the models of lesser-resourced languages by taking into account the resources available for closely related languages, e.g., from French to Romanian. This can be applied in a range of training scenarios, such as Part-Of-Speech tagging, text classification, translation quality prediction, etc.
- Serge Sharoff. Finding next of kin: Cross-lingual embedding spaces for related languages. Journal of Natural Language Engineering, 25, 2019
- Miguel Rios and Serge Sharoff. Language adaptation for extending post-editing estimates for closely related languages. The Prague Bulletin of Mathematical Linguistics, 106:181–192, 2016
Non-parallel resources for translation
Modern Machine Translation is based on "plagiarising" large amounts of existing translations, which usually come from institutions such as the United Nations or the European Parliament. This is not enough for many language directions or for specific domains, such as biomedicine. What are productive methods to mine information about translations from non-parallel texts, such as Wikipedia articles on the same topic or news wire streams in different languages?
- Serge Sharoff. Know thy corpus! Robust methods for digital curation of Web corpora. In Proc LREC, Marseilles, May 2020
- Maria Kunilovskaya and Serge Sharoff. Building functionally similar corpus resources for translation studies. In Proc RANLP, Varna, September 2019
I teach courses on:
- Computer-Assisted Translation Translation Memories, Terminology extraction and management, Machine Translation
- Corpus methods for translators using corpus tools to solve translation problems
- Introduction to Natural Language Processing using computers to model language
Research groups and institutes
- Centre for Translation Studies
- Language documentation
- Leeds Russian Centre
- Centre for Endangered Languages, Cultures and Ecosystems