Dr Serge Sharoff
I joined the University of Leeds in 2003 after obtaining my PhD in 1997 from the Moscow Lomonosov State University and postdoctoral appointments at the Russian Research Institute for Artificial Intelligence (1997-2000), and Humboldt Research fellowship at the Univesity of Bielefeld (2001-2002).
My research interests are related to three domains: linguistics (primarily computational linguistics and corpus linguistics), cognitive science and communication studies.
Probably the most interesting bit in my recent research is digital curation of representative corpora automatically collected from the Web, i.e., their annotation in terms of genres, domains or morphosyntactic categories. The current set of resources includes very large corpora for Arabic, Chinese, English, French, German, Italian, Polish, Portuguese, Russian and Spanish.
I am happy to consider applications from prospective PhD students in the area of my expertise. The following general topics are preferable:
- Automatic Text Classification for Translation
Setting up a translation project usually involves assessing the amount of time required for translating a text and selecting the most suitable translator. Modern approaches in Language Technology can do wonders with text processing, but it is not clear how helpful they can be in the translation settings. For example, can they help to determine the genre of a text, its difficulty or suitability to translators? Similar text classification tools can be also used for tasks related to learning foreign languages.
- Language adaptation for improving models of lesser-resourced languages
A translation model needs to be applicable to a large number of languages, while the training resources or linguistic models are often better developed only for some languages. Language adaptation can be designed in a way similar to domain adaptation to improve the models of lesser-resourced languages by taking into account the resources available for closely related languages, e.g., from French to Romanian. This can be applied in a range of training scenarios, such as Part-Of-Speech tagging, text classification, translation quality prediction, etc.
- Non-parallel resources for translation
Modern Machine Translation is based on "plagiarising" large amounts of existing translations, which usually come from institutions such as the United Nations or the European Parliament. This is not enough for many language directions or for specific domains, such as biomedicine. What are productive methods to mine information about translations from non-parallel texts, such as Wikipedia articles on the same topic or news wire streams in different languages?
I teach courses on:
- Computer-Assisted Translation Translation Memories, Terminology extraction and management,
- Corpus methods for translators Using corpus tools to solve translation problems;
- Introduction to Natural Language Processing Using computers to model language
Research groups and institutes
- Centre for Translation Studies
- Language documentation
- Leeds Russian Centre
- Centre for Endangered Languages, Cultures and Ecosystems