Dr Serge Sharoff

Position: Professor of Language Technology
Areas of expertise: language technology; natural language processing; machine translation; corpus linguistics; digital humanities; genres and text classification; terminology mining; related languages
Email: S.Sharoff@leeds.ac.uk
Phone: +44(0)113 343 7287
Website: Googlescholar | ORCID

Profile

I joined the University of Leeds in 2003 after obtaining my PhD in 1997 from the Moscow Lomonosov State University and postdoctoral appointments at the Russian Research Institute for Artificial Intelligence (1997–2000), and Humboldt Research fellowship at the Univesity of Bielefeld (2001–2002).

Research interests

Artificial Intelligence and more specifically Large Language Models, such as ChatGPT, have recently made a profound impact on how we interact with the computers by providing the ability to produce new texts in response to prompts. Fundamental research in this area is at the core of my expertise, with one of my papers on the diversity of texts on the Web cited by the GPT creators.

My research interests are related to three domains: linguistics (primarily computational linguistics and corpus linguistics), cognitive science and communication studies.

Probably the most interesting bit in my recent research is digital curation of representative corpora automatically collected from the Web, i.e., their annotation in terms of genres, domains or morphosyntactic categories. The current set of resources includes very large corpora for Arabic, Chinese, English, French, German, Italian, Polish, Portuguese, Russian and Spanish.

I am happy to consider applications from prospective PhD students in the area of my expertise. The following general topics are preferable:

Automatic Text Classification for Translation

Setting up a translation project usually involves assessing the amount of time required for translating a text and selecting the most suitable translator. Modern approaches in Language Technology can do wonders with text processing, but it is not clear how helpful they can be in the translation settings. For example, can they help to determine the genre of a text, its difficulty or suitability to translators? Similar text classification tools can be also used for tasks related to learning foreign languages.

Background references:

Serge Sharoff. Genre Annotation for the Web: text-external and text-internal perspectives. Register Studies. , 2021
Serge Sharoff. Functional text dimensions for the annotation of Web corpora. Corpora Journal, 13(1):65–95, 2018
Yu Yuan and Serge Sharoff, Sentence Level Human Translation Quality Estimation with Attention-based Neural Networks. In Proc International Conference on Language Resources and Evaluation (LREC'20), Marseilles, May 2020

Language adaptation for improving models of lesser-resourced languages

A translation model needs to be applicable to a large number of languages, while the training resources or linguistic models are often better developed only for some languages. Language adaptation can be designed in a way similar to domain adaptation to improve the models of lesser-resourced languages by taking into account the resources available for closely related languages, e.g., from French to Romanian. This can be applied in a range of training scenarios, such as Part-Of-Speech tagging, text classification, translation quality prediction, etc.

Background references:

Serge Sharoff. Finding next of kin: Cross-lingual embedding spaces for related languages. Journal of Natural Language Engineering, 25, 2019
Miguel Rios and Serge Sharoff. Language adaptation for extending post-editing estimates for closely related languages. The Prague Bulletin of Mathematical Linguistics, 106:181–192, 2016

Non-parallel resources for translation

Modern Machine Translation is based on "plagiarising" large amounts of existing translations, which usually come from institutions such as the United Nations or the European Parliament. This is not enough for many language directions or for specific domains, such as biomedicine. What are productive methods to mine information about translations from non-parallel texts, such as Wikipedia articles on the same topic or news wire streams in different languages?

Background references:

<h4>Research projects</h4> <p>Some research projects I'm currently working on, or have worked on, will be listed below. Our list of all <a href="https://ahc.leeds.ac.uk/dir/research-projects">research projects</a> allows you to view and search the full list of projects in the faculty.</p>

Primary investigator (PI)

Process Automation for Localisation of Dialogue in Entertainment Media (PALODIEM)

Co-investigator (Co-I)

Leeds Russian Centre

Student education

I teach courses on:

Computer-Assisted Translation Translation Memories, Terminology extraction and management, Machine Translation
Corpus methods for translators using corpus tools to solve translation problems
Introduction to Natural Language Processing using computers to model language

Research groups and institutes

Centre for Translation, Interpreting and Localisation Studies
Language documentation
Translation
Leeds Russian Centre
Russian
Centre for Endangered Languages, Cultures and Ecosystems

Current postgraduate researchers

<h4>Postgraduate research opportunities</h4> <p>We welcome enquiries from motivated and qualified applicants from all around the world who are interested in PhD study. Our <a href="https://phd.leeds.ac.uk">research opportunities</a> allow you to search for projects and scholarships.</p>

Dr Serge Sharoff

Profile

Research interests

Automatic Text Classification for Translation

Language adaptation for improving models of lesser-resourced languages

Non-parallel resources for translation

Primary investigator (PI)

Co-investigator (Co-I)

Student education

Research groups and institutes

Current postgraduate researchers

Research outputs

My five most recent selected research outputs

My other selected research outputs

Journal articles

Conference papers

Presentation (conference/workshop etc)

Books

Chapters

Conference abstracts

Preprints

Reports

Internet publications

Performances

Compositions

Exhibitions

Posters

Artefacts

Designs

Patents

Scholarly editions

Software / code

Thesis / dissertations

Others

Datasets

Media

Working papers