Process Automation for Localisation of Dialogue in Entertainment Media (PALODIEM)

Start date: 2014
End date: 2016
Funder: InnovateUK
Primary investigator: Dr Serge Sharoff

Description

Machine Translation (MT) methods require a substantial source of data consisting of parallel translations of text between the required languages (an aligned corpus). The general principle of MT is based on learning the probabilities of translations of individual words and constructions in aligned corpora. The traditional approach makes no distinction concerning the linguistic ‘distances’ between the languages: the English-Chinese pair is built using the same principle as the Spanish-Portuguese one. However, the latter pair is likely to benefit from the similarities in words and constructions between these two languages since they have a common Latin origin.

We explored the topic of developing new high-quality MT engines using automatic detection of cognate words, i.e., words having similarities in their spelling and meaning in two languages, e.g., maladie (French for ‘disease’) versus malattia (Italian equivalent). Such lists can be generated from large monolingual resources in order to improve the coverage of the related pairs beyond the contents available in (smaller) aligned corpora, and then added to the translation lexicon of an MT engine.

Publications and outputs

Miguel Rios and Serge Sharoff. Obtaining SMT dictionaries for related languages. In Proc the Eighth Workshop on Building and Using Comparable Corpora, pages 68–73, Beijing, China, July 2015.

Miguel Rios and Serge Sharoff. Language adaptation for extending post-editing estimates for closely related languages. The Prague Bulletin of Mathematical Linguistics, 106:181–192, 2016. http://corpus.leeds.ac.uk/serge/publications/2016-pbml.pdf

Serge Sharoff. Language adaptation experiments via cross-lingual embeddings for related languages. In Proc LREC, Miyazaki, Japan, May 2018. http://corpus.leeds.ac.uk/serge/publications/2018-lrec.pdf