- Start date: 1 January 2010
- End date: 31 December 2012
- Funder: Internally funded
- Primary investigator: Serge Sharoff
The TTC project aims at leveraging machine translation tools (MT tools), computer-assisted translation tools (CAT tools) and multilingual content management tools by automatically generating bilingual terminologies from comparable corpora in five European languages (English, French, German, Spanish and one under-resourced language, Serbian), as well as in Chinese and Russian.
The TTC project focuses on the automatic acquisition of aligned bilingual terminologies for computer-assisted translation and machine translation. To do this, important steps of the project are the automatic extraction of monolingual terminologies and the bilingual alignment of the extracted terminologies from a large set of multilingual corpora.
Such terminologies could be extracted from parallel corpora, i.e. from previously translated texts, but such corpora are scarce. Previously translated data is still sparse and only available for some pairs of languages and few specific domains. Thus, no parallel corpora are available for most of specialized domains, especially for emerging domains (such as renewable energy). As a consequence, the project develops methods and tools for automatic extraction of terminologies from comparable corpora, i.e. from corpora corresponding to a same domain, but not necessary being a translation from each other. It also develops tools for gathering (topical web crawler) and managing these comparable corpora and for managing terminologies.
At the end of the TTC project, a platform will be set up to compile and manage comparable corpora using standards (TMF, TBX) and the existing open source UIMA framework. An evaluation and a validation of this work will be done by the consortium on CAT tools and Machine Translation tools. Translation of technical documents for aerospace and IT domain will be done using CAT and MT techniques to assess impact of the TTC project outputs.