Research projects

CTS maintains an extremely active research agenda and has an excellent track record in funded projects in different areas. 


The ACCURAT project aims at researching methods and techniques to overcome one of the central problems of machine translation (MT) - the lack of linguistic resources for many domains and languages that are the object of machine translation. The main goal is to find, analyse and evaluate novel methods that exploit comparable corpora in order to compensate for this shortage of linguistic resources, and ultimately to improve significantly MT quality for under-resourced languages and narrow domains.

Contacts: Bogdan Babych, Serge Sharoff


Translators have access to a wealth of information during the process of translating a text. This includes monolingual dictionaries to examine the senses in the source and target languages, and bilingual dictionaries to examine lexical equivalence. Less available is a translation (or parallel) corpus which provides examples of how translation equivalents are used in the target language. The recent research focus has been on providing translation equivalents for technical vocabulary in a restricted domain. The problem we addressed is that of providing contextual examples of translation equivalents for words from the general lexicon.

For a sentence in the source language (English or Russian) the tool suggests examples of similar contexts in the target language selected from the target language corpus and provide an interface for creating and maintaining a user dictionary of contextualised translation equivalents. The reason we concentrate on the general lexicon is because of the variety of meanings and possible translations that are exhibited by words from the general lexicon, but are not usually covered by translation equivalence lists given in bilingual dictionaries.

Contacts: Serge Sharoff, Tony Hartley

eCoLoRe / eCoLoTrain / eColoMedia

From 2002 until 2010 CTS led, or was a partner in three major Leonardo da Vinci vocational training funded projects - eCoLoRe, eCoLoTrain and eCoLoMedia. With a total value of over 1.5m EUR, these projects were dedicated to creating online training resources for continuous vocational training in eContent localisation. We are now excited to be working with Translation Commons, a nonprofit online volunteer community of language professionals, to offer centralised free access to the updated resources developed as part of the eCoLo projects. Further information on Translation Commons

eCoLoRe - eContent localisation is the translation and cultural adaptation for local markets of digital information. To be efficient, this relies heavily on specialised computer tools requiring intensive training. The desire of professional bodies and universities to provide adequate training is currently frustrated by a lack of resources. These include sample texts and scenarios for their pedagogic exploitation in realistic, task-oriented settings. By providing such materials, eCoLoRe has succeeded in remedying the "severe skills shortage" identified in the EC-sponsored SPICE-PREP II report on eContent localisation. Without these skills, the European digital economy will be stunted and access to wider world markets will be limited.

eCoLoTrain aims to improve the translation trainers' and teachers' ICT skills in general and eContent localisation skills in particular. In doing so, it relies on the raw materials developed during the eCoLoRe project, specially created for scenarios involving the use of state-of-the-art software for eContent localisation. From a practical point of view, eCoLoTrain provides a set of original materials ranging from a dedicated curriculum for continuous vocational training in eContent localisation - including courses such as Computer Assisted Translation (CAT) and Project Management (PM) - via methodological and didactic guidelines, to concrete course materials. All the materials have been tested in various workshops involving translation teachers and trainers.

eCoLoMedia - The new forms of multimodal communication emerging in this age of globalisation have an increasing impact on the translation industry. Language service providers are looking to recruit professionals who have not only linguistic expertise and the technical know-how for working with dedicated software, but who can also easily adapt their knowledge to respond to a variety of translation scenarios. These include localising websites and Flash animations, localising games, and adapting DVD content for audiences with other languages and for the Deaf and hard-of-hearing. eCoLoMedia prepares professional translators and translation students to respond effectively to this demand by providing training materials, delivered in the form of translation kits, which cover a variety of multimedia source files including videos, games and websites incorporating Flash animations. The translation kits includes translations of the sources in a variety of formats - subtitling, voice-over and dubbing files as well as website and game localised versions - together with pedagogical and methodological guidelines to help users adapt these resources to their own learning and/or teaching scenarios. All material is freely available from the project website.

Contacts: Alina Secara, Anthony Hartley, Dragos Ciobanu, Martin Thomas 


Evaluating and improving document usability for multilingual audiences

Documents play crucial roles in modern societies, enabling us to take actions, make decisions and record transactions. As such, an inability to use documents effectively contributes to social exclusion. In design terms, documents are often highly complex, combining written text, diagrams, photos, tables, logos and other graphic elements. Thus we can characterize them as multimodal. Equally, the processes involved in our making sense of visual information are themselves also complex. Recent studies show that a combination of information each of us has learnt through experience is in play alongside our common human hardware. Thus cultural factors play their part in our interactions with document design. As such, it seems plausible that readers whose first language is not English are doubly disadvantaged when working with documents written in English, i.e. in terms of both language skills and document literacy.

EvIDence explores this hypothesis, bringing together rigorous techniques for document sampling and comparison, formal representations of document design and audience-sensitive usability testing. Thus EvIDence identifies and exploits synergies between several relevant research communities and contributes methodological innovation to the current state of the art in each. Moreover, by providing the means to evaluate and improve the accessibility of multimodal documents produced by UK organizations from two key sectors - health and financial services, EvIDence offers impact beyond the academy.

Contact: Martin Thomas


Much humanities research relies on or would benefit from analysis of electronic corpora (representative collections of texts). The main advantage of using corpora over hand-picked examples is the ability to collect data systematically, to assess the centrality of certain features to the research material, and to establish experimentally potential trends in the data. However, the major difficulty faced by corpus-based studies in humanities research is that creating and annotating a new corpus and designing an appropriate search engine for textual analysis require complex technical support.

IntelliText's novel contribution lies in tuning advanced tools and methods from computer science to the needs of humanities researchers, integrating them into a single software application with a simple interface and good documentation. This allows humanities researchers with no specialised background in computer science or corpus linguistics to take advantage of powerful methods of text collection and analysis. It enables them to collect new project corpora from the web, have them enriched automatically with linguistic and other annotations, and then easily uncover interesting patterns of usage, starting either from their own intuitions and hypotheses, or from expressions and patterns identified as potentially noteworthy by the system.

Contacts: Serge Sharoff, Tony Hartley


We propose a hybrid architecture for high quality machine translation which combines the strengths of both approaches and minimizes their weaknesses: At the core is a rule-based MT system which provides morphology, declarative grammars, semantic categories, and small dictionaries, but which avoids all expensive kinds of intellectual knowledge acquisition. Instead of manually working out large dictionaries and compiling information on disambiguation preference, we suggest a novel corpus-based bootstrapping method for automatically expanding dictionaries, and for training the analytical performance and the choice of transfer alternatives.

This is a Marie Curie FP7 project in collaboration with Lingenio, Heidelberg, a small company developing and selling rule based MT systems (Translate) for English/German/French (Spanish and Italian under development) and also Office Dictionaries based on the context sensitive Intellidict technology. The underlying technology was originally developed at the IBM Heidelberg research centre in a long term project.

Contact: Bogdan Babych


We propose a hybrid architecture for high quality machine translation which combines the strengths of both approaches and minimizes their weaknesses: At the core is a rule-based MT system which provides morphology, declarative grammars, semantic categories, and small dictionaries, but which avoids all expensive kinds of intellectual knowledge acquisition. Instead of manually working out large dictionaries and compiling information on disambiguation preference, we suggest a novel corpus-based bootstrapping method for automatically expanding dictionaries, and for training the analytical performance and the choice of transfer alternatives.

This is a Marie Curie FP7 project in collaboration with Lingenio, Heidelberg, a small company developing and selling rule based MT systems (Translate) for English/German/French (Spanish and Italian under development) and also Office Dictionaries based on the context sensitive Intellidict technology. The underlying technology was originally developed at the IBM Heidelberg research centre in a long term project.

Contact: Bogdan Babych


Bringing a Corpus-Based Approach to Language Learning and Teaching into the Mainstream

LangCorp builds on the success of the AHRC-funded IntelliText project, which opened new potential for the use of electronic corpora - representative collections of texts (such as novels, newspaper articles,technical manuals), often enriched with additional information - in research across the humanities. Among several sample fields of research, IntelliText identified language learning and teaching as the one most fertile for immediate impact. Our primary target application domain is language learning and teaching in UK HEIs. This is the environment in which the pedagogical methods, teaching materials and technology we develop will be implemented and evaluated. We will reach tutors who may not be actively engaged in research through channels such as the Association of University Language Centres.

Contacts: Martin Thomas, Tony Hartley


MeLLANGE aims to adapt vocational training for translators and other language professionals to meet the new needs that flow from globalisation and increased cross-cultural communication: an increased need for the management of intercultural language resources. MeLLANGE's achievement was to:

  • devise a methodology for the collaborative creation of corpus-based e-learning materials in translation and language resource management
  • address the issue of a vocational training policy at the European level by coordinating MA programmes, within the framework of the Bologna declaration

Leeds participated in all major aspects of the project and coordinated the following activities:

  • adapting existing corpora for use by material designers, trainers and end-users
  • creating and analysing a corpus of learner translations

Contacts: Alina Secara, Tony Hartley


The aim of the MITRAS project is to improve the retrieval, processing and presentation of web content -- text and speech/audio -- according to the needs of non-native speakers with insufficient knowledge of a particular language.

Since mobility within the EU is unthinkable without reasonable knowledge of the language of the country people want to settle in, MITRAS will focus on a multilingual approach as well as a deeper analysis in terms of language processing in information retrieval and extraction. The

languages to be investigated in the project are Chinese, English, Finnish, German, Greek, Italian, Russian, Slovene and Swedish which together represent a typologically diverse sample of languages.

Digital technologies such as the Internet provide a spectrum of authentic multimodal materials representing various established genres of text and audio such as stories (news, blogs), facts (reports, instructions) and evaluations (editorials, reviews). However there is a shortage of methodologies and tools which would allow learners and teachers to successfully exploit these novel communication resources in the learning process which leaves selection of learning material to the subjective and intuitive role of the teacher.

The project aims at developing building blocks for an intelligent customizable dedicated web search engine and a Reader Assistant which can be used for grading of texts from the web according to difficulty and also for identifying difficult parts of texts retrieved from the Internet. Furthermore a core engine for multilingual information retrieval (MIR) will be developed in order to retrieve relevant learning material requested by a user. Special attention will be given to creation of a robust project infrastructure as well as representing the project data in already established standards of the EU-projects CLARIN and META-NET.

Contact: Serge Sharoff


Multilingualism and multimodality: Improving meaning-making in healthcare information.

Our project delivers evidence to inform the improvement of the design and translation of healthcare communications in two related multilingual societies, Hong Kong and the UK. In sum, we aim to achieve four objectives. In order of priority, these are:

(1) to suggest ways to improve the accessibility of information, whether delivered in the original language or in translation;

(2) to identify opportunities to make its production more efficient;

(3) to develop methodological approaches, as well as results, which will be generalizable to other contexts, including health promotion for other language communities in Hong Kong, the UK and other multilingual societies, as well as transfer to other service sectors, public, charitable and commercial;

(4) to foster collaborative ties between leading academics in Hong Kong and the UK with complementary skills and expertise and to ensure knowledge transfer within the research team.

Contact: Martin Thomas


The medium for the research was an exhibition, at the Royal Armouries Museum in Leeds, based on mediaeval manuscripts documenting the 100 Years War.

The central interest of this project is the process of designing for personalisation, especially personalisation of experience. To provide an environment for a social investigation, the team installed interactive "waypoints" in the exhibition which will help people attend to the narratives and forms of engagement that will suit them. Waypoints will recognise individual visitors, offer choices, record their decisions and actions to build up a record of the visit that will inform the next waypoint about that visitor's interests and guide them onward in their visit. We also looked at how the process can be set up to cause interaction between visitors, rather than just a private experience, most research in the past has investigated the individual experience of personalisation rather than its social dimension.

The research project had two main parts: A nine month creative design project to design and build the waypoints, including a programme of user testing that will support the design process itself and our thinking about the audience research that will be done in the second part.

This was followed by a six month programme of investigations in the exhibition itself, observing and interviewing visitors to the exhibition and setting up experiments in which different communication strategies are tested.

Contact: Serge Sharoff


The NNI aims to raise awareness of interpreting as an exciting profession among young people in the UK through promotional activities, careers events and a dedicated website. The website includes information, related events listings and a range of interactive multimedia resources. As a strand in the HEFCE-funded Routes into Languages project, which tries to encourage the take up of languages nationally, the NNI fosters the use of interpreting as a motivational tool in language teaching (e.g. through KT activities with teachers). Through outreach to undergraduate language students it also attempts to address the current shortage of English mother-tongue interpreters.

Contact: Svetlana Carsten


The aim of the project is to develop an open interactive multimedia resource in conference interpreter training. The interactive medium is intended for postgraduate interpreter training schools in the EU member states and those countries where postgraduate training in conference interpreting is being offered or envisaged. The project team is creating template modules in five EU languages (English, Czech, Greek, Lithuanian and Spanish) which could be adapted for training in any language. The expectation is that this collaborative project will complement the curriculum set by the European Masters in Conference Interpreting and will contribute to best practice in training standards in conference interpreting in Europe and beyond.

The proposed resources are intended to supplement all stages of training. Unlike Video conferencing, ORCIT resources are open and completely accessible to anyone with a computer and internet access.

Contacts: Svetlana Carsten, Matthew Perret, Sophie Llewellyn Smith, Tamara Muroiwa, Jeremy Munday, Dragos Ciobanu, Sara Ramos Pinto


Building on language-related projects carried out at the Universities of Sheffield and Leeds, we aim to develop a corpus-based approach to teaching postgraduate students reading skills in Russian for subject-specific domains. Many of these students have not received any formal linguistic training and they need to learn Russian. The project has four primary aims:

1. to develop corpus-based tools to allow users to upload and annotate texts of direct relevance to their research and extract from them frequency lists of single- and multi-word terms;

2. to produce an online repository of readers in several areas of research that include words and phrases from our frequency lists and are set to prepare students for reading authentic articles in their area of research;

3. to produce a frequency-based grammar booklet for PG students based on the frequency of grammatical forms in a corpus of academic articles compiled at Leeds;

4. to test a corpus-based, vocabulary-oriented pedagogical approach on PG students with no previous knowledge of Russian

Contacts: Serge Sharoff, James Wilson


From 2008 the Centre was a partner in a three-year knowledge-transfer project with the Translation Automation User Society (TAUS). TAUS is a not-for-profit community of users and providers of translation technologies and services, whose members include Adobe, eBay, Intel, McAfee, Microsoft, Oracle and Sun and other international organisations and companies that daily generate large volumes of documentation in many languages. TAUS has created a fellowship in CTS to support research and development of Intelligent Access to Translation Resources: large collections of translated texts (currently over four billion words), aligned sentence-by-sentence for 237 language pairs.

Contacts: Bogdan Babych, Tony Hartley


The TTC project (Terminology Extraction, Translation Tools and Comparable Corpora) aims at leveraging machine translation tools (MT tools), computer-assisted translation tools (CAT tools) and multilingual content management tools by automatically generating bilingual terminologies from comparable corpora in several European languages (including under-resourced languages, i.e. languages with scarce linguistic resources), as well as in Chinese and Russian.

Contact: Serge Sharoff


The project aims to classify web pages automatically. There are many different kinds of documents on the web, from games to shopping pages to journalism to blogs. Different sorts of page have quite different uses and characteristics. A query for 'Venice' results in pages of various types, referring to recent news, information about history, guidebooks, hotel lists, opinions about hotels and restaurants, etc. For many applications (language teaching, machine translation, information retrieval and extraction) it is also important to have the possibility of selecting a subcorpus according to specific parameters, such as encyclopedic knowledge vs. instructions, texts written for professionals vs. for the general public, or opinions vs. factual text.

In this project we will work on different language families, so that the method can be shown to be portable to further languages. We will be testing the approach using webpages in English, Chinese, German and Russian.

Hand in hand with classifying pages, we need to identify the categories we shall classify them into. The web is new, and this is not an area that has been widely researched to date. We shall adopt an iterative approach by classifying samples of web pages to see which pages fit the existing classification scheme, and amending the scheme to allow for those that do not.

Contact: Serge Sharoff