Statistical Machine Translation

On basis of a complete Chinese backfile, dating back to 1985 and of high data quality, 4 million bilingual aligned sentences obtained from human translated patents have been created. We train and evaluate our Chinese to English SMT with a phrase-based approach which utilizes jargon and phrases derived from some 20 million English patents. This allows replicating search strategies and English-driven retrieval technologies to other languages while still achieving useful recall and precision.

Facts

Technology: Phrase-based SMT technology (MOSES), and customized pre- and pro-processing modules
Language: Chinese to English automatic translation, customized for patents
Training data: 4 million bilingual aligned sentences obtained from human translated patents, and 1 million English sentences extracted from WO, EP and US patents
Production environment: a scalable grid environment allowing for a current capacity of more than 100.000 documents per day
Quality:

– Automatic quality assessment mechanism integrated in the workflow, including BLEU score comparison with Google and CNPat (Chinese Patent Information Centre). Currently, our system demonstrates an average 170% BLEU score improvement as compared with Google.

– Human quality assessment by a team of Chinese native speakers and language specialists feeding back findings to improve the translation engine.

Glossary

Source language – the language of the text to translate
Target language – the language in which the source language text shall be translated
Monolingual corpus – a corpus of documents in one single language. Used to create the target language model to optimize jargon and phrases for a specific domain
Bilingual corpus – a corpus of paired documents in two languages
Sentence alignment – the process which takes a bilingual corpus and produces bilingual paired sentences containing equivalent concepts and information
RBMT - Rule based machine translation – system where translation involves “hand-crafted” linguistic rules and dictionaries
TM - Translation Memory – a database of paired sentences in source and target languages, and the mechanism which enables to translate sentences using this data
SMT - Statistical machine translation – system where translation involves bilingual and monolingual data automatically acquired from corpora
Phrase based SMT – a SMT system where bilingual data includes not only words but phrases (sequence of words) of arbitrary lengths
Language model – data and statistics which “describe” the word ordering of a given language. SMT uses a target language model during the actual translation.
Disambiguation – for a translation task, disambiguation is the process of selecting the most suitable translation of a given source language word in a given sentence context.
Pre- and post-processing – designates all the workflow steps which prepare the source document, before the actual sentence-by-sentence translation, and generate a translated document with preserved formatting.
BLEU score – an automatic means for measuring the quality of a translation, performed by comparing the amount of common words and phrases between the automatically translated text and a given human translated reference.
Fluency – used in human assessment to measure the readability of a machine translated text
Adequacy – used in human assessment to measure how much of the original information is conveyed to the machine translated text

IRF

IRF