Statistical Machine Translation
On basis of a complete Chinese backfile, dating back to 1985 and of high data quality, 4 million bilingual aligned sentences obtained from human translated patents have been created. We train and evaluate our Chinese to English SMT with a phrase-based approach which utilizes jargon and phrases derived from some 20 million English patents. This allows replicating search strategies and English-driven retrieval technologies to other languages while still achieving useful recall and precision.
Facts
- Technology: Phrase-based SMT technology (MOSES), and customized pre- and pro-processing modules
- Language: Chinese to English automatic translation, customized for patents
- Training data: 4 million bilingual aligned sentences obtained from human translated patents, and 1 million English sentences extracted from WO, EP and US patents
- Production environment: a scalable grid environment allowing for a current capacity of more than 100.000 documents per day
- Quality:
– Automatic quality assessment mechanism integrated in the workflow, including BLEU score comparison with Google and CNPat (Chinese Patent Information Centre). Currently, our system demonstrates an average 170% BLEU score improvement as compared with Google.
– Human quality assessment by a team of Chinese native speakers and language specialists feeding back findings to improve the translation engine.
Glossary
- Source language – the language of the text to translate
- Target language – the language in which the source language text shall be translated
- Monolingual corpus – a corpus of documents in one single language. Used to create the target language model to optimize jargon and phrases for a specific domain
- Bilingual corpus – a corpus of paired documents in two languages
- Sentence alignment – the process which takes a bilingual corpus and produces bilingual paired sentences containing equivalent concepts and information
- RBMT - Rule based machine translation – system where translation involves “hand-crafted” linguistic rules and dictionaries
- TM - Translation Memory – a database of paired sentences in source and target languages, and the mechanism which enables to translate sentences using this data
- SMT - Statistical machine translation – system where translation involves bilingual and monolingual data automatically acquired from corpora
- Phrase based SMT – a SMT system where bilingual data includes not only words but phrases (sequence of words) of arbitrary lengths
- Language model – data and statistics which “describe” the word ordering of a given language. SMT uses a target language model during the actual translation.
- Disambiguation – for a translation task, disambiguation is the process of selecting the most suitable translation of a given source language word in a given sentence context.
- Pre- and post-processing – designates all the workflow steps which prepare the source document, before the actual sentence-by-sentence translation, and generate a translated document with preserved formatting.
- BLEU score – an automatic means for measuring the quality of a translation, performed by comparing the amount of common words and phrases between the automatically translated text and a given human translated reference.
- Fluency – used in human assessment to measure the readability of a machine translated text
- Adequacy – used in human assessment to measure how much of the original information is conveyed to the machine translated text