Using partly multilingual patents to support research on multilingual IR by building translation memories and MT systems

Lingxiao Wang1, Christian Boitet2, Mathieu Mangeot3
1GETALP, Laboratoire d'Informatique de Grenoble, 2UJF, Grenoble 1 (LIG-GETALP), 3GETALP-LIG Laboratory, Grenoble University


Abstract

In this paper, we describe the extraction of directional translation memories (TMs)from a partly multilingual corpus of patent documents, namely the CLEF-IP collection and the subsequent production and gradual improvement of MT systems for the associated sublanguages (one for each language), the motivation being to support the work of researchers of the MUMIA community. First, we analysed the structure of patent documents in this collection, and extracted multilingual parallel segments (English-German, English-French, and French-German) from it, taking care to identify the source language, as well as monolingual segments. Then we used the extracted TMs to construit statistical machine translation systems (SMT). In order to get more parallel segments, we also imported monolingual segments into our post-editing system, and post-edited them with the help of SMT.