Multimodal Comparable Corpora for Machine Translation

Haithem Afli1, Loïc Barrault2, Holger Schwenk3
1LIUM, 2LIUM, University of Le Mans, 3University of Le Mans


The construction of a statistical machine translation (SMT) requires parallel corpus for training the translation model and monolingual data to build the target language model. A parallel corpus, also called bitext, consists in bilingual/multilingual texts. Unfortunately, parallel texts are a sparse resource for many language pairs. One way to overcome this lack of data is to exploit comparable corpora which are much more easily available. In this paper, we present the corpus developed for automatic parallel data extraction from multimodal comparable corpora, from Euronews and TED web sites. We describe the content of each corpus and how we extracted the parallel data with our new extraction system. We present our methods developed for multimodal corpora exploitation and discuss results on bitexts extracted.