Preliminary Program

Multimodal Comparable Corpora for Machine Translation

Haithem Afli¹, Loïc Barrault², Holger Schwenk³
¹LIUM, ²LIUM, University of Le Mans, ³University of Le Mans

Abstract

The construction of a statistical machine translation (SMT) requires parallel corpus for training the translation model and monolingual data to build the target language model. A parallel corpus, also called bitext, consists in bilingual/multilingual texts. Unfortunately, parallel texts are a sparse resource for many language pairs. One way to overcome this lack of data is to exploit comparable corpora which are much more easily available. In this paper, we present the corpus developed for automatic parallel data extraction from multimodal comparable corpora, from Euronews and TED web sites. We describe the content of each corpus and how we extracted the parallel data with our new extraction system. We present our methods developed for multimodal corpora exploitation and discuss results on bitexts extracted.