Preliminary Program

Building comparable corpora from social networks

Malek Hajjem¹, Marwa Trabelsi², Chiraz Latiri³
¹Laboratoire de recherche LISI, INSAT Tunis carthage, ²Laboratoire LIPAH, Faculté des sciences de Tunis, Département des Sciences de l’informatique 1060 Tunis, Tunisie, ³Laboratoire LIPAH, Facult des sciences de Tunis, Dpartement des Sciences de l’informatique 1060 Tunis, Tunisie

Abstract

Working with comparable corpora has proven an interesting alternative to rare parallel corpora in different Natural language tasks. Therefore many researchers have accentuated the need for large quantities of such corpora and the duty of works on their construction. In this paper, we highlight the interest and usefulness of textual data mining in social networks. We propose the exploitation of tweets from the microblog Twitter in order to construct comparable corpora. This work aims to develop a new method for the construction of comparable corpora that could be used later in multilingual information retrieval (MLIR), in Statistical Machine Translation (SMT) and in other fields.