Building comparable corpora from social networks

Malek Hajjem1, Marwa Trabelsi2, Chiraz Latiri3
1Laboratoire de recherche LISI, INSAT Tunis carthage, 2Laboratoire LIPAH, Faculté des sciences de Tunis, Département des Sciences de l’informatique 1060 Tunis, Tunisie, 3Laboratoire LIPAH, Facult des sciences de Tunis, Dpartement des Sciences de l’informatique 1060 Tunis, Tunisie


Working with comparable corpora has proven an interesting alternative to rare parallel corpora in different Natural language tasks. Therefore many researchers have accentuated the need for large quantities of such corpora and the duty of works on their construction. In this paper, we highlight the interest and usefulness of textual data mining in social networks. We propose the exploitation of tweets from the microblog Twitter in order to construct comparable corpora. This work aims to develop a new method for the construction of comparable corpora that could be used later in multilingual information retrieval (MLIR), in Statistical Machine Translation (SMT) and in other fields.