8th Workshop on Building and Using Comparable Corpora


A shared task is organized together with the workshop. This will be the first evaluation exercise on the identification of comparable texts: given a large multilingual collection of texts (we will be using Wikipedia documents in several languages), the task is to identify the most similar texts across languages. Evaluation will be done by measuring precision, recall and F-measure on links between pages, with a gold standard based on actual inter-language links.

Task description

Parallel corpora of original texts with their translations provide the basis for multilingual NLP applications since the beginning of the 1990s. Relative scarcity of such resources led to greater attention to comparable (=less parallel) resources to mine information about possible translations. Many studies have been produced within the paradigm of comparable corpora, including publications in the BUCC workshop series since 2008, see bucc-introduction.html.

However, the community so far has not conducted an evaluation which compared different approaches for identifying more or less parallel resources in a large amount of multilingual data. Also, it is not clear how language-specific such approaches are. In this shared task we propose the first evaluation exercise, which is aimed at detecting the most similar texts in a large collection.

Data set

The data for each language pair has been split into two sets:

Training set
pages with information about the correct links for the respective language pairs;
Test set
pages without the links.

The task is for each page in the test set to submit up to five ranked suggestions to its linked page, assuming that the gold standard contains its counterpart in another language. The submissions will have to be in the tab-separated format as used in the submissions to TREC with six fields:
id1 X id2 Y score run.name

The X and Y fields are not used, but they are reserved by the TREC evaluation script (and it does not use them either). Please keep them with constant values X and Y. id1 and id2 are the articles ids in a language of evaluation and in English. The score should reflect the similarity between id1 and id2, the higher the closer. The participants are invited to submit up to five runs of their system with different parameters, as identified by a keyword in the last field. This field should include the name of the team and an identifier for the run, e.g., Leeds.run1, or LIMSI.BM25. For the evaluation script and for more information about the format, please visit: http://trec.nist.gov/trec_eval/

The languages in the shared task will be Chinese, French, German, Russian and Turkish. Pages in these languages need to be linked to a page in English.

Submission procedure

Please register by sending a message to shared.bucc2015@gmail.com and giving the name of the contact person, and the language pairs you’d like to work on.

In response you will receive links to the training sets and the scoring script.


1 February 2015 Training set available
20 April 2015 Test set available
24 April 2015 Test submission deadline
1 May 2015 System results to participants
15 May 2015 Paper submission deadline
4 June 2015 Notification of acceptance
21 June 2015 Camera-ready papers due
30 July 2015 Workshop date