10th Workshop on Building and Using Comparable Corpora

SHARED TASK

Shared task: identifying parallel sentences in comparable corpora

We announce a new shared task for 2017. As is well known, a bottleneck in statistical machine translation is the scarceness of parallel resources for many language pairs and domains. Previous research has shown that this bottleneck can be reduced by utilizing parallel portions found within comparable corpora. These are useful for many purposes, including automatic terminology extraction and the training of statistical MT systems.

The aim of the shared task is to quantitatively evaluate competing methods for extracting parallel sentences from comparable monolingual corpora, so as to give an overview on the state of the art and to identify the best performing approaches.

Shared task sample set released 6 February, 2017
Shared task training set released 20 February, 2017
(Chinese training set released) 3 March 2017
Shared task test set released 21 April, 2017
Shared task test submission deadline 28 April, 2017
Shared task paper submission deadline 2 May, 2017
Shared task camera ready papers 26 May, 2017

Any submission to the shared task is expected to be followed by a short paper (4 pages plus references) describing the methods and resources used to perform the task. This will be accepted for publication in the workshop proceedings automatically, although the submission will go via Softconf with the standard peer-review process.

Shared task data contents

Sample, training and test data provide monolingual corpora split into sentences, with the following format (utf-8 text, with Unix end-of-lines; identifiers are made of a two-letter language code + 9 digits, separated by a dash ’-’):

Datasets are provided for French-English, German-English, Russian-English, and Chinese-English (see links below).

Important information and requirements:

Sample data

Sample data is provided for the following language pairs (note that the monolingual English data vary in each language pair):

Each sample dataset contains two monolingual corpora of about 10–70k sentences including 200–2,300 parallel sentences and is provided as a .tar.bz2 archive (1–4MB).

Training and test data

Training and test data are provided for the following language pairs (note that the monolingual English data vary in each language pair):

Each training or test dataset contains two monolingual corpora of about 100–550k sentences including 2,000–14,000 parallel sentences and is provided as a .tar.bz2 archive (6–36MB). Training data includes gold standard links, test data does not.

Task definition

Given two sentence-split monolingual corpora, participant systems are expected to identify pairs of sentences that are translations of each other.

Evaluation will be performed using balanced F-score. In the results of a system, a true positive T P is a pair of sentences that is present in the gold standard and a false positive FP is a pair of sentences that is not present in the gold standard. A false negative FN is a pair of sentences present in the gold standard but absent from system results. Precision P, recall R and F1-score F 1 are then computed as:

 ---TP--- ---TP--- 2-×-P ×-R P = T P + F P, R = T P + FN , F 1= P +R

Submission details

Each team is allowed to submit up to three (3) runs for each language. In other words, a team can test several methods or parameter settings and submit the three they prefer.

Please structure your test results as follows:

Send the archive as an attachment in a message together with factual summary information on your team and method:

To: bucc2017st-submission@limsi.fr

Subject: <team> submission

 

Team name: <team>

Number of runs submitted: <1,2,3>

Participants:

<person1> <email> <affiliation> <country>

<person2> <email> <affiliation> <country>

...

Resources used: <dictionary X>, <corpus Y>, ...

Tools used: <POS tagger X>, <IR system Y>, <word alignment system Z>, <machine learning library T>, ...

You will receive a human acknowledgment in a maximum of 8 hours (depending on the difference between your time zone and CEST).