11th Workshop on Building and Using Comparable Corpora

SHARED TASK

Shared task: identifying parallel sentences in comparable corpora

As a continuation of the previous year’s shared task, we announce a modified shared task for 2018. As is well known, a bottleneck in statistical machine translation is the scarceness of parallel resources for many language pairs and domains. Previous research has shown that this bottleneck can be reduced by utilizing parallel portions found within comparable corpora. These are useful for many purposes, including automatic terminology extraction and the training of statistical MT systems. The aim of the shared task is to quantitatively evaluate competing methods for extracting parallel sentences from comparable monolingual corpora, so as to give an overview on the state of the art and to identify the best performing approaches. This repetition of the same task with updated data aims to give new participants another opportunity to address this task and former participants an opportunity to test improved versions of their methods and systems.

Any submission to the shared task is expected to be accompanied by a short paper (4 pages plus references). This will be accepted for publication in the workshop proceedings after a basic quality check: hence the submission will go via Softconf with the standard peer-review process.

The training data for this task is the same as in the 2017 shared task.

Schedule

Shared task sample and training sets released 22 December 2017
Shared task test set released 24 January 2018
Shared task test submission deadline 31 January 2018
Shared task paper submission deadline 2 February 2018
Shared task camera ready papers 25 February 2018

Task definition

The task definition is the same as in the 2017 shared task: Given two sentence-split monolingual corpora, participant systems are expected to identify pairs of sentences that are translations of each other.

Evaluation will be performed using balanced F-score. In the results of a system, a true positive T P is a pair of sentences that is present in the gold standard and a false positive FP is a pair of sentences that is not present in the gold standard. A false negative FN is a pair of sentences present in the gold standard but absent from system results. Precision P, recall R and F1-score F 1 are then computed as:

 TP TP 2 × P × R P = --------, R = --------, F 1= -------- T P + F P T P + FN P +R

Shared task data contents

Sample, training and test data provide monolingual corpora split into sentences, with the following format (utf-8 text, with Unix end-of-lines; identifiers are made of a two-letter language code + 9 digits, separated by a dash ’-’):

Sample and training datasets are the same as in the 2017 shared task. They are provided for French-English, German-English, Russian-English, and Chinese-English (see links below).

Important information and requirements:

Sample data

Sample data is provided for the following language pairs (note that the monolingual English data vary in each language pair):

Each sample dataset contains two monolingual corpora of about 10–70k sentences including 200–2,300 parallel sentences and is provided as a .tar.bz2 archive (1–4MB).

Training and test data

Training and test data are provided for the following language pairs (note that the monolingual English data vary in each language pair):

Each training or test dataset contains two monolingual corpora of about 100–550k sentences including 2,000–14,000 parallel sentences and is provided as a .tar.bz2 archive (6–36MB). Training data includes gold standard links, test data will not.

Submission details

Each team is allowed to submit up to three (3) runs for each language. In other words, a team can test several methods or parameter settings and submit the three they prefer.

Please structure your test results as follows:

Send the archive as an attachment in a message together with factual summary information on your team and method:

To: bucc2018st-submission@limsi.fr

Subject: <team> submission

 

Team name: <team>

Number of runs submitted: <1,2,3>

Participants:

<person1> <email> <affiliation> <country>

<person2> <email> <affiliation> <country>

...

Resources used: <dictionary X>, <corpus Y>, ...

Tools used: <POS tagger X>, <IR system Y>, <word alignment system Z>, <machine learning library T>, ...

You will receive a human acknowledgment in a maximum of 8 hours (depending on the difference between your time zone and CEST).