11th Workshop on Building and Using Comparable Corpora
SHARED TASK
Shared task: identifying parallel sentences in comparable corpora
As a continuation of the previous year’s shared task, we announce a modified shared task for 2018. As is well known, a bottleneck in statistical machine translation is the scarceness of parallel resources for many language pairs and domains. Previous research has shown that this bottleneck can be reduced by utilizing parallel portions found within comparable corpora. These are useful for many purposes, including automatic terminology extraction and the training of statistical MT systems. The aim of the shared task is to quantitatively evaluate competing methods for extracting parallel sentences from comparable monolingual corpora, so as to give an overview on the state of the art and to identify the best performing approaches. This repetition of the same task with updated data aims to give new participants another opportunity to address this task and former participants an opportunity to test improved versions of their methods and systems.
Any submission to the shared task is expected to be accompanied by a short paper (4 pages plus references). This will be accepted for publication in the workshop proceedings after a basic quality check: hence the submission will go via Softconf with the standard peer-review process.
The training data for this task is the same as in the 2017 shared task.
Schedule
Shared task sample and training sets released | 22 December 2017 |
Shared task test set released | 24 January 2018 |
Shared task test submission deadline | 31 January 2018 |
Shared task paper submission deadline | 2 February 2018 |
Shared task camera ready papers | 25 February 2018 |
Task definition
The task definition is the same as in the 2017 shared task: Given two sentence-split monolingual corpora, participant systems are expected to identify pairs of sentences that are translations of each other.
Evaluation will be performed using balanced F-score. In the results of a system, a true positive T P is a pair of sentences that is present in the gold standard and a false positive FP is a pair of sentences that is not present in the gold standard. A false negative FN is a pair of sentences present in the gold standard but absent from system results. Precision P, recall R and F1-score F 1 are then computed as:
Shared task data contents
Sample, training and test data provide monolingual corpora split into sentences, with the following format (utf-8 text, with Unix end-of-lines; identifiers are made of a two-letter language code + 9 digits, separated by a dash ’-’):
- Monolingual EN corpus (where EN stands for English), one tab-separated sentence_id + sentence per line
- Monolingual FR corpus (where FR stands for Foreign, e.g. French), one tab-separated sentence_id + sentence per line
- Gold standard list of tab-separated EN-FR sentence_id pairs (held out for the test data)
Sample and training datasets are the same as in the 2017 shared task. They are provided for French-English, German-English, Russian-English, and Chinese-English (see links below).
Important information and requirements:
- The papers that describe the data preparation process (BUCC 2016, BUCC 2017) transparently explain that the data come from two sources: Wikipedia (now 20161201 dumps from December 2016) and News Commentary (now version 11).
- As a consequence, the use of Wikipedia and of News Commentary (other than what is distributed in the present shared task corpora) is not allowed in this task, because they trivially contain the solutions (the latter in a positive way, and the former in a negative way).
Sample data
Sample data is provided for the following language pairs (note that the monolingual English data vary in each language pair):
Each sample dataset contains two monolingual corpora of about 10–70k sentences including 200–2,300 parallel sentences and is provided as a .tar.bz2 archive (1–4MB).
Training and test data
Training and test data are provided for the following language pairs (note that the monolingual English data vary in each language pair):
- de-en (German-English)
- fr-en (French-English)
- ru-en (Russian-English)
- zh-en (Chinese-English)
- download training data
- download test data
Each training or test dataset contains two monolingual corpora of about 100–550k sentences including 2,000–14,000 parallel sentences and is provided as a .tar.bz2 archive (6–36MB). Training data includes gold standard links, test data will not.
Submission details
Each team is allowed to submit up to three (3) runs for each language. In other words, a team can test several methods or parameter settings and submit the three they prefer.
Please structure your test results as follows:
- one file per language, named <team><N>.<fr>-en.test, where
- <team> stands for you team name (please use only ASCII letters, digits and “-” or “_”)
- <N> (1, 2 or 3) is the run number
- <fr> stands for the language (among de, fr, ru, zh)
- the file contents and format should be the same as the gold standard files provided with the sample and training data, and contain only those sentence pairs that the system believes are translation pairs:
- One sentence_id pair per line, tab-separated, of the form <fr>-<id1><tab><en>-<id2> where <fr> is one of de, fr, ru, zh and <fr>-<id1> and en-<id2> are 9-digit identifiers found in
the <fr> and en parts of the test corpus. For instance, for de-en (German-English):
de-000000003<tab>en-000007818
de-000000004<tab>en-000013032
...
- One sentence_id pair per line, tab-separated, of the form <fr>-<id1><tab><en>-<id2> where <fr> is one of de, fr, ru, zh and <fr>-<id1> and en-<id2> are 9-digit identifiers found in
the <fr> and en parts of the test corpus. For instance, for de-en (German-English):
- put all files in one directory called <team>
- create an archive with the contents of this directory (either <team>.tar.bz2, <team>.tar.gz, or <team>.zip)
Send the archive as an attachment in a message together with factual summary information on your team and method:
To: bucc2018st-submission@limsi.fr
Subject: <team> submission
Team name: <team>
Number of runs submitted: <1,2,3>
Participants:
<person1> <email> <affiliation> <country>
<person2> <email> <affiliation> <country>
...
Resources used: <dictionary X>, <corpus Y>, ...
Tools used: <POS tagger X>, <IR system Y>, <word alignment system Z>, <machine learning library T>, ...
You will receive a human acknowledgment in a maximum of 8 hours (depending on the difference between your time zone and CEST).