13th Workshop on Building and Using Comparable Corpora

SHARED TASK

Bilingual dictionary induction from comparable corpora

In the framework of machine translation, the extraction of bilingual dictionaries from parallel corpora has been conducted very successfully. On the other hand, human second language acquisition appears not to be based on parallel data. This means that there must be a way of acquiring and relating lexical knowledge in two or more languages without the use of parallel data.
It has been suggested that it might also be possible to extract multilingual lexical knowledge from comparable rather than from parallel corpora. From a theoretical perspective, this suggestion might lead to advances in understanding human second language acquisition. From a practical perspective, as comparable corpora are available in much larger quantities than parallel corpora, this approach might help in relieving the data acquisition bottleneck which tends to be especially severe when dealing with language pairs involving low resource languages.
A well-established practical task to approach this topic is bilingual lexicon extraction from comparable corpora, which is in the focus of this shared task. Typically, its aim is to extract word translations such as the following from comparable corpora, where a given source word may receive multiple translations:
Source (English)
Target (French)
baby
bébé
baby
poupon
bath
bain
bed
lit
bed
plumard
convenience
commodité
doctor
médecin
doctor
docteur
eagle
aigle
mountain
montagne
nervous
nerveux
work
travail
Quite a few research groups are working on this problem using a wide variety of approaches. However, as there is no standard way to measure the performance of the systems, the published results are not comparable and the pros and cons of the various approaches are not clear.
The shared task aims at solving these problems by organizing a fair competition between systems. This is accomplished by providing corpora and bilingual datasets for a number of language pairs involving Chinese, English, French, German, Russian and Spanish, and by comparing the results using a common evaluation framework. For the shared task we provide corpora as well as training data. However, as these corpora and data may not suit all needs, we divide the shared task into two tracks.

How to participate

Research groups or individual researchers can participate in one or both tracks (further details see below), choose the language pairs they wish to work on and can suggest new language pairs for which we will try to provide support. Participation in the shared task is expected to be accompanied by a system description paper (4 to 6 pages plus references). Ideally, this paper gives a description of the participating system in a way that allows replication of the work. As the shared task is supposed to compare as many different systems as possible (i.e. including systems based on well-established techniques) the scientific content of the paper needs not necessarily be novel. Nevertheless, the papers will be peer reviewed and (apart from novelty) the usual quality criteria for research papers will be applied for the papers to be published in the workshop proceedings.
Note that participation in the workshop, although we strongly encourage it, is not mandatory for participating in the shared task and for publication of the system description papers.

Checklist for participants

Time schedule for shared task

Any time
Expression of interest to reinhardrapp (at) gmx (dot) de (including suggestions for additional language pairs). This step is not compulsory but helps us in our plannings and allows us to inform you about updates
12 January 2020
Release of shared task training sets (done)
16 February 2020
Release of shared task test sets (done)
5 March 2020
Submission of shared task results by e-mail to reinhardrapp (at) gmx (de)
15 March 2020
Submission of shared task system description papers
11 May 2020
Workshop date

Track 1: Closed Track

The supported language pairs (for which we provide data) are the following:
l1->l2de en es fr ru zh
deDEWaC-UKWaCdeWiki-esWikiDEWaC-FRWaC
enUKWaC-DEWaCenWiki-esWikiUKWaC-FRWaCUKWaC-RUWaCenWiki-zhWiki
esesWiki-deWikiesWiki-enWiki
frFRWaC-DEWaCFRWaC-UKWaC
ruRUWaC-UKWaC
zhzhWiki-enWiki
The cells in the table show which type of corpus should be used for both languages of a pair when conducting the dictionary induction task. The rationale behind these choices is that the WaCKy (Web-as-a-corpus initiative) corpora seem somewhat better suited for the dictionary induction task than Wikipedia, but they are not available for Chinese and Spanish. Language pairs involving Chinese and Spanish therefore use Wikipedia corpora, whereas other language pairs use WaCKy corpora.
The WaCKy corpora can be downloaded from the links below (download can take from less than one minute to hours depending on your connection speed). For convenience, we also provide pre-trained fastText embeddings for these corpora (see below (4) for details):
Corpus
Language
Corpus
fastText embeddings
UKWaC
English
bin (3.2Gb), vec.xz (0.3Gb)
FRWAC
French
bin (3.0Gb), vec.xz (0.3Gb)
DEWAC
German
bin (3.0Gb), vec.xz (0.5Gb)
RUWAC
Russian
bin (4.1Gb), vec.xz (0.7Gb)
The WaCKy corpora are cleaned-up web crawls of approximately 2 billion words per language. They are kindly provided by the Web-as-a-corpus initiative (WaCKy). For further information see http://wacky.sslmit.unibo.it/doku.php?id=corpora.
If you use the WaCKy-corpora, please cite the following paper:
The Wikipedia corpora can be downloaded from the links below (download can take from less than one minute to hours depending on your connection speed). For convenience, we also point at pre-trained fastText embeddings for these corpora prepared at Facebook (see below (4) for details):
Corpus
Language
Corpus
fastText embeddings
enWiki
English
bin+vec, zipped (9.6Gb) vec (6.1Gb)
esWiki
Spanish
bin+vec, zipped (5.1Gb) vec (2.4Gb)
zhWiki
Chinese
bin+vec, zipped (3.1Gb) vec (0.8Gb)
These corpora are in a one-line per document format. The first tab-separated field in each line contains metadata, the second field contains the text. Paragraph boundaries are marked with HTML tags. As cleaning up the original Wikipedia dump files is not trivial, occasionally there can be some noise in the form of not fully cleaned HTML and Javascript fragments.
Bilingual word pairs. For checking and improving the performance of your systems, please use the following training data which consists of tab-separated bilingual word pairs:
l1 → l2 de en es fr ru zh
de high mid low high mid low high mid low
en high mid low high mid low high mid low high mid low high mid low
es high mid low high mid low high mid low
fr high mid low high mid low high mid low
ru high mid low
zh high mid low
Rather than providing one large set of word pairs for each language pair, by splitting into frequency ranges we provide three smaller sets. Looking at different frequency ranges is of scientific interest as algorithms typically work best for high frequency words, whereas the performance at low frequencies is of higher practical relevance.
We split the data into three sets corresponding to frequency ranges of the source language words: The high frequency set provides bilingual word pairs where the frequency is among the 5000 most frequent words. The mid frequency sets consist of words ranking between 5001 and 20000, and the low frequency set belongs to ranks 20001 to 50000. (For languages where not enough data is available, we had to reduce the size of the bins.)
Each set is a random sample extracted from the MUSE data kindly provided by facebook AI Research and comprises 2000 different source language words together with their translations. Like in the original MUSE data, the source language words are ordered according to frequency (most frequent first). All three sets (per language pair) taken together, this gives 6000 source language words together with their translations, whereby each translation is listed in a separate line.
If you use any of these datasets, please cite the following paper:
As described in this paper, the MUSE dictionaries, which take the polysemy of words into account, were created using a facebook internal translation tool. Given that they were generated automatically, they are of high quality, but still contain a few errors. Participants of the shared task are encouraged to report to us such errors, so that, as a positive side effect of the shared task, the datasets can be improved.
For testing the systems, lists of source language test words were provided on the day listed in the above time schedule, which are likewise split into three sets of 2000 words each:
l1 → l2 de en es fr ru zh
de high mid low high mid low high mid low
en high mid low high mid low high mid low high mid low high mid low
es high mid low high mid low high mid low
fr high mid low high mid low high mid low
ru high mid low
zh high mid low
If your algorithm for inducing dictionaries from comparable corpora requires a seed lexicon, then please use an arbitrary part of the training data for this purpose. We hope that with its 6000 source language words and (depending on the language pair) roughly twice as many translation pairs, the training set is large enough to provide for your needs. If not, please consider using your own data and participating in Track 2 of the shared task.
Pre-trained embedding models such as fastText or BERT can be used only if (re)trained on the provided corpora. The following fastText embeddings have been trained on Wikipedia or WaCKy corpora and can be readily used in this track (specific links are provided in the table above (2)):

Track 2: Open Track

In this track, participants are free to work on other language pairs, use their own data and–if desired–conduct their own evaluation procedure. However, it would be very helpful if in their papers they described their reasons and motivation for deviating from the procedures of Track 1 and–if possible–provided access to their data.
Please also let us know about your plans in time as we may be able to support you with corpora and datasets.
As this appears to be the first shared task on the topic of dictionary induction from comparable corpora, we cannot draw on previous experiences. Due to this pilot character, in Track 1 we are trying to keep things as clear and unsophisticated as possible. But in Track 2 we encourage you to challenge this simplicity, to freely experiment and to come up with new ideas in the hope that the resulting insights will promote future progress in the field.

Evaluation

For evaluation, Track 1 participants (for Track 2 participants this is optional) are asked to provide their results on the test data sets for the test words in each of the three frequency ranges. Hereby it is expected that for each source language word all its major translations are provided (where the definition of “major” is supposed to be inferred from the training data). The shared task organizers compare these translations to the translations as found in their (internal) gold standard data which is structurally similar to the training data. Only identical strings are considered correct, and the performance of the respective system is determined by computing precision, recall, and F1-score, the latter being the official score for system ranking. All data sets are in utf-8 encoding.
More precisely: the input to the system is a list of source language words, one per line. A system should return, for each input word w s , one or more candidate translations w i t , in the form of tab-separated word pairs w s \ t w i t , each on its own line. For instance, in the English-French case, given the following gold standard, test word list, and system output (tab-separated word pairs), the system would get credited for two true positives, one false positive, and two false negatives, hence P = 2 / 3 = 0.67 , R = 2 / 4 = 0.50 , F = 0.57 .
gold standard
bed
lit
bed
plumard
doctor
médecin
doctor
docteur
test set
bed
doctor
system output
bed
lit
bed
futon
doctor
docteur

Shared task organizers

Previous BUCC shared tasks and datasets:

Last modified: 8 May 2020