13th Workshop on Building and Using Comparable Corpora
SHARED TASK
Bilingual dictionary induction from comparable corpora
In the framework of machine translation, the extraction of bilingual dictionaries from parallel corpora has been conducted very successfully. On the other hand, human second language acquisition appears not to be based on parallel data. This means that there must
be a way of acquiring and relating lexical knowledge in two or more languages without the use of parallel data.
It has been suggested that it might also be possible to extract multilingual lexical knowledge from comparable rather than from parallel corpora. From a theoretical perspective, this suggestion might lead to advances in understanding human second language
acquisition. From a practical perspective, as comparable corpora are available in much larger quantities than parallel corpora, this approach might help in relieving the data acquisition bottleneck which tends to be especially severe when dealing with language pairs involving low resource
languages.
A well-established practical task to approach this topic is bilingual lexicon extraction from comparable corpora, which is in the focus of this shared task. Typically, its aim is to extract word translations such as the following from comparable corpora, where a
given source word may receive multiple translations:
Quite a few research groups are working on this problem using a wide variety of approaches. However, as there is no standard way to measure the performance of the systems, the published results are not comparable and the pros and cons of the various approaches are not clear.
The shared task aims at solving these problems by organizing a fair competition between systems. This is accomplished by providing corpora and bilingual datasets for a number of language pairs involving Chinese, English, French, German, Russian and Spanish, and by
comparing the results using a common evaluation framework. For the shared task we provide corpora as well as training data. However, as these corpora and data may not suit all needs, we divide the shared task into two tracks.
- In the “closed track,” participants are required to only use the data provided by us. In this way equal conditions are ensured and, as the outcome of this track, the systems can be ranked according to the quality of their results.
- In the “open track,” participants are free to use their own corpora and training data. If possible, they should still use our evaluation data, but this is also not mandatory. The participants can even work on languages for which the shared task provides no data. If relevant, the participants should describe why their systems are not suitable for the closed track, and discuss the pros and cons of their choices. If possible, they should also provide access to their data for the purpose of facilitating replication by others.
How to participate
Research groups or individual researchers can participate in one or both tracks (further details see below), choose the language pairs they wish to work on and can suggest new language pairs for which we will try to provide support. Participation in the shared task
is expected to be accompanied by a system description paper (4 to 6 pages plus references). Ideally, this paper gives a description of the participating system in a way that allows replication of the work. As the shared task is supposed to compare as many different systems as possible (i.e.
including systems based on well-established techniques) the scientific content of the paper needs not necessarily be novel. Nevertheless, the papers will be peer reviewed and (apart from novelty) the usual quality criteria for research papers will be applied for the papers to be published in
the workshop proceedings.
Note that participation in the workshop, although we strongly encourage it, is not mandatory for participating in the shared task and for publication of the system description papers.
Checklist for participants
- Decide on the track you wish to participate in and on your language pairs.
- Express your interest to reinhardrapp (at) gmx (dot) de so that we can inform you about any possible updates, changes, issues etc. Please mention your track and the language pair(s) you are interested in. You may also suggest new language pairs, and we might be able to help you with data.
- Download the corpora from this webpage (WaCKy or Wikipedia, see below)
- Download the training data (bilingual word pairs) for your language pairs from this webpage, see below.
- Run your system on the words on the source side of the training data and compute the translations. Compare your results with target side of the training data and improve your system if necessary.
- Download the test data on the date specified in the time schedule below.
- Run your system on the test data. Format your output in the same way as you see in the training data.
- Before the deadline specified in the schedule (check for any extensions!), submit your results by e-mail to reinhardrapp (at) gmx (dot) de. Evaluation results will be sent to you after that deadline.
- Write and submit a system description paper.
- Present your paper at the workshop. (If you cannot participate, please let us know in time.) Please see the LREC website for registration information.
Time schedule for shared task
Track 1: Closed Track
l1->l2 | de | en | es | fr | ru | zh |
---|---|---|---|---|---|---|
de | DEWaC-UKWaC | deWiki-esWiki | DEWaC-FRWaC | |||
en | UKWaC-DEWaC | enWiki-esWiki | UKWaC-FRWaC | UKWaC-RUWaC | enWiki-zhWiki | |
es | esWiki-deWiki | esWiki-enWiki | ||||
fr | FRWaC-DEWaC | FRWaC-UKWaC | ||||
ru | RUWaC-UKWaC | |||||
zh | zhWiki-enWiki |
The cells in the table show which type of corpus should be used for both languages of a pair when conducting the dictionary induction task. The rationale behind these choices is that the WaCKy (Web-as-a-corpus initiative) corpora seem somewhat better suited for the
dictionary induction task than Wikipedia, but they are not available for Chinese and Spanish. Language pairs involving Chinese and Spanish therefore use Wikipedia corpora, whereas other language pairs use WaCKy corpora.
The WaCKy corpora can be downloaded from the links below (download can take from less than one minute to hours depending on your connection speed). For convenience, we also provide pre-trained fastText embeddings for these
corpora (see below (4) for details):
The WaCKy corpora are cleaned-up web crawls of approximately 2 billion words per language. They are kindly provided by the Web-as-a-corpus initiative (WaCKy). For further information see http://wacky.sslmit.unibo.it/doku.php?id=corpora.
- M. Baroni, S. Bernardini, A. Ferraresi and E. Zanchetta. 2009. The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43(3): 209–226.
The Wikipedia corpora can be downloaded from the links below (download can take from less than one minute to hours depending on your connection speed). For convenience, we also point at pre-trained fastText embeddings for these corpora prepared at Facebook
(see below (4) for details):
These corpora are in a one-line per document format. The first tab-separated field in each line contains metadata, the second field contains the text. Paragraph boundaries are marked with HTML tags. As cleaning up the original Wikipedia dump files is not trivial,
occasionally there can be some noise in the form of not fully cleaned HTML and Javascript fragments.
Bilingual word pairs. For checking and improving the performance of your systems, please use the following training data which consists of tab-separated bilingual word pairs:
l1 → l2 | de | en | es | fr | ru | zh |
---|---|---|---|---|---|---|
de | high mid low | high mid low | high mid low | |||
en | high mid low | high mid low | high mid low | high mid low | high mid low | |
es | high mid low | high mid low | high mid low | |||
fr | high mid low | high mid low | high mid low | |||
ru | high mid low | |||||
zh | high mid low |
Rather than providing one large set of word pairs for each language pair, by splitting into frequency ranges we provide three smaller sets. Looking at different frequency ranges is of scientific interest as algorithms typically work best for high frequency words,
whereas the performance at low frequencies is of higher practical relevance.
We split the data into three sets corresponding to frequency ranges of the source language words: The high frequency set provides bilingual word pairs where the frequency is among the 5000 most frequent words. The mid frequency sets consist of words ranking between
5001 and 20000, and the low frequency set belongs to ranks 20001 to 50000. (For languages where not enough data is available, we had to reduce the size of the bins.)
Each set is a random sample extracted from the MUSE data kindly provided by facebook AI Research and comprises 2000 different source language words together with their translations. Like in the original MUSE data, the source language words are ordered according to
frequency (most frequent first). All three sets (per language pair) taken together, this gives 6000 source language words together with their translations, whereby each translation is listed in a separate line.
- Conneau, Alexis; Lample, Guillaume; Ranzato, Marc Aurelio ; Denoyer, Ludovic ; Jégou, Hervé (2017). Word translation without parallel data. arXiv preprint 1710.04087.
As described in this paper, the MUSE dictionaries, which take the polysemy of words into account, were created using a facebook internal translation tool. Given that they were generated automatically, they are of high quality, but still contain a few errors.
Participants of the shared task are encouraged to report to us such errors, so that, as a positive side effect of the shared task, the datasets can be improved.
For testing the systems, lists of source language test words were provided on the day listed in the above time schedule, which are likewise split into three sets of 2000 words each:
l1 → l2 | de | en | es | fr | ru | zh |
---|---|---|---|---|---|---|
de | high mid low | high mid low | high mid low | |||
en | high mid low | high mid low | high mid low | high mid low | high mid low | |
es | high mid low | high mid low | high mid low | |||
fr | high mid low | high mid low | high mid low | |||
ru | high mid low | |||||
zh | high mid low |
If your algorithm for inducing dictionaries from comparable corpora requires a seed lexicon, then please use an arbitrary part of the training data for this purpose. We hope that with its 6000 source language words and (depending on the language pair)
roughly twice as many translation pairs, the training set is large enough to provide for your needs. If not, please consider using your own data and participating in Track 2 of the shared task.
Pre-trained embedding models such as fastText or BERT can be used only if (re)trained on the provided corpora. The following fastText embeddings have been trained on Wikipedia or WaCKy corpora and can be readily used
in this track (specific links are provided in the table above (2)):
- Wikipedia: fastText embeddings for the Wikipedia corpora are available from Facebook: https://fasttext.cc/docs/en/pretrained-vectors.html
- WaCKy: pre-trained fastText embeddings are available as follows:
- The .vec.xz files are text representations, widely used in various tools.
- The .bin files are the binary versions for use in Fasttext.
- The following parameters were used: method: skipgram; minCount: 30; dim: 300; ws (context window): 7; epochs: 10; neg (number of negatives sampled): 10. The other parameters are as defaults for fastText.
Track 2: Open Track
In this track, participants are free to work on other language pairs, use their own data and–if desired–conduct their own evaluation procedure. However, it would be very helpful if in their papers they described their reasons and motivation for deviating from the
procedures of Track 1 and–if possible–provided access to their data.
Please also let us know about your plans in time as we may be able to support you with corpora and datasets.
As this appears to be the first shared task on the topic of dictionary induction from comparable corpora, we cannot draw on previous experiences. Due to this pilot character, in Track 1 we are trying to keep things as clear and unsophisticated as possible. But in
Track 2 we encourage you to challenge this simplicity, to freely experiment and to come up with new ideas in the hope that the resulting insights will promote future progress in the field.
Evaluation
For evaluation, Track 1 participants (for Track 2 participants this is optional) are asked to provide their results on the test data sets for the test words in each of the three frequency ranges. Hereby it is expected that for each source language word all its
major translations are provided (where the definition of “major” is supposed to be inferred from the training data). The shared task organizers compare these translations to the translations as found in their (internal) gold standard data which is structurally similar to the training data. Only
identical strings are considered correct, and the performance of the respective system is determined by computing precision, recall, and F1-score, the latter being the official score for system ranking. All data sets are in utf-8 encoding.
More precisely: the input to the system is a list of source language words, one per line. A system should return, for each input word , one or more candidate translations , in the form of tab-separated word pairs , each on its own line. For instance, in the English-French case, given the following gold standard, test word list, and system output (tab-separated word pairs), the system would get credited for two true positives, one false positive, and two false negatives, hence .
Shared task organizers
- Reinhard Rapp (Magdeburg-Stendal University of Applied Sciences and University of Mainz, Germany), Chair and contact person: reinhardrapp (at) gmx (dot) de
- Pierre Zweigenbaum (Université Paris-Saclay, CNRS, LIMSI, Orsay, France)
- Serge Sharoff (University of Leeds, United Kingdom)
Previous BUCC shared tasks and datasets:
- Identifying parallel sentences in comparable corpora (BUCC 2018, 2017)
- Identifying comparable text (BUCC 2015)