# 15th Workshop on Building and Using Comparable Corpora

## BUCC 2022 SHARED TASK: bilingual term alignment in comparable specialized corpora.

The BUCC 2022 shared task is on multilingual terminology alignment in comparable corpora. Many research groups are working on this problem using a wide variety of approaches. However, as there is no standard way to measure the performance of the systems, the published results are not comparable and the pros and cons of the various approaches are not clear. The shared task aims at solving these problems by organizing a fair comparison of systems. This is accomplished by providing corpora and evaluation datasets for a number of language pairs and domains.
Moreover, the importance of dealing with multi-word expressions in Natural Language Processing applications has been recognized for a long time. In particular, multi-word expressions pose serious challenges for machine translation systems because of their syntactic and semantic properties. Furthermore, multi-word expressions tend to be more frequent in domain-specific text, hence the need to handle them in tasks with specialized-domain corpora.
Through the 2022 BUCC shared task, we seek to evaluate methods that detect pairs of terms that are translations of each other in two comparable corpora, with an emphasis on multi-word terms in specialized domains.

### Provided resources

The BUCC shared task provides several datasets of the following form:
• A pair of comparable corpora ${C}_{1}$ and ${C}_{2}$ in languages ${L}_{1}$ and ${L}_{2}$.
• A list of terms ${D}_{1}$ that occur in ${C}_{1}$ and a list of terms ${D}_{2}$ that occur in ${C}_{2}$. Term lists may include both single-word and multi-word terms.
• For training only, a gold standard dictionary ${D}_{1,2}$ in the form of a list of pairs of terms $\left({t}_{1},{t}_{2}\right)$ that are translations of each other, with ${t}_{1}$ in ${D}_{1}$ and ${t}_{2}$ in ${D}_{2}$.
The task participants may additionally use any external resources, except the CCAligned corpora, from which the task datasets have been extracted. When reporting their results, participants are required to specify which resources they used. They are also encouraged to test conditions in which they only use the provided resources.

Given a test dataset with comparable corpora ${C}_{1}$ and ${C}_{2}$, and lists of terms ${D}_{1}$ and ${D}_{2}$, participant systems are expected to produce an ordered list of term pairs in $\left({D}_{1},{D}_{2}\right)$ that are translations of each other, in descending order of confidence.
Note that ${D}_{1}$ and ${D}_{2}$ may have different sizes, that not every term in ${D}_{1}$ may have a translation in ${D}_{2}$, that some terms in ${D}_{1}$ might have multiple translations, and conversely. For practical reasons, we limit the length of a submitted term pair list to a ceiling of 10 times the average length of ${D}_{1}$ and ${D}_{2}$. (This can be seen as meaning that, on average, a system may submit up to 10 alignment hypotheses for each term in ${D}_{1}$ or in ${D}_{2}$.)
The test datasets will include both the same language pairs as those provided for training and also other language pairs.
Participants can submit up to 5 system runs for each test dataset.

### Evaluation

The evaluation metric will be the Average Precision of the predicted bilingual term pair list, where the relevance of a term pair is determined by its presence in the (hidden) gold standard dictionary ${D}_{1,2}$. This models the task as an information retrieval task: retrieve all relevant term pairs $\left({t}_{1},{t}_{2}\right)$ (documents) from the cross-product ${D}_{1}×{D}_{2}$ (virtual pool of documents), presenting them in descending order of confidence. Average Precision is the area under the recall $×$ precision curve. It is computed as the average over all $m$ relevant term pairs $\left({t}_{i},{t}_{j}\right)$ (i.e., all term pairs in the gold standard) of the precision value obtained for the set of top ${n}_{k}$ term pairs existing after each relevant term pair $\left({t}_{{i}_{k}},{t}_{{j}_{k}}\right)$ is retrieved, from the first to the last relevant term pair. Relevant term pairs that are not retrieved receive a precision of zero, hence decrease Average Precision. Average Precision (AP) is defined as:$AP=\frac{1}{m}{\sum }_{k=1}^{m}P\left({R}_{k}\right)$
where ${R}_{k}$ is the set of ranked predicted term pairs from the top to the position at which $k$ relevant term pairs have been retrieved. Given the gold standard dictionary ${D}_{1,2}$, the precision of a set of predicted term pairs $R$ is defined as $P\left(R\right)=\frac{|R\cap {D}_{1,2}|}{|R|}$.
To optimize Average Precision, a system must find all relevant term pairs and put them at the top of the list. Average Precision increases when true predictions (relevant term pairs) are added anywhere in the prediction list. Average Precision also increases when false predictions, if any, are pushed towards the bottom of the list. Note that Average Precision cannot decrease when more predictions, whether true or false, are added to the bottom of the list. Also note that Average Precision is equivalent to Mean Average Precision (MAP) with exactly one query (find all term pairs in ${D}_{1}×{D}_{2}$ that are translations of each other).

### File format

All files use UTF-8 encoding, with LF end-of-line markers.
• Single-term lists ${D}_{1}$ and ${D}_{2}$ contain one term per line.
• Corpora ${C}_{1}$ and ${C}_{2}$ contain one sentence per line.
• The gold standard dictionary ${D}_{1,2}$ contains two terms per line, separated by a tabulation: ${t}_{1}$<TAB>${t}_{2}$
• The system output submitted by a participant contains two terms per line, separated by a tabulation <TAB>. Its lines are ordered in decreasing order of confidence.

### Sample data

A small sample dataset is provided in bucc2022_sample.zip for the English-French language pair. It contains:
• A pair of comparable corpora ${C}_{1}=$src_corpus_sample.txt and ${C}_{2}=$tgt_corpus_sample.txt in languages ${L}_{1}=$en and ${L}_{2}=$fr.
• A list of terms ${D}_{1}=$src_term_list_sample.txt that occur in ${C}_{1}$ and a list of terms ${D}_{2}=$tgt_term_list_sample.txt that occur in ${C}_{2}$. Term lists may include both single-word and multi-word terms.
• For training only, a gold standard dictionary ${D}_{1,2}=$gold_dictionary_sample.txt in en$×$fr.

### Training data

A training dataset is provided in bucc2022_training.zip for the English-French language pair. It contains:
• A pair of comparable corpora ${C}_{1}=$corpus-en.txt and ${C}_{2}=$corpus-fr.txt in languages ${L}_{1}=$en and ${L}_{2}=$fr.
• A list of terms ${D}_{1}=$terms-en.txt that occur in ${C}_{1}$ and a list of terms ${D}_{2}=$terms-fr.txt that occur in ${C}_{2}$. Term lists may include both single-word and multi-word terms.
• For training only, a gold standard dictionary ${D}_{1,2}=$terms-en-fr.txt in en$×$fr.
Note that the sizes of ${D}_{1}$, ${D}_{2}$ and ${D}_{1,2}$as well as the proportions of terms in ${D}_{1}$or ${D}_{2}$ that have a translation in ${D}_{1,2}$ are likely to be different in the test datasets.

### Test data

Test datasets (en-fr: bucc2022_test_enfr_nogold.zip; en-de; en-ru). A test dataset contains:
• A pair of comparable corpora ${C}_{1}=$corpus-en.txt and ${C}_{2}=$corpus-fr.txt in languages ${L}_{1}=$en and ${L}_{2}=$fr.
• A list of terms ${D}_{1}=$terms-en.txt that occur in ${C}_{1}$ and a list of terms ${D}_{2}=$terms-fr.txt that occur in ${C}_{2}$. Term lists may include both single-word and multi-word terms.

### Time schedule

 Any time Expression of interest to all three contact points of the shared task. This will allow us to register you on the shared task discussion list and inform you about updates. 19 January 2022 Sample dataset release 13 February 2022 Training data release (en-fr) 19 March 2022 Test data release (1: en-fr) 26 March 2022 Submission of system runs by participants (up to 5 per dataset) by e-mail to all three contact points of the shared task 30 March 2022 Evaluation sent to participants 10 April 2022 Submission of shared task papers to the BUCC workshop 25 June 2022 Workshop date

### Shared task organizers and contact

(Université Paris-Saclay, CNRS, LISN, Orsay, France)
Emmanuel Morin
(Nantes Université, LS2N, Nantes, France)
Serge Sharoff
(University of Leeds, United Kingdom)
Reinhard Rapp
(Athena R.C., Greece; Magdeburg-Stendal University of Applied Sciences and University of Mainz, Germany)
Pierre Zweigenbaum
(Université Paris-Saclay, CNRS, LISN, Orsay, France)
Shared task contact points: please send expressions of interest to:
• omar (dot) adjali (at) universite-paris-saclay (dot) fr
• CC emmanuel (dot) morin (at) ls2n (dot) fr
• CC pz (at) lisn (dot) fr