15th Workshop on Building and Using Comparable Corpora

BUCC 2022 SHARED TASK: bilingual term alignment in comparable specialized corpora.

The BUCC 2022 shared task is on multilingual terminology alignment in comparable corpora. Many research groups are working on this problem using a wide variety of approaches. However, as there is no standard way to measure the performance of the systems, the published results are not comparable and the pros and cons of the various approaches are not clear. The shared task aims at solving these problems by organizing a fair comparison of systems. This is accomplished by providing corpora and evaluation datasets for a number of language pairs and domains.

Moreover, the importance of dealing with multi-word expressions in Natural Language Processing applications has been recognized for a long time. In particular, multi-word expressions pose serious challenges for machine translation systems because of their syntactic and semantic properties. Furthermore, multi-word expressions tend to be more frequent in domain-specific text, hence the need to handle them in tasks with specialized-domain corpora.

Through the 2022 BUCC shared task, we seek to evaluate methods that detect pairs of terms that are translations of each other in two comparable corpora, with an emphasis on multi-word terms in specialized domains.

Provided resources

The BUCC shared task provides several datasets of the following form:

A pair of comparable corpora $C_{1}$ and $C_{2}$ in languages $L_{1}$ and $L_{2}$ .
A list of terms $D_{1}$ that occur in $C_{1}$ and a list of terms $D_{2}$ that occur in $C_{2}$ . Term lists may include both single-word and multi-word terms.
For training only, a gold standard dictionary $D_{1, 2}$ in the form of a list of pairs of terms $(t_{1}, t_{2})$ that are translations of each other, with $t_{1}$ in $D_{1}$ and $t_{2}$ in $D_{2}$ .

The task participants may additionally use any external resources, except the CCAligned corpora, from which the task datasets have been extracted. When reporting their results, participants are required to specify which resources they used. They are also encouraged to test conditions in which they only use the provided resources.

Task

Given a test dataset with comparable corpora

C_{1}

and

C_{2}

, and lists of terms

D_{1}

and

D_{2}

, participant systems are expected to produce an ordered list of term pairs in

(D_{1}, D_{2})

that are translations of each other, in descending order of confidence.

Note that

D_{1}

and

D_{2}

may have different sizes, that not every term in

D_{1}

may have a translation in

D_{2}

, that some terms in

D_{1}

might have multiple translations, and conversely. For practical reasons, we limit the length of a submitted term pair list to a ceiling of 10 times the average length of

D_{1}

and

D_{2}

. (This can be seen as meaning that, on average, a system may submit up to 10 alignment hypotheses for each term in

D_{1}

or in

D_{2}

The test datasets will include both the same language pairs as those provided for training and also other language pairs.

Participants can submit up to 5 system runs for each test dataset.

Evaluation

The evaluation metric will be the Average Precision of the predicted bilingual term pair list, where the relevance of a term pair is determined by its presence in the (hidden) gold standard dictionary

D_{1, 2}

. This models the task as an information retrieval task: retrieve all relevant term pairs

(t_{1}, t_{2})

(documents) from the cross-product

D_{1} \times D_{2}

(virtual pool of documents), presenting them in descending order of confidence. Average Precision is the area under the recall

\times

precision curve. It is computed as the average over all

m

relevant term pairs

(t_{i}, t_{j})

(i.e., all term pairs in the gold standard) of the precision value obtained for the set of top

n_{k}

term pairs existing after each relevant term pair

(t_{i_{k}}, t_{j_{k}})

is retrieved, from the first to the last relevant term pair. Relevant term pairs that are not retrieved receive a precision of zero, hence decrease Average Precision. Average Precision (AP) is defined as:

A P = \frac{1}{m} \sum_{k = 1}^{m} P (R_{k})

where

R_{k}

is the set of ranked predicted term pairs from the top to the position at which

k

relevant term pairs have been retrieved. Given the gold standard dictionary

D_{1, 2}

, the precision of a set of predicted term pairs

R

is defined as

P (R) = \frac{| R \cap D_{1, 2} |}{| R |}

To optimize Average Precision, a system must find all relevant term pairs and put them at the top of the list. Average Precision increases when true predictions (relevant term pairs) are added anywhere in the prediction list. Average Precision also increases when false predictions, if any, are pushed towards the bottom of the list. Note that Average Precision cannot decrease when more predictions, whether true or false, are added to the bottom of the list. Also note that Average Precision is equivalent to Mean Average Precision (MAP) with exactly one query (find all term pairs in $D_{1} \times D_{2}$ that are translations of each other).

File format

All files use UTF-8 encoding, with LF end-of-line markers.

Single-term lists $D_{1}$ and $D_{2}$ contain one term per line.
Corpora $C_{1}$ and $C_{2}$ contain one sentence per line.
The gold standard dictionary $D_{1, 2}$ contains two terms per line, separated by a tabulation: $t_{1}$ <TAB> $t_{2}$
The system output submitted by a participant contains two terms per line, separated by a tabulation <TAB>. Its lines are ordered in decreasing order of confidence.

Sample data

A small sample dataset is provided in bucc2022_sample.zip for the English-French language pair. It contains:

A pair of comparable corpora $C_{1} =$ src_corpus_sample.txt and $C_{2} =$ tgt_corpus_sample.txt in languages $L_{1} =$ en and $L_{2} =$ fr.
A list of terms $D_{1} =$ src_term_list_sample.txt that occur in $C_{1}$ and a list of terms $D_{2} =$ tgt_term_list_sample.txt that occur in $C_{2}$ . Term lists may include both single-word and multi-word terms.
For training only, a gold standard dictionary $D_{1, 2} =$ gold_dictionary_sample.txt in en $\times$ fr.

Training data

A training dataset is provided in bucc2022_training.zip for the English-French language pair. It contains:

A pair of comparable corpora $C_{1} =$ corpus-en.txt and $C_{2} =$ corpus-fr.txt in languages $L_{1} =$ en and $L_{2} =$ fr.
A list of terms $D_{1} =$ terms-en.txt that occur in $C_{1}$ and a list of terms $D_{2} =$ terms-fr.txt that occur in $C_{2}$ . Term lists may include both single-word and multi-word terms.
For training only, a gold standard dictionary $D_{1, 2} =$ terms-en-fr.txt in en $\times$ fr.

Note that the sizes of

D_{1}

D_{2}

and

D_{1, 2}

as well as the proportions of terms in

D_{1}

D_{2}

that have a translation in

D_{1, 2}

are likely to be different in the test datasets.

Test data

Test datasets (en-fr: bucc2022_test_enfr_nogold.zip; en-de; en-ru). A test dataset contains:

A pair of comparable corpora $C_{1} =$ corpus-en.txt and $C_{2} =$ corpus-fr.txt in languages $L_{1} =$ en and $L_{2} =$ fr.
A list of terms $D_{1} =$ terms-en.txt that occur in $C_{1}$ and a list of terms $D_{2} =$ terms-fr.txt that occur in $C_{2}$ . Term lists may include both single-word and multi-word terms.

Time schedule

Any time	Expression of interest to all three contact points of the shared task. This will allow us to register you on the shared task discussion list and inform you about updates.
19 January 2022	Sample dataset release
13 February 2022	Training data release (en-fr)
19 March 2022	Test data release (1: en-fr)
26 March 2022	Submission of system runs by participants (up to 5 per dataset) by e-mail to all three contact points of the shared task
30 March 2022	Evaluation sent to participants
10 April 2022	Submission of shared task papers to the BUCC workshop
25 June 2022	Workshop date

Shared task organizers and contact

Omar Adjali: (Université Paris-Saclay, CNRS, LISN, Orsay, France)
Emmanuel Morin: (Nantes Université, LS2N, Nantes, France)
Serge Sharoff: (University of Leeds, United Kingdom)
Reinhard Rapp: (Athena R.C., Greece; Magdeburg-Stendal University of Applied Sciences and University of Mainz, Germany)
Pierre Zweigenbaum: (Université Paris-Saclay, CNRS, LISN, Orsay, France)

Shared task contact points: please send expressions of interest to:

omar (dot) adjali (at) universite-paris-saclay (dot) fr
CC emmanuel (dot) morin (at) ls2n (dot) fr
CC pz (at) lisn (dot) fr

Last modified: 19 March 2022