16th Workshop on Building and Using Comparable Corpora

BUCC 2023 SHARED TASK: Bilingual Term Alignment in Comparable Specialized Corpora.

The BUCC 2023 shared task is on multilingual terminology alignment in comparable corpora. Many research groups are working on this problem using a wide variety of approaches. However, as there is no standard way to measure the performance of the systems, the published results are not comparable and the pros and cons of the various approaches are not clear. The shared task aims at solving these problems by organizing a fair comparison of systems. This is accomplished by providing corpora and evaluation datasets for a number of language pairs and domains.
Moreover, the importance of dealing with multi-word expressions in Natural Language Processing applications has been recognized for a long time. In particular, multi-word expressions pose serious challenges for machine translation systems because of their syntactic and semantic properties. Furthermore, multi-word expressions tend to be more frequent in domain-specific text, hence the need to handle them in tasks with specialized-domain corpora.
Through the 2023 BUCC shared task, we seek to evaluate methods that detect pairs of terms that are translations of each other in two comparable corpora, with an emphasis on multi-word terms in specialized domains.

Provided resources

The BUCC shared task provides several datasets of the following form:
The task participants may additionally use any external resources, except the CCAligned corpora, from which the task datasets have been extracted. When reporting their results, participants are required to specify which resources they used. They are also encouraged to test conditions in which they only use the provided resources.


Given a test dataset with comparable corpora C 1 and C 2 , and lists of terms D 1 and D 2 , participant systems are expected to produce an ordered list of term pairs in ( D 1 , D 2 ) that are translations of each other, in descending order of confidence.
Note that D 1 and D 2 may have different sizes, that not every term in D 1 may have a translation in D 2 , that some terms in D 1 might have multiple translations, and conversely. For practical reasons, we limit the length of a submitted term pair list to a ceiling of 10 times the average length of D 1 and D 2 . (This can be seen as meaning that, on average, a system may submit up to 10 alignment hypotheses for each term in D 1 or in D 2 .)
The test datasets will include both the same language pairs as those provided for training and also other language pairs.
Participants can submit up to 5 system runs for each test dataset.


The evaluation metric will be the Average Precision of the predicted bilingual term pair list, where the relevance of a term pair is determined by its presence in the (hidden) gold standard dictionary D 1 , 2 . This models the task as an information retrieval task: retrieve all relevant term pairs ( t 1 , t 2 ) (documents) from the cross-product D 1 × D 2 (virtual pool of documents), presenting them in descending order of confidence. Average Precision is the area under the recall × precision curve. It is computed as the average over all m relevant term pairs ( t i , t j ) (i.e., all term pairs in the gold standard) of the precision value obtained for the set of top n k term pairs existing after each relevant term pair ( t i k , t j k ) is retrieved, from the first to the last relevant term pair. Relevant term pairs that are not retrieved receive a precision of zero, hence decrease Average Precision. Average Precision (AP) is defined as: A P = 1 m k = 1 m P ( R k )
where R k is the set of ranked predicted term pairs from the top to the position at which k relevant term pairs have been retrieved. Given the gold standard dictionary D 1 , 2 , the precision of a set of predicted term pairs R is defined as P ( R ) = | R D 1 , 2 | | R | . Evaluation code is provided on github.
Helper note: To optimize Average Precision, a system must find all relevant term pairs and put them at the top of the list. Average Precision increases when true predictions (relevant term pairs) are added anywhere in the prediction list. Average Precision also increases when false predictions, if any, are pushed towards the bottom of the list. Note that Average Precision cannot decrease when more predictions, whether true or false, are added to the bottom of the list. Also note that Average Precision is equivalent to Mean Average Precision (MAP) with exactly one query:
Q: find all term pairs in D 1 × D 2 that are translations of each other
The present evaluation therefore does not model the task with a query per source term, but with one global query (Q above) that considers all source terms together. The systems are thus expected to rank all their chosen ( t 1 , t 2 ) candidate term pairs in descending order of confidence. To state it another way, a system should not rank term pairs first according to the source term t 1 , then in descending order of confidence.

File format

All files use UTF-8 encoding, with LF end-of-line markers.

Sample data

A small sample dataset is provided in bucc2023_sample.zip for the English-French language pair. It contains:

Training data

A training dataset is provided in bucc2023_training.zip for the English-French language pair. It contains:
Note that the sizes of D 1 , D 2 and D 1 , 2 as well as the proportions of terms in D 1 or D 2 that have a translation in D 1 , 2 are likely to be different in the test datasets.

Test data

Test datasets (en-fr). A test dataset contains:

Time schedule

Any time
Expression of interest to the shared task contact point (see below). This will allow us to register you on the shared task discussion list and inform you about updates.
6 June 2023
Sample dataset and training data are available for the English-French language pair (en-fr)
18 July 2023
Test data release
21 July 2023
Submission of system runs by participants (up to 5 per dataset) by e-mail to the shared task contact point
31 July 2023
Submission of shared task papers to the BUCC workshop: extended to July 31st
7 September 2023
Workshop date

Shared task organizers and contact

Pierre Zweigenbaum
(Université Paris-Saclay, CNRS, LISN, Orsay, France)
Serge Sharoff
(University of Leeds, United Kingdom)
Reinhard Rapp
(University of Mainz and Magdeburg-Stendal University of Applied Sciences, Germany)
Shared task contact point: please send expressions of interest to:
Last modified: 22 July 2023