16th Workshop on Building and Using Comparable Corpora

BUCC 2023 SHARED TASK: Bilingual Term Alignment in Comparable Specialized Corpora.

The BUCC 2023 shared task is on multilingual terminology alignment in comparable corpora. Many research groups are working on this problem using a wide variety of approaches. However, as there is no standard way to measure the performance of the systems, the published results are not comparable and the pros and cons of the various approaches are not clear. The shared task aims at solving these problems by organizing a fair comparison of systems. This is accomplished by providing corpora and evaluation datasets for a number of language pairs and domains.

Moreover, the importance of dealing with multi-word expressions in Natural Language Processing applications has been recognized for a long time. In particular, multi-word expressions pose serious challenges for machine translation systems because of their syntactic and semantic properties. Furthermore, multi-word expressions tend to be more frequent in domain-specific text, hence the need to handle them in tasks with specialized-domain corpora.

Through the 2023 BUCC shared task, we seek to evaluate methods that detect pairs of terms that are translations of each other in two comparable corpora, with an emphasis on multi-word terms in specialized domains.

Provided resources

The BUCC shared task provides several datasets of the following form:

A pair of comparable corpora $C_{1}$ and $C_{2}$ in languages $L_{1}$ and $L_{2}$ .
A list of terms $D_{1}$ that occur in $C_{1}$ and a list of terms $D_{2}$ that occur in $C_{2}$ . Term lists may include both single-word and multi-word terms.
For training only, a gold standard dictionary $D_{1, 2}$ in the form of a list of pairs of terms $(t_{1}, t_{2})$ that are translations of each other, with $t_{1}$ in $D_{1}$ and $t_{2}$ in $D_{2}$ .

The task participants may additionally use any external resources, except the CCAligned corpora, from which the task datasets have been extracted. When reporting their results, participants are required to specify which resources they used. They are also encouraged to test conditions in which they only use the provided resources.

Task

Given a test dataset with comparable corpora

C_{1}

and

C_{2}

, and lists of terms

D_{1}

and

D_{2}

, participant systems are expected to produce an ordered list of term pairs in

(D_{1}, D_{2})

that are translations of each other, in descending order of confidence.

Note that

D_{1}

and

D_{2}

may have different sizes, that not every term in

D_{1}

may have a translation in

D_{2}

, that some terms in

D_{1}

might have multiple translations, and conversely. For practical reasons, we limit the length of a submitted term pair list to a ceiling of 10 times the average length of

D_{1}

and

D_{2}

. (This can be seen as meaning that, on average, a system may submit up to 10 alignment hypotheses for each term in

D_{1}

or in

D_{2}

The test datasets will include both the same language pairs as those provided for training and also other language pairs.

Participants can submit up to 5 system runs for each test dataset.

Evaluation

The evaluation metric will be the Average Precision of the predicted bilingual term pair list, where the relevance of a term pair is determined by its presence in the (hidden) gold standard dictionary

D_{1, 2}

. This models the task as an information retrieval task: retrieve all relevant term pairs

(t_{1}, t_{2})

(documents) from the cross-product

D_{1} \times D_{2}

(virtual pool of documents), presenting them in descending order of confidence. Average Precision is the area under the recall

\times

precision curve. It is computed as the average over all

m

relevant term pairs

(t_{i}, t_{j})

(i.e., all term pairs in the gold standard) of the precision value obtained for the set of top

n_{k}

term pairs existing after each relevant term pair

(t_{i_{k}}, t_{j_{k}})

is retrieved, from the first to the last relevant term pair. Relevant term pairs that are not retrieved receive a precision of zero, hence decrease Average Precision. Average Precision (AP) is defined as:

A P = \frac{1}{m} \sum_{k = 1}^{m} P (R_{k})

where

R_{k}

is the set of ranked predicted term pairs from the top to the position at which

k

relevant term pairs have been retrieved. Given the gold standard dictionary

D_{1, 2}

, the precision of a set of predicted term pairs

R

is defined as

P (R) = \frac{| R \cap D_{1, 2} |}{| R |}

. Evaluation code is provided on github.

Helper note: To optimize Average Precision, a system must find all relevant term pairs and put them at the top of the list. Average Precision increases when true predictions (relevant term pairs) are added anywhere in the prediction list. Average Precision also increases when false predictions, if any, are pushed towards the bottom of the list. Note that Average Precision cannot decrease when more predictions, whether true or false, are added to the bottom of the list. Also note that Average Precision is equivalent to Mean Average Precision (MAP) with exactly one query:

Q: find all term pairs in $D_{1} \times D_{2}$ that are translations of each other

The present evaluation therefore does not model the task with a query per source term, but with one global query (Q above) that considers all source terms together. The systems are thus expected to rank all their chosen

(t_{1}, t_{2})

candidate term pairs in descending order of confidence. To state it another way, a system should not rank term pairs first according to the source term

t_{1}

, then in descending order of confidence.

File format

All files use UTF-8 encoding, with LF end-of-line markers.

Single-term lists $D_{1}$ and $D_{2}$ contain one term per line.
Corpora $C_{1}$ and $C_{2}$ contain one sentence per line.
The gold standard dictionary $D_{1, 2}$ contains two terms per line, separated by a tabulation: $t_{1}$ <TAB> $t_{2}$
The system output submitted by a participant contains two terms per line, separated by a tabulation <TAB>. Its lines are ordered in decreasing order of confidence.

Sample data

A small sample dataset is provided in bucc2023_sample.zip for the English-French language pair. It contains:

A pair of comparable corpora $C_{1} =$ src_corpus_sample.txt and $C_{2} =$ tgt_corpus_sample.txt in languages $L_{1} =$ en and $L_{2} =$ fr.
A list of terms $D_{1} =$ src_term_list_sample.txt that occur in $C_{1}$ and a list of terms $D_{2} =$ tgt_term_list_sample.txt that occur in $C_{2}$ . Term lists may include both single-word and multi-word terms.
For training only, a gold standard dictionary $D_{1, 2} =$ gold_dictionary_sample.txt in en $\times$ fr.

Training data

A training dataset is provided in bucc2023_training.zip for the English-French language pair. It contains:

A pair of comparable corpora $C_{1} =$ corpus-en.txt and $C_{2} =$ corpus-fr.txt in languages $L_{1} =$ en and $L_{2} =$ fr.
A list of terms $D_{1} =$ terms-en.txt that occur in $C_{1}$ and a list of terms $D_{2} =$ terms-fr.txt that occur in $C_{2}$ . Term lists may include both single-word and multi-word terms.
For training only, a gold standard dictionary $D_{1, 2} =$ terms-en-fr.txt in en $\times$ fr.

Note that the sizes of

D_{1}

D_{2}

and

D_{1, 2}

as well as the proportions of terms in

D_{1}

D_{2}

that have a translation in

D_{1, 2}

are likely to be different in the test datasets.

Test data

Test datasets (en-fr). A test dataset contains:

A pair of comparable corpora $C_{1} =$ corpus-en.txt and $C_{2} =$ corpus-fr.txt in languages $L_{1} =$ en and $L_{2} =$ fr.
A list of terms $D_{1}$ that occur in $C_{1}$ and a list of terms $D_{2}$ that occur in $C_{2}$ . Term lists may include both single-word and multi-word terms.

Time schedule

Any time	Expression of interest to the shared task contact point (see below). This will allow us to register you on the shared task discussion list and inform you about updates.
6 June 2023	Sample dataset and training data are available for the English-French language pair (en-fr)
18 July 2023	Test data release
21 July 2023	Submission of system runs by participants (up to 5 per dataset) by e-mail to the shared task contact point
31 July 2023	Submission of shared task papers to the BUCC workshop: extended to July 31st
7 September 2023	Workshop date

Shared task organizers and contact

Pierre Zweigenbaum: (Université Paris-Saclay, CNRS, LISN, Orsay, France)
Serge Sharoff: (University of Leeds, United Kingdom)
Reinhard Rapp: (University of Mainz and Magdeburg-Stendal University of Applied Sciences, Germany)

Shared task contact point: please send expressions of interest to:

pz (at) lisn (dot) fr

Last modified: 22 July 2023