16th Workshop on Building and Using Comparable Corpora
BUCC 2023 SHARED TASK: Bilingual Term Alignment in Comparable Specialized Corpora.
The BUCC 2023 shared task is on multilingual terminology alignment in comparable corpora. Many research groups are working on this problem using a wide variety of approaches. However, as there is no standard way to measure the performance of the systems, the published results are not comparable
and the pros and cons of the various approaches are not clear. The shared task aims at solving these problems by organizing a fair comparison of systems. This is accomplished by providing corpora and evaluation datasets for a number of language pairs and domains.
Moreover, the importance of dealing with multi-word expressions in Natural Language Processing applications has been recognized for a long time. In particular, multi-word expressions pose serious challenges for machine translation systems because of their syntactic and semantic properties.
Furthermore, multi-word expressions tend to be more frequent in domain-specific text, hence the need to handle them in tasks with specialized-domain corpora.
Through the 2023 BUCC shared task, we seek to evaluate methods that detect pairs of terms that are translations of each other in two comparable corpora, with an emphasis on multi-word terms in specialized domains.
Provided resources
The BUCC shared task provides several datasets of the following form:
- A pair of comparable corpora and in languages and .
- A list of terms that occur in and a list of terms that occur in . Term lists may include both single-word and multi-word terms.
- For training only, a gold standard dictionary in the form of a list of pairs of terms that are translations of each other, with in and in .
The task participants may additionally use any external resources, except the CCAligned corpora, from which the task datasets have been extracted. When reporting their results, participants are required to specify which resources they used. They
are also encouraged to test conditions in which they only use the provided resources.
Task
Given a test dataset with comparable corpora and , and lists of terms and , participant systems are expected to produce an ordered list of term pairs in that are translations of each other, in descending order of confidence.
Note that and may have different sizes, that not every term in may have a translation in , that some terms in might have multiple translations, and conversely. For practical reasons, we limit the length of a submitted term pair list to a ceiling of 10 times the average length of and . (This can be seen as meaning that, on average, a system may submit up to 10 alignment hypotheses for each term in or in .)
The test datasets will include both the same language pairs as those provided for training and also other language pairs.
Participants can submit up to 5 system runs for each test dataset.
Evaluation
The evaluation metric will be the Average Precision of the predicted bilingual term pair list, where the relevance of a term pair is determined by its presence in the (hidden) gold standard
dictionary . This models the task as an information retrieval task: retrieve all relevant term pairs (documents) from the cross-product (virtual pool of documents), presenting them in descending order of confidence. Average Precision is the area under the recall precision curve. It is computed as the average over all relevant term pairs (i.e., all term pairs in the gold standard) of the precision value obtained for the set of top term pairs existing after each relevant term pair is retrieved, from the first to the last relevant term pair. Relevant term pairs that are not retrieved receive a precision of zero, hence decrease Average Precision. Average Precision (AP) is defined as:
where is the set of ranked predicted term pairs from the top to the position at which relevant term pairs have been retrieved. Given the gold standard dictionary , the precision of a set of predicted term pairs is defined as . Evaluation code is provided on github.
Helper note: To optimize Average Precision, a system must find all relevant term pairs and put them at the top of the list. Average Precision increases when true predictions (relevant term pairs) are added anywhere in the prediction list. Average Precision also increases when false predictions,
if any, are pushed towards the bottom of the list. Note that Average Precision cannot decrease when more predictions, whether true or false, are added to the bottom of the list. Also note that Average Precision is equivalent to Mean Average Precision (MAP) with exactly one query:
Q: find all term pairs in that are translations of each other
The present evaluation therefore does not model the task with a query per source term, but with one global query (Q above) that considers all source terms together. The systems are thus expected to rank all their chosen candidate term pairs in descending order of confidence. To state it another way, a system should not rank term pairs first according to the source term , then in descending order of confidence.
File format
All files use UTF-8 encoding, with LF end-of-line markers.
- Single-term lists and contain one term per line.
- Corpora and contain one sentence per line.
- The gold standard dictionary contains two terms per line, separated by a tabulation: <TAB>
- The system output submitted by a participant contains two terms per line, separated by a tabulation <TAB>. Its lines are ordered in decreasing order of confidence.
Sample data
A small sample dataset is provided in bucc2023_sample.zip for the English-French language pair. It contains:
- A pair of comparable corpora src_corpus_sample.txt and tgt_corpus_sample.txt in languages en and fr.
- A list of terms src_term_list_sample.txt that occur in and a list of terms tgt_term_list_sample.txt that occur in . Term lists may include both single-word and multi-word terms.
- For training only, a gold standard dictionary gold_dictionary_sample.txt in enfr.
Training data
A training dataset is provided in bucc2023_training.zip for the English-French language pair. It contains:
- A pair of comparable corpora corpus-en.txt and corpus-fr.txt in languages en and fr.
- A list of terms terms-en.txt that occur in and a list of terms terms-fr.txt that occur in . Term lists may include both single-word and multi-word terms.
- For training only, a gold standard dictionary terms-en-fr.txt in enfr.
Note that the sizes of , and as well as the proportions of terms in or that have a translation in are likely to be different in the test datasets.
Test data
Test datasets (en-fr). A test dataset contains:
- A pair of comparable corpora corpus-en.txt and corpus-fr.txt in languages en and fr.
- A list of terms that occur in and a list of terms that occur in . Term lists may include both single-word and multi-word terms.
Time schedule
Any time
|
Expression of interest to the shared task contact point (see below). This will allow us to register you on the shared task discussion list and inform you about updates.
|
6 June 2023
|
Sample dataset and training data are available for the English-French language pair (en-fr)
|
18 July 2023
|
Test data release
|
21 July 2023
|
Submission of system runs by participants (up to 5 per dataset) by e-mail to the shared task contact point
|
31 July 2023
|
Submission of shared task papers to the BUCC workshop: extended to July 31st
|
7 September 2023
|
Workshop date
|
Shared task organizers and contact
- Pierre Zweigenbaum
- (Université Paris-Saclay, CNRS, LISN, Orsay, France)
- Serge Sharoff
- (University of Leeds, United Kingdom)
- Reinhard Rapp
- (University of Mainz and Magdeburg-Stendal University of Applied Sciences, Germany)
Shared task contact point: please send expressions of interest to:
- pz (at) lisn (dot) fr
Last modified: 22 July 2023