16th WORKSHOP ON BUILDING AND USING COMPARABLE CORPORA WITH SHARED TASK ON MULTILINGUAL TERMINOLOGY EXTRACTION FROM COMPARABLE SPECIALIZED CORPORA Co-located with RANLP 2023 September 7 or September 8, 2023 Workshop website: https://comparable.limsi.fr/bucc2023/ RANLP website: http://ranlp.org/ranlp2023 Workshop proceedings to be published in ACL Anthology ************************************************************** MOTIVATION In the language engineering and the linguistics communities, research in comparable corpora has been motivated by two main reasons. In language engineering, on the one hand, it is chiefly motivated by the need to use comparable corpora as training data for statistical NLP applications such as statistical and neural machine translation or cross-lingual retrieval. In linguistics, on the other hand, comparable corpora are of interest because they enable cross-language discoveries and comparisons. It is generally accepted in both communities that comparable corpora consist of documents that are comparable in content and form in various degrees and dimensions across several languages. Parallel corpora are on the one end of this spectrum, unrelated corpora on the other. Comparable corpora have been used in a range of applications, including Information Retrieval, Machine Translation, Cross-lingual text classification, etc. The linguistic definitions and observations related to comparable corpora can improve methods to mine such corpora for applications of statistical NLP, for example to extract parallel corpora from comparable corpora for neural MT. As such, it is of great interest to bring together builders and users of such corpora. TOPICS We solicit contributions on all topics related to comparable (and parallel) corpora, including but not limited to the following: Building Comparable Corpora: * Automatic and semi-automatic methods * Methods to mine parallel and non-parallel corpora from the web * Tools and criteria to evaluate the comparability of corpora * Parallel vs non-parallel corpora, monolingual corpora * Rare and minority languages, across language families * Multi-media/multi-modal comparable corpora Applications of comparable corpora: * Human translation * Language learning * Cross-language information retrieval & document categorization * Bilingual and multilingual projections * (Unsupervised) Machine translation * Writing assistance * Machine learning techniques using comparable corpora Mining from Comparable Corpora: * Cross-language distributional semantics, word embeddings and pre-trained multilingual transformer models * Extraction of parallel segments or paraphrases from comparable corpora * Methods to derive parallel from non-parallel corpora (e.g. to provide for low-resource languages in neural machine translation) * Extraction of bilingual and multilingual translations of single words, multi-word expressions, proper names, named entities, sentences, paraphrases etc. from comparable corpora * Induction of morphological, grammatical, and translation rules from comparable corpora * Induction of multilingual word classes from comparable corpora Comparable Corpora in the Humanities: * Comparing linguistic phenomena across languages in contrastive linguistics * Analyzing properties of translated language in translation studies * Studying language change over time in diachronic linguistics * Assigning texts to authors via authors' corpora in forensic linguistics * Comparing rhetorical features in discourse analysis * Studying cultural differences in sociolinguistics * Analyzing language universals in typological research IMPORTANT DATES July 18, 2023: Paper submission deadline July 31, 2021: Notification of acceptance August 25, 2021: Camera ready final papers September 7 or 8, 2023: Workshop date For updates see the workshop website at https://comparable.limsi.fr/bucc2023/ PRACTICAL INFORMATION The workshop is an in-person event. Workshop registration is via the main conference registration site, see http://ranlp.org/ranlp2023/index.php/fees-registration/ The workshop proceedings will be published in the ACL Anthology. SUBMISSION GUIDELINES Please follow the style sheet and templates (for LaTeX, Overleaf and MS-Word) provided for the main conference at http://ranlp.org/ranlp2023/index.php/submissions/ Papers should be submitted as a PDF file using the START conference manager at https://softconf.com/ranlp23/BUCC/ Submissions must describe original and unpublished work and range from 4 to 8 pages plus unlimited references. Reviewing will be double blind, so the papers should not reveal the authors' identity. Accepted papers will be published in the workshop proceedings, which will be included in the ACL Anthology. Double submission policy: Parallel submission to other meetings or publications is possible but must be immediately (i.e. as soon as known to the authors) notified to the workshop organizers by e-mail. For further information and updates see the BUCC 2023 website: https://comparable.limsi.fr/bucc2023/ In case of questions, please contact Reinhard Rapp: reinhardrapp (at) gmx (dot) de ***** DRAFT BUCC 2023 SHARED TASK: bilingual term alignment in comparable specialized corpora The BUCC 2023 shared task is on multilingual terminology alignment in comparable corpora. Many research groups are working on this problem using a wide variety of approaches. However, as there is no standard way to measure the performance of the systems, the published results are not comparable and the pros and cons of the various approaches are not clear. The shared task aims at solving these problems by organizing a fair comparison of systems. This is accomplished by providing corpora and evaluation datasets for a number of language pairs and domains. Moreover, the importance of dealing with multi-word expressions in Natural Language Processing applications has been recognized for a long time. In particular, multi-word expressions pose serious challenges for machine translation systems because of their syntactic and semantic properties. Furthermore, multi-word expressions tend to be more frequent in domain-specific text, hence the need to handle them in tasks with specialized-domain corpora. Through the 2023 BUCC shared task, we seek to evaluate methods that detect pairs of terms that are translations of each other in two comparable corpora, with an emphasis on multi-word terms in specialized domains. Sample and training data release: 15 June 2023 Test data release: 30 June 2022 For further details see the shared task website at https://comparable.limsi.fr/bucc2023/bucc2023-task.html WORKSHOP ORGANIZERS * Reinhard Rapp (University of Mainz and Magdeburg-Stendal University of Applied Sciences, Germany), chair and contact person: reinhardrapp (at) gmx (dot) de * Pierre Zweigenbaum (Université Paris-Saclay, CNRS, LISN, Orsay, France) * Serge Sharoff (University of Leeds, United Kingdom) PROGRAMME COMMITTEE ************ NOT CURRENT * Ahmet Aker (University of Duisburg-Essen, Germany) * Ebrahim Ansari (Institue for Advanced Studies in Basic Sciences, Iran) * Thierry Etchegoyhen (Vicomtech, Spain) * Hitoshi Isahara (Otemon Gakuin University, Japan) * Kyo Kageura (The University of Tokyo, Japan) * Natalie Kübler (CLILLAC-ARP, Université de Paris, France) * Philippe Langlais (Univerité de Montréal, Canada) * Yves Lepage (Waseda University, Japan) * Emmanuel Morin (Université de Nantes, France) * Dragos Stefan Munteanu (RWS, USA) * Reinhard Rapp (University of Mainz and Magdeburg-Stendal University of Applied Sciences, Germany) * Nasredine Semmar (CEA LIST, Paris, France) * Serge Sharoff (University of Leeds, UK) * Richard Sproat (OGI School of Science & Technology, USA) * Tim Van de Cruys (KU Leuven, Belgium) * Pierre Zweigenbaum (Université Paris-Saclay, CNRS, LISN, Orsay, France)