MT4All Shared Task

For this shared task we leverage the resources generated by the recently finished CEF project MT4All, with the aim of exploring unsupervised MT techniques based only on monolingual corpora. In the course of the project, the following novel datasets were created: 18 monolingual corpora for specific languages and domains, 12 bilingual dictionaries and translation models and 10 annotated datasets for evaluation.

The research in the project was carried out by Universidad de País Vasco and Barcelona Supercomputing Center, with Unbabel, Iconic and Tilde as validator users.

Task Description

Machine Translation made great progress in recent years, in particular owing to novel architectures such as Transformer and to the availability of large corpora of parallel bilingual data in targeted language pairs. However, for many low-resource languages and domains parallel corpora may not always be available. To address this issue, unsupervised techniques recently appeared which rely solely or partially on monolingual corpora to build Machine Translation systems.

Within this shared task we are interested in finding out how additional monolingual data can be leveraged by creating a purely unsupervised Machine Translation model by

either adding value to an existing pre-trained model, on the condition that
- it has been trained on monolingual datasets,
- it has not been fine-tuned with any parallel data,
- it is publicly accessible from the HuggingFace repository,
or training an unsupervised model from scratch.

We would like to emphasize that we exclude the possibility of fine-tuning models with any existing parallell data. However, as you can see below, we allow for some bilingual resources that were created in MT4All using purely unsupervised technologies.

The shared task includes 3 separate subtasks. Participants can choose which subtask they want to participate in and are encouraged to participate in all of them. Every subtask is a combination of under-resourced (or moderately under-resourced) language pairs and domain, except for subtask 3, in which only the domain may be considered under-resourced.

Subtasks

Subtask 1: Unsupervised translation from English to Ukrainian, Georgian and Kazakh in the Legal domain;
Subtask 2: Unsupervised translation from English to Finnish, Latvian, and Norwegian Bokmål in the Financial domain;
Subtask 3: Unsupervised translation from English to German, Norwegian Bokmål, and Spanish in the Customer Support domain.

Data

Monolingual Data

For the purpose of this shared task we intend to leverage the in-domain monolingual datasets that were compiled during the MT4All project. The newly published MT4All monolingual datasets available for each of the subtasks are the following ones:

Subtask 1: Legal domain in Georgian, Kazakh, Ukrainian, English; the sources include: legislation Web sites, governamental sites, domains from the Court and the Parliament;
Subtask 2: Financial domain in Finnish, Latvian, Norwegian Bokmål, English; the sources include: bank Web sites, finance resource sites, finance blogs and forums on banking and economy-related issues;
Subtask 3: Customer Support domain in German, Norwegian Bokmål, Spanish, English; the sources include: FAQ and help Web sites, community sites, forums.

In addition to the in-domain data collected by MT4All, any monolingual Oscar dataset can also be used, but ONLY those ones.

Bilingual Resources (from Monolingual Data)

In addition to monolingual datasets, we propose to optionally use bilingual parallel resources, but only those that were developed in the scope of the MT4All project, because they are built purely on the unsupervised machine translation technologies. In particular, for different domains and different languages paired with English we trained word embeddings that have been aligned in the same vector space according to their similarity using Vecmap. Based on these alignments, we created bilingual dictionaries for the same language pair / domain combinations.

Crosslingual Word Embeddings:
- Subtask 1: Legal domain: English-Georgian, English-Kazakh, English-Ukrainian;
- Subtask 2: Financial domain: English-Finnish, English-Latvian, English-Norwegian Bokmål;
- Subtask 3: Customer Support domain: English-German, English-Norwegian Bokmål, English-Spanish.
Bilingual Dictionaries:
- Subtask 1: Legal domain: English-Georgian, English-Kazakh, English-Ukrainian;
- Subtask 2: Financial domain: English-Finnish, English-Latvian, English-Norwegian Bokmål;
- Subtask 3: Customer Support domain: English-German, English-Norwegian Bokmål, English-Spanish.

Validation Sets

We suggest to use the FLORES-101 dataset as an out-of-domain test set to be used internally by participants in their model validation. However, participants are not allowed to use this set in the automatic hyperparameter search to keep the models as unsupervised as possible.

Test Sets

Previously unpublished annotated test datasets created in MT4All will be used to evaluate and rank the participating systems.

State Your Interest

We kindly ask you to fill in this form if you wish to participate in this task.

Evaluation

Subtask 1. Unsupervised translation from English to Georgian, Kazakh, and Ukrainian in the Legal domain; we will provide 3 different test sets in English, one for each target language;
Subtask 2. Unsupervised translation from English to Finnish, Latvian, and Norwegian Bokmål in the Financial domain; we will provide 3 different test sets in English, one for each target language;
Subtask 3. Unsupervised translation from English to German, Norwegian Bokmål, and Spanish in the Customer support domain; we will provide a test set in English that will need to be translated into each of the target languages.

The test datasets will be in TXT format where the document boundaries are indicated by the empty lines. The FLORES-101 dataset that we suggest to use as validation has the same structure.

We will report automatic evaluation results per language pair and in average in the subtask.

In order to evaluate the results, the following environment and libraries should be used (conda example):

conda create -n sbleu python=3.7
conda activate sbleu
pip3 install sacrebleu==1.5.1
sacrebleu REF --metrics bleu < HYP

Submission

We release the test sets in English for the following subtasks:

Subtask 1. Unsupervised translation in the Legal domain. English-Georgian, English-Kazakh and English-Ukrainian directions;
Subtask 2. Unsupervised translation in the Financial domain. English-Finnish, English-Latvian, English-Norwegian Bokmål directions;
Subtask 3. Unsupervised translation in the Customer support domain. English to Spanish, German and Norwegian Bokmål directions.

Send your translations as described in the Evaluation paragraph by e-mail to Ksenia Kharitonova. Please state your team and the task you are participating in in the e-mail. You will receive a confirmation e-mail in a few hours.

The deadline to send the results is 2 May 2022.

Results

Test Set	MT4All Baseline	CUNI Legal	CUNI
Legal En-Ka	12	13.8	13.8
Legal En-Kk	6.4	9.4	7.7
Legal En-Uk	20.8	28.1	27

BLEU

Important Dates

Training data release	10/03/2022
Test sets release	25/04/2022
Results deadline	02/05/2022
Paper submission deadline	16/05/2022
Acceptance notice	30/05/2022
Camera ready	13/06/2022
Workshop starts	24/06/2022

Organizers

Maite Melero (Barcelona Supercomputing Center)
Ksenia Kharitonova (Barcelona Supercomputing Center)
Ona de Gibert-Bonet (Barcelona Supercomputing Center)
Gorka Labaka (University of the Basque Country – UPV/EHU)
Iakes Goenaga Azkarate (University of the Basque Country – UPV/EHU)
Nora Aranberri (University of the Basque Country – UPV/EHU)
Jordi Armengol-Estapé (University of Edinburgh)

Contact

If you have any comments and/or questions, do not hesitate to contact Ksenia Kharitonova.

Acknowledgements

This task is created in the scope of the MT4All CEF project.