What impacts pretrained multilingual language models in zero-shot learning?

BSc Project

This blog post shortly describes my bachelor thesis “Analysing The Impact Of Linguistic Features On Cross-Lingual Transfer”. For details, you can read the full paper here. The code is made publicly available here.

Motivation

A vast majority of currently available NLP datasets is in English, but many other languages do not have sufficiently large corpora to obtain well-performing models. Thankfully, there is an increasing amount of evidence that in cases with little or no data in a low-resourced language (e.g. Latvian), training on a different language (e.g. English) can yield surprisingly good results. The high-resourced language, which is used for training, is usually called the source (training) language. The low-resourced language is usually called the target (testing) language. This setup is called zero-shot learning because the model doesn’t see any examples from the target language during training.

Unfortunately, there are no established guidelines for choosing the optimal training language. In attempt to solve this issue we thoroughly analyze a state-of-the-art multilingual model and try to determine what impacts good transfer between languages.

Setup

We use XLM-R, which is a pretrained, multilingual model. Multilingual models are trained using corpora from multiple languages which was shown to improve performance in zero-shot tasks. We evaluate it on 3 tasks: Part of Speech Tagging, Named Entity Recognition and Natural Language Understanding. For all pairs of available languages and tasks, we finetune XLM-R with the source language and test it on the target language.

Finding important factors for effective transfer

To find out which linguistic features contribute to high transfer performance, we use WALS (World Atlas of Language Structures) is a database containing 192 linguistic features for 2662 different languages. For each language pair, we concatenate features of both languages with the score obtain on a given dataset. Afterwards, we train an XGBoost model to predict the performance based on the features of both languages. Once the model is trained, we can extract feature importance scores from it which shows us how helpful each feature is in predicting the performance. Another indicator of feature importance is Kruskal-Wallis test which indicates the correlation between features and performance.

Results

Based on the methods described in the section above, it appears that the importance of syntactic features strongly differs depending on the particular task – no single feature is a good performance indicator for all NLP tasks. As a result, one should not expect that for a target language there is a single language L2 that is the best choice for any NLP task (for instance, for Bulgarian, the best source language is French on POS tagging, Russian on NER and Thai on NLI).

Leave a Reply