Machine Learning Assisted Mapping of Multilingual Occupational Data to ESCO (Part 1)

Previously, we explained that linking external data to an ESCO occupation concept is an essential building block of maintaining ESCO because it supports drafting new occupations and enables quality control on existing ones. Here we touch upon a significant challenge that complicates this process: ESCO is currently supporting 28 languages which requires multilingual machine learning models to connect textual information to the ESCO occupations. This is the first of two news articles in which we illustrate the ESCO multilingual mapping approach by applying it to different use-cases.

In part 1 we focus on mapping multilingual occupational classifications to ESCO, while part 2 will generalise to mapping more diverse content such as job titles, work history descriptions or job advertisement descriptions and will present insights in the underlying model results. Where possible, we benchmark our methodology with existing approaches that were published in the scientific literature.

Synergies with stakeholders

While the ESCO team is using artificial intelligence to maintain the ESCO classification, there exist synergies with projects such as EURES and Europass. In EURES Member States use ESCO to exchange online job advertisements, whereas Europass empowers users to build a profile and CV using the ESCO terminology. Given that the ESCO team is continuously working on updating the taxonomy and further optimising algorithms that support this process, it is beneficial to follow an integrated approach such that ESCO stakeholders as Europass and EURES can also benefit from the methodology developed for the maintenance tasks within ESCO.

EURES

EURES is a European cooperation network of employment services, designed to facilitate the free movement of workers. As part of Regulation (EU) 2016/589 of the European Parliament and of the Council, Member States need to adopt the ESCO taxonomy at national level or map their national classifications to ESCO in order to exchange data via a standardised terminology of occupations, skills and competences with the EURES platform. In the past, a software tool was developed to support Member States in the process of mapping their national classification to ESCO. The tool suggests potentially related ESCO concepts for a concept from the national classification through a TF-IDF-based approach to facilitate the manual mapping task. Some Member States worked outside the mapping platform and contracted external parties to support the mapping exercise. Several of these external parties reported the use of artificial intelligence to assist the mapping, but the applied methodology was not always made public. All this led to a very fragmented approach for what is essentially a common challenge faced by the Member States.

Unified approach for mapping multilingual occupational classifications

The ESCO team is using representation learning techniques to support the maintenance of the classification. In particular, the multilingual XLM-RoBERTa model was finetuned on labour market data covering 28 ESCO languages such as ESCO, QDR qualifications and EURES online job advertisements. We apply this model to the problem of mapping national occupational classifications to ESCO and evaluate on the mapping tables that gradually become available from the mapping efforts of the Member States to comply with regulation EU 2016/589 and the recently published O*NET – ESCO crosswalk.

The following table presents occupation concepts selected from the national classifications of Latvia, Spain, Sweden, Italy and United States of America and the top three suggestions and scores by the machine learning model for the source concept. The last column represents the expert validation as obtained from the mapping tables that the Member States provided and the O*NET – ESCO crosswalk. The types of matches are exact match, broader match, narrower match and close match (an empty cell means the ESCO concept was not selected by the experts for the source concept).

A benchmark analysis was performed for the finetuned multilingual model by mapping five national classifications to ESCO occupation concepts based on available mapping tables. Micro mean reciprocal rank and top k accuracy are reported below for different input types (title of source concept only versus title and description combined) and different ESCO language variants (English vs language of the source classification). Note: Member State taxonomies were mapped to ESCO v1.0.7 (2,942 concepts) given that major version was used by the Member States, while O*NET was mapped to ESCO v1.1 (3,008 concepts) as this crosswalk was more recently completed.

Results show that between 75% and 83% of the source occupations, an ESCO occupation selected by experts was in the top 5 suggestions for the Member State classifications and 94% was obtained for O*NET. We observe that model suggestions consistently match better with expert validation when mapping to the ESCO variant having the same language as the source classification. Using more input information (i.e. title and description) further improves suggestion quality. We did not use examples of professions (e.g. job titles) as input in the experiments, but we suspect it should further improve the performance metrics. This illustrates that a single multilingual representation learning approach can be used to support experts in mapping different national classifications to ESCO, thereby being a step forward compared to a simpler TF-IDF approach as currently available in the platform.

Mapping from unseen languages

A significant factor when establishing the methodology was to opt for an approach that can be further improved for existing ESCO languages without difficulty (e.g. when future labour market data become available), but also an approach that can be extended to other languages with minimal effort. This guarantees that ESCO can support implementers all across Europe and can have maximum impact.

The representation learning model that we developed is based on XLM-RoBERTa, which is pretrained on 100 languages. This model is finetuned to map text to ESCO occupations in 28 languages at the moment. To extend our model to new languages, the optimal approach would be to further finetune the model with labour market data from the corresponding languages.

In case no training data is available for a language for which we would like to map to ESCO, we are essentially dealing with zero-shot cross-lingual transfer learning. It was reported that XLM-RoBERTa does not align well across languages although it learns language encoders having a shared multilingual contextual embedding space: the multilingual encoders fail to capture similarity when the source and target languages are less similar at levels of morphology, syntax, and semantics. While here we finetuned the XLM-RoBERTa shared embedding space for a selected number of languages, we also investigated mapping for languages that were not in the training set. The following table contains suggested ESCO occupations for mapping Serbian (SR) and Albanian (SQ) input text to the English variant of ESCO. A more extreme test was performed for Korean (KO).

These examples show that transfer to these languages holds to some extent. While the examples were selected based on their higher mapping score (indicating higher confidence in the suggestion), we also found that extending to this zero-shot setting by focussing on unseen languages remains challenging. Further finetuning the model with labour market data from an unseen language represents the only viable approach.

Summary

This article discussed the approach that the ESCO team is following for multilingual modelling to support the maintenance of the occupations pillar. Representation learning techniques are used to represent free text originating from the 28 ESCO languages, thereby aligning the embedding space for the different languages. The model is based on XLM-RoBERTa and was finetuned on labour market data (ESCO, qualifications and online job advertisements) covering all the ESCO languages. The approach is flexible in terms of extending the model to unseen languages. Examples for mapping Member State classification concepts to ESCO were presented and a benchmark analysis was performed.

Part 2 of this series will be published soon. We will present an analysis for mapping other content such as job titles, work history descriptions or job advertisement descriptions to ESCO, including benchmark results. Eventually, all findings will be summarised in a report.

If you are an ESCO implementer and want to share your feedback, please get in touch via email at EMPL-ESCO-SECRETARIAT@ec.europa.eu or use our hashtag #ESCO_EU.