All That Glitters is Not Gold: Transfer-learning for Offensive Language Detection in Dutch

Dion Theodoridis; Tommaso Caselli

Authors

Dion Theodoridis Rijksuniversiteit Groningen
Tommaso Caselli Rijksuniversiteit Groningen

Abstract

Creating datasets for language phenomena to fill gaps in the language resource panorama of specific natural languages is not a trivial task. In this work, we explore the application of transferlearning as strategy to boost both the creation of language-specific datasets and systems. We use offensive language in Dutch tweets directed at Dutch politicians as a case study. In particular, we trained a multilingual model using the Political Speech Project (Bröckling et al. 2018) dataset to automatically annotate tweets in Dutch. The automatically annotated tweets have been used to further train a monolingual language model in Dutch (BERTje) adopting different strategies and combination of manually curated data. Our results show that: (i) transfer learning is an effective strategy to boost the creation of new datasets for specific language phenomena by reducing the annotation efforts; (ii) using a monolingual language model fine-tuned with automatically annotated data (i.e., silver data) is a competitive baseline against the zero-shot transfer of a multilingual model; and finally, (iii) less surprisingly, the addition of automatically annotated data to manually curated ones is a source of errors for the systems, degrading their performances.

All That Glitters is Not Gold: Transfer-learning for Offensive Language Detection in Dutch

Authors

Abstract

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)