All That Glitters is Not Gold: Transfer-learning for Offensive Language Detection in Dutch
Creating datasets for language phenomena to fill gaps in the language resource panorama of specific natural languages is not a trivial task. In this work, we explore the application of transferlearning as strategy to boost both the creation of language-specific datasets and systems. We use offensive language in Dutch tweets directed at Dutch politicians as a case study. In particular, we trained a multilingual model using the Political Speech Project (Bröckling et al. 2018) dataset to automatically annotate tweets in Dutch. The automatically annotated tweets have been used to further train a monolingual language model in Dutch (BERTje) adopting different strategies and combination of manually curated data. Our results show that: (i) transfer learning is an effective strategy to boost the creation of new datasets for specific language phenomena by reducing the annotation efforts; (ii) using a monolingual language model fine-tuned with automatically annotated data (i.e., silver data) is a competitive baseline against the zero-shot transfer of a multilingual model; and finally, (iii) less surprisingly, the addition of automatically annotated data to manually curated ones is a source of errors for the systems, degrading their performances.