MedRoBERTa.nl: A Language Model for Dutch Electronic Health Records

Stella Verkijk; Piek  Vossen

Authors

Stella Verkijk Vrije Universiteit Amsterdam
Piek Vossen Vrije Universiteit Amsterdam

Abstract

This paper presents MedRoBERTa.nl as the first Transformer-based language model for Dutch medical language. We show that using 13GB of text data from Dutch hospital notes, pre-training from scratch results in a better domain-specific language model than further pre-training RobBERT. When extending pre-training on RobBERT, we use a domain-specific vocabulary and re-train the embedding look-up layer. We show that MedRoBERTa.nl, the model that was trained from scratch, outperforms general language models for Dutch on a medical odd-one-out similarity task. MedRoBERTa.nl already reaches higher performance than general language models for Dutch on this task after only 10k pre-training steps. When fine-tuned, MedRobERTa.nl outperforms general language models for Dutch in a task classifying sentences from Dutch hospital notes that contain information about patients’ mobility levels.

MedRoBERTa.nl: A Language Model for Dutch Electronic Health Records

Authors

Abstract

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)