Controllable Sentence Simplification in Dutch

Authors

  • Theresa Seidl
  • Vincent Vandeghinste

Abstract

Text simplification aims to reduce complexity in vocabulary and syntax, enhancing the readability and comprehension of text. This paper presents a supervised sentence simplification approach for Dutch using a pre-trained large language model (T5). Given the absence of a parallel corpus in Dutch, a synthetic dataset is generated from established parallel corpora. The implementation incorporates a sentence-level discrete parametrization mechanism, enabling control over the simplification features. The model’s output can be tailored to different simplification scenarios and target audiences by incorporating control tokens into the training data. The controlled attributes include sentence length, word length, paraphrasing, and lexical and syntactic complexity. This work contributes a dedicated set of control tokens tailored to the Dutch language. It shows that significant simplification can be achieved using a synthetic dataset with as few as 2000 parallel rows, although optimal performance requires a minimum of 10,000 rows. The fine-tuned model achieves a 36.85 SARI score on the test set, supporting its effectiveness in the simplification process. This research contributes to the field of sentence simplification by discussing the implementation of a supervised simplification approach for Dutch. The findings highlight the potential of synthetic datasets and control tokens in achieving effective simplification, despite the lack of a
parallel corpus in the target language.

Downloads

Published

2024-03-21

How to Cite

Seidl, T., & Vandeghinste, V. (2024). Controllable Sentence Simplification in Dutch. Computational Linguistics in the Netherlands Journal, 13, 31–61. Retrieved from https://www.clinjournal.org/clinj/article/view/171

Issue

Section

Articles