Controllable Sentence Simplification in Dutch

Theresa Seidl; Vincent Vandeghinste

Authors

Theresa Seidl
Vincent Vandeghinste

Abstract

Text simplification aims to reduce complexity in vocabulary and syntax, enhancing the readability and comprehension of text. This paper presents a supervised sentence simplification approach for Dutch using a pre-trained large language model (T5). Given the absence of a parallel corpus in Dutch, a synthetic dataset is generated from established parallel corpora. The implementation incorporates a sentence-level discrete parametrization mechanism, enabling control over the simplification features. The model’s output can be tailored to different simplification scenarios and target audiences by incorporating control tokens into the training data. The controlled attributes include sentence length, word length, paraphrasing, and lexical and syntactic complexity. This work contributes a dedicated set of control tokens tailored to the Dutch language. It shows that significant simplification can be achieved using a synthetic dataset with as few as 2000 parallel rows, although optimal performance requires a minimum of 10,000 rows. The fine-tuned model achieves a 36.85 SARI score on the test set, supporting its effectiveness in the simplification process. This research contributes to the field of sentence simplification by discussing the implementation of a supervised simplification approach for Dutch. The findings highlight the potential of synthetic datasets and control tokens in achieving effective simplification, despite the lack of a
parallel corpus in the target language.

Controllable Sentence Simplification in Dutch

Authors

Abstract

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)