Linguistic proxies of readability: Comparing easy-to-read and regular newspaper Dutch

  • Vincent Vandeghinste
  • Bram Bulté

Abstract

The aim of this study is to identify linguistic proxies of readability in Dutch, i.e. those linguistic features that define text as being easy-to-read. To this end, we compare the Wablieft corpus (Vandeghinste et al. 2019) (Flemish easy-to-read newspaper archives) to articles that appeared in the regular Flemish newspaper De Standaard, using a wide range of lexical, syntactic and readability metrics. We test which of these metrics has the highest effect size and which combinations of metrics work best in a classification task predicting whether articles belong to Wablieft or De Standaard. The results indicate that the best linguistic proxy for readability is (not surprisingly) the average number of words per sentence. Traditional reading metrics score well, although the combination of the parameters constituting these metrics score better in logistic regression than the original metrics.

Author Biographies

Vincent Vandeghinste

Instituut voor de Nederlandse Taal (Leiden, Netherlands)

Bram Bulté

KU Leuven (Belgium)

Published
2019-12-18
How to Cite
Vandeghinste, V., & Bulté, B. (2019). Linguistic proxies of readability: Comparing easy-to-read and regular newspaper Dutch. Computational Linguistics in the Netherlands Journal, 9, 81-100. Retrieved from https://www.clinjournal.org/clinj/article/view/97
Section
Articles