MoNoise: Modeling Noise Using a Modular Normalization System

  • Rob van der Goot University of Groningen
  • Gertjan van Noord University of Groningen

Abstract

We propose MoNoise: a normalization model focused on generalizability and efficiency, it aims at being easily reusable and adaptable. Normalization is the task of translating texts from a noncanonical domain to a more canonical domain, in our case: from social media data to standard language. Our proposed model is based on a modular candidate generation in which each module is responsible for a different type of normalization action. The most important generation modules are a spelling correction system and a word embeddings module. Depending on the definition of the normalization task, a static lookup list can be crucial for performance. We train a random forest classifier to rank the candidates, which generalizes well to all different types of normalization actions. Most features for the ranking originate from the generation modules; besides these features, N-gram features prove to be an important source of information. We show that MoNoise beats the state-of-the-art on different normalization benchmarks for English and Dutch, which all define the task of normalization slightly different.

Published
2017-12-01
How to Cite
van der Goot, R., & van Noord, G. (2017). MoNoise: Modeling Noise Using a Modular Normalization System. Computational Linguistics in the Netherlands Journal, 7, 129-144. Retrieved from https://www.clinjournal.org/clinj/article/view/74
Section
Articles

Most read articles by the same author(s)