Don’t Do Your Experiments Double-blind: The Importance of Checking Your Data

Authors

  • Nelleke Oostdijk Radboud Universiteit
  • Hans Van Halteren Radboud Universiteit

Abstract

In this paper, we investigate what could happen if you run machine learning experiments on data found somewhere on the internet, without first examining this data. As an example, we did polarity recognition on a data set extracted from Booking.com. We found that a) the form of the data in the dataset sometimes made polarity judgements hard for humans and probably also for systems, b) naive use of the data results in a different task than polarity recognition as the content of the data fields does not always comply with the field descriptors, and c) the comments in the data set come in several, quite different, subtypes, so that recognition quality rather varies with the choice for training and test data. On the basis of these findings we conclude that our advice to inspect data before using it is indeed valuable.

Downloads

Published

2022-12-22

How to Cite

Oostdijk, N. ., & Van Halteren, H. (2022). Don’t Do Your Experiments Double-blind: The Importance of Checking Your Data. Computational Linguistics in the Netherlands Journal, 12, 65–82. Retrieved from https://www.clinjournal.org/clinj/article/view/148

Issue

Section

Articles

Most read articles by the same author(s)