How to optimize your Twitter collection

Dutch keywords for better coverage

  • Tim Kreutz
  • Walter Daelemans

Abstract

Twitter allows API calls to retrieve one percent of all tweets at any time using a search word list. Since some languages, including Dutch, make up less than one percent of all tweets on average, a large part can be retrieved using the right keywords. This paper systematically assesses keyword lists for nding language-specic tweets. It contributes comparisons to previously suggested collection methods for the Dutch language and establishes the limitations of each. Generating keywords from Dutch tweets and picking 400 based on their precision-weighted recall achieves the
 best coverage at 91.3%. The list of Dutch keywords is made openly available alongside the code that can be used to generate lists for the collection of other languages or for other tasks that benet from early ltering such as event or hate speech detection.

Author Biographies

Tim Kreutz

University of Antwerp

Walter Daelemans

University of Antwerp

Published
2019-12-16
How to Cite
Kreutz, T., & Daelemans, W. (2019). How to optimize your Twitter collection. Computational Linguistics in the Netherlands Journal, 9, 55-66. Retrieved from https://www.clinjournal.org/clinj/article/view/92
Section
Articles