LLMs as chainsaws: evaluating open-weights generative LLMs for extracting fauna and flora from multilingual travelogues
Abstract
Named Entity Recognition (NER) is crucial in literary-historical research for tasks such as semantic indexing and entity linking. However, historical texts pose challenges for implementing said tasks due to language variations, OCR errors, and poor performance of off-the-shelf annotation tools. Generative Large Language Models (LLMs) present both novel opportunities and challenges in humanities research. These models, while powerful, raise valid concerns regarding biases, hallucinations, and opacity - making their evaluation for the Digital Humanities (DH) community all the more urgent. In response, we propose our work on the evaluation of 3 quantized open-weights LLMs (mistral-7b-instruct-v0.1, nous-hermes-llama2-13b, Meta-Llama-3-8B-instruct) through GPT4ALL for NER on literary-historical travelogues from the 18th to 20th centuries in English, French, Dutch, and German. All models were assessed both quantitatively and qualitatively across 5 incrementally more complex prompts - revealing common error types such as bias, parsing issues, the addition of redundant information, entity adaptations and hallucinations. We analyse prevalent examples per language, century, prompt and model. Our contributions include a publicly accessible annotated dataset, pioneering insights into LLMs’ performance in literary-historical contexts, and the publication of reusable workflows for utilizing and evaluating LLMs in humanities research.