Beyond Perplexity: Examining Temporal Generalization in Large Language Models via Definition Generation


  • Iris Luden UVA
  • Mario Giulianelli ETH Zürich
  • Raquel Fernández UVA


The advent of large language models (LLMs) has significantly improved performance across various Natural Language Processing tasks. However, the performance of LLMs has been shown to deteriorate over time, indicating a lack of temporal generalization. To date, performance deterioration of LLMs is primarily attributed to the factual changes in the real world over time. However, not only the facts of the world, but also the language we use to describe it constantly changes. Recent studies have indicated a relationship between performance deterioration and semantic change. This is typically measured using perplexity scores and relative performance on downstream tasks. Yet, perplexity and accuracy do not explain the effects of temporally shifted data on LLMs in practice. In this work, we propose to assess lexico-semantic temporal generalization of a language model by exploiting the task of contextualized word definition generation. This in-depth semantic assessment enables interpretable insights into the possible mistakes a model may perpetrate due to meaning shift, and can be used to complement more coarse-grained measures like perplexity scores. To assess how semantic change impacts performance, we design the task by differentiating between semantically stable, changing, and emerging target words, and experiment with T5-base, fine-tuned for contextualized definition generation. Our results indicate that (i) the model’s performance deteriorates for the task of contextualized word definition generation, (ii) the performance deteriorates more for semantically changing words compared to semantically stable words, (iii) the model exhibits significantly lower performance and potential bias for emerging words, and (iv) the performance does not correlate with cross-entropy or (pseudo)-perplexity scores. Overall, our results show that definition generation can be a promising task to assess a model’s capacity for temporal generalization with respect to semantic change.




How to Cite

Luden, I., Giulianelli, M., & Fernández, R. (2024). Beyond Perplexity: Examining Temporal Generalization in Large Language Models via Definition Generation. Computational Linguistics in the Netherlands Journal, 13, 205–232. Retrieved from