From Orthography to Semantics: Large-Scale Unsupervised Textual Similarity Detection in Historical Greek

Paulien Lemay; Els Lefever; Klaas Bentein

Authors

Paulien Lemay Ghent University
Els Lefever Ghent University
Klaas Bentein Ghent University

Abstract

Computational methods for detecting textual similarity provide a powerful lens for exploring linguistic patterns, formulaic language, and textual transmission in historical corpora. In this paper, we investigate what insights become possible when such similarity measures are applied across a vast corpus of Greek texts from Antiquity to the Byzantine period. We propose two methods that enable analysis at this scale: orthographic similarity using MinHash-LSH and semantic similarity using transformer-based sentence embeddings. We first validate both approaches on the Database of Byzantine Book Epigrams, which serves as a gold standard for assessing performance, before applying them to a much larger and more heterogeneous corpus. Scaling MinHash-LSH reveals repeated formulae across textual traditions, while clustering semantic embeddings uncovers conceptual and thematic relationships between texts, highlighting recurring motifs and ideas despite orthographic variation. Our findings illustrate how unsupervised methods suited to high-volume data uncover structures and relationships that targeted studies may overlook.

From Orthography to Semantics: Large-Scale Unsupervised Textual Similarity Detection in Historical Greek

Authors

Abstract

Downloads

Published

Issue

Section

How to Cite

Most read articles by the same author(s)