From Orthography to Semantics: Large-Scale Unsupervised Textual Similarity Detection in Historical Greek
Abstract
Computational methods for detecting textual similarity provide a powerful lens for exploring linguistic patterns, formulaic language, and textual transmission in historical corpora. In this paper, we investigate what insights become possible when such similarity measures are applied across a vast corpus of Greek texts from Antiquity to the Byzantine period. We propose two methods that enable analysis at this scale: orthographic similarity using MinHash-LSH and semantic similarity using transformer-based sentence embeddings. We first validate both approaches on the Database of Byzantine Book Epigrams, which serves as a gold standard for assessing performance, before applying them to a much larger and more heterogeneous corpus. Scaling MinHash-LSH reveals repeated formulae across textual traditions, while clustering semantic embeddings uncovers conceptual and thematic relationships between texts, highlighting recurring motifs and ideas despite orthographic variation. Our findings illustrate how unsupervised methods suited to high-volume data uncover structures and relationships that targeted studies may overlook.