Automatic Text Encoding in XML-TEI Using Large Language Models
This is an ongoing project that is due in August 2026
My master’s thesis explores how recent advances in artificial intelligence can support the digital humanities, with a specific focus on text encoding. While the digitization of books and historical sources has progressed greatly over the past decades, challenges remain. Optical Character Recognition (OCR) often produces imperfect text, and the resulting digital versions usually lack crucial annotations that mark a document’s internal structure (chapters, sections) or semantic features (citations, definitions, etc.). These annotations are essential for scholarly analysis but require significant manual effort.
The Text Encoding Initiative (TEI), a widely adopted XML-based standard, provides a way to represent such information in a structured and interoperable format. However, encoding texts in TEI is still a labor-intensive process. My research investigates how Large Language Models (LLMs) can be guided through prompt design to automate XML-TEI annotation, aiming to make the process more efficient, accurate, and scalable.
For this work, I use an excerpt from “Petit traité de versification française” by Louis Quicherat (1882), a detailed guide to French versification (the art and rules of French poetic meter and rhyme), as a text corpus. The project first surveys the state of the art in TEI, LLMs, and their applications in digital scholarly editing. In the second phase, I develop a reference XML-TEI model of the corpus, then design and experiment with a series of prompts providing different levels of contextual input, tested against multiple LLMs. Finally, I analyze and compare the models’ performance in generating TEI-compliant structures.
Objectives
With this thesis, I aim to automate the annotation process that typically requires significant manual effort, thereby increasing both the efficiency and accuracy of textual markup. This research seeks to identify effective prompting strategies and evaluate LLM performance in producing precise, usable TEI-compliant encoded texts for diverse digital text encoding needs.
Method
To achieve my objective, I will design a series of structured prompts to guide the automatic XML-TEI encoding of text, and test them with several advanced large language models: ChatGPT, Gemini, Mistral, and Claude.
To enhance efficiency and reproducibility, I will develop a script that interacts directly with the APIs of these models, automating the process of submitting prompts and retrieving their outputs. Each model will process the same text extract using prompts of varying complexity, and I will systematically compare the TEI-XML annotations they generate and evaluate the models.
To analyze the results generated by each LLM, I will utilize Tableau to create a comprehensive overview, enabling deeper analyses to identify which models perform best with specific prompts.
This approach allows for a thorough, consistent evaluation of the models’ capabilities and helps identify the most effective strategies for AI-assisted digital text encoding.
Technologies used
- Python
- LLM (Chat-GPT, Gemini, Mistral, Claude)
- APIs for LLM
- XML
- XML-TEI
- Tableau