REVERINO: REgesta generation VERsus latIN summarizatiOn

Puccetti, G.; Righi, L.; Sabbatini, I.; Esuli, A.

In this work we introduce the REVERINO dataset, a collection of 4533 pairs of Latin regesta with their respective full text medieval pontifical document extracted from two collections, Epistolae saeculi XIII e regestis pontificum Romanorum selectae. (1216-1268) and Les Registres de Gregoire IX (1227/41). We describe the pipeline used to extract the text from the images of the printed pages and we make high level analysis of the corpus. After developing REVERINO we use it as a benchmark to test the ability of Large Language Models (LLMs) to generate the regestum of a given Latin text. We test 3 LLMs among the best performing ones, GPT-4o, Llama 3.1 70b and Llama 3.1 405b and find that GPT-4o is the best at generating text in Latin. Interestingly, we also find that for Llama models it can be beneficial to first generate a text in English and then translate it in Latin to write better regesta.

REVERINO: REgesta generation VERsus latIN summarizatiOn / Puccetti, G., Righi, L., Sabbatini, I., Esuli, A.. - 3937:(2025). (21st Conference on Information and Research Science Connecting to Digital and Library Science, IRCDL 2025 ita 2025).

REVERINO: REgesta generation VERsus latIN summarizatiOn

Puccetti G.;Righi L.;Sabbatini I.;Esuli A.

2025

Abstract

In this work we introduce the REVERINO dataset, a collection of 4533 pairs of Latin regesta with their respective full text medieval pontifical document extracted from two collections, Epistolae saeculi XIII e regestis pontificum Romanorum selectae. (1216-1268) and Les Registres de Gregoire IX (1227/41). We describe the pipeline used to extract the text from the images of the printed pages and we make high level analysis of the corpus. After developing REVERINO we use it as a benchmark to test the ability of Large Language Models (LLMs) to generate the regestum of a given Latin text. We test 3 LLMs among the best performing ones, GPT-4o, Llama 3.1 70b and Llama 3.1 405b and find that GPT-4o is the best at generating text in Latin. Interestingly, we also find that for Llama models it can be beneficial to first generate a text in English and then translate it in Latin to write better regesta.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Titolo del Convegno
	
				21st Conference on Information and Research Science Connecting to Digital and Library Science, IRCDL 2025
			
	Luogo del Convegno
	
				ita
			
	Data del Convegno
	
				2025
			
	Codice Scopus
	
				2-s2.0-105001158243
			
	Serie
	
				CEUR WORKSHOP PROCEEDINGS
			
	N° del Volume
	
				3937
			
	Tutti gli autori
	
						Puccetti, G.; Righi, L.; Sabbatini, I.; Esuli, A.
					
	Citazione
	
				REVERINO: REgesta generation VERsus latIN summarizatiOn / Puccetti, G., Righi, L., Sabbatini, I., Esuli, A.. - 3937:(2025). (21st Conference on Information and Research Science Connecting to Digital and Library Science, IRCDL 2025 ita 2025).
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
WP7_PAPER_regesti___ircdl.pdf Open access Tipologia: VOR - Versione pubblicata dall'editore Licenza: [IR] creative-commons Dimensione 2.11 MB Formato Adobe PDF Visualizza/Apri	2.11 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris