Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval

Caffagni, Davide.; Cocchi, Federico; Mambelli, Anna; Fabio, Tutrone; Marco, Zanella; Cornia, Marcella.; Cucchiara, Rita

doi:10.1007/978-3-032-05409-8_4

Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual analysis. However, retrieval systems often struggle when training data are scarce, as is the case for low-resource languages or specialized domains such as ancient texts. To address this challenge, we propose a novel paradigm for domain-specific sentence similarity search, where the embedding space is shaped by a combination of limited real data and a large amount of synthetic data generated by Large Language Models (LLMs). Specifically, we employ LLMs to generate domain-specific sentence pairs and fine-tune a sentence embedding model, effectively distilling knowledge from the LLM to the retrieval model. We validate our method through a case study on biblical intertextuality in Latin, demonstrating that synthetic data augmentation significantly improves retrieval effectiveness in a domain with scarce annotated resources. More broadly, our approach offers a scalable and adaptable framework for enhancing retrieval in domain-specific contexts. Source code and trained models are available at https://github.com/aimagelab/biblical-retrieval-synthesis.

Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval / Caffagni, Davide.; Cocchi, Federico; Mambelli, Anna; Tutrone, Fabio; Zanella, Marco; Cornia, Marcella.; Cucchiara, Rita. - 16097:(2026), pp. 36-52. (Intervento presentato al convegno 29th International Conference on Theory and Practice of Digital Libraries, TPDL 2025 tenutosi a Tampere, Finland nel September 23–26, 2025) [10.1007/978-3-032-05409-8_4].

Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval

Caffagni Davide.;Cocchi Federico;Mambelli Anna;Tutrone Fabio;Zanella Marco;Cornia Marcella.;Cucchiara Rita

2026

Abstract

Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual analysis. However, retrieval systems often struggle when training data are scarce, as is the case for low-resource languages or specialized domains such as ancient texts. To address this challenge, we propose a novel paradigm for domain-specific sentence similarity search, where the embedding space is shaped by a combination of limited real data and a large amount of synthetic data generated by Large Language Models (LLMs). Specifically, we employ LLMs to generate domain-specific sentence pairs and fine-tune a sentence embedding model, effectively distilling knowledge from the LLM to the retrieval model. We validate our method through a case study on biblical intertextuality in Latin, demonstrating that synthetic data augmentation significantly improves retrieval effectiveness in a domain with scarce annotated resources. More broadly, our approach offers a scalable and adaptable framework for enhancing retrieval in domain-specific contexts. Source code and trained models are available at https://github.com/aimagelab/biblical-retrieval-synthesis.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2026
			
	Data di prima pubblicazione
	
				15-set-2025
			
	Titolo del Convegno
	
				29th International Conference on Theory and Practice of Digital Libraries, TPDL 2025
			
	Luogo del Convegno
	
				Tampere, Finland
			
	Data del Convegno
	
				September 23–26, 2025
			
	Codice DOI
	
				https://dx.doi.org/10.1007/978-3-032-05409-8_4
			
	Codice Scopus
	
				2-s2.0-105018308861
			
	Serie
	
				LECTURE NOTES IN COMPUTER SCIENCE
			
	N° del Volume
	
				16097
			
	Pagina iniziale
	
				36
			
	Pagina finale
	
				52
			
	Tutti gli autori
	
						Caffagni, Davide.; Cocchi, Federico; Mambelli, Anna; Tutrone, Fabio; Zanella, Marco; Cornia, Marcella.; Cucchiara, Rita
					
	Citazione
	
				Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval / Caffagni, Davide.; Cocchi, Federico; Mambelli, Anna; Tutrone, Fabio; Zanella, Marco; Cornia, Marcella.; Cucchiara, Rita. - 16097:(2026), pp. 36-52. (Intervento presentato al  convegno 29th International Conference on Theory and Practice of Digital Libraries, TPDL 2025 tenutosi a Tampere, Finland nel September 23–26, 2025) [10.1007/978-3-032-05409-8_4].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
Mambelli_Anna_2025_TPDL_Latin_Embeddings.pdf Accesso riservato Tipologia: VOR - Versione pubblicata dall'editore Licenza: [IR] closed Dimensione 1.81 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.81 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris