Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual analysis. However, retrieval systems often struggle when training data are scarce, as is the case for low-resource languages or specialized domains such as ancient texts. To address this challenge, we propose a novel paradigm for domain-specific sentence similarity search, where the embedding space is shaped by a combination of limited real data and a large amount of synthetic data generated by Large Language Models (LLMs). Specifically, we employ LLMs to generate domain-specific sentence pairs and fine-tune a sentence embedding model, effectively distilling knowledge from the LLM to the retrieval model. We validate our method through a case study on biblical intertextuality in Latin, demonstrating that synthetic data augmentation significantly improves retrieval effectiveness in a domain with scarce annotated resources. More broadly, our approach offers a scalable and adaptable framework for enhancing retrieval in domain-specific contexts.

Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval / Caffagni, Davide; Cocchi, Federico; Mambelli, Anna; Tutrone, Fabio; Zanella, Marco; Cornia, Marcella; Cucchiara, Rita. - (2025). (Intervento presentato al convegno 29th International Conference on Theory and Practice of Digital Libraries tenutosi a Tampere, Finland nel September 23-26).

Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval

Davide Caffagni;Federico Cocchi;Anna Mambelli;Marcella Cornia;Rita Cucchiara
2025

Abstract

Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual analysis. However, retrieval systems often struggle when training data are scarce, as is the case for low-resource languages or specialized domains such as ancient texts. To address this challenge, we propose a novel paradigm for domain-specific sentence similarity search, where the embedding space is shaped by a combination of limited real data and a large amount of synthetic data generated by Large Language Models (LLMs). Specifically, we employ LLMs to generate domain-specific sentence pairs and fine-tune a sentence embedding model, effectively distilling knowledge from the LLM to the retrieval model. We validate our method through a case study on biblical intertextuality in Latin, demonstrating that synthetic data augmentation significantly improves retrieval effectiveness in a domain with scarce annotated resources. More broadly, our approach offers a scalable and adaptable framework for enhancing retrieval in domain-specific contexts.
2025
29th International Conference on Theory and Practice of Digital Libraries
Tampere, Finland
September 23-26
Caffagni, Davide; Cocchi, Federico; Mambelli, Anna; Tutrone, Fabio; Zanella, Marco; Cornia, Marcella; Cucchiara, Rita
Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval / Caffagni, Davide; Cocchi, Federico; Mambelli, Anna; Tutrone, Fabio; Zanella, Marco; Cornia, Marcella; Cucchiara, Rita. - (2025). (Intervento presentato al convegno 29th International Conference on Theory and Practice of Digital Libraries tenutosi a Tampere, Finland nel September 23-26).
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1380890
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact