Synthetic dataset generation for theft event extraction in Italian

Bonisoli, Giovanni; Rollo, Federica; Po, Laura

doi:10.1016/j.ipm.2026.104833

Event extraction is the task of automatically identifying and extracting structured information about events from unstructured text. Despite Italian being a well-resourced language, it still lacks annotated datasets specifically designed for fine-grained event extraction. To address this gap, we propose a novel methodology for the generation of synthetic data suitable for fine-grained event extraction tasks. This work is motivated by the high cost and limited scalability of manual annotation. We introduce a controlled synthetic data generation pipeline that strictly adheres to a target annotation schema, providing a scalable alternative to extensive human labeling. The key methodological innovation is a two-phase, document-level generation framework that leverages Large Language Models, ensures structural consistency and mitigates generation biases, enabling the creation of high-quality datasets for complex event extraction scenarios. Using this methodology, we release SYNTH-ITA, the first collection of four medium-scale synthetic datasets for fine-grained Italian event extraction, generated from 10,000 structured crime scenarios each. Experiments conducted on event argument extraction using a QA formulation demonstrate that fine-tuning models on SYNTH-ITA leads to better or comparable performances to models fine-tuned on 200 manually annotated real news articles (+14% improvement with ELECTRA, -0.4% with BERT). Conversely, NER-based models for event argument extraction trained on synthetic data exhibit an 18% performance drop compared to those trained on manually annotated articles.

Synthetic dataset generation for theft event extraction in Italian / Bonisoli, G., Rollo, F., Po, L.. - In: INFORMATION PROCESSING & MANAGEMENT. - ISSN 0306-4573. - 63:7(2026), pp. N/A-N/A. [10.1016/j.ipm.2026.104833]

Synthetic dataset generation for theft event extraction in Italian

Bonisoli, Giovanni;Rollo, Federica;Po, Laura

2026

Abstract

Event extraction is the task of automatically identifying and extracting structured information about events from unstructured text. Despite Italian being a well-resourced language, it still lacks annotated datasets specifically designed for fine-grained event extraction. To address this gap, we propose a novel methodology for the generation of synthetic data suitable for fine-grained event extraction tasks. This work is motivated by the high cost and limited scalability of manual annotation. We introduce a controlled synthetic data generation pipeline that strictly adheres to a target annotation schema, providing a scalable alternative to extensive human labeling. The key methodological innovation is a two-phase, document-level generation framework that leverages Large Language Models, ensures structural consistency and mitigates generation biases, enabling the creation of high-quality datasets for complex event extraction scenarios. Using this methodology, we release SYNTH-ITA, the first collection of four medium-scale synthetic datasets for fine-grained Italian event extraction, generated from 10,000 structured crime scenarios each. Experiments conducted on event argument extraction using a QA formulation demonstrate that fine-tuning models on SYNTH-ITA leads to better or comparable performances to models fine-tuned on 200 manually annotated real news articles (+14% improvement with ELECTRA, -0.4% with BERT). Conversely, NER-based models for event argument extraction trained on synthetic data exhibit an 18% performance drop compared to those trained on manually annotated articles.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2026
			
	Rivista
	
				INFORMATION PROCESSING & MANAGEMENT
			
	N° del Volume
	
				63
			
	Fascicolo
	
				7
			
	Pagina iniziale
	
				N/A
			
	Pagina finale
	
				N/A
			
	Codice DOI
	
				https://dx.doi.org/10.1016/j.ipm.2026.104833
			
	Codice Scopus
	
				2-s2.0-105037862147
			
	Citazione
	
				Synthetic dataset generation for theft event extraction in Italian / Bonisoli, G., Rollo, F., Po, L.. - In: INFORMATION PROCESSING & MANAGEMENT. - ISSN 0306-4573. - 63:7(2026), pp. N/A-N/A. [10.1016/j.ipm.2026.104833]
			
	Tutti gli autori
	
						Bonisoli, Giovanni; Rollo, Federica; Po, Laura
					
	Tipologia
	
				Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
paper.pdf Open access Tipologia: VOR - Versione pubblicata dall'editore Licenza: [IR] creative-commons Dimensione 2.65 MB Formato Adobe PDF Visualizza/Apri	2.65 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris