Event extraction is the task of automatically identifying and extracting structured information about events from unstructured text. Despite Italian being a well-resourced language, it still lacks annotated datasets specifically designed for fine-grained event extraction. To address this gap, we propose a novel methodology for the generation of synthetic data suitable for fine-grained event extraction tasks. This work is motivated by the high cost and limited scalability of manual annotation. We introduce a controlled synthetic data generation pipeline that strictly adheres to a target annotation schema, providing a scalable alternative to extensive human labeling. The key methodological innovation is a two-phase, document-level generation framework that leverages Large Language Models, ensures structural consistency and mitigates generation biases, enabling the creation of high-quality datasets for complex event extraction scenarios. Using this methodology, we release SYNTH-ITA, the first collection of four medium-scale synthetic datasets for fine-grained Italian event extraction, generated from 10,000 structured crime scenarios each. Experiments conducted on event argument extraction using a QA formulation demonstrate that fine-tuning models on SYNTH-ITA leads to better or comparable performances to models fine-tuned on 200 manually annotated real news articles (+14% improvement with ELECTRA, -0.4% with BERT). Conversely, NER-based models for event argument extraction trained on synthetic data exhibit an 18% performance drop compared to those trained on manually annotated articles.

Synthetic dataset generation for theft event extraction in Italian / Bonisoli, Giovanni; Rollo, Federica; Po, Laura. - In: INFORMATION PROCESSING & MANAGEMENT. - ISSN 0306-4573. - 63:7(2026), pp. 104833-104833. [10.1016/j.ipm.2026.104833]

Synthetic dataset generation for theft event extraction in Italian

Bonisoli, Giovanni;Rollo, Federica;Po, Laura
2026

Abstract

Event extraction is the task of automatically identifying and extracting structured information about events from unstructured text. Despite Italian being a well-resourced language, it still lacks annotated datasets specifically designed for fine-grained event extraction. To address this gap, we propose a novel methodology for the generation of synthetic data suitable for fine-grained event extraction tasks. This work is motivated by the high cost and limited scalability of manual annotation. We introduce a controlled synthetic data generation pipeline that strictly adheres to a target annotation schema, providing a scalable alternative to extensive human labeling. The key methodological innovation is a two-phase, document-level generation framework that leverages Large Language Models, ensures structural consistency and mitigates generation biases, enabling the creation of high-quality datasets for complex event extraction scenarios. Using this methodology, we release SYNTH-ITA, the first collection of four medium-scale synthetic datasets for fine-grained Italian event extraction, generated from 10,000 structured crime scenarios each. Experiments conducted on event argument extraction using a QA formulation demonstrate that fine-tuning models on SYNTH-ITA leads to better or comparable performances to models fine-tuned on 200 manually annotated real news articles (+14% improvement with ELECTRA, -0.4% with BERT). Conversely, NER-based models for event argument extraction trained on synthetic data exhibit an 18% performance drop compared to those trained on manually annotated articles.
2026
63
7
104833
104833
Synthetic dataset generation for theft event extraction in Italian / Bonisoli, Giovanni; Rollo, Federica; Po, Laura. - In: INFORMATION PROCESSING & MANAGEMENT. - ISSN 0306-4573. - 63:7(2026), pp. 104833-104833. [10.1016/j.ipm.2026.104833]
Bonisoli, Giovanni; Rollo, Federica; Po, Laura
File in questo prodotto:
File Dimensione Formato  
paper.pdf

Open access

Tipologia: VOR - Versione pubblicata dall'editore
Dimensione 2.65 MB
Formato Adobe PDF
2.65 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1405569
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact