Event extraction is the task of automatically identifying and extracting structured information about events from unstructured text. Despite Italian being a well-resourced language, it still lacks annotated datasets specifically designed for fine-grained event extraction. To address this gap, we propose a novel methodology for the generation of synthetic data suitable for fine-grained event extraction tasks. This work is motivated by the high cost and limited scalability of manual annotation. We introduce a controlled synthetic data generation pipeline that strictly adheres to a target annotation schema, providing a scalable alternative to extensive human labeling. The key methodological innovation is a two-phase, document-level generation framework that leverages Large Language Models, ensures structural consistency and mitigates generation biases, enabling the creation of high-quality datasets for complex event extraction scenarios. Using this methodology, we release SYNTH-ITA, the first collection of four medium-scale synthetic datasets for fine-grained Italian event extraction, generated from 10,000 structured crime scenarios each. Experiments conducted on event argument extraction using a QA formulation demonstrate that fine-tuning models on SYNTH-ITA leads to better or comparable performances to models fine-tuned on 200 manually annotated real news articles (+14% improvement with ELECTRA, -0.4% with BERT). Conversely, NER-based models for event argument extraction trained on synthetic data exhibit an 18% performance drop compared to those trained on manually annotated articles.
Synthetic dataset generation for theft event extraction in Italian / Bonisoli, Giovanni; Rollo, Federica; Po, Laura. - In: INFORMATION PROCESSING & MANAGEMENT. - ISSN 0306-4573. - 63:7(2026), pp. 104833-104833. [10.1016/j.ipm.2026.104833]
Synthetic dataset generation for theft event extraction in Italian
Bonisoli, Giovanni;Rollo, Federica;Po, Laura
2026
Abstract
Event extraction is the task of automatically identifying and extracting structured information about events from unstructured text. Despite Italian being a well-resourced language, it still lacks annotated datasets specifically designed for fine-grained event extraction. To address this gap, we propose a novel methodology for the generation of synthetic data suitable for fine-grained event extraction tasks. This work is motivated by the high cost and limited scalability of manual annotation. We introduce a controlled synthetic data generation pipeline that strictly adheres to a target annotation schema, providing a scalable alternative to extensive human labeling. The key methodological innovation is a two-phase, document-level generation framework that leverages Large Language Models, ensures structural consistency and mitigates generation biases, enabling the creation of high-quality datasets for complex event extraction scenarios. Using this methodology, we release SYNTH-ITA, the first collection of four medium-scale synthetic datasets for fine-grained Italian event extraction, generated from 10,000 structured crime scenarios each. Experiments conducted on event argument extraction using a QA formulation demonstrate that fine-tuning models on SYNTH-ITA leads to better or comparable performances to models fine-tuned on 200 manually annotated real news articles (+14% improvement with ELECTRA, -0.4% with BERT). Conversely, NER-based models for event argument extraction trained on synthetic data exhibit an 18% performance drop compared to those trained on manually annotated articles.| File | Dimensione | Formato | |
|---|---|---|---|
|
paper.pdf
Open access
Tipologia:
VOR - Versione pubblicata dall'editore
Dimensione
2.65 MB
Formato
Adobe PDF
|
2.65 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris




