Supervised and Unsupervised Categorization of an Imbalanced Italian Crime News Dataset

Rollo, F.; Bonisoli, G.; Po, L.

doi:10.1007/978-3-030-98997-2_6

The automatic categorization of crime news is useful to create statistics on the type of crimes occurring in a certain area. This assignment can be treated as a text categorization problem. Several studies have shown that the use of word embeddings improves outcomes in many Natural Language Processing (NLP), including text categorization. The scope of this paper is to explore the use of word embeddings for Italian crime news text categorization. The approach followed is to compare different document pre-processing, Word2Vec models and methods to obtain word embeddings, including the extraction of bigrams and keyphrases. Then, supervised and unsupervised Machine Learning categorization algorithms have been applied and compared. In addition, the imbalance issue of the input dataset has been addressed by using Synthetic Minority Oversampling Technique (SMOTE) to oversample the elements in the minority classes. Experiments conducted on an Italian dataset of 17,500 crime news articles collected from 2011 till 2021 show very promising results. The supervised categorization has proven to be better than the unsupervised categorization, overcoming 80% both in precision and recall, reaching an accuracy of 0.86. Furthermore, lemmatization, bigrams and keyphrase extraction are not so decisive. In the end, the availability of our model on GitHub together with the code we used to extract word embeddings allows replicating our approach to other corpus either in Italian or other languages.

Supervised and Unsupervised Categorization of an Imbalanced Italian Crime News Dataset / Rollo, F., Bonisoli, G., Po, L.. - 442:(2022), pp. 117-139. (16th Conference on Information Systems Management, ISM 2021 and Information Systems and Technologies conference track, FedCSIS-IST 2021 Held as Part of 16th Conference on Computer Science and Information Systems, FedCSIS 2021 Virtual, Online 2021) [10.1007/978-3-030-98997-2_6].

Supervised and Unsupervised Categorization of an Imbalanced Italian Crime News Dataset

Rollo F.;Bonisoli G.;Po L.

2022

Abstract

The automatic categorization of crime news is useful to create statistics on the type of crimes occurring in a certain area. This assignment can be treated as a text categorization problem. Several studies have shown that the use of word embeddings improves outcomes in many Natural Language Processing (NLP), including text categorization. The scope of this paper is to explore the use of word embeddings for Italian crime news text categorization. The approach followed is to compare different document pre-processing, Word2Vec models and methods to obtain word embeddings, including the extraction of bigrams and keyphrases. Then, supervised and unsupervised Machine Learning categorization algorithms have been applied and compared. In addition, the imbalance issue of the input dataset has been addressed by using Synthetic Minority Oversampling Technique (SMOTE) to oversample the elements in the minority classes. Experiments conducted on an Italian dataset of 17,500 crime news articles collected from 2011 till 2021 show very promising results. The supervised categorization has proven to be better than the unsupervised categorization, overcoming 80% both in precision and recall, reaching an accuracy of 0.86. Furthermore, lemmatization, bigrams and keyphrase extraction are not so decisive. In the end, the availability of our model on GitHub together with the code we used to extract word embeddings allows replicating our approach to other corpus either in Italian or other languages.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2022
			
	Titolo del Convegno
	
				16th Conference on Information Systems Management, ISM 2021 and Information Systems and Technologies conference track, FedCSIS-IST 2021 Held as Part of 16th Conference on Computer Science and Information Systems, FedCSIS 2021
			
	Luogo del Convegno
	
				Virtual, Online
			
	Data del Convegno
	
				2021
			
	Codice DOI
	
				https://dx.doi.org/10.1007/978-3-030-98997-2_6
			
	Codice WoS
	
				WOS:000787758500006
			
	Codice Scopus
	
				2-s2.0-85127930412
			
	Serie
	
				LECTURE NOTES IN BUSINESS INFORMATION PROCESSING
			
	N° del Volume
	
				442
			
	Pagina iniziale
	
				117
			
	Pagina finale
	
				139
			
	Tutti gli autori
	
						Rollo, F.; Bonisoli, G.; Po, L.
					
	Citazione
	
				Supervised and Unsupervised Categorization of an Imbalanced Italian Crime News Dataset / Rollo, F., Bonisoli, G., Po, L.. - 442:(2022), pp. 117-139. (16th Conference on Information Systems Management, ISM 2021 and Information Systems and Technologies conference track, FedCSIS-IST 2021 Held as Part of 16th Conference on Computer Science and Information Systems, FedCSIS 2021 Virtual, Online 2021) [10.1007/978-3-030-98997-2_6].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
Rollo et al - Supervised and Unsupervised Categorization of an Imbalanced Italian Crime News Dataset.pdf Open access Tipologia: AO - Versione originale dell'autore proposta per la pubblicazione Dimensione 968.84 kB Formato Adobe PDF Visualizza/Apri	968.84 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris