A comparison of approaches for measuring the semantic similarity of short texts based on word embeddings

Babic, K.; Guerra, F.; Martincic-Ipsic, S.; Mestrovic, A.

doi:10.31341/jios.44.2.2

Measuring the semantic similarity of texts has a vital role in various tasks from the field of natural language processing. In this paper, we describe a set of experiments we carried out to evaluate and compare the performance of different approaches for measuring the semantic similarity of short texts. We perform a comparison of four models based on word embeddings: two variants of Word2Vec (one based on Word2Vec trained on a specific dataset and the second extending it with embeddings of word senses), FastText, and TF-IDF. Since these models provide word vectors, we experiment with various methods that calculate the semantic similarity of short texts based on word vectors. More precisely, for each of these models, we test five methods for aggregating word embeddings into text embedding. We introduced three methods by making variations of two commonly used similarity measures. One method is an extension of the cosine similarity based on centroids, and the other two methods are variations of the Okapi BM25 function. We evaluate all approaches on the two publicly available datasets: SICK and Lee in terms of the Pearson and Spearman correlation. The results indicate that extended methods perform better from the original in most of the cases.

A comparison of approaches for measuring the semantic similarity of short texts based on word embeddings / Babic, K., Guerra, F., Martincic-Ipsic, S., Mestrovic, A.. - In: JOURNAL OF INFORMATION AND ORGANIZATIONAL SCIENCES. - ISSN 1846-3312. - 44:2(2020), pp. 231-246. [10.31341/jios.44.2.2]

A comparison of approaches for measuring the semantic similarity of short texts based on word embeddings

Babic K.;Guerra F.;Martincic-Ipsic S.;Mestrovic A.

2020

Abstract

Measuring the semantic similarity of texts has a vital role in various tasks from the field of natural language processing. In this paper, we describe a set of experiments we carried out to evaluate and compare the performance of different approaches for measuring the semantic similarity of short texts. We perform a comparison of four models based on word embeddings: two variants of Word2Vec (one based on Word2Vec trained on a specific dataset and the second extending it with embeddings of word senses), FastText, and TF-IDF. Since these models provide word vectors, we experiment with various methods that calculate the semantic similarity of short texts based on word vectors. More precisely, for each of these models, we test five methods for aggregating word embeddings into text embedding. We introduced three methods by making variations of two commonly used similarity measures. One method is an extension of the cosine similarity based on centroids, and the other two methods are variations of the Okapi BM25 function. We evaluate all approaches on the two publicly available datasets: SICK and Lee in terms of the Pearson and Spearman correlation. The results indicate that extended methods perform better from the original in most of the cases.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2020
			
	Rivista
	
				JOURNAL OF INFORMATION AND ORGANIZATIONAL SCIENCES
			
	N° del Volume
	
				44
			
	Fascicolo
	
				2
			
	Pagina iniziale
	
				231
			
	Pagina finale
	
				246
			
	Codice DOI
	
				https://dx.doi.org/10.31341/jios.44.2.2
			
	Codice WoS
	
				WOS:000597419200003
			
	Codice Scopus
	
				2-s2.0-85097614747
			
	Citazione
	
				A comparison of approaches for measuring the semantic similarity of short texts based on word embeddings / Babic, K., Guerra, F., Martincic-Ipsic, S., Mestrovic, A.. - In: JOURNAL OF INFORMATION AND ORGANIZATIONAL SCIENCES. - ISSN 1846-3312. - 44:2(2020), pp. 231-246. [10.31341/jios.44.2.2]
			
	Tutti gli autori
	
						Babic, K.; Guerra, F.; Martincic-Ipsic, S.; Mestrovic, A.
					
	Tipologia
	
				Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
dagrabar,+2.+paper.pdf Open access Tipologia: VOR - Versione pubblicata dall'editore Licenza: [IR] creative-commons Dimensione 950.53 kB Formato Adobe PDF Visualizza/Apri	950.53 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris