Evaluating semantic similarity of texts is a task that assumes paramount importance in real-world applications. In this paper, we describe some experiments we carried out to evaluate the performance of different forms of word embeddings and their aggregations in the task of measuring the similarity of short texts. In particular, we explore the results obtained with two publicly available pre-trained word embeddings (one based on word2vec trained on a specific dataset and the second extending it with embeddings of word senses). We test five approaches for aggregating words into text. Two approaches are based on centroids and summarize a text as a word embedding. The other approaches are some variations of the Okapi BM25 function and provide directly a measure of the similarity of two texts.

Short Texts Semantic Similarity Based on Word Embeddings / Babić, Karlo; Martinčić-Ipšić, Sanda; Meštrović, Ana; Guerra, Francesco. - (2019), pp. 27-33. (Intervento presentato al convegno 30th Central European Conference on Information and Intelligent Systems (CECIIS) tenutosi a Varaždin, Croatia nel OCT 02-04, 2019).

Short Texts Semantic Similarity Based on Word Embeddings

Francesco Guerra
2019

Abstract

Evaluating semantic similarity of texts is a task that assumes paramount importance in real-world applications. In this paper, we describe some experiments we carried out to evaluate the performance of different forms of word embeddings and their aggregations in the task of measuring the similarity of short texts. In particular, we explore the results obtained with two publicly available pre-trained word embeddings (one based on word2vec trained on a specific dataset and the second extending it with embeddings of word senses). We test five approaches for aggregating words into text. Two approaches are based on centroids and summarize a text as a word embedding. The other approaches are some variations of the Okapi BM25 function and provide directly a measure of the similarity of two texts.
2019
30th Central European Conference on Information and Intelligent Systems (CECIIS)
Varaždin, Croatia
OCT 02-04, 2019
27
33
Babić, Karlo; Martinčić-Ipšić, Sanda; Meštrović, Ana; Guerra, Francesco
Short Texts Semantic Similarity Based on Word Embeddings / Babić, Karlo; Martinčić-Ipšić, Sanda; Meštrović, Ana; Guerra, Francesco. - (2019), pp. 27-33. (Intervento presentato al convegno 30th Central European Conference on Information and Intelligent Systems (CECIIS) tenutosi a Varaždin, Croatia nel OCT 02-04, 2019).
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1183097
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 4
social impact