We present BLAST2 a novel technique to efficiently extract loose schema information, i.e., metadata that can serve as a surrogate of the schema alignment task within the Entity Resolution (ER) process — to identify records that refer to the same real-world entity — when integrating multiple, heterogeneous and voluminous data sources. The loose schema information is exploited for reducing the overall complexity of ER, whose naïve solution would imply O(n^2) comparisons, where is the number of entity representations involved in the process and can be extracted by both structured and unstructured data sources. BLAST2 is completely unsupervised yet able to achieve almost the same precision and recall of supervised state-of-the-art schema alignment techniques when employed for Entity Resolution tasks, as shown in our experimental evaluation performed on two real-world data sets (composed of 7 and 10 data sources, respectively).

BLAST2: An Efficient Technique for Loose Schema Information Extraction from Heterogeneous Big Data Sources / BENEVENTANO, Domenico; BERGAMASCHI, Sonia; GAGLIARDELLI, LUCA; SIMONINI, GIOVANNI. - In: ACM JOURNAL OF DATA AND INFORMATION QUALITY. - ISSN 1936-1955. - 12:4(2020), pp. 1-20. [10.1145/3394957]

BLAST2: An Efficient Technique for Loose Schema Information Extraction from Heterogeneous Big Data Sources

DOMENICO BENEVENTANO;SONIA BERGAMASCHI;LUCA GAGLIARDELLI;GIOVANNI SIMONINI
2020

Abstract

We present BLAST2 a novel technique to efficiently extract loose schema information, i.e., metadata that can serve as a surrogate of the schema alignment task within the Entity Resolution (ER) process — to identify records that refer to the same real-world entity — when integrating multiple, heterogeneous and voluminous data sources. The loose schema information is exploited for reducing the overall complexity of ER, whose naïve solution would imply O(n^2) comparisons, where is the number of entity representations involved in the process and can be extracted by both structured and unstructured data sources. BLAST2 is completely unsupervised yet able to achieve almost the same precision and recall of supervised state-of-the-art schema alignment techniques when employed for Entity Resolution tasks, as shown in our experimental evaluation performed on two real-world data sets (composed of 7 and 10 data sources, respectively).
2020
12
4
1
20
BLAST2: An Efficient Technique for Loose Schema Information Extraction from Heterogeneous Big Data Sources / BENEVENTANO, Domenico; BERGAMASCHI, Sonia; GAGLIARDELLI, LUCA; SIMONINI, GIOVANNI. - In: ACM JOURNAL OF DATA AND INFORMATION QUALITY. - ISSN 1936-1955. - 12:4(2020), pp. 1-20. [10.1145/3394957]
BENEVENTANO, Domenico; BERGAMASCHI, Sonia; GAGLIARDELLI, LUCA; SIMONINI, GIOVANNI
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1201265
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 6
  • ???jsp.display-item.citation.isi??? 3
social impact