Identifying records that refer to the same entity is a fundamental step for data integration. Since it is prohibitively expensive to compare every pair of records, blocking techniques are typically employed to reduce the complexity of this task. These techniques partition records into blocks and limit the comparison to records co-occurring in a block. Generally, to deal with highly heterogeneous and noisy data (e.g. semi-structured data of the Web), these techniques rely on redundancy to reduce the chance of missing matches. Meta-blocking is the task of restructuring blocks generated by redundancy-based blocking techniques, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on schema-agnostic features. In this paper, we demonstrate how “loose” schema information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract this loose information by adopting a LSH-based step for e ciently scaling to large datasets. We experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art unsupervised meta-blocking approaches, and, in many cases, also the supervised one.

BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution / Simonini, Giovanni; Bergamaschi, Sonia; Jagadish, H. V.. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - STAMPA. - 9:12(2016), pp. 1173-1184. [10.14778/2994509.2994533]

BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution

SIMONINI, GIOVANNI;BERGAMASCHI, Sonia;
2016

Abstract

Identifying records that refer to the same entity is a fundamental step for data integration. Since it is prohibitively expensive to compare every pair of records, blocking techniques are typically employed to reduce the complexity of this task. These techniques partition records into blocks and limit the comparison to records co-occurring in a block. Generally, to deal with highly heterogeneous and noisy data (e.g. semi-structured data of the Web), these techniques rely on redundancy to reduce the chance of missing matches. Meta-blocking is the task of restructuring blocks generated by redundancy-based blocking techniques, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on schema-agnostic features. In this paper, we demonstrate how “loose” schema information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract this loose information by adopting a LSH-based step for e ciently scaling to large datasets. We experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art unsupervised meta-blocking approaches, and, in many cases, also the supervised one.
2016
9
12
1173
1184
BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution / Simonini, Giovanni; Bergamaschi, Sonia; Jagadish, H. V.. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - STAMPA. - 9:12(2016), pp. 1173-1184. [10.14778/2994509.2994533]
Simonini, Giovanni; Bergamaschi, Sonia; Jagadish, H. V.
File in questo prodotto:
File Dimensione Formato  
p1173-simonini.pdf

Open access

Descrizione: PDF
Tipologia: Versione pubblicata dall'editore
Dimensione 1.75 MB
Formato Adobe PDF
1.75 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1111659
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 75
  • ???jsp.display-item.citation.isi??? 50
social impact