Entity Resolution, the task of identifying records that refer to the same real-world entity, is a fundamental step in data integration. Blocking is a widely employed technique to avoid the comparison of all possible record pairs in a dataset (an inefficient approach). Renouncing to exploit schema information for blocking has been proved to limit the chance of missing matches (i.e., it guarantees high recall), at the cost of a low precision. Meta-blocking alleviates this issue by restructuring a block collection, removing redundant and superfluous comparisons. Yet, existing meta-blocking techniques exclusively rely on schema-agnostic features. In this paper, we investigate how loose schema information, induced directly from the data, can be exploited in an holistic loosely schema-aware (meta-)blocking approach that outperforms the state-of-the-art meta-blocking in terms of precision, without renouncing high level of recall. We implemented our idea in a system called Blast, and experimentally evaluated it on real-world datasets.
Enhancing entity resolution efficiency with loosely schema-aware techniques - Discussion paper / Simonini, G.; Bergamaschi, S.. - (2016), pp. 270-277. (Intervento presentato al convegno 24th Italian Symposium on Advanced Database Systems, SEBD 2016 tenutosi a ita nel 2016).
Enhancing entity resolution efficiency with loosely schema-aware techniques - Discussion paper
Simonini G.;Bergamaschi S.
2016
Abstract
Entity Resolution, the task of identifying records that refer to the same real-world entity, is a fundamental step in data integration. Blocking is a widely employed technique to avoid the comparison of all possible record pairs in a dataset (an inefficient approach). Renouncing to exploit schema information for blocking has been proved to limit the chance of missing matches (i.e., it guarantees high recall), at the cost of a low precision. Meta-blocking alleviates this issue by restructuring a block collection, removing redundant and superfluous comparisons. Yet, existing meta-blocking techniques exclusively rely on schema-agnostic features. In this paper, we investigate how loose schema information, induced directly from the data, can be exploited in an holistic loosely schema-aware (meta-)blocking approach that outperforms the state-of-the-art meta-blocking in terms of precision, without renouncing high level of recall. We implemented our idea in a system called Blast, and experimentally evaluated it on real-world datasets.Pubblicazioni consigliate
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris