A fundamental service for the exploitation of the modern large data sources that are available online is the ability to identify the topics of the data that they contain. Unfortunately, the heterogeneity and lack of centralized control makes it difficult to identify the topics directly from the actual values used in the sources. We present an approach that generates signatures of sources that are matched against a reference vocabulary of concepts through the respective signature to generate a description of the topics of the source in terms of this reference vocabulary. The reference vocabulary may be provided ready, may be created manually, or may be created by applying our signature-generated algorithm over a well-curated data source with a clear identification of topics. In our particular case, we have used DBpedia for the creation of the vocabulary, since it is one of the largest known collections of entities and concepts. The signatures are generated by exploiting the entropy and the mutual information of the attributes of the sources to generate semantic identifiers of the various attributes, which combined together form a unique signature of the concepts (i.e. the topics) of the source. The generation of the identifiers is based on the entropy of the values of the attributes; thus, they are independent of naming heterogeneity of attributes or tables. Although the use of traditional information-theoretical quantities such as entropy and mutual information is not new, they may become untrustworthy due to their sensitivity to overfitting, and require an equal number of samples used to construct the reference vocabulary. To overcome these limitations, we normalize and use pseudo-additive entropy measures, which automatically downweight the role of vocabulary items and property values with very low frequencies, resulting in a more stable solution than the traditional counterparts. We have materialized our theory in a system called WHATSIT and we experimentally demonstrate its effectiveness.

Providing Insight into Data Source Topics / Bergamaschi, Sonia; Ferrari, Davide; Guerra, Francesco; Simonini, Giovanni; Velegrakis, Yannis. - In: JOURNAL ON DATA SEMANTICS. - ISSN 1861-2032. - STAMPA. - 5:(2016), pp. 211-228. [10.1007/s13740-016-0063-6]

Providing Insight into Data Source Topics

BERGAMASCHI, Sonia;GUERRA, Francesco;SIMONINI, GIOVANNI;
2016

Abstract

A fundamental service for the exploitation of the modern large data sources that are available online is the ability to identify the topics of the data that they contain. Unfortunately, the heterogeneity and lack of centralized control makes it difficult to identify the topics directly from the actual values used in the sources. We present an approach that generates signatures of sources that are matched against a reference vocabulary of concepts through the respective signature to generate a description of the topics of the source in terms of this reference vocabulary. The reference vocabulary may be provided ready, may be created manually, or may be created by applying our signature-generated algorithm over a well-curated data source with a clear identification of topics. In our particular case, we have used DBpedia for the creation of the vocabulary, since it is one of the largest known collections of entities and concepts. The signatures are generated by exploiting the entropy and the mutual information of the attributes of the sources to generate semantic identifiers of the various attributes, which combined together form a unique signature of the concepts (i.e. the topics) of the source. The generation of the identifiers is based on the entropy of the values of the attributes; thus, they are independent of naming heterogeneity of attributes or tables. Although the use of traditional information-theoretical quantities such as entropy and mutual information is not new, they may become untrustworthy due to their sensitivity to overfitting, and require an equal number of samples used to construct the reference vocabulary. To overcome these limitations, we normalize and use pseudo-additive entropy measures, which automatically downweight the role of vocabulary items and property values with very low frequencies, resulting in a more stable solution than the traditional counterparts. We have materialized our theory in a system called WHATSIT and we experimentally demonstrate its effectiveness.
2016
5
211
228
Providing Insight into Data Source Topics / Bergamaschi, Sonia; Ferrari, Davide; Guerra, Francesco; Simonini, Giovanni; Velegrakis, Yannis. - In: JOURNAL ON DATA SEMANTICS. - ISSN 1861-2032. - STAMPA. - 5:(2016), pp. 211-228. [10.1007/s13740-016-0063-6]
Bergamaschi, Sonia; Ferrari, Davide; Guerra, Francesco; Simonini, Giovanni; Velegrakis, Yannis
File in questo prodotto:
File Dimensione Formato  
PAPER_1_10.1007_s13740-016-0063-6.pdf

Accesso riservato

Descrizione: Articolo pubblicato
Tipologia: Versione pubblicata dall'editore
Dimensione 1.13 MB
Formato Adobe PDF
1.13 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
report.pdf

Open access

Tipologia: Versione dell'autore revisionata e accettata per la pubblicazione
Dimensione 673.85 kB
Formato Adobe PDF
673.85 kB Adobe PDF Visualizza/Apri
VOR_Providing Insight into Data Source Topics.pdf

Accesso riservato

Tipologia: Versione pubblicata dall'editore
Dimensione 920.61 kB
Formato Adobe PDF
920.61 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1111660
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 17
  • ???jsp.display-item.citation.isi??? 11
social impact