Providing Insight into Data Source Topics

A fundamental service for the exploitation of the modern large data sources that are available online is the ability to identify the topics of the data that they contain. Unfortunately, the heterogeneity and lack of centralized control makes it difficult to identify the topics directly from the actual values used in the sources. We present an approach that generates signatures of sources that are matched against a reference vocabulary of concepts through the respective signature to generate a description of the topics of the source in terms of this reference vocabulary. The reference vocabulary may be provided ready, may be created manually, or may be created by applying our signature-generated algorithm over a well-curated data source with a clear identification of topics. In our particular case, we have used DBpedia for the creation of the vocabulary, since it is one of the largest known collections of entities and concepts. The signatures are generated by exploiting the entropy and the mutual information of the attributes of the sources to generate semantic identifiers of the various attributes, which combined together form a unique signature of the concepts (i.e. the topics) of the source. The generation of the identifiers is based on the entropy of the values of the attributes; thus, they are independent of naming heterogeneity of attributes or tables. Although the use of traditional information-theoretical quantities such as entropy and mutual information is not new, they may become untrustworthy due to their sensitivity to overfitting, and require an equal number of samples used to construct the reference vocabulary. To overcome these limitations, we normalize and use pseudo-additive entropy measures, which automatically downweight the role of vocabulary items and property values with very low frequencies, resulting in a more stable solution than the traditional counterparts. We have materialized our theory in a system called WHATSIT and we experimentally demonstrate its effectiveness.

Providing Insight into Data Source Topics / Bergamaschi, Sonia; Ferrari, Davide; Guerra, Francesco; Simonini, Giovanni; Velegrakis, Yannis. - In: JOURNAL ON DATA SEMANTICS. - ISSN 1861-2032. - STAMPA. - 5:(2016), pp. 211-228. [10.1007/s13740-016-0063-6]

Providing Insight into Data Source Topics

BERGAMASCHI, Sonia;Ferrari, Davide;GUERRA, Francesco;SIMONINI, GIOVANNI;Velegrakis, Yannis

2016

Abstract

A fundamental service for the exploitation of the modern large data sources that are available online is the ability to identify the topics of the data that they contain. Unfortunately, the heterogeneity and lack of centralized control makes it difficult to identify the topics directly from the actual values used in the sources. We present an approach that generates signatures of sources that are matched against a reference vocabulary of concepts through the respective signature to generate a description of the topics of the source in terms of this reference vocabulary. The reference vocabulary may be provided ready, may be created manually, or may be created by applying our signature-generated algorithm over a well-curated data source with a clear identification of topics. In our particular case, we have used DBpedia for the creation of the vocabulary, since it is one of the largest known collections of entities and concepts. The signatures are generated by exploiting the entropy and the mutual information of the attributes of the sources to generate semantic identifiers of the various attributes, which combined together form a unique signature of the concepts (i.e. the topics) of the source. The generation of the identifiers is based on the entropy of the values of the attributes; thus, they are independent of naming heterogeneity of attributes or tables. Although the use of traditional information-theoretical quantities such as entropy and mutual information is not new, they may become untrustworthy due to their sensitivity to overfitting, and require an equal number of samples used to construct the reference vocabulary. To overcome these limitations, we normalize and use pseudo-additive entropy measures, which automatically downweight the role of vocabulary items and property values with very low frequencies, resulting in a more stable solution than the traditional counterparts. We have materialized our theory in a system called WHATSIT and we experimentally demonstrate its effectiveness.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
			2016
		
	Rivista
	
			JOURNAL ON DATA SEMANTICS
		
	N° del Volume
	
			5
		
	Pagina iniziale
	
			211
		
	Pagina finale
	
			228
		
	Codice DOI
	
			https://dx.doi.org/10.1007/s13740-016-0063-6
		
	Codice WoS
	
			WOS:000391188500001
		
	Codice Scopus
	
			2-s2.0-84992386943
		
	Citazione
	
			Providing Insight into Data Source Topics / Bergamaschi, Sonia; Ferrari, Davide; Guerra, Francesco; Simonini, Giovanni; Velegrakis, Yannis. - In: JOURNAL ON DATA SEMANTICS. - ISSN 1861-2032. - STAMPA. - 5:(2016), pp. 211-228. [10.1007/s13740-016-0063-6]
		
	Tutti gli autori
	
			Bergamaschi, Sonia; Ferrari, Davide; Guerra, Francesco; Simonini, Giovanni; Velegrakis, Yannis
		
	Tipologia
	
			Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
PAPER_1_10.1007_s13740-016-0063-6.pdf Accesso riservato Descrizione: Articolo pubblicato Tipologia: Versione pubblicata dall'editore Dimensione 1.13 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.13 MB	Adobe PDF	Visualizza/Apri Richiedi una copia
report.pdf Open access Tipologia: Versione dell'autore revisionata e accettata per la pubblicazione Dimensione 673.85 kB Formato Adobe PDF Visualizza/Apri	673.85 kB	Adobe PDF	Visualizza/Apri
VOR_Providing Insight into Data Source Topics.pdf Accesso riservato Tipologia: Versione pubblicata dall'editore Dimensione 920.61 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	920.61 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1111660

Citazioni

ND

17

11

social impact