Integrity constraints (ICs) such as Functional Dependencies (FDs) or Inclusion Dependencies (INDs) are commonly used in databases to check if input relations obey to certain pre-defined quality metrics. While Data-Intensive Scalable Computing (DISC) platforms such as MapReduce commonly accept as input (semi-structured) data not in relational format, still data is often transformed in key/value pairs when data is required to be re-partitioned; a process commonly referred to as shuffle. In this work, we present a Provenance-Aware model for assessing the quality of shuffled data: more precisely, we capture and model provenance using the PROV-DM W3C recommendation and we extend it with rules expressed à la Datalog to assess data quality dimensions by means of ICs metrics over DISC systems. In this way, data (and algorithmic) errors can be promptly and automatically detected without having to go through a lengthy process of output debugging.

Cleaning mapreduce workflows / Interlandi, Matteo; Lacroix, Julien; Boucelma, Omar; Guerra, Francesco. - (2017), pp. 74-78. (Intervento presentato al convegno 15th International Conference on High Performance Computing and Simulation, HPCS 2017 tenutosi a ita nel 2017) [10.1109/HPCS.2017.22].

Cleaning mapreduce workflows

Interlandi, Matteo;Guerra, Francesco
2017

Abstract

Integrity constraints (ICs) such as Functional Dependencies (FDs) or Inclusion Dependencies (INDs) are commonly used in databases to check if input relations obey to certain pre-defined quality metrics. While Data-Intensive Scalable Computing (DISC) platforms such as MapReduce commonly accept as input (semi-structured) data not in relational format, still data is often transformed in key/value pairs when data is required to be re-partitioned; a process commonly referred to as shuffle. In this work, we present a Provenance-Aware model for assessing the quality of shuffled data: more precisely, we capture and model provenance using the PROV-DM W3C recommendation and we extend it with rules expressed à la Datalog to assess data quality dimensions by means of ICs metrics over DISC systems. In this way, data (and algorithmic) errors can be promptly and automatically detected without having to go through a lengthy process of output debugging.
2017
15th International Conference on High Performance Computing and Simulation, HPCS 2017
ita
2017
74
78
Interlandi, Matteo; Lacroix, Julien; Boucelma, Omar; Guerra, Francesco
Cleaning mapreduce workflows / Interlandi, Matteo; Lacroix, Julien; Boucelma, Omar; Guerra, Francesco. - (2017), pp. 74-78. (Intervento presentato al convegno 15th International Conference on High Performance Computing and Simulation, HPCS 2017 tenutosi a ita nel 2017) [10.1109/HPCS.2017.22].
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1149198
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact