Data scientists spend over 80% of their time (1) parameter-tuning machine learning models and (2) iterating between data cleaning and machine learning model execution. While there are existing efforts to support the first requirement, there is currently no integrated workflow system that couples data cleaning and machine learning development. The previous version of Data Civilizer was geared towards data cleaning and discovery using a set of pre-defined tools. In this paper, we introduce Data Civilizer 2.0, an end-to-end workflow system satisfying both requirements. In addition, this system also supports a sophisticated data debugger and a workflow visualization system. In this demo, we will show how we used Data Civilizer 2.0 to help scientists at the Massachusetts General Hospital build their cleaning and machine learning pipeline on their 30TB brain activity dataset.

Data civilizer 2.0: A holistic framework for data preparation and analytics / Rezig, E. K.; Cao, L.; Stonebraker, M.; Simonini, G.; Tao, W.; Madden, S.; Ouzzani, M.; Tang, N.; Elmagarmid, A. K.. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 12:12(2019), pp. 1954-1957. (Intervento presentato al convegno 45th International Conference on Very Large Data Bases, VLDB 2019 tenutosi a usa nel 2017) [10.14778/3352063.3352108].

Data civilizer 2.0: A holistic framework for data preparation and analytics

Simonini G.;
2019

Abstract

Data scientists spend over 80% of their time (1) parameter-tuning machine learning models and (2) iterating between data cleaning and machine learning model execution. While there are existing efforts to support the first requirement, there is currently no integrated workflow system that couples data cleaning and machine learning development. The previous version of Data Civilizer was geared towards data cleaning and discovery using a set of pre-defined tools. In this paper, we introduce Data Civilizer 2.0, an end-to-end workflow system satisfying both requirements. In addition, this system also supports a sophisticated data debugger and a workflow visualization system. In this demo, we will show how we used Data Civilizer 2.0 to help scientists at the Massachusetts General Hospital build their cleaning and machine learning pipeline on their 30TB brain activity dataset.
2019
ago-2019
45th International Conference on Very Large Data Bases, VLDB 2019
usa
2017
12
1954
1957
Rezig, E. K.; Cao, L.; Stonebraker, M.; Simonini, G.; Tao, W.; Madden, S.; Ouzzani, M.; Tang, N.; Elmagarmid, A. K.
Data civilizer 2.0: A holistic framework for data preparation and analytics / Rezig, E. K.; Cao, L.; Stonebraker, M.; Simonini, G.; Tao, W.; Madden, S.; Ouzzani, M.; Tang, N.; Elmagarmid, A. K.. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 12:12(2019), pp. 1954-1957. (Intervento presentato al convegno 45th International Conference on Very Large Data Bases, VLDB 2019 tenutosi a usa nel 2017) [10.14778/3352063.3352108].
File in questo prodotto:
File Dimensione Formato  
VQR.pdf

Open access

Tipologia: Versione pubblicata dall'editore
Dimensione 480.57 kB
Formato Adobe PDF
480.57 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1191053
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 17
  • ???jsp.display-item.citation.isi??? 7
social impact