An important step in data analysis is class assignment which isusually done on the basis of a macroscopic phenotypic or bioprocesscharacteristic, such as high vs low growth, healthy vs diseased state,or high vs low productivity. Unfortunately, such an assignment maylump together samples, which when derived from a more detailedphenotypic or bioprocess description are dissimilar, giving rise tomodels of lower quality and predictive power. In this paper we pre-sent a clustering algorithm for data preprocessing which involves theidentification of fundamentally similar lots on the basis of the extentof similarity among the system variables. The algorithm combinesaspects of cluster analysis and principal component analysis byapplying agglomerative clustering methods to the first principalcomponent of the system data matrix. As part of a rational strategyfor developing empirical models, this technique selects lots (sam-ples) which are most appropriate for inclusion in a training set byanalyzing multivariate data homogeneity. Samples with similar datastructures are identified and grouped together into distinct clusters.This knowledge is used in the formation of potential training sets.Additionally, this technique can identify atypical lots, i.e., samplesthat are not simply outliers but exhibit the general properties of oneclass but have been given the assignment of the other. The method ispresented along with examples from its application to fermentationdata sets.

Mining of Biological Data II: Assessing Data Structure and Class Homogeneity by Cluster Analysis / Kamimura, R. T.; Bicciato, Silvio; Shimizu, H.; Alford, J.; Stephanopoulos, G. N.. - In: METABOLIC ENGINEERING. - ISSN 1096-7176. - STAMPA. - 2(3):(2000), pp. 228-238. [10.1006/mben.2000.0155]

Mining of Biological Data II: Assessing Data Structure and Class Homogeneity by Cluster Analysis

BICCIATO, Silvio;
2000

Abstract

An important step in data analysis is class assignment which isusually done on the basis of a macroscopic phenotypic or bioprocesscharacteristic, such as high vs low growth, healthy vs diseased state,or high vs low productivity. Unfortunately, such an assignment maylump together samples, which when derived from a more detailedphenotypic or bioprocess description are dissimilar, giving rise tomodels of lower quality and predictive power. In this paper we pre-sent a clustering algorithm for data preprocessing which involves theidentification of fundamentally similar lots on the basis of the extentof similarity among the system variables. The algorithm combinesaspects of cluster analysis and principal component analysis byapplying agglomerative clustering methods to the first principalcomponent of the system data matrix. As part of a rational strategyfor developing empirical models, this technique selects lots (sam-ples) which are most appropriate for inclusion in a training set byanalyzing multivariate data homogeneity. Samples with similar datastructures are identified and grouped together into distinct clusters.This knowledge is used in the formation of potential training sets.Additionally, this technique can identify atypical lots, i.e., samplesthat are not simply outliers but exhibit the general properties of oneclass but have been given the assignment of the other. The method ispresented along with examples from its application to fermentationdata sets.
2000
2(3)
228
238
Mining of Biological Data II: Assessing Data Structure and Class Homogeneity by Cluster Analysis / Kamimura, R. T.; Bicciato, Silvio; Shimizu, H.; Alford, J.; Stephanopoulos, G. N.. - In: METABOLIC ENGINEERING. - ISSN 1096-7176. - STAMPA. - 2(3):(2000), pp. 228-238. [10.1006/mben.2000.0155]
Kamimura, R. T.; Bicciato, Silvio; Shimizu, H.; Alford, J.; Stephanopoulos, G. N.
File in questo prodotto:
File Dimensione Formato  
Kamimura_MetabolicEng2.pdf

Accesso riservato

Tipologia: Versione pubblicata dall'editore
Dimensione 264.35 kB
Formato Adobe PDF
264.35 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/421610
Citazioni
  • ???jsp.display-item.citation.pmc??? 1
  • Scopus 29
  • ???jsp.display-item.citation.isi??? 28
social impact