An important step in data analysis is class assignment which isusually done on the basis of a macroscopic phenotypic or bioprocesscharacteristic, such as high vs low growth, healthy vs diseased state,or high vs low productivity. Unfortunately, such an assignment maylump together samples, which when derived from a more detailedphenotypic or bioprocess description are dissimilar, giving rise tomodels of lower quality and predictive power. In this paper we pre-sent a clustering algorithm for data preprocessing which involves theidentification of fundamentally similar lots on the basis of the extentof similarity among the system variables. The algorithm combinesaspects of cluster analysis and principal component analysis byapplying agglomerative clustering methods to the first principalcomponent of the system data matrix. As part of a rational strategyfor developing empirical models, this technique selects lots (sam-ples) which are most appropriate for inclusion in a training set byanalyzing multivariate data homogeneity. Samples with similar datastructures are identified and grouped together into distinct clusters.This knowledge is used in the formation of potential training sets.Additionally, this technique can identify atypical lots, i.e., samplesthat are not simply outliers but exhibit the general properties of oneclass but have been given the assignment of the other. The method ispresented along with examples from its application to fermentationdata sets.
Mining of Biological Data II: Assessing Data Structure and Class Homogeneity by Cluster Analysis / Kamimura, R. T.; Bicciato, Silvio; Shimizu, H.; Alford, J.; Stephanopoulos, G. N.. - In: METABOLIC ENGINEERING. - ISSN 1096-7176. - STAMPA. - 2:3(2000), pp. 228-238. [10.1006/mben.2000.0155]
Mining of Biological Data II: Assessing Data Structure and Class Homogeneity by Cluster Analysis
BICCIATO, Silvio;
2000
Abstract
An important step in data analysis is class assignment which isusually done on the basis of a macroscopic phenotypic or bioprocesscharacteristic, such as high vs low growth, healthy vs diseased state,or high vs low productivity. Unfortunately, such an assignment maylump together samples, which when derived from a more detailedphenotypic or bioprocess description are dissimilar, giving rise tomodels of lower quality and predictive power. In this paper we pre-sent a clustering algorithm for data preprocessing which involves theidentification of fundamentally similar lots on the basis of the extentof similarity among the system variables. The algorithm combinesaspects of cluster analysis and principal component analysis byapplying agglomerative clustering methods to the first principalcomponent of the system data matrix. As part of a rational strategyfor developing empirical models, this technique selects lots (sam-ples) which are most appropriate for inclusion in a training set byanalyzing multivariate data homogeneity. Samples with similar datastructures are identified and grouped together into distinct clusters.This knowledge is used in the formation of potential training sets.Additionally, this technique can identify atypical lots, i.e., samplesthat are not simply outliers but exhibit the general properties of oneclass but have been given the assignment of the other. The method ispresented along with examples from its application to fermentationdata sets.File | Dimensione | Formato | |
---|---|---|---|
Kamimura_MetabolicEng2.pdf
Accesso riservato
Tipologia:
VOR - Versione pubblicata dall'editore
Dimensione
264.35 kB
Formato
Adobe PDF
|
264.35 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris