SlideShare a Scribd company logo
Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya  [email_address] Winter 2006 Dalhousie University Machine Learning Prediction
Outline Introduction Definition of the Problem Related Work Algorithms Description of the Data Methodology of Experiments Results Relevance of Results Conclusions & Future Work
Introduction ML has gained attention in the biomedical field.  Need to turn biomedical data into meaningful information. Microarray technology is used to generate gene expression data. Gene expression data involves a huge number of numeric attributes (gene expression measurements).  This kind of data is also characterized by consisting of a small numbers of instances. This work investigates the classification problem on such data.
Definition of the Problem Classifying Gene Expression Data Number of features (n) is much greater than the number of sample instances (m). (n >> m) Typical data: n > 5000, and m < 100 High risk of overfitting the data due the abundance of attributes and shortage of available samples. The datasets produced by Microarray experiments are highly dimensional and often noisy due to the process involved in the experiments.
Related Work Using gene expression data for the task of classification, has recently gained attention in the biomedical community. Golub et al. describe an approach to cancer classification based on gene expression applied to human acute Leukemia (ALL vs AML). A. Rosenwald et al. developed a model predictor of patient survival after chemotherapy (Alive vs Dead). Furey et al. present a method to analyze microarray expression data using SVM. Guyon et al. experiment with reducing the dimensionality of gene expression data.
Algorithms K-Nearest Neighbor (KNN) It is one of the simplest and widely used algorithms for data classification. Naive Bayes (NB) It assumes that the effect of a feature value on a given class is independent of the values of other features. Decision Trees (DT) Internal nodes represent tests on one or more attributes and leaf nodes indicate decision outcomes. Support Vector Machines (SVM) Works well on high dimensional data
Description of the Data Leukemia dataset.  A collection of 72 expression measurements. The samples are divided into two variants of leukemia: 25 samples of acute myeloid leukemia (AML) and 47 samples acute lymphoblastic leukemia (ALL). Diffuse Large-B-Cell Lymphoma (DLBCL) dataset Biopsy samples that were examined for gene expression with the use of DNA microarrays. Each sample corresponds to the prediction of survival after chemotherapy for diffuse large-B-cell lymphoma (Alive, Dead).
Methodology of Experiments Feature Selection Remove irrelevant features (but may have  biological meaning). Use of GainRatio Selecting a Supervised Learning Method KNN, NB, DT, SVM Testing Methodology Evaluation over independent test set (train/test split) Ratios: 66/34, 80/20, 90/10  10-fold Cross-Validation Compare both methods and see if they are in logical agreement Feature Selection (gene subset) Algorithm All features
Methodology of Experiments (cont…) Measuring Performance Accuracy Precision (p) Recall (r) F-Measure It is hard to compare two classifiers using two measures. F-Measure combines precision and recall into one measure. F-Measure is the harmonic mean of precision, and recall. For F to be large, both  p  and  r  must be large.
Results Without  Feature Selection Naive Bayes and SVM perform better KNN and SVM perform better Cross-validation results are lower; it uses nearly all the data for training and testing, giving a more realistic estimation.
Results (cont…) With  Feature Selection KNN and SVM perform better NB and SVM perform better There is an increase in the overall accuracy, more notorious in DLBCL
Results (cont…) Summary of classification accuracies with cross-validation F-Measures for both datasets with and without feature selection
Relevance of Results Performance depends on the characteristics of the problem, the quality of the measurements in the data, and the capabilities of the classifier in finding regularities in the data. Feature selection, helps to minimize the use of redundant and/or noisy features. SVM gave the best results, they perform well with high dimensional data, and also benefit from feature selection. Decision Trees had the overall worst performance, however, they still work at a competitive level.
Relevance of Results (cont…) Surprisingly, KNN behaves relatively well despite its simplicity, this characteristic allows it to scale well for large feature spaces. In the case of the Leukemia dataset, very high accuracies were achieved here for all the algorithms. Perfect accuracy was achieved in many cases. The DLBCL dataset shows lower accuracies, although using feature selection improved them. In the overall, the observations of the accuracy results are consistent with those from the F-measure, giving us confidence in the relevance of the results obtained.
Conclusions & Future Work Supervised learning algorithms can be used to the classification of gene expression data from DNA microarrays with high accuracy. SVM by its very own nature, deal well with high dimensional gene expression data. We have verified that there are subsets of features (genes) that are more relevant than others and better separate the classes. The use of one algorithm instead of others should be evaluated on a case by case basis
Conclusions & Future Work (cont…) The use of feature selection proved to be beneficial to improve the overall performance of the algorithms. This idea can be extended to the use of other feature selection methods or data transformation such as PCA. Analysis of the effect of noisy gene expression data on the reliability of the classifier. While the scope of our experimental results is confined to a couple of datasets, the analysis can be used as a baseline for future use of supervised learning algorithms for gene expression data
References T.R. Golub et al.  Molecular classification of cancer: class discovery and class prediction by gene-expression monitoring.   Science, Vol. 286 , 531–537, 1999. A. Rosenwald, G. Wright, W. C. Chan, et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large B-cell lymphoma.   New England Journal of Medicine,   Vol. 346 , 1937–1947, 2002. Terrence S. Furey, Nello Cristianini, et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data.   Bioinformatics ,  Vol. 16 , 906–914, 2001. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik.  Gene selection for cancer classification using support vector machines.   BIOWulf Technical Report , 2000. Ethem Alpaydin.  Introduction to Machine Learning . The MIT Press, 2004. Ian H. Witten, Eibe Frank.  Data Mining: Practical Machine Learning Tools and Techniques .  Second Edition. Morgan Kaufmann Publishers , 2005 Wikipedia :  www.wikipedia.org Alvis Brazma, Helen Parkinson, Thomas Schlitt, Mohammadreza Shojatalab.  A quick introduction to elements of biology-cells, molecules, genes, functional genomics, microarrays.  European Bioinformatics Institute.
Thank You!
Ad

More Related Content

What's hot (19)

A new model for large dataset dimensionality reduction based on teaching lear...
A new model for large dataset dimensionality reduction based on teaching lear...A new model for large dataset dimensionality reduction based on teaching lear...
A new model for large dataset dimensionality reduction based on teaching lear...
TELKOMNIKA JOURNAL
 
Classification of pneumonia from X-ray images using siamese convolutional net...
Classification of pneumonia from X-ray images using siamese convolutional net...Classification of pneumonia from X-ray images using siamese convolutional net...
Classification of pneumonia from X-ray images using siamese convolutional net...
TELKOMNIKA JOURNAL
 
DREAM Challenge
DREAM ChallengeDREAM Challenge
DREAM Challenge
Tulip Nandu
 
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
ijsc
 
IRJET- Crop Leaf Disease Diagnosis using Convolutional Neural Network
IRJET- Crop Leaf Disease Diagnosis using Convolutional Neural NetworkIRJET- Crop Leaf Disease Diagnosis using Convolutional Neural Network
IRJET- Crop Leaf Disease Diagnosis using Convolutional Neural Network
IRJET Journal
 
Sample Work For Engineering Literature Review and Gap Identification
Sample Work For Engineering Literature Review and Gap IdentificationSample Work For Engineering Literature Review and Gap Identification
Sample Work For Engineering Literature Review and Gap Identification
PhD Assistance
 
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceEfficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
IJSTA
 
An approach for breast cancer diagnosis classification using neural network
An approach for breast cancer diagnosis classification using neural networkAn approach for breast cancer diagnosis classification using neural network
An approach for breast cancer diagnosis classification using neural network
acijjournal
 
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET Journal
 
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET Journal
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
Double Check ĆŐNSULTING
 
Regularized Weighted Ensemble of Deep Classifiers
Regularized Weighted Ensemble of Deep Classifiers Regularized Weighted Ensemble of Deep Classifiers
Regularized Weighted Ensemble of Deep Classifiers
ijcsa
 
Define cancer treatment using knn and naive bayes algorithms
Define cancer treatment using knn and naive bayes algorithmsDefine cancer treatment using knn and naive bayes algorithms
Define cancer treatment using knn and naive bayes algorithms
rajab ssemwogerere
 
Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...
IJCSEA Journal
 
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Devansh16
 
Nat poster
Nat posterNat poster
Nat poster
Yoonho Na
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
dagunisa
 
Identification of Disease in Leaves using Genetic Algorithm
Identification of Disease in Leaves using Genetic AlgorithmIdentification of Disease in Leaves using Genetic Algorithm
Identification of Disease in Leaves using Genetic Algorithm
ijtsrd
 
Accounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksAccounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarks
Devansh16
 
A new model for large dataset dimensionality reduction based on teaching lear...
A new model for large dataset dimensionality reduction based on teaching lear...A new model for large dataset dimensionality reduction based on teaching lear...
A new model for large dataset dimensionality reduction based on teaching lear...
TELKOMNIKA JOURNAL
 
Classification of pneumonia from X-ray images using siamese convolutional net...
Classification of pneumonia from X-ray images using siamese convolutional net...Classification of pneumonia from X-ray images using siamese convolutional net...
Classification of pneumonia from X-ray images using siamese convolutional net...
TELKOMNIKA JOURNAL
 
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
ijsc
 
IRJET- Crop Leaf Disease Diagnosis using Convolutional Neural Network
IRJET- Crop Leaf Disease Diagnosis using Convolutional Neural NetworkIRJET- Crop Leaf Disease Diagnosis using Convolutional Neural Network
IRJET- Crop Leaf Disease Diagnosis using Convolutional Neural Network
IRJET Journal
 
Sample Work For Engineering Literature Review and Gap Identification
Sample Work For Engineering Literature Review and Gap IdentificationSample Work For Engineering Literature Review and Gap Identification
Sample Work For Engineering Literature Review and Gap Identification
PhD Assistance
 
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceEfficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
IJSTA
 
An approach for breast cancer diagnosis classification using neural network
An approach for breast cancer diagnosis classification using neural networkAn approach for breast cancer diagnosis classification using neural network
An approach for breast cancer diagnosis classification using neural network
acijjournal
 
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET Journal
 
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET Journal
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
Double Check ĆŐNSULTING
 
Regularized Weighted Ensemble of Deep Classifiers
Regularized Weighted Ensemble of Deep Classifiers Regularized Weighted Ensemble of Deep Classifiers
Regularized Weighted Ensemble of Deep Classifiers
ijcsa
 
Define cancer treatment using knn and naive bayes algorithms
Define cancer treatment using knn and naive bayes algorithmsDefine cancer treatment using knn and naive bayes algorithms
Define cancer treatment using knn and naive bayes algorithms
rajab ssemwogerere
 
Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...
IJCSEA Journal
 
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Devansh16
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
dagunisa
 
Identification of Disease in Leaves using Genetic Algorithm
Identification of Disease in Leaves using Genetic AlgorithmIdentification of Disease in Leaves using Genetic Algorithm
Identification of Disease in Leaves using Genetic Algorithm
ijtsrd
 
Accounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksAccounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarks
Devansh16
 

Similar to CSCI 6505 Machine Learning Project (20)

A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
IJTET Journal
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning and
Alexander Decker
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
Damian R. Mingle, MBA
 
Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...
IJECEIAES
 
Comparison of breast cancer classification models on Wisconsin dataset
Comparison of breast cancer classification models on Wisconsin  datasetComparison of breast cancer classification models on Wisconsin  dataset
Comparison of breast cancer classification models on Wisconsin dataset
International Journal of Reconfigurable and Embedded Systems
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniques
inventionjournals
 
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
mlaij
 
BRITEREU_finalposter
BRITEREU_finalposterBRITEREU_finalposter
BRITEREU_finalposter
Elsa Fecke
 
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
ahmad abdelhafeez
 
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET Journal
 
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
rahulmonikasharma
 
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
IJECEIAES
 
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Arinze Akutekwe
 
Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...
IJECEIAES
 
Mining of Important Informative Genes and Classifier Construction for Cancer ...
Mining of Important Informative Genes and Classifier Construction for Cancer ...Mining of Important Informative Genes and Classifier Construction for Cancer ...
Mining of Important Informative Genes and Classifier Construction for Cancer ...
ijsc
 
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
ijsc
 
Diagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set TheoryDiagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set Theory
IRJET Journal
 
Artificial Intelligence in pathology
Artificial Intelligence in pathologyArtificial Intelligence in pathology
Artificial Intelligence in pathology
nehaSingh1543
 
Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...
Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...
Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...
BRNSSPublicationHubI
 
IRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer PredictionIRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer Prediction
IRJET Journal
 
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
IJTET Journal
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning and
Alexander Decker
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
Damian R. Mingle, MBA
 
Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...
IJECEIAES
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniques
inventionjournals
 
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
mlaij
 
BRITEREU_finalposter
BRITEREU_finalposterBRITEREU_finalposter
BRITEREU_finalposter
Elsa Fecke
 
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
ahmad abdelhafeez
 
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET Journal
 
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
rahulmonikasharma
 
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
IJECEIAES
 
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Arinze Akutekwe
 
Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...
IJECEIAES
 
Mining of Important Informative Genes and Classifier Construction for Cancer ...
Mining of Important Informative Genes and Classifier Construction for Cancer ...Mining of Important Informative Genes and Classifier Construction for Cancer ...
Mining of Important Informative Genes and Classifier Construction for Cancer ...
ijsc
 
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
ijsc
 
Diagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set TheoryDiagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set Theory
IRJET Journal
 
Artificial Intelligence in pathology
Artificial Intelligence in pathologyArtificial Intelligence in pathology
Artificial Intelligence in pathology
nehaSingh1543
 
Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...
Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...
Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...
BRNSSPublicationHubI
 
IRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer PredictionIRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer Prediction
IRJET Journal
 
Ad

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
PPT
PPTPPT
PPT
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
hier
hierhier
hier
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 
EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 
Ad

CSCI 6505 Machine Learning Project

  • 1. Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya [email_address] Winter 2006 Dalhousie University Machine Learning Prediction
  • 2. Outline Introduction Definition of the Problem Related Work Algorithms Description of the Data Methodology of Experiments Results Relevance of Results Conclusions & Future Work
  • 3. Introduction ML has gained attention in the biomedical field. Need to turn biomedical data into meaningful information. Microarray technology is used to generate gene expression data. Gene expression data involves a huge number of numeric attributes (gene expression measurements). This kind of data is also characterized by consisting of a small numbers of instances. This work investigates the classification problem on such data.
  • 4. Definition of the Problem Classifying Gene Expression Data Number of features (n) is much greater than the number of sample instances (m). (n >> m) Typical data: n > 5000, and m < 100 High risk of overfitting the data due the abundance of attributes and shortage of available samples. The datasets produced by Microarray experiments are highly dimensional and often noisy due to the process involved in the experiments.
  • 5. Related Work Using gene expression data for the task of classification, has recently gained attention in the biomedical community. Golub et al. describe an approach to cancer classification based on gene expression applied to human acute Leukemia (ALL vs AML). A. Rosenwald et al. developed a model predictor of patient survival after chemotherapy (Alive vs Dead). Furey et al. present a method to analyze microarray expression data using SVM. Guyon et al. experiment with reducing the dimensionality of gene expression data.
  • 6. Algorithms K-Nearest Neighbor (KNN) It is one of the simplest and widely used algorithms for data classification. Naive Bayes (NB) It assumes that the effect of a feature value on a given class is independent of the values of other features. Decision Trees (DT) Internal nodes represent tests on one or more attributes and leaf nodes indicate decision outcomes. Support Vector Machines (SVM) Works well on high dimensional data
  • 7. Description of the Data Leukemia dataset. A collection of 72 expression measurements. The samples are divided into two variants of leukemia: 25 samples of acute myeloid leukemia (AML) and 47 samples acute lymphoblastic leukemia (ALL). Diffuse Large-B-Cell Lymphoma (DLBCL) dataset Biopsy samples that were examined for gene expression with the use of DNA microarrays. Each sample corresponds to the prediction of survival after chemotherapy for diffuse large-B-cell lymphoma (Alive, Dead).
  • 8. Methodology of Experiments Feature Selection Remove irrelevant features (but may have biological meaning). Use of GainRatio Selecting a Supervised Learning Method KNN, NB, DT, SVM Testing Methodology Evaluation over independent test set (train/test split) Ratios: 66/34, 80/20, 90/10 10-fold Cross-Validation Compare both methods and see if they are in logical agreement Feature Selection (gene subset) Algorithm All features
  • 9. Methodology of Experiments (cont…) Measuring Performance Accuracy Precision (p) Recall (r) F-Measure It is hard to compare two classifiers using two measures. F-Measure combines precision and recall into one measure. F-Measure is the harmonic mean of precision, and recall. For F to be large, both p and r must be large.
  • 10. Results Without Feature Selection Naive Bayes and SVM perform better KNN and SVM perform better Cross-validation results are lower; it uses nearly all the data for training and testing, giving a more realistic estimation.
  • 11. Results (cont…) With Feature Selection KNN and SVM perform better NB and SVM perform better There is an increase in the overall accuracy, more notorious in DLBCL
  • 12. Results (cont…) Summary of classification accuracies with cross-validation F-Measures for both datasets with and without feature selection
  • 13. Relevance of Results Performance depends on the characteristics of the problem, the quality of the measurements in the data, and the capabilities of the classifier in finding regularities in the data. Feature selection, helps to minimize the use of redundant and/or noisy features. SVM gave the best results, they perform well with high dimensional data, and also benefit from feature selection. Decision Trees had the overall worst performance, however, they still work at a competitive level.
  • 14. Relevance of Results (cont…) Surprisingly, KNN behaves relatively well despite its simplicity, this characteristic allows it to scale well for large feature spaces. In the case of the Leukemia dataset, very high accuracies were achieved here for all the algorithms. Perfect accuracy was achieved in many cases. The DLBCL dataset shows lower accuracies, although using feature selection improved them. In the overall, the observations of the accuracy results are consistent with those from the F-measure, giving us confidence in the relevance of the results obtained.
  • 15. Conclusions & Future Work Supervised learning algorithms can be used to the classification of gene expression data from DNA microarrays with high accuracy. SVM by its very own nature, deal well with high dimensional gene expression data. We have verified that there are subsets of features (genes) that are more relevant than others and better separate the classes. The use of one algorithm instead of others should be evaluated on a case by case basis
  • 16. Conclusions & Future Work (cont…) The use of feature selection proved to be beneficial to improve the overall performance of the algorithms. This idea can be extended to the use of other feature selection methods or data transformation such as PCA. Analysis of the effect of noisy gene expression data on the reliability of the classifier. While the scope of our experimental results is confined to a couple of datasets, the analysis can be used as a baseline for future use of supervised learning algorithms for gene expression data
  • 17. References T.R. Golub et al. Molecular classification of cancer: class discovery and class prediction by gene-expression monitoring. Science, Vol. 286 , 531–537, 1999. A. Rosenwald, G. Wright, W. C. Chan, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large B-cell lymphoma. New England Journal of Medicine, Vol. 346 , 1937–1947, 2002. Terrence S. Furey, Nello Cristianini, et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics , Vol. 16 , 906–914, 2001. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. BIOWulf Technical Report , 2000. Ethem Alpaydin. Introduction to Machine Learning . The MIT Press, 2004. Ian H. Witten, Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques . Second Edition. Morgan Kaufmann Publishers , 2005 Wikipedia : www.wikipedia.org Alvis Brazma, Helen Parkinson, Thomas Schlitt, Mohammadreza Shojatalab. A quick introduction to elements of biology-cells, molecules, genes, functional genomics, microarrays. European Bioinformatics Institute.

Editor's Notes

  • #4: (at the end) We used Weka to perform the experiments We evaluated KNN, NB, DT, and SVM. Each has its own strengths and limitations. It would be difficult to say which one gives the best results. It is necessary to evaluate on the basis of the same datasets and with a common evaluation criteria. In our experiments, we perform comparative studies using the full set of features, as well as a subset of them. A DNA microarray is a collection of microscopic DNA spots attached to a solid surface, such as glass, plastic or silicon chip forming an array. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously.
  • #5: In a classification problem, we are given m training instances, and l classes, where the instances consist of n features, and the known class labels C. The goal is to predict, the class label for a new given instance. For our problem, we consider the features being gene expression coefficients, and the instances correspond to patients. Here, n &gt;&gt; m . Overfitting : building models that are very good for the training set but perform poorly of future independent samples How can we guard against overtting? Split the data into a training set and a crossvalidation set. Use the latter for monitoring the generalization performance. When overtting sets in, stop the training process. Finding disease markers (classifiers) from gene expression data by machine learning algorithms is characterized by a high risk of overfitting the data due the abundance of attributes (simultaneously measured gene expression values) and shortage of available examples (observations). DNA microarray experiments from biological samples generate thousands of gene expression measurements. The datasets produced are highly dimensional and often noisy due to the process involved in the experiments. This is not only a challenging problem were the results can be used to diagnose a disease or predict survival of a patient. The approach taken by this project is to provide comparative results to indicate that a small number of instances can be used to create a useful model, and that feature selection improves the classification accuracy.
  • #6: Golub et al. … its results demonstrate the feasibility of cancer classification based solely on gene expression. A. Rosenwald et al. … for diffuse large-b-cell lymphoma Furey et al. … their results indicate that SVM is able to classify this kind of data, and be used in the identification of the presence of a disease. Guyon et al. … their results show an increase in the overall performance of SVM classification with the reduced set of features.
  • #7: KNN - To classify a given instance I , the algorithm ranks the neighbors of I , and uses the class labels of the k most similar neighbors to predict the class of the instance I . Then, after gathering the class labels of neighbors, majority of them is taken, and I is assigned the class label with the greatest number of votes among the K nearest neighbors. The best choice of k depends on the dataset. NB - The training phase consists on calculating the conditional probability P(x|c) of an instance given a class label, and the prior probability P(c) of the class. To classify an unseen instance, the posterior probability of each class given the instance, is calculated, and the instance is assigned the class with the highest probability. DT - The algorithm builds a tree based on a training dataset, it recursively partitions the set by choosing an attribute and creates a separate branch for each value of the chosen attribute. The best attribute to split on is the one with the highest information gain or lowest entropy. To classify an instance, the method starts at the root node, testing the attribute specified by the node, then moving down the branch corresponding to the value of the attribute in the given instance. This process is repeated for the subtree rooted at the new node until a leaf is encountered, and the instance is finally labeled with the class indicated by the leaf. SVM - The Support Vector Machine (SVM) method finds a linear discriminant called hyperplane, which separates the classes in a given a dataset. The best hyperplane is the one that keeps the maximum separation between the classes in order to better generalize the model, so we are looking for the maximum margin hyperplane.
  • #8: The datasets used for this evaluation were obtained from the Kent Ridge Biomedical Data Set Repository. They correspond to gene expression data obtained from DNA microarrays. Leukemia dataset. The source of the gene expression were taken from bone marrow samples and blood samples. Diffuse Large-B-Cell Lymphoma (DLBCL) dataset. This dataset consists of biopsy samples of 240 patients that were examined for gene expression with the use of DNA microarrays. The number of microarray features is 7399, and each sample belongs to one of two classes: Alive, Dead. The two classes correspond to the prediction of survival after chemotherapy for diffuse large-B-cell lymphoma.
  • #9: FEATURE SELECTION Due to the high dimensional nature of this type of data, we chose a smaller set of features from the set of original features. Another reason to perform feature selection, lies in the fact that having a number of features much greater that the number of instances, increases the potential problem of overfitting. TESTING METHODOLOGY We divided both datasets with different ratios of train/test sets (66/34, 80/20, and 90/10), and averaged over the results (macroaveraging). However, given the fact that our datasets are small, we also wanted to evaluate the accuracy on the basis of 10-fold cross-validation. The major advantage of cross-validation is that all the cases in the dataset are used for testing, and nearly all the cases are used for training the classifier. This resampling technique can provide a good estimate of the accuracy.
  • #10: The classification of the data corresponds to a binary classification task; we want to determine if a patient is alive or dead, or if it has one of two types of leukemia. However, using only the accuracy can result in misleading overoptimistic estimates, that is why, to evaluate the performance of the classification algorithms, we also use the concepts of precision, recall, and F-measure. Precision is the proportion of the instances which actually have class C among all those which were classified as class C . Recall is the proportion of instances which were classified as class C , among all instances which truly have class C , i. e. how much part of the class was captured. In order to pay equal importance to each class, we want to average the values of precision, recall and F-measure that we get for each class C . Classes are equally (almost evenly) represented in the training samples, that is why we can trust in accuracy as a measure of performance.
  • #11: For both datasets there is an intuitive agreement between the evaluation over an independent test set and cross-validation , however cross-validation results are lower, most likely because it uses nearly all the data for training and testing, giving a more realistic estimation. In the Leukemia dataset, the classification accuracies in both evaluation methods, are remarkably high, there are features that completely determine the class, and Naive Bayes and SVM algorithms tend to slightly outperform KNN and DT. In the case of SVM , it is due to the fact that the classes are linearly divisible, and for NB , its assumption of feature independence indicates that there is at least a number of features that completely determine the class, despite possible redundant or noisy features. For the DLBCL dataset , the accuracy is significantly low in all algorithms, being KNN (66.92%, and 62.91%) the best classifier. Decision Trees gave the lowest accuracy, this is due to the large number of features involved. Surprisingly, KNN outperforms SVM in DLBCL and almost matches it in Leukemia.
  • #12: We must point out that reducing the dimensionality using now the best ranked features , increases the accuracy when compared with using the full set of features. The results obtained from the independent test set evaluations and cross-validation, still intuitively agree , being cross-validation measures, again a little lower. For the Leukemia dataset , the reduced dimensionality brought an slight increase in the overall accuracy, indicating that this dataset can be described to a high degree of accuracy by a reduced number of features. For the DLBCL dataset , feature selection significantly increased the overall performance in all the algorithms being Naive Bayes (78.84%, and 70.83%), and SVM (75.37%, and 71.25%) the ones with the highest accuracies.
  • #13: Observing that cross-validation gives a more realistic view of the algorithms&apos; behavior, the table summarizes the best performance for each type of classifier with and without feature selection, in the terms of 10-fold cross validation. The Figure shows the variation of the F-Measure in each algorithm, using both datasets, reinforcing the assumption, that SVM outperforms the rest. It is interesting to point that the measures are consistent among all the algorithms in each dataset. For example, Leukemia with all features is in the range of [0.847, 0.985], DLBCL with feature selection, is in the range of [0.612, 0.706].
  • #14: Performance depends ... This is confirmed by the remarkably high results obtained with the Leukemia dataset, and which drop dramatically with DLBCL data. Feature selection … No matter which algorithm is being used, all of them benefit from feature selection, increasing the performance. This is specially important for algorithms such as KNN where distances must be computed in terms of features. The use of an information gain based method such as gain ratio, seems to preserve the underlying correlation between the selected features, and the class labels. SVM … As initially suspected, SVM classification gave the best results, however, in spite of the fact that they perform well with high dimensional data, we have shown that SVM can also benefit from reducing the dimensionality with feature selection. Decision Trees … it is widely known that they do not behave well with high dimensional and noisy datasets.
  • #15: Surprisingly, KNN … its relatively strong performance makes it a good choice for baseline when applied to gene expression data. The DLBCL dataset … The reason for the low results, might be due to the fact that predicting whether a patient is dead or alive after certain time has passed since chemotherapy, involves other circumstances such as the living environment, care of the patient, etc, which can not be numerically measured, and they do affect the final prediction.
  • #16: While our results indicate that SVM by its very own nature, deal well with high dimensional gene expression data, we have showed that other methods work surprisingly well too . The datasets used, contain relatively a few number of instances, and do not allow one method to demonstrate absolute superiority. We have also shown that there is no single approach that works well in all situations, and the use of one algorithm instead of others should be evaluated on a case by case basis.
  • #17: Knowing that data transformation methods destroy the underlying meaning of the set of features, it would be interesting to see if algorithms such as SVM and Naive Bayes which assumes term independence, benefit from the transformation. Another direction for future research can be the statistical analysis of the effect of noisy gene expression data on the reliability of the classifier. This is interesting, given the fact that the methods to obtain this type of data can be subject to “noise”, it is crucial to determine these effects on the results and conclude on the basis of robustness of an algorithm in the presence of noisy measures or mislabeled classes. Finally more experiments with other datasets should be performed before deriving final conclusions.
  翻译: