Lung Cancer Classification using Microarray Gene Expression Data and Machine Learning Approach  

Authors

  • Prakash Choudhary 1 Author
  • Nada Tarek 2 Author
  • Hend Alfred *3 Author
  • Moataz Sayed 4 Author

Keywords:

Microarray, Gene Expression, Machine Learning (ML), Biomarkers, Lung Cancer

Abstract

This work proposes an enhanced machine learning pipeline for accurate lung cancer subtype classification, utilizing statistical feature extraction and ensemble modeling. A publicly available Lung Cancer dataset (Lung.arff) containing 203 samples and 12,600 gene features for five classes was employed. Normalization, imputation of missing data, and class balancing via the SMOTE were applied for preprocessing. Feature selection was combined with an ANOVA F-test, and the top 50 discriminative genes were selected using Mutual Information and Random Forest–based feature importance. Subsequently, four classifiers-Logistic Regression, Support Vector Machine (with RBF kernel), Random Forest, and XGBoost were trained and comparatively evaluated using nested 5-fold cross-validation, ensuring a robust and unbiased assessment of the model performance. Experimental results validated that the XGBoost classifier achieved 90.8% accuracy, whereas the voting ensemble of RF, SVM, and XGBoost achieved 91.3% accuracy and improved the macro F1-score. Top-ranked genes from SHAP analysis show high consistency with current lung cancer biomarkers, validating the interpretability of the model. The results highlight the robustness of the incorporation of statistical feature selection and ensemble learning for precise lung cancer subtype classification.

Downloads

Download data is not yet available.

Downloads

Published

26-11-2025

How to Cite

Lung Cancer Classification using Microarray Gene Expression Data and Machine Learning Approach  . (2025). Computational Discovery and Intelligent Systems, 1(1), 9-17. https://pub.scientificirg.com/index.php/CDIS/article/view/4

Similar Articles

You may also start an advanced similarity search for this article.