Lung Cancer Classification using Microarray Gene Expression Data and Machine Learning Approach

Prakash Choudhary 1; Nada Tarek 2; Hend Alfred *3; Moataz Sayed 4

PDF (الإنجليزية)

منشور: 2025-11-26

الكلمات المفتاحية:

Microarray، Gene Expression، Machine Learning (ML)، Biomarkers، Lung Cancer

Prakash Choudhary 1

Nada Tarek 2

Hend Alfred *3

Moataz Sayed 4

الملخص

This work proposes an enhanced machine learning pipeline for accurate lung cancer subtype classification, utilizing statistical feature extraction and ensemble modeling. A publicly available Lung Cancer dataset (Lung.arff) containing 203 samples and 12,600 gene features for five classes was employed. Normalization, imputation of missing data, and class balancing via the SMOTE were applied for preprocessing. Feature selection was combined with an ANOVA F-test, and the top 50 discriminative genes were selected using Mutual Information and Random Forest–based feature importance. Subsequently, four classifiers-Logistic Regression, Support Vector Machine (with RBF kernel), Random Forest, and XGBoost were trained and comparatively evaluated using nested 5-fold cross-validation, ensuring a robust and unbiased assessment of the model performance. Experimental results validated that the XGBoost classifier achieved 90.8% accuracy, whereas the voting ensemble of RF, SVM, and XGBoost achieved 91.3% accuracy and improved the macro F1-score. Top-ranked genes from SHAP analysis show high consistency with current lung cancer biomarkers, validating the interpretability of the model. The results highlight the robustness of the incorporation of statistical feature selection and ensemble learning for precise lung cancer subtype classification.

Downloads

Download data is not yet available.

إصدار

مجلد 1 عدد 1 (2025): November-2025

القسم

Original Research

هذا العمل مرخص بموجب Creative Commons Attribution 4.0 International License.

Computational Discovery and Intelligent Systems (CDIS) content is published under a Creative Commons Attribution License (CCBY). This means that content is freely available to all readers upon publication, and content is published as soon as production is complete.

Computational Discovery and Intelligent Systems (CDIS) seeks to publish the most influential papers that will significantly advance scientific understanding. Selected articles must present new and widely significant data, syntheses, or concepts. They should merit recognition by the wider scientific community and the general public through publication in a reputable scientific journal.

كيفية الاقتباس

Lung Cancer Classification using Microarray Gene Expression Data and Machine Learning Approach . (2025). Computational Discovery and Intelligent Systems, 1(1), 9-17. https://pub.scientificirg.com/index.php/CDIS/article/view/4

الشريط الجانبي للمقالة

محتوى المقالة الرئيسي

الملخص

Downloads

تفاصيل المقالة

إصدار

القسم

كيفية الاقتباس

المؤلفات المشابهة