Lung Cancer Classification using Microarray Gene Expression Data and Machine Learning Approach  

محتوى المقالة الرئيسي

Prakash Choudhary 1
Nada Tarek 2
Hend Alfred *3
Moataz Sayed 4

الملخص

This work proposes an enhanced machine learning pipeline for accurate lung cancer subtype classification, utilizing statistical feature extraction and ensemble modeling. A publicly available Lung Cancer dataset (Lung.arff) containing 203 samples and 12,600 gene features for five classes was employed. Normalization, imputation of missing data, and class balancing via the SMOTE were applied for preprocessing. Feature selection was combined with an ANOVA F-test, and the top 50 discriminative genes were selected using Mutual Information and Random Forest–based feature importance. Subsequently, four classifiers-Logistic Regression, Support Vector Machine (with RBF kernel), Random Forest, and XGBoost were trained and comparatively evaluated using nested 5-fold cross-validation, ensuring a robust and unbiased assessment of the model performance. Experimental results validated that the XGBoost classifier achieved 90.8% accuracy, whereas the voting ensemble of RF, SVM, and XGBoost achieved 91.3% accuracy and improved the macro F1-score. Top-ranked genes from SHAP analysis show high consistency with current lung cancer biomarkers, validating the interpretability of the model. The results highlight the robustness of the incorporation of statistical feature selection and ensemble learning for precise lung cancer subtype classification.

Downloads

Download data is not yet available.

تفاصيل المقالة

القسم

Original Research

كيفية الاقتباس

Lung Cancer Classification using Microarray Gene Expression Data and Machine Learning Approach  . (2025). Computational Discovery and Intelligent Systems, 1(1), 9-17. https://pub.scientificirg.com/index.php/CDIS/article/view/4

المؤلفات المشابهة

يمكنك أيضاً إبدأ بحثاً متقدماً عن المشابهات لهذا المؤلَّف.