Perbandingan Seleksi Fitur Sequential, Chi-Square, dan Embedded Pada Klasifikasi Penyakit Kanker Payudara Menggunakan Algoritma Random Forest

Yudha Alif Auliya, Muhammad ‘Ariful Furqon, Nico Wibiyanto

Abstract

Cancer is typically linked to malignant tumors that can metastasize to extensive body tissues. Breast cancer arises from the uncontrolled proliferation of breast cells, resulting in the formation of benign and malignant tumors. Breast cancer presents various indicators, including small, round, and soft lumps associated with benign breast conditions and non-cancerous growths. In contrast, malignant breast cancer presents as asymmetrical, irregular, painful, and various other manifestations. If untreated, the tumor may metastasize and present a fatal risk. This study intends to evaluate the efficacy of Sequential Feature Selection, Chi-Square, and Embedded methods in classifying breast cancer, alongside implementing hyperparameter optimization via grid search on the random forest algorithm. This study utilizes the Wisconsin Breast Cancer dataset from the UCI Machine Learning Repository, comprising 569 data entries, 30 attributes, and 1 class label. The performance of the model is assessed using a Confusion matrix, which quantifies accuracy, precision, recall, and F1-score. The test results were derived from twenty testing schemes employing a combination of data splitting, cross-validation, and hyperparameter tuning via grid search. The optimal performance outcomes were achieved using the random forest model, which was subjected to hyperparameter tuning alongside SFS feature selection. The integration of 20 features yielded an accuracy of 97.37%, precision of 95.83%, recall of 97.87%, and an F1 score of 96.84%. The employed prediction model demonstrates effective performance in identifying both positive and negative classes. The model accurately predicted the true negative class in 66 instances. The model accurately identified the true positive class in 46 instances. One instance involved the model predicting a false positive class, while another instance involved the model predicting a false negative class. These results demonstrate that the model exhibits a high degree of accuracy with negligible prediction errors.

Full Text:

PDF

References

Adebiyi, M. O., Arowolo, M. O., Mshelia, M. D., & Olugbara, O. O. (2022). A Linear Discriminant Analysis and Classification Model for Breast Cancer Diagnosis. Applied Sciences, 12(22), 11455. https://doi.org/10.3390/app122211455

Ali, N. M., Besar, R., & Aziz, N. A. A. (2023). A case study of microarray breast cancer classification using machine learning algorithms with grid search cross validation. Bulletin of Electrical Engineering and Informatics, 12(2), 1047–1054. https://doi.org/10.11591/eei.v12i2.4838

Anggoro, D. A., & Afdallah, N. A. (2022). Grid Search CV Implementation in Random Forest Algorithm to Improve Accuracy of Breast Cancer Data. International Journal on Advanced Science, Engineering and Information Technology, 12(2), 515–520. https://doi.org/10.18517/ijaseit.12.2.15487

Assegie, T. A., Tulasi, R. L., Elanangai, V., & Kumar, N. K. (2022). Exploring the performance of feature selection method using breast cancer dataset. Indonesian Journal of Electrical Engineering and Computer Science, 25(1), 232. https://doi.org/10.11591/ijeecs.v25.i1.pp232-237

Chaeikar, S. S., Manaf, A. A., Alarood, A. A., & Zamani, M. (2020). PFW: Polygonal fuzzy weighted—an SVM kernel for the classification of overlapping data groups. Electronics (Switzerland), 9(4). https://doi.org/10.3390/electronics9040615

Das, L. N., Saini, S., Kataria, P., & Dipanshu. (2022). Breast cancer detection from histopathological images using machine learning models. International Journal of Health Sciences, 9542–9553. https://doi.org/10.53730/ijhs.v6nS3.8254

Dasariraju, S., Huo, M., & McCalla, S. (2020). Detection and classification of immature leukocytes for diagnosis of acute myeloid leukemia using random forest algorithm. Bioengineering, 7(4), 1–12. https://doi.org/10.3390/bioengineering7040120

Deepa, B. G., & Senthil, S. (2020). Constructive Effect of Ranking Optimal Features Using Random Forest, Support Vector Machine and Naïve Bayes for Breast Cancer Diagnosis. Big Data Analytics and Intelligence: A Perspective for Health Care, September 2020, 189–202. https://doi.org/10.1108/978-1-83909-099-820201014

Fajri, M., & Primajaya, A. (2023). Komparasi Teknik Hyperparameter Optimization pada SVM untuk Permasalahan Klasifikasi dengan Menggunakan Grid Search dan Random Search. Journal of Applied Informatics and Computing, 7(1), 14–19. https://doi.org/10.30871/jaic.v7i1.5004

Hafid, H. (2023). Penerapan K-Fold Cross Validation untuk Menganalisis Kinerja Algoritma K-Nearest Neighbor pada Data Kasus Covid-19 di Indonesia. Journal of Mathematics, 6(2), 161–168. http://www.ojs.unm.ac.id/jmathcos

Jasim, A. A., Jalal, A. A., Abdulateef, N. M., & Talib, N. A. (2022). Effectiveness evaluation of machine learning algorithms for breast cancer prediction. Bulletin of Electrical Engineering and Informatics, 11(3), 1516–1525. https://doi.org/10.11591/eei.v11i3.3621

Jonathan, M., Rostianingsih, S., Palit, H. N., & Surabaya, J. S. (n.d.). Pengaruh Feature Selection terhadap Kinerja C5 . 0 , XGBoost , dan Random Forest dalam Mengklasifikasikan Website Phishing.

Julianto, Y., Setiabudi, D. H., & Rostianingsih, S. (2022). Analisis Sentimen Ulasan Restoran Menggunakan Metode SVM. Jurnal Infra, 10(1).

Kamelia, M., & Agus, S. (2021). Fine Needle Aspiration Biopsy (FNAB) Massa Intraabdomen dipandu Ultrasonografi. Health and Medical Journal, 4(1), 55–61. https://doi.org/10.33854/heme.v4i1.819

Kusumarini, A. I., Hogantara, P. A., Fadhlurohman, M., & Nurul Chamidah, S. K. . M. K. (2021). Perbandingan Algoritma Random Forest, Naive Bayes, Dan Decision Tree Dengan Oversampling Untuk Klasifikasi Bakteri E.Coli. Prosiding Seminar Nasional Mahasiswa Bidang Ilmu Komputer Dan Aplikasinya, 2(1), 792–799.

Mahdi, A. N., & Mohsin, A. A. (2021). Machine learning classification based on Radom Forest algorithm: a review. International Journal of Science and Business, 5(2), 128–142. https://doi.org/10.5281/zenodo.4471118

Maseno, E. M., & Wang, Z. (2024). Hybrid wrapper feature selection method based on genetic algorithm and extreme learning machine for intrusion detection. Journal of Big Data. https://doi.org/10.1186/s40537-024-00887-9

Praghakusma, A. Z., & Charibaldi, N. (2021). Komparasi Fungsi Kernel Metode Support Vector Machine untuk Analisis Sentimen Instagram dan Twitter (Studi Kasus : Komisi Pemberantasan Korupsi). JSTIE (Jurnal Sarjana Teknik Informatika) (E-Journal), 9(2), 88. https://doi.org/10.12928/jstie.v9i2.20181

Rusmalina, S. (2019). Pena medika. Jurnal Kesehatan Pena Medika, 9(2), 48–54.Sandag, G. A. (2020). Prediksi Rating Aplikasi App Store Menggunakan Algoritma Random Forest. CogITo Smart Journal, 6(2), 167–178. https://doi.org/10.31154/cogito.v6i2.270.167-178

Spencer, R., Thabtah, F., Abdelhamid, N., & Thompson, M. (2020). Exploring feature selection and classification methods for predicting heart disease. Digital Health, 6, 1–10. https://doi.org/10.1177/2055207620914777

Refbacks

  • There are currently no refbacks.