Integrating DEA with Machine Learning for Predictive Modeling in Breast Cancer
Abstract
This study proposes an integrated methodology combining Data Envelopment Analysis (DEA) with Machine Learning (ML) to enhance predictive modeling in healthcare data analysis, specifically for breast cancer datasets. The methodology begins with essential data preprocessing steps, including data cleaning, normalization, and outlier detection, to ensure the dataset's quality and consistency. After preprocessing, DEA is applied to calculate efficiency scores for Decision-Making Units (DMUs), such as hospitals or clinics, assessing their resource utilization and performance. These efficiency scores are then incorporated as a new feature into the dataset, providing additional insights into the performance of each DMU. Various ML models are trained using the augmented dataset, and their predictive accuracy is compared to models trained on the original dataset. The inclusion of DEA-derived efficiency scores is shown to improve model performance and interpretability. The results suggest that integrating DEA efficiency scores with ML models enhances the accuracy and transparency of predictions, offering a promising approach for decision-making in complex domains like healthcare. Future research could explore the application of deep learning techniques or extend this methodology to other sectors such as energy management or financial analysis.
Keywords:
Data envelopment analysis, Machine learning, Breast cancer dataset, Feature selectionReferences
- [1] Turing, A. M. (1950). Mind. Oxford university press, 59(236), 433–460. https://www.jstor.org/stable/2251299
- [2] Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological review, 65(6), 386. https://psycnet.apa.org/doi/10.1037/h0042519
- [3] Emrouznejad, A., & Yang, G. (2018). A survey and analysis of the first 40 years of scholarly literature in DEA: 1978–2016. Socio-economic planning sciences, 61, 4–8. https://doi.org/10.1016/j.seps.2017.01.008
- [4] Charnes, A., Cooper, W. W., & Rhodes, E. (1978). Measuring the efficiency of decision making units. European journal of operational research, 2(6), 429–444. https://doi.org/10.1016/0377-2217(78)90138-8
- [5] Banker, R. D., Charnes, A., & Cooper, W. W. (1984). Some models for estimating technical and scale inefficiencies in data envelopment analysis. Management science, 30(9), 1078–1092. https://doi.org/10.1287/mnsc.30.9.1078
- [6] Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123–140. https://doi.org/10.1007/BF00058655
- [7] Breiman, L. (2001). Using iterated bagging to debias regressions. Machine learning, 45(3), 261–277. https://doi.org/10.1023/A:1017934522171
- [8] Friedman, J. H. (2002). Stochastic gradient boosting. Computational statistics & data analysis, 38(4), 367–378. https://doi.org/10.1016/S0167-9473(01)00065-2
- [9] Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The annals of statistics, 29(5), 1189–1232. http://www.jstor.org/stable/2699986
- [10] Guillen, M. D., Aparicio, J., & Esteve, M. (2023). Gradient tree boosting and the estimation of production frontiers. Expert systems with applications, 214, 119134. https://doi.org/10.1016/j.eswa.2022.119134
- [11] Charles, V., Gherman, T., & Zhu, J. (2021). Data envelopment analysis and big data: A systematic literature review with bibliometric analysis. In Data-enabled analytics: DEA for big data (pp. 1–29). Cham: Springer international publishing. https://doi.org/10.1007/978-3-030-75162-3_1
- [12] Russell, S., & Norvig, P. (2020). Artificial intelligence: A modern approach. In Pearson series in artifical intelligence. Pearson. https://www.amazon.com/Artificial-Intelligence-A-Modern-pproach/dp/0134610997#
- [13] Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and machine learning. In Stat sci (pp. 140–155). New York: Springer. http://dx.doi.org/10.1117/1.2819119