skip to main content

EASY ENSEMMBLE WITH RANDOM FOREST TO HANDLE IMBALANCED DATA IN CLASSIFICATION

*Sarini Abdullah  -  Department of Mathematics, Universitas Indonesia, Indonesia
GV Prasetyo  -  Department of Mathematics, FMIPA Universitas Indonesia, Indonesia

Citation Format:
Abstract

Imbalanced data might cause some issues in problem definition level, algorithm level, and data level. Some of the methods have been developed to overcome this issue, one of state-of-the-art method is Easy Ensemble. Easy Ensemble was claimed can improve model performance to classify minority class, and overcome the deficiency of random under- sampling. In this paper we discussed the implementation of Easy Ensemble with Random Forest Classifiers to handle imbalance problem in credit scoring case. This combination method is implemented in two datasets which taken from data science competition website, finhacks.id and kaggle.com with class proportion within majority and minority is 70:30 and 94:6. The results showed that resampling with Easy Ensemble can improve Random Forest classifier performance upon minority class. Recall on minority class increased significantly after the resampling. Before resampling, the recall on minority class for the first dataset (finhacks.id) was 0.49, and increased to 0.82 after the resampling. Similar results were obtained for the second data set (kaggle.com), where the recall for the minority class was increased from just 0.14 to 0.73.

Fulltext View|Download

Article Metrics:

  1. Z. H. Zhou and Y. Jiang, “Medical diagnosis with C4.5 rule preceded by artificial neural network ensemble”, IEEE Transactions on Information Technology in Biomedicine, vol. 7, no. 1, pp. 37–42, 2003
  2. Jeanne S. Mandelblatt, Karen Gold, Ann S. O’Malley, Kathryn Taylor, Kathleen Cagney, John S. Hopkins, Jon Kerner, “Breast and Cervix Cancer Screening among Multiethnic Women: Role of Age, Health, and Source of Care”, Elsevier, (1999)
  3. Sakuma,Yuji et al., “A logistic regression predictive model and the outcome of patients with resected lung adeno carcinoma, Lung Cancer, Volume 65, Issue 1, 85 – 90, (2009)
  4. Bravo, C., Maldonado, S., Weber, R., Granting and managing loans for microentrepreneurs: new developments and practical experiences. Eur. J. Oper. Res. 227 (2), (2013)
  5. E.W.T. Ngai, Yong Hu, Y.H. Wong, Yijun Chen, Xin Sun, “The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature”, Decision Support System, Elsevier, (2011)
  6. Tian-Yu Liu, “EasyEnsemble and Feature Selection for Imbalance Data Sets”, International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, pp. 517 – 518, (2009)
  7. Qiu, Xueheng, et al. "Oblique random forest ensemble via Least Square Estimation for time series forecasting." Information Sciences 420 (2017): 249-262
  8. Subudhi, Asit, Manasa Dash, and Sukanta Sabut. "Automated segmentation and classification of brain stroke using expectation-maximization and random forest classifier." Biocybernetics and Biomedical Engineering 40.1 (2020): 277-289
  9. Kamarajan, Chella, et al. "Random Forest Classification of Alcohol Use Disorder Using fMRI Functional Connectivity, Neuropsychological Functioning, and Impulsivity Measures." Brain Sciences 10.2 (2020): 115
  10. Izquierdo-Verdiguier, Emma, and Raúl Zurita-Milla. "An evaluation of Guided Regularized Random Forest for classification and regression tasks in remote sensing." International Journal of Applied Earth Observation and Geoinformation 88 (2020): 102051
  11. Dumitrescu, Elena, et al. "Machine Learning or Econometrics for Credit Scoring: Let's Get the Best of Both Worlds." (2020)
  12. Van Sang, Ha, Nguyen Ha Nam, and Nguyen Duc Nhan. "A novel credit scoring prediction model based on Feature Selection approach and parallel random forest." Indian Journal of Science and Technology 9.20 (2016)
  13. H. Heibe, A. G. Edwardo, “Learning from Imbalanced Data”, Transactions on Knowledge and Data Engineering, IEEE, (2009)
  14. Y. Liu, A. An, and X. Huang, “Boosting prediction accuracy on imbalanced datasets with SVM ensembles”, Lecture Notes in Artificial Intelligence, vol. 3918, pp. 107–118, (2006)
  15. L. Breiman, “Random Forest”, Department of Statistics, UC Berkeley, Machine Learning, 45, 5–32, (2001)
  16. Galar, Mikel, et al. "Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets." Information Sciences 354 (2016): 178-196
  17. Zhang, Zhongliang, et al. "Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data." Knowledge-Based Systems 106 (2016): 251-263

Last update:

No citation recorded.

Last update:

No citation recorded.