EASY ENSEMMBLE WITH RANDOM FOREST TO HANDLE IMBALANCED DATA IN CLASSIFICATION

Sarini Abdullah; GV Prasetyo

doi:10.14710/jfma.v3i1.7415

DOI: https://doi.org/10.14710/jfma.v3i1.7415

EASY ENSEMMBLE WITH RANDOM FOREST TO HANDLE IMBALANCED DATA IN CLASSIFICATION

*Sarini Abdullah - Department of Mathematics, Universitas Indonesia, Indonesia

GV Prasetyo - Department of Mathematics, FMIPA Universitas Indonesia, Indonesia

Citation Format:

Abstract

Imbalanced data might cause some issues in problem definition level, algorithm level, and data level. Some of the methods have been developed to overcome this issue, one of state-of-the-art method is Easy Ensemble. Easy Ensemble was claimed can improve model performance to classify minority class, and overcome the deficiency of random under- sampling. In this paper we discussed the implementation of Easy Ensemble with Random Forest Classifiers to handle imbalance problem in credit scoring case. This combination method is implemented in two datasets which taken from data science competition website, finhacks.id and kaggle.com with class proportion within majority and minority is 70:30 and 94:6. The results showed that resampling with Easy Ensemble can improve Random Forest classifier performance upon minority class. Recall on minority class increased significantly after the resampling. Before resampling, the recall on minority class for the first dataset (finhacks.id) was 0.49, and increased to 0.82 after the resampling. Similar results were obtained for the second data set (kaggle.com), where the recall for the minority class was increased from just 0.14 to 0.73.

Fulltext View|Download Email colleagues

Article Metrics:

Article Info

Section: FUNDAMENTAL MATHEMATICS AND APPLICATIONS

Language : ID

In Vol 3, No 1 (2020)

Z. H. Zhou and Y. Jiang, “Medical diagnosis with C4.5 rule preceded by artificial neural network ensemble”, IEEE Transactions on Information Technology in Biomedicine, vol. 7, no. 1, pp. 37–42, 2003
Jeanne S. Mandelblatt, Karen Gold, Ann S. O’Malley, Kathryn Taylor, Kathleen Cagney, John S. Hopkins, Jon Kerner, “Breast and Cervix Cancer Screening among Multiethnic Women: Role of Age, Health, and Source of Care”, Elsevier, (1999)
Sakuma,Yuji et al., “A logistic regression predictive model and the outcome of patients with resected lung adeno carcinoma, Lung Cancer, Volume 65, Issue 1, 85 – 90, (2009)
Bravo, C., Maldonado, S., Weber, R., Granting and managing loans for microentrepreneurs: new developments and practical experiences. Eur. J. Oper. Res. 227 (2), (2013)
E.W.T. Ngai, Yong Hu, Y.H. Wong, Yijun Chen, Xin Sun, “The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature”, Decision Support System, Elsevier, (2011)
Tian-Yu Liu, “EasyEnsemble and Feature Selection for Imbalance Data Sets”, International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, pp. 517 – 518, (2009)
Qiu, Xueheng, et al. "Oblique random forest ensemble via Least Square Estimation for time series forecasting." Information Sciences 420 (2017): 249-262
Subudhi, Asit, Manasa Dash, and Sukanta Sabut. "Automated segmentation and classification of brain stroke using expectation-maximization and random forest classifier." Biocybernetics and Biomedical Engineering 40.1 (2020): 277-289
Kamarajan, Chella, et al. "Random Forest Classification of Alcohol Use Disorder Using fMRI Functional Connectivity, Neuropsychological Functioning, and Impulsivity Measures." Brain Sciences 10.2 (2020): 115
Izquierdo-Verdiguier, Emma, and Raúl Zurita-Milla. "An evaluation of Guided Regularized Random Forest for classification and regression tasks in remote sensing." International Journal of Applied Earth Observation and Geoinformation 88 (2020): 102051
Dumitrescu, Elena, et al. "Machine Learning or Econometrics for Credit Scoring: Let's Get the Best of Both Worlds." (2020)
Van Sang, Ha, Nguyen Ha Nam, and Nguyen Duc Nhan. "A novel credit scoring prediction model based on Feature Selection approach and parallel random forest." Indian Journal of Science and Technology 9.20 (2016)
H. Heibe, A. G. Edwardo, “Learning from Imbalanced Data”, Transactions on Knowledge and Data Engineering, IEEE, (2009)
Y. Liu, A. An, and X. Huang, “Boosting prediction accuracy on imbalanced datasets with SVM ensembles”, Lecture Notes in Artificial Intelligence, vol. 3918, pp. 107–118, (2006)
L. Breiman, “Random Forest”, Department of Statistics, UC Berkeley, Machine Learning, 45, 5–32, (2001)
Galar, Mikel, et al. "Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets." Information Sciences 354 (2016): 178-196
Zhang, Zhongliang, et al. "Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data." Knowledge-Based Systems 106 (2016): 251-263

Last update:

No citation recorded.

Last update:

No citation recorded.

Authors who publish articles in this journal agree to the following terms:

Authors grant the journal the right of first publication that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.

For more detailed information about the copyright transfer, please refer to this page: COPYRIGHT TRANSFER FORM

EASY ENSEMMBLE WITH RANDOM FOREST TO HANDLE IMBALANCED DATA IN CLASSIFICATION

EDITORIAL OFFICE

INDEXED IN: