Analysis of imbalanced data set problem: The case of churn prediction for telecommunication

Chun Gui

Abstract


Class-imbalanced datasets are common in the field of mobile Internet industry. We tested three kinds of feature selection techniques-Random Forest (RF), Relative Weight (RW) and Standardized Regression Coefficients (SRC); three kinds of balance methods-over-sampling (OS), under-sampling (US) and synthetic minority over-sampling (SMOTE); a widely used classification method-RF. The combined models are composed of feature selection techniques, balancing techniques and classification method. The original dataset which has 45 thousand records and 22 features were used to evaluate the performances of both feature selection and balancing techniques. The experimental results revealed that SRC combined with SMOTE technique attained the minimum value of Cost = 1085. Through the calculation of the Cost on all models, the most important features for minimum cost of telecommunication were identified. The application of these combined models will have the possibility to maximize the profit with the minimum expenditure for customer retention and help reduce customer churn rates.


Full Text:

PDF


DOI: https://doi.org/10.5430/air.v6n2p93

Refbacks

  • There are currently no refbacks.


Artificial Intelligence Research

ISSN 1927-6974 (Print)   ISSN 1927-6982 (Online)

Copyright © Sciedu Press 
To make sure that you can receive messages from us, please add the 'Sciedupress.com' domain to your e-mail 'safe list'. If you do not receive e-mail in your 'inbox', check your 'bulk mail' or 'junk mail' folders.