This is the fourth article of the series on Predicting Customer Churn using Machine Learning and AI. In case you missed the previous, here is the link.

In this article, we will look at the 4th phase of our CRISP-DM based proposed approach for customer attrition prediction. Before we dive into modeling, lets get some basic understanding of Machine learning and Predictive modeling.

4- Modeling

Machine learning is a methodology that uses cognitive learning methods to program a machine without the need of being explicitly programmed. In this methodology, complex algorithms and models are devised that led machine to make prediction. With the time and experience, these machines led to better predictions.

Whereas, Predictive modeling is an advanced form of basic descriptive analytics which makes used of the current and historical data to provide and outcome. Predictive modeling is a subset and an application of machine learning.

Based on the type of  tasks we can classify machine learning models in the following types:

  • Classification Models
  • Regression Models
  • Clustering
  • Dimensionality Reduction
  • Deep Learning etc.

Customer attrition prediction is a classification problem. Classification is a supervised learning approach in which the computer program learns from the data input given to it and then uses this learning to classify new observation. This data set may simply be bi-class (like identifying whether the person is male or female or that the mail is spam or non-spam) or it may be multi-class too. Some examples of classification problems are: speech recognition, handwriting recognition, bio metric identification, document classification etc.

One of the common challenges while working on classification problem is to deal with Imbalanced data.

Data Balancing Problem

Skewed classes or Imbalanced classes basically refer to a dataset, wherein the number of training example belonging to one class out-numbers heavily the number of training examples belonging to the other.

  • Imbalanced classes are a common problem in machine learning classification.
  • Majority class dominate over minority class.
  • Model is more biased towards majority class.
  • Hence result in poor classification of minority class.
Imbalanced Data

Here are some options to solve skewed classes:

  1. Do nothing. Sometimes you get lucky and nothing needs to be done. You can train on the so-called natural (or stratified) distribution and sometimes it works without need for modification.
  2. Try to :
    - Oversample the minority class.
    - Undersample the majority class.
    - Synthesize minority classes (SMOTE).
  3. Try anomaly detection framework.
  4. At the algorithm level, or after it:
    - Adjust the class weight
    - Adjust the decision threshold.
    - Modify an existing algorithm to be more sensitive to rare classes.
  5. Construct an entirely new algorithm to perform well on imbalanced data.

How we handle skewed class?

Answer:  By iteratively applying SMOTE

Feature Engineering

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. Feature engineering turn your inputs into things the algorithm can understand.

  • It is a process to create a dataset that is optimized to maximize the information density of your data.
  • Use domain knowledge to create features that make algorithm work.
  • Coming up with features is difficult, time-consuming, requires expert knowledge.
Feature Engineering

In machine learning your model is only ever as good as the data you train it on. Good Data preparation and Feature engineering is integral to better prediction. Some machine learning projects succeed and some fail. What makes the difference? Features used!

How we did feature engineering?

We kept three basic principals in mind,

  1. More variables make models less interpretable.
  2. Models have to be generalizable to other data.
  3. There is a Close connection between feature engineering and cross-validation.

We followed the given below approach,

  1. Brainstorming or testing features.
  2. Deciding what features to create.
  3. Creating features.
  4. Checking how the features work with your model.
  5. Improving your features if needed.
  6. Go back to brainstorming/creating more features until the work is done.

For example:  From a basic transaction variable, following variables are created during feature engineering process.

Attribute Importance

Attribute Importance is a process to identify and rank attributes that are most important in predicting a target attribute.

This process ranks attributes according to strength of relationship with target attribute. For example, which factors are most associated with customers who are going to leave us voluntary. There are many algorithms that you can be used for this purpose.

  • Principal Component Analysis
  • Non-negative Matrix Factorization

Feature Selection is a very critical component in a Data Scientist’s workflow. When presented data with very high dimensionality, models usually choke because

  1. Training time increases exponentially with number of features.
  2. Models have increasing risk of overfitting with increasing number of features.

Feature Selection methods helps with these problems by reducing the dimensions without much loss of the total information.

At this point it is important to understand the Bias-Variance Trade-Off and how to use it to better understand machine learning algorithms and get better performance on your data.

Bias-Variance Tradeoff

Bias is the simplifying assumptions made by the model to make the target function easier to approximate. Variance is the amount that the estimate of the target function will change given different training data. Trade-off is tension between the error introduced by the bias and the variance.

Have a look at the below elaboration to better understand Bias-Variance tradeoff.

Keys points in Bias-Variance tradeoff:

  • Overfitting models are too good to be true.
  • Underfitting models are too simple to explain complex variance.
  • Models with high bias tend to underfit the training data.
  • Models with high variance tend to overfit the training data.

Various models are developed and trained before selecting the best one.

Models developed and trained

In the next and final article of the series on Predicting Customer Churn using Machine Learning and AI, we will look at the methods to evaluate models. We will also see which model we adopted. Until then...

Cheers :)