This is the third article of the series on Predicting Customer Churn using Machine Learning and AI. In case you missed the previous, here is the link.

In this article, we will look at the next two phases of our CRISP-DM based proposed approach for customer attrition prediction. First, Data Understanding, which we have called as Data Collection, and the second, Data Preparation. We will also look at the challenges faced during the data preparation and how we resolved them.

2- Data Collection

Given below is the data processing flow diagram.

Data Collection and Modeling Flow Chart

In our proposed approach for customer attrition prediction, the idea is to make use of historical data available in various diversified data sources available within or outside of our dummy organization. These data sources include Business Data Repositories, Data Warehouse and Digital Data Sources like mobile app or online portals. Data from Google Analytics will also be taken to support the model. Furthermore, the data analysis is done to identify data that is relevant to the business goal i.e. Predicting customer attrition. Given below is a glimpse of data collected from various sources.

Building the Data Pipeline

At this point, there was a need to build an automated process to extract only the required data from the sources and process it so that it can be used directly in the model. We called it Data Pipeline. It gave us two main benefits: Firstly, we had all the required data in one place and in the same format. Secondly, the process become independent and re-executable as and when required.

Building a data pipeline is one of the most time consuming and challenging phase. Our 70 to 80 percentage of the project time was spent in this stage. It involves all sort of little as well as big time consuming complexities like debugging of transformation logic, creating repeatable tasks, pipeline performance tuning and optimization, ensuring high speed query performance.

3- Data Preparation

In this process, we will transform raw data so that we can run it through machine learning algorithms to uncover insights or make prediction. This phase is very much complicated because we have to do lot of Data Exploration, Profiling, Missing Values and Outliers treatment. Not only this, we also need to format data to make it consistent and to improve data quality. This phase also involves the most creative step in the whole process called Feature Engineering. We will discuss about this in detail in a different post. Finally, we will split the data into training, testing and cross validation sets.

Data Preparation Challenges

The two most important factors to have an accurate prediction of attrition is to have:

  • High volume of attrition data
  • Data should be feature rich

Challenge #1: Missing Values

If the missing values are not handled properly, machine learning algorithms may end up drawing an inaccurate inference about the data.

Few ways to handle missing values are by:

  • Eliminating rows
  • Replacing NA with Most Frequent Value
  • Replacing NA with  Mean/Median Value
  • Using Weight of Evidence (WOE)
  • Using kNN (Euclidean, Manhattan etc)

Challenge #2: Outliers

An outlier is a data point that differs significantly from other observations. Outliers in input data can skew and mislead the training process of machine learning algorithms resulting in longer training times, less accurate models and ultimately poorer results.

Outlier treatment:

  • Standard Deviations (multiples of sigma, e.g. 3, 2, 1)
  • Percent (lower percent value, upper percent value)
  • Value (upper value, lower value)

Challenge #3: Different Scales

Normalization is a technique used to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. For example, assume your input dataset contains one column with values ranging from 0 to 1, and another column with values ranging from 10,000 to 100,000. The great difference in the scale of the numbers could cause problems when you attempt to combine the values as features during modeling.

Normalization applies only to numeric attributes. Some ways to do normalization are:

  • Min Max: Normalizes each attribute using the transformation x_new = (x_old-min)/(max-min)
  • Linear Scale: Normalizes each attribute using the transformation x_new = (x_old-shift)/scale
  • Z-Score: Normalizes each attribute using the transformation x_new = (x-mean)/standard deviation
  • Custom: The user defines normalization.

Here are some advantages of normalization:

  1. Model able to learn fast.
  2. The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges.

Challenge #4: Continuous vs. Categorical

Binning is a transformation type that converts:

  • a continuous variable into a categorical variable.
  • a continuous value into a continuous value. For example, age can be converted to 10 groups from 1 to 10.
  • a categorical value with many values into a categorical variable with fewer values.

Note: Binning is not always required.  

So, should we do Binning or not?  a common question!


  • In general, we should not bin, because binning will lose information.
  • Binning is actually increasing the degree of freedom of the model, so, it is possible to cause over-fitting after binning. If we have a "high bias" model, binning may not be bad, but if we have a "high variance" model, we should avoid binning.
  • It depends on what model we are using. If it is a linear mode, and data has a lot of "outliers" binning probability is better. If we have a tree model, then, outlier and binning will make too much difference.


  • It makes sense for your problem.
  • Interpret-ability
  • Model limitations
  • Continuous variables have too many unique values to model effectively So, replace a column of numbers with categorical values that represent specific ranges.
  • A dataset has a few extreme values, all well outside the expected range, and these values have an outsized influence on the trained model. To mitigate the bias in the model, you might transform the data to a uniform distribution, using the quantiles (or equal-height) method.

What shall we do then?
Try binning and cross validate if it increase performance and accuracy that you got without doing binning. In our case, we got better results with Binning for few variables.

In the next post of this series, we will look at the Modeling phase of our proposed approach for customer attrition prediction.

until then

Cheers :)