This is the fifth and the last article of the series on Predicting Customer Churn using Machine Learning and AI. In case you missed the previous, here is the link.

In this post, we will look at the last phase Model Evaluation of our CRISP-DM based proposed approach for customer attrition prediction. We will look at how we evaluated performance of various models that we developed.

5- Model Evaluation

Model evaluation aims to estimate the overall accuracy of a model on future unseen data.

Model Evaluation Techniques

Techniques for evaluating performance of a model are divided into 2 categories: namely, holdout and Cross-validation. Both techniques use a test set (i.e. data not seen by the model) to evaluate model performance. It’s not recommended to use the data we used to build the model to evaluate it. This is because our model will simply remember the whole training set, and will therefore always predict the correct label for any point in the training set. This is known as overfitting.


The purpose of holdout evaluation is to test a model on different data than it was trained on. This provides an unbiased estimate of learning performance.

In this technique, the dataset is randomly divided into three subsets:

  • Training set is a subset of the dataset used to build predictive models.
  • Validation set is a subset of the dataset used to assess the performance of the model built in the training phase. It provides a test platform for fine-tuning a model’s parameters and selecting the best performing model. Not all modeling algorithms need a validation set.
  • Test set, or unseen data, is a subset of the dataset used to assess the likely future performance of a model. If a model fits to the training set much better than it fits the test set, overfitting is probably the cause.

The holdout approach is useful because of its speed, simplicity, and flexibility. However, this technique is often associated with high variability since differences in the training and test dataset can result in meaningful differences in the estimate of accuracy.


Cross-validation is a technique that involves partitioning the original observation dataset into a training set, used to train the model, and an independent set used to evaluate the analysis.

The most common cross-validation technique is k-fold cross-validation, where the original dataset is partitioned into k equal size subsamples, called folds. The k is a user-specified number, usually with 5 or 10 as its preferred value. This is repeated k times, such that each time, one of the k subsets is used as the test set/validation set and the other k-1 subsets are put together to form a training set. The error estimation is averaged over all k trials to get the total effectiveness of our model.

For instance, when performing five-fold cross-validation, the data is first partitioned into 5 parts of (approximately) equal size. A sequence of models is trained. The first model is trained using the first fold as the test set, and the remaining folds are used as the training set. This is repeated for each of these 5 splits of the data and the estimation of accuracy is averaged over all 5 trials to get the total effectiveness of our model.

As can be seen, every data point gets to be in a test set exactly once and gets to be in a training set k-1 times. This significantly reduces bias, as we’re using most of the data for fitting, and it also significantly reduces variance, as most of the data is also being used in the test set. Interchanging the training and test sets also adds to the effectiveness of this method.

Model Evaluation Metrics

Model evaluation metrics are required to quantify model performance. The choice of evaluation metrics depends on a given machine learning task (such as classification, regression, ranking, clustering, topic modeling, among others). Some metrics, such as precision-recall, are useful for multiple tasks.

Classification Metrics

In this section we will review some of the metrics used in classification problems, namely:

  • Classification Accuracy
  • Confusion matrix
  • Precision vs. Recall
  • Logarithmic Loss
  • Area under curve (AUC)
  • F-Measure

In our case study i.e. Predicting Customer Churn, we developed Classification models therefore we will discuss only Classification Accuracy and Confusion Matrix here.

Classification Accuracy

Accuracy is a common evaluation metric for classification problems. It’s the number of correct predictions made as a ratio of all predictions made. By using cross-validation, we’d be “testing” our machine learning model in the “training” phase to check for overfitting and to get an idea about how our machine learning model will generalize to independent data (test data set).

Cross-validation techniques can also be used to compare the performance of different machine learning models on the same data set and can also be helpful in selecting the values for a model’s parameters that maximize the accuracy of the model—also known as parameter tuning.

Confusion Matrix

A confusion matrix provides a more detailed breakdown of correct and incorrect classifications for each class. Given below is the confusion matrix for our case study i.e. Predicting Customer Churn.

Confusion Matrix


  • True positive (TP): Number of Cancelled Accounts that are labeled as Cancelled by the model.
  • True Negative (TN): Number of Normal Accounts that are labeled as Normal by the model.
  • False Positive (FP): Number of Cancelled Accounts that are labeled as Normal by the model.
  • False Negative (FN): Number of Normal Accounts that are labeled as Cancelled by the model.

Precision vs. Recall

  • Precision means the percentage of your results which are relevant.
  • Recall refers to the percentage of total relevant results correctly classified by the model.

Given below image elaborate the formula to calculate Precision, Recall and Model Accuracy.

Unfortunately, it is not possible to maximize both of these metrics at the same time, as one comes at the cost of another.

Here are some tips…

For problems where both precision and recall are important, one can select a model which maximizes this F-1 score.

For other problems, a trade-off is needed, and a decision has to be made whether to maximize precision, or recall.


Ideally, the overall performance of a model tells us how well it performs on unseen data. Making predictions on future data is often the main problem we want to solve. It’s important to understand the context before choosing a metric because each machine learning model tries to solve a problem with a different objective using a different dataset.

This is the last post of the series Predicting Customer Churn using Machine Learning and AI.

Keep returning back for useful case studies.

Cheers :)

Disclaimer: This case study is solely an educational exercise and information contained in this case study is to be used only as a case study example for teaching purposes. This hypothetical case study is provided for illustrative purposes only and do not represent an actual client or an actual client’s experience. All of the data, contents and information presented here have been altered and edited to protect the confidentiality and privacy of the clients and my employer.