This is part-2 of the case study on Boost Collections and Recoveries using Machine Learning (MLBR).

To read part-1, click here.

Disclaimer: This case study is solely an educational exercise and information contained in this case study is to be used only as a case study example for teaching purposes. This hypothetical case study is provided for illustrative purposes only and do not represent an actual client or an actual client’s experience. All of the data, contents and information presented here have been altered and edited to protect the confidentiality and privacy of the company.

In this part we will look at the Data Collection and Preparation.

4.1.  Data Collection

We know that data collection is the process of gathering and obtaining information from various sources. In this case we collected various types of data related to customer from data warehouse and OLTP repositories. We collected data related to customer demographics, data about existing recoveries base, collection agents, external agencies data, lawyers data, call logs, payment history etc.

Here is the high level data pipeline for MLBR:

High level Data pipeline for Boost Collection and Recoveries using Machine Learning (MLBR)

4.2.  Data Preparation

After gathering and exploring the data, we came up with three types of variables as an input to our model.

  • Simple Variable: These are direct field from the data source.
  • Complex Variable: These are based on some logic or calculations.
  • Engineered Variables: These are prepared after doing extensive exploratory analysis or by using Principle Component Analysis (PCA).

There were over 40 variables gathered for the input data set. Some of the variables are listed below:

  • Market of the client
  • Top product of the client
  • Year of first account open date
  • Client tenure
  • Written off year
  • Number of months account remained in collection
  • Number of months account lived before written off
  • Number of months taken to resolve. Arbitrary infinity value of not resolved cases.
  • Total write off amount
  • Total recovered amount
  • Yet to recover amount
  • Percentage recovered
  • Percentage yet to recover
  • Amount of first recovery payment
  • Percentage of first recovery payment
  • ETA days to receive first payment
  • Total number of accounts customer holds with the company.
  • Number of settled accounts
  • Number of yet to recover accounts
  • Percentage of number of accounts settled
  • Current external agency name
  • Current lawyer code
  • Number of agents worked
  • Repayment Score: (Feature Engineering) This variable calculates the repayment power of the customer based on historical data.
  • Collection Score: (Feature Engineering) This variable calculates the behavior score of the customer.
  • Expert Score: (Feature Engineering). This variable provides subject matter expert rating about the customer.
  • Risk Level Score provided by application team.        
  • Flag for workable or not workable cases.
  • Fully Resolved or Partial
  • Number of payments made  

Once data for all the variables has been collected, it cannot be used put directly for modeling. At this stage data is treated such that it can be used for modeling. Preparing the dataset is usually the hardest, most time consuming task of predictive analytics. Normally 70 to 80 percentage of the project time will be spent in this stage. The ultimate goal in this phase is to integrate and enrich the data into an analytical data set. The main activities includes

  • Data Acquiring
    Acquiring the data from all data sources and integrate to create a dataset i.e. Demoralization.
  • Data Audit
    Use descriptive analytics techniques to have a comprehensive first look at the raw data for the initial data exploration. e.g. mean, median, standard deviation, maximum value, minimum value etc.
  • Missing Values
    Missing values are generally replaced with zeros. Need to agree with the business for the approach.
  • Outlier Detection and Fixation
    Identify records having outliers or extreme values. Normally values with standard deviation greater than 3 or 5. These records can be put aside if the volume of outliers are less than 1% of the total data. K-means or average values are the other options.
  • Correlation Analysis
    Identify top features having strong relationship with target variable. Key variables that are effecting the churn.
  • Data Balancing
    This defines the percentage of Recovered vs. Not Recovered at all customers in the training dataset. This biasing needs to be removed. Several techniques can be used which includes Random Undersampling, Random Oversampling, Synthetic minority over-sampling technique (SMOTE) etc.
  • Data Imputation
    techniques used are:
    a)  Categorization
    b)  Label Encoding
    c)  One-hot encoding/Dummy Coding
  • Customer Segmentation (if required)
    Customers base can be segmented based on the product and market if required.

Given below is the result after applying all the data treatments to the variables.

Variables Summary

4.3.  Exploratory Data Analysis

After cleaning and munging the data, the next step in a machine learning project is to do exploratory data analysis (EDA). It involves numeric summaries, plots, aggregations, distributions, densities, reviewing all the levels of factor variables and applying general statistical methods. A clear understanding of the data provides the foundation for model selection, i.e. choosing the correct machine learning algorithm to solve your problem.

An extensive exploratory data analysis was done for MLBR which will be available in a separate series of articles soon.

4.4.  Feature Engineering

Feature engineering is the process of determining which predictor variables will contribute the most to the predictive power of a machine learning algorithm. There are two commonly used methods for making this selection – the Forward Selection Procedure starts with no variables in the model. You then iteratively add variables and test the predictive accuracy of the model until adding more variables no longer makes a positive effect. Next, the Backward Elimination Procedure begins with all the variables in the model. You proceed by removing variables and testing the predictive accuracy of the model.

Feature engineering is a give-and-take process with exploratory data analysis to provide much needed intuition about the data. It is also very important to have a domain expert at this stage. Risk Collection team has been very supportive in providing domain knowledge through out the project work.

During exploratory analysis of the historical data, we learned that determining account treatment is widespread in multiple phases of credit life cycle with each phase having several features. Therefore, by using Principal Component Analysis (PCA) technique, we came up with an idea of using predictive scores for different phases that should look at

  • Behaviour score during collection phase (Collection Score)
  • Expert judgement (Expert Score)
  • Overall payment history analysis (Repayment Score)

4.4.1.  Collection Score

Collection specific scoring is designed to predict what will happen in shorter timeframe – the next month or two – using data elements that are proven to be effective during collection phase. Collection Score is an engineered feature that looks at the customer behavior during collection period. The algorithm assigns a base score and then look at several parameters before arriving at the final collection score. The parameters includes number of calls made, contacts made, promise-to-pay (ptp) obtained, ptp kept rate, ptp broken rate, inbound calls, letters sent etc. The algorithm also look at number of ptp taken until first, second and third kept. Similarly, the number of days taken until first, second and third kept.  Given below is the density distribution of the Collection Score for different segments of customers.

4.4.2.  Repayment Score

This is an engineered feature that calculates customer repaying power by looking into customer historical data before it went to Charge Off. The algorithm developed to calculate Customer Repayment Score does this by looking into forward and backword movement of delinquency age buckets in the last three years before charge off. It looks for number of times customer defaulted, number of times customer recovered back to normal. Before arriving at the final repayment score, the algorithm also calculates customer lifetime value factor based on early stage delinquencies and late stage delinquencies. Given below is the density distribution of the repayment score for different segments of customers.

4.4.3.  Expert Score

In collections and recovery, determining how best to work accounts is often left to judgement. We acquired the expertise of collection and recoveries team and developed an Expert Opinion Scorecard to rate a customer based on the expert judgment. The scorecard primarily takes into consideration 19 independent variables related to customer demographics, market and product. Each variable is then rated by the expert based on the scale of 1-5 before calculating the final score. Here is the scale:




Recovery process is likely to be costly and time consuming and not feasible


Recovery process is likely to be costly and time consuming and not fully recoverable


Recovery process is likely to be costly and time consuming but fully recoverable


Recovery process is likely to be quick but not fully recoverable


Recovery process is likely to be easy and quick and fully recoverable

Each variable has its on weight and hence the contribution into the final score. Given below chart shows the weights assigned to each variable.

Expert Score Attributes
Score calculated for a dummy customer

Given below is the density distribution of the Expert Opinion Score for different segments of customers.

Expert Score density distribution

4.5.  Attribute Importance

Feature Importance/Attribute importance generally explain the predictive power of the features in the dataset. It is important to mention that

  • feature importance is calculated based on the training data given to the model, not on predictions on a test dataset.
  • these numbers do not indicate the true predictive power of the model, especially one that overfits. For example, the first variable in our case is overfitted.

Therefore, sample attribute importance and permutation attribute importance techniques are used to overcome this issue.

Attribute Importance

In the next article we will look at the modeling part.

to be continued...

Cheers :)