This is part-3 of the case study on Boost Collections and Recoveries using Machine Learning (MLBR). A machine learning predictive model to enhance the current recovery system by creating focus groups for business to boost debt collection.

To read previous part, click here.

Disclaimer: This case study is solely an educational exercise and information contained in this case study is to be used only as a case study example for teaching purposes. This hypothetical case study is provided for illustrative purposes only and do not represent an actual client or an actual client’s experience. All of the data, contents and information presented here have been altered and edited to protect the confidentiality and privacy of the company.

In this part we will look at the machine learning modeling.

4.6.  Modeling

In theory, a machine learning model is a mathematical representation of a business process. To generate a machine learning model a training data set need to be provided machine learning algorithm so that it can learn from it.

Before looking into modeling, it is important to understand what is machine learning models and how they work.

How Machine Learning Works

Basically there are two phases of the modeling step:

  • Training
  • Testing

4.6.1.  Training

During training process, machine learning algorithm is provided with the training data. The learning algorithm discovers patterns in the provided training data with the end goal that the info boundaries relate to the objective. The output of the training process is machine learning model which you would then be able to use to make predictions. This phase is also called as "learning".

Training phase begins by understanding the problem in hand. In our case, it’s a classification problem i.e. the model will classify each client into one of the four prediction classes.

  1. High probability of Recovery - >75%
  2. Medium probability of recovery- between 25 to 74%
  3. Low probability of recovery – between 1 to 24 %
  4. Zero probability of recovery – less than 1 %

4.6.1.1  Labeling of Historical Data

Gathering large amount of data is comparatively simple. Data can be scraped, synthesized, created  or copied and then can be stored in databases or HDFS.

A key challenge in developing an intelligent model is not just a sheer mass of data but also an effective strategy to intelligently label data to add structure and sense to the data. Data labeling can, therefore, be described as a way to organize information depending on its content.

Classification models are trained using Labeled historical data. This means that before building and training a model, each observation/row in the training data needs to be tagged or labeled, i.e. High, Medium, Low, Zero. Unfortunately, our gathered historical data had no such labels. So, we used k-Mean Clustering algorithm which aims at partitioning n observations into k clusters in which each observation belongs to the cluster with the nearest mean. The primary input variables fed to the algorithm are:

  1. Resolved Type (Full/Partial/No)
  2. Time taken to resolve
  3. Time taken to make first payment

The output of this step is the Labeled Historical Data (LHD).

Labeled Historical Data

4.6.1.2  Creation of the Training, Testing and Cross Validation Data Sets

Multiple snapshots of LHD were created,

  1. Snapshot-0: This is the historical snapshot of the database as of 31-Mar-2018. It is further split into two datasets
    a)  Training dataset: 70% of the data from snapshot-0 is taken into training dataset at random. Purpose of this dataset is to train different models for prediction.
    b)  Testing dataset: 30% of the data from snapshot-0 is taken into testing dataset at random. Purpose of this dataset is to test and compare all three models. This dataset is also used for first Cross Validation.
  2. Snapshot-1: This is the historical snapshot of the database as of 31-Mar-2019 (one year later).
  3. Snapshot-2: This is the historical snapshot of the database as of 31-Oct-2019 (almost one and half year later than snapshot-0).

The data cutoff date for all snapshots are depicted below:

Data cutoff for Training, Testing and Cross Validation Data sets

In the next article we will look at the comparison of various models results and then selecting the best model.

to be continued...

Cheers :)