This is part-4 of the case study on Boost Debt Collections and Recoveries using Machine Learning (MLBR). A machine learning predictive model to enhance the current recovery system by creating focus groups for business to boost debt collection.

Disclaimer: This case study is solely an educational exercise and information contained in this case study is to be used only as a case study example for teaching purposes. This hypothetical case study is provided for illustrative purposes only and do not represent an actual client or an actual client’s experience. All of the data, contents and information presented here have been altered and edited to protect the confidentiality and privacy of the company.

In Part-1, we looked at the background and the understanding of the debt collection and recovery process in credit lending companies. We looked at the entities/players involved in the traditional debt recovery process. We also defined the objective of our use case and highlights of the proposed machine learning solution.

In Part-2, we looked at the data elements and high level design. We also discussed the data collection process and design of the data pipeline. We also discussed the complex data variables created as result of feature engineering. We also discussed the importance of expert opinion in this complex case study. The value of expert opinion and how collection score turned out to be a significant key attribute in Modeling phase.

In Part-3, we discussed the modeling phase. We also explained the learning phase and the various data sets. How Training data set, Test data set and cross validation data set are prepared.

In this part, we will look at the comparison of various models results and the selection of the best model.  Selection of the best model for Classification

We trained three classification models,

  • Support Vector Machine (SVM)
  • Decision Tree (DT)
  • Naïve Bayes (NB)

“Training dataset” is used to train all three machine learning models. Once training is complete, we compared the statistics of all three models.

Recap on the data sets used:

Multiple snapshots of LHD were created,

  1. Snapshot-0: This is the historical snapshot of the database as of 31-Mar-2018. It is further split into two datasetsa)  Training dataset: 70% of the data from snapshot-0 is taken into training dataset at random. Purpose of this dataset is to train different models for prediction.b)  Testing dataset: 30% of the data from snapshot-0 is taken into testing dataset at random. Purpose of this dataset is to test and compare all three models. This dataset is also used for first Cross Validation.
  2. Snapshot-1: This is the historical snapshot of the database as of 31-Mar-2019 (one year later).
  3. Snapshot-2: This is the historical snapshot of the database as of 31-Oct-2019 (almost one and half year later than snapshot-0).

The data cutoff date for all snapshots are depicted below:

Data cutoff for Training, Testing and Cross Validation Data sets
Overall performance of all three models (after training)
Performance Matrix for all three models (after training)

Decision tree model appears to have better overall accuracy on training dataset but Naïve Bayes turned out to be a better option on testing dataset. So, we choose Naïve Bayes for Cross Validation.

In next part, we will look at Cross Validation Result and Implementation.

Cheers :)