This is part-5 of the case study on Boost Debt Collections and Recoveries using Machine Learning (MLBR). A machine learning predictive model to enhance the current recovery system by creating focus groups for business to boost debt collection.

Disclaimer: This case study is solely an educational exercise and information contained in this case study is to be used only as a case study example for teaching purposes. This hypothetical case study is provided for illustrative purposes only and do not represent an actual client or an actual client’s experience. All of the data, contents and information presented here have been altered and edited to protect the confidentiality and privacy of the company.

In Part-1, we looked at the background and the understanding of the debt collection and recovery process in credit lending companies. We looked at the entities/players involved in the traditional debt recovery process. We also defined the objective of our use case and highlights of the proposed machine learning solution.

In Part-2, we looked at the data elements and high level design. We also discussed the data collection process and design of the data pipeline. We also discussed the complex data variables created as result of feature engineering. We also discussed the importance of expert opinion in this complex case study. The value of expert opinion and how collection score turned out to be a significant key attribute in Modeling phase.

In Part-3, we discussed the modeling phase. We also explained the learning phase and the various data sets. How Training data set, Test data set and cross validation data set are prepared.

In Part-4, we reviewed the training results of the various models including Support Vector Machine (SVM), Naïve Bayes (NB) and Decision Tree (DT).

In this part, we will look at the Cross Validation Result and Implementation.

4.7.   Testing and Cross Validation

The basic idea of Cross Validation is to test the model using historical snapshot. A particular point in history is taken as a reference. The data prior to that point is fed to the model. Model predict the recovery for that historical point. These predictions are then compared against real data. Illustrated below is the cross validation approach adopted.

Cross Validation Approach

Two tests were performed using two different unseen snapshots. One for Mar 2018 and another for Mar 2019.

1)  Cross Validation-1 (CV1)

It is the first test performed and snapshot-1 is used. The idea of this test was to run models on a historical snapshot (snapshot-1) and then compare the results with a 7 months later snapshot as illustrated in figure 9. The objective of this cross validation is to validate the model predictions with actual results available in Snaphot 2.

2)  Cross Validation-2 (CV2)

This is second test which is exactly similar to first test (CV1). The only difference here is that we executed models on much older snapshot (fig. 9: “testing” dataset from snapshot-0 ). The results are then compared with snapshot-2 (same as in CV1). Note that the time difference between both snapshots here is almost 1.5 years. The larger time window between both snapshots will give us more confidence on the accuracy of the model.

4.7.1.  Results and Cross Validation

Results are reviewed and evaluated for both CV1 and CV2 cross validations.

A)  Cross Validation-1 Results (CV1)

For this round of testing, currently we have fed only the clients with write off amount over $1000, that’s why the last row is empty. We will update this table with those remaining clients as well. The inferences from the below results are,

  1. 73% of “High probable” worked customers have returned some/full amount.
  2. 23% of “Medium probable” worked customers have returned some/full amount.
  3. None of the Low and Zero probable customers have paid any money so far but they may do in future.

Based on the above inference, we can say that Recoveries can be improved if

  1. We work 100% of the clients in “High” and “Medium”.
  2. We reduce the effort spent on “Low” customers. Probably may focus on clients high balance amount in write off.
  3. We do not work on “Zero” base. It is better to sell this debt to external debt collectors, as the recovery rate is less than 1%.
Cross Validation-1

We further identified the monetary benefit of working based on this model results. If our agents work on all the cases in “High” and “Medium” probable clients based, our recovery would increase by 11%.

B)  Cross Validation-2 Results (CV2)

The results of the Mar 2018 prediction are given below. Few inferences as per this results are,

  1. Out of all customers in “High probability” Category, agents have worked on 60% and were able to collect from 75% with recovery rate of 63%.
  2. Out of all “Medium probability” customers, they have worked only on 27%. We should work on these remaining cases.
  3. Out of all worked customers in “Low” bucket, the return was 6% returns, which may be a wasted effort. This effort should have been spent on High and Medium.
Cross Validation-2

The monetary gain we would have made by working on all of the High and Medium clients. The inferences are,

  1. The recovery rate from “High” customers is 49%. This considers only the clients we have contacted. The uncontacted clients in the High Category as 60%. Assuming we get the same 49% recovery from these customers, we would be able to recover at least 25% more.
  2. Same with Medium category as well. By working remaining unworked clients, we would be able to increase our recovery by $13% as shown in green at the last row.
  3. If the above numbers turns true in real world, then we would realize a high net increase in our recoveries.

That's all about this case study. Hope this series of post on Boost Debt Collection and Recovery using Machine Learning have been valuable to those who are intrigued.

Tools and Technologies Used:

  1. Oracle Advanced Analytics for machine learning.
  2. R programming language.
  3. OBIEE for dashboard and reports development.

Cheers :)