Category: Credit Rating Methodologies

Big Data and Machine Learning in the Credit Risk Analysis

Today, I would like to go back to a very actual argument and which I have already addressed some months ago: how to use big data and machine learning models in the credit risk analysis.

As you will remember, I had said that in modeFinance we use massively large amounts of data (70 million companies in 200 countries, more than 200 million financial accounts, thanks to the partnership with Bureau van Dijk), but as I pointed out before, we use machine learning methodologies only in an indirect way, not integrated into a single analysis.

In this post, I want to take that discussion even further, to show you where there are difficulties in using these specific methodologies.

The main points are two:

1) Lack of data on bankrupt companies;

2) The “holes” in the financial statements.

Let’s start with the first point: the lack of data on the bankrupt companies.

As I recall, the machine learning models “learn” from the database that you see, the classical database in the case of the credit risk analysis is nothing more than the set of all companies (healthy and bankrupt) and their respective financial statement indicators (here opens a discourse: which indicators? But we will delve into in another post), we can represent the database as follows:

Company 1 - ratio_1 - ratio_2 - ratio_3 - … - ratio_n 0
Company 2 - ratio_1 - ratio_2 - ratio_3 - … - ratio_n 0
Company m - ratio_1 - ratio_2 - ratio_3 - … - ratio_n 1

What is 0 or 1 at the end? It represents the label identifying whether the company is active (0) or bankrupt (1). And this is a crucial element: basically the machine learning models create a numeric function that separates the healthy companies from bankrupt ones.

And here comes the fundamental problem: the model to be developed must have a comprehensive database, so it should include a sufficient number of bankrupt companies. But can we retrieve this information? The answer is unfortunately NO, not in all states.

To help you understand, I am attaching the following figure showing the number of bankrupt companies with digitized data (and thus can be used in a numerical model) to be used for a machine learning model:

Number of bankrupt companies

As you can see there are very few states in which we have a good amount of information and so where machine learning models can be used. In all other countries, the application of this modeling is impossible! And in the world there are more than 200 countries..... I would say it is an insurmountable problem!

Let us now turn to the second point: “holes” in the balance sheets.

If we go back to the db before:

Company 1 - ratio_1 - ratio_2 - ratio_3 - … - ratio_n 0
Company 2 - ratio_1 - ratio_2 - ratio_3 - … - ratio_n 0
Company m - ratio_1 - ratio_2 - ratio_3 - … - ratio_n 1

As can be seen, for every business I have to know the set of indicators of financial statements (for example: Leverage, ROE, Current Ratio, etc.), but do we always have these values? The answer is unfortunately NO again. In many states the companies do not have the obligation to publish the entire financial statements hence we would find some “holes” for many companies. And note that you do not need to go very far to find this situation, already in Europe we have two cases: England and Holland! See the following two examples:

 

Incomplete Income Statement Sample

Incomplete Income Statement Sample

Complete Income Statement Sample

Complete Income Statement Sample

As it is seen we could find within the same state two type of companies: one of which I know all the indicators and one of which I only know a few. In these cases the db “with the holes” would become:

Company 1 - blank - ratio_2 - blank - … - ratio_n 0
Company 2 - ratio_1 - ratio_2 - ratio_3 - … - ratio_n 0
Company m - ratio_1 - ratio_2 - ratio_3 - … - ratio_n 1

And in this case nothing could be done! (of course there are methods that go to “cover” the holes such as Multiple Imputation, but they are approximating methods, and I would not use approximate methods for assigning a rating!).

In this short post, I tried to show the reasons why we – modeFinance do not believe machine learning models are to be convenient to be used to evaluate the credit risk: too many dangerous assumptions and too incomplete data. Sure, for a model based only on Italian, French or Spanish companies you could, but as soon as one moves beyond these states one would face too many problems and at that point making assumptions becomes too stringent.

This does not mean that the use of financial Big Data is not fundamental: in a next post that I’m already writing I will tell you how we in modeFinance treat the huge amount of data we have and what results this brings us.

Share and recommend this page: