Detecting Credit Card Default Risk With Machine Learning

10 min readMay 5, 2021

Code_for_this_Project

colab.research.google.com

Interests collected from credit card transactions and lendings constitute a large portion of banks’ income. However, this lucrative business model is also a double-edged dagger, for banks will suffer great losses if their customers default on their transactions. As a result, it is crucial to build a reliable model that discerns who is eligible for what amount of credit ceiling. In this project, I am going to create a model to determine whether a bank customer’s transaction/borrowing is at a risk of default given the customer's information such as job, salary, family member.

Credit Card Fraud Detection

Explore all possibilities while sanctioning a Loan to any customer.

www.kaggle.com

For the dataset, I found a great resource from Kaggle, where the creator complied data from the International Institute of Information Technology Bangalore to provide more insights into the transactions of credit card defaulters. But first, I will need to go through EDA and make sure my dataset is ML-ready.

Part I: EDA

1. Identifying my objective

After taking a rough look at the dataset, I found out that the shape is (307511, 121). Quite a lot of data with a lot of attributes. Here’s a snippet of the description for the attributes:

I need to first identify my target variable. Luckily, the creator of the dataset was kind enough to point out the target variable, which is called “TARGET”, breaking the customers into two segments:

1: Customers that may have difficulties paying their loans and at risk of default.

0: Customers that do not have difficulties in paying their loans.

This is exactly what I am trying to explore in this project.

2. Addressing Missing Values

One thing I noticed during the first look at the dataset was that there are columns that are missing a majority of data. I wrote a function to look into that further, and here is a snippet of the result:

def missing(df):total = df.isnull().sum()percent = (df.isnull().sum()/df.isnull().count()*100)unique = df.nunique()datatypes = df.dtypesreturn pd.concat([total, percent, unique, datatypes],axis=1,keys=['Total', 'Missing_Percent', 'Unique', 'Data_Type']).sort_values(by="Missing_Percent", ascending=False)

Percentage of Missing Value in Attribute Columns

Judging from the result above, I can make an assumption that if a column has over 60% missing values, it can be regarded as non-essential information (otherwise the banks will most definitely collect them). Implementation wise, if a row consists of predominantly null values, it would be very hard to fill up with either interpolation or imputation, and more importantly, the result would not be accurate enough for the model’s performance. As a result, I will exclude them from further considerations.

3. EDA based on data types

We can see that the data can be split into three parts based on data types: object, int, and float. Due to the different nature of these data types, I will perform EDA on each data type and put everything together at the end.

objects

After further inspections, I can conclude that the columns with object data are mostly categorical, as a result, I would need to encode them into numeric values using dummy variables to better fit them into our models. Before that, I will use mode() to fill in missing values with the most frequent value in each column.

A snippet of Attributes with object data type values in our dataset after EDA

int64

Luckily, I don’t have any missing values for the int64 data types, here is a distribution of all int64 values.

float64

Following the same procedures for objects, I filled in the missing values with

floatVal = data.select_dtypes("float64").interpolate(method ='linear', limit_direction ='forward')

and I am left with this distribution:

Distribution of attributes with float64 as values

4. Putting Things Together

Now I have successfully processed all three data types, it's now time to merge them into a new data frame, and check for any interesting correlations between my data.

A preliminary look at correlations between attributes and the target variable

Distribution of value of target variable

We can see from here that there is a huge disparity between those who are marked as customers who will have trouble repaying the loan and those who are not. This makes sense since most customers should be able to get their loans without trouble. However, the gap is very likely to affect the performance of my models, and I will try to address that in the next part.

Part II: Machine Learning

Initial Look

Now our data is ML-ready, it’s time to employ a few simple models to our dataset and see how they perform. To expedite the process, I wrote a function that allows us to test multiple models on the dataset all at once. To start, I will use DummyClassifier(), LogisticRegression(), DecisionTreeClassifier(), and GaussianNB().

def model_lot(X_train, X_test, Y_train, Y_test):collection = [DummyClassifier(),LogisticRegression(),DecisionTreeClassifier(),GaussianNB()]comparison = pd.DataFrame([])row_index = 0for model in collection:predicted = model.fit(X_train, Y_train).predict(X_test)comparison.loc[row_index,'Model Name'] = model.__class__.__name__comparison.loc[row_index, 'Train Accuracy'] = accuracy_score(Y_train,model.predict(X_train))comparison.loc[row_index, 'Test Accuracy'] = accuracy_score(Y_test,predicted)comparison.loc[row_index, 'Precision'] = precision_score(Y_test, predicted)comparison.loc[row_index, 'Recall'] = recall_score(Y_test, predicted)comparison.loc[row_index, 'F1 score'] = f1_score(Y_test, predicted)row_index += 1return comparison.sort_values(by="Precision",ascending=False)

Here is the performance of our models:

Performance Metrics of our simple models

We can see that my models all did pretty good results in terms of accuracy. However, this could be quite deceiving since it might be due to the overwhelming amount of negatives (0) in the TARGET variable. Since I still have not addressed this issue, accuracy should not be the only metric that we look for. Instead of accuracy, in an imbalanced dataset, I should use Precision and Recall as metrics, in terms of which my initial models did not perform well. As a result, I will use random oversampling to address this issue.

Random Over Sampling

Random Oversampling is a way to balance the minority class by randomly duplicating examples in the minority class. Since the information is only getting duplicated, the overall performance of the model will only be enhanced.

oversample = RandomOverSampler(sampling_strategy='minority')X_train_over, Y_train_over = oversample.fit_resample(X_train, Y_train)

Let’s take a look at the result of Random Over Sampling:

Target Variable Distribution before and after Random Over Sampling

Model Performance after Random Over Sampling

Compared to the initial results using the imbalanced training data, we observe an increase in accuracy for the models, and more importantly, the Precision, Recall, and F-measure score also improved. However, they are not as high as I desired. I will now use more complex models to enhance our results.

Complex Models: Voting Classifier and RandomForestClassifer

Using the same logic as our simpler models, I wrote a function that implements these two models all at once.

def model_complex(model_list):collection = model_listcomparison = pd.DataFrame([])row_index = 0for model in collection:predicted = model.predict(X_test)comparison.loc[row_index,'Model Name'] = model.__class__.__name__comparison.loc[row_index, 'Train Accuracy'] = accuracy_score(Y_train,model.predict(X_train))comparison.loc[row_index, 'Test Accuracy'] = accuracy_score(Y_test,predicted)comparison.loc[row_index, 'Precision'] = precision_score(Y_test, predicted)comparison.loc[row_index, 'Recall'] = recall_score(Y_test, predicted)comparison.loc[row_index, 'F1 score'] = f1_score(Y_test, predicted)row_index += 1return comparison.sort_values(by="Precision",ascending=False)

And here’s the performance result for these two models

Performance Metrics for RandomForest and Voting Classifiers

As we can see from the results, RandomForestClassifier has the highest accuracy and precision score out of all the models we have employed. Its stellar performance makes sense since RandomForest is best suited for datasets with a mixture of categorical and numeric features, and is able to make accurate predictions despite a large amount of noise in the datasets. RandomForest’s fortes perfectly address the imperfections of my current dataset (lots of noisy data and a mixture of numeric and categorical features). As a result, I will use RandomForestClassifier as my go-to model. Now I will need to tune the hyperparameters to attain the best result.

Part III: Hyperparameter Tuning

Before we even start the tuning process, I need to first think about the most important metrics we want to prioritize.

Formulas for Precision, Recall, and Accuracy

Above is the formula for three important metrics: Precision, Recall, and Accuracy. I have previously established that accuracy would not be a suitable metric for our evaluation due to the imbalanced nature of the dataset. The purpose of my model is to minimize loss for banks from customer defaults. As a result, the most important task for my model is to discern those who are at risk of default, which would be marked as 1 (true positive), while at the same time minimizing those who are at risk of default but marked as otherwise (false negative). Keeping that in mind, it is clear that Recall is the most important metric when evaluating performance due to the nature of our problem. So let’s get to it!

Method 1: RandomizedSearchCV

Random search is a technique where random combinations of the hyperparameters are used to find the best solution for the built model. The drawback of random search is that it yields high variance during computing. Since the selection of parameters is completely random; and since no intelligence is used to sample these combinations, luck plays its part.

n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num=10)]max_features = ["auto","sqrt"]max_depth = [int(x) for x in np.linspace(start = 10, stop = 50, num=9)]min_samples_split = [2,5,10]min_samples_leaf = [1,2,4]bootstrap = [True,False]ccp_alpha = [x for x in np.arange(start = 0, stop = 0.03, step=0.005)]random_grid = {'n_estimators': n_estimators,'max_features': max_features,'max_depth': max_depth,'min_samples_split': min_samples_split,'min_samples_leaf': min_samples_leaf,'bootstrap': bootstrap,"ccp_alpha":ccp_alpha}rf_new = RandomForestClassifier()random = RandomizedSearchCV(estimator=rf_new,param_distributions=random_grid,n_iter=5,cv=3,random_state=2020,verbose=2,n_jobs=-1,scoring="precision")random.fit(X_train_over,Y_train_over)

RandomizedSearch turned out to be extremely slow due to the large size of my training dataset (I have been running it for 8 hours and still didn’t finish tuning). As a result, I will use Optuna, an open-source optimizer to fine-tune my model more efficiently.

Method 2: Optuna

Optuna is an automated hyperparameter optimization software framework that is knowingly invented for machine learning-based tasks. It emphasizes an authoritative, define-by-run approach user API. Advantages of Optuna include: efficient implementation of searching and pruning strategies and integrated modules for sci-kit learn.

import optunadef objective(trial):n_estimators = trial.suggest_int('n_estimators', 20, 200)max_depth = int(trial.suggest_float('max_depth', 10, 50, log=True))clf = RandomForestClassifier(n_estimators=n_estimators,max_depth=max_depth)predicted = clf.fit(X_train_over, Y_train_over).predict(X_test)return recall_score(Y_test, predicted)study = optuna.create_study(direction='maximize')study.optimize(objective, n_trials=10)trial = study.best_trialprint('Recall: {}'.format(trial.value))print("Best hyperparameters: {}".format(trial.params))

After running 10 trials, Optuna returned the Best hyperparameters: {‘n_estimators’: 158, ‘max_depth’: 11.207186966546672} with a recall score of 0.59, a drastic improvement from before tuning. Here’s a comparison of performance metrics before and after tuning:

Model Performance Metrics Before and After Tuning

Here we can see that the Recall score increased by 8657% while the accuracy and precision decreased by 23% and 64%, respectively. This huge increase in Recall and decrease in Precision and Accuracy means that the model has become a lot more sensitive in flagging customers as ineligible for loans and transactions, and this is a trade-off that I am willing to make. After all, our purpose for this project is to minimize loss for banks from customer defaults.

Now we have tuned our models, here’s the list of the 10 most important features in determining default risk:

Part IV: Conclusion

Takeaway

During my current internship at Airwallex, a cross-border payment solution provider, I’ve noticed that our company has invested a large number of resources in its risk control efforts such as KYC. After doing this project, I realized the importance of their efforts and the crucial role machine learning plays in the risk assessment process. There are numerous things that can lead to a customer’s eventual default and the pattern among these factors would be impossible to spot with the human brain. Machine Learning provided us with a powerful tool to discern and quantify the patterns in the seemingly unrelated data with relatively high consistency.

Things to Improve on

The main drawback of this project is the tuning process. Due to the large dataset, even Optuna was a bit slow when testing different hyperparameters, so I had to limit the trials to 10 (after which the program just shuts down). As a result of this limited tuning, I was left with a decent but not perfect result. Speaking from the bank’s perspective, a relatively low precision score means that the model regularly flags eligible customers as ineligible, which increases the difficulty for customers when securing a loan, and the bank might lose customers because of this. To address this issue, I could: 1. try to further optimize my dataset, leaving only a portion of the most important features or customize Optuna to better fit my data; 2. try to acquire more computing power. With more trials and more hyperparameter tuning, I would be able to eventually get a model that has high precision and high recall score.

Another thing to improve on would be to take Previous_Application into considerations (which is also provided in the dataset). Correlating the previous application history to the current application would yield more interesting results.