Credit Risk Analysis

Niko Montez - 1223861496


In [27]:
import pandas as pd 
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
import random
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

I. Introduction:

In the realm of financial risk management, accurate prediction of credit default risk is crucial for maintaining the stability and profitability of lending institutions. This study focuses on comparing the predictive performance of a neural networks against a Random Forest, both on classification in the context of credit risk assessment. Let's begin by importing and understanding the dataset: https://www.kaggle.com/datasets/praveengovi/credit-risk-classification-dataset/data

Here we will first join our two datasets so we can more easily clean and acess our data.

In [2]:
custdata = pd.read_csv('customer_data.csv')
custdata = custdata.set_index('id')
pmntdata = pd.read_csv('payment_data.csv')
pmntdata = pmntdata.set_index('id')
data = custdata.join(pmntdata)
data.head(5)
Out[2]:
label fea_1 fea_2 fea_3 fea_4 fea_5 fea_6 fea_7 fea_8 fea_9 ... OVD_t2 OVD_t3 OVD_sum pay_normal prod_code prod_limit update_date new_balance highest_balance report_date
id
54982353 0 1 1130.0 2 1000000.0 2 4 -1 100 5 ... 0 0 0 1 10 55000.0 27/08/2014 0.0 2068.0 12/06/2014
54982353 0 1 1130.0 2 1000000.0 2 4 -1 100 5 ... 0 0 1 31 10 550000.0 03/09/2013 326684.4 609683.0 18/12/2015
54982353 0 1 1130.0 2 1000000.0 2 4 -1 100 5 ... 0 0 0 19 10 NaN 16/07/2011 31677.6 204037.0 14/12/2015
54982353 0 1 1130.0 2 1000000.0 2 4 -1 100 5 ... 0 35 31500 0 10 12100.0 27/12/2008 12142.8 10619.0 14/07/2009
54982353 0 1 1130.0 2 1000000.0 2 4 -1 100 5 ... 0 0 0 26 10 660000.0 12/03/2007 252998.4 775030.0 23/12/2015

5 rows × 23 columns

In [3]:
data.shape
Out[3]:
(8250, 23)

This DataFrame is designed to facilitate the development of machine learning models for credit risk assessment. Aside from the id number column, the 'label' column indicates credit risk classification, with 1 representing high risk and 0 representing low risk. Additionally, there are 11 columns for potential real-world features, aimed at deriving insights into individuals who pose credit risks to institutions.

With the description DataFrame, we can really begin to understand the dataset and its intended use, as well as use it to check anything we are unsure of further down the line. Now we can attempt to clean the data.


II. Data Wrangling and Cleaning

In [4]:
data.isna().sum()
Out[4]:
label                 0
fea_1                 0
fea_2              1028
fea_3                 0
fea_4                 0
fea_5                 0
fea_6                 0
fea_7                 0
fea_8                 0
fea_9                 0
fea_10                0
fea_11                0
OVD_t1                0
OVD_t2                0
OVD_t3                0
OVD_sum               0
pay_normal            0
prod_code             0
prod_limit         6118
update_date          26
new_balance           0
highest_balance     409
report_date        1114
dtype: int64
In [5]:
data.drop(columns=['report_date','update_date','prod_limit'],axis=1,inplace=True)
med_imp = SimpleImputer(strategy='median')
impData = pd.DataFrame(med_imp.fit_transform(data), columns=data.columns)
impData.shape
Out[5]:
(8250, 20)
In [6]:
impData.isna().sum()
Out[6]:
label              0
fea_1              0
fea_2              0
fea_3              0
fea_4              0
fea_5              0
fea_6              0
fea_7              0
fea_8              0
fea_9              0
fea_10             0
fea_11             0
OVD_t1             0
OVD_t2             0
OVD_t3             0
OVD_sum            0
pay_normal         0
prod_code          0
new_balance        0
highest_balance    0
dtype: int64

In this step, we have cleaned the data by dropping and imputing some values in for a few of the columns. This practice Allows for us to use all of the useful data in the set without having to drop could-be key insights due to the null values of one column. Now that the data is cleaned, we will begin to create our first model.


III. Training, Testing, and Scaling

We will begin by seperating the data into training and test sets.

In [7]:
random.seed(42)
X = impData.iloc[:, 2:]
y = impData['label']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

Next, we will scale our data.

In [8]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Now that we have our training and test sets at 70 and 30 percent respectively, as well as our predictors scaled, we will create, train, and test our model.


IV. Neural Net

A Neural Network is a machine learning model inspired by the way the human brain process information. It consists of layers of interconnected nodes, or "neurons," where each connection has a weight that adjusts as learning proceeds. It can be trained to predict classification data such as the data set we are using.

In [21]:
nn = Sequential()
nn.add(tf.keras.layers.Dense(15, input_shape=(X_train_scaled.shape[1],), activation='relu'))
nn.add(tf.keras.layers.Dense(8, activation='relu'))
nn.add(tf.keras.layers.Dense(7, activation='relu'))
nn.add(tf.keras.layers.Dense(1, activation='sigmoid'))


nn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

history = nn.fit(X_train_scaled, y_train, validation_split=0.33, epochs=7, batch_size=7, verbose=1)

loss, accuracy = nn.evaluate(X_test_scaled, y_test)
print(f"Neural Network Accuracy: {accuracy}")

# Plotting training and validation accuracy
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0, 1])
plt.legend(loc='lower right')
plt.title("Neural Network Accuracy")
plt.show()
Epoch 1/7
632/632 [==============================] - 2s 3ms/step - loss: 0.5849 - accuracy: 0.7376 - val_loss: 0.4376 - val_accuracy: 0.8435
Epoch 2/7
632/632 [==============================] - 2s 2ms/step - loss: 0.4513 - accuracy: 0.8256 - val_loss: 0.4152 - val_accuracy: 0.8444
Epoch 3/7
632/632 [==============================] - 1s 2ms/step - loss: 0.4369 - accuracy: 0.8254 - val_loss: 0.4079 - val_accuracy: 0.8412
Epoch 4/7
632/632 [==============================] - 1s 2ms/step - loss: 0.4288 - accuracy: 0.8276 - val_loss: 0.4054 - val_accuracy: 0.8430
Epoch 5/7
632/632 [==============================] - 1s 2ms/step - loss: 0.4212 - accuracy: 0.8295 - val_loss: 0.3997 - val_accuracy: 0.8444
Epoch 6/7
632/632 [==============================] - 1s 2ms/step - loss: 0.4136 - accuracy: 0.8295 - val_loss: 0.4014 - val_accuracy: 0.8417
Epoch 7/7
632/632 [==============================] - 1s 2ms/step - loss: 0.4061 - accuracy: 0.8333 - val_loss: 0.3978 - val_accuracy: 0.8435
52/52 [==============================] - 0s 1ms/step - loss: 0.4126 - accuracy: 0.8315
Neural Network Accuracy: 0.8315151333808899

83% accuracy is not bad, but now we can use grid search cross validation to tune our model to try and get the best results.

In [28]:
def create_model(optimizer='adam', init='glorot_uniform', dropout_rate=0.0, neurons1=15, neurons2=8, neurons3=7):
    model = Sequential()
    model.add(Dense(neurons1, input_shape=(X_train_scaled.shape[1],), kernel_initializer=init, activation='relu'))
    model.add(Dropout(dropout_rate))
    model.add(Dense(neurons2, kernel_initializer=init, activation='relu'))
    model.add(Dropout(dropout_rate))
    model.add(Dense(neurons3, kernel_initializer=init, activation='relu'))
    model.add(Dense(1, kernel_initializer=init, activation='sigmoid'))
    
    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
    return model


model = KerasClassifier(build_fn=create_model, verbose=0)

param_grid = {
    'optimizer': ['adam', 'rmsprop'],
    'epochs': [10, 20],
    'batch_size': [10, 20],
    'init': ['glorot_uniform', 'normal', 'uniform'],
    'dropout_rate': [0.0, 0.1, 0.2],
    'neurons1': [15, 30],
    'neurons2': [8, 16],
    'neurons3': [7, 14]
}


grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X_train_scaled, y_train)


print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
<ipython-input-28-7e75e23a7d2a>:14: DeprecationWarning: KerasClassifier is deprecated, use Sci-Keras (https://github.com/adriangb/scikeras) instead.
  model = KerasClassifier(build_fn=create_model, verbose=0)
Best: 0.870152 using {'batch_size': 10, 'dropout_rate': 0.0, 'epochs': 20, 'init': 'glorot_uniform', 'neurons1': 30, 'neurons2': 16, 'neurons3': 7, 'optimizer': 'adam'}

We got our accuracy up to 87%, but this can be a misleading number until we can visualize the accuracy in something like a confusion matrix.

In [35]:
y_pred_prob = nn.predict(X_test_scaled)
y_pred = (y_pred_prob > 0.5).astype(int)
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=["Low Risk", "High Risk"], yticklabels=["Low Risk", "High Risk"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

As we can see, the model is fairly accurate wehn predicting a true negative, a truly low risk client, but when predicting the true positives, the truly high risk clients, the model is unfit. Let's see if our next model is a better fit for this type of classification problem.


V. Random Forest

A random forest is an ensemble learning method primarily used for classification and regression tasks. It builds multiple decision trees and merges them to get a more accurate and stable prediction.

In [40]:
rfclass = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
rfclass.fit(X_train_scaled,y_train)
y_pred = rfclass.predict(X_test_scaled)
accuracy = accuracy_score(y_test,y_pred)
accuracy
Out[40]:
0.9606060606060606
In [41]:
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=["Low Risk", "High Risk"], yticklabels=["Low Risk", "High Risk"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

While the model was very proficient in classifying true negatives, it could imporove in terms of predicting true positives. We will attempt to tune the model the same way as we did before, aiming for a better result.

In [42]:
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=rfclass, param_grid=param_grid, cv=3, scoring='accuracy')

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Cross-validation Accuracy:", grid_search.best_score_)
Best Parameters: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 100}
Best Cross-validation Accuracy: 0.9480303030303031
In [43]:
best_estimator = grid_search.best_estimator_
y_pred = best_estimator.predict(X_test)


conf_matrix = confusion_matrix(y_test, y_pred)


class_labels = ["Low Risk", "High Risk"]


plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=class_labels, yticklabels=class_labels)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

Using grid search cross validation, the model is able to predict a notably higher proportion of correctly identified risk labels.


V. Inference/Conclusion

In this study comparing Random Forest and Neural Network models for credit risk assessment, the Random Forest classifier consistently outperformed the neural network. Achieving a 96% accuracy on the test set, the Random Forest excelled in identifying both low and high-risk instances with less susceptibility to overfitting compared to the neural network. This superiority can be attributed to the Random Forest's ability to handle tabular, structured data effectively and its robust performance, even with default hyperparameters. For financial institutions, adopting these types of Random Forest models would offer efficient, accurate credit risk prediction, aiding in stable and profitable risk management practices.