Niko Montez - 1223861496
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
import random
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
In the realm of financial risk management, accurate prediction of credit default risk is crucial for maintaining the stability and profitability of lending institutions. This study focuses on comparing the predictive performance of a neural networks against a Random Forest, both on classification in the context of credit risk assessment. Let's begin by importing and understanding the dataset: https://www.kaggle.com/datasets/praveengovi/credit-risk-classification-dataset/data
Here we will first join our two datasets so we can more easily clean and acess our data.
custdata = pd.read_csv('customer_data.csv')
custdata = custdata.set_index('id')
pmntdata = pd.read_csv('payment_data.csv')
pmntdata = pmntdata.set_index('id')
data = custdata.join(pmntdata)
data.head(5)
data.shape
This DataFrame is designed to facilitate the development of machine learning models for credit risk assessment. Aside from the id number column, the 'label' column indicates credit risk classification, with 1 representing high risk and 0 representing low risk. Additionally, there are 11 columns for potential real-world features, aimed at deriving insights into individuals who pose credit risks to institutions.
With the description DataFrame, we can really begin to understand the dataset and its intended use, as well as use it to check anything we are unsure of further down the line. Now we can attempt to clean the data.
data.isna().sum()
data.drop(columns=['report_date','update_date','prod_limit'],axis=1,inplace=True)
med_imp = SimpleImputer(strategy='median')
impData = pd.DataFrame(med_imp.fit_transform(data), columns=data.columns)
impData.shape
impData.isna().sum()
In this step, we have cleaned the data by dropping and imputing some values in for a few of the columns. This practice Allows for us to use all of the useful data in the set without having to drop could-be key insights due to the null values of one column. Now that the data is cleaned, we will begin to create our first model.
We will begin by seperating the data into training and test sets.
random.seed(42)
X = impData.iloc[:, 2:]
y = impData['label']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
Next, we will scale our data.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Now that we have our training and test sets at 70 and 30 percent respectively, as well as our predictors scaled, we will create, train, and test our model.
A Neural Network is a machine learning model inspired by the way the human brain process information. It consists of layers of interconnected nodes, or "neurons," where each connection has a weight that adjusts as learning proceeds. It can be trained to predict classification data such as the data set we are using.
nn = Sequential()
nn.add(tf.keras.layers.Dense(15, input_shape=(X_train_scaled.shape[1],), activation='relu'))
nn.add(tf.keras.layers.Dense(8, activation='relu'))
nn.add(tf.keras.layers.Dense(7, activation='relu'))
nn.add(tf.keras.layers.Dense(1, activation='sigmoid'))
nn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = nn.fit(X_train_scaled, y_train, validation_split=0.33, epochs=7, batch_size=7, verbose=1)
loss, accuracy = nn.evaluate(X_test_scaled, y_test)
print(f"Neural Network Accuracy: {accuracy}")
# Plotting training and validation accuracy
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0, 1])
plt.legend(loc='lower right')
plt.title("Neural Network Accuracy")
plt.show()
83% accuracy is not bad, but now we can use grid search cross validation to tune our model to try and get the best results.
def create_model(optimizer='adam', init='glorot_uniform', dropout_rate=0.0, neurons1=15, neurons2=8, neurons3=7):
model = Sequential()
model.add(Dense(neurons1, input_shape=(X_train_scaled.shape[1],), kernel_initializer=init, activation='relu'))
model.add(Dropout(dropout_rate))
model.add(Dense(neurons2, kernel_initializer=init, activation='relu'))
model.add(Dropout(dropout_rate))
model.add(Dense(neurons3, kernel_initializer=init, activation='relu'))
model.add(Dense(1, kernel_initializer=init, activation='sigmoid'))
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
return model
model = KerasClassifier(build_fn=create_model, verbose=0)
param_grid = {
'optimizer': ['adam', 'rmsprop'],
'epochs': [10, 20],
'batch_size': [10, 20],
'init': ['glorot_uniform', 'normal', 'uniform'],
'dropout_rate': [0.0, 0.1, 0.2],
'neurons1': [15, 30],
'neurons2': [8, 16],
'neurons3': [7, 14]
}
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X_train_scaled, y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
We got our accuracy up to 87%, but this can be a misleading number until we can visualize the accuracy in something like a confusion matrix.
y_pred_prob = nn.predict(X_test_scaled)
y_pred = (y_pred_prob > 0.5).astype(int)
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=["Low Risk", "High Risk"], yticklabels=["Low Risk", "High Risk"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
As we can see, the model is fairly accurate wehn predicting a true negative, a truly low risk client, but when predicting the true positives, the truly high risk clients, the model is unfit. Let's see if our next model is a better fit for this type of classification problem.
A random forest is an ensemble learning method primarily used for classification and regression tasks. It builds multiple decision trees and merges them to get a more accurate and stable prediction.
rfclass = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
rfclass.fit(X_train_scaled,y_train)
y_pred = rfclass.predict(X_test_scaled)
accuracy = accuracy_score(y_test,y_pred)
accuracy
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=["Low Risk", "High Risk"], yticklabels=["Low Risk", "High Risk"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
While the model was very proficient in classifying true negatives, it could imporove in terms of predicting true positives. We will attempt to tune the model the same way as we did before, aiming for a better result.
param_grid = {
'n_estimators': [50, 100, 150],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(estimator=rfclass, param_grid=param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-validation Accuracy:", grid_search.best_score_)
best_estimator = grid_search.best_estimator_
y_pred = best_estimator.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_pred)
class_labels = ["Low Risk", "High Risk"]
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=class_labels, yticklabels=class_labels)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
Using grid search cross validation, the model is able to predict a notably higher proportion of correctly identified risk labels.
In this study comparing Random Forest and Neural Network models for credit risk assessment, the Random Forest classifier consistently outperformed the neural network. Achieving a 96% accuracy on the test set, the Random Forest excelled in identifying both low and high-risk instances with less susceptibility to overfitting compared to the neural network. This superiority can be attributed to the Random Forest's ability to handle tabular, structured data effectively and its robust performance, even with default hyperparameters. For financial institutions, adopting these types of Random Forest models would offer efficient, accurate credit risk prediction, aiding in stable and profitable risk management practices.