import pandas as pd
import numpy as np
path = ('Heart.csv')
heart = pd.read_csv(path)
Next, I will take a look at the shape and description of the data, as well as the first few rows.
heart.shape
heart.describe
heart.head(5)
As the data explains, there are 13 predictor columns and 1 final target column. This target column is a classifier column that labels having heart disease as 1 and not having heart disease as 0.
Now, we will quickly clean up the data, checking if there are any null values.
heart.isnull().sum()
As the code shows, there are zero null values in each of the columns, therefore there is no dropping of values that needs to be done.
Now that the data is cleaned, I will wrangle the data in order to begin creating a basic decision tree.
X = heart.iloc[:,0:13]
y = heart.iloc[:,13]
Now that the data is seperated into predictor and target variables, I will split the data into training and test sets.
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.25,random_state= 0)
Next, I will begin to create and test the tree itself.
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train,y_train)
print("The accuracy of the model on the practice set is {:.3f}".format(tree.score(X_train,y_train)))
from sklearn.metrics import accuracy_score
y_pred = tree.predict(X_test)
print('The accuracy of the model on the test set is {:.3f}'.format(accuracy_score(y_test,y_pred)))
As the score of the model on the practice set shows, the model is very overfitted. Let's fix that.
tree_pruned = DecisionTreeClassifier(max_depth=3, random_state=0)
tree_pruned.fit(X_train,y_train)
print('The pruned model accuracy on the training set is {:.3f}'.format(tree_pruned.score(X_train,y_train)))
y_pruned_pred = tree_pruned.predict(X_test)
print('The pruned model accuracy on the test set is {:.3f}'.format(accuracy_score(y_test,y_pruned_pred)))
The model accuracy on this test set for the proper, pruned tree, is 76.3%.
predictors = heart.iloc[:,0:13].columns.tolist()
importance = tree_pruned.feature_importances_
df = pd.DataFrame(importance, index=predictors, columns=['Importance'])
df
As the table shows, chestpain is the largest indicator of heart disease is chest pain.
Now, I will perform a logistic regression, and test whether the pruned tree or logistic regression produces a more accurate result.
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train,y_train)
log_pred = logreg.predict(X_test)
print('Logistic regression model accuracy on the test set is {:.3f}'.format(accuracy_score(y_test,log_pred)))
It is clear that the logistic regression is a more accurate model, even that the pruned version of the decision tree.
coefficients = logreg.coef_[0]
feature_importance = pd.Series(coefficients, index=X.columns)
feature_importance = feature_importance.sort_values(ascending=False)
print(feature_importance)
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, log_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['No (0)', 'Yes (1)'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()
Lastly, we will look at a few different plots.
import matplotlib.pyplot as plt
plt.hist(heart['age'], bins=6, color='seagreen', edgecolor='black')
plt.title('Age Distribution in Data', fontweight='bold')
plt.xlabel('Age')
plt.ylabel('# of Participants')
The histogram shows that the largest portion of participants are around 45 to 60, which can be very useful when trying to create a thesis based on the data.
Now, let's look at a distribution of who has and does not have the disease.
import seaborn as sns
sns.countplot(x=heart['target'], data=heart)
plt.xlabel('Target (0: No Disease, 1: Disease)')
plt.ylabel('Count')
plt.title('Distribution of Heart Disease')
plt.show()
As the plot shows, there is more participants in this list with heart disease whitout, which could show a bias in the data as there is, on average, more people without heart disease than with it.
# Create the bar plot
sns.countplot(x=heart['chestpain'], hue=heart['target'], data=heart)
plt.xlabel('Chest Pain Level')
plt.ylabel('Count')
plt.title('Distribution of Chest Pain Levels by Target')
plt.legend(title='Target', labels=['No (0)', 'Yes (1)'])
plt.show()
As the final plot shows, there seems to be a large correlation between high levels of chestpain and higher levels of positive heart disease diagnosis.
sns.countplot(x=heart['exercise'], hue=heart['target'], data=heart)
plt.xlabel('Exercise Level')
plt.ylabel('Count')
plt.title('Distribution of Exercise Levels by Target')
plt.legend(title='Target', labels=['No (0)', 'Yes (1)'])
plt.show()
sns.countplot(x=heart['fluor'], hue=heart['target'], data=heart)
plt.xlabel('Chest Pain Level')
plt.ylabel('Count')
plt.title('Distribution of Chest Pain Levels by Target')
plt.legend(title='Target', labels=['No (0)', 'Yes (1)'])
plt.show()