In [1]:
import pandas as pd
import numpy as np
path = ('Heart.csv')
heart = pd.read_csv(path)

Next, I will take a look at the shape and description of the data, as well as the first few rows.

In [2]:
heart.shape
Out[2]:
(303, 14)
In [3]:
heart.describe
Out[3]:
<bound method NDFrame.describe of      age  sex  chestpain  restbps  chol  sugar  ecg  maxhr  angina  dep  \
0     63    1          3      145   233      1    0    150       0  2.3   
1     37    1          2      130   250      0    1    187       0  3.5   
2     41    0          1      130   204      0    0    172       0  1.4   
3     56    1          1      120   236      0    1    178       0  0.8   
4     57    0          0      120   354      0    1    163       1  0.6   
..   ...  ...        ...      ...   ...    ...  ...    ...     ...  ...   
298   57    0          0      140   241      0    1    123       1  0.2   
299   45    1          3      110   264      0    1    132       0  1.2   
300   68    1          0      144   193      1    1    141       0  3.4   
301   57    1          0      130   131      0    1    115       1  1.2   
302   57    0          1      130   236      0    0    174       0  0.0   

     exercise  fluor  thal  target  
0           0      0     1       1  
1           0      0     2       1  
2           2      0     2       1  
3           2      0     2       1  
4           2      0     2       1  
..        ...    ...   ...     ...  
298         1      0     3       0  
299         1      0     3       0  
300         1      2     3       0  
301         1      1     3       0  
302         1      1     2       0  

[303 rows x 14 columns]>
In [4]:
heart.head(5)
Out[4]:
age sex chestpain restbps chol sugar ecg maxhr angina dep exercise fluor thal target
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1

As the data explains, there are 13 predictor columns and 1 final target column. This target column is a classifier column that labels having heart disease as 1 and not having heart disease as 0.

Now, we will quickly clean up the data, checking if there are any null values.

In [5]:
heart.isnull().sum()
Out[5]:
age          0
sex          0
chestpain    0
restbps      0
chol         0
sugar        0
ecg          0
maxhr        0
angina       0
dep          0
exercise     0
fluor        0
thal         0
target       0
dtype: int64

As the code shows, there are zero null values in each of the columns, therefore there is no dropping of values that needs to be done.

Now that the data is cleaned, I will wrangle the data in order to begin creating a basic decision tree.

In [6]:
X = heart.iloc[:,0:13]
y = heart.iloc[:,13]

Now that the data is seperated into predictor and target variables, I will split the data into training and test sets.

In [7]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.25,random_state= 0)

Next, I will begin to create and test the tree itself.

In [8]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train,y_train)
print("The accuracy of the model on the practice set is {:.3f}".format(tree.score(X_train,y_train)))
The accuracy of the model on the practice set is 1.000
In [9]:
from sklearn.metrics import accuracy_score
y_pred = tree.predict(X_test)
print('The accuracy of the model on the test set is {:.3f}'.format(accuracy_score(y_test,y_pred)))
The accuracy of the model on the test set is 0.789

As the score of the model on the practice set shows, the model is very overfitted. Let's fix that.

In [10]:
tree_pruned = DecisionTreeClassifier(max_depth=3, random_state=0)
tree_pruned.fit(X_train,y_train)
print('The pruned model accuracy on the training set is {:.3f}'.format(tree_pruned.score(X_train,y_train)))
The pruned model accuracy on the training set is 0.855
In [11]:
y_pruned_pred = tree_pruned.predict(X_test)
print('The pruned model accuracy on the test set is {:.3f}'.format(accuracy_score(y_test,y_pruned_pred)))
The pruned model accuracy on the test set is 0.763

The model accuracy on this test set for the proper, pruned tree, is 76.3%.

In [12]:
predictors = heart.iloc[:,0:13].columns.tolist()
importance = tree_pruned.feature_importances_
df = pd.DataFrame(importance, index=predictors, columns=['Importance'])
df
Out[12]:
Importance
age 0.014161
sex 0.084347
chestpain 0.470628
restbps 0.000000
chol 0.000000
sugar 0.000000
ecg 0.000000
maxhr 0.096373
angina 0.000000
dep 0.000000
exercise 0.000000
fluor 0.213231
thal 0.121260

As the table shows, chestpain is the largest indicator of heart disease is chest pain.

Now, I will perform a logistic regression, and test whether the pruned tree or logistic regression produces a more accurate result.

In [13]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train,y_train)
log_pred = logreg.predict(X_test)
print('Logistic regression model accuracy on the test set is {:.3f}'.format(accuracy_score(y_test,log_pred)))
Logistic regression model accuracy on the test set is 0.829

It is clear that the logistic regression is a more accurate model, even that the pruned version of the decision tree.

In [14]:
coefficients = logreg.coef_[0]

feature_importance = pd.Series(coefficients, index=X.columns)

feature_importance = feature_importance.sort_values(ascending=False)

print(feature_importance)
chestpain    0.820504
exercise     0.230714
ecg          0.168780
maxhr        0.022679
age          0.000290
chol        -0.005125
restbps     -0.010393
sugar       -0.440175
dep         -0.578433
thal        -0.755532
fluor       -0.778159
angina      -0.797117
sex         -1.685999
dtype: float64
In [32]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, log_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['No (0)', 'Yes (1)'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

Lastly, we will look at a few different plots.

In [15]:
import matplotlib.pyplot as plt
plt.hist(heart['age'], bins=6, color='seagreen', edgecolor='black')
plt.title('Age Distribution in Data', fontweight='bold')
plt.xlabel('Age')
plt.ylabel('# of Participants')
Out[15]:
Text(0, 0.5, '# of Participants')

The histogram shows that the largest portion of participants are around 45 to 60, which can be very useful when trying to create a thesis based on the data.

Now, let's look at a distribution of who has and does not have the disease.

In [16]:
import seaborn as sns
sns.countplot(x=heart['target'], data=heart)
plt.xlabel('Target (0: No Disease, 1: Disease)')
plt.ylabel('Count')
plt.title('Distribution of Heart Disease')
plt.show()

As the plot shows, there is more participants in this list with heart disease whitout, which could show a bias in the data as there is, on average, more people without heart disease than with it.

In [29]:
# Create the bar plot
sns.countplot(x=heart['chestpain'], hue=heart['target'], data=heart)
plt.xlabel('Chest Pain Level')
plt.ylabel('Count')
plt.title('Distribution of Chest Pain Levels by Target')
plt.legend(title='Target', labels=['No (0)', 'Yes (1)'])
plt.show()

As the final plot shows, there seems to be a large correlation between high levels of chestpain and higher levels of positive heart disease diagnosis.

In [30]:
sns.countplot(x=heart['exercise'], hue=heart['target'], data=heart)
plt.xlabel('Exercise Level')
plt.ylabel('Count')
plt.title('Distribution of Exercise Levels by Target')
plt.legend(title='Target', labels=['No (0)', 'Yes (1)'])
plt.show()
In [31]:
sns.countplot(x=heart['fluor'], hue=heart['target'], data=heart)
plt.xlabel('Chest Pain Level')
plt.ylabel('Count')
plt.title('Distribution of Chest Pain Levels by Target')
plt.legend(title='Target', labels=['No (0)', 'Yes (1)'])
plt.show()