Predicting Heart Disease using Machine Learning | #1 ML Blog

by Sep 3, 2022Machine Learning0 comments

Predicting Heart Disease using Machine Learning | #1 ML Blog

We are Going to Predicting Heart Disease Using Machine Learning Classification. We are Doing Classification because we want to know whether the Patient has Heart Disease or Not (Yes / No).

What is classification?

Classification involves deciding whether a sample is part of one class or another (single-class classification). If there are multiple class options, it’s referred to as multi-class classification.

So now the Question is How to do it?

Predicting Heart Disease using Machine Learning

Topics we are Going to Cover

  • Exploratory data analysis (EDA) – the process of going through a dataset and finding out more about it.
  • Model training – create a model(s) to learn to predict a target variable based on other variables.
  • Model evaluation – evaluating a model’s predictions using problem-specific evaluation metrics.
  • Model comparison – comparing several different models to find the best one.
  • Model fine-tuning – once we’ve found a good model, how can we improve it?
  • Feature importance – since we’re predicting the presence of heart disease, are there some things that are more important for prediction?
  • Cross-validation – if we do build a good model, can we be sure it will work on unseen data?
  • Reporting what we’ve found – if we had to present our work, what would we show someone?

Problem Definition

In our case, the problem we will be exploring is binary classification (a sample can only be one of two things).

This is because we’re going to be using a number of different features (pieces of information) about a person to predict whether they have heart disease or not.

In a statement,

Given clinical parameters about a patient, can we predict whether or not they have heart disease?

Data:-

We have used the Dataset from kaggle.

Let’s Go now :-

# Regular EDA and plotting libraries
import numpy as np # np is short for numpy
import pandas as pd # pandas is so commonly used, it's shortened to pd
import matplotlib.pyplot as plt
import seaborn as sns # seaborn gets shortened to sns

# We want our plots to appear in the notebook
%matplotlib inline 

## Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

## Model evaluators
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

Load Data

df = pd.read_csv("../data/heart-disease.csv") # 'DataFrame' shortened to 'df'
df.shape # (rows, columns)

EDA (Exploratory Data Analysis)

Once you’ve imported a dataset, the next step is to explore. There’s no set way of doing this. But what you should be trying to do is become more and more familiar with the dataset.

Compare different columns to each other, and compare them to the target variable. Refer back to your data dictionary and remind yourself of what different columns mean.

Your goal is to become a subject matter expert on the dataset you’re working with. So if someone asks you a question about it, you can give them an explanation and when you start building models, you can sound-check them to make sure they’re not performing too well (overfitting) or why they might be performing poorly (underfitting).

Since EDA has no real set methodology, the following is a short checklist you might want to walk through:

  1. What question(s) are you trying to solve (or prove wrong)?
  2. What kind of data do you have and how do you treat different types?
  3. What’s missing from the data and how do you deal with it?
  4. Where are the outliers and why should you care about them?
  5. How can you add, change or remove features to get more out of your data?
# check Top 10 rows of our dataset
df.head(10)
# Let's see how many positive (1) and negative (0) samples we have in our dataframe
df.target.value_counts()
# Plot the value counts with a bar graph
df.target.value_counts().plot(kind="bar", color=["salmon", "lightblue"]);

Heart Disease Frequency according to Gender

Remember from our data dictionary, for the target column, 1 = heart disease present, 0 = no heart disease. And for sex, 1 = male, 0 = female.

df.sex.value_counts()
# Compare target column with sex column
pd.crosstab(df.target, df.sex)

Making our crosstab visual

# Create a plot
pd.crosstab(df.target, df.sex).plot(kind="bar", 
                                    figsize=(10,6), 
                                    color=["salmon", "lightblue"]);

We’ll create the plot again with crosstab() and plot(), then add some helpful labels to it with plt.title()plt.xlabel() and more.

To add the attributes, you call them on plt within the same cell as where you make create the graph.

# Create a plot
pd.crosstab(df.target, df.sex).plot(kind="bar", figsize=(10,6), color=["salmon", "lightblue"])

# Add some attributes to it
plt.title("Heart Disease Frequency for Sex")
plt.xlabel("0 = No Disease, 1 = Disease")
plt.ylabel("Amount")
plt.legend(["Female", "Male"])
plt.xticks(rotation=0); # keep the labels on the x-axis vertical

Age vs Max Heart rate for Heart Disease

# Create another figure
plt.figure(figsize=(10,6))

# Start with positve examples
plt.scatter(df.age[df.target==1], 
            df.thalach[df.target==1], 
            c="salmon") # define it as a scatter figure

# Now for negative examples, we want them on the same plot, so we call plt again
plt.scatter(df.age[df.target==0], 
            df.thalach[df.target==0], 
            c="lightblue") # axis always come as (x, y)

# Add some helpful info
plt.title("Heart Disease in function of Age and Max Heart Rate")
plt.xlabel("Age")
plt.legend(["Disease", "No Disease"])
plt.ylabel("Max Heart Rate");

What can we infer from this?

It seems the younger someone is, the higher their max heart rate (dots are higher on the left of the graph) and the older someone is, the more green dots there are. But this may be because there are more dots all together on the right side of the graph (older participants).

Both of these are observational of course, but this is what we’re trying to do, build an understanding of the data.

Correlation between independent variables

# Find the correlation between our independent variables
corr_matrix = df.corr()
corr_matrix 
# Let's make it look a little prettier
corr_matrix = df.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(corr_matrix, 
            annot=True, 
            linewidths=0.5, 
            fmt= ".2f", 
            cmap="YlGnBu");

Modelling 👌

We’ve explored the data, now we’ll try to use machine learning to predict our target variable based on the 13 independent variables.

# Everything except target variable
X = df.drop("target", axis=1)

# Target variable
y = df.target.values
# Independent variables (no target column)
X.head()

Training and test split

Now comes one of the most important concepts in machine learning, the training/test split.

This is where you’ll split your data into a training set and a test set.

Why not use all the data to train a model?

Let’s say you wanted to take your model into the hospital and start using it on patients. How would you know how well your model goes on a new patient not included in the original full dataset you had?

This is where the test set comes in. It’s used to mimic taking your model to a real environment as much as possible.

And it’s why it’s important to never let your model learn from the test set, it should only be evaluated on it.

To split our data into a training and test set, we can use Scikit-Learn’s train_test_split() and feed it our independent and dependent variables (X & y).

# Random seed for reproducibility
np.random.seed(42)

# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X, # independent variables 
                                                    y, # dependent variable
                                                    test_size = 0.2) # percentage of data to use for test set

Model choices

Now we’ve got our data prepared, we can start to fit models. We’ll be using the following and comparing their results.

  1. Logistic Regression – LogisticRegression()
  2. K-Nearest Neighbors – KNeighboursClassifier()
  3. RandomForest – RandomForestClassifier()
# Put models in a dictionary
models = {"KNN": KNeighborsClassifier(),
          "Logistic Regression": LogisticRegression(), 
          "Random Forest": RandomForestClassifier()}

# Create function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
    """
    Fits and evaluates given machine learning models.
    models : a dict of different Scikit-Learn machine learning models
    X_train : training data
    X_test : testing data
    y_train : labels assosciated with training data
    y_test : labels assosciated with test data
    """
    # Random seed for reproducible results
    np.random.seed(42)
    # Make a list to keep model scores
    model_scores = {}
    # Loop through models
    for name, model in models.items():
        # Fit the model to the data
        model.fit(X_train, y_train)
        # Evaluate the model and append its score to model_scores
        model_scores[name] = model.score(X_test, y_test)
    return model_scores
model_scores = fit_and_score(models=models,
                             X_train=X_train,
                             X_test=X_test,
                             y_train=y_train,
                             y_test=y_test)
model_scores

Model Comparison

model_compare = pd.DataFrame(model_scores, index=['accuracy'])
model_compare.T.plot.bar();

Hyperparameter tuning and cross-validation

Let’s briefly go through each before we see them in action.

  • Hyperparameter tuning – Each model you use has a series of dials you can turn to dictate how they perform. Changing these values may increase or decrease model performance.
  • Feature importance – If there are a large amount of features we’re using to make predictions, do some have more importance than others? For example, for predicting heart disease, which is more important, sex or age?
  • Confusion matrix – Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagnol line).
  • Cross-validation – Splits your dataset into multiple parts and train and tests your model on each part and evaluates performance as an average.
  • Precision – Proportion of true positives over total number of samples. Higher precision leads to less false positives.
  • Recall – Proportion of true positives over total number of true positives and false negatives. Higher recall leads to less false negatives.
  • F1 score – Combines precision and recall into one metric. 1 is best, 0 is worst.
  • Classification report – Sklearn has a built-in function called classification_report() which returns some of the main classification metrics such as precision, recall and f1-score.
  • ROC Curve – Receiver Operating Characterisitc is a plot of true positive rate versus false positive rate.
  • Area Under Curve (AUC) – The area underneath the ROC curve. A perfect model achieves a score of 1.0.

Tuning models with with RandomizedSearchCV

Let’s create a hyperparameter grid (a dictionary of different hyperparameters) for each and then test them out.

# Different LogisticRegression hyperparameters
log_reg_grid = {"C": np.logspace(-4, 4, 20),
                "solver": ["liblinear"]}

# Different RandomForestClassifier hyperparameters
rf_grid = {"n_estimators": np.arange(10, 1000, 50),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2)}

Now let’s use RandomizedSearchCV to try and tune our LogisticRegression model.

We’ll pass it the different hyperparameters from log_reg_grid as well as set n_iter = 20. This means, RandomizedSearchCV will try 20 different combinations of hyperparameters from log_reg_grid and save the best ones.


# Setup random seed
np.random.seed(42)

# Setup random hyperparameter search for LogisticRegression
rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                                param_distributions=log_reg_grid,
                                cv=5,
                                n_iter=20,
                                verbose=True)

# Fit random hyperparameter search model
rs_log_reg.fit(X_train, y_train);

rs_log_reg.best_params_
rs_log_reg.score(X_test, y_test)

Now we’ve tuned LogisticRegression using RandomizedSearchCV, we’ll do the same for RandomForestClassifier.

# Setup random seed
np.random.seed(42)

# Setup random hyperparameter search for RandomForestClassifier
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                           param_distributions=rf_grid,
                           cv=5,
                           n_iter=20,
                           verbose=True)

# Fit random hyperparameter search model
rs_rf.fit(X_train, y_train);
# Find the best parameters
rs_rf.best_params_
# Evaluate the randomized search random forest model
rs_rf.score(X_test, y_test)

But since LogisticRegression is pulling out in front, we’ll try tuning it further with GridSearchCV.

Tuning a model with GridSearchCV

The difference between RandomizedSearchCV and GridSearchCV is where RandomizedSearchCV searches over a grid of hyperparameters performing n_iter combinations, GridSearchCV will test every single possible combination.

In short:

  • RandomizedSearchCV – tries n_iter combinations of hyperparameters and saves the best.
  • GridSearchCV – tries every single combination of hyperparameters and saves the best.
# Different LogisticRegression hyperparameters
log_reg_grid = {"C": np.logspace(-4, 4, 20),
                "solver": ["liblinear"]}

# Setup grid hyperparameter search for LogisticRegression
gs_log_reg = GridSearchCV(LogisticRegression(),
                          param_grid=log_reg_grid,
                          cv=5,
                          verbose=True)

# Fit grid hyperparameter search model
gs_log_reg.fit(X_train, y_train);
# Check the best parameters
gs_log_reg.best_params_
# Evaluate the model
gs_log_reg.score(X_test, y_test)

Evaluating a classification model, beyond accuracy

Now we’ve got a tuned model, let’s get some of the metrics we discussed before.

We want:

# Make predictions on test data
y_preds = gs_log_reg.predict(X_test)
y_preds

ROC Curve and AUC Scores

What’s a ROC curve?

It’s a way of understanding how your model is performing by comparing the true positive rate to the false positive rate.

In our case…

To get an appropriate example in a real-world problem, consider a diagnostic test that seeks to determine whether a person has a certain disease. A false positive in this case occurs when the person tests positive, but does not actually have the disease. A false negative, on the other hand, occurs when the person tests negative, suggesting they are healthy, when they actually do have the disease.

# Import ROC curve function from metrics module
from sklearn.metrics import plot_roc_curve

# Plot ROC curve and calculate AUC metric
plot_roc_curve(gs_log_reg, X_test, y_test);

Confusion matrix

A confusion matrix is a visual way to show where your model made the right predictions and where it made the wrong predictions (or in other words, got confused).

# Display confusion matrix
print(confusion_matrix(y_test, y_preds))
# Import Seaborn
import seaborn as sns
sns.set(font_scale=1.5) # Increase font size

def plot_conf_mat(y_test, y_preds):
    """
    Plots a confusion matrix using Seaborn's heatmap().
    """
    fig, ax = plt.subplots(figsize=(3, 3))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds),
                     annot=True, # Annotate the boxes
                     cbar=False)
    plt.xlabel("true label")
    plt.ylabel("predicted label")
    
plot_conf_mat(y_test, y_preds)

Classification report

We can make a classification report using classification_report() and passing it the true labels, as well as our models, predicted label

# Show classification report
print(classification_report(y_test, y_preds))

# Check best hyperparameters
gs_log_reg.best_params_
# Import cross_val_score
from sklearn.model_selection import cross_val_score

# Instantiate best model with best hyperparameters (found with GridSearchCV)
clf = LogisticRegression(C=0.23357214690901212,
                         solver="liblinear")

# Cross-validated accuracy score
cv_acc = cross_val_score(clf,
                         X,
                         y,
                         cv=5, # 5-fold cross-validation
                         scoring="accuracy") # accuracy as scoring
cv_acc

Since there are 5 metrics here, we’ll take the average.

cv_acc = np.mean(cv_acc)
cv_acc

Now we’ll do the same for other classification metrics.

# Cross-validated precision score
cv_precision = np.mean(cross_val_score(clf,
                                       X,
                                       y,
                                       cv=5, # 5-fold cross-validation
                                       scoring="precision")) # precision as scoring
cv_precision
# Cross-validated recall score
cv_recall = np.mean(cross_val_score(clf,
                                    X,
                                    y,
                                    cv=5, # 5-fold cross-validation
                                    scoring="recall")) # recall as scoring
cv_recall
# Cross-validated F1 score
cv_f1 = np.mean(cross_val_score(clf,
                                X,
                                y,
                                cv=5, # 5-fold cross-validation
                                scoring="f1")) # f1 as scoring
cv_f1

Let’s visualize them.

# Visualizing cross-validated metrics
cv_metrics = pd.DataFrame({"Accuracy": cv_acc,
                            "Precision": cv_precision,
                            "Recall": cv_recall,
                            "F1": cv_f1},
                          index=[0])
cv_metrics.T.plot.bar(title="Cross-Validated Metrics", legend=False);

Great! This looks like something we could share. An extension might be adding the metrics on top of each bar so someone can quickly tell what they were

Experimentation

Well we’ve completed all the metrics your boss requested. You should be able to put together a great report containing a confusion matrix, a handful of cross-valdated metrics such as precision, recall and F1 as well as which features contribute most to the model making a decision.

But after all this you might be wondering where step 6 in the framework is, experimentation.

Well the secret here is, as you might’ve guessed, the whole thing is experimentation.

From trying different models, to tuning different models to figuring out which hyperparameters were best.

What we’ve worked through so far has been a series of experiments.

And the truth is, we could keep going. But of course, things can’t go on forever.

So by this stage, after trying a few different things, we’d ask ourselves did we meet the evaluation metric?

Remember we defined one in step 3.

If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we’ll pursure this project.

In this case, we didn’t. The highest accuracy our model achieved was below 90%.

%d bloggers like this: