Predicting Heart Disease Using Machine Learning
This project showcases an end-to-end example of using machine learning for healthcare. Focusing on heart disease classification, it combines data science techniques and machine learning models to predict heart disease from clinical data. The project serves as a practical demonstration of applying technology in medical diagnosis and preventive care.
Preparing the Tools
In our quest to predict heart disease using machine learning, selecting the right tools and libraries is crucial for effective analysis and model building. Our toolset has been carefully chosen to cover every aspect of the project, from data handling to model evaluation and validation.
​
​
Load Data
We use a tool called Pandas, a staple in data science, to read a file containing heart disease data. The original data came from the Cleveland database from UCI Machine Learning Repository.
​
Once loaded, we first check the size of our dataset by examining the number of rows and columns it contains. This step is like getting a bird's-eye view of the data's dimensionality. Each row in the dataset represents a unique patient entry, and each column represents specific attributes or 'features' related to heart disease.
​
​
​
Exploratory Data Analysis
In this phase, we dive into the dataset to explore. A key step is to understand the distribution of outcomes within our dataset.
​
-
Examining the Distribution of Our Target Variable
In our dataset, the presence of heart disease in a patient is denoted as '1', and the absence as '0'. This categorization turns our target variable into what's known as a binary categorical variable – essentially a way to categorize data into two distinct groups.
​
To gain insight into how these categories are distributed, we perform a 'value counts' operation on the target column. This process is akin to taking a headcount in a room to see how many people fall into two different categories. Here, we are counting how many patients are classified as having heart disease ('1') and how many are classified as not having it ('0').
​
​
​
​
​
Output:
​
​
​
​
​
​
​
Since these two values are close to even, our target column can be considered balanced. An unbalanced target column, meaning some classes have far more samples, can be harder to model than a balanced set. Ideally, all of our target classes have the same number of samples.
​
We can plot the target column value counts by calling the plot() function and telling it what kind of plot we'd like, in this case, bar is good.
​
​
​
Output:
​
​
​
​
​
​
​
​
​
After examining the distribution of our target variable, the next step in our data exploration journey involves a deeper dive into the structure of our dataset. For this, we employ a method called df.info().
​
Output:
​
​
​
​
​
​
​
​
​
​
​
​
​
df.info() shows a quick insight to the number of missing values you have and what type of data your working with. In our case, there are no missing values and all of our columns are numerical in nature.
​
Another way to get some quick insights on your dataframe is to use df.describe(). describe() shows a range of different metrics about your numerical columns such as mean, max and standard deviation.
​
Output:
​
​
​
​
​
​
​
​
​
​
-
Heart Disease Frequency according to Gender
Our next step in the exploratory data analysis delves into understanding how heart disease frequency varies with gender. This analysis is crucial as it can reveal significant insights into gender-specific trends and risks associated with heart disease.
​
Step 1: Counting Gender Representation
We start by examining the gender distribution in our dataset using df.sex.value_counts(). This step helps us understand the proportion of male and female participants in our study. Knowing this distribution is essential to assess whether our dataset is balanced in terms of gender representation.
​
​
​
Output:
​
​
​
​
​
​
Step 2: Cross-Tabulating Heart Disease and Gender
Next, we deepen our analysis by creating a cross-tabulation between heart disease presence (our target variable) and gender. Cross-tabulation allows us to see the relationship between gender and the occurrence of heart disease, presenting us with a clear picture of how these two variables interact.
​
​
​
​
Output:
​
Since there are about 100 women and 72 of them have a postive value of heart disease being present, we might infer, based on this one variable if the participant is a woman, there's a 75% chance she has heart disease.
​
As for males, there's about 200 total with around half indicating a presence of heart disease. So we might predict, if the participant is male, 50% of the time he will have heart disease. Averaging these two values, we can assume, based on no other parameters, if there's a person, there's a 62.5% chance they have heart disease.
​
Step 3: Visual Representation through Bar Plots
To make our findings more accessible and understandable, we visualize this relationship using a bar plot that clearly illustrates the frequency of heart disease in different genders.
​
-
Visualizing the Relationship Between Age and Heart Rate in Heart Disease
Here, we're interested in two critical factors: age and maximum heart rate (thalach), and how they correlate with the occurrence of heart disease.
​
We start by creating a scatter plot, which allows us to observe individual data points. Our plot displays two sets of data:
​
Positive Examples: These are individuals with heart disease, shown in salmon color. We plot their age against their maximum heart rate to observe how these two variables interact for patients with heart disease.
​
Negative Examples: Similarly, we plot the same variables for individuals without heart disease, using light blue for distinction. This dual-plot approach on the same graph allows for direct comparison between the two groups.
​
​
​
​
​
​
​
​
​
​
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
​
​
​
​
Inference: Our exploratory analysis suggests that younger individuals tend to have higher maximum heart rates, as indicated by the positioning of data points towards the upper part of the graph on the left, representing younger ages. Conversely, among older participants, there is a greater prevalence of heart disease, as shown by the denser clustering of positive heart disease cases. However, it's important to note that the apparent increase in heart disease incidence with age may partly result from a higher overall number of older participants in the study.
​
Following the scatter plots, we use a histogram to analyze the age distribution of our entire dataset. A histogram is useful for seeing the frequency distribution of a single variable—in this case, age. It shows us how many participants fall into each age bracket, which can be indicative of the age groups we are dealing with.
​
​
​
Output:
​
​
​
​
​
​
​
​
​
The data displays a generally normal distribution with a slight rightward skew, mirroring the trends observed in the scatter plot above.​
-
Heart Disease Frequency per Chest Pain Type
We delve into the relationship between different types of chest pain and their association with heart disease.
​
Cross-Tabulation for Insight
By using a crosstabulation, we can succinctly summarize the frequency of heart disease occurrences for each chest pain category in our dataset.
​
​
​
Output:
​
​
​
​
​
​
Visualization Through Bar Charts
To visualize this data, we generate a bar chart that displays the frequency of heart disease for each type of chest pain. The use of contrasting colors, light blue for 'No Disease' and salmon for 'Disease,' allows for a clear visual distinction between the groups.
​
​
​
​
​
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
-
​ Correlation between independent variables
We now shift our focus to the relationships between all the independent variables in our dataset. This step is crucial for understanding how these variables might interact with each other and influence the outcome of heart disease. To explore these relationships, we calculate a correlation matrix.
​
​
​
​
​
​
Output:
​
​
​
​
​
​
​
​
​
Modelling
​
Transitioning from our insightful data exploration, we now embark on the next pivotal phase of our project: Modeling. This is where we apply machine learning techniques to predict the likelihood of heart disease based on the patterns we've identified in the data.
​
-
Preparing Data
Before we can train our models, we need to prepare our dataset.
​
Feature Set (X): We create a feature set by dropping the 'target' column from our DataFrame. This set includes the 13 independent variables—such as age, sex, chest pain type, and maximum heart rate—that we suspect could influence the risk of heart disease.
​
Target Variable (y): Our target variable is what we aim to predict: the presence of heart disease. We extract this into a separate variable, making sure that our models will have clear guidance on what outcome they need to learn to predict.
​
​
​
​
​
​
-
Testing and Training Sets
We now move on to a critical step in machine learning: dividing our dataset into training and testing subsets. This is necessary for both developing our predictive models and subsequently assessing their performance.
​
​
​
​
​
​
We begin by setting a random seed. It ensures that our results are reproducible, which is vital for verifying experiments and for collaborative work where consistency is key.
​
Training Set: This subset includes 80% of our data. The training set is what our machine learning models will learn from. It's the dataset that we expose our algorithms to so they can discover patterns and 'train' themselves to predict heart disease.
​
Testing Set: The remaining 20% of our data forms the testing set. It's kept separate from the training process. It is used to evaluate how well our models can apply what they've learned to new, unseen data.
​
-
Model Choices
Our prepared data is now ready to meet the machine learning models that we will train to predict heart disease. Our selection includes three well-regarded algorithms, each with its own strengths and approach to learning from data.
​
Selection of Machine Learning Models:
a ) Logistic Regression
b) K-Nearest Neighbors (KNN)
c) Random Forest
​
We place our chosen models into a dictionary, streamlining the process of iterating over them. This allows us to efficiently apply the same operations to each model without writing repetitive code. To assess our models, we implement a fit_and_score function. The function encapsulates the training and evaluation steps in a loop that iterates over each model in our dictionary, storing each model's score—based on its accuracy in predicting the test data.
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
Output:
​
​
​​
​
​
-
Model Comparison
To make our comparison more intuitive and visually accessible, we plot the accuracy scores in a bar chart.
​
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
​
​
Based on the graph and dictionary data, it's evident that the Logistic Regression model outperforms the others, showing the highest accuracy.
Hyperparameter tuning ​
In this section of the project, we focus on the hyperparameter tuning and cross-validation model. This process is essential to optimize the model's performance by finding the ideal configuration.
​
-
Hyperparameter Tuning of KNN
A critical hyperparameter in KNN is n_neighbors, which defines the number of neighbors to consider for the classification. We experiment with a range of values from 1 to 20. This range allows us to see how the model performance varies with different levels of neighbor consideration.
​
Within a loop, we configure the KNN model with each n_neighbors value, fit it to our training data, and evaluate it on both the training and test datasets. The corresponding scores are then appended to our score lists.
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
To understand the impact of varying n_neighbors, we plot the training and test scores against the number of neighbors.
​
​
​
​
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
Upon reviewing the graph, it appears that setting n_neighbors to 11 offers the optimal performance for our K-Nearest Neighbors (KNN) model. However, despite this tuning, the KNN model still falls short in performance (75.41%) compared to the Logistic Regression and Random Forest Classifier models. As a result, we have decided to set aside the KNN model and concentrate our efforts on further optimizing the other two.
​
We manually tuned the KNN model, but for Logistic Regression and Random Forest Classifier, we plan to utilize RandomizedSearchCV for hyperparameter tuning. This method automates the process of testing various hyperparameter combinations. It systematically evaluates these combinations and identifies the most effective one, thereby streamlining the optimization process and potentially enhancing model performance.
​
-
Tuning the Logistic Regression Model
In this section, we're utilizing RandomizedSearchCV to fine-tune our Logistic Regression. This process involves testing a range of hyperparameters to determine the most effective combination for each model.
​
We first define a grid of hyperparameters for Logistic Regression. This grid includes a range of values for 'C' (inverse of regularization strength) on a logarithmic scale and the 'solver' type.
​
​
​
​
​
Using RandomizedSearchCV, we set up a randomized search for the best hyperparameters. We specify the Logistic Regression model, our defined hyperparameter grid, the number of cross-validation folds (cv=5), the number of different combinations to try (n_iter=20), and make the process verbose for more detailed output.
​
We then fit this search model to our training data. During this process, RandomizedSearchCV randomly selects combinations of hyperparameters from our grid, trains the Logistic Regression model using these parameters, and evaluates its performance across the five cross-validation folds.
​
​
​
​
​
​
​
​
​
Once the fitting process is complete, we extract the best hyperparameters using and assess the model's performance on the test data.
​
​
​
​
Output:
​
​
​
​
-
Tuning the Random Forest Classifier
The process is similar for the Random Forest Classifier. We define a different set of hyperparameters suitable for Random Forest, including the number of trees in the forest (n_estimators), the maximum depth of the trees (max_depth), and settings for splitting the nodes (min_samples_split and min_samples_leaf).
​
​
​
​
​
We then use RandomizedSearchCV in the same manner as for Logistic Regression to find the best combination of these parameters.
​
​
​
​
​
​
​
​
​
Once the fitting process is complete, we extract the best hyperparameters using and assess the model's performance on the test data.
​
​
​
​
Output:
​
​
​
​
​
​
Adjusting the hyperparameters for both the RandomForestClassifier and LogisticRegression models resulted in a modest enhancement in their performance. Given that LogisticRegression is showing a leading performance, we plan to refine it further using GridSearchCV.
​
-
Tuning a Logistic Regression model with GridSearchCV
Here, we methodically test different combinations of hyperparameters to identify the most effective settings for our model.
​
We start by defining a grid of hyperparameters for Logistic Regression. This grid includes a range of values for the regularization strength ('C') and specifies the solver type as 'liblinear'. We then set up GridSearchCV with our Logistic Regression model and the defined hyperparameters. We specify 5-fold cross-validation (cv=5) and enable verbose output for more detailed progress information.
​
GridSearchCV is applied to fit the Logistic Regression model to the training data. Unlike RandomizedSearchCV, GridSearchCV tries all possible combinations of hyperparameters in the provided grid.
​
​
​
​
​
​
​
​
​
After the fitting process, we retrieve the best-performing hyperparameters and evaluate the performance of the model, now tuned with the optimal hyperparameters, on the test dataset. The score returned is an indicator of how well the Logistic Regression model, with its newly tuned settings, can predict heart disease.
​
​
​
​
​
Output:
​
​
​
​
In this scenario, the outcome remains unchanged from our previous attempts, as our grid is designed with a maximum limit of 20 distinct hyperparameter combinations.​
​
-
Evaluating Model Predictions Beyond Accuracy
In this phase, we're moving beyond merely assessing the accuracy of our classification model to explore a broader range of evaluation metrics. To do this, we first need to generate predictions using our model.
​
We call the predict() function on gs_log_reg and pass it the test data. This function uses the model to make predictions based on the features of the test data.
​
​
​
​
ROC Curve and AUC Scores
In this part of the model evaluation, we are focusing on the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) scores, which are crucial for assessing the performance of our classification model in predicting heart disease.
​
We start by importing RocCurveDisplay from sklearn.metrics. We utilize our previously optimized Logistic Regression model, which has been fine-tuned using GridSearchCV. We call RocCurveDisplay.from_estimator() and pass our trained model along with the test data and true labels.
​
​
​
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
​
Inference: The Area Under the Curve (AUC) score is 0.92, which is quite high. This means the model has a good measure of separability and is able to distinguish between positive and negative classes effectively.
​
Confusion Matrix
In this section, we are dealing with the visualization of the model's predictions using a confusion matrix, which is a tool to assess the performance of a classification algorithm. The confusion matrix itself shows where the model's predictions were correct (the true positives and true negatives) and where errors were made (the false positives and false negatives).
​
We define a function that takes in the true labels and the predicted labels as inputs. Inside the function, compute the confusion matrix, which is a tabular way of visualizing the performance of the prediction model.
​
​
​
​
​
​
​
​
​
​
​
Output:
​
​
​
​
​
​
​
​
​
The model displays a comparable level of misclassification for both classes, as indicated by the confusion matrix. Specifically, it incorrectly predicted 'no disease' (0) four times when it should have predicted 'disease' (1), which are false negatives. Similarly, there were three instances where it predicted 'disease' (1) in place of 'no disease' (0), which are false positives.
​
-
Classification Report
The classification report is an essential tool for evaluating the precision, recall, and F1-score of a classification model across all classes. The report provides a breakdown of the precision, recall, and F1-score for each class.
​
​
​
​
To solidify our understanding of the model's capabilities, we plan to implement a more robust evaluation using cross-validation. We'll utilize the cross_val_score function, incorporating our top-performing model with its optimal hyperparameters. This function enhances evaluation accuracy by executing multiple train-test splits and systematically applying cross-validation.
​
A Logistic Regression classifier (clf) is instantiated with the best hyperparameters that were previously identified through GridSearchCV.
​
​
​
​
​
​
Accuracy Assessment
The cross_val_score function is called to calculate the accuracy of the model. The entire dataset is used, and the data is split into 5 parts (5-fold cross-validation), with each part taking turns being the test set.
​
​
​
​
​
​
​
​
Precision Calculation
​
​
​
​
​
​
​
Recall Computation
​
​
​
​
​
​
​
F1 Score Derivation
​
​
​
​
​
​
​
Having computed the cross-validated metrics, the next step is to create a visual representation of these results.
​
​
​
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
​
​
​
-
Feature Importance
Building upon our exploration of cross-validated metrics, the next phase in our analysis involves delving into the feature importance of our Logistic Regression model.
​
This step aims to uncover which specific patient characteristics are most influential in predicting heart disease, offering deeper insights into the factors that are key drivers in the model's decision-making process.
​
By examining the coef_ attribute after fitting our model, we'll be able to pinpoint the features that contribute most significantly to the model's predictions, thereby enhancing our understanding of the underlying patterns the model is leveraging to determine the likelihood of heart disease.
​
​
​
​
​
The values in the coef_ array represent the extent to which each feature influences the model's decision-making process, particularly in determining whether a patient's health data suggests the presence of heart disease. To make them more meaningful and interpretable, we can align them with the corresponding feature names from our dataset. This alignment will provide a clearer understanding of how each specific health attribute contributes to the model's predictions regarding heart disease.
​
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
Having aligned the model's feature coefficients with their respective features, the next step is to create a visual representation of this information. This visualization will help in better understanding the impact of each feature on the model's predictions.
​
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
​
​
​
Inference:
Observing the coefficients, you will see that they vary between negative and positive values. The magnitude of these values, represented by the length of the bars in the visualization, indicates the strength of each feature's contribution to the model's decision-making process.
​
a) Negative Correlation
Take, for instance, the 'sex' feature, which has a coefficient of -0.904. This suggests that as the value of the 'sex' feature increases, the probability of the target (presence of heart disease) decreases. This relationship can be further understood by examining the correlation between the 'sex' feature and the target variable in our data.
​
​
​
Output:
​
​
​
​
​
The analysis of the 'sex' feature in relation to heart disease reveals a distinct pattern in the data. When examining cases where the 'sex' value is 0 (representing females), there is a noticeably higher incidence of heart disease, with heart disease cases nearly tripling the non-heart disease cases (72 vs. 24). However, when the 'sex' value is 1 (representing males), this discrepancy diminishes, showing a more balanced ratio (114 vs. 93) between heart disease and non-heart disease cases.
​
b) Positive Correlation
Examining a feature with a positive correlation, such as 'slope', which refers to the slope of the peak exercise ST segment, we notice some intriguing insights. The 'slope' feature has values like:
​
-
0: Upsloping (suggesting a better heart rate with exercise, which is less common)
-
1: Flatsloping (indicating minimal change, often seen in a typically healthy heart)
-
2: Downsloping (associated with signs of an unhealthy heart)
​
This relationship is validated when examining the data. By creating a cross-tabulation of 'slope' against 'target', we can see that as the 'slope' value rises, the number of heart disease cases (target = 1) also tends to increase, corroborating the pattern identified by the model.
​
​
​
​
Output:
​
​
​
​
​
​
​​