Cardiovascular Disease Prediction Using Machine Learning Models
Author: Arnav Gupta, Karan Gupta, Shivesh Gulati, Vishal Singh
Motivation
A timely and correct diagnosis of cardiovascular diseases helps provide a better prognosis and improve the quality of patient care. A diagnosis often involves multiple parameters, some of which the doctors might neglect or whose effects have not yet been explored. Machine learning applications can identify complex patterns in the data and give much better, timely, and accurate diagnoses. The idea for this topic came up due to the large number of cardiovascular-related mortalities during the COVID-19 pandemic, which were often caused by negligence of subtle signs and symptoms. This prompted us to develop a machine-learning model for diagnosing cardiovascular diseases.
Introduction
The study aims to develop a machine learning model capable of predicting an individual's cardiovascular disease (CVD) using easy-to-determine parameters such as age, glucose levels, weight, and blood pressure indices. This can serve as a tool for early detection of the risk of cardiovascular diseases among individuals, allowing them to take preventive measures and seek medical attention at an early stage to reduce further risk.
The algorithm proposed uses the parameters age, gender, weight, height, smoking, glucose level, systolic blood pressure, diastolic blood pressure, pulse pressure, and mean arterial pressure to predict if a person has CVD.
In this study, various classification models were trained on a cleaned and standardized dataset after removing outliers. Dimensionality reduction techniques, such as PCA, were also used to optimize the model performance further. The optimal number of components for PCA and the optimal hyperparameters for each model were determined using the KFold Cross Validation method.
Literature Survey
1. Effective Heart disease prediction using machine learning techniques
This study develops a model to predict cardiovascular diseases and reduce related fatalities. Different models like random forest (RF), decision tree (DT), multilayer perceptron (MP), and XGBoost (XGB) were employed. Parameters were optimized using GridSearchCV. The research concludes that cross-validated multilayer perceptron outperformed other algorithms, achieving 87.28% accuracy.
2. Blood pressure variables and cardiovascular risk new findings from ADVANCE
This research paper is aimed at finding out the importance of Blood Pressure indices such as Systolic Blood Pressure (SBP), Diastolic Blood Pressure (DBP), Pulse Pressure (PP) (defined as (mean(SBP)- mean(DBP)). Mean Arterial Pressure (MAP) is defined as (DBP+1/3(PP)) in predicting the risk of cardiovascular diseases in a patient, using well-established ML Models such as Cox proportional hazard regression models.
Dataset Description and Feature Engineering
Size and Shape of the dataset
The dataset used for this project has been obtained from Kaggle. The original dataset consisted of 70,000 records and 13 columns. Two additional columns have been added for two derived features. Additionally, the first column of the dataset is the ID, which has been dropped, making the total number of columns in the modified dataset equal 14 (including the target column).
Feature Description
Every feature in the dataset is divided into one of the four categories mentioned below:-
- Objective Feature: Factual Information
- Examination Feature: Results of Medical Examination
- Subjective Feature: Information given by the patient
- Derived Features: Features derived from already existing features. The two derived features in our dataset include the Mean Arterial Pressure (MAP) and the Pulse Pressure(PP).
PP = Systolic Blood Pressure(SBP)-Diastolic Blood Pressure(DBP)
MAP = Diastolic Blood Pressure(DBP)+ 1/3 Pulse Pressure (PP)
Exploratory Data Analysis
Univariate analysis was performed to analyze the effectiveness of each feature in predicting the occurrence of CVD.
Box Plots
The box plots indicate that many values lie beyond the ± 1.5 IQR mark and are outliers. Thus, performing outlier detection is a must before training the models. It can also be observed that among the BP indices, the highest number of outliers are shown by the Pulse Pressure. In contrast, the Mean Arterial Pressure (MAP) and the Systolic Blood Pressure (SBP) demonstrate the lowest number of outliers.
Histograms
Histograms were used to analyze the distribution of the participants, with or without CVD, in specific ranges of the numerical features. From the histograms, it was concluded that, as the age, weight, or blood pressure of the person increases, the number of people with CVD is higher compared to the number of people without CVD, especially for people beyond the age of 22500 days (61 years) or 75 kg in weight.
Co-relation Heat Map
The correlation heatmap shows the correlation between the different features of data, including the target attribute. The values and color of each cell indicate the degree of correlation. Gender and height have a moderate correlation(around 0.5), with males having greater height than females. ap_lo and ap_hi have a strong correlation(close to 1), as greater ap_lo means a greater ap_hi. PP and MAP both have a strong correlation with ap_hi and ap_lo. ap_hi and MAP have a moderate correlation (around 0.5) with CVD, as higher blood pressure generally means a greater risk of CVD.
Data Preprocessing
Outlier Detection and Standardization
The first step in our outlier detection involved removing all the negative values and the values in the columns of ap_hi, ap_lo, PP, and MAP that were beyond 500 mmHg. Once this step was done, the number of records in the dataset reduced from 70,000 to 68,727.
After performing this initial detection, the box plots still showed the presence of outliers. Thus, two copies of the datasets were created. On one of the datasets, the Z-Score method was used to detect the outliers, with a bound of ±2.75. After using the Z-Score method, the final number of records was reduced from 68,727 to 65,048.
The Local Outlier Factor method was applied in the second copy with a 20 percent contamination rate and 20 neighbors. After using LOF, the final number of records was reduced from 68,727 to 56,000.
Standardization was done to ensure that the numerical features had a mean of 0 and a standard deviation of 1 to nullify the effect of different scales of measurements of various physical quantities.
Data Encoding
Initially, label-based encoding was used for the categorical features, followed by one-hot encoding, and the models were trained. The dataset had 13 features with label-based encoding, and with one-hot encoding, the number of features increased to 15.
Methodologies
General Flow
Two copies of the dataset were made. Z-Score and Local Outlier Factor (LOF) methods were used for outlier detection on the first and second copies of the dataset, respectively, as mentioned above. Data standardization was performed on each dataset copy after a 70:30 train-test split. PCA was applied to the one-hot encoded dataset to improve the performance further. K-Fold Cross Validation was applied to determine the optimal hyper-parameters for each model on the data set cleaned using LOF after applying PCA except for Random Forest and MLP, for which PCA was not applied.
Models
Gaussian Naive Bayes
Since the data involves absolute values that are nearly normally distributed, as shown from the histograms, Gaussian Naive Bayes (GNB) was used as an initial starting point for finding the most optimal model.
Logistic Regression
The threshold value for the cut-off probability was set at 0.5. L2 regularisation was used to train all the models. K-Fold cross-validation was applied to the dataset cleaned using the LOF method for outlier detection to determine the optimal value of the regularisation constant and the best solver. The optimal parameters obtained are as follows:-
Solver: newton-cg, Regularization Strength: 10
SVM
Support Vector Machines were trained with an ‘RBF’ kernel and a regularisation strength ‘one’ as the default parameters. This was followed by applying PCA one hot encoded data cleaned using LOF for outlier detection. K-Fold cross-validation was applied to the training data, and the optimal parameters obtained are as follows:-
Regularisation Strength: 1, Kernel: RBF
Decision Trees
They were used for classification, and pre-pruning was used to reduce over-fitting on the training data. K-Fold Cross Validation was used on the training data after applying PCA on the dataset cleaned using the LOF outlier detection method. The hyperparameters obtained were as follows:-
Max Depth: 7, Min Samples Split: 5, Min Samples Leaf: 1
Random Forests
Random forests were used for binary classification and after using k-fold validation on one-hot encoded data. PCA was not used because the best components were found to be 1, but there are trade-offs as the model would not capture the complexity of the dataset. The best hyper-parameters were:
max depth=15, max features=’log2’, min samples leaf=4,n estimators=300, Bootstrap=True
Xg-Boost
XgBoost was used to boost the random forest further, and optimal hyper-parameters were calculated on one-hot encoded training data after applying PCA on the dataset cleaned using the LOF method.
The optimal parameters are as follows:
Learning rate: 0.2, max depth: 3, n estimators: 100
Multi-Layer Perceptron
The model was applied with two hidden layers. Since the task at hand was a binary classification task, the Sigmoid Activation function was used at the output layer, whereas for the hidden layers, the ReLU activation function was used. The default layer sizes for each hidden layer were 200, and the number of epochs was fixed to 20. The best parameters obtained were as follows:-
Neurons in hidden layer-1 : 300, Neurons in hidden layer-2: 250
Training Workflow
Results
Data Cleaned using Z-Score
Data cleaned using LOF.
Analysis
In our case scenario, we are more concerned with false negatives than positives. Thus, unequal weight is given to misclassification. Thus, a model having higher recall is more desirable. Out of all the models trained, the best Recall was observed in the case of Support Vector Machines (achieving a recall of 0.76921).
F1 score is better as an overall metric for evaluating the model performance due to unequal weights to misclassification compared to the model's accuracy. However, a trade-off between accuracy and recall can be observed.
Conclusion
From the above analysis, it can be concluded that out of all the models applied, the best performance (considering the F1 score as the overall evaluation metric) was achieved by the Decision Tree (0.73984) followed by Multi-Layer Perceptron (0.73784). If we observe the accuracies of the model, we can observe that the best accuracy was observed in the case of Random Forest, with an accuracy of 0.74132, which was closely followed by 0.73919.
References
[1] Karolina Drozd˙ z, Katarzyna Nabrdalik, Hanna Kwiendacz, Mirela ˙ Hendel, Anna Olejarz, Andrzej Tomasik, Wojciech Bartman, Jakub Nalepa, Janusz Gumprecht, and Gregory YH Lip. Risk factors for cardiovascular disease in patients with metabolic associated fatty liver disease: a machine learning approach. Cardiovascular Diabetology, 21(1):240, 2022.
[2] Andre-Pascal Kengne, Sebastien Czernichow, Rachel Huxley, ´ Diederick Grobbee, Mark Woodward, Bruce Neal, Sophia Zoungas, Mark Cooper, Paul Glasziou, Pavel Hamet, et al. Blood pressure variables and cardiovascular risk: new findings from advance. Hypertension, 54(2):399–404, 2009.
[3] Svetlana Ulianova. Cardiovascular disease dataset, Jan 2019
Team Members
Arnav Gupta
https://www.linkedin.com/in/arnavgupta-/
Karan Gupta
https://www.linkedin.com/in/karan-gupta-5869a6226
Shivesh Gulati
ShiveshGit - Overview
ShiveshGit has 8 repositories available. Follow their code on GitHub.
github.com
http://www.linkedin.com/in/shiveshgulati
Vishal Singh