BAS221 Final Project

Si Wong
BAS 221
Dr. Audun Runde
Dr. Manju Shah
Framingham Heart Study
Abstract
The Framingham Heart Study was an initiative founded in the late
1940s under the direction of the NHLBI, or the National Heart, Lung,
and Blood Institute to identify the common factors and characteristics
associated with cardiovascular disease (CVD). With a history of
following CVD development through three generations of participants,
the study began with the recruitment of 5,209 men and women
between the ages of 30 and 62 from the town of Framingham,
Massachusetts, followed by five subsequent studies across the
generations to today. This paper examines a battery of studies
collected on a subset of 4,240 participants following them over a
course of time to predict 10-year risk of Coronary Heart Disease (CHD).
While this study primarily revolves around the analysis of the
Framingham data subset to form a prediction model of CHD through
the risk factor variables and joint effects, an attempt to write
programming material was not made but rather the material was
provided to understand the key hypothesis being tested and delineate
the results and our findings based on the results.
Methodology
A subset of the Framingham study dataset, which included 4,240
participants, and code to form the prediction model, in R, was
provided.
When the US Government set out to better understand CVD by
tracking these cohorts of participants from Framingham,
Massachusetts, a dataset was created based on the Framingham
data since 1948.
Variables included were from laboratory, clinical, questionnaire
and event data that included CVD risk factors and markers,
surveillance and reviews of participants, and the outcomes of
adjudicates events for the occurrences of Angina Pectoris,
Myocardial Infarction, Heart Failure, and Cerebrovascular disease.
16 Variables were used for the analysis from the anonymized
data subset to set up the hypothesis to be tested on predicting
10-year risk of CHD. These variables were Gender, Age,
Education, CurrentSmoker, CigsPerDay, BPMeds, PrevalentStroke,
PrevalentHyp, Diabetes, SysBP, DiaBP, BMI, HeartRate, and
Glucose.
When split into the hypothesized risk factors, the variables fell
into the following group:
o Behavioral Risk Factors:
currentSmoker, cigsPerDay: Smoking Behavior
o Medical History Risk Factors:
BPmeds: On blood pressure medication at time of
first examination
prevalentStroke: Previously had a stroke
prevalentHyp: Currently hypertensive
diabetes: Currently has diabetes
o Risk Factors from First Examination
totChol: Total Cholesteral (mg/dL)
sysBP: Systolic Blood Pressure
diaBP: Diastolic Blood Pressure
BMI: Body Mass Index, weight (kg)/height(m)^2
heartRate: Heart Rate (beats/minute)
glucose: Blood Glucose Level (mg/dL)
o Demographic Risk Factors
Male: sex of patient
Age: age in years at first examination
Education: Some high school (1), highschool/GED (2),

some college/vocational school (3), college (4)
Programming Code from R was provided to analyze this dataset on the
logistical regression model by splitting the dataset into training and
testing sets for internal validation. The logistic regression on the
training set was used to predict within the test set if the participant
incurred CHD within 10 years of the first examination as well as testing
the true positives (specificity) versus the false positives (1
specificity). Finding the AUC gives the expected performance true
positives of CHD occurrences.
Key Hypothesized Risk Factors & Validations

Several hypothesis were formed when identifying risk factors:
o Behavioral Risk Factors:

currentSmoker, cigsPerDay: Smoking Behavior
o Medical History Risk Factors:
BPmeds: On blood pressure medication at time of
first examination
prevalentStroke: Previously had a stroke
prevalentHyp: Currently hypertensive
diabetes: Currently has diabetes
o Risk Factors from First Examination
totChol: Total Cholesteral (mg/dL)
sysBP: Systolic Blood Pressure
diaBP: Diastolic Blood Pressure
BMI: Body Mass Index, weight (kg)/height(m)^2
heartRate: Heart Rate (beats/minute)
glucose: Blood Glucose Level (mg/dL)
By passing the identified variables for CHD, we are able to determine
variables that matter in being able to predict 10-year risk of CHD to a
certain degree. Before deciding what variables that you want test, it is
preferable to understand what the problem is and the underlying
factors. In this case, CHD is a condition where one or more of an
individuals coronary arteries become constricted and limits the supply
of blood to the heart. Given the Framingham Heart Study, which
provides existing data to derive CVD, we can use these factors to
predict related cardiovascular diseases such as CHD given behavior,
medical history, health and demographic profiles.
Results
The following include the results from running the R code.
Looking at the first part of the code, we see the str() function on the
dataset where its primary purpose is to display the correct basic
structural information of the dataset.
Before we start loading our variables into a logistic regression model,

we need to split the data into a training set, to fit the model, and the
test set, to assess the performance and utility of the predictive
relationship.
Now, getting into the logistic regression model to fit the training data,
we notice that the variables ideal for the most accurate prediction
include parts from demographic and the first examination. More
specifically on gender, age, cholesterol, and glucose based on 99%
significance. More consideration can be given to smoking behavior,
stroke history, hypertension, and systolic blood pressure based on a
90% significance. The variables can be shortened to those of
significance to capture the prediction model, plus or minus lesser
significant variables.
With the model in hand, predicting the test set becomes possible. This
model can then be inputted into a confusion matrix outputting a 2x2
table representing counts of true and negative positives and negatives.
Each column of the confusion matrix represents the occurrences of the
predicted class while each row represents the actual class. In these
results, we see 11 instances that were classified as true positives and
1069 classified as false negatives.
To calculate the accuracy of the model, accuracy and baseline

accuracy can be calculated:
Accuracy = (False Negative + True Positive) / All = 0.848
o Maximum predictability
Baseline Accuracy = (False Negative + True Negative) / All =
0.844
o Minimum predictability
When calculating the area under the curve, we obtain a value between
0 and 1. An AUC of 0.5 describes a horizontal curve at which classifiers
are expected to perform better than 0.5. In short, the closer to 0.5 the
AUC is then the lower performance of the model and higher when the
value is closer to 1. With the current model at 0.7421, the model is
decent enough to be able to differentiate 10-year low and high risk
CHD as a single number summary of the expected model performance.
Summary
Coronary Heart Disease (CHD) has been leading the cause of death
since 1921 with 7.3 million in casualties in 2008. The Framingham
Heart Study was created in the 1940s by the US government to better
understand the underlying factors of Cardiovascular Disease (CVD)
tracking its initial cohort across three generations where multiple risk
factors were derived.
In this study, a data subset of 4,240 anonymized participants from the
Framingham Heart Study were taken to predict using 16 determined
variables that may be risk factors of CHD. These risk factors fell into
behavioral, medical history, first examination, and demographic
categories. A training and test data set were created to conduct
internal validation within the dataset. A logistic regression model was
ran with the training set to determine what variables were significant
in predicting the 10-year risk of CHD. From here, the CHD can be
validated using the test set in a confusion matrix to produce a count of
false negatives and true positives or specificities and sensitivities in
predictability ranges. To find the expected performance, the AUC is a
way to reduce the ROC curve to a single value with decent classifiers
operating between 0.5 and 1.0.
Using analytics from the Framingham Heart Study can lead to
pinpointing specific causes to develop drugs to lower death rates due
to CHD. Using this prediction model can also predict 10-year risk of
CHD individuals and help them lead longer lives.
10

BAS221 Final Project

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

BAS221 Final Project

Загружено:

Авторское право:

Доступные форматы

Si Wong

Education: Some high school (1), highschool/GED (2),

Key Hypothesized Risk Factors & Validations

o Behavioral Risk Factors:

Before we start loading our variables into a logistic regression model,

To calculate the accuracy of the model, accuracy and baseline

Вам также может понравиться