Академический Документы
Профессиональный Документы
Культура Документы
BAS 221
Dr. Audun Runde
Dr. Manju Shah
Framingham Heart Study
Abstract
The Framingham Heart Study was an initiative founded in the late
1940s under the direction of the NHLBI, or the National Heart, Lung,
and Blood Institute to identify the common factors and characteristics
associated with cardiovascular disease (CVD). With a history of
following CVD development through three generations of participants,
the study began with the recruitment of 5,209 men and women
between the ages of 30 and 62 from the town of Framingham,
Massachusetts, followed by five subsequent studies across the
generations to today. This paper examines a battery of studies
collected on a subset of 4,240 participants following them over a
course of time to predict 10-year risk of Coronary Heart Disease (CHD).
While this study primarily revolves around the analysis of the
Framingham data subset to form a prediction model of CHD through
the risk factor variables and joint effects, an attempt to write
programming material was not made but rather the material was
provided to understand the key hypothesis being tested and delineate
the results and our findings based on the results.
Methodology
A subset of the Framingham study dataset, which included 4,240
participants, and code to form the prediction model, in R, was
provided.
When the US Government set out to better understand CVD by
tracking these cohorts of participants from Framingham,
Massachusetts, a dataset was created based on the Framingham
data since 1948.
Variables included were from laboratory, clinical, questionnaire
and event data that included CVD risk factors and markers,
surveillance and reviews of participants, and the outcomes of
adjudicates events for the occurrences of Angina Pectoris,
Myocardial Infarction, Heart Failure, and Cerebrovascular disease.
16 Variables were used for the analysis from the anonymized
data subset to set up the hypothesis to be tested on predicting
10-year risk of CHD. These variables were Gender, Age,
Education, CurrentSmoker, CigsPerDay, BPMeds, PrevalentStroke,
PrevalentHyp, Diabetes, SysBP, DiaBP, BMI, HeartRate, and
Glucose.
When split into the hypothesized risk factors, the variables fell
into the following group:
o Behavioral Risk Factors:
currentSmoker, cigsPerDay: Smoking Behavior
o Medical History Risk Factors:
BPmeds: On blood pressure medication at time of
first examination
prevalentStroke: Previously had a stroke
prevalentHyp: Currently hypertensive
diabetes: Currently has diabetes
o Risk Factors from First Examination
totChol: Total Cholesteral (mg/dL)
sysBP: Systolic Blood Pressure
diaBP: Diastolic Blood Pressure
BMI: Body Mass Index, weight (kg)/height(m)^2
heartRate: Heart Rate (beats/minute)
glucose: Blood Glucose Level (mg/dL)
o Demographic Risk Factors
Male: sex of patient
Age: age in years at first examination
Results
The following include the results from running the R code.
Looking at the first part of the code, we see the str() function on the
dataset where its primary purpose is to display the correct basic
structural information of the dataset.
Now, getting into the logistic regression model to fit the training data,
we notice that the variables ideal for the most accurate prediction
include parts from demographic and the first examination. More
specifically on gender, age, cholesterol, and glucose based on 99%
significance. More consideration can be given to smoking behavior,
stroke history, hypertension, and systolic blood pressure based on a
90% significance. The variables can be shortened to those of
significance to capture the prediction model, plus or minus lesser
significant variables.
With the model in hand, predicting the test set becomes possible. This
model can then be inputted into a confusion matrix outputting a 2x2
table representing counts of true and negative positives and negatives.
Each column of the confusion matrix represents the occurrences of the
predicted class while each row represents the actual class. In these
results, we see 11 instances that were classified as true positives and
1069 classified as false negatives.
Summary
Coronary Heart Disease (CHD) has been leading the cause of death
since 1921 with 7.3 million in casualties in 2008. The Framingham
Heart Study was created in the 1940s by the US government to better
understand the underlying factors of Cardiovascular Disease (CVD)
tracking its initial cohort across three generations where multiple risk
factors were derived.
In this study, a data subset of 4,240 anonymized participants from the
Framingham Heart Study were taken to predict using 16 determined
variables that may be risk factors of CHD. These risk factors fell into
behavioral, medical history, first examination, and demographic
categories. A training and test data set were created to conduct
internal validation within the dataset. A logistic regression model was
ran with the training set to determine what variables were significant
in predicting the 10-year risk of CHD. From here, the CHD can be
validated using the test set in a confusion matrix to produce a count of
false negatives and true positives or specificities and sensitivities in
predictability ranges. To find the expected performance, the AUC is a
way to reduce the ROC curve to a single value with decent classifiers
operating between 0.5 and 1.0.
Using analytics from the Framingham Heart Study can lead to
pinpointing specific causes to develop drugs to lower death rates due
to CHD. Using this prediction model can also predict 10-year risk of
CHD individuals and help them lead longer lives.
10