Вы находитесь на странице: 1из 14

A Regression Analysis of Suicidality

in U.S. High School Students


Omair A. Khan
December 17, 2014

Abstract
This project focuses on developing a logistic regression model to explain adolescent suicidality
by various indicators. By using data provided by the Youth Risk Behavior Survey (2013), the
results indicate that the odds for a suicide attempt are significantly related to ones sex, history
of depression, and BMI percentile. In the future, the same method described in the paper can
be used with a larger number of predictors in order to create a better screening tool for youth
suicide prevention.

Contents
1 Introduction

2 Method
2.1 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Treatment of the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3
3
3
4

3 Results and Discussion

4 Conclusions

A SAS Code

B Data and Figures

11

Introduction

Suicide is a serious public health problem that affects people of all ages. For youth between the
ages of 10 and 24, suicide is the third leading cause of death in the United States [CDC, 2014].
Approximately 4,600 young lives are lost to suicide each year. However, death from suicide is only
part of the problemmore young people survive suicide attempts than actually complete them. In
2013, data collected from the Youth Risk Behavioral Survey (YRBS) showed that 17.0% of youth
seriously considered attempting suicide during the 12 months before the survey. Approximately
13.6% of high schoolers made a plan about how they would commit suicide and 8.0% of the youth
surveyed attempted suicide one or more times during the prior year [YRBS, 2013].
While these statistics are quite alarming, they provide an opportunity to create tools for the
prevention and reduction of suicides. Previous epidemiological research has shown that the best
predictor for suicide is whether a person has had a prior attempt [Joiner et al., 2005; Borges
et al., 2010; Glenn and Nock, 2014]. Depending on various psychosocial factors affiliated with
the perceived shame and stigma of suicide, some people may not be willing to share this type of
information with their primary care providers. One solution that is explored in this paper is to use
other easily accessible data to model the probability of a prior suicide attempt in order to counsel
youth about their future suicide risks.

Method

All of the data and statistical analysis was done in SAS 9.4.

2.1

The Dataset

The Youth Risk Behavioral Survey is a national survey conducted biennially by the Center for
Disease Control to monitor various health-risk behaviors in adolescents. The sampling frame for
the 2013 national YRBS consisted of all regular public schools in at least one of grades 9-12 in the 50
states and the District of Columbia [CDC, 2013]. The survey consists of 92 questions covering topics
on behaviors that contribute to injuries and violence, tobacco use, alcohol and drug use, sexual
behaviors, unhealthy dietary behaviors, and physical inactivity. Out of the 13,633 questionnaires
completed in 148 schools, there were 13,583 usable questionnaires that passed quality control and
are included in the dataset for the analysis. Detailed information about the methodology and
sampling of the YRBS dataset can be found in [Brener et al., 2013].

2.2

Treatment of the Data

The dataset was downloaded from the CDC website as an ASCII file and was converted into a SAS
dataset using format and input programs (included in the Dropbox). The response variable and
predictors were isolated and any missing observations were removed. Binary variables were then
recoded to a (0,1) scheme and formats were applied. The dataset was then split into two smaller
datasets (YSRB1 and YSRB2) using the SURVEYSELECT procedure (selection method: simple random
sampling, seed: 1234, sampling rate: 0.5, sample sizes: 5574 and 5573). The dataset YSRB1 was
used for the initial model selection and YSRB2 was used for the validation procedure.

2.3

Variables

The dependent variable in this study was QN29, based on item 29 on the YRBS 2013 questionnaire
(During the past 12 months, how many times did you actually attempt suicide? with responses
ranging from 0 to 6 or more times). Variable QN29 was coded so that 0 represented no suicide
attempts and 1 represented 1 or more suicide attempts. There were about 100 different options for
dependent variables but for the sake of simplicity and proof of concept, three were chosen: SEX,
BMIPCT, and QN26. SEX is a binary variable for whether the youth is female or male, BMIPCT
is a continuous variable of the youths BMI percentile based on their height and weight data, and
QN26 is a binary variable consisting of a yes or no response to the question, During the past 12
months, did you ever feel so sad or hopeless almost every day for two weeks or more in a row that
you stopped doing some usual activities? The correlation matrix for the variables is shown in
Table 1. The strongest correlations occur between QN29 and SEX as well as QN29 and QN26.

Results and Discussion

Logistic regression using forward, backward, and stepwise procedures (alpha value for all procedures
= 0.05) were run on the YRBS1 dataset to determine the best model. All three procedures included
all regressors being significant with higher order interactions being non-significant (removed from
code for ease of compiling output). Hosmer-Lemeshow statistic shows that the model is a good fit
(p-value 0.05) and a significant chi-square of testing the global null hypothesis of beta=0 shows
that we can reject it in favor of the alternative. The resulting logistic equation is shown below and
the statistics on the coefficients is in Table 2.



log
= 1.5024 0.4481xSEX 2.4406xQN26 + 0.00606xBMIPCT
1
Odds ratios can be calculated by exponentiating the equation above. For example, we can see
that females are 56% more likely to have attempted suicide than males and a one percent increase
in BMI percentile results in a small but still significant 0.6% increase in the predicted odds of a
prior suicide attempt. Note that the odds ratio for BMIPCT is close to 1 but does not include it
(Figure 1). This is not the case in the validation stage, however. Although the validation dataset
YSRB2 produced a logistic regression equation with all three regressors in it as well, the 95% CI for
the odds ratio of BMIPCT includes 1. This means we cannot state at alpha=0.05 that the effect
has any bearing on the overall model.
The ROC curve (Figure 2) shows that with the stepwise regression each step produced a better
fitting model (Step 1: QN26 entered, Step 2: SEX entered, Step 3: BMIPCT entered). Tolerances
and VIF values were calculated and there was no sign of multicollinearity. The probability for each
person in the dataset for attempted suicide were calculated and Figures 3 and 4 show some of the
results. It is clear that there is an increasing trend for BMI percentile and that women are more
likely to attempt suicide than men. We can also see within the female group that having a history
of depression (QN26) significantly increases the probability of a prior suicide attempt (the same is
true for males).
To validate the regression equation, automatic procedures were run on the YRBS2 dataset and
all resulted in the full model with three regressors (interaction terms again were not significant and
were removed from the code to better produce output). The equation on this dataset is shown
below:
4


log


= 1.4374 0.3796xSEX 2.4252xQN26 + 0.00414xBMIPCT

Conclusions

This project shows a proof of concept for using logistic regression in modeling suicide attempts to
be used as a way to counsel adolescents on future mental health risks. The variables for sex, history
of depression, and BMI percentile were used to model this probability. Future studies could include
more regressors focusing on lowering the -2 Log Likelihood and increasing the area under the ROC
curve.

SAS Code

/******************************************************
Filename: yrbs_regression.sas
Written by: Omair Khan
Date Created: November 20, 2014
Last Modified: December 16, 2014
This program generates a regression equation for
predicting the probability of a high school student
having attempted suicide based on their sex, history of
depression, and BMI percentile.
Input: Youth Risk Behavioral Survey Data 2013
Output: regression data
*******************************************************/
OPTIONS PS = 58 LS = 72 NODATE NONUMBER NOFMTERR;
LIBNAME library X:/STAT501/;
ODS LISTING GPATH = X:/STAT501/;
ODS RTF FILE = X:/STAT501/yrbs_regression.rtf;
%LET PREDICTORS = SEX QN26 BMIPCT;
PROC FORMAT;
VALUE SEX 0 = Female
1 = Male;
VALUE YES_NO 0 = Yes
1 = No;
RUN;
/***bring dataset into Work and apply labels***/
DATA YRBS0;
SET library.yrbs2013 (RENAME=(Q2 = Q2_CHAR));
SEX = input(Q2_CHAR, best.);
LABEL QN29 = At Least One Suicide Attempt
SEX = Sex
QN26 = At Least Two Weeks of Depression
BMIPCT = BMI Percentile;
RUN;
/***remove missing data and recode variables***/
DATA YRBS;
SET yrbs0;
IF QN29 = . THEN DELETE;
IF SEX = . THEN DELETE;
6

IF QN26 = . THEN DELETE;


IF BMIPCT = . THEN DELETE;
IF SEX = 1 THEN SEX = 0;
ELSE IF SEX = 2 THEN SEX = 1;
IF QN29 = 1 THEN QN29 = 0;
ELSE IF QN29 = 2 THEN QN29 = 1;
IF QN26 = 1 THEN QN26 = 0;
ELSE IF QN26 = 2 THEN QN26 = 1;
FORMAT SEX SEX.
QN26 YES_NO.
QN29 YES_NO.;
RUN;
/***Split data into two groups for regression and validation***/
TITLE ;
PROC SURVEYSELECT DATA=YRBS OUT=SPLITDATA METHOD=SRS SAMPRATE=0.5 SEED=1234 OUTALL;
RUN;
DATA YRBS1 (KEEP = QN29 &PREDICTORS) YRBS2 (KEEP = QN29 &PREDICTORS);
SET SPLITDATA;
IF SELECTED = 1 THEN
OUTPUT YRBS1;
ELSE OUTPUT YRBS2;
RUN;
/***eyeball for any potential issues of multicollinearity***/
TITLE Correlation Matrix;
PROC CORR DATA=YRBS1;
VAR QN29 &PREDICTORS;
RUN;
ODS GRAPHICS ON;
/***forward logistic regression***/
TITLE Forward Logistic Regression;
PROC LOGISTIC DATA = YRBS1 DESCENDING PLOTS = NONE;
MODEL QN29 = &PREDICTORS /
SELECTION = FORWARD
LACKFIT
RISKLIMITS
;
RUN;
QUIT;
/***backward logistic regression***/

TITLE Backward Logistic Regression;


PROC LOGISTIC DATA = YRBS1 DESCENDING PLOTS = NONE;
MODEL QN29 = &PREDICTORS /
SELECTION = BACKWARD
LACKFIT
RISKLIMITS
;
RUN;
QUIT;
/***stepwise logistic regression***/
TITLE Stepwise Logistic Regression;
PROC LOGISTIC DATA = YRBS1 DESCENDING PLOTS = ALL PLOTS=ROC(ID=PROB);
MODEL QN29 = &PREDICTORS /
SELECTION = STEPWISE
CTABLE PPROB = (0 to 1 by .1)
LACKFIT
RISKLIMITS
OUTROC = STEPROC1
;
OUTPUT OUT = STEPOUT P = STEPPROB XBETA = STEPLOGIT
LOWER = STEPLCI1 UPPER = STEPUCI1;
RUN;
QUIT;
/***produce weights to calculate tolerance and VIF***/
TITLE ;
DATA STEPWEIGHTED;
SET STEPOUT;
W = STEPPROB*(1-STEPPROB);
RUN;
/***ignore the statistics. only look at tolerance and VIF.***/
PROC REG DATA = STEPWEIGHTED;
WEIGHT W;
MODEL QN29 = &PREDICTORS / TOL VIF;
RUN;
/***data to plot with history of depression***/
DATA STEPOUTYES;
SET STEPOUT;
WHERE QN26 = 0;
RUN;
/***data to plot with no history of depression***/

DATA STEPOUTNO;
SET STEPOUT;
WHERE QN29 = 1;
RUN;
/***data to plot of only females***/
DATA STEPOUTFEMALE;
SET STEPOUT;
WHERE SEX = 0;
RUN;
/***data to plot of only males***/
DATA STEPOUTMALE;
SET STEPOUT;
WHERE SEX = 1;
RUN;
/***sort data for later plotting***/
PROC SORT DATA = STEPOUTYES;
BY BMIPCT;
RUN;
PROC SORT DATA = STEPOUTNO;
BY BMIPCT;
RUN;
PROC SORT DATA = STEPOUTFEMALE;
BY BMIPCT;
RUN;
PROC SORT DATA = STEPOUTMALE;
BY BMIPCT;
RUN;
/***plot data from stepwise regression***/
SYMBOL1 I = JOIN V = POINT L = 1 WIDTH = 3 C = PINK HEIGHT = 0.5 INTERPOL = SM45;
SYMBOL2 I = JOIN V = POINT L = 1 WIDTH = 3 C = TEAL HEIGHT = 0.5 INTERPOL = SM45;
PROC GPLOT DATA = STEPOUTYES;
TITLE Probability of Suicide Attempt vs BMI Percentile by Sex;
PLOT STEPPROB*BMIPCT = SEX;
RUN;
QUIT;
SYMBOL1 I = JOIN V = POINT L = 1 WIDTH = 3 C = MAROON HEIGHT = 0.5 INTERPOL = SM45;
SYMBOL2 I = JOIN V = POINT L = 1 WIDTH = 3 C = GREEN HEIGHT = 0.5 INTERPOL = SM45;
PROC GPLOT DATA = STEPOUTFEMALE;
TITLE Probability of Suicide Attempt for Females;
TITLE2 by Response for Depression Screen;

PLOT STEPPROB*BMIPCT = QN26;


RUN;
QUIT;
/***validation forward logistic regression***/
TITLE Validation Forward Logistic Regression;
PROC LOGISTIC DATA = YRBS2 DESCENDING PLOTS = NONE;
MODEL QN29 = &PREDICTORS /
SELECTION = FORWARD
LACKFIT
RISKLIMITS
;
RUN;
QUIT;
/***validation backward logistic regression***/
TITLE Validation Backward Logistic Regression;
PROC LOGISTIC DATA = YRBS2 DESCENDING PLOTS = NONE;
MODEL QN29 = &PREDICTORS /
SELECTION = BACKWARD
LACKFIT
RISKLIMITS
;
RUN;
QUIT;
/***validation stepwise logistic regression***/
TITLE Validation Stepwise Logistic Regression;
PROC LOGISTIC DATA = YRBS2 DESCENDING PLOTS = NONE;
MODEL QN29 = &PREDICTORS /
SELECTION = STEPWISE
LACKFIT
RISKLIMITS
;
RUN;
QUIT;
TITLE ;
ODS RTF CLOSE;

10

Data and Figures


Pearson Correlation Coefficients, N = 5574
Prob >|r| under H0: Rho=0

QN29
At Least One Suicide Attempt
SEX
Sex
QN26
At Least Two Weeks of Depression
BMIPCT
BMI Percentile

QN29

SEX

QN26

BMIPCT

0.11131
<.0001
1

0.33189
<.0001
0.18291
<.0001
1

-0.046
0.0006
0.02559
0.0561
-0.01631
0.2233
1

0.11131
<.0001
0.33189
<.0001
-0.046
0.0006

0.18291
<.0001
0.02559
0.0561

-0.01631
0.2233

Table 1: Correlation Matrix

Analysis of Maximum Likelihood Estimates


Parameter

DF

Estimate

Standard
Error

Wald
Chi-Square

Pr >ChiSq

Intercept
SEX
QN26
BMIPCT

1
1
1
1

-1.5024
-0.4481
-2.4406
0.00606

0.1429
0.1092
0.1249
0.00189

110.5731
16.8315
381.7570
10.3287

<.0001
<.0001
<.0001
0.0013

Table 2: Coefficients

11

Figure 1: Odds Ratios

12

Figure 2: ROC Curve

13

Figure 3: Probability of Suicide Attempt vs BMI by Sex for a Positive History of Depression

Figure 4: Probability of Suicide Attempt vs BMI by History of Depression for Females

14

Вам также может понравиться