Вы находитесь на странице: 1из 2

PH6420 Fall 2015: Assignment 5

Due: November 23, 2015

Write a separate program for parts A and B. Be sure to answers all questions.
PART A:
1.

Read in from the TOMHS dataset variables ptid, fatbl and sodbl creating a dataset called diet.

2.

Run the univariate procedure to get statistics for variables fatbl and sodbl. Using an ODS
OUTPUT statement capture the output from the quantiles table into a SAS dataset.

3.

Run PROC CONTENTS on the output dataset. What are the variable names on the dataset?

4.

Display the output dataset using PROC PRINT. What is the median for the variable fatbl and
sodbl?

5.

Create a sub-set dataset of the dataset created in part 2 that includes only rows for the 25%, 50%,
and 75% for each variable (use a WHERE statement). Display this dataset using PROC PRINT.

PART B:
This part may be more challenging so plan on taking more time on it. You may want to sketch the
program on paper first.
A text file called president_election_2012.csv (on class website) contains data on the 2012 presidential
election by county within Minnesota. The first 3 records are as follows:
MN,1,AITKIN,0301,4533,9142
MN,1,AITKIN,0401,4412,9142
MN,1,AITKIN,0501,80,9142

There are 6 variables:1) State abbreviation (MN on all records); 2) a county id number (1-87); 3)
name of county; 4) candidate code (0301=Romney, 0401=Obama, 0501=Johnson); 5) number of
votes cast for that candidate in the county; 6) total votes cast in that county.
Note: There are three records for each county, one giving the votes for Romney, one giving the votes
for Obama, and one giving the votes for Johnson (libertarian candidate). Records for other candidates
have been removed.

The main goal is to write a SAS program that puts the candidate votes on the same record (one row
per county) so that the votes can be compared. You may use any technique, the first step for any
technique will be to read in the data as outlined below.
1) Create a SAS dataset reading in the raw data giving variable names of state, county_id,
county_name, candidate_id, candvotes, and totvotes (use list input with appropriate DLM);
state, candidate_id and county_name are character variables. You will need to use an informat
for county_name since it takes on up to 17 characters.
2) In a second data step remove the rows for the candidate Johnson. Use a SET statement with a
WHERE statement;
3) Use any technique to create a dataset (start a new data step) that has one observation per county
(reading in the dataset created in part 2) with the number of votes for each candidate in the county
on the same row, contained in separate variable names. A PROC PRINT on the first 2 counties
for the new dataset should look like the following (variable names are above the data).
COUNTY_NAME TOTVOTES
AITKIN
9142
ANOKA
186461

ROMNEY_VOTES
4533
93430

OBAMA_VOTES
4412
88611

4) Now add a data step that adds new variables to the dataset in part 3: 1) romney_pct (percent of
votes going to Romney) , 2) obama_pct (percent of votes going to Obama) and 4) a variable
called winner = ROMNEY or OBAMA, depending on which candidate had more votes in the
county.
5) Using this final dataset run procedures to answer the following questions:
a. How many counties did Romney and Obama each win
b. Which county did each candidate do best in (the highest percentage of votes)?
c. Which county had the closest race between Romney and Obama (smallest difference in
percentage of votes)?
Note: There are several ways to get the answers for part b. One way is to sort the dataset by
percentage of votes for the candidate of interest and display the observations using PROC PRINT.
For part c you may want to compute a new variable that is the absolute value of the difference in
percent of votes.

Вам также может понравиться