Вы находитесь на странице: 1из 2

NUTR 319 INTERMEDIATE EPIDEMIOLOGY SPRING 2013 HOMEWORK 1 TOPIC: Descriptive Analysis Please answer all the questions

in Homework 1 using the data set epidat1.txt. This data set is from a case-control study investigating esophageal cancer among 978 men (200 cancer cases and 778 population controls). Descriptive information regarding this data set was provided in Laboratory 1. Please review it carefully. 1. Produce frequency distributions for all variables in the dataset and check for any inconsistent/improbable and any out of range values (outliers). 2. Examine frequency distributions for all variables in the dataset separately for the cases and controls. If there are any inconsistencies or outliers, print the distribution of those variables and mark the inconsistent value(s) on the printout. 3. If any of the inconsistencies can be corrected based on the study information you were provided (Lab 1, page 3), please correct them and briefly describe the corrections that you made. Note to students: When you have edits that need to be made to the original data set because of data entry errors or other problems you have two options: 1. Overwrite the original raw data with the corrections. 2. Make all data edits within your SAS program so you have a record of your changes. We recommend the second option unless you are the data manager and can make global changes to the raw data file and can distribute these changes to all the dataset users. The second option will provide a permanent record of all the changes you make to the raw file and will make it easy for anyone to take your program and replicate your results. For organization, we suggest that you put all the changes close to the top of your SAS program so that every time you run the program the changes are made to the original raw data file.

4. Check whether the inconsistencies were corrected by producing a frequency distribution of the corrected variable(s).

5. Are there any inconsistencies in the data that can not be corrected without examining the original questionnaire? Are there any outliers or improbable values that you would like to verify with the original questionnaire? If yes, indicate which ones. 6. Are there any missing values in the data? If yes, indicate the variable(s) and number of missing subjects. 7. Using the PROC UNIVARIATE procedure, produce descriptive statistics for the continuous (quantitative) variables in the dataset for both the total sample and the cases and controls separately. Examine the overall and case/control means for these variables. Are there any differences in the mean values for the cases and the controls? Now produce histogram and box plots to examine the shape of the distributions for these variables. In a few sentences, briefly describe the plots. 8. Create new dichotomous (binary) variables for the age and alcohol variables by using IFTHEN statements in a DATA step. Be careful not to make any overlapping categories. Name the new variables agegrp and alcgrp, respectively. Produce frequency distributions for these new variables. Define the categories as follows: agegrp: 0 = <60 years 1 = 60+ years alcgrp: 0 = 0-39 gm/day 1 = 40+ gm/day

9. Please check the codes for the tobacco variable. Is there any specific pattern of characteristics for subjects with tobacco values of 8 or 9? Recode the tobacco variable to indicate to SAS that the values 8 or 9 are missing (coded as . in SAS) by using an IF-THEN statement. 10. Examine whether the proportion of heavy smokers (tobgrp=1 if 10 cigarettes /day) and light smokers (tobgrp=0 if <10 cigarettes /day) is the same in cases and controls.

Вам также может понравиться