Вы находитесь на странице: 1из 26

Stata Reference Manual

What you should know about Stata


after taking the Stata introduction course

A collection of technical hints

Ivan Iachine, Lars Korsholm,


Henrik Støvring, Kirstin Vach, Werner Vach

Version 1.5, Feb., 2004


Contents

1 Entering commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1


2 Online help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1
3 Producing output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1
4 The general syntax of Stata commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 2
5 Typical errors and error messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 2
6 Protection of files and data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 4
7 Data checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 5
8 The graph command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 5
9 Stratification using by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 6
10 Generating new variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 6
11 Creating subsamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 8
12 Making tables in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 8
13 Categorization of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 10
14 Using Stata as a pocket calculator: The display command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 12
15 Loops in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 12
16 Working with do-files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 13
17 Reshaping datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 14
18 Working with string variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 15
19 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 16
20 Switching between labels, strings and numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 17
21 Creating variables with statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 19
22 Survival analysis commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 21
23 Online facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 22
24 How to find a statistical method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 23
1 Entering commands
New command:
Type the command in the “Stata Command” window. To execute press Enter .
See section 4 for the syntax of Stata commands.

Previous command:
Double-click the command in the “Review” window or press Page Up until you get the appropri-
ate command, then hit Enter . In general Page Up and Page Down browse previously executed
commands.
Execute a do-file:
See Section 16.

2 Online help
Known command name:
Use the help menu or the command help:

. help ttest

-------------------------------------------------------------------------------
help for ttest, ttesti (manual: [R] ttest)
-------------------------------------------------------------------------------

Mean comparison tests


---------------------

ttest varname = # [if exp] [in range] [, level(#) ]


...

The command whelp opens a new window with the same information and clickable links.

Known name of statistical method:


Use the help menu or the command findit:

. findit paired

[R] ttest . . . . . . . . . . . . . . . . . . . . . Mean comparison tests


(help ttest)

FAQ . . . . Comparing the p-values between a paired t test and a signrank


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W. Sribney
3/97 Is my boss correct in saying that the p-value given with
a paired ttest should always be lower than the signrank?
http://www.stata.com/support/faqs/stat/signrank.html
...

The findit command often results in hints to the Stata Technical Bulletin (STB) and to com-
mands you can download from the internet. For more on online facilities see section 23.

3 Producing output
Create a log-file:
Use the command log using filename. Execute your commands and finish with the command
log close. The output is now stored in filename. Usually the file filename will have the exten-
sion .smcl. See help log for further information.

1
Copy and Paste:
Mark the desired output in the “Results” window, then copy this with Ctrl-C . Paste this into
the file of your choice with Ctrl-V . You may also do Copy and Paste in the ordinary Windows
fashion using the mouse or menus.

4 The general syntax of Stata commands


Syntax:
The general syntax of a command is

commandname varlist selector, options

varlist can be one or several variable names, or it might be empty. In the case of several
variables it is possible to give the varlist as, say, var1-var5, which means all the variables
from var1 to var5 in the current order shown by display, or you may use var*, which means
all the variables in the dataset that start with the letters ‘‘var’’.
selector can be something like

if sex=="m"
if age>18
in 1/3

As selector we may use any combination of these. Note that the “logical equal to” symbol is two
times “==”. in 1/3 means the first through thrid observation in the data set (in the current order).
options vary from command to command. They are either single names (e.g. histo) or include
additional information in parentheses (e.g. bin(7) or xscale(0,20))
Note: There is at most one comma in a Stata command!
Abbreviations:
Usually you can abbreviate command names and options. For example, the following two com-
mands are equivalent:

. regress bweight hypertension ..., robust


. reg bw hyp ..., r

Each option and command has a minimal number of letters to be used, you can look this up using
the help command. The minimal number of letters are underlined by stata.
You can also abbreviate variable names by their first letter(s), as long as the identification remains
unique. In the example abovebweight and hypertension must be the only variables begin-
ning with bw and hyp.

5 Typical errors and error messages


If you are using the windows-version of STATA, all error messages are in red

Error messages:
Error messages try to inform you about what may be wrong, for example if you misspell a variable
name,

.tab variabble
variabble not found

if you use an incorrect option

2
.tabulate var1, by(var2)
by() invalid

or if the data is assumed to be sorted, but it is not sorted

.by var1: tabulate var2


not sorted

Below the error message in red is an r(xxx) code in blue. This code is clickable and provides
more details on what might be wrong and what you should do.
The logic of error messages:
Stata cannot know what you intend to do, it can only recover errors by syntax checks. This means,
that you can get only indirect hints. For example, if you forget to separate an option by a comma,
you will get:

.tabulate var1 var2 chi


chi not found

because Stata believes, that you meant chi to be a variable. Or if you forget, that by requires
parentheses, you get:

.table var1, by var2


by invalid

Here Stata does not realise that you forgot the parentheses, it believes, that you tried to use by as
a single option. These examples show that error messages are often very cryptic.

Some typical error messages and what they may indicate:


error message possible explanation solution/example
no; data in memory would be lost changing a dataset without saving save the dataset save newdata.dta
if you want to use a new dataset clear
no variables defined no data loaded use data.dta
not sorted before using a by-option the data has to be sorted sort var1
or use the bysort command bysort var1: ...
xxx not found unknown variable (e.g. incorrect spelling)
no comma before option e.g. tabulate var1 var2 chi
no blank after function e.g. di Binomial (20,10,0.5)
xxx() invalid incorrect/unknown option e.g. .tab var1, by(var2) correct: by var2: tab var1
xxx invalid incorrect option (e.g. missing ()) e.g. table var1, by var2
xxx invalid name incorrect syntax (e.g. ; instead of :) e.g. by var1; tab var2
no observations incorrect variable type e.g. regress STRINGVAR var
variable with missings only e.g. regress MISSINGVAR var
=exp not allowed ”==” is needed e.g. list var1 if var2=0
type mismatch wrong variable type for this operation e.g. list var1 if STRINGVAR==0 (string variable)
e.g. list var1 if var2==”0” (numeric variable)

The “not enough space to add more ...” error messages:


The default installation of Stata starts with a small amount of memory. You have run out of physical
memory.
The quick solution:
save your dataset, clear stata, add more memory, load data again:

. save dummyname
file dummyname.dta saved

. clear

. set memory 16m

3
(16384k)

. use dummyname

. erase dummyname.dta

where 16m is 16 mega byte of RAM. Select the amount you need. See help memory
The lasting solution:
If you are working with a dofile (and you should!), then insert at the top of the file:

set memory 16m

and rerun your dofile.


The (almost) permanent solution:
Right-click on the icon, select Properties, and change the path field to, e.g.
C: stata wstata.exe /m16.

6 Protection of files and data


Stata tries to protect you from yourself so that you do not unintentionally lose data.

The clear and save commands:


When you have performed data manipulations and want to analyze a new dataset or want to exit
the session, Stata requires that you decide what to do with your present dataset. Either you must
specify save newdata or ignore the changes by typing clear. In the last case Stata also
accepts clear as an option, e.g.

. use nextdata, clear

or

. exit, clear

The replace option:


When you want to use external files and these exist, Stata will refuse to let you overwrite them
unless you deliberately use the replace command, e.g.

. log using myfile


file myfile.log already exists
r(602);

. log using myfile, replace

this will overwrite the contents of myfile. Similar for files containing graphs

. scatter x y, saving(mygraph, replace)

Note: You can use the replace option, even if the corresponding file does not exist.

The replace command:


See section 10.

Be careful with your data!

4
7 Data checking
Before you analyse your data you should verify that they are “as expected”.
describe varlist : 

Gives an overview of your variables, storage type etc.


codebook varlist : 

Provides detailed information on each variable. See section 19.


tabulate and list:
The commands tab varname and list varname may give you “on screen” information on
varname but you have to look at the output and remember what you should look for.
The assert command:
The assert command lets you automize the conformation process. The command does nothing
if everything is “as expected”, but stops with an error message if the assessment fails (and stops
executing your dofile). Some examples:
Simple arithmetics

. assert 2+2==4

. assert 2>3
assertion is false
r(9);

If the variable age should contain the age in years (integers) and every one is between 20 and 50.

. assert age==int(age) & age>=20 & age<=50

If the variable sex contains the gender of the person as “F” or “M”

. assert sex=="F" | sex=="M"

If datein is the fist time an object is observed and dateout is the last time

. assert datein<dateout

This may fail if datein can be missing. If you want to run the assessment allowing for missing
cases of datein

. assert datein<dateout if datein!=.

When the assessment fails you list the illegitimate cases. E.g.

. list id sex if !(sex=="F" | sex=="M")

id sex
19. 19 f

8 The graph command


We refere to chapter 14 in “Introduction to Stata 8” by Svend Juul available from
http://www.biostat.au.dk/teaching/software/

5
9 Stratification using by
by-option:
Many commands in Stata allow or require you stratification of your data into groups using the
by-option, e.g.

. gr size, box by(sex)

by-construct:
Other commands allow a preceding by for a stratified analysis, e.g.

. by sex: sum size

In both cases, you have to sort the data first:

. sort sex

There exists no common rule, when by-constructs or by-options are allowed. However, this is
always indicated in the syntax description offered by the help command.

10 Generating new variables


The generate command:
You can use the generate command to generate new variables. In the following example, we
generate a variable for body mass index, an indicator of overweight, and an indicator for absence
of fever, emesis and fatigue:

. l

weight height fever emesis fatigue


1. 54 1.73 1 0 1
2. 88 1.81 1 0 0
3. 102 1.77 0 0 0
4. 91 1.91 0 1 0
5. 74 1.66 0 1 1

. gen bmi=weight/heightˆ2

. gen overw=bmi>25

. gen success=(˜fever) & (˜emesis) & (˜fatigue)

. l

weight height fever emesis fatigue bmi overw success


1. 54 1.73 1 0 1 18.0427 0 0
2. 88 1.81 1 0 0 26.86121 1 0
3. 102 1.77 0 0 0 32.55769 1 1
4. 91 1.91 0 1 0 24.94449 0 0
5. 74 1.66 0 1 1 26.85441 1 0

Note: If you want to generate string variables, you have to specify the length of the string. See
Section 18.
The replace command:
If you want to overwrite an existing value of a variable, you have to use the replace instead of
the generate command. For example, if height is recorded in centimeter in the data set, but you
want to have it in meter, you just type

6
. replace height=height/100

A perhaps unexpected use of replace appears when you try to define a new variable with a
subgroup dependent definition. For example, if the limit for overweight differs between males and
females, typically you use code like

. generate overw=bmi>23 if sex=="m"


. replace overw=bmi>25 if sex=="f"

The reason for this is, that the first statement fills the variable overw already with missing values
for all female subjects, which have to be replaced by the second statement.
Overview about available functions and operators:
In generating new variables, you can connect existing variables by a lot of operators and functions.
By help operators or help functions you get an overview. The most important ones
are summarized in the following list.
. help operators
-------------------------------------------------------------------------------
help for operators (manual: [U] 20 Functions and expressions)
-------------------------------------------------------------------------------

Operators in expressions
------------------------

Relational
Arithmetic Logical (numeric and string)
------------------- ------------------ -------------------
+ addition ˜ not > greater than
- subtraction | or < less than
* multiplication & and >= > or equal
/ division <= < or equal
ˆ power == equal
˜= not equal

....

Note that the “equal to” symbol is two times “==”.


. help functions

-------------------------------------------------------------------------------
help for functions (manual: [R] functions)
-------------------------------------------------------------------------------

...

Mathematical functions
----------------------

abs(x) absolute value


cos(x) cosine of radians
exp(x) exponentiation
ln(x) natural logarithm
log(x) same as ln(x)
log10(x) base 10 logarithm
sin(x) sine of radians
sqrt(x) square root
tan(x) tangent of radians
...

See also section 21.

7
11 Creating subsamples
There are two ways in which you can create subsamples. You can select a subset of your variables (vertical
selection) or you can select a subset of your observations (horizontal selection). For both procedures we
have the commands drop and keep.
For variables:
The data set has three variables ID, sex and income.
. drop income

which produce the same result as


. keep ID sex

For observations:
Drop all observations associated with female individuals (the code “f” in the variable sex indicate
a female)
. drop if sex=="f"

which produce the same result as


. keep if sex˜="f"

The consequence of these commands is that the dateset in memory is permantly changed. The dataset on
disk is not effected until you issue the save dataname, replace. To save in a new filename type
save newdataname

12 Making tables in Stata


The tabstat command:
You use tabstat when you want to display a series of summary statistics for one or several
variables.
tabstat varlist [, statistics(statname [...]) by(varname) columns(var|stat) long ]

where statname [...] are the summary statistics that you want to display.
. tabstat erateWL, s(n mean sd)

variable | N mean sd
-------------+------------------------------
erateWL | 170 .19375 .1836576
--------------------------------------------

If you want separate summary statistics for each group defined by varname you should use the
options by(varname) c(s) lo.
. tabstat erateS erateWL, s(n mean sd) c(s) by(gender) lo

gender variable | N mean sd


--------------------+------------------------------
Female erateS | 89 .5580524 .2354242
erateWL | 89 .1685393 .1723018
--------------------+------------------------------
Male erateS | 81 .5925926 .2487279
erateWL | 81 .2214506 .1926505
--------------------+------------------------------
Total erateS | 170 .5745098 .2417539
erateWL | 170 .19375 .1836576
---------------------------------------------------

8
Se more details in help tabstat.
The table command:
You use table when you want to display a series of summary statistics for each level of another
variable.
table rowvar [colvar [supercolvar] ...] [, contents(clist) row col [options] ]

The philosophy behind the syntax is that we want a table where for each value in the variable
rowvar (and colvar and supercolvar) the cell contains clist with layout format given
in options, where clist is summary statistics on third part variables. The option row adds
the relative frequency to each cell such that each row sum up to 100% (similar for the option col).
For details on the format options see help table.
. table treat, c(n dec med dec p5 dec p95 dec)

----------+-----------------------------------------------------------
treat | N(decrease) med(decrease) p5(decrease) p95(decrease)
----------+-----------------------------------------------------------
1 | 205 5.211085 -10.97878 23.59735
2 | 204 16.30814 -2.117609 33.61396
3 | 204 13.19776 -25.15851 30.93353
----------+-----------------------------------------------------------
The tabulate command:
You use the tabulate command when you want to investigate the association between two (or
more) variables.
tabulate varname1 varname2 [, all cell chi2 column exact gamma lrchi2 row taub V ...]

The interpretation of the syntax is that we tabulate the frequency count of varname1 versus
varname2 with various measures of association, including the common Pearson chi-squared, the
likelihood ratio chi-squared, Cramer’s V, Fisher’s exact test, Goodman and Kruskal’s gamma, and
Kendall’s tau-b.
. tab res treat, chi2

| treat
result | 1 2 3 | Total
-----------+---------------------------------+----------
1 | 74 21 56 | 151
2 | 71 47 35 | 153
3 | 36 57 52 | 145
4 | 24 79 61 | 164
-----------+---------------------------------+----------
Total | 205 204 204 | 613

Pearson chi2(6) = 75.7134 Pr = 0.000

It is possible to combine tabulate with summarize to obtain table-like output in a fast way.
. tab treat, summarize(dec)

| Summary of decrease
treat | Mean Std. Dev. Freq.
------------+------------------------------------
1 | 5.6048431 11.082792 205
2 | 15.710805 11.359821 204
3 | 9.3633245 17.387196 204
------------+------------------------------------
Total | 10.218785 14.193435 613

See help tabsum.


Specialized tables:
There exists a number of “table” commands designed for specific purposes, with epidemiologic
data see help epitab, with cross-sectional time dependent data (also called panel data) see
help xttab, and with survival data see help ltable.

9
13 Categorization of variables
In many medical applications continuous variables are reduced to variables with a few categories like
“low”, “middle” and “high”. Stata supports this step by different functions.
Categorizing a variable at specific cutpoints using the recode function:
If you want to categorize a variable at specific cut points, you can use the recode function as
in the following example. The new variable assigns to each value the upper value of the interval,
where the value falls in. Note that you have to ensure, that the last specified cutpoint is not smaller
then the maximal value in your dataset in order to obtain the desired result (see generation of
catvar1). In general, the last specified value in the arguments of recode is not the last cutpoint,
but the value assigned to each value larger than the last but one argument. This property is used in
generating catvar2 to assign a missing value to all values larger than 110.

. list

var
1. 23
2. 56
3. 67
4. 123
5. 99
6. 17

. gen catvar1=recode(var,50,100,150)

. gen catvar2=recode(var,40,60,80,110,.)
(1 missing value generated)

. list

var catvar1 catvar2


1. 23 50 40
2. 56 100 60
3. 67 100 80
4. 123 150 .
5. 99 100 110
6. 17 50 40

If you want to recode the values of the grouped variable, you can use the recode command, or
you can use the egen command with the group function, which assigns the values 1, 2, 3 etc. to
the smallest, the next smallest etc. value. Both are illustrated in continuing our example:

. list

var catvar1 catvar2


1. 23 50 40
2. 56 100 60
3. 67 100 80
4. 123 150 .
5. 99 100 110
6. 17 50 40

. egen catvarg1=group(catvar1)

. recode catvar2 40=1 60=2 80=3 110=4


(5 changes made)

. list

10
var catvar1 catvar2 catvarg1
1. 23 50 1 1
2. 17 50 1 1
3. 56 100 2 2
4. 67 100 3 2
5. 99 100 4 2
6. 123 150 . 3

Note that using the group function implies that data are reordered.

Categorizing a variable at equidistant cutpoints using the autocode function:


autocode is an automated version of recode, which you can use, if the cutpoints are equidis-
tant. You then have only to specify the number of intervals, the smallest cutpoint and the largest
cutpoint. Note, that all values larger than the largest cutpoint get assigned the largest cutpoint, so
you should ensure, that the largest cutpoint is larger than the maximal value in your dataset.
As categorization and recoding is always a dangerous action, you should always try to check the
result, for example by a cross tabulation. This is illustrated in the following example, too.

. list

var
1. 23
2. 56
3. 67
4. 123
5. 99
6. 17

. gen catvar=autocode(var,5,0,100)

. list

var catvar
1. 23 40
2. 56 60
3. 67 80
4. 123 100
5. 99 100
6. 17 20

. tab var catvar, missing

| catvar
var | 20 40 60 80 100 | Total
-----------+-------------------------------------------------------+----------
17 | 1 0 0 0 0 | 1
23 | 0 1 0 0 0 | 1
56 | 0 0 1 0 0 | 1
67 | 0 0 0 1 0 | 1
99 | 0 0 0 0 1 | 1
123 | 0 0 0 0 1 | 1
-----------+-------------------------------------------------------+----------
Total | 1 1 1 1 2 | 6

Categorizing a variable in groups of equal size using xtile:


The xtile command creates a new variable categorizing an existing variable in groups of (ap-
proximately) equal size. The number of groups has to be specified using the nq option. This is
illustrated in the following example:

11
. list

var
1. 23
2. 56
3. 67
4. 123
5. 99
6. 17

. xtile cat2=var, nq(2)

. xtile cat3=var, nq(3)

. xtile cat4=var, nq(4)

. list

var cat2 cat3 cat4


1. 17 1 1 1
2. 23 1 1 1
3. 56 1 2 2
4. 67 2 2 3
5. 99 2 3 3
6. 123 2 3 4

Note, that xtile reorders the dataset.


One can use xtile also to categorize at cutpoints defined by another variable. Combining it with
pctile allows to categorize at percentiles of subgroups. For further details try help xtile
and look into the Stata reference manual.

14 Using Stata as a pocket calculator: The display command


The display command allows you to type in expressions and to look at the results. You can use all
operators and functions defined in Stata. Typical examples look like these:
. di 3+4
7

. di 10.6 - 2 * 7.35
-4.1

. di 3ˆ4
81

. di (2.1 + 2.3)/(4.1 + 47.3)


.08560311

. di 2+3, 2+5.6, 3+6


5 7.6 9

. di 23.4-invnorm(0.995)*12.3, 23.4 + invnorm(0.995)*12.3


-8.2827004 55.0827

15 Loops in Stata
The for command:
You can execute a series of Stata commands with the command for. Example:

. for num 1/5: replace varX=varX/1000

12

The index X is substituted in each loop. num tells Stata that we use numerical values for X. 1/5
is the list of values 1 2 3 4 5 . The ‘:’ indicates that hereafter are the Stata commands to be
executed in each step of the loop.

variables var11-var15 in kilo scale.



It is possible to have several indices ( ). Example, I may wish to keep var1-var5 and have new

. for num 1/5 \ num 11/15: generate varY=varX/1000

where tells Stata that here start a second index Y.


Further we may nest a for-loop within an other for-loop to obtain matrix form repeatments 1 .

. for A in num 1/5: for B in num 1/5: gen varAB=varA*varB

would generate 25 variables var11, var12, ..., var55.


If you use for combined with graph remember the pause option. See also help foreach
and help forvalues in Stata 7. See the manual for further details.

16 Working with do-files


What is a do-file:
A do-file is a flat text file (ie. ASCII format) containing Stata commands.
Creating a do-file:
Open the “Do file editor”. Type in the commands you would ordinarily type in the “Command”
window. The editor is similar to for example “NotePad”.
Executing a do-file:
Press the Do button (number two top right).

Debugging a do-file:
Read the error messages. If this doesn’t help, try the command set trace on which gives very
detailed information on command execution. It is reversed to its original setting by set trace
off. The command set trace on place a “-” in front of each line which is executed. The last
line without a “-” sign contains the error. Often useful in combination with set more off.
Why use do-files:
For two reasons:

1. Gives you the option of modifying and re-running your commands, ie. it is a time saver (in
the long run...).
2. Provides you with documentation on just how you arrived at your precious conclusions.
Comments in do-files:
It is fruitful to write comments to yourself or any reader in your do-files. You write comments by
beginning the line with an asterisk *, then Stata will ignore whatever is in that line.
A nice do file looks like:
log using filename, replace
* This do-file is an example
use data, clear
describe
... some other commands
log close

1 This feature is new in Stata 6

13
17 Reshaping datasets
Reshaping wide datasets:
Suppose you have the following dataset with measurements of nausea on 3 consecutive days after
chemotherapy:

. list in 1/3

id sex nausea1 nausea2 nausea3


1. 1 m 78 56 34
2. 2 f 83 45 67
3. 3 m 27 22 22

You would like to investigate the increase over time by a regression model. For this, you need a
data set, where each line corresponds to one day of one individual. You can use the reshape
command to achieve this:

. reshape long nausea ,i(id) j(day)

. list in 1/9

id day sex nausea


1. 1 1 m 78
2. 1 2 m 56
3. 1 3 m 34
4. 2 1 f 83
5. 2 2 f 45
6. 2 3 f 67
7. 3 1 m 27
8. 3 2 m 22
9. 3 3 m 22

.
. regress nausea day, cluster(id)

In Stata’s terminology, you have changed a dataset from wide format to long format.
Note: The i-option specifies the logical unit, whereas the j-option specifies the variable which
indicates observations within a unit.
Reshaping long datasets:
Suppose you have the following dataset with measurements of nausea on 3 consecutive days after
chemotherapy:

id day sex nausea


1. 1 1 m 78
2. 1 2 m 56
3. 1 3 m 34
4. 2 1 f 83
5. 2 2 f 45
6. 2 3 f 67
7. 3 1 m 27
8. 3 2 m 22
9. 3 3 m 22

You would like to make a scatterplot of the measurement on day 2 versus the measurement on day
1. For this you need a dataset where you have the variables nausea1 and nausea2. You can use the
reshape command to achieve this:

14
. reshape wide nausea, i(id) j(day)

. list in 1/3

id nausea1 nausea2 nausea3 sex


1. 1 78 56 34 m
2. 2 83 45 67 f
3. 3 27 22 22 m

. gr nausea2 nausea1, twoway


.

In Stata’s terminology, you have changed a dataset from long format to wide format.
Note: If you switch from long to wide format, all variables not used as arguments for reshape must
be constant within each unit specified by the i-option. Otherwise, you get an error message.
Reshaping several variables simultaneously with nonnumeric suffices:
In reshaping datasets, the variables can also have nonnumeric suffices, for example left and
right. In this case you have to specify the string option. You can also reshape several vari-
ables simultaneously. Both is illustrated in the following example:

. list in 1/2

id sex eyeleft eyeright earleft earright


1. 1 m 1 1 0 0
2. 2 f 1 0 1 0

. reshape long eye ear, i(id) j(side) string

. list in 1/4

id side sex eye ear


1. 1 left m 1 0
2. 1 right m 1 0
3. 2 left f 1 1
4. 2 right f 0 0

. reshape wide eye ear, i(id) j(side) string

. list in 1/2

id eyeleft earleft eyeright earright sex


1. 1 1 0 1 0 m
2. 2 1 1 0 0 f

You can use the reshape command also for more complex situations. Take a look at the Stata
Reference Manual.

18 Working with string variables


Generating string variables:
If you want to generate a new string variable, you have to specify the length of the variable in the
generate statement, e.g.

. gen str3 s="abc"

Operations on strings:
If you want to concatenate strings, you can use the + operator:

15
. l

treat group
1. A 2
2. A 1

. gen str3 tr_gr=treat+" "+group

. l

treat group tr_gr


1. A 2 A 2
2. A 1 A 1

There exists a lot of functions to work with strings, especially to switch from numbers to strings
and vice versa.

. help functions

-------------------------------------------------------------------------------
help for functions (manual: [R] functions)
-------------------------------------------------------------------------------

....

String functions
----------------

index(s1,s2) --- returns position in s1 in which s2 is first found or 0 if


s1 does not contain s2
length(s) --- returns length of string s
lower(s) --- returns lowercased variant of s
ltrim(s) --- returns s with leading blanks removed
real(s) --- converts s into a numeric value
rtrim(s) --- returns s with trailing blanks removed
string(n) --- converts n into a string
string(n,%fmt) --- converts n into a string with %fmt display format
substr(s,n1,n2) --- returns the substring of s starting at n1 for a length of
n2; if n1<0, starting position is interpreted as distance
from end of string; if n2==., the remaining portion of the
string is returned
trim(s) --- returns s with leading and trailing blanks removed
upper(s) --- returns uppercased variant of s

....

Se also section 20.

19 Labels
Labelling an existing variable:
If a variable is coded by numerical values, it is often useful to have the meaning of the values and
not the values themselves in tabulations and listings. You can achieve this by assigning labels to
the variable values using the label command:

. list

sex age
1. 0 17
2. 1 23

16
. label define labsex 0 male 1 female

. label values sex labsex

.
. list

sex age
1. male 17
2. female 23

Note: The labels are only used in representing the values. Internally, they need to be stored as
numbers. So you can only use sex as a numeric variable.
Distinguishing values and labels:
Once a variable is labelled, you might have difficulties to find out, what the real values are. The
codebook command shows you always both the values and the labels:

. codebook sex

sex --------------------------------------------------------------- (unlabeled)


type: numeric (float)
label: labsex

range: [0,1] units: 1


unique values: 2 coded missing: 0 / 2

tabulation: Freq. Numeric Label


1 0 male
1 1 female

Note: If you import datasets from other systems, for example using StatTransfer, values are often
already labeled. Hence it is always a good idea to use codebook in the beginning.
Note: Some commands, for example list and tabulate, allow a nolabel-option, such that
the values instead of the labels are shown.

20 Switching between labels, strings and numbers


Labels and Strings:
Sometimes, you would like to use the labels of a variable as strings, for example if you want to
create a new variable by concatenating. This is done by the decode command, and encode does
the opposite:

. list

sex age
1. male 17
2. female 23

. list, nolabel

sex age
1. 0 17
2. 1 23

. decode sex, gen(sexstr)

. list

17
sex age sexstr
1. male 17 male
2. female 23 female

. encode sexstr, gen(gender)

.
. list

sex age sexstr gender


1. male 17 male male
2. female 23 female female

. codebook gender

gender ------------------------------------------------------------ (unlabeled)


type: numeric (long)
label: gender

range: [1,2] units: 1


unique values: 2 coded missing: 0 / 2

tabulation: Freq. Numeric Label


1 1 female
1 2 male

Strings and Numbers:


The string function allows to change numbers to strings, and the real function allows to
change strings to numbers.

. list

sex age
1. female 23
2. male 17

. gen str2 agestr=string(age)

. gen years=real(agestr)

. list

sex age agestr years


1. female 23 23 23
2. male 17 17 17

. describe

Contains data
obs: 2
vars: 4
size: 56 (98.5% of memory free)
-------------------------------------------------------------------------------
1. sex float %9.0g labsex
2. age float %9.0g
3. agestr str2 %9s
4. years float %9.0g
-------------------------------------------------------------------------------

18
21 Creating variables with statistics
It is often necessary for an analysis to prepare the dataset by computing new variables with statistics,
for example the maximum value observed during a day or subject specific mean values. The following
illustrates some typical tools for this task.
Computing statistics over several variables using egen:
The egen command offers functions like rmax or rmean to compute a maximum or a mean
“rowwise”. This is illustrated in the following example, where we have for each subject and each
day a measurement at 6 o’clock, 12 o’clock and 18 o’clock. We can use rmax to compute the
maximum within each day:

. list in 1/6

subj day val6 val12 val18


1. 1 1 23.5 34.3 22.9
2. 1 2 25.8 33.6 27.8
3. 1 3 12.8 18.9 22.3
4. 2 1 14.5 17.9 22.8
5. 2 2 19.8 17.3 15.4
6. 2 3 33.9 30.3 27.8

. egen maxv=rmax(val6 val12 val18)

. list in 1/6

subj day val6 val12 val18 maxv


1. 1 1 23.5 34.3 22.9 34.3
2. 1 2 25.8 33.6 27.8 33.6
3. 1 3 12.8 18.9 22.3 22.3
4. 2 1 14.5 17.9 22.8 22.8
5. 2 2 19.8 17.3 15.4 19.8
6. 2 3 33.9 30.3 27.8 33.9

egen offers for this type of tasks the functions rmax, rmin, rmean, rsum, rsd and
robs, where the latter gives the number of nonmissing observations. Note that these functions
expect a list of variables separated by blanks. Do not confuse them with the functions mean,
min, max etc., which are also offered by egen for other purposes.
Computing statistics over several observations using collapse:
The collapse command allows you to compute statistics from groups of observations. Looking
at the last example, we might now be interested in taking the average over three days for each
subject. This can be done in the following way:

. list in 1/6

subj day val6 val12 val18 maxv


1. 1 1 23.5 34.3 22.9 34.3
2. 1 2 25.8 33.6 27.8 33.6
3. 1 3 12.8 18.9 22.3 22.3
4. 2 1 14.5 17.9 22.8 22.8
5. 2 2 19.8 17.3 15.4 19.8
6. 2 3 33.9 30.3 27.8 33.9

. collapse (mean) meanmax=maxv, by(subj)

. list in 1/2

subj meanmax
1. 1 30.06667
2. 2 25.5

19
You can generate simultaneously several statistics, for example you can use collapse (min)
minval6=val6 (max) maxval6=val6, by(subj) in order to generate the minimum
and maximum of the measurements at 6 o’clock over the three days for each subject. Other statis-
tics offered by collapse are median, sd, sum, iqr and all percentiles.
Note: If you have a variable, which is constant within the unit you would like to collapse, and
which you want to keep in the new dataset (for example the age and sex of a subject), you can
include them in the by-option. (For example: collapse ..., by(subj age sex))

Computing statistics over several observations using egen:


Sometimes it is necessary to generate statistics over observations without reducing the dataset,
for example if you want to compare single values with subject specific mean values. The egen
command together with a by-option allows you to do this in an easy manner. In the following
example we have 6 measurements for each subject, and we would like to compare the values with
the subject specific means in order to check, when a subject suffers from a high or low value. This
can be done in the following way:

. list in 1/12

subj time value


1. 1 1 17.9
2. 1 2 23.7
3. 1 3 45.8
4. 1 4 37.2
5. 1 5 19.4
6. 1 6 20.8
7. 2 1 44.5
8. 2 2 48.7
9. 2 3 52.1
10. 2 4 46.7
11. 2 5 44.5
12. 2 6 40.3

. egen meanval=mean(value), by(subj)

. gen high=val>meanval

. list in 1/12

subj time value meanval high


1. 1 1 17.9 27.46667 0
2. 1 2 23.7 27.46667 0
3. 1 3 45.8 27.46667 1
4. 1 4 37.2 27.46667 1
5. 1 5 19.4 27.46667 0
6. 1 6 20.8 27.46667 0
7. 2 1 44.5 46.13333 0
8. 2 2 48.7 46.13333 1
9. 2 3 52.1 46.13333 1
10. 2 4 46.7 46.13333 1
11. 2 5 44.5 46.13333 0
12. 2 6 40.3 46.13333 0

egen offers also functions like min, max, median, sd, iqr, rank, sum and func-
tions for percentiles. A typical use of egen is in standardizing a variable to the range 0-1 for each
subject. This looks like

. egen min=min(var), by(subject)


. egen max=max(var), by(subject)
. gen standvar=(var-min)/(max-min)

20
22 Survival analysis commands
A characteristic feature of survival data is the presence of censoring and left truncation. Without censoring
and truncation the data are represented by the survival time variable , which measures the duration of time
between the initial event and the final event. In the presence of censoring and truncation more variables are


required to represent the incomplete observation of the survival time .

   
With censoring at time (e.g. end of followup) it is only possible to observe if the final event occurs

  

before time . The final event indicator is equal to if (i.e. uncensored observation) and it is

 
   
equal to if (i.e. censored observation). The censored survival time is equal to if

 
and is equal to if . With left truncation at time the censored observations are only

  
observed if (otherwise no information is collected). Consequently, under right censoring and left

 
truncation the survival time is represented by three variable . In Stata datasets these variables
are usually called time, event and time0 respectively. If all subjects enter at time (i.e. =0) the
respective variable time0 may be omitted in the dataset.

Prepare the dataset for analysis:


In order to avoid entering the three variable names representing the survival time observations in
each survival analysis command, Stata requires an extra step before any survival analysis command
may be executed. This step is carried out using the stset command:
stset time, failure(event) enter(time0)
This ensures, that the variables time, event and time0 will be used automatically by Stata
in all subsequent survival analysis commands to represent the censored observations. When all
subjects enter at time 0, the enter() option may be omitted:
stset time, failure(event)
Kaplan-Meier plot:
The sts graph command will produce graphs of Kaplan-Meier estimates of the survival func-
tion:
sts graph

Kaplan-Meier plots with 95% CI:


The sts graph command may be combined with by(indepvar) option to produce separate
Kaplan-Meier plots for subgroups of the data specified by the different values of indepvar. The
gwood option may be used to add pointwise 95% confidence intervals to the plots.
sts graph, by(indepvar) gwood

Kaplan-Meier at age 200:

 ! !"
sts list, at(200 201)
Note the argument at(200 201), where two time values , are required, because without
“201”, at(200) will tabulate the Kaplan-Meier estimator at 200 equidistant time points.

Estimate median survival:


The stci command produces median estimates along with confidence intervals:
stci, median by(group)
Logrank test:
The sts test command may be used to compare survival in two or more groups. The groups
are defined by distinct values of the indepvar variable. The logrank option specifies, that the
logrank test (default) is to be used for the comparison:
sts test indepvar, logrank

Cox regression:
The stcox command is used to carry out analysis using the Cox regression model:
stcox indepvar1 indepvar2 ... indepvarN
This will report hazard ratio estimates. To produce estimates of regression coefficients the nohr
option may be used.

21
Increase memory size:
Sometimes the extra variables created by the stset command do not fit in the available memory.
In this case see section 5 for commands to increase the memory size. Note, that you will have to
reload and re-stset the dataset after this operation.

23 Online facilities
Stata is web-aware in the sense that it offers commands that allow you to update and enhance your Stata-
version, if you are connected to the Internet. The most important commands are:
update:
Typing update will give an overview of when your Stata system was last updated. The command
update query will check whether or not your Stata would benefit from an update. Finally you
can execute the command update all to update both your ado-files and executable.
findit: In up-to-date Stata 7.0 the command findit will search all relevant Internet sites for Stata mate-
rial containing your search word. For example:

. findit smooth
13 Sep 2002 13:53:35
Keyword search
--------------

Keywords: smooth
Search: (1) Official help files, FAQs, and STBs
(2) Web resources from Stata and from other users

Search of official help files, FAQs, and STBs


---------------------------------------------

[R] kdensity . . . . . . . . . . . . Univariate kernel density estimation


(help kdensity)

[R] ksm . . . . . . . . . . . . . . . . . . . Smoothing including lowess


(help ksm)

<...cut...>

Example . Applied Survival Analysis: Regression Modeling of Time to Event Data


. . . . . . . . . . . . . . . . . . UCLA Academic Technology Services
9/01 http://www.ats.ucla.edu/stat/books/asa/default.htm
examples from the book Applied Survival Analysis:
Regression Modeling of Time to Event Data
by David W. Hosmer, Jr. and Stanley Lemeshow

<...cut...>

STB-53 sg128 . . . Some programs for growth estimation in fisheries biology


. . . Salgado-Ugarte, Martinez-Ramirez, Gomez-Marquez, & Pena-Mendoza
(help bevholt, fordwal, gullholt, gullplot, nlvbgf, ... if installed)
1/00 pp.35--47; STB Reprints Vol 9, pp.278--293
programs to estimate and plot the von Bertalanffy growth
function

STB-41 gr27 . . . . . . . . . An adaptive variable span running line smoother


(help autosmoo if installed) . . . . . . . . . . . . . . . P. Sasieni
1/98 pp.4--7; STB Reprints Vol 7, pp.63--68
smooths yvar on xvar where the smooth is a running line fit
with a variable span

<...cut...>

22
Web resources from Stata and other users
----------------------------------------

(contacting http://www.stata.com)

14 packages found (STB omitted)


-------------------------------

sthaz from http://www.sun.rhbnc.ac.uk/˜uhss021/stata


sthaz. Smoothed hazard (transition/failure) rate plots. / Program by
Kenneth L. Simons. / Compute nonparametric estimates of smoothed hazard
rates, and create graphs / of the results. The program also can compute
and graph standard errors and / confidence bounds. The estimates use

hazplot from http://www.sun.rhbnc.ac.uk/˜uhss021/stata


hazplot. Smoothed hazard (transition/failure) rate plots. / Program by
Kenneth L. Simons. / hazplot plots hazard rates or smoothed hazard rates.
It works only on data in / panel form with integer time variables, and the
data must have been stset / using the time0() option. For example, you

<...cut...>

6 references found in tables of contents


----------------------------------------

http://www.sun.rhbnc.ac.uk/˜uhss021/stata/
Materials by Kenneth L. Simons / Here are assorted utilities for Stata. /
Check dummy (indicator) variables to ensure they are okay / Distance
between latitude & longitude coordinates / Count data points in a
geographic radius of each point / Create data points for extra geographic

http://www.stata.com/users/njc/
Materials by Nicholas J. Cox, University of Durham / Nicholas J. Cox
<N.J.Cox@durham.ac.uk> is a geographer at the University / of Durham and
a frequent contributor to Statalist. His areas of interest / include
graphics, smoothing, probability distributions, circular statistics, / and

<...cut...>

(end of search)

First you see what is in the reference manual, on the Stata FAQ pages, and in the STB, where STB
refers to the “Stata Technical Bulletin”, which is a journal where various enhancements (ado-files)
are published with examples of their use. Next you get results from searching the web resources
for user written resources.

Installation:
To install a specific package you found with findit just follow the blue clickable links.

24 How to find a statistical method


The following list should give you some hints as to where you can find specific statistical methods. Note
that Stata offers many more methods than shown in this list. The list should only help you to find the
corresponding Stata command. Hint: a lot of tables and simple calculations for epidemiologists are to be
found under epitab.

23
Description Stata-command
ANOVA anova or oneway
 -test for contingency tables tabulate var1 var2, chi, see also epitab
confidence intervals for ci or cii (immediate form)
means
proportions
probabilities
percentiles centile
contingency tables tabulate
correlation
Spearman spearman var1 var2
Pearson pwcorr [varlist] or correlate [varlist]
cumulative distribution function cdf from STB
cox regression stcox indepvars
Fisher’s exact test tabulate var1 var2, exact, see also epitab
Friedman friedman from STB, try search friedman
four fold table tabulate, see also epitab
interrater agreement test kappa var1 var2
Kaplan-Meier curves sts graph
kappa kappa var1 var2
Kruskal-Wallis test kwallis
likelihood ratio test lrtest
linear regression regress depvar [varlist]
logistic regression logistic depvar [varlist]
log rank test sts test indepvar, logrank
Mann-Whitney two sample test ranksum
mean, median, sd summarize [varlist]
or table
meta analysis meta from STB, try search meta
McNemar test symmetry casevar controlvar
multiple linear regression regress depvar [varlist]
OR (odds ratio) cc case-var ex-var
or cci a b c d (immediate form)
percentiles table var1,c(p25 var2 ...) or centile
person years ir
relative frequencies tabulate
RR (relative risk) cs cas-var ex-var
or csi a b c d (immediate form)
risk ratio cs, csi or ir (for incidence data)
ROC curves roctab
or rocfit from STB, try search roc
signtest signtest
simple linear regression regress depvar [varlist]
t-test ttest
trend tests nptrend
Wilcoxon matched-pairs signed-ranks test signrank, see also signtest
Wilcoxon ranksum test ranksum

24

Вам также может понравиться