Академический Документы
Профессиональный Документы
Культура Документы
Previous command:
Double-click the command in the “Review” window or press Page Up until you get the appropri-
ate command, then hit Enter . In general Page Up and Page Down browse previously executed
commands.
Execute a do-file:
See Section 16.
2 Online help
Known command name:
Use the help menu or the command help:
. help ttest
-------------------------------------------------------------------------------
help for ttest, ttesti (manual: [R] ttest)
-------------------------------------------------------------------------------
The command whelp opens a new window with the same information and clickable links.
. findit paired
The findit command often results in hints to the Stata Technical Bulletin (STB) and to com-
mands you can download from the internet. For more on online facilities see section 23.
3 Producing output
Create a log-file:
Use the command log using filename. Execute your commands and finish with the command
log close. The output is now stored in filename. Usually the file filename will have the exten-
sion .smcl. See help log for further information.
1
Copy and Paste:
Mark the desired output in the “Results” window, then copy this with Ctrl-C . Paste this into
the file of your choice with Ctrl-V . You may also do Copy and Paste in the ordinary Windows
fashion using the mouse or menus.
varlist can be one or several variable names, or it might be empty. In the case of several
variables it is possible to give the varlist as, say, var1-var5, which means all the variables
from var1 to var5 in the current order shown by display, or you may use var*, which means
all the variables in the dataset that start with the letters ‘‘var’’.
selector can be something like
if sex=="m"
if age>18
in 1/3
As selector we may use any combination of these. Note that the “logical equal to” symbol is two
times “==”. in 1/3 means the first through thrid observation in the data set (in the current order).
options vary from command to command. They are either single names (e.g. histo) or include
additional information in parentheses (e.g. bin(7) or xscale(0,20))
Note: There is at most one comma in a Stata command!
Abbreviations:
Usually you can abbreviate command names and options. For example, the following two com-
mands are equivalent:
Each option and command has a minimal number of letters to be used, you can look this up using
the help command. The minimal number of letters are underlined by stata.
You can also abbreviate variable names by their first letter(s), as long as the identification remains
unique. In the example abovebweight and hypertension must be the only variables begin-
ning with bw and hyp.
Error messages:
Error messages try to inform you about what may be wrong, for example if you misspell a variable
name,
.tab variabble
variabble not found
2
.tabulate var1, by(var2)
by() invalid
Below the error message in red is an r(xxx) code in blue. This code is clickable and provides
more details on what might be wrong and what you should do.
The logic of error messages:
Stata cannot know what you intend to do, it can only recover errors by syntax checks. This means,
that you can get only indirect hints. For example, if you forget to separate an option by a comma,
you will get:
because Stata believes, that you meant chi to be a variable. Or if you forget, that by requires
parentheses, you get:
Here Stata does not realise that you forgot the parentheses, it believes, that you tried to use by as
a single option. These examples show that error messages are often very cryptic.
. save dummyname
file dummyname.dta saved
. clear
3
(16384k)
. use dummyname
. erase dummyname.dta
where 16m is 16 mega byte of RAM. Select the amount you need. See help memory
The lasting solution:
If you are working with a dofile (and you should!), then insert at the top of the file:
or
. exit, clear
this will overwrite the contents of myfile. Similar for files containing graphs
Note: You can use the replace option, even if the corresponding file does not exist.
4
7 Data checking
Before you analyse your data you should verify that they are “as expected”.
describe varlist :
. assert 2+2==4
. assert 2>3
assertion is false
r(9);
If the variable age should contain the age in years (integers) and every one is between 20 and 50.
If the variable sex contains the gender of the person as “F” or “M”
If datein is the fist time an object is observed and dateout is the last time
. assert datein<dateout
This may fail if datein can be missing. If you want to run the assessment allowing for missing
cases of datein
When the assessment fails you list the illegitimate cases. E.g.
id sex
19. 19 f
5
9 Stratification using by
by-option:
Many commands in Stata allow or require you stratification of your data into groups using the
by-option, e.g.
by-construct:
Other commands allow a preceding by for a stratified analysis, e.g.
. sort sex
There exists no common rule, when by-constructs or by-options are allowed. However, this is
always indicated in the syntax description offered by the help command.
. l
. gen bmi=weight/heightˆ2
. gen overw=bmi>25
. l
Note: If you want to generate string variables, you have to specify the length of the string. See
Section 18.
The replace command:
If you want to overwrite an existing value of a variable, you have to use the replace instead of
the generate command. For example, if height is recorded in centimeter in the data set, but you
want to have it in meter, you just type
6
. replace height=height/100
A perhaps unexpected use of replace appears when you try to define a new variable with a
subgroup dependent definition. For example, if the limit for overweight differs between males and
females, typically you use code like
The reason for this is, that the first statement fills the variable overw already with missing values
for all female subjects, which have to be replaced by the second statement.
Overview about available functions and operators:
In generating new variables, you can connect existing variables by a lot of operators and functions.
By help operators or help functions you get an overview. The most important ones
are summarized in the following list.
. help operators
-------------------------------------------------------------------------------
help for operators (manual: [U] 20 Functions and expressions)
-------------------------------------------------------------------------------
Operators in expressions
------------------------
Relational
Arithmetic Logical (numeric and string)
------------------- ------------------ -------------------
+ addition ˜ not > greater than
- subtraction | or < less than
* multiplication & and >= > or equal
/ division <= < or equal
ˆ power == equal
˜= not equal
....
-------------------------------------------------------------------------------
help for functions (manual: [R] functions)
-------------------------------------------------------------------------------
...
Mathematical functions
----------------------
7
11 Creating subsamples
There are two ways in which you can create subsamples. You can select a subset of your variables (vertical
selection) or you can select a subset of your observations (horizontal selection). For both procedures we
have the commands drop and keep.
For variables:
The data set has three variables ID, sex and income.
. drop income
For observations:
Drop all observations associated with female individuals (the code “f” in the variable sex indicate
a female)
. drop if sex=="f"
The consequence of these commands is that the dateset in memory is permantly changed. The dataset on
disk is not effected until you issue the save dataname, replace. To save in a new filename type
save newdataname
where statname [...] are the summary statistics that you want to display.
. tabstat erateWL, s(n mean sd)
variable | N mean sd
-------------+------------------------------
erateWL | 170 .19375 .1836576
--------------------------------------------
If you want separate summary statistics for each group defined by varname you should use the
options by(varname) c(s) lo.
. tabstat erateS erateWL, s(n mean sd) c(s) by(gender) lo
8
Se more details in help tabstat.
The table command:
You use table when you want to display a series of summary statistics for each level of another
variable.
table rowvar [colvar [supercolvar] ...] [, contents(clist) row col [options] ]
The philosophy behind the syntax is that we want a table where for each value in the variable
rowvar (and colvar and supercolvar) the cell contains clist with layout format given
in options, where clist is summary statistics on third part variables. The option row adds
the relative frequency to each cell such that each row sum up to 100% (similar for the option col).
For details on the format options see help table.
. table treat, c(n dec med dec p5 dec p95 dec)
----------+-----------------------------------------------------------
treat | N(decrease) med(decrease) p5(decrease) p95(decrease)
----------+-----------------------------------------------------------
1 | 205 5.211085 -10.97878 23.59735
2 | 204 16.30814 -2.117609 33.61396
3 | 204 13.19776 -25.15851 30.93353
----------+-----------------------------------------------------------
The tabulate command:
You use the tabulate command when you want to investigate the association between two (or
more) variables.
tabulate varname1 varname2 [, all cell chi2 column exact gamma lrchi2 row taub V ...]
The interpretation of the syntax is that we tabulate the frequency count of varname1 versus
varname2 with various measures of association, including the common Pearson chi-squared, the
likelihood ratio chi-squared, Cramer’s V, Fisher’s exact test, Goodman and Kruskal’s gamma, and
Kendall’s tau-b.
. tab res treat, chi2
| treat
result | 1 2 3 | Total
-----------+---------------------------------+----------
1 | 74 21 56 | 151
2 | 71 47 35 | 153
3 | 36 57 52 | 145
4 | 24 79 61 | 164
-----------+---------------------------------+----------
Total | 205 204 204 | 613
It is possible to combine tabulate with summarize to obtain table-like output in a fast way.
. tab treat, summarize(dec)
| Summary of decrease
treat | Mean Std. Dev. Freq.
------------+------------------------------------
1 | 5.6048431 11.082792 205
2 | 15.710805 11.359821 204
3 | 9.3633245 17.387196 204
------------+------------------------------------
Total | 10.218785 14.193435 613
9
13 Categorization of variables
In many medical applications continuous variables are reduced to variables with a few categories like
“low”, “middle” and “high”. Stata supports this step by different functions.
Categorizing a variable at specific cutpoints using the recode function:
If you want to categorize a variable at specific cut points, you can use the recode function as
in the following example. The new variable assigns to each value the upper value of the interval,
where the value falls in. Note that you have to ensure, that the last specified cutpoint is not smaller
then the maximal value in your dataset in order to obtain the desired result (see generation of
catvar1). In general, the last specified value in the arguments of recode is not the last cutpoint,
but the value assigned to each value larger than the last but one argument. This property is used in
generating catvar2 to assign a missing value to all values larger than 110.
. list
var
1. 23
2. 56
3. 67
4. 123
5. 99
6. 17
. gen catvar1=recode(var,50,100,150)
. gen catvar2=recode(var,40,60,80,110,.)
(1 missing value generated)
. list
If you want to recode the values of the grouped variable, you can use the recode command, or
you can use the egen command with the group function, which assigns the values 1, 2, 3 etc. to
the smallest, the next smallest etc. value. Both are illustrated in continuing our example:
. list
. egen catvarg1=group(catvar1)
. list
10
var catvar1 catvar2 catvarg1
1. 23 50 1 1
2. 17 50 1 1
3. 56 100 2 2
4. 67 100 3 2
5. 99 100 4 2
6. 123 150 . 3
Note that using the group function implies that data are reordered.
. list
var
1. 23
2. 56
3. 67
4. 123
5. 99
6. 17
. gen catvar=autocode(var,5,0,100)
. list
var catvar
1. 23 40
2. 56 60
3. 67 80
4. 123 100
5. 99 100
6. 17 20
| catvar
var | 20 40 60 80 100 | Total
-----------+-------------------------------------------------------+----------
17 | 1 0 0 0 0 | 1
23 | 0 1 0 0 0 | 1
56 | 0 0 1 0 0 | 1
67 | 0 0 0 1 0 | 1
99 | 0 0 0 0 1 | 1
123 | 0 0 0 0 1 | 1
-----------+-------------------------------------------------------+----------
Total | 1 1 1 1 2 | 6
11
. list
var
1. 23
2. 56
3. 67
4. 123
5. 99
6. 17
. list
. di 10.6 - 2 * 7.35
-4.1
. di 3ˆ4
81
15 Loops in Stata
The for command:
You can execute a series of Stata commands with the command for. Example:
12
The index X is substituted in each loop. num tells Stata that we use numerical values for X. 1/5
is the list of values 1 2 3 4 5 . The ‘:’ indicates that hereafter are the Stata commands to be
executed in each step of the loop.
Debugging a do-file:
Read the error messages. If this doesn’t help, try the command set trace on which gives very
detailed information on command execution. It is reversed to its original setting by set trace
off. The command set trace on place a “-” in front of each line which is executed. The last
line without a “-” sign contains the error. Often useful in combination with set more off.
Why use do-files:
For two reasons:
1. Gives you the option of modifying and re-running your commands, ie. it is a time saver (in
the long run...).
2. Provides you with documentation on just how you arrived at your precious conclusions.
Comments in do-files:
It is fruitful to write comments to yourself or any reader in your do-files. You write comments by
beginning the line with an asterisk *, then Stata will ignore whatever is in that line.
A nice do file looks like:
log using filename, replace
* This do-file is an example
use data, clear
describe
... some other commands
log close
13
17 Reshaping datasets
Reshaping wide datasets:
Suppose you have the following dataset with measurements of nausea on 3 consecutive days after
chemotherapy:
. list in 1/3
You would like to investigate the increase over time by a regression model. For this, you need a
data set, where each line corresponds to one day of one individual. You can use the reshape
command to achieve this:
. list in 1/9
.
. regress nausea day, cluster(id)
In Stata’s terminology, you have changed a dataset from wide format to long format.
Note: The i-option specifies the logical unit, whereas the j-option specifies the variable which
indicates observations within a unit.
Reshaping long datasets:
Suppose you have the following dataset with measurements of nausea on 3 consecutive days after
chemotherapy:
You would like to make a scatterplot of the measurement on day 2 versus the measurement on day
1. For this you need a dataset where you have the variables nausea1 and nausea2. You can use the
reshape command to achieve this:
14
. reshape wide nausea, i(id) j(day)
. list in 1/3
In Stata’s terminology, you have changed a dataset from long format to wide format.
Note: If you switch from long to wide format, all variables not used as arguments for reshape must
be constant within each unit specified by the i-option. Otherwise, you get an error message.
Reshaping several variables simultaneously with nonnumeric suffices:
In reshaping datasets, the variables can also have nonnumeric suffices, for example left and
right. In this case you have to specify the string option. You can also reshape several vari-
ables simultaneously. Both is illustrated in the following example:
. list in 1/2
. list in 1/4
. list in 1/2
You can use the reshape command also for more complex situations. Take a look at the Stata
Reference Manual.
Operations on strings:
If you want to concatenate strings, you can use the + operator:
15
. l
treat group
1. A 2
2. A 1
. l
There exists a lot of functions to work with strings, especially to switch from numbers to strings
and vice versa.
. help functions
-------------------------------------------------------------------------------
help for functions (manual: [R] functions)
-------------------------------------------------------------------------------
....
String functions
----------------
....
19 Labels
Labelling an existing variable:
If a variable is coded by numerical values, it is often useful to have the meaning of the values and
not the values themselves in tabulations and listings. You can achieve this by assigning labels to
the variable values using the label command:
. list
sex age
1. 0 17
2. 1 23
16
. label define labsex 0 male 1 female
.
. list
sex age
1. male 17
2. female 23
Note: The labels are only used in representing the values. Internally, they need to be stored as
numbers. So you can only use sex as a numeric variable.
Distinguishing values and labels:
Once a variable is labelled, you might have difficulties to find out, what the real values are. The
codebook command shows you always both the values and the labels:
. codebook sex
Note: If you import datasets from other systems, for example using StatTransfer, values are often
already labeled. Hence it is always a good idea to use codebook in the beginning.
Note: Some commands, for example list and tabulate, allow a nolabel-option, such that
the values instead of the labels are shown.
. list
sex age
1. male 17
2. female 23
. list, nolabel
sex age
1. 0 17
2. 1 23
. list
17
sex age sexstr
1. male 17 male
2. female 23 female
.
. list
. codebook gender
. list
sex age
1. female 23
2. male 17
. gen years=real(agestr)
. list
. describe
Contains data
obs: 2
vars: 4
size: 56 (98.5% of memory free)
-------------------------------------------------------------------------------
1. sex float %9.0g labsex
2. age float %9.0g
3. agestr str2 %9s
4. years float %9.0g
-------------------------------------------------------------------------------
18
21 Creating variables with statistics
It is often necessary for an analysis to prepare the dataset by computing new variables with statistics,
for example the maximum value observed during a day or subject specific mean values. The following
illustrates some typical tools for this task.
Computing statistics over several variables using egen:
The egen command offers functions like rmax or rmean to compute a maximum or a mean
“rowwise”. This is illustrated in the following example, where we have for each subject and each
day a measurement at 6 o’clock, 12 o’clock and 18 o’clock. We can use rmax to compute the
maximum within each day:
. list in 1/6
. list in 1/6
egen offers for this type of tasks the functions rmax, rmin, rmean, rsum, rsd and
robs, where the latter gives the number of nonmissing observations. Note that these functions
expect a list of variables separated by blanks. Do not confuse them with the functions mean,
min, max etc., which are also offered by egen for other purposes.
Computing statistics over several observations using collapse:
The collapse command allows you to compute statistics from groups of observations. Looking
at the last example, we might now be interested in taking the average over three days for each
subject. This can be done in the following way:
. list in 1/6
. list in 1/2
subj meanmax
1. 1 30.06667
2. 2 25.5
19
You can generate simultaneously several statistics, for example you can use collapse (min)
minval6=val6 (max) maxval6=val6, by(subj) in order to generate the minimum
and maximum of the measurements at 6 o’clock over the three days for each subject. Other statis-
tics offered by collapse are median, sd, sum, iqr and all percentiles.
Note: If you have a variable, which is constant within the unit you would like to collapse, and
which you want to keep in the new dataset (for example the age and sex of a subject), you can
include them in the by-option. (For example: collapse ..., by(subj age sex))
. list in 1/12
. gen high=val>meanval
. list in 1/12
egen offers also functions like min, max, median, sd, iqr, rank, sum and func-
tions for percentiles. A typical use of egen is in standardizing a variable to the range 0-1 for each
subject. This looks like
20
22 Survival analysis commands
A characteristic feature of survival data is the presence of censoring and left truncation. Without censoring
and truncation the data are represented by the survival time variable , which measures the duration of time
between the initial event and the final event. In the presence of censoring and truncation more variables are
required to represent the incomplete observation of the survival time .
With censoring at time (e.g. end of followup) it is only possible to observe if the final event occurs
before time . The final event indicator is equal to if (i.e. uncensored observation) and it is
equal to if (i.e. censored observation). The censored survival time is equal to if
and is equal to if . With left truncation at time the censored observations are only
observed if (otherwise no information is collected). Consequently, under right censoring and left
truncation the survival time is represented by three variable . In Stata datasets these variables
are usually called time, event and time0 respectively. If all subjects enter at time (i.e. =0) the
respective variable time0 may be omitted in the dataset.
! !"
sts list, at(200 201)
Note the argument at(200 201), where two time values , are required, because without
“201”, at(200) will tabulate the Kaplan-Meier estimator at 200 equidistant time points.
Cox regression:
The stcox command is used to carry out analysis using the Cox regression model:
stcox indepvar1 indepvar2 ... indepvarN
This will report hazard ratio estimates. To produce estimates of regression coefficients the nohr
option may be used.
21
Increase memory size:
Sometimes the extra variables created by the stset command do not fit in the available memory.
In this case see section 5 for commands to increase the memory size. Note, that you will have to
reload and re-stset the dataset after this operation.
23 Online facilities
Stata is web-aware in the sense that it offers commands that allow you to update and enhance your Stata-
version, if you are connected to the Internet. The most important commands are:
update:
Typing update will give an overview of when your Stata system was last updated. The command
update query will check whether or not your Stata would benefit from an update. Finally you
can execute the command update all to update both your ado-files and executable.
findit: In up-to-date Stata 7.0 the command findit will search all relevant Internet sites for Stata mate-
rial containing your search word. For example:
. findit smooth
13 Sep 2002 13:53:35
Keyword search
--------------
Keywords: smooth
Search: (1) Official help files, FAQs, and STBs
(2) Web resources from Stata and from other users
<...cut...>
<...cut...>
<...cut...>
22
Web resources from Stata and other users
----------------------------------------
(contacting http://www.stata.com)
<...cut...>
http://www.sun.rhbnc.ac.uk/˜uhss021/stata/
Materials by Kenneth L. Simons / Here are assorted utilities for Stata. /
Check dummy (indicator) variables to ensure they are okay / Distance
between latitude & longitude coordinates / Count data points in a
geographic radius of each point / Create data points for extra geographic
http://www.stata.com/users/njc/
Materials by Nicholas J. Cox, University of Durham / Nicholas J. Cox
<N.J.Cox@durham.ac.uk> is a geographer at the University / of Durham and
a frequent contributor to Statalist. His areas of interest / include
graphics, smoothing, probability distributions, circular statistics, / and
<...cut...>
(end of search)
First you see what is in the reference manual, on the Stata FAQ pages, and in the STB, where STB
refers to the “Stata Technical Bulletin”, which is a journal where various enhancements (ado-files)
are published with examples of their use. Next you get results from searching the web resources
for user written resources.
Installation:
To install a specific package you found with findit just follow the blue clickable links.
23
Description Stata-command
ANOVA anova or oneway
-test for contingency tables tabulate var1 var2, chi, see also epitab
confidence intervals for ci or cii (immediate form)
means
proportions
probabilities
percentiles centile
contingency tables tabulate
correlation
Spearman spearman var1 var2
Pearson pwcorr [varlist] or correlate [varlist]
cumulative distribution function cdf from STB
cox regression stcox indepvars
Fisher’s exact test tabulate var1 var2, exact, see also epitab
Friedman friedman from STB, try search friedman
four fold table tabulate, see also epitab
interrater agreement test kappa var1 var2
Kaplan-Meier curves sts graph
kappa kappa var1 var2
Kruskal-Wallis test kwallis
likelihood ratio test lrtest
linear regression regress depvar [varlist]
logistic regression logistic depvar [varlist]
log rank test sts test indepvar, logrank
Mann-Whitney two sample test ranksum
mean, median, sd summarize [varlist]
or table
meta analysis meta from STB, try search meta
McNemar test symmetry casevar controlvar
multiple linear regression regress depvar [varlist]
OR (odds ratio) cc case-var ex-var
or cci a b c d (immediate form)
percentiles table var1,c(p25 var2 ...) or centile
person years ir
relative frequencies tabulate
RR (relative risk) cs cas-var ex-var
or csi a b c d (immediate form)
risk ratio cs, csi or ir (for incidence data)
ROC curves roctab
or rocfit from STB, try search roc
signtest signtest
simple linear regression regress depvar [varlist]
t-test ttest
trend tests nptrend
Wilcoxon matched-pairs signed-ranks test signrank, see also signtest
Wilcoxon ranksum test ranksum
24