Академический Документы
Профессиональный Документы
Культура Документы
1 Preliminaries
STATA is a command-line driven statistics package. This means that much like DOS, you need to
type commands into the software to make it execute any routine. While this is a bit more difficult than
menu-driven packages like SPSS, it is much faster and more flexible. This document is meant to get you
started working with STATA.
2 Getting Started
2.1 Why All those Windows?
STATA is a multiple-windowed environment. When you open STATA, you will see 4 windows.
1. Review window - The review window gives you a list of previously typed commands. You can access
these two different ways. You scroll through the previous commands with the scroll arrows and
click on the command, or each time you hit the “page up” command the previous command will
show up in the command window. If you hit the “page up” key twice, the next to last command
will pop up and so forth.
2. Variables window - The variables window provides a list of the variables and their labels that are
in the currently loaded dataset.
3. STATA Command window - The STATA command window is where the user can interact with
STATA. This is where commands are typed in.
4. STATA Results window - The STATA results window show you the results of the commands typed
into STATA.
The Graphics Window displays graphs as a result of a graph command being typed in the command
window. This window will not be visible when you first open STATA, rather it will pop up directly
following a graphical command.
1
This type of log file captures everything that comes up in the results window. There is another type of
log file - the command log - that captures only commands, not output. This type of file is one that would
allow you to replicate your analysis with just one command. The command log can be requested using
command-line syntax as follows:
doedit filename
The file can easily be run in stata by typing in the command line:
do filename
or by clicking on the Tools>>Run menu option in the STATA do-file editor.
Log files can be suspended or closed. Suspending a log file can be done by typing “log off” which
temporarily closes the log file. The log can then be turned back on by typing “log on”. Closing a log file
is done by typing “log close”. You may then open a new log file. You could open the same log file and
add more information by typing:
2
2.3.3 infile/insheet
STATA’s infile command allows the user to bring in any sheet of data into the program. This is
usually done from a .txt document. The data file should be delimited by tabs, spaces or commas. Insheet
is a similar command that is specifically designed for data read out of a spreadsheet program and in this
utility, the delimiter is an argument to the function where it is not in the infile command. The syntax to
the infile command is as follows 1 :
infile varlist [_skip[(#)] [varlist [_skip[(#)] ...]]] using filename [if exp]
[in range][, automatic byvariable(#) clear ]
The syntax to the insheet command is:
You can get a description of what the arguments mean to these and other functions by typing:
help infile1
help insheet
or more generally:
help <function>
Dictionaries Dictionaries are a way to define variable types. STATA does not like to infile string
variables without a dictionary. Dictionary files include not only the data you want to input, but a
dictionary command at the beginning. For an example, see “H:/GVPT622 F02/auto.dct”, you can open
it in a text editor. STATA has two basic types of variables: string and numeric. To use a dictionary with
the infile statement, just type:
1. String variables are those that contain at least one non-numeric character such as a letter or symbol.
STATA calls these “str” variables. There is always a number after the “str” which denotes how
many characters wide the variable is, so a variable that is str8 is 8 characters long.
2. Numeric variables are those containing only numbers (including possibly a decimal point). There
are different kinds of numeric variables: byte, int, long, float and double. They all have different
minima, maxima and precision toward zero. Type “help datatypes” for a more thorough discussion.
2.3.4 Stat-Transfer
By far, the easiest way to get data into STATA or nearly any other format for that matter, is with
Stat-Transfer. This program allows the user to take data in nearly any format (including SAS, SPSS,
Excel (or other spreadsheet), Access (or other database), Systat, Gauss, Limdep, Matlab, Statistica,
etc...) and transfer the data into any other format. One of the benefits is that variable names and labels
as well as value labels tend to be preserved across formats. Stat-Transfer is a windows program that
should be on the statistical software menu in the graduate lab or in LeFrak.
The program works in 4 steps.
1. Choose the type of file you want to transfer.
2. Find the file on your computer
3. specify the type of file into which you want to transfer your data.
1 The hard brackets [ or ] in the commands need not be entered in the syntax, they are simply for clarity in the
presentation.
3
4. hit “Transfer”.
For more advanced users, there are tabs of observations, variables and options that will help the user
tweak the program to produce more polished data, but often times specifying further options in these
tabs is not necessary.
The clear option allows data to be loaded in even if data are currently loaded into the program and
have changed since the last save command was executed.
Where old instructs the software to save the dataset in the previous version of STATA. You shouldn’t
need this in the lab, but will if you’re using STATA 7 elsewhere and want to use the data in STATA
6 in the lab. Replace simply replaces the dataset if there is one that has the exact same name. The
other options are irrelevant to your work.
4 Graphing
STATA’s graphing capabilities are not the best of the statistical packages, but they are sufficient for
exploratory analysis. They are, however, probably not good enough for publication. There are many
possibilities. These can be broken down into two basic types - univariate and bivariate.
4.1.1 Histogram
Histograms - Histograms place observations into categories (or bins) which are then graphed as a
function of the percentage of the total observations that are in each bin. The command in stata is:
graph [variable] [weight] [if exp] [in range], histogram [common_options
bin(#) {freq | percent} normal[(#,#)] density(#)]
The “bin” argument allows you to set the number of categories into which the observations are placed.
A density curve can be imposed on the histogram.
4
4.1.2 Density Plots
A density plot is also called a “smoothed histogram”. In this graph, there are no bins. It is a single
line that is more like the population density function than the histogram. The command in stata for this
is:
kdensity varname [weight] [if exp] [in range] [, nograph
generate(newvarx newvard) n(#) width(#)
{biweight|cosine|epan|gauss|parzen|rectangle|triangle} normal
stud(#) at(varx) symbol(...) connect(...) title(string)
graph_options ]
The gauss option is probably the one that will be most useful. The biweight, cosine, epan (epanechankov),
parsen, rectangle and triangle options are all options that control how observations are weighted (this is
analogous to deciding which bin they are in).
4.1.3 Boxplots
Boxplots, sometimes called “box and whisker” plots are particularly good at showing the spread of a
distribution. The box represents the inter-quartile range (the range between the 25th and 50th percentiles.
The whiskers cover most of the rest of the observations, but some extreme outliers can still lie outside
the whiskers. The STATA command to make a boxplot is:
5 Miscellaneous
There are a number of other commands that will become useful as you begin to use STATA on a
regular basis.
5
1. Describe - describe provides you with a list of properties of the variables specified or all of the
varaibles in the dataset if no variables are specified.
2. Summarize - summarize provides mean, variance, min and max for all of the variables specified or
all variables in data if none are specified.
3. Set Memory - the memory set function will be important when you are using large datasets.
This will set the memory at 100 megabytes. This should be sufficient for nearly all of your projects.
The upper bound is determined by the computer’s physical memory and if 100 megabytes is not
enough, if you computer has more memory available, you can set the limit higher.
4. Labelling - Labelling variables and variable values is important to keeping your dataset manageable.
You will hear horror stories from many quantitative types about how they didn’t label variables and
variable values because they were sure they would always remember and then two years later after
having left the project sitting, they come back only to find they’ve forgotten everything about the
variables and their coding. You will need three different commands to properly label your variables.
(a) Label Variable - this command simply attaches a label to the variable name. So, if for instance
the name of the variable is ’var1’, and you label it ’party ID’, then ’party ID’ will show up in
all printed output containing that variable. The command in STATA is as follows:
label variable varname ["label"]
Where ’varname’ is the variable name (var1 in the example above) and label is the label you
want to apply to that variable name (party ID in the example above). So, to create the label
party ID for var1, we would type the following:
label variable var1 "party ID"
(b) Label Define - this command defines value labels. For instance, our party ID variable may have
republicans, independents and democrats. We want to make a label so that if we tabulate the
variable, instead of 0, 1 and 2 as categories, it shows republicans, independents and democrats.
The general STATA code is as follows:
label define lblname # "label" [# "label" ...] [, add modify nofix ]
Where lblname is the name you want to give to the label, like ’partyid’ for this case, # signifies
the number you want the label to apply and label is the descriptor. For this example, we would
type:
label define partyid 0 "republican" 1 "independent" 2
"democrat"
(c) Label Values - Finally, we can apply the new value label we defined ’partyid’, to the variable
of interest.
label values var1 partyid
More generally, the syntax is:
label values varname [lblname] [, nofix ]
6
6 Resources
1. STATA’s website: www.stata.com has a number of useful resources, like help files and FAQ’s.
2. STATA also has a listserv called STATA list. You can subscribe to STATA list you can consult the
STATA list FAQs located at http://www.stata.com/support/faqs/res/statalist.html.
3. Reference manuals are also a great source of information, hopefully we will have them available to
you early in the semester.