Вы находитесь на странице: 1из 38

Trainers Printout!

TRAINERS PRINTOUT FOR BEGINNERS STATA

CLASS 1: INTRODUCTION TO STATA

All the answers to the exercises are in the soft copy of this document, in white-colored font. So,
wherever you see a place where it says Answer or A: followed by a blank space, you can highlight it
and change the font color to black to see the answer. Please only do this to check your answers. This
module is meant to be interactive. It is strongly recommended that you actually try every step of the
directions in Stata, not simply say that sounds straightforward, and skip over it.

1. THE BASICS: WHAT STATA LOOKS LIKE AND HOW TO OPEN A DATASET

1) Now we will open the dataset we will work with. Find the file named intro.dta, located in
Stata > Beginners Stata > Data. Double click on it. Stata will automatically launch.
a. If Stata is already running, a new Stata window will open.
b. Alternatively, if Stata is running you can go to File Open Choose your dataset
2) Notice there are four sections, or windows: Review, Results, Command, Variables.

o COMMAND: You can tell Stata what to do by typing in commands. Click inside the
command window and type

display Hello!

1
Trainers Printout!

o RESULTS: Here Stata displays the commands followed by the output that Stata has
produced (note what appeared as the result of the command you just typed in).

o VARIABLES: lists all the variables in the dataset. The variable window can act as a
shortcut for creating commands. Try clicking on one of the variables. It should appear in
the command window, eliminating the need for you to write it out.

o REVIEW: Lists all your prior commands. Notice that display Hello! now
appears there. You can click on it and it will appear in your command window.
Useful tip: When you are in the command window, you can scroll through your
previous commands using PageUp and PageDown buttons.

As you can guess from a glance at the variable list, this dataset contains math and reading scores, as
well as attendance rates, of students in various schools. Specifically, the dataset covers 1 class per
school in 4 different schools, 2 public and 2 private. In addition to scores, attendance and the school
variable, it also contains information on name, ID and gender of the student. The dataset is made up
specifically for this exercise, so please excuse any strangeness in the data.

3) Find the file named intro.xls in Stata > Beginners Stata > Raw. Double click on it. The file will
automatically launch in Excel.

4) Compare variable names (school, student, name, math, reading, female, class) in Excel and
Stata. See where they are located in Excel. Now compare with Stata. 5) Now lets look at what
a dataset stored in Stata actually looks like (the way Stata sees it and reads it). Type

browse
in the command window.

Notice that it resembles an excel spreadsheet, although that you cannot actually work with the
dataset or change it in that format. The columns show the variables and the rows show the
observations. Note that in earlier versions of Stata (10 and below), you cannot continue working
with a dataset (i.e. executing commands) when a browse window is open.

Notice that female takes the value of either 1 or 0 (guess what it means for female to equal
0?), school takes the value of 1 to 4, and names takes the form of text. Exit out of the
browse window by clicking on the X on the upper right hand corner. Notice that browse has
appeared in your review window.

2. LOOKING AT YOUR DATA : SOME BASIC COMMANDS

A. Some of the most basic commands that were about to show you follow a really simple structure
of:

command

2
Trainers Printout!

-or-

command variable

where command is the action you want done and variable is the 1 or more variables you
want it done to.

For example, youve already seen the command browse


You can also browse specific variables. Try browsing the two variables:

browse female school

Note: This is very useful when the dataset is large and you are trying to check something specific

B. Commands and variables can both be shortened. For most common commands the first 2 or 3
letters work. For example:

br math

C. SUMMARIZE (shortened: sum, summ) - gives you basic statistics about variable(s).. This
command gives you summary statistics for ALL variables. Type in:

summarize

As you can see, the output is the number of observations, mean, standard deviation, and the
min. and max. of all variables in the dataset

Now lets try producing the summary statistics for just one variable. This is an extremely useful
function, both to use during cleaning and for production of summary statistics (for instance, for
the baseline). Lets get the summary stats for the math score. Type:

summarize math

What is the mean of math scores across the students? What is the standard deviation? What
was the lowest score?

Answers: 81.6, 18.9, and 38

Now produce the same statistics for the variable of reading scores.

3
Trainers Printout!

Sometimes you also need more detail on the variable than just the number of observations-
mean-sd-min-max. That is why the summarize command has a detail specification. Try this:

summarize math, detail

What happened? As you can see, it also produces a percentile breakdown and other stats, like
variance, that can sometimes come in handy.

D. TABULATE (tab)- lists all the values the variable takes in increasing (or alphabetical) order, tells
you how many of each value there are and what percentage each value constitutes. Type in:

tab school

As you can see, tab produced a list of number and percentage of students by school. How many
students are there in school 1? Whats the percentage of the total that belongs to school 3?
Answer: 9 and 25.6%

IMPORTANT! Sometimes a value of a variable will be missing (i.e. theres no data) this is quite
common. For instance, scroll back to where you summarized math and reading in the results
window in Stata. Compare the total number of observations for reading scores with that for the
math scores. What is the difference?

A: There are 3 more observations for math than reading.

Tabulate (and many other commands) doesnt automatically show you the missing values, so it is
easy to forget about their existence. However, doing so sometimes gives you inaccurate stats. To
see the missing values when you tabulate type:

tab reading, missing


Compare this with:
tab reading

As you might have figured out, the missing values for numeric variables get coded as . (a dot).
Notice that there are three missing values in reading. You can see them at the very bottom of
the list. Stata sees missing values as the biggest value possible and always will list them last.

You can also check this by typing

browse reading math

4
Trainers Printout!

And confirming that there are missing variables. However, browsing is not as useful if you have
really large datasets.

Always, always keep in mind the missing value when doing any coding. Always ask how will this
piece of code affect missing values? For example, if you ask Stata to find the mean of math
scores for students with attendance rates above 80%, those with missing values for attendance
will be included in this analysis. Would you have wanted this?
E. LIST lists the values of variables (in the order they are in the dataset, repeating any duplicates)

list
list reading
list reading in 1/5

What did the last command do? Answer: listed only the first five observations

F. IF a condition that qualifies any command to which it is applied

The command is only performed if the condition is met. For example, imagine we wanted the
average of the girls math score i.e., the average of the scores of ONLY girls.

For this we would want to look at individuals who meet the condition of being female. In other
words, we want: female == 1 So, try:

summ math if female == 1

Note the double equal (==) sign. This is the way Stata signals a condition is imposed,
meaning youre not changing the variable. If you just put one equal sign instead of two
Stata will give you an error. You can try it to see what Stata error feedback looks like.

What did you find to be the average math score for girls? Answer: 80.1

Now find what the average score is for boys. First, what is the Stata condition of being male?
(Hint: remember, there is no male variable, so you have to use the female one; Answer:
female == 0)

Now how would you find the average score for the boys using that condition? Answer: summ
math if female == 0 ; the answer is 83.3

How do boys compare with girls? Answer: the boys score is higher, although it does not seem
significant judging by the standard deviations)

For another example, lets see how this works with COUNT, a command that counts the number
of observations that satisfy the specified conditions. (If no conditions are specified, count
displays the number of observations in the data. Type in count to see what it does. You should
get the total number of observations in the dataset A: 39). Now try:

count if female == 0 to see the number of boys in the dataset. A: 18

5
Trainers Printout!

G. SYNTAX - AND/OR used to apply multiple conditions at the same time

Suppose you wanted to see the math scores not just for girls, but only for girls in school 3. Or
you wanted to know what the math score average was in schools 1 and 2 (but not 3 and 4). Or
supposed you wanted to get really fancy and wanted to see the math score averages for girls in
schools 1 and 2. How would you go about that? This is where and/or syntax comes in.

Suppose you have two conditions: A and B.


AND is used when both the condition A and the condition B have to be met. In the illustration
below, the AND condition is the purple-colored intersection between the two circles.

AND is coded as the symbol & in Stata. Stata doesnt recognize and (spelled out) as a
command and will not execute it. . So, if you wanted to meet both the conditions A and B, you
would input:

command if A == true & B == true

Using our previous example, to see the score for girls in school one, think of being a girl as
condition A and being in school 3 as condition B. We need the students to meet both conditions.
To get people who fall in both groups you have to use the AND function:

summarize math if female == 1 & school == 3

What is the average math score for girls in school 3? Answer: 70.5

OR is used by Stata to signify the group for which either of the conditions (including both the
conditions at the same time) are met. In the illustration below, saying the person can meet the
condition A or B would mean anything within the A and B circles (colored a pleasant violet color
this time) is acceptable that is anyone whos A, anyone whos B, and anyone whos both.
However, note that things in C that are not also part of B are outside our condition.

6
Trainers Printout!

Or is denoted by symbol | (made by shift + \).

So, to see the average of the math scores for schools 1 and 2, we say:

summarize math if school == 1 | school == 2

Answer: average is 81.9

Now that we have the basics down, you can combine these in various ways to create any
combination imaginable. Just as in math, certain operations overwrite others, so always use
parentheses to make sure Stata does exactly what you want it to. Until you feel like you know
exactly how Stata reads each command, always check your output and make sure what
happened is what you wanted to happen.

For example, imagine if we wanted to see the reading scores for people who have math scores
less than 65 or more than 90, due to some interest in looking at outliers. Think about how you
would do this.

A: You would look at people with scores less than 65 OR more than 90. What if you said
AND instead? The command wouldnt work, because there are 0 people who have a score below
60 and above 90 at the same time thats impossible!

summ reading if math < 65 | math > 90

Now, what if you wanted to see this only for school 1? Use parentheses to delineate very clearly
for Stata what you want it to do. Basically, you want it to look at the people who have the
necessary scores AND who are in school 1.

summ reading if (math < 65 | math > 90) & school == 1

If you didnt put the parentheses there, Stata would read it as you wanting to summarize the
reading score for all people who have a math score below 65 (regardless of school) and the
separate group of those who are in school 1 AND have a math score over 90.

To try another one, lets look at getting the average math score for girls in schools 1 and 2.

The answer is below, but try thinking it through yourself first. Sketch out a Venn diagram if you
need to help you write the code.

7
Trainers Printout!

A: summ math if female == 1 & (school == 1 | school == 2)

What would it mean if you forgot the parentheses?

A: Stata will think you want to see all females in school 1 plus everyone in school 2,
which makes no sense. So be careful not to do that!)

3. MANIPULATING YOUR DATA VARIABLE GENERATION

A. SAVE saves the dataset


Before you start messing with the dataset, you have to remember this: DO NOT EVER
OVERWRITE THE ORIGINAL DATA! This means, you do not ever make modifications to the
original data and then save them in the same dataset. Imagine you made a mistake in coding and
then saved the dataset with that mistake. What are you going to do now? (Note: always keep
extra untouched copies of the raw data in a separate folder just in case)
To avoid that, save the dataset under a different name (create a 2 nd, modified version of the
data). Please put it in the Data folder, together with the original dataset you are using. You
should do this every time you do significant modifications to the data, and always do it if you
work with the raw data. To do that say
save intro_modified.dta, replace

Note: If you just say save it will simply save over the dataset thats open. The replace option
after the comma specifies that if there is already a dataset with that name in the location, Stata
will overwrite it with this one instead of creating a new one.
If you are working with a dataset that is ok to modify (for instance, you saved it in the beginning
of your work under a different name and now want to save all the changes you made), you can
just say save, replace without further specifications.

B. GENERATE create or change contents of a variable

Generate is an extremely easy command. You basically assign a value to the new variable that
youre creating, using conditions and pre-existing information in the dataset. Try this:

gen test = 1

You should see a new variable appear in your dataset, called test. If you browse you will see
that this variable is uniformly equal to 1.

Note that for generation, since its not a condition and you are in fact changing the variable, you
only need one equal sign instead of two.

Now lets try imposing conditions for variable creating using the if function you just learned.

8
Trainers Printout!

For instance, imagine that we just found out that two out of the four schools are private and two
are not. So we want to generate a private school variable (lets call it private). We know that
school 3 is a private school, so for now lets generate a variable for a private school thats equal
to 1 for school 3.

gen private = 1 if school == 3

Now browse the two relevant variables. private should be 1 when school is 3 and
missing when it isnt.

Note that Stata recognizes capitalization, so Private is a different variable from private

C. REPLACE modify contents of a variable

The same way generate creates a new variable equal to your specifications, replace does that for
existing variables. For instance, try:

replace test = 2

(and browse)

As you can see now, the values of the test variable that were equal to 1 before are now all equal
to 2 (so youre literally replacing the value of the variable with something else).

Now imagine that you learned that school 1 is private as well and you want to modify our private
school variable (private)
replace private = 1 if school == 1

Finally, note that the private variable is blank when its not equal to 1. Now lets set it to 0
for schools 2 and 4.
replace private = 0 if school == 2 | school == 4

Note 1: Variables dont have to just be assigned a uniform value with conditions. You can instead
set the variable equal to another or to itself with modifications. For instance, suppose you
realized that you had asked how frequently someone had done something (like get water from a
well) a day. And now you wanted a weekly measure. You could create a new variable with the
old daily one multiplied by 7 (because there are 7 days a week) to create a variable that is a
weekly measure.

To show you the mechanics, suppose we wanted to make the test2 variable twice as big as it had
been before:
replace test = test*2
browse

As you can see the test variable is now equal to 4. As is standard for these operations, *
stands for multiplication. You can also use / for division, ^ for raising to a power, etc.

9
Trainers Printout!

Note 2: Suppose you want to make a variable equal to 1 when school is NOT equal to 3. To
symbolize not Stata uses the symbol ~ (which is found to the left of your 1 key). For instance,
lets create a 2nd imaginary variable that tags all schools except school 1:
gen test2 = 1 if school ~= 1

D. DROP eliminates variables or observations

Now suppose you want to drop a variable. In our case, we have the test and test2 variables that
we definitely do not need and that clutter up the dataset. Use:
drop test

E. SORT arranges the observations of the current data into ascending order based on the values
of the variables you list after the command. Try the following exercise:
Browse the data. Notice that right now the data is sorted by school. Suppose we instead wanted
Stata to look at the data in the order of the student id instead. Say:
sort student
Now browse again. You can see the data is sorted by the student id. What if you wanted Stata to
sort by student id within school? So to sort by student id within school, say:
sort school student
Browse to see what that did. There is actually no limit to the number of variables that you can
sort by at one time, and Stata just reads them left to right and sorts in that order.
Certain commands will require you to sort the data beforehand, and sometimes its a great trick
to use with commands that dont require it. However, you have to be very careful when using
sort.
If you sort by a non-unique set of variables (for example, by a household ID instead of an
individual one), observations within the group (in this case household ID) are sorted randomly.
What do we mean by that? Suppose your data was not sorted by individual id, and you sorted it
by school. Although your school id would be sorted in order each time, the individuals within the
school would be randomized. So if you had individuals 1,2,3 belong to school 1 and individuals
4,5 belong to school 2, if you just randomly kept sorting by school (by saying sort school),
and then listed them, it might sort it as 3 different things 3 different times, even though the
command remains the same:
1st sort 2nd sort 3rd sort
school individual school individual school individual
1 1 1 3 1 2
1 2 1 1 1 1
1 3 1 2 1 3
2 4 2 4 2 5
2 5 2 5 2 4

So if you then created a command that, say, assigned new ids to those people based on the
random order they just got assigned, their IDs would actually change every time you re-ran the

10
Trainers Printout!

program. This can end in a complete disaster resulting in months of extra work (true story). To
avoid this make sure to sort by unique id or use sort school, stable, which keeps the
order random but the same every time you sort.
F. SAVE, revisited now that we have the dataset saved as intro_modified.dta and wont be over-
writing the original data, we can just save without specifications. NEVER OVER-WRITE THE
ORIGINAL DATA!
save, replace

4. SOME MORE EXERCISES:


Note! All the answers are in white font below the questions, as before. There are also some hints
you can highlight and see that are designated as Hint:

1.a. Find the average math score for male students who are in private school. Call the new variable to do
this male_pr for male in private school

Answer: gen male_pr = 1 if female == 0 & private == 1


replace male_pr = 0 if male_pr ~= 1
sum math if male_pr == 1

Average: 91.1

Hint: First create a variable for male students in private school and then find the average.
To do this, generate a "tag" for male students who are in private school (a tag is a variable that tells you
when something fulfills certain requirements; its usually 1 when the condition is true and 0 if not).

1.b. Now find out how the boys in private schools compare to the rest of the population (those who are
not boys in private school), and to average score of other boys (not in private school)

A: summ math if male_pr == 1


summ math if female == 0 & male_pr ~= 1
(for other boys only)
summ math if male_pr ~=1
(for everyone else)

2.a. Just like the math scores, the reading scores were supposed to have values from 0 to 100; however,
if you summarize or tabulate the variable youll see that they instead range from 0 to 1. The percentages
are the same but are out of 1 instead of out of 100. Try fixing it by adjusting the scale correctly:

Hint: Remember when we talked about how you can set the variable equal to another or to itself with
modifications? (replace test = test*2)

A: replace reading = reading*100

11
Trainers Printout!

2.b. Now suppose were developing a program that wants to help students with failing reading scores
(less than 65). Generate a tag for those failing students so we can identify our sample. Call it
failing_reading.

A: gen failing_reading = 1 if reading < 65


replace failing_reading == 0 if failing_reading >= 6565

Theres actually a more elegant and simple way to produce the same result, with one line instead of two.
gen failing_reading2 = (reading < 65)

Stata interprets the line above as creating a tag. So when you say

gen variable = (condition), Stata will create a variable that is 1 when the condition
is met and 0 when its not. In this case it will make failing_reading2 equal to 1 when the
reading score is less than 65 and 0 when its not.

2.c. If you recall, reading scores were missing for some people that didn't show up for the test. What
happened to the tag for those scores? What command would you use to check?

A: tab failing_reading if reading == .

(browse also works))

2.d. As you can see, the missing scores automatically got coded as 0! In this case the missing values
might be introducing bias, since its possible, for instance, that the students chose to skip the test
because they knew they'd fail. In other circumstances you may want to code certain missings as
something else. Just always make sure to account them and to make sure you dont include them in (or
exclude them from) your code by accident.
As a recap, how did the missings get in there? Stata considers them the biggest value possible, so
whenever you generate a variable basing it on whether the value of another variable is more than a set
amount, you have to account for missing values.

In either case, we want to keep the tag as missing if the reading score was originally missing. Please fix
the previous mistake
A: replace failing_reading = . if reading == .
2.e. In general, to avoid this problem account for it when generating variables; Always be careful with
missing values. Try to generate failing_reading3 as a new variable for failing reading using the most
efficient method possible and accounting for the missing variables.

A: gen failing_reading3 = (reading<65) if reading ~= .

2.f. Now imagine that instead of just helping the failing students, we wanted to split people into 3
achievement levels to best address each group's needs.

12
Trainers Printout!

Generate a variable that divides the students into 3 tracked groups of different levels (call the variable
level), those below 65, those above 90 and those in the middle. Assign the missings to the lowest level
just for the purposes of this exercise.

Answer: tab reading use this to eyeball groups of the same size
gen level = 1 if reading < 65 | reading == .
replace level = 2 if reading >=65 & reading < 90
replace level = 3 if reading >= 90

Hint: you dont have to set the new generated variable equal to just 1 or 0. You can create as many
groups as you want, just setting it equal to a new number each time. So make the value of the variable
for one group equal to one, the second one equal to 2, the 3 rd to 3, etc.

SUMMARY

Now you should know: The commands you should have learned are:
display
- What Stata is, what it looks like and how to browse
operate it summarize
- How to look at data and how to create some tabulate
basic statistics list
- How to use and/or in Stata correctly if (condition)
- How to save data & (and)
- How to generate new variables, modify | (or)
existing ones, and drop them save
- How to sort your data generate
replace
drop
sort

13
CLASS 2: DO-FILES AND REPLICATION

5. CLASS 1 REVIEW AND BUILDING ON IT

Please open Stata itself by double-clicking on the Stata icon. Now lets open the original data file that
we used (intro.dta), but use a different method to do it than before. This new method involves
opening the file internally from Stata and will come in very handy soon.
Just to remind you, this dataset contains math and reading scores, as well as attendance rates, of
students in various schools. It also has information on their gender.

use PUT DIRECTORY HERE/intro.dta

Now lets review some of the commands we learned last time. First, how do you find the average of
the attendance rate?
A: summ attendance
How would you look at all the values that attendance rate takes and how frequently they occur?
A: tab attendance
Now suppose the kids have to re-take the year of school if they miss more than 40% of classes. What
percentage of all the kids currently have attendance rates that are too low? (Tip: create a new
variable for this)
A: gen lowattendance = ( attendance < .60 )
summarize lowattendance
The percentage is 12.8
What is the low-attendace percentage in private versus public schools? (Tip: you dont have to create
the private school variable anew, although you can. Instead just specify which schools are private
(1,3) and which are public (2,4).)
A: tab lowattendance if school == 1 | school == 3
tab lowattendance if school == 2 | school == 4
10.5% in private and 15% in public
What is the average attendance rate for boys in public schools?
A: sum attendance if (school == 2 | school == 4) & female == 0
80% average attendance
Pro tip: you dont have to type out the whole variable name; you just have to make sure to type up
enough to make it unique for Stata to recognize it. So here, you couldve just typed sum attend
or sum att, as long as no other variable starts with those letters. This obviously doesnt work
with datasets with variables name q1, q15, q15c, q156, etc.

6. DO-FILES WHAT ARE THEY AND WHY DO YOU NEED THEM?


G. So What Is a Do-File?
Lets imagine the following situation - you just found out you have to present your results to a
partner or a PI all the averages you produced and comparisons you made. Suppose you hadnt
written them down.
How would you go about it? Would you reproduce everything you did for class 1 from scratch? Can
you do it? How long would it take you to do? Just re-typing all those commands into Stata in order
would take at least an hour.
Now imagine you need to make a tiny little tweak to an important variable early in the sequence in
what you just painstakingly typed , which will change all the subsequent results. What do you do?
An important feature of any good research project is that the results should be reproducible. For
Stata the easiest way to do this is to create a file that lists all your commands in order, so anyone
can re-run all your Stata work on a project anytime. Such text files linked to Stata are called do-files,
because they have an extension .do (like intro_exercise.do). These files feed commands directly into
Stata without you having to type or copy them into the command window.
An added bonus is that having do-files makes it very easy to fix your typos, re-order commands, and
create more complicated chains of commands that wouldnt work otherwise. You can now quickly
reproduce your work and build on it.
Finally, do-files make it possible for multiple people to work on a project, which is necessary for
cooperating with your PI and for handing the project off to a new PA/RA if youre working on a long
project or if you need help.
H. STARTING A DO-FILE
To start a new do-file in stata either:
In the Menu bar up top, go to Window > Do-File Editor > New-Do File
OR Press Ctrl + 8 (Ctrl + N if you're on a mac)
A blank do-file will open.
Now go to your command window and highlight all the commands you used today in your class 1
review (do this by clicking on the 1st command, holding shift and clicking on the last command).
Now copy the highlighted commands (press Ctrl + C) and paste (Ctrl + V) them into the blank do-file
document.
Congratulations! Youve now created your first do-file! Thats all a do-file is: a list of commands that
you want Stata to execute and remember. Please edit your do-file by removing any commands that
were typed wrong, etc, so it runs smoothly.

I. EXECUTING COMMANDS
To execute a command in a do-file, highlight it and click on the Execute (do) button, the last
button on the menu (as seen below). The keyboard shortcut on a PC for this is Ctrl + D.
If you dont highlight something specific and just click execute, Stata will run the entire file, line by
line.
Normally you would type commands directly into the do-file instead of copy-pasting them from
Stata, since it is much more efficient. So, from now on if we ask you to execute a command please
type it out in the do-file and then execute it.
7. REPRODUCIBILITY

J. FOLDER STRUCTURE AND ORGANIZATION


When you first start the project, immediately set up a folder structure for data cleaning/analysis to
which you will conform. (Dont do this right now, but re-organize your folders later if they are not
structured this way already). In general, try to keep everything related to the data analysis of the
project in one big folder and have several subfolders. Some typical subfolders that are good to have
are:
Do (for your do-files)
Raw (for your raw data)
Data (for your processed data)
Backup (for your raw data backup, plus some of the most important dofiles or
processed data. Keep it updated!)
Note: you need to also back up your raw data elsewhere on the
computer and externally, just in case
Some people also use folders like
Graphs (for your graphic output)
Tables or Output (for all other output
Communications or Discussions (to keep track of any data cleaning/analysis-
related discussions, decisions, issues)
Log (for your logs)

K. MASTER DO-FILE
Now suppose youre pretty far into dealing with your projects data and you have a whole bunch of
do-files that perform different stages of cleaning and analysis. Whats the best way to keep them
organized?
A master do-file is the main do-file for a project that calls all the other cleaning and analysis files in
order. Basically, you should be able to open the master do-file and click Execute, and have it
automatically start with the raw data, go through all the cleaning, and end up with the final output.
Another thing a master do-file should do is explain the project and the data cleaning and analysis
process cohesively. Once again, you want to be able to open the file and easily identify what the
project is, who the person responsible is, what the important variables are, what sort of cleaning
and analysis has been done, what the output is, and where to refer for further information and
detail on cleaning and analysis.
As you can see, the command to embed other do-files within the master is simply
do location_of_do-file/do-file_to_execute.do

L. ANNOTATION
One of the most important things you can do for keeping your work replicable is to always, without
fail, annotate your coding as you go and keep your do-file organized. This is not only useful to you
(when you inevitably forget what you were trying to check when you tabulated this variable or why
you needed to clean up that variable), but to all the people who will take over your project or will
have to look at it later.
a. What is annotation?
What do we mean by annotating? It means that the do-file should be well-organized and
contain lots of comments (that Stata wont read as commands) explaining what each step of
your coding does and why it does it. For example, in your do-file (that you just created), you
can explain why you included the following commands:
gen lowattendance = ( attendance < .60 )
tab lowattendance if school == 1 | school == 3
tab lowattendance if school == 2 | school == 4

By leaving a comment on top of them that says:

*Now we will create a variable to see how many people are at risk of
having to retake the class due to low attendance (less than 60%), and how that
percentage differs by public and private schools

Note that * in front of the comment. Basically, a * in front of a line tells Stata not to execute
that line. This is one way to create comments. If the comment is long, however, or contains
several lines, its better to use another commenting format:

/* If your comment is long format it like this so that you can break
it up into two different lines
*/

Stata will ignore anything in these brackets /* */

Note that for some versions of Stata the 2nd part of the command */ has to be located on a
new line for Stata to recognize it properly.

b. Purpose of the do-file


An important part of do-file annotation is the initial information about the do-file that is listed
at the beginning. So, when starting a do-file, at the very top you should always create a section
that contains the following information:
/*
Name: beginners_class2.do
Date Created: December 18, 2011
Date Last Modified: January 12, 2012
Created by: Gean Spektor
Modified By: GS
Last modified by:
Uses data: intro.dta
Creates data: intro_modified.dta
Description: This file is a part of exercises that are designed as an introduction to Stata
for beginners, used at IPA-JPAL Staff Training. This particular do-file is created during
Class 2 of the absolute beginners (Level 1) series in order to give the trainees their first
experience with a do-file and demonstrate some of the common practices for do-file
organization
*/
8. IMPORTING DATA
M. TYPICAL OPENING COMMANDS
You should always start your do-file with the following set of commands:
clear
set more off
set mem 50m

Clear will clear out any previous dataset that Stata has loaded.
If you tried to load a dataset right now within the Stata window you have open, it would give you
an error, saying no; data in memory would be lost. This is because once you
already have a dataset open and have made changes within it (even ones you dont want to
save), Stata will not open another one on top of it. To avoid this issue, every time you write a do-
file that opens new data, start it with clear.
Set more of Stata typically will only run enough operations to fill up the results window, and
then pause until you click any key to tell it to continue (or show more). While this is useful
when youre looking through results, for the most part having to click a button several times to
get Stata to execute a set of command becomes very annoying very fast. To avoid this, you want
to set more off, or tell Stata that you dont want it to ask you whether you would like to see
more and want all the results processed at once.
Set memory - This command expands the amount of memory that Stata allocates to opening and
running the dataset. Memory needs to be increased any time you are running a dataset bigger
than 1 mb.
To determine how high to set the memory, look at the size of your dataset (right click on the
dataset and go to Properties) and then add about 20% more on for processing and adding new
variables.
The syntax of the command is
set mem 30m

Where set mem is the set memory command (shortened), and 30m is how high you want the
memory set (in this case, 30 megabytes).
Other good commands to include in this initial set are:
Version of Stata when multiple people work on a problem set, they sometimes end up using
different versions of Stata (in the same way someone might have Windows 7 and another person
a Windows XP or a mac). IPA typically works with Stata 10.0 or 11.0. The commands in Stata are
mostly standardized across versions, but occasionally the syntax changes slightly and a command
wont run properly. In order to avoid this you want to right away tell Stata which version of Stata
to use, setting it to the lowest one available in the group that will use the files. The syntax is
simply
version 10.0

Log close after you are done with the training, you should look up how to log your do-file. A
log automatically creates a record of all the commands run and their results in a separate file (its
different from a do-file in that it is created by Stata by recording what goes on in its results
window, including output). Logging is a great strategy for replication. However, it is impossible to
open a new log if one is already running. That is why you need to close a log by using the
command log close. However, if there is no log open, log close will create an error,
which will stop the running of your do-file. Cap, or capture, is a prefix to commands that tells
Stata to execute the command if it is correct and skip over it if there is an error. Generally it is
good to be careful with this command, since it can mask serious errors, but in this case it is
appropriate.
cap log close

Now lets add these commands to your existing do-file, and then save the do-file in the DoFile
subfolder in your Stata folder on the USB. Lets close the do-file weve created and open a new do-
file for the next few exercises. Please start it with the typical set of commands as well.
APPLYING WHAT YOUVE LEARNED

Now that we have the new do-file open and set up (lets highlight those commands and run them),
we can learn how we can load in the data in different formats
N. INSHEET very frequently you will have to insheet the raw data thats not in the .dta format yet.
For this you should use the insheet command. The command for this is:
insheet using yourfilepath/yourfile.xls, options

Where options represents all the different specifications you can input. For example, if you
know your data is tab or comma delimited, you can put tab or comma where options is. Stata
can determine itself whether the data is tab or comma delimited if you dont specify anything. For
instance, lets insheet the excel intro data from class one. The excel sheet is located in the Stata >
Beginners Stata > Raw in the USB.
insheet using yourfilepath/intro.xls

O. MERGE - now that you have the intro dataset loaded, imagine that it is actually the baseline
for a study, and you have a separate file with endline scores. Since you now want to compare
the baseline and endline scores, you want to merge these into a single file.

To do this you must have 2 datasets with a common unique identifier that is named the
same way. In this case its the student number (id).

To do this, use the merge command. The merge command syntax will be different depending
on the Stata version you use. In versions 10 and below, the data has to be sorted by a unique
identifier before it can be merged. So, you have to prep both of the datasets by sorting the
data first, and then merge. To do this say
sort id
save your_location/intro_imported.dta, replace

Now open the 2nd dataset that you will be merging, called intro_endline.dta:

use "C:\Users\IPA\Desktop\IPA Training\2012 STAFF


TRAINING\Content\Stata\intro_endline.dta", clear

Note: please notice the , clear after the use. This automatically closes the previous dataset
and allows you to open up the new one without extra coding
sort id

Now comes the merge command. While the 2 nd dataset is open and sorted, you want to
merge using the 1st dataset (the one you sorted, saved and closed before). The format for it
is:
merge unique_id using
location_of_first_dataset/firstdataset.dta

In this case the command is:


merge id using "your location/Data/intro_imported.dta"

As you can see, this created a dataset that contains the data from intro.dta matched with the
endline score data. Now you can play around and figure out whether the treatment group
has done better than the control, by how much, etc. You can save this dataset as something
else and explore it later.

P. USE earlier you learned how to upload a dataset with the use command. Please do this for
the dataset ChildTest.dta,which you will find in the Data subfolder in the Stata folder on
your USB. Dont forget to put the clear command before loading in the dataset (or include it in
the options for use), since you have a different one open now!
use yourdirectory/ChildTest.dta

Q. RELATIVE REFERENCES
So far youve had to open datasets by providing the entire path to wherever the data is located
and save them accordingly. Youve only had to open a few, but I bet it has gotten pretty annoying
already. Now, imagine you and your PI (or PC, or both) are working on the same do-file, whether
by keeping one in common on Dropbox, or sending it back and forth. If you spell out the entire
path to the file whenever you perform an operation, every time you switch users you would have
to change all the paths everywhere, which can be next to impossible in some longer codes.
Thankfully, there is a remedy to this problem.

This remedy is referred to as using relative references. There are several ways to do this, such as
the cd command and using globals; each one has its own advantages and disadvantages. The
simplest one, and most commonly used by IPA, is cd.

In essence, instead of spelling out C:/Desktop/etc (which is called an absolute reference), cd


lets you temporarily (for the duration of the do-file) tell Stata that you will be working from the X
main folder and everything you refer to from then on will be located in that folder. Basically,
instead of starting its addresses with the entirety of your computer, Stata will now pretend that
only the folder you specified exists. To specify a folder say:
cd C:/path_ to_your_main_Stata_project_folder/
main_project_folder

Now that youve specified you main folder (the one containing the Data, Do, Raw, etc subfoders),
you only have to specify the path starting from there every time you use a file. For instance, if
you wanted to save this dataset as something else, after you set the cd youd only need to say
save using Data/ChildTest_modified.dta, replace

This is extremely useful if you are working with others. This way each person only has to modify
the cd command as long as everyone has identical folder structure within the main Stata project
folder.
Now for practice specify your main folder using the cd command at the top of your do-file and
change the commands using absolute references to relative references.

9. NAMING AND LABELING

R. RENAME sometimes it is useful to rename your variables, either because the old name was
incorrect or clunky, or, more often, in the case of merging or creating a new variable that
necessitates clarification. For example, if you merge the baseline, midline and endline of a survey,
you might want to add _base, _mid and _end suffixes to all your variables so they dont get confused
in the merge. The syntax for this command is very simple:
rename old_name new_name
For example, for clarity:
rename survey surveyround

S. LABELING - It is always useful to label your data, to give yourself and others a good sense of
what each variable represents, what its values are, and generally to make the dataset nicer for
use. If you look at your variables window, youll see that next to the variable name column,
theres a label column, which shows you the labels. This comes in especially handy if your
variables are all named stuff like q1, q23b, etc.
There are two principal things you should be labeling: variable themselves and the values of
these variables. In general, any command concerned with labeling will start with the word
label, such as label variable, label drop, label values, etc.

a. LABEL VARIABLE Labeling a variable involves basically attaching an explanation for Stata
to display each time you call the variable up. To label a variable itself, simply type
label variable var "the label"
where var is the variable you want to label.

For instance, lets label the unique ID:


lab var childid "Childs ID Number"
Now label the standard_childtest variable with "What standard is the child in?"
lab var standard_childtest What standard is the child
in?"
b. LABELING VALUES now suppose you have a variable with values that are supposed to
represent something. What I mean is that instead of having a variable like math or reading
in the intro.dta dataset, where the values were just the scores themselves, you have a
variable like female, where 1 represented yes and 0 represented no. Or, in an even
more complicated situation, think of a region variable, where 1 stood for northern
region, 2 for western, etc. How would you know which one is supposed to stand for which
in that situation? You would definitely want to label these variables so theres no confusion
as to what each number means. This involves two steps. First you have to create a label
(think of it as writing the address for a package on a piece of paper) and then you have to
attach it to a particular variable (just as then youd have to glue the paper to the package).

i. Label Define to define a label you have to list the values the variable takes next to
the label you want attached to each value. The format is as follows:
label define nameoflabel 1 Label for value 1 2
Label for value 2 3 Label for value 3

Where nameoflabel is whatever name you want to call the label itself.
Please note that this is NOT the name of the variable that you want to label,
although since Stata thinks of labels as different from variables, you can
actually use the variable name as a label name as well.

For example, lets define a label for different types of extra classes available:
label define extra_classes 0 None 1 "Individual
Tutoring" 2 "Coaching" 3 "Special Free Classes" 4
"Both tutoring and free classes"
ii. Label Values now you have to attach the label you created to the variable you
want labeled. The format is:
label values variable_you_want_labeled label_name

Here:

label values extraclasses extra_classes

Note that you dont have to only attach a label to one variable. You can in fact
attach the same label (for example, a yes-no one) to multiple variables at once. All
you have to do is say:

label values variable1 variable2 var3 label_name

c. There are several other useful label commands, such as label drop, which you can
teach yourself through the Stata help files once you learn about the help files and how to
use them in the 3rd class.
SUMMARY

Now you should know: The commands you shouldve learned are:
do
- What a do-file is, how to open one, how to clear
execute commands in one set more off
- How to make your work reproducible set mem
version
o What the proper folder structure is for
cap log close
Stata work for your project insheet
o How to use a master file merge
o How to annotate your files (always do it!) use
- What are some necessary commands in a do-file use, clear
- How to import data cd
- How to label data rename
label variable
label define
label values

25
CLASS 3: IPA RESOURCES, HOW TO LEARN MORE STATA ALL BY YOURSELF, AND MORE
COOL STUFF

Other Resources:
A. Stata FAQs
B. Statalist
C. SSC
D. Google

A. STATALIST

From http://www.stata.com/statalist:

Do you know about the independently operated Stata listserver? Hosted at the Harvard School of
Public Health, Statalist is an email listserver where over 3,500 Stata users from experts to neophytes
maintain a lively dialogue about all things statistical and Stata.

Usually, the combination of Sharepoint, Random Help, and the Data Coordinator will be enough to
get answers to your questions. However, for very advanced questions, especially statistical ones, you
may have better luck with Statalist, where you can typically expect quick responses. Make sure
youve at least glanced over the Statalist FAQ, as you may be called out for noncompliance.

B. SSC (AND STATA JOURNAL, ETC.)

If youre looking for a sample do-file or user-written program, you may be able to find it on
Sharepoint. However, other organizations house user-written programs, including the Boston College
Statistical Software Components (SSC) archive and the Stata Journal. You can search for programs by
keyword using the netsearch command. For example, suppose I want to find a program that
calculates the Levenshtein edit distance. netsearch finds the user-written strgroup:

26
C. GOOGLE

If nothing else works, Google is always a useful resource. It may just bring you to a Stata FAQ or a
Statalist thread, but you might also find a little known blog or learning site.

1. INTERNAL STATA RESOURCES: HELP FILES

The help command displays help information about the specified command or topic. Generally, it is
best to use Stata help when you are using a new command (or when youve forgotten the details
of an old one), since the help file will show you the appropriate syntax and describe the command
and its options. However, sometimes it can also be useful when looking at different topics as well (as
in, when youre not sure what the command itself should be), since Stata will bring up commands
relevant to the topic.

The syntax for the help command is simply

help command

27
Where command is the command or option on which you need information. Lets try this with a
command you already know: tabulate.

help tab

As you can see, typing in "help tab" pulls up a new window with the description of the command. On
occasion, when a command has several specifications or meanings, like tab, you will see a list of
commands from which to pick the one you want. In this case we want to look at tabulate oneway,
since this is the one weve learned so far (click on it now). In this window you can find info on:

- Title- provides a brief description of the command


- Syntax explains how the command should be structured. In this case:
tabulate varname [if] [in] [weight] [, tabulate1_options]

While this looks scary at first, it is actually fairly straightforward once you get used to
it.

- First of all, Stata gives you the command itself, bolded (in this case,
tabulate)
- Then Stata specifies whether you use the command on one variable or
multiple, by saying either varname or varlist, in italics.
- [if] and [in] tell you that this command can be used with
restrictions imposed by if or in (which you havent learned yet); in this case
specifications if and in
a. Note that you dont actually need these brackets [] when using
the command. They are just there to tell you that this part of
the syntax is optional.

- [, options] means that there are specifications allowed for this command,
which you should list after a comma (and without brackets). These possible
options are listed right underneath the main command in the syntax section,
and also expanded upon in the options section

- Menu lets you know where to find the command in the Stata menu. This is generally
not given for commands, since it is unusual to use the menu to navigate Stata and many
of the commands arent accessible through it
- Description provides a more detailed description of the command. In this case, tab "
produces one-way tables of frequency counts"
- Options gives a more detailed explanation of each option, usually explaining what each
option achieves and how the correct syntax works for it
- Examples for the most common commands, Stata will provide examples of their use,
usually with one of the in-built generic Stata datasets. Here you can see what the syntax
actually looks like if you arent sure after the more technical descriptions above.

We will try using this for new commands shortly.

28
2. FUN BEGINS: GOOD-TO-KNOW COMMANDS AND AVOIDING COMMON ERRORS

First lets start a new do-file, type in the regular commands, set a cd and use the do-file from the 2 nd
class (ChildTest.dta)

In case you forget:


clear
set more off
set mem 10m

cd Wherever your USB is/Stata/Beginners Stata


use ChildTest.dta

Note that you dont need , clear after this particular use command, since you had just put
the clear command into the beginning of the dofile.

A. DUPLICATES
Nows lets use what we learned with the help files and pull up the file for the duplicates
commands.
A: help duplicates
What does duplicates do? Is it just one command?
A: It reports, displays, lists, tags, or drops duplicate observations, depending on the
subcommand specified. Duplicates are observations with identical values either on all variables
if no varlist is specified or on a specified varlist.
Lets focus on duplicates report. What does the command do?
A: produces a table showing observations that occur as one or more copies and indicating
how many observations are "surplus" in the sense that they are the second (third, ...) copy of
the first of each group of duplicates.
Please figure out the syntax and use the command to look at the childid variable. Are there
any duplicates of the childid variable? If there are, how many IDs have more than one
observation assigned to them? Why is that?
A: Yes, there are duplicates. There is a surplus of 16,886 observations. However, try
browsing the dataset to figure out why this happened. Hint: look at the survey variable. You
will quickly see that the childid is still unique to each child, but there are multiple observations
because the data is in a long format (look it up!), and therefore there is an observation per
baseline and midline survey, rather than separate variables for each.
Just as the helpfile states, duplicates reports, displays, lists, tags, or drops duplicate
observations, depending on the subcommand specified.
Duplicates are observations with identical values either on all variables if no varlist is specified
or on a specified varlist. What that means is that if you use duplicates report and specify a
variable, it only checks for observations with identical values in that variable. However, if you just

29
say duplicates report (or drop or tag), Stata will look at observations that are identical across
all variables.
This is a great set of commands to use during merging (to make sure it happened properly), to
figure out unique identifiers, and in general during cleaning.
In addition to duplicates report, duplicates tag and duplicates drop come
in especially handy.
B. _n AND _N
While you as the user are expected to create and record most of the variables in the dataset,
there are several built-in system variables that are created and updated by Stata. Some of these
are produced automatically as a result of the latest command you ran in Stata and kept hidden
until you call them up, and some exist automatically for the entire dataset or its subset. _n and
_N are such variables.
Stata takes _n to mean the number of the line of the observation. That is, if your dataset has 20
observations, for the first listed (in whatever order you have it sorted then), _n would equal 1,
for the second, 2, and for the last, 20.
This has multiple wonderful uses when combined with other commands which we (or you alone)
will learn later.
_N is the variable for the total number of observations in the subgroup you specified, or the
entire dataset.
In general, _n can be used for indexing your data, although they are much better used within
groups, as you will see shortly.

C. LOCALS AND LOOPS

1. LOCALS

Locals are a part of a larger macros category, which are ways of storing lists of information, whether
string or numeric, for later use in Stata. The way locals in particular work is first you declare a local,
i.e., you tell Stata that from now on, whenever you say A, you mean B. Then you call a local,
meaning you use the new definition to pull up the meaning. Check out the syntax:

Locals are declared as follows...

local awesome "ipa & jpal"

...and are called as follows (for the purposes of demonstration, lets use the display
command here, which as you remember displays what you ask). Think of it as copying
ipa & jpal into the local awesome.

disp "`awesome'"

Just to remind you, disp, or display, simply displays whatever you ask it to, as we learned in the
1st class. If you wanted stata to display ipa & jpal, you would have to type

30
disp ipa & jpal

Think of it as just pasting ipa & jpal in place of the local awesome

So now, since you declared awesome to mean IPA & JPAL, every time you say `awesome, Stata
will know you actually really mean IPA & JPAL (obviously!). Please note the little brackets ` that
are used when calling up the local (but not when declaring it). Without it Stata will not recognize the
local as a local and will just treat it as another variable. The first character in that is a forward quote
(located to the left of the "1" key along with the tilde). The second is just the single quotation mark,
found next to the enter key. In the display command, we enclose `awesome' in those quotes to
indicate that it should be treated as a string.

This might not seem so handy right now. After all, why cant you just type ipa & jpal instead of
bothering with the local? However, locals are useful for dealing with lists of variables, particularly
those you need to reuse. For example, imagine if you needed to summarize a whole bunch of
variables repeatedly across the file. Instead of typing them over and over again, you can create a
local for the entire list of variables, then input them into the command.

For those of you familiar with regressions, let's say you want to add controls to a regression. You can
simply declare all the controls in a local and then refer to that in multiple regressions, saving you
loads of time. There are many other wonderful uses as well that are more advanced. You will
discover them all in time.

For now lets try to get a hang of the format. Lets declare i as a local for number 1, and then
generate a variable called one that is equal to 1 throughout by using our local.

A: local i 1
disp "`i'"
gen one = `i'

31
Optional:

There are lots of cool things you can do with locals that will allow you to write really great code. For
example, you can nest a local within a local. Try the following commands and see if you can figure
out what they do:

local a a1 a2 a3
local b b1 b2 b3
local ab `a' `b'

disp `a'
disp `b'
disp `ab'

local aone "the first one"


local 1 "one"
di `aone'
di `a`1''

2. LOOPS

In Stata, loops can be used to repeat a command for a sequence of numbers, variables, or other
listed inputs. Stata will literally loop over each item in a list you provide, and perform the same
action (that you specify) for all of them. There are several commands that work as loops: foreach,
forvalues, if (different if from the conditional if you learned before), while, etc. Foreach is by
far the most widely used, so we will learn that one. Plus, once you understand the syntax of one of
the loops, you can learn any of them easily.

The most basic foreach syntax runs the following way:

foreach letter in a b c d{
disp "`letter'"
}
In this case foreach in is the command, a b c d is the list of items that you want to loop
over, and letter is the local you use to first declare and then call the list. Please run this from your
do-file and see what it does.
Note that the opening bracket { has to come at the end of the 1 st line where you declare the list, and
the closing bracket } has to come on a new line after the loop is finished.
Your output should look like the following:
a
b
c
d

32
Why? Because the command that we requested of Stata was display, so it simply looped over the list
and displayed each item (a through d). Stata takes each item in your list and plugs it into your
command where the local is, in the order of the list.
Note that for the foreach in command the items in the list dont have to be anything specific.
They can be numbers, they can be letters, they can be variables, parts of variables, strings or
numeric basically, anything you can think of you can stick into foreach in, assuming the
command makes sense with it, of course. See the following for an example:
foreach i in 1 2 3 purple cow "purple cow" {
disp "`i'"
}
Now lets try it with more serious commands. Lets write a loop that tabulates the following
variables: childtestcomplete extraclasses readlevel

I will now teach you the process for writing the loop easily and without mistakes. It is entirely
unnecessary for a complicated loop like this, but for more complicated loops Ive found it useful. This
is definitely not a must people create their loops in different ways but give it a try!

First we write the command as it would be for the first variable in the loop. In this case:

tab childtestcomplete

Then we write the loop around it:

foreach _ in ___ {
tab childtestcomplete
}
Obviously now its very easy to see where childtestcomplete and the other list variables go!
foreach _ in childtestcomplete extraclasses readlevel {
tab childtestcomplete
}
Finally, we come up with a name for the local and insert it into the declaration and the calling part of
the loop.
foreach var in childtestcomplete extraclasses readlevel {
tab `var
}
Dont forget the little brackets when youre calling the variable!
Ok, now lets try something even more interesting!

For starters, please check out the schooltype variable tab it, codebook it, etc. You will see that it
has 5 types of schools (1-5). Now suppose we wanted to create a separate variable out of each of the
five types.

Before we wouldve done it the slow way. We wouldve said

33
gen schooltype_1 = (schooltype == 1)
gen schooltype_2 = (schooltype == 2)

etc

But now we can write a loop! Observe:

foreach type in 1 2 3 4 5{
gen schooltype_`type = (schooltype == `type)
}

Thats 3 lines instead of 5 we wouldve done. And one of them is just a bracket! Now imagine that
had been 20 values, or a hundred!

Ok, now lets see what you can do!

Please write a loop that produces descriptions of each variable (using the codebook command) for at
least 6 variables in the dataset (any six)!

A: foreach var in childtestcomplete extraclasses readlevel


standard_childtest survey tcselection {
codebook `var
}

Now suppose you are prepping this dataset ( for now imagine it only contains a midline and no
baseline) for merging with the endline, and you want to rename all variables
whateverthenamewas_mid. I wont ask you to do these for all the variables (for a loop it
would be just as easy, though), but please do this for any 4.

A: foreach var in childtestcomplete extraclasses readlevel


standard_childtest {
rename `var `var_mid
}

Foreach has a couple of different specifications that can greatly simplify your code-writing. If you
have time I recommend you go to the foreach help file and read up on the varlist and
numlist options.

D. BYSORT

Certain commands can also be performed in groups using the bysort command. What do I mean
by groups here? I mean that if you tell Stata to perform an action by variable, it will look at the
values the variable takes, group the observations by those values, and then perform the command
specified after that in those groups.

34
Lets look at the _N variable we recently learned for an example. Suppose we wanted to generate a
variable that is equal to the total number of observations within each schooltype for all observations
of that type. For this we would say:

bysort schooltype: gen num_obs_in_type = _N

If you now browse the dataset, you will see that the new variable is equal to 19,766 for all those in
government schools (because there are 19,766 total of those kids), 12,733 for all kids in private
schools, etc.

You will see how useful this command can be in a minute.

E. EGEN

Egen is probably one of the top 10 most useful commands for intrepid IPA data cleaners. In essence,
egen is generate on steroids. While for generate you have to specify the value that the new variable
should equal, egen allows you to make new variables using different functions. By functions we dont
just mean sum or difference, but rather egen-specific functions that are tailor-made to improve
your life (in Stata, anyway). For a full list of these functions please go to the help file for egen, but
here are some of the easiest and most used.

The syntax is fairly straightforward, and looks very similar to gen:

egen newvar = function(of_whatever_other_variables)

Egen is very commonly used with bysort, although not every function of egen can be combined with
bysort. Lets see a few examples of functions to understand egen better!

a. TOTAL() creates a total (sum) of whatever variable you specified over whatever group
you specified (i.e. column total). For example, imagine that you had villages in which you
collected information on total kgs of harvest of rice in multiple households per village.
You now want to create an aggregate number for the rice harvested in each village. You
would say

bysort village: egen village_total =


total(kgs_collected_by_hh)
Please note that this new variable will be the same for every household in the same
village. So before trying to create statistics on the average of total harvest per village,
dont forget to first drop all but the first observation per village. How would you do that
using what we learned this class?
Hint: Use _n
A: bysort village: drop if _n ~= 1
b. MIN(), MAX() lets you create a variable that is the minimum or maximum of another
variable over a group you specified (or the entire dataset). The syntax is the same as for

35
total(). Please come up with a new variable, named min_age, for the minimum age
of the student when taking the test by the class (standard) the child is in.
A: by standard_childtest: egen min_age =
min(age_childtest)

c. ROWNONMISS() counts up how many non-missing observations there are within each
row for the variables you specified. This can come in very handy during the cleaning
process.

d. OTHER EGEN FUNCTIONS there are many other wonderful functions that create a total
over the row instead of column, rank observations, create variables based on standard
deviations and kurtosis the list goes on and on. You can see the complete list in the
egen helpfile, with really exhaustive descriptions for each function. It is a great idea
when you first begin cleaning your data (if it needs complicated cleaning) to look
through all the egen functions and think of how you could use them (and to periodically
come back and look through them again). I always discover some new ways to resolve
various coding problems whenever I look!

F. CONVERTING STRING AND NUMERIC VARIABLES

a. DATATYPES: STRING AND NUMERIC

There are two types of variables within data: string and numeric.
What are strings? Anything that has non-numeric character in it: punctuation (except
missings), letters, any sort of characters anything thats not a number!
What are numerics? As you might have guessed by now, its the variables that only
contain numbers.
For example, a variable that has values (1, 2, 5, 6.5, 7, 99) is numeric. A variable with
continuous values (like a math score) is also numeric. A variable that contains something
like names of a village or the other, specify type of variables (meaning variables where
people have to write in their own answer) are string.
To check whether the variable is string or numeric, there are various methods. First of
all, you can describe the variable or codebook it. The easiest, way, however, is to look at
the Type column in your variables window. Any string will have str as its type (like
str4, str11, str23), where the number after str (4, 11, 23) is the number of characters
in that string. All the other types are the different subsets of numeric variables. You can
learn more about each subset through the help files once you advance a bit more.

For an exercise, what type are the following variables?


1. standard_childtest A: numeric
2. householdid A: numeric
3. tcstatus A: string

36
b. DESTRING (and tostring) - convert string variables to numeric variables and vice versa.
Destring is used to convert string variables to numeric, and tostring is for numeric to
string operation.

These commands are highly useful, since when you import data it sometimes imports
variables as an incorrect type. This happens most often with numeric being recorded as
strings when theres some extraneous character in the numeric by mistake. For example,
imagine a variable that has values 1,2, and 9. However, while typing the data entry
operator messes up and types 9o for one of the observations. Stata now will refuse to
automatically make the variable since it sees a non-numeric character in one of the
observations. However, in your cleaning process you find this mistake and fix it, and now
need to make the variable numeric to work with it. This is where destring comes in.

The syntax for destring is as follows:


destring varlist, replace [destring_options]

If you look at the different datatypes in our dataset, you will notice that the householdid
is string, when in this case it actually needs to be a numeric, like the childid. Try to
convert it using the syntax above.
A: destring householdid, replace

As you can see, its very easy. Check the variable to see that nothing has gone awry. In
general, it is better to generate a new variable instead of replacing the old one (you do
that by saying , gen(name_of_newvar)) instead of , replace , just so its
easy to compare variables side by side.

c. ENCODE (and decode) if you look at the encode help file, it will tell you that encode
creates a new variable based on a string variable, creating, adding to, or just using the
value label newvar or, if specified, name. But what on earth does that mean? Allow us
to illustrate with an example:

Note that while the treatment-control selection variable is numeric, the treatment-
control status is a string, which makes it much harder to work with. There is actually
absolutely no reason why this variable should be a string, so we should convert it.
Heres the difficulty, though: if we simply use destring, all non-numeric values will
end up being dropped, which in this case are the entire values. Naturally, we could
just painstakingly generate a new variable by hand, then label it, but it would take
ages. Thankfully, encode provides us with a simple and elegant solution.
encode tcstatus, gen(tcstatus_num)
If you look at the new variable generated (tcstatus_num), you can see that
encode took the existing string variable, assigned each category a number, and then
also created a label and labeled the new numerical variable with the string values
from the old variable.

37
SUMMARY
Now you should know: The commands you shouldve learned are:
help
- Where to find resources to help you with Stata cfout
online cfby
- How to use a help file ceadreplace
bcstats
- Some really awesome high-level commands that
cd
will make you a Stata wiz in no time!
, clear
- Like how to set up relative references, duplicates report
search for duplicates, how to use locals and _n and _N
loops, how to use egen local
- What different datatypes are there and how to foreach
convert strings and numerics bysort
egen
total(), min(), max(),
rownonmiss()
destring
encode

38