How to Make a Do File in Stata
Save and Reuse your Work in .do Files
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How can .do files make my work more reproducible?
How do I run my or someone else's .do file?
Why should I care about code quality?
How do I make my code more legible?
Objectives
Run commands and .do files from the Stata command line.
Run .do files from Unix shell or the Windows terminal.
Log your results window.
Understand and use local macros.
Running .do files
Take the commands you have copied in the .do file editor in Episode 3 and save it. Create a code
folder inside your project folder stata-economics
and save the file there as as code/read_reshape_gdp.do
.
You can use basic shell commands such as cd
, pwd
, ls
and mkdir
in Stata.
To run the .do file, use the do
command.
do code/read_reshape_gdp.do
. do code/read_reshape_gdp.do . import delimited "https://raw.githubusercontent.com/korenmiklos/dc-economics-data/mas > ter/data/web/gdp.csv", varnames(1) bindquotes(strict) encoding("utf-8") clear (31 vars, 264 obs) . reshape long gdp, i(countrycode) j(year) (note: j = 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2 > 005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 264 -> 7656 Number of variables 31 -> 4 j variable (29 values) -> year xij variables: gdp1990 gdp1991 ... gdp2018 -> gdp ----------------------------------------------------------------------------- . rename gdp gdp_per_capita . save "data/derived/gdp_per_capita.dta" file data/derived/gdp_per_capita.dta already exists r(602); end of do-file r(602);
The .do file is executed line by line and we see its output as Stata executes.
As in Episode 3, Stata lets us know that the file already exists and is unwilling to replace it. As we are using a .do file to create this file, it is totally safe to overwrite. If we make an error, we can fix it and rerun do code/read_reshape_gdp.do
. That is the whole point of .do files; to make your work more reproducible.
Change the last line of the .do file to save "data/derived/gdp_per_capita.dta", replace
and rerun it.
. save "data/derived/gdp_per_capita.dta", replace file data/derived/gdp_per_capita.dta saved
Never execute just part of a .do file
The .do file editor lets you execute selected lines from your .do file. Never do this. You will not know what state your data is in before clicking that button and you may forget to execute the rest of your .do file. For example, you may omit a crucial
save
command and your data will be lost. Always execute your .do file in its entirety from the command line by runningdo code/read_wdi_variables.do
.If you are tempted to run your .do file by parts, it is a good indication that it is too long. Try breaking it up into multiple .do files.
Challenge
Change your current working directory to
/home/user/stata-economics/data
. How can you run the .do file at/home/user/stata-economics/code/read_reshape_gdp.do
?Solution
You can run the .do file with its relative path,
do "../code/read_reshape_gdp.do"
. However, the last command uses a relative path,data/derived/gdp_per_capita.dta
. Starting from the current directory, it would save the dataset underdata/data/derived/gdp_per_capita.dta
, a nonexistent directory! Change to the upper level directory first.cd .. do "code/read_reshape_gdp.do"
Your .do file begins with loading a dataset and ends with saving one. It leaves no other trace.
Happy Together… ♪
Mistakes often happen and you should be prepared to minimize them.
- Never modify the raw data files. Save the results of your data cleaning in a new file.
- Every data file is created by a script. Convert your interactive data cleaning session to a .do file.
- No data file is modified by multiple scripts.
- Intermediate steps are saved in different files (or kept in temporary files) than the final dataset.
The goal of these rules is that you can unambiguously answer the question "how was this data file created?" You will pose this question countless times even if you work by yourself.
Under these rules, most of your .do files will begin with
use ..., clear
and end withsave ..., replace
. You have automated your work and should not be afraid to use the optionsclear
andreplace
. You will also use "destructive" commands likekeep
,drop
,collapse
andreshape
more freely.
Challenge
What is wrong with the following .do file?
... rename gdp gdp_per_capita save "data/derived/gdp_per_capita.dta" label variable gdp_per_capita "GDP per capita (2011 USD at PPP)" save "data/derived/gdp_per_capita.dta"
Solution
There is no error in the .do file but it saves two different versions of
gdp_per_capita.dta
under the same name. You cannot be sure which version the data file has. For example, if the commandlabel
fails with an error, the .dta file will not contain the variable label, and you will be surprised.
Break up your work (optional)
We are loading a dataset from the web. For larger datasets, this can be frustratingly slow and we do not want to redo it every time we change something in our .do file. We can put this step in a separate .do file.
The copy
command is similar to the Shell command cp
in that copy x y
copies a file from location x
to location y
. But Stata's copy command has the added feature that it can also copy from a URL.
mkdir "data/raw/web" copy "https://raw.githubusercontent.com/korenmiklos/dc-economics-data/master/data/web/gdp.csv" "data/raw/web/gdp.csv"
Keep raw data separate from data that you are working on to make sure you do not accidentally overwrite it. Even though you are only running this copy
command once, add it to a .do file. This is a record of what you did: where you downloaded the data from and where you put it.
Challenge
Create two .do files,
read_gdp.do
andreshape_gdp.do
to create a local copy of the GDP data and to reshape and save it, respectively.Solution
The content of
code/read_gdp.do
:copy "https://raw.githubusercontent.com/korenmiklos/dc-economics-data/master/data/web/gdp.csv" "data/raw/web/gdp.csv"
(Note the
mkdir
is not included.) The content ofcode/reshape_gdp.do
:import delimited "data/raw/web/gdp.csv", varnames(1) bindquotes(strict) encoding("utf-8") clear reshape long gdp, i(countrycode) j(year) rename gdp gdp_per_capita save "data/derived/gdp_per_capita.dta"
When you change something in your data cleaning (for example, you add variable labels), you only have to rerun the second .do file.
If you have many .do files (you should!), you should note the order in which they have to be run. One way to do that is to create a "master" .do file, which calls every other .do file. This also shows your coauthor how to run your code. For example, the master .do file below makes it explicit that read_gdp.do
and reshape_gdp.do
expect to be run from outside the code
folder. You can also note it in a comment.
* run this from the main project folder, one level up from data/ and code/ do code/read_gdp.do do code/reshape_gdp.do
Another useful convention is to number your .do files in the order in which they run, 01_read_gdp.do
, 02_reshape_gdp.do
. This is super helpful to get a quick overview of how to run your code, but does not quite substitute for a master .do file and comments.
Scalars and macros
Macros are useful for storing values and reusing them later. They are the most powerful feature of Stata programming.
There are two types of macros, local and global. Local macros are valid only in a single execution of commands in do-files. Global macros will persist until you delete them or the session is ended. Precisely because global macros are persistent you might inadvertently use the wrong value. We therefore recommend the use of local macros and this is what we cover first.
. local begin_year 1991 . local name value . display `begin_year' 1991 . display "`name'" value
Use backticks and single quote to evaluate a macro "name" to its "value."
. display `begin_year` `begin_year` invalid name r(198); . display 'begin_year' 'begin_year' invalid name r(198);
Macros are evaluated as part of the command. They are not a variable.
. local name value . display `name' value not found r(111);
The second line evaluates to display value
and Stata does not have any object called "value."
Because macros are evaluated before a command is run, they can part of the command.
. local begin_year 1991 . local outcome gdp_per_capita . summarize `outcome' if year >= `begin_year' Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- gdp_per_ca~a | 6,251 15331.78 17967.28 354.2845 135318.8 . summarize gdp_per_capita if year >= 1991 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- gdp_per_ca~a | 6,251 15331.78 17967.28 354.2845 135318.8
The last two lines do exactly the same.
The macro can be any part of the command, you can attach it to variable names, for example.
. local entity country . describe `entity'code storage display value variable name type format label variable label ---------------------------------------------------------------------------------- countrycode str3 %9s Country Code . describe `entity'name storage display value variable name type format label variable label ---------------------------------------------------------------------------------- countryname str52 %52s Country Name
Gotcha
Stata does not stop if you use an undefined macro name. It simply uses an empty string for its value. Watch out for typos in macro names!
. describe `enty'name variable name not found r(111);
Challenge
What does the following code do?
local A a local B 4 generate `A' = `B'
- Creates a variable called
A
with the value 4.- Creates a variable called
a
with the value 4.- Creates a variables called
A
with the value "B".- Creates a variables called
a
with the value "B".Solution
The correct is 2.
Challenge
What does the following code do?
local A a local B 4 generate `A' = `B' local C c generate `C' = `A' + `B'
- Creates a variable called
c
with the value 4.- Creates a variable called
c
with the value "AB".- Creates a variables called
C
with the value 8.- Creates a variables called
c
with the value 8.Solution
The correct is 4.
`A'
evaluates toa
, which is a variable with the value 4.`B'
evaluates to 4, so the variablec
becomes 8.
use "data/derived/gdp_per_capita.dta", clear local begin_year 1991 local end_year 2010 keep if (year >= `begin_year') & (year <= `end_year')
Challenge (optional)
Use
data/derived/gdp_per_capita.dta
and create an index of GDP per capita for each country in each year, relative to year base year 2000. Store base > year in a local macro that is callebase_year
. This index should take the value 100 in the base year.Solution
use "data/derived/gdp_per_capita.dta", clear local base_year 2000 egen gdp_per_capita_`base_year' = mean(cond(year == `base_year', gdp_per_capita, .)), by(countrycode) generate gdp_per_capita_index = gdp_per_capita / gdp_per_capita_`base_year' * 100
Key Points
Add commands to a .do file.
Run .do files en bloc, not by parts.
Check what directory you are running .do files from.
How to Make a Do File in Stata
Source: https://datacarpentry.org/stata-economics/05-programming/