Data Management Workshop: Stata Commands for Statistical Analysis

undefined
 
STATA data
management
 
UCLA OARC
STATISTICAL
METHODS AND
DATA ANALYTICS
 
Purpose of the workshop
 
 
Data sets usually do not arrive in a state immediately ready for analysis
Variables need to be cleaned
E.g. missing data codes like -99 can invalidate analyses if left untouched
Data errors should be corrected
Variables need to be generated
Unnecessary data should perhaps be dropped for efficiency
Data sets may need to be merged together
 
This workshop covers commonly used Stata commands to prepare data sets for statistical
analysis
 
 
 
Inspecting variables
 
Creating variables
 
Selecting observations and variables
 
Missing data
 
Date variables
 
String variables
 
Appending data sets
 
Merging data sets
 
Looping
 
Processing data by groups
 
Topics
undefined
 
Preliminaries
 
 
 
 
 
save
load
 
Use do-files
 
 
Write and run code in do-files, text files where Stata commands can be saved.
A record of the commands used
Can easily make adjustments to code later
 
To run code from the do-file, highlight code and then Ctrl-D or click “Execute(do)” icon
 
Comments:  precede text with 
*
 or enclose text within 
/*
 and 
*/
*not executed
/* this is a comment */
 
You can also place comments on the same line as a command following 
//
 
tab x 
// also a comment
 
Reading syntax specifications in this
workshop
 
 
Unitalicized words should be typed as is
 
Italicized words should be replaced with the appropriate variable, command, or value
 
For example:
 
merge 1:1 
varlist
 using 
filename
 
varlist
 
should be replaced with one or more variable names
 
filename
 
should be replaced with a file’s name
 
The rest should be typed as is
 
 
varlist 
will be used throughout the workshop to mean one or more variable names
 
Workshop dataset
 
 
This dataset contains fake hospital patient data, with patients as rows of data (no patient id,
though)
 
A mix of continuous and discrete numeric variables as well as string variables
 
Some data errors and missing data codes in the data
 
Each patient is also linked to a doctor (docid).
 
Another dataset containing doctor variables will be merged into this dataset later in the
seminar.
 
save
 and 
use
 
 
We recommend that you save your data set
under a different name after performing some
data management on it
Always good to have untouched raw, data set
You may change your mind later about how
variables should be generated/cleaned/etc
 
To quickly 
save
 and 
load
 your data in Stata, 
save
as a Stata file (usually .dta)
 
use
 loads Stata data files
 
 
* save data as Stata data set, overwrite
 
save data_clean.dta, replace
 
 
* .dta automatically added so can omit it
 
save data_clean, replace
 
 
* load data_clean, clear memory first
 
use data_clean, clear
undefined
 
Inspecting
variables
 
 
 
 
browse
summarize
tabulate
codebook
 
browse
 
 
Spreadsheet style window view
 
browse
or click on the spreadsheet and magnifying glass icon in the Stata toolbar
 
Numeric variables appear black
 
String variables appear red
 
Numeric variables with value labels appear blue
 
For large data sets, we need commands that allow us to quickly inspect variables
 
summarize
 
 
summarize
 (abbreviated 
summ
) provides
summary statistics for numeric variables,
which may be useful for data management:
Number of non-missing observations
Does this variable have the correct number of missing
observations?
Mean and standard deviation
Are there any data errors or missing data codes that make
the mean or standard deviation seem implausible?
Minimum and maximum
Are the min and max within the plausible range for this
variable?
 
 
*summary stats for variable y
 
summ y
 
 
*summary stats for all numeric variables
 
summ
 
tabulate
 
 
tabulate
 (abbreviated 
tab 
here
)
frequency tables (# observations per value)
string or numeric variables ok
but not continuous numeric variables
 
Questions to ask with 
tab
:
Is this the number of categories I expect?
Are the categories numbered and labeled as I
expect?
Do these frequencies look right?
 
Important data management options
miss
  - treat missing values as another category
nolabel
 – remove labels to inspect underlying
numeric representation
 
 
 
*table of frequencies of race
 
* display missing
 
tab race, miss
 
 
*2-way table of frequencies
 
tab race gender
 
codebook
 
Detailed information about variables
codebook
 provides:
For 
all
 variables
Number of unique values
Number of missing values
Value labels, if applied
For 
numeric
 variables
Range
quantiles, means and standard deviation for continuous
variables
Frequencies for discrete variables
For 
string
 variables
frequencies
warnings about leading and trailing blanks
 
 
*detailed info about variables x and y
 
codebook x y
 
 
*detailed info about all variables
 
codebook
undefined
 
Creating
variables
 
 
 
 
generate
help functions
egen
 
generate
 variables
 
 
generate
 (abbreviated 
gen
 or even 
g
)  is
the basic command to create variables
Often from other existing variables
if the value of the input variable is missing, the
generate
d value will be missing as well.
 
We will be using 
gen
 throughout this seminar
 
 
*sum of tests
* if any test=., testsum=.
 
gen testsum = test1 + test2 + test3
 
 
*reverse-code a 1,2,3 variable
 
gen revvar = 4-var
 
Functions to use with 
generate
 
 
We can use functions to perform some
operation on a variable to generate another
variable
 
Later sections of this seminar will take a
focused look at the function groups 
Date and
Time 
and 
String
 
We will use many various functions from other
groups as well
 
Most of these functions accept no more than
one variable as an input
 
 
*get table of function help pages
 
help functions
 
 
*random number (0,1) for each obs
 
gen x = runiform()
 
 
*running (cumulative) sum of x
 
gen sumx = sum(x)
 
 
*extract year from date variable
 
gen yr
_birth
= year(date_birth)
 
 
*extract 1st 3 numbers of phone number
 
gen areacode = substr(phone, 1, 3)
 
egen
, extended generation command
 
 
egen
 (extended generate) creates variables
with its own, 
exclusive
 set of functions, which
include:
Many functions that accept multiple variables as
arguments
cut() 
function to create categorical variable
from continuous variable
group() 
function to create a new grouping
variable that is the crossing of two grouping
variables
 
*mean of 3 y variables
egen ymean = rowmean(y1 y2 y3)
 
*how many missing values in y vars
egen ymiss = rowmiss(y1 y2 y3)
 
* how many of the y vars equal 2 or 3
egen y2or3 = anycount(y1 y2 y3), values(2 3)
 
 
*age category variable:
* 
0-17, 18-59, 60-120, with value labels
egen agecat = cut(age), at(0, 18, 60, 120) label
 
 
*create agecat-by-race variable, with value labels
egen agecat_race = group(agecat, race), label
undefined
 
Selecting
observations
and variables
 
 
 
if
<>=!~&|
r
eplace
recode
d
rop
keep
 
i
f: 
selecting by condition
 
 
select observations that meet a certain
condition
if
 clause usually placed after the command
specification, but before the comma that
marks the beginning of the list of options.
 
 
*tab x for age > 60, include missing
 
tab x if age > 60, miss
 
Logical and relational operators
 
 
*summary stats for y
 
* for obs where insured not equal to 1
 
summ y if insured != 1
 
 
*tab x for obs where
 
*  age < 60 and insured equal to 1
 
tab x if (age < 60) & (insured == 1)
 
replace
 and 
if
 
 
The 
replace
 command is used to replace
the values of an existing variable with new
values
 
Typically, 
replace
 is used with 
if
 to replace
the values in a subset of observation
 
 
*binary variable coding whether pain is
greater than 6
 
gen 
highpain
 = 0
 
replace highpain = 1 if pain > 6
 
 
Change variable coding with 
recode
 
 
Variables are often not coded the way we
want
too many categories or with values out of order.
 
With 
recode
, we can handle all recodes for a
variable in a single step.
 
 
*recode (0,1,2)->3  and (6,7)->5
recode income_cat (0 1 2 = 3) (6 7 = 5)
 
keep 
and
 drop: 
filtering
observations
 
 
Drop unneeded observations using:
 
drop if 
exp
 
Where
 
exp
 
expresses some condition that if
true, will cause an observation to be dropped
 
Conversely, we can keep only observations
that satisfy some condition with
 
keep if 
exp
 
 
* drop observations where age < 18
 
drop if age < 18
 
 
* same thing as above
 
keep if age >= 18
 
drop
 (or 
keep
) variables
 
 
Unneeded variables can be dropped with:
 
drop 
varlist
 
where 
varlist
 is a list of variables to drop
 
See examples for some shortcuts to drop
many variables at once
 
Or, if you need to drop most variables, you
can 
keep
 a few
 
 
* drop variables x y z
 
drop x y z
 
 
* drop all variables in consecutive columns
 
*   from age-dob
 
drop age-dob
 
 
* drop all variables that begin with “pat”
 
drop pat*
 
 
* drop all variables but age
 
keep age
undefined
 
Missing Data
 
 
 
 
. “” .a .b
misstable summarize
mvdecode
missing()
 
Missing data
 
 
Missing values are very common in real data,
and it is important for an analyst to be aware
of missingness in the data set
Hopefully, you know any missing data codes
(e.g. -99)
 
When reading in data from a text or Excel file,
missing data can be represented by an empty
field.
 
 
 
* overview of how missing values work in
Stata and Stata commands for missing values
 
help missing
 
Missing values in Stata
 
 
. is missing for numeric variables (also called
sysmiss
)
 
“”
 is missing for string variables (also called
blank
).
 
 
Additional missing data values are available
.a
 through 
.z
can be used to represent different types of missing
(refusal, don’t know, etc.).
 
 
Missing values are very large numbers in Stata
all non-missing numbers  
< . < .a < .b < … < .z
 
 
replace stringvar = “” if stringvar == “-99”
 
 
* replace with sysmiss if -99 (skipped)
 
replace numvar = . if numvar == -99
 
 
* use .a for different missing data code -98
 
*  (e.g. refused to answer)
 
replace numvar = .a if numvar == -98
 
 
 
 
misstable summarize
: finding
existing missing values
 
 
misstable summarize 
produces a table
of missing values (of all types) across a set of
variables
 
column 
Obs=. 
counts the number of
missing values equal to .
 
column 
Obs>. 
counts the number of
missing values other than ., such as .a and .b
 
column 
Obs<. 
and the entire right-hand
section 
Obs<. 
address non-missing values
 
 
*table of missing values across all variables
 
misstable summarize
 
Detecting missing data codes
 
 
extreme numeric values are often used to
represent missing values
-99 or 999
 
Undetected, these missing data codes can be
included as real data in statistical analysis
 
Use 
summarize
 and 
graph boxplot 
to
look for missing data codes across many
variables at once
 
 
*boxplot of variables
 
graph box lungcapacity test1 test2
 
Use 
mvdecode
 to convert user-defined
missing data codes to missing values
 
 
We can quickly convert all user-defined
missing codes to system missing values across
all numeric variables with 
mvdecode
.
Unfortunately, 
mvdecode
 will not work at all
on string variables.
 
*convert -99 to . for all variables
mvdecode _all, mv(-99)
 
*convert -99 to . and -98 to .a
mvdecode _all, mv(-99 =. \ -98=.a)
 
 
 
 
 
The 
missing()
 function
 
 
The 
missing()
 function returns
TRUE if the value is any one of
Stata’s missing values (
.
, 
.a
, 
.b
,
etc)
 
 
*eligible if not missing for lungcapacity
 
gen eligible = 0
 
replace eligible = 1 if !missing(lungcapacity)
 
Be careful with relational operators
when working with missing values
 
 
Missing values are very large numbers in Stata
all non-missing numbers  
< . < .a < .b < … < .z
 
 
Thus, 
(. > 50) 
results in 
TRUE
 
 
When creating variables from other variables,
make sure you know how you want to handle
missing values
 
 
* we want hightest1 = 1 if test1 > 10
 
*  but . > 10, so this is not right
 
gen hightest1 = 0
 
replace hightest1 = 1 if test1 > 10
 
 
*now hightest1 will be . when test1 is .
 
replace hightest1 = . if test1 == .
 
 
undefined
 
Date variables
 
 
 
 
help datetime
date()
format
year() month() day()
 
Dates as strings and numbers
 
 
In Stata we can store dates as strings…
“January 2, 2021”
“1-2-2021”
 
However, dates should be represented
numerically in Stata if the date 
data
 are
needed
To create analysis variables
For plotting
 
Numeric dates (with day, month, and year
data) in Stata are represented by the number
of days since January 1, 1960 (a reference
date)
January 2, 2021 = 22,280
 
 
 
* Overview of how Stata handles dates
 
help datetime
 
 
*Use codebook or describe to determine
whether your date variable is a string or
number
 
codebook date_var
 
 
*Alternatively, we can look at the variable
in the browser: red=string, blue/black=number
 
browse date_var
 
 
date(): 
converting string dates to
numeric
 
 
Use the 
date() 
function to convert string
dates to numeric in Stata
 
The generic syntax, for example with 
gen,
 is:
 
gen 
varname
 = date(
stringdate
,
mask
)
stringdate
 is a variable
mask
 is a code that specifies the order of the
components of the date
 
For day, month, year dates:
mask
 is 
“MDY” 
if the order is month, day, year
mask
 is 
“DMY” 
if the order is day, month, year
 
 
 
* create numeric version of date of birth
 
*  order is month, day, year
 
gen newdob = date(dob, “MDY”)
 
The 
date() 
function accepts dates in
many formats
 
 
String dates are recorded in many different
formats, but fortunately, the 
date()
function is flexible in what it accepts as inputs
 
 
 
*just display commands to show date() usage
 
. di date("March 5, 2021", "MDY")
 
22344
 
 
. di date("Mar 5, 2021", "MDY")
 
22344
 
 
. di date("3/5/2021", "MDY")
 
22344
 
 
. di date("3-5-2021", "MDY")
 
22344
 
 
 
*add 20 or 19 to the mask if year is 2-digit
 
. di date("3-5-21", "MD20Y")
 
22344
 
 
. di date("3-5-21", "MD19Y")
 
-14181
 
Formatting numeric dates
 
 
After conversion using 
date()
, the resulting
variable will be filled with numbers
representing dates, but can be hard to read
directly as dates
 
Stata’s 
format
 command controls how
variables are displayed
 
The format 
%td
 formats numbers as dates
22344
 will appear as 
2mar2021
 after applying
the format
 
 
*apply date format to variable newdob
 
format 
newdob
 %td
 
Date arithmetic
 
 
Once dates are stored as numeric variables in
Stata, we can perform date arithmetic, for
example, to find the number of days between
two dates
 
 
* length of stay
 
gen los = discharge_date – admit_date
 
Functions to extract components of
dates
 
 
At times, we will need to extract one of the
components of a date, such as the year
 
Each of these functions returns a number:
year()
: numeric year
month()
: 1 to 12
day()
: day of the month
 
 
*year of birth
 
gen yob = year(dob)
 
 
*month of birth
 
gen mob = month(dob)
undefined
 
String variables
 
 
 
 
help string functions
strtrim()
substr()
+
encode
destring
 
Strings in Stata
 
 
Strings are just a series of characters
 
Variables can generally be stored as either
numeric or string
 
String values are surrounded by quotes when
specifying a Stata command
String missing is 
“”
 
 
 
Many estimation commands in Stata will not
work with string variables.
 
 
* z now missing if z was “-99”
 
replace z = “” if z==“-99”
 
 
String functions
 
 
Stata provides a number of functions to clean,
combine, and extract components of string
variables
 
We will examine a few in detail shortly
 
Other useful functions:
strlen()
: length of strings
strpos()
: the position in a string where a
substring is found
strofreal()
: convert number to string with
a specified format
 
 
* help page for all string functions
 
help string functions
 
strtrim()
: trimming white space
 
 
String variables often arrive messy, with
unnecessary spaces at the beginning or end of
the string
“  May 3, 2001 “
 
Extra spaces can complicate string matching
and extracting components
tab
 will treat 
“ UCLA” 
and 
“UCLA” 
as
separate categories
 
strtrim() 
removes whitespace from the
beginning and end of strings
 
* some of these categories should be combined
tab hospital
 
* remove all leading and trailing whitespace
replace hospital = strtrim(hospital)
tab hospital
 
substr()
: extracting a substring
 
 
substr() 
extracts a substring from a
longer string
 
substr(
s
, 
n1
, 
n2
)
s
 is a string value or string variable
n1
 is the starting position
negative number counts from the end
n2
 is the length of the string
 
*area code starts at 1, length=3
gen areacode = substr(phone, 1, 3)
 
*extract last 5 characters as zip code
gen zipcode
 = substr(address, -5, 5)
 
+
: concatenating strings
 
 
Strings can be joined or concatenated
together in a Stata command with a 
+
String variables can be combined with string
constants this way
 
 
*create full name variable: “Last, First”
 
gen fullname = lastname + “, “ + firstname
 
 
String matching
 
 
For matching strings exactly, we can use the 
==
operator
 
For more flexible pattern matching, Stata has a
number of functions that use 
regular expressions
e.g. to find strings that contain either 
“US” 
or
“USA” 
and may or may not contain periods
“US”
“U.S.”
“USA”
“U.S.A.”
See 
help regexm
See 
help strmatch
 
Regular expressions are beyond the scope of this
workshop
 
 
* for regular expression functions
 
help regexm
 
Encoding strings into numeric variables
 
 
Categorical variables are often initially coded as
strings
But most estimation commands require numeric
variables
 
encode
 converts string variables to numeric
assigns a numeric value to each distinct string
value
applies value labels of the original string values
Use the 
gen() 
option to create a new variable
Or instead use 
replace
 to overwrite the original string
variable
 
The ordering of the categories is alphabetical
 
Remember, when browsing, string variables will
appear red while numeric variables with labels
will appear blue
 
 
* convert hospital to numeric, generate new
variable
 
encode hospital, gen(hospnum)
 
 
Convert number variables stored as
strings to numeric with 
destring
 
 
Sometimes, variables with number values are
loaded as strings into Stata.
 
This can happen can a character value is
mistakenly entered, or a non-numeric code (e.g.
“NA”
) is used for missing.
Or the variable was saved as a string in other
software (e.g. Excel)
 
We do not want to use 
encode
 here, because
the variable values are truly numbers rather than
categories
we would not want the “1.25” converted to a
category.
 
Instead, we can use 
destring
, which directly
translates numbers stored as strings into the
numbers themselves.
 
*convert string wbc to numeric and overwrite
destring wbc, replace
undefined
 
Appending
data sets
 
 
 
 
 
 
append
 
Appending: adding observations
 
 
We often wish to combine data sets that are split into multiple files that have the same
variables (more or less)
Data collected over several waves of time
Data collected from different labs/sources
 
Loaded in
Stata
 
On hard
drive
 
=
 
+
 
append
 
 
With Stata’s 
append
, data set files stored on
hard drives are appended to the data set
currently loaded in Stata (the 
master
 data set)
 
Syntax:
 
append using 
filename
[
filename
…]
 
Multiple 
filename
s can be appended to the
master data set
 
Variables with the same name should have
the same type (string, float, etc.)
use the 
force
 option for mismatches
 
 
*append data set data2
append using data2
 
*gen() option creates new variable source
*  source=1 if obs comes from 1st data set
*  source=2 if obs comes from 2nd data set
append using data2, gen(source)
 
Unshared variables will have missing
 
 
Variables that do not appear in all datasets will have missing values where they were omitted
 
+
 
=
undefined
 
Merging data
sets
 
 
 
 
 
 
merge
 
One-to-one merging
 
 
When we merge datasets, we add more columns of variables.
 
Datasets to be merged should generally be matched on an id variables that appears in both
datasets
 
In the most basic merge, each id appears once in each file
 
Loaded in Stata
(
master
 dataset)
 
On hard drive
(
using
 dataset)
 
=
 
+
 
merge 1:1 
for
 
one-to-one merging
 
 
Basic syntax
:
 
merge 1:1 
varlist
 using
filename
varlist 
is one or more variables used to
match observations
filename
 
is the data set stored elsewhere to
be merged into the data set currently loaded in
Stata
 
 
*one-to-one merge example
 
merge 1:1 id using morevars.dta
 
_merge
: understanding the resulting
merge
 
 
After running a  
merge
, Stata will
ouput a table similar to the table
pictured on the right
 
Stata also automatically adds a new
variable, 
_merge
, to the merged
data set, where:
_merge==1 
if the observation’s id is
only found in the master file
_merge==2 
if the observation’s id is
only found in the using file
_merge==3 
if the observation’s
merge id was matched in both files
 
 
 
 
    Result                      Number of obs
 
    -----------------------------------------
 
    Not matched                             5
 
        from master                         5  (_merge==1)
 
        from using                          0  (_merge==2)
 
 
    Matched                               200  (_merge==3)
 
    -----------------------------------------
 
 
 
* drop unmatched observations
 
drop if _merge != 3
 
Many-to-one merges
 
 
Many-to-one merge: unique id values appear multiple times in the master data set, but only
once in the using dataset
 
 
 
Loaded in Stata
(
master
 dataset)
 
On hard drive
(
using
 dataset)
 
=
 
+
 
Merging on 
Doc_ID
 
_merge==2 
means this 
Doc_ID
only appears 
in using dataset
 
One-to-many merges
 
 
One-to-many merge: unique id values appear once each in the master data set, but can appear
many times in the using dataset
 
 
 
Loaded in Stata
(
master
 dataset)
 
On hard drive
(
using
 dataset)
 
=
 
+
 
Merging on 
Doc_ID
 
_merge==1 
means this 
Doc_ID
only appears 
in master dataset
 
merge m:1
 and  
merge 1:m
 
 
Many-to-one
:
 
merge m:1 
varlist
 using
filename
 
 
One-to-many
:
 
merge 1:m 
varlist
 using
filename
 
varlist 
is one or more id variables
filename
 
is using data set file name
 
 
 
*each docid appears multiple times in
* master data, once in using data
merge m:1 docid using “dm_doctor_data.dta"
undefined
 
Looping
 
 
 
 
 
foreach
forvalues
 
Loops
 
 
Loops are a programmer’s tool to perform some task repeatedly over a set of items.
 
For example, this Stata loop:
 
foreach var of varlist x y z {
 
  summ `var’
 
  gen copy_`var’ = `var’
 
}
 
Is equivalent to running
 
summ x
 
gen copy_x = x
 
summ y
 
gen copy_y = y
 
summ z
 
gen copy_z = z
var
 is known as a Stata 
macro variable
, a temporary
variable (not related to the data set) that can hold string
values
 
The word 
var
 is arbitrary and can be any word you like
 
varlist 
tells Stata that the strings following are
variable names
 
Loops
 
 
Loops are a programmer’s tool to perform some task repeatedly over a set of items.
 
For example, this Stata loop:
 
foreach var of varlist x y z {
 
  summ `var’
 
  gen copy_`var’ = `var’
 
}
 
 
1
st
 pass through loop:
 
summ x
 
gen copy_x = x
Initially
 var 
will be set equal to 
x, 
and then the
commands within 
{}
 are run;  wherever 
`var’
appears, replace with 
x
.
 
Loops
 
 
Loops are a programmer’s tool to perform some task repeatedly over a set of items.
 
For example, this Stata loop:
 
foreach var of varlist x y z {
 
  summ `var’
 
  gen copy_`var’ = `var’
 
}
 
 
2
nd
 pass through loop:
 
summ y
 
gen copy_y = y
Then
 var 
becomes
 y, 
the commands within
 {}
are run
,
 replace 
`var’
 with
 y
 
Loops
 
 
Loops are a programmer’s tool to perform some task repeatedly over a set of items.
 
For example, this Stata loop:
 
foreach var of varlist x y z {
 
  summ `var’
 
  gen copy_`var’ = `var’
 
}
 
 
3
rd
 pass through loop:
 
summ z
 
gen copy_z = z
Repeat for 
z …
 
Stata loops
 
 
In general, Stata loops will consist of:
A macro variable that sequentially takes on the
values of a set of items
A set of commands within 
{}
Calls to the contents of the macro variable with
`
macro
, where 
macro
 is the macro variable
name
when the loop commands execute, `
macro
’ will be
replaced by its current contents
 
 
 
* ^ means “to the power”
 
forvalues i = 2/4 {
 
   gen test`i’ = test^`i’
 
   summ test`i’
 
}
 
 
 
*loop above is equivalent to
 
gen test2 = test^2
 
summ test2
 
gen test3 = test^3
 
summ test3
 
gen test4 = test*4
 
summ test4
 
 
foreach: 
looping over variables
 
foreach 
macro
 of varlist 
varlist
 {
  
commands
}
 
macro
: name of variable that takes on values in
varlist
 
varlist
: a list of variable names
 
commands
: Stata commands to run each time loop
iterates
Use 
`’
 with macro name to substitute contents of macro
 
Opening 
{
 must be on the first line
 
Closing 
}
 must be by itself on the last line
 
foreach 
Can loop over many other kinds of lists
besides variables
 
 
* for more help with foreach loops
 
help foreach
 
 
*reverse-code three 1-7 Likert scale variables
 
* apply variable label
 
foreach var of varlist T1 T2 T3 {
 
  gen rev_`var’ = 8 - `var’
 
  label var rev_`var’ “Reverse-coded `var’”
 
}
 
 
 
*create copies of four variables
 
* where 4 is recoded to 3
 
foreach var of varlist X1 X2 X3 X4 {
 
  gen new_`var’ = `var’
 
  recode new_`var’ (4=3)
 
}
 
 
 
Use 
forvalues
 to loop over numbers
 
foreach 
macro
 = 
range
 {
  
commands
}
 
macro
: name of variable that takes on values
in 
varlist
 
range
: a range of numbers, e.g. 
1/10
 or
1(2)
11
 
commands
: Stata commands to run each
time loop iterates
 
Opening 
{
 must be on the first line
 
Closing 
}
 must be by itself on the last line
 
 
*create dummy variables for
 
* different age cutoffs
 
forvalues i = 40(10)60 {
 
  gen age`i’ = 0
 
  replace age`i’ = 1 if age >= `i’
 
  replace age`i’ = . if age == .
 
}
 
 
*equivalent to
 
gen age40 = 0
 
replace age40 = 1 if age >= 40
 
replace age40 = . if age == .
 
gen age50 = 0
 
replace age50 = 1 if age >= 50
 
replace age50 = . if age == .
 
gen age60 = 0
 
replace age60 = 1 if age >= 60
 
replace age60 = . if age == .
undefined
 
Processing data
by group
 
 
 
 
 
sort
by:
b
ysort:
varname[n]
 
Grouped data
 
 
Many datasets consist of grouped observations
observations were sampled in clusters (e.g.
students sampled from schools)
repeated measurements of the same individual
 
With grouped data, we often want to generate
variables and process data by groups
 
Examples:
The mean SAT math score for each classroom of
students
In a longitudinal study of depression scores, a
variable that records each person’s first (baseline)
value of depression
Or a variable that represents depression from the previous
timepoint (lagged depression)
 
 
 
 
 
mean 
SAT_math 
by 
Class_ID
 
The 
by
 prefix
 
 
For processing by groups, we use the prefix 
by 
and a group variable
, which precede other
Stata commands
 
In general the syntax will be:
by 
varlist
: 
stata_cmd
varlist
 is one or more grouping variables by which the data are to be processed
stata_cmd
 
is the Stata command to run on each group of data.
 
Data must be sorted before processing
by group
 
 
Data must be sorted by the grouping variable
before processing 
by
 that variable
Either use the 
sort
 command on the grouping
before running any commands with 
by…
…or use the prefix 
bysort (
instead of 
by
),
which sorts the data by the grouping variable(s)
before processing by group
 
 
*2 steps, sort first; summ age by doctor
 
sort dobcid
 
by docid: summ age
 
 
* same as above in one step
 
bysort docid: summ age
 
Generating statistics by group
 
 
some 
egen
 functions can be used with 
by
 to
generate statistics by group.
including 
mean
, 
max
, and 
sd
 
*mean, max, and standard dev of age
*  of patients within each doctor
by docid: egen mean_age = mean(age)
by docid: egen max_age = max(age)
by docid: egen sd_age = sd(age)
 
Longitudinal data (multiple rows per
subject)
 
 
Longitudinal designs repeatedly measure
subjects over time
 
If repeated measurements are recorded on
separate rows of data (i.e. long data), then
there will usually be both an ID variable and a
time variable
 
Data should generally be sorted by the ID
variable and then the time variable before by-
ID processing
However, we usually only want to process by the
ID variable
So, 
sort
 first, then use 
by: 
(2 steps)
 
 
depress
 at 
time=0
 
lagged 
depress
 
Specifying the value of a variable from a
particular observation
 
 
If we want to use the value of a variable from a
particular observation number, e.g.
observation 1, we can use this syntax:
 
varname
[
n
]
 
So, 
math[3]
 is the value of the 
math
variable from the third observation
 
If used with 
by:
, then it is the 
n
th value from
within each group
by classid: math[3]
, third math value
within each class
 
 
*sort by id and time first
 
sort id time
 
 
*baseline (first) value of depression per id
 
by id: depress0 = depress[1]
 
 
System variables 
_n
 and 
_N
 
 
System variables are created and updated by
Stata
 
 
_n 
is the number of the current observation
 
_N 
is the total number of observations
 
With 
by
:
_n
 is the current observation in a group
 
_N 
is the total number of observations in the
current group.
 
These can then be used to create variables for
longitudinal data
 
 
*last depression score per id
 
* allows for different number of timepoints
 
by id: gen depress_last = depress[_N]
 
 
*lagged depression
 
* first obs will be . per id
 
by id: gen depress_lag = depress[_n-1]
 
 
Summing within group
 
 
When used with generate, sum() creates a
running sum
gen sumx = sum(x) 
// running sum
of x
 
With a 
by
-group specification, we get running
sums by group.
The running sum may itself be a useful variable
We can also pull the value from the last
observation (using 
_N
) within each group to
create a total sum variable
 
 
 
*running sum of adverse life events (ale)
 
*  per id
 
by id: gen sum_ale = sum(ale)
 
 
*total number of adverse life events per id
 
by id: gen total_ale = sum_ale[_N]
undefined
 
References and
Further
Learning
 
 
References
 
 
Mitchell, Michael N. 2010. 
Data Management Using Stata: A Practical Handbook
. Stata Press.
 
Stata YouTube channel
 – videos for both data management and data analysis made by Stata,
and a 
list
 of links to their videos on their home site
 
Data management FAQ 
on Stata home site
 
UCLA OARC Stata pages
 – our own pages on data management and data analysis
undefined
 
THANK YOU!
 
Slide Note
Embed
Share

This workshop conducted by UCLA's OARC covers essential Stata commands for preparing data sets for statistical analysis. It includes topics like inspecting and creating variables, handling missing data, merging datasets, and processing data efficiently. Participants will learn to use do-files, write code, and read syntax specifications to enhance data management skills.

  • Data Management
  • Statistical Analysis
  • Stata Commands
  • UCLA Workshop

Uploaded on Jul 10, 2024 | 6 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. UCLA OARC STATISTICAL METHODS AND DATA ANALYTICS STATA data management

  2. Purpose of the workshop Data sets usually do not arrive in a state immediately ready for analysis Variables need to be cleaned E.g. missing data codes like -99 can invalidate analyses if left untouched Data errors should be corrected Variables need to be generated Unnecessary data should perhaps be dropped for efficiency Data sets may need to be merged together This workshop covers commonly used Stata commands to prepare data sets for statistical analysis

  3. Topics Inspecting variables Creating variables Selecting observations and variables Missing data Date variables String variables Appending data sets Merging data sets Looping Processing data by groups

  4. save Preliminaries load

  5. Use do-files Write and run code in do-files, text files where Stata commands can be saved. A record of the commands used Can easily make adjustments to code later To run code from the do-file, highlight code and then Ctrl-D or click Execute(do) icon Comments: precede text with * or enclose text within /*and */ *not executed /* this is a comment */ You can also place comments on the same line as a command following // tab x // also a comment

  6. Reading syntax specifications in this workshop Unitalicized words should be typed as is Italicized words should be replaced with the appropriate variable, command, or value For example: merge 1:1 varlist using filename varlist should be replaced with one or more variable names filename should be replaced with a file s name The rest should be typed as is varlist will be used throughout the workshop to mean one or more variable names

  7. Workshop dataset This dataset contains fake hospital patient data, with patients as rows of data (no patient id, though) A mix of continuous and discrete numeric variables as well as string variables Some data errors and missing data codes in the data Each patient is also linked to a doctor (docid). Another dataset containing doctor variables will be merged into this dataset later in the seminar.

  8. saveand use * save data as Stata data set, overwrite save data_clean.dta, replace We recommend that you save your data set under a different name after performing some data management on it Always good to have untouched raw, data set You may change your mind later about how variables should be generated/cleaned/etc * .dta automatically added so can omit it save data_clean, replace * load data_clean, clear memory first use data_clean, clear To quickly save and load your data in Stata, save as a Stata file (usually .dta) useloads Stata data files

  9. browse Inspecting variables summarize tabulate codebook

  10. browse Spreadsheet style window view browse or click on the spreadsheet and magnifying glass icon in the Stata toolbar Numeric variables appear black String variables appear red Numeric variables with value labels appear blue For large data sets, we need commands that allow us to quickly inspect variables

  11. summarize summarize(abbreviated summ) provides summary statistics for numeric variables, which may be useful for data management: Number of non-missing observations *summary stats for variable y summ y *summary stats for all numeric variables summ Does this variable have the correct number of missing observations? Mean and standard deviation Are there any data errors or missing data codes that make the mean or standard deviation seem implausible? Minimum and maximum Are the min and max within the plausible range for this variable?

  12. tabulate tabulate(abbreviated tab here) frequency tables (# observations per value) string or numeric variables ok but not continuous numeric variables *table of frequencies of race * display missing tab race, miss *2-way table of frequencies tab race gender Questions to ask with tab: Is this the number of categories I expect? Are the categories numbered and labeled as I expect? Do these frequencies look right? Important data management options miss -treat missing values as another category nolabel remove labels to inspect underlying numeric representation

  13. codebook *detailed info about variables x and y codebook x y Detailed information about variables codebookprovides: For all variables Number of unique values Number of missing values Value labels, if applied For numeric variables Range quantiles, means and standard deviation for continuous variables Frequencies for discrete variables For string variables frequencies warnings about leading and trailing blanks *detailed info about all variables codebook

  14. generate Creating variables help functions egen

  15. generatevariables generate(abbreviated genor even g) is the basic command to create variables Often from other existing variables if the value of the input variable is missing, the generated value will be missing as well. *sum of tests * if any test=., testsum=. gen testsum = test1 + test2 + test3 *reverse-code a 1,2,3 variable gen revvar = 4-var We will be using genthroughout this seminar

  16. Functions to use with generate *get table of function help pages help functions We can use functions to perform some operation on a variable to generate another variable *random number (0,1) for each obs gen x = runiform() Later sections of this seminar will take a focused look at the function groups Date and Time and String *running (cumulative) sum of x gen sumx = sum(x) *extract year from date variable gen yr_birth= year(date_birth) We will use many various functions from other groups as well *extract 1st 3 numbers of phone number gen areacode = substr(phone, 1, 3) Most of these functions accept no more than one variable as an input

  17. egen, extended generation command egen(extended generate) creates variables with its own, exclusiveset of functions, which include: Many functions that accept multiple variables as arguments cut() function to create categorical variable from continuous variable group() function to create a new grouping variable that is the crossing of two grouping variables *mean of 3 y variables egen ymean = rowmean(y1 y2 y3) *how many missing values in y vars egen ymiss = rowmiss(y1 y2 y3) * how many of the y vars equal 2 or 3 egen y2or3 = anycount(y1 y2 y3), values(2 3) *age category variable: * 0-17, 18-59, 60-120, with value labels egen agecat = cut(age), at(0, 18, 60, 120) label *create agecat-by-race variable, with value labels egen agecat_race = group(agecat, race), label

  18. if Selecting observations and variables <>=!~&| replace recode drop keep

  19. if: selecting by condition *tab x for age > 60, include missing tab x if age > 60, miss select observations that meet a certain condition ifclause usually placed after the command specification, but before the comma that marks the beginning of the list of options.

  20. Logical and relational operators *summary stats for y * for obs where insured not equal to 1 summ y if insured != 1 Relational Operators > greater than <= less than or equal *tab x for obs where * age < 60 and insured equal to 1 tab x if (age < 60) & (insured == 1) == equal != not equal ~= not equal Logical Operators & and | or ! not ~ not

  21. replaceand if The replacecommand is used to replace the values of an existing variable with new values *binary variable coding whether pain is greater than 6 gen highpain = 0 replace highpain = 1 if pain > 6 Typically, replace is used with ifto replace the values in a subset of observation

  22. Change variable coding with recode *recode (0,1,2)->3 and (6,7)->5 recode income_cat (0 1 2 = 3) (6 7 = 5) Variables are often not coded the way we want too many categories or with values out of order. With recode, we can handle all recodes for a variable in a single step.

  23. keep and drop: filtering observations * drop observations where age < 18 drop if age < 18 Drop unneeded observations using: drop if exp * same thing as above keep if age >= 18 Where exp expresses some condition that if true, will cause an observation to be dropped Conversely, we can keep only observations that satisfy some condition with keep if exp

  24. drop(or keep) variables * drop variables x y z drop x y z Unneeded variables can be dropped with: drop varlist * drop all variables in consecutive columns * from age-dob drop age-dob where varlistis a list of variables to drop See examples for some shortcuts to drop many variables at once * drop all variables that begin with pat drop pat* Or, if you need to drop most variables, you can keepa few * drop all variables but age keep age

  25. . .a .b misstable summarize Missing Data mvdecode missing()

  26. Missing data * overview of how missing values work in Stata and Stata commands for missing values help missing Missing values are very common in real data, and it is important for an analyst to be aware of missingness in the data set Hopefully, you know any missing data codes (e.g. -99) When reading in data from a text or Excel file, missing data can be represented by an empty field.

  27. Missing values in Stata replace stringvar = if stringvar == -99 . is missing for numeric variables (also called sysmiss) * replace with sysmiss if -99 (skipped) replace numvar = . if numvar == -99 is missing for string variables (also called blank). * use .a for different missing data code -98 * (e.g. refused to answer) replace numvar = .a if numvar == -98 Additional missing data values are available .a through .z can be used to represent different types of missing (refusal, don t know, etc.). Missing values are very large numbers in Stata all non-missing numbers < . < .a < .b < < .z

  28. misstable summarize: finding existing missing values misstable summarize produces a table of missing values (of all types) across a set of variables *table of missing values across all variables misstable summarize column Obs=. counts the number of missing values equal to . column Obs>. counts the number of missing values other than ., such as .a and .b column Obs<. and the entire right-hand section Obs<. address non-missing values

  29. Detecting missing data codes *boxplot of variables graph box lungcapacity test1 test2 extreme numeric values are often used to represent missing values -99 or 999 Undetected, these missing data codes can be included as real data in statistical analysis Use summarizeand graph boxplot to look for missing data codes across many variables at once

  30. Use mvdecodeto convert user-defined missing data codes to missing values *convert -99 to . for all variables mvdecode _all, mv(-99) We can quickly convert all user-defined missing codes to system missing values across all numeric variables with mvdecode. *convert -99 to . and -98 to .a mvdecode _all, mv(-99 =. \ -98=.a) Unfortunately, mvdecode will not work at all on string variables.

  31. The missing()function The missing()function returns TRUE if the value is any one of Stata s missing values (., .a, .b, etc) *eligible if not missing for lungcapacity gen eligible = 0 replace eligible = 1 if !missing(lungcapacity)

  32. Be careful with relational operators when working with missing values * we want hightest1 = 1 if test1 > 10 * but . > 10, so this is not right gen hightest1 = 0 replace hightest1 = 1 if test1 > 10 Missing values are very large numbers in Stata all non-missing numbers < . < .a < .b < < .z *now hightest1 will be . when test1 is . replace hightest1 = . if test1 == . Thus, (. > 50) results in TRUE When creating variables from other variables, make sure you know how you want to handle missing values

  33. help datetime date() Date variables format year() month() day()

  34. Dates as strings and numbers * Overview of how Stata handles dates help datetime In Stata we can store dates as strings January 2, 2021 1-2-2021 *Use codebook or describe to determine whether your date variable is a string or number codebook date_var However, dates should be represented numerically in Stata if the date dataare needed To create analysis variables For plotting *Alternatively, we can look at the variable in the browser: red=string, blue/black=number browse date_var Numeric dates (with day, month, and year data) in Stata are represented by the number of days since January 1, 1960 (a reference date) January 2, 2021 = 22,280

  35. date(): converting string dates to numeric Use the date() function to convert string dates to numeric in Stata * create numeric version of date of birth * order is month, day, year gen newdob = date(dob, MDY ) The generic syntax, for example with gen, is: gen varname = date(stringdate, mask) stringdateis a variable maskis a code that specifies the order of the components of the date For day, month, year dates: maskis MDY if the order is month, day, year maskis DMY if the order is day, month, year

  36. The date() function accepts dates in many formats *just display commands to show date() usage . di date("March 5, 2021", "MDY") 22344 String dates are recorded in many different formats, but fortunately, the date() function is flexible in what it accepts as inputs . di date("Mar 5, 2021", "MDY") 22344 . di date("3/5/2021", "MDY") 22344 . di date("3-5-2021", "MDY") 22344 *add 20 or 19 to the mask if year is 2-digit . di date("3-5-21", "MD20Y") 22344 . di date("3-5-21", "MD19Y") -14181

  37. Formatting numeric dates After conversion using date(), the resulting variable will be filled with numbers representing dates, but can be hard to read directly as dates *apply date format to variable newdob format newdob %td Stata s format command controls how variables are displayed The format %td formats numbers as dates 22344will appear as 2mar2021after applying the format

  38. Date arithmetic * length of stay gen los = discharge_date admit_date Once dates are stored as numeric variables in Stata, we can perform date arithmetic, for example, to find the number of days between two dates

  39. Functions to extract components of dates *year of birth gen yob = year(dob) At times, we will need to extract one of the components of a date, such as the year *month of birth gen mob = month(dob) Each of these functions returns a number: year(): numeric year month(): 1 to 12 day(): day of the month

  40. help string functions strtrim() String variables substr() + encode destring

  41. Strings in Stata * z now missing if z was -99 replace z = if z== -99 Strings are just a series of characters Variables can generally be stored as either numeric or string String values are surrounded by quotes when specifying a Stata command String missing is Many estimation commands in Stata will not work with string variables.

  42. String functions * help page for all string functions help string functions Stata provides a number of functions to clean, combine, and extract components of string variables We will examine a few in detail shortly Other useful functions: strlen(): length of strings strpos(): the position in a string where a substring is found strofreal(): convert number to string with a specified format

  43. strtrim(): trimming white space * some of these categories should be combined tab hospital String variables often arrive messy, with unnecessary spaces at the beginning or end of the string May 3, 2001 * remove all leading and trailing whitespace replace hospital = strtrim(hospital) tab hospital Extra spaces can complicate string matching and extracting components tabwill treat UCLA and UCLA as separate categories strtrim() removes whitespace from the beginning and end of strings

  44. substr(): extracting a substring substr() extracts a substring from a longer string *area code starts at 1, length=3 gen areacode = substr(phone, 1, 3) *extract last 5 characters as zip code gen zipcode = substr(address, -5, 5) substr(s, n1, n2) sis a string value or string variable n1is the starting position negative number counts from the end n2is the length of the string

  45. +: concatenating strings *create full name variable: Last, First gen fullname = lastname + , + firstname Strings can be joined or concatenated together in a Stata command with a + String variables can be combined with string constants this way

  46. String matching For matching strings exactly, we can use the == operator * for regular expression functions help regexm For more flexible pattern matching, Stata has a number of functions that use regular expressions e.g. to find strings that contain either US or USA and may or may not contain periods US U.S. USA U.S.A. See help regexm See help strmatch Regular expressions are beyond the scope of this workshop

  47. Encoding strings into numeric variables Categorical variables are often initially coded as strings But most estimation commands require numeric variables * convert hospital to numeric, generate new variable encode hospital, gen(hospnum) encodeconverts string variables to numeric assigns a numeric value to each distinct string value applies value labels of the original string values Use the gen() option to create a new variable Or instead use replaceto overwrite the original string variable The ordering of the categories is alphabetical Remember, when browsing, string variables will appear red while numeric variables with labels will appear blue

  48. Convert number variables stored as strings to numeric with destring Sometimes, variables with number values are loaded as strings into Stata. *convert string wbc to numeric and overwrite destring wbc, replace This can happen can a character value is mistakenly entered, or a non-numeric code (e.g. NA ) is used for missing. Or the variable was saved as a string in other software (e.g. Excel) We do not want to use encodehere, because the variable values are truly numbers rather than categories we would not want the 1.25 converted to a category. Instead, we can use destring, which directly translates numbers stored as strings into the numbers themselves.

  49. Appending data sets append

  50. Appending: adding observations We often wish to combine data sets that are split into multiple files that have the same variables (more or less) Data collected over several waves of time Data collected from different labs/sources ID DOB Weight Insured 101 3-3-1981 175 1 ID DOB Weight Insured Loaded in Stata 102 2-14-1975 213 0 101 3-3-1981 175 1 103 12-10-1990 198 1 102 2-14-1975 213 0 = 103 12-10-1990 198 1 + 201 4-29-1970 150 0 ID DOB Weight Insured 202 12-15-1963 254 0 On hard drive 201 4-29-1970 150 0 203 1-10-1962 199 1 202 12-15-1963 254 0 203 1-10-1962 199 1

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#