Data Management Workshop: Stata Commands for Statistical Analysis

undefined

STATA data

management

UCLA OARC

STATISTICAL

METHODS AND

DATA ANALYTICS

Purpose of the workshop

Data sets usually do not arrive in a state immediately ready for analysis

◦

Variables need to be cleaned

◦

E.g. missing data codes like -99 can invalidate analyses if left untouched

◦

Data errors should be corrected

◦

Variables need to be generated

◦

Unnecessary data should perhaps be dropped for efficiency

◦

Data sets may need to be merged together

This workshop covers commonly used Stata commands to prepare data sets for statistical

analysis

Inspecting variables

Creating variables

Selecting observations and variables

Missing data

Date variables

String variables

Appending data sets

Merging data sets

Looping

Processing data by groups

Topics

undefined

Preliminaries

save

load

Use do-files

Write and run code in do-files, text files where Stata commands can be saved.

◦

A record of the commands used

◦

Can easily make adjustments to code later

To run code from the do-file, highlight code and then Ctrl-D or click “Execute(do)” icon

Comments:  precede text with

 or enclose text within

/*

and

*/

◦

*not executed

◦

/* this is a comment */

You can also place comments on the same line as a command following

//

tab x

// also a comment

Reading syntax specifications in this

workshop

Unitalicized words should be typed as is

Italicized words should be replaced with the appropriate variable, command, or value

For example:

merge 1:1

varlist

 using

filename

varlist

should be replaced with one or more variable names

filename

should be replaced with a file’s name

The rest should be typed as is

varlist

will be used throughout the workshop to mean one or more variable names

Workshop dataset

This dataset contains fake hospital patient data, with patients as rows of data (no patient id,

though)

A mix of continuous and discrete numeric variables as well as string variables

Some data errors and missing data codes in the data

Each patient is also linked to a doctor (docid).

Another dataset containing doctor variables will be merged into this dataset later in the

seminar.

save

and

use

We recommend that you save your data set

under a different name after performing some

data management on it

◦

Always good to have untouched raw, data set

◦

You may change your mind later about how

variables should be generated/cleaned/etc

To quickly

save

and

load

 your data in Stata,

save

as a Stata file (usually .dta)

use

 loads Stata data files

* save data as Stata data set, overwrite

save data_clean.dta, replace

* .dta automatically added so can omit it

save data_clean, replace

* load data_clean, clear memory first

use data_clean, clear

undefined

Inspecting

variables

browse

summarize

tabulate

codebook

browse

Spreadsheet style window view

browse

◦

or click on the spreadsheet and magnifying glass icon in the Stata toolbar

Numeric variables appear black

String variables appear red

Numeric variables with value labels appear blue

For large data sets, we need commands that allow us to quickly inspect variables

summarize

summarize

 (abbreviated

summ

) provides

summary statistics for numeric variables,

which may be useful for data management:

◦

Number of non-missing observations

◦

Does this variable have the correct number of missing

observations?

◦

Mean and standard deviation

◦

Are there any data errors or missing data codes that make

the mean or standard deviation seem implausible?

◦

Minimum and maximum

◦

Are the min and max within the plausible range for this

variable?

*summary stats for variable y

summ y

*summary stats for all numeric variables

summ

tabulate

tabulate

 (abbreviated

tab

here

◦

frequency tables (# observations per value)

◦

string or numeric variables ok

◦

but not continuous numeric variables

Questions to ask with

tab

◦

Is this the number of categories I expect?

◦

Are the categories numbered and labeled as I

expect?

◦

Do these frequencies look right?

Important data management options

◦

miss

  - treat missing values as another category

◦

nolabel

 – remove labels to inspect underlying

numeric representation

*table of frequencies of race

* display missing

tab race, miss

*2-way table of frequencies

tab race gender

codebook

Detailed information about variables

codebook

 provides:

◦

For

all

 variables

◦

Number of unique values

◦

Number of missing values

◦

Value labels, if applied

◦

For

numeric

 variables

◦

Range

◦

quantiles, means and standard deviation for continuous

variables

◦

Frequencies for discrete variables

◦

For

string

 variables

◦

frequencies

◦

warnings about leading and trailing blanks

*detailed info about variables x and y

codebook x y

*detailed info about all variables

codebook

undefined

Creating

variables

generate

help functions

egen

generate

 variables

generate

 (abbreviated

gen

 or even

)  is

the basic command to create variables

◦

Often from other existing variables

◦

if the value of the input variable is missing, the

generate

d value will be missing as well.

We will be using

gen

 throughout this seminar

*sum of tests

* if any test=., testsum=.

gen testsum = test1 + test2 + test3

*reverse-code a 1,2,3 variable

gen revvar = 4-var

Functions to use with

generate

We can use functions to perform some

operation on a variable to generate another

variable

Later sections of this seminar will take a

focused look at the function groups

Date and

Time

and

String

We will use many various functions from other

groups as well

Most of these functions accept no more than

one variable as an input

*get table of function help pages

help functions

*random number (0,1) for each obs

gen x = runiform()

*running (cumulative) sum of x

gen sumx = sum(x)

*extract year from date variable

gen yr

_birth

= year(date_birth)

*extract 1st 3 numbers of phone number

gen areacode = substr(phone, 1, 3)

egen

, extended generation command

egen

 (extended generate) creates variables

with its own,

exclusive

 set of functions, which

include:

◦

Many functions that accept multiple variables as

arguments

◦

cut()

function to create categorical variable

from continuous variable

◦

group()

function to create a new grouping

variable that is the crossing of two grouping

variables

*mean of 3 y variables

egen ymean = rowmean(y1 y2 y3)

*how many missing values in y vars

egen ymiss = rowmiss(y1 y2 y3)

* how many of the y vars equal 2 or 3

egen y2or3 = anycount(y1 y2 y3), values(2 3)

*age category variable:

0-17, 18-59, 60-120, with value labels

egen agecat = cut(age), at(0, 18, 60, 120) label

*create agecat-by-race variable, with value labels

egen agecat_race = group(agecat, race), label

undefined

Selecting

observations

and variables

if

<>=!~&|

eplace

recode

rop

keep

f:

selecting by condition

select observations that meet a certain

condition

if

 clause usually placed after the command

specification, but before the comma that

marks the beginning of the list of options.

*tab x for age > 60, include missing

tab x if age > 60, miss

Logical and relational operators

*summary stats for y

* for obs where insured not equal to 1

summ y if insured != 1

*tab x for obs where

*  age < 60 and insured equal to 1

tab x if (age < 60) & (insured == 1)

replace

and

if

The

replace

 command is used to replace

the values of an existing variable with new

values

Typically,

replace

 is used with

if

 to replace

the values in a subset of observation

*binary variable coding whether pain is

greater than 6

gen

highpain

= 0

replace highpain = 1 if pain > 6

Change variable coding with

recode

Variables are often not coded the way we

want

◦

too many categories or with values out of order.

With

recode

, we can handle all recodes for a

variable in a single step.

*recode (0,1,2)->3  and (6,7)->5

recode income_cat (0 1 2 = 3) (6 7 = 5)

keep

and

 drop:

filtering

observations

Drop unneeded observations using:

drop if

exp

Where

exp

expresses some condition that if

true, will cause an observation to be dropped

Conversely, we can keep only observations

that satisfy some condition with

keep if

exp

* drop observations where age < 18

drop if age < 18

* same thing as above

keep if age >= 18

drop

(or

keep

) variables

Unneeded variables can be dropped with:

drop

varlist

where

varlist

 is a list of variables to drop

See examples for some shortcuts to drop

many variables at once

Or, if you need to drop most variables, you

can

keep

 a few

* drop variables x y z

drop x y z

* drop all variables in consecutive columns

*   from age-dob

drop age-dob

* drop all variables that begin with “pat”

drop pat*

* drop all variables but age

keep age

undefined

Missing Data

. “” .a .b

misstable summarize

mvdecode

missing()

Missing data

Missing values are very common in real data,

and it is important for an analyst to be aware

of missingness in the data set

◦

Hopefully, you know any missing data codes

(e.g. -99)

When reading in data from a text or Excel file,

missing data can be represented by an empty

field.

* overview of how missing values work in

Stata and Stata commands for missing values

help missing

Missing values in Stata

. is missing for numeric variables (also called

sysmiss

“”

 is missing for string variables (also called

blank

).

Additional missing data values are available

◦

.a

 through

.z

◦

can be used to represent different types of missing

(refusal, don’t know, etc.).

Missing values are very large numbers in Stata

◦

all non-missing numbers

< . < .a < .b < … < .z

replace stringvar = “” if stringvar == “-99”

* replace with sysmiss if -99 (skipped)

replace numvar = . if numvar == -99

* use .a for different missing data code -98

*  (e.g. refused to answer)

replace numvar = .a if numvar == -98

misstable summarize

: finding

existing missing values

misstable summarize

produces a table

of missing values (of all types) across a set of

variables

column

Obs=.

counts the number of

missing values equal to .

column

Obs>.

counts the number of

missing values other than ., such as .a and .b

column

Obs<.

and the entire right-hand

section

Obs<.

address non-missing values

*table of missing values across all variables

misstable summarize

Detecting missing data codes

extreme numeric values are often used to

represent missing values

◦

-99 or 999

Undetected, these missing data codes can be

included as real data in statistical analysis

Use

summarize

and

graph boxplot

to

look for missing data codes across many

variables at once

*boxplot of variables

graph box lungcapacity test1 test2

Use

mvdecode

 to convert user-defined

missing data codes to missing values

We can quickly convert all user-defined

missing codes to system missing values across

all numeric variables with

mvdecode

Unfortunately,

mvdecode

 will not work at all

on string variables.

*convert -99 to . for all variables

mvdecode _all, mv(-99)

*convert -99 to . and -98 to .a

mvdecode _all, mv(-99 =. \ -98=.a)

The

missing()

 function

The

missing()

 function returns

TRUE if the value is any one of

Stata’s missing values (

.a

.b

etc)

*eligible if not missing for lungcapacity

gen eligible = 0

replace eligible = 1 if !missing(lungcapacity)

Be careful with relational operators

when working with missing values

Missing values are very large numbers in Stata

◦

all non-missing numbers

< . < .a < .b < … < .z

Thus,

(. > 50)

results in

TRUE

When creating variables from other variables,

make sure you know how you want to handle

missing values

* we want hightest1 = 1 if test1 > 10

*  but . > 10, so this is not right

gen hightest1 = 0

replace hightest1 = 1 if test1 > 10

*now hightest1 will be . when test1 is .

replace hightest1 = . if test1 == .

undefined

Date variables

help datetime

date()

format

year() month() day()

Dates as strings and numbers

In Stata we can store dates as strings…

◦

“January 2, 2021”

◦

“1-2-2021”

However, dates should be represented

numerically in Stata if the date

data

are

needed

◦

To create analysis variables

◦

For plotting

Numeric dates (with day, month, and year

data) in Stata are represented by the number

of days since January 1, 1960 (a reference

date)

◦

January 2, 2021 = 22,280

* Overview of how Stata handles dates

help datetime

*Use codebook or describe to determine

whether your date variable is a string or

number

codebook date_var

*Alternatively, we can look at the variable

in the browser: red=string, blue/black=number

browse date_var

date():

converting string dates to

numeric

Use the

date()

function to convert string

dates to numeric in Stata

The generic syntax, for example with

gen,

is:

gen

varname

 = date(

stringdate

mask

◦

stringdate

 is a variable

◦

mask

 is a code that specifies the order of the

components of the date

For day, month, year dates:

◦

mask

is

“MDY”

if the order is month, day, year

◦

mask

is

“DMY”

if the order is day, month, year

* create numeric version of date of birth

*  order is month, day, year

gen newdob = date(dob, “MDY”)

The

date()

function accepts dates in

many formats

String dates are recorded in many different

formats, but fortunately, the

date()

function is flexible in what it accepts as inputs

*just display commands to show date() usage

. di date("March 5, 2021", "MDY")

. di date("Mar 5, 2021", "MDY")

. di date("3/5/2021", "MDY")

. di date("3-5-2021", "MDY")

*add 20 or 19 to the mask if year is 2-digit

. di date("3-5-21", "MD20Y")

. di date("3-5-21", "MD19Y")

-14181

Formatting numeric dates

After conversion using

date()

, the resulting

variable will be filled with numbers

representing dates, but can be hard to read

directly as dates

Stata’s

format

 command controls how

variables are displayed

The format

%td

 formats numbers as dates

◦

 will appear as

2mar2021

 after applying

the format

*apply date format to variable newdob

format

newdob

%td

Date arithmetic

Once dates are stored as numeric variables in

Stata, we can perform date arithmetic, for

example, to find the number of days between

two dates

* length of stay

gen los = discharge_date – admit_date

Functions to extract components of

dates

At times, we will need to extract one of the

components of a date, such as the year

Each of these functions returns a number:

◦

year()

: numeric year

◦

month()

: 1 to 12

◦

day()

: day of the month

*year of birth

gen yob = year(dob)

*month of birth

gen mob = month(dob)

undefined

String variables

help string functions

strtrim()

substr()

encode

destring

Strings in Stata

Strings are just a series of characters

Variables can generally be stored as either

numeric or string

String values are surrounded by quotes when

specifying a Stata command

◦

String missing is

“”

Many estimation commands in Stata will not

work with string variables.

* z now missing if z was “-99”

replace z = “” if z==“-99”

String functions

Stata provides a number of functions to clean,

combine, and extract components of string

variables

We will examine a few in detail shortly

Other useful functions:

◦

strlen()

: length of strings

◦

strpos()

: the position in a string where a

substring is found

◦

strofreal()

: convert number to string with

a specified format

* help page for all string functions

help string functions

strtrim()

: trimming white space

String variables often arrive messy, with

unnecessary spaces at the beginning or end of

the string

◦

“  May 3, 2001 “

Extra spaces can complicate string matching

and extracting components

◦

tab

 will treat

“ UCLA”

and

“UCLA”

as

separate categories

strtrim()

removes whitespace from the

beginning and end of strings

* some of these categories should be combined

tab hospital

* remove all leading and trailing whitespace

replace hospital = strtrim(hospital)

tab hospital

substr()

: extracting a substring

substr()

extracts a substring from a

longer string

substr(

n1

n2

◦

 is a string value or string variable

◦

n1

 is the starting position

◦

negative number counts from the end

◦

n2

 is the length of the string

•

*area code starts at 1, length=3

•

gen areacode = substr(phone, 1, 3)

•

*extract last 5 characters as zip code

•

gen zipcode

 = substr(address, -5, 5)

: concatenating strings

Strings can be joined or concatenated

together in a Stata command with a

◦

String variables can be combined with string

constants this way

*create full name variable: “Last, First”

gen fullname = lastname + “, “ + firstname

String matching

For matching strings exactly, we can use the

==

operator

For more flexible pattern matching, Stata has a

number of functions that use

regular expressions

◦

e.g. to find strings that contain either

“US”

or

“USA”

and may or may not contain periods

◦

“US”

◦

“U.S.”

◦

“USA”

◦

“U.S.A.”

◦

See

help regexm

◦

See

help strmatch

Regular expressions are beyond the scope of this

workshop

* for regular expression functions

help regexm

Encoding strings into numeric variables

Categorical variables are often initially coded as

strings

◦

But most estimation commands require numeric

variables

encode

 converts string variables to numeric

◦

assigns a numeric value to each distinct string

value

◦

applies value labels of the original string values

◦

Use the

gen()

option to create a new variable

◦

Or instead use

replace

 to overwrite the original string

variable

The ordering of the categories is alphabetical

Remember, when browsing, string variables will

appear red while numeric variables with labels

will appear blue

* convert hospital to numeric, generate new

variable

encode hospital, gen(hospnum)

Convert number variables stored as

strings to numeric with

destring

Sometimes, variables with number values are

loaded as strings into Stata.

This can happen can a character value is

mistakenly entered, or a non-numeric code (e.g.

“NA”

) is used for missing.

◦

Or the variable was saved as a string in other

software (e.g. Excel)

We do not want to use

encode

 here, because

the variable values are truly numbers rather than

categories

◦

we would not want the “1.25” converted to a

category.

Instead, we can use

destring

, which directly

translates numbers stored as strings into the

numbers themselves.

*convert string wbc to numeric and overwrite

destring wbc, replace

undefined

Appending

data sets

append

Appending: adding observations

We often wish to combine data sets that are split into multiple files that have the same

variables (more or less)

◦

Data collected over several waves of time

◦

Data collected from different labs/sources

Loaded in

Stata

On hard

drive

append

With Stata’s

append

, data set files stored on

hard drives are appended to the data set

currently loaded in Stata (the

master

 data set)

Syntax:

append using

filename

filename

…]

Multiple

filename

s can be appended to the

master data set

Variables with the same name should have

the same type (string, float, etc.)

◦

use the

force

 option for mismatches

•

*append data set data2

•

append using data2

*gen() option creates new variable source

*  source=1 if obs comes from 1st data set

*  source=2 if obs comes from 2nd data set

append using data2, gen(source)

Unshared variables will have missing

Variables that do not appear in all datasets will have missing values where they were omitted

undefined

Merging data

sets

merge

One-to-one merging

When we merge datasets, we add more columns of variables.

Datasets to be merged should generally be matched on an id variables that appears in both

datasets

In the most basic merge, each id appears once in each file

Loaded in Stata

master

 dataset)

On hard drive

using

 dataset)

merge 1:1

for

one-to-one merging

Basic syntax

merge 1:1

varlist

 using

filename

◦

varlist

is one or more variables used to

match observations

◦

filename

is the data set stored elsewhere to

be merged into the data set currently loaded in

Stata

*one-to-one merge example

merge 1:1 id using morevars.dta

_merge

: understanding the resulting

merge

After running a

merge

, Stata will

ouput a table similar to the table

pictured on the right

Stata also automatically adds a new

variable,

_merge

, to the merged

data set, where:

◦

_merge==1

if the observation’s id is

only found in the master file

◦

_merge==2

if the observation’s id is

only found in the using file

◦

_merge==3

if the observation’s

merge id was matched in both files

    Result                      Number of obs

    -----------------------------------------

    Not matched                             5

        from master                         5  (_merge==1)

        from using                          0  (_merge==2)

    Matched                               200  (_merge==3)

    -----------------------------------------

* drop unmatched observations

drop if _merge != 3

Many-to-one merges

Many-to-one merge: unique id values appear multiple times in the master data set, but only

once in the using dataset

Loaded in Stata

master

 dataset)

On hard drive

using

 dataset)

Merging on

Doc_ID

_merge==2

means this

Doc_ID

only appears

in using dataset

One-to-many merges

One-to-many merge: unique id values appear once each in the master data set, but can appear

many times in the using dataset

Loaded in Stata

master

 dataset)

On hard drive

using

 dataset)

Merging on

Doc_ID

_merge==1

means this

Doc_ID

only appears

in master dataset

merge m:1

and

merge 1:m

Many-to-one

merge m:1

varlist

 using

filename

One-to-many

merge 1:m

varlist

 using

filename

◦

varlist

is one or more id variables

◦

filename

is using data set file name

*each docid appears multiple times in

* master data, once in using data

merge m:1 docid using “dm_doctor_data.dta"

undefined

Looping

foreach

forvalues

Loops

Loops are a programmer’s tool to perform some task repeatedly over a set of items.

For example, this Stata loop:

foreach var of varlist x y z {

  summ `var’

  gen copy_`var’ = `var’

Is equivalent to running

summ x

gen copy_x = x

summ y

gen copy_y = y

summ z

gen copy_z = z

var

 is known as a Stata

macro variable

, a temporary

variable (not related to the data set) that can hold string

values

The word

var

 is arbitrary and can be any word you like

varlist

tells Stata that the strings following are

variable names

Loops

Loops are a programmer’s tool to perform some task repeatedly over a set of items.

For example, this Stata loop:

foreach var of varlist x y z {

  summ `var’

  gen copy_`var’ = `var’

st

 pass through loop:

summ x

gen copy_x = x

Initially

var

will be set equal to

x,

and then the

commands within

{}

 are run;  wherever

`var’

appears, replace with

Loops

Loops are a programmer’s tool to perform some task repeatedly over a set of items.

For example, this Stata loop:

foreach var of varlist x y z {

  summ `var’

  gen copy_`var’ = `var’

nd

 pass through loop:

summ y

gen copy_y = y

Then

var

becomes

y,

the commands within

{}

are run

 replace

`var’

 with

Loops

Loops are a programmer’s tool to perform some task repeatedly over a set of items.

For example, this Stata loop:

foreach var of varlist x y z {

  summ `var’

  gen copy_`var’ = `var’

rd

 pass through loop:

summ z

gen copy_z = z

Repeat for

z …

Stata loops

In general, Stata loops will consist of:

◦

A macro variable that sequentially takes on the

values of a set of items

◦

A set of commands within

{}

◦

Calls to the contents of the macro variable with

macro

’

, where

macro

 is the macro variable

name

◦

when the loop commands execute, `

macro

’ will be

replaced by its current contents

* ^ means “to the power”

forvalues i = 2/4 {

   gen test`i’ = test^`i’

   summ test`i’

*loop above is equivalent to

gen test2 = test^2

summ test2

gen test3 = test^3

summ test3

gen test4 = test*4

summ test4

foreach:

looping over variables

foreach

macro

 of varlist

varlist

commands

macro

: name of variable that takes on values in

varlist

varlist

: a list of variable names

commands

: Stata commands to run each time loop

iterates

◦

Use

`’

 with macro name to substitute contents of macro

Opening

 must be on the first line

Closing

 must be by itself on the last line

foreach

Can loop over many other kinds of lists

besides variables

* for more help with foreach loops

help foreach

*reverse-code three 1-7 Likert scale variables

* apply variable label

foreach var of varlist T1 T2 T3 {

  gen rev_`var’ = 8 - `var’

  label var rev_`var’ “Reverse-coded `var’”

*create copies of four variables

* where 4 is recoded to 3

foreach var of varlist X1 X2 X3 X4 {

  gen new_`var’ = `var’

  recode new_`var’ (4=3)

Use

forvalues

 to loop over numbers

foreach

macro

range

commands

macro

: name of variable that takes on values

in

varlist

range

: a range of numbers, e.g.

1/10

or

1(2)

commands

: Stata commands to run each

time loop iterates

Opening

 must be on the first line

Closing

 must be by itself on the last line

*create dummy variables for

* different age cutoffs

forvalues i = 40(10)60 {

  gen age`i’ = 0

  replace age`i’ = 1 if age >= `i’

  replace age`i’ = . if age == .

*equivalent to

gen age40 = 0

replace age40 = 1 if age >= 40

replace age40 = . if age == .

gen age50 = 0

replace age50 = 1 if age >= 50

replace age50 = . if age == .

gen age60 = 0

replace age60 = 1 if age >= 60

replace age60 = . if age == .

undefined

Processing data

by group

sort

by:

ysort:

varname[n]

Grouped data

Many datasets consist of grouped observations

◦

observations were sampled in clusters (e.g.

students sampled from schools)

◦

repeated measurements of the same individual

With grouped data, we often want to generate

variables and process data by groups

Examples:

◦

The mean SAT math score for each classroom of

students

◦

In a longitudinal study of depression scores, a

variable that records each person’s first (baseline)

value of depression

◦

Or a variable that represents depression from the previous

timepoint (lagged depression)

mean

SAT_math

by

Class_ID

The

by

 prefix

For processing by groups, we use the prefix

by

and a group variable

, which precede other

Stata commands

In general the syntax will be:

◦

by

varlist

stata_cmd

◦

varlist

 is one or more grouping variables by which the data are to be processed

◦

stata_cmd

is the Stata command to run on each group of data.

Data must be sorted before processing

by group

Data must be sorted by the grouping variable

before processing

by

 that variable

◦

Either use the

sort

 command on the grouping

before running any commands with

by…

◦

…or use the prefix

bysort (

instead of

by

),

which sorts the data by the grouping variable(s)

before processing by group

*2 steps, sort first; summ age by doctor

sort dobcid

by docid: summ age

* same as above in one step

bysort docid: summ age

Generating statistics by group

some

egen

 functions can be used with

by

to

generate statistics by group.

◦

including

mean

max

, and

sd

*mean, max, and standard dev of age

*  of patients within each doctor

by docid: egen mean_age = mean(age)

by docid: egen max_age = max(age)

by docid: egen sd_age = sd(age)

Longitudinal data (multiple rows per

subject)

Longitudinal designs repeatedly measure

subjects over time

If repeated measurements are recorded on

separate rows of data (i.e. long data), then

there will usually be both an ID variable and a

time variable

Data should generally be sorted by the ID

variable and then the time variable before by-

ID processing

◦

However, we usually only want to process by the

ID variable

◦

So,

sort

 first, then use

by:

(2 steps)

depress

at

time=0

lagged

depress

Specifying the value of a variable from a

particular observation

If we want to use the value of a variable from a

particular observation number, e.g.

observation 1, we can use this syntax:

varname

So,

math[3]

 is the value of the

math

variable from the third observation

If used with

by:

, then it is the

th value from

within each group

◦

by classid: math[3]

, third math value

within each class

*sort by id and time first

sort id time

*baseline (first) value of depression per id

by id: depress0 = depress[1]

System variables

_n

and

_N

System variables are created and updated by

Stata

_n

is the number of the current observation

_N

is the total number of observations

With

by

◦

_n

 is the current observation in a group

◦

_N

is the total number of observations in the

current group.

These can then be used to create variables for

longitudinal data

*last depression score per id

* allows for different number of timepoints

by id: gen depress_last = depress[_N]

*lagged depression

* first obs will be . per id

by id: gen depress_lag = depress[_n-1]

Summing within group

When used with generate, sum() creates a

running sum

◦

gen sumx = sum(x)

// running sum

of x

With a

by

-group specification, we get running

sums by group.

◦

The running sum may itself be a useful variable

◦

We can also pull the value from the last

observation (using

_N

) within each group to

create a total sum variable

*running sum of adverse life events (ale)

*  per id

by id: gen sum_ale = sum(ale)

*total number of adverse life events per id

by id: gen total_ale = sum_ale[_N]

undefined

References and

Further

Learning

References

Mitchell, Michael N. 2010.

Data Management Using Stata: A Practical Handbook

. Stata Press.

Stata YouTube channel

 – videos for both data management and data analysis made by Stata,

and a

list

 of links to their videos on their home site

Data management FAQ

on Stata home site

UCLA OARC Stata pages

 – our own pages on data management and data analysis

undefined

THANK YOU!

Slide Note

Embed Share

Download

This workshop conducted by UCLA's OARC covers essential Stata commands for preparing data sets for statistical analysis. It includes topics like inspecting and creating variables, handling missing data, merging datasets, and processing data efficiently. Participants will learn to use do-files, write code, and read syntax specifications to enhance data management skills.

keyaan Follow

Uploaded on Jul 10, 2024 | 6 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

UCLA OARC STATISTICAL METHODS AND DATA ANALYTICS STATA data management

Purpose of the workshop Data sets usually do not arrive in a state immediately ready for analysis Variables need to be cleaned E.g. missing data codes like -99 can invalidate analyses if left untouched Data errors should be corrected Variables need to be generated Unnecessary data should perhaps be dropped for efficiency Data sets may need to be merged together This workshop covers commonly used Stata commands to prepare data sets for statistical analysis

Topics Inspecting variables Creating variables Selecting observations and variables Missing data Date variables String variables Appending data sets Merging data sets Looping Processing data by groups

save Preliminaries load

Use do-files Write and run code in do-files, text files where Stata commands can be saved. A record of the commands used Can easily make adjustments to code later To run code from the do-file, highlight code and then Ctrl-D or click Execute(do) icon Comments: precede text with * or enclose text within /*and */ *not executed /* this is a comment */ You can also place comments on the same line as a command following // tab x // also a comment

Reading syntax specifications in this workshop Unitalicized words should be typed as is Italicized words should be replaced with the appropriate variable, command, or value For example: merge 1:1 varlist using filename varlist should be replaced with one or more variable names filename should be replaced with a file s name The rest should be typed as is varlist will be used throughout the workshop to mean one or more variable names

Workshop dataset This dataset contains fake hospital patient data, with patients as rows of data (no patient id, though) A mix of continuous and discrete numeric variables as well as string variables Some data errors and missing data codes in the data Each patient is also linked to a doctor (docid). Another dataset containing doctor variables will be merged into this dataset later in the seminar.

saveand use * save data as Stata data set, overwrite save data_clean.dta, replace We recommend that you save your data set under a different name after performing some data management on it Always good to have untouched raw, data set You may change your mind later about how variables should be generated/cleaned/etc * .dta automatically added so can omit it save data_clean, replace * load data_clean, clear memory first use data_clean, clear To quickly save and load your data in Stata, save as a Stata file (usually .dta) useloads Stata data files

browse Inspecting variables summarize tabulate codebook

browse Spreadsheet style window view browse or click on the spreadsheet and magnifying glass icon in the Stata toolbar Numeric variables appear black String variables appear red Numeric variables with value labels appear blue For large data sets, we need commands that allow us to quickly inspect variables

summarize summarize(abbreviated summ) provides summary statistics for numeric variables, which may be useful for data management: Number of non-missing observations *summary stats for variable y summ y *summary stats for all numeric variables summ Does this variable have the correct number of missing observations? Mean and standard deviation Are there any data errors or missing data codes that make the mean or standard deviation seem implausible? Minimum and maximum Are the min and max within the plausible range for this variable?

tabulate tabulate(abbreviated tab here) frequency tables (# observations per value) string or numeric variables ok but not continuous numeric variables *table of frequencies of race * display missing tab race, miss *2-way table of frequencies tab race gender Questions to ask with tab: Is this the number of categories I expect? Are the categories numbered and labeled as I expect? Do these frequencies look right? Important data management options miss -treat missing values as another category nolabel remove labels to inspect underlying numeric representation

codebook *detailed info about variables x and y codebook x y Detailed information about variables codebookprovides: For all variables Number of unique values Number of missing values Value labels, if applied For numeric variables Range quantiles, means and standard deviation for continuous variables Frequencies for discrete variables For string variables frequencies warnings about leading and trailing blanks *detailed info about all variables codebook

generate Creating variables help functions egen

generatevariables generate(abbreviated genor even g) is the basic command to create variables Often from other existing variables if the value of the input variable is missing, the generated value will be missing as well. *sum of tests * if any test=., testsum=. gen testsum = test1 + test2 + test3 *reverse-code a 1,2,3 variable gen revvar = 4-var We will be using genthroughout this seminar

Functions to use with generate *get table of function help pages help functions We can use functions to perform some operation on a variable to generate another variable *random number (0,1) for each obs gen x = runiform() Later sections of this seminar will take a focused look at the function groups Date and Time and String *running (cumulative) sum of x gen sumx = sum(x) *extract year from date variable gen yr_birth= year(date_birth) We will use many various functions from other groups as well *extract 1st 3 numbers of phone number gen areacode = substr(phone, 1, 3) Most of these functions accept no more than one variable as an input

egen, extended generation command egen(extended generate) creates variables with its own, exclusiveset of functions, which include: Many functions that accept multiple variables as arguments cut() function to create categorical variable from continuous variable group() function to create a new grouping variable that is the crossing of two grouping variables *mean of 3 y variables egen ymean = rowmean(y1 y2 y3) *how many missing values in y vars egen ymiss = rowmiss(y1 y2 y3) * how many of the y vars equal 2 or 3 egen y2or3 = anycount(y1 y2 y3), values(2 3) *age category variable: * 0-17, 18-59, 60-120, with value labels egen agecat = cut(age), at(0, 18, 60, 120) label *create agecat-by-race variable, with value labels egen agecat_race = group(agecat, race), label

if Selecting observations and variables <>=!~&| replace recode drop keep

if: selecting by condition *tab x for age > 60, include missing tab x if age > 60, miss select observations that meet a certain condition ifclause usually placed after the command specification, but before the comma that marks the beginning of the list of options.

Logical and relational operators *summary stats for y * for obs where insured not equal to 1 summ y if insured != 1 Relational Operators > greater than <= less than or equal *tab x for obs where * age < 60 and insured equal to 1 tab x if (age < 60) & (insured == 1) == equal != not equal ~= not equal Logical Operators & and | or ! not ~ not

replaceand if The replacecommand is used to replace the values of an existing variable with new values *binary variable coding whether pain is greater than 6 gen highpain = 0 replace highpain = 1 if pain > 6 Typically, replace is used with ifto replace the values in a subset of observation

Change variable coding with recode *recode (0,1,2)->3 and (6,7)->5 recode income_cat (0 1 2 = 3) (6 7 = 5) Variables are often not coded the way we want too many categories or with values out of order. With recode, we can handle all recodes for a variable in a single step.

keep and drop: filtering observations * drop observations where age < 18 drop if age < 18 Drop unneeded observations using: drop if exp * same thing as above keep if age >= 18 Where exp expresses some condition that if true, will cause an observation to be dropped Conversely, we can keep only observations that satisfy some condition with keep if exp

drop(or keep) variables * drop variables x y z drop x y z Unneeded variables can be dropped with: drop varlist * drop all variables in consecutive columns * from age-dob drop age-dob where varlistis a list of variables to drop See examples for some shortcuts to drop many variables at once * drop all variables that begin with pat drop pat* Or, if you need to drop most variables, you can keepa few * drop all variables but age keep age

. .a .b misstable summarize Missing Data mvdecode missing()

Missing data * overview of how missing values work in Stata and Stata commands for missing values help missing Missing values are very common in real data, and it is important for an analyst to be aware of missingness in the data set Hopefully, you know any missing data codes (e.g. -99) When reading in data from a text or Excel file, missing data can be represented by an empty field.

Missing values in Stata replace stringvar = if stringvar == -99 . is missing for numeric variables (also called sysmiss) * replace with sysmiss if -99 (skipped) replace numvar = . if numvar == -99 is missing for string variables (also called blank). * use .a for different missing data code -98 * (e.g. refused to answer) replace numvar = .a if numvar == -98 Additional missing data values are available .a through .z can be used to represent different types of missing (refusal, don t know, etc.). Missing values are very large numbers in Stata all non-missing numbers < . < .a < .b < < .z

misstable summarize: finding existing missing values misstable summarize produces a table of missing values (of all types) across a set of variables *table of missing values across all variables misstable summarize column Obs=. counts the number of missing values equal to . column Obs>. counts the number of missing values other than ., such as .a and .b column Obs<. and the entire right-hand section Obs<. address non-missing values

Detecting missing data codes *boxplot of variables graph box lungcapacity test1 test2 extreme numeric values are often used to represent missing values -99 or 999 Undetected, these missing data codes can be included as real data in statistical analysis Use summarizeand graph boxplot to look for missing data codes across many variables at once

Use mvdecodeto convert user-defined missing data codes to missing values *convert -99 to . for all variables mvdecode _all, mv(-99) We can quickly convert all user-defined missing codes to system missing values across all numeric variables with mvdecode. *convert -99 to . and -98 to .a mvdecode _all, mv(-99 =. \ -98=.a) Unfortunately, mvdecode will not work at all on string variables.

The missing()function The missing()function returns TRUE if the value is any one of Stata s missing values (., .a, .b, etc) *eligible if not missing for lungcapacity gen eligible = 0 replace eligible = 1 if !missing(lungcapacity)

Be careful with relational operators when working with missing values * we want hightest1 = 1 if test1 > 10 * but . > 10, so this is not right gen hightest1 = 0 replace hightest1 = 1 if test1 > 10 Missing values are very large numbers in Stata all non-missing numbers < . < .a < .b < < .z *now hightest1 will be . when test1 is . replace hightest1 = . if test1 == . Thus, (. > 50) results in TRUE When creating variables from other variables, make sure you know how you want to handle missing values

help datetime date() Date variables format year() month() day()

Dates as strings and numbers * Overview of how Stata handles dates help datetime In Stata we can store dates as strings January 2, 2021 1-2-2021 *Use codebook or describe to determine whether your date variable is a string or number codebook date_var However, dates should be represented numerically in Stata if the date dataare needed To create analysis variables For plotting *Alternatively, we can look at the variable in the browser: red=string, blue/black=number browse date_var Numeric dates (with day, month, and year data) in Stata are represented by the number of days since January 1, 1960 (a reference date) January 2, 2021 = 22,280

date(): converting string dates to numeric Use the date() function to convert string dates to numeric in Stata * create numeric version of date of birth * order is month, day, year gen newdob = date(dob, MDY ) The generic syntax, for example with gen, is: gen varname = date(stringdate, mask) stringdateis a variable maskis a code that specifies the order of the components of the date For day, month, year dates: maskis MDY if the order is month, day, year maskis DMY if the order is day, month, year

The date() function accepts dates in many formats *just display commands to show date() usage . di date("March 5, 2021", "MDY") 22344 String dates are recorded in many different formats, but fortunately, the date() function is flexible in what it accepts as inputs . di date("Mar 5, 2021", "MDY") 22344 . di date("3/5/2021", "MDY") 22344 . di date("3-5-2021", "MDY") 22344 *add 20 or 19 to the mask if year is 2-digit . di date("3-5-21", "MD20Y") 22344 . di date("3-5-21", "MD19Y") -14181

Formatting numeric dates After conversion using date(), the resulting variable will be filled with numbers representing dates, but can be hard to read directly as dates *apply date format to variable newdob format newdob %td Stata s format command controls how variables are displayed The format %td formats numbers as dates 22344will appear as 2mar2021after applying the format

Date arithmetic * length of stay gen los = discharge_date admit_date Once dates are stored as numeric variables in Stata, we can perform date arithmetic, for example, to find the number of days between two dates

Functions to extract components of dates *year of birth gen yob = year(dob) At times, we will need to extract one of the components of a date, such as the year *month of birth gen mob = month(dob) Each of these functions returns a number: year(): numeric year month(): 1 to 12 day(): day of the month

help string functions strtrim() String variables substr() + encode destring

Strings in Stata * z now missing if z was -99 replace z = if z== -99 Strings are just a series of characters Variables can generally be stored as either numeric or string String values are surrounded by quotes when specifying a Stata command String missing is Many estimation commands in Stata will not work with string variables.

String functions * help page for all string functions help string functions Stata provides a number of functions to clean, combine, and extract components of string variables We will examine a few in detail shortly Other useful functions: strlen(): length of strings strpos(): the position in a string where a substring is found strofreal(): convert number to string with a specified format

strtrim(): trimming white space * some of these categories should be combined tab hospital String variables often arrive messy, with unnecessary spaces at the beginning or end of the string May 3, 2001 * remove all leading and trailing whitespace replace hospital = strtrim(hospital) tab hospital Extra spaces can complicate string matching and extracting components tabwill treat UCLA and UCLA as separate categories strtrim() removes whitespace from the beginning and end of strings

substr(): extracting a substring substr() extracts a substring from a longer string *area code starts at 1, length=3 gen areacode = substr(phone, 1, 3) *extract last 5 characters as zip code gen zipcode = substr(address, -5, 5) substr(s, n1, n2) sis a string value or string variable n1is the starting position negative number counts from the end n2is the length of the string

+: concatenating strings *create full name variable: Last, First gen fullname = lastname + , + firstname Strings can be joined or concatenated together in a Stata command with a + String variables can be combined with string constants this way

String matching For matching strings exactly, we can use the == operator * for regular expression functions help regexm For more flexible pattern matching, Stata has a number of functions that use regular expressions e.g. to find strings that contain either US or USA and may or may not contain periods US U.S. USA U.S.A. See help regexm See help strmatch Regular expressions are beyond the scope of this workshop

Encoding strings into numeric variables Categorical variables are often initially coded as strings But most estimation commands require numeric variables * convert hospital to numeric, generate new variable encode hospital, gen(hospnum) encodeconverts string variables to numeric assigns a numeric value to each distinct string value applies value labels of the original string values Use the gen() option to create a new variable Or instead use replaceto overwrite the original string variable The ordering of the categories is alphabetical Remember, when browsing, string variables will appear red while numeric variables with labels will appear blue

Convert number variables stored as strings to numeric with destring Sometimes, variables with number values are loaded as strings into Stata. *convert string wbc to numeric and overwrite destring wbc, replace This can happen can a character value is mistakenly entered, or a non-numeric code (e.g. NA ) is used for missing. Or the variable was saved as a string in other software (e.g. Excel) We do not want to use encodehere, because the variable values are truly numbers rather than categories we would not want the 1.25 converted to a category. Instead, we can use destring, which directly translates numbers stored as strings into the numbers themselves.

Appending data sets append

Appending: adding observations We often wish to combine data sets that are split into multiple files that have the same variables (more or less) Data collected over several waves of time Data collected from different labs/sources ID DOB Weight Insured 101 3-3-1981 175 1 ID DOB Weight Insured Loaded in Stata 102 2-14-1975 213 0 101 3-3-1981 175 1 103 12-10-1990 198 1 102 2-14-1975 213 0 = 103 12-10-1990 198 1 + 201 4-29-1970 150 0 ID DOB Weight Insured 202 12-15-1963 254 0 On hard drive 201 4-29-1970 150 0 203 1-10-1962 199 1 202 12-15-1963 254 0 203 1-10-1962 199 1

Data Management Workshop: Stata Commands for Statistical Analysis

Download Presentation

Presentation Transcript

Related

More Related Content