Introduction to Data Manipulation in R with dplyr

Data manipulation in R:
dplyr
EPID 799C
Wednesday, Sept. 27, 2017
T
o
d
a
y
s
 
O
u
t
l
i
n
e
Key functions of dplyr
Review of coding for key variables in births dataset
Practice dplyr coding with births dataset
K
e
y
 
F
u
n
c
t
i
o
n
s
select()
Picks variables (columns) based on their names.
filter()
Picks observations (rows) based on their values.
arrange()
Changes the ordering of the rows based on their values.
summarise()
Reduces multiple values down to a single summary value.
mutate()
Adds new variables that are functions of existing variables.
group_by()
Performs data operations on groups that are defined by variables.
For more, see this great resource: 
Data Wrangling Cheat Sheet
K
e
y
 
F
u
n
c
t
i
o
n
s
select()
Picks variables (columns) based on their names.
filter()
Picks observations (rows) based on their values.
arrange()
Changes the ordering of the rows based on their values.
summarise()
Reduces multiple values down to a single summary value.
mutate()
Adds new variables that are functions of existing variables.
group_by()
Performs data operations on groups
    that are defined by variables.
For more, see this great resource: 
Data Wrangling Cheat Sheet
K
e
y
 
F
u
n
c
t
i
o
n
s
select()
Picks variables (columns) based on their names
filter()
Picks observations (rows) based on their values
arrange()
Changes the ordering of the rows
summarise()
Reduces multiple values down to a single summary value
mutate()
Adds new variables that are functions of existing variables
group_by()
Performs data operations on groups that are defined by variables
The pipe operator 
%>%
enables you to pass the
output from one
function to the input of
the next function
For more, see this great resource: 
Data Wrangling Cheat Sheet
B
a
s
i
c
 
s
t
r
u
c
t
u
r
e
 
o
f
 
d
p
l
y
r
 
c
o
d
e
Dataset %>%
 
Select rows or columns to manipulate %>%
 
Arrange or group the data %>%
 
Calculate statistics or new variables of interest
B
a
s
i
c
 
s
t
r
u
c
t
u
r
e
 
o
f
 
d
p
l
y
r
 
c
o
d
e
summary <- 
Dataset %>%
   
Select rows or columns to manipulate %>%
   
Arrange or group the data %>%
   
Calculate statistics, new variables
Creates a new object named summary that stores the output
Otherwise, output is just printed in the console
M
a
n
i
p
u
l
a
t
i
n
g
 
t
h
e
 
b
i
r
t
h
s
 
d
a
t
a
s
e
t
 
w
i
t
h
 
d
p
l
y
r
Key variables of interest in today’s examples
Preterm birth
Early prenatal care
Maternal age
Smoking during pregnancy
Race/ethnicity
County of residence
Review of coding for each variable
P
r
e
t
e
r
m
 
B
i
r
t
h
births$wksgest[births$wksgest==99] <- NA
births$preterm <- ifelse(births$wksgest<37,1,0)
births$preterm_f <- factor(births$preterm,
     
levels = c(1,0),
     
labels =  c("preterm", "term"))
E
a
r
l
y
 
P
r
e
n
a
t
a
l
 
C
a
r
e
births$mdif[births$mdif==99] <- NA
births$pnc5 <- ifelse(births$mdif<=5,1,0)
births$pnc5_f <- factor(births$pnc5,
    
levels = c(1,0),
    
labels = c("Early prenatal care",
        
"No early care"))
M
a
t
e
r
n
a
l
 
A
g
e
births$mage[births$mage==99] <- NA
M
a
t
e
r
n
a
l
 
S
m
o
k
i
n
g
 
d
u
r
i
n
g
 
P
r
e
g
n
a
n
c
y
births$cigdur[births$cigdur=="U"] <- NA
births$smoke <- ifelse(births$cigdur=="Y",1,0)
births$smoke_f <- factor(births$smoke,
     
levels = c(1,0),
     
labels = c("Smoker", "Nonsmoker"))
F
o
r
m
a
t
 
H
e
l
p
e
r
CSV file with labels for the levels of the
race/ethnicity and county variables to save
you from typing them out
Save file to your computer and then read into
R Studio
formatter <- read.csv(“birth-format-helper-2012.csv”,
    
stringsAsFactors = F)
M
a
t
e
r
n
a
l
 
R
a
c
e
/
E
t
h
n
i
c
i
t
y
births$race_f <- factor(births$mrace,
     
levels = 1:4,
   
labels = formatter[formatter$variable=="mrace",]$recode,
     
ordered = T)
births$raceeth <- ifelse(births$mrace == 1 & births$methnic == "N", "WnH",
   
ifelse(births$mrace == 1 & births$methnic == "Y", "WH",
   
ifelse(births$mrace==2, "AA",
   
ifelse(births$mrace==3, "AI/AN", "Other"))))
births$raceeth_f <- factor(births$raceeth,
   
     levels=c("WnH", "AA", "WH", "AI/AN", "Other"))
C
o
u
n
t
y
 
o
f
 
R
e
s
i
d
e
n
c
e
 
i
n
 
N
C
births$county <- factor(births$cores,
   levels = formatter[formatter$variable=="cores",]$code,
   labels = formatter[formatter$variable=="cores",]$recode,
   ordered = T)
P
r
a
c
t
i
c
e
 
P
r
o
b
l
e
m
s
 
u
s
i
n
g
 
d
p
l
y
r
1) 
Calculate the numbers of births by early prenatal care (received early care vs.
no early care). Exclude the births with missing values for prenatal care or
preterm.
Pseudo-code:
use the births dataset %>%
 
exclude births with missing pnc5_f or preterm %>%
 
group births by pnc5_f %>%
 
summarize number of births in each group
Syntax for summary statistics is to name the new variable and set equal to the function of interest:
summarise(number = n())
Name you choose
The function n() counts the number of observations (no arguments)
P
r
a
c
t
i
c
e
 
P
r
o
b
l
e
m
s
 
u
s
i
n
g
 
d
p
l
y
r
2) 
Calculate the numbers of births and average age of mothers in those same
groups (received early care vs. no early care).
Pseudo-code:
use the births dataset %>%
 
exclude births with missing pnc5_f or preterm %>%
 
group births by pnc5_f %>%
 
summarize number of births, average age, in each group
Are there mothers with missing age?
Within the mean function, include an
argument to remove missing values
prior to calculating the average age.
P
r
a
c
t
i
c
e
 
P
r
o
b
l
e
m
s
 
u
s
i
n
g
 
d
p
l
y
r
3) 
In addition to the numbers of births and average age of mothers by early care,
calculate the number and percentage of preterm births in these two groups.
Pseudo-code:
use the births dataset %>%
 
exclude births with missing pnc5_f or preterm %>%
 
group births by pnc5_f %>%
 
summarize total births, average age, number of preterm, % preterm
Different ways to calculate % preterm:
1)
Within summarise(): 
using the function mean(preterm) OR sum(preterm)/n()
2)
Within mutate(): 
as a function of the variables you created in summarise() for
numbers of preterm and total births
P
r
a
c
t
i
c
e
 
P
r
o
b
l
e
m
s
 
u
s
i
n
g
 
d
p
l
y
r
4) 
Continuing to build on your code you’ve already written, calculate the
percentage of smokers in the same two groups (early care vs. no early care).
Pseudo-code:
use the births dataset %>%
 
group births by pnc5_f %>%
 
summarize total births, average age, # preterm, # smokers, % preterm, % smokers
You could try out different methods for calculating % smokers and % preterm:
within summarise (using built-in functions)
within mutate (using the new variables you’ve created)
**Note: should you use the numeric variables (smoke, preterm) or factor
variables (smoke_f, preterm_f) for calculating these summary statistics?**
P
r
a
c
t
i
c
e
 
P
r
o
b
l
e
m
s
 
u
s
i
n
g
 
d
p
l
y
r
5) Onto a new example: 
Calculate the total births in each maternal
race/ethnicity group (WnH, AA, WH, AI/AN, Other). For each group, also
calculate the prevalence of early care and prevalence of preterm birth.
Pseudo-code:
use the births dataset %>%
 
group births by raceeth_f %>%
 
summarize the numbers of total births, births with early care, 
 
and preterm births in each
group %>%
 
calculate percentage of early care, percentage of preterm in 
 
each group
P
r
a
c
t
i
c
e
 
P
r
o
b
l
e
m
s
 
u
s
i
n
g
 
d
p
l
y
r
6) Final example: 
Calculate the prevalence of early prenatal care and prevalence
of preterm birth by NC county of residence
Pseudo-code:
use the births dataset %>%
 
group births by county %>%
 
summarize the number of births with early care, number of 
 
preterm births in each county
%>%
 
calculate prevalence of early care and preterm in each county
This output is large – how would you store it instead of just printing to the console?
Slide Note
Embed
Share

Explore the essential functions of dplyr for data manipulation in R, focusing on key operations like selecting variables, filtering observations, rearranging rows, summarizing data, adding new variables, and grouping operations. Discover the basic structure of dplyr code to efficiently manipulate and analyze datasets.

  • Data Manipulation
  • R Programming
  • dplyr
  • Data Analysis
  • Data Wrangling

Uploaded on Sep 17, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Data manipulation in R: dplyr EPID 799C Wednesday, Sept. 27, 2017

  2. Todays Outline Today s Outline Key functions of dplyr Review of coding for key variables in births dataset Practice dplyr coding with births dataset

  3. For more, see this great resource: Data Wrangling Cheat Sheet Key Functions Key Functions select() Picks variables (columns) based on their names. filter() Picks observations (rows) based on their values. arrange() Changes the ordering of the rows based on their values. summarise() Reduces multiple values down to a single summary value. mutate() Adds new variables that are functions of existing variables. group_by() Performs data operations on groups that are defined by variables.

  4. For more, see this great resource: Data Wrangling Cheat Sheet Key Functions Key Functions select() Picks variables (columns) based on their names. filter() Picks observations (rows) based on their values. arrange() Changes the ordering of the rows based on their values. summarise() Reduces multiple values down to a single summary value. mutate() Adds new variables that are functions of existing variables. group_by() Performs data operations on groups that are defined by variables.

  5. For more, see this great resource: Data Wrangling Cheat Sheet Key Functions Key Functions select() Picks variables (columns) based on their names filter() Picks observations (rows) based on their values arrange() Changes the ordering of the rows summarise() Reduces multiple values down to a single summary value mutate() Adds new variables that are functions of existing variables group_by() Performs data operations on groups that are defined by variables The pipe operator %>% enables you to pass the output from one function to the input of the next function

  6. Basic structure of dplyr code Basic structure of dplyr code Dataset %>% Select rows or columns to manipulate %>% Arrange or group the data %>% Calculate statistics or new variables of interest

  7. Basic structure of dplyr code Basic structure of dplyr code summary <- Dataset %>% Select rows or columns to manipulate %>% Arrange or group the data %>% Calculate statistics, new variables Creates a new object named summary that stores the output Otherwise, output is just printed in the console

  8. Manipulating the births dataset with dplyr Manipulating the births dataset with dplyr Key variables of interest in today s examples Preterm birth Early prenatal care Maternal age Smoking during pregnancy Race/ethnicity County of residence Review of coding for each variable

  9. Preterm Birth Preterm Birth births$wksgest[births$wksgest==99] <- NA births$preterm <- ifelse(births$wksgest<37,1,0) births$preterm_f <- factor(births$preterm, levels = c(1,0), labels = c("preterm", "term"))

  10. Early Prenatal Care Early Prenatal Care births$mdif[births$mdif==99] <- NA births$pnc5 <- ifelse(births$mdif<=5,1,0) births$pnc5_f <- factor(births$pnc5, levels = c(1,0), labels = c("Early prenatal care", "No early care"))

  11. Maternal Age Maternal Age births$mage[births$mage==99] <- NA

  12. Maternal Smoking during Pregnancy Maternal Smoking during Pregnancy births$cigdur[births$cigdur=="U"] <- NA births$smoke <- ifelse(births$cigdur=="Y",1,0) births$smoke_f <- factor(births$smoke, levels = c(1,0), labels = c("Smoker", "Nonsmoker"))

  13. Format Helper Format Helper CSV file with labels for the levels of the race/ethnicity and county variables to save you from typing them out Save file to your computer and then read into R Studio formatter <- read.csv( birth-format-helper-2012.csv , stringsAsFactors = F)

  14. Maternal Race/Ethnicity Maternal Race/Ethnicity births$race_f <- factor(births$mrace, levels = 1:4, labels = formatter[formatter$variable=="mrace",]$recode, ordered = T) births$raceeth <- ifelse(births$mrace == 1 & births$methnic == "N", "WnH", ifelse(births$mrace == 1 & births$methnic == "Y", "WH", ifelse(births$mrace==2, "AA", ifelse(births$mrace==3, "AI/AN", "Other")))) births$raceeth_f <- factor(births$raceeth, levels=c("WnH", "AA", "WH", "AI/AN", "Other"))

  15. County of Residence in NC County of Residence in NC births$county <- factor(births$cores, levels = formatter[formatter$variable=="cores",]$code, labels = formatter[formatter$variable=="cores",]$recode, ordered = T)

  16. Practice Problems using Practice Problems using dplyr 1) Calculate the numbers of births by early prenatal care (received early care vs. no early care). Exclude the births with missing values for prenatal care or preterm. Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize number of births in each group Syntax for summary statistics is to name the new variable and set equal to the function of interest: Name you choose summarise(number = n()) The function n() counts the number of observations (no arguments)

  17. Practice Problems using Practice Problems using dplyr 2) Calculate the numbers of births and average age of mothers in those same groups (received early care vs. no early care). Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize number of births, average age, in each group Are there mothers with missing age? Within the mean function, include an argument to remove missing values prior to calculating the average age.

  18. Practice Problems using Practice Problems using dplyr 3) In addition to the numbers of births and average age of mothers by early care, calculate the number and percentage of preterm births in these two groups. Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize total births, average age, number of preterm, % preterm Different ways to calculate % preterm: 1) Within summarise(): using the function mean(preterm) OR sum(preterm)/n() 2) Within mutate(): as a function of the variables you created in summarise() for numbers of preterm and total births

  19. Practice Problems using Practice Problems using dplyr 4) Continuing to build on your code you ve already written, calculate the percentage of smokers in the same two groups (early care vs. no early care). Pseudo-code: use the births dataset %>% group births by pnc5_f %>% summarize total births, average age, # preterm, # smokers, % preterm, % smokers You could try out different methods for calculating % smokers and % preterm: within summarise (using built-in functions) within mutate (using the new variables you ve created) **Note: should you use the numeric variables (smoke, preterm) or factor variables (smoke_f, preterm_f) for calculating these summary statistics?**

  20. Practice Problems using Practice Problems using dplyr 5) Onto a new example: Calculate the total births in each maternal race/ethnicity group (WnH, AA, WH, AI/AN, Other). For each group, also calculate the prevalence of early care and prevalence of preterm birth. Pseudo-code: use the births dataset %>% group births by raceeth_f %>% summarize the numbers of total births, births with early care, group %>% calculate percentage of early care, percentage of preterm in and preterm births in each each group

  21. Practice Problems using Practice Problems using dplyr 6) Final example: Calculate the prevalence of early prenatal care and prevalence of preterm birth by NC county of residence Pseudo-code: use the births dataset %>% group births by county %>% %>% summarize the number of births with early care, number of preterm births in each county calculate prevalence of early care and preterm in each county This output is large how would you store it instead of just printing to the console?

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#