Introduction to Data Manipulation in R with dplyr

Data manipulation in R:

dplyr

EPID 799C

Wednesday, Sept. 27, 2017

’

•

Key functions of dplyr

•

Review of coding for key variables in births dataset

•

Practice dplyr coding with births dataset

•

select()

•

Picks variables (columns) based on their names.

•

filter()

•

Picks observations (rows) based on their values.

•

arrange()

•

Changes the ordering of the rows based on their values.

•

summarise()

•

Reduces multiple values down to a single summary value.

•

mutate()

•

Adds new variables that are functions of existing variables.

•

group_by()

•

Performs data operations on groups that are defined by variables.

For more, see this great resource:

Data Wrangling Cheat Sheet

•

select()

•

Picks variables (columns) based on their names.

•

filter()

•

Picks observations (rows) based on their values.

•

arrange()

•

Changes the ordering of the rows based on their values.

•

summarise()

•

Reduces multiple values down to a single summary value.

•

mutate()

•

Adds new variables that are functions of existing variables.

•

group_by()

•

Performs data operations on groups

    that are defined by variables.

For more, see this great resource:

Data Wrangling Cheat Sheet

•

select()

•

Picks variables (columns) based on their names

•

filter()

•

Picks observations (rows) based on their values

•

arrange()

•

Changes the ordering of the rows

•

summarise()

•

Reduces multiple values down to a single summary value

•

mutate()

•

Adds new variables that are functions of existing variables

•

group_by()

•

Performs data operations on groups that are defined by variables

The pipe operator

%>%

enables you to pass the

output from one

function to the input of

the next function

For more, see this great resource:

Data Wrangling Cheat Sheet

Dataset %>%

Select rows or columns to manipulate %>%

Arrange or group the data %>%

Calculate statistics or new variables of interest

summary <-

Dataset %>%

Select rows or columns to manipulate %>%

Arrange or group the data %>%

Calculate statistics, new variables

Creates a new object named summary that stores the output

Otherwise, output is just printed in the console

•

Key variables of interest in today’s examples

•

Preterm birth

•

Early prenatal care

•

Maternal age

•

Smoking during pregnancy

•

Race/ethnicity

•

County of residence

•

Review of coding for each variable

births$wksgest[births$wksgest==99] <- NA

births$preterm <- ifelse(births$wksgest<37,1,0)

births$preterm_f <- factor(births$preterm,

levels = c(1,0),

labels =  c("preterm", "term"))

births$mdif[births$mdif==99] <- NA

births$pnc5 <- ifelse(births$mdif<=5,1,0)

births$pnc5_f <- factor(births$pnc5,

levels = c(1,0),

labels = c("Early prenatal care",

"No early care"))

births$mage[births$mage==99] <- NA

births$cigdur[births$cigdur=="U"] <- NA

births$smoke <- ifelse(births$cigdur=="Y",1,0)

births$smoke_f <- factor(births$smoke,

levels = c(1,0),

labels = c("Smoker", "Nonsmoker"))

•

CSV file with labels for the levels of the

race/ethnicity and county variables to save

you from typing them out

•

Save file to your computer and then read into

R Studio

formatter <- read.csv(“birth-format-helper-2012.csv”,

stringsAsFactors = F)

births$race_f <- factor(births$mrace,

levels = 1:4,

labels = formatter[formatter$variable=="mrace",]$recode,

ordered = T)

births$raceeth <- ifelse(births$mrace == 1 & births$methnic == "N", "WnH",

ifelse(births$mrace == 1 & births$methnic == "Y", "WH",

ifelse(births$mrace==2, "AA",

ifelse(births$mrace==3, "AI/AN", "Other"))))

births$raceeth_f <- factor(births$raceeth,

     levels=c("WnH", "AA", "WH", "AI/AN", "Other"))

births$county <- factor(births$cores,

   levels = formatter[formatter$variable=="cores",]$code,

   labels = formatter[formatter$variable=="cores",]$recode,

   ordered = T)

1)

Calculate the numbers of births by early prenatal care (received early care vs.

no early care). Exclude the births with missing values for prenatal care or

preterm.

Pseudo-code:

use the births dataset %>%

exclude births with missing pnc5_f or preterm %>%

group births by pnc5_f %>%

summarize number of births in each group

Syntax for summary statistics is to name the new variable and set equal to the function of interest:

summarise(number = n())

Name you choose

The function n() counts the number of observations (no arguments)

2)

Calculate the numbers of births and average age of mothers in those same

groups (received early care vs. no early care).

Pseudo-code:

use the births dataset %>%

exclude births with missing pnc5_f or preterm %>%

group births by pnc5_f %>%

summarize number of births, average age, in each group

Are there mothers with missing age?

Within the mean function, include an

argument to remove missing values

prior to calculating the average age.

3)

In addition to the numbers of births and average age of mothers by early care,

calculate the number and percentage of preterm births in these two groups.

Pseudo-code:

use the births dataset %>%

exclude births with missing pnc5_f or preterm %>%

group births by pnc5_f %>%

summarize total births, average age, number of preterm, % preterm

Different ways to calculate % preterm:

1)

Within summarise():

using the function mean(preterm) OR sum(preterm)/n()

2)

Within mutate():

as a function of the variables you created in summarise() for

numbers of preterm and total births

4)

Continuing to build on your code you’ve already written, calculate the

percentage of smokers in the same two groups (early care vs. no early care).

Pseudo-code:

use the births dataset %>%

group births by pnc5_f %>%

summarize total births, average age, # preterm, # smokers, % preterm, % smokers

You could try out different methods for calculating % smokers and % preterm:

within summarise (using built-in functions)

within mutate (using the new variables you’ve created)

**Note: should you use the numeric variables (smoke, preterm) or factor

variables (smoke_f, preterm_f) for calculating these summary statistics?**

5) Onto a new example:

Calculate the total births in each maternal

race/ethnicity group (WnH, AA, WH, AI/AN, Other). For each group, also

calculate the prevalence of early care and prevalence of preterm birth.

Pseudo-code:

use the births dataset %>%

group births by raceeth_f %>%

summarize the numbers of total births, births with early care,

and preterm births in each

group %>%

calculate percentage of early care, percentage of preterm in

each group

6) Final example:

Calculate the prevalence of early prenatal care and prevalence

of preterm birth by NC county of residence

Pseudo-code:

use the births dataset %>%

group births by county %>%

summarize the number of births with early care, number of

preterm births in each county

%>%

calculate prevalence of early care and preterm in each county

This output is large – how would you store it instead of just printing to the console?

Slide Note

Embed Share

Download

Explore the essential functions of dplyr for data manipulation in R, focusing on key operations like selecting variables, filtering observations, rearranging rows, summarizing data, adding new variables, and grouping operations. Discover the basic structure of dplyr code to efficiently manipulate and analyze datasets.

shmo982 Follow

Uploaded on Sep 17, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Data manipulation in R: dplyr EPID 799C Wednesday, Sept. 27, 2017

Todays Outline Today s Outline Key functions of dplyr Review of coding for key variables in births dataset Practice dplyr coding with births dataset

For more, see this great resource: Data Wrangling Cheat Sheet Key Functions Key Functions select() Picks variables (columns) based on their names. filter() Picks observations (rows) based on their values. arrange() Changes the ordering of the rows based on their values. summarise() Reduces multiple values down to a single summary value. mutate() Adds new variables that are functions of existing variables. group_by() Performs data operations on groups that are defined by variables.

For more, see this great resource: Data Wrangling Cheat Sheet Key Functions Key Functions select() Picks variables (columns) based on their names. filter() Picks observations (rows) based on their values. arrange() Changes the ordering of the rows based on their values. summarise() Reduces multiple values down to a single summary value. mutate() Adds new variables that are functions of existing variables. group_by() Performs data operations on groups that are defined by variables.

For more, see this great resource: Data Wrangling Cheat Sheet Key Functions Key Functions select() Picks variables (columns) based on their names filter() Picks observations (rows) based on their values arrange() Changes the ordering of the rows summarise() Reduces multiple values down to a single summary value mutate() Adds new variables that are functions of existing variables group_by() Performs data operations on groups that are defined by variables The pipe operator %>% enables you to pass the output from one function to the input of the next function

Basic structure of dplyr code Basic structure of dplyr code Dataset %>% Select rows or columns to manipulate %>% Arrange or group the data %>% Calculate statistics or new variables of interest

Basic structure of dplyr code Basic structure of dplyr code summary <- Dataset %>% Select rows or columns to manipulate %>% Arrange or group the data %>% Calculate statistics, new variables Creates a new object named summary that stores the output Otherwise, output is just printed in the console

Manipulating the births dataset with dplyr Manipulating the births dataset with dplyr Key variables of interest in today s examples Preterm birth Early prenatal care Maternal age Smoking during pregnancy Race/ethnicity County of residence Review of coding for each variable

Preterm Birth Preterm Birth births$wksgest[births$wksgest==99] <- NA births$preterm <- ifelse(births$wksgest<37,1,0) births$preterm_f <- factor(births$preterm, levels = c(1,0), labels = c("preterm", "term"))

Early Prenatal Care Early Prenatal Care births$mdif[births$mdif==99] <- NA births$pnc5 <- ifelse(births$mdif<=5,1,0) births$pnc5_f <- factor(births$pnc5, levels = c(1,0), labels = c("Early prenatal care", "No early care"))

Maternal Age Maternal Age births$mage[births$mage==99] <- NA

Maternal Smoking during Pregnancy Maternal Smoking during Pregnancy births$cigdur[births$cigdur=="U"] <- NA births$smoke <- ifelse(births$cigdur=="Y",1,0) births$smoke_f <- factor(births$smoke, levels = c(1,0), labels = c("Smoker", "Nonsmoker"))

Format Helper Format Helper CSV file with labels for the levels of the race/ethnicity and county variables to save you from typing them out Save file to your computer and then read into R Studio formatter <- read.csv( birth-format-helper-2012.csv , stringsAsFactors = F)

Maternal Race/Ethnicity Maternal Race/Ethnicity births$race_f <- factor(births$mrace, levels = 1:4, labels = formatter[formatter$variable=="mrace",]$recode, ordered = T) births$raceeth <- ifelse(births$mrace == 1 & births$methnic == "N", "WnH", ifelse(births$mrace == 1 & births$methnic == "Y", "WH", ifelse(births$mrace==2, "AA", ifelse(births$mrace==3, "AI/AN", "Other")))) births$raceeth_f <- factor(births$raceeth, levels=c("WnH", "AA", "WH", "AI/AN", "Other"))

County of Residence in NC County of Residence in NC births$county <- factor(births$cores, levels = formatter[formatter$variable=="cores",]$code, labels = formatter[formatter$variable=="cores",]$recode, ordered = T)

Practice Problems using Practice Problems using dplyr 1) Calculate the numbers of births by early prenatal care (received early care vs. no early care). Exclude the births with missing values for prenatal care or preterm. Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize number of births in each group Syntax for summary statistics is to name the new variable and set equal to the function of interest: Name you choose summarise(number = n()) The function n() counts the number of observations (no arguments)

Practice Problems using Practice Problems using dplyr 2) Calculate the numbers of births and average age of mothers in those same groups (received early care vs. no early care). Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize number of births, average age, in each group Are there mothers with missing age? Within the mean function, include an argument to remove missing values prior to calculating the average age.

Practice Problems using Practice Problems using dplyr 3) In addition to the numbers of births and average age of mothers by early care, calculate the number and percentage of preterm births in these two groups. Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize total births, average age, number of preterm, % preterm Different ways to calculate % preterm: 1) Within summarise(): using the function mean(preterm) OR sum(preterm)/n() 2) Within mutate(): as a function of the variables you created in summarise() for numbers of preterm and total births

Practice Problems using Practice Problems using dplyr 4) Continuing to build on your code you ve already written, calculate the percentage of smokers in the same two groups (early care vs. no early care). Pseudo-code: use the births dataset %>% group births by pnc5_f %>% summarize total births, average age, # preterm, # smokers, % preterm, % smokers You could try out different methods for calculating % smokers and % preterm: within summarise (using built-in functions) within mutate (using the new variables you ve created) **Note: should you use the numeric variables (smoke, preterm) or factor variables (smoke_f, preterm_f) for calculating these summary statistics?**

Practice Problems using Practice Problems using dplyr 5) Onto a new example: Calculate the total births in each maternal race/ethnicity group (WnH, AA, WH, AI/AN, Other). For each group, also calculate the prevalence of early care and prevalence of preterm birth. Pseudo-code: use the births dataset %>% group births by raceeth_f %>% summarize the numbers of total births, births with early care, group %>% calculate percentage of early care, percentage of preterm in and preterm births in each each group

Practice Problems using Practice Problems using dplyr 6) Final example: Calculate the prevalence of early prenatal care and prevalence of preterm birth by NC county of residence Pseudo-code: use the births dataset %>% group births by county %>% %>% summarize the number of births with early care, number of preterm births in each county calculate prevalence of early care and preterm in each county This output is large how would you store it instead of just printing to the console?

Introduction to Data Manipulation in R with dplyr

Download Presentation

Presentation Transcript

Related

More Related Content