Introduction to Data Manipulation in R with dplyr

Slide Note
Embed
Share

Explore the essential functions of dplyr for data manipulation in R, focusing on key operations like selecting variables, filtering observations, rearranging rows, summarizing data, adding new variables, and grouping operations. Discover the basic structure of dplyr code to efficiently manipulate and analyze datasets.


Uploaded on Sep 17, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Data manipulation in R: dplyr EPID 799C Wednesday, Sept. 27, 2017

  2. Todays Outline Today s Outline Key functions of dplyr Review of coding for key variables in births dataset Practice dplyr coding with births dataset

  3. For more, see this great resource: Data Wrangling Cheat Sheet Key Functions Key Functions select() Picks variables (columns) based on their names. filter() Picks observations (rows) based on their values. arrange() Changes the ordering of the rows based on their values. summarise() Reduces multiple values down to a single summary value. mutate() Adds new variables that are functions of existing variables. group_by() Performs data operations on groups that are defined by variables.

  4. For more, see this great resource: Data Wrangling Cheat Sheet Key Functions Key Functions select() Picks variables (columns) based on their names. filter() Picks observations (rows) based on their values. arrange() Changes the ordering of the rows based on their values. summarise() Reduces multiple values down to a single summary value. mutate() Adds new variables that are functions of existing variables. group_by() Performs data operations on groups that are defined by variables.

  5. For more, see this great resource: Data Wrangling Cheat Sheet Key Functions Key Functions select() Picks variables (columns) based on their names filter() Picks observations (rows) based on their values arrange() Changes the ordering of the rows summarise() Reduces multiple values down to a single summary value mutate() Adds new variables that are functions of existing variables group_by() Performs data operations on groups that are defined by variables The pipe operator %>% enables you to pass the output from one function to the input of the next function

  6. Basic structure of dplyr code Basic structure of dplyr code Dataset %>% Select rows or columns to manipulate %>% Arrange or group the data %>% Calculate statistics or new variables of interest

  7. Basic structure of dplyr code Basic structure of dplyr code summary <- Dataset %>% Select rows or columns to manipulate %>% Arrange or group the data %>% Calculate statistics, new variables Creates a new object named summary that stores the output Otherwise, output is just printed in the console

  8. Manipulating the births dataset with dplyr Manipulating the births dataset with dplyr Key variables of interest in today s examples Preterm birth Early prenatal care Maternal age Smoking during pregnancy Race/ethnicity County of residence Review of coding for each variable

  9. Preterm Birth Preterm Birth births$wksgest[births$wksgest==99] <- NA births$preterm <- ifelse(births$wksgest<37,1,0) births$preterm_f <- factor(births$preterm, levels = c(1,0), labels = c("preterm", "term"))

  10. Early Prenatal Care Early Prenatal Care births$mdif[births$mdif==99] <- NA births$pnc5 <- ifelse(births$mdif<=5,1,0) births$pnc5_f <- factor(births$pnc5, levels = c(1,0), labels = c("Early prenatal care", "No early care"))

  11. Maternal Age Maternal Age births$mage[births$mage==99] <- NA

  12. Maternal Smoking during Pregnancy Maternal Smoking during Pregnancy births$cigdur[births$cigdur=="U"] <- NA births$smoke <- ifelse(births$cigdur=="Y",1,0) births$smoke_f <- factor(births$smoke, levels = c(1,0), labels = c("Smoker", "Nonsmoker"))

  13. Format Helper Format Helper CSV file with labels for the levels of the race/ethnicity and county variables to save you from typing them out Save file to your computer and then read into R Studio formatter <- read.csv( birth-format-helper-2012.csv , stringsAsFactors = F)

  14. Maternal Race/Ethnicity Maternal Race/Ethnicity births$race_f <- factor(births$mrace, levels = 1:4, labels = formatter[formatter$variable=="mrace",]$recode, ordered = T) births$raceeth <- ifelse(births$mrace == 1 & births$methnic == "N", "WnH", ifelse(births$mrace == 1 & births$methnic == "Y", "WH", ifelse(births$mrace==2, "AA", ifelse(births$mrace==3, "AI/AN", "Other")))) births$raceeth_f <- factor(births$raceeth, levels=c("WnH", "AA", "WH", "AI/AN", "Other"))

  15. County of Residence in NC County of Residence in NC births$county <- factor(births$cores, levels = formatter[formatter$variable=="cores",]$code, labels = formatter[formatter$variable=="cores",]$recode, ordered = T)

  16. Practice Problems using Practice Problems using dplyr 1) Calculate the numbers of births by early prenatal care (received early care vs. no early care). Exclude the births with missing values for prenatal care or preterm. Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize number of births in each group Syntax for summary statistics is to name the new variable and set equal to the function of interest: Name you choose summarise(number = n()) The function n() counts the number of observations (no arguments)

  17. Practice Problems using Practice Problems using dplyr 2) Calculate the numbers of births and average age of mothers in those same groups (received early care vs. no early care). Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize number of births, average age, in each group Are there mothers with missing age? Within the mean function, include an argument to remove missing values prior to calculating the average age.

  18. Practice Problems using Practice Problems using dplyr 3) In addition to the numbers of births and average age of mothers by early care, calculate the number and percentage of preterm births in these two groups. Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize total births, average age, number of preterm, % preterm Different ways to calculate % preterm: 1) Within summarise(): using the function mean(preterm) OR sum(preterm)/n() 2) Within mutate(): as a function of the variables you created in summarise() for numbers of preterm and total births

  19. Practice Problems using Practice Problems using dplyr 4) Continuing to build on your code you ve already written, calculate the percentage of smokers in the same two groups (early care vs. no early care). Pseudo-code: use the births dataset %>% group births by pnc5_f %>% summarize total births, average age, # preterm, # smokers, % preterm, % smokers You could try out different methods for calculating % smokers and % preterm: within summarise (using built-in functions) within mutate (using the new variables you ve created) **Note: should you use the numeric variables (smoke, preterm) or factor variables (smoke_f, preterm_f) for calculating these summary statistics?**

  20. Practice Problems using Practice Problems using dplyr 5) Onto a new example: Calculate the total births in each maternal race/ethnicity group (WnH, AA, WH, AI/AN, Other). For each group, also calculate the prevalence of early care and prevalence of preterm birth. Pseudo-code: use the births dataset %>% group births by raceeth_f %>% summarize the numbers of total births, births with early care, group %>% calculate percentage of early care, percentage of preterm in and preterm births in each each group

  21. Practice Problems using Practice Problems using dplyr 6) Final example: Calculate the prevalence of early prenatal care and prevalence of preterm birth by NC county of residence Pseudo-code: use the births dataset %>% group births by county %>% %>% summarize the number of births with early care, number of preterm births in each county calculate prevalence of early care and preterm in each county This output is large how would you store it instead of just printing to the console?

Related


More Related Content