Exploring R Programming II: Types, Control, Functions - Fall 2018
Overview of types, control structures, and functions in R programming course. Warmup activities include exploring datasets, using various functions to analyze data, and understanding factors and dates manipulation. Introduction to packages and factors. Questions and exercises provided for practical learning.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
R Programming II: more on types, control, & functions EPID 799C Fall 2018
Overview Warmup! Load and tour .rdata and _sm.csv dataset (start with me ) Questions from last class Factors & Dates Control: Conditionals & Loops Functions
Warmup 1: Tour (full/narrow/recoded narrow/recoded) Dataset Using births_sm from the .rdata file, answer the questions below. Also note the resources folder! Use these functions (and others) to answer the questions below : dim() summary() table() hist() plot() 1. How many observations are there, and how many variables are in the (small) births dataset? (Hint: see HW1!) 2. What is the average maternal age (mage)? 3. Make a histogram of gestational age (WKSGEST). What is the minimum and maximum gestational age? 4. How many mothers smoked (smoker_f)? 5. Make a scatterplot of maternal age versus gestational age.
Warmup 2 : Tour (full/narrow/ narrow/unrecoded unrecoded) Dataset Now use read.csv() to read births2012_sm.csv. 1. How many observations are there, and how many variables are in the births dataset? 2. How do the types of variables compare to those in births_sm? (Hint: we ve got some recoding to do!) 3. What is the average maternal age (mage) now? How many mothers have the value 99? 4. How many mothers smoked (smoker_CIGDUR)? 5. Make a histogram of gestational age (WKSGEST). What is the minimum and maximum (non-99) gestational age? 6. HW1: Similar questions, but births2012.csv, the full, wide, unrecoded dataset. Note: much bigger!
Questions at end of last class Spaces Boolean subsetting Project reminders Keystrokes for flipping / to \ on pc A little group work today
But 1st! Packages Packages Packages are extensions to base R. They contain additional functions, documentation for using them, and sample data. Packages are available from the Comprehensive R Archive Network (CRAN). https://cran.r-project.org/web/packages/available_packages_by_name.html The tidyverse is a set of packages for data manipulation, exploration, and visualization. They share a common design and work in harmony. We ll be using it extensively. install.packages('tidyverse') #only need to run once library(tidyverse) #run once per R session to use load it # Can also directly reference unattached functions w/ ::
Factors Factors & Dates & Dates! We ve got the functions to make sense of these now!
Factors Now we re ready: What is a factor? Let s find out! Create one: roles = factor(c( student , faculty , staff )) Find out: use str(), class(), levels(), attributes(), as.numeric(), typeof() on roles How are factors different? Why are they here? Also see: ordered() Stuck with factors? Check out the forcats:: package at http://forcats.tidyverse.org/ and this on factors: http://r4ds.had.co.nz/factors.html
More on Factors Factors have lots of consequences: treatment in models (think: SAS CLASS statement), printing, etc. We re introducing them here, but really, spend 5m (during class? ;) skimming forcats at some point. We ll use base R today.
More on Factors my_factor = factor(c("a", "b", "c")) my_factor[2] my_factor[2] = "d" # Nope!! str(my_factor)
More on Factors sex_vals = c(1, 2, 1, 1, 2, 3) sex_f = factor(sex_vals, levels = 1:2, labels=c("F", "M"), exclude = 3)
More on Factors # cut is nice for binning continuous values vals = 1:10 cut(vals, breaks = 3) # ^ works, but prefer more control! cut(vals, breaks = quantile(vals, seq(0,1,.25))) # ^ getting more complicated! cut(vals, breaks = c(0, 3, 6, 10))
More on Factors plot(births_sm$mage) mage_f = cut(births_sm$mage, breaks = c(0,20,30,max(births_sm$mage, na.rm=T))) plot(mage_f); table(mage_f) mage_f = cut(births_sm$mage, breaks = c(0,20,30,max(births_sm$mage, na.rm=T)), labels = c("Under 20", "20-30", "30+")) plot(mage_f); table(mage_f)
More on Factors: Later on sex_df = data.frame(sex_vals = 1:2, sex_f = c("F", "M")) # defaults to factors test_sf = data.frame(id = 1:10, sex_vals = sample(1:3, 10, replace = T)) test_sf = merge(test_sf, sex_df, all.x = T) test_sf # plus MANY other ways we ll learn.
Factors & Factors & Dates We ve got the functions to make sense of these now! Dates!
Raw Dates What is a date? Two things really today_date = as.Date("2018/08/29") typeof(today_date); class(today_date) today_date_lt = as.POSIXlt("2018/08/29") typeof(today_date_lt); class(today_date_lt) ( Try our other exploration functions too!) But suggestion: Futzing with dates can be a hassle. We ll use the Lubridate:: package to make that easier, later. <2m skim this: http://r4ds.had.co.nz/dates-and-times.html
Lubridate library(lubridate) set.seed(1) # specific random seed for sample() date_df = data.frame(year = rep(2018,5), month=sample(1:12, 5), day=sample(1:31, 5)) date_df$date_d = ymd(paste0(date_df$year, "/", date_df$month, "/", date_df$day)) date_df$date_d = make_date(date_df$year, date_df$month, date_df$day) # ^ same
Lubridate date_df$nextdate_d = date_df$date_d + 1 # assumes day date_df$nextmonth_d = date_df$date_d + ddays(31) date_df$nextmonth2_d = date_df$date_d month(date_df$nextmonth2_d) = month(date_df$nextmonth2_d) + 1 # can roll over date_df$wday = wday(date_df$date_d, label = T) plot(date_df$wday) date_df
Control: Control: Iteration & Conditionals If() {}, elseif() {}, else{}, ifelse(), for(){} and always, vectors
Control To do or not to do or do repeatedly. R has traditional control structures but also benefits from vectorization. Setting that aside, traditional control first:
Conditionals Complex: Simple: If(Boolean_test){ # do stuff } else if(test2) { # do stuff } else { # do stuff } If(Boolean_test){ # do stuff }
Conditionals A conditional expression in R has the following form: if (condition) { expressions } The condition is an expression that should produce a single logical value, and the expressions are only run if the result of the condition is TRUE. The curly braces are not necessary, but it is good practice to always include them; if the braces are omitted, only the first complete expression following the conditionis run. It is also possible to have an elseclause. if (condition) { trueExpressions } else { falseExpressions }
Conditionals Switch is another useful tool if your if statements start to get very long. f = function(x, y, op) { switch(op, plus = x + y, minus = x - y, times = x * y, divide = x / y, stop("Unknown op!") Also see the help file Can accept numbers too. ) } f(1, 2, plus ); f(1,2, minus )
Conditionals Common usage: iterate (coming up) through a vector, assigning a value, using the ifelse() function. new_data = ifelse(boolean_vector, a, b) returns a if boolean_vector is true, b if it isn t.
Control: Control: Iteration & Conditionals If() {}, elseif() {}, else{}, ifelse(), for(){} and always, vectors
Speaking of Iteration for (var.name in vector.of.values){ #Do stuff } while (condition.is.true){ #do stuff } #But don t forget about vectorization! And infinite loops!
Iteration: Examples for (i in 1:nrow(births_sm)){ #^ seq_along is technically safer more later cat("Record", i, "is a birth with", births$plur[i], "baby\n") } cols_to_use = c("wksgest", "weeknum", "sex_f") for (this_name in cols_to_use){ print(summary(births_sm[,this_name])) } # ^ note I m iterating over a character vector! # ^ iteration (and functions) are by default quiet. # print/cat sidenote
Iteration: Examples #But don t forget about vectorization! births$older_mage = NA For (i in 1:nrow(births)){ births$older_mage[i] = births$mage[i] + 1 } #...and don t forget about the index! ALSO: seq_along()
Sidenote: Functionals Functions that take functions apply family: Purrr package: lapply() map() sapply() mapn() vapply() We are not there yet, and this is admittedly tough, but I want you to keep this possibility in mind! sapply(births_sm, is.numeric) # base purrr::map_lgl(births_sm, is.numeric)
Trying to introduce in context, but some of this we just have to hit! More Functions Functions Already: str, class, summary, head, tail, setwd, getwd, View, plot, dim, nrow/col, sd, hist, table, type_of, sum, and more! Today: print, cat, switch, ifelse, read.csv, write.csv, sample, rep, seq, tolower
We Try: A bit more on functions After using read.csv() to get the .csv data in, subset it to the head, and write.csv() that out. What is iris? (see data() ) Save data.frame iris to iris_lower, and change all the variable names to lower case (so we don t have to memorize the capitalization!)
Functions : Functions : Write your own Often *super* useful
We Try: Best way to learn functions are to write our own hello_world = function(this_name = ""){ paste0("Hello World! It's ",this_name,"!") } greeting = hello_world("Mike") # ^ note: will return last value if left out! my_first_function(1, 2) #calling it like this, or my_first_function(param2 = 2, param = 1) # this, or my_first_function(1) #this R functions are scoped (e.g. variables created inside don t exist outside) and pass by reference as default (smart, don t create new copies of what s passed inside unless the copy is changed) Let s write get_older together
Functions Functions: Operators? C mon, I thought we were doing functions!
Reminder: Operators are functions! `+`(1, 2) `%in%`(1:4, c(2,3)) so are assignment, indexing and (technically) function calls themselves. Meaning, hey, you can easily define your own binary operators. More often we ll define our own functions, but important to know: EVERYTHING is a function / object, and can be passed around. `%add_wrong%` <- function(a, b) {a + b+1} a %add_wrong% b
Functions Functions: classes and functions You ll never need to know this until you do.
Classes and functions How do fuctions like dim() or plot() know how to handle all these things? Technically, they re generics, calling (effectively masking) class-specific functions like dim.data.frame or plot.factor that it calls based on the class() of the object. Look up help for plot.factor and dim.data.frame
Ready for Reality! Introducing the class dataset
Putting it all together We have basic data types to hold our vectorized, atomic data. We have a growing wealth of functions to operate on them, usually on a whole vector (think column ) at once. We can write our own if we need to. We have powerful subsetting (see L2 for the full rundown) to select, rewrite, extract and perform other actions on slices of our data.
NC Birth Data The small dataset contains (N) columns of (M) rows of data. Check the documentation for what these values really mean. Mdif, visits, wksgets, mrace, cores, bfed. Again, the overall question (in more detail later): does prenatal care reduce preterm birth?
You try: Tour the Real Dataset! Now use read.csv() to read births2012_sm.csv. 1. How many observations are there, and how many variables are in the (full/wide) births dataset? 2. What is the average maternal age (mage) now? How many mothers have the value 99? 3. How many mothers smoked (smoker_CIGDUR)? 4. Make a histogram of gestational age (WKSGEST). What is the minimum and maximum (non-99) gestational age?
Next Class We start to recode! It ll feel like real work!