Introduction to R Programming for Beginners: Data and Programming Basics

 
R programming for
beginners
 
Data and Programming Basics
 
Jason Gullifer
 
 
 
Datasets for this section
 
Disney Vices
If you downloaded the workshop zip
Located: “docs/datasets/disney_vices.csv”
 
Or
 
"https://goo.gl/KGQ90a"
 
Topics to cover here
 
Data Checking
General
 
Data Frames
 
Referencing data
 
Factors
 
NA values
 
Changing Types
 
Optimal data formatting
 
"Tidy" data is easy to work with
Easy to aggregate and play around
with (e.g., with dplyr)
 
Easy to visualize (e.g., with ggplot2)
 
Easy to model (e.g., with lme4)
 
Each column is a variable, Each row
is an observation
 
There are packages to fix this (e.g.,
tidyr)
trial  subject    RT   condition
1         1       250  a
2         1       350  a
3         1       257  b
4         1       600  b
1         2       302  a
2         2       310  a
3         2       305  b
4         2       312  b
trial  subject    RT   cond_a  cond_b
1         1       250  250
2         1       350  350
3         1       257          257
4         1       600          600
1         2       302  302
2         2       310  310
3         2       305           305
4         2       312           312
 
Optimal data formatting
 
Things to look for
 
Data types are what is expected for important
variables
E.g., reaction time data should be numeric
Categorical data are factors
 
Data looks orderly (columns are column-like),
number of rows and columns are what is expected
 
A word on experiment design
 
Always good to over-collect / over-specify data
 
Add variables that might be important to input files
 
Easier than merging these data points back in when you
realize you need them
 
Checking your data
 
Add this to a script! Maybe vices.R or something
 
vices <- read.csv("docs/datasets/disney_vices.csv",
stringsAsFactors = F)
 
Checking the import went okay
 
Check the first few rows of your data
head(vices)
 
Check the data types
str(vices)
 
Do we see any problems?
Feel free to add these commands to your script. Comment them even!
 
head(vices) #print first few lines of a file
 
Checking the dimensions of your
data
 
Check number of rows and columns
nrow(vices)
ncol(vices)
dim(vices)
 
Check the names of columns
colnames(vices)
 
After general checks
 
We should check specific pieces of our dataset
(generally our different columns)
 
So first we should learn to reference our data
 
Referencing your data
 
 
Referencing data
 
Typically your csv data will be in format:
Columns: measures / variables / etc.
Rows: observations
 
 
Referencing by index (RC cola)
vices[1,] (first row)
vices[,1] (first column)
vices[1,1]
 
Referencing by column name
Note: spaces become underscores, initial numbers become periods;
 also tab-completion!!
 
vices$Movie
 
vices$Length_Categorical
 
Rosenbaum (2007)
 
Subsetting your data
 
Take a look at the dataset for years before 1988
vices[vices$Year < 1988,]
 
Or after 1988 (inclusive)
vices[vices$Year >= 1988,]
 
How would we store these as new datasets?
 
Are there any movies that came out in 1988??
 
Subsetting your data
 
Take a look at the dataset for years before 1988
vices[vices$Year < 1988,]
 
Or after 1988 (inclusive)
vices[vices$Year >= 1988,]
 
How would we store these as new datasets?
oldest_vices<-vices[vices$Year < 1988,]
 
 
Are there any movies that came out in 1988??
vices$Movie [vices$Year=="1988"]
 
For you:
 
Make a new dataset "disney", that includes only disney
movies
 
Make a new dataset "alcohol" that includes only movies
with > 0 alcohol use
 
Make a new dataset "tobacco" that includes only movies
with > 0 tobacco use
 
Make a new dataset "alc_tobac" that includes only movies
with > 0 alcohol AND > 0 tobacco use
& = and
| = or
 
 
Changing data
 
Once we can reference data, we can also change
that data by storing something new to it
 
Problematic observation in Length_Minutes
 
vices$Length_Minutes
 
At least two ways to reference and fix?
 
Changing data
 
At least two ways to reference
vices[vices$Length_Minutes == "seventy-one"]
 
vices[6,4]
 
How to fix?
vices[vices$Length_Minutes == "seventy-one"]
<- "71"
 
But still a problem
Make sure to save this to your script!!
Good stuff to have in your header
after you load the data
 
Executing functions parts of your
data
 
We can run functions on pieces of your data
For example sum() or mean() on columns
mean(vices$Alcohol_Seconds)
median(vices$Alcohol_Seconds)
 
We can get distributions of values with
 
table()
 
 
table(vices$Movie)
 
table(vices$Year)
Is everything what you expected it to be? Is anything missing?
 
 
 
Creating new variables/columns
 
Movies before 1988 are "Very Old" while those that
came out after 1988 are simply "Old"
We know there are no "recent" movies here given the
table()
 
vices$Age (doesn't exist yet)
 
vices$Age <- 0
vices$Age[vices$Year < 1988] <- "Very Old"
vices$Age[vices$Year >= 1988] <- "Old"
 
For you:
 
Combining subsetting and functions
Compute the mean Alcohol_Seconds for Disney movies (store it to a
new variable)
Compute the mean Tobacco_Seconds for Disney movies (store it to a
new variable)
 
Compute the mean Alcohol_Seconds for the second-most represented
company in the dataset (store it to a new variable)
Compute the mean Tobacco_Seconds for the second-most represented
company in the dataset (store it to a new variable)
 
Compute the mean Alcohol_Seconds for all movies besides Disney
movies 
(store it to a new variable)
Compute the mean Tobacco_Seconds for all movies besides Disney
movies 
(store it to a new variable)
(! is not)
 
Get the mean Length_Minutes of all movies
 
Changing data types
 
 
R has useful "as" functions
 
 
as.factor()
 
as.character()
 
as.numeric()
 
as.matrix()
 
as.data.frame()
 
as.matrix(vices)
Be really carful converting factors() to
numeric!!!
 
Now we know
 
how to reference data
how to reference subsets of data
how to store new variables
how to change datatypes
 
How might we finally fix Length_Minutes?
Add this to your script
 
For you:
 
 
We might want to know which company puts out
movies with the most alcohol or tobacco usage
 
What are some steps we might take to get at this
question?
 
 
That was a lot of typing!
 
For you:
 
However, some companies put out movies that are
on average longer than other companies.
 
Smoking and alcohol use length could be
confounded by length of the movies
 
How to correct for this?
More typing? No you saved everything in a script
Remember to make sure units are the same
 
Next up:
 
How to save yourself even more typing
Slide Note
Embed
Share

Learn the fundamental concepts of R programming for beginners, covering topics such as data checking, changing types, working with data frames, factors, NA values, and optimal data formatting. Understand the importance of tidy data and experiment design, along with practical examples and tips to enhance your skills in R programming.

  • R Programming
  • Data Basics
  • Programming Fundamentals
  • Tidy Data
  • Data Analysis

Uploaded on Sep 21, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. R programming for beginners Data and Programming Basics Jason Gullifer

  2. Datasets for this section Disney Vices If you downloaded the workshop zip Located: docs/datasets/disney_vices.csv Or "https://goo.gl/KGQ90a"

  3. Topics to cover here Data Checking General Changing Types Data Frames Referencing data Factors NA values

  4. Optimal data formatting "Tidy" data is easy to work with Easy to aggregate and play around with (e.g., with dplyr) trial subject RT condition 1 1 250 a 2 1 350 a 3 1 257 b 4 1 600 b 1 2 302 a 2 2 310 a 3 2 305 b 4 2 312 b Easy to visualize (e.g., with ggplot2) Easy to model (e.g., with lme4) trial subject RT cond_a cond_b 1 1 250 250 2 1 350 350 3 1 257 257 4 1 600 600 1 2 302 302 2 2 310 310 3 2 305 305 4 2 312 312 Each column is a variable, Each row is an observation There are packages to fix this (e.g., tidyr)

  5. Optimal data formatting Things to look for Data types are what is expected for important variables E.g., reaction time data should be numeric Categorical data are factors Data looks orderly (columns are column-like), number of rows and columns are what is expected

  6. A word on experiment design Always good to over-collect / over-specify data Add variables that might be important to input files Easier than merging these data points back in when you realize you need them

  7. Checking your data Add this to a script! Maybe vices.R or something vices <- read.csv("docs/datasets/disney_vices.csv", stringsAsFactors = F)

  8. Checking the import went okay Check the first few rows of your data head(vices) Check the data types str(vices) Do we see any problems? Feel free to add these commands to your script. Comment them even! head(vices) #print first few lines of a file

  9. Checking the dimensions of your data Check number of rows and columns nrow(vices) ncol(vices) dim(vices) Check the names of columns colnames(vices)

  10. After general checks We should check specific pieces of our dataset (generally our different columns) So first we should learn to reference our data

  11. Referencing your data

  12. Referencing data Typically your csv data will be in format: Columns: measures / variables / etc. Rows: observations Referencing by index (RC cola) vices[1,] (first row) vices[,1] (first column) vices[1,1] Rosenbaum (2007) Referencing by column name Note: spaces become underscores, initial numbers become periods; also tab-completion!! vices$Movie vices$Length_Categorical

  13. Subsetting your data Take a look at the dataset for years before 1988 vices[vices$Year < 1988,] Or after 1988 (inclusive) vices[vices$Year >= 1988,] How would we store these as new datasets? Are there any movies that came out in 1988??

  14. Subsetting your data Take a look at the dataset for years before 1988 vices[vices$Year < 1988,] Or after 1988 (inclusive) vices[vices$Year >= 1988,] How would we store these as new datasets? oldest_vices<-vices[vices$Year < 1988,] Are there any movies that came out in 1988?? vices$Movie [vices$Year=="1988"]

  15. For you: Make a new dataset "disney", that includes only disney movies Make a new dataset "alcohol" that includes only movies with > 0 alcohol use Make a new dataset "tobacco" that includes only movies with > 0 tobacco use Make a new dataset "alc_tobac" that includes only movies with > 0 alcohol AND > 0 tobacco use & = and | = or

  16. Changing data Once we can reference data, we can also change that data by storing something new to it Problematic observation in Length_Minutes vices$Length_Minutes At least two ways to reference and fix?

  17. Changing data At least two ways to reference vices[vices$Length_Minutes == "seventy-one"] Make sure to save this to your script!! Good stuff to have in your header after you load the data vices[6,4] How to fix? vices[vices$Length_Minutes == "seventy-one"] <- "71" But still a problem

  18. Executing functions parts of your data We can run functions on pieces of your data For example sum() or mean() on columns mean(vices$Alcohol_Seconds) median(vices$Alcohol_Seconds) We can get distributions of values with table() Is everything what you expected it to be? Is anything missing? table(vices$Movie) table(vices$Year)

  19. Creating new variables/columns Movies before 1988 are "Very Old" while those that came out after 1988 are simply "Old" We know there are no "recent" movies here given the table() vices$Age (doesn't exist yet) vices$Age <- 0 vices$Age[vices$Year < 1988] <- "Very Old" vices$Age[vices$Year >= 1988] <- "Old"

  20. For you: Combining subsetting and functions Compute the mean Alcohol_Seconds for Disney movies (store it to a new variable) Compute the mean Tobacco_Seconds for Disney movies (store it to a new variable) Compute the mean Alcohol_Seconds for the second-most represented company in the dataset (store it to a new variable) Compute the mean Tobacco_Seconds for the second-most represented company in the dataset (store it to a new variable) Compute the mean Alcohol_Seconds for all movies besides Disney movies (store it to a new variable) Compute the mean Tobacco_Seconds for all movies besides Disney movies (store it to a new variable) (! is not) Get the mean Length_Minutes of all movies

  21. Changing data types

  22. R has useful "as" functions as.factor() Be really carful converting factors() to numeric!!! as.character() as.numeric() as.matrix(vices) as.matrix() as.data.frame()

  23. Now we know how to reference data how to reference subsets of data how to store new variables how to change datatypes How might we finally fix Length_Minutes? Add this to your script

  24. For you: We might want to know which company puts out movies with the most alcohol or tobacco usage What are some steps we might take to get at this question?

  25. That was a lot of typing!

  26. For you: However, some companies put out movies that are on average longer than other companies. Smoking and alcohol use length could be confounded by length of the movies How to correct for this? More typing? No you saved everything in a script Remember to make sure units are the same

  27. Next up: How to save yourself even more typing

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#