Managing Data Sets in R - Learn, Analyze, Visualize
Dive into managing data sets in R with Nicole Lama. Discover how to open, manipulate, and analyze large data sets, understand big data vs. large data sets, and overcome challenges when working with R. Explore the basics of visualizing data and utilizing R for statistical analysis using various file types.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Training Up: Managing Data Sets in R Nicole Lama
Goals Open Large Data Sets of Various types into R Manipulate/Access Data in R Learn How to Do Basic Statistical Analysis Basics of Visualizing the Data
Big Data vs Large Data Set Big Data: Refers to a dataset that will not easily fit onto available RAM of a system Large Data Set: Data that is generally cumbersome to work with because of it s size
Data Size Medium Datasets < 2GB Large Datasets ~ 2- 10 GB Big Data> 10 GB Requires distributed large scale computing
R Cons Will only use one core Reads all data into memory rather than reading it on command Can slow down your computer
Why R for Big Data? R has libraries for just about any statistical analysis possible R is open source (woo hoo! Free software!)
File Types .tsv : tab separated .csv : comma separated .txt : generic text file, could be separated by spaces .json : JavaScript Object Notation
Opening File Types .tsv : read.table( myFile.tsv , sep = \t ) .csv : read.csv( myFile.csv ) .txt : read.table( myFile.txt , sep = \s ) .json : fromJson(file = myFile.json ) rjson library
Opening File Types UH OH! My Data is Throwing an out of workspace error Taking forever to load! Taking forever to run an analysis on
Opening File Types UH OH! My Data is Throwing an out of workspace error* Taking forever to load! Taking forever to run an analysis on Issue: You have too much Data *Out of workspace can also mean the values you are running in your script/function are too large. Use significant values, use a different function, or transform your values before proceeding
Too Much Data Some easy fixes: 1. Down sample your data sample() 2. Select only data that is relevant to your analysis Maybe you only need two columns of data for your analysis
Too Much Data Some easy fixes: 1. Use data.table package Like data.frame but more efficient 2. Split up your data into chunks just read in 200 MB of data at a time
Working with Data Sets < 2 GB Pre-defining column classes and specify number of rows bigfile.sample <- read.csv("data/SAT_Results2014.csv", stringsAsFactors=FALSE, header=T, nrows=20) bigfile.colclass <- sapply(bigfile.sample,class) bigfile.raw <- tbl_df(read.csv("data/SAT_Results2014.csv", stringsAsFactors=FALSE, header=T,nrow=10000, colClasses=attendance.colclass, comment.char="")) https://rpubs.com/msundar/large_data_analysis
Working with Data Sets ~ 2 - 10 GB Use bigmemory library Biganalytics, bigtabulate for manipulation Find more computational power Especially if file size is approaching over 10 GB https://rpubs.com/msundar/large_data_analysis
Still Too Slow? 1. Profile (time your functions) System.time(func) Manipulate function or args until run time is faster 2. Use Compile() on a function Auto-compiles every function only once
Nope, Still Slow 1. Use Parallelism It s not worth using if you have less than 4 processors (IMO) doMC package 2. Use Super Computer Longleaf, Dogwood, other private clusters through Amazon, IBM, Google, NASA
Bottom Line You should always try to dynamically read in your data if possible If you use R and need to change memory allocation, use: memory.size() If you are on 32-bit R, max is 2-3 GB You can never have more than (2^31)-1 (2,147,483,647) rows or columns
But Most Importantly We still don t know the proper way to handle large datasets and it is a hot topic in research https://underthecblog.org/2014/09/16/big-data-big-problems/
Back to Opening Files After opening your file (tsv,csv,txt,etc ) in R, it is usually stored as a data frame A data frame is a special object in R which organizes by row and column, and is easily manipulated Many functions in R require data to be stored as data frames *check if something is a data frame with : is.data.frame(<obj>)
Cleaning Up Your Data Remove Missing Values: na.omit() Exclude NaNs from analysis with na.rm=TRUE argument Check for NaNs with: is.nan()
Adding Data.Frame in R Rbind(df,list(information)) Cbind(df,rowName=c(information)) * Df = data.frame
Removing Data.Frame in R Df$colName <- NULL #remove column Df <- df[-1,] #delete rows by reassignment * Df = data.frame
Accessing Data.Frame in R Df[ colName ] #pulls out column called colName Df$colName Df[[3]] #pulls out 3rd column Df[1,] #access row * Df = data.frame
Filtering Data in R Filt_df <- df[c(1:15),c(3,4,7:9)] #filter with row/col index Filt_df <- df[-c(1:15),-c(3,4,7:9)] #do inverse of above with - #Use subset(df, property you want to filter by, choose columns) subset_df <- subset(df, property == 2, select = c( colName1 , colName2 ))
Statistical Analysis in R Summary Statistics (mean, med, mode) Summary(<list>)
Statistical Analysis in R Covariance: Cov(x,y)
Statistical Analysis in R Linear Models lm(y~x) Scatter plot: Visualize the linear relationship between the predictor and response Box plot: To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit. Density plot: To see the distribution of the predictor variable (X). Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred. Let us see how to make each one of them. http://r-statistics.co
Visualizing Statistical Data Histograms : hist(data) Scatter plots: plot() Bar Plots: barplot() Line Graphs: line() Box Plots: boxplot()
Visualizing Statistical Data All graphing functions in R take aesthetic arguments such as col (color). Look at documentation for all options. For even more aesthetically pleasing plots, I recommend using the ggplot2 library.
write.table(mydata, "c:/mydata.txt", sep="\t") To an Excel Spreadsheet library(xlsx) write.xlsx(mydata, "c:/mydata.xlsx") Writing out Data Write.table(myData, c:/myFile.tsv , sep= \t )
write.table(mydata, "c:/mydata.txt", sep="\t") To an Excel Spreadsheet library(xlsx) write.xlsx(mydata, "c:/mydata.xlsx") Exporting Graphs In R Studio, you can generate a graph and simply click the export button to save it onto you local drive
References Big Data: http://www.columbia.edu/~sjm2186/EPIC_R/EPIC_R_BigData.pdf Large Scale Data Analysis in R: https://rpubs.com/msundar/large_data_analysis Taking R to the Limit: https://www.slideshare.net/bytemining/r-hpc Statistics in R: http://r-statistics.co https://www.statmethods.net/input/missingdata.html
Bigmemory Will point to location in memory rather than reading in a large file (uses pointer of data) https://rpubs.com/msundar/large_data_analysis
Bigmemory library(bigmemory) library(biganalytics) library(bigtabulate) #Create big.matrix setwd("/Users/sundar/dev") school.matrix <- read.big.matrix( "./numeric_matrix_SAT__College_Board__2010_School_Level_Results.csv", type ="integer", header = TRUE, backingfile = "school.bin", descriptorfile ="school.desc", extraCols =NULL) # Get the location of the pointer to school.matrix. desc <- describe(school.matrix) str(school.matrix)## Formal class 'big.matrix' [package "bigmemory"] with 1 slot ## ..@ address:<externalptr> # process big matrix in active session. colsums.session1 <- sum(as.numeric(school.matrix[,3])) colsums.session1 https://rpubs.com/msundar/large_data_analysis
Parallelism in R Use fread option form data.table package Parallel Processing with doMC package library(doMC) registerDoMC(cores = 4) https://rpubs.com/msundar/large_data_analysis
Memory/Data Type Char: 24 MB Int: 96 MB Double: 192 MB Short: 48 MB
R Syntax Variable Var <- 3 Function Func <- function( a,b ){ return(a) } Argument Func(1,2) Library library( ggplot2 )
Programming Terms For Loops- To repeat something a certain amount of times While Loops- To repeat something until a condition is met If Statements- Do something only if a condition is met
R Syntax For loop numbers <- cbind(1,2,3,4) for (num in numbers) { print(num*2) }
R Syntax While loop i <- 0 while(i <6 ) { print(i) i<-i+1}
R Syntax If statement x <- 0 if (x < 0) { print("Negative number") }