Managing Data Sets in R - Learn, Analyze, Visualize

Training Up: Managing
Data Sets in R
Nicole Lama
Goals
Open Large Data Sets of Various types into R
Manipulate/Access Data in R
Learn How to Do Basic Statistical Analysis
Basics of Visualizing the Data
Big Data vs Large Data Set
Big Data
: Refers to a dataset that will not easily fit
onto  available RAM of a system
Large Data Set
: Data that is  generally cumbersome
to work with because of it’s size
Data Size
Medium Datasets < 2GB
Large Datasets ~ 2- 10 GB
Big Data>  10 GB
Requires distributed large scale computing
R Cons
Will only use one core
Reads all data into memory rather than reading it on
command
Can slow down your computer
Why R for Big Data?
R has libraries for just about 
any
 statistical
analysis possible
R is 
open source 
(woo hoo! Free software!)
File Types
.tsv : tab separated
.csv : comma separated
.txt : generic text file, could be separated by spaces
.json : JavaScript Object  Notation
Opening File Types
.tsv : read.table(“myFile.tsv”, sep = ‘\t’)
.csv : read.csv(“myFile.csv”)
.txt : read.table(“myFile.txt”, sep = ‘\s’)
.json : fromJson(file = “myFile.json”)
rjson
 library
Opening File Types – UH OH!
My Data is…
Throwing an “out of workspace” error
Taking forever to load!
Taking forever to run an analysis on
Opening File Types – UH OH!
My Data is…
Throwing an “out of workspace” 
error*
Taking forever to 
load
!
Taking forever to 
run
 an analysis on
Issue
:
You have 
too much 
Data
*Out of workspace can also mean the values you are running in your script/function are too large. Use significant
values, use a different function, or transform your values before proceeding
Too Much Data
Some easy fixes:
1.
Down sample 
your data
sample()
2.
Select only data that is relevant to your
analysis
Maybe you only need two columns of data for your
analysis
Too Much Data
Some easy fixes:
1.
Use
 data.table 
package
Like data.frame but more efficient
2.
Split up your data into chunks
just read in 200 MB of data at a time
Working with Data Sets < 2 GB
Pre-defining column classes and specify number
of rows
bigfile.sample
 
<-
 
read.csv
(
"data/SAT_Results2014.csv"
,
stringsAsFactors
=
FALSE
, 
header
=
T
, 
nrows
=
20
)
bigfile.colclass
 
<-
 
sapply
(
bigfile.sample
,
class
)
bigfile.raw
 
<-
 
tbl_df
(
read.csv
(
"data/SAT_Results2014.csv"
,
stringsAsFactors
=
FALSE
, 
header
=
T
,
nrow
=
10000
,
colClasses
=
attendance.colclass
, 
comment.char
=
""
))
https://rpubs.com/msundar/large_data_analysis
Working with Data Sets ~ 2 - 10
GB
Use 
bigmemory
 library
Biganalytics
, 
bigtabulate
 for manipulation
Find more computational power
Especially if file size is approaching over 10 GB
https://rpubs.com/msundar/large_data_analysis
Still Too Slow?
1.
Profile (time your functions)
System.time(func)
Manipulate function or args until run time is
faster
2.
Use Compile() on a function
Auto-compiles every function only once
Nope, Still Slow
1.
Use Parallelism
It’s not worth using if you have less than 4
processors (IMO)
doMC package
2.
Use Super Computer
Longleaf, Dogwood, other private clusters
through Amazon, IBM, Google, NASA
Bottom Line
You should always try to dynamically read in your
data if possible
If you use R and need to change memory allocation,
use: memory.size()
If you are on 32-bit R, max is 2-3 GB
You can never have more than (2^31)-1
(2,147,483,647) rows or columns
But Most Importantly…
We still don’t know the “proper” way to handle large
datasets and it is a hot topic in research
https://underthecblog.org/2014/09/16/big-data-big-problems/
Back to Opening Files
After opening your file (tsv,csv,txt,etc…) in R, it is
usually stored as a data frame
A data frame is a special object in R which organizes
by row and column, and is easily manipulated
Many functions in R require data to be stored as data
frames
*check if something is a data frame with : 
is.data.frame
(<obj>)
Cleaning Up Your Data
Remove Missing Values: 
na.omit()
Exclude NaNs from analysis with 
na.rm=TRUE
argument
Check for NaNs with: 
is.nan()
Adding Data.Frame in R
Rbind
(df,list(information))
Cbind
(df,rowName=c(information))
* Df = data.frame
Removing Data.Frame in R
Df
$
colName <- 
NULL
  #remove column
Df <- df[-1,] #delete rows by reassignment
* Df = data.frame
Accessing Data.Frame in R
Df[“colName”] 
#pulls out column called colName
Df$colName
Df[[3]] 
#pulls out 3
rd
 column
Df[1,] 
#access row
* Df = data.frame
Filtering Data in R
Filt_df <- df[c(1:15),c(3,4,7:9)] 
#filter with row/col index
Filt_df <- df[-c(1:15),-c(3,4,7:9)] 
#do inverse of above with -
#Use subset(df, property you want to filter by, choose columns)
subset_df <- 
subset
(df, property == 2, select = c(“colName1”,”colName2”))
Statistical Analysis in R
Summary Statistics (mean, med, mode)
Summary(<list>)
Statistical Analysis in R
Covariance:
Cov(x,y)
Statistical Analysis in R
Linear Models
lm(y~x)
Scatter plot
: Visualize the linear relationship between the predictor and response
Box plot
: To spot any outlier observations in the variable. Having outliers in your predictor can
drastically affect the predictions as they can easily affect the direction/slope of the line of best fit.
Density plot
: To see the distribution of the predictor variable (X). Ideally, a close to normal
distribution (a bell shaped curve), without being skewed to the left or right is preferred. Let us
see how to make each one of them.
http://r-statistics.co
Visualizing Statistical Data
Histograms :  hist(data)
Scatter plots: plot()
Bar Plots: barplot()
Line Graphs: line()
Box Plots: boxplot()
Visualizing Statistical Data
All graphing functions in R take aesthetic arguments
such as col (color). Look at documentation for all
options.
For even more aesthetically pleasing plots, I
recommend using the ggplot2 library.
ggplot2
Writing out Data
Write.table(
myData, “c:/myFile.tsv”, sep=“\t”)
write.table(mydata, "c:/mydata.txt", sep="\t")
To an Excel Spreadsheet
library(xlsx)
write.xlsx(mydata, "c:/mydata.xlsx")
Exporting Graphs
In R Studio, you can generate a graph and simply
click the export button to save it onto you local
drive
write.table(mydata, "c:/mydata.txt", sep="\t")
To an Excel Spreadsheet
library(xlsx)
write.xlsx(mydata, "c:/mydata.xlsx")
References
Big Data:
http://www.columbia.edu/~sjm2186/EPIC_R/EPIC_R_BigData.pdf
Large Scale Data Analysis in R:
https://rpubs.com/msundar/large_data_analysis
Taking R to the Limit:
https://www.slideshare.net/bytemining/r-hpc
Statistics in R:
http://r-statistics.co
https://www.statmethods.net/input/missingdata.html
Supplementary Information
Bigmemory
https://rpubs.com/msundar/large_data_analysis
Will 
point
 to location in memory rather
than reading in a large file (uses pointer of
data)
Bigmemory
https://rpubs.com/msundar/large_data_analysis
library
(
bigmemory
)
library
(
biganalytics
)
library
(
bigtabulate
)
 
#Create big.matrix 
setwd
(
"/Users/sundar/dev"
)
school.matrix
 
<-
 
read.big.matrix
(
"./numeric_matrix_SAT__College_Board__2010_School_Level_Results.csv"
,
type
 
=
"integer"
, 
header
 
=
 
TRUE
, 
backingfile
 
=
 
"school.bin"
,
descriptorfile
 
=
"school.desc"
, 
extraCols
 
=
NULL
)
# Get the location of the pointer to school.matrix.
desc
 
<-
 
describe
(
school.matrix
)
 
str
(
school.matrix
)
## Formal class
'big.matrix' [package "bigmemory"] with 1 slot ## ..@
address:<externalptr>
# process big matrix in active session. 
colsums.session1
 
<-
sum
(
as.numeric
(
school.matrix
[
,
3
]))
colsums.session1
Parallelism in R
Use 
fread
 option form data.table package
Parallel Processing with 
doMC
 package
https://rpubs.com/msundar/large_data_analysis
library
(
doMC
)
registerDoMC
(
cores
 
=
 
4
)
Memory/Data Type
Char: 24 MB
Int:  96 MB
Double: 192 MB
Short: 48 MB
R Syntax
Variable
Var 
<-
 3
Function
Func <- 
function( a,b ){ return(a) }
Argument
Func(
1,2
)
Library
library
(“ggplot2”)
Programming Terms
For Loops- 
To repeat something a certain amount of times
While Loops- 
To repeat something until a condition is met
If Statements-  
Do something only if a condition is met
R Syntax
For loop
numbers <- cbind(1,2,3,4)
for (num in numbers) {
        print(num*2)
}
R Syntax
While loop
i <- 0
while(i <6 ) {
print(i)
i<-i+1
}
R Syntax
If statement
x <- 0
if (x < 0) {
        print("Negative number")
}
Slide Note
Embed
Share

Dive into managing data sets in R with Nicole Lama. Discover how to open, manipulate, and analyze large data sets, understand big data vs. large data sets, and overcome challenges when working with R. Explore the basics of visualizing data and utilizing R for statistical analysis using various file types.

  • R Programming
  • Data Analysis
  • Data Visualization
  • Statistical Analysis

Uploaded on Feb 17, 2025 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Training Up: Managing Data Sets in R Nicole Lama

  2. Goals Open Large Data Sets of Various types into R Manipulate/Access Data in R Learn How to Do Basic Statistical Analysis Basics of Visualizing the Data

  3. Big Data vs Large Data Set Big Data: Refers to a dataset that will not easily fit onto available RAM of a system Large Data Set: Data that is generally cumbersome to work with because of it s size

  4. Data Size Medium Datasets < 2GB Large Datasets ~ 2- 10 GB Big Data> 10 GB Requires distributed large scale computing

  5. R Cons Will only use one core Reads all data into memory rather than reading it on command Can slow down your computer

  6. Why R for Big Data? R has libraries for just about any statistical analysis possible R is open source (woo hoo! Free software!)

  7. File Types .tsv : tab separated .csv : comma separated .txt : generic text file, could be separated by spaces .json : JavaScript Object Notation

  8. Opening File Types .tsv : read.table( myFile.tsv , sep = \t ) .csv : read.csv( myFile.csv ) .txt : read.table( myFile.txt , sep = \s ) .json : fromJson(file = myFile.json ) rjson library

  9. Opening File Types UH OH! My Data is Throwing an out of workspace error Taking forever to load! Taking forever to run an analysis on

  10. Opening File Types UH OH! My Data is Throwing an out of workspace error* Taking forever to load! Taking forever to run an analysis on Issue: You have too much Data *Out of workspace can also mean the values you are running in your script/function are too large. Use significant values, use a different function, or transform your values before proceeding

  11. Too Much Data Some easy fixes: 1. Down sample your data sample() 2. Select only data that is relevant to your analysis Maybe you only need two columns of data for your analysis

  12. Too Much Data Some easy fixes: 1. Use data.table package Like data.frame but more efficient 2. Split up your data into chunks just read in 200 MB of data at a time

  13. Working with Data Sets < 2 GB Pre-defining column classes and specify number of rows bigfile.sample <- read.csv("data/SAT_Results2014.csv", stringsAsFactors=FALSE, header=T, nrows=20) bigfile.colclass <- sapply(bigfile.sample,class) bigfile.raw <- tbl_df(read.csv("data/SAT_Results2014.csv", stringsAsFactors=FALSE, header=T,nrow=10000, colClasses=attendance.colclass, comment.char="")) https://rpubs.com/msundar/large_data_analysis

  14. Working with Data Sets ~ 2 - 10 GB Use bigmemory library Biganalytics, bigtabulate for manipulation Find more computational power Especially if file size is approaching over 10 GB https://rpubs.com/msundar/large_data_analysis

  15. Still Too Slow? 1. Profile (time your functions) System.time(func) Manipulate function or args until run time is faster 2. Use Compile() on a function Auto-compiles every function only once

  16. Nope, Still Slow 1. Use Parallelism It s not worth using if you have less than 4 processors (IMO) doMC package 2. Use Super Computer Longleaf, Dogwood, other private clusters through Amazon, IBM, Google, NASA

  17. Bottom Line You should always try to dynamically read in your data if possible If you use R and need to change memory allocation, use: memory.size() If you are on 32-bit R, max is 2-3 GB You can never have more than (2^31)-1 (2,147,483,647) rows or columns

  18. But Most Importantly We still don t know the proper way to handle large datasets and it is a hot topic in research https://underthecblog.org/2014/09/16/big-data-big-problems/

  19. Back to Opening Files After opening your file (tsv,csv,txt,etc ) in R, it is usually stored as a data frame A data frame is a special object in R which organizes by row and column, and is easily manipulated Many functions in R require data to be stored as data frames *check if something is a data frame with : is.data.frame(<obj>)

  20. Cleaning Up Your Data Remove Missing Values: na.omit() Exclude NaNs from analysis with na.rm=TRUE argument Check for NaNs with: is.nan()

  21. Adding Data.Frame in R Rbind(df,list(information)) Cbind(df,rowName=c(information)) * Df = data.frame

  22. Removing Data.Frame in R Df$colName <- NULL #remove column Df <- df[-1,] #delete rows by reassignment * Df = data.frame

  23. Accessing Data.Frame in R Df[ colName ] #pulls out column called colName Df$colName Df[[3]] #pulls out 3rd column Df[1,] #access row * Df = data.frame

  24. Filtering Data in R Filt_df <- df[c(1:15),c(3,4,7:9)] #filter with row/col index Filt_df <- df[-c(1:15),-c(3,4,7:9)] #do inverse of above with - #Use subset(df, property you want to filter by, choose columns) subset_df <- subset(df, property == 2, select = c( colName1 , colName2 ))

  25. Statistical Analysis in R Summary Statistics (mean, med, mode) Summary(<list>)

  26. Statistical Analysis in R Covariance: Cov(x,y)

  27. Statistical Analysis in R Linear Models lm(y~x) Scatter plot: Visualize the linear relationship between the predictor and response Box plot: To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit. Density plot: To see the distribution of the predictor variable (X). Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred. Let us see how to make each one of them. http://r-statistics.co

  28. Visualizing Statistical Data Histograms : hist(data) Scatter plots: plot() Bar Plots: barplot() Line Graphs: line() Box Plots: boxplot()

  29. Visualizing Statistical Data All graphing functions in R take aesthetic arguments such as col (color). Look at documentation for all options. For even more aesthetically pleasing plots, I recommend using the ggplot2 library.

  30. ggplot2

  31. write.table(mydata, "c:/mydata.txt", sep="\t") To an Excel Spreadsheet library(xlsx) write.xlsx(mydata, "c:/mydata.xlsx") Writing out Data Write.table(myData, c:/myFile.tsv , sep= \t )

  32. write.table(mydata, "c:/mydata.txt", sep="\t") To an Excel Spreadsheet library(xlsx) write.xlsx(mydata, "c:/mydata.xlsx") Exporting Graphs In R Studio, you can generate a graph and simply click the export button to save it onto you local drive

  33. References Big Data: http://www.columbia.edu/~sjm2186/EPIC_R/EPIC_R_BigData.pdf Large Scale Data Analysis in R: https://rpubs.com/msundar/large_data_analysis Taking R to the Limit: https://www.slideshare.net/bytemining/r-hpc Statistics in R: http://r-statistics.co https://www.statmethods.net/input/missingdata.html

  34. Supplementary Information

  35. Bigmemory Will point to location in memory rather than reading in a large file (uses pointer of data) https://rpubs.com/msundar/large_data_analysis

  36. Bigmemory library(bigmemory) library(biganalytics) library(bigtabulate) #Create big.matrix setwd("/Users/sundar/dev") school.matrix <- read.big.matrix( "./numeric_matrix_SAT__College_Board__2010_School_Level_Results.csv", type ="integer", header = TRUE, backingfile = "school.bin", descriptorfile ="school.desc", extraCols =NULL) # Get the location of the pointer to school.matrix. desc <- describe(school.matrix) str(school.matrix)## Formal class 'big.matrix' [package "bigmemory"] with 1 slot ## ..@ address:<externalptr> # process big matrix in active session. colsums.session1 <- sum(as.numeric(school.matrix[,3])) colsums.session1 https://rpubs.com/msundar/large_data_analysis

  37. Parallelism in R Use fread option form data.table package Parallel Processing with doMC package library(doMC) registerDoMC(cores = 4) https://rpubs.com/msundar/large_data_analysis

  38. Memory/Data Type Char: 24 MB Int: 96 MB Double: 192 MB Short: 48 MB

  39. R Syntax Variable Var <- 3 Function Func <- function( a,b ){ return(a) } Argument Func(1,2) Library library( ggplot2 )

  40. Programming Terms For Loops- To repeat something a certain amount of times While Loops- To repeat something until a condition is met If Statements- Do something only if a condition is met

  41. R Syntax For loop numbers <- cbind(1,2,3,4) for (num in numbers) { print(num*2) }

  42. R Syntax While loop i <- 0 while(i <6 ) { print(i) i<-i+1}

  43. R Syntax If statement x <- 0 if (x < 0) { print("Negative number") }

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#