Managing Data Sets in R - Learn, Analyze, Visualize

Training Up: Managing

Data Sets in R

Nicole Lama

Goals

•

Open Large Data Sets of Various types into R

•

Manipulate/Access Data in R

•

Learn How to Do Basic Statistical Analysis

•

Basics of Visualizing the Data

Big Data vs Large Data Set

•

Big Data

: Refers to a dataset that will not easily fit

onto  available RAM of a system

•

Large Data Set

: Data that is  generally cumbersome

to work with because of it’s size

Data Size

•

Medium Datasets < 2GB

•

Large Datasets ~ 2- 10 GB

•

Big Data>  10 GB

•

Requires distributed large scale computing

R Cons

•

Will only use one core

•

Reads all data into memory rather than reading it on

command

•

Can slow down your computer

Why R for Big Data?

•

R has libraries for just about

any

 statistical

analysis possible

•

R is

open source

(woo hoo! Free software!)

File Types

•

.tsv : tab separated

•

.csv : comma separated

•

.txt : generic text file, could be separated by spaces

•

.json : JavaScript Object  Notation

Opening File Types

•

.tsv : read.table(“myFile.tsv”, sep = ‘\t’)

•

.csv : read.csv(“myFile.csv”)

•

.txt : read.table(“myFile.txt”, sep = ‘\s’)

•

.json : fromJson(file = “myFile.json”)

•

rjson

 library

Opening File Types – UH OH!

•

My Data is…

•

Throwing an “out of workspace” error

•

Taking forever to load!

•

Taking forever to run an analysis on

Opening File Types – UH OH!

•

My Data is…

•

Throwing an “out of workspace”

error*

•

Taking forever to

load

•

Taking forever to

run

 an analysis on

•

Issue

•

You have

too much

Data

*Out of workspace can also mean the values you are running in your script/function are too large. Use significant

values, use a different function, or transform your values before proceeding

Too Much Data

•

Some easy fixes:

1.

Down sample

your data

•

sample()

2.

Select only data that is relevant to your

analysis

•

Maybe you only need two columns of data for your

analysis

Too Much Data

•

Some easy fixes:

1.

Use

 data.table

package

•

Like data.frame but more efficient

2.

Split up your data into chunks

•

just read in 200 MB of data at a time

Working with Data Sets < 2 GB

•

Pre-defining column classes and specify number

of rows

bigfile.sample

<-

read.csv

"data/SAT_Results2014.csv"

stringsAsFactors

FALSE

header

nrows

bigfile.colclass

<-

sapply

bigfile.sample

class

bigfile.raw

<-

tbl_df

read.csv

"data/SAT_Results2014.csv"

stringsAsFactors

FALSE

header

nrow

colClasses

attendance.colclass

comment.char

""

))

https://rpubs.com/msundar/large_data_analysis

Working with Data Sets ~ 2 - 10

GB

•

Use

bigmemory

 library

•

Biganalytics

bigtabulate

 for manipulation

•

Find more computational power

•

Especially if file size is approaching over 10 GB

https://rpubs.com/msundar/large_data_analysis

Still Too Slow?

1.

Profile (time your functions)

•

System.time(func)

•

Manipulate function or args until run time is

faster

2.

Use Compile() on a function

•

Auto-compiles every function only once

Nope, Still Slow

1.

Use Parallelism

•

It’s not worth using if you have less than 4

processors (IMO)

•

doMC package

2.

Use Super Computer

•

Longleaf, Dogwood, other private clusters

through Amazon, IBM, Google, NASA

Bottom Line

•

You should always try to dynamically read in your

data if possible

•

If you use R and need to change memory allocation,

use: memory.size()

•

If you are on 32-bit R, max is 2-3 GB

•

You can never have more than (2^31)-1

(2,147,483,647) rows or columns

But Most Importantly…

•

We still don’t know the “proper” way to handle large

datasets and it is a hot topic in research

https://underthecblog.org/2014/09/16/big-data-big-problems/

Back to Opening Files

•

After opening your file (tsv,csv,txt,etc…) in R, it is

usually stored as a data frame

•

A data frame is a special object in R which organizes

by row and column, and is easily manipulated

•

Many functions in R require data to be stored as data

frames

*check if something is a data frame with :

is.data.frame

(<obj>)

Cleaning Up Your Data

•

Remove Missing Values:

na.omit()

•

Exclude NaNs from analysis with

na.rm=TRUE

argument

•

Check for NaNs with:

is.nan()

Adding Data.Frame in R

Rbind

(df,list(information))

Cbind

(df,rowName=c(information))

* Df = data.frame

Removing Data.Frame in R

Df

colName <-

NULL

  #remove column

Df <- df[-1,] #delete rows by reassignment

* Df = data.frame

Accessing Data.Frame in R

Df[“colName”]

#pulls out column called colName

Df$colName

Df[[3]]

#pulls out 3

rd

 column

Df[1,]

#access row

* Df = data.frame

Filtering Data in R

Filt_df <- df[c(1:15),c(3,4,7:9)]

#filter with row/col index

Filt_df <- df[-c(1:15),-c(3,4,7:9)]

#do inverse of above with -

#Use subset(df, property you want to filter by, choose columns)

subset_df <-

subset

(df, property == 2, select = c(“colName1”,”colName2”))

Statistical Analysis in R

•

Summary Statistics (mean, med, mode)

•

Summary(<list>)

Statistical Analysis in R

•

Covariance:

•

Cov(x,y)

Statistical Analysis in R

•

Linear Models

•

lm(y~x)

Scatter plot

: Visualize the linear relationship between the predictor and response

Box plot

: To spot any outlier observations in the variable. Having outliers in your predictor can

drastically affect the predictions as they can easily affect the direction/slope of the line of best fit.

Density plot

: To see the distribution of the predictor variable (X). Ideally, a close to normal

distribution (a bell shaped curve), without being skewed to the left or right is preferred. Let us

see how to make each one of them.

http://r-statistics.co

Visualizing Statistical Data

•

Histograms :  hist(data)

•

Scatter plots: plot()

•

Bar Plots: barplot()

•

Line Graphs: line()

•

Box Plots: boxplot()

Visualizing Statistical Data

•

All graphing functions in R take aesthetic arguments

such as col (color). Look at documentation for all

options.

•

For even more aesthetically pleasing plots, I

recommend using the ggplot2 library.

ggplot2

Writing out Data

Write.table(

myData, “c:/myFile.tsv”, sep=“\t”)

write.table(mydata, "c:/mydata.txt", sep="\t")

To an Excel Spreadsheet

library(xlsx)

write.xlsx(mydata, "c:/mydata.xlsx")

Exporting Graphs

•

In R Studio, you can generate a graph and simply

click the export button to save it onto you local

drive

write.table(mydata, "c:/mydata.txt", sep="\t")

To an Excel Spreadsheet

library(xlsx)

write.xlsx(mydata, "c:/mydata.xlsx")

References

Big Data:

http://www.columbia.edu/~sjm2186/EPIC_R/EPIC_R_BigData.pdf

Large Scale Data Analysis in R:

https://rpubs.com/msundar/large_data_analysis

Taking R to the Limit:

https://www.slideshare.net/bytemining/r-hpc

Statistics in R:

http://r-statistics.co

https://www.statmethods.net/input/missingdata.html

Supplementary Information

Bigmemory

https://rpubs.com/msundar/large_data_analysis

•

Will

point

 to location in memory rather

than reading in a large file (uses pointer of

data)

Bigmemory

https://rpubs.com/msundar/large_data_analysis

library

bigmemory

library

biganalytics

library

bigtabulate

#Create big.matrix

setwd

"/Users/sundar/dev"

school.matrix

<-

read.big.matrix

"./numeric_matrix_SAT__College_Board__2010_School_Level_Results.csv"

type

"integer"

header

TRUE

backingfile

"school.bin"

descriptorfile

"school.desc"

extraCols

NULL

# Get the location of the pointer to school.matrix.

desc

<-

describe

school.matrix

str

school.matrix

## Formal class

'big.matrix' [package "bigmemory"] with 1 slot ## ..@

address:<externalptr>

# process big matrix in active session.

colsums.session1

<-

sum

as.numeric

school.matrix

]))

colsums.session1

Parallelism in R

•

Use

fread

 option form data.table package

•

Parallel Processing with

doMC

 package

https://rpubs.com/msundar/large_data_analysis

library

doMC

registerDoMC

cores

Memory/Data Type

•

Char: 24 MB

•

Int:  96 MB

•

Double: 192 MB

•

Short: 48 MB

R Syntax

•

Variable

•

Var

<-

•

Function

•

Func <-

function( a,b ){ return(a) }

•

Argument

•

Func(

1,2

•

Library

•

library

(“ggplot2”)

Programming Terms

•

For Loops-

To repeat something a certain amount of times

•

While Loops-

To repeat something until a condition is met

•

If Statements-

Do something only if a condition is met

R Syntax

•

For loop

•

numbers <- cbind(1,2,3,4)

•

for (num in numbers) {

•

        print(num*2)

•

R Syntax

•

While loop

•

i <- 0

•

while(i <6 ) {

•

print(i)

•

i<-i+1

R Syntax

•

If statement

•

x <- 0

•

if (x < 0) {

•

        print("Negative number")

•

Slide Note

Embed Share

Download

Dive into managing data sets in R with Nicole Lama. Discover how to open, manipulate, and analyze large data sets, understand big data vs. large data sets, and overcome challenges when working with R. Explore the basics of visualizing data and utilizing R for statistical analysis using various file types.

jaer924 Follow

Uploaded on Feb 17, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Training Up: Managing Data Sets in R Nicole Lama

Goals Open Large Data Sets of Various types into R Manipulate/Access Data in R Learn How to Do Basic Statistical Analysis Basics of Visualizing the Data

Big Data vs Large Data Set Big Data: Refers to a dataset that will not easily fit onto available RAM of a system Large Data Set: Data that is generally cumbersome to work with because of it s size

Data Size Medium Datasets < 2GB Large Datasets ~ 2- 10 GB Big Data> 10 GB Requires distributed large scale computing

R Cons Will only use one core Reads all data into memory rather than reading it on command Can slow down your computer

Why R for Big Data? R has libraries for just about any statistical analysis possible R is open source (woo hoo! Free software!)

File Types .tsv : tab separated .csv : comma separated .txt : generic text file, could be separated by spaces .json : JavaScript Object Notation

Opening File Types .tsv : read.table( myFile.tsv , sep = \t ) .csv : read.csv( myFile.csv ) .txt : read.table( myFile.txt , sep = \s ) .json : fromJson(file = myFile.json ) rjson library

Opening File Types UH OH! My Data is Throwing an out of workspace error Taking forever to load! Taking forever to run an analysis on

Opening File Types UH OH! My Data is Throwing an out of workspace error* Taking forever to load! Taking forever to run an analysis on Issue: You have too much Data *Out of workspace can also mean the values you are running in your script/function are too large. Use significant values, use a different function, or transform your values before proceeding

Too Much Data Some easy fixes: 1. Down sample your data sample() 2. Select only data that is relevant to your analysis Maybe you only need two columns of data for your analysis

Too Much Data Some easy fixes: 1. Use data.table package Like data.frame but more efficient 2. Split up your data into chunks just read in 200 MB of data at a time

Working with Data Sets < 2 GB Pre-defining column classes and specify number of rows bigfile.sample <- read.csv("data/SAT_Results2014.csv", stringsAsFactors=FALSE, header=T, nrows=20) bigfile.colclass <- sapply(bigfile.sample,class) bigfile.raw <- tbl_df(read.csv("data/SAT_Results2014.csv", stringsAsFactors=FALSE, header=T,nrow=10000, colClasses=attendance.colclass, comment.char="")) https://rpubs.com/msundar/large_data_analysis

Working with Data Sets ~ 2 - 10 GB Use bigmemory library Biganalytics, bigtabulate for manipulation Find more computational power Especially if file size is approaching over 10 GB https://rpubs.com/msundar/large_data_analysis

Still Too Slow? 1. Profile (time your functions) System.time(func) Manipulate function or args until run time is faster 2. Use Compile() on a function Auto-compiles every function only once

Nope, Still Slow 1. Use Parallelism It s not worth using if you have less than 4 processors (IMO) doMC package 2. Use Super Computer Longleaf, Dogwood, other private clusters through Amazon, IBM, Google, NASA

Bottom Line You should always try to dynamically read in your data if possible If you use R and need to change memory allocation, use: memory.size() If you are on 32-bit R, max is 2-3 GB You can never have more than (2^31)-1 (2,147,483,647) rows or columns

But Most Importantly We still don t know the proper way to handle large datasets and it is a hot topic in research https://underthecblog.org/2014/09/16/big-data-big-problems/

Back to Opening Files After opening your file (tsv,csv,txt,etc ) in R, it is usually stored as a data frame A data frame is a special object in R which organizes by row and column, and is easily manipulated Many functions in R require data to be stored as data frames *check if something is a data frame with : is.data.frame(<obj>)

Cleaning Up Your Data Remove Missing Values: na.omit() Exclude NaNs from analysis with na.rm=TRUE argument Check for NaNs with: is.nan()

Adding Data.Frame in R Rbind(df,list(information)) Cbind(df,rowName=c(information)) * Df = data.frame

Removing Data.Frame in R Df$colName <- NULL #remove column Df <- df[-1,] #delete rows by reassignment * Df = data.frame

Accessing Data.Frame in R Df[ colName ] #pulls out column called colName Df$colName Df[[3]] #pulls out 3rd column Df[1,] #access row * Df = data.frame

Filtering Data in R Filt_df <- df[c(1:15),c(3,4,7:9)] #filter with row/col index Filt_df <- df[-c(1:15),-c(3,4,7:9)] #do inverse of above with - #Use subset(df, property you want to filter by, choose columns) subset_df <- subset(df, property == 2, select = c( colName1 , colName2 ))

Statistical Analysis in R Summary Statistics (mean, med, mode) Summary(<list>)

Statistical Analysis in R Covariance: Cov(x,y)

Statistical Analysis in R Linear Models lm(y~x) Scatter plot: Visualize the linear relationship between the predictor and response Box plot: To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit. Density plot: To see the distribution of the predictor variable (X). Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred. Let us see how to make each one of them. http://r-statistics.co

Visualizing Statistical Data Histograms : hist(data) Scatter plots: plot() Bar Plots: barplot() Line Graphs: line() Box Plots: boxplot()

Visualizing Statistical Data All graphing functions in R take aesthetic arguments such as col (color). Look at documentation for all options. For even more aesthetically pleasing plots, I recommend using the ggplot2 library.

ggplot2

write.table(mydata, "c:/mydata.txt", sep="\t") To an Excel Spreadsheet library(xlsx) write.xlsx(mydata, "c:/mydata.xlsx") Writing out Data Write.table(myData, c:/myFile.tsv , sep= \t )

write.table(mydata, "c:/mydata.txt", sep="\t") To an Excel Spreadsheet library(xlsx) write.xlsx(mydata, "c:/mydata.xlsx") Exporting Graphs In R Studio, you can generate a graph and simply click the export button to save it onto you local drive

References Big Data: http://www.columbia.edu/~sjm2186/EPIC_R/EPIC_R_BigData.pdf Large Scale Data Analysis in R: https://rpubs.com/msundar/large_data_analysis Taking R to the Limit: https://www.slideshare.net/bytemining/r-hpc Statistics in R: http://r-statistics.co https://www.statmethods.net/input/missingdata.html

Supplementary Information

Bigmemory Will point to location in memory rather than reading in a large file (uses pointer of data) https://rpubs.com/msundar/large_data_analysis

Bigmemory library(bigmemory) library(biganalytics) library(bigtabulate) #Create big.matrix setwd("/Users/sundar/dev") school.matrix <- read.big.matrix( "./numeric_matrix_SAT__College_Board__2010_School_Level_Results.csv", type ="integer", header = TRUE, backingfile = "school.bin", descriptorfile ="school.desc", extraCols =NULL) # Get the location of the pointer to school.matrix. desc <- describe(school.matrix) str(school.matrix)## Formal class 'big.matrix' [package "bigmemory"] with 1 slot ## ..@ address:<externalptr> # process big matrix in active session. colsums.session1 <- sum(as.numeric(school.matrix[,3])) colsums.session1 https://rpubs.com/msundar/large_data_analysis

Parallelism in R Use fread option form data.table package Parallel Processing with doMC package library(doMC) registerDoMC(cores = 4) https://rpubs.com/msundar/large_data_analysis

Memory/Data Type Char: 24 MB Int: 96 MB Double: 192 MB Short: 48 MB

R Syntax Variable Var <- 3 Function Func <- function( a,b ){ return(a) } Argument Func(1,2) Library library( ggplot2 )

Programming Terms For Loops- To repeat something a certain amount of times While Loops- To repeat something until a condition is met If Statements- Do something only if a condition is met