Preliminary Steps in Setting Up a Hadoop Environment
Logging into the VM, changing passwords, transferring files to Hadoop, setting up Rstudio for MapReduce programming, and running the first MapReduce program are essential preliminary steps in establishing a Hadoop environment for data processing tasks.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Wordcount CSCE 587 Spring 2017
Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu p555 Where: XX is the vm number assigned to you NOTE: port 555 is for the VM and port 222 is for actual machines If you haven t already done so, change your password Ex: passwd You will be prompted for your current password Next you will be prompted for a new password
Preliminary steps in the VM Load data into linux filesystem of the VM. use SSH secure file transfer or use vi to create a text file or Ex: [student@sandbox ~]$ scp -P222 rose@l-1d39-08.cse.sc.edu:public_html/587/greeneggsandham.txt g.txt scp secure copy command -P222 use port 222 NOTE: this is a different port than used to connect to the VM Source file Destination file
Or use wget to transfer a file from the web: Ex: [student@sandbox ~]$ wget https://cse.sc.edu/~rose/587/greeneggsandham.txt g.txt wget free utility for non-interactive download of files from the Web Source file Destination file
Preliminary steps in the VM Transfer the file from the vm linux filesystem to the Hadoop filesystem hadoop fs -put g.txt /user/share/student/ Convince yourself by checking the HDFS hadoop fs -ls /user/share/student/ Ex: [student@sandbox ~]$ hadoop fs -ls /user/share/student/ ound 1 items -rw-r--r-- 1 student hdfs 1746 2017-11-01 21:13 /user/share/student/g.txt
Preliminary Rstudio steps # Set environmental variables Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.0.0-2557.jar") # Load the following packages in the following order library(rhdfs) library(rmr2) # initialize the connection from rstudio to hadoop hdfs.init()
Our first mapreduce program primo_map = function(k,lines) { words.list = strsplit(lines, '\\s') words = unlist(words.list) return ( keyval(words,1)) } primo_reduce = function(word, counts) { keyval(word, sum(counts)) }
Our first mapreduce program primo_wordcount = function(primo_input, primo_output = NULL) { mapreduce(input = primo_input, output = primo_output, input.format = "text", map = primo_map, reduce = primo_reduce)}
Submitting our first mapreduce job # Specify the path hdfs.root = '/user/share/student' # append the data filename to the pathname hdfs.data = file.path(hdfs.root, 'g.txt') # append the output filename to the pathname hdfs.out = file.path(hdfs.root, 'primo_out')
Submitting our first mapreduce job # invoke your map-reduce functions on your input file and output file out = primo_wordcount(hdfs.data, hdfs.out) # Pour yourself a cup of coffee and wait . # Hadoop is fast enough, but the VM is deadly slow ..
Note: you can not overwrite existing files # if primo_out" already exists, then the mapreduce job will fail and you will # have to delete primo_out": # hadoop fs -rmr /user/student/primo_out
VM: Check for changes to HDFS [student@sandbox ~]$ hadoop fs -ls /user/share/student/ Found 4 items -rw-r--r-- 1 student hdfs drwxr-xr-x - student hdfs drwxr-xr-x - student hdfs drwxr-xr-x - student hdfs 1746 2016-03-03 12:36 /user/share/student/g.txt 0 2015-11-09 18:41 /user/share/student/out 0 2015-11-09 18:47 /user/share/student/out2 0 2016-03-03 12:56 /user/share/student/primo_out
RStudio: Fetch the results from HDFS results = from.dfs(out) results.df = as.data.frame(results, stringsAsFactors=F) colnames(results.df) = c('word', 'count') # head(results.df) results.df