Linear Regression in Rhadoop via MapReduce

What is Rhadoop?

RHadoop is a collection of five R packages that allow users to manage and analyze data with Hadoop. The packages have been tested (and always before a release) on recent releases of the Cloudera and Hortonworks Hadoop distributions and should have broad compatibility with open source Hadoop and mapR's distribution.

rhdfs
rhbase
plyrmr
rmr2
ravro

In order to perform cov/corr matrices we need to use the packages:

rhdfs: This package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS from within R. Install this package only on the node that will run the R client.

rmr2: A package that allows R developer to perform statistical analysis in R via Hadoop MapReduce functionality on a Hadoop cluster. Install this package on every node in the cluster.

Performing with Rhadoop via mapreduce in R

Set up environment:

Sys.setenv(HADOOP_CMD='/usr/bin/hadoop')Sys.setenv(HADOOP_HOME='/usr/lib/hadoop-0.20-mapreduce')
Sys.setenv(HADOOP_STREAMING='/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.7.1.jar')
library(rJava)
library(rmr2)
library(rhdfs)
hdfs.init()

Define the arguments 'x' & 'y':

table<-read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/00265/CASP.csv', sep=",")
table<-as.numeric(unlist(table))
table<-matrix(table, ncol=10)
X1<-to.dfs(table)

1st map-reduce to calculate t(X)*X

mapper=function(.,Xr){
Xr<-Xr[,-1]
keyval(1,list(t(Xr)%*%Xr))}

Reduce function sums a list of matrices

reducer=function(.,A){
keyval(1,list(Reduce('+',A)))}

2nd map-reduce to calculate t(X)*y

mapper2=function(.,Xr){
yr<-Xr[,1]
Xr<-Xr[,-1]
keyval(1,list(t(Xr)%*%yr))}

Calculate t(X)*X

XtX<-values(
from.dfs(
mapreduce(
input=X1,
map=mapper,
reduce=reducer,
combine=T)))[[1]]

Calculate t(X)*Y

Xty<-values(
from.dfs(
mapreduce(
input=X1,
map=mapper2,
reduce=reducer,
combine=T)))[[1]]

Solution

beta<-solve(XtX, Xty)
beta

Juan Garcia | Linkedin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linear Regression in Rhadoop via MapReduce

What is Rhadoop?

Performing with Rhadoop via mapreduce in R

Clone this wiki locally