-
Notifications
You must be signed in to change notification settings - Fork 0
Linear Regression in Rhadoop via MapReduce
RHadoop is a collection of five R packages that allow users to manage and analyze data with Hadoop. The packages have been tested (and always before a release) on recent releases of the Cloudera and Hortonworks Hadoop distributions and should have broad compatibility with open source Hadoop and mapR's distribution.
rhdfs
rhbase
plyrmr
rmr2
ravro
In order to perform cov/corr matrices we need to use the packages:
rhdfs: This package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS from within R. Install this package only on the node that will run the R client.
rmr2: A package that allows R developer to perform statistical analysis in R via Hadoop MapReduce functionality on a Hadoop cluster. Install this package on every node in the cluster.
Set up environment:
Sys.setenv(HADOOP_CMD='/usr/bin/hadoop')Sys.setenv(HADOOP_HOME='/usr/lib/hadoop-0.20-mapreduce')
Sys.setenv(HADOOP_STREAMING='/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.7.1.jar')
library(rJava)
library(rmr2)
library(rhdfs)
hdfs.init()
Define the arguments 'x' & 'y':
table<-read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/00265/CASP.csv', sep=",")
table<-as.numeric(unlist(table))
table<-matrix(table, ncol=10)
X1<-to.dfs(table)
1st map-reduce to calculate t(X)*X
mapper=function(.,Xr){
Xr<-Xr[,-1]
keyval(1,list(t(Xr)%*%Xr))}
Reduce function sums a list of matrices
reducer=function(.,A){
keyval(1,list(Reduce('+',A)))}
2nd map-reduce to calculate t(X)*y
mapper2=function(.,Xr){
yr<-Xr[,1]
Xr<-Xr[,-1]
keyval(1,list(t(Xr)%*%yr))}
Calculate t(X)*X
XtX<-values(
from.dfs(
mapreduce(
input=X1,
map=mapper,
reduce=reducer,
combine=T)))[[1]]
Calculate t(X)*Y
Xty<-values(
from.dfs(
mapreduce(
input=X1,
map=mapper2,
reduce=reducer,
combine=T)))[[1]]
Solution
beta<-solve(XtX, Xty)
beta