-
Notifications
You must be signed in to change notification settings - Fork 254
Aggregating CSV Data (Spatial Binning)
This tutorial will show you how to take a CSV file containing latitude and longitude information -> copy it to HDFS -> aggregate the dataset into bins for a useful visual analysis.
View the corresponding blog post.
-
The sample data being used are 2013 NYC taxi data. This dataset is freely available for download. For this tutorial, we will only download trip_data_1.csv.zip for simplicity.
-
Downloaded trip_data_1.csv.zip, unzip it and store it on your cluster. I am storing mine in
gis-tools-for-hadoop/samples/data/taxi-data
-
Open your terminal (windows use Cygwin) and move your folder containing the data from the local directory (
gis-tools-for-hadoop/samples/data/taxi-data
) tohdfs
by using the put command. To do this make a folder (if necessary), and then move the folder containing the csv:#make a directory if needed hadoop fs -mkdir taxidemo #hadoop fs -put /path/on/localsystem /path/to/hdf hadoop fs -put gis-tools-for-hadoop/samples/data/taxi-data taxidemo #check that it worked: hadoop fs -ls taxidemo
You should see something like:
Found 1 items drwxr-xr-x - sarah hdfs 0 2015-03-19 12:00 taxidemo/taxi-data
-
Start hive in your terminal and add the appropriate jars (similar to the sample example):
hive add jar gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar;
-
Drop a table in case it exists, and then create a new table that the taxi data will reside in:
drop table taxi_demo; CREATE EXTERNAL TABLE taxi_demo(medallion STRING, hack_license STRING,vendor_id STRING, rate_code STRING,store_and_fwd_flag STRING,pickup_datetime STRING, dropoff_datetime STRING, passenger_count DOUBLE, trip_time_in_secs DOUBLE, trip_distance DOUBLE, pickup_longitude DOUBLE,pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE) ROW FORMAT delimited fields terminated by ',' STORED AS textfile tblproperties ("skip.header.line.count"="1");
I found the schema of the data by typing in the terminal (not hive)
head -2 trip_data_1.csv > header_taxi.csv
and then viewing the resulting CSV. You can always find out what folder you are currently in by typingpwd
. -
Check that the table was made correctly:
describe taxi_demo;
-
Load the taxi CSV data into the table:
LOAD DATA INPATH 'taxidemo/taxi-data/trip_data_1.csv' OVERWRITE INTO TABLE taxi_demo;
-
Create the temporary functions that will be used in aggregating bins:
create temporary function ST_Bin as 'com.esri.hadoop.hive.ST_Bin'; create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point'; create temporary function ST_BinEnvelope as 'com.esri.hadoop.hive.ST_BinEnvelope';
A full list of Hive UDFs can be found here
-
Try out the aggregation:
FROM (SELECT ST_Bin(0.001, ST_Point(dropoff_longitude,dropoff_latitude)) bin_id, *FROM taxi_demo) bins SELECT ST_BinEnvelope(0.001, bin_id) shape, COUNT(*) count GROUP BY bin_id;
0.001 refers to 0.001 degrees (the unit the data is in). This can easily be changed if you want less or more detail.
-
If it ran without errors, create a new hive table to save the results to:
drop table taxi_agg; CREATE TABLE taxi_agg(area BINARY, count DOUBLE) ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.EsriJsonSerDe' STORED AS INPUTFORMAT 'com.esri.json.hadoop.UnenclosedEsriJsonInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
'com.esri.json.hadoop.UnenclosedEsriJsonInputFormat' -> "Unenclosed" or "Enclosed" must match what is used in the JSON to Features - Step 4 tool.
-
Rerun the query, and save the results to the new table
taxi_agg
:FROM (SELECT ST_Bin(0.001, ST_Point(dropoff_longitude,dropoff_latitude)) bin_id, *FROM taxi_demo) bins INSERT OVERWRITE TABLE taxi_agg SELECT ST_BinEnvelope(0.001, bin_id) shape, COUNT(*) count GROUP BY bin_id;
-
To view the new table results in ArcMap follow steps 5:9 in the Geoprocessing tools for Hadoop tutorial.