Skip to content

Simple ETLLib Tutorial

Chris Mattmann edited this page Jan 27, 2024 · 4 revisions

Welcome to a short guide on how to install, configure and use ETLLib. For the purposes of this tutorial, we will assume you have a single CSV file with 10s of K of rows. You can use ETLlib to go from a single CSV file with many 10s of K rows, to many JSON files that you can then use to compute tika-similarity on.

Preparation: Turning your CSV into a TSV

So the first step is to take your CSV and get it turned into a TSV. We'll use a Python solution for this, which is independent of whether or not you are using Linux/Mac or Windows. Additionally for this step you are going to have to use Python 2.7. You can use pyenv to get a Python 2.7 version (I personally used 2.7.18). Pyenv works on both Mac and *nix systems. See here.

The expanded tutorial with concise explanations and screenshots is here.

  1. Install CSVKit. pip install csvkit==0.9.2

Assuming that you have many files of 10k rows in CSV, you can use the following command on each of the 10k dataset parts. You will want to run it on your whole 100s of K CSV row dataset. Here's the command to quickly generate a TSV from your source CSV (assume it's called data.csv) of 10k rows:

  1. csvformat -T 10000\ data.csv > 10000\ data.tsv

OK now we have the TSV file.

Installing ETLLib

Let's grab it:

  1. git clone [email protected]:chrismattmann/etllib.git

Install libmagic

Reading the instructions for ETLLib, you need libmagic installed. Since I was on a Mac, I installed it with brew.

  1. brew install libmagic (*nix systems will vary).

Once libmagic is installed, this command should work:

  1. man libmagic

OK with libmagic installed you're ready to install etllib.

Install ETLLib

  1. cd etllib && python setup.py install (make sure again that you are using Python 2.7.x, and as I noted on mine I'm using 2.7.18)

ETLLib should install fine at this point, and when you have installed it, you now have access to the commands listed on the ETLLib Home Page.

In particular, we will use 2 commands from this library. tsvtojson and repackage. The first command takes a big TSV file of objects, and converts it to an aggregate big JSON file of objects. Then second command splits that big aggregate JSON file up into individual JSON files.

To use tsvtojson you will need 2 configuration files. I'm going to provide them to you here. The first is encoding.txt and the second is colheaders.txt. Encoding.txt tells the command what supported text encodings are present in the file. The colheaders.txt tell the command for each row what the column header names it should use for the JSON file field names.

Sample colheaders.conf

storyPrimaryID
storyID
userID
userPrimaryID
gender
age
title
narrative
media
accountCreateDte
interests

(this assumes a 11 column schema, with those headers; this tutorial was sourced from a social media sample dataset called pixstory with this schema. Your own schema may vary)

Sample encoding.conf file

utf-8
us-asci

OK, so for me, I dropped those two files into a folder called conf and then I created two data file directories: aggregate-json to hold the aggregate JSON object output from tsvtojson and json to hold the 10k JSON files output from repackage.

Run tsv2json on your TSV file

So now you're ready to run the tsvtojson command on your TSV file.

  1. tsvtojson -t 10000\ data.tsv -j aggregate-json/aggregate.json -c conf/colheaders.conf -o pixstoryposts -e conf/encoding.conf -s 0.8 -v

on my computer it output:

tsvtojson -t 10000\ data.tsv -j json/aggregate.json -c conf/colheaders.conf -o pixstoryposts -e conf/encoding.conf -s 0.8 -v
['utf-8', 'us-asci']
['storyPrimaryID', 'storyID', 'userID', 'userPrimaryID', 'gender', 'age', 'title', 'narrative', 'media', 'accountCreateDte', 'interests']
Deduping list of structs. Count: [10001]
After dedup. Count: [10001]
Near duplicates detection.
Filtered 0 near duplicates.
After near duplicates. Count: [10000]
Writing output file: [aggregate-json/aggregate.json]

Let's break down what you are seeing: first, when I ran the command, I gave it the two conf files, then I gave it a parameter -o pixstoryposts This means when you create a big aggregate JSON file you need something to call the objects in it. I called them pixstoryposts Also you see that I provided the -s 0.8 flag. During processing, you can use jaccard similarity to drop duplicates based on a similarity threshold set between 0 and 1. I told it to drop duplicates that were 0.8 or more similarity based on the jaccard similarity. You'll note that it dropped 0 duplicates. Finally I passed the -v flag for verbose output just to get all the printing messages

Run repackage

You can now confirm that you have generated the aggregate-json/aggregate.json file. If you have that, you are ready to run repackage.

Here is the command to run

  1. repackage -j ../aggregate-json/aggregate.json -o pixstoryposts -v

So let's breakdown the command, which I ran from inside the directory that I want the output files to reside in. So I cd json first, and then from there ran the command.

I'm not going to paste the full output from the command which looks like a bunch of:

Writing json file: [/Users/mattmann/src/dsci550/data/json/09716830-9698-4f51-a707-b847d2c2aa7c.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/4f05fd11-d6df-4dda-a7ce-a2c8f8bc7ccf.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/31b05c6e-51fc-41e0-982f-a85b8ee34fb6.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/ad33411f-8e87-4946-b4f8-c75f818891b0.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/807cecec-4cf0-4de6-8ef1-51a59305bcfd.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/9f7115c5-a9fe-4721-bf21-d4213cd5b19f.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/c0553aff-8ee3-4077-86af-1beb5bbb99ab.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/a5cb7398-c083-4c00-88ae-b7f601abdddc.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/98c742f6-c799-4b2f-a6b8-e604af6a3f9a.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/ae684b2b-0de0-4850-8683-df347d638c35.json]

The -j ../aggregate-json/aggregate.json is the JSON file with the 10k objects in it that you want to split into 10k individual files. Then you pass the -o pixstoryposts (the object name from the tsvtojson command). Then I passed the -v flag for verbosity.

That's it!

You then have 10k JSON files and are ready for tika-similarity. Hope this guide was helpful. Try it out! That should take care of your issues with ETLLib. note I didn't have to change anything in the code. Note the code works with both Python 2.7 and Python 3.