Basic instructions for configuring and running the set of experiments with Peel.
System | KMeans | Grouping | Conn. Comp. |
---|---|---|---|
Spark | ✓ | ✓ | ❌ |
Flink | ✓ | ✓ | ❌ |
Checkout and install the Peel package in your local maven repository:
# clone Peel project
git clone [email protected]:citlab/peel.git
# install Peel modules locally
mvn install -DskipTests
You need to configure a Procrustes distribution. Check out the Procrustes sources and do:
# create binary package
mvn package -DskipTests
# move binary package into a separate folder
cp procrustes-dist/target/procrustes-dist-1.0-SNAPSHOT-bin $PROCRUSTES_INSTALL_DIR
The distribution package has the following structure:
config # env. configuration and experiment fixtures
datagens # data generators
datasets # data sets
downloads # system downloads (empty)
jobs # experiment jobs
\-- procrustes-flink-${VERSION}.jar # Flink experiment jobs
\-- procrustes-spark-${VERSION}.jar # Spark experiment jobs
lib # Peel libraries
log # Peel log
peel # Peel CLI tool
results # experiment results
systems # bash utils
You need to create a configuration folder for each host where you plan to run Procrustes experiments.
Lookup the $HOSTNAME
of your developer machine and create a corresponding folder under config
.
Use the localhost-sample
configuration as a starting point and adapt their values to something that better suits your environment:
cd config
mkdir $HOSTNAME
cp -R localhost-sample/* $HOSTNAME
For usage on the wally
cluster, you can just create a soft-link to wally
:
cd config
ln -s wally $HOSTNAME
If you want to setup a configuration for a different distributed environment, create a folder with the $HOSTNAME
of the environment master.
Use the wally
configuration as a starting point:
cd config
mkdir $HOSTNAME
cp -R localhost-sample/* $HOSTNAME
The util/sync
folder contains some bash scripts that use rsync for automated deployment and synchronization of the data between the developer machine and the distributed environment master.
To use them, you first need to configure the remote host values in util/sync/${host}.config
file. After that, you can do:
util/sync/fetch_all.sh $host_name # Pushes Procrustes package to $host
util/sync/fetch_log.sh $host_name # Pushes Procrustes package to $host
util/sync/push_all.sh $host_name # Pushes Procrustes package to $host
You can use the following Peel commands (and hopefully save some time).
# HDFS 1
./peel sys:setup hdfs-1 # starts HDFS-1
./peel sys:teardown hdfs-1 # stops HDFS-1
# HDFS 2
./peel sys:setup hdfs-2 # starts HDFS-2
./peel sys:teardown hdfs-2 # stops HDFS-2
# Zookeeper
./peel sys:setup zookeeper # starts Zookeeper
./peel sys:teardown zookeeper # stops Zookeeper
# Flink
./peel sys:setup flink # starts Flink
./peel sys:teardown flink # stops Flink
# Spark
./peel sys:setup spark # starts Spark
./peel sys:teardown spark # stops Spark
You can setup the systems, run the experiment, and teardown the systems using three different commands:
# setup all systems upon which this experiment depends
./peel exp:setup kmeans.default kmeans.single-run
# execute run no. 1 of the experiment
./peel exp:run kmeans.default kmeans.single-run --just --run 1
# teardown all systems upon which this experiment depends
./peel exp:teardown kmeans.default kmeans.single-run
Alternatively, you can do all in one step:
# setup, execute run no. 1, and teardown in one step
./peel exp:run kmeans.default kmeans.single-run --run 1
Logically connected experiments are organized in suites. To run all experiments in a suite, use the suite:run
command:
./peel suite:run kmeans.default