text-mining

This project is an implementation of the Snowball algorithm on the distributed data processing platforms Apache Spark and Apache Flink.

Spark

In order to use the Spark version of the implementation cd into MainTask-spark/MainTask and compile the project using mvn package.

Then you can use the following command to start the program:

$PATH_TO_SPARK_SUBMIT --total-executor-cores $EXECUTOR_CORES 
--class de.hpi.fgis.dbda.textmining.maintask.App --master $SPARK_MASTER 
target/maintask-0.0.1-SNAPSHOT.jar $OUTPUT_PATH $PATH_TO_SEED_TUPLES $INPUT_FILES

where the variables mean the following:

$PATH_TO_SPARK_SUBMIT: the path to your spark submit executable.

$EXECUTOR_CORES: the number of workers your program is run on.

$SPARK_MASTER: the address of the spark master node.

$OUTPUT_PATH: the path to write the ouput (Organization, Location tuples) to.

$PATH_TO_SEED_TUPLES: the path to the seed tuples the algorithm needs. These should be a tsv file with the format ORG \t LOC.

$INPUT_FILES: a arbitrary number of input files. In our tests the New York Times archive was used. The files should already be NER tagged and split into one sentence per line.

Flink

In order to use the Flink version of the implementation cd into MainTask-flink/MainTask-flink and compile the project using mvn package.

Then you can use the following command to start the program:

$PATH_TO_FLINK run --parallelism $PARALLELISM --class 
de.hpi.fgis.dbda.textmining.MainTask_flink.App 
target/MainTask-flink-0.0.1-SNAPSHOT-flink-fat-jar.jar 
--windowSize $WINDOWSIZE --minimalClusterSize $MINIMAL_CLUSTER_SIZE 
--degreeOfMatchThreshold $DEGREE_OF_MATCH_THRESHOLD --similarityThreshold 
$SIMILARITY_THRESHOLD --tupleConfidenceThreshold $TUPLE_CONFIDENCE_THRESHOLD 
--maxDistance $MAX_DISTANCE --iterations $ITERATIONS $ALREADY_TAGGED 
--output $OUTPUT_PATH --seedTuples $PATH_TO_SEED_TUPLES $INPUT_FILES

where the variables mean the following (with values we used in our experiments in brackets for algorithm specific parameters):

$PATH_TO_FLINK: the path to your flink executable.

$PARALLELISM: the number of workers your program is run on.

$WINDOWSIZE: the windows of words that belong to a tuple context. (5)

$MINIMAL_CLUSTER_SIZE: the minimal amount of patterns a cluster should consist of to be used further. (5)

$DEGREE_OF_MATCH_THRESHOLD: the similarity threshold for matching patterns with tuple contexts. (0.6)

$SIMILARITY_THRESHOLD: the similarity threshold for clustering the patterns. (0.4)

$TUPLE_CONFIDENCE_THRESHOLD: the confidence threshold for filtering out candidate tuples. (0.9)

$MAX_DISTANCE: the maximum distance between found Organizations and Locations. (5)

$ITERATIONS: the number of iterations the algorithm is run.

$ALREADY_TAGGED: use the --alreadyTagged flag if the input files are already split into sentences and NER tagged.

$OUTPUT_PATH: the path to write the ouput (Organization, Location tuples) to.

$PATH_TO_SEED_TUPLES: the path to the seed tuples the algorithm needs. These should be a tsv file with the format ORG \t LOC.

$INPUT_FILES: a arbitrary number of input files. In our tests the New York Times archive was used. The files should already be NER tagged and split into one sentence per line.

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
EntranceTask		EntranceTask
MainTask-flink/MainTask-flink		MainTask-flink/MainTask-flink
MainTask-spark/MainTask		MainTask-spark/MainTask
.gitignore		.gitignore
Description of Spark Jobs.txt		Description of Spark Jobs.txt
README.md		README.md
runtimes.odt		runtimes.odt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text-mining

Spark

Flink

About

Releases

Packages

Contributors 2

Languages

DBDA15/text-mining

Folders and files

Latest commit

History

Repository files navigation

text-mining

Spark

Flink

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages