Skip to content

nthistle/bigcat-spark-downsampler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

(BigCAT) Spark Downsampler

Command-line tool for downsampling label data (stored as LabelMultisetTypes in an N5 dataset) with Spark.

Compile

Because this repository uses a branch of BigCAT that is not currently merged into master, to compile from source you will first have to check out that branch (which requires the latest version of N5, which also needs to be compiled).

  • Clone N5 to any location

  • Use Maven to install N5 1.1.4-SNAPSHOT into your local repository

mvn clean install
  • Then, clone shrucis1/bigcat to a location of your choice.

  • Switch to the n5cacheloader branch with the necessary changes.

git checkout n5cacheloader
  • Use Maven to install this branch of BigCAT into your local repository
mvn clean install
  • Finally, clone this repository, and it should compile.

  • To make a "fat jar" with all dependencies added, run:

mvn clean compile assembly:single

Downsampling

Downsampling Operation

Downsampling works by merging groups of LabelMultisetEntryList's:

For instance, in a block of pixels where each list has one entry, with a count of 1 and a label of either 83 or 64, they will be downsampled into a single list with 2 entries, one of which will have a label of 83 and the corresponding count, and the other will have a label of 64 and its corresponding count.

The motivation for this method of downsampling is that it doesn't technically lose information other than where within a downsampled block each label is.

When converting to a color for visualization purposes, this method also allows a weighted color to be calculated from the colors of each label within a downsampled pixel.

Usage

java -Dspark.master=[spark_master] -jar target/bigcat-spark-downsampler-0.0.1-SNAPSHOT-jar-with-dependencies.jar [args]

Arguments

  • --compression, -c Compression type to use in output N5 dataset. Default: RAW

  • --factor, -f Factor by which to downscale the input image (Required)

  • --idatasetname, --idata, -id Input dataset name (N5 relative path from group) (Required)

  • --igroupname, --igroup, -ig Input group name (N5 group) (Required)

  • --odatasetname, --odata, -od Output dataset name (N5 relative path from group) (Required)

  • --ogroupname, --ogroup, -og Output group name (N5 group). Defaults to input group name

  • --parallelblocks, -pb Size of the blocks (in cells) to parallelize with Spark. Defaults to [16, 16, ... 16]

Note that the spark.master property must be set when running as well. See here for more information on Spark Master URLs.

Example

java -Dspark.master=local[*] -jar target/bigcat-spark-downsampler-0.0.1-SNAPSHOT-jar-with-dependencies.jar -ig ~/cremi-n5/ -id sampleA-fullres -od sampleA-8x8x2 -f 8,8,2 -c GZIP -pb 4,4,4

Would downsample the N5 label dataset at ~/cremi-n5/sampleA-fulres by a factor of 8x8x2, and write to an N5 dataset (with GZIP compression) at ~/cremi-n5/sampleA-8x8x2.

Note that the output group name is not specified, and defaults to the same as the input group name.

Also, spark.master is set to local[*], which, according to Spark documentation, will

Run Spark locally with as many worker threads as logical cores on your machine.

Parallel block size simply determines the size of each block to parallelize with, 4x4x4 yields blocks of 64 cells each. This will have no impact on the output dataset, but if it is set too high relative to the input dataset, there may not be enough blocks for each worker thread to have something to work on (thus wasting the parallelism).

About

spark downsampler designed for use with bigcat

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages