A data pipeline used to generate a master spreadsheet of all the datasets shared by clusters in the Critical Zone Collaborative Network.
This data analysis workflow uses Snakemake (installation instructions here) as a pipelining tool to retrieve, clean and munge the spreadsheet to maximize readability.
First, create a Conda environment with all the required packages by running the following command: conda env create -f environment.yaml
Once in the new environment, we can execute the snakemake pipeline with this command: snakemake --cores 1 -s Snakefile.smk --forceall
When the jobs are done, the output master spreadsheet containing all cluster datasets will be in a newly created out folder in 3_munge/.