Before running anything, make sure the pipeline is up to date by doing the following:
Go into the splicing pipeline folder
cd /projects/anczukow-lab/splicing_pipeline/splicing-pipelines-nf/
Update local version of the pipeline. Note: you will need to enter your github username and password.
git pull
Note: if you have not successfully completed the pipeline test, see here
This pipeline can be run on Sumner in three ways:
- Input a
reads.csv
file to input fastq files and run the pipeline in its entirety. - Input a
reads.csv
file to input fastq files and run pipeline until STAR mapping step with--test
parameter set totrue
. - Input a
bams.csv
file to input bams and run steps of pipeline following STAR mapping (Stringtie and rMATS).
The cacheDir
stores singularity images. This is set in splicing-pipelines-nf/conf/executors/sumner.config. For non-Anczukow users, this should be changed to a home directory.
All analyses should be run in /projects/anczukow-lab/NGS_analysis
. If your dataset already has a folder here, use that directory. Otherwise, create a new directory with the same name found in the /projects/anczukow-lab/fastq_files/
folder (Example DATASET Directory - /projects/anczukow-lab/NGS_analysis/Dataset_4_MYC_MCF10A
).
Create a new run directory within the appropriate dataset directory with the following format: runNumber_initials_date run1_LU_20200519
(Example RUN Directory - /projects/anczukow-lab/NGS_analysis/Dataset_4_MYC_MCF10A/run1_LU_20200519
).
Input reads are specified by the reads
input parameter, specifying a path to a CSV file. The format of CSV file will vary slightly based upon the data, see examples for:
- single-end - must contain columns for
sample_id
andfastq
- paired-end - must contain columns for
sample_id
,fastq1
andfastq2
The 'reads.csv' column names must match the above [single-end] and [paired-end] examples. The sample_id
can be anything, however each must be unique. The fastq
column(s) should contain the path to FASTQ files (publicly accessible ftp, s3 and gs links are also accepted). You can create this on your local computer in excel and use WinSCP to move it to Sumner, or use create it using nano
on the cluster.
There should be one reads.csv
file per dataset. If your dataset already has a reads.csv
file, proceed to step 2.
Each rMATS comparison must be specified with a comparison name as well as the sample_id
as specified in the reads
file. See example rmats_pairs.txt
. Each line in the file corresponds to an rMATS execution. The first column corresponds to a unique name/id for the rMATS comparison (this will be used for the output folder/file names)
-
Replicates should be comma separated and the samples for the
b1
/b2
files i.e. case and control should be space separated. b1 - control and b2 - case.See examples
comparison_id[space]sample1[space]sample2
comparison1_id[space]sample1[space]sample2 comparison2_id[space]sample3[space]sample4
comparison1_id[space]sample1replicate1,sample1replicate2,sample1replicate3[space]sample2replicate1,sample2replicate2,sample2replicate3 comparison2_id[space]sample3replicate1,sample3replicate2,sample3replicate3[space]sample4replicate1,sample4replicate1,sample4replicate1
comparison_id[space]sample1,sample2,sample3
This config file will be specific to your user and analysis. You do not need to edit the pipeline code to configure the pipeline. Descriptions of all possible parameters and their default values can be found here and here.
To create your own custom config (to specify your input parameters) you can copy and edit this example config file.
VERY IMPORTANT NOTES*
-
Each time you run the pipeline, go through all possible parameters to ensure you are creating a config ideal for your data. If you do not specify a value for a parameter, the default will be used. All parameters used can be found in the
log
file. WHEN IN DOUBT, SPECIFY ALL PARAMETERS! -
You must name your config file
NF_splicing_pipeline.config
(as specified in main.pbs) -
Your
NF_splicing_pipeline.config
must be in the directory that you are running your analysis. -
The
readlength
here should be the length of the reads - if read length is not a multiple of 5 (ex- 76 or 151), set 'readlength' to nearest multiple of 5 (ex- 75 or 150). This extra base is an artifact of Illumina sequencing -
To run full pipeline, you must specify the following:
reads.csv
,rmats_pairs.txt
,readlength
,assembly_name
,star_index
, andreference gtf
. This string can be a relative path from the directory in which you run Nextflow in, an absolute path or a link. -
The star indexes must be generated prior to executing the pipeline (this is a separate step).
-
Currently, the two options for genomes are hg38 and mm10. If you wish to use a newer version of the genome, you will need to add this to the post-processing script.
Ensure you have NF_splicing_pipeline.config
in this directory.
Run the pipeline!
sbatch /projects/anczukow-lab/splicing_pipeline/splicing-pipelines-nf/main.pbs
Create a new run directory within the appropriate dataset directory with the following format: runNumber_initials_date run1_LU_20200519
(Example RUN Directory - /projects/anczukow-lab/NGS_analysis/Dataset_4_MYC_MCF10A/run1_LU_20200519
).
Input reads are specified by the bams
input parameter, specifying a path to a CSV file.
- (create example) must contain columns for
sample_id
,bam
, andbam.bai
The 'bams.csv' column names must match the above example. The sample_id
can be anything, however each must be unique. The bam
column should contain the path to BAM files. The bam.bai
column should contain the path to BAM.BAI files. You can create this on your local computer in excel and use WinSCP to move it to Sumner, or use create it using nano
on the cluster.
Supplying the bams.csv
will signal to the pipeline to skip the first steps of the pipeline and start with Stringtie. No other parameter is needed.
Each rMATS comparison must be specified with a comparison name as well as the sample_id
as specified in the [bams.csv
](create example) file. See example rmats_pairs.txt
. Each line in the file corresponds to an rMATS execution. The first column corresponds to a unique name/id for the rMATS comparison (this will be used for the output folder/file names).
-
Replicates should be comma separated and the samples for the
b1
/b2
files i.e. case and control should be space separatedSee examples
comparison_id[space]sample1[space]sample2
comparison1_id[space]sample1[space]sample2 comparison2_id[space]sample3[space]sample4
comparison1_id[space]sample1replicate1,sample1replicate2,sample1replicate3[space]sample2replicate1,sample2replicate2,sample2replicate3 comparison2_id[space]sample3replicate1,sample3replicate2,sample3replicate3[space]sample4replicate1,sample4replicate1,sample4replicate1
comparison_id[space]sample1,sample2,sample3
This config file will be specific to your user and analysis. You do not need to edit the pipeline code to configure the pipeline. Descriptions of all possible parameters and their default values can be found here and here.
To create your own custom config (to specify your input parameters) you can copy and edit this example config file.
VERY IMPORTANT NOTES*
-
Each time you run the pipeline, go through all possible parameters to ensure you are creating a config ideal for your data. If you do not specify a value for a parameter, the default will be used. All parameters used can be found in the
log
file. WHEN IN DOUBT, SPECIFY ALL PARAMETERS! -
You must name your config file
NF_splicing_pipeline.config
(as specified in main.pbs). -
Your
NF_splicing_pipeline.config
must be in the directory that you are running your analysis. -
The
readlength
here should be the length of the reads - if read length is not a multiple of 5 (ex- 76 or 151), set 'readlength' to nearest multiple of 5 (ex- 75 or 150). This extra base is an artifact of Illumina sequencing -
Currently, the two options for genomes are hg38 and mm10. If you wish to use a newer version of the genome, you will need to add this to the post-processing script.
Ensure you have NF_splicing_pipeline.config
in this directory.
Run the pipeline!
sbatch /projects/anczukow-lab/splicing_pipeline/splicing-pipelines-nf/main.pbs