Skip to content

Commit 49cd023

Browse files
Vlad-Dembrovskyicgpuimendes93angarb
authored
Pipeline additions for v2.1 (#295)
* Fixes env gtex issue #290 (#294) * Change env() to stdout to save sample_name in gen3_drs * Fix No such property: baseName for class: String * Gen3-DRS prints md5 "file is good" to log not stdout * Improves gen3-drs md5 error message * Changes gtex input to support new manifest file format [#289] (#296) * Updates ch_gtex_gen3_ids items #289 * Remove duplicate val(obj_id) in input of gen3-drs Co-authored-by: cgpu <[email protected]> * Comments our fasta requirement for gen3-drs input (#297) * Comments our fasta requirement for gen3-drs input * Update usage.md that genome_fasta is only for CRAM * Update usage.md typo * Fix missing file from path issue * change GLS executor from parameter to scope (#305) * Remove gtex (#299) * Remove mentions of old GTEX download option from main.nf * Remove mentions of old GTEX download option from help * Remove mentions of old GTEX download option from usage.md * Renames Gen3-DRS into new GTEX download option * Renames Gen3-DRS into new GTEX download opt in usage.md * Dev v2.1 #287 - Simplify the Gen3-DRS download option (#304) * Update usage.md * Update run_on_sumner.md * add dockerfile for csvtoolkit * add process to convert manifest json to csv * add process to filter manifest by file passed through --reads * update help message * fix bug on variable declaration * Update nextflow.config - fix typo * Revert "Merge branch 'master' into dev-v2.1-#287" This reverts commit be2c2ab, reversing changes made to 04285ef. * Update main.nf * patch projectDir error * Fix oublishDir path for manifest * Fix oublishDir path for manifest * Fix typo * Update filter_manifest.py * Update filter_manifest.py * fix bug on saving filenames that were not in manifest file * Update filter_manifest.py * remove logging of samples not found in manifest * Update filter_manifest.py * Makes filter_manifest txt output optional Co-authored-by: angarb <[email protected]> Co-authored-by: Vlad-Dembrovskyi <[email protected]> Co-authored-by: Vlad-Dembrovskyi <[email protected]> * Rename examples/gen3/README.md to examples/GTEX/README.md Editing folder name to match new "download_from" name. * Update and rename GEN3_DRS_config.md to GTEX_config.md Updating parameters * Delete examples/gen3 directory * Update usage.md Moving this information * Update README.md * Update README.md * Delete PRJNA453538.SraRunTable.txt Not needed * Delete MCF10_MYCER.datafiles.csv Not needed * Create reads.csv Adding reads.csv example * Update README.md * Create manifest.json Adding example manifest.json * Update README.md * Update run_on_cloudos.md * Update Copying_Files_From_Sumner_to_Cloud.md Made neater * Create Star_Index_Generation.md Co-authored-by: cgpu <[email protected]> Co-authored-by: imendes93 <[email protected]> Co-authored-by: angarb <[email protected]>
1 parent 33ba660 commit 49cd023

20 files changed

+220
-235
lines changed

bin/filter_manifest.py

+33
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
#!/usr/bin/env python
2+
# -*- coding: utf-8 -*-
3+
4+
import os
5+
import sys
6+
import shutil
7+
import pandas as pd
8+
9+
def __main__():
10+
11+
manifest = sys.argv[1]
12+
reads = sys.argv[2]
13+
print("Input manifest file:", manifest)
14+
print("Input read file: ", reads)
15+
16+
manifest_df = pd.read_csv(manifest, index_col=None, header=0, delimiter=",")
17+
18+
if reads != "PASS":
19+
# process metadata
20+
reads_df = pd.read_csv(reads, index_col=None, header=0, delimiter=",")
21+
manifest_df = manifest_df[manifest_df['file_name'].isin(reads_df['file_name'].tolist())]
22+
23+
if manifest_df.empty:
24+
print("Manifest file is empty after filtering.")
25+
sys.exit(404, "Manifest file is empty after filtering.")
26+
else:
27+
print("Number of samples in filtered manifest:")
28+
print(len(manifest_df))
29+
30+
# save final manifest file
31+
manifest_df.to_csv("filtered_manifest.csv", sep=",", index=False)
32+
33+
if __name__=="__main__": __main__()

conf/examples/GEN3_DRS_config.md

-18
This file was deleted.

conf/examples/GTEX_config.md

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
A minimal set of params need to run when downloading option is GTEX. Test is done with following params on a dev environment.
2+
3+
```yaml
4+
params {
5+
reads = splicing-pipelines-nf/examples/GTEX/reads.csv
6+
manifest = manifest.json
7+
run_name = gtex_gen3
8+
download_from = GTEX
9+
key_file = credentials.json
10+
gtf = gencode.v32.primary_assembly.annotation.gtf
11+
star_index = /mnt/shared/gcp-user/session_data/star_75
12+
assembly_name = GRCh38
13+
readlength = 75
14+
stranded = false
15+
gc_disk_size = 200.GB
16+
}
17+
```

conf/executors/google.config

+3-1
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,10 @@ params {
2020
gc_disk_size = "2000 GB"
2121

2222
cleanup = false // Don't change, otherwise CloudOS jobs won't be resumable by default even if user wants to.
23+
}
2324

24-
executor = 'google-lifesciences'
25+
executor {
26+
name = 'google-lifesciences'
2527
}
2628

2729
process {

containers/csvkit/Dockerfile

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
FROM nfcore/base:1.9
2+
LABEL authors="[email protected]" \
3+
description="Docker image containing csvkit toolkit, including in2csv"
4+
5+
COPY environment.yml /
6+
RUN conda env create -f /environment.yml && conda clean -a
7+
ENV PATH /opt/conda/envs/csvkit/bin:$PATH

containers/csvkit/environment.yml

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
name: csvkit
2+
channels:
3+
- conda-forge
4+
- bioconda
5+
- defaults
6+
- anaconda
7+
dependencies:
8+
- python=3.8
9+
- csvkit=1.0.5
+12-14
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,32 @@
1-
//add singularity to $PATH:
1+
# Moving files from HPC to Cloud (particular to JAX Sumner)
2+
3+
#### Add singularity to $PATH:
24
module load singularity
35

4-
//make some convenience commands to reduce typing (note we changed container name so we can accommodate other cloud providers):
6+
#### Make some convenient commands to reduce typing:
57
alias gcloud="singularity exec /projects/researchit/crf/containers/gcp_sdk.sif gcloud"
68
alias gsutil="singularity exec /projects/researchit/crf/containers/gcp_sdk.sif gsutil"
79

8-
//login to gcloud; this will return a url that you need to paste into a browser, which
9-
//will take you through the google authentication process; you can use your jax
10-
//email as userid and jax password to get in. Once you authenticate, it will display
11-
//a code that you need to paste into the prompt provided in your ssh session on Sumner:
12-
10+
#### Login to gcloud; this will return a url that you need to paste into a browser, which will take you through the google authentication process; you can use your jax email as userid and jax password to get in. Once you authenticate, it will display a code that you need to paste into the prompt provided in your ssh session on Sumner:
1311
gcloud auth login --no-launch-browser
1412

15-
//see which projects you have access to:
13+
#### See which projects you have access to:
1614
gcloud projects list
1715

18-
//what is the project you are currently associated with:
16+
#### What is the project you are currently associated with:
1917
gcloud config list project
2018

21-
//change project association:
19+
#### Change project association:
2220
gcloud config set project my-project
2321

24-
//see what buckets are associated with my-project:
22+
#### See what buckets are associated with my-project:
2523
gsutil ls
2624

27-
//see contents of a particular bucket:
25+
#### See contents of a particular bucket:
2826
gsutil ls -l gs://my-bucket
2927

30-
//recursively copy large directory from filesystem accessible on Sumner to your bucket:
28+
#### Recursively copy large directory from file system accessible on Sumner to your bucket:
3129
gsutil -m -o GSUtil:parallel_composite_upload_threshold=150M cp -r my_dir gs://my_bucket/my_dir
3230

33-
//recursively copy a directory from your bucket to an existing directory on Sumner:
31+
#### Recursively copy a directory from your bucket to an existing directory on Sumner:
3432
gsutil -m -o GSUtil:parallel_composite_upload_threshold=150M cp -r gs://my_bucket/my_dir my_dir

docs/Star_Index_Generation.md

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
## Generating Star Indices
2+
3+
To run the pipeline, you will need star indexes (preferably that match you read length).
4+
5+
This might be a helpful resource to generate multiple star indices:
6+
https://github.com/TheJacksonLaboratory/Star_indices
7+
8+
This is also a useful resource: https://github.com/alexdobin/STAR

docs/run_on_cloudos.md

+1
Original file line numberDiff line numberDiff line change
@@ -98,3 +98,4 @@ Note, if you had to resume your job, the above method will not work **sad face**
9898
### Helpful Tips
9999
[Import the Pipeline](https://github.com/TheJacksonLaboratory/splicing-pipelines-nf/blob/master/docs/import_pipeline)
100100
[Information on how to run TCGA](https://github.com/TheJacksonLaboratory/splicing-pipelines-nf/blob/master/docs/Running_TCGA.md)
101+
[Information on how to run GTEX](https://github.com/TheJacksonLaboratory/splicing-pipelines-nf/blob/dev-v2.1/examples/GTEX/README.md)

docs/usage.md

+4-45
Original file line numberDiff line numberDiff line change
@@ -103,13 +103,13 @@ Input files:
103103
(default: no rmats_pairs specified)
104104
--run_name User specified name used as prefix for output files
105105
(defaut: no prefix, only date and time)
106-
--download_from Database to download FASTQ/BAMs from (available = 'TCGA', 'GTEX' or 'GEN3-DRS',
107-
'SRA', 'FTP') (string)
106+
--download_from Database to download FASTQ/BAMs from (available = 'TCGA', 'GTEX', 'SRA', 'FTP')
107+
(string)
108108
false should be used to run local files on the HPC (Sumner).
109109
'TCGA' can also be used to download GDC data including HCMI data.
110110
(default: false)
111-
--key_file For downloading reads, use TCGA authentication token (TCGA) or dbGAP repository
112-
key (GTEx, path) or credentials.json file in case of 'GEN3-DRS'
111+
--key_file For downloading reads, use TCGA authentication token (TCGA) or
112+
credentials.json file in case of 'GTEX'.
113113
(default: false)
114114
115115
Main arguments:
@@ -246,44 +246,3 @@ Some useful ones include (specified in main.pbs):
246246
- `-with-trace` eg `-with-trace trace.txt` which gives a [trace report](https://www.nextflow.io/docs/latest/tracing.html?highlight=dag#trace-report) for resource consumption by the pipeline
247247
- `-with-dag` eg `-with-dag flowchart.png` which produces the [DAG visualisation](https://www.nextflow.io/docs/latest/tracing.html?highlight=dag#dag-visualisation) graph showing each of the different processes and the connections between them (the channels)
248248

249-
## Run with data from AnviL Gen3-DRS
250-
251-
You will be needing two things from - https://gen3.theanvil.io/
252-
253-
1. manifest file
254-
2. credentials file
255-
256-
Original downloaded `manifest.json` file need to be converted into `manifest.csv` in order to be accepted in `--reads`, for doing that you can do this -
257-
258-
```bash
259-
pip install csvkit
260-
in2csv manifest.json > manifest.csv
261-
```
262-
263-
NOTE: Make sure the `manifest.csv` file have five columns, Check from [examples](../examples/gen3/)
264-
265-
Downloaded `credentials.json` file can be provided in `--key` param.
266-
267-
NOTE: Make sure `credentials.json` is a latest one. They have expiry dates when you download.
268-
269-
If you running with AnviL Gen3-DRS files you also need to provide a Genome fasta file with `--genome_fasta`, which will be used to convert CRAM files to BAM format.
270-
271-
For a minimal params list check [gen3_drs.config](../conf/examples/GEN3_DRS_config.md)
272-
273-
### Extract based on a bam query list
274-
275-
If you have a list of bam file names of interest, extract the manifest file -
276-
277-
```bash
278-
# Get all the bam files name into a txt file
279-
cut -d, -f4 query_list.csv > bam_files_list.txt
280-
# Extract those bam files list from manifest.csv
281-
grep -f bam_files_list.txt -i manifest.csv > manifest.csv
282-
```
283-
284-
Here `query_list.csv` should look something like -
285-
286-
```csv
287-
file_name,sequencing_assay,data_format,file_name,sample_id,participant_id,tissue,age,gender
288-
GTEX-PPPP-XXX-XX-XXXXX,RNA-Seq,bam,GTEX-PPPP-XXX-XX-XXXXX.Aligned.sortedByCoord.out.patched.md.bam,GTEX-PPPP-XXX-XX-XXXXX,GTEX-PPPP,Breast,21,Female
289-
```

examples/GTEX/README.md

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
## Run with GTEX data
2+
You can run pipeline on GTEX data otained directly from Gen3-DRS if you specify input option:
3+
```
4+
--download_from 'GTEX'
5+
```
6+
7+
You will be needing two things from - https://gen3.theanvil.io/
8+
9+
1. [manifest file](https://github.com/TheJacksonLaboratory/splicing-pipelines-nf/blob/dev-v2.1/examples/GTEX/manifest.json)
10+
2. credentials file
11+
12+
Original downloaded `manifest.json` will be converted into `manifest.csv` with pipeline using: https://csvkit.readthedocs.io/en/latest/
13+
14+
The manifest.csv will be subset using the `reads.csv` file provided in `--reads` param. (This allows you to download a complete manifest and later select the samples of interest.) For example: [gtex.reads](https://github.com/TheJacksonLaboratory/splicing-pipelines-nf/blob/dev-v2.1/examples/GTEX/reads.csv)
15+
16+
Downloaded `credentials.json` file can be provided in `--key_file` param.
17+
NOTE: Make sure `credentials.json` is a latest one. They have expiration dates when you download.
18+
19+
If you running with AnviL Gen3-DRS to download CRAM files you also need to provide a Genome fasta file with `--genome_fasta`, which will be used to convert CRAM files to BAM format. If you are donwloading bam files, you can skip this parameter.
20+
21+
For a minimal params list check [gtex.config](../conf/examples/GTEX_config.md)

examples/GTEX/manifest.json

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
[
2+
{
3+
"md5sum":"x1x111xxx1xxxxx1xx1x1x11xxx11111",
4+
"file_name": "GTEX-XXXXX-XXXX-XX-XXXXX.Aligned.sortedByCoord.out.patched.md.bam",
5+
"object_id":"dg.ANV0/yyyyyyyy-yyyy-yyyy-yyyyyyyyyyyy",
6+
"file_size":123321365
7+
},
8+
{
9+
"md5sum":"x2x222xxx2xxxxx2xx2x2x22xxx22222",
10+
"file_name": "GTEX-XXXXX-XXXX-XX-XXXXZ.Aligned.sortedByCoord.out.patched.md.bam",
11+
"object_id":"dg.ANV0/yyyyyyyy-yyyy-yyyy-yyyyyyyyyzzz",
12+
"file_size":123321369
13+
}
14+
]

examples/GTEX/reads.csv

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
sample_id
2+
GTEX-XXXXX-XXXX-XX-XXXXX.Aligned.sortedByCoord.out.patched.md.bam
3+
GTEX-XXXXX-XXXX-XX-XXXXX.Aligned.sortedByCoord.out.patched.md.bam
4+
GTEX-XXXXX-XXXX-XX-XXXXX.Aligned.sortedByCoord.out.patched.md.bam
5+
GTEX-XXXXX-XXXX-XX-XXXXX.Aligned.sortedByCoord.out.patched.md.bam
6+
GTEX-XXXXX-XXXX-XX-XXXXX.Aligned.sortedByCoord.out.patched.md.bam

examples/analyses/MCF10_MYCER.datafiles.csv

-65
This file was deleted.

0 commit comments

Comments
 (0)