samplesheet input is not detecting entire values in sample column #378

wvictor14 · 2024-02-09T22:40:53Z

Description of the bug

when (a certain number of?) underscores are used in sample column, sometimes only a substring of the entire value is read in, rather than the whole value.

E.g. BATCH_DATE_SAMPLE is read in as BATCH_DATE in the following example. The impact is that when the read-in partial value is not unique , the pipeline will erroneously treat multiple (unique) rows as replicates.

See screenshots of an example samplesheet and running pipeline for example

Command used and terminal output

No response

Relevant files

No response

System information

version methylseq 2.6.0
No response

ewels · 2024-02-09T22:59:49Z

I suspect that this is the offending code:

methylseq/workflows/methylseq.nf

Lines 101 to 103 in 54f823e

    
           def meta_clone = meta.clone() 
        
           parts = meta_clone.id.split('_') 
        
           meta_clone.id = parts.length > 1 ? parts[0..-2].join('_') : meta_clone.id

I'm not 100% if this is a bug or a feature. If a feature then it should have better docs.

ewels · 2024-02-09T23:00:17Z

I think that this issue is essentially the inverse of #351 (here it's happening by accident, there it was the desired behaviour).

flerpan01 · 2024-03-22T14:37:21Z

I'm so happy I found this issue, was pulling my hair the whole day thinking my code was wrong. Quickfix: changed the underscore to a dot in my samplesheet.csv (F0_1 -> F0.1)

CathyXD · 2024-04-15T00:58:25Z

I've encountered the same issue that the sample name inputs were uncompleted causing later errors. My sequencing was paired-end with 4 lane per sample, so may also have the problem mentioned in #381 . Could anyone provide an updated workable samplesheet.csv example? Really confused now.

AdrijaK · 2024-05-30T11:24:21Z

Could anyone provide an updated workable samplesheet.csv example? Really confused now.

@CathyXD If you add a random number after the last underscore (i.e., a suffix for each sample name: _x, _x, _x, _x) to each sample name, they will not be concatenated. Similar thing works for nf-core/chipseq pipeline where everything before the last underscore is used to infer group names.

The pipeline decides to pool the samples in this bit of code:

        .map {
            meta, fastq ->
            def meta_clone = meta.clone()
            parts = meta_clone.id.split('_')
            meta_clone.id = parts.length > 1 ? parts[0..-2].join('_') : meta_clone.id
            [ meta_clone, fastq ]
        }

in meta_clone.id = parts.length > 1 ? parts[0..-2].join('_') : meta_clone.id it splits the sample name by underscores, then checks if the number of parts is larger than 1.

if there are no underscores, the whole sample name is used, and no concatenation is triggered
if there is an underscore, everything before the last underscore is used to identify the files to be concatenated.

Here are some examples:
example1: no underscores will make sure no samples are pooled:

input:
sample1
sample2

output:
sample1
sample2

example2: one underscore will pool the samples based on everything before the last underscore

input:
sample_1
sample_2

output:
sample

imdanique · 2024-07-15T13:03:15Z

Thank you!! I've spent a week trying to figure out why the pipeline is strangely concatenating my input fastqs. Could you please add this info regarding naming conventions to README or fix the code?

sateeshperi · 2024-10-27T12:50:00Z

fixed in 2.7.1. plz report back if any issues. Thank you!

wvictor14 added the bug Something isn't working label Feb 9, 2024

sateeshperi mentioned this issue Feb 22, 2024

Fix/samplename join #381

Merged

11 tasks

sateeshperi added this to the 2.7.0 milestone Sep 27, 2024

sateeshperi closed this as completed Oct 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

samplesheet input is not detecting entire values in sample column #378

samplesheet input is not detecting entire values in sample column #378

wvictor14 commented Feb 9, 2024 •

edited

Loading

ewels commented Feb 9, 2024

ewels commented Feb 9, 2024

flerpan01 commented Mar 22, 2024

CathyXD commented Apr 15, 2024

AdrijaK commented May 30, 2024 •

edited

Loading

imdanique commented Jul 15, 2024

sateeshperi commented Oct 27, 2024

samplesheet input is not detecting entire values in sample column #378

samplesheet input is not detecting entire values in sample column #378

Comments

wvictor14 commented Feb 9, 2024 • edited Loading

Description of the bug

Command used and terminal output

Relevant files

System information

ewels commented Feb 9, 2024

ewels commented Feb 9, 2024

flerpan01 commented Mar 22, 2024

CathyXD commented Apr 15, 2024

AdrijaK commented May 30, 2024 • edited Loading

imdanique commented Jul 15, 2024

sateeshperi commented Oct 27, 2024

wvictor14 commented Feb 9, 2024 •

edited

Loading

AdrijaK commented May 30, 2024 •

edited

Loading