Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging of previously sorted BAM files #1990

Open
fgvieira opened this issue Jan 29, 2025 · 2 comments
Open

Merging of previously sorted BAM files #1990

fgvieira opened this issue Jan 29, 2025 · 2 comments

Comments

@fgvieira
Copy link

fgvieira commented Jan 29, 2025

When sorting several pre-sorted BAM files (either by query or coordinate) with different headers, is it possible to merge them with MergeSamFiles keeping the original sorting?

I am working with very large BAM files and sorting the BAM file after merging would take a considerable amount of time!

Thanks,

@lbergelson
Copy link
Member

MergeSamFiles should merge multiple sorted files into a single file with consistent sorting, IF the files are all sorted the same way. This should not require additional sorting afterwards. There is a log message when it identifies this case:

"Input files are in same order as output so sorting to temp directory is not needed."

There is also an argument SORT_ORDER which defaults to coordinate sorting. If your files are in query name order you should specify that with this argument or they will be resorted afterwards.

If you're not seeing that message or if there is a conflict with SORT_ORDER you may have to specify ASSUME_SORTED=true but this can be dangerous.

@fgvieira
Copy link
Author

fgvieira commented Feb 12, 2025

I tried it on three files sorted by query (@HD VN:1.5 SO:queryname SS:queryname:lexicographical), and I get the error:

$ picard MergeSamFiles -Xmx100G --MAX_RECORDS_IN_RAM 25000000 -I file1.bam -I file2.bam -I file3.bam -SO queryname -O out.bam
[...]
Exception in thread "main" htsjdk.samtools.util.SequenceUtil$SequenceListsDifferException: Sequence dictionaries are not the            same size (65119, 82879)
        at htsjdk.samtools.util.SequenceUtil.assertSequenceListsEqual(SequenceUtil.java:259)
        at htsjdk.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:342)
        at htsjdk.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:328)
        at htsjdk.samtools.SamFileHeaderMerger.getSequenceDictionary(SamFileHeaderMerger.java:530)
        at htsjdk.samtools.SamFileHeaderMerger.<init>(SamFileHeaderMerger.java:164)
        at picard.sam.MergeSamFiles.doWork(MergeSamFiles.java:199)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:280)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:105)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:115)

I guess this error is because the files have different headers, no?
I added the option -MSD and it seems to work, but it is taking quite a lot of time for files with ~2Mb (samtools took 5 seconds):

Feb 12, 2025 1:37:28 PM com.intel.gkl.NativeLibraryLoader load
[Wed Feb 12 13:37:28 CET 2025] MergeSamFiles --INPUT file1.bam --INPUT file2.bam --INPUT file3.bam --OUTPUT out.bam --SORT_ORDER queryname --MERGE_SEQUENCE_DICTIONARIES true --MAX_RECORDS_IN_RAM 25000000 --ASSUME_SORTED false --USE_THREADING false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Wed Feb 12 13:37:29 CET 2025] Executing as XXX on Linux 4.18.0-553.16.1.el8_10.x86_64 amd64; OpenJDK 64-Bit Server VM 22.0.1-internal-adhoc.conda.src; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:3.1.1
INFO    2025-02-12 13:37:32     MergeSamFiles   Input files are in same order as output so sorting to temp directory is not needed.

And in the cases when it needs to sort, is it threaded?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants