Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to strip sequences/qualities from SAM/BAM files #102

Open
unode opened this issue Feb 28, 2019 · 7 comments
Open

Add ability to strip sequences/qualities from SAM/BAM files #102

unode opened this issue Feb 28, 2019 · 7 comments

Comments

@unode
Copy link
Member

unode commented Feb 28, 2019

When working with very large SAM files it is often convenient to remove sequence and quality information to reduce storage and improve I/O.

Following from this it would be convenient to have a stripSeqQual function that replaces the two fields with *.

@luispedro
Copy link
Member

Could also be a flag when saving (to the write function). Not 100% sure what is the best model.

@sureyeaah
Copy link

Hi, I would like to work on this.

So I've written a stripSeqQual function in the Data.Sam module. Where should that function be called?

@unode
Copy link
Member Author

unode commented Feb 27, 2020

Luis suggestion above would be to have an extra attribute on the write() function.

Something like::

data = input("file.sam")
write(data, ofile="output.sam", remove_sequence_qualities=true)

My initial idea was to have it as part of a select block. So:

data = input("file.sam")

newdata = select(data) using |mr|:
    mr = mr.filter(min_match_size=45, min_identity_pc=90, action={unmatch})
    mr.remove_sequence_qualities()

write(newdata, ofile="output.sam")

The second interface has a few more use-cases but we didn't reach a decision on which to implement.

@luispedro thoughts?

@luispedro
Copy link
Member

The write function already has a format_flags argument, so it could be write(newdata, ofile="...", format_flags={no_qualities}).

@unode: what use-cases do you see with the second interface? I am not against it, but the write version is more straightforward to code and can be very fast (interpreting blocks is still a bit slow).

@unode
Copy link
Member Author

unode commented Feb 27, 2020

The main case I envision is optimization. Assuming a long pipeline using SAM/BAM that doesn't require qualities, removing them early could speed up processing by reducing I/O.

I had a couple of such cases in the past but wouldn't call it a frequent use-case.

@luispedro
Copy link
Member

In principle, we could move the stripping to earlier in the pipeline as an optimization later without changing the user-visible interface.

@unode
Copy link
Member Author

unode commented Feb 27, 2020

Ok, so write(newdata, ofile="...", format_flags={no_qualities}) and we revisit in the future if necessary.

@sureyeaah can you also add a line to

* Edit-me
in your pull request? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants