Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HsMetrics PCT_USABLE_BASES_ON_BAIT definition/calculation error #1996

Open
ioguztuncay opened this issue Feb 7, 2025 · 3 comments · May be fixed by #1998
Open

HsMetrics PCT_USABLE_BASES_ON_BAIT definition/calculation error #1996

ioguztuncay opened this issue Feb 7, 2025 · 3 comments · May be fixed by #1998

Comments

@ioguztuncay
Copy link

Hi!

On the webpage https://broadinstitute.github.io/picard/picard-metric-definitions.html#HsMetrics ; the HsMetrics output named "PCT_USABLE_BASES_ON_BAIT" is defined as "The number of aligned, de-duped, on-bait bases out of the PF bases available.". However, if you check line 91 in https://github.com/broadinstitute/picard/blob/master/src/main/java/picard/analysis/directed/HsMetricCollector.java , as well as lines 531-550 in https://github.com/broadinstitute/picard/blob/master/src/main/java/picard/analysis/directed/TargetMetricsCollector.java, you can see that this metric uses aligned on-bait bases, without considering duplicates. This results in discrepancies between PCT_USABLE_BASES_ON_BAIT and PCT_USABLE_BASES_ON_TARGET, because the latter is calculated using de-duped counts. Just wanted to raise the issue so that the definition can be corrected!

Best regards

@lbergelson
Copy link
Member

Thanks for reporting this. It looks like a bug to me. We'll look into it and try to verify that.

@yfarjoun Any thoughts on this? Looks fishy to me.

@yfarjoun
Copy link
Contributor

yfarjoun commented Feb 12, 2025

The code seems to be confused altogether: numerators and denominators not agreeing on using deduped or non-deduped counts.... I agree that there's a problem, but I think that it's greater than the documentation....

yfarjoun added a commit to yfarjoun/picard that referenced this issue Feb 14, 2025
closes: broadinstitute#1996

The documentation in HsMetrics class was inaccurate regarding the filtering of reads that go into the PCT_USABLE_BASES_ON_BAIT. It now correctly reflects the fact that the reads/bases that go into this calculation are _not_ unique, i.e. duplicate reads are counted.
@yfarjoun yfarjoun linked a pull request Feb 14, 2025 that will close this issue
5 tasks
@yfarjoun
Copy link
Contributor

yfarjoun commented Feb 14, 2025

having spoken with @tfenne offline, I understand now that the confusion I mentioned earlier is by design:

BAIT related information is supposed to inform regarding the performance of the selection thus duplicates are not counted.
TARGET related information is supposed to inform about the overall efficiency of the assay and thus the numerator is filtered but the denominator is not.

PCT_USABLE_BASES_ON_BAIT=20% means that 20% of the bases that you sequenced were found on the baits. this enables the lab to tweak the selection process without consideration to the PCR process or the insert sizes etc..

PCT_USABLE_BASES_ON_TARGET=20% means that 20% of the bases that you sequenced could be used for calling variants in the target region. This combines the effect of the PCR and the selection and insert size (among other things), and may serve as an overall efficiency metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants