-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a new tool, FilterReadsByFlowCellLocation for filtering out reads with specific flow cell coordinates #1992
base: master
Are you sure you want to change the base?
Conversation
…s with specific flow cell locations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks good, just a few small questions and potential to refactor the tests.
// Example format: @HWUSI-EAS100R:6:73:941:1973#0/1 | ||
// or: @EAS139:136:FC706VJ:2:2104:15343:197393 | ||
try { | ||
String[] parts = readName.split(":"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to parse the reads yourself instead of using ReadNameParser
?
// Write read to output if it doesn't match filter criteria | ||
writer.addAlignment(read); | ||
} | ||
} finally { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use try-with-resources instead.
* @throws IOException if an I/O error occurs. | ||
*/ | ||
private File createSamFile(String[] readNames) throws IOException { | ||
File tmpSam = File.createTempFile("FilterFlowCellEdgeReadsTest_input", ".sam"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
tmpSam.deleteOnExit(); | ||
|
||
// Create a minimal SAM file header. | ||
SAMFileHeader header = new SAMFileHeader(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
* @return the temporary SAM file. | ||
* @throws IOException if an I/O error occurs. | ||
*/ | ||
private File createSamFile(String[] readNames) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readNames final
*/ | ||
private int countRecords(File samFile) throws IOException { | ||
int count = 0; | ||
try (SamReader reader = SamReaderFactory.makeDefault().open(samFile)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
private int countRecords(File samFile) throws IOException { | ||
int count = 0; | ||
try (SamReader reader = SamReaderFactory.makeDefault().open(samFile)) { | ||
for (SAMRecord rec : reader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
public class FilterReadsByFlowCellLocationTest { | ||
|
||
// Temporary files for input and output. | ||
private File inputSam; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why share these fields across tests? You're just regenerating them in each test
} | ||
|
||
@AfterMethod | ||
public void tearDown() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think you need this, because I think you can keep the inputSam and outputSam objects local to the tests instead of as class fields
* – One read with non-matching coordinates ("2000:2000") that should be retained. | ||
*/ | ||
@Test | ||
public void testMixedReads() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you easily refactor this into 1 test and a DataProvider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine overall, I'm kind of skeptical this tool needs to exist. It seems like it should just be a read filter. There are also some awkward AI decisions, especially in the tests. I also found what is either a bug or an incorrect example of how illumina read names are structured.
import java.io.IOException; | ||
|
||
@CommandLineProgramProperties( | ||
summary = "Filters out reads with specific flowcell coordinates", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch summary / one line summary
import picard.cmdline.CommandLineProgram; | ||
import picard.cmdline.StandardOptionDefinitions; | ||
import picard.cmdline.programgroups.ReadDataManipulationProgramGroup; | ||
import htsjdk.samtools.util.Log; // Added this import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unecessary comment
public String OUTPUT; | ||
|
||
@Argument(shortName = "X", | ||
doc = "X coordinate to filter (default: 1000)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't write the default into the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to add a minValue annotation to disallow negative numbers
optional = true) | ||
public int X_COORD = 1000; | ||
|
||
@Argument(shortName = "Y", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
programGroup = ReadDataManipulationProgramGroup.class | ||
) | ||
public class FilterReadsByFlowCellLocation extends CommandLineProgram { | ||
// Initialize logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useless comment
|
||
private boolean hasFlowcellCoordinates(String readName) { | ||
// Parse Illumina read name format | ||
// Example format: @HWUSI-EAS100R:6:73:941:1973#0/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to make use of the ReadNameParser which handles illumina read name -> coordinates, it also has support for abitrary read name schemes as well. (That would require exposing an additional argument though.
optional = true) | ||
public int Y_COORD = 1000; | ||
|
||
private boolean hasFlowcellCoordinates(String readName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename this to "matchesFlowCellCoordinate", make it static, and make it take x and y as inputs. Currently it sounds like it's just checking if the readname includes coordinates, which is different than if it matches the bad coordinate.
.referenceSequence(REFERENCE_SEQUENCE) | ||
.open(new File(INPUT)); | ||
|
||
final SAMFileHeader header = reader.getFileHeader(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd be better off declaring these in a try-with-resources header
import java.io.IOException; | ||
|
||
/** | ||
* Unit tests for FilterFlowCellEdgeReads using TestNG. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless
/** | ||
* Unit tests for FilterFlowCellEdgeReads using TestNG. | ||
*/ | ||
public class FilterReadsByFlowCellLocationTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically create all our files within the test method and don't use setup-tear down. Shared fields like this has makes things complicated in ways we would rather avoid. I believe this is also leaking index files which should ideally be cleaned up.
These tests just aren't idiosyncratic. It would probably be better use dataprovider instead of multiple tests like this.
Oh, I'm late too the party... |
|
||
@CommandLineProgramProperties( | ||
summary = "Filters out reads with specific flowcell coordinates", | ||
oneLineSummary = "Removes reads from specific flowcell positions from BAM/CRAM files", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filter out reads from 'a' specific flowcell position. It only does 1.
I would probably add a toplevel comment describing what problem this works around also because it's pretty specific.
I wonder if it makes sense to put this as a filter in FilterSamReads.
…On Wed, Feb 5, 2025 at 10:11 AM Mark Fleharty ***@***.***> wrote:
Filter out reads with specific flow cell coordinates
Description This pull request adds a new Picard
tool—FilterFlowCellEdgeReads—which is designed to filter out reads from
SAM/BAM files based on their flow cell coordinates. The tool parses the
read names (assuming an Illumina read name format) and extracts the X and Y
coordinates from the last two fields. If a read’s coordinates match the
default values (X_COORD = 1000 and Y_COORD = 1000), the read is omitted
from the output file. Checklist (never delete this)
Never delete this, it is our record that procedure was followed. If you
find that for whatever reason one of the checklist points doesn't apply to
your PR, you can leave it unchecked but please add an explanation below.
Content
- Added or modified tests to cover changes and any new functionality
- Edited the README / documentation (if applicable)
- All tests passing on github actions
Review
- Final thumbs-up from reviewer
- Rebase, squash and reword as applicable
For more detailed guidelines, see
https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests
------------------------------
You can view, comment on, or merge this pull request online at:
#1992
Commit Summary
- 7d6d220
<7d6d220>
Create a new tool, FilterFlowCellEdgeReadsTest for filtering out reads with
specific flow cell locations
File Changes
(2 files <https://github.com/broadinstitute/picard/pull/1992/files>)
- *A* src/main/java/picard/sam/FilterFlowCellEdgeReads.java
<https://github.com/broadinstitute/picard/pull/1992/files#diff-c744530664f2b9e0cc159caa30cb429a355984cc1d5c017e9c1720b39b581d67>
(106)
- *A* src/test/java/picard/sam/FilterFlowCellEdgeReadsTest.java
<https://github.com/broadinstitute/picard/pull/1992/files#diff-7f964701a7769a5ba75506123eb230b6041fbdf4bf094f3528caac2e2335371e>
(163)
Patch Links:
- https://github.com/broadinstitute/picard/pull/1992.patch
- https://github.com/broadinstitute/picard/pull/1992.diff
—
Reply to this email directly, view it on GitHub
<#1992>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAU6JUT2QFZQZTK5CREY2TT2OIS3JAVCNFSM6AAAAABWRJ5MX2VHI2DSMVQWIX3LMV43ASLTON2WKOZSHAZTGMRSGQ4TCOI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
@yfarjoun I think that is a very good suggestion |
The maximum length of a Java array is not exactly Integer.MAX_VALUE, but slightly less due to the space taken up by the object header. The exact maximum differs depending on the platform and Java version. This was already accounted for in one instance, but not others. This commit fixes the other instances and changes the maximum size in the existing instance to Integer.MAX_VALUE - 32 instead of Integer.MAX_VALUE - 5 to decrease the likelihood of allocation failures on different Java versions and platforms.
@fleharty You have a test failure now.
Sorry this is more painful than it ideally would be. |
Filter out reads with specific flow cell coordinates
Description
This pull request adds a new Picard tool—FilterReadsByFlowCellLocation—which is designed to filter out reads from SAM/BAM files based on their flow cell coordinates. The tool parses the read names (assuming an Illumina read name format) and extracts the X and Y coordinates from the last two fields. If a read’s coordinates match the default values (X_COORD = 1000 and Y_COORD = 1000), the read is omitted from the output file.
Checklist (never delete this)
Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.
Content
Review
For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests