-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GenomicData
should include start/end for both start_exon and end_exon
#327
Comments
Not sure I understand the proposed model here. An exon model should not have offsets; offsets are used for describing a junction in a fusion, not an exon in a transcript. |
@ahwagner So should offsets be calculated in FUSOR? I was wondering this yesterday when talking with @jarbesfeld |
@ahwagner the only reason why we have offsets is because this value is used in FUSOR, so I would like to remove if it makes sense to you |
autocomplete gotcha! |
ug |
I'm having a bit of trouble understanding the ask -- are |
I also had this question, and am seeking clarification on this from the CIViC devs. |
@ahwagner @jsstevenson sorry, I did not provide an update to the models I started working on |
I didn't update genomic yet, but it would look similar class Exon(BaseModelForbidExtra):
"""Model for representing exon data"""
number: StrictInt
genomic_start: StrictInt = Field(..., description="Genomic position that aligns to the start of the exon")
genomic_end: StrictInt = Field(..., description="Genomic position that aligns to the end of the exon")
offset_start: StrictInt | None = 0
offset_end: StrictInt | None = 0
class TranscriptExonData(BaseModelForbidExtra):
"""Model containing transcript exon data."""
transcript: StrictStr
exon: Exon
gene: StrictStr
chr: StrictStr
strand: Strand
model_config = ConfigDict(
json_schema_extra={
"example": {
"chr": "NC_000001.11",
"gene": "TPM3",
"exon": {
"number": 1,
"genomic_start": 154191901,
"genomic_end": 154192135,
},
"transcript": "NM_152263.3",
"strand": Strand.NEGATIVE,
}
}
) |
I had asked @jarbesfeld if transcript coords are needed and he said no (for FUSOR). |
I think that it makes sense to keep cool-seq-tool focused on context translation operations; given a transcript and coordinate on that transcript, we should be able to map down to the genome using an alignment from UTA. Given a genomic coordinate and a transcript, we should be able to map back up to the transcript coordinates. I read through the documentation for this function, to ensure I understood its intent.
I think we should be clear what is meant by "exon representation" in this definition, which I assume follows the VICC exon-based transcript segment representation. Exon representation here is defined by the VICC spec, so should be compatible with that. Don't have time to do a full write-up here, but let's set aside a little time to discuss what we are going to use this method for. Are the proposed models representative of the result object from this query? |
Tagging #224 -- I think if we are going to make breaking changes in this module, it might be nice to cover other desired changes in the same sprint |
@jsstevenson Ya, a lot of it will be cleaned up in this issue. I'll do my best to split it out into smaller PRs. |
I'm a little confused about this class Exon(BaseModelForbidExtra):
"""Model for representing exon data"""
number: StrictInt
genomic_start: StrictInt = Field(..., description="Genomic position that aligns to the start of the exon")
genomic_end: StrictInt = Field(..., description="Genomic position that aligns to the end of the exon")
offset_start: StrictInt | None = 0
offset_end: StrictInt | None = 0 Currently, the mapper returns a genomic position, the exon within which that position lies, and the offset corresponding to where within that exon the genomic position lies (e.g. Second -- why do we need both |
@jsstevenson For the second question, I think we would need both Under these conditions, a genomic coordinate falling within an exon represents a positive offset from the beginning of that exon if it is starting at that coordinate, or a negative offset from the end of that exon if it is ending at that coordinate. Likewise, an RNA that includes exon-adjacent intronic sequence would have a genomic coordinate with a negative offset from the beginning of that exon if it is starting at that coordinate, or a positive offset from the end of that exon if it is ending at that coordinate. |
Mmmm. From an API design perspective, rather than providing two mutually-exclusive properties where the one being set implies an additional piece of information, I wonder if we'd rather have a single property plus a property make that additional piece of information explicit. |
Yes, these are the concerns I had with this. I think that the CIViC team had some other requirements beyond a simple mapping for which they needed to look up both ends of an exon (maybe @susannasiebert can comment on this), but I agree with @jsstevenson that doesn't seem like an intuitive response for a genomic->exon translation method. Perhaps, once we understand what the CIViC team needed a full exon model for we can design distinct endpoints that accommodate that use case.
Yes. For genomic -> exon mappings, I think we would need the following information as 3 properties, as described in cancervariants/fusor#172 (comment):
|
@jsstevenson @ahwagner sorry for the confusion. For the past few days, @jarbesfeld and I have been discussing on how to represent the output in Cool-Seq-Tool. We have not been updating the schemas due to back and forth conversation. We believe Cool-Seq-Tool should not follow VICC or GA4GH standards. Currently, the only GA4GH convention that is followed in Cool-Seq-Tool is returning inter-residue coordinates. We envision Cool-Seq-Tool to eventually live in Biocommons, so we believe that VICC/GA4GH standards should be implemented in downstream applications (such as the Variation Normalizer and FUSOR/CurFu). Here is an example of input/output in the Cool-Seq-Tool mapper. This example is on the positive strand and no offset. UTA uses inter-residue based coordinates.
This await mapper.genomic_to_transcript_exon_coordinates(
alt_ac="NC_000011.10",
start=9575888,
end=9589767,
strand=Strand.POSITIVE,
transcript="NM_003390.3",
residue_mode=ResidueMode.RESIDUE
) and await mapper.transcript_to_genomic_coordinates(
transcript="NM_003390.3",
exon_start=2,
exon_end=11,
residue_mode=ResidueMode.RESIDUE
) will both return: {
"gene": "WEE1",
"alt_ac": "NC_000011.10",
"exon_start": {
"ord": 1,
"tx_start_i": 829,
"tx_end_i": 1035,
"alt_start_i": 9575887,
"alt_end_i": 9576093,
},
"exon_start_offset": 0, # input.start - alt_start_i - 1
"exon_end": {
"ord": 10,
"tx_start_i": 2040,
"tx_end_i": 3359,
"alt_start_i": 9588448,
"alt_end_i": 9589767
},
"exon_end_offset": 0, # input.end - alt_end_i
"tx_ac": "NM_003390.3",
"strand": Strand.POSITIVE,
}
@ahwagner @jsstevenson Does this help resolve any confusion? What do you think of the output? We weren't sure on if transcript coordinates should be included. |
No–I feel like there are two conversations going on here. @jsstevenson and I are asking why we need an exon model for genomic <-> exon coordinate conversion. This is independent of schema and/or what data standards we want to apply. |
@ahwagner There was an earlier comment on assuming we were using VICC exon representation, so that's why we made a comment about the standards. Please ignore the outdated models in the PR description and from the July 25 comment. The previous |
The issue I'm trying to gain clarity on is what operation we are performing here. What is the use case for this feature? When you want to convert a transcript segment to two genomic coordinates on a chromosome, or when you want to convert two genomic coordinates on a chromosome to a transcript segment? What if you only have one coordinate? In most real-world cases, we will have one coordinate from each of the ends of two transcripts that form a junction, not a full segment. Therefore, I was expecting this method to convert a genomic coordinate and direction to an exon-based representation of a junction, and vice versa. |
Do we want to meet to go over the proposed overhaul? (Conveniently I am out of the office today but probably could meet virtually at 2) |
Given same example above, the previous output would give: {
"chr": "NC_000011.10",
"gene": "WEE1",
"start": 9575887,
"exon_start": 2,
"exon_start_offset": -205,
"end": 9589767,
"exon_end": 11,
"exon_end_offset": 1318,
"transcript": "NM_003390.3",
"strand": Strand.POSITIVE,
} This is the suggested output: {
"gene": "WEE1",
"alt_ac": "NC_000011.10",
"exon_start": {
"ord": 1,
"tx_start_i": 829,
"tx_end_i": 1035,
"alt_start_i": 9575887,
"alt_end_i": 9576093,
},
"exon_start_offset": 0, # input.start - alt_start_i - 1
"exon_end": {
"ord": 10,
"tx_start_i": 2040,
"tx_end_i": 3359,
"alt_start_i": 9588448,
"alt_end_i": 9589767
},
"exon_end_offset": 0, # input.end - alt_end_i
"tx_ac": "NM_003390.3",
"strand": Strand.POSITIVE,
} Things to point out:
Some use cases:
@jarbesfeld : This does allow both. That's why we have the
@jarbesfeld : Our example output has this since we have the exon number + offset, so it's a question of how to structure the data. |
I'm going to close this for now until CIViC team confirms if this is actually needed. I will make new issues for changes requested by @ahwagner |
Feature description
We should be returning both start and end coordinates for exons.
Use case
Griffith lab would like to use
ExonGenomicCoordsMapper
via Fusion Builder API and have said they need both coordinates.Proposed solution
GenomicData
to include start/end for both start_exon and end_exontranscript_to_genomic_coordinates
andgenomic_to_transcript_exon_coordinates
will use this updated modelAlternatives considered
No response
Implementation details
Thinking of updating the structure to something like this:
Potential Impact
No response
Additional context
No response
Contribution
Yes, I can create a PR for this feature.
The text was updated successfully, but these errors were encountered: