-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How will the seqcol compatibility flags be encoded? #7
Comments
This enables you to answer the question "given seqcols A and B how compatible are they?" but sometimes one needs to answer the question "Here's seqcol A, what other seqcol is compatible with it?" Is that out of scope? |
In fact, this is exactly a use case I raised earlier (user story #8, sub use-case 3 in the google doc for reference). My feeling is that this would be very useful. In our discussion on it, there wasn't a lot of enthusiasm from others in the group so far. I think it's reasonable to think of this as separate from the seqcol spec, but relying on it. Essentially, the specification would define the flags and the basic A vs B function. Then, anyone (like me) could build on top, for example, by pre-computing this function result for all pairwise combinations of a set of seqcols, and thereby provide what you suggest. So, it would be enabled by, but not part of, the specification. If you feel like this should be part of the main specification, I'd be interested, and you should come to the refget call tomorrow to discuss. |
In terms of track analysis, there is also the variable In practice, given track This line of reasoning can be expanded to a number of track files, solving the question: which seqcoll allows me to compare all of these track files? |
@svelnugu yes I think I understand what you mean. But I don't quite understand what you mean about the Given a "track" (I guess, say, either a wiggle-style signal file or a bed-style interval file), would you have the "chrom sizes" file that generated it? This identifiers the set of chromosomes and their lengths, which can be used to represent a coordinate system (or, a sequenceless sequence collection, if you will). You can use the compatibility flags to loop through potential sequence collections to identify which ones meet the requirements you're looking for (like A subset of B, with matching names/lengths, for example). Then, given multiple of these, you could do the same thing, and then take the intersection of all of them. That procedure would answer your question "which seqcoll allows me to compare all of these track files?". I guess one question at hand is: is it sufficient for the specification to define the binary flags, and the function for computing them on 2 sequence collections? And then leave the looping up to a project that would build on that? Or should that be included as part of the fundamental seqcol specification? A related issue I see with what you bring up is this: if you lack the "chrom sizes" file for your "track", then this wouldn't work. You can't guess at the length of the chromosomes for a bed file, for example. Perhaps it's possible with a wiggle-style track that included all bases. In theory you could say "find me the seqcols that have big enough chroms to accommodate this, but that's dangerous and much more likely to be wrong.... and there's no way to find "at least as big as" given the proposal above, you can only identify "lengths match exactly" -- which is, I think, as it should be. |
I was assuming that you did not have the "chrom sizes" info, which you would typically not have available as the connection between metadata and track files are usually not preserved (hence, the need of FAIRtracks). If I recall correctly, the "chrom sizes" info might be embedded in e.g. BigBed/BigWig, but it is at least not present in most textual formats. So you might have to estimate this from the track file, which was what I meant by Also, if you have "chrom sizes" info, some sequences might have been dropped from the real seqcoll, say |
So reading your full answer, I see you comment on the lack of "chrom sizes". I agree that it is dangerous and possibly wrong, but still we do not live in a perfect world, at least not as bioinformaticians. I do believe one can from the combination of a set of registered seqcolls and a BED-type track file typically be quite certain of at least the major version of the reference genome. See https://doi.org/10.1186/s13059-017-1312-1 for a tool that attempts to predict this. Of course, we are way outside the responsibilities of a standard of sequence collection identifiers. But it would still be nice if the specification allowed for tools like this to be built, with the bioinformatician taking responsibility for any errors... |
So I guess I am arguing for the inclusion of an "at least as big as" relation... |
But now you confused me!... :) This was the algorithm I had in mind: Given a seqcoll
For many track files, it is just a manner of adding another loop. So I think your suggestion would allow for this then? Edit: comparison of names is also needed. |
The alternative algorithm is this: Given a seqcoll
As you see, it will make the implementation much easier... It is not a matter of necessity, but convenience. In my mind, that is a strong argument for inclusion of an "at least as big as" relation, but then I am a bit practically oriented. |
To conclude, I think I agree a "at least as big as" relation is out of scope, as we must assume that a seqcol is a real seqcol for the definitions to make sense. But would it be a point to provide a way to query a database for all matching seqcols, given a combination of sequence names, lengths, topology, and content, but without a seqcoll |
Couple of things I noted in today's meeting: 1. Input of the functionThe compatibility function works at level 1 of recursion to assess compatibility. However the input can be at level 0 (SeqCol digest) if the digests are known from the server 2. ReportingThe use of binary flag for reporting the comparison seems unnecessary cryptic when we don't have this much to report. I would suggest reporting the results in an explicit way in JSON. we discussed two ways: {
"sequences": {
"ALL_A_IN_B": "true",
"ALL_B_IN_A": "true",
"ANY_SHARED": "true",
"SAME_ORDER": "true"
},
"lengths": {
"ALL_A_IN_B": "true",
"ALL_B_IN_A": "true",
"ANY_SHARED": "true",
"SAME_ORDER": "true"
},
...
} Enumerating the comparisons: {
"ALL_A_IN_B": {
"sequences": "true",
"lengths": "true",
"names": "true",
"topologies": "true"
},
"ALL_B_IN_A": {
"sequences": "true",
"lengths": "true",
"names": "true",
"topologies": "true"
},
...
} |
Trying to write up the idea I had at the last meeting in 12 mins before the next one... So what if the compatibility API worked more or less the same as in the existing recursive results. So one would instead of true/false values for the various comparisons instead receive a digest of the resulting comparison array. Let me show an example:
In order for something like this to work, we need to have a general management of order, somehow, which I also have some thoughts about, that I can add later. |
Regarding order: As with the other arrays, I believe there are scenarios where the order of the sequences does not matter, and others where the order does matter. An example of the first is for specifying a coordinate system for use with e.g. BED files, depending on the implementation of the client software (i.e. if the client in any case sorts the sequences lexicographically). An example of a scenario where the order matters is for reproducing the results of a sequence mapper. Hence, I believe Example, using current idea that preserves order:
One would need to use the compatibility function to see that they are basically the same, except the order, and one might have no reason for checking this out if one is only looking at the 0-level recursion. Compare this to:
Regarding the canonical ordering, I suggest to order by If the EDIT: Just realized that this idea would also work without the |
Update on the ordering issue: I realized after writing the above that the obvious ordering would be on decreasing instead of increasing length, as the 'chr1' is typically the longest chromosome. i.e.:
This has no consequence on the rest of the writeup, but will add the additional argument that the default order is similar to what is usual for large sequences (although it will most probably break down at some point for smaller sequences, e.g.: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39#/st_alt-assembly-unit). |
I'm going to close this issue as I believe we've come a resolution, which is now specified in the ADR in PR #22. To summarize briefly, we moved away from the idea of compatibility flags and came up with a bit more information-rich return value for what we now call the For the question of a 'search' function, I've raised a new issue: #28 |
One of the use cases for sequence collections is to determine compatibility between two given sequence collections. Input is 2 sequence collections, and output is an assessment of compatibility between those sequence collections.
As a refresher from the use cases document:
Some examples of compatibility levels are:
In a python notebook I've demonstrated an implementation of this , which may give you an idea of how this works:
https://github.com/refgenie/seqcol/blob/master/advanced.ipynb
With a compare function implementation here if you're interested: https://github.com/refgenie/seqcol/blob/ff5769bf92a2da01b24d75fbff428a30709d1123/seqcol/seqcol.py#L71
The important component for discussion is: how will we encode compatibility? My proposal was to use a flag system (think SAM flags), so a bit vector indicates the result of a bunch of comparisons. Here's an example:
{1: 'CONTENT_ALL_A_IN_B',
2: 'CONTENT_ALL_B_IN_A',
4: 'LENGTHS_ALL_A_IN_B',
8: 'LENGTHS_ALL_B_IN_A',
16: 'NAMES_ALL_A_IN_B',
32: 'NAMES_ALL_B_IN_A',
64: 'TOPO_ALL_A_IN_B',
128: 'TOPO_ALL_B_IN_A',
256: 'CONTENT_ANY_SHARED',
512: 'LENGTHS_ANY_SHARED',
1024: 'NAMES_ANY_SHARED',
2048: 'CONTENT_A_ORDER',
4096: 'CONTENT_B_ORDER'}
I think with these flags, you can make any of the compatibility assessments listed above. But am I missing anything? It's open for discussion, what flags should we provide as part of the specification? How should we order them?
The text was updated successfully, but these errors were encountered: