-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use case: a digest for a collection of sequences #76
Comments
Yes, you can do this easily. If you just need sequences, you'll already have these built in, as level 1 digests anyway -- that's for a digest that respects input sort order. If you don't care about sort order, then you'll want to implement the |
Name-invariant digests (Redux)NoteI wrote most of the following comment before today's meeting, which is why my suggestions below don't really focus on the use case @ahwagner described in the meeting. For the use case in question, I agree with @nsheff that the best would be to base a solution on canonical seqcol digests (based on With that in mind, please regard the following suggestion for a name-invariant digest as an idea spawned by this issue and not as a solution to the specific use case, which I don't think should use it. Recap of the problemSo, as mentioned by @ahwagner, the current specification (using the recommended minimal schema) does not support top-level name-invariant digests, e.g. digests that are based on only Arguments for requiring
|
I think that there are a few related challenges here. I am getting the sense that it is a bit late in the game to be voicing a dissenting opinion on prior design decisions, but I want to reiterate that I am still struggling with the concept of sequence names–effectively local keys–being an inherent attribute for global identifiers. Similarly, I am struggling with the notion that sequence order is meaningful. As I am learning more about the spec, I have come to understand that these requirements are in place because SeqCol computed identifiers really aren't intended for global comparison operations, only for retrieving the content of sequence collections (where names and order are meaningful), and all comparison operations are expected to be handled by the comparison function. But if that's the case, why do we care about a top-level digest at all? Why not have this be a system-assigned ID? Putting those questions aside, how sequence collections might work for the concept I have in mind is not clear to me. For example, for me to compare a collection of 10,000 gene sequences from a local collection to see if another server had that same collection of sequences, I need to take the following steps per the current Seq Col draft:
Whereas the workflow I am envisioning would look like:
It seems to me like the above could work with a dedicated endpoint that enables this exact search pattern, relying on the appropriate "level 1" digest instead of the "top level" digest. Other possible queries in such a system may look like:
Just a few thoughts about this. The other thought I will throw out there is how we might implement sequence collections in Biocommons. SeqRepo, like RefGet, does not care about sequence name. SeqRepo may contain collections of sequences, with each colleciton unordered and each sequence potentially having multiple names. It may be useful to describe these collections by computed object identifiers, but they do not fit the assumptions of the Sequence Collections spec. I also want to make clear these are thoughts about future use cases, nothing that is in production at the moment. So it might not be worth trying to make these cases work until they are a little more fleshed out. |
Yes! This is right -- we are trying to get away from the idea that digests are the only way to compare things. There are too many different ways to compare; hence the comparison endpoint. It goes beyond what you can do with a digest check. Otherwise, every type of comparison would need its own digest.
!! Where to start... a few brief reasons:
To your use case,
Yes -- if you're unsure of the names, then what you say is accurate -- this is not a use case we had envisioned would be solved with v1 -- This is in fact what we have called the "search function" : #28 We had planned to address this use case in 1.1. It was just taking too long to get 1.0 finished and so we had to cut it off somewhere...
Yes, this is another way this could be solved. In fact, my implementation has always been already able to do this, but again, we had put this aside for 1.0. But I just wrote a proposal that I guess is basically this: #80 And finally:
Also slated for 1.1: #61 Basically, I think you're bringing a lot of use cases that we had been intending to address "next" |
I was reading through the draft of the SeqCol specification. We have an envisioned use case that complements federated VRS, that would benefit from the notion of a unique digest that represents a sequence collection by sequence content only–no sequence names.
As the specification allows for the digest serialization of sequence collections by name and length only (with no sequence digests), would it make sense to also enable sequence collections to allow for length and sequence digests only (with no names)? Or even just sequence digests only (with no lengths or names)?
The text was updated successfully, but these errors were encountered: