Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use case: a digest for a collection of sequences #76

Open
ahwagner opened this issue May 16, 2024 · 4 comments
Open

Use case: a digest for a collection of sequences #76

ahwagner opened this issue May 16, 2024 · 4 comments

Comments

@ahwagner
Copy link
Member

ahwagner commented May 16, 2024

I was reading through the draft of the SeqCol specification. We have an envisioned use case that complements federated VRS, that would benefit from the notion of a unique digest that represents a sequence collection by sequence content only–no sequence names.

As the specification allows for the digest serialization of sequence collections by name and length only (with no sequence digests), would it make sense to also enable sequence collections to allow for length and sequence digests only (with no names)? Or even just sequence digests only (with no lengths or names)?

@nsheff
Copy link
Member

nsheff commented May 16, 2024

Yes, you can do this easily.

If you just need sequences, you'll already have these built in, as level 1 digests anyway -- that's for a digest that respects input sort order.

If you don't care about sort order, then you'll want to implement the sorted_sequences attribute, which is described here: #71

@sveinugu
Copy link
Collaborator

sveinugu commented May 29, 2024

Name-invariant digests (Redux)

Note

I wrote most of the following comment before today's meeting, which is why my suggestions below don't really focus on the use case @ahwagner described in the meeting. For the use case in question, I agree with @nsheff that the best would be to base a solution on canonical seqcol digests (based on names, lengths, and sequences) as these should be easily available to Beacon providers (or at least will be once seqcol is adopted by the repositories). This allows for solving the particular current problem using the level 1 digests of the sorted-sequences array, while at the same time providing interoperability with eventual later solutions that make use of names and/or lengths.

With that in mind, please regard the following suggestion for a name-invariant digest as an idea spawned by this issue and not as a solution to the specific use case, which I don't think should use it.

Recap of the problem

So, as mentioned by @ahwagner, the current specification (using the recommended minimal schema) does not support top-level name-invariant digests, e.g. digests that are based on only sequences or the combination of lengths and sequences. This is because the names and lengths attributes are both required and inherent in the minimal recommended schema.

Arguments for requiring lengths

As mentioned elsewhere, I have always agreed with requiring lengths. I believe it will for many use cases be very useful to have the lengths precomputed and easily available, as it is otherwise costly for the user to extract this info from the sequences if one needs it. For seqcol providers, either the lengths are already easily available, or there is at most a one-time cost of calculating the lengths from the sequences themselves. Requiring the lengths also shouldn't cause downstream problems. Use cases that require simple single-attribute digests could be solved through direct use of the level 1 digests that in any case are provided by the seqcol standard. Use cases that require other types of digests could be solved through extending the schema with additional non-inherent attributes, as exemplified in the current use case by the suggestion to make use of the level 1 digest of the sorted-sequences attribute.

Problem with requiring names

On the other hand, I still think requiring names could be problematic in certain contexts. This is NOT due to use cases which require name-invariant digests, as those can still be supported through level 1 digests of custom non-inherent attributes (imagine for instance a sequence-length-pairs attribute). However, the problem appears in cases where there TRUELY are no known names for the sequences.

[Note, however, that I don't think this is the case for the current Beacon example (as explained in the meeting). The names should be known by the Beacon providers if they have used standard reference genomes with names to call the variants in the first place).]

Issues with using order as names

So for cases where there truly are no names available, @nsheff has previously suggested that we could recommend using the order of the sequences as the default names array, e.g. 1,2,3,... I have never really liked this suggestion, as it (per definition) incorporates information about the order into the names array. This would for instance make the sorted-name-length-pairs array depend on the order of the sequences, rendering it useless, as the whole point of this attribute was to provide an order-invariant coordinate system digest.

Suggestion: using sequences as names?

So my alternative suggestion is to still require names to be defined, but to recommend another default solution when names are not available. Since I cannot conceive of a context where you would only have the lengthsarray available, I think it is safe to assume that you in those cases will have a sequences array. So what if the default or recommended solution for cases where names are truly not available was simply to duplicate the sequences array and use that also as the names array? Then you wouldn't need to invent names, and the level 1 digest of the sorted-name-length-pairs array would still be order-invariant. (Here, I will assume that most implementations will not actually duplicate the array, but instead refer to the same array both places.)

What about duplicate sequences?

There is a catch, however. So one reason for requiring a names array was for implementations and downstream users to always have available a "primary key" for each sequence in the collection that would be guaranteed to be locally unique (i.e. unique within the collection). So if we require the names to be locally unique, while we do allow duplicate sequences, then any sequence collections with duplicate sequences cannot just duplicate the full sequences array and use it as the names.

Did we enforce unique names? Where?

As an aside, I don't think that the current spec actually requires the values in the names array to be locally unique. So in that case: did we forget to require uniqueness or did we change our minds about this?

Simple solution for duplicate sequences

So if duplicate sequences are to be allowed but not duplicate names, a suggestion to solve this discrepancy would then be to simply define this as out of scope, i.e. just stating that if your use case includes possibly duplicate sequences and no names, then you cannot use seqcol directly. You should still be able to use seqcol, though, through e.g. merging sequence duplicates together into single entry (possibly storing the duplicate counts in a custom duplicate_counts array).

Include in spec?

So could this be a relatively simple late inclusion into the standard, as a way to provide interoperable digests in cases where sequence names are not actually available?

@ahwagner
Copy link
Member Author

ahwagner commented May 30, 2024

I think that there are a few related challenges here.

I am getting the sense that it is a bit late in the game to be voicing a dissenting opinion on prior design decisions, but I want to reiterate that I am still struggling with the concept of sequence names–effectively local keys–being an inherent attribute for global identifiers. Similarly, I am struggling with the notion that sequence order is meaningful.

As I am learning more about the spec, I have come to understand that these requirements are in place because SeqCol computed identifiers really aren't intended for global comparison operations, only for retrieving the content of sequence collections (where names and order are meaningful), and all comparison operations are expected to be handled by the comparison function. But if that's the case, why do we care about a top-level digest at all? Why not have this be a system-assigned ID?

Putting those questions aside, how sequence collections might work for the concept I have in mind is not clear to me. For example, for me to compare a collection of 10,000 gene sequences from a local collection to see if another server had that same collection of sequences, I need to take the following steps per the current Seq Col draft:

  1. Somehow get a list of sequence collection digests hosted by the other server (is there an endpoint for this?)
  2. Then, for each sequence collection digest:
    a. POST to the comparison endpoint a sequence collection containing names, lengths, and digests of all 10,000 genes
    b. Check the response for array_elements.a_count.sequences == array_elements.b_count.sequences == array_elements.a_and_b_count.sequences

Whereas the workflow I am envisioning would look like:

  1. Query a server to ask if my computed ID for the unordered set of 10k gene sequences is supported by that instance

It seems to me like the above could work with a dedicated endpoint that enables this exact search pattern, relying on the appropriate "level 1" digest instead of the "top level" digest.

Other possible queries in such a system may look like:

  1. Query a server for all sequence collections that overlap a set of sequences (with similar metrics regarding set overlap as the current comparison endpoint)
  2. Query a server for a list of all sequence collection IDs

Just a few thoughts about this. The other thought I will throw out there is how we might implement sequence collections in Biocommons. SeqRepo, like RefGet, does not care about sequence name. SeqRepo may contain collections of sequences, with each colleciton unordered and each sequence potentially having multiple names. It may be useful to describe these collections by computed object identifiers, but they do not fit the assumptions of the Sequence Collections spec.

I also want to make clear these are thoughts about future use cases, nothing that is in production at the moment. So it might not be worth trying to make these cases work until they are a little more fleshed out.

@nsheff
Copy link
Member

nsheff commented Jun 12, 2024

I have come to understand that these requirements are in place because SeqCol computed identifiers really aren't intended for global comparison operations, only for retrieving the content of sequence collections (where names and order are meaningful), and all comparison operations are expected to be handled by the comparison function.

Yes! This is right -- we are trying to get away from the idea that digests are the only way to compare things. There are too many different ways to compare; hence the comparison endpoint. It goes beyond what you can do with a digest check. Otherwise, every type of comparison would need its own digest.

But if that's the case, why do we care about a top-level digest at all? Why not have this be a system-assigned ID?

!! Where to start... a few brief reasons:

  • it allows federation, so the same collection gets a consistent digest across different servers
  • it allows you to build a consistent digest for a custom genome that isn't even in any server, but you know won't clash with one that is.
  • it is still useful for identity comparisons when names/order matter (which is, for example, necessary for reproducible analysis). That wasn't your comparison use case, but it is someone's.
  • it provides a publishable string that you can use to refer to a reference genome in a paper instead of "hg38", that is then very clear about exactly what you used. What this means is unambiguous and decentralized; it does not depend on an authority
  • For the same reasons refget digests are more useful than sequence names in some cases: they are algorithm-computed, and confirmable. So I can confirm that a sequence collection is what you say it is by computing its digest.

To your use case,

for me to compare a collection of 10,000 gene sequences from a local collection to see if another server had that same collection of sequences

Yes -- if you're unsure of the names, then what you say is accurate -- this is not a use case we had envisioned would be solved with v1 -- This is in fact what we have called the "search function" : #28

We had planned to address this use case in 1.1. It was just taking too long to get 1.0 finished and so we had to cut it off somewhere...

It seems to me like the above could work with a dedicated endpoint that enables this exact search pattern, relying on the appropriate "level 1" digest instead of the "top level" digest.

Yes, this is another way this could be solved. In fact, my implementation has always been already able to do this, but again, we had put this aside for 1.0. But I just wrote a proposal that I guess is basically this: #80

And finally:

Somehow get a list of sequence collection digests hosted by the other server (is there an endpoint for this?)

Also slated for 1.1: #61

Basically, I think you're bringing a lot of use cases that we had been intending to address "next"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants