Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How will the seqcol compatibility flags be encoded? #7

Closed
nsheff opened this issue Nov 25, 2020 · 15 comments
Closed

How will the seqcol compatibility flags be encoded? #7

nsheff opened this issue Nov 25, 2020 · 15 comments

Comments

@nsheff
Copy link
Member

nsheff commented Nov 25, 2020

One of the use cases for sequence collections is to determine compatibility between two given sequence collections. Input is 2 sequence collections, and output is an assessment of compatibility between those sequence collections.

As a refresher from the use cases document:

As a user I want to compare the two sequence collections used by two separate analyses so I can understand how comparable and compatible their resulting data are. I may want to dial up and down the level of compatibility (sequence content, names, lengths).

Some examples of compatibility levels are:

  • Strict identity. For example, a bowtie2 index produced from one sequence collection that differs in any aspect (sequence name, order difference, etc), will not necessarily produce the same output. Therefore, we must be able to identify that two sequence collections are identical in terms of sequence content, sequence name, and sequence order.
  • Order-relaxed identity. A downstream process that treats each sequence independently and re-orders its results will return identical results as long as the sequence content and names are identical, even if the order doesn’t match. Therefore, we’d like to be able to say “these two sequence collections have identical content and sequence names, but differ in order”.
  • Name-relaxed identity. Some analysis (for example, a salmon alignment I believe) will be identical regardless of the chromosome names, as it considers the digest of the sequence only. Thus, we’d like to be able to say “These sequence collections have identical content, even if their names and/or orders differ.”
  • Length-only compatible. The weakest compatibility is two sequence collections that have the same set of lengths, though the sequences themselves may differ. In this case we may or may not require name identity. For example, a set of ATAC-seq peaks that are annotated on a particular genome could be used in a separate process that had been aligned to a different genome, with different sequences -- as long as the lengths and names were shared between the two analyses.

In a python notebook I've demonstrated an implementation of this , which may give you an idea of how this works:

https://github.com/refgenie/seqcol/blob/master/advanced.ipynb

With a compare function implementation here if you're interested: https://github.com/refgenie/seqcol/blob/ff5769bf92a2da01b24d75fbff428a30709d1123/seqcol/seqcol.py#L71

The important component for discussion is: how will we encode compatibility? My proposal was to use a flag system (think SAM flags), so a bit vector indicates the result of a bunch of comparisons. Here's an example:

{1: 'CONTENT_ALL_A_IN_B',
2: 'CONTENT_ALL_B_IN_A',
4: 'LENGTHS_ALL_A_IN_B',
8: 'LENGTHS_ALL_B_IN_A',
16: 'NAMES_ALL_A_IN_B',
32: 'NAMES_ALL_B_IN_A',
64: 'TOPO_ALL_A_IN_B',
128: 'TOPO_ALL_B_IN_A',
256: 'CONTENT_ANY_SHARED',
512: 'LENGTHS_ANY_SHARED',
1024: 'NAMES_ANY_SHARED',
2048: 'CONTENT_A_ORDER',
4096: 'CONTENT_B_ORDER'}

I think with these flags, you can make any of the compatibility assessments listed above. But am I missing anything? It's open for discussion, what flags should we provide as part of the specification? How should we order them?

@yfarjoun
Copy link

This enables you to answer the question "given seqcols A and B how compatible are they?" but sometimes one needs to answer the question "Here's seqcol A, what other seqcol is compatible with it?"

Is that out of scope?

@nsheff
Copy link
Member Author

nsheff commented Dec 8, 2020

In fact, this is exactly a use case I raised earlier (user story #8, sub use-case 3 in the google doc for reference). My feeling is that this would be very useful. In our discussion on it, there wasn't a lot of enthusiasm from others in the group so far.

I think it's reasonable to think of this as separate from the seqcol spec, but relying on it. Essentially, the specification would define the flags and the basic A vs B function. Then, anyone (like me) could build on top, for example, by pre-computing this function result for all pairwise combinations of a set of seqcols, and thereby provide what you suggest. So, it would be enabled by, but not part of, the specification.

If you feel like this should be part of the main specification, I'd be interested, and you should come to the refget call tomorrow to discuss.

@sveinugu
Copy link
Collaborator

sveinugu commented Dec 9, 2020

In terms of track analysis, there is also the variable C, which refers to the contents of a track file, i.e. which sequences the track file is making use of, which might be only a subset of the full sequence collection it is made from, say B. I think one should be able to make use of the logic above with the addition of a pre-analysis step where the real seqcoll B is reduced to the filtered B'. But it is at least another use case variant that would be incredibly useful for that domain.

In practice, given track C one typically would not know exactly which B has been used to create it, so one could envision an algorithm where one loops over all known B's to find all compatible seqcolls for the track, and then ask which of these are compatible with a given secqoll A?

This line of reasoning can be expanded to a number of track files, solving the question: which seqcoll allows me to compare all of these track files?

@nsheff
Copy link
Member Author

nsheff commented Dec 9, 2020

@svelnugu yes I think I understand what you mean. But I don't quite understand what you mean about the B seqcol and reduced to B'.

Given a "track" (I guess, say, either a wiggle-style signal file or a bed-style interval file), would you have the "chrom sizes" file that generated it? This identifiers the set of chromosomes and their lengths, which can be used to represent a coordinate system (or, a sequenceless sequence collection, if you will). You can use the compatibility flags to loop through potential sequence collections to identify which ones meet the requirements you're looking for (like A subset of B, with matching names/lengths, for example).

Then, given multiple of these, you could do the same thing, and then take the intersection of all of them. That procedure would answer your question "which seqcoll allows me to compare all of these track files?".

I guess one question at hand is: is it sufficient for the specification to define the binary flags, and the function for computing them on 2 sequence collections? And then leave the looping up to a project that would build on that? Or should that be included as part of the fundamental seqcol specification?

A related issue I see with what you bring up is this: if you lack the "chrom sizes" file for your "track", then this wouldn't work. You can't guess at the length of the chromosomes for a bed file, for example. Perhaps it's possible with a wiggle-style track that included all bases. In theory you could say "find me the seqcols that have big enough chroms to accommodate this, but that's dangerous and much more likely to be wrong.... and there's no way to find "at least as big as" given the proposal above, you can only identify "lengths match exactly" -- which is, I think, as it should be.

@sveinugu
Copy link
Collaborator

sveinugu commented Dec 9, 2020

I was assuming that you did not have the "chrom sizes" info, which you would typically not have available as the connection between metadata and track files are usually not preserved (hence, the need of FAIRtracks). If I recall correctly, the "chrom sizes" info might be embedded in e.g. BigBed/BigWig, but it is at least not present in most textual formats. So you might have to estimate this from the track file, which was what I meant by B'.

Also, if you have "chrom sizes" info, some sequences might have been dropped from the real seqcoll, say chrM or the alternative sequences. So B' might not be the full seqcoll B as registered in an official repository, but I don't know if that distinction is relevant in this context.

@sveinugu
Copy link
Collaborator

sveinugu commented Dec 9, 2020

So reading your full answer, I see you comment on the lack of "chrom sizes". I agree that it is dangerous and possibly wrong, but still we do not live in a perfect world, at least not as bioinformaticians. I do believe one can from the combination of a set of registered seqcolls and a BED-type track file typically be quite certain of at least the major version of the reference genome. See https://doi.org/10.1186/s13059-017-1312-1 for a tool that attempts to predict this.

Of course, we are way outside the responsibilities of a standard of sequence collection identifiers. But it would still be nice if the specification allowed for tools like this to be built, with the bioinformatician taking responsibility for any errors...

@sveinugu
Copy link
Collaborator

sveinugu commented Dec 9, 2020

So I guess I am arguing for the inclusion of an "at least as big as" relation...

@sveinugu
Copy link
Collaborator

sveinugu commented Dec 9, 2020

But now you confused me!... :) This was the algorithm I had in mind:

Given a seqcoll A and a track C:

For each registered seqcoll B (will we support fetching such a list??):

If C is compatible with B (as defined by the tool)

Find the minimal subset B' of B that allows compatibility (still defined by the tool)

Is B' a subset of A? (given lengths, names and topology)

For many track files, it is just a manner of adding another loop. So I think your suggestion would allow for this then?

Edit: comparison of names is also needed.

@sveinugu
Copy link
Collaborator

sveinugu commented Dec 9, 2020

The alternative algorithm is this:

Given a seqcoll A and a track C:

Define C' as a mock seqcol with all the required names and the minimal lengths (up to the end of the last region for each sequence)
Is C' is compatible with A? (given "at least as big as", names and topology)

As you see, it will make the implementation much easier... It is not a matter of necessity, but convenience. In my mind, that is a strong argument for inclusion of an "at least as big as" relation, but then I am a bit practically oriented.

@sveinugu
Copy link
Collaborator

sveinugu commented Dec 9, 2020

To conclude, I think I agree a "at least as big as" relation is out of scope, as we must assume that a seqcol is a real seqcol for the definitions to make sense.

But would it be a point to provide a way to query a database for all matching seqcols, given a combination of sequence names, lengths, topology, and content, but without a seqcoll A to query with? Is there a defined use case for this?

@tcezard
Copy link
Collaborator

tcezard commented May 5, 2021

Couple of things I noted in today's meeting:

1. Input of the function

The compatibility function works at level 1 of recursion to assess compatibility. However the input can be at level 0 (SeqCol digest) if the digests are known from the server

2. Reporting

The use of binary flag for reporting the comparison seems unnecessary cryptic when we don't have this much to report. I would suggest reporting the results in an explicit way in JSON. we discussed two ways:
Enumerating all the available arrays:

{
  "sequences": {
    "ALL_A_IN_B": "true",
    "ALL_B_IN_A": "true",
    "ANY_SHARED": "true",
    "SAME_ORDER": "true"
  },
  "lengths": {
    "ALL_A_IN_B": "true",
    "ALL_B_IN_A": "true",
    "ANY_SHARED": "true",
    "SAME_ORDER": "true"
  },
...
}

Enumerating the comparisons:

{
  "ALL_A_IN_B": {
    "sequences": "true",
    "lengths": "true",
    "names": "true",
    "topologies": "true"
  },
  "ALL_B_IN_A": {
    "sequences": "true",
    "lengths": "true",
    "names": "true",
    "topologies": "true"
  },
...
}

@sveinugu
Copy link
Collaborator

sveinugu commented May 19, 2021

Trying to write up the idea I had at the last meeting in 12 mins before the next one...

So what if the compatibility API worked more or less the same as in the existing recursive results. So one would instead of true/false values for the various comparisons instead receive a digest of the resulting comparison array. Let me show an example:

https//seqcolserver.org/digest/114b0a/1:
{
    'lengths': [12345, 23456, 34567],
    'names': ['1', '2', '3'],
    'sequences': ['acb341', 'c23ba4', '2314af']
}

https//seqcolserver.org/digest/223a9d:
{
    'lengths': [12345, 23456, 34567, 45678],
    'names': ['chr1', 'chr2', 'chr3', 'chrZ'],
    'sequences': ['acb341', 'c23ba4', '2314af', '32d2bd']
}

https//seqcolserver.org/compare/114b0a/223a9d/0:
{
    'lenghts': {'only_a': {}, 'a_and_b': 'cabb2d', 'only_b': '497a8b'},
    'names': {'only_a': '3678ab', 'a_and_b': {}, 'only_b': '2847ab'),
     'sequences': {'only_a': {}, 'a_and_b': '867a8d', 'only_b': '126ba8'}
}

https//seqcolserver.org/compare/114b0a/223a9d/1:
{
    'lenghts': {'only_a': {}, 'a_and_b': [12345, 23456, 34567], 'only_b': [45678]},
    'names': {'only_a': ['1', '2', '3'], 'a_and_b': {}, 'only_b': ['chr1', 'chr2', 'chr3', 'chrZ']),
     'sequences': {'only_a': {}, 'a_and_b': ['acb341', 'c23ba4', '2314af'], 'only_b': ['32d2bd']}
}

https//seqcolserver.org/compare/114b0a/223a9d/2:
{
    'lenghts': {'only_a': {}, 'a_and_b': [12345, 23456, 34567], 'only_b': [45678]},
    'names': {'only_a': ['1', '2', '3'], 'a_and_b': {}, 'only_b': ['chr1', 'chr2', 'chr3', 'chrZ']),
     'sequences': {'only_a': {}, 'a_and_b': ['agattagatagata...', 'cagctagtctaca...', 'cgtctagtctgagc...'], 'only_b': ['aatcctagcctac...']}
}

In order for something like this to work, we need to have a general management of order, somehow, which I also have some thoughts about, that I can add later.

@sveinugu
Copy link
Collaborator

sveinugu commented May 19, 2021

Regarding order: As with the other arrays, I believe there are scenarios where the order of the sequences does not matter, and others where the order does matter. An example of the first is for specifying a coordinate system for use with e.g. BED files, depending on the implementation of the client software (i.e. if the client in any case sorts the sequences lexicographically). An example of a scenario where the order matters is for reproducing the results of a sequence mapper. Hence, I believe order should be defined as a separate array, with the rest of the sequences ordered in a uniform way.

Example, using current idea that preserves order:

https//seqcolserver.org/digest/3bacd6/1:
{
    'lengths': [12345, 23456, 34567, 456],
    'names': ['chr1', 'chr2', 'chr3', 'unplaced_12345'],
    'sequences': ['acb341', 'c23ba4', '2314af', '32d2bd']
}

has the same contents as the following collection, except the ordering:

https//seqcolserver.org/digest/746ac2/1:
{
    'lengths': [456, 12345, 23456, 34567],
    'names': ['unplaced_12345', 'chr1', 'chr2', 'chr3'],
    'sequences': ['32d2bd', 'acb341', 'c23ba4', '2314af']
}

However, the 0-level digests will not match at all:

https//seqcolserver.org/digest/3bacd6/0:
{
    'lengths': '36cd24',
    'names': 'fd1567',
    'sequences': '53f7cb'
}

https//seqcolserver.org/digest/746ac2/0:
{
    'lengths': '68bdd5',
    'names': '170a01',
    'sequences': '34f9db'
}

One would need to use the compatibility function to see that they are basically the same, except the order, and one might have no reason for checking this out if one is only looking at the 0-level recursion. Compare this to:

https//seqcolserver.org/digest/67dc71/1:
{
    'lengths': [456, 12345, 23456, 34567],
    'names': ['unplaced_12345', 'chr1', 'chr2', 'chr3'],
    'sequences': ['32d2bd', 'acb341', 'c23ba4', '2314af'],
    'order': [3, 0, 1, 2]
}

https//seqcolserver.org/digest/401d99/1:
{
    'lengths': [456, 12345, 23456, 34567],
    'names': ['unplaced_12345', 'chr1', 'chr2', 'chr3'],
    'sequences': ['32d2bd', 'acb341', 'c23ba4', '2314af'],
    'order': [0, 1, 2, 3]
}

Here, the 0-level output is much more informative:

https//seqcolserver.org/digest/67dc71/0:
{
    'lengths': '68bdd5',
    'names': '170a01',
    'sequences': '34f9db',
    'order': '8277cc'
}

https//seqcolserver.org/digest/401d99/0:
{
    'lengths': '68bdd5',
    'names': '170a01',
    'sequences': '34f9db',
    'order': '10ddd5'
}

Regarding the canonical ordering, I suggest to order by length as the main array. This would require length to always be included as one of the arrays, but this is something I have argued for previously, independently of this use case. One argument for that was that having a sequence collection of just names seems not very usable and out of scope. Also, in the case of a sequence collection with only sequences and no names, having the length array mandatory makes it possible to filter the second level recursion (when the sequences are actually downloaded) by sequence length. It would also be helpful for predefining a data structure to hold the second level information before download, to make sure there is space on the disk, etc.

If the lengths are the same for two sequences, I suggest the lexicographical order of the names array should break the tie, or if it is not present (or strangely does not break the tie), the sequences digest should have the final word. I don't know if we will support the possibility of duplicate sequences in a collection (e.g. for reads), but if so one might want to define the canonical order depending on also the other arrays (e.g. topology).

EDIT: Just realized that this idea would also work without the lengths array being mandatory.

@sveinugu
Copy link
Collaborator

sveinugu commented Jun 2, 2021

https//seqcolserver.org/digest/401d99/1:
{
'lengths': [456, 12345, 23456, 34567],
'names': ['unplaced_12345', 'chr1', 'chr2', 'chr3'],
'sequences': ['32d2bd', 'acb341', 'c23ba4', '2314af'],
'order': [0, 1, 2, 3]
}

Update on the ordering issue: I realized after writing the above that the obvious ordering would be on decreasing instead of increasing length, as the 'chr1' is typically the longest chromosome. i.e.:

https//seqcolserver.org/digest/401d99/1:
{
    'lengths': [34567, 23456, 12345, 456],
    'names': ['chr1', 'chr2', 'chr3', 'unplaced_12345'],
    'sequences': ['32d2bd', 'acb341', 'c23ba4', '2314af'],
    'order': [0, 1, 2, 3]
}

This has no consequence on the rest of the writeup, but will add the additional argument that the default order is similar to what is usual for large sequences (although it will most probably break down at some point for smaller sequences, e.g.: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39#/st_alt-assembly-unit).

@nsheff
Copy link
Member Author

nsheff commented Feb 1, 2022

I'm going to close this issue as I believe we've come a resolution, which is now specified in the ADR in PR #22.

To summarize briefly, we moved away from the idea of compatibility flags and came up with a bit more information-rich return value for what we now call the comparison function.

For the question of a 'search' function, I've raised a new issue: #28
For the comments on order, these have a lot of further discussion in other issues, e.g. #5

@nsheff nsheff closed this as completed Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants