thesis/02-index.Rmd

# Efficient indexing of collections of signatures

```{=latex}
\begin{epigraphs}
  \qitem{A complex system that works is invariably found to have evolved from a simple system that worked.
         A complex system designed from scratch never works and cannot be patched up to make it work.
         You have to start over with a working simple system.}%
        {John Gall}
\end{epigraphs}
```

## Introduction

<!-- TODO
 - Public genome databases
 - Exponential growth
 - Challenges in indexing, searching and updating the indices for collections of datasets
 - Methods for indexing genomic datasets
-->

Searching for matches in large collection of datasets is challenging when hundreds of thousands of them are available,
especially if they are partitioned and the data is not all present at the same place,
or too large to even be stored in a single system.
Efficient methods for sequencing datasets use exact $k$-mer matching instead of relying on sequence alignment,
but sensitivity is reduced since they can't deal with sequencing errors and biological variation as well as alignment-based methods can.

<!-- cite some methods, including SBT and Mantis -->

<!--
CTBQ: Additional points to raise: in-memory representation of sketches may be too big (!!),
goal here is on disk storage/low minimum memory for "extremely large data" situation.
Also/in addition, want ability to do incremental loading of things.
Note we are not talking here about situations where the indices themselves are too big to download,
could maybe include forward pointer to chp4.

Note, in this chapter you could also include distinction in performance between SBT and LCA DB,
to whit: large scaled works well with LCA (small DB, ~tolerable memory, load all at once, then quite fast)
but low scaled may work (much) better with SBT.
-->

Indexing strategies for querying large collections of sequencing datasets
can be classified as _$k$-mer aggregative_ methods and _color aggregative_ methods [@marchet_data_2019].
_$k$-mer aggregative_ methods index the $k$-mer composition for each individual dataset,
and then build structures for retrieving the datasets where the query $k$-mers are present.
_Color aggregative_ methods index the $k$-mer composition for all datasets,
and then assign a color representing an intersection of datasets where a $k$-mer is present.
This allows reduced space requirements,
since each $k$-mer is stored only once,
but needs extra structures for storing what datasets each color represents.

Both strategies allow the same class of queries,
but with different trade-offs and optimizations:
_$k$-mer aggregative_ methods favor threshold queries
("what datasets contain more than 60% of the query $k$-mers?")
while _color aggregative_ methods tend to be more efficient for specific $k$-mer
queries ("what datasets contain this query $k$-mer?").

<!-- how to dive into hierarchical index and inverted index below? -->

<!-- ctb mar 31

CTBQ: Additional points to raise: in-memory representation of sketches may be
too big (!!), goal here is on disk storage/low minimum memory for "extremely
large data" situation. Also/in addition, want ability to do incremental loading
of things. Note we are not talking here about situations where the indices
themselves are too big to download, could maybe include forward pointer to chp4.
-->

### Hierarchical index

<!-- 'k-mer aggregative methods in (marchet 2019)' -->

Bloofi [@crainiceanu_bloofi:_2015] is an hierarchical index structure that
extends the Bloom Filter basic query to collections of Bloom Filters.
Instead of calculating the union of all Bloom Filters in the collection
(which would allow answering if an element is present in any of them)
it defines a tree structure where the original Bloom Filters are leaves,
and internal nodes are the union of all the Bloom Filters in their subtrees.
Searching is based on a breadth-first search,
with subtrees being pruned from the search when no matches are found at an internal level.
Bloofi can also be partitioned in a network,
with network nodes containing a subtree of the original tree and only being
accessed if the search requires it.

For genomic contexts,
a hierarchical index is a _$k$-mer aggregative_ method,
with datasets represented by the $k$-mer composition of the dataset and stored in a data structure that allows querying for $k$-mer presence.
The Sequence Bloom Tree [@solomon_fast_2016] adapts Bloofi for genomics and rephrases the search problem as _experiment discovery_:
given a query sequence $Q$ and a threshold $\theta$,
which experiments contain at least $\theta$ of the original query $Q$?
Experiments are encoded in Bloom Filters containing the $k$-mer composition of transcriptomes,
and queries are transcripts.

Further developments of the SBT approach focuses on clustering similar datasets to prune search
early [@sun_allsome_2017] and developing more efficient representations for the
internal nodes [@solomon_improved_2017] and clustering [@harris_improved_2018] to use less storage space and memory.

<!--
example figure for SBT:
http://www.texample.net/tikz/examples/merge-sort-recursion-tree/
-->

### Inverted index {#inverted-index}

<!-- 'color- aggregative methods in (marchet 2019)' -->

An inverted index is a mapping from words in a document back to its location inside the document,
and is commonly used in information retrieval system to find the occurrences of
words in a text [@ziviani_compression_2000].
Another example is the index in the back of a book,
containing a list of topics and in which page they are present.

When indexing the $k$-mer decomposition of genomic datasets,
the inverted index is a _color aggregative_ method,
representable with a map of all $k$-mers in the $k$-mer composition of the datasets in the collection back to
the dataset from where they originated.
Just as words can appear more than once in a text,
$k$-mers can show up in more than one dataset,
and so the inverted index maps a $k$-mer to a list of datasets.
For efficiency,
$k$-mers are typically hashed and its integer representation (_hash_) is used
instead.

`Kraken` [@wood_kraken:_2014] has a special case of this structure,
using a taxonomic ID (taxon) for representing dataset identity.
Datasets share the same ID if they belong to the same taxon,
and if a hash is present in more than one dataset
Kraken reduces the list of taxons to the lowest common ancestor (LCA),
which lowers memory requirements for storing the index.
This LCA approach leads to decreased precision and sensitivity over time [@nasko_refseq_2018],
since new datasets are frequently added to reference databases and the chance of a k-mer being present in multiple datasets increases.

Efficient storage of the list of signatures IDs can also be achieved via representation of the list as colors,
where a color can represent one or more datasets (if a hash is present in many of them).
Mantis [@pandey_mantis:_2018] uses this hash-to-color mapping
(and an auxiliary color table) to achieve reduced memory usage,
as well as storing the mapping in Counting Quotient Filters [@pandey_general-purpose_2017],
an alternative to Bloom Filters that also support counting and resizing.

## Specialized indices for Scaled MinHash sketches

`sourmash` [@brown_sourmash:_2016] is a software for large-scale sequence data comparisons based on MinHash sketches.
Initially implementing operations for computing,
comparing and plotting distance matrices for _MinHash_ sketches,
version 2 [@pierce_large-scale_2019] introduces _Scaled MinHash_ sketches
and indices for this new sketch format.
Indices support a common set of operations
(insertion, search, and returning all signatures are the main ones),
allowing them to be used interchangeably depending on the use case,
performance requirements and computational resources available.

The simplest index is the `LinearIndex`,
a list of signatures.
Search operations are executed sequentially,
and insertions append new signatures to the end of the list.
Internally,
`sourmash` uses LinearIndex as the default index for lists of
signatures provided in the command-line.

#### MinHash Bloom Tree

<!-- TODO: discuss insertion? -->

The _MinHash Bloom Tree_ (_MHBT_) is a variation of the _Sequence Bloom Tree_ (_SBT_)
that uses Scaled MinHash sketches as leaf nodes instead of Bloom Filters as in
the SBT.
The search operation in SBTs is defined as a breadth-first search starting at the root of the tree,
using a threshold of the original $k$-mers in the query to decide when to prune the search.
MHBTs use a query Scaled MinHash sketch instead,
but keep the same search approach.
The threshold of a query $Q$ approach introduced in [@solomon_fast_2016] is
equivalent to the containment
$$C(Q, S) = \frac{\vert Q \cap S \vert }{\vert S \vert}$$
described in [@broder_resemblance_1997],
where $S$ is a Scaled MinHash sketch.
For internal nodes $n$ (which are Bloom Filters) the containment of the query Scaled MinHash sketch $Q$ is
$$C(Q, n) = \frac{\vert \{\,h \in n \mid \forall h \in Q\,\} \vert}{\vert Q \vert}$$
the same containment score defined in [@koslicki_improving_2019] for the _Containment MinHash_ to _Bloom Filter_ comparison.

MHBTs support both containment and similarity queries.
For internal nodes the containment $C(Q,n)$ is used as an upper-bound of the similarity $J(Q, n)$:
\begin{equation}
\begin{split}
  C(Q, n) &\ge J(Q, n) \\
  \frac{\vert Q \cap n \vert }{\vert Q \vert} &\ge \frac{\vert Q \cap n \vert }{\vert Q \cup n \vert}
\end{split}
\end{equation}
since $\vert Q \cup n \vert \ge \vert Q \vert$.
When a leaf node is reached then the similarity $J(Q, S)$ is calculated for the Scaled MinHash sketch $S$
and declared a match if it is above the threshold $t$.
Because the upper-bound is being used,
this can lead to extra nodes being checked,
but it simplifies implementation and provides better correctness guarantees.

#### LCA index

<!-- TODO: discuss insertion? -->

The LCA index in sourmash is an inverted index that stores a mapping from hashes
in a collection of signatures to a list of IDs for signatures containing the hash.
Despite the name,
the list of signature IDs is not collapsed to the lowest common ancestor (as in kraken),
and is calculated as needed by downstream methods using the taxonomy information
that is also stored separately in the LCA index.

The mapping from hashes to signature IDs in the LCA index is an implicit representation of the original signatures used to build the index,
and so returning the signatures is implemented by rebuilding the original signatures on-the-fly.
Search in an LCA index matches the $k$-mers in the query to the list of signatures IDs containing them,
using a counter data structure to sort results by number of hashes per signature ID.
The rebuilt signatures are then returned as matches based on the signature ID,
with containment or similarity to the query calculated against the rebuilt signatures.

mash screen [@ondov_mash_2019] has a similar index,
but it is constructed on-the-fly using the distinct hashes in a sketch collection as keys,
and values are counters initially set to zero.
As the query is processed,
matching hashes have their counts incremented,
and after all hashes in the query are processed then all the sketches in the collection are
checked in the counters to quantify the containment/similarity of each sketch in the query.
The LCA index uses the opposite approach,
opting to reconstruct the sketches on-the-fly.

## Results

### Index construction

In order to evaluate MHBT and LCA indices construction a GenBank snapshot from July 18, 2020,
containing 725,331 assembled genomes (
5,282 Archaea,
673,414 Bacteria,
6,601 Fungi
933 Protozoa and
39,101 Viral) <!-- TODO add total data size here? need to calculate... -->
was used to measure runtime,
memory consumption and final index size.
MHBT indices were built with $scaled=1000$,
and LCA indices used $scaled=10000$.
Table \@ref(tab:lca-index) shows the results for the LCA index,
and Table \@ref(tab:mhbt-index) for the MHBT index.

Table: (\#tab:lca-index) Results for LCA indexing,
with $scaled=10000$ and $k=21$.

| Domain   | Runtime (s) | Memory (MB)| Size (MB) |
|:---------|------------:|-----------:|----------:|
| Viral    | 57          |         33 |         2 |
| Archaea  | 58          |         30 |         5 |
| Protozoa | 231         |          3 |        17 |
| Fungi    | 999         |          3 |        65 |
| Bacteria | 12,717      |        857 |       446 |

Table: (\#tab:mhbt-index) Results for MHBT indexing,
with $scaled=1000$, $k=21$ and internal nodes (Bloom Filters)
using 10000 slots for storage.

| Domain   | Runtime (s) | Memory (MB)| Size (MB) |
|:---------|------------:|-----------:|----------:|
| Viral    | 126         | 326        | 77        |
| Archaea  | 111         | 217        | 100       |
| Protozoa | 206         | 753        | 302       |
| Fungi    | 1,161       | 3,364      | 1,585     |
| Bacteria | 32,576      | 47,445     | 24,639    |

Index sizes are more affected by the number of genomes inserted than the
individual _Scaled MinHash_ sizes.
Despite Protozoan and Fungal _Scaled MinHash_ sketches being larger individually,
the Bacterial indices are an order of magnitude larger for both indices since
they contain two orders of magnitude more genomes.

Comparing between LCA and MHBT index sizes must account for their different scaled parameters,
but as shown in Chapter [1](#chp-scaled) a _Scaled MinHash_ with $scaled=1000$ when downsampled to $scaled=10000$
is expected to be ten times smaller.
Even so,
MHBT indices are more than ten times larger than their LCA counterparts,
since they store extra caching information
(the internal nodes)
to avoid loading all the data to memory during search.
LCA indices also contain extra data
(the list of datasets containing a hash),
but this is lower than the storage requirements for the MHBT internal nodes.

<!--
  CAMI 2 refseq, 141k signatures, scaled=2000
  LCA: 18.3 GB
  SBT: 524 MB
    sig.name(): 5078 MB
-->

### Similarity queries on sourmash indices

<!--
/usr/bin/time -v -o timings/bacteria_lca.txt -- sourmash search -k 21 --scaled 10000 sigs/bacteria/GCA_002846625.1.sig lca/genbank-bacteria-k21-scaled10k.lca.json.gz
/usr/bin/time -v -o timings/bacteria_sbt.txt -- sourmash search -k 21 --scaled 1000 sigs/bacteria/GCA_002846625.1.sig sbt/genbank-bacteria-x1e4-k21.sbt.zip

/usr/bin/time -v -o timings/fungi_lca.txt -- sourmash search -k 21 --scaled 10000 sigs/fungi/GCA_900069095.1.sig lca/genbank-fungi-k21-scaled10k.lca.json.gz
/usr/bin/time -v -o timings/fungi_sbt.txt -- sourmash search -k 21 --scaled 1000 sigs/fungi/GCA_900069095.1.sig sbt/genbank-fungi-x1e4-k21.sbt.zip

/usr/bin/time -v -o timings/protozoa_lca.txt -- sourmash search -k 21 --scaled 10000 sigs/protozoa/GCA_013420745.1.sig lca/genbank-protozoa-k21-scaled10k.lca.json.gz
/usr/bin/time -v -o timings/protozoa_sbt.txt -- sourmash search -k 21 --scaled 1000 sigs/protozoa/GCA_013420745.1.sig sbt/genbank-protozoa-x1e4-k21.sbt.zip

/usr/bin/time -v -o timings/archaea_lca.txt -- sourmash search -k 21 --scaled 10000 sigs/archaea/GCA_000230485.1.sig lca/genbank-archaea-k21-scaled10k.lca.json.gz
/usr/bin/time -v -o timings/archaea_sbt.txt -- sourmash search -k 21 --scaled 1000 sigs/archaea/GCA_000230485.1.sig sbt/genbank-archaea-x1e4-k21.sbt.zip

/usr/bin/time -v -o timings/viral_lca.txt -- sourmash search -k 21 --scaled 10000 sigs/viral/GCA_006401735.1.sig lca/genbank-viral-k21-scaled10k.lca.json.gz
/usr/bin/time -v -o timings/viral_sbt.txt -- sourmash search -k 21 --scaled 1000 sigs/viral/GCA_006401735.1.sig sbt/genbank-viral-x1e4-k21.sbt.zip
-->

For the purpose of evaluating the performance characteristics of MHBT and LCA indices when performing searches,
each of the previously described indices generated from GenBank domains was used
to execute similarity searches (finding datasets in a collection that are similar to a query)
using appropriate queries for each domain.
All queries were selected from the relevant domain and queried against both MHBT ($scaled=1000$) and LCA ($scaled=10000$),
for $k=21$.

Table: (\#tab:search-runtime) Running time in seconds for similarity search
using LCA ($scaled=10000$) and MHBT ($scaled=1000$) indices.

|           | Viral      | Archaea    | Protozoa   | Fungi        | Bacteria      |
|:----------|-----------:|-----------:|-----------:|-------------:|--------------:|
| LCA       | 1.06       | 1.42       | 5.40       | 26.92        | 231.26        |
| SBT       | 1.32       | 3.77       | 43.51      | 244.77       | 3,185.88      |

Table: (\#tab:search-memory) Memory consumption in megabytes for similarity search
using LCA ($scaled=10000$) and MHBT ($scaled=1000$) indices.

|           | Viral   | Archaea | Protozoa | Fungi     | Bacteria      |
|:----------|--------:|--------:|---------:|----------:|--------------:|
| LCA       |     223 |     240 |     798  |     3,274 | 20,926        |
| SBT       |     163 |     125 |     332  |     1,656 | 2,290         |

Table \@ref(tab:search-runtime) shows running time for both indices.
For small indices (Viral and Archaea) the LCA running time is dominated by loading the index in memory,
but for larger indices the cost is amortized due to the faster running times.
This situation is clearer for the Bacteria indices,
where the LCA search completes in 3 minutes and 51 seconds,
while the SBT search takes 54 minutes.

When comparing memory consumption,
the situation is reversed.
Table \@ref(tab:search-memory) shows how the LCA index consistently uses twice the memory for all domains,
but for larger indices like Bacteria it uses as much as 10 times the memory as
the MHBT index for the same data.

For both runtime and memory consumption,
it is worth pointing that the LCA index is a tenth of the data indexed by the MHBT.
This highlights the trade-off between speed and memory consumption for both approaches,
especially for larger indices.

<!--
### Specificity of taxonomic $k$-mer assignments

 - k=21,31,51
 - scaled=10k
 - using genbank bacteria 2020.07.18
 - TODO: weird bump for superkingdom with k=21
-->

## Discussion

### Choosing an index

The Linear index is appropriate for operations that must check every signature,
since it doesn't have any indexing overhead.
An example is building a distance matrix for comparing signatures all-against-all.
Search operations greatly benefit from extra indexing structure.
The MHBT index and $k$-mer aggregative methods in general are appropriate for searches with query thresholds,
like searching for similarity or containment of a query in a collection of datasets.
The LCA index and color aggregative methods are appropriate for querying which datasets contain a specific query $k$-mer.

As implemented in sourmash,
the MHBT index is more memory efficient because the data can stay in external memory and only the tree structure for the index
need to be loaded in main memory,
and data for the datasets and internal nodes can be loaded and unloaded on demand.
The LCA index must be loaded in main memory before it can be used,
but once it is loaded it is faster,
especially for operations that need to summarize $k$-mer assignments or require repeated searches.

Due to these characteristics,
and if memory usage is not a concern,
then the LCA index is the most appropriate choice since it is faster.
The MHBT index is currently recommended for situations where memory is limited,
such as with smaller scaled values ($s\le2000$)
that increase the size of signatures,
or when there are a large number (hundreds of thousands or more) of datasets to index.

### Converting between indices

Both MHBT and LCA index can recover the original sketch collection.
In the MHBT case,
it outputs all the leaf nodes.
In the LCA index,
it reconstruct each sketch from the hash-to-dataset-ID mapping.
This allows trade-offs between storage efficiency,
distribution,
updating and query performance.

Because both are able to return the original sketch collection,
it is also possible to convert one index into the other.

### Limitations and future directions

_Scaled MinHash_ sketches are fundamentally a subset of the $k$-mer composition of a dataset,
and so any of the techniques described in [@marchet_data_2019] are potential
candidates for improving current indices or implementing new ones.
The MHBT index can be improved by using more efficient representations for the internal nodes [@solomon_improved_2017]
and constructing the MHBT by clustering [@harris_improved_2018],
and the LCA index can use more efficient storage of the list of signatures IDs by representing the list as colors [@pandey_mantis:_2018].
The memory consumption of the LCA index can also be tackled by implementing it in
external memory using memory-mapped files,
letting the operating system cache and unload pages as needed.

Current indices are also single-threaded,
and don't benefit from multicore systems.
Both indices can be used in parallel by loading as read-only and sharing for multiple searches,
but is is also possible to explore parallelization for single queries by
partitioning the LCA and assigning each partition to a thread,
as well as using a work-stealing thread pool for expanding the search frontier in the MHBT in parallel.
In any case,
the current implementations serve as a baseline for future scalability and can
be used to guide optimization and avoid extraneous overhead and common failings
of such projects [@mcsherry_scalability_2015].

## Conclusion

_Scaled MinHash_ sketches allow scaling analysis to thousands of datasets,
but efficiently searching and sharing them can benefit from data structures that
index and optimize these use cases.
This chapter introduces an index abstraction that can be trivially implementing
using a list of sketches (_Linear index_) and more advanced implementations
based on inverted indices (_LCA index_) and hierarchical indices (_MHBT_)
providing options for fast and memory-efficient operations,
as well as making it easier to share and analyze collections of sketches.
All these functionalities are implemented in `sourmash`,
a software package exposing these features as a command-line program as well as
a Python API for data exploration and methods prototyping.

These indices also serve as another set of building blocks for constructing more advanced
methods for solving other relevant biological problems like taxonomic profiling,
described in Chapter [3](#chp-gather),
and approaches for increasing the resilience and shareability of biological
sequencing data,
described in Chapter [5](#chp-decentralizing).

## Methods

### Implementation

`sourmash` is a software package implemented in Python for the command-line
interface and API for data exploration,
and Rust for the core data structures and performance improvements.

Both _Scaled_ and regular _MinHash_ sketches are available,
calculated using the _MurmurHash3_ hash function
(lower 64-bits from the 128-bits version) with a $seed=42$
and stored in a sorted vector in memory.
Serialization and deserialization to JSON is implemented using the `serde` crate,
and sketches also support abundance tracking for the hashes.

The _LCA_ and _MHBT_ indices are implemented at the Python level,
and the _MHBT_ supports multiple storage backends
(hidden dir, Zip files, IPFS and Redis)
depending on the use case requirements.
The _MHBT_ is implemented as a specialization of an _SBT_,
replacing the Bloom Filters in the leaf nodes from the latter with _Scaled MinHash_ 
sketches.

### Experiments

Experiments are implemented in `snakemake` workflows and use `conda` for
managing dependencies,
allowing reproducibility of the results with one command:
`snakemake --use-conda`.
This will download all data,
install dependencies and generate the data used for analysis.

The analysis and figure generation code is contained in a Jupyter Notebook,
and can be executed in any place where it is supported,
including in a local installation or using Binder,
a service that deploy a live Jupyter environment in cloud instances.
Instructions are available at https://doi.org/10.5281/zenodo.4012667