-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path01-scaled.Rmd
472 lines (401 loc) · 24.9 KB
/
01-scaled.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
<!--
The {#rmd-basics} text after the chapter declaration will allow us to link throughout the document back to the beginning of Chapter 1. These labels will automatically be generated (if not specified) by changing the spaces to hyphens and capital letters to lowercase. Look for the reference to this label at the beginning of Chapter 2.
-->
# Accurate containment estimation with Scaled MinHash sketches {#chp-scaled}
```{=latex}
\begin{epigraphs}
\qitem{The aim of science is not to open the door to infinite wisdom, but to set a limit to infinite error.}%
{Bertolt Brecht}
\end{epigraphs}
```
## Introduction
New computational methods that can leverage the increasing availability
of sequencing data due to lower costs are required to analyze this data,
since traditional methods like alignment don't scale well to this data magnitude.
An interesting class of algorithms are sketches [@gibbons_synopsis_1999],
a sublinear space representation of the original data focused on queries for specific properties
using hashing techniques to provide statistical guarantees on the precision of the answer for a query.
These probabilistic data structures allow a memory/accuracy trade-off:
using more memory leads to more accurate results,
but in memory-constrained situations it still bounds results to an expected error rate.
### MinHash sketch: similarity and containment
The MinHash sketch [@broder_resemblance_1997] was developed at Altavista in the context of document clustering and deduplication.
It provides an estimate of the Jaccard similarity
(called **resemblance** in the original article)
$$ J(A, B) = \frac{\vert A \cap B \vert }{\vert A \cup B \vert} $$
and the **containment** of two documents
$$C(A, B) = \frac{\vert A \cap B \vert }{\vert A \vert}$$
estimating how much of document $A$ is contained in document $B$.
These estimates depend on two processes:
converting documents to sets (*Shingling*),
and transforming large sets into short signatures,
while preserving similarity (*Min-Hashing*).
In the original use case the *$w$-shingling* $\Omega$ of a document $D$ is defined as the set of all continuous subsequence of $w$ words contained in $D$.
*Min-hashing* is the process of creating $W = \{\,h(x) \mid \forall x \in \Omega\,\}$,
where $h(x)$ is an uniform hash function,
and then either
a) keeping the $n$ smallest elements ($\mathbf{MIN}_n(W)$, a MinHash), or
b) keeping elements that are $0 \mod m$ ($\mathbf{MOD}_m(W)$, a ModHash).
$\mathbf{MIN}_n(W)$ is fixed-sized (length $n$) and supports similarity estimation,
but doesn't support containment.
$\mathbf{MOD}_m(W)$ supports both similarity and containment,
with the drawback of not being fixed-sized anymore,
growing with the complexity of the document.
### Mash and genomic MinHash
Mash [@ondov_mash:_2016] is the first implementation of MinHash in genomic contexts,
relating the $w$-shingling of a document to the $k$-mer composition of genomic datasets,
and using the $\mathbf{MIN}_n(W)$ fixed-sized formulation for the signature.
Mash needs extra information (the genome size for the organism being queried) to account for genomic complexity in datasets and derives a new score,
the Mash distance,
to bring it closer to previously existing similarity measures in biology
(Average Nucleotide Identity).
This extra information is required because using a fixed-size MinHash leads to
different degrees of accuracy when comparing across highly-diverged organisms
(bacteria to animals, for example),
and it is even more extreme when taking more complex datasets into account (like metagenomes).
### Containment MinHash
CMash [@koslicki_improving_2019] implements the Containment MinHash with the goal of allowing containment estimates between sets of very different sizes.
The construction is the same as MinHash for the smaller dataset,
but the larger dataset $k$-mer composition is stored in a Bloom Filter [@bloom_spacetime_1970],
which can also be calculated while streaming the data.
The containment is defined as the number of hashes from the MinHash $A$ present in the Bloom Filter $B$,
divided by the size of $A$:
$$C(A, B) = \frac{\vert \{\,h \in B \mid \forall h \in \mathbf{MIN}_n(W)\,\} \vert}{\vert \mathbf{MIN}_n(W) \vert}$$
Since the Containment MinHash approach is not doing MinHash-to-MinHash comparisons,
it also avoids needing to store a large set of hashes in order to keep the
containment estimation within small error bounds (less than 1\%).
This formulation enables the comparison of metagenomes (stored in a Bloom Filter)
against a collection of MinHash sketches (one for each reference dataset),
a use case not supported by Mash originally.
In terms of repeated analysis of a query,
\emph{CMash} has the advantage that the query only needs to be processed once into a Bloom Filter,
and can later be reused when the collection of reference MinHash sketches is updated.
This precludes the need to keep the original data for the query,
and since the Bloom Filter is much smaller than the original dataset in most cases
that leads to storage savings for this use case.
### Containment score and Mash Screen
\emph{Mash Screen} [@ondov_mash_2019] is a new method implemented in Mash for calculating containment scores.
Given a collection of reference MinHash sketches and a query sequence mixture (a metagenome, for example),
\emph{Mash Screen} builds a mapping of each distinct hash from the set of all hashes in the reference MinHash sketches to a count of how many times the hash was observed.
The query sequence mixture is decomposed and hashed with the same parameters $k$
(from the $k$-mer composition) and $h$ (the hash function) used to generate the reference MinHash sketches,
and for each hash in the query also present in the mapping the counter is updated.
Finally,
after the mapping is completed each reference MinHash sketch is summarized by
checking the counts for each of its hashes in the mapping and then producing a
final containment estimate.
<!--
`mash screen` also defines a new metric,
the Mash Containment,
to model the $k$-mer survival
-->
Comparing \emph{Mash Screen} to \emph{CMash},
the main difference is that the former streams the query as raw sequencing data,
while the latter generates a Bloom Filter for the query first.
\emph{Mash Screen} avoids repeated membership checks to a Bloom Filter by collecting all distinct hashes across the reference MinHash sketches first,
and then updating a counter associated with each hash if it is observed in the query.
After the query finishes streaming,
it is then summarized again against the sketches in the collection.
\emph{Mash Screen} needs the original data for the query during reanalysis,
because adding new sketches to the collection of references might introduce new hashes not observed before.
For large scale projects like reanalysis of all the SRA metagenomes this
requirement means continuous storage or re-download of many petabytes of data.
## Scaled MinHash {#scaled-minhash}
The Scaled MinHash is a mix of MinHash and ModHash.
From the former it keeps the smallest elements,
and from the latter it adopts the dynamic size to allow containment estimation.
Instead of taking $0 \mod m$ elements like $\mathbf{MOD}_m(W)$,
a Scaled MinHash uses a parameter $s$ to select a subset of $W$:
$$\mathbf{SCALED}_s(W) = \{\,w \leq \frac{H}{s} \mid \forall w \in W\,\}$$
where $H$ is the largest possible value in the domain of $h(x)$ and
$\frac{H}{s}$ is the \emph{maximum hash} value in the Scaled MinHash.
Given an uniform hash function $h$ and $s=m$,
the cardinalities of $\mathbf{SCALED}_s(W)$ and $\mathbf{MOD}_m(W)$ converge for large $\vert W \vert$.
The main difference is the range of possible values in the hash space,
since the Scaled MinHash range is contiguous and the ModHash range is not.
Figure \ref{fig:minhashes} shows an example comparing MinHash, ModHash and Scaled MinHash with the same parameter value.
\begin{figure}%[ht]
\begin{subfigure}{\textwidth}
\centering
\begin{tikzpicture}
\draw[] (0,0) -- (7.75,0) ; %edit here for the axis
\foreach \x in {0,...,31} % edit here for the vertical lines
\draw[shift={(\x/4,0)},color=black] (0pt,2pt) -- (0pt,-2pt);
\foreach \x in {0,8,16,24,31} % edit here for the numbers
\draw[shift={(\x/4,0)},color=black] (0pt,0pt) -- (0pt,-3pt) node[below] {$\x$};
\foreach \x in {0,...,31} % edit here for the markers
\node[mark size=3pt] at (\x/4,0) {\pgfuseplotmark{triangle}};
\foreach \x in {0,3,6,9,12,15,18,21,24,27,30} % edit here for the markers
\draw[shift={(\x/4,6pt)},color=red] (0,2pt) -- (0,-2pt);
\foreach \x in {0,3,6,9} % edit here for the markers
\node[mark size=3pt,color=red] at (\x/4,0) {\pgfuseplotmark{triangle*}};
\end{tikzpicture}
\caption{$\mathbf{MIN}_4(W)$}
\label{fig:sub-first}
\end{subfigure}\vspace{0.02\textheight}
\begin{subfigure}{\textwidth}
\centering
\begin{tikzpicture}
\draw[] (0,0) -- (7.75,0) ; %edit here for the axis
\foreach \x in {0,...,31} % edit here for the vertical lines
\draw[shift={(\x/4,0)},color=black] (0pt,2pt) -- (0pt,-2pt);
\foreach \x in {0,8,16,24,31} % edit here for the numbers
\draw[shift={(\x/4,0)},color=black] (0pt,0pt) -- (0pt,-3pt) node[below] {$\x$};
\foreach \x in {0,4,8,12,16,20,24,28} % edit here for the markers
\node[mark size=3pt] at (\x/4,0) {\pgfuseplotmark{oplus}};
\foreach \x in {0,3,6,9,12,15,18,21,24,27,30} % edit here for the markers
\draw[shift={(\x/4,6pt)},color=red] (0,2pt) -- (0,-2pt);
\foreach \x in {0,12,24} % edit here for the markers
\node[mark size=3pt,color=red] at (\x/4,0) {\pgfuseplotmark{oplus*}};
\end{tikzpicture}
\caption{$\mathbf{MOD}_4(W)$}
\label{fig:sub-second}
\end{subfigure}\vspace{0.02\textheight}
\begin{subfigure}{\textwidth}
\centering
\begin{tikzpicture}
\draw[] (0,0) -- (7.75,0) ; %edit here for the axis
\foreach \x in {0,...,31} % edit here for the vertical lines
\draw[shift={(\x/4,0)},color=black] (0pt,2pt) -- (0pt,-2pt);
\foreach \x in {0,8,16,24,31} % edit here for the numbers
\draw[shift={(\x/4,0)},color=black] (0pt,0pt) -- (0pt,-3pt) node[below] {$\x$};
\foreach \x in {0,1,2,3,4,5,6,7} % edit here for the markers
\node[mark size=3pt] at (\x/4,0) {\pgfuseplotmark{diamond}};
\foreach \x in {0,3,6,9,12,15,18,21,24,27,30} % edit here for the markers
\draw[shift={(\x/4,6pt)},color=red] (0,2pt) -- (0,-2pt);
\foreach \x in {0,3,6} % edit here for the markers
\node[mark size=3pt,color=red] at (\x/4,0) {\pgfuseplotmark{oplus*}};
\end{tikzpicture}
\caption{$\mathbf{SCALED}_4(W)$}
\label{fig:sub-third}
\end{subfigure}\vspace{0.02\textheight}
\caption{Comparing the behavior of different MinHash approaches with the same parameter
value $n=m=s=4$ and a hash function with domain $[0,31]$ (5 bits). Possible
values for each MinHash are represented with hollow shapes. Elements from
$W=\{\,0 \mod 3 \mid \forall w \in W\,\}$ are marked with red lines
above the axis.}
\label{fig:minhashes}
\end{figure}
## Results
### Comparison with other containment estimation methods
In this section the _Scaled MinHash_ method implemented in `smol`
is compared to CMash (_Containment MinHash_)
and Mash Screen (_Containment Score_) for containment queries
in the Shakya dataset [@shakya_comparative_2013],
a synthetic mock metagenomic bacterial and archaeal community where the organisms are known,
including low-coverage and contaminant genomes described in [@awad_evaluating_2017] and [@ondov_mash_2019].
`smol` is a minimal implementation of _Scaled MinHash_ for demonstration of the method
and doesn't include many required features for working with real biological data,
but its smaller code base makes it a more readable and concise example of the method.
For _Mash Screen_ the ratio of hashes matched by total hashes is used instead of the _Containment Score_,
since the latter uses a $k$-mer survival process modeled as a Poisson process
first introduced in [@fan_assembly_2015] and later used in the _Mash distance_ [@ondov_mash:_2016] and
_Containment score_ [@ondov_mash_2019] formulations.
Experiments use $k=\{21, 31, 51\}$
(except for Mash, which only supports $k \le 32$).
For Mash and CMash they were run with $n=\{1000, 10000\}$
to evaluate the containment estimates when using larger sketches with sizes
comparable to the Scaled MinHash sketches with $scaled=1000$.
The truth set is calculated using an exact $k$-mer counter implemented with a
_HashSet_ data structure in the Rust programming language [@matsakis_rust_2014].
```{r minhash1000, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, cache=TRUE, out.width="100%", auto_pdf=TRUE, fig.cap='(ref:minhash1000)', fig.show="hold", fig.align="center"}
knitr::include_graphics('../experiments/smol_gather/figures/containment.pdf')
```
(ref:minhash1000) Letter-value plot [@hofmann_letter-value_2017] of the
differences from containment estimate to ground truth (exact).
Each method is evaluated for $k=\{21,31,51\}$,
except for `Mash` with $k=51$,
since `Mash` doesn't support $k>32$.
**A**: Using all 68 reference genomes found in previous articles.
**B**: Excluding low coverage genomes identified in previous articles.
All methods are within 1\% of the exact containment on average (Figure \ref{fig:minhash1000} A),
with `CMash` consistently underestimating the containment for large $k$ and overestimating for
small $k$.
`Mash Screen` with $n=10000$ has the smallest difference to ground truth for $k=\{21, 31\}$,
followed by `smol` with `scaled=1000` and `Mash Screen` with $n=1000$.
In order to evaluate the effect of the low-coverage and contaminant genomes
previously detected in this dataset,
(Figure \ref{fig:minhash1000} B) shows results with these genomes removed.
The number of outliers is greatly reduced,
with all methods mostly within 1\% absolute difference to the ground truth.
`CMash` still has some outliers with up to 8\% difference to the ground truth.
<!-- TODO
* runtimes?
-->
### Scaled MinHash sketch sizes across GenBank domains
In order to compare sketch sizes across assembled genomes and where they are too
small to properly estimate containment and similarity,
514,209 assembled genomes (
5,234 Archaea,
464,485 Bacteria,
6,984 Fungi,
1,093 Protozoa and
36,413 Viral genomes)
from GenBank [@benson_genbank_2017] were used for computing _Scaled MinHash_ sketches with $scaled=2000$ and $k=21$.
Figure \ref{fig:scaledGenBankSizesBoxen} show letter-value plots [@hofmann_letter-value_2017] for the sketch sizes.
The mean sketch size ranges from 20 for viruses to 18,115 for protozoans,
indicating that _Scaled MinHash_ sketches at $scaled=2000$ are too small for accurate containment and similarity estimates
in most viral genomes,
but are large enough for archaea (mean sketch size = $767$) and larger organisms.
```{r scaledGenBankSizesBoxen, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, cache=TRUE, fig.cap='(ref:genbank-sizes-boxen)', out.width="100%", fig.show="hold", fig.align="center"}
knitr::include_graphics('../experiments/sizes/figures/sizes.pdf')
```
(ref:genbank-sizes-boxen) Letter-value plot of _Scaled MinHash_
sketch sizes ($scaled=2000, k=21$) for 514,209 assembled genomes in GenBank,
separated by domain.
Figure \ref{fig:scaledGenBankSizesScatter} uses the same data from Figure \ref{fig:scaledGenBankSizesBoxen}
but using a scatter plot compared to the number of base pairs and unique 21-mers
for each genome.
The line in subfigure **B** shows the expected sketch size $size = \frac{unique}{scaled}$ for $scaled=2000$,
noting how the actual size converges to the line for larger genomes.
Subfigures **A** and **B** differ due to repetitive genomic sequences and the
number of base pairs not representing the number of unique $k$-mers of a genome.
```{r scaledGenBankSizesScatter, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, cache=TRUE, fig.cap='(ref:genbank-sizes-scatter)', out.width="100%", fig.show="hold", fig.align="center"}
knitr::include_graphics('../experiments/sizes/figures/sizes_bp_unique.png')
```
(ref:genbank-sizes-scatter) Scaled MinHash sketch sizes ($scaled=2000, k=21$) over GenBank domains.
**A** Using number of base pairs in the original genome (including repeats).
**B** Using unique 21-mers instead. Line is expected sketch size for $scaled=2000$.
## Discussion
### Operations without the original data
<!-- TODO: yuck, better subsection title
"Operating on sketches?"
-->
Once a Scaled MinHash is calculated there are many operation that can be applied without depending on the original data,
saving storage space and allowing scaling analysis to thousands of datasets.
Most of these operations are also possible with MinHash and ModHash,
with caveats.
One example of these operations is \emph{downsampling}:
the contiguous value range for Scaled MinHash sketches allow deriving $\mathbf{SCALED}_{s'}(W)$ sketches for any $s' \ge s$ using only $\mathbf{SCALED}_{s}(W)$.
MinHash and ModHash can also support this operation,
as long as $n' \le n$ and $m'$ is a multiple of $m$.
Because Scaled MinHash sketches collect any value below a threshold this also guarantees that once a value is selected it is never discarded.
This is useful in streaming contexts:
any operations that used a previously selected value can be cached and updated with new arriving values.
$\mathbf{MOD}_m(W)$ has similar properties,
but this is not the case for $\mathbf{MIN}_n(W)$,
since after $n$ values are selected any displacement caused by new data can invalidate previous calculations.
Abundance tracking is another extension to MinHash sketches,
keeping a count of how many times a value appeared in the original data.
This allows filtering for low-abundance values,
as implemented in Finch [@bovee_finch:_2018],
another MinHash sketching software for genomics.
Filtering values that only appeared once was implemented before in Mash by using a Bloom Filter and only adding values after they were seen once,
with later versions also implementing an extra counter array to keep track of counts for each value in the MinHash.
<!-- TODO: discuss here how abundance tracking in MinHash is not "correct",
because it is not a proper weighted subsample of the data?
Note that Scaled MinHash is a proper weighted subsample
-->
Other operations are adding and subtracting hash values from a Scaled MinHash sketch,
allowing post-processing and filtering.
Although possible for $\mathbf{MIN}_n(W)$,
in practice this requires oversampling (using a larger $n$) to account for possibly having less than $n$ values after filtering
(the approach taken by Finch [@bovee_finch:_2018]).
<!--
What is the difference with modulo hash (per Broder 1997)? Modulo is expensive vs less-than operator; convertible between minhash and modhash.
address brad’s comment on f1000 paper: the existing modulo approach has no guarantees on equal-sized
(or even equal-fraction as the manuscript claims elsewhere) sub-sampling
Features of scaled:
* subsampling to lower densities
* streaming guarantees (never lose hash value - containment never decreases as you get more data)
* add and subtract hash values (category, “operations directly on sketches”?)
* abundance filter hash values (operations directly on sketches)
Could also alternatively phrase scaled as lossy compression, more so than minhash.
-->
### Limitations
A drawback of Scaled MinHash when compared to regular MinHash sketches is the size:
the MinHash parameter $s$ sets an upper bound on the size of the sketch,
independently of the size of the original data.
Scaled MinHash sketches grow proportionally to the original data cardinality,
and in the worst case can have up to $\frac{H}{s}$ items.
_Scaled MinHash_ sketches offer a fixed range of possible hash values,
but with reduced sensitivity for small datasets when using larger $s$ (scaled) values.
A biological example are viruses:
at $s=2000$ many viruses are too small to consistently have a hashed value
selected by the _Scaled MinHash_ approach.
Other _MinHash_ approaches sidestep the problem by using hashing and streaming the query dataset (`Mash Screen`)
or loading the query dataset into an approximate query membership data structure (`CMash`) to allow comparisons
with the variable range of possible hash values,
but both solutions require the original data or a more limited data representation than _Scaled MinHash_.
The consistency of operating in the same data structure also allows further
methods to be develop using only _Scaled MinHash_ sketches and their features,
especially if large collections of _Scaled MinHash_ sketches are available.
<!-- TODO
- increased size (compared with minhash, which is constant size)
- detection threshold (viruses)
- CTB: include poisson statistics and plot of how many windows are missed (from charcoal stats)
* poisson statistics / random sequence foo
+ “A value of 2000 means that we are 99.98% likely to choose one k-mer per 10kb window,
or to rephrase it, 9998 of 10000 windows of size 10kb will have at least one chosen k-mer in them.”
see [numbers from charcoal](https://github.com/dib-lab/charcoal/blob/master/stats/stats.ipynb).
-->
## Conclusion and Future Work
_Scaled MinHash_ sketches are simple to implement and analyze,
with consistent guarantees for the range of values and subsetting properties when
applied to datasets.
Containment and similarity operations between _Scaled MinHash_ sketches
avoid the need to access the original data or more limited representations that only allow membership query,
and serve as a proxy for large scale comparisons between hundreds or thousands of datasets.
Small genomes require low scaled values in order to properly estimate containment and similarity,
and exact $k$-mer matching is brittle when considering evolutionarily-diverged organisms.
While some of these problems can be overcome in future work,
_Scaled MinHash_ sketches can serve as a prefilter for more accurate and
computationally expensive applications,
allowing these methods to be used in larger scales by avoiding processing data
that is unlikely to return usable results.
_Scaled MinHash_ sketches are effective basic building blocks for creating a software
ecosystem that allow practical applications,
including taxonomic classification in metagenomes and large scale indexing and searching in public genomic databases.
## Methods
### Implementation
#### smol
`smol` is a minimal implementation for the Scaled MinHash sketch and the gather method for simulation and verifying results with more featureful tools.
There are two compatible versions,
one in Python and another in Rust,
due to performance requirements when processing large datasets (like metagenomes).
Both versions of the Scaled MinHash implementations use each language's standard library sets
(`set` for Python, `HashSet` for Rust)
for storing hashes and efficient set operations (intersection and difference).
Experiments used the Rust version for calculating Scaled MinHash sketches,
and the Python version for running gather and reporting containment scores.
Since they serialize the sketches to a compatible JSON format,
they can be used interchangeably and while computing Scaled MinHash sketches is
orders of magnitude faster in Rust,
for gather running time are similar and in the order of seconds.
The Python version has two external dependencies:
`screed` for sequence parsing,
and `mmh3` for the MurmurHash3 hash function.
Other modules from the standard library are used for JSON serialization (`json`)
and command line parsing (`argparse`).
The Rust version has four direct external dependencies:
`needletail` for sequence parsing and normalization
(similar to what `screed` does in the Python version),
`murmurhash3` for the MurmurHash3 hash function,
`serde_json` for JSON serialization and `structopt` for command line parsing.
Full source is available in Appendix [A](#smol-source-code).
#### exact
`exact` is an exact k-mer counting implementation in Rust.
It also uses the standard library `HashSet` to store the $k$-mer composition of
a dataset,
and `Needletail` for sequence parsing.
It differs from `smol` because it stores all canonical $k$-mers,
instead of a subset.
The `bincode` library is used for serialization,
since it provides a simpler binary format output for Rust data types.
The goal is not to have a general purpose tool,
but only a way to calculate the ground truth for containment in a dataset,
and so a more compatible serialization format
(like JSON in the `smol` case)
is not required.
### Experiments
Experiments are implemented in `snakemake` workflows and use `conda` for
managing dependencies,
allowing reproducibility of the results with one command:
`snakemake --use-conda`.
This will download all data,
install dependencies and generate the data used for analysis.
The analysis is contained in a Jupyter Notebook,
and can be executed in any place where it is supported,
including in a local installation or using Binder,
a service that deploy a live Jupyter environment in cloud instances.
Instructions are available at https://doi.org/10.5281/zenodo.4012667