-
Notifications
You must be signed in to change notification settings - Fork 5
/
paper.tex
1719 lines (1475 loc) · 78.8 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Please note that whilst this template provides a
% preview of the typeset manuscript for submission, it
% will not necessarily be the final publication layout.
%
% letterpaper/a4paper: US/UK paper size toggle
% num-refs/alpha-refs: numeric/author-year citation and bibliography toggle
%\documentclass[letterpaper]{oup-contemporary}
\documentclass[a4paper,num-refs]{oup-contemporary}
%%% Journal toggle; only specific options recognised.
%%% (Only "gigascience" and "general" are implemented now. Support for other journals is planned.)
\journal{gigascience}
\usepackage{graphicx}
\usepackage{siunitx}
% Additional packages used for code formatting
\usepackage{minted}
\usepackage{tcolorbox}
\usepackage{etoolbox}
% Additional packages for data model figure
\usepackage{forest}
\usepackage{tikz}
\usetikzlibrary{quotes,arrows.meta,3d}
% Local definitions to systematise tool naming
\newcommand{\sgapi}[1]{\texttt{#1}}
\newcommand{\toolname}[1]{\texttt{#1}}
\newcommand{\sgkit}{\texttt{sgkit}}
%%% Flushend: You can add this package to automatically balance the final page, but if things go awry (e.g. section contents appearing out-of-order or entire blocks or paragraphs are coloured), remove it!
% \usepackage{flushend}
% Macros etc for data model figure
\input{diagrams/data-model-preamble.tex}
\title{Scalable analysis of genetic variation data in Python}
%%% Use the \authfn to add symbols for additional footnotes, if any. 1 is reserved for correspondence emails; then continuing with 2 etc for contributions.
\author[1,\authfn{1},\authfn{2}]{First Author}
\author[2,\authfn{1},\authfn{2}]{Second Author}
\author[2]{Third Author}
\author[2,\authfn{1}]{Fourth Author}
\affil[1]{First Institution}
\affil[2]{Second Institution}
%%% Author Notes
\authnote{\authfn{1}[email protected]; [email protected]}
\authnote{\authfn{2}Contributed equally.}
%%% Paper category
\papercat{Paper}
%%% "Short" author for running page header
\runningauthor{First et al.}
%%% Should only be set by an editor
\jvolume{00}
\jnumber{0}
\jyear{2024}
\begin{document}
\begin{frontmatter}
\maketitle
% The Abstract (250 words maximum) should be structured to
% include the following details:
% \textbf{Background}, the context and purpose of
% the study;
% \textbf{Results}, the main findings;
% \textbf{Conclusions}, brief
% summary and potential implications. Please minimize the use of abbreviations
% and do not cite references in the abstract.
% The Abstract (250 words maximum) should be structured to
% include the following details:
% \textbf{Background}, the context and purpose of % the study;
% \textbf{Results}, the main findings;
% \textbf{Conclusions}, brief summary and potential implications.
%% NOTE: this is much too long currently, but keeping for now so we
% can see which stuff migrates to the intro
\begin{abstract}
\textbf{Background:}
Vast quantities of genetic variation data have been collected to
address fundamental scientific questions across multiple disciplines.
Genetic variation datasets are usually modelled as a ``genotype matrix'',
where rows denote positions on the genome where variation occurs and
columns represent the state of sequenced samples at that position.
The largest dataset available today, the UK Biobank,
has a genotype matrix of approximately 1,000,000,000 rows and 500,000 columns,
along with metadata and quality control information of similar dimension.
Processing such matrices on a single computer is not feasible,
and computations must be distributed to complete in a reasonable timeframe.
Classical methods in genomics are inherently single-computer focused,
oriented around either streaming data through Unix pipelines or
storing the entire genotype matrix in memory.
The standard file format for genetic variation data (VCF: Variant Call Format)
is also fundamentally unsuited to distributed computation and is a severe
bottleneck at the Biobank scale.
Researchers either perform ad-hoc distribution by splitting across VCF files
and dispatching jobs using workflow managers,
or convert data into the undocumented and/or proprietary formats
used by the few specialised tools that do support automatic distribution
of calculations on a genotype matrix.
\textbf{Results:} We present sgkit, a scalable toolkit for processing genetic
variation data built on widely-used technologies from the Python data-science
ecosystem. A key part of this approach is to use Zarr, a cloud-native
format for storing multi-dimensional numerical data, widely used in the
across the sciences to store petabyte-scale datasets, and with
multiple implementations. We show that data stored in this format has
comparable compression levels to specialised methods to compress the
genotype matrix. We show that performing calculations on the genotype
matrix on a single computer is much more efficient at scale using sgkit than standard
approaches based on VCF, and is comparable with state-of-the-art methods
using compiled languages. Moreover, sgkit can also natively distribute
computation over a cluster using Dask, with no changes required to client code.
We show what sgkit is competitive with Hail (the most popular package for distributed
analysis of genetic variation data) in terms of performance, but is
much more flexible and extensible. We demonstrate the
current capabilities of sgkit to perform dataset quality control,
and applications in statistical, population, quantitative and phylo genetics.
\textbf{Conclusions:}
Sgkit enables the interactive analysis of Biobank-scale genetic variation
datasets via familiar and powerful Python data science tools, boosting
researcher productivity. Leveraging these tools also substantially
reduces software development effort for creating new methods.
The vcf-zarr specification may additionally provide a useful starting
point for defining a standard computational format for genetic variation data.
Such a standard, based on the language-agnostic Zarr format, would enable a
diverse ecosystem of tools to efficiently access subsets of the data without
copying, and substantially reduce research costs.
\end{abstract}
\begin{keywords}
Keyword1; keyword 2; keyword 3 (Three to ten keywords representing the main content of the article)
\end{keywords}
\end{frontmatter}
%%% Key points will be printed at top of second page
\begin{keypoints*}
\begin{itemize}
\item This is the first point
\item This is the second point
\item One last point.
\end{itemize}
\end{keypoints*}
\section{Background}
[TODO rewrite this based on the first paragraph of the abstract above
when cutting it down.]
The study of genetics is currently transforming into a true big-data science.
Since the Human Genome Project, genomics has required working with
signicant volumes of data and the specialised methods
required to transform the raw biological signals into analysable data,
such as assembly[cite], mapping[cite] and variant calling[cite] are compute intensive.
However, beyond these initial steps required to generate the data, the actual
analysis of the genetic variation could typically be performed on a single
computer, and often with ad-hoc software written using shell utilities
and scripting languages such as Perl.
From the Human Genome Project, through 1000 Genomes, and now with
population scale genomics datasets such as UK Biobank, GeL, etc,
the requirements for analysing the data have far outgrown these
methods. Simple queries can require minutes to complete, which
has a negative effect on researcher productivity, as well as limiting
the analyses that can be performed. The financial and
environmental~\citep{grealey2022carbon,lannelongue2023greener} cost of working
with such datasets is also substantial.
[Summary of what we do and paper roadmap]
In this paper we present sgkit, a Python library for large-scale genetic data
analysis that supports efficient and convenient distributed computation based
on open data formats and established technologies. We begin by examining the
fundamental question of how we can store observed genotypes for millions of
samples and hundreds of millions of variant sites in a way that is compact,
supports efficient retrieval for subsets of both samples and sites, and leads
to efficient distributed computation. We demonstrate that the straightforward
approach of storing genotypes in compressed rectangular chunks using the Zarr
library enables efficient access by sample, and yields compression levels and
computation speeds similar to specialised methods for genetic data. These
chunks also provide
a natural unit % FIXME "unit" is the wrong word
to parallelise work over, and by using the Dask library sgkit can easily and
transparently execute tasks concurrently within a single computer or across a
large cluster. Concurrent and distributed programming is notoriously difficult,
but Dask's high-level array-oriented operations, coupled with just-in-time
(JIT) compilation via numba~\citep{lam2015numba}, makes it straightforward to
write performant distributed code in Python.
Storing and processing genotype data is of course only one part of
genetic data analysis, and real-world analyses require large volumes
of additional data associated with variants, samples and genotypes calls.
To help manage this complexity
the basic unit of sgkit's interface is an Xarray
dataset~\citep{hoyer2017xarray}, which provides a flexible container
for labelled multi-dimensional arrays. Thus, arbitrary numerical data
can be associated with any of the relevant dimensions, and stored
efficiently in chunked Zarr format.
Using these fundamental building blocks sgkit currently provides
functionality across four key themes of general quality-control,
statistical, population and quantitative genetics. In
each case we briefly discuss available software, and showcase
the functionality via case-studies.
% Note: wishlist, just writing down some stuff that sounds good!
% Also four-key themes doesn't quite fit with 5 sections mentioned
% below - it's a WIP.
We first demonstrate interactive quality-control of [large dataset]
on [cluster], demonstrating how users can quickly summarise
large-scale data in a Jupyter notebook. We then illustrate the
statistical genetics functionality by performing a GWAS
on [something]. We showcase support for
population genetic data by [doing something from one of Alistairs
notebooks?].
We then demonstrate support for quantitative genetics
applications by applying relatedness matrix functions to [something].
% Note: it would be quite neat if we could say this is support
% for phylogenetics operations. Coupled with upstream SeqData
% we might actually be able to do this for some public SARS-CoV-2
% data or something, and make a neighbour joining tree using
% scipy (which is a terrible method, so would have to fully
% acknowledge).
Finally, demonstrate sgkit's support for heterogenous computing
and the general potential for using GPUs by computing a
pairwise distance matrix for [something], and demonstrate
X-fold speedup using our GPU implementation.
% Maybe this belongs in the discussion - we can leave it here for now.
Sgkit builds on the four key technologies of Zarr, Dask, Numba and Xarray,
widely used throughout the sciences, to bring efficient large-scale
genetics computation within the reach of many.
Sgkit is by design an open ecosystem.
The data model is well-documented,
based on open standards, and extensible to include arbitrary additional
variables and dimensions. Users are not restricted to a limited palate
of analyses, but have the full power of the Python data-science stack
available, and can write their own operations in terms of distributed
array operations or optimised numerical code. Finally, sgkit's
development model is open, with a codebase that is specifically
designed for ease of contribution,
% This is horrible, put in something better later.
and development processes informed
by combined decades of experience in open-source projects.
\section{Results}
\subsection{Overview of narrative for draft}
This subsection is a high-level summary of the main sections and
their narrative connections. \textbf{The section is for authors only and
will be deleted
once the sections themselves have matured reasonably}
\paragraph{Storing genetic variation data}
High level discussion of
strategies used to store genotype matrix and accompanying metadata.
Emphasis on access by variant, not sample. Introduce Zarr, and
refer to Fig~\ref{fig-data-storage}A to explain columnar storage,
and Fig~\ref{fig-data-storage}B to show that is competitive
with highly specialised approaches.
Note that the FC simulated dataset is probably a best-case
scenario for methods specialised to patterns in the genotype matrix.
Discuss storing additional data, and advantages of the
columnar approach (refs to genomics tools that do the same).
\paragraph{Computing on the genotype matrix}
Compressing data isn't ultimately the point - we want to use it.
We discuss the problem of running computations on the entire
genotype matrix, and how different formats limit what can be done.
We illustrate the discussion by the afdist operation from
the bcftools plugins package. This is perhaps a contrived example, but
we chose it because it requires examining the entire genotype matrix
and there is an efficient implementation available in bcftools.
It is also a good example because the standard approach of
divvying up the genome into chunks via HPC schedulers or workflow
managers is quite tricky.
We discuss the software development aspect, and how we developed
the afdist-savvy program. We discuss the inherent limitations
of the decode-to-vcf strategy, and how this inevitably
leads to bottlenecks. We discuss parallelism, and why
bcftools and genozip can make limited use of available cores.
Also that single-machine parallelism is inevitably constrained
by IO (which is why savvy's thread utilisation drops off).
Figure: \ref{fig-whole-matrix-compute}. Also
supplementary figure ~\ref{fig-whole-matrix-compute-supplemental}.
\paragraph{Subsetting the genotype matrix}
Existing work overwhelmingly emphasises accessibility by variant.
We demonstrate that sgkit/zarr is good for getting subsets
in terms of both variants and samples via afdist on various subsets.
Discuss why storing AF values for crude population groupings is
intrinsically bad, and that it should be easy to do calculations
on aribitrary subsets. Also participants regularly drop-out of
large studies like UKB - do you update the AF values each time?
\paragraph{Distributed computation}
Why do we need distributed compute? Summary of current
approaches (chunk up along the genome using HPC or workflow
managers). Advantages of true distributed compute, MapReduce
etc. Summary of earlier efforts like ADAM.
Discuss Hail. It's awesome, has enabled current
generation of mega-scale projects. Drawbacks are the
architecture has emphasised performance etc over adaptability
and simplicity. File format is undocumented, etc.
Discuss why the sgkit architecture is suited to distributed
computation. Discuss Zarr's use in cloud stores, and existing
deployments. Discuss Dask.
We do not attempt a head-to-head performance comparison with
Hail here, but defer to Liangde's thesis~\citep{li2022efficient}
with brief summary.
Forward ref to following sections on applications to demonstate
sgkit in a distributed context.
\paragraph{Quality control of large datasets}
Outline the importance of QC, and existing approaches.
Summarise sgkit's general QC capabilities.
We demonstrate sgkit on quality control operations on a large
real dataset. We convert large messy VCF with lots of additional
fields. Quote file size numbers. We do some QC operations on
the dataset, using a good size distributed cluster, using
some of the additional fields. Quote times. Discuss Xarray
here, and possibly illustrate the labelled dimensioned arrays
here via a figure (perhaps a screenshot/pdf conversion of
a Jupyter notebook rendering of the big messy dataset,
showing the various fields).
\paragraph{Statistical genetics}
Quick summary of statgen and existing toolchain.
Summarise sgkit's statgen capabilities.
Demonstrate sgkit doing a GWAS on real data using distributed cluster.
Quote numbers.
\paragraph{Population genetics}
Quick summary of popgen and existing toolchain.
Summarise sgkit's popgen capabilities.
Demonstrate
sgkit doing popgen analysis based on MalariaGEN data (on cluster?)
Quote numbers.
\paragraph{Quantitative genetics}
Quick summary of quantgen and existing toolchain.
Summarise sgkit's quantgen capabilities. Emphasise the importance
of mixed ploidy support.
Demonstrate sgkit doing some relatedness matrix analyses.
Quote numbers, possibly relative to existing tools.
\paragraph{Phylogenetics}
Quick summary of phylogenetics and existing toolchain.
Phylo is not one of sgkit's core capabilities, but there's no
reason it can't support important. Workflows are also becoming
VCF-based, and massive datasets like SC2 would definitely
benefit from the ability to distribute.
We demonstrate the potential by doing computing pairwise
distance matrix from XX samples, and creating a neighbour
joining tree using scipy (which we emphasise is a terrible
idea, but still). We showcase sgkit's support for heterogenous
computing by getting an X-fold speedup on a GPU.
\subsection{Storing genetic variation data}
Storage is the foundation on which any large-scale data-driven computation
rests: how we encode and organise data on persistent storage (the format)
fundamentally affects and limits the efficiencies of downstream calculations.
Different formats are optimised for different purposes, and there are
often many competing requirements when designing a data storage format.
% TODO is there some literature we could cite on this? I'm making up the
% term "computational format" here, presumably there is lots prior thought
% on these ideas?
For our purposes here, it is important to distinguish between \emph{archival}
and \emph{computational} formats. An archival format is optimised for
maximum interoperability and future-proofing, and uses the simplest and
most robust technologies to ensure the long-term usability of the data.
Reasonable compression levels are important because
of storage and transfer costs, but the efficiency of data access is a
secondary concern.
A computational format, on the other hand, is optimised
for calculations, for \emph{using} the data, today. A computational format
is designed with particular data access patterns in mind, ultimately
to make a broad range of queries and downstream computations as efficient as
possible.
% Note: a quick explanation of VCF for CS people coming in fresh
VCF~\citep{danecek2011variant} is a text-based format in which each line
describes observations
and metadata for a single variant (a position on the genome at which
variation occurs in the sample). The first eight tab-delimited
columns are mandatory, and subsequent columns describe the per-sample
genotype calls and quality control data.
[WIP: currently working on weaving together the stuff below into
some sort of narrative]
As has been noted many times~\citep[e.g.][]{kelleher2013processing,layer2016efficient,li2016bgt},
The shortcomings of VCF (and to a lesser degree, BCF) as a storage
format and the basis of efficient computation have long been
recognised
and many alternatives
have been proposed.
Purely archival formats try to get the maximum possible compression levels,
without much attention to query performance. Typically these use
highly specialised approaches to maximise compression performance
on genotype data.
These include:
TGC~\citep{deorowicz2013genome}
GTShark~\citep{deorowicz2019gtshark},
VCFShark~\citep{deorowicz2021vcfshark}.
SpVCF~\citep{lin2020sparse} is an extension of the VCF text
format that maximises compatibility, while gaining substantial
improvements in file sizes.
Many tools focus on balancing compression performance with
being able to run queries, typically outputting a VCF.
e.g.
BGT~\citep{li2016bgt},
GQT~\citep{layer2016efficient},
SAVVY~\citep{lefaive2021sparse},
XSI~\citep{wertenbroek2022xsi},
GTC~\citep{danek2018gtc},
GTRAC~\citep{tatwawadi2016gtrac}
genozip~\citep{lan2020genozip,lan2021genozip}
GBC~\citep{zhang2023gbc}.
SeqArray~\citep{zheng2017seqarray} and the GDS format~\citep{zheng2012high}
provide efficient storage and programmatic access in R.
Many recent tools are based on the Positional Burrows--Wheeler
Transform (PBWT)~\citep{durbin2014efficient}, which takes advantage
of the underlying haplotype structure of genomic data to provide
concise storage, and efficient haplotype matching.
The approach taken by \toolname{plink}~\citep{purcell2007plink,chang2015second} is to
store data in an uncompressed binary format. This has some major
advantages.
BGEN \citep{band2018bgen} is optimised for storing large-scale
genotype array data and stores probabilities, and is used
in particular by UK Biobank~\citep{bycroft2018genome}
[Should consider the "hard-call" nature of most of these formats]
Many methods working with large scale genetic
data now use the PBWT or variants of
it~\citep[e.g.][]{li2016bgt,lefaive2021sparse,wertenbroek2022xsi}. PBWTs have
disadvantages, however: data must be phased, and compression level
is quite sensitive to the presence of errors [citation? people have
worked around this]
Several tools provide querying interfaces in their CLIs
(reminiscent of SQL) to provide easy access to particular samples, or
variants~\cite{li2016bgt,layer2016efficient}.
More recently XSI~\citep{wertenbroek2022xsi} is an file format that
uses a combination of strategies to achieve excellent compression
performance, and can support some calculations directly on the
compressed format. It provides a command line interface and a
C API.
The SAV format is an extension of BCF that takes advantage
of the sparseness of large scale genomic datasets~\citep{lefaive2021sparse}.
They provide a command line interface and a C++ API.
\begin{figure}
\begin{tabular}{c}
\resizebox{7cm}{!}{\input{diagrams/data-model.tex}} \\
\includegraphics[width=7cm]{figures/data-scaling}
\end{tabular}
\caption{Chunked compressed storage of the genotype matrix using Zarr.
(A) Schematic of the chunked array storage used in sgkit
(B) Compression performance compared with BCF, savvy and genozip
based on simulated data (see text for details).
\label{fig-data-storage}}
\end{figure}
High level discussion of
strategies used to store genotype matrix and accompanying metadata.
Emphasis on access by variant, not sample. Introduce Zarr, and
refer to Fig~\ref{fig-data-storage}A to explain columnar storage,
and Fig~\ref{fig-data-storage}B to show that is competitive
with highly specialised approaches.
Note that the FC simulated dataset is probably a best-case
scenario for methods specialised to patterns in the genotype matrix.
Discuss storing additional data, and advantages of the
columnar approach (refs to genomics tools that do the same).
\subsection{Computing on the genotype matrix}
Compressing data isn't ultimately the point - we want to use it.
We discuss the problem of running computations on the entire
genotype matrix, and how different formats limit what can be done.
We illustrate the discussion by the afdist operation from
the bcftools plugins package. This is perhaps a contrived example, but
we chose it because it requires examining the entire genotype matrix
and there is an efficient implementation available in bcftools.
It is also a good example because the standard approach of
divvying up the genome into chunks via HPC schedulers or workflow
managers is quite tricky.
We discuss the software development aspect, and how we developed
the afdist-savvy program. We discuss the inherent limitations
of the decode-to-vcf strategy, and how this inevitably
leads to bottlenecks. We discuss parallelism, and why
bcftools and genozip can make limited use of available cores.
Also that single-machine parallelism is inevitably constrained
by IO (which is why savvy's thread utilisation drops off).
Figure: \ref{fig-whole-matrix-compute}. Also
supplementary figure ~\ref{fig-whole-matrix-compute-supplemental}.
\begin{figure*}
\begin{tabular}{cc}
\includegraphics[width=6cm]{figures/whole-matrix-compute} &
\begin{tcolorbox}[width=7cm]
\begin{minted}[fontsize=\footnotesize]{python}
def sgkit_afdist(ds, num_bins=10):
ds_vs = sgkit.variant_stats(ds).compute()
bins = np.linspace(0, 1.0, num_bins + 1)
bins[-1] += 0.01
af = ds_vs.variant_allele_frequency.values[:, 1]
pRA = 2 * af * (1 - af)
pAA = af * af
hets = ds_vs.variant_n_het.values
homs = ds_vs.variant_n_hom_alt.values
a = np.bincount(
np.digitize(pRA, bins),
weights=hets,
minlength=num_bins + 1)
b = np.bincount(
np.digitize(pAA, bins),
weights=homs,
minlength=num_bins + 1)
count = (a + b).astype(int)
return pd.DataFrame({
"start": bins[:-1],
"stop": bins[1:],
"prob_dist":
count[1:]})
ds = sgkit.load_dataset(ds_path)
df = sgkit_afdist(ds)
\end{minted}
\end{tcolorbox}
\end{tabular}
\caption{Whole-matrix compute performance with increasing sample size.
(A) Total CPU time required to run \texttt{bcftools +afdist}
and equivalent operations in a single thread for various tools.
Elapsed time is also reported (dotted line).
(B) Relative speedup with 8 threads.
(C) The source code used for the sgkit version of afdist used in these
benchmarks. See Figure~\ref{fig-whole-matrix-compute-supplemental} for
more detailed comparisons with different numbers of threads, and comparing
the effects of HDD vs SSD storage.
\label{fig-whole-matrix-compute}}
\end{figure*}
\subsection{Subsetting the genotype matrix}
Existing work overwhelmingly emphasises accessibility by variant.
We demonstrate that sgkit/zarr is good for getting subsets
in terms of both variants and samples via afdist on various subsets.
Discuss why storing AF values for crude population groupings is
intrinsically bad, and that it should be easy to do calculations
on aribitrary subsets. Also participants regularly drop-out of
large studies like UKB - do you update the AF values each time?
\begin{figure*}
\begin{tabular}{cc}
\includegraphics[width=6cm]{figures/subset-matrix-compute} &
\begin{tcolorbox}[width=7cm]
\begin{minted}[fontsize=\footnotesize]{python}
# TODO not sure how best to illustrate this. Trying out various
# options
ds = sgkit.load_dataset(ds_path)
dss = ds.isel(
variants=slice(...),
samples=(slice(...)))
df = sgkit_afdist(dss)
# bcftools
bcftools view PATH -S ... -r...
| bcftools +fill-tags | bcftools +af-dist
# genozip
genocat PATH -r ... -s ...
| bcftools +fill-tags | bcftools +af-dist
# Complex query
df = pd.read_csv("sample_info.csv",
sep="\t").set_index("sample_id")
dss = ds.isel(
samples=df.decade[ds.sample_id.values] >= 1950)
sgkit_afdist(dss)
# Corresponding shell
% tail -n+2 sample_info.csv | awk '$2 >= 1950'
| cut -f1 > samples.txt
bcftools view PATH -s samples.txt
| bcftools +fill-tags | bcftools +af-dist
\end{minted}
\end{tcolorbox}
\end{tabular}
\caption{Compute performance on subsets of the matrix.
(A) Total CPU time required to run the afdist calculation for
a subset of 10 samples and 10000 variants from the middle of the matrix.
(B) Same for n/2 samples. Genozip did not run in this case for
$n > 10^4$ samples because it does not support a file to specify
sample IDs, and the command line was therefore too long for the shell
to execute.
(C) A simplified version of the source code used for these benchmarks.
\label{fig-subset-matrix-compute}}
\end{figure*}
\subsection{Distributed computation}
Distributed computation has been the reality in the analysis
of genetic variation for many years,
where data is most often provided as per-chromosome VCFs.
This chunking provides a fairly straightforward way to parallelise
computations, using HPC schedulers to split work per-file.
When more fine-grained parallelism is required, users must
provide exact genome coordinates of the genome regions
as part of their submission scripts. Combining results
is a manual operation.
Workflow engines such as
Snakemake~\cite{koster2012snakemake,molder2021sustainable}
and [TODO mention WDL, Cromwell, etc?] make such analyses
much easier, but there is still a great deal of room for
mistakes.
Per-chromosome VCFs are not practical for biobank scale
projects, however, and variant call data tends to be split
into many (still large) files.
For example, the VCFs for 150K UKB WGS data~\cite{halldorsson2022sequences}
are provided in 50kb chunks~\cite{browning2023statistical}, resulting in
hundreds of files per chromosome.
% This is horrible writing, but the point is important.
While these files provide a natural
unit for distributing work, the details of how they
are split differ and essentially there is
no interoperability among large-scale datasets because the VCFs are
so cumbersome that a separate orchestration layer is required to
access the data in parallel.
Despite this long history of de-facto distributed computation
and the clear need for better methods of accessing genetic
variation data in a distributed fashion, there is almost
no scientific literature on the topic (in stark contrast to the
study of methods to compress a VCF into a single file,
as discussed in section XXX).
[This is a really rough outline, also not sure how exactly
we sequence the ideas here. Just getting paragraph fragments
on paper. We do need to explain something of the hadoop ecosystem
to readers who are not familiar, though.]
Distributed analytics for large-scale data is a mature
techology, with many different approaches. Hadoop~\citep{white2012hadoop}
is the original method, and remains the basis of the
dominant ecosystem. HDFS. Spark [CITATION?] lets us
do more complex analytics. Parquet [CITATION?]
Although the Hadoop ecosystem has been hugely successful
in many domains, it has had a limited impact in genomics.
[List some things that have been tried for e.g variant calling
, like ADAM~\citep{nothaft2015rethinking}].
[Briefly mention two approaches using Hadoop clusters
~\citep{boufea2017managing,fan2020variant}, to not very
exiting results].
[Hadoop/Spark/Parquet are fundamentally unsuited to genetic
variation data [WHY]?]
Discuss Hail. It's awesome, has enabled current
generation of mega-scale projects. Drawbacks are the
architecture has emphasised performance etc over adaptability
and simplicity. Uses bits of Spark for X, but custom YZ .
File format is undocumented, etc.
Discuss why the sgkit architecture is suited to distributed
computation. Discuss Zarr's use in cloud stores, and existing
deployments. Discuss Dask.
We do not attempt a head-to-head performance comparison with
Hail here, but defer to Liangde's thesis~\citep{li2022efficient}
with brief summary.
Forward ref to following sections on applications to demonstate
sgkit in a distributed context.
\subsection{Quality control of large datasets}
Outline the importance of QC, and existing approaches.
Summarise sgkit's general QC capabilities.
We demonstrate sgkit on quality control operations on a large
real dataset. We convert large messy VCF with lots of additional
fields. Quote file size numbers. We do some QC operations on
the dataset, using a good size distributed cluster, using
some of the additional fields. Quote times. Discuss Xarray
here, and possibly illustrate the labelled dimensioned arrays
here via a figure (perhaps a screenshot/pdf conversion of
a Jupyter notebook rendering of the big messy dataset,
showing the various fields).
\subsection{Statistical Genetics}
Statistical genetics is largely concerned with disease associations
in humans, and the success of genome wide association studies (GWAS) has led
to explosive growth in the field~\citep{uffelmann2021genome}.
Sample size is the single most
important factor in driving statistical power, and
genotype array data has been collected for millions of samples
[more explanation and citations].
GWAS typically use genotype array data because of
its continuing cost-effectiveness, although there has been
exome~\citep{lek2016analysis,backman2021exome}
and large-scale whole genome sequence data is becoming increasingly
available~\citep{halldorsson2022sequences,uk2023whole}.
Genotype array data is usually imputed using software such
as BEAGLE~\citep{browning2018one} or IMPUTE~\citep{howie2011genotype}.
% Mention phasing?
The \toolname{plink}~\citep{purcell2007plink} package is the most
commonly used software for GWAS and has a large range of functionality
required for quality control, association testing and analysis.
More recently software packages have emerged specifically designed
to deal with the very large sample sizes that are becoming
more prevalant, for example SAIGE~\citep{zhou2018efficiently}
Bolt-LMM~\citep{loh2015efficient}
and
REGENIE~\citep{mbatchou2021computationally} are designed to
detect associations in very large cohorts.
The Hail software [citation] has been used [to do lots of things]
[Mention qgg somewhere\citep{rohde2020qgg}]
Because of the relative homogenity of the data sources and
required tasks, programs like \toolname{plink} are able to provide
an end-to-end analysis pipline from the source genotype data, quality
control and associations via Manhattan plots.
\subsection{Population Genetics}
[Things to work in here:
1) Classically popgen is concerned with allele frequencies, and so
used a variety of different data types to access this.
2) Data was at a pretty small scale, so you could do the entire
analysis within a single monolithic package like Arlequin.
3) people have been transitioning to WGS data recently.
]
Population geneticists study evolution at the population scale,
characterising the evolutionary forces shaping the diversity
within and between populations. Compared to statistical genetics,
which is largely concerned with humans, population geneticists
study a broad range of organisms across the tree of life, both well
characterised ``model'' species, and those for which the most basic
genetic information (such as the sizes and number of chromosomes,
and ploidy level) must be determined.
Sample sizes are also much smaller,
with dozens to a few hundred genomes per species being typical.
However, projects such as Darwin Tree of
Life~\citep{darwin2022sequence} are currently sequencing tens of
thousands of different species.
% JK: I'm making this up - need to get something concrete to back it up.
% Is there any review paper summarising computation in popgen?
[THIS IS A ROUGH DRAFT: I'm just getting in a paragraph here with citations
to tools/packages that are important in PopGen. We can get a narrative
through it later.]
See \citep{excoffier2006computer} for a review of computational methods
in classical population genetics.
Because of the relatively small volumes of data and the variety
of different organisms, statistical analysis of population genetic
data has tended to be done by custom scripts. Tools such
as \toolname{plink}~\citep{purcell2007plink}
are often not applicable because of the built-in
assumptions that are appropriate only for humans and other model organisms.
Programs such as Arlequin~\citep{excoffier2005arlequin,excoffier2010arlequin}
contain a large suite of features, but can only be accessed by GUI
or command line interfaces.
BioPerl~\citep{stajich2002bioperl} and BioPython~\citep{cock2009biopython}
are more concerned with alignments and providing access to command-line
programs, and are oriented towards flexibility rather than
performance.
The \toolname{scikit-allel}~\citep{miles2023scikit}
and \toolname{pylibseq}~\citep{thornton2003libsequence}
libraries provide efficient programmatic
access to a rich suite of population genetic statistics, but have limited
support for parallelisation.
STRUCTURE~\citep{pritchard2000inference}
ADMIXTOOLS~\citep{patterson2012ancient}
% Discussion of simulated data and tree-based stuff.
Population geneticists are often interested in comparing with simulated data.
Until recently, this was done by processing
custom text formats output by programs such as
\toolname{ms}~\citep{hudson2002generating} and accompanying
scripts to compute basic statistics.
% JK: This is a bit of a self-cite fest here, but I think people would
% wonder what the connection between sgkit and tskit is if we don't
% explain it. Happy to discuss if anyone doesn't like it, though.
The \toolname{msprime}~\citep{kelleher2016efficient} simulator took a
different approach by providing a Python API which gives
access to the underlying genealogical trees, which often allow
statistics to be calculated more efficiently than standard
matrix-based approaches~\citep{ralph2020efficiently}.
These methods have been subsequently generalised to
other forms of
simulation~\citep{kelleher2018efficient,haller2018tree}
as well as being inferred from observed variation
data~\citep{kelleher2019inferring,wohns2022unified},
supported by the open-source \toolname{tskit}
library~\citep{ralph2020efficiently}.
See~\cite{baumdicker2021efficient} for further discussion on
computation in population genetic simulations, and the
\texttt{tskit} software ecosystem.
% Need to say this somewhere, or people will wonder. This is a
% rough draft
While tree-based methods are excellent for simulated data
and attractive for real-data analysis, they are not a solution
to all problems. Firstly, not all datasets are appropriate
for tree inference and secondly, before we infer trees we must
first perform extensive quality control of the raw data, necessarily
in variant-matrix form.
Sgkit provides a number of methods for computing statistics in population
genetics. All methods work on an Xarray dataset that has a number of well-known
variables defined [TODO: see section...], which for population genetics are the
genotype calls, stored in the \sgapi{call\_genotype} variable. The methods return a
new dataset containing variables that store the computed statistics.
Before running the methods, the dataset is usually divided into windows along
the genome, using the \sgapi{window\_by} functions, which tell sgkit to produce
per-window statistics. For example, \sgapi{window\_by\_position} creates windows that
are a fixed number of base pairs, while \sgapi{window\_by\_interval} creates windows
corresponding to arbitrary user-defined intervals.
It's common in population genetics to group samples into populations, which in
sgkit are referred to as \emph{cohorts}. There are two types of statistics: one-way
statistics where there is a single statistic for each cohort, and multi-way
statistics where there is a statistic between each pair, triple, etc of
cohorts. [TODO: do we need to say how cohorts are defined?]
The methods for one-way statistics include \sgapi{diversity} for computing mean
genetic diversity, \sgapi{Tajimas\_D} for computing Tajima’s D, and
\sgapi{Garud\_H} for
computing the H1, H12, H123 and H2/H1 statistics~\citep{garud2015recent}.
The methods for multi-way statistics include \sgapi{divergence} and
\sgapi{Fst} for
computing mean genetic divergence and $F_{ST}$ (respectively) between pairs of
cohorts, and \sgapi{pbs} for computing the population branching statistic between
cohort triples.
We converted phased Ag1000G hypotype data in Zarr format
%[@https://www.malariagen.net/data/ag1000g-phase-2-ar1]
to sgkit's Zarr format
using the \sgapi{read\_scikit\_allel\_vcfzarr} function.
The data contained 1,164
samples at 39,604,636 sites, and was [TODO] MB on disk before conversion, and
[TODO] MB after conversion to sgkit's Zarr format. Data for the X chromosome
was discarded since it was not available for all samples. The conversion took
[TODO: X minutes Y seconds], including a postprocessing ``rechunk'' step to
ensure that the data was suitably chunked for the subsequent analysis.
We grouped samples into 15 cohorts based on location using simple Pandas and
Xarray data manipulation functions. We then set up half-overlapping windows
along the genome using sgkit's \sgapi{window\_by\_position} function with a fixed
window size of 6000 base pairs, and a step size of 3000 base pairs.
We computed H statistics for one cohort using the \sgapi{Garud\_H} function in sgkit.
This operation took [TODO: X minutes Y seconds]. Finally, we verified that the
output was concordant with scikit-allel. [TODO: how hard to reproduce the
scikit-allel visualization?]
[TODO: summarize the biologically interesting thing here]
\subsection{Quantitative Genetics}
[TODO lacking a lot of detail here in terms of compute and software.
https://github.com/pystatgen/sgkit-publication/issues/26]
Quantitative genetics is classically concerned with understanding
the link between continuously variable quantitative traits
with complex genetics~\citep{hill2010understanding}.
Through decades of selective breeding, quantitative genetics methods have
led to X-fold increases in milk yield [etc, citations].
They are also applied to
natural populations~\citep{wilson2010ecologist}
Quantitative genetics methods are used across a wide variety of
organisms, with many species of argicultural interest having
variable ploidy levels etc.
Much of classical quantitative genetics uses observed pedigree structure
in mixed models to partition the variance of phenotypes, without
reference to any genomic data.
Fitting generalised linear mixed models (LMMs) using
software such as ASReml~\citep{butler2009asreml}
and MCMCglmm~\citep{hadfield2010mcmc} require only the pedigree
information and trait data.
Genetic marker data~\citep{bernardo2008molecular} of various
types (RFLPs, microsats etc) required statistical models.
BLUPs include genetic information~\citep{endelman2011ridge}.
Many packages are available in R [cites]. Because much of the computation
in quantitative genetics has been around solving
general statistical models using linear algebra, R is very good.
However, with larger and larger samples sizes, particularly
in agricultural applications.
For example,
the 1000 bull genomes project~\citep{hayes20191000}
comprises close to 7000 genomes, embedded in multi-generation pedigrees
with millions of animals and extensive SNP array genotype and phenotype
data \citep[e.g.][]{cesarani2022mulibreed}.
[Methods are hitting a computational wall]
Much of the functionality required for quantitative genetics work
is therefore provided by access to efficient and scalable linear algebra
routines (a core element of the PyData ecosystem), the ability
to access genetic variation data from sources such as plink and VCF,
and some specialised routines to handle pedigree data and to
estimate relatedness matrices. Sgkit provides [this], with a particular
focus on supporting autopolyploids.
\section{Discussion}
Outline of a possible narrative by paragraph:
\begin{itemize}
\item When VCF was introduced as part of the work of the 1000 Genomes
project the software landscape was a real mess. It has entirely
succeeded in standardising the output of variant callers, and methods
have interoperated well for the past decade. As a method of interchange
following classical file-oriented paradigms it works well enough.
It is also a good archival format: when compressed with a well established
method like gzip and combined with the well-documented description of
the text, we can be sure that files will still be readable decades
from now.
\item VCF is holding back large-scale genomics, however, and in particular the
analysis of population scale genetic variation datasets like UKB and GeL.
Describe the current state of affairs. Variant datasets are chunked up
into reasonable sized bits (10s of gigabytes) so that they can be processed
in parallel. Doing anything with them requires interacting with workflow
managers. In practise, it is not possible to directly write programs to
work across multiple large-scale datasets: although the chunks of data
are given in VCF, orchestrating how the chunks are processed and
the results combined involves substantial work, involving the workflow
manager, and the details of how chromosomes are split into chunks.
Cite examples like the phasing pipelines for UKB from Brownings etc.
\item Distributed computing is the reality, and increasingly datasets
are being mandatorily stored in public clouds as part of
Trusted Research Environments. Cloud object stores are the reality
because it is much cheaper. POSIX file systems are emulated on top of
this. Benefits of breaking up data into manageable pieces are that
compute resources can be localised to the data. Compute resources are
most effectively obtained as short-lived, small VMs that operate
as part of a large cluster. Large vertically scaled ``pet'' servers
are much more expensive, and just not feasible for the current
scale of data. Lots of other architectural
benefits from the new reality of cloud-based storage.
\item Classical methods based around Unix pipelines and streaming
files between processes on a single machine are therefore a
significant mismatch to the reality of large-scale genetic variation
datasets. The majority of methods suggested to make working
with VCF data more efficient are based on these ideas: they all
create a single large file and regard distribution of work
in a cluster as extrinsic. Most output VCF as the result of queries,
which also assumes that the VCF data can be efficiently processed.
This is a reasonable assumption for thousands of samples, but not
for larger datasets.
\item The key issue here is that VCF is not a \emph{computational}
format. Contrast VCF with plink, which stores data on disk in a
form that can be computed on directly. Plink is limited in what
can be descibed though, does not compress, and is not designed
to distributed.
\item The large-scale analysis of genetic variation data does not
need a more efficient file format, which allows files to be downloaded
somewhat more quickly, and subsets extracted somewhat more efficiently
as VCF. It needs a standard storage platform that allows data to be
processed directly without conversion, subsetted efficiently across
both the variant and sample dimensions, with first-class support
for all the necessary additional information (GT values etc),
and that is suited to the realities of today's cloud-based
storage and processing facilities.
\item We suggest that the model used by Zarr is a good choice for this.
We have shown that chunks of the genotype matrix compressed with standard
methods are competitive with highly specialised methods for
genotype data. In reality most of the storage is for non-genotype
data anyway, where everyone is essentially doing the same thing.
Columnar storage allows getting just (e.g.) the variant positions