-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathindex.Rmd
161 lines (134 loc) · 6.45 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
# UCD thesis fields
title: "Decentralizing Indices for Genomic Data"
author: "Luiz Carlos Irber Junior"
year: "2020"
month: "September"
program: "Computer Science"
uccampus: "DAVIS"
report: "DISSERTATION"
degree: "DOCTOR OF PHILOSOPHY"
chair: "C. Titus Brown"
signature1: "Cindy Rubio González"
signature2: "David M Koslicki"
signature3: "Sam Díaz-Muñoz"
dedication: |
To Stéfanie, who came with me to the frozen lands, drove us through the
corn fields, but more importantly made my life complete and this
dissertation possible. 42! =]
To my mom Eliane, dad Luiz and sister Cris,
who always supported and cherished me,
even when I was stubborn and insisted on outlandish ideas.
To tio Elói, who always gave me a constant flux of new ideas and information even when I lived so far from the center.
And especially to Paraíba, a brother who flew away too soon.
acknowledgments: |
```{=latex}
\begin{epigraphs}
\qitem{To be whole is to be part; true voyage is return.}%
{Ursula K. Le Guin}
\end{epigraphs}
```
To Titus, who believed in someone that cold e-mailed him and brought me into
an environment where I could grow and help others grow together with me,
and supported all my weird ideas even when it took some time for us to figure
out if they were actually good.
Happy that many succeeded and planted seeds for so many more!
Phil, Taylor, Tessa, the #TeamSourmash in-lab,
as well as all 26 (and growing) sourmash contributors on GitHub and everywhere else.
It was your feedback and suggestions that made all the work of
building and maintaining sourmash worth it.
Lab colleagues, both present and past,
first at the GED and later at the DIB Lab,
for creating the awesome environment where we all do science together and for
teaching me so much about science but also the world in general.
My godparents tio Cassol, tio Carlinhos, Beto, tia Môa and tia Lourdes,
as well as tias Nita, Chica and Ângela, tata Ana, and all my family, who raised me to be who I am.
Ladi, Délio, Janete, Rubens and all the Fares and Sabbag, who brought me into
their family and were present even when me and Stéfanie moved so far from
home.
Pacu, Frank, Lacraia, Shaolin, Gretchen, Baba, Alphalpha, Sinfa, Marcos, Balboa, and all the
friends from college who shared the journey through Computer Engineering and
beyond.
And stop lying, Alphalpha! =P
Gabriel (GG), who I met in college and was always close since then,
sharing both the ups and downs of life.
Gui, Léo, Arnaldo, Bia, and all the friends from INPE and oceanography,
who showed that doing a PhD was not impossible and helped me figure out how to achieve it.
Sara, Tarci, Dario and everyone from Taqueria Davis,
who not only fed us with their delicious food but also became our family in Davis.
André, Osvaldo, Perdido, Tiago, and all the friends in São Carlos,
who expanded my world with their views and friendship.
Petro and César, all our trips to the International Free Software Forum in
Porto Alegre, and sharing the discovery that a different world was possible.
Hélio, Zé and all the great professors during college,
who helped grow my love for computer science with their example,
attention and support.
Luciano Ramalho and everyone from the Brazilian Python community,
who showed me that programming was more fun when done together,
sharing each other's problems and solutions.
Professors Cindy, Sam, and Dave at UC Davis,
Charles at Michigan State,
David at Penn State and Rob at UMD,
who shed light on my path through this PhD.
And finally Ada and Runa, who showed me what computers are really for:
sleeping.
abstract: |
Biology as a field is being transformed by the increasing availability of data, especially genomic sequencing data.
Computational methods that can adapt and take advantage of this data deluge are essential
for exploring and providing insights for new hypotheses,
helping to unveil the biological processes that were previously expensive or even impossible to study.
This dissertation introduces data structures and approaches for scaling data
analysis to hundreds of thousands of DNA sequencing datasets using _Scaled MinHash_ sketches,
a reduced space representation of the original datasets that can lower computational
requirements for similarity and containment estimation;
_MHBT_ and _LCA_ indices,
structures for indexing and searching large collections of _Scaled MinHash_ sketches;
`gather`, a new top-down approach for decomposing datasets into a collection of
reference components that can be implemented efficiently with _Scaled MinHash_
sketches and _MHBT_ and _LCA_ indices;
`wort`,
a distributed system for large scale sketch computation across heterogeneous systems,
from laptops to academic clusters and cloud instances,
including prototypes for containment searches across millions of datasets;
as well as explorations on how to facilitate sharing and increase the
resilience of sketches collections built from public genomic data.
# End of UCD thesis fields
knit: "bookdown::render_book"
site: bookdown::bookdown_site
output:
aggiedown::thesis_pdf:
latex_engine: pdflatex
# aggiedown::thesis_gitbook: default
# aggiedown::thesis_word: default
# aggiedown::thesis_epub: default
bibliography: bib/thesis.bib
# Download your specific bibliography database file and refer to it in the line above.
csl: csl/ecology.csl
# Download your specific csl file and refer to it in the line above.
link-citations: yes
linkcolor: blue
urlcolor: blue
citecolor: blue
lot: true
lof: true
#space_between_paragraphs: true
# Delete the # at the beginning of the previous line if you'd like
# to have a blank new line between each paragraph
---
<!--
Above is the YAML (YAML Ain't Markup Language) header that includes a lot of metadata used to produce the document. Be careful with spacing in this header!
If you'd like to include a comment that won't be produced in your resulting file enclose it in a block like this.
-->
<!--
If you receive a duplicate label error after knitting, make sure to delete the index.Rmd file and then knit again.
```{r include_packages, include = FALSE}
# This chunk ensures that the aggiedown package is
# installed and loaded. This aggiedown package includes
# the template files for the thesis.
if(!require(devtools))
install.packages("devtools", repos = "http://cran.rstudio.com")
if(!require(aggiedown))
devtools::install_github("ryanpeek/aggiedown")
library(aggiedown)
```
-->