-
Notifications
You must be signed in to change notification settings - Fork 8
/
README.qmd
148 lines (100 loc) · 3.26 KB
/
README.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
format: gfm
---
```{r pkg-knitr-opts, include=FALSE}
hrbrpkghelpr::global_opts()
```
```{r badges, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::stinking_badges()
```
# tdigest
Wicked Fast, Accurate Quantiles Using 't-Digests'
## Description
The t-Digest construction algorithm uses a variant of 1-dimensional
k-means clustering to produce a very compact data structure that allows
accurate estimation of quantiles. This t-Digest data structure can be used
to estimate quantiles, compute other rank statistics or even to estimate
related measures like trimmed means. The advantage of the t-Digest over
previous digests for this purpose is that the t-Digest handles data with
full floating point resolution. The accuracy of quantile estimates produced
by t-Digests can be orders of magnitude more accurate than those produced
by previous digest algorithms. Methods are provided to create and update
t-Digests and retrieve quantiles from the accumulated distributions.
See [the original paper by Ted Dunning & Otmar Ertl](https://arxiv.org/abs/1902.04023) for more details on t-Digests.
## What's Inside The Tin
The following functions are implemented:
```{r ingredients, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::describe_ingredients()
```
## Installation
```{r install-ex, results='asis', echo = FALSE}
hrbrpkghelpr::install_block()
```
## Usage
```{r lib-ex}
library(tdigest)
# current version
packageVersion("tdigest")
```
### Basic (Low-level interface)
```{r basic}
td <- td_create(10)
td
td_total_count(td)
td_add(td, 0, 1) %>%
td_add(10, 1)
td_total_count(td)
td_value_at(td, 0.1) == 0
td_value_at(td, 0.5) == 5
quantile(td)
```
#### Bigger (and Vectorised)
```{r bigger}
td <- tdigest(c(0, 10), 10)
is_tdigest(td)
td_value_at(td, 0.1) == 0
td_value_at(td, 0.5) == 5
set.seed(1492)
x <- sample(0:100, 1000000, replace = TRUE)
td <- tdigest(x, 1000)
td_total_count(td)
tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
quantile(td)
```
#### Serialization
These [de]serialization functions make it possible to create & populate a tdigest,
serialize it out, read it in at a later time and continue populating it enabling
compact distribution accumulation & storage for large, "continuous" datasets.
```{r serialize}
set.seed(1492)
x <- sample(0:100, 1000000, replace = TRUE)
td <- tdigest(x, 1000)
tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
str(in_r <- as.list(td), 1)
td2 <- as_tdigest(in_r)
tquantile(td2, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
identical(in_r, as.list(td2))
```
#### ALTREP-aware
```{r altrep}
N <- 1000000
x.altrep <- seq_len(N) # this is an ALTREP in R version >= 3.5.0
td <- tdigest(x.altrep)
td[0.1]
td[0.5]
length(td)
```
#### Proof it's faster
```{r benchmark, cache=TRUE}
microbenchmark::microbenchmark(
tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)),
r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
)
```
## tdigest Metrics
```{r cloc, echo=FALSE}
cloc::cloc_pkg_md()
```
## Code of Conduct
Please note that this project is released with a Contributor Code of Conduct.
By participating in this project you agree to abide by its terms.