@@ -43,6 +43,8 @@ local({
43
43
hook_source <- knitr::knit_hooks$get("document")
44
44
knitr::knit_hooks$set(document = clean_output)
45
45
})
46
+
47
+ Sys.setenv(DUCKPLYR_META_SKIP = TRUE)
46
48
```
47
49
48
50
# duckplyr <a href =" https://duckplyr.tidyverse.org " ><img src =" man/figures/logo.png " align =" right " height =" 138 " /></a >
@@ -56,7 +58,7 @@ local({
56
58
57
59
[ dplyr] ( https://dplyr.tidyverse.org/ ) is the grammar of data manipulation in the tidyverse.
58
60
The duckplyr package will run all of your existing dplyr code with identical results, using [ DuckDB] ( https://duckdb.org/ ) where possible to compute the results faster.
59
- In addition, you can analyze larger-than-memory datasets straight from files on your disk or from S3 storage .
61
+ In addition, you can analyze larger-than-memory datasets straight from files on your disk or from the web .
60
62
If you are new to dplyr, the best place to start is the [ data transformation chapter] ( https://r4ds.hadley.nz/data-transform ) in R for Data Science.
61
63
62
64
@@ -81,10 +83,9 @@ Or from [GitHub](https://github.com/) with:
81
83
pak :: pak(" tidyverse/duckplyr" )
82
84
```
83
85
84
- ## Example
86
+ ## Drop-in replacement for dplyr
85
87
86
- Calling ` library(duckplyr) ` overwrites dplyr methods,
87
- enabling duckplyr instead for the entire session.
88
+ Calling ` library(duckplyr) ` overwrites dplyr methods, enabling duckplyr for the entire session.
88
89
89
90
``` {r attach}
90
91
library(conflicted)
@@ -103,7 +104,7 @@ conflict_prefer("filter", "dplyr", quiet = TRUE)
103
104
```
104
105
105
106
The following code aggregates the inflight delay by year and month for the first half of the year.
106
- We use a variant of the ` nycflights13::flights ` dataset that removes an incompatibility with duckplyr.
107
+ We use a variant of the ` nycflights13::flights ` dataset that works around an incompatibility with duckplyr.
107
108
108
109
``` {r}
109
110
flights_df()
@@ -130,7 +131,7 @@ Nothing has been computed yet.
130
131
Querying the number of rows, or a column, starts the computation:
131
132
132
133
``` {r}
133
- system.time(print( out$month))
134
+ out$month
134
135
```
135
136
136
137
Note that, unlike dplyr, the results are not ordered, see ` ?config ` for details.
@@ -146,6 +147,72 @@ Restart R, or call `duckplyr::methods_restore()` to revert to the default dplyr
146
147
duckplyr::methods_restore()
147
148
```
148
149
150
+ ## Analyzing larger-than-memory data
151
+
152
+ An extended variant of this dataset is also available for download as Parquet files.
153
+
154
+ ``` {r}
155
+ year <- 2022:2024
156
+ base_url <- "https://blobs.duckdb.org/flight-data-partitioned/"
157
+ files <- paste0("Year=", year, "/data_0.parquet")
158
+ urls <- paste0(base_url, files)
159
+ urls
160
+ ```
161
+
162
+ Using the httpfs DuckDB extension, we can query these files directly from R, without even downloading them first.
163
+
164
+ ``` {r}
165
+ duck_exec("INSTALL httpfs")
166
+ duck_exec("LOAD httpfs")
167
+
168
+ flights <- duck_parquet(urls)
169
+ ```
170
+
171
+ Unlike with local data frames, the default is to disallow automatic materialization of the results on access.
172
+
173
+ ``` {r error = TRUE}
174
+ nrow(flights)
175
+ ```
176
+
177
+ Queries on the remote data are executed lazily, and the results are not materialized until explicitly requested.
178
+ For printing, only the first few rows of the result are fetched.
179
+
180
+ ``` {r cache = TRUE}
181
+ flights
182
+ ```
183
+
184
+ ``` {r cache = TRUE}
185
+ flights |>
186
+ count(Year)
187
+ ```
188
+
189
+ Complex queries can be executed on the remote data.
190
+ Note how only the relevant columns are fetched and the 2024 data isn't even touched, as it's not needed for the result.
191
+
192
+ ``` {r cache = TRUE}
193
+ out <-
194
+ flights |>
195
+ filter(!is.na(DepDelay), !is.na(ArrDelay)) |>
196
+ mutate(InFlightDelay = ArrDelay - DepDelay) |>
197
+ summarize(
198
+ .by = c(Year, Month),
199
+ MeanInFlightDelay = mean(InFlightDelay),
200
+ MedianInFlightDelay = median(InFlightDelay),
201
+ ) |>
202
+ filter(Year < 2024)
203
+
204
+ out |>
205
+ explain()
206
+
207
+ out |>
208
+ print() |>
209
+ system.time()
210
+ ```
211
+
212
+ Over 10M rows analyzed in about 10 seconds over the internet, that's not bad.
213
+ Of course, working with Parquet, CSV, or JSON files downloaded locally is possible as well.
214
+
215
+
149
216
## Using duckplyr in other packages
150
217
151
218
Refer to ` vignette("developers", package = "duckplyr") ` .
0 commit comments