textrecipes contain extra steps for the
recipes
package for
preprocessing text data.
You can install the released version of textrecipes from CRAN with:
install.packages("textrecipes")
Install the development version from GitHub with:
# install.packages("pak")
pak::pak("tidymodels/textrecipes")
In the following example we will go through the steps needed, to convert
a character variable to the TF-IDF of its tokenized words after removing
stopwords, and, limiting ourself to only the 10 most used words. The
preprocessing will be conducted on the variable medium
and artist
.
library(recipes)
library(textrecipes)
library(modeldata)
data("tate_text")
okc_rec <- recipe(~ medium + artist, data = tate_text) %>%
step_tokenize(medium, artist) %>%
step_stopwords(medium, artist) %>%
step_tokenfilter(medium, artist, max_tokens = 10) %>%
step_tfidf(medium, artist)
okc_obj <- okc_rec %>%
prep()
str(bake(okc_obj, tate_text))
#> tibble [4,284 × 20] (S3: tbl_df/tbl/data.frame)
#> $ tfidf_medium_colour : num [1:4284] 2.31 0 0 0 0 ...
#> $ tfidf_medium_etching : num [1:4284] 0 0.86 0.86 0.86 0 ...
#> $ tfidf_medium_gelatin : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_medium_lithograph : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_medium_paint : num [1:4284] 0 0 0 0 2.35 ...
#> $ tfidf_medium_paper : num [1:4284] 0 0.422 0.422 0.422 0 ...
#> $ tfidf_medium_photograph : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_medium_print : num [1:4284] 0 0 0 0 0 ...
#> $ tfidf_medium_screenprint: num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_medium_silver : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_akram : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_beuys : num [1:4284] 0 0 0 0 0 ...
#> $ tfidf_artist_ferrari : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_john : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_joseph : num [1:4284] 0 0 0 0 0 ...
#> $ tfidf_artist_león : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_richard : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_schütte : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_thomas : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_zaatari : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
As of version 0.4.0, step_lda()
no longer accepts character variables
and instead takes tokenlist variables.
the following recipe
recipe(~text_var, data = data) %>%
step_lda(text_var)
can be replaced with the following recipe to achive the same results
lda_tokenizer <- function(x) text2vec::word_tokenizer(tolower(x))
recipe(~text_var, data = data) %>%
step_tokenize(text_var,
custom_token = lda_tokenizer
) %>%
step_lda(text_var)
This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
-
For questions and discussions about tidymodels packages, modeling, and machine learning, please post on RStudio Community.
-
If you think you have encountered a bug, please submit an issue.
-
Either way, learn how to create and share a reprex (a minimal, reproducible example), to clearly communicate about your code.
-
Check out further details on contributing guidelines for tidymodels packages and how to get help.