diff --git a/README.md b/README.md
index fbff8b342..a31037910 100644
--- a/README.md
+++ b/README.md
@@ -42,7 +42,7 @@ framework?** Start [here](https://JuliaAI.github.io/MLJ.jl/dev/quick_start_guide
MLJ was initially created as a Tools, Practices and Systems project at
the [Alan Turing Institute](https://www.turing.ac.uk/)
-in 2019. Current funding is provided by a [New Zealand Strategic
+in 2019. Funding has also been provided by a [New Zealand Strategic
Science Investment
Fund](https://www.mbie.govt.nz/science-and-technology/science-and-innovation/funding-information-and-opportunities/investment-funds/strategic-science-investment-fund/ssif-funded-programmes/university-of-auckland/)
awarded to the University of Auckland.
@@ -50,6 +50,7 @@ awarded to the University of Auckland.
MLJ has been developed with the support of the following organizations:
"
- ],
- "text/latex": [
- "\\begin{tabular}{r|cc}\n",
- "\t& measure & measurement\\\\\n",
- "\t\\hline\n",
- "\t& Measure & Measurem…\\\\\n",
- "\t\\hline\n",
- "\t1 & BrierLoss() & $0.313 \\pm 0.014$ \\\\\n",
- "\t2 & AreaUnderCurve() & $0.789 \\pm 0.023$ \\\\\n",
- "\t3 & Accuracy() & $0.781 \\pm 0.014$ \\\\\n",
- "\\end{tabular}\n"
- ],
- "text/plain": [
- "\u001b[1m3×2 DataFrame\u001b[0m\n",
- "\u001b[1m Row \u001b[0m│\u001b[1m measure \u001b[0m\u001b[1m measurement \u001b[0m\n",
- "\u001b[1m \u001b[0m│\u001b[90m Measure \u001b[0m\u001b[90m Measuremen… \u001b[0m\n",
- "─────┼───────────────────────────────\n",
- " 1 │ BrierLoss() 0.313±0.014\n",
- " 2 │ AreaUnderCurve() 0.789±0.023\n",
- " 3 │ Accuracy() 0.781±0.014"
- ]
- },
- "execution_count": 63,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "confidence_intervals_basic_model"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As each pair of intervals overlap, it's doubtful the small changes\n",
- "here can be assigned statistical significance. Default `booster`\n",
- "hyper-parameters do a pretty good job."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Testing the final model"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We now determine the performance of our model on our\n",
- "lock-and-throw-away-the-key holdout set. To demonstrate\n",
- "deserialization, we'll pretend we're in a new Julia session (but\n",
- "have called `import`/`using` on the same packages). Then the\n",
- "following should suffice to recover our model trained under\n",
- "\"Hyper-parameter optimization\" above:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 64,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Machine{ProbabilisticTunedModel{RandomSearch,…},…} trained 1 time; caches data\n",
- " model: MLJTuning.ProbabilisticTunedModel{RandomSearch, MLJIteration.ProbabilisticIteratedModel{MLJBase.ProbabilisticPipeline{NamedTuple{(:continuous_encoder, :feature_selector, :evo_tree_classifier), Tuple{Unsupervised, Unsupervised, Probabilistic}}, MLJModelInterface.predict}}}\n",
- " args: \n"
- ]
- },
- "execution_count": 64,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "mach_restored = machine(\"tuned_iterated_pipe.jlso\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We compute predictions on the holdout set:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 65,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- " \u001b[1mUnivariateFinite{Multiclass{2}}\u001b[22m \n",
- " \u001b[90m┌ ┐\u001b[39m \n",
- " \u001b[0mNo \u001b[90m┤\u001b[39m\u001b[38;5;2m■■■■■■■■■■■■■■■■■\u001b[39m\u001b[0m 0.44551543781072306 \u001b[90m \u001b[39m \n",
- " \u001b[0mYes \u001b[90m┤\u001b[39m\u001b[38;5;2m■■■■■■■■■■■■■■■■■■■■■\u001b[39m\u001b[0m 0.554484562189277 \u001b[90m \u001b[39m \n",
- " \u001b[90m└ ┘\u001b[39m "
- ]
- },
- "execution_count": 65,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "ŷ_tuned = predict(mach_restored, Xtest);\n",
- "ŷ_tuned[1]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "And can compute the final performance measures:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 66,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "┌ Info: Tuned model measurements on test:\n",
- "│ brier_loss(ŷ_tuned, ytest) |> mean = 0.26784766342462435\n",
- "│ auc(ŷ_tuned, ytest) = 0.8467741935483871\n",
- "│ accuracy(mode.(ŷ_tuned), ytest) = 0.8104265402843602\n",
- "└ @ Main In[66]:1\n"
- ]
- }
- ],
- "source": [
- "@info(\"Tuned model measurements on test:\",\n",
- " brier_loss(ŷ_tuned, ytest) |> mean,\n",
- " auc(ŷ_tuned, ytest),\n",
- " accuracy(mode.(ŷ_tuned), ytest)\n",
- " )"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For comparison, here's the performance for the basic pipeline model"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 67,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "┌ Info: Basic model measurements on test set:\n",
- "│ brier_loss(ŷ_basic, ytest) |> mean = 0.2815220182102922\n",
- "│ auc(ŷ_basic, ytest) = 0.8496543778801844\n",
- "│ accuracy(mode.(ŷ_basic), ytest) = 0.8009478672985781\n",
- "└ @ Main In[67]:6\n"
- ]
- }
- ],
- "source": [
- "mach_basic = machine(pipe, X, y)\n",
- "fit!(mach_basic, verbosity=0)\n",
- "\n",
- "ŷ_basic = predict(mach_basic, Xtest);\n",
- "\n",
- "@info(\"Basic model measurements on test set:\",\n",
- " brier_loss(ŷ_basic, ytest) |> mean,\n",
- " auc(ŷ_basic, ytest),\n",
- " accuracy(mode.(ŷ_basic), ytest)\n",
- " )"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "---\n",
- "\n",
- "*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Julia 1.6.5",
- "language": "julia",
- "name": "julia-1.6"
- },
- "language_info": {
- "file_extension": ".jl",
- "mimetype": "application/julia",
- "name": "julia",
- "version": "1.6.5"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 3
-}
diff --git a/examples/telco/notebook.jl b/examples/telco/notebook.jl
deleted file mode 100644
index d42a46923..000000000
--- a/examples/telco/notebook.jl
+++ /dev/null
@@ -1,757 +0,0 @@
-# # MLJ for Data Scientists in Two Hours
-
-# An application of the [MLJ
-# toolbox](https://juliaai.github.io/MLJ.jl/dev/) to the
-# Telco Customer Churn dataset, aimed at practicing data scientists
-# new to MLJ (Machine Learning in Julia). This tutorial does not
-# cover exploratory data analysis.
-
-# MLJ is a *multi-paradigm* machine learning toolbox (i.e., not just
-# deep-learning).
-
-# For other MLJ learning resources see the [Learning
-# MLJ](https://juliaai.github.io/MLJ.jl/dev/learning_mlj/)
-# section of the
-# [manual](https://juliaai.github.io/MLJ.jl/dev/).
-
-# **Topics covered**: Grabbing and preparing a dataset, basic
-# fit/predict workflow, constructing a pipeline to include data
-# pre-processing, estimating performance metrics, ROC curves, confusion
-# matrices, feature importance, basic feature selection, controlling iterative
-# models, hyper-parameter optimization (tuning).
-
-# **Prerequisites for this tutorial.** Previous experience building,
-# evaluating, and optimizing machine learning models using
-# scikit-learn, caret, MLR, weka, or similar tool. No previous
-# experience with MLJ. Only fairly basic familiarity with Julia is
-# required. Uses
-# [DataFrames.jl](https://dataframes.juliadata.org/stable/) but in a
-# minimal way ([this
-# cheatsheet](https://ahsmart.com/pub/data-wrangling-with-data-frames-jl-cheat-sheet/index.html)
-# may help).
-
-# **Time.** Between two and three hours, first time through.
-
-
-# ## Summary of methods and types introduced
-
-# |code | purpose|
-# |:-------|:-------------------------------------------------------|
-# | `OpenML.load(id)` | grab a dataset from [OpenML.org](https://www.openml.org)|
-# | `scitype(X)` | inspect the scientific type (scitype) of object `X`|
-# | `schema(X)` | inspect the column scitypes (scientific types) of a table `X`|
-# | `coerce(X, ...)` | fix column encodings to get appropriate scitypes|
-# | `partition(data, frac1, frac2, ...; rng=...)` | vertically split `data`, which can be a table, vector or matrix|
-# | `unpack(table, f1, f2, ...)` | horizontally split `table` based on conditions `f1`, `f2`, ..., applied to column names|
-# | `@load ModelType pkg=...` | load code defining a model type|
-# | `input_scitype(model)` | inspect the scitype that a model requires for features (inputs)|
-# | `target_scitype(model)`| inspect the scitype that a model requires for the target (labels)|
-# | `ContinuousEncoder` | built-in model type for re-encoding all features as `Continuous`|
-# | `model1 ∣> model2 ∣> ...` | combine multiple models into a pipeline|
-# | `measures("under curve")` | list all measures (metrics) with string "under curve" in documentation|
-# | `accuracy(yhat, y)` | compute accuracy of predictions `yhat` against ground truth observations `y`|
-# | `auc(yhat, y)`, `brier_loss(yhat, y)` | evaluate two probabilistic measures (`yhat` a vector of probability distributions)|
-# | `machine(model, X, y)` | bind `model` to training data `X` (features) and `y` (target)|
-# | `fit!(mach, rows=...)` | train machine using specified rows (observation indices)|
-# | `predict(mach, rows=...)`, | make in-sample model predictions given specified rows|
-# | `predict(mach, Xnew)` | make predictions given new features `Xnew`|
-# | `fitted_params(mach)` | inspect learned parameters|
-# | `report(mach)` | inspect other outcomes of training|
-# | `confmat(yhat, y)` | confusion matrix for predictions `yhat` and ground truth `y`|
-# | `roc(yhat, y)` | compute points on the receiver-operator Characteristic|
-# | `StratifiedCV(nfolds=6)` | 6-fold stratified cross-validation resampling strategy|
-# | `Holdout(fraction_train=0.7)` | holdout resampling strategy|
-# | `evaluate(model, X, y; resampling=..., options...)` | estimate performance metrics `model` using the data `X`, `y`|
-# | `FeatureSelector()` | transformer for selecting features|
-# | `Step(3)` | iteration control for stepping 3 iterations|
-# | `NumberSinceBest(6)`, `TimeLimit(60/5), InvalidValue()` | iteration control stopping criteria|
-# | `IteratedModel(model=..., controls=..., options...)` | wrap an iterative `model` in control strategies|
-# | `range(model, :some_hyperparam, lower=..., upper=...)` | define a numeric range|
-# | `RandomSearch()` | random search tuning strategy|
-# | `TunedModel(model=..., tuning=..., options...)` | wrap the supervised `model` in specified `tuning` strategy|
-
-
-# ## Instantiate a Julia environment
-
-# The following code replicates precisely the set of Julia packages
-# used to develop this tutorial. If this is your first time running
-# the notebook, package instantiation and pre-compilation may take a
-# minute or so to complete. **This step will fail** if the [correct
-# Manifest.toml and Project.toml
-# files](https://github.com/JuliaAI/MLJ.jl/tree/dev/examples/telco)
-# are not in the same directory as this notebook.
-
-using Pkg
-Pkg.activate(@__DIR__) # get env from TOML files in same directory as this notebook
-Pkg.instantiate()
-
-
-# ## Warm up: Building a model for the iris dataset
-
-# Before turning to the Telco Customer Churn dataset, we very quickly
-# build a predictive model for Fisher's well-known iris data set, as way of
-# introducing the main actors in any MLJ workflow. Details that you
-# don't fully grasp should become clearer in the Telco study.
-
-# This section is a condensed adaption of the [Getting Started
-# example](https://juliaai.github.io/MLJ.jl/dev/getting_started/#Fit-and-predict)
-# in the MLJ documentation.
-
-# First, using the built-in iris dataset, we load and inspect the features
-# `X_iris` (a table) and target variable `y_iris` (a vector):
-
-using MLJ
-
-#-
-
-const X_iris, y_iris = @load_iris;
-schema(X_iris)
-
-#-
-
-y_iris[1:4]
-
-#
-
-levels(y_iris)
-
-# We load a decision tree model, from the package DecisionTree.jl:
-
-DecisionTree = @load DecisionTreeClassifier pkg=DecisionTree # model type
-model = DecisionTree(min_samples_split=5) # model instance
-
-# In MLJ, a *model* is just a container for hyperparameters of
-# some learning algorithm. It does not store learned parameters.
-
-# Next, we bind the model together with the available data in what's
-# called a *machine*:
-
-mach = machine(model, X_iris, y_iris)
-
-# A machine is essentially just a model (ie, hyperparameters) plus data, but
-# it additionally stores *learned parameters* (the tree) once it is
-# trained on some view of the data:
-
-train_rows = vcat(1:60, 91:150); # some row indices (observations are rows not columns)
-fit!(mach, rows=train_rows)
-fitted_params(mach)
-
-# A machine stores some other information enabling [warm
-# restart](https://juliaai.github.io/MLJ.jl/dev/machines/#Warm-restarts)
-# for some models, but we won't go into that here. You are allowed to
-# access and mutate the `model` parameter:
-
-mach.model.min_samples_split = 10
-fit!(mach, rows=train_rows) # re-train with new hyperparameter
-
-# Now we can make predictions on some other view of the data, as in
-
-predict(mach, rows=71:73)
-
-# or on completely new data, as in
-
-Xnew = (sepal_length = [5.1, 6.3],
- sepal_width = [3.0, 2.5],
- petal_length = [1.4, 4.9],
- petal_width = [0.3, 1.5])
-yhat = predict(mach, Xnew)
-
-# These are probabilistic predictions which can be manipulated using a
-# widely adopted interface defined in the Distributions.jl
-# package. For example, we can get raw probabilities like this:
-
-pdf.(yhat, "virginica")
-
-
-# We now turn to the Telco dataset.
-
-
-# ## Getting the Telco data
-
-import DataFrames
-
-#-
-
-data = OpenML.load(42178) # data set from OpenML.org
-df0 = DataFrames.DataFrame(data)
-first(df0, 4)
-
-# The object of this tutorial is to build and evaluate supervised
-# learning models to predict the `:Churn` variable, a binary variable
-# measuring customer retention, based on other variables that are
-# relevant.
-
-# In the table, observations correspond to rows, and features to
-# columns, which is the convention for representing all
-# two-dimensional data in MLJ.
-
-
-# ## Type coercion
-
-# > Introduces: `scitype`, `schema`, `coerce`
-
-# A ["scientific
-# type"](https://juliaai.github.io/ScientificTypes.jl/dev/) or
-# *scitype* indicates how MLJ will *interpret* data. For example,
-# `typeof(3.14) == Float64`, while `scitype(3.14) == Continuous` and
-# also `scitype(3.14f0) == Continuous`. In MLJ, model data
-# requirements are articulated using scitypes.
-
-# Here are common "scalar" scitypes:
-
-# 
-
-# There are also container scitypes. For example, the scitype of any
-# `N`-dimensional array is `AbstractArray{S, N}`, where `S` is the scitype of the
-# elements:
-
-scitype(["cat", "mouse", "dog"])
-
-# The `schema` operator summarizes the column scitypes of a table:
-
-schema(df0) |> DataFrames.DataFrame # converted to DataFrame for better display
-
-# All of the fields being interpreted as `Textual` are really
-# something else, either `Multiclass` or, in the case of
-# `:TotalCharges`, `Continuous`. In fact, `:TotalCharges` is
-# mostly floats wrapped as strings. However, it needs special
-# treatment because some elements consist of a single space, " ",
-# which we'll treat as "0.0".
-
-fix_blanks(v) = map(v) do x
- if x == " "
- return "0.0"
- else
- return x
- end
-end
-
-df0.TotalCharges = fix_blanks(df0.TotalCharges);
-
-# Coercing the `:TotalCharges` type to ensure a `Continuous` scitype:
-
-coerce!(df0, :TotalCharges => Continuous);
-
-# Coercing all remaining `Textual` data to `Multiclass`:
-
-coerce!(df0, Textual => Multiclass);
-
-# Finally, we'll coerce our target variable `:Churn` to be
-# `OrderedFactor`, rather than `Multiclass`, to enable a reliable
-# interpretation of metrics like "true positive rate". By convention,
-# the first class is the negative one:
-
-coerce!(df0, :Churn => OrderedFactor)
-levels(df0.Churn) # to check order
-
-# Re-inspecting the scitypes:
-
-schema(df0) |> DataFrames.DataFrame
-
-
-# ## Preparing a holdout set for final testing
-
-# > Introduces: `partition`
-
-# To reduce training times for the purposes of this tutorial, we're
-# going to dump 90% of observations (after shuffling) and split off
-# 30% of the remainder for use as a lock-and-throw-away-the-key
-# holdout set:
-
-df, df_test, df_dumped = partition(df0, 0.07, 0.03, # in ratios 7:3:90
- stratify=df0.Churn,
- rng=123);
-
-# The reader interested in including all data can instead do
-# `df, df_test = partition(df0, 0.7, stratify=df0.Churn, rng=123)`.
-
-
-# ## Splitting data into target and features
-
-# > Introduces: `unpack`
-
-# In the following call, the column with name `:Churn` is copied over
-# to a vector `y`, and every remaining column, except `:customerID`
-# (which contains no useful information) goes into a table `X`. Here
-# `:Churn` is the target variable for which we seek predictions, given
-# new versions of the features `X`.
-
-const y, X = unpack(df, ==(:Churn), !=(:customerID));
-schema(X).names
-
-#-
-
-intersect([:Churn, :customerID], schema(X).names)
-
-# We'll do the same for the holdout data:
-
-const ytest, Xtest = unpack(df_test, ==(:Churn), !=(:customerID));
-
-# ## Loading a model and checking type requirements
-
-# > Introduces: `@load`, `input_scitype`, `target_scitype`
-
-# For tools helping us to identify suitable models, see the [Model
-# Search](https://juliaai.github.io/MLJ.jl/dev/model_search/#model_search)
-# section of the manual. We will build a gradient tree-boosting model,
-# a popular first choice for structured data like we have here. Model
-# code is contained in a third-party package called
-# [EvoTrees.jl](https://github.com/Evovest/EvoTrees.jl) which is
-# loaded as follows:
-
-Booster = @load EvoTreeClassifier pkg=EvoTrees
-
-# Recall that a *model* is just a container for some algorithm's
-# hyperparameters. Let's create a `Booster` with default values for
-# the hyperparameters:
-
-booster = Booster()
-
-# This model is appropriate for the kind of target variable we have because of
-# the following passing test:
-
-scitype(y) <: target_scitype(booster)
-
-# However, our features `X` cannot be directly used with `booster`:
-
-scitype(X) <: input_scitype(booster)
-
-# As it turns out, this is because `booster`, like the majority of MLJ
-# supervised models, expects the features to be `Continuous`. (With
-# some experience, this can be gleaned from `input_scitype(booster)`.)
-# So we need categorical feature encoding, discussed next.
-
-
-# ## Building a model pipeline to incorporate feature encoding
-
-# > Introduces: `ContinuousEncoder`, pipeline operator `|>`
-
-# The built-in `ContinuousEncoder` model transforms an arbitrary table
-# to a table whose features are all `Continuous` (dropping any fields
-# it does not know how to encode). In particular, all `Multiclass`
-# features are one-hot encoded.
-
-# A *pipeline* is a stand-alone model that internally combines one or
-# more models in a linear (non-branching) pipeline. Here's a pipeline
-# that adds the `ContinuousEncoder` as a pre-processor to the
-# gradient tree-boosting model above:
-
-pipe = ContinuousEncoder() |> booster
-
-# Note that the component models appear as hyperparameters of
-# `pipe`. Pipelines are an implementation of a more general [model
-# composition](https://juliaai.github.io/MLJ.jl/dev/composing_models/#Composing-Models)
-# interface provided by MLJ that advanced users may want to learn about.
-
-# From the above display, we see that component model hyperparameters
-# are now *nested*, but they are still accessible (important in hyperparameter
-# optimization):
-
-pipe.evo_tree_classifier.max_depth
-
-
-# ## Evaluating the pipeline model's performance
-
-# > Introduces: `measures` (function), **measures:** `brier_loss`, `auc`, `accuracy`;
-# > `machine`, `fit!`, `predict`, `fitted_params`, `report`, `roc`, **resampling strategy** `StratifiedCV`, `evaluate`, `FeatureSelector`
-
-# Without touching our test set `Xtest`, `ytest`, we will estimate the
-# performance of our pipeline model, with default hyperparameters, in
-# two different ways:
-
-# **Evaluating by hand.** First, we'll do this "by hand" using the `fit!` and `predict`
-# workflow illustrated for the iris data set above, using a
-# holdout resampling strategy. At the same time we'll see how to
-# generate a **confusion matrix**, **ROC curve**, and inspect
-# **feature importances**.
-
-# **Automated performance evaluation.** Next we'll apply the more
-# typical and convenient `evaluate` workflow, but using `StratifiedCV`
-# (stratified cross-validation) which is more informative.
-
-# In any case, we need to choose some measures (metrics) to quantify
-# the performance of our model. For a complete list of measures, one
-# does `measures()`. Or we also can do:
-
-measures("Brier")
-
-# We will be primarily using `brier_loss`, but also `auc` (area under
-# the ROC curve) and `accuracy`.
-
-
-# ### Evaluating by hand (with a holdout set)
-
-# Our pipeline model can be trained just like the decision tree model
-# we built for the iris data set. Binding all non-test data to the
-# pipeline model:
-
-mach_pipe = machine(pipe, X, y)
-
-# We already encountered the `partition` method above. Here we apply
-# it to row indices, instead of data containers, as `fit!` and
-# `predict` only need a *view* of the data to work.
-
-train, validation = partition(1:length(y), 0.7)
-fit!(mach_pipe, rows=train)
-
-# We note in passing that we can access two kinds of information from a trained machine:
-
-# - The **learned parameters** (eg, coefficients of a linear model): We use `fitted_params(mach_pipe)`
-# - Other **by-products of training** (eg, feature importances): We use `report(mach_pipe)`
-
-fp = fitted_params(mach_pipe);
-keys(fp)
-
-# For example, we can check that the encoder did not actually drop any features:
-
-Set(fp.continuous_encoder.features_to_keep) == Set(schema(X).names)
-
-# And, from the report, extract feature importances:
-
-rpt = report(mach_pipe)
-keys(rpt.evo_tree_classifier)
-
-#-
-
-fi = rpt.evo_tree_classifier.feature_importances
-feature_importance_table =
- (feature=Symbol.(first.(fi)), importance=last.(fi)) |> DataFrames.DataFrame
-
-# For models not reporting feature importances, we recommend the
-# [Shapley.jl](https://expandingman.gitlab.io/Shapley.jl/) package.
-
-# Returning to predictions and evaluations of our measures:
-
-ŷ = predict(mach_pipe, rows=validation);
-@info("Measurements",
- brier_loss(ŷ, y[validation]) |> mean,
- auc(ŷ, y[validation]),
- accuracy(mode.(ŷ), y[validation])
- )
-
-# Note that we need `mode` in the last case because `accuracy` expects
-# point predictions, not probabilistic ones. (One can alternatively
-# use `predict_mode` to generate the predictions.)
-
-# While we're here, lets also generate a **confusion matrix** and
-# [receiver-operator
-# characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
-# (ROC):
-
-confmat(mode.(ŷ), y[validation])
-
-# Note: Importing the plotting package and calling the plotting
-# functions for the first time can take a minute or so.
-
-using Plots
-
-#-
-
-roc_curve = roc(ŷ, y[validation])
-plt = scatter(roc_curve, legend=false)
-plot!(plt, xlab="false positive rate", ylab="true positive rate")
-plot!([0, 1], [0, 1], linewidth=2, linestyle=:dash, color=:black)
-
-
-# ### Automated performance evaluation (more typical workflow)
-
-# We can also get performance estimates with a single call to the
-# `evaluate` function, which also allows for more complicated
-# resampling - in this case stratified cross-validation. To make this
-# more comprehensive, we set `repeats=3` below to make our
-# cross-validation "Monte Carlo" (3 random size-6 partitions of the
-# observation space, for a total of 18 folds) and set
-# `acceleration=CPUThreads()` to parallelize the computation.
-
-# We choose a `StratifiedCV` resampling strategy; the complete list of options is
-# [here](https://juliaai.github.io/MLJ.jl/dev/evaluating_model_performance/#Built-in-resampling-strategies).
-
-e_pipe = evaluate(pipe, X, y,
- resampling=StratifiedCV(nfolds=6, rng=123),
- measures=[brier_loss, auc, accuracy],
- repeats=3,
- acceleration=CPUThreads())
-
-# (There is also a version of `evaluate` for machines. Query the
-# `evaluate` and `evaluate!` doc-strings to learn more about these
-# functions and what the `PerformanceEvaluation` object `e_pipe` records.)
-
-# While [less than ideal](https://arxiv.org/abs/2104.00673), let's
-# adopt the common practice of using the standard error of a
-# cross-validation score as an estimate of the uncertainty of a
-# performance measure's expected value. Here's a utility function to
-# calculate 95% confidence intervals for our performance estimates based
-# on this practice, and it's application to the current evaluation:
-
-using Measurements
-
-#-
-
-function confidence_intervals(e)
- factor = 2.0 # to get level of 95%
- measure = e.measure
- nfolds = length(e.per_fold[1])
- measurement = [e.measurement[j] ± factor*std(e.per_fold[j])/sqrt(nfolds - 1)
- for j in eachindex(measure)]
- table = (measure=measure, measurement=measurement)
- return DataFrames.DataFrame(table)
-end
-
-const confidence_intervals_basic_model = confidence_intervals(e_pipe)
-
-
-# ## Filtering out unimportant features
-
-# > Introduces: `FeatureSelector`
-
-# Before continuing, we'll modify our pipeline to drop those features
-# with low feature importance, to speed up later optimization:
-
-unimportant_features = filter(:importance => <(0.005), feature_importance_table).feature
-
-pipe2 = ContinuousEncoder() |>
- FeatureSelector(features=unimportant_features, ignore=true) |> booster
-
-
-# ## Wrapping our iterative model in control strategies
-
-# > Introduces: **control strategies:** `Step`, `NumberSinceBest`, `TimeLimit`, `InvalidValue`, **model wrapper** `IteratedModel`, **resampling strategy:** `Holdout`
-
-# We want to optimize the hyperparameters of our model. Since our
-# model is iterative, these parameters include the (nested) iteration
-# parameter `pipe.evo_tree_classifier.nrounds`. Sometimes this
-# parameter is optimized first, fixed, and then maybe optimized again
-# after the other parameters. Here we take a more principled approach,
-# **wrapping our model in a control strategy** that makes it
-# "self-iterating". The strategy applies a stopping criterion to
-# *out-of-sample* estimates of the model performance, constructed
-# using an internally constructed holdout set. In this way, we avoid
-# some data hygiene issues, and, when we subsequently optimize other
-# parameters, we will always being using an optimal number of
-# iterations.
-
-# Note that this approach can be applied to any iterative MLJ model,
-# eg, the neural network models provided by
-# [MLJFlux.jl](https://github.com/FluxML/MLJFlux.jl).
-
-# First, we select appropriate controls from [this
-# list](https://juliaai.github.io/MLJ.jl/dev/controlling_iterative_models/#Controls-provided):
-
-controls = [
- Step(1), # to increment iteration parameter (`pipe.nrounds`)
- NumberSinceBest(4), # main stopping criterion
- TimeLimit(2/3600), # never train more than 2 sec
- InvalidValue() # stop if NaN or ±Inf encountered
-]
-
-# Now we wrap our pipeline model using the `IteratedModel` wrapper,
-# being sure to specify the `measure` on which internal estimates of
-# the out-of-sample performance will be based:
-
-iterated_pipe = IteratedModel(model=pipe2,
- controls=controls,
- measure=brier_loss,
- resampling=Holdout(fraction_train=0.7))
-
-# We've set `resampling=Holdout(fraction_train=0.7)` to arrange that
-# data attached to our model should be internally split into a train
-# set (70%) and a holdout set (30%) for determining the out-of-sample
-# estimate of the Brier loss.
-
-# For demonstration purposes, let's bind `iterated_model` to all data
-# not in our don't-touch holdout set, and train on all of that data:
-
-mach_iterated_pipe = machine(iterated_pipe, X, y)
-fit!(mach_iterated_pipe);
-
-# To recap, internally this training is split into two separate steps:
-
-# - A controlled iteration step, training on the holdout set, with the total number of iterations determined by the specified stopping criteria (based on the out-of-sample performance estimates)
-# - A final step that trains the atomic model on *all* available
-# data using the number of iterations determined in the first step. Calling `predict` on `mach_iterated_pipe` means using the learned parameters of the second step.
-
-
-# ## Hyper-parameter optimization (model tuning)
-
-# > Introduces: `range`, **model wrapper** `TunedModel`, `RandomSearch`
-
-# We now turn to hyperparameter optimization. A tool not discussed
-# here is the `learning_curve` function, which can be useful when
-# wanting to visualize the effect of changes to a *single*
-# hyperparameter (which could be an iteration parameter). See, for
-# example, [this section of the
-# manual](https://juliaai.github.io/MLJ.jl/dev/learning_curves/)
-# or [this
-# tutorial](https://github.com/ablaom/MLJTutorial.jl/blob/dev/notebooks/04_tuning/notebook.ipynb).
-
-# Fine tuning the hyperparameters of a gradient booster can be
-# somewhat involved. Here we settle for simultaneously optimizing two
-# key parameters: `max_depth` and `η` (learning_rate).
-
-# Like iteration control, **model optimization in MLJ is implemented as
-# a model wrapper**, called `TunedModel`. After wrapping a model in a
-# tuning strategy and binding the wrapped model to data in a machine
-# called `mach`, calling `fit!(mach)` instigates a search for optimal
-# model hyperparameters, within a specified range, and then uses all
-# supplied data to train the best model. To predict using that model,
-# one then calls `predict(mach, Xnew)`. In this way the wrapped model
-# may be viewed as a "self-tuning" version of the unwrapped
-# model. That is, wrapping the model simply transforms certain
-# hyperparameters into learned parameters (just as `IteratedModel`
-# does for an iteration parameter).
-
-# To start with, we define ranges for the parameters of
-# interest. Since these parameters are nested, let's force a
-# display of our model to a larger depth:
-
-show(iterated_pipe, 2)
-
-#-
-
-p1 = :(model.evo_tree_classifier.η)
-p2 = :(model.evo_tree_classifier.max_depth)
-
-r1 = range(iterated_pipe, p1, lower=-2, upper=-0.5, scale=x->10^x)
-r2 = range(iterated_pipe, p2, lower=2, upper=6)
-
-# Nominal ranges are defined by specifying `values` instead of `lower`
-# and `upper`.
-
-# Next, we choose an optimization strategy from [this
-# list](https://juliaai.github.io/MLJ.jl/dev/tuning_models/#Tuning-Models):
-
-tuning = RandomSearch(rng=123)
-
-# Then we wrap the model, specifying a `resampling` strategy and a
-# `measure`, as we did for `IteratedModel`. In fact, we can include a
-# battery of `measures`; by default, optimization is with respect to
-# performance estimates based on the first measure, but estimates for
-# all measures can be accessed from the model's `report`.
-
-# The keyword `n` specifies the total number of models (sets of
-# hyperparameters) to evaluate.
-
-tuned_iterated_pipe = TunedModel(model=iterated_pipe,
- range=[r1, r2],
- tuning=tuning,
- measures=[brier_loss, auc, accuracy],
- resampling=StratifiedCV(nfolds=6, rng=123),
- acceleration=CPUThreads(),
- n=40)
-
-# To save time, we skip the `repeats` here.
-
-# Binding our final model to data and training:
-
-mach_tuned_iterated_pipe = machine(tuned_iterated_pipe, X, y)
-fit!(mach_tuned_iterated_pipe)
-
-# As explained above, the training we have just performed was split
-# internally into two separate steps:
-
-# - A step to determine the parameter values that optimize the aggregated cross-validation scores
-# - A final step that trains the optimal model on *all* available data. Future predictions `predict(mach_tuned_iterated_pipe, ...)` are based on this final training step.
-
-# From `report(mach_tuned_iterated_pipe)` we can extract details about
-# the optimization procedure. For example:
-
-rpt2 = report(mach_tuned_iterated_pipe);
-best_booster = rpt2.best_model.model.evo_tree_classifier
-
-#-
-
-@info "Optimal hyperparameters:" best_booster.max_depth best_booster.η;
-
-# Using the `confidence_intervals` function we defined earlier:
-
-e_best = rpt2.best_history_entry
-confidence_intervals(e_best)
-
-# Digging a little deeper, we can learn what stopping criterion was
-# applied in the case of the optimal model, and how many iterations
-# were required:
-
-rpt2.best_report.controls |> collect
-
-# Finally, we can visualize the optimization results:
-
-plot(mach_tuned_iterated_pipe, size=(600,450))
-
-
-# ## Saving our model
-
-# > Introduces: `MLJ.save`
-
-# Here's how to serialize our final, trained self-iterating,
-# self-tuning pipeline machine:
-
-MLJ.save("tuned_iterated_pipe.jlso", mach_tuned_iterated_pipe)
-
-
-# We'll deserialize this in "Testing the final model" below.
-
-# ## Final performance estimate
-
-# Finally, to get an even more accurate estimate of performance, we
-# can evaluate our model using stratified cross-validation and all the
-# data attached to our machine. Because this evaluation implies
-# [nested
-# resampling](https://mlr.mlr-org.com/articles/tutorial/nested_resampling.html),
-# this computation takes quite a bit longer than the previous one
-# (which is being repeated six times, using 5/6th of the data each
-# time):
-
-e_tuned_iterated_pipe = evaluate(tuned_iterated_pipe, X, y,
- resampling=StratifiedCV(nfolds=6, rng=123),
- measures=[brier_loss, auc, accuracy])
-
-#-
-
-confidence_intervals(e_tuned_iterated_pipe)
-
-# For comparison, here are the confidence intervals for the basic
-# pipeline model (no feature selection and default hyperparameters):
-
-confidence_intervals_basic_model
-
-# As each pair of intervals overlap, it's doubtful the small changes
-# here can be assigned statistical significance. Default `booster`
-# hyperparameters do a pretty good job.
-
-
-# ## Testing the final model
-
-# We now determine the performance of our model on our
-# lock-and-throw-away-the-key holdout set. To demonstrate
-# deserialization, we'll pretend we're in a new Julia session (but
-# have called `import`/`using` on the same packages). Then the
-# following should suffice to recover our model trained under
-# "Hyper-parameter optimization" above:
-
-mach_restored = machine("tuned_iterated_pipe.jlso")
-
-# We compute predictions on the holdout set:
-
-ŷ_tuned = predict(mach_restored, Xtest);
-ŷ_tuned[1]
-
-# And can compute the final performance measures:
-
-@info("Tuned model measurements on test:",
- brier_loss(ŷ_tuned, ytest) |> mean,
- auc(ŷ_tuned, ytest),
- accuracy(mode.(ŷ_tuned), ytest)
- )
-
-# For comparison, here's the performance for the basic pipeline model
-
-mach_basic = machine(pipe, X, y)
-fit!(mach_basic, verbosity=0)
-
-ŷ_basic = predict(mach_basic, Xtest);
-
-@info("Basic model measurements on test set:",
- brier_loss(ŷ_basic, ytest) |> mean,
- auc(ŷ_basic, ytest),
- accuracy(mode.(ŷ_basic), ytest)
- )
diff --git a/examples/telco/notebook.pluto.jl b/examples/telco/notebook.pluto.jl
deleted file mode 100644
index 345837476..000000000
--- a/examples/telco/notebook.pluto.jl
+++ /dev/null
@@ -1,1321 +0,0 @@
-### A Pluto.jl notebook ###
-# v0.16.0
-
-using Markdown
-using InteractiveUtils
-
-# ╔═╡ f0cc864c-8b26-441f-9bca-7c69b794f8ce
-md"# MLJ for Data Scientists in Two Hours"
-
-# ╔═╡ 8a6670b8-96a8-4a5d-b795-033f6f2a0674
-md"""
-An application of the [MLJ
-toolbox](https://juliaai.github.io/MLJ.jl/dev/) to the
-Telco Customer Churn dataset, aimed at practicing data scientists
-new to MLJ (Machine Learning in Julia). This tutorial does not
-cover exploratory data analysis.
-"""
-
-# ╔═╡ aa49e638-95dc-4249-935f-ddf6a6bfbbdd
-md"""
-MLJ is a *multi-paradigm* machine learning toolbox (i.e., not just
-deep-learning).
-"""
-
-# ╔═╡ b04c4790-59e0-42a3-af2a-25235e544a31
-md"""
-For other MLJ learning resources see the [Learning
-MLJ](https://juliaai.github.io/MLJ.jl/dev/learning_mlj/)
-section of the
-[manual](https://juliaai.github.io/MLJ.jl/dev/).
-"""
-
-# ╔═╡ 4eb8dff4-c23a-4b41-8af5-148d95ea2900
-md"""
-**Topics covered**: Grabbing and preparing a dataset, basic
-fit/predict workflow, constructing a pipeline to include data
-pre-processing, estimating performance metrics, ROC curves, confusion
-matrices, feature importance, basic feature selection, controlling iterative
-models, hyperparameter optimization (tuning).
-"""
-
-# ╔═╡ a583d175-d623-4888-a6bf-47194d7e8e12
-md"""
-**Prerequisites for this tutorial.** Previous experience building,
-evaluating, and optimizing machine learning models using
-scikit-learn, caret, MLR, weka, or similar tool. No previous
-experience with MLJ. Only fairly basic familiarity with Julia is
-required. Uses
-[DataFrames.jl](https://dataframes.juliadata.org/stable/) but in a
-minimal way ([this
-cheatsheet](https://ahsmart.com/pub/data-wrangling-with-data-frames-jl-cheat-sheet/index.html)
-may help).
-"""
-
-# ╔═╡ 4830fc64-3d70-4869-828a-4cc485149963
-md"**Time.** Between two and three hours, first time through."
-
-# ╔═╡ 28197138-d6b7-433c-9e54-3f7b3ca87ecb
-md"## Summary of methods and types introduced"
-
-# ╔═╡ aad8dc13-c0f9-4090-ba1f-6363f43ec697
-md"""
-|code | purpose|
-|:-------|:-------------------------------------------------------|
-| `OpenML.load(id)` | grab a dataset from [OpenML.org](https://www.openml.org)|
-| `scitype(X)` | inspect the scientific type (scitype) of object `X`|
-| `schema(X)` | inspect the column scitypes (scientific types) of a table `X`|
-| `coerce(X, ...)` | fix column encodings to get appropriate scitypes|
-| `partition(data, frac1, frac2, ...; rng=...)` | vertically split `data`, which can be a table, vector or matrix|
-| `unpack(table, f1, f2, ...)` | horizontally split `table` based on conditions `f1`, `f2`, ..., applied to column names|
-| `@load ModelType pkg=...` | load code defining a model type|
-| `input_scitype(model)` | inspect the scitype that a model requires for features (inputs)|
-| `target_scitype(model)`| inspect the scitype that a model requires for the target (labels)|
-| `ContinuousEncoder` | built-in model type for re-encoding all features as `Continuous`|
-| `model1 ∣> model2 ∣> ...` | combine multiple models into a pipeline|
-| `measures("under curve")` | list all measures (metrics) with string "under curve" in documentation|
-| `accuracy(yhat, y)` | compute accuracy of predictions `yhat` against ground truth observations `y`|
-| `auc(yhat, y)`, `brier_loss(yhat, y)` | evaluate two probabilistic measures (`yhat` a vector of probability distributions)|
-| `machine(model, X, y)` | bind `model` to training data `X` (features) and `y` (target)|
-| `fit!(mach, rows=...)` | train machine using specified rows (observation indices)|
-| `predict(mach, rows=...)`, | make in-sample model predictions given specified rows|
-| `predict(mach, Xnew)` | make predictions given new features `Xnew`|
-| `fitted_params(mach)` | inspect learned parameters|
-| `report(mach)` | inspect other outcomes of training|
-| `confmat(yhat, y)` | confusion matrix for predictions `yhat` and ground truth `y`|
-| `roc(yhat, y)` | compute points on the receiver-operator Characteristic|
-| `StratifiedCV(nfolds=6)` | 6-fold stratified cross-validation resampling strategy|
-| `Holdout(fraction_train=0.7)` | holdout resampling strategy|
-| `evaluate(model, X, y; resampling=..., options...)` | estimate performance metrics `model` using the data `X`, `y`|
-| `FeatureSelector()` | transformer for selecting features|
-| `Step(3)` | iteration control for stepping 3 iterations|
-| `NumberSinceBest(6)`, `TimeLimit(60/5), InvalidValue()` | iteration control stopping criteria|
-| `IteratedModel(model=..., controls=..., options...)` | wrap an iterative `model` in control strategies|
-| `range(model, :some_hyperparam, lower=..., upper=...)` | define a numeric range|
-| `RandomSearch()` | random search tuning strategy|
-| `TunedModel(model=..., tuning=..., options...)` | wrap the supervised `model` in specified `tuning` strategy|
-"""
-
-# ╔═╡ 7c0464a0-4114-46bf-95ea-2955abd45275
-md"## Instantiate a Julia environment"
-
-# ╔═╡ 256869d0-5e0c-42af-b1b3-dd47e367ba54
-md"""
-The following code replicates precisely the set of Julia packages
-used to develop this tutorial. If this is your first time running
-the notebook, package instantiation and pre-compilation may take a
-minute or so to complete. **This step will fail** if the [correct
-Manifest.toml and Project.toml
-files](https://github.com/JuliaAI/MLJ.jl/tree/dev/examples/telco)
-are not in the same directory as this notebook.
-"""
-
-# ╔═╡ 60fe49c1-3434-4a77-8d7e-8e449afd1c48
-begin
- using Pkg
- Pkg.activate(@__DIR__) # get env from TOML files in same directory as this notebook
- Pkg.instantiate()
-end
-
-# ╔═╡ e8c13e9d-7910-4a0e-a949-7f3bd292fe31
-md"## Warm up: Building a model for the iris dataset"
-
-# ╔═╡ 6bf6ef98-302f-478c-8514-99938a2932db
-md"""
-Before turning to the Telco Customer Churn dataset, we very quickly
-build a predictive model for Fisher's well-known iris data set, as way of
-introducing the main actors in any MLJ workflow. Details that you
-don't fully grasp should become clearer in the Telco study.
-"""
-
-# ╔═╡ 33ca287e-8cba-47d1-a0de-1721c1bc2df2
-md"""
-This section is a condensed adaption of the [Getting Started
-example](https://juliaai.github.io/MLJ.jl/dev/getting_started/#Fit-and-predict)
-in the MLJ documentation.
-"""
-
-# ╔═╡ d5b8bf1d-6c9a-46c1-bca8-7eec7950fd82
-md"""
-First, using the built-in iris dataset, we load and inspect the features
-`X_iris` (a table) and target variable `y_iris` (a vector):
-"""
-
-# ╔═╡ 9b7b9ade-9318-4b95-9873-1b1430e635cc
-using MLJ
-
-# ╔═╡ 4f3f061d-259d-4479-b43e-d0ebe87da176
-begin
- const X_iris, y_iris = @load_iris;
- schema(X_iris)
-end
-
-# ╔═╡ a546f372-4ae0-48c2-9009-4cfb2012998f
-y_iris[1:4]
-
-# ╔═╡ f8f79dd2-b6de-48e8-abd3-cb77d7a79683
-levels(y_iris)
-
-# ╔═╡ 617d63f6-8a62-40db-879e-dc128f3db7b3
-md"We load a decision tree model, from the package DecisionTree.jl:"
-
-# ╔═╡ ceda7ed0-3b98-44b3-a367-9aa3c6cf34d0
-begin
- DecisionTree = @load DecisionTreeClassifier pkg=DecisionTree # model type
- model = DecisionTree(min_samples_split=5) # model instance
-end
-
-# ╔═╡ e69cd764-e4b8-4ff8-bf32-94657e65284e
-md"""
-In MLJ, a *model* is just a container for hyperparameters of
-some learning algorithm. It does not store learned parameters.
-"""
-
-# ╔═╡ 4a175e0f-4b87-4b53-9afd-65a5b5facac9
-md"""
-Next, we bind the model together with the available data in what's
-called a *machine*:
-"""
-
-# ╔═╡ 3c23e9f8-49bd-4cde-b6c8-54f5ed90a964
-mach = machine(model, X_iris, y_iris)
-
-# ╔═╡ 6ad971c7-9fe5-45a9-9292-fe822525fc77
-md"""
-A machine is essentially just a model (ie, hyperparameters) plus data, but
-it additionally stores *learned parameters* (the tree) once it is
-trained on some view of the data:
-"""
-
-# ╔═╡ 63151d31-8428-4ebd-ae5e-4b83dcbc9675
-begin
- train_rows = vcat(1:60, 91:150); # some row indices (observations are rows not columns)
- fit!(mach, rows=train_rows)
- fitted_params(mach)
-end
-
-# ╔═╡ 0f978839-cc95-4c3a-8a29-32f11452654a
-md"""
-A machine stores some other information enabling [warm
-restart](https://juliaai.github.io/MLJ.jl/dev/machines/#Warm-restarts)
-for some models, but we won't go into that here. You are allowed to
-access and mutate the `model` parameter:
-"""
-
-# ╔═╡ 5edc98cd-df7d-428a-a5f1-151fcbe229a2
-begin
- mach.model.min_samples_split = 10
- fit!(mach, rows=train_rows) # re-train with new hyperparameter
-end
-
-# ╔═╡ 1848109f-c94c-4cdd-81bc-2d5603785a09
-md"Now we can make predictions on some other view of the data, as in"
-
-# ╔═╡ eee948b7-2b42-439e-9d87-1d1fbb0e3997
-predict(mach, rows=71:73)
-
-# ╔═╡ 7c98e10b-1fa3-4ab6-b950-fddef2a1fb10
-md"or on completely new data, as in"
-
-# ╔═╡ 413d9c3c-a68e-4bf0-951b-64cb2a36c8e3
-begin
- Xnew = (sepal_length = [5.1, 6.3],
- sepal_width = [3.0, 2.5],
- petal_length = [1.4, 4.9],
- petal_width = [0.3, 1.5])
- yhat = predict(mach, Xnew)
-end
-
-# ╔═╡ eb79663c-c671-4c4c-b0e6-441461cc8770
-md"""
-These are probabilistic predictions which can be manipulated using a
-widely adopted interface defined in the Distributions.jl
-package. For example, we can get raw probabilities like this:
-"""
-
-# ╔═╡ ae9e9377-552c-4186-8cb0-de601961bc02
-pdf.(yhat, "virginica")
-
-# ╔═╡ 5c70ee06-edb9-4789-a87d-950c50fb2955
-md"We now turn to the Telco dataset."
-
-# ╔═╡ bde7bc43-a4bb-4a42-8448-4bcb089096cd
-md"## Getting the Telco data"
-
-# ╔═╡ d5a9e9b8-c67e-4bdd-a012-a84240254fb6
-import DataFrames
-
-# ╔═╡ 8de2c7b4-1950-4f05-bbdd-9799f7bb2e60
-begin
- data = OpenML.load(42178) # data set from OpenML.org
- df0 = DataFrames.DataFrame(data)
- first(df0, 4)
-end
-
-# ╔═╡ 768f369a-5f3d-4dcc-97a7-88e3af4f10ee
-md"""
-The object of this tutorial is to build and evaluate supervised
-learning models to predict the `:Churn` variable, a binary variable
-measuring customer retention, based on other variables that are
-relevant.
-"""
-
-# ╔═╡ bc4c0eb9-3b5f-49fc-b372-8c9866e51852
-md"""
-In the table, observations correspond to rows, and features to
-columns, which is the convention for representing all
-two-dimensional data in MLJ.
-"""
-
-# ╔═╡ f04becbb-42a6-409f-8f3d-b82f1e7b6f7a
-md"## Type coercion"
-
-# ╔═╡ 1d08ed75-5377-4971-ab04-579e5608ae53
-md"> Introduces: `scitype`, `schema`, `coerce`"
-
-# ╔═╡ eea585d4-d55b-4ccc-86cf-35420d9e6995
-md"""
-A ["scientific
-type"](https://juliaai.github.io/ScientificTypes.jl/dev/) or
-*scitype* indicates how MLJ will *interpret* data. For example,
-`typeof(3.14) == Float64`, while `scitype(3.14) == Continuous` and
-also `scitype(3.14f0) == Continuous`. In MLJ, model data
-requirements are articulated using scitypes.
-"""
-
-# ╔═╡ ca482134-299b-459a-a29a-8d0445351914
-md"Here are common \"scalar\" scitypes:"
-
-# ╔═╡ 5c7c97e4-fd8b-4ae5-be65-28d0fcca50a8
-md""
-
-# ╔═╡ fd103be7-6fc9-43bc-9a2e-c468345d87d1
-md"""
-There are also container scitypes. For example, the scitype of any
-`N`-dimensional array is `AbstractArray{S, N}`, where `S` is the scitype of the
-elements:
-"""
-
-# ╔═╡ 9d2f0b19-2942-47ac-b5fa-cb79ebf595ef
-scitype(["cat", "mouse", "dog"])
-
-# ╔═╡ 1d18aca8-11f4-4104-91c5-524da38aa391
-md"The `schema` operator summarizes the column scitypes of a table:"
-
-# ╔═╡ accd3a09-5371-45ab-ad90-b4405b216771
-schema(df0) |> DataFrames.DataFrame # converted to DataFrame for better display
-
-# ╔═╡ bde0406f-3964-42b8-895b-997a92b731e0
-md"""
-All of the fields being interpreted as `Textual` are really
-something else, either `Multiclass` or, in the case of
-`:TotalCharges`, `Continuous`. In fact, `:TotalCharges` is
-mostly floats wrapped as strings. However, it needs special
-treatment because some elements consist of a single space, " ",
-which we'll treat as "0.0".
-"""
-
-# ╔═╡ 25e60566-fc64-4f35-a525-e9b04a4bd246
-begin
- fix_blanks(v) = map(v) do x
- if x == " "
- return "0.0"
- else
- return x
- end
- end
-
- df0.TotalCharges = fix_blanks(df0.TotalCharges);
-end
-
-# ╔═╡ 441317a8-3b72-4bf1-80f0-ba7781e173cf
-md"Coercing the `:TotalCharges` type to ensure a `Continuous` scitype:"
-
-# ╔═╡ d2f2b67e-d06e-4c0b-9cbc-9a2c39793333
-coerce!(df0, :TotalCharges => Continuous);
-
-# ╔═╡ 7c93b820-cbce-48be-b887-2b16710e5502
-md"Coercing all remaining `Textual` data to `Multiclass`:"
-
-# ╔═╡ 5ac4a13e-9983-41dd-9452-5298a8a4a401
-coerce!(df0, Textual => Multiclass);
-
-# ╔═╡ 9f28e6c3-5f18-4fb2-b017-96f1602f2cad
-md"""
-Finally, we'll coerce our target variable `:Churn` to be
-`OrderedFactor`, rather than `Multiclass`, to enable a reliable
-interpretation of metrics like "true positive rate". By convention,
-the first class is the negative one:
-"""
-
-# ╔═╡ c9506e5b-c111-442c-8be2-2a4017c45345
-begin
- coerce!(df0, :Churn => OrderedFactor)
- levels(df0.Churn) # to check order
-end
-
-# ╔═╡ 8c865df5-ac95-438f-a7ad-80d84f5b0070
-md"Re-inspecting the scitypes:"
-
-# ╔═╡ 22690828-0992-4ed0-8378-5aa686f0b407
-schema(df0) |> DataFrames.DataFrame
-
-# ╔═╡ 3bd753fc-163a-4b50-9f43-49763e8691a1
-md"## Preparing a holdout set for final testing"
-
-# ╔═╡ 7b12673f-cc63-4cd9-bb0e-3a45761c733a
-md"> Introduces: `partition`"
-
-# ╔═╡ 3c98ab4e-47cd-4247-9979-7044b2f2df33
-md"""
-To reduce training times for the purposes of this tutorial, we're
-going to dump 90% of observations (after shuffling) and split off
-30% of the remainder for use as a lock-and-throw-away-the-key
-holdout set:
-"""
-
-# ╔═╡ ea010966-bb7d-4c79-b2a1-fbbde543f620
-df, df_test, df_dumped = partition(df0, 0.07, 0.03, # in ratios 7:3:90
- stratify=df0.Churn,
- rng=123);
-
-# ╔═╡ ecb7c90a-4c93-483c-910b-f8a9a217eff2
-md"""
-The reader interested in including all data can instead do
-`df, df_test = partition(df0, 0.7, stratify=df0.Churn, rng=123)`.
-"""
-
-# ╔═╡ 81aa1aa9-27d9-423f-aa36-d2e8546da46a
-md"## Splitting data into target and features"
-
-# ╔═╡ fff47e76-d7ac-4377-88a1-a2bf91434413
-md"> Introduces: `unpack`"
-
-# ╔═╡ f1dbbc22-9844-4e1c-a1cb-54e34396a855
-md"""
-In the following call, the column with name `:Churn` is copied over
-to a vector `y`, and every remaining column, except `:customerID`
-(which contains no useful information) goes into a table `X`. Here
-`:Churn` is the target variable for which we seek predictions, given
-new versions of the features `X`.
-"""
-
-# ╔═╡ 20713fb3-3a3b-42e7-8037-0de5806e1a54
-begin
- const y, X = unpack(df, ==(:Churn), !=(:customerID));
- schema(X).names
-end
-
-# ╔═╡ 369cb359-8512-4f80-9961-a09632c33fb0
-intersect([:Churn, :customerID], schema(X).names)
-
-# ╔═╡ af484868-b29c-4808-b7cc-46f4ef988c68
-md"We'll do the same for the holdout data:"
-
-# ╔═╡ 8f3eda66-6183-4fe4-90fb-7f2721f6fcc7
-const ytest, Xtest = unpack(df_test, ==(:Churn), !=(:customerID));
-
-# ╔═╡ 598142b5-0828-49ac-af65-ccd25ecb9818
-md"## Loading a model and checking type requirements"
-
-# ╔═╡ c5309d64-7409-4148-8890-979691212d9b
-md"> Introduces: `@load`, `input_scitype`, `target_scitype`"
-
-# ╔═╡ f97969e2-c15c-42cf-a6fa-eaf14df5d44b
-md"""
-For tools helping us to identify suitable models, see the [Model
-Search](https://juliaai.github.io/MLJ.jl/dev/model_search/#model_search)
-section of the manual. We will build a gradient tree-boosting model,
-a popular first choice for structured data like we have here. Model
-code is contained in a third-party package called
-[EvoTrees.jl](https://github.com/Evovest/EvoTrees.jl) which is
-loaded as follows:
-"""
-
-# ╔═╡ edf1d726-542a-4fc1-8025-5084804a9f6c
-Booster = @load EvoTreeClassifier pkg=EvoTrees
-
-# ╔═╡ ee41a121-df0d-40ad-9e90-09023d201062
-md"""
-Recall that a *model* is just a container for some algorithm's
-hyperparameters. Let's create a `Booster` with default values for
-the hyperparameters:
-"""
-
-# ╔═╡ be6cef47-d585-45fe-b7bb-2f33ef765cc0
-booster = Booster()
-
-# ╔═╡ 452edd51-b878-46d0-9625-01172c4a0081
-md"""
-This model is appropriate for the kind of target variable we have because of
-the following passing test:
-"""
-
-# ╔═╡ fdd20843-c981-41ad-af4f-11c7de9e21dd
-scitype(y) <: target_scitype(booster)
-
-# ╔═╡ 2b25f9cb-d12d-4a6f-8dba-241d9b744683
-md"However, our features `X` cannot be directly used with `booster`:"
-
-# ╔═╡ b139c43b-382b-415a-a6e5-2f0b4dca5c59
-scitype(X) <: input_scitype(booster)
-
-# ╔═╡ 90a7fe9a-934f-445a-8550-20498aa03ed0
-md"""
-As it turns out, this is because `booster`, like the majority of MLJ
-supervised models, expects the features to be `Continuous`. (With
-some experience, this can be gleaned from `input_scitype(booster)`.)
-So we need categorical feature encoding, discussed next.
-"""
-
-# ╔═╡ fc97e937-66ca-4434-9e7b-5b943cf6b560
-md"## Building a model pipeline to incorporate feature encoding"
-
-# ╔═╡ b1d95557-51fc-4f1e-bce6-0a1379cc1259
-md"> Introduces: `ContinuousEncoder`, pipeline operator `|>`"
-
-# ╔═╡ 38961daa-dd5f-4f97-9608-b5032c116833
-md"""
-The built-in `ContinuousEncoder` model transforms an arbitrary table
-to a table whose features are all `Continuous` (dropping any fields
-it does not know how to encode). In particular, all `Multiclass`
-features are one-hot encoded.
-"""
-
-# ╔═╡ ccd3ffd0-ed41-497d-b473-4aa968e6937a
-md"""
-A *pipeline* is a stand-alone model that internally combines one or
-more models in a linear (non-branching) pipeline. Here's a pipeline
-that adds the `ContinuousEncoder` as a pre-processor to the
-gradient tree-boosting model above:
-"""
-
-# ╔═╡ 35cf2011-4fb6-4e7a-8d9e-644a1b3cc6b6
-pipe = ContinuousEncoder() |> booster
-
-# ╔═╡ 4b6e3239-fd47-495a-ac0a-23ded81445da
-md"""
-Note that the component models appear as hyperparameters of
-`pipe`. Pipelines are an implementation of a more general [model
-composition](https://juliaai.github.io/MLJ.jl/dev/composing_models/#Composing-Models)
-interface provided by MLJ that advanced users may want to learn about.
-"""
-
-# ╔═╡ ce9a3479-601a-472a-8535-1a088a6a3228
-md"""
-From the above display, we see that component model hyperparameters
-are now *nested*, but they are still accessible (important in hyperparameter
-optimization):
-"""
-
-# ╔═╡ 45c2d8f9-cbf9-4cda-a39f-7893c73eef39
-pipe.evo_tree_classifier.max_depth
-
-# ╔═╡ 232b7775-1f79-43f7-bcca-51a27994a151
-md"## Evaluating the pipeline model's performance"
-
-# ╔═╡ ac84c13f-391f-405a-9b33-09b4b6661170
-md"""
-> Introduces: `measures` (function), **measures:** `brier_loss`, `auc`, `accuracy`;
-> `machine`, `fit!`, `predict`, `fitted_params`, `report`, `roc`, **resampling strategy** `StratifiedCV`, `evaluate`, `FeatureSelector`
-"""
-
-# ╔═╡ 442078e4-a695-471c-b45d-88d068bb0fa2
-md"""
-Without touching our test set `Xtest`, `ytest`, we will estimate the
-performance of our pipeline model, with default hyperparameters, in
-two different ways:
-"""
-
-# ╔═╡ 23fb37b0-08c1-4688-92c8-6db3a590d963
-md"""
-**Evaluating by hand.** First, we'll do this "by hand" using the `fit!` and `predict`
-workflow illustrated for the iris data set above, using a
-holdout resampling strategy. At the same time we'll see how to
-generate a **confusion matrix**, **ROC curve**, and inspect
-**feature importances**.
-"""
-
-# ╔═╡ 1def55f5-71fc-4257-abf5-96f457eb2bdf
-md"""
-**Automated performance evaluation.** Next we'll apply the more
-typical and convenient `evaluate` workflow, but using `StratifiedCV`
-(stratified cross-validation) which is more informative.
-"""
-
-# ╔═╡ 2c4016c6-dc7e-44ec-8a60-2e0214c059f5
-md"""
-In any case, we need to choose some measures (metrics) to quantify
-the performance of our model. For a complete list of measures, one
-does `measures()`. Or we also can do:
-"""
-
-# ╔═╡ 236c098c-0189-4a96-a38a-beb5c7157b57
-measures("Brier")
-
-# ╔═╡ 9ffe1654-7e32-4c4b-81f5-507803ea9ed6
-md"""
-We will be primarily using `brier_loss`, but also `auc` (area under
-the ROC curve) and `accuracy`.
-"""
-
-# ╔═╡ 48413a7b-d01b-4005-9b21-6d4eb642d87e
-md"### Evaluating by hand (with a holdout set)"
-
-# ╔═╡ 84d7bcbc-df3c-4602-b98c-3df1f31879bf
-md"""
-Our pipeline model can be trained just like the decision tree model
-we built for the iris data set. Binding all non-test data to the
-pipeline model:
-"""
-
-# ╔═╡ f8552404-f95f-4484-92b7-4ceba56e97ad
-mach_pipe = machine(pipe, X, y)
-
-# ╔═╡ e0b1bffb-dbd5-4b0a-b122-214de244406c
-md"""
-We already encountered the `partition` method above. Here we apply
-it to row indices, instead of data containers, as `fit!` and
-`predict` only need a *view* of the data to work.
-"""
-
-# ╔═╡ d8ad2c21-fd27-44fb-8a4b-c83694978e38
-begin
- train, validation = partition(1:length(y), 0.7)
- fit!(mach_pipe, rows=train)
-end
-
-# ╔═╡ ceff13ba-a526-4ecb-a8b6-f7ff516dedc4
-md"We note in passing that we can access two kinds of information from a trained machine:"
-
-# ╔═╡ dcf0eb93-2368-4158-81e1-74ef03c2e79e
-md"""
-- The **learned parameters** (eg, coefficients of a linear model): We use `fitted_params(mach_pipe)`
-- Other **by-products of training** (eg, feature importances): We use `report(mach_pipe)`
-"""
-
-# ╔═╡ 442be3e3-b2e9-499d-a04e-0b76409c14a7
-begin
- fp = fitted_params(mach_pipe);
- keys(fp)
-end
-
-# ╔═╡ f03fdd56-6c30-4d73-b979-3458f2f26667
-md"For example, we can check that the encoder did not actually drop any features:"
-
-# ╔═╡ 3afbe65c-18bf-4576-97e3-6677afc6ca9f
-Set(fp.continuous_encoder.features_to_keep) == Set(schema(X).names)
-
-# ╔═╡ 527b193b-737d-4701-b10e-562ce21caa04
-md"And, from the report, extract feature importances:"
-
-# ╔═╡ b332842d-6397-45c6-8f78-ab549ef1544e
-begin
- rpt = report(mach_pipe)
- keys(rpt.evo_tree_classifier)
-end
-
-# ╔═╡ 3035e6d0-3801-424d-a8a4-a53f5149481e
-begin
- fi = rpt.evo_tree_classifier.feature_importances
- feature_importance_table =
- (feature=Symbol.(first.(fi)), importance=last.(fi)) |> DataFrames.DataFrame
-end
-
-# ╔═╡ 8b67dcac-6a86-4502-8704-dbab8e09b4f1
-md"""
-For models not reporting feature importances, we recommend the
-[Shapley.jl](https://expandingman.gitlab.io/Shapley.jl/) package.
-"""
-
-# ╔═╡ 9cd42f39-942e-4011-a02f-27b6c05e4d02
-md"Returning to predictions and evaluations of our measures:"
-
-# ╔═╡ 0fecce89-102f-416f-bea4-3d467d48781c
-begin
- ŷ = predict(mach_pipe, rows=validation);
- @info("Measurements",
- brier_loss(ŷ, y[validation]) |> mean,
- auc(ŷ, y[validation]),
- accuracy(mode.(ŷ), y[validation])
- )
-end
-
-# ╔═╡ 0b11a81e-42a7-4519-97c5-2979af8a507d
-md"""
-Note that we need `mode` in the last case because `accuracy` expects
-point predictions, not probabilistic ones. (One can alternatively
-use `predict_mode` to generate the predictions.)
-"""
-
-# ╔═╡ e6f6fc92-f9e7-4472-b639-63c76c72c513
-md"""
-While we're here, lets also generate a **confusion matrix** and
-[receiver-operator
-characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
-(ROC):
-"""
-
-# ╔═╡ f3933688-1ea4-4239-8f5a-e1ee9eb5c15c
-confmat(mode.(ŷ), y[validation])
-
-# ╔═╡ ce90835f-6709-4c34-adc5-e671db8bca5d
-md"""
-Note: Importing the plotting package and calling the plotting
-functions for the first time can take a minute or so.
-"""
-
-# ╔═╡ 15996832-b736-45f4-86f1-0ea38de21abb
-using Plots
-
-# ╔═╡ 3c4984d2-9e1a-450b-a55c-0cb44ab816d7
-begin
- roc_curve = roc(ŷ, y[validation])
- plt = scatter(roc_curve, legend=false)
- plot!(plt, xlab="false positive rate", ylab="true positive rate")
- plot!([0, 1], [0, 1], linewidth=2, linestyle=:dash, color=:black)
-end
-
-# ╔═╡ 27221b90-8506-4857-be86-92ecfd0d2343
-md"### Automated performance evaluation (more typical workflow)"
-
-# ╔═╡ 445146e5-e45e-450d-9cf1-facf39e3f302
-md"""
-We can also get performance estimates with a single call to the
-`evaluate` function, which also allows for more complicated
-resampling - in this case stratified cross-validation. To make this
-more comprehensive, we set `repeats=3` below to make our
-cross-validation "Monte Carlo" (3 random size-6 partitions of the
-observation space, for a total of 18 folds) and set
-`acceleration=CPUThreads()` to parallelize the computation.
-"""
-
-# ╔═╡ 562887bb-b7fb-430f-b61c-748aec38e674
-md"""
-We choose a `StratifiedCV` resampling strategy; the complete list of options is
-[here](https://juliaai.github.io/MLJ.jl/dev/evaluating_model_performance/#Built-in-resampling-strategies).
-"""
-
-# ╔═╡ f9be989e-2604-44c2-9727-ed822e4fd85d
-e_pipe = evaluate(pipe, X, y,
- resampling=StratifiedCV(nfolds=6, rng=123),
- measures=[brier_loss, auc, accuracy],
- repeats=3,
- acceleration=CPUThreads())
-
-# ╔═╡ ff7cfc36-b9fc-4570-b2f2-e08965e5be66
-md"""
-(There is also a version of `evaluate` for machines. Query the
-`evaluate` and `evaluate!` doc-strings to learn more about these
-functions and what the `PerformanceEvaluation` object `e_pipe` records.)
-"""
-
-# ╔═╡ 45a4d300-a392-485c-897c-f79712f9ec7c
-md"""
-While [less than ideal](https://arxiv.org/abs/2104.00673), let's
-adopt the common practice of using the standard error of a
-cross-validation score as an estimate of the uncertainty of a
-performance measure's expected value. Here's a utility function to
-calculate 95% confidence intervals for our performance estimates based
-on this practice, and it's application to the current evaluation:
-"""
-
-# ╔═╡ 0f76a79f-8675-4ec1-a543-f9324a87efad
-using Measurements
-
-# ╔═╡ fc641df4-693c-4007-8657-9fba0caf3cb7
-begin
- function confidence_intervals(e)
- factor = 2.0 # to get level of 95%
- measure = e.measure
- nfolds = length(e.per_fold[1])
- measurement = [e.measurement[j] ± factor*std(e.per_fold[j])/sqrt(nfolds - 1)
- for j in eachindex(measure)]
- table = (measure=measure, measurement=measurement)
- return DataFrames.DataFrame(table)
- end
-
- const confidence_intervals_basic_model = confidence_intervals(e_pipe)
-end
-
-# ╔═╡ c3f71e42-8bbe-47fe-a217-e58f442fc85c
-md"## Filtering out unimportant features"
-
-# ╔═╡ db354064-c2dd-4e6a-b8ad-0340f15a03ba
-md"> Introduces: `FeatureSelector`"
-
-# ╔═╡ 3bbb26ed-7d1e-46ac-946d-b124a8db5f7c
-md"""
-Before continuing, we'll modify our pipeline to drop those features
-with low feature importance, to speed up later optimization:
-"""
-
-# ╔═╡ cdfe840d-4e87-467f-b582-dfcbeb05bcc5
-begin
- unimportant_features = filter(:importance => <(0.005), feature_importance_table).feature
-
- pipe2 = ContinuousEncoder() |>
- FeatureSelector(features=unimportant_features, ignore=true) |> booster
-end
-
-# ╔═╡ de589473-4c37-4f70-9143-546b2286a5fe
-md"## Wrapping our iterative model in control strategies"
-
-# ╔═╡ 0b80d69c-b60c-4cf4-a7d8-c3b94fb18495
-md"> Introduces: **control strategies:** `Step`, `NumberSinceBest`, `TimeLimit`, `InvalidValue`, **model wrapper** `IteratedModel`, **resampling strategy:** `Holdout`"
-
-# ╔═╡ 19e7e4c9-95c0-49d6-8396-93fa872d2512
-md"""
-We want to optimize the hyperparameters of our model. Since our
-model is iterative, these parameters include the (nested) iteration
-parameter `pipe.evo_tree_classifier.nrounds`. Sometimes this
-parameter is optimized first, fixed, and then maybe optimized again
-after the other parameters. Here we take a more principled approach,
-**wrapping our model in a control strategy** that makes it
-"self-iterating". The strategy applies a stopping criterion to
-*out-of-sample* estimates of the model performance, constructed
-using an internally constructed holdout set. In this way, we avoid
-some data hygiene issues, and, when we subsequently optimize other
-parameters, we will always being using an optimal number of
-iterations.
-"""
-
-# ╔═╡ 6ff08b40-906f-4154-a4ac-2ddb495858ce
-md"""
-Note that this approach can be applied to any iterative MLJ model,
-eg, the neural network models provided by
-[MLJFlux.jl](https://github.com/FluxML/MLJFlux.jl).
-"""
-
-# ╔═╡ 8fc99d35-d8cc-455f-806e-1bc580dc349d
-md"""
-First, we select appropriate controls from [this
-list](https://juliaai.github.io/MLJ.jl/dev/controlling_iterative_models/#Controls-provided):
-"""
-
-# ╔═╡ 29f33708-4a82-4acc-9703-288eae064e2a
-controls = [
- Step(1), # to increment iteration parameter (`pipe.nrounds`)
- NumberSinceBest(4), # main stopping criterion
- TimeLimit(2/3600), # never train more than 2 sec
- InvalidValue() # stop if NaN or ±Inf encountered
-]
-
-# ╔═╡ 9f80b5b5-95b9-4f01-b2c3-413ae5867f7d
-md"""
-Now we wrap our pipeline model using the `IteratedModel` wrapper,
-being sure to specify the `measure` on which internal estimates of
-the out-of-sample performance will be based:
-"""
-
-# ╔═╡ fd2e2ee4-c256-4fde-93d8-53c527b0a48c
-iterated_pipe = IteratedModel(model=pipe2,
- controls=controls,
- measure=brier_loss,
- resampling=Holdout(fraction_train=0.7))
-
-# ╔═╡ bb7a34eb-4bf1-41d9-af98-8dffdf3118fc
-md"""
-We've set `resampling=Holdout(fraction_train=0.7)` to arrange that
-data attached to our model should be internally split into a train
-set (70%) and a holdout set (30%) for determining the out-of-sample
-estimate of the Brier loss.
-"""
-
-# ╔═╡ 71bead68-2d38-468c-8620-127a8c4021ec
-md"""
-For demonstration purposes, let's bind `iterated_model` to all data
-not in our don't-touch holdout set, and train on all of that data:
-"""
-
-# ╔═╡ f245005b-ca1e-4c38-a1f6-fec4c3edfa7b
-begin
- mach_iterated_pipe = machine(iterated_pipe, X, y)
- fit!(mach_iterated_pipe);
-end
-
-# ╔═╡ 16d8cd4e-ca20-475a-82f7-021485ee0115
-md"To recap, internally this training is split into two separate steps:"
-
-# ╔═╡ ab59d744-51f8-4364-9ecb-94593d972599
-md"""
-- A controlled iteration step, training on the holdout set, with the total number of iterations determined by the specified stopping criteria (based on the out-of-sample performance estimates)
-- A final step that trains the atomic model on *all* available
- data using the number of iterations determined in the first step. Calling `predict` on `mach_iterated_pipe` means using the learned parameters of the second step.
-"""
-
-# ╔═╡ ed9f284e-1e45-46e7-b54b-e44bea97c579
-md"## Hyperparameter optimization (model tuning)"
-
-# ╔═╡ b8b0e5ee-8468-4e02-9121-4e392242994e
-md"> Introduces: `range`, **model wrapper** `TunedModel`, `RandomSearch`"
-
-# ╔═╡ 50372dc8-bb10-4f9e-b221-79886442efe7
-md"""
-We now turn to hyperparameter optimization. A tool not discussed
-here is the `learning_curve` function, which can be useful when
-wanting to visualize the effect of changes to a *single*
-hyperparameter (which could be an iteration parameter). See, for
-example, [this section of the
-manual](https://juliaai.github.io/MLJ.jl/dev/learning_curves/)
-or [this
-tutorial](https://github.com/ablaom/MLJTutorial.jl/blob/dev/notebooks/04_tuning/notebook.ipynb).
-"""
-
-# ╔═╡ 3e385eb4-0a44-40f6-8df5-b7371beb6b3f
-md"""
-Fine tuning the hyperparameters of a gradient booster can be
-somewhat involved. Here we settle for simultaneously optimizing two
-key parameters: `max_depth` and `η` (learning_rate).
-"""
-
-# ╔═╡ caa5153f-6633-41f7-a475-d52dc8eba727
-md"""
-Like iteration control, **model optimization in MLJ is implemented as
-a model wrapper**, called `TunedModel`. After wrapping a model in a
-tuning strategy and binding the wrapped model to data in a machine
-called `mach`, calling `fit!(mach)` instigates a search for optimal
-model hyperparameters, within a specified range, and then uses all
-supplied data to train the best model. To predict using that model,
-one then calls `predict(mach, Xnew)`. In this way the wrapped model
-may be viewed as a "self-tuning" version of the unwrapped
-model. That is, wrapping the model simply transforms certain
-hyperparameters into learned parameters (just as `IteratedModel`
-does for an iteration parameter).
-"""
-
-# ╔═╡ 0b74bfe8-bff6-469d-804b-89f5009710b0
-md"""
-To start with, we define ranges for the parameters of
-interest. Since these parameters are nested, let's force a
-display of our model to a larger depth:
-"""
-
-# ╔═╡ b17f6f89-2fd4-46de-a14a-b0adc2955e1c
-show(iterated_pipe, 2)
-
-# ╔═╡ db7a8b4d-611b-4e5e-bd20-120a7a4020d0
-begin
- p1 = :(model.evo_tree_classifier.η)
- p2 = :(model.evo_tree_classifier.max_depth)
-
- r1 = range(iterated_pipe, p1, lower=-2, upper=-0.5, scale=x->10^x)
- r2 = range(iterated_pipe, p2, lower=2, upper=6)
-end
-
-# ╔═╡ f46af08e-ddb9-4aec-93a0-9d99274137e8
-md"""
-Nominal ranges are defined by specifying `values` instead of `lower`
-and `upper`.
-"""
-
-# ╔═╡ af3023e6-920f-478d-af76-60dddeecbe6c
-md"""
-Next, we choose an optimization strategy from [this
-list](https://juliaai.github.io/MLJ.jl/dev/tuning_models/#Tuning-Models):
-"""
-
-# ╔═╡ 93c17a9b-b49c-4780-9074-c069a0e97d7e
-tuning = RandomSearch(rng=123)
-
-# ╔═╡ 21f14f18-cc5a-4ed9-ac4a-024c5894013e
-md"""
-Then we wrap the model, specifying a `resampling` strategy and a
-`measure`, as we did for `IteratedModel`. In fact, we can include a
-battery of `measures`; by default, optimization is with respect to
-performance estimates based on the first measure, but estimates for
-all measures can be accessed from the model's `report`.
-"""
-
-# ╔═╡ 147cc2c5-f3c2-4aec-82ca-0687059409ae
-md"""
-The keyword `n` specifies the total number of models (sets of
-hyperparameters) to evaluate.
-"""
-
-# ╔═╡ 2996c4f0-567a-4c5a-9e9e-2379bd3c438e
-tuned_iterated_pipe = TunedModel(model=iterated_pipe,
- range=[r1, r2],
- tuning=tuning,
- measures=[brier_loss, auc, accuracy],
- resampling=StratifiedCV(nfolds=6, rng=123),
- acceleration=CPUThreads(),
- n=40)
-
-# ╔═╡ fbf15fc2-347a-432a-bf9e-f9077f3deea4
-md"To save time, we skip the `repeats` here."
-
-# ╔═╡ 3922d490-8018-4b8c-9b74-5a67b6e8b15f
-md"Binding our final model to data and training:"
-
-# ╔═╡ d59dc0af-8eb7-4ac4-b1f4-bf9663e97bb7
-begin
- mach_tuned_iterated_pipe = machine(tuned_iterated_pipe, X, y)
- fit!(mach_tuned_iterated_pipe)
-end
-
-# ╔═╡ ed1d8575-1c50-4bc7-8dca-7ecb1b94fa1b
-md"""
-As explained above, the training we have just performed was split
-internally into two separate steps:
-"""
-
-# ╔═╡ c392b949-7e64-4f3e-aeca-6e3d5d94d8fa
-md"""
-- A step to determine the parameter values that optimize the aggregated cross-validation scores
-- A final step that trains the optimal model on *all* available data. Future predictions `predict(mach_tuned_iterated_pipe, ...)` are based on this final training step.
-"""
-
-# ╔═╡ 954a354f-0a1d-4a7f-8aa0-20af95403dd9
-md"""
-From `report(mach_tuned_iterated_pipe)` we can extract details about
-the optimization procedure. For example:
-"""
-
-# ╔═╡ 0f9e72c1-fb6e-4ea7-a120-518242409f79
-begin
- rpt2 = report(mach_tuned_iterated_pipe);
- best_booster = rpt2.best_model.model.evo_tree_classifier
-end
-
-# ╔═╡ 242b8047-0508-4ce2-bcf7-0be279ee1434
-@info "Optimal hyperparameters:" best_booster.max_depth best_booster.η;
-
-# ╔═╡ af6d8f76-f96e-4136-9df6-c78b3bed8b80
-md"Using the `confidence_intervals` function we defined earlier:"
-
-# ╔═╡ ec5e230c-397a-4e07-b9cc-1427739824b3
-begin
- e_best = rpt2.best_history_entry
- confidence_intervals(e_best)
-end
-
-# ╔═╡ a0a591d4-ae53-4db4-9051-24fa20a24653
-md"""
-Digging a little deeper, we can learn what stopping criterion was
-applied in the case of the optimal model, and how many iterations
-were required:
-"""
-
-# ╔═╡ 8f18ff72-39db-43f2-ac11-6a06d822d067
-rpt2.best_report.controls |> collect
-
-# ╔═╡ d28bdf7e-2ba5-4a97-8d27-497b9a4e8f4b
-md"Finally, we can visualize the optimization results:"
-
-# ╔═╡ e0dba28c-5f07-4305-a8e6-955351cd26f5
-plot(mach_tuned_iterated_pipe, size=(600,450))
-
-# ╔═╡ d92c7d6c-6eda-4083-bf7b-99c47ef72fd2
-md"## Saving our model"
-
-# ╔═╡ 7626383d-5212-4ee5-9b3c-c87e36798d40
-md"> Introduces: `MLJ.save`"
-
-# ╔═╡ f38bca90-a19e-4faa-bc52-7a03f8a4f046
-md"""
-Here's how to serialize our final, trained self-iterating,
-self-tuning pipeline machine:
-"""
-
-# ╔═╡ 36e8600e-f5ee-4e5f-9814-5dd23028b7dd
-MLJ.save("tuned_iterated_pipe.jlso", mach_tuned_iterated_pipe)
-
-# ╔═╡ 4de39bbb-e421-4e1b-aea9-6b095d52d246
-md"We'll deserialize this in \"Testing the final model\" below."
-
-# ╔═╡ cf4e89cd-998c-44d1-8a6a-3fad94d47b89
-md"## Final performance estimate"
-
-# ╔═╡ cd043584-6881-45df-ab7f-23dfd6fe43e8
-md"""
-Finally, to get an even more accurate estimate of performance, we
-can evaluate our model using stratified cross-validation and all the
-data attached to our machine. Because this evaluation implies
-[nested
-resampling](https://mlr.mlr-org.com/articles/tutorial/nested_resampling.html),
-this computation takes quite a bit longer than the previous one
-(which is being repeated six times, using 5/6th of the data each
-time):
-"""
-
-# ╔═╡ 9c90d7a7-1966-4d21-873e-e6fb8e7dca1a
-e_tuned_iterated_pipe = evaluate(tuned_iterated_pipe, X, y,
- resampling=StratifiedCV(nfolds=6, rng=123),
- measures=[brier_loss, auc, accuracy])
-
-# ╔═╡ f5058fe5-53de-43ad-9dd4-420f3ba8803c
-confidence_intervals(e_tuned_iterated_pipe)
-
-# ╔═╡ 669b492b-69a7-4d88-b994-ae59732958cb
-md"""
-For comparison, here are the confidence intervals for the basic
-pipeline model (no feature selection and default hyperparameters):
-"""
-
-# ╔═╡ 5c1754b1-21c2-4173-9aa5-a206354b401f
-confidence_intervals_basic_model
-
-# ╔═╡ 675f06bc-6488-45e2-b666-1307eccc221d
-md"""
-As each pair of intervals overlap, it's doubtful the small changes
-here can be assigned statistical significance. Default `booster`
-hyperparameters do a pretty good job.
-"""
-
-# ╔═╡ 19bb6a91-8c9d-4ed0-8f9a-f7439f35ea8f
-md"## Testing the final model"
-
-# ╔═╡ fe32c8ab-2f9e-4bb4-a8bb-11a6d1761f50
-md"""
-We now determine the performance of our model on our
-lock-and-throw-away-the-key holdout set. To demonstrate
-deserialization, we'll pretend we're in a new Julia session (but
-have called `import`/`using` on the same packages). Then the
-following should suffice to recover our model trained under
-"Hyperparameter optimization" above:
-"""
-
-# ╔═╡ 695926dd-3faf-48a0-8c71-7d4e18e2f699
-mach_restored = machine("tuned_iterated_pipe.jlso")
-
-# ╔═╡ 24ca2761-e264-4cf9-a590-db6dcb21b2d3
-md"We compute predictions on the holdout set:"
-
-# ╔═╡ 9883bd87-d70f-4fea-bec6-3fa17d8c7b35
-begin
- ŷ_tuned = predict(mach_restored, Xtest);
- ŷ_tuned[1]
-end
-
-# ╔═╡ 9b7a12e7-2b3c-445a-97eb-2de8afd657be
-md"And can compute the final performance measures:"
-
-# ╔═╡ 7fa067b5-7e77-4dda-bba0-54e6f740a5b5
-@info("Tuned model measurements on test:",
- brier_loss(ŷ_tuned, ytest) |> mean,
- auc(ŷ_tuned, ytest),
- accuracy(mode.(ŷ_tuned), ytest)
- )
-
-# ╔═╡ f4dc71cf-64c8-4857-94c0-45caa9808777
-md"For comparison, here's the performance for the basic pipeline model"
-
-# ╔═╡ 00257755-7145-45e6-adf5-76545beae885
-begin
- mach_basic = machine(pipe, X, y)
- fit!(mach_basic, verbosity=0)
-
- ŷ_basic = predict(mach_basic, Xtest);
-
- @info("Basic model measurements on test set:",
- brier_loss(ŷ_basic, ytest) |> mean,
- auc(ŷ_basic, ytest),
- accuracy(mode.(ŷ_basic), ytest)
- )
-end
-
-# ╔═╡ 135dac9b-0bd9-4e1d-8715-73a20e2ae31b
-md"""
----
-
-*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*
-"""
-
-# ╔═╡ Cell order:
-# ╟─f0cc864c-8b26-441f-9bca-7c69b794f8ce
-# ╟─8a6670b8-96a8-4a5d-b795-033f6f2a0674
-# ╟─aa49e638-95dc-4249-935f-ddf6a6bfbbdd
-# ╟─b04c4790-59e0-42a3-af2a-25235e544a31
-# ╟─4eb8dff4-c23a-4b41-8af5-148d95ea2900
-# ╟─a583d175-d623-4888-a6bf-47194d7e8e12
-# ╟─4830fc64-3d70-4869-828a-4cc485149963
-# ╟─28197138-d6b7-433c-9e54-3f7b3ca87ecb
-# ╟─aad8dc13-c0f9-4090-ba1f-6363f43ec697
-# ╟─7c0464a0-4114-46bf-95ea-2955abd45275
-# ╟─256869d0-5e0c-42af-b1b3-dd47e367ba54
-# ╠═60fe49c1-3434-4a77-8d7e-8e449afd1c48
-# ╟─e8c13e9d-7910-4a0e-a949-7f3bd292fe31
-# ╟─6bf6ef98-302f-478c-8514-99938a2932db
-# ╟─33ca287e-8cba-47d1-a0de-1721c1bc2df2
-# ╟─d5b8bf1d-6c9a-46c1-bca8-7eec7950fd82
-# ╠═9b7b9ade-9318-4b95-9873-1b1430e635cc
-# ╠═4f3f061d-259d-4479-b43e-d0ebe87da176
-# ╠═a546f372-4ae0-48c2-9009-4cfb2012998f
-# ╠═f8f79dd2-b6de-48e8-abd3-cb77d7a79683
-# ╟─617d63f6-8a62-40db-879e-dc128f3db7b3
-# ╠═ceda7ed0-3b98-44b3-a367-9aa3c6cf34d0
-# ╟─e69cd764-e4b8-4ff8-bf32-94657e65284e
-# ╟─4a175e0f-4b87-4b53-9afd-65a5b5facac9
-# ╠═3c23e9f8-49bd-4cde-b6c8-54f5ed90a964
-# ╟─6ad971c7-9fe5-45a9-9292-fe822525fc77
-# ╠═63151d31-8428-4ebd-ae5e-4b83dcbc9675
-# ╟─0f978839-cc95-4c3a-8a29-32f11452654a
-# ╠═5edc98cd-df7d-428a-a5f1-151fcbe229a2
-# ╟─1848109f-c94c-4cdd-81bc-2d5603785a09
-# ╠═eee948b7-2b42-439e-9d87-1d1fbb0e3997
-# ╟─7c98e10b-1fa3-4ab6-b950-fddef2a1fb10
-# ╠═413d9c3c-a68e-4bf0-951b-64cb2a36c8e3
-# ╟─eb79663c-c671-4c4c-b0e6-441461cc8770
-# ╠═ae9e9377-552c-4186-8cb0-de601961bc02
-# ╟─5c70ee06-edb9-4789-a87d-950c50fb2955
-# ╟─bde7bc43-a4bb-4a42-8448-4bcb089096cd
-# ╠═d5a9e9b8-c67e-4bdd-a012-a84240254fb6
-# ╠═8de2c7b4-1950-4f05-bbdd-9799f7bb2e60
-# ╟─768f369a-5f3d-4dcc-97a7-88e3af4f10ee
-# ╟─bc4c0eb9-3b5f-49fc-b372-8c9866e51852
-# ╟─f04becbb-42a6-409f-8f3d-b82f1e7b6f7a
-# ╟─1d08ed75-5377-4971-ab04-579e5608ae53
-# ╟─eea585d4-d55b-4ccc-86cf-35420d9e6995
-# ╟─ca482134-299b-459a-a29a-8d0445351914
-# ╟─5c7c97e4-fd8b-4ae5-be65-28d0fcca50a8
-# ╟─fd103be7-6fc9-43bc-9a2e-c468345d87d1
-# ╠═9d2f0b19-2942-47ac-b5fa-cb79ebf595ef
-# ╟─1d18aca8-11f4-4104-91c5-524da38aa391
-# ╠═accd3a09-5371-45ab-ad90-b4405b216771
-# ╟─bde0406f-3964-42b8-895b-997a92b731e0
-# ╠═25e60566-fc64-4f35-a525-e9b04a4bd246
-# ╟─441317a8-3b72-4bf1-80f0-ba7781e173cf
-# ╠═d2f2b67e-d06e-4c0b-9cbc-9a2c39793333
-# ╟─7c93b820-cbce-48be-b887-2b16710e5502
-# ╠═5ac4a13e-9983-41dd-9452-5298a8a4a401
-# ╟─9f28e6c3-5f18-4fb2-b017-96f1602f2cad
-# ╠═c9506e5b-c111-442c-8be2-2a4017c45345
-# ╟─8c865df5-ac95-438f-a7ad-80d84f5b0070
-# ╠═22690828-0992-4ed0-8378-5aa686f0b407
-# ╟─3bd753fc-163a-4b50-9f43-49763e8691a1
-# ╟─7b12673f-cc63-4cd9-bb0e-3a45761c733a
-# ╟─3c98ab4e-47cd-4247-9979-7044b2f2df33
-# ╠═ea010966-bb7d-4c79-b2a1-fbbde543f620
-# ╟─ecb7c90a-4c93-483c-910b-f8a9a217eff2
-# ╟─81aa1aa9-27d9-423f-aa36-d2e8546da46a
-# ╟─fff47e76-d7ac-4377-88a1-a2bf91434413
-# ╟─f1dbbc22-9844-4e1c-a1cb-54e34396a855
-# ╠═20713fb3-3a3b-42e7-8037-0de5806e1a54
-# ╠═369cb359-8512-4f80-9961-a09632c33fb0
-# ╟─af484868-b29c-4808-b7cc-46f4ef988c68
-# ╠═8f3eda66-6183-4fe4-90fb-7f2721f6fcc7
-# ╟─598142b5-0828-49ac-af65-ccd25ecb9818
-# ╟─c5309d64-7409-4148-8890-979691212d9b
-# ╟─f97969e2-c15c-42cf-a6fa-eaf14df5d44b
-# ╠═edf1d726-542a-4fc1-8025-5084804a9f6c
-# ╟─ee41a121-df0d-40ad-9e90-09023d201062
-# ╠═be6cef47-d585-45fe-b7bb-2f33ef765cc0
-# ╟─452edd51-b878-46d0-9625-01172c4a0081
-# ╠═fdd20843-c981-41ad-af4f-11c7de9e21dd
-# ╟─2b25f9cb-d12d-4a6f-8dba-241d9b744683
-# ╠═b139c43b-382b-415a-a6e5-2f0b4dca5c59
-# ╟─90a7fe9a-934f-445a-8550-20498aa03ed0
-# ╟─fc97e937-66ca-4434-9e7b-5b943cf6b560
-# ╟─b1d95557-51fc-4f1e-bce6-0a1379cc1259
-# ╟─38961daa-dd5f-4f97-9608-b5032c116833
-# ╟─ccd3ffd0-ed41-497d-b473-4aa968e6937a
-# ╠═35cf2011-4fb6-4e7a-8d9e-644a1b3cc6b6
-# ╟─4b6e3239-fd47-495a-ac0a-23ded81445da
-# ╟─ce9a3479-601a-472a-8535-1a088a6a3228
-# ╠═45c2d8f9-cbf9-4cda-a39f-7893c73eef39
-# ╟─232b7775-1f79-43f7-bcca-51a27994a151
-# ╟─ac84c13f-391f-405a-9b33-09b4b6661170
-# ╟─442078e4-a695-471c-b45d-88d068bb0fa2
-# ╟─23fb37b0-08c1-4688-92c8-6db3a590d963
-# ╟─1def55f5-71fc-4257-abf5-96f457eb2bdf
-# ╟─2c4016c6-dc7e-44ec-8a60-2e0214c059f5
-# ╠═236c098c-0189-4a96-a38a-beb5c7157b57
-# ╟─9ffe1654-7e32-4c4b-81f5-507803ea9ed6
-# ╟─48413a7b-d01b-4005-9b21-6d4eb642d87e
-# ╟─84d7bcbc-df3c-4602-b98c-3df1f31879bf
-# ╠═f8552404-f95f-4484-92b7-4ceba56e97ad
-# ╟─e0b1bffb-dbd5-4b0a-b122-214de244406c
-# ╠═d8ad2c21-fd27-44fb-8a4b-c83694978e38
-# ╟─ceff13ba-a526-4ecb-a8b6-f7ff516dedc4
-# ╟─dcf0eb93-2368-4158-81e1-74ef03c2e79e
-# ╠═442be3e3-b2e9-499d-a04e-0b76409c14a7
-# ╟─f03fdd56-6c30-4d73-b979-3458f2f26667
-# ╠═3afbe65c-18bf-4576-97e3-6677afc6ca9f
-# ╟─527b193b-737d-4701-b10e-562ce21caa04
-# ╠═b332842d-6397-45c6-8f78-ab549ef1544e
-# ╠═3035e6d0-3801-424d-a8a4-a53f5149481e
-# ╟─8b67dcac-6a86-4502-8704-dbab8e09b4f1
-# ╟─9cd42f39-942e-4011-a02f-27b6c05e4d02
-# ╠═0fecce89-102f-416f-bea4-3d467d48781c
-# ╟─0b11a81e-42a7-4519-97c5-2979af8a507d
-# ╟─e6f6fc92-f9e7-4472-b639-63c76c72c513
-# ╠═f3933688-1ea4-4239-8f5a-e1ee9eb5c15c
-# ╟─ce90835f-6709-4c34-adc5-e671db8bca5d
-# ╠═15996832-b736-45f4-86f1-0ea38de21abb
-# ╠═3c4984d2-9e1a-450b-a55c-0cb44ab816d7
-# ╟─27221b90-8506-4857-be86-92ecfd0d2343
-# ╟─445146e5-e45e-450d-9cf1-facf39e3f302
-# ╟─562887bb-b7fb-430f-b61c-748aec38e674
-# ╠═f9be989e-2604-44c2-9727-ed822e4fd85d
-# ╟─ff7cfc36-b9fc-4570-b2f2-e08965e5be66
-# ╟─45a4d300-a392-485c-897c-f79712f9ec7c
-# ╠═0f76a79f-8675-4ec1-a543-f9324a87efad
-# ╠═fc641df4-693c-4007-8657-9fba0caf3cb7
-# ╟─c3f71e42-8bbe-47fe-a217-e58f442fc85c
-# ╟─db354064-c2dd-4e6a-b8ad-0340f15a03ba
-# ╟─3bbb26ed-7d1e-46ac-946d-b124a8db5f7c
-# ╠═cdfe840d-4e87-467f-b582-dfcbeb05bcc5
-# ╟─de589473-4c37-4f70-9143-546b2286a5fe
-# ╟─0b80d69c-b60c-4cf4-a7d8-c3b94fb18495
-# ╟─19e7e4c9-95c0-49d6-8396-93fa872d2512
-# ╟─6ff08b40-906f-4154-a4ac-2ddb495858ce
-# ╟─8fc99d35-d8cc-455f-806e-1bc580dc349d
-# ╠═29f33708-4a82-4acc-9703-288eae064e2a
-# ╟─9f80b5b5-95b9-4f01-b2c3-413ae5867f7d
-# ╠═fd2e2ee4-c256-4fde-93d8-53c527b0a48c
-# ╟─bb7a34eb-4bf1-41d9-af98-8dffdf3118fc
-# ╟─71bead68-2d38-468c-8620-127a8c4021ec
-# ╠═f245005b-ca1e-4c38-a1f6-fec4c3edfa7b
-# ╟─16d8cd4e-ca20-475a-82f7-021485ee0115
-# ╟─ab59d744-51f8-4364-9ecb-94593d972599
-# ╟─ed9f284e-1e45-46e7-b54b-e44bea97c579
-# ╟─b8b0e5ee-8468-4e02-9121-4e392242994e
-# ╟─50372dc8-bb10-4f9e-b221-79886442efe7
-# ╟─3e385eb4-0a44-40f6-8df5-b7371beb6b3f
-# ╟─caa5153f-6633-41f7-a475-d52dc8eba727
-# ╟─0b74bfe8-bff6-469d-804b-89f5009710b0
-# ╠═b17f6f89-2fd4-46de-a14a-b0adc2955e1c
-# ╠═db7a8b4d-611b-4e5e-bd20-120a7a4020d0
-# ╟─f46af08e-ddb9-4aec-93a0-9d99274137e8
-# ╟─af3023e6-920f-478d-af76-60dddeecbe6c
-# ╠═93c17a9b-b49c-4780-9074-c069a0e97d7e
-# ╟─21f14f18-cc5a-4ed9-ac4a-024c5894013e
-# ╟─147cc2c5-f3c2-4aec-82ca-0687059409ae
-# ╠═2996c4f0-567a-4c5a-9e9e-2379bd3c438e
-# ╟─fbf15fc2-347a-432a-bf9e-f9077f3deea4
-# ╟─3922d490-8018-4b8c-9b74-5a67b6e8b15f
-# ╠═d59dc0af-8eb7-4ac4-b1f4-bf9663e97bb7
-# ╟─ed1d8575-1c50-4bc7-8dca-7ecb1b94fa1b
-# ╟─c392b949-7e64-4f3e-aeca-6e3d5d94d8fa
-# ╟─954a354f-0a1d-4a7f-8aa0-20af95403dd9
-# ╠═0f9e72c1-fb6e-4ea7-a120-518242409f79
-# ╠═242b8047-0508-4ce2-bcf7-0be279ee1434
-# ╟─af6d8f76-f96e-4136-9df6-c78b3bed8b80
-# ╠═ec5e230c-397a-4e07-b9cc-1427739824b3
-# ╟─a0a591d4-ae53-4db4-9051-24fa20a24653
-# ╠═8f18ff72-39db-43f2-ac11-6a06d822d067
-# ╟─d28bdf7e-2ba5-4a97-8d27-497b9a4e8f4b
-# ╠═e0dba28c-5f07-4305-a8e6-955351cd26f5
-# ╟─d92c7d6c-6eda-4083-bf7b-99c47ef72fd2
-# ╟─7626383d-5212-4ee5-9b3c-c87e36798d40
-# ╟─f38bca90-a19e-4faa-bc52-7a03f8a4f046
-# ╠═36e8600e-f5ee-4e5f-9814-5dd23028b7dd
-# ╟─4de39bbb-e421-4e1b-aea9-6b095d52d246
-# ╟─cf4e89cd-998c-44d1-8a6a-3fad94d47b89
-# ╟─cd043584-6881-45df-ab7f-23dfd6fe43e8
-# ╠═9c90d7a7-1966-4d21-873e-e6fb8e7dca1a
-# ╠═f5058fe5-53de-43ad-9dd4-420f3ba8803c
-# ╟─669b492b-69a7-4d88-b994-ae59732958cb
-# ╠═5c1754b1-21c2-4173-9aa5-a206354b401f
-# ╟─675f06bc-6488-45e2-b666-1307eccc221d
-# ╟─19bb6a91-8c9d-4ed0-8f9a-f7439f35ea8f
-# ╟─fe32c8ab-2f9e-4bb4-a8bb-11a6d1761f50
-# ╠═695926dd-3faf-48a0-8c71-7d4e18e2f699
-# ╟─24ca2761-e264-4cf9-a590-db6dcb21b2d3
-# ╠═9883bd87-d70f-4fea-bec6-3fa17d8c7b35
-# ╟─9b7a12e7-2b3c-445a-97eb-2de8afd657be
-# ╠═7fa067b5-7e77-4dda-bba0-54e6f740a5b5
-# ╟─f4dc71cf-64c8-4857-94c0-45caa9808777
-# ╠═00257755-7145-45e6-adf5-76545beae885
-# ╟─135dac9b-0bd9-4e1d-8715-73a20e2ae31b
diff --git a/examples/telco/notebook.pluto.jl.html b/examples/telco/notebook.pluto.jl.html
deleted file mode 100644
index 2d39e7ed9..000000000
--- a/examples/telco/notebook.pluto.jl.html
+++ /dev/null
@@ -1,65 +0,0 @@
-
-
-
-
- ⚡ Pluto.jl ⚡
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/examples/telco/notebook.unexecuted.ipynb b/examples/telco/notebook.unexecuted.ipynb
deleted file mode 100644
index 71d109350..000000000
--- a/examples/telco/notebook.unexecuted.ipynb
+++ /dev/null
@@ -1,1855 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# MLJ for Data Scientists in Two Hours"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "An application of the [MLJ\n",
- "toolbox](https://juliaai.github.io/MLJ.jl/dev/) to the\n",
- "Telco Customer Churn dataset, aimed at practicing data scientists\n",
- "new to MLJ (Machine Learning in Julia). This tutorial does not\n",
- "cover exploratory data analysis."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "MLJ is a *multi-paradigm* machine learning toolbox (i.e., not just\n",
- "deep-learning)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For other MLJ learning resources see the [Learning\n",
- "MLJ](https://juliaai.github.io/MLJ.jl/dev/learning_mlj/)\n",
- "section of the\n",
- "[manual](https://juliaai.github.io/MLJ.jl/dev/)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Topics covered**: Grabbing and preparing a dataset, basic\n",
- "fit/predict workflow, constructing a pipeline to include data\n",
- "pre-processing, estimating performance metrics, ROC curves, confusion\n",
- "matrices, feature importance, basic feature selection, controlling iterative\n",
- "models, hyper-parameter optimization (tuning)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Prerequisites for this tutorial.** Previous experience building,\n",
- "evaluating, and optimizing machine learning models using\n",
- "scikit-learn, caret, MLR, weka, or similar tool. No previous\n",
- "experience with MLJ. Only fairly basic familiarity with Julia is\n",
- "required. Uses\n",
- "[DataFrames.jl](https://dataframes.juliadata.org/stable/) but in a\n",
- "minimal way ([this\n",
- "cheatsheet](https://ahsmart.com/pub/data-wrangling-with-data-frames-jl-cheat-sheet/index.html)\n",
- "may help)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Time.** Between two and three hours, first time through."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Summary of methods and types introduced"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "|code | purpose|\n",
- "|:-------|:-------------------------------------------------------|\n",
- "| `OpenML.load(id)` | grab a dataset from [OpenML.org](https://www.openml.org)|\n",
- "| `scitype(X)` | inspect the scientific type (scitype) of object `X`|\n",
- "| `schema(X)` | inspect the column scitypes (scientific types) of a table `X`|\n",
- "| `coerce(X, ...)` | fix column encodings to get appropriate scitypes|\n",
- "| `partition(data, frac1, frac2, ...; rng=...)` | vertically split `data`, which can be a table, vector or matrix|\n",
- "| `unpack(table, f1, f2, ...)` | horizontally split `table` based on conditions `f1`, `f2`, ..., applied to column names|\n",
- "| `@load ModelType pkg=...` | load code defining a model type|\n",
- "| `input_scitype(model)` | inspect the scitype that a model requires for features (inputs)|\n",
- "| `target_scitype(model)`| inspect the scitype that a model requires for the target (labels)|\n",
- "| `ContinuousEncoder` | built-in model type for re-encoding all features as `Continuous`|\n",
- "| `model1 ∣> model2 ∣> ...` | combine multiple models into a pipeline|\n",
- "| `measures(\"under curve\")` | list all measures (metrics) with string \"under curve\" in documentation|\n",
- "| `accuracy(yhat, y)` | compute accuracy of predictions `yhat` against ground truth observations `y`|\n",
- "| `auc(yhat, y)`, `brier_loss(yhat, y)` | evaluate two probabilistic measures (`yhat` a vector of probability distributions)|\n",
- "| `machine(model, X, y)` | bind `model` to training data `X` (features) and `y` (target)|\n",
- "| `fit!(mach, rows=...)` | train machine using specified rows (observation indices)|\n",
- "| `predict(mach, rows=...)`, | make in-sample model predictions given specified rows|\n",
- "| `predict(mach, Xnew)` | make predictions given new features `Xnew`|\n",
- "| `fitted_params(mach)` | inspect learned parameters|\n",
- "| `report(mach)` | inspect other outcomes of training|\n",
- "| `confmat(yhat, y)` | confusion matrix for predictions `yhat` and ground truth `y`|\n",
- "| `roc(yhat, y)` | compute points on the receiver-operator Characteristic|\n",
- "| `StratifiedCV(nfolds=6)` | 6-fold stratified cross-validation resampling strategy|\n",
- "| `Holdout(fraction_train=0.7)` | holdout resampling strategy|\n",
- "| `evaluate(model, X, y; resampling=..., options...)` | estimate performance metrics `model` using the data `X`, `y`|\n",
- "| `FeatureSelector()` | transformer for selecting features|\n",
- "| `Step(3)` | iteration control for stepping 3 iterations|\n",
- "| `NumberSinceBest(6)`, `TimeLimit(60/5), InvalidValue()` | iteration control stopping criteria|\n",
- "| `IteratedModel(model=..., controls=..., options...)` | wrap an iterative `model` in control strategies|\n",
- "| `range(model, :some_hyperparam, lower=..., upper=...)` | define a numeric range|\n",
- "| `RandomSearch()` | random search tuning strategy|\n",
- "| `TunedModel(model=..., tuning=..., options...)` | wrap the supervised `model` in specified `tuning` strategy|"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Instantiate a Julia environment"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The following code replicates precisely the set of Julia packages\n",
- "used to develop this tutorial. If this is your first time running\n",
- "the notebook, package instantiation and pre-compilation may take a\n",
- "minute or so to complete. **This step will fail** if the [correct\n",
- "Manifest.toml and Project.toml\n",
- "files](https://github.com/JuliaAI/MLJ.jl/tree/dev/examples/telco)\n",
- "are not in the same directory as this notebook."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "using Pkg\n",
- "Pkg.activate(@__DIR__) # get env from TOML files in same directory as this notebook\n",
- "Pkg.instantiate()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Warm up: Building a model for the iris dataset"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Before turning to the Telco Customer Churn dataset, we very quickly\n",
- "build a predictive model for Fisher's well-known iris data set, as way of\n",
- "introducing the main actors in any MLJ workflow. Details that you\n",
- "don't fully grasp should become clearer in the Telco study."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "This section is a condensed adaption of the [Getting Started\n",
- "example](https://juliaai.github.io/MLJ.jl/dev/getting_started/#Fit-and-predict)\n",
- "in the MLJ documentation."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "First, using the built-in iris dataset, we load and inspect the features\n",
- "`X_iris` (a table) and target variable `y_iris` (a vector):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "using MLJ"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "const X_iris, y_iris = @load_iris;\n",
- "schema(X_iris)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "y_iris[1:4]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "levels(y_iris)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We load a decision tree model, from the package DecisionTree.jl:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "DecisionTree = @load DecisionTreeClassifier pkg=DecisionTree # model type\n",
- "model = DecisionTree(min_samples_split=5) # model instance"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In MLJ, a *model* is just a container for hyper-parameters of\n",
- "some learning algorithm. It does not store learned parameters."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Next, we bind the model together with the available data in what's\n",
- "called a *machine*:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "mach = machine(model, X_iris, y_iris)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "A machine is essentially just a model (ie, hyper-parameters) plus data, but\n",
- "it additionally stores *learned parameters* (the tree) once it is\n",
- "trained on some view of the data:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "train_rows = vcat(1:60, 91:150); # some row indices (observations are rows not columns)\n",
- "fit!(mach, rows=train_rows)\n",
- "fitted_params(mach)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "A machine stores some other information enabling [warm\n",
- "restart](https://juliaai.github.io/MLJ.jl/dev/machines/#Warm-restarts)\n",
- "for some models, but we won't go into that here. You are allowed to\n",
- "access and mutate the `model` parameter:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "mach.model.min_samples_split = 10\n",
- "fit!(mach, rows=train_rows) # re-train with new hyper-parameter"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now we can make predictions on some other view of the data, as in"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "predict(mach, rows=71:73)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "or on completely new data, as in"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "Xnew = (sepal_length = [5.1, 6.3],\n",
- " sepal_width = [3.0, 2.5],\n",
- " petal_length = [1.4, 4.9],\n",
- " petal_width = [0.3, 1.5])\n",
- "yhat = predict(mach, Xnew)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "These are probabilistic predictions which can be manipulated using a\n",
- "widely adopted interface defined in the Distributions.jl\n",
- "package. For example, we can get raw probabilities like this:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "pdf.(yhat, \"virginica\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We now turn to the Telco dataset."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Getting the Telco data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import DataFrames"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "data = OpenML.load(42178) # data set from OpenML.org\n",
- "df0 = DataFrames.DataFrame(data)\n",
- "first(df0, 4)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The object of this tutorial is to build and evaluate supervised\n",
- "learning models to predict the `:Churn` variable, a binary variable\n",
- "measuring customer retention, based on other variables that are\n",
- "relevant."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In the table, observations correspond to rows, and features to\n",
- "columns, which is the convention for representing all\n",
- "two-dimensional data in MLJ."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Type coercion"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "> Introduces: `scitype`, `schema`, `coerce`"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "A [\"scientific\n",
- "type\"](https://juliaai.github.io/ScientificTypes.jl/dev/) or\n",
- "*scitype* indicates how MLJ will *interpret* data. For example,\n",
- "`typeof(3.14) == Float64`, while `scitype(3.14) == Continuous` and\n",
- "also `scitype(3.14f0) == Continuous`. In MLJ, model data\n",
- "requirements are articulated using scitypes."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Here are common \"scalar\" scitypes:"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "There are also container scitypes. For example, the scitype of any\n",
- "`N`-dimensional array is `AbstractArray{S, N}`, where `S` is the scitype of the\n",
- "elements:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "scitype([\"cat\", \"mouse\", \"dog\"])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The `schema` operator summarizes the column scitypes of a table:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "schema(df0) |> DataFrames.DataFrame # converted to DataFrame for better display"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "All of the fields being interpreted as `Textual` are really\n",
- "something else, either `Multiclass` or, in the case of\n",
- "`:TotalCharges`, `Continuous`. In fact, `:TotalCharges` is\n",
- "mostly floats wrapped as strings. However, it needs special\n",
- "treatment because some elements consist of a single space, \" \",\n",
- "which we'll treat as \"0.0\"."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "fix_blanks(v) = map(v) do x\n",
- " if x == \" \"\n",
- " return \"0.0\"\n",
- " else\n",
- " return x\n",
- " end\n",
- "end\n",
- "\n",
- "df0.TotalCharges = fix_blanks(df0.TotalCharges);"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Coercing the `:TotalCharges` type to ensure a `Continuous` scitype:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "coerce!(df0, :TotalCharges => Continuous);"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Coercing all remaining `Textual` data to `Multiclass`:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "coerce!(df0, Textual => Multiclass);"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Finally, we'll coerce our target variable `:Churn` to be\n",
- "`OrderedFactor`, rather than `Multiclass`, to enable a reliable\n",
- "interpretation of metrics like \"true positive rate\". By convention,\n",
- "the first class is the negative one:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "coerce!(df0, :Churn => OrderedFactor)\n",
- "levels(df0.Churn) # to check order"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Re-inspecting the scitypes:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "schema(df0) |> DataFrames.DataFrame"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Preparing a holdout set for final testing"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "> Introduces: `partition`"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "To reduce training times for the purposes of this tutorial, we're\n",
- "going to dump 90% of observations (after shuffling) and split off\n",
- "30% of the remainder for use as a lock-and-throw-away-the-key\n",
- "holdout set:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "df, df_test, df_dumped = partition(df0, 0.07, 0.03, # in ratios 7:3:90\n",
- " stratify=df0.Churn,\n",
- " rng=123);"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The reader interested in including all data can instead do\n",
- "`df, df_test = partition(df0, 0.7, stratify=df0.Churn, rng=123)`."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Splitting data into target and features"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "> Introduces: `unpack`"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In the following call, the column with name `:Churn` is copied over\n",
- "to a vector `y`, and every remaining column, except `:customerID`\n",
- "(which contains no useful information) goes into a table `X`. Here\n",
- "`:Churn` is the target variable for which we seek predictions, given\n",
- "new versions of the features `X`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "const y, X = unpack(df, ==(:Churn), !=(:customerID));\n",
- "schema(X).names"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "intersect([:Churn, :customerID], schema(X).names)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We'll do the same for the holdout data:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "const ytest, Xtest = unpack(df_test, ==(:Churn), !=(:customerID));"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Loading a model and checking type requirements"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "> Introduces: `@load`, `input_scitype`, `target_scitype`"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For tools helping us to identify suitable models, see the [Model\n",
- "Search](https://juliaai.github.io/MLJ.jl/dev/model_search/#model_search)\n",
- "section of the manual. We will build a gradient tree-boosting model,\n",
- "a popular first choice for structured data like we have here. Model\n",
- "code is contained in a third-party package called\n",
- "[EvoTrees.jl](https://github.com/Evovest/EvoTrees.jl) which is\n",
- "loaded as follows:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "Booster = @load EvoTreeClassifier pkg=EvoTrees"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Recall that a *model* is just a container for some algorithm's\n",
- "hyper-parameters. Let's create a `Booster` with default values for\n",
- "the hyper-parameters:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "booster = Booster()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "This model is appropriate for the kind of target variable we have because of\n",
- "the following passing test:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "scitype(y) <: target_scitype(booster)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "However, our features `X` cannot be directly used with `booster`:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "scitype(X) <: input_scitype(booster)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As it turns out, this is because `booster`, like the majority of MLJ\n",
- "supervised models, expects the features to be `Continuous`. (With\n",
- "some experience, this can be gleaned from `input_scitype(booster)`.)\n",
- "So we need categorical feature encoding, discussed next."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Building a model pipeline to incorporate feature encoding"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "> Introduces: `ContinuousEncoder`, pipeline operator `|>`"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The built-in `ContinuousEncoder` model transforms an arbitrary table\n",
- "to a table whose features are all `Continuous` (dropping any fields\n",
- "it does not know how to encode). In particular, all `Multiclass`\n",
- "features are one-hot encoded."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "A *pipeline* is a stand-alone model that internally combines one or\n",
- "more models in a linear (non-branching) pipeline. Here's a pipeline\n",
- "that adds the `ContinuousEncoder` as a pre-processor to the\n",
- "gradient tree-boosting model above:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "pipe = ContinuousEncoder() |> booster"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Note that the component models appear as hyper-parameters of\n",
- "`pipe`. Pipelines are an implementation of a more general [model\n",
- "composition](https://juliaai.github.io/MLJ.jl/dev/composing_models/#Composing-Models)\n",
- "interface provided by MLJ that advanced users may want to learn about."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "From the above display, we see that component model hyper-parameters\n",
- "are now *nested*, but they are still accessible (important in hyper-parameter\n",
- "optimization):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "pipe.evo_tree_classifier.max_depth"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Evaluating the pipeline model's performance"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "> Introduces: `measures` (function), **measures:** `brier_loss`, `auc`, `accuracy`;\n",
- "> `machine`, `fit!`, `predict`, `fitted_params`, `report`, `roc`, **resampling strategy** `StratifiedCV`, `evaluate`, `FeatureSelector`"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Without touching our test set `Xtest`, `ytest`, we will estimate the\n",
- "performance of our pipeline model, with default hyper-parameters, in\n",
- "two different ways:"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Evaluating by hand.** First, we'll do this \"by hand\" using the `fit!` and `predict`\n",
- "workflow illustrated for the iris data set above, using a\n",
- "holdout resampling strategy. At the same time we'll see how to\n",
- "generate a **confusion matrix**, **ROC curve**, and inspect\n",
- "**feature importances**."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Automated performance evaluation.** Next we'll apply the more\n",
- "typical and convenient `evaluate` workflow, but using `StratifiedCV`\n",
- "(stratified cross-validation) which is more informative."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In any case, we need to choose some measures (metrics) to quantify\n",
- "the performance of our model. For a complete list of measures, one\n",
- "does `measures()`. Or we also can do:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "measures(\"Brier\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We will be primarily using `brier_loss`, but also `auc` (area under\n",
- "the ROC curve) and `accuracy`."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Evaluating by hand (with a holdout set)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Our pipeline model can be trained just like the decision tree model\n",
- "we built for the iris data set. Binding all non-test data to the\n",
- "pipeline model:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "mach_pipe = machine(pipe, X, y)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We already encountered the `partition` method above. Here we apply\n",
- "it to row indices, instead of data containers, as `fit!` and\n",
- "`predict` only need a *view* of the data to work."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "train, validation = partition(1:length(y), 0.7)\n",
- "fit!(mach_pipe, rows=train)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We note in passing that we can access two kinds of information from a trained machine:"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "- The **learned parameters** (eg, coefficients of a linear model): We use `fitted_params(mach_pipe)`\n",
- "- Other **by-products of training** (eg, feature importances): We use `report(mach_pipe)`"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "fp = fitted_params(mach_pipe);\n",
- "keys(fp)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For example, we can check that the encoder did not actually drop any features:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "Set(fp.continuous_encoder.features_to_keep) == Set(schema(X).names)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "And, from the report, extract feature importances:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "rpt = report(mach_pipe)\n",
- "keys(rpt.evo_tree_classifier)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "fi = rpt.evo_tree_classifier.feature_importances\n",
- "feature_importance_table =\n",
- " (feature=Symbol.(first.(fi)), importance=last.(fi)) |> DataFrames.DataFrame"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For models not reporting feature importances, we recommend the\n",
- "[Shapley.jl](https://expandingman.gitlab.io/Shapley.jl/) package."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Returning to predictions and evaluations of our measures:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "ŷ = predict(mach_pipe, rows=validation);\n",
- "@info(\"Measurements\",\n",
- " brier_loss(ŷ, y[validation]) |> mean,\n",
- " auc(ŷ, y[validation]),\n",
- " accuracy(mode.(ŷ), y[validation])\n",
- " )"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Note that we need `mode` in the last case because `accuracy` expects\n",
- "point predictions, not probabilistic ones. (One can alternatively\n",
- "use `predict_mode` to generate the predictions.)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "While we're here, lets also generate a **confusion matrix** and\n",
- "[receiver-operator\n",
- "characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n",
- "(ROC):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "confmat(mode.(ŷ), y[validation])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Note: Importing the plotting package and calling the plotting\n",
- "functions for the first time can take a minute or so."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "using Plots"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "roc_curve = roc(ŷ, y[validation])\n",
- "plt = scatter(roc_curve, legend=false)\n",
- "plot!(plt, xlab=\"false positive rate\", ylab=\"true positive rate\")\n",
- "plot!([0, 1], [0, 1], linewidth=2, linestyle=:dash, color=:black)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Automated performance evaluation (more typical workflow)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can also get performance estimates with a single call to the\n",
- "`evaluate` function, which also allows for more complicated\n",
- "resampling - in this case stratified cross-validation. To make this\n",
- "more comprehensive, we set `repeats=3` below to make our\n",
- "cross-validation \"Monte Carlo\" (3 random size-6 partitions of the\n",
- "observation space, for a total of 18 folds) and set\n",
- "`acceleration=CPUThreads()` to parallelize the computation."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We choose a `StratifiedCV` resampling strategy; the complete list of options is\n",
- "[here](https://juliaai.github.io/MLJ.jl/dev/evaluating_model_performance/#Built-in-resampling-strategies)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "e_pipe = evaluate(pipe, X, y,\n",
- " resampling=StratifiedCV(nfolds=6, rng=123),\n",
- " measures=[brier_loss, auc, accuracy],\n",
- " repeats=3,\n",
- " acceleration=CPUThreads())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "(There is also a version of `evaluate` for machines. Query the\n",
- "`evaluate` and `evaluate!` doc-strings to learn more about these\n",
- "functions and what the `PerformanceEvaluation` object `e_pipe` records.)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "While [less than ideal](https://arxiv.org/abs/2104.00673), let's\n",
- "adopt the common practice of using the standard error of a\n",
- "cross-validation score as an estimate of the uncertainty of a\n",
- "performance measure's expected value. Here's a utility function to\n",
- "calculate 95% confidence intervals for our performance estimates based\n",
- "on this practice, and it's application to the current evaluation:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "using Measurements"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "function confidence_intervals(e)\n",
- " factor = 2.0 # to get level of 95%\n",
- " measure = e.measure\n",
- " nfolds = length(e.per_fold[1])\n",
- " measurement = [e.measurement[j] ± factor*std(e.per_fold[j])/sqrt(nfolds - 1)\n",
- " for j in eachindex(measure)]\n",
- " table = (measure=measure, measurement=measurement)\n",
- " return DataFrames.DataFrame(table)\n",
- "end\n",
- "\n",
- "const confidence_intervals_basic_model = confidence_intervals(e_pipe)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Filtering out unimportant features"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "> Introduces: `FeatureSelector`"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Before continuing, we'll modify our pipeline to drop those features\n",
- "with low feature importance, to speed up later optimization:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "unimportant_features = filter(:importance => <(0.005), feature_importance_table).feature\n",
- "\n",
- "pipe2 = ContinuousEncoder() |>\n",
- " FeatureSelector(features=unimportant_features, ignore=true) |> booster"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Wrapping our iterative model in control strategies"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "> Introduces: **control strategies:** `Step`, `NumberSinceBest`, `TimeLimit`, `InvalidValue`, **model wrapper** `IteratedModel`, **resampling strategy:** `Holdout`"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We want to optimize the hyper-parameters of our model. Since our\n",
- "model is iterative, these parameters include the (nested) iteration\n",
- "parameter `pipe.evo_tree_classifier.nrounds`. Sometimes this\n",
- "parameter is optimized first, fixed, and then maybe optimized again\n",
- "after the other parameters. Here we take a more principled approach,\n",
- "**wrapping our model in a control strategy** that makes it\n",
- "\"self-iterating\". The strategy applies a stopping criterion to\n",
- "*out-of-sample* estimates of the model performance, constructed\n",
- "using an internally constructed holdout set. In this way, we avoid\n",
- "some data hygiene issues, and, when we subsequently optimize other\n",
- "parameters, we will always being using an optimal number of\n",
- "iterations."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Note that this approach can be applied to any iterative MLJ model,\n",
- "eg, the neural network models provided by\n",
- "[MLJFlux.jl](https://github.com/FluxML/MLJFlux.jl)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "First, we select appropriate controls from [this\n",
- "list](https://juliaai.github.io/MLJ.jl/dev/controlling_iterative_models/#Controls-provided):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "controls = [\n",
- " Step(1), # to increment iteration parameter (`pipe.nrounds`)\n",
- " NumberSinceBest(4), # main stopping criterion\n",
- " TimeLimit(2/3600), # never train more than 2 sec\n",
- " InvalidValue() # stop if NaN or ±Inf encountered\n",
- "]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now we wrap our pipeline model using the `IteratedModel` wrapper,\n",
- "being sure to specify the `measure` on which internal estimates of\n",
- "the out-of-sample performance will be based:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "iterated_pipe = IteratedModel(model=pipe2,\n",
- " controls=controls,\n",
- " measure=brier_loss,\n",
- " resampling=Holdout(fraction_train=0.7))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We've set `resampling=Holdout(fraction_train=0.7)` to arrange that\n",
- "data attached to our model should be internally split into a train\n",
- "set (70%) and a holdout set (30%) for determining the out-of-sample\n",
- "estimate of the Brier loss."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For demonstration purposes, let's bind `iterated_model` to all data\n",
- "not in our don't-touch holdout set, and train on all of that data:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "mach_iterated_pipe = machine(iterated_pipe, X, y)\n",
- "fit!(mach_iterated_pipe);"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "To recap, internally this training is split into two separate steps:"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "- A controlled iteration step, training on the holdout set, with the total number of iterations determined by the specified stopping criteria (based on the out-of-sample performance estimates)\n",
- "- A final step that trains the atomic model on *all* available\n",
- " data using the number of iterations determined in the first step. Calling `predict` on `mach_iterated_pipe` means using the learned parameters of the second step."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Hyper-parameter optimization (model tuning)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "> Introduces: `range`, **model wrapper** `TunedModel`, `RandomSearch`"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We now turn to hyper-parameter optimization. A tool not discussed\n",
- "here is the `learning_curve` function, which can be useful when\n",
- "wanting to visualize the effect of changes to a *single*\n",
- "hyper-parameter (which could be an iteration parameter). See, for\n",
- "example, [this section of the\n",
- "manual](https://juliaai.github.io/MLJ.jl/dev/learning_curves/)\n",
- "or [this\n",
- "tutorial](https://github.com/ablaom/MLJTutorial.jl/blob/dev/notebooks/04_tuning/notebook.ipynb)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Fine tuning the hyper-parameters of a gradient booster can be\n",
- "somewhat involved. Here we settle for simultaneously optimizing two\n",
- "key parameters: `max_depth` and `η` (learning_rate)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Like iteration control, **model optimization in MLJ is implemented as\n",
- "a model wrapper**, called `TunedModel`. After wrapping a model in a\n",
- "tuning strategy and binding the wrapped model to data in a machine\n",
- "called `mach`, calling `fit!(mach)` instigates a search for optimal\n",
- "model hyperparameters, within a specified range, and then uses all\n",
- "supplied data to train the best model. To predict using that model,\n",
- "one then calls `predict(mach, Xnew)`. In this way the wrapped model\n",
- "may be viewed as a \"self-tuning\" version of the unwrapped\n",
- "model. That is, wrapping the model simply transforms certain\n",
- "hyper-parameters into learned parameters (just as `IteratedModel`\n",
- "does for an iteration parameter)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "To start with, we define ranges for the parameters of\n",
- "interest. Since these parameters are nested, let's force a\n",
- "display of our model to a larger depth:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "show(iterated_pipe, 2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "p1 = :(model.evo_tree_classifier.η)\n",
- "p2 = :(model.evo_tree_classifier.max_depth)\n",
- "\n",
- "r1 = range(iterated_pipe, p1, lower=-2, upper=-0.5, scale=x->10^x)\n",
- "r2 = range(iterated_pipe, p2, lower=2, upper=6)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Nominal ranges are defined by specifying `values` instead of `lower`\n",
- "and `upper`."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Next, we choose an optimization strategy from [this\n",
- "list](https://juliaai.github.io/MLJ.jl/dev/tuning_models/#Tuning-Models):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "tuning = RandomSearch(rng=123)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Then we wrap the model, specifying a `resampling` strategy and a\n",
- "`measure`, as we did for `IteratedModel`. In fact, we can include a\n",
- "battery of `measures`; by default, optimization is with respect to\n",
- "performance estimates based on the first measure, but estimates for\n",
- "all measures can be accessed from the model's `report`."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The keyword `n` specifies the total number of models (sets of\n",
- "hyper-parameters) to evaluate."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "tuned_iterated_pipe = TunedModel(model=iterated_pipe,\n",
- " range=[r1, r2],\n",
- " tuning=tuning,\n",
- " measures=[brier_loss, auc, accuracy],\n",
- " resampling=StratifiedCV(nfolds=6, rng=123),\n",
- " acceleration=CPUThreads(),\n",
- " n=40)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "To save time, we skip the `repeats` here."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Binding our final model to data and training:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "mach_tuned_iterated_pipe = machine(tuned_iterated_pipe, X, y)\n",
- "fit!(mach_tuned_iterated_pipe)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As explained above, the training we have just performed was split\n",
- "internally into two separate steps:"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "- A step to determine the parameter values that optimize the aggregated cross-validation scores\n",
- "- A final step that trains the optimal model on *all* available data. Future predictions `predict(mach_tuned_iterated_pipe, ...)` are based on this final training step."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "From `report(mach_tuned_iterated_pipe)` we can extract details about\n",
- "the optimization procedure. For example:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "rpt2 = report(mach_tuned_iterated_pipe);\n",
- "best_booster = rpt2.best_model.model.evo_tree_classifier"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "@info \"Optimal hyper-parameters:\" best_booster.max_depth best_booster.η;"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Using the `confidence_intervals` function we defined earlier:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "e_best = rpt2.best_history_entry\n",
- "confidence_intervals(e_best)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Digging a little deeper, we can learn what stopping criterion was\n",
- "applied in the case of the optimal model, and how many iterations\n",
- "were required:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "rpt2.best_report.controls |> collect"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Finally, we can visualize the optimization results:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "plot(mach_tuned_iterated_pipe, size=(600,450))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Saving our model"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "> Introduces: `MLJ.save`"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Here's how to serialize our final, trained self-iterating,\n",
- "self-tuning pipeline machine:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "MLJ.save(\"tuned_iterated_pipe.jlso\", mach_tuned_iterated_pipe)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We'll deserialize this in \"Testing the final model\" below."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Final performance estimate"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Finally, to get an even more accurate estimate of performance, we\n",
- "can evaluate our model using stratified cross-validation and all the\n",
- "data attached to our machine. Because this evaluation implies\n",
- "[nested\n",
- "resampling](https://mlr.mlr-org.com/articles/tutorial/nested_resampling.html),\n",
- "this computation takes quite a bit longer than the previous one\n",
- "(which is being repeated six times, using 5/6th of the data each\n",
- "time):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "e_tuned_iterated_pipe = evaluate(tuned_iterated_pipe, X, y,\n",
- " resampling=StratifiedCV(nfolds=6, rng=123),\n",
- " measures=[brier_loss, auc, accuracy])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "confidence_intervals(e_tuned_iterated_pipe)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For comparison, here are the confidence intervals for the basic\n",
- "pipeline model (no feature selection and default hyperparameters):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "confidence_intervals_basic_model"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As each pair of intervals overlap, it's doubtful the small changes\n",
- "here can be assigned statistical significance. Default `booster`\n",
- "hyper-parameters do a pretty good job."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Testing the final model"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We now determine the performance of our model on our\n",
- "lock-and-throw-away-the-key holdout set. To demonstrate\n",
- "deserialization, we'll pretend we're in a new Julia session (but\n",
- "have called `import`/`using` on the same packages). Then the\n",
- "following should suffice to recover our model trained under\n",
- "\"Hyper-parameter optimization\" above:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "mach_restored = machine(\"tuned_iterated_pipe.jlso\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We compute predictions on the holdout set:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "ŷ_tuned = predict(mach_restored, Xtest);\n",
- "ŷ_tuned[1]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "And can compute the final performance measures:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "@info(\"Tuned model measurements on test:\",\n",
- " brier_loss(ŷ_tuned, ytest) |> mean,\n",
- " auc(ŷ_tuned, ytest),\n",
- " accuracy(mode.(ŷ_tuned), ytest)\n",
- " )"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For comparison, here's the performance for the basic pipeline model"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "mach_basic = machine(pipe, X, y)\n",
- "fit!(mach_basic, verbosity=0)\n",
- "\n",
- "ŷ_basic = predict(mach_basic, Xtest);\n",
- "\n",
- "@info(\"Basic model measurements on test set:\",\n",
- " brier_loss(ŷ_basic, ytest) |> mean,\n",
- " auc(ŷ_basic, ytest),\n",
- " accuracy(mode.(ŷ_basic), ytest)\n",
- " )"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "---\n",
- "\n",
- "*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Julia 1.6.5",
- "language": "julia",
- "name": "julia-1.6"
- },
- "language_info": {
- "file_extension": ".jl",
- "mimetype": "application/julia",
- "name": "julia",
- "version": "1.6.5"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 3
-}
diff --git a/examples/telco/scitypes.png b/examples/telco/scitypes.png
deleted file mode 100644
index 8e04fadf1..000000000
Binary files a/examples/telco/scitypes.png and /dev/null differ
diff --git a/examples/telco/tuned_iterated_pipe.jlso b/examples/telco/tuned_iterated_pipe.jlso
deleted file mode 100644
index 0ce900a29..000000000
Binary files a/examples/telco/tuned_iterated_pipe.jlso and /dev/null differ
diff --git a/material/DFKI.jpg b/material/DFKI.jpg
new file mode 100644
index 000000000..f845efd9d
Binary files /dev/null and b/material/DFKI.jpg differ
diff --git a/material/DFKI.png b/material/DFKI.png
new file mode 100644
index 000000000..8c359e352
Binary files /dev/null and b/material/DFKI.png differ