From a89faf7b479f1f4b359ba7f7a4791b529da455b0 Mon Sep 17 00:00:00 2001 From: abhro <5664668+abhro@users.noreply.github.com> Date: Sun, 21 Apr 2024 22:57:50 -0400 Subject: [PATCH 01/24] Use repl language tag for sample --- docs/src/transformers.md | 31 +++++++++++++++---------------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/docs/src/transformers.md b/docs/src/transformers.md index f03cdb92f..9b868e967 100644 --- a/docs/src/transformers.md +++ b/docs/src/transformers.md @@ -193,18 +193,17 @@ K-means clustering algorithm assigns one of three labels 1, 2, 3 to the input features of the iris data set and compares them with the actual species recorded in the target (not seen by the algorithm). -```julia -import Random.seed! -seed!(123) +```julia-repl +julia> import Random.seed! +julia> seed!(123) -X, y = @load_iris; -KMeans = @load KMeans pkg=ParallelKMeans -kmeans = KMeans() -mach = machine(kmeans, X) |> fit! +julia> X, y = @load_iris; +julia> KMeans = @load KMeans pkg=ParallelKMeans +julia> kmeans = KMeans() +julia> mach = machine(kmeans, X) |> fit! -# transforming: -Xsmall = transform(mach); -selectrows(Xsmall, 1:4) |> pretty +julia> # transforming: +julia> Xsmall = transform(mach); julia> selectrows(Xsmall, 1:4) |> pretty ┌─────────────────────┬────────────────────┬────────────────────┐ │ x1 │ x2 │ x3 │ @@ -217,10 +216,10 @@ julia> selectrows(Xsmall, 1:4) |> pretty │ 0.26919199999998966 │ 26.28656804733727 │ 11.64392098898145 │ └─────────────────────┴────────────────────┴────────────────────┘ -# predicting: -yhat = predict(mach); -compare = zip(yhat, y) |> collect; -compare[1:8] +julia> # predicting: +julia> yhat = predict(mach); +julia> compare = zip(yhat, y) |> collect; +julia> compare[1:8] 8-element Array{Tuple{CategoricalValue{Int64,UInt32},CategoricalString{UInt32}},1}: (1, "setosa") (1, "setosa") @@ -231,7 +230,7 @@ compare[1:8] (1, "setosa") (1, "setosa") -compare[51:58] +julia> compare[51:58] 8-element Array{Tuple{CategoricalValue{Int64,UInt32},CategoricalString{UInt32}},1}: (2, "versicolor") (3, "versicolor") @@ -242,7 +241,7 @@ compare[51:58] (3, "versicolor") (3, "versicolor") -compare[101:108] +julia> compare[101:108] 8-element Array{Tuple{CategoricalValue{Int64,UInt32},CategoricalString{UInt32}},1}: (2, "virginica") (3, "virginica") From 8e45385760e5a87136afd6f3442287514c6b459b Mon Sep 17 00:00:00 2001 From: Abhro <5664668+abhro@users.noreply.github.com> Date: Mon, 22 Apr 2024 17:00:02 -0400 Subject: [PATCH 02/24] Update language tags for code samples --- docs/src/about_mlj.md | 10 +-- docs/src/common_mlj_workflows.md | 13 ++-- docs/src/controlling_iterative_models.md | 1 - docs/src/evaluating_model_performance.md | 12 ++-- docs/src/getting_started.md | 10 +-- docs/src/internals.md | 2 +- docs/src/learning_curves.md | 2 +- docs/src/learning_networks.md | 79 +++++++++++------------ docs/src/linear_pipelines.md | 2 +- docs/src/loading_model_code.md | 8 +-- docs/src/machines.md | 4 +- docs/src/mlj_cheatsheet.md | 52 +++++++++------ docs/src/model_search.md | 14 ++-- docs/src/preparing_data.md | 36 +++++------ docs/src/simple_user_defined_models.md | 33 +++------- docs/src/weights.md | 4 +- docs/src/working_with_categorical_data.md | 34 +++++----- 17 files changed, 155 insertions(+), 161 deletions(-) diff --git a/docs/src/about_mlj.md b/docs/src/about_mlj.md index a33daba26..bf63f8dcc 100755 --- a/docs/src/about_mlj.md +++ b/docs/src/about_mlj.md @@ -110,7 +110,7 @@ X, y = @load_reduced_ames; Evaluating the "self-tuning" pipeline model's performance using 5-fold cross-validation (implies multiple layers of nested resampling): -```julia +```julia-repl julia> evaluate(self_tuning_pipe, X, y, measures=[l1, l2], resampling=CV(nfolds=5, rng=123), @@ -229,19 +229,19 @@ installed in a new [environment](https://julialang.github.io/Pkg.jl/v1/environments/) to avoid package conflicts. You can do this with -```julia +```julia-repl julia> using Pkg; Pkg.activate("my_MLJ_env", shared=true) ``` Installing MLJ is also done with the package manager: -```julia +```julia-repl julia> Pkg.add("MLJ") ``` **Optional:** To test your installation, run -```julia +```julia-repl julia> Pkg.test("MLJ") ``` @@ -252,7 +252,7 @@ environment to make model-specific code available. This happens automatically when you use MLJ's interactive load command `@iload`, as in -```julia +```julia-repl julia> Tree = @iload DecisionTreeClassifier # load type julia> tree = Tree() # instance ``` diff --git a/docs/src/common_mlj_workflows.md b/docs/src/common_mlj_workflows.md index 2b7cfaec9..5c684fcc1 100644 --- a/docs/src/common_mlj_workflows.md +++ b/docs/src/common_mlj_workflows.md @@ -14,9 +14,9 @@ channing = (Sex = rand(["Male","Female"], 462), coerce!(channing, :Sex => Multiclass) ``` -```julia -import RDatasets -channing = RDatasets.dataset("boot", "channing") +```julia-repl +julia> import RDatasets +julia> channing = RDatasets.dataset("boot", "channing") julia> first(channing, 4) 4×5 DataFrame @@ -40,7 +40,7 @@ Horizontally splitting data and shuffling rows. Here `y` is the `:Exit` column and `X` everything else: ```@example workflows -y, X = unpack(channing, ==(:Exit), rng=123); +y, X = unpack(channing, ==(:Exit), rng=123); nothing # hide ``` @@ -514,7 +514,8 @@ curve = learning_curve(mach, ```julia using Plots -plot(curve.parameter_values, curve.measurements, xlab=curve.parameter_name, xscale=curve.parameter_scale) +plot(curve.parameter_values, curve.measurements, + xlab=curve.parameter_name, xscale=curve.parameter_scale) ``` ![](img/workflows_learning_curve.png) @@ -534,7 +535,7 @@ curve = learning_curve(mach, ```julia plot(curve.parameter_values, curve.measurements, -xlab=curve.parameter_name, xscale=curve.parameter_scale) + xlab=curve.parameter_name, xscale=curve.parameter_scale) ``` ![](img/workflows_learning_curves.png) diff --git a/docs/src/controlling_iterative_models.md b/docs/src/controlling_iterative_models.md index 77924219d..e0cabdb40 100644 --- a/docs/src/controlling_iterative_models.md +++ b/docs/src/controlling_iterative_models.md @@ -253,7 +253,6 @@ In the code, `wrapper` is an object that wraps the training machine in this example). ```julia - import IterationControl # or MLJ.IterationControl struct IterateFromList diff --git a/docs/src/evaluating_model_performance.md b/docs/src/evaluating_model_performance.md index 448283c57..c378b46ec 100644 --- a/docs/src/evaluating_model_performance.md +++ b/docs/src/evaluating_model_performance.md @@ -27,7 +27,7 @@ using MLJ X = (a=rand(12), b=rand(12), c=rand(12)); y = X.a + 2X.b + 0.05*rand(12); model = (@load RidgeRegressor pkg=MultivariateStats verbosity=0)() -cv=CV(nfolds=3) +cv = CV(nfolds=3) evaluate(model, X, y, resampling=cv, measure=l2, verbosity=0) ``` @@ -51,8 +51,8 @@ Multiple measures are specified as a vector: evaluate!( mach, resampling=cv, - measures=[l1, rms, rmslp1], - verbosity=0, + measures=[l1, rms, rmslp1], + verbosity=0, ) ``` @@ -70,7 +70,7 @@ evaluate!( mach, resampling=CV(nfolds=3), measure=[l2, rsquared], - weights=weights, + weights=weights, ) ``` @@ -91,8 +91,8 @@ fold1 = 1:6; fold2 = 7:12; evaluate!( mach, resampling = [(fold1, fold2), (fold2, fold1)], - measures=[l1, l2], - verbosity=0, + measures=[l1, l2], + verbosity=0, ) ``` diff --git a/docs/src/getting_started.md b/docs/src/getting_started.md index 8a44c8c6a..6ddf5f89f 100644 --- a/docs/src/getting_started.md +++ b/docs/src/getting_started.md @@ -5,14 +5,14 @@ For an outline of MLJ's **goals** and **features**, see This page introduces some MLJ basics, assuming some familiarity with machine learning. For a complete list of other MLJ learning resources, -see [Learning MLJ](@ref). +see [Learning MLJ](@ref). MLJ collects together the functionality provided by mutliple packages. To learn how to install components separately, run `using MLJ; @doc MLJ`. This section introduces only the most basic MLJ operations and concepts. It assumes MLJ has been successfully installed. See -[Installation](@ref) if this is not the case. +[Installation](@ref) if this is not the case. ```@setup doda @@ -31,7 +31,7 @@ column vectors: ```@repl doda using MLJ iris = load_iris(); -selectrows(iris, 1:3) |> pretty +selectrows(iris, 1:3) |> pretty schema(iris) ``` @@ -114,8 +114,8 @@ computing the mode of each prediction): ```@repl doda evaluate(tree, X, y, resampling=CV(shuffle=true), - measures=[log_loss, accuracy], - verbosity=0) + measures=[log_loss, accuracy], + verbosity=0) ``` Under the hood, `evaluate` calls lower level functions `predict` or diff --git a/docs/src/internals.md b/docs/src/internals.md index f31d4ede8..47bdf5335 100755 --- a/docs/src/internals.md +++ b/docs/src/internals.md @@ -49,7 +49,7 @@ function fit!(mach::Machine; rows=nothing, force=false, verbosity=1) end rows_have_changed = (!isdefined(mach, :previous_rows) || - rows != mach.previous_rows) + rows != mach.previous_rows) args = [MLJ.selectrows(arg, rows) for arg in mach.args] diff --git a/docs/src/learning_curves.md b/docs/src/learning_curves.md index 42847171a..19f011dc0 100644 --- a/docs/src/learning_curves.md +++ b/docs/src/learning_curves.md @@ -48,7 +48,7 @@ used using `rngs=...` (an integer automatically generates the number specified): ```@example hooking -atom.lambda= 7.3 +atom.lambda = 7.3 r_n = range(ensemble, :n, lower=1, upper=50) curves = MLJ.learning_curve(mach; range=r_n, diff --git a/docs/src/learning_networks.md b/docs/src/learning_networks.md index 46e688941..84007be3f 100644 --- a/docs/src/learning_networks.md +++ b/docs/src/learning_networks.md @@ -320,18 +320,18 @@ has the same signature as `MLJModelInterface.fit`): import MLJBase function MLJBase.prefit(composite::CompositeA, verbosity, X, y) - # the learning network from above: - Xs = source(X) - ys = source(y) - mach1 = machine(:preprocessor, Xs) - x = transform(mach1, Xs) - mach2 = machine(:classifier, x, ys) - yhat = predict(mach2, x) - - verbosity > 0 && @info "I'm a noisy fellow!" - - # return "learning network interface": - return (; predict=yhat) + # the learning network from above: + Xs = source(X) + ys = source(y) + mach1 = machine(:preprocessor, Xs) + x = transform(mach1, Xs) + mach2 = machine(:classifier, x, ys) + yhat = predict(mach2, x) + + verbosity > 0 && @info "I'm a noisy fellow!" + + # return "learning network interface": + return (; predict=yhat) end ``` @@ -594,10 +594,10 @@ using MLJ import MLJBase mutable struct CompositeE <: DeterministicNetworkComposite - clusterer # `:kmeans` or `:kmedoids` - k::Int # number of clusters - solver # a ridge regression parameter we want to expose - c::Float64 # a "coupling" coefficient + clusterer # `:kmeans` or `:kmedoids` + k::Int # number of clusters + solver # a ridge regression parameter we want to expose + c::Float64 # a "coupling" coefficient end ``` @@ -610,26 +610,26 @@ KMedoids = @load KMedoids pkg=Clustering verbosity=0 function MLJBase.prefit(composite::CompositeE, verbosity, X, y) - Xs = source(X) - ys = source(y) + Xs = source(X) + ys = source(y) - k = composite.k - solver = composite.solver - c = composite.c + k = composite.k + solver = composite.solver + c = composite.c - clusterer = composite.clusterer == :kmeans ? KMeans(; k) : KMedoids(; k) - mach1 = machine(clusterer, Xs) - Xsmall = transform(mach1, Xs) + clusterer = composite.clusterer == :kmeans ? KMeans(; k) : KMedoids(; k) + mach1 = machine(clusterer, Xs) + Xsmall = transform(mach1, Xs) - # the coupling - ridge regularization depends on the number of - # clusters `k` and the coupling coefficient `c`: - lambda = exp(-c/k) + # the coupling - ridge regularization depends on the number of + # clusters `k` and the coupling coefficient `c`: + lambda = exp(-c/k) - ridge = RidgeRegressor(; lambda, solver) - mach2 = machine(ridge, Xsmall, ys) - yhat = predict(mach2, Xsmall) + ridge = RidgeRegressor(; lambda, solver) + mach2 = machine(ridge, Xsmall, ys) + yhat = predict(mach2, Xsmall) - return (predict=yhat,) + return (predict=yhat,) end ``` @@ -748,20 +748,19 @@ Q = @node sqrt(Z) (so that `Q() == 4`). Here's a more complicated application of `@node` to row-shuffle a table: -```julia -using Random -X = (x1 = [1, 2, 3, 4, 5], - x2 = [:one, :two, :three, :four, :five]) -rows(X) = 1:nrows(X) +```julia-repl +julia> using Random +julia> X = (x1 = [1, 2, 3, 4, 5], + x2 = [:one, :two, :three, :four, :five]) +julia> rows(X) = 1:nrows(X) -Xs = source(X) -rs = @node rows(Xs) -W = @node selectrows(Xs, @node shuffle(rs)) +julia> Xs = source(X) +julia> rs = @node rows(Xs) +julia> W = @node selectrows(Xs, @node shuffle(rs)) julia> W() (x1 = [5, 1, 3, 2, 4], x2 = Symbol[:five, :one, :three, :two, :four],) - ``` **Important.** An argument not in global scope is assumed by `@node` to be a node or diff --git a/docs/src/linear_pipelines.md b/docs/src/linear_pipelines.md index 9ab9fc283..394e473be 100644 --- a/docs/src/linear_pipelines.md +++ b/docs/src/linear_pipelines.md @@ -29,7 +29,7 @@ model type `KNNRegressor` assumes the features are all `X` with `coerce(X, :age=>Continuous)` - standardizing continuous features and one-hot encoding the `Multiclass` features using the `ContinuousEncoder` model - + However, we can avoid separately applying these preprocessing steps (two of which require `fit!` steps) by combining them with the supervised `KKNRegressor` model in a new *pipeline* model, using diff --git a/docs/src/loading_model_code.md b/docs/src/loading_model_code.md index f3ce17a1e..850e5a262 100644 --- a/docs/src/loading_model_code.md +++ b/docs/src/loading_model_code.md @@ -32,7 +32,7 @@ provided by the package. Then, to determine which package provides the MLJ interface you call `load_path`: -```julia +```julia-repl julia> load_path("DecisionTreeClassifier", pkg="DecisionTree") "MLJDecisionTreeInterface.DecisionTreeClassifier" ``` @@ -41,7 +41,7 @@ In this case, we see that the package required is MLJDecisionTreeInterface.jl. If this package is not in `my_env` (do `Pkg.status()` to check) you add it by running -```julia +```julia-repl julia> Pkg.add("MLJDecisionTreeInterface"); ``` @@ -49,14 +49,14 @@ So long as `my_env` is the active environment, this action need never be repeated (unless you run `Pkg.rm("MLJDecisionTreeInterface")`). You are now ready to instantiate a decision tree classifier: -```julia +```julia-repl julia> Tree = @load DecisionTree pkg=DecisionTree julia> tree = Tree() ``` which is equivalent to -```julia +```julia-repl julia> import MLJDecisionTreeInterface.DecisionTreeClassifier julia> Tree = MLJDecisionTreeInterface.DecisionTreeClassifier julia> tree = Tree() diff --git a/docs/src/machines.md b/docs/src/machines.md index 68eb9cddc..7ad56c935 100644 --- a/docs/src/machines.md +++ b/docs/src/machines.md @@ -33,7 +33,7 @@ Generally, changing a hyperparameter triggers retraining on calls to subsequent `fit!`: ```@repl machines -forest.bagging_fraction=0.5 +forest.bagging_fraction = 0.5 fit!(mach, verbosity=2); ``` @@ -41,7 +41,7 @@ However, for this iterative model, increasing the iteration parameter only adds models to the existing ensemble: ```@repl machines -forest.n=15 +forest.n = 15 fit!(mach, verbosity=2); ``` diff --git a/docs/src/mlj_cheatsheet.md b/docs/src/mlj_cheatsheet.md index 397c7690b..1c0573d42 100644 --- a/docs/src/mlj_cheatsheet.md +++ b/docs/src/mlj_cheatsheet.md @@ -34,14 +34,15 @@ With additional conditions: models() do model matching(model, X, y) && model.prediction_type == :probabilistic && - model.is_pure_julia + model.is_pure_julia end ``` -`Tree = @load DecisionTreeClassifier pkg=DecisionTree` imports "DecisionTreeClassifier" type and binds it to `Tree` -`tree = Tree()` to instantiate a `Tree`. +`Tree = @load DecisionTreeClassifier pkg=DecisionTree` imports "DecisionTreeClassifier" type and binds it to `Tree`. -`tree2 = Tree(max_depth=2)` instantiates a tree with different hyperparameter +`tree = Tree()` to instantiate a `Tree`. + +`tree2 = Tree(max_depth=2)` instantiates a tree with different hyperparameter `Ridge = @load RidgeRegressor pkg=MultivariateStats` imports a type for a model provided by multiple packages @@ -96,20 +97,26 @@ y, X = unpack(channing, Splitting row indices into train/validation/test, with seeded shuffling: -`train, valid, test = partition(eachindex(y), 0.7, 0.2, rng=1234)` for 70:20:10 ratio +```julia-repl +julia> train, valid, test = partition(eachindex(y), 0.7, 0.2, rng=1234) # for 70:20:10 ratio +``` For a stratified split: -`train, test = partition(eachindex(y), 0.8, stratify=y)` +```julia-repl +julia> train, test = partition(eachindex(y), 0.8, stratify=y) +``` Split a table or matrix `X`, instead of indices: -`Xtrain, Xvalid, Xtest = partition(X, 0.5, 0.3, rng=123)` +```julia-repl +julia> Xtrain, Xvalid, Xtest = partition(X, 0.5, 0.3, rng=123) +``` Getting data from [OpenML](https://www.openml.org): - -`table = OpenML.load(91)` - +```julia-repl +julia> table = OpenML.load(91) +``` Creating synthetic classification data: `X, y = make_blobs(100, 2)` (also: `make_moons`, `make_circles`) @@ -121,12 +128,17 @@ Creating synthetic regression data: ## Machine construction Supervised case: - -`model = KNNRegressor(K=1)` and `mach = machine(model, X, y)` +```julia-repl +julia> model = KNNRegressor(K=1) +julia> mach = machine(model, X, y) +``` Unsupervised case: -`model = OneHotEncoder()` and `mach = machine(model, X)` +```julia-repl +julia> model = OneHotEncoder() +julia> mach = machine(model, X) +``` ## Fitting @@ -283,7 +295,7 @@ Externals include: `PCA` (in MultivariateStats), `KMeans`, `KMedoids` (in Cluste ## Pipelines -`pipe = (X -> coerce(X, :height=>Continuous)) |> OneHotEncoder |> KNNRegressor(K=3)` +`pipe = (X -> coerce(X, :height=>Continuous)) |> OneHotEncoder |> KNNRegressor(K=3)` Unsupervised: @@ -311,9 +323,10 @@ Supervised, with final node `yhat` returning point predictions: ```julia @from_network machine(Deterministic(), Xs, ys; predict=yhat) begin mutable struct Composite - reducer=network_pca - regressor=network_knn + reducer=network_pca + regressor=network_knn end +end ``` Here `network_pca` and `network_knn` are models appearing in the @@ -327,6 +340,7 @@ Supervised, with `yhat` final node returning probabilistic predictions: reducer=network_pca classifier=network_tree end +end ``` Unsupervised, with final node `Xout`: @@ -334,8 +348,8 @@ Unsupervised, with final node `Xout`: ```julia @from_network machine(Unsupervised(), Xs; transform=Xout) begin mutable struct Composite - reducer1=network_pca - reducer2=clusterer + reducer1=network_pca + reducer2=clusterer end end -```UnivariateTimeTypeToContinuous +``` diff --git a/docs/src/model_search.md b/docs/src/model_search.md index 1f3cfff49..b6e344fa7 100644 --- a/docs/src/model_search.md +++ b/docs/src/model_search.md @@ -5,7 +5,7 @@ properties, without loading all the packages containing model code. In turn, this allows one to efficiently find all models solving a given machine learning task. The task itself is specified with the help of the `matching` method, and the search executed with the `models` -methods, as detailed below. +methods, as detailed below. For commonly encountered problems with model search, see also [Preparing Data](@ref). @@ -33,27 +33,29 @@ info("PCA") So a "model" in the present context is just a named tuple containing metadata, and not an actual model type or instance. If two models with the same name occur in different packages, the package name must be -specified, as in `info("LinearRegressor", pkg="GLM")`. +specified, as in `info("LinearRegressor", pkg="GLM")`. Model document strings can be retreived, without importing the defining code, using the `doc` function: -``` +```julia doc("DecisionTreeClassifier", pkg="DecisionTree") ``` ## General model queries -We list all models (named tuples) using `models()`, and list the models for which code is already loaded with `localmodels()`: +We list all models (named tuples) using `models()`, and list the models for +which code is already loaded with `localmodels()`: ```@repl tokai localmodels() localmodels()[2] ``` -One can search for models containing specified strings or regular expressions in their `docstring` attributes, as in +One can search for models containing specified strings or regular expressions in +their `docstring` attributes, as in -```@repl tokai +```@repl tokai models("forest") ``` diff --git a/docs/src/preparing_data.md b/docs/src/preparing_data.md index e87260707..c796f96b2 100644 --- a/docs/src/preparing_data.md +++ b/docs/src/preparing_data.md @@ -2,9 +2,9 @@ ## Splitting data -MLJ has two tools for splitting data. To split data *vertically* (that -is, to split by observations) use [`partition`](@ref). This is commonly applied to a -vector of observation *indices*, but can also be applied to datasets +MLJ has two tools for splitting data. To split data *vertically* (that is, +to split by observations) use [`partition`](@ref). This is commonly applied to +a vector of observation *indices*, but can also be applied to datasets themselves, provided they are vectors, matrices or tables. To split tabular data *horizontally* (i.e., break up a table based on @@ -39,18 +39,17 @@ models(matching(X, y)) Or are unsure about the source of the following warning: -```julia -Tree = @load DecisionTreeRegressor pkg=DecisionTree verbosity=0 -tree = Tree(); -julia> machine(tree, X, y) +```julia-repl +julia> Tree = @load DecisionTreeRegressor pkg=DecisionTree verbosity=0 +julia> tree = Tree(); julia> machine(tree, X, y) -┌ Warning: The scitype of `X`, in `machine(model, X, ...)` is incompatible with `model=DecisionTreeRegressor @378`: +┌ Warning: The scitype of `X`, in `machine(model, X, ...)` is incompatible with `model=DecisionTreeRegressor @378`: │ scitype(X) = Table{Union{AbstractVector{Continuous}, AbstractVector{Count}, AbstractVector{Textual}, AbstractVector{Union{Missing, Textual}}}} │ input_scitype(model) = Table{var"#s46"} where var"#s46"<:Union{AbstractVector{var"#s9"} where var"#s9"<:Continuous, AbstractVector{var"#s9"} where var"#s9"<:Count, AbstractVector{var"#s9"} where var"#s9"<:OrderedFactor}. └ @ MLJBase ~/Dropbox/Julia7/MLJ/MLJBase/src/machines.jl:103 Machine{DecisionTreeRegressor,…} @198 trained 0 times; caches data - args: + args: 1: Source @628 ⏎ `Table{Union{AbstractVector{Continuous}, AbstractVector{Count}, AbstractVector{Textual}, AbstractVector{Union{Missing, Textual}}}}` 2: Source @544 ⏎ `AbstractVector{Continuous}` ``` @@ -75,16 +74,14 @@ intended scientific interpretation. If `height` in the above example is intended to be `Continuous`, `mark` is supposed to be `OrderedFactor`, and `admitted` a (binary) `Multiclass`, then we can do - - + ```@example poot X_coerced = coerce(X, :height=>Continuous, :mark=>OrderedFactor, :admitted=>Multiclass); schema(X_coerced) ``` **Data transformations:** We carry out conventional data -transformations, such as missing value imputation and feature -encoding: +transformations, such as missing value imputation and feature encoding: ```@example poot imputer = FillImputer() @@ -123,18 +120,17 @@ Also relevant is the section, [Working with Categorical Data](@ref). ## Data transformation -MLJ's Built-in transformers are documented at [Transformers and Other Unsupervised Models](@ref). The most relevant in the present context - are: [`ContinuousEncoder`](@ref), [`OneHotEncoder`](@ref), - [`FeatureSelector`](@ref) and [`FillImputer`](@ref). A Gaussian - mixture models imputer is provided by BetaML, which can be loaded - with +MLJ's Built-in transformers are documented at [Transformers and Other Unsupervised Models](@ref). +The most relevant in the present context are: [`ContinuousEncoder`](@ref), +[`OneHotEncoder`](@ref), [`FeatureSelector`](@ref) and [`FillImputer`](@ref). +A Gaussian mixture models imputer is provided by BetaML, which can be loaded with ```julia MissingImputator = @load MissingImputator pkg=BetaML ``` [This MLJ -Workshop](https://github.com/ablaom/MachineLearningInJulia2020), and the "End-to-end -examples" in [Data Science in Julia +Workshop](https://github.com/ablaom/MachineLearningInJulia2020), and +the "End-to-end examples" in [Data Science in Julia tutorials](https://alan-turing-institute.github.io/DataScienceTutorials.jl/) give further illustrations of data preprocessing in MLJ. diff --git a/docs/src/simple_user_defined_models.md b/docs/src/simple_user_defined_models.md index 57590701a..ffc64920b 100755 --- a/docs/src/simple_user_defined_models.md +++ b/docs/src/simple_user_defined_models.md @@ -38,9 +38,10 @@ For an unsupervised model, implement `transform` and, optionally, Here's a quick-and-dirty implementation of a ridge regressor with no intercept: -```julia +```@example regressor_example import MLJBase using LinearAlgebra +MLJBase.color_off() # hide mutable struct MyRegressor <: MLJBase.Deterministic lambda::Float64 @@ -51,32 +52,14 @@ MyRegressor(; lambda=0.1) = MyRegressor(lambda) function MLJBase.fit(model::MyRegressor, verbosity, X, y) x = MLJBase.matrix(X) # convert table to matrix fitresult = (x'x + model.lambda*I)\(x'y) # the coefficients - cache=nothing - report=nothing + cache = nothing + report = nothing return fitresult, cache, report end # predict uses coefficients to make a new prediction: MLJBase.predict(::MyRegressor, fitresult, Xnew) = MLJBase.matrix(Xnew) * fitresult -``` - -``` @setup regressor_example -using MLJ -import MLJBase -using LinearAlgebra -MLJBase.color_off() -mutable struct MyRegressor <: MLJBase.Deterministic - lambda::Float64 -end -MyRegressor(; lambda=0.1) = MyRegressor(lambda) -function MLJBase.fit(model::MyRegressor, verbosity, X, y) - x = MLJBase.matrix(X) - fitresult = (x'x + model.lambda*I)\(x'y) - cache=nothing - report=nothing - return fitresult, cache, report -end -MLJBase.predict(::MyRegressor, fitresult, Xnew) = MLJBase.matrix(Xnew) * fitresult +nothing # hide ``` After loading this code, all MLJ's basic meta-algorithms can be applied to `MyRegressor`: @@ -116,9 +99,9 @@ MLJBase.predict(model::MyClassifier, fitresult, Xnew) = [fitresult for r in 1:nrows(Xnew)] ``` -```julia -julia> X, y = @load_iris -julia> mach = fit!(machine(MyClassifier(), X, y)) +```julia-repl +julia> X, y = @load_iris; +julia> mach = fit!(machine(MyClassifier(), X, y)); julia> predict(mach, selectrows(X, 1:2)) 2-element Array{UnivariateFinite{String,UInt32,Float64},1}: UnivariateFinite(setosa=>0.333, versicolor=>0.333, virginica=>0.333) diff --git a/docs/src/weights.md b/docs/src/weights.md index 3789faf21..f17cf07b6 100644 --- a/docs/src/weights.md +++ b/docs/src/weights.md @@ -2,7 +2,7 @@ In machine learning it is possible to assign each observation an independent significance, or *weight*, either in **training** or in -**performance evaluation**, or both. +**performance evaluation**, or both. There are two kinds of weights in use in MLJ: @@ -11,7 +11,7 @@ There are two kinds of weights in use in MLJ: - *class weights* refer to dictionaries keyed on the target classes (levels) for use in classification problems - + ## Specifying weights in training diff --git a/docs/src/working_with_categorical_data.md b/docs/src/working_with_categorical_data.md index ee2dfc221..5ddf5250a 100644 --- a/docs/src/working_with_categorical_data.md +++ b/docs/src/working_with_categorical_data.md @@ -65,14 +65,13 @@ see [here](https://github.com/JuliaAI/ScientificTypesBase.jl#more-on-the-table-type).) ```@example hut -import DataFrames.DataFrame -X = DataFrame( - name = ["Siri", "Robo", "Alexa", "Cortana"], - gender = ["male", "male", "Female", "female"], - likes_soup = [true, false, false, true], - height = [152, missing, 148, 163], - rating = [2, 5, 2, 1], - outcome = ["rejected", "accepted", "accepted", "rejected"]) +import DataFrames: DataFrame +X = DataFrame( name = ["Siri", "Robo", "Alexa", "Cortana"], + gender = ["male", "male", "Female", "female"], + likes_soup = [true, false, false, true], + height = [152, missing, 148, 163], + rating = [2, 5, 2, 1], + outcome = ["rejected", "accepted", "accepted", "rejected"]) schema(X) ``` @@ -102,11 +101,12 @@ levels(X.outcome) !!! warning "Changing levels of categorical data" - The order of levels should generally be changed - early in your data science workflow and then not again. Similar - remarks apply to *adding* levels (which is possible; see the - [CategorialArrays.jl documentation](https://juliadata.github.io/CategoricalArrays.jl/stable/)). MLJ supervised and unsupervised models assume levels - and their order do not change. + The order of levels should generally be changed early in your + data science workflow and then not again. Similar remarks apply + to *adding* levels (which is possible; see the + [CategorialArrays.jl documentation](https://juliadata.github.io/CategoricalArrays.jl/stable/)). + MLJ supervised and unsupervised models assume levels and their + order do not change. Coercing all remaining types simultaneously: @@ -157,7 +157,7 @@ or contain previously unseen classes. ## New or missing levels in production data -!!! warning +!!! warning Unpredictable behavior may result whenever `Finite` categorical data presents in a production set with different classes (levels) from those presented during training @@ -179,7 +179,7 @@ Xproduction == X[2:3,:] So far, so good. But the following operation throws an error: -```julia +```julia-repl julia> transform(mach, Xproduction) == transform(mach, X[2:3,:]) ERROR: Found category level mismatch in feature `x`. Consider using `levels!` to ensure fitted and transforming features have the same category levels. ``` @@ -287,7 +287,7 @@ came. Use `get(val)` to extract the raw label from a value `val`. Despite the distinction that exists between a value (element) and a label, the two are the same, from the point of `==` and `in`: -```@julia +```julia v[1] == 'A' # true 'A' in v # true ``` @@ -338,7 +338,7 @@ d_vec = UnivariateFinite(["no", "yes"], probs, pool=v) Or, equivalently: -```@julia +```julia d_vec = UnivariateFinite(["no", "yes"], yes_probs, augment=true, pool=v) ``` From fddc289838753e05f0af3cd0fe72c15b3619182a Mon Sep 17 00:00:00 2001 From: abhro <5664668+abhro@users.noreply.github.com> Date: Wed, 24 Apr 2024 00:16:56 -0400 Subject: [PATCH 03/24] Follow blue style in docs/src/working_with_categorical_data.md Co-authored-by: Anthony Blaom, PhD --- docs/src/working_with_categorical_data.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/docs/src/working_with_categorical_data.md b/docs/src/working_with_categorical_data.md index 5ddf5250a..f1506f2f5 100644 --- a/docs/src/working_with_categorical_data.md +++ b/docs/src/working_with_categorical_data.md @@ -66,12 +66,14 @@ see ```@example hut import DataFrames: DataFrame -X = DataFrame( name = ["Siri", "Robo", "Alexa", "Cortana"], - gender = ["male", "male", "Female", "female"], - likes_soup = [true, false, false, true], - height = [152, missing, 148, 163], - rating = [2, 5, 2, 1], - outcome = ["rejected", "accepted", "accepted", "rejected"]) +X = DataFrame( + name = ["Siri", "Robo", "Alexa", "Cortana"], + gender = ["male", "male", "Female", "female"], + likes_soup = [true, false, false, true], + height = [152, missing, 148, 163], + rating = [2, 5, 2, 1], + outcome = ["rejected", "accepted", "accepted", "rejected"], +) schema(X) ``` From 3d6d15fdc2d1d0e84c0e6c28f511f8f13bb9e09e Mon Sep 17 00:00:00 2001 From: abhro <5664668+abhro@users.noreply.github.com> Date: Mon, 29 Apr 2024 17:17:08 -0400 Subject: [PATCH 04/24] Update mlj_cheatsheet.md --- docs/src/mlj_cheatsheet.md | 43 ++++++++++++++++++++++---------------- 1 file changed, 25 insertions(+), 18 deletions(-) diff --git a/docs/src/mlj_cheatsheet.md b/docs/src/mlj_cheatsheet.md index 1c0573d42..a529bb162 100644 --- a/docs/src/mlj_cheatsheet.md +++ b/docs/src/mlj_cheatsheet.md @@ -97,47 +97,52 @@ y, X = unpack(channing, Splitting row indices into train/validation/test, with seeded shuffling: -```julia-repl -julia> train, valid, test = partition(eachindex(y), 0.7, 0.2, rng=1234) # for 70:20:10 ratio +```julia +train, valid, test = partition(eachindex(y), 0.7, 0.2, rng=1234) # for 70:20:10 ratio ``` For a stratified split: -```julia-repl -julia> train, test = partition(eachindex(y), 0.8, stratify=y) +```julia +train, test = partition(eachindex(y), 0.8, stratify=y) ``` Split a table or matrix `X`, instead of indices: -```julia-repl -julia> Xtrain, Xvalid, Xtest = partition(X, 0.5, 0.3, rng=123) +```julia +Xtrain, Xvalid, Xtest = partition(X, 0.5, 0.3, rng=123) ``` Getting data from [OpenML](https://www.openml.org): -```julia-repl -julia> table = OpenML.load(91) +```julia +table = OpenML.load(91) ``` Creating synthetic classification data: -`X, y = make_blobs(100, 2)` (also: `make_moons`, `make_circles`) +```julia +X, y = make_blobs(100, 2) +``` +(also: `make_moons`, `make_circles`) Creating synthetic regression data: -`X, y = make_regression(100, 2)` +```julia +X, y = make_regression(100, 2) +``` ## Machine construction Supervised case: -```julia-repl -julia> model = KNNRegressor(K=1) -julia> mach = machine(model, X, y) +```julia +model = KNNRegressor(K=1) +mach = machine(model, X, y) ``` Unsupervised case: -```julia-repl -julia> model = OneHotEncoder() -julia> mach = machine(model, X) +```julia +model = OneHotEncoder() +mach = machine(model, X) ``` ## Fitting @@ -308,8 +313,10 @@ Concatenation: ## Define a supervised learning network: -`Xs = source(X)` -`ys = source(y)` +```julia +Xs = source(X) +ys = source(y) +``` ... define further nodal machines and nodes ... From ae2815167fb5e219acb3e6ddb0af310d73302f21 Mon Sep 17 00:00:00 2001 From: abhro <5664668+abhro@users.noreply.github.com> Date: Mon, 29 Apr 2024 22:50:39 -0400 Subject: [PATCH 05/24] Consistenly use @example in common_mlj_workflows.md Remove julia> prompts, replace with @example macro --- docs/src/common_mlj_workflows.md | 28 +++++++++++----------------- 1 file changed, 11 insertions(+), 17 deletions(-) diff --git a/docs/src/common_mlj_workflows.md b/docs/src/common_mlj_workflows.md index 5c684fcc1..8e31bc71d 100644 --- a/docs/src/common_mlj_workflows.md +++ b/docs/src/common_mlj_workflows.md @@ -14,19 +14,13 @@ channing = (Sex = rand(["Male","Female"], 462), coerce!(channing, :Sex => Multiclass) ``` -```julia-repl -julia> import RDatasets -julia> channing = RDatasets.dataset("boot", "channing") +```julia +import RDatasets +channing = RDatasets.dataset("boot", "channing") +``` -julia> first(channing, 4) -4×5 DataFrame - Row │ Sex Entry Exit Time Cens - │ Cat… Int32 Int32 Int32 Int32 -─────┼────────────────────────────────── - 1 │ Male 782 909 127 1 - 2 │ Male 1020 1128 108 1 - 3 │ Male 856 969 113 1 - 4 │ Male 915 957 42 1 +```@example +first(channing, 4) ``` Inspecting metadata, including column scientific types: @@ -47,10 +41,10 @@ nothing # hide Here `y` is the `:Exit` column and `X` everything else except `:Time`: ```@example workflows -y, X = unpack(channing, - ==(:Exit), - !=(:Time); - rng=123); +y, X = unpack(channing, + ==(:Exit), + !=(:Time); + rng=123); scitype(y) ``` @@ -159,7 +153,7 @@ tree = Tree(min_samples_split=5, max_depth=4) or -```@julia +```julia tree = (@load DecisionTreeClassifier)() tree.min_samples_split = 5 tree.max_depth = 4 From 9f274adcbf459e7076f69fa0fdc9b5a00a866423 Mon Sep 17 00:00:00 2001 From: Abhro <5664668+abhro@users.noreply.github.com> Date: Fri, 3 May 2024 10:25:59 -0400 Subject: [PATCH 06/24] Fix @example namespace in common workflows --- docs/src/common_mlj_workflows.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/src/common_mlj_workflows.md b/docs/src/common_mlj_workflows.md index 8e31bc71d..9380ba5a5 100644 --- a/docs/src/common_mlj_workflows.md +++ b/docs/src/common_mlj_workflows.md @@ -19,7 +19,7 @@ import RDatasets channing = RDatasets.dataset("boot", "channing") ``` -```@example +```@example workflows first(channing, 4) ``` @@ -88,7 +88,7 @@ nothing # hide Or, if already horizontally split: ```@example workflows -(Xtrain, Xtest), (ytrain, ytest) = partition((X, y), 0.6, multi=true, rng=123) +(Xtrain, Xtest), (ytrain, ytest) = partition((X, y), 0.6, multi=true, rng=123) ``` @@ -108,7 +108,7 @@ ms[6] ``` ```@example workflows -models("Tree"); +models("Tree") ``` A more refined search: @@ -224,7 +224,7 @@ Run `measures()` to list all losses and scores and their aliases ("instances"). Predict on the new data set: ```@example workflows -Xnew = (FL = rand(3), RW = rand(3), CL = rand(3), CW = rand(3), BD =rand(3)) +Xnew = (FL = rand(3), RW = rand(3), CL = rand(3), CW = rand(3), BD = rand(3)) predict(mach, Xnew) # a vector of distributions ``` From 367db46a2861cb34b24cadbca9200e0b29334468 Mon Sep 17 00:00:00 2001 From: Abhro <5664668+abhro@users.noreply.github.com> Date: Fri, 3 May 2024 10:26:35 -0400 Subject: [PATCH 07/24] Break up predicting transformers into separate @example blocks --- docs/Project.toml | 1 + docs/src/transformers.md | 95 ++++++++++++++++------------------------ 2 files changed, 38 insertions(+), 58 deletions(-) diff --git a/docs/Project.toml b/docs/Project.toml index b9718b4ea..d86571e62 100755 --- a/docs/Project.toml +++ b/docs/Project.toml @@ -15,6 +15,7 @@ MLJLinearModels = "6ee0df7b-362f-4a72-a706-9e79364fb692" MLJMultivariateStatsInterface = "1b6a4a23-ba22-4f51-9698-8599985d3728" Missings = "e1d29d7a-bbdc-5cf2-9ac0-f12de2c33e28" NearestNeighborModels = "636a865e-7cf4-491e-846c-de09b730eb36" +ParallelKMeans = "42b8e9d4-006b-409a-8472-7f34b3fb58af" Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" ScientificTypesBase = "30f210dd-8aff-4c5f-94ba-8e64358c1161" StatisticalMeasures = "a19d573c-0a75-4610-95b3-7071388c7541" diff --git a/docs/src/transformers.md b/docs/src/transformers.md index 9b868e967..805e715d8 100644 --- a/docs/src/transformers.md +++ b/docs/src/transformers.md @@ -193,62 +193,41 @@ K-means clustering algorithm assigns one of three labels 1, 2, 3 to the input features of the iris data set and compares them with the actual species recorded in the target (not seen by the algorithm). -```julia-repl -julia> import Random.seed! -julia> seed!(123) - -julia> X, y = @load_iris; -julia> KMeans = @load KMeans pkg=ParallelKMeans -julia> kmeans = KMeans() -julia> mach = machine(kmeans, X) |> fit! - -julia> # transforming: -julia> Xsmall = transform(mach); -julia> selectrows(Xsmall, 1:4) |> pretty -┌─────────────────────┬────────────────────┬────────────────────┐ -│ x1 │ x2 │ x3 │ -│ Float64 │ Float64 │ Float64 │ -│ Continuous │ Continuous │ Continuous │ -├─────────────────────┼────────────────────┼────────────────────┤ -│ 0.0215920000000267 │ 25.314260355029603 │ 11.645232464391299 │ -│ 0.19199200000001326 │ 25.882721893491123 │ 11.489658693899486 │ -│ 0.1699920000000077 │ 27.58656804733728 │ 12.674412792260142 │ -│ 0.26919199999998966 │ 26.28656804733727 │ 11.64392098898145 │ -└─────────────────────┴────────────────────┴────────────────────┘ - -julia> # predicting: -julia> yhat = predict(mach); -julia> compare = zip(yhat, y) |> collect; -julia> compare[1:8] -8-element Array{Tuple{CategoricalValue{Int64,UInt32},CategoricalString{UInt32}},1}: - (1, "setosa") - (1, "setosa") - (1, "setosa") - (1, "setosa") - (1, "setosa") - (1, "setosa") - (1, "setosa") - (1, "setosa") - -julia> compare[51:58] -8-element Array{Tuple{CategoricalValue{Int64,UInt32},CategoricalString{UInt32}},1}: - (2, "versicolor") - (3, "versicolor") - (2, "versicolor") - (3, "versicolor") - (3, "versicolor") - (3, "versicolor") - (3, "versicolor") - (3, "versicolor") - -julia> compare[101:108] -8-element Array{Tuple{CategoricalValue{Int64,UInt32},CategoricalString{UInt32}},1}: - (2, "virginica") - (3, "virginica") - (2, "virginica") - (2, "virginica") - (2, "virginica") - (2, "virginica") - (3, "virginica") - (2, "virginica") +```@setup predtrans +using MLJ +``` + +```@example predtrans +import Random.seed! +seed!(123) + +X, y = @load_iris; +KMeans = @load KMeans pkg=ParallelKMeans +kmeans = KMeans() +mach = machine(kmeans, X) |> fit! +nothing # hide +``` + +Transforming: +```@example predtrans +Xsmall = transform(mach); +selectrows(Xsmall, 1:4) |> pretty +``` + +Predicting: +```@example predtrans +yhat = predict(mach); +compare = zip(yhat, y) |> collect; +``` + +```@example predtrans +compare[1:8] +``` + +```@example predtrans +compare[51:58] +``` + +```@example predtrans +compare[101:108] ``` From f86b01bf1d0119cbd897109a36637e4a6f20fd9e Mon Sep 17 00:00:00 2001 From: Abhro <5664668+abhro@users.noreply.github.com> Date: Fri, 3 May 2024 10:27:33 -0400 Subject: [PATCH 08/24] Use @example instead of pre-built repl sample in learning_networks.md --- docs/src/learning_networks.md | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/docs/src/learning_networks.md b/docs/src/learning_networks.md index 84007be3f..c275ba543 100644 --- a/docs/src/learning_networks.md +++ b/docs/src/learning_networks.md @@ -748,19 +748,17 @@ Q = @node sqrt(Z) (so that `Q() == 4`). Here's a more complicated application of `@node` to row-shuffle a table: -```julia-repl -julia> using Random -julia> X = (x1 = [1, 2, 3, 4, 5], - x2 = [:one, :two, :three, :four, :five]) -julia> rows(X) = 1:nrows(X) - -julia> Xs = source(X) -julia> rs = @node rows(Xs) -julia> W = @node selectrows(Xs, @node shuffle(rs)) - -julia> W() -(x1 = [5, 1, 3, 2, 4], - x2 = Symbol[:five, :one, :three, :two, :four],) +```@example +using MLJ, Random +X = (x1 = [1, 2, 3, 4, 5], + x2 = [:one, :two, :three, :four, :five]) +rows(X) = 1:nrows(X) + +Xs = source(X) +rs = @node rows(Xs) +W = @node selectrows(Xs, @node shuffle(rs)) + +W() ``` **Important.** An argument not in global scope is assumed by `@node` to be a node or From 211bcf9ac1fb617b36c5e1f7209aa546c32addca Mon Sep 17 00:00:00 2001 From: Abhro R <5664668+abhro@users.noreply.github.com> Date: Tue, 14 May 2024 21:58:30 -0400 Subject: [PATCH 09/24] Do mechanical fixes of spacing, semicolons, and punc --- docs/src/about_mlj.md | 14 ++++++-------- docs/src/adding_models_for_general_use.md | 2 +- docs/src/api.md | 0 docs/src/common_mlj_workflows.md | 6 +++--- docs/src/controlling_iterative_models.md | 16 ++++++++-------- docs/src/evaluating_model_performance.md | 3 +-- docs/src/frequently_asked_questions.md | 0 docs/src/getting_started.md | 2 +- docs/src/glossary.md | 4 ++-- docs/src/img/two_model_stack.png | Bin docs/src/img/wrapped_ridge.png | Bin docs/src/internals.md | 0 docs/src/learning_networks.md | 2 +- docs/src/loading_model_code.md | 2 +- docs/src/machines.md | 8 ++++---- docs/src/preparing_data.md | 9 ++++----- docs/src/transformers.md | 8 ++++---- docs/src/tuning_models.md | 6 +++--- 18 files changed, 39 insertions(+), 43 deletions(-) mode change 100755 => 100644 docs/src/about_mlj.md mode change 100755 => 100644 docs/src/adding_models_for_general_use.md mode change 100755 => 100644 docs/src/api.md mode change 100755 => 100644 docs/src/frequently_asked_questions.md mode change 100755 => 100644 docs/src/img/two_model_stack.png mode change 100755 => 100644 docs/src/img/wrapped_ridge.png mode change 100755 => 100644 docs/src/internals.md diff --git a/docs/src/about_mlj.md b/docs/src/about_mlj.md old mode 100755 new mode 100644 index 59afc2269..86012b056 --- a/docs/src/about_mlj.md +++ b/docs/src/about_mlj.md @@ -1,6 +1,6 @@ # About MLJ -MLJ (Machine Learning in Julia) is a toolbox written in Julia +MLJ (Machine Learning in Julia) is a toolbox written in Julia providing a common interface and meta-algorithms for selecting, tuning, evaluating, composing and comparing [over 180 machine learning models](@ref model_list) written in Julia and other languages. In @@ -22,8 +22,7 @@ The first code snippet below creates a new Julia environment [Installation](@ref) for more on creating a Julia environment for use with MLJ. -Julia installation instructions are -[here](https://julialang.org/downloads/). +Julia installation instructions are [here](https://julialang.org/downloads/). ```julia using Pkg @@ -44,7 +43,7 @@ Loading and instantiating a gradient tree-boosting model: using MLJ Booster = @load EvoTreeRegressor # loads code defining a model type booster = Booster(max_depth=2) # specify hyper-parameter at construction -booster.nrounds=50 # or mutate afterwards +booster.nrounds = 50 # or mutate afterwards ``` This model is an example of an iterative model. As it stands, the @@ -92,7 +91,7 @@ it "self-tuning": ```julia self_tuning_pipe = TunedModel(model=pipe, tuning=RandomSearch(), - ranges = max_depth_range, + ranges=max_depth_range, resampling=CV(nfolds=3, rng=456), measure=l1, acceleration=CPUThreads(), @@ -105,7 +104,7 @@ Loading a selection of features and labels from the Ames House Price dataset: ```julia -X, y = @load_reduced_ames; +X, y = @load_reduced_ames ``` Evaluating the "self-tuning" pipeline model's performance using 5-fold cross-validation (implies multiple layers of nested resampling): @@ -155,8 +154,7 @@ Extract: * Consistent interface to handle probabilistic predictions. -* Extensible [tuning - interface](https://github.com/JuliaAI/MLJTuning.jl), +* Extensible [tuning interface](https://github.com/JuliaAI/MLJTuning.jl), to support a growing number of optimization strategies, and designed to play well with model composition. diff --git a/docs/src/adding_models_for_general_use.md b/docs/src/adding_models_for_general_use.md old mode 100755 new mode 100644 index a63a8ac3f..c51836c04 --- a/docs/src/adding_models_for_general_use.md +++ b/docs/src/adding_models_for_general_use.md @@ -5,4 +5,4 @@ suitable for addition to the MLJ Model Registry, consult the [MLJModelInterface. documentation](https://juliaai.github.io/MLJModelInterface.jl/dev/). For quick-and-dirty user-defined models see [Simple User Defined -Models](simple_user_defined_models.md). +Models](simple_user_defined_models.md). diff --git a/docs/src/api.md b/docs/src/api.md old mode 100755 new mode 100644 diff --git a/docs/src/common_mlj_workflows.md b/docs/src/common_mlj_workflows.md index 72491a81f..a7fbc69af 100644 --- a/docs/src/common_mlj_workflows.md +++ b/docs/src/common_mlj_workflows.md @@ -55,7 +55,7 @@ Horizontally splitting data and shuffling rows. Here `y` is the `:Exit` column and `X` a table with everything else: ```@example workflows -y, X = unpack(channing, ==(:Exit), rng=123); +y, X = unpack(channing, ==(:Exit), rng=123) nothing # hide ``` @@ -202,7 +202,7 @@ Do `measures()` to list all losses and scores and their aliases, or refer to the StatisticalMeasures.jl [docs](https://juliaai.github.io/StatisticalMeasures.jl/dev/). -## Basic fit/evaluate/predict by hand: +## Basic fit/evaluate/predict by hand *Reference:* [Getting Started](index.md), [Machines](machines.md), [Evaluating Model Performance](evaluating_model_performance.md), [Performance Measures](performance_measures.md) @@ -496,7 +496,7 @@ Tree = @load DecisionTreeRegressor pkg=DecisionTree verbosity=0 tree_with_target = TransformedTargetModel(model=Tree(), transformer=y -> log.(y), inverse = z -> exp.(z)) -pipe2 = (X -> coerce(X, :age=>Continuous)) |> OneHotEncoder() |> tree_with_target; +pipe2 = (X -> coerce(X, :age=>Continuous)) |> OneHotEncoder() |> tree_with_target nothing # hide ``` diff --git a/docs/src/controlling_iterative_models.md b/docs/src/controlling_iterative_models.md index e0cabdb40..bbe8ddaa0 100644 --- a/docs/src/controlling_iterative_models.md +++ b/docs/src/controlling_iterative_models.md @@ -98,7 +98,7 @@ control | description [`TimeLimit`](@ref EarlyStopping.TimeLimit)`(t=0.5)` | Stop after `t` hours | yes [`NumberLimit`](@ref EarlyStopping.NumberLimit)`(n=100)` | Stop after `n` applications of the control | yes [`NumberSinceBest`](@ref EarlyStopping.NumberSinceBest)`(n=6)` | Stop when best loss occurred `n` control applications ago | yes -[`InvalidValue`](@ref IterationControl.InvalidValue)() | Stop when `NaN`, `Inf` or `-Inf` loss/training loss encountered | yes +[`InvalidValue`](@ref IterationControl.InvalidValue)() | Stop when `NaN`, `Inf` or `-Inf` loss/training loss encountered | yes [`Threshold`](@ref EarlyStopping.Threshold)`(value=0.0)` | Stop when `loss < value` | yes [`GL`](@ref EarlyStopping.GL)`(alpha=2.0)` | † Stop after the "generalization loss (GL)" exceeds `alpha` | yes [`PQ`](@ref EarlyStopping.PQ)`(alpha=0.75, k=5)` | † Stop after "progress-modified GL" exceeds `alpha` | yes @@ -109,15 +109,15 @@ control | description [`Error`](@ref IterationControl.Error)`(predicate; f="")` | Log to `Error` the value of `f` or `f(mach)`, if `predicate(mach)` holds and then stop | yes [`Callback`](@ref IterationControl.Callback)`(f=mach->nothing)`| Call `f(mach)` | yes [`WithNumberDo`](@ref IterationControl.WithNumberDo)`(f=n->@info(n))` | Call `f(n + 1)` where `n` is the number of complete control cycles so far | yes -[`WithIterationsDo`](@ref MLJIteration.WithIterationsDo)`(f=i->@info("iterations: $i"))`| Call `f(i)`, where `i` is total number of iterations | yes +[`WithIterationsDo`](@ref MLJIteration.WithIterationsDo)`(f=i->@info("iterations: $i"))` | Call `f(i)`, where `i` is total number of iterations | yes [`WithLossDo`](@ref IterationControl.WithLossDo)`(f=x->@info("loss: $x"))` | Call `f(loss)` where `loss` is the current loss | yes -[`WithTrainingLossesDo`](@ref IterationControl.WithTrainingLossesDo)`(f=v->@info(v))` | Call `f(v)` where `v` is the current batch of training losses | yes -[`WithEvaluationDo`](@ref MLJIteration.WithEvaluationDo)`(f->e->@info("evaluation: $e))`| Call `f(e)` where `e` is the current performance evaluation object | yes +[`WithTrainingLossesDo`](@ref IterationControl.WithTrainingLossesDo)`(f=v->@info(v))` | Call `f(v)` where `v` is the current batch of training losses | yes +[`WithEvaluationDo`](@ref MLJIteration.WithEvaluationDo)`(f->e->@info("evaluation: $e))` | Call `f(e)` where `e` is the current performance evaluation object | yes [`WithFittedParamsDo`](@ref MLJIteration.WithFittedParamsDo)`(f->fp->@info("fitted_params: $fp))`| Call `f(fp)` where `fp` is fitted parameters of training machine | yes -[`WithReportDo`](@ref MLJIteration.WithReportDo)`(f->e->@info("report: $e))`| Call `f(r)` where `r` is the training machine report | yes -[`WithModelDo`](@ref MLJIteration.WithModelDo)`(f->m->@info("model: $m))`| Call `f(m)` where `m` is the model, which may be mutated by `f` | yes -[`WithMachineDo`](@ref MLJIteration.WithMachineDo)`(f->mach->@info("report: $mach))`| Call `f(mach)` wher `mach` is the training machine in its current state | yes -[`Save`](@ref MLJIteration.Save)`(filename="machine.jls")`|Save current training machine to `machine1.jls`, `machine2.jsl`, etc | yes +[`WithReportDo`](@ref MLJIteration.WithReportDo)`(f->e->@info("report: $e))`| Call `f(r)` where `r` is the training machine report | yes +[`WithModelDo`](@ref MLJIteration.WithModelDo)`(f->m->@info("model: $m))`| Call `f(m)` where `m` is the model, which may be mutated by `f` | yes +[`WithMachineDo`](@ref MLJIteration.WithMachineDo)`(f->mach->@info("report: $mach))`| Call `f(mach)` wher `mach` is the training machine in its current state | yes +[`Save`](@ref MLJIteration.Save)`(filename="machine.jls")` | Save current training machine to `machine1.jls`, `machine2.jsl`, etc | yes > Table 1. Atomic controls. Some advanced options are omitted. diff --git a/docs/src/evaluating_model_performance.md b/docs/src/evaluating_model_performance.md index c378b46ec..3999a1d97 100644 --- a/docs/src/evaluating_model_performance.md +++ b/docs/src/evaluating_model_performance.md @@ -96,7 +96,7 @@ evaluate!( ) ``` -Or the user can define their own re-usable `ResamplingStrategy` objects, - see [Custom +Or the user can define their own re-usable `ResamplingStrategy` objects; see [Custom resampling strategies](@ref) below. @@ -170,4 +170,3 @@ function train_test_pairs(holdout::Holdout, rows) return [(train, test),] end ``` - diff --git a/docs/src/frequently_asked_questions.md b/docs/src/frequently_asked_questions.md old mode 100755 new mode 100644 diff --git a/docs/src/getting_started.md b/docs/src/getting_started.md index 6ddf5f89f..5459597be 100644 --- a/docs/src/getting_started.md +++ b/docs/src/getting_started.md @@ -260,7 +260,7 @@ evaluate!(mach, resampling=Holdout(fraction_train=0.7), Changing a hyperparameter and re-evaluating: ```@repl doda -tree.max_depth = 3 +tree.max_depth = 3; evaluate!(mach, resampling=Holdout(fraction_train=0.7), measures=[log_loss, accuracy], verbosity=0) diff --git a/docs/src/glossary.md b/docs/src/glossary.md index 94aa925e9..02a8a942d 100755 --- a/docs/src/glossary.md +++ b/docs/src/glossary.md @@ -93,8 +93,8 @@ wrapped in an associated operation (e.g., `predict` or `inverse_transform`). It consists primarily of: 1. An operation, static or dynamic. -1. A machine, or `nothing` if the operation is static. -1. Upstream connections to other nodes, specified by a list of +2. A machine, or `nothing` if the operation is static. +3. Upstream connections to other nodes, specified by a list of *arguments* (one for each argument of the operation). These are the arguments on which the operation "acts" when the node `N` is called, as in `N()`. diff --git a/docs/src/img/two_model_stack.png b/docs/src/img/two_model_stack.png old mode 100755 new mode 100644 diff --git a/docs/src/img/wrapped_ridge.png b/docs/src/img/wrapped_ridge.png old mode 100755 new mode 100644 diff --git a/docs/src/internals.md b/docs/src/internals.md old mode 100755 new mode 100644 diff --git a/docs/src/learning_networks.md b/docs/src/learning_networks.md index c275ba543..0eebc8a0c 100644 --- a/docs/src/learning_networks.md +++ b/docs/src/learning_networks.md @@ -224,7 +224,7 @@ A more complicated learning network may contain machines that can be trained in parallel. In that case, a call like the following may speed up training: ```@example 42 -tree.max_depth=2 +tree.max_depth = 2 fit!(yhat, acceleration=CPUThreads()) nothing # hide ``` diff --git a/docs/src/loading_model_code.md b/docs/src/loading_model_code.md index 850e5a262..d0489ffa3 100644 --- a/docs/src/loading_model_code.md +++ b/docs/src/loading_model_code.md @@ -42,7 +42,7 @@ MLJDecisionTreeInterface.jl. If this package is not in `my_env` (do `Pkg.status()` to check) you add it by running ```julia-repl -julia> Pkg.add("MLJDecisionTreeInterface"); +julia> Pkg.add("MLJDecisionTreeInterface") ``` So long as `my_env` is the active environment, this action need never diff --git a/docs/src/machines.md b/docs/src/machines.md index 7ad56c935..b52579e9b 100644 --- a/docs/src/machines.md +++ b/docs/src/machines.md @@ -33,7 +33,7 @@ Generally, changing a hyperparameter triggers retraining on calls to subsequent `fit!`: ```@repl machines -forest.bagging_fraction = 0.5 +forest.bagging_fraction = 0.5; fit!(mach, verbosity=2); ``` @@ -41,7 +41,7 @@ However, for this iterative model, increasing the iteration parameter only adds models to the existing ensemble: ```@repl machines -forest.n = 15 +forest.n = 15; fit!(mach, verbosity=2); ``` @@ -138,8 +138,8 @@ such as a vector of per-observation weights (in which case --------------------|-----------------------------|-------------------------------------- `Deterministic <: Supervised` | `machine(model, X, y, extras...)` | `predict(mach, Xnew)`, `transform(mach, Xnew)`, `inverse_transform(mach, Xout)` `Probabilistic <: Supervised` | `machine(model, X, y, extras...)` | `predict(mach, Xnew)`, `predict_mean(mach, Xnew)`, `predict_median(mach, Xnew)`, `predict_mode(mach, Xnew)`, `transform(mach, Xnew)`, `inverse_transform(mach, Xout)` -`Unsupervised` (except `Static`) | `machine(model, X)` | `transform(mach, Xnew)`, `inverse_transform(mach, Xout)`, `predict(mach, Xnew)` -`Static` | `machine(model)` | `transform(mach, Xnews...)`, `inverse_transform(mach, Xout)` +`Unsupervised` (except `Static`) | `machine(model, X)` | `transform(mach, Xnew)`, `inverse_transform(mach, Xout)`, `predict(mach, Xnew)` +`Static` | `machine(model)` | `transform(mach, Xnews...)`, `inverse_transform(mach, Xout)` All operations on machines (`predict`, `transform`, etc) have exactly one argument (`Xnew` or `Xout` above) after `mach`, the machine diff --git a/docs/src/preparing_data.md b/docs/src/preparing_data.md index 5b73dfa5f..763bb522e 100644 --- a/docs/src/preparing_data.md +++ b/docs/src/preparing_data.md @@ -40,7 +40,7 @@ models(matching(X, y)) Or are unsure about the source of the following warning: ```julia-repl -julia> Tree = @load DecisionTreeRegressor pkg=DecisionTree verbosity=0 +julia> Tree = @load DecisionTreeRegressor pkg=DecisionTree verbosity=0; julia> tree = Tree(); julia> machine(tree, X, y) @@ -57,7 +57,7 @@ Machine{DecisionTreeRegressor,…} @198 trained 0 times; caches data The meaning of the warning is: - The input `X` is a table with column scitypes `Continuous`, `Count`, and `Textual` and `Union{Missing, Textual}`, which can also see by inspecting the schema: - + ```@example poot schema(X) ``` @@ -72,8 +72,7 @@ above, with links to further documentation given below: **Scientific type coercion:** We coerce machine types to obtain the intended scientific interpretation. If `height` in the above example is intended to be `Continuous`, `mark` is supposed to be -`OrderedFactor`, and `admitted` a (binary) `Multiclass`, then we can -do +`OrderedFactor`, and `admitted` a (binary) `Multiclass`, then we can do ```@example poot X_coerced = coerce(X, :height=>Continuous, :mark=>OrderedFactor, :admitted=>Multiclass); @@ -82,7 +81,7 @@ schema(X_coerced) **Data transformations:** We carry out conventional data transformations, such as missing value imputation and feature encoding: - + ```@example poot imputer = FillImputer() mach = machine(imputer, X_coerced) |> fit! diff --git a/docs/src/transformers.md b/docs/src/transformers.md index 805e715d8..0c32ed2e6 100644 --- a/docs/src/transformers.md +++ b/docs/src/transformers.md @@ -201,7 +201,7 @@ using MLJ import Random.seed! seed!(123) -X, y = @load_iris; +X, y = @load_iris KMeans = @load KMeans pkg=ParallelKMeans kmeans = KMeans() mach = machine(kmeans, X) |> fit! @@ -210,14 +210,14 @@ nothing # hide Transforming: ```@example predtrans -Xsmall = transform(mach); +Xsmall = transform(mach) selectrows(Xsmall, 1:4) |> pretty ``` Predicting: ```@example predtrans -yhat = predict(mach); -compare = zip(yhat, y) |> collect; +yhat = predict(mach) +compare = zip(yhat, y) |> collect ``` ```@example predtrans diff --git a/docs/src/tuning_models.md b/docs/src/tuning_models.md index 709547355..026f56112 100644 --- a/docs/src/tuning_models.md +++ b/docs/src/tuning_models.md @@ -121,7 +121,7 @@ Predicting on new input observations using the optimal model, *trained on all the data* bound to `mach`: ```@example goof -Xnew = MLJ.table(rand(3, 10)); +Xnew = MLJ.table(rand(3, 10)); predict(mach, Xnew) ``` @@ -178,7 +178,7 @@ self_tuning_knn = TunedModel( resampling = CV(nfolds=4, rng=1234), tuning = Grid(resolution=5), range = K_range, - measure=BrierLoss() + measure = BrierLoss() ); mach = machine(self_tuning_knn, X, y); @@ -193,7 +193,7 @@ self_tuning_knn = TunedModel( resampling = CV(nfolds=4, rng=1234), tuning = Grid(resolution=5), range = K_range, - measure=MisclassificationRate() + measure = MisclassificationRate() ) mach = machine(self_tuning_knn, X, y); From c7b5d3ac85163c542ca2f6e8dbb1316fa6d0f186 Mon Sep 17 00:00:00 2001 From: Abhro R <5664668+abhro@users.noreply.github.com> Date: Tue, 14 May 2024 21:58:58 -0400 Subject: [PATCH 10/24] Fix indentation of markdown line --- docs/src/common_mlj_workflows.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/common_mlj_workflows.md b/docs/src/common_mlj_workflows.md index a7fbc69af..d60bc4096 100644 --- a/docs/src/common_mlj_workflows.md +++ b/docs/src/common_mlj_workflows.md @@ -165,7 +165,7 @@ nothing # hide ## Instantiating a model - *Reference:* [Getting Started](@ref), [Loading Model Code](@ref) +*Reference:* [Getting Started](@ref), [Loading Model Code](@ref) Assumes `MLJDecisionTreeClassifier` is in your environment. Otherwise, try interactive loading with `@iload`: From 925ec42d559209b57fbad3ca037b90624b275c7d Mon Sep 17 00:00:00 2001 From: Abhro R <5664668+abhro@users.noreply.github.com> Date: Tue, 14 May 2024 22:00:42 -0400 Subject: [PATCH 11/24] Move hidden example block to setup --- docs/src/common_mlj_workflows.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/src/common_mlj_workflows.md b/docs/src/common_mlj_workflows.md index d60bc4096..11ff279bf 100644 --- a/docs/src/common_mlj_workflows.md +++ b/docs/src/common_mlj_workflows.md @@ -373,8 +373,8 @@ z = transform(mach, y); *Reference:* [Tuning Models](tuning_models.md) -```@example workflows -X, y = @load_iris; nothing # hide +```@setup workflows +X, y = @load_iris ``` Define a model with nested hyperparameters: From 2a1202f61bb61606054331a1e98a34813e22ffd7 Mon Sep 17 00:00:00 2001 From: Abhro R <5664668+abhro@users.noreply.github.com> Date: Tue, 14 May 2024 22:02:38 -0400 Subject: [PATCH 12/24] Pull code sample into list --- docs/src/preparing_data.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/src/preparing_data.md b/docs/src/preparing_data.md index 763bb522e..a8c49cc4f 100644 --- a/docs/src/preparing_data.md +++ b/docs/src/preparing_data.md @@ -58,9 +58,9 @@ The meaning of the warning is: - The input `X` is a table with column scitypes `Continuous`, `Count`, and `Textual` and `Union{Missing, Textual}`, which can also see by inspecting the schema: -```@example poot -schema(X) -``` + ```@example poot + schema(X) + ``` - The model requires a table whose column element scitypes subtype `Continuous`, an incompatibility. From f8518f4536f163fe8f047308e38fed7ef833c971 Mon Sep 17 00:00:00 2001 From: Abhro R <5664668+abhro@users.noreply.github.com> Date: Tue, 14 May 2024 22:05:24 -0400 Subject: [PATCH 13/24] Use proper markdown lists --- docs/src/glossary.md | 28 ++++++++++++++-------------- docs/src/mlj_cheatsheet.md | 20 ++++++++++---------- 2 files changed, 24 insertions(+), 24 deletions(-) mode change 100755 => 100644 docs/src/glossary.md diff --git a/docs/src/glossary.md b/docs/src/glossary.md old mode 100755 new mode 100644 index 02a8a942d..425f45ec7 --- a/docs/src/glossary.md +++ b/docs/src/glossary.md @@ -47,20 +47,20 @@ on a fit-result (e.g., a broadcasted logarithm) which is then called An object consisting of: -(1) A model - -(2) A fit-result (undefined until training) - -(3) *Training arguments* (one for each data argument of the model's -associated `fit` method). A training argument is data used for -training (subsampled by specifying `rows=...` in `fit!`) but also in -evaluation (subsampled by specifying `rows=...` in `predict`, -`predict_mean`, etc). Generally, there are two training arguments for -supervised models, and just one for unsupervised models. Each argument -is either a `Source` node, wrapping concrete data supplied to the -`machine` constructor, or a `Node`, in the case of a learning network -(see below). Both kinds of nodes can be *called* with an optional -`rows=...` keyword argument to (lazily) return concrete data. +1. A model + +2. A fit-result (undefined until training) + +3. *Training arguments* (one for each data argument of the model's + associated `fit` method). A training argument is data used for + training (subsampled by specifying `rows=...` in `fit!`) but also in + evaluation (subsampled by specifying `rows=...` in `predict`, + `predict_mean`, etc). Generally, there are two training arguments for + supervised models, and just one for unsupervised models. Each argument + is either a `Source` node, wrapping concrete data supplied to the + `machine` constructor, or a `Node`, in the case of a learning network + (see below). Both kinds of nodes can be *called* with an optional + `rows=...` keyword argument to (lazily) return concrete data. In addition, machines store "report" metadata, for recording algorithm-specific statistics of training (eg, an internal estimate of diff --git a/docs/src/mlj_cheatsheet.md b/docs/src/mlj_cheatsheet.md index fec42f2a0..11873358b 100644 --- a/docs/src/mlj_cheatsheet.md +++ b/docs/src/mlj_cheatsheet.md @@ -66,11 +66,11 @@ type | scitype Use `schema(X)` to get the column scitypes of a table `X` -`coerce(y, Multiclass)` attempts coercion of all elements of `y` into scitype `Multiclass` +- `coerce(y, Multiclass)` attempts coercion of all elements of `y` into scitype `Multiclass` -`coerce(X, :x1 => Continuous, :x2 => OrderedFactor)` to coerce columns `:x1` and `:x2` of table `X`. +- `coerce(X, :x1 => Continuous, :x2 => OrderedFactor)` to coerce columns `:x1` and `:x2` of table `X`. -`coerce(X, Count => Continuous)` to coerce all columns with `Count` scitype to `Continuous`. +- `coerce(X, Count => Continuous)` to coerce all columns with `Count` scitype to `Continuous`. ## Ingesting data @@ -152,11 +152,11 @@ mach = machine(model, X) ## Prediction -Supervised case: `predict(mach, Xnew)` or `predict(mach, rows=1:100)` +- Supervised case: `predict(mach, Xnew)` or `predict(mach, rows=1:100)` -Similarly, for probabilistic models: `predict_mode`, `predict_mean` and `predict_median`. + Similarly, for probabilistic models: `predict_mode`, `predict_mean` and `predict_median`. -Unsupervised case: `transform(mach, rows=1:100)` or `inverse_transform(mach, rows)`, etc. +- Unsupervised case: `transform(mach, rows=1:100)` or `inverse_transform(mach, rows)`, etc. ## Inspecting objects @@ -302,13 +302,13 @@ Externals include: `PCA` (in MultivariateStats), `KMeans`, `KMedoids` (in Cluste `pipe = (X -> coerce(X, :height=>Continuous)) |> OneHotEncoder |> KNNRegressor(K=3)` -Unsupervised: +- Unsupervised: -`pipe = Standardizer |> OneHotEncoder` + `pipe = Standardizer |> OneHotEncoder` -Concatenation: +- Concatenation: -`pipe1 |> pipe2` or `model |> pipe` or `pipe |> model`, etc + `pipe1 |> pipe2` or `model |> pipe` or `pipe |> model`, etc. ## Advanced model composition techniques From ad9129b8525392b94d8ba10820647e28eec07f6e Mon Sep 17 00:00:00 2001 From: Abhro R <5664668+abhro@users.noreply.github.com> Date: Tue, 14 May 2024 22:43:46 -0400 Subject: [PATCH 14/24] Use example block for workflows --- docs/src/common_mlj_workflows.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/docs/src/common_mlj_workflows.md b/docs/src/common_mlj_workflows.md index 11ff279bf..dac37e18b 100644 --- a/docs/src/common_mlj_workflows.md +++ b/docs/src/common_mlj_workflows.md @@ -23,14 +23,16 @@ MLJ_VERSION ## Data ingestion ```@setup workflows -# to avoid RDatasets as a doc dependency: +# to avoid RDatasets as a doc dependency, generate synthetic data with +# similar parameters, with the first four rows mimicking the original dataset +# for display purposes color_off() import DataFrames -channing = (Sex = rand(["Male","Female"], 462), - Entry = rand(Int, 462), - Exit = rand(Int, 462), - Time = rand(Int, 462), - Cens = rand(Int, 462)) |> DataFrames.DataFrame +channing = (Sex = [repeat(["Male"], 4)..., rand(["Male","Female"], 458)...], + Entry = Int32[782, 1020, 856, 915, rand(733:1140, 458)...], + Exit = Int32[909, 1128, 969, 957, rand(777:1207, 458)...], + Time = Int32[127, 108, 113, 42, rand(0:137, 458)...], + Cens = Int32[1, 1, 1, 1, rand(0:1, 458)...]) |> DataFrames.DataFrame coerce!(channing, :Sex => Multiclass) ``` @@ -41,7 +43,7 @@ channing = RDatasets.dataset("boot", "channing") ``` ```@example workflows -first(channing, 4) +first(channing, 4) |> pretty ``` Inspecting metadata, including column scientific types: From da2e45a6f69eefb920131d819c26942ecd904116 Mon Sep 17 00:00:00 2001 From: Abhro R <5664668+abhro@users.noreply.github.com> Date: Tue, 14 May 2024 22:44:07 -0400 Subject: [PATCH 15/24] Remove lambdas --- docs/src/learning_networks.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/src/learning_networks.md b/docs/src/learning_networks.md index 0eebc8a0c..d4474aafa 100644 --- a/docs/src/learning_networks.md +++ b/docs/src/learning_networks.md @@ -125,7 +125,7 @@ broadcasted versions of `log`, `exp`, `mean`, `mode` and `median`. A function li is not overloaded, so that `Q = sqrt(Z)` will throw an error. Instead, we do ```@example 42 -Q = node(z->sqrt(z), Z) +Q = node(sqrt, Z) Z() ``` @@ -736,7 +736,7 @@ There is also an experimental macro [`@node`](@ref). If `Z` is an `AbstractNode` source(16)`, say) then instead of ```julia -Q = node(z->sqrt(z), Z) +Q = node(sqrt, Z) ``` one can do From 72f2be25ac8eee80f65247c81b93c533f1470987 Mon Sep 17 00:00:00 2001 From: Abhro R <5664668+abhro@users.noreply.github.com> Date: Tue, 14 May 2024 22:45:10 -0400 Subject: [PATCH 16/24] Use repl blocks for user defined models --- docs/src/simple_user_defined_models.md | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) mode change 100755 => 100644 docs/src/simple_user_defined_models.md diff --git a/docs/src/simple_user_defined_models.md b/docs/src/simple_user_defined_models.md old mode 100755 new mode 100644 index ffc64920b..e3e779aed --- a/docs/src/simple_user_defined_models.md +++ b/docs/src/simple_user_defined_models.md @@ -39,9 +39,9 @@ For an unsupervised model, implement `transform` and, optionally, Here's a quick-and-dirty implementation of a ridge regressor with no intercept: ```@example regressor_example +using MLJ; color_off() # hide import MLJBase using LinearAlgebra -MLJBase.color_off() # hide mutable struct MyRegressor <: MLJBase.Deterministic lambda::Float64 @@ -65,11 +65,11 @@ nothing # hide After loading this code, all MLJ's basic meta-algorithms can be applied to `MyRegressor`: ```@repl regressor_example +using MLJ # hide X, y = @load_boston; model = MyRegressor(lambda=1.0) regressor = machine(model, X, y) evaluate!(regressor, resampling=CV(), measure=rms, verbosity=0) - ``` ## A simple probabilistic classifier @@ -78,7 +78,8 @@ The following probabilistic model simply fits a probability distribution to the `MultiClass` training target (i.e., ignores `X`) and returns this pdf for any new pattern: -```julia +```@example classifier_example +using MLJ # hide import MLJBase import Distributions @@ -99,11 +100,8 @@ MLJBase.predict(model::MyClassifier, fitresult, Xnew) = [fitresult for r in 1:nrows(Xnew)] ``` -```julia-repl -julia> X, y = @load_iris; -julia> mach = fit!(machine(MyClassifier(), X, y)); -julia> predict(mach, selectrows(X, 1:2)) -2-element Array{UnivariateFinite{String,UInt32,Float64},1}: - UnivariateFinite(setosa=>0.333, versicolor=>0.333, virginica=>0.333) - UnivariateFinite(setosa=>0.333, versicolor=>0.333, virginica=>0.333) +```@repl classifier_example +X, y = @load_iris; +mach = machine(MyClassifier(), X, y) |> fit!; +predict(mach, selectrows(X, 1:2)) ``` From c24a96b27cfec242c870b69ae86af85957230d3d Mon Sep 17 00:00:00 2001 From: Abhro R <5664668+abhro@users.noreply.github.com> Date: Tue, 14 May 2024 22:50:52 -0400 Subject: [PATCH 17/24] Use bigger fences for cheatsheet code --- docs/src/mlj_cheatsheet.md | 63 +++++++++++++++++++++++++++----------- 1 file changed, 45 insertions(+), 18 deletions(-) diff --git a/docs/src/mlj_cheatsheet.md b/docs/src/mlj_cheatsheet.md index 11873358b..ac58eb6a1 100644 --- a/docs/src/mlj_cheatsheet.md +++ b/docs/src/mlj_cheatsheet.md @@ -38,7 +38,10 @@ models() do model end ``` -`Tree = @load DecisionTreeClassifier pkg=DecisionTree` imports "DecisionTreeClassifier" type and binds it to `Tree`. +```julia +Tree = @load DecisionTreeClassifier pkg=DecisionTree +``` +imports "DecisionTreeClassifier" type and binds it to `Tree`. `tree = Tree()` to instantiate a `Tree`. @@ -90,8 +93,8 @@ Same as above but exclude `:Time` column from `X`: using RDatasets channing = dataset("boot", "channing") y, X = unpack(channing, - ==(:Exit), # y is the :Exit column - !=(:Time); # X is the rest, except :Time + ==(:Exit), # y is the :Exit column + !=(:Time); # X is the rest, except :Time rng=123) ``` @@ -117,8 +120,8 @@ Getting data from [OpenML](https://www.openml.org): ```julia table = OpenML.load(91) ``` -Creating synthetic classification data: +Creating synthetic classification data: ```julia X, y = make_blobs(100, 2) ``` @@ -147,8 +150,10 @@ mach = machine(model, X) ## Fitting -`fit!(mach, rows=1:100, verbosity=1, force=false)` (defaults shown) - +The `fit!` function can be used to fit a machine (defaults shown): +```julia +fit!(mach, rows=1:100, verbosity=1, force=false) +``` ## Prediction @@ -188,11 +193,17 @@ pkg="MultivariateStats")` gets all properties (aka traits) of registered models ## Performance estimation -`evaluate(model, X, y, resampling=CV(), measure=rms, operation=predict, weights=..., verbosity=1)` +```julia +evaluate(model, X, y, resampling=CV(), measure=rms, operation=predict, weights=..., verbosity=1) +``` -`evaluate!(mach, resampling=Holdout(), measure=[rms, mav], operation=predict, weights=..., verbosity=1)` +```julia +evaluate!(mach, resampling=Holdout(), measure=[rms, mav], operation=predict, weights=..., verbosity=1) +``` -`evaluate!(mach, resampling=[(fold1, fold2), (fold2, fold1)], measure=rms)` +```julia +evaluate!(mach, resampling=[(fold1, fold2), (fold2, fold1)], measure=rms) +``` ## Resampling strategies (`resampling=...`) @@ -212,7 +223,9 @@ or a list of pairs of row indices: ### Tuning model wrapper -`tuned_model = TunedModel(model=…, tuning=RandomSearch(), resampling=Holdout(), measure=…, operation=predict, range=…)` +```julia +tuned_model = TunedModel(model=…, tuning=RandomSearch(), resampling=Holdout(), measure=…, operation=predict, range=…) +``` ### Ranges for tuning (`range=...`) @@ -242,20 +255,28 @@ Also available: `LatinHyperCube`, `Explicit` (built-in), `MLJTreeParzenTuning`, For generating a plot of performance against parameter specified by `range`: -`curve = learning_curve(mach, resolution=30, resampling=Holdout(), measure=…, operation=predict, range=…, n=1)` +```julia +curve = learning_curve(mach, resolution=30, resampling=Holdout(), measure=…, operation=predict, range=…, n=1) +``` -`curve = learning_curve(model, X, y, resolution=30, resampling=Holdout(), measure=…, operation=predict, range=…, n=1)` +```julia +curve = learning_curve(model, X, y, resolution=30, resampling=Holdout(), measure=…, operation=predict, range=…, n=1) +``` If using Plots.jl: -`plot(curve.parameter_values, curve.measurements, xlab=curve.parameter_name, xscale=curve.parameter_scale)` +```julia +plot(curve.parameter_values, curve.measurements, xlab=curve.parameter_name, xscale=curve.parameter_scale) +``` ## Controlling iterative models Requires: `using MLJIteration` -`iterated_model = IteratedModel(model=…, resampling=Holdout(), measure=…, controls=…, retrain=false)` +```julia +iterated_model = IteratedModel(model=…, resampling=Holdout(), measure=…, controls=…, retrain=false) +``` ### Controls @@ -291,16 +312,22 @@ Externals include: `PCA` (in MultivariateStats), `KMeans`, `KMedoids` (in Cluste ## Ensemble model wrapper -`EnsembleModel(atom=…, weights=Float64[], bagging_fraction=0.8, rng=GLOBAL_RNG, n=100, parallel=true, out_of_bag_measure=[])` +```julia +EnsembleModel(atom=…, weights=Float64[], bagging_fraction=0.8, rng=GLOBAL_RNG, n=100, parallel=true, out_of_bag_measure=[]) +``` ## Target transformation wrapper -`TransformedTargetModel(model=ConstantClassifier(), target=Standardizer())` +```julia +TransformedTargetModel(model=ConstantClassifier(), target=Standardizer()) +``` ## Pipelines -`pipe = (X -> coerce(X, :height=>Continuous)) |> OneHotEncoder |> KNNRegressor(K=3)` +```julia +pipe = (X -> coerce(X, :height=>Continuous)) |> OneHotEncoder |> KNNRegressor(K=3) +``` - Unsupervised: @@ -312,4 +339,4 @@ Externals include: `PCA` (in MultivariateStats), `KMeans`, `KMedoids` (in Cluste ## Advanced model composition techniques -See the [Composing Models](@ref) section of the MLJ manual. +See the [Composing Models](@ref) section of the MLJ manual. From 331bac89c1663208b92a6630162f6216553c7df9 Mon Sep 17 00:00:00 2001 From: Abhro R <5664668+abhro@users.noreply.github.com> Date: Tue, 14 May 2024 22:54:12 -0400 Subject: [PATCH 18/24] Promote headers in cheatsheet --- docs/src/mlj_cheatsheet.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/src/mlj_cheatsheet.md b/docs/src/mlj_cheatsheet.md index ac58eb6a1..57cb24c46 100644 --- a/docs/src/mlj_cheatsheet.md +++ b/docs/src/mlj_cheatsheet.md @@ -219,15 +219,15 @@ or a list of pairs of row indices: `[(train1, eval1), (train2, eval2), ... (traink, evalk)]` -## Tuning -### Tuning model wrapper + +## Tuning model wrapper ```julia tuned_model = TunedModel(model=…, tuning=RandomSearch(), resampling=Holdout(), measure=…, operation=predict, range=…) ``` -### Ranges for tuning (`range=...`) +## Ranges for tuning (`range=...`) If `r = range(KNNRegressor(), :K, lower=1, upper = 20, scale=:log)` @@ -242,7 +242,7 @@ Nested ranges: Use dot syntax, as in `r = range(EnsembleModel(atom=tree), :(atom Can specify multiple ranges, as in `range=[r1, r2, r3]`. For more range options do `?Grid` or `?RandomSearch` -### Tuning strategies +## Tuning strategies `RandomSearch(rng=1234)` for basic random search @@ -251,7 +251,7 @@ Can specify multiple ranges, as in `range=[r1, r2, r3]`. For more range options Also available: `LatinHyperCube`, `Explicit` (built-in), `MLJTreeParzenTuning`, `ParticleSwarm`, `AdaptiveParticleSwarm` (3rd-party packages) -#### Learning curves +### Learning curves For generating a plot of performance against parameter specified by `range`: From 0acc87642bb5d9befe217046a87a0404861c167f Mon Sep 17 00:00:00 2001 From: Abhro R <5664668+abhro@users.noreply.github.com> Date: Tue, 14 May 2024 23:10:15 -0400 Subject: [PATCH 19/24] Use Clustering.jl instead of ParallelKMeans --- docs/Project.toml | 1 - docs/src/transformers.md | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/Project.toml b/docs/Project.toml index d86571e62..b9718b4ea 100755 --- a/docs/Project.toml +++ b/docs/Project.toml @@ -15,7 +15,6 @@ MLJLinearModels = "6ee0df7b-362f-4a72-a706-9e79364fb692" MLJMultivariateStatsInterface = "1b6a4a23-ba22-4f51-9698-8599985d3728" Missings = "e1d29d7a-bbdc-5cf2-9ac0-f12de2c33e28" NearestNeighborModels = "636a865e-7cf4-491e-846c-de09b730eb36" -ParallelKMeans = "42b8e9d4-006b-409a-8472-7f34b3fb58af" Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" ScientificTypesBase = "30f210dd-8aff-4c5f-94ba-8e64358c1161" StatisticalMeasures = "a19d573c-0a75-4610-95b3-7071388c7541" diff --git a/docs/src/transformers.md b/docs/src/transformers.md index 0c32ed2e6..59373b9eb 100644 --- a/docs/src/transformers.md +++ b/docs/src/transformers.md @@ -202,7 +202,7 @@ import Random.seed! seed!(123) X, y = @load_iris -KMeans = @load KMeans pkg=ParallelKMeans +KMeans = @load KMeans pkg=Clustering kmeans = KMeans() mach = machine(kmeans, X) |> fit! nothing # hide From 18e9c9ff1208623de170c6302c033ae45c551bc4 Mon Sep 17 00:00:00 2001 From: abhro <5664668+abhro@users.noreply.github.com> Date: Wed, 15 May 2024 09:57:57 -0400 Subject: [PATCH 20/24] Remove unsupported use of info() from cheatsheet --- docs/src/mlj_cheatsheet.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/src/mlj_cheatsheet.md b/docs/src/mlj_cheatsheet.md index 57cb24c46..4a8edc82a 100644 --- a/docs/src/mlj_cheatsheet.md +++ b/docs/src/mlj_cheatsheet.md @@ -173,8 +173,6 @@ fit!(mach, rows=1:100, verbosity=1, force=false) `info(ConstantRegressor())`, `info("PCA")`, `info("RidgeRegressor", pkg="MultivariateStats")` gets all properties (aka traits) of registered models -`info(rms)` gets all properties of a performance measure - `schema(X)` get column names, types and scitypes, and nrows, of a table `X` `scitype(X)` gets the scientific type of `X` From 739ca21661f9ee6fde6330c6fbb1dffca8995bcf Mon Sep 17 00:00:00 2001 From: abhro <5664668+abhro@users.noreply.github.com> Date: Wed, 15 May 2024 16:37:38 -0400 Subject: [PATCH 21/24] Remove comments to have not as wide code lines --- docs/src/mlj_cheatsheet.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/src/mlj_cheatsheet.md b/docs/src/mlj_cheatsheet.md index 4a8edc82a..5cd24bea5 100644 --- a/docs/src/mlj_cheatsheet.md +++ b/docs/src/mlj_cheatsheet.md @@ -92,11 +92,12 @@ Same as above but exclude `:Time` column from `X`: ```julia using RDatasets channing = dataset("boot", "channing") -y, X = unpack(channing, - ==(:Exit), # y is the :Exit column - !=(:Time); # X is the rest, except :Time - rng=123) +y, X = unpack(channing, + ==(:Exit), + !=(:Time); + rng=123) ``` +Here, `y` is assigned the `:Exit` column, and `X` is assigned the rest, except `:Time`. Splitting row indices into train/validation/test, with seeded shuffling: From f2113224e49551792394af6f27f1e92a247c2154 Mon Sep 17 00:00:00 2001 From: abhro <5664668+abhro@users.noreply.github.com> Date: Wed, 15 May 2024 16:39:46 -0400 Subject: [PATCH 22/24] Add description of data coercion in cheatsheet --- docs/src/mlj_cheatsheet.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/src/mlj_cheatsheet.md b/docs/src/mlj_cheatsheet.md index 5cd24bea5..6f4c7dd84 100644 --- a/docs/src/mlj_cheatsheet.md +++ b/docs/src/mlj_cheatsheet.md @@ -69,6 +69,8 @@ type | scitype Use `schema(X)` to get the column scitypes of a table `X` +To coerce the data into different scitypes, use the `coerce` function: + - `coerce(y, Multiclass)` attempts coercion of all elements of `y` into scitype `Multiclass` - `coerce(X, :x1 => Continuous, :x2 => OrderedFactor)` to coerce columns `:x1` and `:x2` of table `X`. From d0796444f1941d3b468428198b554734db632e23 Mon Sep 17 00:00:00 2001 From: abhro <5664668+abhro@users.noreply.github.com> Date: Wed, 15 May 2024 16:44:37 -0400 Subject: [PATCH 23/24] Update docs/src/mlj_cheatsheet.md Co-authored-by: Essam --- docs/src/mlj_cheatsheet.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/mlj_cheatsheet.md b/docs/src/mlj_cheatsheet.md index 6f4c7dd84..c2dd70e0b 100644 --- a/docs/src/mlj_cheatsheet.md +++ b/docs/src/mlj_cheatsheet.md @@ -228,7 +228,7 @@ or a list of pairs of row indices: tuned_model = TunedModel(model=…, tuning=RandomSearch(), resampling=Holdout(), measure=…, operation=predict, range=…) ``` -## Ranges for tuning (`range=...`) +## Ranges for tuning `(range=...)` If `r = range(KNNRegressor(), :K, lower=1, upper = 20, scale=:log)` From 650ebbd6dd3fc33a041a18451f9320a46ddd4b8f Mon Sep 17 00:00:00 2001 From: abhro <5664668+abhro@users.noreply.github.com> Date: Wed, 15 May 2024 22:35:02 -0400 Subject: [PATCH 24/24] Remove other occurence of `info` on measure --- docs/src/mlj_cheatsheet.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/mlj_cheatsheet.md b/docs/src/mlj_cheatsheet.md index c2dd70e0b..ea8d0d032 100644 --- a/docs/src/mlj_cheatsheet.md +++ b/docs/src/mlj_cheatsheet.md @@ -299,7 +299,7 @@ Wraps: `MLJIteration.skip(control, predicate=1)`, `IterationControl.with_state_d Do `measures()` to get full list. -`info(rms)` to list properties (aka traits) of the `rms` measure +`?rms` in the REPL can provide information about the `rms` measure, and can be used with any measure or their aliases. ## Transformers