Jchemo was initially dedicated to partial least squares regression (PLSR) and discrimination (PLSDA) methods and their extensions, in particular locally weighted PLS models (KNN-LWPLS-R & -DA; e.g. https://doi.org/10.1002/cem.3209). Then, the package has been expanded with many other dimension-reduction/regression/discrimination methods.
Why the name Jchemo? Since it is oriented towards chemometrics , in brief the use of biometrics for chemistry. But most of the provided methods are generic to other types of data than chemistry.
Suppose training data (X, Y)
and predictions expected from new data Xnew
using a PLSR model with 15 latent variables (LVs). The workflow is has follows
- An object, e.g.
model
(or any other name), is built from the given learning model and its eventual parameters. This object contains three sub-objectsalgo
(the learning algorithm)fitm
(the fitted model, empty at this stage)- and
kwargs
(the specified keyword arguments)
- Function
fit!
fits the model to the data, which fills sub-objectfitm
above. - Function
predict
runs the predictions.
model = plskern(nlv = 15, scal = true)
fit!(model, X, Y)
pred = predict(model Xnew).pred
We can check the contents of object model
@names model
(:algo, :fitm, :kwargs)
An alternative syntax for the keyword arguments is
nlv = 15 ; scal = true
model = plskern(; nlv, scal)
After model fitting, the matrices of the PLS scores can be obtained from function transf
T = transf(model, X) # can also be obtained directly by: model.fitm.T
Tnew = transf(model, Xnew)
Other sample workflows are given at the end of this README.
Jchemo is organized between
- transform operators (that have a function
transf
), - predictors (that have a function
predict
), - utility functions.
Some models, such as PLSR models, are both a transform operator and a predictor.
Ad'hoc pipelines of operations can also be built. In Jchemo, a pipeline is a chain of K modeling steps containing
- either K transform steps,
- or K - 1 transform steps and a final prediction step.
The pipelines are built with function pip
.
Keyword arguments
The keyword arguments required/allowed in a function can be found in the Index of function section of the documentation, or in the REPL by displaying the function's help page, for instance for function plskern
julia> ?plskern
Default values can be displayed in the REPL with macro @pars
julia> @pars plskern
Jchemo.ParPlsr
nlv: Int64 1
scal: Bool false
Multi-threading
Some functions (e.g. those using kNN selections) use multi-threading to speed the computations. Taking advantage of this requires to specify a relevant number of threads (for instance from the Settings menu of the VsCode Julia extension and the file settings.json).
Plotting
Jchemo uses Makie for plotting. Displaying the plots requires to install and load one of the Makie's backends (CairoMakie or GLMakie).
Datasets
The datasets used as examples in the function help pages are stored in package JchemoData.jl, a repository of datasets on chemometrics and other domains. Examples of scripts demonstrating Jchemo are also available in the pedagogical project JchemoDemo.
Two grid-search functions are available to tune the predictors
The syntax is generic for all the functions (see the respective help pages above for sample workflows). These tuning tools have been specifically accelerated for models based on latent variables and ridge regularization.
using Jchemo, BenchmarkTools
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 16 × Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 23 on 16 virtual cores
Environment:
JULIA_EDITOR = code
n = 10^6 # nb. observations (samples)
p = 500 # nb. X-variables (features)
q = 10 # nb. Y-variables to predict
nlv = 25 # nb. PLS latent variables
X = rand(n, p)
Y = rand(n, q)
zX = Float32.(X)
zY = Float32.(Y)
## Float64
model = plskern(; nlv)
@benchmark fit!($model, $X, $Y)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 7.532 s (1.07% GC) to evaluate,
with a memory estimate of 4.09 GiB, over 2677 allocations.
## Float32
@benchmark fit!($model, $zX, $zY)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min … max): 3.956 s … 4.148 s ┊ GC (min … max): 0.82% … 3.95%
Time (median): 4.052 s ┊ GC (median): 2.42%
Time (mean ± σ): 4.052 s ± 135.259 ms ┊ GC (mean ± σ): 2.42% ± 2.21%
█ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
3.96 s Histogram: frequency by time 4.15 s <
Memory estimate: 2.05 GiB, allocs estimate: 2677.
## (NB.: multi-threading is not used in plskern)
To install Jchemo
- From the official Julia repo, run in the Pkg REPL
pkg> add Jchemo
or for a specific version, for instance
pkg> add Jchemo@0.8.5
- For the current developing version (potentially not stable)
pkg> add https://github.com/mlesnoff/Jchemo.jl.git
Warning
Before to update the package, it is recommended to have a look on What changed to avoid eventual problems due to breaking changes.
n = 150 ; p = 200
q = 2 ; m = 50
Xtrain = rand(n, p)
Ytrain = rand(n, q)
Xtest = rand(m, p)
Ytest = rand(m, q)
Consider a signal preprocessing with the Savitsky-Golay filter, using function savgol
## Below, the order of the kwargs is not important but the argument
## names have to be correct.
## Model definition
## (below, the name 'model' can be replaced by any other name)
npoint = 11 ; deriv = 2 ; degree = 3
model = savgol(; npoint, deriv, degree)
## Fitting
fit!(model, Xtrain)
## Transformed (= preprocessed) data
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
Several preprocessing can be applied sequentially to the data by building a pipeline.
Consider a principal component analysis, using SVD and function pcasvd
nlv = 15 # nb. principal components
model = pcasvd(; nlv)
fit!(model, Xtrain, ytrain)
## Score matrices
Ttrain = transf(model, Xtrain) # same as: model.fitm.T
Ttest = transf(model, Xtest)
## Model summary (% of explained variance, etc.)
summary(model, Xtrain)
For a preliminary scaling of the data before the PCA
nlv = 15 ; scal = true
model = pcasvd(; nlv, scal)
fit!(model, Xtrain, ytrain)
Consider a (Gaussian) kernel partial least squares regression (KPLSR), using function kplsr
nlv = 15 # nb. latent variables
kern = :krbf ; gamma = .001
model = kplsr(; nlv, kern, gamma)
fit!(model, Xtrain, ytrain)
## PLS score matrices can be computed by:
Ttrain = transf(model, Xtrain) # = model.fitm.T
Ttest = transf(model, Xtest)
## Model summary
summary(model, Xtrain)
## Y-Predictions
pred = predict(model, Xtest).pred
Consider a data preprocessing by standard-normal-variation transformation (SNV) followed by a Savitsky-Golay filter and a polynomial de-trending transformation
## Model definitions
model1 = snv()
model2 = savgol(npoint = 5, deriv = 1, degree = 2)
model3 = detrend_pol()
## Pipeline building and fitting
model = pip(model1, model2, model3)
fit!(model, Xtrain)
## Transformed data
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
Consider a support vector machine regression model implemented on preliminary computed PCA scores (PCA-SVMR)
nlv = 15
kern = :krbf ; gamma = .001 ; cost = 1000
model1 = pcasvd(; nlv)
model2 = svmr(; kern, gamma, cost)
model = pip(model1, model2)
fit!(model, Xtrain)
## Y-predictions
pred = predict(model, Xtest).pred
Step(s) of data preprocessing can obviously be implemented before the model(s)
nlv = 15
kern = :krbf ; gamma = .001 ; cost = 1000
model1 = detrend_pol(degree = 2) # polynomial de-trending with polynom degree 2
model2 = pcasvd(; nlv)
model3 = svmr(; kern, gamma, cost)
model = pip(model1, model2, model3)
The LWR algorithm of Naes et al (1990) consists in implementing a preliminary global PCA on the data and then a kNN locally weighted multiple linear regression (kNN-LWMLR) on the global PCA scores
nlv = 25
metric = :eucl ; h = 2 ; k = 200
model1 = pcasvd(; nlv)
model2 = lwmlr(; metric, h, k)
model = pip(model1, model2)
Naes et al., 1990. Analytical Chemistry 664–673.
The pipeline of Shen et al. (2019) consists in implementing a preliminary global PLSR on the data and then a kNN-PLSR on the global PLSR scores
nlv = 25
metric = :mah ; h = Inf ; k = 200
model1 = plskern(; nlv)
model2 = lwplsr(; metric, h, k)
model = pip(model1, model2)
Shen et al., 2019. Journal of Chemometrics, 33(5) e3117.
Matthieu Lesnoff
contact: [email protected]
-
Cirad, UMR Selmet, Montpellier, France
-
ChemHouse, Montpellier
Lesnoff, M. 2021. Jchemo: Chemometrics and machine learning on high-dimensional data with Julia. https://github.com/mlesnoff/Jchemo. UMR SELMET, Univ Montpellier, CIRAD, INRA, Institut Agro, Montpellier, France
- G. Cornu (Cirad) https://ur-forets-societes.cirad.fr/en/l-unite/l-equipe
- M. Metz (Pellenc ST, Pertuis, France)
- L. Plagne, F. Févotte (Triscale.innov) https://www.triscale-innov.com
- R. Vezy (Cirad) https://www.youtube.com/channel/UCxArXLI-gxlTmWGGgec5D7w