Skip to content

Tools for chemometrics and machine learning on high-dimensional data (e.g. Partial least squares regression/discrimination)

License

Notifications You must be signed in to change notification settings

mlesnoff/Jchemo.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jchemo.jl

Chemometrics and machine learning on high-dimensional data with Julia

Stable Dev Build Status Project Status: Active - The project has reached a stable, usable state and is being actively developed.

About

Jchemo was initially dedicated to partial least squares regression (PLSR) and discrimination (PLSDA) methods and their extensions, in particular locally weighted PLS models (KNN-LWPLS-R & -DA; e.g. https://doi.org/10.1002/cem.3209). Then, the package has been expanded with many other dimension-reduction/regression/discrimination methods.

Why the name Jchemo? Since it is oriented towards chemometrics , in brief the use of biometrics for chemistry. But most of the provided methods are generic to other types of data than chemistry.

Sample workflow

Suppose training data (X, Y) and predictions expected from new data Xnew using a PLSR model with 15 latent variables (LVs). The workflow is has follows

  1. An object, e.g. model (or any other name), is built from the given learning model and its eventual parameters. This object contains three sub-objects
    • algo (the learning algorithm)
    • fitm (the fitted model, empty at this stage)
    • and kwargs (the specified keyword arguments)
  2. Function fit! fits the model to the data, which fills sub-object fitm above.
  3. Function predict runs the predictions.
model = plskern(nlv = 15, scal = true)
fit!(model, X, Y)
pred = predict(model Xnew).pred

We can check the contents of object model

@names model

(:algo, :fitm, :kwargs)

An alternative syntax for the keyword arguments is

nlv = 15 ; scal = true
model = plskern(; nlv, scal)

After model fitting, the matrices of the PLS scores can be obtained from function transf

T = transf(model, X)   # can also be obtained directly by: model.fitm.T
Tnew = transf(model, Xnew)

Other sample workflows are given at the end of this README.

Package structure

Jchemo is organized between

  • transform operators (that have a function transf),
  • predictors (that have a function predict),
  • utility functions.

Some models, such as PLSR models, are both a transform operator and a predictor.

Ad'hoc pipelines of operations can also be built. In Jchemo, a pipeline is a chain of K modeling steps containing

  • either K transform steps,
  • or K - 1 transform steps and a final prediction step.

The pipelines are built with function pip.

Keyword arguments

The keyword arguments required/allowed in a function can be found in the Index of function section of the documentation, or in the REPL by displaying the function's help page, for instance for function plskern

julia> ?plskern

Default values can be displayed in the REPL with macro @pars

julia> @pars plskern

Jchemo.ParPlsr
  nlv: Int64 1
  scal: Bool false

Multi-threading

Some functions (e.g. those using kNN selections) use multi-threading to speed the computations. Taking advantage of this requires to specify a relevant number of threads (for instance from the Settings menu of the VsCode Julia extension and the file settings.json).

Plotting

Jchemo uses Makie for plotting. Displaying the plots requires to install and load one of the Makie's backends (CairoMakie or GLMakie).

Datasets

The datasets used as examples in the function help pages are stored in package JchemoData.jl, a repository of datasets on chemometrics and other domains. Examples of scripts demonstrating Jchemo are also available in the pedagogical project JchemoDemo.

Tuning predictive models

Two grid-search functions are available to tune the predictors

The syntax is generic for all the functions (see the respective help pages above for sample workflows). These tuning tools have been specifically accelerated for models based on latent variables and ridge regularization.

Benchmark

using Jchemo, BenchmarkTools
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 23 on 16 virtual cores
Environment:
  JULIA_EDITOR = code

Multi-variate PLSR with n = 1e6 observations

n = 10^6  # nb. observations (samples)
p = 500   # nb. X-variables (features)
q = 10    # nb. Y-variables to predict
nlv = 25  # nb. PLS latent variables
X = rand(n, p)
Y = rand(n, q)
zX = Float32.(X)
zY = Float32.(Y)
## Float64
model = plskern(; nlv)
@benchmark fit!($model, $X, $Y)

BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 7.532 s (1.07% GC) to evaluate,
 with a memory estimate of 4.09 GiB, over 2677 allocations.
## Float32 
@benchmark fit!($model, $zX, $zY) 

BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min  max):  3.956 s     4.148 s  ┊ GC (min  max): 0.82%  3.95%
 Time  (median):     4.052 s               ┊ GC (median):    2.42%
 Time  (mean ± σ):   4.052 s ± 135.259 ms  ┊ GC (mean ± σ):  2.42% ± 2.21%

  █                                                        █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  3.96 s         Histogram: frequency by time         4.15 s <

 Memory estimate: 2.05 GiB, allocs estimate: 2677.

## (NB.: multi-threading is not used in plskern) 

Installation

To install Jchemo

  • From the official Julia repo, run in the Pkg REPL
pkg> add Jchemo

or for a specific version, for instance

pkg> add Jchemo@0.8.5
  • For the current developing version (potentially not stable)
pkg> add https://github.com/mlesnoff/Jchemo.jl.git

Warning

Before to update the package, it is recommended to have a look on What changed to avoid eventual problems due to breaking changes.

Examples of syntax

Some fictive data

n = 150 ; p = 200 
q = 2 ; m = 50 
Xtrain = rand(n, p)
Ytrain = rand(n, q) 
Xtest = rand(m, p)
Ytest = rand(m, q) 

Transform operations

a) Example of signal preprocessing

Consider a signal preprocessing with the Savitsky-Golay filter, using function savgol

## Below, the order of the kwargs is not important but the argument 
## names have to be correct.

## Model definition
## (below, the name 'model' can be replaced by any other name)
npoint = 11 ; deriv = 2 ; degree = 3
model = savgol(; npoint, deriv, degree)

## Fitting
fit!(model, Xtrain)

## Transformed (= preprocessed) data
Xptrain = transf(model, Xtrain)  
Xptest = transf(model, Xtest)

Several preprocessing can be applied sequentially to the data by building a pipeline.

b) Example of PCA

Consider a principal component analysis, using SVD and function pcasvd

nlv = 15  # nb. principal components
model = pcasvd(; nlv)
fit!(model, Xtrain, ytrain)

## Score matrices
Ttrain = transf(model, Xtrain) # same as:  model.fitm.T
Ttest = transf(model, Xtest)

## Model summary (% of explained variance, etc.)
summary(model, Xtrain)

For a preliminary scaling of the data before the PCA

nlv = 15 ; scal = true
model = pcasvd(; nlv, scal)
fit!(model, Xtrain, ytrain)

Prediction models

a) Example of KPLSR

Consider a (Gaussian) kernel partial least squares regression (KPLSR), using function kplsr

nlv = 15  # nb. latent variables
kern = :krbf ; gamma = .001 
model = kplsr(; nlv, kern, gamma)
fit!(model, Xtrain, ytrain)

## PLS score matrices can be computed by:
Ttrain = transf(model, Xtrain)   # = model.fitm.T
Ttest = transf(model, Xtest)

## Model summary
summary(model, Xtrain)

## Y-Predictions
pred = predict(model, Xtest).pred

Pipelines

a) Example of chained preprocessing

Consider a data preprocessing by standard-normal-variation transformation (SNV) followed by a Savitsky-Golay filter and a polynomial de-trending transformation

## Model definitions
model1 = snv()
model2 = savgol(npoint = 5, deriv = 1, degree = 2)
model3 = detrend_pol()  

## Pipeline building and fitting
model = pip(model1, model2, model3)
fit!(model, Xtrain)

## Transformed data
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)

b) Example of PCA-SVMR

Consider a support vector machine regression model implemented on preliminary computed PCA scores (PCA-SVMR)

nlv = 15
kern = :krbf ; gamma = .001 ; cost = 1000
model1 = pcasvd(; nlv)
model2 = svmr(; kern, gamma, cost)
model = pip(model1, model2)
fit!(model, Xtrain)

## Y-predictions
pred = predict(model, Xtest).pred

Step(s) of data preprocessing can obviously be implemented before the model(s)

nlv = 15
kern = :krbf ; gamma = .001 ; cost = 1000
model1 = detrend_pol(degree = 2)   # polynomial de-trending with polynom degree 2
model2 = pcasvd(; nlv)
model3 = svmr(; kern, gamma, cost)
model = pip(model1, model2, model3)

c) Example of LWR (Naes et al. 1990)

The LWR algorithm of Naes et al (1990) consists in implementing a preliminary global PCA on the data and then a kNN locally weighted multiple linear regression (kNN-LWMLR) on the global PCA scores

nlv = 25
metric = :eucl ; h = 2 ; k = 200
model1 = pcasvd(; nlv)
model2 = lwmlr(; metric, h, k)
model = pip(model1, model2)

Naes et al., 1990. Analytical Chemistry 664–673.

d) Example of Shen et al. 2019

The pipeline of Shen et al. (2019) consists in implementing a preliminary global PLSR on the data and then a kNN-PLSR on the global PLSR scores

nlv = 25
metric = :mah ; h = Inf ; k = 200
model1 = plskern(; nlv)
model2 = lwplsr(; metric, h, k)
model = pip(model1, model2)

Shen et al., 2019. Journal of Chemometrics, 33(5) e3117.

Credit

Author

Matthieu Lesnoff
contact: [email protected]

How to cite

Lesnoff, M. 2021. Jchemo: Chemometrics and machine learning on high-dimensional data with Julia. https://github.com/mlesnoff/Jchemo. UMR SELMET, Univ Montpellier, CIRAD, INRA, Institut Agro, Montpellier, France

Acknowledgments

About

Tools for chemometrics and machine learning on high-dimensional data (e.g. Partial least squares regression/discrimination)

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages