-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve WhittakerSmooth
, AirPLS
, and ArPLS
performance - sparse matrix operations
#44
Comments
Hello, Suspected Main Problem As far as I can judge, the problem also affects the A = [[ a 0 0 0 b]
[ 0 c 0 d 0]
[ 0 e f 0 g]
[ 0 0 0 h 0]
[ 0 0 0 i j]] and the highly structured sparse format encountered in all the described functionalities (I'll call them Whittaker-Like) which is banded (only a few central diagonals are non-zero) and symmetric, e.g., B = [[ a b 0 0 0]
[ b c d 0 0]
[ 0 d e f 0]
[ 0 0 f g h]
[ 0 0 0 h i]] Usually sparse solvers rely on LU-decomposition, but as far as I'm aware, the LU-decomposition of a sparse matrix like Proposed solution In contrast to this, a_banded = ... # matrix to invert in banded storage with `n_upp` super- and `n_upp` sub-diagonals above or below the main diagonal
b = ... # right hand side vector
try:
# Attempt banded Cholesky decomposition
x = solveh_banded(
a_banded[0 : n_upp + 1, ::], # taking only the superdiagonals and the main diagonal since symmetric
b,
lower = False, # only the superdiagonals and main diagonal were taken
)
except np.linalg.LinalgError:
# Fall back to banded LU decomposition
x = solve_banded(
l_and_u = (n_upp, n_upp), # number of super- and subdiagonals
a_banded,
b,
) Further tweaks What I just described is as far as one can get for the general case of arbitrary
Besides, there is another added bonus. All these functions can solve |
Hi @IruNikZe thanks a lot for the fantastic explanation and it looks super promising! I will turn this issue into a branch and you can add your changes there!! |
@IruNikZe I have assigned you to that issue, and this is the branch you can add your changes too 😄 😄 44-improve-airpls-and-arpls-performance-sparse-matrix-operations |
Hi-hi, 1) Numerical instability of Whittaker smootherProblem Cause
All in all, this blows up the condition number of Partial solution (1) and also a Rust implementation which would probably make the package more scalable as functionality grows and new external functionalities are required which can be kind of a nightmare in Cython. Added bonus 2) Performance of baseline algorithmsProblem Cause Solution
Implications Any feedback is highly welcome, but I guess when the draft pull request is submitted, things will get clearer 😁 |
Hello,
None of the changes broke the API. However, it is extended quite a bit for the Besides that there are some Now, I still need to
Since the Pull Request will be quite massive already, I would for now not touch the baseline initialisation (which will require new classes) and open a Pull Request with what is there this week (as a draft due to the package docs). @paucablop Should we leave the branch open and then do another Pull Request when the new baseline initialisation is finished to make sure everything is updated by the branch linked to this issue? I wish a great start into the week 👋 😸 |
I'd already like to add this tiny teaser for the automated smoothing as a result of the tests. Even though the smoothing itself is repeated quite often to find the optimum lambda (like 40 times or so), this still runs in less than 5 milliseconds on my relatively old machine 🏃 |
@paucablop In case you already want to see the new functionalities of After you pulled the current version of the issue/feature branch, all you need to do is:
Then, you can go through the notebook. It has short documentations as Markdown cells whenever required. Note: the docstring of |
I timed the implementation against this Rust-based package.
I was not really able to reproduce our results though because the weights seem to be defined in a different manner (from 0 to 1 rather than variance-based). So, I guess we are on a good way here. In the future, this could be further improved by moving the main logic to Cython or also Rust. |
Hi @MothNik I went through the branch, and it looks really good, very nice job, and also very well documented 😄 Super exciting to start testing it out. Also thanks a lot for enabling a nice testing environment. I really like the auto lambda functionality 🤩! I have started testing it out on my end (I could not wait 😝 ) and will keep doing so during this week (I have so many use cases I want to test the functions 😋). I think that it would be nice to start the PR process when you are ready. I could give you a hand with the documentation (eventually, I would like to change the documentation site, but I think for now we keep it as it is). I usually like to keep the branches short, but since this would be a very related development, I think it is a good idea to leave the branch open and do another PR later on. I am very much looking forward to the PR, and once again, thank you very much for your brilliant contribution! |
Hi @paucablop 👋 You said you had many use cases, but the automatic smoothing requries weights and I was not sure whether you can get weights easily. Usually one uses the reciprocal of the standard deviations from replicate measurements, but I don't know if you have them at hand. This made me think 🧠 🏋️ that more generally, probably not all users can provide them and limiting the automated smoothing to this scenario seems a bit "discriminating" to me 👀 Therefore, I added a utility function to estimate the standard deviation in a data-driven fashion from the signal alone. I adapted the so-called DER SNR-metric, but spiced it up a little 🌶️ to not only enable global but also local noise estimation for signals where the noise level is not constant 🔉🔊🔉. It's called The results for the test signal look promising: The function has a docstring that reminds more of a SciPy-function, but I tried to keep the API as open as possible because it's the first implementation of such a feature. However, you can get really far with the defaults already. I demonstrated it in Section 7 of the new, improved playground notebook whittaker_show_v2_noise_estim.txt The setup is the same as before
I will finish the final timings of the improvements end of this week, but otherwise I'm ready for the Pull Request. Best regards |
WhittakerSmooth
, AirPLS
, and ArPLS performance
- sparse matrix operations
WhittakerSmooth
, AirPLS
, and ArPLS performance
- sparse matrix operationsWhittakerSmooth
, AirPLS
, and ArPLS
performance - sparse matrix operations
@paucablop whittaker_show_v3_noise_estim.txt Edit @paucablop Do you have any strong opinion on this? I can easily revert the last commit I did. |
@MothNik Just saw the PR, very exciting!! I have been checking the three functions with some of use-cases, and it is working really good, it is significantly faster, on the data I used to benchmark these two versions, I could see around an order of magnitude improvement 🚀🚀, and numerical stability (spectra of around 3500 datapoints) 🦾🦾. Very good job!!! I indeed have the opportunity to estimate the weights from the inversed standard deviation of several measurements, but I know it is not always the case, so I am supper happy you suggested a method to do so, I did not know this method before, but after reading the attached paper you included, it seems like a very smart idea, and brilliant addition to the methods 💡💡Also, having the function Regarding the I will start jump to the PR right away!!! 😄😄 Thanks again for the great job @MothNik !! |
@paucablop You're welcome! 😸 |
@paucablop Renamed and tested 👍 |
Description
AirPLS (Adaptive Iteratively Reweighted Penalized Least Squares) and ArPLS (Asymmetrically Reweighted Penalized Least Squares) are powerful algorithms for removing complex non-linear baselines from spectral signals. However, their computational cost can be significant, especially when processing large numbers of spectra. Currently, we use the
csc_matrix
representation fromscipy.sparse
to optimize performance, but further improvements are needed.Improving Attempts
To improve the performance, I have tried just-in-time compilation of some key functions using
numba
. However,numba
does not support thecsc_matrix
type, and I cannot JIT compile the code. To overcome this issue, I thought of looking for anumba
compatible representation of sparse matrices, but could not find one. Therefore, I have created my own, together with some functions to make basic algebra operations with them code to Gist. Unfortunately, this did not improve the performance over the current implementation.Hacktoberfest Challenge
We invite open source developers to contribute to our project during Hacktoberfest. The goal is to improve the performance of both algorithms
Here are some ideas to work on:
numba
.How to Contribute
Here is the contributing guidelines
Contact
We can have the the conversation in the Issue or the Discussion
Resources
Here are some relevant resources and references for understanding the theory and implementation of the AirPLS and ArPLS algorithms:
The text was updated successfully, but these errors were encountered: