Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore Numba Function Inlining #910

Open
seanlaw opened this issue Sep 6, 2023 · 8 comments
Open

Explore Numba Function Inlining #910

seanlaw opened this issue Sep 6, 2023 · 8 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@seanlaw
Copy link
Contributor

seanlaw commented Sep 6, 2023

In functions like _stump, the sub-functions _compute_diagonal and core._shift_insert_at_index may be called a lot. Therefore, we may gain some (small) performance improvement by adding @njit(inline='always') to those (and other) functions and possibly reduce the overhead cost of calling a function repeatedly. See more here.

@seanlaw seanlaw added the enhancement New feature or request label Sep 6, 2023
@seanlaw seanlaw added the help wanted Extra attention is needed label Sep 18, 2023
@seanlaw seanlaw added this to the Python 1.12.1/1.13.0 Release milestone May 8, 2024
@joehiggi1758
Copy link
Contributor

Hey @seanlaw hope you had a great day!

This one looks like a relatively straightforward next contribution for me - is this still an issue, if so, cool if I take a crack?

@seanlaw
Copy link
Contributor Author

seanlaw commented Jul 30, 2024

@joehiggi1758 Sounds good. The most important thing here is to perform timing comparisons to ensure that adding this would actually make things faster. I recommend searching around previous issues where @NimaSarajpoor performed timing comparisons. It's possible that it doesn't make things faster

@joehiggi1758
Copy link
Contributor

joehiggi1758 commented Sep 5, 2024

Hey @seanlaw !

Finally got a chance to look into this - and I've concluded that we would see marginal improvements in the performance of _compute_diagonal and core._shift_insert_at_index if we add @njit(inline='always').

Generally I approached my timing comparisons for both functions the same (and tried to stay in line with the approach @NimaSarajpoor took here)...

  1. Preparing fake data to pass into both versions, holding all else constant, of said function (with numba inline and without numba inline)
  2. Performing compilations up front (with warm ups) so as to not skew results with numba inline
  3. Iterated 100x (with the exception of _compute_diagonal where n=50000 because of the constraints of my local hardware) and took the average completion times
  4. Compared the completion times of an inlined version and a non-inlined version of said function

My results for _compute_diagonal were as follows...

function n m iterations avg_time_w_inline avg_time_wo_inline avg_efficiency_gain
compute_diagonal 10000 50 100 1.009117 1.017874 0.008757
compute_diagonal 20000 50 100 4.60779 4.395936 -0.211854
compute_diagonal 50000 50 10 36.255343 36.938308 0.682965

My results for _shift_insert_at_index were as follows...

function n m iterations avg_time_w_inline avg_time_wo_inline avg_efficiency_gain
shift_insert_at_index 10000 50 100 0.000014 0.000014 0
shift_insert_at_index 20000 50 100 0.000021 0.000024 0.000002
shift_insert_at_index 50000 50 100 0.000057 0.000055 0.000002

Please let me know if any questions, comments or feedback as I'd be more than happy to explain further! This one was a great learning opportunity for me and I had to do a decent amount of research, but am more than open to areas of opportunity as this was my first time doing a true timing comparison!

@seanlaw
Copy link
Contributor Author

seanlaw commented Sep 6, 2024

@joehiggi1758 This is great. I agree that it doesn't seem to make anything faster. Would you mind trying one more for shift_insert_at_index but with varying values of k (i.e., k in range(5, 50, 5).

Finally, would you mind posting a question to the numba discourse channel to share these results and to maybe get some clarification as to "when" having `inline='always`` would actually make a difference?

@joehiggi1758
Copy link
Contributor

@seanlaw you got it!

I'll post on the numba channel shortly and tag you!

Regarding the varying levels of k for shift_insert_at_index here are the results...

function n m k iterations avg_time_with avg_time_wo_inline avg_efficiency_gain
shift_insert_at_index 10000 50 5 100 0.000011 0.00001 -0.000002
shift_insert_at_index 10000 50 10 100 0.00002 0.000031 0.000011
shift_insert_at_index 10000 50 15 100 0.000011 0.000011 0
shift_insert_at_index 10000 50 20 100 0.00001 0.000036 0.000025
shift_insert_at_index 10000 50 25 100 0.000012 0.000012 0
shift_insert_at_index 10000 50 30 100 0.000011 0.000013 0.000002
shift_insert_at_index 10000 50 35 100 0.000014 0.000026 0.000012
shift_insert_at_index 10000 50 40 100 0.000023 0.00005 0.000027
shift_insert_at_index 10000 50 45 100 0.000018 0.000023 0.000005
shift_insert_at_index 20000 50 5 100 0.000015 0.000016 0.000001
shift_insert_at_index 20000 50 10 100 0.000025 0.000015 -0.00001
shift_insert_at_index 20000 50 15 100 0.000016 0.000015 0
shift_insert_at_index 20000 50 20 100 0.00003 0.000061 0.00003
shift_insert_at_index 20000 50 25 100 0.000014 0.000014 0
shift_insert_at_index 20000 50 30 100 0.000041 0.000015 -0.000027
shift_insert_at_index 20000 50 35 100 0.000016 0.000021 0.000005
shift_insert_at_index 20000 50 40 100 0.000029 0.000043 0.000014
shift_insert_at_index 20000 50 45 100 0.000019 0.000054 0.000035
shift_insert_at_index 50000 50 5 100 0.000029 0.000034 0.000005
shift_insert_at_index 50000 50 10 100 0.000051 0.000028 -0.000023
shift_insert_at_index 50000 50 15 100 0.000024 0.000024 0
shift_insert_at_index 50000 50 20 100 0.000048 0.000029 -0.000019
shift_insert_at_index 50000 50 25 100 0.000027 0.000047 0.00002
shift_insert_at_index 50000 50 30 100 0.000027 0.000034 0.000007
shift_insert_at_index 50000 50 35 100 0.000062 0.000029 -0.000033
shift_insert_at_index 50000 50 40 100 0.000029 0.000041 0.000012
shift_insert_at_index 50000 50 45 100 0.000057 0.000033 -0.000024

@seanlaw
Copy link
Contributor Author

seanlaw commented Sep 6, 2024

I'll post on the numba channel shortly and tag you!

Perfect! Thank you

Recently, I tried inlining some other code and also found zero improvement so this bewilders me.

@seanlaw
Copy link
Contributor Author

seanlaw commented Sep 16, 2024

@joehiggi1758 It also looks like we might add inline=True to GPU functions like core._gpu_searchsorted_left and core._gpu_searchsorted_right.

You should be able to test using Google Colab. @NimaSarajpoor Have you tried this already? It's possible that the CUDA compiler already aggressively inlines device functions automatically so there might not be any gains here.

@NimaSarajpoor
Copy link
Collaborator

I came here to share a comment noticed my name is mentioned already! Sorry for the late response!

In fact, I did test it to see if it improves the recursive function I used in 6-step FFT algorithm (NOT GPU functions)

Since my function was recursive, had to limit the depth. See the section https://numba.pydata.org/numba-doc/dev/developer/inlining.html in numba inlining. I did not play with it that much though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants