Explore Numba Function Inlining #910

seanlaw · 2023-09-06T20:47:05Z

In functions like _stump, the sub-functions _compute_diagonal and core._shift_insert_at_index may be called a lot. Therefore, we may gain some (small) performance improvement by adding @njit(inline='always') to those (and other) functions and possibly reduce the overhead cost of calling a function repeatedly. See more here.

The text was updated successfully, but these errors were encountered:

joehiggi1758 · 2024-07-30T02:21:04Z

Hey @seanlaw hope you had a great day!

This one looks like a relatively straightforward next contribution for me - is this still an issue, if so, cool if I take a crack?

seanlaw · 2024-07-30T02:29:28Z

@joehiggi1758 Sounds good. The most important thing here is to perform timing comparisons to ensure that adding this would actually make things faster. I recommend searching around previous issues where @NimaSarajpoor performed timing comparisons. It's possible that it doesn't make things faster

joehiggi1758 · 2024-09-05T21:20:31Z

Hey @seanlaw !

Finally got a chance to look into this - and I've concluded that we would see marginal improvements in the performance of _compute_diagonal and core._shift_insert_at_index if we add @njit(inline='always').

Generally I approached my timing comparisons for both functions the same (and tried to stay in line with the approach @NimaSarajpoor took here)...

Preparing fake data to pass into both versions, holding all else constant, of said function (with numba inline and without numba inline)
Performing compilations up front (with warm ups) so as to not skew results with numba inline
Iterated 100x (with the exception of _compute_diagonal where n=50000 because of the constraints of my local hardware) and took the average completion times
Compared the completion times of an inlined version and a non-inlined version of said function

My results for _compute_diagonal were as follows...

function	n	m	iterations	avg_time_w_inline	avg_time_wo_inline	avg_efficiency_gain
compute_diagonal	10000	50	100	1.009117	1.017874	0.008757
compute_diagonal	20000	50	100	4.60779	4.395936	-0.211854
compute_diagonal	50000	50	10	36.255343	36.938308	0.682965

My results for _shift_insert_at_index were as follows...

function	n	m	iterations	avg_time_w_inline	avg_time_wo_inline	avg_efficiency_gain
shift_insert_at_index	10000	50	100	0.000014	0.000014	0
shift_insert_at_index	20000	50	100	0.000021	0.000024	0.000002
shift_insert_at_index	50000	50	100	0.000057	0.000055	0.000002

Please let me know if any questions, comments or feedback as I'd be more than happy to explain further! This one was a great learning opportunity for me and I had to do a decent amount of research, but am more than open to areas of opportunity as this was my first time doing a true timing comparison!

seanlaw · 2024-09-06T01:35:55Z

@joehiggi1758 This is great. I agree that it doesn't seem to make anything faster. Would you mind trying one more for shift_insert_at_index but with varying values of k (i.e., k in range(5, 50, 5).

Finally, would you mind posting a question to the numba discourse channel to share these results and to maybe get some clarification as to "when" having `inline='always`` would actually make a difference?

joehiggi1758 · 2024-09-06T03:21:40Z

@seanlaw you got it!

I'll post on the numba channel shortly and tag you!

Regarding the varying levels of k for shift_insert_at_index here are the results...

function	n	m	k	iterations	avg_time_with	avg_time_wo_inline	avg_efficiency_gain
shift_insert_at_index	10000	50	5	100	0.000011	0.00001	-0.000002
shift_insert_at_index	10000	50	10	100	0.00002	0.000031	0.000011
shift_insert_at_index	10000	50	15	100	0.000011	0.000011	0
shift_insert_at_index	10000	50	20	100	0.00001	0.000036	0.000025
shift_insert_at_index	10000	50	25	100	0.000012	0.000012	0
shift_insert_at_index	10000	50	30	100	0.000011	0.000013	0.000002
shift_insert_at_index	10000	50	35	100	0.000014	0.000026	0.000012
shift_insert_at_index	10000	50	40	100	0.000023	0.00005	0.000027
shift_insert_at_index	10000	50	45	100	0.000018	0.000023	0.000005
shift_insert_at_index	20000	50	5	100	0.000015	0.000016	0.000001
shift_insert_at_index	20000	50	10	100	0.000025	0.000015	-0.00001
shift_insert_at_index	20000	50	15	100	0.000016	0.000015	0
shift_insert_at_index	20000	50	20	100	0.00003	0.000061	0.00003
shift_insert_at_index	20000	50	25	100	0.000014	0.000014	0
shift_insert_at_index	20000	50	30	100	0.000041	0.000015	-0.000027
shift_insert_at_index	20000	50	35	100	0.000016	0.000021	0.000005
shift_insert_at_index	20000	50	40	100	0.000029	0.000043	0.000014
shift_insert_at_index	20000	50	45	100	0.000019	0.000054	0.000035
shift_insert_at_index	50000	50	5	100	0.000029	0.000034	0.000005
shift_insert_at_index	50000	50	10	100	0.000051	0.000028	-0.000023
shift_insert_at_index	50000	50	15	100	0.000024	0.000024	0
shift_insert_at_index	50000	50	20	100	0.000048	0.000029	-0.000019
shift_insert_at_index	50000	50	25	100	0.000027	0.000047	0.00002
shift_insert_at_index	50000	50	30	100	0.000027	0.000034	0.000007
shift_insert_at_index	50000	50	35	100	0.000062	0.000029	-0.000033
shift_insert_at_index	50000	50	40	100	0.000029	0.000041	0.000012
shift_insert_at_index	50000	50	45	100	0.000057	0.000033	-0.000024

seanlaw · 2024-09-06T10:36:15Z

I'll post on the numba channel shortly and tag you!

Perfect! Thank you

Recently, I tried inlining some other code and also found zero improvement so this bewilders me.

seanlaw · 2024-09-16T13:53:47Z

@joehiggi1758 It also looks like we might add inline=True to GPU functions like core._gpu_searchsorted_left and core._gpu_searchsorted_right.

You should be able to test using Google Colab. @NimaSarajpoor Have you tried this already? It's possible that the CUDA compiler already aggressively inlines device functions automatically so there might not be any gains here.

NimaSarajpoor · 2024-10-01T01:10:25Z

I came here to share a comment noticed my name is mentioned already! Sorry for the late response!

In fact, I did test it to see if it improves the recursive function I used in 6-step FFT algorithm (NOT GPU functions)

Since my function was recursive, had to limit the depth. See the section https://numba.pydata.org/numba-doc/dev/developer/inlining.html in numba inlining. I did not play with it that much though.

seanlaw added the enhancement New feature or request label Sep 6, 2023

seanlaw added the help wanted Extra attention is needed label Sep 18, 2023

seanlaw added this to the Python 1.12.1/1.13.0 Release milestone May 8, 2024

seanlaw modified the milestones: Python 1.13.0 Release, Python 1.14.0 Release Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore Numba Function Inlining #910

Explore Numba Function Inlining #910

seanlaw commented Sep 6, 2023 •

edited

Loading

joehiggi1758 commented Jul 30, 2024

seanlaw commented Jul 30, 2024

joehiggi1758 commented Sep 5, 2024 •

edited

Loading

seanlaw commented Sep 6, 2024 •

edited

Loading

joehiggi1758 commented Sep 6, 2024

seanlaw commented Sep 6, 2024

seanlaw commented Sep 16, 2024 •

edited

Loading

NimaSarajpoor commented Oct 1, 2024

Explore Numba Function Inlining #910

Explore Numba Function Inlining #910

Comments

seanlaw commented Sep 6, 2023 • edited Loading

joehiggi1758 commented Jul 30, 2024

seanlaw commented Jul 30, 2024

joehiggi1758 commented Sep 5, 2024 • edited Loading

seanlaw commented Sep 6, 2024 • edited Loading

joehiggi1758 commented Sep 6, 2024

seanlaw commented Sep 6, 2024

seanlaw commented Sep 16, 2024 • edited Loading

NimaSarajpoor commented Oct 1, 2024

seanlaw commented Sep 6, 2023 •

edited

Loading

joehiggi1758 commented Sep 5, 2024 •

edited

Loading

seanlaw commented Sep 6, 2024 •

edited

Loading

seanlaw commented Sep 16, 2024 •

edited

Loading