PERF: Improve merge performance #57559

phofl · 2024-02-21T23:22:01Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

this avoids a bunch of unnecessary checks, might be worth it

| Change   | Before [b2b1aae3] <merge~1>   | After [aebecfe9] <merge>   |   Ratio | Benchmark (Parameter)                                                                       |
|----------|-------------------------------|----------------------------|---------|---------------------------------------------------------------------------------------------|
| -        | 46.9±3μs                      | 41.9±0.5μs                 |    0.89 | join_merge.ConcatIndexDtype.time_concat_series('string[pyarrow]', 'monotonic', 0, True)     |
| -        | 210±7ms                       | 187±5ms                    |    0.89 | join_merge.I8Merge.time_i8merge('left')                                                     |
| -        | 2.11±0.2ms                    | 1.88±0.01ms                |    0.89 | join_merge.MergeEA.time_merge('Float32', False)                                             |
| -        | 17.1±3ms                      | 15.2±0.2ms                 |    0.89 | join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'inner')        |
| -        | 1.06±0.04ms                   | 907±20μs                   |    0.86 | join_merge.Merge.time_merge_dataframe_integer_2key(False)                                   |
| -        | 7.99±0.9ms                    | 6.82±0.06ms                |    0.85 | join_merge.ConcatIndexDtype.time_concat_series('string[pyarrow]', 'non_monotonic', 1, True) |
| -        | 224±10ms                      | 191±6ms                    |    0.85 | join_merge.I8Merge.time_i8merge('inner')                                                    |
| -        | 513±90μs                      | 425±5μs                    |    0.83 | join_merge.ConcatIndexDtype.time_concat_series('string[python]', 'monotonic', 0, False)     |
| -        | 519±100μs                     | 421±2μs                    |    0.81 | join_merge.ConcatIndexDtype.time_concat_series('string[python]', 'non_monotonic', 0, False) |
| -        | 1.18±0.6ms                    | 751±7μs                    |    0.63 | join_merge.ConcatIndexDtype.time_concat_series('int64[pyarrow]', 'has_na', 1, False)        |

WillAyd · 2024-02-21T23:41:18Z

pandas/_libs/hashtable_class_helper.pxi.in

@@ -1142,6 +1149,7 @@ cdef class StringHashTable(HashTable):
                # ignore_na is True), we can skip the actual value, and
                # replace the label with na_sentinel directly
                labels[i] = na_sentinel
+                seen_na = True


Instead of trying to find this iterating by element can we get the same performance by querying up front for missing values? That approach would work better in the future where we are more arrow based

wouldn't we expect a two-pass implementation to be slower? i dont think we should be making decisions based on a potential arrow-based future since that would likely need a completely separate implementation anyway

Arrow strings are using a completely different code path, this is just for our own strings

wouldn't we expect a two-pass implementation to be slower? i

Generally I would expect a two pass solution to be faster. Our nulll-checking implementations are pretty naive and always go elementwise with a boolean value. The Arrow implementation iterates 64 bits at a time, so checking for NA can be up to 64x as fast as this kind of loop. I'm not sure what NumPy does internally but with a byte mask that same type of logic would be up to 8x as fast.

By separating that out to a different has_na().any() type of check you let other libraries determine that much faster than we can

To be clear not a blocker for now; just something to think about as we make our code base more Arrow friendly

Arrow strings are using a completely different code path, this is just for our own strings

As I said, arrow dispatches to pyarrow compute functions for those things, so it won't impact this part for now anyway

WillAyd · 2024-02-22T19:43:28Z

pandas/_libs/hashtable_class_helper.pxi.in

@@ -1437,6 +1449,8 @@ cdef class PyObjectHashTable(HashTable):
                idx = self.table.vals[k]
                labels[i] = idx

+        if return_inverse and return_labels_only:
+            return labels.base, seen_na  # .base -> underlying ndarray


So the second argument returned by this is either a bool or an ndarray right? Instead of having dynamic return types like that I think it would be better to just pass in a reference to seen_na - the caller can choose to ignore the result altogether if they'd like. But that way you don't need a return_labels_only argument and can be consistent in what is returned

Yeah this was my first idea as well, but it isn't a good idea. We drop the result but we still set an internal variable that signals that there is an external view created, which triggers a copy later in the process because we call into the same hashtable again. Avoiding this copy is part of the performance improvement, so we can't use the same signature.

I don't understand how passing the seen_na variable by reference as opposed to having a dynamic return type affects that. That seems like a code organization issue outside of this?

That doesn't matter, but returning the same result as when return_labels_only=False matters, so we need another if condition anyway.

phofl · 2024-03-16T17:31:13Z

@WillAyd good to merge?

github-actions · 2024-04-18T00:05:48Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

PERF: Improve merge performance

aebecfe

phofl added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 21, 2024

phofl requested a review from WillAyd as a code owner February 21, 2024 23:22

WillAyd reviewed Feb 21, 2024

View reviewed changes

Fix tests

0349979

WillAyd reviewed Feb 22, 2024

View reviewed changes

github-actions bot added the Stale label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Improve merge performance #57559

PERF: Improve merge performance #57559

phofl commented Feb 21, 2024

WillAyd Feb 21, 2024

jbrockmendel Feb 22, 2024

phofl Feb 22, 2024

WillAyd Feb 22, 2024

WillAyd Feb 22, 2024

phofl Feb 22, 2024

WillAyd Feb 22, 2024

phofl Mar 16, 2024

WillAyd Mar 18, 2024

phofl Mar 18, 2024 •

edited

Loading

phofl commented Mar 16, 2024

github-actions bot commented Apr 18, 2024

PERF: Improve merge performance #57559

Are you sure you want to change the base?

PERF: Improve merge performance #57559

Conversation

phofl commented Feb 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

phofl commented Mar 16, 2024

github-actions bot commented Apr 18, 2024

phofl Mar 18, 2024 •

edited

Loading