Support non-trivial `classes_` in `LogisticRegression` #6346

jcrist · 2025-02-20T19:32:30Z

Scikit-Learn's LogisticRegression is a bit unique in that it natively supports complex labels (e.g. raw strings/categories/non-monotonically increasing ints), rather than requiring that the labels are pre-encoded. It does this by internally using a LabelEncoder during fit, then converting the predicted labels back to the original label dtype in predict.

This PR adds support for this in cuml, improving compatibility with sklearn. This required:

A small improvement to LabelEncoder to support all the input types that cuml natively supports.
Addition of a LabelEncoder in LogisticRegression.fit
Changing how we store classes_. Previously this used the descriptor functionality to support different container types. However, CumlArray/cupy/numba don't support non-numeric types, so we can't use that to store the classes anymore. We now always store classes_ as a numpy array. I think this is fine - the size of classes is small - and also makes us a bit more compatible with sklearn since we can better ensure our dtypes match theirs.
Changing predict to convert the numeric output back into the original classes. This was complicated since CumlArray/cupy/numba don't support non-numeric types, which means that most of our existing output_type machinery fails for these cases. I hacked something in that I think is sufficient, but it's definitely a hack.
Addition of a new test to check things work properly across dtypes and container types.

This is an alternative to #6328. The fix here doesn't add any additional state, and I believe the test cases added here provide better coverage of the behavior we're trying to ensure.

jcrist · 2025-02-20T19:34:03Z

python/cuml/cuml/linear_model/logistic_regression.pyx

+        if is_numeric:
+            if (self.classes_ == np.arange(nclasses)).all():
+                # Fast path for common case of monotonically increasing numeric classes
+                out = indices.to_output("cupy", output_dtype=self.classes_.dtype)


Note that any code working prior to this change will take this path.

csadorf

From my initial review, this looks overall very sound. It's a bit of hack to work around the CumlArray limitation of course, but that was expected and we can of course address that moving forward.

I have not had the chance yet to investigate whether our tests fully cover cases where we call LinearRegression.predict() internally and whether we maintain behavior in that case. Since we are replicating the api decorator in predict() there is a small chance that we are not covering all edge cases. That is not the only reason I'm not approving just yet.

jcrist · 2025-02-20T23:32:01Z

Looks like there's some test failures that I missed fixing locally. Most look pretty straightforward - if we want to still try and get this in pre-patch I can work on resolving these tomorrow.

viclafargue

LGTM, would just add a small test to prevent any regression for issue https://github.com/rapidsai/cuml-accel/issues/94

jcrist · 2025-02-21T18:01:18Z

That case should already sufficiently tested by test_logistic_regression_complex_classes added here (the int32 and float32 cases to be specific)

csadorf · 2025-02-21T18:29:51Z

Looks like there's some test failures that I missed fixing locally. Most look pretty straightforward - if we want to still try and get this in pre-patch I can work on resolving these tomorrow.

@jcrist and I just had a brief offline chat and agreed that it's worth trying to address the issues before code freeze.

Previously assumptions were made that prevented supporting all the possible input types `cuml` normally supports. `LabelEncoder` should probably be fixed to play nicely with cuml's output type handling, but that issue is beyond the requirements of this PR.

Scikit-learn's `LogisticRegression` contains support for non-trivial classes (those that would typically require encoding before processing). This PR adds support for that in both `fit` and `predict`. This is complicated by `CumlArray`/`cupy`/`numba` not supporting non-numeric types, which means we need to special case the output handling in `predict`. It's gross, but functional.

jcrist · 2025-02-21T22:31:21Z

I think I've fixed all the test failures, but who knows. Also added a fix for categorical y support, which was another sklearn compatibility bug we had.

jcrist requested a review from a team as a code owner February 20, 2025 19:32

jcrist requested review from csadorf and vyasr February 20, 2025 19:32

github-actions bot added the Cython / Python Cython or Python issue label Feb 20, 2025

jcrist commented Feb 20, 2025

View reviewed changes

jcrist added improvement Improvement / enhancement to an existing function cuml-cpu non-breaking Non-breaking change labels Feb 20, 2025

jcrist force-pushed the logreg-complex-classes branch from f36aaf7 to 8eb40ba Compare February 20, 2025 19:35

jcrist mentioned this pull request Feb 20, 2025

Fix LogisticRegression use with labels of string dtype #6328

Open

csadorf reviewed Feb 20, 2025

View reviewed changes

dantegd mentioned this pull request Feb 21, 2025

Backport release 25.04 PRs for patch release version 25.02.01 #6329

Draft

20 tasks

viclafargue reviewed Feb 21, 2025

View reviewed changes

jcrist added 5 commits February 21, 2025 22:06

Support categorical inputs for y in LogisticRegression

783a3c3

Support float16 dtypes

032f504

Workaround bug in cudf.pandas handling

17c4d07

jcrist force-pushed the logreg-complex-classes branch from 8eb40ba to 17c4d07 Compare February 21, 2025 22:08

jcrist mentioned this pull request Feb 22, 2025

A few GPU<->CPU interop fixes #6355

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-trivial `classes_` in `LogisticRegression` #6346

Support non-trivial `classes_` in `LogisticRegression` #6346

jcrist commented Feb 20, 2025

jcrist Feb 20, 2025

csadorf left a comment

jcrist commented Feb 20, 2025

viclafargue left a comment

jcrist commented Feb 21, 2025 •

edited

Loading

csadorf commented Feb 21, 2025

jcrist commented Feb 21, 2025

Support non-trivial classes_ in LogisticRegression #6346

Are you sure you want to change the base?

Support non-trivial classes_ in LogisticRegression #6346

Conversation

jcrist commented Feb 20, 2025

jcrist Feb 20, 2025

Choose a reason for hiding this comment

csadorf left a comment

Choose a reason for hiding this comment

jcrist commented Feb 20, 2025

viclafargue left a comment

Choose a reason for hiding this comment

jcrist commented Feb 21, 2025 • edited Loading

csadorf commented Feb 21, 2025

jcrist commented Feb 21, 2025

Support non-trivial `classes_` in `LogisticRegression` #6346

Support non-trivial `classes_` in `LogisticRegression` #6346

jcrist commented Feb 21, 2025 •

edited

Loading