Fix `LogisticRegression` use with labels of string dtype #6328

viclafargue · 2025-02-17T14:22:33Z

No description provided.

csadorf

Thank you very much! I have one question before moving forward.

csadorf · 2025-02-18T15:33:20Z

python/cuml/cuml/linear_model/logistic_regression.pyx

+            self.classes__, y = np.unique(y, return_inverse=True)
+            self.numeric_classes_ = np.arange(len(self.classes__))


What is the motivation for introducing new estimator attributes? I'd suggest that we just convert the provided labels into numeric classes and drop them otherwise.

The motivation was to reproduce the behavior of a Scikit-Learn estimator :

>>> print(cuLogRegModel.classes_) ['setosa' 'versicolor' 'virginica']

But, to be frank this is a small detail and making it work involves solving a lot of edge cases in CI. Since we need this to be merged very soon, it is safer to go with a simpler version that just converts the inputs to numeric classes.

I'm not sure that it is such a small detail, but I'd support merging the simpler version and leaving this as a follow-up.

Why not use the LabelEncoder to handle all this complexity? It comes with the right value in its classes_ attribute "for free".

If we merge something that is different from what scikit-learn does we just create more problems for ourselves IMHO. We put in effort to a half baked fix, we have to deal with users who report breakage and then we need to do the "right thing" anyway later.

@betatim I'm in favor of using the LabelEncoder approach which was also previously suggested by @jcrist . @jcrist Do you already have an implementation that we can reference?

The difficulty isn't in producing the unique classes, but to make sure that the numeric version of classes can be used internally by cuML while the text version can be displayed through the classes_ attribute. Just pushed a solution that might pass CI. Then, sure we can use the LabelEncoder estimator instead of the unique function if that helps, it could especially handle https://github.com/rapidsai/cuml-accel/issues/94.

I don't have an implementation yet, was going to finish up my other issue first. Haven't done more than look at how scikit-learn does it (with a LabelEncoder). I am wary of the additional fitted attribute here (numeric_classes_ and classes__). The scikit-learn implementation doesn't need this extra state and manages to handle things just fine - it would be nice to avoid increasing the amount of state stored on the class if possible (but again, apologies, I haven't looked more into how we might achieve that).

jcrist · 2025-02-20T19:40:09Z

I've opened #6346 as an alternative fix for this issue. That patch doesn't add any additional state, and I believe better handles matching the output type and dtype for predict. The test case there at least provides more coverage of the behavior we're targeting, so whichever fix we apply I think we should add something like that test.

Fix LogisticRegression use with labels of string dtype

5add4b9

viclafargue requested a review from a team as a code owner February 17, 2025 14:22

viclafargue requested review from teju85 and vyasr February 17, 2025 14:22

github-actions bot added the Cython / Python Cython or Python issue label Feb 17, 2025

Adding pytest

14a1717

dantegd added the cuml-cpu label Feb 18, 2025

viclafargue added bug Something isn't working non-breaking Non-breaking change labels Feb 18, 2025

csadorf requested changes Feb 18, 2025

View reviewed changes

viclafargue added 4 commits February 18, 2025 18:15

fix issues

43e09d7

Reversing last commit + fix to pass CI

f991f6c

Use LabelEncoder and add test for issue rapidsai#94

2d07d85

Adding prediction

f6990dc

dantegd mentioned this pull request Feb 20, 2025

Backport release 25.04 PRs for patch release version 25.02.01 #6329

Draft

20 tasks

jcrist mentioned this pull request Feb 20, 2025

Support non-trivial classes_ in LogisticRegression #6346

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `LogisticRegression` use with labels of string dtype #6328

Fix `LogisticRegression` use with labels of string dtype #6328

viclafargue commented Feb 17, 2025

csadorf left a comment

csadorf Feb 18, 2025

viclafargue Feb 18, 2025

csadorf Feb 18, 2025

betatim Feb 19, 2025

csadorf Feb 19, 2025

viclafargue Feb 19, 2025

jcrist Feb 19, 2025

jcrist commented Feb 20, 2025

		self.classes__, y = np.unique(y, return_inverse=True)
		self.numeric_classes_ = np.arange(len(self.classes__))

Fix LogisticRegression use with labels of string dtype #6328

Are you sure you want to change the base?

Fix LogisticRegression use with labels of string dtype #6328

Conversation

viclafargue commented Feb 17, 2025

csadorf left a comment

Choose a reason for hiding this comment

csadorf Feb 18, 2025

Choose a reason for hiding this comment

viclafargue Feb 18, 2025

Choose a reason for hiding this comment

csadorf Feb 18, 2025

Choose a reason for hiding this comment

betatim Feb 19, 2025

Choose a reason for hiding this comment

csadorf Feb 19, 2025

Choose a reason for hiding this comment

viclafargue Feb 19, 2025

Choose a reason for hiding this comment

jcrist Feb 19, 2025

Choose a reason for hiding this comment

jcrist commented Feb 20, 2025

Fix `LogisticRegression` use with labels of string dtype #6328

Fix `LogisticRegression` use with labels of string dtype #6328