Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LogisticRegression use with labels of string dtype #6328

Open
wants to merge 6 commits into
base: branch-25.04
Choose a base branch
from

Conversation

viclafargue
Copy link
Contributor

No description provided.

@viclafargue viclafargue requested a review from a team as a code owner February 17, 2025 14:22
@viclafargue viclafargue requested review from teju85 and vyasr February 17, 2025 14:22
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Feb 17, 2025
@viclafargue viclafargue added bug Something isn't working non-breaking Non-breaking change labels Feb 18, 2025
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! I have one question before moving forward.

Comment on lines 305 to 306
self.classes__, y = np.unique(y, return_inverse=True)
self.numeric_classes_ = np.arange(len(self.classes__))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for introducing new estimator attributes? I'd suggest that we just convert the provided labels into numeric classes and drop them otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation was to reproduce the behavior of a Scikit-Learn estimator :

>>> print(cuLogRegModel.classes_)
['setosa' 'versicolor' 'virginica']

But, to be frank this is a small detail and making it work involves solving a lot of edge cases in CI. Since we need this to be merged very soon, it is safer to go with a simpler version that just converts the inputs to numeric classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it is such a small detail, but I'd support merging the simpler version and leaving this as a follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the LabelEncoder to handle all this complexity? It comes with the right value in its classes_ attribute "for free".

If we merge something that is different from what scikit-learn does we just create more problems for ourselves IMHO. We put in effort to a half baked fix, we have to deal with users who report breakage and then we need to do the "right thing" anyway later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betatim I'm in favor of using the LabelEncoder approach which was also previously suggested by @jcrist . @jcrist Do you already have an implementation that we can reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difficulty isn't in producing the unique classes, but to make sure that the numeric version of classes can be used internally by cuML while the text version can be displayed through the classes_ attribute. Just pushed a solution that might pass CI. Then, sure we can use the LabelEncoder estimator instead of the unique function if that helps, it could especially handle https://github.com/rapidsai/cuml-accel/issues/94.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have an implementation yet, was going to finish up my other issue first. Haven't done more than look at how scikit-learn does it (with a LabelEncoder). I am wary of the additional fitted attribute here (numeric_classes_ and classes__). The scikit-learn implementation doesn't need this extra state and manages to handle things just fine - it would be nice to avoid increasing the amount of state stored on the class if possible (but again, apologies, I haven't looked more into how we might achieve that).

@jcrist
Copy link
Member

jcrist commented Feb 20, 2025

I've opened #6346 as an alternative fix for this issue. That patch doesn't add any additional state, and I believe better handles matching the output type and dtype for predict. The test case there at least provides more coverage of the behavior we're targeting, so whichever fix we apply I think we should add something like that test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuml-cpu Cython / Python Cython or Python issue non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants