-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix LogisticRegression
use with labels of string dtype
#6328
base: branch-25.04
Are you sure you want to change the base?
Fix LogisticRegression
use with labels of string dtype
#6328
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much! I have one question before moving forward.
self.classes__, y = np.unique(y, return_inverse=True) | ||
self.numeric_classes_ = np.arange(len(self.classes__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for introducing new estimator attributes? I'd suggest that we just convert the provided labels into numeric classes and drop them otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation was to reproduce the behavior of a Scikit-Learn estimator :
>>> print(cuLogRegModel.classes_)
['setosa' 'versicolor' 'virginica']
But, to be frank this is a small detail and making it work involves solving a lot of edge cases in CI. Since we need this to be merged very soon, it is safer to go with a simpler version that just converts the inputs to numeric classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that it is such a small detail, but I'd support merging the simpler version and leaving this as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the LabelEncoder
to handle all this complexity? It comes with the right value in its classes_
attribute "for free".
If we merge something that is different from what scikit-learn does we just create more problems for ourselves IMHO. We put in effort to a half baked fix, we have to deal with users who report breakage and then we need to do the "right thing" anyway later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difficulty isn't in producing the unique classes, but to make sure that the numeric version of classes can be used internally by cuML while the text version can be displayed through the classes_
attribute. Just pushed a solution that might pass CI. Then, sure we can use the LabelEncoder
estimator instead of the unique function if that helps, it could especially handle https://github.com/rapidsai/cuml-accel/issues/94.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have an implementation yet, was going to finish up my other issue first. Haven't done more than look at how scikit-learn does it (with a LabelEncoder
). I am wary of the additional fitted attribute here (numeric_classes_
and classes__
). The scikit-learn implementation doesn't need this extra state and manages to handle things just fine - it would be nice to avoid increasing the amount of state stored on the class if possible (but again, apologies, I haven't looked more into how we might achieve that).
I've opened #6346 as an alternative fix for this issue. That patch doesn't add any additional state, and I believe better handles matching the output type and dtype for |
No description provided.