-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Add support for accepting a Numpy RandomState #6150
base: branch-25.02
Are you sure you want to change the base?
Conversation
In addition to accepting integers you can now also pass a RandomState object. It is used to derive an integer to use a seed.
The failures seemed to be related to some dask timeout which is unrelated I think. Lets see what happens for the latest commit For my education, why does it say I requested a code review from people? I don't remember clicking any buttons :-/ |
cuml.UMAP, | ||
], | ||
) | ||
def test_random_state_argument(Estimator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a quick test here that the results are the same with the seed, or is that tested in the individual algo tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the results will be the same because RandomState(42)
will not lead to 42
being passed as the seed to the internal functions that cuml calls.
We can't pass any form of "RNG state" to the internal functions, we can just pass an integer. So I think the best we can do when a RandomState
is passed in is to use it to generate a uint64
and use that as seed for the internal functions. I think this is better than trying to extract the (original) seed from the RandomState
because that way you get a different value if the random state has been used previously.
For example in this (contrived) example I think the two RFs should not both use 42
as the seed internally as they are two separate instances.
rs = RandomState(42)
rf1 = cuml.RandomForestClassifier(random_state=rs)
rf2 = cuml.RandomForestClassifier(random_state=rs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
python/cuml/cuml/cluster/kmeans.pyx
Outdated
@@ -302,6 +306,11 @@ class KMeans(UniversalBase, | |||
else None), | |||
check_dtype=check_dtype) | |||
|
|||
# XXX Should deriving a seed from a random state be idempotent? Should repeated | |||
# XXX calls of `fit` create new seeds or not? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do people think about this? Should we re-derive a seed each time fit
is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an excellent question... what would be the behavior of sklearn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you pass an int
each call to fit
is the same, but if you pass a random state it keeps getting forwarded, so each fit
is different. (I think it is at least somewhat unclear what should happen, at least within scikit-learn we've not really been able to converge on something :-/)
I think here I'd vote for deriving a new seed each time. My thinking is that that way we match scikit-learn (no need to somehow special case this for the accelerator). Even if I can't justify why having a new seed each time is "the right thing to do"
In addition to accepting integers you can now also pass a RandomState object. It is used to derive an integer to use a seed.
add support for cupy random state objectssomething for a new PRCloses #4753