Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatefulDataloader shuffling ignores system RNGs at initiation, in fact internal random state is identical for all (?) new dataloaders #1440

Open
gailweiss opened this issue Feb 4, 2025 · 2 comments · May be fixed by #1441

Comments

@gailweiss
Copy link

gailweiss commented Feb 4, 2025

🚀 The feature

Brief description

The random generator in newly created StatefulDataLoaders should be randomly initiated - but it is currently deterministic. This will make different DataLoaders generate different shuffles of the data, more in line with how 'normal' DataLoaders behave. Such a change is, to my understanding, not in conflict with the statefulness of the dataloaders (the generator can still be stored and loaded, there's just no reason for it to start the same way each time)

Current state

At the moment, all statefuldataloaders with shuffling on have the same initial internal RNG state, regardless of actual environment RNG states at the initiation of these two dataloaders. For example:

from torchdata.stateful_dataloader import StatefulDataLoader as DataLoader

def get_dl(d):
    return DataLoader(d, batch_size=1, shuffle=True)

def get_generator(dl):
    return dl.state_dict()["_index_sampler_state"]["sampler_iter_state"]["generator"]

def same_order(dl1, dl2):
    order1 = [b.item() for b in dl1]
    order2 = [b.item() for b in dl2]
    assert len(order1)>0  # not accidentally on an empty one
    return order1 == order2

dl1, dl2 = get_dl(list(range(10))), get_dl(list(range(100)))
print("different dataloaders (on different dataset!) start from same RNG?:", False not in (get_generator(dl1) == get_generator(dl2)))

dl1, dl2 = get_dl(list(range(10))), get_dl(list(range(10)))
print("new dataloaders on same dataset create same order?: ", same_order(dl1, dl2))

print("trying again, now forcing the environment random state to be sure")

def seed_all(seed):
    for f in [pl.seed_everything, random.seed, torch.mps.manual_seed,
              torch.manual_seed, np.random.seed]:
        f(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

seed_all(0)
dl1 = get_dl(list(range(10)))
seed_all(1)
dl2 = get_dl(list(range(10)))

print("different dataloaders (started with different random environment) start from same RNG?:", False not in (get_generator(dl1) == get_generator(dl2)))
print("new dataloaders (started with different random environment) on same dataset create same order?: ", same_order(dl1, dl2))

Yields:

different dataloaders (on different dataset!) start from same RNG?: True
new dataloaders on same dataset create same order?:  True
trying again, now forcing the environment random state to be sure
different dataloaders (started with different random environment) start from same RNG?: True
new dataloaders (started with different random environment) on same dataset create same order?:  True

This behaviour does not seem necessary for the "statefulness" of the dataloaders, as their state_dict contains a tensor controlling the shuffles, so whichever random state they currently have can be saved and loaded as needed: new ones don't all need to start from the same random state.

Request

I normally expect new, shuffling, dataloaders to create a different shuffles from each other, and in particular to be sensitive to the environment's random state at initiation (and I assume I have this type of variety when testing runs on different random seeds).

I think this can be remedied by setting the generator tensor ( dl.state_dict()["_index_sampler_state"]["sampler_iter_state"]["generator"] ) randomly on initiation. I would use a workaround, something like this, if only I understood what the generator tensor is composed of and how to make a legal one:

# proposal that doesn't work because I don't understand what the generator is composed of, and thus do not know to make a legal one
from torchdata.stateful_dataloader import StatefulDataLoader as DataLoader

def get_generator(sd):
    return sd["_index_sampler_state"]["sampler_iter_state"]["generator"]

def set_generator(sd, t):
    sd["_index_sampler_state"]["sampler_iter_state"]["generator"] = t

def shuffling_dl_getter(d, batch_size):
    dl = DataLoader(d, batch_size=batch_size, shuffle=True)
    g = get_generator(dl.state_dict()) 
    random_initial_generator = torch.randint(g.min(), g.max(), g.shape).byte()  # unfortunately wont be accepted when it comes to use
    set_generator(sd, random_initial_generator)
    dl.load_state_dict(sd)
    return dl

d = list(range(10))
dl = shuffling_dl_getter(d, 1)  # succeeds, but then
sorted([b.item() for b in dl]) == d  # won't successfully run, complaining of an invalid mt19937 state

I would appreciate a "correct" version of the code in shuffling_dl_getter above being added to the initiation of the StatefulDataLoaders! Unfortunately I don't understand the composition of the generator tensor so I can't build a 'good' one myself. In particular I notice that g is a tensor of length 5056 with many 0s and many higher numbers, while mt19937 states should be of length 624, and I don't know what all the extra stuff is

Motivation, pitch

When testing sensitivity of an architecture or training routine to random state, I assume that the data order is being changed too (and not just the network's initial weights, and dropouts throughout training)

Alternatives

If I could, I would use the shuffling_dl_getter code described above to obtain randomly initiated StatefulDataLoaders myself, unfortunately, it is not clear to me how to make legal random states for the dataloaders

Additional context

No response

@ramanishsingh
Copy link
Contributor

Hi @gailweiss
Thanks for raising this issue.
I was able to reproduce your findings.

Let me take a deeper look into the issue.
Meanwhile, one way to have random samples generated would be to use the RandomSampler sampler.
Let me know if this example works for you

from torch.utils.data import RandomSampler
from torchdata.stateful_dataloader import StatefulDataLoader as DataLoader

def get_dl(d, generator):
    sampler = RandomSampler(d, generator=generator)

    return DataLoader(d, batch_size=1, sampler=sampler)


def same_order(dl1, dl2):
    order1 = [b.item() for b in dl1]
    order2 = [b.item() for b in dl2]
    print("order1", order1)
    print("order2", order2)
    assert len(order1) > 0  # not accidentally on an empty one
    return order1 == order2


seed = 0
generator = torch.Generator()
generator.manual_seed(seed)
dl1 = get_dl(list(range(10)), generator)

seed = 1
generator = (
    torch.Generator()
)  # You can create a new generator or can also use the old one
generator.manual_seed(seed)

dl2 = get_dl(list(range(10)), generator)


print(
    "new dataloaders (started with different generators) on same dataset create same order?: ",
    same_order(dl1, dl2),
)

output:

order1 [4, 1, 7, 5, 3, 9, 0, 8, 6, 2]
order2 [5, 6, 1, 2, 0, 8, 9, 3, 7, 4]
new dataloaders (started with different generators) on same dataset create same order?:  False

Let me know if this works for you.

@gailweiss
Copy link
Author

gailweiss commented Feb 4, 2025

Dear @ramanishsingh , thanks for looking into this! Your solution seems to work, thank you!

If I may suggest, I found I can even fix the issue by using the generator argument of the StatefulDataLoader initiation. This removes the dependency on the RandomSampler import. I have verified also that the state can still be properly saved and loaded even with this initiation, i.e. there is indeed no dependency on the fixed initiation of the StatefulDataLoader's generator:

import torch
from torchdata.stateful_dataloader import StatefulDataLoader as DataLoader

def get_dl(d):
    gen = torch.Generator()
    gen.seed()  # for some reason important for getting it to make a difference?
    return DataLoader(d, batch_size=1, shuffle=True, generator=gen)

def get_generator(dl):
    return dl.state_dict()["_index_sampler_state"]["sampler_iter_state"]["generator"]

def same_order(dl1, dl2):
    order1 = [b.item() for b in dl1]
    order2 = [b.item() for b in dl2]
    assert len(order1)>0  # not accidentally on an empty one
    return order1 == order2

def same_generator(dl1, dl2):
    return False not in (get_generator(dl1) == get_generator(dl2))

d = list(range(10))
dl1, dl2 = get_dl(d), get_dl(d)
print("dataloaders have same generator?", same_generator(dl1, dl2))
print("dataloaders have same order?", same_order(dl1, dl2))

sd = dl1.state_dict()
dl2.load_state_dict(sd)

# handle broken epoch after loading state dict, see https://github.com/pytorch/data/issues/1437
for b in dl2:
    pass

print("transferring state_dict between dataloaders with different initial random generators succeeds?", same_order(dl1, dl2))

output:

dataloaders have same generator? False
dataloaders have same order? False
transferring state_dict between dataloaders with different initial random generators succeeds? True

It seems in this case that simply adding these lines to the __init__ function of the StatefulDataLoader (before the assignment self.generator = generator ) should solve the issue (the current default value of the generator argument is None):

if None is generator:
    generator = torch.Generator()
    generator.seed()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants