Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rescalability via IBM dataset layers #1372

Closed
wants to merge 71 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
45b0bce
Add distributed datasets
daviswer Nov 22, 2024
e486614
Formatting, commenting
daviswer Nov 22, 2024
10e45b9
Add demo script
daviswer Nov 23, 2024
10a6f66
Datapath None
daviswer Nov 23, 2024
0281897
Shift dummydata seeding to setup, dummy path handling
daviswer Nov 23, 2024
a175c3c
Actually create dummy data folders
daviswer Nov 23, 2024
957a5bf
Remove cfg ref
daviswer Nov 23, 2024
2e9bdf0
Remove double () call
daviswer Nov 23, 2024
e475eec
Fix dist checkpoint import
daviswer Nov 23, 2024
eac8ef6
Check ckp subfolder existence, not working folder
daviswer Nov 23, 2024
afd0169
Save vals for checking
daviswer Nov 23, 2024
031d67c
Load dummy gen state always
daviswer Nov 23, 2024
d9a575b
Setup calls in dummy
daviswer Nov 23, 2024
157f90b
Diag print
daviswer Nov 23, 2024
91f1b14
Remove sampling
daviswer Nov 23, 2024
b3569e3
Path in dummy build
daviswer Nov 23, 2024
0faea8c
Path in dummy build
daviswer Nov 23, 2024
0be44e4
Scalable off
daviswer Nov 23, 2024
c54aed2
Build data folder early
daviswer Nov 23, 2024
a16ffb1
Avoid resetting gen each state dict call
daviswer Nov 23, 2024
b645aea
Diag print off, all datasets on
daviswer Nov 23, 2024
ceffd24
Stop saving vals
daviswer Nov 23, 2024
d2eb12e
Attempt single blob save
daviswer Jan 14, 2025
ada91ec
Attempt single blob load
daviswer Jan 14, 2025
9bf8f3d
Prevent loading in place
daviswer Jan 14, 2025
934d37b
Cleanup
daviswer Jan 14, 2025
8d0cfd8
ScalableReader changes
daviswer Feb 6, 2025
e633e60
Fix datapath folder creation
daviswer Feb 6, 2025
1f2e37a
Create datapath subfolder, data only when nonexistent
daviswer Feb 6, 2025
0acdf05
Build data only rank 0
daviswer Feb 6, 2025
d146017
Pad chunks to make batchable
daviswer Feb 6, 2025
0fd38e8
give time for data to construct
daviswer Feb 6, 2025
e000b81
Fix pad fn
daviswer Feb 6, 2025
5bbd0d1
reader yield list not tensor
daviswer Feb 6, 2025
888bc19
No arg for repl placement
daviswer Feb 6, 2025
9c1699d
typo fix
daviswer Feb 6, 2025
c551a07
De-dtensorfy in load
daviswer Feb 6, 2025
4675681
Full tensor (apparently replicated doesn't force on load)
daviswer Feb 6, 2025
65744ac
Shard load, full tensor sendaround
daviswer Feb 6, 2025
88ab3c7
Chunksize 40
daviswer Feb 6, 2025
a34a5fc
Intermediate diag mkdir
daviswer Feb 6, 2025
763f60e
Time for other ranks to save
daviswer Feb 6, 2025
476c5a6
exist ok diag subf
daviswer Feb 6, 2025
ba00c20
Corrected step counting
daviswer Feb 6, 2025
0fd2b15
Fix followup nstep scaling
daviswer Feb 10, 2025
fcfee89
diag print
daviswer Feb 10, 2025
57164ca
diag print2
daviswer Feb 10, 2025
068ab32
diag print3
daviswer Feb 10, 2025
dd7d569
diag print4
daviswer Feb 10, 2025
7fa868f
diag print5
daviswer Feb 10, 2025
473e9ff
diag print6
daviswer Feb 10, 2025
bf22ce9
diag print7
daviswer Feb 10, 2025
8307e15
Diag save
daviswer Feb 10, 2025
444547f
Diag save2
daviswer Feb 10, 2025
c94b4ae
Flattenang
daviswer Feb 10, 2025
53a89b5
Flattenang 2
daviswer Feb 10, 2025
ad72ca0
Flattenang 3
daviswer Feb 10, 2025
c267675
Diag print (sigh)
daviswer Feb 10, 2025
03b4b3a
Diag print (sigh)2
daviswer Feb 10, 2025
da5991b
Attempt key-free load impl
daviswer Feb 19, 2025
9037800
Allow full run
daviswer Feb 19, 2025
5f10ac1
Direct import
daviswer Feb 19, 2025
8931620
Precise import
daviswer Feb 19, 2025
3a6e255
gloo backend
daviswer Feb 19, 2025
ba96958
Diag print
daviswer Feb 19, 2025
3ffb475
Specify keys
daviswer Feb 19, 2025
95cf494
Set constructor
daviswer Feb 19, 2025
4a592b7
Avoid popping keys mid iter
daviswer Feb 19, 2025
c37b8ba
Diag print
daviswer Feb 19, 2025
0b09fd4
diag print off
daviswer Feb 19, 2025
71b78dc
Clean up and comment out
daviswer Feb 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Corrected step counting
daviswer committed Feb 6, 2025
commit ba00c20869856873262eea102c511f8503824c70
6 changes: 4 additions & 2 deletions examples/ibm_rescaling/rescaling_demo.py
Original file line number Diff line number Diff line change
@@ -95,12 +95,12 @@

avoid = []
for i, inp in enumerate(data):
if i == args.n_steps:
avoid.append(inp[:,0])
if i == args.n_steps-1:
if rank == 0:
print("Iteration complete!")
save_distributed_state_dict(data, ckpt_path, mesh)
break
avoid.append(inp[:,0])
avoid = torch.cat(avoid)
# Get all vals onto each rank
avoid = dist.tensor.DTensor.from_local(
@@ -140,6 +140,8 @@
# Diag save
os.makedirs(os.path.join(args.ckpt_path, "diag"), exist_ok=True)
torch.save(data.state_dict(), os.path.join(args.ckpt_path, "diag", f"loader_state_{rank}.pth"))
if rank == 0:
torch.save(vals, os.path.join(args.ckpt_path, "diag", "vals.pth"))
time.sleep(10)

# Perform data coverage check on rank 0 only