Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia NEMO 2.0 Serialization Issue: I am facing the same serialization issue with fiddle #12296

Open
bjohn22 opened this issue Feb 20, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@bjohn22
Copy link

bjohn22 commented Feb 20, 2025

Describe the bug
Validation: iteration 39/8
Validation: iteration 40/8
Validation: iteration 41/8
Validation: iteration 42/8
Validation: iteration 43/8
Validation: iteration 44/8
Validation: iteration 45/8
Validation: iteration 46/8
Validation: iteration 47/8
Validation: iteration 48/8
Epoch 0, global step 99: 'reduced_train_loss' reached 1.57371 (best 1.57371), saving model to '/opt/project/temp/nemo_experiments/default/2025-02-20_21-14-28/checkpoints/default--reduced_train_loss=1.5737-epoch=0-consumed_samples=400.0.ckpt' as top 1
[NeMo W 2025-02-20 21:14:42 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use matplotlib.colormaps[name] or matplotlib.colormaps.get_cmap() or pyplot.get_cmap() instead.
cm = get_cmap("Set1")

Root directory is: /opt/project/temp
Current location is: /opt/project/temp
full path string: /opt/project/temp/data_input/tabular_data.bin
check path exists: False
[NeMo I 2025-02-20 21:14:42 model_checkpoint:497] Scheduled async checkpoint save for /opt/project/temp/nemo_experiments/default/2025-02-20_21-14-28/checkpoints/default--reduced_train_loss=1.5737-epoch=0-consumed_samples=400.0.ckpt
Trainer.fit stopped: max_steps=100 reached.
[NeMo I 2025-02-20 21:14:42 model_checkpoint:497] Scheduled async checkpoint save for /opt/project/temp/nemo_experiments/default/2025-02-20_21-14-28/checkpoints/default--reduced_train_loss=1.5737-epoch=0-consumed_samples=400.0-last.ckpt
[NeMo W 2025-02-20 21:14:43 dist_ckpt_io:155] Some async checkpoint saves might be not finalized properly.
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/project/temp/tabular_gpt_end2end_concise.py", line 374, in
[rank0]: main()
[rank0]: File "/opt/project/temp/tabular_gpt_end2end_concise.py", line 314, in main
[rank0]: llm.pretrain(
[rank0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 150, in pretrain
[rank0]: return train(
[rank0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 107, in train
[rank0]: trainer.fit(model, data)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank0]: call._call_and_handle_interrupt(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]: return function(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank0]: self._run(model, ckpt_path=ckpt_path)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
[rank0]: results = self._run_stage()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
[rank0]: self.fit_loop.run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 211, in run
[rank0]: self.on_run_end()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 405, in on_run_end
[rank0]: call._call_callback_hooks(trainer, "on_train_end")
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 218, in _call_callback_hooks
[rank0]: fn(trainer, trainer.lightning_module, *args, **kwargs)
[rank0]: File "/opt/NeMo/nemo/lightning/pytorch/callbacks/model_checkpoint.py", line 294, in on_train_end
[rank0]: TrainerContext.from_trainer(trainer).io_dump(
[rank0]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 238, in io_dump
[rank0]: json = serialization.dump_json(io)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 826, in dump_json
[rank0]: return json.dumps(Serialization(value, pyref_policy).result, indent=indent)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 527, in init
[rank0]: _ROOT_KEY: self._serialize(self._root, (), all_paths=((),)),
[rank0]: File "/opt/NeMo/nemo/lightning/io/fdl_torch.py", line 131, in _modified_serialize
[rank0]: return self._original_serialize(value, current_path, all_paths)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 662, in _serialize
[rank0]: serialized_value = self._serialize(
[rank0]: File "/opt/NeMo/nemo/lightning/io/fdl_torch.py", line 131, in _modified_serialize
[rank0]: return self._original_serialize(value, current_path, all_paths)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 662, in _serialize
[rank0]: serialized_value = self._serialize(
[rank0]: File "/opt/NeMo/nemo/lightning/io/fdl_torch.py", line 131, in _modified_serialize
[rank0]: return self._original_serialize(value, current_path, all_paths)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 651, in _serialize
[rank0]: raise UnserializableValueError(msg)
[rank0]: fiddle._src.experimental.serialization.UnserializableValueError: Unserializable value <nemo.collections.common.tokenizers.tabular_tokenizer.TabularTokenizer object at 0x7718cd0ea6b0> of type <class 'nemo.collections.common.tokenizers.tabular_tokenizer.TabularTokenizer'>. Error occurred at path '.model.tokenizer'.")

Process finished with exit code 1

Steps/Code to reproduce bug

# ----------------------------------------------------
# 5) NE MO 2.0 GPT TRAINING
# ----------------------------------------------------
print("==> Starting GPT training with NeMo 2.0...")


# GPT Config
gpt_config = llm.GPTConfig(
    num_layers=4,
    hidden_size=1024,
    ffn_hidden_size=4096,
    num_attention_heads=16,
    seq_length=1024,
    init_method_std=0.023,
    hidden_dropout=0.1,
    attention_dropout=0.1,
    layernorm_epsilon=1e-5,
    make_vocab_size_divisible_by=128,
)
real_path = "/opt/project/temp/data_input/tabular_data_text_document"
# DataModule
data_module = llm.PreTrainingDataModule(
    paths={
        "train": [real_path],
        "validation": [real_path],
        "test": [real_path],
    },
    global_batch_size=4,
    micro_batch_size=1,
    seq_length=1024,
    num_workers=4,
    pin_memory=True,
    tokenizer=TabularTokenizer(TOKENIZER_PICKLE, delimiter=","),
)

# GPT Model
model = llm.GPTModel(
    gpt_config,
    tokenizer=data_module.tokenizer,
)

# Trainer
devices = 1
strategy = nl.MegatronStrategy(tensor_model_parallel_size=1)
trainer = nl.Trainer(
    devices=devices,
    accelerator="gpu",
    strategy=strategy,
    plugins=nl.MegatronMixedPrecision(precision="bf16-mixed"),
    max_epochs=1,           # small for demo
    max_steps=100,        # small for demo
    log_every_n_steps=50,
    val_check_interval=50,
    accumulate_grad_batches=1,
    # gradient_clip_val=1.0,
)

# Checkpoints & Logging
ckpt = nl.ModelCheckpoint(
    save_last=True,
    monitor="reduced_train_loss",
    save_top_k=1,
    save_on_train_epoch_end=True,
    always_save_context=False,  # <-- added
    save_context_on_train_end=True,  # <-- added
)
logger = nl.NeMoLogger(ckpt=ckpt)

# Optimizer
optim = nl.MegatronOptimizerModule(
    config=OptimizerConfig(
        optimizer="adam",
        lr=2e-4,
        use_distributed_optimizer=True,
        weight_decay=0.01,
        clip_grad=1.0,
       # betas=(0.9, 0.98),
    ),
    lr_scheduler=nl.lr_scheduler.CosineAnnealingScheduler(
        warmup_steps=100,
        constant_steps=0,
        min_lr=1e-5,
    ),
)

# Launch training
llm.pretrain(
    model=model,
    data=data_module,
    trainer=trainer,
    log=logger,
    optim=optim,
)
print("==> GPT training completed.")

Expected behavior

Expected complete: nemo_experiments
└── default
├── 2025-02-20_00-33-45
│ ├── checkpoints
│ │ └── default--reduced_train_loss=1.5739-epoch=0-consumed_samples=400.0
│ │ ├── context
│ │ │ └── io.json
│ │ └── weights
│ │ ├── __0_0.distcp
│ │ ├── __0_1.distcp
│ │ └── common.pt
│ ├── default--reduced_train_loss=1.5739-epoch=0-consumed_samples=400.0-unfinished
│ ├── cmd-args.log
│ ├── lightning_logs.txt
│ ├── nemo_error_log.txt
│ └── nemo_log_globalrank-0_localrank-0.txt
└── 2025-02-20_00-49-19

Environment overview (please complete the following information)

  • Environment location:
    Running Nvidia Nemo 2.0 Docker Installation loaded onto PyCharm on Linux desktop with Nvidia RTX 3090

Environment details

Additional context
I am reproducing this NeMO 1.0 tutorial using ALL the Migration steps provided in the Nvidia User Guide Framework documentation:
https://github.com/NVIDIA/NeMo/blob/main/tutorials/nlp/Megatron_Synthetic_Tabular_Data_Generation.ipynb
I have Nvidia RTX 3090

@bjohn22 bjohn22 added the bug Something isn't working label Feb 20, 2025
@bjohn22
Copy link
Author

bjohn22 commented Feb 20, 2025

A closed issue reported the same bug:

#11931

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant