You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Validation: iteration 39/8
Validation: iteration 40/8
Validation: iteration 41/8
Validation: iteration 42/8
Validation: iteration 43/8
Validation: iteration 44/8
Validation: iteration 45/8
Validation: iteration 46/8
Validation: iteration 47/8
Validation: iteration 48/8
Epoch 0, global step 99: 'reduced_train_loss' reached 1.57371 (best 1.57371), saving model to '/opt/project/temp/nemo_experiments/default/2025-02-20_21-14-28/checkpoints/default--reduced_train_loss=1.5737-epoch=0-consumed_samples=400.0.ckpt' as top 1
[NeMo W 2025-02-20 21:14:42 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use matplotlib.colormaps[name] or matplotlib.colormaps.get_cmap() or pyplot.get_cmap() instead.
cm = get_cmap("Set1")
Root directory is: /opt/project/temp
Current location is: /opt/project/temp
full path string: /opt/project/temp/data_input/tabular_data.bin
check path exists: False
[NeMo I 2025-02-20 21:14:42 model_checkpoint:497] Scheduled async checkpoint save for /opt/project/temp/nemo_experiments/default/2025-02-20_21-14-28/checkpoints/default--reduced_train_loss=1.5737-epoch=0-consumed_samples=400.0.ckpt Trainer.fit stopped: max_steps=100 reached.
[NeMo I 2025-02-20 21:14:42 model_checkpoint:497] Scheduled async checkpoint save for /opt/project/temp/nemo_experiments/default/2025-02-20_21-14-28/checkpoints/default--reduced_train_loss=1.5737-epoch=0-consumed_samples=400.0-last.ckpt
[NeMo W 2025-02-20 21:14:43 dist_ckpt_io:155] Some async checkpoint saves might be not finalized properly.
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/project/temp/tabular_gpt_end2end_concise.py", line 374, in
[rank0]: main()
[rank0]: File "/opt/project/temp/tabular_gpt_end2end_concise.py", line 314, in main
[rank0]: llm.pretrain(
[rank0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 150, in pretrain
[rank0]: return train(
[rank0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 107, in train
[rank0]: trainer.fit(model, data)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank0]: call._call_and_handle_interrupt(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]: return function(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank0]: self._run(model, ckpt_path=ckpt_path)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
[rank0]: results = self._run_stage()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
[rank0]: self.fit_loop.run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 211, in run
[rank0]: self.on_run_end()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 405, in on_run_end
[rank0]: call._call_callback_hooks(trainer, "on_train_end")
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 218, in _call_callback_hooks
[rank0]: fn(trainer, trainer.lightning_module, *args, **kwargs)
[rank0]: File "/opt/NeMo/nemo/lightning/pytorch/callbacks/model_checkpoint.py", line 294, in on_train_end
[rank0]: TrainerContext.from_trainer(trainer).io_dump(
[rank0]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 238, in io_dump
[rank0]: json = serialization.dump_json(io)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 826, in dump_json
[rank0]: return json.dumps(Serialization(value, pyref_policy).result, indent=indent)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 527, in init
[rank0]: _ROOT_KEY: self._serialize(self._root, (), all_paths=((),)),
[rank0]: File "/opt/NeMo/nemo/lightning/io/fdl_torch.py", line 131, in _modified_serialize
[rank0]: return self._original_serialize(value, current_path, all_paths)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 662, in _serialize
[rank0]: serialized_value = self._serialize(
[rank0]: File "/opt/NeMo/nemo/lightning/io/fdl_torch.py", line 131, in _modified_serialize
[rank0]: return self._original_serialize(value, current_path, all_paths)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 662, in _serialize
[rank0]: serialized_value = self._serialize(
[rank0]: File "/opt/NeMo/nemo/lightning/io/fdl_torch.py", line 131, in _modified_serialize
[rank0]: return self._original_serialize(value, current_path, all_paths)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 651, in _serialize
[rank0]: raise UnserializableValueError(msg)
[rank0]: fiddle._src.experimental.serialization.UnserializableValueError: Unserializable value <nemo.collections.common.tokenizers.tabular_tokenizer.TabularTokenizer object at 0x7718cd0ea6b0> of type <class 'nemo.collections.common.tokenizers.tabular_tokenizer.TabularTokenizer'>. Error occurred at path '.model.tokenizer'.")
Describe the bug
Validation: iteration 39/8
Validation: iteration 40/8
Validation: iteration 41/8
Validation: iteration 42/8
Validation: iteration 43/8
Validation: iteration 44/8
Validation: iteration 45/8
Validation: iteration 46/8
Validation: iteration 47/8
Validation: iteration 48/8
Epoch 0, global step 99: 'reduced_train_loss' reached 1.57371 (best 1.57371), saving model to '/opt/project/temp/nemo_experiments/default/2025-02-20_21-14-28/checkpoints/default--reduced_train_loss=1.5737-epoch=0-consumed_samples=400.0.ckpt' as top 1
[NeMo W 2025-02-20 21:14:42 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use
matplotlib.colormaps[name]
ormatplotlib.colormaps.get_cmap()
orpyplot.get_cmap()
instead.cm = get_cmap("Set1")
Root directory is: /opt/project/temp
Current location is: /opt/project/temp
full path string: /opt/project/temp/data_input/tabular_data.bin
check path exists: False
[NeMo I 2025-02-20 21:14:42 model_checkpoint:497] Scheduled async checkpoint save for /opt/project/temp/nemo_experiments/default/2025-02-20_21-14-28/checkpoints/default--reduced_train_loss=1.5737-epoch=0-consumed_samples=400.0.ckpt
Trainer.fit
stopped:max_steps=100
reached.[NeMo I 2025-02-20 21:14:42 model_checkpoint:497] Scheduled async checkpoint save for /opt/project/temp/nemo_experiments/default/2025-02-20_21-14-28/checkpoints/default--reduced_train_loss=1.5737-epoch=0-consumed_samples=400.0-last.ckpt
[NeMo W 2025-02-20 21:14:43 dist_ckpt_io:155] Some async checkpoint saves might be not finalized properly.
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/project/temp/tabular_gpt_end2end_concise.py", line 374, in
[rank0]: main()
[rank0]: File "/opt/project/temp/tabular_gpt_end2end_concise.py", line 314, in main
[rank0]: llm.pretrain(
[rank0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 150, in pretrain
[rank0]: return train(
[rank0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 107, in train
[rank0]: trainer.fit(model, data)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank0]: call._call_and_handle_interrupt(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]: return function(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank0]: self._run(model, ckpt_path=ckpt_path)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
[rank0]: results = self._run_stage()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
[rank0]: self.fit_loop.run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 211, in run
[rank0]: self.on_run_end()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 405, in on_run_end
[rank0]: call._call_callback_hooks(trainer, "on_train_end")
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 218, in _call_callback_hooks
[rank0]: fn(trainer, trainer.lightning_module, *args, **kwargs)
[rank0]: File "/opt/NeMo/nemo/lightning/pytorch/callbacks/model_checkpoint.py", line 294, in on_train_end
[rank0]: TrainerContext.from_trainer(trainer).io_dump(
[rank0]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 238, in io_dump
[rank0]: json = serialization.dump_json(io)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 826, in dump_json
[rank0]: return json.dumps(Serialization(value, pyref_policy).result, indent=indent)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 527, in init
[rank0]: _ROOT_KEY: self._serialize(self._root, (), all_paths=((),)),
[rank0]: File "/opt/NeMo/nemo/lightning/io/fdl_torch.py", line 131, in _modified_serialize
[rank0]: return self._original_serialize(value, current_path, all_paths)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 662, in _serialize
[rank0]: serialized_value = self._serialize(
[rank0]: File "/opt/NeMo/nemo/lightning/io/fdl_torch.py", line 131, in _modified_serialize
[rank0]: return self._original_serialize(value, current_path, all_paths)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 662, in _serialize
[rank0]: serialized_value = self._serialize(
[rank0]: File "/opt/NeMo/nemo/lightning/io/fdl_torch.py", line 131, in _modified_serialize
[rank0]: return self._original_serialize(value, current_path, all_paths)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 651, in _serialize
[rank0]: raise UnserializableValueError(msg)
[rank0]: fiddle._src.experimental.serialization.UnserializableValueError: Unserializable value <nemo.collections.common.tokenizers.tabular_tokenizer.TabularTokenizer object at 0x7718cd0ea6b0> of type <class 'nemo.collections.common.tokenizers.tabular_tokenizer.TabularTokenizer'>. Error occurred at path '.model.tokenizer'.")
Process finished with exit code 1
Steps/Code to reproduce bug
Expected behavior
Expected complete: nemo_experiments
└── default
├── 2025-02-20_00-33-45
│ ├── checkpoints
│ │ └── default--reduced_train_loss=1.5739-epoch=0-consumed_samples=400.0
│ │ ├── context
│ │ │ └── io.json
│ │ └── weights
│ │ ├── __0_0.distcp
│ │ ├── __0_1.distcp
│ │ └── common.pt
│ ├── default--reduced_train_loss=1.5739-epoch=0-consumed_samples=400.0-unfinished
│ ├── cmd-args.log
│ ├── lightning_logs.txt
│ ├── nemo_error_log.txt
│ └── nemo_log_globalrank-0_localrank-0.txt
└── 2025-02-20_00-49-19
Environment overview (please complete the following information)
Running Nvidia Nemo 2.0 Docker Installation loaded onto PyCharm on Linux desktop with Nvidia RTX 3090
Environment details
Additional context
I am reproducing this NeMO 1.0 tutorial using ALL the Migration steps provided in the Nvidia User Guide Framework documentation:
https://github.com/NVIDIA/NeMo/blob/main/tutorials/nlp/Megatron_Synthetic_Tabular_Data_Generation.ipynb
I have Nvidia RTX 3090
The text was updated successfully, but these errors were encountered: