Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keras.Model.fit fails running disctributed training with TF backend and MirroredStrategy #21069

Open
roebel opened this issue Mar 19, 2025 · 1 comment
Assignees
Labels
keras-team-review-pending Pending review by a Keras team member. type:Bug

Comments

@roebel
Copy link
Contributor

roebel commented Mar 19, 2025

Hello

for quite a while I have tried to train my Model with keras3 and the TF backend (TF2.18) using distributed training and the MirroredStrategy. Not being able to run the training with Model.fit successfully, I turned to try this with one of the examples of the keras documentation. The train examples run fine under keras2/TF2.18 and MirroredStrategy for 1 and 2 devices, and it runs fine as well for Keras 3 with one device. For running this with 2 devices the fit function fails within this error under TF2.17

   132         to_union_indices = tf.gather(indices_indices, union_indices)
    133         values_with_leading_zeros = tf.concat(
--> 134             [tf.zeros((1,) + values.shape[1:], values.dtype), values], axis=0
    135         )
    136         return tf.gather(values_with_leading_zeros, to_union_indices)

ValueError: Cannot convert a partially known TensorShape (1, None) to a Tensor.

and with this error under TF2.18

Epoch 1/2
Traceback (most recent call last):
  File "/Users/roebel/pysrc/FSQCodec/./scripts/test_mirrorstrategy_func.py", line 90, in <module>
    model.fit(
  File "/u/formes/share/packages/manaconda3/envs/tf2.18/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/u/formes/share/packages/manaconda3/envs/tf2.18/lib/python3.10/site-packages/keras/src/backend/tensorflow/core.py", line 141, in convert_to_tensor
    return tf.convert_to_tensor(x, dtype=dtype)
ValueError: None values not supported.

I've put the code with result for running under keras2 here and the version configured for running with keras 3 here.

Many thanks for your help.

@roebel
Copy link
Contributor Author

roebel commented Mar 20, 2025

Having seen a few recent bugs related to distributed training producing incorrect results I'd like to add that in the present case the fit function does not even start to produce any results. It simply fails during the first invocation.

@dhantule dhantule added the keras-team-review-pending Pending review by a Keras team member. label Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keras-team-review-pending Pending review by a Keras team member. type:Bug
Projects
None yet
Development

No branches or pull requests

4 participants