keras.Model.fit fails running disctributed training with TF backend and MirroredStrategy #21069

roebel · 2025-03-19T16:58:31Z

Hello

for quite a while I have tried to train my Model with keras3 and the TF backend (TF2.18) using distributed training and the MirroredStrategy. Not being able to run the training with Model.fit successfully, I turned to try this with one of the examples of the keras documentation. The train examples run fine under keras2/TF2.18 and MirroredStrategy for 1 and 2 devices, and it runs fine as well for Keras 3 with one device. For running this with 2 devices the fit function fails within this error under TF2.17

   132         to_union_indices = tf.gather(indices_indices, union_indices)
    133         values_with_leading_zeros = tf.concat(
--> 134             [tf.zeros((1,) + values.shape[1:], values.dtype), values], axis=0
    135         )
    136         return tf.gather(values_with_leading_zeros, to_union_indices)

ValueError: Cannot convert a partially known TensorShape (1, None) to a Tensor.

and with this error under TF2.18

Epoch 1/2
Traceback (most recent call last):
  File "/Users/roebel/pysrc/FSQCodec/./scripts/test_mirrorstrategy_func.py", line 90, in <module>
    model.fit(
  File "/u/formes/share/packages/manaconda3/envs/tf2.18/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/u/formes/share/packages/manaconda3/envs/tf2.18/lib/python3.10/site-packages/keras/src/backend/tensorflow/core.py", line 141, in convert_to_tensor
    return tf.convert_to_tensor(x, dtype=dtype)
ValueError: None values not supported.

I've put the code with result for running under keras2 here and the version configured for running with keras 3 here.

Many thanks for your help.

roebel · 2025-03-20T09:44:52Z

Having seen a few recent bugs related to distributed training producing incorrect results I'd like to add that in the present case the fit function does not even start to produce any results. It simply fails during the first invocation.

github-actions bot assigned sachinprasadhs Mar 19, 2025

sonali-kumari1 added the type:Bug label Mar 20, 2025

dhantule added the keras-team-review-pending Pending review by a Keras team member. label Mar 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keras.Model.fit fails running disctributed training with TF backend and MirroredStrategy #21069

keras.Model.fit fails running disctributed training with TF backend and MirroredStrategy #21069

roebel commented Mar 19, 2025

roebel commented Mar 20, 2025

keras.Model.fit fails running disctributed training with TF backend and MirroredStrategy #21069

keras.Model.fit fails running disctributed training with TF backend and MirroredStrategy #21069

Comments

roebel commented Mar 19, 2025

roebel commented Mar 20, 2025