You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The paper states that the training phase is divided into Text-to-Image Pre-training and Text-to-Video Pre-training. I would like to know how to use image data to train 3D full attention, since image data has one less dimension than video data? Can the 3D Full Attention trained with image data continue to be trained with video data without modification?
The text was updated successfully, but these errors were encountered:
Thanks for your question. Actually the Text-to-Image Pre-training phase uses the 2D attention and then it switches into 3D attention in the Text-to-Video Pre-training phase.
The paper states that the training phase is divided into Text-to-Image Pre-training and Text-to-Video Pre-training. I would like to know how to use image data to train 3D full attention, since image data has one less dimension than video data? Can the 3D Full Attention trained with image data continue to be trained with video data without modification?
The text was updated successfully, but these errors were encountered: