Question about Text-to-Image Pretraining #45

Shuaizhang7 · 2025-02-06T11:38:58Z

The paper states that the training phase is divided into Text-to-Image Pre-training and Text-to-Video Pre-training. I would like to know how to use image data to train 3D full attention, since image data has one less dimension than video data? Can the 3D Full Attention trained with image data continue to be trained with video data without modification?

TikhonovDongqiudi · 2025-02-19T04:32:07Z

Thanks for your question. Actually the Text-to-Image Pre-training phase uses the 2D attention and then it switches into 3D attention in the Text-to-Video Pre-training phase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Text-to-Image Pretraining #45

Question about Text-to-Image Pretraining #45

Shuaizhang7 commented Feb 6, 2025

TikhonovDongqiudi commented Feb 19, 2025

Question about Text-to-Image Pretraining #45

Question about Text-to-Image Pretraining #45

Comments

Shuaizhang7 commented Feb 6, 2025

TikhonovDongqiudi commented Feb 19, 2025