You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for your awesome work! I recently run the public naive SFT and CIL 512 checkpoint following the B.3 Evaluation setting and get the following result, which may be different from paper reported result.
Method
FVD (2048-16f)
IS (10k)
pre fs=3
434.60
13.16
pre fs=8
288.01
13.31
cli fs=8
409.69
13.68
sft fs=8
373.77
13.19
I notice the paper said ``we sample 16 frames at 3 fps'', so I change the inference shell script to FS=3 (NOTE that the FS is actually fps in dynamicrafter) but find the pretrained checkpoint have higher fvd than the paper said. So I also add the FS=8 (fps=24/3=8), but I found CIL at the same fps rate has larger fvd than pre.
So I would like to kindly request more detail about the evaluation detail in the paper, does you evaluate fvd on exact 2048 generated videos, or generate more video but only sample 2048 files randomly each time?
Second, I use RAFT to calculate motion score on the generated 2048 videos following the paper description in page 17, I get the following results:
Method
Motion Score
pre fs=3
91.5
pre fs=8
78.17
cli fs=8
73.78
sft fs=8
85.89
It seems that sft has higher motion score than the pretrained model, which is contrary to the Table1. Could you help me to find problems ?
The text was updated successfully, but these errors were encountered:
Hi @LZY-the-boys , thank you for your attention to our work and sorry for replying late. The evaluation settings are clarified as follows:
For UCF101, we sample 16 frames at 3 fps. In other words, 3 frames per second in the UCF101 video are sampled, resulting in a total of 16 frames serving as ground-truth for further evaluation. Here, the fps does not refer to the FPS in the DynamiCrafter setting, which remains at its default value in its official repository.
We generate 10K videos to evaluate the IS metric, and randomly sample 2048 of them for FVD evaluation.
Notably, the bash scripts in our repository applies the inference strategy, namely Analytic-Init, which makes them unsuitable for direct use in the ablation of training strategies. For example, in inference_512.sh the initial timestep M is set 940 (0.94T) while in inference_CIL_512.sh it's 1000 (T). Also, in both scripts, whether_analytic_init is set true. For fair validation on the effectiveness of TimeNoise, there's a need to set M to 1000 and whether_analytic_init=0. I geuss this is probably also the cause of the inconsistency between your results and that in the paper, especially the motion scores.
To validate the improvement on the motion score, it's also recommended to run our scripts to generate some demos following readme.md. I believe the visual comparison results can illustrate the relative degree of motion between SFT and TimeNoise.
Hi, thanks for your awesome work! I recently run the public naive SFT and CIL 512 checkpoint following the B.3 Evaluation setting and get the following result, which may be different from paper reported result.
I notice the paper said ``we sample 16 frames at 3 fps'', so I change the inference shell script to FS=3 (NOTE that the FS is actually fps in dynamicrafter) but find the pretrained checkpoint have higher fvd than the paper said. So I also add the FS=8 (fps=24/3=8), but I found CIL at the same fps rate has larger fvd than pre.
So I would like to kindly request more detail about the evaluation detail in the paper, does you evaluate fvd on exact 2048 generated videos, or generate more video but only sample 2048 files randomly each time?
Second, I use RAFT to calculate motion score on the generated 2048 videos following the paper description in page 17, I get the following results:
It seems that sft has higher motion score than the pretrained model, which is contrary to the Table1. Could you help me to find problems ?
The text was updated successfully, but these errors were encountered: