Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sageattn_qk_int8_pv_fp16_cuda black output with pv_accum fp16, results in black screen in opensora #93

Open
nighting0le01 opened this issue Jan 22, 2025 · 10 comments

Comments

@nighting0le01
Copy link

sageattn_qk_int8_pv_fp16_cuda black output with pv_accum fp16, results in black screen in opensora help

@jason-huang03
Copy link
Member

which version of opensora do you use?

@nighting0le01
Copy link
Author

i use 1.2 with bf16

@asahni04
Copy link

asahni04 commented Jan 25, 2025

@jason-huang03 @jt-zhang similar issues for me, any updates ?? my issue:#94

@asahni04
Copy link

? @jason-huang03 @jt-zhang any suggestions?

@jason-huang03
Copy link
Member

Have you tried kernels that offers higher precision? like those with fp32 or fp16+fp32 accumulation?

@jason-huang03
Copy link
Member

fp16 has a limited range and may encounter overflow error as the accumulator.

@asahni04
Copy link

hi jason i did @jason-huang03 specifically for i2i it runs into that issue for t2i it seems ok with fp16 accum. with fp32 and fp16+fp32 is does work but is barely faster than fa-2. i even tried with v_smooth and it does not solve this black/nan issue. please suggest what i can do to speedup on A100. qk_quant_gran will qk_quant_gran have a effect?

@jason-huang03
Copy link
Member

I believe qk_quant_gran will not have an effect because your issue seems to be an overflow problem. By the way, what is the sequence length of attention in the model?

@nighting0le01
Copy link
Author

I believe qk_quant_gran will not have an effect because your issue seems to be an overflow problem. By the way, what is the sequence length of attention in the model?

hey @jason-huang03 would qk_quant_gran atleast speedup fp16+fp32? what's the tradeoffs associated with its setting??

@jason-huang03
Copy link
Member

I believe "per_warp" might be a little faster, "per_thread" will be a little more accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants