-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VLM: compile compatibility #35724
base: main
Are you sure you want to change the base?
VLM: compile compatibility #35724
Conversation
if past_key_value is not None: | ||
if not isinstance(past_key_value, EncoderDecoderCache): | ||
curr_past_key_value = past_key_value | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know why but OPT model works as decoder-only but the attention is written as cross-attention (not used anywhere in codebase). So we need to support somehow BC while using the new DynamicCache
As a workaround I simply added a check on cache instance. Another possibility is to accept and return only the correct cache (self or cross attn) but that means all encoder-decoder models will need a change thus breaking BC
ignore_index (`int`, *optional*, defaults to -100): | ||
The ignore index for the loss function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this can be removed as it is not used anymore when merging inputs. We can also deprecate properly, but i don't think anyone uses it
half_batch_size = self.model_tester.batch_size // 2 | ||
inputs_dict_1 = {k: v[:half_batch_size, ...] for k, v in inputs_dict.items() if "head_mask" not in k} | ||
inputs_dict_2 = { | ||
k: v[half_batch_size : half_batch_size * 2, ...] | ||
for k, v in inputs_dict.items() | ||
if "head_mask" not in k | ||
} | ||
self.assertTrue( | ||
inputs_dict_1[model_class.main_input_name].shape == inputs_dict_2[model_class.main_input_name].shape | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some models cannot generate only from input_ids
, so we pass the whole dict except for the head_mask
(anyway we're removing head mask soon )
@@ -83,14 +83,14 @@ def __init__( | |||
moe_intermediate_size=4, | |||
moe_num_experts=4, | |||
moe_topk=2, | |||
num_attention_heads=20, | |||
num_attention_heads=8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aria had hidden size of 32 and 20 heads, which caused by division problems in tests when inferring head_dim
Ready for review failing test is flaky otherwise everything is passing on my end, including slow test for compile/StaticCache |
What does this PR do?
As per title, adds flags in VLMs when needed, removes test skips and makes sure VLMs are compile compatible. Also for BLIP models adds new cache format in OPT which is one of backbones. Now all official BLIP models can support static cache and thus compile
NOTE:
prepare_inputs_for_generation
and thus skiptest_compile_forward
which compiles the model for pre-fill phase. But the test for decoding stage compile is green therefore I'm leaving the flag asTrue
-k compile_forward
and-k static_
were run for all models and are passing. Some models needed to turn the flag off, which shouldn't be there because MoE cannot do compile currently (dynamic control flow)How to run compile and export for VLMs: