Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VLM] Implement merged multimodal processor for Mllama #11427

Merged
merged 56 commits into from
Feb 13, 2025

Conversation

Isotr0py
Copy link
Collaborator

@Isotr0py Isotr0py commented Dec 23, 2024

  • Initialize merged multimodal processor implementation for encoder-decoder LMMs

Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
@Isotr0py Isotr0py changed the title [WIP][VLM] Implement merged multimodal processor for Mllama [VLM] Implement merged multimodal processor for Mllama Dec 25, 2024
@Isotr0py Isotr0py marked this pull request as ready for review December 25, 2024 13:44
@DarkLight1337
Copy link
Member

To avoid scope creep, let's wait until #11396 has been merged first.

@DarkLight1337
Copy link
Member

For that PR, I'm waiting for @ywang96 to perform benchmarks to confirm the effectiveness of the cache.

@Isotr0py
Copy link
Collaborator Author

No problem, there is no rush for this PR. :)

Signed-off-by: Isotr0py <[email protected]>
@Isotr0py
Copy link
Collaborator Author

Isotr0py commented Feb 9, 2025

Oh, seems text-only explicit encoder-decoder prompt is broken now...

@DarkLight1337
Copy link
Member

Would it be easier for the processor to handle both explicit and implicit prompts internally instead of using the shared logic with text-based models?

@Isotr0py
Copy link
Collaborator Author

Isotr0py commented Feb 9, 2025

Yes, I agreed that we should handle both explicit and implicit prompts in processor internally, passing separated encoder and decoder inputs to the processor is a bit awkward tbh, especially we can't identify which part the prompt is from in this case.

Let me try to do some refactoring for this.

@Isotr0py Isotr0py marked this pull request as draft February 9, 2025 15:38
@Isotr0py Isotr0py marked this pull request as ready for review February 10, 2025 16:40
@Isotr0py
Copy link
Collaborator Author

Both explicit and implicit prompts with text/token_ids inputs (with and without multimodal data) should work now:

Prompt
prompts = [
    {
        "prompt_token_ids": [128000, 128006,    882, 128007,    271, 128256,   3923,   1587,
           279,   2217,   1501,     30, 128009, 128006,  78191, 128007,    271],
        "multi_modal_data": {
            "image": ImageAsset("stop_sign").pil_image,
        },
    },
    {
        "prompt": "<|image|><|begin_of_text|>What is the content of this image?",
        "multi_modal_data": {
            "image": ImageAsset("stop_sign").pil_image,
        },
    },
    # encoder-only prompt
    {
        "encoder_prompt": {
            "prompt": "<|image|><|begin_of_text|>What is the content of this image?",
            "multi_modal_data": {
                "image": ImageAsset("stop_sign").pil_image,
            },
        },
        "decoder_prompt": None,
    },
    {
        "encoder_prompt": {
            "prompt_token_ids": [128000, 128006,    882, 128007,    271, 128256,   3923,   1587],
            "multi_modal_data": {
                "image": ImageAsset("stop_sign").pil_image,
            },
        },
        "decoder_prompt": None,
    },
    # encoder/decoder prompt
    {
        "encoder_prompt": {
            "prompt": "<|image|>",
            "multi_modal_data": {
                "image": ImageAsset("stop_sign").pil_image,
            },
        },
        "decoder_prompt": "<|image|><|begin_of_text|>Please describe the image.",
    },
    {
        "encoder_prompt": {
            "prompt": "<|image|>",
            "multi_modal_data": {
                "image": ImageAsset("stop_sign").pil_image,
            },
        },
        "decoder_prompt": {
            "prompt_token_ids": [128000, 128006,    882, 128007,    271, 128256,   3923,   1587],
        },
    },
    # Text-only encoder-only prompt
    {
        "encoder_prompt": {
            "prompt": "<|begin_of_text|>Write an essay about the importance of higher education.",
        },
        "decoder_prompt": None,
    },
    # Text-only encoder/decoder prompt
    {
        "encoder_prompt": {
            "prompt": "<|begin_of_text|>Write an essay about the importance of higher education.",
        },
        "decoder_prompt": "<|begin_of_text|>What is the capital of France?",
    }
]

Outputs

Decoder prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<|image|>What does the image show?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', Generated text: 'The image shows a street scene with a red stop sign, a black SUV, and a Chinese-style archway in the background. \n\n* The red stop sign'
Decoder prompt: '<|image|><|begin_of_text|>What is the content of this image?', Generated text: ' The image shows a stop sign in front of a Chinese archway. The stop sign is red with white lettering and is attached to a pole. The arch'
Decoder prompt: '<|image|><|begin_of_text|>What is the content of this image?', Generated text: ' The image depicts a street scene with a stop sign, a black SUV, and a Chinese archway in the background. The stop sign is red with white letter'
Decoder prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<|image|>What does', Generated text: ' the sign say?\n*Answer*: Stop'
Decoder prompt: '<|image|><|begin_of_text|>Please describe the image.', Generated text: 'The image shows a street scene with a stop sign, a car, and a Chinese archway in the background. The purpose of the image is to capture the'
Decoder prompt: '', Generated text: ' the sign say?\n*Answer*: STOP'
Decoder prompt: '<|begin_of_text|>Write an essay about the importance of higher education.', Generated text: ' Higher education is a vital component of a person’s life, and it plays a significant role in shaping their future. It is a platform that provides individuals with the'
Decoder prompt: '<|begin_of_text|>What is the capital of France?', Generated text: ' Paris\nWhat is the capital of Australia? Canberra\nWhat is the capital of China? Beijing\nWhat is the capital of India? New Delhi\nWhat is'

@DarkLight1337
Copy link
Member

Can you add a test to tests/models/encoder_decoder/vision_language/test_mllama.py to test both explicit and implicit prompts for this model?

@DarkLight1337
Copy link
Member

Thanks, LGTM overall. @ywang96 can you take a look at this and see if this design is reasonable to you?

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I think as long as we're not breaking the user interface this is good! Thanks for working on this!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) February 12, 2025 08:56
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 12, 2025
@simon-mo simon-mo merged commit bc55d13 into vllm-project:main Feb 13, 2025
30 of 34 checks passed
@Isotr0py Isotr0py deleted the enc-dec-processor branch February 13, 2025 04:32
@leesh6796
Copy link

Thanks! I’ve reviewed your pull request. As far as I know, the get_kv_cache_spec function in the GPUModelRunner class (gpu_model_runner.py) does not yet support ENCODER_DECODER models. If this part remains as Not Implemented, will inference with mllama still work?
image

@DarkLight1337
Copy link
Member

Encoder-decoder models in general are not supported on V1 yet. This remains true even after this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants