[VLM] Implement merged multimodal processor for Mllama #11427

Isotr0py · 2024-12-23T07:44:02Z

Initialize merged multimodal processor implementation for encoder-decoder LMMs

Signed-off-by: Isotr0py <[email protected]>

github-actions · 2024-12-23T07:44:15Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Isotr0py <[email protected]>

DarkLight1337 · 2024-12-25T14:26:51Z

To avoid scope creep, let's wait until #11396 has been merged first.

DarkLight1337 · 2024-12-25T14:36:08Z

For that PR, I'm waiting for @ywang96 to perform benchmarks to confirm the effectiveness of the cache.

Isotr0py · 2024-12-25T14:43:28Z

No problem, there is no rush for this PR. :)

Signed-off-by: Isotr0py <[email protected]>

Isotr0py · 2025-02-09T13:59:07Z

Oh, seems text-only explicit encoder-decoder prompt is broken now...

DarkLight1337 · 2025-02-09T15:01:42Z

Would it be easier for the processor to handle both explicit and implicit prompts internally instead of using the shared logic with text-based models?

Isotr0py · 2025-02-09T15:38:30Z

Yes, I agreed that we should handle both explicit and implicit prompts in processor internally, passing separated encoder and decoder inputs to the processor is a bit awkward tbh, especially we can't identify which part the prompt is from in this case.

Let me try to do some refactoring for this.

Signed-off-by: Isotr0py <[email protected]>

vllm/inputs/preprocess.py

Isotr0py · 2025-02-11T03:25:23Z

Both explicit and implicit prompts with text/token_ids inputs (with and without multimodal data) should work now:

Prompt

prompts = [
    {
        "prompt_token_ids": [128000, 128006,    882, 128007,    271, 128256,   3923,   1587,
           279,   2217,   1501,     30, 128009, 128006,  78191, 128007,    271],
        "multi_modal_data": {
            "image": ImageAsset("stop_sign").pil_image,
        },
    },
    {
        "prompt": "<|image|><|begin_of_text|>What is the content of this image?",
        "multi_modal_data": {
            "image": ImageAsset("stop_sign").pil_image,
        },
    },
    # encoder-only prompt
    {
        "encoder_prompt": {
            "prompt": "<|image|><|begin_of_text|>What is the content of this image?",
            "multi_modal_data": {
                "image": ImageAsset("stop_sign").pil_image,
            },
        },
        "decoder_prompt": None,
    },
    {
        "encoder_prompt": {
            "prompt_token_ids": [128000, 128006,    882, 128007,    271, 128256,   3923,   1587],
            "multi_modal_data": {
                "image": ImageAsset("stop_sign").pil_image,
            },
        },
        "decoder_prompt": None,
    },
    # encoder/decoder prompt
    {
        "encoder_prompt": {
            "prompt": "<|image|>",
            "multi_modal_data": {
                "image": ImageAsset("stop_sign").pil_image,
            },
        },
        "decoder_prompt": "<|image|><|begin_of_text|>Please describe the image.",
    },
    {
        "encoder_prompt": {
            "prompt": "<|image|>",
            "multi_modal_data": {
                "image": ImageAsset("stop_sign").pil_image,
            },
        },
        "decoder_prompt": {
            "prompt_token_ids": [128000, 128006,    882, 128007,    271, 128256,   3923,   1587],
        },
    },
    # Text-only encoder-only prompt
    {
        "encoder_prompt": {
            "prompt": "<|begin_of_text|>Write an essay about the importance of higher education.",
        },
        "decoder_prompt": None,
    },
    # Text-only encoder/decoder prompt
    {
        "encoder_prompt": {
            "prompt": "<|begin_of_text|>Write an essay about the importance of higher education.",
        },
        "decoder_prompt": "<|begin_of_text|>What is the capital of France?",
    }
]

Outputs

Decoder prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<|image|>What does the image show?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', Generated text: 'The image shows a street scene with a red stop sign, a black SUV, and a Chinese-style archway in the background. \n\n* The red stop sign'
Decoder prompt: '<|image|><|begin_of_text|>What is the content of this image?', Generated text: ' The image shows a stop sign in front of a Chinese archway. The stop sign is red with white lettering and is attached to a pole. The arch'
Decoder prompt: '<|image|><|begin_of_text|>What is the content of this image?', Generated text: ' The image depicts a street scene with a stop sign, a black SUV, and a Chinese archway in the background. The stop sign is red with white letter'
Decoder prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<|image|>What does', Generated text: ' the sign say?\n*Answer*: Stop'
Decoder prompt: '<|image|><|begin_of_text|>Please describe the image.', Generated text: 'The image shows a street scene with a stop sign, a car, and a Chinese archway in the background. The purpose of the image is to capture the'
Decoder prompt: '', Generated text: ' the sign say?\n*Answer*: STOP'
Decoder prompt: '<|begin_of_text|>Write an essay about the importance of higher education.', Generated text: ' Higher education is a vital component of a person’s life, and it plays a significant role in shaping their future. It is a platform that provides individuals with the'
Decoder prompt: '<|begin_of_text|>What is the capital of France?', Generated text: ' Paris\nWhat is the capital of Australia? Canberra\nWhat is the capital of China? Beijing\nWhat is the capital of India? New Delhi\nWhat is'

DarkLight1337 · 2025-02-11T07:45:14Z

Can you add a test to tests/models/encoder_decoder/vision_language/test_mllama.py to test both explicit and implicit prompts for this model?

DarkLight1337 · 2025-02-11T13:54:42Z

Thanks, LGTM overall. @ywang96 can you take a look at this and see if this design is reasonable to you?

Signed-off-by: isotr0py <[email protected]>

ywang96

LGTM - I think as long as we're not breaking the user interface this is good! Thanks for working on this!

Signed-off-by: isotr0py <[email protected]>

leesh6796 · 2025-02-13T10:35:11Z

Thanks! I’ve reviewed your pull request. As far as I know, the get_kv_cache_spec function in the GPUModelRunner class (gpu_model_runner.py) does not yet support ENCODER_DECODER models. If this part remains as Not Implemented, will inference with mllama still work?

DarkLight1337 · 2025-02-13T10:42:17Z

Encoder-decoder models in general are not supported on V1 yet. This remains true even after this PR.

Isotr0py added 11 commits December 20, 2024 23:54

draft

99657bb

Signed-off-by: Isotr0py <[email protected]>

fix profiling

76185c5

Signed-off-by: Isotr0py <[email protected]>

draft

f3ca433

Signed-off-by: Isotr0py <[email protected]>

fix processing

1da6712

Signed-off-by: Isotr0py <[email protected]>

refactor

7402e62

Signed-off-by: Isotr0py <[email protected]>

Merge branch 'vllm-project:main' into enc-dec-processor

0fa61d7

cleanup

f9ff072

Signed-off-by: Isotr0py <[email protected]>

cleanup

256f85e

Signed-off-by: Isotr0py <[email protected]>

refactor

5708658

Signed-off-by: Isotr0py <[email protected]>

fix

00cf0ef

Signed-off-by: Isotr0py <[email protected]>

Merge branch 'vllm-project:main' into enc-dec-processor

c3cd4a0

Isotr0py added 8 commits December 24, 2024 17:01

refactor

cd780bc

Signed-off-by: Isotr0py <[email protected]>

code format

ba22a20

Signed-off-by: Isotr0py <[email protected]>

cleanup

2f2584a

Signed-off-by: Isotr0py <[email protected]>

cleanup

a65ddce

Signed-off-by: Isotr0py <[email protected]>

refactor

f819d51

Signed-off-by: Isotr0py <[email protected]>

fix text-only inputs

804f76b

Signed-off-by: Isotr0py <[email protected]>

cleanup

a715a97

Signed-off-by: Isotr0py <[email protected]>

fix text enc-dec model

c96fd21

Signed-off-by: Isotr0py <[email protected]>

Isotr0py changed the title ~~[WIP][VLM] Implement merged multimodal processor for Mllama~~ [VLM] Implement merged multimodal processor for Mllama Dec 25, 2024

Isotr0py marked this pull request as ready for review December 25, 2024 13:44

Isotr0py requested a review from DarkLight1337 December 25, 2024 14:17

fix a typo

4fd4204

Signed-off-by: Isotr0py <[email protected]>

This was referenced Dec 25, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

[RFC]: Merge input processor and input mapper for multi-modal models #10114

Open

ywang96 mentioned this pull request Jan 10, 2025

[Bugfix] Check that number of images matches number of <|image|> tokens with mllama #11939

Merged

Isotr0py marked this pull request as draft February 9, 2025 15:38

Isotr0py added 6 commits February 10, 2025 11:57

fix text-only explicit prompt

709017b

Signed-off-by: Isotr0py <[email protected]>

fix tokens only inputs

611d1d7

make mypy happy

35bcb5e

make mypy happy

98e47ce

mypy again

3567c1c

refactor preprocess func for explicit prompt

5551eae

Isotr0py marked this pull request as ready for review February 10, 2025 16:40

Isotr0py commented Feb 11, 2025

View reviewed changes

vllm/inputs/preprocess.py Show resolved Hide resolved

add explicit and implicit_prompt test

edb99ba

DarkLight1337 approved these changes Feb 11, 2025

View reviewed changes

Isotr0py added 2 commits February 11, 2025 22:40

fix broken cross attn mask test

209253a

ooops

3528e61

Signed-off-by: isotr0py <[email protected]>

ywang96 approved these changes Feb 12, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) February 12, 2025 08:56

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 12, 2025

Isotr0py added 4 commits February 12, 2025 17:31

fix entrypoint test

f66ba41

Signed-off-by: isotr0py <[email protected]>

fix whisper

7850f07

Signed-off-by: isotr0py <[email protected]>

linting

2621d4a

Signed-off-by: isotr0py <[email protected]>

Merge branch 'vllm-project:main' into enc-dec-processor

62215b8

simon-mo merged commit bc55d13 into vllm-project:main Feb 13, 2025
30 of 34 checks passed

Isotr0py deleted the enc-dec-processor branch February 13, 2025 04:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VLM] Implement merged multimodal processor for Mllama #11427

[VLM] Implement merged multimodal processor for Mllama #11427

Isotr0py commented Dec 23, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 23, 2024

DarkLight1337 commented Dec 25, 2024

DarkLight1337 commented Dec 25, 2024

Isotr0py commented Dec 25, 2024

Isotr0py commented Feb 9, 2025

DarkLight1337 commented Feb 9, 2025

Isotr0py commented Feb 9, 2025

Isotr0py commented Feb 11, 2025

DarkLight1337 commented Feb 11, 2025

DarkLight1337 commented Feb 11, 2025

ywang96 left a comment

leesh6796 commented Feb 13, 2025

DarkLight1337 commented Feb 13, 2025

[VLM] Implement merged multimodal processor for Mllama #11427

[VLM] Implement merged multimodal processor for Mllama #11427

Conversation

Isotr0py commented Dec 23, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 23, 2024

DarkLight1337 commented Dec 25, 2024

DarkLight1337 commented Dec 25, 2024

Isotr0py commented Dec 25, 2024

Isotr0py commented Feb 9, 2025

DarkLight1337 commented Feb 9, 2025

Isotr0py commented Feb 9, 2025

Isotr0py commented Feb 11, 2025

DarkLight1337 commented Feb 11, 2025

DarkLight1337 commented Feb 11, 2025

ywang96 left a comment

Choose a reason for hiding this comment

leesh6796 commented Feb 13, 2025

DarkLight1337 commented Feb 13, 2025

Isotr0py commented Dec 23, 2024 •

edited by github-actions bot

Loading