Fix SD2.X clip single file load projection_dim #10770

Teriks · 2025-02-11T20:48:55Z

Infer projection_dim from the checkpoint before loading from pretrained, override any incorrect hub config.

Hub configuration for SD2.X specifies projection_dim=512 which is incorrect for SD2.X checkpoints loaded from civitai and similar.

Exception was previously thrown upon attempting to load_model_dict_into_meta for SD2.X single file checkpoints.

Such LDM models usually require projection_dim=1024 for the clip encoder.

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sayakpaul @yiyixuxu @DN6

Infer projection_dim from the checkpoint before loading from pretrained, override any incorrect hub config. Hub configuration for SD2.X specifies projection_dim=512 which is incorrect for SD2.X checkpoints loaded from civitai and similar. Exception was previously thrown upon attempting to load_model_dict_into_meta for SD2.X single file checkpoints. Such LDM models usually require projection_dim=1024

DN6 · 2025-02-14T07:43:59Z

@Teriks could you share an example I can use to reproduce the error? Along with a link to the checkpoint you're trying to use?

Teriks · 2025-02-14T17:02:13Z

@DN6

Model page: https://civitai.com/models/2711/21-sd-modern-buildings-style-md

Checkpoint: https://civitai.com/api/download/models/3002?type=Model&format=PickleTensor&size=full&fp=fp16

Original Config: https://civitai.com/api/download/models/3002?type=Config&format=Other

Here is a reproducible error condition script, and checkpoint to test.

This exception happens with any LDM checkpoint hosted on CivitAI under SD2.0 and SD2.1 checkpoints.

There is probably additional config for some models needed to make them work, the fix I am applying just makes most of them function out of the box.

import diffusers

# https://civitai.com/models/2711/21-sd-modern-buildings-style-md

# This is the ckpt and YAML config from the same page

# https://civitai.com/api/download/models/3002?type=Model&format=PickleTensor&size=full&fp=fp16

# https://civitai.com/api/download/models/3002?type=Config&format=Other


# this will fail with an exception

pipe = diffusers.StableDiffusionPipeline.from_single_file(
    '21SDModernBuildings_midjourneyBuildings.ckpt',
    original_config='21SDModernBuildings_midjourneyBuildings.yaml')

This fails with this exception due to projection_dim for the text_encoder being wrong in the hub config (taken from SD2.1) for this model

Fetching 10 files: 100%|██████████| 10/10 [00:00<?, ?it/s]
Loading pipeline components...:  33%|███▎      | 2/6 [00:00<00:00,  8.02it/s]
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    pipe = diffusers.StableDiffusionPipeline.from_single_file(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\loaders\single_file.py", line 495, in from_single_file
    loaded_sub_model = load_single_file_sub_model(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\loaders\single_file.py", line 113, in load_single_file_sub_model
    loaded_sub_model = create_diffusers_clip_model_from_ldm(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\loaders\single_file_utils.py", line 1571, in create_diffusers_clip_model_from_ldm
    unexpected_keys = load_model_dict_into_meta(model, diffusers_format_checkpoint, dtype=torch_dtype)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\models\model_loading_utils.py", line 230, in load_model_dict_into_meta
    raise ValueError(
ValueError: Cannot load  because text_model.encoder.layers.0.self_attn.q_proj.weight expected shape torch.Size([1024, 1024]), but got torch.Size([512, 1024]). If you want to instead overwrite randomly initialized weights, please make sure to pass both `low_cpu_mem_usage=False` and `ignore_mismatched_sizes=True`. For more information, see also: https://github.com/huggingface/diffusers/issues/1619#issuecomment-1345604389 as an example.

hlky · 2025-02-18T07:37:04Z

diffusers/src/diffusers/loaders/single_file_utils.py

Lines 1449 to 1454 in b75b204

    
           if text_proj_key in checkpoint: 
        
               text_proj_dim = int(checkpoint[text_proj_key].shape[0]) 
        
           elif hasattr(text_model.config, "projection_dim"): 
        
               text_proj_dim = text_model.config.projection_dim 
        
           else: 
        
               text_proj_dim = LDM_OPEN_CLIP_TEXT_PROJECTION_DIM

diffusers/src/diffusers/loaders/single_file_utils.py

Lines 1488 to 1492 in b75b204

    
           text_model_dict[diffusers_key + ".q_proj.weight"] = weight_value[:text_proj_dim, :].clone().detach() 
        
           text_model_dict[diffusers_key + ".k_proj.weight"] = ( 
        
               weight_value[text_proj_dim : text_proj_dim * 2, :].clone().detach() 
        
           ) 
        
           text_model_dict[diffusers_key + ".v_proj.weight"] = weight_value[text_proj_dim * 2 :, :].clone().detach()

We're getting text_proj_dim from either text_projection key, config.projection_dim or LDM_OPEN_CLIP_TEXT_PROJECTION_DIM which is hard coded at 1024. Then using it to split qkv.

The issue is with text_projection key path and config.projection_dim.

config.projection_dim is 512 because it's used as the final output shape in CLIPTextModelWithProjection.

CLIPTextModelWithProjection

self.text_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)

config.hidden_size is used in CLIPAttention for q_proj shape

CLIPAttention

self.embed_dim = config.hidden_size
self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)

>>> torch.nn.Linear(1024, 512).state_dict()["weight"].shape
torch.Size([512, 1024])

diffusers/src/diffusers/loaders/single_file_utils.py

Line 1450 in b75b204

text_proj_dim = int(checkpoint[text_proj_key].shape[0])

This would be shape[1] and config.projection_dim path can use config.hidden_size

WDYT @DN6?

Teriks added a commit to Teriks/dgenerate that referenced this pull request Feb 18, 2025

test clip loading fix described in huggingface/diffusers#10770 (comment)

879b257

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SD2.X clip single file load projection_dim #10770

Fix SD2.X clip single file load projection_dim #10770

Teriks commented Feb 11, 2025

DN6 commented Feb 14, 2025

Teriks commented Feb 14, 2025 •

edited

Loading

hlky commented Feb 18, 2025

Fix SD2.X clip single file load projection_dim #10770

Are you sure you want to change the base?

Fix SD2.X clip single file load projection_dim #10770

Conversation

Teriks commented Feb 11, 2025

What does this PR do?

Before submitting

Who can review?

DN6 commented Feb 14, 2025

Teriks commented Feb 14, 2025 • edited Loading

hlky commented Feb 18, 2025

Teriks commented Feb 14, 2025 •

edited

Loading