Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SD2.X clip single file load projection_dim #10770

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Teriks
Copy link
Contributor

@Teriks Teriks commented Feb 11, 2025

Infer projection_dim from the checkpoint before loading from pretrained, override any incorrect hub config.

Hub configuration for SD2.X specifies projection_dim=512 which is incorrect for SD2.X checkpoints loaded from civitai and similar.

Exception was previously thrown upon attempting to load_model_dict_into_meta for SD2.X single file checkpoints.

Such LDM models usually require projection_dim=1024 for the clip encoder.

What does this PR do?

Fixes # (issue)

Before submitting

Who can review?

@sayakpaul @yiyixuxu @DN6

Infer projection_dim from the checkpoint before loading
from pretrained, override any incorrect hub config.

Hub configuration for SD2.X specifies projection_dim=512
which is incorrect for SD2.X checkpoints loaded from civitai
and similar.

Exception was previously thrown upon attempting to
load_model_dict_into_meta for SD2.X single file checkpoints.

Such LDM models usually require projection_dim=1024
@DN6
Copy link
Collaborator

DN6 commented Feb 14, 2025

@Teriks could you share an example I can use to reproduce the error? Along with a link to the checkpoint you're trying to use?

@Teriks
Copy link
Contributor Author

Teriks commented Feb 14, 2025

@DN6

Model page: https://civitai.com/models/2711/21-sd-modern-buildings-style-md

Checkpoint: https://civitai.com/api/download/models/3002?type=Model&format=PickleTensor&size=full&fp=fp16

Original Config: https://civitai.com/api/download/models/3002?type=Config&format=Other

Here is a reproducible error condition script, and checkpoint to test.

This exception happens with any LDM checkpoint hosted on CivitAI under SD2.0 and SD2.1 checkpoints.

There is probably additional config for some models needed to make them work, the fix I am applying just makes most of them function out of the box.

import diffusers

# https://civitai.com/models/2711/21-sd-modern-buildings-style-md

# This is the ckpt and YAML config from the same page

# https://civitai.com/api/download/models/3002?type=Model&format=PickleTensor&size=full&fp=fp16

# https://civitai.com/api/download/models/3002?type=Config&format=Other


# this will fail with an exception

pipe = diffusers.StableDiffusionPipeline.from_single_file(
    '21SDModernBuildings_midjourneyBuildings.ckpt',
    original_config='21SDModernBuildings_midjourneyBuildings.yaml')

This fails with this exception due to projection_dim for the text_encoder being wrong in the hub config (taken from SD2.1) for this model

Fetching 10 files: 100%|██████████| 10/10 [00:00<?, ?it/s]
Loading pipeline components...:  33%|███▎      | 2/6 [00:00<00:00,  8.02it/s]
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    pipe = diffusers.StableDiffusionPipeline.from_single_file(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\loaders\single_file.py", line 495, in from_single_file
    loaded_sub_model = load_single_file_sub_model(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\loaders\single_file.py", line 113, in load_single_file_sub_model
    loaded_sub_model = create_diffusers_clip_model_from_ldm(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\loaders\single_file_utils.py", line 1571, in create_diffusers_clip_model_from_ldm
    unexpected_keys = load_model_dict_into_meta(model, diffusers_format_checkpoint, dtype=torch_dtype)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\models\model_loading_utils.py", line 230, in load_model_dict_into_meta
    raise ValueError(
ValueError: Cannot load  because text_model.encoder.layers.0.self_attn.q_proj.weight expected shape torch.Size([1024, 1024]), but got torch.Size([512, 1024]). If you want to instead overwrite randomly initialized weights, please make sure to pass both `low_cpu_mem_usage=False` and `ignore_mismatched_sizes=True`. For more information, see also: https://github.com/huggingface/diffusers/issues/1619#issuecomment-1345604389 as an example.

@hlky
Copy link
Collaborator

hlky commented Feb 18, 2025

if text_proj_key in checkpoint:
text_proj_dim = int(checkpoint[text_proj_key].shape[0])
elif hasattr(text_model.config, "projection_dim"):
text_proj_dim = text_model.config.projection_dim
else:
text_proj_dim = LDM_OPEN_CLIP_TEXT_PROJECTION_DIM

text_model_dict[diffusers_key + ".q_proj.weight"] = weight_value[:text_proj_dim, :].clone().detach()
text_model_dict[diffusers_key + ".k_proj.weight"] = (
weight_value[text_proj_dim : text_proj_dim * 2, :].clone().detach()
)
text_model_dict[diffusers_key + ".v_proj.weight"] = weight_value[text_proj_dim * 2 :, :].clone().detach()

We're getting text_proj_dim from either text_projection key, config.projection_dim or LDM_OPEN_CLIP_TEXT_PROJECTION_DIM which is hard coded at 1024. Then using it to split qkv.

The issue is with text_projection key path and config.projection_dim.

config.projection_dim is 512 because it's used as the final output shape in CLIPTextModelWithProjection.

CLIPTextModelWithProjection

self.text_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)

config.hidden_size is used in CLIPAttention for q_proj shape

CLIPAttention

self.embed_dim = config.hidden_size
self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
>>> torch.nn.Linear(1024, 512).state_dict()["weight"].shape
torch.Size([512, 1024])

text_proj_dim = int(checkpoint[text_proj_key].shape[0])

This would be shape[1] and config.projection_dim path can use config.hidden_size

WDYT @DN6?

Teriks added a commit to Teriks/dgenerate that referenced this pull request Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants