selfattention block: Remove the fc linear layer if it is not used #8325

johnzielke · 2025-02-04T14:14:42Z

Description

when include_fc = False, the nn.Linear layer is unused. This leads to errors and warning when training with the pytorch Distributed Data Parallel infrastructure, since the parameters for the nn.Linear layer will not have gradients attached.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
In-line docstrings updated.
Documentation updated, tested make html command in the docs/ folder.

Signed-off-by: John Zielke <[email protected]>

ericspod · 2025-02-05T11:36:59Z

Thanks for the contribution! In itself I think it's fine however we have to check that this won't break old weights. We have this method load_old_state_dict for doing this with DiffusionModelUNet that might not work if out_proj doesn't have weight or bias components. There are other load_old_state_dict methods doing this for other networks that should be looked at.

We still want to maintain backwards compatibility with old stored weights at least for now, but we should discuss about when to deprecate these methods. CC @virginiafdez

johnzielke · 2025-02-12T00:53:44Z

If that's the only concern, I could update that method to ignore that key in the appropriate cases

ericspod · 2025-02-12T14:23:37Z

If that's the only concern, I could update that method to ignore that key in the appropriate cases

Please do have a look at how old state is loaded and see if there's any issues, otherwise yes we should be good here. I've updated your branch after we've done a lot of test refactoring, we should perhaps also include a test that checks the network does or does not have the fc layer when appropriate.

Signed-off-by: John Zielke <[email protected]>

johnzielke · 2025-02-12T18:53:06Z

I pushed the discussed changes. I wanted to test the load_old_state_dict as well, but it seems there is no test yet that loads the code without cross-attention. I did not dive all the way into where the old state dicts are stored etc. Is there an easy to use old state dict I could use in the test_compatibility_with_monai_generative() test?

johnzielke force-pushed the bugfix/attention-remove-unused-parameters branch 3 times, most recently from 1bb3ce5 to 892edc6 Compare February 4, 2025 15:09

selfattention block: Remove the fc linear layer if it is not used

547ac94

Signed-off-by: John Zielke <[email protected]>

johnzielke force-pushed the bugfix/attention-remove-unused-parameters branch from 892edc6 to 547ac94 Compare February 4, 2025 15:10

ericspod requested review from ericspod, Nic-Ma and KumoLiu February 5, 2025 11:23

Merge branch 'dev' into bugfix/attention-remove-unused-parameters

8e6d04e

Fix old state dict loading and add tests

1411564

Signed-off-by: John Zielke <[email protected]>

Merge branch 'dev' into bugfix/attention-remove-unused-parameters

955dc53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

selfattention block: Remove the fc linear layer if it is not used #8325

selfattention block: Remove the fc linear layer if it is not used #8325

johnzielke commented Feb 4, 2025

ericspod commented Feb 5, 2025

johnzielke commented Feb 12, 2025

ericspod commented Feb 12, 2025

johnzielke commented Feb 12, 2025

selfattention block: Remove the fc linear layer if it is not used #8325

Are you sure you want to change the base?

selfattention block: Remove the fc linear layer if it is not used #8325

Conversation

johnzielke commented Feb 4, 2025

Description

Types of changes

ericspod commented Feb 5, 2025

johnzielke commented Feb 12, 2025

ericspod commented Feb 12, 2025

johnzielke commented Feb 12, 2025