F8I4 Grouped Gemm Optimization for Sparse M #3854

jwfromm · 2025-03-19T23:18:50Z

Summary:
In cases where there are many groups, but few have a non-zero amount of routed tokens, it turns out we pay a high overhead. For example if a single token is routed to one of 128 experts, while the compute is the same as 1 token being routed to one expert the runtime is much lower.

Presumably there are some kernel inefficiencies involved in looping over the empty groups. This diff changes how kernel arguments are set up so that we do grouped gemm over min(total_M, groups). This allows us to ignore many of the groups where no compute is required and improves performance in those cases considerably.

As an example of the effect of this diff, when total_M is 1 and there are 128 groups, latency will be 3X smaller thanks to this change.

Differential Revision: D71510967

netlify · 2025-03-19T23:19:12Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`7b99508`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67dc45de797fb0000983de94
😎 Deploy Preview	https://deploy-preview-3854--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2025-03-19T23:19:16Z

This pull request was exported from Phabricator. Differential Revision: D71510967

Summary: X-link: facebookresearch/FBGEMM#945 In cases where there are many groups, but few have a non-zero amount of routed tokens, it turns out we pay a high overhead. For example if a single token is routed to one of 128 experts, while the compute is the same as 1 token being routed to one expert the runtime is much lower. Presumably there are some kernel inefficiencies involved in looping over the empty groups. This diff changes how kernel arguments are set up so that we do grouped gemm over min(total_M, groups). This allows us to ignore many of the groups where no compute is required and improves performance in those cases considerably. As an example of the effect of this diff, when total_M is 1 and there are 128 groups, latency will be 3X smaller thanks to this change. Reviewed By: jiawenliu64 Differential Revision: D71510967

facebook-github-bot · 2025-03-20T16:44:22Z

This pull request was exported from Phabricator. Differential Revision: D71510967

Summary: X-link: facebookresearch/FBGEMM#945 In cases where there are many groups, but few have a non-zero amount of routed tokens, it turns out we pay a high overhead. For example if a single token is routed to one of 128 experts, while the compute is the same as 1 token being routed to one expert the runtime is much lower. Presumably there are some kernel inefficiencies involved in looping over the empty groups. This diff changes how kernel arguments are set up so that we do grouped gemm over min(total_M, groups). This allows us to ignore many of the groups where no compute is required and improves performance in those cases considerably. As an example of the effect of this diff, when total_M is 1 and there are 128 groups, latency will be 3X smaller thanks to this change. Reviewed By: jiawenliu64 Differential Revision: D71510967

facebook-github-bot added the cla signed label Mar 19, 2025

facebook-github-bot added the fb-exported label Mar 19, 2025

jwfromm force-pushed the export-D71510967 branch from 01ca3be to 7b99508 Compare March 20, 2025 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F8I4 Grouped Gemm Optimization for Sparse M #3854

F8I4 Grouped Gemm Optimization for Sparse M #3854

jwfromm commented Mar 19, 2025

netlify bot commented Mar 19, 2025 •

edited

Loading

facebook-github-bot commented Mar 19, 2025

facebook-github-bot commented Mar 20, 2025

F8I4 Grouped Gemm Optimization for Sparse M #3854

Are you sure you want to change the base?

F8I4 Grouped Gemm Optimization for Sparse M #3854

Conversation

jwfromm commented Mar 19, 2025

netlify bot commented Mar 19, 2025 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Mar 19, 2025

facebook-github-bot commented Mar 20, 2025

netlify bot commented Mar 19, 2025 •

edited

Loading