Muon Optimizer Integration #159

tscholak · 2025-02-24T19:05:52Z

🎯 Goal (What & Why)

Integrate the Muon optimizer into Fast-LLM to improve computational efficiency and downstream model performance.

Muon offers ~2x computational efficiency over AdamW and demonstrated strong performance in math and code benchmarks, see https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf. The switch aims to reduce training FLOPs while maintaining or enhancing model quality.

The effort to integrate Muon involves significant work. However, the potential gains in computational efficiency and performance may justify this investment. Below is a staged approach from PoC to full integration ensures that we will only execute in full if tangible benefits are demonstrated early.

🚀 Execution Plan

Step 1: Proof of Concept (PoC)

Approach: Implement the Muon optimizer in a minimalistic and hacky way:
- Use as much code from a proof of concept for Distributed Muon NVIDIA/Megatron-LM#1428 and https://github.com/KellerJordan/Muon/tree/master as possible.
- Avoid abstracting the optimizer choice initially to speed up validation.
- Hard-code Muon as the sole optimizer in Fast-LLM, removing AdamW temporarily.
- Hard-code optimizer configurations (e.g., learning rate, weight decay) if necessary to speed up iteration.
- Focus on validating Muon's performance on our data and model architectures.
Validation: Run a controlled experiment comparing Muon to AdamW on existing benchmarks.
Success Criteria: Demonstrate comparable or superior performance with reduced compute cost.

Step 2: Proper Integration

Refactoring:
- Introduce an abstraction layer for optimizers in Fast-LLM.
- Make optimizer choice configurable, allowing both Muon and AdamW.
- Implement configuration-driven optimizer selection and setup.
Testing:
- Ensure that both Muon and AdamW can be used interchangeably.
- Maintain existing test coverage and add new tests for Muon integration.
Interaction with ZeRO:
- Evaluate the integration with ZeRO, especially for distributed training.
- Implement necessary changes to handle the Muon optimizer in a distributed setup.
- Test for stability and performance in multi-node configurations.

Step 3: Long-Term Optimizations

Advanced Features:
- Investigate Triton kernel development to optimize Muon further.
- Explore advanced configurations and hyper-parameter tuning for Muon.
Documentation & Maintenance:
- Write documentation for the new optimizer integration.
- Ensure that future maintainers can understand and extend the integration.

📌 Acceptance Criteria (Must-Haves for Completion)

The Muon optimizer is functional and tested in both standalone and distributed modes.
The implementation includes documentation on optimizer selection and configuration.
The PR includes a performance/impact summary comparing Muon vs. AdamW on relevant benchmarks.
Refactoring is limited to what is necessary for integrating Muon and allowing optimizer configurability.

🛠️ Project Management

Assign the project to the Fast-LLM project.
Set the Estimate field (in days) in the GitHub project.
Use the Size field to categorize the PR size (Large).
Assign an owner when opening the issue.

The text was updated successfully, but these errors were encountered:

toothacher17 · 2025-02-25T01:30:58Z

@tscholak thanks for your attention!

We had a even simpler implementation of Muon than Keller's implementation as described in our paper: https://github.com/MoonshotAI/Moonlight/blob/master/examples/toy_train.py

It might be helpful for the minimal proof of concept implementation (No distributed supported at all, and adjusting Muon as described in our paper)

tscholak added enhancement New feature or request need update labels Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Muon Optimizer Integration #159

Muon Optimizer Integration #159

tscholak commented Feb 24, 2025 •

edited

Loading

toothacher17 commented Feb 25, 2025

Muon Optimizer Integration #159

Muon Optimizer Integration #159

Comments

tscholak commented Feb 24, 2025 • edited Loading

🎯 Goal (What & Why)

🚀 Execution Plan

Step 1: Proof of Concept (PoC)

Step 2: Proper Integration

Step 3: Long-Term Optimizations

📌 Acceptance Criteria (Must-Haves for Completion)

🛠️ Project Management

toothacher17 commented Feb 25, 2025

tscholak commented Feb 24, 2025 •

edited

Loading