Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Muon Optimizer Integration #159

Open
4 tasks
tscholak opened this issue Feb 24, 2025 · 1 comment
Open
4 tasks

Muon Optimizer Integration #159

tscholak opened this issue Feb 24, 2025 · 1 comment
Labels
enhancement New feature or request need update

Comments

@tscholak
Copy link
Collaborator

tscholak commented Feb 24, 2025

🎯 Goal (What & Why)

Integrate the Muon optimizer into Fast-LLM to improve computational efficiency and downstream model performance.

Muon offers ~2x computational efficiency over AdamW and demonstrated strong performance in math and code benchmarks, see https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf. The switch aims to reduce training FLOPs while maintaining or enhancing model quality.

The effort to integrate Muon involves significant work. However, the potential gains in computational efficiency and performance may justify this investment. Below is a staged approach from PoC to full integration ensures that we will only execute in full if tangible benefits are demonstrated early.

🚀 Execution Plan

Step 1: Proof of Concept (PoC)

  • Approach: Implement the Muon optimizer in a minimalistic and hacky way:
  • Validation: Run a controlled experiment comparing Muon to AdamW on existing benchmarks.
  • Success Criteria: Demonstrate comparable or superior performance with reduced compute cost.

Step 2: Proper Integration

  • Refactoring:
    • Introduce an abstraction layer for optimizers in Fast-LLM.
    • Make optimizer choice configurable, allowing both Muon and AdamW.
    • Implement configuration-driven optimizer selection and setup.
  • Testing:
    • Ensure that both Muon and AdamW can be used interchangeably.
    • Maintain existing test coverage and add new tests for Muon integration.
  • Interaction with ZeRO:
    • Evaluate the integration with ZeRO, especially for distributed training.
    • Implement necessary changes to handle the Muon optimizer in a distributed setup.
    • Test for stability and performance in multi-node configurations.

Step 3: Long-Term Optimizations

  • Advanced Features:
    • Investigate Triton kernel development to optimize Muon further.
    • Explore advanced configurations and hyper-parameter tuning for Muon.
  • Documentation & Maintenance:
    • Write documentation for the new optimizer integration.
    • Ensure that future maintainers can understand and extend the integration.

📌 Acceptance Criteria (Must-Haves for Completion)

  • The Muon optimizer is functional and tested in both standalone and distributed modes.
  • The implementation includes documentation on optimizer selection and configuration.
  • The PR includes a performance/impact summary comparing Muon vs. AdamW on relevant benchmarks.
  • Refactoring is limited to what is necessary for integrating Muon and allowing optimizer configurability.

🛠️ Project Management

  • Assign the project to the Fast-LLM project.
  • Set the Estimate field (in days) in the GitHub project.
  • Use the Size field to categorize the PR size (Large).
  • Assign an owner when opening the issue.
@tscholak tscholak added enhancement New feature or request need update labels Feb 24, 2025
@toothacher17
Copy link

@tscholak thanks for your attention!

We had a even simpler implementation of Muon than Keller's implementation as described in our paper: https://github.com/MoonshotAI/Moonlight/blob/master/examples/toy_train.py

It might be helpful for the minimal proof of concept implementation (No distributed supported at all, and adjusting Muon as described in our paper)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request need update
Projects
None yet
Development

No branches or pull requests

2 participants