Multi-GPU Reduction & MatMul #1
Labels
enhancement
New feature or request
good first issue
Good for newcomers
help wanted
Extra attention is needed
Current kernels are designed for a single-GPU execution. Let's scale them to multi-GPU systems. Ideally, using TMA and cooperative groups.
The text was updated successfully, but these errors were encountered: