feat: experimental sharding backend #1544

polvalente · 2024-10-11T01:33:15Z

Adds a proof-of-concept implementation for sharding as an Nx meta-compiler.

In the current proposal, we shard inputs according to an arbitrary slicing configuration, and the compiler then does its best to propagate those slices through to the output. The compiler can then build a separate {args, function, reducer} tuple for each of the output shards, where:

args: is a version of the input arguments that is sliced according to which data section is required for that specific output data section
function: is a new compilation of the input function based on the new sliced arguments
reducer: is a function that is responsible for inserting the result shard into the correct place in an output accumulator tensor.

The following example showcases how we can shard the example function into 4 separate shards. This happens because There are 2 shards for arg1, and only the first axis of arg0 is able to be sharded, due to the other 2 axes being connected to contracting axes in the dot product.

arg0_sharding = %{} # inputs are taken to be fully sharded if no specification is given
arg1_sharding = %{4 => [0..0, 1..1]}

Nx.default_backend(Nx.BinaryBackend)

fun = fn l, r ->
  x = Nx.add(l, Nx.tensor([[1]]))
  x = Nx.transpose(x, axes: [0, 2, 1])
  y = Nx.subtract(r, 1)
  y = Nx.squeeze(y, axes: [0, 1])
  Nx.dot(x, [2, 1], y, [1, 0])
end

# fun = &Nx.dot(&1, [1, 2], &2, [1, 0])
# fun = &Nx.add(&1, &2)

inputs = [
  Nx.iota({2, 2, 3}, type: :f32),
  Nx.add(Nx.iota({1, 1, 3, 2, 2}), 10)
]

{output_holder, shards} =
  Nx.Defn.jit_apply(
    fun,
    inputs,
    compiler: Nx.Defn.ShardingCompiler,
    sharding_config: [arg0_sharding, arg1_sharding],
    sharding_compiler: Nx.Defn.Evaluator,
    sharding_compiler_options: []
  )

sharded_result =
  shards
  |> Task.async_stream(fn {arg, fun, caster} ->
    dbg(self())
    {fun.(arg), caster}
  end)
  |> Enum.reduce(output_holder, fn {:ok, {result, caster}}, acc ->
    caster.(result, acc)
  end)
  |> IO.inspect()

# Ensure that the sharded result is the same as the result for the function applied to the unsharded inputs
IO.inspect(Nx.equal(sharded_result, apply(fun, inputs)) |> Nx.all() |> Nx.to_number() |> Kernel.==(1))

nx/lib/nx/defn/sharding_compiler.ex

Co-authored-by: José Valim <[email protected]>

…sharding-backend

polvalente added 25 commits September 25, 2024 15:54

feat: add basic backend layout

25788c0

proof of concept solution

6585353

feat: add working POC for sharding

54e7abf

refactor: use only sharding compiler

5fe92ca

refactor: move things into __compile__

09b844a

feat: working EXLA example

f6bd4ed

wip: initial work on dot (doesn't work)

4c8200f

wip

2f9370a

wip: refactor input sharding calculation

918118e

wip: refactor to support broadcasts

bd06388

wip: rework sharding representation (each slice is a shard)

5d7b106

feat: deal with broadcasting and re-slicing

1f7dfb3

refactor: build parents tree into each shard

243eee2

feat: support implicit broadcasting

4bb2d97

chore: remove unused var'

87f1c35

feat: support dot product without contraction sharding

bca3943

feat: support constants

73a3553

add :tensor op to example function

52fc860

feat: support squeeze

7706678

chore: remove empty file

0ddb69a

refactor: remove TensorSharding module

489d655

chore: add stubs for missing callbacks

42b5ca6

test: add tests

59034c1

fix: transpose axis

d6da7a8

chore: remove example .exs files

68565bc

polvalente force-pushed the pv-feat/experimental-sharding-backend branch from e67a399 to 68565bc Compare October 11, 2024 01:35

josevalim reviewed Oct 14, 2024

View reviewed changes

nx/lib/nx/defn/sharding_compiler.ex Outdated Show resolved Hide resolved

polvalente and others added 3 commits October 14, 2024 19:01

Update nx/lib/nx/defn/sharding_compiler.ex

fb86713

Co-authored-by: José Valim <[email protected]>

chore: format

0c111df

chore: remove __stream__

5c5c881

polvalente added 2 commits October 14, 2024 21:30

Merge remote-tracking branch 'origin/main' into pv-feat/experimental-…

5e05bbb

…sharding-backend

feat: add graph splitter for all-gather/all-reduce operations (#1545)

6eb6fba

polvalente self-assigned this Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: experimental sharding backend #1544

feat: experimental sharding backend #1544

polvalente commented Oct 11, 2024

feat: experimental sharding backend #1544

Are you sure you want to change the base?

feat: experimental sharding backend #1544

Conversation

polvalente commented Oct 11, 2024