Dispatch size exceeds WebGPU limits in Y or Z dimension #8373

arcman7 · 2024-09-05T00:26:39Z

Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow.js): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow.js installed from (npm or script link): npm
TensorFlow.js version (use command below): @tensorflow/tfjs-backend-webgpu: ^4.21.0
Browser version: Google Chrome 128.0.6613.114 (Official Build) (64-bit)
Tensorflow.js Converter Version: Not specified

Describe the current behavior

When attempting to perform matrix multiplication using tf.matMul(r, s) where tensor r has shape [1493284, 3, 3] and tensor s has shape [1493284, 3, 3], an error is thrown: "Dispatch size exceeds WebGPU limits in Y or Z dimension."

The error occurs in the reshapeDispatch function of the WebGPU backend when it tries to handle a dispatch shape of [1, 1, 1493284].

Describe the expected behavior

The matrix multiplication should be performed successfully without throwing an error related to dispatch size limits.

Standalone code to reproduce the issue

const r = tf.zeros([1493284, 3, 3]);
const s = tf.zeros([1493284, 3, 3]);
const l = tf.matMul(r, s);

Other info / logs
The error is triggered by this assertion in the WebGPU backend code:

tf.util.assert(dispatch[0] > MAX_COMPUTE_PER_DIMENSION_DISPATCH_SIZE &&
            layout.y === undefined && layout.z === undefined, function () { return 'Dispatch size exceeds WebGPU limits in Y or Z dimension.'; });

This occurs in the reshapeDispatch function, which is called to handle the dispatch shape [1, 1, 1493284] generated by the matrix multiplication operation.

The full implementation of the reshapeDispatch function can be found at:
https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-backend-webgpu/dist/tf-backend-webgpu.js (around line 1252)

The text was updated successfully, but these errors were encountered:

arcman7 · 2024-09-05T00:27:41Z

Note: This same matrix multiplication operation works fine using the webgl backend.

shmishra99 · 2024-09-05T13:38:03Z

Hi @arcman7 ,

I've successfully replicated the issue you're experiencing and am seeing the same error.

I'm currently investigating further and will update you with my findings as soon as possible.

Thank You!!

arcman7 · 2024-09-07T11:41:56Z

@shmishra99 Idk if this is useful to you or not, but in my own webGPU helper code I use something like this:

interface AdapterLimits {
  maxComputeInvocationsPerWorkgroup: number;
  maxComputeWorkgroupSizeX: number;
  maxComputeWorkgroupSizeY: number;
  maxComputeWorkgroupSizeZ: number;
  maxComputeWorkgroupsPerDimension: number;
}

interface WorkgroupConfig {
  workgroupSize: [number, number, number];
  dispatchSize: [number, number, number];
}

export function calculateWorkgroups(totalThreads: number, limits: AdapterLimits): WorkgroupConfig {
  const maxSize = Math.min(
    limits.maxComputeInvocationsPerWorkgroup,
    totalThreads
  );

  let x = Math.floor(Math.sqrt(maxSize));
  let y = Math.floor(maxSize / x);
  let z = 1;

  // Adjust x and y if they exceed their respective limits
  if (x > limits.maxComputeWorkgroupSizeX) {
    x = limits.maxComputeWorkgroupSizeX;
    y = Math.floor(maxSize / x);
  }

  if (y > limits.maxComputeWorkgroupSizeY) {
    y = limits.maxComputeWorkgroupSizeY;
    x = Math.floor(maxSize / y);
  }

  // If x or y still exceed their limits, use the z dimension
  if (x > limits.maxComputeWorkgroupSizeX || y > limits.maxComputeWorkgroupSizeY) {
    x = Math.min(x, limits.maxComputeWorkgroupSizeX);
    y = Math.min(y, limits.maxComputeWorkgroupSizeY);
    z = Math.floor(maxSize / (x * y));
    z = Math.min(z, limits.maxComputeWorkgroupSizeZ);
  }

  // Ensure the total doesn't exceed maxComputeInvocationsPerWorkgroup
  const total = x * y * z;
  if (total > limits.maxComputeInvocationsPerWorkgroup) {
    const scale = Math.cbrt(limits.maxComputeInvocationsPerWorkgroup / total);
    x = Math.floor(x * scale);
    y = Math.floor(y * scale);
    z = Math.floor(z * scale);
  }

  // Calculate the number of workgroups to dispatch
  const dispatchX = Math.ceil(totalThreads / (x * y * z));
  const dispatchY = 1;
  const dispatchZ = 1;

  return {
    workgroupSize: [x, y, z],
    dispatchSize: [dispatchX, dispatchY, dispatchZ]
  };
}

But then you still have to maintain properly calculating flat global ids wherever it applies.

gaikwadrahul8 · 2024-09-10T16:33:57Z

Hi, @arcman7

I apologize for the delayed response and it seems like there is maximum limit for maxComputeWorkgroupsPerDimension is 65535 please refer this official documentation which is getting assigned to MAX_COMPUTE_PER_DIMENSION_DISPATCH_SIZE and below assertion checks if the dispatch size in the X dimension exceeds the maximum limit and if the layout is 1D or 2D (i.e. no Y or Z dimensions). If both conditions are true an error is thrown indicating that the dispatch size exceeds WebGPU limits.

tf.util.assert(dispatch[0] > MAX_COMPUTE_PER_DIMENSION_DISPATCH_SIZE &&
            layout.y === undefined && layout.z === undefined, function () { return 'Dispatch size exceeds WebGPU limits in Y or Z dimension.'; });

Case 01 : When I tried value less than or equal to 65535 it's working as expected I've added output log below for your reference

Case 02 : When I tried value greater than to 65535 it's throwing the same error message Dispatch size exceeds WebGPU limits in Y or Z dimension. I've added output log below for your reference

It seems like you'll have to use value less than or equal to 65535 due to maximum limit, If I have missed something here please let me know.

Thank you for your cooperation and patience.

arcman7 · 2024-09-11T14:01:41Z

Hi @gaikwadrahul8 it is my understanding that maxComputeWorkgroupsPerDimension is the maximum in one dimension. The code I posted above is to serve as an example on how to deal with dispatching much larger computations as webGPU allows for. In the case where I have 1493284 rows of data to process, I would expect tfjs to internally batch the calls like so:

const N = 1493284;
dispatchWorkgroups(
  maxComputeWorkgroupsPerDimension, // x
  Math.ceil( N / maxComputeWorkgroupsPerDimension), // y
  1 // z
);

You could just as well first divide N by the size of the workgroup (256 is the max) when N is greater than maxComputeWorkgroupsPerDimension.

arcman7 added the type:bug Something isn't working label Sep 5, 2024

shmishra99 self-assigned this Sep 6, 2024

shmishra99 added the comp:webgpu label Sep 6, 2024

gaikwadrahul8 added the stat:awaiting response label Sep 10, 2024

google-ml-butler bot removed the stat:awaiting response label Sep 11, 2024

shmishra99 assigned pyu10055 Oct 15, 2024

shmishra99 added the stat:awaiting tensorflower label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dispatch size exceeds WebGPU limits in Y or Z dimension #8373

Dispatch size exceeds WebGPU limits in Y or Z dimension #8373

arcman7 commented Sep 5, 2024

arcman7 commented Sep 5, 2024

shmishra99 commented Sep 5, 2024

arcman7 commented Sep 7, 2024

gaikwadrahul8 commented Sep 10, 2024

arcman7 commented Sep 11, 2024

Dispatch size exceeds WebGPU limits in Y or Z dimension #8373

Dispatch size exceeds WebGPU limits in Y or Z dimension #8373

Comments

arcman7 commented Sep 5, 2024

arcman7 commented Sep 5, 2024

shmishra99 commented Sep 5, 2024

arcman7 commented Sep 7, 2024

gaikwadrahul8 commented Sep 10, 2024

arcman7 commented Sep 11, 2024