-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
redesign for faster cpu/gpu synch #1869
base: main
Are you sure you want to change the base?
Conversation
3833c82
to
7e94ecb
Compare
5d8e1a0
to
04094c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far it looks great. I read about half way through. I didn't read the important bits yet, neither the changes in event.h
nor transforms.cpp
and fence.h
. The API changes look really great imho and unify cpu and gpu primitive implementations.
// Remove the output if it was donated to by an input | ||
if (auto it = buffers.find(arr.data_shared_ptr()); it != buffers.end()) { | ||
buffers.erase(it); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this also be done for the siblings? Otherwise we guarantee they won't be donated later on. Sth like the following in the loop ought to do it.
if (buffers.find(s.data_shared_ptr()) == buffers.end()) {
buffers.insert(s.data_shared_ptr());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not so simple. You'll notice right now siblings are also never donated in the metal task submission.
The problem is that siblings can have 0 references after being detached (because they may not have parents in the graph) and in that case it's a bug to not hold their buffers.
If you recall the optimization that was a bit buggy I did a while ago (#1858) also fixed this issue.
I actually have a diff on top of this branch that tries to resolve this by doing something similar to #1858 which is easier now because the task submission is on the main thread, so we don't have to worry about race conditions when checking the use-count. But there is still the matter of handling async_eval
correctly (references to user held arrays can be deleted before we are done with the computation).
return; | ||
} | ||
scheduler::enqueue(stream_, [arrays = std::move(arrays)]() {}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a problem necessarily but add_temporaries
on the CPU "CommandEncoder" has to be after dispatch
while on the GPU it can be before. One solution would be for the CommandEncoder
to collect temporaries and then cpu::eval
to dispatch the task that keeps them in memory possibly together with the rest of the buffers.
This would require cpu::eval
to know about the CommandEncoder
but it is already the case in the metal side so I don't think this is a problem and it will allow us to use add_temporary
where it makes intuitive sense and reduce the number of tasks in the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a very good suggestion as it's actually really not good that add_temporary
has to be called at just the right spot and otherwise it could fail silently.
720b7ac
to
f995477
Compare
f995477
to
64cf095
Compare
This is ready for review. I left some notes in the top on the important changes. I think that should guide the files that need special attention. Namely:
I'll share some benchmarks shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I still need to go through the fence and event parts in detail, will do that soon!
wt_stride_O = wt.strides()[0], | ||
wt_stride_H = wt.strides()[1], | ||
wt_stride_W = wt.strides()[2], | ||
wt_stride_C = wt.strides()[3], | ||
|
||
out_stride_N = out.strides()[0], | ||
out_stride_H = out.strides()[1], | ||
out_stride_W = out.strides()[2], | ||
out_stride_O = out.strides()[3], | ||
|
||
padding, | ||
wt_strides, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we end up making 2 copies of the wt_strides
here
These are the slower CPU conv kernels, so its not an urgent change though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not two copies.. the naming here is a bit unfortunate:
wt_strides
is the strides for applying the kernel.wt.strides()
andwt_stride_*
are the strides of the underlying array
This is as it was.. it is due for a bit of a clean up though.
Fast synchronization works well with the redesign. On an M1 max, synch latencies are 10x less which is the same as when doing custom synchronization within the primitive.
All Reduce: time per iteration 0.232273 (ms)
All Reduce: time per iteration 0.019380 (ms) |
For the GPU benchmarks are not noticeably different: Generation speed is unchanged:
Pre: Generation: 512 tokens, 129.661 tokens-per-sec Transformer training speed / memory use unchanged: |
I added a limit to the number of outstanding tasks (set to 100, possibly could go lower here with minimal perf loss) in the CPU task encoder to avoid memory blow up. Basically keeps the eval from running to far ahead of the actual work while allowing some pipelining. For example for the following:
Instead of using 16GB it uses 1.6GB with the limit. Its pretty similar now to the way we limit the number of outstanding command buffers. |
9e2fb4c
to
6c853c5
Compare
Notes on important changes:
eval_cpu
are now logically split in two parts and look a lot like howeval_gpu
looks.add_input_array
,add_output_array
add_temporary
eval_cpu
andeval_gpu
are now called on the main thread. It didn't make much sense from a perf standpoint to have sub-threads for this and it was overly complicated from a synchronization standpoint (as one would need to synchronize for both setup and completion). Furthermore this really simplifies things like thread safety of the eval moving forward. I think it's overall a nice change. In-fact there is now some room to pipeline CPU setup with CPU work so potential for speedups there.fence
for synchronization within an eval across streams. Useevent
for synchronization outside the eval loop or across multiple eval loops (e.g. when usingasync_eval
).array::move_shared_buffer
API). This was a bit tedious.. so instead eval driver handles it by not retaining input buffers that got donated to their respective outputs.CPU-only benchmarks: