-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 #1997
Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 #1997
Conversation
I never seen before in our code the examples of usage |
I don't have a strong preference if we choose to go with the way I implemented or just directly add try ... catch throughout this header. I initially did this to avoiding having many try catch statements here. Let me leave this open first for others' opinions to see if they prefer a more functional approach or just directly adding try...catch throughout. |
using _ScanKernel = oneapi::dpl::__par_backend_hetero::__internal::__kernel_name_generator< | ||
__reduce_then_scan_scan_kernel, _CustomName, _InRng, _OutRng, _GenScanInput, _ReduceOp, _ScanInputTransform, | ||
_WriteOp, _InitType, _Inclusive, _IsUniquePattern>; | ||
static auto __kernels = __internal::__kernel_compiler<_ReduceKernel, _ScanKernel>::__compile(__exec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to point this unique case out and ensure there are no issues with this approach. Benchmarking the single work-group sizes, I was seeing some overheads after switching from kernel provider to compiler.
Some of the overheads seem related to the kernel bundle which is unavoidable. However, __kernel_compiler
creates a std::vector
to call sycl::get_kernel_bundle
. This allocation / deallocation overhead on each call was leading to measurable slowdowns with small input sizes. To fix this, I have made the variable static
as __kernels
should be the same for each call.
This is beneficial assuming the application makes multiple calls to the scan-based algorithm which I assume is the most common case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to point this unique case out and ensure there are no issues with this approach.
I am thinking about possible issues with uniqueness of the kernels in some corner cases. Is it possible to end up with different kernels with the same instantiation of __parallel_transform_reduce_then_scan
? I can imagine that, for example:
// no kernel specified: unnamed lambda case
sycl::queue queue_a{selector_vendor_a{});
sycl::queue queue_b{selector_vendor_b{});
// policy_a and policy_b have the same type
dpl::execution::device_policy policy_a(queue_a);
dpl::execution::device_policy policy_b(queue_b);
// ... containers and predicates have the same types
dpl::copy_if(policy_a, ...);
dpl::copy_if(policy_b, ...); // will it use the kernels compiled for the queue_a (and thus device "a")?
Such cases are highly unlikely, and I cannot think of any others. I would rather document this as a known limitation with a workaround (name the kernel) than compromise performance (if it is large enough overhead).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I agree the risk here is low, and I believe this change is worth it if we care about the small-to-medium input size cases. Will take a further look as well and see if we need to add a known limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to have a proper fix.
Generally, caching kernels is not our business, but the one of the SYCL runtime. Caching memory is not our business, but the one of a memory manager.
If we do caching ourselves, we need to check if the cached value is valid for the call - for example, if it was compiled for the device for which the algorithm is called. Documenting it as a limitation is not a long-term substitute for a proper, reliable implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the response. Since this does not seem like a good option to keep, it might be worth incurring the performance hit and remove the static
usage. The performance impact I've measured to be around a constant 3-4 us
with this caching approach for small inputs after the first call. When incurring the additional overhead each call, it adds about another 4-5
us for for a total hit of 8-9 us
on Intel GPU Series Max 1550 with inclusive_scan
.
Once we scale our inputs past cache sizes the impact of this change is minimized with a lower threshold if we are loading data from global memory in the cold cache case. If this performance impact seems acceptable, then I can open a PR to make this single line change.
The overarching issue here is that the SYCL specification does not specify recovery behavior after handling this sort of exception which has led to the need for this workaround using kernel bundles. I've been told that it has been discussed at the Khronos working group before but there was no consensus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the approach, but I need to review the PR more thoroughly. Indeed, according to the SYCL 2020 spec, launching a kernel decorated with reqd_sub_group_size
attribute may throw an exception depending on what inside that kernel...
I have a question regarding that part:
Secondly, after discussion with compiler developers, kernel compilation must be separated from execution of the kernel to prevent corruption of the underlying sycl::queue
Is it a limitation/bug of the DPC++ SYCL implementation?
@@ -1099,7 +1099,7 @@ struct __write_to_id_if_else | |||
template <typename _ExecutionPolicy, typename _Range1, typename _Range2, typename _UnaryOperation, typename _InitType, | |||
typename _BinaryOperation, typename _Inclusive> | |||
auto | |||
__parallel_transform_scan(oneapi::dpl::__internal::__device_backend_tag __backend_tag, _ExecutionPolicy&& __exec, | |||
__parallel_transform_scan(oneapi::dpl::__internal::__device_backend_tag __backend_tag, const _ExecutionPolicy& __exec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the original variant with a forwarding reference, which is aligned with other patterns, unless the change is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a tricky scenario with aligning the future return type from the different potential paths when we use the forwarding reference. I will take another look and see if it can be implemented with the forwarding reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify this issue, there are some instances where forwarding the execution policy in one path and not forwarding it in the other could result in differing future return types as the _ExecutionPolicy
class template in __result_and_scratch_storage
which is bundled in the future return may or may not be references of device_policy
. Making this a const l-value reference aligns the return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me preface this by saying I'm not arguing for any changes in the code here, but this should be considered in a refactor of __result_and_scratch_storage
, #2003 :
This is seems like it reveals an issue in the design of __result_and_scratch_storage
. We shouldn't have to tip toe around the type in our surrounding function definitions.
The constructor of __result_and_scratch_storage
takes a const _ExecutionPolicy&
, but we also explicitly use _ExecutionPolicy
frequently in defining the type of __result_and_scratch_storage
types without std::decay_t
. This seems ripe for problems to arise, and we should try to figure out a way to consistently arrive at the type
of an appropriate __result_and_scratch_storage
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ExecutionPolicy
type should not be used in templates etc. without first clearing from cv qualifiers and references. According to the standard, even the check of the is_execution_policy
trait when the overload set is determined should use decay
/remove_cvref
. Sounds like __result_and_scratch_storage
implementation or usage has a bug that needs to be fixed, instead of changing the pattern signatures, and it is orthogonal to any refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can make a fix for this to be included for this release. My thought was to figure out some way to specify get the proper type in a way that didn't require the decay to make usage easier, but we should first fix the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe #2031 should resolve any issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it also revert the changes made by this patch? That should be a good test :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but I can make that change.
using _ScanKernel = oneapi::dpl::__par_backend_hetero::__internal::__kernel_name_generator< | ||
__reduce_then_scan_scan_kernel, _CustomName, _InRng, _OutRng, _GenScanInput, _ReduceOp, _ScanInputTransform, | ||
_WriteOp, _InitType, _Inclusive, _IsUniquePattern>; | ||
static auto __kernels = __internal::__kernel_compiler<_ReduceKernel, _ScanKernel>::__compile(__exec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to point this unique case out and ensure there are no issues with this approach.
I am thinking about possible issues with uniqueness of the kernels in some corner cases. Is it possible to end up with different kernels with the same instantiation of __parallel_transform_reduce_then_scan
? I can imagine that, for example:
// no kernel specified: unnamed lambda case
sycl::queue queue_a{selector_vendor_a{});
sycl::queue queue_b{selector_vendor_b{});
// policy_a and policy_b have the same type
dpl::execution::device_policy policy_a(queue_a);
dpl::execution::device_policy policy_b(queue_b);
// ... containers and predicates have the same types
dpl::copy_if(policy_a, ...);
dpl::copy_if(policy_b, ...); // will it use the kernels compiled for the queue_a (and thus device "a")?
Such cases are highly unlikely, and I cannot think of any others. I would rather document this as a known limitation with a workaround (name the kernel) than compromise performance (if it is large enough overhead).
I think it is more of a limitation with the existing DPC++ implementation. From my discussion, it seems as if there is no easy fix which is why this workaround was recommended. From their perspective, it is not a bug as the SYCL specification does not specify any behavior regarding resubmissions to a queue after an exception such as this. |
IGC intentionally forces a sub-group size of 16 on certain iGPUs to workaround a known issue. We have to determine this by first compiling the kernels to see if the required sub-group size is respected. Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
…xception Signed-off-by: Matthew Michel <[email protected]>
…erence Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Co-authored-by: Dmitriy Sobolev <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
@dmitriy-sobolev After looking through the PR, do you have any thoughts on Sergey's question here? Is the functional exception handler approach preferred in your opinion or just directly adding try-catch throughout parallel_backend_sycl.h? |
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
Signed-off-by: Matthew Michel <[email protected]>
I slightly prefer Sergey's option. |
Signed-off-by: Matthew Michel <[email protected]>
@SergeyKopienko @dmitriy-sobolev I have removed I will run some internal testing to fully verify the change. |
@@ -1329,45 +1346,51 @@ __parallel_reduce_by_segment_reduce_then_scan(oneapi::dpl::__internal::__device_ | |||
assert(__n > 1); | |||
return __parallel_transform_reduce_then_scan( | |||
__backend_tag, std::forward<_ExecutionPolicy>(__exec), | |||
oneapi::dpl::__ranges::make_zip_view(std::forward<_Range1>(__keys), std::forward<_Range2>(__values)), | |||
oneapi::dpl::__ranges::make_zip_view(std::forward<_Range3>(__out_keys), std::forward<_Range4>(__out_values)), | |||
oneapi::dpl::__ranges::zip_view(std::forward<_Range1>(__keys), std::forward<_Range2>(__values)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was necessary in a few places. When the range forwarding was removed in some places, make_zip_view
resulted in a compiler error. It seemed to be caused by being unable to process l-value references to ranges. With the previous approach, we captured by value in a lambda and moved it in so this was a non-issue.
It looks like it is an instance of this issue: #1805.
clang-format wants to add an additional 4 space indentation throughout both the scan kernel implementations due to some of the changes. I disagree with this as we already have a lot of indentation here. |
…le translation units Signed-off-by: Matthew Michel <[email protected]>
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
_GenMask{__pred}, _WriteOp{}, /*_IsUniquePattern=*/std::false_type{}); | ||
try | ||
{ | ||
using _GenMask = oneapi::dpl::__par_backend_hetero::__gen_mask<_UnaryPredicate>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think better to move typedefs at this place and in other try / catch blocks outside of try / catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way is not a big deal to me, but my thought is constraining the scope of the alias as much as possible makes the most sense. Since it's only needed in this try...catch, it makes sense to limit it to this scope to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The opposite opinion - to include into try
block only the code which really may generate exception.
Writing aliases inside it you make me read what was written here and understand that it's only aliases.
But up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point here. I have made these changes.
Signed-off-by: Matthew Michel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor comment. I am ready to approve the PR once it has been addressed.
Signed-off-by: Matthew Michel <[email protected]>
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reapproving. Please, wait for the CI to finish execution before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
(reverting part of #1997) Signed-off-by: Dan Hoeflinger <[email protected]>
On certain integrated graphics architectures, sub-group sizes of 32 are not supported for kernels with certain properties when compiled with -O0 using the
icpx
compiler. The compiler is normally able to workaround this issue by compiling to a sub-group size of 16 instead. However, in cases in which an explicit sub-group size is required, then the compiler throws an exception at JIT time. This issue directly affects our reduce-then-scan implementation which has a required sub-group size of 32.To properly work around this issue, several things must be done. Firstly, exception handling is implemented to catch this synchronous exception while re-throwing any other exceptions back to the user. Secondly, after discussion with compiler developers, kernel compilation must be separated from execution of the kernel to prevent corruption of the underlying
sycl::queue
that occurs when this exception is thrown after implicit buffer accessor dependencies around the kernel have been established. To do this, kernel bundles are used to first compile the kernel before executing.