Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 #1997

Merged
merged 32 commits into from
Jan 30, 2025

Conversation

mmichel11
Copy link
Contributor

@mmichel11 mmichel11 commented Jan 10, 2025

On certain integrated graphics architectures, sub-group sizes of 32 are not supported for kernels with certain properties when compiled with -O0 using the icpx compiler. The compiler is normally able to workaround this issue by compiling to a sub-group size of 16 instead. However, in cases in which an explicit sub-group size is required, then the compiler throws an exception at JIT time. This issue directly affects our reduce-then-scan implementation which has a required sub-group size of 32.

To properly work around this issue, several things must be done. Firstly, exception handling is implemented to catch this synchronous exception while re-throwing any other exceptions back to the user. Secondly, after discussion with compiler developers, kernel compilation must be separated from execution of the kernel to prevent corruption of the underlying sycl::queue that occurs when this exception is thrown after implicit buffer accessor dependencies around the kernel have been established. To do this, kernel bundles are used to first compile the kernel before executing.

@mmichel11 mmichel11 added this to the 2022.8.0 milestone Jan 10, 2025
@mmichel11 mmichel11 changed the title [Draft] Workaround crashes and incorrect results in scan-based algorithms with integrated graphics when compiling with -O0 [Draft] Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 Jan 10, 2025
@SergeyKopienko
Copy link
Contributor

SergeyKopienko commented Jan 14, 2025

I never seen before in our code the examples of usage std::optional.
From my point of view we may simple write try / catch instead of __handle_sync_sycl_exception and don't use std::optional at all.

@mmichel11 mmichel11 changed the title [Draft] Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 Jan 14, 2025
@mmichel11 mmichel11 marked this pull request as ready for review January 14, 2025 14:29
@mmichel11
Copy link
Contributor Author

I never seen before in our code the examples of usage std::optional'. From my point of view we may simple write try / catchinstead of__handle_sync_sycl_exceptionand don't usestd::optional' at all.

I don't have a strong preference if we choose to go with the way I implemented or just directly add try ... catch throughout this header. I initially did this to avoiding having many try catch statements here.

Let me leave this open first for others' opinions to see if they prefer a more functional approach or just directly adding try...catch throughout.

using _ScanKernel = oneapi::dpl::__par_backend_hetero::__internal::__kernel_name_generator<
__reduce_then_scan_scan_kernel, _CustomName, _InRng, _OutRng, _GenScanInput, _ReduceOp, _ScanInputTransform,
_WriteOp, _InitType, _Inclusive, _IsUniquePattern>;
static auto __kernels = __internal::__kernel_compiler<_ReduceKernel, _ScanKernel>::__compile(__exec);
Copy link
Contributor Author

@mmichel11 mmichel11 Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to point this unique case out and ensure there are no issues with this approach. Benchmarking the single work-group sizes, I was seeing some overheads after switching from kernel provider to compiler.

Some of the overheads seem related to the kernel bundle which is unavoidable. However, __kernel_compiler creates a std::vector to call sycl::get_kernel_bundle. This allocation / deallocation overhead on each call was leading to measurable slowdowns with small input sizes. To fix this, I have made the variable static as __kernels should be the same for each call.

This is beneficial assuming the application makes multiple calls to the scan-based algorithm which I assume is the most common case.

Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to point this unique case out and ensure there are no issues with this approach.

I am thinking about possible issues with uniqueness of the kernels in some corner cases. Is it possible to end up with different kernels with the same instantiation of __parallel_transform_reduce_then_scan? I can imagine that, for example:

// no kernel specified: unnamed lambda case
sycl::queue queue_a{selector_vendor_a{});
sycl::queue queue_b{selector_vendor_b{});

// policy_a and policy_b have the same type
dpl::execution::device_policy policy_a(queue_a);
dpl::execution::device_policy policy_b(queue_b);

// ... containers and predicates have the same types
dpl::copy_if(policy_a, ...);
dpl::copy_if(policy_b, ...); // will it use the kernels compiled for the queue_a (and thus device "a")?

Such cases are highly unlikely, and I cannot think of any others. I would rather document this as a known limitation with a workaround (name the kernel) than compromise performance (if it is large enough overhead).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I agree the risk here is low, and I believe this change is worth it if we care about the small-to-medium input size cases. Will take a further look as well and see if we need to add a known limitation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to have a proper fix.
Generally, caching kernels is not our business, but the one of the SYCL runtime. Caching memory is not our business, but the one of a memory manager.
If we do caching ourselves, we need to check if the cached value is valid for the call - for example, if it was compiled for the device for which the algorithm is called. Documenting it as a limitation is not a long-term substitute for a proper, reliable implementation.

Copy link
Contributor Author

@mmichel11 mmichel11 Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the response. Since this does not seem like a good option to keep, it might be worth incurring the performance hit and remove the static usage. The performance impact I've measured to be around a constant 3-4 us with this caching approach for small inputs after the first call. When incurring the additional overhead each call, it adds about another 4-5 us for for a total hit of 8-9 us on Intel GPU Series Max 1550 with inclusive_scan.

Once we scale our inputs past cache sizes the impact of this change is minimized with a lower threshold if we are loading data from global memory in the cold cache case. If this performance impact seems acceptable, then I can open a PR to make this single line change.

The overarching issue here is that the SYCL specification does not specify recovery behavior after handling this sort of exception which has led to the need for this workaround using kernel bundles. I've been told that it has been discussed at the Khronos working group before but there was no consensus.

Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the approach, but I need to review the PR more thoroughly. Indeed, according to the SYCL 2020 spec, launching a kernel decorated with reqd_sub_group_size attribute may throw an exception depending on what inside that kernel...

I have a question regarding that part:

Secondly, after discussion with compiler developers, kernel compilation must be separated from execution of the kernel to prevent corruption of the underlying sycl::queue

Is it a limitation/bug of the DPC++ SYCL implementation?

include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h Outdated Show resolved Hide resolved
@@ -1099,7 +1099,7 @@ struct __write_to_id_if_else
template <typename _ExecutionPolicy, typename _Range1, typename _Range2, typename _UnaryOperation, typename _InitType,
typename _BinaryOperation, typename _Inclusive>
auto
__parallel_transform_scan(oneapi::dpl::__internal::__device_backend_tag __backend_tag, _ExecutionPolicy&& __exec,
__parallel_transform_scan(oneapi::dpl::__internal::__device_backend_tag __backend_tag, const _ExecutionPolicy& __exec,
Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the original variant with a forwarding reference, which is aligned with other patterns, unless the change is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a tricky scenario with aligning the future return type from the different potential paths when we use the forwarding reference. I will take another look and see if it can be implemented with the forwarding reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify this issue, there are some instances where forwarding the execution policy in one path and not forwarding it in the other could result in differing future return types as the _ExecutionPolicy class template in __result_and_scratch_storage which is bundled in the future return may or may not be references of device_policy. Making this a const l-value reference aligns the return.

Copy link
Contributor

@danhoeflinger danhoeflinger Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me preface this by saying I'm not arguing for any changes in the code here, but this should be considered in a refactor of __result_and_scratch_storage, #2003 :

This is seems like it reveals an issue in the design of __result_and_scratch_storage. We shouldn't have to tip toe around the type in our surrounding function definitions.

The constructor of __result_and_scratch_storage takes a const _ExecutionPolicy&, but we also explicitly use _ExecutionPolicy frequently in defining the type of __result_and_scratch_storage types without std::decay_t. This seems ripe for problems to arise, and we should try to figure out a way to consistently arrive at the type of an appropriate __result_and_scratch_storage.

Copy link
Contributor

@akukanov akukanov Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ExecutionPolicy type should not be used in templates etc. without first clearing from cv qualifiers and references. According to the standard, even the check of the is_execution_policy trait when the overload set is determined should use decay/remove_cvref. Sounds like __result_and_scratch_storage implementation or usage has a bug that needs to be fixed, instead of changing the pattern signatures, and it is orthogonal to any refactoring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can make a fix for this to be included for this release. My thought was to figure out some way to specify get the proper type in a way that didn't require the decay to make usage easier, but we should first fix the bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe #2031 should resolve any issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it also revert the changes made by this patch? That should be a good test :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I can make that change.

using _ScanKernel = oneapi::dpl::__par_backend_hetero::__internal::__kernel_name_generator<
__reduce_then_scan_scan_kernel, _CustomName, _InRng, _OutRng, _GenScanInput, _ReduceOp, _ScanInputTransform,
_WriteOp, _InitType, _Inclusive, _IsUniquePattern>;
static auto __kernels = __internal::__kernel_compiler<_ReduceKernel, _ScanKernel>::__compile(__exec);
Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to point this unique case out and ensure there are no issues with this approach.

I am thinking about possible issues with uniqueness of the kernels in some corner cases. Is it possible to end up with different kernels with the same instantiation of __parallel_transform_reduce_then_scan? I can imagine that, for example:

// no kernel specified: unnamed lambda case
sycl::queue queue_a{selector_vendor_a{});
sycl::queue queue_b{selector_vendor_b{});

// policy_a and policy_b have the same type
dpl::execution::device_policy policy_a(queue_a);
dpl::execution::device_policy policy_b(queue_b);

// ... containers and predicates have the same types
dpl::copy_if(policy_a, ...);
dpl::copy_if(policy_b, ...); // will it use the kernels compiled for the queue_a (and thus device "a")?

Such cases are highly unlikely, and I cannot think of any others. I would rather document this as a known limitation with a workaround (name the kernel) than compromise performance (if it is large enough overhead).

@mmichel11
Copy link
Contributor Author

mmichel11 commented Jan 24, 2025

I agree with the approach, but I need to review the PR more thoroughly. Indeed, according to the SYCL 2020 spec, launching a kernel decorated with reqd_sub_group_size attribute may throw an exception depending on what inside that kernel...

I have a question regarding that part:

Secondly, after discussion with compiler developers, kernel compilation must be separated from execution of the kernel to prevent corruption of the underlying sycl::queue

Is it a limitation/bug of the DPC++ SYCL implementation?

I think it is more of a limitation with the existing DPC++ implementation. From my discussion, it seems as if there is no easy fix which is why this workaround was recommended. From their perspective, it is not a bug as the SYCL specification does not specify any behavior regarding resubmissions to a queue after an exception such as this.

IGC intentionally forces a sub-group size of 16 on certain iGPUs to
workaround a known issue. We have to determine this by first compiling
the kernels to see if the required sub-group size is respected.

Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
@mmichel11
Copy link
Contributor Author

I never seen before in our code the examples of usage std::optional'. From my point of view we may simple write try / catchinstead of__handle_sync_sycl_exceptionand don't usestd::optional' at all.

I don't have a strong preference if we choose to go with the way I implemented or just directly add try ... catch throughout this header. I initially did this to avoiding having many try catch statements here.

Let me leave this open first for others' opinions to see if they prefer a more functional approach or just directly adding try...catch throughout.

@dmitriy-sobolev After looking through the PR, do you have any thoughts on Sergey's question here? Is the functional exception handler approach preferred in your opinion or just directly adding try-catch throughout parallel_backend_sycl.h?

@dmitriy-sobolev
Copy link
Contributor

I never seen before in our code the examples of usage std::optional'. From my point of view we may simple write try / catchinstead of__handle_sync_sycl_exceptionand don't usestd::optional' at all.

I don't have a strong preference if we choose to go with the way I implemented or just directly add try ... catch throughout this header. I initially did this to avoiding having many try catch statements here.
Let me leave this open first for others' opinions to see if they prefer a more functional approach or just directly adding try...catch throughout.

@dmitriy-sobolev After looking through the PR, do you have any thoughts on Sergey's question here? Is the functional exception handler approach preferred in your opinion or just directly adding try-catch throughout parallel_backend_sycl.h?

I slightly prefer Sergey's option. __handle_sync_sycl_exception does not reuse much as an abstraction, but adds complexity. In contrast, __bypass_sycl_kernel_not_supported is helpful here.

@mmichel11
Copy link
Contributor Author

@SergeyKopienko @dmitriy-sobolev I have removed __handle_sync_sycl_exception, turned __bypass_sycl_kernel_not_supported into a function, and made some other necessary changes to directly use try...catch.

I will run some internal testing to fully verify the change.

@@ -1329,45 +1346,51 @@ __parallel_reduce_by_segment_reduce_then_scan(oneapi::dpl::__internal::__device_
assert(__n > 1);
return __parallel_transform_reduce_then_scan(
__backend_tag, std::forward<_ExecutionPolicy>(__exec),
oneapi::dpl::__ranges::make_zip_view(std::forward<_Range1>(__keys), std::forward<_Range2>(__values)),
oneapi::dpl::__ranges::make_zip_view(std::forward<_Range3>(__out_keys), std::forward<_Range4>(__out_values)),
oneapi::dpl::__ranges::zip_view(std::forward<_Range1>(__keys), std::forward<_Range2>(__values)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was necessary in a few places. When the range forwarding was removed in some places, make_zip_view resulted in a compiler error. It seemed to be caused by being unable to process l-value references to ranges. With the previous approach, we captured by value in a lambda and moved it in so this was a non-issue.

It looks like it is an instance of this issue: #1805.

@mmichel11
Copy link
Contributor Author

clang-format wants to add an additional 4 space indentation throughout both the scan kernel implementations due to some of the changes. I disagree with this as we already have a lot of indentation here.

_GenMask{__pred}, _WriteOp{}, /*_IsUniquePattern=*/std::false_type{});
try
{
using _GenMask = oneapi::dpl::__par_backend_hetero::__gen_mask<_UnaryPredicate>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better to move typedefs at this place and in other try / catch blocks outside of try / catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way is not a big deal to me, but my thought is constraining the scope of the alias as much as possible makes the most sense. Since it's only needed in this try...catch, it makes sense to limit it to this scope to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opposite opinion - to include into try block only the code which really may generate exception.
Writing aliases inside it you make me read what was written here and understand that it's only aliases.
But up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point here. I have made these changes.

Signed-off-by: Matthew Michel <[email protected]>
Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor comment. I am ready to approve the PR once it has been addressed.

Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Matthew Michel <[email protected]>
Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reapproving. Please, wait for the CI to finish execution before merging.

Copy link
Contributor

@SergeyKopienko SergeyKopienko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mmichel11 mmichel11 merged commit 0140666 into main Jan 30, 2025
20 of 22 checks passed
@mmichel11 mmichel11 deleted the dev/mmichel11/reduce_then_scan_kernel_bundle_fallback branch January 30, 2025 19:56
danhoeflinger added a commit that referenced this pull request Jan 31, 2025
(reverting part of #1997)

Signed-off-by: Dan Hoeflinger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants