Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[L0 v2] introduce raii wrapper for UR handles #2552

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

igchor
Copy link
Member

@igchor igchor commented Jan 13, 2025

Some entities (e.g. devices) do not need to be retained as they are owned by the platform. For such cases, only validate RefCount instead of acutally increasing/decreasing it.

It doesn't seem to be necessary to retain context whenever it's used to construct queue, etc. However, the spec is not very clear on this. I'm wondering if we should we update the spec or add retains on context?

Also, there are a few instances where I think calling retain() might be justified (we are not calling it right now) - e.g. when passing ur_mem_handle_t to urKernelSetArgMemObj / urKernelSetArgSampler.

@github-actions github-actions bot added the level-zero L0 adapter specific issues label Jan 13, 2025
Copy link
Contributor

@pbalcer pbalcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to be necessary to retain context whenever it's used to construct queue, etc. However, the spec is not very clear on this. I'm wondering if we should we update the spec or add retains on context?

hm, I think we should tie lifetime of context to queues/command buffers/programs/pools and all the other entities created from a context. But not kernels or individual enqueues. These retain their parent objects (like e.g., a queue), so the lifetime relationship between, e.g., an enqueue and context is transitive. So it should be safe.

Generally, my opinion is that it should be safe to release the context while there are some pending operations.

@@ -98,13 +98,14 @@ struct ur_event_handle_t_ : _ur_object {
uint64_t getEventEndTimestamp();

protected:
// Do not use ref couting on context to avoid circular dependency.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need something like a weak pointer? You added quite a few comments like that. It would be better to enforce this using a type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I could just do an alias for this, and add some explanation in the comment above that alias definition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion would be to create a weak<> type that has deleted copy constructors/operators and can only be created from a ref-counted ptr (using something like .downgrade() or whatever).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but in some situations there might be no rc<> instance. For example, when you create an eventPool in the context constructor, you need to pass ur_context_handle_t (this) to the eventPool. If it would only accept weak<> and if weak<> can only be created from rc<>, you would need a temporary rc<>. This is a bit verbose but on the other hand, you can use rc_val_only (to do ref count validation) so this could provider us with some more checks.

See the latest commit for how it would look like. Also, I guess if we decide to do that, then probably all handles in v2 should be replaced with rc<> rc_val_only<> or weak<>.

Copy link
Member Author

@igchor igchor Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having weak<> could also allow us to check the pointers for NULL (only allow rc<> and weak<> to be != NULL). If we ever need the value to be null (2 places right now), we can use optional (see latest commit) - this can get a bit verbose, but we shouldn't rely on null values too much I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer to have a weak be nullable, like normal std::weak. So that when the original rc pointer goes out of scope, we can detect that.

For example, when you create an eventPool in the context constructor, you need to pass ur_context_handle_t (this) to the eventPool.

That's a scenario for the rc_val_only type, because context must outlive a queue/command buffer associated with the eventPool. I think weak is for when we create objects that might possibly outlive their "parent" objects.

Copy link
Member Author

@igchor igchor Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a scenario for the rc_val_only type, because context must outlive a queue/command buffer associated with the eventPool. I think weak is for when we create objects that might possibly outlive their "parent" objects.

Yeah, but the problem is that if you use rc_val_only then it will always fail during release because of how urContextRelease (and other functions are implemented) - we first decrement the RefCount (to 0) and only then call a destructor of the context (which in turn will destroy the eventPools, etc.).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh, I think I'll remove rc_val_only, as it's only applicable to device and platforms and there's a small chance we will have lifetime issues with those. We could just add some checks to the device/platform classes if we really want to.

@igchor
Copy link
Member Author

igchor commented Jan 14, 2025

It doesn't seem to be necessary to retain context whenever it's used to construct queue, etc. However, the spec is not very clear on this. I'm wondering if we should we update the spec or add retains on context?

hm, I think we should tie lifetime of context to queues/command buffers/programs/pools and all the other entities created from a context. But not kernels or individual enqueues. These retain their parent objects (like e.g., a queue), so the lifetime relationship between, e.g., an enqueue and context is transitive. So it should be safe.

Generally, my opinion is that it should be safe to release the context while there are some pending operations.

Yeah, that makes sense, I think at least CUDA and HIP are doing something similar. I will add this and perhaps some tests for that.

Some entities (e.g. devices) do not need to be retained
as they are owned by the platform. For such cases, only
validate RefCount instead of acutally increasing/decreasing it.
@igchor igchor force-pushed the v2_raii_wrapper_ur branch from 66b449b to fe4922a Compare January 14, 2025 20:26

// This version of ref_counted validates that the ref count is not zero on every
// release and retain in debug mode, and does nothing in the release mode.
// Used for types that should always be alibe during the adapter lifetime (e.g.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Used for types that should always be alibe during the adapter lifetime (e.g.
// Used for types that should always be alive during the adapter lifetime (e.g.

// Used for types that should always be alibe during the adapter lifetime (e.g.
// devices).
template <typename URHandle>
using rc_val_only =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I don't like the name :D

durable, NonNull, Live, hm, dunno. Basically what we want to say is that this object's lifetime is guaranteed to exceed that of the containing object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, me too. non_null/not_null might good, or maybe checked?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like checked.

@@ -104,5 +104,114 @@ using ze_context_handle_t =
using ze_command_list_handle_t =
ze_handle_wrapper<::ze_command_list_handle_t, zeCommandListDestroy>;

template <typename URHandle, ur_result_t (*retain)(URHandle),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be tricky to get right for people.
please create a small writeup in comments on when to use which type of pointer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level-zero L0 adapter specific issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants