-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ABI break] Add new structs with version info and readonly flag #101
Conversation
include/dlpack/dlpack.h
Outdated
/*! \brief Mark the data readonly. */ | ||
uint8_t readonly; | ||
/*! | ||
* \brief Endianness of the data. 1 for non-native endianness and | ||
* 0 for native endianness. | ||
*/ | ||
uint8_t endianness; | ||
/*! \brief Alignment of the data. */ | ||
uint32_t alignment; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes still need to be discussed. Since we are breaking the ABI by adding the version info, it would be good to discuss these here and add them too (so we don't need to break the ABI again)
include/dlpack/dlpack.h
Outdated
*/ | ||
struct { | ||
uint8_t dlpack; | ||
uint8_t abi; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us define DLPackVersion struct, for backward compatibility, consider put it in the end(after byte_offset), we need to consider the alignment properties of these fields. Specifically, Version should align to 32 or 64 bit.
Putting them after the main struct will likely help alignment of the current fields as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also make followup compiler support easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
docs/source/python_spec.rst
Outdated
The producer must set the ``PyCapsule`` name to ``"dltensor"`` if ABI | ||
version 1 is requested and ``"versioned_dltensor"`` if ABI version >= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked in #98 (review) and I will ask again: With all these changes flush in, shouldn't we bump the DLPack version at least one more time to include #98 #100 before breaking ABIs? Otherwise, ABI version is never set to 1 in any outside work; it only lives in the dev branch here, and this doc change doesn't really make much sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in the release note we can warn downstream that in the next release we'll introduce ABI breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently this is the third time I am asking the same question: #34 (comment). Please, can we get it addressed? @rgommers @tqchen @tirthasheshpatel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked in #98 (review) and I will ask again: With all these changes flush in, shouldn't we bump the DLPack version at least one more time to include #98 #100 before breaking ABIs? Otherwise, ABI version is never set to 1 in any outside work; it only lives in the dev branch here, and this doc change doesn't really make much sense.
Oh, yes, thanks for noticing! I will propose a PR bumping the DLPack version and rebase here once that's merged.
Apparently this is the third time I am asking the same question: #34 (comment).
Sorry for not answering sooner!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have bumped the DLPack version here itself since it makes sense to upgrade both the DLPack and ABI version for an ABI break. Sounds good?
docs/source/python_spec.rst
Outdated
* ``__dlpack__`` should accept a ``version`` keyword (a Python tuple | ||
``(dlpack_version, dlpack_abi_version)``) which is set to ``None`` by default. | ||
Consumers can use this kwarg to request certain versions of DLPack. If | ||
``version=None`` or the ABI version 1 is requested: | ||
|
||
* a capsule named ``"dltensor"`` which uses the old ABI should be returned | ||
(if the producer wants to keep supporting it) or | ||
* a ``BufferError`` should be raised (if the producer doesn't want to keep | ||
support for the old ABI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, but I think we are moving too fast in this PR. Removing __dlpack_info__
is a design flaw (see also #34 (comment)): Without the producer providing this info upon the consumer's request, how does the consumer know
- the max version supported by the producer (without a try-except loop until no error)?
- the sensible combination of
dlpack_version
anddlpack_abi_version
? For example I can pass meaningless combos like(0.5, 2)
(such combo does not exist).
Shouldn't the burden be fallen on the producer since they know the best? Maybe we should move the discussion back to #34?
An ABI break may also be an opportunity to reexamine the signed/unsignedness of all |
Thanks for the help so far everyone! I think the discussion on #34 has settled on adding the |
I originally was in favor of capsule renaming, but I think it is not needed given the consensus around the protocol. While it does provide another layer of safety, I think it is redundant. |
There are still a number of things I find confusing about the Python specification. Some of them are related to this specific PR. Some of them are things that are unclear in general.
CUDA streams are not a natural representation in CUDA. Even in C++, they are simply a pointer to an opaque NVIDIA data structure. Are they passed as capsules, uintptr_t cast to a python int, etc.? The documented interface lacks type annotations to make this clear.
|
Yeah, making the signatures explicit should make it easier to understand the spec, thanks.
|
Okay, I have removed capsule renaming now. Also, I have added release notes and updated the DLPack diagram with the new structs. This PR is ready for review again from my side. |
@tirthasheshpatel just realized that we checkedin the png image to the repo in a previous path. Is it possible to checkin the image to a different location/repo? On one hand we would like to keep repo self-contained, on the other hand, it can be a bit annoying for a repo to contain a rich history of binaries through multiple revisions. One approach is to send a PR to a separate repo instead (we used https://github.com/dmlc/web-data) for some of that purposes then link to via https://gitcontent |
Sounds good! I will remove the image here and propose a PR on dmlc/web-data. |
doesn't support any version below the producer's maximum version, a | ||
``BufferError`` should be raised. Similarly, If the producer doesn't | ||
support the requested version, it should raise a ``BufferError``. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the consumer does not specify a version? As I understand things, for the forseeable future the producer should return a V1 structure. So effectively the default is 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the consumer does not specify a version? As I understand things, for the forseeable future the producer should return a V1 structure. So effectively the default is
1
.
Yes, the default is 1. I will update the spec to say that instead.
I actually find this part of the documentation quite confusing (That said, I am only familiar with the CUDA part and can't speak about ROCm). There are three concepts in CUDA that play a role here:
Of these three, only the CUDA device ID is representable as an integer. Both CUDA context and CUDA stream are represented by opaque handles to a proprietary data structures, in both CUDA runtime and driver APIs. Essentially a I think that there is I think a fundamental confusion between the word "stream" and "CUDA context" or "CUDA device ID" in the DLPack documentation. I suspect that will you want to call this "device ID" and not "stream" and also establish the convention that CUDA memory allocations are assumed to be located in the primary context associated with the associated CUDA device ID. There is one situation in which the notion of a stream would make sense to me: if the tensor is not yet computed and the |
DLPackVersion version; | ||
/*! \brief Mark the data readonly. */ | ||
uint8_t readonly; | ||
} DLTensorVersioned; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually one thing that might be useful is to think about how to minimizing the change. Specifically, in light of S0 in #104. We may not need to introduce a separate versioned struct.
The old framework can still use the DLTensor(as implemented in the versioned version) as it is, as long as there is no reliance of sizeof(DLTensor). Perhaps we can rename the old one as DLTensorLegacy; In the similar spirit, it is safe to static_cast
a DLTensor* to DLTensorLegacy*, as a result, we do not need to introduce two versions of DLManagedTensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old framework can still use the DLTensor(as implemented in the versioned version) as it is, as long as there is no reliance of sizeof(DLTensor). Perhaps we can rename the old one as DLTensorLegacy; In the similar spirit, it is safe to
static_cast
a DLTensor* to DLTensorLegacy*, as a result, we do not need to introduce two versions of DLManagedTensor
But DLManagedTensor doesn't have a pointer to the DLTensor struct. Instead, it is just a plain DLTensor dl_tensor
field. So, changing the sizeof(DLTensor) will change alignment and size of DLManagedTensor. Which is why we need to rename both structs. I also considered renaming the old structs to DLTensorLegacy
and DLManagedTensorLegacy
but it seems equivalent to instead add new structs with a different name (new implementations will be able to see both and can use the struct they want to export...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, you are indeed right. This would make the change ABI breaking. Although it would make future ABI change hard because all of them will be non-backward compact due to the change of deleter offset. Would be great to have more deliberation of this topic.
One possible choice is to append other flags(version, read_only) to DLManagedTensor instead. This of course won't be too ideal in case we want to change fields in DLTensor, but would maintain backward compact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this just push the problem one abstraction layer higher? Now DLManagedTensor
has a breaking ABI change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see. If we move them to DLManagedTensor, we can ensure the location of the deleter function is still at the same offset. Then the code
DLManagedTensorVersioned *managed = (DLManagedTensorVersioned *)PyCapsule_GetPointer(self, "dltensor");
if (managed->deleter) {
managed->deleter(managed);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although not ideal, I too think it'd be better to add the new fields to the DLManagedTensor instead. This would eliminate the requirement of the *Versioned
structs and also make it easier to implement support for the new ABI.
on the array object, which will be called from within ``from_dlpack``, | ||
to access the data, to get the maximum supported DLPack version, and | ||
to query what device the array is on (may be needed to pass in the | ||
correct stream, e.g. in the case of multiple GPUs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you break this into 3 separate points? They can be subpoints of 2
@@ -96,7 +112,7 @@ C API which is called either when the refcount on the capsule named | |||
PyObject *type, *value, *traceback; | |||
PyErr_Fetch(&type, &value, &traceback); | |||
|
|||
DLManagedTensor *managed = (DLManagedTensor *)PyCapsule_GetPointer(self, "dltensor"); | |||
DLManagedTensorVersioned *managed = (DLManagedTensorVersioned *)PyCapsule_GetPointer(self, "dltensor"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we distinguish between these two cases? For legacy capsules, we must cast to the older struct, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we distinguish between these two cases? For legacy capsules, we must cast to the older struct, right?
The offset of dl_tensor.device
doesn't change with this ABI breaking change (and we also don't anticipate such a change in the future). So, casting it to either struct and accessing the field should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the location of the deleter function called below does change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the location of the deleter function called below does change.
Oh, sorry. I misunderstood: I thought you were talking about the __dlpack_device__
method. We can have different deleter functions (one for each supported ABI). Not the cleanest way, but can be done using templating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, this is code created by the producer, not the consumer. When the consumer calls obj.__dlpack__
, the producer creates a capsule, and the capsule's deleter function consists of the code here.
Co-authored-by: Matti Picus <[email protected]>
closed in favor of #113 We would like to thanks @tirthasheshpatel for your effort on bringing in this change onward |
Closes #34
Supersedes #72
This PR adds version info to the
DLTensor
struct. This is under discussion in #34 (see this comment). Since this is an ABI break, the ABI version has been incremented to 2 and the DLPack Version has been bumped to v0.7.0. This also updates the Python spec with a detailed upgrade policy and future ABI compatibility. The upgrade policy has been designed such that the libraries that want to keep supporting (exporting and importing) the old ABI or support multiple DLPack and ABI versions can do so without breaking compatibility with the old Python API.