Replies: 8 comments
-
Perhaps cuda python should also be reviewed https://developer.nvidia.com/cuda-python for feasibility. |
Beta Was this translation helpful? Give feedback.
-
CUDA Python is a much lower level library than CuPy or Numba and it's positioning itself as a replacement for the CUDA backends to these two. That makes sense to reduce the amount of duplicate code and functionality. Unless we need some specific functionality from that low level we probably don't want to use CUDA Python since it's difficult and cumbersome out of necessity of being direct bindings for CUDA calls. CuPy/Numba provide easy access to this functionality that makes more sense, they provide that higher level interface on top of CUDA Python that we'd want to use. |
Beta Was this translation helpful? Give feedback.
-
@atbenmurray says that "we should choose something that doesn't add any burden on the user". This would require that the absense of CuPy would should be handled internally, so we would need a mechanism to detect this and convert tensors to Numpy and use Numpy calls instead, then possibly convert back. If CuPy is present then we can freely convert to CuPy array from tensor and do the operation. |
Beta Was this translation helpful? Give feedback.
-
One feature that is very useful is the ability to define CUDA kernels without having to do so directly in that language. Cupy provides mechanisms for do so with templates for element-wise, reduction, and raw kernels, but these require the user to provide CUDA code to fill in the template. The JIT facility can take decorated Python functions and convert these to CUDA code like Numba's CUDA library, but this has a few issues that make it difficult to use as it's still experimental. Consider the following: from cupyx import jit
@jit.rawkernel()
def normalize(arr,y,mina,maxa,size):
tid = jit.blockIdx.x * jit.blockDim.x + jit.threadIdx.x
ntid = jit.gridDim.x * jit.blockDim.x
for i in range(tid, size, ntid):
y[i] = (arr[i]-mina)/(maxa-mina)
x=cp.random.rand(5)
y=cp.zeros(5)
normalize[2,2](x,y,cp.float32(x.min().get()),cp.float32(x.max().get()),x.shape[0])
print(x)
print(y) This works but the odd type conversion makes it difficult and the error messages typically refer to a CUDA error not apparent from this code. The equivalent in Numba: from numba import cuda
@cuda.jit
def normalize(arr,y,mina,maxa,size):
tid = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
ntid = cuda.gridDim.x * cuda.blockDim.x
for i in range(tid, size, ntid):
y[i] = (arr[i]-mina)/(maxa-mina)
x=np.random.rand(5)
y=np.zeros(5)
normalize[2,2](x,y,x.min(),x.max(),x.shape[0])
print(x)
print(y) Interfacing Numba CUDA with Pytorch tensors is possible to avoid copying as well, plus numba compiled CPU functions are automatically accessible in CUDA compiled functions (with limitations) through its translation facility so only one definition is needed in many cases for a CPU and GPU algorithm. |
Beta Was this translation helpful? Give feedback.
-
Converting between Cupy arrays and Pytorch tensors isn't free either, if you're trying to do something fast there is a small cost on the order of microseconds that is avoided if you keep operations in one or the other. This is totally reasonable but needs to be considered if mixing operations. |
Beta Was this translation helpful? Give feedback.
-
Add @drbeh for further discussion. Thanks. |
Beta Was this translation helpful? Give feedback.
-
Thanks @ericspod for your detailed experiments. It looks like that mixing CuPy (or NumPy) with PyTorch within a transform is not a very good idea as they may cause some overheads for constantly going back and forth as you mentioned. However, this overhead should be negligible when it is happening once or twice in the entire pipeline (for each iteration) and even less significant when using batch transforms (less number of iterations). Having said that, it seems to me that except for few operations/transforms, we should be able to solely depend on PyTorch for development and maintenance, and keep (and improve) CuPy and NumPy convertor for interoperability with external libraries. The benefit of this approach is so obvious at least in terms of consistency and maintenance but do you think that it would be a viable approach in term of computational efficiency? |
Beta Was this translation helpful? Give feedback.
-
@drbeh I think so, we should focus on Pytorch however my issue with some of the things I was working on was a fast way to resize images and map coordinates. Without an implementation of the latter in Pytorch the best one is in scipy or cupyx's scipy layer. That was a source of inefficiency on the scale of microseconds for 2D images but will add up significantly for 3D in transform pipelines. We shouls prioritize Pytorch as the first class implementation environment for transforms and keep in mind how we can port these missing features in the future as needed. |
Beta Was this translation helpful? Give feedback.
-
Is your feature request related to a problem? Please describe.
Using CuPy in transforms introduces a host of advantages and issues to consider. CuPy provides a number of CUDA accelerated facilities not provided by Pytorch. One important aspect is replicating the NumPy API so it's possible to define/rewrite code to use one or the other based on which library inputs are from. This does add CuPy as a new dependency that is either optional and difficult to integrate or hard.
Describe the solution you'd like
Transforms can be defined with:
We should investigate how to define transforms that use some best combination of these libraries. Since CuPy is meant to be drop-in replaceable with NumPy this may allow us to write code only once and select which libraries to use, but the details will probably make this less straight forward than anticipated. We will always need NumPy to support users' custom transforms using libraries based on it, so we can't state a hard requirement for transforms to accept and return only CuPy arrays or tensors without causing copying inefficiencies for these transforms.
CuPy doesn't cover the whole NumPy/SciPy API, there are missing routines and submodules such as
scipy.interpolate
. There are features in Pytorch that CuPy may provide (or have a better version), it's going to be a mix of which provides the better GPU implementation of some operation.CuPy has a comprehensive interface for defining custom kernels in CUDA without having to deal with it and its compilation directly. CUDA code can be provided in snippets to use with a kernel template routine or translated from Python code using function decorators in a similar way to Numba's CUDA support. This can be really useful in providing fast operations which can be done in NumPy/CuPy directly but would benefit from being compiled into a faster direct version, which is the whole rationale behind Numba.
Describe alternatives you've considered
With all these pros and cons the question is where CuPy fits in with transforms. It's technically redundant in that it provides GPU operations which we can also get from Pytorch. For CPU transforms we can use NumPy or Pytorch but often will want NumPy for interoperation or other features not in Pytorch so we'll never go the Pytorch only route. I had discussed Numba before as a way to define CUDA code in a similarly to CuPy but with the advantage of doing CPU compiled code as well, often reusing the same definitions.
If we want to re-engineer the transforms to support GPU computation more we can do it with Pytorch in conjunction with NumPy for CPU only stuff, so is there a need for CuPy (or Numba)? Do one or the other provide enough advantage to make into a hard requirement and tightly integrate with the transform definitions?
Additional context
One comparison I've made with with resizing an image. In NumPy or SciPy there's options for interpolation or zoom but these are slow. The
Resize
transform usestorch.nn.functional.interpolate
. In CuPy there'scupyx.scipy.ndimage.zoom
which can achieve the same thing but is still much slower thaninterpolate
, however it can do tricubic interpolation whichinterpolate
cannot. Resizing images is important for the way I was defining smooth fields for various applications so there's a definite need for a fast solution.My limited experience in playing with CuPy transforms is that it can be easily mixed with Pytorch code to use whichever component is the faster implementation. Some of our existing transforms can use CuPy as input now and operate using only its routines but more study needs to be done to ensure unnecessary copies and other inefficiencies are minimized.
Beta Was this translation helpful? Give feedback.
All reactions