CuPy Transforms #5144

ericspod · 2021-11-09T15:26:53Z

ericspod
Nov 9, 2021
Maintainer

Is your feature request related to a problem? Please describe.
Using CuPy in transforms introduces a host of advantages and issues to consider. CuPy provides a number of CUDA accelerated facilities not provided by Pytorch. One important aspect is replicating the NumPy API so it's possible to define/rewrite code to use one or the other based on which library inputs are from. This does add CuPy as a new dependency that is either optional and difficult to integrate or hard.

Describe the solution you'd like
Transforms can be defined with:

NumPy: CPU only, default array interface, many libraries use this directly
CuPy: GPU only, can interoperate with array interface but must copy to/from VRAM, cannot be substituted for NumPy in every instance or in existing NumPy based libraries
Pytorch: CPU or GPU, missing some features NumPy/CuPy provide through the SciPy and other routines, some features faster in GPU than those provided in CuPy
Numba: CPU and GPU, provides facilities for compiling Python functions into either CPU code or CUDA, often using the same definitions in both places, not necessarily optimal code

We should investigate how to define transforms that use some best combination of these libraries. Since CuPy is meant to be drop-in replaceable with NumPy this may allow us to write code only once and select which libraries to use, but the details will probably make this less straight forward than anticipated. We will always need NumPy to support users' custom transforms using libraries based on it, so we can't state a hard requirement for transforms to accept and return only CuPy arrays or tensors without causing copying inefficiencies for these transforms.

CuPy doesn't cover the whole NumPy/SciPy API, there are missing routines and submodules such as scipy.interpolate. There are features in Pytorch that CuPy may provide (or have a better version), it's going to be a mix of which provides the better GPU implementation of some operation.

CuPy has a comprehensive interface for defining custom kernels in CUDA without having to deal with it and its compilation directly. CUDA code can be provided in snippets to use with a kernel template routine or translated from Python code using function decorators in a similar way to Numba's CUDA support. This can be really useful in providing fast operations which can be done in NumPy/CuPy directly but would benefit from being compiled into a faster direct version, which is the whole rationale behind Numba.

Describe alternatives you've considered
With all these pros and cons the question is where CuPy fits in with transforms. It's technically redundant in that it provides GPU operations which we can also get from Pytorch. For CPU transforms we can use NumPy or Pytorch but often will want NumPy for interoperation or other features not in Pytorch so we'll never go the Pytorch only route. I had discussed Numba before as a way to define CUDA code in a similarly to CuPy but with the advantage of doing CPU compiled code as well, often reusing the same definitions.

If we want to re-engineer the transforms to support GPU computation more we can do it with Pytorch in conjunction with NumPy for CPU only stuff, so is there a need for CuPy (or Numba)? Do one or the other provide enough advantage to make into a hard requirement and tightly integrate with the transform definitions?

Additional context
One comparison I've made with with resizing an image. In NumPy or SciPy there's options for interpolation or zoom but these are slow. The Resize transform uses torch.nn.functional.interpolate. In CuPy there's cupyx.scipy.ndimage.zoom which can achieve the same thing but is still much slower than interpolate, however it can do tricubic interpolation which interpolate cannot. Resizing images is important for the way I was defining smooth fields for various applications so there's a definite need for a fast solution.

My limited experience in playing with CuPy transforms is that it can be easily mixed with Pytorch code to use whichever component is the faster implementation. Some of our existing transforms can use CuPy as input now and operate using only its routines but more study needs to be done to ensure unnecessary copies and other inefficiencies are minimized.

wyli · 2021-11-11T01:07:58Z

wyli
Nov 11, 2021
Collaborator

Perhaps cuda python should also be reviewed https://developer.nvidia.com/cuda-python for feasibility.

0 replies

ericspod · 2021-11-11T23:44:38Z

ericspod
Nov 11, 2021
Maintainer Author

CUDA Python is a much lower level library than CuPy or Numba and it's positioning itself as a replacement for the CUDA backends to these two. That makes sense to reduce the amount of duplicate code and functionality. Unless we need some specific functionality from that low level we probably don't want to use CUDA Python since it's difficult and cumbersome out of necessity of being direct bindings for CUDA calls. CuPy/Numba provide easy access to this functionality that makes more sense, they provide that higher level interface on top of CUDA Python that we'd want to use.

0 replies

ericspod · 2021-11-12T13:48:34Z

ericspod
Nov 12, 2021
Maintainer Author

@atbenmurray says that "we should choose something that doesn't add any burden on the user". This would require that the absense of CuPy would should be handled internally, so we would need a mechanism to detect this and convert tensors to Numpy and use Numpy calls instead, then possibly convert back. If CuPy is present then we can freely convert to CuPy array from tensor and do the operation.

0 replies

ericspod · 2021-11-26T20:06:30Z

ericspod
Nov 26, 2021
Maintainer Author

One feature that is very useful is the ability to define CUDA kernels without having to do so directly in that language. Cupy provides mechanisms for do so with templates for element-wise, reduction, and raw kernels, but these require the user to provide CUDA code to fill in the template. The JIT facility can take decorated Python functions and convert these to CUDA code like Numba's CUDA library, but this has a few issues that make it difficult to use as it's still experimental. Consider the following:

from cupyx import jit

@jit.rawkernel()
def normalize(arr,y,mina,maxa,size):
    tid = jit.blockIdx.x * jit.blockDim.x + jit.threadIdx.x
    ntid = jit.gridDim.x * jit.blockDim.x

    for i in range(tid, size, ntid):
        y[i] = (arr[i]-mina)/(maxa-mina)
        
x=cp.random.rand(5)
y=cp.zeros(5)

normalize[2,2](x,y,cp.float32(x.min().get()),cp.float32(x.max().get()),x.shape[0])
print(x)
print(y)

This works but the odd type conversion makes it difficult and the error messages typically refer to a CUDA error not apparent from this code.

The equivalent in Numba:

from numba import cuda

@cuda.jit
def normalize(arr,y,mina,maxa,size):
    tid = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    ntid = cuda.gridDim.x * cuda.blockDim.x

    for i in range(tid, size, ntid):
        y[i] = (arr[i]-mina)/(maxa-mina)
        
x=np.random.rand(5)
y=np.zeros(5)

normalize[2,2](x,y,x.min(),x.max(),x.shape[0])
print(x)
print(y)

Interfacing Numba CUDA with Pytorch tensors is possible to avoid copying as well, plus numba compiled CPU functions are automatically accessible in CUDA compiled functions (with limitations) through its translation facility so only one definition is needed in many cases for a CPU and GPU algorithm.

0 replies

ericspod · 2021-11-30T13:02:57Z

ericspod
Nov 30, 2021
Maintainer Author

Converting between Cupy arrays and Pytorch tensors isn't free either, if you're trying to do something fast there is a small cost on the order of microseconds that is avoided if you keep operations in one or the other. This is totally reasonable but needs to be considered if mixing operations.

0 replies

Nic-Ma · 2021-12-01T07:57:51Z

Nic-Ma
Dec 1, 2021
Maintainer

Add @drbeh for further discussion.

Thanks.

0 replies

bhashemian · 2021-12-01T17:03:58Z

bhashemian
Dec 1, 2021
Collaborator

Thanks @ericspod for your detailed experiments.

It looks like that mixing CuPy (or NumPy) with PyTorch within a transform is not a very good idea as they may cause some overheads for constantly going back and forth as you mentioned. However, this overhead should be negligible when it is happening once or twice in the entire pipeline (for each iteration) and even less significant when using batch transforms (less number of iterations).

Having said that, it seems to me that except for few operations/transforms, we should be able to solely depend on PyTorch for development and maintenance, and keep (and improve) CuPy and NumPy convertor for interoperability with external libraries. The benefit of this approach is so obvious at least in terms of consistency and maintenance but do you think that it would be a viable approach in term of computational efficiency?

0 replies

ericspod · 2021-12-03T10:39:10Z

ericspod
Dec 3, 2021
Maintainer Author

@drbeh I think so, we should focus on Pytorch however my issue with some of the things I was working on was a fast way to resize images and map coordinates. Without an implementation of the latter in Pytorch the best one is in scipy or cupyx's scipy layer. That was a source of inefficiency on the scale of microseconds for 2D images but will add up significantly for 3D in transform pipelines. We shouls prioritize Pytorch as the first class implementation environment for transforms and keep in mind how we can port these missing features in the future as needed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CuPy Transforms #5144

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CuPy Transforms #5144

ericspod Nov 9, 2021 Maintainer

Replies: 8 comments

wyli Nov 11, 2021 Collaborator

ericspod Nov 11, 2021 Maintainer Author

ericspod Nov 12, 2021 Maintainer Author

ericspod Nov 26, 2021 Maintainer Author

ericspod Nov 30, 2021 Maintainer Author

Nic-Ma Dec 1, 2021 Maintainer

bhashemian Dec 1, 2021 Collaborator

ericspod Dec 3, 2021 Maintainer Author

ericspod
Nov 9, 2021
Maintainer

wyli
Nov 11, 2021
Collaborator

ericspod
Nov 11, 2021
Maintainer Author

ericspod
Nov 12, 2021
Maintainer Author

ericspod
Nov 26, 2021
Maintainer Author

ericspod
Nov 30, 2021
Maintainer Author

Nic-Ma
Dec 1, 2021
Maintainer

bhashemian
Dec 1, 2021
Collaborator

ericspod
Dec 3, 2021
Maintainer Author