Accelerating Custom Force Computations #1843

mgaimann · 2024-07-15T19:33:00Z

mgaimann
Jul 15, 2024

I have implemented a custom Vicsek-style alignment force using hoomd.md.force.Custom.
It requires access to

pair information (pair list)
particle properties (velocities)
force array

import hoomd
import numpy as np
import cupy as cp
import numba.cuda

# numba-fy this, or make a kernel out of this?
def set_custom_alignment_force(pair_list: cp.ndarray, velocities: cp.ndarray, strength: float,
                                     forces: cp.ndarray):
    n_dim = velocities.shape[1]

    for i in range(pair_list.shape[0]):
        idx_a = pair_list[i, 0]
        idx_b = pair_list[i, 1]

        for j in range(n_dim):
            this_force = strength * velocities[idx_b, j] - velocities[idx_a, j]
            forces[idx_a, j] += this_force
            forces[idx_b, j] -= this_force


class CustomAlignmentForce(hoomd.md.force.Custom):
    def __init__(self, nlist: hoomd.md.nlist.NeighborList, strength: float, device: str = 'cpu'):
        self.nlist = nlist
        self.strength = strength
        self.device = device
        super().__init__(aniso=False)  # no torque

    def set_forces(self, timestep):
        if self.device == 'gpu':
            with self.gpu_local_force_arrays as force_arrays:
                with self._state.gpu_local_snapshot as snap:
                    with self.nlist.gpu_local_nlist_arrays as nlist_arrays:
                        set_custom_alignment_force(nlist_arrays._nlist_obj.local_pair_list,
                                                         snap.particles.velocity,
                                                         self.strength,
                                                         force_arrays.force)
    ...

I was wondering whether you could share experiences or recommendations regarding accelerating custom force computations like this.

I know that building an external component (plugin) via C++ / CUDA is possible, having also checked out the example pair plugin and the pair plugin collection by @ianrgraham. However, access to particle properties (velocities) seems not to be trivial in these pair examples. Is there a way?

Is there an easier way using CuPy kernels or Numba CUDA kernels that would yield similar performance?

Thank you for your help.

Answered by joaander

Jul 18, 2024

local_pair_list is designed for convenience, not zero-copy. You should use gpu_local_nlist_arrays with cupy.

I still recommend a direct C++ implementation. This opens the possibility to use highly optimized multiple threads per particle code, auto uners, parameter dictionaries, etc. If you are writing C++ already for cupy, then the only additional work is to add the C++/Python interface and to execute cmake to configure and build the code.

View full answer

joaander · 2024-07-16T12:23:00Z

joaander
Jul 16, 2024
Maintainer

Are you using this with md.force.Active and the md.methods.Brownian integration method?

2 replies

mgaimann Jul 16, 2024
Author

I am using it together with forces of type hoomd.md.pair.Yukawa and hoomd.md.force.Custom (in integrator.forces) and the hoomd.md.methods.ConstantVolume integration method.

joaander Jul 16, 2024
Maintainer

You can fork hoomd-component-template and add custom C++ code that computes the force you are interested in. You would subclass PotentialPair and override the computeForces method. You will need to provide an isotropic pairwise Evaluator class for the base PotentialPair<Evaluator>, but it can have empty methods as your code will not call it.

The best example of a velocity-dependent pairwise force in HOOMD is the DPD thermostat:
https://github.com/glotzerlab/hoomd-blue/blob/trunk-patch/hoomd/md/PotentialPairDPDThermo.h

For GPU support, you also provide a GPU kernel and a subclass of PotentialPairGPU.
https://github.com/glotzerlab/hoomd-blue/blob/trunk-patch/hoomd/md/PotentialPairDPDThermoGPU.h
https://github.com/glotzerlab/hoomd-blue/blob/trunk-patch/hoomd/md/PotentialPairDPDThermoGPU.cuh
https://github.com/glotzerlab/hoomd-blue/blob/trunk-patch/hoomd/md/PotentialPairDPDThermoGPUKernel.cu.inc

mgaimann · 2024-07-18T21:39:51Z

mgaimann
Jul 18, 2024
Author

Thank you @joaander! I am going to look into implementing it directly in C++/CUDA, I am using a similar approach currently which hopefully works out:

I wrote a CuPy CUDA kernel which gave some speedup, but with NVIDIA Nsys Profile I can see that a lot of data (the neighbor list) is still being transferred between device and host. This data should remain on the device.

Is there a way to access the memory address of the local_pair_list array within the nlist_arrays?
By passing the address to the CUDA kernel I might be able to ensure that no host-device transfer takes place (but maybe I am a bit too optimistic here).

The code looks like this now:

add_custom_alignment_force_cuda_kernel = cp.RawKernel(
    r'''
    extern "C" __global__
    void add_custom_alignment_force_per_agent_and_dim(const double* v, double* f,
                                     const int* pair_list, const double strength,
                                     const int n_dim, const int n_pairs) {

        // compute thread id
        int tid = blockIdx.x * blockDim.x + threadIdx.x;
    
        // Ensure the thread is within the bounds of the pair list
        if (tid < n_pairs * n_dim) {
        
            int pair_id = tid / n_dim; // Get the pair index
            int dim_id = tid % n_dim;  // Get the dimension index
    
            // Compute the indices in the velocity and force arrays per agent and dimension
            int idx_a = pair_list[pair_id * 2] * n_dim + dim_id;
            int idx_b = pair_list[pair_id * 2 + 1] * n_dim + dim_id;
            
            // Compute alignment force
            double alignment_force = strength * (v[idx_b] - v[idx_a]);
            
            // Use atomicAdd to update the forces to avoid race conditions
            atomicAdd(&f[idx_a], alignment_force);
            atomicAdd(&f[idx_b], -alignment_force);
        }
    }

''', 'add_custom_alignment_force_per_agent_and_dim'
)


def add_custom_alignment_force_cuda(pair_list: cp.ndarray, velocities: cp.ndarray, strength: cp.float64,
                                     forces: cp.ndarray):
    n_pairs = pair_list.shape[0]
    n_dims = velocities.shape[1]
    threads_per_block = 256
    blocks_per_grid = (n_pairs * n_dims + threads_per_block - 1) // threads_per_block
    add_lymburn_alignment_force_cuda_kernel((blocks_per_grid,), (threads_per_block,),
                                            (velocities, forces, pair_list, strength, n_dims, n_pairs))


class CustomAlignmentForce(hoomd.md.force.Custom):
    def __init__(self, nlist: hoomd.md.nlist.NeighborList, strength: np.float64, device: str = 'cpu'):
        self.nlist = nlist
        self.strength = strength
        self.device = device
        super().__init__(aniso=False)  # no torque

    def set_forces(self, timestep):
        if self.device == 'gpu':
            with self.gpu_local_force_arrays as force_arrays:
                with self._state.gpu_local_snapshot as snap:
                    with self.nlist.gpu_local_nlist_arrays as nlist_arrays:
                        add_custom_alignment_force_cuda(
                            cp.asarray(nlist_arrays._nlist_obj.local_pair_list, dtype=cp.int32),  # todo: we need to pass the memory address here
                            cp.asarray(snap.particles.velocity, dtype=cp.float64),
                            cp.float64(self.strength),
                            cp.asarray(force_arrays.force, dtype=cp.float64))
        else:
             pass   # cpu code

6 replies

joaander Jul 18, 2024
Maintainer

local_pair_list is designed for convenience, not zero-copy. You should use gpu_local_nlist_arrays with cupy.

I still recommend a direct C++ implementation. This opens the possibility to use highly optimized multiple threads per particle code, auto uners, parameter dictionaries, etc. If you are writing C++ already for cupy, then the only additional work is to add the C++/Python interface and to execute cmake to configure and build the code.

Answer selected by mgaimann

mgaimann Jul 23, 2024
Author

I will try to implement it directly in C++, thank you @joaander!

One thing that confused me when passing an HOOMDGPUArray to a CuPy CUDA kernel was that the array strides in the HOOMDGPUArray unexpectedly contained a buffer.

with self.gpu_local_force_arrays as arrays:

      a = np.zeros((200, 3))
      b = cp.zeros((200, 3))
      forces = arrays.force._coerce_to_ndarray()
  
      print("a.strides: ", a.strides)
      print("b.strides: ", b.strides)

      print("arrays.force.shape: ", arrays.force.shape)
      print("arrays.force.strides: ", arrays.force.strides)

      print("forces.shape: ", forces.shape)
      print("forces.strides: ", forces.strides)

yields

a.strides:  (24, 8)
b.strides:  (24, 8)

arrays.force.shape:  (200, 3)
arrays.force.strides:  (32, 8)

forces.shape:  (200, 3)
forces.strides:  (32, 8)

float64 requires 8 bytes, and we see that 8 bytes are required to traverse the second dimension.
For the first dimension, we expect according to the shape 8 * 3 = 24 bytes, but in fact 32 bytes are required (I assume this is to optimally load data when using the GPU).

Hence iterating over dimensions in a CUDA kernel should be done taking into account the stride of 4:

// compute thread id
int tid = blockIdx.x * blockDim.x + threadIdx.x;

// ensure the thread is within the bounds
if (tid < n_agents) {
    
    for (int dim_idx = 0; dim_idx < n_dim; ++dim_idx) {
        idx = tid * 4 + dim_idx;
        // do something
    }
}

Maybe this could be mentioned in the documentation of HOOMDGPUArray.

joaander Jul 23, 2024
Maintainer

Yes, the internal hoomd data structure is an array of (F.x, F.y, F.z, energy). You will find similar strides for particle position, velocity, etc... GPUs can read 1-, 2-, and 4- vectors much faster than 3- vectors.

Another developer implemented the cupy exports, so I don't know much about it. Does cupy not use the stride information to correctly index the array? I think if a user is manually indexing based on the pointer, then they need to inspect and respect the strides - which may be subject to change to improve performance.

mgaimann Jul 23, 2024
Author

On the cupy side it works as expected I think, the knowledge of the internal data structure is just required for correct indexing in CUDA kernels (idx = tid * stride_inner_dim + dim_idx; instead of idx = tid * n_dim + dim_idx;).

mgaimann Jul 23, 2024
Author

where in the Python part one could get the stride of the inner dimension for example using:
stride_inner_dim = forces.strides[0] // forces.itemsize

This is just something I encountered and wanted to share here, maybe this is useful for someone trying to implement a CuPy CUDA kernel for HOOMD :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerating Custom Force Computations #1843

{{title}}

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Accelerating Custom Force Computations #1843

mgaimann Jul 15, 2024

Replies: 2 comments · 8 replies

joaander Jul 16, 2024 Maintainer

mgaimann Jul 16, 2024 Author

joaander Jul 16, 2024 Maintainer

mgaimann Jul 18, 2024 Author

joaander Jul 18, 2024 Maintainer

mgaimann Jul 23, 2024 Author

joaander Jul 23, 2024 Maintainer

mgaimann Jul 23, 2024 Author

mgaimann Jul 23, 2024 Author

mgaimann
Jul 15, 2024

Replies: 2 comments 8 replies

joaander
Jul 16, 2024
Maintainer

mgaimann Jul 16, 2024
Author

joaander Jul 16, 2024
Maintainer

mgaimann
Jul 18, 2024
Author

joaander Jul 18, 2024
Maintainer

mgaimann Jul 23, 2024
Author

joaander Jul 23, 2024
Maintainer

mgaimann Jul 23, 2024
Author

mgaimann Jul 23, 2024
Author