Skip to content

Commit b209cc1

Browse files
jammmKaoCCtakahiroharadamehmetoguzderinmeistdan
authored
Sync with amdadvtech/Orochi (#73)
* Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 radix sort (#19) * [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <[email protected]> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <[email protected]> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <[email protected]> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <[email protected]> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <[email protected]> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <[email protected]> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <[email protected]> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> * Add kernel path and include dir to the functions. (#20) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] BakeKernel. (#21) * [ORO-0] BakeKernel. * Update tools/genArgs.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/genArgs.py dead code removal * Update tools/stringify.py dead code removal * fix include Signed-off-by: Chih-Chen Kao <[email protected]> * fix script Signed-off-by: Chih-Chen Kao <[email protected]> * fix Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Fix Orochi CUDA API (#23) Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Linux build fix. (#22) * [ORO-0] Linux build fix. * Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Quick fix for old linux gcc which does not support std::exclusive_scan (#24) Quick fix for old linux gcc which does not support std::exclusive_scan Signed-off-by: Chih-Chen Kao <[email protected]> * Fix the kernel cache bug. (#25) Fix the kernel cache bug. The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Remove static variables. (#26) * [ORO-0] Remove static variables. * [ORO-0] Applied the suggestions. * [ORO-0] Linux regression fix. * Fix OrochiUtils::getFunctionFromString API (#27) Signed-off-by: Chih-Chen Kao <[email protected]> * Adding missing assert (#28) * Adding missing assert * Adding more asserts * Feature/oro 0 gpuopen merge (#31) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Fix orochi utils issue in unit tests Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> * remove space after -I (#33) * Feature/oro 0 gpuopen merge 2 (#32) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Feature/oro 0 amdadvtech merge (#43) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 radix sort (#19) * [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <[email protected]> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <[email protected]> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <[email protected]> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <[email protected]> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <[email protected]> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <[email protected]> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <[email protected]> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> * Add kernel path and include dir to the functions. (#20) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] BakeKernel. (#21) * [ORO-0] BakeKernel. * Update tools/genArgs.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/genArgs.py dead code removal * Update tools/stringify.py dead code removal * fix include Signed-off-by: Chih-Chen Kao <[email protected]> * fix script Signed-off-by: Chih-Chen Kao <[email protected]> * fix Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Fix Orochi CUDA API (#23) Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Linux build fix. (#22) * [ORO-0] Linux build fix. * Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Quick fix for old linux gcc which does not support std::exclusive_scan (#24) Quick fix for old linux gcc which does not support std::exclusive_scan Signed-off-by: Chih-Chen Kao <[email protected]> * Fix the kernel cache bug. (#25) Fix the kernel cache bug. The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Remove static variables. (#26) * [ORO-0] Remove static variables. * [ORO-0] Applied the suggestions. * [ORO-0] Linux regression fix. * Fix OrochiUtils::getFunctionFromString API (#27) Signed-off-by: Chih-Chen Kao <[email protected]> * Adding missing assert (#28) * Adding missing assert * Adding more asserts * Feature/oro 0 gpuopen merge (#31) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Fix orochi utils issue in unit tests Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> * [ORO-0] bitcode/cubin linking APIs (#40) * [ORO-0] Link apis. * [ORO-0] Forgot to add. * [ORO-0] Linking test. * [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize * [ORO-0] Update link unit tests with comments * [ORO-0] Change test for CUBIN instead of PTX * [ORO-0] Fix loadfile to use binary mode, remove printf in kernel * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Load amdhip first, then hiprtc. * [ORO-0] Remove assert from hiprtc library checks * [ORO-0] Add gfx1030 bitcode for navi21 * [MNN-0] Fix premake and add more link testcases * [ORO-0] Update a link_null_name testcase * [ORO-0] Make unit tests more stable on CUDA * [ORO-0] Update bitcode for gfx1030 * [ORO-0] Add bitcodes for navi1,2, vega * [ORO-0] Add hiprtc.dll and comgr dll * [ORO-0] Add gfx906 bitcodes * [ORO-0] Support unit tests on both HIP and CUDA * [ORO-0] Update dlls and bitcodes * [ORO-0] Update bitcodes and generation script * [ORO-0] Minor fixes in bundled bitcode unit tests * [ORO-0] Fix typo in options * [ORO-0] Fix getCUBIN/PTX signatures * [ORO-0] Fix unit tests and generate fatbin for CUDA * [ORO-0] Regenerate fatbin and fix script * [ORO-0] Cleanup * [ORO-0] Update bundled bitcodes to only contain navi21 for now * [ORO-0] Updated bundled bitcode * [ORO-0] add ORO_LAUNCH_PARAMS_* * [ORO-0] Add unit test for orortcLinkAddFile * [ORO-0] Add unittest scripts for TC * [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA * [ORO-0] Add bitcode+bundled bitcode link test * [ORO-0] Cleanup * [ORO-0] Fix typo in script * [ORO-0] Update linux TC script Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Get global memory size for CUDA (#44) * [ORO-0] Update HIP dll's for bitcode linking support * [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46) * [ORO-0] Update HIP dll's for bitcode linking support * [ORO-0] Add getLoweredName testcase * [ORO-0] Update unittest filter * [ORO-0] Update loweredName test * [ORO-0] Add missing test kernel * [ORO-0] Fix loweredName test * [ORO-0] Fix linux compilation * [ORO-0] Remove printf from test kernel (#37) * [ORO-0] Allow usage of libhiprtc64.so if exists * [ORO-0] Fix linux loading of libhiprtc.so Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: PixelClear <[email protected]> * Feature/oro 0 radix sort stream (#34) * Initial commit * Streams to the configuration * Mutex in OrochiUtils * Feature/oro 0 radix sort mutex baking (#36) * Locking other methods in OrochiUtils * Removing mutex from static methods * Making mutex and map static * Removing static from OrochiUtils * Removing static from OrochiUtils * Support Precompiled Kernels in Orochi (#37) * Add bitcode support: getFunctionFromPrecompiledBinary Signed-off-by: Chih-Chen Kao <[email protected]> * Add bitcode and the script to generate it. Signed-off-by: Chih-Chen Kao <[email protected]> * rewrite OROASSERT. Fix include file order. Signed-off-by: Chih-Chen Kao <[email protected]> * Use string instead of const char* Signed-off-by: Chih-Chen Kao <[email protected]> * Rename the option from bitcode to precompiled Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Add bitcode script for nvidia fatbin * [ORO-0] CUDA - hipfb->fatbin rename Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> * Feature/oro 0 resource limits (#38) * Adding limit functions * Removing enum * Removing enum * Limit enum * char string Windows API (#39) * [ORO-0] Update precompiled radix sort kernels to use -ffast-math (#42) * [ORO-0] Update precompiled radix sort kernels to use -ffast-math * [ORO-0] Update RadixSort fatbin for NVIDIA and use fast math * [ORO-0] Function pointer test. (#40) * [ORO-0] Function pointer test. * [ORO-0] launch2d. * [ORO-0] Event, OroStopwatch. * Implement GpuMemory to handle device memory operations. Signed-off-by: Chih-Chen Kao <[email protected]> * Sync with GPUOpen/LibrariesAndSDKs/Orochi (#44) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Feature/oro 0 amdadvtech merge (#43) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 radix sort (#19) * [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <[email protected]> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <[email protected]> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <[email protected]> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <[email protected]> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <[email protected]> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <[email protected]> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <[email protected]> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> * Add kernel path and include dir to the functions. (#20) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] BakeKernel. (#21) * [ORO-0] BakeKernel. * Update tools/genArgs.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/genArgs.py dead code removal * Update tools/stringify.py dead code removal * fix include Signed-off-by: Chih-Chen Kao <[email protected]> * fix script Signed-off-by: Chih-Chen Kao <[email protected]> * fix Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Fix Orochi CUDA API (#23) Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Linux build fix. (#22) * [ORO-0] Linux build fix. * Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Quick fix for old linux gcc which does not support std::exclusive_scan (#24) Quick fix for old linux gcc which does not support std::exclusive_scan Signed-off-by: Chih-Chen Kao <[email protected]> * Fix the kernel cache bug. (#25) Fix the kernel cache bug. The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Remove static variables. (#26) * [ORO-0] Remove static variables. * [ORO-0] Applied the suggestions. * [ORO-0] Linux regression fix. * Fix OrochiUtils::getFunctionFromString API (#27) Signed-off-by: Chih-Chen Kao <[email protected]> * Adding missing assert (#28) * Adding missing assert * Adding more asserts * Feature/oro 0 gpuopen merge (#31) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Fix orochi utils issue in unit tests Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> * [ORO-0] bitcode/cubin linking APIs (#40) * [ORO-0] Link apis. * [ORO-0] Forgot to add. * [ORO-0] Linking test. * [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize * [ORO-0] Update link unit tests with comments * [ORO-0] Change test for CUBIN instead of PTX * [ORO-0] Fix loadfile to use binary mode, remove printf in kernel * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Load amdhip first, then hiprtc. * [ORO-0] Remove assert from hiprtc library checks * [ORO-0] Add gfx1030 bitcode for navi21 * [MNN-0] Fix premake and add more link testcases * [ORO-0] Update a link_null_name testcase * [ORO-0] Make unit tests more stable on CUDA * [ORO-0] Update bitcode for gfx1030 * [ORO-0] Add bitcodes for navi1,2, vega * [ORO-0] Add hiprtc.dll and comgr dll * [ORO-0] Add gfx906 bitcodes * [ORO-0] Support unit tests on both HIP and CUDA * [ORO-0] Update dlls and bitcodes * [ORO-0] Update bitcodes and generation script * [ORO-0] Minor fixes in bundled bitcode unit tests * [ORO-0] Fix typo in options * [ORO-0] Fix getCUBIN/PTX signatures * [ORO-0] Fix unit tests and generate fatbin for CUDA * [ORO-0] Regenerate fatbin and fix script * [ORO-0] Cleanup * [ORO-0] Update bundled bitcodes to only contain navi21 for now * [ORO-0] Updated bundled bitcode * [ORO-0] add ORO_LAUNCH_PARAMS_* * [ORO-0] Add unit test for orortcLinkAddFile * [ORO-0] Add unittest scripts for TC * [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA * [ORO-0] Add bitcode+bundled bitcode link test * [ORO-0] Cleanup * [ORO-0] Fix typo in script * [ORO-0] Update linux TC script Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Get global memory size for CUDA (#44) * [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46) * [ORO-0] Update HIP dll's for bitcode linking support * [ORO-0] Add getLoweredName testcase * [ORO-0] Update unittest filter * [ORO-0] Update loweredName test * [ORO-0] Add missing test kernel * [ORO-0] Fix loweredName test * [ORO-0] Fix linux compilation * [ORO-0] Remove printf from test kernel (#37) * [ORO-0] Fix linux loading of libhiprtc.so (#49) * [ORO-0] Update test scripts (#50) * [ORO-0] Update scripts for linux (#51) * [ORO-0] Add new scripts (#52) * [ORO-0] Add new scripts * [ORO-0] Add execute permissions to scripts * Fix Unit Test: getErrorString (#54) Signed-off-by: Chih-Chen Kao <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Support hiprtc0504 (#55) * [ORO-0] Update hiprtc and orortc error codes (#57) * [ORO-0] Update test scripts to delete cache before running (#58) * [ORO-0] Update hiprtc dlls * [ORO-0] Support gfx1100,gfx1102 for radix sort kernel precompilation * Fix apt python installation (#63) Update checkout version Signed-off-by: Chih-Chen Kao <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] OrochiUtils update. (#61) * [ORO-0] Add WMMA test (#62) * [ORO-0] Add WMMA test * [ORO-0] Add a comment for WMMA * [ORO-0] Cleanup * [ORO-0] Add a couple more comments * [ORO-0] Remove hip_runtime include * [ORO-0] Cleanup * [ORO-0] Fix comment * [ORO-0] Add Copyright notice * [ORO-0] Load binary from the directory where DLL is. * [ORO-0] Fix for linux. --------- Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: PixelClear <[email protected]> * [ORO-0] Remove unnecessary template. * [ORO-0] Clean up. Added python script kernelCompile.py for compilation. (#46) * [ORO-0] Clean up. Added python script kernelCompile.py for compilation. * [ORO-0] hipsdk should be next to orochi dir. * Update ParallelPrimitives/RadixSortKernels.h Remove commented line --------- Co-authored-by: Chih-Chen Kao <[email protected]> * [ORO-0] add automatic arch selection (#47) * [ORO-0] add automatic arch selection * [ORO-0] Refactor and error output when it cannot find llc. --------- Co-authored-by: takahiroharada <[email protected]> * Feature/oro 0 flexible rtc error handling cherrypick (#48) * add a handler for RTC load failure case on cuda. * [ORO-0] add a handler for RTC load failure case on hip. * [ORO-0] add cuda 12.0 sdk in nvrtc path * [ORO-0] Remove non bundled bitcode tests. Clean up. * [ORO-0] Clean up. * [ORO-0] Add hiprtcGetBitcodeSize back. * Update Orochi.cpp * Update Orochi.cpp * [ORO-0] Fix for multi-GPU/iGPU * [HIPSDK-0] compute-22.40-osdb/36/ * [ORO-0] compute-23.10-osdb/9/ * [ORO-0] Update dll names * [ORO-0] implement new test for managed memory, enable managed memory api, fix all warnings and cleanup * [ORO-0] fix compile issues * [ORO-0] fix declaration of oroManagedMalloc * [ORO-0] change streaming kernel * [ORO-0] enable it on windows too * [ORO-0] add more asserts * [ORO-0] update kernel * [ORO-0] add host copy times * [ORO-0] add malloc times * Refactor Count Signed-off-by: Chih-Chen Kao <[email protected]> * Refactor Radix Sort class: - Now the tmp buffer is allocated internally. - All GPU memory buffers are changed to the GpuMemory class - `configure` will now calculate the total number of GPU blocks for the count and the scan kernel - The client does not need to call configure explicitly - Refactor function parameters - Remove count reference kernel Signed-off-by: Chih-Chen Kao <[email protected]> * Add `const` Signed-off-by: Chih-Chen Kao <[email protected]> * Thid commit does the followings: - Support setting the the number of thread per block (a.k.a block size) dynamically - Refactor `exclusiveScanCpu` - Extend `printKernelInfo`. Signed-off-by: Chih-Chen Kao <[email protected]> * The 1st working example for the radix sort optimization Signed-off-by: Chih-Chen Kao <[email protected]> * Support configuring dynamic "NUM_WARPS_PER_BLOCK" in the sort kernel Compute the optimal number of inputs for each block to handle. Refactor the usage of stopwatch Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] add hiprtc future dll names in hiprtc path * Add linux paths and dll names (#66) * [ORO-0] Change path and rtc dll names * [ORO-0] Make scripts executable * [ORO-0] Add hiprtc path * [ORO-0] Remove ParallelPrimitives, test/radix sort * [ORO-0] Edit premake --------- Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> Co-authored-by: Richard Geslot <[email protected]> Co-authored-by: Atsushi Yoshimura <[email protected]> Co-authored-by: Atsushi.Yoshimura <[email protected]>
1 parent df8a401 commit b209cc1

16 files changed

+287
-50
lines changed

.gitattributes

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*.dll filter=lfs diff=lfs merge=lfs -text

Orochi/Orochi.cpp

+5
Original file line numberDiff line numberDiff line change
@@ -569,6 +569,11 @@ oroError OROAPI oroMemAllocPitch(oroDeviceptr* dptr, size_t* pPitch, size_t Widt
569569
{
570570
return oroErrorUnknown;
571571
}
572+
oroError OROAPI oroMallocManaged(oroDeviceptr* dptr, size_t bytesize, oroManagedMemoryAttachFlags flags)
573+
{
574+
__ORO_FUNC1( MemAllocManaged((CUdeviceptr*)dptr, bytesize, (CUmemAttach_flags_enum)flags), MallocManaged( dptr, bytesize, (HIPmemAttach_flags_enum)flags ) );
575+
return oroErrorUnknown;
576+
}
572577
oroError OROAPI oroFree(oroDeviceptr dptr)
573578
{
574579
__ORO_FUNC1( MemFree( dptr ), Free( dptr ) );

Orochi/Orochi.h

+8-1
Original file line numberDiff line numberDiff line change
@@ -496,6 +496,13 @@ typedef enum OROmem_range_attribute_enum {
496496
ORO_MEM_RANGE_ATTRIBUTE_LAST_PREFETCH_LOCATION = 4,
497497
} PPmem_range_attribute;
498498

499+
typedef enum oroManagedMemoryAttachFlags
500+
{
501+
oroMemAttachGlobal = 0x1,
502+
oroMemAttachHost = 0x2,
503+
oroMemAttachSingle = 0x4,
504+
}oroManagedMemoryAttachFlags;
505+
499506
typedef enum oroJitOption {
500507
oroJitOptionMaxRegisters = 0,
501508
oroJitOptionThreadsPerBlock,
@@ -641,6 +648,7 @@ oroError OROAPI oroMemGetInfo(size_t* free, size_t* total);
641648
oroError OROAPI oroMalloc(oroDeviceptr* dptr, size_t bytesize);
642649
oroError OROAPI oroMalloc2(oroDeviceptr* dptr, size_t bytesize);
643650
oroError OROAPI oroMemAllocPitch(oroDeviceptr* dptr, size_t* pPitch, size_t WidthInBytes, size_t Height, unsigned int ElementSizeBytes);
651+
oroError OROAPI oroMallocManaged(oroDeviceptr* dptr, size_t bytesize, oroManagedMemoryAttachFlags flags);
644652
oroError OROAPI oroFree(oroDeviceptr dptr);
645653
oroError OROAPI oroFree2(oroDeviceptr dptr);
646654
//oroError OROAPI oroMemGetAddressRange(oroDeviceptr* pbase, size_t* psize, oroDeviceptr dptr);
@@ -650,7 +658,6 @@ oroError OROAPI oroFree2(oroDeviceptr dptr);
650658
oroError OROAPI oroHostRegister(void* p, size_t bytesize, unsigned int Flags);
651659
oroError OROAPI oroHostGetDevicePointer(oroDeviceptr* pdptr, void* p, unsigned int Flags);
652660
//oroError OROAPI oroHostGetFlags(unsigned int* pFlags, void* p);
653-
//oroError OROAPI oroMallocManaged(oroDeviceptr* dptr, size_t bytesize, unsigned int flags);
654661
//oroError OROAPI oroDeviceGetByPCIBusId(hipDevice_t* dev, const char* pciBusId);
655662
//oroError OROAPI oroDeviceGetPCIBusId(char* pciBusId, int len, hipDevice_t dev);
656663
oroError OROAPI oroHostUnregister(void* p);

Orochi/OrochiUtils.cpp

-4
Original file line numberDiff line numberDiff line change
@@ -374,10 +374,6 @@ struct OrochiUtilsImpl
374374
static std::string getCacheName( const std::string& path, const std::string& kernelname ) noexcept { return path + kernelname; }
375375
};
376376

377-
OrochiUtils::OrochiUtils() { m_cacheDirectory = "./cache/"; }
378-
379-
OrochiUtils::~OrochiUtils() {}
380-
381377
bool OrochiUtils::readSourceCode( const std::string& path, std::string& sourceCode, std::vector<std::string>* includes ) { return OrochiUtilsImpl::readSourceCode( path, sourceCode, includes ); }
382378

383379
oroFunction OrochiUtils::getFunctionFromFile( oroDevice device, const char* path, const char* funcName, std::vector<const char*>* optsIn )

Orochi/OrochiUtils.h

+15-4
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,12 @@ class OrochiUtils
3333
int x, y, z, w;
3434
};
3535

36-
OrochiUtils();
37-
~OrochiUtils();
36+
OrochiUtils() = default;
37+
OrochiUtils(const OrochiUtils&) = delete;
38+
OrochiUtils& operator=(const OrochiUtils&) = delete;
39+
OrochiUtils(OrochiUtils&&) = delete;
40+
OrochiUtils& operator=(OrochiUtils&&) = delete;
41+
~OrochiUtils() = default;
3842

3943
oroFunction getFunctionFromPrecompiledBinary( const std::string& path, const std::string& funcName );
4044

@@ -50,12 +54,19 @@ class OrochiUtils
5054
static void launch2D( oroFunction func, int nx, int ny, const void** args, int wgSizeX = 8, int wgSizeY = 8, unsigned int sharedMemBytes = 0, oroStream stream = 0 );
5155

5256
template<typename T>
53-
static void malloc( T*& ptr, int n )
57+
static void malloc( T*& ptr, size_t n )
5458
{
5559
oroError e = oroMalloc( (oroDeviceptr*)&ptr, sizeof( T ) * n );
5660
OROASSERT( e == oroSuccess, 0 );
5761
}
5862

63+
template<typename T>
64+
static void mallocManaged( T*& ptr, size_t n, oroManagedMemoryAttachFlags flags )
65+
{
66+
oroError e = oroMallocManaged( (oroDeviceptr*)&ptr, sizeof( T ) * n, flags );
67+
OROASSERT( e == oroSuccess, 0 );
68+
}
69+
5970
template<typename T>
6071
static void free( T* ptr )
6172
{
@@ -123,7 +134,7 @@ class OrochiUtils
123134
}
124135

125136
public:
126-
std::string m_cacheDirectory;
137+
std::string m_cacheDirectory = "./cache/";
127138
std::recursive_mutex m_mutex;
128139
std::unordered_map<std::string, oroFunction> m_kernelMap;
129140
};

0 commit comments

Comments
 (0)