You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Fix oroGetDeviceProperties in cuda path.
* Fix linux crash (#29)
* [ORO-0] Added missing file.
* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)
* [ORO-0] Skip compilation of vulkan test on Linux
* [ORO-0] Update kernelExec unit test - remove printf
* [ORO-0] Remove cout
* [ORO-0] Fix hipGetErrorString (#32)
* [ORO-0] Fix hipGetErrorString
It was incorrectly importing this API. Import the correct API in hipew.
* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)
* [ORO-0] Skip compilation of vulkan test on Linux
* [ORO-0] Update kernelExec unit test - remove printf
* [ORO-0] Remove cout
* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)
* Add missing path on Apple config. (#34)
* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)
* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.
* [ORO-0] Created win64 subdir.
* [ORO-0] Add hiprtc.dll and comgr dll
Co-authored-by: takahiroharada <[email protected]>
* fix footnote markdown format (#39)
* Feature/oro 0 amdadvtech merge (#43)
* Add gitignore to the repository
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix missing CUDA properties. (#16)
Signed-off-by: Chih-Chen Kao <[email protected]>
* Feature/oro 0 radix sort (#19)
* [ORO-0] Working 8 bit radix sort.
* [ORO-0] Some optimization.
* Create LICENSE
* Update README.md (#15)
* Feature/oro 0 raw get set (#19)
* [ORO-0] Rename setter and getter.
* [ORO-0] Fix when there is a dll but no device.
* [ORO-0] Deletion function.
* [ORO-0] Multi processor count.
* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.
* [ORO-0] Moved temp buffer allocation out from the sort().
* [ORO-0] README. References.
* [ORO-0] Debug flag.
* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .
Signed-off-by: Chih-Chen Kao <[email protected]>
* Squashed commit of the following:
commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date: Wed Apr 13 10:48:35 2022 -0700
[ORO-0] Fix nvrtc.
* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Feature/oro 0 kernel cache (#4)
* [ORO-0] Cache kernel.
* [ORO-0] Support newer HIP builds on windows (#22)
* [ORO-0] Unit test. (#23)
* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize the LDS scan algorithm. (#6)
* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Support an input array in LDS that is 2 times the WG size.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Feature/oro 0 clean up (#7)
* Squashed commit of the following:
commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date: Wed Apr 13 10:48:35 2022 -0700
[ORO-0] Fix nvrtc.
* [ORO-0] Clean up.
* Feature/oro 0 clean up (#10)
* Squashed commit of the following:
commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date: Wed Apr 13 10:48:35 2022 -0700
[ORO-0] Fix nvrtc.
* [ORO-0] Clean up.
* [ORO-0] SortKernel1. Less complex. (#8)
SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720
* [ORO-0] Kernel execution time check.
* Fix the memory access pattern and change it to coalesced memory access. (#11)
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Single kernel sort for small keys. (#12)
* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)
* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Remove __threadfence_block()
Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Introduce DRIVER and RTC APIs
* Disable enum-variant
* Improve paths
* Add fields
* Update Vulkan test
* Define CUDA in terms of DRIVER and RTC
* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)
* Fix a minor issue in CountKernel to make it more robust.
Implement a single-pass 8-bit local sort.
Implement a single-pass 8-bit local sort with shared bins.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix nItemsPerWI and enable the version with shared LDS.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Print driver version.
* [ORO-0] Repro case.
* Fix SORT_WG_SIZE.
Fix stable sort order.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
* Merging another merge (#18)
* Fix a minor issue in CountKernel to make it more robust.
Implement a single-pass 8-bit local sort.
Implement a single-pass 8-bit local sort with shared bins.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix nItemsPerWI and enable the version with shared LDS.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Print driver version.
* [ORO-0] Repro case.
* Fix SORT_WG_SIZE.
Fix stable sort order.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)
* Calculate the number of WGs based on LDS and max-thread-per-WGP.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Add a workaround for CUDA.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)
* Fix a minor issue in CountKernel to make it more robust.
Implement a single-pass 8-bit local sort.
Implement a single-pass 8-bit local sort with shared bins.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix nItemsPerWI and enable the version with shared LDS.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Print driver version.
* [ORO-0] Repro case.
* Fix SORT_WG_SIZE.
Fix stable sort order.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
* Implement key-value pair sorting (#17)
* Add gitignore to the repository
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix missing CUDA properties. (#16)
Signed-off-by: Chih-Chen Kao <[email protected]>
* Add basic structure for key-value pair sorting.
Fix an error in single pass sort
Signed-off-by: Chih-Chen Kao <[email protected]>
* Add Value data in the test and sort it according to keys.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Support Key only sorting.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Make single pass kernel non compile time switch.
* Support both Key-Only & Key-Value pair sort kernels
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Test change.
* [ORO-0] A bug.
* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.
Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
* [ORO-0] Revert demo code.
* Fix missing CUDA properties. (#26)
* Update Orochi.cpp
* [ORO-0] Clean up.
* [ORO-0] OroUtils. (#27)
* [ORO-0] OroUtils.
* [ORO-0] Linux build fix.
* [ORO-0] Forgot to add.
* [ORO-0] Linux build fix.
* [ORO-0] Clean up.
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
* Add kernel path and include dir to the functions. (#20)
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] BakeKernel. (#21)
* [ORO-0] BakeKernel.
* Update tools/genArgs.py
commented code removal
* Update tools/stringify.py
commented code removal
* Update tools/stringify.py
commented code removal
* Update tools/stringify.py
commented code removal
* Update tools/genArgs.py
dead code removal
* Update tools/stringify.py
dead code removal
* fix include
Signed-off-by: Chih-Chen Kao <[email protected]>
* fix script
Signed-off-by: Chih-Chen Kao <[email protected]>
* fix
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
* Fix Orochi CUDA API (#23)
Fix Orochi CUDA API
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Linux build fix. (#22)
* [ORO-0] Linux build fix.
* Fix Orochi CUDA API
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)
Quick fix for old linux gcc which does not support std::exclusive_scan
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix the kernel cache bug. (#25)
Fix the kernel cache bug.
The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Remove static variables. (#26)
* [ORO-0] Remove static variables.
* [ORO-0] Applied the suggestions.
* [ORO-0] Linux regression fix.
* Fix OrochiUtils::getFunctionFromString API (#27)
Signed-off-by: Chih-Chen Kao <[email protected]>
* Adding missing assert (#28)
* Adding missing assert
* Adding more asserts
* Feature/oro 0 gpuopen merge (#31)
* Fix oroGetDeviceProperties in cuda path.
* Fix linux crash (#29)
* [ORO-0] Added missing file.
* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)
* [ORO-0] Skip compilation of vulkan test on Linux
* [ORO-0] Update kernelExec unit test - remove printf
* [ORO-0] Remove cout
* [ORO-0] Fix hipGetErrorString (#32)
* [ORO-0] Fix hipGetErrorString
It was incorrectly importing this API. Import the correct API in hipew.
* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)
* [ORO-0] Skip compilation of vulkan test on Linux
* [ORO-0] Update kernelExec unit test - remove printf
* [ORO-0] Remove cout
* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)
* Add missing path on Apple config. (#34)
* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)
* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.
* [ORO-0] Created win64 subdir.
* [ORO-0] Add hiprtc.dll and comgr dll
Co-authored-by: takahiroharada <[email protected]>
* fix footnote markdown format (#39)
* Fix orochi utils issue in unit tests
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>
* [ORO-0] bitcode/cubin linking APIs (#40)
* [ORO-0] Link apis.
* [ORO-0] Forgot to add.
* [ORO-0] Linking test.
* [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize
* [ORO-0] Update link unit tests with comments
* [ORO-0] Change test for CUBIN instead of PTX
* [ORO-0] Fix loadfile to use binary mode, remove printf in kernel
* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.
* [ORO-0] Created win64 subdir.
* [ORO-0] Load amdhip first, then hiprtc.
* [ORO-0] Remove assert from hiprtc library checks
* [ORO-0] Add gfx1030 bitcode for navi21
* [MNN-0] Fix premake and add more link testcases
* [ORO-0] Update a link_null_name testcase
* [ORO-0] Make unit tests more stable on CUDA
* [ORO-0] Update bitcode for gfx1030
* [ORO-0] Add bitcodes for navi1,2, vega
* [ORO-0] Add hiprtc.dll and comgr dll
* [ORO-0] Add gfx906 bitcodes
* [ORO-0] Support unit tests on both HIP and CUDA
* [ORO-0] Update dlls and bitcodes
* [ORO-0] Update bitcodes and generation script
* [ORO-0] Minor fixes in bundled bitcode unit tests
* [ORO-0] Fix typo in options
* [ORO-0] Fix getCUBIN/PTX signatures
* [ORO-0] Fix unit tests and generate fatbin for CUDA
* [ORO-0] Regenerate fatbin and fix script
* [ORO-0] Cleanup
* [ORO-0] Update bundled bitcodes to only contain navi21 for now
* [ORO-0] Updated bundled bitcode
* [ORO-0] add ORO_LAUNCH_PARAMS_*
* [ORO-0] Add unit test for orortcLinkAddFile
* [ORO-0] Add unittest scripts for TC
* [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA
* [ORO-0] Add bitcode+bundled bitcode link test
* [ORO-0] Cleanup
* [ORO-0] Fix typo in script
* [ORO-0] Update linux TC script
Co-authored-by: takahiroharada <[email protected]>
* [ORO-0] Get global memory size for CUDA (#44)
* [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46)
* [ORO-0] Update HIP dll's for bitcode linking support
* [ORO-0] Add getLoweredName testcase
* [ORO-0] Update unittest filter
* [ORO-0] Update loweredName test
* [ORO-0] Add missing test kernel
* [ORO-0] Fix loweredName test
* [ORO-0] Fix linux compilation
* [ORO-0] Remove printf from test kernel (#37)
* [ORO-0] Fix linux loading of libhiprtc.so (#49)
* [ORO-0] Update test scripts (#50)
* [ORO-0] Update scripts for linux (#51)
* [ORO-0] Add new scripts (#52)
* [ORO-0] Add new scripts
* [ORO-0] Add execute permissions to scripts
* Fix Unit Test: getErrorString (#54)
Signed-off-by: Chih-Chen Kao <[email protected]>
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Support hiprtc0504 (#55)
* [ORO-0] Update hiprtc and orortc error codes (#57)
* [ORO-0] Update test scripts to delete cache before running (#58)
* [ORO-0] Update hiprtc dlls
* [ORO-0] Support gfx1100,gfx1102 for radix sort kernel precompilation
* Fix apt python installation (#63)
Update checkout version
Signed-off-by: Chih-Chen Kao <[email protected]>
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] OrochiUtils update. (#61)
* [ORO-0] Add WMMA test (#62)
* [ORO-0] Add WMMA test
* [ORO-0] Add a comment for WMMA
* [ORO-0] Cleanup
* [ORO-0] Add a couple more comments
* [ORO-0] Remove hip_runtime include
* [ORO-0] Cleanup
* [ORO-0] Fix comment
* [ORO-0] Add Copyright notice
* [ORO-0] Load binary from the directory where DLL is.
* [ORO-0] Fix for linux.
---------
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: PixelClear <[email protected]>
0 commit comments