-
Notifications
You must be signed in to change notification settings - Fork 34
Commit b209cc1
Sync with amdadvtech/Orochi (#73)
* Add gitignore to the repository
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix missing CUDA properties. (#16)
Signed-off-by: Chih-Chen Kao <[email protected]>
* Feature/oro 0 radix sort (#19)
* [ORO-0] Working 8 bit radix sort.
* [ORO-0] Some optimization.
* Create LICENSE
* Update README.md (#15)
* Feature/oro 0 raw get set (#19)
* [ORO-0] Rename setter and getter.
* [ORO-0] Fix when there is a dll but no device.
* [ORO-0] Deletion function.
* [ORO-0] Multi processor count.
* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.
* [ORO-0] Moved temp buffer allocation out from the sort().
* [ORO-0] README. References.
* [ORO-0] Debug flag.
* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .
Signed-off-by: Chih-Chen Kao <[email protected]>
* Squashed commit of the following:
commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date: Wed Apr 13 10:48:35 2022 -0700
[ORO-0] Fix nvrtc.
* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Feature/oro 0 kernel cache (#4)
* [ORO-0] Cache kernel.
* [ORO-0] Support newer HIP builds on windows (#22)
* [ORO-0] Unit test. (#23)
* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize the LDS scan algorithm. (#6)
* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Support an input array in LDS that is 2 times the WG size.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Feature/oro 0 clean up (#7)
* Squashed commit of the following:
commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date: Wed Apr 13 10:48:35 2022 -0700
[ORO-0] Fix nvrtc.
* [ORO-0] Clean up.
* Feature/oro 0 clean up (#10)
* Squashed commit of the following:
commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date: Wed Apr 13 10:48:35 2022 -0700
[ORO-0] Fix nvrtc.
* [ORO-0] Clean up.
* [ORO-0] SortKernel1. Less complex. (#8)
SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720
* [ORO-0] Kernel execution time check.
* Fix the memory access pattern and change it to coalesced memory access. (#11)
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Single kernel sort for small keys. (#12)
* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)
* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Remove __threadfence_block()
Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Introduce DRIVER and RTC APIs
* Disable enum-variant
* Improve paths
* Add fields
* Update Vulkan test
* Define CUDA in terms of DRIVER and RTC
* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)
* Fix a minor issue in CountKernel to make it more robust.
Implement a single-pass 8-bit local sort.
Implement a single-pass 8-bit local sort with shared bins.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix nItemsPerWI and enable the version with shared LDS.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Print driver version.
* [ORO-0] Repro case.
* Fix SORT_WG_SIZE.
Fix stable sort order.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
* Merging another merge (#18)
* Fix a minor issue in CountKernel to make it more robust.
Implement a single-pass 8-bit local sort.
Implement a single-pass 8-bit local sort with shared bins.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix nItemsPerWI and enable the version with shared LDS.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Print driver version.
* [ORO-0] Repro case.
* Fix SORT_WG_SIZE.
Fix stable sort order.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)
* Calculate the number of WGs based on LDS and max-thread-per-WGP.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Add a workaround for CUDA.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)
* Fix a minor issue in CountKernel to make it more robust.
Implement a single-pass 8-bit local sort.
Implement a single-pass 8-bit local sort with shared bins.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix nItemsPerWI and enable the version with shared LDS.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Print driver version.
* [ORO-0] Repro case.
* Fix SORT_WG_SIZE.
Fix stable sort order.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
* Implement key-value pair sorting (#17)
* Add gitignore to the repository
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix missing CUDA properties. (#16)
Signed-off-by: Chih-Chen Kao <[email protected]>
* Add basic structure for key-value pair sorting.
Fix an error in single pass sort
Signed-off-by: Chih-Chen Kao <[email protected]>
* Add Value data in the test and sort it according to keys.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Support Key only sorting.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Make single pass kernel non compile time switch.
* Support both Key-Only & Key-Value pair sort kernels
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Test change.
* [ORO-0] A bug.
* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.
Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
* [ORO-0] Revert demo code.
* Fix missing CUDA properties. (#26)
* Update Orochi.cpp
* [ORO-0] Clean up.
* [ORO-0] OroUtils. (#27)
* [ORO-0] OroUtils.
* [ORO-0] Linux build fix.
* [ORO-0] Forgot to add.
* [ORO-0] Linux build fix.
* [ORO-0] Clean up.
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
* Add kernel path and include dir to the functions. (#20)
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] BakeKernel. (#21)
* [ORO-0] BakeKernel.
* Update tools/genArgs.py
commented code removal
* Update tools/stringify.py
commented code removal
* Update tools/stringify.py
commented code removal
* Update tools/stringify.py
commented code removal
* Update tools/genArgs.py
dead code removal
* Update tools/stringify.py
dead code removal
* fix include
Signed-off-by: Chih-Chen Kao <[email protected]>
* fix script
Signed-off-by: Chih-Chen Kao <[email protected]>
* fix
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
* Fix Orochi CUDA API (#23)
Fix Orochi CUDA API
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Linux build fix. (#22)
* [ORO-0] Linux build fix.
* Fix Orochi CUDA API
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)
Quick fix for old linux gcc which does not support std::exclusive_scan
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix the kernel cache bug. (#25)
Fix the kernel cache bug.
The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Remove static variables. (#26)
* [ORO-0] Remove static variables.
* [ORO-0] Applied the suggestions.
* [ORO-0] Linux regression fix.
* Fix OrochiUtils::getFunctionFromString API (#27)
Signed-off-by: Chih-Chen Kao <[email protected]>
* Adding missing assert (#28)
* Adding missing assert
* Adding more asserts
* Feature/oro 0 gpuopen merge (#31)
* Fix oroGetDeviceProperties in cuda path.
* Fix linux crash (#29)
* [ORO-0] Added missing file.
* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)
* [ORO-0] Skip compilation of vulkan test on Linux
* [ORO-0] Update kernelExec unit test - remove printf
* [ORO-0] Remove cout
* [ORO-0] Fix hipGetErrorString (#32)
* [ORO-0] Fix hipGetErrorString
It was incorrectly importing this API. Import the correct API in hipew.
* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)
* [ORO-0] Skip compilation of vulkan test on Linux
* [ORO-0] Update kernelExec unit test - remove printf
* [ORO-0] Remove cout
* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)
* Add missing path on Apple config. (#34)
* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)
* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.
* [ORO-0] Created win64 subdir.
* [ORO-0] Add hiprtc.dll and comgr dll
Co-authored-by: takahiroharada <[email protected]>
* fix footnote markdown format (#39)
* Fix orochi utils issue in unit tests
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>
* remove space after -I (#33)
* Feature/oro 0 gpuopen merge 2 (#32)
* Fix oroGetDeviceProperties in cuda path.
* Fix linux crash (#29)
* [ORO-0] Added missing file.
* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)
* [ORO-0] Skip compilation of vulkan test on Linux
* [ORO-0] Update kernelExec unit test - remove printf
* [ORO-0] Remove cout
* [ORO-0] Fix hipGetErrorString (#32)
* [ORO-0] Fix hipGetErrorString
It was incorrectly importing this API. Import the correct API in hipew.
* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)
* [ORO-0] Skip compilation of vulkan test on Linux
* [ORO-0] Update kernelExec unit test - remove printf
* [ORO-0] Remove cout
* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)
* Add missing path on Apple config. (#34)
* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)
* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.
* [ORO-0] Created win64 subdir.
* [ORO-0] Add hiprtc.dll and comgr dll
Co-authored-by: takahiroharada <[email protected]>
* fix footnote markdown format (#39)
* Feature/oro 0 amdadvtech merge (#43)
* Add gitignore to the repository
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix missing CUDA properties. (#16)
Signed-off-by: Chih-Chen Kao <[email protected]>
* Feature/oro 0 radix sort (#19)
* [ORO-0] Working 8 bit radix sort.
* [ORO-0] Some optimization.
* Create LICENSE
* Update README.md (#15)
* Feature/oro 0 raw get set (#19)
* [ORO-0] Rename setter and getter.
* [ORO-0] Fix when there is a dll but no device.
* [ORO-0] Deletion function.
* [ORO-0] Multi processor count.
* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.
* [ORO-0] Moved temp buffer allocation out from the sort().
* [ORO-0] README. References.
* [ORO-0] Debug flag.
* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .
Signed-off-by: Chih-Chen Kao <[email protected]>
* Squashed commit of the following:
commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date: Wed Apr 13 10:48:35 2022 -0700
[ORO-0] Fix nvrtc.
* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Feature/oro 0 kernel cache (#4)
* [ORO-0] Cache kernel.
* [ORO-0] Support newer HIP builds on windows (#22)
* [ORO-0] Unit test. (#23)
* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize the LDS scan algorithm. (#6)
* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Support an input array in LDS that is 2 times the WG size.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Feature/oro 0 clean up (#7)
* Squashed commit of the following:
commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date: Wed Apr 13 10:48:35 2022 -0700
[ORO-0] Fix nvrtc.
* [ORO-0] Clean up.
* Feature/oro 0 clean up (#10)
* Squashed commit of the following:
commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date: Wed Apr 13 10:48:35 2022 -0700
[ORO-0] Fix nvrtc.
* [ORO-0] Clean up.
* [ORO-0] SortKernel1. Less complex. (#8)
SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720
* [ORO-0] Kernel execution time check.
* Fix the memory access pattern and change it to coalesced memory access. (#11)
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Single kernel sort for small keys. (#12)
* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)
* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Remove __threadfence_block()
Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Introduce DRIVER and RTC APIs
* Disable enum-variant
* Improve paths
* Add fields
* Update Vulkan test
* Define CUDA in terms of DRIVER and RTC
* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)
* Fix a minor issue in CountKernel to make it more robust.
Implement a single-pass 8-bit local sort.
Implement a single-pass 8-bit local sort with shared bins.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix nItemsPerWI and enable the version with shared LDS.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Print driver version.
* [ORO-0] Repro case.
* Fix SORT_WG_SIZE.
Fix stable sort order.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
* Merging another merge (#18)
* Fix a minor issue in CountKernel to make it more robust.
Implement a single-pass 8-bit local sort.
Implement a single-pass 8-bit local sort with shared bins.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix nItemsPerWI and enable the version with shared LDS.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Print driver version.
* [ORO-0] Repro case.
* Fix SORT_WG_SIZE.
Fix stable sort order.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)
* Calculate the number of WGs based on LDS and max-thread-per-WGP.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Add a workaround for CUDA.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)
* Fix a minor issue in CountKernel to make it more robust.
Implement a single-pass 8-bit local sort.
Implement a single-pass 8-bit local sort with shared bins.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix nItemsPerWI and enable the version with shared LDS.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Print driver version.
* [ORO-0] Repro case.
* Fix SORT_WG_SIZE.
Fix stable sort order.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
* Implement key-value pair sorting (#17)
* Add gitignore to the repository
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix missing CUDA properties. (#16)
Signed-off-by: Chih-Chen Kao <[email protected]>
* Add basic structure for key-value pair sorting.
Fix an error in single pass sort
Signed-off-by: Chih-Chen Kao <[email protected]>
* Add Value data in the test and sort it according to keys.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Support Key only sorting.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Make single pass kernel non compile time switch.
* Support both Key-Only & Key-Value pair sort kernels
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Test change.
* [ORO-0] A bug.
* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.
Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
* [ORO-0] Revert demo code.
* Fix missing CUDA properties. (#26)
* Update Orochi.cpp
* [ORO-0] Clean up.
* [ORO-0] OroUtils. (#27)
* [ORO-0] OroUtils.
* [ORO-0] Linux build fix.
* [ORO-0] Forgot to add.
* [ORO-0] Linux build fix.
* [ORO-0] Clean up.
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
* Add kernel path and include dir to the functions. (#20)
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] BakeKernel. (#21)
* [ORO-0] BakeKernel.
* Update tools/genArgs.py
commented code removal
* Update tools/stringify.py
commented code removal
* Update tools/stringify.py
commented code removal
* Update tools/stringify.py
commented code removal
* Update tools/genArgs.py
dead code removal
* Update tools/stringify.py
dead code removal
* fix include
Signed-off-by: Chih-Chen Kao <[email protected]>
* fix script
Signed-off-by: Chih-Chen Kao <[email protected]>
* fix
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
* Fix Orochi CUDA API (#23)
Fix Orochi CUDA API
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Linux build fix. (#22)
* [ORO-0] Linux build fix.
* Fix Orochi CUDA API
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)
Quick fix for old linux gcc which does not support std::exclusive_scan
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix the kernel cache bug. (#25)
Fix the kernel cache bug.
The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Remove static variables. (#26)
* [ORO-0] Remove static variables.
* [ORO-0] Applied the suggestions.
* [ORO-0] Linux regression fix.
* Fix OrochiUtils::getFunctionFromString API (#27)
Signed-off-by: Chih-Chen Kao <[email protected]>
* Adding missing assert (#28)
* Adding missing assert
* Adding more asserts
* Feature/oro 0 gpuopen merge (#31)
* Fix oroGetDeviceProperties in cuda path.
* Fix linux crash (#29)
* [ORO-0] Added missing file.
* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)
* [ORO-0] Skip compilation of vulkan test on Linux
* [ORO-0] Update kernelExec unit test - remove printf
* [ORO-0] Remove cout
* [ORO-0] Fix hipGetErrorString (#32)
* [ORO-0] Fix hipGetErrorString
It was incorrectly importing this API. Import the correct API in hipew.
* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)
* [ORO-0] Skip compilation of vulkan test on Linux
* [ORO-0] Update kernelExec unit test - remove printf
* [ORO-0] Remove cout
* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)
* Add missing path on Apple config. (#34)
* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)
* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.
* [ORO-0] Created win64 subdir.
* [ORO-0] Add hiprtc.dll and comgr dll
Co-authored-by: takahiroharada <[email protected]>
* fix footnote markdown format (#39)
* Fix orochi utils issue in unit tests
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>
* [ORO-0] bitcode/cubin linking APIs (#40)
* [ORO-0] Link apis.
* [ORO-0] Forgot to add.
* [ORO-0] Linking test.
* [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize
* [ORO-0] Update link unit tests with comments
* [ORO-0] Change test for CUBIN instead of PTX
* [ORO-0] Fix loadfile to use binary mode, remove printf in kernel
* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.
* [ORO-0] Created win64 subdir.
* [ORO-0] Load amdhip first, then hiprtc.
* [ORO-0] Remove assert from hiprtc library checks
* [ORO-0] Add gfx1030 bitcode for navi21
* [MNN-0] Fix premake and add more link testcases
* [ORO-0] Update a link_null_name testcase
* [ORO-0] Make unit tests more stable on CUDA
* [ORO-0] Update bitcode for gfx1030
* [ORO-0] Add bitcodes for navi1,2, vega
* [ORO-0] Add hiprtc.dll and comgr dll
* [ORO-0] Add gfx906 bitcodes
* [ORO-0] Support unit tests on both HIP and CUDA
* [ORO-0] Update dlls and bitcodes
* [ORO-0] Update bitcodes and generation script
* [ORO-0] Minor fixes in bundled bitcode unit tests
* [ORO-0] Fix typo in options
* [ORO-0] Fix getCUBIN/PTX signatures
* [ORO-0] Fix unit tests and generate fatbin for CUDA
* [ORO-0] Regenerate fatbin and fix script
* [ORO-0] Cleanup
* [ORO-0] Update bundled bitcodes to only contain navi21 for now
* [ORO-0] Updated bundled bitcode
* [ORO-0] add ORO_LAUNCH_PARAMS_*
* [ORO-0] Add unit test for orortcLinkAddFile
* [ORO-0] Add unittest scripts for TC
* [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA
* [ORO-0] Add bitcode+bundled bitcode link test
* [ORO-0] Cleanup
* [ORO-0] Fix typo in script
* [ORO-0] Update linux TC script
Co-authored-by: takahiroharada <[email protected]>
* [ORO-0] Get global memory size for CUDA (#44)
* [ORO-0] Update HIP dll's for bitcode linking support
* [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46)
* [ORO-0] Update HIP dll's for bitcode linking support
* [ORO-0] Add getLoweredName testcase
* [ORO-0] Update unittest filter
* [ORO-0] Update loweredName test
* [ORO-0] Add missing test kernel
* [ORO-0] Fix loweredName test
* [ORO-0] Fix linux compilation
* [ORO-0] Remove printf from test kernel (#37)
* [ORO-0] Allow usage of libhiprtc64.so if exists
* [ORO-0] Fix linux loading of libhiprtc.so
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: PixelClear <[email protected]>
* Feature/oro 0 radix sort stream (#34)
* Initial commit
* Streams to the configuration
* Mutex in OrochiUtils
* Feature/oro 0 radix sort mutex baking (#36)
* Locking other methods in OrochiUtils
* Removing mutex from static methods
* Making mutex and map static
* Removing static from OrochiUtils
* Removing static from OrochiUtils
* Support Precompiled Kernels in Orochi (#37)
* Add bitcode support: getFunctionFromPrecompiledBinary
Signed-off-by: Chih-Chen Kao <[email protected]>
* Add bitcode and the script to generate it.
Signed-off-by: Chih-Chen Kao <[email protected]>
* rewrite OROASSERT. Fix include file order.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Use string instead of const char*
Signed-off-by: Chih-Chen Kao <[email protected]>
* Rename the option from bitcode to precompiled
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Add bitcode script for nvidia fatbin
* [ORO-0] CUDA - hipfb->fatbin rename
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
* Feature/oro 0 resource limits (#38)
* Adding limit functions
* Removing enum
* Removing enum
* Limit enum
* char string Windows API (#39)
* [ORO-0] Update precompiled radix sort kernels to use -ffast-math (#42)
* [ORO-0] Update precompiled radix sort kernels to use -ffast-math
* [ORO-0] Update RadixSort fatbin for NVIDIA and use fast math
* [ORO-0] Function pointer test. (#40)
* [ORO-0] Function pointer test.
* [ORO-0] launch2d.
* [ORO-0] Event, OroStopwatch.
* Implement GpuMemory to handle device memory operations.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Sync with GPUOpen/LibrariesAndSDKs/Orochi (#44)
* Fix oroGetDeviceProperties in cuda path.
* Fix linux crash (#29)
* [ORO-0] Added missing file.
* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)
* [ORO-0] Skip compilation of vulkan test on Linux
* [ORO-0] Update kernelExec unit test - remove printf
* [ORO-0] Remove cout
* [ORO-0] Fix hipGetErrorString (#32)
* [ORO-0] Fix hipGetErrorString
It was incorrectly importing this API. Import the correct API in hipew.
* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)
* [ORO-0] Skip compilation of vulkan test on Linux
* [ORO-0] Update kernelExec unit test - remove printf
* [ORO-0] Remove cout
* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)
* Add missing path on Apple config. (#34)
* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)
* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.
* [ORO-0] Created win64 subdir.
* [ORO-0] Add hiprtc.dll and comgr dll
Co-authored-by: takahiroharada <[email protected]>
* fix footnote markdown format (#39)
* Feature/oro 0 amdadvtech merge (#43)
* Add gitignore to the repository
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix missing CUDA properties. (#16)
Signed-off-by: Chih-Chen Kao <[email protected]>
* Feature/oro 0 radix sort (#19)
* [ORO-0] Working 8 bit radix sort.
* [ORO-0] Some optimization.
* Create LICENSE
* Update README.md (#15)
* Feature/oro 0 raw get set (#19)
* [ORO-0] Rename setter and getter.
* [ORO-0] Fix when there is a dll but no device.
* [ORO-0] Deletion function.
* [ORO-0] Multi processor count.
* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.
* [ORO-0] Moved temp buffer allocation out from the sort().
* [ORO-0] README. References.
* [ORO-0] Debug flag.
* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .
Signed-off-by: Chih-Chen Kao <[email protected]>
* Squashed commit of the following:
commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date: Wed Apr 13 10:48:35 2022 -0700
[ORO-0] Fix nvrtc.
* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Feature/oro 0 kernel cache (#4)
* [ORO-0] Cache kernel.
* [ORO-0] Support newer HIP builds on windows (#22)
* [ORO-0] Unit test. (#23)
* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize the LDS scan algorithm. (#6)
* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Support an input array in LDS that is 2 times the WG size.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Feature/oro 0 clean up (#7)
* Squashed commit of the following:
commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date: Wed Apr 13 10:48:35 2022 -0700
[ORO-0] Fix nvrtc.
* [ORO-0] Clean up.
* Feature/oro 0 clean up (#10)
* Squashed commit of the following:
commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date: Wed Apr 13 10:48:35 2022 -0700
[ORO-0] Fix nvrtc.
* [ORO-0] Clean up.
* [ORO-0] SortKernel1. Less complex. (#8)
SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720
* [ORO-0] Kernel execution time check.
* Fix the memory access pattern and change it to coalesced memory access. (#11)
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Single kernel sort for small keys. (#12)
* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)
* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Remove __threadfence_block()
Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Introduce DRIVER and RTC APIs
* Disable enum-variant
* Improve paths
* Add fields
* Update Vulkan test
* Define CUDA in terms of DRIVER and RTC
* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)
* Fix a minor issue in CountKernel to make it more robust.
Implement a single-pass 8-bit local sort.
Implement a single-pass 8-bit local sort with shared bins.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix nItemsPerWI and enable the version with shared LDS.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Print driver version.
* [ORO-0] Repro case.
* Fix SORT_WG_SIZE.
Fix stable sort order.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
* Merging another merge (#18)
* Fix a minor issue in CountKernel to make it more robust.
Implement a single-pass 8-bit local sort.
Implement a single-pass 8-bit local sort with shared bins.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix nItemsPerWI and enable the version with shared LDS.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Print driver version.
* [ORO-0] Repro case.
* Fix SORT_WG_SIZE.
Fix stable sort order.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)
* Calculate the number of WGs based on LDS and max-thread-per-WGP.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Add a workaround for CUDA.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)
* Fix a minor issue in CountKernel to make it more robust.
Implement a single-pass 8-bit local sort.
Implement a single-pass 8-bit local sort with shared bins.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix nItemsPerWI and enable the version with shared LDS.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Print driver version.
* [ORO-0] Repro case.
* Fix SORT_WG_SIZE.
Fix stable sort order.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
* Implement key-value pair sorting (#17)
* Add gitignore to the repository
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix missing CUDA properties. (#16)
Signed-off-by: Chih-Chen Kao <[email protected]>
* Add basic structure for key-value pair sorting.
Fix an error in single pass sort
Signed-off-by: Chih-Chen Kao <[email protected]>
* Add Value data in the test and sort it according to keys.
Signed-off-by: Chih-Chen Kao <[email protected]>
* Support Key only sorting.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Make single pass kernel non compile time switch.
* Support both Key-Only & Key-Value pair sort kernels
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Test change.
* [ORO-0] A bug.
* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.
Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
* [ORO-0] Revert demo code.
* Fix missing CUDA properties. (#26)
* Update Orochi.cpp
* [ORO-0] Clean up.
* [ORO-0] OroUtils. (#27)
* [ORO-0] OroUtils.
* [ORO-0] Linux build fix.
* [ORO-0] Forgot to add.
* [ORO-0] Linux build fix.
* [ORO-0] Clean up.
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
* Add kernel path and include dir to the functions. (#20)
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] BakeKernel. (#21)
* [ORO-0] BakeKernel.
* Update tools/genArgs.py
commented code removal
* Update tools/stringify.py
commented code removal
* Update tools/stringify.py
commented code removal
* Update tools/stringify.py
commented code removal
* Update tools/genArgs.py
dead code removal
* Update tools/stringify.py
dead code removal
* fix include
Signed-off-by: Chih-Chen Kao <[email protected]>
* fix script
Signed-off-by: Chih-Chen Kao <[email protected]>
* fix
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
* Fix Orochi CUDA API (#23)
Fix Orochi CUDA API
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Linux build fix. (#22)
* [ORO-0] Linux build fix.
* Fix Orochi CUDA API
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)
Quick fix for old linux gcc which does not support std::exclusive_scan
Signed-off-by: Chih-Chen Kao <[email protected]>
* Fix the kernel cache bug. (#25)
Fix the kernel cache bug.
The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Remove static variables. (#26)
* [ORO-0] Remove static variables.
* [ORO-0] Applied the suggestions.
* [ORO-0] Linux regression fix.
* Fix OrochiUtils::getFunctionFromString API (#27)
Signed-off-by: Chih-Chen Kao <[email protected]>
* Adding missing assert (#28)
* Adding missing assert
* Adding more asserts
* Feature/oro 0 gpuopen merge (#31)
* Fix oroGetDeviceProperties in cuda path.
* Fix linux crash (#29)
* [ORO-0] Added missing file.
* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)
* [ORO-0] Skip compilation of vulkan test on Linux
* [ORO-0] Update kernelExec unit test - remove printf
* [ORO-0] Remove cout
* [ORO-0] Fix hipGetErrorString (#32)
* [ORO-0] Fix hipGetErrorString
It was incorrectly importing this API. Import the correct API in hipew.
* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)
* [ORO-0] Skip compilation of vulkan test on Linux
* [ORO-0] Update kernelExec unit test - remove printf
* [ORO-0] Remove cout
* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)
* Add missing path on Apple config. (#34)
* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)
* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.
* [ORO-0] Created win64 subdir.
* [ORO-0] Add hiprtc.dll and comgr dll
Co-authored-by: takahiroharada <[email protected]>
* fix footnote markdown format (#39)
* Fix orochi utils issue in unit tests
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>
* [ORO-0] bitcode/cubin linking APIs (#40)
* [ORO-0] Link apis.
* [ORO-0] Forgot to add.
* [ORO-0] Linking test.
* [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize
* [ORO-0] Update link unit tests with comments
* [ORO-0] Change test for CUBIN instead of PTX
* [ORO-0] Fix loadfile to use binary mode, remove printf in kernel
* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.
* [ORO-0] Created win64 subdir.
* [ORO-0] Load amdhip first, then hiprtc.
* [ORO-0] Remove assert from hiprtc library checks
* [ORO-0] Add gfx1030 bitcode for navi21
* [MNN-0] Fix premake and add more link testcases
* [ORO-0] Update a link_null_name testcase
* [ORO-0] Make unit tests more stable on CUDA
* [ORO-0] Update bitcode for gfx1030
* [ORO-0] Add bitcodes for navi1,2, vega
* [ORO-0] Add hiprtc.dll and comgr dll
* [ORO-0] Add gfx906 bitcodes
* [ORO-0] Support unit tests on both HIP and CUDA
* [ORO-0] Update dlls and bitcodes
* [ORO-0] Update bitcodes and generation script
* [ORO-0] Minor fixes in bundled bitcode unit tests
* [ORO-0] Fix typo in options
* [ORO-0] Fix getCUBIN/PTX signatures
* [ORO-0] Fix unit tests and generate fatbin for CUDA
* [ORO-0] Regenerate fatbin and fix script
* [ORO-0] Cleanup
* [ORO-0] Update bundled bitcodes to only contain navi21 for now
* [ORO-0] Updated bundled bitcode
* [ORO-0] add ORO_LAUNCH_PARAMS_*
* [ORO-0] Add unit test for orortcLinkAddFile
* [ORO-0] Add unittest scripts for TC
* [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA
* [ORO-0] Add bitcode+bundled bitcode link test
* [ORO-0] Cleanup
* [ORO-0] Fix typo in script
* [ORO-0] Update linux TC script
Co-authored-by: takahiroharada <[email protected]>
* [ORO-0] Get global memory size for CUDA (#44)
* [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46)
* [ORO-0] Update HIP dll's for bitcode linking support
* [ORO-0] Add getLoweredName testcase
* [ORO-0] Update unittest filter
* [ORO-0] Update loweredName test
* [ORO-0] Add missing test kernel
* [ORO-0] Fix loweredName test
* [ORO-0] Fix linux compilation
* [ORO-0] Remove printf from test kernel (#37)
* [ORO-0] Fix linux loading of libhiprtc.so (#49)
* [ORO-0] Update test scripts (#50)
* [ORO-0] Update scripts for linux (#51)
* [ORO-0] Add new scripts (#52)
* [ORO-0] Add new scripts
* [ORO-0] Add execute permissions to scripts
* Fix Unit Test: getErrorString (#54)
Signed-off-by: Chih-Chen Kao <[email protected]>
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] Support hiprtc0504 (#55)
* [ORO-0] Update hiprtc and orortc error codes (#57)
* [ORO-0] Update test scripts to delete cache before running (#58)
* [ORO-0] Update hiprtc dlls
* [ORO-0] Support gfx1100,gfx1102 for radix sort kernel precompilation
* Fix apt python installation (#63)
Update checkout version
Signed-off-by: Chih-Chen Kao <[email protected]>
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] OrochiUtils update. (#61)
* [ORO-0] Add WMMA test (#62)
* [ORO-0] Add WMMA test
* [ORO-0] Add a comment for WMMA
* [ORO-0] Cleanup
* [ORO-0] Add a couple more comments
* [ORO-0] Remove hip_runtime include
* [ORO-0] Cleanup
* [ORO-0] Fix comment
* [ORO-0] Add Copyright notice
* [ORO-0] Load binary from the directory where DLL is.
* [ORO-0] Fix for linux.
---------
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: PixelClear <[email protected]>
* [ORO-0] Remove unnecessary template.
* [ORO-0] Clean up. Added python script kernelCompile.py for compilation. (#46)
* [ORO-0] Clean up. Added python script kernelCompile.py for compilation.
* [ORO-0] hipsdk should be next to orochi dir.
* Update ParallelPrimitives/RadixSortKernels.h
Remove commented line
---------
Co-authored-by: Chih-Chen Kao <[email protected]>
* [ORO-0] add automatic arch selection (#47)
* [ORO-0] add automatic arch selection
* [ORO-0] Refactor and error output when it cannot find llc.
---------
Co-authored-by: takahiroharada <[email protected]>
* Feature/oro 0 flexible rtc error handling cherrypick (#48)
* add a handler for RTC load failure case on cuda.
* [ORO-0] add a handler for RTC load failure case on hip.
* [ORO-0] add cuda 12.0 sdk in nvrtc path
* [ORO-0] Remove non bundled bitcode tests. Clean up.
* [ORO-0] Clean up.
* [ORO-0] Add hiprtcGetBitcodeSize back.
* Update Orochi.cpp
* Update Orochi.cpp
* [ORO-0] Fix for multi-GPU/iGPU
* [HIPSDK-0] compute-22.40-osdb/36/
* [ORO-0] compute-23.10-osdb/9/
* [ORO-0] Update dll names
* [ORO-0] implement new test for managed memory, enable managed memory api, fix all warnings and cleanup
* [ORO-0] fix compile issues
* [ORO-0] fix declaration of oroManagedMalloc
* [ORO-0] change streaming kernel
* [ORO-0] enable it on windows too
* [ORO-0] add more asserts
* [ORO-0] update kernel
* [ORO-0] add host copy times
* [ORO-0] add malloc times
* Refactor Count
Signed-off-by: Chih-Chen Kao <[email protected]>
* Refactor Radix Sort class:
- Now the tmp buffer is allocated internally.
- All GPU memory buffers are changed to the GpuMemory class
- `configure` will now calculate the total number of GPU blocks for the count and the scan kernel
- The client does not need to call configure explicitly
- Refactor function parameters
- Remove count reference kernel
Signed-off-by: Chih-Chen Kao <[email protected]>
* Add `const`
Signed-off-by: Chih-Chen Kao <[email protected]>
* Thid commit does the followings:
- Support setting the the number of thread per block (a.k.a block size) dynamically
- Refactor `exclusiveScanCpu`
- Extend `printKernelInfo`.
Signed-off-by: Chih-Chen Kao <[email protected]>
* The 1st working example for the radix sort optimization
Signed-off-by: Chih-Chen Kao <[email protected]>
* Support configuring dynamic "NUM_WARPS_PER_BLOCK" in the sort kernel
Compute the optimal number of inputs for each block to handle.
Refactor the usage of stopwatch
Signed-off-by: Chih-Chen Kao <[email protected]>
* [ORO-0] add hiprtc future dll names in hiprtc path
* Add linux paths and dll names (#66)
* [ORO-0] Change path and rtc dll names
* [ORO-0] Make scripts executable
* [ORO-0] Add hiprtc path
* [ORO-0] Remove ParallelPrimitives, test/radix sort
* [ORO-0] Edit premake
---------
Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>
Co-authored-by: Richard Geslot <[email protected]>
Co-authored-by: Atsushi Yoshimura <[email protected]>
Co-authored-by: Atsushi.Yoshimura <[email protected]>1 parent df8a401 commit b209cc1Copy full SHA for b209cc1
File tree
16 files changed
+287
-50
lines changed- Orochi
- UnitTest
- bitcodes
- contrib
- bin/win64
- cuew/src
- hipew
- include
- src
- scripts
16 files changed
+287
-50
lines changed.gitattributes
Copy file name to clipboard+1
Original file line number | Diff line number | Diff line change | |
---|---|---|---|
| |||
| 1 | + |
+5
Original file line number | Diff line number | Diff line change | |
---|---|---|---|
| |||
569 | 569 |
| |
570 | 570 |
| |
571 | 571 |
| |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
572 | 577 |
| |
573 | 578 |
| |
574 | 579 |
| |
|
+8-1
Original file line number | Diff line number | Diff line change | |
---|---|---|---|
| |||
496 | 496 |
| |
497 | 497 |
| |
498 | 498 |
| |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
499 | 506 |
| |
500 | 507 |
| |
501 | 508 |
| |
| |||
641 | 648 |
| |
642 | 649 |
| |
643 | 650 |
| |
| 651 | + | |
644 | 652 |
| |
645 | 653 |
| |
646 | 654 |
| |
| |||
650 | 658 |
| |
651 | 659 |
| |
652 | 660 |
| |
653 |
| - | |
654 | 661 |
| |
655 | 662 |
| |
656 | 663 |
| |
|
-4
Original file line number | Diff line number | Diff line change | |
---|---|---|---|
| |||
374 | 374 |
| |
375 | 375 |
| |
376 | 376 |
| |
377 |
| - | |
378 |
| - | |
379 |
| - | |
380 |
| - | |
381 | 377 |
| |
382 | 378 |
| |
383 | 379 |
| |
|
+15-4
Original file line number | Diff line number | Diff line change | |
---|---|---|---|
| |||
33 | 33 |
| |
34 | 34 |
| |
35 | 35 |
| |
36 |
| - | |
37 |
| - | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
38 | 42 |
| |
39 | 43 |
| |
40 | 44 |
| |
| |||
50 | 54 |
| |
51 | 55 |
| |
52 | 56 |
| |
53 |
| - | |
| 57 | + | |
54 | 58 |
| |
55 | 59 |
| |
56 | 60 |
| |
57 | 61 |
| |
58 | 62 |
| |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
59 | 70 |
| |
60 | 71 |
| |
61 | 72 |
| |
| |||
123 | 134 |
| |
124 | 135 |
| |
125 | 136 |
| |
126 |
| - | |
| 137 | + | |
127 | 138 |
| |
128 | 139 |
| |
129 | 140 |
| |
|
0 commit comments