You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
As of 25.02, the implementation for hash-based groupby aggregations tracks partial aggregations in a table that is sized to the double the row count of the input table. Once all of the input rows have been processed, the implementation completes a compaction step to extract the non-empty entries. If the number of distinct groups is smaller than the number of rows, this approach leads to excess memory usage as well as an additional memory copy.
Describe the solution you'd like
Introduce a hash table that tracks an output index with a key hash value. When a new group is found, increment a "write counter". When processing the aggregation kinds, update the partial aggregations at the output index for that group.
With this idea, the output data will not require another gather. The output data will be created in contiguously and (probably) available for output column creation without another copy. The cost would be some additional atomics pressure when cardinality is high. It may be that "sparse" partial aggregation are preferred if the cardinality is close to the row count (needs confirmation).
Additional context
The current setup for Keys, Values, and Aggregations uses a Struct of Arrays (SoA) format, leading to numerous random memory accesses. Consider exploring a conversion from SoA to Array of Structs (AoS).
There is an internal reference for NVIDIA internal developers here. If you an external developer and would like to learn more, please contact [email protected].
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
As of 25.02, the implementation for hash-based groupby aggregations tracks partial aggregations in a table that is sized to the double the row count of the input table. Once all of the input rows have been processed, the implementation completes a compaction step to extract the non-empty entries. If the number of distinct groups is smaller than the number of rows, this approach leads to excess memory usage as well as an additional memory copy.
Describe the solution you'd like
Introduce a hash table that tracks an output index with a key hash value. When a new group is found, increment a "write counter". When processing the aggregation kinds, update the partial aggregations at the output index for that group.
With this idea, the output data will not require another gather. The output data will be created in contiguously and (probably) available for output column creation without another copy. The cost would be some additional atomics pressure when cardinality is high. It may be that "sparse" partial aggregation are preferred if the cardinality is close to the row count (needs confirmation).
Additional context
The current setup for Keys, Values, and Aggregations uses a Struct of Arrays (SoA) format, leading to numerous random memory accesses. Consider exploring a conversion from SoA to Array of Structs (AoS).
There is an internal reference for NVIDIA internal developers here. If you an external developer and would like to learn more, please contact [email protected].
The text was updated successfully, but these errors were encountered: