Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement xxhash algorithms as part of the expression API #14044

Open
HectorPascual opened this issue Jan 8, 2025 · 13 comments · May be fixed by #14367
Open

Implement xxhash algorithms as part of the expression API #14044

HectorPascual opened this issue Jan 8, 2025 · 13 comments · May be fixed by #14367
Assignees
Labels
enhancement New feature or request

Comments

@HectorPascual
Copy link

HectorPascual commented Jan 8, 2025

Is your feature request related to a problem or challenge?

I am currently in need of using datafusion SQL query engine (through delta-rs merge operation) to hash with a specific hashing algorithm.

https://xxhash.com/
https://github.com/Cyan4973/xxHash
Rust implementation : https://crates.io/crates/twox-hash

Describe the solution you'd like

I'd like the hashing functions xxhash32 and xxhash64 to be part of datafusion expression API so the functions can be used in SQL context for hashing data.

@HectorPascual HectorPascual added the enhancement New feature or request label Jan 8, 2025
@alamb
Copy link
Contributor

alamb commented Jan 8, 2025

You could also implement this as a user defined function perhaps

@HectorPascual
Copy link
Author

Hi, thanks for the reply.

That is true, although, my concern comes from this other issue raised in delta-rs project (link below), since I need to use this hash operation in a Delta merge operation.

Unless I change the codebase for the delta-rs project, I'd be unable to create an UDF (not sure if there's a way to create and register it in the delta-rs context in python, sounds complicated)

delta-io/delta-rs#3105 (comment)

@Spaarsh
Copy link
Contributor

Spaarsh commented Jan 19, 2025

Can I take this up? I shall add the new functions in a new file in crypto/xxhas.rs similar to the crypto/sha224.rs file and import them in the crypto/mod.rs file. And then add some tests.

Do tell me if I'm missing something!

@Spaarsh
Copy link
Contributor

Spaarsh commented Jan 19, 2025

take

@alamb
Copy link
Contributor

alamb commented Jan 22, 2025

@Spaarsh sounds like a good plan

@ion-elgreco
Copy link

Perhaps consider adding a non cryptographic module and add wyhash as well (here some of the crates I've used for similar usecase but in polars: https://github.com/ion-elgreco/polars-hash/blob/main/polars_hash%2FCargo.toml)

@Spaarsh
Copy link
Contributor

Spaarsh commented Jan 23, 2025

Perhaps consider adding a non cryptographic module and add wyhash as well (here some of the crates I've used for similar usecase but in polars: https://github.com/ion-elgreco/polars-hash/blob/main/polars_hash%2FCargo.toml)

Sure. I'll take a look at it and get back to you in case of any doubts! Thanks!

@Spaarsh
Copy link
Contributor

Spaarsh commented Jan 26, 2025

@HectorPascual is this the output that you expect?

> SELECT 
    xxhash32(column1) AS xxhash32_result
FROM (
    SELECT UNNEST(ARRAY[1, 2, 3, 4, 5]) AS column1
) AS subquery;
+-----------------+
| xxhash32_result |
+-----------------+
| b6ecc8b2        |
| 0c436334        |
| 9cec73c4        |
| 6dd9af36        |
| 1c5751f3        |
+-----------------+
xxhash32 output screenshot

Image

> SELECT 
    xxhash64(column1) AS xxhash64_result
FROM (
    SELECT UNNEST(ARRAY[1, 2, 3, 4, 5]) AS column1
) AS subquery;
+------------------+
| xxhash64_result  |
+------------------+
| b7b41276360564d4 |
| 6021b5621680598b |
| 26167c2af5162ca4 |
| 913914322ca46b89 |
| 6a81b47405b648ed |
+------------------+
xxhash64 output screenshot

Image

Thanks!

@HectorPascual
Copy link
Author

@HectorPascual is this the output that you expect?

> SELECT 
    xxhash32(column1) AS xxhash32_result
FROM (
    SELECT UNNEST(ARRAY[1, 2, 3, 4, 5]) AS column1
) AS subquery;
+-----------------+
| xxhash32_result |
+-----------------+
| b6ecc8b2        |
| 0c436334        |
| 9cec73c4        |
| 6dd9af36        |
| 1c5751f3        |
+-----------------+

xxhash32 output screenshot
Image

> SELECT 
    xxhash64(column1) AS xxhash64_result
FROM (
    SELECT UNNEST(ARRAY[1, 2, 3, 4, 5]) AS column1
) AS subquery;
+------------------+
| xxhash64_result  |
+------------------+
| b7b41276360564d4 |
| 6021b5621680598b |
| 26167c2af5162ca4 |
| 913914322ca46b89 |
| 6a81b47405b648ed |
+------------------+

xxhash64 output screenshot
Image

Thanks!

Yes, that looks good!

@Spaarsh
Copy link
Contributor

Spaarsh commented Jan 27, 2025

@alamb @HectorPascual I would like to have your input on this:

  1. I was wondering how to add support for a None or empty values. Should I replace the input with "NULL" string whenever a None value is entered?

  2. There is also the concern of not taking the seed from the user. Currently the seed is given a default value and I think having the user's control over it be helpful.

  3. If we decide to ask for a seed by the user, do we keep the default to be 0? Wouldn't obfuscating that default values cause issues for the user?
    I have added support for individual values of the types Utf8, Int32, Int64, UInt32 and UInt64.

These are the query results:

> SELECT xxhash64('1') AS xxhash64_result;
+------------------+
| xxhash64_result  |
+------------------+
| b7b41276360564d4 |
+------------------+
1 row(s) fetched. 
Elapsed 0.004 seconds.
> SELECT xxhash64(1) AS xxhash64_result;
+------------------+
| xxhash64_result  |
+------------------+
| b7b41276360564d4 |
+------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.
> SELECT xxhash32('1') AS xxhash32_result;
+-----------------+
| xxhash32_result |
+-----------------+
| b6ecc8b2        |
+-----------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.
> SELECT xxhash32(1) AS xxhash32_result;
+-----------------+
| xxhash32_result |
+-----------------+
| b6ecc8b2        |
+-----------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

If this is alright, then I will add the tests for these functions.

Perhaps consider adding a non cryptographic module and add wyhash as well (here some of the crates I've used for similar usecase but in polars: https://github.com/ion-elgreco/polars-hash/blob/main/polars_hash%2FCargo.toml)

@alamb should I open a new issue for adding wyhash and the other functions? Or they can be added along with this PR?

@HectorPascual
Copy link
Author

HectorPascual commented Jan 27, 2025

Looks good, I can try tomorrow hashing the same inputs with the Python xxhash module and double check the output of the results. @Spaarsh

Let me gather some feedback from colleagues on your open concerns and come back later with details.

@HectorPascual
Copy link
Author

Hi @Spaarsh,

The hashes match to the python module :

Image

In regards to your inputs :

  1. The python module breaks when entering a None,
    Image
    You can for instance replace None with '' (empty string)

2-3. On the seed side, for the python module the default seed is 0, see : https://github.com/ifduyue/python-xxhash#:~:text=An%20optional%20seed%20(default%20is%200)%20can%20be%20used%20to%20alter%20the%20result%20predictably%3A
Is there an option to have this as an optional argument?

Thanks!

@Spaarsh
Copy link
Contributor

Spaarsh commented Jan 30, 2025

Thanks @HectorPascual! I'm opening PR from here on for transperancy's sake!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants