-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding bloom command meta data, bloom group and bloom data type documentaion #233
base: main
Are you sure you want to change the base?
Conversation
as well Signed-off-by: zackcam <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting!
I skimmed through it very quickly. The documentation itself looks great AFAICT. I can do a more detailed review later.
The commands look very much like built-in commands. It's not mentioned anywhere that it's a separate module that users need to install. I think we should mentioned it on the bloom filters topic page with a link to the github repo. The BF command pages should link to that topic page, so the pages are all linked together.
To build man pages, the scripts in this repo need to be able to take multiple command JSON files. This needs to be added to the Makefile, the README and maybe the python scripts too. Please try to build the man pages as described in the README of this repo.
Many of the spellcheck errors can be fixed simply but writing the command names in backticks. Stuff in backticks are excluded from spellcheck IIRC. |
Yes, something like that would be good. In your screenshot it looks like the "Extensions" sub-heading is part of "Module Data Types" though, because of the levels of the headings used. If we do this, then "Module Data Types" should be a level-2 heading and "Bloom Filter" a level-3 heading under it. How about just mentioning the module within the description? Something like this? ## Bloom Filter
[Bloom filters](bloomfilters.md) provides a space efficient probabilistic data structure that allows checking if an element is a member of a set. False positives are possible, but it guarantees no false negatives.
+Bloom filters are provided by the module `valkey-bloom`.
For more information, see:
* [Overview of Bloom Filters](bloomfilters.md)
* [Bloom filter command reference](../commands/#bloom)
+* [The valkey-bloom module on GitHub](https://github.com/valkey-io/valkey-bloom/) |
@zuiderkwast I also wanted to get your input about how we should structure the modules to make it clear they aren't part of the core. The current structure is they are intermingled. I don't really have an opinion yet, but one alternative would be to at least separate them in a separate folder structure and clarify which module they are apart of. |
Are you talking about the URLs of the commands? I like that it's a flat structure, just like the commands are in a global flat namespace. The But we should definitely show it in some way. A line somewhere on each command page would be good. I hope we can be generate it in some way from an optional key in the command JSON file or something like that. |
I don't have a strong preference one way or the other about flat/nested, so sticking with flat is OK for me.
Yeah, I guess immediately let's make sure there is something in the JSON file. Maybe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a super deep review. I think we should indicate more clearly that the commands are from a module and not part of the core. That can maybe from the json docs only though.
commands/bf.add.md
Outdated
* key (required) - A Valkey key of Bloom data type | ||
* item (required) - Item to add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* key (required) - A Valkey key of Bloom data type | |
* item (required) - Item to add |
We typically omit this, since the usage would be included at the top which will indicate if something is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah makes sense I removed all these from the bloom commands and if I think the arguments needed explained updated the heading name
commands/bf.add.md
Outdated
@@ -0,0 +1,12 @@ | |||
Adds an item to a bloom filter, if the specified filter does not exist creates a default bloom filter with that name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds an item to a bloom filter, if the specified filter does not exist creates a default bloom filter with that name. | |
Adds an item to a bloom filter, if the specified bloom filter does not exist creates a bloom filter with default configurations with that name. | |
If you want to create a bloom filter with non-standard options, use the `BF.INSERT` command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and made it less wordy as well by removing 'specified' from the description
commands/bf.exists.md
Outdated
@@ -0,0 +1,16 @@ | |||
Determines if a specified item has been added to the specified bloom filter. | |||
Syntax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Syntax |
commands/bf.info.md
Outdated
@@ -0,0 +1,35 @@ | |||
Returns information about a bloomfilter | |||
|
|||
## Arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need to be kept because they include the info data, but I would change this to be about info fields or something.
commands/bf.info.md
Outdated
## Arguments | ||
* key (required) - A valkey key of bloom data type | ||
* CAPACITY (optional) - Returns the number of unique items that would need to be added before scaling would happen | ||
* SIZE (optional) - Returns the memory size which is the number of bytes allocated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* SIZE (optional) - Returns the memory size which is the number of bytes allocated | |
* SIZE (optional) - Returns the number of bytes allocated |
Why waste time say lot word when few word do trick?
topics/data-types.md
Outdated
@@ -92,6 +92,14 @@ The [HyperLogLog](hyperloglogs.md) data structures provide probabilistic estimat | |||
* [Overview of HyperLogLog](hyperloglogs.md) | |||
* [HyperLogLog command reference](../commands/#hyperloglog) | |||
|
|||
## Bloom Filter | |||
|
|||
[Bloom filters](bloomfilters.md) provides a space efficient probabilistic data structure that allows checking if an element is a member of a set. False positives are possible, but it guarantees no false negatives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would translate this to english with an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to make this more understandable but I think potentially having what I use in the exists and mexists commands could also work if the new version still isn't great
topics/bloomfilters.md
Outdated
|
||
Bloom filters are a space efficient probabilistic data structure that allows checking whether an element is member of a set. False positives are possible, but it guarantees no false negatives. | ||
|
||
## Bloom commands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are other examples include the "basic commands" up front, and then the more sophisticated commands later. I think we should do the same.
topics/bloomfilters.md
Outdated
|
||
**Financial fraud detection** | ||
|
||
Bloom filters can help answer the question "Has the user paid from this location before?", which can then give insights if there has been suspicious activity in shopping habits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a real use case? The false positive here is not idea, since it might make it seem like a transaction is legitimate when it is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this use case to be more about card fraud instead of location based checking
topics/bloomfilters.md
Outdated
|
||
Bloom filters can help answer the question "Has the user paid from this location before?", which can then give insights if there has been suspicious activity in shopping habits. | ||
|
||
For the above each user would have a Bloom filter which is then checked for every transaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might just merge this into the previous paragraph.
topics/bloomfilters.md
Outdated
|
||
**Check if URL's are malicious** | ||
|
||
Bloom filters can answer the question is a URL malicious. Any URL inputted would be checked against a malicious URL bloom filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bloom filters can answer the question is a URL malicious. Any URL inputted would be checked against a malicious URL bloom filter. | |
Bloom filters can answer the question "is a URL malicious?". Any URL inputted would be checked against a malicious URL bloom filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a complete review.
We need to think about what we want regarding
- How to show which module a command belongs to and how to store this in the JSON file(s).
- What to show in the
Since
fields. If we'll release some valkey-with-modules bundle, then the version number should probably follow valkey's versioning(?).
resp2_replies.json
Outdated
"* [Integer reply](../topics/protocol.md#integers): '1'. The item was successfully added", | ||
"* [Integer reply](../topics/protocol.md#integers): '0'. The item already existed in the bloom filter", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the single quotes it looks a bit like string literals. Use backticks instead to mark it as code? This seems to be how some other commands' integer replies are documented.
"* [Integer reply](../topics/protocol.md#integers): '1'. The item was successfully added", | |
"* [Integer reply](../topics/protocol.md#integers): '0'. The item already existed in the bloom filter", | |
"* [Integer reply](../topics/protocol.md#integers): `1` if the item was successfully added", | |
"* [Integer reply](../topics/protocol.md#integers): `0` if the item already existed in the bloom filter", |
Compare to for example this one:
"CLIENT UNBLOCK": [
"One of the following:",
"* [Integer reply](../topics/protocol.md#integers): `0` if the client was unblocked successfully.",
"* [Integer reply](../topics/protocol.md#integers): `1` if the client wasn't unblocked."
],
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, updarted both add and exists in both response files to follow this
topics/bloomfilters.md
Outdated
|
||
Example usage for a default bloom object: | ||
``` | ||
127.0.0.1:6379> bf.insert validate_scale_fail VALIDATESCALETO 26214301 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uppercase all commands like BF.INSERT
and fixed tokens makes it easier to see what is fixed and what is variable.
topics/bloomfilters.md
Outdated
|
||
## Common use cases for bloom filters | ||
|
||
**Financial fraud detection** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These look like a sub-headings so I think we should mark them as such. It's semantically more correct. (The others too; not only this one.)
**Financial fraud detection** | |
### Financial fraud detection |
commands/bf.add.md
Outdated
@@ -0,0 +1,12 @@ | |||
Adds an item to a bloom filter, if the specified filter does not exist creates a default bloom filter with that name. | |||
## Arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An empty line before and after headings, before and after bullet lists, etc. makes it more likely to be rendered correctly on website, man pages and github. The all use different markdown implementation with some subtle differences.
## Arguments | |
## Arguments | |
resp3_replies.json
Outdated
"[Array reply](../topics/protocol.md#arrays): List of information about the bloom filter.", | ||
"When an optional argument is provided:", | ||
"* [Integer reply](../topics/protocol.md#integers): argument value", | ||
"* [String reply??](../topics/protocol.md#simple-strings): argument value", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "String reply??" with double question marks??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was accidentally left over, we only have a string reply when one of the optional arguments is provided so was meant to come back to this and try and clear up the differences and provide clarity on which case would have a string or integer
I think for now we should show the independent modules version number, since we got alignment on that. Internally at AWS we are planning on reviving valkey-io/valkey#408 and posting some suggestions. Once that has alignment, we can maybe add more information about where it's available (i.e. Valkey core since 10.0, valkey-bloom since 1.0) |
…to generate bloom man pages Signed-off-by: zackcam <[email protected]>
commands/bf.add.md
Outdated
@@ -0,0 +1,12 @@ | |||
Adds an item to a bloom filter, if the specified bloom filter does not exist creates a bloom filter with default configurations with that name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds an item to a bloom filter, if the specified bloom filter does not exist creates a bloom filter with default configurations with that name. | |
Adds a single item to a bloom filter. If the specified bloom filter does not exist, a bloom filter is created with the provided name with default properties. |
commands/bf.add.md
Outdated
@@ -0,0 +1,12 @@ | |||
Adds an item to a bloom filter, if the specified bloom filter does not exist creates a bloom filter with default configurations with that name. | |||
|
|||
If you want to create a bloom filter with non-standard options, use the `BF.INSERT` or `BF.RESERVE` command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By non-standard options, you mean the non default properties. Right?
If you want to create a bloom filter with non-standard options, use the `BF.INSERT` or `BF.RESERVE` command. | |
To add multiple items to a bloom filter, you can use the BF.MADD or BF.INSERT commands. | |
If you want to create a bloom filter with non-default properties, use the `BF.INSERT` or `BF.RESERVE` command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah non standard meant non default, but agree makes more sense to say non default and that keeps it consistent
commands/bf.card.md
Outdated
@@ -0,0 +1,12 @@ | |||
Gets the cardinality of a Bloom filter - number of items that have been successfully added to a Bloom filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gets the cardinality of a Bloom filter - number of items that have been successfully added to a Bloom filter. | |
Returns the cardinality of a Bloom filter which is the number of items that have been successfully added to it. |
commands/bf.card.md
Outdated
1 | ||
127.0.0.1:6379> BF.CARD key | ||
1 | ||
127.0.0.1:6379> BF.CARD missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
127.0.0.1:6379> BF.CARD missing | |
127.0.0.1:6379> BF.CARD nonexistentkey |
commands/bf.exists.md
Outdated
@@ -0,0 +1,19 @@ | |||
Determines if an item has been added to the bloom filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Determines if an item has been added to the bloom filter. | |
Determines if an item has been added to the bloom filter previously. |
commands/bf.insert.md
Outdated
* SEED seed - The seed the hash functions will use | ||
* NONSCALING - Will make it so the filter can not scale | ||
* VALIDATESCALETO `validatescaleto` - Checks if the filter could scale to this capacity and if not show an error and don’t create the bloom filter | ||
* ITEMS item - One or more items we will add to the bloom filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* ITEMS item - One or more items we will add to the bloom filter | |
* ITEMS item - One or more items to be added to the bloom filter |
commands/bf.insert.md
Outdated
* TIGHTENING `tightening_ratio` - The tightening ratio for the bloom filter | ||
* SEED seed - The seed the hash functions will use | ||
* NONSCALING - Will make it so the filter can not scale | ||
* VALIDATESCALETO `validatescaleto` - Checks if the filter could scale to this capacity and if not show an error and don’t create the bloom filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* VALIDATESCALETO `validatescaleto` - Checks if the filter could scale to this capacity and if not show an error and don’t create the bloom filter | |
* VALIDATESCALETO `validatescaleto` - Validates if the filter can scale out and reach to this capacity based on limits and if not, return an error without creating the bloom filter |
commands/bf.insert.md
Outdated
* NOCREATE - Will not create the bloom filter and add items if the filter does not exist already | ||
* TIGHTENING `tightening_ratio` - The tightening ratio for the bloom filter | ||
* SEED seed - The seed the hash functions will use | ||
* NONSCALING - Will make it so the filter can not scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zackcam - you can follow wording from BF.RESERVE
commands/bf.insert.md
Outdated
* TIGHTENING `tightening_ratio` - The tightening ratio for the bloom filter | ||
* SEED seed - The seed the hash functions will use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Add more wording
commands/bf.insert.md
Outdated
127.0.0.1:6379> BF.INSERT key ITEMS item1 item2 | ||
1) (integer) 1 | ||
2) (integer) 1 | ||
# This does not update the capcity but uses the origianl filters values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This does not update the capcity but uses the origianl filters values | |
# This does not update the capacity since the filter already exists. It only adds the provided items. |
I only reviewed the Command Documentation. I will need to review the remaining sections next |
topics/data-types.md
Outdated
@@ -92,6 +92,17 @@ The [HyperLogLog](hyperloglogs.md) data structures provide probabilistic estimat | |||
* [Overview of HyperLogLog](hyperloglogs.md) | |||
* [HyperLogLog command reference](../commands/#hyperloglog) | |||
|
|||
## Bloom Filter | |||
|
|||
[Bloom filters](bloomfilters.md) are a space efficient data type that can tell you if something is definitely not in a set, or it might be in the set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Bloom filters](bloomfilters.md) are a space efficient data type that can tell you if something is definitely not in a set, or it might be in the set. | |
[Bloom filters](bloomfilters.md) are a space efficient probabilistic data type that can be used to check if item/s are definitely not present in a set, or if they exist within the set (with the configured false positive rate). |
@@ -0,0 +1,108 @@ | |||
--- | |||
title: "Bloom Filters" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also include the section/s below:
- Scaling / Non Scaling Filters and their implications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this section and instead of putting the implication in performance added a subsection to the scaling and non scaling section
topics/bloomfilters.md
Outdated
|
||
Error rate - 0.01 | ||
|
||
Expansion - 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we mention in command documentation, let us clarify the scaling and non scaling cases where expansion is nil.
12) (integer) 2 | ||
13) Max scaled capacity | ||
14) (integer) 26214300 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can include advanced / additional properties here as a sub section within the "Default Properties":
- Tightening Ratio - We do not recommend tuning this unless there is a specific use case for lower memory usage (with higher false positive) or vice versa.
- Seed - This is only useful is a user has a specific 32 byte seed they want their bloom filters to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added these in an advanced properties section.
topics/bloomfilters.md
Outdated
|
||
Most bloom commands are O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only work with one 1 item. | ||
|
||
There are a few bloom commands that are O(1) as they don't work on items but instead work on the data about the bloom filter itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean BF.CARD and BF.INFO? Maybe you can list the ones you are referring to here
Signed-off-by: zackcam <[email protected]>
topics/bloomfilters.md
Outdated
Capacity - 100 | ||
|
||
Error rate - 0.01 | ||
|
||
Expansion - 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to mentioning the default value, can we follow the standard wording from "command documentation" to have a one liner to explain these properties?
topics/bloomfilters.md
Outdated
Introduction to Bloom Filters | ||
--- | ||
|
||
The bloom filter data type is taken from a [separate module](https://github.com/valkey-io/valkey-bloom) that users will need to install in order to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this sentence to be after the introduction statement.
The bloom filter data type is taken from a [separate module](https://github.com/valkey-io/valkey-bloom) that users will need to install in order to use. | |
In Valkey, the bloom filter data type / commands are implemented in the [valkey-bloom module](https://github.com/valkey-io/valkey-bloom) which is an official valkey module compatible with versions 8.0 and above. Users will need to load this module onto their valkey server in order to use this feature. |
topics/bloomfilters.md
Outdated
|
||
### When should you use scaling vs non-scaling filters | ||
|
||
If the data size is known and fixed then using a non-scaling bloom filter is preferred, for example a static dictionary could use a non scaling bloom filter as the amount of items should be fixed. Likewise the reverse case for dynamic data and unknown final sizes is when you should use a scaling bloom filters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consider briefly explaining benefits of non scaling (better performance and less memory overhead) and its drawbacks - it will error out when it reaches capacity. If you don't want to hit an error and want use-as-you-go capacity, scaling is better, but it uses more memory for the additional capacity which is available. Also, more filters (e.g. >500-1000) means higher command latencies.
topics/bloomfilters.md
Outdated
|
||
Seed - The seed used by the bloom filter can be specified by the user in the BF.INSERT command. This property is only useful if you have a specific 32 byte seed that you want your bloom filter to use. By defualt every bloom filter will use a random seed. | ||
|
||
Tightening Ratio - We do not recommend fine tuning this unless there is a specific use case for lower memory usage with higher false positive or vice versa. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clarify that the BF.INSERT command can help specify this
topics/bloomfilters.md
Outdated
|
||
If the data size is known and fixed then using a non-scaling bloom filter is preferred, for example a static dictionary could use a non scaling bloom filter as the amount of items should be fixed. Likewise the reverse case for dynamic data and unknown final sizes is when you should use a scaling bloom filters. | ||
|
||
## Default bloom properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional : We could add a monitoring section and briefly go over the INFO BF command response
section to bloomfilter topic, cleaned up other bloomfilter topic sections
Makefile
Outdated
$(eval VALKEY_ROOTS := $(VALKEY_ROOT) $(VALKEY_BLOOM_ROOT)) | ||
$(eval FINAL_ROOT := $(firstword $(foreach root,$(VALKEY_ROOTS),$(if $(wildcard $(root)/src/commands/$*.json),$(root))))) | ||
$(if $(FINAL_ROOT),,$(eval FINAL_ROOT := $(lastword $(VALKEY_ROOTS)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks complicated. We should try to do something more readable.
I don't have a clear idea about how to make this simpler, but maybe we can define the rules differently or maybe it can get more clear if we use $(call ...)
instead of $(eval ...)
.
Or define a separate rule for the bf.
commands like this?
$(MAN_DIR)/man3/bf.%.3valkey.gz: commands/bf.%.md ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future modules I think making it separate could be detrimental this way any future module will only need to add their equivalent of VALKEY_BLOOM_ROOT
to the first eval. Also keeping it in this sort of format makes it so if changes are needed in the future those changes will only be needed here for all commands. I am going to update the last if/eval though so it doesn't actually need to have found a match and will only create on found matches. If this still doesn't sound ideal let me know and I'll look more in depth how to make it clearer (either by changing how this is done or separating steps)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, makes sense. Maybe just the first line $(eval VALKEY_ROOTS := $(VALKEY_ROOT) $(VALKEY_BLOOM_ROOT))
can be on top level instead of inside the rule?
VALKEY_ROOTS := $(VALKEY_ROOT) $(VALKEY_BLOOM_ROOT)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It starts to look good. I just put a few comments on formatting and such things.
I didn't review the actual docs of the module carefully, because I don't know it very well. To me, it's enough if that part is reviewed by you, the module authors.
README.md
Outdated
sudo make install INSTALL_MAN_DIR=/usr/local/share/man | ||
|
||
Prerequisites: GNU Make, Python 3, Python 3 YAML (pyyaml), Pandoc. | ||
Additionally, the scripts need access to the valkey code repo, | ||
Additionally, the scripts need access to the valkey and valkey-bloom code repos, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention that valkey-bloom is optional and that those pages are excluded if the valkey-bloom path is not provided.
commands/commands
Outdated
@@ -0,0 +1 @@ | |||
../valkey-doc/commands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't add this file (or symlink).
A Bloom filter has two possible responses when you check if an item exists: | ||
|
||
* 0 - The item definitely does not exist since with bloom filters, false negatives are not possible. | ||
|
||
* 1 - The item exists with a given false positive (`fp`) percentage. There is an `fp` rate % chance that the item does not exist. You can create bloom filters with a more strict false positive rate as needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't include the reply docs here. They are added in resp2_replies.json
and resp3_replies.json
so if we add them here to they will appear twice.
127.0.0.1:6379> BF.INFO key | ||
1) Capacity | ||
2) (integer) 100 | ||
3) Size | ||
4) (integer) 384 | ||
5) Number of filters | ||
6) (integer) 1 | ||
7) Number of items inserted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't match the documentation of the field names above. I would expect the field names to be CAPACITY
, SIZE
, FILTERS
etc. rather than "Capacity", "Size", "Number of filters", ...
Btw, why are these uppercase? In the INFO command, the field names are lowercase.
commands/bf.insert.md
Outdated
|
||
## Insert Fields | ||
|
||
* CAPACITY `capacity` - The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, CAPACITY
is a keyword and capacity
is a placeholder for a number?
I suggest we use this formatting instead, with backticks for the keyword and italics for the variable:
* CAPACITY `capacity` - The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items. | |
* `CAPACITY` *capacity* - The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items. |
A Bloom filter has two possible responses when you check if an item exists: | ||
|
||
* 0 - The item definitely does not exist since with bloom filters, false negatives are not possible. | ||
|
||
* 1 - The item exists with a given false positive (`fp`) percentage. There is an `fp` rate % chance that the item does not exist. You can create bloom filters with a more strict false positive rate as needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip this. Responses are documented in the response JSON files.
(I know, I don't like it. It's unnecessarily complex. I want to move the reply docs into the markdown files some day. But for now, let's just follow the existing structure.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was wanted to make it explicit how false positive affects the exist command and determining if an item is present. I could try and reword so it explains false positive not based on response but I think the thinking is that showing the response makes it more understandable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rendered page is showing the response from resp2_responses.json
etc. but unfortunately it gets added in the bottom of the web page.
(On the generated man pages, the reply section gets inserted before Examples, which I think is a better place.)
You can keep this text here if you think it's better, and keep it brief in resp{2,3}_replies.json
so there is not too much duplicated text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it does have some slight duplication but in my opinion I like having this explained as one of the main behaviours of bloom filters is the false positive rate. But am happy to change if others would rather not have the duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind, but feel free to formulate it in a way so that it doesn't look too much like duplication.
resp2_replies.json
Outdated
"* [Integer reply](../topics/protocol.md#integers): `1` if the item exists in the bloom filter", | ||
"* [Integer reply](../topics/protocol.md#integers): `0` if the bloom filter does not exist or the item has not been added to the bloom filter", | ||
"", | ||
"The command will fail if the wrong number of arguments are provided" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to mention error for wrong number of arguments. All commands will return syntax error in this case. This is implicit and we don't need to mention it for every command.
|
||
``` | ||
127.0.0.1:6379> info bf | ||
# bf_bloom_core_metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the INFO command, these section headings match the argument in uppercase, so I would expect # BF
here, with a blank line below it.
# bf_bloom_core_metrics | |
# BF | |
Are these fields matching redis bloom info fields or are they invented in valkey-bloom?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That title is determined from the bloom module (https://github.com/valkey-io/valkey-bloom/blob/unstable/src/metrics.rs#L17) and that output is exactly what I get when running info bf
. I'm pretty sure they were invented in valkey-bloom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That title is determined from the bloom module
Then maybe valkey-bloom doesn't exactly behave as documented for the INFO command:
Lines can contain a section name (starting with a # character) or a property. All
the properties are in the form of field:value terminated by \r\n.
These lines with # are the section names you can also use as argument for fetching a single section. They're not comments.
Or can you do INFO bf_bloom_core_metrics
too?
I'm pretty sure they were invented in valkey-bloom.
Then I'm wondering why the prefix of each field is bf_bloom
and not just bf
? BF stands for bloom filter already, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just do a certain section so INFO bf_bloom_core_metrics
is valid.
I think at some point there was thoughts on expanding the bloom module so wasn't just confined to a bloom filter so wanted to specify this was for bloom in particular.
topics/bloomfilters.md
Outdated
|
||
### Bloom filter core metrics | ||
|
||
* bf_bloom_total_memory_bytes: Current total number of bytes used by all bloom filters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use backticks on the field names here (and below).
…creation and spelling Signed-off-by: zackcam <[email protected]>
topics/bloomfilters.md
Outdated
|
||
### Financial fraud detection | ||
|
||
Bloom filters can help answer the question "Has this card been flagged as stolen?", use a bloom filter that has cards reported stolen added to it. Check a card on use that it is not present in the bloom filter. If it isn't then the card is not marked as stolen, if present then a check to the main database can happen or deny the purchase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor rewording:
Bloom filters can be used to answer the question, "Has this card been flagged as stolen?". To do this, use a bloom filter that contains cards reported as stolen. When a card is used, check whether it is present in the bloom filter. If the card is not found, it means it is not marked as stolen. If the card is present in the filter, a check can be made against the main database, or the purchase can be denied.
topics/bloomfilters.md
Outdated
Bloom filters can help answer the following questions to advertisers: | ||
* Has the user already seen this ad? | ||
* Has the user already bought this product? | ||
|
||
Use a Bloom filter for every user, storing all bought products. The recommendation engine can then suggest a new product and checks if the product is in the user's Bloom filter. | ||
|
||
* If no, the ad is shown to the user and is added to the Bloom filter. | ||
* If yes, the process restarts and repeats until it finds a product that is not present in the filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bloom filters can help advertisers answer the following questions:
-
Has the user already seen this ad?
-
Has the user already purchased this product?
For each user, use a Bloom filter to store all the products they have purchased. The recommendation engine can then suggest a new product and check if it is present in the user's Bloom filter.
-
If the product is not in the filter, the ad is shown to the user, and the product is added to the filter.
-
If the product is already in the filter, it means the ad has already been shown to the user and the recommendation engine finds a different ad to show.
topics/bloomfilters.md
Outdated
* If no then we allow access to the site | ||
* If yes then we can deny access or perform a full check of the URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* If no then we allow access to the site | |
* If yes then we can deny access or perform a full check of the URL | |
* If no, then we allow access to the site | |
* If yes, then we can deny access or perform a full check of the URL |
topics/bloomfilters.md
Outdated
|
||
Bloom filters can answer the question: Has this username/email/domain name/slug already been used? | ||
|
||
For example for usernames. Use a Bloom filter for every username that has signed up. A new user types in the desired username. The app checks if the username exists in the Bloom filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example for usernames. Use a Bloom filter for every username that has signed up. A new user types in the desired username. The app checks if the username exists in the Bloom filter. | |
In the username example, we can use use a Bloom filter to track every username that has signed up. When a new user attempts to sign up with their desired username, the app checks if the username exists in the Bloom filter. |
|
||
The difference between scaling and non scaling bloom filters is that scaling bloom filters do not have a fixed capacity, but a capacity that can grow. While non-scaling bloom filters will have a fixed capacity which also means a fixed size. | ||
|
||
When a scaling filter reaches its capacity, adding a new unique item will cause a new bloom filter to be created and added to the vector of bloom filters. This new bloom filter will have a larger capacity (previous bloom filter's capacity * expansion rate of the bloom object). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can one create a scalable bloom filter?
|
||
When a scaling filter reaches its capacity, adding a new unique item will cause a new bloom filter to be created and added to the vector of bloom filters. This new bloom filter will have a larger capacity (previous bloom filter's capacity * expansion rate of the bloom object). | ||
|
||
When a non scaling filter reaches its capacity, if a user tries to add a new unique item an error will be returned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can one create a non-scalable bloom filter?
Users can create a non scaling bloom filter using BF.RESERVE and BF.INSERT commands or by changing the default X configuration.
Example:
BF.RESERVE <filter-name> <error-rate> <capacity> NONSCALING
.
topics/bloomfilters.md
Outdated
If the data size is known and fixed then using a non-scaling bloom filter is preferred, for example a static dictionary could use a non scaling bloom filter as the amount of items should be fixed. Likewise the reverse case for dynamic data and unknown final sizes is when you should use a scaling bloom filters. | ||
|
||
There are a few benefits for using non scaling filters, a non scaling filter will have better performance than a filter that has scaled out. A non scaling filter also will use less memory for the capacity that is available. However if you don't want to hit an error and want use-as-you-go capacity, scaling is better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the data size is known and fixed then using a non-scaling bloom filter is preferred, for example a static dictionary could use a non scaling bloom filter as the amount of items should be fixed. Likewise the reverse case for dynamic data and unknown final sizes is when you should use a scaling bloom filters. | |
There are a few benefits for using non scaling filters, a non scaling filter will have better performance than a filter that has scaled out. A non scaling filter also will use less memory for the capacity that is available. However if you don't want to hit an error and want use-as-you-go capacity, scaling is better. | |
If the capacity (number of items we want to add) is known and fixed, using a non-scaling bloom filter is preferred. Likewise the reverse case, if the capacity is unknown / dynamically calculated, using a scaling bloom filters is ideal. | |
There are a few benefits for using non scaling filters. A non scaling filter will have better performance than a filter that has scaled out several times (e.g. > 100). Also, non scaling filters in general use less memory for a scaling filter that has scaled out several times to hold the same capacity. | |
However, to ensure you do not hit any capacity related errors, and want use-as-you-go capacity, scaling is better. |
topics/bloomfilters.md
Outdated
</table> | ||
|
||
|
||
As bloom filters have a default expansion of 2 this means all default bloom objects will be scaling. These options are used when not specified explicitly in the commands used to create a new bloom object. For example doing a BF.ADD for a new filter will create a filter with the exact above qualities. These default properties can be configured through configs on the bloom module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As bloom filters have a default expansion of 2 this means all default bloom objects will be scaling. These options are used when not specified explicitly in the commands used to create a new bloom object. For example doing a BF.ADD for a new filter will create a filter with the exact above qualities. These default properties can be configured through configs on the bloom module. | |
Since bloom filters have a default expansion of 2, this means all default bloom filter created by default will be scaling. Additionally, the other default properties of a bloom filter creation can be seen in the table above and BF.INFO command response below. These default properties can be configured through configs on the bloom module. |
topics/bloomfilters.md
Outdated
|
||
|
||
As bloom filters have a default expansion of 2 this means all default bloom objects will be scaling. These options are used when not specified explicitly in the commands used to create a new bloom object. For example doing a BF.ADD for a new filter will create a filter with the exact above qualities. These default properties can be configured through configs on the bloom module. | ||
Example of default bloom objects information: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example of default bloom objects information: | |
Example of default bloom filter information: |
topics/bloomfilters.md
Outdated
Most bloom commands are O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only work with one 1 item. | ||
|
||
As performance can rely on the number of hash functions, choosing the correct capacity and expansion rate can be very important. When you scale out you will be adding more hash functions that will be used. For this reason it is recommended that you should choose a capacity after evaluating your use case as this can avoid several scale outs. | ||
|
||
There are a few bloom commands that are O(1): BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (if no items are specified). These commands have constant time complexity since they don't work on items but instead work on the data about the bloom filter itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most bloom commands are O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only work with one 1 item. | |
As performance can rely on the number of hash functions, choosing the correct capacity and expansion rate can be very important. When you scale out you will be adding more hash functions that will be used. For this reason it is recommended that you should choose a capacity after evaluating your use case as this can avoid several scale outs. | |
There are a few bloom commands that are O(1): BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (if no items are specified). These commands have constant time complexity since they don't work on items but instead work on the data about the bloom filter itself. | |
The bloom commands which involve adding items or checking the existence of items have a time complexity of O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only operate on one item. | |
Since performance relies on the number of hash functions, choosing the correct capacity and expansion rate can be important. In case of scalable bloom filters, with every scale out, we increase the number of checks (using hash functions of each sub filter) performed during any add / exists operation. For this reason, it is recommended that users choose a capacity after evaluating the use case / workload to help avoid several scale outs and reduce the number of checks. | |
There other bloom filter commands are O(1) time complexity: BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (when no items are provided). |
topics/bloomfilters.md
Outdated
|
||
In Valkey, the bloom filter data type / commands are implemented in the [valkey-bloom module](https://github.com/valkey-io/valkey-bloom) which is an official valkey module compatible with versions 8.0 and above. Users will need to load this module onto their valkey server in order to use this feature. | ||
|
||
Bloom filters are a space efficient probabilistic data structure that allows checking whether an element is member of a set. False positives are possible, but it guarantees no false negatives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Explain false positive and false negative
bf_bloom_defrag_misses:0 | ||
``` | ||
|
||
### Bloom filter core metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change object to filter
topics/bloomfilters.md
Outdated
|
||
* `bf_bloom_num_items_across_objects`: Current total number of items across all bloom objects. | ||
|
||
* `bf_bloom_capacity_across_objects`: Current total number of filters across all bloom objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can update this
Signed-off-by: zackcam <[email protected]>
This is one of three PR's that will be done for adding information about the bloom module to the Valkey website:
Bloom repo json command files: valkey-io/valkey-bloom#47
valkey-io.github.io: valkey-io/valkey-io.github.io#212
This PR has three main changes
Adding the bloom command group

Adding bloom command metadata files (Example for bf.add below)
3. Adding bloom data type documents