Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding bloom command meta data, bloom group and bloom data type documentaion #233

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

zackcam
Copy link

@zackcam zackcam commented Feb 20, 2025

This is one of three PR's that will be done for adding information about the bloom module to the Valkey website:
Bloom repo json command files: valkey-io/valkey-bloom#47
valkey-io.github.io: valkey-io/valkey-io.github.io#212

This PR has three main changes

  1. Adding the bloom command group
    image

  2. Adding bloom command metadata files (Example for bf.add below)

image
3. Adding bloom data type documents
image
image
image
image

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting!

I skimmed through it very quickly. The documentation itself looks great AFAICT. I can do a more detailed review later.

The commands look very much like built-in commands. It's not mentioned anywhere that it's a separate module that users need to install. I think we should mentioned it on the bloom filters topic page with a link to the github repo. The BF command pages should link to that topic page, so the pages are all linked together.

To build man pages, the scripts in this repo need to be able to take multiple command JSON files. This needs to be added to the Makefile, the README and maybe the python scripts too. Please try to build the man pages as described in the README of this repo.

@zuiderkwast
Copy link
Contributor

Many of the spellcheck errors can be fixed simply but writing the command names in backticks. Stuff in backticks are excluded from spellcheck IIRC.

@zackcam
Copy link
Author

zackcam commented Feb 20, 2025

The commands look very much like built-in commands. It's not mentioned anywhere that it's a separate module that users need to install

I think we can make it more explicit on the data type page as well by making a modules section. i.e
image
Does this seem like something that would be wanted?

@zuiderkwast
Copy link
Contributor

I think we can make it more explicit on the data type page as well by making a modules section. i.e

Yes, something like that would be good. In your screenshot it looks like the "Extensions" sub-heading is part of "Module Data Types" though, because of the levels of the headings used. If we do this, then "Module Data Types" should be a level-2 heading and "Bloom Filter" a level-3 heading under it.

How about just mentioning the module within the description? Something like this?

 ## Bloom Filter
 
 [Bloom filters](bloomfilters.md) provides a space efficient probabilistic data structure that allows checking if an element is a member of a set. False positives are possible, but it guarantees no false negatives.
+Bloom filters are provided by the module `valkey-bloom`.
 For more information, see:

 * [Overview of Bloom Filters](bloomfilters.md)
 * [Bloom filter command reference](../commands/#bloom)
+* [The valkey-bloom module on GitHub](https://github.com/valkey-io/valkey-bloom/)

@madolson
Copy link
Member

@zuiderkwast I also wanted to get your input about how we should structure the modules to make it clear they aren't part of the core. The current structure is they are intermingled. I don't really have an opinion yet, but one alternative would be to at least separate them in a separate folder structure and clarify which module they are apart of.

@zuiderkwast
Copy link
Contributor

@zuiderkwast I also wanted to get your input about how we should structure the modules to make it clear they aren't part of the core. The current structure is they are intermingled. I don't really have an opinion yet, but one alternative would be to at least separate them in a separate folder structure and clarify which module they are apart of.

Are you talking about the URLs of the commands? I like that it's a flat structure, just like the commands are in a global flat namespace. The BF. prefix is enough.

But we should definitely show it in some way. A line somewhere on each command page would be good. I hope we can be generate it in some way from an optional key in the command JSON file or something like that.

@madolson
Copy link
Member

Are you talking about the URLs of the commands? I like that it's a flat structure, just like the commands are in a global flat namespace. The BF. prefix is enough.

I don't have a strong preference one way or the other about flat/nested, so sticking with flat is OK for me.

But we should definitely show it in some way. A line somewhere on each command page would be good. I hope we can be generate it in some way from an optional key in the command JSON file or something like that.

Yeah, I guess immediately let's make sure there is something in the JSON file. Maybe Module Required: <link to Bloom>.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a super deep review. I think we should indicate more clearly that the commands are from a module and not part of the core. That can maybe from the json docs only though.

Comment on lines 3 to 4
* key (required) - A Valkey key of Bloom data type
* item (required) - Item to add
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* key (required) - A Valkey key of Bloom data type
* item (required) - Item to add

We typically omit this, since the usage would be included at the top which will indicate if something is required.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense I removed all these from the bloom commands and if I think the arguments needed explained updated the heading name

@@ -0,0 +1,12 @@
Adds an item to a bloom filter, if the specified filter does not exist creates a default bloom filter with that name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Adds an item to a bloom filter, if the specified filter does not exist creates a default bloom filter with that name.
Adds an item to a bloom filter, if the specified bloom filter does not exist creates a bloom filter with default configurations with that name.
If you want to create a bloom filter with non-standard options, use the `BF.INSERT` command.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and made it less wordy as well by removing 'specified' from the description

@@ -0,0 +1,16 @@
Determines if a specified item has been added to the specified bloom filter.
Syntax
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Syntax

@@ -0,0 +1,35 @@
Returns information about a bloomfilter

## Arguments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be kept because they include the info data, but I would change this to be about info fields or something.

## Arguments
* key (required) - A valkey key of bloom data type
* CAPACITY (optional) - Returns the number of unique items that would need to be added before scaling would happen
* SIZE (optional) - Returns the memory size which is the number of bytes allocated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* SIZE (optional) - Returns the memory size which is the number of bytes allocated
* SIZE (optional) - Returns the number of bytes allocated

Why waste time say lot word when few word do trick?

@@ -92,6 +92,14 @@ The [HyperLogLog](hyperloglogs.md) data structures provide probabilistic estimat
* [Overview of HyperLogLog](hyperloglogs.md)
* [HyperLogLog command reference](../commands/#hyperloglog)

## Bloom Filter

[Bloom filters](bloomfilters.md) provides a space efficient probabilistic data structure that allows checking if an element is a member of a set. False positives are possible, but it guarantees no false negatives.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would translate this to english with an example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make this more understandable but I think potentially having what I use in the exists and mexists commands could also work if the new version still isn't great


Bloom filters are a space efficient probabilistic data structure that allows checking whether an element is member of a set. False positives are possible, but it guarantees no false negatives.

## Bloom commands
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are other examples include the "basic commands" up front, and then the more sophisticated commands later. I think we should do the same.


**Financial fraud detection**

Bloom filters can help answer the question "Has the user paid from this location before?", which can then give insights if there has been suspicious activity in shopping habits.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a real use case? The false positive here is not idea, since it might make it seem like a transaction is legitimate when it is not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this use case to be more about card fraud instead of location based checking


Bloom filters can help answer the question "Has the user paid from this location before?", which can then give insights if there has been suspicious activity in shopping habits.

For the above each user would have a Bloom filter which is then checked for every transaction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might just merge this into the previous paragraph.


**Check if URL's are malicious**

Bloom filters can answer the question is a URL malicious. Any URL inputted would be checked against a malicious URL bloom filter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Bloom filters can answer the question is a URL malicious. Any URL inputted would be checked against a malicious URL bloom filter.
Bloom filters can answer the question "is a URL malicious?". Any URL inputted would be checked against a malicious URL bloom filter.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a complete review.

We need to think about what we want regarding

  1. How to show which module a command belongs to and how to store this in the JSON file(s).
  2. What to show in the Since fields. If we'll release some valkey-with-modules bundle, then the version number should probably follow valkey's versioning(?).

Comment on lines 67 to 68
"* [Integer reply](../topics/protocol.md#integers): '1'. The item was successfully added",
"* [Integer reply](../topics/protocol.md#integers): '0'. The item already existed in the bloom filter",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the single quotes it looks a bit like string literals. Use backticks instead to mark it as code? This seems to be how some other commands' integer replies are documented.

Suggested change
"* [Integer reply](../topics/protocol.md#integers): '1'. The item was successfully added",
"* [Integer reply](../topics/protocol.md#integers): '0'. The item already existed in the bloom filter",
"* [Integer reply](../topics/protocol.md#integers): `1` if the item was successfully added",
"* [Integer reply](../topics/protocol.md#integers): `0` if the item already existed in the bloom filter",

Compare to for example this one:

  "CLIENT UNBLOCK": [
    "One of the following:",
    "* [Integer reply](../topics/protocol.md#integers): `0` if the client was unblocked successfully.",
    "* [Integer reply](../topics/protocol.md#integers): `1` if the client wasn't unblocked."
  ],

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, updarted both add and exists in both response files to follow this


Example usage for a default bloom object:
```
127.0.0.1:6379> bf.insert validate_scale_fail VALIDATESCALETO 26214301
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uppercase all commands like BF.INSERT and fixed tokens makes it easier to see what is fixed and what is variable.


## Common use cases for bloom filters

**Financial fraud detection**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like a sub-headings so I think we should mark them as such. It's semantically more correct. (The others too; not only this one.)

Suggested change
**Financial fraud detection**
### Financial fraud detection

@@ -0,0 +1,12 @@
Adds an item to a bloom filter, if the specified filter does not exist creates a default bloom filter with that name.
## Arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty line before and after headings, before and after bullet lists, etc. makes it more likely to be rendered correctly on website, man pages and github. The all use different markdown implementation with some subtle differences.

Suggested change
## Arguments
## Arguments

"[Array reply](../topics/protocol.md#arrays): List of information about the bloom filter.",
"When an optional argument is provided:",
"* [Integer reply](../topics/protocol.md#integers): argument value",
"* [String reply??](../topics/protocol.md#simple-strings): argument value",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "String reply??" with double question marks??

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was accidentally left over, we only have a string reply when one of the optional arguments is provided so was meant to come back to this and try and clear up the differences and provide clarity on which case would have a string or integer

@madolson
Copy link
Member

What to show in the Since fields. If we'll release some valkey-with-modules bundle, then the version number should probably follow valkey's versioning(?).

I think for now we should show the independent modules version number, since we got alignment on that. Internally at AWS we are planning on reviving valkey-io/valkey#408 and posting some suggestions. Once that has alignment, we can maybe add more information about where it's available (i.e. Valkey core since 10.0, valkey-bloom since 1.0)

@zackcam
Copy link
Author

zackcam commented Feb 21, 2025

List of non word choice/ document wording changes
The change to version isn't done in this repo but were discussed on this pr so adding screenshot:
image
Still looking at how best to determine if a command is from a specific module so that it is easy to expand on for future modules as well (the io pr has not been updated yet to include this module version change I will push that once I find out how to determine between modules)

Man page generation for modules, example for bf.add
image
For future modules there are only a few places they will need to add to in the make file
Main callout on change they need to do below others should be clear:
Line 187: $(eval VALKEY_ROOTS := $(VALKEY_ROOT) $(VALKEY_BLOOM_ROOT) $(FUTURE_MODULE))

@zackcam
Copy link
Author

zackcam commented Feb 21, 2025

New command page example with hyperlink to module repo:
Screenshot 2025-02-21 at 12 14 04 PM

@@ -0,0 +1,12 @@
Adds an item to a bloom filter, if the specified bloom filter does not exist creates a bloom filter with default configurations with that name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Adds an item to a bloom filter, if the specified bloom filter does not exist creates a bloom filter with default configurations with that name.
Adds a single item to a bloom filter. If the specified bloom filter does not exist, a bloom filter is created with the provided name with default properties.

@@ -0,0 +1,12 @@
Adds an item to a bloom filter, if the specified bloom filter does not exist creates a bloom filter with default configurations with that name.

If you want to create a bloom filter with non-standard options, use the `BF.INSERT` or `BF.RESERVE` command.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By non-standard options, you mean the non default properties. Right?

Suggested change
If you want to create a bloom filter with non-standard options, use the `BF.INSERT` or `BF.RESERVE` command.
To add multiple items to a bloom filter, you can use the BF.MADD or BF.INSERT commands.
If you want to create a bloom filter with non-default properties, use the `BF.INSERT` or `BF.RESERVE` command.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah non standard meant non default, but agree makes more sense to say non default and that keeps it consistent

@@ -0,0 +1,12 @@
Gets the cardinality of a Bloom filter - number of items that have been successfully added to a Bloom filter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Gets the cardinality of a Bloom filter - number of items that have been successfully added to a Bloom filter.
Returns the cardinality of a Bloom filter which is the number of items that have been successfully added to it.

1
127.0.0.1:6379> BF.CARD key
1
127.0.0.1:6379> BF.CARD missing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
127.0.0.1:6379> BF.CARD missing
127.0.0.1:6379> BF.CARD nonexistentkey

@@ -0,0 +1,19 @@
Determines if an item has been added to the bloom filter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Determines if an item has been added to the bloom filter.
Determines if an item has been added to the bloom filter previously.

* SEED seed - The seed the hash functions will use
* NONSCALING - Will make it so the filter can not scale
* VALIDATESCALETO `validatescaleto` - Checks if the filter could scale to this capacity and if not show an error and don’t create the bloom filter
* ITEMS item - One or more items we will add to the bloom filter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* ITEMS item - One or more items we will add to the bloom filter
* ITEMS item - One or more items to be added to the bloom filter

* TIGHTENING `tightening_ratio` - The tightening ratio for the bloom filter
* SEED seed - The seed the hash functions will use
* NONSCALING - Will make it so the filter can not scale
* VALIDATESCALETO `validatescaleto` - Checks if the filter could scale to this capacity and if not show an error and don’t create the bloom filter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* VALIDATESCALETO `validatescaleto` - Checks if the filter could scale to this capacity and if not show an error and don’t create the bloom filter
* VALIDATESCALETO `validatescaleto` - Validates if the filter can scale out and reach to this capacity based on limits and if not, return an error without creating the bloom filter

* NOCREATE - Will not create the bloom filter and add items if the filter does not exist already
* TIGHTENING `tightening_ratio` - The tightening ratio for the bloom filter
* SEED seed - The seed the hash functions will use
* NONSCALING - Will make it so the filter can not scale
Copy link
Member

@KarthikSubbarao KarthikSubbarao Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zackcam - you can follow wording from BF.RESERVE

Comment on lines 9 to 10
* TIGHTENING `tightening_ratio` - The tightening ratio for the bloom filter
* SEED seed - The seed the hash functions will use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add more wording

127.0.0.1:6379> BF.INSERT key ITEMS item1 item2
1) (integer) 1
2) (integer) 1
# This does not update the capcity but uses the origianl filters values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# This does not update the capcity but uses the origianl filters values
# This does not update the capacity since the filter already exists. It only adds the provided items.

@KarthikSubbarao
Copy link
Member

I only reviewed the Command Documentation.

I will need to review the remaining sections next

@@ -92,6 +92,17 @@ The [HyperLogLog](hyperloglogs.md) data structures provide probabilistic estimat
* [Overview of HyperLogLog](hyperloglogs.md)
* [HyperLogLog command reference](../commands/#hyperloglog)

## Bloom Filter

[Bloom filters](bloomfilters.md) are a space efficient data type that can tell you if something is definitely not in a set, or it might be in the set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[Bloom filters](bloomfilters.md) are a space efficient data type that can tell you if something is definitely not in a set, or it might be in the set.
[Bloom filters](bloomfilters.md) are a space efficient probabilistic data type that can be used to check if item/s are definitely not present in a set, or if they exist within the set (with the configured false positive rate).

@@ -0,0 +1,108 @@
---
title: "Bloom Filters"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also include the section/s below:

  1. Scaling / Non Scaling Filters and their implications

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this section and instead of putting the implication in performance added a subsection to the scaling and non scaling section


Error rate - 0.01

Expansion - 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we mention in command documentation, let us clarify the scaling and non scaling cases where expansion is nil.

12) (integer) 2
13) Max scaled capacity
14) (integer) 26214300
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can include advanced / additional properties here as a sub section within the "Default Properties":

  1. Tightening Ratio - We do not recommend tuning this unless there is a specific use case for lower memory usage (with higher false positive) or vice versa.
  2. Seed - This is only useful is a user has a specific 32 byte seed they want their bloom filters to use.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these in an advanced properties section.


Most bloom commands are O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only work with one 1 item.

There are a few bloom commands that are O(1) as they don't work on items but instead work on the data about the bloom filter itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean BF.CARD and BF.INFO? Maybe you can list the ones you are referring to here

Comment on lines 69 to 73
Capacity - 100

Error rate - 0.01

Expansion - 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to mentioning the default value, can we follow the standard wording from "command documentation" to have a one liner to explain these properties?

Introduction to Bloom Filters
---

The bloom filter data type is taken from a [separate module](https://github.com/valkey-io/valkey-bloom) that users will need to install in order to use.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this sentence to be after the introduction statement.

Suggested change
The bloom filter data type is taken from a [separate module](https://github.com/valkey-io/valkey-bloom) that users will need to install in order to use.
In Valkey, the bloom filter data type / commands are implemented in the [valkey-bloom module](https://github.com/valkey-io/valkey-bloom) which is an official valkey module compatible with versions 8.0 and above. Users will need to load this module onto their valkey server in order to use this feature.


### When should you use scaling vs non-scaling filters

If the data size is known and fixed then using a non-scaling bloom filter is preferred, for example a static dictionary could use a non scaling bloom filter as the amount of items should be fixed. Likewise the reverse case for dynamic data and unknown final sizes is when you should use a scaling bloom filters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can consider briefly explaining benefits of non scaling (better performance and less memory overhead) and its drawbacks - it will error out when it reaches capacity. If you don't want to hit an error and want use-as-you-go capacity, scaling is better, but it uses more memory for the additional capacity which is available. Also, more filters (e.g. >500-1000) means higher command latencies.


Seed - The seed used by the bloom filter can be specified by the user in the BF.INSERT command. This property is only useful if you have a specific 32 byte seed that you want your bloom filter to use. By defualt every bloom filter will use a random seed.

Tightening Ratio - We do not recommend fine tuning this unless there is a specific use case for lower memory usage with higher false positive or vice versa.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clarify that the BF.INSERT command can help specify this


If the data size is known and fixed then using a non-scaling bloom filter is preferred, for example a static dictionary could use a non scaling bloom filter as the amount of items should be fixed. Likewise the reverse case for dynamic data and unknown final sizes is when you should use a scaling bloom filters.

## Default bloom properties
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional : We could add a monitoring section and briefly go over the INFO BF command response

section to bloomfilter topic, cleaned up other bloomfilter topic
sections
Makefile Outdated
Comment on lines 187 to 189
$(eval VALKEY_ROOTS := $(VALKEY_ROOT) $(VALKEY_BLOOM_ROOT))
$(eval FINAL_ROOT := $(firstword $(foreach root,$(VALKEY_ROOTS),$(if $(wildcard $(root)/src/commands/$*.json),$(root)))))
$(if $(FINAL_ROOT),,$(eval FINAL_ROOT := $(lastword $(VALKEY_ROOTS))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks complicated. We should try to do something more readable.

I don't have a clear idea about how to make this simpler, but maybe we can define the rules differently or maybe it can get more clear if we use $(call ...) instead of $(eval ...).

Or define a separate rule for the bf. commands like this?

$(MAN_DIR)/man3/bf.%.3valkey.gz: commands/bf.%.md ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future modules I think making it separate could be detrimental this way any future module will only need to add their equivalent of VALKEY_BLOOM_ROOT to the first eval. Also keeping it in this sort of format makes it so if changes are needed in the future those changes will only be needed here for all commands. I am going to update the last if/eval though so it doesn't actually need to have found a match and will only create on found matches. If this still doesn't sound ideal let me know and I'll look more in depth how to make it clearer (either by changing how this is done or separating steps)

Copy link
Contributor

@zuiderkwast zuiderkwast Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense. Maybe just the first line $(eval VALKEY_ROOTS := $(VALKEY_ROOT) $(VALKEY_BLOOM_ROOT)) can be on top level instead of inside the rule?

VALKEY_ROOTS := $(VALKEY_ROOT) $(VALKEY_BLOOM_ROOT)

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It starts to look good. I just put a few comments on formatting and such things.

I didn't review the actual docs of the module carefully, because I don't know it very well. To me, it's enough if that part is reviewed by you, the module authors.

README.md Outdated
sudo make install INSTALL_MAN_DIR=/usr/local/share/man

Prerequisites: GNU Make, Python 3, Python 3 YAML (pyyaml), Pandoc.
Additionally, the scripts need access to the valkey code repo,
Additionally, the scripts need access to the valkey and valkey-bloom code repos,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that valkey-bloom is optional and that those pages are excluded if the valkey-bloom path is not provided.

@@ -0,0 +1 @@
../valkey-doc/commands
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add this file (or symlink).

Comment on lines +3 to +7
A Bloom filter has two possible responses when you check if an item exists:

* 0 - The item definitely does not exist since with bloom filters, false negatives are not possible.

* 1 - The item exists with a given false positive (`fp`) percentage. There is an `fp` rate % chance that the item does not exist. You can create bloom filters with a more strict false positive rate as needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't include the reply docs here. They are added in resp2_replies.json and resp3_replies.json so if we add them here to they will appear twice.

Comment on lines +21 to +28
127.0.0.1:6379> BF.INFO key
1) Capacity
2) (integer) 100
3) Size
4) (integer) 384
5) Number of filters
6) (integer) 1
7) Number of items inserted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match the documentation of the field names above. I would expect the field names to be CAPACITY, SIZE, FILTERS etc. rather than "Capacity", "Size", "Number of filters", ...

Btw, why are these uppercase? In the INFO command, the field names are lowercase.


## Insert Fields

* CAPACITY `capacity` - The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, CAPACITY is a keyword and capacity is a placeholder for a number?

I suggest we use this formatting instead, with backticks for the keyword and italics for the variable:

Suggested change
* CAPACITY `capacity` - The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items.
* `CAPACITY` *capacity* - The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items.

Comment on lines +3 to +7
A Bloom filter has two possible responses when you check if an item exists:

* 0 - The item definitely does not exist since with bloom filters, false negatives are not possible.

* 1 - The item exists with a given false positive (`fp`) percentage. There is an `fp` rate % chance that the item does not exist. You can create bloom filters with a more strict false positive rate as needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip this. Responses are documented in the response JSON files.

(I know, I don't like it. It's unnecessarily complex. I want to move the reply docs into the markdown files some day. But for now, let's just follow the existing structure.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was wanted to make it explicit how false positive affects the exist command and determining if an item is present. I could try and reword so it explains false positive not based on response but I think the thinking is that showing the response makes it more understandable

Copy link
Contributor

@zuiderkwast zuiderkwast Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rendered page is showing the response from resp2_responses.json etc. but unfortunately it gets added in the bottom of the web page.

(On the generated man pages, the reply section gets inserted before Examples, which I think is a better place.)

You can keep this text here if you think it's better, and keep it brief in resp{2,3}_replies.json so there is not too much duplicated text.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it does have some slight duplication but in my opinion I like having this explained as one of the main behaviours of bloom filters is the false positive rate. But am happy to change if others would rather not have the duplication.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind, but feel free to formulate it in a way so that it doesn't look too much like duplication.

"* [Integer reply](../topics/protocol.md#integers): `1` if the item exists in the bloom filter",
"* [Integer reply](../topics/protocol.md#integers): `0` if the bloom filter does not exist or the item has not been added to the bloom filter",
"",
"The command will fail if the wrong number of arguments are provided"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to mention error for wrong number of arguments. All commands will return syntax error in this case. This is implicit and we don't need to mention it for every command.


```
127.0.0.1:6379> info bf
# bf_bloom_core_metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the INFO command, these section headings match the argument in uppercase, so I would expect # BF here, with a blank line below it.

Suggested change
# bf_bloom_core_metrics
# BF

Are these fields matching redis bloom info fields or are they invented in valkey-bloom?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That title is determined from the bloom module (https://github.com/valkey-io/valkey-bloom/blob/unstable/src/metrics.rs#L17) and that output is exactly what I get when running info bf. I'm pretty sure they were invented in valkey-bloom.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That title is determined from the bloom module

Then maybe valkey-bloom doesn't exactly behave as documented for the INFO command:

   Lines can contain a section name (starting with a # character) or a property.  All
   the properties are in the form of field:value terminated by \r\n.

These lines with # are the section names you can also use as argument for fetching a single section. They're not comments.

Or can you do INFO bf_bloom_core_metrics too?

I'm pretty sure they were invented in valkey-bloom.

Then I'm wondering why the prefix of each field is bf_bloom and not just bf? BF stands for bloom filter already, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just do a certain section so INFO bf_bloom_core_metrics is valid.
I think at some point there was thoughts on expanding the bloom module so wasn't just confined to a bloom filter so wanted to specify this was for bloom in particular.


### Bloom filter core metrics

* bf_bloom_total_memory_bytes: Current total number of bytes used by all bloom filters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use backticks on the field names here (and below).


### Financial fraud detection

Bloom filters can help answer the question "Has this card been flagged as stolen?", use a bloom filter that has cards reported stolen added to it. Check a card on use that it is not present in the bloom filter. If it isn't then the card is not marked as stolen, if present then a check to the main database can happen or deny the purchase.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor rewording:

Bloom filters can be used to answer the question, "Has this card been flagged as stolen?". To do this, use a bloom filter that contains cards reported as stolen. When a card is used, check whether it is present in the bloom filter. If the card is not found, it means it is not marked as stolen. If the card is present in the filter, a check can be made against the main database, or the purchase can be denied.

Comment on lines 28 to 35
Bloom filters can help answer the following questions to advertisers:
* Has the user already seen this ad?
* Has the user already bought this product?

Use a Bloom filter for every user, storing all bought products. The recommendation engine can then suggest a new product and checks if the product is in the user's Bloom filter.

* If no, the ad is shown to the user and is added to the Bloom filter.
* If yes, the process restarts and repeats until it finds a product that is not present in the filter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bloom filters can help advertisers answer the following questions:

  • Has the user already seen this ad?

  • Has the user already purchased this product?

For each user, use a Bloom filter to store all the products they have purchased. The recommendation engine can then suggest a new product and check if it is present in the user's Bloom filter.

  • If the product is not in the filter, the ad is shown to the user, and the product is added to the filter.

  • If the product is already in the filter, it means the ad has already been shown to the user and the recommendation engine finds a different ad to show.

Comment on lines 41 to 42
* If no then we allow access to the site
* If yes then we can deny access or perform a full check of the URL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If no then we allow access to the site
* If yes then we can deny access or perform a full check of the URL
* If no, then we allow access to the site
* If yes, then we can deny access or perform a full check of the URL


Bloom filters can answer the question: Has this username/email/domain name/slug already been used?

For example for usernames. Use a Bloom filter for every username that has signed up. A new user types in the desired username. The app checks if the username exists in the Bloom filter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example for usernames. Use a Bloom filter for every username that has signed up. A new user types in the desired username. The app checks if the username exists in the Bloom filter.
In the username example, we can use use a Bloom filter to track every username that has signed up. When a new user attempts to sign up with their desired username, the app checks if the username exists in the Bloom filter.


The difference between scaling and non scaling bloom filters is that scaling bloom filters do not have a fixed capacity, but a capacity that can grow. While non-scaling bloom filters will have a fixed capacity which also means a fixed size.

When a scaling filter reaches its capacity, adding a new unique item will cause a new bloom filter to be created and added to the vector of bloom filters. This new bloom filter will have a larger capacity (previous bloom filter's capacity * expansion rate of the bloom object).
Copy link
Member

@KarthikSubbarao KarthikSubbarao Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can one create a scalable bloom filter?


When a scaling filter reaches its capacity, adding a new unique item will cause a new bloom filter to be created and added to the vector of bloom filters. This new bloom filter will have a larger capacity (previous bloom filter's capacity * expansion rate of the bloom object).

When a non scaling filter reaches its capacity, if a user tries to add a new unique item an error will be returned
Copy link
Member

@KarthikSubbarao KarthikSubbarao Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can one create a non-scalable bloom filter?

Users can create a non scaling bloom filter using BF.RESERVE and BF.INSERT commands or by changing the default X configuration.

Example:
BF.RESERVE <filter-name> <error-rate> <capacity> NONSCALING.

Comment on lines 65 to 67
If the data size is known and fixed then using a non-scaling bloom filter is preferred, for example a static dictionary could use a non scaling bloom filter as the amount of items should be fixed. Likewise the reverse case for dynamic data and unknown final sizes is when you should use a scaling bloom filters.

There are a few benefits for using non scaling filters, a non scaling filter will have better performance than a filter that has scaled out. A non scaling filter also will use less memory for the capacity that is available. However if you don't want to hit an error and want use-as-you-go capacity, scaling is better.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the data size is known and fixed then using a non-scaling bloom filter is preferred, for example a static dictionary could use a non scaling bloom filter as the amount of items should be fixed. Likewise the reverse case for dynamic data and unknown final sizes is when you should use a scaling bloom filters.
There are a few benefits for using non scaling filters, a non scaling filter will have better performance than a filter that has scaled out. A non scaling filter also will use less memory for the capacity that is available. However if you don't want to hit an error and want use-as-you-go capacity, scaling is better.
If the capacity (number of items we want to add) is known and fixed, using a non-scaling bloom filter is preferred. Likewise the reverse case, if the capacity is unknown / dynamically calculated, using a scaling bloom filters is ideal.
There are a few benefits for using non scaling filters. A non scaling filter will have better performance than a filter that has scaled out several times (e.g. > 100). Also, non scaling filters in general use less memory for a scaling filter that has scaled out several times to hold the same capacity.
However, to ensure you do not hit any capacity related errors, and want use-as-you-go capacity, scaling is better.

</table>


As bloom filters have a default expansion of 2 this means all default bloom objects will be scaling. These options are used when not specified explicitly in the commands used to create a new bloom object. For example doing a BF.ADD for a new filter will create a filter with the exact above qualities. These default properties can be configured through configs on the bloom module.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As bloom filters have a default expansion of 2 this means all default bloom objects will be scaling. These options are used when not specified explicitly in the commands used to create a new bloom object. For example doing a BF.ADD for a new filter will create a filter with the exact above qualities. These default properties can be configured through configs on the bloom module.
Since bloom filters have a default expansion of 2, this means all default bloom filter created by default will be scaling. Additionally, the other default properties of a bloom filter creation can be seen in the table above and BF.INFO command response below. These default properties can be configured through configs on the bloom module.



As bloom filters have a default expansion of 2 this means all default bloom objects will be scaling. These options are used when not specified explicitly in the commands used to create a new bloom object. For example doing a BF.ADD for a new filter will create a filter with the exact above qualities. These default properties can be configured through configs on the bloom module.
Example of default bloom objects information:
Copy link
Member

@KarthikSubbarao KarthikSubbarao Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Example of default bloom objects information:
Example of default bloom filter information:

Comment on lines 153 to 157
Most bloom commands are O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only work with one 1 item.

As performance can rely on the number of hash functions, choosing the correct capacity and expansion rate can be very important. When you scale out you will be adding more hash functions that will be used. For this reason it is recommended that you should choose a capacity after evaluating your use case as this can avoid several scale outs.

There are a few bloom commands that are O(1): BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (if no items are specified). These commands have constant time complexity since they don't work on items but instead work on the data about the bloom filter itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Most bloom commands are O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only work with one 1 item.
As performance can rely on the number of hash functions, choosing the correct capacity and expansion rate can be very important. When you scale out you will be adding more hash functions that will be used. For this reason it is recommended that you should choose a capacity after evaluating your use case as this can avoid several scale outs.
There are a few bloom commands that are O(1): BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (if no items are specified). These commands have constant time complexity since they don't work on items but instead work on the data about the bloom filter itself.
The bloom commands which involve adding items or checking the existence of items have a time complexity of O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only operate on one item.
Since performance relies on the number of hash functions, choosing the correct capacity and expansion rate can be important. In case of scalable bloom filters, with every scale out, we increase the number of checks (using hash functions of each sub filter) performed during any add / exists operation. For this reason, it is recommended that users choose a capacity after evaluating the use case / workload to help avoid several scale outs and reduce the number of checks.
There other bloom filter commands are O(1) time complexity: BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (when no items are provided).


In Valkey, the bloom filter data type / commands are implemented in the [valkey-bloom module](https://github.com/valkey-io/valkey-bloom) which is an official valkey module compatible with versions 8.0 and above. Users will need to load this module onto their valkey server in order to use this feature.

Bloom filters are a space efficient probabilistic data structure that allows checking whether an element is member of a set. False positives are possible, but it guarantees no false negatives.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Explain false positive and false negative

bf_bloom_defrag_misses:0
```

### Bloom filter core metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change object to filter


* `bf_bloom_num_items_across_objects`: Current total number of items across all bloom objects.

* `bf_bloom_capacity_across_objects`: Current total number of filters across all bloom objects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can update this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants