Skip to content

Commit fa5dccf

Browse files
committed
Topic documentation updates for bloomfilter
Signed-off-by: zackcam <[email protected]>
1 parent 0249aec commit fa5dccf

File tree

1 file changed

+44
-32
lines changed

1 file changed

+44
-32
lines changed

topics/bloomfilters.md

+44-32
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ description: >
66

77
In Valkey, the bloom filter data type / commands are implemented in the [valkey-bloom module](https://github.com/valkey-io/valkey-bloom) which is an official valkey module compatible with versions 8.0 and above. Users will need to load this module onto their valkey server in order to use this feature.
88

9-
Bloom filters are a space efficient probabilistic data structure that allows checking whether an element is member of a set. False positives are possible, but it guarantees no false negatives.
9+
Bloom filters are a space efficient probabilistic data structure that allows checking whether an element is member of a set. False positives are possible, but it guarantees no false negatives. A false positive is when a structure incorrectly indicates that an element is in the set when it actually is not. False negatives are when a structure incorrectly indicates that an element is not in the set when it actually is.
1010

1111
## Basic Bloom commands
1212

@@ -21,38 +21,38 @@ See the [complete list of bloom filter commands](../commands/#bloom).
2121

2222
### Financial fraud detection
2323

24-
Bloom filters can help answer the question "Has this card been flagged as stolen?", use a bloom filter that has cards reported stolen added to it. Check a card on use that it is not present in the bloom filter. If it isn't then the card is not marked as stolen, if present then a check to the main database can happen or deny the purchase.
24+
Bloom filters can be used to answer the question, "Has this card been flagged as stolen?". To do this, use a bloom filter that contains cards reported as stolen. When a card is used, check whether it is present in the bloom filter. If the card is not found, it means it is not marked as stolen. If the card is present in the filter, a check can be made against the main database, or the purchase can be denied.
2525

2626
### Ad placement
2727

28-
Bloom filters can help answer the following questions to advertisers:
28+
Bloom filters can help advertisers answer the following questions:
2929
* Has the user already seen this ad?
30-
* Has the user already bought this product?
30+
* Has the user already purchased this product?
3131

32-
Use a Bloom filter for every user, storing all bought products. The recommendation engine can then suggest a new product and checks if the product is in the user's Bloom filter.
32+
For each user, use a Bloom filter to store all the products they have purchased. The recommendation engine can then suggest a new product and check if it is present in the user's Bloom filter.
3333

34-
* If no, the ad is shown to the user and is added to the Bloom filter.
35-
* If yes, the process restarts and repeats until it finds a product that is not present in the filter.
34+
* If the product is not in the filter, the ad is shown to the user, and the product is added to the filter.
35+
* If the product is already in the filter, it means the ad has already been shown to the user and the recommendation engine finds a different ad to show.
3636

3737
### Check if URL's are malicious
3838

3939
Bloom filters can answer the question "is a URL malicious?". Any URL inputted would be checked against a malicious URL bloom filter.
4040

41-
* If no then we allow access to the site
42-
* If yes then we can deny access or perform a full check of the URL
41+
* If no, then we allow access to the site
42+
* If yes, then we can deny access or perform a full check of the URL
4343

4444
### Check if a username is taken
4545

4646
Bloom filters can answer the question: Has this username/email/domain name/slug already been used?
4747

48-
For example for usernames. Use a Bloom filter for every username that has signed up. A new user types in the desired username. The app checks if the username exists in the Bloom filter.
48+
In this username example, we can use use a Bloom filter to track every username that has signed up. When a new user attempts to sign up with their desired username, the app checks if the username exists in the Bloom filter.
4949

5050
* If no, the user is created and the username is added to the Bloom filter.
5151
* If yes, the app can decide to either check the main database or reject the username.
5252

5353
## Scaling and non scaling bloom filters
5454

55-
The difference between scaling and non scaling bloom filters is that scaling bloom filters do not have a fixed capacity, but a capacity that can grow. While non-scaling bloom filters will have a fixed capacity which also means a fixed size.
55+
The difference between scaling and non scaling bloom filters is that scaling bloom filters do not have a fixed capacity, but a capacity that can grow. While non-scaling bloom filters will have a fixed capacity which also means a fixed size. Scaling bloom filters consist of a vector of “Subfilters” with length >= 1 while non scaling will only contain 1 subfilter.
5656

5757
When a scaling filter reaches its capacity, adding a new unique item will cause a new bloom filter to be created and added to the vector of bloom filters. This new bloom filter will have a larger capacity (previous bloom filter's capacity * expansion rate of the bloom object).
5858

@@ -62,9 +62,11 @@ The expansion rate is the rate that a scaling bloom filter will have its capacit
6262

6363
### When should you use scaling vs non-scaling filters
6464

65-
If the data size is known and fixed then using a non-scaling bloom filter is preferred, for example a static dictionary could use a non scaling bloom filter as the amount of items should be fixed. Likewise the reverse case for dynamic data and unknown final sizes is when you should use a scaling bloom filters.
65+
If the capacity (number of items we want to add) is known and fixed, using a non-scaling bloom filter is preferred. Likewise the reverse case, if the capacity is unknown / dynamically calculated, using a scaling bloom filters is ideal.
6666

67-
There are a few benefits for using non scaling filters, a non scaling filter will have better performance than a filter that has scaled out. A non scaling filter also will use less memory for the capacity that is available. However if you don't want to hit an error and want use-as-you-go capacity, scaling is better.
67+
There are a few benefits for using non scaling filters. A non scaling filter will have better performance than a filter that has scaled out several times (e.g. > 100). Also, non scaling filters in general use less memory for a scaling filter that has scaled out several times to hold the same capacity.
68+
69+
However, to ensure you do not hit any capacity related errors, and want use-as-you-go capacity, scaling is better.
6870

6971
## Bloom properties
7072

@@ -84,47 +86,57 @@ The following two properties can be specified in the `BF.INSERT` command:
8486

8587
### Default bloom properties
8688

89+
These are the default bloom properties along with the commands and configs which allow customizing.
90+
8791
<table width="100%" border="1" style="border-collapse: collapse; border: 1px solid black" cellpadding="8">
8892
<tr>
89-
<th width="30%">Property</th>
90-
<th width="30%">Default Value</th>
91-
<th width="40%">Engine Config (Global)</th>
93+
<th width="20%">Property</th>
94+
<th width="20%">Default Value</th>
95+
<th width="30%">Command Name</th>
96+
<th width="30%">Configuration name</th>
9297
</tr>
9398
<tr>
9499
<td>Capacity</td>
95100
<td>100</td>
101+
<td>BF.INSERT, BF.RESERVE</td>
96102
<td>BF.BLOOM-CAPACITY</td>
97103
</tr>
98104
<tr>
99105
<td>False Positive Rate</td>
100106
<td>0.01</td>
107+
<td>BF.INSERT, BF.RESERVE</td>
101108
<td>BF.BLOOM-FP-RATE</td>
102109
</tr>
103110
<tr>
104111
<td>Scaling / Non Scaling</td>
105112
<td>Scaling</td>
113+
<td>BF.INSERT, BF.RESERVE</td>
106114
<td>BF.BLOOM-EXPANSION</td>
107115
</tr>
108116
<tr>
109117
<td>Expansion Rate</td>
110118
<td>2</td>
119+
<td>BF.INSERT, BF.RESERVE</td>
111120
<td>BF.BLOOM-EXPANSION</td>
112121
</tr>
113122
<tr>
114123
<td>Tightening Ratio</td>
115124
<td>0.5</td>
125+
<td>BF.INSERT</td>
116126
<td>BF.BLOOM-TIGHTENING-RATIO</td>
117127
</tr>
118128
<tr>
119129
<td>Seed</td>
120130
<td>Random Seed</td>
131+
<td>BF.INSERT</td>
121132
<td>BF.BLOOM-USE-RANDOM-SEED</td>
122133
</tr>
123134
</table>
124135

125136

126-
As bloom filters have a default expansion of 2 this means all default bloom objects will be scaling. These options are used when not specified explicitly in the commands used to create a new bloom object. For example doing a BF.ADD for a new filter will create a filter with the exact above qualities. These default properties can be configured through configs on the bloom module.
127-
Example of default bloom objects information:
137+
Since bloom filters have a default expansion of 2, this means any default creation as a result of `BF.ADD`, `BF.MADD`, `BF.INSERT` will be a scalable bloom filter. Users can create a non scaling bloom filter using `BF.RESERVE <filter-name> <error-rate> <capacity> NONSCALING` or by specifying `NONSCALING` in `BF.INSERT`. Additionally, the other default properties of a bloom filter creation can be seen in the table above and BF.INFO command response below. These default properties can be configured through configs on the bloom module.
138+
139+
Example of default bloom filter information:
128140

129141
```
130142
127.0.0.1:6379> BF.ADD default_filter item
@@ -150,11 +162,11 @@ Example of default bloom objects information:
150162

151163
## Performance
152164

153-
Most bloom commands are O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only work with one 1 item.
165+
The bloom commands which involve adding items or checking the existence of items have a time complexity of O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only operate on one item.
154166

155-
As performance can rely on the number of hash functions, choosing the correct capacity and expansion rate can be very important. When you scale out you will be adding more hash functions that will be used. For this reason it is recommended that you should choose a capacity after evaluating your use case as this can avoid several scale outs.
167+
Since performance relies on the number of hash functions, choosing the correct capacity and expansion rate can be important. In case of scalable bloom filters, with every scale out, we increase the number of checks (using hash functions of each sub filter) performed during any add / exists operation. For this reason, it is recommended that users choose a capacity after evaluating the use case / workload to help avoid several scale outs and reduce the number of checks.
156168

157-
There are a few bloom commands that are O(1): BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (if no items are specified). These commands have constant time complexity since they don't work on items but instead work on the data about the bloom filter itself.
169+
There other bloom filter commands are O(1) time complexity: BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (when no items are provided).
158170

159171
## Monitoring
160172

@@ -193,29 +205,29 @@ bf_bloom_defrag_misses:0
193205

194206
* `bf_bloom_total_memory_bytes`: Current total number of bytes used by all bloom filters.
195207

196-
* `bf_bloom_num_objects`: Current total number of bloom objects.
208+
* `bf_bloom_num_objects`: Current total number of bloom filters.
197209

198-
* `bf_bloom_num_filters_across_objects`: Current total number of filters across all bloom objects.
210+
* `bf_bloom_num_filters_across_objects`: Current total number of subfilters across all bloom filters.
199211

200-
* `bf_bloom_num_items_across_objects`: Current total number of items across all bloom objects.
212+
* `bf_bloom_num_items_across_objects`: Current total number of items across all bloom filters.
201213

202-
* `bf_bloom_capacity_across_objects`: Current total number of filters across all bloom objects.
214+
* `bf_bloom_capacity_across_objects`: Current total capacity across all bloom filters.
203215

204216
### Bloom filter defrag metrics
205217

206-
* `bf_bloom_defrag_hits`: Total number of defrag hits that have occurred on bloom objects.
218+
* `bf_bloom_defrag_hits`: Total number of defrag hits that have occurred on bloom filters.
207219

208-
* `bf_bloom_defrag_misses`: Total number of defrag misses that have occurred on bloom objects.
220+
* `bf_bloom_defrag_misses`: Total number of defrag misses that have occurred on bloom filters.
209221

210222
## Limits
211223

212-
The consumption of memory by a single Bloom object is limited to a default of 128 MB (configurable in the bloom module), which is the size of the in-memory data structure not the capacity of the Bloom object. You can check the amount of memory consumed by a Bloom object by using the BF.INFO command. When a bloom filter scales out it will add another filter, there is a limit on the number of filters that can be added. This filter limit will change depending on the false positive rate, capacity, expansion and tightening ratio, where this filter limit is specified on the memory limit of the bloom objects.
224+
The consumption of memory by a single bloom filter is limited to a default of 128 MB (configurable in the bloom module). You can check the amount of memory consumed by a Bloom filter by using the BF.INFO command. When a bloom filter scales out it will add another subfilter, there is a limit on the number of subfilters that can be added. This subfilter limit will change depending on the false positive rate, capacity, expansion and tightening ratio, where this filter limit is specified on the memory limit of the bloom filters or when the tightening ratio causes the false positive rate of the newest subfilter to reach 0.
213225

214-
We have implemented an optional argument into BF.INSERT (VALIDATESCALETO) that can help you determine the max capacity of the objects on creation. The VALIDATESCALETO when specified would check a few things, the first is that when a bloom filter has scaled out to the desired capacity will the tightening ratio reach zero, and if so we will reject the creation. The second thing it will check is that once we reach the capacity that is desired will the bloom object be less than the max memory limit (by default 128 MB).
226+
We have implemented an optional argument into BF.INSERT (VALIDATESCALETO) that can help you determine the max capacity of the filters on creation. The VALIDATESCALETO when specified will check two scenarios, the first is that when a bloom filter has scaled out to the desired capacity will the tightening ratio reach zero, and if so we will reject the creation. The second thing it will check is that once we reach the desired capacity will the bloom filter be less than the max memory limit (by default 128 MB).
215227

216-
There is also a way to check the max capacity that can be reached for Bloom objects. Using MAXSCALEDCAPACITY in BF.INFO will provide the exact capacity that the bloom object can reach.
228+
There is also a way to check the max capacity that can be reached for Bloom filters. Using MAXSCALEDCAPACITY in BF.INFO will provide the exact capacity that the bloom filter can reach.
217229

218-
Example usage for a default bloom object:
230+
Example usage for a default bloom filter:
219231
```
220232
127.0.0.1:6379> BF.INSERT validate_scale_fail VALIDATESCALETO 26214301
221233
(error) ERR provided VALIDATESCALETO causes bloom object to exceed memory limit
@@ -225,4 +237,4 @@ Example usage for a default bloom object:
225237
(integer) 26214300
226238
```
227239

228-
As you can see above when trying to create a bloom object that the user wants to achieve a capacity more than what is possible given the memory limits the command will output an error and not create the bloom object. However if the wanted capacity is within the limits then the creation of the bloom object will succeed.
240+
As you can see above when trying to create a bloom filter that the user wants to achieve a capacity more than what is possible given the memory limits the command will output an error and not create the bloom filter. However if the wanted capacity is within the limits then the creation of the bloom filter will succeed.

0 commit comments

Comments
 (0)