You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: topics/bloomfilters.md
+44-32
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ description: >
6
6
7
7
In Valkey, the bloom filter data type / commands are implemented in the [valkey-bloom module](https://github.com/valkey-io/valkey-bloom) which is an official valkey module compatible with versions 8.0 and above. Users will need to load this module onto their valkey server in order to use this feature.
8
8
9
-
Bloom filters are a space efficient probabilistic data structure that allows checking whether an element is member of a set. False positives are possible, but it guarantees no false negatives.
9
+
Bloom filters are a space efficient probabilistic data structure that allows checking whether an element is member of a set. False positives are possible, but it guarantees no false negatives. A false positive is when a structure incorrectly indicates that an element is in the set when it actually is not. False negatives are when a structure incorrectly indicates that an element is not in the set when it actually is.
10
10
11
11
## Basic Bloom commands
12
12
@@ -21,38 +21,38 @@ See the [complete list of bloom filter commands](../commands/#bloom).
21
21
22
22
### Financial fraud detection
23
23
24
-
Bloom filters can help answer the question "Has this card been flagged as stolen?", use a bloom filter that has cards reported stolen added to it. Check a card on use that it is not present in the bloom filter. If it isn't then the card is not marked as stolen, if present then a check to the main database can happen or deny the purchase.
24
+
Bloom filters can be used to answer the question, "Has this card been flagged as stolen?". To do this, use a bloom filter that contains cards reported as stolen. When a card is used, check whether it is present in the bloom filter. If the card is not found, it means it is not marked as stolen. If the card is present in the filter, a check can be made against the main database, or the purchase can be denied.
25
25
26
26
### Ad placement
27
27
28
-
Bloom filters can help answer the following questions to advertisers:
28
+
Bloom filters can help advertisers answer the following questions:
29
29
* Has the user already seen this ad?
30
-
* Has the user already bought this product?
30
+
* Has the user already purchased this product?
31
31
32
-
Use a Bloom filter for every user, storing all bought products. The recommendation engine can then suggest a new product and checks if the product is in the user's Bloom filter.
32
+
For each user, use a Bloom filter to store all the products they have purchased. The recommendation engine can then suggest a new product and check if it is present in the user's Bloom filter.
33
33
34
-
* If no, the ad is shown to the user and is added to the Bloom filter.
35
-
* If yes, the process restarts and repeats until it finds a product that is not present in the filter.
34
+
* If the product is not in the filter, the ad is shown to the user, and the product is added to the filter.
35
+
* If the product is already in the filter, it means the ad has already been shown to the user and the recommendation engine finds a different ad to show.
36
36
37
37
### Check if URL's are malicious
38
38
39
39
Bloom filters can answer the question "is a URL malicious?". Any URL inputted would be checked against a malicious URL bloom filter.
40
40
41
-
* If no then we allow access to the site
42
-
* If yes then we can deny access or perform a full check of the URL
41
+
* If no, then we allow access to the site
42
+
* If yes, then we can deny access or perform a full check of the URL
43
43
44
44
### Check if a username is taken
45
45
46
46
Bloom filters can answer the question: Has this username/email/domain name/slug already been used?
47
47
48
-
For example for usernames. Use a Bloom filter for every username that has signed up. A new user types in the desired username. The app checks if the username exists in the Bloom filter.
48
+
In this username example, we can use use a Bloom filter to track every username that has signed up. When a new user attempts to sign up with their desired username, the app checks if the username exists in the Bloom filter.
49
49
50
50
* If no, the user is created and the username is added to the Bloom filter.
51
51
* If yes, the app can decide to either check the main database or reject the username.
52
52
53
53
## Scaling and non scaling bloom filters
54
54
55
-
The difference between scaling and non scaling bloom filters is that scaling bloom filters do not have a fixed capacity, but a capacity that can grow. While non-scaling bloom filters will have a fixed capacity which also means a fixed size.
55
+
The difference between scaling and non scaling bloom filters is that scaling bloom filters do not have a fixed capacity, but a capacity that can grow. While non-scaling bloom filters will have a fixed capacity which also means a fixed size. Scaling bloom filters consist of a vector of “Subfilters” with length >= 1 while non scaling will only contain 1 subfilter.
56
56
57
57
When a scaling filter reaches its capacity, adding a new unique item will cause a new bloom filter to be created and added to the vector of bloom filters. This new bloom filter will have a larger capacity (previous bloom filter's capacity * expansion rate of the bloom object).
58
58
@@ -62,9 +62,11 @@ The expansion rate is the rate that a scaling bloom filter will have its capacit
62
62
63
63
### When should you use scaling vs non-scaling filters
64
64
65
-
If the data size is known and fixed then using a non-scaling bloom filter is preferred, for example a static dictionary could use a non scaling bloom filter as the amount of items should be fixed. Likewise the reverse case for dynamic data and unknown final sizes is when you should use a scaling bloom filters.
65
+
If the capacity (number of items we want to add) is known and fixed, using a non-scaling bloom filter is preferred. Likewise the reverse case, if the capacity is unknown / dynamically calculated, using a scaling bloom filters is ideal.
66
66
67
-
There are a few benefits for using non scaling filters, a non scaling filter will have better performance than a filter that has scaled out. A non scaling filter also will use less memory for the capacity that is available. However if you don't want to hit an error and want use-as-you-go capacity, scaling is better.
67
+
There are a few benefits for using non scaling filters. A non scaling filter will have better performance than a filter that has scaled out several times (e.g. > 100). Also, non scaling filters in general use less memory for a scaling filter that has scaled out several times to hold the same capacity.
68
+
69
+
However, to ensure you do not hit any capacity related errors, and want use-as-you-go capacity, scaling is better.
68
70
69
71
## Bloom properties
70
72
@@ -84,47 +86,57 @@ The following two properties can be specified in the `BF.INSERT` command:
84
86
85
87
### Default bloom properties
86
88
89
+
These are the default bloom properties along with the commands and configs which allow customizing.
As bloom filters have a default expansion of 2 this means all default bloom objects will be scaling. These options are used when not specified explicitly in the commands used to create a new bloom object. For example doing a BF.ADD for a new filter will create a filter with the exact above qualities. These default properties can be configured through configs on the bloom module.
127
-
Example of default bloom objects information:
137
+
Since bloom filters have a default expansion of 2, this means any default creation as a result of `BF.ADD`, `BF.MADD`, `BF.INSERT` will be a scalable bloom filter. Users can create a non scaling bloom filter using `BF.RESERVE <filter-name> <error-rate> <capacity> NONSCALING` or by specifying `NONSCALING` in `BF.INSERT`. Additionally, the other default properties of a bloom filter creation can be seen in the table above and BF.INFO command response below. These default properties can be configured through configs on the bloom module.
138
+
139
+
Example of default bloom filter information:
128
140
129
141
```
130
142
127.0.0.1:6379> BF.ADD default_filter item
@@ -150,11 +162,11 @@ Example of default bloom objects information:
150
162
151
163
## Performance
152
164
153
-
Most bloom commands are O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only work with one 1 item.
165
+
The bloom commands which involve adding items or checking the existence of items have a time complexity of O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only operate on one item.
154
166
155
-
As performance can rely on the number of hash functions, choosing the correct capacity and expansion rate can be very important. When you scale out you will be adding more hash functions that will be used. For this reason it is recommended that you should choose a capacity after evaluating your use case as this can avoid several scale outs.
167
+
Since performance relies on the number of hash functions, choosing the correct capacity and expansion rate can be important. In case of scalable bloom filters, with every scale out, we increase the number of checks (using hash functions of each sub filter) performed during any add / exists operation. For this reason, it is recommended that users choose a capacity after evaluating the use case / workload to help avoid several scale outs and reduce the number of checks.
156
168
157
-
There are a few bloom commands that are O(1): BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (if no items are specified). These commands have constant time complexity since they don't work on items but instead work on the data about the bloom filter itself.
169
+
There other bloom filter commands are O(1) time complexity: BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (when no items are provided).
158
170
159
171
## Monitoring
160
172
@@ -193,29 +205,29 @@ bf_bloom_defrag_misses:0
193
205
194
206
*`bf_bloom_total_memory_bytes`: Current total number of bytes used by all bloom filters.
195
207
196
-
*`bf_bloom_num_objects`: Current total number of bloom objects.
208
+
*`bf_bloom_num_objects`: Current total number of bloom filters.
197
209
198
-
*`bf_bloom_num_filters_across_objects`: Current total number of filters across all bloom objects.
210
+
*`bf_bloom_num_filters_across_objects`: Current total number of subfilters across all bloom filters.
199
211
200
-
*`bf_bloom_num_items_across_objects`: Current total number of items across all bloom objects.
212
+
*`bf_bloom_num_items_across_objects`: Current total number of items across all bloom filters.
201
213
202
-
*`bf_bloom_capacity_across_objects`: Current total number of filters across all bloom objects.
214
+
*`bf_bloom_capacity_across_objects`: Current total capacity across all bloom filters.
203
215
204
216
### Bloom filter defrag metrics
205
217
206
-
*`bf_bloom_defrag_hits`: Total number of defrag hits that have occurred on bloom objects.
218
+
*`bf_bloom_defrag_hits`: Total number of defrag hits that have occurred on bloom filters.
207
219
208
-
*`bf_bloom_defrag_misses`: Total number of defrag misses that have occurred on bloom objects.
220
+
*`bf_bloom_defrag_misses`: Total number of defrag misses that have occurred on bloom filters.
209
221
210
222
## Limits
211
223
212
-
The consumption of memory by a single Bloom object is limited to a default of 128 MB (configurable in the bloom module), which is the size of the in-memory data structure not the capacity of the Bloom object. You can check the amount of memory consumed by a Bloom object by using the BF.INFO command. When a bloom filter scales out it will add another filter, there is a limit on the number of filters that can be added. This filter limit will change depending on the false positive rate, capacity, expansion and tightening ratio, where this filter limit is specified on the memory limit of the bloom objects.
224
+
The consumption of memory by a single bloom filter is limited to a default of 128 MB (configurable in the bloom module). You can check the amount of memory consumed by a Bloom filter by using the BF.INFO command. When a bloom filter scales out it will add another subfilter, there is a limit on the number of subfilters that can be added. This subfilter limit will change depending on the false positive rate, capacity, expansion and tightening ratio, where this filter limit is specified on the memory limit of the bloom filters or when the tightening ratio causes the false positive rate of the newest subfilter to reach 0.
213
225
214
-
We have implemented an optional argument into BF.INSERT (VALIDATESCALETO) that can help you determine the max capacity of the objects on creation. The VALIDATESCALETO when specified would check a few things, the first is that when a bloom filter has scaled out to the desired capacity will the tightening ratio reach zero, and if so we will reject the creation. The second thing it will check is that once we reach the capacity that is desired will the bloom object be less than the max memory limit (by default 128 MB).
226
+
We have implemented an optional argument into BF.INSERT (VALIDATESCALETO) that can help you determine the max capacity of the filters on creation. The VALIDATESCALETO when specified will check two scenarios, the first is that when a bloom filter has scaled out to the desired capacity will the tightening ratio reach zero, and if so we will reject the creation. The second thing it will check is that once we reach the desired capacity will the bloom filter be less than the max memory limit (by default 128 MB).
215
227
216
-
There is also a way to check the max capacity that can be reached for Bloom objects. Using MAXSCALEDCAPACITY in BF.INFO will provide the exact capacity that the bloom object can reach.
228
+
There is also a way to check the max capacity that can be reached for Bloom filters. Using MAXSCALEDCAPACITY in BF.INFO will provide the exact capacity that the bloom filter can reach.
@@ -225,4 +237,4 @@ Example usage for a default bloom object:
225
237
(integer) 26214300
226
238
```
227
239
228
-
As you can see above when trying to create a bloom object that the user wants to achieve a capacity more than what is possible given the memory limits the command will output an error and not create the bloom object. However if the wanted capacity is within the limits then the creation of the bloom object will succeed.
240
+
As you can see above when trying to create a bloom filter that the user wants to achieve a capacity more than what is possible given the memory limits the command will output an error and not create the bloom filter. However if the wanted capacity is within the limits then the creation of the bloom filter will succeed.
0 commit comments