Bleve sharding #11084

jvillafanez · 2025-02-28T13:43:42Z

Current conclusion is to use a different solution such as elasticsearch or solr, which have sharding already implemented (connectors in order to use them aren't implemented yet).
Implementing sharding on our own for bleve is going to be very expensive if we want to make it right. Details below.

Basic implementation idea

In order to improve write and read times, we'll need to split the amount of data into different indexes.
Searching the data in an index of 250k elements should always perform better than searching in one of 1 million. Similarly, having N different indexes might be convenient if only 1 or 2 are locked (because we're writing data in those indexes) while the rest can be accessed freely. Even if we consider only the write operations, it's possible to have up to N write operations at the same time as long as they're writing into different indexes.

A nice way to split our data is based on the space. All the data belonging to a specific space would go to the same index.
This means that, as long as we know the space, we can search in only one index, which might reduce the number of elements we need to check considerably. In addition, indexing files might happen in parallel if those files are in different spaces (and the spaces are also in different indexes). This should provide a predictable UX

Note that is important to have an even data distribution (or as close as possible). Having more shards won't do anything if 95% of the data is in the same shard.

In order to implement this we need to map a space to an index.

Discarded space->index mapping strategies

hash

This is basically, hashing the space. Something like hash(spaceID) % N where N is the number of shards.

This has several important drawbacks:

In theory, increasing the number of shards should improve the performance by reducing the number of elements in each shard (less data we need to go through). However, this method doesn't guarantee this, and it might distribute the data even worse. Some shards might be underused or not used at all.
Without discussing the problems of adding or removing shards, it's very likely that a lot of data would need to be moved between shards (X % N is unlikely to be equal to X % (N+1) or X % (N-1)). Note that it isn't only about moving data to the new shard, but also to "old" shards.
Changing the number of shards will cause a service downtime. Making a live change isn't possible with the data moving around unpredictably.

consistent hashing

Similar to the previous one, although the "final" result is independent to the number of shards. This means that the space will end up in the same shard even if the number of shards increases. Of course, if a shard is removed, the data in the removed shard will need to be moved to a different shard.

Drawbacks:

It's difficult to implement. We'd likely need a library to handle this.
The data might still be unevenly distributed.
Requires a consistent state across all replicas.

static lookup via env variable.

The configuration would be something like: BLEVE_SLOOKUP=spaceid1:shard4;spaceid2:shard3;_spaces:shard2;_default:shard1

Drawbacks:

Changing the configuration might happen. This implies that we need to keep the old configuration somewhere in order to be able to detect the change, otherwise we'll have data in the wrong shards
In won't be usable in the long run:
- New spaces will be added to the system, which will be mapped to the default shard, causing uneven load.
- Using virtual groups (such as "_personal" or "_spaces") to group spaces is better handled internally without exposing it to the outside. Those virtual groups won't change and the shard name isn't important.

Proposal: dynamic lookup

The idea is to create a lookup table that can be changed dynamically.
There are some of goals:

Number of shards can be changed at any moment.
Lookup table can be adjusted manually at will to balance the load. If a shard has too many elements, some of them can be moved to a different shard or a new one.
Configuration changes and data transfer among indexes (due to adjustments in the lookup table) are expected to run on a live system with no downtime.

handling lookup data

There are 3 important things to take into account:

We won't rely on environment variables for the configuration. Static configuration won't be enough when we expect changes.
We need to consider coordination among all the replicas. The lookup data MUST be consistent regardless of the replica.
We need to persist data somewhere.

We'll provide a set of commands to deal with the lookup data.

The "main" lookup data will be kept in the NATS store (or a centralized location available to any replica). Every time a replica needs to search or index a file, it MUST retrieve the "main" lookup data to check where it needs to look. Note that this data should NOT be cached (or it should be extremely short-lived)

Changing the lookup data through a command should go through these steps (assuming everything is correct):

Update the "main" lookup data with the new information.
Notify other replicas that the "main" lookup data has been updated
Persist the data in the local FS

The step 2 (notify other replicas) is to let other replicas know there is a change so they can also persist the same change in the own FS. Whether the notification reaches "on time" or not is irrelevant because the replicas will still use whatever data is in the store (which has been already updated).

In case the oCIS is restarted, the expected flow is the following:

Replicas will load the persisted data from the FS.
They update the lookup data on the store only if it the data is newer.
They'll keep using the "main" lookup data as usual.

Under normal conditions, the persisted data in the replicas will be the same, so there won't be problems. If the persisted data is different for whatever reason, the "main" lookup data should be updated with the latest one pretty fast (replicas are expected to start at the same time), so the possible damage of using outdated info should be minimal.

Note that the lookup data must be timestamped so we clearly know which one is the newest one.

moving index data

We'll use the same idea implemented for OC10 (owncloud/search_elastic#319 , the "migrating" section).

Basically, we'll write in 2 shards for the space we want to move until the migration is fully completed. Once the migration is finished, we can remove the data from the original shard.
For search operations, we'll still use the original shard until we've moved all the data.

This approach allows moving data on a live system without any downtime. Since we'll be writing into 2 shards at the same time (original and new), searching on the original will show the correct results. Once the migration is finished, switching to the new shard is immediate.

Note that the data we'll move will be duplicated, so there must be enough space on disk.

Migration data (what spaces are being migrated and the state of those spaces) must be stored as part of the "main" lookup data (and persisted under the same conditions), so each replica knows whether the data for that spaces need to be written into multiple shards or not.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bleve sharding #11084

Bleve sharding #11084

jvillafanez commented Feb 28, 2025

Bleve sharding #11084

Bleve sharding #11084

Comments

jvillafanez commented Feb 28, 2025

Basic implementation idea

Discarded space->index mapping strategies

hash

consistent hashing

static lookup via env variable.

Proposal: dynamic lookup

handling lookup data

moving index data