feat: allow scaling of the search service #11029

jvillafanez · 2025-02-18T14:36:37Z

Description

Previously, the search service opened a write connection to the bleve index and kept that connection opened as long as the service was running. This means that the search service was locking out other processes from accessing the index, including the bleve cli and other replicas of the search service. Replicas of the service can't be spawned because they won't be able to access the index.

We've added a new flag to the search service in order to scale the service.

For environments where scaling isn't needed, it's recommended to set the SEARCH_ENGINE_BLEVE_SCALE as false (default value). This will keep the current behavior of locking the bleve index for exclusive access. This is expected to have a better performance because we won't create new connections every time, and the load in the index should be reduced due to less locking overhead.

Setting the SEARCH_ENGINE_BLEVE_SCALE to true will cause connections to be opened per "engine operation" (search, index, delete, prune...). Those connections will be closed as soon as the operation finishes.
Concurrency control is delegated to the bleve index, which will need to handle the operations concurrently from multiple replicas.
On heavy workloads, it's expected operations to have delays because they'll have to wait for other operations to finish and release the lock. Read-only operations (search and count) are expected to run through the index in parallel as long as there is no other write operation on going.

Related Issue

#11008

Motivation and Context

Scaling the service should be possible regardless of the specific setup.

How Has This Been Tested?

Exclude the search service from running in the "main" oCIS server
Setup the search service replicas to use debug log level, and use the new SEARCH_ENGINE_BLEVE_SCALE=true on all of them.
Spawn 3 search service replicas (note that all the replicas must access to the same index files)
Index some files and run some searches

You'll see logs coming from different replicas, and the UX isn't impacted in any way (all the replicas are being used and returning results).
In addition, it's possible to run the bleve cli against the index oCIS is using. Previously, the cli waited indefinitely.

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Technical debt
Tests only (no source changes)

Checklist:

Code changes
Unit tests added
Acceptance tests added
Documentation ticket raised:

update-docs · 2025-02-18T14:36:43Z

Thanks for opening this pull request! The maintainers of this repository would appreciate it if you would create a changelog item based on your changes.

mmattel · 2025-02-18T17:46:07Z

services/search/pkg/config/engine.go

@@ -9,4 +9,5 @@ type Engine struct {
 // EngineBleve configures the bleve engine
 type EngineBleve struct {
 	Datapath string `yaml:"data_path" env:"SEARCH_ENGINE_BLEVE_DATA_PATH" desc:"The directory where the filesystem will store search data. If not defined, the root directory derives from $OCIS_BASE_DATA_PATH/search." introductionVersion:"pre5.0"`
+	Scale    bool   `yaml:"scale" env:"SEARCH_ENGINE_BLEVE_SCALE" desc:"Enable scaling of the bleve index. If set to false, the service will have exclusive write access to the index as long as the service is running, locking out other processes. Defaults to false." introductionVersion:"%%NEXT%%"`


Suggested change

Scale bool `yaml:"scale" env:"SEARCH_ENGINE_BLEVE_SCALE" desc:"Enable scaling of the bleve index. If set to false, the service will have exclusive write access to the index as long as the service is running, locking out other processes. Defaults to false." introductionVersion:"%%NEXT%%"`

Scale bool `yaml:"scale" env:"SEARCH_ENGINE_BLEVE_SCALE" desc:"Enable scaling of the search index (bleve). If set to 'true', the instance of the search service will no longer have exclusive write access to the index. Note when scaling search, all instances of the search service must be set to true! For 'false', which is the default, the running search service will lock out other processes trying to access the index as long it is running." introductionVersion:"%%NEXT%%"`

Proposal: as we are enable scaling, we should make the path to scaling more clear, order the cases and align the description.

sonarqubecloud · 2025-02-19T16:25:15Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
36.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

mmattel · 2025-02-20T07:20:47Z

services/search/pkg/config/engine.go

@@ -9,4 +9,5 @@ type Engine struct {
 // EngineBleve configures the bleve engine
 type EngineBleve struct {
 	Datapath string `yaml:"data_path" env:"SEARCH_ENGINE_BLEVE_DATA_PATH" desc:"The directory where the filesystem will store search data. If not defined, the root directory derives from $OCIS_BASE_DATA_PATH/search." introductionVersion:"pre5.0"`
+	Scale    bool   `yaml:"scale" env:"SEARCH_ENGINE_BLEVE_SCALE" desc:"Enable scaling of the search index (bleve). If set to 'true', the instance of the search service will no longer have exclusive write access to the index. Note when scaling search, all instances of the search service must be set to true! For 'false', which is the default, the running search service will lock out other processes trying to access the index as long it is running." introductionVersion:"%%NEXT%%"`


Suggested change

Scale bool `yaml:"scale" env:"SEARCH_ENGINE_BLEVE_SCALE" desc:"Enable scaling of the search index (bleve). If set to 'true', the instance of the search service will no longer have exclusive write access to the index. Note when scaling search, all instances of the search service must be set to true! For 'false', which is the default, the running search service will lock out other processes trying to access the index as long it is running." introductionVersion:"%%NEXT%%"`

Scale bool `yaml:"scale" env:"SEARCH_ENGINE_BLEVE_SCALE" desc:"Enable scaling of the search index (bleve). If set to 'true', the instance of the search service will no longer have exclusive write access to the index. Note when scaling search, all instances of the search service must be set to true! For 'false', which is the default, the running search service will lock out other processes trying to access the index exclusively as long it is running." introductionVersion:"%%NEXT%%"`

I missed one word 😅

I'm not sure... with the new wording, it seems other processes might be able to access the index as long as they aren't requesting exclusive access.

To clarify, oCIS' search service will request exclusive access (write mode), which will cause other processes to be locked out regardless of the access type.
Also, with the Scale flag active, the difference is that oCIS will connect and disconnect with different accesses (including exclusive access) multiple times. Exclusive access will be requested (depending on the operation), and it will be granted if possible, but the connection won't last long, so other connections will be able to access the index afterwards.

jvillafanez · 2025-02-25T14:24:44Z

Based on data from owncloud/cdperf#73

With scaling active and 3 search replicas:

     ✓ authn -> logonResponse - status
     ✓ authn -> authorizeResponse - status
     ✓ authn -> accessTokenResponse - status
     ✓ client -> search.searchForResources - status
     ✓ search -> only one result found

     █ setup

       ✓ authn -> logonResponse - status
       ✓ authn -> authorizeResponse - status
       ✓ authn -> accessTokenResponse - status
       ✓ client -> role.getMyDrives - status
       ✓ client -> resource.createResource - status
       ✓ client -> user.createUser - status
       ✓ client -> user.enableUser - status -- (SKIPPED)
       ✓ client -> resource.uploadResource - status

     █ teardown

       ✓ authn -> logonResponse - status
       ✓ authn -> authorizeResponse - status
       ✓ authn -> accessTokenResponse - status
       ✓ client -> resource.deleteResource - status
       ✓ client -> user.deleteUser - status

     checks.........................: 100.00% 749 out of 749
     data_received..................: 727 kB  7.1 kB/s
     data_sent......................: 91 MB   897 kB/s
     http_req_blocked...............: avg=5.17ms  min=279ns    med=1.16µs   max=222.66ms p(90)=1.85µs   p(95)=82.86µs 
     http_req_connecting............: avg=2.24ms  min=0s       med=0s       max=126.06ms p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=1.65s   min=27.3ms   med=347.04ms max=28.4s    p(90)=2.62s    p(95)=7.48s   
       { expected_response:true }...: avg=1.65s   min=27.3ms   med=347.04ms max=28.4s    p(90)=2.62s    p(95)=7.48s   
     http_req_failed................: 0.00%   0 out of 479
     http_req_receiving.............: avg=57.7ms  min=57.11µs  med=13.64ms  max=2.57s    p(90)=107.55ms p(95)=136.75ms
     http_req_sending...............: avg=17.19ms min=94.56µs  med=957.45µs max=340.29ms p(90)=60.99ms  p(95)=134.51ms
     http_req_tls_handshaking.......: avg=2.76ms  min=0s       med=0s       max=166.42ms p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=1.58s   min=26.17ms  med=259.23ms max=28.39s   p(90)=2.58s    p(95)=7.44s   
     http_reqs......................: 479     4.709452/s
     iteration_duration.............: avg=4.18s   min=144.59ms med=1.26s    max=35.96s   p(90)=4.06s    p(95)=34.7s   
     iterations.....................: 250     2.45796/s
     vus............................: 0       min=0          max=20
     vus_max........................: 20      min=0          max=20


running (01m41.7s), 00/20 VUs, 250 complete and 0 interrupted iterations
default ✓ [ 100% ] 20 VUs  00m52.6s/10m0s  250/250 shared iters

with no scaling (exclusive access)


     ✓ authn -> logonResponse - status
     ✓ authn -> authorizeResponse - status
     ✓ authn -> accessTokenResponse - status
     ✓ client -> search.searchForResources - status
     ✓ search -> only one result found

     █ setup

       ✓ authn -> logonResponse - status
       ✓ authn -> authorizeResponse - status
       ✓ authn -> accessTokenResponse - status
       ✓ client -> role.getMyDrives - status
       ✓ client -> resource.createResource - status
       ✓ client -> user.createUser - status
       ✓ client -> user.enableUser - status -- (SKIPPED)
       ✓ client -> resource.uploadResource - status

     █ teardown

       ✓ authn -> logonResponse - status
       ✓ authn -> authorizeResponse - status
       ✓ authn -> accessTokenResponse - status
       ✓ client -> resource.deleteResource - status
       ✓ client -> user.deleteUser - status

     checks.........................: 100.00% 749 out of 749
     data_received..................: 725 kB  9.2 kB/s
     data_sent......................: 91 MB   1.2 MB/s
     http_req_blocked...............: avg=5.97ms  min=331ns    med=1.09µs   max=320.69ms p(90)=1.53µs  p(95)=2.47µs  
     http_req_connecting............: avg=2.36ms  min=0s       med=0s       max=107.59ms p(90)=0s      p(95)=0s      
     http_req_duration..............: avg=1.05s   min=21.5ms   med=254.2ms  max=18.81s   p(90)=1.58s   p(95)=3.54s   
       { expected_response:true }...: avg=1.05s   min=21.5ms   med=254.2ms  max=18.81s   p(90)=1.58s   p(95)=3.54s   
     http_req_failed................: 0.00%   0 out of 479
     http_req_receiving.............: avg=35.86ms min=61.26µs  med=18.67ms  max=245.25ms p(90)=99.55ms p(95)=122.04ms
     http_req_sending...............: avg=8.76ms  min=99.59µs  med=450.05µs max=175.25ms p(90)=32.64ms p(95)=72.51ms 
     http_req_tls_handshaking.......: avg=3.24ms  min=0s       med=0s       max=165.84ms p(90)=0s      p(95)=0s      
     http_req_waiting...............: avg=1s      min=20.99ms  med=192.06ms max=18.8s    p(90)=1.58s   p(95)=3.46s   
     http_reqs......................: 479     6.082458/s
     iteration_duration.............: avg=2.43s   min=207.12ms med=679.91ms max=22.02s   p(90)=1.94s   p(95)=20.86s  
     iterations.....................: 250     3.17456/s
     vus............................: 0       min=0          max=20
     vus_max........................: 20      min=6          max=20


running (01m18.8s), 00/20 VUs, 250 complete and 0 interrupted iterations
default ✓ [ 100% ] 20 VUs  00m30.6s/10m0s  250/250 shared iters

Note that the tests were run on a single machine with limited memory, so the timings shown above might be worse than the actual ones on a real environment. In any case, having exclusive access is preferred if it's possible

feat: allow scaling of the search service

1063e7d

jvillafanez self-assigned this Feb 18, 2025

mmattel reviewed Feb 18, 2025

View reviewed changes

mmattel mentioned this pull request Feb 18, 2025

Allow scaling of the search service owncloud/docs-ocis#1135

Open

jvillafanez added 2 commits February 19, 2025 13:56

feat: refactor scaling code also for tests

b40e169

docs: adjust description of the config option

abc113b

mmattel reviewed Feb 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow scaling of the search service #11029

feat: allow scaling of the search service #11029

jvillafanez commented Feb 18, 2025

update-docs bot commented Feb 18, 2025

mmattel Feb 18, 2025

sonarqubecloud bot commented Feb 19, 2025

mmattel Feb 20, 2025

jvillafanez Feb 20, 2025

jvillafanez commented Feb 25, 2025

feat: allow scaling of the search service #11029

Are you sure you want to change the base?

feat: allow scaling of the search service #11029

Conversation

jvillafanez commented Feb 18, 2025

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

update-docs bot commented Feb 18, 2025

mmattel Feb 18, 2025

Choose a reason for hiding this comment

sonarqubecloud bot commented Feb 19, 2025

Quality Gate passed

mmattel Feb 20, 2025

Choose a reason for hiding this comment

jvillafanez Feb 20, 2025

Choose a reason for hiding this comment

jvillafanez commented Feb 25, 2025