Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow scaling of the search service #11029

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

jvillafanez
Copy link
Member

Description

Previously, the search service opened a write connection to the bleve index and kept that connection opened as long as the service was running. This means that the search service was locking out other processes from accessing the index, including the bleve cli and other replicas of the search service. Replicas of the service can't be spawned because they won't be able to access the index.

We've added a new flag to the search service in order to scale the service.

For environments where scaling isn't needed, it's recommended to set the SEARCH_ENGINE_BLEVE_SCALE as false (default value). This will keep the current behavior of locking the bleve index for exclusive access. This is expected to have a better performance because we won't create new connections every time, and the load in the index should be reduced due to less locking overhead.

Setting the SEARCH_ENGINE_BLEVE_SCALE to true will cause connections to be opened per "engine operation" (search, index, delete, prune...). Those connections will be closed as soon as the operation finishes.
Concurrency control is delegated to the bleve index, which will need to handle the operations concurrently from multiple replicas.
On heavy workloads, it's expected operations to have delays because they'll have to wait for other operations to finish and release the lock. Read-only operations (search and count) are expected to run through the index in parallel as long as there is no other write operation on going.

Related Issue

#11008

Motivation and Context

Scaling the service should be possible regardless of the specific setup.

How Has This Been Tested?

  1. Exclude the search service from running in the "main" oCIS server
  2. Setup the search service replicas to use debug log level, and use the new SEARCH_ENGINE_BLEVE_SCALE=true on all of them.
  3. Spawn 3 search service replicas (note that all the replicas must access to the same index files)
  4. Index some files and run some searches

You'll see logs coming from different replicas, and the UX isn't impacted in any way (all the replicas are being used and returning results).
In addition, it's possible to run the bleve cli against the index oCIS is using. Previously, the cli waited indefinitely.

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Technical debt
  • Tests only (no source changes)

Checklist:

  • Code changes
  • Unit tests added
  • Acceptance tests added
  • Documentation ticket raised:

@jvillafanez jvillafanez self-assigned this Feb 18, 2025
Copy link

update-docs bot commented Feb 18, 2025

Thanks for opening this pull request! The maintainers of this repository would appreciate it if you would create a changelog item based on your changes.

@@ -9,4 +9,5 @@ type Engine struct {
// EngineBleve configures the bleve engine
type EngineBleve struct {
Datapath string `yaml:"data_path" env:"SEARCH_ENGINE_BLEVE_DATA_PATH" desc:"The directory where the filesystem will store search data. If not defined, the root directory derives from $OCIS_BASE_DATA_PATH/search." introductionVersion:"pre5.0"`
Scale bool `yaml:"scale" env:"SEARCH_ENGINE_BLEVE_SCALE" desc:"Enable scaling of the bleve index. If set to false, the service will have exclusive write access to the index as long as the service is running, locking out other processes. Defaults to false." introductionVersion:"%%NEXT%%"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Scale bool `yaml:"scale" env:"SEARCH_ENGINE_BLEVE_SCALE" desc:"Enable scaling of the bleve index. If set to false, the service will have exclusive write access to the index as long as the service is running, locking out other processes. Defaults to false." introductionVersion:"%%NEXT%%"`
Scale bool `yaml:"scale" env:"SEARCH_ENGINE_BLEVE_SCALE" desc:"Enable scaling of the search index (bleve). If set to 'true', the instance of the search service will no longer have exclusive write access to the index. Note when scaling search, all instances of the search service must be set to true! For 'false', which is the default, the running search service will lock out other processes trying to access the index as long it is running." introductionVersion:"%%NEXT%%"`

Proposal: as we are enable scaling, we should make the path to scaling more clear, order the cases and align the description.

@@ -9,4 +9,5 @@ type Engine struct {
// EngineBleve configures the bleve engine
type EngineBleve struct {
Datapath string `yaml:"data_path" env:"SEARCH_ENGINE_BLEVE_DATA_PATH" desc:"The directory where the filesystem will store search data. If not defined, the root directory derives from $OCIS_BASE_DATA_PATH/search." introductionVersion:"pre5.0"`
Scale bool `yaml:"scale" env:"SEARCH_ENGINE_BLEVE_SCALE" desc:"Enable scaling of the search index (bleve). If set to 'true', the instance of the search service will no longer have exclusive write access to the index. Note when scaling search, all instances of the search service must be set to true! For 'false', which is the default, the running search service will lock out other processes trying to access the index as long it is running." introductionVersion:"%%NEXT%%"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Scale bool `yaml:"scale" env:"SEARCH_ENGINE_BLEVE_SCALE" desc:"Enable scaling of the search index (bleve). If set to 'true', the instance of the search service will no longer have exclusive write access to the index. Note when scaling search, all instances of the search service must be set to true! For 'false', which is the default, the running search service will lock out other processes trying to access the index as long it is running." introductionVersion:"%%NEXT%%"`
Scale bool `yaml:"scale" env:"SEARCH_ENGINE_BLEVE_SCALE" desc:"Enable scaling of the search index (bleve). If set to 'true', the instance of the search service will no longer have exclusive write access to the index. Note when scaling search, all instances of the search service must be set to true! For 'false', which is the default, the running search service will lock out other processes trying to access the index exclusively as long it is running." introductionVersion:"%%NEXT%%"`

I missed one word 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure... with the new wording, it seems other processes might be able to access the index as long as they aren't requesting exclusive access.

To clarify, oCIS' search service will request exclusive access (write mode), which will cause other processes to be locked out regardless of the access type.
Also, with the Scale flag active, the difference is that oCIS will connect and disconnect with different accesses (including exclusive access) multiple times. Exclusive access will be requested (depending on the operation), and it will be granted if possible, but the connection won't last long, so other connections will be able to access the index afterwards.

@jvillafanez
Copy link
Member Author

Based on data from owncloud/cdperf#73

With scaling active and 3 search replicas:

     ✓ authn -> logonResponse - status
     ✓ authn -> authorizeResponse - status
     ✓ authn -> accessTokenResponse - status
     ✓ client -> search.searchForResources - status
     ✓ search -> only one result found

     █ setup

       ✓ authn -> logonResponse - status
       ✓ authn -> authorizeResponse - status
       ✓ authn -> accessTokenResponse - status
       ✓ client -> role.getMyDrives - status
       ✓ client -> resource.createResource - status
       ✓ client -> user.createUser - status
       ✓ client -> user.enableUser - status -- (SKIPPED)
       ✓ client -> resource.uploadResource - status

     █ teardown

       ✓ authn -> logonResponse - status
       ✓ authn -> authorizeResponse - status
       ✓ authn -> accessTokenResponse - status
       ✓ client -> resource.deleteResource - status
       ✓ client -> user.deleteUser - status

     checks.........................: 100.00% 749 out of 749
     data_received..................: 727 kB  7.1 kB/s
     data_sent......................: 91 MB   897 kB/s
     http_req_blocked...............: avg=5.17ms  min=279ns    med=1.16µs   max=222.66ms p(90)=1.85µs   p(95)=82.86µs 
     http_req_connecting............: avg=2.24ms  min=0s       med=0s       max=126.06ms p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=1.65s   min=27.3ms   med=347.04ms max=28.4s    p(90)=2.62s    p(95)=7.48s   
       { expected_response:true }...: avg=1.65s   min=27.3ms   med=347.04ms max=28.4s    p(90)=2.62s    p(95)=7.48s   
     http_req_failed................: 0.00%   0 out of 479
     http_req_receiving.............: avg=57.7ms  min=57.11µs  med=13.64ms  max=2.57s    p(90)=107.55ms p(95)=136.75ms
     http_req_sending...............: avg=17.19ms min=94.56µs  med=957.45µs max=340.29ms p(90)=60.99ms  p(95)=134.51ms
     http_req_tls_handshaking.......: avg=2.76ms  min=0s       med=0s       max=166.42ms p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=1.58s   min=26.17ms  med=259.23ms max=28.39s   p(90)=2.58s    p(95)=7.44s   
     http_reqs......................: 479     4.709452/s
     iteration_duration.............: avg=4.18s   min=144.59ms med=1.26s    max=35.96s   p(90)=4.06s    p(95)=34.7s   
     iterations.....................: 250     2.45796/s
     vus............................: 0       min=0          max=20
     vus_max........................: 20      min=0          max=20


running (01m41.7s), 00/20 VUs, 250 complete and 0 interrupted iterations
default ✓ [ 100% ] 20 VUs  00m52.6s/10m0s  250/250 shared iters

with no scaling (exclusive access)


     ✓ authn -> logonResponse - status
     ✓ authn -> authorizeResponse - status
     ✓ authn -> accessTokenResponse - status
     ✓ client -> search.searchForResources - status
     ✓ search -> only one result found

     █ setup

       ✓ authn -> logonResponse - status
       ✓ authn -> authorizeResponse - status
       ✓ authn -> accessTokenResponse - status
       ✓ client -> role.getMyDrives - status
       ✓ client -> resource.createResource - status
       ✓ client -> user.createUser - status
       ✓ client -> user.enableUser - status -- (SKIPPED)
       ✓ client -> resource.uploadResource - status

     █ teardown

       ✓ authn -> logonResponse - status
       ✓ authn -> authorizeResponse - status
       ✓ authn -> accessTokenResponse - status
       ✓ client -> resource.deleteResource - status
       ✓ client -> user.deleteUser - status

     checks.........................: 100.00% 749 out of 749
     data_received..................: 725 kB  9.2 kB/s
     data_sent......................: 91 MB   1.2 MB/s
     http_req_blocked...............: avg=5.97ms  min=331ns    med=1.09µs   max=320.69ms p(90)=1.53µs  p(95)=2.47µs  
     http_req_connecting............: avg=2.36ms  min=0s       med=0s       max=107.59ms p(90)=0s      p(95)=0s      
     http_req_duration..............: avg=1.05s   min=21.5ms   med=254.2ms  max=18.81s   p(90)=1.58s   p(95)=3.54s   
       { expected_response:true }...: avg=1.05s   min=21.5ms   med=254.2ms  max=18.81s   p(90)=1.58s   p(95)=3.54s   
     http_req_failed................: 0.00%   0 out of 479
     http_req_receiving.............: avg=35.86ms min=61.26µs  med=18.67ms  max=245.25ms p(90)=99.55ms p(95)=122.04ms
     http_req_sending...............: avg=8.76ms  min=99.59µs  med=450.05µs max=175.25ms p(90)=32.64ms p(95)=72.51ms 
     http_req_tls_handshaking.......: avg=3.24ms  min=0s       med=0s       max=165.84ms p(90)=0s      p(95)=0s      
     http_req_waiting...............: avg=1s      min=20.99ms  med=192.06ms max=18.8s    p(90)=1.58s   p(95)=3.46s   
     http_reqs......................: 479     6.082458/s
     iteration_duration.............: avg=2.43s   min=207.12ms med=679.91ms max=22.02s   p(90)=1.94s   p(95)=20.86s  
     iterations.....................: 250     3.17456/s
     vus............................: 0       min=0          max=20
     vus_max........................: 20      min=6          max=20


running (01m18.8s), 00/20 VUs, 250 complete and 0 interrupted iterations
default ✓ [ 100% ] 20 VUs  00m30.6s/10m0s  250/250 shared iters

Note that the tests were run on a single machine with limited memory, so the timings shown above might be worse than the actual ones on a real environment. In any case, having exclusive access is preferred if it's possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants