Support for Diskless Replication #997

vazois · 2025-02-04T21:28:15Z

This PR adds support for diskless replication (it is more of diskless full synchronization, but I am using Redis terminology to be consistent).

When replicas attach to a primary and require full synchronization, due to AOF truncation, they might incur an on-demand-checkpoint and will require for the latest checkpoint data to be streamed to them for recovery.
This method of full synchronization is extremely inefficient for the following reasons:

Write amplification at the primary when flushing the checkpoint
Read amplification at the primary since replicas read and stream the checkpoint files in parallel.
Write and read amplification at the replica that has to first write and then read the checkpoint data in order to recover.

With diskless replication we aim to eliminate these inefficiencies. Diskless replication relies on the streaming snapshot feature of tsavorite (#824) to stream a consistent snapshot of key-value pairs to the replica when full synchronization is necessary.
When a replica attempts to synchronize with an active primary, it performs the following steps:

It issues a CLUSTER ATTACH_SYNC request to the corresponding primary which processes the request to create a sync task.
The attaching sync task tries to create a ReplicaSyncSession object, sets its sync status to INITIALIZING and under lock adds that object to the ReplicaSyncSessionTaskStore.
Once the sync session is added, the sync task proceeds to wait for a few seconds (--repl-diskless-sync-delay) to allow for other replicas to attach in a similar way.
After the wait time is over, the sync task that attached first will initiate the StreamingSnapshotDriver (SSD) as a background task. Afterwards, all sync tasks will proceed to wait for SSD to complete.
SSD acquires an exclusive lock that prevents any other sync tasks to be added and also orchestrates the full synchronization of replicas by streaming a consistent snapshot of the key-value pairs to all replicas that needed it.
The SSD completes by notifying any waiting sync tasks that synchronization has completed and releases the exclusive lock to allow for more tasks to be added for the next diskless replication session.
The waiting sync tasks will in parallel notify the replica to start recovery of the AOF and subsequently spawn a background AofSyncTask to start streaming the AOF records generated at the primary.

By using the streaming checkpoint approach, we eliminate write and read amplification at the primary.
In addition, by allowing multiple replicas to synchronize in parallel, we reduce the overhead of scanning the TsavoriteStore multiple times.
Finally, we eliminate both read and write amplification at the replica because we don't require writing and reading the checkpoint to recover before starting to stream the AOF records.

NOTES:

The SSD will release early any sync task that does not require full synchronization.
Currently, at the completion of a streaming checkpoint the AOF get safely truncated. This is not necessary and might conflict with any persistence guarantees but was done to avoid AOF getting arbitrarily large. The assumption is that taking regular checkpoints at the primary is orthogonal to diskless replication.
Currently, the replica will not write any data to its local disk when receiving the streaming checkpoint. It is possible to eliminate this restriction, but I felt that this goes against the spirit of truly diskless replication.
For now, diskless replication will operate separately from the disk-based approach to allow for a preview period. It could be possible to merge both features together or completely eliminate disk-based replication if not longer necessary.

…nager, add more logging

vazois force-pushed the vazois/diskless-repl branch 5 times, most recently from 81508ae to b802961 Compare February 7, 2025 01:27

vazois added 25 commits February 7, 2025 08:53

expose diskless replication parameters

484ff8f

refactor/cleanup legacy ReplicaSyncSession

4bd5426

add interface to support diskless replication session and aof tasks

762575b

core diskless replication implementation

6e128c6

expose diskless replication API

d47e7b8

adding test for diskless replication

fb8b747

update gcs extension to clearly mark logging progress

c799cea

fix gcs dispose on diskless attach, call dispose of replicationSyncMa…

c8ce9de

…nager, add more logging

complete first diskless replication test

ecf731c

fix iterator check for null when empty store

af7938c

fix iterator for object store cluster sync

7539733

add simple diskless sync test

aa82944

cleanup code

fa64b34

replica fall behind test

616e32e

wip

4795af0

register cts at wait for sync completion

340c18b

add db version alignment test

12820cb

avoid using close lock for leader based syncing

6bb3153

truncate AOF after streaming checkpoint is taken

f396552

add tests for failover with diskless replication

3fab2f5

fix formatting and conversion to IPEndpoint

4b2a3e0

fix RepCommandsTests

eca259c

dispose aofSyncTask if failed to add to AofSyncTaskStore

50b966b

overload dispose ReplicaSyncSession

80fcb43

explicitly dispose gcs used for full sync at replicaSyncSession sync

5737809

vazois added 7 commits February 7, 2025 08:53

dispose gcs once on return

d77501c

code cleanup

87d3f96

update tests to provide more context logging

97f73f5

add more comprehensive logging of syncMetadata

92d3153

add timeout for streaming checkpoint

f29f8f2

add clusterTimeout for diskless repl tests

90cbe8c

some more logging

9a4ad8f

vazois force-pushed the vazois/diskless-repl branch from b802961 to 9a4ad8f Compare February 7, 2025 16:53

cleanup and refactor code

5769f37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Diskless Replication #997

Support for Diskless Replication #997

vazois commented Feb 4, 2025 •

edited

Loading

Support for Diskless Replication #997

Are you sure you want to change the base?

Support for Diskless Replication #997

Conversation

vazois commented Feb 4, 2025 • edited Loading

vazois commented Feb 4, 2025 •

edited

Loading