Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Diskless Replication #997

Draft
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

vazois
Copy link
Contributor

@vazois vazois commented Feb 4, 2025

This PR adds support for diskless replication (it is more of diskless full synchronization, but I am using Redis terminology to be consistent).

When replicas attach to a primary and require full synchronization, due to AOF truncation, they might incur an on-demand-checkpoint and will require for the latest checkpoint data to be streamed to them for recovery.
This method of full synchronization is extremely inefficient for the following reasons:

  1. Write amplification at the primary when flushing the checkpoint
  2. Read amplification at the primary since replicas read and stream the checkpoint files in parallel.
  3. Write and read amplification at the replica that has to first write and then read the checkpoint data in order to recover.

With diskless replication we aim to eliminate these inefficiencies. Diskless replication relies on the streaming snapshot feature of tsavorite (#824) to stream a consistent snapshot of key-value pairs to the replica when full synchronization is necessary.
When a replica attempts to synchronize with an active primary, it performs the following steps:

  1. It issues a CLUSTER ATTACH_SYNC request to the corresponding primary which processes the request to create a sync task.
  2. The attaching sync task tries to create a ReplicaSyncSession object, sets its sync status to INITIALIZING and under lock adds that object to the ReplicaSyncSessionTaskStore.
  3. Once the sync session is added, the sync task proceeds to wait for a few seconds (--repl-diskless-sync-delay) to allow for other replicas to attach in a similar way.
  4. After the wait time is over, the sync task that attached first will initiate the StreamingSnapshotDriver (SSD) as a background task. Afterwards, all sync tasks will proceed to wait for SSD to complete.
  5. SSD acquires an exclusive lock that prevents any other sync tasks to be added and also orchestrates the full synchronization of replicas by streaming a consistent snapshot of the key-value pairs to all replicas that needed it.
  6. The SSD completes by notifying any waiting sync tasks that synchronization has completed and releases the exclusive lock to allow for more tasks to be added for the next diskless replication session.
  7. The waiting sync tasks will in parallel notify the replica to start recovery of the AOF and subsequently spawn a background AofSyncTask to start streaming the AOF records generated at the primary.

By using the streaming checkpoint approach, we eliminate write and read amplification at the primary.
In addition, by allowing multiple replicas to synchronize in parallel, we reduce the overhead of scanning the TsavoriteStore multiple times.
Finally, we eliminate both read and write amplification at the replica because we don't require writing and reading the checkpoint to recover before starting to stream the AOF records.

NOTES:

  • The SSD will release early any sync task that does not require full synchronization.
  • Currently, at the completion of a streaming checkpoint the AOF get safely truncated. This is not necessary and might conflict with any persistence guarantees but was done to avoid AOF getting arbitrarily large. The assumption is that taking regular checkpoints at the primary is orthogonal to diskless replication.
  • Currently, the replica will not write any data to its local disk when receiving the streaming checkpoint. It is possible to eliminate this restriction, but I felt that this goes against the spirit of truly diskless replication.
  • For now, diskless replication will operate separately from the disk-based approach to allow for a preview period. It could be possible to merge both features together or completely eliminate disk-based replication if not longer necessary.

@vazois vazois force-pushed the vazois/diskless-repl branch 5 times, most recently from 81508ae to b802961 Compare February 7, 2025 01:27
@vazois vazois force-pushed the vazois/diskless-repl branch from b802961 to 9a4ad8f Compare February 7, 2025 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant