Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proactive Syncer #1911

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open

Proactive Syncer #1911

wants to merge 35 commits into from

Conversation

Elvis339
Copy link
Contributor

@Elvis339 Elvis339 commented Feb 3, 2025

This patch is closing #1879
TLDR: The syncer delays state sync completion until a full time validity window of blocks is processed. Currently, it relies on existing blocks or waits for new ones. Proposal is to add a P2P protocol to proactively fetch and backfill blocks, making the process faster.

Difference from 1879 issue

Simplified and more intuitive design, the block window fetcher design was dropped in favor of simpler and more intuitive design.

Previously

Currently

Changes

Chain Index Package

The package needed modifications to support backward block fetching during state sync:

  1. Previous Limitation:
    • Could only use UpdateLastAccepted for block insertion
    • This method wasn't suitable for backward fetching due to its forward-only design

Reverted changes to chain index in favor of simpler design where the accept fetched blocks to validity window instead of directly on-disk, this way we maintain one writer heuristic and have streamlined design.

2. New Solution - WriteBlocks:

  • Writes blocks directly to database without pruning enforcement
  • Designed specifically for state sync operations
  • Relies on blockchain's forward-moving nature
  • Uses UpdateLastAccepted's pruning mechanism for cleanup

3. Writing Strategy:

  • Now has two writers instead of one
  • WriteBlock: Used only during state sync, writes without synchronization
  • UpdateLastAccepted: Maintains pruning via DeleteRange
  • No explicit synchronization between writers to avoid overhead

Internal Pebble Package

Enhanced database capabilities through adapter improvements:

  1. Previous Limitation:

    • avalanchego/database provided minimal functionality
    • No native support for range operations
  2. New Features:

    • Added range deletion support
    • Implemented fallback mechanism for databases without native range deletion
    • Maintains consistent behavior across different database implementations
    • Preserves native optimizations when available
  3. Implementation Details:

    • Provides generic implementation for basic databases
    • Uses batching for efficient key deletion
    • Allows seamless integration with both basic and advanced database implementations

Typed Client

Moved TypedClient from dsmr package, there was a TODO to merge TypedClient upstream into AvalancheGo, for the sake of testing & opening a PR I moved it in internal/typedclient.

@Elvis339 Elvis339 self-assigned this Feb 3, 2025
elvis.sabanovic and others added 14 commits February 3, 2025 21:53
… off "use of weak random number generator (math/rand instead of crypto/rand) (gosec)"
…ase to support DeleteRange operation in generic & safe way
Adds range-based block deletion for more efficient pruning and implements in-memory block ID tracking. Key changes:
- Replace individual deletes with DeleteRange operations
- Add ID-to-height map for faster lookups
- Introduce pruningRange helper for deletion window
- Improve WriteBlock comment clarity

BREAKING CHANGE: Updates block prefix constants and storage layout
@Elvis339 Elvis339 marked this pull request as ready for review February 17, 2025 18:48
@Elvis339 Elvis339 changed the title Block Window Fetcher Proactive Syncer Feb 18, 2025
@@ -1,2 +1,2 @@
Copyright (C) 2024, Ava Labs, Inc. All rights reserved.
Copyright (C) 2025, Ava Labs, Inc. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this from the diff?

func newExecutionBlock(height uint64, timestamp int64, containers []int64) executionBlock {
e := executionBlock{
Prnt: uint64ToID(height - 1), // Allow underflow for genesis
Tmstmp: timestamp,
Hght: height,
ID: uint64ToID(height),
Bytes: []byte{},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we populate this with binary.BigEndian.AppendUint64(nil, height) so that we don't hit any unexpected cases where we read this incorrectly?

Copy link
Contributor Author

@Elvis339 Elvis339 Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense.

Bytes: binary.BigEndian.AppendUint64(nil, height),

Comment on lines 102 to 103
case <-timeoutCtx.Done():
return blocks, timeoutCtx.Err()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than returning a non-nil error and ignoring it if we have a populated response, could we instead drop the timeout error here and return the partial response with a nil error?

We should consider a partial response expected and valid imo and in our code we typically will check for a non-nil error first and assume the return value is not populated if there's a non-nil error, so we should follow the same style for consistency here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, thanks for pointing this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (b *BlockFetcherHandler[T]) fetchBlocks(ctx context.Context, request *BlockFetchRequest) ([][]byte, error) {

Comment on lines 47 to 51
response, parseErr := t.marshaler.UnmarshalResponse(responseBytes)
if parseErr != nil {
// TODO how do we handle this?
return
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ik this was already in the codebase within DSMR, but if we hit this onResponse is never called. Can be separate from this PR if preferred, but to guarantee onResponse is eventually called, I think e probably want to call onResponse with an empty value of type U and the parsing error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, that makes more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onResponse(ctx, nodeID, utils.Zero[U](), parseErr)

Comment on lines +80 to +82
common.SendConfig{
Validators: 100,
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed for this PR - may want to make this configurable either by passing a parameter or including a sendConfig in the typed client instance in the future.

timestamp = minAccepted.GetTimestamp()
}

resultChan := s.blockFetcherClient.FetchBlocks(syncCtx, id, height, timestamp, s.minTimestamp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale to have FetchBlocks return a goroutine and process the results here rather than process it directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale to have FetchBlocks return a channel is intuitiveness and simplicity, BlockFetcher fetches blocks and Syncer is adding blocks to time validity window.

It exposes a read-only channel which is thread-safe, which means we could potentially re-use block fetching logic anywhere else instead of coupling it with time validity window.

IMO the name implies the action. Also, adding time validity window dependency increases complexity of block fetcher i.e. the type would need to be from this:

BlockFetcherClient[B Block]

to:

BlockFetcherClient[T emap.Item, B ExecutionBlock[T]]

because of this method:

Accept(blk ExecutionBlock[T])

defined on TimeValidityWindow[T emap.Item].

This design also positions us well for future changes, as new consumers can independently process the fetched blocks without modifying the fetcher implementation.

lastCheckpoint := c.checkpoint
c.checkpointLock.RUnlock()

if lastCheckpoint.timestamp <= minTimestamp.Load() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be strictly less than min timestamp as opposed to <=. Multiple blocks can share the same timestamp, so we have not filled the validity window until we have retrieved the first block whose timestamp is strictly less than the min timestamp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if lastCheckpoint.timestamp < minTimestamp.Load() {

c.checkpointLock.RUnlock()

for _, raw := range respBlocks {
block, parseErr := c.parser.ParseBlock(ctx, raw)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename to err instead of parseErr ? We typically reserve fine grained error names when disambiguation between multiple errors is required, but don't think that's necessary when we immediately handle the error as we do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrapped the error here:

resultChan <- FetchResult[B]{Err: fmt.Errorf("%w: %v", parseErr, errInvalidBlock)}

c.checkpointLock.Lock()
c.checkpoint.parentID = block.GetParent()
c.checkpoint.timestamp = block.GetTimestamp()
c.checkpoint.nextHeight = block.GetHeight() - 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we handle an underflow here?

Comment on lines +78 to +82
c.checkpoint = checkpoint{
parentID: id,
nextHeight: height,
timestamp: timestamp,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace checkpoint with the latest block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean rename it or replace the whole checkpoint struct?

elvis.sabanovic and others added 8 commits February 20, 2025 21:02
Refactored tests to simplify type definitions and adjusted boundary block conditions during fetch and validity window expansion. Updated block fetch handler logic to ensure stricter timestamp adherence and improved error handling. Removed unused imports, adjusted comments for clarity, and fixed copyright year discrepancies.
Enhanced error handling by addressing `context.DeadlineExceeded` and refining failure messages to include error details. Introduced `sync.Once` to ensure `resultChan` is closed only once, preventing potential race conditions. Simplified variable naming for readability and maintained checkpoint consistency.
Reordered the cancel call for proper context cleanup and clarified variable names in checkpoint processing for better readability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants