Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: replace StreamReader/Writer with FileReader/Writer #1193

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

de-sh
Copy link
Contributor

@de-sh de-sh commented Feb 17, 2025

Fixes #XXXX.

Description

  1. Replace usage of StreamReader and StreamWriter with File alternatives, ensuring code that persists .arrows files in staging is simplified.
  2. Enables querying over records on disk(in .arrows files) which haven't been converted into parquet yet.
  3. Generate multiple arrow files on breaching row limit

This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Summary by CodeRabbit

  • New Features

    • Added reverse-order record reading to enhance data processing.
    • Integrated time partitioning support for improved record organization.
    • Implemented a limit on in-memory record batches to prevent overflow.
  • Bug Fixes

    • Improved error handling with specific alerts for row limit exceedance.
  • Refactor

    • Streamlined record handling by replacing legacy components.
    • Simplified logic in the buffer management process.
    • Enhanced clarity in method signatures and added documentation.

@de-sh de-sh marked this pull request as draft February 17, 2025 06:26
Copy link

coderabbitai bot commented Feb 17, 2025

Walkthrough

This pull request introduces a new ReverseReader struct for reading record batches from Arrow IPC files in reverse order. The MergedRecordReader has been updated to utilize ReverseReader, with its constructor renamed and simplified. The DiskWriter replaces StreamWriter for file-based writing, and method signatures have been adjusted to include time partitioning. Additionally, the StagingError enum has been modified to include a new RowLimit variant for better error handling.

Changes

File(s) Change Summary
src/parseable/staging/reader.rs Introduced ReverseReader for reverse reading; updated MergedRecordReader to use ReverseReader, renamed constructor from try_new to new, and removed MergedReverseRecordReader.
src/parseable/staging/writer.rs Replaced StreamWriter with FileWriter in DiskWriter, updating initialization logic accordingly.
src/parseable/streams.rs Modified recordbatches_cloned to accept a time_partition parameter; added constant MAX_RECORD_BATCH_SIZE for row limits.
src/query/stream_schema_provider.rs Updated recordbatches_cloned invocation to include time_partition parameter in get_staging_execution_plan.
src/parseable/staging/mod.rs Updated StagingError enum by removing Metadata variant and adding RowLimit(usize) variant for row limit errors.

Possibly related PRs

  • feat: drop use of MergeReverseRecordReader #1213: The changes in the main PR are directly related to the removal of the MergedReverseRecordReader struct and its associated methods, which is also a key focus of the retrieved PR. Both PRs modify the MergedRecordReader to simplify its usage and eliminate reverse reading logic.

Suggested labels

for next release

Suggested reviewers

  • nikhilsinhaparseable

Poem

I'm a rabbit, hopping through code with glee,
New readers and writers set our data free.
Reverse paths chart a curious new way,
Disk writes and time partitions brighten the day.
CodeRabbit cheers these changes in playful display!


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51eabca and db5c90d.

📒 Files selected for processing (2)
  • src/parseable/streams.rs (8 hunks)
  • src/query/stream_schema_provider.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/query/stream_schema_provider.rs
🔇 Additional comments (6)
src/parseable/streams.rs (6)

70-71: Good addition of a record batch size limit.

Defining a constant for the maximum record batch size helps prevent memory-related issues and ensures optimal I/O performance. The comment clearly explains the purpose of this limit.


105-106: Documentation clarification is helpful.

Adding explicit documentation about the row limit for the writer provides important context for developers working with this code.


139-142: Effective error handling for oversized record batches.

The check for record batch size is a good defensive programming practice. It prevents processing excessively large batches that could cause performance issues.


324-324: Type annotation improves clarity.

The explicit type annotation for time_partition makes the code more readable and self-documenting.


370-388: Significant improvement to record batch retrieval.

This rewritten method properly handles both in-memory records and records from arrow files. The implementation now correctly returns in-memory records even when no disk readers exist, fixing a potential data loss issue that was previously flagged in earlier reviews.

The addition of the time_partition parameter allows for more granular filtering of records, which aligns with the PR objectives of improving querying over records stored on disk.


496-496: Consistent use of the new reader implementation.

Using MergedRecordReader here aligns with the PR's objective of replacing StreamReader/Writer with their File alternatives, making the code more consistent.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@de-sh de-sh marked this pull request as ready for review February 21, 2025 06:39
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (8)
src/parseable/staging/writer.rs (3)

39-48: Solid introduction of DiskWriter struct.
The design nicely encapsulates file handling in a dedicated struct. However, consider documenting concurrency expectations (e.g., ensuring a single thread accesses DiskWriter at a time).


91-109: finish method is well-structured.
Renaming the part file to a completed file is clear. Might want to consider behavior on rename failures (e.g., partial cleanup or rollback).


112-114: Generic Writer referencing DiskWriter.
This integrates memory and disk writers nicely. For large codebases, consider customizing a configurable default row limit or reading it from a central config.

src/parseable/staging/reader.rs (2)

131-131: Test imports updated to reflect file-based I/O.
Looks good. Always ensure an accompanying test for potential edge cases (e.g. partial read failures).

Also applies to: 136-137


155-163: Helper write_file for tests.
Straightforward approach for writing test Arrow files. Possibly add error context or logs if creation fails.

src/parseable/streams.rs (2)

79-79: Hard-coded size for Writer<16384>.
Consider making 16384 a named constant or configuration option to make it clearer and easier to adjust in the future.


373-387: Flushing DiskWriters is appropriate.
Warn logs are correct for failures, and debug logs confirm success. Potentially add a final error if any disk writer fails, or continue as-is if partial successes are allowed.

src/query/stream_schema_provider.rs (1)

236-236: LGTM! Consider adding documentation for time partitioning.

The addition of time partition information to recordbatches_cloned is a good enhancement that aligns with the PR's objectives.

Consider adding a code comment explaining how time partitioning affects the record batch cloning process.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c87b93c and 0e29387.

📒 Files selected for processing (5)
  • src/parseable/mod.rs (1 hunks)
  • src/parseable/staging/reader.rs (7 hunks)
  • src/parseable/staging/writer.rs (1 hunks)
  • src/parseable/streams.rs (13 hunks)
  • src/query/stream_schema_provider.rs (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/parseable/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (11)
src/parseable/staging/writer.rs (2)

22-24: Imports look good!
No significant concerns. These imports properly bring in file operations, buffering, and Arrow’s FileWriter.

Also applies to: 29-29, 34-34, 37-38


50-68: Constructor handles part-file creation gracefully.
Error handling is straightforward; keep in mind to propagate context if future debugging is needed.

src/parseable/staging/reader.rs (3)

22-22: Replaced StreamReader with FileReader.
No issues: switching to file-based reading appears straightforward.

Also applies to: 28-28


40-40: Simplifying MergedRecordReader::new can mask partial file read errors.
If one or more files fail to open, they are silently skipped. This might be intentional, but consider returning a Result or warning about partial successes, so upstream code can handle it.

Also applies to: 44-44, 54-54


167-172: Test coverage for various row scenarios.
Kudos for checking empty, single-row, and multi-row records. The logic ensures FileReader is working as expected. No pressing issues here.

Also applies to: 180-184, 193-224

src/parseable/streams.rs (5)

43-43: Importing MergedRecordReader, DiskWriter, and additional tracing levels.
These changes align with the new file-based approach. Tracing additions are helpful for debugging.

Also applies to: 60-64


120-129: Properly creating new DiskWriter.
Logical approach for on-demand creation of disk writers by schema_key. Check concurrency if multiple schema keys may be created simultaneously.


140-140: Renamed path_by_current_timepath_prefix_by_current_time.
This clarifies that the returned path is only a prefix. Nicely done.


173-176: Filtering for .arrows extension is correct.
No immediate concerns. Please confirm that edge cases (e.g., hidden/system files) won't affect the logic.


536-537: Use of MergedRecordReader::new to build updated schema.
Similar partial-read caveats apply; consider returning any encountered errors.

src/query/stream_schema_provider.rs (1)

222-272: Well-structured integration of time partitioning.

The time partition parameter is consistently used throughout the execution plan generation process, with proper error handling and integration with both arrow and parquet file processing.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/parseable/staging/writer.rs (1)

72-90: ⚠️ Potential issue

Fix missing count incrementation in write() method.

When self.count + rb.num_rows() >= N, the method writes the left slice without updating self.count, causing the row limit logic to skip subsequent rotations.

Apply this fix:

            let left_slice = rb.slice(0, left);
            self.inner.write(&left_slice)?;
+           self.count += left_slice.num_rows();
            self.finish()?;
🧹 Nitpick comments (3)
src/parseable/staging/writer.rs (3)

39-48: Document the const parameter N.

The struct uses a generic const parameter N for row limits, but its purpose and constraints are not documented.

Add documentation for the const parameter:

 /// Context regarding `.arrows` file being persisted onto disk
+/// 
+/// # Type Parameters
+/// 
+/// * `N` - Maximum number of rows allowed in a single Arrow file before rotation
 pub struct DiskWriter<const N: usize> {

112-116: Document the Writer struct and its const parameter.

The struct lacks documentation explaining its purpose and the significance of the const parameter N.

Add documentation:

 #[derive(Default)]
+/// Manages both in-memory and on-disk storage of Arrow record batches
+/// 
+/// # Type Parameters
+/// 
+/// * `N` - Maximum number of rows allowed in a single batch before rotation,
+///         applies to both memory buffers and disk files
 pub struct Writer<const N: usize> {

72-90: Consider extracting common batch slicing logic.

The row limit and batch slicing logic is duplicated between DiskWriter::write() and MutableBuffer::push(). Consider extracting this into a shared trait or utility function.

Example implementation:

trait BatchSlicer {
    fn slice_at_limit<const N: usize>(current_count: usize, rb: &RecordBatch) -> Option<(RecordBatch, RecordBatch)> {
        if current_count + rb.num_rows() >= N {
            let left = N - current_count;
            let left_slice = rb.slice(0, left);
            let right_slice = if left < rb.num_rows() {
                Some(rb.slice(left, rb.num_rows() - left))
            } else {
                None
            };
            right_slice.map(|right| (left_slice, right))
        } else {
            None
        }
    }
}

impl<const N: usize> BatchSlicer for DiskWriter<N> {}
impl<const N: usize> BatchSlicer for MutableBuffer<N> {}

Also applies to: 188-212

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0e29387 and 6fc59d5.

📒 Files selected for processing (1)
  • src/parseable/staging/writer.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (1)
src/parseable/staging/writer.rs (1)

20-38: LGTM! Imports and constants usage align with file-based operations.

The changes appropriately introduce file system and buffered I/O imports, replacing stream-based operations with file-based alternatives. The use of constants for file extensions promotes consistency.

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 21, 2025
coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 21, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/parseable/staging/writer.rs (1)

70-88: ⚠️ Potential issue

Fix missing row count increment in the if branch.

When self.count + rb.num_rows() >= N, the method writes the left slice without updating self.count, causing the row limit logic to skip subsequent rotations.

Apply this fix:

            let left = N - self.count;
            let left_slice = rb.slice(0, left);
            self.inner.write(&left_slice)?;
+           self.count += left;
            self.finish()?;
🧹 Nitpick comments (3)
src/parseable/streams.rs (3)

79-79: Document the significance of 16384.

The magic number 16384 used for the Writer's generic parameter should be documented or defined as a named constant to explain its significance.


141-164: Add documentation for path prefix generation.

The function would benefit from documentation explaining:

  • Why it was renamed from path_by_current_time
  • The format of the generated path prefix
  • The significance of each component in the path

383-389: Enhance error handling with more context.

While the error logging is good, it could be improved by:

  • Including the file path in the warning message
  • Adding error details to the debug message for better debugging

Apply this diff:

-                warn!("Couldn't finish `.arrows` file: {err}");
+                warn!("Couldn't finish `.arrows` file at {:?}: {err}", writer.path_prefix);
-                debug!("Finished `.arrows` file sync onto disk")
+                debug!("Finished `.arrows` file sync onto disk at {:?}", writer.path_prefix)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 240eb99 and acb5ede.

📒 Files selected for processing (2)
  • src/parseable/staging/writer.rs (1 hunks)
  • src/parseable/streams.rs (13 hunks)
🧰 Additional context used
🪛 GitHub Actions: Lint, Test and Coverage Report
src/parseable/staging/writer.rs

[error] 118-118: this expression creates a reference which is immediately dereferenced by the compiler. Change this to: self.inner.schema().

⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (4)
src/parseable/staging/writer.rs (3)

20-36: LGTM! Appropriate imports for file handling.

The imports are well-organized and include all necessary components for file I/O operations.


38-47: LGTM! Well-structured and documented struct.

The DiskWriter struct is well-designed with clear documentation for each field.


125-129: LGTM! Clean transition to generic Writer with DiskWriter.

The Writer struct is well-adapted to use the new DiskWriter with a configurable row limit.

src/parseable/streams.rs (1)

347-367: LGTM! Improved record batch handling.

The changes correctly:

  • Add time partition support
  • Fix the issue with discarding memory batches when no disk readers exist
  • Combine records from both memory and .arrows files

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 21, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/parseable/staging/writer.rs (1)

63-67: ⚠️ Potential issue

Add row count tracking for proper file rotation.

The write method should track the number of rows written to ensure proper file rotation.

     pub fn write(&mut self, rb: &RecordBatch) -> Result<(), StagingError> {
         self.inner.write(rb)?;
+        self.count += rb.num_rows();
 
         Ok(())
     }
🧹 Nitpick comments (1)
src/parseable/staging/mod.rs (1)

33-34: LGTM! Consider adding documentation.

The new RowLimit error variant aligns well with the PR's objective of handling multiple arrow files when row limits are breached. The error message is clear and includes the actual limit value.

Consider adding documentation to explain:

  • The purpose of this error variant
  • When it's triggered
  • How it relates to the multiple arrow files feature
  • What users should do when they encounter this error
 #[derive(Debug, thiserror::Error)]
 pub enum StagingError {
+    /// Thrown when the number of rows in an arrow file exceeds the configured limit.
+    /// This triggers the creation of a new arrow file to handle the overflow.
+    /// The usize parameter represents the configured row limit that was exceeded.
     #[error("Too many rows: {0}")]
     RowLimit(usize),
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c10041 and 913979a.

📒 Files selected for processing (3)
  • src/parseable/staging/mod.rs (1 hunks)
  • src/parseable/staging/writer.rs (3 hunks)
  • src/parseable/streams.rs (19 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/parseable/streams.rs
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: coverage
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
🔇 Additional comments (3)
src/parseable/staging/writer.rs (3)

20-34: LGTM! The imports and constants align with the PR objectives.

The new imports and constants support the transition from stream-based to file-based operations for managing Arrow files.


163-166: LGTM! The record batch slicing logic is improved.

The changes correctly handle the remaining portion of the batch when it exceeds the buffer limit.


176-180: LGTM! The Writer struct changes align with the PR objectives.

The changes successfully:

  • Make the row limit configurable using const generics
  • Replace stream-based operations with file-based operations

@de-sh de-sh marked this pull request as draft February 24, 2025 05:51
@de-sh de-sh marked this pull request as ready for review March 14, 2025 19:44
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/parseable/staging/writer.rs (1)

62-66: ⚠️ Potential issue

Missing count incrementation causes improper file rotation.

When writing a record batch, the method doesn't update a count to track the number of rows. This will likely impact the row limit logic implemented elsewhere.

You need to track the number of rows written to properly implement file rotation when row limits are reached.

🧹 Nitpick comments (1)
src/parseable/staging/reader.rs (1)

297-297: Outdated error message in test code.

The error message still references StreamWriter even though the code now uses FileWriter.

-        let mut writer = FileWriter::try_new(file, schema).expect("Failed to create StreamWriter");
+        let mut writer = FileWriter::try_new(file, schema).expect("Failed to create FileWriter");
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 913979a and d0159f1.

📒 Files selected for processing (4)
  • src/parseable/staging/reader.rs (7 hunks)
  • src/parseable/staging/writer.rs (3 hunks)
  • src/parseable/streams.rs (17 hunks)
  • src/query/stream_schema_provider.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/query/stream_schema_provider.rs
  • src/parseable/streams.rs
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: coverage
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (4)
src/parseable/staging/writer.rs (3)

39-81: Excellent implementation of DiskWriter for better file handling.

The new DiskWriter struct provides a cleaner, file-based approach for persisting Arrow files to disk. It follows good practices with the use of part files for in-progress writes and atomic rename operations when finishing. Your implementation also properly propagates errors using the ? operator, addressing a previous review comment.


175-179: Good design choice making Writer generic over row size.

Making the Writer struct generic over the constant N allows for compile-time configuration of row limits, which is more efficient than runtime checking. The switch from StreamWriter to DiskWriter in the disk field properly implements the PR's objective.


162-165: Correct handling of partial record batches.

This implementation correctly handles the case where only part of a record batch fits within the buffer size limit by slicing the batch and pushing the remainder to the inner buffer.

src/parseable/staging/reader.rs (1)

325-325: Tests properly updated to use the new MergedRecordReader API.

The test cases have been correctly updated to use the new new() constructor and include the None parameter for time_partition, which aligns with the changes in the method signature.

Also applies to: 385-385

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 14, 2025
coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 16, 2025
coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 16, 2025
@de-sh
Copy link
Contributor Author

de-sh commented Mar 16, 2025

The main benefit of this PR is from the maintainability perspective, I have verified it's working is in line with our expectations.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants