-
-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: batch audit logs and ensure they are not lost #1080
base: main
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 13016443859Details
💛 - Coveralls |
…ork is not accepting them, else send to network
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The implementation looks solid, but I have some questions about the architectural approach:
Audit Log Architecture:
- What drove the decision to use another Parseable instance for audit logs rather than local storage?
- Do we have plans for a dedicated audit logging solution in the future?
Deployment Considerations:
For multi-node setups (e.g., 3 ingest + 1 query nodes):
- What's the recommended log_endpoint configuration?
- Which node should own audit log collection?
- How do we prevent circular logging dependencies?
Performance & Scalability:
- In high-traffic scenarios, how do we handle the additional network load from audit logging?
- Have we considered the impact of inter-service dependencies on system reliability?
Would appreciate your insights on these points to better understand the architectural vision.
.open(log_file_path) | ||
.await | ||
.expect("Failed to open audit log file"); | ||
let buf = serde_json::to_vec(&logs_to_send).expect("Failed to serialize audit logs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly it looks like it never happens, but unwrap() can cause panic in production. It would be nice to have proper error handling with context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be an issue in most places are serialization is not a problem for these types. But still we have expects to document why it is safe to assume it won't
} | ||
|
||
// setup the audit log channel | ||
let (audit_log_tx, mut audit_log_rx) = channel(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the initial size is 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to create a "Glienicke Bridge" so that we block at the request level everytime the underlying audit logger is unable to keep-up.
Thank you for the comments @hippalus, let me note that audit logging is not to be considered as beta yet. We are working on developing this as a feature that is still in development and will need a lot of work before we can consider it ready for production. With regards to your questions:
It is an industry practice to send audit logs to an external system, we felt dogfooding would be the least friction option when it comes to doing this.
Parseable already fulfills most of this and we don't see any reason as to why we would build with anything different. Please let us know of any alternatives, we can document them and plan out how to build support. To make ourselves clear, logs should be sent to an instance outside of the cluster.
The log_endpoint should be a separate cluster/standalone instance of parseable.
This should not matter in my opinion, in a multi-node setup the expectation is that a load-balancer should take the decision and logs should end up in the same stream no matter the node it is received at.
Circular logs are not expected as pointed above.
We are only testing the feature and as of our current learnings, batching and writing to file on network slow-down as well as pushing to file when network is down seemed to be the right decision for us to handle high-frequency use-cases.
What do you mean by "inter-service dependencies".
|
Thank you for the clarification @de-sh . I highlighted these questions because I couldn't find explicit documentation or configuration descriptions in the code addressing these architectural decisions. The implementation looks quite solid. I'm particularly looking forward to seeing the Kafka connector integration. Regarding inter-service dependencies, as you've well explained, we can scale Parseable instances behind a load balancer for audit logging purposes. It's great to see this pathway being established. IMHO : It would be nice to add more documentation around these architectural decisions directly in the code, making it easier for future contributors to understand the design choices. It would be nice to have to add docker-compose as an example. |
WalkthroughThis pull request refactors the audit logging subsystem. It removes the legacy Changes
Sequence Diagram(s)sequenceDiagram
participant Main
participant AuditLogger
participant RemoteLogSystem
Main->>AuditLogger: spawn_batcher(shutdown signal)
AuditLogger->>AuditLogger: batch incoming audit logs
AuditLogger->>RemoteLogSystem: send_logs_to_remote(batch)
RemoteLogSystem-->>AuditLogger: response (success/failure)
AuditLogger->>AuditLogger: flush logs and handle errors
Main->>AuditLogger: trigger shutdown for graceful termination
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (6)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (10)
src/option.rs (1)
138-144
: Fix typo in error message and improve parameter validationThe error message has a typo ("pass" instead of "parse") and the function lacks additional validation for extremely large values.
pub fn duration(secs: &str) -> Result<Duration, String> { let Ok(secs) = secs.parse() else { - return Err("Couldn't pass as a number".to_string()); + return Err("Couldn't parse as a number".to_string()); }; + // Optionally validate the value is reasonable + if secs > 3600 * 24 * 7 { + return Err("Duration too large (exceeds one week in seconds)".to_string()); + } + Ok(Duration::from_secs(secs)) }src/audit/mod.rs (1)
19-32
: Good structure for the audit module with shared communication channelThe module is well-organized with clear separation of concerns between builder, logger, and types. Using
OnceCell
for the global sender channel is a good practice for thread-safe initialization.However, consider providing a helper function to safely initialize and access the sender:
/// Initialize the global audit log sender pub fn init_audit_log_tx(sender: Sender<AuditLog>) -> Result<(), Sender<AuditLog>> { AUDIT_LOG_TX.set(sender) } /// Get a reference to the global audit log sender if initialized pub fn audit_log_tx() -> Option<&'static Sender<AuditLog>> { AUDIT_LOG_TX.get() }src/main.rs (2)
36-39
: Consider removing or consolidating the old logger initialization code.
The new inline logger setup (lines 36–39) replaces the previously definedinit_logger()
(lines 95–117), which appears unused now. This can introduce confusion or lead to dead code over time.
66-73
: Handle possibleoneshot::Sender::send
errors gracefully.
Using.unwrap()
when sending shutdown signals toserver_shutdown_trigger
andlogger_shutdown_trigger
can panic if either receiver has been dropped or closed. For resilience, replace.unwrap()
with logic that logs and safely continues if sending fails.src/audit/types.rs (1)
1-84
: Assess potential exposure of sensitive data.
TheActorDetails
struct includes user and host information, potentially personally identifiable. Ensure that serializing and persisting these fields (e.g.,remote_host
,username
) complies with privacy and data retention policies.src/audit/logger.rs (4)
68-70
: Avoid panics for directory creation failures.
Using.expect("Failed to create audit log directory")
can crash the application in production. If you want to handle IO errors more gracefully, consider returning an error, logging it, or using a fallback path instead.
73-90
: Improve error handling for directory entries.
Repetitive.unwrap()
calls when reading directory entries can lead to panics if any read fails. Consider handling or logging these errors to avoid a crash due to a single malformed file.
104-117
: Validate the retry logic for sending logs to remote.
The flush logic only writes logs to disk if there is a backlog or if sending fails. Confirm the approach covers scenarios where intermittent failures occur, and clarify how quickly logs on disk are retried.
200-203
: Confirm buffer size for the audit log channel.
Usingchannel(0)
enforces synchronous sending. This can block the sender if the consumer is slow. Verify that this unbuffered approach is intended to throttle requests or if a small buffer might improve performance.src/audit/builder.rs (1)
42-42
: Consider re-checking channel availability at build time vs. usage time.
Currently,enabled
is determined once based onAUDIT_LOG_TX.get().is_some()
at builder creation. If the channel’s initialization state changes later, the builder may become outdated. If re-checking is desired for dynamic toggling of logging, consider checking the channel upon each method call, or re-initializing the builder.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
src/audit/builder.rs
(3 hunks)src/audit/logger.rs
(1 hunks)src/audit/mod.rs
(1 hunks)src/audit/types.rs
(1 hunks)src/cli.rs
(2 hunks)src/lib.rs
(1 hunks)src/main.rs
(3 hunks)src/option.rs
(2 hunks)
🧰 Additional context used
🧬 Code Definitions (2)
src/audit/logger.rs (2)
src/main.rs (2)
oneshot
(42-42)oneshot
(66-66)src/audit/builder.rs (1)
default
(40-61)
src/audit/types.rs (2)
src/audit/logger.rs (1)
default
(48-98)src/audit/builder.rs (1)
default
(40-61)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: coverage
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Kafka aarch64-apple-darwin
🔇 Additional comments (4)
src/lib.rs (1)
55-55
: LGTM - New public export for AuditLoggerThe addition of
pub use audit::AuditLogger;
makes the audit logger implementation available to external modules, which is necessary for the batch audit logging feature.src/main.rs (1)
41-43
: Ensure the logger object’s lifecycle meets your needs.
You are spawning the batcher and then discarding theAuditLogger
instance without storing it in a variable. Although you set up a static channel, confirm you will not need direct references to the logger afterward.src/audit/logger.rs (1)
226-230
: Kudos on ensuring a graceful shutdown.
Flushing logs before exiting and logging the “shutting down” message helps ensure minimal log loss and provides clarity about shutdown behavior.src/audit/builder.rs (1)
19-23
: Imports appear consistent with usage and project conventions.
No issues found: each import (Display
,current
,StorageMetadata
,Utc
) is used appropriately within the file.
#[arg( | ||
long, | ||
env = "P_AUDIT_LOG_DIR", | ||
default_value = "./auditlogs", | ||
help = "Path for audit log persistence" | ||
)] | ||
pub audit_log_dir: PathBuf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Ensure the audit log directory exists
The code specifies a default path for audit logs but doesn't ensure this directory exists when used, unlike the staging directory which has a dedicated method that ensures its creation.
Add a method similar to staging_dir()
for the audit log directory:
/// Path to audit log directory, ensures that it exists or panics
pub fn audit_log_dir(&self) -> &PathBuf {
fs::create_dir_all(&self.audit_log_dir)
.expect("Should be able to create audit log directory if it doesn't exist");
&self.audit_log_dir
}
#[arg( | ||
long, | ||
env = "P_AUDIT_BATCH_SIZE", | ||
default_value = "100", | ||
help = "Audit log batch size" | ||
)] | ||
pub audit_batch_size: usize, | ||
|
||
#[arg( | ||
long, | ||
env = "P_AUDIT_FLUSH_INTERVAL", | ||
value_parser = validation::duration, | ||
help = "Interval to flush into persistence" | ||
)] | ||
pub audit_flush_interval: Duration, | ||
|
||
#[arg( | ||
long, | ||
env = "P_AUDIT_LOG_DIR", | ||
default_value = "./auditlogs", | ||
help = "Path for audit log persistence" | ||
)] | ||
pub audit_log_dir: PathBuf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add default value for audit_flush_interval
Two of the three new audit configuration options have default values, but audit_flush_interval
does not. This could lead to runtime errors if not explicitly provided by the user.
#[arg(
long,
env = "P_AUDIT_FLUSH_INTERVAL",
+ default_value = "60",
value_parser = validation::duration,
help = "Interval to flush into persistence"
)]
pub audit_flush_interval: Duration,
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
#[arg( | |
long, | |
env = "P_AUDIT_BATCH_SIZE", | |
default_value = "100", | |
help = "Audit log batch size" | |
)] | |
pub audit_batch_size: usize, | |
#[arg( | |
long, | |
env = "P_AUDIT_FLUSH_INTERVAL", | |
value_parser = validation::duration, | |
help = "Interval to flush into persistence" | |
)] | |
pub audit_flush_interval: Duration, | |
#[arg( | |
long, | |
env = "P_AUDIT_LOG_DIR", | |
default_value = "./auditlogs", | |
help = "Path for audit log persistence" | |
)] | |
pub audit_log_dir: PathBuf, | |
#[arg( | |
long, | |
env = "P_AUDIT_BATCH_SIZE", | |
default_value = "100", | |
help = "Audit log batch size" | |
)] | |
pub audit_batch_size: usize, | |
#[arg( | |
long, | |
env = "P_AUDIT_FLUSH_INTERVAL", | |
default_value = "60", | |
value_parser = validation::duration, | |
help = "Interval to flush into persistence" | |
)] | |
pub audit_flush_interval: Duration, | |
#[arg( | |
long, | |
env = "P_AUDIT_LOG_DIR", | |
default_value = "./auditlogs", | |
help = "Path for audit log persistence" | |
)] | |
pub audit_log_dir: PathBuf, |
src/audit/builder.rs
Outdated
// NOTE: we are fine with blocking here as user expects audit logs to be sent at all costs | ||
if let Err(e) = AUDIT_LOG_TX | ||
.get() | ||
.expect("Audit logger not initialized") | ||
.send(audit_log) | ||
.await | ||
{ | ||
error!("Couldn't send to logger: {e}") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Assess the blocking behavior under heavy load.
By design, this code blocks while awaiting the send
operation. Under significant load or if the channel is full, requests may stall. This aligns with a near-synchronous approach but may impact throughput. Consider:
- Using a bounded channel with a fallback strategy to avoid request stalls.
- Logging to disk or an internal buffer if the channel blocks, then retrying asynchronously.
Below is a suggested fallback approach, if you want to avoid fully blocking:
if let Err(e) = AUDIT_LOG_TX
.get()
.expect("Audit logger not initialized")
- .send(audit_log)
+ .try_send(audit_log) // Non-blocking attempt
.await
{
error!("Couldn't send to logger: {e}")
}
Would you like assistance in implementing a buffered or retry-based logging mechanism?
Committable suggestion skipped: line range outside the PR's diff.
src/audit/builder.rs
Outdated
use super::{ | ||
ActorDetails, AuditDetails, AuditLog, AuditLogVersion, RequestDetails, ResponseDetails, | ||
ServerDetails, AUDIT_LOG_TX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify that all imported entities are still needed after refactoring.
The references (ActorDetails
, AuditDetails
, AuditLog
, etc.) are presumably used within this file, but do ensure none of them are left unused after the structural changes and removals in related modules (e.g., old AuditLogger
removal).
🏁 Script executed:
#!/bin/bash
# Description: Search for references of removed or unused structs in the codebase.
fd -e rs --exec rg 'ActorDetails|AuditDetails|AuditLogVersion|...'
Length of output: 65605
Audit Import Cleanup: Remove Unused Entities
The grep search did not reveal any further usage of the imported items (ActorDetails, AuditDetails, AuditLog, AuditLogVersion, RequestDetails, ResponseDetails, ServerDetails, AUDIT_LOG_TX) beyond their inclusion in this file. Please double‑check the implementation in src/audit/builder.rs
—if none of these entities are actually referenced (including potential macro or conditional usages), remove the unused imports to keep the file clean and prevent confusion.
- Verify each imported entity in
src/audit/builder.rs
is actually referenced. - If an import is no longer needed due to the recent refactoring (e.g., after removing the old
AuditLogger
), please delete it.
Fixes #XXXX.
Description
Collects audit logs into batches, pushes as configured based on threshold of time/size and ensures logs are persisted into a file when the network is down. NOTE: Works in a near synchronous manner where logs will not be accepted if write to disk is blocked or when direct push to network is blocked, this may lead to a situation where requests will not respond properly as they try to send audit logs, but this is a known cost.
This PR has:
Summary by CodeRabbit
New Features
AuditLogger
for managing and sending logs, including batch processing and improved error handling.Refactor