-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: modify LogExporter and TraceExporter interfaces to support returning failure #2381
base: main
Are you sure you want to change the base?
chore: modify LogExporter and TraceExporter interfaces to support returning failure #2381
Conversation
@lalitb 👋 Here's a variant where i've simply propgated the errors through, like |
I believe it's acceptable to introduce this as a BREAKING change at this stage, especially since we anticipate the SDK to undergo more breaking changes, at least until the next release. The CHANGELOG can include migration steps to guide users on handling the return status. Initially, I considered deprecating the existing |
33cb755
to
2df2353
Compare
@cijothomas & @lalitb I think this should be good to go |
@@ -10,6 +10,9 @@ use std::time::SystemTime; | |||
/// Describes the result of an export. | |||
pub type ExportResult = Result<(), TraceError>; | |||
|
|||
/// Describes the results of other operations on the trace API. | |||
pub type TraceResult<T> = Result<T, TraceError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to ensure consistency here. LogExporter uses same LogResult for export and shutdown both, but TraceExporter uses different Results.
Is this something we can fix too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use a type alias so that export
is marked as returning TraceResult<()>
, but the existing callers (all 38 of them in our codebase) don't have to change.
I've pushed a change that works this way. I can also just pull of the band-aid and fix all our existing internal usages to use TraceResult
directly.
As we are looking at consistency here, another thing that jumps out is that we're using TraceError
(which i've introduced TraceResult
to match) , despite the trait itself being SpanExporter
. Better would be SpanError
and SpanResult
. To give an indication of scale, we have 72 refs in this codebase to TraceError
alone (plus external usages of course).
My suggestion is:
- Leave
TraceError
alone - Change our internal usages of
ExportResult
toTraceResult<()>
(not yet done, but easy enough) - Leave the
ExportResult
alias for any external usages but mark it as deprecated
Thoughts @cijothomas ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need some time to think through this. I am unsure if the current state is what we finally want. I am thinking that we should have very specific Enums for each Result returned to user.
If shutdown() is called on an already shutdown provider, then the Result should an Error variant, but a specific one like Error::ShutdownAlreadyInvoked, so users know exactly what failed and why instead of relying on "Other("something went wrong")"
This will require more refactoring, but something we should be going before Stable.
The "Other" should be truly for "others" - i.e something went wrong that is not accounted for, and should be rarely used, if not possible to avoid entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another quick thought - validate if it is considered breaking to expand the enum with new variants. Do we need to provide Into/From for std Error and that is sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - so we're happy to take some breaking changes now and make things clean!
I think you'd typically see error-type-per-function, with explicitly mapped errors where it makes sense and thiserror
mapped ones where the caller should handle the passed-through error (e.g., network issues).
I think this would be pretty idiomatic (note - errors and logic for illustration purposes only):
use thiserror::Error;
#[derive(Error, Debug)]
pub enum ExportError {
#[error("Socket IO Error")]
SocketError(#[from] io::Error),
#[error("Exporter already shut down")]
ExporterShutdown,
}
pub trait SpanExporter {
async fn export(&mut self, batch: Vec<SpanData>) -> Result<(), ExportError> {
// explicit error
if already_shutdown {
Err(ExportError::ExporterShutdown)
}
else {
// by way of example, imagine this may return an IO error
// thiserror has setup the error mapping for us, so we can just use ?
let ack = my_socket.send()?;
Ok(())
}
}
async fn shutdown(&mut self) -> Result<(), ShutdownError>;
}
I don't think the extra typedefs add much value.
Which is to say, I think we are on the right path? As @lalitb points out, we're already using thiserror
; we'd just want to make sure we generally have error-per-func, rather than error-per-module. This is important for the caller, as they will be able to write reasonable match
es over the top of the result to handle errors, and not be forced to handle errors not present in the particular function.
If we add another variant, exhaustive matches will break, so I think it'd be a breaking change.
What's harder to think about for me is - what about custom implementations of these interfaces? The error types are to a large extent a function of the concrete implementation .... maybe this is something that pushes more in the direction of higher level errors only (AlreadyShutdown
) rather than mapped-through errors (io::ReadFailed
or whatever) ?
That way we roll things up into a "logical language of failure" and each impl has to map back their arbitrary error domain to these.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understood this fully, but the code already uses thiserror
nicely to get the error status from exporters, with below skeleton. Are you proposing to change this in better way?
#[derive(thiserror::Error, Debug)]
#[non_exhaustive]
pub enum TraceError {
#[error("Exporter {0} encountered the following error(s): {name}", name = .0.exporter_name())]
ExportFailed(Box<dyn ExportError>),
// Rest of variants
}
/// Trait for errors returned by exporters
pub trait ExportError: std::error::Error + Send + Sync + 'static {
/// The name of exporter that returned this error
fn exporter_name(&self) -> &'static str;
}
// and then in OTLP crate
/// Wrap type for errors from this crate.
#[derive(thiserror::Error, Debug)]
pub enum Error {
// OTLP specific error variants
}
impl opentelemetry::trace::ExportError for Error {
fn exporter_name(&self) -> &'static str {
"otlp"
}
}
``
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, so it does. i'll fix my suggestion so it doesn't include adding thiserror as its already there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// explicit error if already_shutdown { Err(ExportError::ExporterShutdown) }
^ I actually meant to suggest we use explicit ones like this, instead of "Other("Shutdown. already invoked"), consistently throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cijothomas @lalitb I think this makes sense.
Are we happy for me to go through and apply this? It will touch a fair bit so I want to make sure before doing it! Would look like:
shutdown(...) -> Result<(), ShutdownError>
export(...) -> Result<(), ExportError>
and an extension of the error types following the explicit-named-errors-style established already in LogError
, e.g.:
pub enum LogError {
/// Export failed with the error returned by the exporter.
#[error("Exporter {0} encountered the following errors: {name}", name = .0.exporter_name())]
ExportFailed(Box<dyn ExportError>),
/// Export failed to finish after certain period and processor stopped the export.
#[error("Exporter timed out after {} seconds", .0.as_secs())]
ExportTimedOut(Duration),
/// Processor is already shutdown
#[error("{0} already shutdown")]
AlreadyShutdown(String),
/// Mutex lock poisoning
#[error("mutex lock poisioning for {0}")]
MutexPoisoned(String),
opentelemetry-sdk/CHANGELOG.md
Outdated
@@ -10,6 +10,8 @@ | |||
[#2338](https://github.com/open-telemetry/opentelemetry-rust/pull/2338) | |||
- `ResourceDetector.detect()` no longer supports timeout option. | |||
- `opentelemetry::global::shutdown_tracer_provider()` Removed from the API, should now use `tracer_provider.shutdown()` see [#2369](https://github.com/open-telemetry/opentelemetry-rust/pull/2369) for a migration example. "Tracer provider" is cheaply cloneable, so users are encouraged to set a clone of it as the global (ex: `global::set_tracer_provider(provider.clone()))`, so that instrumentations and other components can obtain tracers from `global::tracer()`. The tracer_provider must be kept around to call shutdown on it at the end of application (ex: `tracer_provider.shutdown()`) | |||
- The trait functions `LogExporter.shutdown` and `TraceExporter.shutdown` now explicitly return a result. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could we add something for the custom authors as well?
If you are an exporter author, shutdown now requires returning a result. Also, implementing shutdown is optional, as trait already provides a default implementation that returns Ok().
^ something like this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on your comment further down ("The exporter() shutdown() is only called from the SDK, so end user wouldn't have to deal with the return type from it") I think we only need to address the exporter authors, right? Have changed comment accordingly.
Is this going to affect end users or only affecting the SDK itself (and custom exporter authors)? The exporter() shutdown() is only called from the SDK, so end user wouldn't have to deal with the return type from it, right? |
opentelemetry-otlp/CHANGELOG.md
Outdated
@@ -2,6 +2,9 @@ | |||
|
|||
## vNext | |||
|
|||
- `OtlpHttpClient.shutdown` `TonicLogsClient.shutdown`, and `TonicTracesClient.shutdown` now explicitly return a result. The | |||
semantics of the method have not changed, but you will have a new lint encouraging you to consume these results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 2nd sentence maybe omitted as these are anyway expected to be called from upstream sdk.
…rning failure Address review comments
f1be822
to
5b243a6
Compare
@cijothomas i've separated out ShutdownResult for both trace and log as discussed. This was more fiddly than I expected! |
opentelemetry-sdk/src/logs/error.rs
Outdated
pub enum ShutdownError { | ||
/// Mutex lock poisoning | ||
#[error("mutex lock poisioning for {0}")] | ||
MutexPoisoned(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsure if we should make these public.. There are strictly internal details, and a mutex poison is not user actionable at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were public before in the old type; but I agree with you!
I can hide it behind the GenericError or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I am missing some discussion - why do we need to have new error enum for shutdown. Can't we reuse the LogError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep one uber-error and the caller wants to handle it, they are forced to handle variants that have nothing to do with the operation they are calling
#[error("Shutdown timed out after {0:?}")] | ||
Timeout(Duration), | ||
|
||
/// The export client failed while holding the client lock. It is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cijothomas I reckon this is a reasonable "API user summary" of what's happened with the mutex poisoning, and advice on dealing with it
opentelemetry-sdk/src/logs/error.rs
Outdated
|
||
impl<T> From<PoisonError<T>> for ShutdownError { | ||
fn from(err: PoisonError<T>) -> Self { | ||
ShutdownError::ClientFailed(format!("Mutex poisoned during shutdown: {}", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think wrapping up the formatted inner error is plenty of context here given the narrow failure mode. I also considered wrapping up Box<Err>
, but I think this is 1/ confusing in the interface and 2/ not at all actionable.
There's some good stuff on canonical's rust best practices doc on this stuff
@@ -149,14 +149,10 @@ impl SpanProcessor for SimpleSpanProcessor { | |||
} | |||
|
|||
fn shutdown(&self) -> TraceResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should (in another issue!) chase this distinction between TraceError
and ShutdownError
further down the rabbit hole. For the same reasons I don't think it makes sense to use an über-error in the exporter trait
#[derive(Error, Debug)] | ||
#[non_exhaustive] | ||
pub enum ShutdownError { | ||
/// The exporter has already been shut down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still unsure of this....I can understand that a Provider's shutdown method, if invoked more than once, can return AlreadyShutdown error. How would an Exporter's shutdown be invoked more than once, since provider protects the processor/exporter from entering that situation?
I am not even sure if processor/exporter should even have a variable like is_shutdown
at all.. If its shutdown is invoked, it'll kill the transports etc. any further attempt to export will result in failure naturally as transports are killed.
(Sorry not really part of this PR, but this is something that I need some time to think through....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not even sure if processor/exporter should even have a variable like is_shutdown at all.. If its shutdown is invoked, it'll kill the transports etc. any further attempt to export will result in failure naturally as transports are killed.
I feel processor should have this flag. It can't assume how the exporters are implemented. Maybe exporter's shutdown methods are just no-op, and in that case the export may continue happening even after shutdown is invoked, if the processor doesn't have is_shutdown
safeguard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the need for processor to have the flag? If shutdown was already signaled, then no further shutdown signal will ever arrive at the LogProcessor, as the Provider takes care of that, and provider own Processors.
How would log processor ever get a 2nd shutdown call? Or how does it ever get an emit() call after shutdown is done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would log processor ever get a 2nd shutdown call? Or how does it ever get an emit() call after shutdown is done?
Yes correct, is_shutdown at provider level will take care of that. I missed that. However, logger::emit() as of now doesn't check for the provider flag, which needs to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raised a PR to discuss this further - #2462
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it'd be cool if shutdown was idempotent - you call it, it shuts down, you call it again, it no-ops.
you call it again, it no-ops.
This maybe okay. But if a user is calling shutdown more than once, it indicates there have an bug in the way they orchestrate telemetry pipelines, so isn't it best to let them know about it...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is logging it sufficient ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its debatable I guess....We can technically make almost every return type go away in favor of logging. So need to find a balance where do we resort to logging vs explicit-return-type.
Related: Otel .NET decided to specialy handle emit logs after shutdown, by printing these nice message only in StdOut exporter:
https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry.Exporter.Console/ConsoleLogRecordExporter.cs#L39-L42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reckon there’s a clear line there between errors that actually reflect some failure of the system, and errors that have no impact and are just hinting that the caller has done something funny.
By way of example a flush that fails become the connections gone is very different to “probably you didn’t mean to call this twice”.
That stdout error one is interesting. Is there a story there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also and regardless - it feels like something we should have a consistent view on - when to error, when not to - that we can point at apply uniformly. That consistency is maybe even more important than what the policy itself is 🤷♂️
@@ -110,10 +110,6 @@ pub struct SpanData { | |||
#[derive(Error, Debug)] | |||
#[non_exhaustive] | |||
pub enum ShutdownError { | |||
/// The exporter has already been shut down. | |||
#[error("Shutdown already performed")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this as discussed
@@ -148,8 +148,17 @@ impl InMemoryLogExporter { | |||
/// ``` | |||
/// | |||
pub fn get_emitted_logs(&self) -> LogResult<Vec<LogDataWithResource>> { | |||
let logs_guard = self.logs.lock().map_err(LogError::from)?; | |||
let resource_guard = self.resource.lock().map_err(LogError::from)?; | |||
let logs_guard = self.logs.lock().map_err(|_| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mutex poisoned errors here are an implementation detail of InMemoryLogExporter
- that is, they have nothing to do with the common shape of a LogExporter
.
So, I removed the automatic From
for poisoned on the LogError
type and explicitly map detailed errors through here.
@cijothomas @lalitb I've established what I think is an appropriate shape for the error handling in the I've tried to do this with as few flow on effects into the rest of the code, because I think agreeing on the shape of the interface is the most important, 2/ there's so much flux in this area merge conflicts are a pain and 3/ I'm still not sure we agree so I don't want to burn time. IMHO we should agree on the shape, merge it, and then smooth out the flow-on effects in subsequent PRs. I will rebase and deal with merge conflicts again once we are all happy! What do you both think? |
@scottgerring Could Maybe for now, move it to |
Good morning @lalitb 👋 ! I had a play with this just now. Good idea! I like it; it keeps it DRY, we've already got precedent with the common I extended Relevant bits in the last commit |
@scottgerring - And the key benefit of introducing a dedicated |
Ciao @lalitb welcome back! Chat soon, Scott |
Public API Impact
I think the next step after this PR is to do a proper RFC on how we want to expose aggregate errors on public APIs to the caller. Let me explain! As you suggested, I've moved the common opentelemetry-rust/opentelemetry-sdk/src/error.rs Lines 40 to 42 in de49068
... We could, if we so chose, simply use that This PR / Errors on 'inner' traitsI think we should treat the big question as orthogonal to the PR at hand - i've tried to make this as surgical as possible to keep this PR focused. For now, I simply log the errors accordingly in the callers, e.g.: opentelemetry-rust/opentelemetry-sdk/src/logs/log_processor_with_async_runtime.rs Lines 191 to 196 in de49068
The immediate impact of this is that we can say 1/ The shape of the |
Fixes #2262
Changes
Please provide a brief description of the changes here.
Merge requirement checklist
CHANGELOG.md
files updated for non-trivial, user-facing changes