Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to create an index with ingest v2 returns 429 #5719

Closed
wants to merge 5 commits into from

Conversation

rdettai
Copy link
Collaborator

@rdettai rdettai commented Mar 20, 2025

Description of the issue

When using index templates, specifying and index name that matches the pattern but has illegal characters results in a 429 response code instead of a 400.

Description of the problem:

  • the index_id validity is not validated until the metastore is called
  • calling GetOrCreateOpenShards with an invalid index_id on the metastore fails with MetastoreError::JsonDeserializeError
  • a failure to call the metastore is logged but not repported to the ingest workbench
  • because the router is not populated, the workbench fails with "shard not found errors" for the failing index but also other targetted indexes batched with it

Proposed solution

This PR solves the problem in two places:

  • the index id is now validated in quickwit-serve, before calling the ingest router (ES bulk and native APIs)
  • control plane request errors are now recorded to the workbench so that they can properly be surfaced to the ingest requests. Transient control plane / metastore errors are surfaced as 503 (unavailable) and errors that can't be retried as 500 (internal)

How was this PR tested?

Added unit and integration (python) tests

@rdettai rdettai force-pushed the fix-failing-ingest-index-creation-code branch from 2a27562 to c8a00f0 Compare March 20, 2025 10:53
@rdettai rdettai self-assigned this Mar 20, 2025
@rdettai rdettai added the bug Something isn't working label Mar 20, 2025
@rdettai rdettai marked this pull request as draft March 20, 2025 10:55
@rdettai rdettai force-pushed the fix-failing-ingest-index-creation-code branch from ebb5518 to e98ccc0 Compare March 25, 2025 12:31
@rdettai rdettai requested a review from guilload March 25, 2025 12:31
@rdettai rdettai marked this pull request as ready for review March 25, 2025 12:31
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we add the capability to write errors back to the barrier so that control plane errors can be shared back with all ingest routing requests that are waiting for shards.

Comment on lines +179 to +189
let last_failure = match open_shard_error {
ControlPlaneError::Internal(_) => SubworkbenchFailure::Internal,
ControlPlaneError::Timeout(_) => SubworkbenchFailure::ControlPlaneUnavailable,
ControlPlaneError::TooManyRequests => SubworkbenchFailure::ControlPlaneUnavailable,
ControlPlaneError::Unavailable(_) => SubworkbenchFailure::ControlPlaneUnavailable,
ControlPlaneError::Metastore(metastore_error) => match metastore_error {
MetastoreError::Timeout(_) => SubworkbenchFailure::ControlPlaneUnavailable,
MetastoreError::TooManyRequests => SubworkbenchFailure::ControlPlaneUnavailable,
MetastoreError::Unavailable(_) => SubworkbenchFailure::ControlPlaneUnavailable,
// TODO: are there other metastore errors that can be considered temporary?
_ => SubworkbenchFailure::Internal,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mapping determines which requests are going to be 500 or 503. In either case they will be retried internally (is_pending() returns true for both)

@@ -68,6 +68,7 @@ enum IngestFailureReason {
INGEST_FAILURE_REASON_ROUTER_LOAD_SHEDDING = 8;
INGEST_FAILURE_REASON_LOAD_SHEDDING = 9;
INGEST_FAILURE_REASON_CIRCUIT_BREAKER = 10;
INGEST_FAILURE_REASON_UNAVAILABLE = 11;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

until all servers are upgraded, this will be converted to IngestServiceError::Internal

Comment on lines -172 to +174
for subrequest in pending_subrequests(&workbench.subworkbenches) {
for subrequest in
pending_subrequests_for_attempt(&workbench.subworkbenches, workbench.num_attempts)
Copy link
Collaborator Author

@rdettai rdettai Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some logic to not process further during this retry attempt subrequests for which we already observed an error when trying to create the shards. Otherwise all control plane errors are overriden as "no shard available".

@rdettai
Copy link
Collaborator Author

rdettai commented Mar 25, 2025

Splitting this PR into #5721 and #5722

@rdettai rdettai closed this Mar 25, 2025
@rdettai rdettai deleted the fix-failing-ingest-index-creation-code branch March 25, 2025 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant