Failure to create an index with ingest v2 returns 429 #5719

rdettai · 2025-03-20T10:41:10Z

Description of the issue

When using index templates, specifying and index name that matches the pattern but has illegal characters results in a 429 response code instead of a 400.

Description of the problem:

the index_id validity is not validated until the metastore is called
calling GetOrCreateOpenShards with an invalid index_id on the metastore fails with MetastoreError::JsonDeserializeError
a failure to call the metastore is logged but not repported to the ingest workbench
because the router is not populated, the workbench fails with "shard not found errors" for the failing index but also other targetted indexes batched with it

Proposed solution

This PR solves the problem in two places:

the index id is now validated in quickwit-serve, before calling the ingest router (ES bulk and native APIs)
control plane request errors are now recorded to the workbench so that they can properly be surfaced to the ingest requests. Transient control plane / metastore errors are surfaced as 503 (unavailable) and errors that can't be retried as 500 (internal)

How was this PR tested?

Added unit and integration (python) tests

rdettai · 2025-03-25T14:07:28Z

quickwit/quickwit-ingest/src/ingest_v2/debouncing.rs

Here we add the capability to write errors back to the barrier so that control plane errors can be shared back with all ingest routing requests that are waiting for shards.

rdettai · 2025-03-25T14:10:04Z

quickwit/quickwit-ingest/src/ingest_v2/workbench.rs

+        let last_failure = match open_shard_error {
+            ControlPlaneError::Internal(_) => SubworkbenchFailure::Internal,
+            ControlPlaneError::Timeout(_) => SubworkbenchFailure::ControlPlaneUnavailable,
+            ControlPlaneError::TooManyRequests => SubworkbenchFailure::ControlPlaneUnavailable,
+            ControlPlaneError::Unavailable(_) => SubworkbenchFailure::ControlPlaneUnavailable,
+            ControlPlaneError::Metastore(metastore_error) => match metastore_error {
+                MetastoreError::Timeout(_) => SubworkbenchFailure::ControlPlaneUnavailable,
+                MetastoreError::TooManyRequests => SubworkbenchFailure::ControlPlaneUnavailable,
+                MetastoreError::Unavailable(_) => SubworkbenchFailure::ControlPlaneUnavailable,
+                // TODO: are there other metastore errors that can be considered temporary?
+                _ => SubworkbenchFailure::Internal,


This mapping determines which requests are going to be 500 or 503. In either case they will be retried internally (is_pending() returns true for both)

rdettai · 2025-03-25T14:16:39Z

quickwit/quickwit-proto/protos/quickwit/router.proto

@@ -68,6 +68,7 @@ enum IngestFailureReason {
  INGEST_FAILURE_REASON_ROUTER_LOAD_SHEDDING = 8;
  INGEST_FAILURE_REASON_LOAD_SHEDDING = 9;
  INGEST_FAILURE_REASON_CIRCUIT_BREAKER = 10;
+  INGEST_FAILURE_REASON_UNAVAILABLE = 11;


until all servers are upgraded, this will be converted to IngestServiceError::Internal

rdettai · 2025-03-25T14:26:39Z

quickwit/quickwit-ingest/src/ingest_v2/router.rs

-        for subrequest in pending_subrequests(&workbench.subworkbenches) {
+        for subrequest in
+            pending_subrequests_for_attempt(&workbench.subworkbenches, workbench.num_attempts)


Add some logic to not process further during this retry attempt subrequests for which we already observed an error when trying to create the shards. Otherwise all control plane errors are overriden as "no shard available".

rdettai · 2025-03-25T15:28:37Z

Splitting this PR into #5721 and #5722

Add failing test

c8a00f0

rdettai force-pushed the fix-failing-ingest-index-creation-code branch from 2a27562 to c8a00f0 Compare March 20, 2025 10:53

rdettai self-assigned this Mar 20, 2025

rdettai added the bug Something isn't working label Mar 20, 2025

rdettai marked this pull request as draft March 20, 2025 10:55

rdettai added 3 commits March 24, 2025 14:53

Propagate control plane errors and persist failures in barrier

e78a643

Fix wait barrier even if no metastore request

e1cbb8b

Validate index_id early

e98ccc0

rdettai force-pushed the fix-failing-ingest-index-creation-code branch from ebb5518 to e98ccc0 Compare March 25, 2025 12:31

rdettai requested a review from guilload March 25, 2025 12:31

rdettai marked this pull request as ready for review March 25, 2025 12:31

rdettai commented Mar 25, 2025

View reviewed changes

Clarify error message for unavailable

7330ab7

rdettai commented Mar 25, 2025

View reviewed changes

rdettai closed this Mar 25, 2025

rdettai deleted the fix-failing-ingest-index-creation-code branch March 25, 2025 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to create an index with ingest v2 returns 429 #5719

Failure to create an index with ingest v2 returns 429 #5719

rdettai commented Mar 20, 2025 •

edited

Loading

rdettai Mar 25, 2025

rdettai Mar 25, 2025

rdettai Mar 25, 2025

rdettai Mar 25, 2025 •

edited

Loading

rdettai commented Mar 25, 2025

Failure to create an index with ingest v2 returns 429 #5719

Failure to create an index with ingest v2 returns 429 #5719

Conversation

rdettai commented Mar 20, 2025 • edited Loading

Description of the issue

Proposed solution

How was this PR tested?

rdettai Mar 25, 2025

Choose a reason for hiding this comment

rdettai Mar 25, 2025

Choose a reason for hiding this comment

rdettai Mar 25, 2025

Choose a reason for hiding this comment

rdettai Mar 25, 2025 • edited Loading

Choose a reason for hiding this comment

rdettai commented Mar 25, 2025

rdettai commented Mar 20, 2025 •

edited

Loading

rdettai Mar 25, 2025 •

edited

Loading