Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNDB-12425: A few reproduction tests and a preliminary patch, WIP #1529

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

ekaterinadimitrova2
Copy link

@ekaterinadimitrova2 ekaterinadimitrova2 commented Jan 23, 2025

Queries that could use an index will fail while the index is being built, even if allow filtering is specified.
This is an availability issue for people who migrate from using ALLOW FILTERING to indexes.

...

What does this PR fix and why was it fixed
...
We enable CC to fall back to ALLOW FILTERING on the initial index build, which is considered safe. This is done by adding allowFiltering information (whether it exists in the query and whether the query is supported with ALLOW FILTERING) to the RowFilter and also new Index.Status - INITIAL_BUILD_STARTED
In CNDB, later rebuilds are also safe as the index is queryable while the compactor is rebuilding.
This is acknowledged in CC patch as we check not only that an index is building and we have ALLOW FILTERING, but also that the index is not queryable. The difference between Astra and CC is that in CC index is always not queryable during building.
While I still have to address additional testing for index build during bootstrapping and 2i testing of the patch in CC, that does not matter for Astra, so I believe the patch can be reviewed in parallel. I added two tests as per my conversation with @jasonstack, and they pass. Please let me know if there are any other cases they may need to address and whether the tests are what they had to be.

For the addition of allowFiltering, I had to bump the messaging version.
The migration tests failed but I believe this will be addressed with https://github.com/riptano/cndb/pull/13095. It is solving the issues from bumping the messaging version in CC. It is safe to ignore them for now.
Everything else seems to have passed.

Still missing additional 2i testing and testing of builds during bootstrapping. Also, I want to add a feature flag.

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits

Sorry, something went wrong.

@ekaterinadimitrova2 ekaterinadimitrova2 marked this pull request as draft January 23, 2025 03:20
@ekaterinadimitrova2 ekaterinadimitrova2 self-assigned this Jan 23, 2025
@ekaterinadimitrova2 ekaterinadimitrova2 marked this pull request as ready for review February 9, 2025 17:53
@ekaterinadimitrova2 ekaterinadimitrova2 marked this pull request as draft February 9, 2025 17:53
@ekaterinadimitrova2 ekaterinadimitrova2 marked this pull request as ready for review February 9, 2025 17:55
@ekaterinadimitrova2 ekaterinadimitrova2 marked this pull request as draft February 9, 2025 17:57
@ekaterinadimitrova2 ekaterinadimitrova2 marked this pull request as ready for review February 9, 2025 18:01
@ekaterinadimitrova2 ekaterinadimitrova2 marked this pull request as draft February 9, 2025 21:10
@@ -77,7 +76,7 @@ public void tesConcurrencyFactor()
// verify that a low concurrency factor is not capped by the max concurrency factor
PartitionRangeReadCommand command = command(cfs, 50, 50);
try (RangeCommandIterator partitions = RangeCommands.rangeCommandIterator(command, ONE, System.nanoTime(), ReadTracker.NOOP);
ReplicaPlanIterator ranges = new ReplicaPlanIterator(command.dataRange().keyRange(), command.indexQueryPlan(), keyspace, ONE))
ReplicaPlanIterator ranges = new ReplicaPlanIterator(command.dataRange().keyRange(), command.indexQueryPlan(), keyspace, ONE, false))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be command.rowFilter().allowFiltering

@@ -362,13 +365,24 @@ public synchronized Future<?> addIndex(IndexMetadata indexDef, boolean isNewCF)
* @param queryPlan a query plan
* @throws IndexNotAvailableException if the query plan has any index that is not queryable
*/
public void checkQueryability(Index.QueryPlan queryPlan)
public boolean isQueryableThroughIndex(Index.QueryPlan queryPlan, boolean allowsFiltering)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can now return true or false, or throw an exception, and it also throws a client warning. That looks like too many side effects for a method starting with boolean is..., which might suggest a simpler behaviour. I would either:
a) Split it into two simpler separate boolean methods to know if all the indexes in the plan are building/queryable, and let ReadCommand#executeLocally do the AF check and throw exceptions and warnings.
b) Transform it into a SecondaryIndexManger#searcherFor(Index.QueryPlan, boolean) method keeping most of it's responsibilities, returning the searcher if it's possible to build it, null if it's building, and exception if it's not queryable.

* @return a new query plan for the specified {@link RowFilter} and {@link Index}, {@code null} otherwise
*/
@Nullable
QueryPlan queryPlanForIndices(RowFilter rowFilter, Set<Index> indexes);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still thinking about how could make this the only method and get rid of queryPlanFor(RowFilter)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't addressed this yet, I will come back to it soon

}

public void testAllowFilteringDuringIndexBuildsOn3NodeCluster(boolean isCreateIndex, Index.Status buildStatus) throws Exception
{
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is long and ugly, but covers all cases. I will refactor it soon. It wasn't a priority for now

@@ -1156,6 +1156,8 @@ public void testIndexQueriesWithIndexNotReady()
{
execute("DROP index " + KEYSPACE + ".testIndex");
}

execute("SELECT value FROM %s WHERE value = 2 ALLOW FILTERING");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this is the only non-sai test.... we need more

final Injections.Barrier blockIndexBuild = Injections.newBarrier("block_index_build", 2, false)
.add(InvokePointBuilder.newInvokePoint().onClass(StorageAttachedIndex.class)
.onMethod("startInitialBuild"))
.build();
Copy link
Author

@ekaterinadimitrova2 ekaterinadimitrova2 Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to test here also 2i, but in practice this tests only SAI.... for now...

@ekaterinadimitrova2 ekaterinadimitrova2 marked this pull request as ready for review March 3, 2025 08:28
@ekaterinadimitrova2
Copy link
Author

Still not ready for full review...
Missing tests and work around bootstrap. I would appreciate any help with ideas for testing in that area and 2i.
I need to add a feature flag and do a full pass to clean the code. I bumped the messaging version, but I believe Michael has some fixes in his branch. I need to pull them on rebase after he merges tomorrow.
I left some comments to mark things that come to my mind that need to be taken care of. Let's also see what CI has to say. Leaving the PR in draft, though, to show it is not the latest version yet.

@ekaterinadimitrova2 ekaterinadimitrova2 marked this pull request as draft March 3, 2025 08:32
@cassci-bot
Copy link

✔️ Build ds-cassandra-pr-gate/PR-1529 approved by Butler


Approved by Butler
See build details here


// if the status of the index is building and there is allow filtering - that is ok too
if (considerAllowFiltering && status == Index.Status.INITIALIZED && allowFiltering)
continue;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to think about it more thoroughly, but this looks like a good place to place the client warnings that are currently thrown on the replica side. We might have a warning message per index-building replica, so clients can know what nodes are still initializing their indexes and are going to use filtering.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added:

// if the status of the index is building and there is allow filtering - that is ok too
if (considerAllowFiltering && status == Index.Status.INITIAL_BUILD_STARTED && !index.isQueryable(status) && allowFiltering)
{
ClientWarn.instance.warn(String.format("Query fell back to ALLOW FILTERING because index %s is still building on endpoint %s",
index.getIndexMetadata().name,
replica.endpoint()));
continue;
}

which led to multiple warnings for the same node in tests.

I decided to just bring on single node C* and try single query on index build:

cqlsh:k> CREATE CUSTOM INDEX ON t(k) USING 'StorageAttachedIndex';
cqlsh:k> SELECT * FROM t WHERE k=200 ALLOW FILTERING;

 pk | i | j | k | vec
----+---+---+---+-----

(0 rows)

Warnings :
Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoint localhost/127.0.0.1:7000

I have to dig into this tomorrow.... no more energy today

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that's because of virtual nodes, with 16 tokens per node. Rather than throwing the client warning immediately, the endpoints can be collected in a set:

Set<InetAddressAndPort> filteringEndpoints = new HashSet<>();

and then throw a single warning after the loop with the unique addresses. For example:

Query fell back to ALLOW FILTERING because index t_k_idx is still building on endpoints 192.168.0.1:7000, 192.168.0.2:7000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether fell back to ALLOW FILTERING will be clear enough for users, considering that they have just written ALLOW FILTERING in the query and, strictly, ALLOW FILTERING is a permission to filter and not the action of filtering. Perhaps the message would be a bit clearer this way:

The query won't use the indexes a, b and c on endpoints 192.168.0.1:7000, 192.168.0.2:7000 because the indexes are still building on those nodes. 

Feel free to ignore if you don't agree; I'm just giving ideas.

…t include:

- feature flag
- checks we are on the new messaging version added for ANNOptions
- we fall back to allow filtering only on Index Creation. Currently we also fall back to ALLOW FILTERING if we use nodetool to rebuild indexes
…ull rebuilds

Added some ugly testing to IndexAvailabilityTest to confirm queries with the two build statuses
…ordinator and know the plan may change

when it reaches the replica and we rebuild it. This will change with CNDB-13129
Added new IndexBuildDuringBootstrapTest. Handle bootstrapping if we think we should?
Address other nits and fixes.
Rebased on top of Michael's messaging version bump and related fixes.
sbtourist and others added 3 commits March 19, 2025 16:27
Split to prevent timeouts. Also add cluster sharing to speed it up.
Fix AllIndexImplementationsTest and extend it to cover other index implementations.
if (requiresFiltering)
assertInvalidThrowMessage(error, InvalidRequestException.class, query);
else if (duringInitialBuild)
assertInvalidThrow(IndexNotAvailableException.class, query);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adelapena , after I rebased on top of CNDB-12620, and I realized that no matter what exception I put here, the tests with all other indexes but SAI always pass...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't really need to do the build injections for anything but SAI for two reasons:

  1. all indexes but SAI are always queryable according to Index.isQueryable . Though it seems during the build at least SASI does not return results - https://github.com/riptano/cndb/issues/12931
  2. I think there was a bug in the test and we get actually An index involved in this query does not support disjunctive queries using the OR operator from the first query once I fix it. We were never hitting the if (duringInitialBuild) with non-SAI indexes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants