[DO NOT REVIEW] HCD 1.1.0 hotfix #1611

szymon-miezal · 2025-02-26T16:53:54Z

A PR created for the hotfix branch solely to run tests.

…its indexes are fully written (#1435)

### What is the issue riptano/cndb#11950 ### What does this PR fix and why was it fixed We saw many test failures in `UnifiedCompactionStrategyTest` after #1407. After investigating it a bit, it seems that the root cause to the unit test failure is likely the cost associated with the mockito calls to get different values. However, without changing anything in Mockito, I was able to optimize the `UCS::getLevels` method enough to make the test suite go from timing out to taking 3 minutes 11 seconds when running `ant test -Dtest.name=UnifiedCompactionStrategyTest` on the command line. Let's see if the test passes in butler. ### Checklist before you submit for review - [ ] Make sure there is a PR in the CNDB project updating the Converged Cassandra version - [ ] Use `NoSpamLogger` for log lines that may appear frequently in the logs - [ ] Verify test results on Butler - [ ] Test coverage for new/modified code is > 80% - [ ] Proper code formatting - [ ] Proper title for each commit staring with the project-issue number, like CNDB-1234 - [ ] Each commit has a meaningful description - [ ] Each commit is not very long and contains related changes - [ ] Renames, moves and reformatting are in distinct commits

This splits compactions that are to produce more than one output sstable into tasks that can execute in parallel. Such tasks share a transaction and have combined progress and observer. Because we cannot mark parts of an sstable as unneeded, the transaction is only applied when all tasks have succeeded. This also means that early open is not supported for such tasks. The parallelization also takes into account thread reservations, reducing the parallelism to the number of available threads for its level. The new functionality is turned on by default. Major compactions will apply the same mechanism to parallelize the operation. They will only split on pre- existing boundary points if they are also boundary points for the current UCS configuration. This is done to ensure that major compactions can re-shard data when the configuration is changed. If pre-existing boundaries match the current state, a major compaction will still be broken into multiple operations to reduce the space overhead of the operation. Also: - Introduces a parallelism parameter to major compactions (`nodetool compact -j <threads>`, defaulting to half the compaction threads) to avoid stopping all other compaction for the duration. - Changes SSTable expiration to be done in a separate `getNextBackgroundCompactions` round to improve the efficiency of expiration (separate task can run quickly and remove the relevant sstables without waiting for a compaction to end). - Applies small-partition-count correction in `ShardManager.calculateCombinedDensity`.

#1559) ### What is the issue [CNDB-12899](riptano/cndb#12899) `CompactionRealm.estimatedPartitionCount()` is very expensive ### What does this PR fix and why was it fixed Adds a cached version of the metric and removes the memtable partitions from the calculation to make it more precise for the compaction use case. Also makes sure that the `estimatedPartitionCount` metric is not recalculated if the table's data view (i.e. sstable and memtable set) has not changed. --------- Co-authored-by: Szymon Miężał <[email protected]>

### What is the issue Long running repairs trigger auto failing prematurely ### What does this PR fix and why was it fixed Capture status pings as liveness info to prevent early termination of repairs

Ports over single-size chunk cache buffers (DB-2904), caching memory addresses (parts of DB-2509) and file cache ids (DB-2489) from DSE.

### What is the issue Memory-mapping is done in buffers of size less than 2GiB. When these buffers aren't aligned to 4KiB and the trie-index file spans many buffers then reading it results in going out of buffer bounds. ### What does this PR fix and why was it fixed This patch fixes it by making sure that the buffers are correctly aligned.

This patch introduces two changes: - it adds a reading group to guard against sweeping the memtable which the metric is going to potentially iterate through (preventing crashes). - changes the metric calculation by using an estimate (used already by SAI query planner) instead of iterating through the whole memtable (which is quite a heavy operation).

For compressed sstables, fix the partition ending position calculation to prevent going out of bounds when the partition end falls on a chunk boundary.

The Ford Fulkerson optimization may take too long in some configs Some configs make the FF computation take too long This PR adds a feature flag so you can workaround it

### What is the issue Node crashes during node replacements result in hibernated nodes that cannot join the cluster anymore due to a lack of SYN messages from seeds. ### What does this PR fix and why was it fixed Port DB-1482, which allows the use a jmx endpoint on a seed to bring the hibernated node back to the gossiping candidate list. Tested via: datastax/cassandra-dtest#75.

sonarqubecloud · 2025-03-12T19:47:34Z

Quality Gate passed

Issues
16 New issues
0 Accepted issues

Measures
1 Security Hotspot
84.6% Coverage on New Code
0.3% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2025-03-12T20:21:35Z

❌ Build ds-cassandra-pr-gate/PR-1611 rejected by Butler

6 new test failure(s) in 4 builds
See build details here

Found 6 new test failures

Test	Explanation	Branch history	Upstream history
...d.t.s.VectorDistributedTest.rangeRestrictedTest	regression	🔴🔵🔴🔵	🔵🔵🔵🔵🔵🔵🔵
...,wide=false,scenario=COMPACTED_QUERY]	regression	🔴🔵🔵🔵	🔵🔵🔵🔵🔵🔵🔵
...t.testKDTreePostingsQueryMetricsWithSingleIndex	regression	🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...Test.testFinalOpenRetainsCachedData[format=BIG]	regression	🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...Test.testFinalOpenRetainsCachedData[format=BTI]	regression	🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
o.a.c.u.b.BinLogTest.testTruncationReleasesLogS...	regression	🔴🔴🔵🔵	🔵🔵🔵🔵🔵🔵🔵

Found 10 known test failures

jasonstack and others added 13 commits February 12, 2025 18:27

CNDB-11832: add LifecycleNewTracker#trackNewWritten when sstable and …

95f70d4

…its indexes are fully written (#1435)

HCD-31 Auto failing prematurely repair sessions (#1557)

49aff57

### What is the issue Long running repairs trigger auto failing prematurely ### What does this PR fix and why was it fixed Capture status pings as liveness info to prevent early termination of repairs

Improve typed reads in RandomAccessReader

e8183a3

CNDB-9104: Port over chunk cache improvements from DSE (#1495)

f2151e1

Ports over single-size chunk cache buffers (DB-2904), caching memory addresses (parts of DB-2509) and file cache ids (DB-2489) from DSE.

HCD-74: Fix CorruptSSTableException in UCS with compression (#1602)

53a28ae

For compressed sstables, fix the partition ending position calculation to prevent going out of bounds when the partition end falls on a chunk boundary.

HCD-84 Feature flag to skip Ford Fulkerson (#1612)

a8a00b2

The Ford Fulkerson optimization may take too long in some configs Some configs make the FF computation take too long This PR adds a feature flag so you can workaround it

HCD-73: Fix replacing node stuck in hibernation state [NOT MERGED]

340cdba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT REVIEW] HCD 1.1.0 hotfix #1611

[DO NOT REVIEW] HCD 1.1.0 hotfix #1611

szymon-miezal commented Feb 26, 2025

sonarqubecloud bot commented Mar 12, 2025

cassci-bot commented Mar 12, 2025

[DO NOT REVIEW] HCD 1.1.0 hotfix #1611

Are you sure you want to change the base?

[DO NOT REVIEW] HCD 1.1.0 hotfix #1611

Conversation

szymon-miezal commented Feb 26, 2025

sonarqubecloud bot commented Mar 12, 2025

Quality Gate passed

cassci-bot commented Mar 12, 2025

❌ Build ds-cassandra-pr-gate/PR-1611 rejected by Butler

Found 6 new test failures

Found 10 known test failures