Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT REVIEW] HCD 1.1.0 hotfix #1611

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

[DO NOT REVIEW] HCD 1.1.0 hotfix #1611

wants to merge 13 commits into from

Conversation

szymon-miezal
Copy link

A PR created for the hotfix branch solely to run tests.

jasonstack and others added 13 commits February 12, 2025 18:27
### What is the issue
riptano/cndb#11950

### What does this PR fix and why was it fixed
We saw many test failures in `UnifiedCompactionStrategyTest` after
#1407. After investigating it a bit, it seems that the root cause to the
unit test failure is likely the cost associated with the mockito calls
to get different values.

However, without changing anything in Mockito, I was able to optimize
the `UCS::getLevels` method enough to make the test suite go from timing
out to taking 3 minutes 11 seconds when running `ant test
-Dtest.name=UnifiedCompactionStrategyTest` on the command line.

Let's see if the test passes in butler.

### Checklist before you submit for review
- [ ] Make sure there is a PR in the CNDB project updating the Converged
Cassandra version
- [ ] Use `NoSpamLogger` for log lines that may appear frequently in the
logs
- [ ] Verify test results on Butler
- [ ] Test coverage for new/modified code is > 80%
- [ ] Proper code formatting
- [ ] Proper title for each commit staring with the project-issue
number, like CNDB-1234
- [ ] Each commit has a meaningful description
- [ ] Each commit is not very long and contains related changes
- [ ] Renames, moves and reformatting are in distinct commits
This splits compactions that are to produce more than one
output sstable into tasks that can execute in parallel.
Such tasks share a transaction and have combined progress
and observer. Because we cannot mark parts of an sstable
as unneeded, the transaction is only applied when all
tasks have succeeded. This also means that early open
is not supported for such tasks.

The parallelization also takes into account thread reservations,
reducing the parallelism to the number of available threads
for its level. The new functionality is turned on by default.

Major compactions will apply the same mechanism to
parallelize the operation. They will only split on pre-
existing boundary points if they are also boundary
points for the current UCS configuration. This is done
to ensure that major compactions can re-shard data when
the configuration is changed. If pre-existing boundaries
match the current state, a major compaction will still be
broken into multiple operations to reduce the space
overhead of the operation.

Also:
- Introduces a parallelism parameter to major compactions
  (`nodetool compact -j <threads>`, defaulting to half the
  compaction threads) to avoid stopping all other compaction
  for the duration.

- Changes SSTable expiration to be done in a separate
  `getNextBackgroundCompactions` round to improve the
  efficiency of expiration (separate task can run quickly
  and remove the relevant sstables without waiting for
  a compaction to end).

- Applies small-partition-count correction in
  `ShardManager.calculateCombinedDensity`.
#1559)

### What is the issue
[CNDB-12899](riptano/cndb#12899)
`CompactionRealm.estimatedPartitionCount()` is very expensive

### What does this PR fix and why was it fixed
Adds a cached version of the metric and removes the memtable partitions
from the calculation to make it more precise for the compaction use
case.

Also makes sure that the `estimatedPartitionCount` metric is not
recalculated if the table's data view (i.e. sstable and memtable set)
has not changed.

---------

Co-authored-by: Szymon Miężał <[email protected]>
### What is the issue
Long running repairs trigger auto failing prematurely

### What does this PR fix and why was it fixed
Capture status pings as liveness info to prevent early termination of
repairs
Ports over single-size chunk cache buffers (DB-2904), caching memory addresses (parts of DB-2509) and file cache ids (DB-2489) from DSE.
### What is the issue
Memory-mapping is done in buffers of size less than 2GiB.
When these buffers aren't aligned to 4KiB and the trie-index file
spans many buffers then reading it results in going out of buffer
bounds.

### What does this PR fix and why was it fixed
This patch fixes it by making sure that the buffers are correctly
aligned.
This patch introduces two changes:
- it adds a reading group to guard against sweeping the memtable which
the metric
is going to potentially iterate through (preventing crashes).
- changes the metric calculation by using an estimate (used already by
SAI query planner) instead of iterating
through the whole memtable (which is quite a heavy operation).
For compressed sstables, fix the partition ending position calculation to prevent
going out of bounds when the partition end falls on a chunk boundary.
The Ford Fulkerson optimization may take too long in some configs

Some configs make the FF computation take too long

This PR adds a feature flag so you can workaround it
### What is the issue
Node crashes during node replacements result in hibernated nodes that
cannot join the cluster anymore due to a lack of SYN messages from seeds.

### What does this PR fix and why was it fixed
Port DB-1482, which allows the use a jmx endpoint on a seed to bring the
hibernated node back to the gossiping candidate list.

Tested via: datastax/cassandra-dtest#75.
@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-1611 rejected by Butler


6 new test failure(s) in 4 builds
See build details here


Found 6 new test failures

Test Explanation Branch history Upstream history
...d.t.s.VectorDistributedTest.rangeRestrictedTest regression 🔴🔵🔴🔵 🔵🔵🔵🔵🔵🔵🔵
...,wide=false,scenario=COMPACTED_QUERY] regression 🔴🔵🔵🔵 🔵🔵🔵🔵🔵🔵🔵
...t.testKDTreePostingsQueryMetricsWithSingleIndex regression 🔴🔴🔴🔴 🔵🔵🔵🔵🔵🔵🔵
...Test.testFinalOpenRetainsCachedData[format=BIG] regression 🔴🔴🔴🔴 🔵🔵🔵🔵🔵🔵🔵
...Test.testFinalOpenRetainsCachedData[format=BTI] regression 🔴🔴🔴🔴 🔵🔵🔵🔵🔵🔵🔵
o.a.c.u.b.BinLogTest.testTruncationReleasesLogS... regression 🔴🔴🔵🔵 🔵🔵🔵🔵🔵🔵🔵

Found 10 known test failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants