-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NPM-4243] Cache low-value network path contexts and discard them #34597
base: main
Are you sure you want to change the base?
Conversation
Go Package Import DifferencesBaseline: c99883d
|
Uncompressed package size comparisonComparison with ancestor Diff per package
Decision |
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv aws.create-vm --pipeline-id=57539394 --os-family=ubuntu Note: This applies to commit 1f46d69 |
Static quality checks ✅Please find below the results from static quality gates Successful checksInfo
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: c99883d Optimization Goals: ✅ No significant changes detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | tcp_syslog_to_blackhole | ingress throughput | +1.18 | [+1.11, +1.24] | 1 | Logs |
➖ | quality_gate_idle | memory utilization | +0.78 | [+0.74, +0.82] | 1 | Logs bounds checks dashboard |
➖ | file_to_blackhole_500ms_latency | egress throughput | +0.20 | [-0.58, +0.97] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | +0.12 | [-0.34, +0.58] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | +0.03 | [-0.75, +0.81] | 1 | Logs |
➖ | quality_gate_idle_all_features | memory utilization | +0.02 | [-0.03, +0.06] | 1 | Logs bounds checks dashboard |
➖ | file_to_blackhole_100ms_latency | egress throughput | +0.01 | [-0.65, +0.67] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.03, +0.02] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | -0.01 | [-0.63, +0.61] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | -0.01 | [-0.78, +0.76] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | -0.02 | [-0.80, +0.77] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | -0.02 | [-0.31, +0.27] | 1 | Logs |
➖ | file_tree | memory utilization | -0.08 | [-0.14, -0.01] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.26 | [-1.04, +0.53] | 1 | Logs |
➖ | quality_gate_logs | % cpu utilization | -1.04 | [-3.94, +1.86] | 1 | Logs |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -2.33 | [-3.22, -1.43] | 1 | Logs |
Bounds Checks: ✅ Passed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | intake_connections | 10/10 | |
✅ | quality_gate_logs | lost_bytes | 10/10 | |
✅ | quality_gate_logs | memory_usage | 10/10 |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for files owned by agent-configuration
@@ -0,0 +1,148 @@ | |||
// Unless explicitly stated otherwise all files in this repository are licensed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid re-implementing cache ourselves if possible.
Is there are well known cache library we can use instead of a custom impl?
It seems that https://github.com/jellydator/ttlcache could work for this case?
This seems to be already used un multiple places in datadog-agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would work. I was originally looking at pkg/util/cache.go
which also implements a TTL cache, but it requires string keys which is not useful for us. I missed ttlcache
, it seems like it is suitable.
// we only blacklist short traceroutes | ||
if len(path.Hops) > s.config.MaxTTL { | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we start with:
// we only blacklist short traceroutes | |
if len(path.Hops) > s.config.MaxTTL { | |
return false | |
} | |
// we only blacklist short traceroutes | |
if len(path.Hops) > 2 { | |
return false | |
} |
to only target hops with 1 and 2 hops? I think it will be safer that way and we also cover >80%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, will remove this config
return false | ||
} | ||
// none of the intermediate hops should be reachable otherwise it is a "useful" path | ||
for i := range len(path.Hops) - 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this simpler version? :)
for i := range len(path.Hops) - 1 { | |
for i := path.Hops { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent of this was to loop through all but the last hop - since we are removing config.MaxTTL
it won't be necessary anymore. Will update
pkg/config/setup/config.go
Outdated
config.BindEnvAndSetDefault("network_path.collector.blacklist_cache.capacity", 20000) | ||
config.BindEnvAndSetDefault("network_path.collector.blacklist_cache.duration", "2h") | ||
config.BindEnvAndSetDefault("network_path.collector.blacklist_cache.clean_interval", "10m") | ||
config.BindEnvAndSetDefault("network_path.collector.blacklist_scanner.enabled", true) | ||
config.BindEnvAndSetDefault("network_path.collector.blacklist_scanner.max_ttl", 2) | ||
config.BindEnvAndSetDefault("network_path.collector.blacklist_scanner.only_private_subnets", true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid exposing configs if possible.
We probably only need to expose:
- Enable/Disable this blacklist/filter/exclusion feature
- Expiration TTL
The other ones seems not needed / can be kept hard coded (user can't change them).
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I will remove these configs. What do you think about retaining a cache capacity config to limit memory consumption?
@@ -337,6 +342,16 @@ func (s *npCollectorImpl) runTracerouteForPath(ptest *pathteststore.PathtestCont | |||
path.Namespace = s.networkDevicesNamespace | |||
path.Origin = payload.PathOriginNetworkTraffic | |||
|
|||
if s.blacklistScanner.ShouldBlacklist(&path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "runTracerouteForPath" should not be responsible for this new logic.
I would suggestion replace s.runTracerouteForPath
with s.processPathtest()
, then in processPathtest()
have:
- a/
s.runTracerouteForPath()
that only runs the traceroute and returns the result pathtrace without sending the data to event platform - b/ evaluate if the result pathtrace should be discarded
- c/
s.forwardToEventPlatform()
: send payload to event platform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I will separate these out
hash := ptest.Pathtest.GetHash() | ||
s.blacklistCache.Add(hash) | ||
s.pathtestStore.RemoveHash(hash) | ||
// no need to do further processing since it's blacklisted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid removing completely an entry from pathtestStore while it's being processed.
It can possibly lead to subtle bugs.
Maybe we can just set PathtestContext.runUntil
to 0
, that way, it will be discarded automatically at next flush.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RemoveHash
and Flush
both hold onto the contexts mutex so it should be safe no?
// Blacklist of low-value pathtests | ||
blacklistCache *blacklist.Cache | ||
blacklistScanner *blacklist.Scanner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be best, if we only expose a one API instead of two Cache+Scanner.
I think the caller doesn't need to be aware of the low level cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I will refactor this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the approach looks fine overall, but let's sync before merging this :)
I think we might want to make this feature generic enough so that we can easily handle various cases in the future.
What does this PR do?
This PR adds a cache storing known low-value pathtests to avoid repeating them (by default, for two hours currently).
Motivation
We want to reduce the total number of pathtraces done by dynamic paths in order to reduce server load, and reduce the number of independent connections in NAT tables.
Describe how you validated your changes
New discard scanner tests:
Existing NP collector tests:
Possible Drawbacks / Trade-offs
Additional Notes