-
Description: The
Merge
function in thequantile
package, located in/code/quantile/stream.go
, is implemented incorrectly and is explicitly marked as broken in comments and test cases within the codebase. This function is intended to merge samples from multiple quantile streams. However, the current implementation uses a flawed approach that does not correctly merge compressed summaries, leading to inaccurate quantile calculations when querying the merged stream. Specifically, thestream.merge
function uses an incorrect insertion loop with "goto"-based control flow and fails to properly recalculate metadata likedelta
andwidth
, which are essential for maintaining quantile error bounds. An attacker could exploit this vulnerability indirectly by manipulating the input data streams that are intended to be merged. By providing crafted sets of samples to be merged, an attacker can cause significant deviations in the calculated quantiles compared to the true quantiles of the combined dataset. This is because the merge algorithm incorrectly handles pre-compressed summaries instead of processing raw data points, leading to a corrupted representation of the combined data distribution. -
Impact: Incorrect quantile values are produced after merging streams. If an application relies on the accuracy of these merged quantiles for security-sensitive decisions, such as anomaly detection, rate limiting, access control based on percentile thresholds, or any other decision-making or alerting logic based on statistical summaries, this vulnerability can lead to security bypasses or misconfigurations. For example, in anomaly detection systems using percentile thresholds for request latency, an inaccurate merged quantile could lead to missed anomalies (false negatives if the merged quantile is lower than actual) or false alarms (false positives if the merged quantile is higher than actual). Ultimately, the integrity of data analysis and any dependent security mechanisms are compromised when using the broken
Merge
function. -
Vulnerability Rank: High
-
Currently Implemented Mitigations: None. There are no functional safeguards or input validations implemented to mitigate this issue. The code itself is flawed, and the problem is acknowledged through comments within the code and broken tests in the test suite. Comments in
/code/quantile/stream.go
explicitly warn that theMerge
function is broken and produces incorrect results. Correspondingly, tests related to merging, such asBrokenTestTargetedMerge
,BrokenTestLowBiasedMerge
, andBrokenTestHighBiasedMerge
in/code/quantile/stream_test.go
, are marked as "BrokenTest", indicating they are designed to fail and highlight the known issue. -
Missing Mitigations:
- The
Merge
function needs to be reimplemented correctly, following a proper algorithm for merging either raw data points from all streams or compressed quantile summaries as described in research papers like "Effective Computation of Biased Quantiles over Data Streams". - Alternatively, if a correct implementation is not feasible in the short term, the
Merge
function should be removed or clearly deprecated and marked as not recommended for use, especially in scenarios requiring accurate quantile merging. - Consider adding a warning or error message when the
Merge
function is called to alert users about the potential for inaccurate results. - Comprehensive documentation should be provided to advise users against relying on the
Merge
function for accurate quantile calculations and to suggest alternative approaches if stream merging is necessary. - Input validation or access control measures could be considered if the
Merge
function is exposed in a way that allows external entities to influence the input streams. However, the primary mitigation should focus on fixing the underlying algorithm.
- The
-
Preconditions:
- The application must use the
Merge
function of thequantile
package to combine quantile streams from different sources. - The security logic or application functionality must depend on the accuracy of the quantiles calculated after merging streams.
- While direct attacker interaction with the
Merge
function might not be typical, an attacker needs to be able to influence the data streams that are being merged. This could be achieved by injecting malicious data points into one or more of the streams or by manipulating the data distribution of the streams being merged through other means, depending on the application's architecture and data flow.
- The application must use the
-
Source Code Analysis:
- File:
/code/quantile/stream.go
- Function:
Stream.Merge(samples Samples)
- The
Stream.Merge
function is the entry point for merging samples into a quantile stream. It first sorts the inputsamples
and then calls the internalstream.merge
function. - The function's documentation explicitly warns:
ATTENTION: This method is broken and does not yield correct results. The underlying algorithm is not capable of merging streams correctly.
// File: /code/quantile/stream.go // ATTENTION: This method is broken and does not yield correct results. // The underlying algorithm is not capable of merging streams correctly. func (s *Stream) Merge(samples Samples) { sort.Sort(samples) s.stream.merge(samples) }
- The
- Function:
stream.merge(samples Samples)
- This function contains the flawed merging logic. The code attempts to iterate through the input samples and insert them into the stream's internal sample list (
s.l
). - A prominent
TODO
comment highlights the core issue:// TODO(beorn7): This tries to merge not only individual samples, but whole summaries. The paper doesn't mention merging summaries at all. Unittests show that the merging is inaccurate. Find out how to do merges properly.
This indicates that the current approach of merging pre-compressed summaries is likely incorrect and deviates from the intended algorithm described in the research paper. - Within the merging loop, the code attempts to calculate
delta
during insertion, but it is also marked with a// TODO(beorn7): How to calculate delta correctly?
comment, further indicating uncertainty and potential errors in the implementation. - The use of
goto inserted
for insertion control flow is unusual and can make the logic harder to follow and verify for correctness.
// File: /code/quantile/stream.go func (s *stream) merge(samples Samples) { // TODO(beorn7): This tries to merge not only individual samples, but // whole summaries. The paper doesn't mention merging summaries at all. // Unittests show that the merging is inaccurate. Find out how to // do merges properly. var r float64 i := 0 for _, sample := range samples { for ; i < len(s.l); i++ { c := s.l[i] if c.Value > sample.Value { // Insert at position i. s.l = append(s.l, Sample{}) copy(s.l[i+1:], s.l[i:]) s.l[i] = Sample{ sample.Value, sample.Width, math.Max(sample.Delta, math.Floor(s.ƒ(s, r))-1), // TODO(beorn7): How to calculate delta correctly? // TODO(beorn7): How to calculate delta correctly? } i++ goto inserted } r += c.Width } s.l = append(s.l, Sample{sample.Value, sample.Width, 0}) i++ inserted: s.n += sample.Width r += sample.Width } s.compress() }
- This function contains the flawed merging logic. The code attempts to iterate through the input samples and insert them into the stream's internal sample list (
- Test Files:
/code/quantile/stream_test.go
- The test suite includes several tests marked as "BrokenTest", such as
BrokenTestTargetedMerge
,BrokenTestLowBiasedMerge
, andBrokenTestHighBiasedMerge
. These tests are specifically designed to check theMerge
function but are expected to fail due to the known issues, confirming the vulnerability.
- The test suite includes several tests marked as "BrokenTest", such as
- File:
-
Security Test Case:
- Setup: Create a Go test application that imports the
github.com/beorn7/perks/quantile
package. - Initialize Streams: Create two separate
quantile.Stream
objects,stream1
andstream2
, usingquantile.NewTargeted
with the same target quantiles (e.g.,map[float64]float64{0.90: 0.01}
). This ensures both streams are configured to approximate the same quantiles with the same error tolerance. - Populate Streams: Populate
stream1
with a set of numerical data points, for example, integers from 1 to 100. Populatestream2
with a different set of numerical data points, for example, integers from 101 to 200. These datasets should be distinct to demonstrate the effect of merging. - Calculate True Quantile: Create a combined dataset by concatenating the data points used to populate
stream1
andstream2
. Calculate the true 90th percentile of this combined dataset using a standard method (e.g., sorting and selecting the element at the 90th percentile rank). Let's call thistrue_p90
. This represents the accurate 90th percentile of the merged data. - Merge Streams: Merge
stream2
intostream1
usingstream1.Merge(stream2.Samples())
. This invokes the brokenMerge
function. - Query Merged Quantile: Query the 90th percentile from the merged stream
stream1
usingstream1.Query(0.90)
. Let's call thismerged_p90
. This is the quantile calculated by the flawed merge function. - Compare Quantiles: Compare
merged_p90
withtrue_p90
. Calculate the absolute difference|merged_p90 - true_p90|
. - Verification: If the absolute difference
|merged_p90 - true_p90|
is significantly larger than the expected error margin for quantile approximation (which is determined by the epsilon values provided toNewTargeted
), it confirms the inaccuracy caused by the brokenMerge
function. The significance threshold depends on the epsilon values and dataset size. For demonstration, a difference exceeding 10% oftrue_p90
can be considered significant. - Repeat with Variations: To further demonstrate the vulnerability's general nature, repeat steps 1-8 with different datasets (e.g., datasets with different distributions, sizes, or overlaps) and target quantiles (e.g., 50th, 99th percentiles). This will show the inconsistency and unreliability of the
Merge
function across various scenarios. - Output Observation: Print the values of
true_p90
andmerged_p90
. Observe and document the significant deviation, confirming that theMerge
function produces inaccurate quantile estimations when merging streams.
- Setup: Create a Go test application that imports the
-
Vulnerability Name: Top‑K Data Poisoning via Flawed Minimum Element Tracking
-
Description: The Top‑K algorithm implementation in
/code/topk/topk.go
is vulnerable to data poisoning due to a flaw in how it tracks the minimum element for replacement within its frequency map. The algorithm maintains a frequency count for string elements and aims to keep track of the top K most frequent elements. It uses a pointers.min
to anElement
intended to represent the element with the minimum count in the monitored map. However,s.min
is initialized incorrectly in theNew
constructor to a new, emptyElement
with a zero count. Crucially, thiss.min
pointer is not updated correctly during subsequent insertions. When the internal map reaches its capacity (k+1
entries), the algorithm unconditionally selects the element pointed to bys.min
for replacement. Becauses.min
remains the initially created empty element, it does not accurately reflect the actual minimum element in the map. An attacker who can inject a large number of distinct string values into the Top‑K stream, especially after the map is nearly full (sizek+1
), can force the algorithm to repeatedly execute this flawed replacement logic. This allows the attacker to pollute or entirely control the top‑K results by inserting attacker-controlled strings while legitimate, frequent items are incorrectly evicted due to the flawed minimum tracking. -
Impact: Exploiting this vulnerability enables an attacker to manipulate the outcome of the top‑K calculation. In applications that rely on top‑K results for critical functionalities—such as ranking systems, trending analytics, content prioritization, recommendation engines, or any system displaying or using the most frequent items—this data poisoning can lead to a misrepresentation of the most popular or important items. This manipulation can influence downstream processes, user interfaces, and decision-making based on these inaccurate top‑K results. For instance, an attacker could promote malicious or irrelevant content to the top of a trending list or influence recommendations to favor attacker-chosen items, undermining the intended purpose of the Top-K algorithm.
-
Vulnerability Rank: High
-
Currently Implemented Mitigations: There are no implemented mitigations or safeguards within the Top-K module to address this flaw. The incorrect initialization and update logic of
s.min
are inherent in theinsert
method within/code/topk/topk.go
. No input sanitization, frequency count integrity checks, or alternative minimum tracking mechanisms are in place to prevent or correct this vulnerability. The flawed minimum element tracking is a fundamental design issue within the current implementation. -
Missing Mitigations:
- Correct Minimum Tracking: The primary missing mitigation is to implement a correct and robust mechanism for tracking the element with the minimum count. This could involve:
- Proper Initialization of
s.min
: Initializes.min
to point to the actual minimum element in the map after the first few insertions, rather than an empty element. - Dynamic Update of
s.min
: Updates.min
whenever an element is inserted or its count is incremented to ensure it always points to the current minimum element in the map. - Min-Heap Structure: Consider replacing the simple
s.min
pointer with a dedicated min-heap data structure to efficiently track and retrieve the minimum count element. A min-heap would provide logarithmic time complexity for insertion, deletion of the minimum, and finding the minimum, which is more efficient and robust for tracking minimum elements in a dynamic set.
- Proper Initialization of
- Input Validation and Frequency Count Integrity Checks: While not directly addressing the
s.min
flaw, implementing input validation and frequency count integrity checks could add a layer of defense against malicious inputs designed to exploit this vulnerability. However, the core issue lies in the flawed minimum tracking logic.
- Correct Minimum Tracking: The primary missing mitigation is to implement a correct and robust mechanism for tracking the element with the minimum count. This could involve:
-
Preconditions:
- An external attacker must be able to submit a large volume of distinct string inputs into the Top‑K stream. Specifically, this becomes exploitable once the internal map within the Top-K stream reaches its capacity of
k+1
entries, as this triggers the flawed replacement branch in theinsert
method. - The Top-K functionality must be exposed or used in a context where external inputs directly determine the elements being processed and counted by the Top-K algorithm. This could be through an API endpoint, data ingestion pipeline, or any system where attacker-controlled strings can be fed into the Top-K stream.
- An external attacker must be able to submit a large volume of distinct string inputs into the Top‑K stream. Specifically, this becomes exploitable once the internal map within the Top-K stream reaches its capacity of
-
Source Code Analysis:
- File:
/code/topk/topk.go
- Constructor:
New(k int)
- In the
New
constructor, thes.min
field of thetopkStream
struct is initialized to a newElement{}
. This creates an emptyElement
with default values, including a zeroCount
and an empty stringKey
. This is the root of the problem ass.min
starts as an invalid minimum candidate.
// File: /code/topk/topk.go func New(k int) *topkStream { return &topkStream{ k: k, m: make(map[string]*Element, k+1), min: &Element{}, // Incorrect initialization of s.min } }
- In the
- Method:
insert(key string)
- The
insert
method handles the logic for adding new keys and updating counts. When a new key is inserted and the map size exceedsk+1
, the algorithm enters the replacement branch. - In this branch, it deletes the element currently pointed to by
s.min
from the map (delete(s.m, s.min.Key)
) and reusess.min
to store the new key and increment its count (s.min.Key = key
,s.min.Count++
,s.m[key] = s.min
). - Flawed Minimum Update: The critical flaw is in how
s.min
is then updated. The code checksif e.Count < s.min.Count { s.min = e }
. However,s.min
has just been reassigned to the newly inserted element, and its count is incremented to 1 (or more if the key existed before). The comparisone.Count < s.min.Count
will almost always be false becauses.min
now points to a recently inserted item, not the actual minimum in the map. Therefore,s.min
is rarely updated after the initial incorrect initialization.
// File: /code/topk/topk.go func (s *topkStream) insert(key string) { e, ok := s.m[key] if ok { e.Count++ } else { e = &Element{Key: key, Count: 1} s.m[key] = e if len(s.m) > s.k { delete(s.m, s.min.Key) // Deletes element pointed to by s.min (incorrect minimum) s.min.Key = key // Reassigns s.min to the new element s.min.Count++ // Increments count of s.min s.m[key] = s.min // Updates map with reassigned s.min } } if s.min == nil || e.Count < s.min.Count { // Flawed condition - s.min rarely updated correctly s.min = e // Incorrectly attempts to update s.min } }
- The
- Result: Due to this flawed logic,
s.min
essentially becomes stuck pointing to an element inserted early in the process or remains the initially empty element. Subsequent replacements will always target this incorrect minimum element, allowing an attacker to inject new, distinct keys and evict legitimate, more frequent items from the top-K results.
- File:
-
Security Test Case:
- Setup: Create a Go test application that imports the
github.com/beorn7/perks/topk
package. - Initialize Top-K Stream: Create a
topk.Stream
instance with a specifiedk
value (e.g.,k = 10
) usingtopk.New(k)
. This will create a Top-K stream that tracks the top 10 most frequent items. - Insert Legitimate Items: Insert a set of "legitimate" items multiple times into the Top-K stream to simulate high-frequency entries that should ideally be in the top-K results. For example, insert strings like "item1", "item2", "item3" each hundreds or thousands of times. This establishes a baseline of expected top items.
- Inject Attacker-Controlled Strings: Begin injecting a large number of distinct, attacker-controlled strings into the Top-K stream. These strings should be different from the legitimate items and from each other. Continue injecting these distinct strings until the monitored map within the Top-K stream exceeds
k+1
entries. This forces the flawed replacement branch in theinsert
method to activate repeatedly. - Query Top-K Results: Invoke the
Query()
method on the Top-K stream to retrieve the current top-K results. This will return a sorted list ofElement
structs representing the top K items and their counts as determined by the flawed algorithm. - Verify Data Poisoning: Examine the output of
Query()
. Verify if the top-K results now include attacker-supplied items that should not have been part of the expected top-K list based on the initial legitimate items. Specifically, check if legitimate items that were inserted many times have been replaced by the attacker-injected strings, even though the attacker strings were inserted fewer times in total. - Quantify Impact: Compare the retrieved top-K list with the expected top-K list (which should primarily consist of the legitimate items inserted in step 3). Count how many attacker-supplied items are present in the returned top-K list and how many legitimate items are missing or have been pushed out. A significant discrepancy, where attacker items appear in the top-K despite being less frequent overall, confirms the exploitation of the flawed minimum tracking and demonstrates data poisoning.
- Output Observation: Print the top-K results obtained from
Query()
. Highlight the presence of attacker-injected strings in the top-K list and the absence or reduced ranking of legitimate items, demonstrating the successful data poisoning of the Top-K results due to the flaweds.min
tracking mechanism.
- Setup: Create a Go test application that imports the