-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Allow reading per-sample tags from fiber-local variable storage #203
Conversation
5e3478e
to
3690997
Compare
3690997
to
4cd9189
Compare
…ray to mimic 'pass by value'
as a heads-up in case anyone reviews this before I push the fixes, I am testing this PR internally and finding a few minor bugs that I will push sporadically as I find and fix them. I have two pending fixes for edge cases I've discovered in testing with "real apps" and not just the test suite. |
@@ -2,6 +2,7 @@ | |||
require "stackprof/truffleruby" | |||
else | |||
require "stackprof/stackprof" | |||
require "stackprof/tags" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using this on a project on TruffleRuby
code using StackProf::Tag
will fail (const not being loaded). If I'm not mistaken we don't need this for TruffleRuby (as it has it's own internal profilers), maybe we could stub it with a noop class and load that in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i think we'll probably need a stub, as the usage of the StackProf::Tag helpers is intended to be throughout the codebase, so at least having the receivers available should prevent runtime errors, and not need to have every single call behind a an if truffleruby
guard
@tenderlove if you have the time, sorry it's a big one. |
I really like this idea, but the patch is really large. 😅 Gecko format supports a "Marker chart", and I've been messing with converting Stackprof profiles to Gecko format. Gecko format has an idea of "markers" and the markers basically span an amount of time (and are specific to a thread). I've made a sample profile here that uses the markers to denote GC events (click "Marker chart"). The markers only need a time (in our case the monotonic clock) for the profile viewer to make sense of them, so could we have a Ruby API that just stores these in an array? Then when we dump the profile include those markers? At least in Gecko format, the names of the markers can be arbitrary, and we can even include a bag of data with them. Implementation is something like this: INSTANT = 0
INTERVAL = 1
START = 2
FINISH = 3
def start name, data
Stackprof.markers << [Thread.current.id,
Process.clock_gettime(Process::CLOCK_MONOTONIC),
nil,
name,
data,
START]
end
def finish name, data
Stackprof.markers << [Thread.current.id,
Process.clock_gettime(Process::CLOCK_MONOTONIC),
nil,
name,
data,
FINISH]
end Usage is something like this: Stackprof.start("Request", { :controller => controller, :action => action })
...
Stackprof.finish("Request", { :controller => controller, :action => action }) I think this would be enough for Gecko. I'm not sure about the other formats (but to be honest, I don't know if I care that much? 😅) |
@tenderlove thanks for taking a look, and yeah sorry for the size. I wanted to show things working end to end which is why I added a bunch of tests. The actual C code changes aren’t as big but I definitely understand not wanting to take on the baggage if not needed. the approach you’ve outlined sounds like a better way to address this. If the data is ultimately there it should be possible to convert to formats other than gecko too. I’ll close this and hopefully can build off your work with markers. |
Great, and thanks for opening the PR. I hadn't thought of this, and it's definitely a great feature |
What
This adds support for applying tags to individual stackprof samples. Tags are read from fiber-local variable storage to add additional metadata to each sample collected. For example:
In brief:
Will result in tags being added to the stackprof output, which can viewed like
In practice, I hope the utility of this will be to tag samples with useful labels like
controller=HomeController
,action=Welcome
, to analyze specific parts of application code by tagging samples that contain them. Tags can be applied as wrappers around existing framework actions, and potentially help to correlate profiles to application domain areas.In addition to user-specified context, some "built-in" context is also potentially useful. For now, there is only support for reading the current ruby thread thread id via
:thread_id
and current fiber id via:fiber_id
. In future PRs, other builtins like "pid", or "pthread id" (since it looks like we already store a handle to it?), "thread name", etc. could be added. These "built-in" tags do not count towards the user tag total,StackProf::Tag::MAX_TAGS
.Why
Although we can already apply tags at the profile-level with stackprof's existing
metadata
field, we cannot tag at the granularity of individual samples. Other profiling formats, such aspprof
, have the ability to have sample-level tags.With each tag applied to a sample, a new "index" to select on is available for aggregating samples sharing the same tag values. Open source tools like Grafana Phlare / Pyroscope can provide a means of querying the profiles based on these new dimensions.
For example #194 is one use case served by this, since the gecko format makes threads a first class citizen. I have included a proof of concept for this use case as well, using some existing golang converter code, but a future PR could generate a "gecko report output" using the tags to map to gecko thread IDs.
How
This approach is implemented as an optional extension to stackprof, following an approach parallel to how it collects frame samples. If the
tags
argument is not supplied, it should not change the behaviour of stackprof. Iftags
are supplied, then during stackprof's signal handler, it will check the thread that it buffered frames for a matching fiber local variables, and buffer whatever it finds.Stackprof's design already involves periodically interrupting execution in order to collect stack traces. We take advantage of this interrupt collect the tags at a point when VM state is already already "paused" due to either the SIGPROF or SIGALARM interrupt firing, guaranteeing that they should be coherent with the stacktrace, and thus there is no "data race" that could lead to tags being mis-attributed to the wrong callchain.
Like how stackprof currently buffers frames, the tag buffering routine must by signal safe. It will only copy memory to a buffer in order to minimize the interrupt overhead, then the existing delayed job for recording the frame buffer will end with likewise recording the tags. This is where the buffer is recorded and reset, and heap space is allocated as necessary to maintain the array of tag recording entries. Just as stackprof optimizes for repeated identical callchains, the recording process compares the current tagset with the previously recorded one in order to count repeats and reduce redundancy.
Use cases
I've included converter.zip which is extracted from another codebase and facilitates converting stackprof to pprof and pprof to gecko in order to demonstrate two use cases, as those formats already have some native support for tags that we can collect here.
Future PRs could add such output formats to stackprof natively, but to reduce the scope of this PR i've included these PoC converters so that my results below can be externally replicated.
Querying on Tags
Here is a stackprof profile, generated from one of the tests here with tags:
And after converting it to pprof:
We can verify the breakdown of tags with
go tool pprof -tags
:Uploading this to Grafana Phlare (specifically via their brand new /pyroscope/ingest endpoint, which appears supports sample level tags in pprof), we can query on these tags:
click here to see full stacktrace
click here to see full stacktrace
The callchains match what we expect, chains with
slow_function
are tagged withfunction: slow
, chains withfast_function
are tagged withfunction: fast
.Note also that the displayed sample counts matches the output from
go tool pprof -tags
earlier.We can apply tags within framework like rails, to make this transparent. For instance, the following snippet (inspired by Pyroscope Ruby's tag auto-instrumentation):
Will result in being able to query on the tags "action" and "controller", to see where controllers are spending their time. Other frameworks exposing these sorts of hooks, or middlewares, can be used to apply and wrap with tags, potentially including other useful metadata such as request IDs.
Multi-threaded view
Here is a PoC for using these tags to power multi-threaded profile views.
Viewing the resulting profile's frame graph in speedscope, we can see both threads combined:
But viewing the converted gecko format in profiler.firefox.com, we can see them show up as separate threads in the timeline at the top, and based on the sample order, which thread was active:
Note that the gecko converter is heavily WIP so many aspects of the visualization are off (eg, it shows timestamps even though the profile is being represented as just "samples", which is misleading. It is also naive about how much time each thread is active/inactive in the timeline, all time steps are assumed equal.
Implementation
Sample Tags
Sample tags are now stored on the stackprof struct, which are an array of mappings of string ids, which are indices into the
:tag_strings
field, in the order that the strings were encountered when reading tag values.The
:sample_tags
field is an array of mixed types, effectively an array of 2-tuples. Even number array elements indicate the tag set, odd number array elements indicate how many times the previous tag set repeats:Collection Overhead
The overhead will be O(T), where
T = cardinality of tags per sample
, added to each sample. Currently this has a harcoded maximum of 16. For string keys, the maximum value is 128 bytes, and for tag values, the maximum is 512 bytes. These are arbitrary limits, to prevent enormous strings from being used as tags, while leaving some room for large unique IDs to be stored. Thus the maximum amount of string data read while buffering tags is copying up to 16 * 128 + 16 * 512 bytes, which is about 10kb, to a buffer. In practice, it is probably much less data than this which will be copied.Values are read directly from existing variables, so most of the operations are type and nil checks, and copying data around.
There are a few function calls added for each tag collected, however:
The majority of the overhead here is in the last step of recording it, as we need to:
st_lookup
, but lets us take advantage of its hash function to check if we've seen the string before, and quickly obtain its integer representatiot.This following gist shows what the overhead looks like for tag vs frame buffer collection and recording, including a basic diff for collecting the overhead: https://gist.github.com/dalehamel/18a8f291864cee2bb3e34285f6f82c0c
It is a trivial example, but the key points are:
The overhead will scale with the number of tags and amount of tag data to be read, which the unit tests don't stress very hard.
Storage Overhead
We can assume that lots of string values will repeat, so rather than store the same string many times, we use a
st_table
using its string hashing feature for astrtable
to detect if we have seen a string before. The happy path is that the lookup succeeds, and it returns the integer unique id for that string. If it is a new string never seen before, we assign it an id and add it to the struct so future instances will hit and get that ID. If a sample has lots of unique tag values (request id, trace id, span id), this optimization is somewhat blunted, but reduces the storage size of tags that are consistent across multiiple samples.Currently the data size of the
sample_tags
field isT = cardinality of tags per sample
S = cardinality of samples
, so maximum size of tag storage is T * S * size of int * 2.However, generally expect that tags will not change from sample to sample, and tags for contiguous samples should match. This allows us to optimize: if the tagset of the current sample is exactly equal to the tagset of the previous sample, the previous sample's
repeats
counter increments rather than inserting a new tagset.We can decode this with a helper,
StackProf::Tags.from(profile)
, to map tothe tag values, which will be 1:1 per sample:
Because the tags are not stored verbatim, a helper is needed to decode them. The data structure reported on stackprof.results requires further parsing in order to "decode it", for which the ruby helper
StackProf::Tags.from(profile)
provides.Tag Strings
To support optimizing the storage usage of the tag collection, a single string array is maintained and is appended to for each unique new string encountered. This will be where the bulk of the added size will be to profiles featuring tags. This is a new field added to the stackprof struct, which will contain all unique tag strings. Note that by convention, the first entry will be the empty string, "", with an ID of 0, and so the numeric string IDs are the index into this 0-indexed array.
Known limitations
My current experience is that the new tests are generally not the source of CI failures, but I have seen them fail in unexpected ways in CI matrix testing, or after many repeated tests with the following script to try to "harden" it:
Here are some of the rare failures that I see in CI which break assertions around the expected tag order in the tests:
:thread_id
, even though it is configured to collect it1/143 samples did not contain the tag thread_id
StackProfTagsTest#test_sample_tag_persistence_from_parent
, where a sample is taken after the new thread has started, but before the thread local variables are set within the child threadI'm not sure what the root cause of each of these issues are yet, but I'll keep digging as best I can. The most concerning issue to me are the samples that lack any tags, which I cannot explain yet. However, I think this can be reasonably caveated with the disclaimer that tags are "Best effort", and not guaranteed for a sample, but that if tags are collected, they are believed to be accurate.
One major issue with the new tests however is that they are substantially slower than existing tests. The base stackprof suite runs in about 300ms, and the new tag suite runs in about 1500ms. This is due to a few slower tests for more complex cases which involve nested threads and ensuring the correct tags are tread under different concurrency situations. The majority of the single-threaded tests use
mode: :wall
, the default, and this makes it easier to collect samples to make assertions on. However, for multi-threaded tests, we by default currently only signal the thread that called stackprof in:wall
mode, so these tests aren't compatible withmode: :wall
, and thusmode: :cpu
is required. In order to make the desired frames appear on the tests, we need to do additional math to guarantee that stackprof will collect a snapshot of the work in order for us to make assertions on. In general, tests that usemode: :wall
are fast and those withmode: :cpu
are slower and more nuanced.