-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix timestamp math: 1 million microseconds per second #122
Conversation
@benbuckman First off, this isn't a proper "review", just some commentary from a "casual observer" (aka: me). That said, I think it is worth pointing out that effectively, the case you have pointed out in your second bullet point:
is impossible (well... almost) to reach as far as I can tell. Details below: Worth quickly pointing out that this code was added, and remains mostly unmodified since, here: #89: Include timestamp deltas to get more accurate sample timings Which I believe was eventually used to provide support for a external graphing tool, "speedscope", and shouldn't be used by any facilities native to Now, not saying we shouldn't fix it anyway, but to start off, I wrote a much needed test for this chunk of code to start: def test_raw_timestamp_deltas
idle_default = 0.2
assertion_tolerance = 0.01
# default interval (every 1000 microseconds)
profile_1 = StackProf.run(mode: :wall, raw: true) do
idle
end
delta_totals_in_seconds = profile_1[:raw_timestamp_deltas].inject(0, &:+) / 1000000.0
assert_in_delta delta_totals_in_seconds, idle_default, assertion_tolerance
# smaller interval (every 500 microseconds)
profile_2 = StackProf.run(mode: :wall, raw: true, interval: 100) do
idle
end
delta_totals_in_seconds = profile_2[:raw_timestamp_deltas].inject(0, &:+) / 1000000.0
assert_in_delta delta_totals_in_seconds, idle_default, assertion_tolerance
# larger interval (every 2 seconds, 5 second "sleep")
profile_3 = StackProf.run(mode: :wall, raw: true, interval: 1000000) do
idle 5.0 # FYI: added an argument to `.idle` to make this work
end
delta_totals_in_seconds = profile_3[:raw_timestamp_deltas].inject(0, &:+) / 1000000.0
assert_in_delta delta_totals_in_seconds, 5.0, assertion_tolerance
end Now, this failed on Anyway, the spec actually fails on the third assertion, which is when I attempt to replicate the scenario in your second bullet point. However, it isn't because of the mismatch between stackprof/ext/stackprof/stackprof.c Lines 116 to 117 in 348cf5f
Where we are using the same If I update def test_raw_timestamp_deltas
# ...
# repeated for each profile in the spec
puts
puts "[#{profile_3[:raw_timestamp_deltas][0..10].join ', '}, ...]"
puts profile_3[:raw_timestamp_deltas].count
puts profile_3[:samples]
puts profile_3[:frames].count
puts
delta_totals_in_seconds = profile_3[:raw_timestamp_deltas].inject(0, &:+) / 1000000.0
assert_in_delta delta_totals_in_seconds, 5.0, assertion_tolerance
end I get some odd results: $ rake test
Run options: --seed=1355
# Running:
.........
[1322, 903, 1082, 846, 977, 1022, 939, 1015, 1026, 746, 1038, ...]
200
200
22
[140, 63, 93, 91, 119, 94, 89, 89, 111, 88, 88, ...]
1918
1918
22
[5271, 995748, 2343, 3224, 997261, ...]
5
5
22
F...............
Finished in 5.681651s, 4.4001 runs/s, 13.7284 assertions/s.
1) Failure:
StackProfTest#test_raw_timestamp_deltas [/stackprof/test/test_stackprof.rb:125]:
Expected |2.003847 - 5.0| (2.996153) to be <= 0.01.
25 runs, 78 assertions, 1 failures, 0 errors, 0 skips
rake aborted! So with that said, I wonder if it makes more sense to instead drop the multiplication code all together, and instead just error when the However, I guess if the timer is on the fritz and has a really long break in between two samples, your code is a good thing to have in place. Just pretty hard to test the edge case you are fixing here unfortunately. Thoughts? |
Oh bah, I wrote the original code for this. I'm also not sure if this bug has any important implications, but I agree the existing code is definitely wrong. |
@jlfwong Clearly I am not the only person who has notifications turned on for this repo 😄 But yeah, my intent on linking to your PR was not to "shame you" (in fact, I left your name out on purpose specifically so you didn't have to get pinged unwillingly), but just for some historical context on where the code came from and what it's purpose was. My inability to write any legitimate C code myself should also take away from any criticism I would potentially make, so there is that. Also, I just simply wanted to have a test for this in place for the section of the code. 😉 |
I think this is correct after more analysis:
At worst, fixing this should be harmless. |
Interestingly, this dovetails with what I was actually trying to solve when I stumbled on this bug, which is detecting internally when the time to capture a single sample is I'll write up a separate issue for that and link here, later today. |
Wrote up: Program hangs if sample-collection time is > interval, #123 I noticed the timestamp math in this PR while debugging and exploring solutions to that issue. I appreciate the activity on this PR! Would y'all have some time to ponder the above issue as well? Thank you |
For what it's worth – in #124, as part of fixing #123, I replace the timestamp math that's fixed here with a helper function that returns a |
I'm going to close this in favor of #124, which also fixes this bug. |
Hum, in the end #124 was never finished. Could we get this PR merged in the meantime? cc @tenderlove |
Oh, it was somewhat fixed by #161, the bug is still there for platforms without monotonic clock though. |
Pretty much a rebase tmm1#122
Pretty much a rebase tmm1#122
Hello, hello! 🎉 First the disclaimer (I guess it may be relevant): I work at @DataDog on profiling for the [ddtrace](https://github.com/DataDog/dd-trace-rb) gem, although we don't use `rb_profile_frames` (but it is quite interesting). I noticed that the fix for tmm1#122 in tmm1#163 actually missed the new `_POSIX_MONOTONIC_CLOCK` branch, where the math was still incorrect (used 1000 instead of 1000000 when attempting to convert seconds to microseconds). cc @djudd @casperisfine
Pretty much a rebase tmm1/stackprof#122
There appears to be a bug in the way
timeval
s are handled inraw_timestamp_deltas
:tv_sec
(seconds) is multiplied by 1000 (milliseconds) and then added totv_usec
(microseconds).https://www.gnu.org/software/libc/manual/html_node/Elapsed-Time.html says,
and of course, https://en.wikipedia.org/wiki/Microsecond,
Playing with this in a C repl confirms this behavior.
I'm not entirely sure what the implication of this bug is.
interval
is 1000 (microseconds), i.e. 1 millisecond.timestamp_delta
instackprof_record_sample
should always have atv_sec
value of 0. So1000 * 0
is 0 and it doesn't matter.tv_sec=5, tv_usec=100
), thentimestamp_delta = (1000 * 5) + 100 = 5,100
, but it should betimestamp_delta = (1000000 * 5) + 100 = 5,000,100
. Whether the seconds part itself being 1/1000 of the correct value, though, might not matter if they're all proportionately downscaled. (I don't fully understand howraw_timestamp_deltas
is ultimately used.) But the microsecond amount skews the total 1000x more than it should.Thank you