Fix timestamp math: 1 million microseconds per second #122

benbuckman · 2019-11-05T22:16:15Z

There appears to be a bug in the way timevals are handled in raw_timestamp_deltas: tv_sec (seconds) is multiplied by 1000 (milliseconds) and then added to tv_usec (microseconds).

https://www.gnu.org/software/libc/manual/html_node/Elapsed-Time.html says,

time_t tv_sec
This represents the number of whole seconds of elapsed time.
long int tv_usec

This is the rest of the elapsed time (a fraction of a second), represented as the number of microseconds. It is always less than one million.

and of course, https://en.wikipedia.org/wiki/Microsecond,

A microsecond is an SI unit of time equal to one millionth (0.000001 or 10−6 or 1⁄1,000,000) of a second.

Playing with this in a C repl confirms this behavior.

I'm not entirely sure what the implication of this bug is.

Let's say the sampling interval is 1000 (microseconds), i.e. 1 millisecond. timestamp_delta in stackprof_record_sample should always have a tv_sec value of 0. So 1000 * 0 is 0 and it doesn't matter.
But if the sampling rate was much higher, let's say 5 seconds and 100 milliseconds (tv_sec=5, tv_usec=100), then timestamp_delta = (1000 * 5) + 100 = 5,100, but it should be timestamp_delta = (1000000 * 5) + 100 = 5,000,100. Whether the seconds part itself being 1/1000 of the correct value, though, might not matter if they're all proportionately downscaled. (I don't fully understand how raw_timestamp_deltas is ultimately used.) But the microsecond amount skews the total 1000x more than it should.

Thank you

NickLaMuro · 2019-11-06T02:57:49Z

@benbuckman First off, this isn't a proper "review", just some commentary from a "casual observer" (aka: me).

That said, I think it is worth pointing out that effectively, the case you have pointed out in your second bullet point:

But if the sampling rate was much higher, let's say 5 seconds and 100 milliseconds...

is impossible (well... almost) to reach as far as I can tell. Details below:

Worth quickly pointing out that this code was added, and remains mostly unmodified since, here:

#89: Include timestamp deltas to get more accurate sample timings

Which I believe was eventually used to provide support for a external graphing tool, "speedscope", and shouldn't be used by any facilities native to stackprof yet (as far as I am aware).

Now, not saying we shouldn't fix it anyway, but to start off, I wrote a much needed test for this chunk of code to start:

  def test_raw_timestamp_deltas
    idle_default        = 0.2
    assertion_tolerance = 0.01

    # default interval (every 1000 microseconds)
    profile_1 = StackProf.run(mode: :wall, raw: true) do
      idle
    end

    delta_totals_in_seconds = profile_1[:raw_timestamp_deltas].inject(0, &:+) / 1000000.0
    assert_in_delta delta_totals_in_seconds, idle_default, assertion_tolerance

    # smaller interval (every 500 microseconds)
    profile_2 = StackProf.run(mode: :wall, raw: true, interval: 100) do
      idle
    end

    delta_totals_in_seconds = profile_2[:raw_timestamp_deltas].inject(0, &:+) / 1000000.0
    assert_in_delta delta_totals_in_seconds, idle_default, assertion_tolerance

    # larger interval (every 2 seconds, 5 second "sleep")
    profile_3 = StackProf.run(mode: :wall, raw: true, interval: 1000000) do
      idle 5.0  # FYI:  added an argument to `.idle` to make this work
    end

    delta_totals_in_seconds = profile_3[:raw_timestamp_deltas].inject(0, &:+) / 1000000.0
    assert_in_delta delta_totals_in_seconds, 5.0, assertion_tolerance
  end

Now, this failed on master, but not in the way you might expect. As you probably have guessed with the length and "copy-pasta" nature of the test above, I originally only had included the profile_1 bit, but that passed with flying colors with and without your changes. Had I originally read comprehended your entire PR description instead of skimming, I probably would have noticed this sooner...

Anyway, the spec actually fails on the third assertion, which is when I attempt to replicate the scenario in your second bullet point. However, it isn't because of the mismatch between micro and mille, but rather that no frames were captured at all. The case for that seems to be how we translate the interval value into an interrupt in the c-extension:

stackprof/ext/stackprof/stackprof.c

Lines 116 to 117 in 348cf5f

    
           timer.it_interval.tv_sec = 0; 
        
           timer.it_interval.tv_usec = NUM2LONG(interval);

Where we are using the same timeval struct as you documented above. As the docs state, the tv_usec value "...is always less than one million.", and we zero out the tv_sec value so basically no frames are captured since it seems like the timer is just invalid when at or over one million. I assume the implementation for the timer in C corrects for this in some way, which causes the invalid state.

If I update :interval => 999999, with a bit of debug puts statements in the tests:

  def test_raw_timestamp_deltas
    # ...

    # repeated for each profile in the spec
    puts
    puts "[#{profile_3[:raw_timestamp_deltas][0..10].join ', '}, ...]"
    puts profile_3[:raw_timestamp_deltas].count
    puts profile_3[:samples]
    puts profile_3[:frames].count
    puts
    delta_totals_in_seconds = profile_3[:raw_timestamp_deltas].inject(0, &:+) / 1000000.0
    assert_in_delta delta_totals_in_seconds, 5.0, assertion_tolerance
  end

I get some odd results:

$ rake test
Run options: --seed=1355

# Running:

.........
[1322, 903, 1082, 846, 977, 1022, 939, 1015, 1026, 746, 1038, ...]
200
200
22

[140, 63, 93, 91, 119, 94, 89, 89, 111, 88, 88, ...]
1918
1918
22

[5271, 995748, 2343, 3224, 997261, ...]
5
5
22

F...............

Finished in 5.681651s, 4.4001 runs/s, 13.7284 assertions/s.

  1) Failure:
StackProfTest#test_raw_timestamp_deltas [/stackprof/test/test_stackprof.rb:125]:
Expected |2.003847 - 5.0| (2.996153) to be <= 0.01.

25 runs, 78 assertions, 1 failures, 0 errors, 0 skips
rake aborted!

So with that said, I wonder if it makes more sense to instead drop the multiplication code all together, and instead just error when the interval size is too large. Maybe at 50 times a second (so 20000) or lower? At that point, there is really no value in using a sampling profiler if your samples are that far apart.

However, I guess if the timer is on the fritz and has a really long break in between two samples, your code is a good thing to have in place. Just pretty hard to test the edge case you are fixing here unfortunately.

Thoughts?

jlfwong · 2019-11-06T03:08:22Z

Oh bah, I wrote the original code for this. I'm also not sure if this bug has any important implications, but I agree the existing code is definitely wrong.

NickLaMuro · 2019-11-06T05:20:24Z

@jlfwong Clearly I am not the only person who has notifications turned on for this repo 😄

But yeah, my intent on linking to your PR was not to "shame you" (in fact, I left your name out on purpose specifically so you didn't have to get pinged unwillingly), but just for some historical context on where the code came from and what it's purpose was. My inability to write any legitimate C code myself should also take away from any criticism I would potentially make, so there is that.

Also, I just simply wanted to have a test for this in place for the section of the code. 😉

benbuckman · 2019-11-06T17:53:41Z

I think this is correct after more analysis:

I'm also not sure if this bug has any important implications, but I agree the existing code is definitely wrong.

At worst, fixing this should be harmless.

benbuckman · 2019-11-06T17:55:39Z

So with that said, I wonder if it makes more sense to instead drop the multiplication code all together, and instead just error when the interval size is too large. Maybe at 50 times a second (so 20000) or lower? At that point, there is really no value in using a sampling profiler if your samples are that far apart.

Interestingly, this dovetails with what I was actually trying to solve when I stumbled on this bug, which is detecting internally when the time to capture a single sample is >= the interval, in which case, as far as I can tell, the "postponed job" queue for collecting samples piles up infinitely and the program hangs with the CPU pegged. (It's sort of the opposite problem – you're suggesting enforcing a maximum interval and this bug occurs with a too-low interval – but related in the sense that the current code is fine with any interval you give it, and maybe shouldn't be.)

I'll write up a separate issue for that and link here, later today.

benbuckman · 2019-11-06T18:52:10Z

Wrote up: Program hangs if sample-collection time is > interval, #123

I noticed the timestamp math in this PR while debugging and exploring solutions to that issue.

I appreciate the activity on this PR! Would y'all have some time to ponder the above issue as well?

Thank you

benbuckman · 2019-11-07T08:29:08Z

For what it's worth – in #124, as part of fixing #123, I replace the timestamp math that's fixed here with a helper function that returns a long int of microseconds (see https://github.com/tmm1/stackprof/pull/124/files#diff-afcaa6674a4a59851d474ff9db9cf662R510), which I find to be simpler. So, up to the maintainers if you want to review these independently, and land this first, or let the other PR supersede this one.

benbuckman · 2019-11-11T18:42:59Z

I'm going to close this in favor of #124, which also fixes this bug.

casperisfine · 2021-08-26T08:01:25Z

Hum, in the end #124 was never finished. Could we get this PR merged in the meantime?

cc @tenderlove

casperisfine · 2021-08-27T09:46:18Z

Oh, it was somewhat fixed by #161, the bug is still there for platforms without monotonic clock though.

Pretty much a rebase tmm1#122

@djudd

Hello, hello! 🎉 First the disclaimer (I guess it may be relevant): I work at @DataDog on profiling for the [ddtrace](https://github.com/DataDog/dd-trace-rb) gem, although we don't use `rb_profile_frames` (but it is quite interesting). I noticed that the fix for tmm1#122 in tmm1#163 actually missed the new `_POSIX_MONOTONIC_CLOCK` branch, where the math was still incorrect (used 1000 instead of 1000000 when attempting to convert seconds to microseconds). cc @djudd @casperisfine

Pretty much a rebase tmm1/stackprof#122

Fix timestamp math: 1 million microseconds in second

348cf5f

NickLaMuro mentioned this pull request Nov 7, 2019

Program hangs if sample-collection time is > interval #123

Open

3 tasks

benbuckman mentioned this pull request Nov 7, 2019

Prevent hanging when sample collection time > interval #124

Open

benbuckman closed this Nov 11, 2019

casperisfine pushed a commit to Shopify/stackprof that referenced this pull request Aug 27, 2021

Fix timestamp math for platforms without _POSIX_MONOTONIC_CLOCK

a1776ff

Pretty much a rebase tmm1#122

casperisfine pushed a commit to Shopify/stackprof that referenced this pull request Aug 27, 2021

Fix timestamp math for platforms without _POSIX_MONOTONIC_CLOCK

d730dd1

Pretty much a rebase tmm1#122

casperisfine mentioned this pull request Aug 27, 2021

Fix timestamp math for platforms without _POSIX_MONOTONIC_CLOCK #163

Merged

ivoanjo mentioned this pull request Sep 7, 2021

Fix timestamp math for platforms with _POSIX_MONOTONIC_CLOCK #167

Merged

albertarcuri pushed a commit to albertarcuri/stackprof that referenced this pull request Apr 26, 2022

Fix timestamp math for platforms without _POSIX_MONOTONIC_CLOCK

787c65d

Pretty much a rebase tmm1/stackprof#122

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix timestamp math: 1 million microseconds per second #122

Fix timestamp math: 1 million microseconds per second #122

benbuckman commented Nov 5, 2019 •

edited

Loading

NickLaMuro commented Nov 6, 2019 •

edited

Loading

jlfwong commented Nov 6, 2019

NickLaMuro commented Nov 6, 2019

benbuckman commented Nov 6, 2019

benbuckman commented Nov 6, 2019 •

edited

Loading

benbuckman commented Nov 6, 2019 •

edited

Loading

benbuckman commented Nov 7, 2019 •

edited

Loading

benbuckman commented Nov 11, 2019

casperisfine commented Aug 26, 2021

casperisfine commented Aug 27, 2021

Fix timestamp math: 1 million microseconds per second #122

Fix timestamp math: 1 million microseconds per second #122

Conversation

benbuckman commented Nov 5, 2019 • edited Loading

NickLaMuro commented Nov 6, 2019 • edited Loading

jlfwong commented Nov 6, 2019

NickLaMuro commented Nov 6, 2019

benbuckman commented Nov 6, 2019

benbuckman commented Nov 6, 2019 • edited Loading

benbuckman commented Nov 6, 2019 • edited Loading

benbuckman commented Nov 7, 2019 • edited Loading

benbuckman commented Nov 11, 2019

casperisfine commented Aug 26, 2021

casperisfine commented Aug 27, 2021

benbuckman commented Nov 5, 2019 •

edited

Loading

NickLaMuro commented Nov 6, 2019 •

edited

Loading

benbuckman commented Nov 6, 2019 •

edited

Loading

benbuckman commented Nov 6, 2019 •

edited

Loading

benbuckman commented Nov 7, 2019 •

edited

Loading