Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timestamp math: 1 million microseconds per second #122

Closed
wants to merge 1 commit into from

Conversation

benbuckman
Copy link
Contributor

@benbuckman benbuckman commented Nov 5, 2019

There appears to be a bug in the way timevals are handled in raw_timestamp_deltas: tv_sec (seconds) is multiplied by 1000 (milliseconds) and then added to tv_usec (microseconds).

https://www.gnu.org/software/libc/manual/html_node/Elapsed-Time.html says,

time_t tv_sec
This represents the number of whole seconds of elapsed time.
long int tv_usec

This is the rest of the elapsed time (a fraction of a second), represented as the number of microseconds. It is always less than one million.

and of course, https://en.wikipedia.org/wiki/Microsecond,

A microsecond is an SI unit of time equal to one millionth (0.000001 or 10−6 or ​1⁄1,000,000) of a second.

Playing with this in a C repl confirms this behavior.

I'm not entirely sure what the implication of this bug is.

  • Let's say the sampling interval is 1000 (microseconds), i.e. 1 millisecond. timestamp_delta in stackprof_record_sample should always have a tv_sec value of 0. So 1000 * 0 is 0 and it doesn't matter.
  • But if the sampling rate was much higher, let's say 5 seconds and 100 milliseconds (tv_sec=5, tv_usec=100), then timestamp_delta = (1000 * 5) + 100 = 5,100, but it should be timestamp_delta = (1000000 * 5) + 100 = 5,000,100. Whether the seconds part itself being 1/1000 of the correct value, though, might not matter if they're all proportionately downscaled. (I don't fully understand how raw_timestamp_deltas is ultimately used.) But the microsecond amount skews the total 1000x more than it should.

Thank you

@NickLaMuro
Copy link
Contributor

NickLaMuro commented Nov 6, 2019

@benbuckman First off, this isn't a proper "review", just some commentary from a "casual observer" (aka: me).

That said, I think it is worth pointing out that effectively, the case you have pointed out in your second bullet point:

  • But if the sampling rate was much higher, let's say 5 seconds and 100 milliseconds...

is impossible (well... almost) to reach as far as I can tell. Details below:


Worth quickly pointing out that this code was added, and remains mostly unmodified since, here:

#89: Include timestamp deltas to get more accurate sample timings

Which I believe was eventually used to provide support for a external graphing tool, "speedscope", and shouldn't be used by any facilities native to stackprof yet (as far as I am aware).

Now, not saying we shouldn't fix it anyway, but to start off, I wrote a much needed test for this chunk of code to start:

  def test_raw_timestamp_deltas
    idle_default        = 0.2
    assertion_tolerance = 0.01

    # default interval (every 1000 microseconds)
    profile_1 = StackProf.run(mode: :wall, raw: true) do
      idle
    end

    delta_totals_in_seconds = profile_1[:raw_timestamp_deltas].inject(0, &:+) / 1000000.0
    assert_in_delta delta_totals_in_seconds, idle_default, assertion_tolerance

    # smaller interval (every 500 microseconds)
    profile_2 = StackProf.run(mode: :wall, raw: true, interval: 100) do
      idle
    end

    delta_totals_in_seconds = profile_2[:raw_timestamp_deltas].inject(0, &:+) / 1000000.0
    assert_in_delta delta_totals_in_seconds, idle_default, assertion_tolerance

    # larger interval (every 2 seconds, 5 second "sleep")
    profile_3 = StackProf.run(mode: :wall, raw: true, interval: 1000000) do
      idle 5.0  # FYI:  added an argument to `.idle` to make this work
    end

    delta_totals_in_seconds = profile_3[:raw_timestamp_deltas].inject(0, &:+) / 1000000.0
    assert_in_delta delta_totals_in_seconds, 5.0, assertion_tolerance
  end

Now, this failed on master, but not in the way you might expect. As you probably have guessed with the length and "copy-pasta" nature of the test above, I originally only had included the profile_1 bit, but that passed with flying colors with and without your changes. Had I originally read comprehended your entire PR description instead of skimming, I probably would have noticed this sooner...

Anyway, the spec actually fails on the third assertion, which is when I attempt to replicate the scenario in your second bullet point. However, it isn't because of the mismatch between micro and mille, but rather that no frames were captured at all. The case for that seems to be how we translate the interval value into an interrupt in the c-extension:

timer.it_interval.tv_sec = 0;
timer.it_interval.tv_usec = NUM2LONG(interval);

Where we are using the same timeval struct as you documented above. As the docs state, the tv_usec value "...is always less than one million.", and we zero out the tv_sec value so basically no frames are captured since it seems like the timer is just invalid when at or over one million. I assume the implementation for the timer in C corrects for this in some way, which causes the invalid state.

If I update :interval => 999999, with a bit of debug puts statements in the tests:

  def test_raw_timestamp_deltas
    # ...

    # repeated for each profile in the spec
    puts
    puts "[#{profile_3[:raw_timestamp_deltas][0..10].join ', '}, ...]"
    puts profile_3[:raw_timestamp_deltas].count
    puts profile_3[:samples]
    puts profile_3[:frames].count
    puts
    delta_totals_in_seconds = profile_3[:raw_timestamp_deltas].inject(0, &:+) / 1000000.0
    assert_in_delta delta_totals_in_seconds, 5.0, assertion_tolerance
  end

I get some odd results:

$ rake test
Run options: --seed=1355

# Running:

.........
[1322, 903, 1082, 846, 977, 1022, 939, 1015, 1026, 746, 1038, ...]
200
200
22

[140, 63, 93, 91, 119, 94, 89, 89, 111, 88, 88, ...]
1918
1918
22

[5271, 995748, 2343, 3224, 997261, ...]
5
5
22

F...............

Finished in 5.681651s, 4.4001 runs/s, 13.7284 assertions/s.

  1) Failure:
StackProfTest#test_raw_timestamp_deltas [/stackprof/test/test_stackprof.rb:125]:
Expected |2.003847 - 5.0| (2.996153) to be <= 0.01.

25 runs, 78 assertions, 1 failures, 0 errors, 0 skips
rake aborted!

So with that said, I wonder if it makes more sense to instead drop the multiplication code all together, and instead just error when the interval size is too large. Maybe at 50 times a second (so 20000) or lower? At that point, there is really no value in using a sampling profiler if your samples are that far apart.

However, I guess if the timer is on the fritz and has a really long break in between two samples, your code is a good thing to have in place. Just pretty hard to test the edge case you are fixing here unfortunately.

Thoughts?

@jlfwong
Copy link
Collaborator

jlfwong commented Nov 6, 2019

Oh bah, I wrote the original code for this. I'm also not sure if this bug has any important implications, but I agree the existing code is definitely wrong.

@NickLaMuro
Copy link
Contributor

@jlfwong Clearly I am not the only person who has notifications turned on for this repo 😄

But yeah, my intent on linking to your PR was not to "shame you" (in fact, I left your name out on purpose specifically so you didn't have to get pinged unwillingly), but just for some historical context on where the code came from and what it's purpose was. My inability to write any legitimate C code myself should also take away from any criticism I would potentially make, so there is that.

Also, I just simply wanted to have a test for this in place for the section of the code. 😉

@benbuckman
Copy link
Contributor Author

I think this is correct after more analysis:

I'm also not sure if this bug has any important implications, but I agree the existing code is definitely wrong.

At worst, fixing this should be harmless.

@benbuckman
Copy link
Contributor Author

benbuckman commented Nov 6, 2019

So with that said, I wonder if it makes more sense to instead drop the multiplication code all together, and instead just error when the interval size is too large. Maybe at 50 times a second (so 20000) or lower? At that point, there is really no value in using a sampling profiler if your samples are that far apart.

Interestingly, this dovetails with what I was actually trying to solve when I stumbled on this bug, which is detecting internally when the time to capture a single sample is >= the interval, in which case, as far as I can tell, the "postponed job" queue for collecting samples piles up infinitely and the program hangs with the CPU pegged. (It's sort of the opposite problem – you're suggesting enforcing a maximum interval and this bug occurs with a too-low interval – but related in the sense that the current code is fine with any interval you give it, and maybe shouldn't be.)

I'll write up a separate issue for that and link here, later today.

@benbuckman
Copy link
Contributor Author

benbuckman commented Nov 6, 2019

Wrote up: Program hangs if sample-collection time is > interval, #123

I noticed the timestamp math in this PR while debugging and exploring solutions to that issue.

I appreciate the activity on this PR! Would y'all have some time to ponder the above issue as well?

Thank you

@benbuckman
Copy link
Contributor Author

benbuckman commented Nov 7, 2019

For what it's worth – in #124, as part of fixing #123, I replace the timestamp math that's fixed here with a helper function that returns a long int of microseconds (see https://github.com/tmm1/stackprof/pull/124/files#diff-afcaa6674a4a59851d474ff9db9cf662R510), which I find to be simpler. So, up to the maintainers if you want to review these independently, and land this first, or let the other PR supersede this one.

@benbuckman
Copy link
Contributor Author

I'm going to close this in favor of #124, which also fixes this bug.

@benbuckman benbuckman closed this Nov 11, 2019
@casperisfine
Copy link
Contributor

Hum, in the end #124 was never finished. Could we get this PR merged in the meantime?

cc @tenderlove

@casperisfine
Copy link
Contributor

Oh, it was somewhat fixed by #161, the bug is still there for platforms without monotonic clock though.

casperisfine pushed a commit to Shopify/stackprof that referenced this pull request Aug 27, 2021
casperisfine pushed a commit to Shopify/stackprof that referenced this pull request Aug 27, 2021
ivoanjo added a commit to DataDog/stackprof that referenced this pull request Sep 7, 2021
Hello, hello! 🎉

First the disclaimer (I guess it may be relevant): I work at @DataDog
on profiling for the [ddtrace](https://github.com/DataDog/dd-trace-rb)
gem, although we don't use `rb_profile_frames` (but it is quite
interesting).

I noticed that the fix for tmm1#122 in tmm1#163 actually missed the new
`_POSIX_MONOTONIC_CLOCK` branch, where the math was still incorrect
(used 1000 instead of 1000000 when attempting to convert seconds to
microseconds).

cc @djudd @casperisfine
albertarcuri pushed a commit to albertarcuri/stackprof that referenced this pull request Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants