-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible aberation with the new CPU metrics on rtc counter reset (usage % above 100%) #1299
Comments
Yes, it is possible for the utilization number to go over 100%. I swear I had documentation for Depending on how VMWare exposes various MSRs and what happens when the VM can't be scheduled, your guess is as good as mine. All I can say is that when a system is under moderate all-core load, the old-style counters mask how far the system is from being completely utilized. |
I would agree with you if the usage (with the classic counters) around the artefact time showed a spike also (capped at 100%) but:
Even if I cap the value to 100% comparing it to the task manager on the computer the value is too different during those events and our engineer would not understand why there is a spike ever few hours that does not register elsewhere. Dou you think there is a way to prevent the |
Ah, I see what you are saying - I was referring to a separate issue. I do also see this issue with our own fleet, but I'm rarely zooming in at this kind of resolution to notice. So the actual problem is that I don't want to speak for the maintainers of the project, but I think the philosophy has been to expose any data warts-and-all, avoid storing state between scrapes and to handle any unpleasantness in promql and grafana. Assuming I'm correct, I sea two likely ways out of this:
|
I agree on the "expose the data as raw as possible" and Option 1 would really be ideal. Reading again the Option 2.. Does the value is supposed to be constant even with Turbo spikes etc? |
Yes, the value should be constant, so I don't think it would affect alert latency. |
Ok, I'll try to POC that. Thanks for the clarification |
@JDA88 finally , how did finalized that graphs? |
I never managed to use those new metrics. As stated above I suppose |
@breed808 Thanks for your response. You saved a lot of time to me :) will review with existing counters. |
Hi all, I'm currently asking myself, how windows_exporter could solve this issue here? Reading https://www.cpuid.com/softwares/hwmonitor.html, it's expected that Windows can report more than 100% CPU. |
True, but the Task Manager cap the usage to 100% and the issue here is that the The value should be on another metric and be a constant, not a counter that way you can replace: The only issue I can see is if the |
Then, let do the same on the prometheus queries. Use the min function to cap the it, like
|
There is a misunderstanding on the issue. The graph bellow show it perfectly (in yellow) the reality on the server (const CPU usage around 80%) and with those new metrics and the issue with rate (in green) you see a spike to 100%+ witch is NOT real as the test process is limited to 80% usage on the server. |
I feel that this is extract behavior, that Microsoft and other tools describe as known issue. In the FAQ of HWMon, they describe the issue appear after using the utility counters instead of the time counters. Using the utility based counters, it's expected to gain a result more than 100%. Time based counter doesn't not have that issue. Thats the reason, why
I have the feeling that the windows_cpu_processor_utility_total counter reports some gaps in increase which leads to that situation. If windows_exporter would use the calculated values offered by windows, than the 100% would exceeded as well. |
For us, having regular spikes (even caped at 100%) that does not exist in reality make the counter useless, at least to compute % usage |
@JDA88 If I understand correctly, the counter reset is the issue? For the 0.30er release, I plan to offer an option to use the Performance Counter API (pdh.dll) to collect the values. (#1350) Currently, the windows_exporter manually scrapes the data from the Windows Registry directly. I would like to look back to this issue, once the new option is availible. Just to exclude that there is an overflow issue something at the old code. The code is directly reading binary data. What really help to get deeper into this issue would be a perf counter log. As I know, the builtin Performance Monitor tool can monitor und track the counter values. Than, we can identify, if Performance Monitor shows similar peaks, too. We could also compare, if the values reported by windows_exporter are equal to windows tools. My knowledge is limited here, but I found that: https://serverfault.com/questions/1012544/how-can-i-monitor-cpu-usage-and-processes-on-windows-server-2012-as-a-service-ov
Just one of few things at Windows, they are useless. |
To be clear: we don't even know what those arbitrary bytes in the registry really are supposed to represent; that it is the RTC was just an educated guess, eitherway, the width of the integer being used is very narrow. If we can cleanly export the value as a constant (which it should be on any hardware I've tested), that would be ideal; I just felt the correct approach was to take what Windows gave us, garbage or otherwise! If I use the max_over_time trick with a 3h backward look, it is acceptable for my usage at least (the RTC counter rolled over at 12:08 and 10:14 and 08:19) |
Yes.
I'm in for anything that can help find a use to those metrics.
Agree :) |
Yes. My naive guess is that Windows attempts to calculate 100% CPU to mean "would be 100% utilised if the CPU were fully throttled to its TSC frequency", so if you are running at boost and fully utilising the CPU, it will show > 100%. I have seen it be as high as 160% on a single hyperthread (where TSC is 2.0GHz and max boost is 3.2GHz). I could be wrong. Hyperthreads complicate this as well. We've been using these numbers for capacity planning and tuning since I wrote the collector 2ish years ago. @JDA88 has more precise alerting requirements which is why the resets were noticed. |
Here is an another screenshot from a local test.
At least they look similar which I interpret that the prometheus function looks good-ish in general. I will take a look on the counter reset, but for me it's hard to reproduce the issue. I guess, the only was to solve this kind of issue would be switch from Raw performance counters to formatted. |
Some additional insight I would like to share. I figure out that the RTC metric based on the "second value" of the ( Running
In that case, A deeper dive is mandatory to indicate the root cause here. |
My understanding is that you are concerned that "Processor Utility" and "Processor Time" show different values. This is expected and reasonably well understood. IIRC the former was added in Server 2016 / Windows 10, the latter dates back to the days of fixed clockspeeds and no hyperthreading. See #787 |
Thanks for clarification. My system language is german and performance counter are localized in Windows. I grab to wrong one. 😅 |
Kein Problem! |
If a const would be a workaround, what about:
I run the following query round(rate(windows_cpu_processor_utility_total{core="0,0"}[1m]) / 625000, 0.1) == round(rate(windows_cpu_processor_utility_total{core="0,0"}[1m]) / rate(windows_cpu_processor_rtc_total[2m]), 0.1) to check that the difference, using round with 1 decimal after point. The gaps (where both query are not equal) are minimal, I wound say okay-ish. |
Is it ok to have a metric that need a rate and a division with a number to be used correctly? And are we 100% sure the value is the same across all computers? Unfortunately, we don’t have lots of hardware variety here so I can’t really test that. |
We are 100% sure, that the value is not the same across all computers. There is a diff between +/- 100 diff around 625000. In total I have that feeling that On base value reset, the Powershell command After endless efforts here, it Performance Counter seems unusable here. |
Damn. Sorry but in this current state I don’t see any use of those metrics for our usages. |
💯 agree. I try to solve this, but I can't solve design issues here. |
Here's some random samples across our fleet. Seems to be invariant excluding sample errors:
So definitely not the same across all systems. Note that the old E5-2667 v2 (hello 2014) is nearly 900K. If we really want to solve this by exporting it as as gauge, we should do a read at initialization until we get a positive number, and then export as a constant thereafter. |
Thinking about it some more, if we're going to go that far, we may as well divide the utility metrics by that constant * 100 since I think that's the only purpose it serves, so my "cpu idle" query would go from:
to just
but we'd presumably have to rename the metrics. |
The metric query would look much more "intuitive" that way, that for sure! |
On my system, I mean, we can do such shenanigans, but I have the feeling that at the end, doing it that way results in less precise value than just using |
I got some additional info, how the counter works and why the counter doesn't work in the Prometheus eco system: The rollover is an integer flow (32bit unsigned int) and the rollover happens on 4294967295 (2^32). In some languages without integer overflow protection (like C), the integer substraction is stable. It just works including the rollover. (go ref: https://go.dev/play/p/57GXzfT-m5G) For example:
The problem here is that all Prometheus values are float64. By converting the values, the bit based mechanic is lost. |
I'll see if I can get some time to mock this up next week. |
windows_exporter version: 0.23.0
I am trying to migrate our CPU usage percent metric to the new ones introduce in v0.21.0
Query 1 (Old) per core usage (limited to 1 core for the example):
((1-(avg without (mode) (rate(windows_cpu_time_total{mode="idle", core="0,0"}[1m])))) * 100)
Query 2 (New) per core usage (limited to 1 core for the example):
rate(windows_cpu_processor_utility_total{core="0,0"}[1m]) / rate(windows_cpu_processor_rtc_total{}[1m])
The issue is that on sometimes I can get values way above 100% (120%+) with the new query. I fixed it with a
clamp
, but it's far from ideal. (I actually just reverted the cahnge in production until I have a proper way to fix the issue)Example with rate Query1 in yellow and Query2 in green
Looking at the detail of each new metrics I might have found the root cause:
windows_cpu_processor_utility_total
is fine (It continue to increment until the computer reboots)windows_cpu_processor_rtc_total
on the other hand is reseted every time it reach 4300000000, and the calculation artefacts append exatly at that time.Example bellow with
windows_cpu_processor_rtc_total
graphed for 2 different servers@higels is this reset expected in between reboots? Am I doing something wrong?
From my understanding, with the lack of
xrate
we should avoid frequent counter resetsPS: Our Windows servers are mostly VMware VMs
The text was updated successfully, but these errors were encountered: