This repository has been archived by the owner on Aug 2, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 20
Threads blocked trying to attach to the JVM #566
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In [this code](https://github.com/opendistro-for-elasticsearch/performance-analyzer-rca/blob/b33c502d55b954782129b8f61b2c67b07e8477d5/src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/jvm/ThreadList.java#L127) all threads might be blocked behind one thread trying to fill the jTidMap from a threaddump that contains all the thread related details that performanceAnalyzer uses. The thread dump is requires a JVM attach which has a [default timeout of 10 seconds](https://github.com/openjdk/jdk/blob/353416ffcaa12d79013e6f9d03371bf86d0f671b/src/jdk.attach/share/classes/sun/tools/attach/HotSpotVirtualMachine.java#L368) which is quite long. While one thread is busy attaching, all threads running ES operations might be stuck waiting for the same lock. Therefore, this change replaces a mandatory lock with a try-lock, so that the threads which cannot acquire the lock, moves past this and carry on. We will loose the metrics for this run, but that is better than stalling all ES operations. Tests PA metrics is returning the metrics: ``` curl "localhost:9600/_opendistro/_performanceanalyzer/metrics?metrics=Latency,CPU_Utilization&agg=avg,max&dim=ShardID&nodes=all" {"iAHcZIXUQMaOBPzRNsC-JA": {"timestamp": 1614637105000, "data": {"fields":[{"name":"ShardID","type":"VARCHAR"},{"name":"Latency","type":"DOUBLE"},{"name":"CPU_Utilization","type":"DOUBLE"}],"records":[[null,null,0.005394726328405913]]}}, "jLtljYrfQAuTVwUBgGtHNQ" :{"timestamp": 1614637105000, "data": {"fields":[{"name":"ShardID","type":"VARCHAR"},{"name":"Latency","type":"DOUBLE"},{"name":"CPU_Utilization","type":"DOUBLE"}],"records":[[null,null,0.002798561151079137]]}}}``` The writer metrics are being emitted in the /dev/shm file ``` [root@7b746ade07bd performanceanalyzer]# cat 1614636875000 ^heap_metrics {"current_time":1614636875493} {"MemType":"totYoungGC","GC_Collection_Event":0,"GC_Collection_Time":0,"Heap_Committed":-2,"Heap_Init":-2,"Heap_Max":-2,"Heap_Used":-2} {"MemType":"totFullGC","GC_Collection_Event":0,"GC_Collection_Time":0,"Heap_Committed":-2,"Heap_Init":-2,"Heap_Max":-2,"Heap_Used":-2} {"MemType":"PermGen","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":85696512,"Heap_Init":0,"Heap_Max":-1,"Heap_Used":80593424} {"MemType":"Survivor","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":17891328,"Heap_Init":17891328,"Heap_Max":17891328,"Heap_Used":9578944} {"MemType":"OldGen","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":357957632,"Heap_Init":357957632,"Heap_Max":357957632,"Heap_Used":55396840} {"MemType":"Eden","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":143130624,"Heap_Init":143130624,"Heap_Max":143130624,"Heap_Used":37667024} {"MemType":"NonHeap","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":118042624,"Heap_Init":7667712,"Heap_Max":-1,"Heap_Used":109184376} {"MemType":"Heap","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":518979584,"Heap_Init":536870912,"Heap_Max":518979584,"Heap_Used":102642808} ```
rguo-aws
reviewed
Mar 2, 2021
src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/jvm/ThreadList.java
Outdated
Show resolved
Hide resolved
Bukhtawar
reviewed
Mar 2, 2021
src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/jvm/ThreadList.java
Outdated
Show resolved
Hide resolved
src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/jvm/ThreadList.java
Outdated
Show resolved
Hide resolved
Bukhtawar
reviewed
Mar 8, 2021
...endistro/elasticsearch/performanceanalyzer/collectors/ScheduledMetricCollectorsExecutor.java
Outdated
Show resolved
Hide resolved
...endistro/elasticsearch/performanceanalyzer/collectors/ScheduledMetricCollectorsExecutor.java
Show resolved
Hide resolved
src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/jvm/ThreadList.java
Outdated
Show resolved
Hide resolved
Bukhtawar
reviewed
Mar 8, 2021
src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/jvm/ThreadList.java
Show resolved
Hide resolved
nisheedh
reviewed
Mar 8, 2021
...endistro/elasticsearch/performanceanalyzer/collectors/ScheduledMetricCollectorsExecutor.java
Outdated
Show resolved
Hide resolved
nisheedh
reviewed
Mar 8, 2021
...endistro/elasticsearch/performanceanalyzer/collectors/ScheduledMetricCollectorsExecutor.java
Outdated
Show resolved
Hide resolved
nisheedh
reviewed
Mar 8, 2021
src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/jvm/ThreadList.java
Outdated
Show resolved
Hide resolved
nisheedh
reviewed
Mar 8, 2021
src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/jvm/ThreadList.java
Outdated
Show resolved
Hide resolved
nisheedh
reviewed
Mar 8, 2021
src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/jvm/ThreadList.java
Outdated
Show resolved
Hide resolved
nisheedh
reviewed
Mar 10, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you address my comments ?
Sorry for the delay in getting back to the comments. |
Bukhtawar
reviewed
Mar 11, 2021
Comment on lines
+275
to
+279
// oldNativeTidMap gets cleared if the attach Fails, so that the | ||
// metrics collection starts as it would after a restart. | ||
if (!oldNativeTidMap.isEmpty()) { | ||
runMXDump(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better way to indicate whether an attach failed. Map being empty can happen else where in the code maybe
Bukhtawar
reviewed
Mar 11, 2021
src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/jvm/ThreadList.java
Show resolved
Hide resolved
Bukhtawar
approved these changes
Mar 11, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, LGTM
nisheedh
approved these changes
Mar 11, 2021
rguo-aws
approved these changes
Mar 11, 2021
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #: #565
Description of changes:
In this code all threads might be blocked behind one thread trying to
fill the jTidMap from a threaddump that contains all the thread related details that performanceAnalyzer uses. The thread dump is requires a JVM attach which has a default timeout of 10 seconds which is quite long. While one thread is busy attaching, all threads running ES operations might be stuck waiting for the same lock. Therefore, this change replaces a mandatory lock with a try-lock, so that the threads which cannot acquire the lock, moves past this and carry on. We will loose the metrics for this run, but that is better than stalling all ES operations.
Tests:
PA metrics is returning the metrics:
If new tests are added, how long do the new ones take to complete
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.