Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Threads blocked trying to attach to the JVM #566

Merged
merged 6 commits into from
Mar 11, 2021
Merged

Threads blocked trying to attach to the JVM #566

merged 6 commits into from
Mar 11, 2021

Conversation

yojs
Copy link
Contributor

@yojs yojs commented Mar 1, 2021

Fixes #: #565

Description of changes:

In this code all threads might be blocked behind one thread trying to
fill the jTidMap from a threaddump that contains all the thread related details that performanceAnalyzer uses. The thread dump is requires a JVM attach which has a default timeout of 10 seconds which is quite long. While one thread is busy attaching, all threads running ES operations might be stuck waiting for the same lock. Therefore, this change replaces a mandatory lock with a try-lock, so that the threads which cannot acquire the lock, moves past this and carry on. We will loose the metrics for this run, but that is better than stalling all ES operations.

Tests:

PA metrics is returning the metrics:

curl "localhost:9600/_opendistro/_performanceanalyzer/metrics?metrics=Latency,CPU_Utilization&agg=avg,max&dim=ShardID&nodes=all"
{"iAHcZIXUQMaOBPzRNsC-JA": {"timestamp": 1614637105000, "data": {"fields":[{"name":"ShardID","type":"VARCHAR"},{"name":"Latency","type":"DOUBLE"},{"name":"CPU_Utilization","type":"DOUBLE"}],"records":[[null,null,0.005394726328405913]]}}, "jLtljYrfQAuTVwUBgGtHNQ" :{"timestamp": 1614637105000, "data": {"fields":[{"name":"ShardID","type":"VARCHAR"},{"name":"Latency","type":"DOUBLE"},{"name":"CPU_Utilization","type":"DOUBLE"}],"records":[[null,null,0.002798561151079137]]}}}```

The writer metrics are being emitted in the /dev/shm file

```[root@7b746ade07bd performanceanalyzer]# cat 1614636875000
^heap_metrics
{"current_time":1614636875493}
{"MemType":"totYoungGC","GC_Collection_Event":0,"GC_Collection_Time":0,"Heap_Committed":-2,"Heap_Init":-2,"Heap_Max":-2,"Heap_Used":-2}
{"MemType":"totFullGC","GC_Collection_Event":0,"GC_Collection_Time":0,"Heap_Committed":-2,"Heap_Init":-2,"Heap_Max":-2,"Heap_Used":-2}
{"MemType":"PermGen","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":85696512,"Heap_Init":0,"Heap_Max":-1,"Heap_Used":80593424}
{"MemType":"Survivor","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":17891328,"Heap_Init":17891328,"Heap_Max":17891328,"Heap_Used":9578944}
{"MemType":"OldGen","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":357957632,"Heap_Init":357957632,"Heap_Max":357957632,"Heap_Used":55396840}
{"MemType":"Eden","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":143130624,"Heap_Init":143130624,"Heap_Max":143130624,"Heap_Used":37667024}
{"MemType":"NonHeap","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":118042624,"Heap_Init":7667712,"Heap_Max":-1,"Heap_Used":109184376}
{"MemType":"Heap","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":518979584,"Heap_Init":536870912,"Heap_Max":518979584,"Heap_Used":102642808}

If new tests are added, how long do the new ones take to complete

Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or

(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or

(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.

(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.

In [this code](https://github.com/opendistro-for-elasticsearch/performance-analyzer-rca/blob/b33c502d55b954782129b8f61b2c67b07e8477d5/src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/jvm/ThreadList.java#L127) all threads might be blocked behind one thread trying to
fill the jTidMap from a threaddump that contains all the thread related details that performanceAnalyzer uses. The thread dump is requires a JVM attach which has a [default timeout of 10 seconds](https://github.com/openjdk/jdk/blob/353416ffcaa12d79013e6f9d03371bf86d0f671b/src/jdk.attach/share/classes/sun/tools/attach/HotSpotVirtualMachine.java#L368) which is quite long. While one thread is busy attaching, all threads running ES operations might be stuck waiting for the same lock. Therefore, this change replaces a mandatory lock with a try-lock, so that the threads which cannot acquire the lock, moves past this and carry on. We will loose the metrics for this run, but that is better than stalling all ES operations.

Tests

PA metrics is returning the metrics:
```
curl "localhost:9600/_opendistro/_performanceanalyzer/metrics?metrics=Latency,CPU_Utilization&agg=avg,max&dim=ShardID&nodes=all"
{"iAHcZIXUQMaOBPzRNsC-JA": {"timestamp": 1614637105000, "data": {"fields":[{"name":"ShardID","type":"VARCHAR"},{"name":"Latency","type":"DOUBLE"},{"name":"CPU_Utilization","type":"DOUBLE"}],"records":[[null,null,0.005394726328405913]]}}, "jLtljYrfQAuTVwUBgGtHNQ" :{"timestamp": 1614637105000, "data": {"fields":[{"name":"ShardID","type":"VARCHAR"},{"name":"Latency","type":"DOUBLE"},{"name":"CPU_Utilization","type":"DOUBLE"}],"records":[[null,null,0.002798561151079137]]}}}```

The writer metrics are being emitted in the /dev/shm file
```
[root@7b746ade07bd performanceanalyzer]# cat 1614636875000
^heap_metrics
{"current_time":1614636875493}
{"MemType":"totYoungGC","GC_Collection_Event":0,"GC_Collection_Time":0,"Heap_Committed":-2,"Heap_Init":-2,"Heap_Max":-2,"Heap_Used":-2}
{"MemType":"totFullGC","GC_Collection_Event":0,"GC_Collection_Time":0,"Heap_Committed":-2,"Heap_Init":-2,"Heap_Max":-2,"Heap_Used":-2}
{"MemType":"PermGen","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":85696512,"Heap_Init":0,"Heap_Max":-1,"Heap_Used":80593424}
{"MemType":"Survivor","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":17891328,"Heap_Init":17891328,"Heap_Max":17891328,"Heap_Used":9578944}
{"MemType":"OldGen","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":357957632,"Heap_Init":357957632,"Heap_Max":357957632,"Heap_Used":55396840}
{"MemType":"Eden","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":143130624,"Heap_Init":143130624,"Heap_Max":143130624,"Heap_Used":37667024}
{"MemType":"NonHeap","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":118042624,"Heap_Init":7667712,"Heap_Max":-1,"Heap_Used":109184376}
{"MemType":"Heap","GC_Collection_Event":-2,"GC_Collection_Time":-2,"Heap_Committed":518979584,"Heap_Init":536870912,"Heap_Max":518979584,"Heap_Used":102642808}
```
@yojs yojs added the bug Something isn't working label Mar 1, 2021
Copy link

@nisheedh nisheedh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you address my comments ?

@yojs
Copy link
Contributor Author

yojs commented Mar 11, 2021

Sorry for the delay in getting back to the comments.

Comment on lines +275 to +279
// oldNativeTidMap gets cleared if the attach Fails, so that the
// metrics collection starts as it would after a restart.
if (!oldNativeTidMap.isEmpty()) {
runMXDump();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to indicate whether an attach failed. Map being empty can happen else where in the code maybe

Copy link
Member

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, LGTM

@yojs yojs merged commit cdc11b0 into main Mar 11, 2021
@yojs yojs deleted the attach-hang branch March 11, 2021 21:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants