You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Intel GPU Level-Zero sidecar is an extension for the Intel GPU plugin to query additional GPU details from the oneAPI/Level-Zero API. As the Level-Zero is a C/C++ API, it is preferred to keep the original GPU plugin as-is and add the additional functionality via the Level-Zero sidecar. The GPU plugin can be configured to use the Level-Zero sidecar with an overlay, see [install](#install).
11
+
12
+
Intel GPU plugin and the Level-Zero sidecar communicate via gRPC on a local socket visible only to the containers.
13
+
14
+
> **NOTE**: Intel Device Plugin Operator doesn't yet support enabling Level-Zero sidecar in the GPU CR object.
15
+
16
+
## Modes and Configuration Options
17
+
18
+
| Flag | Argument | Default | Meaning |
19
+
|:---- |:-------- |:------- |:------- |
20
+
| -socket | unix socket path | /var/lib/levelzero/server.sock | Unix socket path which the server registers itself into. |
21
+
| -wsl | - | disabled | Adapt sidecar to run in the WSL environment. |
22
+
| -v | verbosity | 1 | Log verbosity |
23
+
24
+
## Install
25
+
26
+
Installing the sidecar along with the GPU plugin happens via two possible overlays: [health](../../deployments/gpu_plugin/overlays/health/) and [wsl](../../deployments/gpu_plugin/overlays/wsl/).
27
+
28
+
Health overlay adds the sidecar to the base GPU plugin deployment and configures GPU plugin to retrieve device health indicators from the Level-Zero API:
Copy file name to clipboardexpand all lines: cmd/gpu_plugin/README.md
+11
Original file line number
Diff line number
Diff line change
@@ -18,6 +18,7 @@ Table of Contents
18
18
*[SR-IOV use with the plugin](#sr-iov-use-with-the-plugin)
19
19
*[CDI support](#cdi-support)
20
20
*[KMD and UMD](#kmd-and-umd)
21
+
*[Health management](#health-management)
21
22
*[Issues with media workloads on multi-GPU setups](#issues-with-media-workloads-on-multi-gpu-setups)
22
23
*[Workaround for QSV and VA-API](#workaround-for-qsv-and-va-api)
23
24
@@ -56,6 +57,8 @@ For workloads on different KMDs, see [KMD and UMD](#kmd-and-umd).
56
57
|:---- |:-------- |:------- |:------- |
57
58
| -enable-monitoring | - | disabled | Enable '*_monitoring' resource that provides access to all Intel GPU devices on the node, [see use](./monitoring.md)|
| -health-management | - | disabled | Enable health management by requesting data from oneAPI/Level-Zero interface. Requires [GPU Level-Zero](../gpu_levelzero/) sidecar. See [health management](#health-management)|
61
+
| -wsl | - | disabled | Adapt plugin to run in the WSL environment. Requires [GPU Level-Zero](../gpu_levelzero/) sidecar. |
59
62
| -shared-dev-num | int | 1 | Number of containers that can share the same GPU device |
60
63
| -allocation-policy | string | none | 3 possible values: balanced, packed, none. For shared-dev-num > 1: _balanced_ mode spreads workloads among GPU devices, _packed_ mode fills one GPU fully before moving to next, and _none_ selects first available device from kubelet. Default is _none_. Allocation policy does not have an effect when resource manager is enabled. |
61
64
@@ -257,6 +260,14 @@ Creating a workload that would support all the different KMDs is not currently p
257
260
| Media | Default | [ENABLE_PRODUCTION_KMD](https://github.com/intel/media-driver/blob/a66b076e83876fbfa9c9ab633ad9c5517f8d74fd/CMakeLists.txt#L58) | [ENABLE_XE_KMD](https://github.com/intel/media-driver/blob/a66b076e83876fbfa9c9ab633ad9c5517f8d74fd/media_driver/cmake/linux/media_feature_flags_linux.cmake#L187-L190) | Xe with upstream or backport i915, not all three. |
258
261
| Graphics | Default | Unknown | [intel-xe-kmd](https://gitlab.freedesktop.org/mesa/mesa/-/blob/e9169881dbd1f72eab65a68c2b8e7643f74489b7/meson_options.txt#L708) | i915 and xe KMDs can be supported at the same time. |
259
262
263
+
### Health management
264
+
265
+
Kubernetes Device Plugin API allows passing device's healthiness to Kubelet. By default GPU plugin reports all devices to be `Healthy`. If health management is enabled, GPU plugin retrieves health related data from oneAPI/Level-Zero interface via [GPU levelzero](../gpu_levelzero/). Depending on the data received, GPU plugin will report device to be `Unhealthy` if:
266
+
1) Direct health indicators report issues: [memory](https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-mem-health-t) & [pci](https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-pci-link-status-t)
267
+
1) Device temperature is over the limit
268
+
269
+
Temperature limit can be provided via the command line argument, default is 100C.
270
+
260
271
### Issues with media workloads on multi-GPU setups
261
272
262
273
OneVPL media API, 3D and compute APIs provide device discovery
sed -i -e 's/gpu_levelzero/gpulevelzero/' levelzero.pb.go levelzero_grpc.pb.go
7
+
```
8
+
9
+
> *Note*: Running `protoc` will erase copyright header and change the package name from "gpulevelzero" to "gpu.levelzero". The header and the package name needs to be added/modified after regeneration.
0 commit comments