Releases: oneapi-src/oneDNN
v3.5.3
This is a patch release containing the following changes to v3.5.2:
- Fixed correctness issue in convolution weight gradient for small shapes on Intel GPUs (49eee6a, 281dd3b)
- Extended MLP patterns supported by experimental Graph Compiler to cover cases relevant to ChatGLM model (ff680fc)
- Fixed performance regression in bf16 depthwise convolution on Intel CPUs (d6c216a)
v3.5.2
This is a patch release containing the following changes to v3.5.1:
- Fixed performance regression for some Graph API subgraphs with LayerNorm operation (82f629c)
- Fixed runtime error for Graph API subgraphs including 6D LayerNorm operation (f704f09)
- Fixed an issue with host compiler version detection in SYCL configurations (730b976)
- Fixed an issue with missing
DNNL_TARGET_ARCH
define for builds not relying on CMake (87848b9) - Fixed a test issue for matmul with low-precision scales and/or zero-points (91c35d8)
- Fixed segfault issue in bfloat16 shuffle on AArch64 processors (9116681)
- Fixed runtime issue in quantized layer normalization pattern with Graph API (0013e8c)
v3.4.4
v3.5.1
This is a patch release containing the following changes to v3.5:
- Fixed potential page fault in matmul on Intel Datacenter Max Series GPUs (a9c525d)
- Fixed potential stack overflow issue in convolution implementation for Intel GPUs (0fb7e6e)
- Added test cases for matmul with compressed weights (015ccb1)
- Extended Graph API
LayerNorm
operation with zero points support (dc2701a) - Fixed primitive creation error for depthwise convolution backpropagation on Intel GPUs (4a045e4, b529d22)
v3.5
Performance Optimizations
-
Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids).
- Improved performance of group normalization primitive.
- Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
- Improved performance of the following subgraphs with Graph API:
- Multi-Query Attention (MQA).
- Scaled Dot Product Attention (SDPA), including the variant with
select
operation. LayerNorm
+Multiply
+Quantize
produced by SmoothQuant algorithm.Convolution
+Sigmoid
+Multiply
with mixed precisions.
-
Intel Graphics Products:
- Improved performance for Processor Graphics based on Xe2 architecture.
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved RNN primitive performance for LSTM cell case.
- Improved performance of f8_e4m3 data type emulation on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
-
AArch64-based Processors:
- Improved convolution forward propagation, matmul, and softmax performance for processors with SVE support.
- Improved bf16 matmul, convolution, and reorder primitives performance with Arm Compute Library (ACL).
- Improved eltwise primitive performance with
gelu_erf
algorithm with ACL.
Functionality
- Introduced sum and binary post-ops support for layer normalization primitive. This functionality is currently implemented on CPUs only.
- Introduced support for int4 data type and extended quantization model with support for grouped scales and zero points.
- Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs with hardware acceleration for fp64 math only.
- Extended floating point math mode API to support weight decompression scenarios. See matmul weights decompression example to get started. New floating mode is supported in the following configurations:
- bfloat16 matmul with int8 weights on Intel CPUs.
- float16 and bfloat16 matmul with int8 or int4 weights on Intel GPUs.
- [experimental] Introduced microkernel API for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.
Usability
- Extended error messages for engine and memory objects creation errors.
- Extended verbose mode diagnostics with information on dispatching decisions for all primitives.
- Introduced support for clang++ host compiler in SYCL builds.
- Introduced API for tensor serialization and deserialization.
- Extended verbose mode diagnostics for Graph API with information on pattern matcher decisions.
- Introduced OpenCL runtime support for Graph API.
- Added support for building oneDNN with installed Arm Compute Library (ACL).
Validation
- Extended benchdnn with support for tensor tags in RNN primitive validation.
Breaking Changes
- Updated minimal supported ACL version to 24.04 (was 23.11).
Thanks to these Contributors
This release contains contributions from the project core team as well as Abdel @quickwritereader, @AngryLoki, Crefeda Rodrigues @cfRod, Daniel Richard G. @iskunk, David Svantesson @davsva01, @deepeshfujitsu, Dylan Angus @dylan-angus-codeplay, Emanuele Rocca @ema, Fadi Arafeh @fadara01, Hernan Martinez @hmartinez82, John Osorio @kala855, Jonathan Deakin @jondea, @kasturedeeksha, Kentaro Kawakami @kawakami-k, Nikita Shulga @malfet, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Roman Zhukov @rozhukov, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Sunita Nadampalli @snadampal, Tadej CiglariΔ @t4c1, Vineel Abhinav @vineelabhinav, @vishwascm. We would also like to thank everyone who asked questions and reported issues.
v3.4.3
v3.5-rc
This is a release candidate for oneDNN v3.5. Please provide feedback and submit defect reports via Github issues.
Performance Optimizations
-
Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids).
- Improved performance of group normalization primitive.
- Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
- Improved performance of the following subgraphs with Graph API:
- Multi-Query Attention (MQA).
- Scaled Dot Product Attention (SDPA), including the variant with
select
operation. LayerNorm
+Multiply
+Quantize
produced by SmoothQuant algorithm.Convolution
+Sigmoid
+Multiply
with mixed precisions.
-
Intel Graphics Products:
- Improved performance for Processor Graphics based on Xe2 architecture.
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved RNN primitive performance for LSTM cell case.
- Improved performance of f8_e4m3 data type emulation on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
-
AArch64-based Processors:
- Improved convolution forward propagation, matmul, and softmax performance for processors with SVE support.
- Improved bf16 matmul performance with Arm Compute Library (ACL).
- Improved eltwise primitive performance with
gelu_erf
algorithm with ACL.
Functionality
- Introduced sum and binary post-ops support for layer normalization primitive. This functionality is currently implemented on CPUs only.
- Introduced support for int4 data type and extended quantization model with support for grouped scales and zero points.
- Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs only.
- Extended floating point math mode API to support weight decompression scenarios. See matmul weights decompression example to get started. New floating mode is supported in the following configurations:
- bfloat16 matmul with int8 weights on Intel CPUs.
- float16 and bfloat16 matmul with int8 or int4 weights on Intel GPUs.
- [experimental] Introduced microkernel API for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.
Usability
- Extended error messages for engine and memory objects creation errors.
- Extended verbose mode diagnostics with information on dispatching decisions for all primitives.
- Introduced support for clang++ host compiler in SYCL builds.
- Introduced API for tensor serialization and deserialization.
- Extended verbose mode diagnostics for Graph API with information on pattern matcher decisions.
- Introduced OpenCL runtime support for Graph API.
- Added support for building oneDNN with installed Arm Compute Library (ACL).
Validation
- Extended benchdnn with support for tensor tags in RNN primitive validation.
Thanks to these Contributors
This release contains contributions from the project core team as well as @AngryLoki, Crefeda Rodrigues @cfRod, Daniel Richard G. @iskunk, @deepeshfujitsu, Dylan Angus @dylan-angus-codeplay, Emanuele Rocca @ema, Hernan Martinez @hmartinez82, John Osorio @kala855, Jonathan Deakin @jondea, @kasturedeeksha, Kentaro Kawakami @kawakami-k, Nikita Shulga @malfet, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Roman Zhukov @rozhukov, Shreyas-fuj @Shreyas-fuj, Sunita Nadampalli @snadampal, Tadej CiglariΔ @t4c1, Vineel Abhinav @vineelabhinav, @vishwascm. We would also like to thank everyone who asked questions and reported issues.
v3.4.2
This is a patch release containing the following changes to v3.4.1:
- Fixed performance regression in deconvolution on processors with Intel AVX-512 instruction set (307b35b, f46fffb)
- Improved performance of batched matmul with binary post-op on processors with Intel AVX-512 instruction set (d39e1b7)
- Fixed performance regression in softmax with destination memory format set to
any
on processors with Intel AVX-512 instruction set (756d3cf) - Fixed incorrect results in int8 deconvolution with source zero points on processors with Intel AMX instruction set (d5ddbc8)
- Fixed performance regression in convolution on processors with Intel AVX2 instruction set (2968c89)
- Improved f8_e4m3 matmul performance on Intel Data Center GPU Max Series (068f850, 668abae, c3972ef, ad94382)
- Fixed sporadic accuracy issues in bf16 depthwise convolution backpropagation on processors with Intel AVX-512 instruction set (0184044)
- Fixed primitive creation issue for fp16 pooling backpropagation on Intel GPUs (e4737d9)
- Fixed failure for subgraphs with int8 matmul operation with experimental Graph Compiler on processors with Intel AMX instruction set (5ebde2e)
- Fixed assert in experimental Graph Compiler on Windows (f53fbd1, fd903ae)
- Fixed incorrect results for subgraphs with shuffle operation with experimental Graph Compiler (aef5023)
- Improved performance of subgraphs involving int8 matmul with experimental Graph Compiler on processors with Intel AMX support (0ca5bc5)
- Fixed page fault in fp16 matmul primitive on Intel Data Center GPU Max Series (5587f08)
- Fixed incorrect results in dp32 deconvolution with Arm Compute Library on AArch64 processors (b7694a0)
- Fixed performance regression in deconvolution on processors with Intel AVX2 instruction set (6f452e2)
v3.4.1
This is a patch release containing the following changes to v3.4:
- Fixed an issue with caching and serialization of primitives in deterministic mode (7ed604a)
- Introduced memory descriptor serialization API (4cad420, 929a27a, 9b848c8)
- Fixed incorrect results in fp64 convolution and deconvolution on Intel GPUs based on Xe-LPG architecture (ebe77b5, 0b399ac, d748d64, 9f4f3d5, 21a8cae)
- Fixed incorrect results in reorder with large sizes on Intel CPUs and GPUs (69a111e, 4b72361, 74a343b)
- Reduced creation time for deconvolution primitive on Intel CPUs (bec487e, 1eab005)
- Fixed performance regression in deconvolution on Intel CPUs (fbe5b97, 1dd3c6a)
- Removed dangling symblols from static builds (e92c404, 6f5621a)
- Fixed crash during platform detection on some AArch64-based systems (406a079)
- Fixed performance regression in int8 deconvolution on Intel CPUs (7e50e15)
- Fixed handling of zero points for matmul in verbose logs converter (15c7916)
v3.3.6
This is a patch release containing the following changes to v3.3.5:
- Fixed crash during platform detection on some AArch64-based systems (3e0e69b)
- Improved inner product performance with Arm Compute Library (ACL) (e7abee2, 214fb9e, 8aacc8f)
- Fixed incorrect results in int8 depthwise convolution with post-ops on processors with Intel AVX2 instruction set support (0c922e0)
- Fixed performance regression in fp32 convolution on processors with Intel AVX2 instruction set support (4efc0ad)