This project aims to enhance the performance of the Password-Based Key Derivation Function (Pbkdf2) in the .NET runtime environment. We achieve this by employing a combination of OpenSSL(1) hash functions and a refined algorithm that minimizes memory and CPU usage. In this README, I provide a comprehensive guide on problem detection steps as well as how to optimize the performance of Pbkdf2 and present the benchmark results to showcase the improvements achieved.
Special thanks to @ctz and https://github.com/ctz/fastpbkdf2 for the idea and the customized implementation of the algorithm
In our Authentication
setup, we facilitate client authentication using the client credentials
, allowing clients to request auth for various OAuth grant types. One essential aspect of this process is the verification of the client's secret. However, we encountered a challenge in this regard - the client secret verification is a highly CPU-intensive task.
Under a recent performance test, we observed a significant impact on our endpoints' response times due to the resource-intensive nature of client secret verification. The test was conducted in an environment consisting of 4 AWS Fargate
instances, each with limited computational resources 0.5 CPU, and 2GB of memory.
The test simulated a load of 140 transactions per second (approximately 8000 per minute). In this high-throughput scenario, the CPU-intensive nature of client secret verification became evident, resulting in performance bottlenecks and increased response times.
Notably, the secret hashing algorithm used for verification is Pbkdf2
, which, while secure, adds to the computational workload and contributes to the observed performance challenges.
As a result, we embarked on a journey to address these performance issues and optimize the client credential verification process in our authentication solution setup. This repository aims to document the steps we took, the optimizations implemented, and the results achieved in our pursuit of a high-performing solution for client credential verification. I intend to share my findings and improvements with the broader developer community to help others facing similar challenges.
The setup with 4 Fargate tasks with 0.5 CPU and 2Gb RAM went acceptable with 65 TPS, resulting in a 300 ms response time. Increasing the load by 5 resulted in a terrible response time near 14 seconds. By digging via the profiler I found 100% CPU usage and a kind of thread contention.
The image below demonstrates the most expensive path which led us to doubt starvation but actually, it was not the case.
The IO thread pool was well configured and no sync over async was detected in the app however, the CPU was stuck on a heavy hashing function having no room to take care of the continuation tasks.
Additionally, By checking the span and traces we could find out the cost of each step down to the response:
Since I was in doubt of high CPU usage and lack of CPU for continuations, I needed a way to verify my understanding.
With kind of an easy verification step, I could make sure that the problem was definitely the CPU. I just removed .Hash()
with await Task.Delay(COST_OF_HASH_FUNCTION_IN_MILLISECONDS)
and reran the test
The results showed that the Api was resulting wonderfully super fast with 2X capacity with no thread starvation issue. (Toleration of 120 TPS - 7200 TPM)
So I decided to micro-optimize the hash function to lower the CPU usage just as an interesting duty.
In any case, that's mandatory to start from a test that ensures that refactoring the logic for any reason, could be performance, clean-up, or re-design, nothing would break.
The test should be like:
Having verifyPasswordV1
, If changing the code, for each specific test case (input password) the verifyPasswordV*
function result must be identical. For example: the result of verifyPasswordV1 must be exactly like verifyPasswordV2 and verifyPasswordV3
For implementation details, visit: Tests
At first insight, it was clear to me that checking the .NET runtime repo's previous issues would help. So a bunch of searching led me to a closed issue 'dotnet/runtime#24897' which had a good effect on the hashing benchmark. Actually, it improves the legacy .Net implementation of Rfc2898DeriveBytes
performance nearly 2/2.5 times.
- How does it work? Just by replacing
new Rfc2898DeriveBytes()
with a statically called methodRfc2898DeriveBytes.Pbkdf2DeriveBytes()
Method | Count | Mean | Error | StdDev | Allocated |
---|---|---|---|---|---|
NewDotnetHash | 1000 | 5.006 s | 0.0170 s | 0.0142 s | 91.89 KB |
LegacyDotNetHash | 1000 | 11.595 s | 0.0660 s | 0.0617 s | 540.25 KB |
NOTE: you can run Benchmarks on your machine by navigating to ./app and running:
sudo dotnet run --configuration Release
Definitely, you can see the benchmark was much better, but nothing changed in the result after triggering the performance tests. The same CPU usage with a bit less response time. So I tried another way to reduce the algorithm time and memory complexity.
After doing some search on Github, I could find some native implementations which have focused on Aggressive inlining
, Zero allocation
, Minimal copies
, and parallelism
.
So started to port the best one in case of benchmarks into C# leveraging ReadonlySpan and safe Pointers. The benchmark shows super fast results! Near 6 times faster.
BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, macOS Ventura 13.4 (22F66) [Darwin 22.5.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.100-rc.2.23502.2
[Host] : .NET 6.0.16 (6.0.1623.17311), Arm64 RyuJIT AdvSIMD
DefaultJob : .NET 6.0.16 (6.0.1623.17311), Arm64 RyuJIT AdvSIMD
Method | Count | Mean | Error | StdDev | Allocated |
---|---|---|---|---|---|
StaticPbkdf2DotnetHash | 1000 | 4.998 s | 0.0044 s | 0.0042 s | 86.89 KB |
LegacyPbkdf2DotNetHash | 1000 | 11.462 s | 0.0073 s | 0.0069 s | 541.99 KB |
OpenSslDrivenHash | 1000 | 2.506 s | 0.0015 s | 0.0013 s | 91.93 KB |
OpenSslDrivenHashMultiThread | 1000 | 2.914 s | 0.0036 s | 0.0029 s | 91.93 KB |
So by triggering the unit tests to verify the logic behavior and making them green, I could test the performance against the near PRD environment.
After deep benchmarking the native implementation and for sure the green tests, I decided to trigger the pipelines to deploy the changes to ECS
For nearly 33 TPS (7k requests per minute):
Maybe OpenSSL3 support :)