The users who require efficient transformer inference and flexibility. FasterTransformer provides flexible APIs and highly optimized kernels. Compare to the fastest solution, TensorRT demo BERT, the performance of FasterTransformer encoder is only little slower in some cases. Besides, FasterTransformer also provides supporting of translation and GPT-2.
Basically, FasterTransformer provides highly optimized transformer block. The users requiring such efficient transformer implementation can get benefit from FasterTransformer. For example, BERT inference, encoder-decoder architecture with transformer block. Besides, FasterTransformer also provides supporting of translation and GPT-2.
FasterTransformer provides C API and TensorFlow/PyTorch OP. Users can use FasterTransformer directly on these frameworks. For other frameworks, users are also able to wrap the C++ codes to integrate FasterTransformer.
The simplest method is using the CUDA Multi-Process Service (MPS), which is supported since Volta GPUs.
Another method is using multi-threading on the same TensorFlow graph and session. Users can load the model in python and call the FasterTransformer OP thread by thread with same model graph. Note that running multiple thread on the same FasterTransformer OP may lead to dead lock, especially when there are lots of threads.
We have verified the correctness and performance for GPUs with Compute Compatibility >= 7.0 such as V100, T4 and A100.
Not yet. It is a suggestion but not limitation. We recommend using these docker image to build the project for the first time to prevent environment problems. The users can also build the project in their environment directly.
FasterTransformer’s approach is to offload the computational workloads to GPUs with the memory operations overlapped with them. So FasterTransformer performance is mainly decided by what kinds of GPUs and I/O devices are used. However, when the batch size and sequence length are both small, kernel launching is the bottleneck and hence worse CPU may lead to worse performance.
In C, users need to load the model by themselves and copy into GPU memory.
In TensorFlow or PyTorch, users can load the checkpoint and put the weight tensor into FasterTransformer directly. Users can also load the model in other formats, like numpy, and put them into FasterTransformer directly like the weight tensor.
The multi-gpu inference of GPT is special. FasterTransformer provides a tool to convert the checkpoint of OpenAI and Megatron, and then load the converted model by FasterTransformer directly.