Using of Triton #367

AlexandderGorodetski · 2023-06-14T07:48:50Z

AlexandderGorodetski
Jun 14, 2023

Hello guys,

Do you think it is possible to optimize faster whisper with Triton ?

Thanks a lot,
AlexG.

AlexandderGorodetski · 2023-06-14T13:15:05Z

AlexandderGorodetski
Jun 14, 2023
Author

Guys, if someone of you is using Triton for performance optimizing of faster whisper, could you please share with me a checklist of how Triton should be installed and configured for faster whisper. Thanks a lot.

0 replies

NikitaSemenovAiforia · 2023-10-04T08:14:11Z

NikitaSemenovAiforia
Oct 4, 2023

Here is my solution.

config.pbtxt

name: "whisper"
backend: "python"
max_batch_size: 0

input [
    {
        name: "input__0"
        data_type: TYPE_STRING
        dims: [1]
    }
]

output [
    {
        name: "output__0"
        data_type: TYPE_STRING
        dims: [ 1 ]
    }
]

instance_group [
  {
    kind: KIND_GPU
  }
]

model.py

from faster_whisper import WhisperModel
import triton_python_backend_utils as pb_utils
import numpy
import io


class TritonPythonModel:
    def initialize(self, _):
        self.model = WhisperModel(
            "/bucket/whisper/whisper-large-v2-ct2-ft16",
            compute_type="float16",
            device="cuda",
            local_files_only=True,
        )

    def execute(self, requests):
        responses = []
        for request in requests:
            audio = request.inputs()[0].as_numpy()[0]
            segments, _ = self.model.transcribe(
                io.BytesIO(audio),
                without_timestamps=True,
            )
            segments = [segment.text for segment in segments]
            text = " ".join(segments)
            responses.append(
                pb_utils.InferenceResponse(
                    [
                        pb_utils.Tensor(
                            "output__0",
                            numpy.array(
                                [bytes(text, "utf-8")],
                                dtype=numpy.object_,
                            ),
                        ),
                    ],
                ),
            )
        return responses

request example

curl -v --location --request POST 'http://localhost:80/v2/models/whisper/infer' \
--header 'Content-Type: application/octet-stream' \
--data-binary @OSR_us_000_0010_8k.wav \
-H "Inference-Header-Content-Length: 0" -o resp.txt

Response:

< HTTP/1.1 200 OK
< Content-Type: application/octet-stream
< Inference-Header-Content-Length: 146
< Content-Length: 560

Response is a binary UTF-8 string. Here is an example

{"model_name":"whisper","model_version":"1","outputs":[{"name":"output__0","datatype":"BYTES","shape":[1],"parameters":{"binary_data_size":414}}]}\9A�\00\00 The birch canoe slid on the smooth planks. Glued the sheet to the dark blue background. It is easy to tell the depth of a well. These days a chicken leg is a rare dish. Rice is often served in round bowls. The juice of lemons makes fine punch. The box was thrown beside the park truck. The hogs were fed chopped corn and garbage. Four hours of steady work faced us.  A large size of stockings is hard to sell.

Inference-Header-Content-Length: 146 - is the most important part to parse response. It is the length of starting json. Then comes 4-bytes integer with a length of the body. In this example, it is "\9A�\00\00". Here is how to decode it:

body_size = int.from_bytes(b"\x9a\x01\x00\x00", byteorder='little')
assert body_size == 410

Triton Server itself is installed without any additional moves. Just add "pip install faster-whisper" in docker image.

For my example it took ~1.8 seconds for inference on L4 GPU. The audio is 33 sec long. So looks fair with 1:15 ratio.

8 replies

joiemoie Nov 2, 2023

Hi Clayton, so with this single-model approach, will there be any speedups at all with batch requests? Maybe something like it Triton will load multiple model instances on the same gpu whereas without Triton it will not? Finally, do you have any predictions as to how much speedup you can get from batching on a single model instance?

ClaytonJY Nov 2, 2023

Hi Joseph! Because faster-whisper does not yet support batching, this single-Triton-model cannot either. Triton has very cool features like "dynamic batching", but it can't turn a batch-1 model into something that supports batches. If faster-whisper did support batching, or when it does, Nikita's code will need some modifications to support that. Also keep in mind that batching trades latency for throughput: a batch-1 inference (single request) on a batch-4 model will be slower than on a batch-1 model, though usually not by much.

You're right about the multiple instances; that's something Triton makes easy, by adding a line like count: 2 to config.pbtxt, just above or below kind: KIND_GPU. I find that 2 instances usually helps, because one might be doing pre- or post-processing while the other is doing the actual model inference on the GPU, while more than 2 does not, possibly because of GPU contention. Multiple model instances may help throughput, but it will not improve single-request (unloaded) latency, and may have a slight negative effect on average latency due to GPU contention between the two.

I suspect that if I could keep the GPU consistently busy with the multi-model approach, a single instance of the core whisper model may be optimal. But I'm not sure; that's an untested hunch!

joiemoie Nov 3, 2023

Okay! Yea, i also noticed my SM active utilization is only 30% and my Tensor core utilization is low between 1-8%. I’d love to see any ways this could be improved. Does the pre/post processing take time?

joiemoie Nov 3, 2023

https://github.com/m-bain/whisperX

I see this one supports batching and uses faster-whisper. Perhaps this one can use triton?

ClaytonJY Nov 3, 2023

If the GPU numbers you're reporting are peak numbers during an inference, they won't go up at all with a multi-model approach (still the same code doing the GPU inference). Not sure how much those numbers might increase with batching. I'm more interested in the high-level "GPU utilization" metric, like "how much of the time is my GPU doing something".

The pre/post processing does take time, but it should take far less time than the whisper model, and because they're separate models Triton will run them concurrently. The model should be the bottleneck. If I submit two request in quick succession, A and B, B will be pre-processed before A is done being transcribed, which means when A is done B will be in the queue for the transcription model. That's how we'd minimize downtime for the GPU.

I haven't personally used WhisperX, but it's built on faster-whisper and has many useful features in addition to batching. I haven't looked closely enough to understand how they use faster-whisper yet also do batching. WhisperX may be a better starting point for triton-izing. Whichever path you take, there's plenty of work to be done. Good luck!

AvivSham · 2024-01-11T21:40:56Z

AvivSham
Jan 11, 2024

Hi @ClaytonJY @NikitaSemenovAiforia and thank you for your contribution!
If I want to deploy a model to sagemaker what are the steps I need to take?
at first which image should I use? I came across this link containing images for sagemaker is there any difference from nvidia's images?

Next, how should the model repository look like? is it like the following?

model_repository:
-- model_directory
    |-- 1
    |   |-- model.py
    |   |--model.bin
    |-- config.pbtxt

Where model.bin is the faster whisper / ct2 converted model.

Thanks in advance!

2 replies

ClaytonJY Jan 18, 2024

Yes, that's the correct directory structure.

I've never used Sagemaker, but it looks like they have a doc on exactly this: https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html the notebooks at the bottoms walk through the process step-by-step for pytorch or huggingface, so you'll have to adapt a bit but that should get you most of the way there, hopefully.

Good luck!

AvivSham Jan 19, 2024

@ClaytonJY Thanks for the hint!
One more question, in the following request example we pass directly the wav file:

curl -v --location --request POST 'http://localhost:80/v2/models/whisper/infer' \
--header 'Content-Type: application/octet-stream' \
--data-binary @OSR_us_000_0010_8k.wav \
-H "Inference-Header-Content-Length: 0" -o resp.txt

What if I have a file that contains a processed version of the same wav file as byte array, how should I change the request to get the same output?

Thanks in advance!

hkhairy · 2024-11-03T06:03:18Z

hkhairy
Nov 3, 2024

I see now that you've added a class for Batched Inference

faster-whisper/faster_whisper/transcribe.py

Line 100 in 814472f

class BatchedInferencePipeline:

Would that work with Triton? Would it enable more concurrent streams to Whisper?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using of Triton #367

{{title}}

Replies: 4 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Using of Triton #367

Replies: 4 comments · 10 replies

AlexandderGorodetski Jun 14, 2023 Author

Replies: 4 comments 10 replies

AlexandderGorodetski
Jun 14, 2023
Author