Using of Triton #367
Replies: 4 comments 10 replies
-
Guys, if someone of you is using Triton for performance optimizing of faster whisper, could you please share with me a checklist of how Triton should be installed and configured for faster whisper. Thanks a lot. |
Beta Was this translation helpful? Give feedback.
-
Here is my solution. config.pbtxt
model.py from faster_whisper import WhisperModel
import triton_python_backend_utils as pb_utils
import numpy
import io
class TritonPythonModel:
def initialize(self, _):
self.model = WhisperModel(
"/bucket/whisper/whisper-large-v2-ct2-ft16",
compute_type="float16",
device="cuda",
local_files_only=True,
)
def execute(self, requests):
responses = []
for request in requests:
audio = request.inputs()[0].as_numpy()[0]
segments, _ = self.model.transcribe(
io.BytesIO(audio),
without_timestamps=True,
)
segments = [segment.text for segment in segments]
text = " ".join(segments)
responses.append(
pb_utils.InferenceResponse(
[
pb_utils.Tensor(
"output__0",
numpy.array(
[bytes(text, "utf-8")],
dtype=numpy.object_,
),
),
],
),
)
return responses request example curl -v --location --request POST 'http://localhost:80/v2/models/whisper/infer' \
--header 'Content-Type: application/octet-stream' \
--data-binary @OSR_us_000_0010_8k.wav \
-H "Inference-Header-Content-Length: 0" -o resp.txt Response:
Response is a binary UTF-8 string. Here is an example
Inference-Header-Content-Length: 146 - is the most important part to parse response. It is the length of starting json. Then comes 4-bytes integer with a length of the body. In this example, it is "\9A�\00\00". Here is how to decode it: body_size = int.from_bytes(b"\x9a\x01\x00\x00", byteorder='little')
assert body_size == 410 Triton Server itself is installed without any additional moves. Just add "pip install faster-whisper" in docker image. For my example it took ~1.8 seconds for inference on L4 GPU. The audio is 33 sec long. So looks fair with 1:15 ratio. |
Beta Was this translation helpful? Give feedback.
-
Hi @ClaytonJY @NikitaSemenovAiforia and thank you for your contribution! Next, how should the model repository look like? is it like the following?
Where Thanks in advance! |
Beta Was this translation helpful? Give feedback.
-
I see now that you've added a class for Batched Inference faster-whisper/faster_whisper/transcribe.py Line 100 in 814472f Would that work with Triton? Would it enable more concurrent streams to Whisper? |
Beta Was this translation helpful? Give feedback.
-
Hello guys,
Do you think it is possible to optimize faster whisper with Triton ?
Thanks a lot,
AlexG.
Beta Was this translation helpful? Give feedback.
All reactions