Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: s2s client #34

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

feat: s2s client #34

wants to merge 4 commits into from

Conversation

junkin
Copy link
Contributor

@junkin junkin commented Jan 12, 2023

Adding speech to speech basic cli

@rmittal-github rmittal-github changed the base branch from main to release/2.9.0 January 16, 2023 13:59
@rmittal-github
Copy link
Contributor

@PeganovAnton could you please review this? need this to merge ASAP to enable QA to test S2S.

Comment on lines 38 to 71
Generates speech recognition responses for fragments of speech audio in :param:`audio_chunks`.
The purpose of the method is to perform speech recognition "online" - as soon as
audio is acquired on small chunks of audio.

All available audio chunks will be sent to a server on first ``next()`` call.

Args:
audio_chunks (:obj:`Iterable[bytes]`): an iterable object which contains raw audio fragments
of speech. For example, such raw audio can be obtained with

.. code-block:: python

import wave
with wave.open(file_name, 'rb') as wav_f:
raw_audio = wav_f.readframes(n_frames)

streaming_config (:obj:`riva.client.proto.riva_asr_pb2.StreamingRecognitionConfig`): a config for streaming.
You may find description of config fields in message ``StreamingRecognitionConfig`` in
`common repo
<https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/protos.html#riva-proto-riva-asr-proto>`_.
An example of creation of streaming config:

.. code-style:: python

from riva.client import RecognitionConfig, StreamingRecognitionConfig
config = RecognitionConfig(enable_automatic_punctuation=True)
streaming_config = StreamingRecognitionConfig(config, interim_results=True)

Yields:
:obj:`riva.client.proto.riva_asr_pb2.StreamingRecognizeResponse`: responses for audio chunks in
:param:`audio_chunks`. You may find description of response fields in declaration of
``StreamingRecognizeResponse``
message `here
<https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/protos.html#riva-proto-riva-asr-proto>`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring needs to updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in #43

nchannels = 1
if args.list_input_devices:
riva.client.audio_io.list_input_devices()
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return
return
if args.list_output_devices:
riva.client.audio_io.list_output_devices()
return

sound_stream = riva.client.audio_io.SoundCallBack(
args.output_device, nchannels=nchannels, sampwidth=sampwidth, framerate=44100
)
print(sound_stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this print?

if args.output_device is not None or args.play_audio:
print("playing audio")
sound_stream = riva.client.audio_io.SoundCallBack(
args.output_device, nchannels=nchannels, sampwidth=sampwidth, framerate=44100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make framerate a parameter of the script, like --sample-rate-hz in the script tts/talk.py?

Comment on lines +68 to +70
sampwidth = 2
nchannels = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sampwidth and nchannels are set in 2 places: here and in play_responses() function. Could you make global variables?

"then the default output audio device will be used.",
)

parser = add_asr_config_argparse_parameters(parser, profanity_filter=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably need to set max_alternatives=False and word_time_offsets=False because these parameters are pointless for the script. Do you think we also need to add speaker_diarization=False flag?

parser.add_argument("--output-device", type=int, help="Output device to use.")
parser.add_argument("--target-language-code", default="en-US", help="Language code of the output language.")
parser.add_argument(
"--play-audio",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If --play-audio is not set, then the script doesn't give any output. We probably should add --output parameter as in tts/talk.py so that the script could produce some output on server.

Comment on lines +107 to +116
play_responses(responses=nmt_service.streaming_s2s_response_generator(
audio_chunks=audio_chunk_iterator,
streaming_config=s2s_config), sound_stream=sound_stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
play_responses(responses=nmt_service.streaming_s2s_response_generator(
audio_chunks=audio_chunk_iterator,
streaming_config=s2s_config), sound_stream=sound_stream)
play_responses(
responses=nmt_service.streaming_s2s_response_generator(
audio_chunks=audio_chunk_iterator,
streaming_config=s2s_config,
),
sound_stream=sound_stream
)

interim_results=True,
),
translation_config = riva.client.TranslationConfig(
target_language_code=args.target_language_code,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here should be source_language_code and, probably, model_name as in config.

first = True # first tts output chunk received
auth = riva.client.Auth(args.ssl_cert, args.use_ssl, args.server)
nmt_service = riva.client.NeuralMachineTranslationClient(auth)
s2s_config = riva.client.StreamingTranslateSpeechToSpeechConfig(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a tts_config as in proto? If so, then we could add a add_tts_config_argparse_parameters() function to argparse_utils.py function and refactor tts/talk.py using this function.

@rmittal-github rmittal-github changed the base branch from release/2.9.0 to main January 30, 2023 04:56
@rmittal-github rmittal-github changed the base branch from main to release/2.11.0 April 19, 2023 12:15
@rmittal-github rmittal-github changed the base branch from release/2.11.0 to main May 19, 2023 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants