NeMo transducer-based Models

Hint

See Installation to install sherpa-onnx before you read this section.

sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8 (25 European Languages)

This model is converted from

https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3

You can find the conversion script at

https://github.com/k2-fsa/sherpa-onnx/tree/master/scripts/nemo/parakeet-tdt-0.6b-v3

It supports 25 European languages:

Bulgarian (bg), Croatian (hr), Czech (cs), Danish (da), Dutch (nl)

English (en), Estonian (et), Finnish (fi), French (fr), German (de)

Greek (el), Hungarian (hu), Italian (it), Latvian (lv), Lithuanian (lt)

Maltese (mt), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk)

Slovenian (sl), Spanish (es), Swedish (sv), Russian (ru), Ukrainian (uk)

In the following, we describe how to download it and use it with sherpa-onnx.

Colab

We provide two colab notebooks for this model:

Colab with CPU

Colab with NVIDIA GPU

Huggingface space

You can try it by visiting https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2
tar xvf sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2
rm sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2

You should see something like below after downloading:

ls -lh sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/

total 640M
-rw-r--r-- 1 501 staff  12M Aug 16 09:00 decoder.int8.onnx
-rw-r--r-- 1 501 staff 622M Aug 16 09:00 encoder.int8.onnx
-rw-r--r-- 1 501 staff 6.1M Aug 16 09:00 joiner.int8.onnx
drwxr-xr-x 2 501 staff 4.0K Aug 16 09:00 test_wavs
-rw-r--r-- 1 501 staff  92K Aug 16 09:00 tokens.txt

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

cd /path/to/sherpa-onnx

build/bin/sherpa-onnx-offline \
  --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx \
  --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt \
  --model-type=nemo_transducer \
  ./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/test_wavs/en.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/project/sherpa-onnx/csrc/parse-options.cc:Read:372 sherpa-onnx-offline --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt --model-type=nemo_transducer ./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/test_wavs/en.wav

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx", decoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx", joiner_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), canary=OfflineCanaryModelConfig(encoder="", decoder="", src_lang="", tgt_lang="", use_pnc=True), telespeech_ctc="", tokens="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="nemo_transducer", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5, lodr_scale=0.01, lodr_fst="", lodr_backoff_id=-1), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(dict_dir="", lexicon="", rule_fsts=""))
Creating recognizer ...
Started
/project/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:160 Creating a resampler:
   in_sample_rate: 24000
   output_sample_rate: 16000

Done!

./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/test_wavs/en.wav
{"lang": "", "emotion": "", "event": "", "text": " Ask not what your country can do for you, ask what you can do for your country.", "timestamps": [0.00, 0.08, 0.40, 0.64, 0.80, 0.96, 1.04, 1.04, 1.04, 1.28, 1.44, 1.60, 1.68, 1.84, 2.08, 2.16, 2.40, 2.56, 2.64, 2.80, 2.96, 3.12, 3.28, 3.36, 3.36, 3.36, 3.68], "tokens":[" A", "sk", " not", " what", " your", " co", "un", "tr", "y", " can", " do", " for", " you", ",", " a", "sk", " what", " you", " can", " do", " for", " your", " co", "un", "tr", "y", "."], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.249 s
Real time factor (RTF): 1.249 / 3.845 = 0.325

Real-time/Streaming Speech recognition from a microphone with VAD

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-simulated-streaming-asr \
  --silero-vad-model=./silero_vad.onnx \
  --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx \
  --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt \
  --model-type=nemo_transducer

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx \
  --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt \
  --model-type=nemo_transducer

Speech recognition from a microphone with VAD

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx \
  --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt \
  --model-type=nemo_transducer

Decode a long audio file with VAD

The following examples show how to decode a very long audio file with the help of VAD.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/Obama.wav

./build/bin/sherpa-onnx-vad-with-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --silero-vad-threshold=0.2 \
  --silero-vad-min-speech-duration=0.2 \
  --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx \
  --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt \
  --model-type=nemo_transducer \
  ./Obama.wav

Wave filename	Content
Obama.wav

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:373 ./build/bin/sherpa-onnx-vad-with-offline-asr --silero-vad-model=./silero_vad.onnx --silero-vad-threshold=0.2 --silero-vad-min-speech-duration=0.2 --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt --model-type=nemo_transducer ./Obama.wav

VadModelConfig(silero_vad=SileroVadModelConfig(model="./silero_vad.onnx", threshold=0.2, min_silence_duration=0.5, min_speech_duration=0.2, max_speech_duration=20, window_size=512, neg_threshold=-1), ten_vad=TenVadModelConfig(model="", threshold=0.5, min_silence_duration=0.5, min_speech_duration=0.25, max_speech_duration=20, window_size=256), sample_rate=16000, num_threads=1, provider="cpu", debug=False)
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx", decoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx", joiner_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1, enable_token_timestamps=False, enable_segment_timestamps=False), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder="", merged_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), canary=OfflineCanaryModelConfig(encoder="", decoder="", src_lang="", tgt_lang="", use_pnc=True), cohere_transcribe=OfflineCohereTranscribeModelConfig(encoder="", decoder="", language="", use_punct=True, use_itn=True), omnilingual=OfflineOmnilingualAsrCtcModelConfig(model=""), funasr_nano=OfflineFunASRNanoModelConfig(encoder_adaptor="", llm="", embedding="", tokenizer="", system_prompt="You are a helpful assistant.", user_prompt="语音转写：", max_new_tokens=512, temperature=1e-06, top_p=0.8, seed=42, language="", itn=True, hotwords=""), medasr=OfflineMedAsrCtcModelConfig(model=""), fire_red_asr_ctc=OfflineFireRedAsrCtcModelConfig(model=""), qwen3_asr=OfflineQwen3ASRModelConfig(conv_frontend="", encoder="", decoder="", tokenizer="", hotwords="", max_total_len=512, max_new_tokens=128, temperature=1e-06, top_p=0.8, seed=42), telespeech_ctc="", tokens="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="nemo_transducer", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5, lodr_scale=0.01, lodr_fst="", lodr_backoff_id=-1), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(lexicon="", rule_fsts=""))
Creating recognizer ...
Recognizer created!
Started
Reading: ./Obama.wav

Started!

248 -- 8.204: Thank you.
976 -- 12.140: Thank you, everybody. All right, everybody go ahead and have a seat.
104 -- 14.540: How's everybody doing today?
704 -- 22.892: How about Tim Spicer?
936 -- 31.884: I am here with students at Wakefield High School in Arlington, Virginia.
720 -- 48.844: And we've got students tuning in from all across America, from kindergarten through 12th grade. And I am just so glad that all could join us today, and I want to thank Wakefield for being such an outstanding host. Give yourselves a big round of applause.
416 -- 55.436: I know that
240 -- 58.892: For many of you, today is the first day of school.
600 -- 69.452: And for those of you in kindergarten or starting middle or high school, it's your first day in a new school, so it's understandable if you're a little nervous.
640 -- 76.332: I imagine there's some seniors out there who are feeling pretty good right now. With just one more year to go.
800 -- 87.180: And no matter what grade you're in, some of you are probably wishing it were still summer, and you could have stayed just a little bit longer this morning.
984 -- 89.100: I know that feeling.
664 -- 111.708: When I was young, my family lived overseas. I lived in Indonesia for a few years. And my mother, she didn't have the money to send me where all the American kids went to school, but she thought it was important for me to keep up with an American education. So she decided to teach me extra lessons herself.
240 -- 118.700: Monday through Friday, but because she had to go to work, the only time she could do it was at 430 in the morning.
048 -- 127.244: Now, as you might imagine, I wasn't too happy about getting up that early. A lot of times I'd fall asleep right there at the kitchen table.
272 -- 135.340: But whenever I'd complain, my mother would just give me one of those looks and she'd say, this is no picnic for me either, Buster.
104 -- 145.132: So I know that some of you are still adjusting to being back at school, but I'm here today because I have something important to discuss with you.
808 -- 153.740: I'm here because I want to talk with you about your education and what's expected of all of you in this new school year.
448 -- 160.268: I've given a lot of speeches about education and I've talked about responsibility a lot.
816 -- 178.220: I've talked about teachers' responsibility for inspiring students and pushing you to learn. I've talked about your parents' responsibility for making sure you stay on track and you get your homework done and don't spend every waking hour in front of the TV or with the Xbox.
088 -- 180.716: I've talked a lot about
360 -- 193.452: Your government's responsibility for setting high standards and supporting teachers and principals and turning around schools that aren't working, where students aren't getting the opportunities that they deserve.
000 -- 195.276: But at the end of the day.
016 -- 206.156: We can have the most dedicated teachers, the most supportive parents, the best schools in the world, and none of it will make a difference. None of it will matter.
704 -- 210.604: unless all of you fulfill your responsibilities.
248 -- 223.404: unless you show up to those schools, unless you pay attention to those teachers, unless you listen to your parents and grandparents and other adults, and put in the hard work it takes to succeed.
656 -- 230.924: That's what I want to focus on today. The responsibility each of you has for your education.
728 -- 234.796: I want to start with the responsibility you have to yourself.
696 -- 238.988: Every single one of you has something that you're good at.
760 -- 242.412: Every single one of you has something to offer.
992 -- 247.404: And you have a responsibility to yourself to discover what that is.
336 -- 251.564: That's the opportunity an education can provide.
336 -- 265.900: Maybe you could be a great writer, maybe even good enough to write a book, or articles in a newspaper, but you might not know it until you write that English paper, that English class paper that's assigned to you.
704 -- 278.668: Maybe you could be an innovator or an inventor, maybe even good enough to come up with the next iPhone or the new medicine or vaccine, but you might not know it until you do your project for your science class.
824 -- 289.964: Maybe you could be a mayor, or a senator, or a Supreme Court Justice, but you might not know that until you join student government or the debate team.
568 -- 309.516: And no matter what you want to do with your life, I guarantee that you'll need an education to do it. You want to be a doctor or a teacher or a police officer, you want to be a nurse or an architect, a lawyer, or a member of our military, you're going to need a good education for every single one of those careers.
064 -- 314.348: You cannot drop out of school and just drop into a good job.
184 -- 319.852: You've got to train for it and work for it and learn for it.
528 -- 323.628: And this isn't just important for your own life and your own future.
688 -- 332.812: What you make of your education will decide nothing less than the future of this country. The future of America depends on you.

num threads: 2
decoding method: greedy_search
Elapsed seconds: 9.372 s
Real time factor (RTF): 9.372 / 334.234 = 0.028

Hint

If you want to use a GUI version and want to export SRT format, please visit https://k2-fsa.github.io/sherpa/onnx/tauri/app/vad-asr-file.html and search for en-parakeet_tdt_v3. Please always use the latest version.

sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8 (English, 英语)

This model is converted from

https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2

You can find the conversion script at

https://github.com/k2-fsa/sherpa-onnx/tree/master/scripts/nemo/parakeet-tdt-0.6b-v2

In the following, we describe how to download it and use it with sherpa-onnx.

Hint

This model supports punctuations and cases.

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8.tar.bz2
tar xvf sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8.tar.bz2
rm sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8.tar.bz2

Hint

If you want to try float16 quantized model, please use sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-fp16.tar.bz2.

If you want to try non-quantized decoder and joiner models, please use sherpa-onnx-nemo-parakeet-tdt-0.6b-v2.tar.bz2

You should see something like below after downloading:

ls -lh sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/
total 1295752
-rw-r--r--  1 fangjun  staff   6.9M May  6 16:24 decoder.int8.onnx
-rw-r--r--  1 fangjun  staff   622M May  6 16:24 encoder.int8.onnx
-rw-r--r--  1 fangjun  staff   1.7M May  6 16:24 joiner.int8.onnx
drwxr-xr-x  3 fangjun  staff    96B May  6 16:24 test_wavs
-rw-r--r--  1 fangjun  staff   9.2K May  6 16:24 tokens.txt

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
  --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
  --model-type=nemo_transducer \
  ./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:372 ./build/bin/sherpa-onnx-offline --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt --model-type=nemo_transducer ./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx", decoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx", joiner_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), telespeech_ctc="", tokens="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="nemo_transducer", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(dict_dir="", lexicon="", rule_fsts=""))
Creating recognizer ...
Started
Done!

./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav
{"lang": "", "emotion": "", "event": "", "text": " Well, I don't wish to see it any more, observed Phebe, turning away her eyes. It is certainly very like the old portrait.", "timestamps": [0.32, 0.64, 0.72, 0.80, 0.88, 0.96, 1.04, 1.12, 1.28, 1.44, 1.60, 1.76, 1.92, 2.00, 2.24, 2.32, 2.40, 2.48, 2.64, 2.72, 2.88, 3.12, 3.36, 3.44, 3.52, 3.68, 3.76, 3.92, 4.16, 4.24, 4.32, 4.64, 4.96, 5.12, 5.36, 5.44, 5.52, 5.60, 5.76, 6.00, 6.24, 6.40, 6.48, 6.64, 6.72, 6.80, 6.88, 7.04], "tokens":[" Well", ",", " I", " don", "'", "t", " w", "ish", " to", " see", " it", " any", " more", ",", " ob", "s", "er", "ved", " P", "he", "be", ",", " t", "ur", "ning", " a", "way", " her", " e", "y", "es", ".", " It", " is", " c", "ert", "ain", "ly", " very", " like", " the", " o", "ld", " p", "ort", "ra", "it", "."], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.874 s
Real time factor (RTF): 0.874 / 7.435 = 0.118

Real-time/Streaming Speech recognition from a microphone with VAD

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-simulated-streaming-asr \
  --silero-vad-model=./silero_vad.onnx \
  --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
  --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
  --model-type=nemo_transducer

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
  --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
  --model-type=nemo_transducer

Speech recognition from a microphone with VAD

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
  --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
  --model-type=nemo_transducer

Decode a long audio file with VAD

The following examples show how to decode a very long audio file with the help of VAD.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/Obama.wav

./build/bin/sherpa-onnx-vad-with-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --silero-vad-threshold=0.2 \
  --silero-vad-min-speech-duration=0.2 \
  --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
  --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
  --model-type=nemo_transducer \
  ./Obama.wav

Wave filename	Content
Obama.wav

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:373 ./build/bin/sherpa-onnx-vad-with-offline-asr --silero-vad-model=./silero_vad.onnx --silero-vad-threshold=0.2 --silero-vad-min-speech-duration=0.2 --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt --model-type=nemo_transducer ./Obama.wav 

VadModelConfig(silero_vad=SileroVadModelConfig(model="./silero_vad.onnx", threshold=0.2, min_silence_duration=0.5, min_speech_duration=0.2, max_speech_duration=20, window_size=512, neg_threshold=-1), ten_vad=TenVadModelConfig(model="", threshold=0.5, min_silence_duration=0.5, min_speech_duration=0.25, max_speech_duration=20, window_size=256), sample_rate=16000, num_threads=1, provider="cpu", debug=False)
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx", decoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx", joiner_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1, enable_token_timestamps=False, enable_segment_timestamps=False), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder="", merged_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), canary=OfflineCanaryModelConfig(encoder="", decoder="", src_lang="", tgt_lang="", use_pnc=True), cohere_transcribe=OfflineCohereTranscribeModelConfig(encoder="", decoder="", language="", use_punct=True, use_itn=True), omnilingual=OfflineOmnilingualAsrCtcModelConfig(model=""), funasr_nano=OfflineFunASRNanoModelConfig(encoder_adaptor="", llm="", embedding="", tokenizer="", system_prompt="You are a helpful assistant.", user_prompt="语音转写：", max_new_tokens=512, temperature=1e-06, top_p=0.8, seed=42, language="", itn=True, hotwords=""), medasr=OfflineMedAsrCtcModelConfig(model=""), fire_red_asr_ctc=OfflineFireRedAsrCtcModelConfig(model=""), qwen3_asr=OfflineQwen3ASRModelConfig(conv_frontend="", encoder="", decoder="", tokenizer="", hotwords="", max_total_len=512, max_new_tokens=128, temperature=1e-06, top_p=0.8, seed=42), telespeech_ctc="", tokens="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="nemo_transducer", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5, lodr_scale=0.01, lodr_fst="", lodr_backoff_id=-1), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(lexicon="", rule_fsts=""))
Creating recognizer ...
Recognizer created!
Started
Reading: ./Obama.wav

Started!

248 -- 8.204: Thank you.
976 -- 12.140: Thank you, everybody. All right, everybody, go ahead and have a seat.
104 -- 14.540: How's everybody doing today?
704 -- 22.892: How about Tim Spicer?
936 -- 31.884: I am here with students at Wakefield High School in Arlington, Virginia.
720 -- 48.844: And we've got students tuning in from all across America, from kindergarten through 12th grade. And I am just so glad that all could join us today. And I want to thank Wakefield for being such an outstanding host. Give yourselves a big round of applause.
416 -- 55.436: I know that
240 -- 58.892: For many of you, today is the first day of school.
600 -- 69.452: And for those of you in kindergarten or starting middle or high school, it's your first day in a new school, so it's understandable if you're a little nervous.
640 -- 76.332: I imagine there's some seniors out there who are feeling pretty good right now. With just one more year to go.
800 -- 87.180: And no matter what grade you're in, some of you are probably wishing it was still summer and you could have stayed in bed just a little bit longer this morning.
984 -- 89.100: I know that feeling.
664 -- 111.708: When I was young, my family lived overseas. I lived in Indonesia for a few years. And my mother, she didn't have the money to send me where all the American kids went to school. But she thought it was important for me to keep up with American education. So she decided to teach me extra lessons herself.
240 -- 118.700: Monday through Friday, but because she had to go to work, the only time she could do it was at 4:30 in the morning.
048 -- 127.244: Now, as you might imagine, I wasn't too happy about getting up that early. And a lot of times I'd fall asleep right there at the kitchen table.
272 -- 135.340: But whenever I'd complain, my mother would just give me one of those looks and she'd say, This is no picnic for me either, Buster.
104 -- 145.132: So I know that some of you are still adjusting to being back at school, but I'm here today because I have something important to discuss with you.
808 -- 153.740: I'm here because I want to talk with you about your education and what's expected of all of you in this new school year.
448 -- 160.268: I've given a lot of speeches about education, and I've talked about responsibility a lot.
816 -- 178.220: I've talked about teachers' responsibility for inspiring students and pushing you to learn. I've talked about your parents' responsibility for making sure you stay on track and you get your homework done and don't spend every waking hour in front of the T V or with the Xbox.
088 -- 180.716: I've talked a lot about
360 -- 193.452: Your government's responsibility for setting high standards and supporting teachers and principals and turning around schools that aren't working, where students aren't getting the opportunities that they deserve.
000 -- 195.276: But at the end of the day,
016 -- 206.156: We can have the most dedicated teachers, the most supportive parents, the best schools in the world, and none of it will make a difference. None of it will matter.
704 -- 210.604: unless all of you fulfill your responsibilities.
248 -- 223.404: Unless you show up to those schools, unless you pay attention to those teachers, unless you listen to your parents and grandparents and other adults and put in the hard work it takes to succeed.
656 -- 230.924: And that's what I want to focus on today: the responsibility each of you has for your education.
728 -- 234.796: I want to start with the responsibility you have to yourself.
696 -- 238.988: Every single one of you has something that you're good at.
760 -- 242.412: Every single one of you has something to offer.
992 -- 247.404: And you have a responsibility to yourself to discover what that is.
336 -- 251.564: That's the opportunity an education can provide.
336 -- 265.900: Maybe you could be a great writer, maybe even good enough to write a book or articles in a newspaper, but you might not know it until you write that English paper, that English class paper that's assigned to you.
704 -- 278.668: Maybe you could be an innovator or an inventor, maybe even good enough to come up with the next iPhone or the new medicine or vaccine. But you might not know it until you do your project for your science class.
824 -- 289.964: Maybe you could be a mayor or a senator or a Supreme Court justice. But you might not know that until you join student government or the debate team.
568 -- 309.516: And no matter what you want to do with your life, I guarantee that you'll need an education to do it. You want to be a doctor or a teacher or a police officer, you want to be a nurse or an architect, a lawyer or a member of our military, you're going to need a good education for every single one of those careers.
064 -- 314.348: You cannot drop out of school and just drop into a good job.
184 -- 319.852: You've got to train for it and work for it and learn for it.
528 -- 323.628: And this isn't just important for your own life and your own future.
688 -- 332.812: What you make of your education will decide nothing less than the future of this country. The future of America depends on you.

num threads: 2
decoding method: greedy_search
Elapsed seconds: 9.406 s
Real time factor (RTF): 9.406 / 334.234 = 0.028

Hint

If you want to use a GUI version and want to export SRT format, please visit https://k2-fsa.github.io/sherpa/onnx/tauri/app/vad-asr-file.html and search for en-parakeet_tdt. Please always use the latest version.

RTF on RK3588 with Cortex A76 CPU

In the following, we test this model on RK3588 with Cortex A76 CPU.

Information about the CPUs on the board is given below:

Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Vendor ID:                       ARM
Model name:                      Cortex-A55
Model:                           0
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
Stepping:                        r2p0
CPU max MHz:                     1800.0000
CPU min MHz:                     408.0000
BogoMIPS:                        48.00
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
Model name:                      Cortex-A76
Model:                           0
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
Stepping:                        r4p0
CPU max MHz:                     2304.0000
CPU min MHz:                     408.0000
BogoMIPS:                        48.00
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
L1d cache:                       384 KiB (8 instances)
L1i cache:                       384 KiB (8 instances)
L2 cache:                        2.5 MiB (8 instances)
L3 cache:                        3 MiB (1 instance)
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable: Unprivileged eBPF enabled
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

You can see that it has 8 CPUs: 4 Cortex A55 + 4 Cortex A76.

We use taskset below to test the RTF on Cortex A76.

taskset 0x80 sherpa-onnx-offline \
  --num-threads=1 \
  --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
  --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
  --model-type=nemo_transducer \
  ./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav

Its output is given below:

/project/sherpa-onnx/csrc/parse-options.cc:Read:372 sherpa-onnx-offline --num-threads=1 --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt --model-type=nemo_transducer ./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx", decoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx", joiner_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), telespeech_ctc="", tokens="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="nemo_transducer", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(dict_dir="", lexicon="", rule_fsts=""))
Creating recognizer ...
Started
Done!

./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav
{"lang": "", "emotion": "", "event": "", "text": " Well, I don't wish to see it any more, observed Phebe, turning away her eyes. It is certainly very like the old portrait.", "timestamps": [0.32, 0.64, 0.72, 0.80, 0.88, 0.96, 1.04, 1.12, 1.28, 1.44, 1.60, 1.76, 1.92, 2.00, 2.24, 2.32, 2.40, 2.48, 2.64, 2.72, 2.88, 3.12, 3.36, 3.44, 3.52, 3.68, 3.76, 3.92, 4.16, 4.24, 4.32, 4.64, 4.96, 5.12, 5.36, 5.44, 5.52, 5.60, 5.76, 6.00, 6.24, 6.40, 6.48, 6.64, 6.72, 6.80, 6.88, 7.04], "tokens":[" Well", ",", " I", " don", "'", "t", " w", "ish", " to", " see", " it", " any", " more", ",", " ob", "s", "er", "ved", " P", "he", "be", ",", " t", "ur", "ning", " a", "way", " her", " e", "y", "es", ".", " It", " is", " c", "ert", "ain", "ly", " very", " like", " the", " o", "ld", " p", "ort", "ra", "it", "."], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 1.639 s
Real time factor (RTF): 1.639 / 7.435 = 0.220

To test the RTF with different --num-threads, we use:

taskset 0xc0 sherpa-onnx-offline \
  --num-threads=2 \
  --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
  --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
  --model-type=nemo_transducer \
  ./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav

taskset 0xe0 sherpa-onnx-offline \
  --num-threads=3 \
  --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
  --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
  --model-type=nemo_transducer \
  ./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav

taskset 0xf0 sherpa-onnx-offline \
  --num-threads=4 \
  --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
  --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
  --model-type=nemo_transducer \
  ./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav

The results are summarized below:

Number of threads	1	2	3	4
RTF on Cortex A76 CPU	0.220	0.142	0.118	0.088

sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19 (Russian, 俄语)

This model is converted from

https://github.com/salute-developers/GigaAM

You can find the conversion script at

https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/nemo/GigaAM/run-rnnt-v2.sh

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19.tar.bz2
tar xvf sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19.tar.bz2
rm sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19.tar.bz2

You should see something like below after downloading:

ls -lh sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19

total 231M
-rw-r--r-- 1 501 staff 3.2M Apr 20 01:58 decoder.onnx
-rw-r--r-- 1 501 staff 226M Apr 20 01:59 encoder.int8.onnx
-rw-r--r-- 1 501 staff 1.4M Apr 20 01:58 joiner.onnx
-rw-r--r-- 1 501 staff 219K Apr 20 01:59 LICENSE
-rw-r--r-- 1 501 staff  302 Apr 20 01:59 README.md
-rwxr-xr-x 1 501 staff  868 Apr 20 01:51 run-rnnt-v2.sh
-rwxr-xr-x 1 501 staff 8.9K Apr 20 01:59 test-onnx-rnnt.py
drwxr-xr-x 2 501 staff 4.0K Apr 21 09:35 test_wavs
-rw-r--r-- 1 501 staff  196 Apr 20 01:58 tokens.txt

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --encoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/decoder.onnx \
  --joiner=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/joiner.onnx \
  --tokens=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/tokens.txt \
  --model-type=nemo_transducer \
  ./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/test_wavs/example.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/project/sherpa-onnx/csrc/parse-options.cc:Read:375 sherpa-onnx-offline --encoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/encoder.int8.onnx --decoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/decoder.onnx --joiner=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/joiner.onnx --tokens=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/tokens.txt --model-type=nemo_transducer ./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/test_wavs/example.wav

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/encoder.int8.onnx", decoder_filename="./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/decoder.onnx", joiner_filename="./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/joiner.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), telespeech_ctc="", tokens="./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="nemo_transducer", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/test_wavs/example.wav
{"lang": "", "emotion": "", "event": "", "text": "ничьих не требуя похвал счастлив уж я надеждой сладкой что дева с трепетом любви посмотрит может быть украдкой на песни грешные мои у лукоморья дуб зеленый", "timestamps": [0.04, 0.12, 0.16, 0.24, 0.32, 0.40, 0.44, 0.52, 0.56, 0.60, 0.64, 0.72, 0.76, 0.80, 0.88, 0.96, 1.04, 1.12, 1.16, 1.24, 1.32, 1.36, 1.44, 1.56, 1.76, 1.84, 1.88, 1.96, 2.00, 2.04, 2.08, 2.16, 2.24, 2.28, 2.36, 2.40, 2.48, 2.60, 2.68, 2.72, 2.76, 2.84, 2.92, 2.96, 3.04, 3.08, 3.16, 3.20, 3.24, 3.32, 3.36, 3.44, 3.52, 3.56, 3.64, 3.68, 3.72, 3.76, 3.80, 3.88, 3.92, 4.00, 4.08, 4.16, 4.20, 4.24, 4.28, 4.32, 4.36, 4.44, 4.52, 4.56, 4.64, 4.68, 4.76, 4.80, 4.88, 4.92, 5.00, 5.08, 5.16, 5.36, 5.44, 5.52, 5.60, 5.68, 5.72, 5.76, 5.84, 5.92, 6.00, 6.04, 6.12, 6.16, 6.20, 6.24, 6.28, 6.32, 6.40, 6.44, 6.48, 6.52, 6.56, 6.64, 6.72, 6.76, 6.84, 6.92, 7.00, 7.04, 7.12, 7.16, 7.20, 7.24, 7.32, 7.36, 7.40, 7.48, 7.60, 7.64, 7.72, 7.76, 7.84, 7.88, 8.00, 8.08, 8.16, 8.24, 8.28, 8.32, 8.44, 8.76, 9.24, 9.32, 9.40, 9.44, 9.52, 9.60, 9.68, 9.76, 9.84, 9.92, 10.00, 10.08, 10.12, 10.24, 10.32, 10.44, 10.52, 10.56, 10.60, 10.68, 10.72, 10.84, 10.92], "tokens":["н", "и", "ч", "ь", "и", "х", " ", "н", "е", " ", "т", "р", "е", "б", "у", "я", " ", "п", "о", "х", "в", "а", "л", " ", "с", "ч", "а", "с", "т", "л", "и", "в", " ", "у", "ж", " ", "я", " ", "н", "а", "д", "е", "ж", "д", "о", "й", " ", "с", "л", "а", "д", "к", "о", "й", " ", "ч", "т", "о", " ", "д", "е", "в", "а", " ", "с", " ", "т", "р", "е", "п", "е", "т", "о", "м", " ", "л", "ю", "б", "в", "и", " ", "п", "о", "с", "м", "о", "т", "р", "и", "т", " ", "м", "о", "ж", "е", "т", " ", "б", "ы", "т", "ь", " ", "у", "к", "р", "а", "д", "к", "о", "й", " ", "н", "а", " ", "п", "е", "с", "н", "и", " ", "г", "р", "е", "ш", "н", "ы", "е", " ", "м", "о", "и", " ", "у", " ", "л", "у", "к", "о", "м", "о", "р", "ь", "я", " ", "д", "у", "б", " ", "з", "е", "л", "е", "н", "ы", "й"], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 4.317 s
Real time factor (RTF): 4.317 / 11.290 = 0.382

Real-time/Streaming Speech recognition from a microphone with VAD

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-simulated-streaming-asr \
  --silero-vad-model=./silero_vad.onnx \
  --encoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/decoder.onnx \
  --joiner=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/joiner.onnx \
  --tokens=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/tokens.txt \
  --model-type=nemo_transducer

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --encoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/decoder.onnx \
  --joiner=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/joiner.onnx \
  --tokens=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/tokens.txt \
  --model-type=nemo_transducer

Speech recognition from a microphone with VAD

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --encoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/decoder.onnx \
  --joiner=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/joiner.onnx \
  --tokens=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/tokens.txt \
  --model-type=nemo_transducer

sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24 (Russian, 俄语)

This model is converted from

https://github.com/salute-developers/GigaAM

You can find the conversion script at

https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/nemo/GigaAM/run-rnnt.sh

Warning

The license of the model can be found at https://github.com/salute-developers/GigaAM/blob/main/GigaAM%20License_NC.pdf.

It is for non-commercial use only.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24.tar.bz2
tar xvf sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24.tar.bz2
rm sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24.tar.bz2

You should see something like below after downloading:

ls -lh sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/
total 548472
-rw-r--r--  1 fangjun  staff    89K Oct 25 13:36 GigaAM%20License_NC.pdf
-rw-r--r--  1 fangjun  staff   318B Oct 25 13:37 README.md
-rw-r--r--  1 fangjun  staff   3.8M Oct 25 13:36 decoder.onnx
-rw-r--r--  1 fangjun  staff   262M Oct 25 13:37 encoder.int8.onnx
-rw-r--r--  1 fangjun  staff   3.8K Oct 25 13:32 export-onnx-rnnt.py
-rw-r--r--  1 fangjun  staff   2.0M Oct 25 13:36 joiner.onnx
-rwxr-xr-x  1 fangjun  staff   2.0K Oct 25 13:32 run-rnnt.sh
-rwxr-xr-x  1 fangjun  staff   8.7K Oct 25 13:32 test-onnx-rnnt.py
drwxr-xr-x  4 fangjun  staff   128B Oct 25 13:37 test_wavs
-rw-r--r--  1 fangjun  staff   5.8K Oct 25 13:36 tokens.txt

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --encoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/decoder.onnx \
  --joiner=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/joiner.onnx \
  --tokens=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/tokens.txt \
  --model-type=nemo_transducer \
  ./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/test_wavs/example.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline --encoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx --decoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/decoder.onnx --joiner=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/joiner.onnx --tokens=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/tokens.txt --model-type=nemo_transducer ./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/test_wavs/example.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx", decoder_filename="./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/decoder.onnx", joiner_filename="./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/joiner.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), telespeech_ctc="", tokens="./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="nemo_transducer", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/test_wavs/example.wav
{"lang": "", "emotion": "", "event": "", "text": " ничьих не требуя похвал счастлив уж я надеждой сладкой что дева с трепетом любви посмотрит может быть украдкой на песни грешные мои у лукоморья дуб зеленый", "timestamps": [0.04, 0.16, 0.24, 0.28, 0.40, 0.48, 0.60, 0.68, 0.80, 0.92, 1.04, 1.20, 1.28, 1.44, 1.76, 1.88, 2.00, 2.08, 2.16, 2.28, 2.36, 2.44, 2.64, 2.76, 2.92, 3.00, 3.04, 3.16, 3.24, 3.36, 3.48, 3.56, 3.68, 3.88, 4.04, 4.16, 4.24, 4.32, 4.40, 4.56, 4.76, 4.88, 4.92, 5.36, 5.64, 5.84, 5.92, 6.04, 6.32, 6.52, 6.60, 6.72, 6.84, 6.92, 7.04, 7.16, 7.28, 7.36, 7.44, 7.56, 7.68, 7.72, 7.88, 8.00, 8.20, 8.36, 9.28, 9.40, 9.44, 9.52, 9.68, 9.84, 9.88, 9.92, 10.12, 10.32, 10.40, 10.52, 10.56, 10.76, 10.84], "tokens":[" ни", "ч", "ь", "и", "х", " не", " т", "ре", "бу", "я", " по", "х", "ва", "л", " с", "ча", "ст", "ли", "в", " у", "ж", " я", " на", "де", "ж", "до", "й", " с", "ла", "д", "ко", "й", " что", " де", "ва", " с", " т", "ре", "пе", "том", " лю", "б", "ви", " пос", "мот", "ри", "т", " может", " быть", " у", "к", "ра", "д", "ко", "й", " на", " п", "е", "с", "ни", " г", "ре", "ш", "ные", " мо", "и", " у", " ", "лу", "ко", "мо", "р", "ь", "я", " ду", "б", " з", "е", "лен", "ы", "й"], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.775 s
Real time factor (RTF): 1.775 / 11.290 = 0.157

Real-time/Streaming Speech recognition from a microphone with VAD

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-simulated-streaming-asr \
  --silero-vad-model=./silero_vad.onnx \
  --encoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/decoder.onnx \
  --joiner=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/joiner.onnx \
  --tokens=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/tokens.txt \
  --model-type=nemo_transducer

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --encoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/decoder.onnx \
  --joiner=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/joiner.onnx \
  --tokens=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/tokens.txt \
  --model-type=nemo_transducer

Speech recognition from a microphone with VAD

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --encoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx \
  --decoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/decoder.onnx \
  --joiner=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/joiner.onnx \
  --tokens=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/tokens.txt \
  --model-type=nemo_transducer