NeMo transducer-based Models
Hint
See Installation to install sherpa-onnx before you read this section.
sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8 (25 European Languages)
This model is converted from
You can find the conversion script at
It supports 25 European languages:
Bulgarian (bg), Croatian (hr), Czech (cs), Danish (da), Dutch (nl)
English (en), Estonian (et), Finnish (fi), French (fr), German (de)
Greek (el), Hungarian (hu), Italian (it), Latvian (lv), Lithuanian (lt)
Maltese (mt), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk)
Slovenian (sl), Spanish (es), Swedish (sv), Russian (ru), Ukrainian (uk)
In the following, we describe how to download it and use it with sherpa-onnx.
Colab
We provide two colab notebooks for this model:
Huggingface space
You can try it by visiting https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition
Download the model
Please use the following commands to download it.
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2
tar xvf sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2
rm sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2
You should see something like below after downloading:
ls -lh sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/
total 640M
-rw-r--r-- 1 501 staff 12M Aug 16 09:00 decoder.int8.onnx
-rw-r--r-- 1 501 staff 622M Aug 16 09:00 encoder.int8.onnx
-rw-r--r-- 1 501 staff 6.1M Aug 16 09:00 joiner.int8.onnx
drwxr-xr-x 2 501 staff 4.0K Aug 16 09:00 test_wavs
-rw-r--r-- 1 501 staff 92K Aug 16 09:00 tokens.txt
Decode wave files
Hint
It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.
cd /path/to/sherpa-onnx
build/bin/sherpa-onnx-offline \
--encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx \
--joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx \
--tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt \
--model-type=nemo_transducer \
./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/test_wavs/en.wav
Note
Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.
You should see the following output:
/project/sherpa-onnx/csrc/parse-options.cc:Read:372 sherpa-onnx-offline --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt --model-type=nemo_transducer ./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/test_wavs/en.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx", decoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx", joiner_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), canary=OfflineCanaryModelConfig(encoder="", decoder="", src_lang="", tgt_lang="", use_pnc=True), telespeech_ctc="", tokens="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="nemo_transducer", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5, lodr_scale=0.01, lodr_fst="", lodr_backoff_id=-1), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(dict_dir="", lexicon="", rule_fsts=""))
Creating recognizer ...
Started
/project/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:160 Creating a resampler:
in_sample_rate: 24000
output_sample_rate: 16000
Done!
./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/test_wavs/en.wav
{"lang": "", "emotion": "", "event": "", "text": " Ask not what your country can do for you, ask what you can do for your country.", "timestamps": [0.00, 0.08, 0.40, 0.64, 0.80, 0.96, 1.04, 1.04, 1.04, 1.28, 1.44, 1.60, 1.68, 1.84, 2.08, 2.16, 2.40, 2.56, 2.64, 2.80, 2.96, 3.12, 3.28, 3.36, 3.36, 3.36, 3.68], "tokens":[" A", "sk", " not", " what", " your", " co", "un", "tr", "y", " can", " do", " for", " you", ",", " a", "sk", " what", " you", " can", " do", " for", " your", " co", "un", "tr", "y", "."], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.249 s
Real time factor (RTF): 1.249 / 3.845 = 0.325
Real-time/Streaming Speech recognition from a microphone with VAD
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-simulated-streaming-asr \
--silero-vad-model=./silero_vad.onnx \
--encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx \
--joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx \
--tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt \
--model-type=nemo_transducer
Speech recognition from a microphone
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx \
--joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx \
--tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt \
--model-type=nemo_transducer
Speech recognition from a microphone with VAD
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx \
--joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx \
--tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt \
--model-type=nemo_transducer
Decode a long audio file with VAD
The following examples show how to decode a very long audio file with the help of VAD.
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/Obama.wav
./build/bin/sherpa-onnx-vad-with-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--silero-vad-threshold=0.2 \
--silero-vad-min-speech-duration=0.2 \
--encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx \
--joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx \
--tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt \
--model-type=nemo_transducer \
./Obama.wav
| Wave filename | Content |
|---|---|
| Obama.wav |
You should see the following output:
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:373 ./build/bin/sherpa-onnx-vad-with-offline-asr --silero-vad-model=./silero_vad.onnx --silero-vad-threshold=0.2 --silero-vad-min-speech-duration=0.2 --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt --model-type=nemo_transducer ./Obama.wav
VadModelConfig(silero_vad=SileroVadModelConfig(model="./silero_vad.onnx", threshold=0.2, min_silence_duration=0.5, min_speech_duration=0.2, max_speech_duration=20, window_size=512, neg_threshold=-1), ten_vad=TenVadModelConfig(model="", threshold=0.5, min_silence_duration=0.5, min_speech_duration=0.25, max_speech_duration=20, window_size=256), sample_rate=16000, num_threads=1, provider="cpu", debug=False)
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/encoder.int8.onnx", decoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/decoder.int8.onnx", joiner_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/joiner.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1, enable_token_timestamps=False, enable_segment_timestamps=False), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder="", merged_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), canary=OfflineCanaryModelConfig(encoder="", decoder="", src_lang="", tgt_lang="", use_pnc=True), cohere_transcribe=OfflineCohereTranscribeModelConfig(encoder="", decoder="", language="", use_punct=True, use_itn=True), omnilingual=OfflineOmnilingualAsrCtcModelConfig(model=""), funasr_nano=OfflineFunASRNanoModelConfig(encoder_adaptor="", llm="", embedding="", tokenizer="", system_prompt="You are a helpful assistant.", user_prompt="语音转写:", max_new_tokens=512, temperature=1e-06, top_p=0.8, seed=42, language="", itn=True, hotwords=""), medasr=OfflineMedAsrCtcModelConfig(model=""), fire_red_asr_ctc=OfflineFireRedAsrCtcModelConfig(model=""), qwen3_asr=OfflineQwen3ASRModelConfig(conv_frontend="", encoder="", decoder="", tokenizer="", hotwords="", max_total_len=512, max_new_tokens=128, temperature=1e-06, top_p=0.8, seed=42), telespeech_ctc="", tokens="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="nemo_transducer", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5, lodr_scale=0.01, lodr_fst="", lodr_backoff_id=-1), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(lexicon="", rule_fsts=""))
Creating recognizer ...
Recognizer created!
Started
Reading: ./Obama.wav
Started!
7.248 -- 8.204: Thank you.
8.976 -- 12.140: Thank you, everybody. All right, everybody go ahead and have a seat.
13.104 -- 14.540: How's everybody doing today?
18.704 -- 22.892: How about Tim Spicer?
25.936 -- 31.884: I am here with students at Wakefield High School in Arlington, Virginia.
32.720 -- 48.844: And we've got students tuning in from all across America, from kindergarten through 12th grade. And I am just so glad that all could join us today, and I want to thank Wakefield for being such an outstanding host. Give yourselves a big round of applause.
54.416 -- 55.436: I know that
56.240 -- 58.892: For many of you, today is the first day of school.
59.600 -- 69.452: And for those of you in kindergarten or starting middle or high school, it's your first day in a new school, so it's understandable if you're a little nervous.
70.640 -- 76.332: I imagine there's some seniors out there who are feeling pretty good right now. With just one more year to go.
78.800 -- 87.180: And no matter what grade you're in, some of you are probably wishing it were still summer, and you could have stayed just a little bit longer this morning.
87.984 -- 89.100: I know that feeling.
91.664 -- 111.708: When I was young, my family lived overseas. I lived in Indonesia for a few years. And my mother, she didn't have the money to send me where all the American kids went to school, but she thought it was important for me to keep up with an American education. So she decided to teach me extra lessons herself.
112.240 -- 118.700: Monday through Friday, but because she had to go to work, the only time she could do it was at 430 in the morning.
120.048 -- 127.244: Now, as you might imagine, I wasn't too happy about getting up that early. A lot of times I'd fall asleep right there at the kitchen table.
128.272 -- 135.340: But whenever I'd complain, my mother would just give me one of those looks and she'd say, this is no picnic for me either, Buster.
137.104 -- 145.132: So I know that some of you are still adjusting to being back at school, but I'm here today because I have something important to discuss with you.
145.808 -- 153.740: I'm here because I want to talk with you about your education and what's expected of all of you in this new school year.
154.448 -- 160.268: I've given a lot of speeches about education and I've talked about responsibility a lot.
160.816 -- 178.220: I've talked about teachers' responsibility for inspiring students and pushing you to learn. I've talked about your parents' responsibility for making sure you stay on track and you get your homework done and don't spend every waking hour in front of the TV or with the Xbox.
179.088 -- 180.716: I've talked a lot about
181.360 -- 193.452: Your government's responsibility for setting high standards and supporting teachers and principals and turning around schools that aren't working, where students aren't getting the opportunities that they deserve.
194.000 -- 195.276: But at the end of the day.
196.016 -- 206.156: We can have the most dedicated teachers, the most supportive parents, the best schools in the world, and none of it will make a difference. None of it will matter.
206.704 -- 210.604: unless all of you fulfill your responsibilities.
211.248 -- 223.404: unless you show up to those schools, unless you pay attention to those teachers, unless you listen to your parents and grandparents and other adults, and put in the hard work it takes to succeed.
224.656 -- 230.924: That's what I want to focus on today. The responsibility each of you has for your education.
231.728 -- 234.796: I want to start with the responsibility you have to yourself.
235.696 -- 238.988: Every single one of you has something that you're good at.
239.760 -- 242.412: Every single one of you has something to offer.
242.992 -- 247.404: And you have a responsibility to yourself to discover what that is.
248.336 -- 251.564: That's the opportunity an education can provide.
252.336 -- 265.900: Maybe you could be a great writer, maybe even good enough to write a book, or articles in a newspaper, but you might not know it until you write that English paper, that English class paper that's assigned to you.
266.704 -- 278.668: Maybe you could be an innovator or an inventor, maybe even good enough to come up with the next iPhone or the new medicine or vaccine, but you might not know it until you do your project for your science class.
279.824 -- 289.964: Maybe you could be a mayor, or a senator, or a Supreme Court Justice, but you might not know that until you join student government or the debate team.
291.568 -- 309.516: And no matter what you want to do with your life, I guarantee that you'll need an education to do it. You want to be a doctor or a teacher or a police officer, you want to be a nurse or an architect, a lawyer, or a member of our military, you're going to need a good education for every single one of those careers.
310.064 -- 314.348: You cannot drop out of school and just drop into a good job.
315.184 -- 319.852: You've got to train for it and work for it and learn for it.
320.528 -- 323.628: And this isn't just important for your own life and your own future.
324.688 -- 332.812: What you make of your education will decide nothing less than the future of this country. The future of America depends on you.
num threads: 2
decoding method: greedy_search
Elapsed seconds: 9.372 s
Real time factor (RTF): 9.372 / 334.234 = 0.028
Hint
If you want to use a GUI version and want to export SRT format, please visit
https://k2-fsa.github.io/sherpa/onnx/tauri/app/vad-asr-file.html and search for
en-parakeet_tdt_v3. Please always use the latest version.
sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8 (English, 英语)
This model is converted from
You can find the conversion script at
In the following, we describe how to download it and use it with sherpa-onnx.
Hint
This model supports punctuations and cases.
Download the model
Please use the following commands to download it.
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8.tar.bz2
tar xvf sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8.tar.bz2
rm sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8.tar.bz2
Hint
If you want to try float16 quantized model, please use sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-fp16.tar.bz2.
If you want to try non-quantized decoder and joiner models, please use sherpa-onnx-nemo-parakeet-tdt-0.6b-v2.tar.bz2
You should see something like below after downloading:
ls -lh sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/
total 1295752
-rw-r--r-- 1 fangjun staff 6.9M May 6 16:24 decoder.int8.onnx
-rw-r--r-- 1 fangjun staff 622M May 6 16:24 encoder.int8.onnx
-rw-r--r-- 1 fangjun staff 1.7M May 6 16:24 joiner.int8.onnx
drwxr-xr-x 3 fangjun staff 96B May 6 16:24 test_wavs
-rw-r--r-- 1 fangjun staff 9.2K May 6 16:24 tokens.txt
Decode wave files
Hint
It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
--joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
--tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
--model-type=nemo_transducer \
./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav
Note
Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.
You should see the following output:
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:372 ./build/bin/sherpa-onnx-offline --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt --model-type=nemo_transducer ./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx", decoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx", joiner_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), telespeech_ctc="", tokens="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="nemo_transducer", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(dict_dir="", lexicon="", rule_fsts=""))
Creating recognizer ...
Started
Done!
./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav
{"lang": "", "emotion": "", "event": "", "text": " Well, I don't wish to see it any more, observed Phebe, turning away her eyes. It is certainly very like the old portrait.", "timestamps": [0.32, 0.64, 0.72, 0.80, 0.88, 0.96, 1.04, 1.12, 1.28, 1.44, 1.60, 1.76, 1.92, 2.00, 2.24, 2.32, 2.40, 2.48, 2.64, 2.72, 2.88, 3.12, 3.36, 3.44, 3.52, 3.68, 3.76, 3.92, 4.16, 4.24, 4.32, 4.64, 4.96, 5.12, 5.36, 5.44, 5.52, 5.60, 5.76, 6.00, 6.24, 6.40, 6.48, 6.64, 6.72, 6.80, 6.88, 7.04], "tokens":[" Well", ",", " I", " don", "'", "t", " w", "ish", " to", " see", " it", " any", " more", ",", " ob", "s", "er", "ved", " P", "he", "be", ",", " t", "ur", "ning", " a", "way", " her", " e", "y", "es", ".", " It", " is", " c", "ert", "ain", "ly", " very", " like", " the", " o", "ld", " p", "ort", "ra", "it", "."], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.874 s
Real time factor (RTF): 0.874 / 7.435 = 0.118
Real-time/Streaming Speech recognition from a microphone with VAD
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-simulated-streaming-asr \
--silero-vad-model=./silero_vad.onnx \
--encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
--joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
--tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
--model-type=nemo_transducer
Speech recognition from a microphone
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
--joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
--tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
--model-type=nemo_transducer
Speech recognition from a microphone with VAD
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
--joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
--tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
--model-type=nemo_transducer
Decode a long audio file with VAD
The following examples show how to decode a very long audio file with the help of VAD.
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/Obama.wav
./build/bin/sherpa-onnx-vad-with-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--silero-vad-threshold=0.2 \
--silero-vad-min-speech-duration=0.2 \
--encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
--joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
--tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
--model-type=nemo_transducer \
./Obama.wav
| Wave filename | Content |
|---|---|
| Obama.wav |
You should see the following output:
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:373 ./build/bin/sherpa-onnx-vad-with-offline-asr --silero-vad-model=./silero_vad.onnx --silero-vad-threshold=0.2 --silero-vad-min-speech-duration=0.2 --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt --model-type=nemo_transducer ./Obama.wav
VadModelConfig(silero_vad=SileroVadModelConfig(model="./silero_vad.onnx", threshold=0.2, min_silence_duration=0.5, min_speech_duration=0.2, max_speech_duration=20, window_size=512, neg_threshold=-1), ten_vad=TenVadModelConfig(model="", threshold=0.5, min_silence_duration=0.5, min_speech_duration=0.25, max_speech_duration=20, window_size=256), sample_rate=16000, num_threads=1, provider="cpu", debug=False)
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx", decoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx", joiner_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1, enable_token_timestamps=False, enable_segment_timestamps=False), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder="", merged_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), canary=OfflineCanaryModelConfig(encoder="", decoder="", src_lang="", tgt_lang="", use_pnc=True), cohere_transcribe=OfflineCohereTranscribeModelConfig(encoder="", decoder="", language="", use_punct=True, use_itn=True), omnilingual=OfflineOmnilingualAsrCtcModelConfig(model=""), funasr_nano=OfflineFunASRNanoModelConfig(encoder_adaptor="", llm="", embedding="", tokenizer="", system_prompt="You are a helpful assistant.", user_prompt="语音转写:", max_new_tokens=512, temperature=1e-06, top_p=0.8, seed=42, language="", itn=True, hotwords=""), medasr=OfflineMedAsrCtcModelConfig(model=""), fire_red_asr_ctc=OfflineFireRedAsrCtcModelConfig(model=""), qwen3_asr=OfflineQwen3ASRModelConfig(conv_frontend="", encoder="", decoder="", tokenizer="", hotwords="", max_total_len=512, max_new_tokens=128, temperature=1e-06, top_p=0.8, seed=42), telespeech_ctc="", tokens="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="nemo_transducer", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5, lodr_scale=0.01, lodr_fst="", lodr_backoff_id=-1), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(lexicon="", rule_fsts=""))
Creating recognizer ...
Recognizer created!
Started
Reading: ./Obama.wav
Started!
7.248 -- 8.204: Thank you.
8.976 -- 12.140: Thank you, everybody. All right, everybody, go ahead and have a seat.
13.104 -- 14.540: How's everybody doing today?
18.704 -- 22.892: How about Tim Spicer?
25.936 -- 31.884: I am here with students at Wakefield High School in Arlington, Virginia.
32.720 -- 48.844: And we've got students tuning in from all across America, from kindergarten through 12th grade. And I am just so glad that all could join us today. And I want to thank Wakefield for being such an outstanding host. Give yourselves a big round of applause.
54.416 -- 55.436: I know that
56.240 -- 58.892: For many of you, today is the first day of school.
59.600 -- 69.452: And for those of you in kindergarten or starting middle or high school, it's your first day in a new school, so it's understandable if you're a little nervous.
70.640 -- 76.332: I imagine there's some seniors out there who are feeling pretty good right now. With just one more year to go.
78.800 -- 87.180: And no matter what grade you're in, some of you are probably wishing it was still summer and you could have stayed in bed just a little bit longer this morning.
87.984 -- 89.100: I know that feeling.
91.664 -- 111.708: When I was young, my family lived overseas. I lived in Indonesia for a few years. And my mother, she didn't have the money to send me where all the American kids went to school. But she thought it was important for me to keep up with American education. So she decided to teach me extra lessons herself.
112.240 -- 118.700: Monday through Friday, but because she had to go to work, the only time she could do it was at 4:30 in the morning.
120.048 -- 127.244: Now, as you might imagine, I wasn't too happy about getting up that early. And a lot of times I'd fall asleep right there at the kitchen table.
128.272 -- 135.340: But whenever I'd complain, my mother would just give me one of those looks and she'd say, This is no picnic for me either, Buster.
137.104 -- 145.132: So I know that some of you are still adjusting to being back at school, but I'm here today because I have something important to discuss with you.
145.808 -- 153.740: I'm here because I want to talk with you about your education and what's expected of all of you in this new school year.
154.448 -- 160.268: I've given a lot of speeches about education, and I've talked about responsibility a lot.
160.816 -- 178.220: I've talked about teachers' responsibility for inspiring students and pushing you to learn. I've talked about your parents' responsibility for making sure you stay on track and you get your homework done and don't spend every waking hour in front of the T V or with the Xbox.
179.088 -- 180.716: I've talked a lot about
181.360 -- 193.452: Your government's responsibility for setting high standards and supporting teachers and principals and turning around schools that aren't working, where students aren't getting the opportunities that they deserve.
194.000 -- 195.276: But at the end of the day,
196.016 -- 206.156: We can have the most dedicated teachers, the most supportive parents, the best schools in the world, and none of it will make a difference. None of it will matter.
206.704 -- 210.604: unless all of you fulfill your responsibilities.
211.248 -- 223.404: Unless you show up to those schools, unless you pay attention to those teachers, unless you listen to your parents and grandparents and other adults and put in the hard work it takes to succeed.
224.656 -- 230.924: And that's what I want to focus on today: the responsibility each of you has for your education.
231.728 -- 234.796: I want to start with the responsibility you have to yourself.
235.696 -- 238.988: Every single one of you has something that you're good at.
239.760 -- 242.412: Every single one of you has something to offer.
242.992 -- 247.404: And you have a responsibility to yourself to discover what that is.
248.336 -- 251.564: That's the opportunity an education can provide.
252.336 -- 265.900: Maybe you could be a great writer, maybe even good enough to write a book or articles in a newspaper, but you might not know it until you write that English paper, that English class paper that's assigned to you.
266.704 -- 278.668: Maybe you could be an innovator or an inventor, maybe even good enough to come up with the next iPhone or the new medicine or vaccine. But you might not know it until you do your project for your science class.
279.824 -- 289.964: Maybe you could be a mayor or a senator or a Supreme Court justice. But you might not know that until you join student government or the debate team.
291.568 -- 309.516: And no matter what you want to do with your life, I guarantee that you'll need an education to do it. You want to be a doctor or a teacher or a police officer, you want to be a nurse or an architect, a lawyer or a member of our military, you're going to need a good education for every single one of those careers.
310.064 -- 314.348: You cannot drop out of school and just drop into a good job.
315.184 -- 319.852: You've got to train for it and work for it and learn for it.
320.528 -- 323.628: And this isn't just important for your own life and your own future.
324.688 -- 332.812: What you make of your education will decide nothing less than the future of this country. The future of America depends on you.
num threads: 2
decoding method: greedy_search
Elapsed seconds: 9.406 s
Real time factor (RTF): 9.406 / 334.234 = 0.028
Hint
If you want to use a GUI version and want to export SRT format, please visit
https://k2-fsa.github.io/sherpa/onnx/tauri/app/vad-asr-file.html and search for
en-parakeet_tdt. Please always use the latest version.
RTF on RK3588 with Cortex A76 CPU
In the following, we test this model on RK3588 with Cortex A76 CPU.
Information about the CPUs on the board is given below:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: ARM
Model name: Cortex-A55
Model: 0
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: r2p0
CPU max MHz: 1800.0000
CPU min MHz: 408.0000
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
Model name: Cortex-A76
Model: 0
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: r4p0
CPU max MHz: 2304.0000
CPU min MHz: 408.0000
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
L1d cache: 384 KiB (8 instances)
L1i cache: 384 KiB (8 instances)
L2 cache: 2.5 MiB (8 instances)
L3 cache: 3 MiB (1 instance)
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Vulnerable: Unprivileged eBPF enabled
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
You can see that it has 8 CPUs: 4 Cortex A55 + 4 Cortex A76.
We use taskset below to test the RTF on Cortex A76.
taskset 0x80 sherpa-onnx-offline \
--num-threads=1 \
--encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
--joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
--tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
--model-type=nemo_transducer \
./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav
Its output is given below:
/project/sherpa-onnx/csrc/parse-options.cc:Read:372 sherpa-onnx-offline --num-threads=1 --encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx --decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx --joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx --tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt --model-type=nemo_transducer ./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx", decoder_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx", joiner_filename="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), telespeech_ctc="", tokens="./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="nemo_transducer", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(dict_dir="", lexicon="", rule_fsts=""))
Creating recognizer ...
Started
Done!
./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav
{"lang": "", "emotion": "", "event": "", "text": " Well, I don't wish to see it any more, observed Phebe, turning away her eyes. It is certainly very like the old portrait.", "timestamps": [0.32, 0.64, 0.72, 0.80, 0.88, 0.96, 1.04, 1.12, 1.28, 1.44, 1.60, 1.76, 1.92, 2.00, 2.24, 2.32, 2.40, 2.48, 2.64, 2.72, 2.88, 3.12, 3.36, 3.44, 3.52, 3.68, 3.76, 3.92, 4.16, 4.24, 4.32, 4.64, 4.96, 5.12, 5.36, 5.44, 5.52, 5.60, 5.76, 6.00, 6.24, 6.40, 6.48, 6.64, 6.72, 6.80, 6.88, 7.04], "tokens":[" Well", ",", " I", " don", "'", "t", " w", "ish", " to", " see", " it", " any", " more", ",", " ob", "s", "er", "ved", " P", "he", "be", ",", " t", "ur", "ning", " a", "way", " her", " e", "y", "es", ".", " It", " is", " c", "ert", "ain", "ly", " very", " like", " the", " o", "ld", " p", "ort", "ra", "it", "."], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 1.639 s
Real time factor (RTF): 1.639 / 7.435 = 0.220
To test the RTF with different --num-threads, we use:
taskset 0xc0 sherpa-onnx-offline \
--num-threads=2 \
--encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
--joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
--tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
--model-type=nemo_transducer \
./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav
taskset 0xe0 sherpa-onnx-offline \
--num-threads=3 \
--encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
--joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
--tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
--model-type=nemo_transducer \
./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav
taskset 0xf0 sherpa-onnx-offline \
--num-threads=4 \
--encoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/decoder.int8.onnx \
--joiner=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/joiner.int8.onnx \
--tokens=./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/tokens.txt \
--model-type=nemo_transducer \
./sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8/test_wavs/0.wav
The results are summarized below:
Number of threads |
1 |
2 |
3 |
4 |
RTF on Cortex A76 CPU |
0.220 |
0.142 |
0.118 |
0.088 |
sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19 (Russian, 俄语)
This model is converted from
You can find the conversion script at
In the following, we describe how to download it and use it with sherpa-onnx.
Download the model
Please use the following commands to download it.
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19.tar.bz2
tar xvf sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19.tar.bz2
rm sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19.tar.bz2
You should see something like below after downloading:
ls -lh sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19
total 231M
-rw-r--r-- 1 501 staff 3.2M Apr 20 01:58 decoder.onnx
-rw-r--r-- 1 501 staff 226M Apr 20 01:59 encoder.int8.onnx
-rw-r--r-- 1 501 staff 1.4M Apr 20 01:58 joiner.onnx
-rw-r--r-- 1 501 staff 219K Apr 20 01:59 LICENSE
-rw-r--r-- 1 501 staff 302 Apr 20 01:59 README.md
-rwxr-xr-x 1 501 staff 868 Apr 20 01:51 run-rnnt-v2.sh
-rwxr-xr-x 1 501 staff 8.9K Apr 20 01:59 test-onnx-rnnt.py
drwxr-xr-x 2 501 staff 4.0K Apr 21 09:35 test_wavs
-rw-r--r-- 1 501 staff 196 Apr 20 01:58 tokens.txt
Decode wave files
Hint
It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--encoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/decoder.onnx \
--joiner=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/joiner.onnx \
--tokens=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/tokens.txt \
--model-type=nemo_transducer \
./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/test_wavs/example.wav
Note
Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.
Caution
If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
You should see the following output:
/project/sherpa-onnx/csrc/parse-options.cc:Read:375 sherpa-onnx-offline --encoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/encoder.int8.onnx --decoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/decoder.onnx --joiner=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/joiner.onnx --tokens=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/tokens.txt --model-type=nemo_transducer ./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/test_wavs/example.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/encoder.int8.onnx", decoder_filename="./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/decoder.onnx", joiner_filename="./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/joiner.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), telespeech_ctc="", tokens="./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="nemo_transducer", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!
./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/test_wavs/example.wav
{"lang": "", "emotion": "", "event": "", "text": "ничьих не требуя похвал счастлив уж я надеждой сладкой что дева с трепетом любви посмотрит может быть украдкой на песни грешные мои у лукоморья дуб зеленый", "timestamps": [0.04, 0.12, 0.16, 0.24, 0.32, 0.40, 0.44, 0.52, 0.56, 0.60, 0.64, 0.72, 0.76, 0.80, 0.88, 0.96, 1.04, 1.12, 1.16, 1.24, 1.32, 1.36, 1.44, 1.56, 1.76, 1.84, 1.88, 1.96, 2.00, 2.04, 2.08, 2.16, 2.24, 2.28, 2.36, 2.40, 2.48, 2.60, 2.68, 2.72, 2.76, 2.84, 2.92, 2.96, 3.04, 3.08, 3.16, 3.20, 3.24, 3.32, 3.36, 3.44, 3.52, 3.56, 3.64, 3.68, 3.72, 3.76, 3.80, 3.88, 3.92, 4.00, 4.08, 4.16, 4.20, 4.24, 4.28, 4.32, 4.36, 4.44, 4.52, 4.56, 4.64, 4.68, 4.76, 4.80, 4.88, 4.92, 5.00, 5.08, 5.16, 5.36, 5.44, 5.52, 5.60, 5.68, 5.72, 5.76, 5.84, 5.92, 6.00, 6.04, 6.12, 6.16, 6.20, 6.24, 6.28, 6.32, 6.40, 6.44, 6.48, 6.52, 6.56, 6.64, 6.72, 6.76, 6.84, 6.92, 7.00, 7.04, 7.12, 7.16, 7.20, 7.24, 7.32, 7.36, 7.40, 7.48, 7.60, 7.64, 7.72, 7.76, 7.84, 7.88, 8.00, 8.08, 8.16, 8.24, 8.28, 8.32, 8.44, 8.76, 9.24, 9.32, 9.40, 9.44, 9.52, 9.60, 9.68, 9.76, 9.84, 9.92, 10.00, 10.08, 10.12, 10.24, 10.32, 10.44, 10.52, 10.56, 10.60, 10.68, 10.72, 10.84, 10.92], "tokens":["н", "и", "ч", "ь", "и", "х", " ", "н", "е", " ", "т", "р", "е", "б", "у", "я", " ", "п", "о", "х", "в", "а", "л", " ", "с", "ч", "а", "с", "т", "л", "и", "в", " ", "у", "ж", " ", "я", " ", "н", "а", "д", "е", "ж", "д", "о", "й", " ", "с", "л", "а", "д", "к", "о", "й", " ", "ч", "т", "о", " ", "д", "е", "в", "а", " ", "с", " ", "т", "р", "е", "п", "е", "т", "о", "м", " ", "л", "ю", "б", "в", "и", " ", "п", "о", "с", "м", "о", "т", "р", "и", "т", " ", "м", "о", "ж", "е", "т", " ", "б", "ы", "т", "ь", " ", "у", "к", "р", "а", "д", "к", "о", "й", " ", "н", "а", " ", "п", "е", "с", "н", "и", " ", "г", "р", "е", "ш", "н", "ы", "е", " ", "м", "о", "и", " ", "у", " ", "л", "у", "к", "о", "м", "о", "р", "ь", "я", " ", "д", "у", "б", " ", "з", "е", "л", "е", "н", "ы", "й"], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 4.317 s
Real time factor (RTF): 4.317 / 11.290 = 0.382
Real-time/Streaming Speech recognition from a microphone with VAD
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-simulated-streaming-asr \
--silero-vad-model=./silero_vad.onnx \
--encoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/decoder.onnx \
--joiner=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/joiner.onnx \
--tokens=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/tokens.txt \
--model-type=nemo_transducer
Speech recognition from a microphone
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--encoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/decoder.onnx \
--joiner=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/joiner.onnx \
--tokens=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/tokens.txt \
--model-type=nemo_transducer
Speech recognition from a microphone with VAD
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--encoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/decoder.onnx \
--joiner=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/joiner.onnx \
--tokens=./sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19/tokens.txt \
--model-type=nemo_transducer
sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24 (Russian, 俄语)
This model is converted from
You can find the conversion script at
Warning
The license of the model can be found at https://github.com/salute-developers/GigaAM/blob/main/GigaAM%20License_NC.pdf.
It is for non-commercial use only.
In the following, we describe how to download it and use it with sherpa-onnx.
Download the model
Please use the following commands to download it.
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24.tar.bz2
tar xvf sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24.tar.bz2
rm sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24.tar.bz2
You should see something like below after downloading:
ls -lh sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/
total 548472
-rw-r--r-- 1 fangjun staff 89K Oct 25 13:36 GigaAM%20License_NC.pdf
-rw-r--r-- 1 fangjun staff 318B Oct 25 13:37 README.md
-rw-r--r-- 1 fangjun staff 3.8M Oct 25 13:36 decoder.onnx
-rw-r--r-- 1 fangjun staff 262M Oct 25 13:37 encoder.int8.onnx
-rw-r--r-- 1 fangjun staff 3.8K Oct 25 13:32 export-onnx-rnnt.py
-rw-r--r-- 1 fangjun staff 2.0M Oct 25 13:36 joiner.onnx
-rwxr-xr-x 1 fangjun staff 2.0K Oct 25 13:32 run-rnnt.sh
-rwxr-xr-x 1 fangjun staff 8.7K Oct 25 13:32 test-onnx-rnnt.py
drwxr-xr-x 4 fangjun staff 128B Oct 25 13:37 test_wavs
-rw-r--r-- 1 fangjun staff 5.8K Oct 25 13:36 tokens.txt
Decode wave files
Hint
It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--encoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/decoder.onnx \
--joiner=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/joiner.onnx \
--tokens=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/tokens.txt \
--model-type=nemo_transducer \
./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/test_wavs/example.wav
Note
Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.
Caution
If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
You should see the following output:
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline --encoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx --decoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/decoder.onnx --joiner=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/joiner.onnx --tokens=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/tokens.txt --model-type=nemo_transducer ./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/test_wavs/example.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx", decoder_filename="./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/decoder.onnx", joiner_filename="./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/joiner.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), telespeech_ctc="", tokens="./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="nemo_transducer", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!
./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/test_wavs/example.wav
{"lang": "", "emotion": "", "event": "", "text": " ничьих не требуя похвал счастлив уж я надеждой сладкой что дева с трепетом любви посмотрит может быть украдкой на песни грешные мои у лукоморья дуб зеленый", "timestamps": [0.04, 0.16, 0.24, 0.28, 0.40, 0.48, 0.60, 0.68, 0.80, 0.92, 1.04, 1.20, 1.28, 1.44, 1.76, 1.88, 2.00, 2.08, 2.16, 2.28, 2.36, 2.44, 2.64, 2.76, 2.92, 3.00, 3.04, 3.16, 3.24, 3.36, 3.48, 3.56, 3.68, 3.88, 4.04, 4.16, 4.24, 4.32, 4.40, 4.56, 4.76, 4.88, 4.92, 5.36, 5.64, 5.84, 5.92, 6.04, 6.32, 6.52, 6.60, 6.72, 6.84, 6.92, 7.04, 7.16, 7.28, 7.36, 7.44, 7.56, 7.68, 7.72, 7.88, 8.00, 8.20, 8.36, 9.28, 9.40, 9.44, 9.52, 9.68, 9.84, 9.88, 9.92, 10.12, 10.32, 10.40, 10.52, 10.56, 10.76, 10.84], "tokens":[" ни", "ч", "ь", "и", "х", " не", " т", "ре", "бу", "я", " по", "х", "ва", "л", " с", "ча", "ст", "ли", "в", " у", "ж", " я", " на", "де", "ж", "до", "й", " с", "ла", "д", "ко", "й", " что", " де", "ва", " с", " т", "ре", "пе", "том", " лю", "б", "ви", " пос", "мот", "ри", "т", " может", " быть", " у", "к", "ра", "д", "ко", "й", " на", " п", "е", "с", "ни", " г", "ре", "ш", "ные", " мо", "и", " у", " ", "лу", "ко", "мо", "р", "ь", "я", " ду", "б", " з", "е", "лен", "ы", "й"], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.775 s
Real time factor (RTF): 1.775 / 11.290 = 0.157
Real-time/Streaming Speech recognition from a microphone with VAD
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-simulated-streaming-asr \
--silero-vad-model=./silero_vad.onnx \
--encoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/decoder.onnx \
--joiner=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/joiner.onnx \
--tokens=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/tokens.txt \
--model-type=nemo_transducer
Speech recognition from a microphone
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--encoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/decoder.onnx \
--joiner=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/joiner.onnx \
--tokens=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/tokens.txt \
--model-type=nemo_transducer
Speech recognition from a microphone with VAD
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--encoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/decoder.onnx \
--joiner=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/joiner.onnx \
--tokens=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/tokens.txt \
--model-type=nemo_transducer