SenseVoiceSmall

SenseVoiceSmall: ASR

SenseVoiceSmall is a lightweight, multifunctional voice understanding model by FunAudioLLM (Alibaba/FunASR), specifically designed for edge devices and real-time applications. Utilizing a non-autoregressive, end-to-end architecture, it achieves extremely low inference latency (~70 ms for 10 s audio), outperforming Whisper-Small by ~5× and Whisper-Large by ~15×.

Multi-Task Learning: Supports multilingual ASR, language identification (LID), emotion recognition (SER), and acoustic event detection (AED); covering Mandarin, Cantonese, English, Japanese, and Korean.
High Recognition Accuracy: Achieves better ASR performance in Chinese and Cantonese than Whisper, and emotion recognition reaches state-of-the-art among open-source models.
Exceptional Inference Efficiency: Non-autoregressive design delivers low latency and high throughput.

SenseVoiceSmall excels not only in speech-to-text conversion but also in emotion and event detection, making it ideal for intelligent assistants, smart surveillance, and real-time voice analysis applications.

The source model can be found here

Multi-language RTF Performance Details

Due to the fixed input size of approximately 15 seconds, shorter audio clips exhibit a higher RTF, as they are automatically padded to the 15s limit.

English

FP16

Audio Length	RTF	Encoder
5s	0.04	0.15
10s	0.02	0.15
14s	0.02	0.15

W8A16

Audio Length	RTF	Encoder
5s	0.03	0.11
10s	0.02	0.11
14s	0.01	0.10

Chinese

FP16

Audio Length	RTF	Encoder
5s	0.04	0.15
10s	0.02	0.15
14s	0.02	0.15

W8A16

Audio Length	RTF	Encoder
5s	0.03	0.11
10s	0.02	0.10
14s	0.01	0.10

Japanese

FP16

Audio Length	RTF	Encoder
5s	0.04	0.15
10s	0.02	0.15
14s	0.02	0.15

W8A16

Audio Length	RTF	Encoder
5s	0.03	0.11
10s	0.02	0.11
14s	0.01	0.11

Korean

FP16

Audio Length	RTF	Encoder
5s	0.04	0.15
10s	0.02	0.15
14s	0.02	0.15

W8A16

Audio Length	RTF	Encoder
5s	0.03	0.10
10s	0.02	0.10
14s	0.01	0.10

Performance Reference

Device

Language

Precision

Audio Duration

RTF

File Size

Model Resource Acquisition

Model Farm provides optimized model resources and test code, which can be obtained through the following two methods:

Obtain via Model Farm page: Click Models & Test Code in the Performance Reference section on the right to obtain model resources and code packages.
Obtain via command line (Recommand): Users with APLUX development boards can obtain model resources and code packages through the built-in MMS tool.

# Search Models
mms list [model name]

# Get Models
mms get -m [model name] -p [precision] -c [soc] -b [backend] -d [file path]

For MMS usage, please refer to: MMS Usage & Access to Preview Models

Supported Language

Supported Languages
Chinese
Cantonese
English
Japanese
Korean

Note: In the performance reference section on the right, the RTF values for each language are shown based on the current audio input length. Since the model uses fixed input dimensions (non-dynamic input), the RTF value may slightly increase when the audio length is shorter than the reference length.

Inference with AidVoice SDK

Inference using AidVoice: Please refer to the AidVoice SDK Documentation

License

Source Model:FUNASR-MODEL-LICENSE

Deployable Model:APLUX-MODEL-FARM-LICENSE