
SenseVoiceSmall is a lightweight, multifunctional voice understanding model by FunAudioLLM (Alibaba/FunASR), specifically designed for edge devices and real-time applications. Utilizing a non-autoregressive, end-to-end architecture, it achieves extremely low inference latency (~70 ms for 10 s audio), outperforming Whisper-Small by ~5× and Whisper-Large by ~15×.
Multi-Task Learning: Supports multilingual ASR, language identification (LID), emotion recognition (SER), and acoustic event detection (AED); covering Mandarin, Cantonese, English, Japanese, and Korean.
High Recognition Accuracy: Achieves better ASR performance in Chinese and Cantonese than Whisper, and emotion recognition reaches state-of-the-art among open-source models.
Exceptional Inference Efficiency: Non-autoregressive design delivers low latency and high throughput.
SenseVoiceSmall excels not only in speech-to-text conversion but also in emotion and event detection, making it ideal for intelligent assistants, smart surveillance, and real-time voice analysis applications.
The source model can be found here
Multi-language RTF Performance Details
Due to the fixed input size of approximately 15 seconds, shorter audio clips exhibit a higher RTF, as they are automatically padded to the 15s limit.
English
FP16
| Audio Length | RTF | Encoder |
|---|---|---|
| 5s | 0.04 | 0.15 |
| 10s | 0.02 | 0.15 |
| 14s | 0.02 | 0.15 |
W8A16
| Audio Length | RTF | Encoder |
|---|---|---|
| 5s | 0.03 | 0.11 |
| 10s | 0.02 | 0.11 |
| 14s | 0.01 | 0.10 |
Chinese
FP16
| Audio Length | RTF | Encoder |
|---|---|---|
| 5s | 0.04 | 0.15 |
| 10s | 0.02 | 0.15 |
| 14s | 0.02 | 0.15 |
W8A16
| Audio Length | RTF | Encoder |
|---|---|---|
| 5s | 0.03 | 0.11 |
| 10s | 0.02 | 0.10 |
| 14s | 0.01 | 0.10 |
Japanese
FP16
| Audio Length | RTF | Encoder |
|---|---|---|
| 5s | 0.04 | 0.15 |
| 10s | 0.02 | 0.15 |
| 14s | 0.02 | 0.15 |
W8A16
| Audio Length | RTF | Encoder |
|---|---|---|
| 5s | 0.03 | 0.11 |
| 10s | 0.02 | 0.11 |
| 14s | 0.01 | 0.11 |
Korean
FP16
| Audio Length | RTF | Encoder |
|---|---|---|
| 5s | 0.04 | 0.15 |
| 10s | 0.02 | 0.15 |
| 14s | 0.02 | 0.15 |
W8A16
| Audio Length | RTF | Encoder |
|---|---|---|
| 5s | 0.03 | 0.10 |
| 10s | 0.02 | 0.10 |
| 14s | 0.01 | 0.10 |
Model Farm provides optimized model resources and test code, which can be obtained through the following two methods:
Obtain via Model Farm page: Click Models & Test Code in the Performance Reference section on the right to obtain model resources and code packages.
Obtain via command line (Recommand): Users with APLUX development boards can obtain model resources and code packages through the built-in MMS tool.
# Search Models
mms list [model name]
# Get Models
mms get -m [model name] -p [precision] -c [soc] -b [backend] -d [file path]
For MMS usage, please refer to: MMS Usage & Access to Preview Models
| Supported Languages |
|---|
| Chinese |
| Cantonese |
| English |
| Japanese |
| Korean |
Note: In the performance reference section on the right, the RTF values for each language are shown based on the current audio input length. Since the model uses fixed input dimensions (non-dynamic input), the RTF value may slightly increase when the audio length is shorter than the reference length.
- Inference using AidVoice: Please refer to the AidVoice SDK Documentation