SenseVoiceSmall
ASR
W8A16
FP16
post
SenseVoiceSmall: ASR

SenseVoiceSmall is a lightweight, multifunctional voice understanding model by FunAudioLLM (Alibaba/FunASR), specifically designed for edge devices and real-time applications. Utilizing a non-autoregressive, end-to-end architecture, it achieves extremely low inference latency (~70 ms for 10 s audio), outperforming Whisper-Small by ~5× and Whisper-Large by ~15×.

  • Multi-Task Learning: Supports multilingual ASR, language identification (LID), emotion recognition (SER), and acoustic event detection (AED); covering Mandarin, Cantonese, English, Japanese, and Korean.

  • High Recognition Accuracy: Achieves better ASR performance in Chinese and Cantonese than Whisper, and emotion recognition reaches state-of-the-art among open-source models.

  • Exceptional Inference Efficiency: Non-autoregressive design delivers low latency and high throughput.

SenseVoiceSmall excels not only in speech-to-text conversion but also in emotion and event detection, making it ideal for intelligent assistants, smart surveillance, and real-time voice analysis applications.

The source model can be found here

Performance Reference

Device

Language
Precision
Audio Duration
RTF
File Size
Supported Language
Supported Languages
Chinese
Cantonese
English
Japanese
Korean

Note: In the performance reference section on the right, the RTF values for each language are shown based on the current audio input length. Since the model uses fixed input dimensions (non-dynamic input), the RTF value may slightly increase when the audio length is shorter than the reference length.

Inference with AidASR SDK

To be released

License
Deployable Model:APLUX-MODEL-FARM-LICENSE
Performance Reference

Device

Language
Precision
Audio Duration
RTF
File Size