SenseVoiceSmall

SenseVoiceSmall: ASR

SenseVoiceSmall is a lightweight, multifunctional voice understanding model by FunAudioLLM (Alibaba/FunASR), specifically designed for edge devices and real-time applications. Utilizing a non-autoregressive, end-to-end architecture, it achieves extremely low inference latency (~70 ms for 10 s audio), outperforming Whisper-Small by ~5× and Whisper-Large by ~15×.

Multi-Task Learning: Supports multilingual ASR, language identification (LID), emotion recognition (SER), and acoustic event detection (AED); covering Mandarin, Cantonese, English, Japanese, and Korean.
High Recognition Accuracy: Achieves better ASR performance in Chinese and Cantonese than Whisper, and emotion recognition reaches state-of-the-art among open-source models.
Exceptional Inference Efficiency: Non-autoregressive design delivers low latency and high throughput.

SenseVoiceSmall excels not only in speech-to-text conversion but also in emotion and event detection, making it ideal for intelligent assistants, smart surveillance, and real-time voice analysis applications.

The source model can be found here

Supported Language

Supported Languages
Chinese
Cantonese
English
Japanese
Korean

Note: In the performance reference section on the right, the RTF values for each language are shown based on the current audio input length. Since the model uses fixed input dimensions (non-dynamic input), the RTF value may slightly increase when the audio length is shorter than the reference length.

Inference with AidASR SDK

To be released