transformers - 💡(How to fix) Fix [New model] Add Fun-ASR-Nano (FunAudioLLM/Fun-ASR-Nano-2512) [1 pull requests]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fixed

RAW_BUFFERClick to expand / collapse

Model description

Fun-ASR-Nano is an 800M-parameter end-to-end speech recognition model from Alibaba DAMO Academy (FunAudioLLM team). It achieves state-of-the-art ASR performance, outperforming Whisper-large-v3 (1.6B) while being half the size.

Architecture:

  • Audio Encoder: SenseVoiceEncoderSmall (SANM - Self-Attention with FSMN Memory, 70 layers, 512-dim)
  • Audio Adaptor: 2-layer Transformer projector (512→1024)
  • Language Model: Qwen3-0.6B (28 layers, 1024-dim)
  • CTC Decoder: 5-layer Transformer for character-level timestamps

Key features:

  • 31 language support (Chinese, English, Japanese + 7 Chinese dialects + 26 accents + 20 EU languages)
  • Character-level timestamps via CTC forced alignment
  • Hotword customization for domain adaptation
  • Native punctuation output (no separate punctuation model needed)
  • Trained on tens of millions of hours of real speech data

Performance (average WER% on industry benchmarks):

  • Fun-ASR-Nano (800M): 16.72%
  • vs Whisper-large-v3 (1.6B): 33.39%
  • vs GLM-ASR-Nano (1.5B): 26.13%
  • vs FireRed-ASR (1.1B): 22.63%

Open source status

Implementation status

I have a working implementation ready:

  • configuration_fun_asr_nano.py — Config classes (encoder, adaptor, CTC, main)
  • modeling_fun_asr_nano.py — Full model (SANM encoder, adaptor, conditional generation)
  • feature_extraction_fun_asr_nano.py — Mel + LFR feature extraction
  • convert_fun_asr_nano_to_hf.py — Weight conversion script

All weight loading verified against original checkpoint:

  • Encoder (221M params): 0 missing, 0 unexpected keys ✅
  • Adaptor (12.6M params): 0 missing, 0 unexpected keys ✅
  • LLM/Qwen3-0.6B (596M params): 0 missing, 0 unexpected keys ✅

Provide useful links for the implementation

I am on the model author team and would like to contribute this directly. Happy to iterate on feedback.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING