transformers - 💡(How to fix) Fix [New model] Add Fun-ASR-Nano (FunAudioLLM/Fun-ASR-Nano-2512) [1 pull requests]

transformers2026-05-24 11:07:04

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Fix Action

Fixed

Fixed by PR: Add Fun-ASR-Nano model (https://github.com/huggingface/transformers/pull/46180)

RAW_BUFFERClick to expand / collapse

Model description

Fun-ASR-Nano is an 800M-parameter end-to-end speech recognition model from Alibaba DAMO Academy (FunAudioLLM team). It achieves state-of-the-art ASR performance, outperforming Whisper-large-v3 (1.6B) while being half the size.

Architecture:

Audio Encoder: SenseVoiceEncoderSmall (SANM - Self-Attention with FSMN Memory, 70 layers, 512-dim)
Audio Adaptor: 2-layer Transformer projector (512→1024)
Language Model: Qwen3-0.6B (28 layers, 1024-dim)
CTC Decoder: 5-layer Transformer for character-level timestamps

Key features:

31 language support (Chinese, English, Japanese + 7 Chinese dialects + 26 accents + 20 EU languages)
Character-level timestamps via CTC forced alignment
Hotword customization for domain adaptation
Native punctuation output (no separate punctuation model needed)
Trained on tens of millions of hours of real speech data

Performance (average WER% on industry benchmarks):

Fun-ASR-Nano (800M): 16.72%
vs Whisper-large-v3 (1.6B): 33.39%
vs GLM-ASR-Nano (1.5B): 26.13%
vs FireRed-ASR (1.1B): 22.63%

Open source status

Model weights available: https://huggingface.co/FunAudioLLM/Fun-ASR-Nano-2512
Original code available: https://github.com/FunAudioLLM/Fun-ASR
Paper: https://arxiv.org/abs/2509.12508

Implementation status

I have a working implementation ready:

configuration_fun_asr_nano.py — Config classes (encoder, adaptor, CTC, main)
modeling_fun_asr_nano.py — Full model (SANM encoder, adaptor, conditional generation)
feature_extraction_fun_asr_nano.py — Mel + LFR feature extraction
convert_fun_asr_nano_to_hf.py — Weight conversion script

All weight loading verified against original checkpoint:

Encoder (221M params): 0 missing, 0 unexpected keys ✅
Adaptor (12.6M params): 0 missing, 0 unexpected keys ✅
LLM/Qwen3-0.6B (596M params): 0 missing, 0 unexpected keys ✅

Provide useful links for the implementation

HuggingFace model: https://huggingface.co/FunAudioLLM/Fun-ASR-Nano-2512
GitHub repo: https://github.com/FunAudioLLM/Fun-ASR
Paper: https://arxiv.org/abs/2509.12508
Similar model in transformers: Qwen2Audio (same audio-LLM pattern)

I am on the model author team and would like to contribute this directly. Happy to iterate on feedback.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering