transformers - 💡(How to fix) Fix Faster preprocessing when using new Parakeet TDT please

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
RAW_BUFFERClick to expand / collapse

Feature request

Hi! Thanks for adding native Parakeet support in #44171 — really nice not having to pull in all of NeMo just to run these models. I've been benchmarking and noticed a speed gap I think has a clean fix, but wanted to file an issue before attempting a PR.

What I noticed

Same weights (nvidia/parakeet-tdt-0.6b-v3), same audio, same GPU (RTX 4090, bf16):

  • NeMo: ~8s for my test set
  • transformers ParakeetForTDT: ~47–50s

About 6x slower, which surprised me since the model is identical.

Why I think this happens

The model forward pass looks comparable between the two. The big difference seems to be preprocessing.

src/transformers/models/parakeet/feature_extraction_parakeet.py uses librosa for the STFT and mel filterbank. Librosa is a pure-Python wrapper over numpy/scipy and runs CPU-only — so each chunk has to leave the GPU pipeline, get processed on CPU, then come back as a tensor.

NeMo's AudioToMelSpectrogramPreprocessor is a torch.nn.Module using torch.stft + torch.matmul. It runs on GPU via cuDNN FFT, or on CPU via MKL FFT — and the audio never leaves the device. For chunked workloads where you call the feature extractor over and over, those roundtrips add up.

The precedent

src/transformers/models/whisper/feature_extraction_whisper.py already solves this exact problem! It has two methods:

  • _torch_extract_fbank_features — uses torch.stft, runs on whatever device you pass
  • _np_extract_fbank_features — the librosa fallback

__call__ picks between them based on whether device= is passed. The same pattern should map cleanly onto Parakeet: keep the librosa code as a fallback, add a _torch_extract_fbank_features method, and accept an optional device argument.

On PR #44171

I saw the maintainers noted the initial implementation prioritized correctness over speed, which totally makes sense for a first landing. Not saying this is a bug — just wondering if a follow-up speed PR would be welcome now that correctness is in.

Things I'd be careful about in a PR

  1. Numerical parity — add a test comparing both paths within tolerance, like Whisper does
  2. Mel filterbank — register as a non-persistent buffer so it lives on the right device
  3. Backwards compatibility — calls without device= should behave exactly as today
  4. Decoder loop — I noticed the TDT decoder is also pure Python (vs NeMo's CUDA kernels). That's a bigger lift and out of scope for this issue, so mel-spec alone won't fully close the 6x gap — but should be a meaningful chunk.

Ask

Would the maintainers be open in principle to this kind of change before I try writing a PR? Happy to share more detailed benchmarks or isolate just the feature-extraction step if useful.

Thanks! 🙏

Motivation

Just really like this model and want it to be as good as the nemo-toolkit version, just without that heavy dependency.

Your contribution

I'd probably be willing to help out on a PR.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix Faster preprocessing when using new Parakeet TDT please