Audio AI workloads cover speech-to-text (Whisper, NVIDIA Parakeet), text-to-speech (XTTS, Bark), voice cloning, music generation (Suno, Stable Audio, MusicLM), and audio enhancement (RNNoise, NVIDIA Maxine). VRAM requirements span a wide range: Whisper Tiny runs in 1 GB, Whisper Large-V3 needs 6-10 GB, and music generation models can demand 24 GB or more.
Real-time speech recognition for transcription services or voice assistants needs both low latency and good throughput. Whisper-derived models with streaming support (faster-whisper, WhisperLive) achieve real-time-factor below 0.1 on modest GPUs. Music and audio generation are batch workloads where VRAM gates which models you can run at full quality.
For voice cloning and TTS, 12-24 GB consumer GPUs are typical. For training or fine-tuning custom voice models, 24 GB or more becomes important. The audio domain benefits less from FP8 quantization than vision or language, so FP16 remains the standard precision.