Speech-to-text (also called automatic speech recognition, or ASR) converts spoken audio into written text. The dominant production models in 2026 are OpenAI's Whisper family, NVIDIA NeMo, and various Whisper-derived variants optimized for streaming or low-latency use cases.
Whisper-large-v3 has roughly 1.5B parameters and runs comfortably on a single workstation or datacenter GPU. Compute requirements scale with audio length — a one-hour audio file typically transcribes in 30-60 seconds on an H100. For real-time streaming transcription, faster-whisper and WhisperX optimize for sub-second latency with smaller models.
Production deployments typically batch audio segments to maximize throughput. Memory bandwidth and FP16/INT8 inference throughput drive cost-effectiveness; the workload tolerates quantization well without significant accuracy loss.