Embedding generation converts text, images, or other content into dense numerical vectors that capture semantic meaning. These vectors power retrieval-augmented generation (RAG), semantic search, recommendation systems, and clustering applications.
Modern embedding models range from compact 110M-parameter encoders (BGE-small, GTE-small) to larger 7B-parameter models that double as text generators. Compute requirements are modest compared to LLM inference: a single batch of 512 documents typically completes in tens of milliseconds on a mid-tier GPU.
Throughput is usually the binding constraint, not VRAM. Production pipelines often process millions of documents per hour during indexing operations, with batch sizes tuned to GPU memory limits. Memory bandwidth and FP16/INT8 throughput drive aggregate processing rates more than raw FLOPs do.