The NVIDIA T4 is a Turing architecture inference accelerator that became the most widely deployed GPU for AI inference in cloud computing. Announced in September 2018, the T4 established the template for compact, efficient inference GPUs that subsequent generations would follow.
The T4 uses the TU104 die with 16GB of GDDR6 memory providing 320 GB/s bandwidth. It includes 2,560 CUDA cores, 320 second-generation Tensor Cores, and 40 RT cores for hardware-accelerated ray tracing. The chip is manufactured on TSMC's 12nm FFN process.
The defining feature is the extremely compact low-profile, single-slot PCIe Gen3 x16 form factor with just 70W TDP. This enables passive cooling and dense deployment with up to 20 T4s in a single server chassis. No external power connector is required - the T4 draws all power from the PCIe slot.
Second-generation Tensor Cores support INT8 and INT4 operations in addition to FP16, enabling efficient quantized inference. The T4 remains the most widely deployed inference GPU in cloud computing as of 2024, powering inference instances on AWS (G4dn), Azure (NC T4 v3), Google Cloud (N1 with T4), and most other cloud providers.