Tensor Cores are dedicated hardware units in NVIDIA GPUs (since Volta in 2017) that execute small matrix multiply-accumulate operations in a single clock cycle. They operate on lower precisions (FP16, BF16, FP8, INT8) and are the primary engine behind deep learning throughput.
A modern H100 has 528 fourth-generation Tensor Cores delivering nearly 1,979 TFLOPS at FP16 with sparsity — orders of magnitude more than the same GPU's CUDA cores at FP32. AMD's equivalent is the Matrix Core unit (CDNA architecture, MI300 series).
Workloads that don't use tensor-friendly precisions (FP64 scientific compute, traditional graphics) cannot benefit from Tensor Cores and must run on the slower general-purpose CUDA cores. This is why scientific HPC GPUs sometimes prioritize FP64 throughput over FP16.