Accelerated Computing
FIELD GUIDE
§ 10
Glossary

The terms.

The vocabulary an engineer uses to talk about accelerated systems, kept short and operational.

CUDA
NVIDIA's parallel programming model and toolchain. The historical reason GPUs broke into general computing.
FLOPS
Floating-point operations per second. The base unit of throughput on numeric workloads.
HBM
High-Bandwidth Memory. DRAM stacks bonded next to the accelerator die for very high bandwidth.
Kernel
A unit of code launched on the accelerator. Thousands of threads run the same kernel concurrently.
MFU
Model FLOPs Utilisation. The fraction of peak compute a training job actually achieves; 30–60% is typical.
PCIe
The expansion bus that connects accelerators to the host CPU. Latency here often gates small workloads.
Quantisation
Reducing numerical precision (FP32 → INT8, FP4) to fit larger models in memory and run them faster.
SM
Streaming Multiprocessor. The basic compute unit of a GPU; modern accelerators carry over a hundred.
Sparsity
Skipping computation on zero-valued weights. A 2:4 pattern can roughly double effective throughput.
Systolic array
A grid of processing elements that pumps data rhythmically through neighbours. The heart of TPUs.
Tensor core
A hardware unit that performs a small matrix multiply in one cycle. Backbone of modern training.
Throughput
Work completed per unit time. Accelerators optimise this; CPUs optimise latency.