Quantization

Alt text

Absmax quantization

This is the most straightforward method of quantization. It quantizes the weights between $-2^{b-1}$ and $2^{b-1} - 1$ where $b$ is the number of bits used for quantization. The quantized weight is calculated as:

$$ w_{\text{quant}} = \text{round}(w\Delta) $$

where $\Delta$ is the quantization step size. The quantization step size is calculated as:

$$ \Delta = \frac{2^b - 1}{\text{max}(|w|)} $$

where $b$ is the number of bits used for quantization. $ \Delta $ is also called the scale. The quantized weight is then dequantized as:

$$ w_{\text{dequant}} = \frac{w_{\text{quant}}}{\Delta} $$

Zero-point quantization

Zero-point quantization is shifts the weights evenly around zero. The quantized weight is calculated as:

$$ w_{\text{quant}} = \text{round}(w\Delta + z) $$

where $z$ is the zero-point. The quantization step size is calculated as:

$$ \Delta = \frac{2^b - 1}{\text{max}(w) - \text{min}(w)} $$

where $b$ is the number of bits used for quantization. The zero-point is calculated as:

$$ z = -\text{round}(\text{min}(w)\Delta) - 2^{b-1} $$

The quantized weight is then dequantized as:

$$ w_{\text{dequant}} = \frac{w_{\text{quant}} - z}{\Delta} $$

Smooth Quant, llm.int8, AWQ

BNB

The activation ($T \times d$) where each row is a token, can have outliers in a few channels across all tokens ($d_i$ embedding of all tokens) as shown in yellow in above figure. So it would be good to quantize the activations along the channels. But the GEMM operation is usually done along the channels (per token). Therefore, we quantize the activations along the columns, which means each row gets 1 scale, giving us $T$ scales. Also, the weights (d x o) are quantized along the rows, giving us $o$ scales.

BNB solves this by separating the activation channels with outliers and its respective weight rows. The outliers are computed in fp16 and the rest of the activations and weights are quantized to int8.

Smooth Quant solves this by dividing the activation channels by a scale and multiplying the weights by the same scale. This way the ouliers in the activations are “transfered” to the weights. The scale is equal to the ${\frac{absmax(A_j)}{absmax(W_j)}}^\alpha$. $\alpha$ is the amount of scale to transfer from activations to weights usually set to 0.5. $A_j$ is the jth column of the activations and $W_j$ is the jth row of the weights.

Quant channels

The $absmax(A_j)$ can be calcluated dynamically during inference or statically by taking a sample of the activations on the training data. Smooth quant is faster than BNB since it doesn’t require fp16 computation but the quality is almost the same.

Smooth Quant is good for compute bound systems (high batch size) but edge inference (low batch size) is usually memory bound. Therefore, Han lab introduced Activation aware quantization (AWQ) which uses the distribution of activations to quantize only the weights to W4A16 format. During inference, the weights are dequantized to fp16 and the inference is done in fp16.

Alt text

Recommened reading:

Other quantization methods

Other general Optimizations

Quantize Diffusion

Libraries

CUDA references

Good discussions

Misc

FP8 vs INT8

Qualcomm whitepaper shows that the hardware implementation of the FP8 format is somewhere between 50% to 180% less efficient than INT8 in terms of chip area and energy usage. This is because of the additional logic needed in the accumulation of FP formats versus integer formats. This seems like a broad range, but the actual efficiency depends on many hardware design choices that vary greatly. A similar conclusion was reached recently by Microsoft and Meta: Floating-point arithmetic is just much less efficient than integer arithmetic.

This means that FP8 will have to be significantly more accurate than INT8 to be worthwhile from a hardware-efficiency perspective.

FP8 is only supported in H100 GPUs but storing approximations in fp8 can be accurate than vanilla int8 quantization. The recent QLoRA paper explores different data types, 4-bit Float and 4-bit NormalFloat which again are only used for storage and not for computation.

Quantizing bias

Biases are not converted because to preserve the accuracy of a typical addmm operation, they must be converted with a scale that is equal to the product of the input and weight scales, which leads to a ridiculously small scale, and conversely requires a very high bitwidth to avoid clipping.

Quantization layer reference

https://pytorch.org/docs/stable/amp.html#torch.autocast