To reach high performance, AI accelerators often deploy 8-bit integer processing of the most compute-intensive parts of neural network calculations instead of using 32-bit floating-point arithmetic. To do so, a quantization of the data from 32-bit to 8-bit needs to be done.
The Post-Training Quantization (PTQ) technique begins with the user providing around hundred images. These images are processed through the full-precision model while detailed statistics are collected. Once this process is complete, the gathered statistics are used to compute quantization parameters, which are then applied to quantize the weights and activations to INT8 and other precisions in both hardware and software.
Additionally, the quantization technique modifies the compute graph to enhance quantization accuracy. This may involve operator folding and fusion, as well as reordering graph nodes.
Axelera AI is currently developing very accurate quantization techniques for the most recent AI algorithms. We are constantly improving the algorithms to further improve accuracy. This is especially important as more recent algorithms, like large language models, require special handling when it comes to quantization. This means our future products will use enhanced quantization methods.