Blog about developing AI accelerators | Axelera AI

Decoding Transformers on Edge Devices

Written by Admin | Jun 11, 2023 10:00:00 PM

Abstract – Recently, Transformer-based models have led to significant breakthroughs in several forms of generative AI. They are key in both increasingly powerful text-to-image models, such as DALL-E or stable diffusion, and language and instruction-following models, such as ChatGPT or Stanford’s Alpaca. Today, such networks are typically executed on GPU-based compute infrastructure in the cloud, because of their massive model sizes and high memory and bandwidth requirements. In bringing real-time generative transformers to edge devices, their applicability could be greatly expanded. To this end, this article discusses bottlenecks in transformer inference for generative AI at the Edge.

Conclusion

Transformers and their usage in Natural Language Processing and generative AI is becoming ever more important. However, their computational cost and memory and bandwidth requirements make them hard to accelerate efficiently on specialized hardware at the edge. This blog introduces the transformer encoder and decoder architecture and discusses the challenges in inferring them on an edge device: high computational cost, high peak memory requirements, low arithmetic intensity. Several solutions to all of these are proposed. First, efficient transformers increase efficiency by modifying the attention mechanism and reducing complexity from O(s^2) to O(s), where s is the sequence length. Other works focus on quantization and increasing sparsity of the neural networks. None of these solutions are currently mainstream. Second, we discuss the KV-caching mechanism to reduce the computational complexity of decoding by orders of magnitude, in exchange for bandwidth and memory capacity. KV-caching makes transformer inference bandwidth-bound, leading to very low performance on the order of 5 tokens per second for the current smallest state of the art NLP models, like the discussed LLM-7B. We propose several network topologies that can be accelerated efficiently on an Edge platform. To bring transformer decoding inference to Edge devices, network designers should focus on making them significantly smaller without sacrificing application accuracy.





References

[1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017), https://arxiv.org/pdf/1706.03762.pdf

[2] Moons, Bert “Transformers in Computer Vision”, https://www.axelera.ai/blog/transformers-in-computer-vision

[3] https://chat.openai.com/chat

[4] https://huggingface.co/tiiuae/falcon-40b

[5] Dettmers, Tim, et al. “Qlora: Efficient finetuning of quantized llms.” arXiv preprint arXiv:2305.14314 (2023), https://arxiv.org/abs/2305.14314

[6] https://github.com/facebookresearch/llama

[7] Taori, Rohan, et al. “Stanford Alpaca: An Instruction-following LlaMA model”, https://github.com/tatsu-lab/stanford_alpaca

[8] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

[9] https://developer.nvidia.com/blog/accelerated-inference-for-large-transformer-models-using-nvidia-fastertransformer-and-nvidia-triton-inference-server/

[10] Yi Tay et al. “Efficient Transformers: A survey”, ACM computing surveys 2022. https://arxiv.org/pdf/2009.06732.pdf

[11] Zhou, Haoyi, et al. “Informer: Beyond efficient transformer for long sequence time-series forecasting.” Proceedings of the AAAI conference on artificial intelligence. Vol. 35. No. 12. 2021. https://arxiv.org/abs/2012.07436

[12] Choromanski, Krzysztof, et al. “Rethinking attention with performers.” arXiv preprint arXiv:2009.14794 (2020). https://arxiv.org/abs/2009.14794

[13] Du, Nan, et al. “Glam: Efficient scaling of language models with mixture-of-experts.” International Conference on Machine Learning. PMLR, 2022. https://arxiv.org/abs/2112.06905

[14] Shen, Sheng, et al. “Q-bert: Hessian based ultra low precision quantization of bert.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 05. 2020. https://arxiv.org/abs/1909.05840

[15] Sheng, Ying, et al. “High-throughput generative inference of large language models with a single gpu.” arXiv preprint arXiv:2303.06865 (2023). https://arxiv.org/abs/2303.06865