
报 告 人:Zhiyu Cheng
NVIDIA
主 持 人:李萌 助理教授
必赢71886网址登录必赢626net入口
时 间:2025年3月27日 11:00-12:00
腾讯会议:463-154-887
报告题目:
FP4 quantization and its real-world applications on LLMs and diffusion models
报告摘要:
As large language models (LLMs) and diffusion models grow in complexity, efficient inference has become a pressing concern. In this talk, we introduce FP4 quantization—an emerging technique that substantially reduces memory usage and computational costs with minimal accuracy trade-offs. We begin by discussing the FP4 numerical format. Next, we delve into the quantization workflow, highlighting both post-training quantization (PTQ) and quantization-aware training (QAT) algorithms, along with practical recipes and best practices for successful implementation on LLMs and diffusion models. We then present quantitative and qualitative results to illustrate FP4 quantization’s impact on real-world applications, such as text, image and video generation. Finally, we introduce the NVIDIA TensorRT Model Optimizer, detailing its capabilities for FP4 quantization and streamlined deployment through TensorRT-LLM.
报告人简介:
Zhiyu Cheng is a manager at NVIDIA, where he focuses on driving algorithms and software development to optimize large-scale inference for generative AI workloads, including large language models (LLMs), vision language models (VLMs), and diffusion models on Nvidia’s latest platforms. He has over 10 years of industry experience in efficient deep learning across his career from NXP, Xilinx, Baidu and OmniML (acquired by Nvidia). Zhiyu has a record of over 30 published papers and patents. He holds a Ph.D. degree in electrical and computer engineering from the University of Illinois with a thesis in the field of information theory.