GLM-5.1 High-Speed API has been launched for select enterprise clients, achieving a model output speed of 400 tokens per second, setting a new global record for large model official interface end-to-end speed. According to Odaily, this high-speed version retains the capabilities of the original flagship model and is powered by a high-performance inference engine jointly developed by Zhipu and the TileRT team. The engine optimizes GPU operation scheduling by restructuring it into a persistent Engine Kernel that resides on the GPU, reducing kernel startup and memory read/write delays in traditional inference.
In multi-card scenarios, TileRT further specializes GPU nodes in the 8-card NVL topology into different functional Workers to enhance attention layer computation and cross-card communication efficiency.
Currently, this high-speed version is available to select enterprise clients on the Zhipu MaaS platform. Future plans include optimizing FP8 inference and extended context capabilities to support low-latency scenarios such as AI programming, real-time interaction, and real-time voice applications.