With continuous evolution of AI models and soaring model parameters, improving the capacity of smart computing centers is urgently needed. Limited by the network communication performance, the computing efficiency of large-scale distributed GPU clusters still cannot achieve linear increase. Development of smart computing centers is faced with many challenges.
AI training requires large-scale GPU cluster networking and distributed parallel computing. This addresses the balance between the cluster scale and GPU efficiency. The network needs to support the construction of clusters with thousands or even tens of thousands of GPUs.
The proportion of inter-machine communication for large models increases, and bandwidth access and usage become the key network indicators affecting training efficiency.
The project construction timeline is tight, requiring rapid service deployment. This puts higher requirements on the network deployment timeliness.
If network instability occurs during training, the progress of the entire training task will be affected.
Automatic traffic orchestration and per-flow load balancing without the O&M platform
Based on the typical transmission traffic patterns between GPUs and the 1:1 oversubscription ratio between the leaf and spine nodes, the switches are organized into leaf groups, which automatically generates a globally balanced path for all network cards connected to the leaf.
16-node P2P test: GDLB increases the bandwidth usage by 6%–25% compared with ECMP, slightly higher than IB.
Allreduce test: GDLB increases the bandwidth usage by 14%–30% compared with ECMP.
All-to-all test: GDLB increases the bandwidth usage by 5%–14% compared with ECMP.
One-click deployment for quick delivery, reducing the deployment cycle. A 1000-GPU cluster can be delivered in a week.
Expert experience-based automated and adaptive optimization simplifies RoCE optimization.
Standard northbound interfaces are provided to be compatible with mainstream third-party cloud-based O&M platforms.
The multi-rail networking architecture is adopted to support on-demand flexible deployment. The three-layer networking supports clusters of up to 100,000 GPUs.
The 400GbE RoCE lossless network is designed to achieve network communication with high bandwidth and low latency, meeting the GenAI network requirements.
The fixed Clos architecture has a small fault radius. Device-network redundancy design ensures uninterrupted training.
One-click deployment shortens the deployment cycle. Full-service delivery provides worry-free experience for customers. Terminal-network monitoring of RoCE indicators facilitates fault demarcation.