With continuous evolution of AI models and soaring model parameters, improving the capacity of smart computing centers is urgently needed. Limited by the network communication performance, the computing efficiency of large-scale distributed GPU clusters still cannot achieve linear increase. Development of smart computing centers is faced with many challenges.
AI training requires large-scale GPU cluster networking and distributed parallel computing. This addresses the balance between the cluster scale and GPU efficiency. The network needs to support the construction of clusters with thousands or even tens of thousands of GPUs.
The proportion of inter-machine communication for large models increases, and bandwidth access and usage become the key network indicators affecting training efficiency.
The project construction timeline is tight, requiring rapid service deployment. This puts higher requirements on the network deployment timeliness.
If network instability occurs during training, the progress of the entire training task will be affected.
Automatic traffic orchestration and per-flow load balancing without the O&M platform
Based on the typical transmission traffic patterns between GPUs and the 1:1 oversubscription ratio between the leaf and spine nodes, the switches are organized into leaf groups, which automatically generates a globally balanced path for all network cards connected to the leaf.
16-node P2P test: GDLB increases the bandwidth usage by 6%–25% compared with ECMP, slightly higher than IB.
Allreduce test: GDLB increases the bandwidth usage by 14%–30% compared with ECMP.
All-to-all test: GDLB increases the bandwidth usage by 5%–14% compared with ECMP.
One-click deployment for quick delivery, reducing the deployment cycle. A 1000-GPU cluster can be delivered in a week.
Expert experience-based automated and adaptive optimization simplifies RoCE optimization.
Standard northbound interfaces are provided to be compatible with mainstream third-party cloud-based O&M platforms.
The multi-rail networking architecture is adopted to support on-demand flexible deployment. The three-layer networking supports clusters of up to 100,000 GPUs.
The 400GbE RoCE lossless network is designed to achieve network communication with high bandwidth and low latency, meeting the GenAI network requirements.
The fixed Clos architecture has a small fault radius. Device-network redundancy design ensures uninterrupted training.
One-click deployment shortens the deployment cycle. Full-service delivery provides worry-free experience for customers. Terminal-network monitoring of RoCE indicators facilitates fault demarcation.
Artificial intelligence and machine learning (AI/ML) workloads are fundamentally reshaping the modern data center. Large-scale models, including generative AI and advanced recommendation systems, require thousands to tens of thousands of GPU accelerators operating as a single distributed system. At this scale, the network fabric becomes just as critical as compute, demanding predictable low latency, extreme bandwidth, and high operational efficiency.
One of the primary causes of low GPU utilization in large-scale AI training environments is network instability or link flapping, which can interrupt distributed communication and force portions of the training process to be recalculated. These interruptions reduce cluster efficiency and prolong overall training time.
To address this challenge, Supermicro, Broadcom, AMD, and Micas jointly deliver an open, standards-based AI infrastructure designed for the most demanding AI/ML workloads. This Co-Packaged Optics (CPO) architecture is particularly valuable for organizations seeking improved data center sustainability and extended connectivity. By integrating optical engines directly with the switch ASIC, the CPO platform significantly reduces power consumption and thermal load while enabling high-bandwidth links reaching up to 2 kilometers that are essential to build a super spine architecture.
Compared with traditional pluggable optics architectures, this solution – built with Supermicro with AMD Instinct™ MI355X (Refer details to the following link: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/productbriefs/amd-instinct-mi355x-platform-brochure.pdf ), the Micas Tomahawk 5 CPO platform and AMD Pensando™ Polara 400 AI NICs– delivers lower latency, improved link stability, reduced power consumption, and higher GPU utilization. These advantages translate into measurable improvements in AI training efficiency.
Shortens end-to-end training cycles, enabling faster time-to-model and higher infrastructure productivity
Eliminates re-compute overhead, maximizes GPU utilization, and ensures predictable large-scale training outcomes
Drives sustainable operations and significantly lowers total cost of ownership (TCO)
Enables efficient expansion to large clusters without compromising performance or cost efficiency
Expands ecosystem flexibility, accelerates innovation, and reduces vendor lock-in