GenAl Smart Computing Center Network Solution

Empowered Ethernet, Intelligent Future

Trials and demos

Challenge

With continuous evolution of AI models and soaring model parameters, improving the capacity of smart computing centers is urgently needed. Limited by the network communication performance, the computing efficiency of large-scale distributed GPU clusters still cannot achieve linear increase. Development of smart computing centers is faced with many challenges.

Large network scale

AI training requires large-scale GPU cluster networking and distributed parallel computing. This addresses the balance between the cluster scale and GPU efficiency. The network needs to support the construction of clusters with thousands or even tens of thousands of GPUs.

High performance requirements

The proportion of inter-machine communication for large models increases, and bandwidth access and usage become the key network indicators affecting training efficiency.

Tight timeline

The project construction timeline is tight, requiring rapid service deployment. This puts higher requirements on the network deployment timeliness.

Difficult O&M

If network instability occurs during training, the progress of the entire training task will be affected.

Overview

Ultra-Large-Scale Networking
Network with Extremely High Throughput
Simplified Deployment, Intelligent O&M
  • The multi-rail networking scheme is adopted, where the eight NICs, each corresponding to one of the eight GPU cards in the GPU server, are connected to the top-of-rack (ToR) devices in eight server PODs. As a result, NICs with the same ID can communicate with the same ToR.
  • Each layer is designed with 1:1 oversubscription to ensure high-speed forwarding on the network. The three-layer networking supports tens of thousands of GPUs.
Spine plane Cross-rail communication
Spine plane Cross-rail communication
Spine plane Cross-rail communication

GenAI Architecture

Single-chip 25.6 Tbps, box-to-box architecture

  • SW ports: 64 x 400GbE
  • SW SerDes: 56 Gbps
  • Two-layer networking: Max 2K GPUs
  • Three-layer networking: Max 8K GPUs

Single-chip 51.2 Tbps, box-to-box architecture

  • SW ports: 64 x 800GbE
  • SW SerDes: 112 Gbps
  • Two-layer networking: Max 4K GPUs
  • Three-layer networking: Max 16K GPUs

Single-chip 51.2 Tbps, box-to-box architecture

  • SW ports: 128 x 400GbE
  • SW SerDes: 112 Gbps
  • Two-layer networking: Max 8K GPUs
  • Three-layer networking: Max 32K GPUs
  • The 400GbE/800GbE RDMA over Converged Ethernet (RoCE) solution is leveraged to achieve low-latency and lossless network communication.
  • Micas GDLB is designed to meet the high-bandwidth network performance requirements of GenAI networks.

Automatic traffic orchestration and per-flow load balancing without the O&M platform

GLDB

Global Dynamic Load Balancing (GDLB)

Based on the typical transmission traffic patterns between GPUs and the 1:1 oversubscription ratio between the leaf and spine nodes, the switches are organized into leaf groups, which automatically generates a globally balanced path for all network cards connected to the leaf.

P2P test

16-node P2P test: GDLB increases the bandwidth usage by 6%–25% compared with ECMP, slightly higher than IB.

Allreduce test

Allreduce test: GDLB increases the bandwidth usage by 14%–30% compared with ECMP.

All-to-all test

All-to-all test: GDLB increases the bandwidth usage by 5%–14% compared with ECMP.

Rapid Deployment

One-click deployment for quick delivery, reducing the deployment cycle. A 1000-GPU cluster can be delivered in a week.

Expert experience-based automated and adaptive optimization simplifies RoCE optimization.

Standard northbound interfaces are provided to be compatible with mainstream third-party cloud-based O&M platforms.

Feature Highlight

Ultra-Large-Scale Networking

The multi-rail networking architecture is adopted to support on-demand flexible deployment. The three-layer networking supports clusters of up to 100,000 GPUs.

Network with Extreme High Throughput

The 400GbE RoCE lossless network is designed to achieve network communication with high bandwidth and low latency, meeting the GenAI network requirements.

High Availability

The fixed Clos architecture has a small fault radius. Device-network redundancy design ensures uninterrupted training.

Simple Deployment and Easy O&M

One-click deployment shortens the deployment cycle. Full-service delivery provides worry-free experience for customers. Terminal-network monitoring of RoCE indicators facilitates fault demarcation.

Request More Information

Preferences

We use different types of cookies to optimize your experience on our website. Click on the categories below to learn more about their purposes. You may choose which types of cookies to allow and can change your preferences at any time. Remember that disabling cookies may affect your experience on the website. You can learn more about how we use cookies by visiting our Cookie Policy and Privacy Policy.

  • These cookies are necessary to the core functionality of our website and some of its features, such as access to secure areas.

  • These cookies are used to enhance the performance and functionality of our websites but are nonessential to their use. However, without these cookies, certain functionality (like videos) may become unavailable.

  • These cookies collect information that can help us understand how our websites are being used. This information can also be used to measure effectiveness in our marketing campaigns or to curate a personalized site experience for you.

  • These cookies are used to make advertising messages more relevant to you. They prevent the same ad from continuously reappearing, ensure that ads are properly displayed for advertisers, and in some cases select advertisements that are based on your interests.

  • These cookies enable you to share our website's content through third-party social networks and other websites. These cookies may also be used for advertising purposes.

  • These are cookies that have not yet been categorized. We are in the process of classifying these cookies with the help of their providers.