← Back to stories Detailed image of a server rack with glowing lights in a modern data center.
Photo by panumas nikhomkhai on Pexels
钛媒体 2026-03-20

Breakthrough in Domestic Computing: Scale of GPU Clusters Competes with Global Giants

Domestic Milestone Achieved

On March 12, Sugon (中科曙光) officially unveiled its self-developed scaleFabric high-speed networking product in Zhengzhou. The company announced the successful deployment of a domestic computing cluster with a scale of 10,000 GPU cards at the National Supercomputing Internet Core Node. This milestone marks a significant step in overcoming critical challenges in China's computing infrastructure. For years, the industry has faced a bottleneck due to reliance on foreign technology, particularly the InfiniBand (IB) network technology dominated by NVIDIA.

However, while domestic players like Sugon are just beginning to break into the 10,000 GPU era, NVIDIA has already claimed to be constructing clusters peaking at 100,000 GPUs or more. The leap from thousands to tens of thousands of GPU cards is not merely numerical; it represents a complex challenge encompassing technology, ecosystem, and systems engineering.

Technical Hurdles Ahead

Sugon’s 10,000 GPU cluster, known as scaleX, stands as a unique case of complete domestic innovation among Chinese computing clusters. In contrast, NVIDIA has leveraged its CUDA ecosystem and IB networks to scale operations significantly. The IB network had long been monopolized by NVIDIA, which acquired Mellanox to cement its control over this critical technology, essentially hampering the development of large-scale domestic computing infrastructure.

Industry experts are vocal about the road ahead. According to Li Bin, Sugon’s Senior Vice President, the primary technical challenge in scaling to 100,000 GPU cards lies not in the computational nodes but in the interconnect systems. As cluster sizes grow exponentially, maintaining computational efficiency and ultra-high reliability becomes paramount. Experts at the China Academy of Information and Communications Technology emphasize that the race for ultra-large clusters is now a focal point in both domestic and international AI competition.

Reliability and Ecosystem Integration

Achieving a 100,000 GPU cluster requires overcoming three main challenges: large-scale reliability, deep collaboration with algorithms, and high levels of system tuning. Reliability is critical, as the stakes are high; a single computational failure can incur significant costs. The design of Sugon’s scaleFabric focuses on ensuring lossless transmission, while rapid fault recovery technologies aim to minimize downtime.

Moreover, efficient collaboration with algorithms is essential. The actual performance of these clusters hinges on how well the hardware interfaces with application algorithms and distributed training processes. As evidenced by optimization practices at Beijing University of Science and Technology, proper alignment can dramatically reduce communication overheads, showcasing the need for cross-disciplinary teams that bridge hardware and AI expertise.

A Fork in the Road: InfiniBand vs. Ethernet

As China advances toward larger computing scales, a fundamental choice looms over the industry: whether to adopt InfiniBand (IB) or Ethernet-based RoCE technology. The choice reflects not only technical differences but also the existing user base and ecosystem strategies. Sugon’s scaleFabric adheres to an IB-compatible path, emphasizing the need for lossless networks and performance, yet many existing data centers operate on Ethernet.

This divergence presents challenges for compatibility and user migration. While new deployments may seamlessly integrate, legacy systems may face communication barriers due to proprietary protocols. Ultimately, the success of domestic solutions hinges on mastering core technologies while remaining flexible enough to accommodate a dual-technology ecosystem.

In summary, as Chinese companies push for innovation and independence in the tech landscape, the next few years will determine whether domestic brands can compete effectively on a global scale, particularly against established players like NVIDIA. The path to a robust domestic computing infrastructure is fraught with challenges but holds promise for a future of high-performance, self-sustaining technology.

AI
View original source →