THE ARENA Both IB and Ultra Ethernet aim to overcome two major hurdles in high performance computing networks: performance and congestion control. Performance is the goal, and congestion is the problem. Performance is generally measured using two metrics: latency and throughput. In essence, latency is typically measured in microseconds and is critical for tasks where computers must synchronize constantly, such as in an AI training environment. Throughput describes how much data can be moved per second and is often used interchangeably with bandwidth. Congestion is a problem that these two contenders try to overcome using different strategies. When too many computers send an immense amount of data through the network at the same time, the network equipment can get overwhelmed and begin to drop data packets, forcing the network environment to resend and retry transmissions. Those who remember the use of hubs in the network will recall how packet switching was a technological wonder when it first came to market. The “secret sauce” to it all revolves around RDMA. Both UET and IB use it and have developed their own individual ways of accessing it. Essentially, it is a shortcut—like taking the data from computer A’s application memory and directly transporting it to computer B’s application memory without the CPU having to process it. IB utilizes native RDMA along with equal-cost multi-pathing (ECMP) as its primary network routing strategy, which has both positive and negative implications. UET uses an enhanced version of RDMA, known as ‘RDMA over converged Ethernet version 2’ (RoCEv2). UET achieves its RDMA functionality by layering this protocol over the existing Ethernet infrastructure. To improve the performance, it adds features such as link layer retry and packet spraying. Packet spraying splits up single data streams and sends individual packets along multiple, different paths at the same time. 1 THE CHAMPION: INFINIBAND InfiniBand was built from the ground up as a high-performance interconnect fabric, entirely
separate from the conventional Ethernet stack. Its focus is on providing a deterministic, lossless transport layer optimized for tightly coupled compute clusters. Its main advantage resides in its natively constructed RDMA solution. IB’s version is native, meaning the RDMA protocol is built directly into the network fabric’s core design. 2 To accomplish this, IB employs a “lossless” credit-based flow control. 3 Think of it like an aristocratic delivery system where the sender never sends a package (or data in this instance) without the receiver explicitly saying that it “has this exact amount of room for packages.” In other words, the receiving computer’s network card sends “credits” to the sender. If the sender’s network card runs out of “credits,” then transmission stops until more credits are available. This ensures a zero-packet loss environment with deterministic latency. The benefit is two-fold. Bypassing the OS and CPU processing overhead creates efficiency, particularly in transport times. This is beneficial because it frees up resources to better focus on computation, such as training AI models, instead of managing network traffic. Think of it like building a foot bridge across a busy highway which allows people to quickly and safely cross over the highway without stopping traffic. The problem is not so easily apparent. IB uses flow-based equal cost multi-pathing (ECMP). Think of it like this: one is driving from point A to point B, and your GPS system provides three routes that take the same amount of time. Without ECMP, all the cars using GPS pick the same route. If an unexpected traffic jam occurs, all the traffic is stuck until the traffic clears. With ECMP, some traffic goes down route 1, some goes down route 2, and the rest goes down route 3. To visualize this, imagine if Netflix traffic goes down Route 1, all Microsoft 365 traffic goes down Route 2, and all Hulu traffic goes down Route 3. The goal is to segment traffic effectively to prevent congestion and latency. THE CHALLENGER: ULTRA ETHERNET Ultra Ethernet Transport, currently at version 1.0, is a novel transport protocol defined by the Ultra Ethernet Consortium (UEC) and is designed to utilize
The Great Interconnect Showdown – InfiniBand versus Ultra Ethernet
By Justin W. Hobbs
In the modern cloud-based world, the terms AI, hyperscale, high performance computing (HPC), latency, and even remote direct memory access (RDMA) are continually rising buzzwords within the data center space. Why? These words represent a much larger discussion on how data centers are built, how they operate, and how performance is both scrutinized and constantly improved upon. Not to mention that time equals money, and when seconds count, microseconds are akin to gold. In the world of hyperscale data centers and large AI models, high-octane compute performance demands an equally high-octane networking technology ecosystem where every millisecond and every frequency of bandwidth counts. It is important to remember that networks rely on physical infrastructure. To understand why this is important, one could ask, “Do most people use cloud-based applications in their everyday lives?” Of course we do. Simplistically stated, cloud-based applications are just software that people pay a license fee to use while someone else’s computer hosts it. Microsoft 365,
Netflix, Zelle ® , and DoorDash are all examples of everyday applications that utilize cloud-based systems. Multiply that by billions of people and thousands of organizations worldwide, and the need for high-performance, low-latency computing space becomes quite clear. The rise of massive-scale distributed AI models and HPC has created an increasing demand for interconnect solutions that can handle massive throughput with minimal latency. For more than two decades, InfiniBand (IB) has been lauded as the gold standard in this arena. However, an emerging standard, Ultra Ethernet Transport (UET), aims to challenge that dominance by integrating IB’s performance capabilities into the vast, open ecosystem of Ethernet. This is similar to decades ago, when proprietary systems once reigned supreme. This article explores the high-level differences between these two systems and evaluates where each technology is best suited to drive the next generation of data-intensive workloads. In the words of famous ring announcer, Michael Buffer, “Let’s get ready to rumble!”
I
I
40
ICT TODAY
January/February/March 2026
41
Made with FlippingBook - Online catalogs