One More Time with Feeling - the Ultra Ethernet OCP Collaboration

This summer, the Ultra Ethernet Consortium was formed by founding members AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta and Microsoft. The charter of this organization is to address gaps in Ethernet primarily for AI cluster requirements. You may wonder why a technology founded decades ago requires this work, but if you look back in the history of high speed fabrics you’ll remember we’ve been here before. InfiniBand entered the scene in the early 2000’s providing mainframe inspired RDMA at high speeds. Soon, the Top 500 supercomputing rankings started featuring the interconnect. Fast forward to the AI era, the convergence of HPC architecture with AI clusters, and NVIDIA’s acquisition of InfiniBand provider Mellanox, and you can see that AI cluster operators today are limited in scale of cluster deployment without the pricey and single vendor solution. Cloud providers want to scale more efficiently, and UEC is looking to solve the final limitations that have not been addressed with previous efforts.

As someone who worked on InfiniBand during the foundation of the specifications, I’ve followed this dramatic narrative more than the average data center geek. While the companies gathered to work on UEC specifications provide great leadership across the value chain, and others from the Ethernet industry have also signaled their support for this initiative, there has been some concern that the religion that comes up about communications protocols could once again limit progress.

This week, however, a further advancement to UEC was reached with a strategic collaboration with the Open Compute Project and the collective data center operators that roam OCP working groups and deploy open OCP spec’d hardware. The two organizations announced a sweeping collaboration across the OCP Switch Abstraction Interface (SAI), OCP Caliptra Workstream, OCP Networking Project, OCP NIC Workstream, OCP Time Appliance Project, and OCP Future Technologies Initiative. I’d also assume that OCP’s sustainability initiative will get involved to ensure the next generation industry standard fabrics that stem from UEC will offer sustainability as well as performance.

When I caught up with Cloudflare’s Rebecca Weekly earlier this week she put this announcement in context. “We need massive clusters for AI models. I think one of the big themes here is around ultra ethernet, the consortium that has just been established in the last few months to really start to drive ethernet as primary. That is a huge step forward. I would argue for the ecosystem to be able to drive interconnected systems to get us from a couple hundred accelerated nodes operating together to actually millions of accelerated nodes operating together. That is a connectivity first problem.”

The TechArena take: we’re excited to see this collaboration as it will accelerate time from specification to solutions and hopefully prevent forking of vendor solutions from standards. We also see this as a widening of cloud operator blessing on UEC signaling interest in using Ethernet for massive scale fabrics. This puts UEC on technologies to watch heading into 2024.

Previous
Previous

The TechArena Compute Sustainability 2023 Report

Next
Next

The Quest for Broad Data Center Advancement with Arm’s Eddie Ramirez