Overview
Increasingly computation and storage are migrating to dense data centers spread across the planet. Common web requests to services such as Facebook and Amazon run on hundreds or even thousands of machines randomly distributed across data centers that consist of tens or even hundreds of thousands of machines. With the advent of open source data-parallel toolkits such as Hadoop, organizations are able to process petabytes of data spread across thousands of machines.
A common performance bottleneck in all of these scenarios is often the underlying network interconnecting the machines in a data center. Modern parallel and distributed applications demonstrate little data locality and so applications may often be blocked waiting to send data to or receive data from other nodes.
The goal of our work is to address some of the fundamental challenges in building ultra large-scale network fabrics:
- Raw performance: At the highest end, data centers will require 1 Petabit/sec of bisection bandwidth. Imagine a 100,000 node data center with 10 Gb/s NICs available at each end host.
- Unified layer 2 fabric: A layer 2 fabric such as Ethernet holds the promise for plug-and-play deployment with minimal human management. Unfortunately, the associated routing, forwarding, and configuration protocols present fundamental scaling and performance limitations.
- Single management domain: Today, large-scale networks must still be managed at the level of individual switches, potentially numbering in the thousands. We are designing software and hardware to make the data center network modular and incrementally scalable.