Data Centers and Cloud Computing Technologies

 
 
Lecture 8, Computer Networks (198:552)
Fall 2019
 
Data Centers
Part II
 
What are data centers?
 
Large facilities with 10s of thousands of networked servers
Compute, storage, and networking working in concert
“Warehouse-Scale Computers”
 
Cloud Computing: The View from Apps
 
On-demand
Use resources when you need it; pay-as-you-go
Elastic
Scale up & down based on demand
Multi-tenancy
Multiple independent users share infrastructure
Security and resource isolation
SLAs on performance & reliability
Dynamic Management
Resiliency: isolate failure of servers and storage
Workload movement: move work to other locations
What’s different about DCNs?
 
Single administrative domain
Change all endpoints and switches if you want
No need to be compatible with outside world
Unique network properties
Tiny round trip times (microseconds)
Massive multipath topologies
Shallow-buffered switches
Latency and tail-latency critical
Network is a backplane for large-scale parallel computation
Together, serious implications for the transport, network, link layer
designs you can (and should) use
 
Challenges in DCNs
Data center costs
*3 yr amortization for servers, 15 yr for infrastructure
; 
5% cost of money
T
h
e
 
C
o
s
t
 
o
f
 
a
 
C
l
o
u
d
:
 
R
e
s
e
a
r
c
h
 
P
r
o
b
l
e
m
s
 
i
n
 
D
a
t
a
 
C
e
n
t
e
r
 
N
e
t
w
o
r
k
s
.
S
i
g
c
o
m
m
 
 
C
C
R
 
2
0
0
9
.
 
 
G
r
e
e
n
b
e
r
g
,
 
H
a
m
i
l
t
o
n
,
 
M
a
l
t
z
,
 
P
a
t
e
l
.
 
Server costs
 
30% server utilization considered “good” in data centers
Application demands uneven across the resources
Each server has CPU, memory, disk: most applications exhaust one
resource, stranding the others
Long provisioning timescales
New servers purchased quarterly at best
Uncertainty in demand
Demand for a new service can spike quickly
Risk management
Not having spare servers to meet demand brings failure just when
success is at hand
Session state and storage constraints
If the world were stateless servers, life would be good
 
Goal: 
Agility
 -- any service, any server
 
Turn the servers into a 
single large fungible pool
Dynamically expand and contract service footprint as needed
Place workloads where server resources are available
Easier to maintain 
availability
If one rack goes down, machines from another still available
 
Want to view DCN as a pool of compute connected by one big
high-speed fabric
Steps to achieving Agility
 
Workload (compute) management
Means for rapidly installing a service’s code on a server
Virtual machines, disk images, containers
 
Storage management
Means for a server to access persistent data
Distributed filesystems  (e.g., HDFS, blob stores)
 
Network and Routing management
Means for communicating with other servers, regardless of where
they are in the data center
?
?
 
Agility means that the DCN needs…
 
Massive 
bisection bandwidth
Topologies
Routing (Multiple paths 
 
oad balancing)
Ultra-Low latency (<10 microseconds)
The right transport? Switch scheduling/buffer management?
Schedule packets or control transmission rates?
Centralized or distributed control?
Effective Resource Management (across servers & switches)
Multi-tenant 
performance isolation
App-aware network scheduling (e.g., for big data)
Support for next-generation hardware & apps
ML, RDMA, rack-scale computing, memory disaggregation
Conventional DC network
Source: “Data Center: Load balancing Data Center Services”, Cisco 2004
CR
CR
AR
AR
AR
AR
. . .
. . .
K
e
y
C
R
 
=
 
C
o
r
e
 
R
o
u
t
e
r
 
(
L
3
)
A
R
 
=
 
A
c
c
e
s
s
 
R
o
u
t
e
r
 
(
L
3
)
S
 
=
 
E
t
h
e
r
n
e
t
 
S
w
i
t
c
h
 
(
L
2
)
A
 
=
 
R
a
c
k
 
o
f
 
a
p
p
.
 
s
e
r
v
e
r
s
 
~ 1,000 servers/pod == IP subnet
 
Layer 2 vs. Layer 3
 
Ethernet switching (layer 2)
Fixed IP addresses and auto-configuration (plug & play)
Seamless mobility, migration, and failover
x
Broadcast limits scale (ARP)
x
Spanning Tree Protocol: no multipath routing
 
IP routing (layer 3)
Scalability through hierarchical addressing
Multipath routing through equal-cost multipath
x
More complex configuration
x
Can’t migrate w/o changing IP address
Conventional DC Network Problems
CR
CR
AR
AR
AR
AR
. . .
~ 5:1
~ 40:1
~ 200:1
 
Dependence on high-cost proprietary routers
Extremely limited server-to-server capacity
CR
CR
AR
AR
AR
AR
 
I
P
 
s
u
b
n
e
t
 
(
V
L
A
N
)
 
#
1
~ 200:1
 
Resource fragmentation, significantly lowering cloud
utilization (and cost-efficiency)
 
I
P
 
s
u
b
n
e
t
 
(
V
L
A
N
)
 
#
2
A
A
A
A
A
A
A
A
A
Conventional DC Network Problems
CR
CR
AR
AR
AR
AR
 
I
P
 
s
u
b
n
e
t
 
(
V
L
A
N
)
 
#
1
~ 200:1
 
C
o
m
p
l
i
c
a
t
e
d
 
m
a
n
u
a
l
L
2
/
L
3
 
r
e
-
c
o
n
f
i
g
u
r
a
t
i
o
n
 
I
P
 
s
u
b
n
e
t
 
(
V
L
A
N
)
 
#
2
A
A
A
A
A
A
A
A
A
15
Conventional DC Network Problems
Resource fragmentation, significantly lowering cloud
utilization (and cost-efficiency)
 
Google’s DCN Challenges & Approaches
 
Source: Jupiter rising: a decade of Clos topologies and centralized control
in Google’s data center network. SIGCOMM’15
 
Building a high-speed
switching fabric
 
A single (n X m)-port switching fabric
 
Different designs of switching fabric possible
Assume n ingress ports and m egress ports, 
half duplex 
links
A single (n X m)-port switching fabric
 
We are OK with any design such that:
 
Any port can connect to any other
directly if all other ports free
 
Nonblocking:
 if input port x and output
port y are both free, they should be
able to connect
Regardless of other ports being
connected.
If not satisfied, switch is 
blocking.
 
Electrical/mechanical/
electronic 
crossover
 
High port density + nonblocking == hard!
 
Low-cost nonblocking crossbars are feasible for small # ports
 
However, it is 
costly
 to be nonblocking with a large number of ports
 
If each crossover is as fast as each input port,
Number of crossover points == n * m
Cost grows quadratically on the number of input ports
 
Else, crossover must transition 
faster 
than the port
… so that you can keep the number of crossovers small
 
Q: How is this relevant to the data center network fabric?
Slide Note

Slide material heavily adapted courtesy of Albert Greenberg, Changhoon Kim, Mohammad Alizadeh.

Embed
Share

Explore the world of data centers and cloud computing through topics such as data center functionalities, cloud computing advantages, challenges in Data-Center Networks (DCNs), cost breakdowns, server utilization considerations, and more. Delve into the intricacies of managing large-scale computing infrastructures efficiently and effectively.

  • Data Centers
  • Cloud Computing
  • Networking
  • DCNs
  • Server Costs

Uploaded on Jul 10, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Data Centers Part II Lecture 8, Computer Networks (198:552) Fall 2019

  2. What are data centers? Large facilities with 10s of thousands of networked servers Compute, storage, and networking working in concert Warehouse-Scale Computers

  3. Cloud Computing: The View from Apps On-demand Use resources when you need it; pay-as-you-go Elastic Scale up & down based on demand Multi-tenancy Multiple independent users share infrastructure Security and resource isolation SLAs on performance & reliability Dynamic Management Resiliency: isolate failure of servers and storage Workload movement: move work to other locations

  4. Whats different about DCNs? Single administrative domain Change all endpoints and switches if you want No need to be compatible with outside world Unique network properties Tiny round trip times (microseconds) Massive multipath topologies Shallow-buffered switches Latency and tail-latency critical Network is a backplane for large-scale parallel computation Together, serious implications for the transport, network, link layer designs you can (and should) use

  5. Challenges in DCNs

  6. Data center costs Amortized Cost* ~45% Component Sub-Components Servers CPU, memory, disk ~25% Power infrastructure UPS, cooling, power distribution ~15% Power draw Electrical utility costs ~15% Network Switches, links, transit The Cost of a Cloud: Research Problems in Data Center Networks. SigcommCCR 2009. Greenberg, Hamilton, Maltz, Patel. *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money

  7. Server costs 30% server utilization considered good in data centers Application demands uneven across the resources Each server has CPU, memory, disk: most applications exhaust one resource, stranding the others Long provisioning timescales New servers purchased quarterly at best Uncertainty in demand Demand for a new service can spike quickly Risk management Not having spare servers to meet demand brings failure just when success is at hand Session state and storage constraints If the world were stateless servers, life would be good

  8. Goal: Agility -- any service, any server Turn the servers into a single large fungible pool Dynamically expand and contract service footprint as needed Place workloads where server resources are available Easier to maintain availability If one rack goes down, machines from another still available Want to view DCN as a pool of compute connected by one big high-speed fabric

  9. Steps to achieving Agility Workload (compute) management Means for rapidly installing a service s code on a server Virtual machines, disk images, containers Storage management Means for a server to access persistent data Distributed filesystems (e.g., HDFS, blob stores) Network and Routing management Means for communicating with other servers, regardless of where they are in the data center??

  10. Agility means that the DCN needs Massive bisection bandwidth Topologies Routing (Multiple paths Load balancing) Ultra-Low latency (<10 microseconds) The right transport? Switch scheduling/buffer management? Schedule packets or control transmission rates? Centralized or distributed control? Effective Resource Management (across servers & switches) Multi-tenant performance isolation App-aware network scheduling (e.g., for big data) Support for next-generation hardware & apps ML, RDMA, rack-scale computing, memory disaggregation

  11. Conventional DC network Internet CR CR DC-Layer 3 . . . AR AR AR AR DC-Layer 2 Key S S CR = Core Router (L3) AR = Access Router (L3) S = Ethernet Switch (L2) A = Rack of app. servers . . . S S S S ~ 1,000 servers/pod == IP subnet Source: Data Center: Load balancing Data Center Services , Cisco 2004

  12. Layer 2 vs. Layer 3 Ethernet switching (layer 2) Fixed IP addresses and auto-configuration (plug & play) Seamless mobility, migration, and failover xBroadcast limits scale (ARP) xSpanning Tree Protocol: no multipath routing IP routing (layer 3) Scalability through hierarchical addressing Multipath routing through equal-cost multipath xMore complex configuration xCan t migrate w/o changing IP address

  13. Conventional DC Network Problems CR CR ~ 200:1 AR AR AR AR S S S S ~ 40:1 S S S S . . . S S S S ~ 5:1 Dependence on high-cost proprietary routers Extremely limited server-to-server capacity

  14. Conventional DC Network Problems CR CR ~ 200:1 AR AR AR AR S S S S S S S S S S S S IP subnet (VLAN) #2 IP subnet (VLAN) #1 Resource fragmentation, significantly lowering cloud utilization (and cost-efficiency)

  15. Conventional DC Network Problems CR CR ~ 200:1 AR AR AR AR Complicated manual L2/L3 re-configuration S S S S S S S S S S S S IP subnet (VLAN) #2 IP subnet (VLAN) #1 Resource fragmentation, significantly lowering cloud utilization (and cost-efficiency) 15

  16. Googles DCN Challenges & Approaches Source: Jupiter rising: a decade of Clos topologies and centralized control in Google s data center network. SIGCOMM 15

  17. Building a high-speed switching fabric

  18. A single (n X m)-port switching fabric Different designs of switching fabric possible Assume n ingress ports and m egress ports, half duplex links

  19. A single (n X m)-port switching fabric Electrical/mechanical/ electronic crossover We are OK with any design such that: Any port can connect to any other directly if all other ports free Nonblocking: if input port x and output port y are both free, they should be able to connect Regardless of other ports being connected. If not satisfied, switch is blocking.

  20. High port density + nonblocking == hard! Low-cost nonblocking crossbars are feasible for small # ports However, it is costly to be nonblocking with a large number of ports If each crossover is as fast as each input port, Number of crossover points == n * m Cost grows quadratically on the number of input ports Else, crossover must transition faster than the port so that you can keep the number of crossovers small Q: How is this relevant to the data center network fabric?

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#