Overview of Datacenter Operations and Failures

undefined
 
CS 142 Lecture Notes: Datacenters
 
Slide 1
 
Google Datacenter
undefined
 
CS 142 Lecture Notes: Datacenters
 
Slide 2
 
Datacenter Organization
 
R
a
c
k
:
5
0
 
m
a
c
h
i
n
e
s
D
R
A
M
:
 
8
0
0
-
3
2
0
0
G
B
 
@
 
3
0
0
 
µ
s
D
i
s
k
:
 
1
0
0
T
B
 
@
 
1
0
m
s
 
S
i
n
g
l
e
 
s
e
r
v
e
r
:
8
-
2
4
 
c
o
r
e
s
D
R
A
M
:
 
1
6
-
6
4
G
B
 
@
 
1
0
0
n
s
D
i
s
k
:
 
2
 
T
B
 
@
1
0
m
s
 
R
o
w
/
c
l
u
s
t
e
r
:
3
0
+
 
r
a
c
k
s
D
R
A
M
:
 
2
4
-
9
6
T
B
 
@
 
5
0
0
 
µ
s
D
i
s
k
:
 
3
 
P
B
 
@
 
1
0
m
s
undefined
 
CS 142 Lecture Notes: Datacenters
 
Slide 3
 
Google Containers
undefined
 
CS 142 Lecture Notes: Datacenters
 
Slide 4
 
Microsoft Containers
undefined
 
CS 142 Lecture Notes: Datacenters
 
Slide 5
 
Microsoft Containers, cont'd
undefined
 
CS 142 Lecture Notes: Datacenters
 
Slide 6
 
Failures are Frequent
 
T
y
p
i
c
a
l
 
f
i
r
s
t
 
y
e
a
r
 
f
o
r
 
a
 
n
e
w
 
d
a
t
a
c
e
n
t
e
r
 
(
J
e
f
f
 
D
e
a
n
,
 
G
o
o
g
l
e
)
:
~0.5 
overheating
 (power down most machines in <5 mins, ~1-2 days to recover)
~1 
PDU failure
 (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 
rack-move
 (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 
network rewiring
 (rolling ~5% of machines down over 2-day span)
~20 
rack failures
 (40-80 machines instantly disappear, 1-6 hours to get back)
~5 
racks go wonky
 (40-80 machines see 50% packet loss)
~8 
network maintenances
 (4 might cause ~30-minute random connectivity losses)
~12 
router reloads
 (takes out DNS and external vips for a couple minutes)
~3 
router failures
 (have to immediately pull traffic for an hour)
~dozens of minor 
30-second blips
 for DNS
~1000 
individual machine failures
~thousands of 
hard drive failures
Slow disks
, 
bad memory
, 
misconfigured machines
, 
flaky machines
, etc.
 
Long distance links: 
wild dogs
, 
sharks
, 
dead horses
, 
drunken hunters
, etc.
undefined
 
How Many Datacenters?
 
1
-
1
0
 
d
a
t
a
c
e
n
t
e
r
 
s
e
r
v
e
r
s
/
h
u
m
a
n
?
1
0
0
,
0
0
0
 
s
e
r
v
e
r
s
/
d
a
t
a
c
e
n
t
e
r
 
 
 
 
8
0
-
9
0
%
 
o
f
 
g
e
n
e
r
a
l
-
p
u
r
p
o
s
e
 
c
o
m
p
u
t
i
n
g
 
w
i
l
l
 
s
o
o
n
 
b
e
i
n
 
d
a
t
a
c
e
n
t
e
r
s
?
 
August 25, 2010
 
RAMCloud
 
Slide 7
undefined
 
CS 142 Lecture Notes: Security Attacks: Phishing
 
Slide 8
undefined
 
CS 142 Lecture Notes: Datacenters
 
Slide 9
 
Sun Containers
undefined
 
CS 142 Lecture Notes: Datacenters
 
Slide 10
 
Sun Containers, cont'd
Slide Note
Embed
Share

The content discusses datacenter organization, frequent failures, and the prevalence of datacenters in modern computing. It details the typical first-year failures in a new datacenter and highlights the number of servers per datacenter and the shift towards datacenter-centric computing.

  • Datacenter operations
  • Failures
  • Server capacity
  • Datacenter prevalence
  • Modern computing

Uploaded on Sep 29, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Google Datacenter CS 142 Lecture Notes: Datacenters Slide 1

  2. Datacenter Organization Single server: 8-24 cores DRAM: 16-64GB @ 100ns Disk: 2 TB @10ms Rack: 50 machines DRAM: 800-3200GB @ 300 s Disk: 100TB @ 10ms Row/cluster: 30+ racks DRAM: 24-96TB @ 500 s Disk: 3 PB @ 10ms CS 142 Lecture Notes: Datacenters Slide 2

  3. Google Containers CS 142 Lecture Notes: Datacenters Slide 3

  4. Microsoft Containers CS 142 Lecture Notes: Datacenters Slide 4

  5. Microsoft Containers, cont'd CS 142 Lecture Notes: Datacenters Slide 5

  6. Failures are Frequent Typical first year for a new datacenter (Jeff Dean, Google): ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packet loss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for DNS ~1000 individual machine failures ~thousands of hard drive failures Slow disks, bad memory, misconfigured machines, flaky machines, etc. Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc. CS 142 Lecture Notes: Datacenters Slide 6

  7. How Many Datacenters? 1-10 datacenter servers/human? 100,000 servers/datacenter U.S. World Servers 0.3-3B 7-70B Datacenters 3000-30,000 70,000-700,000 80-90% of general-purpose computing will soon be in datacenters? August 25, 2010 RAMCloud Slide 7

  8. CS 142 Lecture Notes: Security Attacks: Phishing Slide 8

  9. Sun Containers CS 142 Lecture Notes: Datacenters Slide 9

  10. Sun Containers, cont'd CS 142 Lecture Notes: Datacenters Slide 10

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#