Efficient Gossip and Bimodal Multicast Protocols in Distributed Systems

undefined
 
CS5412:
BIMODAL MULTICAST
ASTROLABE
 
Ken Birman
 
CS5412 Spring 2016
 
1
 
Lecture XIX
 
Gossip 201
 
Recall from early in the semester that gossip
spreads in log(system size) time
But is this actually “fast”?
 
% infected
 
0.0
 
1.0
 
Time 
 
CS5412 Spring 2016
 
2
 
Gossip in distributed systems
 
Log(N) can be a very big number!
With N=100,000, log(N) would be 12
So with one gossip round per five seconds, information
needs 
one minute 
to spread in a large system!
Some gossip protocols combine pure gossip with an
accelerator
A good way to get the word out quickly
 
CS5412 Spring 2016
 
3
 
Bimodal Multicast
 
CS5412 Spring 2016
 
4
 
To send a message, this protocol uses IP multicast
 
We just transmit it without delay and we don’t
expect any form of responses
Not reliable, no acks
No flow control (this can be an issue)
In data centers that lack IP multicast, can simulate by
sending UDP packets 1:1 without acks
 
What’s the cost of an IP multicast?
 
CS5412 Spring 2016
 
5
 
In principle, each Bimodal Multicast packet traverses
the relevant data center links and routers just once
per message
 
So this is extremely cheap... but how do we deal
with systems that didn’t receive the multicast?
 
Making Bimodal Multicast reliable
 
CS5412 Spring 2016
 
6
 
We can use gossip!
 
Every node tracks the membership of the target
group (using gossip, just like with Kelips, the DHT we
studied early in the semester)
Bootstrap by learning “some node addresses” from
some kind of a server or web page
But then exchange of gossip used to improve accuracy
 
Making Bimodal Multicast reliable
 
CS5412 Spring 2016
 
7
 
Now, layer in a gossip mechanism that gossips
about multicasts each node knows about
Rather than sending the multicasts themselves, the gossip
messages just talk about “digests”, which are lists
Node A might send node B
I have messages 1-18 from sender X
I have message 11 from sender Y
I have messages 14, 16 and 22-71 from sender Z
Compactly represented...
This is a form of “push” gossip
 
Making Bimodal Multicast reliable
 
CS5412 Spring 2016
 
8
 
On receiving such a gossip message, the recipient
checks to see which messages it has that the gossip
sender lacks, and vice versa
 
Then it responds
I have copies of messages M, M’and M’’ that you seem
to lack
I would like a copy of messages N, N’ and N’’ please
 
An exchange of the actual messages follows
 
Optimizations
 
CS5412 Spring 2016
 
9
 
Bimodal Multicast resends using IP multicast if there
is “evidence” that a few nodes may be missing the
same thing
E.g. if two nodes ask for the same retransmission
Or if a retransmission shows up from a very remote
node (IP multicast doesn’t always work in WANs)
It also prioritizes recent messages over old ones
Reliability has a “bimodal” probability curve: either
nobody gets a message or nearly everyone does
 
lpbcast variation
 
CS5412 Spring 2016
 
10
 
In this variation on Bimodal Multicast instead of
gossiping with every node in a system, we modify
the Bimodal Multicast protocol
It maintains a “peer overlay”: each member only
gossips with a smaller set of peers picked to be
reachable with low round-trip times, plus a second small
set of remote peers picked to ensure that the graph is
very highly connected and has a small diameter
Called a “small worlds” structure by Jon Kleinberg
Lpbcast is often faster, but equally reliable!
 
Speculation... about speed
 
CS5412 Spring 2016
 
11
 
When we combine IP multicast with gossip we try to
match the tool we’re using with the need
 
Try to get the messages through fast...  but if loss
occurs, try to have a very predictable recovery cost
Gossip has a totally predictable worst-case load
This is appealing at large scales
 
How can we generalize this concept?
 
A thought question
 
What’s the best way to
Count the number of nodes in a system?
Compute the average load, or find the most loaded
nodes, or least loaded nodes?
 
Options to consider
Pure gossip solution
Construct an overlay tree (via “flooding”, like in our
consistent snapshot algorithm), then count nodes in the
tree, or pull the answer from the leaves to the root…
 
CS5412 Spring 2016
 
12
 
… and the answer is
 
Gossip isn’t very good for some of these tasks!
There are gossip solutions for counting nodes, but they
give approximate answers and run slowly
Tricky to compute something like an average because
of “re-counting” effect,  (best algorithm: Kempe 
et al)
On the other hand, gossip works well for finding the
c 
most loaded or least loaded nodes (constant 
c
)
 
Gossip solutions will usually run in time O(log N)
and generally give probabilistic solutions
 
CS5412 Spring 2016
 
13
 
Yet with flooding… easy!
 
Recall how flooding works
 
 
 
 
Basically: we construct a tree by pushing data towards
the leaves and linking a node to its parent when that
node first learns of the flood
Can do this with a fixed topology or in a gossip style
by picking random next hops
1
3
3
3
2
2
 
Labels: distance of the node from
the root
 
CS5412 Spring 2016
 
14
 
This is a “spanning tree”
 
Once we have a spanning tree
To count the nodes, just have leaves report 1 to their
parents and inner nodes count the values from their
children
To compute an average, have the leaves report their
value and the parent compute the sum, then divide by
the count of nodes
To find the least or most loaded node, inner nodes
compute a min or max…
Tree should have roughly log(N) depth, but once we
build it, we can reuse it for a while
 
CS5412 Spring 2016
 
15
 
Not all logs are identical!
 
When we say that a gossip protocol needs
time log(N) to run, we mean log(N) rounds
And a gossip protocol usually sends one message every
five seconds or so, hence with 100,000 nodes, 60 secs
But our spanning tree protocol is constructed using a
flooding algorithm that runs in a hurry
Log(N) depth, but each “hop” takes perhaps a
millisecond.
So with 100,000 nodes we have our tree in 12 ms and
answers in 24ms!
 
CS5412 Spring 2016
 
16
 
Insight?
 
Gossip has time complexity O(log N) but the
“constant” can be rather big (5000 times larger in
our example)
Spanning tree had same time complexity but a tiny
constant in front
 
But network load for spanning tree was much higher
In the last step, we may have reached roughly half the
nodes in the system
So 50,000 messages were sent all at the same time!
 
CS5412 Spring 2016
 
17
 
Gossip vs “Urgent”?
 
With gossip, we have a slow but steady story
We know the speed and the cost, and both are low
A constant, low-key, background cost
And gossip is also very robust
 
Urgent protocols (like our flooding protocol, or 2PC,
or reliable virtually synchronous multicast)
Are way faster
But produce load spikes
And may be fragile, prone to broadcast storms, etc
 
CS5412 Spring 2016
 
18
 
Introducing hierarchy
 
One issue with gossip is that the messages fill up
With constant sized messages…
… and constant rate of communication
… we’ll inevitably reach the limit!
 
Can we inroduce hierarchy into gossip systems?
 
CS5412 Spring 2016
 
19
 
Astrolabe
 
Intended as help for
applications adrift in a
sea of information
Structure emerges from
a randomized gossip
protocol
This approach is robust
and scalable even under
stress that cripples
traditional systems
Developed at RNS, Cornell
By Robbert van Renesse,
with many others
helping…
Technology was
adopted at Amazon.com
(but they build their own
solutions rather than
using it in this form)
 
CS5412 Spring 2016
 
20
Astrolabe is a flexible monitoring overlay
swift.cs.cornell.edu
cardinal.cs.cornell.edu
Periodically, pull data from monitored systems
CS5412 Spring 2016
21
 
Astrolabe in a single domain
 
Each node owns a single tuple, like the management
information base (MIB)
Nodes discover one-another through a simple
broadcast scheme (“anyone out there?”) and gossip
about membership
Nodes also keep replicas of one-another’s rows
Periodically (uniformly at random) merge your state
with some else…
 
CS5412 Spring 2016
 
22
 
State Merge: Core of Astrolabe epidemic
 
swift.cs.cornell.edu
 
cardinal.cs.cornell.edu
 
CS5412 Spring 2016
 
23
 
State Merge: Core of Astrolabe epidemic
 
swift.cs.cornell.edu
 
cardinal.cs.cornell.edu
 
CS5412 Spring 2016
 
24
 
State Merge: Core of Astrolabe epidemic
 
swift.cs.cornell.edu
 
cardinal.cs.cornell.edu
 
CS5412 Spring 2016
 
25
 
Observations
 
Merge protocol has constant cost
One message sent, received (on avg) per unit time.
The data changes slowly, so no need to run it quickly –
we usually run it every five seconds or so
Information spreads in O(log N) time
But this assumes bounded region size
In Astrolabe, we limit them to 50-100 rows
 
CS5412 Spring 2016
 
26
 
Big systems…
 
 
A big system could have 
many
 regions
Looks like a pile of spreadsheets
A node only replicates data from its neighbors within its
own region
 
CS5412 Spring 2016
 
27
 
Scaling up… and up…
 
With a stack of domains, we don’t want every
system to “see” every domain
Cost would be huge
So instead, we’ll see a summary
 
cardinal.cs.cornell.edu
 
CS5412 Spring 2016
 
28
Astrolabe builds a hierarchy using a P2P protocol that
“assembles the puzzle” without any servers
San Francisco
New Jersey
SQL query
“summarizes”
data
Dynamically changing query
output is visible system-wide
CS5412 Spring 2016
29
 
Large scale: “fake” regions
 
These are
Computed by queries that summarize a whole region as
a single row
Gossiped in a read-only manner within a leaf region
But who runs the gossip?
Each region elects “k” members to run gossip at the
next level up.
Can play with selection criteria and “k”
 
CS5412 Spring 2016
 
30
Hierarchy is virtual… data is replicated
San Francisco
New Jersey
CS5412 Spring 2016
31
 
Hierarchy is virtual… data is replicated
 
San Francisco
 
New Jersey
 
CS5412 Spring 2016
 
32
 
Worst case load?
 
A small number of nodes end up participating in
O(log
fanout
N) epidemics
Here the fanout is something like 50
In each epidemic, a message is sent and received
roughly every 5 seconds
We limit message size so even during periods of
turbulence, no message can become huge.
 
CS5412 Spring 2016
 
33
 
Who uses Astrolabe?
 
Amazon doesn’t use Astrolabe in this identical form,
but they built gossip-based monitoring systems
based on the same ideas.
They deploy these in S3 and EC2: throughout their
big data centers!
For them, Astrolabe-like mechanisms track overall state
of their system to diagnose performance issues
They also automate reaction to temporary overloads
 
CS5412 Spring 2016
 
34
 
Example of overload handling
 
Some service S is getting slow…
Astrolabe triggers a “system wide warning”
Everyone sees the picture
“Oops, S is getting overloaded and slow!”
So everyone tries to reduce their frequency of requests
against service S
 
What about overload in Astrolabe 
itself?
Could everyone do a fair share of inner aggregation?
 
CS5412 Spring 2016
 
35
 
Idea that one company had
 
CS5412 Spring 2016
 
36
 
Start with Astrolabe approach
 
But instead of electing nodes to play inner roles, just
assign them roles, left to right
 
N-1 inner nodes, hence N-1 nodes play 2 aggre-
gation roles and one lucky node just has one role
 
What impact will this have on Astrolabe?
CS5412 Spring 2016
37
World’s worst aggregation tree!
 
  A     B  C      D          E      F  G     H              I       J  K      L         M      N O    P
A
C
E
G
I
K
M
O
B
F
J
N
D
L
 
CS5412 Spring 2016
 
38
 
What went wrong?
 
In this horrendous tree, each node has equal “work
to do” but the information-space diameter is larger!
Astrolabe benefits from “instant” knowledge
because the epidemic at each level is run 
by
someone elected from the level below
 
CS5412 Spring 2016
 
39
 
Insight: Two kinds of shape
 
We’ve focused on the aggregation tree
But in fact should also think about the information
flow tree
 
CS5412 Spring 2016
 
40
 
Information space perspective
 
Bad aggregation graph: diameter O(n)
 
 
 
Astrolabe version: diameter
O(
log(n))
 
H – G – E – F – B – A – C – D – L – K – I – J – N – M – O – P
 
Where will this approach go next?
 
CS5412 Spring 2016
 
41
 
Cornell research: Shared State Table (SST)
 
Idea is to use RDMA to implement a super-fast
shared state table for use in cloud data centers
 
Tradeoff: simplified model but WAY faster
 
SST details
 
CS5412 Spring 2016
 
42
 
Starts like Astrolabe with a table for a region
But no hierarchy, SST is like a “one level” Astrolabe
And the implementation doesn’t use gossip….
 
To understand SST, you need to know about RDMA
This is a new kind of hardware
Allows DMA transfers over the optical network from
sender memory to receiver memory, at full network
data rates which can be 10Gbps or even 100Gbpsm
 
RDMA Concept
 
CS5412 Spring 2016
 
43
 
RDMA is a new standard for Reliable DMA
 
Sender Node
 
Receiver Node
 
Memory
Region
 
Bits moved by the optical
network directly from
sender memory to receiver
memory
 
RDMA acronyms: Enliven a party…
 
CS5412 Spring 2016
 
44
 
RDMA: Short for Remote Direct Memory Access.. Offers two modes:
Block transfers 
(can be very large): binary from sender to destination; sender
specifies source and destination provides a receive buffer
One-sided read
: Receiver allows sender to do a read from memory without
any software action on receiver side.  
One-sided write
: same idea, but
allows writes
Three ways to obtain RDMA:
RDMA on Infiniband 
network hardware: popular in HPC
RDMA on fast Ethernet: 
also called RDMA/ROCE.  Very new and we need
experience to know how well this will work.  Issue centers on a new approach
to flow control.
RDMA in software: SoftROCE.  
A fake RDMA for running RDMA code on old
machines lacking RDMA hardware.   Tunnels on TCP
IB Verbs
: this is the name for the software library used to access RDMA
 
SST uses RDMA to replicate rows
 
CS5412 Spring 2016
 
45
 
With SST, when a node updates its row, RDMA is
used to copy the row to other nodes
 
Then the compute model offers predicates on the
table, aggregation, and upcalls to event handlers
As many as 1M or 2M, per node, per second
This is so much faster than Astrolabe that comparison is
kind of meaningless…  maybe 100,000x faster…
 
SST: Like Astrolabe, but one region, fast updates
 
swift.cs.cornell.edu
 
cardinal.cs.cornell.edu
 
CS5412 Spring 2016
 
46
 
SST Predicate with Upcall
 
CS5412 Spring 2016
 
47
 
Basically, each program registers code like this:
if (some-predicate-becomes true)
{
 
do this “upcall” logic
}
SST evaluates predicates and issues upcalls in
parallel, on all machines, achieving incredible speed
After machine A changes some value, a predicate
on machine B might fire in 1us or less delay…
 
Use case comparison
 
Kind of slow
Used at large scale to
monitor big data
centers, for “data
mining”
Uses hierarchical
structure to scale, gossip
as its communication
protocol
 
Crazy fast
Used mostly at rack
scale: maybe 64 or 128
nodes at a time
Has a flat structure, and
uses RDMA to transfer
rows directly at optical
network speeds of 10 or
even 100Gbps
 
48
 
CS5412 Spring 2016
Astrolabe
SST
 
Summary
 
First we saw a way of using Gossip in a reliable
multicast (although the reliability is probabilistic)
Astrolabe uses Gossip for aggregation
Pure gossip isn’t ideal for this… and competes poorly
with flooding and other urgent protocols
But Astrolabe introduces hierarchy and is an interesting
option that gets used in at least one real cloud platform
SST uses a similar model but
Implemented with one-sided RDMA reads, writes
Less scalable, but runs 100,000x faster…
 
CS5412 Spring 2016
 
49
Slide Note
Embed
Share

Gossip protocols are vital in spreading information efficiently in large systems, with log(N) time complexity for dissemination. Bimodal multicast uses IP multicast for message transmission, lacking reliability and flow control. The cost of IP multicast is low, but ensuring reliability can be achieved by integrating gossip mechanisms. By gossiping about multicasts and exchanging digests instead of actual messages, the accuracy and efficiency of bimodal multicast can be enhanced.

  • Distributed Systems
  • Gossip Protocols
  • Bimodal Multicast
  • IP Multicast
  • Reliability

Uploaded on Oct 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. CS5412 Spring 2016 1 CS5412: BIMODAL MULTICAST ASTROLABE Lecture XIX Ken Birman

  2. Gossip 201 2 Recall from early in the semester that gossip spreads in log(system size) time But is this actually fast ? 1.0 % infected 0.0 Time CS5412 Spring 2016

  3. Gossip in distributed systems 3 Log(N) can be a very big number! With N=100,000, log(N) would be 12 So with one gossip round per five seconds, information needs one minute to spread in a large system! Some gossip protocols combine pure gossip with an accelerator A good way to get the word out quickly CS5412 Spring 2016

  4. Bimodal Multicast 4 To send a message, this protocol uses IP multicast We just transmit it without delay and we don t expect any form of responses Not reliable, no acks No flow control (this can be an issue) In data centers that lack IP multicast, can simulate by sending UDP packets 1:1 without acks CS5412 Spring 2016

  5. Whats the cost of an IP multicast? 5 In principle, each Bimodal Multicast packet traverses the relevant data center links and routers just once per message So this is extremely cheap... but how do we deal with systems that didn t receive the multicast? CS5412 Spring 2016

  6. Making Bimodal Multicast reliable 6 We can use gossip! Every node tracks the membership of the target group (using gossip, just like with Kelips, the DHT we studied early in the semester) Bootstrap by learning some node addresses from some kind of a server or web page But then exchange of gossip used to improve accuracy CS5412 Spring 2016

  7. Making Bimodal Multicast reliable 7 Now, layer in a gossip mechanism that gossips about multicasts each node knows about Rather than sending the multicasts themselves, the gossip messages just talk about digests , which are lists Node A might send node B I have messages 1-18 from sender X I have message 11 from sender Y I have messages 14, 16 and 22-71 from sender Z Compactly represented... This is a form of push gossip CS5412 Spring 2016

  8. Making Bimodal Multicast reliable 8 On receiving such a gossip message, the recipient checks to see which messages it has that the gossip sender lacks, and vice versa Then it responds I have copies of messages M, M and M that you seem to lack I would like a copy of messages N, N and N please An exchange of the actual messages follows CS5412 Spring 2016

  9. Optimizations 9 Bimodal Multicast resends using IP multicast if there is evidence that a few nodes may be missing the same thing E.g. if two nodes ask for the same retransmission Or if a retransmission shows up from a very remote node (IP multicast doesn t always work in WANs) It also prioritizes recent messages over old ones Reliability has a bimodal probability curve: either nobody gets a message or nearly everyone does CS5412 Spring 2016

  10. lpbcast variation 10 In this variation on Bimodal Multicast instead of gossiping with every node in a system, we modify the Bimodal Multicast protocol It maintains a peer overlay : each member only gossips with a smaller set of peers picked to be reachable with low round-trip times, plus a second small set of remote peers picked to ensure that the graph is very highly connected and has a small diameter Called a small worlds structure by Jon Kleinberg Lpbcast is often faster, but equally reliable! CS5412 Spring 2016

  11. Speculation... about speed 11 When we combine IP multicast with gossip we try to match the tool we re using with the need Try to get the messages through fast... but if loss occurs, try to have a very predictable recovery cost Gossip has a totally predictable worst-case load This is appealing at large scales How can we generalize this concept? CS5412 Spring 2016

  12. A thought question 12 What s the best way to Count the number of nodes in a system? Compute the average load, or find the most loaded nodes, or least loaded nodes? Options to consider Pure gossip solution Construct an overlay tree (via flooding , like in our consistent snapshot algorithm), then count nodes in the tree, or pull the answer from the leaves to the root CS5412 Spring 2016

  13. and the answer is 13 Gossip isn t very good for some of these tasks! There are gossip solutions for counting nodes, but they give approximate answers and run slowly Tricky to compute something like an average because of re-counting effect, (best algorithm: Kempe et al) On the other hand, gossip works well for finding the c most loaded or least loaded nodes (constant c) Gossip solutions will usually run in time O(log N) and generally give probabilistic solutions CS5412 Spring 2016

  14. Yet with flooding easy! 14 Recall how flooding works 3 2 Labels: distance of the node from the root 1 3 2 3 Basically: we construct a tree by pushing data towards the leaves and linking a node to its parent when that node first learns of the flood Can do this with a fixed topology or in a gossip style by picking random next hops CS5412 Spring 2016

  15. This is a spanning tree 15 Once we have a spanning tree To count the nodes, just have leaves report 1 to their parents and inner nodes count the values from their children To compute an average, have the leaves report their value and the parent compute the sum, then divide by the count of nodes To find the least or most loaded node, inner nodes compute a min or max Tree should have roughly log(N) depth, but once we build it, we can reuse it for a while CS5412 Spring 2016

  16. Not all logs are identical! 16 When we say that a gossip protocol needs time log(N) to run, we mean log(N) rounds And a gossip protocol usually sends one message every five seconds or so, hence with 100,000 nodes, 60 secs But our spanning tree protocol is constructed using a flooding algorithm that runs in a hurry Log(N) depth, but each hop takes perhaps a millisecond. So with 100,000 nodes we have our tree in 12 ms and answers in 24ms! CS5412 Spring 2016

  17. Insight? 17 Gossip has time complexity O(log N) but the constant can be rather big (5000 times larger in our example) Spanning tree had same time complexity but a tiny constant in front But network load for spanning tree was much higher In the last step, we may have reached roughly half the nodes in the system So 50,000 messages were sent all at the same time! CS5412 Spring 2016

  18. Gossip vs Urgent? 18 With gossip, we have a slow but steady story We know the speed and the cost, and both are low A constant, low-key, background cost And gossip is also very robust Urgent protocols (like our flooding protocol, or 2PC, or reliable virtually synchronous multicast) Are way faster But produce load spikes And may be fragile, prone to broadcast storms, etc CS5412 Spring 2016

  19. Introducing hierarchy 19 One issue with gossip is that the messages fill up With constant sized messages and constant rate of communication we ll inevitably reach the limit! Can we inroduce hierarchy into gossip systems? CS5412 Spring 2016

  20. Astrolabe 20 Intended as help for applications adrift in a sea of information Structure emerges from a randomized gossip protocol This approach is robust and scalable even under stress that cripples traditional systems Developed at RNS, Cornell By Robbert van Renesse, with many others helping Technology was adopted at Amazon.com (but they build their own solutions rather than using it in this form) CS5412 Spring 2016

  21. Astrolabe is a flexible monitoring overlay 21 Name Name Time Time Load Load Weblogic? Weblogic? SMTP? SMTP? Word Version Version Word swift swift 2011 2271 2.0 1.8 0 0 1 1 6.2 6.2 falcon falcon 1971 1971 1.5 1.5 1 1 0 0 4.1 4.1 cardinal cardinal 2004 2004 4.5 4.5 1 1 0 0 6.0 6.0 swift.cs.cornell.edu Periodically, pull data from monitored systems Name Name Time Time Load Load Weblogic ? ? Weblogic SMTP? SMTP? Word Version Version Word swift swift 2003 2003 .67 .67 0 0 1 1 6.2 6.2 falcon falcon 1976 1976 2.7 2.7 1 1 0 0 4.1 4.1 cardinal cardinal 2201 2231 3.5 1.7 1 1 1 1 6.0 6.0 CS5412 Spring 2016 cardinal.cs.cornell.edu

  22. Astrolabe in a single domain 22 Each node owns a single tuple, like the management information base (MIB) Nodes discover one-another through a simple broadcast scheme ( anyone out there? ) and gossip about membership Nodes also keep replicas of one-another s rows Periodically (uniformly at random) merge your state with some else CS5412 Spring 2016

  23. State Merge: Core of Astrolabe epidemic 23 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1971 1.5 1 0 4.1 cardinal 2004 4.5 1 0 6.0 swift.cs.cornell.edu Name Time Load Weblogic ? SMTP? Word Version swift 2003 .67 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 CS5412 Spring 2016 cardinal.cs.cornell.edu

  24. State Merge: Core of Astrolabe epidemic 24 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1971 1.5 1 0 4.1 cardinal 2004 4.5 1 0 6.0 swift.cs.cornell.edu swift 2011 2.0 cardinal 2201 3.5 Name Time Load Weblogic ? SMTP? Word Version swift 2003 .67 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 CS5412 Spring 2016 cardinal.cs.cornell.edu

  25. State Merge: Core of Astrolabe epidemic 25 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1971 1.5 1 0 4.1 cardinal 2201 3.5 1 0 6.0 swift.cs.cornell.edu Name Time Load Weblogic ? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 CS5412 Spring 2016 cardinal.cs.cornell.edu

  26. Observations 26 Merge protocol has constant cost One message sent, received (on avg) per unit time. The data changes slowly, so no need to run it quickly we usually run it every five seconds or so Information spreads in O(log N) time But this assumes bounded region size In Astrolabe, we limit them to 50-100 rows CS5412 Spring 2016

  27. Big systems 27 A big system could have many regions Looks like a pile of spreadsheets A node only replicates data from its neighbors within its own region CS5412 Spring 2016

  28. Scaling up and up 28 With a stack of domains, we don t want every system to see every domain Cost would be huge So instead, we ll see a summary Name Time Load Weblogic ? Weblogic Load Load Load Time Time SMTP? Word Version Version Name Time Load SMTP? Word ? Name Time Weblogic ? Weblogic SMTP? Word Version Version swift 2011 2.0 0 1 6.2 Name Time SMTP? Word swift 2011 2.0 0 1 6.2 falcon 1976 Name 2.7 Time 1 0 ? Weblogic ? Weblogic 4.1 ? Weblogic SMTP? Word Version Version swift 2011 2.0 0 1 6.2 falcon 1976 Name 2.7 1 0 4.1 Load SMTP? Word cardinal cardinal 2201 swift 3.5 2011 1 1 0 6.0 1 swift 2011 2.0 0 1 6.2 falcon 1976 Name 2.7 1 0 ? 4.1 Load SMTP? Word Version 2201 swift 3.5 2011 1 1 0 6.0 1 2.0 6.2 falcon 1976 2.7 1 0 4.1 cardinal cardinal 2201 swift 3.5 2011 1 1 0 6.0 1 2.0 6.2 falcon 1976 2.7 1 0 4.1 2201 3.5 1 1 6.0 2.0 6.2 falcon 1976 2.7 1 0 4.1 cardinal cardinal 2201 3.5 1 1 6.0 falcon 1976 2.7 1 0 4.1 2201 3.5 1 1 6.0 cardinal 2201 3.5 1 1 6.0 CS5412 Spring 2016 cardinal.cs.cornell.edu

  29. Astrolabe builds a hierarchy using a P2P protocol that assembles the puzzle without any servers 29 Dynamically changing query output is visible system-wide SQL query summarizes Name Name Name Avg Load Load Load Avg Avg WL contact WL contact WL contact SMTP contact SMTP contact SMTP contact SF SF SF 2.6 2.6 2.2 123.45.61.3 123.45.61.3 123.45.61.3 123.45.61.17 123.45.61.17 123.45.61.17 NJ NJ NJ 1.8 1.8 1.6 127.16.77.6 127.16.77.6 127.16.77.6 127.16.77.11 127.16.77.11 127.16.77.11 data Paris Paris Paris 3.1 3.1 2.7 14.66.71.8 14.66.71.8 14.66.71.8 14.66.71.12 14.66.71.12 14.66.71.12 Name Name Name Load Load Load Weblogic? Weblogic? Weblogic? SMTP? SMTP? SMTP? Word Version Version Version Word Word Name Name Name Load Load Load Weblogic? Weblogic? Weblogic? SMTP? SMTP? SMTP? Word Version Version Version Word Word swift swift swift 2.0 2.0 1.7 0 0 0 1 1 1 6.2 6.2 6.2 gazelle gazelle gazelle 1.7 1.7 4.1 0 0 0 0 0 0 4.5 4.5 4.5 falcon falcon falcon 1.5 1.5 2.1 1 1 1 0 0 0 4.1 4.1 4.1 zebra zebra zebra 3.2 3.2 0.9 0 0 0 1 1 1 6.2 6.2 6.2 cardinal cardinal cardinal 4.5 4.5 3.9 1 1 1 0 0 0 6.0 6.0 6.0 gnu gnu gnu .5 .5 2.2 1 1 1 0 0 0 6.2 6.2 6.2 New Jersey San Francisco CS5412 Spring 2016

  30. Large scale: fake regions 30 These are Computed by queries that summarize a whole region as a single row Gossiped in a read-only manner within a leaf region But who runs the gossip? Each region elects k members to run gossip at the next level up. Can play with selection criteria and k CS5412 Spring 2016

  31. Hierarchy is virtual data is replicated 31 Yellow leaf node sees its neighbors and the domains on the path to the root. Name Avg Load WL contact SMTP contact SF 2.6 123.45.61.3 123.45.61.17 Gnu runs level 2 epidemic because it has lowest load NJ 1.8 127.16.77.6 127.16.77.11 Paris 3.1 14.66.71.8 14.66.71.12 Falcon runs level 2 epidemic because it has lowest load Name Load Weblogic? SMTP? Word Version Name Load Weblogic? SMTP? Word Version swift 2.0 0 1 6.2 gazelle 1.7 0 0 4.5 falcon 1.5 1 0 4.1 zebra 3.2 0 1 6.2 cardinal 4.5 1 0 6.0 gnu .5 1 0 6.2 New Jersey San Francisco CS5412 Spring 2016

  32. Hierarchy is virtual data is replicated 32 Green node sees different leaf domain but has a consistent view of the inner domain Name Avg Load WL contact SMTP contact SF 2.6 123.45.61.3 123.45.61.17 NJ 1.8 127.16.77.6 127.16.77.11 Paris 3.1 14.66.71.8 14.66.71.12 Name Load Weblogic? SMTP? Word Version Name Load Weblogic? SMTP? Word Version swift 2.0 0 1 6.2 gazelle 1.7 0 0 4.5 falcon 1.5 1 0 4.1 zebra 3.2 0 1 6.2 cardinal 4.5 1 0 6.0 gnu .5 1 0 6.2 New Jersey San Francisco CS5412 Spring 2016

  33. Worst case load? 33 A small number of nodes end up participating in O(logfanoutN) epidemics Here the fanout is something like 50 In each epidemic, a message is sent and received roughly every 5 seconds We limit message size so even during periods of turbulence, no message can become huge. CS5412 Spring 2016

  34. Who uses Astrolabe? 34 Amazon doesn t use Astrolabe in this identical form, but they built gossip-based monitoring systems based on the same ideas. They deploy these in S3 and EC2: throughout their big data centers! For them, Astrolabe-like mechanisms track overall state of their system to diagnose performance issues They also automate reaction to temporary overloads CS5412 Spring 2016

  35. Example of overload handling 35 Some service S is getting slow Astrolabe triggers a system wide warning Everyone sees the picture Oops, S is getting overloaded and slow! So everyone tries to reduce their frequency of requests against service S What about overload in Astrolabe itself? Could everyone do a fair share of inner aggregation? CS5412 Spring 2016

  36. Idea that one company had 36 Start with Astrolabe approach But instead of electing nodes to play inner roles, just assign them roles, left to right N-1 inner nodes, hence N-1 nodes play 2 aggre- gation roles and one lucky node just has one role What impact will this have on Astrolabe? CS5412 Spring 2016

  37. 37Worlds worst aggregation tree! D L B J F N A C E G G gossips with H and learns e I K M O An event e occurs at H P learns O(N) time units later! A B C D E F G H I J K L M N O P CS5412 Spring 2016

  38. What went wrong? 38 In this horrendous tree, each node has equal work to do but the information-space diameter is larger! Astrolabe benefits from instant knowledge because the epidemic at each level is run by someone elected from the level below CS5412 Spring 2016

  39. Insight: Two kinds of shape 39 We ve focused on the aggregation tree But in fact should also think about the information flow tree CS5412 Spring 2016

  40. Information space perspective 40 Bad aggregation graph: diameter O(n) D L B J F N H G E F B A C D L K I J N M O P A C E G I K M O A B C D E F G H I J K L M N O P Astrolabe version: diameter O(log(n)) I A A I E M C D A B G H E F A C E G I K M O K L M N I J O P A B C D E F G H I J K L M N O P CS5412 Spring 2016

  41. Where will this approach go next? 41 Cornell research: Shared State Table (SST) Idea is to use RDMA to implement a super-fast shared state table for use in cloud data centers Tradeoff: simplified model but WAY faster CS5412 Spring 2016

  42. Name Load Weblogic? SMTP? Word Version SST details gazelle 1.7 0 0 4.5 zebra 3.2 0 1 6.2 gnu .5 1 0 6.2 42 Starts like Astrolabe with a table for a region But no hierarchy, SST is like a one level Astrolabe And the implementation doesn t use gossip . To understand SST, you need to know about RDMA This is a new kind of hardware Allows DMA transfers over the optical network from sender memory to receiver memory, at full network data rates which can be 10Gbps or even 100Gbpsm CS5412 Spring 2016

  43. RDMA Concept 43 RDMA is a new standard for Reliable DMA Memory Region Bits moved by the optical network directly from sender memory to receiver memory Sender Node Receiver Node CS5412 Spring 2016

  44. RDMA acronyms: Enliven a party 44 RDMA: Short for Remote Direct Memory Access.. Offers two modes: Block transfers (can be very large): binary from sender to destination; sender specifies source and destination provides a receive buffer One-sided read: Receiver allows sender to do a read from memory without any software action on receiver side. One-sided write: same idea, but allows writes Three ways to obtain RDMA: RDMA on Infiniband network hardware: popular in HPC RDMA on fast Ethernet: also called RDMA/ROCE. Very new and we need experience to know how well this will work. Issue centers on a new approach to flow control. RDMA in software: SoftROCE. A fake RDMA for running RDMA code on old machines lacking RDMA hardware. Tunnels on TCP IB Verbs: this is the name for the software library used to access RDMA CS5412 Spring 2016

  45. SST uses RDMA to replicate rows 45 With SST, when a node updates its row, RDMA is used to copy the row to other nodes Then the compute model offers predicates on the table, aggregation, and upcalls to event handlers As many as 1M or 2M, per node, per second This is so much faster than Astrolabe that comparison is kind of meaningless maybe 100,000x faster CS5412 Spring 2016

  46. SST: Like Astrolabe, but one region, fast updates 46 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1971 1.5 1 0 4.1 cardinal 2004 4.5 1 0 6.0 swift.cs.cornell.edu Name Time Load Weblogic ? SMTP? Word Version swift 2003 .67 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 CS5412 Spring 2016 cardinal.cs.cornell.edu

  47. SST Predicate with Upcall 47 Basically, each program registers code like this: if (some if (some- -predicate predicate- -becomes true) becomes true) { { do this do this upcall upcall logic logic } } SST evaluates predicates and issues upcalls in parallel, on all machines, achieving incredible speed After machine A changes some value, a predicate on machine B might fire in 1us or less delay CS5412 Spring 2016

  48. Use case comparison 48 Astrolabe SST Kind of slow Used at large scale to monitor big data centers, for data mining Uses hierarchical structure to scale, gossip as its communication protocol Crazy fast Used mostly at rack scale: maybe 64 or 128 nodes at a time Has a flat structure, and uses RDMA to transfer rows directly at optical network speeds of 10 or even 100Gbps CS5412 Spring 2016

  49. Summary 49 First we saw a way of using Gossip in a reliable multicast (although the reliability is probabilistic) Astrolabe uses Gossip for aggregation Pure gossip isn t ideal for this and competes poorly with flooding and other urgent protocols But Astrolabe introduces hierarchy and is an interesting option that gets used in at least one real cloud platform SST uses a similar model but Implemented with one-sided RDMA reads, writes Less scalable, but runs 100,000x faster CS5412 Spring 2016

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#