Distributed Consensus and Coordination in Hardware Birds of a Feather Session

 
Birds of a feather session at Middleware’18
 
Hosted by: Zsolt István and Marko Vukolić
 
D
i
s
t
r
i
b
u
t
e
d
 
C
o
n
s
e
n
s
u
s
 
a
n
d
C
o
o
r
d
i
n
a
t
i
o
n
 
i
n
 
H
a
r
d
w
a
r
e
 
Outline
 
Specialized hardware 101
Programmable Switches (P4)
Programmable NICs (ARMs)
Programmable NICs (FPGAs)
RDMA
Spectrum of accelerated solutions
Examples by scope
Examples by location
Discussion
 
 
2
 
P4
 
Language to express forwarding rules on switches (and more)
Flexibility
: 
packet-forwarding policies
 as programs
Expressiveness: 
hardware-independent packet processing algorithms using
general-purpose operations and table lookups
.
Resource mapping and management: 
compilers manage resource allocation
and scheduling.
Software engineering:
 type checking, information hiding, and software reuse.
Decoupling hardware and software evolution:
 architecture independent,
allowing separate hardware and software upgrade cycles.
Debugging: 
software models of switch architectures
 
3
 
P4 Deployment
 
4
 
P4 Code Example
 
5
 
SmartNIC (Arm)
 
Mellanox Bluefield
2x 25/100 Gbps NIC
Up to 16 Arm A72 cores
Up to 16GB Onboard DRAM
 
Arms can run commodity software
Best used to implement something like OpenVSwitch
 
If compute-bound can’t keep up with packets!
 
6
 
SmartNIC (FPGA)
 
Xilinx Alveo cards
2x 100 Gbps NIC
Up to 64GB Onboard DRAM
Up to 32MB Onchip BRAM
 
FPGA
Can guarantee line-rate performance by design
Breaks traditional software tradeoffs
 
7
 
F
ield 
P
rogrammable 
G
ate 
A
rray (FPGA)
Free choice of architecture
Fine-grained pipelining,
communication, distributed memory
Tradeoff: all “code” occupies chip
space
8
Re-programmable Specialized Hardware
Op 1
Op 2
Op 3
 
Challenge: adapting algorithms to the parallelism of the FPGA
 
 
 
Coding: Hardware definition languages, high level languages
Synthesis: Produce a logic-gate level representation (any FPGA)
Place & route: Circuit that gets mapped onto specific FPGA
9
Programming FPGAs
Code
Synthesized
Circuit
Placed &
Routed
 
Massive parallelism – both pipeline and data-parallel execution
Arithmetic operations boosted by DSPs
Compute & Data close together thanks to BRAM
 
Can’t “page” code in or out
Problem is if algorithm core state doesn’t fit in BRAM
 
10x Less power efficient then ASICs, >10x more power efficient than CPUs
 
10
FPGA Benefits and Drawbacks
 
 
11
RDMA
 
Hardware summary
 
Programmable Switches (P4)
Use forwarding tables
Guarantee line-rate processing
Very high bandwidth
Limited state on device, limited complexity code (e.g. branches, loops)
Programmable NICs (ARMs)
Arbitrary processing
Can’t guarantee line-rate processing
Lower bandwidths
Programmable NICs / Switches (FPGAs)
Arbitrary processing*, supports complex state on device
Can guarantee line-rate processing
High bandwidths
RDMA NICs
No processing, only data manipulation with low latency
Low latency buy removing OS overhead
 
12
 
Hardware landscape
 
P4 adoption by Chinese companies
https://www.sdxcentral.com/articles/news/barefoot-scores-tofino-deals-
with-alibaba-baidu-and-tencent/2017/05/
Smart NICs
E.g., Mellanox
Microsoft Catapult
FPGAs in the cloud
Amazon, Baidu
RDMA support in Azure
https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-
hpc?toc=%2Fazure%2Fvirtual-machines%2Fwindows%2Ftoc.json
 
 
13
14
Consensus in Hardware
Ordered, reliable channels
Write
Propose
Propose
Ack.
Ack.
Commit
Commit
Follower
Leader
Follower
Protocol described in: F. P. Junqueira, B. C. Reed, et al. 
Zab: High-performance broadcast for primary-backup systems
. In DSN’11.
 
T
i
g
h
t
 
i
n
t
e
g
r
a
t
i
o
n
 
w
i
t
h
 
n
e
t
w
o
r
k
 
(
l
a
t
e
n
c
y
)
L
o
w
 
l
a
t
e
n
c
y
 
d
e
c
i
s
i
o
n
 
m
a
k
i
n
g
 
(
l
a
t
e
n
c
y
)
P
i
p
e
l
i
n
i
n
g
 
(
t
h
r
o
u
g
h
p
u
t
)
 
S
e
q
u
e
n
c
i
n
g
/
r
e
l
i
a
b
i
l
i
t
y
 
i
n
 
n
e
t
w
o
r
k
Scope of Acceleration (NICs)
Full Protocol
DARE [1]
FARM [2]
Common-case
APUS [3]
Operations
Mellanox Fabric Collective Accelerator (FCA)
15
[1] Poke, Marius, and Torsten Hoefler. "Dare: High-performance state machine replication on rdma networks." 
Proceedings of the 24th International
Symposium on High-Performance Parallel and Distributed Computing
. ACM, 2015.
[2] Dragojević, Aleksandar, et al. "FaRM: Fast remote memory." 
11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14)
.
2014.
[3] Wang, Cheng, et al. "APUS: Fast and scalable Paxos on RDMA." 
Proceedings of the 2017 Symposium on Cloud Computing
. ACM, 2017.
 
Scope of Acceleration
 
Full Protocol
Consensus in a Box [1] – FPGA
NetChain [2] – P4 Switch
 
Common-case
P4Paxos [3] (NetPaxos [4]) – P4 Switch
 
Operations
SpecPaxos [5] – OpenFlow Switch
 
16
 
[1] István, Zsolt, et al. "Consensus in a Box: Inexpensive Coordination in Hardware." 
NSDI
. 2016.
[2] Jin, Xin, et al. "NetChain: Scale-Free Sub-RTT Coordination." NSDI 2018
[3] Dang, Huynh Tu, et al. "Paxos made switch-y." 
ACM SIGCOMM Computer Communication Review
 46.2 (2016): 18-24.
[4] Dang, Huynh Tu, et al. "Netpaxos: Consensus at network speed." 
Proceedings of the 1st ACM SIGCOMM SDN
. ACM, 2015.
[5] Ports, Dan RK, et al. "Designing Distributed Systems Using Approximate Synchrony in Data Center Networks." 
NSDI
. 2015.
 
Software clients (>10 machines simulating 1000s of clients)
Binary protocol, but can be used as drop-in replacement for SW key-value stores
(e.g. Memcached)
Client-facing and inter-node traffic: 10Gbps TCP
<10
μ
s consensus latency, >1M consensus rounds/s
17
Consensus in a Box (Caribou)
10Gbps Ethernet
8GB DDR3 Memory
FPGA
Extension to e.g. SATA,
NVMe
 
NetChain
 
Implements KVS in switches
Meta-data store, coordination
“HalfRTT” because no need to reach
an other end-host
 
μ
s replication (strong consistency)
>100Gbps bandwidth
 
Limitations on key/value sizes
 
18
 
Paxos Made Switchy
 
Implements the Coordinator
and Acceptor roles in P4
Switch
 
Reconfiguration and recovery,
as well as management, are
external
 
Reduces latency and cost on
end-hosts
 
19
 
Speculative Paxos
 
20
 
Scope of Acceleration – Gains
 
Full Protocol
Rely on tight integration of different layers to deliver high throughput/low latency
Specializing the processing to the protocol
 
Common-case
Benefit from cheaper processing in best case, less egress on end-host
Detect when we are not in best case, fall back logic
Uses less state on the devices then performing entire protocol
 
 
Operations
Allow the end hosts to push simple “tasks” of some domain into the network
Generate packets, gain from reducing data movement on egress link
 
21
 
Integration of Acceleration
 
End hosts (DARE, FARM, Caribou)
Easiest integration
Most control
 
Split  (PaxosMadeSwitchy, ERIS [1])
Integration more complex
Less control
 
Switch/Middlebox (NetChain)
Packaged as “service”
Independently controlled
 
22
 
[1] Li, Jialin, Ellis Michael, and Dan RK Ports. "Eris: Coordination-free consistent transactions using in-network
concurrency control." 
Proceedings of the 26th Symposium on Operating Systems Principles
. ACM, 2017.
 
Coordinating control plane ops.
 
A special type of application – Update the changes in the SDN
controller, detect errors, etc.
Low latency operation required
Strongly consistent view
 
 
 
Molero, Edgar Costa, Stefano Vissicchio, and Laurent Vanbever. "Hardware-Accelerated Network Control Planes." 
Proceedings
of the 17th ACM Workshop on Hot Topics in Networks
. ACM, 2018.
Schiff, Liron, Stefan Schmid, and Petr Kuznetsov. "In-band synchronization for distributed SDN control planes." 
ACM
SIGCOMM Computer Communication Review
 46.1 (2016): 37-43.
 
23
 
Application scenarios
 
Replicated KVS
Maintain consistent view across replicas
Cheaper consensus 
 switch to strong consistency instead of eventual
Both throughput and latency is important
Could offload at NIC or Switch
 
Part of a larger application
OLTP Database transactions – lock management
Not necessarily KV pairs, could be a tree
Many concurrent operations – not locking the actual data – throughput and latency both important
Could be done as offload or as independent service
 
Targeting Distributed Ledgers
Each node (many) takes part in consensus
Operations on top can be expensive (crypto) – unclear how much it is worth optimizing the consensus layer
for throughput or latency…
In non-geo replicated scenarios coordination 
should
 become the bottleneck
Could be done as offload or as independent service
 
 
 
 
 
24
 
Application spectrum
 
25
 
1ms app time / coordination op
Distributed ledgers (core ordering)
Machine learning frameworks (parameter server)
 
100
μ
s app time / coordination op
Relational database engine (lock management)
Some HPC workloads (MPI barriers)
 
<10
μ
s app time / coordination op
NoSQL database engines (distributed transactions, replication)
Metadata stores (replication)
SDN control plane management (update propagation)
 
 
 
Question1: What about Geo-distribution?
 
Intuitively hardware acceleration is not useful in this scenario
 
Or: Can hardware make a difference in keeping algorithms in best
case and reduce cost of reconfig/recovery?
Or: …?
 
26
 
Question1: Geo-Distribution
 
27
 
World-scale
 
Datacenter
 
City-scale
 
Continent-sc.
 
Papers
discussed
 
Question2: What about BFT?
 
BFT involves more computation 
 less amenable to low level HW
optimization
Could we use hardware to keep algorithm in best case? Anything more?
Could we use some “certification” of hardware to relax assumptions? Etc.
 
28
 
Question3: TPUT vs. Latency?
 
29
 
What ranges are of interest?
What combinations are of interest?
Is the gain a linear or step function?
 
Question3: TPUT vs Latency
 
+ Additional requirement?
 
 
30
 
>1M rounds/s
 
10k rounds/s
 
<1k rounds/s
 
ms
 
μ
s
 
Traditional Software Solutions
Most accelerated
work
 
Question4: What about programmability?
 
If we had an Paxos ASIC, would that be useful?
Are algorithms still changing, or we can use common building blocks?
 
31
 
Temperature check
 
Do we feel that there is more to achieve in this space?
 
 
 
Which direction should we be looking at?
 
32
 
9
t
h
 
W
o
r
k
s
h
o
p
 
o
n
 
S
y
s
t
e
m
s
 
f
o
r
M
u
l
t
i
-
c
o
r
e
 
a
n
d
 
H
e
t
e
r
o
g
e
n
e
o
u
s
A
r
c
h
i
t
e
c
t
u
r
e
s
 
(
S
F
M
A
 
2
0
1
9
)
 
https://sites.google.com/site/sfma2019eurosys/
 
Researchers from operating systems, language
runtime, virtual machine and architecture
communities
Focuses on system building experiences with the
new generations of parallel and heterogeneous
hardware
No proceedings!
Important Dates:
Submission: January 17, 23:55 (GMT)
Acceptance: February 10th
 
33
Slide Note
Embed
Share

Specialists in distributed consensus and hardware coordination gathered at Middleware 18 for a session hosted by Zsolt István and Marko Vukoli. The session covered topics such as specialized hardware, programmable switches and NICs, P4 language for expressing forwarding rules, and deployment examples like SmartNICs with Arm and FPGA technologies. The discussion highlighted the flexibility, expressiveness, and resource management benefits of utilizing specialized hardware in distributed systems.

  • Distributed Consensus
  • Hardware Coordination
  • Middleware 18
  • Programmable Switches
  • SmartNICs

Uploaded on Oct 01, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Distributed Consensus and Distributed Consensus and Coordination in Hardware Coordination in Hardware Birds of a feather session at Middleware 18 Hosted by: Zsolt Istv n and Marko Vukoli

  2. Outline Specialized hardware 101 Programmable Switches (P4) Programmable NICs (ARMs) Programmable NICs (FPGAs) RDMA Spectrum of accelerated solutions Examples by scope Examples by location Discussion 2

  3. P4 Language to express forwarding rules on switches (and more) Flexibility: packet-forwarding policies as programs Expressiveness: hardware-independent packet processing algorithms using general-purpose operations and table lookups. Resource mapping and management: compilers manage resource allocation and scheduling. Software engineering: type checking, information hiding, and software reuse. Decoupling hardware and software evolution: architecture independent, allowing separate hardware and software upgrade cycles. Debugging: software models of switch architectures 3

  4. P4 Deployment 4

  5. P4 Code Example 5

  6. SmartNIC (Arm) Mellanox Bluefield 2x 25/100 Gbps NIC Up to 16 Arm A72 cores Up to 16GB Onboard DRAM Arms can run commodity software Best used to implement something like OpenVSwitch If compute-bound can t keep up with packets! 6

  7. SmartNIC (FPGA) Xilinx Alveo cards 2x 100 Gbps NIC Up to 64GB Onboard DRAM Up to 32MB Onchip BRAM FPGA Can guarantee line-rate performance by design Breaks traditional software tradeoffs 7

  8. Re-programmable Specialized Hardware Field Programmable Gate Array (FPGA) Free choice of architecture Fine-grained pipelining, communication, distributed memory Tradeoff: all code occupies chip space Op 1 Op 2 Op 3 8

  9. Programming FPGAs Challenge: adapting algorithms to the parallelism of the FPGA Synthesized Circuit Placed & Routed Code Coding: Hardware definition languages, high level languages Synthesis: Produce a logic-gate level representation (any FPGA) Place & route: Circuit that gets mapped onto specific FPGA 9

  10. FPGA Benefits and Drawbacks Massive parallelism both pipeline and data-parallel execution Arithmetic operations boosted by DSPs Compute & Data close together thanks to BRAM Can t page code in or out Problem is if algorithm core state doesn t fit in BRAM 10x Less power efficient then ASICs, >10x more power efficient than CPUs 10

  11. RDMA 11

  12. Hardware summary Programmable Switches (P4) Use forwarding tables Guarantee line-rate processing Very high bandwidth Limited state on device, limited complexity code (e.g. branches, loops) Programmable NICs (ARMs) Arbitrary processing Can t guarantee line-rate processing Lower bandwidths Programmable NICs / Switches (FPGAs) Arbitrary processing*, supports complex state on device Can guarantee line-rate processing High bandwidths RDMA NICs No processing, only data manipulation with low latency Low latency buy removing OS overhead 12

  13. Hardware landscape P4 adoption by Chinese companies https://www.sdxcentral.com/articles/news/barefoot-scores-tofino-deals- with-alibaba-baidu-and-tencent/2017/05/ Smart NICs E.g., Mellanox Microsoft Catapult FPGAs in the cloud Amazon, Baidu RDMA support in Azure https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes- hpc?toc=%2Fazure%2Fvirtual-machines%2Fwindows%2Ftoc.json 13

  14. Consensus in Hardware Tight integration with network (latency) Low latency decision making (latency) Pipelining (throughput) Follower Ack. Propose Commit Write Leader Propose Commit Ack. Follower Sequencing/reliability in network Ordered, reliable channels 14 Protocol described in: F. P. Junqueira, B. C. Reed, et al. Zab: High-performance broadcast for primary-backup systems. In DSN 11.

  15. Scope of Acceleration (NICs) KVS Replicated operations Full Protocol DARE [1] FARM [2] Common-case APUS [3] Remote log Operations Mellanox Fabric Collective Accelerator (FCA) [1] Poke, Marius, and Torsten Hoefler. "Dare: High-performance state machine replication on rdma networks." Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 2015. [2] Dragojevi , Aleksandar, et al. "FaRM: Fast remote memory." 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14). 2014. [3] Wang, Cheng, et al. "APUS: Fast and scalable Paxos on RDMA." Proceedings of the 2017 Symposium on Cloud Computing. ACM, 2017. 15

  16. Scope of Acceleration Full Protocol Consensus in a Box [1] FPGA NetChain [2] P4 Switch Common-case P4Paxos [3] (NetPaxos [4]) P4 Switch Operations SpecPaxos [5] OpenFlow Switch [1] Istv n, Zsolt, et al. "Consensus in a Box: Inexpensive Coordination in Hardware." NSDI. 2016. [2] Jin, Xin, et al. "NetChain: Scale-Free Sub-RTT Coordination." NSDI 2018 [3] Dang, Huynh Tu, et al. "Paxos made switch-y." ACM SIGCOMM Computer Communication Review 46.2 (2016): 18-24. [4] Dang, Huynh Tu, et al. "Netpaxos: Consensus at network speed." Proceedings of the 1st ACM SIGCOMM SDN. ACM, 2015. [5] Ports, Dan RK, et al. "Designing Distributed Systems Using Approximate Synchrony in Data Center Networks." NSDI. 2015. 16

  17. Consensus in a Box (Caribou) Software clients (>10 machines simulating 1000s of clients) Binary protocol, but can be used as drop-in replacement for SW key-value stores (e.g. Memcached) Client-facing and inter-node traffic: 10Gbps TCP <10 s consensus latency, >1M consensus rounds/s Extension to e.g. SATA, NVMe 10Gbps Ethernet FPGA 8GB DDR3 Memory 17

  18. NetChain Implements KVS in switches Meta-data store, coordination HalfRTT because no need to reach an other end-host s replication (strong consistency) >100Gbps bandwidth Limitations on key/value sizes 18

  19. Paxos Made Switchy Implements the Coordinator and Acceptor roles in P4 Switch Reconfiguration and recovery, as well as management, are external Reduces latency and cost on end-hosts 19

  20. Speculative Paxos 20

  21. Scope of Acceleration Gains Full Protocol Rely on tight integration of different layers to deliver high throughput/low latency Specializing the processing to the protocol Common-case Benefit from cheaper processing in best case, less egress on end-host Detect when we are not in best case, fall back logic Uses less state on the devices then performing entire protocol Operations Allow the end hosts to push simple tasks of some domain into the network Generate packets, gain from reducing data movement on egress link 21

  22. Integration of Acceleration End hosts (DARE, FARM, Caribou) Easiest integration Most control Split (PaxosMadeSwitchy, ERIS [1]) Integration more complex Less control Switch/Middlebox (NetChain) Packaged as service Independently controlled [1] Li, Jialin, Ellis Michael, and Dan RK Ports. "Eris: Coordination-free consistent transactions using in-network concurrency control." Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 2017. 22

  23. Coordinating control plane ops. A special type of application Update the changes in the SDN controller, detect errors, etc. Low latency operation required Strongly consistent view Molero, Edgar Costa, Stefano Vissicchio, and Laurent Vanbever. "Hardware-Accelerated Network Control Planes." Proceedings of the 17th ACM Workshop on Hot Topics in Networks. ACM, 2018. Schiff, Liron, Stefan Schmid, and Petr Kuznetsov. "In-band synchronization for distributed SDN control planes." ACM SIGCOMM Computer Communication Review 46.1 (2016): 37-43. 23

  24. Application scenarios Replicated KVS Maintain consistent view across replicas Cheaper consensus switch to strong consistency instead of eventual Both throughput and latency is important Could offload at NIC or Switch Part of a larger application OLTP Database transactions lock management Not necessarily KV pairs, could be a tree Many concurrent operations not locking the actual data throughput and latency both important Could be done as offload or as independent service Targeting Distributed Ledgers Each node (many) takes part in consensus Operations on top can be expensive (crypto) unclear how much it is worth optimizing the consensus layer for throughput or latency In non-geo replicated scenarios coordination should become the bottleneck Could be done as offload or as independent service 24

  25. Application spectrum 1ms app time / coordination op Distributed ledgers (core ordering) Machine learning frameworks (parameter server) 100 s app time / coordination op Relational database engine (lock management) Some HPC workloads (MPI barriers) <10 s app time / coordination op NoSQL database engines (distributed transactions, replication) Metadata stores (replication) SDN control plane management (update propagation) 25

  26. Question1: What about Geo-distribution? Intuitively hardware acceleration is not useful in this scenario Or: Can hardware make a difference in keeping algorithms in best case and reduce cost of reconfig/recovery? Or: ? 26

  27. Question1: Geo-Distribution Papers discussed 27

  28. Question2: What about BFT? BFT involves more computation less amenable to low level HW optimization Could we use hardware to keep algorithm in best case? Anything more? Could we use some certification of hardware to relax assumptions? Etc. 28

  29. Question3: TPUT vs. Latency? What ranges are of interest? What combinations are of interest? Is the gain a linear or step function? 29

  30. Question3: TPUT vs Latency + Additional requirement? Most accelerated work 30

  31. Question4: What about programmability? If we had an Paxos ASIC, would that be useful? Are algorithms still changing, or we can use common building blocks? 31

  32. Temperature check Do we feel that there is more to achieve in this space? Which direction should we be looking at? 32

  33. 9th Workshop on Systems for 9th Workshop on Systems for Multi Multi- -core and Heterogeneous core and Heterogeneous Architectures (SFMA 2019) Architectures (SFMA 2019) https://sites.google.com/site/sfma2019eurosys/ Researchers from operating systems, language runtime, virtual machine and architecture communities Focuses on system building experiences with the new generations of parallel and heterogeneous hardware No proceedings! Important Dates: Submission: January 17, 23:55 (GMT) Acceptance: February 10th 33

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#