Impact of Supporting Out-of-order Communication on In-order Performance

undefined
Analyzing the Impact of Supporting
Out-of-order Communication on
In-order Performance with iWARP
P. Balaji,
 
W. Feng
, 
S. Bhagvat
, 
D. K. Panda
, 
R. Thakur
 and 
W. Gropp
Mathematics and Computer Science, Argonne National Laboratory
Department of Computer Science, Virginia Tech
Scalable Systems Group, Dell Inc.
Computer Science and Engineering, Ohio State University
Computer Science, University of Illinois at Urbana Champagne
Motivation
High-end computing systems growing rapidly in scale
128K processor system at LLNL (HPC CPU growth of 50%)
1M processor systems as soon as next year
Network subsystem has to scale accordingly
Fault-tolerance and hot-spot avoidance important
Possible Solution: Multi-pathing
Supported by many networks
InfiniBand uses subnet management to discover paths
10-Gigabit Ethernet uses VLAN based multi-pathing
Disadvantage: Out-of-order Communication!
Out-of-order Communication
Different packets taking different paths mean that later injected
packets might arrive earlier
Physical networks only deal with sending packets out-of-order
Protocols on top of networks (either in hardware or software)
have to deal with reordering packets
Networks such as IB handle this by dropping out-of-order packets
FECN, BECN and throttling on congestion
Network buffering (with FECN/BECN) helps, but not perfect
1
2
3
4
1
2
3
4
Overview of iWARP over Ethernet
Relatively new initiative by
IETF and RDMAC
Backward compatibility with
TCP/IP/Ethernet
Sender stuffs iWARP packets
within TCP/IP packets
When sent, one TCP packet
contains one iWARP packet
What about on receive?
Application
Sockets
SDP, MPI etc.
Software 
TCP/IP
10-Gigabit Ethernet
RDMAP Verbs
RDDP
MPA
Offloaded 
TCP/IP
Ethernet Packet Segmentation
Packet
Header
iWARP
Header
Data Payload
Packet
Header
iWARP
Header
Data 
Payload
Packet
Header
iWARP
Header
Data 
Payload
Packet
Header
iWARP
Header
Partial
 Payload
Packet
Header
Partial
 Payload
Packet
Header
iWARP
Header
Data 
Payload
Packet
Header
iWARP
Header
Data 
Payload
 
Delayed Packet
Out-Of-Order Packets
(Cannot identify iWARP header)
Intermediate Switch Segmentation
 
 Intermediate switch segmentation
 Packets split or coalesced
 Current iWARP implementations do not handle out-of-order packets
 Follow approaches used by IB
Problem Statement
How do we design a feature-complete iWARP stack?
Provide support for out-of-order arriving packets
Maintaining performance of in-order communication
What are the tradeoffs in designing iWARP?
Host-based iWARP
Host-offloaded iWARP
Host-assisted iWARP
Presentation Layout
Introduction and Motivation
Details of the iWARP Standard
Design Choices for iWARP
Experimental Evaluation
Concluding Remarks and Future Work
Dealing with Out-of-order packets in iWARP
iWARP specifies intelligent approaches to deal with
out-of-order packets
Out-of-order data placement and In-order data
delivery
If packets arrive out-of-order, they are directly placed
in the appropriate location in memory
Application notified about the arrival of the message only
when:
All packets of the message have arrived
All previous messages have arrived
It is necessary that iWARP recognize all packets !
MPA Protocol Frame
DDP
Header
Payload (IF ANY)
DDP
Header
Payload (IF ANY)
Pad
CRC
Marker
Segment
Length
 
Deterministic approach to identify packet header
Can distinguish in-order packets from out-of-order
packets
Presentation Layout
Introduction and Motivation
Details of the iWARP Standard
Design Choices for iWARP
Experimental Evaluation
Concluding Remarks and Future Work
iWARP components
iWARP consists of three layers
RDMAP: Thin layer that deals with interfacing upper
layers with iWARP
RDDP: Core of the iWARP stack
Component 1:
 Deals with connection management issues and
packet de-multiplexing between connections
MPA: Glue layer to deal with backward compatibility with
TCP/IP
Component 2:
 Performs CRC
Component 3:
 Adds marker strips of data to point to the
packet header
Component Onload vs. Offload
Connection Management and Packet Demultiplexing
Connection lookup and book-keeping --> CPU intensive
Can be done efficiently on hardware
Data Integrity: CRC-32
CPU intensive
Can be done efficiently on hardware
Marker Strips:
Tricky as they need to be inserted in between the data
Software implementation requires an extra copy
Hardware implementation might require multiple DMAs
T
a
s
k
 
d
i
s
t
r
i
b
u
t
i
o
n
 
f
o
r
 
d
i
f
f
e
r
e
n
t
 
i
W
A
R
P
d
e
s
i
g
n
s
RDMAP
RDDP
CRC
Markers
TCP/IP
RDMAP
Markers
TCP/IP
RDDP
CRC
Markers
TCP/IP
RDMAP
RDDP
CRC
HOST
NIC
Host-based
Host-offloaded
Host-assisted
Host-based and -offloaded Designs
Host-based iWARP: Completely in software
Deals with overheads for all components
Host-offloaded iWARP: Completely in hardware
Good for packet demultiplexing and CRC
Is it good for inserting marker strips?
Ideal: True Scatter/Gather DMA engine. Not available.
Contiguous DMA and Decoupled Marker Insertion
Large chunks DMAed and moved on the NIC to insert markers
A lot of NIC memory transactions
Scatter/Gather DMA with Coupled Marker Insertion
Small chunks DMAed and non-contiguously
A lot of DMA operations
Hybrid Host-assisted Implementation
Performs tasks such as:
packet demultiplexing and CRC in hardware
marker insertion in software (requires an extra-copy)
Fully utilizes both the host and the NIC
Summary:
Host-based design suffers from software overheads for
all tasks
Host-offloaded design suffers from the overhead of
multiple DMA operations
Host-based design suffers from the extra memory copy
to add the markers but benefits from less DMAs
Presentation Layout
Introduction and Motivation
Details of the iWARP Standard
Design Choices for iWARP
Experimental Evaluation
Concluding Remarks
Experimental Test bed
4-node cluster
2 Intel Xeon 3.0GHz processors with 533MHz FSB, 2GB
266-MHz DDR SDRAM and 133 MHx PCI-X slots
Chelsio T110 10GE
 TCP Offload Engines
12-port Fujitsu XG800 switch
Red Hat Operating system (2.4.22smp)
iWARP Microbenchmarks
iWARP Latency
iWARP Bandwidth
Out-of-cache Communication
iWARP Bandwidth
Computation Communication Overlap
Message Size 4KB
Message Size 128KB
Iso-surface Visual rendering application
Data Distribution Size : 8KB
Data Distribution Size : 1MB
Presentation Layout
Introduction and Motivation
Details of the iWARP Standard
Design Choices for iWARP
Experimental Evaluation
Concluding Remarks
Concluding Remarks
With growing scales of high-end computing systems,
network infrastructure has to scale as well
Issues such as fault tolerance and hot-spot avoidance
play an important role
While multi-path communication can help with these
problems, it introduces Out-of-order communication
We presented three designs of iWARP that deal with
out-of-order communication
Each design has its pros and cons
No single design could achieve the best performance in
all cases
undefined
Thank You
Email Contacts:
P. Balaji: 
balaji@mcs.anl.gov
W. Feng: 
feng@cs.vt.edu
S. Bhagvat: 
sitha_bhagvat@dell.com
D. K. Panda: 
panda@cse.ohio-state.edu
R. Thakur: 
thakur@mcs.anl.gov
W. Gropp: 
wgropp@uiuc.edu
undefined
Backup Slides
 
IDLE
READY
DMA
BUSY
SDMA
Send 
Request
 Host 
DMA
Free
Host DMA
Busy
Integrated
Segment
Complete
Host DMA
Free
READY
DMA
BUSY
SDMA
 Host 
DMA
Free
Host DMA
Busy
Host DMA
Free 
Marker
Inserted
Segment
Not Complete
 
IDLE
READY
DMA
BUSY
SDMA
Host DMA
Free
Send Request
 SDMA
Done
 Host 
DMA
Free
Host DMA
In Use
SDMA
IDLE
READY
COPY
PARTIAL
SEGMENT
INSERT
MARKERS
Segment 
Available
Processing
Segment Not 
Complete
Marker
Inserted
Segment
Complete
IDLE
Calculate
CRC
Segment 
Available
Segment 
Complete
IDLE
SEND
Segment 
Available
Segment 
Complete
CRC
SEND
iWARP Out-of-Cache Communication
Bandwidth
Cache Traffic (Transmit Side)
Cache Traffic (Receive Side)
Impact of marker separation on iWARP
performance
Host-offloaded iWARP Latency
NIC-offloaded iWARP Bandwidth
Slide Note
Embed
Share

This study delves into the impact of out-of-order communication on in-order performance in high-end computing systems, specifically investigating the potential solutions, challenges, and network handling strategies associated with this issue. The research focuses on the implications of supporting out-of-order communication using iWARP over Ethernet, highlighting the complexities of packet segmentation, reordering, and network buffering. The discussion also explores the relevance of multi-pathing in scaling network subsystems to meet the growing demands of large-scale computing environments.

  • Computing Systems
  • High Performance Computing
  • Network Subsystems
  • iWARP
  • Packet Segmentation

Uploaded on Sep 30, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur and W. Gropp Mathematics and Computer Science, Argonne National Laboratory Department of Computer Science, Virginia Tech Scalable Systems Group, Dell Inc. Computer Science and Engineering, Ohio State University Computer Science, University of Illinois at Urbana Champagne

  2. Motivation High-end computing systems growing rapidly in scale 128K processor system at LLNL (HPC CPU growth of 50%) 1M processor systems as soon as next year Network subsystem has to scale accordingly Fault-tolerance and hot-spot avoidance important Possible Solution: Multi-pathing Supported by many networks InfiniBand uses subnet management to discover paths 10-Gigabit Ethernet uses VLAN based multi-pathing Disadvantage: Out-of-order Communication!

  3. Out-of-order Communication 2 1 3 2 1 4 4 3 Different packets taking different paths mean that later injected packets might arrive earlier Physical networks only deal with sending packets out-of-order Protocols on top of networks (either in hardware or software) have to deal with reordering packets Networks such as IB handle this by dropping out-of-order packets FECN, BECN and throttling on congestion Network buffering (with FECN/BECN) helps, but not perfect

  4. Overview of iWARP over Ethernet Relatively new initiative by IETF and RDMAC Application Sockets SDP, MPI etc. Backward compatibility with TCP/IP/Ethernet RDMAP Verbs Software TCP/IP RDDP Sender stuffs iWARP packets within TCP/IP packets MPA When sent, one TCP packet contains one iWARP packet Offloaded TCP/IP 10-Gigabit Ethernet What about on receive?

  5. Ethernet Packet Segmentation Packet Header Packet Header Packet Header Data Payload Data Payload iWARP HeaderData Payload iWARP Header iWARP Header Intermediate Switch Segmentation Partial Payload Packet Header Packet Header Packet Header Packet Header Partial Payload Data Payload Data Payload iWARP Header iWARP Header iWARP Header Delayed Packet Out-Of-Order Packets (Cannot identify iWARP header) Intermediate switch segmentation Packets split or coalesced Current iWARP implementations do not handle out-of-order packets Follow approaches used by IB

  6. Problem Statement How do we design a feature-complete iWARP stack? Provide support for out-of-order arriving packets Maintaining performance of in-order communication What are the tradeoffs in designing iWARP? Host-based iWARP Host-offloaded iWARP Host-assisted iWARP

  7. Presentation Layout Introduction and Motivation Details of the iWARP Standard Design Choices for iWARP Experimental Evaluation Concluding Remarks and Future Work

  8. Dealing with Out-of-order packets in iWARP iWARP specifies intelligent approaches to deal with out-of-order packets Out-of-order data placement and In-order data delivery If packets arrive out-of-order, they are directly placed in the appropriate location in memory Application notified about the arrival of the message only when: All packets of the message have arrived All previous messages have arrived It is necessary that iWARP recognize all packets !

  9. MPA Protocol Frame DDP Header Payload (IF ANY) Pad CRC DDP Header Payload (IF ANY) Segment Length Marker Deterministic approach to identify packet header Can distinguish in-order packets from out-of-order packets

  10. Presentation Layout Introduction and Motivation Details of the iWARP Standard Design Choices for iWARP Experimental Evaluation Concluding Remarks and Future Work

  11. iWARP components iWARP consists of three layers RDMAP: Thin layer that deals with interfacing upper layers with iWARP RDDP: Core of the iWARP stack Component 1: Deals with connection management issues and packet de-multiplexing between connections MPA: Glue layer to deal with backward compatibility with TCP/IP Component 2: Performs CRC Component 3: Adds marker strips of data to point to the packet header

  12. Component Onload vs. Offload Connection Management and Packet Demultiplexing Connection lookup and book-keeping --> CPU intensive Can be done efficiently on hardware Data Integrity: CRC-32 CPU intensive Can be done efficiently on hardware Marker Strips: Tricky as they need to be inserted in between the data Software implementation requires an extra copy Hardware implementation might require multiple DMAs

  13. Task distribution for different iWARP designs RDMAP RDDP HOST RDMAP RDMAP Markers CRC Markers RDDP CRC NIC RDDP CRC TCP/IP TCP/IP Markers TCP/IP Host-based Host-offloaded Host-assisted

  14. Host-based and -offloaded Designs Host-based iWARP: Completely in software Deals with overheads for all components Host-offloaded iWARP: Completely in hardware Good for packet demultiplexing and CRC Is it good for inserting marker strips? Ideal: True Scatter/Gather DMA engine. Not available. Contiguous DMA and Decoupled Marker Insertion Large chunks DMAed and moved on the NIC to insert markers A lot of NIC memory transactions Scatter/Gather DMA with Coupled Marker Insertion Small chunks DMAed and non-contiguously A lot of DMA operations

  15. Hybrid Host-assisted Implementation Performs tasks such as: packet demultiplexing and CRC in hardware marker insertion in software (requires an extra-copy) Fully utilizes both the host and the NIC Summary: Host-based design suffers from software overheads for all tasks Host-offloaded design suffers from the overhead of multiple DMA operations Host-based design suffers from the extra memory copy to add the markers but benefits from less DMAs

  16. Presentation Layout Introduction and Motivation Details of the iWARP Standard Design Choices for iWARP Experimental Evaluation Concluding Remarks

  17. Experimental Test bed 4-node cluster 2 Intel Xeon 3.0GHz processors with 533MHz FSB, 2GB 266-MHz DDR SDRAM and 133 MHx PCI-X slots Chelsio T110 10GE TCP Offload Engines 12-port Fujitsu XG800 switch Red Hat Operating system (2.4.22smp)

  18. iWARP Microbenchmarks iWARP Latency iWARP Bandwidth 7000 100 60 Host-offloaded iWARP CPU 90 6000 Host-based iWARP CPU 50 80 Host-offloaded iWARP Host-assisted iWARP CPU 5000 Host-offloaded iWARP 70 Bandwidth (Mbps) Host-based iWARP 40 Host-based iWARP Latency (us) 60 Host-assisted iWARP 4000 Host-assisted iWARP CPU(%) 50 30 3000 40 20 30 2000 20 10 1000 10 0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 16 64 1K 4K 1 4 256 16K 64K 256K Message Size (Bytes) Message Size (bytes)

  19. Out-of-cache Communication iWARP Bandwidth 4000 Host-offloaded iWARP Host-based iWARP Host-assisted iWARP 3500 3000 Bandwidth (Mbps) 2500 2000 1500 1000 500 0 1 1K 4K 16 64 256 Message Size (Bytes) 16K 64K 4 256K

  20. Computation Communication Overlap Message Size 4KB Message Size 128KB 6000 7000 6000 5000 5000 Bandwidth (Mbps) Bandwidth (Mbps) 4000 4000 3000 3000 2000 2000 1000 1000 0 1 0 1 2 4 8 2 4 8 16 32 64 256 2K 2K 16 32 64 256 128 512 1024 128 512 1024 Computation Time (us) Computation Time (us) Host-offloaded iWARP Host-based iWARP Host-assisted iWARP

  21. Iso-surface Visual rendering application Data Distribution Size : 8KB Data Distribution Size : 1MB 450 600 400 500 350 Execution Time (secs) Execution Time (secs) 300 400 250 300 200 150 200 100 100 50 0 0 1024x1024 2048x2048 4096x4096 8192x8192 1024x1024 2048x2048 4096x4096 8192x8192 Dataset Dimensions Dataset Dimensions Host-offloaded iWARP Host-based iWARP Host-assisted iWARP

  22. Presentation Layout Introduction and Motivation Details of the iWARP Standard Design Choices for iWARP Experimental Evaluation Concluding Remarks

  23. Concluding Remarks With growing scales of high-end computing systems, network infrastructure has to scale as well Issues such as fault tolerance and hot-spot avoidance play an important role While multi-path communication can help with these problems, it introduces Out-of-order communication We presented three designs of iWARP that deal with out-of-order communication Each design has its pros and cons No single design could achieve the best performance in all cases

  24. Thank You Email Contacts: P. Balaji: balaji@mcs.anl.gov W. Feng: feng@cs.vt.edu S. Bhagvat: sitha_bhagvat@dell.com D. K. Panda: panda@cse.ohio-state.edu R. Thakur: thakur@mcs.anl.gov W. Gropp: wgropp@uiuc.edu

  25. Backup Slides

  26. Segment Not Complete IDLE Send Request READY READY Segment Complete Host DMA Free Host DMA Free Host DMA Busy Host DMA Busy Host DMA Free DMA BUSY DMA BUSY Host DMA Free SDMA SDMA Marker Inserted Integrated

  27. IDLE IDLE Segment Available Segment Complete Segment Available Send Request Calculate CRC IDLE Segment Complete READY READY Host DMA Free CRC SDMA Done Host DMA In Use COPY PARTIAL SEGMENT DMA BUSY Segment Available Host DMA Free Marker Inserted Segment Not Complete IDLE SEND SDMA INSERT MARKERS Segment Complete SEND Processing SDMA

  28. iWARP Out-of-Cache Communication Bandwidth Cache Traffic (Transmit Side) Cache Traffic (Receive Side) 6 6 Ratio of Cache to Network Traffic Ratio of Cache to Network Traffic 5 5 4 4 3 3 2 2 1 1 0 0 1 16 256 4K 64K 256K 1 16 256 4K 64K 256K Message Size (Bytes) Message Size (Bytes) Host-offloaded iWARP Host-based iWARP Host-assisted iWARP

  29. Impact of marker separation on iWARP performance Host-offloaded iWARP Latency NIC-offloaded iWARP Bandwidth 35 8000 7000 30 6000 25 Bandwidth (Mbps) Latency (us) 5000 20 4000 15 3000 10 2000 5 1000 0 0 1 1 2 4 8 16 32 64 128 256 512 1K 2K 4 16 64 256 1K 4K 64K 16K 256K Message Size (Bytes) Message Size (Bytes) iWARP (original) iWARP (1KB marker separation) iWARP (2KB marker separation) iWARP (no markers)

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#