Understanding Programmable Traffic Management for Network Optimization
Programmable Traffic Management involves packet scheduling, traffic shaping, policing, drop policies, packet buffering, replication, and classification to optimize network performance. It is used in integrated switch architectures and is crucial for addressing diverse traffic characteristics and requirements. Embracing programmability in traffic management is essential for enhancing network utilization, meeting SLAs, achieving fairness, and adapting to evolving traffic demands.
- Traffic Management
- Network Optimization
- Programmable Switches
- Traffic Engineering
- Network Programmability
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
P4 Programmable Traffic Management Stephen Ibanez & Gordon Brebner 9/12/2018
Goals Motivating Discussions Traffic Manager Architectural Model Overview Putative P4 Extensions Moving Forward Discussions >> 2
What is Traffic Management? Packet Scheduling in what order? Traffic Shaping at what rate/pacing? Policing is packet compliant? Drop Policy how to avoid / deal with congestion? Packet Buffering how to store packets? Packet Replication how to replicate packets? Classification how to map packets into queues? >> 3
Where are traffic managers used? Integrated Switch (PISA) Ingress Parser Traffic Manager Ingress Deparser Egress Parser Egress Deparser Ingress M/A Egress M/A NIC or Line Card Ingress Deparser Traffic Manager Ingress Parser Ingress M/A Ethernet Ports Host PCI Express (NIC) / Switch Fabric (Line Card) Egress Parser Traffic Manager Egress Deparser Egress M/A >> 4
Why should we care about Traffic Management? Lots of different types of traffic w/ different characteristics and requirements: Characteristics: burstiness, packet sizes, flow sizes, flow rates Requirements: throughput, latency, loss, jitter, reordering, flow completion time, tail latency Network operators have a wide range of objectives: Meet all SLAs Maximize network utilization Achieve fairness More complicated ASICs Network devices are picking up more functionality More network programmability More types of traffic More TM requirements! Embrace programmability! About 50% of a modern programmable switch chip is dedicated to buffering and traffic management logic but is not programmable! >> 5
Why should we care about Traffic Management? WAN links are expensive want to make best use of them by prioritizing traffic Modern congestion control algorithms rely on accurate packet pacing (e.g. BBR[1], Timely[2]) Performance isolation for thousands of VMs (millions of flows) per server [1] Cardwell, Neal, et al. "BBR: Congestion-based congestion control." Queue 14.5 (2016): 50. [2] Mittal, Radhika, et al. "TIMELY: RTT-based Congestion Control for the Datacenter." ACM SIGCOMM Computer Communication Review. Vol. 45. No. 4. ACM, 2015. [3] Saeed, Ahmed, et al. "Carousel: Scalable traffic shaping at end hosts." Proceedings of the Conference of the ACM Special Interest Group on Data Communication. ACM, 2017. >> 6
Benefits of Programmable Traffic Management Benefits of an unambiguous way to define TM algorithms: Portability Formal verification and static analysis Precise way to express customer needs Benefits of having programmable TM: Innovation Differentiation Reliability Network operators can fine tune for performance Small menu of algorithms to choose from today Many possible algorithms that can be expressed >> 7
Existing TM algorithms are too limiting TM algorithm parameters can be derived from: Network operator static configurations Per packet information Per flow or per class information (i.e. state stored across packets) Current queue state Aggregated queue state (e.g. avg sizes, avg service rates) Other state that evolves over time Randomness Combinations of any of the above parameters Combined arithmetically Combined temporally (i.e. alternate between various policies) Combined hierarchically >> 8
Generic Traffic Manager Architecture Scheduling & Shaping Crossbar Classification & Policing & Drop Policy Non-P4-Prog PRE Non-P4-Prog Buffer Output queued Each egress port may have many associated queues A NIC may not need a crossbar or PRE >> 9
Proposed TM Architecture Model Ingress logic Scheduling / Shaping Classification Policing Queue 1 Drop Policy . . . Classification & Policing & Drop Policy Input Packet Output Packet Queue i Packet buffer . . . Scheduling / Shaping Queue N State Extern interface Packet Buffer >> 10
Ingress Logic Scheduling / Shaping Match / Action processing Classification: Decide which queue E.g. per flow queues Queue 1 Policing: . . . Classification & Policing & Drop Policy Input Packet Output Packet Is packet compliant? E.g. Use PSA meters to drop packets Queue i . . . Drop Policy: How to drop packets during periods of perceived congestion E.g. WRED Queue N State Extern interface Packet Buffer Extract scheduling metadata Extern interface to access packet buffer state >> 11
Packet Buffer Scheduling / Shaping Memory space to store packets Divided into queues to isolate traffic Queue 1 Queue boundaries may be flexible . . . Classification & Policing & Drop Policy Packet order in queue scheduling order Input Packet Output Packet Queue i . . . Exchange packet descriptor and metadata with scheduling / shaping logic Queue N State Extern interface Packet Buffer State: Size of each queue Accessible to ingress logic (read only) >> 12
The Push-In-First-Out (PIFO) Model [1] 2 What is a PIFO? Programmable rank computation 0 8 7 3 4 Fixed PIFO Why is the PIFO a good model? Scheduling decision made at time of enqueue helps relax timing constraints leads to efficient implementation Clear separation b/w fixed and programmable logic Can implement existing algorithms: Start Time Fair Queueing (STFQ), Least Slack-Time First (LSTF), Stop-and-Go Queueing, Minimum rate guarantees, fine grained priority scheduling, Service-Curved Earliest Deadline First (SC-EDF), Rate-Controlled Service Disciplines (RCSD) Token bucket rate limiting Can implement new algorithms >> 13
Scheduling / Shaping Tree PIFO Scheduling / Shaping Tree 4 3 2 0 1 Path computation Decide leaf node to enqueue into deq logic enq logic Scheduling / Shaping S Each node One PIFO Programmable enq and deq logic Potentially shared state b/w enq and deq logic Shaping Node Queue 1 t1 t0 . . . Classification & Policing & Drop Policy deq logic enq logic Scheduling Node Input Packet Output Packet Queue i S Scheduling node Implement scheduling algs Leaf nodes store descriptors, parent nodes store scheduling node refs . . . Queue N 0 3 9 3 3 2 9 State Extern interface deq logic enq logic deq logic enq logic Packet Buffer Shaping node Implement shaping algs Shapes all nodes within its sub tree S S compute path >> 14 descriptor & metadata
Scheduling & Shaping Demonstration
Scheduling and Shaping Goal: Scheduling Enqueue 3 packets Dequeue 3 packets Shaping Scheduling Scheduling Class A Class B >> 16
Enqueue Packet 1 t8 t1 t7 t2 t6 2 deq logic enq logic L t5 t3 t4 S 3 deq logic deq logic enq logic enq logic deq logic enq logic &p1 S S S compute path L &p1 Shaping Node p1 p1 Queue 1 . . . Classification / Policing / Drop Policy p1 Queue i . . . Queue N >> 17 State
Enqueue Packet 2 t8 2 t1 t7 L t2 t6 deq logic enq logic t5 t3 t4 S t4 3 &p1 7 deq logic deq logic enq logic enq logic deq logic enq logic &p2 S S S compute path R &p2 p2 p2 p2 Queue 1 . . . Classification / Policing / Drop Policy p1 Queue i . . . Queue N >> 18 State
Enqueue Packet 3 t8 2 t1 t7 L t2 t6 deq logic enq logic t5 t3 t4 S t7 t4 3 7 &p1 &p2 2 deq logic deq logic enq logic enq logic deq logic enq logic &p3 S S S compute path R &p3 p3 p3 p2 Queue 1 . . . Classification / Policing / Drop Policy p1 Queue i . . . p3 Queue N >> 19 State
Enqueue Packet 2 (continued) t8 2 t1 t7 L t2 t6 1 deq logic enq logic R t5 t3 t4 S t7 t4 7 2 3 &p2 &p3 &p1 deq logic deq logic enq logic enq logic deq logic enq logic R S S S compute path p2 Queue 1 . . . Classification / Policing / Drop Policy p1 Queue i . . . p3 Queue N >> 20 State
Dequeue Packet 3 t8 2 1 t1 t7 L R t2 t6 deq logic enq logic t5 t3 R t4 S t7 2 3 7 &p3 &p1 &p2 deq logic deq logic enq logic enq logic deq logic enq logic &p3 S S S compute path p2 Queue 1 . . . Classification / Policing / Drop Policy p1 Queue i . . . p3 Queue N >> 21 State
Dequeue Packet 1 t8 2 t1 t7 L t2 t6 deq logic enq logic t5 t3 L t4 S t7 3 7 &p1 &p2 deq logic deq logic enq logic enq logic deq logic enq logic &p1 S S S compute path p2 Queue 1 . . . Classification / Policing / Drop Policy p1 Queue i . . . Queue N >> 22 State
Enqueue Packet 3 (continued) t8 t1 t7 t2 t6 1 deq logic enq logic R t5 t3 t4 S t7 7 &p2 deq logic deq logic enq logic enq logic deq logic enq logic R S S S compute path p2 Queue 1 . . . Classification / Policing / Drop Policy Queue i . . . Queue N >> 23 State
Dequeue Packet 2 t8 1 t1 t7 R t2 t6 deq logic enq logic t5 t3 R t4 S 7 &p2 deq logic deq logic enq logic enq logic deq logic enq logic &p2 S S S compute path p2 Queue 1 . . . Classification / Policing / Drop Policy Queue i . . . Queue N >> 24 State
Prog. Drop Policy t8 1 1 2 t1 t7 R R L t2 t6 deq logic enq logic Extern interface to read buffer state t5 t3 Can implement WRED-like policies t4 S 7 2 3 &p2 &p3 &p1 deq logic deq logic enq logic enq logic S S compute path p4 drop p4 Queue 1 . . . Classification / Policing / Drop Policy Queue i . . . Queue N >> 25 State
New Externs /* PIFO extern */ extern pifo<T, M> { pifo(bit<32> size); // constructor enq(in T rank, in M meta); deq(out T rank, out M meta); // meta is optional } // example instantiation pifo<rank_t, sched_meta_t>(2048) p; // meta is optional /* Packet buffer extern */ extern buffer<T> { buffer(bit<32> num_queues); // get the current size of a queue get_q_size(in T q_id, out bit<32> size); } // example instantiation buffer<qid_t>(1024) buf; >> 26
Simple Architecture User defined scheduling metadata scheduler MyScheduler<D>(in D sched_meta); parser Parser<H, M>(packet_in b, out H hdr, out M user_meta, inout std_meta_t std_meta); control Egress<H, M>(inout H hdr, inout M user_meta, inout std_meta_t std_meta); control Ingress<H, M, D>(inout H hdr, out D sched_meta, inout M user_meta, inout std_meta_t std_meta); control Deparser<H, M>(packet_out b, in H hdr, in M user_meta, inout std_meta_t std_meta); Classification & Policing & Drop Policy Scheduling / Shaping Non-P4- Programmable PRE / buffer Ingress Parser Egress Deparser Ingress M/A Egress M/A >> 27
Scheduler Block scheduler MyScheduler(in sched_meta_t sched_meta) { /* Define PIFO tree nodes */ /* root node */ node strict_priority { type = scheduling; pifo<rank_t>(2048) p; enqueue = { } dequeue = { } } /* shaping node */ node token_bucket { type = shaping; pifo<rank_t, sched_meta_t>(2048) p; enqueue = {} dequeue = { } } /* Define the shape of the tree */ tree myTree { strict_priority(), {wfq(), {token_bucket(), {wfq()} } } table find_path { } apply { find_path.apply(); // apply the scheduling algorithm defined by the tree myTree.apply(leaf_node); } } strict token bucket WFQ WFQ >> 28
Example Strict Priority /* root node */ node strict { type = scheduling; // PIFO extern instantiation format: // pifo<rank_type>(size_in_pkts) instance_name; pifo<rank_t>(2048) p; Strict low high Class A enqueue = { rank_t rank; rank = sched_meta.diffserv; // maybe use a table? p.enq(rank); } Class B dequeue = { rank_t rank; p.deq(rank); } } >> 29
Example Strict Priority with Starvation Prevention strict FCFS nodes prevent reordering of packets within a flow high Also useful for traffic that is sensitive to token bucket low jitter (e.g. voice) FCFS FCFS Flow A Flow B >> 30
Example Strict Priority with Starvation Prevention node token_bucket { type = shaping; pifo<rank_t, sched_meta_t>(2048) p; enqueue = { // declare registers: tb, last_time, rate, max_tokens @atomic { // atomically update last_time and tb registers tb.read(0, old_tokens); old_tokens = old_tokens + r * (now - lt); if (old_tokens > B) { old_tokens = B; } // compute new tokens and send time if (sm.pkt_len <= old_tokens) { new_tokens = old_tokens sm.pkt_len; send_time = now; } else { new_tokens = 0; send_time = now + (sm.pkt_len old_tokens)/r; } tb.write(0, new_tokens); } p.enq(send_time, sm); } strict high token bucket low FCFS FCFS Flow A Flow B dequeue = { bit<rank_t> rank; p.deq(rank, sm); } } >> 31
Example Strict Priority with Starvation Prevention node FCFS { type = scheduling; pifo<rank_t>(2048) p; enqueue = { bit<rank_t> rank = get_time(); // extern function p.enq(rank); } dequeue = { bit<rank_t> rank; p.deq(rank); } } strict high token bucket low FCFS FCFS /* Define the shape of the tree */ tree myTree { strict_priority(), {FCFS(), {token_bucket(), {FCFS()}}} } apply { // path computation logic pifo_id_t leaf_node; if (sched_meta.flow_id == 0) { leaf_node = RIGHT; } else { leaf_node = LEFT; } // apply the scheduling / shaping algorithm myTree.apply(leaf_node); } Flow A Flow B >> 32
Example Low Latency Traffic low latency WFQ across all queues until a particular queue exceeds a configured threshold at which point that queue is given strict priority until it has been sufficiently drained FCFS FCFS FCFS Classification sets desired queue size in sched_meta Flow A Flow C Flow B >> 33
Example Low Latency Traffic Ingress Logic: buffer<qid_t>(NUM_QUEUES) buf; // instantiate buffer extern // get size of low latency queue sched_meta.ll_q_size = buf.get_q_size(LL_Q_ID); lookup_queue.apply(); // set egress queue low latency Root Node: node low_latency { type = scheduling; pifo<rank_t>(2048) p; enqueue = { if (sched_meta.ll_q_size > THRESH) { // Strict priority logic } else { // WFQ logic } p.enq(rank); } dequeue = { rank_t rank; p.deq(rank); } } >> 34 FCFS FCFS FCFS Flow A Flow C Flow B
Runtime Control The tree can be updated by the control plane, but not from within the P4 program Akin to adding table entries Will need P4Runtime support for tree configuration Tables and externs can be handled normally >> 35
Open Source Implementations PIFO paper ASIC design [1] NetFPGA Prototype (P4 Workshop Demo) PIFO Scheduler 0 8 7 3 4 descriptor & rank PIFO block diagram rank Rank Store (SRAM) descriptor Flow Scheduler (flip-flops) computation A 5 6 3 4 2 descriptor & metadata C 3 B 1 A 0 B 2 4 5 C 8 Queue 1 Input Packet Increasing ranks Output Packet . . . Increasing ranks Classification Queue i . . . Queue N Packet Buffer
Challenges with TM Programmability Scaling programmable designs to high rates and large sizes Bottleneck #1 timing closure for scheduling decision logic Bottleneck #2 memory bandwidth/interface to large packet buffer Abstractions to program the PIFO tree Limitations of the proposed model >> 37
Moving Forward Motivation from the community? Settle on architecture model Define language semantics Is a tree too low-level? Are there better abstractions? Improve open source HW implementations Open source simulation/emulation environment Bmv2? NS3? Other? Small P4 TM Working Group >> 38
References [1] Anirudh Sivaraman, Suvinay Subramanian, Mohammad Alizadeh, Sharad Chole, Shang- Tse Chuang, Anurag Agrawal, Hari Balakrishnan, Tom Edsall, Sachin Katti, Nick McKeown. "Programmable packet scheduling at line rate." Proceedings of the 2016 ACM SIGCOMM Conference https://cs.nyu.edu/~anirudh/pifo-sigcomm.pdf >> 39
NetFPGA Prototype (P4 Workshop Demo) Parallel deterministic skip lists and register-based sorting cache PIFO Scheduler 0 8 7 3 4 BRAM based packet buffer descriptor & rank rank descriptor computation Top level PIFO block diagram descriptor & metadata Load Balancer Insertion Queue 1 Input Packet Output Packet . . . Register Cache Register Cache Register Cache Classification Queue i . . . Register Cache . . . Skip List Skip List Skip List Queue N Packet Buffer Selector Removal >> 41
PIFO Paper ASIC Design [1] Flow scheduler Choose amongst head pkt of each flow PIFO block diagram Rank Store Store computed ranks for each flow in FIFO order Rank Store (SRAM) Flow Scheduler (flip-flops) A 5 6 3 4 2 PIFO blocks connected in a full mesh C 3 B 1 A 0 B 2 4 5 C 8 64 port 10Gbps shared memory switch, 1GHz Increasing ranks Increasing ranks 1000 flows, 64K packets >> 42
Input vs Output Rate Limiting Goal of rate limiting: control the rate at which bytes are removed from the buffer and scheduled. Output rate limiting direct approach, short and long term rate guarantees Input rate limiting indirect approach, long term rate guarantees Sent out at line rate Strict low high Rate Limit FCFS FCFS >> 44
Arbitrary reordering of buffered packets Pfabric scheduling order: SRPT with FIFO order within flows 6 8 9 7 p3 p2 p1 p0 >> 45
Approximate Pfabric SRPT 9 R R 8 7 L R 6 FCFS FCFS 0 2 1 0 p0 p3 p2 p1 p3 p2 p0 p1 Final Scheduling Order: 7 7 9 9 8 8 6 p3 p3 6 p0 p0 p1 p1 p2 p2 Pfabric Scheduling Order: p0 p3 p2 p1 >> 46
Soft Rate Limiting Rate limit flow A to 2 Gbps only if the egress link is congested Hard vs. Soft Rate Limiting: Hard hold packets until time to send (non work conserving) Soft packets can be sent if bandwidth is available (work conserving) >> 47
Soft Rate Limiting t8 t1 default t7 t2 t6 deq logic enq logic t5 t3 t4 S t7 7 &p2 deq logic deq logic enq logic enq logic deq logic enq logic R &p2 S S S compute path What if there are multiple shaping nodes? p2 Queue 1 . . . Classification / Policing / Drop Policy Queue i . . . Queue N >> 48 State
Ingress Logic Limitations Drop policy: No drop head algorithms (unless using speedup) No algorithms that must programmatically update state in the packet buffer >> 49
Best Effort for Traffic Shaping Shaping operations cause conflicts Scheduling and shaping transactions commit at same time Resolved in favor of scheduling transactions Rationale: scheduling algorithms must run at line rate, shaping algorithms can afford to be delayed a few cycles >> 50