Optimal VM Dimensioning for Data Plane VNFs in Telco Cloud Environments
Discusses the efficient dimensioning of Virtual Machines for data plane Virtual Network Functions in local, edge, and regional Telco Cloud setups. Covers performance requirements, VNF lifecycle considerations, typical VNF types, dimensioning challenges, and potential solutions in Telco Cloud deployments.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Optimal VM Dimensioning for Data Plane VNFs in Local / Edge Telco Cloud SHASHI KANT SINGH DPDK SUMMIT BANGALORE - 2018
Agenda Telco Cloud Performance Requirements aligned with Deployments Data Plane VNF Lifecycle Requirements Typical Data Plane VNF Types Issues with Data Plane VNF Dimensioning Possible Solutions 2
Telco DCs in Cloud Edge Cloud (EC) Regional Cloud (RC) National / Central Cloud Local Cloud (LC) Typical Telco applications are distributed in nature primarily driven by performance needs. These applications may need to be run in different DCs but still be chained to provide a service Multi-Tenancy, High Availability , High Capacity, Scalability, Load Balancing, Energy Saving / Resource Utilization Non Real-Time Performance National Cloud Fault Resilience, Scalability, High Throughput, High Capacity, Multi-Tenancy Non Real-Time Performance Key Telco Requirements: Efficient service chaining by removing local performance bottlenecks Multi-tenancy support with network slicing Separation of Management, Control and Data Planes High Capacity and High Bandwidth Ultra Reliability and Low Latency Regional Cloud Low Latency, High Throughput, Fault Resilience Real-Time Performance Local Cloud 3
Typical Telco Cloud National Cloud: Data Plane VNFs are typically L2-L4 Routing, Firewall, Security Gateways, Application Gateways BBU-RT Edge Cloud PGW-CP BBU-NRT SGW-CP Local Cloud Control Plane VNFs are signalling GWs e.g. IMS servers , MME etc. Regional Cloud Generic VNF flavours are good enough for Control / Data Plane VNFs with help of load balancers to even out the processing requirements BBU-RT Edge Cloud PGW-DP PDN BBU-NRT SGW-DP Local Cloud National Data Centre Majority of data traffic processed is agnostic to subscriber context Line rate traffic handling easily achieved even with 40G / 100G due to: Nature of applications (switching , routing) DPDK/VPP based vSwitch works well without need of HW binding With HW independent virtualization, Fault Management is done effectivity Live Migration (LM) Independent high capacity Infrastructure link can be used to support hitless migrations i.e. without service disruption Link Aggregation Pooled VNFs with load balancers ACT-ACT, ACT-SBY configurations possible with hitless / minimal impacted check pointing BBU-RT Edge Cloud PGW-CP IMS MME Regional Cloud BBU-NRT SGW-CP Local Cloud BBU-RT Edge Cloud PGW-DP SGW-DP Regional Cloud: Data Processing based on Subscriber context. Data traffic forwarding depends on the next leg conditions e.g. radio conditions, Front-haul / Mid-haul link stability etc. These leads to controlled processing in data plane BUT still need to meet the max throughput requirements in good channel conditions. To handle varying throughput, control code in data plane is required. Local / Edge Cloud: Manage the subscriber context data and update the channel conditions dynamically Use Channel Condition data to control the flow of traffic in data plane Due to high bandwidth requirement along with the control code, data plane VNF perform sub- optimally Data Plane processing is further split into: Non Real Time Data Plane VNFs (e.g. BBU-NRT) These are IO Intensive VNFs Latency is not a critical performance parameter Don t not typically need HW binding Real Time Data Plane VNFs (e.g. BBU-RT) These are IO Intensive VNFs Latency is most critical performance parameter Mostly needs HW binding to meet performance requirements Split of Control and Data VNFs is typically done Control Plane VNFs tend to be CPU Intensive Data Plane VNFs become IO Intensive 4
Data Plane VNF Lifecycle Requirements Maintain/exceed stringent service availability and real-time performance requirements Government and regulatory requirement: must be at least 99.999 percent available (5 nines availability) with the ability to continue to operate and maintain services even with a full nodal failure VNFs must be built to handle failures. Fault tolerance must be considered at top of list during the design of VNFs along with Performance Performance Critical ACT-ACT / ACT-SBY HA Active Live Migration Passive Cloning, Snapshots Backup / Restore Fault Resilience OPNFV Directions: Static Manual request by Operator to increase network capacity Dynamic (Auto-Scaling) Based on CPU / Network / Memory Utilization Based on Application KPIs monitoring e.g. Throughput, Latency ACT-SBY configuration (Under critical error conditions) Scalability Preventive action by fault prediction (Read warnings) Static Based on Time of Day Management request of shutting down nodes Dynamic Condense Scaled Down VNFs to fewer compute nodes Re-arrange VNFs Active Fault Management VM Retirements for regular health check-ups Energy Saving Passive Fault Management 5
Typical Data Plane VNF Types / Flavors IO Intensive without HW Binding IO Intensive with HW Binding General CPU Intensive Fault Resiliency Purpose Performance High Rate of Data traffic entering / leaving VNF Latency not a critical performance requirement Have specific ask for CPU cores with low or medium data traffic entering / leaving the VNF Generic VMs with no specific asks for HW resources like CPU, Network port, Accelerators. Use HW binding CPU Pinning IRQ Remapping NUMA awareness for IO devices Cache Coherence May need dedicated CPU allocation Performance requirements well without reach with generic VNF favours. Need dedicated CPU allocation for IO performance requirements Specific ask for IO and CPU resources SR-IOV, PCI-PT Use HW accelerators (e.g. Crypto) NUMA awareness for the CPU, RAM alone (without NIC proximity) can also maintained by VNFM/Orchestrator Typically they could use: OVS-DPDK / VPP at host DPDK-PMD / VPP inside VM Designed to meet all the VNF Lifecycle Requirements Horizontal and Vertical Scaling takes care of High Capacity needs Could be: RAN Data plane processors for non- real time traffic Packet processors like DPI, Firewall, Fault Resiliency is provided by HA / Redundancy. High rate of data traffic entering and leaving the VNFs and additionally: Latency Sensitivity is critical parameter Typically they could use: OVS / OVS-DPDK at host Virtio inside VM Typically they use: Shared CPU allocation which enables efficient sharing of HW resources OVS as software switch in Host Compute and generic virtio as network interface in VM Live Migration supported with some limitations Parallel live migrations How fat is the VM (capacity of cpu processing / rate of dirtying pages). Scalability not a critical parameter Without load balancers, scalability options are limited Live Migration are not supported Live Migration are typically not supported Live Migration fully supported 6
Issues with VNF Dimensioning CPU and IO Intensive VNFs may not be able to meet all the VNF Lifecycle requirements. Common Issues: CPU Allocation Shared vs Dedicated Shared will give high multiplexing and overall CPU utilization Dedicated CPUs will ensure CPU availability all the time With Hyper Threaded physical CPUs, virtual CPUs may not give 2x performance for extensive data processing applications e.g. AI Openstack thread allocation supports Avoid, Separate, Isolate, Prefer. Live Migration may not be possible with some of these allocation types. Trade-off against the minimum required RAM / Disk Higher RAM / Disk makes VM FAT Impact on Live Migration Performance Lower RAM/Disk may impact the CPU performance with Cache misses Trade-off against the page size allocated Smaller size leads to more TLB misses Larger size increases Unused page sections per process (internal fragmentation) Page relocation options hence page faults 7
Issues with VNF Dimensioning CPU Intensive VMs: Dedicated CPUs allocated to VMs may not be efficiently used (as in shared CPU allocation). CPUs are typically allocated considering the max load handling capacity of the VNF and in case of low load conditions, CPUs are not optimally used nor they are shared with other VNFs Multiplexing gain limitations as the major concern Openstack provides Overcommitting of CPU /RAM to increase the effective CPU utilization by sharing instances. This is not possible with dedicated CPU allocation May have issues if multiple Live Migrations are performed at the same time Dedicated Infrastructure link may not be possible in Regional / Local Cloud IO Intensive VMs: HW binding for high performance makes it static. This is more like a VM acting like as a RECONFIGURABLE PHYSICAL MACHINE. VM Dimension cannot be changed easily Due to static configuration, network slicing also becomes difficult HW binding makes the VM not portable on a COTS hardware VM has to fit within single NUMA mode hence placement of FAT VMs is also challenging. Local SR-IOV NICs on NUMA node need to be available for VM relocation. Availability of SR-IOV VFs on same NUMA node where CPU resources are available for the migrating VM VM cannot be condensed to reduced set of compute nodes for energy saving Mostly Live Migration not supported Due to high rate of dirtying page SR-IOV not supporting Live Migration HW binding Fault Resiliency is difficult without Live Migration. 8 HA ACT-SBY configuration can be used at the cost of duplicate HW requirements which is not cost effective
Solution - General Find out of the resources / flavour definition JUST sufficient to meet the Performance Expectation Define Upper bound of Performance Expectation of a VM Split functionality of VNF in multiple VMs is possible Customize Guest OS If Dedicated resources are only option then select ones which support Live Migration If Performance if met then don t need to look for more sophisticated solution Try to use Shared resource before looking for Dedicated Resources Start with General Profile, add HW independent resources Identity Fault Resilience procedure for the VM to the extent possible without compromising on Performance Define the Scalability Options/Procedures Define the Energy Saving Options 9
Solution - General Size of the VM (CPU, RAM, Network Ports, Disk) has bearing on the following: Live Migration System Backup / Restore Instantiation / Deletion VM Operations e.g. Suspend, Resume, Pause, Un-pause Bigger the size of VM, higher the above response time Smaller size of VMs, tend to be more responsive but below some range, performance requirements of the VM may not be met Optimal VM dimension for CPU, RAM, Network IO and Disk need to be identified General Solutions: Strip Guest OS to keep only the desired services Guest OS resource usage should be kept < 20% Separate out CPU resources used by Guest OS within the VM CPU isolation for Apps IRQ remapping to specific cores 10
Solution CPU Intensive VMs Allocate CPUs in mixed mode: Shared + Dedicated Allow for Vertical Scaling of CPUs Avoid PMD mode if possible if the network IO is manageable in interrupt mode Decide on the optimal RAM / Disk size required based on the VNF application Disk operations can be reduced by using network IO to a storage node Volume based VNFs (with dedicated storage nodes) helps in recovery from Faults e.g. using LM Define the optimal page size requirement based on the application Let Virtualization platform reduce the CPU cycles of a VM during LM. This reduces the rate of dirtying page and allows multiple LMs to pass without higher probability Energy Saving is possible with increased usage of shared resources With multiple VNFs scaling down, it should be possible to put them in low CPU cycle mode and even condense VNFs in fewer compute nodes. 11
Solution IO Intensive VMs Use 1G Huge Page size over 4M Use PMD mode over interrupt mode Reduce the Latency of interrupt handling Higher packet IO performance Use Ring buffer (non-blocking) for higher packet processing rate Use Vector Packet Processing Techniques (Concept of VPP) To support LM use XVIO based network interface instead of SR-IOV Netronome has come up with smartNIC cards supporting XVIO standards This allows full VM mobility by providing standard virtio interfaces in the VM With VM CPUs getting freed up from PMD (offloaded to XVIO), it is possible to handle traffic at sub-line rate. Instead of having 1 FAT VM processing data at line rate with x vCPU and y Memory, it can be split in 3 smaller VMs. This would allow Live Migration to go through. App assisted LM: Virtualization infrastructure provides a notification to app about LM being initiated. This allows app to reduce the data traffic handling thereby reducing the rate of dirtying pages Virtualization infrastructure can also reduce CPU cycle rate of VM 12
THANK YOU (singhshashik1@gmail.com) 13