Kubernetes Modifications for GPUs

Kubernetes Modifications for GPUs
Slide Note
Embed
Share

Explore Kubernetes modifications for GPUs in this informative content by Sanjeev Mehrotra. Learn about resource scheduling, node allocation, and why modifications are necessary for complex GPU constraints and interconnectivity requirements.

  • Kubernetes
  • GPUs
  • Resource Scheduling
  • Node Allocation
  • Sanjeev Mehrotra

Uploaded on Mar 01, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Kubernetes Modifications for GPUs Sanjeev Mehrotra

  2. Kubernetes resource scheduling Terminology: - Allocatable what is available at node - Used what is already being used from node (called RequestedResource) - Requests what is requested by container(s) for the pod Kubelets send Allocatable resources for nodes Worker 1 Worker 2 Scheduler Keeps track of Used Scheduling Request Pod (Contianer) Spec - Container Requests Worker N

  3. Resources All resources (allocatable, used, and requests) are represented as a ResourceList which is simply a list of key-value pairs, e.g. memory : 64GiB cpu : 8

  4. Simple scheduling 1. Find worker nodes that can fit a pod spec plugin/pkg/scheduler/algorithm/predicates 2. Prioritize list of nodes plugin/pkg/scheduler/algorithm/priorities 3. Try to schedule pod on node node may have additional admission policy so pod may fail 4. If fails, try next node on list

  5. Find nodes that fit For simple scheduling, node will NOT fit if Allocatable < Request + Used Example if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU)) } if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory)) } if allocatable.NvidiaGPU < podRequest.NvidiaGPU+nodeInfo.RequestedResource().NvidiaGPU { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceNvidiaGPU, podRequest.NvidiaGPU, nodeInfo.RequestedResource().NvidiaGPU, allocatable.NvidiaGPU)) }

  6. Why do we need modifications? Only allows for constraints like following in pod spec Need 4 GPUs Does NOT allow for constraints like following in pod spec Need 4 GPUs with minimum memory 12GiB OR Need 2 GPUs with minimum memory 4GiB and 2 GPUs with 12GiB Need 2 GPUs interconnected via NVLink (peer-to-peer for high speed inter- GPU communication)

  7. Solution 1 Label nodes and use node selector https://kubernetes.io/docs/concepts/configuration/assign-pod-node/ However, not optimal in cases with heterogeneous configurations For example, one machine may have GPUs of several types, some with large amounts of memory and some with small If label used, then don t know which GPUs will get assigned. Thus only minimally performant GPU can be used to label node Also even in homogenous configurations, kubelet running on worker nodes needs to keep track of bookkeeping and which GPUs are in use

  8. Solution 2 Group Scheduler Define richer syntax on ResourceLists to allow for such constraints to be scheduled Example: Instead of: NvidiaGPU: 2 Use something like now memory for each GPU is clearly specified Gpu/0/cards: 1 Gpu/0/memory: 12GiB Gpu/1/cards: 1 Gpu/1/memory: 6GiB Use of cards is present to prevent sharing of GPU cards

  9. Example GPU with NVLink For 4 GPUs with two groups, each connected via NVLink to another GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB GpuGrp0 Gpu0 Gpu1 GpuGrp1 Gpu2 Gpu3

  10. Group scheduler All resource lists (allocatable, used, and requests) specified in this manner Scheduling can no longer compare values with same key to see fit e.g: allocatable[ memory ] < used[ memory ] + requested[ memory ] Example Allocatable: Requested (two GPUs minimum memory 10GiB, don t require about NVLink): GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB GpuGrp/A/Gpu/0/cards: 1 GpuGrp/A/Gpu/0/memory: 10GiB GpuGrp/B/Gpu/1/cards: 1 GpuGrp/B/Gpu/1/memory: 10GiB

  11. Group scheduler Group scheduler uses hierarchical group allocation with arbitrary scorers to accomplish both checking for fit and allocation Allocation is a string-to-string key-value which specifies a mapping from Requests to Allocatable Requested (two GPUs minimum memory 10GiB, don t require about NVLink): Allocatable: GpuGrp/A/Gpu/0/cards: 1 GpuGrp/A/Gpu/0/memory: 10GiB GpuGrp/B/Gpu/1/cards: 1 GpuGrp/B/Gpu/1/memory: 10GiB GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB

  12. Group Allocation Allocatable Gpugrp1/0/Gpugrp0/0/gpu/dev0/cards: 1 Gpugrp1/0/Gpugrp0/0/gpu/dev1/cards: 1 Gpugrp1/0/Gpugrp0/1/gpu/dev2/cards: 1 Gpugrp1/0/Gpugrp0/1/gpu/dev3/cards: 1 Gpugrp1/1/Gpugrp0/2/gpu/dev4/cards: 1 Gpugrp1/1/Gpugrp0/2/gpu/dev5/cards: 1 Gpugrp1/1/Gpugrp0/3/gpu/dev6/cards: 1 Gpugrp1/1/Gpugrp0/3/gpu/dev7/cards: 1 Requests Requests Gpugrp1/R0/Gpugrp0/RA/gpu/gpu0/cards: 1 Gpugrp1/R0/Gpugrp0/RA/gpu/gpu1/cards: 1 Gpugrp1/R1/Gpugrp0/RA/gpu/gpu2/cards: 1 Gpugrp1/R1/Gpugrp0/RA/gpu/gpu3/cards: 1 Gpugrp1/R1/Gpugrp0/RB/gpu/gpu4/cards: 1 Gpugrp1/R1/Gpugrp0/RB/gpu/gpu5/cards: 1 Allocatable

  13. Main Modifications scheduler side 1. Addition of AllocateFrom field in pod specification. This is a list of key-value pairs which specify mapping from Requests to Allocatable pkg/api/types.go 2. Addition of group scheduler code plugin/pkg/scheduler/algorithm/predicates/grpallocate.go plugin/pkg/scheduler/algorithm/scorer 3. Modification in scheduler to write pod update after scheduling and to call group allocator plugin/pkg/scheduler/generic_scheduler.go plugin/pkg/scheduler/scheduler.go

  14. Kubelet modifications Existing multi-GPU code makes the kubelet do the work of keeping track of which GPUs are available and uses /dev/nvidia* to see number of devices, both of which are hacks With addition of AllocateFrom field, scheduler decides which GPUs to use and keeps track of which ones are in use.

  15. Main Modifications kubelet side 1. Use of AllocateFrom to decide which GPUs to use 2. Use of nvidia-docker-plugin to find GPUs (instead of looking at /dev/nvidia*) This is also needed to get richer information such as memory in GPU, GPU type, topology information (i.e. NVLink) 3. Use of nvidia-docker-plugin to find correct location for nvidia drivers inside container (in conjunction with nvidia-docker driver) 4. Allow specification of driver when specifying mount needed to use nvidia-docker driver

  16. Integration with community Kubelets know nothing about GPUs Eventual goal Resources to advertise Kubelets send Allocatable resources for nodes Device Plugins (e.g. GPU) Worker 1 Resources usage / docker params Asks for fit Worker 2 Scheduler extender Scheduler Keeps track of Used Scheduling Request Performs group allocation writes update to pod spec with allocation Pod (Contianer) Spec - Container Requests Worker N

  17. Needed in Kubernetes core We will need a few things in order to achieve separation with core which will allow for directly using latest Kubernetes binaries Resource Class, scheduled for v1.9 will allow for non-identity mappings between requests and allocatable Device plugins and native Nvidia GPU support is v1.13 for now https://docs.google.com/a/google.com/spreadsheets/d/1NWarIgtSLsq 3izc5wOzV7ItdhDNRd-6oBVawmvs-LGw

  18. Other future Kubernetes/Scheduler work Pod placement using other constraints such as pod-level constraints or higher (e.g. multiple pods for distributed training) For example, networking constraints for distributed training when scheduling Container networking for faster cross-pod communication (e.g. using RDMA / IB)

Related


More Related Content