Critical Properties of Apollo Scheduling Framework
Managing task assignment and scheduling in Apollo involves critical properties such as distributed and coordinated scheduling, correction mechanisms, and opportunistic scheduling. Capacity management is done using a token-based mechanism, while the architectural overview highlights the roles of Job Managers, Process Nodes, and Resource Monitors. Task priority and stable matching ensure efficient decision-making, and the correction mechanism allows for task reassignment based on real-time feedback. Overall, Apollo's framework emphasizes efficiency and accuracy in task execution and resource allocation.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Apollo Weize Sun Feb.17th, 2017
Critical properties of Apollo Distributed and coordinated scheduling framework Assign tasks to server with minimal estimated completion time Provide near-future states of servers Correction mechanism Opportunistic scheduling
Capacity Management and tokens Apollo uses a token-based mechanism. Each token is defined as a right to execute a task with predefined amount of resource. The more tokens a job has, the more tasks it can run.
Architectural Overview
Job Manager (Scheduler) Every JM is responsible for one job Receive global cluster information from the collaboration of RM and PNs
Process Node Manages a queue of tasks assigned to the server The cooperation between PNs and JM: When JM give a task to PN, it passes resource requirements, estimated time, required files to PN. PN provides feedbacks to JM to help improve accuracy of task runtime estimation.
Resource Monitor Provide global view of cluster status Collect information from PNs Build wait-time matrix Note: RM is not critical If RM is unfortunately down, it will not hurt the Apollo implementation too much. JM can still make a locally optimal decision based on the feedback of PNs
Task Priority and Stable matching JM will analysis the DAG of each task JM makes independent scheduling designs, and pick the best one. Note: breakdown of some scheduling does not hurt the overall optimal decision making JM use stable matching to limit the search space. What if two JMs make decision that has collision?
Correction Mechanism Unlike other systems, Apollo implements the correction process after the task is dispatched to the server. JM will reassign the task to the server if the wait time is too greater than estimated or a much better pattern is designed Apollo also use randomization to reduce the collision Apollo adds weight to different wait time matrix to check the accuracy.
Opportunistic Scheduling There are two kinds of tasks in Apollo regular tasks & opportunistic tasks Apollo adapts randomized allocation Apollo runs regular tasks first, and uses the rest resources to run opportunistic tasks Resource Regular Tasks Opportunistic Tasks