Roadmap for DNS Load Balancing Service at CERN - HEPiX Autumn 2020 Workshop
This roadmap presented by Kristian Kouros on behalf of the DNS Load Balancing Team at CERN outlines the introduction, implementation, and upgrades associated with the DNS Load Balancing Service. It covers topics such as system architecture, LBClient metrics, and the overall structure of the service. The roadmap provides insights into the nodes, aliases, Enquirer(s), and DNS services involved in the load balancing setup. Detailed explanations are given, along with accompanying visual representations to aid in understanding the concepts discussed.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Roadmap for the DNS Load Balancing Service at CERN Kristian Kouros on behalf of the DNS Load Balancing Team HEPiX Autumn 2020 Workshop
Topics 1 INTRO 2 IMPLEMENTATION 3 UPGRADES 1/21
T here is nothing certain, but the uncertain. -P roverb- 2/21
DNS LB 101 Node (1) 1 Alias represented by 1 1 Node (3) (2) Enquirer(s) DNS Service 3/21
DNS LB 101 Node (1) 1 Alias represented by 1 1 Node (3) (2) Enquirer(s) DNS Service 3/21
DNS LB 101 N N Nodes (1) 1 Alias represented by N N Nodes (3) (2) Enquirer(s) DNS Service 4/21
DNS LB 101 N N Nodes (1) 1 Alias represented by N Round Round- -Robin N Nodes Robin Serving (3) (2) Enquirer(s) DNS Service 4/21
DNS LB 101 N N Nodes (1) 1 Alias represented by N Round Round- -Robin N Nodes Robin Serving (3) Agnostic of Node s Wellbeing Agnostic of Node s Wellbeing DoS for 1 in N user DoS for 1 in N user (2) Enquirer(s) DNS Service 4/21
DNS LB 101 N N Nodes Arbiter (1) (3) (2) Enquirer(s) DNS Service 5/21
System architecture NODES ERMIS LBD NodeN Node1 Node1 Backup Node2 Node1 REST API DB Node1 Primary LBClient snmpd DNS Service Computing Cloud 6/21
LBClient NODES NodeN Runs on every node that is behind an alias Node1 Node1 Node2 Node1 Node1 LBClient Metrics: Checks -> tmpfull, nologin , daemon , etc Load -> collectd, lemon, const Summarizes evaluation into an int value Best value: The lowest positive snmpd 7/21
LB Daemon LBD Backup Selects best nodes per alias Primary Backup performs same tasks ChecksPrimary s heartbeat: If dead take its place 8/21
REST Service Self-manage aliases ERMIS UI & CLI paired with MySQL What users can CRUD : Alias Hostgroup, Visibility, Number of best nodes, Cnames, Nodes Etc. REST API DB 9/21
Workflow (1)Periodic SNMP GET @all NODES ERMIS LBD NodeN (0) Define Alias (3)Calculate best nodes Node1 Node1 Backup Node2 NodeN: alias1 = 21 REST API Node1 DB Puppet Node3: alias1 =7 Node2: alias1 = 15 Node1: alias1 = -1 Node1 Primary LBClient (2)Nodes report load for Alias1 (4) Update nodes for Alias1 (6) Usage DNS query Alias1 DNS Service Computin g Cloud IP of Node3 (5) Resolution 10/21
Lets rewind a bit 1. Stable service for +10 years 2. Each component in different language : LBD Perl LBClient C Ermis Python 3. Sequential processes 4. Number of aliases in increase Example: From 2017 to 2019 ~50% increase 11/21
Upgrade #1: LBD & LBClient production The Issue: Sequential processes were limiting performance and alias capacity The Solution: 1.Reimplement LBD and LBClient in Golang Aliases evaluated in parallel Nodes checked only once 2. Partition aliases & LBD/partition The Result: SNMP Querying time reduced from 5 minutes to less than 1 minute 13/21
production Upgade #2 : Node Control The Issue: There was no way to overwrite the nodes that LBD what receiving from PuppetDB. The solution: Introduce a new configuration element in Ermis UI & CLI for the user to define blacklisted and/or whitelisted nodes. 14/21
Upgrade #3 : Security production The Issue: Shibboleth SSO No 2FA The Solution: Replace Shibboleth with OpenIDC Protect SSH connections with 2FA 15/21
Upgrade #4 : Ermis REST Back-end development The Issue: The Ermis REST API back-end is still in Django/Tastypie, different than the rest of the components High abstraction and undefined data types The Solution: Reimplemented in Golang + Echo Framework + GORM The result: The Pros/Cons of Golang over Python. Performance not evaluated yet. 16/21
Upgrade #5: Cloud Native development The Issue: Ermis REST Service, LBD Primary & Backup run on VMs Service availability not ideal The Solution: Deploy in a Kubernetes cluster 17/21
development Upgrade #6 : Node Alarms The Issue: Users are not aware if the number of defined nodes per alias is maintained The Solution: Introduce new feature for setting an alarm if number of nodes crosses the threshold 18/21
Evaluation: Push vs Pull To do The Issue: LBD, even though in parallelization, it has to Pull data from a lot of nodes. Alternative Scenario: Evaluate if it would be better for the Nodes to Push instead. Feedback or suggestions are welcomed 19/21
Summary Overview of CERN s open-source load-balancer for DNS (but not only!) Presentation of the undertaken upgrades to meet the new demands of the service Project : https://github.com/cernops/golbclient https://github.com/cernops/golbd 20/21
lb-experts-public@cern.ch kristian.kouros@cern.ch Thank You ! 21/21