JLab Operations and Planning Overview

jlab status 2016 planning l.w
1 / 13
Embed
Share

This document provides an overview of the operational status and planning initiatives at Jefferson Lab (JLab). It covers updates on file system upgrades, resource overview, operations utilization, Lustre file system, computer room upgrades, and the 2016 planning for the LQCD machine. Key highlights include hardware upgrades, system transitions, and future procurement plans to enhance performance and efficiency.

  • JLab
  • Operations
  • Planning
  • Resource Overview
  • Lustre File System

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. JLab Status & 2016 Planning April 2015 All Hands Meeting Chip Watson Jefferson Lab Outline Operations Status FY15 File System Upgrade 2016 Planning for Next USQCD Resource

  2. JLab Resources Overview 3 IB clusters, 8,800 cores, shrinking to 6,200 July 1 3 GPU clusters, 512 GPUs 48 nodes, quad gaming GPU, going to 36 quads 36 nodes, quad C2050, will shrink as cards fail 42 nodes, quad K20 Xeon Phi KNC test cluster, 64 accelerators -> 48 Will convert 4 nodes into interactive and R&D nodes 1.3 PB Lustre file system shared with Experimental Physics, 70% LQCD 32 servers (soon to be 23) 8.5 GB/s aggregate bandwidth 10 PB tape library, shared, 10% LQCD LQCD growing at about 40 TB / month

  3. Operations & Utilization LQCD running well Colors are different USQCD projects/users (note that peak is above the 8,800 cores owned by USQCD) JLab load balances with Experimental Physics, which can consume nodes during our slow months. (No penalties this past year.) LQCD is now consuming unused farm cycles (debt shown in chart to left)

  4. Lustre File System 1.3 PB across 32 servers shared with Experimental Physics Aggregates bandwidth, helping both hit higher peaks Allows more flexibility in adjusting allocations quickly As 12 GeV program ramps up, split will move to 50% each Now upgrading to version 2.5.3 OpenZFS RAID-z2, full RAID check on every read Will rsync across IB project by project starting in May Will drain and move servers; as 1.8 shrinks, 2.5 grows 3 new servers (1 LQCD) will allow decommissioning 2009 hardware (12 oldest, smallest servers) Soon to procure newer higher performance system(s) to replace 2010 hardware and increase total bandwidth to 10-12 GB/s

  5. Computer Room Upgrades To meet DOE goal of PUE of 1.4, power and cooling are being refurbished in 2015 New 800 KW UPS 3 new 200 KW air handlers (+ refurbished 180) All file servers, interactive, etc. will move to dual fed power, one side of which will be generator backed (99.99% uptime) Transitions Chilled water outage later this month (1-2 days) Rolling cluster outages to relocate and re-rack to 18-20 KW/rack as opposed to 10-12 KW today Anticipate 2 days outage per rack (3-4 racks at a time) plus 4 days full system outage over the next 7 months, so <2% for the year. JLab will augment x86 capacity by 2% to compensate

  6. 2016 Planning

  7. 2016 LQCD Machine 5 year plan 2015-2019 has leaner budgets, 40% less hardware, with no hardware funds in 2015, so the project plans to combine funds into 2 procurements (current plan of record): FY16 & FY17 into a 2 phase procurement of ~$1.96M FY18 & FY19 into a 2 phase procurement of ~$2.65M Process: The timeline & process is the same as previous years. The goal is also the same: Optimize the portfolio of machines to get the most science on the portfolio of applications.

  8. x86 GPU Xeon Phi combo ? The Probable Contenders Latest conventional x86, Pascal GPU, Xeon Phi / Knights Landing, Likely configurations for each Dual socket, 16 core Xeon (64 threads), 1:1 QDR or 2:1 FDR Quad GPU + dual socket (thousands of threads/GPU, on package high bandwidth memory); quad GPU to amortize cost of host OR dual GPU to minimize Amdahl s Law; either way this is a fatter node therefore higher speed Infiniband per node, FDR or faster Single socket, 64+ core Xeon Phi (256+ threads, 512 bit SIMD, on-package high bandwidth memory)

  9. KNL many core Not an accelerator. Not a heterogeneous architecture. x86 single socket node. Better core than KNC. Out-of-order execution Advanced branch prediction Scatter gather 8 on package MCDRAM, up to 16 GB 6 DDR4 ports up to 384 GB 1 MB L2 cache per 2 core tile (figure shows up to 72 cores if all are real & operational) https://software.intel.com/en-us/articles/what-disclosures- has-intel-made-about-knights-landing

  10. Time to Consider a New Architecture Xeon Phi software maturity is growing 2013 saw LQCD running at TACC / Stampede (KNC) Optimized Dirac matched performance of contemporary GPU Additional developments under way on multiple codes, driven by large future resources Cori, 2016; with 9,300+ chips, followed by ANL s Theta (KNL) in 2016; 2,500 chips and ANL s Aurora (KNH Knights Hill) in 2018, with 50,000 nodes

  11. Other Significant Changes Both Pascal and Knights Landing will have on package memory high bandwidth, memory mapped (or cache, but probably better directly managed). What is happening now to enable use of this feature in 15 months? Pascal will have a new NVlink. Details still NDA. Intel will have an on-chip network that can replace Infiniband, but timeline is still NDA (certainly in time for Aurora). GPU-Power processor coupling with NVlink. Will this significantly reduce Amdahl s Law hits? What will we need to do to exploit this?

  12. Community Participation Very Important! This next machine will serve to replace all of the ARRA hardware (which by that time will be gone), while also increasing total USQCD project resources. When it turns on, it might represent as much as 50% of the JLab + FNAL + BNL LQCD resources. Questions: 1) What are the best representative applications to characterize a large fraction of our activities on USQCD owned resources? (Inverters are a part, but more is needed) 2) For what applications does CUDA code exist? Will exist ? 3) Who is working to prepare Xeon Phi code? What is anticipated to exist by early 2016? What is the minimum we should have to fairly evaluate Pascal vs Knights Landing ?

  13. Additional Questions 4) How much memory for each of the three architectures? For GPU s, how much host memory is needed compared to GPU memory? (we ll need to understand the % gain that comes from doubling memory, to compare to the cost of that upgrade) 5) What will it take to exploit on-package memory ? (can be similar to QCDOC, fast, memory mapped) 6) What applications are significantly disk I/O bound ? Is anyone dependent upon the performance of random I/O ? (i.e. is it time for SSD, or just better servers?) Please help the project in making the best selection by providing your input!

More Related Content