Challenges and Solutions in High Energy Physics Computing
High Energy Physics faces significant challenges due to increasing data complexities and storage needs. To address this, investments in resources, advancements in technology, and potential shifts in research focus are suggested. The HSF initiative, started in 2014, aims to tackle these issues through community collaboration and innovative strategies. While experiments have made internal optimizations, additional measures are needed to meet future demands effectively.
- High Energy Physics
- Computing Challenges
- Resource Management
- Technological Innovations
- Collaborative Strategies
Uploaded on Nov 17, 2024 | 0 Views
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
HSF E WLCG STRATEGY DOCUMENT Tommaso Boccali (INFN Pisa)
THE PROBLEM #1 We did 3% of LHC ~ ~ total integrated storage needed ~ single event complexity / size HE-LHC
THE PROBLEM #2 Overall resource increase if we do nothing Why? Some numbers and resource scaling (5-10)*5 for storage (up to 50x) Trigger rates1kHz 5-10 kHz (factor 5-10x) (5-10)*10 for CPU (up to 100x) More complex detector (more acquisition channels) + At the same time, if we assume technology ( Moore and friends laws ) will improve at 20%/y, a factor 5-6 is gained + More crowded events (PU from 35 to 200) RAW event size 1 5 MB/ev (factor 5x) We miss factors 5-20 Reconstruction times worse than linear. At least a factor 10x slower If we do nothing not an option
SOLUTIONS? Please add money to HEP Computing Today INFN funds resources for O(2MEur) @ T2 level and O(2MEur) @ T1 level (personnel and maintenance / operations costs NOT included) A 5x means (as a rule of thumb) that all CSN1 budget is needed Hope for technological breakthroughs A leap in Moore s law? Quantum Computing? Photonic CPUs? Do less physics (for example, reconstruct electrons with p_T > 5 GeV only) Severely limits the discovery potential of the HL-LHC experiments Invest in long term R&D can be top down or bottom up Idea: do a first bottom up attempt; gather ideas (even absurd ones). At the same time make more people / FA / Experiments aware of the problem. Later collapse into something with a little more top down approach HSF is the bottom up part of the effort
HSF Initiated in 2014 Activity level raised in 2017 2 large WSs Preparation of the Community White Paper Bottom up ideas in a range of Work Packages Signed by individuals (no FA commitments, no experiment commitments) A roadmap condensed document in arXiv
STATUS FROM THE EXPERIMENTS Of course, the experiments (*) were not sitting and waiting Internal optimizations, reduction of storage copies, reduction of MC production lead to reductions year-to-year CMS Preliminary 2017 Numbers from CMS and ATLAS speak about a factor 20x wrt to 2017 resources Which becomes an excess of a factor ~5x when factoring Moore and friends This is the size of the problem today , and what we should earn back *: it is mostly ATLAS and CMS, since ALICE and LHCb live in a different LHC timescale and will have the major upgrade in 2021 They are basically already done with modelling and R&D
THE WLCG STRATEGY DOCUMENT If you want, the start of the top down process: WLCG taking note of CWP and Experiments starting to plan on it The WLCG strategy document is a specific view of the CWP, prioritizing R&Ds relevant to the HL-LHC computing challenge It is signed by Experiments:, it will get a review by LHCC and eventually an approval It is the first step towards a TDR in 2022 (new!)
WHAT IS INSIDE? SOME GENERAL POINTS / TRENDS Review the split between online and offline Can we simulate the HL-LHC computing in such a way we can understand the effect on total cost by adjusting parameters? ALICE already trying a merge; even if not at this level go to virtualized systems which can me moved to online or offline worlds when needed Example: reduce the disk by 30% by doubling the WAN traffic? Is it worth? Tune the data model, be able to find a small format which covers most of the analyses Expand to expensive HPC systems for peak versus buy more CPUs What is the cost of an inefficient use of HPC? Use Virtual data: reproduce on the fly instead of storing (so store the recipes ) Heterogeneous computing: on the fly discovery and utilization of local accelerators / GPU / FPGA Is ML going to be used in 1-10-50-90% of the algorithms? w/o code duplication? OpenCL? Client-server mode? Will a 2026 standard machine _always_ have a GPU? And the big one: How many centers? Where to put storage? Where to put CPUs?
DISTRIBUTED COMPUTING SOME GENERAL IDEAS Data is the most precious asset from LHC: Stays in our owned centers CPU is almost stateless (at the level of few hours) Can be anywhere Provided they can access data (remote access) Provided we can efficiently send data (network statinc/on demand) Again, data lake in the next talk. Here the implications for anything outside the lake 1. Sites outside the lake can have storage, but it would be not centrally managed; most probably 1. A place to store your ntuples 2. A place to store your private MC productions 3. Simply a cache to make remote access faster 2. Sites can be storageless 1. An HPC center 2. An online farm by another experiment 3. Network is of capital importance here 1. Internal to the lake but also outside 2. Space / need for SDNs or similar: next month I got an Amazon Grant, I need good connection to my site This is at the hearth of the data lake model (see talk by D.Cesini) but there is more Keep our data safe Data handling to become an infrastructure problem, not to be handled by experiments
DATA AND COMPUTE INFRASTRUCTURES High Level Services High Level Services High Level Services Storage Interoperability Services Infrastructure Asynchronous Data Transfer Data (Lake) Distributed Storage Distributed Regional Storage Volatile Storage Storage Storage Storage Storage Content Delivering and Caching Infrastructure Grid Cloud Compute HPC Compute Grid @HOME Grid Compute Compute Compute Compute Compute Provisioning Site N 12
POSSIBLE DISTRIBUTED COMPUTING CONFIGURATIONS One: Three: No data lake node in Italy (?) A singe data lake roughly equivalent to today s T1s, connected at least at 100 Gbit/s All our centers are Compute nodes, but we can afford a large cache in front of them, so that most of the accesses are local T1s can host CPUs, or not Experiments see a single storage entry point wrt to today s 200 (life-is-good) But there is no custodiality commitment Four: T2s become compute nodes (outside the lake, storage only as cache) US do not want to join a EU/CERN driven data lake (probable) HPC centers as compute nodes Commercial Clouds as Compute nodes Two: We have 2 or even more lakes, which have to have peerings between them The data lake is composed of logical, not physical nodes: the total storage of Italian Computing centers is seen behind a single entry point This makes Experiments life become complex: they see >1 storage endpoints, but still much less than 200 All T1+T2s become part of the logical center which joins the lake HPC centers as compute nodes Commercial Clouds as Compute nodes
THE IMPORTANCE OF SOFTWARE What can help is the ability to use new architectures? General feeling is that you will not able to solve the HL-LHC problems with computing itself 2000-now: HEP computing using the most performant hardware @ a given price: commodity machines are X86 + Linux, exactly what we use Data lake(s) can reduce the # of copies of a given file thanks to remote access But not below 1 copy (and we are not far even now) Already now-foreseeable future: Best cycles/EUR are GPUs, FPGA, even ASIC CPU cycles can be easier to obtain by allowing for HPC / Commercial / Diskless sites, but the total # of cycles is still needed Can we use them? How? At which (human) cost? Can we trust something (compilers, etc) will make them easily usable? (OpenCL, )
A PROBABLE FUTURE Technology is now driven by Commercial Clouds Three ways to use them: By far the highest buyers in the market 1. Specialized code: can be done for a small, mission critical part of the software (tracking, ) Eventually by HPC market (mainly for showcasing superior technology) 2. Use OpenCL or friends to get at least some benefits In both cases, the easiest to predict future is an ubiquitous presence of GPUs, Why? 3. (use your faith) Use Machine Learning whenever possible Commercial: ML GPUs are seen via TensorFlow and not from the code we need to program HPC: best Linpack performers This means most of the Flops in a chip will come from vector operations A major experiment starting with A is already claiming all the reconstruction will be ML based Will be the cheapest / $
CONCLUSIONS The process to understand / simulate / plan for 2026+ HL-LHC computing has started It is clear these are all at the level of ideas R&Ds are of capital importance On paper, we are short by a factor O(5x) Test beds will need to follow On the positive side, the 5x factor means everyone goes on vacation for 10 years not happening! New level of competences needed Ideas on how to reduce costs are available How? Use of architectures cheaper (per flops) than x86_64 Reduce storage costs by federating storage areas, also limiting duplication Allow the use of unconventional resources, like HPC, Disk less, Commercial