Overview of Virgo Computing Activities
Virgo computing has been a hot topic recently, with various discussions and meetings focusing on computing issues, future developments in astroparticle computing, and funding for INFN experiments. The activities include presentations, committee meetings, talks, and challenges in computing faced by Virgo and related projects like Advanced LIGO. The storage capabilities at EGO/Virgo site are discussed, highlighting challenges in data storage and data transfer rates. The need for adapting to evolving requirements is emphasized to meet the demands of data analysis and storage for current and future projects.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Virgo computing Michele Punturo Computing - VW20180709 1
Computing is an hot topic? Virgo computing has been an hot topic in the last weeks 15/06/2018 presentation of ET computing issues and activities in front of the INFN Commissione Calcolo e Reti 18/06/2018 Computing issues discussed at VSC 25/06/2018 discussion on future developments on astroparticle computing in INFN (Virgo invited together CTA, KM3, Euclid in the INFN presidency) 02/07/2018 External Computing Committee meeting at EGO (ECC appointed by the EGO council) 04/07/2018 discussion on funds in 2019 for all the INFN experiments at TIER1 08/07/2018 Talk at the Virgo Week 19/07/2018 C3S meeting in Presidenza INFN on computing challenges Computing - VW20180709 2
Slides Recycling Computing - VW20180709 3
Advanced Virgo EGO/Virgo Site DAQ (50MB/s 84MB/s full raw data) Detector characterisation Low latency analysis and detection h(t) data transfer Temporary storage and DBs Reduced Data Set (RDS) transfer 0.87MB/s via LDR Virgo Virgo Via GridFTP LIGO Raw Data Transfer 45-50MB/s Advanced LIGO Nikhef Offline DA CNAF CC and Tier0-1 Offline Data Analysis CCIN2P3 CC and Tier0-1 Offline Data Analysis SurfSARA Offline DA LIGO via LDR PolGRID Offline DA GRID and local access Computing - VW20180709 4
EGO/Virgo Site Temporary storage and DBs DAQ (50MB/s 84MB/s full raw data) Detector characterisation Low latency analysis and detection Storage Currently the storage capabilities at EGO are about 1PB 50% devoted to home directories, archiving special data, output of the local low latency analysis 50% devoted to the circular buffer that is used to store locally the raw data Less than 4 months of lifetime of the data before overwriting (at 50MB/s) Too short period for commissioning purposes Unable to keep O3 on disk This situation is due to a rapid evolution of the requirements scenario that is making obsolete the previous computing model specifications: Increase of the requests by DAQ; data writing rates: Nominal ~22MB/s O2: ~ 37MB/s Current: ~50MB/s Requests by commissioner to have data corresponding to special periods stored locally for commissioning and noise hunting purposes Requests by low-latency analysers to have disk space for outputs of their analysis Computing - VW20180709 5
Storage The shortage of disk space at the site is also raising the risk of loosing scientific data Nota Bene: If the incident at the CNAF had occurred during O3, Virgo would have lost science data We had to pass through: a Council meeting a STAC meeting a ECC meeting Finally we had green light to purchase the storage and the order has been submitted by EGO Computing - VW20180709 6
O2 experience on computing and DT A sequences of internal and external issues affected the data transfer toward the CCs during O2 7 Computing STAC 23/05/2017
O2 issues highlighted by DT Problem (1): iRODS hanging toward CCIN2P3 Problem (2): un-identified midnight slowing down Problem (3): Grid certificate expiration Problem (4): Saturation of the disk-to-tape transfer in CCIN2P3 Problem (7): Similar issue at CNAF Problem (5): Freezing of the storage due to lack of disk space Problem (6): Firewall freezing Introduction to Virgo computing 8
Discussion with CCs We had two meetings with CCs: 16/01/2018: first meeting, presentation of the problems, discussions, hypothesis, some solution suggested on Data management (Keyword: Dirac) Workload management (Keyword: Dirac) Virgo software replica (Keyword: CernVMS) From CCs to Virgo: request of requirements From Virgo to CCs: request of common solutions 07/05/2018: Second meeting Solution on Data Transfer discussed A possible common solution proposed by CCs Computing - VW20180709 9
Strategy toward O3 Radically reduce the data loss risk purchasing a large storage at EGO Almost solved Improve the reliability and availability of the computing architecture at EGO: Bandwidth increase from 1GB/s to 10Gb/s Bandwidth tested toward the pop GARR Some issues toward CCs (see later) High Availability Firewall has been installed New Data Transfer Engine? Virgo request: use the same protocol for CNAF and CCIN2P3 CCs answer: Webdav Computing - VW20180709 10
Clean solutions to improve critical domains security New Firewall - separate the ITF control and operation network from the others - separate the on-line analysis network - introduce a separate VTF (Virgo Test Facility) network (needed also for reliability) - introduce a separate R&D network - introduce 2-factors authentication in the most critical domains - introduce internal intrusion/anomaly detection probes (reactive approach) - reorganize the remote access procedures in the new topology (VPN , proxies , ...) 11 Computing - VW20180709
New Firewall Fastpath (for selected hosts ):access control only: 9.32/9.55 Gbps Normal path: deep inspection Stefano s talk 12 Computing - VW20180709
Federated login and Identity Management: last news Federate Login Recently requested to join IDEM federation as EGO IdP Quick solution: working with GARR to provide the ego-gw.it IdP as a service in the GARR cloud connected to our AD database, and enter IDEM/EduGAIN in a few days Federate Virgo web applications starting from the most critical for collaborative detection events management: VIM (Virgo Interferometer Monitor) web site need to split the application in an internal instance and an external federated one (ongoing) federated authentication for ligo.org or ego-gw.it users , direct Virgo users authentication only when defined in the LV Virtual Organization common database For next Web applications: discussing with GARR to set an SP/IdP proxy pilot for a more flexible setup LSC plans for authorization and IdM: to provide gradually federated services to the LV federated identities via CoManage (as in the gw-astronomy instance) Caveats ligo.org identities (accounts) still needed to access LDG computing resources in addition Virgo users still need to complement their ligo.org account with their personal certificate subject 13 Computing - VW20180709
AA COmanage IdMS TDS AAI Scheme Idem Interferometer: (DAQ, Controls, Electronics, Monitoring, ) WWW FIREWALL EGO SP (IdP) LIGO lab and LSC universities offer access and services through EduGAIN Renater UV VIM Single sign on Federate login access SIR VIM-replica Surfnet WIGNER Poland: Pioneer? VW Apr2018 Computing 14 EduID
Bulk data transfer O3 Requirements: Data writing 50MB/s 100MB/s sustained (parallel) data transfer per remote site 150-200MB/s peak data transfer per remote site Same protocol/solution for the two sites Reliable login procedure O2: iRods+username/password login @ CCIN2P3 Gridftp+certificate @ CnAF Solution proposed (previously) by CCs: Webdav Tests performed: CNAF: Login issues (certificate) Throughput always > 100MB/s with peaks of about 200MB/s Performance issues @ CCIN2P3: Easy to login Throughput about 12MB/s up to 30MB/s using webdav (100MB/s using iRods) Long discussion at ECC meeting: Waiting for feedbacks from Lyon Test proposed with FTS 180MB/s since Friday, thanks to the migration of the iRods server serving Virgo at CCIN2P3 in a 10Gb/s link Computing - VW20180709 15
Low latency analysis machines The number of machines devoted to MBTA have been doubled 160 320 cores Additional ~180 cores have been devoted to detchar and cWB (Condor farm) Quick investment and installation made possible by the fact that a virtual machine architecture has been tested and approved for low latency analysis Computing - VW20180709 16
CentOS7 (CL7) + Python F. Carbognani slides ctrlXX and farmnXX machine upgraded to CentOS point release 7.5 (from 7.4) without known impacts. Discovered a possible minor problem on CMT built executables under investigation but the assumption that upgrades of the OS minor index can be done transparently for the Virgo Software seems holding. Transition from tcsh to bash seems stabilized (at least no problems reported by users in the last weeks). Python installation upgraded (ex: gracedb client 1.29.dev1) for OPA Challenge. Anaconda distribution + pip based installations seems working fine l l l l 1 Computing - VW20180709 17
Software releases plans F. Carbognani slides Software release cycles toward O3 have started following the guidelines defined into the Virgo Software Release Management Process dedicated document The main driver for the release cycles is stay possibly synchronized with Virgo Commissioning Runs (CR), OPA challenges, common Ligo/Virgo Engineering Runs (ER) and with foreseen Code Freezes. Software releases timeline: VCS-10.0: 18th May (CL7 wrap-up / 1st CR) VCS-10.1: 29th June (snapshot for 1stOPA challenge) VCS-10.2: 10th Sept (snapshot for 2ndOPA challenge, TBC) VCS-11.0: 1st October (Code Freeze in view of ER 13, see G1800889) l l l 2 Computing - VW20180709 18
Software releases plans F. Carbognani slides The minor software release VCS-10.1 made a snapshot of the code for the 1stOPA challenge. Not much difference respect to 10.0 since few packages were provided for /virgoApp upload. VCS-11.0 will manage the foreseen1stOct milestone (Code Freeze): Software features frozen, software under formal review. from there: Fixes approved by common Ligo/Virgo SCCB New features approved by Runs committees Be ready for much more pressing request for timely code freeze from your (hated?) Software Manager l l l 3 Computing - VW20180709 19
Virgo Software Distributions F. Carbognani slides A CVMFS based Virgo Software Release distributions mechanism is being tested. A test setup is up and running in Cascina from swtest6 (server machine) to swtest3 (client machine). A production machine equipped with sufficient disk area is being prepared and interaction with Virgo Computing Centers will soon start. Experimentations with containers technology (docker and singularity) for RefOS implementation to be used as software distributions and software testing environment are ongoing. l l l 4 Computing - VW20180709 20
Offline analysis Unresolved long standing issue about the under-use of Virgo computing resources at CCs Only CW is using substantially CNAF and less regularly Nikhef Other pipelines (mainly in CCIN2P3) have a negligible CPU impact Computing - VW20180709 21
CPU hours (52 weeks) Network of Computing Centres SUGAR-SU 1% NEMO-UWM 5% IUCAA 2% Sept 2016-Sept2017 VIRGO.NL 1% LIGO-LHO 3% VIRGO.CCIN2P3 0% LIGO-LLO 2% VIRGO.POLGRAV 0% ARCCA-CDF 4% LIGO.OSG 3% VIRGO.OSG 1% ATLAS-AEI 51% Other 6% Virgo ~6-8% LIGO Scientific Collaboration: 1263 collaborators (including GEO) 20 countries 9 computing centres ~1.5 G$ of total investment Virgo Collaboration: 343 collaborators 6 countries 5 computing centres ~0.42 G of total investment KAGRA Collaboration: 260 collaborators 12 countries 5 computing centres ~16.4 G of construction costs VIRGO.CNAF 4% LIGO-CIT 23% Offline Computing 22
Computing load distribution The use of the CNAF is almost mono- analysis Diversification is give by OSG access cw cwcw cw cw cw cw cw cw cw cwcw Offline Computing 23
Future: Increase of CPU power needs O3 run will start in February 2019 and will last for 1 year We are signing a new agreement with LIGO that is forcing us to provide about 20-25% of the whole computing energy 3 detectors: Non linear increment for the coherent pipelines HL, HV, LV, HLV 4 detectors: At the end of O3, probably KAGRA will join the science run Some of the pipelines will be tested (based) on GPUs Offline Computing 24
How to fill up the resources HS06 200000 180000 Ok, let suppose to find the money to provide the 25% virgo quota Are we able to use it? Within a parallel investigation made for INFN, I asked to (Italian) DA people to project their needs/intentions in the next years With a series of caveat the projection is pyCBC-OSG 160000 cWB 140000 CW O4 GRB 120000 Data taking 100000 O3 80000 60000 40000 20000 0 2019 2020 2021 2022 2023 2024 2025 2026 Offline Computing 25
HPC resources Numerical relativity is now a key element for the realisation of BNS templates In Virgo there are important activities thanks to the groups in Torino and in Milano Bicocca They intensively uses the CINECA resources within the CSN4 framework and through some grant Being them structurally participating to the DA of LV we need to provide computing resources: Requests: 2018-2019: ------------------ 2M GPU hours (e.g. on galileo @ cineca) [i] 6M CPU hours on Intel BDW/SKL (Marconi A1/A3 @ cineca) [ii] 50 Tb hard drive space [iii] yrs 2020-2023: ---------------------- 6M GPU hours per year [i] 6M GPU hours per year [ii] 150 Tb hard drive space [iii] Offline Computing 26
Hence We need to contribute with new resources (MOU constrain) But we are unable to use them or to allow access to LIGO-like pipelines We need a solution to this situation Replicate LIGO environment Local installation of Condor as wrapper of the local batch system New Workload Manager + Data manager DIRAC Positive WM tests Data management unclear to me Data transfer still to be tested We are progressing too slow LIGO is going in the direction of using Rucio as DM and remain in Condor as WM Future development: In LIGO tests of using Singularity + CVMS for virtualisation and distribution That is a technology pursued at LHC supported by our CCs We need to invest on that A post-doc is under recruitment @ INFN-Torino to be engaged in that activity Computing - VW20180709 27
Cost model In the current cost model EGO reimburses the costs at the French and Italian CCs, Nikhef and Polgrav are contributing in kind This model balances the costs between Italy and France but it charges the largest fraction of costs on EGO shoulders To move the bill back on the funding agencies doesn t work, because INFN will be by far the largest contributor without further balancing We need to find a cost model that shares in a more clever way the costs for computing within Virgo: It should take in account the number of authors It should take in account the global investment in computing (DAQ, low latency analysis, storage, human resources) It should force the Virgo members to account their resources Offline Computing 28
Cost model This is not a definitive proposal, but we need to open a discussion also in front of the EGO Council and the institutions We have to define our standards: Needs and standard cost of the storage Needs and standard costs for CPU Accessibility requirements(LIGO compliant, Virgo compliant, ) Accountability requirements (resources need to be accountable) Human resource requirements (a collaboration member MUST be the interface) Compute a Virgo standard cost per author Each institution in Virgo has to provide in kind and/or in money resources proportional to the number of authors according to the standard figures we defined Ghosts are expensive! Over-contributions can be considered as a net contribution to the experiment by the institution; obviously we must take in account also the direct contribution to EGO Offline Computing 29
2019: who pays? In these days INFN is defining the investments in TIER1 (CNAF) The decision will be in September We have 30kHS06 In the original plan it was expected to jump to 70-80kHS06 Difficult in terms of cost and efficiency What we ask for 2019? Tape is defined (1PB) Disk: We have 656TB (578TB occupied) O3: 1MB/s Virgo of RDS + 2MB/s LIGO = 90TB Disk for DA Suggest to request a pledge of 780TB (+124TB) CPU? 30kHS06, 40kHS06, ? CPU Computing - VW20180709 30
Organisation I have appointed as VDAS coordinator in emergency in Sept 2015 Since the beginning of my mandate, ended in Sept. 2017, I highlighted the need to improve the organisation of the VDAS (changing also the name!) Recently I proposed a new structure of VDAS, dividing it in Working Packages and identifying the main projects This organisation has been shown at the Virgo week in April and proposed to VSC Decision pending and a suggestion from this committee is more than welcome ECC meeting Offline Computing 31
Local area Wide area Spokesperson / EGO Direcor DA coord LIGO interface Virgo Computing Coordination Computing Centres Commissioning coordinator Offline Computing Local computing and storage infrastructure management & strategy Analysis pipelines GRID compatibility ECC meeting Low latency architecture Online Computing Management Local Computing Infrastructure Sw management subsystem Offline Computation needs evaluation Networking infrastructure Dedicated hw management Cyber-security Data Management System Local data allocation and management & strategy Local services ( federated login, web servers ... ) Bulk Data Transfer 32 Offline Computing
Reference persons at CCs In addition, it is crucial to have a MEMBER OF THE VIRGO COLLABORATION, fully devoted to computing issues, physically or virtually located at each computing center, acting as reference persons Post-doc level He/she participates to the collaboration life (computing meetings, DA meetings, ), but he/she has the duty to solve (or to facilitate the solution) of all the issues related to the use of that CC by the collaboration ECC meeting Offline Computing 33
Conclusions Computing is a crucial part of the detector Computing is a key element of the LIGO-Virgo agreement It is time for the collaboration (and for all the funding agencies) to take it seriously As stated at the last VSC and reported in the minutes, today I consider concluded the extra-time I devoted to my appointment as VDAS coordinator (officially ended in Sept. 2017) I hope that a VDAS coordinator will exist anymore, but I wish all the best to my successor Computing - VW20180709 34