Enhancing Grid Site Computing Resources through BOINC Implementation
This presentation discusses the full utilization of grid site computing resources using BOINC, focusing on NGI_CZ and CESNET cluster resources, along with strategies for better resource utilization and the implementation of BOINC for LHC community support.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
FULL UTILIZATION OF GRID SITE COMPUTING RESOURCES USING BOINC Ji Chudoba, Alexandr Mikula, Ale Prchal CESNET and FZU EGI Conference 2021 21.10.2021, virtual
NGI_CZ GRID RESOURCES CESNET represents NGI_CZ National research and education network part of eInfra.cz consortium 3 active sites registered praguelcg2 WLCG Tier-2 center plus Fermilab VOs plus astroparticle VOs distributed, mostly at FZU. Individual projects contribute to resources prague_cesnet_lcg2 CESNET contribution to EGI HTC Grid resources CESNET-MCC - CESNET contribution to EGI cloud resources More resources for CZ users via Metacentrum clusters distributed on many sites PBSPro batch system
NGI_CZ GRID RESOURCES CESNET cluster resources 2068 jobslots (HT on) provided by 3 subclusters (29 servers in total) 1024 cores added in Dec 2020 30 kHS06 KVM hypervisors for services HTCondor-CE, HTCondor SE: DPM, 900 TB in 6 servers 400 TB added in June 2021 network 2x10 Gbps 1 VOMS server (out of 2) hosted on VMware in another location WLCG T2 10000 jobslots, 7 PB disk space
TIER-2 PRAGUELCG2 ATLAS and ALICE VO almost continuous production High priority for local users
CESNET SITE some periods of unused cores
OPTIONS FOR BETTER UTILISATION Add LHC VO relatively small number of cores bad ratio effort/beneficial effect Add other VOs not connected to CZ groups increase load on support HTCondor jobs flocking may be interesting unknown side effects BOINC for LHC community support should be easy to operate
BOINC IMPLEMENTATION common account for many instances if the name matches the site name, visible in the ATLAS accounting BOINC used also for backfilling standard sites D. Cameron: Adapting ATLAS@Home to trusted and semi-trusted resources CHEP19
BOINC IMPLEMENTATION Standalone clients works well for desktops VirtualBox used issues with kernel modules issues when running on worker nodes long, never ending jobs not always 0 usage by BOINC when other workload started required manual interventions some scripts available Another attempt based on HTCondor Manual for backfilling does not support segmentation for cores
BOINC IMPLEMENTATION Current implementation based on htcondor wiki manual BOINC jobs are allowed to run on every 4th job slot and each Boinc job utilizes exactly 4 CPU cores. For each job slot in an Unclaimed state the local startd triggers a fetch_work_boinc script periodically which in turn generates a Classad file for a Boinc job. The job Classad file is then executed by the startd. If the job slot is being unclaimed for more than 10 minutes, the job requirements are met and the boinc-client binary is executed. Because of a RANK statement in a configuration file, the Boinc jobs have lower priority, and therefore can be evicted whenever a regular grid job is waiting for a free job slot. Runs in singularity containers Receives SIGTERM when a standard jobs is executed 7 12 GB of disk space used for 4 cores
GREAT EFFECIENCIES Efficiency for last 30 days for praguelcg2 boinc
CONTRIBUTION TO ATLAS@HOME BOINC resources for ATLAS contribute ~5%
CONCLUSIONS ATLAS@HOME application manages to effectively fill unused resources with minimal effort and now observed negative influence on standard jobs