Status of BESIII Distributed Computing
This report covers the status of BESIII Distributed Computing, including system and site status, private production status, central storage solutions, monitoring system, VM performance study in cloud computing, cloud storage, resources, and sites. It also discusses BOSS software deployment, support for BOSS 6.6.5, cloud status, private production status, user job status, and user jobs data transfer.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Status of BESIII Distributed Computing Xianghu Zhao On Behalf of the BESIII Distributed Computing Group BESIII Workshop, Mar 2015
Outline System and site status Private production status Central storage solutions Monitoring system VM performance study in cloud computing Cloud storage Summary 2
Resources and Sites # Site Name Type OS CPU Cores SE Type SE Capacity Status 1 CLOUD.IHEP.cn Cloud SL6 264 dCache 214 TB Active 2 CLUSTER.UCAS.cn Cluster SL5 152 Active 3 CLUSTER.USTC.cn Cluster SL6 200 ~ 1280 dCache 24 TB Active 4 CLUSTER.PKU.cn Cluster SL5 100 Active 5 CLUSTER.WHU.cn Cluster SL6 100 ~ 300 StoRM 39 TB Active 6 CLUSTER.UMN.us Cluster SL5/SL6 768 BeStMan 50 TB Active 7 CLUSTER.SJTU.cn Cluster 100 Active 8 GRID.JINR.ru Grid SL6 100 ~ 200 dCache 30 TB Active 9 GRID.INFN-Torino.it Grid SL 200 StoRM 30 TB Active 10 CLUSTER.SDU.cn Cluster Testing 11 CLUSTER.BUAA.cn Cluster Testing Total 1864 ~ 3504 387 TB CPU resources are about 2000 cores, storage about 387TB Some CPU resources are shared with site local users 3
BOSS Software Deployment Currently the following BOSS versions are available for distributed computing 6.6.2, 6.6.3, 6.6.3.p01, 6.6.4, 6.6.4.p01, 6.6.4.p02, 6.6.4.p03, 6.6.5 Version 6.6.2, 6.6.3, 6.6.3.p01, 6.6.4 are updated to accommodate the distributed computing The result of verification could be found under directory /besfs/users/zhaoxh/verify_dist/boss The following random trigger files are deployed round02, round03, round04, round05, round06, round07 4
BOSS 6.6.5 Support BOSS 6.6.5 is already supported by distributed computing system Site with sl6 5
Cloud Status Cloud computing is already opened to private users More storage is extended on the cloud computing nodes Allow more virtual machines to run steadily Database backend of OpenNebula cloud is switched from sqlite to mysql Improve the performance Avoid the situation of no response 6
User Job Status More users are using distributed computing Totally more than 93,000 user jobs are done successfully since last collaboration meeting 8
User Jobs Data Transfer 7T data transferred to IHEP Reconstruction jobs need more data transferred than analysis jobs 9
Improvement for GangaBoss Speed up the job submission to distributed computing Can support more jobs submitted one time on the lxslc login nodes Submission speed is much faster Simplify the way for using custom boss packages Support SL6 and BOSS 6.6.5 Will soon provided in the next version 10
New Function in GangaBoss User can specify more than one output file types If file type is not specified, the output file will be from the last step Output of .rec file is also supported in reconstruction jobs Do not need to change anything in the script All output files can be downloaded with the besdirac-dms-dataset-get command 11
Support These job types are supported now Simulation Simulation + Reconstruction Simulation + Reconstruction + Analysis User custom package are supported Custom generator User analysis packages Detailed user guide has been provided in wiki How to submit a BOSS job to distributed computing: http://boss.ihep.ac.cn/~offlinesoftware/index.php/BESDIRAC_User_Tu torial How to submit different type of BOSS job: http://docbes3.ihep.ac.cn/~offlinesoftware/index.php/BESDIRAC_BOS S_Job_Guide 12
Plan to Do Support analysis of existing DST jobs Full upload of user package Reduce the difficulty to find out which files to upload particularly 13
Job Splitter to Choose There are two kind of splitters Split by run Split by event Split by run is recommended for users More sites can be used (Currently only UMN support split by event job) Job running time is shorter than by event jobs Lower pressure of storage for sites (UMN has encountered performance problem when there are too many by event jobs) 14
Data Transfer Using StoRM+Lustre On Dec. 10th, local users at UMN produced a DsHi dataset of size 3.3 TB and 36,328 files. It s difficult to transfer such amount of data to IHEP by scp or rsync. This dataset was transferred from UMN to IHEP by our SE transferring system. On IHEP side, the destination SE is IHEP-STORM (StoRM+Lustre testbed) The data is accessable on Lustre right after it s transferred. No upload/download needed The transfer speed is 35 MB/s, one-time success rate is > 99%. It shows the feasibility of transfering data from Lustre at one site to Lustre at another site 17
Job Read/Write Using StoRM+Lustre From Jan. 19th to Mar. 4th , 103k CEPC MC production jobs using StoRM+Lustre as central storage. Total 11 TB input data read from /cefs, and 41 TB output data writed to /cefs, with only 4% failure. From user s point of view, jobs read input data from /cefs, write output data to /cefs. Data operation (upload and download) is not need. 11 TB input data read from Lustre 90.8% success rate 4.01% SE read/write error 41 TB output data write to Lustre 18
Site Summary A site summary page is added to the monitoring system More detailed information will be added 20
Tests by Submitting Job Easier to add new test Also a history graph is available for each test 21
Introduction Suitable for sites without SE Possibly supporting split by event job for sites which can not mount all the random trigger for each computing node 23
Test for Cloud Storage MucuraFS client is deployed on 5 cloud computing testbeds Random trigger files of round06 are prepared on the cloud storage 1000 reconstruction jobs split by event with run range [30616, 31279]. 10,000 events in each job Test results High success rate The CPU efficiency is much lower and the execution time is much longer for nodes outside IHEP 96.6% success IHEP Cloud 24
Future Plan Further strengthen user support User tutorial will be provided regularly if needed More improvements will be done according to user feedback Supporting analysis jobs with existing DST file Upload the full user workarea for simplicity and integrity Make cloud resources easier to be centrally managed Improve the monitoring system Develop an accounting system More efforts will be done to make system more robust Push usage of mirror offline database, implementing real-time synchronization Consider redundant central server to avoid one point failure 25
Summary Distributed computing system is in good status with the user jobs Private user production is well supported with several improvements In central storage tests, StoRM+Lustre is tested in good status and could be used for real jobs Monitoring system is upgraded and new page is developed Cloud storage is tested and could be an alternative choice for providing random trigger file access 26
Thanks for your attention! Thank you for your feedback! Welcome to use and send your feedbacks! 27