Status of BESIII Distributed Computing

Status of BESIII Distributed Computing
BESIII Workshop, Mar 2015
Xianghu Zhao
On Behalf of the BESIII Distributed Computing Group
Outline
System and site status
Private production status
Central storage solutions
Monitoring system
VM performance study in cloud computing
Cloud storage
Summary
2
Resources and Sites
3
CPU resources are about 2000 cores, storage about 387TB
Some CPU resources are shared with site local users
BOSS Software Deployment
Currently the following BOSS versions are
available for distributed computing
6.6.2, 6.6.3, 6.6.3.p01, 6.6.4, 6.6.4.p01, 6.6.4.p02,
6.6.4.p03, 6.6.5
Version 6.6.2, 6.6.3, 6.6.3.p01, 6.6.4 are updated to
accommodate the distributed computing
The result of verification could be found under
directory /besfs/users/zhaoxh/verify_dist/boss
The following random trigger files are deployed
round02, round03, round04, round05, round06,
round07
4
BOSS 6.6.5 Support
BOSS 6.6.5 is already supported by distributed
computing system
Site with sl6
5
Cloud Status
Cloud computing is already opened to private
users
More storage is extended on the cloud
computing nodes
Allow more virtual machines to run steadily
Database backend of OpenNebula cloud is
switched from sqlite to mysql
Improve the performance
Avoid the situation of no response
6
PRIVATE PRODUCTION STATUS
 
7
User Job Status
More users are using distributed computing
Totally more than 93,000 user jobs are done
successfully since last collaboration meeting
8
User Jobs Data Transfer
7T data transferred to IHEP
Reconstruction jobs need more data transferred
than analysis jobs
9
Improvement for GangaBoss
Speed up the job submission to distributed
computing
Can support more jobs submitted one time on the
lxslc login nodes
Submission speed is much faster
Simplify the way for using custom boss packages
Support SL6 and BOSS 6.6.5
Will soon provided in the next version
10
New Function in GangaBoss
User can specify more than one output file types
If file type is not specified, the output file will be from
the last step
Output of .rec file is also supported in
reconstruction jobs
Do not need to change anything in the script
All output files can be downloaded with the
“besdirac-dms-dataset-get” command
11
Support
These job types are supported now
Simulation
Simulation + Reconstruction
Simulation + Reconstruction + Analysis
User custom package are supported
Custom generator
User analysis packages
Detailed user guide has been provided in wiki
How to submit a BOSS job to distributed computing:
http://boss.ihep.ac.cn/~offlinesoftware/index.php/BESDIRAC_User_Tu
torial
How to submit different type of BOSS job:
http://docbes3.ihep.ac.cn/~offlinesoftware/index.php/BESDIRAC_BOS
S_Job_Guide
12
Plan to Do
Support analysis of existing DST jobs
Full upload of user package
Reduce the difficulty to find out which files to
upload particularly
13
Job Splitter to Choose
There are two kind of splitters
Split by run
Split by event
Split by run is recommended for users
More sites can be used (Currently only UMN support
split by event job)
Job running time is shorter than by event jobs
Lower pressure of storage for sites (UMN has
encountered performance problem when there are
too many by event jobs)
14
BESDIRAC Task Manager
 
15
CENTRAL STORAGE SOLUTIONS
 
16
Data Transfer Using StoRM+Lustre
On Dec. 10th, local users at UMN produced  a DsHi dataset of size 3.3 TB and 36,328 files.
It’s difficult to transfer such amount of data to IHEP by scp or rsync.
This dataset was transferred from UMN to IHEP by our SE transferring system.
On IHEP side, the destination SE is IHEP-STORM (StoRM+Lustre testbed)
The data is accessable on Lustre right after it’s transferred. No upload/download needed
The transfer speed is 35 MB/s, one-time success rate is > 99%.
It shows the feasibility of transfering data from Lustre at one site to Lustre at another
site
17
Job Read/Write Using StoRM+Lustre
From Jan. 19
th
 to Mar. 4
th
 , 103k CEPC MC production
jobs using StoRM+Lustre as central storage.
Total 11 TB input data read from /cefs, and  41 TB
output data writed to /cefs, with only 4% failure.
From user’s point of view, jobs read input data from
/cefs, write output data to /cefs. Data operation
(upload and download) is not need.
18
90.8% success rate
4.01% SE read/write error
41 TB output data
write to Lustre
11 TB input data
read from Lustre
MONITORING SYSTEM
 
19
Site Summary
A site summary
page is added to
the monitoring
system
More detailed
information will be
added
20
Tests by Submitting Job
Easier to add new test
Also a history graph is available for each test
21
CLOUD STORAGE
 
22
Introduction
Suitable for sites without SE
Possibly supporting split by event job for sites
which can not mount all the random trigger
for each computing node
23
Test for Cloud Storage
MucuraFS client is deployed on 5 cloud computing testbeds
Random trigger files of round06 are prepared on the cloud storage
1000 reconstruction jobs split by event with run range [30616, 31279].
10,000 events in each job
Test results
High success rate
The CPU efficiency is much lower and the execution time is much longer for
nodes outside IHEP
24
 
96.6% success
IHEP Cloud
Future Plan
Further strengthen user support
User tutorial will be provided regularly if needed
More improvements will be done according to user feedback
Supporting analysis jobs with existing DST file
Upload the full user workarea for simplicity and integrity
Make cloud resources easier to be centrally managed
Improve the monitoring system
Develop an accounting system
More efforts will be done to make system more robust
Push usage of mirror offline database, implementing real-time
synchronization
Consider redundant central server to avoid one point failure
25
Summary
Distributed computing system is in good status
with the user jobs
Private user production is well supported with
several improvements
In central storage tests, StoRM+Lustre is tested in
good status and could be used for real jobs
Monitoring system is upgraded and new page is
developed
Cloud storage is tested and could be an
alternative choice for providing random trigger
file access
26
27
Thanks for your attention! Thank you
for your feedback!
Welcome to use and send your
feedbacks!
Slide Note
Embed
Share

This report covers the status of BESIII Distributed Computing, including system and site status, private production status, central storage solutions, monitoring system, VM performance study in cloud computing, cloud storage, resources, and sites. It also discusses BOSS software deployment, support for BOSS 6.6.5, cloud status, private production status, user job status, and user jobs data transfer.

  • Distributed Computing
  • BESIII
  • System Status
  • Site Status
  • Cloud Computing

Uploaded on Feb 28, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Status of BESIII Distributed Computing Xianghu Zhao On Behalf of the BESIII Distributed Computing Group BESIII Workshop, Mar 2015

  2. Outline System and site status Private production status Central storage solutions Monitoring system VM performance study in cloud computing Cloud storage Summary 2

  3. Resources and Sites # Site Name Type OS CPU Cores SE Type SE Capacity Status 1 CLOUD.IHEP.cn Cloud SL6 264 dCache 214 TB Active 2 CLUSTER.UCAS.cn Cluster SL5 152 Active 3 CLUSTER.USTC.cn Cluster SL6 200 ~ 1280 dCache 24 TB Active 4 CLUSTER.PKU.cn Cluster SL5 100 Active 5 CLUSTER.WHU.cn Cluster SL6 100 ~ 300 StoRM 39 TB Active 6 CLUSTER.UMN.us Cluster SL5/SL6 768 BeStMan 50 TB Active 7 CLUSTER.SJTU.cn Cluster 100 Active 8 GRID.JINR.ru Grid SL6 100 ~ 200 dCache 30 TB Active 9 GRID.INFN-Torino.it Grid SL 200 StoRM 30 TB Active 10 CLUSTER.SDU.cn Cluster Testing 11 CLUSTER.BUAA.cn Cluster Testing Total 1864 ~ 3504 387 TB CPU resources are about 2000 cores, storage about 387TB Some CPU resources are shared with site local users 3

  4. BOSS Software Deployment Currently the following BOSS versions are available for distributed computing 6.6.2, 6.6.3, 6.6.3.p01, 6.6.4, 6.6.4.p01, 6.6.4.p02, 6.6.4.p03, 6.6.5 Version 6.6.2, 6.6.3, 6.6.3.p01, 6.6.4 are updated to accommodate the distributed computing The result of verification could be found under directory /besfs/users/zhaoxh/verify_dist/boss The following random trigger files are deployed round02, round03, round04, round05, round06, round07 4

  5. BOSS 6.6.5 Support BOSS 6.6.5 is already supported by distributed computing system Site with sl6 5

  6. Cloud Status Cloud computing is already opened to private users More storage is extended on the cloud computing nodes Allow more virtual machines to run steadily Database backend of OpenNebula cloud is switched from sqlite to mysql Improve the performance Avoid the situation of no response 6

  7. PRIVATE PRODUCTION STATUS 7

  8. User Job Status More users are using distributed computing Totally more than 93,000 user jobs are done successfully since last collaboration meeting 8

  9. User Jobs Data Transfer 7T data transferred to IHEP Reconstruction jobs need more data transferred than analysis jobs 9

  10. Improvement for GangaBoss Speed up the job submission to distributed computing Can support more jobs submitted one time on the lxslc login nodes Submission speed is much faster Simplify the way for using custom boss packages Support SL6 and BOSS 6.6.5 Will soon provided in the next version 10

  11. New Function in GangaBoss User can specify more than one output file types If file type is not specified, the output file will be from the last step Output of .rec file is also supported in reconstruction jobs Do not need to change anything in the script All output files can be downloaded with the besdirac-dms-dataset-get command 11

  12. Support These job types are supported now Simulation Simulation + Reconstruction Simulation + Reconstruction + Analysis User custom package are supported Custom generator User analysis packages Detailed user guide has been provided in wiki How to submit a BOSS job to distributed computing: http://boss.ihep.ac.cn/~offlinesoftware/index.php/BESDIRAC_User_Tu torial How to submit different type of BOSS job: http://docbes3.ihep.ac.cn/~offlinesoftware/index.php/BESDIRAC_BOS S_Job_Guide 12

  13. Plan to Do Support analysis of existing DST jobs Full upload of user package Reduce the difficulty to find out which files to upload particularly 13

  14. Job Splitter to Choose There are two kind of splitters Split by run Split by event Split by run is recommended for users More sites can be used (Currently only UMN support split by event job) Job running time is shorter than by event jobs Lower pressure of storage for sites (UMN has encountered performance problem when there are too many by event jobs) 14

  15. CENTRAL STORAGE SOLUTIONS 16

  16. Data Transfer Using StoRM+Lustre On Dec. 10th, local users at UMN produced a DsHi dataset of size 3.3 TB and 36,328 files. It s difficult to transfer such amount of data to IHEP by scp or rsync. This dataset was transferred from UMN to IHEP by our SE transferring system. On IHEP side, the destination SE is IHEP-STORM (StoRM+Lustre testbed) The data is accessable on Lustre right after it s transferred. No upload/download needed The transfer speed is 35 MB/s, one-time success rate is > 99%. It shows the feasibility of transfering data from Lustre at one site to Lustre at another site 17

  17. Job Read/Write Using StoRM+Lustre From Jan. 19th to Mar. 4th , 103k CEPC MC production jobs using StoRM+Lustre as central storage. Total 11 TB input data read from /cefs, and 41 TB output data writed to /cefs, with only 4% failure. From user s point of view, jobs read input data from /cefs, write output data to /cefs. Data operation (upload and download) is not need. 11 TB input data read from Lustre 90.8% success rate 4.01% SE read/write error 41 TB output data write to Lustre 18

  18. MONITORING SYSTEM 19

  19. Site Summary A site summary page is added to the monitoring system More detailed information will be added 20

  20. Tests by Submitting Job Easier to add new test Also a history graph is available for each test 21

  21. CLOUD STORAGE 22

  22. Introduction Suitable for sites without SE Possibly supporting split by event job for sites which can not mount all the random trigger for each computing node 23

  23. Test for Cloud Storage MucuraFS client is deployed on 5 cloud computing testbeds Random trigger files of round06 are prepared on the cloud storage 1000 reconstruction jobs split by event with run range [30616, 31279]. 10,000 events in each job Test results High success rate The CPU efficiency is much lower and the execution time is much longer for nodes outside IHEP 96.6% success IHEP Cloud 24

  24. Future Plan Further strengthen user support User tutorial will be provided regularly if needed More improvements will be done according to user feedback Supporting analysis jobs with existing DST file Upload the full user workarea for simplicity and integrity Make cloud resources easier to be centrally managed Improve the monitoring system Develop an accounting system More efforts will be done to make system more robust Push usage of mirror offline database, implementing real-time synchronization Consider redundant central server to avoid one point failure 25

  25. Summary Distributed computing system is in good status with the user jobs Private user production is well supported with several improvements In central storage tests, StoRM+Lustre is tested in good status and could be used for real jobs Monitoring system is upgraded and new page is developed Cloud storage is tested and could be an alternative choice for providing random trigger file access 26

  26. Thanks for your attention! Thank you for your feedback! Welcome to use and send your feedbacks! 27

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#