Status of BESIII Distributed Computing

Status of BESIII Distributed Computing

BESIII Workshop, Mar 2015

Xianghu Zhao

On Behalf of the BESIII Distributed Computing Group

Outline

•

System and site status

•

Private production status

•

Central storage solutions

•

Monitoring system

•

VM performance study in cloud computing

•

Cloud storage

•

Summary

Resources and Sites

•

CPU resources are about 2000 cores, storage about 387TB

•

Some CPU resources are shared with site local users

BOSS Software Deployment

•

Currently the following BOSS versions are

available for distributed computing

–

6.6.2, 6.6.3, 6.6.3.p01, 6.6.4, 6.6.4.p01, 6.6.4.p02,

6.6.4.p03, 6.6.5

–

Version 6.6.2, 6.6.3, 6.6.3.p01, 6.6.4 are updated to

accommodate the distributed computing

–

The result of verification could be found under

directory /besfs/users/zhaoxh/verify_dist/boss

•

The following random trigger files are deployed

–

round02, round03, round04, round05, round06,

round07

BOSS 6.6.5 Support

•

BOSS 6.6.5 is already supported by distributed

computing system

•

Site with sl6

Cloud Status

•

Cloud computing is already opened to private

users

•

More storage is extended on the cloud

computing nodes

–

Allow more virtual machines to run steadily

•

Database backend of OpenNebula cloud is

switched from sqlite to mysql

–

Improve the performance

–

Avoid the situation of no response

PRIVATE PRODUCTION STATUS

User Job Status

•

More users are using distributed computing

•

Totally more than 93,000 user jobs are done

successfully since last collaboration meeting

User Jobs Data Transfer

•

7T data transferred to IHEP

•

Reconstruction jobs need more data transferred

than analysis jobs

Improvement for GangaBoss

•

Speed up the job submission to distributed

computing

–

Can support more jobs submitted one time on the

lxslc login nodes

–

Submission speed is much faster

•

Simplify the way for using custom boss packages

•

Support SL6 and BOSS 6.6.5

–

Will soon provided in the next version

New Function in GangaBoss

•

User can specify more than one output file types

–

If file type is not specified, the output file will be from

the last step

•

Output of .rec file is also supported in

reconstruction jobs

–

Do not need to change anything in the script

•

All output files can be downloaded with the

“besdirac-dms-dataset-get” command

Support

•

These job types are supported now

–

Simulation

–

Simulation + Reconstruction

–

Simulation + Reconstruction + Analysis

•

User custom package are supported

–

Custom generator

–

User analysis packages

–

…

•

Detailed user guide has been provided in wiki

–

How to submit a BOSS job to distributed computing:

http://boss.ihep.ac.cn/~offlinesoftware/index.php/BESDIRAC_User_Tu

torial

–

How to submit different type of BOSS job:

http://docbes3.ihep.ac.cn/~offlinesoftware/index.php/BESDIRAC_BOS

S_Job_Guide

Plan to Do

•

Support analysis of existing DST jobs

•

Full upload of user package

–

Reduce the difficulty to find out which files to

upload particularly

Job Splitter to Choose

•

There are two kind of splitters

–

Split by run

–

Split by event

•

Split by run is recommended for users

–

More sites can be used (Currently only UMN support

split by event job)

–

Job running time is shorter than by event jobs

–

Lower pressure of storage for sites (UMN has

encountered performance problem when there are

too many by event jobs)

BESDIRAC Task Manager

CENTRAL STORAGE SOLUTIONS

Data Transfer Using StoRM+Lustre

•

On Dec. 10th, local users at UMN produced  a DsHi dataset of size 3.3 TB and 36,328 files.

It’s difficult to transfer such amount of data to IHEP by scp or rsync.

•

This dataset was transferred from UMN to IHEP by our SE transferring system.

•

On IHEP side, the destination SE is IHEP-STORM (StoRM+Lustre testbed)

•

The data is accessable on Lustre right after it’s transferred. No upload/download needed

•

The transfer speed is 35 MB/s, one-time success rate is > 99%.

•

It shows the feasibility of transfering data from Lustre at one site to Lustre at another

site

Job Read/Write Using StoRM+Lustre

•

From Jan. 19

th

 to Mar. 4

th

 , 103k CEPC MC production

jobs using StoRM+Lustre as central storage.

•

Total 11 TB input data read from /cefs, and  41 TB

output data writed to /cefs, with only 4% failure.

•

From user’s point of view, jobs read input data from

/cefs, write output data to /cefs. Data operation

(upload and download) is not need.

90.8% success rate

4.01% SE read/write error

41 TB output data

write to Lustre

11 TB input data

read from Lustre

MONITORING SYSTEM

Site Summary

•

A site summary

page is added to

the monitoring

system

•

More detailed

information will be

added

Tests by Submitting Job

•

Easier to add new test

•

Also a history graph is available for each test

CLOUD STORAGE

Introduction

•

Suitable for sites without SE

•

Possibly supporting split by event job for sites

which can not mount all the random trigger

for each computing node

Test for Cloud Storage

•

MucuraFS client is deployed on 5 cloud computing testbeds

•

Random trigger files of round06 are prepared on the cloud storage

•

1000 reconstruction jobs split by event with run range [30616, 31279].

10,000 events in each job

•

Test results

–

High success rate

–

The CPU efficiency is much lower and the execution time is much longer for

nodes outside IHEP

96.6% success

IHEP Cloud

Future Plan

•

Further strengthen user support

–

User tutorial will be provided regularly if needed

–

More improvements will be done according to user feedback

–

Supporting analysis jobs with existing DST file

–

Upload the full user workarea for simplicity and integrity

•

Make cloud resources easier to be centrally managed

•

Improve the monitoring system

•

Develop an accounting system

•

More efforts will be done to make system more robust

–

Push usage of mirror offline database, implementing real-time

synchronization

–

Consider redundant central server to avoid one point failure

Summary

•

Distributed computing system is in good status

with the user jobs

•

Private user production is well supported with

several improvements

•

In central storage tests, StoRM+Lustre is tested in

good status and could be used for real jobs

•

Monitoring system is upgraded and new page is

developed

•

Cloud storage is tested and could be an

alternative choice for providing random trigger

file access

Thanks for your attention! Thank you

for your feedback!

Welcome to use and send your

feedbacks!

Slide Note

Embed Share

Download

This report covers the status of BESIII Distributed Computing, including system and site status, private production status, central storage solutions, monitoring system, VM performance study in cloud computing, cloud storage, resources, and sites. It also discusses BOSS software deployment, support for BOSS 6.6.5, cloud status, private production status, user job status, and user jobs data transfer.

dylanat Follow

Uploaded on Feb 28, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Status of BESIII Distributed Computing Xianghu Zhao On Behalf of the BESIII Distributed Computing Group BESIII Workshop, Mar 2015

Outline System and site status Private production status Central storage solutions Monitoring system VM performance study in cloud computing Cloud storage Summary 2

Resources and Sites # Site Name Type OS CPU Cores SE Type SE Capacity Status 1 CLOUD.IHEP.cn Cloud SL6 264 dCache 214 TB Active 2 CLUSTER.UCAS.cn Cluster SL5 152 Active 3 CLUSTER.USTC.cn Cluster SL6 200 ~ 1280 dCache 24 TB Active 4 CLUSTER.PKU.cn Cluster SL5 100 Active 5 CLUSTER.WHU.cn Cluster SL6 100 ~ 300 StoRM 39 TB Active 6 CLUSTER.UMN.us Cluster SL5/SL6 768 BeStMan 50 TB Active 7 CLUSTER.SJTU.cn Cluster 100 Active 8 GRID.JINR.ru Grid SL6 100 ~ 200 dCache 30 TB Active 9 GRID.INFN-Torino.it Grid SL 200 StoRM 30 TB Active 10 CLUSTER.SDU.cn Cluster Testing 11 CLUSTER.BUAA.cn Cluster Testing Total 1864 ~ 3504 387 TB CPU resources are about 2000 cores, storage about 387TB Some CPU resources are shared with site local users 3

BOSS Software Deployment Currently the following BOSS versions are available for distributed computing 6.6.2, 6.6.3, 6.6.3.p01, 6.6.4, 6.6.4.p01, 6.6.4.p02, 6.6.4.p03, 6.6.5 Version 6.6.2, 6.6.3, 6.6.3.p01, 6.6.4 are updated to accommodate the distributed computing The result of verification could be found under directory /besfs/users/zhaoxh/verify_dist/boss The following random trigger files are deployed round02, round03, round04, round05, round06, round07 4

BOSS 6.6.5 Support BOSS 6.6.5 is already supported by distributed computing system Site with sl6 5

Cloud Status Cloud computing is already opened to private users More storage is extended on the cloud computing nodes Allow more virtual machines to run steadily Database backend of OpenNebula cloud is switched from sqlite to mysql Improve the performance Avoid the situation of no response 6

PRIVATE PRODUCTION STATUS 7

User Job Status More users are using distributed computing Totally more than 93,000 user jobs are done successfully since last collaboration meeting 8

User Jobs Data Transfer 7T data transferred to IHEP Reconstruction jobs need more data transferred than analysis jobs 9

Improvement for GangaBoss Speed up the job submission to distributed computing Can support more jobs submitted one time on the lxslc login nodes Submission speed is much faster Simplify the way for using custom boss packages Support SL6 and BOSS 6.6.5 Will soon provided in the next version 10

New Function in GangaBoss User can specify more than one output file types If file type is not specified, the output file will be from the last step Output of .rec file is also supported in reconstruction jobs Do not need to change anything in the script All output files can be downloaded with the besdirac-dms-dataset-get command 11

Support These job types are supported now Simulation Simulation + Reconstruction Simulation + Reconstruction + Analysis User custom package are supported Custom generator User analysis packages Detailed user guide has been provided in wiki How to submit a BOSS job to distributed computing: http://boss.ihep.ac.cn/~offlinesoftware/index.php/BESDIRAC_User_Tu torial How to submit different type of BOSS job: http://docbes3.ihep.ac.cn/~offlinesoftware/index.php/BESDIRAC_BOS S_Job_Guide 12

Plan to Do Support analysis of existing DST jobs Full upload of user package Reduce the difficulty to find out which files to upload particularly 13

Job Splitter to Choose There are two kind of splitters Split by run Split by event Split by run is recommended for users More sites can be used (Currently only UMN support split by event job) Job running time is shorter than by event jobs Lower pressure of storage for sites (UMN has encountered performance problem when there are too many by event jobs) 14

CENTRAL STORAGE SOLUTIONS 16

Data Transfer Using StoRM+Lustre On Dec. 10th, local users at UMN produced a DsHi dataset of size 3.3 TB and 36,328 files. It s difficult to transfer such amount of data to IHEP by scp or rsync. This dataset was transferred from UMN to IHEP by our SE transferring system. On IHEP side, the destination SE is IHEP-STORM (StoRM+Lustre testbed) The data is accessable on Lustre right after it s transferred. No upload/download needed The transfer speed is 35 MB/s, one-time success rate is > 99%. It shows the feasibility of transfering data from Lustre at one site to Lustre at another site 17

Job Read/Write Using StoRM+Lustre From Jan. 19th to Mar. 4th , 103k CEPC MC production jobs using StoRM+Lustre as central storage. Total 11 TB input data read from /cefs, and 41 TB output data writed to /cefs, with only 4% failure. From user s point of view, jobs read input data from /cefs, write output data to /cefs. Data operation (upload and download) is not need. 11 TB input data read from Lustre 90.8% success rate 4.01% SE read/write error 41 TB output data write to Lustre 18

MONITORING SYSTEM 19

Site Summary A site summary page is added to the monitoring system More detailed information will be added 20

Tests by Submitting Job Easier to add new test Also a history graph is available for each test 21

CLOUD STORAGE 22

Introduction Suitable for sites without SE Possibly supporting split by event job for sites which can not mount all the random trigger for each computing node 23

Test for Cloud Storage MucuraFS client is deployed on 5 cloud computing testbeds Random trigger files of round06 are prepared on the cloud storage 1000 reconstruction jobs split by event with run range [30616, 31279]. 10,000 events in each job Test results High success rate The CPU efficiency is much lower and the execution time is much longer for nodes outside IHEP 96.6% success IHEP Cloud 24

Future Plan Further strengthen user support User tutorial will be provided regularly if needed More improvements will be done according to user feedback Supporting analysis jobs with existing DST file Upload the full user workarea for simplicity and integrity Make cloud resources easier to be centrally managed Improve the monitoring system Develop an accounting system More efforts will be done to make system more robust Push usage of mirror offline database, implementing real-time synchronization Consider redundant central server to avoid one point failure 25

Summary Distributed computing system is in good status with the user jobs Private user production is well supported with several improvements In central storage tests, StoRM+Lustre is tested in good status and could be used for real jobs Monitoring system is upgraded and new page is developed Cloud storage is tested and could be an alternative choice for providing random trigger file access 26