
Automating Data Certification Procedure in CMS DQM
"Explore the journey towards automating the Data Certification procedure using ML tools in the CMS DQM team. Learn about the challenges faced, previous attempts, and the new ML4DC schema. Discover the steps involved and current advancements in this innovative process." (291 characters)
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
CMS DQM toward an automated DC procedure F.Fiori on behalf of the CMS DQM team
Data Certification and ML Data Certification is a procedure to filter out data showing detector issues and provide a list of runs and lumisections that can safely be used for Physics analysis 17 different binary flags (GOOD or BAD) are set by human experts, the AND is taken as global flag (3-4 people involved for each flag) flags are set for detector bits (i.e Strip, Pixel, ECAL ...) or basic physics objects reconstruction (i.e muons, tracks, Jets) flags are currently set per Run (i.e few hours of data taking), the goal for Run3 is to have quality flags per single L Lumisection umisection (LS processed by HLT and DAQ) A lot of person A lot of person- -power power involved (especially expert-time ), plan to automate using ML tools Main issues towards the automation: Lack of rigorous definition of BAD data Lack of rigorous definition of BAD data: decisions often taken after long discussions in meetings (lack of proper metrics to judge results) Class imbalance Class imbalance: BAD data are only a small fraction of GOOD ones (~1%) LS ~23 seconds, minimal data range
Past attempt in CMS DQM In 2016 two groups formed in CMS (and closed in 2018): ML4DQM ML4DQM: goal to develop tools for ~real-time data monitoring. Based on anomaly detection on 2D occupancy maps (i.e image processing). Some interesting results obtained and published [1] ML4DC ML4DC: with the goal to develop automatic tools for Data Certification, mainly based on the study of sets of 1D distributions related to different detector data and physics objects (i.e track pT, Jet Energy ... etc). No sensible result have been published. One of the main issue was the absence of a suitable dataset ... https://indico.cern.ch/event/798721/contributions/3461344/attachments/1864432/3 065140/ML_Workshop_Fiori_v3.pdf (for more historical info) Several algorithms have been tested (Supervised and Unsupervised), no solution in hands good ideas developed however: AE for anomaly detection, Factorization of datasets In February 2019 the ML4DC effort restarted with a quite different approach DC done in two steps, different input dataset [1] Adrian Alan Pol et al, Detector Monitoring with Artificial Neural Networks at the CMS Experiment at the CERN Large Hadron Collider , Computing and Software for Big Science, 3(1):3, Jan 2019.
The new The new ML4DC ML4DC schema schema LS1 | LS2 | LS3| LS4 | LS5 ..LS# | LS(n-2) |LS(n-1) |LS n RUN 123456 DCS bits LS1 | LS4 | LS5| ..LS# | LS(n-1) |LS n SingleMuon EGamma ZeroBias JetHT Supervised e features features tracking Jet features STEP 1 + + + fuzzy AND GOLDEN JSON DATA GREY RUN GOOD RUN Semi-supervised LS1 | LS4 | LS5| ..LS# | LS(n-1) |LS n SingleMuon EGamma ZeroBias JetHT e features features tracking Jet features STEP 2 + + + LS4 | LS5| .. | LS(n-1) fuzzy AND human check of BAD LS 25/09/2019 DQM-DC, CMS Week PPD plenary 4
The new The new ML4DC ML4DC schema schema LS1 | LS2 | LS3| LS4 | LS5 ..LS# | LS(n-2) |LS(n-1) |LS n RUN 123456 DCS bits LS1 | LS4 | LS5| ..LS# | LS(n-1) |LS n SingleMuon EGamma ZeroBias JetHT Supervised e features features tracking Jet features STEP 1 + + + fuzzy AND GOLDEN JSON DATA GREY RUN GOOD RUN Step 1 provides quality flags per single Run, as it is done in the current DC procedure. Use of Supervised classifier or even a simpler PCA Semi-supervised LS1 | LS4 | LS5| ..LS# | LS(n-1) |LS n SingleMuon EGamma ZeroBias JetHT e features features tracking Jet features STEP 2 + + + LS4 | LS5| .. | LS(n-1) fuzzy AND human check of BAD LS 25/09/2019 DQM-DC, CMS Week PPD plenary 5
The new The new ML4DC ML4DC schema schema LS1 | LS2 | LS3| LS4 | LS5 ..LS# | LS(n-2) |LS(n-1) |LS n RUN 123456 DCS bits LS1 | LS4 | LS5| ..LS# | LS(n-1) |LS n SingleMuon EGamma ZeroBias JetHT Supervised Runs which fail to be classified as GOOD in step 1, are forwarded to step 2 in order to identify and e features features tracking Jet features STEP 1 reject anomalous Lumisections. Semi-Supervised models (Autoencoders?) are used here + + + fuzzy AND GOLDEN JSON DATA GREY RUN GOOD RUN Semi-supervised LS1 | LS4 | LS5| ..LS# | LS(n-1) |LS n SingleMuon EGamma ZeroBias JetHT e features features tracking Jet features STEP 2 + + + LS4 | LS5| .. | LS(n-1) fuzzy AND human check of BAD LS 25/09/2019 DQM-DC, CMS Week PPD plenary 6
The new The new ML4DC ML4DC schema schema LS1 | LS2 | LS3| LS4 | LS5 ..LS# | LS(n-2) |LS(n-1) |LS n RUN 123456 DCS bits LS1 | LS4 | LS5| ..LS# | LS(n-1) |LS n SingleMuon EGamma ZeroBias JetHT Supervised e features features tracking Jet features STEP 1 + + + In both the steps, features are mapped to the corresponding PD used in DC fuzzy AND GOLDEN JSON DATA GREY RUN GOOD RUN Semi-supervised LS1 | LS4 | LS5| ..LS# | LS(n-1) |LS n SingleMuon EGamma ZeroBias JetHT e features features tracking Jet features STEP 2 + + + LS4 | LS5| .. | LS(n-1) fuzzy AND human check of BAD LS 25/09/2019 DQM-DC, CMS Week PPD plenary 7
Use of DQM data The human based DC procedure make use of DQM data, a special stream in which mostly low-level detector data are present (i.e single Strip or Pixel hits) in some sense it is the dataset closer to RAW data In the past DQM data were only available on a Run granularity i.e files of few GBs collecting root plots where used to monitor and certify DQM data (a single file containing few 100k of plots) DQM data were never used to develop ML tools (too coarse time granularity) The DQM team included lately some code development to save the full content of DQM data per LS, now a first set of such data is available to restart the studies For the moment the full 2017 ZeroBias dataset is available, soon also the 2018 samples will be produced for the full set of datasets used for DC
A quick look to per LS data blue and black line correspond to lumi leveling LS #100 LS #500 LS #1000 LS #1500 LS #1900 25/09/2019 DQM-DC, CMS Week PPD plenary 9
What to do with per LS data? Focus on Tracker data since we only have ZeroBias dataset by far the most complex subsystem in CMS to be extended to the other subsystems after 2018 reprocessing Two main approaches planned to be tested A. Semi-supervised AE to asses the quality of single histograms we can feed the AE with the bin values, and use the MSE as discriminator then combine the results in a sensible way (possibly using different layers of AE) B. Perform some PCA studies on per LS data method proved to be effective for Run based analysis, not involving fancy ML tools. Will require further reduction of data Goals and timescale: Develop a generic tool to be used by subsytems to build their own auto-certification for Run3, a kind of single histogram classificatory) Provide reproducibility of human based certification (as much as possible) Studies should be completed for the Spring 2020
Method A: histo N histo 1 ........ N bins AE N AE 1 MSE 1........ MSE N The number of hidden layer to be decided by N bins Here the AE represent a generic Semi-supervised model, feel free to suggest a different approach! Global MSE (i.e final result)
Method B: PCA analysis at LS level: extend results already available per Run In a nut shell: compute the multi dimensional correlation matrix and extract the 2-3 features which show the highest gradient 1D distributions have to be reduced to few values: Mean, RMS, Skew This study can be performed also at Run level for the step1 of DC, and is anyway useful as an initial check on data Results relative to the Pixel detector and 2018 data. The red points inside the blue cluster have been traced back to timing scans very close to the optimal setting
Available dataset Only 2017 and ZeroBias Primary dataset used to certify Tracker (Strip + Pixels) and Tracking data 584 Runs from 2017 corresponding to ~ 400k LS 1TB of sqlite tables containing DQM histograms and other metadata (i.e axis limits, number of bins, entries) 106 1D distributions relative to Tracker data saved now in csv Pandas dataframes ongoing, reduce data for PCA After Xmas the 2018 reprocessing is coming expected almost the same total amount of data but 4 more primary datasets are coming Issue: serious lack of manpower!! Data Frame content
Summary and Conclusions CMS plans to develop ML tools to automate DC for Run3 data taking A new approach is under development using a completely new dataset (DQM per LS data) based on the reprocessing of 2017 ZeroBias two ways are proposed to continue the study: Semi-supervised approach and simple PCA analysis The data are ready and now accessible to the CMS collaboration Searching desperately for some young and enthusiastic contributor to join this effort!