Efficient Recovery of Missing Events
Efficient recovery of missing events is crucial in information systems to ensure accurate business history recording. This study delves into algorithms for filling gaps in causal networks, dealing with loops, and addressing causes such as man-made errors and system failures. It explores the significance of event data and process specifications in recovering missing events, with a focus on ensuring completeness and accuracy in event sequences. The study highlights the causes of missing events and the need for efficient recovery strategies to maintain data integrity and system reliability.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Efficient Recovery of Missing Events 1/23 Jianmin Wang1, Shaoxu Song1, Xiaochen Zhu1, Xuemin Lin2 1Tsinghua University, China 2University of New South Wales, Australia VLDB 2013
Outline 2/23 Motivation Recovery Algorithms Filling gaps in causal net Branching framework Dealing with loops Experiments Conclusion Ongoing Work VLDB 2013
Event Data 3/23 Information systems record the business history in their event logs. E ID T ID Event Name Timestamp 1 1 Order 2013-04-22 13:33:34 2 1 Pay by Cash 2013-04-22 15:10:17 3 1 Check Inventory 2013-04-22 15:18:11 4 1 Validate 2013-04-22 15:31:50 5 1 Delivery 2013-04-23 08:14:26 Event Sequence: Order Pay by cash Check inventory Validate Delivery Huge Amount of Event Data: Corporation Products No. of Event Sequences Power Generatior Machinery Train 1,230,000 3,260,000 2,600,000 VLDB 2013
Process Specification 4/23 Business events often follow certain business rules or constraints Order Order Pay by cash Check inventory Validate Delivery Pay by cash Pay by cash Check inventory Pay by cash 1 Event sequences Order Validate Order Check inventory inventory Validate Check inventory Delivery Check inventory Pay by credit card Pay by credit card Check credit history 2 Pay by cash Delivery Re-check Check 3 Pay by cash Pay by cash follow Pay by cash Constraints by Petri net: Sequence Parallel Choice XOR split XOR join B C D Pay by credit card Check credit history Order Validate Delivery F G A Process specification Check inventory AND split AND join E H Re-check VLDB 2013
Missing Events 5/23 Pay by cash B B C C D D Sequence of events <A, B, E, F, G> complete <A, C, E, F, G> incomplete Pay by credit card Check credit history Order Validate Delivery G G G F F F A A A Check inventory E E E H Re-check The Causes of Missing Events: Man-made errors (typo); System failures (power down); Lost during system integration (mismatching). Survey in one division of CNR: 47% events are missed. VLDB 2013
Meaning of Recovery 6/23 Why Need Recovery? 1 Order inventory Validate Delivery Pay by cash Pay by cash Check inventory Re-check Check 2 Order Pay by cash Check inventory Validate Delivery Incomplete provenance answer What are the prerequisites of Validate in sequence 1? Answer: Order , Check Inventory and Re-check . Pay by cash is lost. Inaccurate query answer SEQ(Pay by cash, Validate), which means Pay by cash followed by Validate . Answer: sequence 2. Event sequence 1 is lost. VLDB 2013
Recovery of Missing Events 7/23 Event sequence <ACEFG> has missing event between ACE and F, called a GAP (ACE, F). Pay by cash B C C D D Pay by credit card Check credit history Order Validate Delivery F F G A A Check inventory E E H Re-check Recovery Insert the missing prerequisites into the gap; Based on the specification (follows constraints). A possible recovery is <ACEDFG> VLDB 2013
Possible Recoveries 8/23 Multiple Possible Recoveries Pay by cash B B B C C D D Pay by credit card Check credit history Order Validate Delivery F F F G A A A Check inventory E E E H H Re-check Choice structures: Loop structures: To recover (AE,F), we have AEBF, AECDF To recover (AB,F), we have ABEF, ABEHEF, ABEHEHEF... What is the most likely recovery? VLDB 2013
Minimum Recovery 9/23 Intuition: Generally, people try to make the minimum mistakes. Principle: Follow the minimum change law in improving data quality. Problem: The cost of a recovery, which is the number of inserted events should be minimized. Pay by cash B C D Pay by credit card Check credit history Order Validate Delivery F G A Check inventory E H Re-check To recover <AEFG>: <AEBFG> is a minimum recovery, while <AECDFG> is not. VLDB 2013
Hardness and Related Work 10/23 Hardness: Owing to choices and parallelization of flows, there exist vast alternatives to enumerate in the recovery; We prove the NP-hardness of this problem. Existing Approach: The Alignment algorithm* Enumerates a factorial number of redundant recoveries. B B To recover (A,E), it generates S: {ABCDE, ABDCE, ACBDE, ACDBE, ADBCE, ADCBE} C C A A E E D D Our Observation: All the sequences in S are equivalent w.r.t. the specification; Any of S is a minimum recovery; Not necessary to enumerate. *M. de Leoni, F. M. Maggi, and W. M. P. van der Aalst. Aligning event logs and declarative process models for conformance checking. In BPM, pages 82 97, 2012. VLDB 2013
Outline 11/23 Motivation Recovery Algorithms 1. Filling gaps upon causal net 2. Branching framework 3. Dealing with loops Experiments Conclusion Ongoing Work 2. Net with Choice Structures B 1. Causal Net C D F A G E H 3. Net with Loop Structures VLDB 2013
1. Filling Gaps upon Causal Net 12/23 Causal Net: net without XOR structures. It can be translated into a equivalent DAG. B B C C A A E E D D Based on parallel constraints & sequence constraints: Each complete sequence must be one of the topological sorts on the DAG. A recovery of incomplete sequence is a subsequence of a complete sequence. <ABCE> is a subsequence of <ABCDE> Hints: fills the missing prerequisites by backtracking. VLDB 2013
2. The Branching Framework 13/23 Problem: topological sort cannot be applied on choice structure. Hints: Enumerate all the possible execution of the choice structures Each execution is a causal net, topological sort can be applied. B D E A C To recover (A, E) (A, BE) B E (A, CDE) D E A C Recovery with minimum cost VLDB 2013
Pruning Methods 14/23 1. Branching Index: prune the branches not including the right-side event of gap. Path Reachability Pruning: prune the branches that the left-side of gap are not reachable to the right-side of gap. Branch and Bound: prune the branches that may lead to possible recoveries but cannot be the minimum one. Local Optimality: identify groups of transitions that always share the same branching, thus only one of them needs to be computed. 2. 3. 4. Gap (AC,E) LB = 3 B E E E E Gap (A,E) Gap (A,D) LB = 4 D D E E E A A A A C C VLDB 2013
3. Extension on Loops 15/23 Problem: when to terminate? Hints: considers loop at most once (minimum cost). To recover (ABC, B) E (ABC, AB) Not valid B B C C D A A B A E E B (ABC, EB) B B C C D A A E E E B (ABC, EBCEB) Not minimum B C B B C C D A A VLDB 2013
Outline 16/23 Motivation Recovery Algorithms Filling gaps in causal net Branching framework Dealing with loops Experiments Conclusion Ongoing Work VLDB 2013
Experiment Setting 17/23 Real Life Data Set: employed from CNR* Causal net Net with choices (no loop) Net with loops 25 No. of Process Specifications No. of Event Sequences Max Specification Size Max Sequence Size 63 transitions 79 places 149 57 4470 1000 events 67 Setting we randomly remove 10%-90% events from the complete sequences; apply the recovery methods to recover the removed events. Criteria: to evaluate the accuracy of recovery, F-measure of precision and recall. Baseline: the Alignment approach. *www.tangche.com VLDB 2013
Experiments on Causal Net 18/23 The accuracy is no worse than Alignment algorithm. Our algorithm shows great improvement in time cost. accuracy accuracy 0.9 1 0.8 0.9 0.8 0.7 F-Measure 0.7 0.6 F-Measure 0.6 0.5 0.5 0.4 Alignment Backtracking 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Missing Rate 0-2 3-5 6-8 9-11 12-14 21-23 Specification size time performance time performance 10000 100000 10000 1000 Time cost (ms) Time cost (ms) 1000 100 100 10 10 1 1 0.1 0.1 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Missing Rate 0-2 3-5 6-8 9-11 12-14 21-23 VLDB 2013 Specification size
Experiments on Net with Choices 19/23 The pruning methods are effective. accuracy accuracy 0.9 1 Alignment Branch Branch+Reach Branch+Reach+Bound Local Local+Reach+Bound 0.9 0.8 0.8 0.7 0.7 F-Measure F-Measure 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Missing Rate Specification size time performance time performance 10000 1000 1000 100 100 Time cost (ms) Time cost (ms) 10 10 1 1 0.1 0.01 0.1 0.001 0.01 0-4 5-9 10- 14 15- 19 Specification size 20- 24 25- 29 30- 34 35- 39 40- 44 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Missing Rate Best approach VLDB 2013
Experiments on Net with Loops 20/23 Local+Reach+Bound can always achieve the lowest time cost. Alignment Branch Branch+Reach Branch+Reach+Bound Local Local+Reach+Bound accuracy time performance 100000 10000 Time cost (ms) 1 0.9 0.8 0.7 F-Measure 1000 0.6 0.5 Best approach 0.4 0.3 0.2 0.1 0 100 100 200 300 400 500 600 700 800 900 1000 Sequence size Sequence size VLDB 2013
Conclusion 21/23 We define the specification-based minimum recovery problem for missing events. We prove the NP-hardnessof this problem. To efficiently find the optimal recovery: We propose a backtracking idea to reduce the redundant sequences with respect to parallel events. We construct a branching index, and develop reachability checking and lower bounds of recovery distances to further accelerate the computation. The local optimal method can reduce the number of intermediate results. VLDB 2013
Ongoing Work 22/23 In this paper of missing events, we assume that the logged event data are clean. However, the logged event data may also be dirty. Encoding example: UTF-8: Re-check GB2312: e ? C D Delivery G Validate F Order Check credit history Check inventory Pay by credit card Check inventory E A e ? E Without addressing erroneous events, Inaccurate provenance answer Inaccurate query answer By the insertion technique proposed in this paper, we cannot elimate the erroneous events. Instead, we need to modify the mistakely recorded events Non-trivial, also NP-hard Heuristic approach may apply VLDB 2013
Q & A Thanks 23/23 VLDB 2013