Efficient Recovery of Missing Events

undefined
Jianmin Wang
1
, Shaoxu Song
1
,
 Xiaochen Zhu
1
, Xuemin Lin
2
1
Tsinghua University, China
2
University of New South Wales, Australia
1/23
VLDB 2013
Motivation
Recovery Algorithms
Filling gaps in causal net
Branching framework
Dealing with loops
Experiments
Conclusion
Ongoing Work
2/23
VLDB 2013
 
Information systems record the business history in their event logs.
 
 
 
 
 
 
 
 
3/23
VLDB 2013
 
Huge Amount of Event Data:
 
Event Sequence
: 
Order 
 Pay by cash 
 Check inventory 
 Validate 
 Delivery
 
Business events often 
follow
 certain business rules or constraints
4/23
VLDB 2013
 
Order 
 
Pay by cash 
 
Check inventory 
 
Validate 
 
Delivery
 
Order 
 Check inventory 
 Pay by credit card 
 Check credit history
 Validate 
 Delivery
 
Order 
 Check inventory 
 Pay by cash 
 Re-check 
 Check
inventory 
 Validate 
 Delivery
 
Process
specification
 
Event
sequences
follow
 
Constraints by
Petri net:
Sequence
Parallel
Choice
 
Order 
 
Pay by cash
 
Pay by cash 
 
Check inventory
 
Check inventory 
 Pay by cash
 
Pay by credit card
5/23
VLDB 2013
 
 
 
 
 
 
 
 
The Causes of Missing Events:
Man-made errors (typo);
System failures (power down);
Lost during system integration (mismatching).
Survey in one division of CNR:
47% events are missed.
 
<
A, B, E, F, G
>  
complete
 
<
A, C, E, 
F,
 
G
>  
incomplete
B
F
G
A
E
A
C
E
F
G
 
Why Need Recovery?
 
 
 
 
Incomplete provenance answer
 
 
 
Inaccurate query answer
6/23
VLDB 2013
 
SEQ(Pay by cash, Validate), which means “Pay by cash” followed by
                                                      “Validate”.
 
What are the prerequisites of “Validate” in sequence 1?
 
Event sequence 1 is lost.
 
“Pay by cash” is lost.
 
Answer: “Order”, “Check Inventory” and “Re-check”.
 
Answer: sequence 2.
 
Pay by cash
 
 
2    Order 
 
Pay by cash 
 
Check inventory 
 
Validate 
 
Delivery
 
 
 
 
 
 
 
 
 
 
Recovery
Insert the missing prerequisites into the gap;
Based on the specification (follows constraints).
7/23
VLDB 2013
 
 
E
vent sequence <ACEFG> has missing event between ACE
 and F, called a
GAP (
ACE
, 
F
)
.
 
 A 
possible recovery 
is <
ACE
D
F
G>
F
D
 
Multiple Possible Recoveries
 
 
 
 
8/23
 
Loop structures:
To recover (
AB
,
F
), we
have 
AB
E
F
, 
AB
EHE
F
,
AB
EHEHE
F
...
VLDB 2013
 
Choice structures:
To recover (
AE
,
F
), we
have 
AE
B
F
, 
AE
CD
F
 
What is the 
most likely 
recovery?
 
B
 
Intuition:
 Generally, people 
try to make the minimum mistakes
.
Principle: 
Follow 
the minimum change law 
in improving data
quality.
Problem: 
The cost of a recovery, which is the 
number of inserted
events 
should be minimized.
9/23
VLDB 2013
 
To recover <
AE
F
G>: <
AE
B
F
G>
 
is a minimum recovery, 
while
<
AE
CD
F
G>
 
is not
.
 
Hardness:
Owing to 
choices and parallelization
 of flows, there exist 
vast
alternatives 
to enumerate in the recovery;
We prove the 
NP-hardness 
of this problem.
Existing Approach: 
The Alignment algorithm*
Enumerates a factorial number 
of redundant recoveries.
 
 
 
 
Our Observation:
All the sequences in 
S
 are equivalent w.r.t. the specification;
Any of 
S
 is a minimum recovery;
Not necessary to enumerate.
 
*M. de Leoni, F. M. Maggi, and W. M. P. van der Aalst. Aligning event logs and declarative process
models for conformance checking. In BPM, pages 82–97, 2012.
10/23
VLDB 2013
 
To recover (
A
,
E
), it generates
S:
 {
A
BCD
E
, 
A
BDC
E
, 
A
CBD
E
,
A
CDB
E
, 
A
DBC
E
, 
A
DCB
E
}
A
E
B
C
D
 
Motivation
Recovery Algorithms
1.
Filling gaps upon causal net
2.
Branching framework
3.
Dealing with loops
Experiments
Conclusion
Ongoing Work
11/23
VLDB 2013
 
Causal Net: net 
without XOR 
structures.
It can be translated into a 
equivalent DAG
.
 
 
 
 
 
Based on 
parallel
 constraints & 
sequence
 constraints:
Each complete sequence must be one of the 
topological sorts
 on the
DAG.
A recovery of incomplete sequence is a 
subsequence
 of a complete
sequence.
<
ABC
E
> is a subsequence of <
ABC
D
E
>
Hints: 
fills the missing prerequisites by 
backtracking
.
12/23
VLDB 2013
 
Problem: 
topological sort cannot be applied on choice structure.
Hints:
Enumerate all the possible execution
 of the choice structures
Each execution is a 
causal net
, topological sort can be applied.
13/23
VLDB 2013
 
To recover (A
, E
)
 
(A
, BE
)
 
(A
, CDE
)
 
1.
Branching Index:
 prune the branches not including the right-side
event of gap.
2.
Path Reachability Pruning: 
prune the branches that the left-side
of gap are 
not reachable
 to the right-side of gap.
3.
Branch and Bound: 
prune the branches that may lead to possible
recoveries but 
cannot be the minimum one
.
4.
Local Optimality: 
identify groups of transitions that always 
share the
same branching
, thus 
only one of them needs to be computed
.
 
14/23
VLDB 2013
 
Problem: 
when to terminate?
Hints: 
considers loop at most once (minimum cost).
15/23
 
To recover (
ABC
,
 B
)
VLDB 2013
B
B
 
(
ABC
,
 AB
)
Not valid
 
(
ABC
,
 
E
B
)
 
(
ABC
,
 
EBCE
B
)
Not minimum
B
Motivation
Recovery Algorithms
Filling gaps in causal net
Branching framework
Dealing with loops
Experiments
Conclusion
Ongoing Work
16/23
VLDB 2013
 
Real Life Data Set: 
employed from CNR*
 
 
 
 
Setting
we 
randomly remove 10%-90% events 
from the complete
sequences;
apply the recovery methods to recover the removed events.
Criteria: 
to evaluate the accuracy of recovery,
F-measure of precision and recall.
Baseline
: the Alignment approach.
 
*www.tangche.com
17/23
VLDB 2013
 
The accuracy is no worse than Alignment algorithm.
Our algorithm shows great improvement in time cost.
18/23
VLDB 2013
 
The pruning methods are effective.
19/23
VLDB 2013
 
Local+Reach+Bound can always achieve the lowest time cost.
20/23
VLDB 2013
 
We define the 
specification-based minimum recovery problem 
for
missing events.
We 
prove the NP-hardness
 of this problem.
To 
efficiently
 find the optimal recovery:
We propose a 
backtracking idea 
to reduce the redundant sequences
with respect to parallel events.
We construct a 
branching index
, and develop 
reachability checking
and 
lower bounds 
of recovery distances to further accelerate the
computation.
The 
local optimal method 
can reduce the number of intermediate
results.
21/23
VLDB 2013
 
In this paper of missing events, we assume that the logged event data
are clean.
However, the 
logged event data 
may also be 
dirty
.
Encoding example: UTF-8: Re-check 
 GB2312: 
e
?
 
 
 
Without addressing erroneous events,
Inaccurate provenance answer
Inaccurate query answer
By the 
insertion technique
 proposed in this paper, we 
cannot elimate
the erroneous
 events.
Instead, we need to 
modify
 the mistakely recorded events
Non-trivial, also NP-hard
Heuristic approach may apply
22/23
VLDB 2013
undefined
Thanks
23/23
VLDB 2013
Slide Note
Embed
Share

Efficient recovery of missing events is crucial in information systems to ensure accurate business history recording. This study delves into algorithms for filling gaps in causal networks, dealing with loops, and addressing causes such as man-made errors and system failures. It explores the significance of event data and process specifications in recovering missing events, with a focus on ensuring completeness and accuracy in event sequences. The study highlights the causes of missing events and the need for efficient recovery strategies to maintain data integrity and system reliability.

  • Recovery
  • Missing Events
  • Algorithms
  • Event Data
  • Process Specification

Uploaded on Feb 25, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Efficient Recovery of Missing Events 1/23 Jianmin Wang1, Shaoxu Song1, Xiaochen Zhu1, Xuemin Lin2 1Tsinghua University, China 2University of New South Wales, Australia VLDB 2013

  2. Outline 2/23 Motivation Recovery Algorithms Filling gaps in causal net Branching framework Dealing with loops Experiments Conclusion Ongoing Work VLDB 2013

  3. Event Data 3/23 Information systems record the business history in their event logs. E ID T ID Event Name Timestamp 1 1 Order 2013-04-22 13:33:34 2 1 Pay by Cash 2013-04-22 15:10:17 3 1 Check Inventory 2013-04-22 15:18:11 4 1 Validate 2013-04-22 15:31:50 5 1 Delivery 2013-04-23 08:14:26 Event Sequence: Order Pay by cash Check inventory Validate Delivery Huge Amount of Event Data: Corporation Products No. of Event Sequences Power Generatior Machinery Train 1,230,000 3,260,000 2,600,000 VLDB 2013

  4. Process Specification 4/23 Business events often follow certain business rules or constraints Order Order Pay by cash Check inventory Validate Delivery Pay by cash Pay by cash Check inventory Pay by cash 1 Event sequences Order Validate Order Check inventory inventory Validate Check inventory Delivery Check inventory Pay by credit card Pay by credit card Check credit history 2 Pay by cash Delivery Re-check Check 3 Pay by cash Pay by cash follow Pay by cash Constraints by Petri net: Sequence Parallel Choice XOR split XOR join B C D Pay by credit card Check credit history Order Validate Delivery F G A Process specification Check inventory AND split AND join E H Re-check VLDB 2013

  5. Missing Events 5/23 Pay by cash B B C C D D Sequence of events <A, B, E, F, G> complete <A, C, E, F, G> incomplete Pay by credit card Check credit history Order Validate Delivery G G G F F F A A A Check inventory E E E H Re-check The Causes of Missing Events: Man-made errors (typo); System failures (power down); Lost during system integration (mismatching). Survey in one division of CNR: 47% events are missed. VLDB 2013

  6. Meaning of Recovery 6/23 Why Need Recovery? 1 Order inventory Validate Delivery Pay by cash Pay by cash Check inventory Re-check Check 2 Order Pay by cash Check inventory Validate Delivery Incomplete provenance answer What are the prerequisites of Validate in sequence 1? Answer: Order , Check Inventory and Re-check . Pay by cash is lost. Inaccurate query answer SEQ(Pay by cash, Validate), which means Pay by cash followed by Validate . Answer: sequence 2. Event sequence 1 is lost. VLDB 2013

  7. Recovery of Missing Events 7/23 Event sequence <ACEFG> has missing event between ACE and F, called a GAP (ACE, F). Pay by cash B C C D D Pay by credit card Check credit history Order Validate Delivery F F G A A Check inventory E E H Re-check Recovery Insert the missing prerequisites into the gap; Based on the specification (follows constraints). A possible recovery is <ACEDFG> VLDB 2013

  8. Possible Recoveries 8/23 Multiple Possible Recoveries Pay by cash B B B C C D D Pay by credit card Check credit history Order Validate Delivery F F F G A A A Check inventory E E E H H Re-check Choice structures: Loop structures: To recover (AE,F), we have AEBF, AECDF To recover (AB,F), we have ABEF, ABEHEF, ABEHEHEF... What is the most likely recovery? VLDB 2013

  9. Minimum Recovery 9/23 Intuition: Generally, people try to make the minimum mistakes. Principle: Follow the minimum change law in improving data quality. Problem: The cost of a recovery, which is the number of inserted events should be minimized. Pay by cash B C D Pay by credit card Check credit history Order Validate Delivery F G A Check inventory E H Re-check To recover <AEFG>: <AEBFG> is a minimum recovery, while <AECDFG> is not. VLDB 2013

  10. Hardness and Related Work 10/23 Hardness: Owing to choices and parallelization of flows, there exist vast alternatives to enumerate in the recovery; We prove the NP-hardness of this problem. Existing Approach: The Alignment algorithm* Enumerates a factorial number of redundant recoveries. B B To recover (A,E), it generates S: {ABCDE, ABDCE, ACBDE, ACDBE, ADBCE, ADCBE} C C A A E E D D Our Observation: All the sequences in S are equivalent w.r.t. the specification; Any of S is a minimum recovery; Not necessary to enumerate. *M. de Leoni, F. M. Maggi, and W. M. P. van der Aalst. Aligning event logs and declarative process models for conformance checking. In BPM, pages 82 97, 2012. VLDB 2013

  11. Outline 11/23 Motivation Recovery Algorithms 1. Filling gaps upon causal net 2. Branching framework 3. Dealing with loops Experiments Conclusion Ongoing Work 2. Net with Choice Structures B 1. Causal Net C D F A G E H 3. Net with Loop Structures VLDB 2013

  12. 1. Filling Gaps upon Causal Net 12/23 Causal Net: net without XOR structures. It can be translated into a equivalent DAG. B B C C A A E E D D Based on parallel constraints & sequence constraints: Each complete sequence must be one of the topological sorts on the DAG. A recovery of incomplete sequence is a subsequence of a complete sequence. <ABCE> is a subsequence of <ABCDE> Hints: fills the missing prerequisites by backtracking. VLDB 2013

  13. 2. The Branching Framework 13/23 Problem: topological sort cannot be applied on choice structure. Hints: Enumerate all the possible execution of the choice structures Each execution is a causal net, topological sort can be applied. B D E A C To recover (A, E) (A, BE) B E (A, CDE) D E A C Recovery with minimum cost VLDB 2013

  14. Pruning Methods 14/23 1. Branching Index: prune the branches not including the right-side event of gap. Path Reachability Pruning: prune the branches that the left-side of gap are not reachable to the right-side of gap. Branch and Bound: prune the branches that may lead to possible recoveries but cannot be the minimum one. Local Optimality: identify groups of transitions that always share the same branching, thus only one of them needs to be computed. 2. 3. 4. Gap (AC,E) LB = 3 B E E E E Gap (A,E) Gap (A,D) LB = 4 D D E E E A A A A C C VLDB 2013

  15. 3. Extension on Loops 15/23 Problem: when to terminate? Hints: considers loop at most once (minimum cost). To recover (ABC, B) E (ABC, AB) Not valid B B C C D A A B A E E B (ABC, EB) B B C C D A A E E E B (ABC, EBCEB) Not minimum B C B B C C D A A VLDB 2013

  16. Outline 16/23 Motivation Recovery Algorithms Filling gaps in causal net Branching framework Dealing with loops Experiments Conclusion Ongoing Work VLDB 2013

  17. Experiment Setting 17/23 Real Life Data Set: employed from CNR* Causal net Net with choices (no loop) Net with loops 25 No. of Process Specifications No. of Event Sequences Max Specification Size Max Sequence Size 63 transitions 79 places 149 57 4470 1000 events 67 Setting we randomly remove 10%-90% events from the complete sequences; apply the recovery methods to recover the removed events. Criteria: to evaluate the accuracy of recovery, F-measure of precision and recall. Baseline: the Alignment approach. *www.tangche.com VLDB 2013

  18. Experiments on Causal Net 18/23 The accuracy is no worse than Alignment algorithm. Our algorithm shows great improvement in time cost. accuracy accuracy 0.9 1 0.8 0.9 0.8 0.7 F-Measure 0.7 0.6 F-Measure 0.6 0.5 0.5 0.4 Alignment Backtracking 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Missing Rate 0-2 3-5 6-8 9-11 12-14 21-23 Specification size time performance time performance 10000 100000 10000 1000 Time cost (ms) Time cost (ms) 1000 100 100 10 10 1 1 0.1 0.1 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Missing Rate 0-2 3-5 6-8 9-11 12-14 21-23 VLDB 2013 Specification size

  19. Experiments on Net with Choices 19/23 The pruning methods are effective. accuracy accuracy 0.9 1 Alignment Branch Branch+Reach Branch+Reach+Bound Local Local+Reach+Bound 0.9 0.8 0.8 0.7 0.7 F-Measure F-Measure 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Missing Rate Specification size time performance time performance 10000 1000 1000 100 100 Time cost (ms) Time cost (ms) 10 10 1 1 0.1 0.01 0.1 0.001 0.01 0-4 5-9 10- 14 15- 19 Specification size 20- 24 25- 29 30- 34 35- 39 40- 44 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Missing Rate Best approach VLDB 2013

  20. Experiments on Net with Loops 20/23 Local+Reach+Bound can always achieve the lowest time cost. Alignment Branch Branch+Reach Branch+Reach+Bound Local Local+Reach+Bound accuracy time performance 100000 10000 Time cost (ms) 1 0.9 0.8 0.7 F-Measure 1000 0.6 0.5 Best approach 0.4 0.3 0.2 0.1 0 100 100 200 300 400 500 600 700 800 900 1000 Sequence size Sequence size VLDB 2013

  21. Conclusion 21/23 We define the specification-based minimum recovery problem for missing events. We prove the NP-hardnessof this problem. To efficiently find the optimal recovery: We propose a backtracking idea to reduce the redundant sequences with respect to parallel events. We construct a branching index, and develop reachability checking and lower bounds of recovery distances to further accelerate the computation. The local optimal method can reduce the number of intermediate results. VLDB 2013

  22. Ongoing Work 22/23 In this paper of missing events, we assume that the logged event data are clean. However, the logged event data may also be dirty. Encoding example: UTF-8: Re-check GB2312: e ? C D Delivery G Validate F Order Check credit history Check inventory Pay by credit card Check inventory E A e ? E Without addressing erroneous events, Inaccurate provenance answer Inaccurate query answer By the insertion technique proposed in this paper, we cannot elimate the erroneous events. Instead, we need to modify the mistakely recorded events Non-trivial, also NP-hard Heuristic approach may apply VLDB 2013

  23. Q & A Thanks 23/23 VLDB 2013

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#