Efficient Specification Mining for Trustworthy Software Development

undefined
 
Specification Mining With Few
False Positives
 
 
Claire Le Goues
Westley Weimer
University of Virginia
March 25, 2009
 
1
 
Slide 0.5: Hypothesis
 
 
We can use measurements of the
“trustworthiness” of source code
to mine specifications with few
false positives.
 
2
 
Slide 0.5: Hypothesis
 
 
We can use measurements of the
“trustworthiness” of source code
to mine 
specifications 
with few
false positives.
 
3
 
Slide 0.5: Hypothesis
 
 
We can use measurements of the
trustworthiness
” of source code
to mine 
specifications 
with few
false positives.
 
4
 
Slide 0.5: Hypothesis
 
 
We can use measurements of the
trustworthiness
” of source code
to mine 
specifications 
with few
false positives
.
 
5
 
Outline
 
Motivation: Specifications
Problem: Specification Mining
Solution: Trustworthiness
Evaluation: 3 Experiments
Conclusions
 
6
 
7
Why Specifications?
 
Modifying code, correcting defects, and evolving
code account for as much as 90% of the total
cost of software projects.
Up to 60% of maintenance time is spent
studying existing software.
Specifications are useful for debugging, testing,
maintaining, refactoring, and documenting
software.
 
8
 
Our Definition (Broadly)
 
 
A 
specification 
is a formal
description of some aspect
of legal program behavior.
 
9
What kind of specification?
 
We would like specifications that are
simple and machine-readable
We focus on 
partial-correctness
specifications 
describing 
temporal
properties
Describes legal sequences of events, where an
event is a function call; similar to an API.
Two-state finite state machines
10
 
Example Specification
 
11
Event A: Mutex.lock()
Event B: Mutex.unlock()
 
Example: Locks
1
 
2
 
1
 
12
 
Mutex.lock()
 
Mutex.unlock()
Our Specifications
 
For the sake of this work, we are talking about
this type of two-state temporal specifications.
These specifications correspond to the regular
expression 
(ab)*
More complicated patterns are possible.
13
 
14
Where do formal specifications come
from?
 
Formal specifications are useful, but there
aren’t as many as we would like.
We use 
specification mining 
to
automatically derive the specifications
from the program itself.
15
Mining 2-state Temporal Specifications
 
Input: 
program traces – a sequence of events
that can take place as the program runs.
Consider pairs of events that meet certain criteria.
Use statistics to figure out which ones are likely
true specifications.
Output: 
ranked set of candidate specifications,
presented to a programmer for review and
validation.
16
Problem: False Positives Are Common
Event A: Iterator.hasNext()
Event B: Iterator.next()
17
 
This is very 
common 
behavior.
This is not 
required 
behavior.
Iterator.hasNext() does not have to be followed
eventually by Iterator.next() in order for the code
to be correct.
This 
candidate specification is a 
false positive
.
 
Previous Work
 
* Adapted from Weimer-Necula, TACAS 2005
 
18
 
Previous Work
 
* Adapted from Weimer-Necula, TACAS 2005
 
19
 
 
20
The Problem (as we see it)
 
Let’s pretend we’d like to learn the rules of
English grammar.
…but all we have is a stack of high school English
papers.
Previous miners ignore the differences between
A papers and F papers.
Previous miners treat all traces as though they
were all equally indicative of correct program
behavior.
21
Solution: Code Trustworthiness
 
Trustworthy 
code is unlikely to exhibit API
policy violations.
Candidate specifications derived from
trustworthy 
code are more likely to be true
specifications.
22
What is trustworthy code?
 
Informally…
Code that hasn’t been changed recently
Code that was written by trustworthy developers
Code that hasn’t been cut and pasted all over the
place
Code that is readable
Code that is well-tested
And so on.
23
Can you firm that up a bit?
 
Multiple surface-level, textual, and semantic
features can reveal the trustworthiness of code
Churn, author rank, copy-paste development,
readability, frequency, feasibility, density, and
others.
Our miner should believe that lock() – unlock()
is a specification if it is often followed on
trustworthy traces and often violated on
untrustworthy ones.
24
A New Miner
 
Statically estimate the trustworthiness of each
code fragment.
Lift that judgment to program traces by
considering the code visited along the trace.
Weight the contribution of each trace by its
trustworthiness when counting event
frequencies while mining.
25
Incorporating Trustworthiness
 
We use linear regression on a set of previously
published specifications to learn good weights
for the different trustworthiness factors.
Different weights yield different miners.
26
 
 
27
Experimental Questions
 
Can we use trustworthiness metrics to  build
a miner that finds useful specifications with
few false positives?
Which trustworthiness metrics are the most
useful in finding specifications?
Do our ideas about trustworthiness
generalize?
28
 
Experimental Questions
 
Can we use trustworthiness metrics to
build a miner that finds useful
specifications with few false positives?
Which trustworthiness metrics are the most
useful in finding specifications?
Do our ideas about trustworthiness
generalize?
 
29
Experimental Setup: Some Definitions
 
False positive
: an event pair that appears in
the candidate list, but a program trace may
contain only event A and still be correct.
Our 
normal 
miner balances 
true positives 
and
false positives 
(maximizes F-measure)
Our 
precise 
miner avoids 
false positives
(maximizes precision)
30
Experiment 1: A New Miner
 
On this dataset:
 Our normal
miner produces
107 false positive
specifications.
 Our precise
miner produces 1
 The previous
work produces
567.
31
More Thoughts On Experiment 1
 
Our normal miner improves on the false positive
rate of previous miners by 20%.
Our precise miner offers an order-of-magnitude
improvement on the false positive rate of
previous work.
We find specifications that are more useful in
terms of bug finding: we find 15 bugs per mined
specification, where previous work only found 7.
In other words: 
we find useful
specifications with fewer false positives.
32
 
Experimental Questions
 
Can we use trustworthiness metrics to  build
a miner that finds useful specifications with
few false positives?
Which trustworthiness metrics are the
most useful in finding specifications?
Do our ideas about trustworthiness
generalize?
 
33
 
Experiment 2: Metric Importance
 
 Results of an analysis of
variance (ANOVA).
 Shows the importance of
the trustworthiness
metrics.
 F is the predictive power
(1.0 means no power).
 p is the probability that it
had no effect (smaller is
better).
 
34
 
More Thoughts on Experiment 2
 
 Statically predicted path
frequency has the strongest
predictive power.
 
35
 
More Thoughts on Experiment 2
 
 Statically predicted path
frequency has the strongest
predictive power.
 Author rank has no effect
on the model.
 
36
 
More Thoughts on Experiment 2
 
 Statically predicted path
frequency has the strongest
predictive power.
 Author rank has no effect
on the model.
 Previous work falls
somewhere in the middle.
 
37
 
Experimental Questions
 
Can we use trustworthiness metrics to  build
a miner that finds useful specifications with
few false positives?
Which trustworthiness metrics are the most
useful in finding specifications?
Do our ideas about trustworthiness
generalize?
 
38
Experiment 3: Does it generalize?
 
Previous work claimed that more input is
necessarily better for specification mining.
We hypothesized that smaller, more trustworthy
input sets would yield more accurate output
from previously implemented tools.
39
 
Experiment 3: Generalizing
 
40
 
Experiment 3: Generalizing
 
41
 
Experiment 3: Generalizing
 
42
 
Experiment 3: Generalizing
 
43
Experiment 3: Generalizing
44
 
 The top 25% “most
trustworthy” traces make for a
much more accurate miner;
the opposite effect is true for
the 25% “least trustworthy”
traces.
 We can throw out the least
trustworthy 40-50% of traces
and still find the exact same
specifications with a slightly
lower false positive rate.
 More traces != better, so
long as the traces are
trustworthy.
Experimental Summary
 
We can use trustworthiness metrics to Build a Better
Miner: our normal miner improves on the false
positive rate of previous work by 20%, our precise
miner by an order of magnitude, while still finding
useful specifications.
Statistical techniques show that our notion of
trustworthiness contributes significantly to our
success.
We can increase the precision and accuracy of
previous techniques by using a trustworthy subset of
the input.
45
 
 
46
Summary
 
Formal specifications are very useful.
The previous work in specification mining yields
too many false positives for industrial practice.
We developed a notion of trustworthiness to
evaluate the likelihood that code adheres to two-
state temporal specifications.
47
 
Conclusion
 
A specification miner that incorporates
notions of code trustworthiness can
mine useful specifications with a
much lower false positive rate.
 
48
 
The End
(questions?)
 
49
Slide Note

I’m speaking about work I’ve done with my adviser, Wes Weimer, at UVA. The name of the talk is…

Embed
Share

Explore the approach of utilizing code trustworthiness measurements to mine specifications with minimal false positives. Understand the significance of specifications in software projects and how they contribute to debugging, testing, maintenance, refactoring, and documentation. The focus is on generating simple, machine-readable partial-correctness specifications to describe temporal properties of legal program behaviors, similar to API event sequences.

  • Specification Mining
  • Trustworthiness Evaluation
  • Software Development
  • Code Maintenance
  • API Event Sequences

Uploaded on Sep 16, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. 1 Specification Mining With Few False Positives Claire Le Goues Westley Weimer University of Virginia March 25, 2009

  2. 2 Slide 0.5: Hypothesis We can use measurements of the trustworthiness of source code to mine specifications with few false positives.

  3. 3 Slide 0.5: Hypothesis We can use measurements of the trustworthiness of source code to mine specifications with few false positives.

  4. 4 Slide 0.5: Hypothesis We can use measurements of the trustworthiness of source code to mine specifications with few false positives.

  5. 5 Slide 0.5: Hypothesis We can use measurements of the trustworthiness of source code to mine specifications with few false positives.

  6. 6 Outline Motivation: Specifications Problem: Specification Mining Solution: Trustworthiness Evaluation: 3 Experiments Conclusions

  7. 7 Specifications

  8. 8 Why Specifications? Modifying code, correcting defects, and evolving code account for as much as 90% of the total cost of software projects. Up to 60% of maintenance time is spent studying existing software. Specifications are useful for debugging, testing, maintaining, refactoring, and documenting software.

  9. 9 Our Definition (Broadly) A specification is a formal description of some aspect of legal program behavior.

  10. 10 What kind of specification? We would like specifications that are simple and machine-readable We focus on partial-correctness specifications describing temporal properties Describes legal sequences of events, where an event is a function call; similar to an API. Two-state finite state machines

  11. 11 Example Specification Event A: Mutex.lock() Event B: Mutex.unlock()

  12. 12 Example: Locks Mutex.lock() 1 1 2 Mutex.unlock()

  13. 13 Our Specifications For the sake of this work, we are talking about this type of two-state temporal specifications. These specifications correspond to the regular expression (ab)* More complicated patterns are possible.

  14. 14 The Problem

  15. 15 Where do formal specifications come from? Formal specifications are useful, but there aren t as many as we would like. We use specification mining to automatically derive the specifications from the program itself.

  16. 16 Mining 2-state Temporal Specifications Input: program traces a sequence of events that can take place as the program runs. Consider pairs of events that meet certain criteria. Use statistics to figure out which ones are likely true specifications. Output: ranked set of candidate specifications, presented to a programmer for review and validation.

  17. 17 Problem: False Positives Are Common Event A: Iterator.hasNext() Event B: Iterator.next() This is very common behavior. This is not required behavior. Iterator.hasNext() does not have to be followed eventually by Iterator.next() in order for the code to be correct. This candidate specification is a false positive.

  18. 18 Previous Work * Adapted from Weimer-Necula, TACAS 2005 Benchmark Infinity Hibernate Axion Hsqldb Cayenne Sablecc Jboss Mckoi-sql Ptolemy2 LOC 28K 57K 65K 71K 86K 99K 107K 118K 362K Candidate Specs 10 51 25 62 35 4 114 156 192 False Positive Rate 90% 82% 68% 89% 86% 100% 90% 88% 95%

  19. 19 Previous Work * Adapted from Weimer-Necula, TACAS 2005 Benchmark Infinity Hibernate Axion Hsqldb Cayenne Sablecc Jboss Mckoi-sql Ptolemy2 LOC 28K 57K 65K 71K 86K 99K 107K 118K 362K Candidate Specs 10 51 25 62 35 4 114 156 192 False Positive Rate 90% 82% 68% 89% 86% 100% 90% 88% 95%

  20. 20 Our Solution: Trustworthiness

  21. 21 The Problem (as we see it) Let s pretend we d like to learn the rules of English grammar. but all we have is a stack of high school English papers. Previous miners ignore the differences between A papers and F papers. Previous miners treat all traces as though they were all equally indicative of correct program behavior.

  22. 22 Solution: Code Trustworthiness Trustworthy code is unlikely to exhibit API policy violations. Candidate specifications derived from trustworthy code are more likely to be true specifications.

  23. 23 What is trustworthy code? Informally Code that hasn t been changed recently Code that was written by trustworthy developers Code that hasn t been cut and pasted all over the place Code that is readable Code that is well-tested And so on.

  24. 24 Can you firm that up a bit? Multiple surface-level, textual, and semantic features can reveal the trustworthiness of code Churn, author rank, copy-paste development, readability, frequency, feasibility, density, and others. Our miner should believe that lock() unlock() is a specification if it is often followed on trustworthy traces and often violated on untrustworthy ones.

  25. 25 A New Miner Statically estimate the trustworthiness of each code fragment. Lift that judgment to program traces by considering the code visited along the trace. Weight the contribution of each trace by its trustworthiness when counting event frequencies while mining.

  26. 26 Incorporating Trustworthiness We use linear regression on a set of previously published specifications to learn good weights for the different trustworthiness factors. Different weights yield different miners.

  27. 27 Evaluation

  28. 28 Experimental Questions Can we use trustworthiness metrics to build a miner that finds useful specifications with few false positives? Which trustworthiness metrics are the most useful in finding specifications? Do our ideas about trustworthiness generalize?

  29. 29 Experimental Questions Can we use trustworthiness metrics to build a miner that finds useful specifications with few false positives? Which trustworthiness metrics are the most useful in finding specifications? Do our ideas about trustworthiness generalize?

  30. 30 Experimental Setup: Some Definitions False positive: an event pair that appears in the candidate list, but a program trace may contain only event A and still be correct. Our normal miner balances true positives and false positives (maximizes F-measure) Our precise miner avoids false positives (maximizes precision)

  31. 31 Experiment 1: A New Miner Normal Miner Precise Miner WN On this dataset: Our normal miner produces 107 false positive specifications. Our precise miner produces 1 The previous work produces 567. Violation Violation Violation 93 False False False Program Hibernate s s s 53% 279 17% 153 82% Axion 42% 71 0% 52 68% 45 Hsqldb 25% 36 0% 5 89% 35 jboss 84% 255 0% 12 90% 94 Cayenne 58% 45 0% 23 86% 18 Mckoi-sql 59% 20 0% 7 88% 69 ptolemy 14% 44 0% 13 95% 72 Total 69% 740 5% 265 89% 426

  32. 32 More Thoughts On Experiment 1 Our normal miner improves on the false positive rate of previous miners by 20%. Our precise miner offers an order-of-magnitude improvement on the false positive rate of previous work. We find specifications that are more useful in terms of bug finding: we find 15 bugs per mined specification, where previous work only found 7. In other words: we find useful specifications with fewer false positives.

  33. 33 Experimental Questions Can we use trustworthiness metrics to build a miner that finds useful specifications with few false positives? Which trustworthiness metrics are the most useful in finding specifications? Do our ideas about trustworthiness generalize?

  34. 34 Experiment 2: Metric Importance Metric Frequency Copy-Paste Code Churn Density Readability Feasibility Author Rank Exceptional Dataflow Same Package One Error F 32.3 12.4 10.2 10.4 9.4 4.1 1.0 10.8 4.3 4.0 2.2 p 0.0000 0.0004 0.0014 0.0013 0.0021 0.0423 0.3284 0.0000 0.0000 0.0001 0.0288 Results of an analysis of variance (ANOVA). Shows the importance of the trustworthiness metrics. F is the predictive power (1.0 means no power). p is the probability that it had no effect (smaller is better).

  35. 35 More Thoughts on Experiment 2 Metric Frequency Copy-Paste Code Churn Density Readability Feasibility Author Rank Exceptional Dataflow Same Package One Error F 32.3 12.4 10.2 10.4 9.4 4.1 1.0 10.8 4.3 4.0 2.2 p 0.0000 0.0004 0.0014 0.0013 0.0021 0.0423 0.3284 0.0000 0.0000 0.0001 0.0288 Statically predicted path frequency has the strongest predictive power.

  36. 36 More Thoughts on Experiment 2 Metric Frequency Copy-Paste Code Churn Density Readability Feasibility Author Rank Exceptional Dataflow Same Package One Error F 32.3 12.4 10.2 10.4 9.4 4.1 1.0 10.8 4.3 4.0 2.2 p 0.0000 0.0004 0.0014 0.0013 0.0021 0.0423 0.3284 0.0000 0.0000 0.0001 0.0288 Statically predicted path frequency has the strongest predictive power. Author rank has no effect on the model.

  37. 37 More Thoughts on Experiment 2 Metric Frequency Copy-Paste Code Churn Density Readability Feasibility Author Rank Exceptional Dataflow Same Package One Error F 32.3 12.4 10.2 10.4 9.4 4.1 1.0 10.8 4.3 4.0 2.2 p 0.0000 0.0004 0.0014 0.0013 0.0021 0.0423 0.3284 0.0000 0.0000 0.0001 0.0288 Statically predicted path frequency has the strongest predictive power. Author rank has no effect on the model. Previous work falls somewhere in the middle.

  38. 38 Experimental Questions Can we use trustworthiness metrics to build a miner that finds useful specifications with few false positives? Which trustworthiness metrics are the most useful in finding specifications? Do our ideas about trustworthiness generalize?

  39. 39 Experiment 3: Does it generalize? Previous work claimed that more input is necessarily better for specification mining. We hypothesized that smaller, more trustworthy input sets would yield more accurate output from previously implemented tools.

  40. 40 Experiment 3: Generalizing

  41. 41 Experiment 3: Generalizing

  42. 42 Experiment 3: Generalizing

  43. 43 Experiment 3: Generalizing

  44. 44 Experiment 3: Generalizing The top 25% most trustworthy traces make for a much more accurate miner; the opposite effect is true for the 25% least trustworthy traces. We can throw out the least trustworthy 40-50% of traces and still find the exact same specifications with a slightly lower false positive rate. More traces != better, so long as the traces are trustworthy.

  45. 45 Experimental Summary We can use trustworthiness metrics to Build a Better Miner: our normal miner improves on the false positive rate of previous work by 20%, our precise miner by an order of magnitude, while still finding useful specifications. Statistical techniques show that our notion of trustworthiness contributes significantly to our success. We can increase the precision and accuracy of previous techniques by using a trustworthy subset of the input.

  46. 46 Conclusions

  47. 47 Summary Formal specifications are very useful. The previous work in specification mining yields too many false positives for industrial practice. We developed a notion of trustworthiness to evaluate the likelihood that code adheres to two- state temporal specifications.

  48. 48 Conclusion A specification miner that incorporates notions of code trustworthiness can mine useful specifications with a much lower false positive rate.

  49. 49 The End (questions?)

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#