Modeling Scientific Software Architecture for Feature Readiness

undefined
 
Modeling Scientific Software
Architecture from Regression Tests
using Data Mining Techniques
 
Matthew Mosby
 
Is feature “X” ready for my intended use?
 
SAND2023-05296C
 
With Jake Healy, Tony Nguyen and Chris Siefert
ASC Sustainable Scientific Software Conference
NLIT Summit, Milwaukee, WI June 27-30, 2023
 
 
This work was supported by the Laboratory Directed Research and Development program at Sandia National Laboratories, a multimission laboratory managed
and operated by National Technology and Engineering Solutions of Sandia LLC, a wholly owned subsidiary of Honeywell International Inc. for the U.S.
Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.
IS FEATURE “X” READY FOR MY INTENDED USE?
Examples
:
Material model formulation
Active physics (e.g., contact)
Solver selection
Time integration scheme
Discretization
As a 
user
 of SciSoft, how can I be confident
that a given 
feature
 is ready for use?
What evidence is available to me for deciding
between two similar features?
Typical Evidence:
Overall software test coverage
Identify that the feature is tested
SME assertion that the feature is ready
2
Is this evidence sufficient?
Feature
 (n): user input to a scientific
software program that activates a specific
capability or behavior
Goal: Enable such feedback
A 
(VERY)
 SIMPLE MOTIVATING EXAMPLE
 
Feature:
 Elastic/Linearly plastic material
Credibility Evidence:
The overall code coverage is 90%
The model is used in several tests
o
The code SME isn’t available
Quiz time:
How many conditions are in this model?
Could those branches be in the 10% missing
code coverage?
What can a user assert about the quality of
tests this model was used in?
3
 
Absolutely
 
Nothing, this is why the SME is involved!
What would change if the user were
presented with the following?
Estimated coverage of <feature>: 30%
 
WHAT DOES ANY OF THAT HAVE TO DO WITH SOFTWARE
ARCHITECTURE?
 
SciSoft is complex, long-lived & changing
SciSoft is often written by scientists and
not 
computer
 
scientists
Features
 are often difficult to test in
isolation
 
4
Architecture
 (n): the relationship between a
user-facing 
feature
 and its software
implementation
Understanding the 
architecture
 
is
a prerequisite to gathering
feature
-level readiness evidence
 
Library-level dependency graph of
the SIERRA/SM application
 
 
THE GENERAL APPROACH – MINE THE REGRESSION TESTS
 
SciSoft typically have test suites
Instrument the code/tests to provide:
Features used by an execution
Code coverage from an execution
Run the instrumented test suite
Per-test records form training data
Apply ML algorithms to construct a
model of the 
architecture
 
5
 
General form:
 
ML multi-label classifier
ML multi-label classifier 
 “Given a feature
(set), estimate the coverage set”
 
WHAT CONSTITUTES A FEATURE? HOW TO IDENTIFY THEM?
 
What
Up to interpretation by user and/or SME
General, e.g., model formulation
Specific, e.g., sub-option/setting
How
Feature annotation by SME or automatic
Automatic annotation 
strongly
 preferred
Always up to date
Supports 
user
-annotation of input
Extension to library-level SciSoft APIs
 
6
 
Feature identification requirements
Feature keys generated by unique context
Keys 
don’t
 encode parameter/option 
values
Keys can be mapped back to input
comman
d
 
FORMING THE TRAINING DATA (A, C)
 
Constructing A
Run tests logging what features are used
Label columns of 
A
 with feature keys
Each test results in a (sparse) row of 
A
 
Constructing C
Optional level of detail:
File, Function, 
Edge, Line
Greater detail increases dimension
Run tests with coverage instrumentation
Label columns of 
C
 with coverage keys
Each test provides a (sparse) row of 
C
 
7
 
 
MODELING THE ARCHITECTURE (B)
 
Architecture modeling can be framed as
a ML 
multi-label classification
 problem
We are 
using
 available classifiers
Decision Tree classifiers are good for our
type of data
Use series of splits based on parameter
influence
Suffer from variance and bias – can be
reduced by ensembles
 
8
 
Titanic passenger survival model as
a decision tree classifier [1]
 
[1] 
Lovinger, J., Valova, I. Infinite Lattice Learner: an ensemble for incremental learning. 
Soft Comput
 
24
, 6957–6974 (2020).
 
ASIDE
: SOFTWARE SUSTAINABILITY & CREDIBILITY BENEFITS
FROM 
ABILITY
 TO AUTOMATICALLY GATHER THE
TRAINING DATA
 
Benefits from 
A
Feature coverage database
Statistics on how apps used “in the wild”
Identification of weak/untested features
 
Benefits from 
C
Supports optimal test-suite construction
Faster CI for large projects
Targeted change-based testing
 
9
Automatically gathering this data is foundational to a variety of
potential user- & developer-facing credibility and productivity tools
THE DATASET: SIERRA/SM (SOLID MECHANICS) APP
Focus on Solid Mechanics app
Focusing on the “small” tests in the suite
Use Feature Coverage Tool for building 
A
Use 
LLVM CoverageSanitizer 
for 
B
Custom callback for 
file
-level coverage
10
 
INITIAL RESULTS
 
Examine performance of two classifiers
Split test with all data to train/test and
30% reserved for training
Use native “score” – fraction of 100%
correct sample predictions
 
11
 
 
Sparsity pattern of training data.
Note high coverage density.
 
[1] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
[2] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Can filtering the training data improve
fit/predictions?
 
FILTERING THE TRAINING DATASET
 
Desire automatic approaches
Identify duplicate samples
Reduce duplicate samples with union of
coverage data
1-core vs. N-core cases of the same test
Remove “mainline” features and
coverage
Filter out columns with >= 99% fill
Removed features: boilerplate, e.g., ‘begin
sierra’
Removed coverage: libraries (e.g., ioss,
stk), parsing
 
12
 
Sparsity pattern of reduced training data.
 
Maintain high
feature diversity
in samples
 
Greatly reduced
“mainline” coverage
 
FILTERING THE TRAINING DATASET
 
13
 
Classifier performance on reduced dataset
 
[1] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
[2] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Reduced training dataset provides better fit and maintains accuracy for the full sample set
EXPERIMENT: IDENTIFICATION OF FEATURE SUPPORT
SIERRA/SM has typical structure of a 30
year old SciSoft package
14
 
 
<obligatory developer joke slide 
>
 
EXPERIMENT: IDENTIFICATION OF FEATURE SUPPORT
 
SIERRA/SM has typical structure of a 30
year old SciSoft package
Materials 
do
 have a well-defined interface
All models implemented in Lamé library
Source file names relate to model name
Able to construct/verify known label set
 
Given 
a feature key for a material model,
If
 the model is accurate,
Then
 the model will predict Lamé library files
that support that model
Bonus
: If the model is 
precise
, the prediction
doesn’t contain 
other 
sources
 
15
 
RESULTS: IDENTIFICATION OF FEATURE SUPPORT
 
16
 
“# Samples” is the number of times the specified material appears in the training data
Too much noise/bias in the model 
 Predicts everything is elastic
 
USE SME GUIDANCE TO FOCUS TRAINING DATASET FOR
MATERIALS
 
Know that all material models are
implemented in ‘
lame/
’ directory
Know feature correlation, i.e., model
options are associated with the specific
model
Reduce feature set to possible materials
Reduce coverage set to files in ‘
lame/
Train sub-model with reduced dataset
 
17
 
* Model never used alone in sample set
How might we automatically detect
these reduced spaces to improve
accuracy?
 
SUMMARY
 
Can predict which source files cover a
given input deck with ~60% accuracy
Current model is noisy
Identifies a lot of library-type files
Overpredicts coverage for specific
features
Bias from sample feature distribution
Can improve new developer productivity
Provide pointers to where a feature is
implemented, even if not super specific
Poor predictor of specific features
without sub-modeling
Segmented models can improve
accuracy
 
18
 
FUTURE WORK
 
Open Questions
Can we using unsupervised learning to automatically discover correlated features and
construct piecewise models spanning the feature space?
How could we sustainably incorporated SME knowledge?
Is overprediction of file coverage acceptable? To what extent? Can we train to this metric?
 
Next Steps to Provide User Feedback
Query full coverage data given files supporting a feature
Develop coverage metric meaningful to an end user
Integrate information into other user-facing credibility tools
 
19
 
ACKNOWLEDGEMENTS
 
Jake Healy
Tony Nguyen
Chris Siefert
 
20
 
CIS Project 229302
 
Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC (NTESS), a wholly
owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (DOE/NNSA) under contract DE-
NA0003525. This written work is authored by an employee of NTESS. The employee, not NTESS, owns the right, title and interest in and to the written work and is
responsible for its contents. Any subjective views or opinions that might be expressed in the written work do not necessarily represent the views of the U.S.
Government. The publisher acknowledges that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the
published form of this written work or allow others to do so, for U.S. Government purposes. The DOE will provide public access to results of federally sponsored
research in accordance with the DOE Public Access Plan.
undefined
 
THANKS!
Slide Note
Embed
Share

This work discusses the importance of understanding software architecture in assessing the readiness of user-facing features in scientific software. It explores the challenges of testing complex features, presents a motivating example, and emphasizes the role of subject matter experts in validating feature quality. The study highlights the need for architectural insights to gather evidence on feature readiness effectively.


Uploaded on Apr 05, 2024 | 6 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Exceptional service in the national interest Modeling Scientific Software Architecture from Regression Tests using Data Mining Techniques Is feature X ready for my intended use? Matthew Mosby With Jake Healy, Tony Nguyen and Chris Siefert ASC Sustainable Scientific Software Conference NLIT Summit, Milwaukee, WI June 27-30, 2023 This work was supported by the Laboratory Directed Research and Development program at Sandia National Laboratories, a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia LLC, a wholly owned subsidiary of Honeywell International Inc. for the U.S. Department of Energy s National Nuclear Security Administration under contract DE-NA0003525. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia LLC, a wholly owned subsidiary of Honeywell International Inc. for the U.S. Department of Energy s National Nuclear Security Administration under contract DE-NA0003525. SAND2023-05296C

  2. IS FEATURE X READY FOR MY INTENDED USE? As a user of SciSoft, how can I be confident that a given feature is ready for use? What evidence is available to me for deciding between two similar features? Feature (n): user input to a scientific software program that activates a specific capability or behavior Examples: Material model formulation Active physics (e.g., contact) Solver selection Time integration scheme Discretization Typical Evidence: Overall software test coverage Identify that the feature is tested SME assertion that the feature is ready Is this evidence sufficient? 2

  3. A (VERY) SIMPLE MOTIVATING EXAMPLE Feature: Elastic/Linearly plastic material Credibility Evidence: The overall code coverage is 90% The model is used in several tests o The code SME isn t available Quiz time: How many conditions are in this model? Could those branches be in the 10% missing code coverage? What can a user assert about the quality of tests this model was used in? Nothing, this is why the SME is involved! Nothing, this is why the SME is involved! 5 5 What would change if the user were presented with the following? Absolutely Absolutely Estimated coverage of <feature>: 30% Goal: Enable such feedback 3

  4. WHAT DOES ANY OF THAT HAVE TO DO WITH SOFTWARE ARCHITECTURE? ... SciSoft is complex, long-lived & changing SciSoft is often written by scientists and not computerscientists Features are often difficult to test in isolation Architecture (n): the relationship between a user-facing feature and its software implementation Understanding the architectureis a prerequisite to gathering feature-level readiness evidence ... Library-level dependency graph of the SIERRA/SM application 4

  5. THE GENERAL APPROACH MINE THE REGRESSION TESTS General form: min? ? ?,? ? SciSoft typically have test suites Instrument the code/tests to provide: Features used by an execution Code coverage from an execution Conceptual linear form: Coverage Features Coverage Run the instrumented test suite Per-test records form training data Apply ML algorithms to construct a model of the architecture = Features Tests Tests C A B ML multi-label classifier ML multi-label classifier Given a feature (set), estimate the coverage set 5

  6. WHAT CONSTITUTES A FEATURE? HOW TO IDENTIFY THEM? What Up to interpretation by user and/or SME General, e.g., model formulation Specific, e.g., sub-option/setting How Feature annotation by SME or automatic Automatic annotation strongly preferred Always up to date Supports user-annotation of input Extension to library-level SciSoft APIs Feature identification requirements Feature keys generated by unique context Keys don t encode parameter/option values Keys can be mapped back to input command Key (e.g.) Input begin material steel 0d4f3dd density = 0.000756 b16b186 begin parameters for model ml_ep_fail 698c7e7 youngs modulus = {youngs} 3a1dac5 poissons ratio = {poissons} <...> yield stress = {yield} <...> ... <...> end <none> end <none> 6

  7. FORMING THE TRAINING DATA (A, C) Constructing C Optional level of detail: File, Function, Edge, Line Greater detail increases dimension Conceptual linear form: Coverage Features Coverage = Features C A B Tests Tests Run tests with coverage instrumentation Label columns of C with coverage keys Each test provides a (sparse) row of C Constructing A Run tests logging what features are used Label columns of A with feature keys Each test results in a (sparse) row of A 7

  8. MODELING THE ARCHITECTURE (B) Conceptual linear form: Coverage Features Coverage Features = C A B Tests Tests Architecture modeling can be framed as a ML multi-label classification problem We are using available classifiers Decision Tree classifiers are good for our type of data Use series of splits based on parameter influence Suffer from variance and bias can be reduced by ensembles Titanic passenger survival model as a decision tree classifier [1] [1] Lovinger, J., Valova, I. Infinite Lattice Learner: an ensemble for incremental learning. Soft Comput 24 24, 6957 6974 (2020). 8

  9. ASIDE: SOFTWARE SUSTAINABILITY & CREDIBILITY BENEFITS FROM ABILITY TO AUTOMATICALLY GATHER THE TRAINING DATA Conceptual linear form: Coverage Features Coverage = Features C A B Tests Tests Benefits from C Supports optimal test-suite construction Faster CI for large projects Targeted change-based testing Benefits from A Feature coverage database Statistics on how apps used in the wild Identification of weak/untested features Automatically gathering this data is foundational to a variety of potential user- & developer-facing credibility and productivity tools 9

  10. THE DATASET: SIERRA/SM (SOLID MECHANICS) APP SIERRA Engineering Mechanics Code Suite SIERRA Engineering Mechanics Code Suite Source Lines of Code (C/C++) # Regression and unit tests % Source line coverage ~2M ~20k ~75% Focus on Solid Mechanics app Focusing on the small tests in the suite Use Feature Coverage Tool for building A Use LLVM CoverageSanitizer for B Custom callback for file-level coverage SIERRA/SM Dataset details SIERRA/SM Dataset details # Tests (samples) # Features # Covered Files (labels) 6393 7005 5347 Conceptual linear form: Coverage Features Coverage Features = C A B Tests Tests 10

  11. INITIAL RESULTS Features (7005) Coverage (5347) Examine performance of two classifiers Split test with all data to train/test and 30% reserved for training Use native score fraction of 100% correct sample predictions Samples (6393) 1.4% non-zeros Sparsity pattern of training data. Note high coverage density. 29% 29% non-zeros Classifier Classifier Test % Test % 0% 30% 0% 30% Train Score Train Score 0.64 0.68 0.64 0.68 Test Score Test Score - 0.24 - 0.24 ExtraTrees1 Features Coverage 6k RandomForest2 High feature diversity in samples mainline coverage impacts over-fit Count Can filtering the training data improve fit/predictions? 0 0 6k Sample Occurrence [1] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html [2] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html 11

  12. FILTERING THE TRAINING DATASET Sparsity pattern of reduced training data. Coverage (4284) Features (6992) Desire automatic approaches Identify duplicate samples Reduce duplicate samples with union of coverage data 1-core vs. N-core cases of the same test Samples (3468) 0.8% non-zeros 12% non-zeros Remove mainline features and coverage Filter out columns with >= 99% fill Removed features: boilerplate, e.g., begin sierra Removed coverage: libraries (e.g., ioss, stk), parsing 7k Features Coverage Maintain high feature diversity in samples Count Greatly reduced mainline coverage 0 0 3.5k Sample Occurrence 12

  13. FILTERING THE TRAINING DATASET SIERRA/SM Reduced Dataset Details SIERRA/SM Reduced Dataset Details # Tests (samples) # Features # Covered Files (labels) 3468 6992 4284 Classifier performance on reduced dataset Classifier Classifier Test % Test % Train Score Train Score 0% 30% 0% 30% Test Score Test Score - 0.183 - 0.169 Full Sample Score Full Sample Score 0.622 - 0.621 - 1.0 1.0 0.999 0.998 ExtraTrees1 RandomForest2 Reduced training dataset provides better fit and maintains accuracy for the full sample set [1] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html [2] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html 13

  14. EXPERIMENT: IDENTIFICATION OF FEATURE SUPPORT SIERRA/SM has typical structure of a 30 year old SciSoft package Materials do have a well-defined interface All models implemented in Lam library Source file names relate to model name Able to construct/verify known label set bin/adagio liblame.a Given a feature key for a material model, If the model is accurate, Then the model will predict Lam library files that support that model Bonus: If the model is precise, the prediction doesn t contain other sources /MLEP.C /Elastic.h /johnson_cook_model.F 15

  15. RESULTS: IDENTIFICATION OF FEATURE SUPPORT Material: elastic # Samples # Correct Lam labels # Wrong Lam labels # Non-Lam labels Material: johnson_cook # Samples # Correct Lam labels # Wrong Lam labels # Non-Lam labels 2305 7 of 7 0 259 61 5 of 9 2 291 Material: dsa Material: mlep # Samples # Correct Lam labels # Wrong Lam labels # Non-Lam labels 28 5 of 7 2 245 # Samples # Correct Lam labels # Wrong Lam labels # Non-Lam labels 68 5 of 10 2 245 # Samples is the number of times the specified material appears in the training data Too much noise/bias in the model Predicts everything is elastic 16

  16. USE SME GUIDANCE TO FOCUS TRAINING DATASET FOR MATERIALS Know that all material models are implemented in lame/ directory Know feature correlation, i.e., model options are associated with the specific model Reduce feature set to possible materials Reduce coverage set to files in lame/ Train sub-model with reduced dataset SIERRA/SM Material Only Dataset SIERRA/SM Material Only Dataset # Tests (samples) # Features (materials) # Covered Files (labels) 3468 135 681 Model Model elastic mlep *johnson_cook *dsa jc + mlep * Model never used alone in sample set # Correct # Correct 7 of 7 10 of 10 9 of 9 7 of 7 8 of 14 How might we automatically detect these reduced spaces to improve accuracy? 17

  17. SUMMARY All Features Model Can predict which source files cover a given input deck with ~60% accuracy Current model is noisy Identifies a lot of library-type files Overpredicts coverage for specific features Bias from sample feature distribution Can improve new developer productivity Provide pointers to where a feature is implemented, even if not super specific Poor predictor of specific features without sub-modeling Segmented models can improve accuracy elastic mlep elastic dsa Materials Only Model elastic elastic mlep mlep dsa dsa 18

  18. FUTURE WORK Open Questions Can we using unsupervised learning to automatically discover correlated features and construct piecewise models spanning the feature space? How could we sustainably incorporated SME knowledge? Is overprediction of file coverage acceptable? To what extent? Can we train to this metric? Next Steps to Provide User Feedback Query full coverage data given files supporting a feature Develop coverage metric meaningful to an end user Integrate information into other user-facing credibility tools 19

  19. ACKNOWLEDGEMENTS Jake Healy Tony Nguyen Chris Siefert CIS Project 229302 Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC (NTESS), a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy s National Nuclear Security Administration (DOE/NNSA) under contract DE- NA0003525. This written work is authored by an employee of NTESS. The employee, not NTESS, owns the right, title and interest in and to the written work and is responsible for its contents. Any subjective views or opinions that might be expressed in the written work do not necessarily represent the views of the U.S. Government. The publisher acknowledges that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this written work or allow others to do so, for U.S. Government purposes. The DOE will provide public access to results of federally sponsored research in accordance with the DOE Public Access Plan. 20

  20. THANKS!

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#