Modeling Scientific Software Architecture for Feature Readiness

undefined

Modeling Scientific Software

Architecture from Regression Tests

using Data Mining Techniques

Matthew Mosby

Is feature “X” ready for my intended use?

SAND2023-05296C

With Jake Healy, Tony Nguyen and Chris Siefert

ASC Sustainable Scientific Software Conference

NLIT Summit, Milwaukee, WI June 27-30, 2023

This work was supported by the Laboratory Directed Research and Development program at Sandia National Laboratories, a multimission laboratory managed

and operated by National Technology and Engineering Solutions of Sandia LLC, a wholly owned subsidiary of Honeywell International Inc. for the U.S.

Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

IS FEATURE “X” READY FOR MY INTENDED USE?

Examples

•

Material model formulation

•

Active physics (e.g., contact)

•

Solver selection

•

Time integration scheme

•

Discretization

As a

user

 of SciSoft, how can I be confident

that a given

feature

 is ready for use?

What evidence is available to me for deciding

between two similar features?

Typical Evidence:

•

Overall software test coverage

•

Identify that the feature is tested

•

SME assertion that the feature is ready

Is this evidence sufficient?

Feature

 (n): user input to a scientific

software program that activates a specific

capability or behavior

Goal: Enable such feedback

(VERY)

 SIMPLE MOTIVATING EXAMPLE

Feature:

 Elastic/Linearly plastic material

Credibility Evidence:



The overall code coverage is 90%



The model is used in several tests

The code SME isn’t available

Quiz time:

How many conditions are in this model?

Could those branches be in the 10% missing

code coverage?

What can a user assert about the quality of

tests this model was used in?

Absolutely

Nothing, this is why the SME is involved!

What would change if the user were

presented with the following?

Estimated coverage of <feature>: 30%

WHAT DOES ANY OF THAT HAVE TO DO WITH SOFTWARE

ARCHITECTURE?

•

SciSoft is complex, long-lived & changing

•

SciSoft is often written by scientists and

not

computer

scientists

•

Features

 are often difficult to test in

isolation

Architecture

 (n): the relationship between a

user-facing

feature

 and its software

implementation

Understanding the

architecture

is

a prerequisite to gathering

feature

-level readiness evidence

Library-level dependency graph of

the SIERRA/SM application

THE GENERAL APPROACH – MINE THE REGRESSION TESTS

•

SciSoft typically have test suites

•

Instrument the code/tests to provide:

•

Features used by an execution

•

Code coverage from an execution

•

Run the instrumented test suite

•

Per-test records form training data

•

Apply ML algorithms to construct a

model of the

architecture

General form:

ML multi-label classifier

ML multi-label classifier



 “Given a feature

(set), estimate the coverage set”

WHAT CONSTITUTES A FEATURE? HOW TO IDENTIFY THEM?

What

•

Up to interpretation by user and/or SME

•

General, e.g., model formulation

•

Specific, e.g., sub-option/setting

How

•

Feature annotation by SME or automatic

•

Automatic annotation

strongly

 preferred

•

Always up to date

•

Supports

user

-annotation of input

•

Extension to library-level SciSoft APIs

Feature identification requirements

•

Feature keys generated by unique context

•

Keys

don’t

 encode parameter/option

values

•

Keys can be mapped back to input

comman

FORMING THE TRAINING DATA (A, C)

Constructing A

•

Run tests logging what features are used

•

Label columns of

 with feature keys

•

Each test results in a (sparse) row of

Constructing C

•

Optional level of detail:

•

File, Function,

Edge, Line

•

Greater detail increases dimension

•

Run tests with coverage instrumentation

•

Label columns of

 with coverage keys

•

Each test provides a (sparse) row of

MODELING THE ARCHITECTURE (B)

•

Architecture modeling can be framed as

a ML

multi-label classification

 problem

•

We are

using

 available classifiers

•

Decision Tree classifiers are good for our

type of data

•

Use series of splits based on parameter

influence

•

Suffer from variance and bias – can be

reduced by ensembles

Titanic passenger survival model as

a decision tree classifier [1]

[1]

Lovinger, J., Valova, I. Infinite Lattice Learner: an ensemble for incremental learning.

Soft Comput

, 6957–6974 (2020).

ASIDE

: SOFTWARE SUSTAINABILITY & CREDIBILITY BENEFITS

FROM

ABILITY

 TO AUTOMATICALLY GATHER THE

TRAINING DATA

Benefits from

•

Feature coverage database

•

Statistics on how apps used “in the wild”

•

Identification of weak/untested features

Benefits from

•

Supports optimal test-suite construction

•

Faster CI for large projects

•

Targeted change-based testing

Automatically gathering this data is foundational to a variety of

potential user- & developer-facing credibility and productivity tools

THE DATASET: SIERRA/SM (SOLID MECHANICS) APP

•

Focus on Solid Mechanics app

•

Focusing on the “small” tests in the suite

•

Use Feature Coverage Tool for building

•

Use

LLVM CoverageSanitizer

for

•

Custom callback for

file

-level coverage

INITIAL RESULTS

•

Examine performance of two classifiers

•

Split test with all data to train/test and

30% reserved for training

•

Use native “score” – fraction of 100%

correct sample predictions

Sparsity pattern of training data.

Note high coverage density.

[1] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

[2] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Can filtering the training data improve

fit/predictions?

FILTERING THE TRAINING DATASET

•

Desire automatic approaches

•

Identify duplicate samples

•

Reduce duplicate samples with union of

coverage data

•

1-core vs. N-core cases of the same test

•

Remove “mainline” features and

coverage

•

Filter out columns with >= 99% fill

•

Removed features: boilerplate, e.g., ‘begin

sierra’

•

Removed coverage: libraries (e.g., ioss,

stk), parsing

Sparsity pattern of reduced training data.

Maintain high

feature diversity

in samples

Greatly reduced

“mainline” coverage

FILTERING THE TRAINING DATASET

Classifier performance on reduced dataset

[1] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

[2] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Reduced training dataset provides better fit and maintains accuracy for the full sample set

EXPERIMENT: IDENTIFICATION OF FEATURE SUPPORT

•

SIERRA/SM has typical structure of a 30

year old SciSoft package

<obligatory developer joke slide



EXPERIMENT: IDENTIFICATION OF FEATURE SUPPORT

•

SIERRA/SM has typical structure of a 30

year old SciSoft package

•

Materials

do

 have a well-defined interface

•

All models implemented in Lamé library

•

Source file names relate to model name

•

Able to construct/verify known label set

Given

a feature key for a material model,

If

 the model is accurate,

Then

 the model will predict Lamé library files

that support that model

Bonus

: If the model is

precise

, the prediction

doesn’t contain

other

sources

RESULTS: IDENTIFICATION OF FEATURE SUPPORT

“# Samples” is the number of times the specified material appears in the training data

Too much noise/bias in the model

 Predicts everything is elastic

USE SME GUIDANCE TO FOCUS TRAINING DATASET FOR

MATERIALS

•

Know that all material models are

implemented in ‘

lame/

’ directory

•

Know feature correlation, i.e., model

options are associated with the specific

model

•

Reduce feature set to possible materials

•

Reduce coverage set to files in ‘

lame/

’

•

Train sub-model with reduced dataset

* Model never used alone in sample set

How might we automatically detect

these reduced spaces to improve

accuracy?

SUMMARY

•

Can predict which source files cover a

given input deck with ~60% accuracy

•

Current model is noisy

•

Identifies a lot of library-type files

•

Overpredicts coverage for specific

features

•

Bias from sample feature distribution

•

Can improve new developer productivity

•

Provide pointers to where a feature is

implemented, even if not super specific

•

Poor predictor of specific features

without sub-modeling

•

Segmented models can improve

accuracy

FUTURE WORK

Open Questions

•

Can we using unsupervised learning to automatically discover correlated features and

construct piecewise models spanning the feature space?

•

How could we sustainably incorporated SME knowledge?

•

Is overprediction of file coverage acceptable? To what extent? Can we train to this metric?

Next Steps to Provide User Feedback

•

Query full coverage data given files supporting a feature

•

Develop coverage metric meaningful to an end user

•

Integrate information into other user-facing credibility tools

ACKNOWLEDGEMENTS

Jake Healy

Tony Nguyen

Chris Siefert

CIS Project 229302

Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC (NTESS), a wholly

owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (DOE/NNSA) under contract DE-

NA0003525. This written work is authored by an employee of NTESS. The employee, not NTESS, owns the right, title and interest in and to the written work and is

responsible for its contents. Any subjective views or opinions that might be expressed in the written work do not necessarily represent the views of the U.S.

Government. The publisher acknowledges that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the

published form of this written work or allow others to do so, for U.S. Government purposes. The DOE will provide public access to results of federally sponsored

research in accordance with the DOE Public Access Plan.

undefined

THANKS!

Slide Note

Embed Share

Download Presentation

This work discusses the importance of understanding software architecture in assessing the readiness of user-facing features in scientific software. It explores the challenges of testing complex features, presents a motivating example, and emphasizes the role of subject matter experts in validating feature quality. The study highlights the need for architectural insights to gather evidence on feature readiness effectively.

jinx Follow

Uploaded on Apr 05, 2024 | 6 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Exceptional service in the national interest Modeling Scientific Software Architecture from Regression Tests using Data Mining Techniques Is feature X ready for my intended use? Matthew Mosby With Jake Healy, Tony Nguyen and Chris Siefert ASC Sustainable Scientific Software Conference NLIT Summit, Milwaukee, WI June 27-30, 2023 This work was supported by the Laboratory Directed Research and Development program at Sandia National Laboratories, a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia LLC, a wholly owned subsidiary of Honeywell International Inc. for the U.S. Department of Energy s National Nuclear Security Administration under contract DE-NA0003525. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia LLC, a wholly owned subsidiary of Honeywell International Inc. for the U.S. Department of Energy s National Nuclear Security Administration under contract DE-NA0003525. SAND2023-05296C

IS FEATURE X READY FOR MY INTENDED USE? As a user of SciSoft, how can I be confident that a given feature is ready for use? What evidence is available to me for deciding between two similar features? Feature (n): user input to a scientific software program that activates a specific capability or behavior Examples: Material model formulation Active physics (e.g., contact) Solver selection Time integration scheme Discretization Typical Evidence: Overall software test coverage Identify that the feature is tested SME assertion that the feature is ready Is this evidence sufficient? 2

A (VERY) SIMPLE MOTIVATING EXAMPLE Feature: Elastic/Linearly plastic material Credibility Evidence: The overall code coverage is 90% The model is used in several tests o The code SME isn t available Quiz time: How many conditions are in this model? Could those branches be in the 10% missing code coverage? What can a user assert about the quality of tests this model was used in? Nothing, this is why the SME is involved! Nothing, this is why the SME is involved! 5 5 What would change if the user were presented with the following? Absolutely Absolutely Estimated coverage of <feature>: 30% Goal: Enable such feedback 3

WHAT DOES ANY OF THAT HAVE TO DO WITH SOFTWARE ARCHITECTURE? ... SciSoft is complex, long-lived & changing SciSoft is often written by scientists and not computerscientists Features are often difficult to test in isolation Architecture (n): the relationship between a user-facing feature and its software implementation Understanding the architectureis a prerequisite to gathering feature-level readiness evidence ... Library-level dependency graph of the SIERRA/SM application 4

THE GENERAL APPROACH MINE THE REGRESSION TESTS General form: min? ? ?,? ? SciSoft typically have test suites Instrument the code/tests to provide: Features used by an execution Code coverage from an execution Conceptual linear form: Coverage Features Coverage Run the instrumented test suite Per-test records form training data Apply ML algorithms to construct a model of the architecture = Features Tests Tests C A B ML multi-label classifier ML multi-label classifier Given a feature (set), estimate the coverage set 5

WHAT CONSTITUTES A FEATURE? HOW TO IDENTIFY THEM? What Up to interpretation by user and/or SME General, e.g., model formulation Specific, e.g., sub-option/setting How Feature annotation by SME or automatic Automatic annotation strongly preferred Always up to date Supports user-annotation of input Extension to library-level SciSoft APIs Feature identification requirements Feature keys generated by unique context Keys don t encode parameter/option values Keys can be mapped back to input command Key (e.g.) Input begin material steel 0d4f3dd density = 0.000756 b16b186 begin parameters for model ml_ep_fail 698c7e7 youngs modulus = {youngs} 3a1dac5 poissons ratio = {poissons} <...> yield stress = {yield} <...> ... <...> end <none> end <none> 6

FORMING THE TRAINING DATA (A, C) Constructing C Optional level of detail: File, Function, Edge, Line Greater detail increases dimension Conceptual linear form: Coverage Features Coverage = Features C A B Tests Tests Run tests with coverage instrumentation Label columns of C with coverage keys Each test provides a (sparse) row of C Constructing A Run tests logging what features are used Label columns of A with feature keys Each test results in a (sparse) row of A 7

MODELING THE ARCHITECTURE (B) Conceptual linear form: Coverage Features Coverage Features = C A B Tests Tests Architecture modeling can be framed as a ML multi-label classification problem We are using available classifiers Decision Tree classifiers are good for our type of data Use series of splits based on parameter influence Suffer from variance and bias can be reduced by ensembles Titanic passenger survival model as a decision tree classifier [1] [1] Lovinger, J., Valova, I. Infinite Lattice Learner: an ensemble for incremental learning. Soft Comput 24 24, 6957 6974 (2020). 8

ASIDE: SOFTWARE SUSTAINABILITY & CREDIBILITY BENEFITS FROM ABILITY TO AUTOMATICALLY GATHER THE TRAINING DATA Conceptual linear form: Coverage Features Coverage = Features C A B Tests Tests Benefits from C Supports optimal test-suite construction Faster CI for large projects Targeted change-based testing Benefits from A Feature coverage database Statistics on how apps used in the wild Identification of weak/untested features Automatically gathering this data is foundational to a variety of potential user- & developer-facing credibility and productivity tools 9

THE DATASET: SIERRA/SM (SOLID MECHANICS) APP SIERRA Engineering Mechanics Code Suite SIERRA Engineering Mechanics Code Suite Source Lines of Code (C/C++) # Regression and unit tests % Source line coverage ~2M ~20k ~75% Focus on Solid Mechanics app Focusing on the small tests in the suite Use Feature Coverage Tool for building A Use LLVM CoverageSanitizer for B Custom callback for file-level coverage SIERRA/SM Dataset details SIERRA/SM Dataset details # Tests (samples) # Features # Covered Files (labels) 6393 7005 5347 Conceptual linear form: Coverage Features Coverage Features = C A B Tests Tests 10

INITIAL RESULTS Features (7005) Coverage (5347) Examine performance of two classifiers Split test with all data to train/test and 30% reserved for training Use native score fraction of 100% correct sample predictions Samples (6393) 1.4% non-zeros Sparsity pattern of training data. Note high coverage density. 29% 29% non-zeros Classifier Classifier Test % Test % 0% 30% 0% 30% Train Score Train Score 0.64 0.68 0.64 0.68 Test Score Test Score - 0.24 - 0.24 ExtraTrees1 Features Coverage 6k RandomForest2 High feature diversity in samples mainline coverage impacts over-fit Count Can filtering the training data improve fit/predictions? 0 0 6k Sample Occurrence [1] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html [2] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html 11

FILTERING THE TRAINING DATASET Sparsity pattern of reduced training data. Coverage (4284) Features (6992) Desire automatic approaches Identify duplicate samples Reduce duplicate samples with union of coverage data 1-core vs. N-core cases of the same test Samples (3468) 0.8% non-zeros 12% non-zeros Remove mainline features and coverage Filter out columns with >= 99% fill Removed features: boilerplate, e.g., begin sierra Removed coverage: libraries (e.g., ioss, stk), parsing 7k Features Coverage Maintain high feature diversity in samples Count Greatly reduced mainline coverage 0 0 3.5k Sample Occurrence 12

FILTERING THE TRAINING DATASET SIERRA/SM Reduced Dataset Details SIERRA/SM Reduced Dataset Details # Tests (samples) # Features # Covered Files (labels) 3468 6992 4284 Classifier performance on reduced dataset Classifier Classifier Test % Test % Train Score Train Score 0% 30% 0% 30% Test Score Test Score - 0.183 - 0.169 Full Sample Score Full Sample Score 0.622 - 0.621 - 1.0 1.0 0.999 0.998 ExtraTrees1 RandomForest2 Reduced training dataset provides better fit and maintains accuracy for the full sample set [1] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html [2] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html 13

EXPERIMENT: IDENTIFICATION OF FEATURE SUPPORT SIERRA/SM has typical structure of a 30 year old SciSoft package Materials do have a well-defined interface All models implemented in Lam library Source file names relate to model name Able to construct/verify known label set bin/adagio liblame.a Given a feature key for a material model, If the model is accurate, Then the model will predict Lam library files that support that model Bonus: If the model is precise, the prediction doesn t contain other sources /MLEP.C /Elastic.h /johnson_cook_model.F 15

RESULTS: IDENTIFICATION OF FEATURE SUPPORT Material: elastic # Samples # Correct Lam labels # Wrong Lam labels # Non-Lam labels Material: johnson_cook # Samples # Correct Lam labels # Wrong Lam labels # Non-Lam labels 2305 7 of 7 0 259 61 5 of 9 2 291 Material: dsa Material: mlep # Samples # Correct Lam labels # Wrong Lam labels # Non-Lam labels 28 5 of 7 2 245 # Samples # Correct Lam labels # Wrong Lam labels # Non-Lam labels 68 5 of 10 2 245 # Samples is the number of times the specified material appears in the training data Too much noise/bias in the model Predicts everything is elastic 16

USE SME GUIDANCE TO FOCUS TRAINING DATASET FOR MATERIALS Know that all material models are implemented in lame/ directory Know feature correlation, i.e., model options are associated with the specific model Reduce feature set to possible materials Reduce coverage set to files in lame/ Train sub-model with reduced dataset SIERRA/SM Material Only Dataset SIERRA/SM Material Only Dataset # Tests (samples) # Features (materials) # Covered Files (labels) 3468 135 681 Model Model elastic mlep *johnson_cook *dsa jc + mlep * Model never used alone in sample set # Correct # Correct 7 of 7 10 of 10 9 of 9 7 of 7 8 of 14 How might we automatically detect these reduced spaces to improve accuracy? 17

SUMMARY All Features Model Can predict which source files cover a given input deck with ~60% accuracy Current model is noisy Identifies a lot of library-type files Overpredicts coverage for specific features Bias from sample feature distribution Can improve new developer productivity Provide pointers to where a feature is implemented, even if not super specific Poor predictor of specific features without sub-modeling Segmented models can improve accuracy elastic mlep elastic dsa Materials Only Model elastic elastic mlep mlep dsa dsa 18

FUTURE WORK Open Questions Can we using unsupervised learning to automatically discover correlated features and construct piecewise models spanning the feature space? How could we sustainably incorporated SME knowledge? Is overprediction of file coverage acceptable? To what extent? Can we train to this metric? Next Steps to Provide User Feedback Query full coverage data given files supporting a feature Develop coverage metric meaningful to an end user Integrate information into other user-facing credibility tools 19

ACKNOWLEDGEMENTS Jake Healy Tony Nguyen Chris Siefert CIS Project 229302 Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC (NTESS), a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy s National Nuclear Security Administration (DOE/NNSA) under contract DE- NA0003525. This written work is authored by an employee of NTESS. The employee, not NTESS, owns the right, title and interest in and to the written work and is responsible for its contents. Any subjective views or opinions that might be expressed in the written work do not necessarily represent the views of the U.S. Government. The publisher acknowledges that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this written work or allow others to do so, for U.S. Government purposes. The DOE will provide public access to results of federally sponsored research in accordance with the DOE Public Access Plan. 20

THANKS!

Modeling Scientific Software Architecture for Feature Readiness

Download Presentation

Presentation Transcript

Related

More Related Content