Comprehensive Overview of AI Technical Test Specification

Slide Note
Embed
Share

This presentation provides a detailed look at the AI technical test specification authored by Auss Abbood from the Robert Koch Institute in Berlin. It covers best practices in AI testing, essential tests for assessment platforms, testing principles, test levels, test types, and more. The deliverable is currently under review, with feedback welcomed. Key points include the need for testing AI presence of errors, updating tests regularly, and understanding that error-free doesn't guarantee user satisfaction.


Uploaded on Sep 20, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. FGAI4H-R-045 Cambridge, 21-24 March 2023 Source: Editor DEL7.2 Title: DEL7.2 Update: AI technical test specification Contact: Auss Abbood, Robert Koch Institute, Berlin, Germany This PPT contains the current structure of the deliverable AI technical test specification. E-mail: abbooda@rki.de Abstract:

  2. Deliverable: AI Technical Test Specification Auss Abbood Robert Koch-Institute, Berlin, Germany Cambridge, MA 21-24th March 2023

  3. Motivation What are best practices in AI testing that TGs can adapt? Which tests are specifically important for an assessment platform? AI Technical Test Specification - FG-AI4R 3

  4. Outline Deliverable is mature and currently under review Feedback is always and still appreciated Summary of the deliverable AI Technical Test Specification - FG-AI4R 4

  5. Background Contains SOTA in testing as described by books, International Software Testing Qualification Board, the National Institute for Standards and Technology, and ISO/IEC/IEEE standards A summary of commonly used terms and principles in software testing Filtered for our purpose: What is it that we need to test AI (and what not) AI Technical Test Specification - FG-AI4R 5

  6. Testing principles Known by engineers but often not common knowledge in science Testing shows presence of errors, not their absence: Long tail due to rare diseases Exhaustive testing is usually not possible: Human in the loop, data synth Testing early on: Clarify and test expectation Errors cluster together: Subject matter experts can help detect causes for clusters Tests and test data need to be updated regularly. Pesticide paradox. Testing depends on purpose and environment of software Error-free does not equate to user satisfaction AI Technical Test Specification - FG-AI4R 6

  7. Test levels Unit/component testing: Fine Integration testing: Course System testing: Holistic AI Technical Test Specification - FG-AI4R 7

  8. Test types Test Type Explanation Functional Testing Tests what the system should do by specifying some precondition, running code and then compare the result of this execution with some postcondition. It is applied at each level of testing although in acceptance testing most implemented functions should already work. A measure of thoroughness of functional testing is coverage. Non-functional Testing Test how well a system performs. This includes testing of usability, performance efficiency, or security of a system and other characteristics found at ISO/IEC 25010. This test can be performed on all levels of test. Coverage for non-functional testing means how many of such characteristics were tested for. White-box Testing Tests the internal structure of a system or its implementation. Its is mostly tested in component and system testing. Coverage in this test measures the proportion of code components that have been tested as is part of component and Black-box Testing Opposed to white-box testing, here we treat software as a black box with no knowledge on how software achieves its intended functionality. Merely the output of this form of testing is compared with the expected output or behaviour. The advantage of black-box testing is that no programming knowledge is required and therefore well equipped to detect biases that arise if only programmer write and test software. This test can be applied at all levels of testing. Tests changes of already delivered software for functional and non-functional quality characteristics. Maintenance Testing Static Testing Form of testing that does not execute code but manually examines the system, i.e., through reviews, linters, or formal proofs of the program. Change-related Testing Tests whether changes corrected (confirmation testing) or caused errors (regression testing). Change-related testing can be applied on all levels of testing. Destructive Testing This tests aims to make the software fail by proving unintended inputs which tests the robustness of the software. This can be applied on all levels of software testing. AI Technical Test Specification - FG-AI4R 8

  9. AI testing No big difference: E.g., cryptographic or scientific software also hard to test Base recipe Metrics Data/Benchmark Discriminatory (subject matter experts) Code (mostly dealt with through libraries) AI Technical Test Specification - FG-AI4R 9

  10. AI testing Metamorphic testing seems promising: Test coverage does not equate unit (neuron) activity Use other model to maximize unit activity Create two sets of inputs (raw and modfied) with an expected change (pseudo oracle) Verify and validate 1 0 AI Technical Test Specification - FG-AI4R

  11. ML ops as an addition Testing should appreciate connection between data, software, hardware, and AI over time: MLFlow, Argo, Docker, Sacred, DVC, etc. can help (or AWS, Google, with enough funding) device-specific properties of produced data BUT, not all forms of input can be tested (General Principles of Software Validation; Final Guidance for Industry and FDA Staff ). When is our testing done? Third party libraries are usually well tested Errors sneak into the best of libaries 1 1 AI Technical Test Specification - FG-AI4R

  12. ML ops as an addition 1 2 AI Technical Test Specification - FG-AI4P

  13. Leaderboard probing Leaderboard probing Data aggregation or missing data Vulnerable metrics or data formats (timeseries) Non weighted performance (optimize on easy tasks instead of hard ones) Adversarial validation -> Find biggest similarities in test and training data Random rotation of classifaction hyperplane to find most useful variables 1 3 AI Technical Test Specification - FG-AI4R

  14. Other deliverables Information from DEL. 5.1 and 7.3 crucial 5.1: Test data pipeline, e.g., pre-processing, heterogenity, precision of data, bias, leakage 7.3: Test if compatible with audit setup 1 4 AI Technical Test Specification - FG-AI4R

  15. Outlook Receive more feedback (where to put more attention) Document remedies for leaderboard probing 1 5 AI Technical Test Specification - FG-AI4R

Related