Accelerate AI Performance with DirectML on Intel Hardware by Szymon Marcinkowski
Learn about leveraging DirectML on Intel hardware to boost AI performance, including insights on Windows AI ecosystem, DirectML optimizations, scaling AI models, and tools like Windows ML, ONNX Runtime, and more.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Accelerate AI performance with DirectML on Intel Hardware Szymon Marcinkowski
Agenda 1. Windows AI ecosystem overview 1. DirectML 2. Supported frameworks 3. Tools 2. DirectML optimizations overview 1. General optimizations 2. Performance considerations on Intel platform 3. Guide on how to bring your AI model at scale with DirectML 1. 3-step guide 2. Example of Generative AI workflow: Stable Diffusion 1.5 4. Summary
DirectML: Low-level API for high performance machine learning, supported by all DirectX 12 compatible hardware. Breakdown of DirectML definition: Low Low- -level and High level and High performance : performance : Native C++ API Developed on top of DirectX 12 Supported by all DirectX 12 Supported by all DirectX 12 compatible hardware : compatible hardware : Industry standard with cross- vendor support: Intel, Nvidia, AMD, Qualcomm etc. Deployed with Windows OS GPU and NPU acceleration C++ applications DirectML DirectX 12 Runtime DirectX 12 Driver provided by Vendor NPU GPU
Frameworks: DirectML is industry standard for deploying AI on Windows OS with GPU/NPU accelerators. Windows ML: Ease of development Ease of development built into Windows OS Broad hardware support Broad hardware support GPU, NPU and CPU Flexibility Flexibility evaluate your models locally on PC ONNX Runtime: Cross Cross- -platform, cross platform, cross- -hardware Integrated into many frameworks TensorFlow: Public Preview available for version: 1.15 Inference and training Inference and training PyTorch: Public Preview available for version: 1.13 Inference and training Inference and training Windows AI Platform ecosystem hardware Training scenario WinML TensorFlow PyTorch ONNX Runtime DirectML
Tooling: Windows AI and DirectML ecosystem delivers broad set of tools, which helps improve and evaluate your AI workloads. Olive Hardware aware model optimization WinML Dashboard Graphic view Graphic view of model s architecture Conversion Conversion from popularframeworks into ONNX format ONNXMLTools Converts Converts models from popular frameworksinto ONNX format WinMLRunner WinML framework consuming ONNX formatto evaluate performance evaluate performance and model modelcompleteness completeness PIX Application DirectX calls instrumentation instrumentation and work visualization Useful for performance analysis performance analysis model optimization visualization
DirectML: Performance is key Performance is key to adopt AI models into day-to-day use cases: DirectML allows for seamless integration seamless integration into DirectX 12 apps (including games) Low latency Low latency inference thanks to broad set of optimizations described in next slides Super Resolution public demo: link
General optimizations pt. 1: DirectML provides broad set of performance optimization opportunities for your AI model. With Metacommand Metacommands Metacommands Mechanism by which vendors can implement own versions of operations making best use of underlaying hardware Without Metacommand
General optimizations pt. 2: DirectML provides broad set of performance optimization opportunities for your AI model. Graph optimizer Graph optimizer Execution improvements via removal of not needed nodes Operator fusion
Performance considerations: Get best performance out of your DirectML AI model on Intel GPU hardware. Managed tensors Managed tensors Set Convolution weights tensors to be managed managed by DirectML. During model loading driver will have option to reshuffle its data into format optimized for efficient data loads. Optimal data layout Optimal data layout Use NHWC (interleaved) data layout for your tensors, so kernels can make effective use of block/coalesced loads. Optimal input dimensions Optimal input dimensions Design your AI model architecture to use input dimensions in multiple of 16 input dimensions in multiple of 16 (FP16) to get best XMX (Xe Matrix Extension) utilization. Analysis Analysis Use PIX performance capture and analysis tool to investigate investigate if your tensors are leveraging MetaCommands. weights tensors to be Performance ratio benchmarked on system with Arc770 GPU. Performance ratio benchmarked on system with Arc770 GPU. 8.44 8.44x x data layout Performance ratio Performance ratio 1.69 1.69x x 1 1x x 1 2 3
Guide on how to bring your AI model at scale with DirectML.
3-step guide: These three steps are a powerful combo to deploy your AI model. Train and evaluate AI model with PyTorch/TensorFlow frameworks 1. 1. Convert Convert Convert your model into ONNX format. 2. 2. Optimize Optimize Leverage Olive tool powered by DirectML to optimize your model. You should see dramatic performance improvements. 3. 3. Integrate Integrate Your model is ready to bring hardware-accelerate inferencing in your application, whether it is low level C++ application, or high-level Python framework. Convert to ONNX format Optimize with Olive tool Deploy optimized model in your target application with ONNX runtime and DirectML Execution Provider
Stable Diffusion 1.5: Presents on how easy it is to get optimized AI model working at scale with DirectML. 1. 1. Install Install latest vendor. 2. 2. Convert Convert and optimize a) Clone Olive repo b) Run SD1.5 example, it script will: download PyTorch models convert them to ONNX format run through predefined list of optimizations steps 3. 3. Test Test a) Run script again to validate correctness of your optimized model. 4. 4. Integrate Integrate and deploy a) Integrate optimized models into your application. latest drivers drivers of your optimize deploy
Why use DirectML in AI PC era? 1. 1. Cross vendor Cross vendor works at scale 2. 2. Performance Performance hardware accelerated (GPU/NPU) 3. 3. Reliable Reliable conformance tests with hardware certification 4. 4. Works out of the box Works out of the box - - deployed together with Windows OS