Enabling AVX-512 Assembly Instructions in Valgrind by Tatyana Volnina (Mineeva)

Slide Note

Explore the implementation of AVX-512 instructions in Valgrind through detailed agendas, instruction layouts, masking techniques, new features, subsets, and methodologies. Discover how Valgrind enables parsing of assembly instructions for AVX-512 and AVX-2, facilitating code separation and IR generation.

bard_il Follow

Uploaded on Oct 06, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Enabling AVX-512 assembly instructions in Valgrind Tatyana Volnina (Mineeva) 1

Agenda AVX-512 instruction summary AVX-512 implementation in Valgrind Q/A 2

AVX-512 instructions 3

Vector register layout AVX-512 AVX-2 SSE (x86-64) ZMM0 YMM0 XMM0 ZMM1 YMM1 XMM1 ZMM15 YMM15 XMM15 ZMM16 YMM16 XMM16 ZMM31 YMM31 XMM31 bits 512 256 255 128 127 0 4

Masking Masked vector sum: ZMM1 (8 x 64-bit ints) 1 2 3 4 5 6 7 8 + ZMM2 (8 x 64-bit ints) 4 4 3 3 2 2 1 1 k1 (8 bits) 0 1 0 0 1 1 0 1 = = Zero-mode sum 0 6 0 0 7 8 0 9 Merge-mode sum (ZMM1 as base) 1 6 3 4 7 8 7 8 5

AVX-512 features New instruction prefix (EVEX): Instruction-specific rounding mode Embedded broadcast New memory displacement encoding 8-bit and 16-bit vector elements 6

AVX-512 subsets AVX-512 subsets Total missing F CD VL DQ BW ER PF IFMA VBMI VNNI VPOPCNTDQ VBMI2 BITALG VPCLMULQDQ GFNI VAES VP2INTERSECT BF16 Knight s Landing 0 Skylake 0 2 4 4 2 8 3 1 3 4 Ice Lake 31 2 4 4 2 8 3 1 3 4 2 Tiger Lake Sapphire Rapids 33 4 2 3 9 - Instruction set is already enabled N - Number of instructions to add 7

AVX-512 Valgrind 8

General Approach Separate AVX-512 code Reuse existent IRs Do not affect non-AVX512 runs AVX-512 performance haven t been measured yet 9

AVX-2 Valgrind Parse assembly instruction User s binary Up to AVX-2 Translate into intermediate representation Tool instrumentation (up to 256 bit IR) Up to 256 bits Generate assembly Up to SSE 10

AVX-512 Valgrind Parse assembly instruction User s binary Up to AVX-512 Translate into intermediate representation Tool instrumentation (up to 512 bit IR) Up to 512 bits Generate assembly Up to SSE or AVX-512 intrinsics 11

AVX-512 instruction handling Describe assembly instructions in .csv file Generate regression tests Generate C data structures Refer to the data structures in Memcheck Did not work Refer to the data structures in Valgrind core 12

AVX-512 Valgrind tests Valgrind core: Nas Parallel Benchmarks - one failure on an existent Valgrind limitation HPC workloads Memcheck: ISPC and PMDK regression tests Intel Inspector memory tests Helgrind: Data race benchmarks (Lawrence Livermore) AVX-2 and AVX-512 reports matched 13

VG-to-PIN comparison Problem: If an instruction is emulated incorrectly under Valgrind, it can pass unnoticed for a long time Solution: dump register values under Valgind and non-Valgrind run and compare them Wrote logging tools for Valgrind and Intel PIN: - Heavyweight and not user-friendly - Invaluable to detect emulation errors 14