Variation-Tolerant OpenMP Tasking for Processor Clusters

variation tolerant openmp tasking on tightly n.w

1 / 22

Embed Share

"Explore the challenges of device variability and the need for variation-tolerant architectures in tightly-coupled processor clusters. Learn about OpenMP tasking, task-level vulnerability, and variation-aware reactive scheduling algorithms to combat process, voltage, and temperature variations."

jveron Follow

Uploaded on Apr 03, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Variation-Tolerant OpenMP Tasking on Tightly-Coupled Processor Clusters A. Rahimi, A. Marongiu, P. Burgio, R. K. Gupta, L. Benini UC San Diego and Universit di Bologna

Outline Device Variability Process, voltage, and temperature variations Why OpenMP and why tasking? Task-Level Vulnerability (TLV) Variation-Tolerant Architecture Inter- and Intra-corner TLV Variation-Tolerant OpenMP Tasking Variation-Aware Reactive Scheduling Algorithm Experimental Reults 3-Apr-25 Andrea Marongiu / Universit di Bologna 1

Ever-increasing Proc.-Vol.-Tem. Variations Variability in transistor characteristics is a major challenge in nanoscale CMOS Static Process variation, e.g., 40% VTH Dynamic variations, e.g., 160 C temperature fluctuations and 10% supply voltage droops. To handle variations designers use conservative guardbands loss of operational efficiency guardband actual circuit delay Clock Other uncertainty Across-wafer Frequency Temperature VCCDroop 3-Apr-25 Your Name / Affiliation 2

Approaches to Variability-Tolerance 1. Design time conservative guardbanding II. creates runtime overhead for both [Bowman 11] Latency (up to 28 extra recovery cycles per error) Energy overhead of 26nJ that should be minimized This approach I. relies on online measurements of errors 2. Post silicon binning 3. Runtime tolerance by various adaptiveness, e.g., replay errant instructions 3-Apr-25 Andrea Marongiu / Universit di Bologna 3

Why a Variation-Aware OpenMP? 847 MHz 893 MHz 847 MHz 901 MHz Frequency variation of a 16-core cluster due to WID and D2D process variation 847 MHz 909 MHz 877 MHz 870 MHz 909 MHz 855 MHz 826 MHz 917 MHz 901 MHz 820 MHz 826 MHz 862 MHz Variations are more exacerbated by many-core systems: Multiple voltage-temperature islands Cores in various islands display different error rate The programming model and runtime environment of MIMD should be aware of variations. C15 C14 C13 C12 C11 C10 C9 Core ID C8 C7 C6 Core1at 0.81V faces 428K errant instructions C5 C4 C3 C2 Core0at 1.1V faces 7.3K errant instructions C1 C0 0 20 40 60 80 100 Number of errant instructions x 10000 3-Apr-25 Andrea Marongiu / Universit di Bologna 4

Why OpenMP Tasking? The steps to build variability abstractions up to the SW layer Task-Level Vulnerability (TLV) as metadata to characterize variations. TLV is a vertical abstraction: TLV reflects manifestation of circuit-level variability in specific parallel software context. The right granularity: To observe and react for OMP scheduler A convenient abstraction for programmers to express irregular and unstructured parallelism. Instruction-level Vulnerability (ILV) Sequence-level Vulnerability (SLV) Procedure-level Vulnerability (PLV) Task-level Vulnerability (TLV) [ILV] A. Rahimi, L. Benini, R. K. Gupta, Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations, DATE, 2012. [SLV] A. Rahimi, L. Benini, R. K. Gupta, Application-Adaptive Guardbanding to Mitigate Static and Dynamic Variability, IEEE Tran. on Computer, 2013 (to appear) [PLV] A. Rahimi, L. Benini, R. K. Gupta, Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters, ISLPED, 2012. 3-Apr-25 Andrea Marongiu / Universit di Bologna 5

Instruction-Level Vulnerability (ILV)* The ILV for each instructioniat every operating condition is quantified: N i Instruction-level Vulnerability (ILV) 1 N = ( , , , _ ) Violation ILV i V T cycle time j i Sequence-level Vulnerability (SLV) = 1 j If any stage violates at cycle 1 j Procedure-level Vulnerability (PLV) = Violation j otherwise 0 Task-level Vulnerability (TLV) where Niis the total number of clock cycles in Monte Carlo simulation of instructioniwith random operands. Violationjindicates whether there is a violated stage at clock cyclejor not. ILVidefines as the total number of violated cycles over the total simulated cycles for the instructioni. Therefore, the lower ILV, the better *A. Rahimi, L. Benini, R. K. Gupta, Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations, DATE, 2012. 3-Apr-25 Andrea Marongiu / Universit di Bologna 6

Task-Level Vulnerability (TLV) ILV represents a useful variability metric that raises the level of abstraction from the circuit (critical paths) to the ISA-level. ILV is extended to a more coarse-grained task-level metric, TLV, towards building an integrated, vertical approach to control variability. TLV is a per core and per task type metric: EI TLV Length = , core , task i ( , ) i j j EI is # of errant instructions during taskjon corei Length is total # of executed instructions The lower TLV, the better Instruction-level Vulnerability (ILV) Sequence-level Vulnerability (SLV) Procedure-level Vulnerability (PLV) 3-Apr-25 Andrea Marongiu / Universit di Bologna 7 Task-level Vulnerability (TLV)

Variation-Tolerant MP Cluster (1/2) Inspired by STM STHORM 16x 32-bit RISC cores L1 SW-managed Tightly Coupled Data Memory (TCDM) Multi-banked/multi-ported Fast concurrent read access Fast Log. Interconnect One clock domain Bridge towards NoC Replay VDD-hopping SHARED L1 TCDM Var. sensor SLAVE MASTER PORT PORT BANK N LOW-LATENCY LOGARITHMIC INTERCONNECT CORE M I$ I$ SLAVE PORT BANK 1 VDD-Hopping CORE 0 SLAVE PORT BANK 0 Var-Sensor Replay I$ SLAVE PORT test-and-set semaphores MASTER PORT Replay VDD-hopping Var. sensor MASTER PORT BRIDGE L2/L3 CORE 0 I$ 3-Apr-25 Andrea Marongiu / Universit di Bologna 8

Variation-Tolerant Architecture (2/2) VDD-Hopping Every core is equipped with: Error sensing (EDS [Bowman 09]) detect any timing error due to dynamic delay variation Error recovery (Multiple-issue replay mechanism [Bowman 11]) to recover the errant instruction without changing the clock frequency VDD hopping (semi-static) [Miermont 07] to compensate the impact of static process variation [Rahimi 12] Thus, cluster enables per-core characterization of TLV metadata CORE 0 Online variability measurement Fast access to the TLV metadata for each type of task is guaranteed by carefully placing these key data structures in L1 TCDM. CORE 0 Var-Sensor Replay I$ MASTER PORT VDD-hopping TLV metadata characterization VDD-hopping CORE M Replay Replay sensor sensor Var. Var. I$ I$ I$ MASTER PORT MASTER PORT LOW-LATENCY LOGARITHMIC INTERCONNECT SLAVE PORT SLAVE PORT SLAVE PORT SLAVE PORT L2/L3 BRIDGE semaphores test-and-set BANK N BANK 0 BANK 1 3-Apr-25 Andrea Marongiu / Universit di Bologna 9 TLV metadata lookup table SHARED L1 TCDM

OpenMP Tasking #pragma omp parallel { #pragma omp single { for (i = 1...N) { #pragma omp task FUNC_1 (i); Task queue TCDM Push task Task descriptor #pragma omp task FUNC_2 (i); } } } /* implicit barrier */ Fetch and execute (FIFO) two task types types Task descriptors created upon encountering a task directive Task fetched by any core encountering a barrier task directives identify given portions of code (tasks) A task type is defined for every occurrence of the task directive 3-Apr-25 in the program Andrea Marongiu / Universit di Bologna 10

Intra- and Inter-Corner TLV # of iterations = 100 TLV across various type of tasks: TLV of each type of tasks is different (up to 9 ) even within the fixed operating condition in a corei logical instructions 6 Types of tasks add/sub instructions arith. shift instructions log. shift instructions 5 # of iterations = 10 4 3 multiply instructions 2 mix inst. 1 0.00 0.01 0.02 0.03 0.04 0.05 TLV Intra-corner TLV at fix (25 C, 1.1V) Inter-corner TLV (across various operating conditions for 45nm) The average TLV of the six types of tasks is an increasing function of temperature. In contrast, decreasing the voltage from the nominal point of 1.1V increases TLV. Voltage (V) 0.88 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 0.1 0.7 0.09 0.6 0.08 0.5 0.07 0.06 0.4 TLV TLV 0.05 0.3 0.04 Temperature variation 0.03 0.2 Voltage variation 0.02 0.1 0.01 0 0 20 40 60 80 100 120 140 Temperature ( C) Inter-corner TLV 3-Apr-25 Andrea Marongiu / Universit di Bologna 11

Variation-tolerant OpenMP Tasking Online TLV characterization TLV table: LUT containing TLV for every core and task type Reside in TCDM. Parallel inspection from multiple cores Each core collects TLV information in parallel Distributed scheduler LUT updated at every task execution void handle_tasks () { while (HAVE_TASKS) { task_desc_t *t = EXTRACT_TASK (); if (t) { float Otlv = tlv_read_task_metadata (core_id); /* Reset counter for this core */ tlv_reset_task_metadata (core_id); /* EXEC! */ t->task_fn (t->task_data); /* We executed. Fetch TLV ...*/ float tlv = tlv_read_task_metadata (core_id); /* Update TLV. Average new and old value */ tlv_table_write(t->task_type_id, core_id, (tlv-Otlv)/2); } } } // Task scheduling loop VDD-Hopping cores CORE 0 TLV-table Var-Sensor C0 C1 C2 Replay T0 0.0211 - 0.11 task types I$ T1 0.891 - 0.000005 MASTER PORT TCDM 3-Apr-25 Andrea Marongiu / Universit di Bologna 12

TLV-aware Extensions #pragma omp parallel { #pragma omp single { for (i = 1...N) { #pragma omp task FUNC_1 (i); Task queue TCDM Task descriptor Fetch and execute (FIFO) #pragma omp task FUNC_2 (i); } } } /* implicit barrier */ TLV-aware fetch Variation-tolerant OpenMP scheduler Reactive scheduling. Idle processors trying to fetch a task check if their TLV for the task is under a certain threshold to minimize number of errant instructions (and costly replay cycles) limited number of rejects for a given tasks, to avoid starvation 3-Apr-25 Andrea Marongiu / Universit di Bologna 13

Variation-aware Scheduling Algorithm TLV-table TCDM C0 C1 C2 core_escape_cnt T0 0.0211 0.11 - T1 0.891 - 0.000005 C0 C1 C2 1 5 0 taskj= PEEK_QUEUE() Task queue TLV(i,j)= tlv_table_read(corei, taskj); if (TLV(i,j)> TLV_THR && corei_escape_cnt <ESCAPE_THR) { corei_escape_cnt ++; escape (taskj); } else { assign_to_corei(taskj); corei_escape_cnt = 0; } 3-Apr-25 Andrea Marongiu / Universit di Bologna 14

Experimental Setup: Arch. + Benchmarks Architecture: SystemC-based virtual platform* modeling the tightly-coupled cluster ARM v6 core I$ size I$ line Latency hit Latency miss 16 16KB per core TCDM latency 2 cycles 4 words TCDM size 1 cycle L3 latency 59 cycles L3 size TCDM banks 16 256 KB 60 cycles 256MB Benchmark: Seven widely used computational kernels from the image processing domain are parallelized using OpenMP tasking. On average 375 dynamic tasks. The TLV lookup table only occupies 104 448 Bytes depending upon the number of task types. *D. Bortolotti et al., Exploring instruction caching strategies for tightly-coupled shared- memory clusters, Proc. Intern.Symposium on System on Chip (SoC), pp.34-41, 2011 3-Apr-25 Andrea Marongiu / Universit di Bologna 15

Experimental Setup: Variability Modeling Each core optimized during P&R with a target frequency of 850MHz. @ Sign-off: die-to-die and within-die process variations are injected using PrimeTime VX and variation-aware 45nm TSMC libs (derived from To emulate variations, we have integrated variations models at the level of individual instructions using the ILV characterization methodology. ILV models of 16-core LEON-3 for TSMC 45-nm, general- purpose process with normal VTHcells. Vdd-hopping is applied to compensate injected process variation. PCA) All cores can work with the design time target frequency of 850 MHz but multiple voltage OpPs Six cores (C0, C2, C4, C10, C13, C14) cannot meet the design time target frequency of 850 MHz I$B0 ... I$Bi-1 C0 C4 C8 909 C9 855 C10 >850 C11 917 C12 901 C13 >850 C14 >850 C15 862 C0 847 C1 893 C2 847 C3 901 C4 847 C5 909 C6 877 C7 870 C8 909 C9 855 C10 826 C11 917 C12 901 C13 820 C14 826 C15 862 Log. Interc. f+180 >850 C1 893 C2 >850 C3 901 >850 C5 909 C6 877 C7 870 Process Variation Vdd- Hopping Level Shifters Level Shifters High VDD Typical VDD Low VDD PSS PSS VA-VDD-hopping VA-VDD-hopping f DFS ... Core15 Core0 CPM CPM SHM Level Shifters Level Shifters f+180 Log. Interc. VDD={ 1.1V, 0.97V, 0.81V } TCDMBj-1 TCDMB0 ... 3-Apr-25 Andrea Marongiu / Universit di Bologna 16

Overhead of Variation-tolerant Scheduler 1.01 Normalized IPC () 256 225 256 256 720 225 750 1.00 # of dyn. tasks 256 0.99 0.98 0.97 Normalized IPC = IPC variation-aware scheduler / IPC OMP baseline scheduler On a variation-immune cluster, on average, the normalized IPC of the cluster is slightly decreased by 0.998 . Due to reading the TLV lookup table checking the conditions 3-Apr-25 Andrea Marongiu / Universit di Bologna 17

IPC of Variability-affected Cluster 10 C 40 C 70 C 100 C M M = ( m(i,j)) / # of dyn. 1.6 3.5 1.4 Normalized IPC () 3.0 1.2 2.5 1 tasks 2.0 0.8 1.5 0.6 1.0 0.4 0.5 0.2 0 0.0 M= Number of times that the scheduler postponing the execution of the task in the head of queue. On average, each task is escaped 2.1 times. Our scheduler decreases the number of cycles per cluster for each type of tasks, because cores incur fewer errant instructions and spend lower cycles for recovery. The normalized IPC is increased by 1.17 (on average) for all benchmarks executing at 10 C. At temperature of 100 C ( T=90 C) IPC is increased by 1.15 . 3-Apr-25 Andrea Marongiu / Universit di Bologna 18

Conclusion Vertical abstraction of circuit-level variations into a high-level parallel software execution (OpenMP 3.0 tasking) The vulnerability of tasks is characterized by TLV metadata during introspective execution The reactive variation-tolerant runtime scheduler utilizes TLV to match cores with tasks The normalized IPC of 16-core variability-affected cluster increases up to 1.51 (on average, 1.15 ). Future work: multiple clusters @ multiple dynamic OpP in Vdd & f 3-Apr-25 Andrea Marongiu / Universit di Bologna 19

Grazie dellattenzione! ERC MultiTherman NSF Variability Expedition 3-Apr-25 Andrea Marongiu / Universit di Bologna 20

Classification of Instructions Based ILV ILV at 0.88V, while varying temperature for 65nm: (0.88V, -40 C) 1.06 0 0 0 0 0 0 0 0 0 0 0 0.064 0.989 (0.88V, 0 C) 1.02 1.06 0 0 0 0 0 0 0 0 0 0.707 0.743 0.996 0.065 0.994 0.991 (0.88V, 125 C) 1.08 0 0 0 0 0 0 0 0 0 0 0 0.876 0.991 (V, T) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1.02 0 0 0 0 0 0 0 0 0 0.824 0.847 0.996 0.991 1.08 0 0 0 0 0 0 0 0 0 0 0 0.027 0.989 1.10 0 0 0 0 0 0 0 0 0 0 0 0.017 0.984 1.12 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1.10 0 0 0 0 0 0 0 0 0 0 0 0.018 0.973 1.12 0 0 0 0 0 0 0 0 0 0 0 0 0 1.04 1 1 1 1 1 1 1 1 1 1 1 1 1 1.06 0 0 0 0 0 0 0 0 0 0.796 0.823 0.876 0.991 1.10 0 0 0 0 0 0 0 0 0 0 0 0.016 0.991 1.16 0 0 0 0 0 0 0 0 0 0 0 06 0.984 1.18 0 0 0 0 0 0 0 0 0 0 0 0 0 Cycle time (ns) add and or sll sra srl sub xnor xor load store mul div 0 0 0 0 0 0 0 0 0 0 0 Logical & Arithmetic Mem Mul. &Div Instructions are partitioned into three main classes: 1stClass: Logical & arithmetic instructions 2ndClass: Memory instructions 3rdClass: Hardware multiply & divide instructions For every operating conditions: ILV (3rdClass) ILV (2ndClass) ILV (1stClass) 3-Apr-25 Andrea Marongiu / Universit di Bologna 21

Variation-Tolerant OpenMP Tasking for Processor Clusters

Download Presentation

Presentation Transcript

Related

More Related Content