A Model for Application Slowdown Estimation in On-Chip Networks

undefined
 
A Model for Application Slowdown Estimation
in On-Chip Networks and Its Use for Improving
System Fairness and Performance
 
Xiyue Xiang*
, Saugata Ghose
, Onur Mutlu
†§
, Nian-Feng Tzeng
*
*
University of Louisiana at Lafayette
Carnegie Mellon University
§
ETH Zürich
Executive Summary
2
 
Problem
: 
inter-application interference 
in on-chip networks (NoCs)
In a multicore processor, interference can occur due to NoC contention
Interference causes applications to 
slow down 
unfairly
Goal
:
 estimate NoC-level slowdown at runtime, and use slowdown information to
improve system fairness and performance
Our Approach
NoC Application Slowdown Model (NAS)
: first 
online model 
to quantify
inter-application interference 
in NoCs
Fairness-Aware Source Throttling (FAST)
: 
throttle
 network injection rate of
processor cores 
based on slowdown estimate 
from NAS
Results
NAS
 is 
very accurate and scalable
: 
4.2% error rate 
on average (8×8 mesh)
FAST
 
improves system fairness 
by 9.5%, 
and performance 
by 5.2%
(compared to a baseline without source throttling on a 8×8 mesh)
Motivation: Interference in NoCs
3
 
2.7×
 
1.6×
Interference slows down applications and increases system unfairness
16 copies of each application run concurrently on a 64-core processor
Root cause:
NoC bandwidth is shared
 
Challenges:
Flit-level delay ≠ slowdown
NAS: 
N
oC 
A
pplication 
S
lowdown Model
4
t
alone
: 
unknown at runtime
Node S
Node D
request
response
 
Each request involves multiple packets
t
shared
: 
measured directly
 
Challenges:
Flit-level delay ≠ slowdown
Random and distributive
Overlapped delay
NAS: 
N
oC 
A
pplication 
S
lowdown Model
4
t
shared
: 
measured directly
t
alone
: 
unknown at runtime
Node S
Node D
A packet is formed by multiple flits
Basic idea: track delay and calculate ∆t
stall
Flit-Level Interference
5
 
Three
 
interference events
Injection
Virtual channel arbitration
Switch arbitration
Each flit carries an additional field 
t
flit
If arbitration loses, 
t
flit 
=
t
flit 
+ 
1
Sum up arbitration delays due to interference
Packet-Level Interference
6
 
T
first_arrival 
=3
 
T
last_arrival
=11
 
t
reassembly
 = M cycles (M=5)
 
1    2    3    4    5
f1
 
Alone run:
 
Shared run:
f2
f3
f4
f5
f1
f2
f3
f4
f5
 
1    2    3    4    5    6    7    8    9   10   11
 
M-cycle
reassembly
Packet’s flits arrive consecutively when there is no interference
Track increase in packet reassembly time
Request-Level Interference
7
Node S
Node D
 
Leverage 
closed-loop 
packet behavior to accumulate ∆
t
packet
Inheritance Table: 
lump sum of ∆
t
packet 
for associated packets
Request-Level Interference
7
Node S
Node D
N
I
L
L
C
 
S
l
i
c
e
 
I
n
h
e
r
i
t
a
n
c
e
 
T
a
b
l
e
 
Leverage 
closed-loop 
packet behavior to accumulate ∆
t
packet
Inheritance Table: 
lump sum of ∆
t
packet 
for associated packets
Request-Level Interference
7
Node S
 
Leverage 
closed-loop 
packet behavior to accumulate ∆
t
packet
Inheritance Table: 
lump sum of ∆
t
packet 
for associated packets
Node D
N
I
L
L
C
 
S
l
i
c
e
I
n
h
e
r
i
t
a
n
c
e
 
T
a
b
l
e
Request-Level Interference
7
Node S
Final value
of 
Δ
t
packet
is 8 cycles
 
Leverage 
closed-loop 
packet behavior to accumulate ∆
t
packet
Inheritance Table: 
lump sum of ∆
t
packet 
for associated packets
Sum up delays of all associated packets
Node D
N
I
L
L
C
 
S
l
i
c
e
I
n
h
e
r
i
t
a
n
c
e
 
T
a
b
l
e
Application Stall Time
8
 
A memory request becomes critical if
1)
It is the oldest instruction at ROB and ROB is full, and/or
2)
It is the oldest instruction at LSQ and LSQ is full when the next is a memory instruction
For all critical requests
Latency is hidden
App. stalls
 
ignored
 
Latency of critical request
 
T
critical
 
T
service
ILP, MLP
Count only request delays on critical path of execution time
Using NAS to Improve Fairness
9
 
NAS provides 
online
 estimation of slowdown
Sum up flit-level arbitration delays due to interference
Track increase in packet reassembly time
Sum up delays of all associated packets
Determine which request delays causes application stall
Goal
Use NAS to 
improve system fairness and performance
FAST: Fairness-Aware Source Throttling
A New Metric: NoC Stall-Time Criticality
10
 
NoC Stall-Time Criticality
FAST utilizes STC
noc
 to 
proactively
 estimate
the expected impact of each L1 miss
 
Lower 
STC
noc
     <==>   Less sensitive to NoC-level interference
Good candidate to be throttled down
Interference in NoCs
has uneven impact
 
Key Knobs of FAST
 
11
 
Rank
 based on 
slowdown
Classification
 based on 
network intensity
Latency-sensitive: spends more time in the core
Throughput-sensitive: network intensive
Throttle Up
Latency-sensitive applications: improve system performance
Slower applications: optimize system fairness
Throttle Down
Throughput sensitive application with lower 
STC
noc
: 
reduce
interference with lower negative impact on performance
Avoid throttling down the slowest application
 
Methodology
 
12
 
Processor
Out-of-order, ROB / instruction window = 128
Caches
L1: 64KB, 16 MSHRs
L2: perfect shared
NoCs
Topology: 4×4 and 8×8 mesh
Router: conventional VC router with 8 VCs, 4 flits/VC
Workloads: multiprogrammed SPEC CPU2006
90 randomly-chosen workloads
Categorized by network intensity (i.e., MPKI)
NAS is Accurate
 
4.2%
 
2.6%
 
Network saturation
Slowdown estimation error: 4.2% (2.6%) for 8×8 (4×4)
Low estimated slowdown error consistently
Good scalability
NAS is highly accurate and scalable
13
FAST Improves Performance
14
(a) Mixed workloads
(b) Heavy workloads
FAST has better performance than both HAT and NoST
Inter-application interference is reduced
Only throttles applications with low negative impact (i.e., lower 
STC
noc
)
 
+5.2%
 
+5.0%
FAST Reduces Unfairness
15
 
- 4.7%
 
-9.5%
FAST can improve fairness
Source throttling allows slower applications to catch up
Uses runtime slowdown to 
identify 
and 
avoid
 throttling the slowest application
(a) Mixed workloads
(b) Heavy workloads
Conclusion
16
 
Problem
: 
inter-application interference 
in on-chip networks (NoCs)
In a multicore processor, interference can occur due to NoC contention
Interference causes applications to 
slow down 
unfairly
Goal
:
 estimate NoC-level slowdown at runtime, and use slowdown information to
improve system fairness and performance
Our Approach
NoC Application Slowdown Model (NAS)
: first 
online model 
to quantify
inter-application interference 
in NoCs
Fairness-Aware Source Throttling (FAST)
: 
throttle
 network injection rate of
processor cores 
based on slowdown estimate 
from NAS
Results
NAS
 is 
very accurate and scalable
: 
4.2% error rate 
on average (8×8 mesh)
FAST
 
improves system fairness 
by 9.5%, 
and performance 
by 5.2%
(compared to a baseline without source throttling on a 8×8 mesh)
undefined
 
A Model for Application Slowdown Estimation
in On-Chip Networks and Its Use for Improving
System Fairness and Performance
 
Xiyue Xiang*
, Saugata Ghose
, Onur Mutlu
†§
, Nian-Feng Tzeng
*
*
University of Louisiana at Lafayette
Carnegie Mellon University
§
ETH Zürich
undefined
 
Backup Slides
 
Xiyue Xiang*
, Saugata Ghose
, Onur Mutlu
†§
, Nian-Feng Tzeng
*
*
University of Louisiana at Lafayette
Carnegie Mellon University
§
ETH Zürich
 
Related Works
 
19
 
Slowdown modeling
Fine grained: [Mutlu+ MICRO ’07], [Ebrahimi+ ASPLOS ’10], [Bois+ TACO ’13]
Coarse grained: [Subramanian+ HPCA ’13], [Subramanian MICRO ’15]
Source throttling
[Chang+ SBAC-PAD ’12], [Nychis+ SIGCOMM ’12], [Nychis+ HotNet ’10]
Application mapping
[Chou+ ICCD ’08], [Das+ HPCA ’13]
Prioritization
[Das+ MICRO ’09], [Das ISCA ’10]
Scheduling
[Kim+ MICRO’10]
QoS
[Grot+ MICRO ’09], [Grot+ ISCA ’11], [Lee+ ISCA ’08]
 
Hardware Cost of NAS
 
20
 
NAS Error Distribution
 
Plot 7,200 application 
instances
 
Plot 7,200 application instance
NAS exhibits high accuracy most of the time
 
21
Slide Note

My name is

The design and use of a model to …….

Embed
Share

Problem of inter-application interference in on-chip networks in multicore processors due to NoC contention causes unfair slowdowns. The goal is to estimate NoC-level slowdowns in runtime and improve system fairness and performance. The approach includes NoC Application Slowdown Model (NAS) and Fairness-Aware Source Throttling (FAST), which significantly enhance system fairness and performance.

  • On-Chip Networks
  • Interference
  • Multicore Processors
  • NoC
  • System Fairness

Uploaded on Sep 27, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*, Saugata Ghose , Onur Mutlu , Nian-Feng Tzeng* *University of Louisiana at Lafayette Carnegie Mellon University ETH Z rich ul_logo

  2. Executive Summary Problem: inter-application interference in on-chip networks (NoCs) In a multicore processor, interference can occur due to NoC contention Interference causes applications to slow down unfairly Goal: estimate NoC-level slowdown at runtime, and use slowdown information to improve system fairness and performance Our Approach NoC Application Slowdown Model (NAS): first online model to quantify inter-application interference in NoCs Fairness-Aware Source Throttling (FAST): throttle network injection rate of processor cores based on slowdown estimate from NAS Results NAS is very accurate and scalable: 4.2% error rate on average (8 8 mesh) FASTimproves system fairness by 9.5%, and performance by 5.2% (compared to a baseline without source throttling on a 8 8 mesh) 2

  3. Motivation: Interference in NoCs 3 2.7 Slowdown 2 1.6 1 0 lbm 16 copies of each application run concurrently on a 64-core processor leslie3d mcf GemsFDTD Root cause: ????????????? ???_???????????? =??????? ?????? ???????? = NoC bandwidth is shared Interference slows down applications and increases system unfairness 3

  4. NAS: NoC Application Slowdown Model talone: unknown at runtime ?? ???? ?? ???? ?????? tshared: measured directly ???????? =?? ???? = ?????? Online estimation of ??????: application stall time due to interference Challenges: Node D Node S request Flit-level delay slowdown response Each request involves multiple packets 4

  5. NAS: NoC Application Slowdown Model talone: unknown at runtime ?? ???? ?? ???? ?????? tshared: measured directly ???????? =?? ???? = ?????? Online estimation of ??????: application stall time due to interference Challenges: Node D Node S Flit-level delay slowdown Random and distributive Overlapped delay A packet is formed by multiple flits Basic idea: track delay and calculate tstall 4

  6. Flit-Level Interference 12 14 15 13 Threeinterference events 9 10 11 8 Injection 5 6 7 4 Virtual channel arbitration 1 0 2 3 Switch arbitration MSHRs Core Each flit carries an additional field tflit Shared LLC Slice L1 If arbitration loses, tflit = tflit + 1 Router Node Sum up arbitration delays due to interference 5

  7. Packet-Level Interference 1 2 3 4 5 treassembly = M cycles (M=5) Alone run: f1 f2 f3 f4 f5 Packet s flits arrive consecutively when there is no interference 1 2 3 4 5 6 7 8 9 10 11 Shared run: f3 f5 f1 f2 f4 ??????_???? =2 M-cycle reassembly ??????????? Tfirst_arrival =3 ???????= 2 + 11 3 5 = 5 ?????? Tlast_arrival=11 treassembly= ?????_??????? ??????_??????? ? Track increase in packet reassembly time ???????= ??????_????+ ??????????? 6

  8. Request-Level Interference Request packet delayed by 5 cycles due to inter-application interference Node D Node S 0 1 Leverage closed-loop packet behavior to accumulate tpacket Inheritance Table: lump sum of tpacket for associated packets 7

  9. Request-Level Interference Request packet delayed by 5 cycles due to inter-application interference Node D Node S 0 1 2Register request packet info in inheritance table ( tpacket = 5) 5 LLC Slice NI Inheritance Table reqID mshrID tpacket ... 3Cache access ... ... 4Generate response packet, inheriting tpacket from table 5 Leverage closed-loop packet behavior to accumulate tpacket Inheritance Table: lump sum of tpacket for associated packets 7

  10. Request-Level Interference Request packet delayed by 5 cycles due to inter-application interference Node D Node S 1 2Register request packet info in inheritance table ( tpacket = 5) 5 LLC Slice NI Inheritance Table reqID mshrID tpacket ... 3Cache access access 3Cache ... ... 5Response packet delayed by 3 cycles due to inter-application interference 4Generate response packet, inheriting tpacket from table 5 Leverage closed-loop packet behavior to accumulate tpacket Inheritance Table: lump sum of tpacket for associated packets 7

  11. Request-Level Interference Request packet delayed by 5 cycles due to inter-application interference Node D Node S 1 2Register request packet info in inheritance table ( tpacket = 5) 5 LLC Slice NI Inheritance Table reqID mshrID tpacket ... 3Cache 3Cache access access Final value of tpacket is 8 cycles ... ... 5Response packet delayed by 3 cycles due to inter-application interference 8 4Generate response packet, inheriting tpacket from table Leverage closed-loop packet behavior to accumulate tpacket Inheritance Table: lump sum of tpacket for associated packets Sum up delays of all associated packets ????????= ????????_??????+ ?????????_?????? 7

  12. Application Stall Time ILP, MLP App. stalls Latency is hidden ignored Latency of critical request ??????_???_??????? Tcritical Tservice A memory request becomes critical if 1) It is the oldest instruction at ROB and ROB is full, and/or 2) It is the oldest instruction at LSQ and LSQ is full when the next is a memory instruction For all critical requests ??????_???_???????= ???(???????? ?????????,????????) Count only request delays on critical path of execution time ??????= ??????_???_???????,? ? 8

  13. Using NAS to Improve Fairness NAS provides online estimation of slowdown Sum up flit-level arbitration delays due to interference Track increase in packet reassembly time Sum up delays of all associated packets Determine which request delays causes application stall Goal Use NAS to improve system fairness and performance FAST: Fairness-Aware Source Throttling 9

  14. A New Metric: NoC Stall-Time Criticality Slowdown Network Intensity Interference in NoCs 3.0 120 Network Intensity (MPKI) 2.5 100 has uneven impact Slowdown 2.0 80 1.5 60 1.0 1.0 40 NoC Stall-Time Criticality 0.5 20 ??????=???????? 0.0 0 lbm leslie3d mcf GemsFDTD ?? ???? Lower STCnoc <==> Less sensitive to NoC-level interference Good candidate to be throttled down FAST utilizes STCnoc to proactively estimate the expected impact of each L1 miss 10

  15. Key Knobs of FAST Rank based on slowdown Classification based on network intensity Latency-sensitive: spends more time in the core Throughput-sensitive: network intensive Throttle Up Latency-sensitive applications: improve system performance Slower applications: optimize system fairness Throttle Down Throughput sensitive application with lower STCnoc: reduce interference with lower negative impact on performance Avoid throttling down the slowest application 11

  16. Methodology Processor Out-of-order, ROB / instruction window = 128 Caches L1: 64KB, 16 MSHRs L2: perfect shared NoCs Topology: 4 4 and 8 8 mesh Router: conventional VC router with 8 VCs, 4 flits/VC Workloads: multiprogrammed SPEC CPU2006 90 randomly-chosen workloads Categorized by network intensity (i.e., MPKI) 12

  17. NAS is Accurate Network saturation 31.7% 4.2% 2.6% 15% 4x4 8x8 10% Estimation Slowdown Error 5% 0% Slowdown estimation error: 4.2% (2.6%) for 8 8 (4 4) Low estimated slowdown error consistently NAS is highly accurate and scalable Good scalability 13

  18. FAST Improves Performance 1.10 1.10 +5.0% +5.2% 1.05 1.05 Weighted Speedup Weighted Speedup Normalized Normalized 1.00 1.00 0.95 0.95 0.90 0.90 NoST HAT FAST NoST HAT FAST NoST HAT FAST NoST HAT FAST 4 4 8 8 4 4 8 8 (b) Heavy workloads (a) Mixed workloads FAST has better performance than both HAT and NoST Inter-application interference is reduced Only throttles applications with low negative impact (i.e., lower STCnoc) 14

  19. FAST Reduces Unfairness 1.10 1.10 1.05 1.05 1.00 1.00 Normalized Unfairness - 4.7% Normalized Unfairness 0.95 0.95 -9.5% 0.90 0.90 0.85 0.85 NoST HAT FAST NoST HAT FAST NoST HAT FAST NoST HAT FAST 4 4 8 8 4 4 8 8 (a) Mixed workloads (b) Heavy workloads FAST can improve fairness Source throttling allows slower applications to catch up Uses runtime slowdown to identify and avoid throttling the slowest application 15

  20. Conclusion Problem: inter-application interference in on-chip networks (NoCs) In a multicore processor, interference can occur due to NoC contention Interference causes applications to slow down unfairly Goal: estimate NoC-level slowdown at runtime, and use slowdown information to improve system fairness and performance Our Approach NoC Application Slowdown Model (NAS): first online model to quantify inter-application interference in NoCs Fairness-Aware Source Throttling (FAST): throttle network injection rate of processor cores based on slowdown estimate from NAS Results NAS is very accurate and scalable: 4.2% error rate on average (8 8 mesh) FASTimproves system fairness by 9.5%, and performance by 5.2% (compared to a baseline without source throttling on a 8 8 mesh) 16

  21. A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*, Saugata Ghose , Onur Mutlu , Nian-Feng Tzeng* *University of Louisiana at Lafayette Carnegie Mellon University ETH Z rich ul_logo

  22. Backup Slides Xiyue Xiang*, Saugata Ghose , Onur Mutlu , Nian-Feng Tzeng* *University of Louisiana at Lafayette Carnegie Mellon University ETH Z rich ul_logo

  23. Related Works Slowdown modeling Fine grained: [Mutlu+ MICRO 07], [Ebrahimi+ ASPLOS 10], [Bois+ TACO 13] Coarse grained: [Subramanian+ HPCA 13], [Subramanian MICRO 15] Source throttling [Chang+ SBAC-PAD 12], [Nychis+ SIGCOMM 12], [Nychis+ HotNet 10] Application mapping [Chou+ ICCD 08], [Das+ HPCA 13] Prioritization [Das+ MICRO 09], [Das ISCA 10] Scheduling [Kim+ MICRO 10] QoS [Grot+ MICRO 09], [Grot+ ISCA 11], [Lee+ ISCA 08] 19

  24. Hardware Cost of NAS Location Components Costs Router Interference delay of each flit 5.3% wider data path Timestamp of the first and last arrival flit of a packet Inheritance table (16+16) 16 bits NI (6+4+8) 20 bits Interference delay of the request 8 bits Core Timestamp when processor stalls 16 bits Estimated application stall time 16 bits Total cost of NAS per node 114 Bytes + 5.3% router area 20

  25. NAS Error Distribution Plot 7,200 application instances 50% 66.0% of application instances with < 10% error 40% 84.3% of application instances with < 20% error Application Instances 30% Fraction of 5.6% of application instances with 40% error 20% 10% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Slowdown Estimation Error (Binned) Plot 7,200 application instance NAS exhibits high accuracy most of the time 21

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#