Clock Clustering and IO Optimization for 3D Integration

undefined
C
C
l
l
o
o
c
c
k
k
 
 
C
C
l
l
u
u
s
s
t
t
e
e
r
r
i
i
n
n
g
g
 
 
a
a
n
n
d
d
 
 
I
I
O
O
O
O
p
p
t
t
i
i
m
m
i
i
z
z
a
a
t
t
i
i
o
o
n
n
 
 
f
f
o
o
r
r
 
 
3
3
D
D
 
 
I
I
n
n
t
t
e
e
g
g
r
r
a
a
t
t
i
i
o
o
n
n
S
a
m
y
o
u
n
g
 
B
a
n
g
*
,
 
K
w
a
n
g
s
o
o
 
H
a
n
,
A
n
d
r
e
w
 
B
.
 
K
a
h
n
g
 
a
n
d
 
V
a
i
s
h
n
a
v
 
S
r
i
n
i
v
a
s
ECE and 
CSE Departments, UC San Diego, La Jolla, CA 92093
*Samsung Electronics Co. Ltd, Hwaseong-si, South Korea
eva.bang@samsung.com, {kwhan, abk, vaishnav}@ucsd.edu
O
u
t
l
i
n
e
M
o
t
i
v
a
t
i
o
n
Power, Area and Timing Model
P&R and Timing Flow
Experimental Results
Conclusion
Motivation
 
F
o
r
 
3
D
 
i
n
t
e
g
r
a
t
i
o
n
 
w
i
t
h
 
l
a
r
g
e
 
b
a
n
d
w
i
d
t
h
 
n
e
e
d
s
b
e
t
w
e
e
n
 
d
i
e
s
,
 
c
h
o
i
c
e
 
o
f
 
c
l
o
c
k
i
n
g
 
o
p
t
i
o
n
s
 
n
e
e
d
 
t
o
b
e
 
m
a
d
e
 
u
p
f
r
o
n
t
T
r
a
d
e
o
f
f
 
b
e
t
w
e
e
n
 
a
r
e
a
 
a
n
d
 
p
o
w
e
r
 
n
e
e
d
e
d
 
u
p
f
r
o
n
t
A
f
f
e
c
t
s
 
f
l
o
o
r
p
l
a
n
n
i
n
g
 
c
h
o
i
c
e
s
 
S
e
r
i
a
l
i
z
e
r
 
3
D
I
O
 
P
L
L
 
D
e
s
e
r
i
a
l
i
z
e
r
 
3
D
I
O
 
P
L
L
 
P
L
L
K
e
y
 
C
h
o
i
c
e
s
 
f
o
r
 
C
l
o
c
k
i
n
g
 
O
p
t
i
o
n
s
 
L
o
c
a
l
 
c
l
u
s
t
e
r
i
n
g
P
a
r
t
i
t
i
o
n
 
a
 
g
i
v
e
n
 
r
e
g
i
o
n
 
i
n
t
o
 
s
u
b
-
r
e
g
i
o
n
s
C
l
o
c
k
 
s
y
n
c
h
r
o
n
i
z
a
t
i
o
n
 
s
c
h
e
m
e
S
y
n
c
h
r
o
n
o
u
s
S
o
u
r
c
e
-
s
y
n
c
h
r
o
n
o
u
s
A
s
y
n
c
h
r
o
n
o
u
s
3
D
I
O
 
f
r
e
q
u
e
n
c
y
 
 
#
 
o
f
 
3
D
I
O
T
o
 
e
n
a
b
l
e
 
d
e
s
i
g
n
 
s
p
a
c
e
 
p
a
t
h
f
i
n
d
i
n
g
/
e
x
p
l
o
r
a
t
i
o
n
:
P
o
w
e
r
/
A
r
e
a
/
T
i
m
i
n
g
 
m
o
d
e
l
 
b
a
s
e
d
 
o
n
 
t
o
t
a
l
 
b
a
n
d
w
i
d
t
h
,
c
l
u
s
t
e
r
i
n
g
,
 
s
y
n
c
h
r
o
n
i
z
a
t
i
o
n
 
s
c
h
e
m
e
,
 
3
D
I
O
 
f
r
e
q
u
e
n
c
y
C
o
m
b
i
n
e
 
c
l
o
c
k
i
n
g
 
a
n
d
 
3
D
I
O
 
p
o
w
e
r
/
a
r
e
a
/
t
i
m
i
n
g
3
D
I
O
 
C
l
u
s
t
e
r
i
n
g
 
L
o
c
a
l
i
z
e
 
t
h
e
 
c
l
o
c
k
 
t
r
e
e
 
o
f
 
t
h
e
 
3
D
 
i
n
t
e
r
c
o
n
n
e
c
t
A
d
v
a
n
t
a
g
e
s
 
w
h
e
n
 
n
u
m
b
e
r
 
o
f
 
c
l
u
s
t
e
r
 
i
n
c
r
e
a
s
e
:
S
i
z
e
 
o
f
 
c
l
u
s
t
e
r
 
c
l
o
c
k
 
t
r
e
e
 
(
s
m
a
l
l
e
r
 
s
k
e
w
,
 
j
i
t
t
e
r
)
S
h
o
r
t
e
r
 
d
a
t
a
 
p
a
t
h
s
 
t
o
 
3
D
I
O
 
a
r
r
a
y
 
a
t
 
t
h
e
 
c
e
n
t
e
r
 
o
f
 
e
a
c
h
 
c
l
u
s
t
e
r
E
n
a
b
l
e
s
 
e
f
f
i
c
i
e
n
t
 
c
l
o
c
k
i
n
g
 
s
c
h
e
m
e
s
 
(
f
o
r
w
a
r
d
e
d
 
c
l
o
c
k
,
a
s
y
n
c
h
r
o
n
o
u
s
)
D
i
s
a
d
v
a
n
t
a
g
e
s
 
w
h
e
n
 
n
u
m
b
e
r
 
o
f
 
c
l
u
s
t
e
r
 
i
n
c
r
e
a
s
e
:
O
v
e
r
h
e
a
d
 
t
o
 
s
y
n
c
h
r
o
n
i
z
e
 
b
e
t
w
e
e
n
 
c
l
u
s
t
e
r
s
 
o
n
 
t
o
p
 
d
i
e
O
v
e
r
h
e
a
d
 
i
n
 
c
l
u
s
t
e
r
 
c
l
o
c
k
 
3
D
I
O
 
p
e
r
 
c
l
u
s
t
e
r
Synchronization Schemes for 3DIO Clocking
 
S
y
n
c
h
r
o
n
o
u
s
Cluster clock tree is balanced to all F/Fs on
both the bottom and the top die
Simplest clocking scheme (similar to on-die)
Vulnerable to inter-die process/voltage
variation (large skews)
 
 
S
o
u
r
c
e
-
s
y
n
c
h
r
o
n
o
u
s
Forwarded clock from one die to another
No skew balancing needed across two dies
Require 
balance delays (T
b
)
 within each die
on the data path to match the clock
insertion delay
 
A
s
y
n
c
h
r
o
n
o
u
s
Separate clocks on each die
FIFO to help clock domain crossing
Obtain much smaller number of 3DIOs
due to higher speeds achievable with
embedded clock and CDR techniques
Our Work
 
G
i
v
e
n
 
t
h
e
 
c
h
o
i
c
e
s
 
o
f
 
c
l
o
c
k
 
s
y
n
c
h
r
o
n
i
z
a
t
i
o
n
 
s
c
h
e
m
e
s
,
n
u
m
b
e
r
 
o
f
 
c
l
u
s
t
e
r
s
 
a
n
d
 
3
D
I
O
 
f
r
e
q
u
e
n
c
y
,
 
f
i
n
d
 
m
a
x
i
m
u
m
b
a
n
d
w
i
d
t
h
 
f
o
r
 
t
h
e
 
3
D
 
i
n
t
e
r
c
o
n
n
e
c
t
 
g
i
v
e
n
 
a
 
m
a
x
 
p
o
w
e
r
 
a
n
d
a
r
e
a
 
c
o
n
s
t
r
a
i
n
t
s
.
O
u
t
l
i
n
e
Motivation
P
o
w
e
r
,
 
A
r
e
a
 
a
n
d
 
T
i
m
i
n
g
 
M
o
d
e
l
P&R and Timing Flow
Experimental Results
Conclusion
3
D
I
O
/
C
T
S
 
D
i
r
e
c
t
e
d
 
G
r
a
p
h
 
Primary inputs are indicated by circle
Rectangles are determined by the primary inputs
Solid and dotted arrow indicates positive and negative
correlation
Estimate the rounded rectangles as analytic expressions
 
#Clusters
Freq.
Clocking
scheme
Region
Area
BW
WNS
Skew outcome,
clock ins. delay
Area
Power
# FFs
IO Freq.
Per-IO
power/area
# 3DIO 
Max skew/transition
Jitter
Input
Deterministic
Est. outcome
Estimated
Increase
Decrease
Clock WL 
Clock buf. area
Data WL 
Data buf. area
C
l
o
c
k
 
W
i
r
e
l
e
n
g
t
h
 
Hierarchical approach to estimate clock wirelength
Assume clock tree is well balanced because FFs are
uniformly distributed over the region area
Length of Steiner minimal tree over 
N
 points uniformly
distributed within a given region 
A
reg
 
is proportional to
Total clock wirelength is
Clock Buffer Area
 
Tellez and Sarrafzadeh propose a method to insert the minimum
number of buffers under a given transition time (
T
max_tran
) constraint
Linearize the problem by using the concept of maxinum capacitance (
C
max
)
 Any buffer stage i with stage cap 
C
i
 < 
C
max
 will have 
T
i_tran
 < 
T
max_tran
Using 
C
max
, we estimate the number of clock buffers (
N
cbuf
),
 
Kashyap et al. discuss transition time degradation
and C
max
 can be expressed as follows,
Total clock buffer area is
W
i
r
e
 
(
m
a
x
 
l
e
n
g
t
h
 
=
 
W
m
a
x
)
 
T
0
 
T
m
a
x
_
t
r
a
n
Data Wirelength and Data Buffer Area
 
Data path wirelength is
proportional to the number of
data wires and the cluster
dimension
 
Distribution exists based on sink
placement wrt 3DIO cluster
 
For data buffer area, we use a similar concept to clock
buffer area estimation
Need to consider each data path separately
 Cannot use total wirelength
Need minimum number of data buffers to meet hold timing
 
3DIO/Overall Power and Area
 
3DIO power and area models are based on CACTI-IO
 
 
 
 
 
Overall (3DIO+clocking) power and area are
O
u
t
l
i
n
e
Motivation
Power, Area and Timing Model
P
&
R
 
a
n
d
 
T
i
m
i
n
g
 
F
l
o
w
Experimental Results
Conclusion
3D P&R Flow - Synchronous
 
S
y
n
c
h
r
o
n
o
u
s
Synthesize the cluster clock tree on the
top die first to balance the clock tree on
both dies
E
x
t
r
a
c
t
 
m
a
x
i
m
u
m
 
c
l
o
c
k
 
i
n
s
e
r
t
i
o
n
 
d
e
l
a
y
(
T
c
l
k
1
)
P
r
o
p
a
g
a
t
e
 
t
h
e
 
d
a
t
a
 
p
a
t
h
 
d
e
l
a
y
 
(
T
d
a
t
a
)
f
o
r
 
t
h
e
 
r
o
u
t
i
n
g
 
o
n
 
t
o
p
 
d
i
e
3D P&R Flow – Source-synchronous and Asynchronous
 
S
o
u
r
c
e
-
s
y
n
c
h
r
o
n
o
u
s
Synthesize the clock tree and route on
the bottom die, and separately synthesize
the clock tree only for the top die
E
x
t
r
a
c
t
 
b
a
l
a
n
c
e
 
d
e
l
a
y
 
T
b
 
(
i
.
e
.
,
 
T
c
l
k
1
)
 
f
o
r
e
a
c
h
 
c
a
p
t
u
r
e
 
F
F
 
a
n
d
 
a
n
n
o
t
a
t
e
 
t
h
e
 
d
e
l
a
y
s
t
o
 
t
h
e
 
c
o
r
r
e
s
p
o
n
d
i
n
g
 
d
a
t
a
 
3
D
I
O
s
A
s
y
n
c
h
r
o
n
o
u
s
Run traditional 2D flow on both dies
separately
 
Source-synchronous scheme
C
o
n
v
e
n
t
i
o
n
a
l
 
2
D
 
S
T
A
 
v
s
.
 
o
u
r
 
3
D
 
S
T
A
 
We focus on inter-die variation,
and do not consider intra-die
variation which can be
comprehended by timing derate
or OCV
Two process corners {BC, WC}
for inter-die variation
Assign the same corner on the
paths on the same die
Report worst setup WNS out of
four combinations (i.e., BC-BC,
BC-WC, WC-BC, WC-WC ) of
corners
 
C
o
n
v
e
n
t
i
o
n
a
l
 
2
D
 
S
T
A
(
w
i
t
h
o
u
t
 
i
n
t
e
r
-
d
i
e
 
v
a
r
i
a
t
i
o
n
)
 
O
u
r
 
3
D
 
S
T
A
 
Setup
s
l
a
c
k
 
=
 
T
p
e
r
 
 
 
T
s
u
 
 
T
{
c
2
q
,
 
W
C
}
 
 
 
T
{
d
a
t
a
1
,
 
W
C
}
 
T
{
d
a
t
a
2
,
 
W
C
}
 
+
 
(
T
{
c
a
p
t
u
r
e
,
 
B
C
}
 
 
T
{
l
a
u
n
c
h
,
 
W
C
}
)
 
Setup
s
l
a
c
k
1
 
=
 
T
p
e
r
 
 
T
s
u
 
 
T
{
c
2
q
,
 
B
C
}
 
 
 
T
{
d
a
t
a
1
,
 
B
C
}
 
 
T
{
d
a
t
a
2
,
 
B
C
}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
+
 
(
T
{
c
a
p
t
u
r
e
,
 
B
C
}
 
 
T
{
l
a
u
n
c
h
,
 
B
C
}
)
s
l
a
c
k
2
 
=
 
T
p
e
r
 
 
T
s
u
 
 
T
{
c
2
q
,
 
B
C
}
 
 
 
T
{
d
a
t
a
1
,
 
B
C
}
 
 
T
{
d
a
t
a
2
,
 
W
C
}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
+
 
(
T
{
c
a
p
t
u
r
e
,
 
W
C
}
 
 
T
{
l
a
u
n
c
h
,
 
B
C
}
)
s
l
a
c
k
3
 
=
 
T
p
e
r
 
 
T
s
u
 
 
T
{
c
2
q
,
 
W
C
}
 
 
 
T
{
d
a
t
a
1
,
 
W
C
}
 
 
T
{
d
a
t
a
2
,
 
B
C
}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
+
 
(
T
{
c
a
p
t
u
r
e
,
 
B
C
}
 
 
T
{
l
a
u
n
c
h
,
 
W
C
}
)
s
l
a
c
k
4
 
=
 
T
p
e
r
 
 
T
s
u
 
 
T
{
c
2
q
,
 
W
C
}
 
 
 
T
{
d
a
t
a
1
,
 
W
C
}
 
 
T
{
d
a
t
a
2
,
 
W
C
}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
+
 
(
T
{
c
a
p
t
u
r
e
,
 
W
C
}
 
 
T
{
l
a
u
n
c
h
,
 
W
C
}
)
 
slack = min (slack1, slack2, slack3, slack4)
O
u
t
l
i
n
e
Motivation
Power, Area and Timing Model
P&R and Timing Flow
E
x
p
e
r
i
m
e
n
t
a
l
 
R
e
s
u
l
t
s
Conclusion
E
x
p
e
r
i
m
e
n
t
a
l
 
S
e
t
u
p
 
P&R tool is Synopsys IC Compiler I-2013.12-SP1
Timing analysis tool is Synopsys PrimeTime H-
2013.06-SP2
We use a 65nm TSMC library
Design of experiments
Bandwidth (10 – 200 GB/s)
Region area (25 – 100 mm
2
)
3DIO clock frequency
Synchronous (100 – 2000 MHz)
Source-synchronous (1500 – 4000 MHz)
Asynchronous (3500 – 8000 MHz)
Number of clusters (1 – 25)
We select four data points for each parameter
 
256 design implementations for each clocking scheme
 
 
 
Model Fitting Approach
 
We use Artificial Neural Network (ANN) model for our fit,
guided by the directed graph
Iteratively progress through the directed graph to fit each
node
Clock wirelength
Data wirelength
Clock buffer area
Data buffer area
3DIO power/area
Total Area/Power/WNS
We use the 
F
max
 for the timing model (instead of WNS)
Multiple runs with different training, validation and test
data sets 
 Improved generality and robustness of the
resulting models
A
r
e
a
,
 
P
o
w
e
r
 
a
n
d
 
T
i
m
i
n
g
 
M
o
d
e
l
 
R
e
s
u
l
t
s
Min-Max error within
+/-20%
For synchronous
scheme, tool inserts
large number of
hold buffers due to
inter-die variation
 Larger error
Mean error within
+/-0.5%
A
r
e
a
P
o
w
e
r
F
m
a
x
D
e
s
i
g
n
 
S
p
a
c
e
 
R
e
s
u
l
t
s
 
Max BW:
Figure shows the iso-bandwidth curves
Vertical and horizontal walls show min power/area required to hit a bandwidth
requirement
Clocking scheme:
The asynchronous scheme is area-efficient
The synchronous scheme is power-efficient
The source-synchronous scheme provides a valuable tradeoff between power
and area along the knee of the iso-bandwidth curve.
The interesting tradeoffs between the schemes occurs along these knee points
as we change the power/area constraint tradeoffs.
 
M
a
x
 
B
W
 
O
p
t
i
m
a
l
 
c
l
o
c
k
i
n
g
 
s
c
h
e
m
e
 
f
o
r
 
M
a
x
 
B
W
D
e
s
i
g
n
 
S
p
a
c
e
 
R
e
s
u
l
t
s
 
Cluster clock frequency:
As power constraint gets tighter, frequency goes down
As area constraint gets tighter, frequency goes up
Source-synchronous schemes provide benefits at higher cluster
frequencies
The asynchronous scheme provides a way to keep the cluster frequency
down but still have high 3DIO frequency, through serialization
Number of clusters:
Not monotonic along edges of hypercube and clocking scheme
boundaries
Also sensitive to the total region area
 
O
p
t
i
m
a
l
 
#
 
o
f
 
C
l
u
s
t
e
r
s
 
f
o
r
 
M
a
x
 
B
W
 
O
p
t
i
m
a
l
 
C
l
u
s
t
e
r
 
F
r
e
q
u
e
n
c
y
 
f
o
r
 
M
a
x
 
B
W
O
u
t
l
i
n
e
Motivation
Power, Area and Timing Model
P&R and Timing Flow
Experimental Results
C
o
n
c
l
u
s
i
o
n
C
o
n
c
l
u
s
i
o
n
 
We have developed a power, area and timing model for 3DIO
and CTS that includes clustering and three different clock
synchronization schemes (synchronous, source-synchronous,
asynchronous)
Our model estimates power, area and timing within 20% error
across a large range of bandwidths, region areas, numbers of
clusters and 3DIO frequencies
Our modeling methodology will enable architects to study and
optimize the design space upfront
Key takeaways:
Iso-bandwidth lines identify min area/power required to hit a particular
BW
Clocking scheme tradeoffs are interesting along the knee of iso-
bandwidth lines
Cluster frequency for asynchronous schemes can be kept low while still
reducing the number of 3DIO due to serialization
F
u
t
u
r
e
 
W
o
r
k
 
Extend our model to be aware of
Placement uniformity
Technology dependence
Datapath logics
More comprehensive STA including intra-die variation
Blockages
Asymmetric clustering
Different 3DIO placement
Serial 3DIO circuit options for asynchronous scheme
 
2.5D (interposer-based) design
 
 
 
undefined
T
T
h
h
a
a
n
n
k
k
 
 
y
y
o
o
u
u
undefined
B
B
A
A
C
C
K
K
U
U
P
P
 
S
y
n
c
h
r
o
n
o
u
s
All end points on both dies are synchronized
Colored FFs are uniformly distributed over the region
Non-colored FFs are placed right next to the 3DIO array
Clock tree is vulnerable to the inter-die variation
Use DDR to
 minimize number of 3DIOs
Two factors affect to determine max 3DIO clock frequency (F
IO
)
Clock skew due to the inter-die variation
Jitter
Increase #clusters 
 increase max F
IO
 because clock tree becomes more
robust to the inter-die variation
S
o
u
r
c
e
-
S
y
n
c
h
r
o
n
o
u
s
Forward clock one die to the another die
For any paths across dies, the launch and capture path delays from
source to 3DIO at the bottom die are balanced 
 no inter-die variation
R
e
q
u
i
r
e
 
b
a
l
a
n
c
e
 
d
e
l
a
y
 
T
b
 
t
o
 
c
o
m
p
e
n
s
a
t
e
 
c
l
o
c
k
 
i
n
s
e
r
t
i
o
n
 
d
e
l
a
y
T
c
l
k
1
Two factors to determine max 3DIO clock frequency (F
IO
)
S
k
e
w
 
b
e
t
w
e
e
n
 
T
b
 
a
n
d
 
T
c
l
k
1
 
d
u
e
 
t
o
 
t
h
e
 
i
n
t
r
a
-
d
i
e
 
v
a
r
i
a
t
i
o
n
Jitter
A
s
y
n
c
h
r
o
n
o
u
s
Use FIFO (1:8 serializer, 8:1 deserializer) to separate clock
domain
No inter-die variation
Minimize the number of 3DIOs
Require PLL for cluster clock for the top die and IO clock for
both dies
Large power overhead
One factor to determine max 3DIO clock frequency (F
IO
)
Jitter
Flow of Synch. Clocking Schemes
D
e
l
a
y
 
t
o
 
b
a
l
a
n
c
e
 
t
h
e
 
c
l
o
c
k
 
i
n
s
e
r
t
i
o
n
 
d
e
l
a
y
s
 
a
c
r
o
s
s
 
d
i
e
s
0
.
3
0
7
 
-
 
0
.
1
2
5
 
=
 
0
.
1
8
2
n
s
 
(
b
c
)
0
.
6
1
8
 
 
0
.
2
4
7
 
=
 
0
.
3
7
1
n
s
 
(
w
c
)
I
n
p
u
t
 
d
e
l
a
y
 
t
o
 
p
r
e
v
e
n
t
 
u
n
n
e
c
e
s
s
a
r
y
 
h
o
l
d
 
b
u
f
f
e
r
 
i
n
s
e
r
t
i
o
n
s
 
0
.
3
0
7
 
+
 
0
.
0
8
9
 
 
0
.
1
8
2
 
=
 
0
.
2
1
4
n
s
1
2
B
W
:
 
1
2
G
B
/
s
A
r
e
g
:
 
8
1
m
m
2
n
c
:
 
4
f
c
l
u
s
:
 
1
0
0
0
M
H
z
C
l
u
s
t
e
r
 
b
u
f
f
e
r
C
l
u
s
t
e
r
 
b
u
f
f
e
r
Flow of Synch. Clocking Schemes
C
T
S
 
o
n
 
t
o
p
 
d
i
e
E
x
t
r
a
c
t
 
b
a
l
a
n
c
e
 
d
e
l
a
y
 
C
T
O
 
a
n
d
 
R
o
u
t
e
 
o
n
 
t
o
p
 
d
i
e
1
2
3
3
S
T
A
4
R
u
n
 
C
T
O
 
a
n
d
 
r
o
u
t
e
 
a
t
 
w
o
r
s
t
 
c
o
r
n
e
r
c
o
n
s
i
d
e
r
i
n
g
 
h
o
l
d
 
t
i
m
e
 
a
n
d
 
c
l
o
c
k
 
u
n
c
e
r
t
a
i
n
t
y
4
S
e
t
u
p
:
 
0
.
5
 
(
h
a
l
f
 
c
y
c
l
e
)
 
+
 
0
.
8
0
2
(
t
c
l
k
)
 
 
0
.
0
7
5
 
(
t
u
n
c
)
 
 
 
 
 
 
 
 
 
 
 
 
-
 
0
.
0
0
8
 
(
t
s
)
 
 
1
.
1
9
5
 
(
t
d
a
t
a
)
 
=
 
0
.
0
2
4
n
s
 
H
o
l
d
:
 
0
.
6
8
3
(
t
d
a
t
a
)
 
 
-
 
0
.
5
7
6
 
(
t
c
l
k
)
 
-
 
0
.
0
6
0
 
(
t
u
n
c
)
 
-
 
0
.
0
3
0
 
(
t
h
)
 
 
 
 
 
 
 
 
 
 
=
 
0
.
0
1
7
n
s
0
.
3
7
1
n
s
 
(
w
c
)
0
.
0
7
1
n
s
 
(
b
c
)
0
.
1
4
0
n
s
 
(
w
c
)
0
.
1
8
2
n
s
 
(
b
c
)
C
l
u
s
t
e
r
 
b
u
f
f
e
r
C
l
u
s
t
e
r
 
b
u
f
f
e
r
Flow of Synch. Clocking Schemes
1
2
B
W
:
 
1
2
G
B
/
s
A
r
e
g
:
 
8
1
m
m
2
n
c
:
 
4
f
c
l
u
s
:
 
1
0
0
0
M
H
z
B
a
l
a
n
c
e
 
t
h
e
 
d
e
l
a
y
 
f
r
o
m
 
c
l
o
c
k
 
s
o
u
r
c
e
 
t
o
 
d
a
t
a
 
3
D
I
O
 
a
n
d
 
t
h
e
 
d
e
l
a
y
 
f
r
o
m
 
c
l
o
c
k
 
s
o
u
r
c
e
 
t
o
 
c
l
o
c
k
 
3
D
I
O
 
0
.
6
1
8
 
+
 
0
.
2
0
0
 
=
 
0
.
8
1
8
n
s
A
n
n
o
t
a
t
e
 
b
a
l
a
n
c
i
n
g
 
d
e
l
a
y
 
0
.
2
4
7
n
s
C
l
u
s
t
e
r
 
b
u
f
f
e
r
C
l
u
s
t
e
r
 
b
u
f
f
e
r
Flow of Source synch. Clocking Schemes
C
T
S
 
o
n
 
t
o
p
 
d
i
e
E
x
t
r
a
c
t
 
b
a
l
a
n
c
e
 
d
e
l
a
y
 
C
T
O
 
a
n
d
 
R
o
u
t
e
 
o
n
 
t
o
p
 
d
i
e
1
2
3
3
S
T
A
4
R
u
n
 
C
T
O
 
a
n
d
 
r
o
u
t
e
 
a
t
 
w
o
r
s
t
 
c
o
r
n
e
r
c
o
n
s
i
d
e
r
i
n
g
 
h
o
l
d
 
t
i
m
e
 
a
n
d
 
c
l
o
c
k
 
u
n
c
e
r
t
a
i
n
t
y
4
S
e
t
u
p
:
 
0
.
5
 
(
h
a
l
f
 
c
y
c
l
e
)
 
+
 
1
.
3
7
1
 
(
t
c
l
k
)
 
 
0
.
0
7
5
 
(
t
u
n
c
)
 
 
 
 
 
 
 
 
 
 
 
 
-
 
0
.
0
0
8
 
(
t
s
)
 
 
1
.
4
7
1
 
(
t
d
a
t
a
)
 
=
 
0
.
3
1
7
n
s
 
H
o
l
d
:
1
.
4
7
1
 
(
t
d
a
t
a
)
 
 
-
 
1
.
3
7
1
 
(
t
c
l
k
)
 
-
 
0
.
0
6
0
 
(
t
u
n
c
)
 
-
 
0
.
0
3
0
 
(
t
h
)
 
 
 
 
 
 
 
 
 
 
=
 
0
.
0
1
0
n
s
0
.
8
1
8
n
s
0
.
1
0
0
n
s
0
.
1
0
0
n
s
0
.
8
1
8
n
s
0
.
2
4
7
n
s
0
.
2
4
7
n
s
C
l
u
s
t
e
r
 
b
u
f
f
e
r
C
l
u
s
t
e
r
 
b
u
f
f
e
r
B
a
l
a
n
c
i
n
g
 
d
e
l
a
y
B
a
l
a
n
c
i
n
g
 
d
e
l
a
y
Slide Note
Embed
Share

This research focuses on clocking options, clustering, and synchronization schemes for 3D integration, addressing challenges related to area, power, and timing. Key choices for optimizing clocking and 3D interconnects are discussed, along with the advantages and disadvantages of different clocking schemes. Various synchronization schemes for 3DIO clocking, including synchronous, source-synchronous, and asynchronous approaches, are explored.

  • Clock Clustering
  • IO Optimization
  • 3D Integration
  • Synchronization Schemes
  • Technology

Uploaded on Mar 01, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Clock Clustering and IO Optimization for 3D Integration Samyoung Bang*, Kwangsoo Han , Andrew B. Kahng and Vaishnav Srinivas ECE and CSE Departments, UC San Diego, La Jolla, CA 92093 *Samsung Electronics Co. Ltd, Hwaseong-si, South Korea eva.bang@samsung.com, {kwhan, abk, vaishnav}@ucsd.edu

  2. Outline Motivation Power, Area and Timing Model P&R and Timing Flow Experimental Results Conclusion 2

  3. Motivation For 3D integration with large bandwidth needs between dies, choice of clocking options need to be made upfront Tradeoff between area and power needed upfront Affects floorplanning choices 3DIO PLL PLL Deserializer PLL 3DIO PLL Serializer Deserializer Serializer 3DIO PLL 3DIO PLL 3

  4. Key Choices for Clocking Options Local clustering Partition a given region into sub-regions Clock synchronization scheme Synchronous Source-synchronous Asynchronous 3DIO frequency # of 3DIO To enable design space pathfinding/exploration: Power/Area/Timing model based on total bandwidth, clustering, synchronization scheme, 3DIO frequency Combine clocking and 3DIO power/area/timing 4

  5. 3DIO Clustering Localize the clock tree of the 3D interconnect Advantages when number of cluster increase: Size of cluster clock tree (smaller skew, jitter) Shorter data paths to 3DIO array at the center of each cluster 3DIO array Clock entry point Enables efficient clocking schemes (forwarded clock, asynchronous) Disadvantages when number of cluster increase: Overhead to synchronize between clusters on top die Overhead in cluster clock 3DIO per cluster Cluster Data path The layout of the bottom die 5

  6. Synchronization Schemes for 3DIO Clocking Synchronous Cluster clock tree is balanced to all F/Fs on both the bottom and the top die Simplest clocking scheme (similar to on-die) Vulnerable to inter-die process/voltage variation (large skews) Source-synchronous Forwarded clock from one die to another No skew balancing needed across two dies Require balance delays (Tb) within each die on the data path to match the clock insertion delay Asynchronous Separate clocks on each die FIFO to help clock domain crossing Obtain much smaller number of 3DIOs due to higher speeds achievable with embedded clock and CDR techniques Launch FFs Capture FFs Thold_fix Tdata Data path 0 DDR 1 Thold_fix Tdata 0 DDR 1 Top Bottom Thold_fix Tdata 0 DDR 1 Tclk0 Tclk1 (a) Synchronous clocking Synchronous clocking Launch FFs Tclk0 Tclk1 Tdata 0 Tdata 0 Thold_fix Capture FFs (a) Synchronous clocking Data path Tclk0 DDR Tclk1 Thold_fix 1 Balance delay (Tb) Balance Tb DDR (a) Synchronous clocking 1 Tdata 0 Thold_fix delay (Tb) Top Forwarded clock Forwarded clock Bottom DDR 1 Balance delay (Tb) 0 1 0 Tclk1 1 Forwarded clock Tclk0 Tclk0 Tclk1 0 1 Tclk1 Tclk0 (b) Source-synchronous clocking Source-synchronous clocking (b) Source-synchronous clocking (b) Source-synchronous clocking Serializer Serializer Deserializer Deserializer Deserializer Deserializer Serializer Serializer Cluster clock Cluster clock IO clock IO clock IO clock - recovered IO clock - recovered Cluster clock Cluster clock Cluster clock IO-clock IO clock - recovered IO-clock Cluster clock (c) Asynchronous clocking (c) Asynchronous clocking Cluster clock Cluster clock IO clock (c) Asynchronous clocking Asynchronous clocking 6

  7. Our Work Given the choices of clock synchronization schemes, number of clusters and 3DIO frequency, find maximum bandwidth for the 3D interconnect given a max power and area constraints. Max area constraints Max Achievable BW Optimal Clocking scheme for Max BW Max area constraints Synch. Source-synch. Asynch. Max power constraint Max power constraint Optimal number of clusters for Max BW Optimal Clocking frequency for Max BW Max area constraints Max area constraints 7 Max power constraint Max power constraint

  8. Outline Motivation Power, Area and Timing Model P&R and Timing Flow Experimental Results Conclusion 8

  9. 3DIO/CTS Directed Graph Primary inputs are indicated by circle Rectangles are determined by the primary inputs Solid and dotted arrow indicates positive and negative correlation Estimate the rounded rectangles as analytic expressions IO Freq. Clocking scheme WNS Per-IO power/area Jitter #Clusters Freq. Max skew/transition Skew outcome, clock ins. delay Area # 3DIO # FFs Region Area Clock WL Increase Decrease BW Clock buf. area Input Data WL Deterministic Estimated Power Data buf. area Est. outcome 9

  10. Clock Wirelength Hierarchical approach to estimate clock wirelength Assume clock tree is well balanced because FFs are uniformly distributed over the region area Length of Steiner minimal tree over N points uniformly distributed within a given region Areg Total clock wirelength is depth i clk k N A k w W + = = 0 is proportional to Areg/ N max_ = A N C reg ff g reg C i i = 0 w0 i = 1 w1 Global clock tree Notation Cluster clock tree Depth of clock tree (i == 0 for clock source) i i = 2 w2 N Number of cluster C N Total number of flip-flops ff FFs FFs FFs Ck k , Fitted coefficients g 10

  11. Clock Buffer Area Tellez and Sarrafzadeh propose a method to insert the minimum number of buffers under a given transition time (Tmax_tran) constraint Linearize the problem by using the concept of maxinum capacitance (Cmax) Any buffer stage i with stage cap Ci < Cmax will have Ti_tran < Tmax_tran Using Cmax, we estimate the number of clock buffers (Ncbuf), + + = 0 max ) ( ff buf g C N W C N C N C cbuf W clk ff g cbuf + ff C N C 0 clk ff g = N cbuf buf g C C max Kashyap et al. discuss transition time degradation and Cmax can be expressed as follows, Total clock buffer area is g ff clk clk C T T 0 max_ ) ( ) ( Tmax_tran = + 2 2 ( ) ( ) max_ T T k C R T0 0 1 max max tran Wire (max length = Wmax) = 2 2 ( ) ( ) C k max_ T T max 2 0 tran + + ff ff W ( ) W ( ) C N C C N C 0 0 clk ff g = + A k k 1 2 2 2 2 2 buf g buf g ( ) ( ) T T C 11 max_ 0 tran tran

  12. Data Wirelength and Data Buffer Area Data path wirelength is proportional to the number of data wires and the cluster dimension N k W 0 = Cluster Sink p 3D IO / A N data ff reg C Distribution exists based on sink placement wrt 3DIO cluster d For data buffer area, we use a similar concept to clock buffer area estimation Need to consider each data path separately Cannot use total wirelength Need minimum number of data buffers to meet hold timing 12

  13. 3DIO/Overall Power and Area 3DIO power and area models are based on CACTI-IO ( static term dyn IO IO P P P N P + + = ( ) 1 k 2 3 = + + + + 0 A N A N k k k f f f IO 0 1 2 3 IO IO min 2 (R , R ) R ON TTI ON ) Overall (3DIO+clocking) power and area are + + = ( 5 4 3 + + total A k W k W A + A A 1 2 clk data clk k data + IO A = + + + + ) ( ) total P k W k W A k F k A A P 6 7 clk data clk data clk clk data IO Internal and leakage power IO power Switching power 13

  14. Outline Motivation Power, Area and Timing Model P&R and Timing Flow Experimental Results Conclusion 14

  15. 3D P&R Flow - Synchronous Synchronous Synthesize the cluster clock tree on the top die first to balance the clock tree on both dies Extract maximum clock insertion delay (Tclk1) Propagate the data path delay (Tdata) for the routing on top die Gate-level netlist and SDC file generation CTS and Route (bottom die) Extract Tdata Custom placement Route (top die) CTS (top die) Report Extract max Tclk1 Synchronous scheme power/area/timing Propagated Clock 3DIO Tdata Tclk1 15 The layout of the bottom die The layout of top die

  16. 3D P&R Flow Source-synchronous and Asynchronous Source-synchronous Synthesize the clock tree and route on the bottom die, and separately synthesize the clock tree only for the top die Extract balance delay Tb (i.e., Tclk1) for each capture FF and annotate the delays to the corresponding data 3DIOs Asynchronous Run traditional 2D flow on both dies separately Gate-level netlist and SDC file generation Extract Tb Custom placement Route (top die) CTS and Route (bottom die) Report power/area/timing CTS (top die) Source-synchronous scheme Propagated Clock 3DIO Tclk1 Annotate Tb (i.e., Tclk1) 16 The layout of the bottom die The layout of top die

  17. Conventional 2D STA vs. our 3D STA We focus on inter-die variation, and do not consider intra-die variation which can be comprehended by timing derate or OCV Two process corners {BC, WC} for inter-die variation Assign the same corner on the paths on the same die Report worst setup WNS out of four combinations (i.e., BC-BC, BC-WC, WC-BC, WC-WC ) of corners FF on bottom die FF on top die Buffer on top die Buffer on bottom die Tdata1 Tdata2 Tc2q Tlaunch Tcapture Our 3D STA Setup slack1 = Tper Tsu T{c2q, BC} T{data1, BC} T{data2, BC} + (T{capture, BC} T{launch, BC}) slack2 = Tper Tsu T{c2q, BC} T{data1, BC} T{data2, WC} + (T{capture, WC} T{launch, BC}) slack3 = Tper Tsu T{c2q, WC} T{data1, WC} T{data2, BC} + (T{capture, BC} T{launch, WC}) slack4 = Tper Tsu T{c2q, WC} T{data1, WC} T{data2, WC} + (T{capture, WC} T{launch, WC}) slack = min (slack1, slack2, slack3, slack4) Conventional 2D STA (without inter-die variation) Setup slack = Tper Tsu T{c2q, WC} T{data1, WC} T{data2, WC} + (T{capture, BC} T{launch, WC}) 17

  18. Outline Motivation Power, Area and Timing Model P&R and Timing Flow Experimental Results Conclusion 18

  19. Experimental Setup P&R tool is Synopsys IC Compiler I-2013.12-SP1 Timing analysis tool is Synopsys PrimeTime H- 2013.06-SP2 We use a 65nm TSMC library Design of experiments Bandwidth (10 200 GB/s) Region area (25 100 mm2) 3DIO clock frequency Synchronous (100 2000 MHz) Source-synchronous (1500 4000 MHz) Asynchronous (3500 8000 MHz) Number of clusters (1 25) We select four data points for each parameter 256 design implementations for each clocking scheme 19

  20. Model Fitting Approach We use Artificial Neural Network (ANN) model for our fit, guided by the directed graph Iteratively progress through the directed graph to fit each node Clock wirelength Data wirelength Clock buffer area Data buffer area 3DIO power/area Total Area/Power/WNS We use the Fmax for the timing model (instead of WNS) Multiple runs with different training, validation and test data sets Improved generality and robustness of the resulting models IO Freq. Clocking scheme WNS Per-IO power/area Jitter #Clusters Freq. Max skew/transition Skew outcome, clock ins. delay Area # 3DIO # FFs Clock WL Region Area Increase Decrease BW Clock buf. area Input Data WL Deterministic Estimated Power Data buf. area Est. outcome 20

  21. Area, Power and Timing Model Results Area Min-Max error within +/-20% For synchronous scheme, tool inserts large number of hold buffers due to inter-die variation Larger error Mean error within +/-0.5% Power Fmax 21

  22. Design Space Results Max BW: Figure shows the iso-bandwidth curves Vertical and horizontal walls show min power/area required to hit a bandwidth requirement Clocking scheme: The asynchronous scheme is area-efficient The synchronous scheme is power-efficient The source-synchronous scheme provides a valuable tradeoff between power and area along the knee of the iso-bandwidth curve. The interesting tradeoffs between the schemes occurs along these knee points as we change the power/area constraint tradeoffs. Max BW Optimal clocking scheme for Max BW 22

  23. Design Space Results Cluster clock frequency: As power constraint gets tighter, frequency goes down As area constraint gets tighter, frequency goes up Source-synchronous schemes provide benefits at higher cluster frequencies The asynchronous scheme provides a way to keep the cluster frequency down but still have high 3DIO frequency, through serialization Number of clusters: Not monotonic along edges of hypercube and clocking scheme boundaries Also sensitive to the total region area Optimal Cluster Frequency for Max BW Optimal # of Clusters for Max BW 23

  24. Outline Motivation Power, Area and Timing Model P&R and Timing Flow Experimental Results Conclusion 24

  25. Conclusion We have developed a power, area and timing model for 3DIO and CTS that includes clustering and three different clock synchronization schemes (synchronous, source-synchronous, asynchronous) Our model estimates power, area and timing within 20% error across a large range of bandwidths, region areas, numbers of clusters and 3DIO frequencies Our modeling methodology will enable architects to study and optimize the design space upfront Key takeaways: Iso-bandwidth lines identify min area/power required to hit a particular BW Clocking scheme tradeoffs are interesting along the knee of iso- bandwidth lines Cluster frequency for asynchronous schemes can be kept low while still reducing the number of 3DIO due to serialization 25

  26. Future Work Extend our model to be aware of Placement uniformity Technology dependence Datapath logics More comprehensive STA including intra-die variation Blockages Asymmetric clustering Different 3DIO placement Serial 3DIO circuit options for asynchronous scheme 2.5D (interposer-based) design 26

  27. Thank you

  28. BACKUP

  29. Synchronous All end points on both dies are synchronized Colored FFs are uniformly distributed over the region Non-colored FFs are placed right next to the 3DIO array Clock tree is vulnerable to the inter-die variation Use DDR to minimize number of 3DIOs Two factors affect to determine max 3DIO clock frequency (FIO) Clock skew due to the inter-die variation Jitter Increase #clusters increase max FIO because clock tree becomes more robust to the inter-die variation Region Area (mm2) 25 25 100 100 25 25 100 100 BW (GB/s) Fmax (MHz) FIO Ncluster (MHz) 12 1 640 900 300 600 460 900 300 600 1280 1800 600 1200 920 1800 600 1200 11.25 12 11.25 50.025 50.625 200.25 202.5 25 1 25 1 25 1 25 29

  30. Source-Synchronous Forward clock one die to the another die For any paths across dies, the launch and capture path delays from source to 3DIO at the bottom die are balanced no inter-die variation Require balance delay Tb to compensate clock insertion delay Tclk1 Two factors to determine max 3DIO clock frequency (FIO) Skew between Tb and Tclk1 due to the intra-die variation Jitter Region Area (mm2) 25 25 25 25 100 100 100 100 BW (GB/s) Fmax (MHz) FIO Ncluster (MHz) 12.095 10.625 50.02 46.875 12 15 200 195 1 25 1 25 1 25 1 25 820 1700 820 1500 500 1200 350 1200 1640 3400 1640 3000 1000 2400 700 2400 30

  31. Thold_fix Tdata 0 DDR 1 Asynchronous Use FIFO (1:8 serializer, 8:1 deserializer) to separate clock domain No inter-die variation Minimize the number of 3DIOs Require PLL for cluster clock for the top die and IO clock for both dies Large power overhead One factor to determine max 3DIO clock frequency (FIO) Jitter Tclk0 Tclk1 (a) Synchronous clocking Tdata 0 Thold_fix DDR 1 Balance delay (Tb) Forwarded clock 0 1 Tclk1 Tclk0 Region Area (mm2) 25 25 25 25 100 100 100 100 (b) Source-synchronous clocking BW (GB/s) Fmax (MHz) FIO Ncluster (MHz) 11.9 25 49.7 40 12 20 200 200 1 25 1 25 1 25 1 25 700 1000 700 800 400 800 125 500 5600 8000 5600 6400 3200 6400 1000 4000 Serializer Deserializer Cluster clock IO clock IO clock - recovered Cluster clock (c) Asynchronous clocking 31

  32. Flow of Synch. Clocking Schemes Bottom Top Custom placement on bottom/top dies 1 BW: 12GB/s Areg: 81mm2 nc: 4 fclus: 1000MHz CTS, CTO and Route on bottom die 0.307ns (bc) 0.618ns (wc) 0.089ns (bc) 0.200ns (wc) CTS on top die 1 Cluster buffer 2 0.147ns (bc) 0.306ns (wc) Extract delay 0.125ns (bc) 0.247ns (wc) Bottom Top 2 Input delay to prevent unnecessary hold buffer insertions 0.307 + 0.089 0.182 = 0.214ns 0.307ns (bc) 0.618ns (wc) 0.089ns (bc) 0.200ns (wc) Delay to balance the clock insertion delays across dies 0.307 - 0.125 = 0.182ns (bc) 0.618 0.247 = 0.371ns (wc) Cluster buffer 0.125ns (bc) 0.247ns (wc) 0.147ns (bc) 0.306ns (wc) 32

  33. Flow of Synch. Clocking Schemes Run CTO and route at worst corner considering hold time and clock uncertainty Top 0.214ns 3 1 CTS on top die Top 0.214ns 2 Extract balance delay 3 CTO and Route on top die 4 STA 0.247ns (wc) 0.247ns (wc) Setup: 0.5 (half cycle) + 0.802(tclk) 0.075 (tunc) - 0.008 (ts) 1.195 (tdata) = 0.024ns Hold: 0.683(tdata) - 0.576 (tclk) - 0.060 (tunc) - 0.030 (th) = 0.017ns 4 Bottom Top Bottom Top 0.089ns (bc) 0.307ns (bc) 0.140ns (wc) 0.200ns (wc) 0.618ns (wc) 0.071ns (bc) Cluster buffer Cluster buffer 0.182ns (bc) 0.247ns (wc) 0.147ns (bc) 0.371ns (wc) 0.125ns (bc) 0.306ns (wc) 33

  34. Flow of Synch. Clocking Schemes Bottom Top Custom placement on bottom/top dies 1 BW: 12GB/s Areg: 81mm2 nc: 4 fclus: 1000MHz CTS, CTO and Route on bottom die 0.200ns 0.618ns CTS on top die 1 Cluster buffer 2 Extract delay 0.247ns 0.306ns Bottom Top 2 Annotate balancing delay 0.247ns 0.200ns 0.618ns Balance the delay from clock source to data 3DIO and the delay from clock source to clock 3DIO 0.618 + 0.200 = 0.818ns Cluster buffer 0.247ns 0.306ns 34

  35. Flow of Source synch. Clocking Schemes Run CTO and route at worst corner considering hold time and clock uncertainty Top 0.247ns 0.247ns 3 1 CTS on top die Top 2 Extract balance delay 3 CTO and Route on top die 4 STA 0.247ns (wc) 0.247ns (wc) Setup: 0.5 (half cycle) + 1.371 (tclk) 0.075 (tunc) - 0.008 (ts) 1.471 (tdata) = 0.317ns Hold:1.471 (tdata) - 1.371 (tclk) - 0.060 (tunc) - 0.030 (th) = 0.010ns 4 Balancing delay Balancing delay Bottom Top Bottom Top 0.247ns 0.247ns 0.200ns 0.618ns 0.100ns 0.200ns 0.618ns 0.100ns Cluster buffer Cluster buffer 0.818ns 0.247ns 0.306ns 0.818ns 0.247ns 0.306ns 35

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#