Clock Clustering and IO Optimization for 3D Integration
This research focuses on clocking options, clustering, and synchronization schemes for 3D integration, addressing challenges related to area, power, and timing. Key choices for optimizing clocking and 3D interconnects are discussed, along with the advantages and disadvantages of different clocking schemes. Various synchronization schemes for 3DIO clocking, including synchronous, source-synchronous, and asynchronous approaches, are explored.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Clock Clustering and IO Optimization for 3D Integration Samyoung Bang*, Kwangsoo Han , Andrew B. Kahng and Vaishnav Srinivas ECE and CSE Departments, UC San Diego, La Jolla, CA 92093 *Samsung Electronics Co. Ltd, Hwaseong-si, South Korea eva.bang@samsung.com, {kwhan, abk, vaishnav}@ucsd.edu
Outline Motivation Power, Area and Timing Model P&R and Timing Flow Experimental Results Conclusion 2
Motivation For 3D integration with large bandwidth needs between dies, choice of clocking options need to be made upfront Tradeoff between area and power needed upfront Affects floorplanning choices 3DIO PLL PLL Deserializer PLL 3DIO PLL Serializer Deserializer Serializer 3DIO PLL 3DIO PLL 3
Key Choices for Clocking Options Local clustering Partition a given region into sub-regions Clock synchronization scheme Synchronous Source-synchronous Asynchronous 3DIO frequency # of 3DIO To enable design space pathfinding/exploration: Power/Area/Timing model based on total bandwidth, clustering, synchronization scheme, 3DIO frequency Combine clocking and 3DIO power/area/timing 4
3DIO Clustering Localize the clock tree of the 3D interconnect Advantages when number of cluster increase: Size of cluster clock tree (smaller skew, jitter) Shorter data paths to 3DIO array at the center of each cluster 3DIO array Clock entry point Enables efficient clocking schemes (forwarded clock, asynchronous) Disadvantages when number of cluster increase: Overhead to synchronize between clusters on top die Overhead in cluster clock 3DIO per cluster Cluster Data path The layout of the bottom die 5
Synchronization Schemes for 3DIO Clocking Synchronous Cluster clock tree is balanced to all F/Fs on both the bottom and the top die Simplest clocking scheme (similar to on-die) Vulnerable to inter-die process/voltage variation (large skews) Source-synchronous Forwarded clock from one die to another No skew balancing needed across two dies Require balance delays (Tb) within each die on the data path to match the clock insertion delay Asynchronous Separate clocks on each die FIFO to help clock domain crossing Obtain much smaller number of 3DIOs due to higher speeds achievable with embedded clock and CDR techniques Launch FFs Capture FFs Thold_fix Tdata Data path 0 DDR 1 Thold_fix Tdata 0 DDR 1 Top Bottom Thold_fix Tdata 0 DDR 1 Tclk0 Tclk1 (a) Synchronous clocking Synchronous clocking Launch FFs Tclk0 Tclk1 Tdata 0 Tdata 0 Thold_fix Capture FFs (a) Synchronous clocking Data path Tclk0 DDR Tclk1 Thold_fix 1 Balance delay (Tb) Balance Tb DDR (a) Synchronous clocking 1 Tdata 0 Thold_fix delay (Tb) Top Forwarded clock Forwarded clock Bottom DDR 1 Balance delay (Tb) 0 1 0 Tclk1 1 Forwarded clock Tclk0 Tclk0 Tclk1 0 1 Tclk1 Tclk0 (b) Source-synchronous clocking Source-synchronous clocking (b) Source-synchronous clocking (b) Source-synchronous clocking Serializer Serializer Deserializer Deserializer Deserializer Deserializer Serializer Serializer Cluster clock Cluster clock IO clock IO clock IO clock - recovered IO clock - recovered Cluster clock Cluster clock Cluster clock IO-clock IO clock - recovered IO-clock Cluster clock (c) Asynchronous clocking (c) Asynchronous clocking Cluster clock Cluster clock IO clock (c) Asynchronous clocking Asynchronous clocking 6
Our Work Given the choices of clock synchronization schemes, number of clusters and 3DIO frequency, find maximum bandwidth for the 3D interconnect given a max power and area constraints. Max area constraints Max Achievable BW Optimal Clocking scheme for Max BW Max area constraints Synch. Source-synch. Asynch. Max power constraint Max power constraint Optimal number of clusters for Max BW Optimal Clocking frequency for Max BW Max area constraints Max area constraints 7 Max power constraint Max power constraint
Outline Motivation Power, Area and Timing Model P&R and Timing Flow Experimental Results Conclusion 8
3DIO/CTS Directed Graph Primary inputs are indicated by circle Rectangles are determined by the primary inputs Solid and dotted arrow indicates positive and negative correlation Estimate the rounded rectangles as analytic expressions IO Freq. Clocking scheme WNS Per-IO power/area Jitter #Clusters Freq. Max skew/transition Skew outcome, clock ins. delay Area # 3DIO # FFs Region Area Clock WL Increase Decrease BW Clock buf. area Input Data WL Deterministic Estimated Power Data buf. area Est. outcome 9
Clock Wirelength Hierarchical approach to estimate clock wirelength Assume clock tree is well balanced because FFs are uniformly distributed over the region area Length of Steiner minimal tree over N points uniformly distributed within a given region Areg Total clock wirelength is depth i clk k N A k w W + = = 0 is proportional to Areg/ N max_ = A N C reg ff g reg C i i = 0 w0 i = 1 w1 Global clock tree Notation Cluster clock tree Depth of clock tree (i == 0 for clock source) i i = 2 w2 N Number of cluster C N Total number of flip-flops ff FFs FFs FFs Ck k , Fitted coefficients g 10
Clock Buffer Area Tellez and Sarrafzadeh propose a method to insert the minimum number of buffers under a given transition time (Tmax_tran) constraint Linearize the problem by using the concept of maxinum capacitance (Cmax) Any buffer stage i with stage cap Ci < Cmax will have Ti_tran < Tmax_tran Using Cmax, we estimate the number of clock buffers (Ncbuf), + + = 0 max ) ( ff buf g C N W C N C N C cbuf W clk ff g cbuf + ff C N C 0 clk ff g = N cbuf buf g C C max Kashyap et al. discuss transition time degradation and Cmax can be expressed as follows, Total clock buffer area is g ff clk clk C T T 0 max_ ) ( ) ( Tmax_tran = + 2 2 ( ) ( ) max_ T T k C R T0 0 1 max max tran Wire (max length = Wmax) = 2 2 ( ) ( ) C k max_ T T max 2 0 tran + + ff ff W ( ) W ( ) C N C C N C 0 0 clk ff g = + A k k 1 2 2 2 2 2 buf g buf g ( ) ( ) T T C 11 max_ 0 tran tran
Data Wirelength and Data Buffer Area Data path wirelength is proportional to the number of data wires and the cluster dimension N k W 0 = Cluster Sink p 3D IO / A N data ff reg C Distribution exists based on sink placement wrt 3DIO cluster d For data buffer area, we use a similar concept to clock buffer area estimation Need to consider each data path separately Cannot use total wirelength Need minimum number of data buffers to meet hold timing 12
3DIO/Overall Power and Area 3DIO power and area models are based on CACTI-IO ( static term dyn IO IO P P P N P + + = ( ) 1 k 2 3 = + + + + 0 A N A N k k k f f f IO 0 1 2 3 IO IO min 2 (R , R ) R ON TTI ON ) Overall (3DIO+clocking) power and area are + + = ( 5 4 3 + + total A k W k W A + A A 1 2 clk data clk k data + IO A = + + + + ) ( ) total P k W k W A k F k A A P 6 7 clk data clk data clk clk data IO Internal and leakage power IO power Switching power 13
Outline Motivation Power, Area and Timing Model P&R and Timing Flow Experimental Results Conclusion 14
3D P&R Flow - Synchronous Synchronous Synthesize the cluster clock tree on the top die first to balance the clock tree on both dies Extract maximum clock insertion delay (Tclk1) Propagate the data path delay (Tdata) for the routing on top die Gate-level netlist and SDC file generation CTS and Route (bottom die) Extract Tdata Custom placement Route (top die) CTS (top die) Report Extract max Tclk1 Synchronous scheme power/area/timing Propagated Clock 3DIO Tdata Tclk1 15 The layout of the bottom die The layout of top die
3D P&R Flow Source-synchronous and Asynchronous Source-synchronous Synthesize the clock tree and route on the bottom die, and separately synthesize the clock tree only for the top die Extract balance delay Tb (i.e., Tclk1) for each capture FF and annotate the delays to the corresponding data 3DIOs Asynchronous Run traditional 2D flow on both dies separately Gate-level netlist and SDC file generation Extract Tb Custom placement Route (top die) CTS and Route (bottom die) Report power/area/timing CTS (top die) Source-synchronous scheme Propagated Clock 3DIO Tclk1 Annotate Tb (i.e., Tclk1) 16 The layout of the bottom die The layout of top die
Conventional 2D STA vs. our 3D STA We focus on inter-die variation, and do not consider intra-die variation which can be comprehended by timing derate or OCV Two process corners {BC, WC} for inter-die variation Assign the same corner on the paths on the same die Report worst setup WNS out of four combinations (i.e., BC-BC, BC-WC, WC-BC, WC-WC ) of corners FF on bottom die FF on top die Buffer on top die Buffer on bottom die Tdata1 Tdata2 Tc2q Tlaunch Tcapture Our 3D STA Setup slack1 = Tper Tsu T{c2q, BC} T{data1, BC} T{data2, BC} + (T{capture, BC} T{launch, BC}) slack2 = Tper Tsu T{c2q, BC} T{data1, BC} T{data2, WC} + (T{capture, WC} T{launch, BC}) slack3 = Tper Tsu T{c2q, WC} T{data1, WC} T{data2, BC} + (T{capture, BC} T{launch, WC}) slack4 = Tper Tsu T{c2q, WC} T{data1, WC} T{data2, WC} + (T{capture, WC} T{launch, WC}) slack = min (slack1, slack2, slack3, slack4) Conventional 2D STA (without inter-die variation) Setup slack = Tper Tsu T{c2q, WC} T{data1, WC} T{data2, WC} + (T{capture, BC} T{launch, WC}) 17
Outline Motivation Power, Area and Timing Model P&R and Timing Flow Experimental Results Conclusion 18
Experimental Setup P&R tool is Synopsys IC Compiler I-2013.12-SP1 Timing analysis tool is Synopsys PrimeTime H- 2013.06-SP2 We use a 65nm TSMC library Design of experiments Bandwidth (10 200 GB/s) Region area (25 100 mm2) 3DIO clock frequency Synchronous (100 2000 MHz) Source-synchronous (1500 4000 MHz) Asynchronous (3500 8000 MHz) Number of clusters (1 25) We select four data points for each parameter 256 design implementations for each clocking scheme 19
Model Fitting Approach We use Artificial Neural Network (ANN) model for our fit, guided by the directed graph Iteratively progress through the directed graph to fit each node Clock wirelength Data wirelength Clock buffer area Data buffer area 3DIO power/area Total Area/Power/WNS We use the Fmax for the timing model (instead of WNS) Multiple runs with different training, validation and test data sets Improved generality and robustness of the resulting models IO Freq. Clocking scheme WNS Per-IO power/area Jitter #Clusters Freq. Max skew/transition Skew outcome, clock ins. delay Area # 3DIO # FFs Clock WL Region Area Increase Decrease BW Clock buf. area Input Data WL Deterministic Estimated Power Data buf. area Est. outcome 20
Area, Power and Timing Model Results Area Min-Max error within +/-20% For synchronous scheme, tool inserts large number of hold buffers due to inter-die variation Larger error Mean error within +/-0.5% Power Fmax 21
Design Space Results Max BW: Figure shows the iso-bandwidth curves Vertical and horizontal walls show min power/area required to hit a bandwidth requirement Clocking scheme: The asynchronous scheme is area-efficient The synchronous scheme is power-efficient The source-synchronous scheme provides a valuable tradeoff between power and area along the knee of the iso-bandwidth curve. The interesting tradeoffs between the schemes occurs along these knee points as we change the power/area constraint tradeoffs. Max BW Optimal clocking scheme for Max BW 22
Design Space Results Cluster clock frequency: As power constraint gets tighter, frequency goes down As area constraint gets tighter, frequency goes up Source-synchronous schemes provide benefits at higher cluster frequencies The asynchronous scheme provides a way to keep the cluster frequency down but still have high 3DIO frequency, through serialization Number of clusters: Not monotonic along edges of hypercube and clocking scheme boundaries Also sensitive to the total region area Optimal Cluster Frequency for Max BW Optimal # of Clusters for Max BW 23
Outline Motivation Power, Area and Timing Model P&R and Timing Flow Experimental Results Conclusion 24
Conclusion We have developed a power, area and timing model for 3DIO and CTS that includes clustering and three different clock synchronization schemes (synchronous, source-synchronous, asynchronous) Our model estimates power, area and timing within 20% error across a large range of bandwidths, region areas, numbers of clusters and 3DIO frequencies Our modeling methodology will enable architects to study and optimize the design space upfront Key takeaways: Iso-bandwidth lines identify min area/power required to hit a particular BW Clocking scheme tradeoffs are interesting along the knee of iso- bandwidth lines Cluster frequency for asynchronous schemes can be kept low while still reducing the number of 3DIO due to serialization 25
Future Work Extend our model to be aware of Placement uniformity Technology dependence Datapath logics More comprehensive STA including intra-die variation Blockages Asymmetric clustering Different 3DIO placement Serial 3DIO circuit options for asynchronous scheme 2.5D (interposer-based) design 26
Synchronous All end points on both dies are synchronized Colored FFs are uniformly distributed over the region Non-colored FFs are placed right next to the 3DIO array Clock tree is vulnerable to the inter-die variation Use DDR to minimize number of 3DIOs Two factors affect to determine max 3DIO clock frequency (FIO) Clock skew due to the inter-die variation Jitter Increase #clusters increase max FIO because clock tree becomes more robust to the inter-die variation Region Area (mm2) 25 25 100 100 25 25 100 100 BW (GB/s) Fmax (MHz) FIO Ncluster (MHz) 12 1 640 900 300 600 460 900 300 600 1280 1800 600 1200 920 1800 600 1200 11.25 12 11.25 50.025 50.625 200.25 202.5 25 1 25 1 25 1 25 29
Source-Synchronous Forward clock one die to the another die For any paths across dies, the launch and capture path delays from source to 3DIO at the bottom die are balanced no inter-die variation Require balance delay Tb to compensate clock insertion delay Tclk1 Two factors to determine max 3DIO clock frequency (FIO) Skew between Tb and Tclk1 due to the intra-die variation Jitter Region Area (mm2) 25 25 25 25 100 100 100 100 BW (GB/s) Fmax (MHz) FIO Ncluster (MHz) 12.095 10.625 50.02 46.875 12 15 200 195 1 25 1 25 1 25 1 25 820 1700 820 1500 500 1200 350 1200 1640 3400 1640 3000 1000 2400 700 2400 30
Thold_fix Tdata 0 DDR 1 Asynchronous Use FIFO (1:8 serializer, 8:1 deserializer) to separate clock domain No inter-die variation Minimize the number of 3DIOs Require PLL for cluster clock for the top die and IO clock for both dies Large power overhead One factor to determine max 3DIO clock frequency (FIO) Jitter Tclk0 Tclk1 (a) Synchronous clocking Tdata 0 Thold_fix DDR 1 Balance delay (Tb) Forwarded clock 0 1 Tclk1 Tclk0 Region Area (mm2) 25 25 25 25 100 100 100 100 (b) Source-synchronous clocking BW (GB/s) Fmax (MHz) FIO Ncluster (MHz) 11.9 25 49.7 40 12 20 200 200 1 25 1 25 1 25 1 25 700 1000 700 800 400 800 125 500 5600 8000 5600 6400 3200 6400 1000 4000 Serializer Deserializer Cluster clock IO clock IO clock - recovered Cluster clock (c) Asynchronous clocking 31
Flow of Synch. Clocking Schemes Bottom Top Custom placement on bottom/top dies 1 BW: 12GB/s Areg: 81mm2 nc: 4 fclus: 1000MHz CTS, CTO and Route on bottom die 0.307ns (bc) 0.618ns (wc) 0.089ns (bc) 0.200ns (wc) CTS on top die 1 Cluster buffer 2 0.147ns (bc) 0.306ns (wc) Extract delay 0.125ns (bc) 0.247ns (wc) Bottom Top 2 Input delay to prevent unnecessary hold buffer insertions 0.307 + 0.089 0.182 = 0.214ns 0.307ns (bc) 0.618ns (wc) 0.089ns (bc) 0.200ns (wc) Delay to balance the clock insertion delays across dies 0.307 - 0.125 = 0.182ns (bc) 0.618 0.247 = 0.371ns (wc) Cluster buffer 0.125ns (bc) 0.247ns (wc) 0.147ns (bc) 0.306ns (wc) 32
Flow of Synch. Clocking Schemes Run CTO and route at worst corner considering hold time and clock uncertainty Top 0.214ns 3 1 CTS on top die Top 0.214ns 2 Extract balance delay 3 CTO and Route on top die 4 STA 0.247ns (wc) 0.247ns (wc) Setup: 0.5 (half cycle) + 0.802(tclk) 0.075 (tunc) - 0.008 (ts) 1.195 (tdata) = 0.024ns Hold: 0.683(tdata) - 0.576 (tclk) - 0.060 (tunc) - 0.030 (th) = 0.017ns 4 Bottom Top Bottom Top 0.089ns (bc) 0.307ns (bc) 0.140ns (wc) 0.200ns (wc) 0.618ns (wc) 0.071ns (bc) Cluster buffer Cluster buffer 0.182ns (bc) 0.247ns (wc) 0.147ns (bc) 0.371ns (wc) 0.125ns (bc) 0.306ns (wc) 33
Flow of Synch. Clocking Schemes Bottom Top Custom placement on bottom/top dies 1 BW: 12GB/s Areg: 81mm2 nc: 4 fclus: 1000MHz CTS, CTO and Route on bottom die 0.200ns 0.618ns CTS on top die 1 Cluster buffer 2 Extract delay 0.247ns 0.306ns Bottom Top 2 Annotate balancing delay 0.247ns 0.200ns 0.618ns Balance the delay from clock source to data 3DIO and the delay from clock source to clock 3DIO 0.618 + 0.200 = 0.818ns Cluster buffer 0.247ns 0.306ns 34
Flow of Source synch. Clocking Schemes Run CTO and route at worst corner considering hold time and clock uncertainty Top 0.247ns 0.247ns 3 1 CTS on top die Top 2 Extract balance delay 3 CTO and Route on top die 4 STA 0.247ns (wc) 0.247ns (wc) Setup: 0.5 (half cycle) + 1.371 (tclk) 0.075 (tunc) - 0.008 (ts) 1.471 (tdata) = 0.317ns Hold:1.471 (tdata) - 1.371 (tclk) - 0.060 (tunc) - 0.030 (th) = 0.010ns 4 Balancing delay Balancing delay Bottom Top Bottom Top 0.247ns 0.247ns 0.200ns 0.618ns 0.100ns 0.200ns 0.618ns 0.100ns Cluster buffer Cluster buffer 0.818ns 0.247ns 0.306ns 0.818ns 0.247ns 0.306ns 35