

### Technion – Israel Institute of Technology



IBM HRL July 2009

# Outline

Interconnect as a basic

system complexity problem

Physical problem:

The the poor scalability of metal wires

- Ways to address the problem:
  - Physical design of circuits and layout
  - System architecture directions



### Technion's VLSI architecture research team:

- Pls
- Israel Cidon (Networking)
- Ran Ginosar (VLSI)
- Idit Keidar (Dist. Systems)
- Isaac Kesslassy (Networking)
- Avinoam Kolodny (VLSI)
- Uri Weiser (Architecture)
- Collaborators
  - Ronny Ronen
  - Avi Mendelson
  - David Goren
  - Israel Wagner
  - Eby Friedman
  - Shmuel Wimer

Students:

Evgeny Bolotin Reuven Dobkin Zvika Guz Ran Manevich Kostya Moiseev Arkadiy Morgenshtein Zigi Walter Anastasia Barger Shay Michaely Nir Magen Michael Moreinis





## **Connectivity and Complexity**



### Interconnect : The old hidden problem



#### Intel 8088 processor Single metal layer

#### Perceived Solution: Systolic arrays



#### The impact of a second metal layer







### Adding more metal layers...



- Delay
- Power
- Noise
- Reliability
- Cost







# Principles for dealing with complexity

- Abstraction
- Hierarchy
- Regularity
- Design Methodology









## **Evolution of interconnect models**













The idea: Shrink lateral dimensions – save area Keep vertical dimensions – to avoid very high resistance











Figure 3 Calculated Gate and Interconnect Delay versus Technology Generation

Calculated gate and interconnect delay versus technology generation illustrating the dominance of interconnect delay over gate delay for aluminum metallization and silicon dioxide dielectrics as feature sizes approach 100 nm. Also shown is the decrease in interconnect delay and improved overall performance expected for copper and low  $\kappa$  dielectric constant insulators.<sup>1</sup>

## Local wires and Global wires

#### Local wire:

- Shrinks in length just like everything else
- While transistors become <u>faster</u>, local wire delay remains <u>unchanged</u> (by simple scaling theory)

#### • Global wire:

- Goes across the whole chip does not scale!
- Reflects new complexity added to the system!



## **Bakoglu's solution: Repeaters**

Bakoglu's classical derivation ED-32, 1985





#### **The global wire scaling problem** Gate delay gets better, wire delay gets worse



Figure 54 Delay for Metal 1 and Global Wiring versus Feature Size

#### Delay of global wire is longer than a clock cycle



## Inverse Scaling: "fat wires"

- Thick & wide wires at the top metal layers:
  - Large cross section Low R
  - Large spaces Low C





## Inverse Scaling: "fat wires"

- Thick & wide wires at the top metal layers:
  - Large cross section Low R
  - Large spaces Low C







#### Speed optimization in RLC lines (Ismail & Friedman, ISCAS 99)





## **RLC delay model characteristics**

$$de lay = \sqrt{LC} \left( e^{-2.9 \left( \alpha_{asym} l \right) 1.35} l + 0.74 \alpha_{asym} \right) de lay = \sqrt{LC} \left( e^{-2.9 \left( \alpha_{asym} l \right) 1.35} l + 0.74 \alpha_{asym} \right) de lay = \frac{R}{2\sqrt{L}} de lay = \frac{R}{$$

L,C and R are per unit length l denotes wire length

#### Inductive effects:

- Longer delay
- Steeper slope
- overshoot

\* Eby G. Friedman, Yehea E. Ismail, Onchip inductance in high speed integrated circuits, 2001



2



#### Fast wires must use transmission line layout W d WS W SIGNAL Ît |t **SIGNAL** 1 h h tg GROUND tg GROUND wg wg WS W S S S S h tg tg GROUND GROUND wg wg Ground plane and/or wires provide *current return path*

29













# The future of interconnect power



#### **Interconnect power grows to 65%-80% within 5 years**

(using optimistic interconnect scaling assumptions for a uniprocessor) Global interconnect causes significant power dissipation

#### Wires can be blamed for even more power... Because designers tend to oversize gates! 70-70 Stage - Gate 60 60 Wire 50 50 Delay (ps) delay Gal **40** 40 30 30 20 20 10· 10 0 0 20 30 10 40 Device Size (\* L<sub>min</sub>) $W_{n,p} = 1 \ \mu m$ 35









## Data Rate Optimization in an Interconnect Channel



- making the wires narrow (small W),
- and dense (small S)
- What will happen to the delay?



38

### Should all wires be the same? How about optimizing individual widths and spaces?







\* K. Moiseev, S. Wimer and A. Kolodny, "Timing Optimization of Interconnect by Simultaneous Net-Ordering, Wire Sizing and Spacing," *INTEGRATION, 2007*.





\* N. Magen, A. Kolodny, U. Weiser and N. Shamir, "Interconnect-power dissipation in a Microprocessor," SLIP 2004.







# **Breaking The Wall:** Logic Gates as Repeaters - LGR

"Where should the gates be located (along the wire)?"



\* M. Moreinis, A. Morgenshtein, I. Wagner, and A. Kolodny, "Logic Gates as Repeaters (LGR) for Area-Efficient Timing Optimization," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 14, no. 11, pp. 1276-1281, November 2006



# Breaking The Wall: Unified Logical Effort



46

#### . - chip multiprocessors N X in the second se ø





# **Cross-section of a 3-D Integrated Circuit**

### • Plane bonding

- Back to face
- Face to face
- Bonding materials
  - Adhesive polymers
  - Metal pads (e.g., copper)
- Bonding process involves
  - Compression at elevated temperatures
  - Wafer thining



 \* R. J. Gutmann *et al.*, "Three-dimensional (3D) ICs: A Technology Platform for Integrated Systems and Opportunities for New Polymeric Adhesives," *Proceedings of the Conference on Polymers and Adhesives in Microelectronics and Photonics*, pp. 173-180, October 2001



## Multi-integration of 3 – D Systems-on-Chip

- Integration of
  - Circuits from different fabrication processes
  - Non-silicon technologies
  - Non-electrical systems





# **Chip Multi-Processors**

- Uniprocessors cannot provide Power-efficient performance growth
  - Interconnect dominates dynamic power
  - Global wire delay doesn't scale
  - Instruction-level parallelism is limited
- Power-efficiency requires many parallel <u>local</u> computations
  - Chip Multi Processors (CMP)

Cell Broadband Engine Process

Thread-Level Parallelism (TLP)



"Pollack's rule"

Teraflops Research Chip

**Die Area (or Power)** 

(F. Pollack. Micro 32, 1999)

# Future of VLSI architecture - CMP

(Dally 1999, Horowitz 2001)

- System requirement: <u>Power-efficient performance growth</u>
- Implications:
  - Chip Multi Processors (CMP)
  - Thread-Level Parallelism (TLP)
- Local memories
  - Memory is interconnection in time?
- Explicit communication









- Communication by packets of bits
- Routing of packets through several hops, via switches
- Parallelism
- Efficient sharing of wires



## Past Examples of Paradigm Shifts in VLSI

## The Microprocessor

- From: Hard-wired state machines
- To: Programmable chips
  - Created a new computer industry

## **Logic Synthesis**

- From: Schematic entry
- To: HDLs and Cell libraries
  - Logic designers became programmers
  - Enabled ASIC industry and Fab-less companies
  - "System-on-Chip"















# 1(b): Wire Design for NoC



- NoC links:
  - Regular
  - Point-to-point (no fanout tree)
  - Can use transmission-line layout
  - Well-defined current return path
- Can be optimized for <u>noise / speed / power</u>
  - Low swing, current mode, ....





## 2: NoC and the Engineering Productivity Problem

- NoC eliminates ad-hoc global wire engineering
- NoC separates computation from communication
  - NoC supports modularity and reuse of cores
- NoC is a platform for system integration, debugging and testing



## 3: NoC and CMP

Network is a natural choice for multiple cores!



**Montecito** Intel 2004





**Terascale** Intel Polaris 2007



**Niagara** Sun 2004



Barcelona AMD 2007



# **Combining 3-D and NoC?**



64

# **Hybrid Optical 3-D NoC?**





