# **Reconfigurable Computing**

#### **Trends and Exploration**

In wisdom gathered over time I have found that every experience is a form of exploration.

- Ansel Adams

Philip Leong (philip.leong@sydney.edu.au) School of Electrical and Information Engineering

http://phwl.org/talks

Permission to use figures have been gained where possible. Please contact me if you believe anything within infringes on copyright.





- > How do we measure performance?
- > What tools can we use to explore a design space?
- > What is the impact of VLSI technology on FPGA design?
- Technology trends influence architecture. Can we understand how they change with time?
- Case study
  - Matrix multiplication



- > Understand what needs to be optimised (and what doesn't)
- > Tradeoff between speed, area, latency, throughput, energy, cost, accuracy ...
  - Cannot optimise them all, e.g. usually can increase speed if cost unimportant
- > Good design is a tradeoff









Log<sup>2</sup> of the Number of Components

Gordon Moore in 1965
 predicted number of
 transistors in an IC will
 double ≈ two years

THE UNIVERSITY OF

- This has driven the semiconductor industry for many decades
- Made FPGAs practical (first commercial FPGA XC2064 which had 64 CLBs with 2x 3-LUTs per CLB)



#### Moore's Law [1]



- He made the bold claim that 65,000 components could fit on an IC by 1975 (at the time they had 50)!
- > Cartoon is from the same paper



| Year | Feature<br>Size | Xilinx FPGA | family | Device     | LUTs      | DSP/Mult<br>blocks | BRAM<br>Kbits | LUTs<br>/DSP | LUTs<br>/BRAM | Altera FPGA<br>I family |                    | Device      | ALMs<br>(LEs) | DSP<br>/Mult<br>blocks | BRAM<br>Kbits | LEs/<br>DSP | LEs/<br>BRAM |
|------|-----------------|-------------|--------|------------|-----------|--------------------|---------------|--------------|---------------|-------------------------|--------------------|-------------|---------------|------------------------|---------------|-------------|--------------|
| 2011 | 28 nm           |             | V      | XC7V2000T  | 1,221,600 | 2,160              | 46,512        | 566          | 26            |                         |                    |             |               |                        |               |             |              |
|      |                 | Virtex 7    | VX     | XC7VX1140T | 712,000   | 3,600              | 67,680        | 198          | 11            |                         |                    |             |               |                        |               |             |              |
|      |                 |             | VH     | XC/VH8/01  | 547,600   | 2,520              | 50,760        | 217          | 11            |                         | CT                 | FCCTCT      | 622.000       | 512                    | 50.000        | 1 215       | 12           |
|      |                 |             |        |            |           |                    |               |              |               |                         | GI                 | 55GTC7      | 622,000       | 512                    | 50,000        | 1,215       | 12           |
|      |                 |             |        |            |           |                    |               | Stratix V    | GX            | 55GXBB                  | 952,000<br>605,000 | 2 026       | 52,000        | 1,352                  | 18            |             |              |
|      |                 |             |        |            |           |                    |               |              |               |                         | 63<br>E            | 555508      | 952,000       | 3,920                  | 52,000        | 1 252       | 14           |
| 2009 | 40 nm           |             | IX     | XC6VI X760 | 474,240   | 864                | 25,920        | 549          | 18            |                         |                    | 55220       | 552,000       | 704                    | 52,000        | 1,552       | 10           |
|      |                 | Virtex 6    | SX     | XC6VSX475T | 297.600   | 2.016              | 38,304        | 148          | 8             |                         |                    |             |               |                        |               |             |              |
|      |                 |             | HX     | XC6VHX565T | 354,240   | 864                | 32,832        | 410          | 11            |                         |                    |             |               |                        |               |             |              |
| 2008 |                 |             |        |            |           |                    |               |              |               |                         | GT                 | EP4S100G5   | 531,200       | 1,024                  | 27,376        | 519         | 19           |
|      |                 |             |        |            |           |                    |               |              |               | Stratix IV              | GX                 | EP4SGX530   | 531,200       | 1,024                  | 27,376        | 519         | 19           |
|      |                 |             |        |            |           |                    |               |              |               |                         | E                  | EP4SE820    | 813,050       | 960                    | 33,294        | 847         | 24           |
|      | 65 nm           | Virtex 5    | LX     | XC5VLX330  | 207,360   | 192                | 10,368        | 1,080        | 20            |                         |                    |             |               |                        |               |             |              |
| 2006 |                 |             | SX     | XC5VSX240T | 149,760   | 1,056              | 18,576        | 142          | 8             |                         |                    |             |               |                        |               |             |              |
|      |                 |             | FX     | XC5VFX200T | 122,880   | 384                | 16,416        | 320          | 7             |                         |                    |             |               |                        |               |             |              |
|      |                 |             |        |            |           |                    |               |              |               | Stratix III             | L                  | EP3SL340    | 337,500       | 576                    | 16,272        | 586         | 21           |
|      | 00              |             |        |            |           |                    |               |              |               |                         | E                  | EP3SE260    | 255,000       | 768                    | 14,688        | 332         | 17           |
| 2005 | 90 nm           |             |        |            |           |                    |               |              |               | Stratix II              | GX                 | EP25GX130/G | 132,540       | 252                    | 6,747         | 526         | 20           |
|      | 90 nm           |             | 1.2    | XC4VI X200 | 178 176   | 96                 | 6 0/18        | 1 956        | 20            |                         |                    | EP25180     | 179,400       | 384                    | 9,383         | 467         | 19           |
| 2004 |                 | Virtex 4    | SX     | XC4V5X55   | 49 152    | 512                | 5 760         | 1,850        | - 25          |                         |                    |             |               |                        |               |             |              |
| 2004 |                 |             | FX     | XC4VFX140  | 126.336   | 192                | 9,936         | 658          | 13            |                         |                    |             |               |                        |               |             |              |
| 2002 | 130 nm          |             |        |            |           |                    |               |              |               | <i>c</i> , , , ,        | GX                 | EP1SGX40D   | 41,250        | 56                     | 3,423         | 737         | 12           |
|      |                 |             |        |            |           |                    |               |              |               | Stratix                 |                    | EP1S80      | 79,040        | 88                     | 7,428         | 898         | 11           |
|      | 130 nm          | Virtex II   | Pro    | XC2VP100   | 88,192    | 444                | 7,992         | 199          | 11            |                         |                    |             |               |                        |               |             |              |
| 2001 |                 |             | Pro X  | XC2VPX70   | 66,176    | 308                | 5,544         | 215          | 12            |                         |                    |             |               |                        |               |             |              |
|      | 0.15 um         |             | v      | XC2V8000   | 93,184    | 168                | 3,024         | 555          | 31            | Mercury                 |                    | EP1M350     | 14,400        | 0                      | 115           | -           | 125          |
| 2000 | 0.18 um         |             |        |            |           |                    |               |              |               | Excalibur               |                    | EPXA10      | 38,400        | 0                      | 3,146         | -           | 12           |
| 1999 | 0.18 um         | Virtex E    |        | XCV3200E   | 64,896    | 0                  | 851           | -            | 76            |                         |                    |             |               |                        |               |             | 100          |
| 1998 | 0.22 um         | Mintan      |        | XC1/1000   | 24.576    | 0                  | 121           |              | 100           | Flex 10KE               |                    | EPF10K200E  | 9,984         | 0                      | 98            | -           | 102          |
| 1007 | 0.25 um         | Virtex      |        | XCV1000    | 24,576    | 0                  | 131           | -            | 188           |                         |                    |             |               |                        |               |             |              |
| 1997 | 0.35 um         | 4000 E/ AL  |        | AC4065AL   | 12,544    | 0                  | 0             | -            | -             | Fley 10KA               |                    | EPE10K250A  | 12 160        | 0                      | 41            |             | 297          |
| 1995 | 0.42 um         |             |        |            |           |                    |               |              |               | Flex 10KA               | $\vdash$           | EPF10K100   | 4.992         | 0                      | 25            | -           | 200          |
| 1992 | 0.6 um          |             |        |            |           |                    |               |              |               | Flex 8000               |                    | EPF81500A   | 1.296         | 0                      | 0             | -           |              |
| 1991 | 0.8um           | 4000 series |        | XC4025     | 2,048     | 0                  | 0             | -            | -             |                         |                    | 2.1.010004  | 2,200         |                        |               |             |              |
| 1985 | 2 um            | 2000 series |        | XC2018     | 400       | 0                  | 0             | -            | -             |                         |                    |             |               |                        |               |             |              |





> If x is the year and y is the number of transistors on an integrated circuit, give an equation to model Moore's Law.

# $1/\lambda^2$ vs year [2]

 FPGA lambda from previous table plotted vs year

THE UNIVERSITY OF

- Transistor density doubling every two years, in agreement with Moore's Law
- Can use equation to estimate extrapolate





## Design Size (number of LUTs) [2]

- x's are the number of LUTs in the largest
   FPGA of that year
- > o's are FCCM designs
- Tech design size doubles every 2.5 years (slightly slower than Moore's Law)
- Inaccuracies because we don't count clock trees and hard blocks











- >  $T_{\text{execution}} = T_{\text{clk}} * N$ 
  - Where N is the number of clock cycles to complete the task
- Speed S = 1/T<sub>execution</sub>
- > The **speedup** of machine A with execution time  $T_A$  over machine B with execution time  $T_B$ 
  - Speedup = Speed<sub>A</sub>/Speed<sub>B</sub>

 $= T_B/T_A$ 

- > Real-time measures often reflect performance per unit time
  - GOPS (billion operations per second)
  - GFLOPS (billions of floating point operations per second)



- > Gene Amdahl in 1967 gave us a way to think about parallelism
- > If B is the fraction of algorithm which is serial (e.g. I/O), and  $T_p$  is the execution time for  $_p$  processors)

> Speedup = 
$$T_1 / T_p$$
  
=  $\frac{p}{pB + (1 - B)}$ 

> This equation gives us a way to estimate the speedup of a system

#### Amdahl's Law Example



#### Most important issue is I/O (and memory) overhead!



=27.2059



 Dennard in 1974: as transistor feature size (κ or commonly λ) decreases, power stays proportional to area

| Device or Circuit Parameter      | Scaling Factor |
|----------------------------------|----------------|
| Device dimension $t_{ox}$ , L, W | 1/к            |
| Doping concentration $N_a$       | κ              |
| Voltage $V$                      | $1/\kappa$     |
| Current I                        | $1/\kappa$     |
| Capacitance $\epsilon A/t$       | $1/\kappa$     |
| Delay time/circuit $VC/I$        | $1/\kappa$     |
| Power dissipation/circuit $VI$   | $1/\kappa^2$   |
| Power density $VI/A$             | 1              |



#### Clock Frequency (1999-2013) [2]

- Tech freq doubles every 8 years
- Research freq doubles every <u>6</u> years
- Tracking with what might be expected based on technology scaling

Freq (MHz)



#### Frequency of IP Cores [2]

 Device technology trend is black line

THE UNIVERSITY OF

 Designs have trend consistent with technology





#### Breakdown of Dennard's Law

- Clock speeds are not rising according to Dennard's Law as transistors have stopped getting faster
- Voltage essentially stopped shrinking 10 years ago
  - > Thermal noise (kT/q = 25 mV at room temperature)
  - > Subthreshold leakage current
- > Cannot reduce voltage and current so that power density is no longer constant
  - > In fact rising sharply
  - > Designs used to be speed constrained, now they are power constrained
- Cannot turn on all parts of the chip at the same time (% which must be off is called Dark Silicon)

## Power





- > P=CV<sup>2</sup>f and it fundamentally limits performance gains
- > Three main components in an FPGA
  - Static, Dynamic, I/O
- Dennard scaling says halving lambda decreases P by 4 (broken down due to > statis P)





#### Figure 5. Total Power Breakdown Across Various High-End FPGA Customer Designs

Altera White Paper https://www.altera.com/en US/pdfs/literature/wp/wp-01148-stxv-power-consumption.pdf







Table 1. Main Sources of Transistor Leakage

| Main Sources of Leakage                                  | Impact     | Mitigation Techniques                                                                                                                |
|----------------------------------------------------------|------------|--------------------------------------------------------------------------------------------------------------------------------------|
| Subthreshold leakage (I <sub>sub</sub> )                 | Dominant   | <ul> <li>Lower voltage</li> <li>Higher voltage threshold</li> <li>Longer gate length</li> <li>Dopant profile optimization</li> </ul> |
| Gate direct-tunneling leakage (I <sub>G</sub> )          | Dominant   | High-k metal gate (HKMG)                                                                                                             |
| Gate-induced gate leakage (IGIDL)                        | Small      | Dopant profile optimization                                                                                                          |
| Reverse-biased junction leakage current (I_{\text{REV}}) | Negligible | Dopant profile optimization                                                                                                          |

#### **Dynamic Power**







#### THE UNIVERSITY OF SYDNEY

#### Table 2. Main Factors Impacting General-Purpose I/O Power

| Main Factors Impacting I/O Power                                                                              | <b>Mitigation Techniques</b>          |  |  |  |
|---------------------------------------------------------------------------------------------------------------|---------------------------------------|--|--|--|
| Termination resistors (on-chip series termination ( $R_S$ OCT) and on-chip parallel termination ( $R_T$ OCT)) | Dynamic on-chip termination<br>(DOCT) |  |  |  |
| Output buffer drive strength                                                                                  | Programmable drive strength           |  |  |  |
| Output buffer slew rate                                                                                       | Programmable slew rate                |  |  |  |
| I/O standard (single ended, voltage referenced, or differential)                                              | Support for multiple I/O standards    |  |  |  |
| Voltage supply                                                                                                | Support for various voltage rails     |  |  |  |
| Capacitive load (charging/discharging)                                                                        | Interface dependent                   |  |  |  |



- > High threshold voltage low leakage but low speed
- > Not all LUTs are on the critical path so some can be slower
- CAD tools plus configurable substrate bias allow reduced power without sacrificing speed



#### Figure 18. Programmable Power Technology Enabled by Adjusting Back-Bias Voltage



## Reducing Power in FPGA Designs

- > Use minimum possible voltage
- Reduce switching activity
- > Use most advanced process technology with best hard blocks
- > Use device with appropriate hard blocks
- > Do not clock unused parts of circuit



Figure 14. Increase Bandwidth and Cut Power by Half Using 28-Gbps Transceivers



## ASICs vs FPGAs [6]

- > Kuon and Rose compared FPGAs and ASICs on a number of benchmarks and found that FPGAs are
  - 20x larger area
  - 3-4x slower
  - 10x higher power
- > Embedded blocks improve area and power significantly (if utilised)

# **Design Space Exploration**





- Classification of computer architectures made in 1966 by Michael Flynn (IBM)
- Based on whether instruction and data streams are parallel
- > SISD serial processor

THE UNIVERSITY OF

- > SIMD array or vector processor
- MISD for fault tolerance, systolic array
- MIMD multicore or distributed processor





## **Design Space Exploration**

- > Options include
  - Algorithm (most important)
  - Parallelism
  - Precision
  - Interface
  - Customisation



- > Within each are other options and so the actual design space is extremely large
- > Key to making good designs is to have good judgment regarding the tradeoffs
  - These may be different depending on what you need to optimise
  - Can be estimated using back-of-envelope techniques and reduced implementations
  - Finding suitable input data to characterise your application is also a big issue



#### Pareto Frontier

- > For competing factors such as speed and area efficiency (1/area)
- > Pareto Frontier separates infeasible from feasible designs
- > We want to be as close to the optimal as possible







- Introduced some important principles
  - Moore's Law (tells us how IC area scales)
  - Dennard's Law (tells us how IC technology scales)
  - Amdahl's Law (tells us how to estimate speedup for parallel processing)
- > FPGA designs have followed technology
- Design space is large (curse of dimensionality) so we need to be selective and tried to be close to Pareto Frontier
- > Exploration must be done right to avoid having to redesign system



[1] G. E. Moore, "Cramming more components onto integrated circuits," Electronics, vol. 38, no. 8, April 1965

[2] Lesley Shannon, Veronica Cojocaru, Cong Nguyen Dao, and Philip H.W. Leong. Trends in reconfigurable computing: Applications and architectures. In *Proc. FCCM*, pages 1–8, 2015

[3] Amdahl, Gene M. (1967). "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities". AFIPS Conference Proceedings (30): 483–485. doi:10.1145/1465482.1465560

[4] R. Dennard, F. Gaensslen, V. Rideout, E. Bassous, and A. LeBlanc, "Design of ionimplanted MOSFET's with very small physical dimensions," JSSC, vol. 9, no. 5, pp. 256–268, Oct 1974.

[5] Altera White Paper<u>https://www.altera.com/en\_US/pdfs/literature/wp/wp-01148-stxv-power-consumption.pdf</u>

[6] Ian Kuon; Rose, J., "Measuring the Gap Between FPGAs and ASICs," TCAD, vol.26, no.2, pp.203,215, Feb. 2007



#### **Review Exercises**

- > Explain in your own words:
  - Moore's Law (tells us how IC area scales)
  - Dennard's Law (tells us how IC technology scales)
  - Amdahl's Law
- A problem has a section of non-parallelisable code which takes 100 s to execute, and the rest of the code is parallelisable and takes 1 hour to process. If we are given the task of designing an FPGA accelerator to replace the CPU and wish to achieve a speedup of 100, what should the speedup of the FPGA accelerator core be? What if it takes a day to process?

# Case Study – Matrix Multiplication





- > Serve as an example of design exploration of matrix multiplication
- While examples are for a processor with cache, they are equally valid for an FPGA with external memory





- > Performance Modeling
- Matrix-Vector Multiply (Warmup)
- Matrix Multiply Cache Optimizations


#### Why Matrix Multiplication?

- > An important kernel in many problems
  - Appears in many linear algebra algorithms
    - Bottleneck for dense linear algebra
  - One of the 7 dwarfs / 13 motifs of parallel computing
  - Closely related to other algorithms, e.g., transitive closure on a graph using Floyd-Warshall
- > Optimization ideas can be used in other problems
- > The best case for optimization payoffs
- > The most-studied algorithm in high performance computing



## Motif/Dwarf: Common Computational Methods (Red Hot $\rightarrow$ Blue Cool)

|                      | Embed | SPEC | DB | Games | ML | HPC | Health | Image | Speech | Music | Browser |
|----------------------|-------|------|----|-------|----|-----|--------|-------|--------|-------|---------|
| 1 Finite State Mach. |       |      |    |       |    |     |        |       |        |       |         |
| 2 Combinational      |       |      |    |       |    |     |        |       |        |       |         |
| 3 Graph Traversal    |       |      | _  |       |    |     |        |       |        |       |         |
| 4 Structured Grid    |       |      |    | _     |    |     |        |       |        |       |         |
| 5 Dense Matrix       |       |      |    |       |    |     |        |       |        |       |         |
| 6 Sparse Matrix      |       | _    |    |       |    |     |        |       |        |       |         |
| 7 Spectral (FFT)     |       |      |    |       |    |     |        |       |        |       |         |
| 8 Dynamic Prog       |       |      |    |       |    |     |        |       |        |       |         |
| 9 N-Body             |       |      |    |       |    |     |        |       |        |       |         |
| 10 MapReduce         |       |      |    |       |    |     |        |       |        |       |         |
| 11 Backtrack/ B&B    |       |      |    |       |    |     |        |       |        |       |         |
| 12 Graphical Models  |       |      |    |       |    |     |        |       |        |       |         |
| 13 Unstructured Grid |       |      |    |       |    |     |        |       |        |       |         |



#### Matrix-multiply, optimized several ways



Speed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops



- A matrix is a 2-D array of elements, but memory addresses are "1-D"
- > Conventions for matrix layout
  - by column, or "column major" (Fortran default); A(i,j) at A+i+j\*n
  - by row, or "row major" (C default) A(i,j) at A+i\*n+jColumn major matrix in memory
  - recursive (later)

>



Slide: James Demmel UCB Figure source: Larry Carter, UCSD



- > Assume just 2 levels in the hierarchy, fast and slow
- > All data initially in slow memory
  - m = number of memory elements (words) moved between fast and slow memory
- t<sub>m</sub> = time per slow memory operation
  f = number of arithmetic operations
  t<sub>f</sub> = time per arithmetic operation << t<sub>m</sub>
  q = f / m average number of flops per slow memory access
  Minimum possible time = f\* t<sub>f</sub> when all data in fast memory
  Actual time

- 
$$f * t_f + m * t_m = f * t_f * (1 + t_m/t_f * 1/q)$$
  
> Larger q means time closer to minimum  $f * t_f$   
Machine  
Balance:  
Key to  
machine  
efficiency

-  $q \geq t_m/t_f\,$  needed to get at least half of peak speed



#### Warm up: Matrix-vector multiplication

{implements y = y + A\*x} for i = 1:n for j = 1:n y(i) = y(i) + A(i,j)\*x(j)





```
{read x(1:n) into fast memory}
{read y(1:n) into fast memory}
for i = 1:n
    {read row i of A into fast memory}
    for j = 1:n
        y(i) = y(i) + A(i,j)*x(j)
{write y(1:n) back to slow memory}
```

- m = number of slow memory refs =  $3n + n^2$
- f = number of arithmetic operations =  $2n^2$
- q = f / m  $\approx 2$
- Matrix-vector multiplication limited by slow memory speed



#### Modeling Matrix-Vector Multiplication

Compute time for nxn = 1000x1000 matrix

> Time

- $f * t_f + m * t_m = f * t_f * (1 + t_m/t_f * 1/q)$
- $= 2*n^2 * t_f * (1 + t_m/t_f * 1/2)$
- > For  $t_f$  and  $t_m$ , using data from R. Vuduc's PhD (pp 351-3)
  - http://bebop.cs.berkeley.edu/pubs/vuduc2003-dissertation.pdf
  - For t<sub>m</sub> use minimum-memory-latency / words-per-cache-line

|           | Clock | Peak    | 'eak Mem Lat (Min,Max) |       |       | t_m/t_f |
|-----------|-------|---------|------------------------|-------|-------|---------|
|           | MHz   | Mflop/s | сус                    | les   | Bytes |         |
| Ultra 2i  | 333   | 667     | 38                     | 66    | 16    | 24.8    |
| Ultra 3   | 900   | 1800    | 28                     | 200   | 32    | 14.0    |
| Pentium 3 | 500   | 500     | 25                     | 60    | 32    | 6.3     |
| Pentium3N | 800   | 800     | 40                     | 60    | 32    | 10.0    |
| Power3    | 375   | 1500    | 35                     | 139   | 128   | 8.8     |
| Power4    | 1300  | 5200    | 60                     | 10000 | 128   | 15.0    |
| ltanium1  | 800   | 3200    | 36                     | 85    | 32    | 36.0    |
| ltanium2  | 900   | 3600    | 11                     | 60    | 64    | 5.5     |

*machine balance* (q must be at least this for ½ peak speed)



- > What simplifying assumptions did we make in this analysis?
  - Ignored parallelism in processor between memory and arithmetic within the processor
    - Sometimes drop arithmetic term in this type of analysis
  - Assumed fast memory was large enough to hold three vectors
    - Reasonable if we are talking about any level of cache
    - Not if we are talking about registers (~32 words)
  - Assumed the cost of a fast memory access is 0
    - Reasonable if we are talking about registers
    - Not necessarily if we are talking about cache (1-2 cycles for L1)
  - Memory latency is constant
- Could simplify even further by ignoring memory operations in X and Y vectors
  - Mflop rate/element =  $2 / (2^* t_f + t_m)$



#### Validating the Model

- > How well does the model predict actual performance?
  - Actual DGEMV: Most highly optimized code for the platform
- Model sufficient to compare across machines
- > But under-predicting on most recent ones due to latency estimate





#### Naïve Matrix Multiply

```
{implements C = C + A^*B}
for i = 1 to n
for j = 1 to n
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
```

Algorithm has  $2^n^3 = O(n^3)$  Flops and operates on  $3^n^2$  words of memory

q potentially as large as  $2^n^3 / 3^n^2 = O(n)$ 





#### Naïve Matrix Multiply

```
{implements C = C + A^*B}
for i = 1 to n
{read row i of A into fast memory}
for j = 1 to n
{read C(i,j) into fast memory}
{read column j of B into fast memory}
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
{write C(i,j) back to slow memory}
```





#### Naïve Matrix Multiply

Number of slow memory references on unblocked matrix multiply

- $m = n^3$  to read each column of B n times
  - +  $n^2$  to read each row of A once
  - +  $2n^2$  to read and write each element of C once

$$= n^3 + 3n^2$$

- So q = f / m =  $2n^3$  / ( $n^3$  +  $3n^2$ )
  - $\approx 2$  for large n, no improvement over matrix-vector multiply

Inner two loops are just matrix-vector multiply, of row i of A times B Similar for any other order of 3 loops





#### Matrix-multiply, optimized several ways



Speed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops



| Consider A,B,C to be N-by-N matrices of b-by-b subblocks where<br>called the block size | b=n / N is |
|-----------------------------------------------------------------------------------------|------------|
| for i = 1 to N                                                                          |            |
| for j = 1 to N                                                                          |            |
| {read block C(i,j) into fast memory}                                                    |            |
| for $k = 1$ to N                                                                        |            |
| <pre>{read block A(i,k) into fast memory}</pre>                                         |            |
| {read block B(k,j) into fast memory}                                                    |            |
| C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blo                          | ocks}      |
| {write block C(i,j) back to slow memory}                                                |            |





Recall:

m is amount memory traffic between slow and fast memory matrix has nxn elements, and NxN blocks each of size bxb f is number of floating point operations,  $2n^3$  for this problem q = f / m is our measure of algorithm efficiency in the memory system So:

 $\begin{array}{ll} m = \ N^*n^2 & \text{read each block of B} & N^3 \text{ times } (N^3 * b^2 = N^3 * (n/N)^2 = N^*n^2) \\ & + \ N^*n^2 & \text{read each block of A} & N^3 \text{ times} \\ & + \ 2n^2 & \text{read and write each block of C} \text{ once} \\ & = \ (2N + 2) * n^2 \end{array}$ 

So computational intensity  $q = f / m = 2n^3 / ((2N + 2) * n^2)$ 

 $\approx$  n / N = b for large n

So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiply (q=2)



The blocked algorithm has computational intensity  $q \approx b$ 

- > The larger the block size, the more efficient our algorithm will be
- Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large
- > Assume your fast memory has size M<sub>fast</sub>

 $3b^2 \le M_{fast}$ , so  $q \approx b \le (M_{fast}/3)^{1/2}$ 

 To build a machine to run matrix multiply at 1/2 peak arithmetic speed of the machine, we need a fast memory of size

 $M_{fast} \geq 3b^2 \approx 3q^2 = 3(t_m/t_f)^2$ 

- This size is reasonable for L1 cache, but not for register sets
- Note: analysis assumes it is possible to schedule the instructions perfectly

|           |         | required |  |
|-----------|---------|----------|--|
|           | t_m/t_f | KB       |  |
| Ultra 2i  | 24.8    | 14.8     |  |
| Ultra 3   | 14      | 4.7      |  |
| Pentium 3 | 6.25    | 0.9      |  |
| Pentium3M | 10      | 2.4      |  |
| Power3    | 8.75    | 1.8      |  |
| Power4    | 15      | 5.4      |  |
| Itanium1  | 36      | 31.1     |  |
| Itanium2  | 5.5     | 0.7      |  |



- The blocked algorithm changes the order in which values are accumulated into each C[i,j] by applying commutativity and associativity
  - Get slightly different answers from naïve code, because of roundoff OK
- The previous analysis showed that the blocked algorithm has computational intensity:

 $q \approx b \leq (M_{fast}/3)^{1/2}$ 

- There is a lower bound result that says we cannot do any better than this (using only associativity)
- > Theorem (Hong & Kung, 1981): Any reorganization of this algorithm (that uses only associativity) is limited to  $q = O((M_{fast})^{1/2})$ 
  - #words moved between fast and slow memory =  $\Omega (n^3 / (M_{fast})^{1/2})$



- > Need to minimize communication between all levels
  - Between L1 and L2 cache, cache and DRAM, DRAM and disk...
- > The tiled algorithm requires finding a good block size
  - Machine dependent
  - Need to "block" b x b matrix multiply in inner most loop
    - 1 level of memory  $\Rightarrow$  3 nested loops (naïve algorithm)
    - 2 levels of memory  $\Rightarrow$  6 nested loops
    - 3 levels of memory  $\Rightarrow$  9 nested loops ...
- > Cache Oblivious Algorithms offer an alternative
  - Treat nxn matrix multiply as a set of smaller problems
  - Eventually, these will fit in cache
  - Will minimize # words moved between every level of memory hierarchy at least asymptotically





- Described a way to think about computation and memory computational intensity
- > Introduced the concept of blocking to increase computational intensity



#### **Review Exercises**

- > Explain in your own words:
  - Computational intensity
- Do a similar analysis computational intensity analysis for a different algorithm e.g. FFT

### An FPGA Delay Model





This work based on a paper at FPT09 [1]

Eddie Hung<sup>1</sup>, Steven J. E. Wilton<sup>1</sup>,

Haile Yu<sup>2</sup>, Thomas C. P. Chau<sup>2</sup>, Philip H. W. Leong<sup>2\*</sup>

<sup>1</sup> Department of ECE, University of British Columbia

<sup>2</sup> Department of CSE, Chinese University of Hong Kong \* Now with School of Electrical and Information Engineering, University of Sydney

>Funded by NSERC of Canada and RGC of HKSAR







**Physical Delay Estimate** 

Compared to previous models:

• Simpler, closed-form, equally accurate



#### Motivation: FPGA Design

Important when *designing* FPGA architectures:

- Need methods to estimate their performance ahead of time
- Two different ways of investigating new architectures:
  - Analytical Models (our approach)
  - Experimental Techniques



#### Motivation: FPGA Design

Existing FPGA design approach:

 Iteratively change details and experimentally measure improvement using benchmarks



- Problems:
  - Slow and resource-hungry
  - Lack of intuition and insight into why



#### Motivation: Analytical Models

New paradigm emerging: <u>Analytical Modelling</u>

 Capturing the essence of programmable logic in a set of simple equations



#### Motivation: Analytical Models

- Why analytical modelling?
  - Faster
  - Allow early exploration of radical architectures

- What makes a good model?
  - Analytical not rely on curve fitting
  - Simple more insight into architectural trade-offs
  - Circuit Independent capturing average behaviour



#### Where this work fits in

Delay Model:

 Logical delay model presented by Das *et al.* at FPL 2009 [2]





#### Where this work fits in

Delay Model:

- Logical delay model presented by Das *et al.* at FPL 2009 [2]
- This paper:

A model which relates *logical* delay to *physical* delay





#### Motivation: Analytical Models

- Can the models from the experimental flow be re-used for the analytical flow?
  - Requires routing/timing graphs
  - Lack of delay model for logic cluster



#### What makes deriving this model hard?

- Would like our model to be:
  - Flexible, coping with range of modern architectures
  - Accurate
  - Closed-form
  - Fast
- But complex interactions exist between FPGA architecture and circuit implementation:
  - e.g. Buffer sizes change depending on loading



# Circuit Assumptions and Delay Model



#### **Circuit Assumptions**

- Island-Style FPGA
  - 2-D array of *Logic Clusters* surrounded by a *Global Interconnect* of routing tracks
- Delay model broken down into:
  - Local Routing Delay:
  - Logic Element Delay:
  - Global Interconnect Delay:



Collection of logic elements accessed through a local routing network with a shared set of inputs





Global Interconnect: T<sub>global</sub>

Composed of horizontal/vertical tracks and Connection/Switch Boxes




### Delay Model: How it fits together





## Circuit Assumptions: Logic Element

• Lookup table with D flip-flop and bypass mux



Architecture View



**Circuit View** 



## Circuit Assumptions: Logic Element





## **Circuit Assumptions: Local Routing**

>Fully connected crossbar implemented using multiplexers





## **Global Interconnect Delay**

>Further divided into:





## Circuit Assumptions: Cluster-Switch

• Single-Driver Routing



**Circuit View** 



## Circuit Assumptions: Cluster-Switch





## **Global Interconnect Delay**

- Similarly for and
- Combining them:





## **Optimal Buffer Sizing**

- RC values depends on buffer sizing
  - Using equations: differentiate to find optimal size





# **Model Validation**



## Model Validation: Methodology

Analytical Model



## Model Validation: T<sub>local</sub>











## Model Validation: T<sub>global</sub>







## Model Validation: Previous Work

- Consistent with previous methods
  - But orders of magnitude faster

|    |           | Experimentally Derived |            |
|----|-----------|------------------------|------------|
| Ν  | Our Model | Ahmed et al.           | Our HSPICE |
| 2  | 253       | 221                    | 267        |
| 4  | 286       | 301                    | 298        |
| 6  | 321       | 332                    | 326        |
| 8  | 352       | 331                    | 349        |
| 10 | 361       | 337                    | 362        |

T<sub>local</sub> (ps) for K=4 at 0.18um



## Model Applications

(1) Speeding up FPGA architecture design

(2) In conjunction with experimental techniques

• Generate delays for use in VPR architecture files

(3) Gain additional insight into FPGAs



## Application: Experimental Techniques

- VPR does not have a parameterised delay model for T<sub>local</sub> and T<sub>logic</sub>
  - Physical delays currently specified per-architecture
  - Can use our model to generate realistic delays





## Application: Gaining Insight

- Abstract away technology parameters
- Leaving behind a 'distilled' expression with architecture parameters only:

$$T_{local} \approx A_0 + A_1 \sqrt{2N + K + NK} + A_2 NK$$

- Not possible using experimental techniques
- Interesting insight:

N has about the same effect on delay as K

## Conclusion



- Circuit-level description of FPGA presented
- Simple yet accurate delay model derived
- Future directions:
  - Incorporate more recent architectural developments
  - Investigate effects of process technology scaling
  - Develop associated area model to explore tradeoffs



[1] Eddie Hung, Steven J. E. Wilton, Haile Yu, Thomas C. P. Chau, and Philip H.W. Leong. A detailed delay path model for FPGAs. In Proc. International Conference on Field Programmable Technology (FPT), pages 96–103, 2009.

[2] Joydip Das, Andrew Lam, Steven J.E. Wilton, Philip Leong, and Wayne Luk. An analytical model relating FPGA architecture to logic density and depth. IEEE Transactions on VLSI Systems, 9(12):2229–2242, 2011.