Carlos Bederián
Nicolás Wolovick
CCAD - UNC
Nicolás Wolovick 20190728
Haswell-EP (661mm², 5.75GTr, 22nm),
GP102 (471mm², 12GTr, 16nm),
KNL (683mm², 8GTr, 14nm)
Pero los costos son los que determinan.
Intel 10 años en llegar 10nm.
Por ahora es un problema de costos. Creatividad y tomar riesgos: AMD Rome chiplets.
Putting it all together, in every technology generation transistor integration doubles, circuits are 40% faster, and system power consumption (with twice as many transistors) stays the same.
Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors, Communications of the ACM, May 2011, Vol. 54 No. 5, Pages 67-77
Tomasulo Algorithm, IBM 360/91, 1968.
Scoreboarding, CDC 6600, 1964 (Seymour Cray)
Ley de los Rendimientos Decrecientes.
Bill Dally, The Last Classical Computer, ISAT Study, 2001.
La corrinte de fuga se fue al tacho.
Hennessy, Patterson, A New Golden Age for Computer Architecture, Turing Lecture, 2018.
argues that microprocessor performance scales roughly as the square root of its complexity, where the logic transistor count is often used as a proxy to quantify complexity.
Cualquier arquitectura moderna tiene:
Heterogeneidad
(Bill Dally, Efficiency and Programmability: Enablers for ExaScale, SC13)
Hennessy, Patterson, A New Golden Age for Computer Architecture, Turing Lecture, 2018.
¡Si prendo todos los transistores se prende fuego!
P = V²f
Dynamic Voltage and Frecuency Scaling.
Ya no se controla la frecuencia a la que opera un procesador.
To keep power in check Intel introduced something called dynamic frequency scaling. It reduces the base frequency of the processor whenever AVX2 or AVX-512 instructions are used. This is not new, and has existed since Haswell introduced AVX2 three years ago.
If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.
(secreto, shhh, vamos a hablar de memoria)
M. Själander, M. Martonosi, S. Kaxiras, Power-Efficient Computer Architectures, recent advances, 2014.
Hennessy, Patterson, Computer Architecture a Quantitative Approach, 6th ed, 2017.
Errata: FB reads FP, floating point.
La caché es una Leaky abstraction.
que aprovecha regularidades estadísticas del acceso a la memoria por parte de los programas en ejecución.
HBM1, HBM2, MCDRAM
3 niveles de paralelismo ortogonales
4 jerarquías de memoria
¿Cómo logro aprovechar todo esto? JAAAA, es un lío
Concentrarse en la low-hanging fruit
ILP, DLP, L1i, L1d, L2
RIDL and Fallout: MDS attacks, Vrije Universiteit Amsterdam, 2019.
TLP, L3
Las computadoras se inventaron para sumar una secuencia de números.
1 #define N (1<<28)
2 float a[N];
3
4 int main(int argc, char **argv)
5 {
6 float sum = 0.0f;
7 for (unsigned int i=0; i<N; ++i) {
8 sum += a[i];
9 }
10 return (int)sum;
11 }
1 $ gcc-9 suma.c && ./a.out
2 $ gcc-9 suma.c && perf stat ./a.out
3 $ gcc-9 -S suma.c && cat suma.s
4 $ gcc-9 -O1 -S suma.c && cat suma.s
5 $ gcc-9 -O1 suma.c && perf stat ./a.out
6 $ gcc-9 -O2 suma.c && perf stat ./a.out
7 $ gcc-9 -O2 suma.c && perf stat -r 16 ./a.out
8 $ gcc-9 -O3 suma.c && perf stat -r 16 ./a.out
9 $ gcc-9 -S -O2 suma.c && cat suma.s
10 $ gcc-9 -S -O3 suma.c && cat suma.s
11 $ gcc-9 -O3 -ffast-math suma.c && perf stat -r 16 ./a.out
12 $ gcc-9 -O1 -ftree-vectorize -ffast-math suma.c && perf stat -r 16 ./a.out
13 $ gcc-9 -S -O1 -ftree-vectorize -ffast-math suma.c && cat suma.s
14 $ echo "Agregamos justo antes del for"
15 $ echo "#pragma omp parallel for default(none) shared(a) reduction(+:sum)"
16 $ gcc-9 -O1 -ftree-vectorize -ffast-math suma.c && perf stat -r 16 ./a.out
17 $ gcc-9 -O1 -ftree-vectorize -ffast-math -fopenmp suma.c && perf stat -r 16 ./a.out
18 $ echo "#pragma omp parallel for simd default(none) shared(a) reduction(+:sum)"
19 $ gcc-9 -O1 -ffast-math -fopenmp suma.c && perf stat -r 16 ./a.out
20 $ gcc-9 -O1 -ffast-math -fopenmp suma.c && OMP_NUM_THREADS=1 perf stat -r 16 ./a.out
21 $ gcc-9 -O1 -ffast-math -fopenmp suma.c && OMP_NUM_THREADS=2 perf stat -r 16 ./a.out
22 $ gcc-9 -O1 -ffast-math -fopenmp suma.c && OMP_NUM_THREADS=3 perf stat -r 16 ./a.out
23 $ gcc-9 -O1 -ffast-math -fopenmp suma.c && OMP_NUM_THREADS=4 perf stat -r 16 ./a.out
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |