Trash the cache

Action	time
L1 cache reference	0.5 ns
Branch mispredict	5 ns
L2 cache reference	7 ns		14x L1 cache
Mutex lock/unlock	25 ns
Main memory reference	100 ns		20x L2 cache, 200x L1 cache
Send 1K bytes over 1 Gbps network	10,000 ns	10 μs
Read 4K randomly from SSD	150,000 ns	150 μs	~1GB/sec SSD
Read 1 MB sequentially from memory	250,000 ns	250 μs
Read 1 MB sequentially from SSD	1,000,000 ns	1,000 μs = 1 ms	4X memory
HDD Disk seek	10,000,000 ns	10,000 μs = 10 ms
Read 1 MB sequentially from HDD	20,000,000 ns	20,000 μs = 20 ms	20X SSD

Persistent memory – memory that survives a reboot of your device. SSD – Solid State Drive. HDD – Hard Disk Drives. Moving parts – slower, but more capacity (for now) Remember the old days when we had these things like … DVD’s? What is the difference between an SSD and a HDD, does anyone know that?

HDD’s have moving parts, with an arm seeking positions on a disk, making it slow. SSD’s don’t have that problem and are thus way faster.

rpb = framepointer eax = register

source: Computer Architecture: A Quantitative Approach by John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau How slow? Very slow. One fetch can take thousands of CPU cycles, leaving the CPU waiting idle

Say we have an instruction that needs to operate on a value in memory (blue)

We fetch the cache line containing the variable, and store the value into the CPU’s register, which can perform his instruction now.

Say we need another value for our operation, that is located somewhere else in memory. We need to fetch it too, and here we have a cache miss.

Say we want to perform the same operation again but on two other values. Here we have a cache miss.

If we locate the data we need for our operation next to each other, we have a cache hit for the second value

Even better, the next operation can also continue without memory access

The one without the pointer :)

If your code is accessing data in all separate places, you would have cache misses on the data Or vice versa, if you jump through the code’s virtual methods that are every where in memory, you would introduce cache misses on your nice contiguous instructions.

They’ll probably answer the 2nd, but why? Not because the algorithm might be too complex for this simple case, but because of the potential cache miss this introduces. Of course, compiler optimization will probably fix this But: no premature optimization!

fd = no mans land, used by the debugger to detect buffer overruns dd = unused free memory cd = newly allocated memory

32b in x64, 20b in x86

32b in x64, 20b in x86 Red – the canaries Yellow – The vtable pointer Green – m_parent Blue – m_isActive + padding Purple – m_health Orange – m_name

Not all particles are active at the same time, so we check. Introduces a lot of cache fetches that are often not needed How do we fix? - Open question, many solutions possible… One: keep active particles together in memory by swapping their data around and keeping track of how many active particles there are. Can't do this with pointer swapping, so potentially this moves a lot of memory around. Is a tradeoff, what is cheaper for the use case you have? MEASURE before optimize.

Problem: AIComponent has a lot of data. Can't fit all of it or too few in a cache line. Open question: how do we fix that Solution: we split the data into hot/cold. Only data in the hot path should reside in the AIComponent and be used in the update function. The cold data we separate into another object and we just point to it.

We access every cacheline for every step, causing the timing to match the expected result. But what we don't want is the height of the graph. Let students fix it according to AIComponent example

Again, we flatline until k = 8 (because sizeof(GameObject3DAlt) = 8 instead of 4 as in the previous example) But compare it with the GameObject3D and you see the massive gain, until k = 16 - then we're in the same boat again.

Memory

Persistent memory

Persistent memory

Temporary memory

Temporary memory

CPU registers

CPU registers

CPU speed vs memory speed

CPU cache

Cache lines

CPU cache

CPU cache

CPU cache

CPU cache

CPU cache

CPU cache

Question

Multiple levels

Multiple levels

I-cache and D-cache

Multiple levels

Cache size

Cache size

Object layout in memory

Object layout in memory

Object layout in memory

Memory alignment

Memory alignment

Memory alignment

Memory alignment

Memory alignment

Memory alignment

Cache alignment

Exercise

Exercise

Exercise

Exercise

What happened?

What happened?

What happened?

What happened?

What happened

2 Types of cache

I-cache

D-cache

Examples

Examples

Examples

Examples

Exercise

Exercise

Exercise

User interface programming

Retained mode (RMGUI)

Immediate mode (IMGUI)

Dear ImGui

Dear ImGui

Exercise