If you are bored of contemporary topics of AI and need a breather, I invite you to join me to explore a mundane, fundamental and earthy topic.
The CPU.
Take the CPU, a beautiful instrument for executing instructions.
A simple yet very common operation is to add two numbers and store the sum in a variable in memory.
c = a + b
Both “a” and “b” are in the main memory, to the CPU it’s a colossal distance like the Sun and Moon.
You see the CPU cannot do anything with stuff in memory, the data has to be close to it, local cache called registers. Often 64bit in modern CPUs.
So you load the variable “a” into a CPU register, then load B into another register in the same CPU. You might have multiple cores so you really need to know which core to pick. You don’t want a to be loaded in core 1 register and b to be loaded in core 4’s register.
I’m embarrassed to say I’m glossing over extraordinary amount of work and complexity of what it takes to load data from DIMMs RAM to the CPU registers, but maybe I’ll explore that in another post.
You now have both numbers close by in the CPU registers, you then execute 1 instruction to add the two specified registers and store them in a third register. The third register is then written back to memory where “c” supposed to live. The process can then enjoy reading the value of “c” and work with it.
It is important to mention that the add itself is an instruction that lives in main memory (in a text block of a process) and that too is fetched from memory and stored in a special register called IR (instruction register)
But I must ask a question, what if I want to sum 100 pairs? You might say well that is 100 instructions similar to what we have just explored.
The is the thing, it doesn’t have to be a 100 instructions, what if tell you can sum 100 pairs in 25 instructions.
25 executions is more efficient than 100.
Meet SIMD, single instruction multiple data. This allows one instruction to operate on multiple data at once and have multiple outputs essentially. As long as the CPU supports it of course.
So in our example you can store 4 variables in special vector registers and another 4 in another vector and have the CPU execute one instruction to sum 4 integers at once. Why 4 ? well its just what the vector size the CPU supports.
Think of it as a function that takes 8 parameters, a1,a2,a3,a4,b1,b2,b3,b4 and sum all of then at once and produce c1,c2,c3,c4 all in one shot single instruction. This is with a 128 bit vector and 32bit integer.
Brilliant.
This can add up especially in CPU bound and heavy workload applications.
For example, there are several research to combine SIMD with B+Tree where we have alot of data keys and values (in pages) and we want to process it with SIMD.
I just love this stuff.
The things that are so fundamentals which we think can’t be improved, can actually be.
Ok now back to why LLMs are being disobedient.
Clickhouse also makes use of SIMD to speed up the queries.
I was not aware of SIMD until I read it.
Thanks for this article. Each article has just few lines but it’s meaning can not even be fit in an hour article