This blog post presents some of the speed measurements I've done with alternative implementations of memset.
I was filling a 1 GB memory area 20 times (equivalent to
memset(a, 0, 1 << 30) each) with various different implementations, and measuring the speed. I've compiled the program for i386 (
gcc -m32) and amd64 (
gcc -m64), I ran it on desktop PC running 64-bit Linux 3.13.0 on a Xeon CPU (Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz) with 32 GB of RAM. I was compiling the code with GCC 4.6.3, with optimization level
gcc -O2, because
gcc -O3 optimized the 1-byte-at-a-time loop to a 4-bytes-at-a-time loop.
It turned out that there was no measurable difference in the i386 and amd64 version of the programs. Here are relative speeds (higher numbers are proportionally faster) of user times:
- 1.480: memset, doing
rep stosd, 4 bytes at a time
- 1.000: 16 writes of 4 bytes each in loop the body (like Duff's device)
- 1.000: 1 write of 8 bytes in the loop body
- 1.000: 1 write of 4 bytes in the loop body
- 0.675: 1 write of 2 bytes in the loop body
- 0.329: 1 write of 1 byte in the loop body (
char *cp = a, *cpend = a + sizeof(a) / sizeof(a); for (; cp != cpend; ++cp) *cp = 0;)
I can interpret the numbers except for memset the following way: the cache doesn't help at this size; CPU does correct branch prediction (or taking the branch is fast enough), or taking a branch is faster than writing to memory; the data bus between the CPU and the memory can take 4 bytes at a time.
But why is the assembly instruction
rep stosd that much faster than any of the loops? What's the magic behind it? It looks like my CPU had an optimized
rep stosd and
rep stosb built in, called ERMSB. More details in the PDF available from here (search for memset within the downloaded PDF).