2019-02-10

Speed of in-memory algorithms in scripting languages

This blog post gives some examples how much faster in-memory algorithms are in scripting languages than in C.

Before writing this blog post I had the general impression that the speed ratio between code in a scripting language and code in C for the same CPU-bound algorithm is between 5 and 20. I was very much surprised that for LZMA2 decompression I experienced a much larger ratio between Perl and C: 285.

Then I looked at the C speeds and Perl speeds on the Debian Computer Language Benchmarks Game, and I've found these ratios (in decreasing order) between Perl and C: 413, 79.7, 66.3, 62.2, 49.2, 20.8, 12.1, 10.2, 5.87, 1.91. So it turns out that there is a huge fluctuation in the speed ratio, depending on the algorithm.

Takeaways:

  • One doesn't need to use 64-bit registers or vector (SIMD) instructions (e.g. AVX, SSE, MMX) or other special instructions in C code to get a huge speed ratio: for LZMA2 decompression, there can be a huge speed difference even if all variables are 32-bit unsigned integers.
  • One doesn't need to use 64-bit code in C to get a huge speed ratio: for LZMA2 decompression, the benchmarked C code was running as 32-bit (i386, more specifically: i686) code.
  • One doesn't have to use C compiler optimization flags for fast execution (e.g. -O3 or -O2) to get a huge speed ratio: for LZMA2 decompression the size-optimized output of gcc -Os was that fast.
  • Cache usage (e.g. L1 cache, L2 cache, L3 cache) can have a huge effect on speed of C code. The https://github.com/pts/muxzcat/releases/download/v1/muxzcat is 7376 bytes in total, thus the code fits into the fastest L1 cache of modern Intel processors (L1 cache size at lest 8 KiB, typically at least 32 KiB). The data itself doesn't fit to the cache though.
  • I/O buffering and the associated memory copies can also affect execution speed. Typical size of read(2) calls is >60 KiB, and typical size of write(2) is even larger (2--3 times larger) for LZMA2 decompression, this is fast enough in both C and Perl code.
  • Memory allocation can also affect execution speed. The C code for LZMA2 decompression doesn't do any memory allocation. The algorithm of the Perl code doesn't do any either (but the Perl interpreter may do some as part of its overhead), except for the occasional exponential doubling of the string capacity. (Preallocating these string buffers didn't make it any faster.)
  • Even older C compilers (e.g. GCC 4.8 from 2014) can generate very optimized low-level i386 machine code.
  • Some scripting languages are faster than others, e.g. Lua in LuaJIT and JavaScript in Node.js are typically faster than the Python, Perl and Ruby interpreters written in C, and PyPy is faster than the Python interpreter written in C.
  • Different integer sizes (e.g. 8-bit, 16-bit, 32-bit, 64-bit) can affect execution speed. Sometimes larger integers are faster (e.g. 32-bit is faster than 16-bit), because they are better aligned in memory, and fewer conversion instructions are necessary.
  • Integer fixups can contribute to the slowness of scripting languages. For example the algorithm for LZMA2 decompression works with unsigned 32-bit integers, but Perl has only either signed 64-bit integers or signed 32-bit integers, so inputs of some operators (e.g. >>, <, ==, %, /) need to be bit-masked to get correct results. Out of these, / and % would be the slower to fix, but since LZMA2 decompression doesn't use these operators, < is the slowest: in total, the 32-bit Perl is 1.1017 times slower running the LZMA2 decompression than the 64-bit Perl, mostly because operator < and its possibly negative inputs need more complicated masking if Perl is doing 32-it arithmetic.
  • Function calls can be very slow in scripting languages, while the C compiler can inline some of the smaller functions, avoiding most of the overhead. For LZMA2 decompression, manual inlining of the fixup for operator < on 32-bit Perls made the entire program about 1.3 times faster.

No comments:

Post a Comment