2013-12-09

Compressing the Go cross-compiler

This blog post presents my findings about the compressibility of the Go 1.2 cross-compiler binaries, running on Linux i386 and generating code for various operating systems and architectures. It also adds data points for comparing data compression algorithms and tools.

TL;DR 7-Zip creates the smallest archives and self-extracting archives.

Input files (all uncompressed):

  • 5 binaries (go, gofmt, cgo, fix, yacc) written in Go, 15_599_640 bytes in total.
  • 18 binaries (5a, 5c, 5g, 5l, 6a, 6c, 6g, 6l, 8a, 8c, 8g, 8l, addr2line, dist, nm, objdump, pack, pprof) written in C, 15_655_780 bytes in total.
  • 623 libraries written mostly in Go, some parts written in C, (uncompressed) .a files containing object files, 249_375_024 bytes in total.
  • 147 very short placeholder .go source files, each containing package PACKAGENAME only, 2_034 bytes in total.
  • 712 library symlinks (.a pointing to another .a file)
  • 476 directories nested up to the level 7

All binaries are statically linked, compiled for Linux i386 (x86, 32-bit). Libraries are for various operating systems (Linux, FreeBSD, Mac OS X, Windows) and architectures (i386, amd64, ARM). All files were compiled from the go1.2 sources.

Archive sizes:

  • .tar: 281_886_720 bytes (uncompressed)
  • .zip: 117_874_697 bytes
  • .zip without symlink duplicates: 68_160_582 bytes
  • .tar.gz: 67_984_293 bytes
  • .tar.bz2: 52_344_038 bytes
  • .tar.xz: 36_832_488 bytes
  • .rar: 36_347_393 bytes
  • .7z: 25_776_183 bytes

Archive creation commands:

  • .zip: zip -9r multigo1.2_linux_386.zip multigo1.2_linux_386
  • .tar: tar cvf multigo1.2_linux_386.tar.gz multigo1.2_linux_386.tar.gz
  • .tar.gz: tar czvf multigo1.2_linux_386.tar.gz multigo1.2_linux_386.tar.gz
  • .tar.bz2: tar cv multigo1.2_linux_386 | bzip2 -c9 >multigo1.2_linux_386.tar.bz2
  • .tar.xz: tar cv multigo1.2_linux_386 | xz -7 --memlimit=100M >multigo1.2_linux_386.tar.xz
  • .7z: 7z a -t7z -mx=7 -md=8m -ms=on multigo1.2_linux_386.sfx.7z multigo1.2_linux_386
  • .rar: rar a -r -s -m5 multigo1.2_linux_386.rar multigo1.2_linux_386

Feature matrix of the archivers:

  • rar: files, directories, permission bits, last modification times
  • zip: same as above
  • 7z (7-Zip): all above, plus symlinks
  • tar: all above, plus file owner users and groups

Self-extracting archive binary sizes for Linux i386:

  • uncompressed: 281_892_240 (dynamically linked, just writes the .tar to stdout, 5520 bytes larger than the .tar)
  • upx --best: 50_215_752 bytes (the uncompressed binary above was compressed)
  • upx --lzma: 37_810_484 bytes
  • .sfx.rar: 36_481_433 bytes (dynamically linked, 134_040 bytes longer, not exactly prepended)
  • .sfx.7z: 25_897_327 bytes (statically linked, see how to create, 121_144 of self-extraction code prepended)
  • upx --brute: out of memory (wanted to use more than 3 GB of RAM)

Note: rar doesn't support symlinks, it follows the symlinks and duplicated the files at compression time. This wasn't a cause of a major size increase, because solid archiving (rar a -s) put these files next to each other, so the previous file instance was emitted as a single repetition.

Note: zip doesn't support symlinks either, and it makes the archive considerably larger, because it follows the symlinks and saves the file again.

Note: The self-extraction code of 7-Zip (7zCon.sfx) was 121144 bytes, not included in the sizes above.

Note: rar and 7-Zip support file last modification times, permission bits, but no owner or group.

Note: The fundamental difference between .7z and .xz is that 7z has special preprocessing (BCJ and BCJ2) step for executable binaries, and BCJ is enabled by default. This lets the final size become much smaller.

No comments: