pts.blog: 2013-12

2013-12-31

How to implement in-place set intersection in C++

This blog post shows and explains C++ source code implementing in-place set intersection, i.e. removing each element from a set (or another sorted container) *bc which is not also a member of container bc.

The std::intersection function template in the <algorithm> header in C++ standard template library populates a new output set, adding all elements in the intersection into it. This can be too slow and a waste of memory if one of the inputs is not needed afterwards. In this case an in-place intersection is desired instead, but unfortunately such a function template is not part of the C++ standard template library.

Here is a simple in-place implementation which looks up each element of *bc in ac, and removes (erases) it from *bc if not found:

#include <set>

// Remove elements from bc which are missing from ac.
//
// The time required is proportional to log(ac.size()) * bc->size(), so it's
// faster than IntersectionUpdate if ac is large compared to bc.
template<typename Input, typename Output>
static void IntersectionUpdateLargeAc(
    const std::set<Input> &ac, std::set<Output> *bc) {
  const typename std::set<Input >::const_iterator a_end = ac.end();
  const typename std::set<Output>::const_iterator b_end = bc->end();
  for (typename std::set<Output>::iterator b = bc->begin(); b != b_end; ) {
    if (ac.find(*b) == a_end) {  // Not found.
      // Not a const_iterator, erase wouldn't accept it until C++11.
      const typename std::set<Output>::iterator b_old = b++;
      bc->erase(b_old);  // erase doesn't invalidate b.
    } else {
      ++b;
    }
  }
}

Removing from *bc above is a bit tricky, because we don't want to invalidate the iterator b. In C++11 erase returns a new iterator, which is just after the removed elements, but we don't use that just to be backwards-compatible. Instead of that we make use of the fact that iterators to the non-removed elements are kept intact for set, multiset and list, so we create the temporary iterator b_old, which will be invalidated, but b remains valid.

We need the typename keyword in local variable declarations, because they have a dependent type (i.e. a type whose identifier is within another type specified by a template parameter.)

The time complexity is O(log(as) · bs), so it is fast if ac is large when compared to *bc. For example, when as = 3^k and bs = k, then it's O(k²).

As an alternative, we could iterate over the two sets in increasing (ascending) order at the same time, similarly to the merge operation (as implemented by std::merge in mergesort, but dropping elements from *bc if there is no corresponding element in ac. One possible implementation:

#include <set>

// Remove elements from bc which are missing from ac.
//
// The time required is proportional to ac.size() + bc->size().
template<typename Input, typename Output>
static void IntersectionUpdate(
    const std::set<Input> &ac, std::set<Output> *bc) {
  typename std::set<Input>::const_iterator a = ac.begin();
  const typename std::set<Input>::const_iterator a_end = ac.begin();
  typename std::set<Output>::iterator b = bc->begin();
  const typename std::set<Output>::iterator b_end = bc->end();
  while (a != a_end && b != b_end) {
    if (*a < *b) {
      ++a;
    } else if (*a > *b) {
      const typename std::set<Output>::iterator b_old = b++;
      bc->erase(b_old);  // erase doesn't invalidate b.
    } else {  // Elements are equal, keep them in the intersection.
      ++a;
      ++b;
    }
  }
  bc->erase(b, b_end);  // Remove remaining elements in bc but not in ac.
}

The time complexity of this above (IntersectionUpdate) is O(as + bs), which is faster than IntersectionUpdateLargeAc if ac is not much smaller than *bc. For example, when as = 3^k and bs = k, then it's O(3^k + k), so IntersectionUpdateLargeAc is faster.

Example usage of both (just to see if they compile):

int main(int, char**) {
  std::set<int> a, b;
  IntersectionUpdateLargeAc(a, &b);
  IntersectionUpdate(a, &b);
  return 0;
}

It's natural to ask if these function templates can be generalized to C++ containers other than set. They take advantage of the input being sorted, so let's consider sorted std::vector, sorted std::list and std::multiset in addition to std::set. To avoid the complexity of having to distinguish keys from values, let's ignore std::map and std::multimap.

The generalization of IntersectionUpdateLargeAc from set to multiset is trivial: no code change is necessary. The std::multiset::find operation returns any matching element, which is good for us. However, with IntersectionUpdate, the last ++a; must be removed: without the removal subsequent occurrences of the same value in *bc would be removed if ac contains this value only once. No other code change is needed. It is tempting to introduce a loop in the previous (*a > *b) if branch:

for (;;) {
  const typename Output::iterator b_old = b++;
  const bool do_break = b == b_end || *b_old != *b;
  bc->erase(b_old);  // erase doesn't invalidate b.
  if (do_break) break;
}

However, this change is not necessary, because subsequent equal values in *bc would be removed in subsequent iterations of the outer loop.

Here are the full generalized implementations:

#if __cplusplus >= 201103 || __GXX_EXPERIMENTAL_CXX0X__
#include <type_traits>
#endif

// Remove elements from bc which are missing from ac. Supported containers for 
// bc: list (only if sorted), vector (only if sorted), set, multiset. Supported
// containers for ac: set, multiset.
//
// The time required is proportional to log(ac.size()) * bc->size(), so it's
// faster than IntersectionUpdate if ac is large compared to bc.
template<typename Input, typename Output>
static void IntersectionUpdateLargeAc(const Input &ac, Output *bc) {
#if __cplusplus >= 201103 || __GXX_EXPERIMENTAL_CXX0X__
  // We could use std::is_convertible (both ways) instead of std::is_same.
  static_assert(std::is_same<typename Input::value_type,
                             typename Output::value_type>::value,
                "the containers passed to IntersectionUpdateLargeAc() need to "
                "have the same value_type");
#endif
  const typename Input::const_iterator a_end = ac.end();
  const typename Output::const_iterator b_end = bc->end();
  for (typename Output::iterator b = bc->begin(); b != b_end; ) {
    if (ac.find(*b) == a_end) {  // Not found.
      // Not a const_iterator, erase wouldn't accept it until C++11.
      const typename Output::iterator b_old = b++;
      bc->erase(b_old);  // erase doesn't invalidate b.
    } else {
      ++b;
    }
  }
}

// Remove elements from bc which are missing from ac. Supported containers for 
// ac and bc: list (only if sorted), vector (only if sorted), set, multiset.
template<typename Input, typename Output>
static void IntersectionUpdate(const Input &ac, Output *bc) {
#if __cplusplus >= 201103 || __GXX_EXPERIMENTAL_CXX0X__
  static_assert(std::is_same<typename Input::value_type,
                             typename Output::value_type>::value,
                "the containers passed to IntersectionUpdate() need to "
                "have the same value_type");
#endif
  typename Input::const_iterator a = ac.begin();
  const typename Input::const_iterator a_end = ac.end();
  typename Output::iterator b = bc->begin();
  // Can't be a const interator, similarly to b_old.
  const typename Output::iterator b_end = bc->end();
  while (a != a_end && b != b_end) {
    if (*a < *b) {
      ++a;
    } else if (*a > *b) {
      const typename Output::iterator b_old = b++;
      bc->erase(b_old);  // erase doesn't invalidate b.
    } else {  // Elements are equal, keep it in the intersection.
      // Don't do ++a, in case ac is a multiset.
      ++b;
    }
  }
  bc->erase(b, b_end);  // Remove remaining elements in bc but not in ac.
}

These work as expected for set, multiset and sorted list. It also doesn't require that the two containers are of the same kind. For C++0x and C++11, an extra static_assert is present in the code to print a helpful compact error message if the base types are different.

However, when *bc is a vector, we get a compile error, because in C++ older than C++11, std::vector::erase doesn't return an iterator (but it returns void). Even if we could get an iterator, b_end would be invalidated by erase, because it's behind it. This is easy to fix, we should use bc->end() instead of b_end everywhere. However, if we didn't make any other changes, the algorithm would be slower than necessary, because std::vector::erase moves each element behind the erased one. So the time complexity would be O(as + bs²). To speed it up, let's swap the to-be-removed elements with the element with the last element of the vector, and to the actual removal at the end of the function:

#if __cplusplus >= 201103 || __GXX_EXPERIMENTAL_CXX0X__
#include <type_traits>
#include <utility>  // std::swap.
#else
#include <algorithm>  // std::swap.
#endif

// Template specialization for vector output.
template<typename Input, typename T>
static void IntersectionUpdate(const Input &ac, std::vector<T> *bc) {
#if __cplusplus >= 201103 || __GXX_EXPERIMENTAL_CXX0X__
  static_assert(std::is_same<typename Input::value_type, T>::value,
                "the containers passed to IntersectionUpdate() need to "
                "have the same value_type");
#endif
  typename Input::const_iterator a = ac.begin();
  const typename Input::const_iterator a_end = ac.end();
  typename std::vector<T>::iterator b = bc->begin();
  // Elements between b_high an bc->end() will be removed (erased) right before
  // the function returns. We defer their removal to save time.
  typename std::vector<T>::iterator b_high = bc->end();
  while (a != a_end && b != b_high) {
    if (*a < *b) {
      ++a;
    } else if (*a > *b) {
      std::iter_swap(b, --b_high);  // Works even if swapping with itself.
    } else {  // Elements are equal, keep them in the intersection.
      ++a;
      ++b;
    }
  }
  bc->erase(b, bc->end());  // Remove remaining elements in bc but not in ac.
}

Once we have the generic implementation and the special implementation for vector in the same file, the C++ compiler would take care of choosing the right (most specific) one depending on whether *bc is a vector or not. So all these work now:

#include <list>
#include <set>
#include <vector>

int main(int, char**) {
  std::set<int> s;
  std::multiset<int> ms;
  std::vector<int> v;
  // std::list<unsigned> l;  // Won't work in C++0x and C++11.
  std::list<int> l;
  IntersectionUpdate(s, &ms);
  IntersectionUpdate(ms, &v);
  IntersectionUpdate(v, &l);
  IntersectionUpdate(l, &s);
  IntersectionUpdateLargeAc(s, &ms);
  IntersectionUpdateLargeAc(ms, &v);
  // IntersectionUpdateLargeAc(v, &l);  // v is not good as ac.
  // IntersectionUpdateLargeAc(l, &s);  // l is not good as ac.
  IntersectionUpdateLargeAc(s, &l);
  IntersectionUpdateLargeAc(ms, &s);
  return 0;
}

The full source code is available on GitHub.

2013-12-26

Tiny 7z archive extractors and SFX for Linux and Windows

This blog post contains links to the tiniest programs that can extract 7z (7-Zip) archives on Linux and Windows i386. These programs can also be used as SFX (self-extract archives) by appending a .7z file to them.

Please see my previous post about self-extracting 7z archives. In that post the smallest Linux extractor is of 118 kB, and the smallest Windows console extractor (the official one) is of 149 kB. The Linux extractor is based on the p7zip C++ sources, statically linked against uClibc and compiled with some GCC flags optimized for small binary size.

The smallest known Windows SFX extractor, 7zS2Con.sfx can be found in 7z922_extra.7z. It's of 27648 bytes (27 kB). It's based on the official ANSI C 7-Zip extractor sources, i.e. the C subdirectory in 7z922.tar.bz2. The fundamental difference between this smallest SFX and the normal Windows SFX is the implementation language (C vs C++) and the feature set (e.g. larger memory requirements, not all algorithms supported, encryption not supported, see the Limitations section below). See also this original forum topic for a discussion about features and limitations.

I've compiled the ANSI C 7-Zip extractor using uClibc and the size-optimized GCC flags, fixed some bugs, added some features (e.g. mtime, permissions, symlink, SFX mode), and compressed the binary using UPX. The result is the smallest known extractor (and SFX extractor) for Linux: 7z9.22LinuxI386ConTiny.sfx, 24116 bytes (24 kB, smaller than the Windows extractor, linked using diet libc). The sources are available as a GIT repostitory. The sources are optimized for Linux (they work on both i386 and amd64), but they should compile and work other Unix systems as well.

Features of the tiny Linux extractor

Small (the Linux statically linked binary is less than 40 kB).
Can be used to create a SFX (self-extract) binary by prepending to a 7z archive. (Same as the `7z -sfx' flag.)
It supports file and directory attributes (i.e. it calls chmod(2)).
It sets the mtime (i.e. it calls utimes(2)).
It can extract symlinks.
Has a command-line syntax compatible with the regular console SFX binaries.

Limitations of the tiny extractors (Windows and Linux)

It supports only: LZMA, LZMA2, BCJ, BCJ2, COPY.
It keeps an uncompressed version of each file in RAM.
It decompresses solid 7z blocks (it can be whole 7z archive) to RAM. So user that calls SFX installer must have free RAM of size of largest solid 7z block (size of 7z archive at simplest case).
The Windows extractor overwrites files without asking.
It always extracts to the current directory.
It does not support (and may misbehave for) encryption in archives.

2013-12-25

How to make smaller C and C++ binaries

This blog post presents several techniques to make the binaries resulting from C or C++ compilation smaller with GCC (or Clang). Please note that almost all techniques are tradeoffs, i.e. a smaller binary can be slower and harder to debug. So don't use the techniques blindly before understanding the tradeoffs.

The recommended GCC (and Clang) flags:

Use -s to strip debug info from the binary (and don't use -g).
Use -Os to optimize for output file size. (This will make the code run slower than with -O2 or -O3.
Use -m32 to compile a 32-bit binary. 32-bit binaries are smaller than 64-bit binaries because pointers are shorter.
In C++, use -fno-exceptions if your code doesn't use exceptions.
In C++, use -fno-rtti if your code doesn't use RTTI (run-time type identification) or dynamic_cast.
In C++, use -fvtable-gc to let the linker know about and remove unused virtual method tables.
Use -fno-stack-protector .
Use -fomit-frame-pointer (this may make the code larger on amd64).
Use -ffunction-sections -fdata-sections -Wl,--gc-sections . Without this all code from each needed .o file will be included. With this only the needed code will be included.
For i386, use -mpreferred-stack-boundary=2 .
For i386, use -falign-functions=1 -falign-jumps=1 -falign-loops=1 .
In C, use -fno-unwind-tables -fno-asynchronous-unwind-tables . Out of these, -fno-asynchronous-unwind-tables makes the larger difference (can be several kilobytes).
Use -fno-math-errno, and don't check the errno after calling math functions.
Try -fno-unroll-loops, sometimes it makes the file smaller.
Use -fmerge-all-constants.
Use -fno-ident, this will prevent the generation of the .ident assembler directive, which adds an identification of the compiler to the binary.
Use -mfpmath=387 -mfancy-math-387 to make floating point computations shorter.
If you don't need double precision, but float preecision is enough, use -fshort-double -fsingle-precision-constant .
If you don't need IEEE-conformat floating point calculations, use -ffast-math .
Use -Wl,-z,norelro for linking, which is equivalent to ld -z norelro .
Use -Wl,--hash-style=gnu for linking, which is equivalent to ld --hash-style=gnu . You may also try =sysv instead of =gnu, sometimes it's smaller by a couple of bytes. The goal here is to avoid =both, which is the default on some systems.
Use -Wl,--build-id=none for linking, which is equivalent to ld --build-id=none .
Get more flags from the Os list in diet.c of diet libc, for about 15 architectures.
Don't use these flags: -pie, -fpie, -fPIE, -fpic, -fPIC. Some of these are useful in shared libraries, so enable them only when compiling shared libraries.

Other ways to reduce the binary size:

Run strip -S --strip-unneeded --remove-section=.note.gnu.gold-version --remove-section=.comment --remove-section=.note --remove-section=.note.gnu.build-id --remove-section=.note.ABI-tag on the resulting binary to strip even more unneeded parts. This replaces the gcc -s flag with even more aggressive stripping.
If you are using uClibc or diet libc, then additionally run strip --remove-section=.jcr --remove-section=.got.plt on the resulting binary.
If you are using uClibc or diet libc with C or C++ with -fno-exceptions, then additionally run strip --remove-section=.eh_frame --remove-section=.eh_frame_ptr on the resulting binary.
After running strip ... above, also run sstrip on the binary. Download sstrip from ELF Kickers, and compile it for yourself. Or get the 3.0a binary from here.
In C++, avoid STL. Use C library functions instead.
In C++, use as few template types as possible (i.e. code with vector<int> and vector<unsigned> is twice as long as the code with vector<int> only).
In C++, have each of your non-POD (plain old data) classes an explicit constructor, destructor, copy-constructor and assignment operator, and implement them outside the class, in the .c file.
In C++, move constructor, destructor and method bodies outside the class, in the .c file.
In C++, use fewer virtual methods.
Compress the binary using UPX. For small binaries, use upx --brute or upx --ultra-brute . For large binaries, use upx --lzma . If you have large initialized arrays in your code, make sure you declare them const, otherwise UPX won't compress them.
Compress the used libraries using UPX.
If you use static linking (e.g. gcc -static), use uClibc (most convenient way: pts-xstatic or diet libc (most convenient way: the included diet tool) or musl (most convenient way: the included musl-gcc tool) instead of glibc (GNU C library).
Make every function static, create a .c file which includes all other .c files, and compile that with gcc -W -Wall. Remove all code to which the compiler says is unused. Last time this saved about 9.2 bytes per function for me.
Don't use __attribute__((regparm(3))) on functions, it tends to make the code larger.
If you have several binaries and shared libraries, consider unifying the binaries into a single one (using symlinks and distinguishing in main with argv[0]), and moving the library code to the binary. This is useful, because the shared libraries use position-independent code (PIC), which is larger.
If it's feasible, rewrite your C++ code as C. Once it's C, it doesn't matter if you compile it with gcc or g++.
If your binary is already less than 10 kilobytes, consider rewriting it in assembly, and generating the ELF headers manually, see the tiny ELF page for inspiration.
If your binary is already less than 10 kilobytes, and you don't use any libc functions, use a linker script to generate tiny ELF headers. See the tarball with the linker script.
Drop the --hash-style=... flag passed to ld by gcc. To do so, pass the -Bmydir flag to gcc, and create the executable mydir/ld, which drops these flags and calls the real ld.
See more flags and ideas in this answer.

2013-12-12

Announcing pts-clang: A portable LLVM Clang C and C++ compiler release running on Linux i386 and Linux x86_64

This blog post announces pts-clang, a portable LLVM Clang C and C++ compiler release running on Linux i386 and Linux x86_64.

pts-clang is a portable Linux i386 version of the clang tool of the LLVM Clang compiler, version 3.3. C and C++ compilation is supported, other frontends (such as Objective C) were not tested. The tool runs on Linix i386 and Linux amd64 systems. It's libc-independent, all code is statically linked to the binary. It also contains statically linked linker (ld), so installing binutils is not necessary.

See all details in the most recent README.

Trying it out

If you don't have root access and you don't have the libc headers (e.g. in the libc6-dev package, sample file /usr/include/stdio.h) installed, you can still try pts-clang, with pts-xstatic. See the details in this blog post.

If you don't have /usr/include/stdio.h, install the libc development package:

$ sudo apt-get install libc6-dev

To be able to compile both i386 and amd64 binaries, and you have a recent Linux distribution (e.g. Ubuntu Precise) you can install the libc6-dev package for both architectures:

$ sudo apt-get intall libc6-dev:i386 libc6-dev:x86_64

Download and try pts-clang like this:

$ cd /tmp
$ rm -f pts-clang-latest.sfx.7z
$ wget http://pts.50.hu/files/pts-clang/pts-clang-latest.sfx.7z
$ chmod +x pts-clang-latest.sfx.7z
$ ./pts-clang-latest.sfx.7z -y  # Creates the pts-clang directory.
$ cat >>hw.c <<'END'
#include <stdio.h>
int main(void) {
  return !printf("Hello, %s!\n", "World");
}
$ pts-clang/bin/clang -s -O2 -W -Wall hw.c
$ ./a.out
Hello, World!
END

Use clang++ for compiling C++ code, but for that you have to install one of the libstdc++...-dev packages first.

Does pts-clang create portable executables?

By default (without the -xstatic or -xermine flags), the executables created by pts-clang are just as portable as those generated by gcc or clang. They are dynamically linked (unless -static is specified), thus they depend on the system libraries (e.g. /lib/libc.so.6).

If the -static flag is specified, then the executable becomes statically linked, but this doesn't provide real portability, because for calls such as getpwent(3) (getting info of Unix system users) and gethostbyname(3) (DNS resolution), glibc loads files such as libnss_compat.so, libnss_dns.so. On the target system those libraries may be incompatible with your binary, so you may get a segfault or unintended behavior. pts-xstatic solves this, because it uses uClibc.

If the -xstatic flag is specified, pts-xstatic is used to create a portable statically linked, Linux i386 executable, linked against uClibc.

If the -xermine flag is specified, Ermine is used to pack library and other dependencies to a single, portable executable. This can be even more portable than -xstatic, because Ermine can pack locale files, gconv libraries etc. Ermine is a Linux ELF portable executable creator: it takes a dynamically linked ELF executable, discovers its dependencies (e.g. dynamic libraries, NSS libaries), and builds a protable, statically linked ELF executable containing all the dependencies. See the features, licensing information and get Ermine from here. The result can be even more portable than -xstatic, because Ermine can pack locale files, gconv libraries etc. Not all the packing is automatic: use -xermine,... to specify packing flags to Ermine.

Portability improvements

pts-clang is portable (libc-independent): all shipped binaries are either statically linked. (The clang binary is packed with Ermine, the file is a statically linked executable, which contains a dynamically linked executables and its dependencies (e.g. libc etc.) with itself.) The system libraries are not used for running the compiler (but they are used for linking the output file, except when -xstatic is specified).
A statically linked linker (ld, GNU gold) binary is provided, so GNU binutils is not a requirement for compilation on the host system.
Some other optional, statically linked binutils tools (ar, ranlib and strip) are also provided for convenience in the pts-static-binu binary release, see more info in its README. These tools can be used for auxiliary tasks such as building static libraries.

Because of these portability improvemenets, it's easy to run pts-clang in a chroot environment.

C++11 and C++0x compatibility

Please note that even though Clang 3.3 supports C++11, much of that is implemented in the C++ standard library (GCC's libstdc++ or Clang's libc++) header files, and no attempt is made in pts-clang to provide the most up-to-date C++ standard library. With -xstatic, an old libstdc++ (the one from gcc-4.4.3) is provided, and without -xstatic the system's default libstdc++ will be used, which can be older than C++11.

Author, copyright and recompilation

The binaries here were created by Péter Szabó, using existing LLVM Clang and uClibc cross compiler and other binaries, and writing some custom trampoline code. See the details in the GitHub repository.

All software mentioned in this blog post is free software and open source, except for Ermine.

Thanks to Ermine

The author of pts-clang is grateful and says thank you to the author of Ermine, who has provided a free-of-charge Ermine license, using which the portable clang.bin binary was created from the official Clang binary release (which is libc-dependent).

If you want to create portable Linux executables (and you don't care too much about file size), give Ermine a try! It's the most comfortable, easy-to-use, and comprehensive tool available.

Installation instructions for the old version (v1)

Download the old version (v1) from here: http://pts.szit.bme.hu/files/pts-clang/pts-clang-xstatic-bin-3.3-linux-i386-v1.sfx.7z You can extract it with 7z (in the p7zip package), but you can also make it executable and run it, because it's a self-extracting archive.

Clang itself is a cross-compiler, so it can generate object files for many architectures and operating systems (see the -target flag), but you also need a system-specific linker (not included) to build binaries or libraries. In newer versions (v2 and later), a copy of the GNU gold linker is also incuded.

This release introduces a non-standard command-line flag -xstatic which enables the Linux i386 target with static linking using the bundled uClibc library. The required .h and .a files, as well as a portable GNU ld linker binary are also included for the use of this flag. In newer versions (v2 and later) the files required by -xstatic are available in a separate download, see the details in this blog post.

Installation instructions for the even old version (v0)

Download version v0 from here: pts-clang-xstatic-bin-3.3-linux-i386.sfx.7z. You can extract it with 7z (in the p7zip package), but you can also make it executable and run it, because it's a self-extracting archive.

More info

See the most recent README for full installation instructions, usage details, full feature list etc.

2013-12-10

How to install Ubuntu Precise to a KVM guest on a headless server

This blog post explains how to install Ubuntu Precise (or other Ubuntu versions) to a virtualized KVM guest on a headless Linux server, i.e. on a server without a display connected or graphics (X11) software installed. We assume that the server host is Ubuntu or Debian. With some modifications the instructions can be made work on other Linux systems as well. The step-by-step instructions below contain explanations and troubleshooting tips as well.

The installation of the Ubuntu system in the KVM guest will be carried out over a VNC remote desktop connection. So in addition to the headless server host, you'll need a client desktop computer with a Linux GUI (X11) already installed. The default installation of any recent Linux distribution is fine. You may also use a client desktop running Mac OS X or Windows, but specific instructions for that are not provided in this howto. Also it is possible to install a headless Ubuntu guest automatically, without a GUI installer, but this topic is not covered in this howto.

The $ sign in front of the commands below indicates that these should be run as a regular user (not root) on the server. Please omit the $ sign when copy-pasting the commands. The -$ sign indicates that the command should be run on the client desktop (without the sign itself).

Autodetecting and installing KVM

If you want to check if KVM is supported on the host server before installing anything, follow these instructions. In a nutshell,

$ grep -Ec '(vmx|svm)' /proc/cpuinfo

$ grep -c hvm /sys/hypervisor/properties/capabilities

should print a number larger than 0 on the server host.

Install KVM on the server host:

$ sudo apt-get install qemu-kvm cpu-checker virtinst

Add yourself to the kvm and libvirtd groups:

$ sudo adduser "$USER" kvm
$ sudo adduser "$USER" libvirtd

Check that you are indeed a group member (both numbers must be larger than 0):

$ id | grep -c '(kvm)'
1
$ id | grep -c (libvirtd)'
1

If you are not, then login again, and check again.

Check whether KVM is supported on the server host:

$ /usr/sbin/kvm-ok
INFO: /dev/kvm exists
KVM acceleration can be used

Preparations on the server host

Download the Ubuntu .iso installer image from the Ubuntu alternative downloads page. Use the 32-bit download if your virtual machine has at most 2 GB of RAM. Even with more RAM, you may want to use the 32-bit, because it uses less memory. For me download filename was ubuntu-12.04.3-desktop-i386.iso and the size was 707 MB. (You may consider using a non-desktop version of Ubuntu.)

Move the downloaded .iso to /var/tmp (or any other world-readable directory).

Make sure that the .iso installer image is world-readable:

$ sudo chmod a+r /var/tmp/ubuntu-12.04.3-desktop-i386.iso

Create a sparse disk image file for the guest: (10 GB in the example)

$ rm -f /var/tmp/precise0.img
$ dd if=/dev/null bs=1G seek=10 of=/var/tmp/precise0.img

Please note that the virt-install command below can also create the disk image (if its --disk flag value is modified), making the dd step above unnecessary. But I think it's more didactic and safer to keep these steps separated.

Come up with a password to be used for the incoming GUI VNC remote desktop connections. The example command uses IncomingVpcShinyApe below, but for security please invent your own password and use that instead.

Creating the guest and starting the installation

Start the installation on the server host with the command below. No need do run as root, and no need to run in screen, because it exits in less than half a minute. The 2048 stands for 2GB of RAM.

DISPLAY= virt-install --connect qemu:///system -n precise0 -r 2048 --os-type=linux --os-variant=ubuntuprecise --disk /var/tmp/precise0.img,device=disk,bus=virtio,format=raw --network user --cdrom /var/tmp/ubuntu-12.04.3-desktop-i386.iso --noautoconsole --graphics vnc,port=5965,listen=0.0.0.0,password=IncomingVncShinyApe

Starting install...
Creating domain...
Domain installation still in progress. You can reconnect to 
the console to complete the installation process.

If virt-install exits with an error message ERROR Guest name 'precise0' is already in use., then it may be a stale leftover from the previous runs. Remove it with the following command, and try virt-install again:

# Only if virt-install has exited with: already in use.
$ virsh -c qemu:///system undefine precise0

If virt-install exits with an error message such as ERROR Unable to read from monitor: Connection reset by peer, then take a look at the more specific error message here:

sudo less /var/log/libvirt/qemu/precise0.log

Specifying qemu:///system above and below is essential, because there is also qemu:///session, and various tools (such as virt-install and virsh) have different defaults.

Specifying --noautoconsole above makes sense (but it is not essential) on a headless server, because otherwise virt-install would attempt to start a VNC client, but that wouldn't work on the server without a GUI.

Don't worry if CPU usage on the server host jumps up to 100%. That's because the Unbuntu installer is booting from the installer CD .iso image in the KVM guest. It will go back to near 0% in a minute or so.

Each KVM guest has an XML configuration file. To see what virt-install has created, run this:

$ virsh -c qemu:///system dumpxml precise0

Preparations for the VNC connection

Check that the KVM guest is running and it's accepting incoming VNC connections:

$ ps $(sudo fuser -n tcp 5965 || echo -V) | cat

If you see procps version instead of a long (> 12 line) kvm command-line, then it's not running. Look at the log file above or the error messages printed by the previous commands to figure out what went wrong.

Please note that in the command-line printed by ps above, -vnc 0.0.0.0:65,password is normal. The actual password is not shown, and 65 is a correct port number, because VNC sometimes subtracts its base port number 5900. The server is actually listening on TCP port 5965.

On the server host, run the following command to check if the VNC connection can be made:

$ perl -we 'use IO::Socket::INET; alarm 3; my $s;
    die "error: $!\n" if !(
    $s=IO::Socket::INET->new(PeerAddr=>"127.0.0.1:5965"));
    my $buf; die "error: $!\n" if !sysread($s, $buf, 128);
    print $buf'
RFB 003.008

If it doesn't print a line starting with RFB, then it's a problem. Most probably there is a restrictive firewall active on the server host. Run this to fix it in a non-persistent way, and try the check again:

$ sudo iptables -I INPUT -p tcp --dport 5965 -i lo -j ACCEPT

The next big step is to use a VNC client to connect to the KVM guest, and complete the installation using the remote GUI. There are two was to connect:

An SSH tunnel (recommended), which is a little bit slower, but works over an SSH connection (port 22 by default).
A direct TCP connection, which is fast, but needs an end-to-end firewall configuration which lets the port 5965 go through. (Any port number at least 5900 is configurable.)

Start the SSH tunnel by running this from your client GUI desktop (substituting SERVERHOSTNAME with the host name of the server host running the KVM guest):

-$ ssh -L 5965:127.0.0.1:5965 SERVERHOSTNAME

Keep this SSH connection open while using the VNC GUI below.

If you want a direct TCP connection instead, then enable incoming TCP connections on port 5965 on the server host. A simple, nonpersistent way of doing so is running:

$ sudo iptables -I INPUT -p tcp --dport 5965 -j ACCEPT

Now it's time to test that your client desktop computer can connect to the KVM guest using VNC. Run the following command (not the same as the similar one above) on the client desktop, replacing XSERVERHOSTNAME with 127.0.0.1 if you are using the SSH tunnel, and with the server host name if you are using a direct TCP connection (works on Linux and Mac OS X):

$ perl -we 'use IO::Socket::INET; alarm 20; my $s;
    die "error: $!\n" if !(
    $s=IO::Socket::INET->new(PeerAddr=>"XSERVERHOSTNAME:5965"));
    my $buf; die "error: $!\n" if !sysread($s, $buf, 128);
    print $buf'
RFB 003.008

If it emits something starting with RFB, then it's OK, otherwise it's a network or firewall problem which you have to diagnose and fix.

Connecting using VNC and clicking through the graphical installer

On the client desktop Linux system you can use any of the following commands as a VNC remote desktop client: vinagre, gvncviewer, xtightvncviewer, xvnc4viewer. The name of the package containing them is the same. (Pick just any VNC client if you use Mac OS X or Wondows) The recommaned command is vinagre, if it's available. Install vinagre to the client desktop with:

-$ sudo apt-get install vinagre

Start your VNC client, and connect it to the host depicted by XSERVERHOSTNAME (see above), port number 65. You can connect from the GUI or start the VNC client by specifying the target host and port in the command line, for example (don't forget to replace XERVERHOSTNAME as above):

-$ vinagre XSERVERHOSTNAME:65

The VNC client will ask for a password. Type the same password you have specified in the virt-install command-line.

With a bit of luck, you'll now be seeing the Ubuntu installer welcome screen (language selection, Try Ubuntu button and Install Ubuntu button). Click through the installation as usual.

If your VNC or SSH connection gets broken, don't worry, this doesn't abort the installation on the KVM guest. Just restart the connections, and continue clicking in VNC.

Once the installation has finished and the system has rebooted to the installed Linux, you can remove the installer image:

-$ rm -f /var/tmp/ubuntu-12.04.3-desktop-i386.iso

The --network user is a zero-configuration setting which works out-of-the-box, but it's a slow for production use. One the system is set up, you may want to change it in your /etc/libvirt/qemu/precise0.xml (using virsh) to bridged networking for much higher performance.

When you don't need VNC anymore, you may want to remove the firewall rule on the server host:

$ sudo iptables -R INPUT -p tcp --dport 5965 -j ACCEPT

2013-12-09

How to add very large data blobs to C and C++ binaries

This blog post explains how to add large (>100 MB) data blobs to C and C++ binaries in an efficient way.

Short data

For short data, create a .c file containing this:

const char hi[14] = "Hello, World\077!\n";
const char *hi_end = hi + sizeof(hi) / sizeof(char);

And use it like this from another .c file:

#include 
extern const char hi[], *hi_end;
int main() {
  size_t hi_size = hi_end - hi;
  fwrite(hi, 1, hi_size, stdout);
  return 0;
}

When specifying the string literal, you have to escape not only \, " and nonprintable bytes (code less than 32 or larger than 126, this works for both no matter char is signed or unsigned), but also ?, because without escaping ? unintended trigraphs may get formed. Escape nonprintable bytes as exactly 3-digit octal literals, such as \012. Don't use shorter escapes or hex escapes, because they may get combined with the following bytes if they are in [0-9a-fA-F].

Another option is to specify the bytes as integer literals, but it can get quite long for 8-bit bytes, because char can be either signed or unsigned:

const char hi[] = {126, 127, (char)128, (char)129 };
char *hi_end = hi + sizeof(hi) / sizeof(char);

These solutions work for both C and C++.

Long data

Neither of the solutions above works well for for long blobs, because some C (or C++) compilers don't accept string literals longer than a few megabytes, and putting string literals next to each to get implicit implicit concatenation other also doesn't work if there are too many. Typical failure modes are: compile error, excessive memory usage, extremely slow (slower than O(n) or with a large constant) compilation.

If we don't care about the actual contents of most the array, there is an easy and quick solution:

const char hi[1234567] = "SIGNATURE";
const char *hi_end = hi + sizeof(hi) / sizeof(char);

This creates an .o file containing 1234567 bytes of data, starting with the bytes SIGNATURE and having 1234557 '\0' bytes afterwards. Surprisingly, gcc compiles this very quickly, with little memory use, generating a small (shorter than 400 bytes) .S (assembly) file, spending most of the time writing the .o file to disk. It's easy to prove the little memory usage by making the data size 400 MB, and limiting the virtual memory size to 30 MB (ulimit -v 30000).

So now we have an .o file with the correct size, with a signature and lots of '\0's instead of the real data. All we need to open the file, read the first few kilobytes, find the signature, and replace it and following the zeros with the real data. We can do this without having to understand the .o file format: it's enough to choose a long enough signature to make it unique within the .o file.

The cobjgen Perl script automates all this. Use it like this:

$ cobjgen input_blob.bin hi gcc -c hidatac.c
$ gcc -o progc main.c hidatac.o

$ cobjgen input_blob.bin hi g++ -c hidatacc.cc
$ g++ -o progcc main.cc hidatacc.o

Compressing the Go cross-compiler

This blog post presents my findings about the compressibility of the Go 1.2 cross-compiler binaries, running on Linux i386 and generating code for various operating systems and architectures. It also adds data points for comparing data compression algorithms and tools.

TL;DR 7-Zip creates the smallest archives and self-extracting archives.

Input files (all uncompressed):

5 binaries (go, gofmt, cgo, fix, yacc) written in Go, 15_599_640 bytes in total.
18 binaries (5a, 5c, 5g, 5l, 6a, 6c, 6g, 6l, 8a, 8c, 8g, 8l, addr2line, dist, nm, objdump, pack, pprof) written in C, 15_655_780 bytes in total.
623 libraries written mostly in Go, some parts written in C, (uncompressed) .a files containing object files, 249_375_024 bytes in total.
147 very short placeholder .go source files, each containing package PACKAGENAME only, 2_034 bytes in total.
712 library symlinks (.a pointing to another .a file)
476 directories nested up to the level 7

All binaries are statically linked, compiled for Linux i386 (x86, 32-bit). Libraries are for various operating systems (Linux, FreeBSD, Mac OS X, Windows) and architectures (i386, amd64, ARM). All files were compiled from the go1.2 sources.

Archive sizes:

.tar: 281_886_720 bytes (uncompressed)
.zip: 117_874_697 bytes
.zip without symlink duplicates: 68_160_582 bytes
.tar.gz: 67_984_293 bytes
.tar.bz2: 52_344_038 bytes
.tar.xz: 36_832_488 bytes
.rar: 36_347_393 bytes
.7z: 25_776_183 bytes

Archive creation commands:

.zip: zip -9r multigo1.2_linux_386.zip multigo1.2_linux_386
.tar: tar cvf multigo1.2_linux_386.tar.gz multigo1.2_linux_386.tar.gz
.tar.gz: tar czvf multigo1.2_linux_386.tar.gz multigo1.2_linux_386.tar.gz
.tar.bz2: tar cv multigo1.2_linux_386 | bzip2 -c9 >multigo1.2_linux_386.tar.bz2
.tar.xz: tar cv multigo1.2_linux_386 | xz -7 --memlimit=100M >multigo1.2_linux_386.tar.xz
.7z: 7z a -t7z -mx=7 -md=8m -ms=on multigo1.2_linux_386.sfx.7z multigo1.2_linux_386
.rar: rar a -r -s -m5 multigo1.2_linux_386.rar multigo1.2_linux_386

Feature matrix of the archivers:

rar: files, directories, permission bits, last modification times
zip: same as above
7z (7-Zip): all above, plus symlinks
tar: all above, plus file owner users and groups

Self-extracting archive binary sizes for Linux i386:

uncompressed: 281_892_240 (dynamically linked, just writes the .tar to stdout, 5520 bytes larger than the .tar)
upx --best: 50_215_752 bytes (the uncompressed binary above was compressed)
upx --lzma: 37_810_484 bytes
.sfx.rar: 36_481_433 bytes (dynamically linked, 134_040 bytes longer, not exactly prepended)
.sfx.7z: 25_897_327 bytes (statically linked, see how to create, 121_144 of self-extraction code prepended)
upx --brute: out of memory (wanted to use more than 3 GB of RAM)

Note: rar doesn't support symlinks, it follows the symlinks and duplicated the files at compression time. This wasn't a cause of a major size increase, because solid archiving (rar a -s) put these files next to each other, so the previous file instance was emitted as a single repetition.

Note: zip doesn't support symlinks either, and it makes the archive considerably larger, because it follows the symlinks and saves the file again.

Note: The self-extraction code of 7-Zip (7zCon.sfx) was 121144 bytes, not included in the sizes above.

Note: rar and 7-Zip support file last modification times, permission bits, but no owner or group.

Note: The fundamental difference between .7z and .xz is that 7z has special preprocessing (BCJ and BCJ2) step for executable binaries, and BCJ is enabled by default. This lets the final size become much smaller.

2013-12-08

How to make an existing 7-Zip archive self-extracting

This blog post explains how to make an existing 7-Zip (.7z) archive self-extracting, i.e. how to convert it to an executable which can extract its own contents.

A 7-Zip self-extracting archive is just a concatenation of the extractor code and a regular .7z archive. So to make an archive self-extracting, one needs to get the extractor code file, append the regular .7z archive file, and (for Unix) make it executable.

Similarly, it's easy to remove the extractor code, by looking for the 6-byte 7-Zip header ("7z\274\257\047\034") in the first 1 megabyte of the self-extracting file, and removing everything before it. The Perl script replace_7z_sfx can be used to automate this. It can add, remove or replace the extractor code.

If you want to create the self-extracting archive from scratch, use 7z a -sfxEXTR FILE.7z ..., replacing EXTR with the name of the file containing the extractor code.

How to get the extractor code for Windows

Please note that the extractor code is specific for the operating system. That is, the self-extracting archive will be able to run on its target system only. It doesn't matter on which system you create the archive though.

Get precompiled extractor code from http://pts.50.hu/files/sfx_for_7zip9.20/:

Linux, i386, console, statically linked, 121176 bytes, supports the -eFILE.7z flag to extract external archives
Mac OS X 10.6 or later, i386, console, 143360 bytes
Windows, i386 (Win32), console (text mode), 152064 bytes
Windows, i386 (Win32), GUI (windowed), 162816 bytes

As a comparison, the i386 Linux RAR 4.2.0 extractor code is dynamically liked (needs the libc), 137068 bytes uncompressed, and 61800 bytes when compressed with upx --ultra-brute. The rar executable is statically linked.

How were the extractor code files generated

The Windows extractors were taken from the Windows 32-bit .exe from the 7-Zip download page (direct link for version 9.20): files 7z.sfx (GUI) and 7zCon.sfx (console). The download is a self-extracting .7z archive, it could be extracted by the Linux 7z tool, in the p7zip-full Ubuntu package.

The Mac OS X console extractor was taken from the Rudix 7-Zip package for Mac OS X 10.6: file 7zCon.sfx. Extracting that file needed several steps:

$ wget -nv -O p7zip-9.20.1-1.pkg \
    http://rudix-snowleopard.googlecode.com/files/p7zip-9.20.1-1.pkg
$ file p7zip-9.20.1-1.pkg
p7zip-9.20.1-1.pkg: xar archive - version 1
$ 7z x p7zip-9.20.1-1.pkg p7zipinstall.pkg/Payload
$ file p7zipinstall.pkg/Payload
p7zipinstall.pkg/Payload: gzip compressed data, from Unix
$ 7z x -so p7zipinstall.pkg/Payload >payload.cpio
$ file payload.cpio 
payload.cpio: ASCII cpio archive (pre-SVR4 or odc)
$ 7z x payload.cpio
$ ls -l usr/local/lib/p7zip/7zCon.sfx
-rw-r--r-- 1 pts pts 521680 Aug  6 20:09 usr/local/lib/p7zip/7zCon.sfx
$ cp -a usr/local/lib/p7zip/7zCon.sfx 7zCon.sfx
$ upx --ultra-brute 7zCon.sfx
$ ls -l 7zCon.sfx
-rw-r--r-- 1 pts pts 143360 Aug  6 20:09 7zCon.sfx

It looks like 7-Zip is very useful: it can extract many different kinds of archives.

The Linux extractor was compiled from the p7zip sources. It was compiled with a gcc-4.1.2 cross-compiler targeting i386, generating statically linked Linux binary, using uClibc, and then compressed with upx --ultra-brute. It is interesting to enumerate which g++ compiler flags were used to further reduce the file size:

-Os: Optimize for generated code size.
-fno-rtti: Disable RTTI: run-time type identification.
(-fno-exceptions): Disable exception handling. We couldn't use it, because the p7zip code base uses exceptions.
-fno-stack-protector: Disable protection of the stack against buffer overflows. Saves a few bytes per function call location.
-ffunction-sections -fdata-sections: Create a separate section for each symbol. (By default a section is created per source file, and all symbols within a source file are put to the same section.)
-Wl,--gc-sections: Remove unused sections from the resulting binary. This is implemented by ld --gc-sections. This is a very important flag (in combination with the other section flags above), because it removes all unused functions and methods.

See also the shell script which does all the download, compilation and compression of the Linux 7zCon.sfx.

2013-12-01

Binary search in line-sorted text files

This blog post announces pts-line-bisect, a C program and an equivalent Python script and library for doing binary seach in line-sorted text files, and it also explains some of the design details of the Python implementation.

Let's suppose you have a sorted text file, in which each line is lexicographically larger than the previous one, and you want to find a specific range of lines, or all lines with a specific prefix.

I've written the program pts-line-bisect for that recently. See the beginning of the README on many examples how to use the C program. The Python program has fewer features, here is how to use it:

Closed ended interval search: print all lines at least bar and at most foo:
```
$ python pts_line_bisect.py -t file.sorted bar foo
```
Open ended interval search: print all lines at least bar and smaller than foo:
```
$ python pts_line_bisect.py -e file.sorted bar foo
```
Prepend position: print the smallest offset where foo can be inserted (please note that this does not report whether foo is present, always exit(0)).
```
$ python pts_line_bisect.py -oe file.sorted foo
```
Append position: print the largest offset where foo can be inserted (please note that this does not report whether foo is present, always exit(0)).
```
$ python pts_line_bisect.py -oae file.sorted foo
```
Exact range: print the range (start byte offset and after-the-last-byte end offset) of lines containing only foo:
```
$ python pts_line_bisect.py -ot file.sorted foo
```

Please note that these tools support duplicate lines correctly (i.e. if the same line appears multiple times, in a block).

Please note that these tools and libraries assume that the key is at the beginning of the line. If you have a sorted file whose lines contain the sort key somewhere in the middle, you need to make changes to the code. It's possible to do so in a way that the code will still support duplicate keys (i.e. records with the same key).

Analysis

I've written the article titled Evolution of a binary search implementation for line-sorted text files about the topic, containing the problem statement, the possible pitfalls, an analysis of some incorrect solutions available on the web as code example, a detailed specification and explanation of my solution (including a proof), disk seek and speed analysis, a set of speedup ideas and their implementation, and further notes about the speedups in the C implementation.

As a teaser, here is an incorrect solution in Python:

def bisect_left_incorrect(f, x):
  """... Warning: Incorrect implementation with corner case bugs!"""
  x = x.rstrip('\n')
  f.seek(0, 2)  # Seek to EOF.
  lo, hi = 0, f.tell()
  while lo < hi:
    mid = (lo + hi) >> 1
    f.seek(mid)
    f.readline()  # Ignore previous line, find our line.
    if x <= f.readline().rstrip('\n'):
      hi = mid
    else:
      lo = mid + 1
  return lo

Can you spot the all the 3 bugs?

Read the article for all the details and the solution.

About sorting: For the above implementation to work, files must be sorted lexicographically, using the unsigned value (0..255) for each byte in the line. If you don't sort the file, or you sort it using some language collation or some locale, then the linked implementation won't find the results you are looking for. On Linux, use LC_ALL=C sort <in.txt >out.txt to sort lexicographically.

As a reference, here is the correct implementation of the same algorithm for finding the start of the interval in a sorted list or other sequences (based on the bisect module):

def bisect_left(a, x, lo=0, hi=None):
  """Return the index where to insert item x in list a, assuming a is sorted.

  The return value i is such that all e in a[:i] have e < x, and all e in
  a[i:] have e >= x.  So if x already appears in the list, a.insert(x) will
  insert just before the leftmost x already there.

  Optional args lo (default 0) and hi (default len(a)) bound the
  slice of a to be searched.
  """
  if lo < 0:
    raise ValueError('lo must be non-negative')
  if hi is None:
    hi = len(a)
  while lo < hi:
    mid = (lo + hi) >> 1
    if x <= a[mid]:  # Change `<=' to `<', and you get bisect_right.
      hi = mid
    else:
      lo = mid + 1
  return lo

A typical real-word use case for such a binary search tool is retrieving lines corresponding to a particular time range in log files (or time based measurement records). These files text files with variable-length lines, with the log timestamp in the beginning of the line, and they are generated in increasing timestamp order. Unfortunately the lines are not lexicographically sorted, so the timestamp has to be decoded first for the comparison. The bsearch tool does that, it also supports parsing arbitrary, user-specifiable datetime formats, and it can binary search in gzip(1)ped files as well (by building an index). It's also of high performance and low overhead, partially because it is written in C++. So bsearch is practical tool with lots of useful features. If you need anything more complicated than a lexicographic binary search, use it instead.

Before getting too excited about binary search, please note that there are much faster alternatives for data on disk. In an external (file-based) binary search the slowest operation is the disk seek needed for each bisection. Most of the software and hardware components would be waiting for the hard disk to move the reading head. (Except, of course, that seek times are negligible when non-spinning storage hardware such as SSD or memory card is used.)

An out-of-the box solution would be adding the data to more disk-efficient key-value store. There are several programs providing such stores. Most of them are based on a B-tree, B*-tree or B+-tree data structure if sorted iteration and range searches have to be supported, or disk-based hashtables otherwise. Some of the high-performance single-machine key-value stores: cdb (read-only), Tokyo Cabinet, Kyoto Cabinet, LevelDB; see more in the NoSQL software list.

The fundamental speed difference between a B-tree search and a binary search in a sorted list stems from the fact that B-trees have a branching factor larger than 2 (possibly 100s or 1000s), thus each seeking step in a B-tree search reduces possible input size by a factor larger than 2, while in a binary search each step reduces the the input size by a factor 2 only (i.e. we keep either the bottom half or the top half). So both kinds of searches are logarithmic, but the base of the logarithm is different, and this causes a constant factor difference in the disk seek count. By careful tunig of the constant in B-trees it's usual to have only 2 or 3 disk seeks for each search even for 100GB of data, while a binary search in a such a large file with 50 bytes per record would need 31 disk seeks. By taking 10ms as the seek time (see more info about typical hard disk seek times), a typical B-tree search takes 0.03 second, and a typical binary search takes 0.31 second.

Have fun binary searching, but don't forget to sort your files first!

Flattr this blog

Blog Archive

Recommended