## 2013-08-18

### Non-recursive, fast, compact stable sort implementation in C++

This blog post discusses stable sort implementations in various languages, and it also presents a non-recursive, fast and compact mergesort implementation in C++. The implementation runs the original mergesort algorithm in an iterative way with some optimizations, so it is stable.

Please note that C++ STL has a stable sort built in: std::stable_sort in #include <algorithm>. If it's OK to use STL in your C++ code, use that, and you won't need any custom code from this blog post.

About compactness: when compiled statically with GCC 4.1.2 (g++ -fno-rtti -fno-exceptions -s -O2 -static) for i386, the mergesort in this blog post produces an executable 4084 bytes smaller than with std::stable_sort.

About speed: between 10% and 16% faster than std::stable_sort, see speed measurements below.

### The fast, non-recursive, stable sort implementation in C++

It's a non-recursive, stable mergesort implementation, with O(n*log(n)) worst-case speed.

#include <sys/types.h>
// Stable, non-recursive mergesort algorithm. Sort a[:n], using temporary
// space of size n starting at tmp. (Please note that mergesort Algorithm
// 5.2.4N and Algorithm 5.2.4S from Knuth's TAOCP are not stable. Please
// note that most implementations of mergesort found online are recursive,
// thus slower.)
//
// by pts@fazekas.hu at Sun Aug 18 19:23:36 CEST 2013
template <typename T, typename IsLess>
void mergesort(T *a, size_t n, T *tmp, IsLess is_less) {
for (size_t f = 1; f < n; f += 2) {  // Unfold the first pass for speedup.
if (is_less(a[f], a[f - 1])) {
T t = a[f];
a[f] = a[f - 1];
a[f - 1] = t;
}
}
bool s = false;
for (size_t p = 2; p != 0 && p < n; p <<= 1, s = !s) {
// Now all sublists of p are already sorted.
T *z = tmp;
for (size_t i = 0; i < n; i += p << 1) {
T *x = a + i;
T *y = x + p;
size_t xn = p < n - i ? p : n - i;
size_t yn = (p << 1) < n - i ? p : p < n - i ? n - p - i : 0;
if (xn > 0 && yn > 0 &&
is_less(*y, x[xn - 1])) {  // Optimization (S), Java 1.6 also has it.
for (;;) {
if (is_less(*y, *x)) {
*z++ = *y++;
if (--yn == 0) break;
} else {
*z++ = *x++;
if (--xn == 0) break;
}
}
}
while (xn > 0) {  // Copy from *x first because of (S).
*z++ = *x++;
--xn;
}
while (yn > 0) {
*z++ = *y++;
--yn;
}
}
z = a; a = tmp; tmp = z;
}
if (s) {  // Copy from tmp to result.
for (T *x = tmp, *y = a, * const x_end = tmp + n; x != x_end; ++x, ++y) {
*x = *y;
}
}
}

Some convenience functions to call it:

// Stable, non-recursive mergesort.
// To sort vector a', call mergesort(a.data(), a.data() + a.size(), is_less)'
// or use the convenience function below.
template <typename T, typename IsLess>
void mergesort(T *a, T *a_end, IsLess is_less) {
const size_t n = a_end - a;
if (n < 2) return;
// Creating ptr_deleter so tmp will be deleted even if is_less or
// mergesort throws an exception.
struct ptr_deleter {
ptr_deleter(T *p): p_(p) {}
~ptr_deleter() { delete p_; }
T *p_;
} tmp(new T[n]);
mergesort(a, n, tmp.p_, is_less);
}

// Convenience function to sort a range in a vector.
template <typename T, typename IsLess>
// Doesn't match:
//static void mergesort(const typename std::vector<T>::iterator &begin,
//                      const typename std::vector<T>::iterator &end,
//static void mergesort(const typename T::iterator &begin,
//                      const typename T::iterator &end,
void mergesort(const T &begin, const T &end, IsLess is_less) {
mergesort(&*begin, &*end, is_less);
}

// Stable, non-recursive mergesort.
// Resizes v to double size temporarily, and then changes it back. Memory may
// be wasted in b after the call because of that.
template <typename T, typename IsLess>
void mergesort_consecutive(std::vector<T> *v, IsLess is_less) {
const size_t n = v->size();
if (n < 2) return;
v->resize(n << 1);  // Allocate temporary space.
mergesort(v->data(), n, v->data() + n, is_less);
v->resize(n);
}

Example invocation:

inline bool int_mask_less(int a, int b) {
return (a & 15) < (b & 15);
}
...
std::vector<int> a;
...
mergesort(a.begin(), a.end(), int_mask_less);

### A slow, non-recursive stable sort implementation in C++

If you don't care about speed, an insertion sort would do: it's simple (short), stable and non-recursive. Unfortunately it's slow: O(n2) in worst case and average.

// Insertion sort: stable but slow (O(n^2)). Use mergesort instead if you need
// a stable sort.
template <typename T, typename IsLess>
static void insertion_sort(T *a, T *a_end, IsLess is_less) {
const size_t n = a_end - a;
for (size_t i = 1; i < n; ++i) {
if (is_less(a[i], a[i - 1])) {
T t = a[i];
size_t j = i - 1;  // TODO: Use pointers instead of indexes.
while (j > 0 && is_less(t, a[j - 1])) {
--j;
}
for (size_t k = i; k > j; --k) {
a[k] = a[k - 1];
}
a[j] = t;
}
}
}

// Convenience function to sort a range in a vector.
template <typename T, typename IsLess>
static void insertion_sort(const T &begin, const T &end, IsLess is_less) {
insertion_sort(&*begin, &*end, is_less);
}

### C++ stable sort speed measurements

10000 random int vectors (of random size smaller than 16384) were sorted in place in a C++ program running on Linux 2.6.35 running on an Intel Core 2 Duo CPU T6600 @ 2.20GHz. The total user time was measured, including the sort, the running of std::stable_sort, and the verification of the sort output (i.e. comparing it to the output of std::stable_sort. The insmax parameter below indicates the size of the sublists with were sorted using insertion sort before running the mergesort (the implementation above uses insmax=2, which is implemented in the Unfold... loop).

The raw time measurement results:

mask= 15  insmax= 1          18.805s

mask=255  insertion_sort    530.27s

It looks like that the mergesort algorithm proposed in this blog post (insmax=2 in the measurements above) is between 10% and 16% faster than std::stable_sort. It looks like that increasing insmax from 2 to 8 could yield an additional 1.18% speed increase.

### Stable sort in other programming languages

• Java java.util.Collections.sort and java.util.Arrays.sort are stable. In OpenJDK 1.6, for primitive types, it uses an optimized (tuned) quicksort, and for objects (with a comparator) it uses a recursive, stable mergesort. Even though quicksort is not stable, this doesn't matter for primitive types, because there is no comparator (so it's always in ascending order), and it's impossible to distinguish different instances of the same primitive type value.
• C qsort function is not stable, because it uses quicksort. FreeBSD, Mac OS X and other BSD systems have the mergesort function in their standard library (in libc, stdlib/merge.c), Linux glibc doesn't have it. As an alternative, it's straightforward to convert the mergesort implementation in this blog post from C++ to C.
• CPython's sorted and list.sort functions are implemented with a stable, recursive mergesort (and binary insertion sort for short inputs) with a sophisticated (complicated, long) code containing lots of optimizations: see the files Objects/listsort.txt and Objects/listobject.c in the Python source code for documentation and implementation. It looks like it's not recursive.
• Perl sort uses mergesort by default, so it's stable, but the default can change in future versions. Add use sort 'stable'; to the beginning of your code to get a guaranteed stable sort. See the sort pragma page for details.
• Ruby 2.0 Array#sort doesn't say if it's stable or not, but a quick search for the string qsort in array.c reveals that it uses quicksort, so it's not stable. (a bit slow) stable sorting can be added easily:
class Array
def stable_sort
n = 0
sort_by {|x| n+= 1; [x, n]}
end
end
... as explained here. The fundamental idea in this trick is to compare the indexes if the values are equal. This fundamental idea can be implemented in any programming language, although it may have considerable memory and/or time overhead.
• JavaScript's Array.prototype.sort tends to be stable in modern browsers (see the details), but unstable on old browsers.
• C# Array.Sort is not stable (because it uses quicksort), but Enumerable.OrderBy and Enumberable.ThenBy are stable (because they use mergesort. See more info here.

### Other stable sorting algorithms

Heapsort and quicksort are not stable. Some variations of mergesort (such as Algorithm 5.2.4N and Algorithm 5.2.4S in The Art of Computer programming, Volume 3 by Knuth) are not stable.

This page lists stable sorting algorithms known to mankind. Be careful with the implementations though: even if the algorithm itself is stable, some optimized, otherwise modified or buggy implementations might not be. The algorithms are:

• O(n*log(n)), stable, comparison-based: mergesort, cascade merge sort, oscillating merge sort.
• O(n2), stable, comparison-based: bubble sort, cocktail sort, insertion sort, binary insertion sort (where the insertion index is found using binary search), gnome sort, library sort, odd-even sort.
• O(n2), stable, not comparison-based: bucket sort, proxmap sort.
• faster than O(n2), stable, not comparison-based: counting sort, pigeonhole sort, radix sort.

So it looks like that we don't know of an alternative of mergesort for comparison-based sorting in O(n*log(n)) time, the others above are slower. Maybe Edsger Dijkstra[citation needed] can come up with an alternative.

## 2013-08-13

### How to send Unix file descriptors between processes?

This blog post explains how to send Unix file descriptors between processes. The example shown below sends from parent to child, but the general idea works the other way round as well, and it works even between unrelated processes.

As a dependency, the processes need to have the opposite ends of a Unix domain socket already. Once they have it, they can use sendfd to send any file descriptor over the existing Unix domain socket. Do it like this in Python:

#! /usr/bin/python

"""sendfd_demo.py: How to send Unix file descriptors between processes

for Python 2.x at Tue Aug 13 16:07:29 CEST 2013
"""

import _multiprocessing
import os
import socket
import sys
import traceback

def run_in_child(f, *args):
pid = os.fork()
if not pid:  # Child.
try:
child(*args)
except:
traceback.print_exc()
os._exit(1)
finally:
os._exit(0)
return pid

def child(sb):
assert sb.recv(1) == 'X'
# Be careful: there is no EOF handling in
# _multiprocessing.recvfd (because it's buggy). On EOF, it
# returns an arbitrary file descriptor number. We work it around by doing
# a regular sendall--recv pair first, so if there is an obvious reason for
# failure, they will fail properly.
f = os.fdopen(_multiprocessing.recvfd(sb.fileno()))
print f.tell()  # The file size.

def main(argv):
sa, sb = socket.socketpair(socket.AF_UNIX, socket.SOCK_STREAM)
pid = run_in_child(child, sb)
# We open f after creating the child process, so it could not inherit f.
f = open('/etc/hosts')
sa.sendall('X')
_multiprocessing.sendfd(sa.fileno(), f.fileno())  # Send f to child.
assert (pid, 0) == os.waitpid(pid, 0)
# The file descriptor is shared between us and the child, so we get the file
# position where the child has left off (i.e. at the end).
print f.tell()
print 'Parent done.'

if __name__ == '__main__':
sys.exit(main(sys.argv))

It works even for unrelated processes. To try it, run the following program with a command-line argument in a terminal window (it will start the server), and in another terminal window run it several times without arguments (it will run the client). Each time the client is run, it connects to the server, and sends a file descriptor to it, which the server reads. The program:

#! /usr/bin/python2.7

"""sendfd_unrelated.py: Send Unix file descriptors between unrelated processes.

for Python 2.x at Tue Aug 13 16:26:27 CEST 2013
"""

import _multiprocessing
import errno
import os
import socket
import stat
import sys
import time

SOCKET_NAME = '/tmp/sendfd_socket'

def server():
print 'Server.'
ssock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
while 1:
try:
ssock.bind(SOCKET_NAME)
break
except socket.error, e:
raise
# TODO: Fail if the old server is still running.
print 'Removing old socket.'

ssock.listen(16)
while 1:
print 'Accepting.'
print 'Accepted.'
assert sock.recv(1) == 'X'
f = os.fdopen(_multiprocessing.recvfd(sock.fileno()))
print f.tell()  # The file size.
del sock, f

def client():
print 'Client.'
sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
while 1:
try:
sock.connect(SOCKET_NAME)
break
except socket.error, e:
if e not in (errno.ENOENT, errno.ECONNREFUSED):
raise
print 'Server not listening, trying again.'
time.sleep(1)
print 'Connected.'
f = open('/etc/hosts')
sock.sendall('X')
_multiprocessing.sendfd(sock.fileno(), f.fileno())  # Send f to server.
assert '' == sock.recv(1)  # Wait for the server to close the connection.
del sock
print f.tell()

def main(argv):
if len(argv) > 1:
server()
else:
client()
print 'Done.'

if __name__ == '__main__':
sys.exit(main(sys.argv))

There is no system call named sendfd or recvfd. They are implemented in terms of the sendmsg and recvmsg system calls. Passing the correct arguments is tricky and a bit awkward. Here is how to do it in C or C++:

#include <string.h>
#include <sys/socket.h>

int sendfd(int unix_domain_socket_fd, int fd) {
char dummy_char = 0;
char buf[CMSG_SPACE(sizeof(int))];
struct msghdr msg;
struct iovec dummy_iov;
struct cmsghdr *cmsg;
memset(&msg, 0, sizeof msg);
dummy_iov.iov_base = &dummy_char;
dummy_iov.iov_len = 1;
msg.msg_control = buf;
msg.msg_controllen = sizeof(buf);
msg.msg_iov = &dummy_iov;
msg.msg_iovlen = 1;
cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
msg.msg_controllen = cmsg->cmsg_len;
memcpy(CMSG_DATA(cmsg), &fd, sizeof fd);  /* int. */
return sendmsg(unix_domain_socket_fd, &msg, 0);  /* I/O error unless 1. */
}

int recvfd(int unix_domain_socket_fd) {
int res;
char dummy_char;
char buf[CMSG_SPACE(sizeof(int))];
struct msghdr msg;
struct iovec dummy_iov;
struct cmsghdr *cmsg;
memset(&msg, 0, sizeof msg);
dummy_iov.iov_base = &dummy_char;
dummy_iov.iov_len = 1;
msg.msg_control = buf;
msg.msg_controllen = sizeof(buf);
msg.msg_iov = &dummy_iov;
msg.msg_iovlen = 1;
cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
msg.msg_controllen = cmsg->cmsg_len;
res = recvmsg(unix_domain_socket_fd, &msg, 0);
if (res == 0) return -2;  /* EOF. */
if (res < 0) return -1;  /* I/O error. */
memcpy(&res, CMSG_DATA(cmsg), sizeof res);  /* int. */
return res;  /* OK, new file descriptor is returned. */
}

## 2013-08-08

### webfind: How many programming languages can you speak ... at the same time?

When I wrote the following code which compiles and/or executes in 10 different programming languages (C, C++, Perl, TeX, LaTeX, PostScript, sh, bash, zsh and Prolog) and prints a different message in each of them, I think it was cool.

%:/*:if 0;"true" +s ||true<</;#|+q|*/include<stdio.h>/*\_/
{\if(%)}newpath/Times-Roman findfont 20 scalefont setfont(
%%)pop 72 72 moveto(Just another PostScript hacker,)show((
t)}. t:-write('Just another Prolog hacker,'),nl,halt. :-t.
:-initialization(t). end_of_file. %)pop pop showpage(-: */
int main(){return 0&printf("Just another C%s hacker,\n",1%
sizeof'2'*2+"++");}/*\fi}\csname @gobble\endcsname{\egroup
\let\LaTeX\TeX\ifx}\if00\documentclass{article}\begin{doc%
ument}\fi Just another \LaTeX\ hacker,\end{document}|if 0;
/(J.*)\$sh(.*)"/,print"$1Perl$2$/"if$_.=q # hack the lang! / sh=sh;test$BASH_VERSION &&sh=bash;test $POSIXLY_CORRECT&& sh=sh;test$ZSH_VERSION && sh=zsh;awk 'BEGIN{x="%c[A%c[K"
printf(x,27,27)}';echo "Just another $sh hacker," #)pop%*/ But now I know that the source of this is the the coolest ever (poly.html). I could make it work in 22 languages without changing the original file. Now, that's a feat, congratulations to the author! Good luck for the autodetect feature of your smart text editor to figure out the highlighting. The name of some languages is not explicitly mentioned in the file. Can you find them? There is another impressive one polyglot with 8 languages: COBOL, Pascal, Fortran, C, PostScript, sh, DOS .com file, Perl. There is another multilingual program I wrote earlier with 3 languages of: Haskell, Prolog, Perl and Python. ### webfind: Bootstrapping a simple compiler from nothing This document is a case study of writing and bootstrapping a compiler for a simple (but structured) language from scratch. It targets i386 Linux. It starts from a single hex-to-binary converter, and repeats this many times: write a little compiler for a little bit smarter language in the old language, compile it, then rewrite it to the new language. After about a dozen iterations a compiler for a structured language is created. See also the source code. Done in 2001. ### webfind: The smallest compiler (it compiles Brainfuck) I've found the smallest compiler so far. It's a Brainfuck compiler written in x86 assembly, and it compiles to an i386 Linux binary. Downloaded it from here and play with it on Linux (i386 or amd64): $ sudo apt-get install nasm
$wget -O bf.asm http://www.muppetlabs.com/~breadbox/software/tiny/bf.asm.txt$ nasm -f bin -o bf bf.asm && chmod +x bf
$cat >hello_world.bf <<'END' >+++++++++[<++++++++>-]<.>+++++++[<++++>-]<+.+++++++..+++.[-] >++++++++[<++++>-] <.>+++++++++++[<++++++++>-]<-.--------.+++ .------.--------.[-]>++++++++[<++++>- ]<+.[-]++++++++++. END$ ./bf hello_world && chmod +x hello_world
\$ ./hello_world
Hello world!
`

A 16-bit DOS version, which is even shorter (135 bytes) can be found here. There is a challange here for implementing Brainfuck interpreters, and the best so far is 106 bytes in Perl, but unfortunately the source code is not available. There is also a 98-byte interpreter for 16-bit DOS here.

See a 160-byte (source code) interpreter in C here.

Here is a similar page listing short interpreters for a small, Turing-complete language.