Non-recursive, fast, compact stable sort implementation in C++

This blog post discusses stable sort implementations in various languages, and it also presents a non-recursive, fast and compact mergesort implementation in C++. The implementation runs the original mergesort algorithm in an iterative way with some optimizations, so it is stable.

Please note that C++ STL has a stable sort built in: std::stable_sort in #include <algorithm>. If it's OK to use STL in your C++ code, use that, and you won't need any custom code from this blog post.

About compactness: when compiled statically with GCC 4.1.2 (g++ -fno-rtti -fno-exceptions -s -O2 -static) for i386, the mergesort in this blog post produces an executable 4084 bytes smaller than with std::stable_sort.

About speed: between 10% and 16% faster than std::stable_sort, see speed measurements below.

The fast, non-recursive, stable sort implementation in C++

It's a non-recursive, stable mergesort implementation, with O(n*log(n)) worst-case speed.

#include <sys/types.h>
// Stable, non-recursive mergesort algorithm. Sort a[:n], using temporary
// space of size n starting at tmp. (Please note that mergesort Algorithm
// 5.2.4N and Algorithm 5.2.4S from Knuth's TAOCP are not stable. Please 
// note that most implementations of mergesort found online are recursive,
// thus slower.)
// by pts@fazekas.hu at Sun Aug 18 19:23:36 CEST 2013
template <typename T, typename IsLess>
void mergesort(T *a, size_t n, T *tmp, IsLess is_less) {
  for (size_t f = 1; f < n; f += 2) {  // Unfold the first pass for speedup.
    if (is_less(a[f], a[f - 1])) {
      T t = a[f];
      a[f] = a[f - 1];
      a[f - 1] = t;   
  bool s = false;
  for (size_t p = 2; p != 0 && p < n; p <<= 1, s = !s) {
    // Now all sublists of p are already sorted.
    T *z = tmp;
    for (size_t i = 0; i < n; i += p << 1) {
      T *x = a + i;
      T *y = x + p;
      size_t xn = p < n - i ? p : n - i;
      size_t yn = (p << 1) < n - i ? p : p < n - i ? n - p - i : 0;
      if (xn > 0 && yn > 0 &&
          is_less(*y, x[xn - 1])) {  // Optimization (S), Java 1.6 also has it.
        for (;;) {
          if (is_less(*y, *x)) {
            *z++ = *y++;
            if (--yn == 0) break;
          } else {
            *z++ = *x++;
            if (--xn == 0) break;
      while (xn > 0) {  // Copy from *x first because of (S).
        *z++ = *x++;
      while (yn > 0) {
        *z++ = *y++;  
    z = a; a = tmp; tmp = z;
  if (s) {  // Copy from tmp to result.
    for (T *x = tmp, *y = a, * const x_end = tmp + n; x != x_end; ++x, ++y) {
      *x = *y;

Some convenience functions to call it:

// Stable, non-recursive mergesort.
// To sort vector `a', call mergesort(a.data(), a.data() + a.size(), is_less)'
// or use the convenience function below.
template <typename T, typename IsLess>   
void mergesort(T *a, T *a_end, IsLess is_less) {
  const size_t n = a_end - a;
  if (n < 2) return;
  // Creating ptr_deleter so tmp will be deleted even if is_less or
  // mergesort throws an exception.
  struct ptr_deleter {
    ptr_deleter(T *p): p_(p) {}
    ~ptr_deleter() { delete p_; }
    T *p_;
  } tmp(new T[n]);
  mergesort(a, n, tmp.p_, is_less);
// Convenience function to sort a range in a vector.
template <typename T, typename IsLess>
// Doesn't match:
//static void mergesort(const typename std::vector<T>::iterator &begin,
//                      const typename std::vector<T>::iterator &end,  
//static void mergesort(const typename T::iterator &begin,
//                      const typename T::iterator &end,  
void mergesort(const T &begin, const T &end, IsLess is_less) {
  mergesort(&*begin, &*end, is_less);
// Stable, non-recursive mergesort.
// Resizes v to double size temporarily, and then changes it back. Memory may
// be wasted in b after the call because of that.
template <typename T, typename IsLess>
void mergesort_consecutive(std::vector<T> *v, IsLess is_less) {
  const size_t n = v->size();
  if (n < 2) return;
  v->resize(n << 1);  // Allocate temporary space.
  mergesort(v->data(), n, v->data() + n, is_less);

Example invocation:

inline bool int_mask_less(int a, int b) {
  return (a & 15) < (b & 15);
  std::vector<int> a;
  mergesort(a.begin(), a.end(), int_mask_less);

A slow, non-recursive stable sort implementation in C++

If you don't care about speed, an insertion sort would do: it's simple (short), stable and non-recursive. Unfortunately it's slow: O(n2) in worst case and average.

// Insertion sort: stable but slow (O(n^2)). Use mergesort instead if you need
// a stable sort.  
template <typename T, typename IsLess>  
static void insertion_sort(T *a, T *a_end, IsLess is_less) {
  const size_t n = a_end - a;
  for (size_t i = 1; i < n; ++i) {
    if (is_less(a[i], a[i - 1])) {
      T t = a[i];
      size_t j = i - 1;  // TODO: Use pointers instead of indexes.
      while (j > 0 && is_less(t, a[j - 1])) {
      for (size_t k = i; k > j; --k) {
        a[k] = a[k - 1];
      a[j] = t; 

// Convenience function to sort a range in a vector.
template <typename T, typename IsLess>
static void insertion_sort(const T &begin, const T &end, IsLess is_less) {
  insertion_sort(&*begin, &*end, is_less);

C++ stable sort speed measurements

10000 random int vectors (of random size smaller than 16384) were sorted in place in a C++ program running on Linux 2.6.35 running on an Intel Core 2 Duo CPU T6600 @ 2.20GHz. The total user time was measured, including the sort, the running of std::stable_sort, and the verification of the sort output (i.e. comparing it to the output of std::stable_sort. The insmax parameter below indicates the size of the sublists with were sorted using insertion sort before running the mergesort (the implementation above uses insmax=2, which is implemented in the Unfold... loop).

The raw time measurement results:

mask= 15  insmax= 1          18.805s
mask= 15  insmax= 2          17.337s
mask= 15  insmax= 4          17.109s
mask= 15  insmax= 8          17.133s
mask= 15  insmax=16          17.153s
mask= 15  insmax=32          17.325s
mask= 15  insmax=64          18.221s
mask= 15  std::stable_sort   19.401s
mask= 15  insertion_sort    500.83s

mask=255  insmax= 1          19.349s
mask=255  insmax= 2          18.357s
mask=255  insmax= 4          18.493s
mask=255  insmax= 8          18.585s
mask=255  insmax=16          18.741s
mask=255  insmax=32          19.081s
mask=255  insmax=64          20.061s
mask=255  std::stable_sort   21.965s
mask=255  insertion_sort    530.27s

It looks like that the mergesort algorithm proposed in this blog post (insmax=2 in the measurements above) is between 10% and 16% faster than std::stable_sort. It looks like that increasing insmax from 2 to 8 could yield an additional 1.18% speed increase.

Stable sort in other programming languages

  • Java java.util.Collections.sort and java.util.Arrays.sort are stable. In OpenJDK 1.6, for primitive types, it uses an optimized (tuned) quicksort, and for objects (with a comparator) it uses a recursive, stable mergesort. Even though quicksort is not stable, this doesn't matter for primitive types, because there is no comparator (so it's always in ascending order), and it's impossible to distinguish different instances of the same primitive type value.
  • C qsort function is not stable, because it uses quicksort. FreeBSD, Mac OS X and other BSD systems have the mergesort function in their standard library (in libc, stdlib/merge.c), Linux glibc doesn't have it. As an alternative, it's straightforward to convert the mergesort implementation in this blog post from C++ to C.
  • CPython's sorted and list.sort functions are implemented with a stable, recursive mergesort (and binary insertion sort for short inputs) with a sophisticated (complicated, long) code containing lots of optimizations: see the files Objects/listsort.txt and Objects/listobject.c in the Python source code for documentation and implementation. It looks like it's not recursive.
  • Perl sort uses mergesort by default, so it's stable, but the default can change in future versions. Add use sort 'stable'; to the beginning of your code to get a guaranteed stable sort. See the sort pragma page for details.
  • Ruby 2.0 Array#sort doesn't say if it's stable or not, but a quick search for the string qsort in array.c reveals that it uses quicksort, so it's not stable. (a bit slow) stable sorting can be added easily:
    class Array
      def stable_sort
        n = 0
        sort_by {|x| n+= 1; [x, n]}
    ... as explained here. The fundamental idea in this trick is to compare the indexes if the values are equal. This fundamental idea can be implemented in any programming language, although it may have considerable memory and/or time overhead.
  • JavaScript's Array.prototype.sort tends to be stable in modern browsers (see the details), but unstable on old browsers.
  • C# Array.Sort is not stable (because it uses quicksort), but Enumerable.OrderBy and Enumberable.ThenBy are stable (because they use mergesort. See more info here.

Other stable sorting algorithms

Heapsort and quicksort are not stable. Some variations of mergesort (such as Algorithm 5.2.4N and Algorithm 5.2.4S in The Art of Computer programming, Volume 3 by Knuth) are not stable.

This page lists stable sorting algorithms known to mankind. Be careful with the implementations though: even if the algorithm itself is stable, some optimized, otherwise modified or buggy implementations might not be. The algorithms are:

  • O(n*log(n)), stable, comparison-based: mergesort, cascade merge sort, oscillating merge sort.
  • O(n2), stable, comparison-based: bubble sort, cocktail sort, insertion sort, binary insertion sort (where the insertion index is found using binary search), gnome sort, library sort, odd-even sort.
  • O(n2), stable, not comparison-based: bucket sort, proxmap sort.
  • faster than O(n2), stable, not comparison-based: counting sort, pigeonhole sort, radix sort.

So it looks like that we don't know of an alternative of mergesort for comparison-based sorting in O(n*log(n)) time, the others above are slower. Maybe Edsger Dijkstra[citation needed] can come up with an alternative.

No comments: