2013-12-01

Binary search in line-sorted text files

This blog post announces pts-line-bisect, a C program and an equivalent Python script and library for doing binary seach in line-sorted text files, and it also explains some of the design details of the Python implementation.

Let's suppose you have a sorted text file, in which each line is lexicographically larger than the previous one, and you want to find a specific range of lines, or all lines with a specific prefix.

I've written the program pts-line-bisect for that recently. See the beginning of the README on many examples how to use the C program. The Python program has fewer features, here is how to use it:

  • Closed ended interval search: print all lines at least bar and at most foo:
    $ python pts_line_bisect.py -t file.sorted bar foo
  • Open ended interval search: print all lines at least bar and smaller than foo:
    $ python pts_line_bisect.py -e file.sorted bar foo
  • Prepend position: print the smallest offset where foo can be inserted (please note that this does not report whether foo is present, always exit(0)).
    $ python pts_line_bisect.py -oe file.sorted foo
  • Append position: print the largest offset where foo can be inserted (please note that this does not report whether foo is present, always exit(0)).
    $ python pts_line_bisect.py -oae file.sorted foo
  • Exact range: print the range (start byte offset and after-the-last-byte end offset) of lines containing only foo:
    $ python pts_line_bisect.py -ot file.sorted foo

Please note that these tools support duplicate lines correctly (i.e. if the same line appears multiple times, in a block).

Please note that these tools and libraries assume that the key is at the beginning of the line. If you have a sorted file whose lines contain the sort key somewhere in the middle, you need to make changes to the code. It's possible to do so in a way that the code will still support duplicate keys (i.e. records with the same key).

Analysis

I've written the article titled Evolution of a binary search implementation for line-sorted text files about the topic, containing the problem statement, the possible pitfalls, an analysis of some incorrect solutions available on the web as code example, a detailed specification and explanation of my solution (including a proof), disk seek and speed analysis, a set of speedup ideas and their implementation, and further notes about the speedups in the C implementation.

As a teaser, here is an incorrect solution in Python:

def bisect_left_incorrect(f, x):
  """... Warning: Incorrect implementation with corner case bugs!"""
  x = x.rstrip('\n')
  f.seek(0, 2)  # Seek to EOF.
  lo, hi = 0, f.tell()
  while lo < hi:
    mid = (lo + hi) >> 1
    f.seek(mid)
    f.readline()  # Ignore previous line, find our line.
    if x <= f.readline().rstrip('\n'):
      hi = mid
    else:
      lo = mid + 1
  return lo

Can you spot the all the 3 bugs?

Read the article for all the details and the solution.

About sorting: For the above implementation to work, files must be sorted lexicographically, using the unsigned value (0..255) for each byte in the line. If you don't sort the file, or you sort it using some language collation or some locale, then the linked implementation won't find the results you are looking for. On Linux, use LC_ALL=C sort <in.txt >out.txt to sort lexicographically.

As a reference, here is the correct implementation of the same algorithm for finding the start of the interval in a sorted list or other sequences (based on the bisect module):

def bisect_left(a, x, lo=0, hi=None):
  """Return the index where to insert item x in list a, assuming a is sorted.

  The return value i is such that all e in a[:i] have e < x, and all e in
  a[i:] have e >= x.  So if x already appears in the list, a.insert(x) will
  insert just before the leftmost x already there.

  Optional args lo (default 0) and hi (default len(a)) bound the
  slice of a to be searched.
  """
  if lo < 0:
    raise ValueError('lo must be non-negative')
  if hi is None:
    hi = len(a)
  while lo < hi:
    mid = (lo + hi) >> 1
    if x <= a[mid]:  # Change `<=' to `<', and you get bisect_right.
      hi = mid
    else:
      lo = mid + 1
  return lo

A typical real-word use case for such a binary search tool is retrieving lines corresponding to a particular time range in log files (or time based measurement records). These files text files with variable-length lines, with the log timestamp in the beginning of the line, and they are generated in increasing timestamp order. Unfortunately the lines are not lexicographically sorted, so the timestamp has to be decoded first for the comparison. The bsearch tool does that, it also supports parsing arbitrary, user-specifiable datetime formats, and it can binary search in gzip(1)ped files as well (by building an index). It's also of high performance and low overhead, partially because it is written in C++. So bsearch is practical tool with lots of useful features. If you need anything more complicated than a lexicographic binary search, use it instead.

Before getting too excited about binary search, please note that there are much faster alternatives for data on disk. In an external (file-based) binary search the slowest operation is the disk seek needed for each bisection. Most of the software and hardware components would be waiting for the hard disk to move the reading head. (Except, of course, that seek times are negligible when non-spinning storage hardware such as SSD or memory card is used.)

An out-of-the box solution would be adding the data to more disk-efficient key-value store. There are several programs providing such stores. Most of them are based on a B-tree, B*-tree or B+-tree data structure if sorted iteration and range searches have to be supported, or disk-based hashtables otherwise. Some of the high-performance single-machine key-value stores: cdb (read-only), Tokyo Cabinet, Kyoto Cabinet, LevelDB; see more in the NoSQL software list.

The fundamental speed difference between a B-tree search and a binary search in a sorted list stems from the fact that B-trees have a branching factor larger than 2 (possibly 100s or 1000s), thus each seeking step in a B-tree search reduces possible input size by a factor larger than 2, while in a binary search each step reduces the the input size by a factor 2 only (i.e. we keep either the bottom half or the top half). So both kinds of searches are logarithmic, but the base of the logarithm is different, and this causes a constant factor difference in the disk seek count. By careful tunig of the constant in B-trees it's usual to have only 2 or 3 disk seeks for each search even for 100GB of data, while a binary search in a such a large file with 50 bytes per record would need 31 disk seeks. By taking 10ms as the seek time (see more info about typical hard disk seek times), a typical B-tree search takes 0.03 second, and a typical binary search takes 0.31 second.

Have fun binary searching, but don't forget to sort your files first!

5 comments:

zsbana said...

I have implemented binary search in a sorted text file once, a long time ago. See http://article.gmane.org/gmane.comp.lang.perl.qotw.discuss/2118 , and have fun undoing the random substitutions gmane does on the text.

tester said...

I've found out that I had to sort my file using C locale first. It wasn't really obvious since I thought I only need to use locale to run pts_lbsearch.

I have another question - is there any way I could limit a number of resulting lines?

pts said...

@tester: Thank you for mentioning the C locale explicitly, I've extended the post accordingly.

To limit the number of resulting lines, you either have to extend the source code of pts_lbsearch (this will give you the fastest solution) or filter its output (e.g. ... | head -10).

Ed said...

I was curious that you did not mention duplicate keys. I adapted your python code by also passing a get_key functor since my lines have the key as an embedded part of each line.

But I was wondering if you had a versio that takes into account duplicate key runs, such that it guarantees to return the initial occurence for start, and the last occurence for end, for any range of keys requested.

pts said...

@Ed: I've added some clarification about the support for duplicate lines and keys. I don't have any other code to share, so if you want to have your key in the middle of the line, you need to make code changes.