Binary search in line-sorted text files

This blog post announces pts-line-bisect, a C program and an equivalent Python script and library for doing binary seach in line-sorted text files, and it also explains some of the design details of the Python implementation.

Let's suppose you have a sorted text file, in which each line is lexicographically larger than the previous one, and you want to find a specific range of lines, or all lines with a specific prefix.

I've written the program pts-line-bisect for that recently. I've also written the article titled Evolution of a binary search implementation for line-sorted text files about the topic, containing the problem statement, the possible pitfalls, an analysis of some incorrect solutions available on the web as code example, a detailed specification and explanation of my solution (including a proof), disk seek and speed analysis, a set of speedup ideas and their implementation, and further notes about the speedups in the C implementation.

As a teaser, here is an incorrect solution in Python:

def bisect_left_incorrect(f, x):
  """... Warning: Incorrect implementation with corner case bugs!"""
  x = x.rstrip('\n')
  f.seek(0, 2)  # Seek to EOF.
  lo, hi = 0, f.tell()
  while lo < hi:
    mid = (lo + hi) >> 1
    f.readline()  # Ignore previous line, find our line.
    if x <= f.readline().rstrip('\n'):
      hi = mid
      lo = mid + 1
  return lo

Can you spot the all the 3 bugs?

Read the article for all the details and the solution.

About sorting: For the above implementation to work, files must be sorted lexicographically, using the unsigned value (0..255) for each byte in the line. If you don't sort the file, or you sort it using some language collation or some locale, then the linked implementation won't find the results you are looking for. On Linux, use LC_ALL=C sort <in.txt >out.txt to sort lexicographically.

As a reference, here is the correct implementation of the same algorithm for finding the start of the interval in a sorted list or other sequences (based on the bisect module):

def bisect_left(a, x, lo=0, hi=None):
  """Return the index where to insert item x in list a, assuming a is sorted.

  The return value i is such that all e in a[:i] have e < x, and all e in
  a[i:] have e >= x.  So if x already appears in the list, a.insert(x) will
  insert just before the leftmost x already there.

  Optional args lo (default 0) and hi (default len(a)) bound the
  slice of a to be searched.
  if lo < 0:
    raise ValueError('lo must be non-negative')
  if hi is None:
    hi = len(a)
  while lo < hi:
    mid = (lo + hi) >> 1
    if x <= a[mid]:  # Change `<=' to `<', and you get bisect_right.
      hi = mid
      lo = mid + 1
  return lo

A typical real-word use case for such a binary search tool is retrieving lines corresponding to a particular time range in log files (or time based measurement records). These files text files with variable-length lines, with the log timestamp in the beginning of the line, and they are generated in increasing timestamp order. Unfortunately the lines are not lexicographically sorted, so the timestamp has to be decoded first for the comparison. The bsearch tool does that, it also supports parsing arbitrary, user-specifiable datetime formats, and it can binary search in gzip(1)ped files as well (by building an index). It's also of high performance and low overhead, partially because it is written in C++. So bsearch is practical tool with lots of useful features. If you need anything more complicated than a lexicographic binary search, use it instead.

Before getting too excited about binary search, please note that there are much faster alternatives for data on disk. In an external (file-based) binary search the slowest operation is the disk seek needed for each bisection. Most of the software and hardware components would be waiting for the hard disk to move the reading head. (Except, of course, that seek times are negligible when non-spinning storage hardware such as SSD or memory card is used.)

An out-of-the box solution would be adding the data to more disk-efficient key-value store. There are several programs providing such stores. Most of them are based on a B-tree, B*-tree or B+-tree data structure if sorted iteration and range searches have to be supported, or disk-based hashtables otherwise. Some of the high-performance single-machine key-value stores: cdb (read-only), Tokyo Cabinet, Kyoto Cabinet, LevelDB; see more in the NoSQL software list.

The fundamental speed difference between a B-tree search and a binary search in a sorted list stems from the fact that B-trees have a branching factor larger than 2 (possibly 100s or 1000s), thus each seeking step in a B-tree search reduces possible input size by a factor larger than 2, while in a binary search each step reduces the the input size by a factor 2 only (i.e. we keep either the bottom half or the top half). So both kinds of searches are logarithmic, but the base of the logarithm is different, and this causes a constant factor difference in the disk seek count. By careful tunig of the constant in B-trees it's usual to have only 2 or 3 disk seeks for each search even for 100GB of data, while a binary search in a such a large file with 50 bytes per record would need 31 disk seeks. By taking 10ms as the seek time (see more info about typical hard disk seek times), a typical B-tree search takes 0.03 second, and a typical binary search takes 0.31 second.

Have fun binary searching, but don't forget to sort your files first!


zsbana said...

I have implemented binary search in a sorted text file once, a long time ago. See http://article.gmane.org/gmane.comp.lang.perl.qotw.discuss/2118 , and have fun undoing the random substitutions gmane does on the text.

tester said...

I've found out that I had to sort my file using C locale first. It wasn't really obvious since I thought I only need to use locale to run pts_lbsearch.

I have another question - is there any way I could limit a number of resulting lines?

pts said...

@tester: Thank you for mentioning the C locale explicitly, I've extended the post accordingly.

To limit the number of resulting lines, you either have to extend the source code of pts_lbsearch (this will give you the fastest solution) or filter its output (e.g. ... | head -10).