pts.blog: 2014-04

This blog post explains how to write C++ code which parses integers, in a type-safe way for any integral type, paying attention to overflows and underflow. The code with small modifications can be used in C as well, C++ language features are only used to allow arbitrary integral types.

It sounds like we could use a standard library function. But apparently there isn't any that would work:

Please note that sscanf(3) etc. don't do overflow detection, so we can't use that.
Please note that atoi(3) etc. can't indicate a parse error, so we can't use that.
Please note that strtol(3) etc. doesn't exist so it can't do overflow detection for anything smaller than a long, so we can't use that.
Please note that sscanf, atoi and strtol support NUL-terminated strings only, they can't parse a string (and report the expected parse error) which contains an '\0'.

So let's implement our own parser. Someone might expect that this is a trivial and boring task, but it will turn out that there are many pitfalls, and many tricky ideas to avoid all of them. Let's see.

The first attempt:

template<class T>bool parse_dec(char const *p, T *out) {
  T n = 0;
  while (*p + 0U - '0' <= 9U) {  // A digit.
    n = 10 * n + *p++ - '0';
  }
  *out = n;
  return true;  // Success.
}

There is only 1 trick in the implementation: how to detect whether *p is a digit (i.e. between '0' and '9'. For characters with too large ASCII code, the result of *p + 0U - '0' would be larger than 9, so the condition is false. For characters with too small ASCII code, the result is negative, but because of 0U it would be interpreted as unsigned, so it would be larger than 9U, so the condition is false. We don't use the standard library function isdigit(3), because on some systems it returns true for bytes other than '0' ... '9' (see the link above).

Problems:

It doesn't check that the end of the string is reached at the end of the number.
It parses the empty string as 0.
It doesn't support negative numbers.
It doesn't check for overflow or underflow.

It's easy to fix problems 1 and 2 by restructuring the loop and checking for '\0':

template<class T>bool parse_dec(char const *p, T *out) {
  T n = 0;
  do {
    if (*p + 0U - '0' > 9U) return false;  // Not a digit.
    n = 10 * n + *p++ - '0';
  } while (*p != '\0');
  *out = n;
  return true;  // Success.
}

To fix problem 3 (negative numbers), we need a way to detect whether the type T is signed, and then parse a leading -:

template<class T>bool parse_dec(char const *p, T *out) {
  const bool is_signed = (T)~(T)0 < (T)0;
  bool is_negative = false;
  if (is_signed && *p == '-') {
    is_negative = true;
    ++p;
  }
  T n = 0;
  do {
    if (*p + 0U - '0' > 9U) return false;  // Not a digit.
    n = 10 * n + *p++ - '0';
  } while (*p != '\0');
  *out = is_negative ? -n : n;
  return true;  // Success.
}

To understand the initialization of is_signed above, consider that in C and C++, the expression ~0 generates an int with all bits 1, thus it's -1, thus it's negative. The expression (unsigned)~0 is positive, it's the largest value for unsigned int. So by checking the sign of (T)~(T)0 we can determine whether T is a signed or unsigned integral type. The double cast is necessary to make it work even if sizeof(T) is smaller or larger than sizeof(int).

Please note that the ~ operator works for integral types, but it doesn't compile for pointer or floating point types etc. So if parse_dec is used with a wrong T, then a compile error would be reported at template instantiation time. To convert it to a template substitution failure, type_traits and enable_if can be used.

To fix problem 4 (overflow detection), we can add an if just before we assign the larger value to n. But how can we check for an overflow without triggering the overflow (and already losing some high bits from the result)? We can work like this: if we are parsing a positive, 8-bit signed int, then the maximum value is 127, so there is no overflow if the previous n is less than 12, or the previous n is 12, and the new digit is at most 7. That's exactly how the code below checks for overflow. It also calculates the upper limit of n (e.g. 12) and the new digit (e.g. 7) based on properties of T: whether it is signed, its size, and whether there was a - in the input.

template<class T>bool parse_dec(char const *p, T *out) {
  // ~(T)~(T)0 compiles for integral types, but it doesn't compile for pointer
  // or floating point types etc.
  const bool is_signed = (T)~(T)0 < (T)1;
  const T nmax = (is_signed ? (T)~((T)1 << (sizeof(T)*8-1)) : (T)~(T)0) / 10;
  bool is_negative = false;
  if (is_signed && *p == '-') {
    is_negative = true;
    ++p;
  }
  const char cmax = is_signed ? (is_negative ? 8 : 7) : 5;
  T n = 0;
  do {
    if (*p + 0U - '0' > 9U) return false;  // Not a digit.
    const char c = *p++ - '0';
    const T nneg = -n;
    if (nneg > nmax || (nneg == nmax && c > cmax)) return false;  // Overflow.
    n = 10 * n - c;
  } while (*p != '\0');
  *out = is_negative ? n : -n;
  return true;  // Success.
}

It's interesting to note that the last allowed digit (cmax) doesn't depend on sizeof(T), because the sequence containing of the last digit of powers of 2 is periodic with a period divisible by 8.

Please note that the upper limit signed integers is not symmetric to 0. For example, for uint8_t, the smallest allowed value is -128, and the largest one is 127. This difference is reflected in the initialization of cmax: 7 or 8.

Please note that we're updating n with the opposite sign, and we flip the sign only at the end. This is necessary to avoid signed integer overflow when parsing the smallest possible integer (e.g. -128 for 8-bit integers). Without the sign flip we'd temporarily have 128 in n, which is an overflow.

Please note that in each comparison in the code above we cast both sides explicitly to the same type. (In the 9U comparison both sides are of type unsigned int.) This is to make the code easy to understand even if the reader is not familiar with the integer automatic conversion and sign extension rules in C and C++.

Please note that in C and C++ the result is undefined if there is a signed integer overflow in the arithmetic operations. That's OK for us, because we make sure that our arithmetic operations (10 * n + c and the boring ones) happen to never overflow: either we explicitly check for overflow before, or the input contains only small numbers.

The example code above can be changed easily if the input is not a NUL-terminated string, but any array of characters with a known size, or a FILE* (using getc), or an std::istream (using get). This is left as an exercise to the reader.

Similar code can be written for parsing hex and octal, and combining them to autodetect the base based on the prefix (i.e. 0x or 0X for decimal, and 0 for octal.

For completeness, here is what kind of numbers our last parser accepts: those non-overflowing integers which fully match the regexp -[0-9]+ for signed T, and [0-9]+ for unsigned T. So leading whitespace is not allowed, trailing junk is not allowed, multiple leading 0s are allowed, a leading + is not allowed, and a leading - is allowed iff T is signed.

See the full code with some usage examples on Github.

This blog post explains how to convert the protocol from http:// to https:// in YouTube URLs in the browsing history of a Mozilla Firefox profile. YouTube has recently started redirecting logged-in users from http:// to https:// . This conversion is useful in the following situation:

The user uses his Firefox browsing history to track which YouTube videos he has already seen.
The user disables web site colors (e.g. using the PrefBar Firefox extension) so by looking at the color of a YouTube video link he can immediately see if he has visited that video page or not.
The user uses some Greasemonkey scripts to normalize YouTube video links on web pages (e.g. to remove &feature=related, so the normalized URLs will be added to the Firefox browsing history.
Bonus: The user uses the LinkVisitor Firefox extension to mass-toggle the visited status of URLs without actually visiting them in Firefox.

To do the protocol conversion from http:// to https:// , close Firefox and run this on Linux (without the leading $ sign):

$ sqlite3 ~/.mozilla/firefox/*.default/places.sqlite '
    UPDATE OR IGNORE moz_places   SET url="https" || SUBSTR(url, 5) WHERE url LIKE "http://youtube.com/%" OR url LIKE "http://www.youtube.com/%";
    UPDATE OR IGNORE moz_favicons SET url="https" || SUBSTR(url, 5) WHERE url LIKE "http://youtube.com/%" OR url LIKE "http://www.youtube.com/%";
    ANALYZE;
    VACUUM;
'

The ANALYZE and VACUUM parts are optional, they just speed up Firefox accessing these tables in the future.

The SQLite UPDATE queries above can be run on the relevant places.sqlite file on Mac OS X or Windows as well. (This blog post doesn't provide explicit instructions how to run those queries. Use Google web search or ask on Stack Overflow to figure out how.)

pts.blog

Flattr this blog

Blog Archive

Recommended

2014-04-26

How to parse an integer in C++ (and C) with overflow detection

2014-04-10

Announcing ssweb: single-shot webserver in Python

2014-04-04

How to convert all YouTube links from http:// to https:// in Firefox history