2014-04-26

How to parse an integer in C++ (and C) with overflow detection

This blog post explains how to write C++ code which parses integers, in a type-safe way for any integral type, paying attention to overflows and underflow. The code with small modifications can be used in C as well, C++ language features are only used to allow arbitrary integral types.

It sounds like we could use a standard library function. But apparently there isn't any that would work:

  • Please note that sscanf(3) etc. don't do overflow detection, so we can't use that.
  • Please note that atoi(3) etc. can't indicate a parse error, so we can't use that.
  • Please note that strtol(3) etc. doesn't exist so it can't do overflow detection for anything smaller than a long, so we can't use that.
  • Please note that sscanf, atoi and strtol support NUL-terminated strings only, they can't parse a string (and report the expected parse error) which contains an '\0'.

So let's implement our own parser. Someone might expect that this is a trivial and boring task, but it will turn out that there are many pitfalls, and many tricky ideas to avoid all of them. Let's see.

The first attempt:

template<class T>bool parse_dec(char const *p, T *out) {
  T n = 0;
  while (*p + 0U - '0' <= 9U) {  // A digit.
    n = 10 * n + *p++ - '0';
  }
  *out = n;
  return true;  // Success.
}

There is only 1 trick in the implementation: how to detect whether *p is a digit (i.e. between '0' and '9'. For characters with too large ASCII code, the result of *p + 0U - '0' would be larger than 9, so the condition is false. For characters with too small ASCII code, the result is negative, but because of 0U it would be interpreted as unsigned, so it would be larger than 9U, so the condition is false. We don't use the standard library function isdigit(3), because on some systems it returns true for bytes other than '0' ... '9' (see the link above).

Problems:

  1. It doesn't check that the end of the string is reached at the end of the number.
  2. It parses the empty string as 0.
  3. It doesn't support negative numbers.
  4. It doesn't check for overflow or underflow.
It's easy to fix problems 1 and 2 by restructuring the loop and checking for '\0':
template<class T>bool parse_dec(char const *p, T *out) {
  T n = 0;
  do {
    if (*p + 0U - '0' > 9U) return false;  // Not a digit.
    n = 10 * n + *p++ - '0';
  } while (*p != '\0');
  *out = n;
  return true;  // Success.
}
To fix problem 3 (negative numbers), we need a way to detect whether the type T is signed, and then parse a leading -:
template<class T>bool parse_dec(char const *p, T *out) {
  const bool is_signed = (T)~(T)0 < (T)0;
  bool is_negative = false;
  if (is_signed && *p == '-') {
    is_negative = true;
    ++p;
  }
  T n = 0;
  do {
    if (*p + 0U - '0' > 9U) return false;  // Not a digit.
    n = 10 * n + *p++ - '0';
  } while (*p != '\0');
  *out = is_negative ? -n : n;
  return true;  // Success.
}

To understand the initialization of is_signed above, consider that in C and C++, the expression ~0 generates an int with all bits 1, thus it's -1, thus it's negative. The expression (unsigned)~0 is positive, it's the largest value for unsigned int. So by checking the sign of (T)~(T)0 we can determine whether T is a signed or unsigned integral type. The double cast is necessary to make it work even if sizeof(T) is smaller or larger than sizeof(int).

Please note that the ~ operator works for integral types, but it doesn't compile for pointer or floating point types etc. So if parse_dec is used with a wrong T, then a compile error would be reported at template instantiation time. To convert it to a template substitution failure, type_traits and enable_if can be used.

To fix problem 4 (overflow detection), we can add an if just before we assign the larger value to n. But how can we check for an overflow without triggering the overflow (and already losing some high bits from the result)? We can work like this: if we are parsing a positive, 8-bit signed int, then the maximum value is 127, so there is no overflow if the previous n is less than 12, or the previous n is 12, and the new digit is at most 7. That's exactly how the code below checks for overflow. It also calculates the upper limit of n (e.g. 12) and the new digit (e.g. 7) based on properties of T: whether it is signed, its size, and whether there was a - in the input.

template<class T>bool parse_dec(char const *p, T *out) {
  // ~(T)~(T)0 compiles for integral types, but it doesn't compile for pointer
  // or floating point types etc.
  const bool is_signed = (T)~(T)0 < (T)1;
  const T nmax = (is_signed ? (T)~((T)1 << (sizeof(T)*8-1)) : (T)~(T)0) / 10;
  bool is_negative = false;
  if (is_signed && *p == '-') {
    is_negative = true;
    ++p;
  }
  const char cmax = is_signed ? (is_negative ? 8 : 7) : 5;
  T n = 0;
  do {
    if (*p + 0U - '0' > 9U) return false;  // Not a digit.
    const char c = *p++ - '0';
    const T nneg = -n;
    if (nneg > nmax || (nneg == nmax && c > cmax)) return false;  // Overflow.
    n = 10 * n - c;
  } while (*p != '\0');
  *out = is_negative ? n : -n;
  return true;  // Success.
}

It's interesting to note that the last allowed digit (cmax) doesn't depend on sizeof(T), because the sequence containing of the last digit of powers of 2 is periodic with a period divisible by 8.

Please note that the upper limit signed integers is not symmetric to 0. For example, for uint8_t, the smallest allowed value is -128, and the largest one is 127. This difference is reflected in the initialization of cmax: 7 or 8.

Please note that we're updating n with the opposite sign, and we flip the sign only at the end. This is necessary to avoid signed integer overflow when parsing the smallest possible integer (e.g. -128 for 8-bit integers). Without the sign flip we'd temporarily have 128 in n, which is an overflow.

Please note that in each comparison in the code above we cast both sides explicitly to the same type. (In the 9U comparison both sides are of type unsigned int.) This is to make the code easy to understand even if the reader is not familiar with the integer automatic conversion and sign extension rules in C and C++.

Please note that in C and C++ the result is undefined if there is a signed integer overflow in the arithmetic operations. That's OK for us, because we make sure that our arithmetic operations (10 * n + c and the boring ones) happen to never overflow: either we explicitly check for overflow before, or the input contains only small numbers.

The example code above can be changed easily if the input is not a NUL-terminated string, but any array of characters with a known size, or a FILE* (using getc), or an std::istream (using get). This is left as an exercise to the reader.

Similar code can be written for parsing hex and octal, and combining them to autodetect the base based on the prefix (i.e. 0x or 0X for decimal, and 0 for octal.

For completeness, here is what kind of numbers our last parser accepts: those non-overflowing integers which fully match the regexp -[0-9]+ for signed T, and [0-9]+ for unsigned T. So leading whitespace is not allowed, trailing junk is not allowed, multiple leading 0s are allowed, a leading + is not allowed, and a leading - is allowed iff T is signed.

See the full code with some usage examples on Github.

9 comments:

Jens Gustedt said...

Looks a bit like overkill to me. C and probably C++ has strtoull and friends that should detect overflows and be sufficient for everyday's use. Also there is isdigit to classify characters as decimal digit or not.

pts said...

@Jens Gustedt: Thanks for your opinion.

strtoull can't read an uint8_t or uint16_t with overflow detection. If you need such functionality, you have to write some code (in which you can use strtoull if you wish).

The strto*l family of functions relies on the programmer to match the correct function name with the desired integral type (for the overflow boundaries). Programmers make more mistakes than computers, so it's a good idea to automate this. In my opinion this goal and its implementation in the blog post is not an overkill at all.

My solution also works for __uint128_t (with g++ -m64), for which strtoull doesn't work on most systems: sizeof(__uint128_t) == 16; sizeof(unsigned long long) == 8.

I'm not using isdigit on purpose, because it depends on the active locale settings, and I don't want that in integer parsing.

Jens Gustedt said...

Using strtoull and then check against min and max for the type, still looks much simpler to me, and has the additional benefit to handle octal and hex values at the same time. And, no, other than some of the other classification function, isdigit is not locale dependent. It just tests for the decimal-digit characters as defined in the C standard.

pts said...

@Jens Gustedt: I agree that strtoull provides additional functionality (such as octal, hex, base autodetection and ignoring whitespace in the beginning). My implementation can be extended to do so, I leave it as an exercise for the reader.

As I wrote earlier, strtoull doesn't work with __uint128_t. If the goal is to write a generic, type safe integer parser, strtoull can't possibly be a solution.

Also strtoull can't be used to parse strings with '\0' inside. My implementation can be easily extended to do so, while an implementation calling strtoull can't be without copying the input (and that won't be simple).

When I'm writing code, and I need to read an integer, then I want to call a single function which does it with error detection and overflow detection based whatever type of the result I'm giving to it. If there is no such function available, I'll write one, and then call it. I think this is much better practice than using an existing function, and then applying some fixups (such as overflow detection) manually. The keyword is *manually* here. I want to focus on using the integer immediately in my code, and not the details of the fixups every time I need a parsed integer.

http://en.cppreference.com/w/cpp/string/byte/isdigit says about isdigit: Some implementations (e.g. Microsoft in 1252 codepage) may classify additional single-byte characters as digits. So I'm not using isdigit.

pts said...

@Jens Gustedt: I understand that some alternatives look simpler to you, but the alternatives you have mentioned so far look more complex from the user's point of view to me.

Jens Gustedt said...

Your mileage seems to vary, but I'd write a wrapper function as you do but instead of doing the parsing all by myself, I'd use known and proven tools inside to do the parsing.

Then, that Microsoft, again, violates the C standard for isdigit is not a big surprise. Luckily I don't have to code for such a crude environment.

zsbana said...

You claim that the arithmetic in this function cannot overflow. If you attempt to parse the least signed integer, will that not overflow the addition in 10 * n + c ?

pts said...

@Jens Gustedt: Yes, using a known and proven tool sounds safer, but unfortunately there isn't such a tool in the standard library (see my analysis above why strtoull etc. don't qualify). Now we have two options: either we depend on a non-standard library or we write our own code.

Think of this blog post as the implementation of the non-standard library function, with some educational explanation of the interesting implementation details.

pts said...

@zsbana: You're right about the signed integer overflow. I've just fixed both the blog post and the code on Github.