Parsing strings at compile-time — Part I

In this post we will examine how constant expressions can be used to implement compile-time string parsing. I assume you are already familiar with my previous post on compile-time computations.

So, what is such compile-time string parsing good for? To substitute for missing user defined literals. Even though C++11 comes with user-defined literals, we are still missing functionality that would allow us to write something like the following:

 
Regex R1 = /^(.*A)$/;
Regex R2 = /^(.*A$/; // compile-time error: invalid regex

Date today = 21-MAR-2011;

Allowing arbitrary syntax for literals is obviously impossible, and would make compiling C++ a hell; therefore we are usually left with having to use strings:

 
Regex R1( "^(.*A)$" ) ;
Regex R2( "^(.*A$" ); // will throw: invalid regex

Date today( "21-JUM-2011" ); // will throw

Inside strings we can use whatever syntax we want. However they are just strings without any special meaning to the compiler, and therefore the compiler cannot tell us if we have made a mistake in such a ‘literal.’ We are left with having to signal errors by throwing exceptions, but this has drawbacks. First, if we throw while initializing a global object, we automatically terminate the program, because there is no way to put a try-catch block. Second, even if we can catch an exception, say, inside functions, there is no good way of handling such exception. It is a logic error.

In this post we will explore how strings can be parsed by the compiler. Let’s start with a very simple example. We will write a function that will return the n-th character in the string. Naturally, it will need two arguments: a string and an index. We will not use std::string to represent the string argument, because std::string requires memory allocation, which will not work at compile-time. Instead we will use string literals. A string literal is any text in double quotes, like "home". What is the type of "home"? It is const char[5] (four characters for the letters in the string and one for terminating zero). Unfortunately, we will not be able to pass the array by value, because it doesn’t work in C++: arrays are decayed to pointers when passed to functions, and we loose the information about array size, and we are definitely interested in the size of our string literal. Therefore we will need to pas the array by reference. So the first argument in our function will be const char (&arr)[N]. But what should N be if we do not know the size of the string we will be parsing? We will make it a template parameter and let the compiler deduce it on the fly. Long story short, the following is our first function:

 
template< unsigned N > constexpr
char nth_char( const char (&arr) [N], unsigned i )
{
    return arr[i];
}

Keyword constexpr indicates that we may want the results of the function to be obtained at compile-time. N is the length of the array that is going to be deduced when function is called. The size of the string we will parse is N - 1. arr is a reference to array. References are allowed for constexpr functions, and such functions can still be evaluated at compile-time, provided that the reference is bound to a literal. We can use our function in the following way:

 
static_assert( nth_char("home", 2) == 'm', "should be m" );

static_assert (also a new feature in C++11) requires its argument to be a compile-time constant. If this code compiles, it means our function has been evaluated at compile-time. Now, we will add some range checking to our function:

 
constexpr unsigned requires_inRange( unsigned i, unsigned len )
{
    return i >= len ? throw OutOfRange(i, len) : i;
}

template< unsigned N > constexpr
char nth_char( const char (&arr) [N], unsigned i )
{
    return requires_inRange(i, N - 1),
           arr[i];
}

Function requires_inRange checks for precondition. If i is in range it has no effect; otherwise it throws an exception at run-time and reports a compilation error at compile-time, because you cannot throw exceptions during the compilation. For compile-time computations it doesn’t matter what you throw, you only need to call a run-time-only operation. Second argument to requires_inRange is N - 1 rather than N, because we do not want to include the final zero in zero-terminated string in the range. Now, using the following case inside a switch statement:

 
case  nth_char("home", 4):

will result in compilation error (function requires_inRange is not a constant expression for i is 4 (and N is 5)), even though it would compile fine for smaller indexes.

So, by now we know how to compute at least one thing about the string at compile-time. Now we can proceed to more advanced computations. First, lets introduce a class that will wrap our literals:

 
class StrWrap
{
    char * const begin_;
    unsigned size_;

public:
    template< unsigned N >
    constexpr StrWrap( const char(&arr)[N] ) : begin_(arr), size_(N - 1) {
        static_assert( N >= 1, "not a string literal");
    }

    constexpr char operator[]( unsigned i ) { 
        return requires_inRange(i, size_), begin_[i]; 
    }

    constexpr operator const char *() { 
        return begin_; 
    }

    constexpr unsigned size() { 
        return size_; 
    }
};

This is useful for a couple of reasons. Once we have created an object of this type we store the size of the string in it: we no longer need to use templates. While templates are useful for many compile-time applications, using them incurs cost of instantiating more and more of them. Second, We can pass our objects by value. While passing literals by a reference to array worked in basic example, it will not work if we use it in conditional operator, because using conditional operator with a throw in it, requires an array-to-pointer decay. Now, our new type requires a bit of description.

The class has two constructors: a copy constructor (generated implicitly) and a constructor template with different sizes of the array. This is the only template that we need. In the constructor we change template parameter for class member, and henceforth we can use a normal, non-template class. We initialize the pointer with a reference to array. Pointers to objects are valid for constexpr functions and constructors, provided they point to constexpr objects, and this will be the case in our examples. The constructor is not explicit because we do not want to force the users to type it everywhere. They should not even be aware of our type’s existence. Size of string literal is always at least 1 due to the terminating zero; hence the assertion, and N - 1 in the initializer.

Index operator makes our type behave as it was an array, except that we check the index for being in range. Note that we could use static_assert in the constructor because we were testing the template parameter; we cannot use it in our indexing operator because here we are testing a data member, and although it will be a compile-time constant in some of the usages, the compiler cannot assume that, because we may also use our type in run-time contexts.

The implicit cast to zero-terminated C-string is necessary to use our type in contexts where we would normally use a C-string, like streaming the string out to std::cout. This is not typical of a class to provide an implicit conversions from and to a similar type, and it might be insecure for some types; however our usage seams to require this practice. Note that we did not declare our member functions as const although they do not mutate the state of the object. This is because const specifier is automatically added to constexpr function’s signature.

One other concern that needs to be addressed, if we have pointers to segments of memory, who will take care of allocating and deallocating the memory? The answer here is that we will only be using string literals. String literals are always located in static memory, allocated before the program starts and disposed of after the program terminates — all this done automatically.

Now, lets try to count the occurrences of a given character in a string. For that we will need to inspect every single character in the string. Normally (that is, at run-time) we would use a loop construct for that, however at compile-time, which only requires functional programming patterns we cannot use iteration, because iteration always requires altering some iteration variable; at compile-time we cannot alter anything. Therefore we will need to use recursion (very popular in FP). Here it is:

 
constexpr unsigned count( StrWrap str, char c, unsigned i = 0, unsigned ans = 0 )
{
    return i == str.size() ? ans :
               str[i] == c ? count(str, c, i + 1, ans + 1) :
                             count(str, c, i + 1, ans);
}

A word of comment. First condition is our terminating condition: if we reach the end of string we return the answer we have collected so far. ans is our ‘current’ answer. Second condition is checking if the character at given position is the one we are looking for. i indicates our current position. Based on the result we continue with either incremented or the same current result, and change the current position to the next one. Now we can check if the function works correctly:

 
static_assert( count("dude", 'd') == 2, "d != 2" ); 
static_assert( count("dude", 'u') == 1, "u != 1" );
static_assert( count("dude", 'g') == 0, "g != 0" );

For a slightly more useful example, let’s parse a string containing parentheses and check if they are balanced, i. e., if the number of left parentheses is the same as the number of right parentheses. A function, although somewhat longer, is similar:

 
constexpr int balance( StrWrap str, unsigned i = 0, int ans = 0 )
{
    return i == str.size() ? ans :
             str[i] == '(' ? balance(str, i + 1, ans + 1) :
             str[i] == ')' ? balance(str, i + 1, ans - 1) :
                             balance(str, i + 1, ans);
}

Now again, we can test our function at compile-time:

 
static_assert( balance("") == 0, "should be balanced" ); 
static_assert( balance("dude") == 0, "should be balanced" );
static_assert( balance("((x + y)*(x - y)) + (z)") == 0, "..." );
static_assert( balance("((x + y)") == 1, "should be 1" );
static_assert( balance(")(") == 1, "should be balanced" );

Note the last assertion. Should string ")(" be really called balanced? I guess not. It looks that the number of closing brackets must at no point (of recursion) exceed the number of opening brackets. We need to fix our function:

 
constexpr int balance( StrWrap str, unsigned i = 0, int ans = 0 )
{
    return requires_nonnegative(ans),
           i == str.size() ? ans : 
             str[i] == '(' ? balance(str, i + 1, ans + 1):
             str[i] == ')' ? balance(str, i + 1, ans - 1):
                             balance(str, i + 1, ans);
}

The additional call to function requires_nonnegative (described in the previous post) says that the value of balance must never go below zero.

While this starts to appear more useful, in typical applications we will be interested in creating balanced strings rather than measuring the balance of the string. The following technique will enable us to use only balanced string.

 
constexpr StrWrap Balanced( StrWrap str  )
{
	return balance(str) != 0 ? throw ImbalncedExpr() : str;
}

int main()
{
    constexpr auto myExpression = Balanced("(x + y) * (x - y)");
    std::cout << myExpression << std::endl;
}

We used auto in order not to specify the type manually. This type inference functionality is also part of C++11. Alternatively, you can provide one of the types StrWrap or const char * and the example will also work. Note that we cannot make the example a one-liner because we need a place where we can put constexpr initialization. That is, the compile-time check in the following example would not work, because we did not force constexpr initialization.

 
int main()
{
    std::cout << Balanced("(x + y * (x - y") << std::endl;
    // compiles fine, throw at runtime
}

Since we mentioned algebraic expressions, the next question that comes to mind is if it is possible to write a function that takes a string representing an algebraic expression, like "(2 + 3) * 5 * (90 - 22)" and returns the value of such expression as type int. But we will answer this question in the next post.

7 Responses to Parsing strings at compile-time — Part I

Faisal Vali says:

July 22, 2011 at 10:19 am

Hi enjoyed your post! – along similar lines, I wrote a rudimentary library that allows one to arbitrarily manipulate and process strings at compile time (with some implementation limits of course).
Check out:
1) http://tinyurl.com/45xkdlq
(http://groups.google.com/group/comp.lang.c++.moderated/browse_thread/thread/d9bddd4105f1441e?hl=en#)
Faisal Vali

sehe (@sehetw) says:

June 20, 2014 at 10:35 am

Are you aware of the splendid metaparse library by Abel Sinkovicz (mpllibs on github)? It’s a parser combinator for compile time string parsing AFAICT

Pingback: Get the name of a time type in chrono | DL-UAT
Pingback: The road for reflection: Tagging types - IT大道
Pingback: The Road For Reflection: Tagging Types | 神刀安全网

Oodini says:

September 1, 2017 at 10:57 pm

Hello.

I don’t understand the count function :

If I use it with any non empty string, str.size() will return a positive value, and soans will be assigned to i. Or I didn’t get an obvious stuff…

Andrzej Krzemieński says:

September 4, 2017 at 8:10 am

You can try it with a non-constexpr simpler function with your favorite compiler/debugger:

#include <string>
#include <cassert>

unsigned count( std::string str, char c, unsigned i = 0, unsigned ans = 0 )
{
    return i == str.size() ? ans :
               str[i] == c ? count(str, c, i + 1, ans + 1) :
                             count(str, c, i + 1, ans);
}

int main()
{
    assert (count("ABACAD", 'A') == 3);
    assert (count("ABACAD", 'B') == 1);
    assert (count("ABACAD", 'Z') == 0);
}

Note that the condition in conditional statement is not bool(str.size()) but i == str.size(), and i is greater by one in each recurrent step.