Parsing strings at compile-time — Part II

Welcome back to compile-time string computation study. In the previous post, I promised to see if we can compute the result of an algebraic expression at compile-time. I will disappoint you. We are not going to see the entire solution, because the codes would be too long, and the post too boring. Instead, we will pick a bit simpler task. We will parse algebraic expressions in Reverse Polish notation because they are easier to parse, and require no parentheses. To further simplify the parsing, we will only allow one-digit numbers in expressions, in order to skip the logic for collecting digits into numbers. The goal is to prove that evaluating expressions at compile-time is possible rather than building a fully-working library.

Evaluating expressions requires a stack, so the first problem to solve is to implement a ‘constexpr stack’. This is different than containers offered by Boost.MPL which work exclusively at compile-time. For our task we will need a stack that works both at compile-time and run-time, similar to constexpr functions. An interesting question to answer at this point would be how big do you want the compile-time stack to be. And, on a similar token, how much memory should the compiler offer for programmer’s compile-time computations. I haven’t found a scalable implementation of a stack. Although there doesn’t appear to be any obvious reason for this being impossible, I was not able to find a solution. Perhaps you will be more lucky. As a consequence, the solution I propose here has an inelegant feature of having to represent each stack element as a separate class member. The below code implements a stack of maximum five elements.

typedef int T; // in case you want to make it a template

class Stack
{
    static constexpr unsigned requires_LessThan5( unsigned i ) {
        return i < 5 ? i : throw BadInput(i);
    }
	static constexpr unsigned requires_GreaterEqual1( unsigned i ) {
		return i >= 1 ? i : throw BadInput(i);
	}
	static constexpr unsigned requires_GreaterEqual2( unsigned i ) {
		return i >= 2 ? i : throw BadInput(i);
	}

public:

	struct Pop{};  // tags for "naming" the constructors
	struct Pop2{}; //

	unsigned size_;
	T el0, el1, el2, el3, el4; // stack elements
	
	constexpr Stack() 
	: size_(0)
	, el0()
 	, el1()
	, el2()
	, el3()
	, el4()
	{}

	constexpr Stack( Stack const& st, T elem ) // push constructor
	: size_(( requires_LessThan5(st.size_), st.size_ + 1 ))
	, el0( st.size_ == 0 ? elem : st.el0 )
 	, el1( st.size_ == 1 ? elem : st.el1 )
	, el2( st.size_ == 2 ? elem : st.el2 )
	, el3( st.size_ == 3 ? elem : st.el3 )
	, el4( st.size_ == 4 ? elem : st.el4 )
	{}

	constexpr Stack( Stack const& st, Pop ) // pop constructor
	: size_(( requires_GreaterEqual1(st.size_), st.size_ - 1 ))
	, el0( st.el0 )
 	, el1( st.el1 )
	, el2( st.el2 )
	, el3( st.el3 )
	, el4( st.el4 )
	{}

	constexpr Stack( Stack const& st, Pop2 ) // 2-elem pop
	: size_(( requires_GreaterEqual2(st.size_), st.size_ - 2 ))
	, el0( st.el0 )
 	, el1( st.el1 )
	, el2( st.el2 )
	, el3( st.el3 )
	, el4( st.el4 )
	{}

    constexpr T top() { 
        return requires_GreaterEqual1(size_),
            size_ == 1 ? el0 :
            size_ == 2 ? el1 :
            size_ == 3 ? el2 :
            size_ == 4 ? el3 :
                         el4 ;
	}

	constexpr T top2() { 
		return requires_GreaterEqual2(size_),
			size_ == 2 ? el0 :
 			size_ == 3 ? el1 :
			size_ == 4 ? el2 :
			             el3 ;
	}

	constexpr unsigned size() { return size_; }
	constexpr bool empty() { return size_ == 0; }
};

constexpr Stack pop( Stack const& s ) {
	return Stack( s, Stack::Pop() ); 
}

constexpr Stack pop2( Stack const& s ) {
	return Stack( s, Stack::Pop2() ) ; 
}

constexpr Stack push( Stack const& s, T elem ) {
	return Stack( s, elem ); 
}

Double parentheses have been used when checking the precondition in initializer-list. This is to make the compiler interpret a comma as a sequence operator rather than a two-argument initialization. We have two functions,requires_GreaterEqual1 and requires_GreaterEqual2, rather than one function with additional parameter; this is because we want to make sure that value 1 and 2 are included in the error message that will be produced if the preconditions are violated. Function pop, as it happens in functional programming, rather than really altering the stack, simply returns another stack that looks as if we removed the element from the top.Two functions are not typical of a stack: top2 and pop2. The former gives us the second element from the top; the latter removes two elements off the top. We added them to the stack interface, because the two operations are so common while parsing expressions. They are only added for brevity because the same effect can be achieved by the combination of functions top and pop.

Thus equipped with a stack we can proceed to our first interesting task: evaluating an expression in Reverse Polish notation. The algorithm can be described as follows:

For each parsed element,

  • if it is a number, push it on the stack,
  • if it is an operator pop two elements off the stack, apply the operation and push the result back onto the stack.

Once this is done the stack is left with one element which is our answer.

And this is the implementation of the algorithm; first, without error checking.

constexpr bool isnum( char c ) { return c >= '0' && c <= '9'; }
constexpr int tonum( char c ) { return c - '0'; }

constexpr int eval_rpn( StrWrap str, unsigned i = 0, Stack s = Stack() )
{
	return 
	i == str.size() ? 
		s.top()
	: isnum(str[i]) ?
		eval_rpn( str, i + 1, push( s, tonum(str[i]) ) )
	: str[i] == '+' ? 
		eval_rpn( str, i + 1, push(pop2(s), s.top2() + s.top()) ) 
	: str[i] == '-' ?
		eval_rpn( str, i + 1, push(pop2(s), s.top2() - s.top()) )
	: str[i] == '*' ?
		eval_rpn( str, i + 1, push(pop2(s), s.top2() * s.top()) )
	: str[i] == '/' ?
		eval_rpn( str, i + 1, push(pop2(s), s.top2() / s.top()) )
	: 
		throw InvalidChar( str[i] );	
}

Well, almost without error checking. Note that in order to detect if a given character represents a digit, we used our own function isnum rather than using standard isdigit, because the latter is not guaranteed to be a constexpr function. We are using a perhaps too optimistic assumption that the characters representing digits are encoded in ASCII-like way: contiguously from 0 to 9. Instruction “take two elements off the stack, add them and push the result back” is expressed with the following sub-expression:

push( pop2(s), s.top2() + s.top() )

Now, to make the code a bit more realistic we add error checking and white space skipping:

constexpr Stack requires_enoughNumbers( Stack const& s ) { 
    return s.size() < 2 ? throw Exc() : s;
}
constexpr Stack requires_enoughOperators( Stack const& s ) {
    return s.size() != 1 ? throw Exc() : s;
}
int requires_onlyValidCharacters() { // not constexpr
    throw Exc();
}

constexpr int eval_rpn( StrWrap str, unsigned i = 0, Stack s = Stack() )
{
	return 
	i == str.size() ? 
		( requires_enoughOperators(s), s.top() ) // stack size must be 1
	: str[i] == ' ' ?
		eval_rpn( str, i + 1, s ) // skip
	: isnum(str[i]) ?
		eval_rpn( str, i + 1, push( s, tonum(str[i]) ) )
	: (requires_enoughNumbers(s), str[i] == '+') ? 
		eval_rpn( str, i + 1, push(pop2(s), s.top2() + s.top()) ) 
	: str[i] == '-' ?
		eval_rpn( str, i + 1, push(pop2(s), s.top2() - s.top()) )
	: str[i] == '*' ?
		eval_rpn( str, i + 1, push(pop2(s), s.top2() * s.top()) )
	: str[i] == '/' ?
		eval_rpn( str, i + 1, push(pop2(s), s.top2() / s.top()) )
	: 
		requires_onlyValidCharacters();
}

First, we check if when we reach the end of input the stack size is one (function requires_enoughOperators). If the stack size is greater, it means that someone forgot to put some operators, or put too many numbers: the correct balance is: operators + 1 = numbers. Similarly, if we encounter an operator there must be at least two numbers on the stack (function requires_enoughNumbers). Also, we replaced the final throw with a non-constexpr function requires_onlyValidCharacters. I said in one of my previous posts that you can only call constexpr functions inside constexpr functions. This is not quite so, you can also use normal functions as long as they do not need to be evaluated at compile-time. We use this rule to our advantage: if we skip all the cases for valid characters, it means that we have an invalid character and we want to stop the compilation (or throw an exception at run-time). Now we can test our function:

static_assert( eval_rpn("5 1 + 3 /") == 2, "should be 2" ); // compiles
static_assert( eval_rpn("5 1 + 3 /") == 3, "should be 2" ); // error
static_assert( eval_rpn("5 1 + 3") == 2, "should be 2" ); // error

For a second exercise, we will parse expressions that apart from numbers also include variables, like "a b + a b - *". Obviously, we cannot evaluate the result until some numbers are bound to variables; so, we will evaluate in two phases: in the first we will just check for syntactic correctness, in the second, when the values for variables are provided, we will evaluate the final value. Let’s start with the first phase. As we already said, numbersoperators = 1. Numbers in this case include variables. We will call value numbersoperators a balance of expression and compute it. The second thing to track is that the first two items in expression must not be operators, and that we need at leas one element. Third, for every encountered operator, the number of elements already in the stack must be at least two.

constexpr bool isvar( char c ) { 
    return c >= 'a' && c <= 'z';
}
constexpr bool isoper( char c ) { 
    return c == '+' || c == '-' || c == '*' | |c == '/';
}

constexpr int balance_of_rpn( StrWrap str, unsigned i, int ans, unsigned firstnums )
{
	return i == str.size() ? (
		firstnums == 0 ?
			throw "not a single item" // (i)
		:
			ans
	)
	: isvar(str[i]) || isnum(str[i]) ?
		balance_of_rpn( str, i + 1, ans + 1, firstnums + 1 )
	: isoper(str[i]) ? (
		firstnums < 2 ? 
			throw "2 nums up front" // (ii)
		: ans < 2 ? 
			throw "too many operators" // (iii)
		: 
			balance_of_rpn( str, i + 1, ans - 1, firstnums )
	)
	: str[i] == ' ' ?
		balance_of_rpn( str, i + 1, ans, firstnums )
	:
		requires_onlyValidCharacters();		
}

A word of comment. Variable ans collects the balance. It is incremented when a number or variable is encountered and decremented when we find an operator. Variable firstnums takes care of the first two items being numbers/variables: it is incremented on each encountered number until first operator is encountered; afterwards its value is meaningless. Error conditions in the function (apart from encountering an invalid char) are: (i) not having found a single item, (ii) finding the first operator before two numbers/variables, (iii) ‘running’ balance dropping below 1 at any stage. Now we use the balance to define a function that checks for expression validity:

constexpr bool requires_valid_rpn( StrWrap str ) 
{
	return balance_of_rpn(str, 0, 0, 0) != 1 ? throw BadExpr() : true;
}

Note that we will not be interested in the result returned from the function, because we will be using the comma operator trick. The only reason for that is that in C++11 a constexpr function must return a value. Once we are done with validation, we can proceed to the second phase: binding values to variables and evaluating the expression. For that we will use a new class Expr that will store the validated string, and provide a member function eval which given the values of variables can evaluate the expression stored in the string. Different expressions have different number of variables; how do we write a function taking a different number of parameters at different times? For now, lets just try to write the version with two variables. Our function eval will be similar to the previously defined eval_rpn; we need to add two cases for two variables: a and b. Here is our class with the two-argument function.

class Expr
{
	StrWrap str_;

	public:
	constexpr Expr( StrWrap str ) 
	: str_(( requires_valid_rpn(str), str )) {}

	constexpr int eval( int a, int b, Stack s = Stack(), unsigned i = 0 )
	{ 
		return 
		i == str_.size() ? 
			( requires_enoughOperators(s), s.top() )
		: str_[i] == ' ' ?
			eval( a, b, s, i + 1 )
		: str_[i] == 'a' ?
			eval( a, b, push(s, a), i + 1 )
		: str_[i] == 'b' ?
			eval( a, b, push(s, b), i + 1 )
		: isnum(str_[i]) ?
			eval( a, b, push( s, tonum(str_[i]) ), i + 1 )
		: (requires_enoughNumbers(s), str_[i] == '+') ? 
			eval( a, b, push(pop2(s), s.top2() + s.top()), i + 1 ) 
		: str_[i] == '-' ?
			eval( a, b, push(pop2(s), s.top2() - s.top()), i + 1 )
		: str_[i] == '*' ?
			eval( a, b, push(pop2(s), s.top2() * s.top()), i + 1 )
		: str_[i] == '/' ?
			eval( a, b, push(pop2(s), s.top2() / s.top()), i + 1 )
		: 
			requires_onlyValidCharacters();
	} 
};

Now, we can use our class in the following way.

constexpr Expr x("ab+ ab- *");
static_assert( x.eval(3, 1) == 8, "should be 8" );
static_assert( x.eval(3, 3) == 0, "should be 0" );

Now, how can we extend our class to evaluate expressions with arbitrary number of variables. we will need function eval that accepts variable number of arguments. In C++11 passing variable number of arguments is done by initializer lists. However they will not work in constexpr context because none of std::initializer_list members is constexpr. Therefore we will use a less convenient, but working, feature for our task: variadic templates. At this point I assume you are already familiar with variadic templates. We start with defining a function that picks n-th integer from the list:

int getN( unsigned ) // not constexpr
{
	throw "too big N";
}

template< typename ... Args  >
constexpr int getN( unsigned i, int first, Args... args )
{
	return i == 0 ? first : getN( i - 1, args... );
}

The first function overload is the stop condition for our recursion. If we reach it, that is, if it ever gets called it means that the index (i) was too big. The second overload does the job. First argument is the index in the list that we want to extract. Next arguments (both first and args) is our list. We require the first one directly to be int and we require the same of the other arguments indirectly by variadic template recursion. Upon each recursion step list args becomes shorter until it is finally empty, or until we find the element we are looking for. The look-up is simple: if index is 0, it is the first element, otherwise check the same but for a decremented index and a list shorter by the first element. Once our function is working, it can be tested as follows:

static_assert( getN(0, 5, 6) == 5, "need 5" );
static_assert( getN(1, 5, 6) == 6, "need 6" );

Now, we can finally define our expression-evaluating class:

constexpr bool isvar( char c ) {
	return c >= 'a' && c <= 'z';
}

constexpr int varindex( char c ) {
	return c - 'a';
}

template< typename ... Args  >
constexpr int eval_impl( StrWrap str_, Stack s, unsigned i, Args... args )
{
	return 
	i == str_.size() ? 
		( requires_enoughOperators(s), s.top() )
	: str_[i] == ' ' ?
		eval_impl( str_, s, i + 1, args... )
	: isvar(str_[i]) ?
		eval_impl( str_, push(s, getN(varindex(str_[i]), args...)), i + 1, args... )
	: isnum(str_[i]) ?
		eval_impl( str_, push( s, tonum(str_[i]) ), i + 1 )
	: (requires_enoughNumbers(s), str_[i] == '+') ? 
		eval_impl( str_, push(pop2(s), s.top2() + s.top()), i + 1, args... ) 
	: str_[i] == '-' ?
		eval_impl( str_, push(pop2(s), s.top2() - s.top()), i + 1, args... )
	: str_[i] == '*' ?
		eval_impl( str_, push(pop2(s), s.top2() * s.top()), i + 1, args... )
	: str_[i] == '/' ?
		eval_impl( str_, push(pop2(s), s.top2() / s.top()), i + 1, args... )
	: 
		requires_onlyValidCharacters();
}

class Expr
{
	StrWrap str_;

	public:
	constexpr Expr( StrWrap str ) 
	: str_(( requires_valid_rpn(str), str )) {}

	template< typename ... Args  >
	constexpr int eval( Args... args )
	{
		return eval_impl( str_, Stack(), 0, args... );
	};
};

Now the following tests work:

static_assert( Expr("abc++").eval(1,2,3) == 6, "bad" );
static_assert( Expr("abc++").eval(3,6,9) == 18, "bad" );
static_assert( Expr("12+").eval() == 3, "bad" );

Thus, we have seen how expressions encoded in strings can be parsed and evaluated at compile-time. Although I cannot prove it, the above examples convince me that similar techniques are sufficient to verify correctness of a regular expression at compile-time. These techniques can be used to emulate literals (compile-time checked) of very complex syntax. C++ standard even calls classes that define constexpr constructors (other than copy constructor) literal types.

If you find compile-time string parsing interesting you can also try Faisal Vali’s library at SourceForge.

About these ads
This entry was posted in programming and tagged , . Bookmark the permalink.

3 Responses to Parsing strings at compile-time — Part II

  1. Thiago Adams says:

    Hi,
    Do you know any compiler that can be used to test constexpr?
    I want to parse colors at compile time and convert names to integers/enums.
    Color color = “#FF00FF”; or
    color = “rgb(255,0,0) “; or
    color = “blue”;
    textAlign = “center”; //
    Congratulations for your blog, I will add it to my favorites.

  2. Hi Thiago. GCC implemented the support for constexpr in version 4.6. This is the one I used to test my examples. Version 4.7 is even nicer because it also implements user defined literals.

  3. Thiago says:

    Hi, I started some tests using GCC 4.7.
    To parse #FF00FF was very easy.After that I tried to parse “rgb( 255, 0 , 0 )”;

    I have a tokenizer generator online here
    http://www.thradams.com/webtkgen.aspx
    I put on it the regex expression :
    NUMBER
    rgb\([ ]*[0-9][0-9]?[0-9]?

    This will generate a ” get next state” that uses a switch
    like this:

    static int GetNext(int state, wchar_t ch) {
    switch (state)
    {
    case 0:
    if (ch == L’r’)
    return 1;
    break; //
    ….
    Then I changed each state by a function.

    constexpr int State6(const char*p, int val) { return (*p >= L’0′ && *p = L’0′ && *p = L’0′ && *p <= L'9') ? State5(p+1, *p – '0') : throw "error"; }
    constexpr int State3(const char*p) { return (*p == L'(') ? State4(p + 1) : throw "error"; }
    constexpr int State2(const char*p) { return (*p == L'b') ? State3(p + 1) : throw "error";}
    constexpr int State1(const char*p) { return (*p == L'g') ? State2(p + 1) : throw "error"; }
    constexpr int State0(const char*p) { return (*p == L'r') ? State1(p + 1) : throw "error";}

    static_assert( State0("rgb(123,0,0)" == 123, "");

    I think it is possible to modifify the gerator to create a regex machine that uses only recursive function calls. Then it can be used to create compile time parsers. Or, like now just changing the switch states by functions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s