User-defined literals — Part II

In the previous post on user-defined literals, we have seen what user-defined literal are for and how you define a cooked literal operator, i.e., where compiler that sees literal 12_kg extracts value 12 of type long double and and calls your function operator"" _kg(12.L) to transform the result. In this post we will explore other aspects of user-defined literals: raw literal operators, which allow you to inspect every character in the literal.

Raw literal operators

Sometimes the cooked literal operator will not do, and we have to analyze literals character by character. We need this when the suffix changes the interpretation of digits in the literal. For instance, number 11 means something else when interpreted as base-10 number and something else when interpreted as base-2 number. If we used cooked literal operator, compiler would have interpreted it already as eleven and we would never know whether it was the result of parsing literal 11 or 0xB or 013.

Similarly, consider a user-defined type for storing decimal numbers. Your type is capable of storing number 10.2 without any loss of precision. However, if you use cooked literal operator, compiler has to convert your type to type long double and this type is not capable of storing the value 10.2 exactly, and will have to use something like 10.19999... instead. This value in turn cannot be converted to decimal number without further loss in precision. In these cases it is desirable to translate the string representation of the literal directly into the destination type and value without the intermediate stage.

So, let’s see how we can parse a literal char-by-char. Let’s say we want to be able to use binary literals, similar to the ones like 0b11010110. Of course, we already know that we are limited: we can only define literal suffixes and non-standard suffixes must start with underscore. So we will be really parsing literals like 11010110_b. First, we take the easiest, and least attractive way. We will define a raw literal operator with the following signature:

unsigned operator"" _b(const char* str);

It reads, “whenever you find an integral or floating point literal with suffix _b, pass it as C-style string to our function and we will return a corresponding value of type unsigned.”

Note the magic here not available for cooked literal operators. This usage of raw operator cannot be replaced with a normal function call. What it does is when it finds code like:

auto i = 101_b;

It treats it like:

auto i = operator"" _b("101");

This is somewhat similar to using the following macro-based solution:

unsigned str2int(const char* str);
# define BINARY(literal) str2int(#literal)

auto i = BINARY(101);

Note also that this form of declaration of a literal operator says that it only works for integral and floating-point literals: it will never work for string literals. But didn’t we use a similar literal operator declaration in the previous post for string literals? Similar, but not same. In order to define a cooked string literal operator, we would have to write:

unsigned operator"" _s(const char * str, size_t len);

Note the second argument len. It needs to be of integral type (not necessarily size_t). One of the primary purposes (although not the only one) for this second argument is to differentiate between a cooked string literal operator and raw numeric (integral or floating-point) literal operator. This ‘trick’ is similar to the declaration of postfix operator++:

Integral operator++(Integral& i, int);

Here, the second argument is only needed for syntactic disambiguation.

But let’s go back to our first attempt at binary integer literal operator. We can implement it like this:

unsigned operator"" _b(const char* str)
{
  unsigned ans = 0;
  
  for (size_t i = 0;  str[i] != '\0';  ++i) {
    // loop variant: strlen(str + i);
    char digit = str[i];

    if (digit != '1' && digit != '0') {     // (1)
      throw std::runtime_error("only 1s and 0s allowed");
    }
    
    ans = ans * 2 + (digit - '0');          // (2)
  }
  
  return ans;
}

Key points to observe: (1) we allow only 0s and 1s in each character of the literal; (2) in each iteration we ‘grow’ the to-be-returned value (ans) by the newly read binary digit. Here is how we can test our new literal operator:

int main() try
{
  unsigned i = 101_b; 
  assert (i == 5);
 
  unsigned j = 123_b;
  assert (false); // should never get here
}
catch (std::exception const& e) {
  std::cerr << e.what() << std::endl;
}

Our operator requires one more check: if the binary literal is not too long. If we assume that type unsigned has 32 bits, we should not allow binary literals longer than this number. We omitted it for the time being, but will add it in the next attempts at improving our literal operator. Why does it need the improvement? First of all, because it is not a constexpr function, we cannot use it to define compile-time constants:

constexpr unsigned i = 11011_b; // ERROR
static_assert(101_b == 5, "!"); // ERROR
int array_of_ints[ 11011_b ];   // ERROR

In order to make our operator a constexpr function we need to abandon loop iteration in favor of recursion, and resort to a more fancy way of invalid input reporting, as described here:

template <typename T>
constexpr size_t NumberOfBits()
{
  static_assert(std::numeric_limits<T>::is_integer, "only integers allowed");
  return std::numeric_limits<T>::digits;
}

constexpr size_t length( const char * str, size_t current_len = 0 ) 
{
  return *str == '\0' ? current_len           // end of recursion
       : length(str + 1, current_len + 1);    // compute recursively
}

constexpr bool is_binary( char c )
{
  return c == '0' || c == '1';
}

size_t TOO_LONG_BINARY_LITERAL()
{
  throw std::runtime_error("too long binary literal");
}

size_t ONLY_0_AND_1_IN_BINARY_LITERAL()
{
  throw std::runtime_error("only 0s and 1s allowed in binary literal");
}

constexpr unsigned build_binary_literal( const char * str, size_t val = 0 )
{
  return length(str) == 0 ? val                                // end of recursion
       : !is_binary(*str) ? ONLY_0_AND_1_IN_BINARY_LITERAL()   // check for non-binary digit
       : build_binary_literal(str + 1, 2 * val + *str - '0');  // inspect recursively
}

constexpr unsigned operator"" _b( const char * str )
{
  return length(str) > NumberOfBits<unsigned>()                // too long literal?
       ? TOO_LONG_BINARY_LITERAL()                             // report error
       : build_binary_literal(str);                            // build numeric value
}

static_assert(10001000100010001000100010001000_b == 0x88888888, "!!");

Most of the tricks used here are described in my other post “Compile-time computations.”

Now, our binary literals can be used to create compile-time constants; however, it does not guarantee that the literal will always be evaluated at compile-time:

int main()
{
  unsigned i = 1102_b; // compiles but throws at run-time
}

Literal operator templates

In order to ensure that our literal is always evaluated at compile-time, we will use another form of raw literal operator:

template <char... STR> constexpr unsigned operator"" _b();

Are you familiar with variadic templates? Notation char... indicates that this template can be instantiated with 0, 1, 2 or more parameters of type char. The above declaration means that each time the compiler encounters a literal like 11011_b it should treat it as the following function call:

operator"" _b<'1', '1', '0', '1', '1'>();

Note: no terminating '\0'. Now, the entire string representing the literal is passed (chopped) as template argument. This enables the possibility of inspecting every character at compile-time regardless of the context in which the literal is used. How can the operator be implemented? A number of people have already described how a binary literal can be implemented in C++11. For instance, Daniel Krügler (see here) and Johannes Schaub (see here). Below, we will analyze a simple implementation step-by-step.

Compile-time evaluation requires that we use recursion again. Recursion is typically needed when processing a variadic template. Template meta-programming recursion (that we are about to use) requires that terminating case of the recursion is implemented via template specialization.

template <unsigned VAL>                             // (D) terminate the recursion
constexpr unsigned build_binary_literal()
{
  return VAL;
}

template <unsigned VAL, char DIGIT, char... REST>   // (B) recursive value generation
constexpr unsigned build_binary_literal()
{
  static_assert(is_binary(DIGIT), "only 0s and 1s allowed");
  return build_binary_literal<(2 * VAL + DIGIT - '0'), REST...>(); // (C)
}

template <char... STR>
constexpr unsigned operator"" _b()
{
  static_assert(sizeof...(STR) <= NumberOfBits<unsigned>(), "too long binary literal");
  return build_binary_literal<0, STR...>();        // (A)
}

This short piece of code deserves a long explanation. First note that with template meta-programming version, error reporting is cleaner, as we can use static_assert in place of tricks with non-constant sub-expressions inside constant expressions. Note the entity STR in the literal operator template definition. It is not a single value. It is a template parameter pack. it represents all (0, 1, 2, or more) chars that our template may be instantiated with. Notation STR... at point (A) is pack expansion: it tells that the compiler should re-build the same list of arguments that our template has been instantiated with. For instance, if we are parsing literal 11011_b and our literal template has been instantiated like this:

operator"" _b<'1', '1', '0', '1', '1'>();

The pack expansion at point (A) should result in the following instantiation of template build_binary_literal:

build_binary_literal<0, '1', '1', '0', '1', '1'>(); 

Note also the expression sizeof...(STR). It tells us how many chars are really there in the parameter pack STR. Andrei Alexandrescu argues in this lecture that operator sizeof... is an unnecessary language feature because one can implement it as a library solution using template recursion (similar to ours above). I believe that with native support as a language feature sizeof... can be faster and not require a not-so-small number of unnecessary template instantiations. C++ compiles slowly already.

Note the second and third template argument at point (B): char DIGIT, char... REST. This is how the ‘decomposition’ of a parameter pack is implemented. We say that we will inspect the first character of the pack in this iteration (DIGIT) and inspect the reminder (REST) in subsequent recursive call. Such recursive call is performed at point (C). Note that the pack is shorter by one character in the next iteration. Note also that when we inspect the last character of the literal (in the primary template), DIGIT contains its value and REST is empty. When we instantiate template in this final iteration, it is equivalent to:

build_binary_literal<(2 * VAL + DIGIT - '0')>(); 

Note that the comma between (2 * VAL + DIGIT - '0') and now empty parameter pack REST magically disappears. This calls the template specialization from point (D), which ends the recursion.

With thus defined literal operator template we are guaranteed that our binary literals are always checked at compile-time:

int main()
{
  unsigned i = 102_b; // compile-time error
}

Thus defined literal operator template has still one limitation. In case the literal is longer than 32 bits (I assume sizeof(unsigned) == 4 and 8-bit char) we get a compile-time error. It would be better if in the case of longer, say 40-bit, binary literal it returned a longer unsigned type. This is somewhat similar to how built-in literals work in C++ (even C++03):

auto i = 0x12345678;   // decltype(i) == int  (on my platform)
auto j = 0x1234567890; // decltype(j) == long long int

This is doable in C++11; but we leave it for the next post.

This entry was posted in programming and tagged , . Bookmark the permalink.

8 Responses to User-defined literals — Part II

  1. kaballo says:

    If I recall correctly, names that start with an underscore followed by an uppercase letter are reserved for the implementation. That would render _B a bad choice for an identifier.

    • That’s a very interesting observation. Thanks. Indeed names like that are reserved for implementation, but I wonder if this also applies to literal suffixes. Typically it is just a suffix of a larger entity: 1101_B does not start with an underscore.

      When I try to read the standard thoroughly, I find the following relevant places:

      §17.6.4.3.2 Global names:

      Certain sets of names and function signatures are always reserved to the implementation:

      • Each name that contains a double underscore __ or begins with an underscore followed by an uppercase letter is reserved to the implementation for any use.
      • Each name that begins with an underscore is reserved to the implementation for use as a name in the global namespace.

      §3 ¶4:

      A name is a use of an identifier, operator-function-id, literal-operator-id, conversion-function-id, or template-id that denotes an entity or label.

      §13.5.8 User-defined literals:

      literal-operator-id:
         operator"" identifier

      First, it looks like the restriction only applies to ‘names’ in global namespace. While my literal is indeed in global namespace it is only because I need to introduce it in stages (in order not to put off the audience). In the third part we will out it into some other one. Second, the restriction only applies to names, and neither literal like 11_B, nor the literal suffix itself is a name. Only “operator"" _B” is a name but one could not say that it starts with an underscore.

      Having said all that, though, I still think I had better change the literal to lower-case: in order to avoid any confusion.

      Thanks.

      • Phil says:

        The most applicable section is 17.6.4.3.5, which states “Literal suffix identifiers that do not start with an underscore are reserved for future standardization”. So you should always name your literal suffix starting with an underscore, otherwise there may be one added to the standard library in the future that will cause naming conflicts.

  2. I was going to post almost exactly the same code as you did down to the same choice of examples (although perhaps a bit lest elegant) but now it feels like plagiarism even though I have already finished my code samples and was just about to start write it up :(.

  3. You don’t need fancy functions to indirectly check for octet-sized bytes, just use the CHAR_BIT C-level macro!

    #if CHAR_BIT != 8
    #error Octet-sized bytes required!
    #endif
    

    But you won’t need that because you can check the bit-length for a built-in integer type T with std::numeric_limits<T>::digits. (Your "sizeof(T) * CHAR_BIT" theoretically may not work due to padding bits.)

  4. Thanks Daryle! I fixed the examples. They are now shorter, cleaner and safer 🙂

  5. Pingback: User Defined Literals – Using C++

  6. Pingback: About the SCM Challenge at CppCon 2016 – Pirates – h-deb

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.