User-defined literals — Part III

In the previous post we have seen how we can define a raw literal operator template that enables us to convert almost any binary literal of the form 11011_b to a corresponding value of type unsigned int at compile-time and still use this value as compile-time constant. However, the length of the literal has to be short enough to fit into the capacity of type unsigned int. In this post, as promised, we will try to make our literal render values of different types based on the length of the binary literal, so that 11011_b renders value of type unsigned int and 100010001000100010001000100010001000_b renders value of type long long unsigned int.

Some meta-programming

A literal operator is a function (template). Is it possible to have a function return values of different types based on different input values? With function templates it is doable. Here is a fairly short example that shows how to do that. First, we show how to select different type based on compile-time value:

template <bool COND>
struct Bool2Type_
{
  using type = int;                                 // (1)
};

template <>
struct Bool2Type_<true>
{
  using type = std::string;
};

template <bool COND>
using Bool2Type = typename Bool2Type_<COND>::type;  // (2)

With this we have implemented a template meta-function, as described here. Given value false it returns type int; given value true it returns type std::string:

// not legal C++, but you know what it means
Bool2Type<false> == int;
Bool2Type<true>  == std::string;

How does the implementation work? The first template declaration declares a primary template. The line at point (1) is an alias declaration. It is almost the same as typedef declaration, except that we specify the new type name first. Next, we have a template specialization for value true that will have a nested alias type referring to std::string. And these two declarations are enough to say that we have defined a template meta-function, but in order to save people from typing verbose typename Bool2Type_<COND>::type, we introduce another alias; alias template in fact. This is why type aliases are superior to typedefs: you cannot have typedef templates.

This is how we can select a type. Now, the following shows how we can select both type and value with a function template:

template <bool COND>
Bool2Type<COND> bool2val()
{
  return 0;
}

template<>
Bool2Type<true> bool2val<true>()
{
  return std::string{"one"};
}

int main()
{
  assert (bool2val<false>() == 0);
  assert (bool2val<true>() == "one");
}

But a good programming language comes with a standard library that makes it easy to do common tasks. In C++11 meta-programming is considered common (like it or not). We already have a meta-function for selecting between two types:

using T0 = typename std::conditional<false, std::string, int>::type;
using T1 = typename std::conditional<true, std::string, int>::type;

// not C++:
T0 == int;
T1 == std::string;

You can see that std::conditional is like an if-statement for selecting one of the two types. You do not have to use literal true to select the type, you can use any compile-time expression convertible to bool:

using TX = typename std::conditional<sizeof(short) == sizeof(int), short, int>::type;

But again, to avoid typing, let’s introduce another alias template to make the usage of conditional shorter. We will use it later:

template <bool COND, typename T, typename F>
using IF = typename std::conditional<COND, T, F>::type;

// usage:
using TX2 = IF<sizeof(short) == sizeof(int), short, int>;

The ultimate binary literal

First, let’s create a meta-function that will select the most appropriate type for our binary literal, based on the literal’s length. As in the previous post, we will assume that we are dealing with the platform where type char is 8-bit long. One option would be to select from the following sequence of standard types: uint8_t, uint16_t, uint32_t, uint64_t. However, it does not appear too practical. We choose another schema. We start with unsigned, with native platform word width; if it is too small, we try long unsigned, and if it is too small also we use long long unsigned. If the last one still doesn’t work, we give up. This can be summarized with the following pseudo code:

if (SIZE > sizeof(long long unsigned) * 8) {
  ERROR();
}
if (SIZE <= sizeof(unsigned) * 8) {
  return <unsigned>;
}
else {
  if (SIZE <= sizeof(long unsigned) * 8) {
    return <long unsigned>;
  }
  else {
    return <long long unsigned>;
  }
}

This is how we can implement it:

template <size_t SIZE>
struct select_type
{
  template <typename T>
  constexpr size_t NumberOfBits()
  {
    return std::numeric_limits<T>::digits;
  }

  static_assert(SIZE <= NumberOfBits<long long unsigned>(), "too long binary literal");
  
  using type = IF<(SIZE <= NumberOfBits<unsigned>()), 
    unsigned, 
    IF<(SIZE <= NumberOfBits<long unsigned>()), 
      long unsigned, 
      long long unsigned
    >
  >; 
};

template <size_t SIZE>
using SelectType = typename select_type<SIZE>::type;

Let me make a small digression here. Typically, it is a good idea for every piece of code one writes to also write unit tests that check the basic behavior offered by the component. For compile-time functions the situation is peculiar. We can test some aspects by using static assertions:

static_assert(std::is_same<SelectType<1>, unsigned>::value, "!");

But some semantics of meta programs just cannot be unit-tested. One of the features of our meta-function is that it reports an error (via compilation failure) when the size of the literal is so long that type long long unsigned could not fit it. You can check it manually, but how do you write a unit-test for that within C++?

Back to the subject, with the above type-selecting function, our final implementation of the literal operator template is somewhat similar to the one from the previous post. The only difference now is that some of our templates require an additional type that specify the unsigned type our literal will obtain:

constexpr bool is_binary( char c )
{
  return c == '0' || c == '1';
}

template <typename UINT, UINT VAL>
constexpr UINT build_binary_literal()
{
  static_assert(std::is_unsigned<UINT>::value, "requires unsigned type");
  return VAL;
}

template <typename UINT, UINT VAL, char DIGIT, char... REST>
constexpr UINT build_binary_literal()
{
  static_assert(is_binary(DIGIT), "only 0s and 1s allowed");
  static_assert(std::is_unsigned<UINT>::value, "requires unsigned type");
  return build_binary_literal<UINT, 2 * VAL + DIGIT - '0', REST...>();
}

template <char... STR>
constexpr SelectType<sizeof...(STR)> operator"" _b()
{
  return build_binary_literal<SelectType<sizeof...(STR)>, 0, STR...>();
}

Now we can test our literal:

static_assert(0_b == 0, "!!");
static_assert(1_b == 1, "!!");
static_assert(10_b == 2, "!!");
static_assert(1000100010001000100010001000100010001000_b == 0x8888888888, "!!");

int main()
{
  auto i = 10001000100010001000100010001000_b;
  auto j = 1000100010001000100010001000100010_b;
  
  static_assert( std::is_same<decltype(i), unsigned int>::value, "!unsigned" );
  static_assert( std::is_same<decltype(j), unsigned long long>::value, "!ull" );
}

A note on performance

Our binary literal looks cool, but using it comes with a price. The final value is available at compile-time, so it incurs no run-time overhead; however, you might find the additional compilation time surprisingly high. Using even a single literal requires a number of recursive template instantiations. I do not provide a benchmark here, but if you try using 32 literals of size, say, 46 you can observe the slow-down compared to built-in hexadecimal literals. If you ever used heavy meta-programs you know how much they can slow the compile times down. Sticking to hexadecimal literals may still be the most attractive solution even though the option to define a binary literal exists.

Literal namespaces

Built-in types can be thought of as being defined in the global namespace. Similarly, literals representing them can be thought of as defined in the global namespace: you never think about prefixing them with a namespace resolution operator. It wouldn’t even work. Similarly, namespace resolution operator will not work for user-defined literal, so they have to be accessible without any namespace qualifiers. No argument-dependent look-up is possible because literal operators take arguments only of built-in types.

You could consider defining your literal operators in the global namespace, but that comes with a handful of problems. First, defining anything in the global namespace is risky already. Implementations use global namespace to define some private infrastructure library components. They typically start with an underscore, and in case of user-defined literals, we are forced to define suffixes starting with an underscore. Second, literal suffixes tend to be short and literals from different libraries would likely clash. For instance, one library may use suffix _lb to denote long binary literal, and another may use _lb to denote mass in pounds.

So what options does a library author have? Define the literal operator directly in the library’s namespace and import the literals into the global namespace with using-declaration:

// units_library.hpp:

namespace Units
{
  class Mass;
  Mass operator"" _lb(long double);
  Mass operator"" _kg(long double);
}

using Units::operator"" _lb;
using Units::operator"" _kg;

But this might cause a literal clash if an another library declares literals more carelessly, in the global namespace:

// utilities.hpp:

long operator"" _b(const char *);           // binary literal
unsigned long operator"" _lb(const char *); // long binary literal
// main.cpp:

# include "units_library.hpp"
# include "utilities.hpp" 
// ERROR: operator"" _lb already declared in global namespace

Note that we didn’t even try to use literal _lb. One way to mitigate the problem is to use using-directive rather than using-declaration. This way we postpone the compilation error until someone really tries to use the literal. However, since using-directive ‘imports’ every name in the namespace, you should separate the literal operator definitions by putting them into an additional namespace:

// units_library2.hpp:

namespace Units
{
  class Mass;

  namespace operators
  {
    Mass operator"" _lb(long double);
    Mass operator"" _kg(long double);
  }
}

using namespace Units::operators;

This is somewhat better. Now the ambiguity error only occurs when we try to use the ambiguous literal:

// main.cpp:

# include "units_library2.hpp"
# include "utilities.hpp" 
// OK so far

Units::Mass m1 = 200_kg; // OK
Units::Mass m2 = 400_lb; // ERROR: ambiguity

This technique is proposed in N2750 (see section 3.5). However, it still does not save us from the ambiguity, in case two libraries define the same literal suffix. A different approach would be to always define a separate (nested) namespace for your library’s literals. Never have the library import the literals into the global namespace. Have the end user do the import if she wants to use literals:

// units_library3.hpp:

namespace Units
{
  class Mass;

  namespace operators
  {
    Mass operator"" _lb(long double);
    Mass operator"" _kg(long double);
  }
}
// utilities_b.hpp:

namespace Utilities
{
  namespace operators 
  {
    long operator"" _b(const char *);           // binary literal
    unsigned long operator"" _lb(const char *); // long binary literal
  }
}
// main.cpp:

# include "units_library3.hpp"
# include "utilities_b.hpp" 

void fun1() 
{
  using namespace Units::operators;
  Units::Mass m1 = 2100_kg; // OK
  Units::Mass m2 = 1100_lb; // OK
}

void fun2()
{
  using namespace Utilities::operators;
  unsigned long v = 1100_lb; // OK: long binary literal
}

This technique is proposed in N3402.

User-defined string literals

Note that cooked string literals do not clash with raw numeric literals; because of the additional parameter:

std::string operator"" _s(const char * str, unsigned len) 
{ 
  return std::string{str, len};
}

constexpr Seconds operator"" _s(const char * str) 
{ 
  return Seconds{str2int(str)}; 
}

int main()
{
  std::string name = "dog"_s; // OK: string
  Second elapsed  = 360_s;    // OK: Seconds
}

The additional length parameter is there not only to avoid such ambiguities. It is also used to correctly convey the size of the string literal in cases where it contains character '\0'. C-style string’s size is determined by the first occurrence of character '\0' in the sequence. However std::string stores the size of the string separately and '\0' is treated as any other character:

std::string strange{ "hello\0""world", 11 };
assert (strange.length() == 11);

std::string strange2 = "hello\0""world"_s;
assert (strange2.length() == 11);

Note that if we defined our string literal without length parameter, it would give us an incorrect length:

std::string operator"" _badstr(const char * str, unsigned len) 
{ 
  return std::string{str};
}

std::string badstr = "hello\0""world"_badstr;
assert (badstr.length() == 5);

The above should already give us an idea of what ‘to cook’ means for cooked string literals. For another example consider the following:

std::string text = R"dog(cat)dog"_s;
assert (text == "cat");

If this “R” syntax looks too confusing, let me only explain that it is a raw string literal. The part «R"dog(» says that everything we parse next (including backslash, double-quote and parentheses) is considered an ordinary character of the string until we encounter terminating sequence: «)dog"». For some less hard-core example consider this:

std::string text = "d""o""g"_s;
assert (text.length() == 3);

text = "\"\"\""_s;
assert (text.length() == 3);

This also takes us close to the answer to the question why it is not possible to define raw string literal operators. Yes: you cannot define them in C++11. They would be very useful to enable full compile-time computation of strings, and at some point they were even proposed for addition. But we do not have them in the end. One of the reasons is that it is not clear what “raw” is in case of strings. Note that in case of raw integral literal operators we also parse the prefixes:

0x11_b;
// this is equivalent to:
operator"" _b<'0', 'x', '1', '1'>();

What would be the meaning of literal "g""o"_s:

operator"" _s<'g', 'o'>();                          // ??
operator"" _s<'\"', 'g', '\"', '\"', 'o', '\"'>();  // ??

What about other special cases:

"\?"_s;
operator"" _s<'\?'>();        // ??
operator"" _s<'\\', '\?'>();  // ??

R"(A)"_s;
operator"" _s<'R', '\"', '(', 'A', ')', '\"'>();  // ??
operator"" _s<'A'>();                             // ??

"\101"_s; // letter "A"
operator"" _s<'A'>();                                    // ??
operator"" _s<'\\', '1', '0', '1'>();                    // ??
operator"" _s<'\"', '\\', '1', '0', '1', '\"'>();        // ??
operator"" _s<'\"', '\\', '1', '0', '1', '\"', '\0'>();  // ??

Pushing the limits

In this and the previous post we have seen how we can parse binary integral literals. In a similar manner, we could parse base-3 integers: we just check if every digit is either 0, 1 or 2. Can we similarly parse base-12 integers? Since we can already parse base-16 integers, base-12 should not be a problem. Indeed, we can do this, however, since we have to comply to C++ syntax, we need to use the 0x prefix for hexadecimal literals:

unsigned int i = 0x1A0_b12; // = 264
unsigned int j = 0x1C0_b12; // ERROR: 'C' is not a base-12 digit

Can we parse base-32 integer literals? It looks like we cannot: we can only narrow down the built-in literal syntax; but we cannot expand it.

Apart from different bases, we can do some other clever — perhaps too clever — things. For instance, you can define a binary literal that uses digits ‘A’ and ‘B':

unsigned int i = 0xAABBABBA_bb;

It is not particularly useful (and would likely only cause confusion), but it gives you an idea of what can be done with literal operators. You can also limit binary literals to only these having a certain fixed width:

unsigned int i = 0x11001001_bL8;  // OK: 201
unsigned int j = 0x11001010_bL8;  // OK: 202
unsigned int k = 0x1100100_bL8;   // ERROR: too short
unsigned int l = 0x110010010_bL8; // ERROR: too long

You can also cleverly use the point in the floating-point literal to denote pairs of numbers:

Hour t1 = 14.45_hour;  // 2:45 p.m.
Hour t2 =  2.45_hour;  // 2:45 a.m.
Hour x1 = 14.450_hour; // ERROR: only two digits allowed after point
Hour x2 =  0.61_hour;  // ERROR: '6' not allowed immediately after the point 

Again, this could be a bad idea, because people are used to point denoting a floating-point number. But you see the possibilities.

You can judge yourself whether user-defined literals are useful and whether the deserve the place in the C++ Standard.

About these ads
This entry was posted in programming and tagged , . Bookmark the permalink.

4 Responses to User-defined literals — Part III

  1. Very nice, thanks for the explanation on why un-cooked user defined string literals were not included in C++11, I was wondering about that.

  2. Krzysiek says:

    “But some semantics of meta programs just cannot be unit-tested. One of the features of our meta-function is that it reports an error (via compilation failure) when the size of the literal is so long that type long long unsigned could not fit it. You can check it manually, but how do you write a unit-test for that within C++?”

    Well, meta-programs require meta-unit-tests :-) You need to include in your test suite several source files. Some of them should compile cleanly (those source files test for positive cases) and some of them should fail (tests for negative cases). Those source files should only test compile-time meta-functions and do not need to be linked into unit-test executable. There should be a separate script that would try to compile those files and generate a report.

    It seems there should be two test suites: one for compile-time meta-functions (that depends entirely on compilation success/failure) and the other for run-time (meta-) functions that are actually executed. The second suite should be executed if the first one passes.

    Is it doable in C++? I guess not, but not every language can unit-test its every feature.

  3. rhalbersma says:

    I wonder if you could also write e.g. DNA-sequences as ACGT literals *and* do std::regex searches on them, or write Morse code, or music notes, or some other stuff with small-sized alphabets ….

    • I guess you could to some extent achieve this goal. DNA sequences, the way I understand them, can be treated as quaternary (base-4) integer numbers. You cannot get letter ‘T’ in an integral literal, but you could use numbers 0, 1, 2, 3, to represent nucleotides; or numbers 1 ,2 ,3, 4. The question is, do you need such literals in the program code?

      You could use string literals to express more fancy alphabets, but note two things. Using literal alone, you cannot guarantee that they are processed at compile-time: whoever types the literal must take additional means (use constexpr) to guarantee that. But in this case (and this is my second remark) how is using a string literal better than normal string parsing at compile time?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s