User-defined literals — Part I

This post is about the new language feature: the ability to specify user-defined literals. You may already be familiar with it, for example form N2750. In this post we will talk about literals in general, the purpose and usefulness of user-defined literals, their limitations, and alternatives in C++ to achieve similar goals.

As of today, the only compiler I know of to support user-defined literals is GCC 4.7. If you are a Linux user you probably know how to get it. If you are a Windows user, I recommend using the latest (9.2) MinGW Distro prepared by Stephan T. Lavavej. It is a compact version of MinGW containing GCC 4.7.1 and Boost 1.50.0. Just what you need to play with the modern C++.

Built-in literals

Literals are used to set objects of build-in types, with values that we know at compile-time:

int i = 0;
i = 17;

There are other ways of setting values at compile time:

int i{}; // zero-initialization
i = {};  // reset to zero-initialized

In their simplest form (without prefixes or suffixes) literals determine the value and the ‘basic’ type:

// not legal C++, but the meaning is obvious
decltype(11) == int;
decltype('y') == char;
decltype("dog") == const char[3 + 1];
decltype(true) == bool;
decltype(nullptr) == nullptr_t;

For some literals C++ uses keywords (true, false, nullptr), because the range of values of these types is so small. The similar case is for enums: usually the number of enumerated values is so small that it is easy for the compiler to store every single literal. These literals are of little interest to us in this post. For the other literals (integral, floating-point, string, character) the range of possible values is too large for the compiler to store all of them in some map, therefore it recognizes them by matching patterns (a sequence of digits for integrals, a sequence of characters enclosed in double quotes for strings, etc.).

With prefixes and suffixes it is possible to alter the type of the literal:

// not legal C++, but the meaning is obvious
decltype(11) == int;
decltype(11UL) == unsigned long;
decltype(11LL) == signed long long;

decltype( 'y') == char;
decltype(u'y') == char16_t;
decltype(U'y') == char32_t;
decltype(L'y') == wchar_t;

decltype( "dog") == const char[3 + 1];
decltype(u"dog") == const char16_t[3 + 1];
decltype(U"dog") == const char32_t[3 + 1];
decltype(L"dog") == const wchar_t[3 + 1];

This is useful when we want to pick the correct function overload, or correctly deduce type of a variable:

void fun(int i);
void fun(unsigned i);

fun(12);  // pick first overload
fun(12U); // pick second overload

auto c1 =  'c'; // deduce type char
auto c2 = u'c'; // deduce type char16_t

Prefixes and suffixes determine also how the letters (or digits) in the literal are interpreted:

auto i1 = 80;    // decimal value 80
auto i2 = 0x80;  // decimal value 128
decltype(i1) == decltype(i2);
assert (i1 != i2);

auto s1 = "(\\\\)";  // renders: (\\)
auto s2 = R"(\\\\)"; // renders: \\\\
decltype(s1) == decltype(s2);
assert (s1 != s2);

When setting compile-time values to user-defined types, we use the fact that these types are composed of built-in types, and we simply compose literals:

std::complex<double> j{0.0, 1.0}; // two literals used
BigInt I{1};                      // int literal converted to BigInt
boost::tribool b{true};           // bool literal converted to tribool
std::string s{"dog"};             // const char [4] converted to std::string

This works for object initialization, but we cannot pick the correct overload easily: we have to create temporary object or use explicit casting:

void fun(int i);
void fun(BigInt i);

fun(1);          // pick first overload
fun(BigInt{1});  // pick second overload

Goals of user-defined literals

User-defined literals can be defined only for the “interesting” types of literals: integral numbers, floating-point numbers, characters and character strings. New literal types are defined by specifying new literal suffixes (it is not possible to specify new literal prefixes). The primary goals, as declared in the proposals, were for the Committee to have a tool for specifying new literal types by means of Standard Library, rather than by extending the core language. Changing the Standard Library appears to be easier. The Committee wants to use this tool to add any literals that are added, or intended for future addition, to C or standard C extensions. One such example is decimal floating-point literal, like 10.2df. This goal has only partially been achieved because many new C literals require prefixes or syntax other that suffixes/prefixes:

  • binary integer literals (proposed to C but rejected): 0b11011,
  • hex floating-point literals: 0x102Ap12,
  • new char literals: u'A',

The other goal for the literals was to prepare the way for the addition of the new C++ Standard Library components: arbitrary-precision integers, decimal floating-point numbers, fixed-point numbers, new string types, SI units.

While user-defined literals are the toy primarily for the Standard Library designers, normal users have also limited access to it.

Cooked literals

There is a number of ways to define a new literal suffix. In this part we will focus on the one called a cooked literal. Suppose our program processes weights (the physical quantity). We do not want to use type double because we are working in an international environment and for some developers “weight” naturally means “kilograms” and for others, it obviously means “pounds.” So we introduce a new type Kilograms. The name of the type states clearly what the unit is. We prevent the conversion from double to avoid any inadvertent weight unit confusion:

class Kilograms
  double rawWeight;

  class DoubleIsKilos{}; // a tag
  explicit constexpr Kilograms(DoubleIsKilos, double wgt) : rawWeight{wgt} {}

Kilograms wgt = 100.2; // compile-time error

The ugly constructor prevents the inadvertent usage of the raw weight. Now, our goal is to make the following syntax work:

Kilograms wgt = 100.2_kg;

We cannot define the literal 100.2kg due to the limitation that C++ programmers suffer from: we can only define suffixes that start with an underscore. This is to prevent suffix name clashes with the potential future standard suffixes. First, let’s see the literal definition. The detailed explanation will follow.

constexpr Kilograms operator "" _kg( long double wgt )
  return Kilograms{Kilograms::DoubleIsKilos{}, static_cast<double>(wgt)};

We defined a special function. Similarly as if we defined operator+=. It takes long double as its argument and returns Kilograms. The function is declared as constexpr. It is not essential to make our simple example work, but in general it is a good idea, because the literals are often used to initialize compile-time constants. The part operator "" _kg is a convention; it says “define literal suffix _kg.” Note however that the space between "" and _kg is essential: otherwise token ""_kg is interpreted as one (empty) string with a suffix in the earlier stages of parsing the source code.

Another thing to note is that even though we will be storing doubles we still took long double as argument and made an effort to narrow it down to type double. This is characteristic of cooked literal definitions: you are forced to use the longest possible type of this “category”: for floating-point literals it is long double; for integer literals it is unsigned long long. Yes, unsigned, because the minus (or plus) sign is never part of the literal: it is a unary operator applied to the temporary initialized with the literal.

Note that we can have more than one literal return the same type. For instance, while we use type Kilograms we can still allow the programmers that think in pounds to use pound-based literals:

constexpr Kilograms operator "" _lb( long double wgt )
  return Kilograms{Kilograms::DoubleIsKilos{}, static_cast<double>(wgt * 0.45359237)};

Kilograms wgt = 200.5_lb;

You can still safely mix the two units:

Kilograms w = 200.5_lb + 100.1_kg; // ok (assuming we also define operator+)
Kilograms v = 21000.33 + 100.1_kg; // error: cannot add Kilograms and double

So, how the cooked literals work? When compiler encounters a token like 100.1_kg it recognizes it as a floating-point literal with an unrecognized suffix. At this point a C++03 compiler would stop and report an error; C++11 compiler now looks through all user-defined literal operators and tries to match the suffix. If it finds the right suffix and sees that you have chosen to use a cooked literal, it interprets the part 100.1 as the longest floating point type (long double) and passes it to your function. This process of computing the input parameter of type long double to your literal operator is called “cooking”: you do not need to parse the digits in the literal one by one, the compiler does it for you. Now, what you can do is to change the type of the input argument or transform its value.

Let’s see one other example that will demonstrate one characteristic or limitation of cooked literal operator. We define a type that represents a probability: its base type is long double but the valid range of its values is [0, 1]:

class Probability
    long double value;
    // invariant:  0.0 <= value && value <= 1.0
    explicit constexpr Probability(long double v);
    // ...

In order to define our constexpr literal operator we will use both compile-time and run-time error-reporting technique described in “Compile-time computations”:

constexpr Probability operator"" _prob( long double v )
    return v > 1.0 ? throw BadProbability{v} : Probability{v};

Note that we are not checking for the other condition (v < 0.0) because a literal never represents a negative value. Now we might expect that the following would cause a compile-time error:

Probability p = 1.2_prob;

Such expectation makes sense: ‘nominal’ value of the literal is always known at compile-time; so it should be easy to also transform this value at compile-time. However, as is the case for, say operator+=, an operator invocation (even a literal operator) is just a function call, and our line is equivalent to:

Probability p = operator"" _prob(1.2);

Thus, as with any other constexpr function, they are not evaluated at compile-time (even if called with compile-time constants) unless you initialize another compile-time constant:

constexpr Probability p = 1.2_prob; // compile-time error

For another example consider a string literal suffix that renders a value of type std::string rather than const char *:

std::string operator"" _s(const char * str, size_t len)
    return std::string{str, len};

void fun(const char*);
void fun(std::string);
fun("dog"_s); // picks second overload

This is another example of a cooked literal. The second argument (len) is required by the literal operator logic in C++, but we need not use it inside. Being cooked in this case means that all escape sequences or special sequences in raw string literals are performed before our operator is called:

auto s1 = "\\dog\\"_s;  // renders: \dog\ 
auto s2 = R"(\dog\)"_s; // renders: \dog\
assert (s1 == s2);

How useful is that?

Note that so far we are only talking about cooked literal forms, and leave the other kind (raw literal forms) for the next post. The question now is if we really need cooked literals, and if the value they add justifies the addition of the new feature to the language?

Consider the string example first. We can select the right overload by simply creating a temporary explicitly (rather than inside the literal operator):

void fun(const char*);
void fun(std::string);
fun(std::string{"dog"}); // picks second overload

It is a bit longer but sufficient, and many people may find it cleaner and less arcane: everyone knows how and why you create temporaries.

Now, to units example, I can think of at least two alternatives without literals. First, we can use functions instead of literal operators:

Kilograms w = pounds(200.5) + kilograms(100.1);

This approach has been adopted in time utilities in header <chrono> in the current standard (see here for short intro and here for the formal proposal). With <chrono>, you can specify any time duration this way:

#include <chrono>
using namespace std::chrono;

auto duration1 = hours(8) + minutes(30) + seconds(5) + milliseconds(120);

This is a bit more verbose than literal suffixes, but is extremely readable: you just cannot interpret the meaning incorrectly.

Another way of addressing the weight units safety would be to define constants that represent “units”, and then use multiplication to combine nominal values with units:

constexpr Kilograms kg{Kilograms::DoubleIsKilos{}, 1.0}
constexpr Kilograms lb{Kilograms::DoubleIsKilos{}, 0.45359237};

Kilograms w = 200.5 * lb + 100.1 * kg;

This approach has been adopted by Boost.Units library, where units of physical quantities are represented by constants:

using namespace boost::units;
using namespace boost::units::si;

quantity<force>  F  = 2.0 * newton;
quantity<length> dx = 2.0 * meter;
quantity<energy> E1 = F * dx;
quantity<energy> E2 = 2.0 * newton  *  2.0 * meter;

The multiplication operator is fairly intuitive here, because in physics two symbols glued together are already interpreted as multiplication even if we scale a unit by a scalar: 3m means “the one-meter unit scaled by the factor of three,” or “3 times m.”

Admittedly, literals take us a bit closer to the notation from physics: E = 2.0N · 2.0m. But is this goal worth complicating the language? It is worth noting that operators, like + or *, are also not necessary, and we could do without them by using function calls:

Kilograms v = add(mul(a, b), c);

And in fact, this is what we would be forced to do in languages that do not provide operator overloading if we chose to use a user-defined type. But you can see that the lack of arithmetic operators really disturbs. With SI units the situation is different. First, less programmers need them; second, there is a limit to which we can represent units, because some of them require symbols outside of basic source character set, e.g. μs.

There is one practical gain from using cooked literals though. Because numeric literals never represent negative numbers, you can easily eliminate all negative values at compile-time from your non-negative type. This is what we did for type Probability:

double d = -0.5;          // ok: operator-(0.5);
Probaility p = -0.5_prob; // error: operator- not defined for Probability

And that’s it for now. In part II, we will look at raw literals.

About these ads
This entry was posted in programming and tagged , . Bookmark the permalink.

5 Responses to User-defined literals — Part I

  1. Grout says:

    Why doesn’t C++11 have “”s for std::string, perchance? Seems incredibly obvious and useful.

    • ""s as language feature or Standard Library component? If you mean the latter, the usage would not be that simple. You would have to add some declaration in front:

      using namespace std::string_literals;

      And then the compilation times may be marginally affected because a new function call needs to be compiled.
      Honestly, I do not know the C++ Committees intentions or process, so I can only guess that the addition of literals was so new that no-one wanted to risk building standard library additions atop of it. Some bodies even requested that user-defined literals be removed, see here.
      But I can see (here) that the plan is still to add ""s in the future standard.

  2. Pingback: C++ Sandbox | Dave++

  3. Brutally Frank says:

    Seems like a lot of fancy crapola for little gain in functionality not to mention a maintenance nightmare. Simplicity leads to clear mind and top performance.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s