String’s competing constructors

Let’s start with the problem. I want to check whether a program received a text message that consists of four consecutive zeroes. Not '0', but the numeric zero. I will create a constant std::string representing the special sequence and compare the messages (also stored as std::strings) I receive. I cannot just type:

const std::string Special = "\0\0\0\0";

As this would be interpreted, according to C rules, as a zero-length string. But there is another constructor exactly for our case, where I can say “character C repeated N times”. So, I type:

const std::string Special ('\0', size_t(4));

I could have used braces for the initialization, but since I am aware of the std::initializer_list gotchas (like the one described here), I am following my rule of thumb, “if you suspect the type might expose an initializer-list constructor, and you want to initialize through a different constructor, do not use braces”.

Now I can test my program. It still doesn’t work correctly for some reasons. It does not recognize the special messages as special, even though I would expect it to. I test and debug it for quite some time, until finally do I realize that my string Special has a different value than I think.

The constructor that I intended, “character C repeated N times,” has signature:

basic_string( size_type count,
              CharT ch,
              const Allocator& alloc = Allocator() );

I gave the arguments in the wrong order, and most likely I have inadvertently selected a different constructor. Type std::basic_string has quite a lot of constructors that differ in the combinations of types. Just see here.

Now, we have two problems here. The first is that the program has a bug, and I should provide the arguments in the inverse order. This is fixed quickly. The other — and this is the one we will focus on in this post — is that there is something in the interface of the library, or in my way of using the library or language that makes a bug like this one easy to occur. I have lost some time, which could have been avoided, and there was a risk that the program with this bug would be shipped to users. The fix here would be to identify the source of the problem and to provide some guidelines which when followed are likely to prevent similar bugs in the future.

First, let’s figure out which constructor actually got called. To make the things more interesting, it is a different constructor in different versions of C++. In C++11 it is:

basic_string( const CharT* s,
              size_type count,
              const Allocator& alloc = Allocator() );

That is, '\0' binds to type const CharT*, and size_t(4) — unsurprisingly — binds to type size_type. For this constructor, passing a null pointer where a pointer to an array is expected, is an undefined behavior.

That '\0' should be a good initializer for an object of type const CharT* is quite surprising. This is a defect in C++11. It is a consequence of the following definition (§ 4.10 ¶ 1):

A null pointer constant is an integral constant expression prvalue of integer type that evaluates to zero or a prvalue of type std::nullptr_t.

This means that any literal or temporary of integral type (char is one) that can be evaluated at compile-time to value zero can be used to initialize a pointer to a null pointer value. For instance:

constexpr int zero() { return 0; }

int * p = zero();            // ok in C++11
int * q = (2 + 2) - (2 * 2); // ok in C++11

Now, I am only telling you what the Standard says. What compilers actually do is a different story. Clang never implemented this requirement, and GCC implemented it with varying accuracy in different versions. I do not know about other compilers.

Anyway, C++14 fixes this bug. The offending definition is changed to:

A null pointer constant is an integer literal with value zero or a prvalue of type std::nullptr_t.

'\0' is not an integral literal. This narrows the creative ways of initializing a null pointer down to:

int * p1 = 0X0UL; // hex + unsigned long
int * p2 = 0B00L; // binary literal
int * p3 = 0'0'0; // oct + digit separators

With this change, the constructor of basic_string selected previously is no longer a match, so a different one is chosen. In fact, the compiler chooses the constructor we intended:

basic_string( size_type count,
              CharT ch,
              const Allocator& alloc = Allocator() );

But with the arguments bound inversely: '\0' binds to type count (type promotion), and size_t(4) binds to ch (type conversion). So, instead of “char '\0' repeated 4 times”, I get “char '\4' repeated 0 times”, which makes an empty string.

Now, we can see this works this way because expression size_t(4) is converted to a smaller type: char. And I told you I am not using brace initialization. Brace initialization prevents narrowing, so maybe using braces for initialization would have prevented this constructor from being selected, and I would get a compiler error. No. If I initialize my constant like this:

const std::string Special {'\0', size_t(4)};

a yet another, initializer-list constructor is selected and interprets the parameters as two characters: '\0' and '\4': I get a two-character string. Narrowing is not prevented, because there is no narrowing here (as explained in this post).

What can we do?

The fact that I was able to put my arguments in inverse order and still get the same constructor to be selected means that the design for this constructor ignores an important guideline (I.24): “do not have two adjacent arguments to the function of the same type”. Here the types are actually different, but because of the implicit conversions, the constructor still falls into this category. This implies that the problem is better fixed in the interface of the library. We cannot change the standard library, but we can speculate on how a robust interface would look like.

When we use factories, rather than constructors directly, we can use the function name to indicate how the arguments will be interpreted:

const String Special = 
  String::make_repeated_char(size_t(4), '\0');

This would protect us against inadvertently selecting a different constructor, but the arguments can still be passed in the inverse order. As mentioned in the other post, we can emulate named function parameters using tags:

constexpr struct with_size_t{} with_size{};
constexpr struct with_value_t{} with_value{};

These are empty classes, each with different type, and this is the distinct type that is the only relevant information in them. With tags in the library, we could initialize our string as follows:

const String Special {with_size,  4,
                      with_value, '\0');

A number of people suggested to me that another solution would be to pass only the values but wrapped into distinct types:

const String Special {with_size{4}, with_value{'\0'});

Indeed, this solves the problem most of the time. The problem with the latter solution is that with_value looks like a normal value-semantic type with a meaningful value, and someone might at some point want to use it as a regular type:

vector<with_size> v {with_size{10}}; // value 10?
                                     // size 10?

In the case of initializing a string, this problem does not occur, but if you consider a general solution for any type, the trick with with_size{10} does not come without its own problems. If I ware to go with it, I would make type with_size non-assignable and non-comparable, so that it is unattractive for any use other than passing arguments to functions.

Another thing we can do is to use clang-tidy. It is a tool for automated code analysis. In version 7, it has a group of checks bugprone-. If we run it against our example, it will immediately detect the problem (even though it is valid C++ code), and give us a message:

warning: string constructor parameters are probably swapped;
expecting string(count, character) [bugprone-string-constructor]
  const std::string Special ('\0', size_t(4));
                    ^        ~~~~~ ~~~~~~~~~~
                             size_t(4) '\0'

So the advice of the day is, if possible, use clang-tidy.

Advertisements
This entry was posted in programming and tagged , , , , . Bookmark the permalink.

17 Responses to String’s competing constructors

  1. chip0 says:

    This is an interesting discussion, but I can’t help thinking that if I needed to detect 4 consecutive zero values, I’d go a little lower-level than std::string. I used to write network protocol parsers for a living, so this is familiar territory.

    bool is_four_zeros(const std::string& zerostr) {
    if(zerostr.size() != 4) {
    std::cerr(“string is not 4 characters.\n”);
    return false;
    }
    const uint32_t value = *(static_cast(zerostr.data()));
    if(value == 0) {
    std::cout << "Found it!\n";
    return true;
    }
    std::cout << "nope.\n";
    return false;
    }
    (I haven't so much as tried to compile this, so it almost certainly won't work as is, but you get the idea.)

    • chip0 says:

      er, that should be a static_cast<uint32_t*>(zerostr.data). might have to be a reinterpret_cast, actually.

      • bool is_four_zeros(const std::string& s)
        {
          if (s.size() != 4) return false;
          return *(reinterpret_cast<const uint32_t*>(s.data())) == 0;
        }
        

        But I get the idea.

        • Sky says:

          I could be wrong but as far as I know, this is UB.

          From http://en.cppreference.com/w/cpp/language/reinterpret_cast :
          We are in the case 5 “Any pointer to object of type T1 can be converted to pointer to object of another type cv T2.” which adds “the resulting pointer may only be dereferenced safely if allowed by the type aliasing rules (see below)”.

          Under the aliasing rules, “DynamicType” is “const char” and “AliasedType” is “const uint32_t”. They are not similar (not identical after stripping the const), uint32_t is not the signed/unsigned variant of char (unless a char is 32bits…) and uint32_t is not std::byte, char or unsigned char.

          However in practice, on godbolt, gcc, clang and msvc are (currently) generating code that does what you would expect without warnings.

        • I think the point @chipo is making is that one need not create any string to inspect the state of another.

        • heto says:

          To avoid the undefined behavoiur, make it more easily generalizable to an arbitrary length, and to make the intent more clear:

          bool is_four_zeros(const std::string& s)
          {
            if (s.size() != 4) return false;
            return std::all_of(s.begin(), s.end(), [](char c) { return c == '\0'; });
          }
          
  2. Przemek says:

    How about overloading operator*:
    const std::string Special = std::string{‘\0’} * 4;

  3. Marco Arena says:

    Small typo:
    int * p = Zero();
    should be:
    int * p = zero();
    (lower ‘z’)

    Per info: on Visual Studio 2017 compiling with /std:c++17:

    int * p = zero(); // does not compile
    int * q = (2 + 2) – (2 * 2); // compiles

  4. Yankes says:

    One solution for `vector v {with_size{10}}` is ban vectors of `with_size_t` (or more exactly ban “tag” types in containers). They role is to tag arguments and not long term storage.

  5. thesombrerokid says:

    The constructor should be marked as explicit which would remove the implicit conversions that allowed the bug to manifest, the parameters should also be swapped imo, since data, size is a standard across the entire C++ community.

  6. Melmac says:

    How about const std::string Special = “\0\0\0\0″s;

  7. rrowniak says:

    Thanks Andrzej for another good post. I was wondering if this is a pure academical exercise or you had a real need for using this construction? Because if the latter is the case, probably std::vector would be a better choice. Of course I don’t know whole story that’s why I’m curious…

    • Thank you for this question. Regarding string and vector I am not changing these in std. I am just considering how they could be redesigned in std2. But I try to encourage good practices in my team like not using braces for initializing containers (unless you want a sequence constructor aong with its costs). But regarding other types I do try to design them with my guidelines in mind. I did it for std::optional, where you have this:

      optional<X> o {in_place, 1, 2};
      

      For my private types, I often have more possibilities, like provide only one constructor, use constrained types, or if possible: factories.

      I am not sure if it addresses your question. The scope of the problem is huge.

      • rrowniak says:

        Thanks for the explanation, it perfectly answers my question and now I see your point.
        Regarding the vector’s and string’s constructors (except default ctors), I always look at the documentation before writing code but event that practice may not help you because of non-trivial implicit conversion rules.
        In practice I’d prefer to avoid similar situations by packing arguments of the same types into separate structures, thereby creating new types. This is a very simple solution but can be considered as over-designed and if such class is not used extensively (e.g. is not part of general purpose library) the effort may never pay off.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s