String’s competing constructors

Posted on March 12, 2018 by Andrzej Krzemieński

Let’s start with the problem. I want to check whether a program received a text message that consists of four consecutive zeroes. Not '0', but the numeric zero. I will create a constant std::string representing the special sequence and compare the messages (also stored as std::strings) I receive. I cannot just type:

const std::string Special = "\0\0\0\0";

As this would be interpreted, according to C rules, as a zero-length string. But there is another constructor exactly for our case, where I can say “character C repeated N times”. So, I type:

const std::string Special ('\0', size_t(4));

I could have used braces for the initialization, but since I am aware of the std::initializer_list gotchas (like the one described here), I am following my rule of thumb, “if you suspect the type might expose an initializer-list constructor, and you want to initialize through a different constructor, do not use braces”.

Now I can test my program. It still doesn’t work correctly for some reasons. It does not recognize the special messages as special, even though I would expect it to. I test and debug it for quite some time, until finally do I realize that my string Special has a different value than I think.

The constructor that I intended, “character C repeated N times,” has signature:

basic_string( size_type count,
              CharT ch,
              const Allocator& alloc = Allocator() );

I gave the arguments in the wrong order, and most likely I have inadvertently selected a different constructor. Type std::basic_string has quite a lot of constructors that differ in the combinations of types. Just see here.

Now, we have two problems here. The first is that the program has a bug, and I should provide the arguments in the inverse order. This is fixed quickly. The other — and this is the one we will focus on in this post — is that there is something in the interface of the library, or in my way of using the library or language that makes a bug like this one easy to occur. I have lost some time, which could have been avoided, and there was a risk that the program with this bug would be shipped to users. The fix here would be to identify the source of the problem and to provide some guidelines which when followed are likely to prevent similar bugs in the future.

First, let’s figure out which constructor actually got called. To make the things more interesting, it is a different constructor in different versions of C++. In C++11 it is:

basic_string( const CharT* s,
              size_type count,
              const Allocator& alloc = Allocator() );

That is, '\0' binds to type const CharT*, and size_t(4) — unsurprisingly — binds to type size_type. For this constructor, passing a null pointer where a pointer to an array is expected, is an undefined behavior.

That '\0' should be a good initializer for an object of type const CharT* is quite surprising. This is a defect in C++11. It is a consequence of the following definition (§ 4.10 ¶ 1):

A null pointer constant is an integral constant expression prvalue of integer type that evaluates to zero or a prvalue of type std::nullptr_t.

This means that any literal or temporary of integral type (char is one) that can be evaluated at compile-time to value zero can be used to initialize a pointer to a null pointer value. For instance:

constexpr int zero() { return 0; }

int * p = zero();            // ok in C++11
int * q = (2 + 2) - (2 * 2); // ok in C++11

Now, I am only telling you what the Standard says. What compilers actually do is a different story. Clang never implemented this requirement, and GCC implemented it with varying accuracy in different versions. I do not know about other compilers.

Anyway, C++14 fixes this bug. The offending definition is changed to:

A null pointer constant is an integer literal with value zero or a prvalue of type std::nullptr_t.

'\0' is not an integral literal. This narrows the creative ways of initializing a null pointer down to:

int * p1 = 0X0UL; // hex + unsigned long
int * p2 = 0B00L; // binary literal
int * p3 = 0'0'0; // oct + digit separators

With this change, the constructor of basic_string selected previously is no longer a match, so a different one is chosen. In fact, the compiler chooses the constructor we intended:

basic_string( size_type count,
              CharT ch,
              const Allocator& alloc = Allocator() );

But with the arguments bound inversely: '\0' binds to type count (type promotion), and size_t(4) binds to ch (type conversion). So, instead of “char '\0' repeated 4 times”, I get “char '\4' repeated 0 times”, which makes an empty string.

Now, we can see this works this way because expression size_t(4) is converted to a smaller type: char. And I told you I am not using brace initialization. Brace initialization prevents narrowing, so maybe using braces for initialization would have prevented this constructor from being selected, and I would get a compiler error. No. If I initialize my constant like this:

const std::string Special {'\0', size_t(4)};

a yet another, initializer-list constructor is selected and interprets the parameters as two characters: '\0' and '\4': I get a two-character string. Narrowing is not prevented, because there is no narrowing here (as explained in this post).

What can we do?

The fact that I was able to put my arguments in inverse order and still get the same constructor to be selected means that the design for this constructor ignores an important guideline (I.24): “do not have two adjacent arguments to the function of the same type”. Here the types are actually different, but because of the implicit conversions, the constructor still falls into this category. This implies that the problem is better fixed in the interface of the library. We cannot change the standard library, but we can speculate on how a robust interface would look like.

When we use factories, rather than constructors directly, we can use the function name to indicate how the arguments will be interpreted:

const String Special = 
  String::make_repeated_char(size_t(4), '\0');

This would protect us against inadvertently selecting a different constructor, but the arguments can still be passed in the inverse order. As mentioned in the other post, we can emulate named function parameters using tags:

constexpr struct with_size_t{} with_size{};
constexpr struct with_value_t{} with_value{};

These are empty classes, each with different type, and this is the distinct type that is the only relevant information in them. With tags in the library, we could initialize our string as follows:

const String Special {with_size,  4,
                      with_value, '\0');

A number of people suggested to me that another solution would be to pass only the values but wrapped into distinct types:

const String Special {with_size{4}, with_value{'\0'});

Indeed, this solves the problem most of the time. The problem with the latter solution is that with_value looks like a normal value-semantic type with a meaningful value, and someone might at some point want to use it as a regular type:

vector<with_size> v {with_size{10}}; // value 10?
                                     // size 10?

In the case of initializing a string, this problem does not occur, but if you consider a general solution for any type, the trick with with_size{10} does not come without its own problems. If I ware to go with it, I would make type with_size non-assignable and non-comparable, so that it is unattractive for any use other than passing arguments to functions.

Another thing we can do is to use clang-tidy. It is a tool for automated code analysis. In version 7, it has a group of checks bugprone-. If we run it against our example, it will immediately detect the problem (even though it is valid C++ code), and give us a message:

warning: string constructor parameters are probably swapped;
expecting string(count, character) [bugprone-string-constructor]
  const std::string Special ('\0', size_t(4));
                    ^        ~~~~~ ~~~~~~~~~~
                             size_t(4) '\0'

So the advice of the day is, if possible, use clang-tidy.

This entry was posted in programming and tagged C++11, c++14, gotchas, stating intentions, type system. Bookmark the permalink.

20 Responses to String’s competing constructors

chip0 says:

March 13, 2018 at 4:51 am

This is an interesting discussion, but I can’t help thinking that if I needed to detect 4 consecutive zero values, I’d go a little lower-level than std::string. I used to write network protocol parsers for a living, so this is familiar territory.

bool is_four_zeros(const std::string& zerostr) {
if(zerostr.size() != 4) {
std::cerr(“string is not 4 characters.\n”);
return false;
}
const uint32_t value = *(static_cast(zerostr.data()));
if(value == 0) {
std::cout << "Found it!\n";
return true;
}
std::cout << "nope.\n";
return false;
}
(I haven't so much as tried to compile this, so it almost certainly won't work as is, but you get the idea.)

Reply
- chip0 says:
  
  March 13, 2018 at 2:52 pm
  
  er, that should be a static_cast<uint32_t*>(zerostr.data). might have to be a reinterpret_cast, actually.
  
  Reply
  - Andrzej Krzemieński says:
    
    March 13, 2018 at 3:34 pm
    bool is_four_zeros(const std::string& s) { if (s.size() != 4) return false; return *(reinterpret_cast<const uint32_t*>(s.data())) == 0; }
    
    But I get the idea.
    Reply
    - Sky says:
      
      March 15, 2018 at 9:08 am
      
      I could be wrong but as far as I know, this is UB.
      
      From http://en.cppreference.com/w/cpp/language/reinterpret_cast :
      We are in the case 5 “Any pointer to object of type T1 can be converted to pointer to object of another type cv T2.” which adds “the resulting pointer may only be dereferenced safely if allowed by the type aliasing rules (see below)”.
      
      Under the aliasing rules, “DynamicType” is “const char” and “AliasedType” is “const uint32_t”. They are not similar (not identical after stripping the const), uint32_t is not the signed/unsigned variant of char (unless a char is 32bits…) and uint32_t is not std::byte, char or unsigned char.
      
      However in practice, on godbolt, gcc, clang and msvc are (currently) generating code that does what you would expect without warnings.
    - Andrzej Krzemieński says:
      
      March 15, 2018 at 10:04 am
      
      I think the point @chipo is making is that one need not create any string to inspect the state of another.
    - heto says:
      
      April 3, 2018 at 6:15 pm
      
      To avoid the undefined behavoiur, make it more easily generalizable to an arbitrary length, and to make the intent more clear:
      
      bool is_four_zeros(const std::string& s) { if (s.size() != 4) return false; return std::all_of(s.begin(), s.end(), [](char c) { return c == '\0'; }); }
Przemek says:

March 13, 2018 at 9:39 am

How about overloading operator*:
const std::string Special = std::string{‘\0’} * 4;

Reply
Marco Arena says:

March 13, 2018 at 1:56 pm

Small typo:
int * p = Zero();
should be:
int * p = zero();
(lower ‘z’)

Per info: on Visual Studio 2017 compiling with /std:c++17:

int * p = zero(); // does not compile
int * q = (2 + 2) – (2 * 2); // compiles

Reply
- Andrzej Krzemieński says:
  
  March 13, 2018 at 2:43 pm
  
  Tanks!. Fixed.
  
  Reply
Yankes says:

March 14, 2018 at 6:47 pm

One solution for `vector v {with_size{10}}` is ban vectors of `with_size_t` (or more exactly ban “tag” types in containers). They role is to tag arguments and not long term storage.

Reply
thesombrerokid says:

March 15, 2018 at 10:26 am

The constructor should be marked as explicit which would remove the implicit conversions that allowed the bug to manifest, the parameters should also be swapped imo, since data, size is a standard across the entire C++ community.

Reply
- Sheen says:
  
  March 16, 2018 at 11:29 am
  
  explicit doesn’t help in this case.
  
  Reply
Melmac says:

March 15, 2018 at 4:17 pm

How about const std::string Special = “\0\0\0\0″s;

Reply
rrowniak says:

March 18, 2018 at 3:36 pm

Thanks Andrzej for another good post. I was wondering if this is a pure academical exercise or you had a real need for using this construction? Because if the latter is the case, probably std::vector would be a better choice. Of course I don’t know whole story that’s why I’m curious…

Reply
- Andrzej Krzemieński says:
  
  March 19, 2018 at 7:21 am
  Thank you for this question. Regarding string and vector I am not changing these in std. I am just considering how they could be redesigned in std2. But I try to encourage good practices in my team like not using braces for initializing containers (unless you want a sequence constructor aong with its costs). But regarding other types I do try to design them with my guidelines in mind. I did it for std::optional, where you have this:
```
optional<X> o {in_place, 1, 2};
```
  For my private types, I often have more possibilities, like provide only one constructor, use constrained types, or if possible: factories.
  
  I am not sure if it addresses your question. The scope of the problem is huge.
  Reply
  - rrowniak says:
    
    March 19, 2018 at 2:33 pm
    
    Thanks for the explanation, it perfectly answers my question and now I see your point.
    Regarding the vector’s and string’s constructors (except default ctors), I always look at the documentation before writing code but event that practice may not help you because of non-trivial implicit conversion rules.
    In practice I’d prefer to avoid similar situations by packing arguments of the same types into separate structures, thereby creating new types. This is a very simple solution but can be considered as over-designed and if such class is not used extensively (e.g. is not part of general purpose library) the effort may never pay off.
    
    Reply
    - Andrzej Krzemieński says:
      
      March 19, 2018 at 2:56 pm
      
      Agreed. Although the trade-off may be different if you have a template for quickly generating unique types. I am using this poor man’s emulation of opaque typedef for bool arguments: https://github.com/akrzemi1/explicit/blob/master/doc/tagged_bool.md.
    - Mariusz Jaskółka says:
      
      May 8, 2018 at 7:35 pm
      
      I also always suffer when I see a piece of someones (or my own) code that looks like this:
      
      do_smth(0, true, 0.0f, false, false, 0, 15);
      
      Lately, in such situation, I remembered that few months ago I was using the “por man’s emulation” you linked above.
      I am sure you had a reason you did it for bool only, not a generic one. Finally I wrote more generic emulation of opaque typedefs (with type-driven tuple additionally):
      https://github.com/jm4R/explicit_tuple
      
      I dare to ask if you think that such generic “emulation” is – like rrowniak called it – over-designed? Or did you consider writing tagged_any_type when you were developing tagged_bool?
    - Andrzej Krzemieński says:
      
      May 9, 2018 at 6:53 am
      
      (This is a reply to Mariusz.)
      
      My approach to this problem and similar ones is to use distinct types that clearly reflect their intent and semantics. Using tagged types is one way of doing it, but sometimes it is inferior to other alternatives. I found tags particularly useful for type bool, because bool has exactly the interface that I need. I also found it very useful for my own types like wrappers for small char arrays that represent short codes like “PL”, or “EUR”, as described here. In that latter case my reasoning was that since I am declaring a new type template with the interface tailored to my use case, I can afford the usage of another template parameter for tagging.
      
      In case of types like int, its interface is too rich: it allows you to do bit manipulation. It is not clear whether you want to use it for indexing an array or for doing mathematical operations. It is not clear if you considered the range of the numbers you will be representing, and if you are not risking an overflow. For these reasons, when I have the budget for replacing an int with a better type, I need to write my own type template with a thinner, more dedicated interface, that clearly reflects my intentions. I could also add tags to that type. But I would never use tagged<int, Tag> because it is only half way where I want to be: it is a different tagged int than another tagged int, but it is still ambiguous what I want (and don’t want) to do with this int.
      
      And also, I have recently found that when you type:
      
      do_smth(/*index=*/ 0, /*fast=*/true, /*threshold=*/ 0.0f, /*update=*/ false, /* start_at=*/ 15);
      
      clang-tidy in version 7 is able to match the names in the comments with the names of function arguments and warn you if they mismatch. This has reduced my drive to address the problem in the type system.
    - Mariusz Jaskółka says:
      
      May 9, 2018 at 8:24 am
      
      Thanks for replying Andrzej,
      that clears a lot. I also noticed that there is a problem with operators, for example in such code:
      
      tagged_type<float, class velocity_tag> v{15.f}; tagged_type<float, class time_tag> t{2.f}; auto a = v/t;
      
      The type of `a` should be even different, like `tagged_type`. That makes it hard to implement generic tagged_type with all operators support.
      
      Thanks also for the info about clang-tidy, I use stable 6.0 release with my fvrt IDE, so waiting for this helpful feature. Unfortunetely, if I understand correctly, it will not increase clarity of legacy, “dirty” code.
      
      Greetings!