String’s interface

What I have to say today is fairly obvious, and you probably know about it. But because I have observed the following pattern in my project’s code fairly frequently, I feel it needs to be recorded. Consider this code:

bool fun (const string& code)
  assert (code.length() >= 2);

  if (code.substr(0, 2) == string("XX")) {
    // ...

  // ...

Can you see what is wrong about it? Please, do not focus on the usage of assert. I only put in order for you to take for granted that the string has two or more characters.

Apparently, this condition checks if string code starts with sequence "XX". Given the assumption that code is at least 2-character long, it appears to be doing the right thing. Provided that the only thing we are concerned with is the correct result of the expression.

Quite often, however, we choose to use C++ in hope of achieving maximum performance of our program. If this is our goal, the above code looks wrong. In order to check if code starts with "XX" we are creating two temporary strings, each of which can potentially allocate heap memory. One could argue that std::string should be able to implement a short string optimization (SSO) for 2-letter sequences, but (1) even then, there is some cost involved that cannot be easily optimized away and (2) not every implementation uses SSO. For instance, I am using GCC 4.4.7 and it does not implement SSO on strings.

The interface of class template std::basic_string is very complicated, as nicely explained in GotW #84. It offers so many member functions that it feels ungrateful not to use any of them. At the same time, because there are so many of them, one feels reluctant to re-parse them time and again.

Because the programmer vaguely remembers that operator== applied to naked null-terminated byte strings (NTBS) (which tend to be converted to const char *) does the wrong thing, he avoids this by making sure that this is two std::strings that are compared. Also, he might be thinking, there is no harm done because literal "XX" would be implicitly converted to std::string anyway, before the operator== is invoked. But this is wrong. The Standard also provides the mixed versions of operator==:

bool operator==(const std::string& lhs, const char* rhs);
bool operator==(const char* lhs, const std::string& rhs);

Of course, in reality they are function templates with many parameters, but you get what I mean. std::string can be compared to a NTBS without the necessity of creating any temporary std::string. Our example can be easily optimized by removing the explicit creation of a temporary:

if (code.substr(0, 2) == "XX") // ...

Next, while admittedly using operator== somewhere in comparison looks elegant, it is wrong to make a brand new string managing its own resource only to inspect a portion of the original. The programmer’s primary goal is not to make the program elegant. Indeed, if we dig into the documentation (e.g., here), we will find that std::basic_string offers a way to compare its sub-string against an NTBS:

if (, 2, "XX") == 0) // ...

The comparison is a three-way one, with 0 indicating equality. It can be performed in-place, without creating any temporary string.

To be honest, I did not know about this member function overload prior to writing this post. I am not really found of learning more than 100 members of basic_string. Also, while the above is definitely a performance improvement, I am not satisfied with it. While it does the correct thing, it may be difficult, when you encounter it the first time, to immediately grasp what it tries to do. My ideal solution, if you can afford using Boost, is to use one of the algorithms from String Algorithms Library:

#include <boost/algorithm/string/predicate.hpp>

bool fun (const string& code)
  // ...
  if (boost::algorithm::starts_with(code, "XX")) // ...
  // ...

This just says what I mean, without adding any unnecessary overhead at the same time.

And that’s it for today. As a side note, in order to test memory allocations of my std::basic_string implementation, I used a custom allocator that apart from allocating, performs the necessary statistics. As I am not good at implementing custom allocators, I used this nice allocator ‘cheat sheet’ by Howard Hinnant from this location.

This entry was posted in programming and tagged , , . Bookmark the permalink.

22 Responses to String’s interface

  1. mooware says:

    For me the obvious solution would be strncmp(). Some people might not like it because it’s C, but I think it’s a perfect fit for the problem, and doesn’t require any additional libraries (like boost or Qt).

    • I think the solution of boost is more elegant than strncmp(if you can afford to boost).With the api of boost(boost::algorithm::starts_with), I do not need to check the document of strncmp at all, the api just show us “what do you want to do”.

  2. It’s worth noting that a string class designed as immutable (like Java and C# strings) can support constant time sub-strings safely, obviating the need for special operations like `starts_with`. Also, a string reference type like the proposed std::string_ref can help do that, but then only in unsafe ways. Regarding std::string::compare, even though it’s not that long since I implemented it for a demonstration, it didn’t register with me that it could be used for `starts_with`. But now that you mention that, yes, and for e.g. C strings one can use std::char_traits::compare. I’ve instead used compare as basis of a convenient mix-in that supplies the relational operators. It can also be expressed in terms of < and ==, but using only < as basis incurs some inefficient double checking.

    • uxcn says:

      Java at least no longer supports constant time substrings (as of 1.7’ish). Even with immutable strings, you would still generally encur the cost of the additional object instantiations (allocate, return, etc…). Although, I think starts_with can still express intent better. Both operator== functions, compare, and boost::algorithm::starts_with probably compile down to just a memcmp(c_str, c_str, length), although they are still technically linear unless you use them in a constexpr. GC’d language string APIs might be a bit more efficient/expressive (switch for strings, etc…), but even those optimizations can lead to problems.

    • slimshader says:

      BOOST has string_ref for some time now, with explicit example of starts_with():

    • Martin Ba says:

      ‘Course, an immutable string type (as the primary string type) doesn’t make a whole lot of sense in C++, given that we have const and also by-value as the default.

      The reason, afaik, in C# and Java for the immutability of string is not primarily performance, but program correctness, so that passing strings around does not mess up program state if two pieces of code were to reference the same string instance when it were mutated by only one of them.

      • An immutable string type is very much desirable in C++. shared_ptr<const wstring> gives you reference counting, i.e. efficient passing and reference storing, but at the cost of double indirection, double dynamic allocation for creation (which can be a real performance killer) and e.g. with linear time substrings (instead of constant time) and with quadratic time for repeated concatenations (instead of linear time). In contrast, a real immutable string type gives you mentioned advantages, as well as e.g. constant time very efficient no-dynamic-allocation creation for returning a string literal from a function, and so on. Presumably the apparent opinion of such advantages not making sense, was based on ignorance of those advantages. I.e. an opinion based on more or less total ignorance of the subject matter.

        • Aurélien says:


          Note that the double alloc of shared ptr can be avoided via make_shared. Could you explain why is repeated concatenations quadratic ?

          FYI, Qt strings are based on reference counting, and today they are thinking about changing it because they conducted a study of the usage of strings in a program and found that an enormous amount of them are (very) short ones. Therefore, small string optimization (SSO) is a much more efficient approach for a general purpose string (with VC++, there is no heap allocation if your string is less than 16 chars).

          Recently, I compared VC++ strings against C# ones. I was surprised to discover they were very similar results (I expected C# strings to be much slower than SSO). So indeed that’s impressive, but at the same time, I’m not sure an immutable string will be that much better than the current SSO ones. But I’m open to learn more about that !

        • Aurélien says:

          Oops, Qt strings are based on COW not reference counting, but that’s somehow the same (it’s even better) for what I meant in my comment.

        • @Aurélien:

          I found no “reply” button on your reply, perhaps it’s too deeply nested. So, I’m replying to myself (technically).

          Regarding the suggestion of using make_shared, thanks, but I was assuming use of make_shared. Consider auto s = make_shared<wstring const>( L"Blah!" );. Assuming that the small buffer optimization isn’t used there is (at least, and in practice) one dynamic allocation internally in the wstring instance, and one dynamic allocation for the make_shared, two dynamic allocations in total. With auto s2 = shared_ptr<wstring const>( new wstring( L"Blah!" ) ); there is one dynamic allocation internally in the wstring, for its buffer; one dynamic allocation of the wstring itself (the new expression); and one dynamic allocation for the shared_ptr control block, for a total of three dynamic allocations.

          Now, regarding the question about quadratic time for repeated concatenation, that’s because with a const wstring in there, this faux immutable string scheme, it’s really immutable. And so each concatenation produces a new string with a new buffer, which means if you start with an empty string and repeatedly add one character, n times, then the total of the string sizes, and hence the number of characters copied in total, is 0 + 1 + … + n – 1 = n*(n-1)/2. In contrast, with direct use of wstring and += the buffer reallocations and hence number of characters copied is guaranteed amortized linear in n (usually this is achieved by doubling the buffer capacity each time the buffer needs to extended).

          With a designed-as-immutable reference counted string one has the linear complexity, because after the first concatenation there is only a single owner and then the buffer contents can be modified under the hood.

          For example, this is done for CPython strings. Internally that code uses C library’s realloc for efficiency. And a modern immutable C++ string type would presumably do the same, rather than std::allocator.🙂

        • Aurelien says:

          @Alf thanks for the details.
          What distinction do you make between a “designed-as-immutable reference counted string” and a Copy On Write design ? What you describe looks quite similar to QString, and as I said, for a general purpose use it’s not that obvious which is best (in particular in a multithreaded context).
          The string_view approach (already mentioned here) looks quite interesting. It also seems to make it possible to safely expose it in an exported interface (dll).

        • @Aurélien:
          Regarding “What distinction do you make between a “designed-as-immutable reference counted string” and a Copy On Write design ?”,

          a copy-on-write implementation, COW, e.g. as the 32-bit g++ implementation of std::basic_string, is for a string whose value can be modified, that supports e.g. changing one character in the string. The mutability complicates things even when the string type is designed for this. In passing, I think it’s worth noting that std::basic_string is not designed for COW, but there was some wording that provided limited COW support in C++03, which was removed in C++11.

          I can’t say that it’s impossible to create a COW mutable string type with the same efficiency advantages as an immutable reference-counted string type (though lacking the advantage of knowing that that value ain’t gonna change). But I can say, from experience implementing strings (including a number of mutable string types, and including a demonstration of reference-counted immutable string type at SourceForge in 2007, and including a for-demonstration partial COW implementation of C++11 std::basic_string, just for a discussion on SO) that if it’s possible then it’s probably rather difficult, i.e. my expectation is much more work creating it, with no initial guarantee of success. If someone managed to create such a string type S, then const S would not be the same as an immutable reference counted string type, because const S would not be assignable. So, it would be a slightly different kind of beast. As a user I would prefer the immutable type. And also as creator.🙂

  3. Diego Giagio says:

    You can use boost::starts_with() as a shortcut to boost::algorithm::starts_with().

  4. simmo says:

    You could also have a look at the upcoming basic_string_view to avoid unnecessary copying.

  5. Vaughn Cato says:

    I take exception to this statement: “The programmer’s primary goal is not to make the program elegant.” While that may be true, the programmer’s primary goal is not to use premature optimization either. However, your solution is both efficient and elegant, so I can’t argue with the results.

  6. Oldcat says:

    why not just
    if (code[0]==’X’ && code[1]==’X’)

    • I think this solution is inflexible, what if you want to compare “XXYYGGCC”?
      if (code[0]==’X’ && code[1]==’X’ && code[2]==’Y’ && code[3]==’Y’ &&
      code[4]==’G’ && code[5]==’G’ && code[6]==’C’ && code[7]==’C’)

  7. Johann says:

    A dedicated string interface solution should take composed and pecomposed unicode characters into account. For instance, if we have a string instance containing L”é” (U+00E9, “LATIN SMALL LETTER E WITH ACUTE”), doing a string::starts_with(L”\x0065\x0301″) (U+0065, “LATIN SMALL LETTER E”, followed by U+0301, “COMBINING ACUTE ACCENT”) on it should return true. A generic boost or std algrorithm solution would need string specializations to do the same out of the box.

  8. peroulia says:

    Yep yep. What about std equal(begin(code), end(code), “xx”) though?

  9. The best solution is to have free functions which use string_view whenever possible. We can forget about the std::string member algorithms and instead use a free function string library.

    For example:

    string_view substr(string_view, size_t pos, size_t len);

    Now you can do the following with no performance penalty.

    bool beginsWithXX(const std::string& s) {
    return substr(s, 0, 2) == “XX”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s