What I have to say today is fairly obvious, and you probably know about it. But because I have observed the following pattern in my project’s code fairly frequently, I feel it needs to be recorded. Consider this code:
bool fun (const string& code) { assert (code.length() >= 2); if (code.substr(0, 2) == string("XX")) { // ... } // ... }
Can you see what is wrong about it? Please, do not focus on the usage of assert
. I only put in order for you to take for granted that the string has two or more characters.
Apparently, this condition checks if string code
starts with sequence "XX"
. Given the assumption that code
is at least 2-character long, it appears to be doing the right thing. Provided that the only thing we are concerned with is the correct result of the expression.
Quite often, however, we choose to use C++ in hope of achieving maximum performance of our program. If this is our goal, the above code looks wrong. In order to check if code
starts with "XX"
we are creating two temporary string
s, each of which can potentially allocate heap memory. One could argue that std::string
should be able to implement a short string optimization (SSO) for 2-letter sequences, but (1) even then, there is some cost involved that cannot be easily optimized away and (2) not every implementation uses SSO. For instance, I am using GCC 4.4.7 and it does not implement SSO on string
s.
The interface of class template std::basic_string
is very complicated, as nicely explained in GotW #84. It offers so many member functions that it feels ungrateful not to use any of them. At the same time, because there are so many of them, one feels reluctant to re-parse them time and again.
Because the programmer vaguely remembers that operator==
applied to naked null-terminated byte strings (NTBS) (which tend to be converted to const char *
) does the wrong thing, he avoids this by making sure that this is two std::string
s that are compared. Also, he might be thinking, there is no harm done because literal "XX"
would be implicitly converted to std::string
anyway, before the operator==
is invoked. But this is wrong. The Standard also provides the mixed versions of operator==
:
bool operator==(const std::string& lhs, const char* rhs); bool operator==(const char* lhs, const std::string& rhs);
Of course, in reality they are function templates with many parameters, but you get what I mean. std::string
can be compared to a NTBS without the necessity of creating any temporary std::string
. Our example can be easily optimized by removing the explicit creation of a temporary:
if (code.substr(0, 2) == "XX") // ...
Next, while admittedly using operator==
somewhere in comparison looks elegant, it is wrong to make a brand new string
managing its own resource only to inspect a portion of the original. The programmer’s primary goal is not to make the program elegant. Indeed, if we dig into the documentation (e.g., here), we will find that std::basic_string
offers a way to compare its sub-string against an NTBS:
if (code.compare(0, 2, "XX") == 0) // ...
The comparison is a three-way one, with 0 indicating equality. It can be performed in-place, without creating any temporary string
.
To be honest, I did not know about this member function overload prior to writing this post. I am not really found of learning more than 100 members of basic_string
. Also, while the above is definitely a performance improvement, I am not satisfied with it. While it does the correct thing, it may be difficult, when you encounter it the first time, to immediately grasp what it tries to do. My ideal solution, if you can afford using Boost, is to use one of the algorithms from String Algorithms Library:
#include <boost/algorithm/string/predicate.hpp> bool fun (const string& code) { // ... if (boost::algorithm::starts_with(code, "XX")) // ... // ... }
This just says what I mean, without adding any unnecessary overhead at the same time.
And that’s it for today. As a side note, in order to test memory allocations of my std::basic_string
implementation, I used a custom allocator that apart from allocating, performs the necessary statistics. As I am not good at implementing custom allocators, I used this nice allocator ‘cheat sheet’ by Howard Hinnant from this location.
For me the obvious solution would be strncmp(). Some people might not like it because it’s C, but I think it’s a perfect fit for the problem, and doesn’t require any additional libraries (like boost or Qt).
I think the solution of boost is more elegant than strncmp(if you can afford to boost).With the api of boost(boost::algorithm::starts_with), I do not need to check the document of strncmp at all, the api just show us “what do you want to do”.
It’s worth noting that a string class designed as immutable (like Java and C# strings) can support constant time sub-strings safely, obviating the need for special operations like `starts_with`. Also, a string reference type like the proposed
std::string_ref
can help do that, but then only in unsafe ways. Regardingstd::string::compare
, even though it’s not that long since I implemented it for a demonstration, it didn’t register with me that it could be used for `starts_with`. But now that you mention that, yes, and for e.g. C strings one can usestd::char_traits::compare
. I’ve instead usedcompare
as basis of a convenient mix-in that supplies the relational operators. It can also be expressed in terms of<
and==
, but using only<
as basis incurs some inefficient double checking.Java at least no longer supports constant time substrings (as of 1.7’ish). Even with immutable strings, you would still generally encur the cost of the additional object instantiations (allocate, return, etc…). Although, I think
starts_with
can still express intent better. Bothoperator==
functions,compare
, andboost::algorithm::starts_with
probably compile down to just amemcmp(c_str, c_str, length)
, although they are still technically linear unless you use them in aconstexpr
. GC’d language string APIs might be a bit more efficient/expressive (switch
for strings, etc…), but even those optimizations can lead to problems.BOOST has string_ref for some time now, with explicit example of starts_with():
http://www.boost.org/doc/libs/1_57_0/libs/utility/doc/html/string_ref.html
‘Course, an immutable string type (as the primary string type) doesn’t make a whole lot of sense in C++, given that we have
const
and also by-value as the default.The reason, afaik, in C# and Java for the immutability of string is not primarily performance, but program correctness, so that passing strings around does not mess up program state if two pieces of code were to reference the same string instance when it were mutated by only one of them.
An immutable string type is very much desirable in C++.
shared_ptr<const wstring>
gives you reference counting, i.e. efficient passing and reference storing, but at the cost of double indirection, double dynamic allocation for creation (which can be a real performance killer) and e.g. with linear time substrings (instead of constant time) and with quadratic time for repeated concatenations (instead of linear time). In contrast, a real immutable string type gives you mentioned advantages, as well as e.g. constant time very efficient no-dynamic-allocation creation for returning a string literal from a function, and so on. Presumably the apparent opinion of such advantages not making sense, was based on ignorance of those advantages. I.e. an opinion based on more or less total ignorance of the subject matter.Hello!
Note that the double alloc of shared ptr can be avoided via make_shared. Could you explain why is repeated concatenations quadratic ?
FYI, Qt strings are based on reference counting, and today they are thinking about changing it because they conducted a study of the usage of strings in a program and found that an enormous amount of them are (very) short ones. Therefore, small string optimization (SSO) is a much more efficient approach for a general purpose string (with VC++, there is no heap allocation if your string is less than 16 chars).
Recently, I compared VC++ strings against C# ones. I was surprised to discover they were very similar results (I expected C# strings to be much slower than SSO). So indeed that’s impressive, but at the same time, I’m not sure an immutable string will be that much better than the current SSO ones. But I’m open to learn more about that !
Oops, Qt strings are based on COW not reference counting, but that’s somehow the same (it’s even better) for what I meant in my comment.
@Aurélien:
I found no “reply” button on your reply, perhaps it’s too deeply nested. So, I’m replying to myself (technically).
Regarding the suggestion of using
make_shared
, thanks, but I was assuming use ofmake_shared
. Considerauto s = make_shared<wstring const>( L"Blah!" );
. Assuming that the small buffer optimization isn’t used there is (at least, and in practice) one dynamic allocation internally in thewstring
instance, and one dynamic allocation for themake_shared
, two dynamic allocations in total. Withauto s2 = shared_ptr<wstring const>( new wstring( L"Blah!" ) );
there is one dynamic allocation internally in thewstring
, for its buffer; one dynamic allocation of thewstring
itself (thenew
expression); and one dynamic allocation for theshared_ptr
control block, for a total of three dynamic allocations.Now, regarding the question about quadratic time for repeated concatenation, that’s because with a
const wstring
in there, this faux immutable string scheme, it’s really immutable. And so each concatenation produces a new string with a new buffer, which means if you start with an empty string and repeatedly add one character, n times, then the total of the string sizes, and hence the number of characters copied in total, is 0 + 1 + … + n – 1 = n*(n-1)/2. In contrast, with direct use ofwstring
and+=
the buffer reallocations and hence number of characters copied is guaranteed amortized linear in n (usually this is achieved by doubling the buffer capacity each time the buffer needs to extended).With a designed-as-immutable reference counted string one has the linear complexity, because after the first concatenation there is only a single owner and then the buffer contents can be modified under the hood.
For example, this is done for CPython strings. Internally that code uses C library’s
realloc
for efficiency. And a modern immutable C++ string type would presumably do the same, rather thanstd::allocator
. 🙂@Alf thanks for the details.
What distinction do you make between a “designed-as-immutable reference counted string” and a Copy On Write design ? What you describe looks quite similar to QString, and as I said, for a general purpose use it’s not that obvious which is best (in particular in a multithreaded context).
The string_view approach (already mentioned here) looks quite interesting. It also seems to make it possible to safely expose it in an exported interface (dll).
@Aurélien:
Regarding “What distinction do you make between a “designed-as-immutable reference counted string” and a Copy On Write design ?”,
a copy-on-write implementation, COW, e.g. as the 32-bit g++ implementation of
std::basic_string
, is for a string whose value can be modified, that supports e.g. changing one character in the string. The mutability complicates things even when the string type is designed for this. In passing, I think it’s worth noting thatstd::basic_string
is not designed for COW, but there was some wording that provided limited COW support in C++03, which was removed in C++11.I can’t say that it’s impossible to create a COW mutable string type with the same efficiency advantages as an immutable reference-counted string type (though lacking the advantage of knowing that that value ain’t gonna change). But I can say, from experience implementing strings (including a number of mutable string types, and including a demonstration of reference-counted immutable string type at SourceForge in 2007, and including a for-demonstration partial COW implementation of C++11
std::basic_string
, just for a discussion on SO) that if it’s possible then it’s probably rather difficult, i.e. my expectation is much more work creating it, with no initial guarantee of success. If someone managed to create such a string typeS
, thenconst S
would not be the same as an immutable reference counted string type, becauseconst S
would not be assignable. So, it would be a slightly different kind of beast. As a user I would prefer the immutable type. And also as creator. 🙂You can use boost::starts_with() as a shortcut to boost::algorithm::starts_with().
You could also have a look at the upcoming basic_string_view to avoid unnecessary copying.
This is what the proposed string_view is for http://en.cppreference.com/w/cpp/experimental/basic_string_view
I take exception to this statement: “The programmer’s primary goal is not to make the program elegant.” While that may be true, the programmer’s primary goal is not to use premature optimization either. However, your solution is both efficient and elegant, so I can’t argue with the results.
why not just
if (code[0]==’X’ && code[1]==’X’)
I think this solution is inflexible, what if you want to compare “XXYYGGCC”?
if (code[0]==’X’ && code[1]==’X’ && code[2]==’Y’ && code[3]==’Y’ &&
code[4]==’G’ && code[5]==’G’ && code[6]==’C’ && code[7]==’C’)
A dedicated string interface solution should take composed and pecomposed unicode characters into account. For instance, if we have a string instance containing L”é” (U+00E9, “LATIN SMALL LETTER E WITH ACUTE”), doing a string::starts_with(L”\x0065\x0301″) (U+0065, “LATIN SMALL LETTER E”, followed by U+0301, “COMBINING ACUTE ACCENT”) on it should return true. A generic boost or std algrorithm solution would need string specializations to do the same out of the box.
Yep yep. What about std equal(begin(code), end(code), “xx”) though?
I’m sorry.. actually I mean
equal(begin(code), begin(code) + 2, “xx”);
The best solution is to have free functions which use string_view whenever possible. We can forget about the std::string member algorithms and instead use a free function string library.
For example:
string_view substr(string_view, size_t pos, size_t len);
Now you can do the following with no performance penalty.
bool beginsWithXX(const std::string& s) {
return substr(s, 0, 2) == “XX”
}
Pingback: c# - Come faccio a controllare se un C++ std::string inizia con una determinata stringa, e convertire una stringa in un int?
Pingback: c++ - ¿Cómo puedo comprobar si un C++ std::string comienza con una determinada cadena, y convertir una cadena en un entero?