Common optimizations

Language designers, compiler and library vendors make a great effort to make your programs run faster and faster. This post is a tour of some common performance optimizations in C++.

Consider the following code that deals with std::string:

string makeText()
  string s{"Hello World!"}; // C++11 init syntax
  return s;

int main()
  string t = makeText();
  cout << t << endl;

std::string is a specialization of template std::basic_string which takes an allocator as one of template arguments. This allocator is used — unsurprisingly — for allocating memory. Memory allocation is a fairly slow operation that is best avoided. The question is: how many times function allocate (on the allocator) will be invoked for the purpose of memory allocation during this program’s execution? 3, 2, 1, 0?

When I started learning C++ my answer would be “3”: one to initially allocate memory for the string s, next two in order to return by value: first, copy-initialize a temporary from s; next, copy-initialize t from the temporary.

Then I learned about named return value optimization and copy elision, and realized that when you return by value, a clever compiler can avoid creating three objects and instead use one: to assign the initial value in makeText and read it in main. So, from then on my answer would be “3 to 1” depending on how clever the compiler is; most of them are clever enough already.

C++11 introduced move semantics, and from now on compilers are required not to copy upon returning by value if our type provides a move constructor. std::string does come with a move constructor, which does not allocate memory, but intercepts the memory already allocated by the moved-from object. So, from then on my answer is “exactly 1”.

But note one thing: move construction is not a no-op. For std::string it will require putting the moved-from object to a state where it knows it need not, and in fact must not, release its resources. This is a certain (albeit small) cost compared with a no-op copy elision. Luckily, the requirement to use move construction instead of copy construction upon returning by value does not prohibit or disable the ability for a compiler to still apply copy elision. Thus, in our example, most probably there will be no move constructions.

But that’s not all. std::string is not required to allocate heap memory (or any other kind of memory its allocator is aware of). It is only required to be able to store anyhow a string of characters of arbitrary length. True, in general case it implies allocating memory at some point; but our case is not a general case: it is our case. Our text "Hello World!" is fairly short (compared, say, to the contents of a 4GB xml file). It is 12 characters; 13 if we count the terminating null character. If on a 64-bit machine the size of a pointer is that of 8 characters, our text is less than the size of two pointers. The typical efficient implementation of a std::string requires three pointers. The text is small enough to fit into the stack-allocated part. This is a perfect candidate for small buffer optimization.

Small buffer optimization

Below, I enclosed a short example of what an SBO would look like. Please do not treat it as a production-strength implementation.

A possible implementation of an STL-like string holds a pointer to the begin of the sequence, the size of the string and the size of the allocated buffer (the capacity):

class string
  char*  _begin;
  size_t _size;
  size_t _capacity;
  // ...

The pointer _begin refers in general case to the heap allocated memory. The three members form what we call a handle. This is the part of the object’s state allocated on the stack. Expression sizeof(string) evaluates to 24 (or something different, depending on the platform, alignment, etc.) which indicates the size of the handle. When an object is moved its handle is copied.

Now we want to say that the part of the handle occupied by address _begin should be reused as a buffer for holding a small string. We can implement it as a union:

class string
  union Buffer
    char*    _begin;
    char[16] _local;
  Buffer _buffer;
  size_t _size;
  size_t _capacity;
  // ...

I have ignored the issues with alignment in order to make the example simple. Now the size of our class string grew a bit (on my machine, to the equivalent of 32 characters). This does not visibly affect the run-time performance of copy operations. It is not the sizeof of the type that is the crucial factor of copying performance, but the time it gets to allocate all the required resources for the second time.

But how do we know if our union is currently used as a pointer or as a small buffer? We can use member _capacity. We can pick some special value, likely 0, that can never be used as the proper size of the allocated buffer, and have it indicate that a small buffer is used. Now, accessing the value is slower, because we have to evaluate an if-statement checking where the contents are stored, but what we get instead is that in a number of cases we avoid using memory allocation altogether.

Such an optimization is implemented in my version of STD on Visual C++. It is also available in Boost.Container’s implementation of string, without any sizeof overhead.

Going back to the original question in this post, in my environment the number of allocations is 0. (It may be 1 on yours, though.) Is that not incredible?

This entry was posted in programming and tagged , , , . Bookmark the permalink.

28 Responses to Common optimizations

  1. John says:

    When I started reading I was wondering if you were going to mention move optimization or SBO. Good on ya. By the way, SBO’s enabled in most major implementations, including libstdc++ that comes by default with gcc.

    I disagree with your choice of the array size, though. 16 is an int, so sizeof(16) on most of the architectures I deal with is 4. I would’ve made it sizeof(char*).

    I would also seriously consider including _capacity inside the union. It does have a downside: you either have to say anytime size < sizeof(_buffer.local) you're using local storage (which implies you have to deallocate a large buffer you already allocated if your string shrinks), or you have to play some silly tricks, like checking that the most-significant byte of _capacity is 0xff and that _size < sizeof(char*)+sizeof(size_t)-2 (and then making sure the condition holds when appropriate).

    Anyhow, nice article.

  2. ciechowoj says:

    I wonder can we reduce the amount of allocations to zero with the use of operator””s. I am not sure, but I think the only way to invoke operator””s is by passing some string literal as argument, which anyways resides somewhere in the global memory, so the std::string could store a pointer to it (and information not to deallocate it), saving an allocation.

    • I believe, this is what string_view is trying to address.

      • Adam Romanek says:

        It is worth to mention that Boost already provides string_ref which addresses the same problem domain. The class is an implementation of “N3442: string_ref: a non-owning reference to a string”.

      • ciechowoj says:

        No, it is not what I mean. Maybe it will be clear in C++:

        class string {
        char* _begin;
        size_t _size;
        size_t _capacity;
        bool _static; // shouldn’t we release memory in destructor

        // […]

        string(const char* str) {
        _begin = new char[strlen(str) + 1];
        // […]
        // do more stuff

        friend string operator”” _s(const char * str, size_t len) {
        string str;
        str._begin = str; // can store pointer because the str resides in .text section (or where the compiler put it) anyway
        str._size = len;
        str._capacity = 0;
        str._static = true;
        return str;

        I assume that there is no way to invoke operator”” _s directly, namely something like following code is prohibited:

        char x[10];
        memset(str, ‘a’, 9);
        x[9] = ”;

        string x = operator”” _s(x); // INVALID C++ CODE

        • I think I understand what you say. I believe one can use literals directly. I am not sure if you need literals for that. Perhaps a constructor taking an array reference would be enough:

          class string
            // normal ctors
            template<int N>
            string(const char (&arr)[N]) : _begin(&arr[0]), _static(true) {}

          But I am telling you: in C++17 you will be using string_view in many places instead of string. Note that you would not be able to modify your string, so it would not be a string anyway, but a string view.

        • ciechowoj says:

          Yes, this is exactly what I’ve just wanted to achieve. But, now I see the pitfalls of my approach (I may use something like COW to regain ability to modify the string, but it makes the solution even more silly). And of course string_view solves the problem of additional allocation anyway.

          // I’ve written a reply here, because there were no reply button below (maybe to many nesting levels)…

  3. JH says:

    “It is 12 characters; 13 if we count the terminating null character.”

    I thought C++ did not use null terminated strings?

    • John says:

      The existence of std::string::c_str requires it. It’s a const function, so it should not modify the underlying content, and it requires that the result be nul-terminated… so it had to have already been nul-terminated in the first place. That is unless the standard wanted to encourage string implementations to use mutable or const_cast.

      If I’m not mistaken string::data has this termination requirement now, too.

    • Well, C++ inherits from C and thereby it does use null-terminated strings. While std::string does not determine its length by the first encountered null character (as indicated in the previous post), in order for it to be able to convert to a null-terminated string it is still required to put the trailing zero.

    • JH says:

      Huh, fascinating, didn’t know that. Learn something new everyday.

  4. Mantosh Kumar says:

    Nice article and specially the way you have explained 3 to 0 allocation…
    There is typo in the line “It is not the siezeof of the type that is the crucial factor of copying performance”. It should be sizeof.

  5. Martin Ba says:

    While you mention MSVC – what I don’t get behind is why they think SBO/SSO is such a big deal that the sizeof(std::string) case for 32bit is also bloated to 24bytes. It all nice and good to have an implementation use the space the pointers’ll take up anyway, but wasting 24bytes on a simple string object (when 12 should be enough, so, right we “waste” only 12, – and it’s not even configurable) is really paying for what you may not use.

    • I understand your sentiment. I guess std::string is such a general purpose thing that there is no way of answering how to best resolve a tradeoff between the size and run-time performance (and of which operations). Some use strings for storing large documents, some (myself included) use it mainly for storing short codes. Vendors do it based on their user’s experience, and by guessing, I think.

      • Martin says:

        Ah, BTW: I was missing a bit in this article explaining that currently gcc doesn’t do the standards-conforming SSBO, and won’t in 4.9 either…(they do COW)

    • Martin says:

      Uhm… last time I checked (which is a year or two ago at least), sizeof(std::string) in a 32-bit app with MSVC was 28 bytes, which made sense since that lets a std::pair take up exactly 32 bytes, which is probably nice perf-wise on Intel hardware… Note that if you assume each small-string needs to contain strings of 16-bit wide chars the amount of wasted space seems less bad…

  6. Nice article. What do you think about performance lost in operator[]?
    So in following loops we are forced to do an extra if for each iteration.

    for (size_t i=0; i<str.size(); ++i)

    MSVC std::string implementation did not use this kind of optimization and store char* _begin member outside of union.

    On the other hand we are able to use entire string memory as a string buffer:

    class stack_string{
      char _buffer[31];
      char _magic;
       size_t size() const {return 31 - _magic;} // can hold up to 31 characters
       size_t capacity() const {return 31;} // const
       const char* c_str() const {return _buffer;} 
    class heap_string{
     char* _begin;
     size_t _size;
     size_t _capacity;
     char _unused[8];
       size_t size() const {return _size;}
       size_t capacity() const {return _capacity;}
       const char* c_str() const {return _begin;}
    class string
        stack_string _stack;
        heap_string _heap;
       bool is_heap() const {return _stack._magic == 32;} // a magic number
       size_t size() const {return is_heap() ? _heap.size() : _stack.size();}
       size_t capacity() const {return is_heap() ? _heap.capacity() : _stack.capacity();}
       const char* c_str() const {return is_heap() ? _heap.c_str() : _stack.c_str();}

    Note, we can not move small strings and always MUST copy it.
    But to copy we can copy entire object memory (32 bytes) but in this case we can not use std::char_traits::copy() function. Is it a violation of C++ standard?

    • What do you think about performance lost in operator[]?

      — This is a trade-off one have to make (on behalf of one’s users). This shows that in certain cases a happy optimization can make some code slower. But a trade-off needs to be made. This also shows why one should prefer STL range-based algorithms like for_each to manual indexing.

      but in this case we can not use std::char_traits::copy() function. Is it a violation of C++ standard?

      — Why can’t std::char_traits::copy() be used here? To me it looks like it can.

      Thank you for the demonstration of a solid SSO implementation.

      • We definitely can use std::char_traits::copy() in a loop to copy characters of small string but default stack_string::operator= can be MUCH faster. To use std::char_traits::copy() we need loop, we must calculate string size and copy values one-by-a-time. We pay with loop and conditional branching. Instead we can copy 32 bytes of memory and don’t care about unused characters in buffer (one or two unconditional CPU commands). It is the same price as for heap_string move operator.

        • Well, I have never measured the performance of std::char_traits<char>::copy(), so my reply is only based on my beliefs. But I do believe that std::char_traits<char>::copy() uses memcpy (or some such) inside and that the latter is implemented as something much faster than per-element assignment inside the loop. We would have to measure your solution to draw conclusions.

  7. Thanks a lot for great Article🙂

  8. Ajay says:

    I am new to c++ so could you explain : “The typical efficient implementation of a std::string requires three pointers”. Why?

    • std::string similarly like std::vector needs to track two sizes: (1) the number of elements it contains (function size()) and (2) the number of elements it can hold without reallocating the memory it owns (function capacity()). So, you need three pieces of data: the address of the memory, the size and the capacity (one pointer and two sizes), or alternatively a pointer to the begin of the memory buffer, end of the memory buffer and the pointer that indicates within the buffer where the elements end and the uninitialized part of the buffer begins.

      For some more, see for instance

  9. Pingback: [讀書筆記]Universal References in C++11 part 7: Small String Optimize(SSO) 與 move - 地平線的彼端--日出與日落之國地平線的彼端--日出與日落之國

  10. Pingback: C++中的string | 藏经阁

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s