Handling errors is canceling operations

I actually covered this topic before, in this post, but given my recent experience I feel it needs reiterating and a bit of restructuring. It boils down to the observation that any error handling I have encountered — be it error codes, errno, exceptions, error monad — is about canceling operations that depend, directly or indirectly, on the function that reported failure. This has some consequences on how we look at our program flow and what principles we should follow when responding to failures in our programs.

Let’s have a look at the online tutorial on how to write a HTTP client in C, here. In the sample program client.c we can see a pattern occurring everywhere. Conceptually, the program goes like this:

open_socket();
if (failed)
  die();

resolve_host();
if (failed)
  die();

connect();
if (failed)
  die();

send_data();
if (failed)
  die();

receive_data();
if (failed)
  die();

This code really says the following:

Unless opening a socket succeeded, do not even attempt resolving the host.
Unless host resolution succeeded, do not even attempt connecting to it.
Unless connection to host succeeds …

This reflects a dependency between operations:

resolve_host() depends on the successful execution of open_socket().
connect() depends on the successful execution of resolve_host().
send_data() depends on the successful execution of connect()…

This dependency propagates further to the higher levels of function call chain. A toy example fits directly inside function main() and handling errors by killing the application is adequate, as well as leaking resources; but a commercial server that is supposed to run for weeks or months cannot afford to die(). In serious programs, we will have to close all sockets we opened (and clean up all other resources) and report that the entire function failed; e.g.:

Status get_data_from_server(HostName host)
{
  open_socket();
  if (failed)
    return failure();

  resolve_host();
  if (failed)
    return failure();

  connect();
  if (failed)
    return failure();

  send_data();
  if (failed)
    return failure();

  receive_data();
  if (failed)
    return failure();

  close_socket(); // potential resource leak
  return success();
}

If any of the operations fails (ends in an error being reported) not only are the subsequent operations inside the function cancelled, but also the entire function reports failure and causes the cancellation of next functions at the higher level. Our function get_data_from_server() may be called in the following context:

Status do_the_task()
{
  get_server_name_from_user();
  if (failed)
    return failure();

  get_data_from_server(serverName);
  if (failed)
    return failure();  
  
  process_data();
  if (failed)
    return failure();

  return success();
}

And the cancellation cascade continues, because we do not want to call functions in case their dependencies failed: it would be a bug to do so.

This operation dependency is ubiquitous in practically every program; to the extent that C++ introduced syntax and features to reflect this dependency natively: without the programmer having to explicitly put the if-statements. I mean stack unwinding: owing to it, only by putting instruction B below A we already express the relation, “unless A succeeds do not attempt to call B.” The program code is much shorter, more cleanly describes the positive flow (where no failures occur), and it is impossible to unintentionally let the operation be invoked if its dependency has failed.

Canceling operations that depend on the operation that failed: this is the heart of handling errors. This is the case for return codes and manual if-statements. This is also the case of exceptions. In functional programming languages, error monads also express the concept of skipping the dependent operations upon failure (and propagating the failure). In fact, we could say that C++ exception handling mechanism is an error monad built into the imperative language: it skips subsequent operations, and propagates error information from throw-point to the exception handler (catch-point). If exceptions are not desired for some reason, there are more alternatives that also follow the cancellation cascade concept. For instance, Boost.Outcome library, where functions return a type representing either a successfully computed result or an information about failure, and the if-statements are hidden behind macros that emulate a cancellation control flow. With this library function do_the_task() would read:

result<int> do_the_task()
{
  BOOST_OUTCOME_TRY(name, get_server_name_from_user());
  BOOST_OUTCOME_TRY(data, get_data_from_server(name));
  BOOST_OUTCOME_TRY(value, process_data(data));
  return value;
}

But the core idea remains the same: do not execute operations if the operation they depend on failed. We could call it a cancellation cascade. There are a number of observations related to this view.

Some operations cannot be cancelled

You have probably spotted that in the second example, containing the implementation of function get_data_from_server(), I have planted a resource leak: if the function needs to exit prematurely, because any of its operations failed, the socket will not be closed. This pattern also appears frequently in the code: if a resource is acquired it has to be released even if everything before or after is cancelled. In other words, releasing of the resource depends on the successful acquisition of the resource, but does not depend on any other operation in between. In order to reflect this in the program we need a construct that acquires the resource and if this acquisition succeeds schedules the release of the resource at the point where it is no longer used, and this release must take place even in the face of the cancellation cascade. C++ offers a solution: it is called RAII. It has exactly these semantics. You need to design your resource-managing type in a certain way: constructor acquires the resource (and signals failure if it doesn’t work), destructor releases the resource. Now when you create such automatic object in the scope that manages a resource we get what we need:

Status get_data_from_server(HostName host)
{
  Socket socket {}; // opens the socket
  if (failed) return failure();

  resolve_host();
  if (failed) return failure();

  connect();
  if (failed) return failure();

  send_data();
  if (failed) return failure();

  receive_data();
  if (failed) return failure();

  return success();
} // closes the socket

Note one thing: we are not using exceptions in this example. The problem with unintentionally cancelling an operation that shouldn’t ever be cancelled is present in any error handling technique that is based on skipping operations. This is why C programmers are often given advice to have only one return statement in a function. But with RAII the motivation for having a single return statement is not that strong anymore. RAII is needed not only for stack unwinding: it is needed for any cancellation-based error handling.

Subsequent functions do not depend on successful resource release

Once you have a destructor, you can use it for anything, but if you adhere to the guideline “destructors only for releasing resources,” a number of other things becomes easy, clear, and natural. It is practically never a function’s goal to acquire or release a resource. Function’s goal is to produce some value or some side effects, and resources are only means to achieving the goal. Consider our function get_data_from_server(): its goal is to get some data from the server. It uses a socket to obtain this data, but it is just an implementation detail. If the socket cannot be opened then the data cannot be obtained from the server: whoever needs this data has now to be cancelled. Similarly, if sending or receiving data from the server fails, the consumers of the data need to be cancelled. But, if we received the data from the server and are ready to return it to the consumer, but just before we do it the attempt to close the socket fails, there is no need to start the cancellation cascade: the consumers will get the data they need, and subsequent functions will be able to complete their tasks. We may have leaked the resource, but this does not prevent the subsequent functions from continuing. At some later point the leaked resource might cause some other operation X to fail, but that will be X’s problem and it will start its cancellation cascade.

This advice mapped on exception handling means that destructors (used for releasing resources) should not throw even if they fail, for some reason, to release resources. This advice might be uncomfortable: it looks like concealing the information about failure. However, it has to be observed: exception handling is not a tool for broadcasting information about any failure in the system: it is a tool for declaring the success/failure dependency between operations and controlling how the cancellation cascade proceeds. If there is no need to cancel subsequent operations, you do not throw: use other means for broadcasting information about the failure, e.g. logging or some global state.

Do not stop the cancellation cascade prematurely

Most of the time if operation B follows operation A in the source code it means that B depends on the results of A, and if A fails then B must not be executed: otherwise we would be calling B without satisfying its preconditions, and this would be a bug. This means that cancellation cascade cannot be stopped between A and B. Stopping the cancellation cascade between A and B only makes sense when B does not depend on the success of A. And how often is this the case? The answer is: in a very small number of places in the program. For instance, if a server is receiving requests and processing them in the main loop. Even if the processing of one request fails, the server can move on to processing the next request.

Another situation is when some function needs to return a set of records. It gets them from three servers: each server returns one portion of the records. Ideally all records from three servers should be returned but it is acceptable if only one server returns its records. So, if we call the operations that retrieve data from each server as A1, A2 and A3, and an operation that consumes the records as B, we can say that although there exists some dependency, B does not depend on A1 (in the sense of requiring a cancellation if A1 alone should fail), does not depend on A2 and does not depend on A3. Therefore we expect that in the program code the cancellation cascade started in A1 will be stopped (if only temporarily) before it reaches B.

When mapped onto the exception handling mechanism, we get the advice: do not catch exceptions unless you are sure you are in the place where subsequent operations do not depend on the success of the operations inside the try block. And I assure you, there are not many such places in the program.

Basic failure safety

Handling errors through cancellation cascade means that we are exiting the scopes one after another abandoning operations and destroying automatic objects (calling their destructors). This sets a certain expectation. A mutating operation on some class object o can fail, and may leave the object in undesired or otherwise unexpected state. This is not much of a problem, because the failure (in a program that correctly handles errors) will start the cancellation cascade and the subsequent operations involving object o will be cancelled, and if the rule from the previous section is observed, and the cascade is not stopped prematurely, object o will go out of scope and will never be heard of again. However, before that happens, one operation still needs to be called on o: the destructor. Hence the important requirement on the design of every operation that mutates class objects: if it fails it should at minimum leave the objects in a state where they can be “safely” destroyed: without causing UB, bugs, resource leaks. (Although as indicated earlier and as will be covered in subsequent posts, sometimes we may not be able to prevent a resource leak.) This is what we can call a basic failure-safety guarantee. In the context of C++ it is often called “basic exception-safety guarantee”, but it should be noted: the same applies to every error handling technique. We expect that every mutating operation provides this guarantee: otherwise we cannot reason about program correctness in the face of failures. We have to assume that every mutating operation satisfies it.

In fact, although the above is the bare minimum that suffices to work with class types, the basic failure-safety guarantee requires more. It is possible that nonetheless someone will stop the cancellation cascade before object o goes out of scope. In that case it will survive the cancellation and it will be possible to still use it. In that case it should be possible to reset the object to a well understood state. The most generic and common way for resetting the object is to copy- or move-assign a new value to it. So if a type provides a copy- or move-assignment any mutating operation on o should guarantee that if it fails, then o can be assigned to without causing UB, bugs, resource leaks. If additionally the class provides a default constructor the following “reset idiom” will work:

o = {};

Other types may offer other means of resetting their state. For instance, STL containers offer member function clear().

In fact, basic failure-safety guarantee requires one more thing: that the state object o is left in is valid, although it is not specified what particular state it can be: it can be any state as long as it is valid. But what does it mean to be in a valid state? The answer is: for each class type, its author decides what it means for it to be in a valid state. The bare minimum is that the object in this state can be destroyed or assigned to (provided that assignment operators are not deleted) without causing UB, bugs, resource leaks. But a class type can, and usually does, guarantee more: for instance, it can guarantee that objects in such state can be compared with one another or that they can be copied, or that every operation on them just works, or that they go into the default-constructed state. But in practice that last part of the guarantee does not buy you much. What is relevant in practice is that an object after a failure can be safely destroyed and reset. With this guarantee in place cancellation cascade can safely work without causing damage to the program.

And that’s it for today. To summarize the advice:

Use RAII wherever resource management is performed in the program.
Use destructors only for releasing resources.
Do not start cancellation cascade if releasing resources fails.
Do not stop cancellation cascade, unless you are sure that subsequent operations do not depend on the cancelled ones.
Make sure that any mutating operation when fails leaves the objects in the state where they can be safely destroyed and reset.
Apply the above advice regardless of what technique you use for handling errors.

7 Responses to Handling errors is canceling operations

Vladimir Krivopalov says:

April 26, 2019 at 1:15 am

Thanks Andrzej, great post.
I’ve got a question connected to your earlier post (“Destructors — 2 use cases”) that you refer to in the beginning: should your advice on using destructors only for releasing resources be read as that RAII should not be used for representing transactional behaviour like in an example in that post?
Do you have any thoughts about how should those cases – essentially, the logic that some other languages implement through a ‘finally’ clause of a try/catch block – be best addressed as far as C++ is concerned?

- Andrzej Krzemieński says:
  
  April 26, 2019 at 7:31 am
  
  Thank you for this question. I find it very important. I would rather write a separate post on it than giving a short advice here, as advice on such a popular topic without a strong rationale is of little value and could be considered arrogant.
  
  But to leave you with no answer, let me try to say something. I think the general use case of “doing something at the end of the scope” should be inspected case by case and a solution to each case should be proposed individually. For instance, commit or rollback at the end of DB transaction: maybe we want a different solution for either of them:
  
  “Commit” is more like “on_scope_success” from D language, and “rollback” is more like “on_scope_failure”. So one “finally” clause from languages like Java might not me sufficient. We could think of requiring user to explicitly call commit() and in case she doesn’t do it, do an automatic rollback at the end of the scope.
  
  But this is just one use case. If you have any other use cases in mind, I would be interested in seeing them, as it would allow me to get a better picture of what people use it for. I personally try to avoid doing such automated logic at the end of the scope.
  
  - d.Candela (@dervish_candela) says:
    
    September 30, 2019 at 9:11 am
    
    «explicit commit, default rollback» seems obvious and popular (e.g. in c++ ORM’s )
    but RAII has pecuilar 2× transactional semantics with hidden failure path — both object creation and object destruction are supposed to be atomic operations, both must be successful in language terms.
    
    That doesn’t map well to domain concept of «transaction», where we would like to have a single transaction with two explicit pathways for both success and failure (and in many cases, neither is an error).
    
    I wonder if transactions would be better off with an explicit generic idiom, concept, library or language construct, instead of «scope object lifetime». Python has context managers, functional langs have monadic dataflow…
    
    btw
    year after year I keep coming back to your blog
    just wanted to say thank you for your work
    
Pingback: Article on Error Handling – development
Craig Scott says:

April 27, 2019 at 12:58 am

RAII can indeed be conveniently exploited to robustly clean up resources regardless of early returns or exceptions being thrown. The `std::scope_exit` proposal that has been kicking around for quite some time now makes it really easy to do this and allows you to put the cleanup code right there immediately after the creation of the resource it cleans up. For a longer discussion of the implementation of such a helper and how to use it, see here:

* https://crascit.com/2015/05/27/on-leaving-scope-part-1/ (motivation, naive implementation)
* https://crascit.com/2015/06/03/on-leaving-scope-part-2/ (final implementation and related work)

It covers very similar material to the `std::scope_exit` proposal and to Andrei Alexandrescu’s work on ScopeGuard. These helpers make cancelling operations simple without sacrificing readability, robustness or maintainability.

Pingback: Operation cancelling and std::fstream | Andrzej's C++ blog
Pingback: Article on Error Handling - DevLog