Nigel Tao

C++ Coroutines Part 1: `co_yield`, `co_return` and a Prime Sieve

This blog post is one of a two part series.

Update on 2023-03-04: This blog post series is discussed on Hacker News.

Introduction

C++ is late to the coroutine party, compared to other programming languages, but they are part of C++20. Prior to coroutines, a C++ programmer had two choices, roughly speaking:

Synchronous (straight line) code is easier to understand but less efficient.
Asynchronous code (e.g. callbacks) is more efficient (letting you do other work while waiting for things) but is also more complicated (manually saving and restoring state).

Update on 2023-03-04: “is… less efficient” and “is more efficient” should be “can be less / more efficient”.

Coroutines, “functions whose execution you can pause”, aim to get the best of both worlds: programs that look like sync code but performs like async code.

Generally speaking, C++ language design tends to favor efficiency, customizability and the zero-overhead principle instead of things like ease of use, safety or “batteries included”.

Update on 2023-03-04: “zero-overhead” isn’t an absolute requirement. As the WG21/P1063 Core Coroutines counter-proposal from January 2019 says, “programmers cannot reliably prevent [C++20] coroutine-based code from allocating memory”. Nonetheless, while not unanimously endorsed, C++20 coroutines are the design that shipped.

These are neither “good” or “bad” design principles, but as C++ isn’t garbage collected and it doesn’t come with a runtime system, it does mean that C++ coroutines have a steep learning curve. Lewis Baker has written some good blog posts, as has Raymond Chen, but Chen’s series is a 61 part epic and yet his table-of-contents page finishes with “I’m not done with coroutines”! Unsurprisingly (given its domain name), cppreference.com has good reference material but that doesn’t make a good tutorial.

My two blog posts don’t aim to be comprehensive but instead to give a quick tour of the three fundamental mechanisms (the new-in-C++20 coroutine-related operators): co_yield, co_return and co_await. Both blog posts walk through a complete, simple program, somewhat like literate programming.

Prime Sieve

The Sieve of Eratosthenes is one of the earliest recorded algorithms, over two thousand years old, generating the series of prime numbers: 2, 3, 5, 7, 11, etc.

Last millenium, Doug McIlroy and Ken Thompson invented Unix pipes as a way of connecting concurrent processes. McIlroy (Update on 2023-03-04: inspired by David Gries) wrote a page-long C version of the Sieve that uses Unix processes and pipes. Per that link, the essence of that program also shows up in Tony Hoare’s influential Communicating Sequential Processes (CSP) paper. More recently, there’s a 36-line Go version of the Sieve in the Go playground.

Update on 2023-03-04: McIlroy’s prime sieve (re-implemented in this blog post) is not exactly the same as Eratosthenes’ prime sieve. Eratosthenes’ algorithm is substantially more efficient.

That design can be ported to C++ coroutines. The “processes” in CSP are not the same as Unix processes. Our program (unlike McIlroy’s) is single-threaded and single-process (in the Unix process sense). Here’s the “business logic”:

Generator source(int end) {
  for (int x = 2; x < end; x++) {
    co_yield x;
  }
}

Generator filter(Generator g, int prime) {
  while (std::optional<int> optional_x = g.next()) {
    int x = optional_x.value();
    if ((x % prime) != 0) {
      co_yield x;
    }
  }
}

int main(int argc, char** argv) {
  Generator g = source(40);
  while (std::optional<int> optional_prime = g.next()) {
    int prime = optional_prime.value();
    std::cout << prime << std::endl;
    g = filter(std::move(g), prime);
  }
  return 0;
}

Update on 2023-03-04: Here’s how it works:

The source generator produces 2, 3, 4, 5, 6, 7, 8, 9, etc.
The first (filtered) element, 2, kicks off a filter(g, 2) generator that knocks out every multiple of 2, leaving 3, 5, 7, 9, 11, 13, 15, 17, etc.
The next (filtered) element, 3, kicks off a filter(g, 3) generator that knocks out every multiple of 3, leaving 5, 7, 11, 13, 17, 19, 23, 25, etc.
The next (filtered) element, 5, kicks off a filter(g, 5) generator that knocks out every multiple of 5, leaving 7, 11, 13, 17, 19, 23, 29, 31, etc.
And so on.

Each filter generator connects to one upstream producer and one downstream consumer, in a pipeline or chain. The number of filters grows over time. After the first three dot points above, the chain looks like this: main - filter(g, 3) - filter(g, 2) - source.

To be clear, coroutines aren’t necessarily the best (simplest, fastest, etc.) way to implement a prime sieve in C++. It’s that a prime sieve is a nice way to demonstrate C++ coroutines.

For those familiar with C++ range-based for loops, that can simplify the call site for simple loops, but our Generator doesn’t bother implementing it. One subtlety here is that we pass Generator values around (see the std::move(g)). We don’t just iterate over them.

Output

Build and run the complete C++ file like below or on godbolt.org:

$ g++ --version | head -n 1
g++ (Debian 10.2.1-6) 10.2.1 20210110

$ g++ -g -std=c++20 -fcoroutines -fno-exceptions cpp-coro-part-1-yield-return-prime-sieve.cc -o coro1 && ./coro1
2
3
5
7
11
13
17
19
23
29
31
37

The -fno-exceptions flag just simplifies away some C++ ceremony that’s important if your program uses exceptions but uninteresting noise otherwise.

`co_yield`

Here’s our source function again.

Generator source(int end) {
  for (int x = 2; x < end; x++) {
    co_yield x;
  }
}

This is a coroutine (instead of a regular function) because there’s at least one explicit co_yield or co_return in its body. An explicit co_await would also suffice, but we won’t discuss co_await until part 2.

While a regular function can only return (something of type RType, say), and only return at most once, a coroutine can do that too but also co_yield zero or more things (of type CYType) before co_returning something (of type CRType) at most once. Just as a regular function could loop forever without returning, a coroutine could loop forever, perhaps co_yielding things or perhaps not, without co_returning.

Here, source co_yields (generates) the sequence of integers 2, 3, 4, 5, etc. up to (but excluding) end. Because source is a coroutine, there’s an implicit co_return; statement at the end of its body. Its RType, CYType and CRType are Generator, int and void.

`return` and `co_return`

source returns a Generator (even though the function body never mentions return or Generator). The main function saves the result of calling source just as if it was calling a regular function. From the caller’s (not callee’s) point of view, and from a “function signature in a .h file” point of view, it is indeed just a regular function. Unlike other programming languages, C++ coroutines don’t need an async keyword.

Generator source(int end) { etc; }

Generator g = source(40);

Calling source(40) physically returns (physically means in the calling convention sense and in the x86 CALL and RET instructions sense) before it conceptually, logically finishes by reaching a co_return (the implicit one at the final ‘}’ curly bracket). main can continue running concurrently while source is also running. For a multi-threaded program, the two could run in parallel (and we’d have to use mutexes, atomics or similar) but our example program is single-threaded and concurrency is not parallelism.

Logically, source is running its for (int x = 2; x < end; x++) loop off on its own, occasionally co_yielding a thing. Physically, source is called once, suspending, returning, and then repeatedly resuming and co_yielding/suspending until the final resume finishes with a co_return/suspend.

As we’ll see further below, in our program, resuming is explicitly triggered inside the Generator::next method (and resume is just a method call). Our “pull-style” generator coroutines are scheduled “on demand”, which works well here as we’re never waiting on I/O.

Promise Type

With a regular function call, the caller and callee collaborate (according to the calling convention) to reserve some memory for a stack frame, holding e.g. the function arguments, local variables, return address and return value. After the callee returns, the stack frame is no longer needed.

With a coroutine call, such state (function arguments, local variables, etc.) is needed even after physically returning. That’s therefore held in a coroutine frame, usually heap-allocated. The coroutine frame also holds some notion of “where to resume, inside the coroutine body” plus a customized helper object to drive the coroutine. In C++, the pointer to that coroutine frame is represented as a std::coroutine_handle<CustomizedHelper>.

I’m not sure why, but that customized helper object is called a “promise” or “promise object” (but its type is not a std::promise) and the CustomizedHelper type is usually RType::promise_type, where RType is the coroutine’s return type.

Some documentation talks about “coroutine state” instead of “coroutine frame”, as in: the promise object lives alongside (instead of within) the “coroutine frame” (which holds arguments and locals), both of which are within the “coroutine state”. But I prefer “coroutine frame” to mean the whole thing. See also frame_ptr, further below, being a pointer to the (coroutine) frame.

`Generator::promise_type`

In our program, the compiler knows that source and filter are coroutines (because they have co_yield expressions). They are also declared to return a Generator, so the compiler looks for a Generator::promise_type and expects it to have certain methods.

For example, our coroutine body says co_yield x and the CYType (the type of x) is an int, so our promise type needs to have a yield_value method that takes an int. It also has an (implicit) co_return; statement (but not a co_return foo; statement) so it also needs a return_void method that takes no arguments. It also needs get_return_object, initial_suspend and final_suspend. Here’s the complete Generator::promise_type definition:

class Generator {
 public:
  class promise_type {
   public:
    Generator get_return_object() {
      return Generator(
          std::coroutine_handle<promise_type>::from_promise(*this));
    }
    std::suspend_always initial_suspend() { return {}; }
    std::suspend_always final_suspend() { return {}; }
    std::suspend_always yield_value(int value) {
      m_value = value;
      return {};
    }
    void return_void() {}

    int m_value;
  };

  // Not shown yet: Generator code, not Generator::promise_type code.
};

get_return_object produces the Generator object that source or filter returns, in the Generator g = source(40) sense. We’ll discuss std::coroutine_handle further below, but it’s essentially a glorified pointer to the coroutine frame. We’ll pass it to the Generator constructor so that Generator::next can resume the coroutine when necessary.

initial_suspend returns an awaitable (covered in part 2) that controls whether the coroutine is eager (also known as “hot start”) or lazy (“cold start”). Does the coroutine start running straight away or does it need a separate kick first? Our program returns a std::suspend_always which means lazy, as that will work better with “Generator::next always calls resume to pull the next value”, as we’ll see further below.

final_suspend likewise controls whether to suspend after the (possibly implicit) co_return. If it doesn’t suspend, the coroutine frame will be automatically destroyed, which is great from a “don’t forget to clean up” point of view, but destroying the coroutine frame also destroys the promise object.

In our program, Generator::next needs to inspect (call methods on) the promise object after the co_return (and calling a promise object’s methods is only valid if the coroutine is suspended), so we do suspend (by final_suspend returning a std::suspend_always). Our Generator will be responsible for explicitly destroying the coroutine frame (spoiler alert: it’ll do it in its destructor, via the std::coroutine_handle passed to its constructor).

The yield_value and return_void methods have already been mentioned, but note that yield_value saves its argument to a member variable (that Generator::next will load). This is how the generator coroutine passes what it yields (produces) back to the consumer. Our implementation only buffers one value at a time but other promise_type implementations could do something different. At the very least, it would have to do something thread-safe if the program was multi-threaded.

`Generator::next`

Here’s the Generator::next method (and the Generator constructor). It resumes the wrapped coroutine, running it up until its next suspension (at an explicit co_yield or at the final_suspend after the implicit co_return; the latter means the coroutine is done).

class Generator {
  // Etc.

 public:
  std::optional<int> next() {
    if (!m_cohandle || m_cohandle.done()) {
      return std::nullopt;
    }
    m_cohandle.resume();
    if (m_cohandle.done()) {
      return std::nullopt;
    }
    return m_cohandle.promise().m_value;
  }

 private:
  // Regular constructor.
  explicit Generator(const std::coroutine_handle<promise_type> cohandle)
      : m_cohandle{cohandle} {}

  std::coroutine_handle<promise_type> m_cohandle;

  // Etc.
};

Resource Acquisition Is Initialization

To clean up properly, we should destroy the std::coroutine_handle exactly once. We’ll do that in the Generator destructor (and the m_cohandle field is private). When we pass a Generator from main to filter, we have to std::move it, just as if it was a std::unique_ptr.

g = filter(std::move(g), prime);

Here’s the bureaucratic incantations to make Generator a move-only type.

class Generator {
  // Etc.

 public:
  // This class is move-only. See
  // https://google.github.io/styleguide/cppguide.html#Copyable_Movable_Types
  Generator(Generator&& other) : m_cohandle{other.release_handle()} {}
  Generator& operator=(Generator&& other) {
    if (this != &other) {
      if (m_cohandle) {
        m_cohandle.destroy();
      }
      m_cohandle = other.release_handle();
    }
    return *this;
  }

  // Regular destructor.
  ~Generator() {
    if (m_cohandle) {
      m_cohandle.destroy();
    }
  }

 private:
  std::coroutine_handle<promise_type> release_handle() {
    return std::exchange(m_cohandle, nullptr);
  }
};

That’s it! You can look back over the complete C++ file at your leisure.

Debugging

It may get better in the coming months and years, but debugging coroutines can be a little rough today, at least on Debian stable (Bullseye). Breakpoints work, but local variables are buggy.

For example, we can set a breakpoint on the co_yield x in the source coroutine function, but the x value doesn’t seem to change (printing x always says 2). Making the breakpoint conditional on x == 5 means that, in practice, the breakpoint no longer triggers. Curiously, info breakpoints also places the breakpoint in the _Z6sourcei.actor(_Z6sourcei.frame *) function, presumably a compiler-transformed version of the plain source(int) function.

$ gdb ./coro1
[Etc.]

(gdb) break cpp-coro-part-1-yield-return-prime-sieve.cc:96
Breakpoint 1 at 0x1342: file cpp-coro-part-1-yield-return-prime-sieve.cc, line 96.

(gdb) info breakpoints
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x0000000000001342 in _Z6sourcei.actor(_Z6sourcei.frame *) at cpp-coro-part-1-yield-return-prime-sieve.cc:96

(gdb) run
Starting program: /etc/etc/etc/coro1

Breakpoint 1, source (frame_ptr=0x55555556aeb0) at cpp-coro-part-1-yield-return-prime-sieve.cc:96
96          co_yield x;

(gdb) print x
$1 = 2

(gdb) continue
Continuing.
2

Breakpoint 1, source (frame_ptr=0x55555556aeb0) at cpp-coro-part-1-yield-return-prime-sieve.cc:96
96          co_yield x;

(gdb) print x
$2 = 2

(gdb) condition 1 x == 5

(gdb) continue
Continuing.
3
5
7
11
13
17
19
23
29
31
37
[Inferior 1 (process 12345) exited normally]

Manual Breakpoint

We can insert a manual breakpoint (even a conditional one) in the source code, instead of via gdb.

Generator source(int end) {
  for (int x = 2; x < end; x++) {
#if defined(__GNUC__) && defined(__x86_64__)
    if (x == 5) {
      __asm__ __volatile__("int $03");
    }
#endif
    co_yield x;
  }
}

In the x == 5 loop iteration (but before the co_yield), our processes (in the CSP sense) should be chained like this: main - filter(g, 3) - filter(g, 2) - source. Recompiling and running in a debugger confirms this: from the bottom up, the stack trace shows main, filter twice and then source.

$ g++ -g -std=c++20 -fcoroutines -fno-exceptions cpp-coro-part-1-yield-return-prime-sieve.cc -o coro1

$ gdb ./coro1
[Etc.]

(gdb) run
Starting program: /etc/etc/etc/coro1
2
3

Program received signal SIGTRAP, Trace/breakpoint trap.
source (frame_ptr=0x55555556aeb0) at cpp-coro-part-1-yield-return-prime-sieve.cc:101
101         co_yield x;

(gdb) backtrace
#0  source (frame_ptr=0x55555556aeb0) at cpp-coro-part-1-yield-return-prime-sieve.cc:101
#1  0x00005555555559bf in std::__n4861::coroutine_handle<void>::resume (this=0x55555556b320) at /usr/include/c++/10/coroutine:126
#2  0x0000555555555ae9 in Generator::next (this=0x55555556b320) at cpp-coro-part-1-yield-return-prime-sieve.cc:51
#3  0x0000555555555659 in filter (frame_ptr=0x55555556b300) at cpp-coro-part-1-yield-return-prime-sieve.cc:106
#4  0x00005555555559bf in std::__n4861::coroutine_handle<void>::resume (this=0x55555556b370) at /usr/include/c++/10/coroutine:126
#5  0x0000555555555ae9 in Generator::next (this=0x55555556b370) at cpp-coro-part-1-yield-return-prime-sieve.cc:51
#6  0x0000555555555659 in filter (frame_ptr=0x55555556b350) at cpp-coro-part-1-yield-return-prime-sieve.cc:106
#7  0x00005555555559bf in std::__n4861::coroutine_handle<void>::resume (this=0x7fffffffe9d0) at /usr/include/c++/10/coroutine:126
#8  0x0000555555555ae9 in Generator::next (this=0x7fffffffe9d0) at cpp-coro-part-1-yield-return-prime-sieve.cc:51
#9  0x0000555555555827 in main (argc=1, argv=0x7fffffffeaf8) at cpp-coro-part-1-yield-return-prime-sieve.cc:116

(gdb) # Again, gdb mistakenly says that x is 2.
(gdb) info locals
x = 2
Yd0 = {<No data fields>}
Is = {<No data fields>}
Fs = {<No data fields>}

Recall that logically (and in the source code), the filter function takes two arguments (a Generator and an int) but physically (in the stack trace), after the compiler transformed it, filter (or perhaps _Z6filter9Generatori.actor, which c++filt demangles as filter(Generator, int) [clone .actor]) takes only one (what g++ calls the frame_ptr). This pointer value turns out to be the same address as what the std::coroutine_handle<Generator::promise_type>::address() method would return. For g++, the frame_ptr address is also a small, constant offset from the promise’s address (what this is inside promise_type methods).

Conclusion

Coroutines are magic in some sense, in that it requires compiler support and isn’t something you could easily do in pure C++ (e.g. boost coroutines depend on boost contexts and that requires CPU-architecture-specific assembly code). But this blog post has hopefully demystified C++20 coroutines’ co_yield and co_return operators:

A function is a coroutine if its body contains at least one co_yield, co_return or co_await expression.
The compiler transforms the coroutine’s body into something that dynamically allocates a coroutine frame.
The pointer to the coroutine frame is called a std::coroutine_handle.
That coroutine frame contains the suspension/resumption point, copies of the arguments and local variables and also a customizable helper object (called the promise object) that bridges the caller and callee worlds.
co_yielding (or co_returning) in the coroutine callee saves state in the promise object (by calling yield_blah or return_blah methods). The caller (or other code) can load this state later.
co_yielding (or co_returning), which is part of the C++ language and standard library, typically also suspends the coroutine.
It is up to the program (or its non-standard libraries) to explicitly resume a suspended coroutine.

That last bullet point glosses over a lot of potential detail. Our example program is relatively simple but, in general, scheduling is a hard problem. C++20 doesn’t provide a one-size-fits-all solution. It merely provides mechanism, not policy. Update on 2023-03-04: Earlier versions of C++ coroutines integrated with std::async and std::future but that didn’t make it into C++20.

This is partly because of the customizability and “no runtime” design goals mentioned earlier but also because a high-performance coroutine scheduling implementation may be OS (Operating System) specific (and you may not even have an OS).

C++20 doesn’t give you an ergonomic, high-level coroutine API. It’s not “sprinkle some asyncs and awaits and you’re done.” It gives you a low-level coroutine API construction kit. Some further C++ (but not assembly) is required.

Baker puts it like this: “The facilities the C++ Coroutines TS [Technical Specification] provides in the language can be thought of as a low-level assembly-language [emphasis in the original] for coroutines. These facilities can be difficult to use directly in a safe way and are mainly intended to be used by library-writers to build higher-level abstractions that application developers can work with safely.”

It gives you the coroutine equivalent of a goto and it’s up to you (or the libraries you use) to build better abstractions like the equivalent of if-else, while loops and function calls. Indeed, some have argued for structured concurrency, even as far to say “Go statement considered harmful”, but that bigger discussion is out of scope of this blog post.

`co_await`

The last thing I’ll say about co_yield is that co_yield expr is basically syntactic sugar for co_await promise.yield_value(expr). Or, it would be, if you could otherwise access the coroutine’s implicit promise object. What’s co_await and how does it work? Find out more in part 2: co_await and Fizz Buzz.

Published: 2023-02-20