stackless coroutine backend

[PATCH experiment 00/35] stackless coroutine backend

Posted by Paolo Bonzini 2 years, 2 months ago

Here is an experiment with using stackless coroutines in QEMU.  It
only compiles enough code to run tests/unit/test-coroutine, but at
least it proves that it's possible to quickly test ideas in the
area of coroutine runtimes.  Another idea that could be toyed with
in a similar manner could be (whoa) C++ coroutines.

As expected, this also found some issues in existing code, so I
plan to submit patches 1-5 separately.

The new backend (which is the only one that works, due to the required
code changes) is in patch 7.  For the big description of what stackless
coroutines are, please refer to that patch.

Patches 8-11 do some initial conversions.  Patch 12 introduce some
preprocessor magic that greatly eases the rest of the work, and then
the tests are converted one at a time, until patch 27 where the only
ones missing are the CoRwlock tests.

Therefore, patches 28-33 convert CoRwlock and pathces 34-35 take care
of the corresponding tests, thus concluding the experiment.

Paolo

Paolo Bonzini (35):
  coroutine: add missing coroutine_fn annotations for CoRwlock functions
  coroutine: qemu_coroutine_get_aio_context is not a coroutine_fn
  coroutine: introduce QemuCoLockable
  coroutine: introduce coroutine_only_fn
  coroutine: small code cleanup in qemu_co_rwlock_wrlock
  disable some code
  coroutine: introduce the "stackless coroutine" backend
  /basic/lifecycle
  convert qemu-coroutine-sleep.c to stackless coroutines
  enable tail call optimization of qemu_co_mutex_lock
  convert CoMutex to stackless coroutines
  define magic macros for stackless coroutines
  /basic/yield
  /basic/nesting
  /basic/self
  /basic/entered
  /basic/in_coroutine
  /basic/order
  /perf/lifecycle
  /perf/nesting
  /perf/yield
  /perf/function-call
  /perf/cost
  /basic/no-dangling-access
  /locking/co-mutex
  convert qemu_co_mutex_lock_slowpath to magic macros
  /locking/co-mutex/lockable
  qemu_co_rwlock_maybe_wake_one
  qemu_co_rwlock_rdlock
  qemu_co_rwlock_unlock
  qemu_co_rwlock_downgrade
  qemu_co_rwlock_wrlock
  qemu_co_rwlock_upgrade
  /locking/co-rwlock/upgrade
  /locking/co-rwlock/downgrade

 configure                    |  44 +---
 include/qemu/co-lockable.h   | 110 +++++++++
 include/qemu/coroutine.h     |  99 ++++++--
 include/qemu/coroutine_int.h |   6 -
 include/qemu/lockable.h      |  13 +-
 include/qemu/typedefs.h      |   1 +
 tests/unit/meson.build       |   2 +-
 tests/unit/test-coroutine.c  | 425 +++++++++++++++++++++++++++++------
 util/coroutine-stackless.c   | 159 +++++++++++++
 util/meson.build             |  10 +-
 util/qemu-coroutine-lock.c   | 215 ++++++++++++++----
 util/qemu-coroutine-sleep.c  |  57 ++++-
 util/qemu-coroutine.c        |  18 +-
 13 files changed, 932 insertions(+), 227 deletions(-)
 create mode 100644 include/qemu/co-lockable.h
 create mode 100644 util/coroutine-stackless.c

-- 
2.35.1

Re: [PATCH experiment 00/35] stackless coroutine backend

Posted by Stefan Hajnoczi 2 years, 2 months ago

On Thu, Mar 10, 2022 at 01:43:38PM +0100, Paolo Bonzini wrote:
> Here is an experiment with using stackless coroutines in QEMU.  It
> only compiles enough code to run tests/unit/test-coroutine, but at
> least it proves that it's possible to quickly test ideas in the
> area of coroutine runtimes.  Another idea that could be toyed with
> in a similar manner could be (whoa) C++ coroutines.
> 
> As expected, this also found some issues in existing code, so I
> plan to submit patches 1-5 separately.
> 
> The new backend (which is the only one that works, due to the required
> code changes) is in patch 7.  For the big description of what stackless
> coroutines are, please refer to that patch.
> 
> Patches 8-11 do some initial conversions.  Patch 12 introduce some
> preprocessor magic that greatly eases the rest of the work, and then
> the tests are converted one at a time, until patch 27 where the only
> ones missing are the CoRwlock tests.
> 
> Therefore, patches 28-33 convert CoRwlock and pathces 34-35 take care
> of the corresponding tests, thus concluding the experiment.

Nice, the transformation is clear. It's simpler than Continuation
Passing Style transform because the loops and if statements remain
unmodified. This is a big advantage with the Duff's device-style
approach.

There are a lot of details to decide on in the translator tool and
runtime to optimize the code. I think the way the stack frames are
organized in this patch series is probably for convenience rather than
performance.

Out of curiousity, did you run the perf tests and compare against
ucontext?

Stefan

Re: [PATCH experiment 00/35] stackless coroutine backend

Posted by Paolo Bonzini 2 years, 2 months ago

On 3/10/22 18:42, Stefan Hajnoczi wrote:
> There are a lot of details to decide on in the translator tool and
> runtime to optimize the code. I think the way the stack frames are
> organized in this patch series is probably for convenience rather than
> performance.

Yes, sometimes the optimizations are there but mostly because they made 
my job easier.

> Out of curiousity, did you run the perf tests and compare against
> ucontext?

Not quite voluntarily, but I noticed I had to add one 0 to make them run 
for a decent amount of time.  So yeah, it's much faster than siglongjmp.

Paolo

Re: [PATCH experiment 00/35] stackless coroutine backend

Posted by Stefan Hajnoczi 2 years, 2 months ago

On Thu, Mar 10, 2022 at 09:14:07PM +0100, Paolo Bonzini wrote:
> On 3/10/22 18:42, Stefan Hajnoczi wrote:
> > There are a lot of details to decide on in the translator tool and
> > runtime to optimize the code. I think the way the stack frames are
> > organized in this patch series is probably for convenience rather than
> > performance.
> 
> Yes, sometimes the optimizations are there but mostly because they made my
> job easier.
> 
> > Out of curiousity, did you run the perf tests and compare against
> > ucontext?
> 
> Not quite voluntarily, but I noticed I had to add one 0 to make them run for
> a decent amount of time.  So yeah, it's much faster than siglongjmp.

That's a nice first indication that performance will be good. I guess
that deep coroutine_fn stacks could be less efficient with stackless
coroutines compared to ucontext, but the cost of switching between
coroutines (enter/yield) will be lower with stackless coroutines.

Stefan

Re: [PATCH experiment 00/35] stackless coroutine backend

Posted by Paolo Bonzini 2 years, 2 months ago

On 3/11/22 10:27, Stefan Hajnoczi wrote:
>> Not quite voluntarily, but I noticed I had to add one 0 to make them run for
>> a decent amount of time.  So yeah, it's much faster than siglongjmp.
> That's a nice first indication that performance will be good. I guess
> that deep coroutine_fn stacks could be less efficient with stackless
> coroutines compared to ucontext, but the cost of switching between
> coroutines (enter/yield) will be lower with stackless coroutines.

Note that right now I'm not placing the coroutine_fn stack on the heap, 
it's still allocated from a contiguous area in virtual address space. 
The contiguous allocation is wrapped by coroutine_stack_alloc and 
coroutine_stack_free, so it's really easy to change them to malloc and free.

I also do not have to walk up the whole call stack on coroutine_fn 
yields, because calls from one coroutine_fn to the next are tail calls; 
in exchange for that, I have more indirect calls than if the code did

     if (next_call() == COROUTINE_YIELD) {
         return COROUTINE_YIELD;
     }

For now the choice was again just the one that made the translation easiest.

Today I also managed to implement a QEMU-like API on top of C++ coroutines:

     CoroutineFn<int> return_int() {
         co_await qemu_coroutine_yield();
         co_return 30;
     }

     CoroutineFn<void> return_void() {
         co_await qemu_coroutine_yield();
     }

     CoroutineFn<void> co(void *) {
         co_await return_void();
         printf("%d\n", co_await return_int())
         co_await qemu_coroutine_yield();
     }

     int main() {
         Coroutine *f = qemu_coroutine_create(co, NULL);
         printf("--- 0\n");
         qemu_coroutine_enter(f);
         printf("--- 1\n");
         qemu_coroutine_enter(f);
         printf("--- 2\n");
         qemu_coroutine_enter(f);
         printf("--- 3\n");
         qemu_coroutine_enter(f);
         printf("--- 4\n");
     }

The runtime code is absurdly obscure; my favorite bit is

     Yield qemu_coroutine_yield()
     {
         return Yield();
     }

:) However, at 200 lines of code it's certainly smaller than a 
source-to-source translator.  It might be worth investigating a bit 
more.  Only files that define or use a coroutine_fn (which includes 
callers of qemu_coroutine_create) would have to be compiled as C++.

Paolo

Re: [PATCH experiment 00/35] stackless coroutine backend

Posted by Daniel P. Berrangé 2 years, 2 months ago

On Fri, Mar 11, 2022 at 01:04:33PM +0100, Paolo Bonzini wrote:
> On 3/11/22 10:27, Stefan Hajnoczi wrote:
> > > Not quite voluntarily, but I noticed I had to add one 0 to make them run for
> > > a decent amount of time.  So yeah, it's much faster than siglongjmp.
> > That's a nice first indication that performance will be good. I guess
> > that deep coroutine_fn stacks could be less efficient with stackless
> > coroutines compared to ucontext, but the cost of switching between
> > coroutines (enter/yield) will be lower with stackless coroutines.
> 
> Note that right now I'm not placing the coroutine_fn stack on the heap, it's
> still allocated from a contiguous area in virtual address space. The
> contiguous allocation is wrapped by coroutine_stack_alloc and
> coroutine_stack_free, so it's really easy to change them to malloc and free.
> 
> I also do not have to walk up the whole call stack on coroutine_fn yields,
> because calls from one coroutine_fn to the next are tail calls; in exchange
> for that, I have more indirect calls than if the code did
> 
>     if (next_call() == COROUTINE_YIELD) {
>         return COROUTINE_YIELD;
>     }
> 
> For now the choice was again just the one that made the translation easiest.
> 
> Today I also managed to implement a QEMU-like API on top of C++ coroutines:
> 
>     CoroutineFn<int> return_int() {
>         co_await qemu_coroutine_yield();
>         co_return 30;
>     }
> 
>     CoroutineFn<void> return_void() {
>         co_await qemu_coroutine_yield();
>     }
> 
>     CoroutineFn<void> co(void *) {
>         co_await return_void();
>         printf("%d\n", co_await return_int())
>         co_await qemu_coroutine_yield();
>     }
> 
>     int main() {
>         Coroutine *f = qemu_coroutine_create(co, NULL);
>         printf("--- 0\n");
>         qemu_coroutine_enter(f);
>         printf("--- 1\n");
>         qemu_coroutine_enter(f);
>         printf("--- 2\n");
>         qemu_coroutine_enter(f);
>         printf("--- 3\n");
>         qemu_coroutine_enter(f);
>         printf("--- 4\n");
>     }
> 
> The runtime code is absurdly obscure; my favorite bit is
> 
>     Yield qemu_coroutine_yield()
>     {
>         return Yield();
>     }
> 
> :) However, at 200 lines of code it's certainly smaller than a
> source-to-source translator.  It might be worth investigating a bit more.
> Only files that define or use a coroutine_fn (which includes callers of
> qemu_coroutine_create) would have to be compiled as C++.

Unless I'm misunderstanding what you mean, "define a coroutine_fn"
is a very large number of functions/files

  $ git grep coroutine_fn | wc -l
  806
  $ git grep -l coroutine_fn | wc -l
  132

Dominated by the block layer of course, but tentacles spreading
out into alot of other code.

Feels like identifying all callers would be tedious/unpleasant enough,
that practically speaking we would have to just compile all of QEMU
as C++.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH experiment 00/35] stackless coroutine backend

Posted by Paolo Bonzini 2 years, 2 months ago

On 3/11/22 13:17, Daniel P. Berrangé wrote:
>> Only files that define or use a coroutine_fn (which includes callers of
>> qemu_coroutine_create) would have to be compiled as C++.
> Unless I'm misunderstanding what you mean, "define a coroutine_fn"
> is a very large number of functions/files
> 
>    $ git grep coroutine_fn | wc -l
>    806
>    $ git grep -l coroutine_fn | wc -l
>    132
> 
> Dominated by the block layer of course, but tentacles spreading
> out into alot of other code.

The main other user is 9pfs, then there is:

hw/remote/message.c
io/channel.c
job.c
migration/savevm.c
monitor/hmp-cmds.c
monitor/monitor-internal.h
monitor/qmp.c
nbd/client-connection.c
nbd/client.c
nbd/server.c
net/colo-compare.c
net/filter-mirror.c
scsi/pr-manager.c
scsi/qemu-pr-helper.c
ui/console.c
util/vhost-user-server.c

> Feels like identifying all callers would be tedious/unpleasant enough,
> that practically speaking we would have to just compile all of QEMU
> as C++.

Yes, it's a large amount of code, but it's relatively self-contained. 
In io/ for example only three functions would have to become C++ 
(qio_channel_readv_full_all_eof, qio_channel_writev_full_all, 
qio_channel_yield), and it's easy to move them to a separate file 
io/channel-coroutine.cc.

Likewise for e.g. util/async.c or util/thread-pool.c (one function each).

The block layer would almost entirely move to C++, that's for sure.  The 
monitor would be a bit more in the middle, but hardware emulation can 
remain 100% C.

I haven't gotten the thing to compile or run yet, and I'm not sure how 
much time I'll have this week, but the change for test-coroutine.c to 
run should be in the ballpark of this:

  include/qemu/coroutine.h                                 |  26
  tests/unit/meson.build                                   |   6
  tests/unit/{test-coroutine.c => test-coroutine.cc}       | 106
  util/meson.build                                         |   4
  util/{qemu-coroutine-lock.c => qemu-coroutine-lock.cc}   |  65
  util/{qemu-coroutine-sleep.c => qemu-coroutine-sleep.cc} |  10

where the changes are for a good part mechanical: switching from "x 
coroutine_fn" to CoroutineFn<x> entirely so, while adding co_await in 
front of coroutine calls is half mechanical.  For non-void functions, 
the compiler can identify all callers (because the old type "int" is not 
compatible with the new type CoroutineFn<int>).  For void function one 
could use warn_unused_result.

The question is what is easier to maintain, stack switching code that is 
becoming less and less portable (status quo with SafeStack, CET and the 
TLS issues that Stefan has worked on), a mixed C/C++ codebase (C++ 
coroutines), a custom source-to-source translator (this series).  The 
third might be more fun, but it would be quite a large enterprise and 
the C++ compiler writers have already done the work.

A part of the changes is common in both cases, since you cannot have 
code that can run both inside or outside a coroutine.

Paolo

Re: [PATCH experiment 00/35] stackless coroutine backend

Posted by Stefan Hajnoczi 2 years, 2 months ago

On Sun, Mar 13, 2022 at 04:18:40PM +0100, Paolo Bonzini wrote:
> On 3/11/22 13:17, Daniel P. Berrangé wrote:
> The question is what is easier to maintain, stack switching code that is
> becoming less and less portable (status quo with SafeStack, CET and the TLS
> issues that Stefan has worked on), a mixed C/C++ codebase (C++ coroutines),
> a custom source-to-source translator (this series).  The third might be more
> fun, but it would be quite a large enterprise and the C++ compiler writers
> have already done the work.

Or a C-to-C++ translator to keep the code in C but still use C++
coroutines :). (I'm joking.)

Stefan