[v5] aio: add the aio_add_sqe() io_uring API

[RESEND PATCH v5 00/13] aio: add the aio_add_sqe() io_uring API

Posted by Stefan Hajnoczi 3 months, 1 week ago

v5:
- Explain how fdmon-io_uring.c differs from other fdmon implementations
  in commit message [Kevin]
- Move test-nested-aio-poll aio_get_g_source() removal into commit that touches test case [Kevin]
- Avoid g_source_add_poll() use-after-free in fdmon_poll_update() [Kevin]
- Avoid duplication in fdmon_epoll_gsource_dispatch(), use fdmon_epoll_wait() [Kevin]
- Drop unnecessary revents checks in fdmon_poll_gsource_dispatch() [Kevin]
- Mention in commit message that fdmon-io_uring.c is the new default [Kevin]
- Add comments explaining how to clean up resources in error paths [Kevin]
- Indicate error in return value from function with Error *errp arg [Kevin]
- Add patch to unindent fdmon_io_uring_destroy() [Kevin]
- Add patch to introduce FDMonOps->dispatch() callback [Kevin]
- Drop patch with hacky BH optimization for fdmon-io_uring.c [Kevin]
- Replace cqe_handler_bh with FDMonOps->dispatch() [Kevin]
- Rename AioHandler->cqe_handler field to ->internal_cqe_handler [Kevin]
- Consolidate fdmon-io_uring.c trace-events changes into this commit
- Reduce #ifdef HAVE_IO_URING_PREP_WRITEV2 code duplication [Kevin]

v4:
- Rebased and tested after the QEMU 10.1.0 release

v3:
- Add assertions documenting that ADD and REMOVE flags cannot be present
  together with DELETE_AIO_HANDLER [Kevin]

v2:
- Performance improvements
- Fix pre_sqe -> prep_sqe typo [Eric]
- Add #endif terminator comment [Eric]
- Fix spacing in aio_ctx_finalize() argument list [Eric]
- Add new "block/io_uring: use non-vectored read/write when possible" patch [Eric]
- Drop Patch 1 because multi-shot POLL_ADD has edge-triggered semantics instead
  of level-triggered semantics required by QEMU's AioContext APIs. The
  qemu-iotests 308 test case was hanging because block/export/fuse.c relies on
  level-triggered semantics. Luckily the performance reason for switching from
  one-shot to multi-shot has been solved by Patch 2 ("aio-posix: keep polling
  enabled with fdmon-io_uring.c"), so it's okay to use single-shot.
- Add a new Patch 1. It's a bug fix for a user-after-free in fdmon-io_uring.c
  triggered by qemu-iotests iothreads-nbd-export.

This patch series contains io_uring improvements:

1. Support the glib event loop in fdmon-io_uring.
   - aio-posix: fix race between io_uring CQE and AioHandler deletion
   - aio-posix: keep polling enabled with fdmon-io_uring.c
   - tests/unit: skip test-nested-aio-poll with io_uring
   - aio-posix: integrate fdmon into glib event loop

2. Enable fdmon-io_uring on hosts where io_uring is available at runtime.
   Otherwise continue using ppoll(2) or epoll(7).
   - aio: remove aio_context_use_g_source()

3. Add the new aio_add_sqe() API for submitting io_uring requests in the QEMU
   event loop.
   - aio: free AioContext when aio_context_new() fails
   - aio: add errp argument to aio_context_setup()
   - aio-posix: gracefully handle io_uring_queue_init() failure
   - aio-posix: add aio_add_sqe() API for user-defined io_uring requests
   - aio-posix: avoid EventNotifier for cqe_handler_bh

4. Use aio_add_sqe() in block/io_uring.c instead of creating a dedicated
   io_uring context for --blockdev aio=io_uring. This simplifies the code,
   reduces the number of file descriptors, and demonstrates the aio_add_sqe()
   API.
   - block/io_uring: use aio_add_sqe()
   - block/io_uring: use non-vectored read/write when possible

The highlight is aio_add_sqe(), which is needed for the FUSE-over-io_uring
Google Summer of Code project and other future QEMU features that natively use
Linux io_uring functionality.

rw        bs iodepth aio    iothread before after  diff
randread  4k       1 native        0  78353  84860 +8.3%
randread  4k      64 native        0 262370 269823 +2.8%
randwrite 4k       1 native        0 142703 144348 +1.2%
randwrite 4k      64 native        0 259947 263895 +1.5%
randread  4k       1 io_uring      0  76883  78270 +1.8%
randread  4k      64 io_uring      0 269712 250513 -7.1%
randwrite 4k       1 io_uring      0 143657 131481 -8.5%
randwrite 4k      64 io_uring      0 274461 264785 -3.5%
randread  4k       1 native        1  84080  84097 0.0%
randread  4k      64 native        1 314650 311193 -1.1%
randwrite 4k       1 native        1 172463 159993 -7.2%
randwrite 4k      64 native        1 303091 299726 -1.1%
randread  4k       1 io_uring      1  83415  84081 +0.8%
randread  4k      64 io_uring      1 324797 318429 -2.0%
randwrite 4k       1 io_uring      1 174421 172809 -0.9%
randwrite 4k      64 io_uring      1 323394 312286 -3.4%

Performance is in the same ballpark as without fdmon-io_uring. Results vary
from run to run due to the timing/batching of requests (even with iodepth=1 due
to 8 vCPUs using a single IOThread).

Here is the performance from v1 for reference:
rw        bs iodepth aio    iothread before after  diff
randread  4k       1 native        0  76281 79707  +4.5%
randread  4k      64 native        0 255078 247293 -3.1%
randwrite 4k       1 native        0 132706 123337 -7.1%
randwrite 4k      64 native        0 275589 245192 -11%
randread  4k       1 io_uring      0  75284 78023  +3.5%
randread  4k      64 io_uring      0 254637 248222 -2.5%
randwrite 4k       1 io_uring      0 126519 128641 +1.7%
randwrite 4k      64 io_uring      0 258967 249266 -3.7%
randread  4k       1 native        1  90557 88436  -2.3%
randread  4k      64 native        1 290673 280456 -3.5%
randwrite 4k       1 native        1 183015 169106 -7.6%
randwrite 4k      64 native        1 281316 280078 -0.4%
randread  4k       1 io_uring      1  92479 86983  -5.9%
randread  4k      64 io_uring      1 304229 257730 -15.3%
randwrite 4k       1 io_uring      1 183983 157425 -14.4%
randwrite 4k      64 io_uring      1 299979 264156 -11.9%

This series replaces the following older series that were held off from merging
until the QEMU 10.1 development window opened and the performance results were
collected:
- "[PATCH 0/3] [RESEND] block: unify block and fdmon io_uring"
- "[PATCH 0/4] aio-posix: integrate fdmon into glib event loop"

Stefan Hajnoczi (13):
  aio-posix: fix race between io_uring CQE and AioHandler deletion
  aio-posix: keep polling enabled with fdmon-io_uring.c
  tests/unit: skip test-nested-aio-poll with io_uring
  aio-posix: integrate fdmon into glib event loop
  aio: remove aio_context_use_g_source()
  aio: free AioContext when aio_context_new() fails
  aio: add errp argument to aio_context_setup()
  aio-posix: gracefully handle io_uring_queue_init() failure
  aio-posix: unindent fdmon_io_uring_destroy()
  aio-posix: add fdmon_ops->dispatch()
  aio-posix: add aio_add_sqe() API for user-defined io_uring requests
  block/io_uring: use aio_add_sqe()
  block/io_uring: use non-vectored read/write when possible

 include/block/aio.h               | 156 ++++++++-
 include/block/raw-aio.h           |   5 -
 util/aio-posix.h                  |  18 +-
 block/file-posix.c                |  40 +--
 block/io_uring.c                  | 507 ++++++++----------------------
 stubs/io_uring.c                  |  32 --
 tests/unit/test-aio.c             |   7 +-
 tests/unit/test-nested-aio-poll.c |  13 +-
 util/aio-posix.c                  | 141 ++++-----
 util/aio-win32.c                  |   7 +-
 util/async.c                      |  71 ++---
 util/fdmon-epoll.c                |  34 +-
 util/fdmon-io_uring.c             | 211 +++++++++++--
 util/fdmon-poll.c                 |  85 ++++-
 block/trace-events                |  12 +-
 stubs/meson.build                 |   3 -
 util/trace-events                 |   4 +
 17 files changed, 710 insertions(+), 636 deletions(-)
 delete mode 100644 stubs/io_uring.c

-- 
2.51.0

Re: [RESEND PATCH v5 00/13] aio: add the aio_add_sqe() io_uring API

Posted by Kevin Wolf 3 months, 1 week ago

Am 30.10.2025 um 16:21 hat Stefan Hajnoczi geschrieben:
> v5:
> - Explain how fdmon-io_uring.c differs from other fdmon implementations
>   in commit message [Kevin]
> - Move test-nested-aio-poll aio_get_g_source() removal into commit that touches test case [Kevin]
> - Avoid g_source_add_poll() use-after-free in fdmon_poll_update() [Kevin]
> - Avoid duplication in fdmon_epoll_gsource_dispatch(), use fdmon_epoll_wait() [Kevin]
> - Drop unnecessary revents checks in fdmon_poll_gsource_dispatch() [Kevin]
> - Mention in commit message that fdmon-io_uring.c is the new default [Kevin]
> - Add comments explaining how to clean up resources in error paths [Kevin]
> - Indicate error in return value from function with Error *errp arg [Kevin]
> - Add patch to unindent fdmon_io_uring_destroy() [Kevin]
> - Add patch to introduce FDMonOps->dispatch() callback [Kevin]
> - Drop patch with hacky BH optimization for fdmon-io_uring.c [Kevin]
> - Replace cqe_handler_bh with FDMonOps->dispatch() [Kevin]
> - Rename AioHandler->cqe_handler field to ->internal_cqe_handler [Kevin]
> - Consolidate fdmon-io_uring.c trace-events changes into this commit
> - Reduce #ifdef HAVE_IO_URING_PREP_WRITEV2 code duplication [Kevin]

The changes look good to me.

However, the test cases are still failing. I just tried to see where
test-aio is stuck, and while I looked for a backtrace first, I noticed
that just attaching gdb to the process and immediately detaching again
makes the test unstuck. Very strange.

This is the backtrace, maybe a bit unsurpring:

(gdb) bt
#0  0x00007ffff7e6fec6 in __io_uring_submit () from /lib64/liburing.so.2
#1  0x00005555556f4394 in fdmon_io_uring_wait (ctx=0x555556409950, ready_list=0x7fffffffcda0, timeout=749993088) at ../util/fdmon-io_uring.c:410
#2  0x00005555556ed29f in aio_poll (ctx=0x555556409950, blocking=true) at ../util/aio-posix.c:699
#3  0x0000555555681547 in test_timer_schedule () at ../tests/unit/test-aio.c:413
#4  0x00007ffff6f30e7e in test_case_run (tc=0x55555640d340, test_run_name=0x55555640de10 "/aio/timer/schedule", path=<optimized out>) at ../glib/gtestutils.c:3115
#5  g_test_run_suite_internal (suite=suite@entry=0x5555558696d0, path=path@entry=0x0) at ../glib/gtestutils.c:3210
#6  0x00007ffff6f30df3 in g_test_run_suite_internal (suite=suite@entry=0x555555867480, path=path@entry=0x0) at ../glib/gtestutils.c:3229
#7  0x00007ffff6f30df3 in g_test_run_suite_internal (suite=suite@entry=0x555555867720, path=path@entry=0x0) at ../glib/gtestutils.c:3229
#8  0x00007ffff6f313aa in g_test_run_suite (suite=suite@entry=0x555555867720) at ../glib/gtestutils.c:3310
#9  0x00007ffff6f31440 in g_test_run () at ../glib/gtestutils.c:2379
#10 g_test_run () at ../glib/gtestutils.c:2366
#11 0x000055555567e204 in main (argc=1, argv=0x7fffffffd488) at ../tests/unit/test-aio.c:872

And running it under strace shows that we're indeed hanging in the
syscall:

write(1, "# Start of timer tests\n", 23) = 23
eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK)   = 9
io_uring_enter(7, 1, 0, 0, NULL, 8)     = 1
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffc239bec80) = 0
io_uring_enter(7, 1, 1, IORING_ENTER_GETEVENTS, NULL, 8

Of course, if I start the test without strace and then attach strace to
the running process, that gets it unstuck like attaching gdb (not very
surprising, I guess, it's both just ptrace).

Finally I tried Ctrl-C while having strace logging to a file, and now
the io_uring_enter() returns 1 (rather than EINTR or 0 or whatever):

io_uring_enter(7, 1, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 1
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
+++ killed by SIGINT +++

Not sure what to make of this.

I think you already said you run the same kernel version, but just to be
sure, I'm running 6.17.5-200.fc42.x86_64.

Kevin

Re: [RESEND PATCH v5 00/13] aio: add the aio_add_sqe() io_uring API

Posted by Kevin Wolf 3 months, 1 week ago

Am 30.10.2025 um 19:11 hat Kevin Wolf geschrieben:
> Am 30.10.2025 um 16:21 hat Stefan Hajnoczi geschrieben:
> > v5:
> > - Explain how fdmon-io_uring.c differs from other fdmon implementations
> >   in commit message [Kevin]
> > - Move test-nested-aio-poll aio_get_g_source() removal into commit that touches test case [Kevin]
> > - Avoid g_source_add_poll() use-after-free in fdmon_poll_update() [Kevin]
> > - Avoid duplication in fdmon_epoll_gsource_dispatch(), use fdmon_epoll_wait() [Kevin]
> > - Drop unnecessary revents checks in fdmon_poll_gsource_dispatch() [Kevin]
> > - Mention in commit message that fdmon-io_uring.c is the new default [Kevin]
> > - Add comments explaining how to clean up resources in error paths [Kevin]
> > - Indicate error in return value from function with Error *errp arg [Kevin]
> > - Add patch to unindent fdmon_io_uring_destroy() [Kevin]
> > - Add patch to introduce FDMonOps->dispatch() callback [Kevin]
> > - Drop patch with hacky BH optimization for fdmon-io_uring.c [Kevin]
> > - Replace cqe_handler_bh with FDMonOps->dispatch() [Kevin]
> > - Rename AioHandler->cqe_handler field to ->internal_cqe_handler [Kevin]
> > - Consolidate fdmon-io_uring.c trace-events changes into this commit
> > - Reduce #ifdef HAVE_IO_URING_PREP_WRITEV2 code duplication [Kevin]
> 
> The changes look good to me.
> 
> However, the test cases are still failing. I just tried to see where
> test-aio is stuck, and while I looked for a backtrace first, I noticed
> that just attaching gdb to the process and immediately detaching again
> makes the test unstuck. Very strange.
> 
> This is the backtrace, maybe a bit unsurpring:
> 
> (gdb) bt
> #0  0x00007ffff7e6fec6 in __io_uring_submit () from /lib64/liburing.so.2
> #1  0x00005555556f4394 in fdmon_io_uring_wait (ctx=0x555556409950, ready_list=0x7fffffffcda0, timeout=749993088) at ../util/fdmon-io_uring.c:410
> #2  0x00005555556ed29f in aio_poll (ctx=0x555556409950, blocking=true) at ../util/aio-posix.c:699
> #3  0x0000555555681547 in test_timer_schedule () at ../tests/unit/test-aio.c:413
> #4  0x00007ffff6f30e7e in test_case_run (tc=0x55555640d340, test_run_name=0x55555640de10 "/aio/timer/schedule", path=<optimized out>) at ../glib/gtestutils.c:3115
> #5  g_test_run_suite_internal (suite=suite@entry=0x5555558696d0, path=path@entry=0x0) at ../glib/gtestutils.c:3210
> #6  0x00007ffff6f30df3 in g_test_run_suite_internal (suite=suite@entry=0x555555867480, path=path@entry=0x0) at ../glib/gtestutils.c:3229
> #7  0x00007ffff6f30df3 in g_test_run_suite_internal (suite=suite@entry=0x555555867720, path=path@entry=0x0) at ../glib/gtestutils.c:3229
> #8  0x00007ffff6f313aa in g_test_run_suite (suite=suite@entry=0x555555867720) at ../glib/gtestutils.c:3310
> #9  0x00007ffff6f31440 in g_test_run () at ../glib/gtestutils.c:2379
> #10 g_test_run () at ../glib/gtestutils.c:2366
> #11 0x000055555567e204 in main (argc=1, argv=0x7fffffffd488) at ../tests/unit/test-aio.c:872
> 
> And running it under strace shows that we're indeed hanging in the
> syscall:
> 
> write(1, "# Start of timer tests\n", 23) = 23
> eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK)   = 9
> io_uring_enter(7, 1, 0, 0, NULL, 8)     = 1
> clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffc239bec80) = 0
> io_uring_enter(7, 1, 1, IORING_ENTER_GETEVENTS, NULL, 8
> 
> Of course, if I start the test without strace and then attach strace to
> the running process, that gets it unstuck like attaching gdb (not very
> surprising, I guess, it's both just ptrace).
> 
> Finally I tried Ctrl-C while having strace logging to a file, and now
> the io_uring_enter() returns 1 (rather than EINTR or 0 or whatever):
> 
> io_uring_enter(7, 1, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 1
> --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
> +++ killed by SIGINT +++
> 
> Not sure what to make of this.
> 
> I think you already said you run the same kernel version, but just to be
> sure, I'm running 6.17.5-200.fc42.x86_64.

I'm at the point where I'm bisecting compiler flags...

I have seen three different outcomes from test-aio:

1. It hangs. This is what I saw in my normal clang build. This configure
   line seems to be enough to trigger it:
   ../configure '--target-list=x86_64-softmmu' '--cc=clang' '--cxx=clang++'

2. An assertion failure. I haven't seen this in the actual QEMU tree
   with clang. With gcc, it seems to happen if you use -O0:
   ../configure '--target-list=x86_64-softmmu' '--enable-debug'

   Outside of the QEMU tree with a manual Makefile, I saw this behaviour
   with clang and -fstack-protector-strong, but without
   -ftrivial-auto-var-init=zero. Add the latter turns it into the hang.

3. It just passes. This is what I saw in my default gcc build without
   --enable-debug. The test also passes with --disable-stack-protector
   added to both configure lines in 1 and 2.

Not sure yet where the flags make the difference, but I guess it does
hint at something going wrong on the stack.

Kevin

Re: [RESEND PATCH v5 00/13] aio: add the aio_add_sqe() io_uring API

Posted by Kevin Wolf 3 months, 1 week ago

Am 03.11.2025 um 11:40 hat Kevin Wolf geschrieben:
> Am 30.10.2025 um 19:11 hat Kevin Wolf geschrieben:
> > Am 30.10.2025 um 16:21 hat Stefan Hajnoczi geschrieben:
> > > v5:
> > > - Explain how fdmon-io_uring.c differs from other fdmon implementations
> > >   in commit message [Kevin]
> > > - Move test-nested-aio-poll aio_get_g_source() removal into commit that touches test case [Kevin]
> > > - Avoid g_source_add_poll() use-after-free in fdmon_poll_update() [Kevin]
> > > - Avoid duplication in fdmon_epoll_gsource_dispatch(), use fdmon_epoll_wait() [Kevin]
> > > - Drop unnecessary revents checks in fdmon_poll_gsource_dispatch() [Kevin]
> > > - Mention in commit message that fdmon-io_uring.c is the new default [Kevin]
> > > - Add comments explaining how to clean up resources in error paths [Kevin]
> > > - Indicate error in return value from function with Error *errp arg [Kevin]
> > > - Add patch to unindent fdmon_io_uring_destroy() [Kevin]
> > > - Add patch to introduce FDMonOps->dispatch() callback [Kevin]
> > > - Drop patch with hacky BH optimization for fdmon-io_uring.c [Kevin]
> > > - Replace cqe_handler_bh with FDMonOps->dispatch() [Kevin]
> > > - Rename AioHandler->cqe_handler field to ->internal_cqe_handler [Kevin]
> > > - Consolidate fdmon-io_uring.c trace-events changes into this commit
> > > - Reduce #ifdef HAVE_IO_URING_PREP_WRITEV2 code duplication [Kevin]
> > 
> > The changes look good to me.
> > 
> > However, the test cases are still failing. I just tried to see where
> > test-aio is stuck, and while I looked for a backtrace first, I noticed
> > that just attaching gdb to the process and immediately detaching again
> > makes the test unstuck. Very strange.
> > 
> > This is the backtrace, maybe a bit unsurpring:
> > 
> > (gdb) bt
> > #0  0x00007ffff7e6fec6 in __io_uring_submit () from /lib64/liburing.so.2
> > #1  0x00005555556f4394 in fdmon_io_uring_wait (ctx=0x555556409950, ready_list=0x7fffffffcda0, timeout=749993088) at ../util/fdmon-io_uring.c:410
> > #2  0x00005555556ed29f in aio_poll (ctx=0x555556409950, blocking=true) at ../util/aio-posix.c:699
> > #3  0x0000555555681547 in test_timer_schedule () at ../tests/unit/test-aio.c:413
> > #4  0x00007ffff6f30e7e in test_case_run (tc=0x55555640d340, test_run_name=0x55555640de10 "/aio/timer/schedule", path=<optimized out>) at ../glib/gtestutils.c:3115
> > #5  g_test_run_suite_internal (suite=suite@entry=0x5555558696d0, path=path@entry=0x0) at ../glib/gtestutils.c:3210
> > #6  0x00007ffff6f30df3 in g_test_run_suite_internal (suite=suite@entry=0x555555867480, path=path@entry=0x0) at ../glib/gtestutils.c:3229
> > #7  0x00007ffff6f30df3 in g_test_run_suite_internal (suite=suite@entry=0x555555867720, path=path@entry=0x0) at ../glib/gtestutils.c:3229
> > #8  0x00007ffff6f313aa in g_test_run_suite (suite=suite@entry=0x555555867720) at ../glib/gtestutils.c:3310
> > #9  0x00007ffff6f31440 in g_test_run () at ../glib/gtestutils.c:2379
> > #10 g_test_run () at ../glib/gtestutils.c:2366
> > #11 0x000055555567e204 in main (argc=1, argv=0x7fffffffd488) at ../tests/unit/test-aio.c:872
> > 
> > And running it under strace shows that we're indeed hanging in the
> > syscall:
> > 
> > write(1, "# Start of timer tests\n", 23) = 23
> > eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK)   = 9
> > io_uring_enter(7, 1, 0, 0, NULL, 8)     = 1
> > clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffc239bec80) = 0
> > io_uring_enter(7, 1, 1, IORING_ENTER_GETEVENTS, NULL, 8
> > 
> > Of course, if I start the test without strace and then attach strace to
> > the running process, that gets it unstuck like attaching gdb (not very
> > surprising, I guess, it's both just ptrace).
> > 
> > Finally I tried Ctrl-C while having strace logging to a file, and now
> > the io_uring_enter() returns 1 (rather than EINTR or 0 or whatever):
> > 
> > io_uring_enter(7, 1, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 1
> > --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
> > +++ killed by SIGINT +++
> > 
> > Not sure what to make of this.
> > 
> > I think you already said you run the same kernel version, but just to be
> > sure, I'm running 6.17.5-200.fc42.x86_64.
> 
> I'm at the point where I'm bisecting compiler flags...
> 
> I have seen three different outcomes from test-aio:
> 
> 1. It hangs. This is what I saw in my normal clang build. This configure
>    line seems to be enough to trigger it:
>    ../configure '--target-list=x86_64-softmmu' '--cc=clang' '--cxx=clang++'
> 
> 2. An assertion failure. I haven't seen this in the actual QEMU tree
>    with clang. With gcc, it seems to happen if you use -O0:
>    ../configure '--target-list=x86_64-softmmu' '--enable-debug'
> 
>    Outside of the QEMU tree with a manual Makefile, I saw this behaviour
>    with clang and -fstack-protector-strong, but without
>    -ftrivial-auto-var-init=zero. Add the latter turns it into the hang.
> 
> 3. It just passes. This is what I saw in my default gcc build without
>    --enable-debug. The test also passes with --disable-stack-protector
>    added to both configure lines in 1 and 2.
> 
> Not sure yet where the flags make the difference, but I guess it does
> hint at something going wrong on the stack.

Ok, that was quite some debugging, but I think I have it. The problem is
add_timeout_sqe():

static void add_timeout_sqe(AioContext *ctx, int64_t ns)
{
    struct io_uring_sqe *sqe;
    ts = (struct __kernel_timespec) {
        .tv_sec = ns / NANOSECONDS_PER_SECOND,
        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
    };

    sqe = get_sqe(ctx);
    io_uring_prep_timeout(sqe, &ts, 1, 0);
    io_uring_sqe_set_data(sqe, NULL);
}

What io_uring_prep_timeout() does is that it just stores the ts pointer
in the SQE, the timeout is never copied anywhere. Obviously, by the time
that we submit the SQE, ts has been out of scope for a long time, so the
kernel reads random data as a timeout.

# bpftrace -e 'kfunc:io_timeout { printf("%s: io_timeout %lld s + %lld ns\n", comm, ((struct io_timeout_data *)args.req->async_data)->ts.tv_sec, ((struct io_timeout_data *)args.req->async_data)->ts.tv_nsec ) }'
Attaching 1 probe...
test-aio: io_timeout 0 s + 140736377549872 ns

>>> hex(140736377549872)
'0x7fffbdca7430'

That looked a bit suspicious for a timeout. :-)

After fixing this, we still have the problem that io_uring_enter() can
return early without failing with EINTR when something like a signal
arrives. This means that a blocking aio_poll(true) can actually return
without any progress.  Not sure if it matters in practice, but it can
make test cases fail.

Not completely sure when this happens, though. When running the aio-test
under strace, kill -CONT makes it return early and fail the assertion,
but without strace, I can't seem to reproduce the problem at the moment.
Attaching strace or gdb to the running process that is waiting for the
timeout also makes it return early and fail the assertion.

Kevin

Re: [RESEND PATCH v5 00/13] aio: add the aio_add_sqe() io_uring API

Posted by Stefan Hajnoczi 3 months ago

On Mon, Nov 03, 2025 at 02:30:34PM +0100, Kevin Wolf wrote:
> Am 03.11.2025 um 11:40 hat Kevin Wolf geschrieben:
> > Am 30.10.2025 um 19:11 hat Kevin Wolf geschrieben:
> > > Am 30.10.2025 um 16:21 hat Stefan Hajnoczi geschrieben:
> > > > v5:
> > > > - Explain how fdmon-io_uring.c differs from other fdmon implementations
> > > >   in commit message [Kevin]
> > > > - Move test-nested-aio-poll aio_get_g_source() removal into commit that touches test case [Kevin]
> > > > - Avoid g_source_add_poll() use-after-free in fdmon_poll_update() [Kevin]
> > > > - Avoid duplication in fdmon_epoll_gsource_dispatch(), use fdmon_epoll_wait() [Kevin]
> > > > - Drop unnecessary revents checks in fdmon_poll_gsource_dispatch() [Kevin]
> > > > - Mention in commit message that fdmon-io_uring.c is the new default [Kevin]
> > > > - Add comments explaining how to clean up resources in error paths [Kevin]
> > > > - Indicate error in return value from function with Error *errp arg [Kevin]
> > > > - Add patch to unindent fdmon_io_uring_destroy() [Kevin]
> > > > - Add patch to introduce FDMonOps->dispatch() callback [Kevin]
> > > > - Drop patch with hacky BH optimization for fdmon-io_uring.c [Kevin]
> > > > - Replace cqe_handler_bh with FDMonOps->dispatch() [Kevin]
> > > > - Rename AioHandler->cqe_handler field to ->internal_cqe_handler [Kevin]
> > > > - Consolidate fdmon-io_uring.c trace-events changes into this commit
> > > > - Reduce #ifdef HAVE_IO_URING_PREP_WRITEV2 code duplication [Kevin]
> > > 
> > > The changes look good to me.
> > > 
> > > However, the test cases are still failing. I just tried to see where
> > > test-aio is stuck, and while I looked for a backtrace first, I noticed
> > > that just attaching gdb to the process and immediately detaching again
> > > makes the test unstuck. Very strange.
> > > 
> > > This is the backtrace, maybe a bit unsurpring:
> > > 
> > > (gdb) bt
> > > #0  0x00007ffff7e6fec6 in __io_uring_submit () from /lib64/liburing.so.2
> > > #1  0x00005555556f4394 in fdmon_io_uring_wait (ctx=0x555556409950, ready_list=0x7fffffffcda0, timeout=749993088) at ../util/fdmon-io_uring.c:410
> > > #2  0x00005555556ed29f in aio_poll (ctx=0x555556409950, blocking=true) at ../util/aio-posix.c:699
> > > #3  0x0000555555681547 in test_timer_schedule () at ../tests/unit/test-aio.c:413
> > > #4  0x00007ffff6f30e7e in test_case_run (tc=0x55555640d340, test_run_name=0x55555640de10 "/aio/timer/schedule", path=<optimized out>) at ../glib/gtestutils.c:3115
> > > #5  g_test_run_suite_internal (suite=suite@entry=0x5555558696d0, path=path@entry=0x0) at ../glib/gtestutils.c:3210
> > > #6  0x00007ffff6f30df3 in g_test_run_suite_internal (suite=suite@entry=0x555555867480, path=path@entry=0x0) at ../glib/gtestutils.c:3229
> > > #7  0x00007ffff6f30df3 in g_test_run_suite_internal (suite=suite@entry=0x555555867720, path=path@entry=0x0) at ../glib/gtestutils.c:3229
> > > #8  0x00007ffff6f313aa in g_test_run_suite (suite=suite@entry=0x555555867720) at ../glib/gtestutils.c:3310
> > > #9  0x00007ffff6f31440 in g_test_run () at ../glib/gtestutils.c:2379
> > > #10 g_test_run () at ../glib/gtestutils.c:2366
> > > #11 0x000055555567e204 in main (argc=1, argv=0x7fffffffd488) at ../tests/unit/test-aio.c:872
> > > 
> > > And running it under strace shows that we're indeed hanging in the
> > > syscall:
> > > 
> > > write(1, "# Start of timer tests\n", 23) = 23
> > > eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK)   = 9
> > > io_uring_enter(7, 1, 0, 0, NULL, 8)     = 1
> > > clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffc239bec80) = 0
> > > io_uring_enter(7, 1, 1, IORING_ENTER_GETEVENTS, NULL, 8
> > > 
> > > Of course, if I start the test without strace and then attach strace to
> > > the running process, that gets it unstuck like attaching gdb (not very
> > > surprising, I guess, it's both just ptrace).
> > > 
> > > Finally I tried Ctrl-C while having strace logging to a file, and now
> > > the io_uring_enter() returns 1 (rather than EINTR or 0 or whatever):
> > > 
> > > io_uring_enter(7, 1, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 1
> > > --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
> > > +++ killed by SIGINT +++
> > > 
> > > Not sure what to make of this.
> > > 
> > > I think you already said you run the same kernel version, but just to be
> > > sure, I'm running 6.17.5-200.fc42.x86_64.
> > 
> > I'm at the point where I'm bisecting compiler flags...
> > 
> > I have seen three different outcomes from test-aio:
> > 
> > 1. It hangs. This is what I saw in my normal clang build. This configure
> >    line seems to be enough to trigger it:
> >    ../configure '--target-list=x86_64-softmmu' '--cc=clang' '--cxx=clang++'
> > 
> > 2. An assertion failure. I haven't seen this in the actual QEMU tree
> >    with clang. With gcc, it seems to happen if you use -O0:
> >    ../configure '--target-list=x86_64-softmmu' '--enable-debug'
> > 
> >    Outside of the QEMU tree with a manual Makefile, I saw this behaviour
> >    with clang and -fstack-protector-strong, but without
> >    -ftrivial-auto-var-init=zero. Add the latter turns it into the hang.
> > 
> > 3. It just passes. This is what I saw in my default gcc build without
> >    --enable-debug. The test also passes with --disable-stack-protector
> >    added to both configure lines in 1 and 2.
> > 
> > Not sure yet where the flags make the difference, but I guess it does
> > hint at something going wrong on the stack.
> 
> Ok, that was quite some debugging, but I think I have it. The problem is
> add_timeout_sqe():
> 
> static void add_timeout_sqe(AioContext *ctx, int64_t ns)
> {
>     struct io_uring_sqe *sqe;
>     ts = (struct __kernel_timespec) {
>         .tv_sec = ns / NANOSECONDS_PER_SECOND,
>         .tv_nsec = ns % NANOSECONDS_PER_SECOND,
>     };
> 
>     sqe = get_sqe(ctx);
>     io_uring_prep_timeout(sqe, &ts, 1, 0);
>     io_uring_sqe_set_data(sqe, NULL);
> }
> 
> What io_uring_prep_timeout() does is that it just stores the ts pointer
> in the SQE, the timeout is never copied anywhere. Obviously, by the time
> that we submit the SQE, ts has been out of scope for a long time, so the
> kernel reads random data as a timeout.
> 
> # bpftrace -e 'kfunc:io_timeout { printf("%s: io_timeout %lld s + %lld ns\n", comm, ((struct io_timeout_data *)args.req->async_data)->ts.tv_sec, ((struct io_timeout_data *)args.req->async_data)->ts.tv_nsec ) }'
> Attaching 1 probe...
> test-aio: io_timeout 0 s + 140736377549872 ns
> 
> >>> hex(140736377549872)
> '0x7fffbdca7430'
> 
> That looked a bit suspicious for a timeout. :-)
> 
> After fixing this, we still have the problem that io_uring_enter() can
> return early without failing with EINTR when something like a signal
> arrives. This means that a blocking aio_poll(true) can actually return
> without any progress.  Not sure if it matters in practice, but it can
> make test cases fail.
> 
> Not completely sure when this happens, though. When running the aio-test
> under strace, kill -CONT makes it return early and fail the assertion,
> but without strace, I can't seem to reproduce the problem at the moment.
> Attaching strace or gdb to the running process that is waiting for the
> timeout also makes it return early and fail the assertion.

Thanks again for debugging the test failures. I am sending a new
revision with the timeout variable lifetime fixed and a stronger loop
condition that retries io_uring_submit_and_wait() when the awaited CQE
is still missing.

Thoughts on what you found:

I've upgraded to Fedora 43 (kernel-6.17.5-300.fc43.x86_64) but I think
the behavior is still the same.

The difference in ptraced vs normal SIGCONT behavior is because the
signal only actually interrupts processes that are stopped or ptraced.
Otherwise SIGCONT disappears without interrupting the system call.

When ptraced, io_uring_enter()'s return value will be 1 and not -EINTR
because of how io_uring/io_uring.c:io_uring_enter() is implemented.
There are two variables: ret is the return value from submitting SQEs
and ret2 is the return value from waiting for CQEs. The first variable
takes precendence, so even though ret2 can be -EINTR, it is overriden:

                  if (!ret) {
                          ret = ret2;
  			...
                  }
          }
  out:
          ...
          return ret;

In our case ret is be >0, so ret2 = -EINTR is ignored and the
io_uring_enter(2) return value is 1. This explains the behavior when
SIGCONT is received while a ptraced io_uring_enter(2) is waiting.

I confirmed in userspace that there are no CQEs ready when
io_uring_enter(2) returns 1 after SIGCONT. This can be fixed in
userspace by retrying when the expected CQE is missing.

Regarding aio_poll(ctx, true) returning false: it happens when
aio_notify() is called while aio_poll(ctx, true) is waiting. The tests
rely on the internal details of when exactly false can be returned, so
it will help if fdmon-io_uring.c avoids returning false in situations
where other fdmon implementations wait longer.

Stefan

Re: [RESEND PATCH v5 00/13] aio: add the aio_add_sqe() io_uring API

Posted by Stefan Hajnoczi 3 months, 1 week ago

On Mon, Nov 03, 2025 at 02:30:34PM +0100, Kevin Wolf wrote:
> Am 03.11.2025 um 11:40 hat Kevin Wolf geschrieben:
> > Am 30.10.2025 um 19:11 hat Kevin Wolf geschrieben:
> > > Am 30.10.2025 um 16:21 hat Stefan Hajnoczi geschrieben:
> > > > v5:
> > > > - Explain how fdmon-io_uring.c differs from other fdmon implementations
> > > >   in commit message [Kevin]
> > > > - Move test-nested-aio-poll aio_get_g_source() removal into commit that touches test case [Kevin]
> > > > - Avoid g_source_add_poll() use-after-free in fdmon_poll_update() [Kevin]
> > > > - Avoid duplication in fdmon_epoll_gsource_dispatch(), use fdmon_epoll_wait() [Kevin]
> > > > - Drop unnecessary revents checks in fdmon_poll_gsource_dispatch() [Kevin]
> > > > - Mention in commit message that fdmon-io_uring.c is the new default [Kevin]
> > > > - Add comments explaining how to clean up resources in error paths [Kevin]
> > > > - Indicate error in return value from function with Error *errp arg [Kevin]
> > > > - Add patch to unindent fdmon_io_uring_destroy() [Kevin]
> > > > - Add patch to introduce FDMonOps->dispatch() callback [Kevin]
> > > > - Drop patch with hacky BH optimization for fdmon-io_uring.c [Kevin]
> > > > - Replace cqe_handler_bh with FDMonOps->dispatch() [Kevin]
> > > > - Rename AioHandler->cqe_handler field to ->internal_cqe_handler [Kevin]
> > > > - Consolidate fdmon-io_uring.c trace-events changes into this commit
> > > > - Reduce #ifdef HAVE_IO_URING_PREP_WRITEV2 code duplication [Kevin]
> > > 
> > > The changes look good to me.
> > > 
> > > However, the test cases are still failing. I just tried to see where
> > > test-aio is stuck, and while I looked for a backtrace first, I noticed
> > > that just attaching gdb to the process and immediately detaching again
> > > makes the test unstuck. Very strange.
> > > 
> > > This is the backtrace, maybe a bit unsurpring:
> > > 
> > > (gdb) bt
> > > #0  0x00007ffff7e6fec6 in __io_uring_submit () from /lib64/liburing.so.2
> > > #1  0x00005555556f4394 in fdmon_io_uring_wait (ctx=0x555556409950, ready_list=0x7fffffffcda0, timeout=749993088) at ../util/fdmon-io_uring.c:410
> > > #2  0x00005555556ed29f in aio_poll (ctx=0x555556409950, blocking=true) at ../util/aio-posix.c:699
> > > #3  0x0000555555681547 in test_timer_schedule () at ../tests/unit/test-aio.c:413
> > > #4  0x00007ffff6f30e7e in test_case_run (tc=0x55555640d340, test_run_name=0x55555640de10 "/aio/timer/schedule", path=<optimized out>) at ../glib/gtestutils.c:3115
> > > #5  g_test_run_suite_internal (suite=suite@entry=0x5555558696d0, path=path@entry=0x0) at ../glib/gtestutils.c:3210
> > > #6  0x00007ffff6f30df3 in g_test_run_suite_internal (suite=suite@entry=0x555555867480, path=path@entry=0x0) at ../glib/gtestutils.c:3229
> > > #7  0x00007ffff6f30df3 in g_test_run_suite_internal (suite=suite@entry=0x555555867720, path=path@entry=0x0) at ../glib/gtestutils.c:3229
> > > #8  0x00007ffff6f313aa in g_test_run_suite (suite=suite@entry=0x555555867720) at ../glib/gtestutils.c:3310
> > > #9  0x00007ffff6f31440 in g_test_run () at ../glib/gtestutils.c:2379
> > > #10 g_test_run () at ../glib/gtestutils.c:2366
> > > #11 0x000055555567e204 in main (argc=1, argv=0x7fffffffd488) at ../tests/unit/test-aio.c:872
> > > 
> > > And running it under strace shows that we're indeed hanging in the
> > > syscall:
> > > 
> > > write(1, "# Start of timer tests\n", 23) = 23
> > > eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK)   = 9
> > > io_uring_enter(7, 1, 0, 0, NULL, 8)     = 1
> > > clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffc239bec80) = 0
> > > io_uring_enter(7, 1, 1, IORING_ENTER_GETEVENTS, NULL, 8
> > > 
> > > Of course, if I start the test without strace and then attach strace to
> > > the running process, that gets it unstuck like attaching gdb (not very
> > > surprising, I guess, it's both just ptrace).
> > > 
> > > Finally I tried Ctrl-C while having strace logging to a file, and now
> > > the io_uring_enter() returns 1 (rather than EINTR or 0 or whatever):
> > > 
> > > io_uring_enter(7, 1, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 1
> > > --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
> > > +++ killed by SIGINT +++
> > > 
> > > Not sure what to make of this.
> > > 
> > > I think you already said you run the same kernel version, but just to be
> > > sure, I'm running 6.17.5-200.fc42.x86_64.
> > 
> > I'm at the point where I'm bisecting compiler flags...
> > 
> > I have seen three different outcomes from test-aio:
> > 
> > 1. It hangs. This is what I saw in my normal clang build. This configure
> >    line seems to be enough to trigger it:
> >    ../configure '--target-list=x86_64-softmmu' '--cc=clang' '--cxx=clang++'
> > 
> > 2. An assertion failure. I haven't seen this in the actual QEMU tree
> >    with clang. With gcc, it seems to happen if you use -O0:
> >    ../configure '--target-list=x86_64-softmmu' '--enable-debug'
> > 
> >    Outside of the QEMU tree with a manual Makefile, I saw this behaviour
> >    with clang and -fstack-protector-strong, but without
> >    -ftrivial-auto-var-init=zero. Add the latter turns it into the hang.
> > 
> > 3. It just passes. This is what I saw in my default gcc build without
> >    --enable-debug. The test also passes with --disable-stack-protector
> >    added to both configure lines in 1 and 2.
> > 
> > Not sure yet where the flags make the difference, but I guess it does
> > hint at something going wrong on the stack.
> 
> Ok, that was quite some debugging, but I think I have it. The problem is
> add_timeout_sqe():
> 
> static void add_timeout_sqe(AioContext *ctx, int64_t ns)
> {
>     struct io_uring_sqe *sqe;
>     ts = (struct __kernel_timespec) {
>         .tv_sec = ns / NANOSECONDS_PER_SECOND,
>         .tv_nsec = ns % NANOSECONDS_PER_SECOND,
>     };
> 
>     sqe = get_sqe(ctx);
>     io_uring_prep_timeout(sqe, &ts, 1, 0);
>     io_uring_sqe_set_data(sqe, NULL);
> }
> 
> What io_uring_prep_timeout() does is that it just stores the ts pointer
> in the SQE, the timeout is never copied anywhere. Obviously, by the time
> that we submit the SQE, ts has been out of scope for a long time, so the
> kernel reads random data as a timeout.
> 
> # bpftrace -e 'kfunc:io_timeout { printf("%s: io_timeout %lld s + %lld ns\n", comm, ((struct io_timeout_data *)args.req->async_data)->ts.tv_sec, ((struct io_timeout_data *)args.req->async_data)->ts.tv_nsec ) }'
> Attaching 1 probe...
> test-aio: io_timeout 0 s + 140736377549872 ns
> 
> >>> hex(140736377549872)
> '0x7fffbdca7430'
> 
> That looked a bit suspicious for a timeout. :-)
> 
> After fixing this, we still have the problem that io_uring_enter() can
> return early without failing with EINTR when something like a signal
> arrives. This means that a blocking aio_poll(true) can actually return
> without any progress.  Not sure if it matters in practice, but it can
> make test cases fail.
> 
> Not completely sure when this happens, though. When running the aio-test
> under strace, kill -CONT makes it return early and fail the assertion,
> but without strace, I can't seem to reproduce the problem at the moment.
> Attaching strace or gdb to the running process that is waiting for the
> timeout also makes it return early and fail the assertion.

Hi Kevin,
Thank you for going through the effort of debugging this!

I'll see if I can track down the issue with io_uring_enter() returning
early.

Stefan