include/qemu/coroutine.h | 17 ++++++++--- util/qemu-coroutine-sleep.c | 60 +++++++++++++++++++++++++++---------- 2 files changed, 58 insertions(+), 19 deletions(-)
Changes since v1
----------------
v1 was a two-patch series tracking two independent missed-wakeup
races on the qemu shutdown path.
* Patch 1, "block/graph-lock: fix missed wakeup in
bdrv_graph_co_rdunlock()", was applied as e3082ab3b3 by Kevin
and is now in tree. This v2 carries only the remaining race.
* Per Kevin's review of v1 patch 2 [1], the cache_clean_timer
hang is no longer worked around inside block/qcow2.c. Instead,
the underlying primitive -- qemu_co_sleep_wake() -- is fixed,
closing the lost-wakeup window for every caller
(cache_clean_timer, block_copy_kick, ...) rather than just
qcow2. Cancellation latency through qemu_co_sleep_wake() drops
from "next 1 s tick" (v1 workaround) to aio_co_wake().
[1] https://lore.kernel.org/qemu-devel/agcYkmTWud8euTds@redhat.com/
Problem
-------
The qemu shutdown / blockdev-close path can deadlock permanently on
upstream master. The main thread enters ppoll(timeout=-1) holding
BQL, no other thread has a wake source that points back at it, and
qemu has to be SIGKILLed. The hang has no timeout -- it is a hard
deadlock, not a slow operation; behind BQL, RCU, VCPUs and every
iothread path that needs BQL stall with it.
Two independent missed-wakeup races in the block layer contributed
to the symptom on v1. Both shared the same shape: a waiter arms on
one side, the waker reads stale state on its fast path and silently
skips the kick, and nothing else on the AioContext fires to
recover. The first (block/graph-lock) was fixed by e3082ab3b3 and
is now in tree. This patch closes the second one, exposed in
qcow2's cache_clean_timer cancellation path:
ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close
The race diagram and the exact stale-state read are in the patch's
commit message.
Reproducer
----------
Environment: 4-vCPU VM guest, kernel 6.12.x, upstream master at
e3082ab3b3 (with the graph-lock fix already applied). On modern
bare-metal the window is narrow enough that the hang rarely
reproduces without a VM -- a VM guest under full CPU saturation is
what makes the timing reliable.
# reproducer
stress-ng --cpu "$(nproc)" --timeout 0 &
for r in $(seq 20); do
timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create
done
kill %1
With `stress-ng --cpu $(nproc)` the race surfaces. With
`stress-ng --cpu $(($(nproc) - 1))` or without a stressor it does
not reproduce reliably across 20 iterations.
When the race fires, the Python QMP client times out on
vm.run_job() after 5 s, the qemu process keeps running but never
makes forward progress, and the outer `timeout 120` eventually
kills it. Attach gdb before the timeout kills qemu to capture the
stack.
Results
-------
Same guest, 20 iterations of the loop above, master at e3082ab3b3:
without this patch: reproduces reliably (qcow2_close in ppoll)
with this patch: 20/20 PASS
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Hanna Reitz <hreitz@redhat.com>
Denis V. Lunev (1):
coroutine: fix lost wakeup in qemu_co_sleep_wake()
include/qemu/coroutine.h | 17 ++++++++---
util/qemu-coroutine-sleep.c | 60 +++++++++++++++++++++++++++----------
2 files changed, 58 insertions(+), 19 deletions(-)
--
2.51.0
On 5/20/26 21:38, Denis V. Lunev wrote: > Changes since v1 > ---------------- > > v1 was a two-patch series tracking two independent missed-wakeup > races on the qemu shutdown path. > > * Patch 1, "block/graph-lock: fix missed wakeup in > bdrv_graph_co_rdunlock()", was applied as e3082ab3b3 by Kevin > and is now in tree. This v2 carries only the remaining race. > > * Per Kevin's review of v1 patch 2 [1], the cache_clean_timer > hang is no longer worked around inside block/qcow2.c. Instead, > the underlying primitive -- qemu_co_sleep_wake() -- is fixed, > closing the lost-wakeup window for every caller > (cache_clean_timer, block_copy_kick, ...) rather than just > qcow2. Cancellation latency through qemu_co_sleep_wake() drops > from "next 1 s tick" (v1 workaround) to aio_co_wake(). > > [1] https://lore.kernel.org/qemu-devel/agcYkmTWud8euTds@redhat.com/ > > Problem > ------- > > The qemu shutdown / blockdev-close path can deadlock permanently on > upstream master. The main thread enters ppoll(timeout=-1) holding > BQL, no other thread has a wake source that points back at it, and > qemu has to be SIGKILLed. The hang has no timeout -- it is a hard > deadlock, not a slow operation; behind BQL, RCU, VCPUs and every > iothread path that needs BQL stall with it. > > Two independent missed-wakeup races in the block layer contributed > to the symptom on v1. Both shared the same shape: a waiter arms on > one side, the waker reads stale state on its fast path and silently > skips the kick, and nothing else on the AioContext fires to > recover. The first (block/graph-lock) was fixed by e3082ab3b3 and > is now in tree. This patch closes the second one, exposed in > qcow2's cache_clean_timer cancellation path: > > ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close > > The race diagram and the exact stale-state read are in the patch's > commit message. > > Reproducer > ---------- > > Environment: 4-vCPU VM guest, kernel 6.12.x, upstream master at > e3082ab3b3 (with the graph-lock fix already applied). On modern > bare-metal the window is narrow enough that the hang rarely > reproduces without a VM -- a VM guest under full CPU saturation is > what makes the timing reliable. > > # reproducer > stress-ng --cpu "$(nproc)" --timeout 0 & > for r in $(seq 20); do > timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create > done > kill %1 > > With `stress-ng --cpu $(nproc)` the race surfaces. With > `stress-ng --cpu $(($(nproc) - 1))` or without a stressor it does > not reproduce reliably across 20 iterations. > > When the race fires, the Python QMP client times out on > vm.run_job() after 5 s, the qemu process keeps running but never > makes forward progress, and the outer `timeout 120` eventually > kills it. Attach gdb before the timeout kills qemu to capture the > stack. > > Results > ------- > > Same guest, 20 iterations of the loop above, master at e3082ab3b3: > > without this patch: reproduces reliably (qcow2_close in ppoll) > with this patch: 20/20 PASS > > Signed-off-by: Denis V. Lunev <den@openvz.org> > Cc: Kevin Wolf <kwolf@redhat.com> > Cc: Hanna Reitz <hreitz@redhat.com> > > Denis V. Lunev (1): > coroutine: fix lost wakeup in qemu_co_sleep_wake() > > include/qemu/coroutine.h | 17 ++++++++--- > util/qemu-coroutine-sleep.c | 60 +++++++++++++++++++++++++++---------- > 2 files changed, 58 insertions(+), 19 deletions(-) > ping
On 20.05.2026 22:38, Denis V. Lunev via wrote: > Changes since v1 > ---------------- > > v1 was a two-patch series tracking two independent missed-wakeup > races on the qemu shutdown path. Um. The v1 has already been applied to the master branch, see commit e3082ab3b3. And I already picked it up for stable, but not for 10.0.x So I guess the next step is to send a follow-up on top of v1, or to revert v1 and have this v2 instead. Fun stuff ;) /mjt
On 5/20/26 21:57, Michael Tokarev wrote: > On 20.05.2026 22:38, Denis V. Lunev via wrote: >> Changes since v1 >> ---------------- >> >> v1 was a two-patch series tracking two independent missed-wakeup >> races on the qemu shutdown path. > Um. The v1 has already been applied to the master branch, see commit > e3082ab3b3. > > And I already picked it up for stable, but not for 10.0.x > > So I guess the next step is to send a follow-up on top of v1, or > to revert v1 and have this v2 instead. > > Fun stuff ;) > > /mjt > It seems that I have written things wrong in description. v1 consists of 2 patches which were fixed very similar but different patterns - 2 distinct hangs on cleanup: * one in bdrv_graph_co_rdunlock() [1] * and one in cache_clean_timer_del_and_wait() [2] Patch [1] has been applied by Kevin to master. Patch [2] has been reviewed with a proposal to revise the approach. This submission contains Patch [2] in a way proposed by Kevin. Patch [1] is independent and stays merged. Sorry one if this causes some problems. Thank you in advance, Den
On 20.05.2026 23:11, Denis V. Lunev wrote: > On 5/20/26 21:57, Michael Tokarev wrote: >> On 20.05.2026 22:38, Denis V. Lunev via wrote: >>> Changes since v1 >>> ---------------- >>> >>> v1 was a two-patch series tracking two independent missed-wakeup >>> races on the qemu shutdown path. >> Um. The v1 has already been applied to the master branch, see commit >> e3082ab3b3. >> >> And I already picked it up for stable, but not for 10.0.x >> >> So I guess the next step is to send a follow-up on top of v1, or >> to revert v1 and have this v2 instead. .. > It seems that I have written things wrong in description. > > v1 consists of 2 patches which were fixed very similar > but different patterns - 2 distinct hangs on cleanup: > * one in bdrv_graph_co_rdunlock() [1] > * and one in cache_clean_timer_del_and_wait() [2] > > Patch [1] has been applied by Kevin to master. > Patch [2] has been reviewed with a proposal to revise > the approach. > > This submission contains Patch [2] in a way proposed > by Kevin. Patch [1] is independent and stays merged. Aha, this makes sense. Thank you for clarifying things! And indeed, I missed this part of your description - which is right in the part of your message which I quoted. I'm sorry for the noise. Best, /mjt
© 2016 - 2026 Red Hat, Inc.