block: fix two missed-wakeup hangs on shutdown path

[PATCH 0/2] block: fix two missed-wakeup hangs on shutdown path

Posted by Denis V. Lunev via qemu development 1 month ago

Problem
-------

The qemu shutdown / blockdev-close path can deadlock permanently on
upstream master.  The main thread enters ppoll(timeout=-1) holding
BQL, no other thread has a wake source that points back at it, and
qemu has to be SIGKILLed.  The hang has no timeout -- it is a hard
deadlock, not a slow operation; behind BQL, RCU, VCPUs and every
iothread path that needs BQL stall with it.

Two independent missed-wakeup races in the block layer contribute.
Both share the same shape: a waiter arms on one side, the waker
reads stale state on its fast path and silently skips the kick, and
nothing else on the AioContext will fire to recover.  They are
different bugs in different subsystems and each patch stands on its
own; they are posted together because they surface through the same
test and the same symptom and are easiest to diagnose side by side.

Depending on which race fires, the main thread backtrace at the
moment of hang is one of:

  ppoll -> aio_poll -> bdrv_graph_wrlock -> blk_remove_bs
      (patch 1 -- block/graph-lock)

  ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close
      (patch 2 -- block/qcow2 cache_clean_timer)

Race diagrams and the exact stale-state read are in each patch's
commit message.

Reproducer
----------

Environment used for the numbers below: 4-vCPU VM guest,
kernel 6.12.x, upstream master at bb230769b4.  On modern bare-metal
the window is narrow enough that the hangs rarely reproduce without
a VM -- a VM guest under full CPU saturation is what makes the
timing reliable.  Downstream trees that still use plain
bdrv_graph_wrlock() in blk_remove_bs() hit the graph-lock race on
the first iteration without any stress at all.

    # reproducer
    stress-ng --cpu "$(nproc)" --timeout 0 &
    for r in $(seq 20); do
        timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create
    done
    kill %1

With `stress-ng --cpu $(nproc)` both races surface.  With
`stress-ng --cpu $(($(nproc) - 1))` or without a stressor neither
reproduces reliably across 20 iterations.

When a race fires, the Python QMP client times out on vm.run_job()
after 5 s, the qemu process keeps running but never makes forward
progress, and the outer `timeout 120` eventually kills it.  attach
gdb before the timeout kills qemu to capture the stack and
distinguish which of the two races fired.

Results
-------

Same guest, 20 iterations of the loop above:

  upstream master:            10/20 FAIL (first fail at iter #2)
  master + both patches:      20/20 PASS

Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Hanna Reitz <hreitz@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fiona Ebner <f.ebner@proxmox.com>
Cc: Hanna Czenczek <hreitz@redhat.com>

Denis V. Lunev (2):
  block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock()
  block/qcow2: fix hangup in cache_clean_timer cancellation

 block/graph-lock.c | 12 +++++-------
 block/qcow2.c      | 28 +++++++++++++++++-----------
 2 files changed, 22 insertions(+), 18 deletions(-)

--
2.51.0

Re: [PATCH 0/2] block: fix two missed-wakeup hangs on shutdown path

Posted by Denis V. Lunev 2 weeks, 4 days ago

On 4/24/26 12:39, Denis V. Lunev wrote:
> Problem
> -------
>
> The qemu shutdown / blockdev-close path can deadlock permanently on
> upstream master.  The main thread enters ppoll(timeout=-1) holding
> BQL, no other thread has a wake source that points back at it, and
> qemu has to be SIGKILLed.  The hang has no timeout -- it is a hard
> deadlock, not a slow operation; behind BQL, RCU, VCPUs and every
> iothread path that needs BQL stall with it.
>
> Two independent missed-wakeup races in the block layer contribute.
> Both share the same shape: a waiter arms on one side, the waker
> reads stale state on its fast path and silently skips the kick, and
> nothing else on the AioContext will fire to recover.  They are
> different bugs in different subsystems and each patch stands on its
> own; they are posted together because they surface through the same
> test and the same symptom and are easiest to diagnose side by side.
>
> Depending on which race fires, the main thread backtrace at the
> moment of hang is one of:
>
>   ppoll -> aio_poll -> bdrv_graph_wrlock -> blk_remove_bs
>       (patch 1 -- block/graph-lock)
>
>   ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close
>       (patch 2 -- block/qcow2 cache_clean_timer)
>
> Race diagrams and the exact stale-state read are in each patch's
> commit message.
>
> Reproducer
> ----------
>
> Environment used for the numbers below: 4-vCPU VM guest,
> kernel 6.12.x, upstream master at bb230769b4.  On modern bare-metal
> the window is narrow enough that the hangs rarely reproduce without
> a VM -- a VM guest under full CPU saturation is what makes the
> timing reliable.  Downstream trees that still use plain
> bdrv_graph_wrlock() in blk_remove_bs() hit the graph-lock race on
> the first iteration without any stress at all.
>
>     # reproducer
>     stress-ng --cpu "$(nproc)" --timeout 0 &
>     for r in $(seq 20); do
>         timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create
>     done
>     kill %1
>
> With `stress-ng --cpu $(nproc)` both races surface.  With
> `stress-ng --cpu $(($(nproc) - 1))` or without a stressor neither
> reproduces reliably across 20 iterations.
>
> When a race fires, the Python QMP client times out on vm.run_job()
> after 5 s, the qemu process keeps running but never makes forward
> progress, and the outer `timeout 120` eventually kills it.  attach
> gdb before the timeout kills qemu to capture the stack and
> distinguish which of the two races fired.
>
> Results
> -------
>
> Same guest, 20 iterations of the loop above:
>
>   upstream master:            10/20 FAIL (first fail at iter #2)
>   master + both patches:      20/20 PASS
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Hanna Reitz <hreitz@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Cc: Fiona Ebner <f.ebner@proxmox.com>
> Cc: Hanna Czenczek <hreitz@redhat.com>
>
> Denis V. Lunev (2):
>   block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock()
>   block/qcow2: fix hangup in cache_clean_timer cancellation
>
>  block/graph-lock.c | 12 +++++-------
>  block/qcow2.c      | 28 +++++++++++++++++-----------
>  2 files changed, 22 insertions(+), 18 deletions(-)
>
> --
> 2.51.0
ping

Re: [PATCH 0/2] block: fix two missed-wakeup hangs on shutdown path

Posted by Stefan Hajnoczi 2 weeks, 4 days ago

On Mon, May 11, 2026 at 11:53:37PM +0200, Denis V. Lunev wrote:
> On 4/24/26 12:39, Denis V. Lunev wrote:
> > Problem
> > -------
> >
> > The qemu shutdown / blockdev-close path can deadlock permanently on
> > upstream master.  The main thread enters ppoll(timeout=-1) holding
> > BQL, no other thread has a wake source that points back at it, and
> > qemu has to be SIGKILLed.  The hang has no timeout -- it is a hard
> > deadlock, not a slow operation; behind BQL, RCU, VCPUs and every
> > iothread path that needs BQL stall with it.
> >
> > Two independent missed-wakeup races in the block layer contribute.
> > Both share the same shape: a waiter arms on one side, the waker
> > reads stale state on its fast path and silently skips the kick, and
> > nothing else on the AioContext will fire to recover.  They are
> > different bugs in different subsystems and each patch stands on its
> > own; they are posted together because they surface through the same
> > test and the same symptom and are easiest to diagnose side by side.
> >
> > Depending on which race fires, the main thread backtrace at the
> > moment of hang is one of:
> >
> >   ppoll -> aio_poll -> bdrv_graph_wrlock -> blk_remove_bs
> >       (patch 1 -- block/graph-lock)
> >
> >   ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close
> >       (patch 2 -- block/qcow2 cache_clean_timer)
> >
> > Race diagrams and the exact stale-state read are in each patch's
> > commit message.
> >
> > Reproducer
> > ----------
> >
> > Environment used for the numbers below: 4-vCPU VM guest,
> > kernel 6.12.x, upstream master at bb230769b4.  On modern bare-metal
> > the window is narrow enough that the hangs rarely reproduce without
> > a VM -- a VM guest under full CPU saturation is what makes the
> > timing reliable.  Downstream trees that still use plain
> > bdrv_graph_wrlock() in blk_remove_bs() hit the graph-lock race on
> > the first iteration without any stress at all.
> >
> >     # reproducer
> >     stress-ng --cpu "$(nproc)" --timeout 0 &
> >     for r in $(seq 20); do
> >         timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create
> >     done
> >     kill %1
> >
> > With `stress-ng --cpu $(nproc)` both races surface.  With
> > `stress-ng --cpu $(($(nproc) - 1))` or without a stressor neither
> > reproduces reliably across 20 iterations.
> >
> > When a race fires, the Python QMP client times out on vm.run_job()
> > after 5 s, the qemu process keeps running but never makes forward
> > progress, and the outer `timeout 120` eventually kills it.  attach
> > gdb before the timeout kills qemu to capture the stack and
> > distinguish which of the two races fired.
> >
> > Results
> > -------
> >
> > Same guest, 20 iterations of the loop above:
> >
> >   upstream master:            10/20 FAIL (first fail at iter #2)
> >   master + both patches:      20/20 PASS
> >
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > Cc: Kevin Wolf <kwolf@redhat.com>
> > Cc: Hanna Reitz <hreitz@redhat.com>
> > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > Cc: Fiona Ebner <f.ebner@proxmox.com>
> > Cc: Hanna Czenczek <hreitz@redhat.com>
> >
> > Denis V. Lunev (2):
> >   block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock()
> >   block/qcow2: fix hangup in cache_clean_timer cancellation
> >
> >  block/graph-lock.c | 12 +++++-------
> >  block/qcow2.c      | 28 +++++++++++++++++-----------
> >  2 files changed, 22 insertions(+), 18 deletions(-)
> >
> > --
> > 2.51.0
> ping

Hi Kevin,
This looks like a series for your block tree. If I can help in some way,
please let me know.

Stefan