Request queuing prevents new requests from being submitted to the
BlockDriverState, not allowing them to start instead of just letting
them complete before bdrv_drained_begin() returns.
The reason for this was to ensure progress and avoid a livelock
in blk_drain(), blk_drain_all_begin(), bdrv_drained_begin() or
bdrv_drain_all_begin(), if there is an endless stream of requests to
a BlockBackend. However, this is prone to deadlocks.
In particular, IDE TRIM wants to elevate its BB's in-flight counter for a
"macro" operation that consists of several actual I/O operations. Each of
those operations is individually started and awaited. It does this so
that blk_drain() will drain the whole TRIM, and not just a single one
of the many discard operations it may encompass. When request queuing
is enabled, this leads to a deadlock: The currently ongoing discard is
drained, and the next one is queued, waiting for the drain to stop.
Meanwhile, TRIM still keeps the in-flight counter elevated, waiting
for all discards to stop -- which will never happen, because with the
in-flight counter elevated, the BB is never considered drained, so the
drained section does not begin and cannot end.
Fixing the implementation of request queuing is hard to do in general,
and even harder to do without adding more hacks. As the name suggests,
deadlocks are worse than livelocks :) so let's avoid them: turn the
request queuing on only after the BlockBackend has quiesced, and leave
the second functionality of bdrv_drained_begin() to the BQL or to the
individual BlockDevOps implementations.
In fact, devices such as IDE that run in the vCPU thread do not suffer
from this livelock because they only submit requests while they are
allowed to hold the big QEMU lock (i.e., not during bdrv_drained_begin()
or bdrv_drain_all_begin(). Other devices can avoid it through external
file descriptor (so that aio_disable_external() will prevent submission
of new requests) or with a .drained_begin callback in their BlockDevOps.
Note that this change does not affect the safety of bdrv_drained_begin(),
since the patch does not completely get away with request queuing.
Reported-by: Fiona Ebner <f.ebner@proxmox.com>
Fixes: 7e5cdb345f77 ("ide: Increment BB in-flight counter for TRIM BH")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block/block-backend.c | 40 ++++++++++++++++++++++++++++++----------
1 file changed, 30 insertions(+), 10 deletions(-)
diff --git a/block/block-backend.c b/block/block-backend.c
index 10419f8be91e..acb4cb91a5ee 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -42,6 +42,12 @@ typedef struct BlockBackendAioNotifier {
QLIST_ENTRY(BlockBackendAioNotifier) list;
} BlockBackendAioNotifier;
+typedef enum {
+ BLK_QUEUE_READY,
+ BLK_QUEUE_DISABLED,
+ BLK_QUEUE_QUIESCENT,
+} BlockBackendQueueState;
+
struct BlockBackend {
char *name;
int refcnt;
@@ -79,13 +85,14 @@ struct BlockBackend {
*/
bool allow_aio_context_change;
bool allow_write_beyond_eof;
- bool disable_request_queuing;
/* Protected by BQL */
NotifierList remove_bs_notifiers, insert_bs_notifiers;
QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
int quiesce_counter; /* atomic: written under BQL, read by other threads */
+ BlockBackendQueueState request_queuing;
+
QemuMutex queued_requests_lock; /* protects queued_requests */
CoQueue queued_requests;
@@ -368,6 +375,7 @@ BlockBackend *blk_new(AioContext *ctx, uint64_t perm, uint64_t shared_perm)
blk->shared_perm = shared_perm;
blk_set_enable_write_cache(blk, true);
+ blk->request_queuing = BLK_QUEUE_READY;
blk->on_read_error = BLOCKDEV_ON_ERROR_REPORT;
blk->on_write_error = BLOCKDEV_ON_ERROR_ENOSPC;
@@ -1240,7 +1248,7 @@ void blk_allow_aio_context_change(BlockBackend *blk)
void blk_disable_request_queuing(BlockBackend *blk)
{
GLOBAL_STATE_CODE();
- blk->disable_request_queuing = true;
+ blk->request_queuing = BLK_QUEUE_DISABLED;
}
static int coroutine_fn GRAPH_RDLOCK
@@ -1279,16 +1287,18 @@ static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
{
assert(blk->in_flight > 0);
- if (qatomic_read(&blk->quiesce_counter) && !blk->disable_request_queuing) {
+ if (qatomic_read(&blk->request_queuing) == BLK_QUEUE_QUIESCENT) {
/*
* Take lock before decrementing in flight counter so main loop thread
* waits for us to enqueue ourselves before it can leave the drained
* section.
*/
qemu_mutex_lock(&blk->queued_requests_lock);
- blk_dec_in_flight(blk);
- qemu_co_queue_wait(&blk->queued_requests, &blk->queued_requests_lock);
- blk_inc_in_flight(blk);
+ if (qatomic_read(&blk->request_queuing) == BLK_QUEUE_QUIESCENT) {
+ blk_dec_in_flight(blk);
+ qemu_co_queue_wait(&blk->queued_requests, &blk->queued_requests_lock);
+ blk_inc_in_flight(blk);
+ }
qemu_mutex_unlock(&blk->queued_requests_lock);
}
}
@@ -2600,7 +2610,14 @@ static bool blk_root_drained_poll(BdrvChild *child)
if (blk->dev_ops && blk->dev_ops->drained_poll) {
busy = blk->dev_ops->drained_poll(blk->dev_opaque);
}
- return busy || !!blk->in_flight;
+ if (busy || blk->in_flight) {
+ return true;
+ }
+
+ if (qatomic_read(&blk->request_queuing) == BLK_QUEUE_READY) {
+ qatomic_set(&blk->request_queuing, BLK_QUEUE_QUIESCENT);
+ }
+ return false;
}
static void blk_root_drained_end(BdrvChild *child)
@@ -2616,9 +2633,12 @@ static void blk_root_drained_end(BdrvChild *child)
blk->dev_ops->drained_end(blk->dev_opaque);
}
qemu_mutex_lock(&blk->queued_requests_lock);
- while (qemu_co_enter_next(&blk->queued_requests,
- &blk->queued_requests_lock)) {
- /* Resume all queued requests */
+ if (qatomic_read(&blk->request_queuing) != BLK_QUEUE_DISABLED) {
+ qatomic_set(&blk->request_queuing, BLK_QUEUE_READY);
+ while (qemu_co_enter_next(&blk->queued_requests,
+ &blk->queued_requests_lock)) {
+ /* Resume all queued requests */
+ }
}
qemu_mutex_unlock(&blk->queued_requests_lock);
}
--
2.39.2
On 05.04.23 18:31, Paolo Bonzini wrote:
> Request queuing prevents new requests from being submitted to the
> BlockDriverState, not allowing them to start instead of just letting
> them complete before bdrv_drained_begin() returns.
>
> The reason for this was to ensure progress and avoid a livelock
> in blk_drain(), blk_drain_all_begin(), bdrv_drained_begin() or
> bdrv_drain_all_begin(), if there is an endless stream of requests to
> a BlockBackend. However, this is prone to deadlocks.
>
> In particular, IDE TRIM wants to elevate its BB's in-flight counter for a
> "macro" operation that consists of several actual I/O operations. Each of
> those operations is individually started and awaited. It does this so
> that blk_drain() will drain the whole TRIM, and not just a single one
> of the many discard operations it may encompass. When request queuing
> is enabled, this leads to a deadlock: The currently ongoing discard is
> drained, and the next one is queued, waiting for the drain to stop.
> Meanwhile, TRIM still keeps the in-flight counter elevated, waiting
> for all discards to stop -- which will never happen, because with the
> in-flight counter elevated, the BB is never considered drained, so the
> drained section does not begin and cannot end.
>
> Fixing the implementation of request queuing is hard to do in general,
> and even harder to do without adding more hacks. As the name suggests,
> deadlocks are worse than livelocks :) so let's avoid them: turn the
> request queuing on only after the BlockBackend has quiesced, and leave
> the second functionality of bdrv_drained_begin() to the BQL or to the
> individual BlockDevOps implementations.
>
> In fact, devices such as IDE that run in the vCPU thread do not suffer
> from this livelock because they only submit requests while they are
> allowed to hold the big QEMU lock (i.e., not during bdrv_drained_begin()
> or bdrv_drain_all_begin(). Other devices can avoid it through external
> file descriptor (so that aio_disable_external() will prevent submission
> of new requests) or with a .drained_begin callback in their BlockDevOps.
>
> Note that this change does not affect the safety of bdrv_drained_begin(),
> since the patch does not completely get away with request queuing.
>
> Reported-by: Fiona Ebner <f.ebner@proxmox.com>
> Fixes: 7e5cdb345f77 ("ide: Increment BB in-flight counter for TRIM BH")
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> block/block-backend.c | 40 ++++++++++++++++++++++++++++++----------
> 1 file changed, 30 insertions(+), 10 deletions(-)
>
> diff --git a/block/block-backend.c b/block/block-backend.c
> index 10419f8be91e..acb4cb91a5ee 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
[...]
> @@ -2600,7 +2610,14 @@ static bool blk_root_drained_poll(BdrvChild *child)
> if (blk->dev_ops && blk->dev_ops->drained_poll) {
> busy = blk->dev_ops->drained_poll(blk->dev_opaque);
> }
> - return busy || !!blk->in_flight;
> + if (busy || blk->in_flight) {
> + return true;
> + }
> +
> + if (qatomic_read(&blk->request_queuing) == BLK_QUEUE_READY) {
> + qatomic_set(&blk->request_queuing, BLK_QUEUE_QUIESCENT);
> + }
> + return false;
> }
This implicitly relies on nobody increasing blk->in_flight (or
dev_ops->drained_poll() returning `true` again) while the BB is starting
to be drained, because if the function were to be called again after it
has returned `false` once per drained section (not sure if that’s
possible![1]), then we’d end up in the original situation, with
in_flight elevated and queuing enabled.
Is that really strictly guaranteed somehow or is it rather a complex
conglomerate of many cases that in the end happen to work out
individually? I mean, I could imagine that running
BlockDevOps.drained_begin() is supposed to guarantee that, but it can’t,
because only NBD seems to implement it. The commit message talks about
IDE being fine (by accident?) because it needs BQL availability to
submit new requests. But that’s very complex and I’d rather have a
strict requirement to guarantee correctness.
[1] If the blk_root_drained_poll() isn’t called anymore after returning
`false`, all will be good, but I assume it will be, because we have a
quiesce_counter, not a quiesce_bool. We could kind of emulate this by
continuing to return `false` after blk_root_drained_poll() has returned
`false` once, until the quiesce_counter becomes 0.
We could also have blk_root_drained_poll(), if it sees in_flight > 0 &&
request_queuing == BLK_QUEUE_QUIESCENT, revert request_queuing to
BLK_QUEUE_READY and resume all queued requests.
But, admittedly, I’m making a lot of assumptions and leaps by this
point. It all hinges on whether we can guarantee that in_flight won’t
be increased while a drained section starts.
Hanna
On Wed, Apr 12, 2023 at 1:54 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> On 05.04.23 18:31, Paolo Bonzini wrote:
> > + if (busy || blk->in_flight) {
> > + return true;
> > + }
> > +
> > + if (qatomic_read(&blk->request_queuing) == BLK_QUEUE_READY) {
> > + qatomic_set(&blk->request_queuing, BLK_QUEUE_QUIESCENT);
> > + }
> > + return false;
> > }
>
> This implicitly relies on nobody increasing blk->in_flight (or
> dev_ops->drained_poll() returning `true` again) while the BB is starting
> to be drained, because if the function were to be called again after it
> has returned `false` once per drained section (not sure if that’s
> possible![1]), then we’d end up in the original situation, with
> in_flight elevated and queuing enabled.
Yes, it does.
> Is that really strictly guaranteed somehow or is it rather a complex
> conglomerate of many cases that in the end happen to work out
> individually? I mean, I could imagine that running
> BlockDevOps.drained_begin() is supposed to guarantee that, but it can’t,
> because only NBD seems to implement it. The commit message talks about
> IDE being fine (by accident?) because it needs BQL availability to
> submit new requests. But that’s very complex and I’d rather have a
> strict requirement to guarantee correctness.
It's a conglomerate of three cases each of which is sufficient (BQL,
aio_disable_external, bdrv_drained_begin---plus just not using
blk_inc_in_flight could be a fourth, of course). Of these,
aio_disable_external() is going away in favor of the
.bdrv_drained_begin callback; and blk_inc_in_flight() is used rarely
in the first place so I thought it'd be not too hard to have this
requirement.
> [1] If the blk_root_drained_poll() isn’t called anymore after returning
> `false`, all will be good, but I assume it will be, because we have a
> quiesce_counter, not a quiesce_bool. We could kind of emulate this by
> continuing to return `false` after blk_root_drained_poll() has returned
> `false` once, until the quiesce_counter becomes 0.
> We could also have blk_root_drained_poll(), if it sees in_flight > 0 &&
> request_queuing == BLK_QUEUE_QUIESCENT, revert request_queuing to
> BLK_QUEUE_READY and resume all queued requests.
The intended long term fix is to remove request queuing and, if a
request is submitted while BLK_QUEUE_QUIESCENT, give an assertion
failure.
But since the hang requires blk_inc_in_flight() in the device, perhaps
in the short term documenting it in blk_inc_in_flight() may be enough?
Paolo
On 12.04.23 14:03, Paolo Bonzini wrote:
> On Wed, Apr 12, 2023 at 1:54 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>> On 05.04.23 18:31, Paolo Bonzini wrote:
>>> + if (busy || blk->in_flight) {
>>> + return true;
>>> + }
>>> +
>>> + if (qatomic_read(&blk->request_queuing) == BLK_QUEUE_READY) {
>>> + qatomic_set(&blk->request_queuing, BLK_QUEUE_QUIESCENT);
>>> + }
>>> + return false;
>>> }
>> This implicitly relies on nobody increasing blk->in_flight (or
>> dev_ops->drained_poll() returning `true` again) while the BB is starting
>> to be drained, because if the function were to be called again after it
>> has returned `false` once per drained section (not sure if that’s
>> possible![1]), then we’d end up in the original situation, with
>> in_flight elevated and queuing enabled.
> Yes, it does.
>
>> Is that really strictly guaranteed somehow or is it rather a complex
>> conglomerate of many cases that in the end happen to work out
>> individually? I mean, I could imagine that running
>> BlockDevOps.drained_begin() is supposed to guarantee that, but it can’t,
>> because only NBD seems to implement it. The commit message talks about
>> IDE being fine (by accident?) because it needs BQL availability to
>> submit new requests. But that’s very complex and I’d rather have a
>> strict requirement to guarantee correctness.
> It's a conglomerate of three cases each of which is sufficient (BQL,
> aio_disable_external, bdrv_drained_begin---plus just not using
> blk_inc_in_flight could be a fourth, of course). Of these,
> aio_disable_external() is going away in favor of the
> .bdrv_drained_begin callback; and blk_inc_in_flight() is used rarely
> in the first place so I thought it'd be not too hard to have this
> requirement.
Does IDE’s BQL requirement work for nested drains, though, i.e. when you
have a drained_begin, followed by another? The commit message doesn’t
say whether it’s impossible for IDE to create a new request in between
the two.
I’m a bit afraid that these cases are too complicated for me to fully
comprehend.
>> [1] If the blk_root_drained_poll() isn’t called anymore after returning
>> `false`, all will be good, but I assume it will be, because we have a
>> quiesce_counter, not a quiesce_bool. We could kind of emulate this by
>> continuing to return `false` after blk_root_drained_poll() has returned
>> `false` once, until the quiesce_counter becomes 0.
>> We could also have blk_root_drained_poll(), if it sees in_flight > 0 &&
>> request_queuing == BLK_QUEUE_QUIESCENT, revert request_queuing to
>> BLK_QUEUE_READY and resume all queued requests.
> The intended long term fix is to remove request queuing and, if a
> request is submitted while BLK_QUEUE_QUIESCENT, give an assertion
> failure.
Yep, that would be a nice obvious requirement.
> But since the hang requires blk_inc_in_flight() in the device, perhaps
> in the short term documenting it in blk_inc_in_flight() may be enough?
Technically it needs a blk_inc_in_flight() whose blk_dec_in_flight()
depends on a different request that can be queued (which is only the
case in IDE), so I suppose we could document exactly that in those
functions’ interfaces, i.e. that users must take care not to use
blk_inc_flight() while the BlockBackend is (being) drained, when the
associated blk_dec_in_flight() may depend on an I/O request to the BB.
I think that should be enough, yes. Well, as long as you can guarantee
that IDE will indeed fulfill that requirement, because I find it
difficult to see/prove...
Hanna
© 2016 - 2026 Red Hat, Inc.