[v1] Refactor bdrv_try_set_aio_context using transactions

[RFC PATCH 3/8] RFC: block: use transactions as a replacement of ->{can_}set_aio_context()

Posted by Emanuele Giuseppe Esposito 3 years, 7 months ago

---------
RFC because I am not sure about the AioContext locks.
- Do we need to take the new AioContext lock? what does it protect?
- Taking the old AioContext lock is required now, because of
  bdrv_drained_begin calling AIO_WAIT_WHILE that unlocks the
  aiocontext. If we replace it with AIO_WAIT_WHILE_UNLOCKED,
  could we get rid of taking every time the old AioContext?
  drain would be enough to protect the graph modification.
----------

Simplify the way the aiocontext can be changed in a BDS graph.
There are currently two problems in bdrv_try_set_aio_context:
- There is a confusion of AioContext locks taken and released, because
  we assume that old aiocontext is always taken and new one is
  taken inside.

- It doesn't look very safe to call bdrv_drained_begin while some
  nodes have already switched to the new aiocontext and others haven't.
  This could be especially dangerous because bdrv_drained_begin polls, so
  something else could be executed while graph is in an inconsistent
  state.

Additional minor nitpick: can_set and set_ callbacks both traverse the
graph, both using the ignored list of visited nodes in a different way.

Therefore, get rid of all of this and introduce a new callback,
change_aio_context, that uses transactions to efficiently, cleanly
and most importantly safely change the aiocontext of a graph.

This new callback is a "merge" of the two previous ones:
- Just like can_set_aio_context, recursively traverses the graph.
  Marks all nodes that are visited using a GList, and checks if
  they *could* change the aio_context.
- For each node that passes the above check, add a new transaction
  that implements a callback that effectively changes the aiocontext.
- If a node is a BDS, add two transactions: one taking care of draining
  the node at the beginning of the list (so that will be executed first)
  and one at the end taking care of changing the AioContext.
- Once done, the recursive function returns if *all* nodes can change
  the AioContext. If so, commit the above transactions. Otherwise don't
  do anything.
- The transaction list contains first all "drain" transactions, so
  we are sure we are draining all nodes in the same context, and then
  all the other switch the AioContext. In this way we make sure that
  bdrv_drained_begin() is always called under the old AioContext, and
  bdrv_drained_end() under the new one.
- Because of the above, we don't need to release and re-acquire the
  old AioContext every time, as everything is done once (and not
  per-node drain and aiocontext change).

Note that the "change" API is not yet invoked anywhere.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 block.c                            | 197 +++++++++++++++++++++++++++++
 include/block/block-global-state.h |   9 ++
 include/block/block_int-common.h   |   3 +
 3 files changed, 209 insertions(+)

diff --git a/block.c b/block.c
index 267a39c0de..bda4e1bcef 100644
--- a/block.c
+++ b/block.c
@@ -7437,6 +7437,51 @@ static bool bdrv_parent_can_set_aio_context(BdrvChild *c, AioContext *ctx,
     return true;
 }
 
+typedef struct BdrvStateSetAioContext {
+    AioContext *new_ctx;
+    BlockDriverState *bs;
+} BdrvStateSetAioContext;
+
+/*
+ * Changes the AioContext used for fd handlers, timers, and BHs by this
+ * BlockDriverState and all its children and parents.
+ *
+ * Must be called from the main AioContext.
+ *
+ * The caller must own the AioContext lock for the old AioContext of bs, but it
+ * must not own the AioContext lock for new_context (unless new_context is the
+ * same as the current context of bs).
+ *
+ * @visited will accumulate all visited BdrvChild object. The caller is
+ * responsible for freeing the list afterwards.
+ */
+static bool bdrv_parent_change_aio_context(BdrvChild *c, AioContext *ctx,
+                                           GSList **visited, Transaction *tran,
+                                           Error **errp)
+{
+    GLOBAL_STATE_CODE();
+    if (g_slist_find(*visited, c)) {
+        return true;
+    }
+    *visited = g_slist_prepend(*visited, c);
+
+    /*
+     * A BdrvChildClass that doesn't handle AioContext changes cannot
+     * tolerate any AioContext changes
+     */
+    if (!c->klass->change_aio_ctx) {
+        char *user = bdrv_child_user_desc(c);
+        error_setg(errp, "Changing iothreads is not supported by %s", user);
+        g_free(user);
+        return false;
+    }
+    if (!c->klass->change_aio_ctx(c, ctx, visited, tran, errp)) {
+        assert(!errp || *errp);
+        return false;
+    }
+    return true;
+}
+
 bool bdrv_child_can_set_aio_context(BdrvChild *c, AioContext *ctx,
                                     GSList **ignore, Error **errp)
 {
@@ -7448,6 +7493,18 @@ bool bdrv_child_can_set_aio_context(BdrvChild *c, AioContext *ctx,
     return bdrv_can_set_aio_context(c->bs, ctx, ignore, errp);
 }
 
+bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
+                                   GSList **visited, Transaction *tran,
+                                   Error **errp)
+{
+    GLOBAL_STATE_CODE();
+    if (g_slist_find(*visited, c)) {
+        return true;
+    }
+    *visited = g_slist_prepend(*visited, c);
+    return bdrv_change_aio_context(c->bs, ctx, visited, tran, errp);
+}
+
 /* @ignore will accumulate all visited BdrvChild object. The caller is
  * responsible for freeing the list afterwards. */
 bool bdrv_can_set_aio_context(BlockDriverState *bs, AioContext *ctx,
@@ -7475,6 +7532,99 @@ bool bdrv_can_set_aio_context(BlockDriverState *bs, AioContext *ctx,
     return true;
 }
 
+static void bdrv_drained_begin_commit(void *opaque)
+{
+    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
+
+    /* Paired with bdrv_drained_end in bdrv_drained_end_clean() */
+    bdrv_drained_begin(state->bs);
+}
+
+static void bdrv_drained_end_clean(void *opaque)
+{
+    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
+    BlockDriverState *bs = (BlockDriverState *) state->bs;
+    AioContext *new_context = state->new_ctx;
+
+    /*
+     * drain only if bdrv_drained_begin_commit() and
+     * bdrv_set_aio_context_commit() executed
+     */
+    if (bdrv_get_aio_context(bs) == new_context) {
+        /* Paired with bdrv_drained_begin in bdrv_drained_begin_commit() */
+        bdrv_drained_end(bs);
+    }
+
+    /* state is freed by set_aio_context->clean() */
+}
+
+static void bdrv_set_aio_context_commit(void *opaque)
+{
+    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
+    BlockDriverState *bs = (BlockDriverState *) state->bs;
+    AioContext *new_context = state->new_ctx;
+    assert_bdrv_graph_writable(bs);
+
+    bdrv_detach_aio_context(bs);
+    bdrv_attach_aio_context(bs, new_context);
+}
+
+static TransactionActionDrv set_aio_context = {
+    .commit = bdrv_set_aio_context_commit,
+    .clean = g_free,
+};
+
+static TransactionActionDrv drained_begin_end = {
+    .commit = bdrv_drained_begin_commit,
+    .clean = bdrv_drained_end_clean,
+};
+
+/*
+ * @visited will accumulate all visited BdrvChild object. The caller is
+ * responsible for freeing the list afterwards.
+ */
+bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+                             GSList **visited, Transaction *tran,
+                             Error **errp)
+{
+    BdrvChild *c;
+    BdrvStateSetAioContext *state;
+
+    GLOBAL_STATE_CODE();
+
+    if (bdrv_get_aio_context(bs) == ctx) {
+        return true;
+    }
+
+    QLIST_FOREACH(c, &bs->parents, next_parent) {
+        if (!bdrv_parent_change_aio_context(c, ctx, visited, tran, errp)) {
+            return false;
+        }
+    }
+
+    QLIST_FOREACH(c, &bs->children, next) {
+        if (!bdrv_child_change_aio_context(c, ctx, visited, tran, errp)) {
+            return false;
+        }
+    }
+
+    state = g_new(BdrvStateSetAioContext, 1);
+    *state = (BdrvStateSetAioContext) {
+        .new_ctx = ctx,
+        .bs = bs,
+    };
+
+    /* First half of transactions are drains */
+    tran_add(tran, &drained_begin_end, state);
+    /*
+     * Second half of transactions takes care of setting the
+     * AioContext and free the state
+     */
+    tran_add_tail(tran, &set_aio_context, state);
+
+    return true;
+}
+
 int bdrv_child_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
                                    BdrvChild *ignore_child, Error **errp)
 {
@@ -7498,6 +7648,53 @@ int bdrv_child_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
     return 0;
 }
 
+/*
+ * First, go recursively through all nodes in the graph, and see if they
+ * can change their AioContext.
+ * If so, add for each node a new transaction with a callback to change the
+ * AioContext with the new one.
+ * Once recursion is completed, if all nodes are allowed to change their
+ * AioContext then commit the transaction, running all callbacks in order.
+ * Otherwise don't do anything.
+ *
+ * This function still requires the caller to take the bs current
+ * AioContext lock, otherwise draining could fail since AIO_WAIT_WHILE
+ * assumes the lock is always held if bs is in another AioContext.
+ */
+int bdrv_child_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+                                      BdrvChild *ignore_child, Error **errp)
+{
+    Transaction *tran;
+    GSList *visited;
+    int ret;
+    GLOBAL_STATE_CODE();
+
+    tran = tran_new();
+    visited = ignore_child ? g_slist_prepend(NULL, ignore_child) : NULL;
+    ret = bdrv_change_aio_context(bs, ctx, &visited, tran, errp);
+    g_slist_free(visited);
+
+    if (!ret) {
+        /* just run clean() callbacks */
+        tran_abort(tran);
+        return -EPERM;
+    }
+
+    /* Acquire the new context, if necessary */
+    /* TODO: why do we need to take this AioContext lock? */
+    if (qemu_get_aio_context() != ctx) {
+        aio_context_acquire(ctx);
+    }
+
+    tran_commit(tran);
+
+    if (qemu_get_aio_context() != ctx) {
+        aio_context_release(ctx);
+    }
+
+    return 0;
+}
+
 int bdrv_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
                              Error **errp)
 {
diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
index 21265e3966..5847b64f95 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -229,6 +229,15 @@ bool bdrv_can_set_aio_context(BlockDriverState *bs, AioContext *ctx,
                               GSList **ignore, Error **errp);
 AioContext *bdrv_child_get_parent_aio_context(BdrvChild *c);
 
+bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
+                                   GSList **visited, Transaction *tran,
+                                   Error **errp);
+bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+                             GSList **visited, Transaction *tran,
+                             Error **errp);
+int bdrv_child_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+                                      BdrvChild *ignore_child, Error **errp);
+
 int bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz);
 int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry *geo);
 
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 8947abab76..b35a0fe840 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -896,6 +896,9 @@ struct BdrvChildClass {
                         GSList **ignore, Error **errp);
     void (*set_aio_ctx)(BdrvChild *child, AioContext *ctx, GSList **ignore);
 
+    bool (*change_aio_ctx)(BdrvChild *child, AioContext *ctx,
+                           GSList **ignore, Transaction *tran, Error **errp);
+
     AioContext *(*get_parent_aio_context)(BdrvChild *child);
 
     /*
-- 
2.31.1

Re: [RFC PATCH 3/8] RFC: block: use transactions as a replacement of ->{can_}set_aio_context()

Posted by Vladimir Sementsov-Ogievskiy 3 years, 6 months ago

On 7/13/22 00:19, Emanuele Giuseppe Esposito wrote:
> +/*
> + * @visited will accumulate all visited BdrvChild object. The caller is
> + * responsible for freeing the list afterwards.
> + */
> +bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
> +                             GSList **visited, Transaction *tran,
> +                             Error **errp)
> +{
> +    BdrvChild *c;
> +    BdrvStateSetAioContext *state;
> +
> +    GLOBAL_STATE_CODE();
> +
> +    if (bdrv_get_aio_context(bs) == ctx) {
> +        return true;
> +    }
> +
> +    QLIST_FOREACH(c, &bs->parents, next_parent) {
> +        if (!bdrv_parent_change_aio_context(c, ctx, visited, tran, errp)) {
> +            return false;
> +        }
> +    }
> +
> +    QLIST_FOREACH(c, &bs->children, next) {
> +        if (!bdrv_child_change_aio_context(c, ctx, visited, tran, errp)) {
> +            return false;
> +        }
> +    }
> +
> +    state = g_new(BdrvStateSetAioContext, 1);
> +    *state = (BdrvStateSetAioContext) {
> +        .new_ctx = ctx,
> +        .bs = bs,
> +    };
> +
> +    /* First half of transactions are drains */
> +    tran_add(tran, &drained_begin_end, state);
> +    /*
> +     * Second half of transactions takes care of setting the
> +     * AioContext and free the state
> +     */
> +    tran_add_tail(tran, &set_aio_context, state);
> +
> +    return true;
> +}

First, it looks like you want to use .commit() as .prepare(), .clean() as commit(), and .prepare() as something that just schedule other actions.

In block transaction any effect that is visible to other transaction actions should be done in .prepare(). (and .prepare is the main function of the action with *tran parameter, i.e. bdrv_change_aio_context() function is actually .prepare() phase).

So, ideally, the action:

  - does the complete thing in .prepare (i.e. in the main function of the action)
  - rollback it in .abort
  - nothing in .commit

Usually we still need a .commit phase for irreversible part of the action (like calling free() on the object). But that should be transparent for other actions.

So, generally we do:

tran = tran_new()

action_a(..., tran);
action_b(..., tran);
action_c(..., tran);

And we expect, that action_b() operates on top of action_a() already done. And action_c() operate on top of action_b() done. So we cannot postpone visible (to other actions) effect of the action to .commit phase.

So, for example, if you want a kind of smart drained section in transaction, you may add simple

bdrv_drain_tran(..., tran);

that just call drained_begin(), and have only .clean() handler that call drained_end(). This way you may do

bdrv_drain_tran(...., tran);
action_a(..., tran);
action_b(..., tran);

And be sure that both actions and all their .abort/.commit/.clean handlers are called inside drained seaction. Isn't it what you need?

Second, this all becomes extremely difficult when we mix transactions and recursion. When I moved permission system to transactions, I've introduced topological dfs search to just get a list of nodes to handle. I think, if we want to rework aio context change, we should first get rid of recursion.

-- 
Best regards,
Vladimir

Re: [RFC PATCH 3/8] RFC: block: use transactions as a replacement of ->{can_}set_aio_context()

Posted by Emanuele Giuseppe Esposito 3 years, 6 months ago


Am 20/07/2022 um 16:09 schrieb Vladimir Sementsov-Ogievskiy:
> On 7/13/22 00:19, Emanuele Giuseppe Esposito wrote:
>> +/*
>> + * @visited will accumulate all visited BdrvChild object. The caller is
>> + * responsible for freeing the list afterwards.
>> + */
>> +bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
>> +                             GSList **visited, Transaction *tran,
>> +                             Error **errp)
>> +{
>> +    BdrvChild *c;
>> +    BdrvStateSetAioContext *state;
>> +
>> +    GLOBAL_STATE_CODE();
>> +
>> +    if (bdrv_get_aio_context(bs) == ctx) {
>> +        return true;
>> +    }
>> +
>> +    QLIST_FOREACH(c, &bs->parents, next_parent) {
>> +        if (!bdrv_parent_change_aio_context(c, ctx, visited, tran,
>> errp)) {
>> +            return false;
>> +        }
>> +    }
>> +
>> +    QLIST_FOREACH(c, &bs->children, next) {
>> +        if (!bdrv_child_change_aio_context(c, ctx, visited, tran,
>> errp)) {
>> +            return false;
>> +        }
>> +    }
>> +
>> +    state = g_new(BdrvStateSetAioContext, 1);
>> +    *state = (BdrvStateSetAioContext) {
>> +        .new_ctx = ctx,
>> +        .bs = bs,
>> +    };
>> +
>> +    /* First half of transactions are drains */
>> +    tran_add(tran, &drained_begin_end, state);
>> +    /*
>> +     * Second half of transactions takes care of setting the
>> +     * AioContext and free the state
>> +     */
>> +    tran_add_tail(tran, &set_aio_context, state);
>> +
>> +    return true;
>> +}
> 
> 
> First, it looks like you want to use .commit() as .prepare(), .clean()
> as commit(), and .prepare() as something that just schedule other actions.
> 
> In block transaction any effect that is visible to other transaction
> actions should be done in .prepare(). (and .prepare is the main function
> of the action with *tran parameter, i.e. bdrv_change_aio_context()
> function is actually .prepare() phase).
> 
> So, ideally, the action:
> 
>  - does the complete thing in .prepare (i.e. in the main function of the
> action)
>  - rollback it in .abort
>  - nothing in .commit
> 
> Usually we still need a .commit phase for irreversible part of the
> action (like calling free() on the object). But that should be
> transparent for other actions.
> 
> So, generally we do:
> 
> tran = tran_new()
> 
> action_a(..., tran);
> action_b(..., tran);
> action_c(..., tran);
> 
> 
> And we expect, that action_b() operates on top of action_a() already
> done. And action_c() operate on top of action_b() done. So we cannot
> postpone visible (to other actions) effect of the action to .commit phase.
> 
> So, for example, if you want a kind of smart drained section in
> transaction, you may add simple
> 
> bdrv_drain_tran(..., tran);
> 
> that just call drained_begin(), and have only .clean() handler that call
> drained_end(). This way you may do
> 
> bdrv_drain_tran(...., tran);
> action_a(..., tran);
> action_b(..., tran);
> 
> And be sure that both actions and all their .abort/.commit/.clean
> handlers are called inside drained seaction. Isn't it what you need?

I understand how you see transactions, but I think we can also use them
in a different way than intended (with proper documentation).

I don't think it makes sense to use transactions as rollback in this
case, even though with the coming v2 it would be more similar to what
you propose:

.prepare takes care of drain and checking if the node is ok
.commit changes the aiocontext lock only if all nodes are drained and
passed the test
.clean undrains (drained end)

> 
> 
> Second, this all becomes extremely difficult when we mix transactions
> and recursion. When I moved permission system to transactions, I've
> introduced topological dfs search to just get a list of nodes to handle.
> I think, if we want to rework aio context change, we should first get
> rid of recursion.
> 

I honestly don't see how recursion complicates things. It's just graph
traversal, and building a list of things to do while doing that.

Isn't it much more complex to do it with a loop (ie not recursively?).
Your bdrv_topological_dfs is recursive too.

Think about it as 2 separate steps:
- Recursion takes care of checking the nodes and draining them
- Once done, if all nodes are ready to switch, switch linearly by
iterating in a list of callbacks

Anyways, I am probably not going to wait your feedback and send v2 just
because I think the way I did this double transaction list in v1
confuses people.

Feel free to continue the discussion here or in v2 directly, though.

Thank you,
Emanuele

Re: [RFC PATCH 3/8] RFC: block: use transactions as a replacement of ->{can_}set_aio_context()

Posted by Hanna Reitz 3 years, 6 months ago

On 12.07.22 23:19, Emanuele Giuseppe Esposito wrote:
> ---------
> RFC because I am not sure about the AioContext locks.
> - Do we need to take the new AioContext lock? what does it protect?
> - Taking the old AioContext lock is required now, because of
>    bdrv_drained_begin calling AIO_WAIT_WHILE that unlocks the
>    aiocontext. If we replace it with AIO_WAIT_WHILE_UNLOCKED,
>    could we get rid of taking every time the old AioContext?
>    drain would be enough to protect the graph modification.

It’s been a while, but as far as I remember (which may be wrong), the 
reason for how the locks are supposed to be taken was mostly that we 
need some defined state so that we know how to invoke 
bdrv_drained_begin() and bdrv_drained_end() (i.e. call the first one 
as-is, and switch the locks around for the latter one).

The idea of using _UNLOCKED sounds interesting, almost too obvious. I 
can’t see why that wouldn’t work, actually.

> ----------
>
> Simplify the way the aiocontext can be changed in a BDS graph.
> There are currently two problems in bdrv_try_set_aio_context:
> - There is a confusion of AioContext locks taken and released, because
>    we assume that old aiocontext is always taken and new one is
>    taken inside.

Yep, and that assumption is just broken in some cases, which is the main 
pain point I’m feeling with it right now.

For example, look at bdrv_attach_child_common(): Here, we attach a child 
to a parent, so we need to get them into a common AioContext. So first 
we try to put the child into the parent’s context, and if that fails, 
we’ll try the other way, putting the parent into the child’s context.

The problem is clear: The bdrv_try_set_aio_context() call requires us to 
hold the child’s current context but not the parent’s, and the 
child_class->set_aio_ctx() call requires the exact opposite.  But we 
never switch the context we have acquired, so this can’t both work.  
Better yet, nowhere is it defined what context a caller to 
bdrv_attach_child_common() will hold.

In practice, what happens here most of the time is that something will 
be moved from the main context to some other context, and since we’re in 
the main context already, that’ll just work.  But you can construct 
cases where something is attempted to be moved from an I/O thread into a 
different thread and then you’ll get a crash.

I’d be happy if we could do away with the requirement of having to hold 
any lock for changing a node’s AioContext.

> - It doesn't look very safe to call bdrv_drained_begin while some
>    nodes have already switched to the new aiocontext and others haven't.
>    This could be especially dangerous because bdrv_drained_begin polls, so
>    something else could be executed while graph is in an inconsistent
>    state.
>
> Additional minor nitpick: can_set and set_ callbacks both traverse the
> graph, both using the ignored list of visited nodes in a different way.
>
> Therefore, get rid of all of this and introduce a new callback,
> change_aio_context, that uses transactions to efficiently, cleanly
> and most importantly safely change the aiocontext of a graph.
>
> This new callback is a "merge" of the two previous ones:
> - Just like can_set_aio_context, recursively traverses the graph.
>    Marks all nodes that are visited using a GList, and checks if
>    they *could* change the aio_context.
> - For each node that passes the above check, add a new transaction
>    that implements a callback that effectively changes the aiocontext.
> - If a node is a BDS, add two transactions: one taking care of draining
>    the node at the beginning of the list (so that will be executed first)
>    and one at the end taking care of changing the AioContext.
> - Once done, the recursive function returns if *all* nodes can change
>    the AioContext. If so, commit the above transactions. Otherwise don't
>    do anything.
> - The transaction list contains first all "drain" transactions, so
>    we are sure we are draining all nodes in the same context, and then
>    all the other switch the AioContext. In this way we make sure that
>    bdrv_drained_begin() is always called under the old AioContext, and
>    bdrv_drained_end() under the new one.
> - Because of the above, we don't need to release and re-acquire the
>    old AioContext every time, as everything is done once (and not
>    per-node drain and aiocontext change).
>
> Note that the "change" API is not yet invoked anywhere.

So the idea is that we introduce a completely new transaction-based API 
to change BDSs’ AioContext, and then drop the old one, right?

> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   block.c                            | 197 +++++++++++++++++++++++++++++
>   include/block/block-global-state.h |   9 ++
>   include/block/block_int-common.h   |   3 +
>   3 files changed, 209 insertions(+)
>
> diff --git a/block.c b/block.c
> index 267a39c0de..bda4e1bcef 100644
> --- a/block.c
> +++ b/block.c
> @@ -7437,6 +7437,51 @@ static bool bdrv_parent_can_set_aio_context(BdrvChild *c, AioContext *ctx,
>       return true;
>   }
>   
> +typedef struct BdrvStateSetAioContext {
> +    AioContext *new_ctx;
> +    BlockDriverState *bs;
> +} BdrvStateSetAioContext;
> +
> +/*
> + * Changes the AioContext used for fd handlers, timers, and BHs by this
> + * BlockDriverState and all its children and parents.
> + *
> + * Must be called from the main AioContext.
> + *
> + * The caller must own the AioContext lock for the old AioContext of bs, but it
> + * must not own the AioContext lock for new_context (unless new_context is the
> + * same as the current context of bs).

:(  Too bad.

> + *
> + * @visited will accumulate all visited BdrvChild object. The caller is

*objects

> + * responsible for freeing the list afterwards.
> + */
> +static bool bdrv_parent_change_aio_context(BdrvChild *c, AioContext *ctx,

The comment says “this BlockDriverState”, but then this function is 
called “_parent_” and takes a BdrvChild object.  Not quite clear what is 
meant.

You could argue it doesn’t matter because the function acts on all 
parents and children recursively anyway, so it’s doesn’t matter whether 
this is about c->bs or c’s parent, which is true, but still, the wording 
doesn’t seem quite as clear to me as it could be.

I also wonder why this function is so syntactically focused on BdrvChild 
object when I would have thought that those objects aren’t really 
associated with any AioContext, only the BDS is.  A BdrvChild just gives 
you a way to change a BDS’s parents’ AioContext.  Is it because of 
bdrv_child_try_change_aio_context()’s ignore_child parameter?

> +                                           GSList **visited, Transaction *tran,
> +                                           Error **errp)
> +{
> +    GLOBAL_STATE_CODE();
> +    if (g_slist_find(*visited, c)) {
> +        return true;
> +    }
> +    *visited = g_slist_prepend(*visited, c);
> +
> +    /*
> +     * A BdrvChildClass that doesn't handle AioContext changes cannot
> +     * tolerate any AioContext changes
> +     */
> +    if (!c->klass->change_aio_ctx) {
> +        char *user = bdrv_child_user_desc(c);
> +        error_setg(errp, "Changing iothreads is not supported by %s", user);
> +        g_free(user);
> +        return false;
> +    }
> +    if (!c->klass->change_aio_ctx(c, ctx, visited, tran, errp)) {
> +        assert(!errp || *errp);
> +        return false;
> +    }

This makes me wonder how this is supposed to change the children’s 
AioContext, because traditionally, BdrvChildClass contains only 
callbacks that affect the parent.

> +    return true;
> +}
> +
>   bool bdrv_child_can_set_aio_context(BdrvChild *c, AioContext *ctx,
>                                       GSList **ignore, Error **errp)
>   {
> @@ -7448,6 +7493,18 @@ bool bdrv_child_can_set_aio_context(BdrvChild *c, AioContext *ctx,
>       return bdrv_can_set_aio_context(c->bs, ctx, ignore, errp);
>   }
>   
> +bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
> +                                   GSList **visited, Transaction *tran,
> +                                   Error **errp)
> +{
> +    GLOBAL_STATE_CODE();
> +    if (g_slist_find(*visited, c)) {

Not the first time that we’re using linear lookup for such a thing, but 
it really irks me badly.  I think in any language that would provide 
collections by default, we’d be using a hash map here.

> +        return true;
> +    }
> +    *visited = g_slist_prepend(*visited, c);
> +    return bdrv_change_aio_context(c->bs, ctx, visited, tran, errp);
> +}
> +

OK, so this looks like the function that will change the children’s 
AioContext, and now I guess we’ll have a function that invokes both 
bdrv_{child,parent}_change_aio_context()...  And yes, we do, that is 
bdrv_change_aio_context().  So now it looks to me like the comment 
that’s above bdrv_parent_change_aio_context() actually belongs to 
bdrv_change_aio_context().

>   /* @ignore will accumulate all visited BdrvChild object. The caller is
>    * responsible for freeing the list afterwards. */
>   bool bdrv_can_set_aio_context(BlockDriverState *bs, AioContext *ctx,
> @@ -7475,6 +7532,99 @@ bool bdrv_can_set_aio_context(BlockDriverState *bs, AioContext *ctx,
>       return true;
>   }
>   
> +static void bdrv_drained_begin_commit(void *opaque)
> +{
> +    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
> +
> +    /* Paired with bdrv_drained_end in bdrv_drained_end_clean() */
> +    bdrv_drained_begin(state->bs);
> +}
> +
> +static void bdrv_drained_end_clean(void *opaque)
> +{
> +    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
> +    BlockDriverState *bs = (BlockDriverState *) state->bs;
> +    AioContext *new_context = state->new_ctx;
> +
> +    /*
> +     * drain only if bdrv_drained_begin_commit() and
> +     * bdrv_set_aio_context_commit() executed
> +     */
> +    if (bdrv_get_aio_context(bs) == new_context) {
> +        /* Paired with bdrv_drained_begin in bdrv_drained_begin_commit() */
> +        bdrv_drained_end(bs);
> +    }

So as far as I understood, in bdrv_set_aio_context_ignore(), we switch 
around the currently held context because this will poll bs’s context, 
and so because bs’s context changed, we now need to hold the new 
context.  I don’t know off the top of my head whether there’s a problem 
with holding both contexts simultaneously.

> +
> +    /* state is freed by set_aio_context->clean() */
> +}
> +
> +static void bdrv_set_aio_context_commit(void *opaque)
> +{
> +    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
> +    BlockDriverState *bs = (BlockDriverState *) state->bs;
> +    AioContext *new_context = state->new_ctx;
> +    assert_bdrv_graph_writable(bs);
> +
> +    bdrv_detach_aio_context(bs);
> +    bdrv_attach_aio_context(bs, new_context);
> +}
> +
> +static TransactionActionDrv set_aio_context = {
> +    .commit = bdrv_set_aio_context_commit,
> +    .clean = g_free,
> +};
> +
> +static TransactionActionDrv drained_begin_end = {
> +    .commit = bdrv_drained_begin_commit,
> +    .clean = bdrv_drained_end_clean,
> +};
> +
> +/*
> + * @visited will accumulate all visited BdrvChild object. The caller is
> + * responsible for freeing the list afterwards.
> + */
> +bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
> +                             GSList **visited, Transaction *tran,
> +                             Error **errp)
> +{
> +    BdrvChild *c;
> +    BdrvStateSetAioContext *state;
> +
> +    GLOBAL_STATE_CODE();
> +
> +    if (bdrv_get_aio_context(bs) == ctx) {
> +        return true;
> +    }
> +
> +    QLIST_FOREACH(c, &bs->parents, next_parent) {
> +        if (!bdrv_parent_change_aio_context(c, ctx, visited, tran, errp)) {
> +            return false;
> +        }
> +    }
> +
> +    QLIST_FOREACH(c, &bs->children, next) {
> +        if (!bdrv_child_change_aio_context(c, ctx, visited, tran, errp)) {
> +            return false;
> +        }
> +    }
> +
> +    state = g_new(BdrvStateSetAioContext, 1);
> +    *state = (BdrvStateSetAioContext) {
> +        .new_ctx = ctx,
> +        .bs = bs,
> +    };
> +
> +    /* First half of transactions are drains */
> +    tran_add(tran, &drained_begin_end, state);
> +    /*
> +     * Second half of transactions takes care of setting the
> +     * AioContext and free the state
> +     */
> +    tran_add_tail(tran, &set_aio_context, state);

Completely subjective and I struggle to explain my feelings:

I don’t quite like the “first half, second half” explanation.  This 
makes it sound like these should be separate phases in the transaction, 
and if so, we should have an interface that allows you to put 
transaction handlers into an arbitrary number of phases.

We don’t have that interface, though, all we have are functions that 
carry the connotation of “execute as soon as possible” and “execute as 
late as possible”, so that if you use both in conjunction like here, you 
can be certain that drained_begin_end is executed before the respective 
set_aio_context.

The comments also struggle to convey how much black magic this is.

(Note that this is just my POV; also note that I’m not judging, black 
magic is cool, just difficult to grasp)

So what’s going on is (AFAIU) that we have this black magic 
drained_begin_end transaction which is added such that its commit 
handler will be run before the set_aio_context’s commit handler, but its 
clean handler will be run after set_aio_context’s commit handler.  So 
that’s why I call it black magic, because drained_begin_end can do both, 
it’ll first begin the drain, and then end it, all around some normal 
other transaction, such that this other one can operate on a completely 
drained graph.

I don’t hate this, it’s just that the comments don’t convey it.  I think 
there should be a big comment above drained_begin_end describing how 
it’s supposed to be used (i.e. you can wrap some other transaction in 
it, if you take care to add that other transaction with tran_add_tail(), 
and then its commit handlers will run on a drained graph); and then here 
we just describe that we’re doing exactly that, adding both transactions 
such that set_aio_context’s commit handler will run on a drained graph.

What I don’t like is that a BdrvChildClass.change_aio_ctx implementation 
must know this.  If it adds handlers to the transaction, it must be 
aware that if it does so with tran_add(), the commit handler won’t 
necessarily run on a fully drained graph; it must use tran_add_tail() 
for this, which isn’t obvious (and might be called a layering violation).

Perhaps that doesn’t matter because none of those .change_aio_ctx commit 
handlers will care, but, well.

The point is that the drained_begin_end transaction kind of abuses the 
system and adds new non-obvious semantics.

I can’t think of something better off the top of my head, though, and as 
long as this will remain the only place to call tran_add_tail(), I 
probably shouldn’t care.

...Ah, I can see patch 4 and 6 add exactly what I had feared.  Well, 
that isn’t great. :/

> +
> +    return true;
> +}
> +
>   int bdrv_child_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
>                                      BdrvChild *ignore_child, Error **errp)
>   {
> @@ -7498,6 +7648,53 @@ int bdrv_child_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
>       return 0;
>   }
>   
> +/*
> + * First, go recursively through all nodes in the graph, and see if they
> + * can change their AioContext.
> + * If so, add for each node a new transaction with a callback to change the
> + * AioContext with the new one.
> + * Once recursion is completed, if all nodes are allowed to change their
> + * AioContext then commit the transaction, running all callbacks in order.
> + * Otherwise don't do anything.

Nice explanation, but I’d start with something more simple to describe 
the external interface, like “Change bs's and recursively all of its 
parents' and children's AioContext to the given new context, returning 
an error if that isn't possible.  If ignore_child is not NULL, that 
child (and its subgraph) will not be touched.”

> + *
> + * This function still requires the caller to take the bs current
> + * AioContext lock, otherwise draining could fail since AIO_WAIT_WHILE

Well, let’s just say “will” instead of “could”.  Callers don’t need to 
know they can get lucky when they ignore the rules.

> + * assumes the lock is always held if bs is in another AioContext.
> + */
> +int bdrv_child_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
> +                                      BdrvChild *ignore_child, Error **errp)

Do the other functions need to be public?  Outside of this file, this 
one seems sufficient to me, but I may well be overlooking something.

Also, why is this function called bdrv_child_*, and not just 
bdrv_try_change_aio_context()?  (Or maybe have a 
bdrv_try_change_aio_context() variant without the ignore_child 
parameter, just like bdrv_try_set_aio_context()?)

Actually, wouldn’t just bdrv_change_aio_context() be sufficient?  I 
mean, it isn’t called bdrv_try_co_pwritev().  Most functions try to do 
something and return errors if they can’t.  Not sure if we need to make 
that clear in the name.

(I may be wrong, but I suspect bdrv_try_set_aio_context() is only named 
such that the compiler will check that the bdrv_set_aio_context() 
interface is completely unused; 
42a65f02f9b380bd8074882d5844d4ea033389cc’s commit message does seem to 
confirm that.)

> +{
> +    Transaction *tran;
> +    GSList *visited;
> +    int ret;
> +    GLOBAL_STATE_CODE();
> +
> +    tran = tran_new();
> +    visited = ignore_child ? g_slist_prepend(NULL, ignore_child) : NULL;
> +    ret = bdrv_change_aio_context(bs, ctx, &visited, tran, errp);
> +    g_slist_free(visited);
> +
> +    if (!ret) {
> +        /* just run clean() callbacks */
> +        tran_abort(tran);
> +        return -EPERM;
> +    }
> +
> +    /* Acquire the new context, if necessary */
> +    /* TODO: why do we need to take this AioContext lock? */

As I’ve said somewhere above, I think it’s because we need it for 
bdrv_drained_end().  I don’t know if there’s a problem with holding both 
contexts simultaneously.  (I would’ve just assumed that there must be a 
reason for bdrv_set_aio_context_ignore() to release old_context, but 
maybe there is actually no reason to it?)

> +    if (qemu_get_aio_context() != ctx) {
> +        aio_context_acquire(ctx);
> +    }
> +
> +    tran_commit(tran);
> +
> +    if (qemu_get_aio_context() != ctx) {
> +        aio_context_release(ctx);
> +    }
> +
> +    return 0;
> +}

Might be nice to have a version that has @tran in its interface so 
callers can incorporate it; Vladimir wanted transactionable 
context-setting.  Then again, doesn’t seem to hard to introduce that 
when needed.

> +
>   int bdrv_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
>                                Error **errp)
>   {

I like the idea of reimplementing it all on top of transactions!  I 
think it needs some tuning, but it definitely looks doable.

Hanna

Re: [RFC PATCH 3/8] RFC: block: use transactions as a replacement of ->{can_}set_aio_context()

Posted by Emanuele Giuseppe Esposito 3 years, 6 months ago


Am 14/07/2022 um 18:45 schrieb Hanna Reitz:
>> + * First, go recursively through all nodes in the graph, and see if they
>> + * can change their AioContext.
>> + * If so, add for each node a new transaction with a callback to
>> change the
>> + * AioContext with the new one.
>> + * Once recursion is completed, if all nodes are allowed to change their
>> + * AioContext then commit the transaction, running all callbacks in
>> order.
>> + * Otherwise don't do anything.
> 
> Nice explanation, but I’d start with something more simple to describe
> the external interface, like “Change bs's and recursively all of its
> parents' and children's AioContext to the given new context, returning
> an error if that isn't possible.  If ignore_child is not NULL, that
> child (and its subgraph) will not be touched.”
You want to have your suggestion as a replacement or addition to mine?
> 
>> + *
>> + * This function still requires the caller to take the bs current
>> + * AioContext lock, otherwise draining could fail since AIO_WAIT_WHILE
> 
> Well, let’s just say “will” instead of “could”.  Callers don’t need to
> know they can get lucky when they ignore the rules.
> 
>> + * assumes the lock is always held if bs is in another AioContext.
>> + */
>> +int bdrv_child_try_change_aio_context(BlockDriverState *bs,
>> AioContext *ctx,
>> +                                      BdrvChild *ignore_child, Error
>> **errp)
> 
> Do the other functions need to be public?  Outside of this file, this
> one seems sufficient to me, but I may well be overlooking something.
> 
> Also, why is this function called bdrv_child_*, and not just
> bdrv_try_change_aio_context()?  (Or maybe have a
> bdrv_try_change_aio_context() variant without the ignore_child
> parameter, just like bdrv_try_set_aio_context()?)
> 
> Actually, wouldn’t just bdrv_change_aio_context() be sufficient?  I
> mean, it isn’t called bdrv_try_co_pwritev().  Most functions try to do
> something and return errors if they can’t.  Not sure if we need to make
> that clear in the name.
> 
> (I may be wrong, but I suspect bdrv_try_set_aio_context() is only named
> such that the compiler will check that the bdrv_set_aio_context()
> interface is completely unused;
> 42a65f02f9b380bd8074882d5844d4ea033389cc’s commit message does seem to
> confirm that.)

Ok, I will change the name in bdrv_change_aio_context. And I now
understand your suggestion on the last patch to remove
bdrv_try_set_aio_context.

Emanuele

Re: [RFC PATCH 3/8] RFC: block: use transactions as a replacement of ->{can_}set_aio_context()

Posted by Emanuele Giuseppe Esposito 3 years, 6 months ago


Am 14/07/2022 um 18:45 schrieb Hanna Reitz:
> On 12.07.22 23:19, Emanuele Giuseppe Esposito wrote:
>> ---------
>> RFC because I am not sure about the AioContext locks.
>> - Do we need to take the new AioContext lock? what does it protect?
>> - Taking the old AioContext lock is required now, because of
>>    bdrv_drained_begin calling AIO_WAIT_WHILE that unlocks the
>>    aiocontext. If we replace it with AIO_WAIT_WHILE_UNLOCKED,
>>    could we get rid of taking every time the old AioContext?
>>    drain would be enough to protect the graph modification.
> 
> It’s been a while, but as far as I remember (which may be wrong), the
> reason for how the locks are supposed to be taken was mostly that we
> need some defined state so that we know how to invoke
> bdrv_drained_begin() and bdrv_drained_end() (i.e. call the first one
> as-is, and switch the locks around for the latter one).

Right, so bdrv_drained_begin must be always under old aiocontext lock,
bdrv_drained_end always under the new one.

If so, then I think I can make the whole thing even cleaner:

1. bdrv_drained_begin is taken in bdrv_change_aio_context, and not in a
transaction. This was the initial idea, but somehow it didn't work,
something was failing. However, I must have changed something now
because all tests are passing :)
Then old aiocontext lock is assumed to be taken by the caller.

2. old aiocontext lock is released

3. set_aio_context (actual detach + attach operation) are kept in a
commit transaciton, as it is now. They don't need any aiocontext lock,
as far as I understand. bdrv_drained_end is implemented in a .clean
transaction

4. new aiocontext lock is taken

5. either tran_abort or tran_commit, depending on the result of the
recursion.

Disadvantage: the unlock/lock mess is still there, but at least we know
now why we need that.
Advantage: nothing should break (because it is similar as it was
before), no double lock held, and I can always add an additional patch
where I use _UNLOCKED functions and see that happens. In addition, no
tran_add_tail needed.

> 
> The idea of using _UNLOCKED sounds interesting, almost too obvious. I
> can’t see why that wouldn’t work, actually.
> 
>> ----------
>>
>> Simplify the way the aiocontext can be changed in a BDS graph.
>> There are currently two problems in bdrv_try_set_aio_context:
>> - There is a confusion of AioContext locks taken and released, because
>>    we assume that old aiocontext is always taken and new one is
>>    taken inside.
> 
> Yep, and that assumption is just broken in some cases, which is the main
> pain point I’m feeling with it right now.
> 
> For example, look at bdrv_attach_child_common(): Here, we attach a child
> to a parent, so we need to get them into a common AioContext. So first
> we try to put the child into the parent’s context, and if that fails,
> we’ll try the other way, putting the parent into the child’s context.
> 
> The problem is clear: The bdrv_try_set_aio_context() call requires us to
> hold the child’s current context but not the parent’s, and the
> child_class->set_aio_ctx() call requires the exact opposite.  But we
> never switch the context we have acquired, so this can’t both work. 
> Better yet, nowhere is it defined what context a caller to
> bdrv_attach_child_common() will hold.
> 
> In practice, what happens here most of the time is that something will
> be moved from the main context to some other context, and since we’re in
> the main context already, that’ll just work.  But you can construct
> cases where something is attempted to be moved from an I/O thread into a
> different thread and then you’ll get a crash.
> 
> I’d be happy if we could do away with the requirement of having to hold
> any lock for changing a node’s AioContext.
> 
>> - It doesn't look very safe to call bdrv_drained_begin while some
>>    nodes have already switched to the new aiocontext and others haven't.
>>    This could be especially dangerous because bdrv_drained_begin
>> polls, so
>>    something else could be executed while graph is in an inconsistent
>>    state.
>>
>> Additional minor nitpick: can_set and set_ callbacks both traverse the
>> graph, both using the ignored list of visited nodes in a different way.
>>
>> Therefore, get rid of all of this and introduce a new callback,
>> change_aio_context, that uses transactions to efficiently, cleanly
>> and most importantly safely change the aiocontext of a graph.
>>
>> This new callback is a "merge" of the two previous ones:
>> - Just like can_set_aio_context, recursively traverses the graph.
>>    Marks all nodes that are visited using a GList, and checks if
>>    they *could* change the aio_context.
>> - For each node that passes the above check, add a new transaction
>>    that implements a callback that effectively changes the aiocontext.
>> - If a node is a BDS, add two transactions: one taking care of draining
>>    the node at the beginning of the list (so that will be executed first)
>>    and one at the end taking care of changing the AioContext.
>> - Once done, the recursive function returns if *all* nodes can change
>>    the AioContext. If so, commit the above transactions. Otherwise don't
>>    do anything.
>> - The transaction list contains first all "drain" transactions, so
>>    we are sure we are draining all nodes in the same context, and then
>>    all the other switch the AioContext. In this way we make sure that
>>    bdrv_drained_begin() is always called under the old AioContext, and
>>    bdrv_drained_end() under the new one.
>> - Because of the above, we don't need to release and re-acquire the
>>    old AioContext every time, as everything is done once (and not
>>    per-node drain and aiocontext change).
>>
>> Note that the "change" API is not yet invoked anywhere.
> 
> So the idea is that we introduce a completely new transaction-based API
> to change BDSs’ AioContext, and then drop the old one, right?
> 
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> ---
>>   block.c                            | 197 +++++++++++++++++++++++++++++
>>   include/block/block-global-state.h |   9 ++
>>   include/block/block_int-common.h   |   3 +
>>   3 files changed, 209 insertions(+)
>>
>> diff --git a/block.c b/block.c
>> index 267a39c0de..bda4e1bcef 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -7437,6 +7437,51 @@ static bool
>> bdrv_parent_can_set_aio_context(BdrvChild *c, AioContext *ctx,
>>       return true;
>>   }
>>   +typedef struct BdrvStateSetAioContext {
>> +    AioContext *new_ctx;
>> +    BlockDriverState *bs;
>> +} BdrvStateSetAioContext;
>> +
>> +/*
>> + * Changes the AioContext used for fd handlers, timers, and BHs by this
>> + * BlockDriverState and all its children and parents.
>> + *
>> + * Must be called from the main AioContext.
>> + *
>> + * The caller must own the AioContext lock for the old AioContext of
>> bs, but it
>> + * must not own the AioContext lock for new_context (unless
>> new_context is the
>> + * same as the current context of bs).
> 
> :(  Too bad.
> 
>> + *
>> + * @visited will accumulate all visited BdrvChild object. The caller is
> 
> *objects
> 
>> + * responsible for freeing the list afterwards.
>> + */
>> +static bool bdrv_parent_change_aio_context(BdrvChild *c, AioContext
>> *ctx,
> 
> The comment says “this BlockDriverState”, but then this function is
> called “_parent_” and takes a BdrvChild object.  Not quite clear what is
> meant.
> 
> You could argue it doesn’t matter because the function acts on all
> parents and children recursively anyway, so it’s doesn’t matter whether
> this is about c->bs or c’s parent, which is true, but still, the wording
> doesn’t seem quite as clear to me as it could be.
> 
> I also wonder why this function is so syntactically focused on BdrvChild
> object when I would have thought that those objects aren’t really
> associated with any AioContext, only the BDS is.  A BdrvChild just gives
> you a way to change a BDS’s parents’ AioContext.  Is it because of
> bdrv_child_try_change_aio_context()’s ignore_child parameter?
> 
>> +                                           GSList **visited,
>> Transaction *tran,
>> +                                           Error **errp)
>> +{
>> +    GLOBAL_STATE_CODE();
>> +    if (g_slist_find(*visited, c)) {
>> +        return true;
>> +    }
>> +    *visited = g_slist_prepend(*visited, c);
>> +
>> +    /*
>> +     * A BdrvChildClass that doesn't handle AioContext changes cannot
>> +     * tolerate any AioContext changes
>> +     */
>> +    if (!c->klass->change_aio_ctx) {
>> +        char *user = bdrv_child_user_desc(c);
>> +        error_setg(errp, "Changing iothreads is not supported by %s",
>> user);
>> +        g_free(user);
>> +        return false;
>> +    }
>> +    if (!c->klass->change_aio_ctx(c, ctx, visited, tran, errp)) {
>> +        assert(!errp || *errp);
>> +        return false;
>> +    }
> 
> This makes me wonder how this is supposed to change the children’s
> AioContext, because traditionally, BdrvChildClass contains only
> callbacks that affect the parent.
> 
>> +    return true;
>> +}
>> +
>>   bool bdrv_child_can_set_aio_context(BdrvChild *c, AioContext *ctx,
>>                                       GSList **ignore, Error **errp)
>>   {
>> @@ -7448,6 +7493,18 @@ bool bdrv_child_can_set_aio_context(BdrvChild
>> *c, AioContext *ctx,
>>       return bdrv_can_set_aio_context(c->bs, ctx, ignore, errp);
>>   }
>>   +bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
>> +                                   GSList **visited, Transaction *tran,
>> +                                   Error **errp)
>> +{
>> +    GLOBAL_STATE_CODE();
>> +    if (g_slist_find(*visited, c)) {
> 
> Not the first time that we’re using linear lookup for such a thing, but
> it really irks me badly.  I think in any language that would provide
> collections by default, we’d be using a hash map here.
> 
>> +        return true;
>> +    }
>> +    *visited = g_slist_prepend(*visited, c);
>> +    return bdrv_change_aio_context(c->bs, ctx, visited, tran, errp);
>> +}
>> +
> 
> OK, so this looks like the function that will change the children’s
> AioContext, and now I guess we’ll have a function that invokes both
> bdrv_{child,parent}_change_aio_context()...  And yes, we do, that is
> bdrv_change_aio_context().  So now it looks to me like the comment
> that’s above bdrv_parent_change_aio_context() actually belongs to
> bdrv_change_aio_context().
> 
>>   /* @ignore will accumulate all visited BdrvChild object. The caller is
>>    * responsible for freeing the list afterwards. */
>>   bool bdrv_can_set_aio_context(BlockDriverState *bs, AioContext *ctx,
>> @@ -7475,6 +7532,99 @@ bool bdrv_can_set_aio_context(BlockDriverState
>> *bs, AioContext *ctx,
>>       return true;
>>   }
>>   +static void bdrv_drained_begin_commit(void *opaque)
>> +{
>> +    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
>> +
>> +    /* Paired with bdrv_drained_end in bdrv_drained_end_clean() */
>> +    bdrv_drained_begin(state->bs);
>> +}
>> +
>> +static void bdrv_drained_end_clean(void *opaque)
>> +{
>> +    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
>> +    BlockDriverState *bs = (BlockDriverState *) state->bs;
>> +    AioContext *new_context = state->new_ctx;
>> +
>> +    /*
>> +     * drain only if bdrv_drained_begin_commit() and
>> +     * bdrv_set_aio_context_commit() executed
>> +     */
>> +    if (bdrv_get_aio_context(bs) == new_context) {
>> +        /* Paired with bdrv_drained_begin in
>> bdrv_drained_begin_commit() */
>> +        bdrv_drained_end(bs);
>> +    }
> 
> So as far as I understood, in bdrv_set_aio_context_ignore(), we switch
> around the currently held context because this will poll bs’s context,
> and so because bs’s context changed, we now need to hold the new
> context.  I don’t know off the top of my head whether there’s a problem
> with holding both contexts simultaneously.
> 
>> +
>> +    /* state is freed by set_aio_context->clean() */
>> +}
>> +
>> +static void bdrv_set_aio_context_commit(void *opaque)
>> +{
>> +    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
>> +    BlockDriverState *bs = (BlockDriverState *) state->bs;
>> +    AioContext *new_context = state->new_ctx;
>> +    assert_bdrv_graph_writable(bs);
>> +
>> +    bdrv_detach_aio_context(bs);
>> +    bdrv_attach_aio_context(bs, new_context);
>> +}
>> +
>> +static TransactionActionDrv set_aio_context = {
>> +    .commit = bdrv_set_aio_context_commit,
>> +    .clean = g_free,
>> +};
>> +
>> +static TransactionActionDrv drained_begin_end = {
>> +    .commit = bdrv_drained_begin_commit,
>> +    .clean = bdrv_drained_end_clean,
>> +};
>> +
>> +/*
>> + * @visited will accumulate all visited BdrvChild object. The caller is
>> + * responsible for freeing the list afterwards.
>> + */
>> +bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
>> +                             GSList **visited, Transaction *tran,
>> +                             Error **errp)
>> +{
>> +    BdrvChild *c;
>> +    BdrvStateSetAioContext *state;
>> +
>> +    GLOBAL_STATE_CODE();
>> +
>> +    if (bdrv_get_aio_context(bs) == ctx) {
>> +        return true;
>> +    }
>> +
>> +    QLIST_FOREACH(c, &bs->parents, next_parent) {
>> +        if (!bdrv_parent_change_aio_context(c, ctx, visited, tran,
>> errp)) {
>> +            return false;
>> +        }
>> +    }
>> +
>> +    QLIST_FOREACH(c, &bs->children, next) {
>> +        if (!bdrv_child_change_aio_context(c, ctx, visited, tran,
>> errp)) {
>> +            return false;
>> +        }
>> +    }
>> +
>> +    state = g_new(BdrvStateSetAioContext, 1);
>> +    *state = (BdrvStateSetAioContext) {
>> +        .new_ctx = ctx,
>> +        .bs = bs,
>> +    };
>> +
>> +    /* First half of transactions are drains */
>> +    tran_add(tran, &drained_begin_end, state);
>> +    /*
>> +     * Second half of transactions takes care of setting the
>> +     * AioContext and free the state
>> +     */
>> +    tran_add_tail(tran, &set_aio_context, state);
> 
> Completely subjective and I struggle to explain my feelings:
> 
> I don’t quite like the “first half, second half” explanation.  This
> makes it sound like these should be separate phases in the transaction,
> and if so, we should have an interface that allows you to put
> transaction handlers into an arbitrary number of phases.
> 
> We don’t have that interface, though, all we have are functions that
> carry the connotation of “execute as soon as possible” and “execute as
> late as possible”, so that if you use both in conjunction like here, you
> can be certain that drained_begin_end is executed before the respective
> set_aio_context.
> 
> The comments also struggle to convey how much black magic this is.
> 
> (Note that this is just my POV; also note that I’m not judging, black
> magic is cool, just difficult to grasp)
> 
> So what’s going on is (AFAIU) that we have this black magic
> drained_begin_end transaction which is added such that its commit
> handler will be run before the set_aio_context’s commit handler, but its
> clean handler will be run after set_aio_context’s commit handler.  So
> that’s why I call it black magic, because drained_begin_end can do both,
> it’ll first begin the drain, and then end it, all around some normal
> other transaction, such that this other one can operate on a completely
> drained graph.
> 
> I don’t hate this, it’s just that the comments don’t convey it.  I think
> there should be a big comment above drained_begin_end describing how
> it’s supposed to be used (i.e. you can wrap some other transaction in
> it, if you take care to add that other transaction with tran_add_tail(),
> and then its commit handlers will run on a drained graph); and then here
> we just describe that we’re doing exactly that, adding both transactions
> such that set_aio_context’s commit handler will run on a drained graph.
> 
> 
> What I don’t like is that a BdrvChildClass.change_aio_ctx implementation
> must know this.  If it adds handlers to the transaction, it must be
> aware that if it does so with tran_add(), the commit handler won’t
> necessarily run on a fully drained graph; it must use tran_add_tail()
> for this, which isn’t obvious (and might be called a layering violation).
> 
> Perhaps that doesn’t matter because none of those .change_aio_ctx commit
> handlers will care, but, well.
> 
> The point is that the drained_begin_end transaction kind of abuses the
> system and adds new non-obvious semantics.
> 
> I can’t think of something better off the top of my head, though, and as
> long as this will remain the only place to call tran_add_tail(), I
> probably shouldn’t care.
> 
> ...Ah, I can see patch 4 and 6 add exactly what I had feared.  Well,
> that isn’t great. :/
> 
>> +
>> +    return true;
>> +}
>> +
>>   int bdrv_child_try_set_aio_context(BlockDriverState *bs, AioContext
>> *ctx,
>>                                      BdrvChild *ignore_child, Error
>> **errp)
>>   {
>> @@ -7498,6 +7648,53 @@ int
>> bdrv_child_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
>>       return 0;
>>   }
>>   +/*
>> + * First, go recursively through all nodes in the graph, and see if they
>> + * can change their AioContext.
>> + * If so, add for each node a new transaction with a callback to
>> change the
>> + * AioContext with the new one.
>> + * Once recursion is completed, if all nodes are allowed to change their
>> + * AioContext then commit the transaction, running all callbacks in
>> order.
>> + * Otherwise don't do anything.
> 
> Nice explanation, but I’d start with something more simple to describe
> the external interface, like “Change bs's and recursively all of its
> parents' and children's AioContext to the given new context, returning
> an error if that isn't possible.  If ignore_child is not NULL, that
> child (and its subgraph) will not be touched.”
> 
>> + *
>> + * This function still requires the caller to take the bs current
>> + * AioContext lock, otherwise draining could fail since AIO_WAIT_WHILE
> 
> Well, let’s just say “will” instead of “could”.  Callers don’t need to
> know they can get lucky when they ignore the rules.
> 
>> + * assumes the lock is always held if bs is in another AioContext.
>> + */
>> +int bdrv_child_try_change_aio_context(BlockDriverState *bs,
>> AioContext *ctx,
>> +                                      BdrvChild *ignore_child, Error
>> **errp)
> 
> Do the other functions need to be public?  Outside of this file, this
> one seems sufficient to me, but I may well be overlooking something.
> 
> Also, why is this function called bdrv_child_*, and not just
> bdrv_try_change_aio_context()?  (Or maybe have a
> bdrv_try_change_aio_context() variant without the ignore_child
> parameter, just like bdrv_try_set_aio_context()?)
> 
> Actually, wouldn’t just bdrv_change_aio_context() be sufficient?  I
> mean, it isn’t called bdrv_try_co_pwritev().  Most functions try to do
> something and return errors if they can’t.  Not sure if we need to make
> that clear in the name.
> 
> (I may be wrong, but I suspect bdrv_try_set_aio_context() is only named
> such that the compiler will check that the bdrv_set_aio_context()
> interface is completely unused;
> 42a65f02f9b380bd8074882d5844d4ea033389cc’s commit message does seem to
> confirm that.)
> 
>> +{
>> +    Transaction *tran;
>> +    GSList *visited;
>> +    int ret;
>> +    GLOBAL_STATE_CODE();
>> +
>> +    tran = tran_new();
>> +    visited = ignore_child ? g_slist_prepend(NULL, ignore_child) : NULL;
>> +    ret = bdrv_change_aio_context(bs, ctx, &visited, tran, errp);
>> +    g_slist_free(visited);
>> +
>> +    if (!ret) {
>> +        /* just run clean() callbacks */
>> +        tran_abort(tran);
>> +        return -EPERM;
>> +    }
>> +
>> +    /* Acquire the new context, if necessary */
>> +    /* TODO: why do we need to take this AioContext lock? */
> 
> As I’ve said somewhere above, I think it’s because we need it for
> bdrv_drained_end().  I don’t know if there’s a problem with holding both
> contexts simultaneously.  (I would’ve just assumed that there must be a
> reason for bdrv_set_aio_context_ignore() to release old_context, but
> maybe there is actually no reason to it?)
> 
>> +    if (qemu_get_aio_context() != ctx) {
>> +        aio_context_acquire(ctx);
>> +    }
>> +
>> +    tran_commit(tran);
>> +
>> +    if (qemu_get_aio_context() != ctx) {
>> +        aio_context_release(ctx);
>> +    }
>> +
>> +    return 0;
>> +}
> 
> Might be nice to have a version that has @tran in its interface so
> callers can incorporate it; Vladimir wanted transactionable
> context-setting.  Then again, doesn’t seem to hard to introduce that
> when needed.
> 
>> +
>>   int bdrv_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
>>                                Error **errp)
>>   {
> 
> I like the idea of reimplementing it all on top of transactions!  I
> think it needs some tuning, but it definitely looks doable.
> 
> Hanna
>

[RFC PATCH 1/8] block.c: assert bs->aio_context is written under BQL and drains
[RFC PATCH 2/8] transactions: add tran_add_back
[RFC PATCH 3/8] RFC: block: use transactions as a replacement of ->{can_}set_aio_context()
[RFC PATCH 4/8] blockjob: implement .change_aio_ctx in child_job
[RFC PATCH 5/8] block: implement .change_aio_ctx in child_of_bds
[RFC PATCH 6/8] block-backend: implement .change_aio_ctx in child_root
[RFC PATCH 7/8] block: use the new _change_ API instead of _can_set_ and _set_
[RFC PATCH 8/8] block: remove all unused ->can_set_aio_ctx and ->set_aio_ctx callbacks