[RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept

Emanuele Giuseppe Esposito posted 5 patches 2 years, 2 months ago
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20220301142113.163174-1-eesposit@redhat.com
Maintainers: Stefan Hajnoczi <stefanha@redhat.com>, Fam Zheng <fam@euphon.net>, Kevin Wolf <kwolf@redhat.com>, Hanna Reitz <hreitz@redhat.com>, John Snow <jsnow@redhat.com>, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
block/io.c                   |  48 ++++++--
blockjob.c                   |   3 +-
include/block/aio-wait.h     |  15 ++-
include/block/block.h        |   7 ++
tests/unit/test-bdrv-drain.c | 218 +++++++++++++++++++++++++++++++++++
5 files changed, 274 insertions(+), 17 deletions(-)
[RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years, 2 months ago
This serie tries to provide a proof of concept and a clear explanation
on why we need to use drains (and more precisely subtree_drains)
to replace the aiocontext lock, especially to protect BlockDriverState
->children and ->parent lists.

Just a small recap on the key concepts:
* We split block layer APIs in "global state" (GS), "I/O", and
"global state or I/O".
  GS are running in the main loop, under BQL, and are the only
  one allowed to modify the BlockDriverState graph.

  I/O APIs are thread safe and can run in any thread

  "global state or I/O" are essentially all APIs that use
  BDRV_POLL_WHILE. This is because there can be only 2 threads
  that can use BDRV_POLL_WHILE: main loop and the iothread that
  runs the aiocontext.

* Drains allow the caller (either main loop or iothread running
the context) to wait all in_flights requests and operations
of a BDS: normal drains target a given node and is parents, while
subtree ones also include the subgraph of the node. Siblings are
not affected by any of these two kind of drains.
After bdrv_drained_begin, no more request is allowed to come
from the affected nodes. Therefore the only actor left working
on a drained part of the graph should be the main loop.

What do we intend to do
-----------------------
We want to remove the AioContext lock. It is not 100% clear on how
many things we are protecting with it, and why.
As a starter, we want to protect BlockDriverState ->parents and
->children lists, since they are read by main loop and I/O but
only written by main loop under BQL. The function that modifies
these lists is bdrv_replace_child_common().

How do we want to do it
-----------------------
We individuated as ideal subtitute of AioContext lock
the subtree_drain API. The reason is simple: draining prevents the iothread to read or write the nodes, so once the main loop finishes
executing bdrv_drained_begin() on the interested graph, we are sure that
the iothread is not going to look or even interfere with that part of the graph.
We are also sure that the only two actors that can look at a specific
BlockDriverState in any given context are the main loop and the
iothread running the AioContext (ensured by "global state or IO" logic).

Why use _subtree_ instead of normal drain
-----------------------------------------
A simple drain "blocks" a given node and all its parents.
But it doesn't touch the child.
This means that if we use a simple drain, a child can always
keep processing requests, and eventually end up calling itself
bdrv_drained_begin, ending up reading the parents node while the main loop
is modifying them. Therefore a subtree drain is necessary.

Possible scenarios
-------------------
Keeping in mind that we can only have an iothread and the main loop
draining on a certain node, we could have:

main loop successfully drains and then iothread tries to drain:
  impossible scenario, as iothread is already stopped once main
  successfully drains.

iothread successfully drains and then main loop drains:
  should not be a problem, as:
  1) the iothread should be already "blocked" by its own drain
  2) main loop would still wait for it to completely block
  There is the issue of mirror overriding such scenario to avoid
  having deadlocks, but that is handled in the next section.

main loop and iothread try to drain together:
  As above, this case doens't really matter. As long as
  bdrv_drained_begin invariant is respected, the main loop will
  continue only once the iothread is "blocked" on that part of the graph.

A note on iothread draining
---------------------------
Theoretically draining from an iothread should not be possible,
as the iothread would be scheduling a bh in the main loop waiting
for itself to stop, even though it is not yet stopped since it is waiting for the bh.

This is what would happen in the tests in patch 5 if .drained_poll
was not implemented.

Therefore, one solution is to use .drained_poll callback in BlockJobDriver.
This callback overrides the default job poll() behavior, and
allows the polling condition to stop waiting for the job.
It is actually used only in mirror.
This however breaks bdrv_drained_begin invariant, because the
iothread is not really blocked on that node but continues running.
In order to fix this, patch 4 allows the polling condition to be
used only by the iothread, and not the main loop too, preventing
the drain to return before the iothread is effectively stopped.
This is also shown in the tests in patch 5. If the fix in patch
4 is removed, then the main loop drain will return earlier and
allow the iothread to run and drain together.

The other patches in this serie are cherry-picked from the various
series I already sent, and are included here just to allow
subtree_drained_begin/end_unlocked implementation.

Emanuele Giuseppe Esposito (5):
  aio-wait.h: introduce AIO_WAIT_WHILE_UNLOCKED
  introduce BDRV_POLL_WHILE_UNLOCKED
  block/io.c: introduce bdrv_subtree_drained_{begin/end}_unlocked
  child_job_drained_poll: override polling condition only when in home
    thread
  test-bdrv-drain: ensure draining from main loop stops iothreads

 block/io.c                   |  48 ++++++--
 blockjob.c                   |   3 +-
 include/block/aio-wait.h     |  15 ++-
 include/block/block.h        |   7 ++
 tests/unit/test-bdrv-drain.c | 218 +++++++++++++++++++++++++++++++++++
 5 files changed, 274 insertions(+), 17 deletions(-)

-- 
2.31.1
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years, 2 months ago
I would really love to hear opinions on this, since we already had some
discussions on other similar patches.

Thank you,
Emanuele

On 01/03/2022 15:21, Emanuele Giuseppe Esposito wrote:
> This serie tries to provide a proof of concept and a clear explanation
> on why we need to use drains (and more precisely subtree_drains)
> to replace the aiocontext lock, especially to protect BlockDriverState
> ->children and ->parent lists.
> 
> Just a small recap on the key concepts:
> * We split block layer APIs in "global state" (GS), "I/O", and
> "global state or I/O".
>   GS are running in the main loop, under BQL, and are the only
>   one allowed to modify the BlockDriverState graph.
> 
>   I/O APIs are thread safe and can run in any thread
> 
>   "global state or I/O" are essentially all APIs that use
>   BDRV_POLL_WHILE. This is because there can be only 2 threads
>   that can use BDRV_POLL_WHILE: main loop and the iothread that
>   runs the aiocontext.
> 
> * Drains allow the caller (either main loop or iothread running
> the context) to wait all in_flights requests and operations
> of a BDS: normal drains target a given node and is parents, while
> subtree ones also include the subgraph of the node. Siblings are
> not affected by any of these two kind of drains.
> After bdrv_drained_begin, no more request is allowed to come
> from the affected nodes. Therefore the only actor left working
> on a drained part of the graph should be the main loop.
> 
> What do we intend to do
> -----------------------
> We want to remove the AioContext lock. It is not 100% clear on how
> many things we are protecting with it, and why.
> As a starter, we want to protect BlockDriverState ->parents and
> ->children lists, since they are read by main loop and I/O but
> only written by main loop under BQL. The function that modifies
> these lists is bdrv_replace_child_common().
> 
> How do we want to do it
> -----------------------
> We individuated as ideal subtitute of AioContext lock
> the subtree_drain API. The reason is simple: draining prevents the iothread to read or write the nodes, so once the main loop finishes
> executing bdrv_drained_begin() on the interested graph, we are sure that
> the iothread is not going to look or even interfere with that part of the graph.
> We are also sure that the only two actors that can look at a specific
> BlockDriverState in any given context are the main loop and the
> iothread running the AioContext (ensured by "global state or IO" logic).
> 
> Why use _subtree_ instead of normal drain
> -----------------------------------------
> A simple drain "blocks" a given node and all its parents.
> But it doesn't touch the child.
> This means that if we use a simple drain, a child can always
> keep processing requests, and eventually end up calling itself
> bdrv_drained_begin, ending up reading the parents node while the main loop
> is modifying them. Therefore a subtree drain is necessary.
> 
> Possible scenarios
> -------------------
> Keeping in mind that we can only have an iothread and the main loop
> draining on a certain node, we could have:
> 
> main loop successfully drains and then iothread tries to drain:
>   impossible scenario, as iothread is already stopped once main
>   successfully drains.
> 
> iothread successfully drains and then main loop drains:
>   should not be a problem, as:
>   1) the iothread should be already "blocked" by its own drain
>   2) main loop would still wait for it to completely block
>   There is the issue of mirror overriding such scenario to avoid
>   having deadlocks, but that is handled in the next section.
> 
> main loop and iothread try to drain together:
>   As above, this case doens't really matter. As long as
>   bdrv_drained_begin invariant is respected, the main loop will
>   continue only once the iothread is "blocked" on that part of the graph.
> 
> A note on iothread draining
> ---------------------------
> Theoretically draining from an iothread should not be possible,
> as the iothread would be scheduling a bh in the main loop waiting
> for itself to stop, even though it is not yet stopped since it is waiting for the bh.
> 
> This is what would happen in the tests in patch 5 if .drained_poll
> was not implemented.
> 
> Therefore, one solution is to use .drained_poll callback in BlockJobDriver.
> This callback overrides the default job poll() behavior, and
> allows the polling condition to stop waiting for the job.
> It is actually used only in mirror.
> This however breaks bdrv_drained_begin invariant, because the
> iothread is not really blocked on that node but continues running.
> In order to fix this, patch 4 allows the polling condition to be
> used only by the iothread, and not the main loop too, preventing
> the drain to return before the iothread is effectively stopped.
> This is also shown in the tests in patch 5. If the fix in patch
> 4 is removed, then the main loop drain will return earlier and
> allow the iothread to run and drain together.
> 
> The other patches in this serie are cherry-picked from the various
> series I already sent, and are included here just to allow
> subtree_drained_begin/end_unlocked implementation.
> 
> Emanuele Giuseppe Esposito (5):
>   aio-wait.h: introduce AIO_WAIT_WHILE_UNLOCKED
>   introduce BDRV_POLL_WHILE_UNLOCKED
>   block/io.c: introduce bdrv_subtree_drained_{begin/end}_unlocked
>   child_job_drained_poll: override polling condition only when in home
>     thread
>   test-bdrv-drain: ensure draining from main loop stops iothreads
> 
>  block/io.c                   |  48 ++++++--
>  blockjob.c                   |   3 +-
>  include/block/aio-wait.h     |  15 ++-
>  include/block/block.h        |   7 ++
>  tests/unit/test-bdrv-drain.c | 218 +++++++++++++++++++++++++++++++++++
>  5 files changed, 274 insertions(+), 17 deletions(-)
>
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Vladimir Sementsov-Ogievskiy 2 years, 2 months ago
01.03.2022 17:21, Emanuele Giuseppe Esposito wrote:
> This serie tries to provide a proof of concept and a clear explanation
> on why we need to use drains (and more precisely subtree_drains)
> to replace the aiocontext lock, especially to protect BlockDriverState
> ->children and ->parent lists.
> 
> Just a small recap on the key concepts:
> * We split block layer APIs in "global state" (GS), "I/O", and
> "global state or I/O".
>    GS are running in the main loop, under BQL, and are the only
>    one allowed to modify the BlockDriverState graph.
> 
>    I/O APIs are thread safe and can run in any thread
> 
>    "global state or I/O" are essentially all APIs that use
>    BDRV_POLL_WHILE. This is because there can be only 2 threads
>    that can use BDRV_POLL_WHILE: main loop and the iothread that
>    runs the aiocontext.
> 
> * Drains allow the caller (either main loop or iothread running
> the context) to wait all in_flights requests and operations
> of a BDS: normal drains target a given node and is parents, while
> subtree ones also include the subgraph of the node. Siblings are
> not affected by any of these two kind of drains.
> After bdrv_drained_begin, no more request is allowed to come
> from the affected nodes. Therefore the only actor left working
> on a drained part of the graph should be the main loop.
> 
> What do we intend to do
> -----------------------
> We want to remove the AioContext lock. It is not 100% clear on how
> many things we are protecting with it, and why.
> As a starter, we want to protect BlockDriverState ->parents and
> ->children lists, since they are read by main loop and I/O but
> only written by main loop under BQL. The function that modifies
> these lists is bdrv_replace_child_common().
> 
> How do we want to do it
> -----------------------
> We individuated as ideal subtitute of AioContext lock
> the subtree_drain API. The reason is simple: draining prevents the iothread to read or write the nodes, so once the main loop finishes

I'm not sure it's ideal. Unfortunately I'm not really good in all that BQL, AioContext locks and drains. So, I can't give good advice. But here are my doubts:

Draining is very restrictive measure: even if drained section is very short, at least on bdrv_drained_begin() we have to wait for all current requests, and don't start new ones. That slows down the guest. In the same time there are operations that don't require to stop guest IO requests. For example manipulation with dirty bitmaps - qmp commands block-dirty-bitmap-add block-dirty-bitmap-remove. Or different query requests..

I see only two real cases, where we do need drain:

1. When we need a consistent "point-in-time". For example, when we start backup in transaction with some dirty-bitmap manipulation commands.

2. When we need to modify block-graph: if we are going to break relation A -> B, there must not be any in-flight request that want to use this relation.

All other operations, for which we want some kind of lock (like AioContext lock or something) we actually don't want to stop guest IO.


Next, I have a problem in mind, that in past lead to a lot of iotest 30 failures. Next there were different fixes and improvements, but the core problem (as far as I understand) is still here: nothing protects us when we are in some graph modification process (for example block-job finalization) do yield, switch to other coroutine and enter another graph modification process (for example, another block-job finaliztion)..
(for details look at my old "[PATCH RFC 0/5] Fix accidental crash in iotest 30"  https://lists.nongnu.org/archive/html/qemu-devel/2020-11/msg05290.html , where I suggested to add a global graph_modify_mutex CoMutex, to be held during graph-modifying process that may yield)..
Does your proposal solve this problem?


> executing bdrv_drained_begin() on the interested graph, we are sure that
> the iothread is not going to look or even interfere with that part of the graph.
> We are also sure that the only two actors that can look at a specific
> BlockDriverState in any given context are the main loop and the
> iothread running the AioContext (ensured by "global state or IO" logic).
> 
> Why use _subtree_ instead of normal drain
> -----------------------------------------
> A simple drain "blocks" a given node and all its parents.
> But it doesn't touch the child.
> This means that if we use a simple drain, a child can always
> keep processing requests, and eventually end up calling itself
> bdrv_drained_begin, ending up reading the parents node while the main loop
> is modifying them. Therefore a subtree drain is necessary.

I'm not sure I understand.. Could you give a more concrete example of what may happen if we use simple drain?
Who will call bdrv_drained_begin() you say about? Some IO path in parallel thread? Do we really have an IO path that calls bdrv_drained_begin()?

> 
> Possible scenarios
> -------------------
> Keeping in mind that we can only have an iothread and the main loop
> draining on a certain node, we could have:
> 
> main loop successfully drains and then iothread tries to drain:
>    impossible scenario, as iothread is already stopped once main
>    successfully drains.
> 
> iothread successfully drains and then main loop drains:
>    should not be a problem, as:
>    1) the iothread should be already "blocked" by its own drain
>    2) main loop would still wait for it to completely block
>    There is the issue of mirror overriding such scenario to avoid
>    having deadlocks, but that is handled in the next section.
> 
> main loop and iothread try to drain together:
>    As above, this case doens't really matter. As long as
>    bdrv_drained_begin invariant is respected, the main loop will
>    continue only once the iothread is "blocked" on that part of the graph.
> 
> A note on iothread draining
> ---------------------------
> Theoretically draining from an iothread should not be possible,
> as the iothread would be scheduling a bh in the main loop waiting
> for itself to stop, even though it is not yet stopped since it is waiting for the bh.
> 
> This is what would happen in the tests in patch 5 if .drained_poll
> was not implemented.
> 
> Therefore, one solution is to use .drained_poll callback in BlockJobDriver.
> This callback overrides the default job poll() behavior, and
> allows the polling condition to stop waiting for the job.
> It is actually used only in mirror.
> This however breaks bdrv_drained_begin invariant, because the
> iothread is not really blocked on that node but continues running.
> In order to fix this, patch 4 allows the polling condition to be
> used only by the iothread, and not the main loop too, preventing
> the drain to return before the iothread is effectively stopped.
> This is also shown in the tests in patch 5. If the fix in patch
> 4 is removed, then the main loop drain will return earlier and
> allow the iothread to run and drain together.
> 
> The other patches in this serie are cherry-picked from the various
> series I already sent, and are included here just to allow
> subtree_drained_begin/end_unlocked implementation.
> 
> Emanuele Giuseppe Esposito (5):
>    aio-wait.h: introduce AIO_WAIT_WHILE_UNLOCKED
>    introduce BDRV_POLL_WHILE_UNLOCKED
>    block/io.c: introduce bdrv_subtree_drained_{begin/end}_unlocked
>    child_job_drained_poll: override polling condition only when in home
>      thread
>    test-bdrv-drain: ensure draining from main loop stops iothreads
> 
>   block/io.c                   |  48 ++++++--
>   blockjob.c                   |   3 +-
>   include/block/aio-wait.h     |  15 ++-
>   include/block/block.h        |   7 ++
>   tests/unit/test-bdrv-drain.c | 218 +++++++++++++++++++++++++++++++++++
>   5 files changed, 274 insertions(+), 17 deletions(-)
> 


-- 
Best regards,
Vladimir
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years, 1 month ago

Am 02/03/2022 um 12:07 schrieb Vladimir Sementsov-Ogievskiy:
> 01.03.2022 17:21, Emanuele Giuseppe Esposito wrote:
>> This serie tries to provide a proof of concept and a clear explanation
>> on why we need to use drains (and more precisely subtree_drains)
>> to replace the aiocontext lock, especially to protect BlockDriverState
>> ->children and ->parent lists.
>>
>> Just a small recap on the key concepts:
>> * We split block layer APIs in "global state" (GS), "I/O", and
>> "global state or I/O".
>>    GS are running in the main loop, under BQL, and are the only
>>    one allowed to modify the BlockDriverState graph.
>>
>>    I/O APIs are thread safe and can run in any thread
>>
>>    "global state or I/O" are essentially all APIs that use
>>    BDRV_POLL_WHILE. This is because there can be only 2 threads
>>    that can use BDRV_POLL_WHILE: main loop and the iothread that
>>    runs the aiocontext.
>>
>> * Drains allow the caller (either main loop or iothread running
>> the context) to wait all in_flights requests and operations
>> of a BDS: normal drains target a given node and is parents, while
>> subtree ones also include the subgraph of the node. Siblings are
>> not affected by any of these two kind of drains.
>> After bdrv_drained_begin, no more request is allowed to come
>> from the affected nodes. Therefore the only actor left working
>> on a drained part of the graph should be the main loop.
>>
>> What do we intend to do
>> -----------------------
>> We want to remove the AioContext lock. It is not 100% clear on how
>> many things we are protecting with it, and why.
>> As a starter, we want to protect BlockDriverState ->parents and
>> ->children lists, since they are read by main loop and I/O but
>> only written by main loop under BQL. The function that modifies
>> these lists is bdrv_replace_child_common().
>>
>> How do we want to do it
>> -----------------------
>> We individuated as ideal subtitute of AioContext lock
>> the subtree_drain API. The reason is simple: draining prevents the
>> iothread to read or write the nodes, so once the main loop finishes
> 
> I'm not sure it's ideal. Unfortunately I'm not really good in all that
> BQL, AioContext locks and drains. So, I can't give good advice. But here
> are my doubts:
> 
> Draining is very restrictive measure: even if drained section is very
> short, at least on bdrv_drained_begin() we have to wait for all current
> requests, and don't start new ones. That slows down the guest. 

I don't think we are in a critical path where performance is important here.

In the
> same time there are operations that don't require to stop guest IO
> requests. For example manipulation with dirty bitmaps - qmp commands
> block-dirty-bitmap-add block-dirty-bitmap-remove. Or different query
> requests..
>

Maybe you misunderstood or I was not 100% clear, but I am talking about replacing the AioContext lock for the ->parents and ->children instance. Not everywhere. This is the first step, and then we will see if the additional things that it protects can use drain or something else

 
> I see only two real cases, where we do need drain:
> 
> 1. When we need a consistent "point-in-time". For example, when we start
> backup in transaction with some dirty-bitmap manipulation commands.
> 
> 2. When we need to modify block-graph: if we are going to break relation
> A -> B, there must not be any in-flight request that want to use this
> relation.

That's the use case I am considering.
> 
> All other operations, for which we want some kind of lock (like
> AioContext lock or something) we actually don't want to stop guest IO.

Yes, they have to be analyzed case by case.
> 
> 
> Next, I have a problem in mind, that in past lead to a lot of iotest 30
> failures. Next there were different fixes and improvements, but the core
> problem (as far as I understand) is still here: nothing protects us when
> we are in some graph modification process (for example block-job
> finalization) do yield, switch to other coroutine and enter another
> graph modification process (for example, another block-job finaliztion)..

That's another point to consider. I don't really have a solution for this.

> (for details look at my old "[PATCH RFC 0/5] Fix accidental crash in
> iotest 30" 
> https://lists.nongnu.org/archive/html/qemu-devel/2020-11/msg05290.html ,
> where I suggested to add a global graph_modify_mutex CoMutex, to be held
> during graph-modifying process that may yield)..
> Does your proposal solve this problem?
> 
> 
>> executing bdrv_drained_begin() on the interested graph, we are sure that
>> the iothread is not going to look or even interfere with that part of
>> the graph.
>> We are also sure that the only two actors that can look at a specific
>> BlockDriverState in any given context are the main loop and the
>> iothread running the AioContext (ensured by "global state or IO" logic).
>>
>> Why use _subtree_ instead of normal drain
>> -----------------------------------------
>> A simple drain "blocks" a given node and all its parents.
>> But it doesn't touch the child.
>> This means that if we use a simple drain, a child can always
>> keep processing requests, and eventually end up calling itself
>> bdrv_drained_begin, ending up reading the parents node while the main
>> loop
>> is modifying them. Therefore a subtree drain is necessary.
> 
> I'm not sure I understand.. Could you give a more concrete example of
> what may happen if we use simple drain?
> Who will call bdrv_drained_begin() you say about? Some IO path in
> parallel thread? Do we really have an IO path that calls
> bdrv_drained_begin()?

It is already being done in mirror, where it drains while running the .run() Job callback.

I made an unit test for this:
https://gitlab.com/eesposit/qemu/-/commit/1ffd7193ed441020f51aeeb49c3b07edb6949760

assume that you have the following graph:

blk (blk) -> overlay (bs)
overlay (bs) --- backed by ---> backing (bs) 
job1 (blockjob) --> backing

Then the main loop calls bdrv_drained_begin(overlay)
This means overlay and bs are under drain, but backing and most importantly the job are not.
At this point, the job run() function decides to drain. It should not be
allowed to read overlay parents list for example, but instead there is nothing
to prevent it from doing that, and thus we see drains_done >0.

On the other side, when we subtree_drain, also the job is drained and thus it goes in
pause.

Makes sense?
> 
>>
>> Possible scenarios
>> -------------------
>> Keeping in mind that we can only have an iothread and the main loop
>> draining on a certain node, we could have:
>>
>> main loop successfully drains and then iothread tries to drain:
>>    impossible scenario, as iothread is already stopped once main
>>    successfully drains.
>>
>> iothread successfully drains and then main loop drains:
>>    should not be a problem, as:
>>    1) the iothread should be already "blocked" by its own drain
>>    2) main loop would still wait for it to completely block
>>    There is the issue of mirror overriding such scenario to avoid
>>    having deadlocks, but that is handled in the next section.
>>
>> main loop and iothread try to drain together:
>>    As above, this case doens't really matter. As long as
>>    bdrv_drained_begin invariant is respected, the main loop will
>>    continue only once the iothread is "blocked" on that part of the
>> graph.
>>
>> A note on iothread draining
>> ---------------------------
>> Theoretically draining from an iothread should not be possible,
>> as the iothread would be scheduling a bh in the main loop waiting
>> for itself to stop, even though it is not yet stopped since it is
>> waiting for the bh.
>>
>> This is what would happen in the tests in patch 5 if .drained_poll
>> was not implemented.
>>
>> Therefore, one solution is to use .drained_poll callback in
>> BlockJobDriver.
>> This callback overrides the default job poll() behavior, and
>> allows the polling condition to stop waiting for the job.
>> It is actually used only in mirror.
>> This however breaks bdrv_drained_begin invariant, because the
>> iothread is not really blocked on that node but continues running.
>> In order to fix this, patch 4 allows the polling condition to be
>> used only by the iothread, and not the main loop too, preventing
>> the drain to return before the iothread is effectively stopped.
>> This is also shown in the tests in patch 5. If the fix in patch
>> 4 is removed, then the main loop drain will return earlier and
>> allow the iothread to run and drain together.
>>
>> The other patches in this serie are cherry-picked from the various
>> series I already sent, and are included here just to allow
>> subtree_drained_begin/end_unlocked implementation.
>>
>> Emanuele Giuseppe Esposito (5):
>>    aio-wait.h: introduce AIO_WAIT_WHILE_UNLOCKED
>>    introduce BDRV_POLL_WHILE_UNLOCKED
>>    block/io.c: introduce bdrv_subtree_drained_{begin/end}_unlocked
>>    child_job_drained_poll: override polling condition only when in home
>>      thread
>>    test-bdrv-drain: ensure draining from main loop stops iothreads
>>
>>   block/io.c                   |  48 ++++++--
>>   blockjob.c                   |   3 +-
>>   include/block/aio-wait.h     |  15 ++-
>>   include/block/block.h        |   7 ++
>>   tests/unit/test-bdrv-drain.c | 218 +++++++++++++++++++++++++++++++++++
>>   5 files changed, 274 insertions(+), 17 deletions(-)
>>
> 
> 


Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Vladimir Sementsov-Ogievskiy 2 years, 1 month ago
09.03.2022 16:26, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 02/03/2022 um 12:07 schrieb Vladimir Sementsov-Ogievskiy:
>> 01.03.2022 17:21, Emanuele Giuseppe Esposito wrote:
>>> This serie tries to provide a proof of concept and a clear explanation
>>> on why we need to use drains (and more precisely subtree_drains)
>>> to replace the aiocontext lock, especially to protect BlockDriverState
>>> ->children and ->parent lists.
>>>
>>> Just a small recap on the key concepts:
>>> * We split block layer APIs in "global state" (GS), "I/O", and
>>> "global state or I/O".
>>>     GS are running in the main loop, under BQL, and are the only
>>>     one allowed to modify the BlockDriverState graph.
>>>
>>>     I/O APIs are thread safe and can run in any thread
>>>
>>>     "global state or I/O" are essentially all APIs that use
>>>     BDRV_POLL_WHILE. This is because there can be only 2 threads
>>>     that can use BDRV_POLL_WHILE: main loop and the iothread that
>>>     runs the aiocontext.
>>>
>>> * Drains allow the caller (either main loop or iothread running
>>> the context) to wait all in_flights requests and operations
>>> of a BDS: normal drains target a given node and is parents, while
>>> subtree ones also include the subgraph of the node. Siblings are
>>> not affected by any of these two kind of drains.
>>> After bdrv_drained_begin, no more request is allowed to come
>>> from the affected nodes. Therefore the only actor left working
>>> on a drained part of the graph should be the main loop.
>>>
>>> What do we intend to do
>>> -----------------------
>>> We want to remove the AioContext lock. It is not 100% clear on how
>>> many things we are protecting with it, and why.
>>> As a starter, we want to protect BlockDriverState ->parents and
>>> ->children lists, since they are read by main loop and I/O but
>>> only written by main loop under BQL. The function that modifies
>>> these lists is bdrv_replace_child_common().
>>>
>>> How do we want to do it
>>> -----------------------
>>> We individuated as ideal subtitute of AioContext lock
>>> the subtree_drain API. The reason is simple: draining prevents the
>>> iothread to read or write the nodes, so once the main loop finishes
>>
>> I'm not sure it's ideal. Unfortunately I'm not really good in all that
>> BQL, AioContext locks and drains. So, I can't give good advice. But here
>> are my doubts:
>>
>> Draining is very restrictive measure: even if drained section is very
>> short, at least on bdrv_drained_begin() we have to wait for all current
>> requests, and don't start new ones. That slows down the guest.
> 
> I don't think we are in a critical path where performance is important here.
> 
> In the
>> same time there are operations that don't require to stop guest IO
>> requests. For example manipulation with dirty bitmaps - qmp commands
>> block-dirty-bitmap-add block-dirty-bitmap-remove. Or different query
>> requests..
>>
> 
> Maybe you misunderstood or I was not 100% clear, but I am talking about replacing the AioContext lock for the ->parents and ->children instance. Not everywhere. This is the first step, and then we will see if the additional things that it protects can use drain or something else

Ok, if we are only about graph modification that's not a critical performance path.

> 
>   
>> I see only two real cases, where we do need drain:
>>
>> 1. When we need a consistent "point-in-time". For example, when we start
>> backup in transaction with some dirty-bitmap manipulation commands.
>>
>> 2. When we need to modify block-graph: if we are going to break relation
>> A -> B, there must not be any in-flight request that want to use this
>> relation.
> 
> That's the use case I am considering.
>>
>> All other operations, for which we want some kind of lock (like
>> AioContext lock or something) we actually don't want to stop guest IO.
> 
> Yes, they have to be analyzed case by case.
>>
>>
>> Next, I have a problem in mind, that in past lead to a lot of iotest 30
>> failures. Next there were different fixes and improvements, but the core
>> problem (as far as I understand) is still here: nothing protects us when
>> we are in some graph modification process (for example block-job
>> finalization) do yield, switch to other coroutine and enter another
>> graph modification process (for example, another block-job finaliztion)..
> 
> That's another point to consider. I don't really have a solution for this.
> 
>> (for details look at my old "[PATCH RFC 0/5] Fix accidental crash in
>> iotest 30"
>> https://lists.nongnu.org/archive/html/qemu-devel/2020-11/msg05290.html ,
>> where I suggested to add a global graph_modify_mutex CoMutex, to be held
>> during graph-modifying process that may yield)..
>> Does your proposal solve this problem?
>>
>>
>>> executing bdrv_drained_begin() on the interested graph, we are sure that
>>> the iothread is not going to look or even interfere with that part of
>>> the graph.
>>> We are also sure that the only two actors that can look at a specific
>>> BlockDriverState in any given context are the main loop and the
>>> iothread running the AioContext (ensured by "global state or IO" logic).
>>>
>>> Why use _subtree_ instead of normal drain
>>> -----------------------------------------
>>> A simple drain "blocks" a given node and all its parents.
>>> But it doesn't touch the child.
>>> This means that if we use a simple drain, a child can always
>>> keep processing requests, and eventually end up calling itself
>>> bdrv_drained_begin, ending up reading the parents node while the main
>>> loop
>>> is modifying them. Therefore a subtree drain is necessary.
>>
>> I'm not sure I understand.. Could you give a more concrete example of
>> what may happen if we use simple drain?
>> Who will call bdrv_drained_begin() you say about? Some IO path in
>> parallel thread? Do we really have an IO path that calls
>> bdrv_drained_begin()?
> 
> It is already being done in mirror, where it drains while running the .run() Job callback.
> 
> I made an unit test for this:
> https://gitlab.com/eesposit/qemu/-/commit/1ffd7193ed441020f51aeeb49c3b07edb6949760
> 
> assume that you have the following graph:
> 
> blk (blk) -> overlay (bs)
> overlay (bs) --- backed by ---> backing (bs)
> job1 (blockjob) --> backing
> 
> Then the main loop calls bdrv_drained_begin(overlay)
> This means overlay and bs are under drain, but backing and most importantly the job are not.
> At this point, the job run() function decides to drain. It should not be
> allowed to read overlay parents list for example, but instead there is nothing
> to prevent it from doing that, and thus we see drains_done >0.
> 
> On the other side, when we subtree_drain, also the job is drained and thus it goes in
> pause.
> 
> Makes sense?

Ah seems I understand what you mean.

One of my arguments is that "drain" - is not a lock against other clients who want to modify the graph. Because, drained section allows nested drained sections.

And you try to solve it, by draining more things, this way, we'll drain also the job, which is a possible client, who may want to modify the graph in parallel.

So, in other words, when we want to modify the graph, we drain the whole connectivity component of the graph. And we think that we are safe from other graph modifications because all related jobs are drained.
Interesting, is that possible that some not drained job from another connectivity component will want to connect some node from our drained component?

I just still feel that draining is a wrong mechanism to avoid interaction with other clients who want to modify the graph, because:

1. we stop the whole IO on all subgraph which is not necessary
2. draining is not a mutex, it allows nesting and it's ok when two different clients drain same nodes. Draining is just a requirement to do no IO at these nodes.

And in your way, it seems that to be absolutely safe we'll need to drain everything..

In my feeling it's better to keep draining what it is now: requirement to have no IO requests. And to isolate graph modifications from each other make a new synchronization mechanism, something like a global queue, where clients who want to get an access to graph modifications wait for their turn.

>>
>>>
>>> Possible scenarios
>>> -------------------
>>> Keeping in mind that we can only have an iothread and the main loop
>>> draining on a certain node, we could have:
>>>
>>> main loop successfully drains and then iothread tries to drain:
>>>     impossible scenario, as iothread is already stopped once main
>>>     successfully drains.
>>>
>>> iothread successfully drains and then main loop drains:
>>>     should not be a problem, as:
>>>     1) the iothread should be already "blocked" by its own drain
>>>     2) main loop would still wait for it to completely block
>>>     There is the issue of mirror overriding such scenario to avoid
>>>     having deadlocks, but that is handled in the next section.
>>>
>>> main loop and iothread try to drain together:
>>>     As above, this case doens't really matter. As long as
>>>     bdrv_drained_begin invariant is respected, the main loop will
>>>     continue only once the iothread is "blocked" on that part of the
>>> graph.
>>>
>>> A note on iothread draining
>>> ---------------------------
>>> Theoretically draining from an iothread should not be possible,
>>> as the iothread would be scheduling a bh in the main loop waiting
>>> for itself to stop, even though it is not yet stopped since it is
>>> waiting for the bh.
>>>
>>> This is what would happen in the tests in patch 5 if .drained_poll
>>> was not implemented.
>>>
>>> Therefore, one solution is to use .drained_poll callback in
>>> BlockJobDriver.
>>> This callback overrides the default job poll() behavior, and
>>> allows the polling condition to stop waiting for the job.
>>> It is actually used only in mirror.
>>> This however breaks bdrv_drained_begin invariant, because the
>>> iothread is not really blocked on that node but continues running.
>>> In order to fix this, patch 4 allows the polling condition to be
>>> used only by the iothread, and not the main loop too, preventing
>>> the drain to return before the iothread is effectively stopped.
>>> This is also shown in the tests in patch 5. If the fix in patch
>>> 4 is removed, then the main loop drain will return earlier and
>>> allow the iothread to run and drain together.
>>>
>>> The other patches in this serie are cherry-picked from the various
>>> series I already sent, and are included here just to allow
>>> subtree_drained_begin/end_unlocked implementation.
>>>
>>> Emanuele Giuseppe Esposito (5):
>>>     aio-wait.h: introduce AIO_WAIT_WHILE_UNLOCKED
>>>     introduce BDRV_POLL_WHILE_UNLOCKED
>>>     block/io.c: introduce bdrv_subtree_drained_{begin/end}_unlocked
>>>     child_job_drained_poll: override polling condition only when in home
>>>       thread
>>>     test-bdrv-drain: ensure draining from main loop stops iothreads
>>>
>>>    block/io.c                   |  48 ++++++--
>>>    blockjob.c                   |   3 +-
>>>    include/block/aio-wait.h     |  15 ++-
>>>    include/block/block.h        |   7 ++
>>>    tests/unit/test-bdrv-drain.c | 218 +++++++++++++++++++++++++++++++++++
>>>    5 files changed, 274 insertions(+), 17 deletions(-)
>>>
>>
>>
> 


-- 
Best regards,
Vladimir

Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Vladimir Sementsov-Ogievskiy 2 years, 1 month ago
09.03.2022 16:26, Emanuele Giuseppe Esposito wrote:
> 
> Am 02/03/2022 um 12:07 schrieb Vladimir Sementsov-Ogievskiy:
>> 01.03.2022 17:21, Emanuele Giuseppe Esposito wrote:
>>> This serie tries to provide a proof of concept and a clear explanation
>>> on why we need to use drains (and more precisely subtree_drains)
>>> to replace the aiocontext lock, especially to protect BlockDriverState
>>> ->children and ->parent lists.
>>>
>>> Just a small recap on the key concepts:
>>> * We split block layer APIs in "global state" (GS), "I/O", and
>>> "global state or I/O".
>>>     GS are running in the main loop, under BQL, and are the only
>>>     one allowed to modify the BlockDriverState graph.
>>>
>>>     I/O APIs are thread safe and can run in any thread
>>>
>>>     "global state or I/O" are essentially all APIs that use
>>>     BDRV_POLL_WHILE. This is because there can be only 2 threads
>>>     that can use BDRV_POLL_WHILE: main loop and the iothread that
>>>     runs the aiocontext.
>>>
>>> * Drains allow the caller (either main loop or iothread running
>>> the context) to wait all in_flights requests and operations
>>> of a BDS: normal drains target a given node and is parents, while
>>> subtree ones also include the subgraph of the node. Siblings are
>>> not affected by any of these two kind of drains.
>>> After bdrv_drained_begin, no more request is allowed to come
>>> from the affected nodes. Therefore the only actor left working
>>> on a drained part of the graph should be the main loop.
>>>
>>> What do we intend to do
>>> -----------------------
>>> We want to remove the AioContext lock. It is not 100% clear on how
>>> many things we are protecting with it, and why.
>>> As a starter, we want to protect BlockDriverState ->parents and
>>> ->children lists, since they are read by main loop and I/O but
>>> only written by main loop under BQL. The function that modifies
>>> these lists is bdrv_replace_child_common().
>>>
>>> How do we want to do it
>>> -----------------------
>>> We individuated as ideal subtitute of AioContext lock
>>> the subtree_drain API. The reason is simple: draining prevents the
>>> iothread to read or write the nodes, so once the main loop finishes
>> I'm not sure it's ideal. Unfortunately I'm not really good in all that
>> BQL, AioContext locks and drains. So, I can't give good advice. But here
>> are my doubts:
>>
>> Draining is very restrictive measure: even if drained section is very
>> short, at least on bdrv_drained_begin() we have to wait for all current
>> requests, and don't start new ones. That slows down the guest.
> I don't think we are in a critical path where performance is important here.
> 
> In the
>> same time there are operations that don't require to stop guest IO
>> requests. For example manipulation with dirty bitmaps - qmp commands
>> block-dirty-bitmap-add block-dirty-bitmap-remove. Or different query
>> requests..
>>
> Maybe you misunderstood or I was not 100% clear, but I am talking about replacing the AioContext lock for the ->parents and ->children instance. Not everywhere. This is the first step, and then we will see if the additional things that it protects can use drain or something else
> 
>   
>> I see only two real cases, where we do need drain:
>>
>> 1. When we need a consistent "point-in-time". For example, when we start
>> backup in transaction with some dirty-bitmap manipulation commands.
>>
>> 2. When we need to modify block-graph: if we are going to break relation
>> A -> B, there must not be any in-flight request that want to use this
>> relation.
> That's the use case I am considering.
>> All other operations, for which we want some kind of lock (like
>> AioContext lock or something) we actually don't want to stop guest IO.
> Yes, they have to be analyzed case by case.
>>
>> Next, I have a problem in mind, that in past lead to a lot of iotest 30
>> failures. Next there were different fixes and improvements, but the core
>> problem (as far as I understand) is still here: nothing protects us when
>> we are in some graph modification process (for example block-job
>> finalization) do yield, switch to other coroutine and enter another
>> graph modification process (for example, another block-job finaliztion)..
> That's another point to consider. I don't really have a solution for this.
> 
>> (for details look at my old "[PATCH RFC 0/5] Fix accidental crash in
>> iotest 30"
>> https://lists.nongnu.org/archive/html/qemu-devel/2020-11/msg05290.html  ,
>> where I suggested to add a global graph_modify_mutex CoMutex, to be held
>> during graph-modifying process that may yield)..
>> Does your proposal solve this problem?
>>
>>
>>> executing bdrv_drained_begin() on the interested graph, we are sure that
>>> the iothread is not going to look or even interfere with that part of
>>> the graph.
>>> We are also sure that the only two actors that can look at a specific
>>> BlockDriverState in any given context are the main loop and the
>>> iothread running the AioContext (ensured by "global state or IO" logic).
>>>
>>> Why use_subtree_  instead of normal drain
>>> -----------------------------------------
>>> A simple drain "blocks" a given node and all its parents.
>>> But it doesn't touch the child.
>>> This means that if we use a simple drain, a child can always
>>> keep processing requests, and eventually end up calling itself
>>> bdrv_drained_begin, ending up reading the parents node while the main
>>> loop
>>> is modifying them. Therefore a subtree drain is necessary.
>> I'm not sure I understand.. Could you give a more concrete example of
>> what may happen if we use simple drain?
>> Who will call bdrv_drained_begin() you say about? Some IO path in
>> parallel thread? Do we really have an IO path that calls
>> bdrv_drained_begin()?
> It is already being done in mirror, where it drains while running the .run() Job callback.
> 
> I made an unit test for this:
> https://gitlab.com/eesposit/qemu/-/commit/1ffd7193ed441020f51aeeb49c3b07edb6949760
> 
> assume that you have the following graph:
> 
> blk (blk) -> overlay (bs)
> overlay (bs) --- backed by ---> backing (bs)
> job1 (blockjob) --> backing
> 
> Then the main loop calls bdrv_drained_begin(overlay)
> This means overlay and bs are under drain, but backing and most importantly the job are not.
> At this point, the job run() function decides to drain. It should not be
> allowed to read overlay parents list for example, but instead there is nothing
> to prevent it from doing that, and thus we see drains_done >0.
> 
> On the other side, when we subtree_drain, also the job is drained and thus it goes in
> pause.
> 
> Makes sense?


As I understand:

You want to make drained section to be a kind of lock, so that if we take this lock, we can modify the graph and we are sure that no other client will modify it in parallel.

But drained sections can be nested. So to solve the problem you try to drain more nodes: include subtree for example, or may be we need to drain the whole graph connectivity component, or (to be more safe) the whole block layer (to be sure that during drained section in one connectivity component some not-drained block-job from another connectivity component will not try to attach some node from our drained connectivity component)..

I still feel that draining is wrong tool for isolating graph modifying operations from each other:

1. Drained sections can be nested, and natively that's not a kind of lock. That's just a requirement to have no IO requests. There may be several clients that want this condition on same set of nodes.

2. Blocking IO on the whole connected subgraph or even on the whole block layer graph is not necessary, so that's an extra blocking.


Could we instead do the following:

1. Keep draining as is - a mechanism to stop IO on some nodes

2. To isolate graph-modifying operations implement another mechanism: something like a global queue, where clients wait until they gen an access to modify block layer.


This way, any graph modifying process would look like this:

1. drained_begin(only where necessary, not the whole subgraph in general)

2. wait in the global queue

3. Ok, now we can do all the modifications

4. Kick the global queue, so that next client will get an access

5. drained_end()


-- 
Best regards,
Vladimir

Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years, 1 month ago
>
> Ah seems I understand what you mean.
>
> One of my arguments is that "drain" - is not a lock against other
> clients who want to modify the graph. Because, drained section allows
> nested drained sections.
>
> And you try to solve it, by draining more things, this way, we'll drain
> also the job, which is a possible client, who may want to modify the
> graph in parallel.
>
> So, in other words, when we want to modify the graph, we drain the whole
> connectivity component of the graph. And we think that we are safe from
> other graph modifications because all related jobs are drained.
> Interesting, is that possible that some not drained job from another
> connectivity component will want to connect some node from our drained
> component?

You mean another job or whathever calling bdrv_find_node() on a random
graph? Yes that is not protected. But can this happen?

That's the question. What are the invariants here? Can anything happen?

>
> I just still feel that draining is a wrong mechanism to avoid
> interaction with other clients who want to modify the graph, because:
>
> 1. we stop the whole IO on all subgraph which is not necessary
> 2. draining is not a mutex, it allows nesting and it's ok when two
> different clients drain same nodes. Draining is just a requirement to do
> no IO at these nodes.
>
> And in your way, it seems that to be absolutely safe we'll need to drain
> everything..
>
> In my feeling it's better to keep draining what it is now: requirement
> to have no IO requests. And to isolate graph modifications from each
> other make a new synchronization mechanism, something like a global
> queue, where clients who want to get an access to graph modifications
> wait for their turn.

This is a matter of definitions. Subtree drains can theoretically work,
I managed to answer to my own doubts in the last email I sent.

Yes, there is still some completely random case like the one I wrote
above, but I think it is more a matter of what we want to use and what
meaning we want to give to drains.

Global queue is what Kevin proposes, I will try to implement it.

> 
> 
> As I understand:
> 
> You want to make drained section to be a kind of lock, so that if we
> take this lock, we can modify the graph and we are sure that no other
> client will modify it in parallel.

Yes

> 
> But drained sections can be nested. So to solve the problem you try to
> drain more nodes: include subtree for example, or may be we need to
> drain the whole graph connectivity component, or (to be more safe) the
> whole block layer (to be sure that during drained section in one
> connectivity component some not-drained block-job from another
> connectivity component will not try to attach some node from our drained
> connectivity component)..
> 
> I still feel that draining is wrong tool for isolating graph modifying
> operations from each other:
> 
> 1. Drained sections can be nested, and natively that's not a kind of
> lock. That's just a requirement to have no IO requests. There may be
> several clients that want this condition on same set of nodes.
> 
> 2. Blocking IO on the whole connected subgraph or even on the whole
> block layer graph is not necessary, so that's an extra blocking.
> 
> 
> Could we instead do the following:
> 
> 1. Keep draining as is - a mechanism to stop IO on some nodes
> 
> 2. To isolate graph-modifying operations implement another mechanism:
> something like a global queue, where clients wait until they gen an
> access to modify block layer.
> 
> 
> This way, any graph modifying process would look like this:
> 
> 1. drained_begin(only where necessary, not the whole subgraph in general)
> 
> 2. wait in the global queue
> 
> 3. Ok, now we can do all the modifications
> 
> 4. Kick the global queue, so that next client will get an access
> 
> 5. drained_end()
> 
> 

Please give a look at what Kevin (described by me) proposed. I think
it's the same as you are suggesting. I am pasting it below.
I will try to implement this and see if it is doable or not.

I think the advantage of drains is that it isn't so complicated and
doesn't add any complication to the existing code.
But we'll see how it goes with this global queue.

> His idea is to replicate what blk_wait_while_drained() currently does
> but on a larger scale. It is something in between this subtree_drains
> logic and a rwlock.
> 
> Basically if I understood correctly, we could implement
> bdrv_wait_while_drained(), and put in all places where we would put a
> read lock: all the reads to ->parents and ->children.
> This function detects if the bdrv is under drain, and if so it will stop
> and wait that the drain finishes (ie the graph modification).
> On the other side, each write would just need to drain probably both
> nodes (simple drain), to signal that we are modifying the graph. Once
> bdrv_drained_begin() finishes, we are sure all coroutines are stopped.
> Once bdrv_drained_end() finishes, we automatically let all coroutine
> restart, and continue where they left off.
> 
> Seems a good compromise between drains and rwlock. What do you think?
> 
> I am not sure how painful it will be to implement though.

Emanuele
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Vladimir Sementsov-Ogievskiy 2 years, 1 month ago
30.03.2022 12:09, Emanuele Giuseppe Esposito wrote:
>>
>> Ah seems I understand what you mean.
>>
>> One of my arguments is that "drain" - is not a lock against other
>> clients who want to modify the graph. Because, drained section allows
>> nested drained sections.
>>
>> And you try to solve it, by draining more things, this way, we'll drain
>> also the job, which is a possible client, who may want to modify the
>> graph in parallel.
>>
>> So, in other words, when we want to modify the graph, we drain the whole
>> connectivity component of the graph. And we think that we are safe from
>> other graph modifications because all related jobs are drained.
>> Interesting, is that possible that some not drained job from another
>> connectivity component will want to connect some node from our drained
>> component?
> 
> You mean another job or whathever calling bdrv_find_node() on a random
> graph? Yes that is not protected. But can this happen?
> 
> That's the question. What are the invariants here? Can anything happen?
> 
>>
>> I just still feel that draining is a wrong mechanism to avoid
>> interaction with other clients who want to modify the graph, because:
>>
>> 1. we stop the whole IO on all subgraph which is not necessary
>> 2. draining is not a mutex, it allows nesting and it's ok when two
>> different clients drain same nodes. Draining is just a requirement to do
>> no IO at these nodes.
>>
>> And in your way, it seems that to be absolutely safe we'll need to drain
>> everything..
>>
>> In my feeling it's better to keep draining what it is now: requirement
>> to have no IO requests. And to isolate graph modifications from each
>> other make a new synchronization mechanism, something like a global
>> queue, where clients who want to get an access to graph modifications
>> wait for their turn.
> 
> This is a matter of definitions. Subtree drains can theoretically work,
> I managed to answer to my own doubts in the last email I sent.
> 
> Yes, there is still some completely random case like the one I wrote
> above, but I think it is more a matter of what we want to use and what
> meaning we want to give to drains.
> 
> Global queue is what Kevin proposes, I will try to implement it.
> 
>>
>>
>> As I understand:
>>
>> You want to make drained section to be a kind of lock, so that if we
>> take this lock, we can modify the graph and we are sure that no other
>> client will modify it in parallel.
> 
> Yes
> 
>>
>> But drained sections can be nested. So to solve the problem you try to
>> drain more nodes: include subtree for example, or may be we need to
>> drain the whole graph connectivity component, or (to be more safe) the
>> whole block layer (to be sure that during drained section in one
>> connectivity component some not-drained block-job from another
>> connectivity component will not try to attach some node from our drained
>> connectivity component)..
>>
>> I still feel that draining is wrong tool for isolating graph modifying
>> operations from each other:
>>
>> 1. Drained sections can be nested, and natively that's not a kind of
>> lock. That's just a requirement to have no IO requests. There may be
>> several clients that want this condition on same set of nodes.
>>
>> 2. Blocking IO on the whole connected subgraph or even on the whole
>> block layer graph is not necessary, so that's an extra blocking.
>>
>>
>> Could we instead do the following:
>>
>> 1. Keep draining as is - a mechanism to stop IO on some nodes
>>
>> 2. To isolate graph-modifying operations implement another mechanism:
>> something like a global queue, where clients wait until they gen an
>> access to modify block layer.
>>
>>
>> This way, any graph modifying process would look like this:
>>
>> 1. drained_begin(only where necessary, not the whole subgraph in general)
>>
>> 2. wait in the global queue
>>
>> 3. Ok, now we can do all the modifications
>>
>> 4. Kick the global queue, so that next client will get an access
>>
>> 5. drained_end()
>>
>>
> 
> Please give a look at what Kevin (described by me) proposed. I think
> it's the same as you are suggesting. I am pasting it below.
> I will try to implement this and see if it is doable or not.
> 
> I think the advantage of drains is that it isn't so complicated and
> doesn't add any complication to the existing code.
> But we'll see how it goes with this global queue.
> 
>> His idea is to replicate what blk_wait_while_drained() currently does
>> but on a larger scale. It is something in between this subtree_drains
>> logic and a rwlock.
>>
>> Basically if I understood correctly, we could implement
>> bdrv_wait_while_drained(), and put in all places where we would put a
>> read lock: all the reads to ->parents and ->children.
>> This function detects if the bdrv is under drain, and if so it will stop
>> and wait that the drain finishes (ie the graph modification).
>> On the other side, each write would just need to drain probably both
>> nodes (simple drain), to signal that we are modifying the graph. Once
>> bdrv_drained_begin() finishes, we are sure all coroutines are stopped.
>> Once bdrv_drained_end() finishes, we automatically let all coroutine
>> restart, and continue where they left off.
>>
>> Seems a good compromise between drains and rwlock. What do you think?
>>
>> I am not sure how painful it will be to implement though.
> 

Hm, I don't see, where is global queue here? Or bdrv_wait_while_drained() is global and has no bs arguement?




-- 
Best regards,
Vladimir
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years, 1 month ago

Am 30/03/2022 um 11:52 schrieb Vladimir Sementsov-Ogievskiy:
> 30.03.2022 12:09, Emanuele Giuseppe Esposito wrote:
>>>
>>> Ah seems I understand what you mean.
>>>
>>> One of my arguments is that "drain" - is not a lock against other
>>> clients who want to modify the graph. Because, drained section allows
>>> nested drained sections.
>>>
>>> And you try to solve it, by draining more things, this way, we'll drain
>>> also the job, which is a possible client, who may want to modify the
>>> graph in parallel.
>>>
>>> So, in other words, when we want to modify the graph, we drain the whole
>>> connectivity component of the graph. And we think that we are safe from
>>> other graph modifications because all related jobs are drained.
>>> Interesting, is that possible that some not drained job from another
>>> connectivity component will want to connect some node from our drained
>>> component?
>>
>> You mean another job or whathever calling bdrv_find_node() on a random
>> graph? Yes that is not protected. But can this happen?
>>
>> That's the question. What are the invariants here? Can anything happen?
>>
>>>
>>> I just still feel that draining is a wrong mechanism to avoid
>>> interaction with other clients who want to modify the graph, because:
>>>
>>> 1. we stop the whole IO on all subgraph which is not necessary
>>> 2. draining is not a mutex, it allows nesting and it's ok when two
>>> different clients drain same nodes. Draining is just a requirement to do
>>> no IO at these nodes.
>>>
>>> And in your way, it seems that to be absolutely safe we'll need to drain
>>> everything..
>>>
>>> In my feeling it's better to keep draining what it is now: requirement
>>> to have no IO requests. And to isolate graph modifications from each
>>> other make a new synchronization mechanism, something like a global
>>> queue, where clients who want to get an access to graph modifications
>>> wait for their turn.
>>
>> This is a matter of definitions. Subtree drains can theoretically work,
>> I managed to answer to my own doubts in the last email I sent.
>>
>> Yes, there is still some completely random case like the one I wrote
>> above, but I think it is more a matter of what we want to use and what
>> meaning we want to give to drains.
>>
>> Global queue is what Kevin proposes, I will try to implement it.
>>
>>>
>>>
>>> As I understand:
>>>
>>> You want to make drained section to be a kind of lock, so that if we
>>> take this lock, we can modify the graph and we are sure that no other
>>> client will modify it in parallel.
>>
>> Yes
>>
>>>
>>> But drained sections can be nested. So to solve the problem you try to
>>> drain more nodes: include subtree for example, or may be we need to
>>> drain the whole graph connectivity component, or (to be more safe) the
>>> whole block layer (to be sure that during drained section in one
>>> connectivity component some not-drained block-job from another
>>> connectivity component will not try to attach some node from our drained
>>> connectivity component)..
>>>
>>> I still feel that draining is wrong tool for isolating graph modifying
>>> operations from each other:
>>>
>>> 1. Drained sections can be nested, and natively that's not a kind of
>>> lock. That's just a requirement to have no IO requests. There may be
>>> several clients that want this condition on same set of nodes.
>>>
>>> 2. Blocking IO on the whole connected subgraph or even on the whole
>>> block layer graph is not necessary, so that's an extra blocking.
>>>
>>>
>>> Could we instead do the following:
>>>
>>> 1. Keep draining as is - a mechanism to stop IO on some nodes
>>>
>>> 2. To isolate graph-modifying operations implement another mechanism:
>>> something like a global queue, where clients wait until they gen an
>>> access to modify block layer.
>>>
>>>
>>> This way, any graph modifying process would look like this:
>>>
>>> 1. drained_begin(only where necessary, not the whole subgraph in
>>> general)
>>>
>>> 2. wait in the global queue
>>>
>>> 3. Ok, now we can do all the modifications
>>>
>>> 4. Kick the global queue, so that next client will get an access
>>>
>>> 5. drained_end()
>>>
>>>
>>
>> Please give a look at what Kevin (described by me) proposed. I think
>> it's the same as you are suggesting. I am pasting it below.
>> I will try to implement this and see if it is doable or not.
>>
>> I think the advantage of drains is that it isn't so complicated and
>> doesn't add any complication to the existing code.
>> But we'll see how it goes with this global queue.
>>
>>> His idea is to replicate what blk_wait_while_drained() currently does
>>> but on a larger scale. It is something in between this subtree_drains
>>> logic and a rwlock.
>>>
>>> Basically if I understood correctly, we could implement
>>> bdrv_wait_while_drained(), and put in all places where we would put a
>>> read lock: all the reads to ->parents and ->children.
>>> This function detects if the bdrv is under drain, and if so it will stop
>>> and wait that the drain finishes (ie the graph modification).
>>> On the other side, each write would just need to drain probably both
>>> nodes (simple drain), to signal that we are modifying the graph. Once
>>> bdrv_drained_begin() finishes, we are sure all coroutines are stopped.
>>> Once bdrv_drained_end() finishes, we automatically let all coroutine
>>> restart, and continue where they left off.
>>>
>>> Seems a good compromise between drains and rwlock. What do you think?
>>>
>>> I am not sure how painful it will be to implement though.
>>
> 
> Hm, I don't see, where is global queue here? Or
> bdrv_wait_while_drained() is global and has no bs arguement?
> 
> 

From what I understand, blk_wait_while_drained has a queue internally.
Yes, the queue would be global, and all coroutines that want to perform
a read will have to wait until the modification is ended.

Whether to wake the queue up with a drain or a write lock is also
another point worth discussion maybe.
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years, 1 month ago

Am 09/03/2022 um 14:26 schrieb Emanuele Giuseppe Esposito:
>> Next, I have a problem in mind, that in past lead to a lot of iotest 30
>> failures. Next there were different fixes and improvements, but the core
>> problem (as far as I understand) is still here: nothing protects us when
>> we are in some graph modification process (for example block-job
>> finalization) do yield, switch to other coroutine and enter another
>> graph modification process (for example, another block-job finaliztion)..
> That's another point to consider. I don't really have a solution for this.
> 
On a side note, that might not be a real problem.
If I understand correctly, your fear is that we are doing something like
parent->children[x] = new_node // partial graph operation
/* yield to another coroutine */
coroutine reads/writes parent->children[x] and/or new_node->parents[y]
/* yield back */
new_node->parents[y] = parent // end of the initial graph operation

Is that what you are pointing out here?
If so, is there a concrete example for this? Because yields and drains
(that eventually can poll) seem to be put either before or after the
whole graph modification section. In other words, even if a coroutine
enters, it will be always before or after the _whole_ graph modification
is performed.

Thank you,
Emanuele
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Vladimir Sementsov-Ogievskiy 2 years, 1 month ago
17.03.2022 00:55, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 09/03/2022 um 14:26 schrieb Emanuele Giuseppe Esposito:
>>> Next, I have a problem in mind, that in past lead to a lot of iotest 30
>>> failures. Next there were different fixes and improvements, but the core
>>> problem (as far as I understand) is still here: nothing protects us when
>>> we are in some graph modification process (for example block-job
>>> finalization) do yield, switch to other coroutine and enter another
>>> graph modification process (for example, another block-job finaliztion)..
>> That's another point to consider. I don't really have a solution for this.
>>
> On a side note, that might not be a real problem.
> If I understand correctly, your fear is that we are doing something like
> parent->children[x] = new_node // partial graph operation
> /* yield to another coroutine */
> coroutine reads/writes parent->children[x] and/or new_node->parents[y]
> /* yield back */
> new_node->parents[y] = parent // end of the initial graph operation
> 
> Is that what you are pointing out here?
> If so, is there a concrete example for this? Because yields and drains
> (that eventually can poll) seem to be put either before or after the
> whole graph modification section. In other words, even if a coroutine
> enters, it will be always before or after the _whole_ graph modification
> is performed.
> 

The old example was here: https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05212.html  - not sure how much is it applicable now.

Another example - look at bdrv_drop_intermediate() in block.c and at TODO comments in it.

In both cases the problem is we want to update some metadata in qcow2 (backing file name) as part of block-graph modification. But this update does write to qcow2 header which may yield and switch to some another block-graph modification code.


-- 
Best regards,
Vladimir
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Stefan Hajnoczi 2 years, 2 months ago
On Tue, Mar 01, 2022 at 09:21:08AM -0500, Emanuele Giuseppe Esposito wrote:
> This serie tries to provide a proof of concept and a clear explanation
> on why we need to use drains (and more precisely subtree_drains)
> to replace the aiocontext lock, especially to protect BlockDriverState
> ->children and ->parent lists.
> 
> Just a small recap on the key concepts:
> * We split block layer APIs in "global state" (GS), "I/O", and
> "global state or I/O".
>   GS are running in the main loop, under BQL, and are the only
>   one allowed to modify the BlockDriverState graph.
> 
>   I/O APIs are thread safe and can run in any thread
> 
>   "global state or I/O" are essentially all APIs that use
>   BDRV_POLL_WHILE. This is because there can be only 2 threads
>   that can use BDRV_POLL_WHILE: main loop and the iothread that
>   runs the aiocontext.
> 
> * Drains allow the caller (either main loop or iothread running
> the context) to wait all in_flights requests and operations
> of a BDS: normal drains target a given node and is parents, while
> subtree ones also include the subgraph of the node. Siblings are
> not affected by any of these two kind of drains.

Siblings are drained to the extent required for their parent node to
reach in_flight == 0.

I haven't checked the code but I guess the case you're alluding to is
that siblings with multiple parents could have other I/O in flight that
will not be drained and further I/O can be submitted after the parent
has drained?

> After bdrv_drained_begin, no more request is allowed to come
> from the affected nodes. Therefore the only actor left working
> on a drained part of the graph should be the main loop.

It's worth remembering that this invariant is not enforced. Draining is
a cooperative scheme. Everything *should* be notified and stop
submitting I/O, but there is no assertion or return -EBUSY if a request
is submitted while the BDS is drained.

If the thread that drained the BDS wants, I think it can (legally)
submit I/O within the drained section.

> What do we intend to do
> -----------------------
> We want to remove the AioContext lock. It is not 100% clear on how
> many things we are protecting with it, and why.
> As a starter, we want to protect BlockDriverState ->parents and
> ->children lists, since they are read by main loop and I/O but
> only written by main loop under BQL. The function that modifies
> these lists is bdrv_replace_child_common().
> 
> How do we want to do it
> -----------------------
> We individuated as ideal subtitute of AioContext lock
> the subtree_drain API. The reason is simple: draining prevents the iothread to read or write the nodes, so once the main loop finishes
> executing bdrv_drained_begin() on the interested graph, we are sure that
> the iothread is not going to look or even interfere with that part of the graph.
> We are also sure that the only two actors that can look at a specific
> BlockDriverState in any given context are the main loop and the
> iothread running the AioContext (ensured by "global state or IO" logic).
> 
> Why use _subtree_ instead of normal drain
> -----------------------------------------
> A simple drain "blocks" a given node and all its parents.
> But it doesn't touch the child.
> This means that if we use a simple drain, a child can always
> keep processing requests, and eventually end up calling itself
> bdrv_drained_begin, ending up reading the parents node while the main loop
> is modifying them. Therefore a subtree drain is necessary.
> 
> Possible scenarios
> -------------------
> Keeping in mind that we can only have an iothread and the main loop
> draining on a certain node, we could have:
> 
> main loop successfully drains and then iothread tries to drain:
>   impossible scenario, as iothread is already stopped once main
>   successfully drains.
> 
> iothread successfully drains and then main loop drains:
>   should not be a problem, as:
>   1) the iothread should be already "blocked" by its own drain

Once drained_begin() returns in the IOThread, the IOThread can do
anything it wants, including more submitting I/O. I don't consider that
"blocked", so I'm not sure what this sentence means?

The way the main loop thread protects itself against the IOThread is via
the aio "external" handler concept and block job drain callbacks, which
are activated by drained_begin(). They ensure that the IOThread will not
perform further processing that submits I/O, but the IOThread code that
invoked drained_begin() can still do anything it wants.

>   2) main loop would still wait for it to completely block
>   There is the issue of mirror overriding such scenario to avoid
>   having deadlocks, but that is handled in the next section.
> 
> main loop and iothread try to drain together:
>   As above, this case doens't really matter. As long as
>   bdrv_drained_begin invariant is respected, the main loop will
>   continue only once the iothread is "blocked" on that part of the graph.

I'm not sure about this, see above.

> 
> A note on iothread draining
> ---------------------------
> Theoretically draining from an iothread should not be possible,
> as the iothread would be scheduling a bh in the main loop waiting
> for itself to stop, even though it is not yet stopped since it is waiting for the bh.

I'm not sure what you mean. Which function schedules this BH?

> 
> This is what would happen in the tests in patch 5 if .drained_poll
> was not implemented.
> 
> Therefore, one solution is to use .drained_poll callback in BlockJobDriver.
> This callback overrides the default job poll() behavior, and
> allows the polling condition to stop waiting for the job.
> It is actually used only in mirror.
> This however breaks bdrv_drained_begin invariant, because the
> iothread is not really blocked on that node but continues running.
> In order to fix this, patch 4 allows the polling condition to be
> used only by the iothread, and not the main loop too, preventing
> the drain to return before the iothread is effectively stopped.
> This is also shown in the tests in patch 5. If the fix in patch
> 4 is removed, then the main loop drain will return earlier and
> allow the iothread to run and drain together.
> 
> The other patches in this serie are cherry-picked from the various
> series I already sent, and are included here just to allow
> subtree_drained_begin/end_unlocked implementation.
> 
> Emanuele Giuseppe Esposito (5):
>   aio-wait.h: introduce AIO_WAIT_WHILE_UNLOCKED
>   introduce BDRV_POLL_WHILE_UNLOCKED
>   block/io.c: introduce bdrv_subtree_drained_{begin/end}_unlocked
>   child_job_drained_poll: override polling condition only when in home
>     thread
>   test-bdrv-drain: ensure draining from main loop stops iothreads
> 
>  block/io.c                   |  48 ++++++--
>  blockjob.c                   |   3 +-
>  include/block/aio-wait.h     |  15 ++-
>  include/block/block.h        |   7 ++
>  tests/unit/test-bdrv-drain.c | 218 +++++++++++++++++++++++++++++++++++
>  5 files changed, 274 insertions(+), 17 deletions(-)
> 
> -- 
> 2.31.1
> 
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years, 1 month ago

Am 02/03/2022 um 10:47 schrieb Stefan Hajnoczi:
> On Tue, Mar 01, 2022 at 09:21:08AM -0500, Emanuele Giuseppe Esposito wrote:
>> This serie tries to provide a proof of concept and a clear explanation
>> on why we need to use drains (and more precisely subtree_drains)
>> to replace the aiocontext lock, especially to protect BlockDriverState
>> ->children and ->parent lists.
>>
>> Just a small recap on the key concepts:
>> * We split block layer APIs in "global state" (GS), "I/O", and
>> "global state or I/O".
>>   GS are running in the main loop, under BQL, and are the only
>>   one allowed to modify the BlockDriverState graph.
>>
>>   I/O APIs are thread safe and can run in any thread
>>
>>   "global state or I/O" are essentially all APIs that use
>>   BDRV_POLL_WHILE. This is because there can be only 2 threads
>>   that can use BDRV_POLL_WHILE: main loop and the iothread that
>>   runs the aiocontext.
>>
>> * Drains allow the caller (either main loop or iothread running
>> the context) to wait all in_flights requests and operations
>> of a BDS: normal drains target a given node and is parents, while
>> subtree ones also include the subgraph of the node. Siblings are
>> not affected by any of these two kind of drains.
> 
> Siblings are drained to the extent required for their parent node to
> reach in_flight == 0.
> 
> I haven't checked the code but I guess the case you're alluding to is
> that siblings with multiple parents could have other I/O in flight that
> will not be drained and further I/O can be submitted after the parent
> has drained?

Yes, this in theory can happen. I don't really know if this happens
practically, and how likely is to happen.

The alternative would be to make a drain that blocks the whole graph,
siblings included, but that would probably be an overkill.

> 
>> After bdrv_drained_begin, no more request is allowed to come
>> from the affected nodes. Therefore the only actor left working
>> on a drained part of the graph should be the main loop.
> 
> It's worth remembering that this invariant is not enforced. Draining is
> a cooperative scheme. Everything *should* be notified and stop
> submitting I/O, but there is no assertion or return -EBUSY if a request
> is submitted while the BDS is drained.

Yes, it is a cooperative scheme and all except from mirror (fixed in the
last patch) seem to follow the invariant.
> 
> If the thread that drained the BDS wants, I think it can (legally)
> submit I/O within the drained section.
> 

Yes but at some point the main loop drains too before changing the
graph. Then the thread must be stopped, according with the invariant
above. So it doesn't matter if the thread has already drained or not.

>> What do we intend to do
>> -----------------------
>> We want to remove the AioContext lock. It is not 100% clear on how
>> many things we are protecting with it, and why.
>> As a starter, we want to protect BlockDriverState ->parents and
>> ->children lists, since they are read by main loop and I/O but
>> only written by main loop under BQL. The function that modifies
>> these lists is bdrv_replace_child_common().
>>
>> How do we want to do it
>> -----------------------
>> We individuated as ideal subtitute of AioContext lock
>> the subtree_drain API. The reason is simple: draining prevents the iothread to read or write the nodes, so once the main loop finishes
>> executing bdrv_drained_begin() on the interested graph, we are sure that
>> the iothread is not going to look or even interfere with that part of the graph.
>> We are also sure that the only two actors that can look at a specific
>> BlockDriverState in any given context are the main loop and the
>> iothread running the AioContext (ensured by "global state or IO" logic).
>>
>> Why use _subtree_ instead of normal drain
>> -----------------------------------------
>> A simple drain "blocks" a given node and all its parents.
>> But it doesn't touch the child.
>> This means that if we use a simple drain, a child can always
>> keep processing requests, and eventually end up calling itself
>> bdrv_drained_begin, ending up reading the parents node while the main loop
>> is modifying them. Therefore a subtree drain is necessary.
>>
>> Possible scenarios
>> -------------------
>> Keeping in mind that we can only have an iothread and the main loop
>> draining on a certain node, we could have:
>>
>> main loop successfully drains and then iothread tries to drain:
>>   impossible scenario, as iothread is already stopped once main
>>   successfully drains.
>>
>> iothread successfully drains and then main loop drains:
>>   should not be a problem, as:
>>   1) the iothread should be already "blocked" by its own drain
> 
> Once drained_begin() returns in the IOThread, the IOThread can do
> anything it wants, including more submitting I/O. I don't consider that
> "blocked", so I'm not sure what this sentence means?
> 
> The way the main loop thread protects itself against the IOThread is via
> the aio "external" handler concept and block job drain callbacks, which
> are activated by drained_begin(). They ensure that the IOThread will not
> perform further processing that submits I/O, but the IOThread code that
> invoked drained_begin() can still do anything it wants.

As above I think that regardless on what the iothread is doing, once the
main loop has finished executing bdrv_drained_begin the iothread should
not be doing anything related to the nodes that have been drained.

> 
>>   2) main loop would still wait for it to completely block
>>   There is the issue of mirror overriding such scenario to avoid
>>   having deadlocks, but that is handled in the next section.
>>
>> main loop and iothread try to drain together:
>>   As above, this case doens't really matter. As long as
>>   bdrv_drained_begin invariant is respected, the main loop will
>>   continue only once the iothread is "blocked" on that part of the graph.
> 
> I'm not sure about this, see above.
> 
>>
>> A note on iothread draining
>> ---------------------------
>> Theoretically draining from an iothread should not be possible,
>> as the iothread would be scheduling a bh in the main loop waiting
>> for itself to stop, even though it is not yet stopped since it is waiting for the bh.
> 
> I'm not sure what you mean. Which function schedules this BH?
bdrv_drained_begin and _end schedule a bh to run the draining code in
the main loop, if they are in a coroutine. That's what I meant here :)
> 
>>
>> This is what would happen in the tests in patch 5 if .drained_poll
>> was not implemented.
>>
>> Therefore, one solution is to use .drained_poll callback in BlockJobDriver.
>> This callback overrides the default job poll() behavior, and
>> allows the polling condition to stop waiting for the job.
>> It is actually used only in mirror.
>> This however breaks bdrv_drained_begin invariant, because the
>> iothread is not really blocked on that node but continues running.
>> In order to fix this, patch 4 allows the polling condition to be
>> used only by the iothread, and not the main loop too, preventing
>> the drain to return before the iothread is effectively stopped.
>> This is also shown in the tests in patch 5. If the fix in patch
>> 4 is removed, then the main loop drain will return earlier and
>> allow the iothread to run and drain together.
>>
>> The other patches in this serie are cherry-picked from the various
>> series I already sent, and are included here just to allow
>> subtree_drained_begin/end_unlocked implementation.
>>
>> Emanuele Giuseppe Esposito (5):
>>   aio-wait.h: introduce AIO_WAIT_WHILE_UNLOCKED
>>   introduce BDRV_POLL_WHILE_UNLOCKED
>>   block/io.c: introduce bdrv_subtree_drained_{begin/end}_unlocked
>>   child_job_drained_poll: override polling condition only when in home
>>     thread
>>   test-bdrv-drain: ensure draining from main loop stops iothreads
>>
>>  block/io.c                   |  48 ++++++--
>>  blockjob.c                   |   3 +-
>>  include/block/aio-wait.h     |  15 ++-
>>  include/block/block.h        |   7 ++
>>  tests/unit/test-bdrv-drain.c | 218 +++++++++++++++++++++++++++++++++++
>>  5 files changed, 274 insertions(+), 17 deletions(-)
>>
>> -- 
>> 2.31.1
>>
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years, 1 month ago

Am 09/03/2022 um 14:26 schrieb Emanuele Giuseppe Esposito:
>>> * Drains allow the caller (either main loop or iothread running
>>> the context) to wait all in_flights requests and operations
>>> of a BDS: normal drains target a given node and is parents, while
>>> subtree ones also include the subgraph of the node. Siblings are
>>> not affected by any of these two kind of drains.
>> Siblings are drained to the extent required for their parent node to
>> reach in_flight == 0.
>>
>> I haven't checked the code but I guess the case you're alluding to is
>> that siblings with multiple parents could have other I/O in flight that
>> will not be drained and further I/O can be submitted after the parent
>> has drained?
> Yes, this in theory can happen. I don't really know if this happens
> practically, and how likely is to happen.
> 
> The alternative would be to make a drain that blocks the whole graph,
> siblings included, but that would probably be an overkill.
> 

So I have thought about this, and I think maybe this is not a concrete
problem.
Suppose we have a graph where "parent" has 2 children: "child" and
"sibling". "sibling" also has a blockjob.

Now, main loop wants to modify parent-child relation and maybe detach
child from parent.

1st wrong assumption: the sibling is not drained. Actually my strategy
takes into account draining both nodes, also because parent could be in
another graph. Therefore sibling is drained.

But let's assume "sibling" is the sibling of the parent.

Therefore we have
"child" -> "parent" -> "grandparent"
and
"blockjob" -> "sibling" -> "grandparent"

The issue is the following: main loop can't drain "sibling", because
subtree_drained does not reach it. Therefore blockjob can still run
while main loop modifies "child" -> "parent". Blockjob can either:
1) drain, but this won't affect "child" -> "parent"
2) read the graph in other ways different from drain, for example
.set_aio_context recursively touches the whole graph.
3) write the graph.

3) can be only performed in the main loop, because it's a graph
operation. It means that the blockjob runs when the graph modifying
coroutine/bh is not running. They never run together.
The safety of this operation relies on where the drains are and will be
inserted. If you do like in my patch "block.c:
bdrv_replace_child_noperm: first call ->attach(), and then add child",
then we would have problem, because we drain between two writes, and the
blockjob will find an inconsistent graph. If we do it as we seem to do
it so far, then we won't really have any problem.

2) is a read, and can theoretically be performed by another thread. But
is there a function that does that? .set_aio_context for example is a GS
function, so we will fall back to case 3) and nothing bad would happen.

Is there a counter example for this?

-----------

Talking about something else, I discussed with Kevin what *seems* to be
an alternative way to do this, instead of adding drains everywhere.
His idea is to replicate what blk_wait_while_drained() currently does
but on a larger scale. It is something in between this subtree_drains
logic and a rwlock.

Basically if I understood correctly, we could implement
bdrv_wait_while_drained(), and put in all places where we would put a
read lock: all the reads to ->parents and ->children.
This function detects if the bdrv is under drain, and if so it will stop
and wait that the drain finishes (ie the graph modification).
On the other side, each write would just need to drain probably both
nodes (simple drain), to signal that we are modifying the graph. Once
bdrv_drained_begin() finishes, we are sure all coroutines are stopped.
Once bdrv_drained_end() finishes, we automatically let all coroutine
restart, and continue where they left off.

Seems a good compromise between drains and rwlock. What do you think?

I am not sure how painful it will be to implement though.

Thank you,
Emanuele
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Hanna Reitz 2 years, 1 month ago
On 17.03.22 17:23, Emanuele Giuseppe Esposito wrote:
>
> Am 09/03/2022 um 14:26 schrieb Emanuele Giuseppe Esposito:
>>>> * Drains allow the caller (either main loop or iothread running
>>>> the context) to wait all in_flights requests and operations
>>>> of a BDS: normal drains target a given node and is parents, while
>>>> subtree ones also include the subgraph of the node. Siblings are
>>>> not affected by any of these two kind of drains.
>>> Siblings are drained to the extent required for their parent node to
>>> reach in_flight == 0.
>>>
>>> I haven't checked the code but I guess the case you're alluding to is
>>> that siblings with multiple parents could have other I/O in flight that
>>> will not be drained and further I/O can be submitted after the parent
>>> has drained?
>> Yes, this in theory can happen. I don't really know if this happens
>> practically, and how likely is to happen.
>>
>> The alternative would be to make a drain that blocks the whole graph,
>> siblings included, but that would probably be an overkill.
>>
> So I have thought about this, and I think maybe this is not a concrete
> problem.
> Suppose we have a graph where "parent" has 2 children: "child" and
> "sibling". "sibling" also has a blockjob.
>
> Now, main loop wants to modify parent-child relation and maybe detach
> child from parent.
>
> 1st wrong assumption: the sibling is not drained. Actually my strategy
> takes into account draining both nodes, also because parent could be in
> another graph. Therefore sibling is drained.
>
> But let's assume "sibling" is the sibling of the parent.
>
> Therefore we have
> "child" -> "parent" -> "grandparent"
> and
> "blockjob" -> "sibling" -> "grandparent"
>
> The issue is the following: main loop can't drain "sibling", because
> subtree_drained does not reach it. Therefore blockjob can still run
> while main loop modifies "child" -> "parent". Blockjob can either:
> 1) drain, but this won't affect "child" -> "parent"
> 2) read the graph in other ways different from drain, for example
> .set_aio_context recursively touches the whole graph.
> 3) write the graph.

I don’t really understand the problem here.  If the block job only 
operates on the sibling subgraph, why would it care what’s going on in 
the other subgraph?

Block jobs should own all nodes that are associated with them (e.g. 
because they intend to drop or replace them when the job is done), so 
when part of the graph is drained, all jobs that could modify that part 
should be drained, too.

> 3) can be only performed in the main loop, because it's a graph
> operation. It means that the blockjob runs when the graph modifying
> coroutine/bh is not running. They never run together.
> The safety of this operation relies on where the drains are and will be
> inserted. If you do like in my patch "block.c:
> bdrv_replace_child_noperm: first call ->attach(), and then add child",
> then we would have problem, because we drain between two writes, and the
> blockjob will find an inconsistent graph. If we do it as we seem to do
> it so far, then we won't really have any problem.
>
> 2) is a read, and can theoretically be performed by another thread. But
> is there a function that does that? .set_aio_context for example is a GS
> function, so we will fall back to case 3) and nothing bad would happen.
>
> Is there a counter example for this?
>
> -----------
>
> Talking about something else, I discussed with Kevin what *seems* to be
> an alternative way to do this, instead of adding drains everywhere.
> His idea is to replicate what blk_wait_while_drained() currently does
> but on a larger scale. It is something in between this subtree_drains
> logic and a rwlock.
>
> Basically if I understood correctly, we could implement
> bdrv_wait_while_drained(), and put in all places where we would put a
> read lock: all the reads to ->parents and ->children.
> This function detects if the bdrv is under drain, and if so it will stop
> and wait that the drain finishes (ie the graph modification).
> On the other side, each write would just need to drain probably both
> nodes (simple drain), to signal that we are modifying the graph. Once
> bdrv_drained_begin() finishes, we are sure all coroutines are stopped.
> Once bdrv_drained_end() finishes, we automatically let all coroutine
> restart, and continue where they left off.
>
> Seems a good compromise between drains and rwlock. What do you think?

Well, sounds complicated.  So I’m asking myself whether this would be 
noticeably better than just an RwLock for graph modifications, like the 
global lock Vladimir has proposed.

Hanna


Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years, 1 month ago

Am 30/03/2022 um 12:53 schrieb Hanna Reitz:
> On 17.03.22 17:23, Emanuele Giuseppe Esposito wrote:
>>
>> Am 09/03/2022 um 14:26 schrieb Emanuele Giuseppe Esposito:
>>>>> * Drains allow the caller (either main loop or iothread running
>>>>> the context) to wait all in_flights requests and operations
>>>>> of a BDS: normal drains target a given node and is parents, while
>>>>> subtree ones also include the subgraph of the node. Siblings are
>>>>> not affected by any of these two kind of drains.
>>>> Siblings are drained to the extent required for their parent node to
>>>> reach in_flight == 0.
>>>>
>>>> I haven't checked the code but I guess the case you're alluding to is
>>>> that siblings with multiple parents could have other I/O in flight that
>>>> will not be drained and further I/O can be submitted after the parent
>>>> has drained?
>>> Yes, this in theory can happen. I don't really know if this happens
>>> practically, and how likely is to happen.
>>>
>>> The alternative would be to make a drain that blocks the whole graph,
>>> siblings included, but that would probably be an overkill.
>>>
>> So I have thought about this, and I think maybe this is not a concrete
>> problem.
>> Suppose we have a graph where "parent" has 2 children: "child" and
>> "sibling". "sibling" also has a blockjob.
>>
>> Now, main loop wants to modify parent-child relation and maybe detach
>> child from parent.
>>
>> 1st wrong assumption: the sibling is not drained. Actually my strategy
>> takes into account draining both nodes, also because parent could be in
>> another graph. Therefore sibling is drained.
>>
>> But let's assume "sibling" is the sibling of the parent.
>>
>> Therefore we have
>> "child" -> "parent" -> "grandparent"
>> and
>> "blockjob" -> "sibling" -> "grandparent"
>>
>> The issue is the following: main loop can't drain "sibling", because
>> subtree_drained does not reach it. Therefore blockjob can still run
>> while main loop modifies "child" -> "parent". Blockjob can either:
>> 1) drain, but this won't affect "child" -> "parent"
>> 2) read the graph in other ways different from drain, for example
>> .set_aio_context recursively touches the whole graph.
>> 3) write the graph.
> 
> I don’t really understand the problem here.  If the block job only
> operates on the sibling subgraph, why would it care what’s going on in
> the other subgraph?

We are talking about something that probably does not happen, but what
if it calls a callback similar to .set_aio_context that goes through the
whole graph?

Even though the first question is: is there such callback?

Second even more irrealistic case is when a job randomly looks for a bs
in another connectivity component and for example drains it.
Again probably impossible.

> Block jobs should own all nodes that are associated with them (e.g.
> because they intend to drop or replace them when the job is done), so
> when part of the graph is drained, all jobs that could modify that part
> should be drained, too.

What do you mean with "own"?

> 
>> 3) can be only performed in the main loop, because it's a graph
>> operation. It means that the blockjob runs when the graph modifying
>> coroutine/bh is not running. They never run together.
>> The safety of this operation relies on where the drains are and will be
>> inserted. If you do like in my patch "block.c:
>> bdrv_replace_child_noperm: first call ->attach(), and then add child",
>> then we would have problem, because we drain between two writes, and the
>> blockjob will find an inconsistent graph. If we do it as we seem to do
>> it so far, then we won't really have any problem.
>>
>> 2) is a read, and can theoretically be performed by another thread. But
>> is there a function that does that? .set_aio_context for example is a GS
>> function, so we will fall back to case 3) and nothing bad would happen.
>>
>> Is there a counter example for this?
>>
>> -----------
>>
>> Talking about something else, I discussed with Kevin what *seems* to be
>> an alternative way to do this, instead of adding drains everywhere.
>> His idea is to replicate what blk_wait_while_drained() currently does
>> but on a larger scale. It is something in between this subtree_drains
>> logic and a rwlock.
>>
>> Basically if I understood correctly, we could implement
>> bdrv_wait_while_drained(), and put in all places where we would put a
>> read lock: all the reads to ->parents and ->children.
>> This function detects if the bdrv is under drain, and if so it will stop
>> and wait that the drain finishes (ie the graph modification).
>> On the other side, each write would just need to drain probably both
>> nodes (simple drain), to signal that we are modifying the graph. Once
>> bdrv_drained_begin() finishes, we are sure all coroutines are stopped.
>> Once bdrv_drained_end() finishes, we automatically let all coroutine
>> restart, and continue where they left off.
>>
>> Seems a good compromise between drains and rwlock. What do you think?
> 
> Well, sounds complicated.  So I’m asking myself whether this would be
> noticeably better than just an RwLock for graph modifications, like the
> global lock Vladimir has proposed.

But the point is then: aren't we re-inventing an AioContext lock?
the lock will protect not only ->parents and ->child, but also other
bdrv fields that are concurrently read/written.

I don't know, it seems to me that there is a lot of uncertainty on which
way to take...

Emanuele


Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Hanna Reitz 2 years, 1 month ago
On 30.03.22 13:55, Emanuele Giuseppe Esposito wrote:
>
> Am 30/03/2022 um 12:53 schrieb Hanna Reitz:
>> On 17.03.22 17:23, Emanuele Giuseppe Esposito wrote:
>>> Am 09/03/2022 um 14:26 schrieb Emanuele Giuseppe Esposito:
>>>>>> * Drains allow the caller (either main loop or iothread running
>>>>>> the context) to wait all in_flights requests and operations
>>>>>> of a BDS: normal drains target a given node and is parents, while
>>>>>> subtree ones also include the subgraph of the node. Siblings are
>>>>>> not affected by any of these two kind of drains.
>>>>> Siblings are drained to the extent required for their parent node to
>>>>> reach in_flight == 0.
>>>>>
>>>>> I haven't checked the code but I guess the case you're alluding to is
>>>>> that siblings with multiple parents could have other I/O in flight that
>>>>> will not be drained and further I/O can be submitted after the parent
>>>>> has drained?
>>>> Yes, this in theory can happen. I don't really know if this happens
>>>> practically, and how likely is to happen.
>>>>
>>>> The alternative would be to make a drain that blocks the whole graph,
>>>> siblings included, but that would probably be an overkill.
>>>>
>>> So I have thought about this, and I think maybe this is not a concrete
>>> problem.
>>> Suppose we have a graph where "parent" has 2 children: "child" and
>>> "sibling". "sibling" also has a blockjob.
>>>
>>> Now, main loop wants to modify parent-child relation and maybe detach
>>> child from parent.
>>>
>>> 1st wrong assumption: the sibling is not drained. Actually my strategy
>>> takes into account draining both nodes, also because parent could be in
>>> another graph. Therefore sibling is drained.
>>>
>>> But let's assume "sibling" is the sibling of the parent.
>>>
>>> Therefore we have
>>> "child" -> "parent" -> "grandparent"
>>> and
>>> "blockjob" -> "sibling" -> "grandparent"
>>>
>>> The issue is the following: main loop can't drain "sibling", because
>>> subtree_drained does not reach it. Therefore blockjob can still run
>>> while main loop modifies "child" -> "parent". Blockjob can either:
>>> 1) drain, but this won't affect "child" -> "parent"
>>> 2) read the graph in other ways different from drain, for example
>>> .set_aio_context recursively touches the whole graph.
>>> 3) write the graph.
>> I don’t really understand the problem here.  If the block job only
>> operates on the sibling subgraph, why would it care what’s going on in
>> the other subgraph?
> We are talking about something that probably does not happen, but what
> if it calls a callback similar to .set_aio_context that goes through the
> whole graph?

Hm.  Quite unfortunate if such a callback can operate on drained nodes, 
I’d say.  Ideally callbacks wouldn’t do that, but probably they will. :/

> Even though the first question is: is there such callback?

I mean, you could say any callback qualifies.  Draining a node will only 
drain its recursive parents, so siblings are not affected.  If the 
sibling issues the callback on its parent...  (E.g. changes in the 
backing chain requiring a qcow2 parent node to change the backing file 
string in its image file)

> Second even more irrealistic case is when a job randomly looks for a bs
> in another connectivity component and for example drains it.
> Again probably impossible.

I hope so, but the block layer sure likes to surprise me.

>> Block jobs should own all nodes that are associated with them (e.g.
>> because they intend to drop or replace them when the job is done), so
>> when part of the graph is drained, all jobs that could modify that part
>> should be drained, too.
> What do you mean with "own"?

They’re added with block_job_add_bdrv(), and then are children of the 
BlockJob object.

>>> 3) can be only performed in the main loop, because it's a graph
>>> operation. It means that the blockjob runs when the graph modifying
>>> coroutine/bh is not running. They never run together.
>>> The safety of this operation relies on where the drains are and will be
>>> inserted. If you do like in my patch "block.c:
>>> bdrv_replace_child_noperm: first call ->attach(), and then add child",
>>> then we would have problem, because we drain between two writes, and the
>>> blockjob will find an inconsistent graph. If we do it as we seem to do
>>> it so far, then we won't really have any problem.
>>>
>>> 2) is a read, and can theoretically be performed by another thread. But
>>> is there a function that does that? .set_aio_context for example is a GS
>>> function, so we will fall back to case 3) and nothing bad would happen.
>>>
>>> Is there a counter example for this?
>>>
>>> -----------
>>>
>>> Talking about something else, I discussed with Kevin what *seems* to be
>>> an alternative way to do this, instead of adding drains everywhere.
>>> His idea is to replicate what blk_wait_while_drained() currently does
>>> but on a larger scale. It is something in between this subtree_drains
>>> logic and a rwlock.
>>>
>>> Basically if I understood correctly, we could implement
>>> bdrv_wait_while_drained(), and put in all places where we would put a
>>> read lock: all the reads to ->parents and ->children.
>>> This function detects if the bdrv is under drain, and if so it will stop
>>> and wait that the drain finishes (ie the graph modification).
>>> On the other side, each write would just need to drain probably both
>>> nodes (simple drain), to signal that we are modifying the graph. Once
>>> bdrv_drained_begin() finishes, we are sure all coroutines are stopped.
>>> Once bdrv_drained_end() finishes, we automatically let all coroutine
>>> restart, and continue where they left off.
>>>
>>> Seems a good compromise between drains and rwlock. What do you think?
>> Well, sounds complicated.  So I’m asking myself whether this would be
>> noticeably better than just an RwLock for graph modifications, like the
>> global lock Vladimir has proposed.
> But the point is then: aren't we re-inventing an AioContext lock?

I don’t know how AioContext locks would even help with graph changes.  
If I want to change a block subgraph that’s in a different I/O thread, 
locking that thread isn’t enough (I would’ve thought); because I have no 
idea what the thread is doing when I’m locking it.  Perhaps it’s 
iterating through ->children right now (with some yields in between), 
and by pausing it, changing the graph, and then resuming it, it’ll still 
cause problems.

> the lock will protect not only ->parents and ->child, but also other
> bdrv fields that are concurrently read/written.

I would’ve thought this lock should only protect ->parents and ->children.

> I don't know, it seems to me that there is a lot of uncertainty on which
> way to take...

Definitely. :)

I wouldn’t call that a bad thing, necessarily.  Let’s look at the 
positive side: There are many ideas!

Hanna


Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Paolo Bonzini 2 years, 1 month ago
On 3/30/22 12:53, Hanna Reitz wrote:
>>
>> Seems a good compromise between drains and rwlock. What do you think?
> 
> Well, sounds complicated.  So I’m asking myself whether this would be 
> noticeably better than just an RwLock for graph modifications, like the 
> global lock Vladimir has proposed.

A global lock would have to be taken by all iothreads on every I/O 
operation, even a read-write lock would scale because it has a global 
CoMutex inside and rdlock/unlock both take it.  Even if the critical 
section is small, the cacheline bumping would be pretty bad.

Perhaps we could reuse the code in cpus-common.c, which relies on a list 
of possible readers and is quite cheap (two memory barriers basically) 
for writers.  Here we would have a list of AioContexts as the possible 
readers.

The slow path uses a QemuMutex, a QemuCond for the readers and a 
QemuCond for the writers.  The reader QemuCond can be replaced by a 
CoQueue, I think.

Paolo

Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Paolo Bonzini 2 years, 1 month ago
On 3/30/22 18:02, Paolo Bonzini wrote:
> On 3/30/22 12:53, Hanna Reitz wrote:
>>>
>>> Seems a good compromise between drains and rwlock. What do you think?
>>
>> Well, sounds complicated.  So I’m asking myself whether this would be 
>> noticeably better than just an RwLock for graph modifications, like 
>> the global lock Vladimir has proposed.

[try again, this time with brain connected]

A global lock would have to be taken by all iothreads on every I/O
operation, even a CoRwLock would not scale because it has a global
CoMutex inside and rdlock/unlock both take it.  Even if the critical
section is small, the cacheline bumping would be pretty bad.

Perhaps we could reuse the code in cpus-common.c, which relies on a list
of possible readers and is quite cheap (two memory barriers basically)
for readers.  Here we would have a list of AioContexts (or perhaps 
BlockDriverStates?) as the possible readers.

The slow path uses a QemuMutex, a QemuCond for the readers and a
QemuCond for the writers.  The reader QemuCond can be replaced by a
CoQueue, I think.

Paolo


Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years, 1 month ago

Am 31/03/2022 um 11:59 schrieb Paolo Bonzini:
> On 3/30/22 18:02, Paolo Bonzini wrote:
>> On 3/30/22 12:53, Hanna Reitz wrote:
>>>>
>>>> Seems a good compromise between drains and rwlock. What do you think?
>>>
>>> Well, sounds complicated.  So I’m asking myself whether this would be
>>> noticeably better than just an RwLock for graph modifications, like
>>> the global lock Vladimir has proposed.
> 
> [try again, this time with brain connected]
> 
> A global lock would have to be taken by all iothreads on every I/O
> operation, even a CoRwLock would not scale because it has a global
> CoMutex inside and rdlock/unlock both take it.  Even if the critical
> section is small, the cacheline bumping would be pretty bad.
> 
> Perhaps we could reuse the code in cpus-common.c, which relies on a list
> of possible readers and is quite cheap (two memory barriers basically)
> for readers.  Here we would have a list of AioContexts (or perhaps
> BlockDriverStates?) as the possible readers.
> 
> The slow path uses a QemuMutex, a QemuCond for the readers and a
> QemuCond for the writers.  The reader QemuCond can be replaced by a
> CoQueue, I think.
> 

It would be something like this:
Essentially move graph_bdrv_states as public, so that we can iterate
through all bs as cpu-common.c does with cpus, and then duplicate the
already existing API in cpu-common.c:

bdrv_graph_list_wrlock <-> start_exclusive
bdrv_graph_list_wrunlock <-> end_exclusive
bdrv_graph_list_rdlock <-> cpu_exec_start
bdrv_graph_list_rdunlock <-> cpu_exec_end

The only difference is that there is no qemu_cpu_kick(), but I don't
think it matters.

What do you think?

diff --git a/block.c b/block.c
index 718e4cae8b..af8dd37101 100644
--- a/block.c
+++ b/block.c
@@ -52,6 +52,7 @@
 #include "qemu/range.h"
 #include "qemu/rcu.h"
 #include "block/coroutines.h"
+#include "block/block-list.h"

 #ifdef CONFIG_BSD
 #include <sys/ioctl.h>
@@ -67,10 +68,6 @@

 #define NOT_DONE 0x7fffffff /* used while emulated sync operation in
progress */

-/* Protected by BQL */
-static QTAILQ_HEAD(, BlockDriverState) graph_bdrv_states =
-    QTAILQ_HEAD_INITIALIZER(graph_bdrv_states);
-
 /* Protected by BQL */
 static QTAILQ_HEAD(, BlockDriverState) all_bdrv_states =
     QTAILQ_HEAD_INITIALIZER(all_bdrv_states);
diff --git a/include/block/block-list.h b/include/block/block-list.h
new file mode 100644
index 0000000000..d55a056938
--- /dev/null
+++ b/include/block/block-list.h
@@ -0,0 +1,54 @@
+#ifndef BLOCK_LIST_H
+#define BLOCK_LIST_H
+
+#include "qemu/osdep.h"
+
+
+/* Protected by BQL */
+QTAILQ_HEAD(,BlockDriverState) graph_bdrv_states =
QTAILQ_HEAD_INITIALIZER(graph_bdrv_states);
+
+void bdrv_init_graph_list_lock(void);
+
+/*
+ * bdrv_graph_list_wrlock:
+ * Modify the graph. Nobody else is allowed to access the graph.
+ * Set pending op >= 1, so that the next readers will wait in a
coroutine queue.
+ * Then keep track of the running readers using bs->has_writer,
+ * and wait until they finish.
+ */
+void bdrv_graph_list_wrlock(void);
+
+/*
+ * bdrv_graph_list_wrunlock:
+ * Write finished, reset pending operations to 0 and restart
+ * all readers that are waiting.
+ */
+void bdrv_graph_list_wrunlock(void);
+
+/*
+ * bdrv_graph_list_rdlock:
+ * Read the bs graph. Set bs->reading_graph true, and if there are pending
+ * operations, it means that the main loop is modifying the graph,
+ * therefore wait in exclusive_resume coroutine queue.
+ * If this read is coming while the writer has already marked the
+ * running read requests and it's waiting for them to finish,
+ * wait that the write finishes first.
+ * Main loop will then wake this coroutine once it is done.
+ */
+void bdrv_graph_list_rdlock(BlockDriverState *bs);
+
+/*
+ * bdrv_graph_list_rdunlock:
+ * Read terminated, decrease the count of pending operations.
+ * If it's the last read that the writer is waiting for, signal
+ * the writer that we are done and let it continue.
+ */
+void bdrv_graph_list_rdunlock(BlockDriverState *bs);
+
+
+
+
+#define BS_GRAPH_READER(bs) /* in main loop OR bs->reading_graph */
+#define BS_GRAPH_WRITER(bs) /* in main loop AND bs->bs_graph_pending_op
> 0 */
+#endif /* BLOCK_LIST_H */
+
diff --git a/include/block/block_int-common.h
b/include/block/block_int-common.h
index 5a04c778e4..0266876a59 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -985,6 +985,20 @@ struct BlockDriverState {
     bool force_share; /* if true, always allow all shared permissions */
     bool implicit;  /* if true, this filter node was automatically
inserted */

+    /*
+     * If true, the bs is reading the graph.
+     * No write should happen in the meanwhile.
+     * Atomic.
+     */
+    bool reading_graph;
+
+    /*
+     * If true, the main loop is modifying the graph.
+     * bs cannot read the graph.
+     * Protected by bs_graph_list_lock.
+     */
+    bool has_writer;
+
     BlockDriver *drv; /* NULL means no media */
     void *opaque;


Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Paolo Bonzini 2 years, 1 month ago
On 3/31/22 15:51, Emanuele Giuseppe Esposito wrote:
> 
> bdrv_graph_list_wrlock <-> start_exclusive
> bdrv_graph_list_wrunlock <-> end_exclusive
> bdrv_graph_list_rdlock <-> cpu_exec_start
> bdrv_graph_list_rdunlock <-> cpu_exec_end

This wouldn't protect the list but the whole graph, i.e. the parents and 
children of all BDSes.  So the functions would be:

   bdrv_graph_wrlock <-> start_exclusive
   bdrv_graph_wrunlock <-> end_exclusive
   bdrv_graph_rdlock <-> cpu_exec_start
   bdrv_graph_rdunlock <-> cpu_exec_end


The list itself would be used internally to implement the write-side 
lock and unlock primitives, but it would not be protected by the above 
functions.  So there would be a couple additional functions:

   bdrv_graph_list_lock <-> cpu_list_lock
   bdrv_graph_list_unlock <-> cpu_list_unlock

> +void bdrv_graph_list_rdlock(BlockDriverState *bs);
> +void bdrv_graph_list_rdunlock(BlockDriverState *bs);

Apart from the naming change, these two would be coroutine_fn.

> +#define BS_GRAPH_READER(bs) /* in main loop OR bs->reading_graph */
> +#define BS_GRAPH_WRITER(bs) /* in main loop AND bs->bs_graph_pending_op

bs_graph_pending_op is not part of bs->, it is a global variable 
(corresponding to pending_cpus in cpus-common.c).  I would call it 
bs_graph_pending_reader since you have "has_writer" below.

Also, this second #define does not need an argument, and is really the 
same as assert_bdrv_graph_writable(bs).  So perhaps you can rename the 
first one to assert_bdrv_graph_readable(bs).

> 
> +    /*
> +     * If true, the main loop is modifying the graph.
> +     * bs cannot read the graph.
> +     * Protected by bs_graph_list_lock.
> +     */
> +    bool has_writer;

Note that it's "has_waiter" in cpus-common.c. :)  has_writer is fine too.

Paolo
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years, 1 month ago

Am 31/03/2022 um 18:40 schrieb Paolo Bonzini:
> On 3/31/22 15:51, Emanuele Giuseppe Esposito wrote:
>>
>> bdrv_graph_list_wrlock <-> start_exclusive
>> bdrv_graph_list_wrunlock <-> end_exclusive
>> bdrv_graph_list_rdlock <-> cpu_exec_start
>> bdrv_graph_list_rdunlock <-> cpu_exec_end
> 
> This wouldn't protect the list but the whole graph, i.e. the parents and
> children of all BDSes.  So the functions would be:
> 
>   bdrv_graph_wrlock <-> start_exclusive
>   bdrv_graph_wrunlock <-> end_exclusive
>   bdrv_graph_rdlock <-> cpu_exec_start
>   bdrv_graph_rdunlock <-> cpu_exec_end
> 
> 
> The list itself would be used internally to implement the write-side
> lock and unlock primitives, but it would not be protected by the above
> functions.  So there would be a couple additional functions:
> 
>   bdrv_graph_list_lock <-> cpu_list_lock
>   bdrv_graph_list_unlock <-> cpu_list_unlock

The list would be graph_bdrv_states, why do we need to protect it with a
lock? Currently it is protected by BQL, and theoretically only
bdrv_graph_wrlock iterates on it. And as we defined in the assertion
below, wrlock is always in the main loop too.

> 
>> +void bdrv_graph_list_rdlock(BlockDriverState *bs);
>> +void bdrv_graph_list_rdunlock(BlockDriverState *bs);
> 
> Apart from the naming change, these two would be coroutine_fn.
> 
>> +#define BS_GRAPH_READER(bs) /* in main loop OR bs->reading_graph */
>> +#define BS_GRAPH_WRITER(bs) /* in main loop AND bs->bs_graph_pending_op
> 
> bs_graph_pending_op is not part of bs->, it is a global variable
> (corresponding to pending_cpus in cpus-common.c).  I would call it
> bs_graph_pending_reader since you have "has_writer" below.

Hmm maybe it is better to go with has_waiter and pending_op. This is
because as "pending" we always have 1 write and zero or more reads.

Rest makes sense.

Emanuele
> 
> Also, this second #define does not need an argument, and is really the
> same as assert_bdrv_graph_writable(bs).  So perhaps you can rename the
> first one to assert_bdrv_graph_readable(bs).
> 
>>
>> +    /*
>> +     * If true, the main loop is modifying the graph.
>> +     * bs cannot read the graph.
>> +     * Protected by bs_graph_list_lock.
>> +     */
>> +    bool has_writer;
> 
> Note that it's "has_waiter" in cpus-common.c. :)  has_writer is fine too.
> 
> Paolo
> 


Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Paolo Bonzini 2 years, 1 month ago
On 4/1/22 10:05, Emanuele Giuseppe Esposito wrote:
>> The list itself would be used internally to implement the write-side
>> lock and unlock primitives, but it would not be protected by the above
>> functions.  So there would be a couple additional functions:
>>
>>    bdrv_graph_list_lock <-> cpu_list_lock
>>    bdrv_graph_list_unlock <-> cpu_list_unlock
> 
> The list would be graph_bdrv_states, why do we need to protect it with a
> lock? Currently it is protected by BQL, and theoretically only
> bdrv_graph_wrlock iterates on it. And as we defined in the assertion
> below, wrlock is always in the main loop too.

You're right, CPU_FOREACH only appears in start_exclusive; so likewise 
you only need to walk the list in bdrv_graph_wrlock, i.e. only under BQL.

My thought was that, within the implementation, you'll need a mutex to 
protect has_waiter, and protecting the list with the same mutex made 
sense to me.  But indeed it's not necessary.

Paolo

>>> +void bdrv_graph_list_rdlock(BlockDriverState *bs);
>>> +void bdrv_graph_list_rdunlock(BlockDriverState *bs);
>>
>> Apart from the naming change, these two would be coroutine_fn.
>>
>>> +#define BS_GRAPH_READER(bs) /* in main loop OR bs->reading_graph */
>>> +#define BS_GRAPH_WRITER(bs) /* in main loop AND bs->bs_graph_pending_op
>>
>> bs_graph_pending_op is not part of bs->, it is a global variable
>> (corresponding to pending_cpus in cpus-common.c).  I would call it
>> bs_graph_pending_reader since you have "has_writer" below.

Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years ago
So this is a more concrete and up-to-date header.

Few things to notice:
- we have a list of AioContext. They are registered once an aiocontext
is created, and deleted when it is destroyed.
This list is helpful because each aiocontext can only modify its own
number of readers, avoiding unnecessary cacheline bouncing

- if a coroutine changes aiocontext, it's ok with regards to the
per-aiocontext reader counter. As long as the sum is correct, there's no
issue. The problem comes only once the original aiocontext is deleted,
and at that point we need to move the count it held to a shared global
variable, otherwise we risk to lose track of readers.

- All synchronization between the flags explained in this header is of
course handled in the implementation. But for now it would be nice to
have a feedback on the idea/API.

So in short we need:
- per-aiocontext counter
- global list of aiocontext
- global additional reader counter (in case an aiocontext is deleted)
- global CoQueue
- global has_writer flag
- global QemuMutex to protect the list access

Emanuele

#ifndef BLOCK_LOCK_H
#define BLOCK_LOCK_H

#include "qemu/osdep.h"

/*
 * register_aiocontext:
 * Add AioContext @ctx to the list of AioContext.
 * This list is used to obtain the total number of readers
 * currently running the graph.
 */
void register_aiocontext(AioContext *ctx);

/*
 * unregister_aiocontext:
 * Removes AioContext @ctx to the list of AioContext.
 */
void unregister_aiocontext(AioContext *ctx);

/*
 * bdrv_graph_wrlock:
 * Modify the graph. Nobody else is allowed to access the graph.
 * Set global has_writer to 1, so that the next readers will wait
 * that writer is done in a coroutine queue.
 * Then keep track of the running readers by counting what is the total
 * amount of readers (sum of all aiocontext readers), and wait until
 * they all finish with AIO_WAIT_WHILE.
 */
void bdrv_graph_wrlock(void);

/*
 * bdrv_graph_wrunlock:
 * Write finished, reset global has_writer to 0 and restart
 * all readers that are waiting.
 */
void bdrv_graph_wrunlock(void);

/*
 * bdrv_graph_co_rdlock:
 * Read the bs graph. Increases the reader counter of the current
aiocontext,
 * and if has_writer is set, it means that the writer is modifying
 * the graph, therefore wait in a coroutine queue.
 * The writer will then wake this coroutine once it is done.
 *
 * This lock cannot be taken recursively.
 */
void coroutine_fn bdrv_graph_co_rdlock(void);

/*
 * bdrv_graph_rdunlock:
 * Read terminated, decrease the count of readers in the current aiocontext.
 * If the writer is waiting for reads to finish (has_writer == 1), signal
 * the writer that we are done via aio_wait_kick() to let it continue.
 */
void coroutine_fn bdrv_graph_co_rdunlock(void);

#endif /* BLOCK_LOCK_H */
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Kevin Wolf 2 years ago
Am 13.04.2022 um 15:43 hat Emanuele Giuseppe Esposito geschrieben:
> So this is a more concrete and up-to-date header.
> 
> Few things to notice:
> - we have a list of AioContext. They are registered once an aiocontext
> is created, and deleted when it is destroyed.
> This list is helpful because each aiocontext can only modify its own
> number of readers, avoiding unnecessary cacheline bouncing
> 
> - if a coroutine changes aiocontext, it's ok with regards to the
> per-aiocontext reader counter. As long as the sum is correct, there's no
> issue. The problem comes only once the original aiocontext is deleted,
> and at that point we need to move the count it held to a shared global
> variable, otherwise we risk to lose track of readers.

So the idea is that we can do bdrv_graph_co_rdlock() in one thread and
the corresponding bdrv_graph_co_rdunlock() in a different thread?

Would the unlock somehow remember the original thread, or do you use the
"sum is correct" argument and allow negative counter values, so you can
end up having count +1 in A and -1 in B to represent "no active
readers"? If this happens, it's likely to happen many times, so do we
have to take integer overflows into account then?

> - All synchronization between the flags explained in this header is of
> course handled in the implementation. But for now it would be nice to
> have a feedback on the idea/API.
> 
> So in short we need:
> - per-aiocontext counter
> - global list of aiocontext
> - global additional reader counter (in case an aiocontext is deleted)
> - global CoQueue
> - global has_writer flag
> - global QemuMutex to protect the list access
> 
> Emanuele
> 
> #ifndef BLOCK_LOCK_H
> #define BLOCK_LOCK_H
> 
> #include "qemu/osdep.h"
> 
> /*
>  * register_aiocontext:
>  * Add AioContext @ctx to the list of AioContext.
>  * This list is used to obtain the total number of readers
>  * currently running the graph.
>  */
> void register_aiocontext(AioContext *ctx);
> 
> /*
>  * unregister_aiocontext:
>  * Removes AioContext @ctx to the list of AioContext.
>  */
> void unregister_aiocontext(AioContext *ctx);
> 
> /*
>  * bdrv_graph_wrlock:
>  * Modify the graph. Nobody else is allowed to access the graph.
>  * Set global has_writer to 1, so that the next readers will wait
>  * that writer is done in a coroutine queue.
>  * Then keep track of the running readers by counting what is the total
>  * amount of readers (sum of all aiocontext readers), and wait until
>  * they all finish with AIO_WAIT_WHILE.
>  */
> void bdrv_graph_wrlock(void);

Do we need a coroutine version that yields instead of using
AIO_WAIT_WHILE() or are we sure this will only ever be called from
non-coroutine contexts?

> /*
>  * bdrv_graph_wrunlock:
>  * Write finished, reset global has_writer to 0 and restart
>  * all readers that are waiting.
>  */
> void bdrv_graph_wrunlock(void);
> 
> /*
>  * bdrv_graph_co_rdlock:
>  * Read the bs graph. Increases the reader counter of the current
> aiocontext,
>  * and if has_writer is set, it means that the writer is modifying
>  * the graph, therefore wait in a coroutine queue.
>  * The writer will then wake this coroutine once it is done.
>  *
>  * This lock cannot be taken recursively.
>  */
> void coroutine_fn bdrv_graph_co_rdlock(void);

What prevents it from being taken recursively when it's just a counter?
(I do see however, that you can't take a reader lock while you have the
writer lock or vice versa because it would deadlock.)

Does this being a coroutine_fn mean that we would have to convert QMP
command handlers to coroutines so that they can take the rdlock while
they don't expect the graph to change? Or should we have a non-coroutine
version, too, that works with AIO_WAIT_WHILE()?

Or should this only be taken for very small pieces of code directly
accessing the BdrvChild objects, and high-level users like QMP commands
shouldn't even consider themselves readers?

> /*
>  * bdrv_graph_rdunlock:
>  * Read terminated, decrease the count of readers in the current aiocontext.
>  * If the writer is waiting for reads to finish (has_writer == 1), signal
>  * the writer that we are done via aio_wait_kick() to let it continue.
>  */
> void coroutine_fn bdrv_graph_co_rdunlock(void);
> 
> #endif /* BLOCK_LOCK_H */

I expect that in the final version, we might want to have some sugar
like a WITH_BDRV_GRAPH_RDLOCK_GUARD() macro, but obviously that doesn't
affect the fundamental design.

Kevin
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Paolo Bonzini 2 years ago
On 4/13/22 16:51, Kevin Wolf wrote:
> So the idea is that we can do bdrv_graph_co_rdlock() in one thread and
> the corresponding bdrv_graph_co_rdunlock() in a different thread?
> 
> Would the unlock somehow remember the original thread, or do you use the
> "sum is correct" argument and allow negative counter values, so you can
> end up having count +1 in A and -1 in B to represent "no active
> readers"? If this happens, it's likely to happen many times, so do we
> have to take integer overflows into account then?

The counter cannot be negative, so you can use uint32_t and sum modulo 
2^32.  You might have a thread with counter 2^31+1 which is negative in 
twos complement; and a thread with counter -2^31-1 which is positive in 
twos complement; but their sum cancels out correctly.

Paolo
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years ago

Am 13/04/2022 um 16:51 schrieb Kevin Wolf:
> Am 13.04.2022 um 15:43 hat Emanuele Giuseppe Esposito geschrieben:
>> So this is a more concrete and up-to-date header.
>>
>> Few things to notice:
>> - we have a list of AioContext. They are registered once an aiocontext
>> is created, and deleted when it is destroyed.
>> This list is helpful because each aiocontext can only modify its own
>> number of readers, avoiding unnecessary cacheline bouncing
>>
>> - if a coroutine changes aiocontext, it's ok with regards to the
>> per-aiocontext reader counter. As long as the sum is correct, there's no
>> issue. The problem comes only once the original aiocontext is deleted,
>> and at that point we need to move the count it held to a shared global
>> variable, otherwise we risk to lose track of readers.
> 
> So the idea is that we can do bdrv_graph_co_rdlock() in one thread and
> the corresponding bdrv_graph_co_rdunlock() in a different thread?
> 
> Would the unlock somehow remember the original thread, or do you use the
> "sum is correct" argument and allow negative counter values, so you can
> end up having count +1 in A and -1 in B to represent "no active
> readers"? If this happens, it's likely to happen many times, so do we
> have to take integer overflows into account then?
> 
>> - All synchronization between the flags explained in this header is of
>> course handled in the implementation. But for now it would be nice to
>> have a feedback on the idea/API.
>>
>> So in short we need:
>> - per-aiocontext counter
>> - global list of aiocontext
>> - global additional reader counter (in case an aiocontext is deleted)
>> - global CoQueue
>> - global has_writer flag
>> - global QemuMutex to protect the list access
>>
>> Emanuele
>>
>> #ifndef BLOCK_LOCK_H
>> #define BLOCK_LOCK_H
>>
>> #include "qemu/osdep.h"
>>
>> /*
>>  * register_aiocontext:
>>  * Add AioContext @ctx to the list of AioContext.
>>  * This list is used to obtain the total number of readers
>>  * currently running the graph.
>>  */
>> void register_aiocontext(AioContext *ctx);
>>
>> /*
>>  * unregister_aiocontext:
>>  * Removes AioContext @ctx to the list of AioContext.
>>  */
>> void unregister_aiocontext(AioContext *ctx);
>>
>> /*
>>  * bdrv_graph_wrlock:
>>  * Modify the graph. Nobody else is allowed to access the graph.
>>  * Set global has_writer to 1, so that the next readers will wait
>>  * that writer is done in a coroutine queue.
>>  * Then keep track of the running readers by counting what is the total
>>  * amount of readers (sum of all aiocontext readers), and wait until
>>  * they all finish with AIO_WAIT_WHILE.
>>  */
>> void bdrv_graph_wrlock(void);
> 
> Do we need a coroutine version that yields instead of using
> AIO_WAIT_WHILE() or are we sure this will only ever be called from
> non-coroutine contexts?

writes (graph modifications) are always done under BQL in the main loop.
Except an unit test, I don't think a coroutine ever does that.

> 
>> /*
>>  * bdrv_graph_wrunlock:
>>  * Write finished, reset global has_writer to 0 and restart
>>  * all readers that are waiting.
>>  */
>> void bdrv_graph_wrunlock(void);
>>
>> /*
>>  * bdrv_graph_co_rdlock:
>>  * Read the bs graph. Increases the reader counter of the current
>> aiocontext,
>>  * and if has_writer is set, it means that the writer is modifying
>>  * the graph, therefore wait in a coroutine queue.
>>  * The writer will then wake this coroutine once it is done.
>>  *
>>  * This lock cannot be taken recursively.
>>  */
>> void coroutine_fn bdrv_graph_co_rdlock(void);
> 
> What prevents it from being taken recursively when it's just a counter?
> (I do see however, that you can't take a reader lock while you have the
> writer lock or vice versa because it would deadlock.)
> 
I actually didn't add the assertion to prevent it from being recoursive
yet, but I think it simplifies everything if it's not recoursive

> Does this being a coroutine_fn mean that we would have to convert QMP
> command handlers to coroutines so that they can take the rdlock while
> they don't expect the graph to change? Or should we have a non-coroutine
> version, too, that works with AIO_WAIT_WHILE()?

Why convert the QMP command handlers? coroutine_fn was just to signal
that it can also be called from coroutines, like the ones created by the
blk_* API.
A reader does not have to be a coroutine. AIO_WAIT_WHILE is not
mandatory to allow it to finish, it helps to ensure progress in case
some reader is waiting for something, but other than that is not
necessary IMO.

> Or should this only be taken for very small pieces of code directly
> accessing the BdrvChild objects, and high-level users like QMP commands
> shouldn't even consider themselves readers?
> 

No I think if we focus on small pieces of code we end up having a
million lock/unlock pairs.

>> /*
>>  * bdrv_graph_rdunlock:
>>  * Read terminated, decrease the count of readers in the current aiocontext.
>>  * If the writer is waiting for reads to finish (has_writer == 1), signal
>>  * the writer that we are done via aio_wait_kick() to let it continue.
>>  */
>> void coroutine_fn bdrv_graph_co_rdunlock(void);
>>
>> #endif /* BLOCK_LOCK_H */
> 
> I expect that in the final version, we might want to have some sugar
> like a WITH_BDRV_GRAPH_RDLOCK_GUARD() macro, but obviously that doesn't
> affect the fundamental design.

Yeah I will ping you once I get to that point ;)

Emanuele
> 
> Kevin
>
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Kevin Wolf 2 years ago
Am 13.04.2022 um 17:14 hat Emanuele Giuseppe Esposito geschrieben:
> Am 13/04/2022 um 16:51 schrieb Kevin Wolf:
> > Am 13.04.2022 um 15:43 hat Emanuele Giuseppe Esposito geschrieben:
> >> So this is a more concrete and up-to-date header.
> >>
> >> Few things to notice:
> >> - we have a list of AioContext. They are registered once an aiocontext
> >> is created, and deleted when it is destroyed.
> >> This list is helpful because each aiocontext can only modify its own
> >> number of readers, avoiding unnecessary cacheline bouncing
> >>
> >> - if a coroutine changes aiocontext, it's ok with regards to the
> >> per-aiocontext reader counter. As long as the sum is correct, there's no
> >> issue. The problem comes only once the original aiocontext is deleted,
> >> and at that point we need to move the count it held to a shared global
> >> variable, otherwise we risk to lose track of readers.
> > 
> > So the idea is that we can do bdrv_graph_co_rdlock() in one thread and
> > the corresponding bdrv_graph_co_rdunlock() in a different thread?
> > 
> > Would the unlock somehow remember the original thread, or do you use the
> > "sum is correct" argument and allow negative counter values, so you can
> > end up having count +1 in A and -1 in B to represent "no active
> > readers"? If this happens, it's likely to happen many times, so do we
> > have to take integer overflows into account then?
> > 
> >> - All synchronization between the flags explained in this header is of
> >> course handled in the implementation. But for now it would be nice to
> >> have a feedback on the idea/API.
> >>
> >> So in short we need:
> >> - per-aiocontext counter
> >> - global list of aiocontext
> >> - global additional reader counter (in case an aiocontext is deleted)
> >> - global CoQueue
> >> - global has_writer flag
> >> - global QemuMutex to protect the list access
> >>
> >> Emanuele
> >>
> >> #ifndef BLOCK_LOCK_H
> >> #define BLOCK_LOCK_H
> >>
> >> #include "qemu/osdep.h"
> >>
> >> /*
> >>  * register_aiocontext:
> >>  * Add AioContext @ctx to the list of AioContext.
> >>  * This list is used to obtain the total number of readers
> >>  * currently running the graph.
> >>  */
> >> void register_aiocontext(AioContext *ctx);
> >>
> >> /*
> >>  * unregister_aiocontext:
> >>  * Removes AioContext @ctx to the list of AioContext.
> >>  */
> >> void unregister_aiocontext(AioContext *ctx);
> >>
> >> /*
> >>  * bdrv_graph_wrlock:
> >>  * Modify the graph. Nobody else is allowed to access the graph.
> >>  * Set global has_writer to 1, so that the next readers will wait
> >>  * that writer is done in a coroutine queue.
> >>  * Then keep track of the running readers by counting what is the total
> >>  * amount of readers (sum of all aiocontext readers), and wait until
> >>  * they all finish with AIO_WAIT_WHILE.
> >>  */
> >> void bdrv_graph_wrlock(void);
> > 
> > Do we need a coroutine version that yields instead of using
> > AIO_WAIT_WHILE() or are we sure this will only ever be called from
> > non-coroutine contexts?
> 
> writes (graph modifications) are always done under BQL in the main loop.

Yes, I think we're fairly certain about this part.

> Except an unit test, I don't think a coroutine ever does that.

I'm not sure about this one, though. Didn't you have cases where
bdrv_replace_child_noperm() was called in coroutine context? Or maybe
I'm mixing up things here.

> >> /*
> >>  * bdrv_graph_wrunlock:
> >>  * Write finished, reset global has_writer to 0 and restart
> >>  * all readers that are waiting.
> >>  */
> >> void bdrv_graph_wrunlock(void);
> >>
> >> /*
> >>  * bdrv_graph_co_rdlock:
> >>  * Read the bs graph. Increases the reader counter of the current
> >> aiocontext,
> >>  * and if has_writer is set, it means that the writer is modifying
> >>  * the graph, therefore wait in a coroutine queue.
> >>  * The writer will then wake this coroutine once it is done.
> >>  *
> >>  * This lock cannot be taken recursively.
> >>  */
> >> void coroutine_fn bdrv_graph_co_rdlock(void);
> > 
> > What prevents it from being taken recursively when it's just a counter?
> > (I do see however, that you can't take a reader lock while you have the
> > writer lock or vice versa because it would deadlock.)
> > 
> I actually didn't add the assertion to prevent it from being recoursive
> yet, but I think it simplifies everything if it's not recoursive
> 
> > Does this being a coroutine_fn mean that we would have to convert QMP
> > command handlers to coroutines so that they can take the rdlock while
> > they don't expect the graph to change? Or should we have a non-coroutine
> > version, too, that works with AIO_WAIT_WHILE()?
> 
> Why convert the QMP command handlers? coroutine_fn was just to signal
> that it can also be called from coroutines, like the ones created by the
> blk_* API.

coroutine_fn means that it can _only_ be called from coroutines (because
it will yield, which doesn't work outside of a coroutine - not sure what
happens, probably just a crash).

> A reader does not have to be a coroutine. AIO_WAIT_WHILE is not
> mandatory to allow it to finish, it helps to ensure progress in case
> some reader is waiting for something, but other than that is not
> necessary IMO.

When it's outside of a coroutine, how would you implement waiting for a
writer to finish if not with AIO_WAIT_WHILE()?

> > Or should this only be taken for very small pieces of code directly
> > accessing the BdrvChild objects, and high-level users like QMP commands
> > shouldn't even consider themselves readers?
> > 
> 
> No I think if we focus on small pieces of code we end up having a
> million lock/unlock pairs.

Yes, I agree. On the other hand, if we're taking the locks in high-level
outer operations, avoiding to take the lock recursively might become
harder. I guess we'll see how it works out when we actually introduce
callers.

Kevin
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Paolo Bonzini 2 years ago
On 4/13/22 18:29, Kevin Wolf wrote:
>> A reader does not have to be a coroutine. AIO_WAIT_WHILE is not
>> mandatory to allow it to finish, it helps to ensure progress in case
>> some reader is waiting for something, but other than that is not
>> necessary IMO.
> When it's outside of a coroutine, how would you implement waiting for a
> writer to finish if not with AIO_WAIT_WHILE()?
> 

In the main thread a non-coroutine can always read the graph though, 
because the only writer can be the main thread.

If the critical sections are large enough, I don't think rdlock needs to 
be taken outside a coroutine in the iothread, e.g. in a bottom half.

>> No I think if we focus on small pieces of code we end up having a
>> million lock/unlock pairs.
> 
> Yes, I agree. On the other hand, if we're taking the locks in high-level
> outer operations, avoiding to take the lock recursively might become
> harder. I guess we'll see how it works out when we actually introduce
> callers.

My "hope" is that taking the locks in blk_* functions covers most of the 
calls, and then only a few (dozens) direct uses of bdrv_* remain.

Unfortunately, adding assertions is not easy because "is there a reader" 
cannot be easily answered.  But I think Emanuele has a debug mode that 
can enable the assertions at a performance cost.

Paolo
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years ago

Am 13/04/2022 um 17:14 schrieb Emanuele Giuseppe Esposito:
> 
> 
> Am 13/04/2022 um 16:51 schrieb Kevin Wolf:
>> Am 13.04.2022 um 15:43 hat Emanuele Giuseppe Esposito geschrieben:
>>> So this is a more concrete and up-to-date header.
>>>
>>> Few things to notice:
>>> - we have a list of AioContext. They are registered once an aiocontext
>>> is created, and deleted when it is destroyed.
>>> This list is helpful because each aiocontext can only modify its own
>>> number of readers, avoiding unnecessary cacheline bouncing
>>>
>>> - if a coroutine changes aiocontext, it's ok with regards to the
>>> per-aiocontext reader counter. As long as the sum is correct, there's no
>>> issue. The problem comes only once the original aiocontext is deleted,
>>> and at that point we need to move the count it held to a shared global
>>> variable, otherwise we risk to lose track of readers.
>>
>> So the idea is that we can do bdrv_graph_co_rdlock() in one thread and
>> the corresponding bdrv_graph_co_rdunlock() in a different thread?
>>
>> Would the unlock somehow remember the original thread, or do you use the
>> "sum is correct" argument and allow negative counter values, so you can
>> end up having count +1 in A and -1 in B to represent "no active
>> readers"? If this happens, it's likely to happen many times, so do we
>> have to take integer overflows into account then?
>>
>>> - All synchronization between the flags explained in this header is of
>>> course handled in the implementation. But for now it would be nice to
>>> have a feedback on the idea/API.
>>>
>>> So in short we need:
>>> - per-aiocontext counter
>>> - global list of aiocontext
>>> - global additional reader counter (in case an aiocontext is deleted)
>>> - global CoQueue
>>> - global has_writer flag
>>> - global QemuMutex to protect the list access
>>>
>>> Emanuele
>>>
>>> #ifndef BLOCK_LOCK_H
>>> #define BLOCK_LOCK_H
>>>
>>> #include "qemu/osdep.h"
>>>
>>> /*
>>>  * register_aiocontext:
>>>  * Add AioContext @ctx to the list of AioContext.
>>>  * This list is used to obtain the total number of readers
>>>  * currently running the graph.
>>>  */
>>> void register_aiocontext(AioContext *ctx);
>>>
>>> /*
>>>  * unregister_aiocontext:
>>>  * Removes AioContext @ctx to the list of AioContext.
>>>  */
>>> void unregister_aiocontext(AioContext *ctx);
>>>
>>> /*
>>>  * bdrv_graph_wrlock:
>>>  * Modify the graph. Nobody else is allowed to access the graph.
>>>  * Set global has_writer to 1, so that the next readers will wait
>>>  * that writer is done in a coroutine queue.
>>>  * Then keep track of the running readers by counting what is the total
>>>  * amount of readers (sum of all aiocontext readers), and wait until
>>>  * they all finish with AIO_WAIT_WHILE.
>>>  */
>>> void bdrv_graph_wrlock(void);
>>
>> Do we need a coroutine version that yields instead of using
>> AIO_WAIT_WHILE() or are we sure this will only ever be called from
>> non-coroutine contexts?
> 
> writes (graph modifications) are always done under BQL in the main loop.
> Except an unit test, I don't think a coroutine ever does that.

Additional point (1): I am also prepraring a serie with all "helpful
fixes" I got through all other discarderd/obsolete series, like
subtree_drain and similar.

So except for the job patches, you can discard all other series.

> 
>>
>>> /*
>>>  * bdrv_graph_wrunlock:
>>>  * Write finished, reset global has_writer to 0 and restart
>>>  * all readers that are waiting.
>>>  */
>>> void bdrv_graph_wrunlock(void);
>>>
>>> /*
>>>  * bdrv_graph_co_rdlock:
>>>  * Read the bs graph. Increases the reader counter of the current
>>> aiocontext,
>>>  * and if has_writer is set, it means that the writer is modifying
>>>  * the graph, therefore wait in a coroutine queue.
>>>  * The writer will then wake this coroutine once it is done.
>>>  *
>>>  * This lock cannot be taken recursively.
>>>  */
>>> void coroutine_fn bdrv_graph_co_rdlock(void);
>>
>> What prevents it from being taken recursively when it's just a counter?
>> (I do see however, that you can't take a reader lock while you have the
>> writer lock or vice versa because it would deadlock.)
>>
> I actually didn't add the assertion to prevent it from being recoursive
> yet, but I think it simplifies everything if it's not recoursive


Additional point (2): I forgot that with counters there's no easy way to
avoid recursion, so yeah theoretically it can be recursive. Still,
better avoid doing it intentionally though.

> 
>> Does this being a coroutine_fn mean that we would have to convert QMP
>> command handlers to coroutines so that they can take the rdlock while
>> they don't expect the graph to change? Or should we have a non-coroutine
>> version, too, that works with AIO_WAIT_WHILE()?
> 
> Why convert the QMP command handlers? coroutine_fn was just to signal
> that it can also be called from coroutines, like the ones created by the
> blk_* API.
> A reader does not have to be a coroutine. AIO_WAIT_WHILE is not
> mandatory to allow it to finish, it helps to ensure progress in case
> some reader is waiting for something, but other than that is not
> necessary IMO.
> 
>> Or should this only be taken for very small pieces of code directly
>> accessing the BdrvChild objects, and high-level users like QMP commands
>> shouldn't even consider themselves readers?
>>
> 
> No I think if we focus on small pieces of code we end up having a
> million lock/unlock pairs.
> 
>>> /*
>>>  * bdrv_graph_rdunlock:
>>>  * Read terminated, decrease the count of readers in the current aiocontext.
>>>  * If the writer is waiting for reads to finish (has_writer == 1), signal
>>>  * the writer that we are done via aio_wait_kick() to let it continue.
>>>  */
>>> void coroutine_fn bdrv_graph_co_rdunlock(void);
>>>
>>> #endif /* BLOCK_LOCK_H */
>>
>> I expect that in the final version, we might want to have some sugar
>> like a WITH_BDRV_GRAPH_RDLOCK_GUARD() macro, but obviously that doesn't
>> affect the fundamental design.
> 
> Yeah I will ping you once I get to that point ;)
> 
> Emanuele
>>
>> Kevin
>>
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Stefan Hajnoczi 2 years ago
On Fri, Apr 01, 2022 at 01:01:53PM +0200, Paolo Bonzini wrote:
> On 4/1/22 10:05, Emanuele Giuseppe Esposito wrote:
> > > The list itself would be used internally to implement the write-side
> > > lock and unlock primitives, but it would not be protected by the above
> > > functions.  So there would be a couple additional functions:
> > > 
> > >    bdrv_graph_list_lock <-> cpu_list_lock
> > >    bdrv_graph_list_unlock <-> cpu_list_unlock
> > 
> > The list would be graph_bdrv_states, why do we need to protect it with a
> > lock? Currently it is protected by BQL, and theoretically only
> > bdrv_graph_wrlock iterates on it. And as we defined in the assertion
> > below, wrlock is always in the main loop too.
> 
> You're right, CPU_FOREACH only appears in start_exclusive; so likewise you
> only need to walk the list in bdrv_graph_wrlock, i.e. only under BQL.
> 
> My thought was that, within the implementation, you'll need a mutex to
> protect has_waiter, and protecting the list with the same mutex made sense
> to me.  But indeed it's not necessary.

What is the relationship between this new API and aio_set_fd_handler()'s
is_external?

A few thoughts:

- The new API doesn't stop more I/O requests from being submitted, it
  just blocks the current coroutine so request processing is deferred.

- In other words, is_external is a flow control API whereas the new API
  queues up request coroutines without notifying the caller.

- The new API still needs to be combined with bdrv_drained_begin/end()
  to ensure in-flight requests are done.

- It's not obvious to me whether the new API obsoletes is_external. I
  think it probably doesn't.

Stefan
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Paolo Bonzini 2 years ago
On Mon, Apr 4, 2022 at 11:25 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:

> - The new API doesn't stop more I/O requests from being submitted, it
>   just blocks the current coroutine so request processing is deferred.
>

New I/O requests would not complete until the write-side critical section
ends. However they would still be accepted: from the point of view of the
guest, the "consumed" index of the virtio ring would move forward, unlike
bdrv_drained_begin/end().

- In other words, is_external is a flow control API whereas the new API
>   queues up request coroutines without notifying the caller.
>

Yes, I think this is the same I wrote above.

> - The new API still needs to be combined with bdrv_drained_begin/end()
>   to ensure in-flight requests are done.
>

I don't think so, because in-flight requests would take the lock for
reading. The write side would not start until those in-flight requests
release the lock.

- It's not obvious to me whether the new API obsoletes is_external. I think
> it probably doesn't.
>

I agree that it doesn't. This new lock is only protecting ->parents and
->children. bdrv_drained_begin()/end() remains necessary, for example, when
you need to send a request during the drained section. An example is
block_resize.

In addition, bdrv_drained_begin()/end() ensures that the callback of
blk_aio_*() functions has been invoked (see commit 46aaf2a566,
"block-backend: Decrease in_flight only after callback", 2018-09-25).  This
new lock would not ensure that.

As an aside, instead of is_external, QEMU could remove/add the ioeventfd
handler in the blk->dev_ops->drained_begin and blk->dev_ops->drained_end
callbacks respectively. But that's just a code cleanup.

Paolo
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Emanuele Giuseppe Esposito 2 years ago

Am 04/04/2022 um 11:41 schrieb Paolo Bonzini:
> 
> 
> On Mon, Apr 4, 2022 at 11:25 AM Stefan Hajnoczi <stefanha@redhat.com
> <mailto:stefanha@redhat.com>> wrote:
> 
>     - The new API doesn't stop more I/O requests from being submitted, it
>       just blocks the current coroutine so request processing is deferred.
> 
> 
> New I/O requests would not complete until the write-side critical
> section ends. However they would still be accepted: from the point of
> view of the guest, the "consumed" index of the virtio ring would move
> forward, unlike bdrv_drained_begin/end().
> 
>     - In other words, is_external is a flow control API whereas the new API
>       queues up request coroutines without notifying the caller.
> 
> 
> Yes, I think this is the same I wrote above.
> 
>     - The new API still needs to be combined with bdrv_drained_begin/end()
>       to ensure in-flight requests are done.
> 
> 
> I don't think so, because in-flight requests would take the lock for
> reading. The write side would not start until those in-flight requests
> release the lock.
> 
>     - It's not obvious to me whether the new API obsoletes is_external.
>     I think it probably doesn't.
> 
> 
> I agree that it doesn't. This new lock is only protecting ->parents and
> ->children. 

Side note: it will also be used to protect other fields, like
.aio_context I think. I haven't checked if there is something else we
might want to protect that is currently protected by AioContext lock.

At least, I think we are going to use the same lock, right?

Emanuele

bdrv_drained_begin()/end() remains necessary, for example,
> when you need to send a request during the drained section. An example
> is block_resize.
> 
> In addition, bdrv_drained_begin()/end() ensures that the callback of
> blk_aio_*() functions has been invoked (see commit 46aaf2a566,
> "block-backend: Decrease in_flight only after callback", 2018-09-25). 
> This new lock would not ensure that.
> 
> As an aside, instead of is_external, QEMU could remove/add the ioeventfd
> handler in the blk->dev_ops->drained_begin and blk->dev_ops->drained_end
> callbacks respectively. But that's just a code cleanup.
> 
> Paolo


Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Paolo Bonzini 2 years ago
On 4/4/22 11:51, Emanuele Giuseppe Esposito wrote:
>>
>> I agree that it doesn't. This new lock is only protecting ->parents and
>> ->children. 
> Side note: it will also be used to protect other fields, like
> .aio_context I think. I haven't checked if there is something else we
> might want to protect that is currently protected by AioContext lock.
> 
> At least, I think we are going to use the same lock, right?

I have no idea honestly.  It can make sense for anything that is changed
very rarely and read during requests.

.aio_context has the .detach/.attach callbacks and I wonder if there
should be any reason to access it outside the callbacks.  A lot of uses
of .aio_context (for example for aio_bh_new or
aio_bh_schedule_oneshot/replay_bh_schedule_oneshot_event) can, and
perhaps should, be changed to just qemu_get_current_aio_context().  For
multiqueue we probably want the same BlockDriverState to use the
AioContext corresponding to a virtio queue, rather than always the same one.

Paolo
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Kevin Wolf 2 years ago
Am 04.04.2022 um 11:41 hat Paolo Bonzini geschrieben:
> As an aside, instead of is_external, QEMU could remove/add the ioeventfd
> handler in the blk->dev_ops->drained_begin and blk->dev_ops->drained_end
> callbacks respectively. But that's just a code cleanup.

Yes, this is the proper way to do it that we intended to implement
"sometime later". aio_disable_external() is really an ugly hack.

I assume that with multiqueue, we'll have to rework this anyway because
there won't be a single AioContext to disable any more - and I'm not
sure if a node will even know from which AioContexts requests could
potentially be sent to it. So we may have to go through the proper
callback mechanisms at that point.

Kevin
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Stefan Hajnoczi 2 years ago
On Mon, Apr 04, 2022 at 11:41:04AM +0200, Paolo Bonzini wrote:
> On Mon, Apr 4, 2022 at 11:25 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > - The new API still needs to be combined with bdrv_drained_begin/end()
> >   to ensure in-flight requests are done.
> >
> 
> I don't think so, because in-flight requests would take the lock for
> reading. The write side would not start until those in-flight requests
> release the lock.

Good point!

> - It's not obvious to me whether the new API obsoletes is_external. I think
> > it probably doesn't.
> >
> 
> I agree that it doesn't. This new lock is only protecting ->parents and
> ->children. bdrv_drained_begin()/end() remains necessary, for example, when
> you need to send a request during the drained section. An example is
> block_resize.
> 
> In addition, bdrv_drained_begin()/end() ensures that the callback of
> blk_aio_*() functions has been invoked (see commit 46aaf2a566,
> "block-backend: Decrease in_flight only after callback", 2018-09-25).  This
> new lock would not ensure that.
> 
> As an aside, instead of is_external, QEMU could remove/add the ioeventfd
> handler in the blk->dev_ops->drained_begin and blk->dev_ops->drained_end
> callbacks respectively. But that's just a code cleanup.

Interesting idea. If is_external is a block layer-specific feature that
nothing else in QEMU uses then I like the idea because it's cleaner and
more obvious than is_external.

Stefan
Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Posted by Stefan Hajnoczi 2 years, 1 month ago
On Wed, Mar 09, 2022 at 02:26:28PM +0100, Emanuele Giuseppe Esposito wrote:
> Am 02/03/2022 um 10:47 schrieb Stefan Hajnoczi:
> > On Tue, Mar 01, 2022 at 09:21:08AM -0500, Emanuele Giuseppe Esposito wrote:
> >> Possible scenarios
> >> -------------------
> >> Keeping in mind that we can only have an iothread and the main loop
> >> draining on a certain node, we could have:
> >>
> >> main loop successfully drains and then iothread tries to drain:
> >>   impossible scenario, as iothread is already stopped once main
> >>   successfully drains.
> >>
> >> iothread successfully drains and then main loop drains:
> >>   should not be a problem, as:
> >>   1) the iothread should be already "blocked" by its own drain
> > 
> > Once drained_begin() returns in the IOThread, the IOThread can do
> > anything it wants, including more submitting I/O. I don't consider that
> > "blocked", so I'm not sure what this sentence means?
> > 
> > The way the main loop thread protects itself against the IOThread is via
> > the aio "external" handler concept and block job drain callbacks, which
> > are activated by drained_begin(). They ensure that the IOThread will not
> > perform further processing that submits I/O, but the IOThread code that
> > invoked drained_begin() can still do anything it wants.
> 
> As above I think that regardless on what the iothread is doing, once the
> main loop has finished executing bdrv_drained_begin the iothread should
> not be doing anything related to the nodes that have been drained.

I agree. What I wanted to highlight is that waiting for requests to
complete is not what stops the IOThread, it's the "external" AIO handler
mechanism.

Stefan