[PATCH RFC 0/5] Fix accidental crash in iotest 30

Vladimir Sementsov-Ogievskiy posted 5 patches 3 years, 5 months ago
Test checkpatch failed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20201120161622.1537-1-vsementsov@virtuozzo.com
Maintainers: John Snow <jsnow@redhat.com>, Max Reitz <mreitz@redhat.com>, Kevin Wolf <kwolf@redhat.com>, Cleber Rosa <crosa@redhat.com>, Eduardo Habkost <ehabkost@redhat.com>
block/coroutines.h                 | 11 +++++++
include/block/block.h              |  2 ++
block.c                            | 36 +++++++++++++++------
block/mirror.c                     |  9 ++++--
block/stream.c                     |  9 ++++--
scripts/block-coroutine-wrapper.py | 36 +++++++++++++--------
tests/qemu-iotests/030             | 52 +++++++++++++++---------------
tests/qemu-iotests/030.out         |  4 +--
8 files changed, 105 insertions(+), 54 deletions(-)
[PATCH RFC 0/5] Fix accidental crash in iotest 30
Posted by Vladimir Sementsov-Ogievskiy 3 years, 5 months ago
Hi all!

As Peter recently noted, iotest 30 accidentally fails.

I found that Qemu crashes due to interleaving of graph-update operations of parallel mirror and stream block-jobs.

So, here is a "workaround" to discuss.

It's of course not the full solution, as if we decide to go this way we should protect by the mutex all graph-modifying operations, not only here. And move everything into coroutine..

So, I send this mostly as a starting point for discussion, may be someone imagine better solution.

Main patches are 04-05. 01-02 only simplify debugging and 03 is
preparation for 04.

Original qemu crash looks like this:

#0  0x00007f7029b23e35 in raise () at /lib64/libc.so.6
#1  0x00007f7029b0e895 in abort () at /lib64/libc.so.6
#2  0x00007f7029b0e769 in _nl_load_domain.cold () at /lib64/libc.so.6
#3  0x00007f7029b1c566 in annobin_assert.c_end () at /lib64/libc.so.6
#4  0x0000558f3d92f15a in bdrv_replace_child (child=0x558f3fa7c400, new_bs=0x0) at ../block.c:2648
#5  0x0000558f3d92f6e1 in bdrv_detach_child (child=0x558f3fa7c400) at ../block.c:2777
#6  0x0000558f3d92f723 in bdrv_root_unref_child (child=0x558f3fa7c400) at ../block.c:2789
#7  0x0000558f3d897f4c in block_job_remove_all_bdrv (job=0x558f3f626940) at ../blockjob.c:191
#8  0x0000558f3d897c73 in block_job_free (job=0x558f3f626940) at ../blockjob.c:88
#9  0x0000558f3d891456 in job_unref (job=0x558f3f626940) at ../job.c:380
#10 0x0000558f3d892602 in job_exit (opaque=0x558f3f626940) at ../job.c:894
#11 0x0000558f3d9ce2fb in aio_bh_call (bh=0x558f3f5dc480) at ../util/async.c:136
#12 0x0000558f3d9ce405 in aio_bh_poll (ctx=0x558f3e80c5f0) at ../util/async.c:164
#13 0x0000558f3d9f75ea in aio_dispatch (ctx=0x558f3e80c5f0) at ../util/aio-posix.c:381
#14 0x0000558f3d9ce836 in aio_ctx_dispatch (source=0x558f3e80c5f0, callback=0x0, user_data=0x0)
    at ../util/async.c:306
#15 0x00007f702ae75ecd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#16 0x0000558f3da09e33 in glib_pollfds_poll () at ../util/main-loop.c:221
#17 0x0000558f3da09ead in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
#18 0x0000558f3da09fb5 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
#19 0x0000558f3d7836b7 in qemu_main_loop () at ../softmmu/vl.c:1678
#20 0x0000558f3d317316 in main (argc=20, argv=0x7fffa94d35a8, envp=0x7fffa94d3650)
    at ../softmmu/main.c:50
(gdb) fr 4
#4  0x0000558f3d92f15a in bdrv_replace_child (child=0x558f3fa7c400, new_bs=0x0) at ../block.c:2648
2648            assert(tighten_restrictions == false);
(gdb) list
2643            int ret;
2644
2645            bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm);
2646            ret = bdrv_check_perm(old_bs, NULL, perm, shared_perm, NULL,
2647                                  &tighten_restrictions, NULL);
2648            assert(tighten_restrictions == false);
2649            if (ret < 0) {
2650                /* We only tried to loosen restrictions, so errors are not fatal */
2651                bdrv_abort_perm_update(old_bs);
2652            } else {


And my exploration shows that this due to permission-graph already broken before this permission update. So we tighten restrictions not because removing the child but because we recalculate broken permissions graph and it becomes correct (and more strict unfortunately).


Also, please look through my explorations on this topic in threads:

"iotest 030 still occasionally intermittently failing"
https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg04018.html

"question about bdrv_replace_node"
https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg04478.html

Vladimir Sementsov-Ogievskiy (5):
  abort-on-set-to-true
  iotest-30-shorten: concentrate on failing test case
  scripts/block-coroutine-wrapper.py: allow more function types
  block: move some mirror and stream handlers to coroutine
  block: protect some graph-modifyng things by mutex

 block/coroutines.h                 | 11 +++++++
 include/block/block.h              |  2 ++
 block.c                            | 36 +++++++++++++++------
 block/mirror.c                     |  9 ++++--
 block/stream.c                     |  9 ++++--
 scripts/block-coroutine-wrapper.py | 36 +++++++++++++--------
 tests/qemu-iotests/030             | 52 +++++++++++++++---------------
 tests/qemu-iotests/030.out         |  4 +--
 8 files changed, 105 insertions(+), 54 deletions(-)

-- 
2.21.3


Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30
Posted by no-reply@patchew.org 3 years, 5 months ago
Patchew URL: https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20201120161622.1537-1-vsementsov@virtuozzo.com
Subject: [PATCH RFC 0/5] Fix accidental crash in iotest 30

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]         patchew/20201120161622.1537-1-vsementsov@virtuozzo.com -> patchew/20201120161622.1537-1-vsementsov@virtuozzo.com
Switched to a new branch 'test'
9aed580 block: protect some graph-modifyng things by mutex
74c4557 block: move some mirror and stream handlers to coroutine
fae70bd scripts/block-coroutine-wrapper.py: allow more function types
7bccf2c iotest-30-shorten: concentrate on failing test case
1c79f08 abort-on-set-to-true

=== OUTPUT BEGIN ===
1/5 Checking commit 1c79f086eae8 (abort-on-set-to-true)
ERROR: do not initialise globals to 0 or NULL
#18: FILE: block.c:87:
+bool abort_on_set_to_true = false;

ERROR: line over 90 characters
#27: FILE: block.c:2007:
+        if ((added_perms || removed_shared_perms) && tighten_restrictions == &abort_on_set_to_true) {

WARNING: line over 80 characters
#44: FILE: block.c:2075:
+            ret = bdrv_child_check_perm(c, q, cur_perm, cur_shared, ignore_children,

WARNING: line over 80 characters
#47: FILE: block.c:2078:
+            ret = bdrv_child_check_perm(c, q, cur_perm, cur_shared, ignore_children,

WARNING: line over 80 characters
#48: FILE: block.c:2079:
+                                        tighten_restrictions ? &child_tighten_restr

total: 2 errors, 3 warnings, 74 lines checked

Patch 1/5 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

2/5 Checking commit 7bccf2cf8bff (iotest-30-shorten: concentrate on failing test case)
3/5 Checking commit fae70bd3851a (scripts/block-coroutine-wrapper.py: allow more function types)
WARNING: line over 80 characters
#35: FILE: scripts/block-coroutine-wrapper.py:82:
+func_decl_re = re.compile(r'^(?P<return_type>(void|int))\s*generated_co_wrapper\s*'

total: 0 errors, 1 warnings, 80 lines checked

Patch 3/5 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
4/5 Checking commit 74c4557e6dea (block: move some mirror and stream handlers to coroutine)
5/5 Checking commit 9aed58093018 (block: protect some graph-modifyng things by mutex)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20201120161622.1537-1-vsementsov@virtuozzo.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30
Posted by Vladimir Sementsov-Ogievskiy 3 years, 5 months ago
20.11.2020 19:27, no-reply@patchew.org wrote:
> Patchew URL:https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/
> 
> 
> 
> Hi,
> 
> This series seems to have some coding style problems. See output below for
> more information:

This is an RFC, so I didn't care.

-- 
Best regards,
Vladimir

Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30
Posted by Kevin Wolf 3 years, 5 months ago
Am 20.11.2020 um 17:16 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Hi all!
> 
> As Peter recently noted, iotest 30 accidentally fails.
> 
> I found that Qemu crashes due to interleaving of graph-update
> operations of parallel mirror and stream block-jobs.

I haven't found the time yet to properly look into this or your other
thread where you had a similar question, but there is one thing I'm
wondering: Why can the nested job even make progress and run its
completion handler?

When we modify the graph, we should have drained the subtree in
question, so in theory while one job finishes and modifies the graph,
there should be no way for the other job to make progress and get
interleaved - it shouldn't be able to start I/O requests and much less
to run its completion handler and modify the graph.

Are we missing drained sections somewhere or do they fail to achieve
what I think they should achieve?

Kevin


Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30
Posted by Vladimir Sementsov-Ogievskiy 3 years, 5 months ago
20.11.2020 19:36, Kevin Wolf wrote:
> Am 20.11.2020 um 17:16 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> Hi all!
>>
>> As Peter recently noted, iotest 30 accidentally fails.
>>
>> I found that Qemu crashes due to interleaving of graph-update
>> operations of parallel mirror and stream block-jobs.
> 
> I haven't found the time yet to properly look into this or your other
> thread where you had a similar question, but there is one thing I'm
> wondering: Why can the nested job even make progress and run its
> completion handler?
> 
> When we modify the graph, we should have drained the subtree in
> question, so in theory while one job finishes and modifies the graph,
> there should be no way for the other job to make progress and get
> interleaved - it shouldn't be able to start I/O requests and much less
> to run its completion handler and modify the graph.
> 
> Are we missing drained sections somewhere or do they fail to achieve
> what I think they should achieve?
> 

It all looks like both jobs are reached their finish simultaneously. So, all progress is done in both jobs. And they go concurrently to completion procedures which interleaves. So, there no more io through blk, which is restricted by drained sections.


-- 
Best regards,
Vladimir

Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30
Posted by Kevin Wolf 3 years, 5 months ago
Am 20.11.2020 um 17:43 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 20.11.2020 19:36, Kevin Wolf wrote:
> > Am 20.11.2020 um 17:16 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > Hi all!
> > > 
> > > As Peter recently noted, iotest 30 accidentally fails.
> > > 
> > > I found that Qemu crashes due to interleaving of graph-update
> > > operations of parallel mirror and stream block-jobs.
> > 
> > I haven't found the time yet to properly look into this or your other
> > thread where you had a similar question, but there is one thing I'm
> > wondering: Why can the nested job even make progress and run its
> > completion handler?
> > 
> > When we modify the graph, we should have drained the subtree in
> > question, so in theory while one job finishes and modifies the graph,
> > there should be no way for the other job to make progress and get
> > interleaved - it shouldn't be able to start I/O requests and much less
> > to run its completion handler and modify the graph.
> > 
> > Are we missing drained sections somewhere or do they fail to achieve
> > what I think they should achieve?
> > 
> 
> It all looks like both jobs are reached their finish simultaneously.
> So, all progress is done in both jobs. And they go concurrently to
> completion procedures which interleaves. So, there no more io through
> blk, which is restricted by drained sections.

They can't be truly simultaneous because they run in the same thread.
During job completion, this is the main thread.

However as soon as job_is_completed() returns true, it seems we're not
pausing the job any more when one of its nodes gets drained.

Possibly also relevant: The job->busy = false in job_exit(). The comment
there says it's a lie, but we might deadlock otherwise.

This problem will probably affect other callers, too, which drain a
subtree and then resonably expect that nobody will modify the graph
until they end the drained section. So I think the problem that we need
to address is that jobs run their completion handlers even though they
are supposed to be paused by a drain.

I'm not saying that your graph modification locks are a bad idea, but
they are probably not a complete solution.

Kevin


Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30
Posted by Vladimir Sementsov-Ogievskiy 3 years, 5 months ago
20.11.2020 20:22, Kevin Wolf wrote:
> Am 20.11.2020 um 17:43 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 20.11.2020 19:36, Kevin Wolf wrote:
>>> Am 20.11.2020 um 17:16 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> Hi all!
>>>>
>>>> As Peter recently noted, iotest 30 accidentally fails.
>>>>
>>>> I found that Qemu crashes due to interleaving of graph-update
>>>> operations of parallel mirror and stream block-jobs.
>>>
>>> I haven't found the time yet to properly look into this or your other
>>> thread where you had a similar question, but there is one thing I'm
>>> wondering: Why can the nested job even make progress and run its
>>> completion handler?
>>>
>>> When we modify the graph, we should have drained the subtree in
>>> question, so in theory while one job finishes and modifies the graph,
>>> there should be no way for the other job to make progress and get
>>> interleaved - it shouldn't be able to start I/O requests and much less
>>> to run its completion handler and modify the graph.
>>>
>>> Are we missing drained sections somewhere or do they fail to achieve
>>> what I think they should achieve?
>>>
>>
>> It all looks like both jobs are reached their finish simultaneously.
>> So, all progress is done in both jobs. And they go concurrently to
>> completion procedures which interleaves. So, there no more io through
>> blk, which is restricted by drained sections.
> 
> They can't be truly simultaneous because they run in the same thread.
> During job completion, this is the main thread.

No, they not truly simultaneous, but completions may interleave through nested aio_poll loops.

> 
> However as soon as job_is_completed() returns true, it seems we're not
> pausing the job any more when one of its nodes gets drained.
> 
> Possibly also relevant: The job->busy = false in job_exit(). The comment
> there says it's a lie, but we might deadlock otherwise.
> 
> This problem will probably affect other callers, too, which drain a
> subtree and then resonably expect that nobody will modify the graph
> until they end the drained section. So I think the problem that we need
> to address is that jobs run their completion handlers even though they
> are supposed to be paused by a drain.

Hmm. I always thought about drained section as about thing that stops IO requests, not other operations.. And we do graph modifications in drained section to avoid in-flight IO requests during graph modification.

> 
> I'm not saying that your graph modification locks are a bad idea, but
> they are probably not a complete solution.
> 

Hmm. What do you mean? It's of course not complete, as I didn't protect every graph modification procedure.. But if we do protect all such things and do graph modifications always under this mutex, it should work I think.

-- 
Best regards,
Vladimir

Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30
Posted by Kevin Wolf 3 years, 5 months ago
Am 20.11.2020 um 19:19 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 20.11.2020 20:22, Kevin Wolf wrote:
> > Am 20.11.2020 um 17:43 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > 20.11.2020 19:36, Kevin Wolf wrote:
> > > > Am 20.11.2020 um 17:16 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > > > Hi all!
> > > > > 
> > > > > As Peter recently noted, iotest 30 accidentally fails.
> > > > > 
> > > > > I found that Qemu crashes due to interleaving of graph-update
> > > > > operations of parallel mirror and stream block-jobs.
> > > > 
> > > > I haven't found the time yet to properly look into this or your other
> > > > thread where you had a similar question, but there is one thing I'm
> > > > wondering: Why can the nested job even make progress and run its
> > > > completion handler?
> > > > 
> > > > When we modify the graph, we should have drained the subtree in
> > > > question, so in theory while one job finishes and modifies the graph,
> > > > there should be no way for the other job to make progress and get
> > > > interleaved - it shouldn't be able to start I/O requests and much less
> > > > to run its completion handler and modify the graph.
> > > > 
> > > > Are we missing drained sections somewhere or do they fail to achieve
> > > > what I think they should achieve?
> > > > 
> > > 
> > > It all looks like both jobs are reached their finish simultaneously.
> > > So, all progress is done in both jobs. And they go concurrently to
> > > completion procedures which interleaves. So, there no more io through
> > > blk, which is restricted by drained sections.
> > 
> > They can't be truly simultaneous because they run in the same thread.
> > During job completion, this is the main thread.
> 
> No, they not truly simultaneous, but completions may interleave
> through nested aio_poll loops.
> 
> > 
> > However as soon as job_is_completed() returns true, it seems we're not
> > pausing the job any more when one of its nodes gets drained.
> > 
> > Possibly also relevant: The job->busy = false in job_exit(). The comment
> > there says it's a lie, but we might deadlock otherwise.
> > 
> > This problem will probably affect other callers, too, which drain a
> > subtree and then resonably expect that nobody will modify the graph
> > until they end the drained section. So I think the problem that we need
> > to address is that jobs run their completion handlers even though they
> > are supposed to be paused by a drain.
> 
> Hmm. I always thought about drained section as about thing that stops
> IO requests, not other operations.. And we do graph modifications in
> drained section to avoid in-flight IO requests during graph
> modification.

Is there any use for an operation that only stops I/O, but doesn't
prevent graph changes?

I always understood it as a request to have exclusive access to a
subtree, so that nobody else would touch it.

> > I'm not saying that your graph modification locks are a bad idea, but
> > they are probably not a complete solution.
> > 
> 
> Hmm. What do you mean? It's of course not complete, as I didn't
> protect every graph modification procedure.. But if we do protect all
> such things and do graph modifications always under this mutex, it
> should work I think.

What I mean is that not only graph modifications can conflict with each
other, but most callers of drain_begin/end will probably not be prepared
for the graph changing under their feet, even if they don't actively
change the graph themselves.

Kevin


Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30
Posted by Vladimir Sementsov-Ogievskiy 3 years, 5 months ago
23.11.2020 13:10, Kevin Wolf wrote:
> Am 20.11.2020 um 19:19 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 20.11.2020 20:22, Kevin Wolf wrote:
>>> Am 20.11.2020 um 17:43 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> 20.11.2020 19:36, Kevin Wolf wrote:
>>>>> Am 20.11.2020 um 17:16 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>> Hi all!
>>>>>>
>>>>>> As Peter recently noted, iotest 30 accidentally fails.
>>>>>>
>>>>>> I found that Qemu crashes due to interleaving of graph-update
>>>>>> operations of parallel mirror and stream block-jobs.
>>>>>
>>>>> I haven't found the time yet to properly look into this or your other
>>>>> thread where you had a similar question, but there is one thing I'm
>>>>> wondering: Why can the nested job even make progress and run its
>>>>> completion handler?
>>>>>
>>>>> When we modify the graph, we should have drained the subtree in
>>>>> question, so in theory while one job finishes and modifies the graph,
>>>>> there should be no way for the other job to make progress and get
>>>>> interleaved - it shouldn't be able to start I/O requests and much less
>>>>> to run its completion handler and modify the graph.
>>>>>
>>>>> Are we missing drained sections somewhere or do they fail to achieve
>>>>> what I think they should achieve?
>>>>>
>>>>
>>>> It all looks like both jobs are reached their finish simultaneously.
>>>> So, all progress is done in both jobs. And they go concurrently to
>>>> completion procedures which interleaves. So, there no more io through
>>>> blk, which is restricted by drained sections.
>>>
>>> They can't be truly simultaneous because they run in the same thread.
>>> During job completion, this is the main thread.
>>
>> No, they not truly simultaneous, but completions may interleave
>> through nested aio_poll loops.
>>
>>>
>>> However as soon as job_is_completed() returns true, it seems we're not
>>> pausing the job any more when one of its nodes gets drained.
>>>
>>> Possibly also relevant: The job->busy = false in job_exit(). The comment
>>> there says it's a lie, but we might deadlock otherwise.
>>>
>>> This problem will probably affect other callers, too, which drain a
>>> subtree and then resonably expect that nobody will modify the graph
>>> until they end the drained section. So I think the problem that we need
>>> to address is that jobs run their completion handlers even though they
>>> are supposed to be paused by a drain.
>>
>> Hmm. I always thought about drained section as about thing that stops
>> IO requests, not other operations.. And we do graph modifications in
>> drained section to avoid in-flight IO requests during graph
>> modification.
> 
> Is there any use for an operation that only stops I/O, but doesn't
> prevent graph changes?
> 
> I always understood it as a request to have exclusive access to a
> subtree, so that nobody else would touch it.
> 
>>> I'm not saying that your graph modification locks are a bad idea, but
>>> they are probably not a complete solution.
>>>
>>
>> Hmm. What do you mean? It's of course not complete, as I didn't
>> protect every graph modification procedure.. But if we do protect all
>> such things and do graph modifications always under this mutex, it
>> should work I think.
> 
> What I mean is that not only graph modifications can conflict with each
> other, but most callers of drain_begin/end will probably not be prepared
> for the graph changing under their feet, even if they don't actively
> change the graph themselves.
> 

Understand now.. Right.. Anyway, it looks as we need some kind of mutex. As the user of drained section of course wants to do graph modifications and even IO (for example update backing-link in metadata). The first thing that comes to mind is to protect all outer-most drained sections by global CoMutex and assert in drain_begin/drain_end that the mutex is locked.

Hmm, it also looks like RW-lock, and simple IO is "read" and something under drained section is "write".


-- 
Best regards,
Vladimir

Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30
Posted by Kevin Wolf 3 years, 5 months ago
Am 23.11.2020 um 11:29 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 23.11.2020 13:10, Kevin Wolf wrote:
> > Am 20.11.2020 um 19:19 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > 20.11.2020 20:22, Kevin Wolf wrote:
> > > > Am 20.11.2020 um 17:43 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > > > 20.11.2020 19:36, Kevin Wolf wrote:
> > > > > > Am 20.11.2020 um 17:16 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > > > > > Hi all!
> > > > > > > 
> > > > > > > As Peter recently noted, iotest 30 accidentally fails.
> > > > > > > 
> > > > > > > I found that Qemu crashes due to interleaving of graph-update
> > > > > > > operations of parallel mirror and stream block-jobs.
> > > > > > 
> > > > > > I haven't found the time yet to properly look into this or your other
> > > > > > thread where you had a similar question, but there is one thing I'm
> > > > > > wondering: Why can the nested job even make progress and run its
> > > > > > completion handler?
> > > > > > 
> > > > > > When we modify the graph, we should have drained the subtree in
> > > > > > question, so in theory while one job finishes and modifies the graph,
> > > > > > there should be no way for the other job to make progress and get
> > > > > > interleaved - it shouldn't be able to start I/O requests and much less
> > > > > > to run its completion handler and modify the graph.
> > > > > > 
> > > > > > Are we missing drained sections somewhere or do they fail to achieve
> > > > > > what I think they should achieve?
> > > > > > 
> > > > > 
> > > > > It all looks like both jobs are reached their finish simultaneously.
> > > > > So, all progress is done in both jobs. And they go concurrently to
> > > > > completion procedures which interleaves. So, there no more io through
> > > > > blk, which is restricted by drained sections.
> > > > 
> > > > They can't be truly simultaneous because they run in the same thread.
> > > > During job completion, this is the main thread.
> > > 
> > > No, they not truly simultaneous, but completions may interleave
> > > through nested aio_poll loops.
> > > 
> > > > 
> > > > However as soon as job_is_completed() returns true, it seems we're not
> > > > pausing the job any more when one of its nodes gets drained.
> > > > 
> > > > Possibly also relevant: The job->busy = false in job_exit(). The comment
> > > > there says it's a lie, but we might deadlock otherwise.
> > > > 
> > > > This problem will probably affect other callers, too, which drain a
> > > > subtree and then resonably expect that nobody will modify the graph
> > > > until they end the drained section. So I think the problem that we need
> > > > to address is that jobs run their completion handlers even though they
> > > > are supposed to be paused by a drain.
> > > 
> > > Hmm. I always thought about drained section as about thing that stops
> > > IO requests, not other operations.. And we do graph modifications in
> > > drained section to avoid in-flight IO requests during graph
> > > modification.
> > 
> > Is there any use for an operation that only stops I/O, but doesn't
> > prevent graph changes?
> > 
> > I always understood it as a request to have exclusive access to a
> > subtree, so that nobody else would touch it.
> > 
> > > > I'm not saying that your graph modification locks are a bad idea, but
> > > > they are probably not a complete solution.
> > > > 
> > > 
> > > Hmm. What do you mean? It's of course not complete, as I didn't
> > > protect every graph modification procedure.. But if we do protect all
> > > such things and do graph modifications always under this mutex, it
> > > should work I think.
> > 
> > What I mean is that not only graph modifications can conflict with each
> > other, but most callers of drain_begin/end will probably not be prepared
> > for the graph changing under their feet, even if they don't actively
> > change the graph themselves.
> > 
> 
> Understand now.. Right.. Anyway, it looks as we need some kind of
> mutex. As the user of drained section of course wants to do graph
> modifications and even IO (for example update backing-link in
> metadata). The first thing that comes to mind is to protect all
> outer-most drained sections by global CoMutex and assert in
> drain_begin/drain_end that the mutex is locked.
> 
> Hmm, it also looks like RW-lock, and simple IO is "read" and something
> under drained section is "write".

In a way, drain _is_ the implementation of a lock. But as you've shown,
it's a buggy implementation.

What I was looking at was basically fixing the one instance of a bug
while leaving the design as it is.

My impression is that you're looking at this more radically and want to
rearchitecture the whole drain mechanism so that such bugs would be much
less likely to start with. Maybe this is a good idea, but it's probably
also a lot more effort.

Basically, for making use of more traditional locks, the naive approach
would be changing blk/bdrv_inc/dec_in_flight() so that it takes/releases
an actual coroutine lock. As you suggest, probably a CoRwLock.

I see a few non-trivial questions already for this part:

* What about requests for which bs->in_flight is increased more than
  once? Do we need some sort of recursive lock for them?

* How do you know whether you should take a reader or a writer lock? For
  drains called from coroutine context, maybe you could store the caller
  that "owns" the drain section in the BDS, but what about non-coroutine
  drains?

  What do you do if coroutine A drains and then (directly or indirectly)
  spawns coroutine B to do some work?

* Multiple concurrent requests from the same origin (the drain section
  owner) shouldn't be serialised, so the CoRwLock needs to be taken once
  per origin, not once per request. Again, how do we identify origins
  and where do we store the common lock?

* Is it desirable that requests hang in bdrv_inc_in_flight() waiting for
  the lock to be released? This may be in the middle of another
  operation that needs to complete before drain_begin can return.

  I seem to remember that introducing queued_requests in BlockBackend
  was already non-trivial because of potential deadlocks. We would have
  to prepare for more of the same in BlockDriverState.

  The BlockBackend code involves temporarily dropping the in_flight
  counter change that the request made, but on the BDS level we don't
  even know which counters we increased how often before reaching
  bdrv_inc_in_flight().

Do you have a different approach for placing the locks or do you have
ideas for how we would find good answers for these problems?

Kevin


Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30
Posted by Vladimir Sementsov-Ogievskiy 3 years, 5 months ago
23.11.2020 14:10, Kevin Wolf wrote:
> Am 23.11.2020 um 11:29 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 23.11.2020 13:10, Kevin Wolf wrote:
>>> Am 20.11.2020 um 19:19 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> 20.11.2020 20:22, Kevin Wolf wrote:
>>>>> Am 20.11.2020 um 17:43 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>> 20.11.2020 19:36, Kevin Wolf wrote:
>>>>>>> Am 20.11.2020 um 17:16 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>>>> Hi all!
>>>>>>>>
>>>>>>>> As Peter recently noted, iotest 30 accidentally fails.
>>>>>>>>
>>>>>>>> I found that Qemu crashes due to interleaving of graph-update
>>>>>>>> operations of parallel mirror and stream block-jobs.
>>>>>>>
>>>>>>> I haven't found the time yet to properly look into this or your other
>>>>>>> thread where you had a similar question, but there is one thing I'm
>>>>>>> wondering: Why can the nested job even make progress and run its
>>>>>>> completion handler?
>>>>>>>
>>>>>>> When we modify the graph, we should have drained the subtree in
>>>>>>> question, so in theory while one job finishes and modifies the graph,
>>>>>>> there should be no way for the other job to make progress and get
>>>>>>> interleaved - it shouldn't be able to start I/O requests and much less
>>>>>>> to run its completion handler and modify the graph.
>>>>>>>
>>>>>>> Are we missing drained sections somewhere or do they fail to achieve
>>>>>>> what I think they should achieve?
>>>>>>>
>>>>>>
>>>>>> It all looks like both jobs are reached their finish simultaneously.
>>>>>> So, all progress is done in both jobs. And they go concurrently to
>>>>>> completion procedures which interleaves. So, there no more io through
>>>>>> blk, which is restricted by drained sections.
>>>>>
>>>>> They can't be truly simultaneous because they run in the same thread.
>>>>> During job completion, this is the main thread.
>>>>
>>>> No, they not truly simultaneous, but completions may interleave
>>>> through nested aio_poll loops.
>>>>
>>>>>
>>>>> However as soon as job_is_completed() returns true, it seems we're not
>>>>> pausing the job any more when one of its nodes gets drained.
>>>>>
>>>>> Possibly also relevant: The job->busy = false in job_exit(). The comment
>>>>> there says it's a lie, but we might deadlock otherwise.
>>>>>
>>>>> This problem will probably affect other callers, too, which drain a
>>>>> subtree and then resonably expect that nobody will modify the graph
>>>>> until they end the drained section. So I think the problem that we need
>>>>> to address is that jobs run their completion handlers even though they
>>>>> are supposed to be paused by a drain.
>>>>
>>>> Hmm. I always thought about drained section as about thing that stops
>>>> IO requests, not other operations.. And we do graph modifications in
>>>> drained section to avoid in-flight IO requests during graph
>>>> modification.
>>>
>>> Is there any use for an operation that only stops I/O, but doesn't
>>> prevent graph changes?
>>>
>>> I always understood it as a request to have exclusive access to a
>>> subtree, so that nobody else would touch it.
>>>
>>>>> I'm not saying that your graph modification locks are a bad idea, but
>>>>> they are probably not a complete solution.
>>>>>
>>>>
>>>> Hmm. What do you mean? It's of course not complete, as I didn't
>>>> protect every graph modification procedure.. But if we do protect all
>>>> such things and do graph modifications always under this mutex, it
>>>> should work I think.
>>>
>>> What I mean is that not only graph modifications can conflict with each
>>> other, but most callers of drain_begin/end will probably not be prepared
>>> for the graph changing under their feet, even if they don't actively
>>> change the graph themselves.
>>>
>>
>> Understand now.. Right.. Anyway, it looks as we need some kind of
>> mutex. As the user of drained section of course wants to do graph
>> modifications and even IO (for example update backing-link in
>> metadata). The first thing that comes to mind is to protect all
>> outer-most drained sections by global CoMutex and assert in
>> drain_begin/drain_end that the mutex is locked.
>>
>> Hmm, it also looks like RW-lock, and simple IO is "read" and something
>> under drained section is "write".
> 
> In a way, drain _is_ the implementation of a lock. But as you've shown,
> it's a buggy implementation.
> 
> What I was looking at was basically fixing the one instance of a bug
> while leaving the design as it is.
> 
> My impression is that you're looking at this more radically and want to
> rearchitecture the whole drain mechanism so that such bugs would be much
> less likely to start with. Maybe this is a good idea, but it's probably
> also a lot more effort.
> 
> Basically, for making use of more traditional locks, the naive approach
> would be changing blk/bdrv_inc/dec_in_flight() so that it takes/releases
> an actual coroutine lock. As you suggest, probably a CoRwLock.
> 
> I see a few non-trivial questions already for this part:
> 
> * What about requests for which bs->in_flight is increased more than
>    once? Do we need some sort of recursive lock for them?

Is there reasonable example? I'd avoid recursive locks if possible, it doesn't make things simpler.

> 
> * How do you know whether you should take a reader or a writer lock? For
>    drains called from coroutine context, maybe you could store the caller
>    that "owns" the drain section in the BDS, but what about non-coroutine
>    drains?

Intuitively, readers are all IO requests and writers are drained sections.

Why should we store drained-section owner somewhere?

Ah, we need to do "read" when holding write lock.. So we need a possibility for readers to operate without taking lock, when write lock is taken by current[*] coroutine.

And I don't see the way to make synchronization without moving everything to coroutine.
In non coroutine we'll dead lock on nested aio_poll loop which will do nested drain of another job. In coroutine we can wait on mutex more efficiently.

> 
>    What do you do if coroutine A drains and then (directly or indirectly)
>    spawns coroutine B to do some work?

And that means that [*] is not current coroutine but may be some another coroutine started indirectly by lock owner..

So, just RW lock is not enough.. We need something like RW lock but with a possibility to call read operations while write lock is taken (by owner of write lock)..

> 
> * Multiple concurrent requests from the same origin (the drain section
>    owner) shouldn't be serialised, so the CoRwLock needs to be taken once
>    per origin, not once per request. Again, how do we identify origins
>    and where do we store the common lock?

This makes things complex. Why not to take lock per request? IO requests will take read lock and go in parallel.

> 
> * Is it desirable that requests hang in bdrv_inc_in_flight() waiting for
>    the lock to be released? This may be in the middle of another
>    operation that needs to complete before drain_begin can return.

drain_begin should take write lock. This automatically means than all read locks are released..

> 
>    I seem to remember that introducing queued_requests in BlockBackend
>    was already non-trivial because of potential deadlocks. We would have
>    to prepare for more of the same in BlockDriverState.
> 
>    The BlockBackend code involves temporarily dropping the in_flight
>    counter change that the request made, but on the BDS level we don't
>    even know which counters we increased how often before reaching
>    bdrv_inc_in_flight().
> 
> Do you have a different approach for placing the locks or do you have
> ideas for how we would find good answers for these problems?
> 

No, I don't have good complete architecture in-mind.. I was interested in this because I'm preparing a series to refactor permissions-update, and it's quite close. But still a lot simpler than all these problems. So, I'd start from my already prepared series anyway.

Do you think it worth protecting by global lock some job handlers, like I propose in patch 05? It's of course better to move generic job_*() functions to coroutine, to not make same thing for each job. It will not solve the whole problem but at least fix 030 iotest.. I don't remember real bugs around this and don't think that users really run parallel stream and commit jobs. On the other hand the fix should not hurt.

-- 
Best regards,
Vladimir