From: Greg Kurz <groug@kaod.org>
Removing a drive with drive_del while it is being used to run an I/O
intensive workload can cause QEMU to crash.
An AIO flush can yield at some point:
blk_aio_flush_entry()
blk_co_flush(blk)
bdrv_co_flush(blk->root->bs)
...
qemu_coroutine_yield()
and let the HMP command to run, free blk->root and give control
back to the AIO flush:
hmp_drive_del()
blk_remove_bs()
bdrv_root_unref_child(blk->root)
child_bs = blk->root->bs
bdrv_detach_child(blk->root)
bdrv_replace_child(blk->root, NULL)
blk->root->bs = NULL
g_free(blk->root) <============== blk->root becomes stale
bdrv_unref(child_bs)
bdrv_delete(child_bs)
bdrv_close()
bdrv_drained_begin()
bdrv_do_drained_begin()
bdrv_drain_recurse()
aio_poll()
...
qemu_coroutine_switch()
and the AIO flush completion ends up dereferencing blk->root:
blk_aio_complete()
scsi_aio_complete()
blk_get_aio_context(blk)
bs = blk_bs(blk)
ie, bs = blk->root ? blk->root->bs : NULL
^^^^^
stale
The problem is that we should avoid making block driver graph
changes while we have in-flight requests. Let's drain all I/O
for this BB before calling bdrv_root_unref_child().
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block/block-backend.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/block/block-backend.c b/block/block-backend.c
index 2d1a3463e8..6b75bca317 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -767,6 +767,11 @@ void blk_remove_bs(BlockBackend *blk)
blk_update_root_state(blk);
+ /* bdrv_root_unref_child() will cause blk->root to become stale and may
+ * switch to a completion coroutine later on. Let's drain all I/O here
+ * to avoid that and a potential QEMU crash.
+ */
+ blk_drain(blk);
bdrv_root_unref_child(blk->root);
blk->root = NULL;
}
--
2.13.6
On 06/18/2018 11:44 AM, Kevin Wolf wrote: > From: Greg Kurz <groug@kaod.org> > > Removing a drive with drive_del while it is being used to run an I/O > intensive workload can cause QEMU to crash. > ... > The problem is that we should avoid making block driver graph > changes while we have in-flight requests. Let's drain all I/O > for this BB before calling bdrv_root_unref_child(). > > Signed-off-by: Greg Kurz <groug@kaod.org> > Signed-off-by: Kevin Wolf <kwolf@redhat.com> > --- > block/block-backend.c | 5 +++++ > 1 file changed, 5 insertions(+) Interestingly enough, git bisect is reliably pointing to this commit as the culprit in the changed output of iotests './check -nbd 83': 083 8s ... - output mismatch (see 083.out.bad) --- /home/eblake/qemu/tests/qemu-iotests/083.out 2018-02-26 11:40:31.605792220 -0600 +++ /home/eblake/qemu/tests/qemu-iotests/083.out.bad 2018-08-06 16:57:03.411861660 -0500 @@ -41,7 +41,6 @@ === Check disconnect after neg2 === -Connection closed read failed: Input/output error === Check disconnect 8 neg2 === @@ -54,7 +53,6 @@ === Check disconnect before request === -Connection closed read failed: Input/output error === Check disconnect after request === @@ -116,7 +114,6 @@ === Check disconnect after neg-classic === -Connection closed read failed: Input/output error === Check disconnect before neg1 === Failures: 083 Failed 1 of 1 tests > > diff --git a/block/block-backend.c b/block/block-backend.c > index 2d1a3463e8..6b75bca317 100644 > --- a/block/block-backend.c > +++ b/block/block-backend.c > @@ -767,6 +767,11 @@ void blk_remove_bs(BlockBackend *blk) > > blk_update_root_state(blk); > > + /* bdrv_root_unref_child() will cause blk->root to become stale and may > + * switch to a completion coroutine later on. Let's drain all I/O here > + * to avoid that and a potential QEMU crash. > + */ > + blk_drain(blk); > bdrv_root_unref_child(blk->root); > blk->root = NULL; > } > Test 83 sets up a client that intentionally disconnects at critical points in the NBD protocol exchange, to ensure that the server reacts sanely. I suspect that somewhere in the NBD code, the server detects the disconnect and was somehow calling into blk_remove_bs() (although I could not quickly find the backtrace); and that prior to this patch, the 'Connection closed' message resulted from other NBD coroutines getting a shot at the (now-closed) connection, while after this patch, the additional blk_drain() somehow tweaks things in a way that prevents the other NBD coroutines from printing a message. If so, then the change in 83 reference output is probably intentional, and we should update it. But I'm having a hard time convincing myself that this is the case, particularly since I'm not even sure how to easily debug the assumptions I made above. Since I'm very weak on the whole notion of what blk_drain() vs. blk_remove_bs() is really supposed to be doing, and could easily be persuaded that the change in output is a regression instead of a fix. Thoughts? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
On 08/06/2018 05:04 PM, Eric Blake wrote: > On 06/18/2018 11:44 AM, Kevin Wolf wrote: >> From: Greg Kurz <groug@kaod.org> >> >> Removing a drive with drive_del while it is being used to run an I/O >> intensive workload can cause QEMU to crash. >> > ... > > Test 83 sets up a client that intentionally disconnects at critical > points in the NBD protocol exchange, to ensure that the server reacts > sanely. Rather, nbd-fault-injector.py is a server that disconnects at critical points, and the test is of client reaction. > I suspect that somewhere in the NBD code, the server detects > the disconnect and was somehow calling into blk_remove_bs() (although I > could not quickly find the backtrace); and that prior to this patch, the > 'Connection closed' message resulted from other NBD coroutines getting a > shot at the (now-closed) connection, while after this patch, the > additional blk_drain() somehow tweaks things in a way that prevents the > other NBD coroutines from printing a message. If so, then the change in > 83 reference output is probably intentional, and we should update it. It seems like this condition is racy, and that the race is more likely to be lost prior to this patch than after. It's a question of whether the client has time to start a request to the server prior to the server hanging up, as the message is generated during nbd_co_do_receive_one_chunk. Here's a demonstration of the fact that things are racy: $ git revert f45280cbf $ make $ cd tests/qemu-iotests $ cat fault.txt [inject-error "a"] event=neg2 when=after $ python nbd-fault-injector.py localhost:10809 ./fault.txt & Listening on 127.0.0.1:10809 $ ../../qemu-io -f raw nbd://localhost:10809 -c 'r 0 512' Closing connection on rule match inject-error "a" Connection closed read failed: Input/output error $ python nbd-fault-injector.py localhost:10809 ./fault.txt & Listening on 127.0.0.1:10809 $ ../../qemu-io -f raw nbd://localhost:10809 Closing connection on rule match inject-error "a" qemu-io> r 0 512 read failed: Input/output error qemu-io> q So, depending on whether the read command is kicked off quickly (via -c) or slowly (via typing into qemu-io) determines whether the message appears. What's more, in commit f140e300, we specifically called out in the commit message that maybe it was better to trace when we detect connection closed rather than log it to stdout, and in all cases in that commit, the additional 'Connection closed' messages do not add any information to the error message already displayed by the rest of the code. I don't know how much the proposed NBD reconnect code will change things in 3.1. Meanwhile, we've missed any chance for 3.0 to fix test 83. > > But I'm having a hard time convincing myself that this is the case, > particularly since I'm not even sure how to easily debug the assumptions > I made above. > > Since I'm very weak on the whole notion of what blk_drain() vs. > blk_remove_bs() is really supposed to be doing, and could easily be > persuaded that the change in output is a regression instead of a fix. At this point, I don't think we have a regression, just merely a bad iotests reference output. The extra blk_drain() merely adds more time before the NBD code can send out its first request, and thus makes it more likely that the fault injector has closed the connection before the read request is issued rather than after (the message only appears when read beats the race), but the NBD code shouldn't be printing the error message in the first place, and 083 needs to be tweaked to remove the noisy lines added in f140e300 (not just the three lines that are reliably different due to this patch, but all other such lines due to strategic server drops at other points in the NBD protocol). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
07.08.2018 22:57, Eric Blake wrote: > On 08/06/2018 05:04 PM, Eric Blake wrote: >> On 06/18/2018 11:44 AM, Kevin Wolf wrote: >>> From: Greg Kurz <groug@kaod.org> >>> >>> Removing a drive with drive_del while it is being used to run an I/O >>> intensive workload can cause QEMU to crash. >>> >> ... > >> >> Test 83 sets up a client that intentionally disconnects at critical >> points in the NBD protocol exchange, to ensure that the server reacts >> sanely. > > Rather, nbd-fault-injector.py is a server that disconnects at critical > points, and the test is of client reaction. > >> I suspect that somewhere in the NBD code, the server detects the >> disconnect and was somehow calling into blk_remove_bs() (although I >> could not quickly find the backtrace); and that prior to this patch, >> the 'Connection closed' message resulted from other NBD coroutines >> getting a shot at the (now-closed) connection, while after this >> patch, the additional blk_drain() somehow tweaks things in a way that >> prevents the other NBD coroutines from printing a message. If so, >> then the change in 83 reference output is probably intentional, and >> we should update it. > > It seems like this condition is racy, and that the race is more likely > to be lost prior to this patch than after. It's a question of whether > the client has time to start a request to the server prior to the > server hanging up, as the message is generated during > nbd_co_do_receive_one_chunk. Here's a demonstration of the fact that > things are racy: > > $ git revert f45280cbf > $ make > $ cd tests/qemu-iotests > $ cat fault.txt > [inject-error "a"] > event=neg2 > when=after > $ python nbd-fault-injector.py localhost:10809 ./fault.txt & > Listening on 127.0.0.1:10809 > $ ../../qemu-io -f raw nbd://localhost:10809 -c 'r 0 512' > Closing connection on rule match inject-error "a" > Connection closed > read failed: Input/output error > $ python nbd-fault-injector.py localhost:10809 ./fault.txt & > Listening on 127.0.0.1:10809 > $ ../../qemu-io -f raw nbd://localhost:10809 > Closing connection on rule match inject-error "a" > qemu-io> r 0 512 > read failed: Input/output error > qemu-io> q > > So, depending on whether the read command is kicked off quickly (via > -c) or slowly (via typing into qemu-io) determines whether the message > appears. > > What's more, in commit f140e300, we specifically called out in the > commit message that maybe it was better to trace when we detect > connection closed rather than log it to stdout, and in all cases in > that commit, the additional 'Connection closed' messages do not add > any information to the error message already displayed by the rest of > the code. > > I don't know how much the proposed NBD reconnect code will change > things in 3.1. Meanwhile, we've missed any chance for 3.0 to fix test > 83. hmm, isn't 3.0 release planned on August 14th? > >> >> But I'm having a hard time convincing myself that this is the case, >> particularly since I'm not even sure how to easily debug the >> assumptions I made above. >> >> Since I'm very weak on the whole notion of what blk_drain() vs. >> blk_remove_bs() is really supposed to be doing, and could easily be >> persuaded that the change in output is a regression instead of a fix. > > At this point, I don't think we have a regression, just merely a bad > iotests reference output. The extra blk_drain() merely adds more time > before the NBD code can send out its first request, and thus makes it > more likely that the fault injector has closed the connection before > the read request is issued rather than after (the message only appears > when read beats the race), but the NBD code shouldn't be printing the > error message in the first place, and 083 needs to be tweaked to > remove the noisy lines added in f140e300 (not just the three lines > that are reliably different due to this patch, but all other such > lines due to strategic server drops at other points in the NBD protocol). > -- Best regards, Vladimir
On 08/08/2018 06:40 AM, Vladimir Sementsov-Ogievskiy wrote: >> I don't know how much the proposed NBD reconnect code will change >> things in 3.1. Meanwhile, we've missed any chance for 3.0 to fix test >> 83. > > hmm, isn't 3.0 release planned on August 14th? Yes, but -rc4 has already been cut, and there will not be an -rc5 unless there is a MAJOR regression identified in qemu proper (a failing iotest is not major). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
07.08.2018 22:57, Eric Blake wrote: > On 08/06/2018 05:04 PM, Eric Blake wrote: >> On 06/18/2018 11:44 AM, Kevin Wolf wrote: >>> From: Greg Kurz <groug@kaod.org> >>> >>> Removing a drive with drive_del while it is being used to run an I/O >>> intensive workload can cause QEMU to crash. >>> >> ... > >> >> Test 83 sets up a client that intentionally disconnects at critical >> points in the NBD protocol exchange, to ensure that the server reacts >> sanely. > > Rather, nbd-fault-injector.py is a server that disconnects at critical > points, and the test is of client reaction. > >> I suspect that somewhere in the NBD code, the server detects the >> disconnect and was somehow calling into blk_remove_bs() (although I >> could not quickly find the backtrace); and that prior to this patch, >> the 'Connection closed' message resulted from other NBD coroutines >> getting a shot at the (now-closed) connection, while after this >> patch, the additional blk_drain() somehow tweaks things in a way that >> prevents the other NBD coroutines from printing a message. If so, >> then the change in 83 reference output is probably intentional, and >> we should update it. > > It seems like this condition is racy, and that the race is more likely > to be lost prior to this patch than after. It's a question of whether > the client has time to start a request to the server prior to the > server hanging up, as the message is generated during > nbd_co_do_receive_one_chunk. Here's a demonstration of the fact that > things are racy: > > $ git revert f45280cbf > $ make > $ cd tests/qemu-iotests > $ cat fault.txt > [inject-error "a"] > event=neg2 > when=after > $ python nbd-fault-injector.py localhost:10809 ./fault.txt & > Listening on 127.0.0.1:10809 > $ ../../qemu-io -f raw nbd://localhost:10809 -c 'r 0 512' > Closing connection on rule match inject-error "a" > Connection closed > read failed: Input/output error > $ python nbd-fault-injector.py localhost:10809 ./fault.txt & > Listening on 127.0.0.1:10809 > $ ../../qemu-io -f raw nbd://localhost:10809 > Closing connection on rule match inject-error "a" > qemu-io> r 0 512 > read failed: Input/output error > qemu-io> q > > So, depending on whether the read command is kicked off quickly (via > -c) or slowly (via typing into qemu-io) determines whether the message > appears. > > What's more, in commit f140e300, we specifically called out in the > commit message that maybe it was better to trace when we detect > connection closed rather than log it to stdout, and in all cases in > that commit, the additional 'Connection closed' messages do not add > any information to the error message already displayed by the rest of > the code. > > I don't know how much the proposed NBD reconnect code will change > things in 3.1. Meanwhile, we've missed any chance for 3.0 to fix test > 83. > >> >> But I'm having a hard time convincing myself that this is the case, >> particularly since I'm not even sure how to easily debug the >> assumptions I made above. >> >> Since I'm very weak on the whole notion of what blk_drain() vs. >> blk_remove_bs() is really supposed to be doing, and could easily be >> persuaded that the change in output is a regression instead of a fix. > > At this point, I don't think we have a regression, just merely a bad > iotests reference output. The extra blk_drain() merely adds more time > before the NBD code can send out its first request, and thus makes it > more likely that the fault injector has closed the connection before > the read request is issued rather than after (the message only appears > when read beats the race), but the NBD code shouldn't be printing the > error message in the first place, and 083 needs to be tweaked to > remove the noisy lines added in f140e300 (not just the three lines > that are reliably different due to this patch, but all other such > lines due to strategic server drops at other points in the NBD protocol). > Ok, agree, I'll do it in reconnect series. -- Best regards, Vladimir
08.08.2018 12:33, Vladimir Sementsov-Ogievskiy wrote: > 07.08.2018 22:57, Eric Blake wrote: >> On 08/06/2018 05:04 PM, Eric Blake wrote: >>> On 06/18/2018 11:44 AM, Kevin Wolf wrote: >>>> From: Greg Kurz <groug@kaod.org> >>>> >>>> Removing a drive with drive_del while it is being used to run an I/O >>>> intensive workload can cause QEMU to crash. >>>> >>> ... >> >>> >>> Test 83 sets up a client that intentionally disconnects at critical >>> points in the NBD protocol exchange, to ensure that the server >>> reacts sanely. >> >> Rather, nbd-fault-injector.py is a server that disconnects at >> critical points, and the test is of client reaction. >> >>> I suspect that somewhere in the NBD code, the server detects the >>> disconnect and was somehow calling into blk_remove_bs() (although I >>> could not quickly find the backtrace); and that prior to this patch, >>> the 'Connection closed' message resulted from other NBD coroutines >>> getting a shot at the (now-closed) connection, while after this >>> patch, the additional blk_drain() somehow tweaks things in a way >>> that prevents the other NBD coroutines from printing a message. If >>> so, then the change in 83 reference output is probably intentional, >>> and we should update it. >> >> It seems like this condition is racy, and that the race is more >> likely to be lost prior to this patch than after. It's a question of >> whether the client has time to start a request to the server prior to >> the server hanging up, as the message is generated during >> nbd_co_do_receive_one_chunk. Here's a demonstration of the fact that >> things are racy: >> >> $ git revert f45280cbf >> $ make >> $ cd tests/qemu-iotests >> $ cat fault.txt >> [inject-error "a"] >> event=neg2 >> when=after >> $ python nbd-fault-injector.py localhost:10809 ./fault.txt & >> Listening on 127.0.0.1:10809 >> $ ../../qemu-io -f raw nbd://localhost:10809 -c 'r 0 512' >> Closing connection on rule match inject-error "a" >> Connection closed >> read failed: Input/output error >> $ python nbd-fault-injector.py localhost:10809 ./fault.txt & >> Listening on 127.0.0.1:10809 >> $ ../../qemu-io -f raw nbd://localhost:10809 >> Closing connection on rule match inject-error "a" >> qemu-io> r 0 512 >> read failed: Input/output error >> qemu-io> q >> >> So, depending on whether the read command is kicked off quickly (via >> -c) or slowly (via typing into qemu-io) determines whether the >> message appears. >> >> What's more, in commit f140e300, we specifically called out in the >> commit message that maybe it was better to trace when we detect >> connection closed rather than log it to stdout, and in all cases in >> that commit, the additional 'Connection closed' messages do not add >> any information to the error message already displayed by the rest of >> the code. >> >> I don't know how much the proposed NBD reconnect code will change >> things in 3.1. Meanwhile, we've missed any chance for 3.0 to fix >> test 83. >> >>> >>> But I'm having a hard time convincing myself that this is the case, >>> particularly since I'm not even sure how to easily debug the >>> assumptions I made above. >>> >>> Since I'm very weak on the whole notion of what blk_drain() vs. >>> blk_remove_bs() is really supposed to be doing, and could easily be >>> persuaded that the change in output is a regression instead of a fix. >> >> At this point, I don't think we have a regression, just merely a bad >> iotests reference output. The extra blk_drain() merely adds more time >> before the NBD code can send out its first request, and thus makes it >> more likely that the fault injector has closed the connection before >> the read request is issued rather than after (the message only >> appears when read beats the race), but the NBD code shouldn't be >> printing the error message in the first place, and 083 needs to be >> tweaked to remove the noisy lines added in f140e300 (not just the >> three lines that are reliably different due to this patch, but all >> other such lines due to strategic server drops at other points in the >> NBD protocol). >> > > Ok, agree, I'll do it in reconnect series. > hmm, do what? I was going to change these error messages to be traces, but now I'm not sure that it's a good idea. We have generic errp returned from the function, and why to drop it from logs? Fixing iotest is not a good reason, better is to adjust iotest itself a bit (just commit changed output) and forget about it. Is iotest racy itself, did you see different output running 83 iotest, not testing by hand? -- Best regards, Vladimir
On 08/08/2018 09:32 AM, Vladimir Sementsov-Ogievskiy wrote: >>> What's more, in commit f140e300, we specifically called out in the >>> commit message that maybe it was better to trace when we detect >>> connection closed rather than log it to stdout, and in all cases in >>> that commit, the additional 'Connection closed' messages do not add >>> any information to the error message already displayed by the rest of >>> the code. >>> >> Ok, agree, I'll do it in reconnect series. >> > > > hmm, do what? > > I was going to change these error messages to be traces, but now I'm not > sure that it's a good idea. Traces are fine. They won't show up in iotests, but will show up when debugging a failed connection. > We have generic errp returned from the > function, and why to drop it from logs? Because it is redundant with the very next line already in the log. Any error encountered when trying to write to a disconnected server is redundant with an already-reported error due to detecting EOF on reading from the server. > Fixing iotest is not a good > reason, better is to adjust iotest itself a bit (just commit changed > output) and forget about it. Is iotest racy itself, did you see > different output running 83 iotest, not testing by hand? The condition for the output of the 'Connection closed' message is racy - it depends entirely on the timing of whether the client was able to send() a read request to the server prior to the server disconnecting immediately after negotiation ended. If the client loses the race and detects the server hangup prior to writing anything, you get one path; if the client wins the race and successfully writes the request and only later learns that the server has disconnected when trying to read the response to that request, you get the other path. The window for the race changed (and the iotests did not seem to ever expose it short of this particular change to the block layer to do an extra drain), but I could still imagine scenarios where iotests will trigger the opposite path of the race from what is expected, depending on load, since I don't see any synchronization points between the two processes where the server is hanging up after negotiation without reading the client's request, but where the client may or may not have had time to get its request sent to the server's queue. So, just because I have not seen the iotest fail directly because of a race, I think that this commit causing failures in the iotest is evidence that the test is not robust with those extra 'Connection closed' messages being output. Switching the output to be a trace instead should be just fine; overall, the client's attempt to read when the server hangs up will be an EIO failure whether or not the client was able to send() its request and merely fails to get a reply (server disconnect was slow), or whether the client was not even able to send() its request (server disconnect was fast). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
© 2016 - 2025 Red Hat, Inc.