hmp-commands.hx | 12 +++++++ hmp.c | 13 +++++++ hmp.h | 1 + migration/migration.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++-- migration/migration.h | 4 +++ qapi/migration.json | 30 ++++++++++++++-- 6 files changed, 152 insertions(+), 4 deletions(-)
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Hi, This set attempts to make a race condition between migration and drive-mirror (and other block users) soluble by allowing the migration to be paused after the source qemu releases the block devices but before the serialisation of the device state. The symptom of this failure, as reported by Wangjie, is a: _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed and the source qemu dieing; so the problem is pretty nasty. This has only been seen on 2.9 onwards, but the theory is that prior to 2.9 it might have been happening anyway and we were perhaps getting unreported corruptions (lost writes); so this really needs fixing. This flow came from discussions between Kevin and me, and we can't see a way of fixing it without exposing a new state to the management layer. The flow is now: (qemu) migrate_set_capability pause-before-device on (qemu) migrate -d ... (qemu) info migrate ... Migration status: pause-before-device ... << issue commands to clean up any block jobs>> (qemu) migrate_continue pause-before-device (qemu) info migrate ... Migration status: completed This set has been _very_ lightly tested just at the normal migration code, without the addition of the drive mirror; so this is a first cut. I'd appreciate some feedback from libvirt whether the inteface is OK and ideally a hack to test it in a full libvirt setup to see if we hit any other issues. The precopy flow is: active->pause-before-device->completed The postcopy flow is: active->pause-before-device->postcopy-active->completed Although the behaviour with postcopy only gets interesting when we add something like Max's active-sync. Please argue about the command and state naming. Dave Dr. David Alan Gilbert (7): migration: Add 'pause-before-device' capability migration: Add 'pause-before-device' and 'device' statuses migration: Wait for semaphore before completing migration migration: migrate-continue migrate: HMP migate_continue migration: allow cancel to unpause migration: pause-before-device for postcopy hmp-commands.hx | 12 +++++++ hmp.c | 13 +++++++ hmp.h | 1 + migration/migration.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++-- migration/migration.h | 4 +++ qapi/migration.json | 30 ++++++++++++++-- 6 files changed, 152 insertions(+), 4 deletions(-) -- 2.13.6
Hi, This series seems to have some coding style problems. See output below for more information: Type: series Message-id: 20171011191317.24157-1-dgilbert@redhat.com Subject: [Qemu-devel] [PATCH 0/7] migration: pause-before-device === TEST SCRIPT BEGIN === #!/bin/bash BASE=base n=1 total=$(git log --oneline $BASE.. | wc -l) failed=0 git config --local diff.renamelimit 0 git config --local diff.renames True commits="$(git log --format=%H --reverse $BASE..)" for c in $commits; do echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..." if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then failed=1 echo fi n=$((n+1)) done exit $failed === TEST SCRIPT END === Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 From https://github.com/patchew-project/qemu t [tag update] patchew/1507742676-9908-1-git-send-email-peter.maydell@linaro.org -> patchew/1507742676-9908-1-git-send-email-peter.maydell@linaro.org * [new tag] patchew/1507751249-3035-1-git-send-email-stefanb@linux.vnet.ibm.com -> patchew/1507751249-3035-1-git-send-email-stefanb@linux.vnet.ibm.com Switched to a new branch 'test' 88ddcde82f migration: pause-before-device for postcopy 341433d19b migration: allow cancel to unpause 4a76fbc2b2 migrate: HMP migate_continue 24305477e0 migration: migrate-continue fc48af642c migration: Wait for semaphore before completing migration 263e6dc5be migration: Add 'pause-before-device' and 'device' statuses 7827c25132 migration: Add 'pause-before-device' capability === OUTPUT BEGIN === Checking PATCH 1/7: migration: Add 'pause-before-device' capability... Checking PATCH 2/7: migration: Add 'pause-before-device' and 'device' statuses... Checking PATCH 3/7: migration: Wait for semaphore before completing migration... ERROR: trailing statements should be on next line #45: FILE: migration/migration.c:1995: + while (qemu_sem_timedwait(&s->pause_sem, 1) == 0); ERROR: braces {} are necessary even for single statement blocks #45: FILE: migration/migration.c:1995: + while (qemu_sem_timedwait(&s->pause_sem, 1) == 0); total: 2 errors, 0 warnings, 80 lines checked Your patch has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. Checking PATCH 4/7: migration: migrate-continue... Checking PATCH 5/7: migrate: HMP migate_continue... Checking PATCH 6/7: migration: allow cancel to unpause... Checking PATCH 7/7: migration: pause-before-device for postcopy... === OUTPUT END === Test command exited with code: 1 --- Email generated automatically by Patchew [http://patchew.org/]. Please send your feedback to patchew-devel@freelists.org
On Wed, Oct 11, 2017 at 08:13:10PM +0100, Dr. David Alan Gilbert (git) wrote: > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > > Hi, > This set attempts to make a race condition between migration and > drive-mirror (and other block users) soluble by allowing the migration > to be paused after the source qemu releases the block devices but > before the serialisation of the device state. > > The symptom of this failure, as reported by Wangjie, is a: > _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed > > and the source qemu dieing; so the problem is pretty nasty. > This has only been seen on 2.9 onwards, but the theory is that > prior to 2.9 it might have been happening anyway and we were > perhaps getting unreported corruptions (lost writes); so this > really needs fixing. > > This flow came from discussions between Kevin and me, and we can't > see a way of fixing it without exposing a new state to the management > layer. > > The flow is now: > > (qemu) migrate_set_capability pause-before-device on How about 'switchover-cleanup' > (qemu) migrate -d ... > (qemu) info migrate > ... > Migration status: pause-before-device and 'switchover' > ... > << issue commands to clean up any block jobs>> > > (qemu) migrate_continue pause-before-device > (qemu) info migrate > ... > Migration status: completed > > This set has been _very_ lightly tested just at the normal migration > code, without the addition of the drive mirror; so this is a first > cut. I'd appreciate some feedback from libvirt whether the inteface > is OK and ideally a hack to test it in a full libvirt setup to see > if we hit any other issues. > > The precopy flow is: > active->pause-before-device->completed > > The postcopy flow is: > active->pause-before-device->postcopy-active->completed > > Although the behaviour with postcopy only gets interesting when > we add something like Max's active-sync. > > Please argue about the command and state naming. Argued above :-) Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Wed, Oct 11, 2017 at 08:13:10PM +0100, Dr. David Alan Gilbert (git) wrote: > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > > Hi, > This set attempts to make a race condition between migration and > drive-mirror (and other block users) soluble by allowing the migration > to be paused after the source qemu releases the block devices but > before the serialisation of the device state. > > The symptom of this failure, as reported by Wangjie, is a: > _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed > > and the source qemu dieing; so the problem is pretty nasty. > This has only been seen on 2.9 onwards, but the theory is that > prior to 2.9 it might have been happening anyway and we were > perhaps getting unreported corruptions (lost writes); so this > really needs fixing. > > This flow came from discussions between Kevin and me, and we can't > see a way of fixing it without exposing a new state to the management > layer. > > The flow is now: > > (qemu) migrate_set_capability pause-before-device on > (qemu) migrate -d ... > (qemu) info migrate > ... > Migration status: pause-before-device > ... > << issue commands to clean up any block jobs>> > > (qemu) migrate_continue pause-before-device > (qemu) info migrate > ... > Migration status: completed I'm curious why QEMU doesn't have enough info to clean up the block jobs automatically ? What is the key thing that libvirt knows about the block jobs, that QEMU is lacking ? If QEMU had the right info it could do it automatically & avoid this extra lock-step synchronization with libvirt. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Am 12.10.2017 um 10:21 hat Daniel P. Berrange geschrieben: > On Wed, Oct 11, 2017 at 08:13:10PM +0100, Dr. David Alan Gilbert (git) wrote: > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > > > > Hi, > > This set attempts to make a race condition between migration and > > drive-mirror (and other block users) soluble by allowing the migration > > to be paused after the source qemu releases the block devices but > > before the serialisation of the device state. > > > > The symptom of this failure, as reported by Wangjie, is a: > > _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed > > > > and the source qemu dieing; so the problem is pretty nasty. > > This has only been seen on 2.9 onwards, but the theory is that > > prior to 2.9 it might have been happening anyway and we were > > perhaps getting unreported corruptions (lost writes); so this > > really needs fixing. > > > > This flow came from discussions between Kevin and me, and we can't > > see a way of fixing it without exposing a new state to the management > > layer. > > > > The flow is now: > > > > (qemu) migrate_set_capability pause-before-device on > > (qemu) migrate -d ... > > (qemu) info migrate > > ... > > Migration status: pause-before-device > > ... > > << issue commands to clean up any block jobs>> > > > > (qemu) migrate_continue pause-before-device > > (qemu) info migrate > > ... > > Migration status: completed > > I'm curious why QEMU doesn't have enough info to clean up the block > jobs automatically ? What is the key thing that libvirt knows about > the block jobs, that QEMU is lacking ? If QEMU had the right info it > could do it automatically & avoid this extra lock-step synchronization > with libvirt. The key point is that the block job needs to be completed while the source VM is stopped, but the source qemu is still in control of the image files (e.g. still holds the file locks), so that it can do the remaining writes. Without the additional migration phase, the only state where both sides are stopped is when the destination is in control of the image files (migration has completed, but -S prevents it from automatically resuming), so the source can't write to the image any more. Kevin
On Thu, Oct 12, 2017 at 11:18:31AM +0200, Kevin Wolf wrote: > Am 12.10.2017 um 10:21 hat Daniel P. Berrange geschrieben: > > On Wed, Oct 11, 2017 at 08:13:10PM +0100, Dr. David Alan Gilbert (git) wrote: > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > > > > > > Hi, > > > This set attempts to make a race condition between migration and > > > drive-mirror (and other block users) soluble by allowing the migration > > > to be paused after the source qemu releases the block devices but > > > before the serialisation of the device state. > > > > > > The symptom of this failure, as reported by Wangjie, is a: > > > _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed > > > > > > and the source qemu dieing; so the problem is pretty nasty. > > > This has only been seen on 2.9 onwards, but the theory is that > > > prior to 2.9 it might have been happening anyway and we were > > > perhaps getting unreported corruptions (lost writes); so this > > > really needs fixing. > > > > > > This flow came from discussions between Kevin and me, and we can't > > > see a way of fixing it without exposing a new state to the management > > > layer. > > > > > > The flow is now: > > > > > > (qemu) migrate_set_capability pause-before-device on > > > (qemu) migrate -d ... > > > (qemu) info migrate > > > ... > > > Migration status: pause-before-device > > > ... > > > << issue commands to clean up any block jobs>> > > > > > > (qemu) migrate_continue pause-before-device > > > (qemu) info migrate > > > ... > > > Migration status: completed > > > > I'm curious why QEMU doesn't have enough info to clean up the block > > jobs automatically ? What is the key thing that libvirt knows about > > the block jobs, that QEMU is lacking ? If QEMU had the right info it > > could do it automatically & avoid this extra lock-step synchronization > > with libvirt. > > The key point is that the block job needs to be completed while the > source VM is stopped, but the source qemu is still in control of the > image files (e.g. still holds the file locks), so that it can do the > remaining writes. > > Without the additional migration phase, the only state where both sides > are stopped is when the destination is in control of the image files > (migration has completed, but -S prevents it from automatically > resuming), so the source can't write to the image any more. Hmm, I always thought that the target QEMU did not start using the image files until you ran 'cont' on the target. eg once source QEMU has migrate=completed, both QEMUs are in paused state and source QEMU still owns the images, until we run 'cont'. What you're saying seems to imply this is not the case, but if so what is triggering the target QEMU to acquire the locks on images ? Is it done implicitly when it finishes reading device state off the wire ? If so, could we instead add a migrate feature flag to tell the target QEMU not to automatically acquire image locks, until it receives an explicit 'cont'. That would then not require this extra lock-step migration state. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Am 12.10.2017 um 11:27 hat Daniel P. Berrange geschrieben: > On Thu, Oct 12, 2017 at 11:18:31AM +0200, Kevin Wolf wrote: > > Am 12.10.2017 um 10:21 hat Daniel P. Berrange geschrieben: > > > On Wed, Oct 11, 2017 at 08:13:10PM +0100, Dr. David Alan Gilbert (git) wrote: > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > > > > > > > > Hi, > > > > This set attempts to make a race condition between migration and > > > > drive-mirror (and other block users) soluble by allowing the migration > > > > to be paused after the source qemu releases the block devices but > > > > before the serialisation of the device state. > > > > > > > > The symptom of this failure, as reported by Wangjie, is a: > > > > _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed > > > > > > > > and the source qemu dieing; so the problem is pretty nasty. > > > > This has only been seen on 2.9 onwards, but the theory is that > > > > prior to 2.9 it might have been happening anyway and we were > > > > perhaps getting unreported corruptions (lost writes); so this > > > > really needs fixing. > > > > > > > > This flow came from discussions between Kevin and me, and we can't > > > > see a way of fixing it without exposing a new state to the management > > > > layer. > > > > > > > > The flow is now: > > > > > > > > (qemu) migrate_set_capability pause-before-device on > > > > (qemu) migrate -d ... > > > > (qemu) info migrate > > > > ... > > > > Migration status: pause-before-device > > > > ... > > > > << issue commands to clean up any block jobs>> > > > > > > > > (qemu) migrate_continue pause-before-device > > > > (qemu) info migrate > > > > ... > > > > Migration status: completed > > > > > > I'm curious why QEMU doesn't have enough info to clean up the block > > > jobs automatically ? What is the key thing that libvirt knows about > > > the block jobs, that QEMU is lacking ? If QEMU had the right info it > > > could do it automatically & avoid this extra lock-step synchronization > > > with libvirt. > > > > The key point is that the block job needs to be completed while the > > source VM is stopped, but the source qemu is still in control of the > > image files (e.g. still holds the file locks), so that it can do the > > remaining writes. > > > > Without the additional migration phase, the only state where both sides > > are stopped is when the destination is in control of the image files > > (migration has completed, but -S prevents it from automatically > > resuming), so the source can't write to the image any more. > > Hmm, I always thought that the target QEMU did not start using the > image files until you ran 'cont' on the target. eg once source QEMU > has migrate=completed, both QEMUs are in paused state and source QEMU > still owns the images, until we run 'cont'. > > What you're saying seems to imply this is not the case, but if so what > is triggering the target QEMU to acquire the locks on images ? Is it > done implicitly when it finishes reading device state off the wire ? > > If so, could we instead add a migrate feature flag to tell the target > QEMU not to automatically acquire image locks, until it receives an > explicit 'cont'. That would then not require this extra lock-step > migration state. The handover consists of two parts: The destination acquires the locks, but first the source needs to release them. Without a new command, the source can't know when it is supposed to do that. The destination receives the 'cont' command, but source doesn't know about this. So you have to have something that tells the source "management has made sure to complete what needed to be completed, you can now give up control of the images". I also think that conceptually it is the cleanest to have a source controlled pre-handover phase with paused VM, which is only symmetrical to the existing post-handover phase that we have on the destination. This gives us a clean model for the handover of any resources that require some tearing down on the source before they can be used on the destination, so it appears to be the most future-proof option. Kevin
On Thu, Oct 12, 2017 at 11:52:40AM +0200, Kevin Wolf wrote: > Am 12.10.2017 um 11:27 hat Daniel P. Berrange geschrieben: > > On Thu, Oct 12, 2017 at 11:18:31AM +0200, Kevin Wolf wrote: > > > Am 12.10.2017 um 10:21 hat Daniel P. Berrange geschrieben: > > > > On Wed, Oct 11, 2017 at 08:13:10PM +0100, Dr. David Alan Gilbert (git) wrote: > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > > > > > > > > > > Hi, > > > > > This set attempts to make a race condition between migration and > > > > > drive-mirror (and other block users) soluble by allowing the migration > > > > > to be paused after the source qemu releases the block devices but > > > > > before the serialisation of the device state. > > > > > > > > > > The symptom of this failure, as reported by Wangjie, is a: > > > > > _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed > > > > > > > > > > and the source qemu dieing; so the problem is pretty nasty. > > > > > This has only been seen on 2.9 onwards, but the theory is that > > > > > prior to 2.9 it might have been happening anyway and we were > > > > > perhaps getting unreported corruptions (lost writes); so this > > > > > really needs fixing. > > > > > > > > > > This flow came from discussions between Kevin and me, and we can't > > > > > see a way of fixing it without exposing a new state to the management > > > > > layer. > > > > > > > > > > The flow is now: > > > > > > > > > > (qemu) migrate_set_capability pause-before-device on > > > > > (qemu) migrate -d ... > > > > > (qemu) info migrate > > > > > ... > > > > > Migration status: pause-before-device > > > > > ... > > > > > << issue commands to clean up any block jobs>> > > > > > > > > > > (qemu) migrate_continue pause-before-device > > > > > (qemu) info migrate > > > > > ... > > > > > Migration status: completed > > > > > > > > I'm curious why QEMU doesn't have enough info to clean up the block > > > > jobs automatically ? What is the key thing that libvirt knows about > > > > the block jobs, that QEMU is lacking ? If QEMU had the right info it > > > > could do it automatically & avoid this extra lock-step synchronization > > > > with libvirt. > > > > > > The key point is that the block job needs to be completed while the > > > source VM is stopped, but the source qemu is still in control of the > > > image files (e.g. still holds the file locks), so that it can do the > > > remaining writes. > > > > > > Without the additional migration phase, the only state where both sides > > > are stopped is when the destination is in control of the image files > > > (migration has completed, but -S prevents it from automatically > > > resuming), so the source can't write to the image any more. > > > > Hmm, I always thought that the target QEMU did not start using the > > image files until you ran 'cont' on the target. eg once source QEMU > > has migrate=completed, both QEMUs are in paused state and source QEMU > > still owns the images, until we run 'cont'. > > > > What you're saying seems to imply this is not the case, but if so what > > is triggering the target QEMU to acquire the locks on images ? Is it > > done implicitly when it finishes reading device state off the wire ? > > > > If so, could we instead add a migrate feature flag to tell the target > > QEMU not to automatically acquire image locks, until it receives an > > explicit 'cont'. That would then not require this extra lock-step > > migration state. > > The handover consists of two parts: The destination acquires the locks, > but first the source needs to release them. Without a new command, the > source can't know when it is supposed to do that. The destination > receives the 'cont' command, but source doesn't know about this. So you > have to have something that tells the source "management has made sure > to complete what needed to be completed, you can now give up control of > the images". > > I also think that conceptually it is the cleanest to have a source > controlled pre-handover phase with paused VM, which is only symmetrical > to the existing post-handover phase that we have on the destination. > This gives us a clean model for the handover of any resources that > require some tearing down on the source before they can be used on the > destination, so it appears to be the most future-proof option. Ok, I see what you mean now. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
© 2016 - 2024 Red Hat, Inc.