migration: pause-before-device

[Qemu-devel] [PATCH 0/7] migration: pause-before-device

Posted by Dr. David Alan Gilbert (git) 8 years, 4 months ago

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Hi,
  This set attempts to make a race condition between migration and
drive-mirror (and other block users) soluble by allowing the migration
to be paused after the source qemu releases the block devices but
before the serialisation of the device state.

The symptom of this failure, as reported by Wangjie, is a:
   _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed

and the source qemu dieing; so the problem is pretty nasty.
This has only been seen on 2.9 onwards, but the theory is that
prior to 2.9 it might have been happening anyway and we were
perhaps getting unreported corruptions (lost writes); so this
really needs fixing.

This flow came from discussions between Kevin and me, and we can't
see a way of fixing it without exposing a new state to the management
layer.

The flow is now:

(qemu) migrate_set_capability pause-before-device on
(qemu) migrate -d ...
(qemu) info migrate
...
Migration status: pause-before-device
...
<< issue commands to clean up any block jobs>>

(qemu) migrate_continue pause-before-device
(qemu) info migrate
...
Migration status: completed

This set has been _very_ lightly tested just at the normal migration
code, without the addition of the drive mirror; so this is a first
cut.  I'd appreciate some feedback from libvirt whether the inteface
is OK and ideally a hack to test it in a full libvirt setup to see
if we hit any other issues.

The precopy flow is:
active->pause-before-device->completed

The postcopy flow is:
active->pause-before-device->postcopy-active->completed

Although the behaviour with postcopy only gets interesting when
we add something like Max's active-sync.

Please argue about the command and state naming.

Dave

Dr. David Alan Gilbert (7):
  migration: Add 'pause-before-device' capability
  migration: Add 'pause-before-device' and 'device' statuses
  migration: Wait for semaphore before completing migration
  migration: migrate-continue
  migrate: HMP migate_continue
  migration: allow cancel to unpause
  migration: pause-before-device for postcopy

 hmp-commands.hx       | 12 +++++++
 hmp.c                 | 13 +++++++
 hmp.h                 |  1 +
 migration/migration.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++--
 migration/migration.h |  4 +++
 qapi/migration.json   | 30 ++++++++++++++--
 6 files changed, 152 insertions(+), 4 deletions(-)

-- 
2.13.6

Re: [Qemu-devel] [PATCH 0/7] migration: pause-before-device

Posted by no-reply@patchew.org 8 years, 4 months ago

Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20171011191317.24157-1-dgilbert@redhat.com
Subject: [Qemu-devel] [PATCH 0/7] migration: pause-before-device

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 t [tag update]            patchew/1507742676-9908-1-git-send-email-peter.maydell@linaro.org -> patchew/1507742676-9908-1-git-send-email-peter.maydell@linaro.org
 * [new tag]               patchew/1507751249-3035-1-git-send-email-stefanb@linux.vnet.ibm.com -> patchew/1507751249-3035-1-git-send-email-stefanb@linux.vnet.ibm.com
Switched to a new branch 'test'
88ddcde82f migration: pause-before-device for postcopy
341433d19b migration: allow cancel to unpause
4a76fbc2b2 migrate: HMP migate_continue
24305477e0 migration: migrate-continue
fc48af642c migration: Wait for semaphore before completing migration
263e6dc5be migration: Add 'pause-before-device' and 'device' statuses
7827c25132 migration: Add 'pause-before-device' capability

=== OUTPUT BEGIN ===
Checking PATCH 1/7: migration: Add 'pause-before-device' capability...
Checking PATCH 2/7: migration: Add 'pause-before-device' and 'device' statuses...
Checking PATCH 3/7: migration: Wait for semaphore before completing migration...
ERROR: trailing statements should be on next line
#45: FILE: migration/migration.c:1995:
+    while (qemu_sem_timedwait(&s->pause_sem, 1) == 0);

ERROR: braces {} are necessary even for single statement blocks
#45: FILE: migration/migration.c:1995:
+    while (qemu_sem_timedwait(&s->pause_sem, 1) == 0);

total: 2 errors, 0 warnings, 80 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 4/7: migration: migrate-continue...
Checking PATCH 5/7: migrate: HMP migate_continue...
Checking PATCH 6/7: migration: allow cancel to unpause...
Checking PATCH 7/7: migration: pause-before-device for postcopy...
=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org

Re: [Qemu-devel] [PATCH 0/7] migration: pause-before-device

Posted by Daniel P. Berrange 8 years, 4 months ago

On Wed, Oct 11, 2017 at 08:13:10PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Hi,
>   This set attempts to make a race condition between migration and
> drive-mirror (and other block users) soluble by allowing the migration
> to be paused after the source qemu releases the block devices but
> before the serialisation of the device state.
> 
> The symptom of this failure, as reported by Wangjie, is a:
>    _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed
> 
> and the source qemu dieing; so the problem is pretty nasty.
> This has only been seen on 2.9 onwards, but the theory is that
> prior to 2.9 it might have been happening anyway and we were
> perhaps getting unreported corruptions (lost writes); so this
> really needs fixing.
> 
> This flow came from discussions between Kevin and me, and we can't
> see a way of fixing it without exposing a new state to the management
> layer.
> 
> The flow is now:
> 
> (qemu) migrate_set_capability pause-before-device on

How about 'switchover-cleanup'


> (qemu) migrate -d ...
> (qemu) info migrate
> ...
> Migration status: pause-before-device

and 'switchover'

> ...
> << issue commands to clean up any block jobs>>
> 
> (qemu) migrate_continue pause-before-device
> (qemu) info migrate
> ...
> Migration status: completed
> 
> This set has been _very_ lightly tested just at the normal migration
> code, without the addition of the drive mirror; so this is a first
> cut.  I'd appreciate some feedback from libvirt whether the inteface
> is OK and ideally a hack to test it in a full libvirt setup to see
> if we hit any other issues.
> 
> The precopy flow is:
> active->pause-before-device->completed
> 
> The postcopy flow is:
> active->pause-before-device->postcopy-active->completed
> 
> Although the behaviour with postcopy only gets interesting when
> we add something like Max's active-sync.
> 
> Please argue about the command and state naming.

Argued above :-)

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [Qemu-devel] [PATCH 0/7] migration: pause-before-device

Posted by Daniel P. Berrange 8 years, 4 months ago

On Wed, Oct 11, 2017 at 08:13:10PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Hi,
>   This set attempts to make a race condition between migration and
> drive-mirror (and other block users) soluble by allowing the migration
> to be paused after the source qemu releases the block devices but
> before the serialisation of the device state.
> 
> The symptom of this failure, as reported by Wangjie, is a:
>    _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed
> 
> and the source qemu dieing; so the problem is pretty nasty.
> This has only been seen on 2.9 onwards, but the theory is that
> prior to 2.9 it might have been happening anyway and we were
> perhaps getting unreported corruptions (lost writes); so this
> really needs fixing.
> 
> This flow came from discussions between Kevin and me, and we can't
> see a way of fixing it without exposing a new state to the management
> layer.
> 
> The flow is now:
> 
> (qemu) migrate_set_capability pause-before-device on
> (qemu) migrate -d ...
> (qemu) info migrate
> ...
> Migration status: pause-before-device
> ...
> << issue commands to clean up any block jobs>>
> 
> (qemu) migrate_continue pause-before-device
> (qemu) info migrate
> ...
> Migration status: completed

I'm curious why QEMU doesn't have enough info to clean up the block
jobs automatically ? What is the key thing that libvirt knows about
the block jobs, that QEMU is lacking ? If QEMU had the right info it
could do it automatically & avoid this extra lock-step synchronization
with libvirt.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [Qemu-devel] [PATCH 0/7] migration: pause-before-device

Posted by Kevin Wolf 8 years, 4 months ago

Am 12.10.2017 um 10:21 hat Daniel P. Berrange geschrieben:
> On Wed, Oct 11, 2017 at 08:13:10PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Hi,
> >   This set attempts to make a race condition between migration and
> > drive-mirror (and other block users) soluble by allowing the migration
> > to be paused after the source qemu releases the block devices but
> > before the serialisation of the device state.
> > 
> > The symptom of this failure, as reported by Wangjie, is a:
> >    _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed
> > 
> > and the source qemu dieing; so the problem is pretty nasty.
> > This has only been seen on 2.9 onwards, but the theory is that
> > prior to 2.9 it might have been happening anyway and we were
> > perhaps getting unreported corruptions (lost writes); so this
> > really needs fixing.
> > 
> > This flow came from discussions between Kevin and me, and we can't
> > see a way of fixing it without exposing a new state to the management
> > layer.
> > 
> > The flow is now:
> > 
> > (qemu) migrate_set_capability pause-before-device on
> > (qemu) migrate -d ...
> > (qemu) info migrate
> > ...
> > Migration status: pause-before-device
> > ...
> > << issue commands to clean up any block jobs>>
> > 
> > (qemu) migrate_continue pause-before-device
> > (qemu) info migrate
> > ...
> > Migration status: completed
> 
> I'm curious why QEMU doesn't have enough info to clean up the block
> jobs automatically ? What is the key thing that libvirt knows about
> the block jobs, that QEMU is lacking ? If QEMU had the right info it
> could do it automatically & avoid this extra lock-step synchronization
> with libvirt.

The key point is that the block job needs to be completed while the
source VM is stopped, but the source qemu is still in control of the
image files (e.g. still holds the file locks), so that it can do the
remaining writes.

Without the additional migration phase, the only state where both sides
are stopped is when the destination is in control of the image files
(migration has completed, but -S prevents it from automatically
resuming), so the source can't write to the image any more.

Kevin

Re: [Qemu-devel] [PATCH 0/7] migration: pause-before-device

Posted by Daniel P. Berrange 8 years, 4 months ago

On Thu, Oct 12, 2017 at 11:18:31AM +0200, Kevin Wolf wrote:
> Am 12.10.2017 um 10:21 hat Daniel P. Berrange geschrieben:
> > On Wed, Oct 11, 2017 at 08:13:10PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Hi,
> > >   This set attempts to make a race condition between migration and
> > > drive-mirror (and other block users) soluble by allowing the migration
> > > to be paused after the source qemu releases the block devices but
> > > before the serialisation of the device state.
> > > 
> > > The symptom of this failure, as reported by Wangjie, is a:
> > >    _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed
> > > 
> > > and the source qemu dieing; so the problem is pretty nasty.
> > > This has only been seen on 2.9 onwards, but the theory is that
> > > prior to 2.9 it might have been happening anyway and we were
> > > perhaps getting unreported corruptions (lost writes); so this
> > > really needs fixing.
> > > 
> > > This flow came from discussions between Kevin and me, and we can't
> > > see a way of fixing it without exposing a new state to the management
> > > layer.
> > > 
> > > The flow is now:
> > > 
> > > (qemu) migrate_set_capability pause-before-device on
> > > (qemu) migrate -d ...
> > > (qemu) info migrate
> > > ...
> > > Migration status: pause-before-device
> > > ...
> > > << issue commands to clean up any block jobs>>
> > > 
> > > (qemu) migrate_continue pause-before-device
> > > (qemu) info migrate
> > > ...
> > > Migration status: completed
> > 
> > I'm curious why QEMU doesn't have enough info to clean up the block
> > jobs automatically ? What is the key thing that libvirt knows about
> > the block jobs, that QEMU is lacking ? If QEMU had the right info it
> > could do it automatically & avoid this extra lock-step synchronization
> > with libvirt.
> 
> The key point is that the block job needs to be completed while the
> source VM is stopped, but the source qemu is still in control of the
> image files (e.g. still holds the file locks), so that it can do the
> remaining writes.
> 
> Without the additional migration phase, the only state where both sides
> are stopped is when the destination is in control of the image files
> (migration has completed, but -S prevents it from automatically
> resuming), so the source can't write to the image any more.

Hmm, I always thought that the target QEMU did not start using the
image files until you ran 'cont' on the target. eg once source QEMU
has migrate=completed, both QEMUs are in paused state and source QEMU
still owns the images, until we run 'cont'.

What you're saying seems to imply this is not the case, but if so what
is triggering the target QEMU to acquire the locks on images ? Is it
done implicitly when it finishes reading device state off the wire ?

If so, could we instead add a migrate feature flag to tell the target
QEMU not to automatically acquire image locks, until it receives an
explicit 'cont'. That would then not require this extra lock-step
migration state.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [Qemu-devel] [PATCH 0/7] migration: pause-before-device

Posted by Kevin Wolf 8 years, 4 months ago

Am 12.10.2017 um 11:27 hat Daniel P. Berrange geschrieben:
> On Thu, Oct 12, 2017 at 11:18:31AM +0200, Kevin Wolf wrote:
> > Am 12.10.2017 um 10:21 hat Daniel P. Berrange geschrieben:
> > > On Wed, Oct 11, 2017 at 08:13:10PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > Hi,
> > > >   This set attempts to make a race condition between migration and
> > > > drive-mirror (and other block users) soluble by allowing the migration
> > > > to be paused after the source qemu releases the block devices but
> > > > before the serialisation of the device state.
> > > > 
> > > > The symptom of this failure, as reported by Wangjie, is a:
> > > >    _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed
> > > > 
> > > > and the source qemu dieing; so the problem is pretty nasty.
> > > > This has only been seen on 2.9 onwards, but the theory is that
> > > > prior to 2.9 it might have been happening anyway and we were
> > > > perhaps getting unreported corruptions (lost writes); so this
> > > > really needs fixing.
> > > > 
> > > > This flow came from discussions between Kevin and me, and we can't
> > > > see a way of fixing it without exposing a new state to the management
> > > > layer.
> > > > 
> > > > The flow is now:
> > > > 
> > > > (qemu) migrate_set_capability pause-before-device on
> > > > (qemu) migrate -d ...
> > > > (qemu) info migrate
> > > > ...
> > > > Migration status: pause-before-device
> > > > ...
> > > > << issue commands to clean up any block jobs>>
> > > > 
> > > > (qemu) migrate_continue pause-before-device
> > > > (qemu) info migrate
> > > > ...
> > > > Migration status: completed
> > > 
> > > I'm curious why QEMU doesn't have enough info to clean up the block
> > > jobs automatically ? What is the key thing that libvirt knows about
> > > the block jobs, that QEMU is lacking ? If QEMU had the right info it
> > > could do it automatically & avoid this extra lock-step synchronization
> > > with libvirt.
> > 
> > The key point is that the block job needs to be completed while the
> > source VM is stopped, but the source qemu is still in control of the
> > image files (e.g. still holds the file locks), so that it can do the
> > remaining writes.
> > 
> > Without the additional migration phase, the only state where both sides
> > are stopped is when the destination is in control of the image files
> > (migration has completed, but -S prevents it from automatically
> > resuming), so the source can't write to the image any more.
> 
> Hmm, I always thought that the target QEMU did not start using the
> image files until you ran 'cont' on the target. eg once source QEMU
> has migrate=completed, both QEMUs are in paused state and source QEMU
> still owns the images, until we run 'cont'.
> 
> What you're saying seems to imply this is not the case, but if so what
> is triggering the target QEMU to acquire the locks on images ? Is it
> done implicitly when it finishes reading device state off the wire ?
> 
> If so, could we instead add a migrate feature flag to tell the target
> QEMU not to automatically acquire image locks, until it receives an
> explicit 'cont'. That would then not require this extra lock-step
> migration state.

The handover consists of two parts: The destination acquires the locks,
but first the source needs to release them. Without a new command, the
source can't know when it is supposed to do that. The destination
receives the 'cont' command, but source doesn't know about this. So you
have to have something that tells the source "management has made sure
to complete what needed to be completed, you can now give up control of
the images".

I also think that conceptually it is the cleanest to have a source
controlled pre-handover phase with paused VM, which is only symmetrical
to the existing post-handover phase that we have on the destination.
This gives us a clean model for the handover of any resources that
require some tearing down on the source before they can be used on the
destination, so it appears to be the most future-proof option.

Kevin

Re: [Qemu-devel] [PATCH 0/7] migration: pause-before-device

Posted by Daniel P. Berrange 8 years, 4 months ago

On Thu, Oct 12, 2017 at 11:52:40AM +0200, Kevin Wolf wrote:
> Am 12.10.2017 um 11:27 hat Daniel P. Berrange geschrieben:
> > On Thu, Oct 12, 2017 at 11:18:31AM +0200, Kevin Wolf wrote:
> > > Am 12.10.2017 um 10:21 hat Daniel P. Berrange geschrieben:
> > > > On Wed, Oct 11, 2017 at 08:13:10PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > 
> > > > > Hi,
> > > > >   This set attempts to make a race condition between migration and
> > > > > drive-mirror (and other block users) soluble by allowing the migration
> > > > > to be paused after the source qemu releases the block devices but
> > > > > before the serialisation of the device state.
> > > > > 
> > > > > The symptom of this failure, as reported by Wangjie, is a:
> > > > >    _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed
> > > > > 
> > > > > and the source qemu dieing; so the problem is pretty nasty.
> > > > > This has only been seen on 2.9 onwards, but the theory is that
> > > > > prior to 2.9 it might have been happening anyway and we were
> > > > > perhaps getting unreported corruptions (lost writes); so this
> > > > > really needs fixing.
> > > > > 
> > > > > This flow came from discussions between Kevin and me, and we can't
> > > > > see a way of fixing it without exposing a new state to the management
> > > > > layer.
> > > > > 
> > > > > The flow is now:
> > > > > 
> > > > > (qemu) migrate_set_capability pause-before-device on
> > > > > (qemu) migrate -d ...
> > > > > (qemu) info migrate
> > > > > ...
> > > > > Migration status: pause-before-device
> > > > > ...
> > > > > << issue commands to clean up any block jobs>>
> > > > > 
> > > > > (qemu) migrate_continue pause-before-device
> > > > > (qemu) info migrate
> > > > > ...
> > > > > Migration status: completed
> > > > 
> > > > I'm curious why QEMU doesn't have enough info to clean up the block
> > > > jobs automatically ? What is the key thing that libvirt knows about
> > > > the block jobs, that QEMU is lacking ? If QEMU had the right info it
> > > > could do it automatically & avoid this extra lock-step synchronization
> > > > with libvirt.
> > > 
> > > The key point is that the block job needs to be completed while the
> > > source VM is stopped, but the source qemu is still in control of the
> > > image files (e.g. still holds the file locks), so that it can do the
> > > remaining writes.
> > > 
> > > Without the additional migration phase, the only state where both sides
> > > are stopped is when the destination is in control of the image files
> > > (migration has completed, but -S prevents it from automatically
> > > resuming), so the source can't write to the image any more.
> > 
> > Hmm, I always thought that the target QEMU did not start using the
> > image files until you ran 'cont' on the target. eg once source QEMU
> > has migrate=completed, both QEMUs are in paused state and source QEMU
> > still owns the images, until we run 'cont'.
> > 
> > What you're saying seems to imply this is not the case, but if so what
> > is triggering the target QEMU to acquire the locks on images ? Is it
> > done implicitly when it finishes reading device state off the wire ?
> > 
> > If so, could we instead add a migrate feature flag to tell the target
> > QEMU not to automatically acquire image locks, until it receives an
> > explicit 'cont'. That would then not require this extra lock-step
> > migration state.
> 
> The handover consists of two parts: The destination acquires the locks,
> but first the source needs to release them. Without a new command, the
> source can't know when it is supposed to do that. The destination
> receives the 'cont' command, but source doesn't know about this. So you
> have to have something that tells the source "management has made sure
> to complete what needed to be completed, you can now give up control of
> the images".
> 
> I also think that conceptually it is the cleanest to have a source
> controlled pre-handover phase with paused VM, which is only symmetrical
> to the existing post-handover phase that we have on the destination.
> This gives us a clean model for the handover of any resources that
> require some tearing down on the source before they can be used on the
> destination, so it appears to be the most future-proof option.

Ok, I see what you mean now.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|