[Qemu-devel] [PATCH v3 0/4] migation: unbreak postcopy recovery

Peter Xu posted 4 patches 5 years, 10 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20180627132246.5576-1-peterx@redhat.com
Test checkpatch passed
Test docker-mingw@fedora passed
Test docker-quick@centos7 passed
Test s390x passed
migration/ram.h       |  2 +-
migration/exec.c      |  3 ---
migration/fd.c        |  3 ---
migration/migration.c | 44 ++++++++++++++++++++++++++++++++++++-------
migration/ram.c       | 11 +++++------
migration/savevm.c    |  6 +++---
migration/socket.c    |  5 -----
7 files changed, 46 insertions(+), 28 deletions(-)
[Qemu-devel] [PATCH v3 0/4] migation: unbreak postcopy recovery
Posted by Peter Xu 5 years, 10 months ago
v3:
- keep the recovery logic even for RDMA by dropping the 3rd patch and
  touch up the original 4th patch (current 3rd patch) to suite that [Dave]

v2:
- break the first patch into several
- fix a QEMUFile leak

Please review.  Thanks,

Peter Xu (4):
  migration: delay postcopy paused state
  migration: move income process out of multifd
  migration: unbreak postcopy recovery
  migration: unify incoming processing

 migration/ram.h       |  2 +-
 migration/exec.c      |  3 ---
 migration/fd.c        |  3 ---
 migration/migration.c | 44 ++++++++++++++++++++++++++++++++++++-------
 migration/ram.c       | 11 +++++------
 migration/savevm.c    |  6 +++---
 migration/socket.c    |  5 -----
 7 files changed, 46 insertions(+), 28 deletions(-)

-- 
2.17.1


Re: [Qemu-devel] [PATCH v3 0/4] migation: unbreak postcopy recovery
Posted by Balamuruhan S 5 years, 10 months ago
On Wed, Jun 27, 2018 at 09:22:42PM +0800, Peter Xu wrote:
> v3:
> - keep the recovery logic even for RDMA by dropping the 3rd patch and
>   touch up the original 4th patch (current 3rd patch) to suite that [Dave]
> 
> v2:
> - break the first patch into several
> - fix a QEMUFile leak
> 
> Please review.  Thanks,
Hi Peter,

I have applied this patchset with upstream Qemu for testing postcopy
pause recover feature in PowerPC,

I used NFS shared qcow2 between source and target host

source:
# ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none \
-machine pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 \
-device virtio-blk-pci,drive=rootdisk -drive \
file=/home/bala/sharing/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \
-monitor telnet:127.0.0.1:1234,server,nowait -net nic,model=virtio \
-net user -redir tcp:2000::22

To keep the VM with workload I ran stress-ng inside guest,

# stress-ng --cpu 6 --vm 6 --io 6

target:
# ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none \
-machine pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 \
-device virtio-blk-pci,drive=rootdisk -drive \
file=/home/bala/sharing/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \
-monitor telnet:127.0.0.1:1235,server,nowait -net nic,model=virtio \
-net user -redir tcp:2001::22 -incoming tcp:0:4445

enabled postcopy on both source and destination from qemu monitor

(qemu) migrate_set_capability postcopy-ram on

From source qemu monitor,
(qemu) migrate -d tcp:10.45.70.203:4445
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off
release-ram: off block: off return-path: off pause-before-switchover:
off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
late-block-activate: off 
Migration status: active
total time: 2331 milliseconds
expected downtime: 300 milliseconds
setup: 65 milliseconds
transferred ram: 38914 kbytes
throughput: 273.16 mbps
remaining ram: 67063784 kbytes
total ram: 67109120 kbytes
duplicate: 1627 pages
skipped: 0 pages
normal: 9706 pages
normal bytes: 38824 kbytes
dirty sync count: 1
page size: 4 kbytes
multifd bytes: 0 kbytes

triggered postcopy from source,
(qemu) migrate_start_postcopy

After triggering postcopy from source, in target I tried to pause the
postcopy migration

(qemu) migrate_pause

In target I see error as,
error while loading state section id 4(ram)
qemu-system-ppc64: Detected IO failure for postcopy. Migration paused.

In source I see error as,
qemu-system-ppc64: Detected IO failure for postcopy. Migration paused.

Later from target I try for recovery from target monitor,
(qemu) migrate_recover qemu+ssh://10.45.70.203/system
Migrate recovery is triggered already

but in source still it remains to be in postcopy-paused state
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off
release-ram: off block: off return-path: off pause-before-switchover:
off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
late-block-activate: off 
Migration status: postcopy-paused
total time: 222841 milliseconds
expected downtime: 382991 milliseconds
setup: 65 milliseconds
transferred ram: 385270 kbytes
throughput: 265.06 mbps
remaining ram: 8150528 kbytes
total ram: 67109120 kbytes
duplicate: 14679647 pages
skipped: 0 pages
normal: 63937 pages
normal bytes: 255748 kbytes
dirty sync count: 2
page size: 4 kbytes
multifd bytes: 0 kbytes
dirty pages rate: 854740 pages
postcopy request count: 374

later I also tried to recover postcopy in source monitor,
(qemu) migrate_recover qemu+ssh://10.45.193.21/system
Migrate recover can only be run when postcopy is paused.

Looks to be it is broken, please help me if I missed something
in this test.

Thank you,
Bala
> 
> Peter Xu (4):
>   migration: delay postcopy paused state
>   migration: move income process out of multifd
>   migration: unbreak postcopy recovery
>   migration: unify incoming processing
> 
>  migration/ram.h       |  2 +-
>  migration/exec.c      |  3 ---
>  migration/fd.c        |  3 ---
>  migration/migration.c | 44 ++++++++++++++++++++++++++++++++++++-------
>  migration/ram.c       | 11 +++++------
>  migration/savevm.c    |  6 +++---
>  migration/socket.c    |  5 -----
>  7 files changed, 46 insertions(+), 28 deletions(-)
> 
> -- 
> 2.17.1
> 
> 


Re: [Qemu-devel] [PATCH v3 0/4] migation: unbreak postcopy recovery
Posted by Peter Xu 5 years, 10 months ago
On Mon, Jul 02, 2018 at 01:34:45PM +0530, Balamuruhan S wrote:
> On Wed, Jun 27, 2018 at 09:22:42PM +0800, Peter Xu wrote:
> > v3:
> > - keep the recovery logic even for RDMA by dropping the 3rd patch and
> >   touch up the original 4th patch (current 3rd patch) to suite that [Dave]
> > 
> > v2:
> > - break the first patch into several
> > - fix a QEMUFile leak
> > 
> > Please review.  Thanks,
> Hi Peter,

Hi, Balamuruhan,

Glad to know that you are playing this stuff with ppc.  I think the
major steps are correct, though...

> 
> I have applied this patchset with upstream Qemu for testing postcopy
> pause recover feature in PowerPC,
> 
> I used NFS shared qcow2 between source and target host
> 
> source:
> # ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none \
> -machine pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 \
> -device virtio-blk-pci,drive=rootdisk -drive \
> file=/home/bala/sharing/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \
> -monitor telnet:127.0.0.1:1234,server,nowait -net nic,model=virtio \
> -net user -redir tcp:2000::22
> 
> To keep the VM with workload I ran stress-ng inside guest,
> 
> # stress-ng --cpu 6 --vm 6 --io 6
> 
> target:
> # ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none \
> -machine pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 \
> -device virtio-blk-pci,drive=rootdisk -drive \
> file=/home/bala/sharing/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \
> -monitor telnet:127.0.0.1:1235,server,nowait -net nic,model=virtio \
> -net user -redir tcp:2001::22 -incoming tcp:0:4445
> 
> enabled postcopy on both source and destination from qemu monitor
> 
> (qemu) migrate_set_capability postcopy-ram on
> 
> From source qemu monitor,
> (qemu) migrate -d tcp:10.45.70.203:4445

[1]

> (qemu) info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> decompress-error-check: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
> zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off
> release-ram: off block: off return-path: off pause-before-switchover:
> off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> late-block-activate: off 
> Migration status: active
> total time: 2331 milliseconds
> expected downtime: 300 milliseconds
> setup: 65 milliseconds
> transferred ram: 38914 kbytes
> throughput: 273.16 mbps
> remaining ram: 67063784 kbytes
> total ram: 67109120 kbytes
> duplicate: 1627 pages
> skipped: 0 pages
> normal: 9706 pages
> normal bytes: 38824 kbytes
> dirty sync count: 1
> page size: 4 kbytes
> multifd bytes: 0 kbytes
> 
> triggered postcopy from source,
> (qemu) migrate_start_postcopy
> 
> After triggering postcopy from source, in target I tried to pause the
> postcopy migration
> 
> (qemu) migrate_pause
> 
> In target I see error as,
> error while loading state section id 4(ram)
> qemu-system-ppc64: Detected IO failure for postcopy. Migration paused.
> 
> In source I see error as,
> qemu-system-ppc64: Detected IO failure for postcopy. Migration paused.
> 
> Later from target I try for recovery from target monitor,
> (qemu) migrate_recover qemu+ssh://10.45.70.203/system

... here is that URI for libvirt only?

Normally I'll use something similar to [1] above.

> Migrate recovery is triggered already

And this means that you have already sent one recovery command before
hand.  In the future we'd better allow the recovery command to be run
more than once (in case the first one mistyped...).

> 
> but in source still it remains to be in postcopy-paused state
> (qemu) info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> decompress-error-check: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
> zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off
> release-ram: off block: off return-path: off pause-before-switchover:
> off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> late-block-activate: off 
> Migration status: postcopy-paused
> total time: 222841 milliseconds
> expected downtime: 382991 milliseconds
> setup: 65 milliseconds
> transferred ram: 385270 kbytes
> throughput: 265.06 mbps
> remaining ram: 8150528 kbytes
> total ram: 67109120 kbytes
> duplicate: 14679647 pages
> skipped: 0 pages
> normal: 63937 pages
> normal bytes: 255748 kbytes
> dirty sync count: 2
> page size: 4 kbytes
> multifd bytes: 0 kbytes
> dirty pages rate: 854740 pages
> postcopy request count: 374
> 
> later I also tried to recover postcopy in source monitor,
> (qemu) migrate_recover qemu+ssh://10.45.193.21/system

This command should be run on destination side only.  Here the
"migrate-recover" command on destination will start a new listening
port there waiting for the migration to be continued.  Then after that
command we need an extra command on source to start the recovery:

  (HMP) migrate -r $URI

Here $URI should be the only you specified in the "migrate-recover"
command on destination machine.

> Migrate recover can only be run when postcopy is paused.

I can try to fix up this error.  Basically we shouldn't allow this
command to be run on source machine.

> 
> Looks to be it is broken, please help me if I missed something
> in this test.

Btw, I'm writting up an unit test for postcopy recovery recently, that
could be a good reference for the new feature.  Meanwhile I think I
should write up some documents too afterwards.

Regards,

> 
> Thank you,
> Bala
> > 
> > Peter Xu (4):
> >   migration: delay postcopy paused state
> >   migration: move income process out of multifd
> >   migration: unbreak postcopy recovery
> >   migration: unify incoming processing
> > 
> >  migration/ram.h       |  2 +-
> >  migration/exec.c      |  3 ---
> >  migration/fd.c        |  3 ---
> >  migration/migration.c | 44 ++++++++++++++++++++++++++++++++++++-------
> >  migration/ram.c       | 11 +++++------
> >  migration/savevm.c    |  6 +++---
> >  migration/socket.c    |  5 -----
> >  7 files changed, 46 insertions(+), 28 deletions(-)
> > 
> > -- 
> > 2.17.1
> > 
> > 
> 

-- 
Peter Xu

Re: [Qemu-devel] [PATCH v3 0/4] migation: unbreak postcopy recovery
Posted by Balamuruhan S 5 years, 10 months ago
On Mon, Jul 02, 2018 at 04:46:18PM +0800, Peter Xu wrote:
> On Mon, Jul 02, 2018 at 01:34:45PM +0530, Balamuruhan S wrote:
> > On Wed, Jun 27, 2018 at 09:22:42PM +0800, Peter Xu wrote:
> > > v3:
> > > - keep the recovery logic even for RDMA by dropping the 3rd patch and
> > >   touch up the original 4th patch (current 3rd patch) to suite that [Dave]
> > > 
> > > v2:
> > > - break the first patch into several
> > > - fix a QEMUFile leak
> > > 
> > > Please review.  Thanks,
> > Hi Peter,
> 
> Hi, Balamuruhan,
> 
> Glad to know that you are playing this stuff with ppc.  I think the
> major steps are correct, though...
> 

Thank you Peter for correcting my mistake, It works like a charm.
Nice feature!

Tested-by: Balamuruhan S <bala24@linux.vnet.ibm.com>

> > 
> > I have applied this patchset with upstream Qemu for testing postcopy
> > pause recover feature in PowerPC,
> > 
> > I used NFS shared qcow2 between source and target host
> > 
> > source:
> > # ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none \
> > -machine pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 \
> > -device virtio-blk-pci,drive=rootdisk -drive \
> > file=/home/bala/sharing/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \
> > -monitor telnet:127.0.0.1:1234,server,nowait -net nic,model=virtio \
> > -net user -redir tcp:2000::22
> > 
> > To keep the VM with workload I ran stress-ng inside guest,
> > 
> > # stress-ng --cpu 6 --vm 6 --io 6
> > 
> > target:
> > # ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none \
> > -machine pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 \
> > -device virtio-blk-pci,drive=rootdisk -drive \
> > file=/home/bala/sharing/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \
> > -monitor telnet:127.0.0.1:1235,server,nowait -net nic,model=virtio \
> > -net user -redir tcp:2001::22 -incoming tcp:0:4445
> > 
> > enabled postcopy on both source and destination from qemu monitor
> > 
> > (qemu) migrate_set_capability postcopy-ram on
> > 
> > From source qemu monitor,
> > (qemu) migrate -d tcp:10.45.70.203:4445
> 
> [1]
> 
> > (qemu) info migrate
> > globals:
> > store-global-state: on
> > only-migratable: off
> > send-configuration: on
> > send-section-footer: on
> > decompress-error-check: on
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
> > zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off
> > release-ram: off block: off return-path: off pause-before-switchover:
> > off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> > late-block-activate: off 
> > Migration status: active
> > total time: 2331 milliseconds
> > expected downtime: 300 milliseconds
> > setup: 65 milliseconds
> > transferred ram: 38914 kbytes
> > throughput: 273.16 mbps
> > remaining ram: 67063784 kbytes
> > total ram: 67109120 kbytes
> > duplicate: 1627 pages
> > skipped: 0 pages
> > normal: 9706 pages
> > normal bytes: 38824 kbytes
> > dirty sync count: 1
> > page size: 4 kbytes
> > multifd bytes: 0 kbytes
> > 
> > triggered postcopy from source,
> > (qemu) migrate_start_postcopy
> > 
> > After triggering postcopy from source, in target I tried to pause the
> > postcopy migration
> > 
> > (qemu) migrate_pause
> > 
> > In target I see error as,
> > error while loading state section id 4(ram)
> > qemu-system-ppc64: Detected IO failure for postcopy. Migration paused.
> > 
> > In source I see error as,
> > qemu-system-ppc64: Detected IO failure for postcopy. Migration paused.
> > 
> > Later from target I try for recovery from target monitor,
> > (qemu) migrate_recover qemu+ssh://10.45.70.203/system
> 
> ... here is that URI for libvirt only?
> 
> Normally I'll use something similar to [1] above.
> 
> > Migrate recovery is triggered already
> 
> And this means that you have already sent one recovery command before
> hand.  In the future we'd better allow the recovery command to be run
> more than once (in case the first one mistyped...).
> 
> > 
> > but in source still it remains to be in postcopy-paused state
> > (qemu) info migrate
> > globals:
> > store-global-state: on
> > only-migratable: off
> > send-configuration: on
> > send-section-footer: on
> > decompress-error-check: on
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
> > zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off
> > release-ram: off block: off return-path: off pause-before-switchover:
> > off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> > late-block-activate: off 
> > Migration status: postcopy-paused
> > total time: 222841 milliseconds
> > expected downtime: 382991 milliseconds
> > setup: 65 milliseconds
> > transferred ram: 385270 kbytes
> > throughput: 265.06 mbps
> > remaining ram: 8150528 kbytes
> > total ram: 67109120 kbytes
> > duplicate: 14679647 pages
> > skipped: 0 pages
> > normal: 63937 pages
> > normal bytes: 255748 kbytes
> > dirty sync count: 2
> > page size: 4 kbytes
> > multifd bytes: 0 kbytes
> > dirty pages rate: 854740 pages
> > postcopy request count: 374
> > 
> > later I also tried to recover postcopy in source monitor,
> > (qemu) migrate_recover qemu+ssh://10.45.193.21/system
> 
> This command should be run on destination side only.  Here the
> "migrate-recover" command on destination will start a new listening
> port there waiting for the migration to be continued.  Then after that
> command we need an extra command on source to start the recovery:
> 
>   (HMP) migrate -r $URI
> 
> Here $URI should be the only you specified in the "migrate-recover"
> command on destination machine.
> 
> > Migrate recover can only be run when postcopy is paused.
> 
> I can try to fix up this error.  Basically we shouldn't allow this
> command to be run on source machine.

Sure, :+1:

> 
> > 
> > Looks to be it is broken, please help me if I missed something
> > in this test.
> 
> Btw, I'm writting up an unit test for postcopy recovery recently, that
> could be a good reference for the new feature.  Meanwhile I think I
> should write up some documents too afterwards.

fine, I am also working on writing test scenario in tp-qemu using Avocado-VT
for postcopy pause/recover and multifd features.

-- Bala
> 
> Regards,
> 
> > 
> > Thank you,
> > Bala
> > > 
> > > Peter Xu (4):
> > >   migration: delay postcopy paused state
> > >   migration: move income process out of multifd
> > >   migration: unbreak postcopy recovery
> > >   migration: unify incoming processing
> > > 
> > >  migration/ram.h       |  2 +-
> > >  migration/exec.c      |  3 ---
> > >  migration/fd.c        |  3 ---
> > >  migration/migration.c | 44 ++++++++++++++++++++++++++++++++++++-------
> > >  migration/ram.c       | 11 +++++------
> > >  migration/savevm.c    |  6 +++---
> > >  migration/socket.c    |  5 -----
> > >  7 files changed, 46 insertions(+), 28 deletions(-)
> > > 
> > > -- 
> > > 2.17.1
> > > 
> > > 
> > 
> 
> -- 
> Peter Xu
> 


Re: [Qemu-devel] [PATCH v3 0/4] migation: unbreak postcopy recovery
Posted by Peter Xu 5 years, 10 months ago
On Mon, Jul 02, 2018 at 03:12:41PM +0530, Balamuruhan S wrote:
> On Mon, Jul 02, 2018 at 04:46:18PM +0800, Peter Xu wrote:
> > On Mon, Jul 02, 2018 at 01:34:45PM +0530, Balamuruhan S wrote:
> > > On Wed, Jun 27, 2018 at 09:22:42PM +0800, Peter Xu wrote:
> > > > v3:
> > > > - keep the recovery logic even for RDMA by dropping the 3rd patch and
> > > >   touch up the original 4th patch (current 3rd patch) to suite that [Dave]
> > > > 
> > > > v2:
> > > > - break the first patch into several
> > > > - fix a QEMUFile leak
> > > > 
> > > > Please review.  Thanks,
> > > Hi Peter,
> > 
> > Hi, Balamuruhan,
> > 
> > Glad to know that you are playing this stuff with ppc.  I think the
> > major steps are correct, though...
> > 
> 
> Thank you Peter for correcting my mistake, It works like a charm.
> Nice feature!
> 
> Tested-by: Balamuruhan S <bala24@linux.vnet.ibm.com>

Thanks!  Good to know that it worked.

> 
> > > 
> > > I have applied this patchset with upstream Qemu for testing postcopy
> > > pause recover feature in PowerPC,
> > > 
> > > I used NFS shared qcow2 between source and target host
> > > 
> > > source:
> > > # ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none \
> > > -machine pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 \
> > > -device virtio-blk-pci,drive=rootdisk -drive \
> > > file=/home/bala/sharing/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \
> > > -monitor telnet:127.0.0.1:1234,server,nowait -net nic,model=virtio \
> > > -net user -redir tcp:2000::22
> > > 
> > > To keep the VM with workload I ran stress-ng inside guest,
> > > 
> > > # stress-ng --cpu 6 --vm 6 --io 6
> > > 
> > > target:
> > > # ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none \
> > > -machine pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 \
> > > -device virtio-blk-pci,drive=rootdisk -drive \
> > > file=/home/bala/sharing/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \
> > > -monitor telnet:127.0.0.1:1235,server,nowait -net nic,model=virtio \
> > > -net user -redir tcp:2001::22 -incoming tcp:0:4445
> > > 
> > > enabled postcopy on both source and destination from qemu monitor
> > > 
> > > (qemu) migrate_set_capability postcopy-ram on
> > > 
> > > From source qemu monitor,
> > > (qemu) migrate -d tcp:10.45.70.203:4445
> > 
> > [1]
> > 
> > > (qemu) info migrate
> > > globals:
> > > store-global-state: on
> > > only-migratable: off
> > > send-configuration: on
> > > send-section-footer: on
> > > decompress-error-check: on
> > > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
> > > zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off
> > > release-ram: off block: off return-path: off pause-before-switchover:
> > > off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> > > late-block-activate: off 
> > > Migration status: active
> > > total time: 2331 milliseconds
> > > expected downtime: 300 milliseconds
> > > setup: 65 milliseconds
> > > transferred ram: 38914 kbytes
> > > throughput: 273.16 mbps
> > > remaining ram: 67063784 kbytes
> > > total ram: 67109120 kbytes
> > > duplicate: 1627 pages
> > > skipped: 0 pages
> > > normal: 9706 pages
> > > normal bytes: 38824 kbytes
> > > dirty sync count: 1
> > > page size: 4 kbytes
> > > multifd bytes: 0 kbytes
> > > 
> > > triggered postcopy from source,
> > > (qemu) migrate_start_postcopy
> > > 
> > > After triggering postcopy from source, in target I tried to pause the
> > > postcopy migration
> > > 
> > > (qemu) migrate_pause
> > > 
> > > In target I see error as,
> > > error while loading state section id 4(ram)
> > > qemu-system-ppc64: Detected IO failure for postcopy. Migration paused.
> > > 
> > > In source I see error as,
> > > qemu-system-ppc64: Detected IO failure for postcopy. Migration paused.
> > > 
> > > Later from target I try for recovery from target monitor,
> > > (qemu) migrate_recover qemu+ssh://10.45.70.203/system
> > 
> > ... here is that URI for libvirt only?
> > 
> > Normally I'll use something similar to [1] above.
> > 
> > > Migrate recovery is triggered already
> > 
> > And this means that you have already sent one recovery command before
> > hand.  In the future we'd better allow the recovery command to be run
> > more than once (in case the first one mistyped...).
> > 
> > > 
> > > but in source still it remains to be in postcopy-paused state
> > > (qemu) info migrate
> > > globals:
> > > store-global-state: on
> > > only-migratable: off
> > > send-configuration: on
> > > send-section-footer: on
> > > decompress-error-check: on
> > > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
> > > zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off
> > > release-ram: off block: off return-path: off pause-before-switchover:
> > > off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> > > late-block-activate: off 
> > > Migration status: postcopy-paused
> > > total time: 222841 milliseconds
> > > expected downtime: 382991 milliseconds
> > > setup: 65 milliseconds
> > > transferred ram: 385270 kbytes
> > > throughput: 265.06 mbps
> > > remaining ram: 8150528 kbytes
> > > total ram: 67109120 kbytes
> > > duplicate: 14679647 pages
> > > skipped: 0 pages
> > > normal: 63937 pages
> > > normal bytes: 255748 kbytes
> > > dirty sync count: 2
> > > page size: 4 kbytes
> > > multifd bytes: 0 kbytes
> > > dirty pages rate: 854740 pages
> > > postcopy request count: 374
> > > 
> > > later I also tried to recover postcopy in source monitor,
> > > (qemu) migrate_recover qemu+ssh://10.45.193.21/system
> > 
> > This command should be run on destination side only.  Here the
> > "migrate-recover" command on destination will start a new listening
> > port there waiting for the migration to be continued.  Then after that
> > command we need an extra command on source to start the recovery:
> > 
> >   (HMP) migrate -r $URI
> > 
> > Here $URI should be the only you specified in the "migrate-recover"
> > command on destination machine.
> > 
> > > Migrate recover can only be run when postcopy is paused.
> > 
> > I can try to fix up this error.  Basically we shouldn't allow this
> > command to be run on source machine.
> 
> Sure, :+1:
> 
> > 
> > > 
> > > Looks to be it is broken, please help me if I missed something
> > > in this test.
> > 
> > Btw, I'm writting up an unit test for postcopy recovery recently, that
> > could be a good reference for the new feature.  Meanwhile I think I
> > should write up some documents too afterwards.
> 
> fine, I am also working on writing test scenario in tp-qemu using Avocado-VT
> for postcopy pause/recover and multifd features.

Nice!  I don't know avocado much inside, but definitely it'll be good
if we have more tests to cover it so we'll know its breakage asap (and
the same applies to multifd for sure).

Regards,

-- 
Peter Xu

Re: [Qemu-devel] [PATCH v3 0/4] migation: unbreak postcopy recovery
Posted by Dr. David Alan Gilbert 5 years, 10 months ago
Queued

* Peter Xu (peterx@redhat.com) wrote:
> v3:
> - keep the recovery logic even for RDMA by dropping the 3rd patch and
>   touch up the original 4th patch (current 3rd patch) to suite that [Dave]
> 
> v2:
> - break the first patch into several
> - fix a QEMUFile leak
> 
> Please review.  Thanks,
> 
> Peter Xu (4):
>   migration: delay postcopy paused state
>   migration: move income process out of multifd
>   migration: unbreak postcopy recovery
>   migration: unify incoming processing
> 
>  migration/ram.h       |  2 +-
>  migration/exec.c      |  3 ---
>  migration/fd.c        |  3 ---
>  migration/migration.c | 44 ++++++++++++++++++++++++++++++++++++-------
>  migration/ram.c       | 11 +++++------
>  migration/savevm.c    |  6 +++---
>  migration/socket.c    |  5 -----
>  7 files changed, 46 insertions(+), 28 deletions(-)
> 
> -- 
> 2.17.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK