[Qemu-devel] [PATCH for-3.0 0/9] migration: postcopy recovery unit test, bug fixes

Peter Xu posted 9 patches 7 years, 4 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20180705031755.3254-1-peterx@redhat.com
Test checkpatch passed
Test docker-mingw@fedora passed
Test docker-quick@centos7 failed
Test s390x passed
There is a newer version of this series
migration/ram.c        |  21 +++--
migration/savevm.c     |  16 ++--
tests/migration-test.c | 198 ++++++++++++++++++++++++++++++++---------
3 files changed, 176 insertions(+), 59 deletions(-)
[Qemu-devel] [PATCH for-3.0 0/9] migration: postcopy recovery unit test, bug fixes
Posted by Peter Xu 7 years, 4 months ago
Based-on: <20180627132246.5576-1-peterx@redhat.com>

Based on the series to unbreak postcopy:
  Subject: [PATCH v3 0/4] migation: unbreak postcopy recovery
  Message-Id: <20180627132246.5576-1-peterx@redhat.com>

This series introduce a new postcopy recovery test.  The new test
actually helped me to identify two bugs there so fix them as well
before 3.0 release.

Patch 1: a trivial cleanup for existing postcopy ram load, which I
         found a bit confusing during debugging the problem.

Patch 2-3: two bug fixes that address different issues.  Please see
           the commit log for more information.

Patch 4-9: add the postcopy recovery unit test.

Please review.  Thanks,

Peter Xu (9):
  migration: simplify check to use qemu file buffer
  migration: loosen recovery check when load vm
  migration: fix incorrect bitmap size calculation
  tests: introduce migrate_postcopy_* helpers
  tests: allow migrate() to take extra flags
  tests: introduce migrate_query*() helpers
  tests: introduce wait_for_migration_status()
  tests: add postcopy recovery test
  tests: hide stderr for postcopy recovery test

 migration/ram.c        |  21 +++--
 migration/savevm.c     |  16 ++--
 tests/migration-test.c | 198 ++++++++++++++++++++++++++++++++---------
 3 files changed, 176 insertions(+), 59 deletions(-)

-- 
2.17.1


Re: [Qemu-devel] [PATCH for-3.0 0/9] migration: postcopy recovery unit test, bug fixes
Posted by Dr. David Alan Gilbert 7 years, 4 months ago
* Peter Xu (peterx@redhat.com) wrote:
> Based-on: <20180627132246.5576-1-peterx@redhat.com>
> 
> Based on the series to unbreak postcopy:
>   Subject: [PATCH v3 0/4] migation: unbreak postcopy recovery
>   Message-Id: <20180627132246.5576-1-peterx@redhat.com>
> 
> This series introduce a new postcopy recovery test.  The new test
> actually helped me to identify two bugs there so fix them as well
> before 3.0 release.
> 
> Patch 1: a trivial cleanup for existing postcopy ram load, which I
>          found a bit confusing during debugging the problem.
> 
> Patch 2-3: two bug fixes that address different issues.  Please see
>            the commit log for more information.
> 
> Patch 4-9: add the postcopy recovery unit test.
> 
> Please review.  Thanks,

Queued

> Peter Xu (9):
>   migration: simplify check to use qemu file buffer
>   migration: loosen recovery check when load vm
>   migration: fix incorrect bitmap size calculation
>   tests: introduce migrate_postcopy_* helpers
>   tests: allow migrate() to take extra flags
>   tests: introduce migrate_query*() helpers
>   tests: introduce wait_for_migration_status()
>   tests: add postcopy recovery test
>   tests: hide stderr for postcopy recovery test
> 
>  migration/ram.c        |  21 +++--
>  migration/savevm.c     |  16 ++--
>  tests/migration-test.c | 198 ++++++++++++++++++++++++++++++++---------
>  3 files changed, 176 insertions(+), 59 deletions(-)
> 
> -- 
> 2.17.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH for-3.0 0/9] migration: postcopy recovery unit test, bug fixes
Posted by Dr. David Alan Gilbert 7 years, 4 months ago
* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Based-on: <20180627132246.5576-1-peterx@redhat.com>
> > 
> > Based on the series to unbreak postcopy:
> >   Subject: [PATCH v3 0/4] migation: unbreak postcopy recovery
> >   Message-Id: <20180627132246.5576-1-peterx@redhat.com>
> > 
> > This series introduce a new postcopy recovery test.  The new test
> > actually helped me to identify two bugs there so fix them as well
> > before 3.0 release.
> > 
> > Patch 1: a trivial cleanup for existing postcopy ram load, which I
> >          found a bit confusing during debugging the problem.
> > 
> > Patch 2-3: two bug fixes that address different issues.  Please see
> >            the commit log for more information.
> > 
> > Patch 4-9: add the postcopy recovery unit test.
> > 
> > Please review.  Thanks,
> 
> Queued

Hi Peter,
  There's a problem in there somewhere;  I'm getting
an intermittent failure of the test if I run a make check -j 8    on my
laptop.  Just running two copies of tests/migration-test in parallel
sometimes triggers it (but not if I turn on QTEST_LOG!).
But it's always failing with:

  ERROR:/home/dgilbert/git/migpull/tests/migration-test.c:373:migrate_recover: assertion failed: (qdict_haskey(rsp, "return"))

Dave

> > Peter Xu (9):
> >   migration: simplify check to use qemu file buffer
> >   migration: loosen recovery check when load vm
> >   migration: fix incorrect bitmap size calculation
> >   tests: introduce migrate_postcopy_* helpers
> >   tests: allow migrate() to take extra flags
> >   tests: introduce migrate_query*() helpers
> >   tests: introduce wait_for_migration_status()
> >   tests: add postcopy recovery test
> >   tests: hide stderr for postcopy recovery test
> > 
> >  migration/ram.c        |  21 +++--
> >  migration/savevm.c     |  16 ++--
> >  tests/migration-test.c | 198 ++++++++++++++++++++++++++++++++---------
> >  3 files changed, 176 insertions(+), 59 deletions(-)
> > 
> > -- 
> > 2.17.1
> > 
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH for-3.0 0/9] migration: postcopy recovery unit test, bug fixes
Posted by Balamuruhan S 7 years, 4 months ago
On Fri, Jul 06, 2018 at 11:56:59AM +0100, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > Based-on: <20180627132246.5576-1-peterx@redhat.com>
> > > 
> > > Based on the series to unbreak postcopy:
> > >   Subject: [PATCH v3 0/4] migation: unbreak postcopy recovery
> > >   Message-Id: <20180627132246.5576-1-peterx@redhat.com>
> > > 
> > > This series introduce a new postcopy recovery test.  The new test
> > > actually helped me to identify two bugs there so fix them as well
> > > before 3.0 release.
> > > 
> > > Patch 1: a trivial cleanup for existing postcopy ram load, which I
> > >          found a bit confusing during debugging the problem.
> > > 
> > > Patch 2-3: two bug fixes that address different issues.  Please see
> > >            the commit log for more information.
> > > 
> > > Patch 4-9: add the postcopy recovery unit test.
> > > 
> > > Please review.  Thanks,
> > 
> > Queued
> 
> Hi Peter,
>   There's a problem in there somewhere;  I'm getting
> an intermittent failure of the test if I run a make check -j 8    on my
> laptop.  Just running two copies of tests/migration-test in parallel
> sometimes triggers it (but not if I turn on QTEST_LOG!).
> But it's always failing with:
> 
>   ERROR:/home/dgilbert/git/migpull/tests/migration-test.c:373:migrate_recover: assertion failed: (qdict_haskey(rsp, "return"))

Hi Peter and Dave,

I have tested postcopy migration pause/recover after applying this
patchset on upstream Qemu,

Observation 1:

We loose the target after triggering migrate_pause,

source

# ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none \
-machine pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 \
-device virtio-blk-pci,drive=rootdisk -drive \
file=/home/bala/sharing/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \
-monitor telnet:127.0.0.1:1235,server,nowait -net nic,model=virtio \
-net user -redir tcp:2001::22 

qemu-system-ppc64: Detected IO failure for postcopy. Migration paused.

source Monitor
(qemu) migrate_set_capability postcopy-ram on
(qemu) migrate_set_parameter max-postcopy-bandwidth 4096
(qemu) migrate -d tcp:127.0.0.1:4444
(qemu) migrate_start_postcopy
(qemu) migrate_pause

(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off
release-ram: off block: off return-path: off pause-before-switchover:
off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
late-block-activate: off 
Migration status: postcopy-paused
total time: 371289 milliseconds
expected downtime: 656414 milliseconds
setup: 93 milliseconds
transferred ram: 690856 kbytes
throughput: 46.65 mbps
remaining ram: 3716864 kbytes
total ram: 67109120 kbytes
duplicate: 16631167 pages
skipped: 0 pages
normal: 135905 pages
normal bytes: 543620 kbytes
dirty sync count: 2
page size: 4 kbytes
multifd bytes: 0 kbytes
dirty pages rate: 626209 pages
postcopy request count: 395

source remains to be in postcopy-paused state as the target is lost.

target

# ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none \
-machine pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 \
-device virtio-blk-pci,drive=rootdisk -drive \
file=/home/bala/sharing/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk
\
-monitor telnet:127.0.0.1:1235,server,nowait -net nic,model=virtio \
-net user -redir tcp:2001::22 -incoming tcp:127.0.0.1:4444

Error observed
check_section_footer: Read section footer failed: -5
qemu-system-ppc64: postcopy_ram_listen_thread: loadvm failed: -22
[  188.815436] Unable to handle kernel paging request for instruction
fetch

Target Monitor

(qemu) migrate_set_capability postcopy-ram on


Observation 2:
Unlike error observed by Dave in Qtest it hangs for me waiting for
the migration to complete, but the source remains to be in
migration-paused state.

# time QTEST_QEMU_BINARY=./ppc64-softmmu/qemu-system-ppc64
# ./tests/migration-test
/ppc64/migration/deprecated: OK
/ppc64/migration/bad_dest: OK
/ppc64/migration/postcopy/unix: OK
/ppc64/migration/postcopy/recovery: ^C

real    21m55.176s
user    2m28.800s
sys 4m55.980s

-- Bala

> 
> Dave
> 
> > > Peter Xu (9):
> > >   migration: simplify check to use qemu file buffer
> > >   migration: loosen recovery check when load vm
> > >   migration: fix incorrect bitmap size calculation
> > >   tests: introduce migrate_postcopy_* helpers
> > >   tests: allow migrate() to take extra flags
> > >   tests: introduce migrate_query*() helpers
> > >   tests: introduce wait_for_migration_status()
> > >   tests: add postcopy recovery test
> > >   tests: hide stderr for postcopy recovery test
> > > 
> > >  migration/ram.c        |  21 +++--
> > >  migration/savevm.c     |  16 ++--
> > >  tests/migration-test.c | 198 ++++++++++++++++++++++++++++++++---------
> > >  3 files changed, 176 insertions(+), 59 deletions(-)
> > > 
> > > -- 
> > > 2.17.1
> > > 
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


Re: [Qemu-devel] [PATCH for-3.0 0/9] migration: postcopy recovery unit test, bug fixes
Posted by Balamuruhan S 7 years, 4 months ago
On Fri, Jul 06, 2018 at 11:56:59AM +0100, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > Based-on: <20180627132246.5576-1-peterx@redhat.com>
> > > 
> > > Based on the series to unbreak postcopy:
> > >   Subject: [PATCH v3 0/4] migation: unbreak postcopy recovery
> > >   Message-Id: <20180627132246.5576-1-peterx@redhat.com>
> > > 
> > > This series introduce a new postcopy recovery test.  The new test
> > > actually helped me to identify two bugs there so fix them as well
> > > before 3.0 release.
> > > 
> > > Patch 1: a trivial cleanup for existing postcopy ram load, which I
> > >          found a bit confusing during debugging the problem.
> > > 
> > > Patch 2-3: two bug fixes that address different issues.  Please see
> > >            the commit log for more information.
> > > 
> > > Patch 4-9: add the postcopy recovery unit test.
> > > 
> > > Please review.  Thanks,
> > 
> > Queued
> 
> Hi Peter,
>   There's a problem in there somewhere;  I'm getting
> an intermittent failure of the test if I run a make check -j 8    on my
> laptop.  Just running two copies of tests/migration-test in parallel
> sometimes triggers it (but not if I turn on QTEST_LOG!).
> But it's always failing with:
> 
>   ERROR:/home/dgilbert/git/migpull/tests/migration-test.c:373:migrate_recover: assertion failed: (qdict_haskey(rsp, "return"))
> 
> Dave

Hi Peter, Dave,

I have applied this patchset in upstream Qemu to test postcopy
pause/recovery.

I observed error after triggering recovery command from source monitor
where the target is lost and the source remains to be in `postcopy-pause`
state.

Please find my observation below,

Source:

# ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none -machine \
pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 -device virtio-blk-pci,drive=rootdisk \
-drive file=/home/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \
-monitor telnet:127.0.0.1:1234,server,nowait -net nic,model=virtio -net user \
-redir tcp:2000::22

qemu-system-ppc64: Detected IO failure for postcopy. Migration paused.

Source Monitor:

(qemu) migrate_set_capability postcopy-ram on
(qemu) migrate_set_parameter max-postcopy-bandwidth 4096
(qemu) migrate -d tcp:127.0.0.1:4444
(qemu) migrate_start_postcopy
(qemu) migrate_pause
(qemu) migrate -r tcp:127.0.0.1:4446

After triggering recovery, target is lost with the error mentioned below
and source remains to be in `postcopy-paused` state

(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off \
compress: off events: off postcopy-ram: on x-colo: off release-ram: off
block: off return-path: off pause-before-switchover: off x-multifd: off \
dirty-bitmaps: off
postcopy-blocktime: off late-block-activate: off 
Migration status: postcopy-recover
total time: 78818 milliseconds
expected downtime: 300 milliseconds
setup: 169 milliseconds
transferred ram: 177749 kbytes
throughput: 63.72 mbps
remaining ram: 28061376 kbytes
total ram: 67109120 kbytes
duplicate: 9742102 pages
skipped: 0 pages
normal: 22986 pages
normal bytes: 91944 kbytes
dirty sync count: 2
page size: 4 kbytes
multifd bytes: 0 kbytes
dirty pages rate: 1273187 pages
postcopy request count: 236


Target:

# ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none -machine \
pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 -device virtio-blk-pci,drive=rootdisk \
-drive file=/home/bala/sharing/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \
-monitor telnet:127.0.0.1:1235,server,nowait -net nic,model=virtio -net user \
-redir tcp:2001::22 -incoming tcp:127.0.0.1:4444


qemu-system-ppc64: check_section_footer: Read section footer failed: -5
qemu-system-ppc64: Detected IO failure for postcopy. Migration paused.
qemu-system-ppc64: Not a migration stream
qemu-system-ppc64: load of migration failed: Invalid argument


Target Monitor:

(qemu) migrate_set_capability postcopy-ram on
(qemu) migrate_recover tcp:127.0.0.1:4446
(qemu) Connection closed by foreign host.

QTest:

Also with respect to Qtest, I have tested it and the recovery test
doesn't complete as it waits on the source for "completed" but due to this
issue source remains to be in `postcopy-paused`

`migrate_postcopy_complete(from, to);`

but it actually doesn't end.

As it did not complete, I cancelled it forcefully

# time QTEST_QEMU_BINARY=./ppc64-softmmu/qemu-system-ppc64 ./tests/migration-test
/ppc64/migration/deprecated: OK
/ppc64/migration/bad_dest: OK
/ppc64/migration/postcopy/unix: OK
/ppc64/migration/postcopy/recovery: ^C

real    21m55.176s
user    2m28.800s
sys 4m55.980s

-- Bala
> 
> > > Peter Xu (9):
> > >   migration: simplify check to use qemu file buffer
> > >   migration: loosen recovery check when load vm
> > >   migration: fix incorrect bitmap size calculation
> > >   tests: introduce migrate_postcopy_* helpers
> > >   tests: allow migrate() to take extra flags
> > >   tests: introduce migrate_query*() helpers
> > >   tests: introduce wait_for_migration_status()
> > >   tests: add postcopy recovery test
> > >   tests: hide stderr for postcopy recovery test
> > > 
> > >  migration/ram.c        |  21 +++--
> > >  migration/savevm.c     |  16 ++--
> > >  tests/migration-test.c | 198 ++++++++++++++++++++++++++++++++---------
> > >  3 files changed, 176 insertions(+), 59 deletions(-)
> > > 
> > > -- 
> > > 2.17.1
> > > 
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


Re: [Qemu-devel] [PATCH for-3.0 0/9] migration: postcopy recovery unit test, bug fixes
Posted by Dr. David Alan Gilbert 7 years, 3 months ago
* Balamuruhan S (bala24@linux.vnet.ibm.com) wrote:
> On Fri, Jul 06, 2018 at 11:56:59AM +0100, Dr. David Alan Gilbert wrote:
> > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > * Peter Xu (peterx@redhat.com) wrote:
> > > > Based-on: <20180627132246.5576-1-peterx@redhat.com>
> > > > 
> > > > Based on the series to unbreak postcopy:
> > > >   Subject: [PATCH v3 0/4] migation: unbreak postcopy recovery
> > > >   Message-Id: <20180627132246.5576-1-peterx@redhat.com>
> > > > 
> > > > This series introduce a new postcopy recovery test.  The new test
> > > > actually helped me to identify two bugs there so fix them as well
> > > > before 3.0 release.
> > > > 
> > > > Patch 1: a trivial cleanup for existing postcopy ram load, which I
> > > >          found a bit confusing during debugging the problem.
> > > > 
> > > > Patch 2-3: two bug fixes that address different issues.  Please see
> > > >            the commit log for more information.
> > > > 
> > > > Patch 4-9: add the postcopy recovery unit test.
> > > > 
> > > > Please review.  Thanks,
> > > 
> > > Queued
> > 
> > Hi Peter,
> >   There's a problem in there somewhere;  I'm getting
> > an intermittent failure of the test if I run a make check -j 8    on my
> > laptop.  Just running two copies of tests/migration-test in parallel
> > sometimes triggers it (but not if I turn on QTEST_LOG!).
> > But it's always failing with:
> > 
> >   ERROR:/home/dgilbert/git/migpull/tests/migration-test.c:373:migrate_recover: assertion failed: (qdict_haskey(rsp, "return"))
> > 
> > Dave
> 
> Hi Peter, Dave,

Hi Bala,

> I have applied this patchset in upstream Qemu to test postcopy
> pause/recovery.

Are you still seeing this with the set that got merged into 3.0-rc0?
The second of your errors looks similar to problems with the race
we had before Peter fixed it; but the set that I merged passed a 'make
check' on a Power box.

Dave

> I observed error after triggering recovery command from source monitor
> where the target is lost and the source remains to be in `postcopy-pause`
> state.
> 
> Please find my observation below,
> 
> Source:
> 
> # ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none -machine \
> pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 -device virtio-blk-pci,drive=rootdisk \
> -drive file=/home/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \
> -monitor telnet:127.0.0.1:1234,server,nowait -net nic,model=virtio -net user \
> -redir tcp:2000::22
> 
> qemu-system-ppc64: Detected IO failure for postcopy. Migration paused.
> 
> Source Monitor:
> 
> (qemu) migrate_set_capability postcopy-ram on
> (qemu) migrate_set_parameter max-postcopy-bandwidth 4096
> (qemu) migrate -d tcp:127.0.0.1:4444
> (qemu) migrate_start_postcopy
> (qemu) migrate_pause
> (qemu) migrate -r tcp:127.0.0.1:4446
> 
> After triggering recovery, target is lost with the error mentioned below
> and source remains to be in `postcopy-paused` state
> 
> (qemu) info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> decompress-error-check: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
> zero-blocks: off \
> compress: off events: off postcopy-ram: on x-colo: off release-ram: off
> block: off return-path: off pause-before-switchover: off x-multifd: off \
> dirty-bitmaps: off
> postcopy-blocktime: off late-block-activate: off 
> Migration status: postcopy-recover
> total time: 78818 milliseconds
> expected downtime: 300 milliseconds
> setup: 169 milliseconds
> transferred ram: 177749 kbytes
> throughput: 63.72 mbps
> remaining ram: 28061376 kbytes
> total ram: 67109120 kbytes
> duplicate: 9742102 pages
> skipped: 0 pages
> normal: 22986 pages
> normal bytes: 91944 kbytes
> dirty sync count: 2
> page size: 4 kbytes
> multifd bytes: 0 kbytes
> dirty pages rate: 1273187 pages
> postcopy request count: 236
> 
> 
> Target:
> 
> # ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none -machine \
> pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 -device virtio-blk-pci,drive=rootdisk \
> -drive file=/home/bala/sharing/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \
> -monitor telnet:127.0.0.1:1235,server,nowait -net nic,model=virtio -net user \
> -redir tcp:2001::22 -incoming tcp:127.0.0.1:4444
> 
> 
> qemu-system-ppc64: check_section_footer: Read section footer failed: -5
> qemu-system-ppc64: Detected IO failure for postcopy. Migration paused.
> qemu-system-ppc64: Not a migration stream
> qemu-system-ppc64: load of migration failed: Invalid argument
> 
> 
> Target Monitor:
> 
> (qemu) migrate_set_capability postcopy-ram on
> (qemu) migrate_recover tcp:127.0.0.1:4446
> (qemu) Connection closed by foreign host.
> 
> QTest:
> 
> Also with respect to Qtest, I have tested it and the recovery test
> doesn't complete as it waits on the source for "completed" but due to this
> issue source remains to be in `postcopy-paused`
> 
> `migrate_postcopy_complete(from, to);`
> 
> but it actually doesn't end.
> 
> As it did not complete, I cancelled it forcefully
> 
> # time QTEST_QEMU_BINARY=./ppc64-softmmu/qemu-system-ppc64 ./tests/migration-test
> /ppc64/migration/deprecated: OK
> /ppc64/migration/bad_dest: OK
> /ppc64/migration/postcopy/unix: OK
> /ppc64/migration/postcopy/recovery: ^C
> 
> real    21m55.176s
> user    2m28.800s
> sys 4m55.980s
> 
> -- Bala
> > 
> > > > Peter Xu (9):
> > > >   migration: simplify check to use qemu file buffer
> > > >   migration: loosen recovery check when load vm
> > > >   migration: fix incorrect bitmap size calculation
> > > >   tests: introduce migrate_postcopy_* helpers
> > > >   tests: allow migrate() to take extra flags
> > > >   tests: introduce migrate_query*() helpers
> > > >   tests: introduce wait_for_migration_status()
> > > >   tests: add postcopy recovery test
> > > >   tests: hide stderr for postcopy recovery test
> > > > 
> > > >  migration/ram.c        |  21 +++--
> > > >  migration/savevm.c     |  16 ++--
> > > >  tests/migration-test.c | 198 ++++++++++++++++++++++++++++++++---------
> > > >  3 files changed, 176 insertions(+), 59 deletions(-)
> > > > 
> > > > -- 
> > > > 2.17.1
> > > > 
> > > > 
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH for-3.0 0/9] migration: postcopy recovery unit test, bug fixes
Posted by Peter Xu 7 years, 3 months ago
On Fri, Jul 06, 2018 at 11:56:59AM +0100, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > Based-on: <20180627132246.5576-1-peterx@redhat.com>
> > > 
> > > Based on the series to unbreak postcopy:
> > >   Subject: [PATCH v3 0/4] migation: unbreak postcopy recovery
> > >   Message-Id: <20180627132246.5576-1-peterx@redhat.com>
> > > 
> > > This series introduce a new postcopy recovery test.  The new test
> > > actually helped me to identify two bugs there so fix them as well
> > > before 3.0 release.
> > > 
> > > Patch 1: a trivial cleanup for existing postcopy ram load, which I
> > >          found a bit confusing during debugging the problem.
> > > 
> > > Patch 2-3: two bug fixes that address different issues.  Please see
> > >            the commit log for more information.
> > > 
> > > Patch 4-9: add the postcopy recovery unit test.
> > > 
> > > Please review.  Thanks,
> > 
> > Queued
> 
> Hi Peter,
>   There's a problem in there somewhere;  I'm getting
> an intermittent failure of the test if I run a make check -j 8    on my
> laptop.  Just running two copies of tests/migration-test in parallel
> sometimes triggers it (but not if I turn on QTEST_LOG!).
> But it's always failing with:
> 
>   ERROR:/home/dgilbert/git/migpull/tests/migration-test.c:373:migrate_recover: assertion failed: (qdict_haskey(rsp, "return"))

Hmm, so this should be a race.  I suspect it's because destination VM
hasn't reached the correct state when sending the recovery command.

Could you help to try these two tiny patches to see whether it can fix
the problem?

================

commit d875ea1a98932174e3fa202859b65df26def174d
Author: Peter Xu <peterx@redhat.com>
Date:   Tue Jul 10 11:17:24 2018 +0800

    migration: show pause/recover state on dst host

    These two states will be missing when doing "query-migrate" on
    destination VM.  Add these states so that we can get the query results
    as expected.

    Signed-off-by: Peter Xu <peterx@redhat.com>

diff --git a/migration/migration.c b/migration/migration.c
index 0404c53215..8d56d56930 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -911,6 +911,8 @@ static void fill_destination_migration_info(MigrationInfo *info)
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_ACTIVE:
     case MIGRATION_STATUS_POSTCOPY_ACTIVE:
+    case MIGRATION_STATUS_POSTCOPY_PAUSED:
+    case MIGRATION_STATUS_POSTCOPY_RECOVER:
     case MIGRATION_STATUS_FAILED:
     case MIGRATION_STATUS_COLO:
         info->has_status = true;

================

commit 9fa7fc773961cd0ea0b5f70a166def0d8aebf464
Author: Peter Xu <peterx@redhat.com>
Date:   Tue Jul 10 11:18:48 2018 +0800

    tests: don't send recovery cmd until dst pauses

    Signed-off-by: Peter Xu <peterx@redhat.com>

diff --git a/tests/migration-test.c b/tests/migration-test.c
index 96e69dab99..45558446f1 100644
--- a/tests/migration-test.c
+++ b/tests/migration-test.c
@@ -646,6 +646,13 @@ static void test_postcopy_recovery(void)
      */
     migrate_pause(from);

+    /*
+     * Wait for destination side to reach postcopy-paused state.  The
+     * migrate-recover command can only succeed if destination machine
+     * is in the paused state
+     */
+    wait_for_migration_status(to, "postcopy-paused");
+
     /*
      * Create a new socket to emulate a new channel that is different
      * from the broken migration channel; tell the destination to

================

Thanks!

-- 
Peter Xu

Re: [Qemu-devel] [PATCH for-3.0 0/9] migration: postcopy recovery unit test, bug fixes
Posted by Dr. David Alan Gilbert 7 years, 3 months ago
* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Jul 06, 2018 at 11:56:59AM +0100, Dr. David Alan Gilbert wrote:
> > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > * Peter Xu (peterx@redhat.com) wrote:
> > > > Based-on: <20180627132246.5576-1-peterx@redhat.com>
> > > > 
> > > > Based on the series to unbreak postcopy:
> > > >   Subject: [PATCH v3 0/4] migation: unbreak postcopy recovery
> > > >   Message-Id: <20180627132246.5576-1-peterx@redhat.com>
> > > > 
> > > > This series introduce a new postcopy recovery test.  The new test
> > > > actually helped me to identify two bugs there so fix them as well
> > > > before 3.0 release.
> > > > 
> > > > Patch 1: a trivial cleanup for existing postcopy ram load, which I
> > > >          found a bit confusing during debugging the problem.
> > > > 
> > > > Patch 2-3: two bug fixes that address different issues.  Please see
> > > >            the commit log for more information.
> > > > 
> > > > Patch 4-9: add the postcopy recovery unit test.
> > > > 
> > > > Please review.  Thanks,
> > > 
> > > Queued
> > 
> > Hi Peter,
> >   There's a problem in there somewhere;  I'm getting
> > an intermittent failure of the test if I run a make check -j 8    on my
> > laptop.  Just running two copies of tests/migration-test in parallel
> > sometimes triggers it (but not if I turn on QTEST_LOG!).
> > But it's always failing with:
> > 
> >   ERROR:/home/dgilbert/git/migpull/tests/migration-test.c:373:migrate_recover: assertion failed: (qdict_haskey(rsp, "return"))
> 
> Hmm, so this should be a race.  I suspect it's because destination VM
> hasn't reached the correct state when sending the recovery command.
> 
> Could you help to try these two tiny patches to see whether it can fix
> the problem?

Yes, this seems to work; even running 6 in parallel.

Dave

> ================
> 
> commit d875ea1a98932174e3fa202859b65df26def174d
> Author: Peter Xu <peterx@redhat.com>
> Date:   Tue Jul 10 11:17:24 2018 +0800
> 
>     migration: show pause/recover state on dst host
> 
>     These two states will be missing when doing "query-migrate" on
>     destination VM.  Add these states so that we can get the query results
>     as expected.
> 
>     Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 0404c53215..8d56d56930 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -911,6 +911,8 @@ static void fill_destination_migration_info(MigrationInfo *info)
>      case MIGRATION_STATUS_CANCELLED:
>      case MIGRATION_STATUS_ACTIVE:
>      case MIGRATION_STATUS_POSTCOPY_ACTIVE:
> +    case MIGRATION_STATUS_POSTCOPY_PAUSED:
> +    case MIGRATION_STATUS_POSTCOPY_RECOVER:
>      case MIGRATION_STATUS_FAILED:
>      case MIGRATION_STATUS_COLO:
>          info->has_status = true;
> 
> ================
> 
> commit 9fa7fc773961cd0ea0b5f70a166def0d8aebf464
> Author: Peter Xu <peterx@redhat.com>
> Date:   Tue Jul 10 11:18:48 2018 +0800
> 
>     tests: don't send recovery cmd until dst pauses
> 
>     Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> diff --git a/tests/migration-test.c b/tests/migration-test.c
> index 96e69dab99..45558446f1 100644
> --- a/tests/migration-test.c
> +++ b/tests/migration-test.c
> @@ -646,6 +646,13 @@ static void test_postcopy_recovery(void)
>       */
>      migrate_pause(from);
> 
> +    /*
> +     * Wait for destination side to reach postcopy-paused state.  The
> +     * migrate-recover command can only succeed if destination machine
> +     * is in the paused state
> +     */
> +    wait_for_migration_status(to, "postcopy-paused");
> +
>      /*
>       * Create a new socket to emulate a new channel that is different
>       * from the broken migration channel; tell the destination to
> 
> ================
> 
> Thanks!
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH for-3.0 0/9] migration: postcopy recovery unit test, bug fixes
Posted by Balamuruhan S 7 years, 3 months ago
On Thu, Jul 05, 2018 at 11:17:46AM +0800, Peter Xu wrote:
> Based-on: <20180627132246.5576-1-peterx@redhat.com>
> 
> Based on the series to unbreak postcopy:
>   Subject: [PATCH v3 0/4] migation: unbreak postcopy recovery
>   Message-Id: <20180627132246.5576-1-peterx@redhat.com>
> 
> This series introduce a new postcopy recovery test.  The new test
> actually helped me to identify two bugs there so fix them as well
> before 3.0 release.
> 
> Patch 1: a trivial cleanup for existing postcopy ram load, which I
>          found a bit confusing during debugging the problem.
> 
> Patch 2-3: two bug fixes that address different issues.  Please see
>            the commit log for more information.
> 
> Patch 4-9: add the postcopy recovery unit test.
> 
> Please review.  Thanks,

Hi Peter, Dave,

I am sorry, I have missed to include Peter's postcopy-recover fix patchset,

migration: delay postcopy paused state
migration: move income process out of multifd
migration: unbreak postcopy recovery
migration: unify incoming processing

Postcopy migration with pause and recover is working fine.

# QTEST_QEMU_BINARY=./ppc64-softmmu/qemu-system-ppc64
# ./tests/migration-test
/ppc64/migration/deprecated: OK
/ppc64/migration/bad_dest: OK
/ppc64/migration/postcopy/unix: OK
/ppc64/migration/postcopy/recovery: OK
/ppc64/migration/precopy/unix: OK

But qtest patches in this patchset have to be rebased as commit
5fd4a9c97397bc0819a919de7a62ec972ec85260 (tests/migration: Skip tests
for ppc tcg) have gone in.

# git am ../postcopy_pause/4.patch 
Applying: tests: introduce migrate_postcopy_* helpers
error: patch failed: tests/migration-test.c:351
error: tests/migration-test.c: patch does not apply
Patch failed at 0001 tests: introduce migrate_postcopy_* helpers
The copy of the patch that failed is found in:
   /home/bala/qemu/.git/rebase-apply/patch
When you have resolved this problem, run "git am --resolved".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".


I have manually reverted it to apply and test your patchset.

This Patchset is working without any issues.

Tested-by: Balamuruhan S <bala24@linux.vnet.ibm.com>
> 
> Peter Xu (9):
>   migration: simplify check to use qemu file buffer
>   migration: loosen recovery check when load vm
>   migration: fix incorrect bitmap size calculation
>   tests: introduce migrate_postcopy_* helpers
>   tests: allow migrate() to take extra flags
>   tests: introduce migrate_query*() helpers
>   tests: introduce wait_for_migration_status()
>   tests: add postcopy recovery test
>   tests: hide stderr for postcopy recovery test
> 
>  migration/ram.c        |  21 +++--
>  migration/savevm.c     |  16 ++--
>  tests/migration-test.c | 198 ++++++++++++++++++++++++++++++++---------
>  3 files changed, 176 insertions(+), 59 deletions(-)
> 
> -- 
> 2.17.1
> 
> 


Re: [Qemu-devel] [PATCH for-3.0 0/9] migration: postcopy recovery unit test, bug fixes
Posted by Peter Xu 7 years, 3 months ago
On Tue, Jul 10, 2018 at 07:26:40AM +0530, Balamuruhan S wrote:
> On Thu, Jul 05, 2018 at 11:17:46AM +0800, Peter Xu wrote:
> > Based-on: <20180627132246.5576-1-peterx@redhat.com>
> > 
> > Based on the series to unbreak postcopy:
> >   Subject: [PATCH v3 0/4] migation: unbreak postcopy recovery
> >   Message-Id: <20180627132246.5576-1-peterx@redhat.com>
> > 
> > This series introduce a new postcopy recovery test.  The new test
> > actually helped me to identify two bugs there so fix them as well
> > before 3.0 release.
> > 
> > Patch 1: a trivial cleanup for existing postcopy ram load, which I
> >          found a bit confusing during debugging the problem.
> > 
> > Patch 2-3: two bug fixes that address different issues.  Please see
> >            the commit log for more information.
> > 
> > Patch 4-9: add the postcopy recovery unit test.
> > 
> > Please review.  Thanks,
> 
> Hi Peter, Dave,
> 
> I am sorry, I have missed to include Peter's postcopy-recover fix patchset,
> 
> migration: delay postcopy paused state
> migration: move income process out of multifd
> migration: unbreak postcopy recovery
> migration: unify incoming processing
> 
> Postcopy migration with pause and recover is working fine.
> 
> # QTEST_QEMU_BINARY=./ppc64-softmmu/qemu-system-ppc64
> # ./tests/migration-test
> /ppc64/migration/deprecated: OK
> /ppc64/migration/bad_dest: OK
> /ppc64/migration/postcopy/unix: OK
> /ppc64/migration/postcopy/recovery: OK
> /ppc64/migration/precopy/unix: OK
> 
> But qtest patches in this patchset have to be rebased as commit
> 5fd4a9c97397bc0819a919de7a62ec972ec85260 (tests/migration: Skip tests
> for ppc tcg) have gone in.
> 
> # git am ../postcopy_pause/4.patch 
> Applying: tests: introduce migrate_postcopy_* helpers
> error: patch failed: tests/migration-test.c:351
> error: tests/migration-test.c: patch does not apply
> Patch failed at 0001 tests: introduce migrate_postcopy_* helpers
> The copy of the patch that failed is found in:
>    /home/bala/qemu/.git/rebase-apply/patch
> When you have resolved this problem, run "git am --resolved".
> If you prefer to skip this patch, run "git am --skip" instead.
> To restore the original branch and stop patching, run "git am --abort".
> 
> 
> I have manually reverted it to apply and test your patchset.
> 
> This Patchset is working without any issues.
> 
> Tested-by: Balamuruhan S <bala24@linux.vnet.ibm.com>

Thanks for the quick follow up!  I'll have a look today at the problem
that Dave reported.

Regards,

-- 
Peter Xu