[v4] migration/postcopy: Sync faulted addresses after network recovered

[PATCH v4 0/4] migration/postcopy: Sync faulted addresses after network recovered

Posted by Peter Xu 5 years, 4 months ago

v4:
- use "void */ulong" instead of "uint64_t" where proper in patch 3/4 [Dave]

v3:
- fix build on 32bit hosts & rebase
- remove r-bs for the last 2 patches for Dave due to the changes

v2:
- add r-bs for Dave
- add patch "migration: Properly destroy variables on incoming side" as patch 1
- destroy page_request_mutex in migration_incoming_state_destroy() too [Dave]
- use WITH_QEMU_LOCK_GUARD in two places where we can [Dave]

We've seen conditional guest hangs on destination VM after postcopy recovered.
However the hang will resolve itself after a few minutes.

The problem is: after a postcopy recovery, the prioritized postcopy queue on
the source VM is actually missing.  So all the faulted threads before the
postcopy recovery happened will keep halted until (accidentally) the page got
copied by the background precopy migration stream.

The solution is to also refresh this information after postcopy recovery.  To
achieve this, we need to maintain a list of faulted addresses on the
destination node, so that we can resend the list when necessary.  This work is
done via patch 2-5.

With that, the last thing we need to do is to send this extra information to
source VM after recovered.  Very luckily, this synchronization can be
"emulated" by sending a bunch of page requests (although these pages have been
sent previously!) to source VM just like when we've got a page fault.  Even in
the 1st version of the postcopy code we'll handle duplicated pages well.  So
this fix does not even need a new capability bit and it'll work smoothly on old
QEMUs when we migrate from them to the new QEMUs.

Please review, thanks.

Peter Xu (4):
  migration: Pass incoming state into qemu_ufd_copy_ioctl()
  migration: Introduce migrate_send_rp_message_req_pages()
  migration: Maintain postcopy faulted addresses
  migration: Sync requested pages after postcopy recovery

 migration/migration.c    | 49 ++++++++++++++++++++++++++++++++--
 migration/migration.h    | 21 ++++++++++++++-
 migration/postcopy-ram.c | 25 +++++++++++++-----
 migration/savevm.c       | 57 ++++++++++++++++++++++++++++++++++++++++
 migration/trace-events   |  3 +++
 5 files changed, 146 insertions(+), 9 deletions(-)

-- 
2.26.2

Re: [PATCH v4 0/4] migration/postcopy: Sync faulted addresses after network recovered

Posted by Dr. David Alan Gilbert 5 years, 4 months ago

Queued

* Peter Xu (peterx@redhat.com) wrote:
> v4:
> - use "void */ulong" instead of "uint64_t" where proper in patch 3/4 [Dave]
> 
> v3:
> - fix build on 32bit hosts & rebase
> - remove r-bs for the last 2 patches for Dave due to the changes
> 
> v2:
> - add r-bs for Dave
> - add patch "migration: Properly destroy variables on incoming side" as patch 1
> - destroy page_request_mutex in migration_incoming_state_destroy() too [Dave]
> - use WITH_QEMU_LOCK_GUARD in two places where we can [Dave]
> 
> We've seen conditional guest hangs on destination VM after postcopy recovered.
> However the hang will resolve itself after a few minutes.
> 
> The problem is: after a postcopy recovery, the prioritized postcopy queue on
> the source VM is actually missing.  So all the faulted threads before the
> postcopy recovery happened will keep halted until (accidentally) the page got
> copied by the background precopy migration stream.
> 
> The solution is to also refresh this information after postcopy recovery.  To
> achieve this, we need to maintain a list of faulted addresses on the
> destination node, so that we can resend the list when necessary.  This work is
> done via patch 2-5.
> 
> With that, the last thing we need to do is to send this extra information to
> source VM after recovered.  Very luckily, this synchronization can be
> "emulated" by sending a bunch of page requests (although these pages have been
> sent previously!) to source VM just like when we've got a page fault.  Even in
> the 1st version of the postcopy code we'll handle duplicated pages well.  So
> this fix does not even need a new capability bit and it'll work smoothly on old
> QEMUs when we migrate from them to the new QEMUs.
> 
> Please review, thanks.
> 
> Peter Xu (4):
>   migration: Pass incoming state into qemu_ufd_copy_ioctl()
>   migration: Introduce migrate_send_rp_message_req_pages()
>   migration: Maintain postcopy faulted addresses
>   migration: Sync requested pages after postcopy recovery
> 
>  migration/migration.c    | 49 ++++++++++++++++++++++++++++++++--
>  migration/migration.h    | 21 ++++++++++++++-
>  migration/postcopy-ram.c | 25 +++++++++++++-----
>  migration/savevm.c       | 57 ++++++++++++++++++++++++++++++++++++++++
>  migration/trace-events   |  3 +++
>  5 files changed, 146 insertions(+), 9 deletions(-)
> 
> -- 
> 2.26.2
> 
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [PATCH v4 0/4] migration/postcopy: Sync faulted addresses after network recovered

Posted by Dr. David Alan Gilbert 5 years, 3 months ago

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> Queued

Hi Peter,
  I've had to unqueue this again unfortunately.
There's something going on with big endian hosts; on a PPC BE host,
it reliably hangs in the recovery test with this set.
(Although I can't see anything relevant to eye).

Dave
P.S. I can point you at an installed host

> * Peter Xu (peterx@redhat.com) wrote:
> > v4:
> > - use "void */ulong" instead of "uint64_t" where proper in patch 3/4 [Dave]
> > 
> > v3:
> > - fix build on 32bit hosts & rebase
> > - remove r-bs for the last 2 patches for Dave due to the changes
> > 
> > v2:
> > - add r-bs for Dave
> > - add patch "migration: Properly destroy variables on incoming side" as patch 1
> > - destroy page_request_mutex in migration_incoming_state_destroy() too [Dave]
> > - use WITH_QEMU_LOCK_GUARD in two places where we can [Dave]
> > 
> > We've seen conditional guest hangs on destination VM after postcopy recovered.
> > However the hang will resolve itself after a few minutes.
> > 
> > The problem is: after a postcopy recovery, the prioritized postcopy queue on
> > the source VM is actually missing.  So all the faulted threads before the
> > postcopy recovery happened will keep halted until (accidentally) the page got
> > copied by the background precopy migration stream.
> > 
> > The solution is to also refresh this information after postcopy recovery.  To
> > achieve this, we need to maintain a list of faulted addresses on the
> > destination node, so that we can resend the list when necessary.  This work is
> > done via patch 2-5.
> > 
> > With that, the last thing we need to do is to send this extra information to
> > source VM after recovered.  Very luckily, this synchronization can be
> > "emulated" by sending a bunch of page requests (although these pages have been
> > sent previously!) to source VM just like when we've got a page fault.  Even in
> > the 1st version of the postcopy code we'll handle duplicated pages well.  So
> > this fix does not even need a new capability bit and it'll work smoothly on old
> > QEMUs when we migrate from them to the new QEMUs.
> > 
> > Please review, thanks.
> > 
> > Peter Xu (4):
> >   migration: Pass incoming state into qemu_ufd_copy_ioctl()
> >   migration: Introduce migrate_send_rp_message_req_pages()
> >   migration: Maintain postcopy faulted addresses
> >   migration: Sync requested pages after postcopy recovery
> > 
> >  migration/migration.c    | 49 ++++++++++++++++++++++++++++++++--
> >  migration/migration.h    | 21 ++++++++++++++-
> >  migration/postcopy-ram.c | 25 +++++++++++++-----
> >  migration/savevm.c       | 57 ++++++++++++++++++++++++++++++++++++++++
> >  migration/trace-events   |  3 +++
> >  5 files changed, 146 insertions(+), 9 deletions(-)
> > 
> > -- 
> > 2.26.2
> > 
> > 
> > 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK