[PATCH v2 0/7] migration: Better error handling in return path thread

Peter Xu posted 7 patches 10 months, 2 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20230705163502.331007-1-peterx@redhat.com
Maintainers: Juan Quintela <quintela@redhat.com>, Peter Xu <peterx@redhat.com>, Leonardo Bras <leobras@redhat.com>, Eric Blake <eblake@redhat.com>, Markus Armbruster <armbru@redhat.com>
qapi/migration.json      |   5 +-
migration/migration.h    |   8 +-
migration/ram.h          |   5 +-
migration/channel.c      |   1 -
migration/migration.c    | 168 +++++++++++++++++++++++----------------
migration/multifd.c      |  10 +--
migration/postcopy-ram.c |   1 -
migration/qemu-file.c    |  20 ++++-
migration/ram.c          |  42 +++++-----
migration/trace-events   |   2 +-
10 files changed, 149 insertions(+), 113 deletions(-)
[PATCH v2 0/7] migration: Better error handling in return path thread
Posted by Peter Xu 10 months, 2 weeks ago
v2:
- Patch "migration: Provide explicit error message for file shutdowns"
  - Touched up qapi doc [Fabiano]
  - Added Bugzilla link to commit which I didn't even notice that I was
    fixing a bug.. but rightfully pointed out by Laszlo.
  - Moved it to the 1st patch because it fixes a bug, please consider
    review and merge it earlier.

This is a small series that reworks error handling of postcopy return path
threads.

We used to contain a bunch of error_report(), converting them into
error_setg() properly and deliver any of those errors to migration generic
error reports (via migrate_set_error()).  Then these errors can also be
observed in query-migrate after postcopy is paused.

Dropped the return-path specific error reporting: mark_source_rp_bad(),
because it's a duplication if we can always use migrate_set_error().

Please have a look, thanks.

Peter Xu (7):
  migration: Display error in query-migrate irrelevant of status
  migration: Let migrate_set_error() take ownership
  migration: Introduce migrate_has_error()
  migration: Refactor error handling in source return path
  migration: Deliver return path file error to migrate state too
  qemufile: Always return a verbose error
  migration: Provide explicit error message for file shutdowns

 qapi/migration.json      |   5 +-
 migration/migration.h    |   8 +-
 migration/ram.h          |   5 +-
 migration/channel.c      |   1 -
 migration/migration.c    | 168 +++++++++++++++++++++++----------------
 migration/multifd.c      |  10 +--
 migration/postcopy-ram.c |   1 -
 migration/qemu-file.c    |  20 ++++-
 migration/ram.c          |  42 +++++-----
 migration/trace-events   |   2 +-
 10 files changed, 149 insertions(+), 113 deletions(-)

-- 
2.41.0
Re: [PATCH v2 0/7] migration: Better error handling in return path thread
Posted by Fabiano Rosas 9 months, 4 weeks ago
Peter Xu <peterx@redhat.com> writes:

> v2:
> - Patch "migration: Provide explicit error message for file shutdowns"
>   - Touched up qapi doc [Fabiano]
>   - Added Bugzilla link to commit which I didn't even notice that I was
>     fixing a bug.. but rightfully pointed out by Laszlo.
>   - Moved it to the 1st patch because it fixes a bug, please consider
>     review and merge it earlier.
>
> This is a small series that reworks error handling of postcopy return path
> threads.
>
> We used to contain a bunch of error_report(), converting them into
> error_setg() properly and deliver any of those errors to migration generic
> error reports (via migrate_set_error()).  Then these errors can also be
> observed in query-migrate after postcopy is paused.
>
> Dropped the return-path specific error reporting: mark_source_rp_bad(),
> because it's a duplication if we can always use migrate_set_error().
>
> Please have a look, thanks.
>
> Peter Xu (7):
>   migration: Display error in query-migrate irrelevant of status
>   migration: Let migrate_set_error() take ownership
>   migration: Introduce migrate_has_error()
>   migration: Refactor error handling in source return path
>   migration: Deliver return path file error to migrate state too
>   qemufile: Always return a verbose error
>   migration: Provide explicit error message for file shutdowns
>
>  qapi/migration.json      |   5 +-
>  migration/migration.h    |   8 +-
>  migration/ram.h          |   5 +-
>  migration/channel.c      |   1 -
>  migration/migration.c    | 168 +++++++++++++++++++++++----------------
>  migration/multifd.c      |  10 +--
>  migration/postcopy-ram.c |   1 -
>  migration/qemu-file.c    |  20 ++++-
>  migration/ram.c          |  42 +++++-----
>  migration/trace-events   |   2 +-
>  10 files changed, 149 insertions(+), 113 deletions(-)

Hi Peter,

Were you aiming at solving any specific bug with this series? I'm seeing
a bug on master (361d5397355) with the
/x86_64/migration/postcopy/preempt/recovery/plain test around the areas
that this series touches.

It happens very rarely and I'm still investigating, but in case you have
any thoughts:

====
It seems there's a race condition between postcopy resume and the return
path cleanup.

It is possible for open_return_path_on_source() to setup the new
QEMUFile *before* the cleanup path at source_return_path_thread() has
had a chance to run, so we end up calling migration_release_dst_files()
on the new file and ms->rp_state.from_dst_file gets set to NULL again,
leading to a SIGSEGV at qemu_file_get_error(rp) due to rp being NULL.

Here's a trace from when it works (both src and dst):

open_return_path_on_source
open_return_path_on_source_continue
postcopy_pause_incoming
postcopy_pause_fast_load
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
postcopy_pause_fault_thread
postcopy_pause_return_path   <---
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
open_return_path_on_source   <--- OK
postcopy_pause_continued     <---
postcopy_pause_return_path_continued
postcopy_pause_incoming_continued
postcopy_pause_fault_thread_continued
postcopy_pause_fast_load_continued

versus when it fails:

open_return_path_on_source
open_return_path_on_source_continue
postcopy_pause_incoming
postcopy_pause_fast_load
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
postcopy_pause_fault_thread
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
postcopy_pause_incoming_continued
open_return_path_on_source   <--- NOK
postcopy_pause_continued
postcopy_pause_return_path   <---
postcopy_pause_return_path_continued <---
postcopy_pause_incoming
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
postcopy_pause_incoming_continued
Re: [PATCH v2 0/7] migration: Better error handling in return path thread
Posted by Fabiano Rosas 9 months, 3 weeks ago
Fabiano Rosas <farosas@suse.de> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> v2:
>> - Patch "migration: Provide explicit error message for file shutdowns"
>>   - Touched up qapi doc [Fabiano]
>>   - Added Bugzilla link to commit which I didn't even notice that I was
>>     fixing a bug.. but rightfully pointed out by Laszlo.
>>   - Moved it to the 1st patch because it fixes a bug, please consider
>>     review and merge it earlier.
>>
>> This is a small series that reworks error handling of postcopy return path
>> threads.
>>
>> We used to contain a bunch of error_report(), converting them into
>> error_setg() properly and deliver any of those errors to migration generic
>> error reports (via migrate_set_error()).  Then these errors can also be
>> observed in query-migrate after postcopy is paused.
>>
>> Dropped the return-path specific error reporting: mark_source_rp_bad(),
>> because it's a duplication if we can always use migrate_set_error().
>>
>> Please have a look, thanks.
>>
>> Peter Xu (7):
>>   migration: Display error in query-migrate irrelevant of status
>>   migration: Let migrate_set_error() take ownership
>>   migration: Introduce migrate_has_error()
>>   migration: Refactor error handling in source return path
>>   migration: Deliver return path file error to migrate state too
>>   qemufile: Always return a verbose error
>>   migration: Provide explicit error message for file shutdowns
>>
>>  qapi/migration.json      |   5 +-
>>  migration/migration.h    |   8 +-
>>  migration/ram.h          |   5 +-
>>  migration/channel.c      |   1 -
>>  migration/migration.c    | 168 +++++++++++++++++++++++----------------
>>  migration/multifd.c      |  10 +--
>>  migration/postcopy-ram.c |   1 -
>>  migration/qemu-file.c    |  20 ++++-
>>  migration/ram.c          |  42 +++++-----
>>  migration/trace-events   |   2 +-
>>  10 files changed, 149 insertions(+), 113 deletions(-)
>
> Hi Peter,
>
> Were you aiming at solving any specific bug with this series? I'm seeing
> a bug on master (361d5397355) with the
> /x86_64/migration/postcopy/preempt/recovery/plain test around the areas
> that this series touches.
>
> It happens very rarely and I'm still investigating, but in case you have
> any thoughts:
>
> ====
> It seems there's a race condition between postcopy resume and the return
> path cleanup.
>
> It is possible for open_return_path_on_source() to setup the new
> QEMUFile *before* the cleanup path at source_return_path_thread() has
> had a chance to run, so we end up calling migration_release_dst_files()
> on the new file and ms->rp_state.from_dst_file gets set to NULL again,
> leading to a SIGSEGV at qemu_file_get_error(rp) due to rp being NULL.

I did some more digging and this is indeed what happens. When we pause
on the incoming side, the to_src_file is closed and the source return
path sees an error (EBADFD) which leads to the cleanup (from_dst_file =
NULL). This happens independently and without any synchronization with a
potential concurrent resume operation.

Is there a reason for not closing the return path thread and starting a
new one for resume? The from_dst_file is the only thing being changed
anyway. It would allow us to remove the retry logic along with the
problematic cleanup path and not need another synchronization point
between qmp_migrate() and the return path.

Here's the race (important bit is open_return_path happening before
migration_release_dst_files):

migration                 | qmp                         | return path
--------------------------+-----------------------------+---------------------------------
                            qmp_migrate_pause()
                             shutdown(ms->to_dst_file)
                              f->last_error = -EIO
migrate_detect_error()
 postcopy_pause()
  set_state(PAUSED)
  wait(postcopy_pause_sem)
                            qmp_migrate(resume)
                            migrate_fd_connect()
                             resume = state == PAUSED
                             open_return_path <-- TOO SOON!
                             set_state(RECOVER)
                             post(postcopy_pause_sem)
                                                        (incoming closes to_src_file)
                                                        res = qemu_file_get_error(rp)
                                                        migration_release_dst_files()
                                                        ms->rp_state.from_dst_file = NULL
  post(postcopy_pause_rp_sem)
                                                        postcopy_pause_return_path_thread()
                                                          wait(postcopy_pause_rp_sem)
                                                        rp = ms->rp_state.from_dst_file
                                                        goto retry
                                                        qemu_file_get_error(rp)
                                                        SIGSEGV
-------------------------------------------------------------------------------------------
Re: [PATCH v2 0/7] migration: Better error handling in return path thread
Posted by Peter Xu 9 months, 3 weeks ago
Hi, Fabiano,

Sorry to be late on responding.

On Tue, Jul 25, 2023 at 03:24:26PM -0300, Fabiano Rosas wrote:
> Fabiano Rosas <farosas@suse.de> writes:
> 
> > Peter Xu <peterx@redhat.com> writes:
> >
> >> v2:
> >> - Patch "migration: Provide explicit error message for file shutdowns"
> >>   - Touched up qapi doc [Fabiano]
> >>   - Added Bugzilla link to commit which I didn't even notice that I was
> >>     fixing a bug.. but rightfully pointed out by Laszlo.
> >>   - Moved it to the 1st patch because it fixes a bug, please consider
> >>     review and merge it earlier.
> >>
> >> This is a small series that reworks error handling of postcopy return path
> >> threads.
> >>
> >> We used to contain a bunch of error_report(), converting them into
> >> error_setg() properly and deliver any of those errors to migration generic
> >> error reports (via migrate_set_error()).  Then these errors can also be
> >> observed in query-migrate after postcopy is paused.
> >>
> >> Dropped the return-path specific error reporting: mark_source_rp_bad(),
> >> because it's a duplication if we can always use migrate_set_error().
> >>
> >> Please have a look, thanks.
> >>
> >> Peter Xu (7):
> >>   migration: Display error in query-migrate irrelevant of status
> >>   migration: Let migrate_set_error() take ownership
> >>   migration: Introduce migrate_has_error()
> >>   migration: Refactor error handling in source return path
> >>   migration: Deliver return path file error to migrate state too
> >>   qemufile: Always return a verbose error
> >>   migration: Provide explicit error message for file shutdowns
> >>
> >>  qapi/migration.json      |   5 +-
> >>  migration/migration.h    |   8 +-
> >>  migration/ram.h          |   5 +-
> >>  migration/channel.c      |   1 -
> >>  migration/migration.c    | 168 +++++++++++++++++++++++----------------
> >>  migration/multifd.c      |  10 +--
> >>  migration/postcopy-ram.c |   1 -
> >>  migration/qemu-file.c    |  20 ++++-
> >>  migration/ram.c          |  42 +++++-----
> >>  migration/trace-events   |   2 +-
> >>  10 files changed, 149 insertions(+), 113 deletions(-)
> >
> > Hi Peter,
> >
> > Were you aiming at solving any specific bug with this series?

Nop.  I simply noticed that the error in return path cannot be proxied to
migration object and thought maybe we should do that.

Thanks for looing into this problem.

> > I'm seeing
> > a bug on master (361d5397355) with the
> > /x86_64/migration/postcopy/preempt/recovery/plain test around the areas
> > that this series touches.
> >
> > It happens very rarely and I'm still investigating, but in case you have
> > any thoughts:
> >
> > ====
> > It seems there's a race condition between postcopy resume and the return
> > path cleanup.
> >
> > It is possible for open_return_path_on_source() to setup the new
> > QEMUFile *before* the cleanup path at source_return_path_thread() has
> > had a chance to run, so we end up calling migration_release_dst_files()
> > on the new file and ms->rp_state.from_dst_file gets set to NULL again,
> > leading to a SIGSEGV at qemu_file_get_error(rp) due to rp being NULL.
> 
> I did some more digging and this is indeed what happens. When we pause
> on the incoming side, the to_src_file is closed and the source return
> path sees an error (EBADFD) which leads to the cleanup (from_dst_file =
> NULL). This happens independently and without any synchronization with a
> potential concurrent resume operation.
> 
> Is there a reason for not closing the return path thread and starting a
> new one for resume?

Not anything special.  When I initially worked on that (quite a few years
ago) I thought it would be the simplest we keep hold as much things as
possible, including the threads.  But maybe it's not too hard either to
just reinitiate the thread when resume indeed.

> The from_dst_file is the only thing being changed
> anyway. It would allow us to remove the retry logic along with the
> problematic cleanup path and not need another synchronization point
> between qmp_migrate() and the return path.

I'm fine if you want to remove the return path thread when a pause happens,
as long as we can cleanly do that; if you already drafted something and
looks all good from your side, happy to read it.  We may somewhere need
another call to await_return_path_close_on_source() when pause from
migration thread.

The other way is the main thread can stupidly wait for the two files to be
released, namely, from_dst_file and postcopy_qemufile_src.

> 
> Here's the race (important bit is open_return_path happening before
> migration_release_dst_files):
> 
> migration                 | qmp                         | return path
> --------------------------+-----------------------------+---------------------------------
>                             qmp_migrate_pause()
>                              shutdown(ms->to_dst_file)
>                               f->last_error = -EIO
> migrate_detect_error()
>  postcopy_pause()
>   set_state(PAUSED)
>   wait(postcopy_pause_sem)
>                             qmp_migrate(resume)
>                             migrate_fd_connect()
>                              resume = state == PAUSED
>                              open_return_path <-- TOO SOON!
>                              set_state(RECOVER)
>                              post(postcopy_pause_sem)
>                                                         (incoming closes to_src_file)
>                                                         res = qemu_file_get_error(rp)
>                                                         migration_release_dst_files()
>                                                         ms->rp_state.from_dst_file = NULL
>   post(postcopy_pause_rp_sem)
>                                                         postcopy_pause_return_path_thread()
>                                                           wait(postcopy_pause_rp_sem)
>                                                         rp = ms->rp_state.from_dst_file
>                                                         goto retry
>                                                         qemu_file_get_error(rp)
>                                                         SIGSEGV
> -------------------------------------------------------------------------------------------
> 

Thanks,

-- 
Peter Xu