[Qemu-devel] [PATCH RFC 0/2] Fix migration issues

Fei Li posted 2 patches 7 years ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20181022110854.10284-1-fli@suse.com
Test docker-clang@ubuntu passed
Test checkpatch passed
Test asan passed
Test docker-mingw@fedora failed
Test docker-quick@centos7 passed
migration/migration.c    |  5 +----
migration/postcopy-ram.c |  3 +++
migration/ram.c          | 33 +++++++++++++++++++++++----------
migration/ram.h          |  2 +-
4 files changed, 28 insertions(+), 15 deletions(-)
[Qemu-devel] [PATCH RFC 0/2] Fix migration issues
Posted by Fei Li 7 years ago
Hi,
these two patches are to fix live migration issues. The first is
about multifd, and the second is to fix some error handling.

But I have a question about using multifd migration.
In our current code, when multifd is used during migration, if there
is an error before the destination receives all new channels (I mean
multifd_recv_new_channel(ioc)), the destination does not exit but
keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
the source exits.

My question is about the state of the destination host if fails during
this period. I did a test, after applying [1/2] patch, if
multifd_new_send_channel_async() fails, the destination host hangs for
a while then later pops up a window saying
    "'QEMU (...) [stopped]' is not responding.
    You may choose to wait a short while for it to continue or force
    the application to quit entirely."
But after closing the window by clicking, the qemu on the dest still
hangs there until I exclusively kill the qemu on the source.

The source host keeps running as expected, but I guess the hang
phenonmenon in the dest is not right.
Would someone kindly give some suggestions on this? Thanks a lot.


Fei Li (2):
  migration: fix the multifd code
  migration: fix some error handling

 migration/migration.c    |  5 +----
 migration/postcopy-ram.c |  3 +++
 migration/ram.c          | 33 +++++++++++++++++++++++----------
 migration/ram.h          |  2 +-
 4 files changed, 28 insertions(+), 15 deletions(-)

-- 
2.13.7


Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
Posted by Peter Xu 7 years ago
On Mon, Oct 22, 2018 at 07:08:52PM +0800, Fei Li wrote:
> Hi,
> these two patches are to fix live migration issues. The first is
> about multifd, and the second is to fix some error handling.
> 
> But I have a question about using multifd migration.
> In our current code, when multifd is used during migration, if there
> is an error before the destination receives all new channels (I mean
> multifd_recv_new_channel(ioc)), the destination does not exit but
> keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
> the source exits.
> 
> My question is about the state of the destination host if fails during
> this period. I did a test, after applying [1/2] patch, if
> multifd_new_send_channel_async() fails, the destination host hangs for
> a while then later pops up a window saying
>     "'QEMU (...) [stopped]' is not responding.
>     You may choose to wait a short while for it to continue or force
>     the application to quit entirely."
> But after closing the window by clicking, the qemu on the dest still
> hangs there until I exclusively kill the qemu on the source.
> 
> The source host keeps running as expected, but I guess the hang
> phenonmenon in the dest is not right.
> Would someone kindly give some suggestions on this? Thanks a lot.

Note that it's during KVM forum so the response from anyone might be
slow (it ends this week).

I think the thing you described seems normal since we can't guarantee
the network is always stable, normally I'll expect that the migration
will fail but it won't matter much since after all it's a precopy so
we lose nothing.  So I'm curious about when the error you mentioned
happens (e.g., total channel number is N, you only got M channels
connected, with M < N) could you just simply kill the destination?
Then AFAIU the source can just continue to run, right?

> 
> 
> Fei Li (2):
>   migration: fix the multifd code
>   migration: fix some error handling
> 
>  migration/migration.c    |  5 +----
>  migration/postcopy-ram.c |  3 +++
>  migration/ram.c          | 33 +++++++++++++++++++++++----------
>  migration/ram.h          |  2 +-
>  4 files changed, 28 insertions(+), 15 deletions(-)
> 
> -- 
> 2.13.7
> 

Regards,

-- 
Peter Xu

Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
Posted by Fei Li 7 years ago

On 10/25/2018 05:27 AM, Peter Xu wrote:
> On Mon, Oct 22, 2018 at 07:08:52PM +0800, Fei Li wrote:
>> Hi,
>> these two patches are to fix live migration issues. The first is
>> about multifd, and the second is to fix some error handling.
>>
>> But I have a question about using multifd migration.
>> In our current code, when multifd is used during migration, if there
>> is an error before the destination receives all new channels (I mean
>> multifd_recv_new_channel(ioc)), the destination does not exit but
>> keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
>> the source exits.
>>
>> My question is about the state of the destination host if fails during
>> this period. I did a test, after applying [1/2] patch, if
>> multifd_new_send_channel_async() fails, the destination host hangs for
>> a while then later pops up a window saying
>>      "'QEMU (...) [stopped]' is not responding.
>>      You may choose to wait a short while for it to continue or force
>>      the application to quit entirely."
>> But after closing the window by clicking, the qemu on the dest still
>> hangs there until I exclusively kill the qemu on the source.
>>
>> The source host keeps running as expected, but I guess the hang
>> phenonmenon in the dest is not right.
>> Would someone kindly give some suggestions on this? Thanks a lot.
> Note that it's during KVM forum so the response from anyone might be
> slow (it ends this week).
Thanks for the kindly reminder. :)
> I think the thing you described seems normal since we can't guarantee
> the network is always stable, normally I'll expect that the migration
> will fail but it won't matter much since after all it's a precopy so
> we lose nothing.  So I'm curious about when the error you mentioned
> happens (e.g., total channel number is N, you only got M channels
> connected, with M < N) could you just simply kill the destination?
> Then AFAIU the source can just continue to run, right?
Yes, for the M < N situation, IMO the destination can be simply killed by
adding exit(EXIT_FAILURE) when it failed to receive packet via some
channel. The code is as below which has been tested, and result is the
source continues to run and the destination exits.
I'd like to write a separate patch if the below code/idea is acceptable
to fix the hang issue.

@@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
  /* Return true if multifd is ready for the migration, otherwise false */
  bool multifd_recv_new_channel(QIOChannel *ioc)
  {
+    MigrationIncomingState *mis = migration_incoming_get_current();
      MultiFDRecvParams *p;
      Error *local_err = NULL;
      int id;

      id = multifd_recv_initial_packet(ioc, &local_err);
      if (id < 0) {
-        multifd_recv_terminate_threads(local_err);
-        return false;
+        error_reportf_err(local_err,
+                          "failed to receive packet via multifd channel 
%x: ",
+                          multifd_recv_state->count);
+        goto fail;
      }

      p = &multifd_recv_state->params[id];
      if (p->c != NULL) {
          error_setg(&local_err, "multifd: received id '%d' already setup'",
                     id);
-        multifd_recv_terminate_threads(local_err);
-        return false;
+        goto fail;
      }
      p->c = ioc;
      object_ref(OBJECT(ioc));
@@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
                         QEMU_THREAD_JOINABLE);
      atomic_inc(&multifd_recv_state->count);
      return multifd_recv_state->count == migrate_multifd_channels();
+fail:
+    multifd_recv_terminate_threads(local_err);
+    qemu_fclose(mis->from_src_file);
+    mis->from_src_file = NULL;
+    exit(EXIT_FAILURE);
  }

Have a nice day, thanks a lot
Fei
>>
>> Fei Li (2):
>>    migration: fix the multifd code
>>    migration: fix some error handling
>>
>>   migration/migration.c    |  5 +----
>>   migration/postcopy-ram.c |  3 +++
>>   migration/ram.c          | 33 +++++++++++++++++++++++----------
>>   migration/ram.h          |  2 +-
>>   4 files changed, 28 insertions(+), 15 deletions(-)
>>
>> -- 
>> 2.13.7
>>
> Regards,
>


Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
Posted by Peter Xu 7 years ago
On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:

[...]

> @@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
>  /* Return true if multifd is ready for the migration, otherwise false */
>  bool multifd_recv_new_channel(QIOChannel *ioc)
>  {
> +    MigrationIncomingState *mis = migration_incoming_get_current();
>      MultiFDRecvParams *p;
>      Error *local_err = NULL;
>      int id;
> 
>      id = multifd_recv_initial_packet(ioc, &local_err);
>      if (id < 0) {
> -        multifd_recv_terminate_threads(local_err);
> -        return false;
> +        error_reportf_err(local_err,
> +                          "failed to receive packet via multifd channel %x:
> ",
> +                          multifd_recv_state->count);
> +        goto fail;
>      }
> 
>      p = &multifd_recv_state->params[id];
>      if (p->c != NULL) {
>          error_setg(&local_err, "multifd: received id '%d' already setup'",
>                     id);
> -        multifd_recv_terminate_threads(local_err);
> -        return false;
> +        goto fail;
>      }
>      p->c = ioc;
>      object_ref(OBJECT(ioc));
> @@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
>                         QEMU_THREAD_JOINABLE);
>      atomic_inc(&multifd_recv_state->count);
>      return multifd_recv_state->count == migrate_multifd_channels();
> +fail:
> +    multifd_recv_terminate_threads(local_err);
> +    qemu_fclose(mis->from_src_file);
> +    mis->from_src_file = NULL;
> +    exit(EXIT_FAILURE);
>  }

Yeah I think it makes sense to at least report some details when error
happens, but I'm not sure whether it's good to explicitly exit() here.
IMHO you can add an Error** in multifd_recv_new_channel() parameter
list to do that, and even through migration_ioc_process_incoming().
What do you think?

Regards,

-- 
Peter Xu

Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
Posted by Fei Li 7 years ago

On 10/25/2018 08:58 PM, Peter Xu wrote:
> On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:
>
> [...]
>
>> @@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
>>   /* Return true if multifd is ready for the migration, otherwise false */
>>   bool multifd_recv_new_channel(QIOChannel *ioc)
>>   {
>> +    MigrationIncomingState *mis = migration_incoming_get_current();
>>       MultiFDRecvParams *p;
>>       Error *local_err = NULL;
>>       int id;
>>
>>       id = multifd_recv_initial_packet(ioc, &local_err);
>>       if (id < 0) {
>> -        multifd_recv_terminate_threads(local_err);
>> -        return false;
>> +        error_reportf_err(local_err,
>> +                          "failed to receive packet via multifd channel %x:
>> ",
>> +                          multifd_recv_state->count);
>> +        goto fail;
>>       }
>>
>>       p = &multifd_recv_state->params[id];
>>       if (p->c != NULL) {
>>           error_setg(&local_err, "multifd: received id '%d' already setup'",
>>                      id);
>> -        multifd_recv_terminate_threads(local_err);
>> -        return false;
>> +        goto fail;
>>       }
>>       p->c = ioc;
>>       object_ref(OBJECT(ioc));
>> @@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
>>                          QEMU_THREAD_JOINABLE);
>>       atomic_inc(&multifd_recv_state->count);
>>       return multifd_recv_state->count == migrate_multifd_channels();
>> +fail:
>> +    multifd_recv_terminate_threads(local_err);
>> +    qemu_fclose(mis->from_src_file);
>> +    mis->from_src_file = NULL;
>> +    exit(EXIT_FAILURE);
>>   }
> Yeah I think it makes sense to at least report some details when error
> happens, but I'm not sure whether it's good to explicitly exit() here.
> IMHO you can add an Error** in multifd_recv_new_channel() parameter
> list to do that, and even through migration_ioc_process_incoming().
> What do you think?
>
> Regards,
>
You mean exit() in migration_ioc_process_incoming(), or further
caller migration_channel_process_incoming()? Actually either is
ok for me. :) But today I find if using postcopy and multifd together
to do live migration, it seems the hang still occurs even with the
above codes, so sad about that. I will keep debugging and see
how to fix this.

Have a nice day, thanks
Fei

Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
Posted by Peter Xu 7 years ago
On Fri, Oct 26, 2018 at 09:10:19PM +0800, Fei Li wrote:
> 
> 
> On 10/25/2018 08:58 PM, Peter Xu wrote:
> > On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:
> > 
> > [...]
> > 
> > > @@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
> > >   /* Return true if multifd is ready for the migration, otherwise false */
> > >   bool multifd_recv_new_channel(QIOChannel *ioc)
> > >   {
> > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > >       MultiFDRecvParams *p;
> > >       Error *local_err = NULL;
> > >       int id;
> > > 
> > >       id = multifd_recv_initial_packet(ioc, &local_err);
> > >       if (id < 0) {
> > > -        multifd_recv_terminate_threads(local_err);
> > > -        return false;
> > > +        error_reportf_err(local_err,
> > > +                          "failed to receive packet via multifd channel %x:
> > > ",
> > > +                          multifd_recv_state->count);
> > > +        goto fail;
> > >       }
> > > 
> > >       p = &multifd_recv_state->params[id];
> > >       if (p->c != NULL) {
> > >           error_setg(&local_err, "multifd: received id '%d' already setup'",
> > >                      id);
> > > -        multifd_recv_terminate_threads(local_err);
> > > -        return false;
> > > +        goto fail;
> > >       }
> > >       p->c = ioc;
> > >       object_ref(OBJECT(ioc));
> > > @@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
> > >                          QEMU_THREAD_JOINABLE);
> > >       atomic_inc(&multifd_recv_state->count);
> > >       return multifd_recv_state->count == migrate_multifd_channels();
> > > +fail:
> > > +    multifd_recv_terminate_threads(local_err);
> > > +    qemu_fclose(mis->from_src_file);
> > > +    mis->from_src_file = NULL;
> > > +    exit(EXIT_FAILURE);
> > >   }
> > Yeah I think it makes sense to at least report some details when error
> > happens, but I'm not sure whether it's good to explicitly exit() here.
> > IMHO you can add an Error** in multifd_recv_new_channel() parameter
> > list to do that, and even through migration_ioc_process_incoming().
> > What do you think?
> > 
> > Regards,
> > 
> You mean exit() in migration_ioc_process_incoming(), or further
> caller migration_channel_process_incoming()? Actually either is
> ok for me. :) But today I find if using postcopy and multifd together
> to do live migration, it seems the hang still occurs even with the
> above codes, so sad about that. I will keep debugging and see
> how to fix this.

Maybe you can move the error_report_err() in
migration_channel_process_incoming() out of the TLS path so we can
report the error if either TLS or non-TLS case got something wrong.

And I don't even know whether multifd could work with postcopy...

Regards,

-- 
Peter Xu

Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
Posted by Dr. David Alan Gilbert 7 years ago
* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Oct 26, 2018 at 09:10:19PM +0800, Fei Li wrote:
> > 
> > 
> > On 10/25/2018 08:58 PM, Peter Xu wrote:
> > > On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:
> > > 
> > > [...]
> > > 
> > > > @@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
> > > >   /* Return true if multifd is ready for the migration, otherwise false */
> > > >   bool multifd_recv_new_channel(QIOChannel *ioc)
> > > >   {
> > > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > > >       MultiFDRecvParams *p;
> > > >       Error *local_err = NULL;
> > > >       int id;
> > > > 
> > > >       id = multifd_recv_initial_packet(ioc, &local_err);
> > > >       if (id < 0) {
> > > > -        multifd_recv_terminate_threads(local_err);
> > > > -        return false;
> > > > +        error_reportf_err(local_err,
> > > > +                          "failed to receive packet via multifd channel %x:
> > > > ",
> > > > +                          multifd_recv_state->count);
> > > > +        goto fail;
> > > >       }
> > > > 
> > > >       p = &multifd_recv_state->params[id];
> > > >       if (p->c != NULL) {
> > > >           error_setg(&local_err, "multifd: received id '%d' already setup'",
> > > >                      id);
> > > > -        multifd_recv_terminate_threads(local_err);
> > > > -        return false;
> > > > +        goto fail;
> > > >       }
> > > >       p->c = ioc;
> > > >       object_ref(OBJECT(ioc));
> > > > @@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
> > > >                          QEMU_THREAD_JOINABLE);
> > > >       atomic_inc(&multifd_recv_state->count);
> > > >       return multifd_recv_state->count == migrate_multifd_channels();
> > > > +fail:
> > > > +    multifd_recv_terminate_threads(local_err);
> > > > +    qemu_fclose(mis->from_src_file);
> > > > +    mis->from_src_file = NULL;
> > > > +    exit(EXIT_FAILURE);
> > > >   }
> > > Yeah I think it makes sense to at least report some details when error
> > > happens, but I'm not sure whether it's good to explicitly exit() here.
> > > IMHO you can add an Error** in multifd_recv_new_channel() parameter
> > > list to do that, and even through migration_ioc_process_incoming().
> > > What do you think?
> > > 
> > > Regards,
> > > 
> > You mean exit() in migration_ioc_process_incoming(), or further
> > caller migration_channel_process_incoming()? Actually either is
> > ok for me. :) But today I find if using postcopy and multifd together
> > to do live migration, it seems the hang still occurs even with the
> > above codes, so sad about that. I will keep debugging and see
> > how to fix this.
> 
> Maybe you can move the error_report_err() in
> migration_channel_process_incoming() out of the TLS path so we can
> report the error if either TLS or non-TLS case got something wrong.
> 
> And I don't even know whether multifd could work with postcopy...

Nope, it's not expected to work yet.

Dave

> Regards,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
Posted by Fei Li 7 years ago

On 10/26/2018 11:24 PM, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
>> On Fri, Oct 26, 2018 at 09:10:19PM +0800, Fei Li wrote:
>>>
>>> On 10/25/2018 08:58 PM, Peter Xu wrote:
>>>> On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:
>>>>
>>>> [...]
>>>>
>>>>> @@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
>>>>>    /* Return true if multifd is ready for the migration, otherwise false */
>>>>>    bool multifd_recv_new_channel(QIOChannel *ioc)
>>>>>    {
>>>>> +    MigrationIncomingState *mis = migration_incoming_get_current();
>>>>>        MultiFDRecvParams *p;
>>>>>        Error *local_err = NULL;
>>>>>        int id;
>>>>>
>>>>>        id = multifd_recv_initial_packet(ioc, &local_err);
>>>>>        if (id < 0) {
>>>>> -        multifd_recv_terminate_threads(local_err);
>>>>> -        return false;
>>>>> +        error_reportf_err(local_err,
>>>>> +                          "failed to receive packet via multifd channel %x:
>>>>> ",
>>>>> +                          multifd_recv_state->count);
>>>>> +        goto fail;
>>>>>        }
>>>>>
>>>>>        p = &multifd_recv_state->params[id];
>>>>>        if (p->c != NULL) {
>>>>>            error_setg(&local_err, "multifd: received id '%d' already setup'",
>>>>>                       id);
>>>>> -        multifd_recv_terminate_threads(local_err);
>>>>> -        return false;
>>>>> +        goto fail;
>>>>>        }
>>>>>        p->c = ioc;
>>>>>        object_ref(OBJECT(ioc));
>>>>> @@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
>>>>>                           QEMU_THREAD_JOINABLE);
>>>>>        atomic_inc(&multifd_recv_state->count);
>>>>>        return multifd_recv_state->count == migrate_multifd_channels();
>>>>> +fail:
>>>>> +    multifd_recv_terminate_threads(local_err);
>>>>> +    qemu_fclose(mis->from_src_file);
>>>>> +    mis->from_src_file = NULL;
>>>>> +    exit(EXIT_FAILURE);
>>>>>    }
>>>> Yeah I think it makes sense to at least report some details when error
>>>> happens, but I'm not sure whether it's good to explicitly exit() here.
>>>> IMHO you can add an Error** in multifd_recv_new_channel() parameter
>>>> list to do that, and even through migration_ioc_process_incoming().
>>>> What do you think?
>>>>
>>>> Regards,
>>>>
>>> You mean exit() in migration_ioc_process_incoming(), or further
>>> caller migration_channel_process_incoming()? Actually either is
>>> ok for me. :) But today I find if using postcopy and multifd together
>>> to do live migration, it seems the hang still occurs even with the
>>> above codes, so sad about that. I will keep debugging and see
>>> how to fix this.
>> Maybe you can move the error_report_err() in
>> migration_channel_process_incoming() out of the TLS path so we can
>> report the error if either TLS or non-TLS case got something wrong.
Thanks for the advice. I will do the update in the next version. :)
>>
>> And I don't even know whether multifd could work with postcopy...
> Nope, it's not expected to work yet.
>
> Dave
Thanks for the helpful information. :)

BTW, in the next version, I'd like to merge these three migration 
patches into
the "[PATCH RFC v6 ] qemu_thread_create: propagate the error to callers 
to handle",
and cc you inside the patches. Please help to review.

Have a nice day, thanks again
Fei
>
>> Regards,
>>
>> -- 
>> Peter Xu
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>


Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
Posted by Dr. David Alan Gilbert 7 years ago
* Fei Li (fli@suse.com) wrote:
> Hi,
> these two patches are to fix live migration issues. The first is
> about multifd, and the second is to fix some error handling.
> 
> But I have a question about using multifd migration.
> In our current code, when multifd is used during migration, if there
> is an error before the destination receives all new channels (I mean
> multifd_recv_new_channel(ioc)), the destination does not exit but
> keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
> the source exits.
> 
> My question is about the state of the destination host if fails during
> this period. I did a test, after applying [1/2] patch, if
> multifd_new_send_channel_async() fails, the destination host hangs for
> a while then later pops up a window saying
>     "'QEMU (...) [stopped]' is not responding.
>     You may choose to wait a short while for it to continue or force
>     the application to quit entirely."
> But after closing the window by clicking, the qemu on the dest still
> hangs there until I exclusively kill the qemu on the source.

That sounds like the main thread is blocked for some reason? But I don't
normally use the window setup;  if you try with -nographic and can see
the HMP (or a QMP) monitor, can you see if the monitor still responds?
If it doesn't then try and get a backtrace.

The monitor really shouldn't block, so it would be interesting to see.

Dave

> The source host keeps running as expected, but I guess the hang
> phenonmenon in the dest is not right.
> Would someone kindly give some suggestions on this? Thanks a lot.
> 
> 
> Fei Li (2):
>   migration: fix the multifd code
>   migration: fix some error handling
> 
>  migration/migration.c    |  5 +----
>  migration/postcopy-ram.c |  3 +++
>  migration/ram.c          | 33 +++++++++++++++++++++++----------
>  migration/ram.h          |  2 +-
>  4 files changed, 28 insertions(+), 15 deletions(-)
> 
> -- 
> 2.13.7
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
Posted by Fei Li 7 years ago

On 10/25/2018 08:55 PM, Dr. David Alan Gilbert wrote:
> * Fei Li (fli@suse.com) wrote:
>> Hi,
>> these two patches are to fix live migration issues. The first is
>> about multifd, and the second is to fix some error handling.
>>
>> But I have a question about using multifd migration.
>> In our current code, when multifd is used during migration, if there
>> is an error before the destination receives all new channels (I mean
>> multifd_recv_new_channel(ioc)), the destination does not exit but
>> keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
>> the source exits.
>>
>> My question is about the state of the destination host if fails during
>> this period. I did a test, after applying [1/2] patch, if
>> multifd_new_send_channel_async() fails, the destination host hangs for
>> a while then later pops up a window saying
>>      "'QEMU (...) [stopped]' is not responding.
>>      You may choose to wait a short while for it to continue or force
>>      the application to quit entirely."
>> But after closing the window by clicking, the qemu on the dest still
>> hangs there until I exclusively kill the qemu on the source.
> That sounds like the main thread is blocked for some reason?
Yes, the main thread on  the dst is keeps looping.
> But I don't
> normally use the window setup;  if you try with -nographic and can see
> the HMP (or a QMP) monitor, can you see if the monitor still responds?

Thanks for the `-nographic` reminder, I harvested an interesting 
phenonmenon:
If I do the `migrate -d tcp:ip_addr:port` before the guest's graphic appears
(it's dark now), there is no hang and the guest starts up properly later.
But if I do the live migration after the guest fully starts up, I mean when
I can operate something using my mouse inside the guest, the hang
situation is there.
This is true for using `-nographic` for both src and dst,
and using `-nographic` for only src or dst.


The hang phenonmenon is that the dst seems never responds (I
waited three minutes), and the cursor just keeps flashing. After I
exclusively kill the src, then the dst quit. Just as follows:
(Same result if gdb is not used in src)
src:
(qemu) ...
(qemu) q
(gdb) q
dst:
(qemu) Up to now, dst has received the 0 channel
Up to now, dst has received the 1 channel

(qemu)
(qemu)

To check the migtation state in the src:
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off 
zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off 
release-ram: off block: off return-path: off pause-before-switchover: 
off x-multifd: on dirty-bitmaps: off postcopy-blocktime: off 
late-block-activate: off
Migration status: setup /* I added some codes to set the status to 
"failed", but still not working, details see below */
total time: 0 milliseconds

I guess maybe the source should to proactive to tell the dst and
disconnects from the source side, so I tried to set the above
"Migration status" to be "failed", and use qemu_fclose(s->to_dst_file)
when multifd_new_send_channel_async() fails.
(BTW: I even tried:
  if (s->vm_was_running) {   vm_start();   }   )
But the hang situation is still there.
> If it doesn't then try and get a backtrace.
>
> The monitor really shouldn't block, so it would be interesting to see.
>
> Dave
I set two breakpoints and get the following backtrace, hope they can 
help. :)

Thread 1 "qemu-system-x86" hit Breakpoint 1, multifd_recv_new_channel (
     ioc=0x555557995af0) at /build/gitcode/qemu-build/migration/ram.c:1368
1368    {
(gdb) c
Continuing.

Thread 1 "qemu-system-x86" hit Breakpoint 2, qio_channel_socket_readv (
     ioc=0x555557995af0, iov=0x5555568777d0, niov=1, fds=0x0, nfds=0x0,
     errp=0x7fffffffdb38) at io/channel-socket.c:463
463    {
(gdb) n
464        QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
(gdb)
......
483     retry:
(gdb)
484        ret = recvmsg(sioc->fd, &msg, sflags);
(gdb) bt
#0  qio_channel_socket_readv (ioc=0x555557995af0, iov=0x5555568777d0, 
niov=1,
     fds=0x0, nfds=0x0, errp=0x7fffffffdb38) at io/channel-socket.c:484
#1  0x0000555555d156c5 in qio_channel_readv_full (ioc=0x555557995af0,
     iov=0x5555568777d0, niov=1, fds=0x0, nfds=0x0, errp=0x7fffffffdb38)
     at io/channel.c:65
#2  0x0000555555d15b26 in qio_channel_readv (ioc=0x555557995af0,
     iov=0x5555568777d0, niov=1, errp=0x7fffffffdb38) at io/channel.c:197
#3  0x0000555555d15853 in qio_channel_readv_all_eof (ioc=0x555557995af0,
     iov=0x7fffffffda70, niov=1, errp=0x7fffffffdb38) at io/channel.c:106
#4  0x0000555555d1595c in qio_channel_readv_all (ioc=0x555557995af0,
     iov=0x7fffffffda70, niov=1, errp=0x7fffffffdb38) at io/channel.c:142
#5  0x0000555555d15d0c in qio_channel_read_all (ioc=0x555557995af0,
     buf=0x7fffffffdad0 "\340\"zVUU", buflen=25, errp=0x7fffffffdb38)
     at io/channel.c:246
#6  0x000055555587695c in multifd_recv_initial_packet (c=0x555557995af0,
     errp=0x7fffffffdb38) at /build/gitcode/qemu-build/migration/ram.c:653
#7  0x00005555558788fb in multifd_recv_new_channel (ioc=0x555557995af0)
     at /build/gitcode/qemu-build/migration/ram.c:1374
#8  0x0000555555bc9978 in migration_ioc_process_incoming 
(ioc=0x555557995af0)
     at migration/migration.c:573
#9  0x0000555555bd0c69 in migration_channel_process_incoming 
(ioc=0x555557995af0)
     at migration/channel.c:47
#10 0x0000555555bcf7e8 in socket_accept_incoming_migration (
     listener=0x5555578dcae0, cioc=0x555557995af0, opaque=0x0)
     at migration/socket.c:166
#11 0x0000555555d2051f in qio_net_listener_channel_func 
(ioc=0x5555579c7180,
     condition=G_IO_IN, opaque=0x5555578dcae0) at io/net-listener.c:53
#12 0x0000555555d1c0a2 in qio_channel_fd_source_dispatch 
(source=0x5555568d5970,
---Type <return> to continue, or q <return> to quit---
     callback=0x555555d20473 <qio_net_listener_channel_func>,
     user_data=0x5555578dcae0) at io/channel-watch.c:84
#13 0x00007ffff6353dc5 in g_main_context_dispatch ()
    from /usr/lib64/libglib-2.0.so.0
#14 0x0000555555d7d1ad in glib_pollfds_poll () at util/main-loop.c:215
#15 0x0000555555d7d227 in os_host_main_loop_wait (timeout=0) at 
util/main-loop.c:238
#16 0x0000555555d7d2e0 in main_loop_wait (nonblocking=0) at 
util/main-loop.c:497
#17 0x00005555559cd679 in main_loop () at vl.c:1884
#18 0x00005555559d4f1e in main (argc=32, argv=0x7fffffffe0b8, 
envp=0x7fffffffe1c0)
     at vl.c:4618
(gdb) n

Thread 1 "qemu-system-x86" received signal SIGINT, Interrupt.
0x00007ffff5606f64 in recvmsg () from /lib64/libpthread.so.0
(gdb) c
Continuing.

After I input above `n`, the dst just hangs here, seems waiting for the 
result of
recvmsg(sioc->fd, &msg, sflags); Later even I use ctrl+c to kill it, the 
dst still hangs.

Have a nice day, thanks
Fei
>
>> The source host keeps running as expected, but I guess the hang
>> phenonmenon in the dest is not right.
>> Would someone kindly give some suggestions on this? Thanks a lot.
>>
>>
>> Fei Li (2):
>>    migration: fix the multifd code
>>    migration: fix some error handling
>>
>>   migration/migration.c    |  5 +----
>>   migration/postcopy-ram.c |  3 +++
>>   migration/ram.c          | 33 +++++++++++++++++++++++----------
>>   migration/ram.h          |  2 +-
>>   4 files changed, 28 insertions(+), 15 deletions(-)
>>
>> -- 
>> 2.13.7
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>