migration/channel errors and cancelling

[Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling

Posted by Dr. David Alan Gilbert (git) 8 years, 1 month ago

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Hi,
  Where a channel fails asynchronously during connect, call
back through the migration code so it can clean up.
  In particular this causes the transition of a 'cancelling' state
to 'cancelled' in the case of:

    migrate -d tcp:deadhost:port
     <host tries to connect>
    migrate_cancel

previously the status would get stuck in cancelling because
the final cleanup didn't happen.

  This is the second part of the fix for:
https://bugzilla.redhat.com/show_bug.cgi?id=1525899

Dave

Dr. David Alan Gilbert (2):
  migration: Allow migrate_fd_connect to take an Error *
  migration: Route errors down  through migration_channel_connect

 migration/channel.c    | 32 ++++++++++++++++----------------
 migration/channel.h    |  3 ++-
 migration/exec.c       |  2 +-
 migration/fd.c         |  2 +-
 migration/migration.c  |  7 ++++++-
 migration/migration.h  |  2 +-
 migration/rdma.c       |  2 +-
 migration/socket.c     |  4 +---
 migration/tls.c        |  3 +--
 migration/trace-events |  2 +-
 10 files changed, 31 insertions(+), 28 deletions(-)

-- 
2.14.3

Re: [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling

Posted by Peter Xu 8 years, 1 month ago

On Fri, Dec 15, 2017 at 05:16:53PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Hi,
>   Where a channel fails asynchronously during connect, call
> back through the migration code so it can clean up.
>   In particular this causes the transition of a 'cancelling' state
> to 'cancelled' in the case of:
> 
>     migrate -d tcp:deadhost:port
>      <host tries to connect>
>     migrate_cancel
> 
> previously the status would get stuck in cancelling because
> the final cleanup didn't happen.
> 
>   This is the second part of the fix for:
> https://bugzilla.redhat.com/show_bug.cgi?id=1525899

IIUC this series tries to deliver the connection error a long way
until migrate_fd_connect() to handle it. But, haven't we already have
a function migrate_fd_error() to do that (which is faster, and
simpler)?

void migrate_fd_error(MigrationState *s, const Error *error)
{
    trace_migrate_fd_error(error_get_pretty(error));
    assert(s->to_dst_file == NULL);
    migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
                      MIGRATION_STATUS_FAILED);
    migrate_set_error(s, error);
    notifier_list_notify(&migration_state_notifiers, s);
    block_cleanup_parameters(s);
}

I think it's not handling the case when cancelling.  If we let it to
handle the cancelling case well, would it be a simpler fix?

Moreover, I think this is another good example that migration is not
handling the cleanup "cleanly" in general... I really hope we can do
this better in 2.12.  I'll see whether I can give it a shot, but in
all cases it'll be after the merging of existing patches since there
are already quite a lot of dangling patches.

Thanks,

-- 
Peter Xu

Re: [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling

Posted by Dr. David Alan Gilbert 8 years, 1 month ago

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Dec 15, 2017 at 05:16:53PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Hi,
> >   Where a channel fails asynchronously during connect, call
> > back through the migration code so it can clean up.
> >   In particular this causes the transition of a 'cancelling' state
> > to 'cancelled' in the case of:
> > 
> >     migrate -d tcp:deadhost:port
> >      <host tries to connect>
> >     migrate_cancel
> > 
> > previously the status would get stuck in cancelling because
> > the final cleanup didn't happen.
> > 
> >   This is the second part of the fix for:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1525899
> 
> IIUC this series tries to deliver the connection error a long way
> until migrate_fd_connect() to handle it. But, haven't we already have
> a function migrate_fd_error() to do that (which is faster, and
> simpler)?
> 
> void migrate_fd_error(MigrationState *s, const Error *error)
> {
>     trace_migrate_fd_error(error_get_pretty(error));
>     assert(s->to_dst_file == NULL);
>     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>                       MIGRATION_STATUS_FAILED);
>     migrate_set_error(s, error);
>     notifier_list_notify(&migration_state_notifiers, s);
>     block_cleanup_parameters(s);
> }
> 
> I think it's not handling the case when cancelling.  If we let it to
> handle the cancelling case well, would it be a simpler fix?
>
> Moreover, I think this is another good example that migration is not
> handling the cleanup "cleanly" in general... I really hope we can do
> this better in 2.12.  I'll see whether I can give it a shot, but in
> all cases it'll be after the merging of existing patches since there
> are already quite a lot of dangling patches.

No, I think migrate_fd_error is the cause of the problem here, not the
answer.

If we stick to the simple rule that a migration must always call
migrate_fd_cleanup then the cancellation problems are fixed - I think
that's how we make migration 'clean' - a single cleanup routine
that always gets called.

Dave


> Thanks,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling

Posted by Peter Xu 8 years, 1 month ago

On Tue, Dec 19, 2017 at 10:14:08AM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Fri, Dec 15, 2017 at 05:16:53PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Hi,
> > >   Where a channel fails asynchronously during connect, call
> > > back through the migration code so it can clean up.
> > >   In particular this causes the transition of a 'cancelling' state
> > > to 'cancelled' in the case of:
> > > 
> > >     migrate -d tcp:deadhost:port
> > >      <host tries to connect>
> > >     migrate_cancel
> > > 
> > > previously the status would get stuck in cancelling because
> > > the final cleanup didn't happen.
> > > 
> > >   This is the second part of the fix for:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1525899
> > 
> > IIUC this series tries to deliver the connection error a long way
> > until migrate_fd_connect() to handle it. But, haven't we already have
> > a function migrate_fd_error() to do that (which is faster, and
> > simpler)?
> > 
> > void migrate_fd_error(MigrationState *s, const Error *error)
> > {
> >     trace_migrate_fd_error(error_get_pretty(error));
> >     assert(s->to_dst_file == NULL);
> >     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> >                       MIGRATION_STATUS_FAILED);
> >     migrate_set_error(s, error);
> >     notifier_list_notify(&migration_state_notifiers, s);
> >     block_cleanup_parameters(s);
> > }
> > 
> > I think it's not handling the case when cancelling.  If we let it to
> > handle the cancelling case well, would it be a simpler fix?
> >
> > Moreover, I think this is another good example that migration is not
> > handling the cleanup "cleanly" in general... I really hope we can do
> > this better in 2.12.  I'll see whether I can give it a shot, but in
> > all cases it'll be after the merging of existing patches since there
> > are already quite a lot of dangling patches.
> 
> No, I think migrate_fd_error is the cause of the problem here, not the
> answer.

Could I ask why migrate_fd_error() is problematic?  Yeah I agree that
we should have a single point to clean things up, then can we call
migrate_fd_cleanup() somehow inside migrate_fd_error()?

The thing I don't really understand is: why we want the error be
delivered via those functions (migration_channel_connect,
migrate_fd_connect, etc.) to finally be cleaned up.

> 
> If we stick to the simple rule that a migration must always call
> migrate_fd_cleanup then the cancellation problems are fixed - I think
> that's how we make migration 'clean' - a single cleanup routine
> that always gets called.

-- 
Peter Xu

Re: [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling

Posted by Dr. David Alan Gilbert 8 years, 1 month ago

* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Dec 19, 2017 at 10:14:08AM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Fri, Dec 15, 2017 at 05:16:53PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > Hi,
> > > >   Where a channel fails asynchronously during connect, call
> > > > back through the migration code so it can clean up.
> > > >   In particular this causes the transition of a 'cancelling' state
> > > > to 'cancelled' in the case of:
> > > > 
> > > >     migrate -d tcp:deadhost:port
> > > >      <host tries to connect>
> > > >     migrate_cancel
> > > > 
> > > > previously the status would get stuck in cancelling because
> > > > the final cleanup didn't happen.
> > > > 
> > > >   This is the second part of the fix for:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1525899
> > > 
> > > IIUC this series tries to deliver the connection error a long way
> > > until migrate_fd_connect() to handle it. But, haven't we already have
> > > a function migrate_fd_error() to do that (which is faster, and
> > > simpler)?
> > > 
> > > void migrate_fd_error(MigrationState *s, const Error *error)
> > > {
> > >     trace_migrate_fd_error(error_get_pretty(error));
> > >     assert(s->to_dst_file == NULL);
> > >     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> > >                       MIGRATION_STATUS_FAILED);
> > >     migrate_set_error(s, error);
> > >     notifier_list_notify(&migration_state_notifiers, s);
> > >     block_cleanup_parameters(s);
> > > }
> > > 
> > > I think it's not handling the case when cancelling.  If we let it to
> > > handle the cancelling case well, would it be a simpler fix?
> > >
> > > Moreover, I think this is another good example that migration is not
> > > handling the cleanup "cleanly" in general... I really hope we can do
> > > this better in 2.12.  I'll see whether I can give it a shot, but in
> > > all cases it'll be after the merging of existing patches since there
> > > are already quite a lot of dangling patches.
> > 
> > No, I think migrate_fd_error is the cause of the problem here, not the
> > answer.
> 
> Could I ask why migrate_fd_error() is problematic?  Yeah I agree that
> we should have a single point to clean things up, then can we call
> migrate_fd_cleanup() somehow inside migrate_fd_error()?
> 
> The thing I don't really understand is: why we want the error be
> delivered via those functions (migration_channel_connect,
> migrate_fd_connect, etc.) to finally be cleaned up.

migrate_fd_cleanup has a lot of code for dealing with different cases:
   a) Closing the to_dst_file
   b) joining the thread if its already running
   c) Cleaning up multifd (stub)
   d) finishing of cancel
   e) notification
   f) block cleanup

we seem to have copied some of those into migrate_fd_error - but not all
of them.  In this case the one we're missing is (d) got finishing
cancel;
when you issue a cancel command we move from whatever state we were in
to the 'cancelling' state, various things get cleaned up and eventually
we move to cancelled; that wasn't happening because we missed the
code in migrate_fd_cleanup out.  So we could copy that code (copied code
is bad) or we could just make sure migrate_fd_cleanup is called like
it normally is.

The other thinking is that at the moment the migration/socket.c and tls.c
etc code has to choose between callbacks into the main migration code,
either migration_channel_connet or migrate_fd_error - now it's simpler,
once you've asked for an outgoing migration you always get a callback
to migration_channel_connect.  Much more predictable.

Dave

> > 
> > If we stick to the simple rule that a migration must always call
> > migrate_fd_cleanup then the cancellation problems are fixed - I think
> > that's how we make migration 'clean' - a single cleanup routine
> > that always gets called.
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling

Posted by Peter Xu 8 years, 1 month ago

On Tue, Dec 19, 2017 at 11:33:21AM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Tue, Dec 19, 2017 at 10:14:08AM +0000, Dr. David Alan Gilbert wrote:
> > > * Peter Xu (peterx@redhat.com) wrote:
> > > > On Fri, Dec 15, 2017 at 05:16:53PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > 
> > > > > Hi,
> > > > >   Where a channel fails asynchronously during connect, call
> > > > > back through the migration code so it can clean up.
> > > > >   In particular this causes the transition of a 'cancelling' state
> > > > > to 'cancelled' in the case of:
> > > > > 
> > > > >     migrate -d tcp:deadhost:port
> > > > >      <host tries to connect>
> > > > >     migrate_cancel
> > > > > 
> > > > > previously the status would get stuck in cancelling because
> > > > > the final cleanup didn't happen.
> > > > > 
> > > > >   This is the second part of the fix for:
> > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1525899
> > > > 
> > > > IIUC this series tries to deliver the connection error a long way
> > > > until migrate_fd_connect() to handle it. But, haven't we already have
> > > > a function migrate_fd_error() to do that (which is faster, and
> > > > simpler)?
> > > > 
> > > > void migrate_fd_error(MigrationState *s, const Error *error)
> > > > {
> > > >     trace_migrate_fd_error(error_get_pretty(error));
> > > >     assert(s->to_dst_file == NULL);
> > > >     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> > > >                       MIGRATION_STATUS_FAILED);
> > > >     migrate_set_error(s, error);
> > > >     notifier_list_notify(&migration_state_notifiers, s);
> > > >     block_cleanup_parameters(s);
> > > > }
> > > > 
> > > > I think it's not handling the case when cancelling.  If we let it to
> > > > handle the cancelling case well, would it be a simpler fix?
> > > >
> > > > Moreover, I think this is another good example that migration is not
> > > > handling the cleanup "cleanly" in general... I really hope we can do
> > > > this better in 2.12.  I'll see whether I can give it a shot, but in
> > > > all cases it'll be after the merging of existing patches since there
> > > > are already quite a lot of dangling patches.
> > > 
> > > No, I think migrate_fd_error is the cause of the problem here, not the
> > > answer.
> > 
> > Could I ask why migrate_fd_error() is problematic?  Yeah I agree that
> > we should have a single point to clean things up, then can we call
> > migrate_fd_cleanup() somehow inside migrate_fd_error()?
> > 
> > The thing I don't really understand is: why we want the error be
> > delivered via those functions (migration_channel_connect,
> > migrate_fd_connect, etc.) to finally be cleaned up.
> 
> migrate_fd_cleanup has a lot of code for dealing with different cases:
>    a) Closing the to_dst_file
>    b) joining the thread if its already running
>    c) Cleaning up multifd (stub)
>    d) finishing of cancel
>    e) notification
>    f) block cleanup
> 
> we seem to have copied some of those into migrate_fd_error - but not all
> of them.  In this case the one we're missing is (d) got finishing
> cancel;
> when you issue a cancel command we move from whatever state we were in
> to the 'cancelling' state, various things get cleaned up and eventually
> we move to cancelled; that wasn't happening because we missed the
> code in migrate_fd_cleanup out.  So we could copy that code (copied code
> is bad) or we could just make sure migrate_fd_cleanup is called like
> it normally is.

I fully agree on above.

> 
> The other thinking is that at the moment the migration/socket.c and tls.c
> etc code has to choose between callbacks into the main migration code,
> either migration_channel_connet or migrate_fd_error - now it's simpler,
> once you've asked for an outgoing migration you always get a callback
> to migration_channel_connect.  Much more predictable.

This is the point I still don't understand, on why we must go into
migrate_fd_connect(), even if error happens before that point.

What I meant is pasted at the end.  Again, I don't know whether it
works, just want to show what I meant.  I'm fine with current approach
too.  Thanks,

----------------------------------

diff --git a/migration/migration.c b/migration/migration.c                     
index 4de3b551fe..fd9b509ab1 100644    
--- a/migration/migration.c            
+++ b/migration/migration.c            
@@ -1074,8 +1074,10 @@ static void migrate_fd_cleanup(void *opaque)            
 {                                     
     MigrationState *s = opaque;       
                                       
-    qemu_bh_delete(s->cleanup_bh);    
-    s->cleanup_bh = NULL;             
+    if (s->cleanup_bh) {              
+        qemu_bh_delete(s->cleanup_bh);
+        s->cleanup_bh = NULL;         
+    }                                 
                                       
     if (s->to_dst_file) {             
         Error *local_err = NULL;      
@@ -1124,11 +1126,15 @@ void migrate_fd_error(MigrationState *s, const Error *error)                                                                          
 {                                     
     trace_migrate_fd_error(error_get_pretty(error));                          
     assert(s->to_dst_file == NULL);   
+    /*                                
+     * If we are still in setup, switch to failure.  It's also                
+     * possible that the migration has been cancelled, then we do             
+     * nothing here.                  
+     */                               
     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,                      
                       MIGRATION_STATUS_FAILED);                               
     migrate_set_error(s, error);      
-    notifier_list_notify(&migration_state_notifiers, s);                      
-    block_cleanup_parameters(s);      
+    migrate_fd_cleanup(s);            
 }                                     
                                       
 static void migrate_fd_cancel(MigrationState *s)                              

-- 
Peter Xu

Re: [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling

Posted by Dr. David Alan Gilbert 8 years, 1 month ago

* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Dec 19, 2017 at 11:33:21AM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Tue, Dec 19, 2017 at 10:14:08AM +0000, Dr. David Alan Gilbert wrote:
> > > > * Peter Xu (peterx@redhat.com) wrote:
> > > > > On Fri, Dec 15, 2017 at 05:16:53PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > 
> > > > > > Hi,
> > > > > >   Where a channel fails asynchronously during connect, call
> > > > > > back through the migration code so it can clean up.
> > > > > >   In particular this causes the transition of a 'cancelling' state
> > > > > > to 'cancelled' in the case of:
> > > > > > 
> > > > > >     migrate -d tcp:deadhost:port
> > > > > >      <host tries to connect>
> > > > > >     migrate_cancel
> > > > > > 
> > > > > > previously the status would get stuck in cancelling because
> > > > > > the final cleanup didn't happen.
> > > > > > 
> > > > > >   This is the second part of the fix for:
> > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1525899
> > > > > 
> > > > > IIUC this series tries to deliver the connection error a long way
> > > > > until migrate_fd_connect() to handle it. But, haven't we already have
> > > > > a function migrate_fd_error() to do that (which is faster, and
> > > > > simpler)?
> > > > > 
> > > > > void migrate_fd_error(MigrationState *s, const Error *error)
> > > > > {
> > > > >     trace_migrate_fd_error(error_get_pretty(error));
> > > > >     assert(s->to_dst_file == NULL);
> > > > >     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> > > > >                       MIGRATION_STATUS_FAILED);
> > > > >     migrate_set_error(s, error);
> > > > >     notifier_list_notify(&migration_state_notifiers, s);
> > > > >     block_cleanup_parameters(s);
> > > > > }
> > > > > 
> > > > > I think it's not handling the case when cancelling.  If we let it to
> > > > > handle the cancelling case well, would it be a simpler fix?
> > > > >
> > > > > Moreover, I think this is another good example that migration is not
> > > > > handling the cleanup "cleanly" in general... I really hope we can do
> > > > > this better in 2.12.  I'll see whether I can give it a shot, but in
> > > > > all cases it'll be after the merging of existing patches since there
> > > > > are already quite a lot of dangling patches.
> > > > 
> > > > No, I think migrate_fd_error is the cause of the problem here, not the
> > > > answer.
> > > 
> > > Could I ask why migrate_fd_error() is problematic?  Yeah I agree that
> > > we should have a single point to clean things up, then can we call
> > > migrate_fd_cleanup() somehow inside migrate_fd_error()?
> > > 
> > > The thing I don't really understand is: why we want the error be
> > > delivered via those functions (migration_channel_connect,
> > > migrate_fd_connect, etc.) to finally be cleaned up.
> > 
> > migrate_fd_cleanup has a lot of code for dealing with different cases:
> >    a) Closing the to_dst_file
> >    b) joining the thread if its already running
> >    c) Cleaning up multifd (stub)
> >    d) finishing of cancel
> >    e) notification
> >    f) block cleanup
> > 
> > we seem to have copied some of those into migrate_fd_error - but not all
> > of them.  In this case the one we're missing is (d) got finishing
> > cancel;
> > when you issue a cancel command we move from whatever state we were in
> > to the 'cancelling' state, various things get cleaned up and eventually
> > we move to cancelled; that wasn't happening because we missed the
> > code in migrate_fd_cleanup out.  So we could copy that code (copied code
> > is bad) or we could just make sure migrate_fd_cleanup is called like
> > it normally is.
> 
> I fully agree on above.
> 
> > 
> > The other thinking is that at the moment the migration/socket.c and tls.c
> > etc code has to choose between callbacks into the main migration code,
> > either migration_channel_connet or migrate_fd_error - now it's simpler,
> > once you've asked for an outgoing migration you always get a callback
> > to migration_channel_connect.  Much more predictable.
> 
> This is the point I still don't understand, on why we must go into
> migrate_fd_connect(), even if error happens before that point.

To me it just doesn't feel clean, and is the reason this error happened
in the first place.

Dave

> What I meant is pasted at the end.  Again, I don't know whether it
> works, just want to show what I meant.  I'm fine with current approach
> too.  Thanks,
> 
> ----------------------------------
> 
> diff --git a/migration/migration.c b/migration/migration.c                     
> index 4de3b551fe..fd9b509ab1 100644    
> --- a/migration/migration.c            
> +++ b/migration/migration.c            
> @@ -1074,8 +1074,10 @@ static void migrate_fd_cleanup(void *opaque)            
>  {                                     
>      MigrationState *s = opaque;       
>                                        
> -    qemu_bh_delete(s->cleanup_bh);    
> -    s->cleanup_bh = NULL;             
> +    if (s->cleanup_bh) {              
> +        qemu_bh_delete(s->cleanup_bh);
> +        s->cleanup_bh = NULL;         
> +    }                                 
>                                        
>      if (s->to_dst_file) {             
>          Error *local_err = NULL;      
> @@ -1124,11 +1126,15 @@ void migrate_fd_error(MigrationState *s, const Error *error)                                                                          
>  {                                     
>      trace_migrate_fd_error(error_get_pretty(error));                          
>      assert(s->to_dst_file == NULL);   
> +    /*                                
> +     * If we are still in setup, switch to failure.  It's also                
> +     * possible that the migration has been cancelled, then we do             
> +     * nothing here.                  
> +     */                               
>      migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,                      
>                        MIGRATION_STATUS_FAILED);                               
>      migrate_set_error(s, error);      
> -    notifier_list_notify(&migration_state_notifiers, s);                      
> -    block_cleanup_parameters(s);      
> +    migrate_fd_cleanup(s);            
>  }                                     
>                                        
>  static void migrate_fd_cancel(MigrationState *s)                              
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK