vhost: Don't abort when vhost-user connection is lost during migration

[Qemu-devel] [PATCH v4] vhost: Don't abort when vhost-user connection is lost during migration

Posted by fangying 8 years, 2 months ago

QEMU will abort when vhost-user process is restarted during migration
when vhost_log_global_start/stop is called. The reason is clear that
vhost_dev_set_log returns -1 because network connection is lost.

To handle this situation, let's cancel migration by setting migrate
state to failure and report it to user.
---
 hw/virtio/vhost.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index ddc42f0..92725f7 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -26,6 +26,8 @@
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
 #include "migration/blocker.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
 #include "sysemu/dma.h"
 
 /* enabled until disconnected backend stabilizes */
@@ -885,7 +887,10 @@ static void vhost_log_global_start(MemoryListener *listener)
 
     r = vhost_migration_log(listener, true);
     if (r < 0) {
-        abort();
+        error_report("Failed to start vhost dirty log");
+        if (migrate_get_current()->migration_thread_running) {
+            qemu_file_set_error(migrate_get_current()->to_dst_file, -ECHILD);
+        }
     }
 }
 
@@ -895,7 +900,10 @@ static void vhost_log_global_stop(MemoryListener *listener)
 
     r = vhost_migration_log(listener, false);
     if (r < 0) {
-        abort();
+        error_report("Failed to stop vhost dirty log");
+        if (migrate_get_current()->migration_thread_running) {
+            qemu_file_set_error(migrate_get_current()->to_dst_file, -ECHILD);
+        }
     }
 }
 
-- 
1.8.3.1

Re: [Qemu-devel] [PATCH v4] vhost: Don't abort when vhost-user connection is lost during migration

Posted by Michael S. Tsirkin 8 years, 2 months ago

On Fri, Dec 01, 2017 at 01:58:32PM +0800, fangying wrote:
> QEMU will abort when vhost-user process is restarted during migration
> when vhost_log_global_start/stop is called. The reason is clear that
> vhost_dev_set_log returns -1 because network connection is lost.
> 
> To handle this situation, let's cancel migration by setting migrate
> state to failure and report it to user.

In fact I don't see this as the right way to fix it. Backend is dead so why
not just proceed with migration? We just need to make sure we re-send
migration data on re-connect.

> ---
>  hw/virtio/vhost.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index ddc42f0..92725f7 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -26,6 +26,8 @@
>  #include "hw/virtio/virtio-bus.h"
>  #include "hw/virtio/virtio-access.h"
>  #include "migration/blocker.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
>  #include "sysemu/dma.h"
>  
>  /* enabled until disconnected backend stabilizes */
> @@ -885,7 +887,10 @@ static void vhost_log_global_start(MemoryListener *listener)
>  
>      r = vhost_migration_log(listener, true);
>      if (r < 0) {
> -        abort();
> +        error_report("Failed to start vhost dirty log");
> +        if (migrate_get_current()->migration_thread_running) {
> +            qemu_file_set_error(migrate_get_current()->to_dst_file, -ECHILD);
> +        }
>      }
>  }
>  
> @@ -895,7 +900,10 @@ static void vhost_log_global_stop(MemoryListener *listener)
>  
>      r = vhost_migration_log(listener, false);
>      if (r < 0) {
> -        abort();
> +        error_report("Failed to stop vhost dirty log");
> +        if (migrate_get_current()->migration_thread_running) {
> +            qemu_file_set_error(migrate_get_current()->to_dst_file, -ECHILD);
> +        }
>      }
>  }
>  
> -- 
> 1.8.3.1
>

Re: [Qemu-devel] [PATCH v4] vhost: Don't abort when vhost-user connection is lost during migration

Posted by Ying Fang 8 years, 2 months ago

On 2017/12/1 22:39, Michael S. Tsirkin wrote:
> On Fri, Dec 01, 2017 at 01:58:32PM +0800, fangying wrote:
>> QEMU will abort when vhost-user process is restarted during migration
>> when vhost_log_global_start/stop is called. The reason is clear that
>> vhost_dev_set_log returns -1 because network connection is lost.
>>
>> To handle this situation, let's cancel migration by setting migrate
>> state to failure and report it to user.
> 
> In fact I don't see this as the right way to fix it. Backend is dead so why
> not just proceed with migration? We just need to make sure we re-send
> migration data on re-connect.
> This is where vhost start/stop migration dirty log. The original code aborts
qemu here beacuse vhost data stream may break down if we fail to start/stop
vhost dirty log during migration. Backend may be active after vhost_log_global_start.

             dirty log start ----------------- dirty log stop
                     ^           ^
                     |           |
----- backend dead ----- backend active

Currently we don't re-send migration data on re-connect in this situation.
May we should work it out.

>> ---
>>  hw/virtio/vhost.c | 12 ++++++++++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>> index ddc42f0..92725f7 100644
>> --- a/hw/virtio/vhost.c
>> +++ b/hw/virtio/vhost.c
>> @@ -26,6 +26,8 @@
>>  #include "hw/virtio/virtio-bus.h"
>>  #include "hw/virtio/virtio-access.h"
>>  #include "migration/blocker.h"
>> +#include "migration/migration.h"
>> +#include "migration/qemu-file.h"
>>  #include "sysemu/dma.h"
>>  
>>  /* enabled until disconnected backend stabilizes */
>> @@ -885,7 +887,10 @@ static void vhost_log_global_start(MemoryListener *listener)
>>  
>>      r = vhost_migration_log(listener, true);
>>      if (r < 0) {
>> -        abort();
>> +        error_report("Failed to start vhost dirty log");
>> +        if (migrate_get_current()->migration_thread_running) {
>> +            qemu_file_set_error(migrate_get_current()->to_dst_file, -ECHILD);
>> +        }
>>      }
>>  }
>>  
>> @@ -895,7 +900,10 @@ static void vhost_log_global_stop(MemoryListener *listener)
>>  
>>      r = vhost_migration_log(listener, false);
>>      if (r < 0) {
>> -        abort();
>> +        error_report("Failed to stop vhost dirty log");
>> +        if (migrate_get_current()->migration_thread_running) {
>> +            qemu_file_set_error(migrate_get_current()->to_dst_file, -ECHILD);
>> +        }
>>      }
>>  }
>>  
>> -- 
>> 1.8.3.1
>>
> 
> .
>

Re: [Qemu-devel] [PATCH v4] vhost: Don't abort when vhost-user connection is lost during migration

Posted by Michael S. Tsirkin 8 years, 2 months ago

On Wed, Dec 06, 2017 at 09:30:27PM +0800, Ying Fang wrote:
> 
> On 2017/12/1 22:39, Michael S. Tsirkin wrote:
> > On Fri, Dec 01, 2017 at 01:58:32PM +0800, fangying wrote:
> >> QEMU will abort when vhost-user process is restarted during migration
> >> when vhost_log_global_start/stop is called. The reason is clear that
> >> vhost_dev_set_log returns -1 because network connection is lost.
> >>
> >> To handle this situation, let's cancel migration by setting migrate
> >> state to failure and report it to user.
> > 
> > In fact I don't see this as the right way to fix it. Backend is dead so why
> > not just proceed with migration? We just need to make sure we re-send
> > migration data on re-connect.
> > This is where vhost start/stop migration dirty log. The original code aborts
> qemu here beacuse vhost data stream may break down if we fail to start/stop
> vhost dirty log during migration. Backend may be active after vhost_log_global_start.
> 
>              dirty log start ----------------- dirty log stop
>                      ^           ^
>                      |           |
> ----- backend dead ----- backend active

I'm sorry, I don't understand yet. Backend is active after logging started -
why is this a problem?

> Currently we don't re-send migration data on re-connect in this situation.
> May we should work it out.

So basically backend connects after logging started, and we
do not tell it to start logging and where - is that the issue?
I agree, that would be a bug then.

-- 
MST

Re: [Qemu-devel] [PATCH v4] vhost: Don't abort when vhost-user connection is lost during migration

Posted by Ying Fang 8 years, 2 months ago

On 2017/12/7 0:34, Michael S. Tsirkin wrote:
> On Wed, Dec 06, 2017 at 09:30:27PM +0800, Ying Fang wrote:
>>
>> On 2017/12/1 22:39, Michael S. Tsirkin wrote:
>>> On Fri, Dec 01, 2017 at 01:58:32PM +0800, fangying wrote:
>>>> QEMU will abort when vhost-user process is restarted during migration
>>>> when vhost_log_global_start/stop is called. The reason is clear that
>>>> vhost_dev_set_log returns -1 because network connection is lost.
>>>>
>>>> To handle this situation, let's cancel migration by setting migrate
>>>> state to failure and report it to user.
>>>
>>> In fact I don't see this as the right way to fix it. Backend is dead so why
>>> not just proceed with migration? We just need to make sure we re-send
>>> migration data on re-connect.
>>> This is where vhost start/stop migration dirty log. The original code aborts
>> qemu here beacuse vhost data stream may break down if we fail to start/stop
>> vhost dirty log during migration. Backend may be active after vhost_log_global_start.
>>
>>              dirty log start ----------------- dirty log stop
>>                      ^           ^
>>                      |           |
>> ----- backend dead ----- backend active
> 
> I'm sorry, I don't understand yet. Backend is active after logging started -
> why is this a problem?Sorry, I did not explain it well. IF backend is dead when dirty log start is called,
vhost_dev_set_log/vhost_dev_set_features may fail because connection is temporarily lost.
So even if migration is in progress and vhost-user backend is active again later,
vhost-user dirty memory is not logged.
> 
>> Currently we don't re-send migration data on re-connect in this situation.
>> May we should work it out.
> 
> So basically backend connects after logging started, and we
> do not tell it to start logging and where - is that the issue?
> I agree, that would be a bug then.
> 
Yes, this is just the issue.