[v1] Fix vhost reconnect issues

[PATCH 1/2] vhost-user: fix lost reconnect

Posted by Li Feng 2 years, 6 months ago

When the vhost-user is reconnecting to the backend, and if the vhost-user fails
at the get_features in vhost_dev_init(), then the reconnect will fail
and it will not be retriggered forever.

The reason is:
When the vhost-user fail at get_features, the vhost_dev_cleanup will be called
immediately.

vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.

The reconnect path is:
vhost_user_blk_event
   vhost_user_async_close(.. vhost_user_blk_disconnect ..)
     qemu_chr_fe_set_handlers <----- clear the notifier callback
       schedule vhost_user_async_close_bh

The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
called, then the event fd callback will not be reinstalled.

With this patch, the vhost_user_blk_disconnect will call the
vhost_dev_cleanup() again, it's safe.

All vhost-user devices have this issue, including vhost-user-blk/scsi.

Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")

Signed-off-by: Li Feng <fengli@smartx.com>
---
 hw/virtio/vhost-user.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 8dcf049d42..697b403fe2 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -2648,16 +2648,8 @@ typedef struct {
 static void vhost_user_async_close_bh(void *opaque)
 {
     VhostAsyncCallback *data = opaque;
-    struct vhost_dev *vhost = data->vhost;
 
-    /*
-     * If the vhost_dev has been cleared in the meantime there is
-     * nothing left to do as some other path has completed the
-     * cleanup.
-     */
-    if (vhost->vdev) {
-        data->cb(data->dev);
-    }
+    data->cb(data->dev);
 
     g_free(data);
 }
-- 
2.41.0

Re: [PATCH 1/2] vhost-user: fix lost reconnect

Posted by Raphael Norwitz 2 years, 5 months ago

Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()?

Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees.


> On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote:
> 
> When the vhost-user is reconnecting to the backend, and if the vhost-user fails
> at the get_features in vhost_dev_init(), then the reconnect will fail
> and it will not be retriggered forever.
> 
> The reason is:
> When the vhost-user fail at get_features, the vhost_dev_cleanup will be called
> immediately.
> 
> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.
> 
> The reconnect path is:
> vhost_user_blk_event
>   vhost_user_async_close(.. vhost_user_blk_disconnect ..)
>     qemu_chr_fe_set_handlers <----- clear the notifier callback
>       schedule vhost_user_async_close_bh
> 
> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
> called, then the event fd callback will not be reinstalled.
> 
> With this patch, the vhost_user_blk_disconnect will call the
> vhost_dev_cleanup() again, it's safe.
> 
> All vhost-user devices have this issue, including vhost-user-blk/scsi.
> 
> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")
> 
> Signed-off-by: Li Feng <fengli@smartx.com>
> ---
> hw/virtio/vhost-user.c | 10 +---------
> 1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 8dcf049d42..697b403fe2 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -2648,16 +2648,8 @@ typedef struct {
> static void vhost_user_async_close_bh(void *opaque)
> {
>     VhostAsyncCallback *data = opaque;
> -    struct vhost_dev *vhost = data->vhost;
> 
> -    /*
> -     * If the vhost_dev has been cleared in the meantime there is
> -     * nothing left to do as some other path has completed the
> -     * cleanup.
> -     */
> -    if (vhost->vdev) {
> -        data->cb(data->dev);
> -    }
> +    data->cb(data->dev);
> 
>     g_free(data);
> }
> -- 
> 2.41.0
>

Re: [PATCH 1/2] vhost-user: fix lost reconnect

Posted by Li Feng 2 years, 5 months ago

> 2023年8月14日 下午8:11，Raphael Norwitz <raphael.norwitz@nutanix.com> 写道：
> 
> Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()?
> 
> Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees.
> 
I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’ has been called in vhost_user_async_close, and will be called in event->cb, so why need add a
new event_cb?
 
For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in struct vhost-dev to mark if it’s inited like this:

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index e2f6ffb446..edc80c0231 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
         goto fail_busyloop;
     }

+    hdev->inited = true;
     return 0;

 fail_busyloop:
@@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev)
 {
     int i;

+    if (!hdev->inited) {
+        return;
+    }
+    hdev->inited = false;
     trace_vhost_dev_cleanup(hdev);

     for (i = 0; i < hdev->nvqs; ++i) {
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index ca3131b1af..74b1aec960 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -123,6 +123,7 @@ struct vhost_dev {
     /* @started: is the vhost device started? */
     bool started;
     bool log_enabled;
+    bool inited;
     uint64_t log_size;
     Error *migration_blocker;
     const VhostOps *vhost_ops;

Thanks.

> 
>> On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote:
>> 
>> When the vhost-user is reconnecting to the backend, and if the vhost-user fails
>> at the get_features in vhost_dev_init(), then the reconnect will fail
>> and it will not be retriggered forever.
>> 
>> The reason is:
>> When the vhost-user fail at get_features, the vhost_dev_cleanup will be called
>> immediately.
>> 
>> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.
>> 
>> The reconnect path is:
>> vhost_user_blk_event
>>  vhost_user_async_close(.. vhost_user_blk_disconnect ..)
>>    qemu_chr_fe_set_handlers <----- clear the notifier callback
>>      schedule vhost_user_async_close_bh
>> 
>> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
>> called, then the event fd callback will not be reinstalled.
>> 
>> With this patch, the vhost_user_blk_disconnect will call the
>> vhost_dev_cleanup() again, it's safe.
>> 
>> All vhost-user devices have this issue, including vhost-user-blk/scsi.
>> 
>> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")
>> 
>> Signed-off-by: Li Feng <fengli@smartx.com>
>> ---
>> hw/virtio/vhost-user.c | 10 +---------
>> 1 file changed, 1 insertion(+), 9 deletions(-)
>> 
>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>> index 8dcf049d42..697b403fe2 100644
>> --- a/hw/virtio/vhost-user.c
>> +++ b/hw/virtio/vhost-user.c
>> @@ -2648,16 +2648,8 @@ typedef struct {
>> static void vhost_user_async_close_bh(void *opaque)
>> {
>>    VhostAsyncCallback *data = opaque;
>> -    struct vhost_dev *vhost = data->vhost;
>> 
>> -    /*
>> -     * If the vhost_dev has been cleared in the meantime there is
>> -     * nothing left to do as some other path has completed the
>> -     * cleanup.
>> -     */
>> -    if (vhost->vdev) {
>> -        data->cb(data->dev);
>> -    }
>> +    data->cb(data->dev);
>> 
>>    g_free(data);
>> }
>> -- 
>> 2.41.0
>> 
>

Re: [PATCH 1/2] vhost-user: fix lost reconnect

Posted by Raphael Norwitz 2 years, 5 months ago

> On Aug 17, 2023, at 2:40 AM, Li Feng <fengli@smartx.com> wrote:
> 
> 
>> 2023年8月14日 下午8:11，Raphael Norwitz <raphael.norwitz@nutanix.com> 写道：
>> 
>> Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()?
>> 
>> Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees.
>> 
> I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’ has been called in vhost_user_async_close, and will be called in event->cb, so why need add a
> new event_cb?
> 

I’m suggesting calling the data->event_cb instead of the data->cb if we hit the error case where vhost->vdev is NULL. Something like:

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 8dcf049d42..edf1dccd44 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -2648,6 +2648,10 @@ typedef struct {
 static void vhost_user_async_close_bh(void *opaque)
 {
     VhostAsyncCallback *data = opaque;
+
+    VirtIODevice *vdev = VIRTIO_DEVICE(data->dev);
+    VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
     struct vhost_dev *vhost = data->vhost;
 
     /*
@@ -2657,6 +2661,9 @@ static void vhost_user_async_close_bh(void *opaque)
      */
     if (vhost->vdev) {
         data->cb(data->dev);
+    } else if (data->event_cb) {
+        qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, data->event_cb,
+                                 NULL, data->dev, NULL, true);
     }
 
     g_free(data);

data->event_cb would be vhost_user_blk_event().

I think that makes the error path a lot easier to reason about and more future proof.

> For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in struct vhost-dev to mark if it’s inited like this:
> 

This is better than the original, but let me know what you think of my alternative.

> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index e2f6ffb446..edc80c0231 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>         goto fail_busyloop;
>     }
> 
> +    hdev->inited = true;
>     return 0;
> 
> fail_busyloop:
> @@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev)
> {
>     int i;
> 
> +    if (!hdev->inited) {
> +        return;
> +    }
> +    hdev->inited = false;
>     trace_vhost_dev_cleanup(hdev);
> 
>     for (i = 0; i < hdev->nvqs; ++i) {
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index ca3131b1af..74b1aec960 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -123,6 +123,7 @@ struct vhost_dev {
>     /* @started: is the vhost device started? */
>     bool started;
>     bool log_enabled;
> +    bool inited;
>     uint64_t log_size;
>     Error *migration_blocker;
>     const VhostOps *vhost_ops;
> 
> Thanks.
> 
>> 
>>> On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote:
>>> 
>>> When the vhost-user is reconnecting to the backend, and if the vhost-user fails
>>> at the get_features in vhost_dev_init(), then the reconnect will fail
>>> and it will not be retriggered forever.
>>> 
>>> The reason is:
>>> When the vhost-user fail at get_features, the vhost_dev_cleanup will be called
>>> immediately.
>>> 
>>> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.
>>> 
>>> The reconnect path is:
>>> vhost_user_blk_event
>>> vhost_user_async_close(.. vhost_user_blk_disconnect ..)
>>>   qemu_chr_fe_set_handlers <----- clear the notifier callback
>>>     schedule vhost_user_async_close_bh
>>> 
>>> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
>>> called, then the event fd callback will not be reinstalled.
>>> 
>>> With this patch, the vhost_user_blk_disconnect will call the
>>> vhost_dev_cleanup() again, it's safe.
>>> 
>>> All vhost-user devices have this issue, including vhost-user-blk/scsi.
>>> 
>>> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")
>>> 
>>> Signed-off-by: Li Feng <fengli@smartx.com>
>>> ---
>>> hw/virtio/vhost-user.c | 10 +---------
>>> 1 file changed, 1 insertion(+), 9 deletions(-)
>>> 
>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>> index 8dcf049d42..697b403fe2 100644
>>> --- a/hw/virtio/vhost-user.c
>>> +++ b/hw/virtio/vhost-user.c
>>> @@ -2648,16 +2648,8 @@ typedef struct {
>>> static void vhost_user_async_close_bh(void *opaque)
>>> {
>>>   VhostAsyncCallback *data = opaque;
>>> -    struct vhost_dev *vhost = data->vhost;
>>> 
>>> -    /*
>>> -     * If the vhost_dev has been cleared in the meantime there is
>>> -     * nothing left to do as some other path has completed the
>>> -     * cleanup.
>>> -     */
>>> -    if (vhost->vdev) {
>>> -        data->cb(data->dev);
>>> -    }
>>> +    data->cb(data->dev);
>>> 
>>>   g_free(data);
>>> }
>>> -- 
>>> 2.41.0
>>> 
>> 
>

Re: [PATCH 1/2] vhost-user: fix lost reconnect

Posted by Li Feng 2 years, 5 months ago

On 22 Aug 2023, at 8:38 AM, Raphael Norwitz <raphael.norwitz@nutanix.com>
wrote:


On Aug 17, 2023, at 2:40 AM, Li Feng <fengli@smartx.com> wrote:


2023年8月14日 下午8:11，Raphael Norwitz <raphael.norwitz@nutanix.com> 写道：

Why can’t we rather fix this by adding a “event_cb” param to
vhost_user_async_close and then call qemu_chr_fe_set_handlers in
vhost_user_async_close_bh()?

Even if calling vhost_dev_cleanup() twice is safe today I worry future
changes may easily stumble over the reconnect case and introduce crashes or
double frees.

I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’
has been called in vhost_user_async_close, and will be called in event->cb,
so why need add a
new event_cb?


I’m suggesting calling the data->event_cb instead of the data->cb if we hit
the error case where vhost->vdev is NULL. Something like:

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 8dcf049d42..edf1dccd44 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -2648,6 +2648,10 @@ typedef struct {
static void vhost_user_async_close_bh(void *opaque)
{
    VhostAsyncCallback *data = opaque;
+
+    VirtIODevice *vdev = VIRTIO_DEVICE(data->dev);
+    VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
    struct vhost_dev *vhost = data->vhost;

    /*
@@ -2657,6 +2661,9 @@ static void vhost_user_async_close_bh(void *opaque)
     */
    if (vhost->vdev) {
        data->cb(data->dev);
+    } else if (data->event_cb) {
+        qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, data->event_cb,
+                                 NULL, data->dev, NULL, true);
    }

    g_free(data);

data->event_cb would be vhost_user_blk_event().

I think that makes the error path a lot easier to reason about and more
future proof.

For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in
struct vhost-dev to mark if it’s inited like this:


This is better than the original, but let me know what you think of my
alternative.


The vhost_user_async_close_bh() is a common function in vhost-user.c, and
vhost_user_async_close() is used by vhost-user-scsi/blk/gpio,
However, in your patch it’s limited to VhostUserBlk, so I think my fix is
more reasonable.

Thanks,
LI


diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index e2f6ffb446..edc80c0231 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void
*opaque,
       goto fail_busyloop;
   }

+    hdev->inited = true;
   return 0;

fail_busyloop:
@@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev)
{
   int i;

+    if (!hdev->inited) {
+        return;
+    }
+    hdev->inited = false;
   trace_vhost_dev_cleanup(hdev);

   for (i = 0; i < hdev->nvqs; ++i) {
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index ca3131b1af..74b1aec960 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -123,6 +123,7 @@ struct vhost_dev {
   /* @started: is the vhost device started? */
   bool started;
   bool log_enabled;
+    bool inited;
   uint64_t log_size;
   Error *migration_blocker;
   const VhostOps *vhost_ops;

Thanks.


On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote:

When the vhost-user is reconnecting to the backend, and if the vhost-user
fails
at the get_features in vhost_dev_init(), then the reconnect will fail
and it will not be retriggered forever.

The reason is:
When the vhost-user fail at get_features, the vhost_dev_cleanup will be
called
immediately.

vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.

The reconnect path is:
vhost_user_blk_event
vhost_user_async_close(.. vhost_user_blk_disconnect ..)
 qemu_chr_fe_set_handlers <----- clear the notifier callback
   schedule vhost_user_async_close_bh

The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
called, then the event fd callback will not be reinstalled.

With this patch, the vhost_user_blk_disconnect will call the
vhost_dev_cleanup() again, it's safe.

All vhost-user devices have this issue, including vhost-user-blk/scsi.

Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")

Signed-off-by: Li Feng <fengli@smartx.com>
---
hw/virtio/vhost-user.c | 10 +---------
1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 8dcf049d42..697b403fe2 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -2648,16 +2648,8 @@ typedef struct {
static void vhost_user_async_close_bh(void *opaque)
{
 VhostAsyncCallback *data = opaque;
-    struct vhost_dev *vhost = data->vhost;

-    /*
-     * If the vhost_dev has been cleared in the meantime there is
-     * nothing left to do as some other path has completed the
-     * cleanup.
-     */
-    if (vhost->vdev) {
-        data->cb(data->dev);
-    }
+    data->cb(data->dev);

 g_free(data);
}
-- 
2.41.0

Re: [PATCH 1/2] vhost-user: fix lost reconnect

Posted by Raphael Norwitz 2 years, 5 months ago


> On Aug 22, 2023, at 12:49 AM, Li Feng <fengli@smartx.com> wrote:
> 
> 
> 
>> On 22 Aug 2023, at 8:38 AM, Raphael Norwitz <raphael.norwitz@nutanix.com> wrote:
>> 
>>> 
>>> On Aug 17, 2023, at 2:40 AM, Li Feng <fengli@smartx.com> wrote:
>>> 
>>> 
>>>> 2023年8月14日 下午8:11，Raphael Norwitz <raphael.norwitz@nutanix.com> 写道：
>>>> 
>>>> Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()?
>>>> 
>>>> Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees.
>>>> 
>>> I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’ has been called in vhost_user_async_close, and will be called in event->cb, so why need add a
>>> new event_cb?
>>> 
>> 
>> I’m suggesting calling the data->event_cb instead of the data->cb if we hit the error case where vhost->vdev is NULL. Something like:
>> 
>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>> index 8dcf049d42..edf1dccd44 100644
>> --- a/hw/virtio/vhost-user.c
>> +++ b/hw/virtio/vhost-user.c
>> @@ -2648,6 +2648,10 @@ typedef struct {
>> static void vhost_user_async_close_bh(void *opaque)
>> {
>>     VhostAsyncCallback *data = opaque;
>> +
>> +    VirtIODevice *vdev = VIRTIO_DEVICE(data->dev);
>> +    VHostUserBlk *s = VHOST_USER_BLK(vdev);
>> +
>>     struct vhost_dev *vhost = data->vhost;
>> 
>>     /*
>> @@ -2657,6 +2661,9 @@ static void vhost_user_async_close_bh(void *opaque)
>>      */
>>     if (vhost->vdev) {
>>         data->cb(data->dev);
>> +    } else if (data->event_cb) {
>> +        qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, data->event_cb,
>> +                                 NULL, data->dev, NULL, true);
>>     }
>> 
>>     g_free(data);
>> 
>> data->event_cb would be vhost_user_blk_event().
>> 
>> I think that makes the error path a lot easier to reason about and more future proof.
>> 
>>> For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in struct vhost-dev to mark if it’s inited like this:
>>> 
>> 
>> This is better than the original, but let me know what you think of my alternative.
> 
> The vhost_user_async_close_bh() is a common function in vhost-user.c, and vhost_user_async_close() is used by vhost-user-scsi/blk/gpio, 
> However, in your patch it’s limited to VhostUserBlk, so I think my fix is more reasonable.

I did not write out the full patch. 

vhost-user-scsi/blk/gpio would each provide their own ->event_cb, just like they each provide a different ->cb today. Looking at it again, data has the chardev, so no need to use VIRTO_DEVICE() or VHOST_USER_BLK().

The fix generalizes to all device types.

> 
> Thanks,
> LI
>> 
>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>> index e2f6ffb446..edc80c0231 100644
>>> --- a/hw/virtio/vhost.c
>>> +++ b/hw/virtio/vhost.c
>>> @@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>>>        goto fail_busyloop;
>>>    }
>>> 
>>> +    hdev->inited = true;
>>>    return 0;
>>> 
>>> fail_busyloop:
>>> @@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev)
>>> {
>>>    int i;
>>> 
>>> +    if (!hdev->inited) {
>>> +        return;
>>> +    }
>>> +    hdev->inited = false;
>>>    trace_vhost_dev_cleanup(hdev);
>>> 
>>>    for (i = 0; i < hdev->nvqs; ++i) {
>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>> index ca3131b1af..74b1aec960 100644
>>> --- a/include/hw/virtio/vhost.h
>>> +++ b/include/hw/virtio/vhost.h
>>> @@ -123,6 +123,7 @@ struct vhost_dev {
>>>    /* @started: is the vhost device started? */
>>>    bool started;
>>>    bool log_enabled;
>>> +    bool inited;
>>>    uint64_t log_size;
>>>    Error *migration_blocker;
>>>    const VhostOps *vhost_ops;
>>> 
>>> Thanks.
>>> 
>>>> 
>>>>> On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote:
>>>>> 
>>>>> When the vhost-user is reconnecting to the backend, and if the vhost-user fails
>>>>> at the get_features in vhost_dev_init(), then the reconnect will fail
>>>>> and it will not be retriggered forever.
>>>>> 
>>>>> The reason is:
>>>>> When the vhost-user fail at get_features, the vhost_dev_cleanup will be called
>>>>> immediately.
>>>>> 
>>>>> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.
>>>>> 
>>>>> The reconnect path is:
>>>>> vhost_user_blk_event
>>>>> vhost_user_async_close(.. vhost_user_blk_disconnect ..)
>>>>>  qemu_chr_fe_set_handlers <----- clear the notifier callback
>>>>>    schedule vhost_user_async_close_bh
>>>>> 
>>>>> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
>>>>> called, then the event fd callback will not be reinstalled.
>>>>> 
>>>>> With this patch, the vhost_user_blk_disconnect will call the
>>>>> vhost_dev_cleanup() again, it's safe.
>>>>> 
>>>>> All vhost-user devices have this issue, including vhost-user-blk/scsi.
>>>>> 
>>>>> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")
>>>>> 
>>>>> Signed-off-by: Li Feng <fengli@smartx.com>
>>>>> ---
>>>>> hw/virtio/vhost-user.c | 10 +---------
>>>>> 1 file changed, 1 insertion(+), 9 deletions(-)
>>>>> 
>>>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>>>> index 8dcf049d42..697b403fe2 100644
>>>>> --- a/hw/virtio/vhost-user.c
>>>>> +++ b/hw/virtio/vhost-user.c
>>>>> @@ -2648,16 +2648,8 @@ typedef struct {
>>>>> static void vhost_user_async_close_bh(void *opaque)
>>>>> {
>>>>>  VhostAsyncCallback *data = opaque;
>>>>> -    struct vhost_dev *vhost = data->vhost;
>>>>> 
>>>>> -    /*
>>>>> -     * If the vhost_dev has been cleared in the meantime there is
>>>>> -     * nothing left to do as some other path has completed the
>>>>> -     * cleanup.
>>>>> -     */
>>>>> -    if (vhost->vdev) {
>>>>> -        data->cb(data->dev);
>>>>> -    }
>>>>> +    data->cb(data->dev);
>>>>> 
>>>>>  g_free(data);
>>>>> }
>>>>> -- 
>>>>> 2.41.0

Re: [PATCH 1/2] vhost-user: fix lost reconnect

Posted by Li Feng 2 years, 5 months ago


> On 22 Aug 2023, at 6:17 PM, Raphael Norwitz <raphael.norwitz@nutanix.com> wrote:
> 
> 
> 
>> On Aug 22, 2023, at 12:49 AM, Li Feng <fengli@smartx.com> wrote:
>> 
>> 
>> 
>>> On 22 Aug 2023, at 8:38 AM, Raphael Norwitz <raphael.norwitz@nutanix.com> wrote:
>>> 
>>>> 
>>>> On Aug 17, 2023, at 2:40 AM, Li Feng <fengli@smartx.com> wrote:
>>>> 
>>>> 
>>>>> 2023年8月14日 下午8:11，Raphael Norwitz <raphael.norwitz@nutanix.com> 写道：
>>>>> 
>>>>> Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()?
>>>>> 
>>>>> Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees.
>>>>> 
>>>> I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’ has been called in vhost_user_async_close, and will be called in event->cb, so why need add a
>>>> new event_cb?
>>>> 
>>> 
>>> I’m suggesting calling the data->event_cb instead of the data->cb if we hit the error case where vhost->vdev is NULL. Something like:
>>> 
>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>> index 8dcf049d42..edf1dccd44 100644
>>> --- a/hw/virtio/vhost-user.c
>>> +++ b/hw/virtio/vhost-user.c
>>> @@ -2648,6 +2648,10 @@ typedef struct {
>>> static void vhost_user_async_close_bh(void *opaque)
>>> {
>>>    VhostAsyncCallback *data = opaque;
>>> +
>>> +    VirtIODevice *vdev = VIRTIO_DEVICE(data->dev);
>>> +    VHostUserBlk *s = VHOST_USER_BLK(vdev);
>>> +
>>>    struct vhost_dev *vhost = data->vhost;
>>> 
>>>    /*
>>> @@ -2657,6 +2661,9 @@ static void vhost_user_async_close_bh(void *opaque)
>>>     */
>>>    if (vhost->vdev) {
>>>        data->cb(data->dev);
>>> +    } else if (data->event_cb) {
>>> +        qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, data->event_cb,
>>> +                                 NULL, data->dev, NULL, true);
>>>    }
>>> 
>>>    g_free(data);
>>> 
>>> data->event_cb would be vhost_user_blk_event().
>>> 
>>> I think that makes the error path a lot easier to reason about and more future proof.
>>> 
>>>> For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in struct vhost-dev to mark if it’s inited like this:
>>>> 
>>> 
>>> This is better than the original, but let me know what you think of my alternative.
>> 
>> The vhost_user_async_close_bh() is a common function in vhost-user.c, and vhost_user_async_close() is used by vhost-user-scsi/blk/gpio, 
>> However, in your patch it’s limited to VhostUserBlk, so I think my fix is more reasonable.
> 
> I did not write out the full patch. 
> 
> vhost-user-scsi/blk/gpio would each provide their own ->event_cb, just like they each provide a different ->cb today. Looking at it again, data has the chardev, so no need to use VIRTO_DEVICE() or VHOST_USER_BLK().
> 
> The fix generalizes to all device types.
OK, I will change it in V2.
> 
>> 
>> Thanks,
>> LI
>>> 
>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>> index e2f6ffb446..edc80c0231 100644
>>>> --- a/hw/virtio/vhost.c
>>>> +++ b/hw/virtio/vhost.c
>>>> @@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>>>>       goto fail_busyloop;
>>>>   }
>>>> 
>>>> +    hdev->inited = true;
>>>>   return 0;
>>>> 
>>>> fail_busyloop:
>>>> @@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev)
>>>> {
>>>>   int i;
>>>> 
>>>> +    if (!hdev->inited) {
>>>> +        return;
>>>> +    }
>>>> +    hdev->inited = false;
>>>>   trace_vhost_dev_cleanup(hdev);
>>>> 
>>>>   for (i = 0; i < hdev->nvqs; ++i) {
>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>>> index ca3131b1af..74b1aec960 100644
>>>> --- a/include/hw/virtio/vhost.h
>>>> +++ b/include/hw/virtio/vhost.h
>>>> @@ -123,6 +123,7 @@ struct vhost_dev {
>>>>   /* @started: is the vhost device started? */
>>>>   bool started;
>>>>   bool log_enabled;
>>>> +    bool inited;
>>>>   uint64_t log_size;
>>>>   Error *migration_blocker;
>>>>   const VhostOps *vhost_ops;
>>>> 
>>>> Thanks.
>>>> 
>>>>> 
>>>>>> On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote:
>>>>>> 
>>>>>> When the vhost-user is reconnecting to the backend, and if the vhost-user fails
>>>>>> at the get_features in vhost_dev_init(), then the reconnect will fail
>>>>>> and it will not be retriggered forever.
>>>>>> 
>>>>>> The reason is:
>>>>>> When the vhost-user fail at get_features, the vhost_dev_cleanup will be called
>>>>>> immediately.
>>>>>> 
>>>>>> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.
>>>>>> 
>>>>>> The reconnect path is:
>>>>>> vhost_user_blk_event
>>>>>> vhost_user_async_close(.. vhost_user_blk_disconnect ..)
>>>>>> qemu_chr_fe_set_handlers <----- clear the notifier callback
>>>>>>   schedule vhost_user_async_close_bh
>>>>>> 
>>>>>> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
>>>>>> called, then the event fd callback will not be reinstalled.
>>>>>> 
>>>>>> With this patch, the vhost_user_blk_disconnect will call the
>>>>>> vhost_dev_cleanup() again, it's safe.
>>>>>> 
>>>>>> All vhost-user devices have this issue, including vhost-user-blk/scsi.
>>>>>> 
>>>>>> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")
>>>>>> 
>>>>>> Signed-off-by: Li Feng <fengli@smartx.com>
>>>>>> ---
>>>>>> hw/virtio/vhost-user.c | 10 +---------
>>>>>> 1 file changed, 1 insertion(+), 9 deletions(-)
>>>>>> 
>>>>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>>>>> index 8dcf049d42..697b403fe2 100644
>>>>>> --- a/hw/virtio/vhost-user.c
>>>>>> +++ b/hw/virtio/vhost-user.c
>>>>>> @@ -2648,16 +2648,8 @@ typedef struct {
>>>>>> static void vhost_user_async_close_bh(void *opaque)
>>>>>> {
>>>>>> VhostAsyncCallback *data = opaque;
>>>>>> -    struct vhost_dev *vhost = data->vhost;
>>>>>> 
>>>>>> -    /*
>>>>>> -     * If the vhost_dev has been cleared in the meantime there is
>>>>>> -     * nothing left to do as some other path has completed the
>>>>>> -     * cleanup.
>>>>>> -     */
>>>>>> -    if (vhost->vdev) {
>>>>>> -        data->cb(data->dev);
>>>>>> -    }
>>>>>> +    data->cb(data->dev);
>>>>>> 
>>>>>> g_free(data);
>>>>>> }
>>>>>> -- 
>>>>>> 2.41.0

[PATCH 1/2] vhost-user: fix lost reconnect
[PATCH 2/2] vhost: Add Error parameter to vhost_scsi_common_start()