When the vhost-user is reconnecting to the backend, and if the vhost-user fails
at the get_features in vhost_dev_init(), then the reconnect will fail
and it will not be retriggered forever.
The reason is:
When the vhost-user fail at get_features, the vhost_dev_cleanup will be called
immediately.
vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.
The reconnect path is:
vhost_user_blk_event
vhost_user_async_close(.. vhost_user_blk_disconnect ..)
qemu_chr_fe_set_handlers <----- clear the notifier callback
schedule vhost_user_async_close_bh
The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
called, then the event fd callback will not be reinstalled.
With this patch, the vhost_user_blk_disconnect will call the
vhost_dev_cleanup() again, it's safe.
All vhost-user devices have this issue, including vhost-user-blk/scsi.
Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")
Signed-off-by: Li Feng <fengli@smartx.com>
---
hw/virtio/vhost-user.c | 10 +---------
1 file changed, 1 insertion(+), 9 deletions(-)
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 8dcf049d42..697b403fe2 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -2648,16 +2648,8 @@ typedef struct {
static void vhost_user_async_close_bh(void *opaque)
{
VhostAsyncCallback *data = opaque;
- struct vhost_dev *vhost = data->vhost;
- /*
- * If the vhost_dev has been cleared in the meantime there is
- * nothing left to do as some other path has completed the
- * cleanup.
- */
- if (vhost->vdev) {
- data->cb(data->dev);
- }
+ data->cb(data->dev);
g_free(data);
}
--
2.41.0
Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()?
Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees.
> On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote:
>
> When the vhost-user is reconnecting to the backend, and if the vhost-user fails
> at the get_features in vhost_dev_init(), then the reconnect will fail
> and it will not be retriggered forever.
>
> The reason is:
> When the vhost-user fail at get_features, the vhost_dev_cleanup will be called
> immediately.
>
> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.
>
> The reconnect path is:
> vhost_user_blk_event
> vhost_user_async_close(.. vhost_user_blk_disconnect ..)
> qemu_chr_fe_set_handlers <----- clear the notifier callback
> schedule vhost_user_async_close_bh
>
> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
> called, then the event fd callback will not be reinstalled.
>
> With this patch, the vhost_user_blk_disconnect will call the
> vhost_dev_cleanup() again, it's safe.
>
> All vhost-user devices have this issue, including vhost-user-blk/scsi.
>
> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")
>
> Signed-off-by: Li Feng <fengli@smartx.com>
> ---
> hw/virtio/vhost-user.c | 10 +---------
> 1 file changed, 1 insertion(+), 9 deletions(-)
>
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 8dcf049d42..697b403fe2 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -2648,16 +2648,8 @@ typedef struct {
> static void vhost_user_async_close_bh(void *opaque)
> {
> VhostAsyncCallback *data = opaque;
> - struct vhost_dev *vhost = data->vhost;
>
> - /*
> - * If the vhost_dev has been cleared in the meantime there is
> - * nothing left to do as some other path has completed the
> - * cleanup.
> - */
> - if (vhost->vdev) {
> - data->cb(data->dev);
> - }
> + data->cb(data->dev);
>
> g_free(data);
> }
> --
> 2.41.0
>
> 2023年8月14日 下午8:11,Raphael Norwitz <raphael.norwitz@nutanix.com> 写道:
>
> Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()?
>
> Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees.
>
I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’ has been called in vhost_user_async_close, and will be called in event->cb, so why need add a
new event_cb?
For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in struct vhost-dev to mark if it’s inited like this:
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index e2f6ffb446..edc80c0231 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
goto fail_busyloop;
}
+ hdev->inited = true;
return 0;
fail_busyloop:
@@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev)
{
int i;
+ if (!hdev->inited) {
+ return;
+ }
+ hdev->inited = false;
trace_vhost_dev_cleanup(hdev);
for (i = 0; i < hdev->nvqs; ++i) {
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index ca3131b1af..74b1aec960 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -123,6 +123,7 @@ struct vhost_dev {
/* @started: is the vhost device started? */
bool started;
bool log_enabled;
+ bool inited;
uint64_t log_size;
Error *migration_blocker;
const VhostOps *vhost_ops;
Thanks.
>
>> On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote:
>>
>> When the vhost-user is reconnecting to the backend, and if the vhost-user fails
>> at the get_features in vhost_dev_init(), then the reconnect will fail
>> and it will not be retriggered forever.
>>
>> The reason is:
>> When the vhost-user fail at get_features, the vhost_dev_cleanup will be called
>> immediately.
>>
>> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.
>>
>> The reconnect path is:
>> vhost_user_blk_event
>> vhost_user_async_close(.. vhost_user_blk_disconnect ..)
>> qemu_chr_fe_set_handlers <----- clear the notifier callback
>> schedule vhost_user_async_close_bh
>>
>> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
>> called, then the event fd callback will not be reinstalled.
>>
>> With this patch, the vhost_user_blk_disconnect will call the
>> vhost_dev_cleanup() again, it's safe.
>>
>> All vhost-user devices have this issue, including vhost-user-blk/scsi.
>>
>> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")
>>
>> Signed-off-by: Li Feng <fengli@smartx.com>
>> ---
>> hw/virtio/vhost-user.c | 10 +---------
>> 1 file changed, 1 insertion(+), 9 deletions(-)
>>
>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>> index 8dcf049d42..697b403fe2 100644
>> --- a/hw/virtio/vhost-user.c
>> +++ b/hw/virtio/vhost-user.c
>> @@ -2648,16 +2648,8 @@ typedef struct {
>> static void vhost_user_async_close_bh(void *opaque)
>> {
>> VhostAsyncCallback *data = opaque;
>> - struct vhost_dev *vhost = data->vhost;
>>
>> - /*
>> - * If the vhost_dev has been cleared in the meantime there is
>> - * nothing left to do as some other path has completed the
>> - * cleanup.
>> - */
>> - if (vhost->vdev) {
>> - data->cb(data->dev);
>> - }
>> + data->cb(data->dev);
>>
>> g_free(data);
>> }
>> --
>> 2.41.0
>>
>
> On Aug 17, 2023, at 2:40 AM, Li Feng <fengli@smartx.com> wrote:
>
>
>> 2023年8月14日 下午8:11,Raphael Norwitz <raphael.norwitz@nutanix.com> 写道:
>>
>> Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()?
>>
>> Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees.
>>
> I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’ has been called in vhost_user_async_close, and will be called in event->cb, so why need add a
> new event_cb?
>
I’m suggesting calling the data->event_cb instead of the data->cb if we hit the error case where vhost->vdev is NULL. Something like:
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 8dcf049d42..edf1dccd44 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -2648,6 +2648,10 @@ typedef struct {
static void vhost_user_async_close_bh(void *opaque)
{
VhostAsyncCallback *data = opaque;
+
+ VirtIODevice *vdev = VIRTIO_DEVICE(data->dev);
+ VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
struct vhost_dev *vhost = data->vhost;
/*
@@ -2657,6 +2661,9 @@ static void vhost_user_async_close_bh(void *opaque)
*/
if (vhost->vdev) {
data->cb(data->dev);
+ } else if (data->event_cb) {
+ qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, data->event_cb,
+ NULL, data->dev, NULL, true);
}
g_free(data);
data->event_cb would be vhost_user_blk_event().
I think that makes the error path a lot easier to reason about and more future proof.
> For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in struct vhost-dev to mark if it’s inited like this:
>
This is better than the original, but let me know what you think of my alternative.
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index e2f6ffb446..edc80c0231 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
> goto fail_busyloop;
> }
>
> + hdev->inited = true;
> return 0;
>
> fail_busyloop:
> @@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev)
> {
> int i;
>
> + if (!hdev->inited) {
> + return;
> + }
> + hdev->inited = false;
> trace_vhost_dev_cleanup(hdev);
>
> for (i = 0; i < hdev->nvqs; ++i) {
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index ca3131b1af..74b1aec960 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -123,6 +123,7 @@ struct vhost_dev {
> /* @started: is the vhost device started? */
> bool started;
> bool log_enabled;
> + bool inited;
> uint64_t log_size;
> Error *migration_blocker;
> const VhostOps *vhost_ops;
>
> Thanks.
>
>>
>>> On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote:
>>>
>>> When the vhost-user is reconnecting to the backend, and if the vhost-user fails
>>> at the get_features in vhost_dev_init(), then the reconnect will fail
>>> and it will not be retriggered forever.
>>>
>>> The reason is:
>>> When the vhost-user fail at get_features, the vhost_dev_cleanup will be called
>>> immediately.
>>>
>>> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.
>>>
>>> The reconnect path is:
>>> vhost_user_blk_event
>>> vhost_user_async_close(.. vhost_user_blk_disconnect ..)
>>> qemu_chr_fe_set_handlers <----- clear the notifier callback
>>> schedule vhost_user_async_close_bh
>>>
>>> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
>>> called, then the event fd callback will not be reinstalled.
>>>
>>> With this patch, the vhost_user_blk_disconnect will call the
>>> vhost_dev_cleanup() again, it's safe.
>>>
>>> All vhost-user devices have this issue, including vhost-user-blk/scsi.
>>>
>>> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")
>>>
>>> Signed-off-by: Li Feng <fengli@smartx.com>
>>> ---
>>> hw/virtio/vhost-user.c | 10 +---------
>>> 1 file changed, 1 insertion(+), 9 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>> index 8dcf049d42..697b403fe2 100644
>>> --- a/hw/virtio/vhost-user.c
>>> +++ b/hw/virtio/vhost-user.c
>>> @@ -2648,16 +2648,8 @@ typedef struct {
>>> static void vhost_user_async_close_bh(void *opaque)
>>> {
>>> VhostAsyncCallback *data = opaque;
>>> - struct vhost_dev *vhost = data->vhost;
>>>
>>> - /*
>>> - * If the vhost_dev has been cleared in the meantime there is
>>> - * nothing left to do as some other path has completed the
>>> - * cleanup.
>>> - */
>>> - if (vhost->vdev) {
>>> - data->cb(data->dev);
>>> - }
>>> + data->cb(data->dev);
>>>
>>> g_free(data);
>>> }
>>> --
>>> 2.41.0
>>>
>>
>
On 22 Aug 2023, at 8:38 AM, Raphael Norwitz <raphael.norwitz@nutanix.com>
wrote:
On Aug 17, 2023, at 2:40 AM, Li Feng <fengli@smartx.com> wrote:
2023年8月14日 下午8:11,Raphael Norwitz <raphael.norwitz@nutanix.com> 写道:
Why can’t we rather fix this by adding a “event_cb” param to
vhost_user_async_close and then call qemu_chr_fe_set_handlers in
vhost_user_async_close_bh()?
Even if calling vhost_dev_cleanup() twice is safe today I worry future
changes may easily stumble over the reconnect case and introduce crashes or
double frees.
I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’
has been called in vhost_user_async_close, and will be called in event->cb,
so why need add a
new event_cb?
I’m suggesting calling the data->event_cb instead of the data->cb if we hit
the error case where vhost->vdev is NULL. Something like:
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 8dcf049d42..edf1dccd44 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -2648,6 +2648,10 @@ typedef struct {
static void vhost_user_async_close_bh(void *opaque)
{
VhostAsyncCallback *data = opaque;
+
+ VirtIODevice *vdev = VIRTIO_DEVICE(data->dev);
+ VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
struct vhost_dev *vhost = data->vhost;
/*
@@ -2657,6 +2661,9 @@ static void vhost_user_async_close_bh(void *opaque)
*/
if (vhost->vdev) {
data->cb(data->dev);
+ } else if (data->event_cb) {
+ qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, data->event_cb,
+ NULL, data->dev, NULL, true);
}
g_free(data);
data->event_cb would be vhost_user_blk_event().
I think that makes the error path a lot easier to reason about and more
future proof.
For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in
struct vhost-dev to mark if it’s inited like this:
This is better than the original, but let me know what you think of my
alternative.
The vhost_user_async_close_bh() is a common function in vhost-user.c, and
vhost_user_async_close() is used by vhost-user-scsi/blk/gpio,
However, in your patch it’s limited to VhostUserBlk, so I think my fix is
more reasonable.
Thanks,
LI
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index e2f6ffb446..edc80c0231 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void
*opaque,
goto fail_busyloop;
}
+ hdev->inited = true;
return 0;
fail_busyloop:
@@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev)
{
int i;
+ if (!hdev->inited) {
+ return;
+ }
+ hdev->inited = false;
trace_vhost_dev_cleanup(hdev);
for (i = 0; i < hdev->nvqs; ++i) {
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index ca3131b1af..74b1aec960 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -123,6 +123,7 @@ struct vhost_dev {
/* @started: is the vhost device started? */
bool started;
bool log_enabled;
+ bool inited;
uint64_t log_size;
Error *migration_blocker;
const VhostOps *vhost_ops;
Thanks.
On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote:
When the vhost-user is reconnecting to the backend, and if the vhost-user
fails
at the get_features in vhost_dev_init(), then the reconnect will fail
and it will not be retriggered forever.
The reason is:
When the vhost-user fail at get_features, the vhost_dev_cleanup will be
called
immediately.
vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.
The reconnect path is:
vhost_user_blk_event
vhost_user_async_close(.. vhost_user_blk_disconnect ..)
qemu_chr_fe_set_handlers <----- clear the notifier callback
schedule vhost_user_async_close_bh
The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
called, then the event fd callback will not be reinstalled.
With this patch, the vhost_user_blk_disconnect will call the
vhost_dev_cleanup() again, it's safe.
All vhost-user devices have this issue, including vhost-user-blk/scsi.
Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")
Signed-off-by: Li Feng <fengli@smartx.com>
---
hw/virtio/vhost-user.c | 10 +---------
1 file changed, 1 insertion(+), 9 deletions(-)
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 8dcf049d42..697b403fe2 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -2648,16 +2648,8 @@ typedef struct {
static void vhost_user_async_close_bh(void *opaque)
{
VhostAsyncCallback *data = opaque;
- struct vhost_dev *vhost = data->vhost;
- /*
- * If the vhost_dev has been cleared in the meantime there is
- * nothing left to do as some other path has completed the
- * cleanup.
- */
- if (vhost->vdev) {
- data->cb(data->dev);
- }
+ data->cb(data->dev);
g_free(data);
}
--
2.41.0
> On Aug 22, 2023, at 12:49 AM, Li Feng <fengli@smartx.com> wrote:
>
>
>
>> On 22 Aug 2023, at 8:38 AM, Raphael Norwitz <raphael.norwitz@nutanix.com> wrote:
>>
>>>
>>> On Aug 17, 2023, at 2:40 AM, Li Feng <fengli@smartx.com> wrote:
>>>
>>>
>>>> 2023年8月14日 下午8:11,Raphael Norwitz <raphael.norwitz@nutanix.com> 写道:
>>>>
>>>> Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()?
>>>>
>>>> Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees.
>>>>
>>> I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’ has been called in vhost_user_async_close, and will be called in event->cb, so why need add a
>>> new event_cb?
>>>
>>
>> I’m suggesting calling the data->event_cb instead of the data->cb if we hit the error case where vhost->vdev is NULL. Something like:
>>
>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>> index 8dcf049d42..edf1dccd44 100644
>> --- a/hw/virtio/vhost-user.c
>> +++ b/hw/virtio/vhost-user.c
>> @@ -2648,6 +2648,10 @@ typedef struct {
>> static void vhost_user_async_close_bh(void *opaque)
>> {
>> VhostAsyncCallback *data = opaque;
>> +
>> + VirtIODevice *vdev = VIRTIO_DEVICE(data->dev);
>> + VHostUserBlk *s = VHOST_USER_BLK(vdev);
>> +
>> struct vhost_dev *vhost = data->vhost;
>>
>> /*
>> @@ -2657,6 +2661,9 @@ static void vhost_user_async_close_bh(void *opaque)
>> */
>> if (vhost->vdev) {
>> data->cb(data->dev);
>> + } else if (data->event_cb) {
>> + qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, data->event_cb,
>> + NULL, data->dev, NULL, true);
>> }
>>
>> g_free(data);
>>
>> data->event_cb would be vhost_user_blk_event().
>>
>> I think that makes the error path a lot easier to reason about and more future proof.
>>
>>> For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in struct vhost-dev to mark if it’s inited like this:
>>>
>>
>> This is better than the original, but let me know what you think of my alternative.
>
> The vhost_user_async_close_bh() is a common function in vhost-user.c, and vhost_user_async_close() is used by vhost-user-scsi/blk/gpio,
> However, in your patch it’s limited to VhostUserBlk, so I think my fix is more reasonable.
I did not write out the full patch.
vhost-user-scsi/blk/gpio would each provide their own ->event_cb, just like they each provide a different ->cb today. Looking at it again, data has the chardev, so no need to use VIRTO_DEVICE() or VHOST_USER_BLK().
The fix generalizes to all device types.
>
> Thanks,
> LI
>>
>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>> index e2f6ffb446..edc80c0231 100644
>>> --- a/hw/virtio/vhost.c
>>> +++ b/hw/virtio/vhost.c
>>> @@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>>> goto fail_busyloop;
>>> }
>>>
>>> + hdev->inited = true;
>>> return 0;
>>>
>>> fail_busyloop:
>>> @@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev)
>>> {
>>> int i;
>>>
>>> + if (!hdev->inited) {
>>> + return;
>>> + }
>>> + hdev->inited = false;
>>> trace_vhost_dev_cleanup(hdev);
>>>
>>> for (i = 0; i < hdev->nvqs; ++i) {
>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>> index ca3131b1af..74b1aec960 100644
>>> --- a/include/hw/virtio/vhost.h
>>> +++ b/include/hw/virtio/vhost.h
>>> @@ -123,6 +123,7 @@ struct vhost_dev {
>>> /* @started: is the vhost device started? */
>>> bool started;
>>> bool log_enabled;
>>> + bool inited;
>>> uint64_t log_size;
>>> Error *migration_blocker;
>>> const VhostOps *vhost_ops;
>>>
>>> Thanks.
>>>
>>>>
>>>>> On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote:
>>>>>
>>>>> When the vhost-user is reconnecting to the backend, and if the vhost-user fails
>>>>> at the get_features in vhost_dev_init(), then the reconnect will fail
>>>>> and it will not be retriggered forever.
>>>>>
>>>>> The reason is:
>>>>> When the vhost-user fail at get_features, the vhost_dev_cleanup will be called
>>>>> immediately.
>>>>>
>>>>> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.
>>>>>
>>>>> The reconnect path is:
>>>>> vhost_user_blk_event
>>>>> vhost_user_async_close(.. vhost_user_blk_disconnect ..)
>>>>> qemu_chr_fe_set_handlers <----- clear the notifier callback
>>>>> schedule vhost_user_async_close_bh
>>>>>
>>>>> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
>>>>> called, then the event fd callback will not be reinstalled.
>>>>>
>>>>> With this patch, the vhost_user_blk_disconnect will call the
>>>>> vhost_dev_cleanup() again, it's safe.
>>>>>
>>>>> All vhost-user devices have this issue, including vhost-user-blk/scsi.
>>>>>
>>>>> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")
>>>>>
>>>>> Signed-off-by: Li Feng <fengli@smartx.com>
>>>>> ---
>>>>> hw/virtio/vhost-user.c | 10 +---------
>>>>> 1 file changed, 1 insertion(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>>>> index 8dcf049d42..697b403fe2 100644
>>>>> --- a/hw/virtio/vhost-user.c
>>>>> +++ b/hw/virtio/vhost-user.c
>>>>> @@ -2648,16 +2648,8 @@ typedef struct {
>>>>> static void vhost_user_async_close_bh(void *opaque)
>>>>> {
>>>>> VhostAsyncCallback *data = opaque;
>>>>> - struct vhost_dev *vhost = data->vhost;
>>>>>
>>>>> - /*
>>>>> - * If the vhost_dev has been cleared in the meantime there is
>>>>> - * nothing left to do as some other path has completed the
>>>>> - * cleanup.
>>>>> - */
>>>>> - if (vhost->vdev) {
>>>>> - data->cb(data->dev);
>>>>> - }
>>>>> + data->cb(data->dev);
>>>>>
>>>>> g_free(data);
>>>>> }
>>>>> --
>>>>> 2.41.0
> On 22 Aug 2023, at 6:17 PM, Raphael Norwitz <raphael.norwitz@nutanix.com> wrote:
>
>
>
>> On Aug 22, 2023, at 12:49 AM, Li Feng <fengli@smartx.com> wrote:
>>
>>
>>
>>> On 22 Aug 2023, at 8:38 AM, Raphael Norwitz <raphael.norwitz@nutanix.com> wrote:
>>>
>>>>
>>>> On Aug 17, 2023, at 2:40 AM, Li Feng <fengli@smartx.com> wrote:
>>>>
>>>>
>>>>> 2023年8月14日 下午8:11,Raphael Norwitz <raphael.norwitz@nutanix.com> 写道:
>>>>>
>>>>> Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()?
>>>>>
>>>>> Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees.
>>>>>
>>>> I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’ has been called in vhost_user_async_close, and will be called in event->cb, so why need add a
>>>> new event_cb?
>>>>
>>>
>>> I’m suggesting calling the data->event_cb instead of the data->cb if we hit the error case where vhost->vdev is NULL. Something like:
>>>
>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>> index 8dcf049d42..edf1dccd44 100644
>>> --- a/hw/virtio/vhost-user.c
>>> +++ b/hw/virtio/vhost-user.c
>>> @@ -2648,6 +2648,10 @@ typedef struct {
>>> static void vhost_user_async_close_bh(void *opaque)
>>> {
>>> VhostAsyncCallback *data = opaque;
>>> +
>>> + VirtIODevice *vdev = VIRTIO_DEVICE(data->dev);
>>> + VHostUserBlk *s = VHOST_USER_BLK(vdev);
>>> +
>>> struct vhost_dev *vhost = data->vhost;
>>>
>>> /*
>>> @@ -2657,6 +2661,9 @@ static void vhost_user_async_close_bh(void *opaque)
>>> */
>>> if (vhost->vdev) {
>>> data->cb(data->dev);
>>> + } else if (data->event_cb) {
>>> + qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, data->event_cb,
>>> + NULL, data->dev, NULL, true);
>>> }
>>>
>>> g_free(data);
>>>
>>> data->event_cb would be vhost_user_blk_event().
>>>
>>> I think that makes the error path a lot easier to reason about and more future proof.
>>>
>>>> For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in struct vhost-dev to mark if it’s inited like this:
>>>>
>>>
>>> This is better than the original, but let me know what you think of my alternative.
>>
>> The vhost_user_async_close_bh() is a common function in vhost-user.c, and vhost_user_async_close() is used by vhost-user-scsi/blk/gpio,
>> However, in your patch it’s limited to VhostUserBlk, so I think my fix is more reasonable.
>
> I did not write out the full patch.
>
> vhost-user-scsi/blk/gpio would each provide their own ->event_cb, just like they each provide a different ->cb today. Looking at it again, data has the chardev, so no need to use VIRTO_DEVICE() or VHOST_USER_BLK().
>
> The fix generalizes to all device types.
OK, I will change it in V2.
>
>>
>> Thanks,
>> LI
>>>
>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>> index e2f6ffb446..edc80c0231 100644
>>>> --- a/hw/virtio/vhost.c
>>>> +++ b/hw/virtio/vhost.c
>>>> @@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>>>> goto fail_busyloop;
>>>> }
>>>>
>>>> + hdev->inited = true;
>>>> return 0;
>>>>
>>>> fail_busyloop:
>>>> @@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev)
>>>> {
>>>> int i;
>>>>
>>>> + if (!hdev->inited) {
>>>> + return;
>>>> + }
>>>> + hdev->inited = false;
>>>> trace_vhost_dev_cleanup(hdev);
>>>>
>>>> for (i = 0; i < hdev->nvqs; ++i) {
>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>>> index ca3131b1af..74b1aec960 100644
>>>> --- a/include/hw/virtio/vhost.h
>>>> +++ b/include/hw/virtio/vhost.h
>>>> @@ -123,6 +123,7 @@ struct vhost_dev {
>>>> /* @started: is the vhost device started? */
>>>> bool started;
>>>> bool log_enabled;
>>>> + bool inited;
>>>> uint64_t log_size;
>>>> Error *migration_blocker;
>>>> const VhostOps *vhost_ops;
>>>>
>>>> Thanks.
>>>>
>>>>>
>>>>>> On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote:
>>>>>>
>>>>>> When the vhost-user is reconnecting to the backend, and if the vhost-user fails
>>>>>> at the get_features in vhost_dev_init(), then the reconnect will fail
>>>>>> and it will not be retriggered forever.
>>>>>>
>>>>>> The reason is:
>>>>>> When the vhost-user fail at get_features, the vhost_dev_cleanup will be called
>>>>>> immediately.
>>>>>>
>>>>>> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.
>>>>>>
>>>>>> The reconnect path is:
>>>>>> vhost_user_blk_event
>>>>>> vhost_user_async_close(.. vhost_user_blk_disconnect ..)
>>>>>> qemu_chr_fe_set_handlers <----- clear the notifier callback
>>>>>> schedule vhost_user_async_close_bh
>>>>>>
>>>>>> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
>>>>>> called, then the event fd callback will not be reinstalled.
>>>>>>
>>>>>> With this patch, the vhost_user_blk_disconnect will call the
>>>>>> vhost_dev_cleanup() again, it's safe.
>>>>>>
>>>>>> All vhost-user devices have this issue, including vhost-user-blk/scsi.
>>>>>>
>>>>>> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")
>>>>>>
>>>>>> Signed-off-by: Li Feng <fengli@smartx.com>
>>>>>> ---
>>>>>> hw/virtio/vhost-user.c | 10 +---------
>>>>>> 1 file changed, 1 insertion(+), 9 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>>>>> index 8dcf049d42..697b403fe2 100644
>>>>>> --- a/hw/virtio/vhost-user.c
>>>>>> +++ b/hw/virtio/vhost-user.c
>>>>>> @@ -2648,16 +2648,8 @@ typedef struct {
>>>>>> static void vhost_user_async_close_bh(void *opaque)
>>>>>> {
>>>>>> VhostAsyncCallback *data = opaque;
>>>>>> - struct vhost_dev *vhost = data->vhost;
>>>>>>
>>>>>> - /*
>>>>>> - * If the vhost_dev has been cleared in the meantime there is
>>>>>> - * nothing left to do as some other path has completed the
>>>>>> - * cleanup.
>>>>>> - */
>>>>>> - if (vhost->vdev) {
>>>>>> - data->cb(data->dev);
>>>>>> - }
>>>>>> + data->cb(data->dev);
>>>>>>
>>>>>> g_free(data);
>>>>>> }
>>>>>> --
>>>>>> 2.41.0
© 2016 - 2026 Red Hat, Inc.