[v6] virtio: add VIRTQUEUE_ERROR QAPI event

[PATCH v6] [for-10.1] virtio: add VIRTQUEUE_ERROR QAPI event

Posted by Vladimir Sementsov-Ogievskiy 10 months ago

For now we only log the vhost device error, when virtqueue is actually
stopped. Let's add a QAPI event, which makes possible:

 - collect statistics of such errors
 - make immediate actions: take core dumps or do some other debugging
 - inform the user through a management API or UI, so that (s)he can
  react somehow, e.g. reset the device driver in the guest or even
  build up some automation to do so

Note that basically every inconsistency discovered during virtqueue
processing results in a silent virtqueue stop.  The guest then just
sees the requests getting stuck somewhere in the device for no visible
reason.  This event provides a means to inform the management layer of
this situation in a timely fashion.

The event could be reused for some other virtqueue problems (not only
for vhost devices) in future. For this it gets a generic name and
structure.

We keep original VHOST_OPS_DEBUG(), to keep original debug output as is
here, it's not the only call to VHOST_OPS_DEBUG in the file.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
---

v6: rename path to qom-path, and improve throttling of the event
    improve wording

 hw/virtio/vhost.c | 12 +++++++++---
 monitor/monitor.c | 14 ++++++++++++++
 qapi/qdev.json    | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 6aa72fd434..0b205cef73 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -15,6 +15,7 @@
 
 #include "qemu/osdep.h"
 #include "qapi/error.h"
+#include "qapi/qapi-events-qdev.h"
 #include "hw/virtio/vhost.h"
 #include "qemu/atomic.h"
 #include "qemu/range.h"
@@ -1442,11 +1443,16 @@ static void vhost_virtqueue_error_notifier(EventNotifier *n)
     struct vhost_virtqueue *vq = container_of(n, struct vhost_virtqueue,
                                               error_notifier);
     struct vhost_dev *dev = vq->dev;
-    int index = vq - dev->vqs;
 
     if (event_notifier_test_and_clear(n) && dev->vdev) {
-        VHOST_OPS_DEBUG(-EINVAL,  "vhost vring error in virtqueue %d",
-                        dev->vq_index + index);
+        int ind = vq - dev->vqs + dev->vq_index;
+        DeviceState *ds = &dev->vdev->parent_obj;
+
+        VHOST_OPS_DEBUG(-EINVAL,  "vhost vring error in virtqueue %d", ind);
+        qapi_event_send_virtqueue_error(ds->id, ds->canonical_path, ind,
+                                        VIRTQUEUE_ERROR_VHOST_VRING_ERROR,
+                                        "vhost reported failure through vring "
+                                        "error fd");
     }
 }
 
diff --git a/monitor/monitor.c b/monitor/monitor.c
index c5a5d30877..11c8859703 100644
--- a/monitor/monitor.c
+++ b/monitor/monitor.c
@@ -313,6 +313,7 @@ static MonitorQAPIEventConf monitor_qapi_event_conf[QAPI_EVENT__MAX] = {
     [QAPI_EVENT_BALLOON_CHANGE]    = { 1000 * SCALE_MS },
     [QAPI_EVENT_QUORUM_REPORT_BAD] = { 1000 * SCALE_MS },
     [QAPI_EVENT_QUORUM_FAILURE]    = { 1000 * SCALE_MS },
+    [QAPI_EVENT_VIRTQUEUE_ERROR]   = { 1000 * SCALE_MS },
     [QAPI_EVENT_VSERPORT_CHANGE]   = { 1000 * SCALE_MS },
     [QAPI_EVENT_MEMORY_DEVICE_SIZE_CHANGE] = { 1000 * SCALE_MS },
     [QAPI_EVENT_HV_BALLOON_STATUS_REPORT] = { 1000 * SCALE_MS },
@@ -499,6 +500,12 @@ static unsigned int qapi_event_throttle_hash(const void *key)
         hash += g_str_hash(qdict_get_str(evstate->data, "qom-path"));
     }
 
+    if (evstate->event == QAPI_EVENT_VIRTQUEUE_ERROR) {
+        uint64_t virtqueue = qdict_get_int(evstate->data, "virtqueue");
+        hash += g_str_hash(qdict_get_str(evstate->data, "qom-path")) ^
+            g_int64_hash(&virtqueue);
+    }
+
     return hash;
 }
 
@@ -527,6 +534,13 @@ static gboolean qapi_event_throttle_equal(const void *a, const void *b)
                        qdict_get_str(evb->data, "qom-path"));
     }
 
+    if (eva->event == QAPI_EVENT_VIRTQUEUE_ERROR) {
+        return !strcmp(qdict_get_str(eva->data, "qom-path"),
+                       qdict_get_str(evb->data, "qom-path")) &&
+            (qdict_get_int(eva->data, "virtqueue") ==
+             qdict_get_int(evb->data, "virtqueue"));
+    }
+
     return TRUE;
 }
 
diff --git a/qapi/qdev.json b/qapi/qdev.json
index 25cbcf977b..ddfae18761 100644
--- a/qapi/qdev.json
+++ b/qapi/qdev.json
@@ -187,3 +187,35 @@
 { 'command': 'device-sync-config',
   'features': [ 'unstable' ],
   'data': {'id': 'str'} }
+
+##
+# @VirtqueueError:
+#
+# @vhost-vring-error: the vhost device has communicated failure via
+#     the vring error file descriptor
+#
+# Since: 10.1
+##
+{ 'enum': 'VirtqueueError',
+  'data': [ 'vhost-vring-error' ] }
+
+##
+# @VIRTQUEUE_ERROR:
+#
+# Emitted when a device virtqueue fails at runtime.
+#
+# @device: the device's ID if it has one
+#
+# @qom-path: the device's QOM path
+#
+# @virtqueue: the index of the virtqueue that failed
+#
+# @error: error identifier
+#
+# @description: human readable description
+#
+# Since: 10.1
+##
+{ 'event': 'VIRTQUEUE_ERROR',
+ 'data': { '*device': 'str', 'qom-path': 'str', 'virtqueue': 'int',
+            'error': 'VirtqueueError', 'description': 'str'} }
-- 
2.48.1

Re: [PATCH v6] [for-10.1] virtio: add VIRTQUEUE_ERROR QAPI event

Posted by Markus Armbruster 10 months ago

Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> writes:

> For now we only log the vhost device error, when virtqueue is actually
> stopped. Let's add a QAPI event, which makes possible:
>
>  - collect statistics of such errors
>  - make immediate actions: take core dumps or do some other debugging
>  - inform the user through a management API or UI, so that (s)he can
>   react somehow, e.g. reset the device driver in the guest or even
>   build up some automation to do so
>
> Note that basically every inconsistency discovered during virtqueue
> processing results in a silent virtqueue stop.  The guest then just
> sees the requests getting stuck somewhere in the device for no visible
> reason.  This event provides a means to inform the management layer of
> this situation in a timely fashion.
>
> The event could be reused for some other virtqueue problems (not only
> for vhost devices) in future. For this it gets a generic name and
> structure.
>
> We keep original VHOST_OPS_DEBUG(), to keep original debug output as is
> here, it's not the only call to VHOST_OPS_DEBUG in the file.

Likely should be tracepoints.  Not this patch's problem, though.

> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> ---
>
> v6: rename path to qom-path, and improve throttling of the event
>     improve wording
>
>  hw/virtio/vhost.c | 12 +++++++++---
>  monitor/monitor.c | 14 ++++++++++++++
>  qapi/qdev.json    | 32 ++++++++++++++++++++++++++++++++
>  3 files changed, 55 insertions(+), 3 deletions(-)
>
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 6aa72fd434..0b205cef73 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -15,6 +15,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qapi/error.h"
> +#include "qapi/qapi-events-qdev.h"
>  #include "hw/virtio/vhost.h"
>  #include "qemu/atomic.h"
>  #include "qemu/range.h"
> @@ -1442,11 +1443,16 @@ static void vhost_virtqueue_error_notifier(EventNotifier *n)
>      struct vhost_virtqueue *vq = container_of(n, struct vhost_virtqueue,
>                                                error_notifier);
>      struct vhost_dev *dev = vq->dev;
> -    int index = vq - dev->vqs;
>  
>      if (event_notifier_test_and_clear(n) && dev->vdev) {
> -        VHOST_OPS_DEBUG(-EINVAL,  "vhost vring error in virtqueue %d",
> -                        dev->vq_index + index);
> +        int ind = vq - dev->vqs + dev->vq_index;
> +        DeviceState *ds = &dev->vdev->parent_obj;
> +
> +        VHOST_OPS_DEBUG(-EINVAL,  "vhost vring error in virtqueue %d", ind);
> +        qapi_event_send_virtqueue_error(ds->id, ds->canonical_path, ind,
> +                                        VIRTQUEUE_ERROR_VHOST_VRING_ERROR,
> +                                        "vhost reported failure through vring "
> +                                        "error fd");
>      }
>  }
>  
> diff --git a/monitor/monitor.c b/monitor/monitor.c
> index c5a5d30877..11c8859703 100644
> --- a/monitor/monitor.c
> +++ b/monitor/monitor.c
> @@ -313,6 +313,7 @@ static MonitorQAPIEventConf monitor_qapi_event_conf[QAPI_EVENT__MAX] = {
>      [QAPI_EVENT_BALLOON_CHANGE]    = { 1000 * SCALE_MS },
>      [QAPI_EVENT_QUORUM_REPORT_BAD] = { 1000 * SCALE_MS },
>      [QAPI_EVENT_QUORUM_FAILURE]    = { 1000 * SCALE_MS },
> +    [QAPI_EVENT_VIRTQUEUE_ERROR]   = { 1000 * SCALE_MS },
>      [QAPI_EVENT_VSERPORT_CHANGE]   = { 1000 * SCALE_MS },
>      [QAPI_EVENT_MEMORY_DEVICE_SIZE_CHANGE] = { 1000 * SCALE_MS },
>      [QAPI_EVENT_HV_BALLOON_STATUS_REPORT] = { 1000 * SCALE_MS },
> @@ -499,6 +500,12 @@ static unsigned int qapi_event_throttle_hash(const void *key)
>          hash += g_str_hash(qdict_get_str(evstate->data, "qom-path"));
>      }
>  
> +    if (evstate->event == QAPI_EVENT_VIRTQUEUE_ERROR) {
> +        uint64_t virtqueue = qdict_get_int(evstate->data, "virtqueue");
> +        hash += g_str_hash(qdict_get_str(evstate->data, "qom-path")) ^
> +            g_int64_hash(&virtqueue);
> +    }
> +
>      return hash;
>  }
>  
> @@ -527,6 +534,13 @@ static gboolean qapi_event_throttle_equal(const void *a, const void *b)
>                         qdict_get_str(evb->data, "qom-path"));
>      }
>  
> +    if (eva->event == QAPI_EVENT_VIRTQUEUE_ERROR) {
> +        return !strcmp(qdict_get_str(eva->data, "qom-path"),
> +                       qdict_get_str(evb->data, "qom-path")) &&
> +            (qdict_get_int(eva->data, "virtqueue") ==
> +             qdict_get_int(evb->data, "virtqueue"));
> +    }
> +
>      return TRUE;
>  }
>  

Rate-limiting is now per virt queue.  It was per device in previous
revisions.  Worth it?

> diff --git a/qapi/qdev.json b/qapi/qdev.json
> index 25cbcf977b..ddfae18761 100644
> --- a/qapi/qdev.json
> +++ b/qapi/qdev.json
> @@ -187,3 +187,35 @@
>  { 'command': 'device-sync-config',
>    'features': [ 'unstable' ],
>    'data': {'id': 'str'} }
> +
> +##
> +# @VirtqueueError:
> +#
> +# @vhost-vring-error: the vhost device has communicated failure via
> +#     the vring error file descriptor
> +#
> +# Since: 10.1
> +##
> +{ 'enum': 'VirtqueueError',
> +  'data': [ 'vhost-vring-error' ] }
> +
> +##
> +# @VIRTQUEUE_ERROR:
> +#
> +# Emitted when a device virtqueue fails at runtime.
> +#
> +# @device: the device's ID if it has one
> +#
> +# @qom-path: the device's QOM path
> +#
> +# @virtqueue: the index of the virtqueue that failed
> +#
> +# @error: error identifier
> +#
> +# @description: human readable description
> +#
> +# Since: 10.1
> +##
> +{ 'event': 'VIRTQUEUE_ERROR',
> + 'data': { '*device': 'str', 'qom-path': 'str', 'virtqueue': 'int',
> +            'error': 'VirtqueueError', 'description': 'str'} }

Standard question for events: can a management application poll for the
information as well?

I might have asked this before, I don't remember.  If you already
answered it, feel free to point me to your answer.

Why is this a standard question for events?  Say, a management
application wants to track the state of X.  Two ways: poll the state
with a query command that returns it, listen for events that report a
change of X.

Listening for an event is more efficient.

However, if the management application connects to a QEMU instance, X
could be anything, so it needs to poll once.

Special case: the management application restarts for some reason.

Re: [PATCH v6] [for-10.1] virtio: add VIRTQUEUE_ERROR QAPI event

Posted by Vladimir Sementsov-Ogievskiy 10 months ago

On 09.04.25 13:48, Markus Armbruster wrote:
> Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> writes:
> 
>> For now we only log the vhost device error, when virtqueue is actually
>> stopped. Let's add a QAPI event, which makes possible:
>>
>>   - collect statistics of such errors
>>   - make immediate actions: take core dumps or do some other debugging
>>   - inform the user through a management API or UI, so that (s)he can
>>    react somehow, e.g. reset the device driver in the guest or even
>>    build up some automation to do so
>>
>> Note that basically every inconsistency discovered during virtqueue
>> processing results in a silent virtqueue stop.  The guest then just
>> sees the requests getting stuck somewhere in the device for no visible
>> reason.  This event provides a means to inform the management layer of
>> this situation in a timely fashion.
>>
>> The event could be reused for some other virtqueue problems (not only
>> for vhost devices) in future. For this it gets a generic name and
>> structure.
>>
>> We keep original VHOST_OPS_DEBUG(), to keep original debug output as is
>> here, it's not the only call to VHOST_OPS_DEBUG in the file.
> 
> Likely should be tracepoints.  Not this patch's problem, though.
> 
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
>> ---
>>
>> v6: rename path to qom-path, and improve throttling of the event
>>      improve wording
>>

[..]

>> @@ -527,6 +534,13 @@ static gboolean qapi_event_throttle_equal(const void *a, const void *b)
>>                          qdict_get_str(evb->data, "qom-path"));
>>       }
>>   
>> +    if (eva->event == QAPI_EVENT_VIRTQUEUE_ERROR) {
>> +        return !strcmp(qdict_get_str(eva->data, "qom-path"),
>> +                       qdict_get_str(evb->data, "qom-path")) &&
>> +            (qdict_get_int(eva->data, "virtqueue") ==
>> +             qdict_get_int(evb->data, "virtqueue"));
>> +    }
>> +
>>       return TRUE;
>>   }
>>   
> 
> Rate-limiting is now per virt queue.  It was per device in previous
> revisions.  Worth it?
> 

Hmm. Probably not. If we have 2 virtqueue, seems good to see both event
(or only one, if only one virtqueue failed).
If we have 256 virtqueues, 256 immediate events seems too much.
So, better is to drop virtqueue here and consider only qom-path for throttling.

>> diff --git a/qapi/qdev.json b/qapi/qdev.json
>> index 25cbcf977b..ddfae18761 100644
>> --- a/qapi/qdev.json
>> +++ b/qapi/qdev.json
>> @@ -187,3 +187,35 @@
>>   { 'command': 'device-sync-config',
>>     'features': [ 'unstable' ],
>>     'data': {'id': 'str'} }
>> +
>> +##
>> +# @VirtqueueError:
>> +#
>> +# @vhost-vring-error: the vhost device has communicated failure via
>> +#     the vring error file descriptor
>> +#
>> +# Since: 10.1
>> +##
>> +{ 'enum': 'VirtqueueError',
>> +  'data': [ 'vhost-vring-error' ] }
>> +
>> +##
>> +# @VIRTQUEUE_ERROR:
>> +#
>> +# Emitted when a device virtqueue fails at runtime.
>> +#
>> +# @device: the device's ID if it has one
>> +#
>> +# @qom-path: the device's QOM path
>> +#
>> +# @virtqueue: the index of the virtqueue that failed
>> +#
>> +# @error: error identifier
>> +#
>> +# @description: human readable description
>> +#
>> +# Since: 10.1
>> +##
>> +{ 'event': 'VIRTQUEUE_ERROR',
>> + 'data': { '*device': 'str', 'qom-path': 'str', 'virtqueue': 'int',
>> +            'error': 'VirtqueueError', 'description': 'str'} }
> 
> Standard question for events: can a management application poll for the
> information as well?

Oh. that's a good shot.

I'm afraid it can't. And this makes me to dig into history of this patch
- no, we didn't discussed it before.

And before trying to implement something new here (a way to get a kind of
virtqueues status by a new QMP command), I check that:
- our mgmt tool still doesn't use VIRTQUEUE_ERROR event (which we've
merged to downstream QEMU long ago, of course)
- the original problem that led us to introducing such event doesn't
bother us for a long time

It seems wiser to stop here for now. I should have considered these aspects
before beginning the process of reviving this series. Sorry for your time.

Still, if we (or someone other) need such event in future - good, we have
a modern patch in mailing list to start from.

> 
> I might have asked this before, I don't remember.  If you already
> answered it, feel free to point me to your answer.
> 
> Why is this a standard question for events?  Say, a management
> application wants to track the state of X.  Two ways: poll the state
> with a query command that returns it, listen for events that report a
> change of X.
> 
> Listening for an event is more efficient.
> 
> However, if the management application connects to a QEMU instance, X
> could be anything, so it needs to poll once.
> 
> Special case: the management application restarts for some reason.
> 

-- 
Best regards,
Vladimir

Re: [PATCH v6] [for-10.1] virtio: add VIRTQUEUE_ERROR QAPI event

Posted by Markus Armbruster 10 months ago

Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> writes:

> On 09.04.25 13:48, Markus Armbruster wrote:
>> Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> writes:
>> 
>>> For now we only log the vhost device error, when virtqueue is actually
>>> stopped. Let's add a QAPI event, which makes possible:
>>>
>>>   - collect statistics of such errors
>>>   - make immediate actions: take core dumps or do some other debugging
>>>   - inform the user through a management API or UI, so that (s)he can
>>>    react somehow, e.g. reset the device driver in the guest or even
>>>    build up some automation to do so
>>>
>>> Note that basically every inconsistency discovered during virtqueue
>>> processing results in a silent virtqueue stop.  The guest then just
>>> sees the requests getting stuck somewhere in the device for no visible
>>> reason.  This event provides a means to inform the management layer of
>>> this situation in a timely fashion.
>>>
>>> The event could be reused for some other virtqueue problems (not only
>>> for vhost devices) in future. For this it gets a generic name and
>>> structure.
>>>
>>> We keep original VHOST_OPS_DEBUG(), to keep original debug output as is
>>> here, it's not the only call to VHOST_OPS_DEBUG in the file.
>> 
>> Likely should be tracepoints.  Not this patch's problem, though.
>> 
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
>>> ---
>>>
>>> v6: rename path to qom-path, and improve throttling of the event
>>>      improve wording
>>>
>
> [..]
>
>>> @@ -527,6 +534,13 @@ static gboolean qapi_event_throttle_equal(const void *a, const void *b)
>>>                          qdict_get_str(evb->data, "qom-path"));
>>>       }
>>>   
>>> +    if (eva->event == QAPI_EVENT_VIRTQUEUE_ERROR) {
>>> +        return !strcmp(qdict_get_str(eva->data, "qom-path"),
>>> +                       qdict_get_str(evb->data, "qom-path")) &&
>>> +            (qdict_get_int(eva->data, "virtqueue") ==
>>> +             qdict_get_int(evb->data, "virtqueue"));
>>> +    }
>>> +
>>>       return TRUE;
>>>   }
>>>   
>> 
>> Rate-limiting is now per virt queue.  It was per device in previous
>> revisions.  Worth it?
>> 
>
> Hmm. Probably not. If we have 2 virtqueue, seems good to see both event
> (or only one, if only one virtqueue failed).
> If we have 256 virtqueues, 256 immediate events seems too much.
> So, better is to drop virtqueue here and consider only qom-path for throttling.
>
>>> diff --git a/qapi/qdev.json b/qapi/qdev.json
>>> index 25cbcf977b..ddfae18761 100644
>>> --- a/qapi/qdev.json
>>> +++ b/qapi/qdev.json
>>> @@ -187,3 +187,35 @@
>>>   { 'command': 'device-sync-config',
>>>     'features': [ 'unstable' ],
>>>     'data': {'id': 'str'} }
>>> +
>>> +##
>>> +# @VirtqueueError:
>>> +#
>>> +# @vhost-vring-error: the vhost device has communicated failure via
>>> +#     the vring error file descriptor
>>> +#
>>> +# Since: 10.1
>>> +##
>>> +{ 'enum': 'VirtqueueError',
>>> +  'data': [ 'vhost-vring-error' ] }
>>> +
>>> +##
>>> +# @VIRTQUEUE_ERROR:
>>> +#
>>> +# Emitted when a device virtqueue fails at runtime.
>>> +#
>>> +# @device: the device's ID if it has one
>>> +#
>>> +# @qom-path: the device's QOM path
>>> +#
>>> +# @virtqueue: the index of the virtqueue that failed
>>> +#
>>> +# @error: error identifier
>>> +#
>>> +# @description: human readable description
>>> +#
>>> +# Since: 10.1
>>> +##
>>> +{ 'event': 'VIRTQUEUE_ERROR',
>>> + 'data': { '*device': 'str', 'qom-path': 'str', 'virtqueue': 'int',
>>> +            'error': 'VirtqueueError', 'description': 'str'} }
>> 
>> Standard question for events: can a management application poll for the
>> information as well?
>
> Oh. that's a good shot.
>
> I'm afraid it can't. And this makes me to dig into history of this patch
> - no, we didn't discussed it before.
>
> And before trying to implement something new here (a way to get a kind of
> virtqueues status by a new QMP command), I check that:
> - our mgmt tool still doesn't use VIRTQUEUE_ERROR event (which we've
> merged to downstream QEMU long ago, of course)
> - the original problem that led us to introducing such event doesn't
> bother us for a long time
>
> It seems wiser to stop here for now. I should have considered these aspects
> before beginning the process of reviving this series. Sorry for your time.

Well, *I* could've remembered the standard question at the beginning!
So, sorry for your time, too :)

> Still, if we (or someone other) need such event in future - good, we have
> a modern patch in mailing list to start from.

Yes.  Thank you!

>> I might have asked this before, I don't remember.  If you already
>> answered it, feel free to point me to your answer.
>> 
>> Why is this a standard question for events?  Say, a management
>> application wants to track the state of X.  Two ways: poll the state
>> with a query command that returns it, listen for events that report a
>> change of X.
>> 
>> Listening for an event is more efficient.
>> 
>> However, if the management application connects to a QEMU instance, X
>> could be anything, so it needs to poll once.
>> 
>> Special case: the management application restarts for some reason.