[v2] QIOChannelSocket: Flush zerocopy socket error queue on ENOBUF failure for sendmsg

[PATCH v2] QIOChannelSocket: Flush zerocopy socket error queue on ENOBUF failure for sendmsg

Posted by Manish Mishra 3 weeks, 2 days ago

We allocate extra metadata SKBs in case of a zerocopy send. This metadata
memory is accounted for in the OPTMEM limit. If there is any error while
sending zerocopy packets or if zerocopy is skipped, these metadata SKBs are
queued in the socket error queue. This error queue is freed when userspace
reads it.

Usually, if there are continuous failures, we merge the metadata into a single
SKB and free another one. As a result, it never exceeds the OPTMEM limit.
However, if there is any out-of-order processing or intermittent zerocopy
failures, this error chain can grow significantly, exhausting the OPTMEM limit.
As a result, all new sendmsg requests fail to allocate any new SKB, leading to
an ENOBUF error. Depending on the amount of data queued before the flush
(i.e., large live migration iterations), even large OPTMEM limits are prone to
failure.

To work around this, if we encounter an ENOBUF error with a zerocopy sendmsg,
we flush the error queue and retry once more.

Additionally, this patch removes the dirty_sync_missed_zero_copy migration
stat. This stat is not used anywhere and does not seem useful. Removing it
simplifies the patch.

V2:
  1. Removed the dirty_sync_missed_zero_copy migration stat.
  2. Made the call to qio_channel_socket_flush_internal() from
     qio_channel_socket_writev() non-blocking.

Signed-off-by: Manish Mishra <manish.mishra@nutanix.com>
---
 include/io/channel.h           |  3 +-
 io/channel-socket.c            | 57 ++++++++++++++++++++++++----------
 migration/migration-hmp-cmds.c |  5 ---
 migration/migration-stats.h    |  5 ---
 migration/migration.c          |  2 --
 migration/multifd.c            |  3 --
 6 files changed, 41 insertions(+), 34 deletions(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index 62b657109c..c393764ab7 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -980,8 +980,7 @@ int coroutine_mixed_fn qio_channel_writev_full_all(QIOChannel *ioc,
  * If not implemented, acts as a no-op, and returns 0.
  *
  * Returns -1 if any error is found,
- *          1 if every send failed to use zero copy.
- *          0 otherwise.
+ *          0 on success.
  */
 
 int qio_channel_flush(QIOChannel *ioc,
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 608bcf066e..0f2a5c2b5f 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -37,6 +37,12 @@
 
 #define SOCKET_MAX_FDS 16
 
+#ifdef QEMU_MSG_ZEROCOPY
+static int qio_channel_socket_flush_internal(QIOChannel *ioc,
+                                             bool block,
+                                             Error **errp);
+#endif
+
 SocketAddress *
 qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
                                      Error **errp)
@@ -566,6 +572,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
     size_t fdsize = sizeof(int) * nfds;
     struct cmsghdr *cmsg;
     int sflags = 0;
+    bool zero_copy_flush_pending = TRUE;
 
     memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
 
@@ -612,9 +619,21 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
             goto retry;
         case ENOBUFS:
             if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
-                error_setg_errno(errp, errno,
-                                 "Process can't lock enough memory for using MSG_ZEROCOPY");
-                return -1;
+                if (zero_copy_flush_pending) {
+                    ret = qio_channel_socket_flush_internal(ioc, false, errp);
+                    if (ret < 0) {
+                        error_setg_errno(errp, errno,
+                                         "Zerocopy flush failed");
+                        return -1;
+                    }
+                    zero_copy_flush_pending = FALSE;
+                    goto retry;
+                } else {
+                    error_setg_errno(errp, errno,
+                                     "Process can't lock enough memory for "
+                                     "using MSG_ZEROCOPY");
+                    return -1;
+                }
             }
             break;
         }
@@ -725,8 +744,9 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
 
 
 #ifdef QEMU_MSG_ZEROCOPY
-static int qio_channel_socket_flush(QIOChannel *ioc,
-                                    Error **errp)
+static int qio_channel_socket_flush_internal(QIOChannel *ioc,
+                                             bool block,
+                                             Error **errp)
 {
     QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
     struct msghdr msg = {};
@@ -734,7 +754,6 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
     struct cmsghdr *cm;
     char control[CMSG_SPACE(sizeof(*serr))];
     int received;
-    int ret;
 
     if (sioc->zero_copy_queued == sioc->zero_copy_sent) {
         return 0;
@@ -744,16 +763,19 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
     msg.msg_controllen = sizeof(control);
     memset(control, 0, sizeof(control));
 
-    ret = 1;
-
     while (sioc->zero_copy_sent < sioc->zero_copy_queued) {
         received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE);
         if (received < 0) {
             switch (errno) {
             case EAGAIN:
-                /* Nothing on errqueue, wait until something is available */
-                qio_channel_wait(ioc, G_IO_ERR);
-                continue;
+                if (block) {
+                    /* Nothing on errqueue, wait until something is
+                     * available.
+                     */
+                    qio_channel_wait(ioc, G_IO_ERR);
+                    continue;
+                }
+                return 0;
             case EINTR:
                 continue;
             default:
@@ -790,14 +812,15 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
 
         /* No errors, count successfully finished sendmsg()*/
         sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1;
-
-        /* If any sendmsg() succeeded using zero copy, return 0 at the end */
-        if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) {
-            ret = 0;
-        }
     }
 
-    return ret;
+    return 0;
+}
+
+static int qio_channel_socket_flush(QIOChannel *ioc,
+                                    Error **errp)
+{
+    return qio_channel_socket_flush_internal(ioc, true, errp);
 }
 
 #endif /* QEMU_MSG_ZEROCOPY */
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 49c26daed3..4533ce91ae 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -138,11 +138,6 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
             monitor_printf(mon, "postcopy ram: %" PRIu64 " kbytes\n",
                            info->ram->postcopy_bytes >> 10);
         }
-        if (info->ram->dirty_sync_missed_zero_copy) {
-            monitor_printf(mon,
-                           "Zero-copy-send fallbacks happened: %" PRIu64 " times\n",
-                           info->ram->dirty_sync_missed_zero_copy);
-        }
     }
 
     if (info->xbzrle_cache) {
diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 05290ade76..80a3d4be93 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -50,11 +50,6 @@ typedef struct {
      * Number of times we have synchronized guest bitmaps.
      */
     Stat64 dirty_sync_count;
-    /*
-     * Number of times zero copy failed to send any page using zero
-     * copy.
-     */
-    Stat64 dirty_sync_missed_zero_copy;
     /*
      * Number of bytes sent at migration completion stage while the
      * guest is stopped.
diff --git a/migration/migration.c b/migration/migration.c
index 1833cfe358..2ab19ca858 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1231,8 +1231,6 @@ static void populate_ram_info(MigrationInfo *info, MigrationState *s)
     info->ram->mbps = s->mbps;
     info->ram->dirty_sync_count =
         stat64_get(&mig_stats.dirty_sync_count);
-    info->ram->dirty_sync_missed_zero_copy =
-        stat64_get(&mig_stats.dirty_sync_missed_zero_copy);
     info->ram->postcopy_requests =
         stat64_get(&mig_stats.postcopy_requests);
     info->ram->page_size = page_size;
diff --git a/migration/multifd.c b/migration/multifd.c
index dfb5189f0e..ee6b2d3cba 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -607,9 +607,6 @@ static int multifd_zero_copy_flush(QIOChannel *c)
         error_report_err(err);
         return -1;
     }
-    if (ret == 1) {
-        stat64_add(&mig_stats.dirty_sync_missed_zero_copy, 1);
-    }
 
     return ret;
 }
-- 
2.43.0

Re: [PATCH v2] QIOChannelSocket: Flush zerocopy socket error queue on ENOBUF failure for sendmsg

Posted by Peter Xu 3 weeks, 1 day ago

On Sun, Mar 09, 2025 at 09:15:00PM -0400, Manish Mishra wrote:
> We allocate extra metadata SKBs in case of a zerocopy send. This metadata
> memory is accounted for in the OPTMEM limit. If there is any error while
> sending zerocopy packets or if zerocopy is skipped, these metadata SKBs are
> queued in the socket error queue. This error queue is freed when userspace
> reads it.
> 
> Usually, if there are continuous failures, we merge the metadata into a single
> SKB and free another one. As a result, it never exceeds the OPTMEM limit.
> However, if there is any out-of-order processing or intermittent zerocopy
> failures, this error chain can grow significantly, exhausting the OPTMEM limit.
> As a result, all new sendmsg requests fail to allocate any new SKB, leading to
> an ENOBUF error. Depending on the amount of data queued before the flush
> (i.e., large live migration iterations), even large OPTMEM limits are prone to
> failure.
> 
> To work around this, if we encounter an ENOBUF error with a zerocopy sendmsg,
> we flush the error queue and retry once more.
> 
> Additionally, this patch removes the dirty_sync_missed_zero_copy migration
> stat. This stat is not used anywhere and does not seem useful. Removing it
> simplifies the patch.

IMHO it's still useful, it's just that if it's for debugging purpose, it's
optional to expose it via QAPI.  Then if without exposing it to upper
layer, it can simplify the change this patch wanted to introduce.  We can
still keep it a tracepoint.

Meanwhile, please consider spliting the patch into two:

  - Removal of dirty_sync_missed_zero_copy, we need to copy QAPI
    maintainers for this one.  We can add a tracepoint altogether.

  - The real work for the workaround on the flush

> 
> V2:
>   1. Removed the dirty_sync_missed_zero_copy migration stat.
>   2. Made the call to qio_channel_socket_flush_internal() from
>      qio_channel_socket_writev() non-blocking.
> 
> Signed-off-by: Manish Mishra <manish.mishra@nutanix.com>
> ---
>  include/io/channel.h           |  3 +-
>  io/channel-socket.c            | 57 ++++++++++++++++++++++++----------
>  migration/migration-hmp-cmds.c |  5 ---
>  migration/migration-stats.h    |  5 ---
>  migration/migration.c          |  2 --
>  migration/multifd.c            |  3 --
>  6 files changed, 41 insertions(+), 34 deletions(-)
> 
> diff --git a/include/io/channel.h b/include/io/channel.h
> index 62b657109c..c393764ab7 100644
> --- a/include/io/channel.h
> +++ b/include/io/channel.h
> @@ -980,8 +980,7 @@ int coroutine_mixed_fn qio_channel_writev_full_all(QIOChannel *ioc,
>   * If not implemented, acts as a no-op, and returns 0.
>   *
>   * Returns -1 if any error is found,
> - *          1 if every send failed to use zero copy.
> - *          0 otherwise.
> + *          0 on success.
>   */
>  
>  int qio_channel_flush(QIOChannel *ioc,
> diff --git a/io/channel-socket.c b/io/channel-socket.c
> index 608bcf066e..0f2a5c2b5f 100644
> --- a/io/channel-socket.c
> +++ b/io/channel-socket.c
> @@ -37,6 +37,12 @@
>  
>  #define SOCKET_MAX_FDS 16
>  
> +#ifdef QEMU_MSG_ZEROCOPY
> +static int qio_channel_socket_flush_internal(QIOChannel *ioc,
> +                                             bool block,
> +                                             Error **errp);
> +#endif
> +
>  SocketAddress *
>  qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
>                                       Error **errp)
> @@ -566,6 +572,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>      size_t fdsize = sizeof(int) * nfds;
>      struct cmsghdr *cmsg;
>      int sflags = 0;
> +    bool zero_copy_flush_pending = TRUE;

I'd call it zerocopy_flush_once = FALSE.  "pending" may imply it needs to
be done at some point, but it's not always needed, afaict.  It's a throttle
that "if try it we should try no more than once" - logically speaking there
must be no concurrent writters to the same iochannel, it means if it's
about fallback messages eating up the optmem limit, one flush should always
work, or it must be some other real errors that we need to report, hence
"once".

I think we should also add a rich comment above the variable or the code
below to explain the issue.  It might make the code more readable too
without looking into git history.

>  
>      memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
>  
> @@ -612,9 +619,21 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>              goto retry;
>          case ENOBUFS:
>              if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> -                error_setg_errno(errp, errno,
> -                                 "Process can't lock enough memory for using MSG_ZEROCOPY");
> -                return -1;
> +                if (zero_copy_flush_pending) {
> +                    ret = qio_channel_socket_flush_internal(ioc, false, errp);
> +                    if (ret < 0) {
> +                        error_setg_errno(errp, errno,
> +                                         "Zerocopy flush failed");
> +                        return -1;
> +                    }
> +                    zero_copy_flush_pending = FALSE;
> +                    goto retry;
> +                } else {
> +                    error_setg_errno(errp, errno,
> +                                     "Process can't lock enough memory for "
> +                                     "using MSG_ZEROCOPY");
> +                    return -1;
> +                }
>              }
>              break;
>          }
> @@ -725,8 +744,9 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>  
>  
>  #ifdef QEMU_MSG_ZEROCOPY
> -static int qio_channel_socket_flush(QIOChannel *ioc,
> -                                    Error **errp)
> +static int qio_channel_socket_flush_internal(QIOChannel *ioc,
> +                                             bool block,
> +                                             Error **errp)
>  {
>      QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
>      struct msghdr msg = {};
> @@ -734,7 +754,6 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
>      struct cmsghdr *cm;
>      char control[CMSG_SPACE(sizeof(*serr))];
>      int received;
> -    int ret;
>  
>      if (sioc->zero_copy_queued == sioc->zero_copy_sent) {
>          return 0;
> @@ -744,16 +763,19 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
>      msg.msg_controllen = sizeof(control);
>      memset(control, 0, sizeof(control));
>  
> -    ret = 1;
> -
>      while (sioc->zero_copy_sent < sioc->zero_copy_queued) {
>          received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE);
>          if (received < 0) {
>              switch (errno) {
>              case EAGAIN:
> -                /* Nothing on errqueue, wait until something is available */
> -                qio_channel_wait(ioc, G_IO_ERR);
> -                continue;
> +                if (block) {
> +                    /* Nothing on errqueue, wait until something is
> +                     * available.
> +                     */
> +                    qio_channel_wait(ioc, G_IO_ERR);
> +                    continue;
> +                }
> +                return 0;
>              case EINTR:
>                  continue;
>              default:
> @@ -790,14 +812,15 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
>  
>          /* No errors, count successfully finished sendmsg()*/
>          sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1;
> -
> -        /* If any sendmsg() succeeded using zero copy, return 0 at the end */
> -        if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) {
> -            ret = 0;
> -        }

We could attach a tracepoint to not lose the debug-ability showing whether
zerocopy is working or got falled back.

>      }
>  
> -    return ret;
> +    return 0;
> +}
> +
> +static int qio_channel_socket_flush(QIOChannel *ioc,
> +                                    Error **errp)
> +{
> +    return qio_channel_socket_flush_internal(ioc, true, errp);
>  }
>  
>  #endif /* QEMU_MSG_ZEROCOPY */
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index 49c26daed3..4533ce91ae 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -138,11 +138,6 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>              monitor_printf(mon, "postcopy ram: %" PRIu64 " kbytes\n",
>                             info->ram->postcopy_bytes >> 10);
>          }
> -        if (info->ram->dirty_sync_missed_zero_copy) {
> -            monitor_printf(mon,
> -                           "Zero-copy-send fallbacks happened: %" PRIu64 " times\n",
> -                           info->ram->dirty_sync_missed_zero_copy);
> -        }
>      }
>  
>      if (info->xbzrle_cache) {
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index 05290ade76..80a3d4be93 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -50,11 +50,6 @@ typedef struct {
>       * Number of times we have synchronized guest bitmaps.
>       */
>      Stat64 dirty_sync_count;
> -    /*
> -     * Number of times zero copy failed to send any page using zero
> -     * copy.
> -     */
> -    Stat64 dirty_sync_missed_zero_copy;
>      /*
>       * Number of bytes sent at migration completion stage while the
>       * guest is stopped.
> diff --git a/migration/migration.c b/migration/migration.c
> index 1833cfe358..2ab19ca858 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1231,8 +1231,6 @@ static void populate_ram_info(MigrationInfo *info, MigrationState *s)
>      info->ram->mbps = s->mbps;
>      info->ram->dirty_sync_count =
>          stat64_get(&mig_stats.dirty_sync_count);
> -    info->ram->dirty_sync_missed_zero_copy =
> -        stat64_get(&mig_stats.dirty_sync_missed_zero_copy);
>      info->ram->postcopy_requests =
>          stat64_get(&mig_stats.postcopy_requests);
>      info->ram->page_size = page_size;
> diff --git a/migration/multifd.c b/migration/multifd.c
> index dfb5189f0e..ee6b2d3cba 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -607,9 +607,6 @@ static int multifd_zero_copy_flush(QIOChannel *c)
>          error_report_err(err);
>          return -1;
>      }
> -    if (ret == 1) {
> -        stat64_add(&mig_stats.dirty_sync_missed_zero_copy, 1);
> -    }

If we want to remove this, we need to remove the variable in QAPI too.

# @dirty-sync-missed-zero-copy: Number of times dirty RAM
#     synchronization could not avoid copying dirty pages.  This is
#     between 0 and @dirty-sync-count * @multifd-channels.  (since
#     7.1)

Personally I'd remove it directly, but others may not always agree... in
this case, the safe approach is to mark it deprecate and update this in
docs/about/deprecated.rst.  Then you can leave it to us to finally remove
this entry (or send another patch after two qemu releases).

Thanks,

>  
>      return ret;
>  }
> -- 
> 2.43.0
> 

-- 
Peter Xu

Re: [PATCH v2] QIOChannelSocket: Flush zerocopy socket error queue on ENOBUF failure for sendmsg

Posted by Daniel P. Berrangé 3 weeks, 1 day ago

On Mon, Mar 10, 2025 at 03:12:05PM -0400, Peter Xu wrote:
> On Sun, Mar 09, 2025 at 09:15:00PM -0400, Manish Mishra wrote:
> > We allocate extra metadata SKBs in case of a zerocopy send. This metadata
> > memory is accounted for in the OPTMEM limit. If there is any error while
> > sending zerocopy packets or if zerocopy is skipped, these metadata SKBs are
> > queued in the socket error queue. This error queue is freed when userspace
> > reads it.
> > 
> > Usually, if there are continuous failures, we merge the metadata into a single
> > SKB and free another one. As a result, it never exceeds the OPTMEM limit.
> > However, if there is any out-of-order processing or intermittent zerocopy
> > failures, this error chain can grow significantly, exhausting the OPTMEM limit.
> > As a result, all new sendmsg requests fail to allocate any new SKB, leading to
> > an ENOBUF error. Depending on the amount of data queued before the flush
> > (i.e., large live migration iterations), even large OPTMEM limits are prone to
> > failure.
> > 
> > To work around this, if we encounter an ENOBUF error with a zerocopy sendmsg,
> > we flush the error queue and retry once more.
> > 
> > Additionally, this patch removes the dirty_sync_missed_zero_copy migration
> > stat. This stat is not used anywhere and does not seem useful. Removing it
> > simplifies the patch.
> 
> IMHO it's still useful, it's just that if it's for debugging purpose, it's
> optional to expose it via QAPI.  Then if without exposing it to upper
> layer, it can simplify the change this patch wanted to introduce.  We can
> still keep it a tracepoint.


> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index dfb5189f0e..ee6b2d3cba 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -607,9 +607,6 @@ static int multifd_zero_copy_flush(QIOChannel *c)
> >          error_report_err(err);
> >          return -1;
> >      }
> > -    if (ret == 1) {
> > -        stat64_add(&mig_stats.dirty_sync_missed_zero_copy, 1);
> > -    }
> 
> If we want to remove this, we need to remove the variable in QAPI too.
> 
> # @dirty-sync-missed-zero-copy: Number of times dirty RAM
> #     synchronization could not avoid copying dirty pages.  This is
> #     between 0 and @dirty-sync-count * @multifd-channels.  (since
> #     7.1)
> 
> Personally I'd remove it directly, but others may not always agree... in
> this case, the safe approach is to mark it deprecate and update this in
> docs/about/deprecated.rst.  Then you can leave it to us to finally remove
> this entry (or send another patch after two qemu releases).

Given this is in public API, the data needs to remain reported accurately
for the whole deprecation period. IOW, the patch to qiochannel needs to
preserve this data too.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v2] QIOChannelSocket: Flush zerocopy socket error queue on ENOBUF failure for sendmsg

Posted by Peter Xu 3 weeks, 1 day ago

On Mon, Mar 10, 2025 at 07:48:16PM +0000, Daniel P. Berrangé wrote:
> Given this is in public API, the data needs to remain reported accurately
> for the whole deprecation period. IOW, the patch to qiochannel needs to
> preserve this data too.

:-(

We could potentially mark MigrationStats to be experimental as a whole and
declare that in deprecate.rst too, then after two releases, we can randomly
add / remove fields as wish without always need to go through the
deprecation process, am I right?

Maybe we should do that..

Thanks,

-- 
Peter Xu

Re: [PATCH v2] QIOChannelSocket: Flush zerocopy socket error queue on ENOBUF failure for sendmsg

Posted by Daniel P. Berrangé 3 weeks, 1 day ago

On Mon, Mar 10, 2025 at 04:03:26PM -0400, Peter Xu wrote:
> On Mon, Mar 10, 2025 at 07:48:16PM +0000, Daniel P. Berrangé wrote:
> > Given this is in public API, the data needs to remain reported accurately
> > for the whole deprecation period. IOW, the patch to qiochannel needs to
> > preserve this data too.
> 
> :-(
> 
> We could potentially mark MigrationStats to be experimental as a whole and
> declare that in deprecate.rst too, then after two releases, we can randomly
> add / remove fields as wish without always need to go through the
> deprecation process, am I right?

IMHO that would be an abuse of the process and harmful to applications
and users consuming stats.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v2] QIOChannelSocket: Flush zerocopy socket error queue on ENOBUF failure for sendmsg

Posted by Peter Xu 3 weeks ago

On Tue, Mar 11, 2025 at 08:13:16AM +0000, Daniel P. Berrangé wrote:
> On Mon, Mar 10, 2025 at 04:03:26PM -0400, Peter Xu wrote:
> > On Mon, Mar 10, 2025 at 07:48:16PM +0000, Daniel P. Berrangé wrote:
> > > Given this is in public API, the data needs to remain reported accurately
> > > for the whole deprecation period. IOW, the patch to qiochannel needs to
> > > preserve this data too.
> > 
> > :-(
> > 
> > We could potentially mark MigrationStats to be experimental as a whole and
> > declare that in deprecate.rst too, then after two releases, we can randomly
> > add / remove fields as wish without always need to go through the
> > deprecation process, am I right?
> 
> IMHO that would be an abuse of the process and harmful to applications
> and users consuming stats.

Ah I just noticed that's the exact same one we included in
query-migrate.. Then yes, the stable ABI is important here.

So for this specific case, maybe we shouldn't have exposed it in QMP from
the start.

To me, it's a question on whether we could have something experimental and
be exposed to QMP, where we don't need to guarantee a strict stable ABI, or
a very loose ABI (e.g. we can guarantee the command exists, and with
key-value string-integer pairs, nothing else).

Taking the example of downtime reports: there used to be attempts on the
list capturing details of downtime measurements for each migration and
report that via QMP, probably via the stats.  At that time, I was concerned
that whatever we change alone the lines then we risk breaking the ABI
(e.g. we move code within blackout path and it can move contribution of X
from bucket B1 to B2).  At that time the work was not merged.

However, such things (either downtime reports, or some of the stats that
are pretty much not designed for generic users, like this zerocopy success
counter) would still be nice if we can collect them in QMP queries.

Maybe what we need is a new MigrationInfoOptional, to be embeded into
MigrationInfo (or not), marked experimental.  Then in the future whenever
we want to add some new statistics, we could decide whether it should be
part of stable ABI or not.

PS: we have Juraj actively looking at selftests to measure downtime
contributions of live migrations.  Currently we need to leverage
tracepoints and parse the results.  If that sounds reasonable, we could
start with having MigrationInfoOptional and export downtime metrics.

Thanks,

-- 
Peter Xu

Re: [PATCH v2] QIOChannelSocket: Flush zerocopy socket error queue on ENOBUF failure for sendmsg

Posted by Daniel P. Berrangé 3 weeks ago

On Tue, Mar 11, 2025 at 11:20:50AM -0400, Peter Xu wrote:
> On Tue, Mar 11, 2025 at 08:13:16AM +0000, Daniel P. Berrangé wrote:
> > On Mon, Mar 10, 2025 at 04:03:26PM -0400, Peter Xu wrote:
> > > On Mon, Mar 10, 2025 at 07:48:16PM +0000, Daniel P. Berrangé wrote:
> > > > Given this is in public API, the data needs to remain reported accurately
> > > > for the whole deprecation period. IOW, the patch to qiochannel needs to
> > > > preserve this data too.
> > > 
> > > :-(
> > > 
> > > We could potentially mark MigrationStats to be experimental as a whole and
> > > declare that in deprecate.rst too, then after two releases, we can randomly
> > > add / remove fields as wish without always need to go through the
> > > deprecation process, am I right?
> > 
> > IMHO that would be an abuse of the process and harmful to applications
> > and users consuming stats.
> 
> Ah I just noticed that's the exact same one we included in
> query-migrate.. Then yes, the stable ABI is important here.
> 
> So for this specific case, maybe we shouldn't have exposed it in QMP from
> the start.
> 
> To me, it's a question on whether we could have something experimental and
> be exposed to QMP, where we don't need to guarantee a strict stable ABI, or
> a very loose ABI (e.g. we can guarantee the command exists, and with
> key-value string-integer pairs, nothing else).

QMP has the ability to tag commands/fields, etc as experimental.

libvirt will explicitly avoid consuming or exposing anything with
an experimental tag on it.

> Maybe what we need is a new MigrationInfoOptional, to be embeded into
> MigrationInfo (or not), marked experimental.  Then in the future whenever
> we want to add some new statistics, we could decide whether it should be
> part of stable ABI or not.

That is not required - individual struct fields can be marked
experimental.

The key question is what the intended usage of the fields/stats/etc
is to be. If you want it used by libvirt and mgmt apps it would need
to be formally supported. If it is just for adhoc QEMU developer
debugging and doesn't need libvirt / app support, then experimental
is fine.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v2] QIOChannelSocket: Flush zerocopy socket error queue on ENOBUF failure for sendmsg

Posted by Peter Xu 3 weeks ago

On Tue, Mar 11, 2025 at 03:33:23PM +0000, Daniel P. Berrangé wrote:
> On Tue, Mar 11, 2025 at 11:20:50AM -0400, Peter Xu wrote:
> > On Tue, Mar 11, 2025 at 08:13:16AM +0000, Daniel P. Berrangé wrote:
> > > On Mon, Mar 10, 2025 at 04:03:26PM -0400, Peter Xu wrote:
> > > > On Mon, Mar 10, 2025 at 07:48:16PM +0000, Daniel P. Berrangé wrote:
> > > > > Given this is in public API, the data needs to remain reported accurately
> > > > > for the whole deprecation period. IOW, the patch to qiochannel needs to
> > > > > preserve this data too.
> > > > 
> > > > :-(
> > > > 
> > > > We could potentially mark MigrationStats to be experimental as a whole and
> > > > declare that in deprecate.rst too, then after two releases, we can randomly
> > > > add / remove fields as wish without always need to go through the
> > > > deprecation process, am I right?
> > > 
> > > IMHO that would be an abuse of the process and harmful to applications
> > > and users consuming stats.
> > 
> > Ah I just noticed that's the exact same one we included in
> > query-migrate.. Then yes, the stable ABI is important here.
> > 
> > So for this specific case, maybe we shouldn't have exposed it in QMP from
> > the start.
> > 
> > To me, it's a question on whether we could have something experimental and
> > be exposed to QMP, where we don't need to guarantee a strict stable ABI, or
> > a very loose ABI (e.g. we can guarantee the command exists, and with
> > key-value string-integer pairs, nothing else).
> 
> QMP has the ability to tag commands/fields, etc as experimental.
> 
> libvirt will explicitly avoid consuming or exposing anything with
> an experimental tag on it.
> 
> > Maybe what we need is a new MigrationInfoOptional, to be embeded into
> > MigrationInfo (or not), marked experimental.  Then in the future whenever
> > we want to add some new statistics, we could decide whether it should be
> > part of stable ABI or not.
> 
> That is not required - individual struct fields can be marked
> experimental.

Yes that'll work too.  The important bit here is I think we should start to
seriously evaluate which to expose to QAPI as stable API when we add stats
into it.  We used to not pay too much attention.

With MigrationInfoOptional, we should suggest any new field to be added
there by default, then whatever needs to be put out of experimental needs
explicit justifications.  Or we can also document any new migration field
at least in the stats to be marked as experimental unless justified.

> 
> The key question is what the intended usage of the fields/stats/etc
> is to be. If you want it used by libvirt and mgmt apps it would need
> to be formally supported. If it is just for adhoc QEMU developer
> debugging and doesn't need libvirt / app support, then experimental
> is fine.

To my initial thoughts, I want Libvirt to fetch it.  However I don't want
Libvirt to parse it.

For example, for things like "whether zerocopy send succeeded or not", or
"how much time we spent on sending non-iterable device states", they're
almost not consumable for users, but great for debuggings.  It would be
great if Libvirt could know their existance, fetch it (e.g. once after
migration completes) then dump it to the logfile to help debugging and
triaging QEMU issues.  In that case parsing is not needed, the whole result
can be attached to the log as a JSON blob.  That releases the burden from
the need to maintain compatibility that we don't really need and nobody
cared (I bet it's the case here for zerocopy stats, but we got restricted
by our promises even if it may ultimately benefit nobody..).

Thanks,

-- 
Peter Xu

Re: [PATCH v2] QIOChannelSocket: Flush zerocopy socket error queue on ENOBUF failure for sendmsg

Posted by Daniel P. Berrangé 3 weeks ago

On Tue, Mar 11, 2025 at 03:57:35PM -0400, Peter Xu wrote:
> On Tue, Mar 11, 2025 at 03:33:23PM +0000, Daniel P. Berrangé wrote:
> > On Tue, Mar 11, 2025 at 11:20:50AM -0400, Peter Xu wrote:
> > > On Tue, Mar 11, 2025 at 08:13:16AM +0000, Daniel P. Berrangé wrote:
> > > > On Mon, Mar 10, 2025 at 04:03:26PM -0400, Peter Xu wrote:
> > > > > On Mon, Mar 10, 2025 at 07:48:16PM +0000, Daniel P. Berrangé wrote:
> > > > > > Given this is in public API, the data needs to remain reported accurately
> > > > > > for the whole deprecation period. IOW, the patch to qiochannel needs to
> > > > > > preserve this data too.
> > > > > 
> > > > > :-(
> > > > > 
> > > > > We could potentially mark MigrationStats to be experimental as a whole and
> > > > > declare that in deprecate.rst too, then after two releases, we can randomly
> > > > > add / remove fields as wish without always need to go through the
> > > > > deprecation process, am I right?
> > > > 
> > > > IMHO that would be an abuse of the process and harmful to applications
> > > > and users consuming stats.
> > > 
> > > Ah I just noticed that's the exact same one we included in
> > > query-migrate.. Then yes, the stable ABI is important here.
> > > 
> > > So for this specific case, maybe we shouldn't have exposed it in QMP from
> > > the start.
> > > 
> > > To me, it's a question on whether we could have something experimental and
> > > be exposed to QMP, where we don't need to guarantee a strict stable ABI, or
> > > a very loose ABI (e.g. we can guarantee the command exists, and with
> > > key-value string-integer pairs, nothing else).
> > 
> > QMP has the ability to tag commands/fields, etc as experimental.
> > 
> > libvirt will explicitly avoid consuming or exposing anything with
> > an experimental tag on it.
> > 
> > > Maybe what we need is a new MigrationInfoOptional, to be embeded into
> > > MigrationInfo (or not), marked experimental.  Then in the future whenever
> > > we want to add some new statistics, we could decide whether it should be
> > > part of stable ABI or not.
> > 
> > That is not required - individual struct fields can be marked
> > experimental.
> 
> Yes that'll work too.  The important bit here is I think we should start to
> seriously evaluate which to expose to QAPI as stable API when we add stats
> into it.  We used to not pay too much attention.
> 
> With MigrationInfoOptional, we should suggest any new field to be added
> there by default, then whatever needs to be put out of experimental needs
> explicit justifications.  Or we can also document any new migration field
> at least in the stats to be marked as experimental unless justified.
> 
> > 
> > The key question is what the intended usage of the fields/stats/etc
> > is to be. If you want it used by libvirt and mgmt apps it would need
> > to be formally supported. If it is just for adhoc QEMU developer
> > debugging and doesn't need libvirt / app support, then experimental
> > is fine.
> 
> To my initial thoughts, I want Libvirt to fetch it.  However I don't want
> Libvirt to parse it.
> 
> For example, for things like "whether zerocopy send succeeded or not", or
> "how much time we spent on sending non-iterable device states", they're
> almost not consumable for users, but great for debuggings.  It would be
> great if Libvirt could know their existance, fetch it (e.g. once after
> migration completes) then dump it to the logfile to help debugging and
> triaging QEMU issues.  In that case parsing is not needed, the whole result
> can be attached to the log as a JSON blob.  That releases the burden from
> the need to maintain compatibility that we don't really need and nobody
> cared (I bet it's the case here for zerocopy stats, but we got restricted
> by our promises even if it may ultimately benefit nobody..).

We already log every single QMP command & response and event we deal
with, at INFO level, but by default our log files are only set to
capture WARN level, so this isn't visible without extra config steps
by the ademin

Possibly we could think about dumping all migration stats to
/var/log/libvirt/qemu/$GUEST.log at migration completion


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v2] QIOChannelSocket: Flush zerocopy socket error queue on ENOBUF failure for sendmsg

Posted by Peter Xu 3 weeks ago

On Tue, Mar 11, 2025 at 08:08:02PM +0000, Daniel P. Berrangé wrote:
> On Tue, Mar 11, 2025 at 03:57:35PM -0400, Peter Xu wrote:
> > On Tue, Mar 11, 2025 at 03:33:23PM +0000, Daniel P. Berrangé wrote:
> > > On Tue, Mar 11, 2025 at 11:20:50AM -0400, Peter Xu wrote:
> > > > On Tue, Mar 11, 2025 at 08:13:16AM +0000, Daniel P. Berrangé wrote:
> > > > > On Mon, Mar 10, 2025 at 04:03:26PM -0400, Peter Xu wrote:
> > > > > > On Mon, Mar 10, 2025 at 07:48:16PM +0000, Daniel P. Berrangé wrote:
> > > > > > > Given this is in public API, the data needs to remain reported accurately
> > > > > > > for the whole deprecation period. IOW, the patch to qiochannel needs to
> > > > > > > preserve this data too.
> > > > > > 
> > > > > > :-(
> > > > > > 
> > > > > > We could potentially mark MigrationStats to be experimental as a whole and
> > > > > > declare that in deprecate.rst too, then after two releases, we can randomly
> > > > > > add / remove fields as wish without always need to go through the
> > > > > > deprecation process, am I right?
> > > > > 
> > > > > IMHO that would be an abuse of the process and harmful to applications
> > > > > and users consuming stats.
> > > > 
> > > > Ah I just noticed that's the exact same one we included in
> > > > query-migrate.. Then yes, the stable ABI is important here.
> > > > 
> > > > So for this specific case, maybe we shouldn't have exposed it in QMP from
> > > > the start.
> > > > 
> > > > To me, it's a question on whether we could have something experimental and
> > > > be exposed to QMP, where we don't need to guarantee a strict stable ABI, or
> > > > a very loose ABI (e.g. we can guarantee the command exists, and with
> > > > key-value string-integer pairs, nothing else).
> > > 
> > > QMP has the ability to tag commands/fields, etc as experimental.
> > > 
> > > libvirt will explicitly avoid consuming or exposing anything with
> > > an experimental tag on it.
> > > 
> > > > Maybe what we need is a new MigrationInfoOptional, to be embeded into
> > > > MigrationInfo (or not), marked experimental.  Then in the future whenever
> > > > we want to add some new statistics, we could decide whether it should be
> > > > part of stable ABI or not.
> > > 
> > > That is not required - individual struct fields can be marked
> > > experimental.
> > 
> > Yes that'll work too.  The important bit here is I think we should start to
> > seriously evaluate which to expose to QAPI as stable API when we add stats
> > into it.  We used to not pay too much attention.
> > 
> > With MigrationInfoOptional, we should suggest any new field to be added
> > there by default, then whatever needs to be put out of experimental needs
> > explicit justifications.  Or we can also document any new migration field
> > at least in the stats to be marked as experimental unless justified.
> > 
> > > 
> > > The key question is what the intended usage of the fields/stats/etc
> > > is to be. If you want it used by libvirt and mgmt apps it would need
> > > to be formally supported. If it is just for adhoc QEMU developer
> > > debugging and doesn't need libvirt / app support, then experimental
> > > is fine.
> > 
> > To my initial thoughts, I want Libvirt to fetch it.  However I don't want
> > Libvirt to parse it.
> > 
> > For example, for things like "whether zerocopy send succeeded or not", or
> > "how much time we spent on sending non-iterable device states", they're
> > almost not consumable for users, but great for debuggings.  It would be
> > great if Libvirt could know their existance, fetch it (e.g. once after
> > migration completes) then dump it to the logfile to help debugging and
> > triaging QEMU issues.  In that case parsing is not needed, the whole result
> > can be attached to the log as a JSON blob.  That releases the burden from
> > the need to maintain compatibility that we don't really need and nobody
> > cared (I bet it's the case here for zerocopy stats, but we got restricted
> > by our promises even if it may ultimately benefit nobody..).
> 
> We already log every single QMP command & response and event we deal
> with, at INFO level, but by default our log files are only set to
> capture WARN level, so this isn't visible without extra config steps
> by the ademin
> 
> Possibly we could think about dumping all migration stats to
> /var/log/libvirt/qemu/$GUEST.log at migration completion

Yes it would be great to have it if it's trivial to get.  It could be a
last round of 'query-migrate' dumped only on src after migration is
completed, right before src QEMU shuts down.

Thanks,

-- 
Peter Xu

Re: [PATCH v2] QIOChannelSocket: Flush zerocopy socket error queue on ENOBUF failure for sendmsg

Posted by Manish 3 weeks, 1 day ago

Thanks Peter, Daniel

Are there any code pointers where something similar is done? I can 
follow that.

Thanks

Manish Mishra


On 11/03/25 1:33 am, Peter Xu wrote:
> !-------------------------------------------------------------------|
>    CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> On Mon, Mar 10, 2025 at 07:48:16PM +0000, Daniel P. Berrangé wrote:
>> Given this is in public API, the data needs to remain reported accurately
>> for the whole deprecation period. IOW, the patch to qiochannel needs to
>> preserve this data too.
> :-(
>
> We could potentially mark MigrationStats to be experimental as a whole and
> declare that in deprecate.rst too, then after two releases, we can randomly
> add / remove fields as wish without always need to go through the
> deprecation process, am I right?
>
> Maybe we should do that..
>
> Thanks,
>

Re: [PATCH v2] QIOChannelSocket: Flush zerocopy socket error queue on ENOBUF failure for sendmsg

Posted by Peter Xu 3 weeks ago

On Tue, Mar 11, 2025 at 04:15:38AM +0530, Manish wrote:
> Thanks Peter, Daniel
> 
> Are there any code pointers where something similar is done? I can follow
> that.

Sorry to make you feel like we can simplify the code, it's my fault.  Dan
has a point that at least for now it's part of ABI, so we must need to
maintain the stats.

Then if we must do that, we don't need to consider deprecation in your
patch.  It'll be simpler for you.

Thanks,

-- 
Peter Xu

Re: [PATCH v2] QIOChannelSocket: Flush zerocopy socket error queue on ENOBUF failure for sendmsg

Posted by Manish 3 weeks ago

Sure

Thanks

Manish Mishra

On 11/03/25 8:52 pm, Peter Xu wrote:
> !-------------------------------------------------------------------|
>    CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> On Tue, Mar 11, 2025 at 04:15:38AM +0530, Manish wrote:
>> Thanks Peter, Daniel
>>
>> Are there any code pointers where something similar is done? I can follow
>> that.
> Sorry to make you feel like we can simplify the code, it's my fault.  Dan
> has a point that at least for now it's part of ABI, so we must need to
> maintain the stats.
>
> Then if we must do that, we don't need to consider deprecation in your
> patch.  It'll be simpler for you.
>
> Thanks,
>