[PATCH for-10.1 0/2] migration: actually make gnutls workaround functional

Daniel P. Berrangé posted 2 patches 3 months, 2 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20250801170212.54409-1-berrange@redhat.com
Maintainers: "Daniel P. Berrangé" <berrange@redhat.com>, Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>
crypto/tlssession.c   | 16 ----------------
migration/qemu-file.c | 22 +++++++++++++++++-----
2 files changed, 17 insertions(+), 21 deletions(-)
[PATCH for-10.1 0/2] migration: actually make gnutls workaround functional
Posted by Daniel P. Berrangé 3 months, 2 weeks ago
This is a followup to previously merged patches that claimed to
workaround the gnutls bug impacting migration, but in fact were
essentially non-functional. Juraj Marcin pointed this out, and
this new patch tweaks the workaround to make it actually do
something useful.

Daniel P. Berrangé (2):
  migration: simplify error reporting after channel read
  migration: fix workaround for gnutls thread safety

 crypto/tlssession.c   | 16 ----------------
 migration/qemu-file.c | 22 +++++++++++++++++-----
 2 files changed, 17 insertions(+), 21 deletions(-)

-- 
2.50.1


Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional
Posted by Peter Xu 1 month, 2 weeks ago
On Fri, Aug 01, 2025 at 06:02:10PM +0100, Daniel P. Berrangé wrote:
> This is a followup to previously merged patches that claimed to
> workaround the gnutls bug impacting migration, but in fact were
> essentially non-functional. Juraj Marcin pointed this out, and
> this new patch tweaks the workaround to make it actually do
> something useful.
> 
> Daniel P. Berrangé (2):
>   migration: simplify error reporting after channel read
>   migration: fix workaround for gnutls thread safety
> 
>  crypto/tlssession.c   | 16 ----------------
>  migration/qemu-file.c | 22 +++++++++++++++++-----
>  2 files changed, 17 insertions(+), 21 deletions(-)

Dan, is there a planned repost on this one?

Thanks!

-- 
Peter Xu


Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional
Posted by Daniel P. Berrangé 1 month, 2 weeks ago
On Mon, Sep 29, 2025 at 11:58:21AM -0400, Peter Xu wrote:
> On Fri, Aug 01, 2025 at 06:02:10PM +0100, Daniel P. Berrangé wrote:
> > This is a followup to previously merged patches that claimed to
> > workaround the gnutls bug impacting migration, but in fact were
> > essentially non-functional. Juraj Marcin pointed this out, and
> > this new patch tweaks the workaround to make it actually do
> > something useful.
> > 
> > Daniel P. Berrangé (2):
> >   migration: simplify error reporting after channel read
> >   migration: fix workaround for gnutls thread safety
> > 
> >  crypto/tlssession.c   | 16 ----------------
> >  migration/qemu-file.c | 22 +++++++++++++++++-----
> >  2 files changed, 17 insertions(+), 21 deletions(-)
> 
> Dan, is there a planned repost on this one?

It is on my todo list, but I don't have a firm ETA yet.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional
Posted by Peter Xu 1 month, 2 weeks ago
On Mon, Sep 29, 2025 at 05:55:50PM +0100, Daniel P. Berrangé wrote:
> On Mon, Sep 29, 2025 at 11:58:21AM -0400, Peter Xu wrote:
> > On Fri, Aug 01, 2025 at 06:02:10PM +0100, Daniel P. Berrangé wrote:
> > > This is a followup to previously merged patches that claimed to
> > > workaround the gnutls bug impacting migration, but in fact were
> > > essentially non-functional. Juraj Marcin pointed this out, and
> > > this new patch tweaks the workaround to make it actually do
> > > something useful.
> > > 
> > > Daniel P. Berrangé (2):
> > >   migration: simplify error reporting after channel read
> > >   migration: fix workaround for gnutls thread safety
> > > 
> > >  crypto/tlssession.c   | 16 ----------------
> > >  migration/qemu-file.c | 22 +++++++++++++++++-----
> > >  2 files changed, 17 insertions(+), 21 deletions(-)
> > 
> > Dan, is there a planned repost on this one?
> 
> It is on my todo list, but I don't have a firm ETA yet.

Sure.  I assume in production people should really be relying on the gnutls
upgrade.

I queued patch 1 then for now, thanks.

-- 
Peter Xu


Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional
Posted by Juraj Marcin 3 months, 1 week ago
Hi Daniel,

On 2025-08-01 18:02, Daniel P. Berrangé wrote:
> This is a followup to previously merged patches that claimed to
> workaround the gnutls bug impacting migration, but in fact were
> essentially non-functional. Juraj Marcin pointed this out, and
> this new patch tweaks the workaround to make it actually do
> something useful.
> 
> Daniel P. Berrangé (2):
>   migration: simplify error reporting after channel read
>   migration: fix workaround for gnutls thread safety
> 
>  crypto/tlssession.c   | 16 ----------------
>  migration/qemu-file.c | 22 +++++++++++++++++-----
>  2 files changed, 17 insertions(+), 21 deletions(-)
> 

thanks for finding a fix for the workaround. I have tested it and it
resolves the issue.

However, it significantly slows down migration, even with the workaround
disabled (and thus no locking). When benchmarking, I used the fixed
version of GNUTLS, VM with 20GB of RAM which were fully written to
before starting a normal migration with no workload during the
migration.

Test cases:
[1]: before this patchset
[2]: with this patchset applied and GNUTLS workaround enabled
[2]: with this patchset applied and GNUTLS workaround disabled

  | Total time | Throughput | Transfered bytes |
--+------------+------------+------------------+
1 |  31 192 ms |  5450 mpbs |   21 230 973 763 |
2 |  74 147 ms |  2291 mbps |   21 232 849 066 |
3 |  72 426 ms |  2343 mbps |   21 215 009 392 |

(data reported by query-migrate QMP command after migration completion)


Therefore, this workaround can be used in environments where we cannot
use the fixed version of GNUTLS library, but I think it would be better
use polling only if locking is actually used, so the performance with
the proper GNUTLS fix is not hindered. Maybe yielding/waiting inside
tlssession.c before actually locking?


Best regards,

Juraj Marcin
Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional
Posted by Fabiano Rosas 3 months, 1 week ago
Juraj Marcin <jmarcin@redhat.com> writes:

> Hi Daniel,
>
> On 2025-08-01 18:02, Daniel P. Berrangé wrote:
>> This is a followup to previously merged patches that claimed to
>> workaround the gnutls bug impacting migration, but in fact were
>> essentially non-functional. Juraj Marcin pointed this out, and
>> this new patch tweaks the workaround to make it actually do
>> something useful.
>> 
>> Daniel P. Berrangé (2):
>>   migration: simplify error reporting after channel read
>>   migration: fix workaround for gnutls thread safety
>> 
>>  crypto/tlssession.c   | 16 ----------------
>>  migration/qemu-file.c | 22 +++++++++++++++++-----
>>  2 files changed, 17 insertions(+), 21 deletions(-)
>> 
>
> thanks for finding a fix for the workaround. I have tested it and it
> resolves the issue.
>
> However, it significantly slows down migration, even with the workaround
> disabled (and thus no locking). When benchmarking, I used the fixed
> version of GNUTLS, VM with 20GB of RAM which were fully written to
> before starting a normal migration with no workload during the
> migration.
>
> Test cases:
> [1]: before this patchset
> [2]: with this patchset applied and GNUTLS workaround enabled
> [2]: with this patchset applied and GNUTLS workaround disabled
>
>   | Total time | Throughput | Transfered bytes |
> --+------------+------------+------------------+
> 1 |  31 192 ms |  5450 mpbs |   21 230 973 763 |
> 2 |  74 147 ms |  2291 mbps |   21 232 849 066 |
> 3 |  72 426 ms |  2343 mbps |   21 215 009 392 |

Thanks testing this. I had just managed to convince myself that there
wouldn't be any performance issues.

The yield at every buffer fill on the incoming side is probably way more
impactful than the poll on the RP.
Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional
Posted by Daniel P. Berrangé 3 months, 1 week ago
On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote:
> Juraj Marcin <jmarcin@redhat.com> writes:
> 
> > Hi Daniel,
> >
> > On 2025-08-01 18:02, Daniel P. Berrangé wrote:
> >> This is a followup to previously merged patches that claimed to
> >> workaround the gnutls bug impacting migration, but in fact were
> >> essentially non-functional. Juraj Marcin pointed this out, and
> >> this new patch tweaks the workaround to make it actually do
> >> something useful.
> >> 
> >> Daniel P. Berrangé (2):
> >>   migration: simplify error reporting after channel read
> >>   migration: fix workaround for gnutls thread safety
> >> 
> >>  crypto/tlssession.c   | 16 ----------------
> >>  migration/qemu-file.c | 22 +++++++++++++++++-----
> >>  2 files changed, 17 insertions(+), 21 deletions(-)
> >> 
> >
> > thanks for finding a fix for the workaround. I have tested it and it
> > resolves the issue.
> >
> > However, it significantly slows down migration, even with the workaround
> > disabled (and thus no locking). When benchmarking, I used the fixed
> > version of GNUTLS, VM with 20GB of RAM which were fully written to
> > before starting a normal migration with no workload during the
> > migration.
> >
> > Test cases:
> > [1]: before this patchset
> > [2]: with this patchset applied and GNUTLS workaround enabled
> > [2]: with this patchset applied and GNUTLS workaround disabled
> >
> >   | Total time | Throughput | Transfered bytes |
> > --+------------+------------+------------------+
> > 1 |  31 192 ms |  5450 mpbs |   21 230 973 763 |
> > 2 |  74 147 ms |  2291 mbps |   21 232 849 066 |
> > 3 |  72 426 ms |  2343 mbps |   21 215 009 392 |
> 
> Thanks testing this. I had just managed to convince myself that there
> wouldn't be any performance issues.
> 
> The yield at every buffer fill on the incoming side is probably way more
> impactful than the poll on the RP.

Yeah, that's an unacceptable penalty on the incoming side for sure.

How about we simply change the outbound migration channel to be in
non-blocking mode ?   I originally put it in blocking mode way back
in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the
QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be
coping with a non-blocking socket. qemu_fill_buffer has explicit
code to wait, and qemu_fflush uses the _all() variant whcih has
built-in support for waiting. So I'm not seeing an obvious need
to run the channel in blocking mode.

Using non-blocking will prevent the return path thuread setting
in a read() call, so we won't have mutual exclusion between read
and write which this patch was trying to avoid

ie, this delta on top of this patch:

diff --git a/migration/migration.c b/migration/migration.c
index 10c216d25d..1eaabc1f19 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -4002,7 +4002,7 @@ void migration_connect(MigrationState *s, Error *error_in)
     }
 
     migration_rate_set(rate_limit);
-    qemu_file_set_blocking(s->to_dst_file, true);
+    qemu_file_set_blocking(s->to_dst_file, false);
 
     /*
      * Open the return path. For postcopy, it is used exclusively. For
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index cf6115e699..8ee44c5ac9 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -338,22 +338,6 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
         return 0;
     }
 
-    /*
-     * This feature triggers acquisition of mutexes around every
-     * read and write. Thus we must not sit in a blocking read
-     * if this is set, but must instead poll proactively. This
-     * does not work with some channel types, however, so must
-     * only pre-poll when the featre is set.
-     */
-    if (qio_channel_has_feature(f->ioc,
-                                QIO_CHANNEL_FEATURE_CONCURRENT_IO)) {
-        if (qemu_in_coroutine()) {
-            qio_channel_yield(f->ioc, G_IO_IN);
-        } else {
-            qio_channel_wait(f->ioc, G_IO_IN);
-        }
-    }
-
     do {
         struct iovec iov = { f->buf + pending, IO_BUF_SIZE - pending };
         len = qio_channel_readv_full(f->ioc, &iov, 1, pfds, pnfd, 0,


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional
Posted by Fabiano Rosas 3 months, 1 week ago
Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote:
>> Juraj Marcin <jmarcin@redhat.com> writes:
>> 
>> > Hi Daniel,
>> >
>> > On 2025-08-01 18:02, Daniel P. Berrangé wrote:
>> >> This is a followup to previously merged patches that claimed to
>> >> workaround the gnutls bug impacting migration, but in fact were
>> >> essentially non-functional. Juraj Marcin pointed this out, and
>> >> this new patch tweaks the workaround to make it actually do
>> >> something useful.
>> >> 
>> >> Daniel P. Berrangé (2):
>> >>   migration: simplify error reporting after channel read
>> >>   migration: fix workaround for gnutls thread safety
>> >> 
>> >>  crypto/tlssession.c   | 16 ----------------
>> >>  migration/qemu-file.c | 22 +++++++++++++++++-----
>> >>  2 files changed, 17 insertions(+), 21 deletions(-)
>> >> 
>> >
>> > thanks for finding a fix for the workaround. I have tested it and it
>> > resolves the issue.
>> >
>> > However, it significantly slows down migration, even with the workaround
>> > disabled (and thus no locking). When benchmarking, I used the fixed
>> > version of GNUTLS, VM with 20GB of RAM which were fully written to
>> > before starting a normal migration with no workload during the
>> > migration.
>> >
>> > Test cases:
>> > [1]: before this patchset
>> > [2]: with this patchset applied and GNUTLS workaround enabled
>> > [2]: with this patchset applied and GNUTLS workaround disabled
>> >
>> >   | Total time | Throughput | Transfered bytes |
>> > --+------------+------------+------------------+
>> > 1 |  31 192 ms |  5450 mpbs |   21 230 973 763 |
>> > 2 |  74 147 ms |  2291 mbps |   21 232 849 066 |
>> > 3 |  72 426 ms |  2343 mbps |   21 215 009 392 |
>> 
>> Thanks testing this. I had just managed to convince myself that there
>> wouldn't be any performance issues.
>> 
>> The yield at every buffer fill on the incoming side is probably way more
>> impactful than the poll on the RP.
>
> Yeah, that's an unacceptable penalty on the incoming side for sure.
>
> How about we simply change the outbound migration channel to be in
> non-blocking mode ?   I originally put it in blocking mode way back
> in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the
> QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be
> coping with a non-blocking socket. qemu_fill_buffer has explicit
> code to wait, and qemu_fflush uses the _all() variant whcih has
> built-in support for waiting. So I'm not seeing an obvious need
> to run the channel in blocking mode.
>

It's definitely simpler and I think it works. It's uncomfortably late
though to add a bunch of glib event loop code to the migration
thread. Is the suggestion of moving the yield to tlssession.c even
feasible?

Juraj, are you able to get some numbers with this?

> Using non-blocking will prevent the return path thuread setting
> in a read() call, so we won't have mutual exclusion between read
> and write which this patch was trying to avoid
>
> ie, this delta on top of this patch:
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 10c216d25d..1eaabc1f19 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -4002,7 +4002,7 @@ void migration_connect(MigrationState *s, Error *error_in)
>      }
>  
>      migration_rate_set(rate_limit);
> -    qemu_file_set_blocking(s->to_dst_file, true);
> +    qemu_file_set_blocking(s->to_dst_file, false);
>  
>      /*
>       * Open the return path. For postcopy, it is used exclusively. For
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index cf6115e699..8ee44c5ac9 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -338,22 +338,6 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
>          return 0;
>      }
>  
> -    /*
> -     * This feature triggers acquisition of mutexes around every
> -     * read and write. Thus we must not sit in a blocking read
> -     * if this is set, but must instead poll proactively. This
> -     * does not work with some channel types, however, so must
> -     * only pre-poll when the featre is set.
> -     */
> -    if (qio_channel_has_feature(f->ioc,
> -                                QIO_CHANNEL_FEATURE_CONCURRENT_IO)) {
> -        if (qemu_in_coroutine()) {
> -            qio_channel_yield(f->ioc, G_IO_IN);
> -        } else {
> -            qio_channel_wait(f->ioc, G_IO_IN);
> -        }
> -    }
> -
>      do {
>          struct iovec iov = { f->buf + pending, IO_BUF_SIZE - pending };
>          len = qio_channel_readv_full(f->ioc, &iov, 1, pfds, pnfd, 0,
>
>
> With regards,
> Daniel
Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional
Posted by Daniel P. Berrangé 3 months, 1 week ago
On Tue, Aug 05, 2025 at 10:44:41AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote:
> >> Juraj Marcin <jmarcin@redhat.com> writes:
> >> 
> >> > Hi Daniel,
> >> >
> >> > On 2025-08-01 18:02, Daniel P. Berrangé wrote:
> >> >> This is a followup to previously merged patches that claimed to
> >> >> workaround the gnutls bug impacting migration, but in fact were
> >> >> essentially non-functional. Juraj Marcin pointed this out, and
> >> >> this new patch tweaks the workaround to make it actually do
> >> >> something useful.
> >> >> 
> >> >> Daniel P. Berrangé (2):
> >> >>   migration: simplify error reporting after channel read
> >> >>   migration: fix workaround for gnutls thread safety
> >> >> 
> >> >>  crypto/tlssession.c   | 16 ----------------
> >> >>  migration/qemu-file.c | 22 +++++++++++++++++-----
> >> >>  2 files changed, 17 insertions(+), 21 deletions(-)
> >> >> 
> >> >
> >> > thanks for finding a fix for the workaround. I have tested it and it
> >> > resolves the issue.
> >> >
> >> > However, it significantly slows down migration, even with the workaround
> >> > disabled (and thus no locking). When benchmarking, I used the fixed
> >> > version of GNUTLS, VM with 20GB of RAM which were fully written to
> >> > before starting a normal migration with no workload during the
> >> > migration.
> >> >
> >> > Test cases:
> >> > [1]: before this patchset
> >> > [2]: with this patchset applied and GNUTLS workaround enabled
> >> > [2]: with this patchset applied and GNUTLS workaround disabled
> >> >
> >> >   | Total time | Throughput | Transfered bytes |
> >> > --+------------+------------+------------------+
> >> > 1 |  31 192 ms |  5450 mpbs |   21 230 973 763 |
> >> > 2 |  74 147 ms |  2291 mbps |   21 232 849 066 |
> >> > 3 |  72 426 ms |  2343 mbps |   21 215 009 392 |
> >> 
> >> Thanks testing this. I had just managed to convince myself that there
> >> wouldn't be any performance issues.
> >> 
> >> The yield at every buffer fill on the incoming side is probably way more
> >> impactful than the poll on the RP.
> >
> > Yeah, that's an unacceptable penalty on the incoming side for sure.
> >
> > How about we simply change the outbound migration channel to be in
> > non-blocking mode ?   I originally put it in blocking mode way back
> > in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the
> > QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be
> > coping with a non-blocking socket. qemu_fill_buffer has explicit
> > code to wait, and qemu_fflush uses the _all() variant whcih has
> > built-in support for waiting. So I'm not seeing an obvious need
> > to run the channel in blocking mode.
> >
> 
> It's definitely simpler and I think it works. It's uncomfortably late
> though to add a bunch of glib event loop code to the migration
> thread. Is the suggestion of moving the yield to tlssession.c even
> feasible?

Well that'll remove the burden for the non-TLS incoming migration,
but the incoming TLS migration will still have the redundant
yields and so still suffer a hit.

Given where we are in freeze, I'm thinking we should just hard
disable the workaround for this release, and re-attempt it in
next cycle and then we can bring it back to stable afterwards.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional
Posted by Fabiano Rosas 3 months, 1 week ago
Daniel P. Berrangé <berrange@redhat.com> writes:

> On Tue, Aug 05, 2025 at 10:44:41AM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> > On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote:
>> >> Juraj Marcin <jmarcin@redhat.com> writes:
>> >> 
>> >> > Hi Daniel,
>> >> >
>> >> > On 2025-08-01 18:02, Daniel P. Berrangé wrote:
>> >> >> This is a followup to previously merged patches that claimed to
>> >> >> workaround the gnutls bug impacting migration, but in fact were
>> >> >> essentially non-functional. Juraj Marcin pointed this out, and
>> >> >> this new patch tweaks the workaround to make it actually do
>> >> >> something useful.
>> >> >> 
>> >> >> Daniel P. Berrangé (2):
>> >> >>   migration: simplify error reporting after channel read
>> >> >>   migration: fix workaround for gnutls thread safety
>> >> >> 
>> >> >>  crypto/tlssession.c   | 16 ----------------
>> >> >>  migration/qemu-file.c | 22 +++++++++++++++++-----
>> >> >>  2 files changed, 17 insertions(+), 21 deletions(-)
>> >> >> 
>> >> >
>> >> > thanks for finding a fix for the workaround. I have tested it and it
>> >> > resolves the issue.
>> >> >
>> >> > However, it significantly slows down migration, even with the workaround
>> >> > disabled (and thus no locking). When benchmarking, I used the fixed
>> >> > version of GNUTLS, VM with 20GB of RAM which were fully written to
>> >> > before starting a normal migration with no workload during the
>> >> > migration.
>> >> >
>> >> > Test cases:
>> >> > [1]: before this patchset
>> >> > [2]: with this patchset applied and GNUTLS workaround enabled
>> >> > [2]: with this patchset applied and GNUTLS workaround disabled
>> >> >
>> >> >   | Total time | Throughput | Transfered bytes |
>> >> > --+------------+------------+------------------+
>> >> > 1 |  31 192 ms |  5450 mpbs |   21 230 973 763 |
>> >> > 2 |  74 147 ms |  2291 mbps |   21 232 849 066 |
>> >> > 3 |  72 426 ms |  2343 mbps |   21 215 009 392 |
>> >> 
>> >> Thanks testing this. I had just managed to convince myself that there
>> >> wouldn't be any performance issues.
>> >> 
>> >> The yield at every buffer fill on the incoming side is probably way more
>> >> impactful than the poll on the RP.
>> >
>> > Yeah, that's an unacceptable penalty on the incoming side for sure.
>> >
>> > How about we simply change the outbound migration channel to be in
>> > non-blocking mode ?   I originally put it in blocking mode way back
>> > in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the
>> > QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be
>> > coping with a non-blocking socket. qemu_fill_buffer has explicit
>> > code to wait, and qemu_fflush uses the _all() variant whcih has
>> > built-in support for waiting. So I'm not seeing an obvious need
>> > to run the channel in blocking mode.
>> >
>> 
>> It's definitely simpler and I think it works. It's uncomfortably late
>> though to add a bunch of glib event loop code to the migration
>> thread. Is the suggestion of moving the yield to tlssession.c even
>> feasible?
>
> Well that'll remove the burden for the non-TLS incoming migration,
> but the incoming TLS migration will still have the redundant
> yields and so still suffer a hit.
>
> Given where we are in freeze, I'm thinking we should just hard
> disable the workaround for this release, and re-attempt it in
> next cycle and then we can bring it back to stable afterwards.
>

Yes, I agree. Will you send a patch?

> With regards,
> Daniel
Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional
Posted by Juraj Marcin 3 months, 1 week ago
Hi all!

On 2025-08-05 10:44, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote:
> >> Juraj Marcin <jmarcin@redhat.com> writes:
> >> 
> >> > Hi Daniel,
> >> >
> >> > On 2025-08-01 18:02, Daniel P. Berrangé wrote:
> >> >> This is a followup to previously merged patches that claimed to
> >> >> workaround the gnutls bug impacting migration, but in fact were
> >> >> essentially non-functional. Juraj Marcin pointed this out, and
> >> >> this new patch tweaks the workaround to make it actually do
> >> >> something useful.
> >> >> 
> >> >> Daniel P. Berrangé (2):
> >> >>   migration: simplify error reporting after channel read
> >> >>   migration: fix workaround for gnutls thread safety
> >> >> 
> >> >>  crypto/tlssession.c   | 16 ----------------
> >> >>  migration/qemu-file.c | 22 +++++++++++++++++-----
> >> >>  2 files changed, 17 insertions(+), 21 deletions(-)
> >> >> 
> >> >
> >> > thanks for finding a fix for the workaround. I have tested it and it
> >> > resolves the issue.
> >> >
> >> > However, it significantly slows down migration, even with the workaround
> >> > disabled (and thus no locking). When benchmarking, I used the fixed
> >> > version of GNUTLS, VM with 20GB of RAM which were fully written to
> >> > before starting a normal migration with no workload during the
> >> > migration.
> >> >
> >> > Test cases:
> >> > [1]: before this patchset
> >> > [2]: with this patchset applied and GNUTLS workaround enabled
> >> > [2]: with this patchset applied and GNUTLS workaround disabled
> >> >
> >> >   | Total time | Throughput | Transfered bytes |
> >> > --+------------+------------+------------------+
> >> > 1 |  31 192 ms |  5450 mpbs |   21 230 973 763 |
> >> > 2 |  74 147 ms |  2291 mbps |   21 232 849 066 |
> >> > 3 |  72 426 ms |  2343 mbps |   21 215 009 392 |
> >> 
> >> Thanks testing this. I had just managed to convince myself that there
> >> wouldn't be any performance issues.
> >> 
> >> The yield at every buffer fill on the incoming side is probably way more
> >> impactful than the poll on the RP.
> >
> > Yeah, that's an unacceptable penalty on the incoming side for sure.
> >
> > How about we simply change the outbound migration channel to be in
> > non-blocking mode ?   I originally put it in blocking mode way back
> > in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the
> > QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be
> > coping with a non-blocking socket. qemu_fill_buffer has explicit
> > code to wait, and qemu_fflush uses the _all() variant whcih has
> > built-in support for waiting. So I'm not seeing an obvious need
> > to run the channel in blocking mode.
> >
> 
> It's definitely simpler and I think it works. It's uncomfortably late
> though to add a bunch of glib event loop code to the migration
> thread. Is the suggestion of moving the yield to tlssession.c even
> feasible?
> 
> Juraj, are you able to get some numbers with this?

Yes, I have tested it with this patch and it still works, now without
any noticeable performance regressions.

Test cases:
[0]: before this patchset with GNUTLS workaround disabled
[1]: before this patchset with GNUTLS workaround enabled
[2]: with this patchset applied and GNUTLS workaround enabled
[2]: with this patchset applied and GNUTLS workaround disabled

  | Total time | Throughput  | Transfered bytes |
--+------------+-------------+------------------+
0 |  32 464 ms |  5 228 mbps |   21 213 283 021 |
1 |  32 429 ms |  5 236 mpbs |   21 224 062 348 |
2 |  32 934 ms |  5 150 mbps |   21 200 558 083 |
3 |  31 547 ms |  5 384 mbps |   21 229 225 792 |


Tested-by: Juraj Marcin <jmarcin@redhat.com>


Best regards

Juraj Marcin

> 
> > Using non-blocking will prevent the return path thuread setting
> > in a read() call, so we won't have mutual exclusion between read
> > and write which this patch was trying to avoid
> >
> > ie, this delta on top of this patch:
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 10c216d25d..1eaabc1f19 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -4002,7 +4002,7 @@ void migration_connect(MigrationState *s, Error *error_in)
> >      }
> >  
> >      migration_rate_set(rate_limit);
> > -    qemu_file_set_blocking(s->to_dst_file, true);
> > +    qemu_file_set_blocking(s->to_dst_file, false);
> >  
> >      /*
> >       * Open the return path. For postcopy, it is used exclusively. For
> > diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> > index cf6115e699..8ee44c5ac9 100644
> > --- a/migration/qemu-file.c
> > +++ b/migration/qemu-file.c
> > @@ -338,22 +338,6 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
> >          return 0;
> >      }
> >  
> > -    /*
> > -     * This feature triggers acquisition of mutexes around every
> > -     * read and write. Thus we must not sit in a blocking read
> > -     * if this is set, but must instead poll proactively. This
> > -     * does not work with some channel types, however, so must
> > -     * only pre-poll when the featre is set.
> > -     */
> > -    if (qio_channel_has_feature(f->ioc,
> > -                                QIO_CHANNEL_FEATURE_CONCURRENT_IO)) {
> > -        if (qemu_in_coroutine()) {
> > -            qio_channel_yield(f->ioc, G_IO_IN);
> > -        } else {
> > -            qio_channel_wait(f->ioc, G_IO_IN);
> > -        }
> > -    }
> > -
> >      do {
> >          struct iovec iov = { f->buf + pending, IO_BUF_SIZE - pending };
> >          len = qio_channel_readv_full(f->ioc, &iov, 1, pfds, pnfd, 0,
> >
> >
> > With regards,
> > Daniel
> 
Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional
Posted by Peter Xu 3 months, 1 week ago
On Tue, Aug 05, 2025 at 04:52:58PM +0200, Juraj Marcin wrote:
> Hi all!
> 
> On 2025-08-05 10:44, Fabiano Rosas wrote:
> > Daniel P. Berrangé <berrange@redhat.com> writes:
> > 
> > > On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote:
> > >> Juraj Marcin <jmarcin@redhat.com> writes:
> > >> 
> > >> > Hi Daniel,
> > >> >
> > >> > On 2025-08-01 18:02, Daniel P. Berrangé wrote:
> > >> >> This is a followup to previously merged patches that claimed to
> > >> >> workaround the gnutls bug impacting migration, but in fact were
> > >> >> essentially non-functional. Juraj Marcin pointed this out, and
> > >> >> this new patch tweaks the workaround to make it actually do
> > >> >> something useful.
> > >> >> 
> > >> >> Daniel P. Berrangé (2):
> > >> >>   migration: simplify error reporting after channel read
> > >> >>   migration: fix workaround for gnutls thread safety
> > >> >> 
> > >> >>  crypto/tlssession.c   | 16 ----------------
> > >> >>  migration/qemu-file.c | 22 +++++++++++++++++-----
> > >> >>  2 files changed, 17 insertions(+), 21 deletions(-)
> > >> >> 
> > >> >
> > >> > thanks for finding a fix for the workaround. I have tested it and it
> > >> > resolves the issue.
> > >> >
> > >> > However, it significantly slows down migration, even with the workaround
> > >> > disabled (and thus no locking). When benchmarking, I used the fixed
> > >> > version of GNUTLS, VM with 20GB of RAM which were fully written to
> > >> > before starting a normal migration with no workload during the
> > >> > migration.
> > >> >
> > >> > Test cases:
> > >> > [1]: before this patchset
> > >> > [2]: with this patchset applied and GNUTLS workaround enabled
> > >> > [2]: with this patchset applied and GNUTLS workaround disabled
> > >> >
> > >> >   | Total time | Throughput | Transfered bytes |
> > >> > --+------------+------------+------------------+
> > >> > 1 |  31 192 ms |  5450 mpbs |   21 230 973 763 |
> > >> > 2 |  74 147 ms |  2291 mbps |   21 232 849 066 |
> > >> > 3 |  72 426 ms |  2343 mbps |   21 215 009 392 |
> > >> 
> > >> Thanks testing this. I had just managed to convince myself that there
> > >> wouldn't be any performance issues.
> > >> 
> > >> The yield at every buffer fill on the incoming side is probably way more
> > >> impactful than the poll on the RP.
> > >
> > > Yeah, that's an unacceptable penalty on the incoming side for sure.
> > >
> > > How about we simply change the outbound migration channel to be in
> > > non-blocking mode ?   I originally put it in blocking mode way back
> > > in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the
> > > QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be
> > > coping with a non-blocking socket. qemu_fill_buffer has explicit
> > > code to wait, and qemu_fflush uses the _all() variant whcih has
> > > built-in support for waiting. So I'm not seeing an obvious need
> > > to run the channel in blocking mode.

Sync IOs actually made some sense to me here, since when there's no lock to
releaes, it sounds more efficient to wait in the syscall until all the
buffers are flushed read/write.  That is compared to async where we need to
exit to userspace, qio_channel_wait(), retry syscall.

Is there any concern where we could in some cases get frequent G_IO_*
events, but syscall read/write may be a large buffer so the overhead of
async can be measurable (one syscall may trigger multiple loops of wait()
and retry)?  From Juraj's test results it looks like at least not
measurable in the questioned use case.  However not sure about the rest.
For example, we still have option to only enable async channels for tls,
but I'm not sure whether it's necessary, either.

> > 
> > It's definitely simpler and I think it works. It's uncomfortably late
> > though to add a bunch of glib event loop code to the migration
> > thread. Is the suggestion of moving the yield to tlssession.c even
> > feasible?
> > 
> > Juraj, are you able to get some numbers with this?
> 
> Yes, I have tested it with this patch and it still works, now without
> any noticeable performance regressions.
> 
> Test cases:
> [0]: before this patchset with GNUTLS workaround disabled
> [1]: before this patchset with GNUTLS workaround enabled
> [2]: with this patchset applied and GNUTLS workaround enabled
> [2]: with this patchset applied and GNUTLS workaround disabled
> 
>   | Total time | Throughput  | Transfered bytes |
> --+------------+-------------+------------------+
> 0 |  32 464 ms |  5 228 mbps |   21 213 283 021 |
> 1 |  32 429 ms |  5 236 mpbs |   21 224 062 348 |
> 2 |  32 934 ms |  5 150 mbps |   21 200 558 083 |
> 3 |  31 547 ms |  5 384 mbps |   21 229 225 792 |
> 
> 
> Tested-by: Juraj Marcin <jmarcin@redhat.com>
> 
> 
> Best regards
> 
> Juraj Marcin
> 
> > 
> > > Using non-blocking will prevent the return path thuread setting
> > > in a read() call, so we won't have mutual exclusion between read
> > > and write which this patch was trying to avoid
> > >
> > > ie, this delta on top of this patch:
> > >
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 10c216d25d..1eaabc1f19 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -4002,7 +4002,7 @@ void migration_connect(MigrationState *s, Error *error_in)
> > >      }
> > >  
> > >      migration_rate_set(rate_limit);
> > > -    qemu_file_set_blocking(s->to_dst_file, true);
> > > +    qemu_file_set_blocking(s->to_dst_file, false);
> > >  
> > >      /*
> > >       * Open the return path. For postcopy, it is used exclusively. For
> > > diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> > > index cf6115e699..8ee44c5ac9 100644
> > > --- a/migration/qemu-file.c
> > > +++ b/migration/qemu-file.c
> > > @@ -338,22 +338,6 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
> > >          return 0;
> > >      }
> > >  
> > > -    /*
> > > -     * This feature triggers acquisition of mutexes around every
> > > -     * read and write. Thus we must not sit in a blocking read
> > > -     * if this is set, but must instead poll proactively. This
> > > -     * does not work with some channel types, however, so must
> > > -     * only pre-poll when the featre is set.
> > > -     */
> > > -    if (qio_channel_has_feature(f->ioc,
> > > -                                QIO_CHANNEL_FEATURE_CONCURRENT_IO)) {
> > > -        if (qemu_in_coroutine()) {
> > > -            qio_channel_yield(f->ioc, G_IO_IN);
> > > -        } else {
> > > -            qio_channel_wait(f->ioc, G_IO_IN);
> > > -        }
> > > -    }
> > > -
> > >      do {
> > >          struct iovec iov = { f->buf + pending, IO_BUF_SIZE - pending };
> > >          len = qio_channel_readv_full(f->ioc, &iov, 1, pfds, pnfd, 0,
> > >
> > >
> > > With regards,
> > > Daniel
> > 
> 

-- 
Peter Xu