[PATCH] migration: fix SEEK_CUR offset calculation in qio_channel_block_seek

Marco Cavenati posted 1 patch 10 months, 2 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20250326162230.3323199-1-Marco.Cavenati@eurecom.fr
Maintainers: Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>
migration/channel-block.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH] migration: fix SEEK_CUR offset calculation in qio_channel_block_seek
Posted by Marco Cavenati 10 months, 2 weeks ago
The SEEK_CUR case in qio_channel_block_seek was incorrectly using the
'whence' parameter instead of the 'offset' parameter when calculating the
new position.

Fixes: 65cf200a51ddc6d0a28ecceac30dc892233cddd7 ("migration: introduce a QIOChannel impl for BlockDriverState VMState")

Signed-off-by: Marco Cavenati <Marco.Cavenati@eurecom.fr>
---
 migration/channel-block.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/channel-block.c b/migration/channel-block.c
index fff8d87094..b0477f5b6d 100644
--- a/migration/channel-block.c
+++ b/migration/channel-block.c
@@ -123,7 +123,7 @@ qio_channel_block_seek(QIOChannel *ioc,
         bioc->offset = offset;
         break;
     case SEEK_CUR:
-        bioc->offset += whence;
+        bioc->offset += offset;
         break;
     case SEEK_END:
         error_setg(errp, "Size of VMstate region is unknown");
-- 
2.43.0
Re: [PATCH] migration: fix SEEK_CUR offset calculation in qio_channel_block_seek
Posted by Michael Tokarev 10 months, 2 weeks ago
26.03.2025 19:22, Marco Cavenati wrote:
> The SEEK_CUR case in qio_channel_block_seek was incorrectly using the
> 'whence' parameter instead of the 'offset' parameter when calculating the
> new position.
> 
> Fixes: 65cf200a51ddc6d0a28ecceac30dc892233cddd7 ("migration: introduce a QIOChannel impl for BlockDriverState VMState")
> 
> Signed-off-by: Marco Cavenati <Marco.Cavenati@eurecom.fr>
> ---
>   migration/channel-block.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/channel-block.c b/migration/channel-block.c
> index fff8d87094..b0477f5b6d 100644
> --- a/migration/channel-block.c
> +++ b/migration/channel-block.c
> @@ -123,7 +123,7 @@ qio_channel_block_seek(QIOChannel *ioc,
>           bioc->offset = offset;
>           break;
>       case SEEK_CUR:
> -        bioc->offset += whence;
> +        bioc->offset += offset;
>           break;
>       case SEEK_END:
>           error_setg(errp, "Size of VMstate region is unknown");

Reviewed-by: Michael Tokarev <mjt@tls.msk.ru>

This is a (trivial) bugfix, I'd say it should be in 10.0.
Will you guys send a pullreq for the block layer, or should
I make a single-patch pullreq from the trivial tree?

Thanks,

/mjt
Re: [PATCH] migration: fix SEEK_CUR offset calculation in qio_channel_block_seek
Posted by Fabiano Rosas 10 months, 2 weeks ago
Michael Tokarev <mjt@tls.msk.ru> writes:

> 26.03.2025 19:22, Marco Cavenati wrote:
>> The SEEK_CUR case in qio_channel_block_seek was incorrectly using the
>> 'whence' parameter instead of the 'offset' parameter when calculating the
>> new position.
>> 
>> Fixes: 65cf200a51ddc6d0a28ecceac30dc892233cddd7 ("migration: introduce a QIOChannel impl for BlockDriverState VMState")
>> 
>> Signed-off-by: Marco Cavenati <Marco.Cavenati@eurecom.fr>
>> ---
>>   migration/channel-block.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/migration/channel-block.c b/migration/channel-block.c
>> index fff8d87094..b0477f5b6d 100644
>> --- a/migration/channel-block.c
>> +++ b/migration/channel-block.c
>> @@ -123,7 +123,7 @@ qio_channel_block_seek(QIOChannel *ioc,
>>           bioc->offset = offset;
>>           break;
>>       case SEEK_CUR:
>> -        bioc->offset += whence;
>> +        bioc->offset += offset;
>>           break;
>>       case SEEK_END:
>>           error_setg(errp, "Size of VMstate region is unknown");
>
> Reviewed-by: Michael Tokarev <mjt@tls.msk.ru>
>
> This is a (trivial) bugfix, I'd say it should be in 10.0.
> Will you guys send a pullreq for the block layer, or should
> I make a single-patch pullreq from the trivial tree?

I'll take it. It's not entirely trivial as it shifts a value by 1 in
mapped-ram migration. Fortunately, it's a value that doesn't need to be
the same between migration source and destination.

Thanks

>
> Thanks,
>
> /mjt
Re: [PATCH] migration: fix SEEK_CUR offset calculation in qio_channel_block_seek
Posted by Michael Tokarev 5 months, 2 weeks ago
Hi!

This is

commit c0b32426ce56182c1ce2a12904f3a702c2ecc460
Author: Marco Cavenati <Marco.Cavenati@eurecom.fr>
Date:   Wed Mar 26 17:22:30 2025 +0100

     migration: fix SEEK_CUR offset calculation in qio_channel_block_seek

which went to 10.0.0-rc2, and has been cherry-picked to
7.2 and 9.2 stable series.

Reportedly it breaks migration in 7.2.18 and up.  Which is
kinda strange, as it shouldn't do any harm?

https://bugs.debian.org/1112044

any guess what's going on?

Thanks,

/mjt
Re: [PATCH] migration: fix SEEK_CUR offset calculation in qio_channel_block_seek
Posted by Fabiano Rosas 5 months, 2 weeks ago
Michael Tokarev <mjt@tls.msk.ru> writes:

+CC Akihiko

> Hi!
>
> This is
>
> commit c0b32426ce56182c1ce2a12904f3a702c2ecc460
> Author: Marco Cavenati <Marco.Cavenati@eurecom.fr>
> Date:   Wed Mar 26 17:22:30 2025 +0100
>
>      migration: fix SEEK_CUR offset calculation in qio_channel_block_seek
>
> which went to 10.0.0-rc2, and has been cherry-picked to
> 7.2 and 9.2 stable series.
>
> Reportedly it breaks migration in 7.2.18 and up.  Which is
> kinda strange, as it shouldn't do any harm?
>

Yeah, this is not it. Unless you're using colo or mapped-ram.

> https://bugs.debian.org/1112044
>
> any guess what's going on?
>

The virtio changes are probably the issue. One of them touches
mhdr.num_buffers, under mergeable_rx_bufs, which is migrated state. The
flag in turn depends on VIRTIO_NET_F_MRG_RXBUF, which is set on the
cmdline with -device virtio-net-pci,mrg_rxbuf= but also reset by
virtio_set_features_nocheck, if I'm reading this right.

Let's ask Akihiko.

> Thanks,
>
> /mjt
Re: [PATCH] migration: fix SEEK_CUR offset calculation in qio_channel_block_seek
Posted by Akihiko Odaki 5 months, 2 weeks ago
On 2025/08/27 6:52, Fabiano Rosas wrote:
> Michael Tokarev <mjt@tls.msk.ru> writes:
> 
> +CC Akihiko
> 
>> Hi!
>>
>> This is
>>
>> commit c0b32426ce56182c1ce2a12904f3a702c2ecc460
>> Author: Marco Cavenati <Marco.Cavenati@eurecom.fr>
>> Date:   Wed Mar 26 17:22:30 2025 +0100
>>
>>       migration: fix SEEK_CUR offset calculation in qio_channel_block_seek
>>
>> which went to 10.0.0-rc2, and has been cherry-picked to
>> 7.2 and 9.2 stable series.
>>
>> Reportedly it breaks migration in 7.2.18 and up.  Which is
>> kinda strange, as it shouldn't do any harm?
>>
> 
> Yeah, this is not it. Unless you're using colo or mapped-ram.
> 
>> https://bugs.debian.org/1112044
>>
>> any guess what's going on?
>>
> 
> The virtio changes are probably the issue. One of them touches
> mhdr.num_buffers, under mergeable_rx_bufs, which is migrated state. The
> flag in turn depends on VIRTIO_NET_F_MRG_RXBUF, which is set on the
> cmdline with -device virtio-net-pci,mrg_rxbuf= but also reset by
> virtio_set_features_nocheck, if I'm reading this right.

I don't think commit cefd67f25430 ("virtio-net: Fix num_buffers for 
version 1") is related to the issue. Commit ce1431615292 ("virtio: Call 
set_features during reset") is more likely.

virtio_set_features_nocheck() shouldn't reset VIRTIO_NET_F_MRG_RXBUF. It 
calls virtio_net_set_features(), which does not clear features. 
virtio_net_get_features() clears features, but it is called before 
migration.

The posted call trace indicates a lockup happens in the control path, 
but commit cefd67f25430 ("virtio-net: Fix num_buffers for version 1") 
changes the data path.

On the other hand, I can come up with a possible failure scenario with 
commit ce1431615292 ("virtio: Call set_features during reset"). Perhaps 
it changed the machine state before loading the migrated state, and 
caused a mismatch between them.

I need more information to understand the issue. A command line to 
reproduce the issue is especially helpful because options like 
mrg_rxbuf=, which you mentioned, tell enabled features, which is 
valuable information.

Regards,
Akihiko Odaki
Re: [PATCH] virtio: Call set_features during reset
Posted by Michael Tokarev 5 months, 2 weeks ago
On 28.08.2025 03:57, Akihiko Odaki wrote:

> The posted call trace indicates a lockup happens in the control path, 
> but commit cefd67f25430 ("virtio-net: Fix num_buffers for version 1") 
> changes the data path.
> 
> On the other hand, I can come up with a possible failure scenario with 
> commit ce1431615292 ("virtio: Call set_features during reset"). Perhaps 
> it changed the machine state before loading the migrated state, and 
> caused a mismatch between them.

Yes, the problem commit is 0caed25cd171c6 "virtio: Call set_features
during reset", - the OP corrected himself in the next message (subject
line updated).

> I need more information to understand the issue. A command line to 
> reproduce the issue is especially helpful because options like 
> mrg_rxbuf=, which you mentioned, tell enabled features, which is 
> valuable information.

See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1112044#69 for
the command line.  The guest is started by libvirtd.

Please note: this is stable-7.2 series, it is *not* master (I picked
up this commit to 7.2.x).  It'd be interesting to check if master is
affected too.  Unfortunately I never tried migration, and now I only
have my notebook, so can only migrate between two qemu instances on
the same box - which is probably okay too.

Thanks,

/mjt
Re: [PATCH] virtio: Call set_features during reset
Posted by Akihiko Odaki 5 months, 1 week ago
On 2025/08/29 17:34, Michael Tokarev wrote:
> On 28.08.2025 03:57, Akihiko Odaki wrote:
> 
>> The posted call trace indicates a lockup happens in the control path, 
>> but commit cefd67f25430 ("virtio-net: Fix num_buffers for version 1") 
>> changes the data path.
>>
>> On the other hand, I can come up with a possible failure scenario with 
>> commit ce1431615292 ("virtio: Call set_features during reset"). 
>> Perhaps it changed the machine state before loading the migrated 
>> state, and caused a mismatch between them.
> 
> Yes, the problem commit is 0caed25cd171c6 "virtio: Call set_features
> during reset", - the OP corrected himself in the next message (subject
> line updated).
> 
>> I need more information to understand the issue. A command line to 
>> reproduce the issue is especially helpful because options like 
>> mrg_rxbuf=, which you mentioned, tell enabled features, which is 
>> valuable information.
> 
> See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1112044#69 for
> the command line.  The guest is started by libvirtd.

Thank you, now I think I understand the problem.

> 
> Please note: this is stable-7.2 series, it is *not* master (I picked
> up this commit to 7.2.x).  It'd be interesting to check if master is
> affected too.  Unfortunately I never tried migration, and now I only
> have my notebook, so can only migrate between two qemu instances on
> the same box - which is probably okay too.

I think you need to backport commit 9379ea9db3c0 ("virtio-net: Add 
queues before loading them") and adda0ad56bd2 ("virtio-net: Add queues 
for RSS during migration"). Here is an explanation:

First, let me define two variables for conciseness:
N: the number of queue pairs
M: the maximum number of queue pairs, which is determined with
    n->max_queue_pairs

The problem is that QEMU inconsistently chose N for virtio-net in the 
past. Before commit 8c49756825da ("virtio-net: Add only one queue pair 
when realizing"):
1) realize() chose M.
2) set_features() chose: 1 (when RSS and MQ are disabled)
                          M (otherwise)

This itself was a problem; both RSS and MQ were disabled when realize() 
but N was M, which is inconsistent with 2) and this inconsistency was 
guest-visible.

I wrote commit 8c49756825da ("virtio-net: Add only one queue pair when 
realizing") to make QEMU implement the behavior in 2) also during 
realization and fix the inconsistency, but it broke migration when the 
migrated VM had enabled VIRTIO_NET_F_RSS and VIRTIO_NET_F_MQ because it 
expected that N == M.

This is also why the backported commit also broke migration; it 
accidentally fixed the inconsistency between the first reset state and 
the state after set_features() and caused the same problem.

I wrote commit 9379ea9db3c0 ("virtio-net: Add queues before loading 
them") to fix the issue and later complemented it with commit 
adda0ad56bd2 ("virtio-net: Add queues for RSS during migration").

There are several relevant commits because I could not fix the 
underlying problem at once, but hopefully this email clarifies how the 
two commits fixed it in the end.

Regards,
Akihiko Odaki

Re: [PATCH] virtio: Call set_features during reset
Posted by Michael Tokarev 5 months, 1 week ago
On 29.08.2025 17:40, Akihiko Odaki wrote:
> On 2025/08/29 17:34, Michael Tokarev wrote:
>> On 28.08.2025 03:57, Akihiko Odaki wrote:
...
>> Please note: this is stable-7.2 series, it is *not* master (I picked
>> up this commit to 7.2.x).  It'd be interesting to check if master is
>> affected too.  Unfortunately I never tried migration, and now I only
>> have my notebook, so can only migrate between two qemu instances on
>> the same box - which is probably okay too.
> 
> I think you need to backport commit 9379ea9db3c0 ("virtio-net: Add 
> queues before loading them") and adda0ad56bd2 ("virtio-net: Add queues 
> for RSS during migration"). Here is an explanation:

"Add queues before loading them" is marked as fixing

Both the mentioned commits were tagged with Cc: qemu-stable, but both
has Fixes: the same commit 8c49756825da ("virtio-net: Add only one queue
pair when realizing"), which is in 9.1, but not in 7.2.  I guess, in
this case, I also need 8c49756825da - let's first break things, and
next fix them :)

> First, let me define two variables for conciseness:
> N: the number of queue pairs
> M: the maximum number of queue pairs, which is determined with
>     n->max_queue_pairs
> 
> The problem is that QEMU inconsistently chose N for virtio-net in the 
> past. Before commit 8c49756825da ("virtio-net: Add only one queue pair 
> when realizing"):
> 1) realize() chose M.
> 2) set_features() chose: 1 (when RSS and MQ are disabled)
>                           M (otherwise)
> 
> This itself was a problem; both RSS and MQ were disabled when realize() 
> but N was M, which is inconsistent with 2) and this inconsistency was 
> guest-visible.
> 
> I wrote commit 8c49756825da ("virtio-net: Add only one queue pair when 
> realizing") to make QEMU implement the behavior in 2) also during 
> realization and fix the inconsistency, but it broke migration when the 
> migrated VM had enabled VIRTIO_NET_F_RSS and VIRTIO_NET_F_MQ because it 
> expected that N == M.

Yup, 8c49756825da it is, which is missing in 7.2, which broke things :))

> This is also why the backported commit also broke migration; it 
> accidentally fixed the inconsistency between the first reset state and 
> the state after set_features() and caused the same problem.
> 
> I wrote commit 9379ea9db3c0 ("virtio-net: Add queues before loading 
> them") to fix the issue and later complemented it with commit 
> adda0ad56bd2 ("virtio-net: Add queues for RSS during migration").
> 
> There are several relevant commits because I could not fix the 
> underlying problem at once, but hopefully this email clarifies how the 
> two commits fixed it in the end.

Now, with 3 commits picked up to 7.2, things are at least compiles :))

Let's test the result..  hopefully I didn't break something else in the
process of breaking+fixing stuff :))

Thank you very much for the detailed explanation and attention to
details.  It makes sense.

An for the 7.2.x series which is getting old(ish) - I plan to end
support for it in the near future, so there will be less surprises
like this one.. if not only for 10.0.x :)

/mjt

Re: [PATCH] virtio: Call set_features during reset (was: migration: fix SEEK_CUR offset calculation in qio_channel_block_seek)
Posted by Michael Tokarev 5 months, 2 weeks ago
On 27.08.2025 00:52, Fabiano Rosas wrote:
> Michael Tokarev <mjt@tls.msk.ru> writes:
> 
> +CC Akihiko
> 
>> Hi!
>>
>> This is
>>
>> commit c0b32426ce56182c1ce2a12904f3a702c2ecc460
>> Author: Marco Cavenati <Marco.Cavenati@eurecom.fr>
>> Date:   Wed Mar 26 17:22:30 2025 +0100
>>
>>       migration: fix SEEK_CUR offset calculation in qio_channel_block_seek
>>
>> which went to 10.0.0-rc2, and has been cherry-picked to
>> 7.2 and 9.2 stable series.
>>
>> Reportedly it breaks migration in 7.2.18 and up.  Which is
>> kinda strange, as it shouldn't do any harm?
>>
> 
> Yeah, this is not it. Unless you're using colo or mapped-ram.
> 
>> https://bugs.debian.org/1112044
>>
>> any guess what's going on?
>>
> 
> The virtio changes are probably the issue. One of them touches
> mhdr.num_buffers, under mergeable_rx_bufs, which is migrated state. The
> flag in turn depends on VIRTIO_NET_F_MRG_RXBUF, which is set on the
> cmdline with -device virtio-net-pci,mrg_rxbuf= but also reset by
> virtio_set_features_nocheck, if I'm reading this right.
> 
> Let's ask Akihiko.

The reporter just reported that they used the wrong commit id, --
actually, the issue is caused by ce1431615292dc735597db4062834b:

commit ce1431615292dc735597db4062834bfb271381bc
Author: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Date:   Mon Apr 21 21:17:20 2025 +0900

     virtio: Call set_features during reset

     virtio-net expects set_features() will be called when the feature set
     used by the guest changes to update the number of virtqueues but it is
     not called during reset, which will clear all features, leaving the
     queues added for VIRTIO_NET_F_MQ or VIRTIO_NET_F_RSS. Not only these
     extra queues are visible to the guest, they will cause segmentation
     fault during migration.

     Call set_features() during reset to remove those queues for virtio-net
     as we call set_status(). It will also prevent similar bugs for
     virtio-net and other devices in the future.

     Fixes: f9d6dbf0bf6e ("virtio-net: remove virtio queues if the guest 
doesn't support multiqueue")
     Buglink: https://issues.redhat.com/browse/RHEL-73842
     Cc: qemu-stable@nongnu.org
     Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
     Message-Id: <20250421-reset-v2-1-e4c1ead88ea1@daynix.com>
     Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
     Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
     (cherry picked from commit 0caed25cd171c611781589b5402161d27d57229c)
     Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>

(so commit 0caed25cd171c611781589b5402161 in master).

So yes, this one looks much more real in this context.

Thanks,

/mjt
Re: [PATCH] migration: fix SEEK_CUR offset calculation in qio_channel_block_seek
Posted by Peter Xu 5 months, 2 weeks ago
On Tue, Aug 26, 2025 at 11:32:20PM +0300, Michael Tokarev wrote:
> Hi!
> 
> This is
> 
> commit c0b32426ce56182c1ce2a12904f3a702c2ecc460
> Author: Marco Cavenati <Marco.Cavenati@eurecom.fr>
> Date:   Wed Mar 26 17:22:30 2025 +0100
> 
>     migration: fix SEEK_CUR offset calculation in qio_channel_block_seek
> 
> which went to 10.0.0-rc2, and has been cherry-picked to
> 7.2 and 9.2 stable series.
> 
> Reportedly it breaks migration in 7.2.18 and up.  Which is
> kinda strange, as it shouldn't do any harm?
> 
> https://bugs.debian.org/1112044
> 
> any guess what's going on?

The only thing I can think of is, when it used to save to some file /
snapshot, then if the old image was stored with some wrong offsets (due to
wrong seek()s) then a new QEMU with correct offsets will instead read wrong
data even if they started to do the right things..

The reporter says:

        This occurs during live-migrating a guest onto a host with u15,
        migrating it back fixes the softlocks. A reset is required to fix
        it but is only required when the receiving host is on the latest
        version.

So it's a host-to-host live migration.  Is that using TCP as URI?  The
problem is I don't even think TCP layer should use io_seek at all.

qio_channel_io_seek() is only used in below (except VFIO when used with
multifd, that doesn't look like what the reporter was using..):

*** migration/file.c:
file_start_outgoing_migration[121] if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
file_start_incoming_migration[190] qio_channel_io_seek(QIO_CHANNEL(fioc), offset, SEEK_SET, errp) < 0) {

*** migration/qemu-file.c:
qemu_set_offset[611]           ret = qio_channel_io_seek(f->ioc, off, whence, &err);
qemu_get_offset[624]           ret = qio_channel_io_seek(f->ioc, 0, SEEK_CUR, &err);

All these references are about file migrations, not generic live
migrations..

-- 
Peter Xu
Re: [PATCH] migration: fix SEEK_CUR offset calculation in qio_channel_block_seek
Posted by Daniel P. Berrangé 10 months, 2 weeks ago
On Wed, Mar 26, 2025 at 05:22:30PM +0100, Marco Cavenati wrote:
> The SEEK_CUR case in qio_channel_block_seek was incorrectly using the
> 'whence' parameter instead of the 'offset' parameter when calculating the
> new position.
> 
> Fixes: 65cf200a51ddc6d0a28ecceac30dc892233cddd7 ("migration: introduce a QIOChannel impl for BlockDriverState VMState")
> 
> Signed-off-by: Marco Cavenati <Marco.Cavenati@eurecom.fr>
> ---
>  migration/channel-block.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|