[PATCH v3 05/10] colo: Fix crash during device vmstate load

Lukas Straub posted 10 patches 2 weeks ago
Maintainers: Pierrick Bouvier <pierrick.bouvier@linaro.org>, Lukas Straub <lukasstraub2@web.de>, Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>, Laurent Vivier <lvivier@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>
There is a newer version of this series
[PATCH v3 05/10] colo: Fix crash during device vmstate load
Posted by Lukas Straub 2 weeks ago
With colo we load device vmstate during each checkpoint, on top of
a vm that was already running. Some devices expect a reset before
loading vmstate on such a previously running vm.

This fixes a crash when using COLO with Q35 machine.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 migration/colo.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/migration/colo.c b/migration/colo.c
index db783f6fa77500386d923dd97e522883027e71d8..627b3706687036554eda3909b4194116a7640493 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -727,6 +727,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
 
     bql_lock();
     vmstate_loading = true;
+    qemu_system_reset(SHUTDOWN_CAUSE_SNAPSHOT_LOAD);
     colo_flush_ram_cache();
     ret = qemu_load_device_state(fb, errp);
     if (ret < 0) {

-- 
2.39.5
Re: [PATCH v3 05/10] colo: Fix crash during device vmstate load
Posted by Peter Xu 1 week, 5 days ago
On Sun, Jan 25, 2026 at 09:40:10PM +0100, Lukas Straub wrote:
> With colo we load device vmstate during each checkpoint, on top of
> a vm that was already running. Some devices expect a reset before
> loading vmstate on such a previously running vm.
> 
> This fixes a crash when using COLO with Q35 machine.
> 
> Signed-off-by: Lukas Straub <lukasstraub2@web.de>

Yes makes sense, maybe you can add some comments into the code too since
this was overlooked before,

Reviewed-by: Peter Xu <peterx@redhat.com>

Have you tried to measure how many overheads will this introduce to loading
each snapshot?

> ---
>  migration/colo.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/migration/colo.c b/migration/colo.c
> index db783f6fa77500386d923dd97e522883027e71d8..627b3706687036554eda3909b4194116a7640493 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -727,6 +727,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
>  
>      bql_lock();
>      vmstate_loading = true;
> +    qemu_system_reset(SHUTDOWN_CAUSE_SNAPSHOT_LOAD);
>      colo_flush_ram_cache();
>      ret = qemu_load_device_state(fb, errp);
>      if (ret < 0) {
> 
> -- 
> 2.39.5
> 

-- 
Peter Xu
Re: [PATCH v3 05/10] colo: Fix crash during device vmstate load
Posted by Lukas Straub 1 week, 3 days ago
On Tue, 27 Jan 2026 15:38:55 -0500
Peter Xu <peterx@redhat.com> wrote:

> On Sun, Jan 25, 2026 at 09:40:10PM +0100, Lukas Straub wrote:
> > With colo we load device vmstate during each checkpoint, on top of
> > a vm that was already running. Some devices expect a reset before
> > loading vmstate on such a previously running vm.
> > 
> > This fixes a crash when using COLO with Q35 machine.
> > 
> > Signed-off-by: Lukas Straub <lukasstraub2@web.de>  
> 
> Yes makes sense, maybe you can add some comments into the code too since
> this was overlooked before,
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> Have you tried to measure how many overheads will this introduce to loading
> each snapshot?

It's a large overhead actually, between 10-20 milliseconds.

Regards,
Lukas Straub

> 
> > ---
> >  migration/colo.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/migration/colo.c b/migration/colo.c
> > index db783f6fa77500386d923dd97e522883027e71d8..627b3706687036554eda3909b4194116a7640493 100644
> > --- a/migration/colo.c
> > +++ b/migration/colo.c
> > @@ -727,6 +727,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
> >  
> >      bql_lock();
> >      vmstate_loading = true;
> > +    qemu_system_reset(SHUTDOWN_CAUSE_SNAPSHOT_LOAD);
> >      colo_flush_ram_cache();
> >      ret = qemu_load_device_state(fb, errp);
> >      if (ret < 0) {
> > 
> > -- 
> > 2.39.5
> >   
> 

Re: [PATCH v3 05/10] colo: Fix crash during device vmstate load
Posted by Peter Xu 1 week ago
On Fri, Jan 30, 2026 at 01:49:42PM +0100, Lukas Straub wrote:
> On Tue, 27 Jan 2026 15:38:55 -0500
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Sun, Jan 25, 2026 at 09:40:10PM +0100, Lukas Straub wrote:
> > > With colo we load device vmstate during each checkpoint, on top of
> > > a vm that was already running. Some devices expect a reset before
> > > loading vmstate on such a previously running vm.
> > > 
> > > This fixes a crash when using COLO with Q35 machine.
> > > 
> > > Signed-off-by: Lukas Straub <lukasstraub2@web.de>  
> > 
> > Yes makes sense, maybe you can add some comments into the code too since
> > this was overlooked before,
> > 
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > 
> > Have you tried to measure how many overheads will this introduce to loading
> > each snapshot?
> 
> It's a large overhead actually, between 10-20 milliseconds.

This can be mentioned in the commit message.

IIUC reset() may or may not be required while loading a snapshot.
Normally, a device reset() should reset all dev registers and internal
states, OTOH loadvm() will reload most of them once more.. so less
efficient.

Maybe there's chance to "fix" q35 instead reducing this overhead, but I'll
leave that to be your call; to me this fix is clean from maint POV.

Thanks,

-- 
Peter Xu
Re: [PATCH v3 05/10] colo: Fix crash during device vmstate load
Posted by Lukas Straub 6 days, 8 hours ago
On Mon, 2 Feb 2026 09:12:33 -0500
Peter Xu <peterx@redhat.com> wrote:

> On Fri, Jan 30, 2026 at 01:49:42PM +0100, Lukas Straub wrote:
> > On Tue, 27 Jan 2026 15:38:55 -0500
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > On Sun, Jan 25, 2026 at 09:40:10PM +0100, Lukas Straub wrote:  
> > > > With colo we load device vmstate during each checkpoint, on top of
> > > > a vm that was already running. Some devices expect a reset before
> > > > loading vmstate on such a previously running vm.
> > > > 
> > > > This fixes a crash when using COLO with Q35 machine.
> > > > 
> > > > Signed-off-by: Lukas Straub <lukasstraub2@web.de>    
> > > 
> > > Yes makes sense, maybe you can add some comments into the code too since
> > > this was overlooked before,
> > > 
> > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > 
> > > Have you tried to measure how many overheads will this introduce to loading
> > > each snapshot?  
> > 
> > It's a large overhead actually, between 10-20 milliseconds.  
> 
> This can be mentioned in the commit message.
> 
> IIUC reset() may or may not be required while loading a snapshot.
> Normally, a device reset() should reset all dev registers and internal
> states, OTOH loadvm() will reload most of them once more.. so less
> efficient.
> 
> Maybe there's chance to "fix" q35 instead reducing this overhead, but I'll
> leave that to be your call; to me this fix is clean from maint POV.
> 
> Thanks,
> 

Yes, I think this fix is fine for now. It more correct like this and we
can improve performance later while keeping it correct.

Regards,
Lukas Straub