migration: Fix a possible crash when halting a guest during migration

[PATCH] migration: Fix a possible crash when halting a guest during migration

Posted by Thomas Huth 2 months ago

From: Thomas Huth <thuth@redhat.com>

When shutting down a guest that is currently in progress of being
migrated, there is a chance that QEMU might crash during bdrv_delete().
The backtrace looks like this:

 Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.

 [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
 0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
 5560	        QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
 (gdb) bt
 #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
 #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
 Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0

The problem is apparently that the migration thread is still active
(migration_shutdown() only asks it to stop the current migration, but
does not wait for it to finish), while the main thread continues to
bdrv_close_all() that will destroy all block drivers. So the two threads
are racing here for the destruction of the migration-related block drivers.

I was able to bisect the problem and the race has apparently been introduced
by commit c2a189976e211c9ff782 ("migration/block-active: Remove global active
flag"), so reverting it might be an option as well, but waiting for the
migration thread to finish before continuing with the further clean-ups
during shutdown seems less intrusive.

Note: I used the Claude AI assistant for analyzing the crash, and it
came up with the idea of waiting for the migration thread to finish
in migration_shutdown() before proceeding with the further clean-up,
but the patch itself has been 100% written by myself.

Fixes: c2a189976e ("migration/block-active: Remove global active flag")
Signed-off-by: Thomas Huth <thuth@redhat.com>
---
 migration/migration.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index b316ee01ab2..6f4bb6d8438 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -380,6 +380,16 @@ void migration_bh_schedule(QEMUBHFunc *cb, void *opaque)
     qemu_bh_schedule(bh);
 }
 
+static void migration_thread_join(MigrationState *s)
+{
+    if (s && s->migration_thread_running) {
+        bql_unlock();
+        qemu_thread_join(&s->thread);
+        s->migration_thread_running = false;
+        bql_lock();
+    }
+}
+
 void migration_shutdown(void)
 {
     /*
@@ -393,6 +403,13 @@ void migration_shutdown(void)
      * stop the migration using this structure
      */
     migration_cancel();
+    /*
+     * Wait for migration thread to finish to prevent a possible race where
+     * the migration thread is still running and accessing host block drivers
+     * while the main cleanup proceeds to remove them in bdrv_close_all()
+     * later.
+     */
+    migration_thread_join(migrate_get_current());
     object_unref(OBJECT(current_migration));
 
     /*
@@ -1499,12 +1516,7 @@ static void migration_cleanup(MigrationState *s)
 
     close_return_path_on_source(s);
 
-    if (s->migration_thread_running) {
-        bql_unlock();
-        qemu_thread_join(&s->thread);
-        s->migration_thread_running = false;
-        bql_lock();
-    }
+    migration_thread_join(s);
 
     WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
         /*
-- 
2.52.0

Re: [PATCH] migration: Fix a possible crash when halting a guest during migration

Posted by Peter Xu 2 months ago

On Mon, Dec 08, 2025 at 02:51:01PM +0100, Thomas Huth wrote:
> From: Thomas Huth <thuth@redhat.com>
> 
> When shutting down a guest that is currently in progress of being
> migrated, there is a chance that QEMU might crash during bdrv_delete().
> The backtrace looks like this:
> 
>  Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.
> 
>  [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
>  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>  5560	        QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
>  (gdb) bt
>  #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>  #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
>  Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0
> 
> The problem is apparently that the migration thread is still active
> (migration_shutdown() only asks it to stop the current migration, but
> does not wait for it to finish), while the main thread continues to
> bdrv_close_all() that will destroy all block drivers. So the two threads
> are racing here for the destruction of the migration-related block drivers.
> 
> I was able to bisect the problem and the race has apparently been introduced
> by commit c2a189976e211c9ff782 ("migration/block-active: Remove global active
> flag"), so reverting it might be an option as well, but waiting for the
> migration thread to finish before continuing with the further clean-ups
> during shutdown seems less intrusive.
> 
> Note: I used the Claude AI assistant for analyzing the crash, and it
> came up with the idea of waiting for the migration thread to finish
> in migration_shutdown() before proceeding with the further clean-up,
> but the patch itself has been 100% written by myself.
> 
> Fixes: c2a189976e ("migration/block-active: Remove global active flag")
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  migration/migration.c | 24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index b316ee01ab2..6f4bb6d8438 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -380,6 +380,16 @@ void migration_bh_schedule(QEMUBHFunc *cb, void *opaque)
>      qemu_bh_schedule(bh);
>  }
>  
> +static void migration_thread_join(MigrationState *s)
> +{
> +    if (s && s->migration_thread_running) {
> +        bql_unlock();
> +        qemu_thread_join(&s->thread);
> +        s->migration_thread_running = false;
> +        bql_lock();
> +    }
> +}
> +
>  void migration_shutdown(void)
>  {
>      /*
> @@ -393,6 +403,13 @@ void migration_shutdown(void)
>       * stop the migration using this structure
>       */
>      migration_cancel();
> +    /*
> +     * Wait for migration thread to finish to prevent a possible race where
> +     * the migration thread is still running and accessing host block drivers
> +     * while the main cleanup proceeds to remove them in bdrv_close_all()
> +     * later.
> +     */
> +    migration_thread_join(migrate_get_current());

Not join() the thread was intentional, per commit 892ae715b6bc81, and then
I found I asked a question before; Dave answers here:

https://lore.kernel.org/all/20190228114019.GB4970@work-vm/

I wonder if we can still investigate what Stefan mentioned as the other
approach, as join() here may introduce other hang risks before we can
justify it's safe..

Thanks,

>      object_unref(OBJECT(current_migration));
>  
>      /*
> @@ -1499,12 +1516,7 @@ static void migration_cleanup(MigrationState *s)
>  
>      close_return_path_on_source(s);
>  
> -    if (s->migration_thread_running) {
> -        bql_unlock();
> -        qemu_thread_join(&s->thread);
> -        s->migration_thread_running = false;
> -        bql_lock();
> -    }
> +    migration_thread_join(s);
>  
>      WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
>          /*
> -- 
> 2.52.0
> 

-- 
Peter Xu

Re: [PATCH] migration: Fix a possible crash when halting a guest during migration

Posted by Stefan Hajnoczi 2 months ago

On Mon, Dec 08, 2025 at 02:51:01PM +0100, Thomas Huth wrote:
> From: Thomas Huth <thuth@redhat.com>
> 
> When shutting down a guest that is currently in progress of being
> migrated, there is a chance that QEMU might crash during bdrv_delete().
> The backtrace looks like this:
> 
>  Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.
> 
>  [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
>  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>  5560	        QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
>  (gdb) bt
>  #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>  #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
>  Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0
> 
> The problem is apparently that the migration thread is still active
> (migration_shutdown() only asks it to stop the current migration, but
> does not wait for it to finish), while the main thread continues to
> bdrv_close_all() that will destroy all block drivers. So the two threads
> are racing here for the destruction of the migration-related block drivers.
> 
> I was able to bisect the problem and the race has apparently been introduced
> by commit c2a189976e211c9ff782 ("migration/block-active: Remove global active
> flag"), so reverting it might be an option as well, but waiting for the
> migration thread to finish before continuing with the further clean-ups
> during shutdown seems less intrusive.
> 
> Note: I used the Claude AI assistant for analyzing the crash, and it
> came up with the idea of waiting for the migration thread to finish
> in migration_shutdown() before proceeding with the further clean-up,
> but the patch itself has been 100% written by myself.

It sounds like the migration thread does not hold block graph refcounts
and assumes the BlockDriverStates it uses have a long enough lifetime.

I don't know the migration code well enough to say whether joining in
migration_shutdown() is okay. Another option would be expicitly holding
the necessary refcounts in the migration thread.

> 
> Fixes: c2a189976e ("migration/block-active: Remove global active flag")
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  migration/migration.c | 24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index b316ee01ab2..6f4bb6d8438 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -380,6 +380,16 @@ void migration_bh_schedule(QEMUBHFunc *cb, void *opaque)
>      qemu_bh_schedule(bh);
>  }
>  
> +static void migration_thread_join(MigrationState *s)
> +{
> +    if (s && s->migration_thread_running) {
> +        bql_unlock();
> +        qemu_thread_join(&s->thread);
> +        s->migration_thread_running = false;
> +        bql_lock();
> +    }
> +}
> +
>  void migration_shutdown(void)
>  {
>      /*
> @@ -393,6 +403,13 @@ void migration_shutdown(void)
>       * stop the migration using this structure
>       */
>      migration_cancel();
> +    /*
> +     * Wait for migration thread to finish to prevent a possible race where
> +     * the migration thread is still running and accessing host block drivers
> +     * while the main cleanup proceeds to remove them in bdrv_close_all()
> +     * later.
> +     */
> +    migration_thread_join(migrate_get_current());
>      object_unref(OBJECT(current_migration));
>  
>      /*
> @@ -1499,12 +1516,7 @@ static void migration_cleanup(MigrationState *s)
>  
>      close_return_path_on_source(s);
>  
> -    if (s->migration_thread_running) {
> -        bql_unlock();
> -        qemu_thread_join(&s->thread);
> -        s->migration_thread_running = false;
> -        bql_lock();
> -    }
> +    migration_thread_join(s);
>  
>      WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
>          /*
> -- 
> 2.52.0
>

Re: [PATCH] migration: Fix a possible crash when halting a guest during migration

Posted by Fabiano Rosas 2 months ago

Stefan Hajnoczi <stefanha@redhat.com> writes:

> On Mon, Dec 08, 2025 at 02:51:01PM +0100, Thomas Huth wrote:
>> From: Thomas Huth <thuth@redhat.com>
>> 
>> When shutting down a guest that is currently in progress of being
>> migrated, there is a chance that QEMU might crash during bdrv_delete().
>> The backtrace looks like this:
>> 
>>  Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.
>> 
>>  [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
>>  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>>  5560	        QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
>>  (gdb) bt
>>  #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>>  #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
>>  Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0
>> 

How does the migration thread reaches here? Is this from
migration_block_inactivate()?

>> The problem is apparently that the migration thread is still active
>> (migration_shutdown() only asks it to stop the current migration, but
>> does not wait for it to finish)

"asks it to stop", more like pulls the plug abruptly. Note that setting
the CANCELLING state has technically nothing to do with this, the actual
cancelling lies on the not so gentle:

if (s->to_dst_file) {
    qemu_file_shutdown(s->to_dst_file);
}
 
>> , while the main thread continues to
>> bdrv_close_all() that will destroy all block drivers. So the two threads
>> are racing here for the destruction of the migration-related block drivers.
>> 
>> I was able to bisect the problem and the race has apparently been introduced
>> by commit c2a189976e211c9ff782 ("migration/block-active: Remove global active
>> flag"), so reverting it might be an option as well, but waiting for the
>> migration thread to finish before continuing with the further clean-ups
>> during shutdown seems less intrusive.
>> 
>> Note: I used the Claude AI assistant for analyzing the crash, and it
>> came up with the idea of waiting for the migration thread to finish
>> in migration_shutdown() before proceeding with the further clean-up,
>> but the patch itself has been 100% written by myself.
>
> It sounds like the migration thread does not hold block graph refcounts
> and assumes the BlockDriverStates it uses have a long enough lifetime.
>
> I don't know the migration code well enough to say whether joining in
> migration_shutdown() is okay. Another option would be expicitly holding
> the necessary refcounts in the migration thread.
>

I agree. In principle and also because shuffling the joining around
feels like something that's prone to introduce other bugs.

>> 
>> Fixes: c2a189976e ("migration/block-active: Remove global active flag")
>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>> ---
>>  migration/migration.c | 24 ++++++++++++++++++------
>>  1 file changed, 18 insertions(+), 6 deletions(-)
>> 
>> diff --git a/migration/migration.c b/migration/migration.c
>> index b316ee01ab2..6f4bb6d8438 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -380,6 +380,16 @@ void migration_bh_schedule(QEMUBHFunc *cb, void *opaque)
>>      qemu_bh_schedule(bh);
>>  }
>>  
>> +static void migration_thread_join(MigrationState *s)
>> +{
>> +    if (s && s->migration_thread_running) {
>> +        bql_unlock();
>> +        qemu_thread_join(&s->thread);
>> +        s->migration_thread_running = false;
>> +        bql_lock();
>> +    }
>> +}
>> +
>>  void migration_shutdown(void)
>>  {
>>      /*
>> @@ -393,6 +403,13 @@ void migration_shutdown(void)
>>       * stop the migration using this structure
>>       */
>>      migration_cancel();
>> +    /*
>> +     * Wait for migration thread to finish to prevent a possible race where
>> +     * the migration thread is still running and accessing host block drivers
>> +     * while the main cleanup proceeds to remove them in bdrv_close_all()
>> +     * later.
>> +     */
>> +    migration_thread_join(migrate_get_current());
>>      object_unref(OBJECT(current_migration));
>>  
>>      /*
>> @@ -1499,12 +1516,7 @@ static void migration_cleanup(MigrationState *s)
>>  
>>      close_return_path_on_source(s);
>>  
>> -    if (s->migration_thread_running) {
>> -        bql_unlock();
>> -        qemu_thread_join(&s->thread);
>> -        s->migration_thread_running = false;
>> -        bql_lock();
>> -    }
>> +    migration_thread_join(s);
>>  
>>      WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
>>          /*
>> -- 
>> 2.52.0
>>

Re: [PATCH] migration: Fix a possible crash when halting a guest during migration

Posted by Thomas Huth 1 month, 3 weeks ago

On 08/12/2025 16.26, Fabiano Rosas wrote:
> Stefan Hajnoczi <stefanha@redhat.com> writes:
> 
>> On Mon, Dec 08, 2025 at 02:51:01PM +0100, Thomas Huth wrote:
>>> From: Thomas Huth <thuth@redhat.com>
>>>
>>> When shutting down a guest that is currently in progress of being
>>> migrated, there is a chance that QEMU might crash during bdrv_delete().
>>> The backtrace looks like this:
>>>
>>>   Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.
>>>
>>>   [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
>>>   0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>>>   5560	        QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
>>>   (gdb) bt
>>>   #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>>>   #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
>>>   Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0
>>>
> 
> How does the migration thread reaches here? Is this from
> migration_block_inactivate()?

Unfortunately, gdb was not very helpful here (claiming that it cannot access 
the memory and stack anymore), so I had to do some printf debugging. This is 
what seems to happen:

Main thread: qemu_cleanup() calls  migration_shutdown() --> 
migration_cancel() which signals the migration thread to cancel the migration.

Migration thread: migration_thread() got kicked out the loop and calls 
migration_iteration_finish(), which tries to get the BQL via bql_lock() but 
that is currently held by another thread, so the migration thread is blocked 
here.

Main thread: qemu_cleanup() advances to bdrv_close_all() that uses 
blockdev_close_all_bdrv_states() to unref all BDS. The BDS with the name 
'libvirt-1-storage' gets deleted via bdrv_delete() that way.

Migration thread: Later, migration_iteration_finish() finally gets the BQL, 
and calls the migration_block_activate() function in the 
MIGRATION_STATUS_CANCELLING case statement. This calls bdrv_activate_all().
bdrv_activate_all() gets a pointer to that 'libvirt-1-storage' BDS again 
from bdrv_first(), and during the bdrv_next() that BDS gets unref'ed again 
which is causing the crash.

==> Why is bdrv_first() still providing a BDS that have been deleted by 
other threads earlier?

>> It sounds like the migration thread does not hold block graph refcounts
>> and assumes the BlockDriverStates it uses have a long enough lifetime.
>>
>> I don't know the migration code well enough to say whether joining in
>> migration_shutdown() is okay. Another option would be expicitly holding
>> the necessary refcounts in the migration thread.
> 
> I agree. In principle and also because shuffling the joining around
> feels like something that's prone to introduce other bugs.

I'm a little bit lost here right now ... Can you suggest a place where we 
would need to increase the refcounts in the migration thread?

  Thomas

Re: [PATCH] migration: Fix a possible crash when halting a guest during migration

Posted by Fabiano Rosas 1 month, 3 weeks ago

Thomas Huth <thuth@redhat.com> writes:

> On 08/12/2025 16.26, Fabiano Rosas wrote:
>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>> 
>>> On Mon, Dec 08, 2025 at 02:51:01PM +0100, Thomas Huth wrote:
>>>> From: Thomas Huth <thuth@redhat.com>
>>>>
>>>> When shutting down a guest that is currently in progress of being
>>>> migrated, there is a chance that QEMU might crash during bdrv_delete().
>>>> The backtrace looks like this:
>>>>
>>>>   Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.
>>>>
>>>>   [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
>>>>   0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>>>>   5560	        QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
>>>>   (gdb) bt
>>>>   #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>>>>   #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
>>>>   Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0
>>>>
>> 
>> How does the migration thread reaches here? Is this from
>> migration_block_inactivate()?
>
> Unfortunately, gdb was not very helpful here (claiming that it cannot access 
> the memory and stack anymore), so I had to do some printf debugging. This is 
> what seems to happen:
>
> Main thread: qemu_cleanup() calls  migration_shutdown() --> 
> migration_cancel() which signals the migration thread to cancel the migration.
>
> Migration thread: migration_thread() got kicked out the loop and calls 
> migration_iteration_finish(), which tries to get the BQL via bql_lock() but 
> that is currently held by another thread, so the migration thread is blocked 
> here.
>
> Main thread: qemu_cleanup() advances to bdrv_close_all() that uses 
> blockdev_close_all_bdrv_states() to unref all BDS. The BDS with the name 
> 'libvirt-1-storage' gets deleted via bdrv_delete() that way.
>

Has qmp_blockdev_del() ever been called to remove the BDS from the
monitor_bdrv_states list? Otherwise your debugging seems to indicate
blockdev_close_all_bdrv_states() is dropping the last reference to bs,
but it's still accessible from bdrv_next() via
bdrv_next_monitor_owned().

> Migration thread: Later, migration_iteration_finish() finally gets the BQL, 
> and calls the migration_block_activate() function in the 
> MIGRATION_STATUS_CANCELLING case statement. This calls bdrv_activate_all().
> bdrv_activate_all() gets a pointer to that 'libvirt-1-storage' BDS again 
> from bdrv_first(), and during the bdrv_next() that BDS gets unref'ed again 
> which is causing the crash.
>
> ==> Why is bdrv_first() still providing a BDS that have been deleted by 
> other threads earlier?
>

>>> It sounds like the migration thread does not hold block graph refcounts
>>> and assumes the BlockDriverStates it uses have a long enough lifetime.
>>>
>>> I don't know the migration code well enough to say whether joining in
>>> migration_shutdown() is okay. Another option would be expicitly holding
>>> the necessary refcounts in the migration thread.
>> 
>> I agree. In principle and also because shuffling the joining around
>> feels like something that's prone to introduce other bugs.
>
> I'm a little bit lost here right now ... Can you suggest a place where we 
> would need to increase the refcounts in the migration thread?
>
>   Thomas

Re: [PATCH] migration: Fix a possible crash when halting a guest during migration

Posted by Kevin Wolf 1 month, 3 weeks ago

Am 12.12.2025 um 22:26 hat Fabiano Rosas geschrieben:
> Thomas Huth <thuth@redhat.com> writes:
> 
> > On 08/12/2025 16.26, Fabiano Rosas wrote:
> >> Stefan Hajnoczi <stefanha@redhat.com> writes:
> >> 
> >>> On Mon, Dec 08, 2025 at 02:51:01PM +0100, Thomas Huth wrote:
> >>>> From: Thomas Huth <thuth@redhat.com>
> >>>>
> >>>> When shutting down a guest that is currently in progress of being
> >>>> migrated, there is a chance that QEMU might crash during bdrv_delete().
> >>>> The backtrace looks like this:
> >>>>
> >>>>   Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.
> >>>>
> >>>>   [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
> >>>>   0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
> >>>>   5560	        QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
> >>>>   (gdb) bt
> >>>>   #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
> >>>>   #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
> >>>>   Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0
> >>>>
> >> 
> >> How does the migration thread reaches here? Is this from
> >> migration_block_inactivate()?
> >
> > Unfortunately, gdb was not very helpful here (claiming that it cannot access 
> > the memory and stack anymore), so I had to do some printf debugging. This is 
> > what seems to happen:
> >
> > Main thread: qemu_cleanup() calls  migration_shutdown() --> 
> > migration_cancel() which signals the migration thread to cancel the migration.
> >
> > Migration thread: migration_thread() got kicked out the loop and calls 
> > migration_iteration_finish(), which tries to get the BQL via bql_lock() but 
> > that is currently held by another thread, so the migration thread is blocked 
> > here.
> >
> > Main thread: qemu_cleanup() advances to bdrv_close_all() that uses 
> > blockdev_close_all_bdrv_states() to unref all BDS. The BDS with the name 
> > 'libvirt-1-storage' gets deleted via bdrv_delete() that way.
> >
> 
> Has qmp_blockdev_del() ever been called to remove the BDS from the
> monitor_bdrv_states list? Otherwise your debugging seems to indicate
> blockdev_close_all_bdrv_states() is dropping the last reference to bs,
> but it's still accessible from bdrv_next() via
> bdrv_next_monitor_owned().

The reference that blockdev_close_all_bdrv_states() drops is the monitor
reference. So is this the right fix (completely untested, but matches
what qmp_blockdev_del() does)?

Kevin

diff --git a/blockdev.c b/blockdev.c
index dbd1d4d3e80..6e86c6262f9 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -686,6 +686,7 @@ void blockdev_close_all_bdrv_states(void)

     GLOBAL_STATE_CODE();
     QTAILQ_FOREACH_SAFE(bs, &monitor_bdrv_states, monitor_list, next_bs) {
+        QTAILQ_REMOVE(&monitor_bdrv_states, bs, monitor_list);
         bdrv_unref(bs);
     }
 }

Re: [PATCH] migration: Fix a possible crash when halting a guest during migration

Posted by Thomas Huth 1 month, 3 weeks ago

On 15/12/2025 14.42, Kevin Wolf wrote:
> Am 12.12.2025 um 22:26 hat Fabiano Rosas geschrieben:
>> Thomas Huth <thuth@redhat.com> writes:
>>
>>> On 08/12/2025 16.26, Fabiano Rosas wrote:
>>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>>>
>>>>> On Mon, Dec 08, 2025 at 02:51:01PM +0100, Thomas Huth wrote:
>>>>>> From: Thomas Huth <thuth@redhat.com>
>>>>>>
>>>>>> When shutting down a guest that is currently in progress of being
>>>>>> migrated, there is a chance that QEMU might crash during bdrv_delete().
>>>>>> The backtrace looks like this:
>>>>>>
>>>>>>    Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.
>>>>>>
>>>>>>    [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
>>>>>>    0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>>>>>>    5560	        QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
>>>>>>    (gdb) bt
>>>>>>    #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>>>>>>    #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
>>>>>>    Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0
>>>>>>
>>>>
>>>> How does the migration thread reaches here? Is this from
>>>> migration_block_inactivate()?
>>>
>>> Unfortunately, gdb was not very helpful here (claiming that it cannot access
>>> the memory and stack anymore), so I had to do some printf debugging. This is
>>> what seems to happen:
>>>
>>> Main thread: qemu_cleanup() calls  migration_shutdown() -->
>>> migration_cancel() which signals the migration thread to cancel the migration.
>>>
>>> Migration thread: migration_thread() got kicked out the loop and calls
>>> migration_iteration_finish(), which tries to get the BQL via bql_lock() but
>>> that is currently held by another thread, so the migration thread is blocked
>>> here.
>>>
>>> Main thread: qemu_cleanup() advances to bdrv_close_all() that uses
>>> blockdev_close_all_bdrv_states() to unref all BDS. The BDS with the name
>>> 'libvirt-1-storage' gets deleted via bdrv_delete() that way.
>>>
>>
>> Has qmp_blockdev_del() ever been called to remove the BDS from the
>> monitor_bdrv_states list? Otherwise your debugging seems to indicate
>> blockdev_close_all_bdrv_states() is dropping the last reference to bs,
>> but it's still accessible from bdrv_next() via
>> bdrv_next_monitor_owned().
> 
> The reference that blockdev_close_all_bdrv_states() drops is the monitor
> reference. So is this the right fix (completely untested, but matches
> what qmp_blockdev_del() does)?
> 
> Kevin
> 
> diff --git a/blockdev.c b/blockdev.c
> index dbd1d4d3e80..6e86c6262f9 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -686,6 +686,7 @@ void blockdev_close_all_bdrv_states(void)
> 
>       GLOBAL_STATE_CODE();
>       QTAILQ_FOREACH_SAFE(bs, &monitor_bdrv_states, monitor_list, next_bs) {
> +        QTAILQ_REMOVE(&monitor_bdrv_states, bs, monitor_list);
>           bdrv_unref(bs);
>       }
>   }

Thanks a lot, Kevin! This looks like the right fix for me - I gave it a try 
and it fixes the crash indeed!

I also threw it into the CI and there weren't any regressions:

  https://gitlab.com/thuth/qemu/-/pipelines/2215426210
  https://app.travis-ci.com/github/huth/qemu/builds/277063452

Could you please send it as a proper patch?

  Thanks,
   Thomas

Re: [PATCH] migration: Fix a possible crash when halting a guest during migration

Posted by Kevin Wolf 1 month, 3 weeks ago

Am 15.12.2025 um 16:11 hat Thomas Huth geschrieben:
> On 15/12/2025 14.42, Kevin Wolf wrote:
> > Am 12.12.2025 um 22:26 hat Fabiano Rosas geschrieben:
> > > Thomas Huth <thuth@redhat.com> writes:
> > > 
> > > > On 08/12/2025 16.26, Fabiano Rosas wrote:
> > > > > Stefan Hajnoczi <stefanha@redhat.com> writes:
> > > > > 
> > > > > > On Mon, Dec 08, 2025 at 02:51:01PM +0100, Thomas Huth wrote:
> > > > > > > From: Thomas Huth <thuth@redhat.com>
> > > > > > > 
> > > > > > > When shutting down a guest that is currently in progress of being
> > > > > > > migrated, there is a chance that QEMU might crash during bdrv_delete().
> > > > > > > The backtrace looks like this:
> > > > > > > 
> > > > > > >    Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.
> > > > > > > 
> > > > > > >    [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
> > > > > > >    0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
> > > > > > >    5560	        QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
> > > > > > >    (gdb) bt
> > > > > > >    #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
> > > > > > >    #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
> > > > > > >    Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0
> > > > > > > 
> > > > > 
> > > > > How does the migration thread reaches here? Is this from
> > > > > migration_block_inactivate()?
> > > > 
> > > > Unfortunately, gdb was not very helpful here (claiming that it cannot access
> > > > the memory and stack anymore), so I had to do some printf debugging. This is
> > > > what seems to happen:
> > > > 
> > > > Main thread: qemu_cleanup() calls  migration_shutdown() -->
> > > > migration_cancel() which signals the migration thread to cancel the migration.
> > > > 
> > > > Migration thread: migration_thread() got kicked out the loop and calls
> > > > migration_iteration_finish(), which tries to get the BQL via bql_lock() but
> > > > that is currently held by another thread, so the migration thread is blocked
> > > > here.
> > > > 
> > > > Main thread: qemu_cleanup() advances to bdrv_close_all() that uses
> > > > blockdev_close_all_bdrv_states() to unref all BDS. The BDS with the name
> > > > 'libvirt-1-storage' gets deleted via bdrv_delete() that way.
> > > > 
> > > 
> > > Has qmp_blockdev_del() ever been called to remove the BDS from the
> > > monitor_bdrv_states list? Otherwise your debugging seems to indicate
> > > blockdev_close_all_bdrv_states() is dropping the last reference to bs,
> > > but it's still accessible from bdrv_next() via
> > > bdrv_next_monitor_owned().
> > 
> > The reference that blockdev_close_all_bdrv_states() drops is the monitor
> > reference. So is this the right fix (completely untested, but matches
> > what qmp_blockdev_del() does)?
> > 
> > Kevin
> > 
> > diff --git a/blockdev.c b/blockdev.c
> > index dbd1d4d3e80..6e86c6262f9 100644
> > --- a/blockdev.c
> > +++ b/blockdev.c
> > @@ -686,6 +686,7 @@ void blockdev_close_all_bdrv_states(void)
> > 
> >       GLOBAL_STATE_CODE();
> >       QTAILQ_FOREACH_SAFE(bs, &monitor_bdrv_states, monitor_list, next_bs) {
> > +        QTAILQ_REMOVE(&monitor_bdrv_states, bs, monitor_list);
> >           bdrv_unref(bs);
> >       }
> >   }
> 
> Thanks a lot, Kevin! This looks like the right fix for me - I gave it
> a try and it fixes the crash indeed!

Good. I think something like your patch would still be good for 11.0.
Having undefined order in shutdown is just asking for trouble. So it
would be good if we could be sure that migration is out of the way when
migration_shutdown() returns.

I sent the above as a proper patch to fix the immediate problem for
10.2.

Kevin

Re: [PATCH] migration: Fix a possible crash when halting a guest during migration

Posted by Peter Xu 1 month, 3 weeks ago

On Mon, Dec 15, 2025 at 04:28:39PM +0100, Kevin Wolf wrote:
> Am 15.12.2025 um 16:11 hat Thomas Huth geschrieben:
> > On 15/12/2025 14.42, Kevin Wolf wrote:
> > > Am 12.12.2025 um 22:26 hat Fabiano Rosas geschrieben:
> > > > Thomas Huth <thuth@redhat.com> writes:
> > > > 
> > > > > On 08/12/2025 16.26, Fabiano Rosas wrote:
> > > > > > Stefan Hajnoczi <stefanha@redhat.com> writes:
> > > > > > 
> > > > > > > On Mon, Dec 08, 2025 at 02:51:01PM +0100, Thomas Huth wrote:
> > > > > > > > From: Thomas Huth <thuth@redhat.com>
> > > > > > > > 
> > > > > > > > When shutting down a guest that is currently in progress of being
> > > > > > > > migrated, there is a chance that QEMU might crash during bdrv_delete().
> > > > > > > > The backtrace looks like this:
> > > > > > > > 
> > > > > > > >    Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.
> > > > > > > > 
> > > > > > > >    [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
> > > > > > > >    0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
> > > > > > > >    5560	        QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
> > > > > > > >    (gdb) bt
> > > > > > > >    #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
> > > > > > > >    #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
> > > > > > > >    Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0
> > > > > > > > 
> > > > > > 
> > > > > > How does the migration thread reaches here? Is this from
> > > > > > migration_block_inactivate()?
> > > > > 
> > > > > Unfortunately, gdb was not very helpful here (claiming that it cannot access
> > > > > the memory and stack anymore), so I had to do some printf debugging. This is
> > > > > what seems to happen:
> > > > > 
> > > > > Main thread: qemu_cleanup() calls  migration_shutdown() -->
> > > > > migration_cancel() which signals the migration thread to cancel the migration.
> > > > > 
> > > > > Migration thread: migration_thread() got kicked out the loop and calls
> > > > > migration_iteration_finish(), which tries to get the BQL via bql_lock() but
> > > > > that is currently held by another thread, so the migration thread is blocked
> > > > > here.
> > > > > 
> > > > > Main thread: qemu_cleanup() advances to bdrv_close_all() that uses
> > > > > blockdev_close_all_bdrv_states() to unref all BDS. The BDS with the name
> > > > > 'libvirt-1-storage' gets deleted via bdrv_delete() that way.
> > > > > 
> > > > 
> > > > Has qmp_blockdev_del() ever been called to remove the BDS from the
> > > > monitor_bdrv_states list? Otherwise your debugging seems to indicate
> > > > blockdev_close_all_bdrv_states() is dropping the last reference to bs,
> > > > but it's still accessible from bdrv_next() via
> > > > bdrv_next_monitor_owned().
> > > 
> > > The reference that blockdev_close_all_bdrv_states() drops is the monitor
> > > reference. So is this the right fix (completely untested, but matches
> > > what qmp_blockdev_del() does)?
> > > 
> > > Kevin
> > > 
> > > diff --git a/blockdev.c b/blockdev.c
> > > index dbd1d4d3e80..6e86c6262f9 100644
> > > --- a/blockdev.c
> > > +++ b/blockdev.c
> > > @@ -686,6 +686,7 @@ void blockdev_close_all_bdrv_states(void)
> > > 
> > >       GLOBAL_STATE_CODE();
> > >       QTAILQ_FOREACH_SAFE(bs, &monitor_bdrv_states, monitor_list, next_bs) {
> > > +        QTAILQ_REMOVE(&monitor_bdrv_states, bs, monitor_list);
> > >           bdrv_unref(bs);
> > >       }
> > >   }
> > 
> > Thanks a lot, Kevin! This looks like the right fix for me - I gave it
> > a try and it fixes the crash indeed!
> 
> Good. I think something like your patch would still be good for 11.0.
> Having undefined order in shutdown is just asking for trouble. So it
> would be good if we could be sure that migration is out of the way when
> migration_shutdown() returns.
> 
> I sent the above as a proper patch to fix the immediate problem for
> 10.2.

Thanks all!

Does that completely fix the problem?  If so, IMHO we don't need the
migration change anymore until later.  As replied in the other email, it
was at least intentional a few years ago (when introducing
migration_shutdown()) to not join() the migration thread here.

If we need to persue that we need to justify it, and if the other fix can
completely fix this problem, we'll need some other justification (meanwhile
we'll need to re-evaluate the possible hang QEMU in other use cases).

Thanks,

-- 
Peter Xu

Re: [PATCH] migration: Fix a possible crash when halting a guest during migration

Posted by Thomas Huth 1 month, 3 weeks ago

On 15/12/2025 17.12, Peter Xu wrote:
> On Mon, Dec 15, 2025 at 04:28:39PM +0100, Kevin Wolf wrote:
>> Am 15.12.2025 um 16:11 hat Thomas Huth geschrieben:
>>> On 15/12/2025 14.42, Kevin Wolf wrote:
>>>> Am 12.12.2025 um 22:26 hat Fabiano Rosas geschrieben:
>>>>> Thomas Huth <thuth@redhat.com> writes:
>>>>>
>>>>>> On 08/12/2025 16.26, Fabiano Rosas wrote:
>>>>>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>>>>>>
>>>>>>>> On Mon, Dec 08, 2025 at 02:51:01PM +0100, Thomas Huth wrote:
>>>>>>>>> From: Thomas Huth <thuth@redhat.com>
>>>>>>>>>
>>>>>>>>> When shutting down a guest that is currently in progress of being
>>>>>>>>> migrated, there is a chance that QEMU might crash during bdrv_delete().
>>>>>>>>> The backtrace looks like this:
>>>>>>>>>
>>>>>>>>>     Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.
>>>>>>>>>
>>>>>>>>>     [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
>>>>>>>>>     0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>>>>>>>>>     5560	        QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
>>>>>>>>>     (gdb) bt
>>>>>>>>>     #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>>>>>>>>>     #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
>>>>>>>>>     Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0
>>>>>>>>>
>>>>>>>
>>>>>>> How does the migration thread reaches here? Is this from
>>>>>>> migration_block_inactivate()?
>>>>>>
>>>>>> Unfortunately, gdb was not very helpful here (claiming that it cannot access
>>>>>> the memory and stack anymore), so I had to do some printf debugging. This is
>>>>>> what seems to happen:
>>>>>>
>>>>>> Main thread: qemu_cleanup() calls  migration_shutdown() -->
>>>>>> migration_cancel() which signals the migration thread to cancel the migration.
>>>>>>
>>>>>> Migration thread: migration_thread() got kicked out the loop and calls
>>>>>> migration_iteration_finish(), which tries to get the BQL via bql_lock() but
>>>>>> that is currently held by another thread, so the migration thread is blocked
>>>>>> here.
>>>>>>
>>>>>> Main thread: qemu_cleanup() advances to bdrv_close_all() that uses
>>>>>> blockdev_close_all_bdrv_states() to unref all BDS. The BDS with the name
>>>>>> 'libvirt-1-storage' gets deleted via bdrv_delete() that way.
>>>>>>
>>>>>
>>>>> Has qmp_blockdev_del() ever been called to remove the BDS from the
>>>>> monitor_bdrv_states list? Otherwise your debugging seems to indicate
>>>>> blockdev_close_all_bdrv_states() is dropping the last reference to bs,
>>>>> but it's still accessible from bdrv_next() via
>>>>> bdrv_next_monitor_owned().
>>>>
>>>> The reference that blockdev_close_all_bdrv_states() drops is the monitor
>>>> reference. So is this the right fix (completely untested, but matches
>>>> what qmp_blockdev_del() does)?
>>>>
>>>> Kevin
>>>>
>>>> diff --git a/blockdev.c b/blockdev.c
>>>> index dbd1d4d3e80..6e86c6262f9 100644
>>>> --- a/blockdev.c
>>>> +++ b/blockdev.c
>>>> @@ -686,6 +686,7 @@ void blockdev_close_all_bdrv_states(void)
>>>>
>>>>        GLOBAL_STATE_CODE();
>>>>        QTAILQ_FOREACH_SAFE(bs, &monitor_bdrv_states, monitor_list, next_bs) {
>>>> +        QTAILQ_REMOVE(&monitor_bdrv_states, bs, monitor_list);
>>>>            bdrv_unref(bs);
>>>>        }
>>>>    }
>>>
>>> Thanks a lot, Kevin! This looks like the right fix for me - I gave it
>>> a try and it fixes the crash indeed!
>>
>> Good. I think something like your patch would still be good for 11.0.
>> Having undefined order in shutdown is just asking for trouble. So it
>> would be good if we could be sure that migration is out of the way when
>> migration_shutdown() returns.
>>
>> I sent the above as a proper patch to fix the immediate problem for
>> 10.2.
> 
> Thanks all!
> 
> Does that completely fix the problem?

Yes, Kevin's patch fixes the crash!

>  If so, IMHO we don't need the
> migration change anymore until later.  As replied in the other email, it
> was at least intentional a few years ago (when introducing
> migration_shutdown()) to not join() the migration thread here.

Fine for me.

  Thomas

Re: [PATCH] migration: Fix a possible crash when halting a guest during migration

Posted by Thomas Huth 1 month, 3 weeks ago

On 12/12/2025 22.26, Fabiano Rosas wrote:
> Thomas Huth <thuth@redhat.com> writes:
> 
>> On 08/12/2025 16.26, Fabiano Rosas wrote:
>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>>
>>>> On Mon, Dec 08, 2025 at 02:51:01PM +0100, Thomas Huth wrote:
>>>>> From: Thomas Huth <thuth@redhat.com>
>>>>>
>>>>> When shutting down a guest that is currently in progress of being
>>>>> migrated, there is a chance that QEMU might crash during bdrv_delete().
>>>>> The backtrace looks like this:
>>>>>
>>>>>    Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.
>>>>>
>>>>>    [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
>>>>>    0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>>>>>    5560	        QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
>>>>>    (gdb) bt
>>>>>    #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>>>>>    #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
>>>>>    Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0
>>>>>
>>>
>>> How does the migration thread reaches here? Is this from
>>> migration_block_inactivate()?
>>
>> Unfortunately, gdb was not very helpful here (claiming that it cannot access
>> the memory and stack anymore), so I had to do some printf debugging. This is
>> what seems to happen:
>>
>> Main thread: qemu_cleanup() calls  migration_shutdown() -->
>> migration_cancel() which signals the migration thread to cancel the migration.
>>
>> Migration thread: migration_thread() got kicked out the loop and calls
>> migration_iteration_finish(), which tries to get the BQL via bql_lock() but
>> that is currently held by another thread, so the migration thread is blocked
>> here.
>>
>> Main thread: qemu_cleanup() advances to bdrv_close_all() that uses
>> blockdev_close_all_bdrv_states() to unref all BDS. The BDS with the name
>> 'libvirt-1-storage' gets deleted via bdrv_delete() that way.
>>
> 
> Has qmp_blockdev_del() ever been called to remove the BDS from the
> monitor_bdrv_states list? Otherwise your debugging seems to indicate
> blockdev_close_all_bdrv_states() is dropping the last reference to bs,
> but it's still accessible from bdrv_next() via
> bdrv_next_monitor_owned().

As far as I can see, qmp_blockdev_del() is never called, so yes, looks like 
blockdev_close_all_bdrv_states() drops the last reference here while its 
still available from bdrv_next() ...

The weird thing is also that I can only reproduce the issue on s390x so far, 
not on x86 ... I wonder whether we're missing something there...

  Thomas