On 5.03.2025 17:53, Cédric Le Goater wrote:
> So that's PATCH 7 replacement.
Yes, that the replacement for "[PATCH v6 07/36] migration: postcopy_ram_listen_thread() should take BQL for some calls".
> Thanks,
>
> C.
Thanks,
Maciej
> On 3/5/25 17:49, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> All callers to migration_incoming_state_destroy() other than
>> postcopy_ram_listen_thread() do this call with BQL held.
>>
>> Since migration_incoming_state_destroy() ultimately calls "load_cleanup"
>> SaveVMHandlers and it will soon call BQL-sensitive code it makes sense
>> to always call that function under BQL rather than to have it deal with
>> both cases (with BQL and without BQL).
>> Add the necessary bql_lock() and bql_unlock() to
>> postcopy_ram_listen_thread().
>>
>> qemu_loadvm_state_main() in postcopy_ram_listen_thread() could call
>> "load_state" SaveVMHandlers that are expecting BQL to be held.
>>
>> In principle, the only devices that should be arriving on migration
>> channel serviced by postcopy_ram_listen_thread() are those that are
>> postcopiable and whose load handlers are safe to be called without BQL
>> being held.
>>
>> But nothing currently prevents the source from sending data for "unsafe"
>> devices which would cause trouble there.
>> Add a TODO comment there so it's clear that it would be good to improve
>> handling of such (erroneous) case in the future.
>>
>> Acked-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>> migration/migration.c | 13 +++++++++++++
>> migration/savevm.c | 4 ++++
>> 2 files changed, 17 insertions(+)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 9e9db26667f1..0bf70ea9717d 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -402,10 +402,23 @@ void migration_incoming_state_destroy(void)
>> struct MigrationIncomingState *mis = migration_incoming_get_current();
>> multifd_recv_cleanup();
>> +
>> /*
>> * RAM state cleanup needs to happen after multifd cleanup, because
>> * multifd threads can use some of its states (receivedmap).
>> + * The VFIO load_cleanup() implementation is BQL-sensitive. It requires
>> + * BQL must NOT be taken when recycling load threads, so that it won't
>> + * block the load threads from making progress on address space
>> + * modification operations.
>> + *
>> + * To make it work, we could try to not take BQL for all load_cleanup(),
>> + * or conditionally unlock BQL only if bql_locked() in VFIO.
>> + *
>> + * Since most existing call sites take BQL for load_cleanup(), make
>> + * it simple by taking BQL always as the rule, so that VFIO can unlock
>> + * BQL and retake unconditionally.
>> */
>> + assert(bql_locked());
>> qemu_loadvm_state_cleanup();
>> if (mis->to_src_file) {
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 7c1aa8ad7b9d..3e86b572cfa8 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -1986,6 +1986,8 @@ static void *postcopy_ram_listen_thread(void *opaque)
>> * in qemu_file, and thus we must be blocking now.
>> */
>> qemu_file_set_blocking(f, true);
>> +
>> + /* TODO: sanity check that only postcopiable data will be loaded here */
>> load_res = qemu_loadvm_state_main(f, mis);
>> /*
>> @@ -2046,7 +2048,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
>> * (If something broke then qemu will have to exit anyway since it's
>> * got a bad migration state).
>> */
>> + bql_lock();
>> migration_incoming_state_destroy();
>> + bql_unlock();
>> rcu_unregister_thread();
>> mis->have_listen_thread = false;
>>
>