From nobody Fri Nov 14 19:40:56 2025 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass(p=quarantine dis=none) header.from=redhat.com ARC-Seal: i=1; a=rsa-sha256; t=1762194909; cv=none; d=zohomail.com; s=zohoarc; b=AF9hrsOeKSGM6k4Cis81v8AwaA7wENrat9odHhQES9XnJFRFi9iSQnSaAMBuHjlTunxdPL/Pzb4S6yNSE2onzYmGPMCM7qO7ZbD89H1PcHKufxreoRqUW9HAgIS1QThlNaFcxKsXvBxyW00uf3Ap7TI74frcMysFa8uQIP26Ibk= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1762194908; h=Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:Subject:To:To:Message-Id:Reply-To; bh=0Vw1B+WAklixK/OgjgJ07zfBa9fc2MEiBscojqAcb3k=; b=JWdrgm0CBfcGyMiEu13+58v2sEBTqontXqaZ2y5qS2oDwSQXO0vLE1oxztgp6Bt9aRgpltT+vtAEDrDwGWoseGosv/Aw8WskE+UHWm1Md7A+Df3xYypfO6Vm3BB/NUYSYsKnWfPLGZbi5lWH9FGy5XrCJAvxhiufNnLL15HLn3w= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass header.from= (p=quarantine dis=none) Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1762194908991678.6340124151037; Mon, 3 Nov 2025 10:35:08 -0800 (PST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1vFzNS-00064i-RK; Mon, 03 Nov 2025 13:34:02 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1vFzNR-00064K-PY for qemu-devel@nongnu.org; Mon, 03 Nov 2025 13:34:01 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1vFzNF-00075y-Po for qemu-devel@nongnu.org; Mon, 03 Nov 2025 13:33:59 -0500 Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-108-tzFpCPrtNVm-xUTXnCt-Sw-1; Mon, 03 Nov 2025 13:33:44 -0500 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 2DDC419373E6; Mon, 3 Nov 2025 18:33:27 +0000 (UTC) Received: from fedora.redhat.com (unknown [10.44.32.249]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 3426D1800576; Mon, 3 Nov 2025 18:33:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1762194827; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0Vw1B+WAklixK/OgjgJ07zfBa9fc2MEiBscojqAcb3k=; b=hQEiCGfyq0hGBl/EpxDc/lKkNAGTyUYevv+IN5xmVUYtoUZSxGbz3lv+Lv4WjTQmGZV8uo WUmdFuc8XIKq9/PrsF1ylYgBX9jg7JsG91KIf6tGjA40RoE5itDENpiSDi9yb4CH++kCH2 nvsnMtymK6ULt6KUNH01lKi4gjfdBIw= X-MC-Unique: tzFpCPrtNVm-xUTXnCt-Sw-1 X-Mimecast-MFC-AGG-ID: tzFpCPrtNVm-xUTXnCt-Sw_1762194823 From: Juraj Marcin To: qemu-devel@nongnu.org Cc: Juraj Marcin , "Dr. David Alan Gilbert" , Fabiano Rosas , Peter Xu , Jiri Denemark Subject: [PATCH v4 5/8] migration: Refactor all incoming cleanup info migration_incoming_destroy() Date: Mon, 3 Nov 2025 19:32:54 +0100 Message-ID: <20251103183301.3840862-6-jmarcin@redhat.com> In-Reply-To: <20251103183301.3840862-1-jmarcin@redhat.com> References: <20251103183301.3840862-1-jmarcin@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: pass client-ip=170.10.133.124; envelope-from=jmarcin@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001, SPF_HELO_PASS=-0.001, T_SPF_TEMPERROR=0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer=patchew.org@nongnu.org X-ZohoMail-DKIM: pass (identity @redhat.com) X-ZM-MESSAGEID: 1762194911155154100 Content-Type: text/plain; charset="utf-8" From: Juraj Marcin Currently, there are two functions that are responsible for calling the cleanup of the incoming migration state. With successful precopy, it's the incoming migration coroutine, and with successful postcopy it's the postcopy listen thread. However, if postcopy fails during in the device load, both functions will try to do the cleanup. This patch refactors all cleanup that needs to be done on the incoming side into a common function and defines a clear boundary, who is responsible for the cleanup. The incoming migration coroutine is responsible for calling the cleanup function, unless the listen thread has been started, in which case the postcopy listen thread runs the incoming migration cleanup in its BH. Signed-off-by: Juraj Marcin Fixes: 9535435795 ("migration: push Error **errp into qemu_loadvm_state()") Reviewed-by: Peter Xu --- migration/migration.c | 44 +++++++++------------------- migration/migration.h | 1 + migration/postcopy-ram.c | 63 +++++++++++++++++++++------------------- migration/trace-events | 2 +- 4 files changed, 49 insertions(+), 61 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index 9a367f717e..637be71bfe 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -438,10 +438,15 @@ void migration_incoming_transport_cleanup(MigrationIn= comingState *mis) =20 void migration_incoming_state_destroy(void) { - struct MigrationIncomingState *mis =3D migration_incoming_get_current(= ); + MigrationIncomingState *mis =3D migration_incoming_get_current(); + PostcopyState ps =3D postcopy_state_get(); =20 multifd_recv_cleanup(); =20 + if (ps !=3D POSTCOPY_INCOMING_NONE) { + postcopy_incoming_cleanup(mis); + } + /* * RAM state cleanup needs to happen after multifd cleanup, because * multifd threads can use some of its states (receivedmap). @@ -866,7 +871,6 @@ process_incoming_migration_co(void *opaque) { MigrationState *s =3D migrate_get_current(); MigrationIncomingState *mis =3D migration_incoming_get_current(); - PostcopyState ps; int ret; Error *local_err =3D NULL; =20 @@ -883,25 +887,14 @@ process_incoming_migration_co(void *opaque) =20 trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed"); =20 - ps =3D postcopy_state_get(); - trace_process_incoming_migration_co_end(ret, ps); - if (ps !=3D POSTCOPY_INCOMING_NONE) { - if (ps =3D=3D POSTCOPY_INCOMING_ADVISE) { - /* - * Where a migration had postcopy enabled (and thus went to ad= vise) - * but managed to complete within the precopy period, we can u= se - * the normal exit. - */ - postcopy_incoming_cleanup(mis); - } else if (ret >=3D 0) { - /* - * Postcopy was started, cleanup should happen at the end of t= he - * postcopy thread. - */ - trace_process_incoming_migration_co_postcopy_end_main(); - goto out; - } - /* Else if something went wrong then just fall out of the normal e= xit */ + trace_process_incoming_migration_co_end(ret); + if (mis->have_listen_thread) { + /* + * Postcopy was started, cleanup should happen at the end of the + * postcopy listen thread. + */ + trace_process_incoming_migration_co_postcopy_end_main(); + goto out; } =20 if (ret < 0) { @@ -933,15 +926,6 @@ fail: } =20 exit(EXIT_FAILURE); - } else { - /* - * Report the error here in case that QEMU abruptly exits - * when postcopy is enabled. - */ - WITH_QEMU_LOCK_GUARD(&s->error_mutex) { - error_report_err(s->error); - s->error =3D NULL; - } } out: /* Pairs with the refcount taken in qmp_migrate_incoming() */ diff --git a/migration/migration.h b/migration/migration.h index 01329bf824..4a37f7202c 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -254,6 +254,7 @@ struct MigrationIncomingState { MigrationIncomingState *migration_incoming_get_current(void); void migration_incoming_state_destroy(void); void migration_incoming_transport_cleanup(MigrationIncomingState *mis); +void migration_incoming_qemu_exit(void); /* * Functions to work with blocktime context */ diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c index b47c955763..48cbb46c27 100644 --- a/migration/postcopy-ram.c +++ b/migration/postcopy-ram.c @@ -2078,6 +2078,24 @@ bool postcopy_is_paused(MigrationStatus status) status =3D=3D MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP; } =20 +static void postcopy_listen_thread_bh(void *opaque) +{ + MigrationIncomingState *mis =3D migration_incoming_get_current(); + + migration_incoming_state_destroy(); + + if (mis->state =3D=3D MIGRATION_STATUS_FAILED) { + /* + * If something went wrong then we have a bad state so exit; + * we only could have gotten here if something failed before + * POSTCOPY_INCOMING_RUNNING (for example device load), otherwise + * postcopy migration would pause inside qemu_loadvm_state_main(). + * Failing dirty-bitmaps won't fail the whole migration. + */ + exit(1); + } +} + /* * Triggered by a postcopy_listen command; this thread takes over reading * the input stream, leaving the main thread free to carry on loading the = rest @@ -2131,53 +2149,38 @@ static void *postcopy_listen_thread(void *opaque) "bitmaps are correctly migrated and valid.", __func__, load_res, error_get_pretty(local_err)); g_clear_pointer(&local_err, error_free); - load_res =3D 0; /* prevent further exit() */ } else { + /* + * Something went fatally wrong and we have a bad state, QEMU = will + * exit depending on if postcopy-exit-on-error is true, but the + * migration cannot be recovered. + */ error_prepend(&local_err, "loadvm failed during postcopy: %d: ", load_res); migrate_set_error(migr, local_err); g_clear_pointer(&local_err, error_report_err); migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIV= E, MIGRATION_STATUS_FAILED); + goto out; } } - if (load_res >=3D 0) { - /* - * This looks good, but it's possible that the device loading in t= he - * main thread hasn't finished yet, and so we might not be in 'RUN' - * state yet; wait for the end of the main thread. - */ - qemu_event_wait(&mis->main_thread_load_event); - } - postcopy_incoming_cleanup(mis); - - if (load_res < 0) { - /* - * If something went wrong then we have a bad state so exit; - * depending how far we got it might be possible at this point - * to leave the guest running and fire MCEs for pages that never - * arrived as a desperate recovery step. - */ - rcu_unregister_thread(); - exit(EXIT_FAILURE); - } + /* + * This looks good, but it's possible that the device loading in the + * main thread hasn't finished yet, and so we might not be in 'RUN' + * state yet; wait for the end of the main thread. + */ + qemu_event_wait(&mis->main_thread_load_event); =20 migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE, MIGRATION_STATUS_COMPLETED); - /* - * If everything has worked fine, then the main thread has waited - * for us to start, and we're the last use of the mis. - * (If something broke then qemu will have to exit anyway since it's - * got a bad migration state). - */ - bql_lock(); - migration_incoming_state_destroy(); - bql_unlock(); =20 +out: rcu_unregister_thread(); mis->have_listen_thread =3D false; postcopy_state_set(POSTCOPY_INCOMING_END); =20 + migration_bh_schedule(postcopy_listen_thread_bh, NULL); + object_unref(OBJECT(migr)); =20 return NULL; diff --git a/migration/trace-events b/migration/trace-events index e8edd1fbba..772636f3ac 100644 --- a/migration/trace-events +++ b/migration/trace-events @@ -193,7 +193,7 @@ source_return_path_thread_resume_ack(uint32_t v) "%"PRI= u32 source_return_path_thread_switchover_acked(void) "" migration_thread_low_pending(uint64_t pending) "%" PRIu64 migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t ba= ndwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_sp= ent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %"= PRId64 -process_incoming_migration_co_end(int ret, int ps) "ret=3D%d postcopy-stat= e=3D%d" +process_incoming_migration_co_end(int ret) "ret=3D%d" process_incoming_migration_co_postcopy_end_main(void) "" postcopy_preempt_enabled(bool value) "%d" migration_precopy_complete(void) "" --=20 2.51.0