From nobody Tue May 7 12:41:47 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=fail(p=none dis=none) header.from=virtuozzo.com ARC-Seal: i=1; a=rsa-sha256; t=1576674857; cv=none; d=zohomail.com; s=zohoarc; b=OmZSpcozECaRowcY8ubWuEO+34NnoZGMedfpwSu/Uc4PC0MW58905bcC+aLZm0dgPG+X+XTTHil+zLwLWAdXYh8FgjLNXGdBuoTNUGoY7Ev5qEP1xH0OUunJkTMbrYVCFAOYlXBFfQShL98dy3NRtMdaga28r1ZRnvz/nrSNj0Q= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1576674857; h=Content-Transfer-Encoding:Cc:Date:From:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:Sender:Subject:To; bh=pi4kQ72NaEg+i13FNXMo5rp5QbZ83n6F0H4PrZeAl4Y=; b=cmwFm9JtqEZWaEJqOctFRcV7lK4rZuuNWfsOInTfL2oTVRnt281DR4CZqZvZYzNH2RvjHzUlhlpOD+mBqkp0NbC40ttVUhOdnJXeEOKr/Bdocn+Mua4Z9IbtLnc92mTSpO0ORYeLzRiSYTo7VRM6W30Ix2dV68UFyud0nsQpdno= ARC-Authentication-Results: i=1; mx.zohomail.com; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=fail header.from= (p=none dis=none) header.from= Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1576674857304499.977779754391; Wed, 18 Dec 2019 05:14:17 -0800 (PST) Received: from localhost ([::1]:54166 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ihZ9b-0002jU-US for importer@patchew.org; Wed, 18 Dec 2019 08:14:15 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:58708) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ihYrN-0002ks-82 for qemu-devel@nongnu.org; Wed, 18 Dec 2019 07:55:27 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ihYrL-0002yN-34 for qemu-devel@nongnu.org; Wed, 18 Dec 2019 07:55:24 -0500 Received: from relay.sw.ru ([185.231.240.75]:46600) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1ihYrK-0002mv-Pa for qemu-devel@nongnu.org; Wed, 18 Dec 2019 07:55:23 -0500 Received: from vovaso.qa.sw.ru ([10.94.3.0] helo=kvm.qa.sw.ru) by relay.sw.ru with esmtp (Exim 4.92.3) (envelope-from ) id 1ihYrD-0005hS-5O; Wed, 18 Dec 2019 15:55:15 +0300 From: Vladimir Sementsov-Ogievskiy To: qemu-devel@nongnu.org Subject: [RFC] migration: introduce failed-unrecovarable status Date: Wed, 18 Dec 2019 15:55:12 +0300 Message-Id: <20191218125512.5446-1-vsementsov@virtuozzo.com> X-Mailer: git-send-email 2.21.0 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy] X-Received-From: 185.231.240.75 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: vsementsov@virtuozzo.com, quintela@redhat.com, armbru@redhat.com, dgilbert@redhat.com, den@openvz.org Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" Content-Type: text/plain; charset="utf-8" We should not start source vm automatically, if the error occured after target accessed disks, or if we failed to invalidate nodes. Also, fix, that we need invalidate even if bdrv_inactivate_all() failed, as in this case it still may successfully inactivate some of the nodes. Signed-off-by: Vladimir Sementsov-Ogievskiy --- Hi all! It's an investigation on top of old thread https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg02355.html Either I'm missing something, or we need this patch. It's a draft, may be need to split it into 2-3 small patches. Still I'd like to get general approval at first, may be I'm doing something wrong. Also, there may be other migration failure cases like this. qapi/migration.json | 7 +++++-- migration/migration.c | 36 ++++++++++++++++++++++++------------ 2 files changed, 29 insertions(+), 14 deletions(-) diff --git a/qapi/migration.json b/qapi/migration.json index b7348d0c8b..90fa625cbb 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -125,6 +125,9 @@ # # @failed: some error occurred during migration process. # +# @failed-unrecoverable: postcopy failed after no return point, when disks= may +# already be accessed by target Qemu process. (sinc= e 5.0) +# # @colo: VM is in the process of fault tolerance, VM can not get into this # state unless colo capability is enabled for migration. (since 2.8) # @@ -142,8 +145,8 @@ { 'enum': 'MigrationStatus', 'data': [ 'none', 'setup', 'cancelling', 'cancelled', 'active', 'postcopy-active', 'postcopy-paused', - 'postcopy-recover', 'completed', 'failed', 'colo', - 'pre-switchover', 'device', 'wait-unplug' ] } + 'postcopy-recover', 'completed', 'failed', 'failed-unrecoverab= le', + 'colo', 'pre-switchover', 'device', 'wait-unplug' ] } =20 ## # @MigrationInfo: diff --git a/migration/migration.c b/migration/migration.c index 354ad072fa..00684fdef8 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -2576,7 +2576,14 @@ static int postcopy_start(MigrationState *ms) QEMUFile *fb; int64_t time_at_stop =3D qemu_clock_get_ms(QEMU_CLOCK_REALTIME); int64_t bandwidth =3D migrate_max_postcopy_bandwidth(); - bool restart_block =3D false; + + /* + * recoverable_failure + * A failure happened early enough that we know the destination hasn't + * accessed block devices, so we're safe to recover. + */ + bool recoverable_failure =3D true; + bool inactivated =3D false; int cur_state =3D MIGRATION_STATUS_ACTIVE; if (!migrate_pause_before_switchover()) { migrate_set_state(&ms->state, MIGRATION_STATUS_ACTIVE, @@ -2600,11 +2607,11 @@ static int postcopy_start(MigrationState *ms) goto fail; } =20 + inactivated =3D true; ret =3D bdrv_inactivate_all(); if (ret < 0) { goto fail; } - restart_block =3D true; =20 /* * Cause any non-postcopiable, but iterative devices to @@ -2682,7 +2689,7 @@ static int postcopy_start(MigrationState *ms) goto fail_closefb; } =20 - restart_block =3D false; + recoverable_failure =3D false; =20 /* Now send that blob */ if (qemu_savevm_send_packaged(ms->to_dst_file, bioc->data, bioc->usage= )) { @@ -2716,26 +2723,28 @@ static int postcopy_start(MigrationState *ms) ret =3D qemu_file_get_error(ms->to_dst_file); if (ret) { error_report("postcopy_start: Migration stream errored"); - migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE, - MIGRATION_STATUS_FAILED); + goto fail; } =20 - return ret; + return 0; =20 fail_closefb: qemu_fclose(fb); fail: migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE, - MIGRATION_STATUS_FAILED); - if (restart_block) { - /* A failure happened early enough that we know the destination ha= sn't - * accessed block devices, so we're safe to recover. - */ + recoverable_failure ? MIGRATION_STATUS_FAILED : + MIGRATION_STATUS_FAILED_UNRECOVERABLE); + if (recoverable_failure && inactivated) { Error *local_err =3D NULL; =20 bdrv_invalidate_cache_all(&local_err); if (local_err) { error_report_err(local_err); + /* + * We failed to invalidate, so we must not start vm automatica= lly. + * User may retry invalidation and start by cont qmp command. + */ + ms->vm_was_running =3D false; } } qemu_mutex_unlock_iothread(); @@ -3194,9 +3203,12 @@ static void migration_iteration_finish(MigrationStat= e *s) s->vm_was_running =3D true; /* Fallthrough */ case MIGRATION_STATUS_FAILED: + case MIGRATION_STATUS_FAILED_UNRECOVERABLE: case MIGRATION_STATUS_CANCELLED: case MIGRATION_STATUS_CANCELLING: - if (s->vm_was_running) { + if (s->vm_was_running && + s->state !=3D MIGRATION_STATUS_FAILED_UNRECOVERABLE) + { vm_start(); } else { if (runstate_check(RUN_STATE_FINISH_MIGRATE)) { --=20 2.21.0