[v1] migration queue

[PULL 14/18] migration: Enlarge postcopy recovery to capture !-EIO too

Dr. David Alan Gilbert (git) posted 18 patches 3 years, 11 months ago

Diff against v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v2
Download mbox

Maintainers: Juan Quintela <quintela@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Luc Michel <luc@lmichel.fr>, Damien Hedde <damien.hedde@greensocs.com>, Francisco Iglesias <francisco.iglesias@xilinx.com>, Alistair Francis <alistair@alistair23.me>, "Edgar E. Iglesias" <edgar.iglesias@gmail.com>, Peter Maydell <peter.maydell@linaro.org>, Markus Armbruster <armbru@redhat.com>, Eric Blake <eblake@redhat.com>, Gerd Hoffmann <kraxel@redhat.com>, Thomas Huth <thuth@redhat.com>, Laurent Vivier <lvivier@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>

There is a newer version of this series

Expand all Fold all

[PULL 14/18] migration: Enlarge postcopy recovery to capture !-EIO too

Posted by Dr. David Alan Gilbert (git) 3 years, 11 months ago

From: Peter Xu <peterx@redhat.com>

We used to have quite a few places making sure -EIO happened and that's the
only way to trigger postcopy recovery.  That's based on the assumption that
we'll only return -EIO for channel issues.

It'll work in 99.99% cases but logically that won't cover some corner cases.
One example is e.g. ram_block_from_stream() could fail with an interrupted
network, then -EINVAL will be returned instead of -EIO.

I remembered Dave Gilbert pointed that out before, but somehow this is
overlooked.  Neither did I encounter anything outside the -EIO error.

However we'd better touch that up before it triggers a rare VM data loss during
live migrating.

To cover as much those cases as possible, remove the -EIO restriction on
triggering the postcopy recovery, because even if it's not a channel failure,
we can't do anything better than halting QEMU anyway - the corpse of the
process may even be used by a good hand to dig out useful memory regions, or
the admin could simply kill the process later on.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220301083925.33483-11-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/migration.c    | 4 ++--
 migration/postcopy-ram.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index bcc385b94b..306e2ac60e 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2865,7 +2865,7 @@ retry:
 out:
     res = qemu_file_get_error(rp);
     if (res) {
-        if (res == -EIO && migration_in_postcopy()) {
+        if (res && migration_in_postcopy()) {
             /*
              * Maybe there is something we can do: it looks like a
              * network down issue, and we pause for a recovery.
@@ -3466,7 +3466,7 @@ static MigThrError migration_detect_error(MigrationState *s)
         error_free(local_error);
     }
 
-    if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret == -EIO) {
+    if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret) {
         /*
          * For postcopy, we allow the network to be down for a
          * while. After that, it can be continued by a
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index d08d396c63..b0d12d5053 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1039,7 +1039,7 @@ retry:
                                         msg.arg.pagefault.address);
             if (ret) {
                 /* May be network failure, try to wait for recovery */
-                if (ret == -EIO && postcopy_pause_fault_thread(mis)) {
+                if (postcopy_pause_fault_thread(mis)) {
                     /* We got reconnected somehow, try to continue */
                     goto retry;
                 } else {
-- 
2.35.1