migration: Apply migration specific keep-alive defaults to inet socket

[PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Juraj Marcin 2 weeks, 5 days ago

From: Juraj Marcin <jmarcin@redhat.com>

Usual system defaults for TCP keep-alive options are too long for
migration workload. On Linux, a TCP connection waits idle for 2 hours
before it starts checking if the connection is not broken.

Now when InetSocketAddress supports keep-alive options [1], this patch
applies migration specific defaults if they are not supplied by the user
or the management software. With these defaults, a migration TCP stream
waits idle for 1 minute and then sends 5 TCP keep-alive packets in 30
second interval before considering the connection as broken.

System defaults can be still used by explicitly setting these parameters
to 0.

[1]: 1bd4237cb1 "util/qemu-sockets: Introduce inet socket options
     controlling TCP keep-alive"

Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
---
 migration/migration.c | 39 +++++++++++++++++++++++++++++++++++++++
 qapi/migration.json   |  6 ++++++
 2 files changed, 45 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 10c216d25d..a1f1223946 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -74,6 +74,11 @@
 
 #define INMIGRATE_DEFAULT_EXIT_ON_ERROR true
 
+#define INET_SOCKET_DEFAULT_KEEP_ALIVE true
+#define INET_SOCKET_DEFAULT_KEEP_ALIVE_COUNT 5
+#define INET_SOCKET_DEFAULT_KEEP_ALIVE_IDLE 60
+#define INET_SOCKET_DEFAULT_KEEP_ALIVE_INTERVAL 30
+
 static NotifierWithReturnList migration_state_notifiers[] = {
     NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
     NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
@@ -718,6 +723,36 @@ bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
     return true;
 }
 
+static void migration_address_apply_defaults(MigrationAddress *addr)
+{
+    if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET &&
+        addr->u.socket.type == SOCKET_ADDRESS_TYPE_INET) {
+        InetSocketAddress *inet = &addr->u.socket.u.inet;
+        if (!inet->has_keep_alive) {
+            inet->has_keep_alive = true;
+            inet->keep_alive = INET_SOCKET_DEFAULT_KEEP_ALIVE;
+        }
+#ifdef HAVE_TCP_KEEPCNT
+        if (!inet->has_keep_alive_count) {
+            inet->has_keep_alive_count = true;
+            inet->keep_alive_count = INET_SOCKET_DEFAULT_KEEP_ALIVE_COUNT;
+        }
+#endif
+#ifdef HAVE_TCP_KEEPIDLE
+        if (!inet->has_keep_alive_idle) {
+            inet->has_keep_alive_idle = true;
+            inet->keep_alive_idle = INET_SOCKET_DEFAULT_KEEP_ALIVE_IDLE;
+        }
+#endif
+#ifdef HAVE_TCP_KEEPINTVL
+        if (!inet->has_keep_alive_interval) {
+            inet->has_keep_alive_interval = true;
+            inet->keep_alive_interval = INET_SOCKET_DEFAULT_KEEP_ALIVE_INTERVAL;
+        }
+#endif
+    }
+}
+
 static bool
 migration_incoming_state_setup(MigrationIncomingState *mis, Error **errp)
 {
@@ -775,6 +810,8 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
         addr = channel->addr;
     }
 
+    migration_address_apply_defaults(addr);
+
     /* transport mechanism not suitable for migration? */
     if (!migration_transport_compatible(addr, errp)) {
         return;
@@ -2232,6 +2269,8 @@ void qmp_migrate(const char *uri, bool has_channels,
         addr = channel->addr;
     }
 
+    migration_address_apply_defaults(addr);
+
     /* transport mechanism not suitable for migration? */
     if (!migration_transport_compatible(addr, errp)) {
         return;
diff --git a/qapi/migration.json b/qapi/migration.json
index 2387c21e9c..68d4acb5db 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1639,6 +1639,12 @@
 #
 # Migration endpoint configuration.
 #
+# If transport is socket of type inet, it has the following defaults:
+# keep-alive: true, keep-alive-count: 5, keep-alive-idle: 60 seconds,
+# keep-alive-interval: 30 seconds.  System defaults can be used by
+# explicitly setting these parameters to 0.  See `InetSocketAddress` for
+# more details.
+#
 # @transport: The migration stream transport mechanism
 #
 # Since: 8.2
-- 
2.51.0

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Daniel P. Berrangé 2 weeks, 5 days ago

On Tue, Sep 09, 2025 at 05:01:24PM +0200, Juraj Marcin wrote:
> From: Juraj Marcin <jmarcin@redhat.com>
> 
> Usual system defaults for TCP keep-alive options are too long for
> migration workload. On Linux, a TCP connection waits idle for 2 hours
> before it starts checking if the connection is not broken.
> 
> Now when InetSocketAddress supports keep-alive options [1], this patch
> applies migration specific defaults if they are not supplied by the user
> or the management software. With these defaults, a migration TCP stream
> waits idle for 1 minute and then sends 5 TCP keep-alive packets in 30
> second interval before considering the connection as broken.
> 
> System defaults can be still used by explicitly setting these parameters
> to 0.

IMHO this is not a good idea. This is a very short default, which
may be fine for the scenario where your network conn is permanently
dead, but it is going to cause undesirable failures when the network
conn is only temporarily dead.

Optimizing defaults for temporary outages is much more preferrable
as that maximises reliability of migration. In the case of permanent
outages, it is already possible to tear down the connection without
waiting for a keep-alive timeout, and liveliness checks can also be
perform by the mgmt app at a higher level too. The TCP keepalives
are just an eventual failsafe, and having those work on a long
timeframe is OK.

> 
> [1]: 1bd4237cb1 "util/qemu-sockets: Introduce inet socket options
>      controlling TCP keep-alive"
> 
> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> ---
>  migration/migration.c | 39 +++++++++++++++++++++++++++++++++++++++
>  qapi/migration.json   |  6 ++++++
>  2 files changed, 45 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 10c216d25d..a1f1223946 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -74,6 +74,11 @@
>  
>  #define INMIGRATE_DEFAULT_EXIT_ON_ERROR true
>  
> +#define INET_SOCKET_DEFAULT_KEEP_ALIVE true
> +#define INET_SOCKET_DEFAULT_KEEP_ALIVE_COUNT 5
> +#define INET_SOCKET_DEFAULT_KEEP_ALIVE_IDLE 60
> +#define INET_SOCKET_DEFAULT_KEEP_ALIVE_INTERVAL 30
> +
>  static NotifierWithReturnList migration_state_notifiers[] = {
>      NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
>      NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
> @@ -718,6 +723,36 @@ bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
>      return true;
>  }
>  
> +static void migration_address_apply_defaults(MigrationAddress *addr)
> +{
> +    if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET &&
> +        addr->u.socket.type == SOCKET_ADDRESS_TYPE_INET) {
> +        InetSocketAddress *inet = &addr->u.socket.u.inet;
> +        if (!inet->has_keep_alive) {
> +            inet->has_keep_alive = true;
> +            inet->keep_alive = INET_SOCKET_DEFAULT_KEEP_ALIVE;
> +        }
> +#ifdef HAVE_TCP_KEEPCNT
> +        if (!inet->has_keep_alive_count) {
> +            inet->has_keep_alive_count = true;
> +            inet->keep_alive_count = INET_SOCKET_DEFAULT_KEEP_ALIVE_COUNT;
> +        }
> +#endif
> +#ifdef HAVE_TCP_KEEPIDLE
> +        if (!inet->has_keep_alive_idle) {
> +            inet->has_keep_alive_idle = true;
> +            inet->keep_alive_idle = INET_SOCKET_DEFAULT_KEEP_ALIVE_IDLE;
> +        }
> +#endif
> +#ifdef HAVE_TCP_KEEPINTVL
> +        if (!inet->has_keep_alive_interval) {
> +            inet->has_keep_alive_interval = true;
> +            inet->keep_alive_interval = INET_SOCKET_DEFAULT_KEEP_ALIVE_INTERVAL;
> +        }
> +#endif
> +    }
> +}
> +
>  static bool
>  migration_incoming_state_setup(MigrationIncomingState *mis, Error **errp)
>  {
> @@ -775,6 +810,8 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
>          addr = channel->addr;
>      }
>  
> +    migration_address_apply_defaults(addr);
> +
>      /* transport mechanism not suitable for migration? */
>      if (!migration_transport_compatible(addr, errp)) {
>          return;
> @@ -2232,6 +2269,8 @@ void qmp_migrate(const char *uri, bool has_channels,
>          addr = channel->addr;
>      }
>  
> +    migration_address_apply_defaults(addr);
> +
>      /* transport mechanism not suitable for migration? */
>      if (!migration_transport_compatible(addr, errp)) {
>          return;
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 2387c21e9c..68d4acb5db 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -1639,6 +1639,12 @@
>  #
>  # Migration endpoint configuration.
>  #
> +# If transport is socket of type inet, it has the following defaults:
> +# keep-alive: true, keep-alive-count: 5, keep-alive-idle: 60 seconds,
> +# keep-alive-interval: 30 seconds.  System defaults can be used by
> +# explicitly setting these parameters to 0.  See `InetSocketAddress` for
> +# more details.
> +#
>  # @transport: The migration stream transport mechanism
>  #
>  # Since: 8.2
> -- 
> 2.51.0
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Juraj Marcin 2 weeks, 4 days ago

Hi Daniel,

thank you for the feedback.

On 2025-09-09 16:09, Daniel P. Berrangé wrote:
> On Tue, Sep 09, 2025 at 05:01:24PM +0200, Juraj Marcin wrote:
> > From: Juraj Marcin <jmarcin@redhat.com>
> > 
> > Usual system defaults for TCP keep-alive options are too long for
> > migration workload. On Linux, a TCP connection waits idle for 2 hours
> > before it starts checking if the connection is not broken.
> > 
> > Now when InetSocketAddress supports keep-alive options [1], this patch
> > applies migration specific defaults if they are not supplied by the user
> > or the management software. With these defaults, a migration TCP stream
> > waits idle for 1 minute and then sends 5 TCP keep-alive packets in 30
> > second interval before considering the connection as broken.
> > 
> > System defaults can be still used by explicitly setting these parameters
> > to 0.
> 
> IMHO this is not a good idea. This is a very short default, which
> may be fine for the scenario where your network conn is permanently
> dead, but it is going to cause undesirable failures when the network
> conn is only temporarily dead.
> 
> Optimizing defaults for temporary outages is much more preferrable
> as that maximises reliability of migration. In the case of permanent
> outages, it is already possible to tear down the connection without
> waiting for a keep-alive timeout, and liveliness checks can also be
> perform by the mgmt app at a higher level too. The TCP keepalives
> are just an eventual failsafe, and having those work on a long
> timeframe is OK.

So, would extending the keep-alive parameter to, for example 30 minutes
be an acceptable solution? Or the system default is preferred and the
patch should just enable keep-alive without setting other parameters?

Thank you!

Best regards,

Juraj Marcin

> 
> > 
> > [1]: 1bd4237cb1 "util/qemu-sockets: Introduce inet socket options
> >      controlling TCP keep-alive"
> > 
> > Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> > ---
> >  migration/migration.c | 39 +++++++++++++++++++++++++++++++++++++++
> >  qapi/migration.json   |  6 ++++++
> >  2 files changed, 45 insertions(+)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 10c216d25d..a1f1223946 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -74,6 +74,11 @@
> >  
> >  #define INMIGRATE_DEFAULT_EXIT_ON_ERROR true
> >  
> > +#define INET_SOCKET_DEFAULT_KEEP_ALIVE true
> > +#define INET_SOCKET_DEFAULT_KEEP_ALIVE_COUNT 5
> > +#define INET_SOCKET_DEFAULT_KEEP_ALIVE_IDLE 60
> > +#define INET_SOCKET_DEFAULT_KEEP_ALIVE_INTERVAL 30
> > +
> >  static NotifierWithReturnList migration_state_notifiers[] = {
> >      NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
> >      NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
> > @@ -718,6 +723,36 @@ bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
> >      return true;
> >  }
> >  
> > +static void migration_address_apply_defaults(MigrationAddress *addr)
> > +{
> > +    if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET &&
> > +        addr->u.socket.type == SOCKET_ADDRESS_TYPE_INET) {
> > +        InetSocketAddress *inet = &addr->u.socket.u.inet;
> > +        if (!inet->has_keep_alive) {
> > +            inet->has_keep_alive = true;
> > +            inet->keep_alive = INET_SOCKET_DEFAULT_KEEP_ALIVE;
> > +        }
> > +#ifdef HAVE_TCP_KEEPCNT
> > +        if (!inet->has_keep_alive_count) {
> > +            inet->has_keep_alive_count = true;
> > +            inet->keep_alive_count = INET_SOCKET_DEFAULT_KEEP_ALIVE_COUNT;
> > +        }
> > +#endif
> > +#ifdef HAVE_TCP_KEEPIDLE
> > +        if (!inet->has_keep_alive_idle) {
> > +            inet->has_keep_alive_idle = true;
> > +            inet->keep_alive_idle = INET_SOCKET_DEFAULT_KEEP_ALIVE_IDLE;
> > +        }
> > +#endif
> > +#ifdef HAVE_TCP_KEEPINTVL
> > +        if (!inet->has_keep_alive_interval) {
> > +            inet->has_keep_alive_interval = true;
> > +            inet->keep_alive_interval = INET_SOCKET_DEFAULT_KEEP_ALIVE_INTERVAL;
> > +        }
> > +#endif
> > +    }
> > +}
> > +
> >  static bool
> >  migration_incoming_state_setup(MigrationIncomingState *mis, Error **errp)
> >  {
> > @@ -775,6 +810,8 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
> >          addr = channel->addr;
> >      }
> >  
> > +    migration_address_apply_defaults(addr);
> > +
> >      /* transport mechanism not suitable for migration? */
> >      if (!migration_transport_compatible(addr, errp)) {
> >          return;
> > @@ -2232,6 +2269,8 @@ void qmp_migrate(const char *uri, bool has_channels,
> >          addr = channel->addr;
> >      }
> >  
> > +    migration_address_apply_defaults(addr);
> > +
> >      /* transport mechanism not suitable for migration? */
> >      if (!migration_transport_compatible(addr, errp)) {
> >          return;
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 2387c21e9c..68d4acb5db 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -1639,6 +1639,12 @@
> >  #
> >  # Migration endpoint configuration.
> >  #
> > +# If transport is socket of type inet, it has the following defaults:
> > +# keep-alive: true, keep-alive-count: 5, keep-alive-idle: 60 seconds,
> > +# keep-alive-interval: 30 seconds.  System defaults can be used by
> > +# explicitly setting these parameters to 0.  See `InetSocketAddress` for
> > +# more details.
> > +#
> >  # @transport: The migration stream transport mechanism
> >  #
> >  # Since: 8.2
> > -- 
> > 2.51.0
> > 
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Peter Xu 2 weeks, 4 days ago

On Tue, Sep 09, 2025 at 04:09:23PM +0100, Daniel P. Berrangé wrote:
> On Tue, Sep 09, 2025 at 05:01:24PM +0200, Juraj Marcin wrote:
> > From: Juraj Marcin <jmarcin@redhat.com>
> > 
> > Usual system defaults for TCP keep-alive options are too long for
> > migration workload. On Linux, a TCP connection waits idle for 2 hours
> > before it starts checking if the connection is not broken.
> > 
> > Now when InetSocketAddress supports keep-alive options [1], this patch
> > applies migration specific defaults if they are not supplied by the user
> > or the management software. With these defaults, a migration TCP stream
> > waits idle for 1 minute and then sends 5 TCP keep-alive packets in 30
> > second interval before considering the connection as broken.
> > 
> > System defaults can be still used by explicitly setting these parameters
> > to 0.
> 
> IMHO this is not a good idea. This is a very short default, which
> may be fine for the scenario where your network conn is permanently
> dead, but it is going to cause undesirable failures when the network
> conn is only temporarily dead.
> 
> Optimizing defaults for temporary outages is much more preferrable
> as that maximises reliability of migration. In the case of permanent
> outages, it is already possible to tear down the connection without
> waiting for a keep-alive timeout, and liveliness checks can also be
> perform by the mgmt app at a higher level too. The TCP keepalives
> are just an eventual failsafe, and having those work on a long
> timeframe is OK.

For precopy it looks fine indeed, because migrate_cancel should always work
on src if src socket hanged, and even if dest QEMU socket hanged, it can
simply be killed if src QEMU can be gracefully cancelled and rolled back to
RUNNING, disregarding the socket status on dest QEMU.

For postcopy, we could still use migrate_pause to enforce src shutdown().
Initially I thought we have no way of doing that for dest QEMU, but I just
noticed two years ago I added that to dest QEMU for migrate_paused when
working on commit f8c543e808f20b..  So looks like that part is covered too,
so that if dest QEMU socket hanged we can also kick it out.

I'm not 100% sure though, on whether shutdown() would always be able to
successfully kick out the hanged socket while the keepalive is ticking.  Is
it guaranteed?

I also am not sure if that happens, whether libvirt would automatically do
that, or provide some way so the user can trigger that.  The goal IIUC here
is we shouldn't put user into a situation where the migration hanged but
without any way to either cancel or recover.  With the default values Juraj
provided here, it makes sure the hang won't happen more than a few minutes,
which sounds like a sane timeout value.

Thanks,

-- 
Peter Xu

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Daniel P. Berrangé 2 weeks, 4 days ago

On Tue, Sep 09, 2025 at 05:58:49PM -0400, Peter Xu wrote:
> On Tue, Sep 09, 2025 at 04:09:23PM +0100, Daniel P. Berrangé wrote:
> > On Tue, Sep 09, 2025 at 05:01:24PM +0200, Juraj Marcin wrote:
> > > From: Juraj Marcin <jmarcin@redhat.com>
> > > 
> > > Usual system defaults for TCP keep-alive options are too long for
> > > migration workload. On Linux, a TCP connection waits idle for 2 hours
> > > before it starts checking if the connection is not broken.
> > > 
> > > Now when InetSocketAddress supports keep-alive options [1], this patch
> > > applies migration specific defaults if they are not supplied by the user
> > > or the management software. With these defaults, a migration TCP stream
> > > waits idle for 1 minute and then sends 5 TCP keep-alive packets in 30
> > > second interval before considering the connection as broken.
> > > 
> > > System defaults can be still used by explicitly setting these parameters
> > > to 0.
> > 
> > IMHO this is not a good idea. This is a very short default, which
> > may be fine for the scenario where your network conn is permanently
> > dead, but it is going to cause undesirable failures when the network
> > conn is only temporarily dead.
> > 
> > Optimizing defaults for temporary outages is much more preferrable
> > as that maximises reliability of migration. In the case of permanent
> > outages, it is already possible to tear down the connection without
> > waiting for a keep-alive timeout, and liveliness checks can also be
> > perform by the mgmt app at a higher level too. The TCP keepalives
> > are just an eventual failsafe, and having those work on a long
> > timeframe is OK.
> 
> For precopy it looks fine indeed, because migrate_cancel should always work
> on src if src socket hanged, and even if dest QEMU socket hanged, it can
> simply be killed if src QEMU can be gracefully cancelled and rolled back to
> RUNNING, disregarding the socket status on dest QEMU.
> 
> For postcopy, we could still use migrate_pause to enforce src shutdown().
> Initially I thought we have no way of doing that for dest QEMU, but I just
> noticed two years ago I added that to dest QEMU for migrate_paused when
> working on commit f8c543e808f20b..  So looks like that part is covered too,
> so that if dest QEMU socket hanged we can also kick it out.
> 
> I'm not 100% sure though, on whether shutdown() would always be able to
> successfully kick out the hanged socket while the keepalive is ticking.  Is
> it guaranteed?

I don't know about shutdown(), but close() certainly works. If shutdown()
is not sufficient, then IMHO the migration code would need the ability to
use close() to deal with this situation.


> I also am not sure if that happens, whether libvirt would automatically do
> that, or provide some way so the user can trigger that.  The goal IIUC here
> is we shouldn't put user into a situation where the migration hanged but
> without any way to either cancel or recover.  With the default values Juraj
> provided here, it makes sure the hang won't happen more than a few minutes,
> which sounds like a sane timeout value.

Sufficient migration QMP commands should exist to ensure migration can
always be cancelled. Short keepalive timeouts should not be considered
a solution to any gaps in that respect.

Also there is yank, but IMHO apps shouldn't have to rely on yank - I see
yank as a safety net for apps to workaround limitations in QEMU.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Peter Xu 2 weeks, 3 days ago

On Wed, Sep 10, 2025 at 08:10:57AM +0100, Daniel P. Berrangé wrote:
> On Tue, Sep 09, 2025 at 05:58:49PM -0400, Peter Xu wrote:
> > On Tue, Sep 09, 2025 at 04:09:23PM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Sep 09, 2025 at 05:01:24PM +0200, Juraj Marcin wrote:
> > > > From: Juraj Marcin <jmarcin@redhat.com>
> > > > 
> > > > Usual system defaults for TCP keep-alive options are too long for
> > > > migration workload. On Linux, a TCP connection waits idle for 2 hours
> > > > before it starts checking if the connection is not broken.
> > > > 
> > > > Now when InetSocketAddress supports keep-alive options [1], this patch
> > > > applies migration specific defaults if they are not supplied by the user
> > > > or the management software. With these defaults, a migration TCP stream
> > > > waits idle for 1 minute and then sends 5 TCP keep-alive packets in 30
> > > > second interval before considering the connection as broken.
> > > > 
> > > > System defaults can be still used by explicitly setting these parameters
> > > > to 0.
> > > 
> > > IMHO this is not a good idea. This is a very short default, which
> > > may be fine for the scenario where your network conn is permanently
> > > dead, but it is going to cause undesirable failures when the network
> > > conn is only temporarily dead.
> > > 
> > > Optimizing defaults for temporary outages is much more preferrable
> > > as that maximises reliability of migration. In the case of permanent
> > > outages, it is already possible to tear down the connection without
> > > waiting for a keep-alive timeout, and liveliness checks can also be
> > > perform by the mgmt app at a higher level too. The TCP keepalives
> > > are just an eventual failsafe, and having those work on a long
> > > timeframe is OK.
> > 
> > For precopy it looks fine indeed, because migrate_cancel should always work
> > on src if src socket hanged, and even if dest QEMU socket hanged, it can
> > simply be killed if src QEMU can be gracefully cancelled and rolled back to
> > RUNNING, disregarding the socket status on dest QEMU.
> > 
> > For postcopy, we could still use migrate_pause to enforce src shutdown().
> > Initially I thought we have no way of doing that for dest QEMU, but I just
> > noticed two years ago I added that to dest QEMU for migrate_paused when
> > working on commit f8c543e808f20b..  So looks like that part is covered too,
> > so that if dest QEMU socket hanged we can also kick it out.
> > 
> > I'm not 100% sure though, on whether shutdown() would always be able to
> > successfully kick out the hanged socket while the keepalive is ticking.  Is
> > it guaranteed?
> 
> I don't know about shutdown(), but close() certainly works. If shutdown()
> is not sufficient, then IMHO the migration code would need the ability to
> use close() to deal with this situation.
> 
> 
> > I also am not sure if that happens, whether libvirt would automatically do
> > that, or provide some way so the user can trigger that.  The goal IIUC here
> > is we shouldn't put user into a situation where the migration hanged but
> > without any way to either cancel or recover.  With the default values Juraj
> > provided here, it makes sure the hang won't happen more than a few minutes,
> > which sounds like a sane timeout value.
> 
> Sufficient migration QMP commands should exist to ensure migration can
> always be cancelled. Short keepalive timeouts should not be considered
> a solution to any gaps in that respect.
> 
> Also there is yank, but IMHO apps shouldn't have to rely on yank - I see
> yank as a safety net for apps to workaround limitations in QEMU.

The QMP facility looks to be all present, which is migrate-cancel and
migrate-pause mentioned above.

For migrate_cancel (of precopy), is that Ctrl-C of "virsh migrate"?

Does libvirt exposes migrate_pause via any virsh command?  IIUC that's the
only official way of pausing a postcopy VM on either side.  I also agree we
shouldn't make yank the official tool to use.

OTOH, the default timeouts work without changing libvirt, making sure the
customers will not be stuck in a likely-failing network for hours without
providing a way to properly detach and recover when it's wanted.

Thanks,

-- 
Peter Xu

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Daniel P. Berrangé 2 weeks, 2 days ago

On Wed, Sep 10, 2025 at 12:36:44PM -0400, Peter Xu wrote:
> On Wed, Sep 10, 2025 at 08:10:57AM +0100, Daniel P. Berrangé wrote:
> > On Tue, Sep 09, 2025 at 05:58:49PM -0400, Peter Xu wrote:
> > > On Tue, Sep 09, 2025 at 04:09:23PM +0100, Daniel P. Berrangé wrote:
> > > > On Tue, Sep 09, 2025 at 05:01:24PM +0200, Juraj Marcin wrote:
> > > > > From: Juraj Marcin <jmarcin@redhat.com>
> > > > > 
> > > > > Usual system defaults for TCP keep-alive options are too long for
> > > > > migration workload. On Linux, a TCP connection waits idle for 2 hours
> > > > > before it starts checking if the connection is not broken.
> > > > > 
> > > > > Now when InetSocketAddress supports keep-alive options [1], this patch
> > > > > applies migration specific defaults if they are not supplied by the user
> > > > > or the management software. With these defaults, a migration TCP stream
> > > > > waits idle for 1 minute and then sends 5 TCP keep-alive packets in 30
> > > > > second interval before considering the connection as broken.
> > > > > 
> > > > > System defaults can be still used by explicitly setting these parameters
> > > > > to 0.
> > > > 
> > > > IMHO this is not a good idea. This is a very short default, which
> > > > may be fine for the scenario where your network conn is permanently
> > > > dead, but it is going to cause undesirable failures when the network
> > > > conn is only temporarily dead.
> > > > 
> > > > Optimizing defaults for temporary outages is much more preferrable
> > > > as that maximises reliability of migration. In the case of permanent
> > > > outages, it is already possible to tear down the connection without
> > > > waiting for a keep-alive timeout, and liveliness checks can also be
> > > > perform by the mgmt app at a higher level too. The TCP keepalives
> > > > are just an eventual failsafe, and having those work on a long
> > > > timeframe is OK.
> > > 
> > > For precopy it looks fine indeed, because migrate_cancel should always work
> > > on src if src socket hanged, and even if dest QEMU socket hanged, it can
> > > simply be killed if src QEMU can be gracefully cancelled and rolled back to
> > > RUNNING, disregarding the socket status on dest QEMU.
> > > 
> > > For postcopy, we could still use migrate_pause to enforce src shutdown().
> > > Initially I thought we have no way of doing that for dest QEMU, but I just
> > > noticed two years ago I added that to dest QEMU for migrate_paused when
> > > working on commit f8c543e808f20b..  So looks like that part is covered too,
> > > so that if dest QEMU socket hanged we can also kick it out.
> > > 
> > > I'm not 100% sure though, on whether shutdown() would always be able to
> > > successfully kick out the hanged socket while the keepalive is ticking.  Is
> > > it guaranteed?
> > 
> > I don't know about shutdown(), but close() certainly works. If shutdown()
> > is not sufficient, then IMHO the migration code would need the ability to
> > use close() to deal with this situation.
> > 
> > 
> > > I also am not sure if that happens, whether libvirt would automatically do
> > > that, or provide some way so the user can trigger that.  The goal IIUC here
> > > is we shouldn't put user into a situation where the migration hanged but
> > > without any way to either cancel or recover.  With the default values Juraj
> > > provided here, it makes sure the hang won't happen more than a few minutes,
> > > which sounds like a sane timeout value.
> > 
> > Sufficient migration QMP commands should exist to ensure migration can
> > always be cancelled. Short keepalive timeouts should not be considered
> > a solution to any gaps in that respect.
> > 
> > Also there is yank, but IMHO apps shouldn't have to rely on yank - I see
> > yank as a safety net for apps to workaround limitations in QEMU.
> 
> The QMP facility looks to be all present, which is migrate-cancel and
> migrate-pause mentioned above.
> 
> For migrate_cancel (of precopy), is that Ctrl-C of "virsh migrate"?
> 
> Does libvirt exposes migrate_pause via any virsh command?  IIUC that's the
> only official way of pausing a postcopy VM on either side.  I also agree we
> shouldn't make yank the official tool to use.

virsh will call virDomainAbortJob when Ctrl-C is done to a 'migrate'
command.

virDomainAbortJob will call migrate-cancel for pre-copy, or
'migrate-pause' for post-copy.


> OTOH, the default timeouts work without changing libvirt, making sure the
> customers will not be stuck in a likely-failing network for hours without
> providing a way to properly detach and recover when it's wanted.

"timeouts work" has the implicit assumpton that the only reason a
timeout will fire is due to a unrecoverable situation. IMHO that
assumption is not valid.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Peter Xu 2 weeks, 2 days ago

On Fri, Sep 12, 2025 at 11:20:01AM +0100, Daniel P. Berrangé wrote:
> On Wed, Sep 10, 2025 at 12:36:44PM -0400, Peter Xu wrote:
> > On Wed, Sep 10, 2025 at 08:10:57AM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Sep 09, 2025 at 05:58:49PM -0400, Peter Xu wrote:
> > > > On Tue, Sep 09, 2025 at 04:09:23PM +0100, Daniel P. Berrangé wrote:
> > > > > On Tue, Sep 09, 2025 at 05:01:24PM +0200, Juraj Marcin wrote:
> > > > > > From: Juraj Marcin <jmarcin@redhat.com>
> > > > > > 
> > > > > > Usual system defaults for TCP keep-alive options are too long for
> > > > > > migration workload. On Linux, a TCP connection waits idle for 2 hours
> > > > > > before it starts checking if the connection is not broken.
> > > > > > 
> > > > > > Now when InetSocketAddress supports keep-alive options [1], this patch
> > > > > > applies migration specific defaults if they are not supplied by the user
> > > > > > or the management software. With these defaults, a migration TCP stream
> > > > > > waits idle for 1 minute and then sends 5 TCP keep-alive packets in 30
> > > > > > second interval before considering the connection as broken.
> > > > > > 
> > > > > > System defaults can be still used by explicitly setting these parameters
> > > > > > to 0.
> > > > > 
> > > > > IMHO this is not a good idea. This is a very short default, which
> > > > > may be fine for the scenario where your network conn is permanently
> > > > > dead, but it is going to cause undesirable failures when the network
> > > > > conn is only temporarily dead.
> > > > > 
> > > > > Optimizing defaults for temporary outages is much more preferrable
> > > > > as that maximises reliability of migration. In the case of permanent
> > > > > outages, it is already possible to tear down the connection without
> > > > > waiting for a keep-alive timeout, and liveliness checks can also be
> > > > > perform by the mgmt app at a higher level too. The TCP keepalives
> > > > > are just an eventual failsafe, and having those work on a long
> > > > > timeframe is OK.
> > > > 
> > > > For precopy it looks fine indeed, because migrate_cancel should always work
> > > > on src if src socket hanged, and even if dest QEMU socket hanged, it can
> > > > simply be killed if src QEMU can be gracefully cancelled and rolled back to
> > > > RUNNING, disregarding the socket status on dest QEMU.
> > > > 
> > > > For postcopy, we could still use migrate_pause to enforce src shutdown().
> > > > Initially I thought we have no way of doing that for dest QEMU, but I just
> > > > noticed two years ago I added that to dest QEMU for migrate_paused when
> > > > working on commit f8c543e808f20b..  So looks like that part is covered too,
> > > > so that if dest QEMU socket hanged we can also kick it out.
> > > > 
> > > > I'm not 100% sure though, on whether shutdown() would always be able to
> > > > successfully kick out the hanged socket while the keepalive is ticking.  Is
> > > > it guaranteed?
> > > 
> > > I don't know about shutdown(), but close() certainly works. If shutdown()
> > > is not sufficient, then IMHO the migration code would need the ability to
> > > use close() to deal with this situation.
> > > 
> > > 
> > > > I also am not sure if that happens, whether libvirt would automatically do
> > > > that, or provide some way so the user can trigger that.  The goal IIUC here
> > > > is we shouldn't put user into a situation where the migration hanged but
> > > > without any way to either cancel or recover.  With the default values Juraj
> > > > provided here, it makes sure the hang won't happen more than a few minutes,
> > > > which sounds like a sane timeout value.
> > > 
> > > Sufficient migration QMP commands should exist to ensure migration can
> > > always be cancelled. Short keepalive timeouts should not be considered
> > > a solution to any gaps in that respect.
> > > 
> > > Also there is yank, but IMHO apps shouldn't have to rely on yank - I see
> > > yank as a safety net for apps to workaround limitations in QEMU.
> > 
> > The QMP facility looks to be all present, which is migrate-cancel and
> > migrate-pause mentioned above.
> > 
> > For migrate_cancel (of precopy), is that Ctrl-C of "virsh migrate"?
> > 
> > Does libvirt exposes migrate_pause via any virsh command?  IIUC that's the
> > only official way of pausing a postcopy VM on either side.  I also agree we
> > shouldn't make yank the official tool to use.
> 
> virsh will call virDomainAbortJob when Ctrl-C is done to a 'migrate'
> command.
> 
> virDomainAbortJob will call migrate-cancel for pre-copy, or
> 'migrate-pause' for post-copy.

Would it call "migrate-pause" on both sides?

I believe the problem we hit was, when during postcopy and the NIC was
misfunctioning, src fell into postcopy-paused successfully but dest didn't,
stuck in postcopy-active.

We'll want to make sure both sides to be kicked into paused stage to
recover.  Otherwise dest can hang in the stage for hours until the watchdog
timeout triggers.

> 
> 
> > OTOH, the default timeouts work without changing libvirt, making sure the
> > customers will not be stuck in a likely-failing network for hours without
> > providing a way to properly detach and recover when it's wanted.
> 
> "timeouts work" has the implicit assumpton that the only reason a
> timeout will fire is due to a unrecoverable situation. IMHO that
> assumption is not valid.

I agree adjusting timeout is not the best.

If we can have solid way to kick two sides out, I think indeed we don't
need to change the timeout.

If not, we may still need to provide a way to allow user to try continue
when the user found that the network is behaving abnormal.

Here adjusting timeout is slightly better than any adhoc socket timeout
that we'll adjust: it's the migration timeout, and we only have two cases:
(1) precopy, which is ok to fail and retried, (2) postcopy, which is also
ok to fail and recovered.

The timeout made some corner cases more aggressive to converge to a
failure/retry condition, but since we know exactly the property of the
sockets so it's better than when we know nothing about what the sockets are
used for.

Thanks,

-- 
Peter Xu

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Daniel P. Berrangé 1 week, 5 days ago

On Fri, Sep 12, 2025 at 11:02:12AM -0400, Peter Xu wrote:
> On Fri, Sep 12, 2025 at 11:20:01AM +0100, Daniel P. Berrangé wrote:
> > On Wed, Sep 10, 2025 at 12:36:44PM -0400, Peter Xu wrote:
> > > On Wed, Sep 10, 2025 at 08:10:57AM +0100, Daniel P. Berrangé wrote:
> > > > On Tue, Sep 09, 2025 at 05:58:49PM -0400, Peter Xu wrote:
> > > > > On Tue, Sep 09, 2025 at 04:09:23PM +0100, Daniel P. Berrangé wrote:
> > > > > > On Tue, Sep 09, 2025 at 05:01:24PM +0200, Juraj Marcin wrote:
> > > > > > > From: Juraj Marcin <jmarcin@redhat.com>
> > > > > > > 
> > > > > > > Usual system defaults for TCP keep-alive options are too long for
> > > > > > > migration workload. On Linux, a TCP connection waits idle for 2 hours
> > > > > > > before it starts checking if the connection is not broken.
> > > > > > > 
> > > > > > > Now when InetSocketAddress supports keep-alive options [1], this patch
> > > > > > > applies migration specific defaults if they are not supplied by the user
> > > > > > > or the management software. With these defaults, a migration TCP stream
> > > > > > > waits idle for 1 minute and then sends 5 TCP keep-alive packets in 30
> > > > > > > second interval before considering the connection as broken.
> > > > > > > 
> > > > > > > System defaults can be still used by explicitly setting these parameters
> > > > > > > to 0.
> > > > > > 
> > > > > > IMHO this is not a good idea. This is a very short default, which
> > > > > > may be fine for the scenario where your network conn is permanently
> > > > > > dead, but it is going to cause undesirable failures when the network
> > > > > > conn is only temporarily dead.
> > > > > > 
> > > > > > Optimizing defaults for temporary outages is much more preferrable
> > > > > > as that maximises reliability of migration. In the case of permanent
> > > > > > outages, it is already possible to tear down the connection without
> > > > > > waiting for a keep-alive timeout, and liveliness checks can also be
> > > > > > perform by the mgmt app at a higher level too. The TCP keepalives
> > > > > > are just an eventual failsafe, and having those work on a long
> > > > > > timeframe is OK.
> > > > > 
> > > > > For precopy it looks fine indeed, because migrate_cancel should always work
> > > > > on src if src socket hanged, and even if dest QEMU socket hanged, it can
> > > > > simply be killed if src QEMU can be gracefully cancelled and rolled back to
> > > > > RUNNING, disregarding the socket status on dest QEMU.
> > > > > 
> > > > > For postcopy, we could still use migrate_pause to enforce src shutdown().
> > > > > Initially I thought we have no way of doing that for dest QEMU, but I just
> > > > > noticed two years ago I added that to dest QEMU for migrate_paused when
> > > > > working on commit f8c543e808f20b..  So looks like that part is covered too,
> > > > > so that if dest QEMU socket hanged we can also kick it out.
> > > > > 
> > > > > I'm not 100% sure though, on whether shutdown() would always be able to
> > > > > successfully kick out the hanged socket while the keepalive is ticking.  Is
> > > > > it guaranteed?
> > > > 
> > > > I don't know about shutdown(), but close() certainly works. If shutdown()
> > > > is not sufficient, then IMHO the migration code would need the ability to
> > > > use close() to deal with this situation.
> > > > 
> > > > 
> > > > > I also am not sure if that happens, whether libvirt would automatically do
> > > > > that, or provide some way so the user can trigger that.  The goal IIUC here
> > > > > is we shouldn't put user into a situation where the migration hanged but
> > > > > without any way to either cancel or recover.  With the default values Juraj
> > > > > provided here, it makes sure the hang won't happen more than a few minutes,
> > > > > which sounds like a sane timeout value.
> > > > 
> > > > Sufficient migration QMP commands should exist to ensure migration can
> > > > always be cancelled. Short keepalive timeouts should not be considered
> > > > a solution to any gaps in that respect.
> > > > 
> > > > Also there is yank, but IMHO apps shouldn't have to rely on yank - I see
> > > > yank as a safety net for apps to workaround limitations in QEMU.
> > > 
> > > The QMP facility looks to be all present, which is migrate-cancel and
> > > migrate-pause mentioned above.
> > > 
> > > For migrate_cancel (of precopy), is that Ctrl-C of "virsh migrate"?
> > > 
> > > Does libvirt exposes migrate_pause via any virsh command?  IIUC that's the
> > > only official way of pausing a postcopy VM on either side.  I also agree we
> > > shouldn't make yank the official tool to use.
> > 
> > virsh will call virDomainAbortJob when Ctrl-C is done to a 'migrate'
> > command.
> > 
> > virDomainAbortJob will call migrate-cancel for pre-copy, or
> > 'migrate-pause' for post-copy.
> 
> Would it call "migrate-pause" on both sides?

Not 100% sure, but with virDomainAbortJob I think libvirt only calls
migrate-pause on the source host.

> I believe the problem we hit was, when during postcopy and the NIC was
> misfunctioning, src fell into postcopy-paused successfully but dest didn't,
> stuck in postcopy-active.

If something has interrupted src<->dst host comms for QEMU it may well
impact libvirt <-> libvirt comms too, unless migration was being done
over a separate NIC than the mgmt LAN.  IOW, it may be impossible for
libvirt to call migrate-pause on both sides, at least not until the
NIC problem has been resolved.

> We'll want to make sure both sides to be kicked into paused stage to
> recover.  Otherwise dest can hang in the stage for hours until the watchdog
> timeout triggers.

Once the network problem has been resolved, then it ought to be possible
to get libvirt to issue 'migrate-pause' on both hosts, and thus be able
to recover.

Possibly the act of starting migration recovery in libvirt should attempt
to issue 'migrate-pause' to cleanup the previously running migration if
it is still in the stuck state.

> 
> > 
> > 
> > > OTOH, the default timeouts work without changing libvirt, making sure the
> > > customers will not be stuck in a likely-failing network for hours without
> > > providing a way to properly detach and recover when it's wanted.
> > 
> > "timeouts work" has the implicit assumpton that the only reason a
> > timeout will fire is due to a unrecoverable situation. IMHO that
> > assumption is not valid.
> 
> I agree adjusting timeout is not the best.
> 
> If we can have solid way to kick two sides out, I think indeed we don't
> need to change the timeout.
> 
> If not, we may still need to provide a way to allow user to try continue
> when the user found that the network is behaving abnormal.
> 
> Here adjusting timeout is slightly better than any adhoc socket timeout
> that we'll adjust: it's the migration timeout, and we only have two cases:
> (1) precopy, which is ok to fail and retried, (2) postcopy, which is also
> ok to fail and recovered.

Fail & retry/recover is not without cost / risk though. Users can have
successful migrations that are many hours long when dealing with big
VMs. IOW, returning to the start of pre-copy could be a non-trivial
time delay.

Consider if the reason for the migration is to evacuate workloads off
a host that is suffering technical problems. It could well be that
periodic unexpected network outages are what is triggering the need
to evacuate workloads. If we timeout a migration with keepalives they
may never be able to get through a migration op quickly enough, or
they can be delayed such that the host has a fatal error loosing the
workload before the retried migration is complete.

IMHO, once a migration has been started we should not proactively
interrupt that with things like keepalives, unless the admin made a
concious decision they wanted that behaviour enabled.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Juraj Marcin 1 week, 3 days ago

Hi Daniel and all,

On 2025-09-15 19:23, Daniel P. Berrangé wrote:
> On Fri, Sep 12, 2025 at 11:02:12AM -0400, Peter Xu wrote:
> > On Fri, Sep 12, 2025 at 11:20:01AM +0100, Daniel P. Berrangé wrote:
> > > On Wed, Sep 10, 2025 at 12:36:44PM -0400, Peter Xu wrote:
> > > > On Wed, Sep 10, 2025 at 08:10:57AM +0100, Daniel P. Berrangé wrote:
> > > > > On Tue, Sep 09, 2025 at 05:58:49PM -0400, Peter Xu wrote:
> > > > > > On Tue, Sep 09, 2025 at 04:09:23PM +0100, Daniel P. Berrangé wrote:
> > > > > > > On Tue, Sep 09, 2025 at 05:01:24PM +0200, Juraj Marcin wrote:
> > > > > > > > From: Juraj Marcin <jmarcin@redhat.com>
> > > > > > > > 
> > > > > > > > Usual system defaults for TCP keep-alive options are too long for
> > > > > > > > migration workload. On Linux, a TCP connection waits idle for 2 hours
> > > > > > > > before it starts checking if the connection is not broken.
> > > > > > > > 
> > > > > > > > Now when InetSocketAddress supports keep-alive options [1], this patch
> > > > > > > > applies migration specific defaults if they are not supplied by the user
> > > > > > > > or the management software. With these defaults, a migration TCP stream
> > > > > > > > waits idle for 1 minute and then sends 5 TCP keep-alive packets in 30
> > > > > > > > second interval before considering the connection as broken.
> > > > > > > > 
> > > > > > > > System defaults can be still used by explicitly setting these parameters
> > > > > > > > to 0.
> > > > > > > 
> > > > > > > IMHO this is not a good idea. This is a very short default, which
> > > > > > > may be fine for the scenario where your network conn is permanently
> > > > > > > dead, but it is going to cause undesirable failures when the network
> > > > > > > conn is only temporarily dead.
> > > > > > > 
> > > > > > > Optimizing defaults for temporary outages is much more preferrable
> > > > > > > as that maximises reliability of migration. In the case of permanent
> > > > > > > outages, it is already possible to tear down the connection without
> > > > > > > waiting for a keep-alive timeout, and liveliness checks can also be
> > > > > > > perform by the mgmt app at a higher level too. The TCP keepalives
> > > > > > > are just an eventual failsafe, and having those work on a long
> > > > > > > timeframe is OK.
> > > > > > 
> > > > > > For precopy it looks fine indeed, because migrate_cancel should always work
> > > > > > on src if src socket hanged, and even if dest QEMU socket hanged, it can
> > > > > > simply be killed if src QEMU can be gracefully cancelled and rolled back to
> > > > > > RUNNING, disregarding the socket status on dest QEMU.
> > > > > > 
> > > > > > For postcopy, we could still use migrate_pause to enforce src shutdown().
> > > > > > Initially I thought we have no way of doing that for dest QEMU, but I just
> > > > > > noticed two years ago I added that to dest QEMU for migrate_paused when
> > > > > > working on commit f8c543e808f20b..  So looks like that part is covered too,
> > > > > > so that if dest QEMU socket hanged we can also kick it out.
> > > > > > 
> > > > > > I'm not 100% sure though, on whether shutdown() would always be able to
> > > > > > successfully kick out the hanged socket while the keepalive is ticking.  Is
> > > > > > it guaranteed?
> > > > > 
> > > > > I don't know about shutdown(), but close() certainly works. If shutdown()
> > > > > is not sufficient, then IMHO the migration code would need the ability to
> > > > > use close() to deal with this situation.
> > > > > 
> > > > > 
> > > > > > I also am not sure if that happens, whether libvirt would automatically do
> > > > > > that, or provide some way so the user can trigger that.  The goal IIUC here
> > > > > > is we shouldn't put user into a situation where the migration hanged but
> > > > > > without any way to either cancel or recover.  With the default values Juraj
> > > > > > provided here, it makes sure the hang won't happen more than a few minutes,
> > > > > > which sounds like a sane timeout value.
> > > > > 
> > > > > Sufficient migration QMP commands should exist to ensure migration can
> > > > > always be cancelled. Short keepalive timeouts should not be considered
> > > > > a solution to any gaps in that respect.
> > > > > 
> > > > > Also there is yank, but IMHO apps shouldn't have to rely on yank - I see
> > > > > yank as a safety net for apps to workaround limitations in QEMU.
> > > > 
> > > > The QMP facility looks to be all present, which is migrate-cancel and
> > > > migrate-pause mentioned above.
> > > > 
> > > > For migrate_cancel (of precopy), is that Ctrl-C of "virsh migrate"?
> > > > 
> > > > Does libvirt exposes migrate_pause via any virsh command?  IIUC that's the
> > > > only official way of pausing a postcopy VM on either side.  I also agree we
> > > > shouldn't make yank the official tool to use.
> > > 
> > > virsh will call virDomainAbortJob when Ctrl-C is done to a 'migrate'
> > > command.
> > > 
> > > virDomainAbortJob will call migrate-cancel for pre-copy, or
> > > 'migrate-pause' for post-copy.
> > 
> > Would it call "migrate-pause" on both sides?
> 
> Not 100% sure, but with virDomainAbortJob I think libvirt only calls
> migrate-pause on the source host.
> 
> > I believe the problem we hit was, when during postcopy and the NIC was
> > misfunctioning, src fell into postcopy-paused successfully but dest didn't,
> > stuck in postcopy-active.
> 
> If something has interrupted src<->dst host comms for QEMU it may well
> impact libvirt <-> libvirt comms too, unless migration was being done
> over a separate NIC than the mgmt LAN.  IOW, it may be impossible for
> libvirt to call migrate-pause on both sides, at least not until the
> NIC problem has been resolved.
> 
> > We'll want to make sure both sides to be kicked into paused stage to
> > recover.  Otherwise dest can hang in the stage for hours until the watchdog
> > timeout triggers.
> 
> Once the network problem has been resolved, then it ought to be possible
> to get libvirt to issue 'migrate-pause' on both hosts, and thus be able
> to recover.
> 
> Possibly the act of starting migration recovery in libvirt should attempt
> to issue 'migrate-pause' to cleanup the previously running migration if
> it is still in the stuck state.
> 
> > 
> > > 
> > > 
> > > > OTOH, the default timeouts work without changing libvirt, making sure the
> > > > customers will not be stuck in a likely-failing network for hours without
> > > > providing a way to properly detach and recover when it's wanted.
> > > 
> > > "timeouts work" has the implicit assumpton that the only reason a
> > > timeout will fire is due to a unrecoverable situation. IMHO that
> > > assumption is not valid.
> > 
> > I agree adjusting timeout is not the best.
> > 
> > If we can have solid way to kick two sides out, I think indeed we don't
> > need to change the timeout.
> > 
> > If not, we may still need to provide a way to allow user to try continue
> > when the user found that the network is behaving abnormal.
> > 
> > Here adjusting timeout is slightly better than any adhoc socket timeout
> > that we'll adjust: it's the migration timeout, and we only have two cases:
> > (1) precopy, which is ok to fail and retried, (2) postcopy, which is also
> > ok to fail and recovered.
> 
> Fail & retry/recover is not without cost / risk though. Users can have
> successful migrations that are many hours long when dealing with big
> VMs. IOW, returning to the start of pre-copy could be a non-trivial
> time delay.
> 
> Consider if the reason for the migration is to evacuate workloads off
> a host that is suffering technical problems. It could well be that
> periodic unexpected network outages are what is triggering the need
> to evacuate workloads. If we timeout a migration with keepalives they
> may never be able to get through a migration op quickly enough, or
> they can be delayed such that the host has a fatal error loosing the
> workload before the retried migration is complete.
> 
> IMHO, once a migration has been started we should not proactively
> interrupt that with things like keepalives, unless the admin made a
> concious decision they wanted that behaviour enabled.
>

maybe I should reiterate on the original problem this is trying to
solve. I have also talked to @jdenemar how libvirt currently handles
such things (but if I still missed something, please correct me).

If there is no outgoing traffic from the destination side (this can be
caused for example by a workload with no page faults or paused machine),
QEMU has no way of knowing if the connection is still working or not.
The TCP stack doesn't treat no incoming traffic as a sign of a broken
connection. Therefore, QEMU would stay in postcopy-active waiting for
pages indefinitely.

Also, libvirt might not be aware of a connection dropout between QEMUs,
if libvirt's connection is intact, especially if libvirt daemons are
communicating through some central entity that is managing the migration
and not directly. And to do postcopy migration recovery, libvirt needs
both sides to be in postcopy-paused state.

Alternatively, there also might be an issue with the connection between
libvirt daemons, but not the migration connection. Even if the libvirt
connection fails, the migration is not paused, rather libvirt lets the
migration finish normally. Similarly, if the libvirt connection is
broken up due to, for example, libvirt daemon restart, the ongoing
migration is not paused, but after the libvirt daemon starts again, it
sees an ongoing migration and lets it finish.

Additionally, libvirt uses its own internal keep-alive packets with much
more aggressive timeouts, waiting 5 - 10 seconds idle before sending a
keep-alive packet and then killing the connection if there is no
response in 30 seconds.

I think, if we enable keep-alive in QEMU, but let the default timeouts
be longer, for example idle time of 5 minutes and 15 retries in 1 minute
intervals (which would mean, that connection would be considered broken
after 20 minutes of unsuccessful communication attempts), that would be
an acceptable solution.

Finally, normal TCP packets already have a default system timeout and
limited retransmission attempts, so if a TCP packet cannot be delivered
in 20 minutes (or even less)[1], the whole connection times out and the
migration is paused/cancelled. So, this patch doesn't really introduce
any new or uncommon behavior, just merely tries to make both scenarios
(with dst->src traffic and without dst->src traffic) behave similarly -
they would time out after some time.

[1] Linux man-pages, tcp(7) on TCP timeouts: "Otherwise, failure may
    take up to 20 minutes with the current system defaults in a normal
    WAN environment."


Best regards,

Juraj Marcin

> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Daniel P. Berrangé 1 week, 3 days ago

On Thu, Sep 18, 2025 at 04:16:56PM +0200, Juraj Marcin wrote:
> If there is no outgoing traffic from the destination side (this can be
> caused for example by a workload with no page faults or paused machine),
> QEMU has no way of knowing if the connection is still working or not.
> The TCP stack doesn't treat no incoming traffic as a sign of a broken
> connection. Therefore, QEMU would stay in postcopy-active waiting for
> pages indefinitely.
> 
> Also, libvirt might not be aware of a connection dropout between QEMUs,
> if libvirt's connection is intact, especially if libvirt daemons are
> communicating through some central entity that is managing the migration
> and not directly. And to do postcopy migration recovery, libvirt needs
> both sides to be in postcopy-paused state.

Whether keepalive timeouts are at the QEMU level or global kernel
level, there will always be situations where the timeouts are too
long. Apps/admins can have out of band liveliness checks between
hosts that detect a problem before the keepalives will trigger
and shouldn't have to wait to recover migration, once they have
resolved the underlying network issue.

There needs to be a way to initiate post-copy recovery regardless
of whether we've hit a keepalive timeout. Especially if we can
see one QEMU in postcopy-paused, but not the other side, it
doesn't appear to make sense to block the recovery process.

The virDomainJobCancel command can do a migrate-cancel on the
src, but it didn't look like we could do the same on the dst.
Unless I've overlooked something, Libvirt needs to gain a way
to explicitly force both sides into the postcopy-paused state,
and thus be able to immediately initiate recovery.

> Alternatively, there also might be an issue with the connection between
> libvirt daemons, but not the migration connection. Even if the libvirt
> connection fails, the migration is not paused, rather libvirt lets the
> migration finish normally. Similarly, if the libvirt connection is
> broken up due to, for example, libvirt daemon restart, the ongoing
> migration is not paused, but after the libvirt daemon starts again, it
> sees an ongoing migration and lets it finish.

Whole this is a reliability issue for libvirt, this doesn't have
any bearing on migration keepalive timeouts, as we're only concerned
about QEMU connections.

> Additionally, libvirt uses its own internal keep-alive packets with much
> more aggressive timeouts, waiting 5 - 10 seconds idle before sending a
> keep-alive packet and then killing the connection if there is no
> response in 30 seconds.

Yep, this keepalive is very aggressive and has frequently caused
problems with libvirt connections being torn down inappropriately.
We get away with that because most libvirt APIs don't need to have
persistent state over the duration of a connection. The migration
APIs are there area where this isn't true, and the keepalives on
libvirt conmnections have resulted in us breaking otherwise still
functional migrations. IOW, I wouldn't point to libvirt as an
illustration of keepalives being free of significant downsides.

> I think, if we enable keep-alive in QEMU, but let the default timeouts
> be longer, for example idle time of 5 minutes and 15 retries in 1 minute
> intervals (which would mean, that connection would be considered broken
> after 20 minutes of unsuccessful communication attempts), that would be
> an acceptable solution.

I'm fine with turning on keepalives on the socket, but IMHO the
out of the box behaviour should be to honour the kernel default
tunables unless the admin decides they want different behaviour.
I'm not seeing a rational for why the kernel defaults should be
forceably overridden in QEMU out of the box.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Jiří Denemark 1 week, 2 days ago

On Thu, Sep 18, 2025 at 15:45:21 +0100, Daniel P. Berrangé wrote:
> On Thu, Sep 18, 2025 at 04:16:56PM +0200, Juraj Marcin wrote:
> > If there is no outgoing traffic from the destination side (this can be
> > caused for example by a workload with no page faults or paused machine),
> > QEMU has no way of knowing if the connection is still working or not.
> > The TCP stack doesn't treat no incoming traffic as a sign of a broken
> > connection. Therefore, QEMU would stay in postcopy-active waiting for
> > pages indefinitely.
> > 
> > Also, libvirt might not be aware of a connection dropout between QEMUs,
> > if libvirt's connection is intact, especially if libvirt daemons are
> > communicating through some central entity that is managing the migration
> > and not directly. And to do postcopy migration recovery, libvirt needs
> > both sides to be in postcopy-paused state.
> 
> Whether keepalive timeouts are at the QEMU level or global kernel
> level, there will always be situations where the timeouts are too
> long. Apps/admins can have out of band liveliness checks between
> hosts that detect a problem before the keepalives will trigger
> and shouldn't have to wait to recover migration, once they have
> resolved the underlying network issue.
> 
> There needs to be a way to initiate post-copy recovery regardless
> of whether we've hit a keepalive timeout. Especially if we can
> see one QEMU in postcopy-paused, but not the other side, it
> doesn't appear to make sense to block the recovery process.
> 
> The virDomainJobCancel command can do a migrate-cancel on the
> src, but it didn't look like we could do the same on the dst.
> Unless I've overlooked something, Libvirt needs to gain a way
> to explicitly force both sides into the postcopy-paused state,
> and thus be able to immediately initiate recovery.
> 
> > Alternatively, there also might be an issue with the connection between
> > libvirt daemons, but not the migration connection. Even if the libvirt
> > connection fails, the migration is not paused, rather libvirt lets the
> > migration finish normally. Similarly, if the libvirt connection is
> > broken up due to, for example, libvirt daemon restart, the ongoing
> > migration is not paused, but after the libvirt daemon starts again, it
> > sees an ongoing migration and lets it finish.
> 
> Whole this is a reliability issue for libvirt, this doesn't have
> any bearing on migration keepalive timeouts, as we're only concerned
> about QEMU connections.
> 
> > Additionally, libvirt uses its own internal keep-alive packets with much
> > more aggressive timeouts, waiting 5 - 10 seconds idle before sending a
> > keep-alive packet and then killing the connection if there is no
> > response in 30 seconds.
> 
> Yep, this keepalive is very aggressive and has frequently caused
> problems with libvirt connections being torn down inappropriately.

This was happening when a link was saturated with storage migrations and
the keepalive messages in a separate (and otherwise idle) connection
were not sent in time. We haven't seen any reports for quite some time.
I believe it was identified as a kernel bug a few years ago and then the
reports stopped.

Jirka

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Peter Xu 1 week, 3 days ago

On Thu, Sep 18, 2025 at 03:45:21PM +0100, Daniel P. Berrangé wrote:
> There needs to be a way to initiate post-copy recovery regardless
> of whether we've hit a keepalive timeout. Especially if we can
> see one QEMU in postcopy-paused, but not the other side, it
> doesn't appear to make sense to block the recovery process.
> 
> The virDomainJobCancel command can do a migrate-cancel on the
> src, but it didn't look like we could do the same on the dst.
> Unless I've overlooked something, Libvirt needs to gain a way
> to explicitly force both sides into the postcopy-paused state,
> and thus be able to immediately initiate recovery.

Right, if libvirt can do that then problem should have been solved too.

> I'm fine with turning on keepalives on the socket, but IMHO the
> out of the box behaviour should be to honour the kernel default
> tunables unless the admin decides they want different behaviour.
> I'm not seeing a rational for why the kernel defaults should be
> forceably overridden in QEMU out of the box.

IMHO the rational here is that the socket here is in a special state and
for special purpose. So we're not trying to change anything globally for
qemu (without knowing what the socket is), but only this specific type of
socket that is used for either precopy or postcopy live migrations.

It's special because it's always safe to have a more aggresive
disconnection, and might be preferred versus very lengthy hangs (if
assuming libvirt doesn't yet have way to stop the hang), especially for a
postcopy phase.

There's also an option that we only have such keepalive timeout setup if a
postcopy process is expected (or even only postcopy starts, but maybe
that's a slight overkill).  For precopy, hang isn't a huge issue because
migrate-cancel is always present and functional.

Thanks,

-- 
Peter Xu

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Jiří Denemark 1 week, 2 days ago

On Thu, Sep 18, 2025 at 11:10:49 -0400, Peter Xu wrote:
> On Thu, Sep 18, 2025 at 03:45:21PM +0100, Daniel P. Berrangé wrote:
> > There needs to be a way to initiate post-copy recovery regardless
> > of whether we've hit a keepalive timeout. Especially if we can
> > see one QEMU in postcopy-paused, but not the other side, it
> > doesn't appear to make sense to block the recovery process.
> > 
> > The virDomainJobCancel command can do a migrate-cancel on the
> > src, but it didn't look like we could do the same on the dst.
> > Unless I've overlooked something, Libvirt needs to gain a way
> > to explicitly force both sides into the postcopy-paused state,
> > and thus be able to immediately initiate recovery.
> 
> Right, if libvirt can do that then problem should have been solved too.

I think we should be able to use the yank command to tell QEMU to close
migration connections. I haven't tried it on the destination, but I
guess it should work similarly to the source where it causes the
migration to switch to postcopy-paused. It seems to be an equivalent of
migrate-pause. So can we safely use yank in such situations?

Jirka

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Daniel P. Berrangé 1 week, 2 days ago

On Fri, Sep 19, 2025 at 01:59:11PM +0200, Jiří Denemark wrote:
> On Thu, Sep 18, 2025 at 11:10:49 -0400, Peter Xu wrote:
> > On Thu, Sep 18, 2025 at 03:45:21PM +0100, Daniel P. Berrangé wrote:
> > > There needs to be a way to initiate post-copy recovery regardless
> > > of whether we've hit a keepalive timeout. Especially if we can
> > > see one QEMU in postcopy-paused, but not the other side, it
> > > doesn't appear to make sense to block the recovery process.
> > > 
> > > The virDomainJobCancel command can do a migrate-cancel on the
> > > src, but it didn't look like we could do the same on the dst.
> > > Unless I've overlooked something, Libvirt needs to gain a way
> > > to explicitly force both sides into the postcopy-paused state,
> > > and thus be able to immediately initiate recovery.
> > 
> > Right, if libvirt can do that then problem should have been solved too.
> 
> I think we should be able to use the yank command to tell QEMU to close
> migration connections. I haven't tried it on the destination, but I
> guess it should work similarly to the source where it causes the
> migration to switch to postcopy-paused. It seems to be an equivalent of
> migrate-pause. So can we safely use yank in such situations?

Can't we use migrate-pause on the target too ?  IIUC that was what Peter
was suggesting earlier in the thread, unless I mis-interpreted ?

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Jiří Denemark 1 week, 2 days ago

On Fri, Sep 19, 2025 at 13:00:34 +0100, Daniel P. Berrangé wrote:
> On Fri, Sep 19, 2025 at 01:59:11PM +0200, Jiří Denemark wrote:
> > On Thu, Sep 18, 2025 at 11:10:49 -0400, Peter Xu wrote:
> > > On Thu, Sep 18, 2025 at 03:45:21PM +0100, Daniel P. Berrangé wrote:
> > > > There needs to be a way to initiate post-copy recovery regardless
> > > > of whether we've hit a keepalive timeout. Especially if we can
> > > > see one QEMU in postcopy-paused, but not the other side, it
> > > > doesn't appear to make sense to block the recovery process.
> > > > 
> > > > The virDomainJobCancel command can do a migrate-cancel on the
> > > > src, but it didn't look like we could do the same on the dst.
> > > > Unless I've overlooked something, Libvirt needs to gain a way
> > > > to explicitly force both sides into the postcopy-paused state,
> > > > and thus be able to immediately initiate recovery.
> > > 
> > > Right, if libvirt can do that then problem should have been solved too.
> > 
> > I think we should be able to use the yank command to tell QEMU to close
> > migration connections. I haven't tried it on the destination, but I
> > guess it should work similarly to the source where it causes the
> > migration to switch to postcopy-paused. It seems to be an equivalent of
> > migrate-pause. So can we safely use yank in such situations?
> 
> Can't we use migrate-pause on the target too ?  IIUC that was what Peter
> was suggesting earlier in the thread, unless I mis-interpreted ?

Ah ok, I missed that. Somehow I interpreted "Libvirt needs to gain a way
to explicitly force both sides into the postcopy-paused" as "QEMU needs
to allow us to do that" :-)

Jirka

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Daniel P. Berrangé 1 week, 3 days ago

On Thu, Sep 18, 2025 at 11:10:49AM -0400, Peter Xu wrote:
> On Thu, Sep 18, 2025 at 03:45:21PM +0100, Daniel P. Berrangé wrote:
> > There needs to be a way to initiate post-copy recovery regardless
> > of whether we've hit a keepalive timeout. Especially if we can
> > see one QEMU in postcopy-paused, but not the other side, it
> > doesn't appear to make sense to block the recovery process.
> > 
> > The virDomainJobCancel command can do a migrate-cancel on the
> > src, but it didn't look like we could do the same on the dst.
> > Unless I've overlooked something, Libvirt needs to gain a way
> > to explicitly force both sides into the postcopy-paused state,
> > and thus be able to immediately initiate recovery.
> 
> Right, if libvirt can do that then problem should have been solved too.
> 
> > I'm fine with turning on keepalives on the socket, but IMHO the
> > out of the box behaviour should be to honour the kernel default
> > tunables unless the admin decides they want different behaviour.
> > I'm not seeing a rational for why the kernel defaults should be
> > forceably overridden in QEMU out of the box.
> 
> IMHO the rational here is that the socket here is in a special state and
> for special purpose. So we're not trying to change anything globally for
> qemu (without knowing what the socket is), but only this specific type of
> socket that is used for either precopy or postcopy live migrations.
> 
> It's special because it's always safe to have a more aggresive
> disconnection, and might be preferred versus very lengthy hangs (if
> assuming libvirt doesn't yet have way to stop the hang), especially for a
> postcopy phase.

I've already described up-thread that it is not guaranteed safe to
have aggressive disconnection. It is often safe, but there is
definitely non-negligible risk in prematurely terminating a migration
connection.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] migration: Apply migration specific keep-alive defaults to inet socket

Posted by Peter Xu 1 week, 5 days ago

On Mon, Sep 15, 2025 at 07:23:27PM +0100, Daniel P. Berrangé wrote:
> On Fri, Sep 12, 2025 at 11:02:12AM -0400, Peter Xu wrote:
> > On Fri, Sep 12, 2025 at 11:20:01AM +0100, Daniel P. Berrangé wrote:
> > > On Wed, Sep 10, 2025 at 12:36:44PM -0400, Peter Xu wrote:
> > > > On Wed, Sep 10, 2025 at 08:10:57AM +0100, Daniel P. Berrangé wrote:
> > > > > On Tue, Sep 09, 2025 at 05:58:49PM -0400, Peter Xu wrote:
> > > > > > On Tue, Sep 09, 2025 at 04:09:23PM +0100, Daniel P. Berrangé wrote:
> > > > > > > On Tue, Sep 09, 2025 at 05:01:24PM +0200, Juraj Marcin wrote:
> > > > > > > > From: Juraj Marcin <jmarcin@redhat.com>
> > > > > > > > 
> > > > > > > > Usual system defaults for TCP keep-alive options are too long for
> > > > > > > > migration workload. On Linux, a TCP connection waits idle for 2 hours
> > > > > > > > before it starts checking if the connection is not broken.
> > > > > > > > 
> > > > > > > > Now when InetSocketAddress supports keep-alive options [1], this patch
> > > > > > > > applies migration specific defaults if they are not supplied by the user
> > > > > > > > or the management software. With these defaults, a migration TCP stream
> > > > > > > > waits idle for 1 minute and then sends 5 TCP keep-alive packets in 30
> > > > > > > > second interval before considering the connection as broken.
> > > > > > > > 
> > > > > > > > System defaults can be still used by explicitly setting these parameters
> > > > > > > > to 0.
> > > > > > > 
> > > > > > > IMHO this is not a good idea. This is a very short default, which
> > > > > > > may be fine for the scenario where your network conn is permanently
> > > > > > > dead, but it is going to cause undesirable failures when the network
> > > > > > > conn is only temporarily dead.
> > > > > > > 
> > > > > > > Optimizing defaults for temporary outages is much more preferrable
> > > > > > > as that maximises reliability of migration. In the case of permanent
> > > > > > > outages, it is already possible to tear down the connection without
> > > > > > > waiting for a keep-alive timeout, and liveliness checks can also be
> > > > > > > perform by the mgmt app at a higher level too. The TCP keepalives
> > > > > > > are just an eventual failsafe, and having those work on a long
> > > > > > > timeframe is OK.
> > > > > > 
> > > > > > For precopy it looks fine indeed, because migrate_cancel should always work
> > > > > > on src if src socket hanged, and even if dest QEMU socket hanged, it can
> > > > > > simply be killed if src QEMU can be gracefully cancelled and rolled back to
> > > > > > RUNNING, disregarding the socket status on dest QEMU.
> > > > > > 
> > > > > > For postcopy, we could still use migrate_pause to enforce src shutdown().
> > > > > > Initially I thought we have no way of doing that for dest QEMU, but I just
> > > > > > noticed two years ago I added that to dest QEMU for migrate_paused when
> > > > > > working on commit f8c543e808f20b..  So looks like that part is covered too,
> > > > > > so that if dest QEMU socket hanged we can also kick it out.
> > > > > > 
> > > > > > I'm not 100% sure though, on whether shutdown() would always be able to
> > > > > > successfully kick out the hanged socket while the keepalive is ticking.  Is
> > > > > > it guaranteed?
> > > > > 
> > > > > I don't know about shutdown(), but close() certainly works. If shutdown()
> > > > > is not sufficient, then IMHO the migration code would need the ability to
> > > > > use close() to deal with this situation.
> > > > > 
> > > > > 
> > > > > > I also am not sure if that happens, whether libvirt would automatically do
> > > > > > that, or provide some way so the user can trigger that.  The goal IIUC here
> > > > > > is we shouldn't put user into a situation where the migration hanged but
> > > > > > without any way to either cancel or recover.  With the default values Juraj
> > > > > > provided here, it makes sure the hang won't happen more than a few minutes,
> > > > > > which sounds like a sane timeout value.
> > > > > 
> > > > > Sufficient migration QMP commands should exist to ensure migration can
> > > > > always be cancelled. Short keepalive timeouts should not be considered
> > > > > a solution to any gaps in that respect.
> > > > > 
> > > > > Also there is yank, but IMHO apps shouldn't have to rely on yank - I see
> > > > > yank as a safety net for apps to workaround limitations in QEMU.
> > > > 
> > > > The QMP facility looks to be all present, which is migrate-cancel and
> > > > migrate-pause mentioned above.
> > > > 
> > > > For migrate_cancel (of precopy), is that Ctrl-C of "virsh migrate"?
> > > > 
> > > > Does libvirt exposes migrate_pause via any virsh command?  IIUC that's the
> > > > only official way of pausing a postcopy VM on either side.  I also agree we
> > > > shouldn't make yank the official tool to use.
> > > 
> > > virsh will call virDomainAbortJob when Ctrl-C is done to a 'migrate'
> > > command.
> > > 
> > > virDomainAbortJob will call migrate-cancel for pre-copy, or
> > > 'migrate-pause' for post-copy.
> > 
> > Would it call "migrate-pause" on both sides?
> 
> Not 100% sure, but with virDomainAbortJob I think libvirt only calls
> migrate-pause on the source host.
> 
> > I believe the problem we hit was, when during postcopy and the NIC was
> > misfunctioning, src fell into postcopy-paused successfully but dest didn't,
> > stuck in postcopy-active.
> 
> If something has interrupted src<->dst host comms for QEMU it may well
> impact libvirt <-> libvirt comms too, unless migration was being done
> over a separate NIC than the mgmt LAN.  IOW, it may be impossible for

Do you know what's the common setup?  Do they normally share the network or
not?

> libvirt to call migrate-pause on both sides, at least not until the
> NIC problem has been resolved.
> 
> > We'll want to make sure both sides to be kicked into paused stage to
> > recover.  Otherwise dest can hang in the stage for hours until the watchdog
> > timeout triggers.
> 
> Once the network problem has been resolved, then it ought to be possible
> to get libvirt to issue 'migrate-pause' on both hosts, and thus be able
> to recover.
> 
> Possibly the act of starting migration recovery in libvirt should attempt
> to issue 'migrate-pause' to cleanup the previously running migration if
> it is still in the stuck state.

Yes we can do that.  But normally libvirt will need to monitor the
migration process, migration-pause won't be initiated until it thinks the
migration is interrupted.  We'll need to check whether it only relies on
src QEMU status, or both sides.

If we do that, libvirt will need to monitor migration status on both sides,
no matter which side fell into a postcopy-paused, libvirt should kick out
the other side.  That sounds OK.

Said that, maybe we should still provide an explicit way to pause migration
on both sides.  For example, I'm not sure whether the timeout could happen
on both sides at the same time without being kicked out, so that both sides
are postcopy-active but in reality the network is already unplugged.  In
that case we'd want to have libvirt be able to provide a "virsh" command to
pause both sides at least when the mgmt network is still working, and when
the user knows it's stuck (instead of waiting for hours)?

> 
> > 
> > > 
> > > 
> > > > OTOH, the default timeouts work without changing libvirt, making sure the
> > > > customers will not be stuck in a likely-failing network for hours without
> > > > providing a way to properly detach and recover when it's wanted.
> > > 
> > > "timeouts work" has the implicit assumpton that the only reason a
> > > timeout will fire is due to a unrecoverable situation. IMHO that
> > > assumption is not valid.
> > 
> > I agree adjusting timeout is not the best.
> > 
> > If we can have solid way to kick two sides out, I think indeed we don't
> > need to change the timeout.
> > 
> > If not, we may still need to provide a way to allow user to try continue
> > when the user found that the network is behaving abnormal.
> > 
> > Here adjusting timeout is slightly better than any adhoc socket timeout
> > that we'll adjust: it's the migration timeout, and we only have two cases:
> > (1) precopy, which is ok to fail and retried, (2) postcopy, which is also
> > ok to fail and recovered.
> 
> Fail & retry/recover is not without cost / risk though. Users can have
> successful migrations that are many hours long when dealing with big
> VMs. IOW, returning to the start of pre-copy could be a non-trivial
> time delay.
> 
> Consider if the reason for the migration is to evacuate workloads off
> a host that is suffering technical problems. It could well be that
> periodic unexpected network outages are what is triggering the need
> to evacuate workloads. If we timeout a migration with keepalives they
> may never be able to get through a migration op quickly enough, or
> they can be delayed such that the host has a fatal error loosing the
> workload before the retried migration is complete.
> 
> IMHO, once a migration has been started we should not proactively
> interrupt that with things like keepalives, unless the admin made a
> concious decision they wanted that behaviour enabled.

For postcopy, we don't have such cost.  Postcopy pause will keep both QEMUs
persist their data sent / received, then when recover it'll handshake once
more to synchronize the states between two sides.  They'll continue the
migration from the previous state, not completely retransfer but only diff.

In this case, normally it's the dest QEMU that has less data, IOW, it's
possible something "sent" from the src QEMU didn't get "received" on
destination but lost on the wire.  Hence what we do is synchronizing the
received bitmap from dest to src so src QEMU knows what to re-transmit.

Thanks,

-- 
Peter Xu