[PATCH RFC 0/2] migration: introduce strict SLA

Elena Ufimtseva posted 2 patches 5 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20240621143221.198784-1-elena.ufimtseva@oracle.com
Maintainers: Eduardo Habkost <eduardo@habkost.net>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>, "Philippe Mathieu-Daudé" <philmd@linaro.org>, Yanan Wang <wangyanan55@huawei.com>, Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>, Eric Blake <eblake@redhat.com>, Markus Armbruster <armbru@redhat.com>
hw/core/machine.c                  |  1 +
include/migration/client-options.h |  1 +
migration/migration-hmp-cmds.c     | 10 ++++
migration/migration.c              | 41 +++++++++++++++
migration/migration.h              | 20 ++++++++
migration/options.c                | 56 +++++++++++++++++++++
migration/options.h                |  1 +
migration/savevm.c                 | 81 ++++++++++++++++++++++++++++++
migration/savevm.h                 |  2 +
migration/trace-events             |  3 ++
qapi/migration.json                | 27 ++++++++--
11 files changed, 239 insertions(+), 4 deletions(-)
[PATCH RFC 0/2] migration: introduce strict SLA
Posted by Elena Ufimtseva 5 months ago
Hello

This RFC patchset introduces strict downtime SLA for live migration by
restricting how long switchover phase can take and aborts live migration
if this exceeded.

Various consumers of VFIO Live Migration are bound checks on how long
the switchover process lasts. Some things are not accounted for and are
unbounded, such as:
  - Time to quiesce/resume the VF
  - Time to save/resume all system state
  - How fast we can save/restore VF state

These cases lead to the final downtime being larger than what was
configured in by setting a downtime limit.
In some applications it is important to observe the requested downtime
and re-try live migration some other time if the downtime requirements
cannot be satisfied.

This patchset introduces capability to abort live migration if
the downtime exceeds a certain value specified by switchover limit
migration parameter.
When a guest stops at the source, measure the downtime and if
it exceeds a threshold we cancel the migration and resume the guest.
The destination is being notified of the source downtime and its threshold
and starts measuring downtime. Destination will cancel live migration
if downtime exceeds the swithover limit.

The migration with this capability would be used this way for example:

migrate_set_capability return-path on
migrate_set_capability switchover-abort on
migrate_set_parameter downtime-limit 300
migrate_set_parameter switchover-limit 10

The migration will be aborted if the downtime exceeds
10ms (switchover-limit) and total downtime would not
be more than 310ms.

Please send your comments and recommendations.

The patchset idea originally comes from Joao Martins
<joao.m.martins@oracle.com>.


Elena Ufimtseva (2):
  migration: abort when switchover limit exceeded
  migration: abort on destination if switchover limit exceeded

 hw/core/machine.c                  |  1 +
 include/migration/client-options.h |  1 +
 migration/migration-hmp-cmds.c     | 10 ++++
 migration/migration.c              | 41 +++++++++++++++
 migration/migration.h              | 20 ++++++++
 migration/options.c                | 56 +++++++++++++++++++++
 migration/options.h                |  1 +
 migration/savevm.c                 | 81 ++++++++++++++++++++++++++++++
 migration/savevm.h                 |  2 +
 migration/trace-events             |  3 ++
 qapi/migration.json                | 27 ++++++++--
 11 files changed, 239 insertions(+), 4 deletions(-)

-- 
2.34.1