On 26/04/22 5:08 am, Peter Xu wrote:
> This is v5 of postcopy preempt series. It can also be found here:
>
> https://github.com/xzpeter/qemu/tree/postcopy-preempt
>
> RFC: https://lore.kernel.org/qemu-devel/20220119080929.39485-1-peterx@redhat.com
> V1: https://lore.kernel.org/qemu-devel/20220216062809.57179-1-peterx@redhat.com
> V2: https://lore.kernel.org/qemu-devel/20220301083925.33483-1-peterx@redhat.com
> V3: https://lore.kernel.org/qemu-devel/20220330213908.26608-1-peterx@redhat.com
> V4: https://lore.kernel.org/qemu-devel/20220331150857.74406-1-peterx@redhat.com
>
> v4->v5 changelog:
> - Fixed all checkpatch.pl warnings
> - Picked up leftover patches from Dan's tls test case series:
> https://lore.kernel.org/qemu-devel/20220310171821.3724080-1-berrange@redhat.com/
> - Rebased to v7.0.0 tag, collected more R-bs from Dave/Dan
> - In migrate_fd_cleanup(), use g_clear_pointer() for s->hostname [Dan]
> - Mark postcopy-preempt capability for 7.1 not 7.0 [Dan]
> - Moved migrate_channel_requires_tls() into tls.[ch] [Dan]
> - Mention the bug-fixing side effect of patch "migration: Export
> tls-[creds|hostname|authz] params to cmdline too" on tls_authz [Dan]
> - Use g_autoptr where proper [Dan]
> - Drop a few (probably over-cautious) asserts on local_err being set [Dan]
> - Quite a few renamings in the qtest in the last few test patches [Dan]
>
> Abstract
> ========
>
> This series contains two parts now:
>
> (1) Leftover patches from Dan's tls unit tests v2, which is first half
> (2) Leftover patches from my postcopy preempt v4, which is second half
>
> This series added a new migration capability called "postcopy-preempt". It can
> be enabled when postcopy is enabled, and it'll simply (but greatly) speed up
> postcopy page requests handling process.
>
> Below are some initial postcopy page request latency measurements after the
> new series applied.
>
> For each page size, I measured page request latency for three cases:
>
> (a) Vanilla: the old postcopy
> (b) Preempt no-break-huge: preempt enabled, x-postcopy-preempt-break-huge=off
> (c) Preempt full: preempt enabled, x-postcopy-preempt-break-huge=on
> (this is the default option when preempt enabled)
>
> Here x-postcopy-preempt-break-huge parameter is just added in v2 so as to
> conditionally disable the behavior to break sending a precopy huge page for
> debugging purpose. So when it's off, postcopy will not preempt precopy
> sending a huge page, but still postcopy will use its own channel.
>
> I tested it separately to give a rough idea on which part of the change
> helped how much of it. The overall benefit should be the comparison
> between case (a) and (c).
>
> |-----------+---------+-----------------------+--------------|
> | Page size | Vanilla | Preempt no-break-huge | Preempt full |
> |-----------+---------+-----------------------+--------------|
> | 4K | 10.68 | N/A [*] | 0.57 |
> | 2M | 10.58 | 5.49 | 5.02 |
> | 1G | 2046.65 | 933.185 | 649.445 |
> |-----------+---------+-----------------------+--------------|
> [*]: This case is N/A because 4K page does not contain huge page at all
>
> [1] https://github.com/xzpeter/small-stuffs/blob/master/tools/huge_vm/uffd-latency.bpf
Hi Peter, Just wanted understand what setup was used for these experiments like
number of vcpu, workload, network bandwidth so that i can make sense of these
numbers. Also i could not understand reason for so much difference between preempt
full and Preempt no-break-huge especially for 1G case, so if you please share little more
details on this.
>
> TODO List
> =========
>
> Avoid precopy write() blocks postcopy
> -------------------------------------
>
> I didn't prove this, but I always think the write() syscalls being blocked
> for precopy pages can affect postcopy services. If we can solve this
> problem then my wild guess is we can further reduce the average page
> latency.
>
> Two solutions at least in mind: (1) we could have made the write side of
> the migration channel NON_BLOCK too, or (2) multi-threads on send side,
> just like multifd, but we may use lock to protect which page to send too
> (e.g., the core idea is we should _never_ rely anything on the main thread,
> multifd has that dependency on queuing pages only on main thread).
>
> That can definitely be done and thought about later.
>
> Multi-channel for preemption threads
> ------------------------------------
>
> Currently the postcopy preempt feature use only one extra channel and one
> extra thread on dest (no new thread on src QEMU). It should be mostly good
> enough for major use cases, but when the postcopy queue is long enough
> (e.g. hundreds of vCPUs faulted on different pages) logically we could
> still observe more delays in average. Whether growing threads/channels can
> solve it is debatable, but sounds worthwhile a try. That's yet another
> thing we can think about after this patchset lands.
>
> Logically the design provides space for that - the receiving postcopy
> preempt thread can understand all ram-layer migration protocol, and for
> multi channel and multi threads we could simply grow that into multile
> threads handling the same protocol (with multiple PostcopyTmpPage). The
> source needs more thoughts on synchronizations, though, but it shouldn't
> affect the whole protocol layer, so should be easy to keep compatible.
>
> Please review, thanks.
>
> Daniel P. Berrangé (9):
> tests: fix encoding of IP addresses in x509 certs
> tests: add more helper macros for creating TLS x509 certs
> tests: add migration tests of TLS with PSK credentials
> tests: add migration tests of TLS with x509 credentials
> tests: convert XBZRLE migration test to use common helper
> tests: convert multifd migration tests to use common helper
> tests: add multifd migration tests of TLS with PSK credentials
> tests: add multifd migration tests of TLS with x509 credentials
> tests: ensure migration status isn't reported as failed
>
> Peter Xu (12):
> migration: Add postcopy-preempt capability
> migration: Postcopy preemption preparation on channel creation
> migration: Postcopy preemption enablement
> migration: Postcopy recover with preempt enabled
> migration: Create the postcopy preempt channel asynchronously
> migration: Parameter x-postcopy-preempt-break-huge
> migration: Add helpers to detect TLS capability
> migration: Export tls-[creds|hostname|authz] params to cmdline too
> migration: Enable TLS for preempt channel
> tests: Add postcopy tls migration test
> tests: Add postcopy tls recovery migration test
> tests: Add postcopy preempt tests
>
> meson.build | 1 +
> migration/channel.c | 10 +-
> migration/migration.c | 146 +++-
> migration/migration.h | 46 +-
> migration/multifd.c | 7 +-
> migration/postcopy-ram.c | 186 ++++-
> migration/postcopy-ram.h | 11 +
> migration/qemu-file.c | 27 +
> migration/qemu-file.h | 1 +
> migration/ram.c | 283 +++++++-
> migration/ram.h | 4 +-
> migration/savevm.c | 46 +-
> migration/socket.c | 22 +-
> migration/socket.h | 1 +
> migration/tls.c | 9 +
> migration/tls.h | 4 +
> migration/trace-events | 15 +-
> qapi/migration.json | 8 +-
> tests/qtest/meson.build | 12 +-
> tests/qtest/migration-helpers.c | 13 +
> tests/qtest/migration-helpers.h | 1 +
> tests/qtest/migration-test.c | 997 ++++++++++++++++++++++++---
> tests/unit/crypto-tls-psk-helpers.c | 18 +-
> tests/unit/crypto-tls-psk-helpers.h | 1 +
> tests/unit/crypto-tls-x509-helpers.c | 16 +-
> tests/unit/crypto-tls-x509-helpers.h | 53 ++
> tests/unit/test-crypto-tlssession.c | 11 +-
> 27 files changed, 1779 insertions(+), 170 deletions(-)
>