[PATCH RFC 00/15] migration: Postcopy Preemption

Peter Xu posted 15 patches 2 years, 2 months ago
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20220119080929.39485-1-peterx@redhat.com
Maintainers: Juan Quintela <quintela@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Thomas Huth <thuth@redhat.com>, Markus Armbruster <armbru@redhat.com>, Eric Blake <eblake@redhat.com>, Laurent Vivier <lvivier@redhat.com>
There is a newer version of this series
migration/migration.c        | 107 +++++++--
migration/migration.h        |  55 ++++-
migration/multifd.c          |  19 +-
migration/multifd.h          |   2 -
migration/postcopy-ram.c     | 192 ++++++++++++----
migration/postcopy-ram.h     |  14 ++
migration/ram.c              | 417 ++++++++++++++++++++++++++++-------
migration/ram.h              |   2 +
migration/savevm.c           |  12 +-
migration/socket.c           |  18 ++
migration/socket.h           |   1 +
migration/trace-events       |  12 +-
qapi/migration.json          |   8 +-
tests/qtest/migration-test.c |  21 ++
14 files changed, 716 insertions(+), 164 deletions(-)
[PATCH RFC 00/15] migration: Postcopy Preemption
Posted by Peter Xu 2 years, 2 months ago
Based-on: <20211224065000.97572-1-peterx@redhat.com>

Human version - This patchset is based on:
  https://lore.kernel.org/qemu-devel/20211224065000.97572-1-peterx@redhat.com/

This series can also be found here:
  https://github.com/xzpeter/qemu/tree/postcopy-preempt

Abstract
========

This series added a new migration capability called "postcopy-preempt".  It can
be enabled when postcopy is enabled, and it'll simply (but greatly) speed up
postcopy page requests handling process.

Some quick tests below measuring postcopy page request latency:

  - Guest config: 20G guest, 40 vcpus
  - Host config: 10Gbps host NIC attached between src/dst
  - Workload: one busy dirty thread, writting to 18G memory (pre-faulted).
    (refers to "2M/4K huge page, 1 dirty thread" tests below)
  - Script: see [1]

  |----------------+--------------+-----------------------|
  | Host page size | Vanilla (ms) | Postcopy Preempt (ms) |
  |----------------+--------------+-----------------------|
  | 2M             |        10.58 |                  4.96 |
  | 4K             |        10.68 |                  0.57 |
  |----------------+--------------+-----------------------|

For 2M page, we got 1x speedup.  For 4K page, 18x speedup.

For more information on the testing, please refer to "Test Results" below.

Design
======

The postcopy-preempt feature contains two major reworks on postcopy page fault
handlings:

    (1) Postcopy requests are now sent via a different socket from precopy
        background migration stream, so as to be isolated from very high page
        request delays

    (2) For huge page enabled hosts: when there's postcopy requests, they can
        now intercept a partial sending of huge host pages on src QEMU.

The design is relatively straightforward, however there're trivial
implementation details that the patchset needs to address.  Many of them are
addressed as separate patches.  The rest is handled majorly in the big patch to
enable the whole feature.

Postcopy recovery is not yet supported, it'll be done after some initial review
on the solution first.

Patch layout
============

The initial 10 (out of 15) patches are mostly even suitable to be merged
without the new feature, so they can be looked at even earlier.

Patch 11-14 implements the new feature, in which patches 11-13 are mostly still
small and doing preparations, and the major change is done in patch 14.

Patch 15 is an unit test.

Tests Results
==================

When measuring the page request latency, I did that via trapping userfaultfd
kernel faults using the bpf script [1]. I ignored kvm fast page faults, because
when it happened it means no major/real page fault is even needed, IOW, no
query to src QEMU.

The numbers (and histogram) I captured below are based on a whole procedure of
postcopy migration that I sampled with different configurations, and the
average page request latency was calculated.  I also captured the latency
distribution, it's also interesting too to look at them here.

One thing to mention is I didn't even test 1G pages.  It doesn't mean that this
series won't help 1G - actually it'll help no less than what I've tested I
believe, it's just that for 1G huge pages the latency will be >1sec on 10Gbps
nic so it's not really a usable scenario for any sensible customer.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2M huge page, 1 dirty thread
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With vanilla postcopy:

Average: 10582 (us)

@delay_us:
[1K, 2K)               7 |                                                    |
[2K, 4K)               1 |                                                    |
[4K, 8K)               9 |                                                    |
[8K, 16K)           1983 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

With postcopy-preempt:

Average: 4960 (us)

@delay_us:
[1K, 2K)               5 |                                                    |
[2K, 4K)              44 |                                                    |
[4K, 8K)            3495 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)            154 |@@                                                  |
[16K, 32K)             1 |                                                    |

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4K small page, 1 dirty thread
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With vanilla postcopy:

Average: 10676 (us)

@delay_us:
[4, 8)                 1 |                                                    |
[8, 16)                3 |                                                    |
[16, 32)               5 |                                                    |
[32, 64)               3 |                                                    |
[64, 128)             12 |                                                    |
[128, 256)            10 |                                                    |
[256, 512)            27 |                                                    |
[512, 1K)              5 |                                                    |
[1K, 2K)              11 |                                                    |
[2K, 4K)              17 |                                                    |
[4K, 8K)              10 |                                                    |
[8K, 16K)           2681 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)             6 |                                                    |

With postcopy preempt:

Average: 570 (us)

@delay_us:
[16, 32)               5 |                                                    |
[32, 64)               6 |                                                    |
[64, 128)           8340 |@@@@@@@@@@@@@@@@@@                                  |
[128, 256)         23052 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)          8119 |@@@@@@@@@@@@@@@@@@                                  |
[512, 1K)            148 |                                                    |
[1K, 2K)             759 |@                                                   |
[2K, 4K)            6729 |@@@@@@@@@@@@@@@                                     |
[4K, 8K)              80 |                                                    |
[8K, 16K)            115 |                                                    |
[16K, 32K)            32 |                                                    |

So one thing funny about 4K small pages is that with vanilla postcopy I didn't
even get a speedup comparing to 2M pages, probably because the major overhead
is not sending the page itself, but other things (e.g. waiting for precopy to
flush the existing pages).

The other thing is in postcopy preempt test, I can still see a bunch of 2ms-4ms
latency page requests.  That's probably what we would like to dig into next.
One possibility is since we shared the same sending thread on src QEMU, we
could have yield ourselves because precopy socket is full.  But that's TBD.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4K small page, 16 dirty threads
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What I did test in extra was using 16 concurrent faulting threads, in this case
the postcopy queue can be relatively longer.  It's done via:

  $ stress -m 16 --vm-bytes 1073741824 --vm-keep

With vanilla postcopy:

Average: 2244 (us)

@delay_us:
[0]                  556 |                                                    |
[1]                11251 |@@@@@@@@@@@@                                        |
[2, 4)             12094 |@@@@@@@@@@@@@                                       |
[4, 8)             12234 |@@@@@@@@@@@@@                                       |
[8, 16)            47144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)           42281 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[32, 64)           17676 |@@@@@@@@@@@@@@@@@@@                                 |
[64, 128)            952 |@                                                   |
[128, 256)           405 |                                                    |
[256, 512)           779 |                                                    |
[512, 1K)           1003 |@                                                   |
[1K, 2K)            1976 |@@                                                  |
[2K, 4K)            4865 |@@@@@                                               |
[4K, 8K)            5892 |@@@@@@                                              |
[8K, 16K)          26941 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[16K, 32K)           844 |                                                    |
[32K, 64K)            17 |                                                    |

With postcopy preempt:

Average: 1064 (us)

@delay_us:
[0]                 1341 |                                                    |
[1]                30211 |@@@@@@@@@@@@                                        |
[2, 4)             32934 |@@@@@@@@@@@@@                                       |
[4, 8)             21295 |@@@@@@@@                                            |
[8, 16)           130774 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)           95128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
[32, 64)           49591 |@@@@@@@@@@@@@@@@@@@                                 |
[64, 128)           3921 |@                                                   |
[128, 256)          1066 |                                                    |
[256, 512)          2730 |@                                                   |
[512, 1K)           1849 |                                                    |
[1K, 2K)             512 |                                                    |
[2K, 4K)            2355 |                                                    |
[4K, 8K)           48812 |@@@@@@@@@@@@@@@@@@@                                 |
[8K, 16K)          10026 |@@@                                                 |
[16K, 32K)           810 |                                                    |
[32K, 64K)            68 |                                                    |

In this specific case, a funny thing is when there're tons of postcopy
requests, the vanilla postcopy page requests are handled even faster (2ms
average) than when there's only 1 dirty thread.  It's probably because
unqueue_page() will always hit anyway so precopy streaming has a less effect on
postcopy.  However that'll be still slower than having a standalone postcopy
stream as preempt version has (1ms).

Any comment welcomed.

[1] https://github.com/xzpeter/small-stuffs/blob/master/tools/huge_vm/uffd-latency.bpf

Peter Xu (15):
  migration: No off-by-one for pss->page update in host page size
  migration: Allow pss->page jump over clean pages
  migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat
  migration: Add postcopy_has_request()
  migration: Simplify unqueue_page()
  migration: Move temp page setup and cleanup into separate functions
  migration: Introduce postcopy channels on dest node
  migration: Dump ramblock and offset too when non-same-page detected
  migration: Add postcopy_thread_create()
  migration: Move static var in ram_block_from_stream() into global
  migration: Add pss.postcopy_requested status
  migration: Move migrate_allow_multifd and helpers into migration.c
  migration: Add postcopy-preempt capability
  migration: Postcopy preemption on separate channel
  tests: Add postcopy preempt test

 migration/migration.c        | 107 +++++++--
 migration/migration.h        |  55 ++++-
 migration/multifd.c          |  19 +-
 migration/multifd.h          |   2 -
 migration/postcopy-ram.c     | 192 ++++++++++++----
 migration/postcopy-ram.h     |  14 ++
 migration/ram.c              | 417 ++++++++++++++++++++++++++++-------
 migration/ram.h              |   2 +
 migration/savevm.c           |  12 +-
 migration/socket.c           |  18 ++
 migration/socket.h           |   1 +
 migration/trace-events       |  12 +-
 qapi/migration.json          |   8 +-
 tests/qtest/migration-test.c |  21 ++
 14 files changed, 716 insertions(+), 164 deletions(-)

-- 
2.32.0


Re: [PATCH RFC 00/15] migration: Postcopy Preemption
Posted by Dr. David Alan Gilbert 2 years, 2 months ago
* Peter Xu (peterx@redhat.com) wrote:
> Based-on: <20211224065000.97572-1-peterx@redhat.com>
> 
> Human version - This patchset is based on:
>   https://lore.kernel.org/qemu-devel/20211224065000.97572-1-peterx@redhat.com/
> 
> This series can also be found here:
>   https://github.com/xzpeter/qemu/tree/postcopy-preempt
> 
> Abstract
> ========
> 
> This series added a new migration capability called "postcopy-preempt".  It can
> be enabled when postcopy is enabled, and it'll simply (but greatly) speed up
> postcopy page requests handling process.
> 
> Some quick tests below measuring postcopy page request latency:
> 
>   - Guest config: 20G guest, 40 vcpus
>   - Host config: 10Gbps host NIC attached between src/dst
>   - Workload: one busy dirty thread, writting to 18G memory (pre-faulted).
>     (refers to "2M/4K huge page, 1 dirty thread" tests below)
>   - Script: see [1]
> 
>   |----------------+--------------+-----------------------|
>   | Host page size | Vanilla (ms) | Postcopy Preempt (ms) |
>   |----------------+--------------+-----------------------|
>   | 2M             |        10.58 |                  4.96 |
>   | 4K             |        10.68 |                  0.57 |
>   |----------------+--------------+-----------------------|
> 
> For 2M page, we got 1x speedup.  For 4K page, 18x speedup.
> 
> For more information on the testing, please refer to "Test Results" below.
> 
> Design
> ======
> 
> The postcopy-preempt feature contains two major reworks on postcopy page fault
> handlings:
> 
>     (1) Postcopy requests are now sent via a different socket from precopy
>         background migration stream, so as to be isolated from very high page
>         request delays
> 
>     (2) For huge page enabled hosts: when there's postcopy requests, they can
>         now intercept a partial sending of huge host pages on src QEMU.
> 
> The design is relatively straightforward, however there're trivial
> implementation details that the patchset needs to address.  Many of them are
> addressed as separate patches.  The rest is handled majorly in the big patch to
> enable the whole feature.
> 
> Postcopy recovery is not yet supported, it'll be done after some initial review
> on the solution first.
> 
> Patch layout
> ============
> 
> The initial 10 (out of 15) patches are mostly even suitable to be merged
> without the new feature, so they can be looked at even earlier.
> 
> Patch 11-14 implements the new feature, in which patches 11-13 are mostly still
> small and doing preparations, and the major change is done in patch 14.
> 
> Patch 15 is an unit test.
> 
> Tests Results
> ==================
> 
> When measuring the page request latency, I did that via trapping userfaultfd
> kernel faults using the bpf script [1]. I ignored kvm fast page faults, because
> when it happened it means no major/real page fault is even needed, IOW, no
> query to src QEMU.
> 
> The numbers (and histogram) I captured below are based on a whole procedure of
> postcopy migration that I sampled with different configurations, and the
> average page request latency was calculated.  I also captured the latency
> distribution, it's also interesting too to look at them here.
> 
> One thing to mention is I didn't even test 1G pages.  It doesn't mean that this
> series won't help 1G - actually it'll help no less than what I've tested I
> believe, it's just that for 1G huge pages the latency will be >1sec on 10Gbps
> nic so it's not really a usable scenario for any sensible customer.
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 2M huge page, 1 dirty thread
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> With vanilla postcopy:
> 
> Average: 10582 (us)
> 
> @delay_us:
> [1K, 2K)               7 |                                                    |
> [2K, 4K)               1 |                                                    |
> [4K, 8K)               9 |                                                    |
> [8K, 16K)           1983 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> With postcopy-preempt:
> 
> Average: 4960 (us)
> 
> @delay_us:
> [1K, 2K)               5 |                                                    |
> [2K, 4K)              44 |                                                    |
> [4K, 8K)            3495 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8K, 16K)            154 |@@                                                  |
> [16K, 32K)             1 |                                                    |
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 4K small page, 1 dirty thread
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> With vanilla postcopy:
> 
> Average: 10676 (us)
> 
> @delay_us:
> [4, 8)                 1 |                                                    |
> [8, 16)                3 |                                                    |
> [16, 32)               5 |                                                    |
> [32, 64)               3 |                                                    |
> [64, 128)             12 |                                                    |
> [128, 256)            10 |                                                    |
> [256, 512)            27 |                                                    |
> [512, 1K)              5 |                                                    |
> [1K, 2K)              11 |                                                    |
> [2K, 4K)              17 |                                                    |
> [4K, 8K)              10 |                                                    |
> [8K, 16K)           2681 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16K, 32K)             6 |                                                    |
> 
> With postcopy preempt:
> 
> Average: 570 (us)
> 
> @delay_us:
> [16, 32)               5 |                                                    |
> [32, 64)               6 |                                                    |
> [64, 128)           8340 |@@@@@@@@@@@@@@@@@@                                  |
> [128, 256)         23052 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [256, 512)          8119 |@@@@@@@@@@@@@@@@@@                                  |
> [512, 1K)            148 |                                                    |
> [1K, 2K)             759 |@                                                   |
> [2K, 4K)            6729 |@@@@@@@@@@@@@@@                                     |
> [4K, 8K)              80 |                                                    |
> [8K, 16K)            115 |                                                    |
> [16K, 32K)            32 |                                                    |

Nice speedups.

> So one thing funny about 4K small pages is that with vanilla postcopy I didn't
> even get a speedup comparing to 2M pages, probably because the major overhead
> is not sending the page itself, but other things (e.g. waiting for precopy to
> flush the existing pages).
> 
> The other thing is in postcopy preempt test, I can still see a bunch of 2ms-4ms
> latency page requests.  That's probably what we would like to dig into next.
> One possibility is since we shared the same sending thread on src QEMU, we
> could have yield ourselves because precopy socket is full.  But that's TBD.

I guess those could be pages queued behind others; or maybe something
like one that starts getting sent on the main socket but then
interrupted by another, but then the original page is wanted?

> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 4K small page, 16 dirty threads
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> What I did test in extra was using 16 concurrent faulting threads, in this case
> the postcopy queue can be relatively longer.  It's done via:
> 
>   $ stress -m 16 --vm-bytes 1073741824 --vm-keep
> 
> With vanilla postcopy:
> 
> Average: 2244 (us)
> 
> @delay_us:
> [0]                  556 |                                                    |
> [1]                11251 |@@@@@@@@@@@@                                        |
> [2, 4)             12094 |@@@@@@@@@@@@@                                       |
> [4, 8)             12234 |@@@@@@@@@@@@@                                       |
> [8, 16)            47144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16, 32)           42281 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> [32, 64)           17676 |@@@@@@@@@@@@@@@@@@@                                 |
> [64, 128)            952 |@                                                   |
> [128, 256)           405 |                                                    |
> [256, 512)           779 |                                                    |
> [512, 1K)           1003 |@                                                   |
> [1K, 2K)            1976 |@@                                                  |
> [2K, 4K)            4865 |@@@@@                                               |
> [4K, 8K)            5892 |@@@@@@                                              |
> [8K, 16K)          26941 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
> [16K, 32K)           844 |                                                    |
> [32K, 64K)            17 |                                                    |
> 
> With postcopy preempt:
> 
> Average: 1064 (us)
> 
> @delay_us:
> [0]                 1341 |                                                    |
> [1]                30211 |@@@@@@@@@@@@                                        |
> [2, 4)             32934 |@@@@@@@@@@@@@                                       |
> [4, 8)             21295 |@@@@@@@@                                            |
> [8, 16)           130774 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16, 32)           95128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
> [32, 64)           49591 |@@@@@@@@@@@@@@@@@@@                                 |
> [64, 128)           3921 |@                                                   |
> [128, 256)          1066 |                                                    |
> [256, 512)          2730 |@                                                   |
> [512, 1K)           1849 |                                                    |
> [1K, 2K)             512 |                                                    |
> [2K, 4K)            2355 |                                                    |
> [4K, 8K)           48812 |@@@@@@@@@@@@@@@@@@@                                 |
> [8K, 16K)          10026 |@@@                                                 |
> [16K, 32K)           810 |                                                    |
> [32K, 64K)            68 |                                                    |
> 
> In this specific case, a funny thing is when there're tons of postcopy
> requests, the vanilla postcopy page requests are handled even faster (2ms
> average) than when there's only 1 dirty thread.  It's probably because
> unqueue_page() will always hit anyway so precopy streaming has a less effect on
> postcopy.  However that'll be still slower than having a standalone postcopy
> stream as preempt version has (1ms).

Curious.

Dave

> Any comment welcomed.
> 
> [1] https://github.com/xzpeter/small-stuffs/blob/master/tools/huge_vm/uffd-latency.bpf
> 
> Peter Xu (15):
>   migration: No off-by-one for pss->page update in host page size
>   migration: Allow pss->page jump over clean pages
>   migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat
>   migration: Add postcopy_has_request()
>   migration: Simplify unqueue_page()
>   migration: Move temp page setup and cleanup into separate functions
>   migration: Introduce postcopy channels on dest node
>   migration: Dump ramblock and offset too when non-same-page detected
>   migration: Add postcopy_thread_create()
>   migration: Move static var in ram_block_from_stream() into global
>   migration: Add pss.postcopy_requested status
>   migration: Move migrate_allow_multifd and helpers into migration.c
>   migration: Add postcopy-preempt capability
>   migration: Postcopy preemption on separate channel
>   tests: Add postcopy preempt test
> 
>  migration/migration.c        | 107 +++++++--
>  migration/migration.h        |  55 ++++-
>  migration/multifd.c          |  19 +-
>  migration/multifd.h          |   2 -
>  migration/postcopy-ram.c     | 192 ++++++++++++----
>  migration/postcopy-ram.h     |  14 ++
>  migration/ram.c              | 417 ++++++++++++++++++++++++++++-------
>  migration/ram.h              |   2 +
>  migration/savevm.c           |  12 +-
>  migration/socket.c           |  18 ++
>  migration/socket.h           |   1 +
>  migration/trace-events       |  12 +-
>  qapi/migration.json          |   8 +-
>  tests/qtest/migration-test.c |  21 ++
>  14 files changed, 716 insertions(+), 164 deletions(-)
> 
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK