[v4] migration: introduce dirtylimit capability

[PATCH v4 00/10] migration: introduce dirtylimit capability

Posted by huangy81@chinatelecom.cn 1 year, 2 months ago

From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>

v4:
1. Polish the docs and update the release version suggested by Markus 
2. Rename the migrate exported info "dirty-limit-throttle-time-per-round"
   to "dirty-limit-throttle-time-per-full". 

The following 5 commits hasn't been acked or reviewed yet:

kvm: dirty-ring: Fix race with vcpu creation
qapi/migration: Introduce x-vcpu-dirty-limit-period parameter
migration: Implement dirty-limit convergence algo
migration: Extend query-migrate to provide dirty page limit info
tests: Add migration dirty-limit capability test

Ping David and Juan. 

Please review if you have time. Thanks. 

Yong

v3(resend):
- fix the syntax error of the topic.

v3:
This version make some modifications inspired by Peter and Markus
as following:
1. Do the code clean up in [PATCH v2 02/11] suggested by Markus 
2. Replace the [PATCH v2 03/11] with a much simpler patch posted by
   Peter to fix the following bug:
   https://bugzilla.redhat.com/show_bug.cgi?id=2124756
3. Fix the error path of migrate_params_check in [PATCH v2 04/11]
   pointed out by Markus. Enrich the commit message to explain why
   x-vcpu-dirty-limit-period an unstable parameter.
4. Refactor the dirty-limit convergence algo in [PATCH v2 07/11] 
   suggested by Peter:
   a. apply blk_mig_bulk_active check before enable dirty-limit
   b. drop the unhelpful check function before enable dirty-limit
   c. change the migration_cancel logic, just cancel dirty-limit
      only if dirty-limit capability turned on. 
   d. abstract a code clean commit [PATCH v3 07/10] to adjust
      the check order before enable auto-converge 
5. Change the name of observing indexes during dirty-limit live
   migration to make them more easy-understanding. Use the
   maximum throttle time of vpus as "dirty-limit-throttle-time-per-full"
6. Fix some grammatical and spelling errors pointed out by Markus
   and enrich the document about the dirty-limit live migration
   observing indexes "dirty-limit-ring-full-time"
   and "dirty-limit-throttle-time-per-full"
7. Change the default value of x-vcpu-dirty-limit-period to 1000ms,
   which is optimal value pointed out in cover letter in that
   testing environment.
8. Drop the 2 guestperf test commits [PATCH v2 10/11],
   [PATCH v2 11/11] and post them with a standalone series in the
   future.

Thanks Peter and Markus sincerely for the passionate, efficient
and careful comments and suggestions.

Please review.  

Yong

v2: 
This version make a little bit modifications comparing with
version 1 as following:
1. fix the overflow issue reported by Peter Maydell
2. add parameter check for hmp "set_vcpu_dirty_limit" command
3. fix the racing issue between dirty ring reaper thread and
   Qemu main thread.
4. add migrate parameter check for x-vcpu-dirty-limit-period
   and vcpu-dirty-limit.
5. add the logic to forbid hmp/qmp commands set_vcpu_dirty_limit,
   cancel_vcpu_dirty_limit during dirty-limit live migration when
   implement dirty-limit convergence algo.
6. add capability check to ensure auto-converge and dirty-limit
   are mutually exclusive.
7. pre-check if kvm dirty ring size is configured before setting
   dirty-limit migrate parameter 

A more comprehensive test was done comparing with version 1.

The following are test environment:
-------------------------------------------------------------
a. Host hardware info:

CPU:
Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz

CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       2
NUMA node(s):                    2

NUMA node0 CPU(s):               0-15,32-47
NUMA node1 CPU(s):               16-31,48-63

Memory:
Hynix  503Gi

Interface:
Intel Corporation Ethernet Connection X722 for 1GbE (rev 09)
Speed: 1000Mb/s

b. Host software info:

OS: ctyunos release 2
Kernel: 4.19.90-2102.2.0.0066.ctl2.x86_64
Libvirt baseline version:  libvirt-6.9.0
Qemu baseline version: qemu-5.0

c. vm scale
CPU: 4
Memory: 4G
-------------------------------------------------------------

All the supplementary test data shown as follows are basing on
above test environment.

In version 1, we post a test data from unixbench as follows:

$ taskset -c 8-15 ./Run -i 2 -c 8 {unixbench test item}

host cpu: Intel(R) Xeon(R) Platinum 8378A
host interface speed: 1000Mb/s
  |---------------------+--------+------------+---------------|
  | UnixBench test item | Normal | Dirtylimit | Auto-converge |
  |---------------------+--------+------------+---------------|
  | dhry2reg            | 32800  | 32786      | 25292         |
  | whetstone-double    | 10326  | 10315      | 9847          |
  | pipe                | 15442  | 15271      | 14506         |
  | context1            | 7260   | 6235       | 4514          |
  | spawn               | 3663   | 3317       | 3249          |
  | syscall             | 4669   | 4667       | 3841          |
  |---------------------+--------+------------+---------------|

In version 2, we post a supplementary test data that do not use
taskset and make the scenario more general, see as follows:

$ ./Run

per-vcpu data:
  |---------------------+--------+------------+---------------|
  | UnixBench test item | Normal | Dirtylimit | Auto-converge |
  |---------------------+--------+------------+---------------|
  | dhry2reg            | 2991   | 2902       | 1722          |
  | whetstone-double    | 1018   | 1006       | 627           |
  | Execl Throughput    | 955    | 320        | 660           |
  | File Copy - 1       | 2362   | 805        | 1325          |
  | File Copy - 2       | 1500   | 1406       | 643           |  
  | File Copy - 3       | 4778   | 2160       | 1047          | 
  | Pipe Throughput     | 1181   | 1170       | 842           |
  | Context Switching   | 192    | 224        | 198           |
  | Process Creation    | 490    | 145        | 95            |
  | Shell Scripts - 1   | 1284   | 565        | 610           |
  | Shell Scripts - 2   | 2368   | 900        | 1040          |
  | System Call Overhead| 983    | 948        | 698           |
  | Index Score         | 1263   | 815        | 600           |
  |---------------------+--------+------------+---------------|
Note:
  File Copy - 1: File Copy 1024 bufsize 2000 maxblocks
  File Copy - 2: File Copy 256 bufsize 500 maxblocks 
  File Copy - 3: File Copy 4096 bufsize 8000 maxblocks 
  Shell Scripts - 1: Shell Scripts (1 concurrent)
  Shell Scripts - 2: Shell Scripts (8 concurrent)

Basing on above data, we can draw a conclusion that dirty-limit
can hugely improve the system benchmark almost in every respect,
the "System Benchmarks Index Score" show it improve 35% performance
comparing with auto-converge during live migration.

4-vcpu parallel data(we run a test vm with 4c4g-scale):
  |---------------------+--------+------------+---------------|
  | UnixBench test item | Normal | Dirtylimit | Auto-converge |
  |---------------------+--------+------------+---------------|
  | dhry2reg            | 7975   | 7146       | 5071          |
  | whetstone-double    | 3982   | 3561       | 2124          |
  | Execl Throughput    | 1882   | 1205       | 768           |
  | File Copy - 1       | 1061   | 865        | 498           |
  | File Copy - 2       | 676    | 491        | 519           |  
  | File Copy - 3       | 2260   | 923        | 1329          | 
  | Pipe Throughput     | 3026   | 3009       | 1616          |
  | Context Switching   | 1219   | 1093       | 695           |
  | Process Creation    | 947    | 307        | 446           |
  | Shell Scripts - 1   | 2469   | 977        | 989           |
  | Shell Scripts - 2   | 2667   | 1275       | 984           |
  | System Call Overhead| 1592   | 1459       | 692           |
  | Index Score         | 1976   | 1294       | 997           |
  |---------------------+--------+------------+---------------|

For the parallel data, the "System Benchmarks Index Score" show it
also improve 29% performance.

In version 1, migration total time is shown as follows: 

host cpu: Intel(R) Xeon(R) Platinum 8378A
host interface speed: 1000Mb/s
  |-----------------------+----------------+-------------------|
  | dirty memory size(MB) | Dirtylimit(ms) | Auto-converge(ms) |
  |-----------------------+----------------+-------------------|
  | 60                    | 2014           | 2131              |
  | 70                    | 5381           | 12590             |
  | 90                    | 6037           | 33545             |
  | 110                   | 7660           | [*]               |
  |-----------------------+----------------+-------------------|
  [*]: This case means migration is not convergent. 

In version 2, we post more comprehensive migration total time test data
as follows: 

we update N MB on 4 cpus and sleep S us every time 1 MB data was updated.
test twice in each condition, data is shown as follow: 

  |-----------+--------+--------+----------------+-------------------|
  | ring size | N (MB) | S (us) | Dirtylimit(ms) | Auto-converge(ms) |
  |-----------+--------+--------+----------------+-------------------|
  | 1024      | 1024   | 1000   | 44951          | 191780            |
  | 1024      | 1024   | 1000   | 44546          | 185341            |
  | 1024      | 1024   | 500    | 46505          | 203545            |
  | 1024      | 1024   | 500    | 45469          | 909945            |
  | 1024      | 1024   | 0      | 61858          | [*]               |
  | 1024      | 1024   | 0      | 57922          | [*]               |
  | 1024      | 2048   | 0      | 91982          | [*]               |
  | 1024      | 2048   | 0      | 90388          | [*]               |
  | 2048      | 128    | 10000  | 14511          | 25971             |
  | 2048      | 128    | 10000  | 13472          | 26294             |
  | 2048      | 1024   | 10000  | 44244          | 26294             |
  | 2048      | 1024   | 10000  | 45099          | 157701            |
  | 2048      | 1024   | 500    | 51105          | [*]               |
  | 2048      | 1024   | 500    | 49648          | [*]               |
  | 2048      | 1024   | 0      | 229031         | [*]               |
  | 2048      | 1024   | 0      | 154282         | [*]               |
  |-----------+--------+--------+----------------+-------------------|
  [*]: This case means migration is not convergent. 

Not that the larger ring size is, the less sensitively dirty-limit responds,
so we should choose a optimal ring size base on the test data with different 
scale vm.

We also test the effect of "x-vcpu-dirty-limit-period" parameter on
migration total time. test twice in each condition, data is shown
as follows:

  |-----------+--------+--------+-------------+----------------------|
  | ring size | N (MB) | S (us) | Period (ms) | migration total time | 
  |-----------+--------+--------+-------------+----------------------|
  | 2048      | 1024   | 10000  | 100         | [*]                  |
  | 2048      | 1024   | 10000  | 100         | [*]                  |
  | 2048      | 1024   | 10000  | 300         | 156795               |
  | 2048      | 1024   | 10000  | 300         | 118179               |
  | 2048      | 1024   | 10000  | 500         | 44244                |
  | 2048      | 1024   | 10000  | 500         | 45099                |
  | 2048      | 1024   | 10000  | 700         | 41871                |
  | 2048      | 1024   | 10000  | 700         | 42582                |
  | 2048      | 1024   | 10000  | 1000        | 41430                |
  | 2048      | 1024   | 10000  | 1000        | 40383                |
  | 2048      | 1024   | 10000  | 1500        | 42030                |
  | 2048      | 1024   | 10000  | 1500        | 42598                |
  | 2048      | 1024   | 10000  | 2000        | 41694                |
  | 2048      | 1024   | 10000  | 2000        | 42403                |
  | 2048      | 1024   | 10000  | 3000        | 43538                |
  | 2048      | 1024   | 10000  | 3000        | 43010                |
  |-----------+--------+--------+-------------+----------------------|

It shows that x-vcpu-dirty-limit-period should be configured with 1000 ms
in above condition.

Please review, any comments and suggestions are very appreciated, thanks

Yong

Hyman Huang (9):
  dirtylimit: Fix overflow when computing MB
  softmmu/dirtylimit: Add parameter check for hmp "set_vcpu_dirty_limit"
  qapi/migration: Introduce x-vcpu-dirty-limit-period parameter
  qapi/migration: Introduce vcpu-dirty-limit parameters
  migration: Introduce dirty-limit capability
  migration: Refactor auto-converge capability logic
  migration: Implement dirty-limit convergence algo
  migration: Extend query-migrate to provide dirty page limit info
  tests: Add migration dirty-limit capability test

Peter Xu (1):
  kvm: dirty-ring: Fix race with vcpu creation

 accel/kvm/kvm-all.c            |   9 ++
 include/sysemu/dirtylimit.h    |   2 +
 migration/migration-hmp-cmds.c |  26 ++++++
 migration/migration.c          |  88 ++++++++++++++++++
 migration/migration.h          |   1 +
 migration/ram.c                |  63 ++++++++++---
 migration/trace-events         |   1 +
 qapi/migration.json            |  64 ++++++++++++--
 softmmu/dirtylimit.c           |  91 ++++++++++++++++---
 tests/qtest/migration-test.c   | 157 +++++++++++++++++++++++++++++++++
 10 files changed, 470 insertions(+), 32 deletions(-)

-- 
2.17.1

Re: [PATCH v4 00/10] migration: introduce dirtylimit capability

Posted by Hyman Huang 1 year, 2 months ago

Ping ?

在 2023/2/17 0:18, huangy81@chinatelecom.cn 写道:
> From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
> 
> v4:
> 1. Polish the docs and update the release version suggested by Markus
> 2. Rename the migrate exported info "dirty-limit-throttle-time-per-round"
>     to "dirty-limit-throttle-time-per-full".
> 
> The following 5 commits hasn't been acked or reviewed yet:
> 
> kvm: dirty-ring: Fix race with vcpu creation
> qapi/migration: Introduce x-vcpu-dirty-limit-period parameter
> migration: Implement dirty-limit convergence algo
> migration: Extend query-migrate to provide dirty page limit info
> tests: Add migration dirty-limit capability test
> 
> Ping David and Juan.
> 
> Please review if you have time. Thanks.
> 
> Yong
> 
> v3(resend):
> - fix the syntax error of the topic.
> 
> v3:
> This version make some modifications inspired by Peter and Markus
> as following:
> 1. Do the code clean up in [PATCH v2 02/11] suggested by Markus
> 2. Replace the [PATCH v2 03/11] with a much simpler patch posted by
>     Peter to fix the following bug:
>     https://bugzilla.redhat.com/show_bug.cgi?id=2124756
> 3. Fix the error path of migrate_params_check in [PATCH v2 04/11]
>     pointed out by Markus. Enrich the commit message to explain why
>     x-vcpu-dirty-limit-period an unstable parameter.
> 4. Refactor the dirty-limit convergence algo in [PATCH v2 07/11]
>     suggested by Peter:
>     a. apply blk_mig_bulk_active check before enable dirty-limit
>     b. drop the unhelpful check function before enable dirty-limit
>     c. change the migration_cancel logic, just cancel dirty-limit
>        only if dirty-limit capability turned on.
>     d. abstract a code clean commit [PATCH v3 07/10] to adjust
>        the check order before enable auto-converge
> 5. Change the name of observing indexes during dirty-limit live
>     migration to make them more easy-understanding. Use the
>     maximum throttle time of vpus as "dirty-limit-throttle-time-per-full"
> 6. Fix some grammatical and spelling errors pointed out by Markus
>     and enrich the document about the dirty-limit live migration
>     observing indexes "dirty-limit-ring-full-time"
>     and "dirty-limit-throttle-time-per-full"
> 7. Change the default value of x-vcpu-dirty-limit-period to 1000ms,
>     which is optimal value pointed out in cover letter in that
>     testing environment.
> 8. Drop the 2 guestperf test commits [PATCH v2 10/11],
>     [PATCH v2 11/11] and post them with a standalone series in the
>     future.
> 
> Thanks Peter and Markus sincerely for the passionate, efficient
> and careful comments and suggestions.
> 
> Please review.
> 
> Yong
> 
> v2:
> This version make a little bit modifications comparing with
> version 1 as following:
> 1. fix the overflow issue reported by Peter Maydell
> 2. add parameter check for hmp "set_vcpu_dirty_limit" command
> 3. fix the racing issue between dirty ring reaper thread and
>     Qemu main thread.
> 4. add migrate parameter check for x-vcpu-dirty-limit-period
>     and vcpu-dirty-limit.
> 5. add the logic to forbid hmp/qmp commands set_vcpu_dirty_limit,
>     cancel_vcpu_dirty_limit during dirty-limit live migration when
>     implement dirty-limit convergence algo.
> 6. add capability check to ensure auto-converge and dirty-limit
>     are mutually exclusive.
> 7. pre-check if kvm dirty ring size is configured before setting
>     dirty-limit migrate parameter
> 
> A more comprehensive test was done comparing with version 1.
> 
> The following are test environment:
> -------------------------------------------------------------
> a. Host hardware info:
> 
> CPU:
> Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
> 
> CPU(s):                          64
> On-line CPU(s) list:             0-63
> Thread(s) per core:              2
> Core(s) per socket:              16
> Socket(s):                       2
> NUMA node(s):                    2
> 
> NUMA node0 CPU(s):               0-15,32-47
> NUMA node1 CPU(s):               16-31,48-63
> 
> Memory:
> Hynix  503Gi
> 
> Interface:
> Intel Corporation Ethernet Connection X722 for 1GbE (rev 09)
> Speed: 1000Mb/s
> 
> b. Host software info:
> 
> OS: ctyunos release 2
> Kernel: 4.19.90-2102.2.0.0066.ctl2.x86_64
> Libvirt baseline version:  libvirt-6.9.0
> Qemu baseline version: qemu-5.0
> 
> c. vm scale
> CPU: 4
> Memory: 4G
> -------------------------------------------------------------
> 
> All the supplementary test data shown as follows are basing on
> above test environment.
> 
> In version 1, we post a test data from unixbench as follows:
> 
> $ taskset -c 8-15 ./Run -i 2 -c 8 {unixbench test item}
> 
> host cpu: Intel(R) Xeon(R) Platinum 8378A
> host interface speed: 1000Mb/s
>    |---------------------+--------+------------+---------------|
>    | UnixBench test item | Normal | Dirtylimit | Auto-converge |
>    |---------------------+--------+------------+---------------|
>    | dhry2reg            | 32800  | 32786      | 25292         |
>    | whetstone-double    | 10326  | 10315      | 9847          |
>    | pipe                | 15442  | 15271      | 14506         |
>    | context1            | 7260   | 6235       | 4514          |
>    | spawn               | 3663   | 3317       | 3249          |
>    | syscall             | 4669   | 4667       | 3841          |
>    |---------------------+--------+------------+---------------|
> 
> In version 2, we post a supplementary test data that do not use
> taskset and make the scenario more general, see as follows:
> 
> $ ./Run
> 
> per-vcpu data:
>    |---------------------+--------+------------+---------------|
>    | UnixBench test item | Normal | Dirtylimit | Auto-converge |
>    |---------------------+--------+------------+---------------|
>    | dhry2reg            | 2991   | 2902       | 1722          |
>    | whetstone-double    | 1018   | 1006       | 627           |
>    | Execl Throughput    | 955    | 320        | 660           |
>    | File Copy - 1       | 2362   | 805        | 1325          |
>    | File Copy - 2       | 1500   | 1406       | 643           |
>    | File Copy - 3       | 4778   | 2160       | 1047          |
>    | Pipe Throughput     | 1181   | 1170       | 842           |
>    | Context Switching   | 192    | 224        | 198           |
>    | Process Creation    | 490    | 145        | 95            |
>    | Shell Scripts - 1   | 1284   | 565        | 610           |
>    | Shell Scripts - 2   | 2368   | 900        | 1040          |
>    | System Call Overhead| 983    | 948        | 698           |
>    | Index Score         | 1263   | 815        | 600           |
>    |---------------------+--------+------------+---------------|
> Note:
>    File Copy - 1: File Copy 1024 bufsize 2000 maxblocks
>    File Copy - 2: File Copy 256 bufsize 500 maxblocks
>    File Copy - 3: File Copy 4096 bufsize 8000 maxblocks
>    Shell Scripts - 1: Shell Scripts (1 concurrent)
>    Shell Scripts - 2: Shell Scripts (8 concurrent)
> 
> Basing on above data, we can draw a conclusion that dirty-limit
> can hugely improve the system benchmark almost in every respect,
> the "System Benchmarks Index Score" show it improve 35% performance
> comparing with auto-converge during live migration.
> 
> 4-vcpu parallel data(we run a test vm with 4c4g-scale):
>    |---------------------+--------+------------+---------------|
>    | UnixBench test item | Normal | Dirtylimit | Auto-converge |
>    |---------------------+--------+------------+---------------|
>    | dhry2reg            | 7975   | 7146       | 5071          |
>    | whetstone-double    | 3982   | 3561       | 2124          |
>    | Execl Throughput    | 1882   | 1205       | 768           |
>    | File Copy - 1       | 1061   | 865        | 498           |
>    | File Copy - 2       | 676    | 491        | 519           |
>    | File Copy - 3       | 2260   | 923        | 1329          |
>    | Pipe Throughput     | 3026   | 3009       | 1616          |
>    | Context Switching   | 1219   | 1093       | 695           |
>    | Process Creation    | 947    | 307        | 446           |
>    | Shell Scripts - 1   | 2469   | 977        | 989           |
>    | Shell Scripts - 2   | 2667   | 1275       | 984           |
>    | System Call Overhead| 1592   | 1459       | 692           |
>    | Index Score         | 1976   | 1294       | 997           |
>    |---------------------+--------+------------+---------------|
> 
> For the parallel data, the "System Benchmarks Index Score" show it
> also improve 29% performance.
> 
> In version 1, migration total time is shown as follows:
> 
> host cpu: Intel(R) Xeon(R) Platinum 8378A
> host interface speed: 1000Mb/s
>    |-----------------------+----------------+-------------------|
>    | dirty memory size(MB) | Dirtylimit(ms) | Auto-converge(ms) |
>    |-----------------------+----------------+-------------------|
>    | 60                    | 2014           | 2131              |
>    | 70                    | 5381           | 12590             |
>    | 90                    | 6037           | 33545             |
>    | 110                   | 7660           | [*]               |
>    |-----------------------+----------------+-------------------|
>    [*]: This case means migration is not convergent.
> 
> In version 2, we post more comprehensive migration total time test data
> as follows:
> 
> we update N MB on 4 cpus and sleep S us every time 1 MB data was updated.
> test twice in each condition, data is shown as follow:
> 
>    |-----------+--------+--------+----------------+-------------------|
>    | ring size | N (MB) | S (us) | Dirtylimit(ms) | Auto-converge(ms) |
>    |-----------+--------+--------+----------------+-------------------|
>    | 1024      | 1024   | 1000   | 44951          | 191780            |
>    | 1024      | 1024   | 1000   | 44546          | 185341            |
>    | 1024      | 1024   | 500    | 46505          | 203545            |
>    | 1024      | 1024   | 500    | 45469          | 909945            |
>    | 1024      | 1024   | 0      | 61858          | [*]               |
>    | 1024      | 1024   | 0      | 57922          | [*]               |
>    | 1024      | 2048   | 0      | 91982          | [*]               |
>    | 1024      | 2048   | 0      | 90388          | [*]               |
>    | 2048      | 128    | 10000  | 14511          | 25971             |
>    | 2048      | 128    | 10000  | 13472          | 26294             |
>    | 2048      | 1024   | 10000  | 44244          | 26294             |
>    | 2048      | 1024   | 10000  | 45099          | 157701            |
>    | 2048      | 1024   | 500    | 51105          | [*]               |
>    | 2048      | 1024   | 500    | 49648          | [*]               |
>    | 2048      | 1024   | 0      | 229031         | [*]               |
>    | 2048      | 1024   | 0      | 154282         | [*]               |
>    |-----------+--------+--------+----------------+-------------------|
>    [*]: This case means migration is not convergent.
> 
> Not that the larger ring size is, the less sensitively dirty-limit responds,
> so we should choose a optimal ring size base on the test data with different
> scale vm.
> 
> We also test the effect of "x-vcpu-dirty-limit-period" parameter on
> migration total time. test twice in each condition, data is shown
> as follows:
> 
>    |-----------+--------+--------+-------------+----------------------|
>    | ring size | N (MB) | S (us) | Period (ms) | migration total time |
>    |-----------+--------+--------+-------------+----------------------|
>    | 2048      | 1024   | 10000  | 100         | [*]                  |
>    | 2048      | 1024   | 10000  | 100         | [*]                  |
>    | 2048      | 1024   | 10000  | 300         | 156795               |
>    | 2048      | 1024   | 10000  | 300         | 118179               |
>    | 2048      | 1024   | 10000  | 500         | 44244                |
>    | 2048      | 1024   | 10000  | 500         | 45099                |
>    | 2048      | 1024   | 10000  | 700         | 41871                |
>    | 2048      | 1024   | 10000  | 700         | 42582                |
>    | 2048      | 1024   | 10000  | 1000        | 41430                |
>    | 2048      | 1024   | 10000  | 1000        | 40383                |
>    | 2048      | 1024   | 10000  | 1500        | 42030                |
>    | 2048      | 1024   | 10000  | 1500        | 42598                |
>    | 2048      | 1024   | 10000  | 2000        | 41694                |
>    | 2048      | 1024   | 10000  | 2000        | 42403                |
>    | 2048      | 1024   | 10000  | 3000        | 43538                |
>    | 2048      | 1024   | 10000  | 3000        | 43010                |
>    |-----------+--------+--------+-------------+----------------------|
> 
> It shows that x-vcpu-dirty-limit-period should be configured with 1000 ms
> in above condition.
> 
> Please review, any comments and suggestions are very appreciated, thanks
> 
> Yong
> 
> Hyman Huang (9):
>    dirtylimit: Fix overflow when computing MB
>    softmmu/dirtylimit: Add parameter check for hmp "set_vcpu_dirty_limit"
>    qapi/migration: Introduce x-vcpu-dirty-limit-period parameter
>    qapi/migration: Introduce vcpu-dirty-limit parameters
>    migration: Introduce dirty-limit capability
>    migration: Refactor auto-converge capability logic
>    migration: Implement dirty-limit convergence algo
>    migration: Extend query-migrate to provide dirty page limit info
>    tests: Add migration dirty-limit capability test
> 
> Peter Xu (1):
>    kvm: dirty-ring: Fix race with vcpu creation
> 
>   accel/kvm/kvm-all.c            |   9 ++
>   include/sysemu/dirtylimit.h    |   2 +
>   migration/migration-hmp-cmds.c |  26 ++++++
>   migration/migration.c          |  88 ++++++++++++++++++
>   migration/migration.h          |   1 +
>   migration/ram.c                |  63 ++++++++++---
>   migration/trace-events         |   1 +
>   qapi/migration.json            |  64 ++++++++++++--
>   softmmu/dirtylimit.c           |  91 ++++++++++++++++---
>   tests/qtest/migration-test.c   | 157 +++++++++++++++++++++++++++++++++
>   10 files changed, 470 insertions(+), 32 deletions(-)
> 

-- 
Best regard

Hyman Huang(黄勇)

Re: [PATCH v4 00/10] migration: introduce dirtylimit capability

Posted by Hyman Huang 1 year, 1 month ago

Ping again, to make sure this series not be forgotten. :)

Please review the last three commit if you are free.

Thanks,

Yong


在 2023/3/1 23:53, Hyman Huang 写道:
> Ping ?
> 
> 在 2023/2/17 0:18, huangy81@chinatelecom.cn 写道:
>> From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
>>
>> v4:
>> 1. Polish the docs and update the release version suggested by Markus
>> 2. Rename the migrate exported info "dirty-limit-throttle-time-per-round"
>>     to "dirty-limit-throttle-time-per-full".
>>
>> The following 5 commits hasn't been acked or reviewed yet:
>>
>> kvm: dirty-ring: Fix race with vcpu creation
>> qapi/migration: Introduce x-vcpu-dirty-limit-period parameter
>> migration: Implement dirty-limit convergence algo
>> migration: Extend query-migrate to provide dirty page limit info
>> tests: Add migration dirty-limit capability test
>>
>> Ping David and Juan.
>>
>> Please review if you have time. Thanks.
>>
>> Yong
>>
>> v3(resend):
>> - fix the syntax error of the topic.
>>
>> v3:
>> This version make some modifications inspired by Peter and Markus
>> as following:
>> 1. Do the code clean up in [PATCH v2 02/11] suggested by Markus
>> 2. Replace the [PATCH v2 03/11] with a much simpler patch posted by
>>     Peter to fix the following bug:
>>     https://bugzilla.redhat.com/show_bug.cgi?id=2124756
>> 3. Fix the error path of migrate_params_check in [PATCH v2 04/11]
>>     pointed out by Markus. Enrich the commit message to explain why
>>     x-vcpu-dirty-limit-period an unstable parameter.
>> 4. Refactor the dirty-limit convergence algo in [PATCH v2 07/11]
>>     suggested by Peter:
>>     a. apply blk_mig_bulk_active check before enable dirty-limit
>>     b. drop the unhelpful check function before enable dirty-limit
>>     c. change the migration_cancel logic, just cancel dirty-limit
>>        only if dirty-limit capability turned on.
>>     d. abstract a code clean commit [PATCH v3 07/10] to adjust
>>        the check order before enable auto-converge
>> 5. Change the name of observing indexes during dirty-limit live
>>     migration to make them more easy-understanding. Use the
>>     maximum throttle time of vpus as "dirty-limit-throttle-time-per-full"
>> 6. Fix some grammatical and spelling errors pointed out by Markus
>>     and enrich the document about the dirty-limit live migration
>>     observing indexes "dirty-limit-ring-full-time"
>>     and "dirty-limit-throttle-time-per-full"
>> 7. Change the default value of x-vcpu-dirty-limit-period to 1000ms,
>>     which is optimal value pointed out in cover letter in that
>>     testing environment.
>> 8. Drop the 2 guestperf test commits [PATCH v2 10/11],
>>     [PATCH v2 11/11] and post them with a standalone series in the
>>     future.
>>
>> Thanks Peter and Markus sincerely for the passionate, efficient
>> and careful comments and suggestions.
>>
>> Please review.
>>
>> Yong
>>
>> v2:
>> This version make a little bit modifications comparing with
>> version 1 as following:
>> 1. fix the overflow issue reported by Peter Maydell
>> 2. add parameter check for hmp "set_vcpu_dirty_limit" command
>> 3. fix the racing issue between dirty ring reaper thread and
>>     Qemu main thread.
>> 4. add migrate parameter check for x-vcpu-dirty-limit-period
>>     and vcpu-dirty-limit.
>> 5. add the logic to forbid hmp/qmp commands set_vcpu_dirty_limit,
>>     cancel_vcpu_dirty_limit during dirty-limit live migration when
>>     implement dirty-limit convergence algo.
>> 6. add capability check to ensure auto-converge and dirty-limit
>>     are mutually exclusive.
>> 7. pre-check if kvm dirty ring size is configured before setting
>>     dirty-limit migrate parameter
>>
>> A more comprehensive test was done comparing with version 1.
>>
>> The following are test environment:
>> -------------------------------------------------------------
>> a. Host hardware info:
>>
>> CPU:
>> Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
>>
>> CPU(s):                          64
>> On-line CPU(s) list:             0-63
>> Thread(s) per core:              2
>> Core(s) per socket:              16
>> Socket(s):                       2
>> NUMA node(s):                    2
>>
>> NUMA node0 CPU(s):               0-15,32-47
>> NUMA node1 CPU(s):               16-31,48-63
>>
>> Memory:
>> Hynix  503Gi
>>
>> Interface:
>> Intel Corporation Ethernet Connection X722 for 1GbE (rev 09)
>> Speed: 1000Mb/s
>>
>> b. Host software info:
>>
>> OS: ctyunos release 2
>> Kernel: 4.19.90-2102.2.0.0066.ctl2.x86_64
>> Libvirt baseline version:  libvirt-6.9.0
>> Qemu baseline version: qemu-5.0
>>
>> c. vm scale
>> CPU: 4
>> Memory: 4G
>> -------------------------------------------------------------
>>
>> All the supplementary test data shown as follows are basing on
>> above test environment.
>>
>> In version 1, we post a test data from unixbench as follows:
>>
>> $ taskset -c 8-15 ./Run -i 2 -c 8 {unixbench test item}
>>
>> host cpu: Intel(R) Xeon(R) Platinum 8378A
>> host interface speed: 1000Mb/s
>>    |---------------------+--------+------------+---------------|
>>    | UnixBench test item | Normal | Dirtylimit | Auto-converge |
>>    |---------------------+--------+------------+---------------|
>>    | dhry2reg            | 32800  | 32786      | 25292         |
>>    | whetstone-double    | 10326  | 10315      | 9847          |
>>    | pipe                | 15442  | 15271      | 14506         |
>>    | context1            | 7260   | 6235       | 4514          |
>>    | spawn               | 3663   | 3317       | 3249          |
>>    | syscall             | 4669   | 4667       | 3841          |
>>    |---------------------+--------+------------+---------------|
>>
>> In version 2, we post a supplementary test data that do not use
>> taskset and make the scenario more general, see as follows:
>>
>> $ ./Run
>>
>> per-vcpu data:
>>    |---------------------+--------+------------+---------------|
>>    | UnixBench test item | Normal | Dirtylimit | Auto-converge |
>>    |---------------------+--------+------------+---------------|
>>    | dhry2reg            | 2991   | 2902       | 1722          |
>>    | whetstone-double    | 1018   | 1006       | 627           |
>>    | Execl Throughput    | 955    | 320        | 660           |
>>    | File Copy - 1       | 2362   | 805        | 1325          |
>>    | File Copy - 2       | 1500   | 1406       | 643           |
>>    | File Copy - 3       | 4778   | 2160       | 1047          |
>>    | Pipe Throughput     | 1181   | 1170       | 842           |
>>    | Context Switching   | 192    | 224        | 198           |
>>    | Process Creation    | 490    | 145        | 95            |
>>    | Shell Scripts - 1   | 1284   | 565        | 610           |
>>    | Shell Scripts - 2   | 2368   | 900        | 1040          |
>>    | System Call Overhead| 983    | 948        | 698           |
>>    | Index Score         | 1263   | 815        | 600           |
>>    |---------------------+--------+------------+---------------|
>> Note:
>>    File Copy - 1: File Copy 1024 bufsize 2000 maxblocks
>>    File Copy - 2: File Copy 256 bufsize 500 maxblocks
>>    File Copy - 3: File Copy 4096 bufsize 8000 maxblocks
>>    Shell Scripts - 1: Shell Scripts (1 concurrent)
>>    Shell Scripts - 2: Shell Scripts (8 concurrent)
>>
>> Basing on above data, we can draw a conclusion that dirty-limit
>> can hugely improve the system benchmark almost in every respect,
>> the "System Benchmarks Index Score" show it improve 35% performance
>> comparing with auto-converge during live migration.
>>
>> 4-vcpu parallel data(we run a test vm with 4c4g-scale):
>>    |---------------------+--------+------------+---------------|
>>    | UnixBench test item | Normal | Dirtylimit | Auto-converge |
>>    |---------------------+--------+------------+---------------|
>>    | dhry2reg            | 7975   | 7146       | 5071          |
>>    | whetstone-double    | 3982   | 3561       | 2124          |
>>    | Execl Throughput    | 1882   | 1205       | 768           |
>>    | File Copy - 1       | 1061   | 865        | 498           |
>>    | File Copy - 2       | 676    | 491        | 519           |
>>    | File Copy - 3       | 2260   | 923        | 1329          |
>>    | Pipe Throughput     | 3026   | 3009       | 1616          |
>>    | Context Switching   | 1219   | 1093       | 695           |
>>    | Process Creation    | 947    | 307        | 446           |
>>    | Shell Scripts - 1   | 2469   | 977        | 989           |
>>    | Shell Scripts - 2   | 2667   | 1275       | 984           |
>>    | System Call Overhead| 1592   | 1459       | 692           |
>>    | Index Score         | 1976   | 1294       | 997           |
>>    |---------------------+--------+------------+---------------|
>>
>> For the parallel data, the "System Benchmarks Index Score" show it
>> also improve 29% performance.
>>
>> In version 1, migration total time is shown as follows:
>>
>> host cpu: Intel(R) Xeon(R) Platinum 8378A
>> host interface speed: 1000Mb/s
>>    |-----------------------+----------------+-------------------|
>>    | dirty memory size(MB) | Dirtylimit(ms) | Auto-converge(ms) |
>>    |-----------------------+----------------+-------------------|
>>    | 60                    | 2014           | 2131              |
>>    | 70                    | 5381           | 12590             |
>>    | 90                    | 6037           | 33545             |
>>    | 110                   | 7660           | [*]               |
>>    |-----------------------+----------------+-------------------|
>>    [*]: This case means migration is not convergent.
>>
>> In version 2, we post more comprehensive migration total time test data
>> as follows:
>>
>> we update N MB on 4 cpus and sleep S us every time 1 MB data was updated.
>> test twice in each condition, data is shown as follow:
>>
>>    |-----------+--------+--------+----------------+-------------------|
>>    | ring size | N (MB) | S (us) | Dirtylimit(ms) | Auto-converge(ms) |
>>    |-----------+--------+--------+----------------+-------------------|
>>    | 1024      | 1024   | 1000   | 44951          | 191780            |
>>    | 1024      | 1024   | 1000   | 44546          | 185341            |
>>    | 1024      | 1024   | 500    | 46505          | 203545            |
>>    | 1024      | 1024   | 500    | 45469          | 909945            |
>>    | 1024      | 1024   | 0      | 61858          | [*]               |
>>    | 1024      | 1024   | 0      | 57922          | [*]               |
>>    | 1024      | 2048   | 0      | 91982          | [*]               |
>>    | 1024      | 2048   | 0      | 90388          | [*]               |
>>    | 2048      | 128    | 10000  | 14511          | 25971             |
>>    | 2048      | 128    | 10000  | 13472          | 26294             |
>>    | 2048      | 1024   | 10000  | 44244          | 26294             |
>>    | 2048      | 1024   | 10000  | 45099          | 157701            |
>>    | 2048      | 1024   | 500    | 51105          | [*]               |
>>    | 2048      | 1024   | 500    | 49648          | [*]               |
>>    | 2048      | 1024   | 0      | 229031         | [*]               |
>>    | 2048      | 1024   | 0      | 154282         | [*]               |
>>    |-----------+--------+--------+----------------+-------------------|
>>    [*]: This case means migration is not convergent.
>>
>> Not that the larger ring size is, the less sensitively dirty-limit 
>> responds,
>> so we should choose a optimal ring size base on the test data with 
>> different
>> scale vm.
>>
>> We also test the effect of "x-vcpu-dirty-limit-period" parameter on
>> migration total time. test twice in each condition, data is shown
>> as follows:
>>
>>    |-----------+--------+--------+-------------+----------------------|
>>    | ring size | N (MB) | S (us) | Period (ms) | migration total time |
>>    |-----------+--------+--------+-------------+----------------------|
>>    | 2048      | 1024   | 10000  | 100         | [*]                  |
>>    | 2048      | 1024   | 10000  | 100         | [*]                  |
>>    | 2048      | 1024   | 10000  | 300         | 156795               |
>>    | 2048      | 1024   | 10000  | 300         | 118179               |
>>    | 2048      | 1024   | 10000  | 500         | 44244                |
>>    | 2048      | 1024   | 10000  | 500         | 45099                |
>>    | 2048      | 1024   | 10000  | 700         | 41871                |
>>    | 2048      | 1024   | 10000  | 700         | 42582                |
>>    | 2048      | 1024   | 10000  | 1000        | 41430                |
>>    | 2048      | 1024   | 10000  | 1000        | 40383                |
>>    | 2048      | 1024   | 10000  | 1500        | 42030                |
>>    | 2048      | 1024   | 10000  | 1500        | 42598                |
>>    | 2048      | 1024   | 10000  | 2000        | 41694                |
>>    | 2048      | 1024   | 10000  | 2000        | 42403                |
>>    | 2048      | 1024   | 10000  | 3000        | 43538                |
>>    | 2048      | 1024   | 10000  | 3000        | 43010                |
>>    |-----------+--------+--------+-------------+----------------------|
>>
>> It shows that x-vcpu-dirty-limit-period should be configured with 1000 ms
>> in above condition.
>>
>> Please review, any comments and suggestions are very appreciated, thanks
>>
>> Yong
>>
>> Hyman Huang (9):
>>    dirtylimit: Fix overflow when computing MB
>>    softmmu/dirtylimit: Add parameter check for hmp "set_vcpu_dirty_limit"
>>    qapi/migration: Introduce x-vcpu-dirty-limit-period parameter
>>    qapi/migration: Introduce vcpu-dirty-limit parameters
>>    migration: Introduce dirty-limit capability
>>    migration: Refactor auto-converge capability logic
>>    migration: Implement dirty-limit convergence algo
>>    migration: Extend query-migrate to provide dirty page limit info
>>    tests: Add migration dirty-limit capability test
>>
>> Peter Xu (1):
>>    kvm: dirty-ring: Fix race with vcpu creation
>>
>>   accel/kvm/kvm-all.c            |   9 ++
>>   include/sysemu/dirtylimit.h    |   2 +
>>   migration/migration-hmp-cmds.c |  26 ++++++
>>   migration/migration.c          |  88 ++++++++++++++++++
>>   migration/migration.h          |   1 +
>>   migration/ram.c                |  63 ++++++++++---
>>   migration/trace-events         |   1 +
>>   qapi/migration.json            |  64 ++++++++++++--
>>   softmmu/dirtylimit.c           |  91 ++++++++++++++++---
>>   tests/qtest/migration-test.c   | 157 +++++++++++++++++++++++++++++++++
>>   10 files changed, 470 insertions(+), 32 deletions(-)
>>
> 

-- 
Best regard

Hyman Huang(黄勇)