[PATCH v2 0/4] Dirty ring and auto converge optimization

wucy11@chinatelecom.cn posted 4 patches 2 years, 1 month ago
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/cover.1648091539.git.wucy11@chinatelecom.cn
Maintainers: Paolo Bonzini <pbonzini@redhat.com>, Richard Henderson <richard.henderson@linaro.org>, Juan Quintela <quintela@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>
accel/kvm/kvm-all.c       | 241 +++++++++++++++++++++++++++++++++++++++++-----
include/exec/cpu-common.h |   2 +
include/sysemu/kvm.h      |   2 +
migration/migration.c     |  12 +++
migration/migration.h     |   2 +
migration/ram.c           |  64 +++++++++++-
softmmu/cpus.c            |  18 ++++
7 files changed, 312 insertions(+), 29 deletions(-)
[PATCH v2 0/4] Dirty ring and auto converge optimization
Posted by wucy11@chinatelecom.cn 2 years, 1 month ago
From: Chongyun Wu <wucy11@chinatelecom.cn>

v2:
-patch 1: remove patch_1

v1:
-rebase to qemu/master

Overview
============
This series of patches is to optimize the performance of
online migration using dirty ring and autoconverge.

Mainly through the following aspects to do optimization:
1. Dynamically adjust the dirty ring collection thread to
reduce the occurrence of ring full, thereby reducing the
impact on customers, improving the efficiency of dirty
page collection, and thus improving the migration efficiency.

2. When collecting dirty pages from KVM,
kvm_cpu_synchronize_kick_all is not called if the rate is
limited, and it is called only once before suspending the
virtual machine. Because kvm_cpu_synchronize_kick_all will
become very time-consuming when the CPU is limited, and
there will not be too many dirty pages, so it only needs
to be called once before suspending the virtual machine to
ensure that dirty pages will not be lost and the efficiency
of migration is guaranteed .

3. Based on the characteristic of collecting dirty pages
in the dirty ring, a new dirty page rate calculation method
is proposed to obtain a more accurate dirty page rate.

4. Use a more accurate dirty page rate and calculate the
matched speed limit throttle required to complete the
migration according to the current system bandwidth and
parameters, instead of the current time-consuming method
of trying to get a speed limit, greatly reducing migration
time.

Testing
=======
    Test environment:
    Host: 64 cpus(Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz),
          512G memory,
          10G NIC
    VM: 2 cpus,4G memory and 8 cpus, 32G memory
    memory stress: run stress(qemu) in VM to generates memory stress

    Test1: Massive online migration(Run each test item 50 to 200 times)
    Test command: virsh -t migrate $vm --live --p2p --unsafe
    --undefinesource --persistent --auto-converge  --migrateuri
    tcp://${data_ip_remote}
    *********** Use optimized dirtry ring  ***********
    ring_size  mem_stress VM   average_migration_time(ms)
    4096      1G       2C4G     15888
    4096      3G       2C4G     13320
    65536     1G       2C4G     10036
    65536     3G       2C4G     12132
    4096      4G       8C32G    53629
    4096      8G       8C32G    62474
    4096      30G      8C32G    99025
    65536     4G       8C32G    45563
    65536     8G       8C32G    61114
    65536     30G      8C32G    102087
    *********** Use Unoptimized dirtry ring ***********
    ring_size  mem_stress VM   average_migration_time(ms)
    4096      1G       2C4G     23992
    4096      3G       2C4G     44234
    65536     1G       2C4G     24546
    65536     3G       2C4G     44939
    4096      4G       8C32G    88441
    4096      8G       8C32G    may not complete
    4096      30G      8C32G    602884
    65536     4G       8C32G    335535
    65536     8G       8C32G    1249232
    65536     30G      8C32G    616939
    *********** Use bitmap dirty tracking  ***********
    ring_size  mem_stress VM   average_migration_time(ms)
    0         1G       2C4G     24597
    0         3G       2C4G     45254
    0         4G       8C32G    103773
    0         8G       8C32G    129626
    0         30G      8C32G    588212

Test1 result:
    Compared with the old bitmap method and the unoptimized dirty ring,
    the migration time of the optimized dirty ring from the sorted data
    is greatly improved, especially when the virtual machine memory is
    large and the memory pressure is high, the effect is more obvious,
    can achieve five to six times the migration acceleration effect.

    And during the test, it was found that the dirty ring could not be
    completed for a long time after adding certain memory pressure.
    The optimized dirty ring did not encounter such a problem.

    Test2: qemu guestperf test
    Test ommand parameters:  --auto-converge  --stress-mem XX --downtime 300
    --bandwidth 10000
    *********** Use optimized dirtry ring  ***********
    ring_size stress VM    Significant_perf  max_memory_update cost_time(s)
                           _drop_duration(s) speed(ms/GB)
    4096       3G    2C4G        5.5           2962             23.5
    65536      3G    2C4G        6             3160             25
    4096       3G    8C32G       13            7921             38
    4096       6G    8C32G       16            11.6K            46
    4096       10G   8C32G       12.1          11.2K            47.6
    4096       20G   8C32G       20            20.2K            71
    4096       30G   8C32G       29.5          29K              94.5
    65536      3G    8C32G       14            8700             40
    65536      6G    8C32G       15            12K              46
    65536      10G   8C32G       11.5          11.1k            47.5
    65536      20G   8C32G       21            20.9K            72
    65536      30G   8C32G       29.5          29.1K            94.5
    *********** Use Unoptimized dirtry ring ***********
    ring_size stress VM    Significant_perf  max_memory_update cost_time(s)
                           _drop_duration(s) speed(ms/GB)
    4096        3G    2C4G        23            2766            46
    65536       3G    2C4G        22.2          3283            46
    4096        3G    8C32G       62            48.8K           106
    4096        6G    8C32G       68            23.87K          124
    4096        10G   8C32G       91            16.87K          190
    4096        20G   8C32G       152.8         28.65K          336.8
    4096        30G   8C32G       187           41.19K          502
    65536       3G    8C32G       71            12.7K           67
    65536       6G    8C32G       63            12K             46
    65536       10G   8C32G       88            25.3k           120
    65536       20G   8C32G       157.3         25K             391
    65536       30G   8C32G       171           30.8K           487
    *********** Use bitmap dirty tracking  ***********
    ring_size stress VM    Significant_perf  max_memory_update cost_time(s)
                           _drop_duration(s) speed(ms/GB)
    0           3G    2C4G        18             3300            38
    0           3G    8C32G       38             7571            66
    0           6G    8C32G       61.5           10.5K           115.5
    0           10G   8C32G       110            13.68k          180
    0           20G   8C32G       161.6          24.4K           280
    0           30G   8C32G       221.5          28.4K           337.5

Test2 result:
    The above test data shows that the guestperf performance of the
    optimized dirty ring during the migration process is significantly
    better than that of the unoptimized dirty ring, and slightly better
    than the bitmap method.

    During the migration process of the optimized dirty ring, the migration
    time is greatly reduced, and the time in the period of significant
    memory performance degradation is  significantly shorter than that of
    the bitmap mode and the unoptimized dirty ring mode. Therefore, the
    optimized ditry ring can better reduce the impact on guests accessing
    memory resources during the migration process.

Please review, thanks.

Chongyun Wu (4):
  kvm: Dynamically adjust the rate of dirty ring reaper thread
  kvm: Dirty ring autoconverge optmization for
    kvm_cpu_synchronize_kick_all
  kvm: Introduce a dirty rate calculation method based on dirty ring
  migration: Calculate the appropriate throttle for autoconverge

 accel/kvm/kvm-all.c       | 241 +++++++++++++++++++++++++++++++++++++++++-----
 include/exec/cpu-common.h |   2 +
 include/sysemu/kvm.h      |   2 +
 migration/migration.c     |  12 +++
 migration/migration.h     |   2 +
 migration/ram.c           |  64 +++++++++++-
 softmmu/cpus.c            |  18 ++++
 7 files changed, 312 insertions(+), 29 deletions(-)

-- 
1.8.3.1
Re: [PATCH v2 0/4] Dirty ring and auto converge optimization
Posted by Peter Xu 2 years, 1 month ago
Chongyun,

On Mon, Mar 28, 2022 at 09:32:10AM +0800, wucy11@chinatelecom.cn wrote:
> From: Chongyun Wu <wucy11@chinatelecom.cn>
> 
> v2:
> -patch 1: remove patch_1
> 
> v1:
> -rebase to qemu/master
> 
> Overview
> ============
> This series of patches is to optimize the performance of
> online migration using dirty ring and autoconverge.
> 
> Mainly through the following aspects to do optimization:
> 1. Dynamically adjust the dirty ring collection thread to
> reduce the occurrence of ring full, thereby reducing the
> impact on customers, improving the efficiency of dirty
> page collection, and thus improving the migration efficiency.
> 
> 2. When collecting dirty pages from KVM,
> kvm_cpu_synchronize_kick_all is not called if the rate is
> limited, and it is called only once before suspending the
> virtual machine. Because kvm_cpu_synchronize_kick_all will
> become very time-consuming when the CPU is limited, and
> there will not be too many dirty pages, so it only needs
> to be called once before suspending the virtual machine to
> ensure that dirty pages will not be lost and the efficiency
> of migration is guaranteed .
> 
> 3. Based on the characteristic of collecting dirty pages
> in the dirty ring, a new dirty page rate calculation method
> is proposed to obtain a more accurate dirty page rate.
> 
> 4. Use a more accurate dirty page rate and calculate the
> matched speed limit throttle required to complete the
> migration according to the current system bandwidth and
> parameters, instead of the current time-consuming method
> of trying to get a speed limit, greatly reducing migration
> time.

Thanks for the patches.

I'm curious what's the relationship between this series and Yong's?

If talking about throttling, I do think the old auto-converge was kind of
inefficient comparing to the new per-vcpu ways of throttling at least in
either granularity or on read tolerances (e.g., dirty ring based solution
will not block vcpu readers even if the thread is heavily throttled).

We've got quite a few techniques taking care of migration convergence
issues (didn't mention postcopy yet..), and I'm wondering whether at some
point we should be more focused and make a chosen one better, rather than
building different blocks servicing the same purpose.

Thanks,

-- 
Peter Xu
Re: [PATCH v2 0/4] Dirty ring and auto converge optimization
Posted by Chongyun Wu 2 years, 1 month ago
Thanks for review.

On 4/1/2022 9:13 PM, Peter Xu wrote:
> Chongyun,
> 
> On Mon, Mar 28, 2022 at 09:32:10AM +0800, wucy11@chinatelecom.cn wrote:
>> From: Chongyun Wu <wucy11@chinatelecom.cn>
>>
>> v2:
>> -patch 1: remove patch_1
>>
>> v1:
>> -rebase to qemu/master
>>
>> Overview
>> ============
>> This series of patches is to optimize the performance of
>> online migration using dirty ring and autoconverge.
>>
>> Mainly through the following aspects to do optimization:
>> 1. Dynamically adjust the dirty ring collection thread to
>> reduce the occurrence of ring full, thereby reducing the
>> impact on customers, improving the efficiency of dirty
>> page collection, and thus improving the migration efficiency.
>>
>> 2. When collecting dirty pages from KVM,
>> kvm_cpu_synchronize_kick_all is not called if the rate is
>> limited, and it is called only once before suspending the
>> virtual machine. Because kvm_cpu_synchronize_kick_all will
>> become very time-consuming when the CPU is limited, and
>> there will not be too many dirty pages, so it only needs
>> to be called once before suspending the virtual machine to
>> ensure that dirty pages will not be lost and the efficiency
>> of migration is guaranteed .
>>
>> 3. Based on the characteristic of collecting dirty pages
>> in the dirty ring, a new dirty page rate calculation method
>> is proposed to obtain a more accurate dirty page rate.
>>
>> 4. Use a more accurate dirty page rate and calculate the
>> matched speed limit throttle required to complete the
>> migration according to the current system bandwidth and
>> parameters, instead of the current time-consuming method
>> of trying to get a speed limit, greatly reducing migration
>> time.
> 
> Thanks for the patches.
> 
> I'm curious what's the relationship between this series and Yong's?
I personally think it is a complementary relationship. Yong's can limit 
per-vcpu. In the case of memory pressure threads in certain vcpu scenarios, the 
restrictions on other vcpus are very small, and the impact on customers during 
the migration process will be smaller. The auto-convergence optimization of the 
last two patches in this patch series can cope with scenarios where the memory 
pressure is balanced on each vcpu. Each has its own advantages, and customers 
can choose the appropriate mode according to their own application scenarios. 
The first two patches are for the dirty ring, and both auto converge and yong 
modes can improve performance.

> 
> If talking about throttling, I do think the old auto-converge was kind of
> inefficient comparing to the new per-vcpu ways of throttling at least in
> either granularity or on read tolerances (e.g., dirty ring based solution
> will not block vcpu readers even if the thread is heavily throttled).
Yes, I agree with that. Through the research of dirty ring and a lot of tests, 
some points that may affect the advantages of dirty ring have been found, so 
some optimizations have been made, and these optimizations are found to be 
effective through testing and verification.
In this patch series, only the last two patches are optimized for autocoverge. 
The first two patches are for all situations where the dirty ring is used, 
including Yong's, and there is no conflict with his. Among them, "kvm: 
Dynamically adjust the rate of dirty ring reaper thread" is proposed to take 
advantage of dirty ring. When the memory pressure is high, speeding up the rate 
at which the reaper thread collects dirty pages can effectively solve the 
problem that the frequent occurrence of ring full leads to the frequent exit of 
the guest and the performance of the guestperf is degraded. When the migration 
thread migrates data, it also completes the synchronization of most dirty pages. 
When the migration thread of the dirty ring synchronizes the dirty pages, it 
will take less time, which will also speed up the migration. These two patches 
will make yong's test results better, and the two optimization points are different.

> We've got quite a few techniques taking care of migration convergence
> issues (didn't mention postcopy yet..), and I'm wondering whether at some
> point we should be more focused and make a chosen one better, rather than
> building different blocks servicing the same purpose.
I'm sorry, maybe I should separate these patch series to avoid 
misunderstandings. These patches and yong's should be complementary, and two of 
them can also help yong get some performance improvements.
> 
> Thanks,
> 

-- 
Best Regard,
Chongyun Wu