accel/kvm/kvm-all.c | 241 +++++++++++++++++++++++++++++++++++++++++----- include/exec/cpu-common.h | 2 + include/sysemu/kvm.h | 2 + migration/migration.c | 12 +++ migration/migration.h | 2 + migration/ram.c | 64 +++++++++++- softmmu/cpus.c | 18 ++++ 7 files changed, 312 insertions(+), 29 deletions(-)
From: Chongyun Wu <wucy11@chinatelecom.cn> v2: -patch 1: remove patch_1 v1: -rebase to qemu/master Overview ============ This series of patches is to optimize the performance of online migration using dirty ring and autoconverge. Mainly through the following aspects to do optimization: 1. Dynamically adjust the dirty ring collection thread to reduce the occurrence of ring full, thereby reducing the impact on customers, improving the efficiency of dirty page collection, and thus improving the migration efficiency. 2. When collecting dirty pages from KVM, kvm_cpu_synchronize_kick_all is not called if the rate is limited, and it is called only once before suspending the virtual machine. Because kvm_cpu_synchronize_kick_all will become very time-consuming when the CPU is limited, and there will not be too many dirty pages, so it only needs to be called once before suspending the virtual machine to ensure that dirty pages will not be lost and the efficiency of migration is guaranteed . 3. Based on the characteristic of collecting dirty pages in the dirty ring, a new dirty page rate calculation method is proposed to obtain a more accurate dirty page rate. 4. Use a more accurate dirty page rate and calculate the matched speed limit throttle required to complete the migration according to the current system bandwidth and parameters, instead of the current time-consuming method of trying to get a speed limit, greatly reducing migration time. Testing ======= Test environment: Host: 64 cpus(Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz), 512G memory, 10G NIC VM: 2 cpus,4G memory and 8 cpus, 32G memory memory stress: run stress(qemu) in VM to generates memory stress Test1: Massive online migration(Run each test item 50 to 200 times) Test command: virsh -t migrate $vm --live --p2p --unsafe --undefinesource --persistent --auto-converge --migrateuri tcp://${data_ip_remote} *********** Use optimized dirtry ring *********** ring_size mem_stress VM average_migration_time(ms) 4096 1G 2C4G 15888 4096 3G 2C4G 13320 65536 1G 2C4G 10036 65536 3G 2C4G 12132 4096 4G 8C32G 53629 4096 8G 8C32G 62474 4096 30G 8C32G 99025 65536 4G 8C32G 45563 65536 8G 8C32G 61114 65536 30G 8C32G 102087 *********** Use Unoptimized dirtry ring *********** ring_size mem_stress VM average_migration_time(ms) 4096 1G 2C4G 23992 4096 3G 2C4G 44234 65536 1G 2C4G 24546 65536 3G 2C4G 44939 4096 4G 8C32G 88441 4096 8G 8C32G may not complete 4096 30G 8C32G 602884 65536 4G 8C32G 335535 65536 8G 8C32G 1249232 65536 30G 8C32G 616939 *********** Use bitmap dirty tracking *********** ring_size mem_stress VM average_migration_time(ms) 0 1G 2C4G 24597 0 3G 2C4G 45254 0 4G 8C32G 103773 0 8G 8C32G 129626 0 30G 8C32G 588212 Test1 result: Compared with the old bitmap method and the unoptimized dirty ring, the migration time of the optimized dirty ring from the sorted data is greatly improved, especially when the virtual machine memory is large and the memory pressure is high, the effect is more obvious, can achieve five to six times the migration acceleration effect. And during the test, it was found that the dirty ring could not be completed for a long time after adding certain memory pressure. The optimized dirty ring did not encounter such a problem. Test2: qemu guestperf test Test ommand parameters: --auto-converge --stress-mem XX --downtime 300 --bandwidth 10000 *********** Use optimized dirtry ring *********** ring_size stress VM Significant_perf max_memory_update cost_time(s) _drop_duration(s) speed(ms/GB) 4096 3G 2C4G 5.5 2962 23.5 65536 3G 2C4G 6 3160 25 4096 3G 8C32G 13 7921 38 4096 6G 8C32G 16 11.6K 46 4096 10G 8C32G 12.1 11.2K 47.6 4096 20G 8C32G 20 20.2K 71 4096 30G 8C32G 29.5 29K 94.5 65536 3G 8C32G 14 8700 40 65536 6G 8C32G 15 12K 46 65536 10G 8C32G 11.5 11.1k 47.5 65536 20G 8C32G 21 20.9K 72 65536 30G 8C32G 29.5 29.1K 94.5 *********** Use Unoptimized dirtry ring *********** ring_size stress VM Significant_perf max_memory_update cost_time(s) _drop_duration(s) speed(ms/GB) 4096 3G 2C4G 23 2766 46 65536 3G 2C4G 22.2 3283 46 4096 3G 8C32G 62 48.8K 106 4096 6G 8C32G 68 23.87K 124 4096 10G 8C32G 91 16.87K 190 4096 20G 8C32G 152.8 28.65K 336.8 4096 30G 8C32G 187 41.19K 502 65536 3G 8C32G 71 12.7K 67 65536 6G 8C32G 63 12K 46 65536 10G 8C32G 88 25.3k 120 65536 20G 8C32G 157.3 25K 391 65536 30G 8C32G 171 30.8K 487 *********** Use bitmap dirty tracking *********** ring_size stress VM Significant_perf max_memory_update cost_time(s) _drop_duration(s) speed(ms/GB) 0 3G 2C4G 18 3300 38 0 3G 8C32G 38 7571 66 0 6G 8C32G 61.5 10.5K 115.5 0 10G 8C32G 110 13.68k 180 0 20G 8C32G 161.6 24.4K 280 0 30G 8C32G 221.5 28.4K 337.5 Test2 result: The above test data shows that the guestperf performance of the optimized dirty ring during the migration process is significantly better than that of the unoptimized dirty ring, and slightly better than the bitmap method. During the migration process of the optimized dirty ring, the migration time is greatly reduced, and the time in the period of significant memory performance degradation is significantly shorter than that of the bitmap mode and the unoptimized dirty ring mode. Therefore, the optimized ditry ring can better reduce the impact on guests accessing memory resources during the migration process. Please review, thanks. Chongyun Wu (4): kvm: Dynamically adjust the rate of dirty ring reaper thread kvm: Dirty ring autoconverge optmization for kvm_cpu_synchronize_kick_all kvm: Introduce a dirty rate calculation method based on dirty ring migration: Calculate the appropriate throttle for autoconverge accel/kvm/kvm-all.c | 241 +++++++++++++++++++++++++++++++++++++++++----- include/exec/cpu-common.h | 2 + include/sysemu/kvm.h | 2 + migration/migration.c | 12 +++ migration/migration.h | 2 + migration/ram.c | 64 +++++++++++- softmmu/cpus.c | 18 ++++ 7 files changed, 312 insertions(+), 29 deletions(-) -- 1.8.3.1
Chongyun, On Mon, Mar 28, 2022 at 09:32:10AM +0800, wucy11@chinatelecom.cn wrote: > From: Chongyun Wu <wucy11@chinatelecom.cn> > > v2: > -patch 1: remove patch_1 > > v1: > -rebase to qemu/master > > Overview > ============ > This series of patches is to optimize the performance of > online migration using dirty ring and autoconverge. > > Mainly through the following aspects to do optimization: > 1. Dynamically adjust the dirty ring collection thread to > reduce the occurrence of ring full, thereby reducing the > impact on customers, improving the efficiency of dirty > page collection, and thus improving the migration efficiency. > > 2. When collecting dirty pages from KVM, > kvm_cpu_synchronize_kick_all is not called if the rate is > limited, and it is called only once before suspending the > virtual machine. Because kvm_cpu_synchronize_kick_all will > become very time-consuming when the CPU is limited, and > there will not be too many dirty pages, so it only needs > to be called once before suspending the virtual machine to > ensure that dirty pages will not be lost and the efficiency > of migration is guaranteed . > > 3. Based on the characteristic of collecting dirty pages > in the dirty ring, a new dirty page rate calculation method > is proposed to obtain a more accurate dirty page rate. > > 4. Use a more accurate dirty page rate and calculate the > matched speed limit throttle required to complete the > migration according to the current system bandwidth and > parameters, instead of the current time-consuming method > of trying to get a speed limit, greatly reducing migration > time. Thanks for the patches. I'm curious what's the relationship between this series and Yong's? If talking about throttling, I do think the old auto-converge was kind of inefficient comparing to the new per-vcpu ways of throttling at least in either granularity or on read tolerances (e.g., dirty ring based solution will not block vcpu readers even if the thread is heavily throttled). We've got quite a few techniques taking care of migration convergence issues (didn't mention postcopy yet..), and I'm wondering whether at some point we should be more focused and make a chosen one better, rather than building different blocks servicing the same purpose. Thanks, -- Peter Xu
Thanks for review. On 4/1/2022 9:13 PM, Peter Xu wrote: > Chongyun, > > On Mon, Mar 28, 2022 at 09:32:10AM +0800, wucy11@chinatelecom.cn wrote: >> From: Chongyun Wu <wucy11@chinatelecom.cn> >> >> v2: >> -patch 1: remove patch_1 >> >> v1: >> -rebase to qemu/master >> >> Overview >> ============ >> This series of patches is to optimize the performance of >> online migration using dirty ring and autoconverge. >> >> Mainly through the following aspects to do optimization: >> 1. Dynamically adjust the dirty ring collection thread to >> reduce the occurrence of ring full, thereby reducing the >> impact on customers, improving the efficiency of dirty >> page collection, and thus improving the migration efficiency. >> >> 2. When collecting dirty pages from KVM, >> kvm_cpu_synchronize_kick_all is not called if the rate is >> limited, and it is called only once before suspending the >> virtual machine. Because kvm_cpu_synchronize_kick_all will >> become very time-consuming when the CPU is limited, and >> there will not be too many dirty pages, so it only needs >> to be called once before suspending the virtual machine to >> ensure that dirty pages will not be lost and the efficiency >> of migration is guaranteed . >> >> 3. Based on the characteristic of collecting dirty pages >> in the dirty ring, a new dirty page rate calculation method >> is proposed to obtain a more accurate dirty page rate. >> >> 4. Use a more accurate dirty page rate and calculate the >> matched speed limit throttle required to complete the >> migration according to the current system bandwidth and >> parameters, instead of the current time-consuming method >> of trying to get a speed limit, greatly reducing migration >> time. > > Thanks for the patches. > > I'm curious what's the relationship between this series and Yong's? I personally think it is a complementary relationship. Yong's can limit per-vcpu. In the case of memory pressure threads in certain vcpu scenarios, the restrictions on other vcpus are very small, and the impact on customers during the migration process will be smaller. The auto-convergence optimization of the last two patches in this patch series can cope with scenarios where the memory pressure is balanced on each vcpu. Each has its own advantages, and customers can choose the appropriate mode according to their own application scenarios. The first two patches are for the dirty ring, and both auto converge and yong modes can improve performance. > > If talking about throttling, I do think the old auto-converge was kind of > inefficient comparing to the new per-vcpu ways of throttling at least in > either granularity or on read tolerances (e.g., dirty ring based solution > will not block vcpu readers even if the thread is heavily throttled). Yes, I agree with that. Through the research of dirty ring and a lot of tests, some points that may affect the advantages of dirty ring have been found, so some optimizations have been made, and these optimizations are found to be effective through testing and verification. In this patch series, only the last two patches are optimized for autocoverge. The first two patches are for all situations where the dirty ring is used, including Yong's, and there is no conflict with his. Among them, "kvm: Dynamically adjust the rate of dirty ring reaper thread" is proposed to take advantage of dirty ring. When the memory pressure is high, speeding up the rate at which the reaper thread collects dirty pages can effectively solve the problem that the frequent occurrence of ring full leads to the frequent exit of the guest and the performance of the guestperf is degraded. When the migration thread migrates data, it also completes the synchronization of most dirty pages. When the migration thread of the dirty ring synchronizes the dirty pages, it will take less time, which will also speed up the migration. These two patches will make yong's test results better, and the two optimization points are different. > We've got quite a few techniques taking care of migration convergence > issues (didn't mention postcopy yet..), and I'm wondering whether at some > point we should be more focused and make a chosen one better, rather than > building different blocks servicing the same purpose. I'm sorry, maybe I should separate these patch series to avoid misunderstandings. These patches and yong's should be complementary, and two of them can also help yong get some performance improvements. > > Thanks, > -- Best Regard, Chongyun Wu
© 2016 - 2024 Red Hat, Inc.