accel/kvm/kvm-all.c | 287 ++++++++++++++++++++++++++++++++++----- accel/kvm/trace-events | 1 + exec.c | 15 +- include/exec/memory.h | 36 ++--- include/exec/ram_addr.h | 91 ++++++++++++- include/qemu/bitmap.h | 9 ++ include/sysemu/kvm_int.h | 4 + memory.c | 64 +++++++-- migration/migration.c | 4 + migration/migration.h | 27 ++++ migration/ram.c | 45 ++++++ migration/trace-events | 1 + tests/Makefile.include | 2 + tests/test-bitmap.c | 81 +++++++++++ util/bitmap.c | 73 ++++++++++ 15 files changed, 671 insertions(+), 69 deletions(-) create mode 100644 tests/test-bitmap.c
Summary ===================== Based-on: <20190426062705.4651-1-peterx@redhat.com> (It's "[PATCH] checkpatch: allow SPDX-License-Identifier", not a big deal, just to make sure no spoil message since one patch used SPDX license identifier and checkpatch doesn't like it...) This series allows QEMU to start using the new KVM_CLEAR_DIRTY_LOG interface. For more on KVM_CLEAR_DIRTY_LOG itself, please refer to: https://github.com/torvalds/linux/blob/master/Documentation/virtual/kvm/api.txt#L3810 The QEMU work (which is this series) is pushed too, please find the tree here: https://github.com/xzpeter/qemu/tree/kvm-clear-dirty-log Meanwhile, For anyone who really wants to try this out, please also upgrade the host kernel and use the tree here: https://github.com/xzpeter/linux/tree/kvm-clear-dirty-fixes The new kernel is required because there are still some fixes to make the whole thing work, which have not yet been finalized and they haven't reached Linux master. Design =================== I started with a naive/stupid design that I always pass all 1's to the KVM for a memory range to clear all the dirty bits within that memory range, but then I encountered guest oops - it's simply because we can't clear any dirty bit from QEMU if we are not _sure_ that the bit is dirty in the kernel. Otherwise we might accidentally clear a bit that we don't even know of (e.g., the bit was clear in migration's dirty bitmap in QEMU) but actually that page was just being written so QEMU will never remember to migrate that new page again. The new design is focused on a dirty bitmap cache within the QEMU kvm layer (which is per kvm memory slot). With that we know what's dirty in the kernel previously (note! the kernel bitmap is still growing all the time so the cache will only be a subset of the realtime kernel bitmap but that's far enough for us) and with that we'll be sure to not accidentally clear unknown dirty pages. With this method, we can also avoid race when multiple users (e.g., DIRTY_MEMORY_VGA and DIRTY_MEMORY_MIGRATION) want to clear the bit for multiple time. If without the kvm memory slot cached dirty bitmap we won't be able to know which bit has been cleared and then if we send the CLEAR operation upon the same bit twice (or more) we can still face the same issue to clear something accidentally while we shouldn't. Summary: we really need to be careful on what bit to clear otherwise we can face anything after the migration completes. And I hope this series has considered all about this. Besides the new KVM cache layer and the new ioctl support, this series introduced the memory_region_clear_dirty_bitmap() in the memory API layer to allow clearing dirty bits of a specific memory range within the memory region. Implementations ============================ Patch 1-3: these should be nothing directly related to the series but they are things I found during working on it. They can be picked even earlier if reviewers are happy with them. Patch 4: pre-work on bitmap operations, and within the patch I added the first unit test for utils/bitmap.c. Patch 5-6: the new memory API interface. Since no one is providing log_clear() yet so it's not working yet. Note that this only splits the dirty clear operation from sync but it hasn't yet been splitted into smaller chunk so it's not really helpful for us yet. Patch 7-10: kvm support of KVM_CLEAR_DIRTY_LOG. Patch 11: do the log_clear() splitting for the case of migration. Also a new parameter is introduced to define the block size of the small chunks (the unit to clear dirty bits) Tests =========================== - make check - migrate idle/memory-heavy guests (Not yet tested with huge guests but it'll be more than welcomed if someone has the resource and wants to give it a shot) Please have a look, thanks. Peter Xu (11): migration: No need to take rcu during sync_dirty_bitmap memory: Remove memory_region_get_dirty() memory: Don't set migration bitmap when without migration bitmap: Add bitmap_copy_with_{src|dst}_offset() memory: Pass mr into snapshot_and_clear_dirty memory: Introduce memory listener hook log_clear() kvm: Update comments for sync_dirty_bitmap kvm: Persistent per kvmslot dirty bitmap kvm: Introduce slots lock for memory listener kvm: Support KVM_CLEAR_DIRTY_LOG migration: Split log_clear() into smaller chunks accel/kvm/kvm-all.c | 287 ++++++++++++++++++++++++++++++++++----- accel/kvm/trace-events | 1 + exec.c | 15 +- include/exec/memory.h | 36 ++--- include/exec/ram_addr.h | 91 ++++++++++++- include/qemu/bitmap.h | 9 ++ include/sysemu/kvm_int.h | 4 + memory.c | 64 +++++++-- migration/migration.c | 4 + migration/migration.h | 27 ++++ migration/ram.c | 45 ++++++ migration/trace-events | 1 + tests/Makefile.include | 2 + tests/test-bitmap.c | 81 +++++++++++ util/bitmap.c | 73 ++++++++++ 15 files changed, 671 insertions(+), 69 deletions(-) create mode 100644 tests/test-bitmap.c -- 2.17.1
On 08/05/19 01:15, Peter Xu wrote: > Design > =================== > > I started with a naive/stupid design that I always pass all 1's to the > KVM for a memory range to clear all the dirty bits within that memory > range, but then I encountered guest oops - it's simply because we > can't clear any dirty bit from QEMU if we are not _sure_ that the bit > is dirty in the kernel. Otherwise we might accidentally clear a bit > that we don't even know of (e.g., the bit was clear in migration's > dirty bitmap in QEMU) but actually that page was just being written so > QEMU will never remember to migrate that new page again. > > The new design is focused on a dirty bitmap cache within the QEMU kvm > layer (which is per kvm memory slot). With that we know what's dirty > in the kernel previously (note! the kernel bitmap is still growing all > the time so the cache will only be a subset of the realtime kernel > bitmap but that's far enough for us) and with that we'll be sure to > not accidentally clear unknown dirty pages. > > With this method, we can also avoid race when multiple users (e.g., > DIRTY_MEMORY_VGA and DIRTY_MEMORY_MIGRATION) want to clear the bit for > multiple time. If without the kvm memory slot cached dirty bitmap we > won't be able to know which bit has been cleared and then if we send > the CLEAR operation upon the same bit twice (or more) we can still > face the same issue to clear something accidentally while we > shouldn't. > > Summary: we really need to be careful on what bit to clear otherwise > we can face anything after the migration completes. And I hope this > series has considered all about this. The disadvantage of this is that you won't clear in the kernel those dirty bits that come from other sources (e.g. vhost or address_space_map). This can lead to double-copying of pages. Migration already makes a local copy in rb->bmap, and memory_region_snapshot_and_clear_dirty can also do the clear. Would it be possible to invoke the clear using rb->bmap instead of the KVMSlot's new bitmap? Thanks, Paolo > Besides the new KVM cache layer and the new ioctl support, this series > introduced the memory_region_clear_dirty_bitmap() in the memory API > layer to allow clearing dirty bits of a specific memory range within > the memory region.
On Wed, May 08, 2019 at 12:09:00PM +0200, Paolo Bonzini wrote: > On 08/05/19 01:15, Peter Xu wrote: > > Design > > =================== > > > > I started with a naive/stupid design that I always pass all 1's to the > > KVM for a memory range to clear all the dirty bits within that memory > > range, but then I encountered guest oops - it's simply because we > > can't clear any dirty bit from QEMU if we are not _sure_ that the bit > > is dirty in the kernel. Otherwise we might accidentally clear a bit > > that we don't even know of (e.g., the bit was clear in migration's > > dirty bitmap in QEMU) but actually that page was just being written so > > QEMU will never remember to migrate that new page again. > > > > The new design is focused on a dirty bitmap cache within the QEMU kvm > > layer (which is per kvm memory slot). With that we know what's dirty > > in the kernel previously (note! the kernel bitmap is still growing all > > the time so the cache will only be a subset of the realtime kernel > > bitmap but that's far enough for us) and with that we'll be sure to > > not accidentally clear unknown dirty pages. > > > > With this method, we can also avoid race when multiple users (e.g., > > DIRTY_MEMORY_VGA and DIRTY_MEMORY_MIGRATION) want to clear the bit for > > multiple time. If without the kvm memory slot cached dirty bitmap we > > won't be able to know which bit has been cleared and then if we send > > the CLEAR operation upon the same bit twice (or more) we can still > > face the same issue to clear something accidentally while we > > shouldn't. > > > > Summary: we really need to be careful on what bit to clear otherwise > > we can face anything after the migration completes. And I hope this > > series has considered all about this. > > The disadvantage of this is that you won't clear in the kernel those > dirty bits that come from other sources (e.g. vhost or > address_space_map). This can lead to double-copying of pages. > > Migration already makes a local copy in rb->bmap, and > memory_region_snapshot_and_clear_dirty can also do the clear. Would it > be possible to invoke the clear using rb->bmap instead of the KVMSlot's > new bitmap? Do you mean to invoke the clear notifications and deliver the bitmap from the very top of the stack (for VGA it would probably be the DirtyBitmapSnapshot, for MIGRATION it's the rb->bmap)? Actually that's what I did in the first version before I post the work but I noticed that there seems to have a race condition with the design. The problem is we have multiple copies of the same dirty bitmap from KVM and the race can happen with those multiple users (bitmaps of the users can be a merged version containing KVM and other sources like vhost, address_space_map, etc. but let's just make it simpler to not have them yet). Here's one possible race condition (assuming migration has started): (1) page P is written by the guest (let's version its data as P1), its dirty bit D is set in KVM (2) QEMU sync dirty log, fetch D for page P (which is set). D is not cleared in KVM due to MANUAL_PROTECT cap. (3) QEMU copies the D bit to all users of dirty bmap (MIGRATION and VGA as example). (4) MIGRATION code collects bit D in migration bitmap, clear it, send CLEAR to KVM to clear the bit on remote (then D in KVM is cleared too), send the page P with content P1 to destination. CAUTION: when reach here VGA bitmap still has the D bit set. (5) page P is written by the guest again (let's version its data as P2), bit D set again in KVM. (6) VGA code collectes bit D (note! we haven't synced the log again so this is the _old_ dirty bit came from step 3 above) in VGA bitmap, clear it, send CLEAR to KVM to clear the bit on remote. Then D bit in KVM is cleared again. Until here, D bit in all bitmaps are cleared (MIGRATION, VGA, KVM). (7) page P is never written again until migration completes (no matter whether we sync again D bit will be cleared). (8) On destination VM page P will contain content P1 rather than P2, this is data loss... After I noticed it, I added the kvm bitmap layer with the lock to make sure we won't send the 2nd clear at step (6) above because with the per memslot bitmap we will be able to know that we've already cleared a bit by one user so we shouldn't clear it again (as long as we don't SYNC again). However later on I just noticed maybe I don't even need the top layer bitmap at all, because it seems to not bring much benefit but instead it'll bring lots of more complexity - I'll need to do bitmap convertions (because upper layer bitmaps are per ramblock, and kvm bitmaps are per memslot), not to mention I'll need to AND the two bitmaps (upper layer bitmap, and the QEMU kvm cached bitmap) to avoid the above race. So at last I simply removed the top layer bitmap and only keep the per memslot bitmap as what this series did. Thanks, -- Peter Xu
On 08/05/19 06:39, Peter Xu wrote: >> The disadvantage of this is that you won't clear in the kernel those >> dirty bits that come from other sources (e.g. vhost or >> address_space_map). This can lead to double-copying of pages. >> >> Migration already makes a local copy in rb->bmap, and >> memory_region_snapshot_and_clear_dirty can also do the clear. Would it >> be possible to invoke the clear using rb->bmap instead of the KVMSlot's >> new bitmap? > > Actually that's what I did in the first version before I post the work > but I noticed that there seems to have a race condition with the > design. The problem is we have multiple copies of the same dirty > bitmap from KVM and the race can happen with those multiple users > (bitmaps of the users can be a merged version containing KVM and other > sources like vhost, address_space_map, etc. but let's just make it > simpler to not have them yet). I see now. And in fact the same double-copying inefficiency happens already without this series, so you are improving the situation anyway. Have you done any kind of benchmarking already? Thanks, Paolo > (1) page P is written by the guest (let's version its data as P1), > its dirty bit D is set in KVM > > (2) QEMU sync dirty log, fetch D for page P (which is set). D is > not cleared in KVM due to MANUAL_PROTECT cap. > > (3) QEMU copies the D bit to all users of dirty bmap (MIGRATION and > VGA as example). > > (4) MIGRATION code collects bit D in migration bitmap, clear it, > send CLEAR to KVM to clear the bit on remote (then D in KVM is > cleared too), send the page P with content P1 to destination. > CAUTION: when reach here VGA bitmap still has the D bit set. > > (5) page P is written by the guest again (let's version its data as > P2), bit D set again in KVM. > > (6) VGA code collectes bit D (note! we haven't synced the log again > so this is the _old_ dirty bit came from step 3 above) in VGA > bitmap, clear it, send CLEAR to KVM to clear the bit on remote. > Then D bit in KVM is cleared again. Until here, D bit in all > bitmaps are cleared (MIGRATION, VGA, KVM). > > (7) page P is never written again until migration completes (no > matter whether we sync again D bit will be cleared). > > (8) On destination VM page P will contain content P1 rather than P2, > this is data loss...
On Wed, May 08, 2019 at 01:55:07PM +0200, Paolo Bonzini wrote: > On 08/05/19 06:39, Peter Xu wrote: > >> The disadvantage of this is that you won't clear in the kernel those > >> dirty bits that come from other sources (e.g. vhost or > >> address_space_map). This can lead to double-copying of pages. > >> > >> Migration already makes a local copy in rb->bmap, and > >> memory_region_snapshot_and_clear_dirty can also do the clear. Would it > >> be possible to invoke the clear using rb->bmap instead of the KVMSlot's > >> new bitmap? > > > > Actually that's what I did in the first version before I post the work > > but I noticed that there seems to have a race condition with the > > design. The problem is we have multiple copies of the same dirty > > bitmap from KVM and the race can happen with those multiple users > > (bitmaps of the users can be a merged version containing KVM and other > > sources like vhost, address_space_map, etc. but let's just make it > > simpler to not have them yet). > > I see now. And in fact the same double-copying inefficiency happens > already without this series, so you are improving the situation anyway. > > Have you done any kind of benchmarking already? Not yet. I posted the series for some initial reviews first before moving on with performance tests. My plan of the test scenario could be: - find a guest with relatively large memory (I would guess it is better to have memory like 64G or even more to make some big difference) - run random dirty memory workload upon most of the mem, with dirty rate X Bps. - setup the migration bandwidth to Y Bps (Y should be bigger than X but not that big. One could be X=800M and Y=1G to emulate 10G nic with a workload that we can still converge with precopy only) and start precopy migration. - measure total migration time with CLEAR_LOG on & off. We should expect the guest to have these with CLEAR_LOG: (1) not hang during log_sync, and (2) migration should complete faster. Does above test plan makes sense? If both the QEMU/KVM changes looks ok in general, I can at least try this on some smaller guests (I can manage ~10G mem guests with my own hosts, but I can also try to find some bigger ones). Thanks, -- Peter Xu
On Thu, May 09, 2019 at 10:33:19AM +0800, Peter Xu wrote: > On Wed, May 08, 2019 at 01:55:07PM +0200, Paolo Bonzini wrote: > > On 08/05/19 06:39, Peter Xu wrote: > > >> The disadvantage of this is that you won't clear in the kernel those > > >> dirty bits that come from other sources (e.g. vhost or > > >> address_space_map). This can lead to double-copying of pages. > > >> > > >> Migration already makes a local copy in rb->bmap, and > > >> memory_region_snapshot_and_clear_dirty can also do the clear. Would it > > >> be possible to invoke the clear using rb->bmap instead of the KVMSlot's > > >> new bitmap? > > > > > > Actually that's what I did in the first version before I post the work > > > but I noticed that there seems to have a race condition with the > > > design. The problem is we have multiple copies of the same dirty > > > bitmap from KVM and the race can happen with those multiple users > > > (bitmaps of the users can be a merged version containing KVM and other > > > sources like vhost, address_space_map, etc. but let's just make it > > > simpler to not have them yet). > > > > I see now. And in fact the same double-copying inefficiency happens > > already without this series, so you are improving the situation anyway. > > > > Have you done any kind of benchmarking already? > > Not yet. I posted the series for some initial reviews first before > moving on with performance tests. > > My plan of the test scenario could be: > > - find a guest with relatively large memory (I would guess it is > better to have memory like 64G or even more to make some big > difference) > > - run random dirty memory workload upon most of the mem, with dirty > rate X Bps. > > - setup the migration bandwidth to Y Bps (Y should be bigger than X > but not that big. One could be X=800M and Y=1G to emulate 10G nic > with a workload that we can still converge with precopy only) and > start precopy migration. > > - measure total migration time with CLEAR_LOG on & off. We should > expect the guest to have these with CLEAR_LOG: (1) not hang during > log_sync, and (2) migration should complete faster. Some updates on performance numbers. Summary: the ideal case below shows ~40% or even more time reduced to migrate the same VM with the same workload. In other words, it could be seen as ~40% faster than before. Test environment: 13G guest, 10G test mem (so I leave 3G untouched), dirty rate 900MB/s, BW 10Gbps to emulate ixgbe, downtime 100ms. IO pattern: I pre-fault all the 10G mem then I do random writes (with command "mig_mon mm_dirty 10240 900 random" [1]) upon the test memory with a constant dirty rate (900MB/s, as mentioned). Then I migrate during the IOs. Here's the total migration time of such VM (for each scenario, I run the migration 5 times then I get an average migration total time used): |--------------+---------------------+-------------| | scenario | migration times (s) | average (s) | |--------------+---------------------+-------------| | no CLEAR_LOG | 55, 54, 56, 74, 54 | 58 | | 1G chunk | 40, 39, 41, 39, 40 | 40 | | 128M chunk | 38, 40, 37, 40, 38 | 38 | | 16M chunk | 42, 40, 38, 41, 38 | 39 | | 1M chunk | 37, 40, 36, 40, 39 | 38 | |--------------+---------------------+-------------| The first "no CLEAR_LOG" means the master branch which still uses the GET_DIRTY only. The latter four scenarios are all with the new CLEAR_LOG interface, aka, this series. The test result shows that 128M chunk size seems to be a good default value instead of 1G (which this series used). I'll adjust that accordingly when I post the next version. [1] https://github.com/xzpeter/clibs/blob/master/bsd/mig_mon/mig_mon.c Regards, -- Peter Xu
© 2016 - 2024 Red Hat, Inc.