accel/kvm/kvm-all.c | 585 +++++++++++++++++++++++++++++++++------ accel/kvm/trace-events | 7 + include/exec/memory.h | 12 + include/hw/core/cpu.h | 8 + include/sysemu/kvm_int.h | 7 +- qemu-options.hx | 12 + softmmu/memory.c | 33 ++- 7 files changed, 565 insertions(+), 99 deletions(-)
This is v5 of the qemu dirty ring interface support.
v5:
- rebase
- dropped patch "update-linux-headers: Include const.h" after rebase
- dropped patch "KVM: Fixup kvm_log_clear_one_slot() ioctl return check" since
similar patch got merged recently (38e0b7904eca7cd32f8953c3)
========= v4 cover letter below =============
It is merely the same as v3 content-wise, but there're a few things to mention
besides the rebase itself:
- I picked up two patches from Eric Farman for the linux-header updates (from
Eric's v3 series) for convenience just in case any of the series would got
queued by any maintainer.
- One more patch is added as "KVM: Disable manual dirty log when dirty ring
enabled". I found this when testing the branch after rebasing to latest
qemu, that not only the manual dirty log capability is not needed for kvm
dirty ring, but more importantly INITIALLY_ALL_SET is totally against kvm
dirty ring and it could silently crash the guest after migration. For this
new commit, I touched up "KVM: Add dirty-gfn-count property" a bit.
- A few more documentation lines in qemu-options.hx.
- I removed the RFC tag after kernel series got merged.
Again, this is only the 1st step to support dirty ring. Ideally dirty ring
should grant QEMU the possibility to remove the whole layered dirty bitmap so
that dirty ring will work similarly as auto-converge enabled but should better;
we will just throttle vcpus with the dirty ring kvm exit rather than explicitly
adding a timer to stop the vcpu thread from entering the guest again (like what
we did with current migration auto-converge). Some more information could also
be found in the kvm forum 2020 talk regarding kvm dirty ring (slides 21/22 [1]).
That next step (to remove all the dirty bitmaps, as mentioned above) is still
discussable: firstly I don't know whether there's anything I've overlooked in
there. Meanwhile that's also only services huge VM cases, may not be extremely
helpful with a lot major scenarios where VMs are not that huge.
There's probably other ways to fix huge VM migration issues, majorly focusing
on responsiveness and convergence. For example, Google has proposed some new
userfaultfd kernel capability called "minor modes" [2] to track page minor
faults and that could be finally served for that purpose too using postcopy.
That's another long story so I'll stop here, but just as a marker along with
the dirty ring series so there'll still be a record to reference.
Said that, I still think this series is very worth merging even if we don't
persue the next steps yet, since dirty ring is disabled by default, and we can
always work upon this series.
Please review, thanks.
V3: https://lore.kernel.org/qemu-devel/20200523232035.1029349-1-peterx@redhat.com/
(V3 contains all the pre-v3 changelog)
QEMU branch for testing (requires kernel version 5.11-rc1+):
https://github.com/xzpeter/qemu/tree/kvm-dirty-ring
[1] https://static.sched.com/hosted_files/kvmforum2020/97/kvm_dirty_ring_peter.pdf
[2] https://lore.kernel.org/lkml/20210107190453.3051110-1-axelrasmussen@google.com/
---------------------------8<---------------------------------
Overview
========
KVM dirty ring is a new interface to pass over dirty bits from kernel
to the userspace. Instead of using a bitmap for each memory region,
the dirty ring contains an array of dirtied GPAs to fetch, one ring
per vcpu.
There're a few major changes comparing to how the old dirty logging
interface would work:
- Granularity of dirty bits
KVM dirty ring interface does not offer memory region level
granularity to collect dirty bits (i.e., per KVM memory
slot). Instead the dirty bit is collected globally for all the vcpus
at once. The major effect is on VGA part because VGA dirty tracking
is enabled as long as the device is created, also it was in memory
region granularity. Now that operation will be amplified to a VM
sync. Maybe there's smarter way to do the same thing in VGA with
the new interface, but so far I don't see it affects much at least
on regular VMs.
- Collection of dirty bits
The old dirty logging interface collects KVM dirty bits when
synchronizing dirty bits. KVM dirty ring interface instead used a
standalone thread to do that. So when the other thread (e.g., the
migration thread) wants to synchronize the dirty bits, it simply
kick the thread and wait until it flushes all the dirty bits to the
ramblock dirty bitmap.
A new parameter "dirty-ring-size" is added to "-accel kvm". By
default, dirty ring is still disabled (size==0). To enable it, we
need to be with:
-accel kvm,dirty-ring-size=65536
This establishes a 64K dirty ring buffer per vcpu. Then if we
migrate, it'll switch to dirty ring.
I gave it a shot with a 24G guest, 8 vcpus, using 10g NIC as migration
channel. When idle or dirty workload small, I don't observe major
difference on total migration time. When with higher random dirty
workload (800MB/s dirty rate upon 20G memory, worse for kvm dirty
ring). Total migration time is (ping pong migrate for 6 times, in
seconds):
|-------------------------+---------------|
| dirty ring (4k entries) | dirty logging |
|-------------------------+---------------|
| 70 | 58 |
| 78 | 70 |
| 72 | 48 |
| 74 | 52 |
| 83 | 49 |
| 65 | 54 |
|-------------------------+---------------|
Summary:
dirty ring average: 73s
dirty logging average: 55s
The KVM dirty ring will be slower in above case. The number may show
that the dirty logging is still preferred as a default value because
small/medium VMs are still major cases, and high dirty workload
happens frequently too. And that's what this series did.
TODO:
- Consider to drop the BQL dependency: then we can run the reaper thread in
parallel of main thread. Needs some thought around the race conditions.
- Consider to drop the kvmslot bitmap: logically this can be dropped with kvm
dirty ring, not only for space saving, but also it's still another layer
linear to guest mem size which is against the whole idea of kvm dirty ring.
This should make above number (of kvm dirty ring) even smaller (but still may
not be as good as dirty logging when with such high workload).
Please refer to the code and comment itself for more information.
Thanks,
Peter Xu (10):
memory: Introduce log_sync_global() to memory listener
KVM: Use a big lock to replace per-kml slots_lock
KVM: Create the KVMSlot dirty bitmap on flag changes
KVM: Provide helper to get kvm dirty log
KVM: Provide helper to sync dirty bitmap from slot to ramblock
KVM: Simplify dirty log sync in kvm_set_phys_mem
KVM: Cache kvm slot dirty bitmap size
KVM: Add dirty-gfn-count property
KVM: Disable manual dirty log when dirty ring enabled
KVM: Dirty ring support
accel/kvm/kvm-all.c | 585 +++++++++++++++++++++++++++++++++------
accel/kvm/trace-events | 7 +
include/exec/memory.h | 12 +
include/hw/core/cpu.h | 8 +
include/sysemu/kvm_int.h | 7 +-
qemu-options.hx | 12 +
softmmu/memory.c | 33 ++-
7 files changed, 565 insertions(+), 99 deletions(-)
--
2.26.2
On Wed, Mar 10, 2021 at 03:32:51PM -0500, Peter Xu wrote: > This is v5 of the qemu dirty ring interface support. > > v5: > - rebase > - dropped patch "update-linux-headers: Include const.h" after rebase > - dropped patch "KVM: Fixup kvm_log_clear_one_slot() ioctl return check" since > similar patch got merged recently (38e0b7904eca7cd32f8953c3) Ping - I think it missed 6.0 soft freeze, so it's a ping about review comments, then hopefully it can catch the train for 6.1, still? Thanks, -- Peter Xu
Hi Peter, On 2021/3/11 4:32, Peter Xu wrote: > This is v5 of the qemu dirty ring interface support. > > > > v5: > > - rebase > > - dropped patch "update-linux-headers: Include const.h" after rebase > > - dropped patch "KVM: Fixup kvm_log_clear_one_slot() ioctl return check" since > > similar patch got merged recently (38e0b7904eca7cd32f8953c3) > > > > ========= v4 cover letter below ============= > > > > It is merely the same as v3 content-wise, but there're a few things to mention > > besides the rebase itself: > > > > - I picked up two patches from Eric Farman for the linux-header updates (from > > Eric's v3 series) for convenience just in case any of the series would got > > queued by any maintainer. > > > > - One more patch is added as "KVM: Disable manual dirty log when dirty ring > > enabled". I found this when testing the branch after rebasing to latest > > qemu, that not only the manual dirty log capability is not needed for kvm > > dirty ring, but more importantly INITIALLY_ALL_SET is totally against kvm > > dirty ring and it could silently crash the guest after migration. For this > > new commit, I touched up "KVM: Add dirty-gfn-count property" a bit. > > > > - A few more documentation lines in qemu-options.hx. > > > > - I removed the RFC tag after kernel series got merged. > > > > Again, this is only the 1st step to support dirty ring. Ideally dirty ring > > should grant QEMU the possibility to remove the whole layered dirty bitmap so > > that dirty ring will work similarly as auto-converge enabled but should better; > > we will just throttle vcpus with the dirty ring kvm exit rather than explicitly > > adding a timer to stop the vcpu thread from entering the guest again (like what > > we did with current migration auto-converge). Some more information could also > > be found in the kvm forum 2020 talk regarding kvm dirty ring (slides 21/22 [1]). I have read this pdf and code, and I have some questions, hope you can help me. :) You emphasize that dirty ring is a "Thread-local buffers", but dirty bitmap is global, but I don't see it has optimization about "locking" compared to dirty bitmap. The thread-local means that vCPU can flush hardware buffer into dirty ring without locking, but for bitmap, vCPU can also use atomic set to mark dirty without locking. Maybe I miss something? The second question is that you observed longer migration time (55s->73s) when guest has 24G ram and dirty rate is 800M/s. I am not clear about the reason. As with dirty ring enabled, Qemu can get dirty info faster which means it handles dirty page more quick, and guest can be throttled which means dirty page is generated slower. What's the rationale for the longer migration time? PS: As the dirty ring is still converted into dirty_bitmap of kvm_slot, so the "get dirty info faster" maybe not true. :-( Thanks, Keqian > > > > That next step (to remove all the dirty bitmaps, as mentioned above) is still > > discussable: firstly I don't know whether there's anything I've overlooked in > > there. Meanwhile that's also only services huge VM cases, may not be extremely > > helpful with a lot major scenarios where VMs are not that huge. > > > > There's probably other ways to fix huge VM migration issues, majorly focusing > > on responsiveness and convergence. For example, Google has proposed some new > > userfaultfd kernel capability called "minor modes" [2] to track page minor > > faults and that could be finally served for that purpose too using postcopy. > > That's another long story so I'll stop here, but just as a marker along with > > the dirty ring series so there'll still be a record to reference. > > > > Said that, I still think this series is very worth merging even if we don't > > persue the next steps yet, since dirty ring is disabled by default, and we can > > always work upon this series. > > > > Please review, thanks. > > > > V3: https://lore.kernel.org/qemu-devel/20200523232035.1029349-1-peterx@redhat.com/ > > (V3 contains all the pre-v3 changelog) > > > > QEMU branch for testing (requires kernel version 5.11-rc1+): > > https://github.com/xzpeter/qemu/tree/kvm-dirty-ring > > > > [1] https://static.sched.com/hosted_files/kvmforum2020/97/kvm_dirty_ring_peter.pdf > > [2] https://lore.kernel.org/lkml/20210107190453.3051110-1-axelrasmussen@google.com/ > > > > ---------------------------8<--------------------------------- > > > > Overview > > ======== > > > > KVM dirty ring is a new interface to pass over dirty bits from kernel > > to the userspace. Instead of using a bitmap for each memory region, > > the dirty ring contains an array of dirtied GPAs to fetch, one ring > > per vcpu. > > > > There're a few major changes comparing to how the old dirty logging > > interface would work: > > > > - Granularity of dirty bits > > > > KVM dirty ring interface does not offer memory region level > > granularity to collect dirty bits (i.e., per KVM memory > > slot). Instead the dirty bit is collected globally for all the vcpus > > at once. The major effect is on VGA part because VGA dirty tracking > > is enabled as long as the device is created, also it was in memory > > region granularity. Now that operation will be amplified to a VM > > sync. Maybe there's smarter way to do the same thing in VGA with > > the new interface, but so far I don't see it affects much at least > > on regular VMs. > > > > - Collection of dirty bits > > > > The old dirty logging interface collects KVM dirty bits when > > synchronizing dirty bits. KVM dirty ring interface instead used a > > standalone thread to do that. So when the other thread (e.g., the > > migration thread) wants to synchronize the dirty bits, it simply > > kick the thread and wait until it flushes all the dirty bits to the > > ramblock dirty bitmap. > > > > A new parameter "dirty-ring-size" is added to "-accel kvm". By > > default, dirty ring is still disabled (size==0). To enable it, we > > need to be with: > > > > -accel kvm,dirty-ring-size=65536 > > > > This establishes a 64K dirty ring buffer per vcpu. Then if we > > migrate, it'll switch to dirty ring. > > > > I gave it a shot with a 24G guest, 8 vcpus, using 10g NIC as migration > > channel. When idle or dirty workload small, I don't observe major > > difference on total migration time. When with higher random dirty > > workload (800MB/s dirty rate upon 20G memory, worse for kvm dirty > > ring). Total migration time is (ping pong migrate for 6 times, in > > seconds): > > > > |-------------------------+---------------| > > | dirty ring (4k entries) | dirty logging | > > |-------------------------+---------------| > > | 70 | 58 | > > | 78 | 70 | > > | 72 | 48 | > > | 74 | 52 | > > | 83 | 49 | > > | 65 | 54 | > > |-------------------------+---------------| > > > > Summary: > > > > dirty ring average: 73s > > dirty logging average: 55s > > > > The KVM dirty ring will be slower in above case. The number may show > > that the dirty logging is still preferred as a default value because > > small/medium VMs are still major cases, and high dirty workload > > happens frequently too. And that's what this series did. > > > > TODO: > > > > - Consider to drop the BQL dependency: then we can run the reaper thread in > > parallel of main thread. Needs some thought around the race conditions. > > > > - Consider to drop the kvmslot bitmap: logically this can be dropped with kvm > > dirty ring, not only for space saving, but also it's still another layer > > linear to guest mem size which is against the whole idea of kvm dirty ring. > > This should make above number (of kvm dirty ring) even smaller (but still may > > not be as good as dirty logging when with such high workload). > > > > Please refer to the code and comment itself for more information. > > > > Thanks, > > > > Peter Xu (10): > > memory: Introduce log_sync_global() to memory listener > > KVM: Use a big lock to replace per-kml slots_lock > > KVM: Create the KVMSlot dirty bitmap on flag changes > > KVM: Provide helper to get kvm dirty log > > KVM: Provide helper to sync dirty bitmap from slot to ramblock > > KVM: Simplify dirty log sync in kvm_set_phys_mem > > KVM: Cache kvm slot dirty bitmap size > > KVM: Add dirty-gfn-count property > > KVM: Disable manual dirty log when dirty ring enabled > > KVM: Dirty ring support > > > > accel/kvm/kvm-all.c | 585 +++++++++++++++++++++++++++++++++------ > > accel/kvm/trace-events | 7 + > > include/exec/memory.h | 12 + > > include/hw/core/cpu.h | 8 + > > include/sysemu/kvm_int.h | 7 +- > > qemu-options.hx | 12 + > > softmmu/memory.c | 33 ++- > > 7 files changed, 565 insertions(+), 99 deletions(-) > > >
On Mon, Mar 22, 2021 at 10:02:38PM +0800, Keqian Zhu wrote: > Hi Peter, Hi, Keqian, [...] > You emphasize that dirty ring is a "Thread-local buffers", but dirty bitmap is global, > but I don't see it has optimization about "locking" compared to dirty bitmap. > > The thread-local means that vCPU can flush hardware buffer into dirty ring without > locking, but for bitmap, vCPU can also use atomic set to mark dirty without locking. > Maybe I miss something? Yes, the atomic ops guaranteed locking as you said, but afaiu atomics are expensive already, since at least on x86 I think it needs to lock the memory bus. IIUC that'll become even slower as cores grow, as long as the cores share the memory bus. KVM dirty ring is per-vcpu, it means its metadata can be modified locally without atomicity at all (but still, we'll need READ_ONCE/WRITE_ONCE to guarantee ordering of memory accesses). It should scale better especially with hosts who have lots of cores. > > The second question is that you observed longer migration time (55s->73s) when guest > has 24G ram and dirty rate is 800M/s. I am not clear about the reason. As with dirty > ring enabled, Qemu can get dirty info faster which means it handles dirty page more > quick, and guest can be throttled which means dirty page is generated slower. What's > the rationale for the longer migration time? Because dirty ring is more sensitive to dirty rate, while dirty bitmap is more sensitive to memory footprint. In above 24G mem + 800MB/s dirty rate condition, dirty bitmap seems to be more efficient, say, collecting dirty bitmap of 24G mem (24G/4K/8=0.75MB) for each migration cycle is fast enough. Not to mention that current implementation of dirty ring in QEMU is not complete - we still have two more layers of dirty bitmap, so it's actually a mixture of dirty bitmap and dirty ring. This series is more like a POC on dirty ring interface, so as to let QEMU be able to run on KVM dirty ring. E.g., we won't have hang issue when getting dirty pages since it's totally async, however we'll still have some legacy dirty bitmap issues e.g. memory consumption of userspace dirty bitmaps are still linear to memory footprint. Moreover, IMHO another important feature that dirty ring provided is actually the full-exit, where we can pause a vcpu when it dirties too fast, while other vcpus won't be affected. That's something I really wanted to POC too but I don't have enough time. I think it's a worth project in the future to really make the full-exit throttle vcpus, then ideally we'll remove all the dirty bitmaps in QEMU as long as dirty ring is on. So I'd say the number I got at that time is not really helping a lot - as you can see for small VMs it won't make things faster. Maybe a bit more efficient? I can't tell. From design-wise it looks actually still better. However dirty logging still has the reasoning to be the default interface we use for small vms, imho. > > PS: As the dirty ring is still converted into dirty_bitmap of kvm_slot, so the > "get dirty info faster" maybe not true. :-( We can get dirty info faster even now, I think, because previously we only do KVM_GET_DIRTY_LOG once per migration iteration, which could be tens of seconds for a VM mentioned above with 24G and 800MB/s dirty rate. Dirty ring is fully async, we'll get that after the reaper thread timeout. However I must also confess "get dirty info faster" doesn't help us a lot on anything yet, afaict, comparing to a full-featured dirty logging where clear dirty log and so on. Hope above helps. Thanks, -- Peter Xu
Hi Peter, On 2021/3/23 3:45, Peter Xu wrote: > On Mon, Mar 22, 2021 at 10:02:38PM +0800, Keqian Zhu wrote: >> Hi Peter, > > Hi, Keqian, > > [...] > >> You emphasize that dirty ring is a "Thread-local buffers", but dirty bitmap is global, >> but I don't see it has optimization about "locking" compared to dirty bitmap. >> >> The thread-local means that vCPU can flush hardware buffer into dirty ring without >> locking, but for bitmap, vCPU can also use atomic set to mark dirty without locking. >> Maybe I miss something? > > Yes, the atomic ops guaranteed locking as you said, but afaiu atomics are > expensive already, since at least on x86 I think it needs to lock the memory > bus. IIUC that'll become even slower as cores grow, as long as the cores share > the memory bus. > > KVM dirty ring is per-vcpu, it means its metadata can be modified locally > without atomicity at all (but still, we'll need READ_ONCE/WRITE_ONCE to > guarantee ordering of memory accesses). It should scale better especially with > hosts who have lots of cores. That makes sense to me. > >> >> The second question is that you observed longer migration time (55s->73s) when guest >> has 24G ram and dirty rate is 800M/s. I am not clear about the reason. As with dirty >> ring enabled, Qemu can get dirty info faster which means it handles dirty page more >> quick, and guest can be throttled which means dirty page is generated slower. What's >> the rationale for the longer migration time? > > Because dirty ring is more sensitive to dirty rate, while dirty bitmap is more Emm... Sorry that I'm very clear about this... I think that higher dirty rate doesn't cause slower dirty_log_sync compared to that of legacy bitmap mode. Besides, higher dirty rate means we may have more full-exit, which can properly limit the dirty rate. So it seems that dirty ring "prefers" higher dirty rate. > sensitive to memory footprint. In above 24G mem + 800MB/s dirty rate > condition, dirty bitmap seems to be more efficient, say, collecting dirty > bitmap of 24G mem (24G/4K/8=0.75MB) for each migration cycle is fast enough. > > Not to mention that current implementation of dirty ring in QEMU is not > complete - we still have two more layers of dirty bitmap, so it's actually a > mixture of dirty bitmap and dirty ring. This series is more like a POC on > dirty ring interface, so as to let QEMU be able to run on KVM dirty ring. > E.g., we won't have hang issue when getting dirty pages since it's totally > async, however we'll still have some legacy dirty bitmap issues e.g. memory > consumption of userspace dirty bitmaps are still linear to memory footprint. The plan looks good and coordinated, but I have a concern. Our dirty ring actually depends on the structure of hardware logging buffer (PML buffer). We can't say it can be properly adapted to all kinds of hardware design in the future. > > Moreover, IMHO another important feature that dirty ring provided is actually > the full-exit, where we can pause a vcpu when it dirties too fast, while other I think a proper pause time is hard to decide. Short time may have little effect of throttle, but long time may have heavy effect on guest. Do you have a good algorithm? > vcpus won't be affected. That's something I really wanted to POC too but I > don't have enough time. I think it's a worth project in the future to really > make the full-exit throttle vcpus, then ideally we'll remove all the dirty > bitmaps in QEMU as long as dirty ring is on. > > So I'd say the number I got at that time is not really helping a lot - as you > can see for small VMs it won't make things faster. Maybe a bit more efficient? > I can't tell. From design-wise it looks actually still better. However dirty > logging still has the reasoning to be the default interface we use for small > vms, imho. I see. > >> >> PS: As the dirty ring is still converted into dirty_bitmap of kvm_slot, so the >> "get dirty info faster" maybe not true. :-( > > We can get dirty info faster even now, I think, because previously we only do > KVM_GET_DIRTY_LOG once per migration iteration, which could be tens of seconds > for a VM mentioned above with 24G and 800MB/s dirty rate. Dirty ring is fully > async, we'll get that after the reaper thread timeout. However I must also > confess "get dirty info faster" doesn't help us a lot on anything yet, afaict, > comparing to a full-featured dirty logging where clear dirty log and so on. OK. > > Hope above helps. Sure, thanks. :) Keqian
Keqian, On Tue, Mar 23, 2021 at 02:40:43PM +0800, Keqian Zhu wrote: > >> The second question is that you observed longer migration time (55s->73s) when guest > >> has 24G ram and dirty rate is 800M/s. I am not clear about the reason. As with dirty > >> ring enabled, Qemu can get dirty info faster which means it handles dirty page more > >> quick, and guest can be throttled which means dirty page is generated slower. What's > >> the rationale for the longer migration time? > > > > Because dirty ring is more sensitive to dirty rate, while dirty bitmap is more > Emm... Sorry that I'm very clear about this... I think that higher dirty rate doesn't cause > slower dirty_log_sync compared to that of legacy bitmap mode. Besides, higher dirty rate > means we may have more full-exit, which can properly limit the dirty rate. So it seems that > dirty ring "prefers" higher dirty rate. When I measured the 800MB/s it's in the guest, after throttling. Imagine another example: a VM has 1G memory keep dirtying with 10GB/s. Dirty logging will need to collect even less for each iteration because memory size shrinked, collect even less frequent due to the high dirty rate, however dirty ring will use 100% cpu power to collect dirty pages because the ring keeps full. > > > sensitive to memory footprint. In above 24G mem + 800MB/s dirty rate > > condition, dirty bitmap seems to be more efficient, say, collecting dirty > > bitmap of 24G mem (24G/4K/8=0.75MB) for each migration cycle is fast enough. > > > > Not to mention that current implementation of dirty ring in QEMU is not > > complete - we still have two more layers of dirty bitmap, so it's actually a > > mixture of dirty bitmap and dirty ring. This series is more like a POC on > > dirty ring interface, so as to let QEMU be able to run on KVM dirty ring. > > E.g., we won't have hang issue when getting dirty pages since it's totally > > async, however we'll still have some legacy dirty bitmap issues e.g. memory > > consumption of userspace dirty bitmaps are still linear to memory footprint. > The plan looks good and coordinated, but I have a concern. Our dirty ring actually depends > on the structure of hardware logging buffer (PML buffer). We can't say it can be properly > adapted to all kinds of hardware design in the future. Sorry I don't get it - dirty ring can work with pure page wr-protect too? > > > > > Moreover, IMHO another important feature that dirty ring provided is actually > > the full-exit, where we can pause a vcpu when it dirties too fast, while other > I think a proper pause time is hard to decide. Short time may have little effect > of throttle, but long time may have heavy effect on guest. Do you have a good algorithm? That's the next thing we can discuss. IMHO I think the dirty ring is nice already because we can measure dirty rate per-vcpu, also we can throttle in vcpu granule. That's something required for a good algorithm, say we shouldn't block vcpu when there's small dirty rate, and in many cases that's the case for e.g. UI threads. Any algorithm should be based on these facts. Thanks, -- Peter Xu
Hi Peter, On 2021/3/23 22:34, Peter Xu wrote: > Keqian, > > On Tue, Mar 23, 2021 at 02:40:43PM +0800, Keqian Zhu wrote: >>>> The second question is that you observed longer migration time (55s->73s) when guest >>>> has 24G ram and dirty rate is 800M/s. I am not clear about the reason. As with dirty >>>> ring enabled, Qemu can get dirty info faster which means it handles dirty page more >>>> quick, and guest can be throttled which means dirty page is generated slower. What's >>>> the rationale for the longer migration time? >>> >>> Because dirty ring is more sensitive to dirty rate, while dirty bitmap is more >> Emm... Sorry that I'm very clear about this... I think that higher dirty rate doesn't cause >> slower dirty_log_sync compared to that of legacy bitmap mode. Besides, higher dirty rate >> means we may have more full-exit, which can properly limit the dirty rate. So it seems that >> dirty ring "prefers" higher dirty rate. > > When I measured the 800MB/s it's in the guest, after throttling. > > Imagine another example: a VM has 1G memory keep dirtying with 10GB/s. Dirty > logging will need to collect even less for each iteration because memory size > shrinked, collect even less frequent due to the high dirty rate, however dirty > ring will use 100% cpu power to collect dirty pages because the ring keeps full. Looks good. We have many places to collect dirty pages: the background reaper, vCPU exit handler, and the migration thread. I think migration time is closely related to the migration thread. The migration thread calls kvm_dirty_ring_flush(). 1. kvm_cpu_synchronize_kick_all() will wait vcpu handles full-exit. 2. kvm_dirty_ring_reap() collects and resets dirty pages. The above two operation will spend more time with higher dirty rate. But I suddenly realize that the key problem maybe not at this. Though we have separate "reset" operation for dirty ring, actually it is performed right after we collect dirty ring to kvmslot. So in dirty ring mode, it likes legacy bitmap mode without manual_dirty_clear. If we can "reset" dirty ring just before we really handle the dirty pages, we can have shorter migration time. But the design of dirty ring doesn't allow this, because we must perform reset to make free space... > >> >>> sensitive to memory footprint. In above 24G mem + 800MB/s dirty rate >>> condition, dirty bitmap seems to be more efficient, say, collecting dirty >>> bitmap of 24G mem (24G/4K/8=0.75MB) for each migration cycle is fast enough. >>> >>> Not to mention that current implementation of dirty ring in QEMU is not >>> complete - we still have two more layers of dirty bitmap, so it's actually a >>> mixture of dirty bitmap and dirty ring. This series is more like a POC on >>> dirty ring interface, so as to let QEMU be able to run on KVM dirty ring. >>> E.g., we won't have hang issue when getting dirty pages since it's totally >>> async, however we'll still have some legacy dirty bitmap issues e.g. memory >>> consumption of userspace dirty bitmaps are still linear to memory footprint. >> The plan looks good and coordinated, but I have a concern. Our dirty ring actually depends >> on the structure of hardware logging buffer (PML buffer). We can't say it can be properly >> adapted to all kinds of hardware design in the future. > > Sorry I don't get it - dirty ring can work with pure page wr-protect too? Sure, it can. I just want to discuss many possible kinds of hardware logging buffer. However, I'd like to stop at this, at least dirty ring works well with PML. :) > >> >>> >>> Moreover, IMHO another important feature that dirty ring provided is actually >>> the full-exit, where we can pause a vcpu when it dirties too fast, while other >> I think a proper pause time is hard to decide. Short time may have little effect >> of throttle, but long time may have heavy effect on guest. Do you have a good algorithm? > > That's the next thing we can discuss. IMHO I think the dirty ring is nice > already because we can measure dirty rate per-vcpu, also we can throttle in > vcpu granule. That's something required for a good algorithm, say we shouldn't > block vcpu when there's small dirty rate, and in many cases that's the case for > e.g. UI threads. Any algorithm should be based on these facts. OK. Thanks, Keqian
On Wed, Mar 24, 2021 at 10:56:22AM +0800, Keqian Zhu wrote: > Hi Peter, > > On 2021/3/23 22:34, Peter Xu wrote: > > Keqian, > > > > On Tue, Mar 23, 2021 at 02:40:43PM +0800, Keqian Zhu wrote: > >>>> The second question is that you observed longer migration time (55s->73s) when guest > >>>> has 24G ram and dirty rate is 800M/s. I am not clear about the reason. As with dirty > >>>> ring enabled, Qemu can get dirty info faster which means it handles dirty page more > >>>> quick, and guest can be throttled which means dirty page is generated slower. What's > >>>> the rationale for the longer migration time? > >>> > >>> Because dirty ring is more sensitive to dirty rate, while dirty bitmap is more > >> Emm... Sorry that I'm very clear about this... I think that higher dirty rate doesn't cause > >> slower dirty_log_sync compared to that of legacy bitmap mode. Besides, higher dirty rate > >> means we may have more full-exit, which can properly limit the dirty rate. So it seems that > >> dirty ring "prefers" higher dirty rate. > > > > When I measured the 800MB/s it's in the guest, after throttling. > > > > Imagine another example: a VM has 1G memory keep dirtying with 10GB/s. Dirty > > logging will need to collect even less for each iteration because memory size > > shrinked, collect even less frequent due to the high dirty rate, however dirty > > ring will use 100% cpu power to collect dirty pages because the ring keeps full. > Looks good. > > We have many places to collect dirty pages: the background reaper, vCPU exit handler, > and the migration thread. I think migration time is closely related to the migration thread. > > The migration thread calls kvm_dirty_ring_flush(). > 1. kvm_cpu_synchronize_kick_all() will wait vcpu handles full-exit. > 2. kvm_dirty_ring_reap() collects and resets dirty pages. > The above two operation will spend more time with higher dirty rate. > > But I suddenly realize that the key problem maybe not at this. Though we have separate > "reset" operation for dirty ring, actually it is performed right after we collect dirty > ring to kvmslot. So in dirty ring mode, it likes legacy bitmap mode without manual_dirty_clear. > > If we can "reset" dirty ring just before we really handle the dirty pages, we can have > shorter migration time. But the design of dirty ring doesn't allow this, because we must > perform reset to make free space... This is a very good point. Dirty ring should have been better in quite some ways already, but from that pov as you said it goes a bit backwards on reprotection of pages (not to mention currently we can't even reset the ring per-vcpu; that seems to be not fully matching the full locality that the rings have provided as well; but Paolo and I discussed with that issue, it's about TLB flush expensiveness, so we still need to think more of it..). Ideally the ring could have been both per-vcpu but also bi-directional (then we'll have 2*N rings, N=vcpu number), so as to split the state transition into "dirty ring" and "reprotect ring", then that reprotect ring will be the clear dirty log. That'll look more like virtio as used ring. However we'll still need to think about the TLB flush issue too as Paolo used to mention, as that'll exist too with any per-vcpu flush model (each reprotect of page will need a tlb flush of all vcpus). Or.. maybe we can make the flush ring a standalone one, so we need N dirty ring and one global flush ring. Anyway.. Before that, I'd still think the next step should be how to integrate qemu to fully leverage current ring model, so as to be able to throttle in per-vcpu fashion. The major issue (IMHO) with huge VM migration is: 1. Convergence 2. Responsiveness Here we'll have a chance to solve (1) by highly throttle the working vcpu threads, meanwhile still keep (2) by not throttle user interactive threads. I'm not sure whether this will really work as expected, but just show what I'm thinking about it. These may not matter a lot yet with further improving ring reset mechanism, which definitely sounds even better, but seems orthogonal. That's also why I think we should still merge this series first as a fundation for the rest. > > > > >> > >>> sensitive to memory footprint. In above 24G mem + 800MB/s dirty rate > >>> condition, dirty bitmap seems to be more efficient, say, collecting dirty > >>> bitmap of 24G mem (24G/4K/8=0.75MB) for each migration cycle is fast enough. > >>> > >>> Not to mention that current implementation of dirty ring in QEMU is not > >>> complete - we still have two more layers of dirty bitmap, so it's actually a > >>> mixture of dirty bitmap and dirty ring. This series is more like a POC on > >>> dirty ring interface, so as to let QEMU be able to run on KVM dirty ring. > >>> E.g., we won't have hang issue when getting dirty pages since it's totally > >>> async, however we'll still have some legacy dirty bitmap issues e.g. memory > >>> consumption of userspace dirty bitmaps are still linear to memory footprint. > >> The plan looks good and coordinated, but I have a concern. Our dirty ring actually depends > >> on the structure of hardware logging buffer (PML buffer). We can't say it can be properly > >> adapted to all kinds of hardware design in the future. > > > > Sorry I don't get it - dirty ring can work with pure page wr-protect too? > Sure, it can. I just want to discuss many possible kinds of hardware logging buffer. > However, I'd like to stop at this, at least dirty ring works well with PML. :) I see your point. That'll be a good topic at least when we'd like to port dirty ring to other archs for sure. However as you see I hoped we can start to use dirty ring first, find issues, fix it, even redesign some of it, make it really beneficial at least on one arch, then it'll make more sense to port it, or attract people porting it. :) QEMU does not yet have a good solution for huge vm migration yet. Maybe dirty ring is a good start for it, maybe not (e.g., with uffd minor mode postcopy has the other chance). We'll see... Thanks, -- Peter Xu
Peter, On 2021/3/24 23:09, Peter Xu wrote: > On Wed, Mar 24, 2021 at 10:56:22AM +0800, Keqian Zhu wrote: >> Hi Peter, >> >> On 2021/3/23 22:34, Peter Xu wrote: >>> Keqian, >>> >>> On Tue, Mar 23, 2021 at 02:40:43PM +0800, Keqian Zhu wrote: >>>>>> The second question is that you observed longer migration time (55s->73s) when guest >>>>>> has 24G ram and dirty rate is 800M/s. I am not clear about the reason. As with dirty >>>>>> ring enabled, Qemu can get dirty info faster which means it handles dirty page more >>>>>> quick, and guest can be throttled which means dirty page is generated slower. What's >>>>>> the rationale for the longer migration time? >>>>> >>>>> Because dirty ring is more sensitive to dirty rate, while dirty bitmap is more >>>> Emm... Sorry that I'm very clear about this... I think that higher dirty rate doesn't cause >>>> slower dirty_log_sync compared to that of legacy bitmap mode. Besides, higher dirty rate >>>> means we may have more full-exit, which can properly limit the dirty rate. So it seems that >>>> dirty ring "prefers" higher dirty rate. >>> >>> When I measured the 800MB/s it's in the guest, after throttling. >>> >>> Imagine another example: a VM has 1G memory keep dirtying with 10GB/s. Dirty >>> logging will need to collect even less for each iteration because memory size >>> shrinked, collect even less frequent due to the high dirty rate, however dirty >>> ring will use 100% cpu power to collect dirty pages because the ring keeps full. >> Looks good. >> >> We have many places to collect dirty pages: the background reaper, vCPU exit handler, >> and the migration thread. I think migration time is closely related to the migration thread. >> >> The migration thread calls kvm_dirty_ring_flush(). >> 1. kvm_cpu_synchronize_kick_all() will wait vcpu handles full-exit. >> 2. kvm_dirty_ring_reap() collects and resets dirty pages. >> The above two operation will spend more time with higher dirty rate. >> >> But I suddenly realize that the key problem maybe not at this. Though we have separate >> "reset" operation for dirty ring, actually it is performed right after we collect dirty >> ring to kvmslot. So in dirty ring mode, it likes legacy bitmap mode without manual_dirty_clear. >> >> If we can "reset" dirty ring just before we really handle the dirty pages, we can have >> shorter migration time. But the design of dirty ring doesn't allow this, because we must >> perform reset to make free space... > > This is a very good point. > > Dirty ring should have been better in quite some ways already, but from that > pov as you said it goes a bit backwards on reprotection of pages (not to > mention currently we can't even reset the ring per-vcpu; that seems to be not > fully matching the full locality that the rings have provided as well; but > Paolo and I discussed with that issue, it's about TLB flush expensiveness, so > we still need to think more of it..). > > Ideally the ring could have been both per-vcpu but also bi-directional (then > we'll have 2*N rings, N=vcpu number), so as to split the state transition into > "dirty ring" and "reprotect ring", then that reprotect ring will be the clear > dirty log. That'll look more like virtio as used ring. However we'll still > need to think about the TLB flush issue too as Paolo used to mention, as > that'll exist too with any per-vcpu flush model (each reprotect of page will > need a tlb flush of all vcpus). > > Or.. maybe we can make the flush ring a standalone one, so we need N dirty ring > and one global flush ring. Yep, have separate "reprotect" ring(s) is a good idea. > > Anyway.. Before that, I'd still think the next step should be how to integrate > qemu to fully leverage current ring model, so as to be able to throttle in > per-vcpu fashion. > > The major issue (IMHO) with huge VM migration is: > > 1. Convergence > 2. Responsiveness > > Here we'll have a chance to solve (1) by highly throttle the working vcpu > threads, meanwhile still keep (2) by not throttle user interactive threads. > I'm not sure whether this will really work as expected, but just show what I'm > thinking about it. These may not matter a lot yet with further improving ring > reset mechanism, which definitely sounds even better, but seems orthogonal. > > That's also why I think we should still merge this series first as a fundation > for the rest. I see. > >> >>> >>>> >>>>> sensitive to memory footprint. In above 24G mem + 800MB/s dirty rate >>>>> condition, dirty bitmap seems to be more efficient, say, collecting dirty >>>>> bitmap of 24G mem (24G/4K/8=0.75MB) for each migration cycle is fast enough. >>>>> >>>>> Not to mention that current implementation of dirty ring in QEMU is not >>>>> complete - we still have two more layers of dirty bitmap, so it's actually a >>>>> mixture of dirty bitmap and dirty ring. This series is more like a POC on >>>>> dirty ring interface, so as to let QEMU be able to run on KVM dirty ring. >>>>> E.g., we won't have hang issue when getting dirty pages since it's totally >>>>> async, however we'll still have some legacy dirty bitmap issues e.g. memory >>>>> consumption of userspace dirty bitmaps are still linear to memory footprint. >>>> The plan looks good and coordinated, but I have a concern. Our dirty ring actually depends >>>> on the structure of hardware logging buffer (PML buffer). We can't say it can be properly >>>> adapted to all kinds of hardware design in the future. >>> >>> Sorry I don't get it - dirty ring can work with pure page wr-protect too? >> Sure, it can. I just want to discuss many possible kinds of hardware logging buffer. >> However, I'd like to stop at this, at least dirty ring works well with PML. :) > > I see your point. That'll be a good topic at least when we'd like to port > dirty ring to other archs for sure. However as you see I hoped we can start to > use dirty ring first, find issues, fix it, even redesign some of it, make it > really beneficial at least on one arch, then it'll make more sense to port it, > or attract people porting it. :) > > QEMU does not yet have a good solution for huge vm migration yet. Maybe dirty > ring is a good start for it, maybe not (e.g., with uffd minor mode postcopy has > the other chance). We'll see... OK. Thanks, Keqian
© 2016 - 2025 Red Hat, Inc.