accel/kvm/kvm-all.c | 3 --- include/migration/misc.h | 2 ++ migration/migration.c | 7 +++++++ migration/ram.c | 10 +++++----- softmmu/vl.c | 16 ++++++++++++++++ 5 files changed, 30 insertions(+), 8 deletions(-)
This RFC series starts from the fact that we will sync dirty bitmap when removing a memslot for KVM. IIUC that was majorly to maintain the dirty bitmap even across a system reboot. This series wants to move that sync from kvm memslot removal to system reset. (I still don't know why the reset system will still need to keep the RAM status before the reset. I thought things like kdump might use this to retrieve info from previous kernel panic, however IIUC that's not what kdump is doing now. Anyway, I'd be more than glad if anyone knows the real scenario behind this...) The current solution (sync at kvm memslot removal) works in most cases, but: - it will be merely impossible to work for dirty ring, and, - it has an existing flaw on race condition. [1] So if system reset is the only thing we care here, I'm thinking whether we can move this sync explicitly to system reset so we do a global sync there instead of sync every time when memory layout changed and caused memory removals. I think it can be more explict to sync during system reset, and also with that context it will be far easier for kvm dirty ring to provide the same logic. This is totally RFC because I'm still trying to find whether there will be other cases besides system reset that we want to keep the dirty bits for a to-be-removed memslot (real memory removals like unplugging memory shouldn't matter, because we won't care about the dirty bits if it's never going to be there anymore, not to mention we won't allow such things during a migration). So far I don't see any. I've run some tests either using the old dirty log or dirty ring, with either some memory load or reboots on the source, and I see no issues so far. Comments greatly welcomed. Thanks. [1] https://lore.kernel.org/qemu-devel/20200327150425.GJ422390@xz-x1/ Peter Xu (4): migration: Export migration_bitmap_sync_precopy() migration: Introduce migrate_is_precopy() vl: Sync dirty bits for system resets during precopy kvm: No need to sync dirty bitmap before memslot removal any more accel/kvm/kvm-all.c | 3 --- include/migration/misc.h | 2 ++ migration/migration.c | 7 +++++++ migration/ram.c | 10 +++++----- softmmu/vl.c | 16 ++++++++++++++++ 5 files changed, 30 insertions(+), 8 deletions(-) -- 2.24.1
* Peter Xu (peterx@redhat.com) wrote: > This RFC series starts from the fact that we will sync dirty bitmap when > removing a memslot for KVM. IIUC that was majorly to maintain the dirty bitmap > even across a system reboot. > > This series wants to move that sync from kvm memslot removal to system reset. > > (I still don't know why the reset system will still need to keep the RAM status > before the reset. I thought things like kdump might use this to retrieve info > from previous kernel panic, however IIUC that's not what kdump is doing now. > Anyway, I'd be more than glad if anyone knows the real scenario behind > this...) Aren't there pages written by the BIOS that are read by the system as it comes up through reset - so you need those pages intact? (But I don't think that slot gets removed? Or does it - the bios has some weird aliasing) > The current solution (sync at kvm memslot removal) works in most cases, but: > > - it will be merely impossible to work for dirty ring, and, Why doesn't that work with dirty ring? > - it has an existing flaw on race condition. [1] > > So if system reset is the only thing we care here, I'm thinking whether we can > move this sync explicitly to system reset so we do a global sync there instead > of sync every time when memory layout changed and caused memory removals. I > think it can be more explict to sync during system reset, and also with that > context it will be far easier for kvm dirty ring to provide the same logic. > > This is totally RFC because I'm still trying to find whether there will be > other cases besides system reset that we want to keep the dirty bits for a > to-be-removed memslot (real memory removals like unplugging memory shouldn't > matter, because we won't care about the dirty bits if it's never going to be > there anymore, not to mention we won't allow such things during a migration). > So far I don't see any. I'm still unusure when slot removal happens for real; but if it's happening for RAM on PCI devices, then that would make sense as something you might want to keep. Dave > I've run some tests either using the old dirty log or dirty ring, with either > some memory load or reboots on the source, and I see no issues so far. > > Comments greatly welcomed. Thanks. > > [1] https://lore.kernel.org/qemu-devel/20200327150425.GJ422390@xz-x1/ > > Peter Xu (4): > migration: Export migration_bitmap_sync_precopy() > migration: Introduce migrate_is_precopy() > vl: Sync dirty bits for system resets during precopy > kvm: No need to sync dirty bitmap before memslot removal any more > > accel/kvm/kvm-all.c | 3 --- > include/migration/misc.h | 2 ++ > migration/migration.c | 7 +++++++ > migration/ram.c | 10 +++++----- > softmmu/vl.c | 16 ++++++++++++++++ > 5 files changed, 30 insertions(+), 8 deletions(-) > > -- > 2.24.1 > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Wed, Apr 29, 2020 at 02:26:07PM +0100, Dr. David Alan Gilbert wrote: > * Peter Xu (peterx@redhat.com) wrote: > > This RFC series starts from the fact that we will sync dirty bitmap when > > removing a memslot for KVM. IIUC that was majorly to maintain the dirty bitmap > > even across a system reboot. > > > > This series wants to move that sync from kvm memslot removal to system reset. > > > > (I still don't know why the reset system will still need to keep the RAM status > > before the reset. I thought things like kdump might use this to retrieve info > > from previous kernel panic, however IIUC that's not what kdump is doing now. > > Anyway, I'd be more than glad if anyone knows the real scenario behind > > this...) > > Aren't there pages written by the BIOS that are read by the system as it > comes up through reset - so you need those pages intact? > (But I don't think that slot gets removed? Or does it - the bios has > some weird aliasing) Do you know which section it writes to? I can trace a bit to see... My previous understanding is that if the RAM starts to be written then it shouldn't change layout anymore, but I could be wrong. > > > The current solution (sync at kvm memslot removal) works in most cases, but: > > > > - it will be merely impossible to work for dirty ring, and, > > Why doesn't that work with dirty ring? Because strict sync is hard imho... I shouldn't say it's impossible, but it should depend on the fixes for existing kvm get dirty log, so it's even harder (please see below). I'll start with normal get dirty log. Currently, qemu's KVM syncs dirty with: - remove memslot - KVM_GET_DIRTY_LOG on memslot - KVM_SET_USER_MEMORY_REGION to remove But it's racy because writes can happen between the two steps. One correct way to do that is: - remove memslot - KVM_SET_USER_MEMORY_REGION to set it READONLY [a] - kick all vcpu out synchronously to make sure dirty bits are flushed to KVM per-memslot dirty bitmaps [b] - KVM_GET_DIRTY_LOG on memslot - KVM_SET_USER_MEMORY_REGION to remove The fix should both contain a QEMU fix at [a] to first mark it as READONLY so no continuous write could happen before get dirty log, and another fix in kernel [b] to do synchronously flush each vcpu to make sure PML is flushed (which could be very awkward; I proposed an idea [1] but I didn't get any positive feedback yet). [1] https://lore.kernel.org/qemu-devel/20200402223240.GG103677@xz-x1/ So I think it's not easy to make it fully right even for get dirty log. For dirty ring, the sync will look like: - remove memslot - KVM_SET_USER_MEMORY_REGION to set it READONLY [a] - kick all vcpu out synchronously to make sure dirty bits are flushed to KVM per-memslot dirty bitmaps [b] - collect dirty ring [c] So it will depend on us fixing [a,b] first, then because dirty ring is per-vcpu not per-slot, [c] needs to collect all data for all vcpus (across all slots). So it will try to take all slots_lock even if we're removing a single small memslot with one slots_lock taken already. I think we can still do it, but as you see, it's awkward and harder than the legacy get dirty log process. I was always wondering why this is that useful and in what cases we're using this. If system reset is the only place then it'll be very clean to move it there because that's a very high point of the stack where we only have a BQL, at the same time we paused all the vcpus anyway so most nasty things are gone. We just do a simple sync like in this series and it'll nicely work for both dirty log and dirty ring (then we don't need to fix [a] and [b] either). > > > - it has an existing flaw on race condition. [1] > > > > So if system reset is the only thing we care here, I'm thinking whether we can > > move this sync explicitly to system reset so we do a global sync there instead > > of sync every time when memory layout changed and caused memory removals. I > > think it can be more explict to sync during system reset, and also with that > > context it will be far easier for kvm dirty ring to provide the same logic. > > > > This is totally RFC because I'm still trying to find whether there will be > > other cases besides system reset that we want to keep the dirty bits for a > > to-be-removed memslot (real memory removals like unplugging memory shouldn't > > matter, because we won't care about the dirty bits if it's never going to be > > there anymore, not to mention we won't allow such things during a migration). > > So far I don't see any. > > I'm still unusure when slot removal happens for real; but if it's > happening for RAM on PCI devices, then that would make sense as > something you might want to keep. Yeah indeed. Though I'm also curious on a real user of that either that will frequently map/unmap a MMIO region of device ram. Also there's a workaround of this issue too by marking the region as dirty when unmap the MMIO region, but I haven't checked whether that's easy to do in qemu, assuming not that much... Thanks, -- Peter Xu
On Wed, Apr 29, 2020 at 10:32:27AM -0400, Peter Xu wrote: > On Wed, Apr 29, 2020 at 02:26:07PM +0100, Dr. David Alan Gilbert wrote: > > * Peter Xu (peterx@redhat.com) wrote: > > > This RFC series starts from the fact that we will sync dirty bitmap when > > > removing a memslot for KVM. IIUC that was majorly to maintain the dirty bitmap > > > even across a system reboot. > > > > > > This series wants to move that sync from kvm memslot removal to system reset. > > > > > > (I still don't know why the reset system will still need to keep the RAM status > > > before the reset. I thought things like kdump might use this to retrieve info > > > from previous kernel panic, however IIUC that's not what kdump is doing now. > > > Anyway, I'd be more than glad if anyone knows the real scenario behind > > > this...) > > > > Aren't there pages written by the BIOS that are read by the system as it > > comes up through reset - so you need those pages intact? > > (But I don't think that slot gets removed? Or does it - the bios has > > some weird aliasing) > > Do you know which section it writes to? I can trace a bit to see... > > My previous understanding is that if the RAM starts to be written then it > shouldn't change layout anymore, but I could be wrong. > > > > > > The current solution (sync at kvm memslot removal) works in most cases, but: > > > > > > - it will be merely impossible to work for dirty ring, and, > > > > Why doesn't that work with dirty ring? > > Because strict sync is hard imho... I shouldn't say it's impossible, but it > should depend on the fixes for existing kvm get dirty log, so it's even harder > (please see below). > > I'll start with normal get dirty log. Currently, qemu's KVM syncs dirty with: > > - remove memslot > - KVM_GET_DIRTY_LOG on memslot > - KVM_SET_USER_MEMORY_REGION to remove > > But it's racy because writes can happen between the two steps. One correct way > to do that is: > > - remove memslot > - KVM_SET_USER_MEMORY_REGION to set it READONLY [a] > - kick all vcpu out synchronously to make sure dirty bits are flushed to > KVM per-memslot dirty bitmaps [b] > - KVM_GET_DIRTY_LOG on memslot > - KVM_SET_USER_MEMORY_REGION to remove > > The fix should both contain a QEMU fix at [a] to first mark it as READONLY so > no continuous write could happen before get dirty log, and another fix in > kernel [b] to do synchronously flush each vcpu to make sure PML is flushed > (which could be very awkward; I proposed an idea [1] but I didn't get any > positive feedback yet). > > [1] https://lore.kernel.org/qemu-devel/20200402223240.GG103677@xz-x1/ > > So I think it's not easy to make it fully right even for get dirty log. > > For dirty ring, the sync will look like: > > - remove memslot > - KVM_SET_USER_MEMORY_REGION to set it READONLY [a] > - kick all vcpu out synchronously to make sure dirty bits are flushed to > KVM per-memslot dirty bitmaps [b] > - collect dirty ring [c] > > So it will depend on us fixing [a,b] first, then because dirty ring is per-vcpu > not per-slot, [c] needs to collect all data for all vcpus (across all slots). > So it will try to take all slots_lock even if we're removing a single small > memslot with one slots_lock taken already. I think we can still do it, but as > you see, it's awkward and harder than the legacy get dirty log process. > > I was always wondering why this is that useful and in what cases we're using > this. If system reset is the only place then it'll be very clean to move it > there because that's a very high point of the stack where we only have a BQL, > at the same time we paused all the vcpus anyway so most nasty things are gone. > We just do a simple sync like in this series and it'll nicely work for both > dirty log and dirty ring (then we don't need to fix [a] and [b] either). Paolo, Do you have any opinion on this series? Or more generally comments on what's your preference to fix the known dirty sync issue with memslot removal would be welcomed too. For my own opinion, I prefer the approach with this series so we won't need to touch the kernel at all, and it also paves the way for dirty ring. In all cases, I do think collecting dirty bit of a removing slot is slightly awkward already. Dave has valid conerns about other cases besides reboot where we might still want to keep the dirty bits, on either (1) during BIOS/... boots, or (2) guest unmaps of device mmio regions that were backed up by RAM. I can continue to look into theses, however before that I'd appreciate to know whether the direction is correct and acceptable. IMHO the worst case of a series like this could be that we got regression on dirty bit loss, but then we can simply revert the last patch to verify or fix it (or we can fix the missing case instead). Thanks, > > > > > > - it has an existing flaw on race condition. [1] > > > > > > So if system reset is the only thing we care here, I'm thinking whether we can > > > move this sync explicitly to system reset so we do a global sync there instead > > > of sync every time when memory layout changed and caused memory removals. I > > > think it can be more explict to sync during system reset, and also with that > > > context it will be far easier for kvm dirty ring to provide the same logic. > > > > > > This is totally RFC because I'm still trying to find whether there will be > > > other cases besides system reset that we want to keep the dirty bits for a > > > to-be-removed memslot (real memory removals like unplugging memory shouldn't > > > matter, because we won't care about the dirty bits if it's never going to be > > > there anymore, not to mention we won't allow such things during a migration). > > > So far I don't see any. > > > > I'm still unusure when slot removal happens for real; but if it's > > happening for RAM on PCI devices, then that would make sense as > > something you might want to keep. > > Yeah indeed. Though I'm also curious on a real user of that either that will > frequently map/unmap a MMIO region of device ram. Also there's a workaround of > this issue too by marking the region as dirty when unmap the MMIO region, but I > haven't checked whether that's easy to do in qemu, assuming not that much... > > Thanks, > > -- > Peter Xu -- Peter Xu
On Sat, May 02, 2020 at 05:04:23PM -0400, Peter Xu wrote: > Paolo, > > Do you have any opinion on this series? Or more generally comments on what's > your preference to fix the known dirty sync issue with memslot removal would be > welcomed too. For my own opinion, I prefer the approach with this series so we > won't need to touch the kernel at all, and it also paves the way for dirty > ring. In all cases, I do think collecting dirty bit of a removing slot is > slightly awkward already. > > Dave has valid conerns about other cases besides reboot where we might still > want to keep the dirty bits, on either (1) during BIOS/... boots, or (2) guest > unmaps of device mmio regions that were backed up by RAM. I can continue to > look into theses, however before that I'd appreciate to know whether the > direction is correct and acceptable. > > IMHO the worst case of a series like this could be that we got regression on > dirty bit loss, but then we can simply revert the last patch to verify or fix > it (or we can fix the missing case instead). Hmm... I didn't find a good and clean way to cover what Dave has mentioned. Everything would be either ugly or even uglier. Sad. Since I didn't get any feedback either on this for some time (or the rest of the discussion threads or proposals I opened), I think it's time to move on... I start to feel unwise to continue have the two kvm dirty ring series blocked due to this extremely corner case of race condition, after all it's not a specific problem to kvm dirty ring but to both of the methods. I'll continue with dirty ring for now, then we fix the race altogether with dirty log when we find some better solution. I'll soon post a new QEMU version, keeping the removing memslot sync operation, but just like dirty log it'll be a best effort sync (e.g. PML ignored). However I'll comment to at least mention the issues we've found so we can get to it later again. Thanks, -- Peter Xu
© 2016 - 2024 Red Hat, Inc.