When resetting a dirty ring, conditionally reschedule on each iteration
after the first. The recently introduced hard limit mitigates the issue
of an endless reset, but isn't sufficient to completely prevent RCU
stalls, soft lockups, etc., nor is the hard limit intended to guard
against such badness.
Note! Take care to check for reschedule even in the "continue" paths,
as a pathological scenario (or malicious userspace) could dirty the same
gfn over and over, i.e. always hit the continue path.
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 4-....: (5249 ticks this GP) idle=51e4/1/0x4000000000000000 softirq=309/309 fqs=2563
rcu: (t=5250 jiffies g=-319 q=608 ncpus=24)
CPU: 4 UID: 1000 PID: 1067 Comm: dirty_log_test Tainted: G L 6.13.0-rc3-17fa7a24ea1e-HEAD-vm #814
Tainted: [L]=SOFTLOCKUP
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x26/0x200 [kvm]
Call Trace:
<TASK>
kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
kvm_dirty_ring_reset+0x58/0x220 [kvm]
kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
__x64_sys_ioctl+0x8b/0xb0
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
Tainted: [L]=SOFTLOCKUP
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x17/0x200 [kvm]
Call Trace:
<TASK>
kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
kvm_dirty_ring_reset+0x58/0x220 [kvm]
kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
__x64_sys_ioctl+0x8b/0xb0
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
virt/kvm/dirty_ring.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index e844e869e8c7..97cca0c02fd1 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
ring->reset_index++;
(*nr_entries_reset)++;
+
+ /*
+ * While the size of each ring is fixed, it's possible for the
+ * ring to be constantly re-dirtied/harvested while the reset
+ * is in-progress (the hard limit exists only to guard against
+ * wrapping the count into negative space).
+ */
+ if (!first_round)
+ cond_resched();
+
/*
* Try to coalesce the reset operations when the guest is
* scanning pages in the same slot.
--
2.49.0.1015.ga840276032-goog
On Thu, May 8, 2025 at 7:11 AM Sean Christopherson <seanjc@google.com> wrote:
>
> When resetting a dirty ring, conditionally reschedule on each iteration
> after the first. The recently introduced hard limit mitigates the issue
> of an endless reset, but isn't sufficient to completely prevent RCU
> stalls, soft lockups, etc., nor is the hard limit intended to guard
> against such badness.
>
> Note! Take care to check for reschedule even in the "continue" paths,
> as a pathological scenario (or malicious userspace) could dirty the same
> gfn over and over, i.e. always hit the continue path.
>
> rcu: INFO: rcu_sched self-detected stall on CPU
> rcu: 4-....: (5249 ticks this GP) idle=51e4/1/0x4000000000000000 softirq=309/309 fqs=2563
> rcu: (t=5250 jiffies g=-319 q=608 ncpus=24)
> CPU: 4 UID: 1000 PID: 1067 Comm: dirty_log_test Tainted: G L 6.13.0-rc3-17fa7a24ea1e-HEAD-vm #814
> Tainted: [L]=SOFTLOCKUP
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x26/0x200 [kvm]
> Call Trace:
> <TASK>
> kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
> kvm_dirty_ring_reset+0x58/0x220 [kvm]
> kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
> __x64_sys_ioctl+0x8b/0xb0
> do_syscall_64+0x5b/0x160
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> </TASK>
> Tainted: [L]=SOFTLOCKUP
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x17/0x200 [kvm]
> Call Trace:
> <TASK>
> kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
> kvm_dirty_ring_reset+0x58/0x220 [kvm]
> kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
> __x64_sys_ioctl+0x8b/0xb0
> do_syscall_64+0x5b/0x160
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> </TASK>
>
> Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> virt/kvm/dirty_ring.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index e844e869e8c7..97cca0c02fd1 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>
> ring->reset_index++;
> (*nr_entries_reset)++;
> +
> + /*
> + * While the size of each ring is fixed, it's possible for the
> + * ring to be constantly re-dirtied/harvested while the reset
> + * is in-progress (the hard limit exists only to guard against
> + * wrapping the count into negative space).
> + */
> + if (!first_round)
> + cond_resched();
Should we be dropping slots_lock here?
It seems like we need to be holding slots_lock to call
kvm_reset_dirty_gfn(), but that's it. Userspace can already change the
memslots after enabling the dirty ring, so `entry->slot` can already
be stale, so dropping slots_lock for the cond_resched() seems harmless
(and better than not dropping it).
On Mon, May 12, 2025, James Houghton wrote: > On Thu, May 8, 2025 at 7:11 AM Sean Christopherson <seanjc@google.com> wrote: > > --- > > virt/kvm/dirty_ring.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c > > index e844e869e8c7..97cca0c02fd1 100644 > > --- a/virt/kvm/dirty_ring.c > > +++ b/virt/kvm/dirty_ring.c > > @@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring, > > > > ring->reset_index++; > > (*nr_entries_reset)++; > > + > > + /* > > + * While the size of each ring is fixed, it's possible for the > > + * ring to be constantly re-dirtied/harvested while the reset > > + * is in-progress (the hard limit exists only to guard against > > + * wrapping the count into negative space). > > + */ > > + if (!first_round) > > + cond_resched(); > > Should we be dropping slots_lock here? Could we? Yes. Should we? Eh. I don't see any value in doing so, because in practice, it's extremely unlikely anything will be waiting on slots_lock. kvm_vm_ioctl_reset_dirty_pages() operates on all vCPUs, i.e. there won't be competing calls to reset other rings. A well-behaved userspace won't be modifying memslots or dirty logs, and won't be toggling nx_huge_pages. That leaves kvm_vm_ioctl_set_mem_attributes(), kvm_inhibit_apic_access_page(), kvm_assign_ioeventfd(), snp_launch_update(), and coalesced IO/MMIO (un)registration. Except for snp_launch_update(), those are all brutally slow paths, e.g. require SRCU synchronization and/or zapping of SPTEs. And snp_launch_update() is probably fairly slow too. And dropping slots_lock only makes any sense for non-preemptible kernels, because preemptible kernels include an equivalent check in KVM_MMU_UNLOCK(). It's also possible that dropping slots_lock in this case could be a net negative. I don't think it's likely, but I don't think it's any more or less likely that droppings slots_lock is a net positive. Without performance data to guide us, it'd be little more than a guess, and I really, really don't want to set a precedence of dropping a mutex on cond_resched() without a very strong reason for doing so. > It seems like we need to be holding slots_lock to call kvm_reset_dirty_gfn(), > but that's it. Userspace can already change the memslots after enabling the > dirty ring, so `entry->slot` can already be stale, so dropping slots_lock for > the cond_resched() seems harmless (and better than not dropping it).
On Tue, May 13, 2025 at 7:13 AM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, May 12, 2025, James Houghton wrote: > > On Thu, May 8, 2025 at 7:11 AM Sean Christopherson <seanjc@google.com> wrote: > > > --- > > > virt/kvm/dirty_ring.c | 10 ++++++++++ > > > 1 file changed, 10 insertions(+) > > > > > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c > > > index e844e869e8c7..97cca0c02fd1 100644 > > > --- a/virt/kvm/dirty_ring.c > > > +++ b/virt/kvm/dirty_ring.c > > > @@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring, > > > > > > ring->reset_index++; > > > (*nr_entries_reset)++; > > > + > > > + /* > > > + * While the size of each ring is fixed, it's possible for the > > > + * ring to be constantly re-dirtied/harvested while the reset > > > + * is in-progress (the hard limit exists only to guard against > > > + * wrapping the count into negative space). > > > + */ > > > + if (!first_round) > > > + cond_resched(); > > > > Should we be dropping slots_lock here? > > Could we? Yes. Should we? Eh. I don't see any value in doing so, because in > practice, it's extremely unlikely anything will be waiting on slots_lock. > > kvm_vm_ioctl_reset_dirty_pages() operates on all vCPUs, i.e. there won't be > competing calls to reset other rings. A well-behaved userspace won't be modifying > memslots or dirty logs, and won't be toggling nx_huge_pages. > > That leaves kvm_vm_ioctl_set_mem_attributes(), kvm_inhibit_apic_access_page(), > kvm_assign_ioeventfd(), snp_launch_update(), and coalesced IO/MMIO (un)registration. > Except for snp_launch_update(), those are all brutally slow paths, e.g. require > SRCU synchronization and/or zapping of SPTEs. And snp_launch_update() is probably > fairly slow too. Okay, that makes sense. > And dropping slots_lock only makes any sense for non-preemptible kernels, because > preemptible kernels include an equivalent check in KVM_MMU_UNLOCK(). I'm not really sure what "equivalent check" you mean, sorry. For preemptible kernels, we could reschedule at any time, so dropping the slots_lock on a cond_resched() wouldn't do much, yeah. Hopefully that's partially what you mean. > It's also possible that dropping slots_lock in this case could be a net negative. > I don't think it's likely, but I don't think it's any more or less likely that > droppings slots_lock is a net positive. Without performance data to guide us, > it'd be little more than a guess, and I really, really don't want to set a > precedence of dropping a mutex on cond_resched() without a very strong reason > for doing so. Fair enough. Also, while we're at it, could you add a `lockdep_assert_held(&kvm->slots_lock)` to this function? :) Not necessarily in this patch.
On Tue, May 13, 2025, James Houghton wrote: > On Tue, May 13, 2025 at 7:13 AM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, May 12, 2025, James Houghton wrote: > > > On Thu, May 8, 2025 at 7:11 AM Sean Christopherson <seanjc@google.com> wrote: > > > > --- > > > > virt/kvm/dirty_ring.c | 10 ++++++++++ > > > > 1 file changed, 10 insertions(+) > > > > > > > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c > > > > index e844e869e8c7..97cca0c02fd1 100644 > > > > --- a/virt/kvm/dirty_ring.c > > > > +++ b/virt/kvm/dirty_ring.c > > > > @@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring, > > > > > > > > ring->reset_index++; > > > > (*nr_entries_reset)++; > > > > + > > > > + /* > > > > + * While the size of each ring is fixed, it's possible for the > > > > + * ring to be constantly re-dirtied/harvested while the reset > > > > + * is in-progress (the hard limit exists only to guard against > > > > + * wrapping the count into negative space). > > > > + */ > > > > + if (!first_round) > > > > + cond_resched(); > > > > > > Should we be dropping slots_lock here? > > > > Could we? Yes. Should we? Eh. I don't see any value in doing so, because in > > practice, it's extremely unlikely anything will be waiting on slots_lock. > > > > kvm_vm_ioctl_reset_dirty_pages() operates on all vCPUs, i.e. there won't be > > competing calls to reset other rings. A well-behaved userspace won't be modifying > > memslots or dirty logs, and won't be toggling nx_huge_pages. > > > > That leaves kvm_vm_ioctl_set_mem_attributes(), kvm_inhibit_apic_access_page(), > > kvm_assign_ioeventfd(), snp_launch_update(), and coalesced IO/MMIO (un)registration. > > Except for snp_launch_update(), those are all brutally slow paths, e.g. require > > SRCU synchronization and/or zapping of SPTEs. And snp_launch_update() is probably > > fairly slow too. > > Okay, that makes sense. As discussed offlist, dropping slots_lock would also be functionally problematic, as concurrent calls to KVM_RESET_DIRTY_RINGS could get interwoven, which could result in one of the calls returning to userspace without actually completing the reset, i.e. if a different task has reaped entries but not yet called kvm_reset_dirty_gfn(). > > And dropping slots_lock only makes any sense for non-preemptible kernels, because > > preemptible kernels include an equivalent check in KVM_MMU_UNLOCK(). > > I'm not really sure what "equivalent check" you mean, sorry. For preemptible > kernels, we could reschedule at any time, so dropping the slots_lock on a > cond_resched() wouldn't do much, yeah. Hopefully that's partially what you > mean. Ya, that's essentially what I mean. What I was referencing with KVM_MMU_UNLOCK() is the explicit check for NEED_RESCHED that happens when the preempt count hits '0' on preemptible kernels. > > It's also possible that dropping slots_lock in this case could be a net negative. > > I don't think it's likely, but I don't think it's any more or less likely that > > droppings slots_lock is a net positive. Without performance data to guide us, > > it'd be little more than a guess, and I really, really don't want to set a > > precedence of dropping a mutex on cond_resched() without a very strong reason > > for doing so. > > Fair enough. > > Also, while we're at it, could you add a > `lockdep_assert_held(&kvm->slots_lock)` to this function? :) Not necessarily > in this patch. Heh, yep, my mind jumped to that as well. I'll tack on a patch to add a lockdep assertion, along with a comment explaining what all it protects.
© 2016 - 2025 Red Hat, Inc.