[PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring

Sean Christopherson posted 5 patches 7 months, 1 week ago
There is a newer version of this series
[PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring
Posted by Sean Christopherson 7 months, 1 week ago
When resetting a dirty ring, conditionally reschedule on each iteration
after the first.  The recently introduced hard limit mitigates the issue
of an endless reset, but isn't sufficient to completely prevent RCU
stalls, soft lockups, etc., nor is the hard limit intended to guard
against such badness.

Note!  Take care to check for reschedule even in the "continue" paths,
as a pathological scenario (or malicious userspace) could dirty the same
gfn over and over, i.e. always hit the continue path.

  rcu: INFO: rcu_sched self-detected stall on CPU
  rcu: 	4-....: (5249 ticks this GP) idle=51e4/1/0x4000000000000000 softirq=309/309 fqs=2563
  rcu: 	(t=5250 jiffies g=-319 q=608 ncpus=24)
  CPU: 4 UID: 1000 PID: 1067 Comm: dirty_log_test Tainted: G             L     6.13.0-rc3-17fa7a24ea1e-HEAD-vm #814
  Tainted: [L]=SOFTLOCKUP
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x26/0x200 [kvm]
  Call Trace:
   <TASK>
   kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
   kvm_dirty_ring_reset+0x58/0x220 [kvm]
   kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
   __x64_sys_ioctl+0x8b/0xb0
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>
  Tainted: [L]=SOFTLOCKUP
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x17/0x200 [kvm]
  Call Trace:
   <TASK>
   kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
   kvm_dirty_ring_reset+0x58/0x220 [kvm]
   kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
   __x64_sys_ioctl+0x8b/0xb0
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>

Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/dirty_ring.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index e844e869e8c7..97cca0c02fd1 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 
 		ring->reset_index++;
 		(*nr_entries_reset)++;
+
+		/*
+		 * While the size of each ring is fixed, it's possible for the
+		 * ring to be constantly re-dirtied/harvested while the reset
+		 * is in-progress (the hard limit exists only to guard against
+		 * wrapping the count into negative space).
+		 */
+		if (!first_round)
+			cond_resched();
+
 		/*
 		 * Try to coalesce the reset operations when the guest is
 		 * scanning pages in the same slot.
-- 
2.49.0.1015.ga840276032-goog
Re: [PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring
Posted by James Houghton 7 months, 1 week ago
On Thu, May 8, 2025 at 7:11 AM Sean Christopherson <seanjc@google.com> wrote:
>
> When resetting a dirty ring, conditionally reschedule on each iteration
> after the first.  The recently introduced hard limit mitigates the issue
> of an endless reset, but isn't sufficient to completely prevent RCU
> stalls, soft lockups, etc., nor is the hard limit intended to guard
> against such badness.
>
> Note!  Take care to check for reschedule even in the "continue" paths,
> as a pathological scenario (or malicious userspace) could dirty the same
> gfn over and over, i.e. always hit the continue path.
>
>   rcu: INFO: rcu_sched self-detected stall on CPU
>   rcu:  4-....: (5249 ticks this GP) idle=51e4/1/0x4000000000000000 softirq=309/309 fqs=2563
>   rcu:  (t=5250 jiffies g=-319 q=608 ncpus=24)
>   CPU: 4 UID: 1000 PID: 1067 Comm: dirty_log_test Tainted: G             L     6.13.0-rc3-17fa7a24ea1e-HEAD-vm #814
>   Tainted: [L]=SOFTLOCKUP
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x26/0x200 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
>    kvm_dirty_ring_reset+0x58/0x220 [kvm]
>    kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
>    __x64_sys_ioctl+0x8b/0xb0
>    do_syscall_64+0x5b/0x160
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>   Tainted: [L]=SOFTLOCKUP
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x17/0x200 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
>    kvm_dirty_ring_reset+0x58/0x220 [kvm]
>    kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
>    __x64_sys_ioctl+0x8b/0xb0
>    do_syscall_64+0x5b/0x160
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>
> Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/dirty_ring.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index e844e869e8c7..97cca0c02fd1 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>
>                 ring->reset_index++;
>                 (*nr_entries_reset)++;
> +
> +               /*
> +                * While the size of each ring is fixed, it's possible for the
> +                * ring to be constantly re-dirtied/harvested while the reset
> +                * is in-progress (the hard limit exists only to guard against
> +                * wrapping the count into negative space).
> +                */
> +               if (!first_round)
> +                       cond_resched();

Should we be dropping slots_lock here?

It seems like we need to be holding slots_lock to call
kvm_reset_dirty_gfn(), but that's it. Userspace can already change the
memslots after enabling the dirty ring, so `entry->slot` can already
be stale, so dropping slots_lock for the cond_resched() seems harmless
(and better than not dropping it).
Re: [PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring
Posted by Sean Christopherson 7 months ago
On Mon, May 12, 2025, James Houghton wrote:
> On Thu, May 8, 2025 at 7:11 AM Sean Christopherson <seanjc@google.com> wrote:
> > ---
> >  virt/kvm/dirty_ring.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > index e844e869e8c7..97cca0c02fd1 100644
> > --- a/virt/kvm/dirty_ring.c
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> >
> >                 ring->reset_index++;
> >                 (*nr_entries_reset)++;
> > +
> > +               /*
> > +                * While the size of each ring is fixed, it's possible for the
> > +                * ring to be constantly re-dirtied/harvested while the reset
> > +                * is in-progress (the hard limit exists only to guard against
> > +                * wrapping the count into negative space).
> > +                */
> > +               if (!first_round)
> > +                       cond_resched();
> 
> Should we be dropping slots_lock here?

Could we?  Yes.  Should we?  Eh.  I don't see any value in doing so, because in
practice, it's extremely unlikely anything will be waiting on slots_lock.

kvm_vm_ioctl_reset_dirty_pages() operates on all vCPUs, i.e. there won't be
competing calls to reset other rings.  A well-behaved userspace won't be modifying
memslots or dirty logs, and won't be toggling nx_huge_pages.

That leaves kvm_vm_ioctl_set_mem_attributes(), kvm_inhibit_apic_access_page(),
kvm_assign_ioeventfd(), snp_launch_update(), and coalesced IO/MMIO (un)registration.
Except for snp_launch_update(), those are all brutally slow paths, e.g. require
SRCU synchronization and/or zapping of SPTEs.  And snp_launch_update() is probably
fairly slow too.

And dropping slots_lock only makes any sense for non-preemptible kernels, because
preemptible kernels include an equivalent check in KVM_MMU_UNLOCK().

It's also possible that dropping slots_lock in this case could be a net negative.
I don't think it's likely, but I don't think it's any more or less likely that
droppings slots_lock is a net positive.  Without performance data to guide us,
it'd be little more than a guess, and I really, really don't want to set a
precedence of dropping a mutex on cond_resched() without a very strong reason
for doing so.

> It seems like we need to be holding slots_lock to call kvm_reset_dirty_gfn(),
> but that's it. Userspace can already change the memslots after enabling the
> dirty ring, so `entry->slot` can already be stale, so dropping slots_lock for
> the cond_resched() seems harmless (and better than not dropping it).
Re: [PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring
Posted by James Houghton 7 months ago
On Tue, May 13, 2025 at 7:13 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, May 12, 2025, James Houghton wrote:
> > On Thu, May 8, 2025 at 7:11 AM Sean Christopherson <seanjc@google.com> wrote:
> > > ---
> > >  virt/kvm/dirty_ring.c | 10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > >
> > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > > index e844e869e8c7..97cca0c02fd1 100644
> > > --- a/virt/kvm/dirty_ring.c
> > > +++ b/virt/kvm/dirty_ring.c
> > > @@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> > >
> > >                 ring->reset_index++;
> > >                 (*nr_entries_reset)++;
> > > +
> > > +               /*
> > > +                * While the size of each ring is fixed, it's possible for the
> > > +                * ring to be constantly re-dirtied/harvested while the reset
> > > +                * is in-progress (the hard limit exists only to guard against
> > > +                * wrapping the count into negative space).
> > > +                */
> > > +               if (!first_round)
> > > +                       cond_resched();
> >
> > Should we be dropping slots_lock here?
>
> Could we?  Yes.  Should we?  Eh.  I don't see any value in doing so, because in
> practice, it's extremely unlikely anything will be waiting on slots_lock.
>
> kvm_vm_ioctl_reset_dirty_pages() operates on all vCPUs, i.e. there won't be
> competing calls to reset other rings.  A well-behaved userspace won't be modifying
> memslots or dirty logs, and won't be toggling nx_huge_pages.
>
> That leaves kvm_vm_ioctl_set_mem_attributes(), kvm_inhibit_apic_access_page(),
> kvm_assign_ioeventfd(), snp_launch_update(), and coalesced IO/MMIO (un)registration.
> Except for snp_launch_update(), those are all brutally slow paths, e.g. require
> SRCU synchronization and/or zapping of SPTEs.  And snp_launch_update() is probably
> fairly slow too.

Okay, that makes sense.

> And dropping slots_lock only makes any sense for non-preemptible kernels, because
> preemptible kernels include an equivalent check in KVM_MMU_UNLOCK().

I'm not really sure what "equivalent check" you mean, sorry. For
preemptible kernels, we could reschedule at any time, so dropping the
slots_lock on a cond_resched() wouldn't do much, yeah. Hopefully
that's partially what you mean.

> It's also possible that dropping slots_lock in this case could be a net negative.
> I don't think it's likely, but I don't think it's any more or less likely that
> droppings slots_lock is a net positive.  Without performance data to guide us,
> it'd be little more than a guess, and I really, really don't want to set a
> precedence of dropping a mutex on cond_resched() without a very strong reason
> for doing so.

Fair enough.

Also, while we're at it, could you add a
`lockdep_assert_held(&kvm->slots_lock)` to this function? :) Not
necessarily in this patch.
Re: [PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring
Posted by Sean Christopherson 7 months ago
On Tue, May 13, 2025, James Houghton wrote:
> On Tue, May 13, 2025 at 7:13 AM Sean Christopherson <seanjc@google.com> wrote:
> > On Mon, May 12, 2025, James Houghton wrote:
> > > On Thu, May 8, 2025 at 7:11 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > ---
> > > >  virt/kvm/dirty_ring.c | 10 ++++++++++
> > > >  1 file changed, 10 insertions(+)
> > > >
> > > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > > > index e844e869e8c7..97cca0c02fd1 100644
> > > > --- a/virt/kvm/dirty_ring.c
> > > > +++ b/virt/kvm/dirty_ring.c
> > > > @@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> > > >
> > > >                 ring->reset_index++;
> > > >                 (*nr_entries_reset)++;
> > > > +
> > > > +               /*
> > > > +                * While the size of each ring is fixed, it's possible for the
> > > > +                * ring to be constantly re-dirtied/harvested while the reset
> > > > +                * is in-progress (the hard limit exists only to guard against
> > > > +                * wrapping the count into negative space).
> > > > +                */
> > > > +               if (!first_round)
> > > > +                       cond_resched();
> > >
> > > Should we be dropping slots_lock here?
> >
> > Could we?  Yes.  Should we?  Eh.  I don't see any value in doing so, because in
> > practice, it's extremely unlikely anything will be waiting on slots_lock.
> >
> > kvm_vm_ioctl_reset_dirty_pages() operates on all vCPUs, i.e. there won't be
> > competing calls to reset other rings.  A well-behaved userspace won't be modifying
> > memslots or dirty logs, and won't be toggling nx_huge_pages.
> >
> > That leaves kvm_vm_ioctl_set_mem_attributes(), kvm_inhibit_apic_access_page(),
> > kvm_assign_ioeventfd(), snp_launch_update(), and coalesced IO/MMIO (un)registration.
> > Except for snp_launch_update(), those are all brutally slow paths, e.g. require
> > SRCU synchronization and/or zapping of SPTEs.  And snp_launch_update() is probably
> > fairly slow too.
> 
> Okay, that makes sense.

As discussed offlist, dropping slots_lock would also be functionally problematic,
as concurrent calls to KVM_RESET_DIRTY_RINGS could get interwoven, which could
result in one of the calls returning to userspace without actually completing the
reset, i.e. if a different task has reaped entries but not yet called
kvm_reset_dirty_gfn().

> > And dropping slots_lock only makes any sense for non-preemptible kernels, because
> > preemptible kernels include an equivalent check in KVM_MMU_UNLOCK().
> 
> I'm not really sure what "equivalent check" you mean, sorry. For preemptible
> kernels, we could reschedule at any time, so dropping the slots_lock on a
> cond_resched() wouldn't do much, yeah. Hopefully that's partially what you
> mean.

Ya, that's essentially what I mean.  What I was referencing with KVM_MMU_UNLOCK()
is the explicit check for NEED_RESCHED that happens when the preempt count hits
'0' on preemptible kernels.

> > It's also possible that dropping slots_lock in this case could be a net negative.
> > I don't think it's likely, but I don't think it's any more or less likely that
> > droppings slots_lock is a net positive.  Without performance data to guide us,
> > it'd be little more than a guess, and I really, really don't want to set a
> > precedence of dropping a mutex on cond_resched() without a very strong reason
> > for doing so.
> 
> Fair enough.
> 
> Also, while we're at it, could you add a
> `lockdep_assert_held(&kvm->slots_lock)` to this function? :) Not necessarily
> in this patch.

Heh, yep, my mind jumped to that as well.  I'll tack on a patch to add a lockdep
assertion, along with a comment explaining what all it protects.