[v1] KVM: Dirty ring fixes and cleanups

[PATCH 3/5] KVM: Conditionally reschedule when resetting the dirty ring

Posted by Sean Christopherson 1 year ago

When resetting a dirty ring, conditionally reschedule on each iteration
after the first.  The recently introduced hard limit mitigates the issue
of an endless reset, but isn't sufficient to completely prevent RCU
stalls, soft lockups, etc., nor is the hard limit intended to guard
against such badness.

Note!  Take care to check for reschedule even in the "continue" paths,
as a pathological scenario (or malicious userspace) could dirty the same
gfn over and over, i.e. always hit the continue path.

  rcu: INFO: rcu_sched self-detected stall on CPU
  rcu: 	4-....: (5249 ticks this GP) idle=51e4/1/0x4000000000000000 softirq=309/309 fqs=2563
  rcu: 	(t=5250 jiffies g=-319 q=608 ncpus=24)
  CPU: 4 UID: 1000 PID: 1067 Comm: dirty_log_test Tainted: G             L     6.13.0-rc3-17fa7a24ea1e-HEAD-vm #814
  Tainted: [L]=SOFTLOCKUP
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x26/0x200 [kvm]
  Call Trace:
   <TASK>
   kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
   kvm_dirty_ring_reset+0x58/0x220 [kvm]
   kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
   __x64_sys_ioctl+0x8b/0xb0
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>
  Tainted: [L]=SOFTLOCKUP
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x17/0x200 [kvm]
  Call Trace:
   <TASK>
   kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
   kvm_dirty_ring_reset+0x58/0x220 [kvm]
   kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
   __x64_sys_ioctl+0x8b/0xb0
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>

Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/dirty_ring.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index a81ad17d5eef..37eb2b7142bd 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -133,6 +133,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 
 		ring->reset_index++;
 		(*nr_entries_reset)++;
+
+		/*
+		 * While the size of each ring is fixed, it's possible for the
+		 * ring to be constantly re-dirtied/harvested while the reset
+		 * is in-progress (the hard limit exists only to guard against
+		 * wrapping the count into negative space).
+		 */
+		if (!first_round)
+			cond_resched();
+
 		/*
 		 * Try to coalesce the reset operations when the guest is
 		 * scanning pages in the same slot.
-- 
2.47.1.613.gc27f4b7a9f-goog

Re: [PATCH 3/5] KVM: Conditionally reschedule when resetting the dirty ring

Posted by Yan Zhao 1 year ago

On Fri, Jan 10, 2025 at 05:04:07PM -0800, Sean Christopherson wrote:
> When resetting a dirty ring, conditionally reschedule on each iteration
> after the first.  The recently introduced hard limit mitigates the issue
> of an endless reset, but isn't sufficient to completely prevent RCU
> stalls, soft lockups, etc., nor is the hard limit intended to guard
> against such badness.
> 
> Note!  Take care to check for reschedule even in the "continue" paths,
> as a pathological scenario (or malicious userspace) could dirty the same
> gfn over and over, i.e. always hit the continue path.
> 
>   rcu: INFO: rcu_sched self-detected stall on CPU
>   rcu: 	4-....: (5249 ticks this GP) idle=51e4/1/0x4000000000000000 softirq=309/309 fqs=2563
>   rcu: 	(t=5250 jiffies g=-319 q=608 ncpus=24)
>   CPU: 4 UID: 1000 PID: 1067 Comm: dirty_log_test Tainted: G             L     6.13.0-rc3-17fa7a24ea1e-HEAD-vm #814
>   Tainted: [L]=SOFTLOCKUP
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x26/0x200 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
>    kvm_dirty_ring_reset+0x58/0x220 [kvm]
>    kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
>    __x64_sys_ioctl+0x8b/0xb0
>    do_syscall_64+0x5b/0x160
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>   Tainted: [L]=SOFTLOCKUP
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x17/0x200 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
>    kvm_dirty_ring_reset+0x58/0x220 [kvm]
>    kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
>    __x64_sys_ioctl+0x8b/0xb0
>    do_syscall_64+0x5b/0x160
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
> 
> Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/dirty_ring.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index a81ad17d5eef..37eb2b7142bd 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -133,6 +133,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>  
>  		ring->reset_index++;
>  		(*nr_entries_reset)++;
> +
> +		/*
> +		 * While the size of each ring is fixed, it's possible for the
> +		 * ring to be constantly re-dirtied/harvested while the reset
> +		 * is in-progress (the hard limit exists only to guard against
> +		 * wrapping the count into negative space).
> +		 */
> +		if (!first_round)
> +			cond_resched();
> +
Will cond_resched() per entry be too frequent?
Could we combine the cond_resched() per ring? e.g.

if (count >= ring->soft_limit)
	cond_resched();

or simply
while (count < ring->size) {
	...
}


>  		/*
>  		 * Try to coalesce the reset operations when the guest is
>  		 * scanning pages in the same slot.
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
>

Re: [PATCH 3/5] KVM: Conditionally reschedule when resetting the dirty ring

Posted by Sean Christopherson 1 year ago

On Mon, Jan 13, 2025, Yan Zhao wrote:
> On Fri, Jan 10, 2025 at 05:04:07PM -0800, Sean Christopherson wrote:
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > index a81ad17d5eef..37eb2b7142bd 100644
> > --- a/virt/kvm/dirty_ring.c
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -133,6 +133,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> >  
> >  		ring->reset_index++;
> >  		(*nr_entries_reset)++;
> > +
> > +		/*
> > +		 * While the size of each ring is fixed, it's possible for the
> > +		 * ring to be constantly re-dirtied/harvested while the reset
> > +		 * is in-progress (the hard limit exists only to guard against
> > +		 * wrapping the count into negative space).
> > +		 */
> > +		if (!first_round)
> > +			cond_resched();
> > +
> Will cond_resched() per entry be too frequent?

No, if it is too frequent, KVM has other problems.  cond_resched() only takes a
handful of cycles when no work needs to be done, and on PREEMPTION=y kernels,
dropping mmu_lock in kvm_reset_dirty_gfn() already includes a NEED_RESCHED check.

> Could we combine the cond_resched() per ring? e.g.
> 
> if (count >= ring->soft_limit)
> 	cond_resched();
> 
> or simply
> while (count < ring->size) {
> 	...
> }

I don't think I have any objections to bounding the reset at ring->size?  I
assumed the unbounded walk was deliberate, e.g. to let userspace reset entries
in a separate thread, but looking at the QEMU code, that doesn't appear to be
the case.

However, IMO that's an orthogonal discussion.  I think KVM should still check for
NEED_RESCHED after processing each entry regardless of how the loop is bounded.
E.g. write-protecting 65536 GFNs is definitely going to have measurable latency.

Re: [PATCH 3/5] KVM: Conditionally reschedule when resetting the dirty ring

Posted by Yan Zhao 1 year ago

On Mon, Jan 13, 2025 at 08:28:06AM -0800, Sean Christopherson wrote:
> On Mon, Jan 13, 2025, Yan Zhao wrote:
> > On Fri, Jan 10, 2025 at 05:04:07PM -0800, Sean Christopherson wrote:
> > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > > index a81ad17d5eef..37eb2b7142bd 100644
> > > --- a/virt/kvm/dirty_ring.c
> > > +++ b/virt/kvm/dirty_ring.c
> > > @@ -133,6 +133,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> > >  
> > >  		ring->reset_index++;
> > >  		(*nr_entries_reset)++;
> > > +
> > > +		/*
> > > +		 * While the size of each ring is fixed, it's possible for the
> > > +		 * ring to be constantly re-dirtied/harvested while the reset
> > > +		 * is in-progress (the hard limit exists only to guard against
> > > +		 * wrapping the count into negative space).
> > > +		 */
> > > +		if (!first_round)
> > > +			cond_resched();
> > > +
> > Will cond_resched() per entry be too frequent?
> 
> No, if it is too frequent, KVM has other problems.  cond_resched() only takes a
> handful of cycles when no work needs to be done, and on PREEMPTION=y kernels,
> dropping mmu_lock in kvm_reset_dirty_gfn() already includes a NEED_RESCHED check.
Ok. I just worried about the live migration performance.
But looks per-entry should be also good.

> 
> > Could we combine the cond_resched() per ring? e.g.
> > 
> > if (count >= ring->soft_limit)
> > 	cond_resched();
> > 
> > or simply
> > while (count < ring->size) {
> > 	...
> > }
> 
> I don't think I have any objections to bounding the reset at ring->size?  I
> assumed the unbounded walk was deliberate, e.g. to let userspace reset entries
> in a separate thread, but looking at the QEMU code, that doesn't appear to be
> the case.
Ok.

> However, IMO that's an orthogonal discussion.  I think KVM should still check for
> NEED_RESCHED after processing each entry regardless of how the loop is bounded.
> E.g. write-protecting 65536 GFNs is definitely going to have measurable latency.
Yes.