[v1] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables

[PATCH 0/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables

Posted by Raghavendra Rao Ananta 2 months, 3 weeks ago

Hello,

When destroying a fully-mapped 128G VM abruptly, the following scheduler
warning is observed:

  sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule
  CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE
  Tainted: [O]=OOT_MODULE
  Call trace:
      show_stack+0x20/0x38 (C)
      dump_stack_lvl+0x3c/0xb8
      dump_stack+0x18/0x30
      resched_latency_warn+0x7c/0x88
      sched_tick+0x1c4/0x268
      update_process_times+0xa8/0xd8
      tick_nohz_handler+0xc8/0x168
      __hrtimer_run_queues+0x11c/0x338
      hrtimer_interrupt+0x104/0x308
      arch_timer_handler_phys+0x40/0x58
      handle_percpu_devid_irq+0x8c/0x1b0
      generic_handle_domain_irq+0x48/0x78
      gic_handle_irq+0x1b8/0x408
      call_on_irq_stack+0x24/0x30
      do_interrupt_handler+0x54/0x78
      el1_interrupt+0x44/0x88
      el1h_64_irq_handler+0x18/0x28
      el1h_64_irq+0x84/0x88
      stage2_free_walker+0x30/0xa0 (P)
      __kvm_pgtable_walk+0x11c/0x258
      __kvm_pgtable_walk+0x180/0x258
      __kvm_pgtable_walk+0x180/0x258
      __kvm_pgtable_walk+0x180/0x258
      kvm_pgtable_walk+0xc4/0x140
      kvm_pgtable_stage2_destroy+0x5c/0xf0
      kvm_free_stage2_pgd+0x6c/0xe8
      kvm_uninit_stage2_mmu+0x24/0x48
      kvm_arch_flush_shadow_all+0x80/0xa0
      kvm_mmu_notifier_release+0x38/0x78
      __mmu_notifier_release+0x15c/0x250
      exit_mmap+0x68/0x400
      __mmput+0x38/0x1c8
      mmput+0x30/0x68
      exit_mm+0xd4/0x198
      do_exit+0x1a4/0xb00
      do_group_exit+0x8c/0x120
      get_signal+0x6d4/0x778
      do_signal+0x90/0x718
      do_notify_resume+0x70/0x170
      el0_svc+0x74/0xd8
      el0t_64_sync_handler+0x60/0xc8
      el0t_64_sync+0x1b0/0x1b8

The host kernel was running with CONFIG_PREEMPT_NONE=y, and since the
page-table walk operation takes considerable amount of time for a VM
with such a large number of PTEs mapped, the warning is seen.

To mitigate this, split the walk into smaller ranges, by checking for
cond_resched() between each range. Since the path is executed during
VM destruction, after the page-table structure is unlinked from the
KVM MMU, relying on cond_resched_rwlock_write() isn't necessary.

Patch-1 kills the assumption that the page-table hierarchy under the
table is free (in stage2_free_walker()). Instead, drop and clear the
references only on empty tables.

Patch-2 splits the kvm_pgtable_stage2_destroy() function into separate
'walk' and 'free PGD' parts.

Patch-3 leverages the split and performs the walk periodically over
smaller ranges and calls cond_resched() between them.

The series was originally posted and merged [1], but was later reverted
due to syzkaller catching a UAF bug [2]. This series fixes the issue, and
the original need_resched warning is addressed.

[1]: https://lore.kernel.org/all/175582091313.1266576.4329884314263043118.b4-ty@linux.dev/
[2]: https://lore.kernel.org/all/20250910180930.3679473-1-oliver.upton@linux.dev/ 

Oliver Upton (1):
  KVM: arm64: Only drop references on empty tables in stage2_free_walker

Raghavendra Rao Ananta (2):
  KVM: arm64: Split kvm_pgtable_stage2_destroy()
  KVM: arm64: Reschedule as needed when destroying the stage-2
    page-tables

 arch/arm64/include/asm/kvm_pgtable.h | 30 +++++++++++++
 arch/arm64/include/asm/kvm_pkvm.h    |  4 +-
 arch/arm64/kvm/hyp/pgtable.c         | 63 +++++++++++++++++++++++-----
 arch/arm64/kvm/mmu.c                 | 36 +++++++++++++++-
 arch/arm64/kvm/pkvm.c                | 11 ++++-
 5 files changed, 129 insertions(+), 15 deletions(-)


base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa
-- 
2.51.2.1041.gc1ab5b90ca-goog

Re: [PATCH 0/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables

Posted by Marc Zyngier 1 week, 3 days ago

On Thu, 13 Nov 2025 05:24:49 +0000,
Raghavendra Rao Ananta <rananta@google.com> wrote:
> 
> Hello,
> 
> When destroying a fully-mapped 128G VM abruptly, the following scheduler
> warning is observed:
> 
>   sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule
>   CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE
>   Tainted: [O]=OOT_MODULE
>   Call trace:
>       show_stack+0x20/0x38 (C)
>       dump_stack_lvl+0x3c/0xb8
>       dump_stack+0x18/0x30
>       resched_latency_warn+0x7c/0x88
>       sched_tick+0x1c4/0x268
>       update_process_times+0xa8/0xd8
>       tick_nohz_handler+0xc8/0x168
>       __hrtimer_run_queues+0x11c/0x338
>       hrtimer_interrupt+0x104/0x308
>       arch_timer_handler_phys+0x40/0x58
>       handle_percpu_devid_irq+0x8c/0x1b0
>       generic_handle_domain_irq+0x48/0x78
>       gic_handle_irq+0x1b8/0x408
>       call_on_irq_stack+0x24/0x30
>       do_interrupt_handler+0x54/0x78
>       el1_interrupt+0x44/0x88
>       el1h_64_irq_handler+0x18/0x28
>       el1h_64_irq+0x84/0x88
>       stage2_free_walker+0x30/0xa0 (P)
>       __kvm_pgtable_walk+0x11c/0x258
>       __kvm_pgtable_walk+0x180/0x258
>       __kvm_pgtable_walk+0x180/0x258
>       __kvm_pgtable_walk+0x180/0x258
>       kvm_pgtable_walk+0xc4/0x140
>       kvm_pgtable_stage2_destroy+0x5c/0xf0
>       kvm_free_stage2_pgd+0x6c/0xe8
>       kvm_uninit_stage2_mmu+0x24/0x48
>       kvm_arch_flush_shadow_all+0x80/0xa0
>       kvm_mmu_notifier_release+0x38/0x78
>       __mmu_notifier_release+0x15c/0x250
>       exit_mmap+0x68/0x400
>       __mmput+0x38/0x1c8
>       mmput+0x30/0x68
>       exit_mm+0xd4/0x198
>       do_exit+0x1a4/0xb00
>       do_group_exit+0x8c/0x120
>       get_signal+0x6d4/0x778
>       do_signal+0x90/0x718
>       do_notify_resume+0x70/0x170
>       el0_svc+0x74/0xd8
>       el0t_64_sync_handler+0x60/0xc8
>       el0t_64_sync+0x1b0/0x1b8
> 
> The host kernel was running with CONFIG_PREEMPT_NONE=y, and since the
> page-table walk operation takes considerable amount of time for a VM
> with such a large number of PTEs mapped, the warning is seen.
> 
> To mitigate this, split the walk into smaller ranges, by checking for
> cond_resched() between each range. Since the path is executed during
> VM destruction, after the page-table structure is unlinked from the
> KVM MMU, relying on cond_resched_rwlock_write() isn't necessary.
> 
> Patch-1 kills the assumption that the page-table hierarchy under the
> table is free (in stage2_free_walker()). Instead, drop and clear the
> references only on empty tables.
> 
> Patch-2 splits the kvm_pgtable_stage2_destroy() function into separate
> 'walk' and 'free PGD' parts.
> 
> Patch-3 leverages the split and performs the walk periodically over
> smaller ranges and calls cond_resched() between them.
> 
> The series was originally posted and merged [1], but was later reverted
> due to syzkaller catching a UAF bug [2]. This series fixes the issue, and
> the original need_resched warning is addressed.
> 
> [1]: https://lore.kernel.org/all/175582091313.1266576.4329884314263043118.b4-ty@linux.dev/
> [2]: https://lore.kernel.org/all/20250910180930.3679473-1-oliver.upton@linux.dev/ 
> 
> Oliver Upton (1):
>   KVM: arm64: Only drop references on empty tables in stage2_free_walker
> 
> Raghavendra Rao Ananta (2):
>   KVM: arm64: Split kvm_pgtable_stage2_destroy()
>   KVM: arm64: Reschedule as needed when destroying the stage-2
>     page-tables
> 
>  arch/arm64/include/asm/kvm_pgtable.h | 30 +++++++++++++
>  arch/arm64/include/asm/kvm_pkvm.h    |  4 +-
>  arch/arm64/kvm/hyp/pgtable.c         | 63 +++++++++++++++++++++++-----
>  arch/arm64/kvm/mmu.c                 | 36 +++++++++++++++-
>  arch/arm64/kvm/pkvm.c                | 11 ++++-
>  5 files changed, 129 insertions(+), 15 deletions(-)
> 
> 
> base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa

As a heads-up: I am suspecting this series to break my NV guests in a
pretty bad way. L2 and L3 guests are getting stuck, L0 and L1 barf on
S2 PTs that are being destroyed. This stinks of TLB invalidation going
very wrong, which would result in S2 management going similarly
sideways.  I still need to work out whether that is just triggering
something bad somewhere else.

For what it is worth, this reproduces on both M2 and QC machines.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH 0/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables

Posted by Oliver Upton 2 months, 2 weeks ago

On Thu, 13 Nov 2025 05:24:49 +0000, Raghavendra Rao Ananta wrote:
> When destroying a fully-mapped 128G VM abruptly, the following scheduler
> warning is observed:
> 
>   sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule
>   CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE
>   Tainted: [O]=OOT_MODULE
>   Call trace:
>       show_stack+0x20/0x38 (C)
>       dump_stack_lvl+0x3c/0xb8
>       dump_stack+0x18/0x30
>       resched_latency_warn+0x7c/0x88
>       sched_tick+0x1c4/0x268
>       update_process_times+0xa8/0xd8
>       tick_nohz_handler+0xc8/0x168
>       __hrtimer_run_queues+0x11c/0x338
>       hrtimer_interrupt+0x104/0x308
>       arch_timer_handler_phys+0x40/0x58
>       handle_percpu_devid_irq+0x8c/0x1b0
>       generic_handle_domain_irq+0x48/0x78
>       gic_handle_irq+0x1b8/0x408
>       call_on_irq_stack+0x24/0x30
>       do_interrupt_handler+0x54/0x78
>       el1_interrupt+0x44/0x88
>       el1h_64_irq_handler+0x18/0x28
>       el1h_64_irq+0x84/0x88
>       stage2_free_walker+0x30/0xa0 (P)
>       __kvm_pgtable_walk+0x11c/0x258
>       __kvm_pgtable_walk+0x180/0x258
>       __kvm_pgtable_walk+0x180/0x258
>       __kvm_pgtable_walk+0x180/0x258
>       kvm_pgtable_walk+0xc4/0x140
>       kvm_pgtable_stage2_destroy+0x5c/0xf0
>       kvm_free_stage2_pgd+0x6c/0xe8
>       kvm_uninit_stage2_mmu+0x24/0x48
>       kvm_arch_flush_shadow_all+0x80/0xa0
>       kvm_mmu_notifier_release+0x38/0x78
>       __mmu_notifier_release+0x15c/0x250
>       exit_mmap+0x68/0x400
>       __mmput+0x38/0x1c8
>       mmput+0x30/0x68
>       exit_mm+0xd4/0x198
>       do_exit+0x1a4/0xb00
>       do_group_exit+0x8c/0x120
>       get_signal+0x6d4/0x778
>       do_signal+0x90/0x718
>       do_notify_resume+0x70/0x170
>       el0_svc+0x74/0xd8
>       el0t_64_sync_handler+0x60/0xc8
>       el0t_64_sync+0x1b0/0x1b8
> 
> [...]

Applied to next, thanks!

[1/3] KVM: arm64: Only drop references on empty tables in stage2_free_walker
      https://git.kernel.org/kvmarm/kvmarm/c/156f70afcfec
[2/3] KVM: arm64: Split kvm_pgtable_stage2_destroy()
      https://git.kernel.org/kvmarm/kvmarm/c/d68d66e57e2b
[3/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables
      https://git.kernel.org/kvmarm/kvmarm/c/4ddfab5436b6

--
Best,
Oliver