[v9] KVM: x86/mmu: Age sptes locklessly

[PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

Posted by James Houghton 1 year ago

By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
waiting for aging. This contention reduction improves guest performance
and saves a significant amount of Google Cloud's CPU usage, and it has
valuable improvements for ChromeOS, as Yu has mentioned previously[1].

Please see v8[8] for some performance results using
access_tracking_perf_test patched to use MGLRU.

Neither access_tracking_perf_test nor mmu_stress_test trigger any
splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.

=== Previous Versions ===

Since v8[8]:
 - Re-added the kvm_handle_hva_range helpers and applied Sean's
   kvm_{handle -> age}_hva_range rename.
 - Renamed spte_has_volatile_bits() to spte_needs_atomic_write() and
   removed its Accessed bit check. Undid change to
   tdp_mmu_spte_need_atomic_write().
 - Renamed KVM_MMU_NOTIFIER_{YOUNG -> AGING}_LOCKLESS.
 - cpu_relax(), lockdep, preempt_disable(), and locking fixups for
   per-rmap lock (thanks Lai and Sean).
 - Renamed kvm_{has -> may_have}_shadow_mmu_sptes().
 - Rebased onto latest kvm/next, including changing
   for_each_tdp_mmu_root_rcu to use `types`.
 - Dropped MGLRU changes from access_tracking_perf_test.
 - Picked up Acked-bys from Yu. (thank you!)

Since v7[7]:
 - Dropped MGLRU changes.
 - Dropped DAMON cleanup.
 - Dropped MMU notifier changes completely.
 - Made shadow MMU aging *always* lockless, not just lockless when the
   now-removed "fast_only" clear notifier was used.
 - Given that the MGLRU changes no longer introduce a new MGLRU
   capability, drop the new capability check from the selftest.
 - Rebased on top of latest kvm-x86/next, including the x86 mmu changes
   for marking pages as dirty.

Since v6[6]:
 - Rebased on top of kvm-x86/next and Sean's lockless rmap walking
   changes.
 - Removed HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY (thanks DavidM).
 - Split up kvm_age_gfn() / kvm_test_age_gfn() optimizations (thanks
   DavidM and Sean).
 - Improved new MMU notifier documentation (thanks DavidH).
 - Dropped arm64 locking change.
 - No longer retry for CAS failure in TDP MMU non-A/D case (thanks
   Sean).
 - Added some R-bys and A-bys.

Since v5[5]:
 - Reworked test_clear_young_fast_only() into a new parameter for the
   existing notifiers (thanks Sean).
 - Added mmu_notifier.has_fast_aging to tell mm if calling fast-only
   notifiers should be done.
 - Added mm_has_fast_young_notifiers() to inform users if calling
   fast-only notifier helpers is worthwhile (for look-around to use).
 - Changed MGLRU to invoke a single notifier instead of two when
   aging and doing look-around (thanks Yu).
 - For KVM/x86, check indirect_shadow_pages > 0 instead of
   kvm_memslots_have_rmaps() when collecting age information
   (thanks Sean).
 - For KVM/arm, some fixes from Oliver.
 - Small fixes to access_tracking_perf_test.
 - Added missing !MMU_NOTIFIER version of mmu_notifier_clear_young().

Since v4[4]:
 - Removed Kconfig that controlled when aging was enabled. Aging will
   be done whenever the architecture supports it (thanks Yu).
 - Added a new MMU notifier, test_clear_young_fast_only(), specifically
   for MGLRU to use.
 - Add kvm_fast_{test_,}age_gfn, implemented by x86.
 - Fix locking for clear_flush_young().
 - Added KVM_MMU_NOTIFIER_YOUNG_LOCKLESS to clean up locking changes
   (thanks Sean).
 - Fix WARN_ON and other cleanup for the arm64 locking changes
   (thanks Oliver).

Since v3[3]:
 - Vastly simplified the series (thanks David). Removed mmu notifier
   batching logic entirely.
 - Cleaned up how locking is done for mmu_notifier_test/clear_young
   (thanks David).
 - Look-around is now only done when there are no secondary MMUs
   subscribed to MMU notifiers.
 - CONFIG_LRU_GEN_WALKS_SECONDARY_MMU has been added.
 - Fixed the lockless implementation of kvm_{test,}age_gfn for x86
   (thanks David).
 - Added MGLRU functional and performance tests to
   access_tracking_perf_test (thanks Axel).
 - In v3, an mm would be completely ignored (for aging) if there was a
   secondary MMU but support for secondary MMU walking was missing. Now,
   missing secondary MMU walking support simply skips the notifier
   calls (except for eviction).
 - Added a sanity check for that range->lockless and range->on_lock are
   never both provided for the memslot walk.

For the changes since v2[2], see v3.

Based on latest kvm/next.

[1]: https://lore.kernel.org/kvm/CAOUHufYS0XyLEf_V+q5SCW54Zy2aW5nL8CnSWreM8d1rX5NKYg@mail.gmail.com/
[2]: https://lore.kernel.org/kvmarm/20230526234435.662652-1-yuzhao@google.com/
[3]: https://lore.kernel.org/linux-mm/20240401232946.1837665-1-jthoughton@google.com/
[4]: https://lore.kernel.org/linux-mm/20240529180510.2295118-1-jthoughton@google.com/
[5]: https://lore.kernel.org/linux-mm/20240611002145.2078921-1-jthoughton@google.com/
[6]: https://lore.kernel.org/linux-mm/20240724011037.3671523-1-jthoughton@google.com/
[7]: https://lore.kernel.org/kvm/20240926013506.860253-1-jthoughton@google.com/
[8]: https://lore.kernel.org/kvm/20241105184333.2305744-1-jthoughton@google.com/

James Houghton (7):
  KVM: Rename kvm_handle_hva_range()
  KVM: Add lockless memslot walk to KVM
  KVM: x86/mmu: Factor out spte atomic bit clearing routine
  KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
  KVM: x86/mmu: Rename spte_has_volatile_bits() to
    spte_needs_atomic_write()
  KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as
    young
  KVM: x86/mmu: Only check gfn age in shadow MMU if
    indirect_shadow_pages > 0

Sean Christopherson (4):
  KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o
    mmu_lock
  KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of
    mmu_lock
  KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
  KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging
    gfns

 Documentation/virt/kvm/locking.rst |   4 +-
 arch/x86/include/asm/kvm_host.h    |   4 +-
 arch/x86/kvm/Kconfig               |   1 +
 arch/x86/kvm/mmu/mmu.c             | 364 +++++++++++++++++++++--------
 arch/x86/kvm/mmu/spte.c            |  19 +-
 arch/x86/kvm/mmu/spte.h            |   2 +-
 arch/x86/kvm/mmu/tdp_iter.h        |  26 ++-
 arch/x86/kvm/mmu/tdp_mmu.c         |  36 ++-
 include/linux/kvm_host.h           |   1 +
 virt/kvm/Kconfig                   |   2 +
 virt/kvm/kvm_main.c                |  56 +++--
 11 files changed, 364 insertions(+), 151 deletions(-)


base-commit: f7bafceba76e9ab475b413578c1757ee18c3e44b
-- 
2.48.1.362.g079036d154-goog

Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

Posted by Maxim Levitsky 11 months, 3 weeks ago

On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote:
> By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> waiting for aging. This contention reduction improves guest performance
> and saves a significant amount of Google Cloud's CPU usage, and it has
> valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> 
> Please see v8[8] for some performance results using
> access_tracking_perf_test patched to use MGLRU.
> 
> Neither access_tracking_perf_test nor mmu_stress_test trigger any
> splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.


Hi, I have a question about this patch series and about the access_tracking_perf_test:

Some time ago, I investigated a failure in access_tracking_perf_test which shows up in our CI.

The root cause was that 'folio_clear_idle' doesn't clear the idle bit when MGLRU is enabled,
and overall I got the impression that MGLRU is not compatible with idle page tracking.


I thought that this patch series and the 'mm: multi-gen LRU: Have secondary MMUs participate in MM_WALK' 
patch series could address this but the test still fails.


For the reference the exact problem is:

1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap

2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings.

3. A NUMA autobalance code write protects the guest memory. KVM in response evicts the SPTE mappings with A/D bit set,
   and while doing so tells mm that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) )
   but due to MLGRU the call doesn't clear the idle bit and thus all the traces of the guest access disappear
   and the kernel thinks that the page is still idle.

I can say that the root cause of this is that folio_mark_accessed doesn't do what it supposed to do.

Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed() 
will probably fix this but I don't have enough confidence
to say if this is all that is needed to fix this. 
If this is the case I can send a patch.


This patch makes the test pass (but only on 6.12 kernel and below, see below):

diff --git a/mm/swap.c b/mm/swap.c
index 59f30a981c6f..2013e1f4d572 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -460,7 +460,7 @@ void folio_mark_accessed(struct folio *folio)
 {
        if (lru_gen_enabled()) {
                folio_inc_refs(folio);
-               return;
+               goto clear_idle_bit;
        }
 
        if (!folio_test_referenced(folio)) {
@@ -485,6 +485,7 @@ void folio_mark_accessed(struct folio *folio)
                folio_clear_referenced(folio);
                workingset_activation(folio);
        }
+clear_idle_bit:
        if (folio_test_idle(folio))
                folio_clear_idle(folio);
 }


To always reproduce this, it is best to use a patch to make the test run in a loop, 
like below (although the test fails without this as well).


diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..829774e325fa 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -131,6 +131,7 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm,
        uint64_t pages = vcpu_args->pages;
        uint64_t page;
        uint64_t still_idle = 0;
+       uint64_t failed_to_mark_idle = 0;
        uint64_t no_pfn = 0;
        int page_idle_fd;
        int pagemap_fd;
@@ -160,6 +161,14 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm,
                }
 
                mark_page_idle(page_idle_fd, pfn);
+
+
+                if (!is_page_idle(page_idle_fd, pfn)) {
+                        failed_to_mark_idle++;
+                        continue;
+                }
+
+
        }
 
        /*
@@ -183,16 +192,15 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm,
         * explicitly flush the TLB when aging SPTEs.  As a result, more pages
         * are cached and the guest won't see the "idle" bit cleared.
         */
-       if (still_idle >= pages / 10) {
+       //if (still_idle >= pages / 10) {
 #ifdef __x86_64__
-               TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR),
-                           "vCPU%d: Too many pages still idle (%lu out of %lu)",
-                           vcpu_idx, still_idle, pages);
+       //      TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR),
+       //                  "vCPU%d: Too many pages still idle (%lu out of %lu)",
+       //                  vcpu_idx, still_idle, pages);
 #endif
-               printf("WARNING: vCPU%d: Too many pages still idle (%lu out of %lu), "
-                      "this will affect performance results.\n",
-                      vcpu_idx, still_idle, pages);
-       }
+               printf("vCPU%d: idle pages: %lu out of %lu, failed to mark idle: %lu no pfn: %lu\n",
+                      vcpu_idx, still_idle, pages, failed_to_mark_idle, no_pfn);
+       //}
 
        close(page_idle_fd);
        close(pagemap_fd);
@@ -315,14 +323,16 @@ static void run_test(enum vm_guest_mode mode, void *arg)
        access_memory(vm, nr_vcpus, ACCESS_WRITE, "Populating memory");
 
        /* As a control, read and write to the populated memory first. */
-       access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to populated memory");
-       access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory");
+       //access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to populated memory");
+       //access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory");
 
        /* Repeat on memory that has been marked as idle. */
+again:
        mark_memory_idle(vm, nr_vcpus);
        access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to idle memory");
-       mark_memory_idle(vm, nr_vcpus);
-       access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");
+       //mark_memory_idle(vm, nr_vcpus);
+       //access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");
+       goto again;
 
        memstress_join_vcpu_threads(nr_vcpus);
        memstress_destroy_vm(vm);


With the above patch applied, you will notice after 4-6 iterations that the number of still idle
pages soars:

Populating memory             : 0.798882357s
vCPU0: idle pages: 0 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 3.003939277s
Writing to idle memory        : 0.503653562s
vCPU0: idle pages: 0 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 3.060128175s
Writing to idle memory        : 0.502705587s
vCPU0: idle pages: 2048 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 3.039294079s
Writing to idle memory        : 0.092227612s
vCPU0: idle pages: 0 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 3.046216234s
Writing to idle memory        : 0.295077724s
vCPU0: idle pages: 132558 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 2.711946690s
Writing to idle memory        : 0.302882502s

...



(*) Turns out that since kernel 6.13, this code that sets accessed bit in the primary paging
structure, when the secondary was zapped was *removed*. I bisected this to commit:

66bc627e7fee KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs

So now the access_tracking_test is broken regardless of MGLRU.

Any ideas on how to fix all this mess?


Best regards,
	Maxim Levitsky


> 
> === Previous Versions ===
> 
> Since v8[8]:
>  - Re-added the kvm_handle_hva_range helpers and applied Sean's
>    kvm_{handle -> age}_hva_range rename.
>  - Renamed spte_has_volatile_bits() to spte_needs_atomic_write() and
>    removed its Accessed bit check. Undid change to
>    tdp_mmu_spte_need_atomic_write().
>  - Renamed KVM_MMU_NOTIFIER_{YOUNG -> AGING}_LOCKLESS.
>  - cpu_relax(), lockdep, preempt_disable(), and locking fixups for
>    per-rmap lock (thanks Lai and Sean).
>  - Renamed kvm_{has -> may_have}_shadow_mmu_sptes().
>  - Rebased onto latest kvm/next, including changing
>    for_each_tdp_mmu_root_rcu to use `types`.
>  - Dropped MGLRU changes from access_tracking_perf_test.
>  - Picked up Acked-bys from Yu. (thank you!)
> 
> Since v7[7]:
>  - Dropped MGLRU changes.
>  - Dropped DAMON cleanup.
>  - Dropped MMU notifier changes completely.
>  - Made shadow MMU aging *always* lockless, not just lockless when the
>    now-removed "fast_only" clear notifier was used.
>  - Given that the MGLRU changes no longer introduce a new MGLRU
>    capability, drop the new capability check from the selftest.
>  - Rebased on top of latest kvm-x86/next, including the x86 mmu changes
>    for marking pages as dirty.
> 
> Since v6[6]:
>  - Rebased on top of kvm-x86/next and Sean's lockless rmap walking
>    changes.
>  - Removed HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY (thanks DavidM).
>  - Split up kvm_age_gfn() / kvm_test_age_gfn() optimizations (thanks
>    DavidM and Sean).
>  - Improved new MMU notifier documentation (thanks DavidH).
>  - Dropped arm64 locking change.
>  - No longer retry for CAS failure in TDP MMU non-A/D case (thanks
>    Sean).
>  - Added some R-bys and A-bys.
> 
> Since v5[5]:
>  - Reworked test_clear_young_fast_only() into a new parameter for the
>    existing notifiers (thanks Sean).
>  - Added mmu_notifier.has_fast_aging to tell mm if calling fast-only
>    notifiers should be done.
>  - Added mm_has_fast_young_notifiers() to inform users if calling
>    fast-only notifier helpers is worthwhile (for look-around to use).
>  - Changed MGLRU to invoke a single notifier instead of two when
>    aging and doing look-around (thanks Yu).
>  - For KVM/x86, check indirect_shadow_pages > 0 instead of
>    kvm_memslots_have_rmaps() when collecting age information
>    (thanks Sean).
>  - For KVM/arm, some fixes from Oliver.
>  - Small fixes to access_tracking_perf_test.
>  - Added missing !MMU_NOTIFIER version of mmu_notifier_clear_young().
> 
> Since v4[4]:
>  - Removed Kconfig that controlled when aging was enabled. Aging will
>    be done whenever the architecture supports it (thanks Yu).
>  - Added a new MMU notifier, test_clear_young_fast_only(), specifically
>    for MGLRU to use.
>  - Add kvm_fast_{test_,}age_gfn, implemented by x86.
>  - Fix locking for clear_flush_young().
>  - Added KVM_MMU_NOTIFIER_YOUNG_LOCKLESS to clean up locking changes
>    (thanks Sean).
>  - Fix WARN_ON and other cleanup for the arm64 locking changes
>    (thanks Oliver).
> 
> Since v3[3]:
>  - Vastly simplified the series (thanks David). Removed mmu notifier
>    batching logic entirely.
>  - Cleaned up how locking is done for mmu_notifier_test/clear_young
>    (thanks David).
>  - Look-around is now only done when there are no secondary MMUs
>    subscribed to MMU notifiers.
>  - CONFIG_LRU_GEN_WALKS_SECONDARY_MMU has been added.
>  - Fixed the lockless implementation of kvm_{test,}age_gfn for x86
>    (thanks David).
>  - Added MGLRU functional and performance tests to
>    access_tracking_perf_test (thanks Axel).
>  - In v3, an mm would be completely ignored (for aging) if there was a
>    secondary MMU but support for secondary MMU walking was missing. Now,
>    missing secondary MMU walking support simply skips the notifier
>    calls (except for eviction).
>  - Added a sanity check for that range->lockless and range->on_lock are
>    never both provided for the memslot walk.
> 
> For the changes since v2[2], see v3.
> 
> Based on latest kvm/next.
> 
> [1]: https://lore.kernel.org/kvm/CAOUHufYS0XyLEf_V+q5SCW54Zy2aW5nL8CnSWreM8d1rX5NKYg@mail.gmail.com/
> [2]: https://lore.kernel.org/kvmarm/20230526234435.662652-1-yuzhao@google.com/
> [3]: https://lore.kernel.org/linux-mm/20240401232946.1837665-1-jthoughton@google.com/
> [4]: https://lore.kernel.org/linux-mm/20240529180510.2295118-1-jthoughton@google.com/
> [5]: https://lore.kernel.org/linux-mm/20240611002145.2078921-1-jthoughton@google.com/
> [6]: https://lore.kernel.org/linux-mm/20240724011037.3671523-1-jthoughton@google.com/
> [7]: https://lore.kernel.org/kvm/20240926013506.860253-1-jthoughton@google.com/
> [8]: https://lore.kernel.org/kvm/20241105184333.2305744-1-jthoughton@google.com/
> 
> James Houghton (7):
>   KVM: Rename kvm_handle_hva_range()
>   KVM: Add lockless memslot walk to KVM
>   KVM: x86/mmu: Factor out spte atomic bit clearing routine
>   KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
>   KVM: x86/mmu: Rename spte_has_volatile_bits() to
>     spte_needs_atomic_write()
>   KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as
>     young
>   KVM: x86/mmu: Only check gfn age in shadow MMU if
>     indirect_shadow_pages > 0
> 
> Sean Christopherson (4):
>   KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o
>     mmu_lock
>   KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of
>     mmu_lock
>   KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
>   KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging
>     gfns
> 
>  Documentation/virt/kvm/locking.rst |   4 +-
>  arch/x86/include/asm/kvm_host.h    |   4 +-
>  arch/x86/kvm/Kconfig               |   1 +
>  arch/x86/kvm/mmu/mmu.c             | 364 +++++++++++++++++++++--------
>  arch/x86/kvm/mmu/spte.c            |  19 +-
>  arch/x86/kvm/mmu/spte.h            |   2 +-
>  arch/x86/kvm/mmu/tdp_iter.h        |  26 ++-
>  arch/x86/kvm/mmu/tdp_mmu.c         |  36 ++-
>  include/linux/kvm_host.h           |   1 +
>  virt/kvm/Kconfig                   |   2 +
>  virt/kvm/kvm_main.c                |  56 +++--
>  11 files changed, 364 insertions(+), 151 deletions(-)
> 
> 
> base-commit: f7bafceba76e9ab475b413578c1757ee18c3e44b

Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

Posted by Sean Christopherson 11 months, 3 weeks ago

On Tue, Feb 18, 2025, Maxim Levitsky wrote:
> On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote:
> > By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> > vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> > waiting for aging. This contention reduction improves guest performance
> > and saves a significant amount of Google Cloud's CPU usage, and it has
> > valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> > 
> > Please see v8[8] for some performance results using
> > access_tracking_perf_test patched to use MGLRU.
> > 
> > Neither access_tracking_perf_test nor mmu_stress_test trigger any
> > splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.
> 
> 
> Hi, I have a question about this patch series and about the
> access_tracking_perf_test:
> 
> Some time ago, I investigated a failure in access_tracking_perf_test which
> shows up in our CI.
> 
> The root cause was that 'folio_clear_idle' doesn't clear the idle bit when
> MGLRU is enabled, and overall I got the impression that MGLRU is not
> compatible with idle page tracking.
>
> I thought that this patch series and the 'mm: multi-gen LRU: Have secondary
> MMUs participate in MM_WALK' patch series could address this but the test
> still fails.
> 
> 
> For the reference the exact problem is:
> 
> 1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap
> 
> 2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings.
> 
> 3. A NUMA autobalance code write protects the guest memory. KVM in response
>    evicts the SPTE mappings with A/D bit set, and while doing so tells mm
>    that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) )
>    but due to MLGRU the call doesn't clear the idle bit and thus all the traces
>    of the guest access disappear and the kernel thinks that the page is still idle.
> 
> I can say that the root cause of this is that folio_mark_accessed doesn't do
> what it supposed to do.
> 
> Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed()
> will probably fix this but I don't have enough confidence to say if this is
> all that is needed to fix this.  If this is the case I can send a patch.

My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
effectively isn't supported by MGLRU.

[1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com

> This patch makes the test pass (but only on 6.12 kernel and below, see below):
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 59f30a981c6f..2013e1f4d572 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -460,7 +460,7 @@ void folio_mark_accessed(struct folio *folio)
>  {
>         if (lru_gen_enabled()) {
>                 folio_inc_refs(folio);
> -               return;
> +               goto clear_idle_bit;
>         }
>  
>         if (!folio_test_referenced(folio)) {
> @@ -485,6 +485,7 @@ void folio_mark_accessed(struct folio *folio)
>                 folio_clear_referenced(folio);
>                 workingset_activation(folio);
>         }
> +clear_idle_bit:
>         if (folio_test_idle(folio))
>                 folio_clear_idle(folio);
>  }
> 
> 
> To always reproduce this, it is best to use a patch to make the test run in a
> loop, like below (although the test fails without this as well).

..

> With the above patch applied, you will notice after 4-6 iterations that the
> number of still idle pages soars:
> 
> Populating memory             : 0.798882357s

...

> vCPU0: idle pages: 132558 out of 262144, failed to mark idle: 0 no pfn: 0
> Mark memory idle              : 2.711946690s
> Writing to idle memory        : 0.302882502s
> 
> ...
> 
> (*) Turns out that since kernel 6.13, this code that sets accessed bit in the
> primary paging structure, when the secondary was zapped was *removed*. I
> bisected this to commit:
> 
> 66bc627e7fee KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs
> 
> So now the access_tracking_test is broken regardless of MGLRU.

Just to confirm, do you see failures on 6.13 with MGLRU disabled?  

> Any ideas on how to fix all this mess?

The easy answer is to skip the test if MGLRU is in use, or if NUMA balancing is
enabled.  In a real-world scenario, if the guest is actually accessing the pages
that get PROT_NONE'd by NUMA balancing, they will be marked accessed when they're
faulted back in.  There's a window where page_idle/bitmap could be read between
making the VMA PROT_NONE and re-accessing the page from the guest, but IMO that's
one of the many flaws of NUMA balancing.

That said, one thing is quite odd.  In the failing case, *half* of the guest pages
are still idle.  That's quite insane.

Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
different node, and that causes NUMA balancing to go crazy and zap pretty much
all of guest memory.  If that's what's happening, then a better solution for the
NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
pin it to a single pCPU?).

Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

Posted by Maxim Levitsky 11 months, 2 weeks ago

On Tue, 2025-02-18 at 17:13 -0800, Sean Christopherson wrote:
> On Tue, Feb 18, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote:
> > > By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> > > vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> > > waiting for aging. This contention reduction improves guest performance
> > > and saves a significant amount of Google Cloud's CPU usage, and it has
> > > valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> > > 
> > > Please see v8[8] for some performance results using
> > > access_tracking_perf_test patched to use MGLRU.
> > > 
> > > Neither access_tracking_perf_test nor mmu_stress_test trigger any
> > > splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.
> > 
> > Hi, I have a question about this patch series and about the
> > access_tracking_perf_test:
> > 
> > Some time ago, I investigated a failure in access_tracking_perf_test which
> > shows up in our CI.
> > 
> > The root cause was that 'folio_clear_idle' doesn't clear the idle bit when
> > MGLRU is enabled, and overall I got the impression that MGLRU is not
> > compatible with idle page tracking.
> > 
> > I thought that this patch series and the 'mm: multi-gen LRU: Have secondary
> > MMUs participate in MM_WALK' patch series could address this but the test
> > still fails.
> > 
> > 
> > For the reference the exact problem is:
> > 
> > 1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap
> > 
> > 2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings.
> > 
> > 3. A NUMA autobalance code write protects the guest memory. KVM in response
> >    evicts the SPTE mappings with A/D bit set, and while doing so tells mm
> >    that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) )
> >    but due to MLGRU the call doesn't clear the idle bit and thus all the traces
> >    of the guest access disappear and the kernel thinks that the page is still idle.
> > 
> > I can say that the root cause of this is that folio_mark_accessed doesn't do
> > what it supposed to do.
> > 
> > Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed()
> > will probably fix this but I don't have enough confidence to say if this is
> > all that is needed to fix this.  If this is the case I can send a patch.
> 
> My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
> effectively isn't supported by MGLRU.
> 
> [1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com

Hi,

Reading this mail makes me think that the page idle interface isn't really used anymore.

Maybe we should redo the access_tracking_perf_test to only use the MGLRU specific interfaces/mode,
and remove its classical page_idle mode altogher?

The point I am trying to get across is that currently access_tracking_perf_test main 
purpose is to test that page_idle works with secondary paging and the fact is that it doesn't work 
well due to more that one reason:

The mere fact that we don't flush TLB already necessitated hacks like the 90% check,
which for example doesn't work nested so another hack was needed, to skip the check
completely when hypervisor is detected, etc, etc.

And now as of 6.13, we don't propagate accessed bit when KVM zaps the SPTE at all, 
which can happen at least in theory due to other reasons than NUMA balancing.


Tomorrow there will be something else that will cause KVM to zap the SPTEs, and the test will fail again,
and again...

What do you think?


> 
> > This patch makes the test pass (but only on 6.12 kernel and below, see below):
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 59f30a981c6f..2013e1f4d572 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -460,7 +460,7 @@ void folio_mark_accessed(struct folio *folio)
> >  {
> >         if (lru_gen_enabled()) {
> >                 folio_inc_refs(folio);
> > -               return;
> > +               goto clear_idle_bit;
> >         }
> >  
> >         if (!folio_test_referenced(folio)) {
> > @@ -485,6 +485,7 @@ void folio_mark_accessed(struct folio *folio)
> >                 folio_clear_referenced(folio);
> >                 workingset_activation(folio);
> >         }
> > +clear_idle_bit:
> >         if (folio_test_idle(folio))
> >                 folio_clear_idle(folio);
> >  }
> > 
> > 
> > To always reproduce this, it is best to use a patch to make the test run in a
> > loop, like below (although the test fails without this as well).
> 
> ..
> 
> > With the above patch applied, you will notice after 4-6 iterations that the
> > number of still idle pages soars:
> > 
> > Populating memory             : 0.798882357s
> 
> ...
> 
> > vCPU0: idle pages: 132558 out of 262144, failed to mark idle: 0 no pfn: 0
> > Mark memory idle              : 2.711946690s
> > Writing to idle memory        : 0.302882502s
> > 
> > ...
> > 
> > (*) Turns out that since kernel 6.13, this code that sets accessed bit in the
> > primary paging structure, when the secondary was zapped was *removed*. I
> > bisected this to commit:
> > 
> > 66bc627e7fee KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs
> > 
> > So now the access_tracking_test is broken regardless of MGLRU.
> 
> Just to confirm, do you see failures on 6.13 with MGLRU disabled?  

Yes. The test always fails.

> 
> > Any ideas on how to fix all this mess?
> 
> The easy answer is to skip the test if MGLRU is in use, or if NUMA balancing is
> enabled.  In a real-world scenario, if the guest is actually accessing the pages
> that get PROT_NONE'd by NUMA balancing, they will be marked accessed when they're
> faulted back in.  There's a window where page_idle/bitmap could be read between
> making the VMA PROT_NONE and re-accessing the page from the guest, but IMO that's
> one of the many flaws of NUMA balancing.
> 
> That said, one thing is quite odd.  In the failing case, *half* of the guest pages
> are still idle.  That's quite insane.
> 
> Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> different node, and that causes NUMA balancing to go crazy and zap pretty much
> all of guest memory.  If that's what's happening, then a better solution for the
> NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> pin it to a single pCPU?).

Nope. I pinned main thread to  CPU 0 and VM thread to  CPU 1 and the problem persists.
On 6.13, the only way to make the test consistently work is to disable NUMA balancing.


Best regards,
	Maxim Levitsky

Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

Posted by Sean Christopherson 11 months, 2 weeks ago

On Tue, Feb 25, 2025, Maxim Levitsky wrote:
> On Tue, 2025-02-18 at 17:13 -0800, Sean Christopherson wrote:
> > My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
> > effectively isn't supported by MGLRU.
> > 
> > [1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com
> 
> Hi,
> 
> Reading this mail makes me think that the page idle interface isn't really
> used anymore.

I'm sure it's still used in production somewhere.  And even if it's being phased
out in favor of MGLRU, it's still super useful for testing purposes, because it
gives userspace much more direct control over aging.

> Maybe we should redo the access_tracking_perf_test to only use the MGLRU
> specific interfaces/mode, and remove its classical page_idle mode altogher?

I don't want to take a hard dependency on MGLRU (unless page_idle gets fully
deprecated/removed by the kernel), and I also don't think page_idle is the main
problem with the test.

> The point I am trying to get across is that currently
> access_tracking_perf_test main purpose is to test that page_idle works with
> secondary paging and the fact is that it doesn't work well due to more that
> one reason:

The primary purpose of the test is to measure performance.  Asserting that 90%+
pages were dirtied is a sanity check, not an outright goal.

> The mere fact that we don't flush TLB already necessitated hacks like the 90%
> check, which for example doesn't work nested so another hack was needed, to
> skip the check completely when hypervisor is detected, etc, etc.

100% agreed here.

> And now as of 6.13, we don't propagate accessed bit when KVM zaps the SPTE at
> all, which can happen at least in theory due to other reasons than NUMA balancing.
> 
> Tomorrow there will be something else that will cause KVM to zap the SPTEs,
> and the test will fail again, and again...
> 
> What do you think?

What if we make the assertion user controllable?  I.e. let the user opt-out (or
off-by-default and opt-in) via command line?  We did something similar for the
rseq test, because the test would run far fewer iterations than expected if the
vCPU task was migrated to CPU(s) in deep sleep states.

	TEST_ASSERT(skip_sanity_check || i > (NR_TASK_MIGRATIONS / 2),
		    "Only performed %d KVM_RUNs, task stalled too much?\n\n"
		    "  Try disabling deep sleep states to reduce CPU wakeup latency,\n"
		    "  e.g. via cpuidle.off=1 or setting /dev/cpu_dma_latency to '0',\n"
		    "  or run with -u to disable this sanity check.", i);

This is quite similar, because as you say, it's impractical for the test to account
for every possible environmental quirk.

> > Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> > different node, and that causes NUMA balancing to go crazy and zap pretty much
> > all of guest memory.  If that's what's happening, then a better solution for the
> > NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> > pin it to a single pCPU?).
> 
> Nope. I pinned main thread to  CPU 0 and VM thread to  CPU 1 and the problem
> persists.  On 6.13, the only way to make the test consistently work is to
> disable NUMA balancing.

Well that's odd.  While I'm quite curious as to what's happening, my stance is
that enabling NUMA balancing with KVM is a terrible idea, so my vote is to sweep
it under the rug and let the user disable the sanity check.

Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

Posted by Maxim Levitsky 11 months, 2 weeks ago

On Tue, 2025-02-25 at 16:50 -0800, Sean Christopherson wrote:
> On Tue, Feb 25, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-02-18 at 17:13 -0800, Sean Christopherson wrote:
> > > My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
> > > effectively isn't supported by MGLRU.
> > > 
> > > [1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com
> > 
> > Hi,
> > 
> > Reading this mail makes me think that the page idle interface isn't really
> > used anymore.
> 
> I'm sure it's still used in production somewhere.  And even if it's being phased
> out in favor of MGLRU, it's still super useful for testing purposes, because it
> gives userspace much more direct control over aging.

I also think that page_idle is used somewhere in production, and it probably works
more or less correctly with regular non VM processes, although I have no idea how well it works
when MGLRU is enabled.

My point was that using page_idle to track guest memory is something that is probably
not used because it doesn't work that well, and nobody seems to complain.
However I don't ask for it to be removed, although a note of deprecation might
be worth it if really nobody uses it.

> 
> > Maybe we should redo the access_tracking_perf_test to only use the MGLRU
> > specific interfaces/mode, and remove its classical page_idle mode altogher?
> 
> I don't want to take a hard dependency on MGLRU (unless page_idle gets fully
> deprecated/removed by the kernel), and I also don't think page_idle is the main
> problem with the test.
>    
> > The point I am trying to get across is that currently
> > access_tracking_perf_test main purpose is to test that page_idle works with
> > secondary paging and the fact is that it doesn't work well due to more that
> > one reason:
> 
> The primary purpose of the test is to measure performance.  Asserting that 90%+
> pages were dirtied is a sanity check, not an outright goal.

From my POV, a performance test can't really be a selftest unless it actually fails
when it detects an unusually low performance. 

Otherwise who is going to be alarmed when a regression happens and
things actually get slower?

> 
> > The mere fact that we don't flush TLB already necessitated hacks like the 90%
> > check, which for example doesn't work nested so another hack was needed, to
> > skip the check completely when hypervisor is detected, etc, etc.
> 
> 100% agreed here.
> 
> > And now as of 6.13, we don't propagate accessed bit when KVM zaps the SPTE at
> > all, which can happen at least in theory due to other reasons than NUMA balancing.
> > 
> > Tomorrow there will be something else that will cause KVM to zap the SPTEs,
> > and the test will fail again, and again...
> > 
> > What do you think?
> 
> What if we make the assertion user controllable?  I.e. let the user opt-out (or
> off-by-default and opt-in) via command line?  We did something similar for the
> rseq test, because the test would run far fewer iterations than expected if the
> vCPU task was migrated to CPU(s) in deep sleep states.
> 
> 	TEST_ASSERT(skip_sanity_check || i > (NR_TASK_MIGRATIONS / 2),
> 		    "Only performed %d KVM_RUNs, task stalled too much?\n\n"
> 		    "  Try disabling deep sleep states to reduce CPU wakeup latency,\n"
> 		    "  e.g. via cpuidle.off=1 or setting /dev/cpu_dma_latency to '0',\n"
> 		    "  or run with -u to disable this sanity check.", i);
> 
> This is quite similar, because as you say, it's impractical for the test to account
> for every possible environmental quirk.

No objections in principle, especially if sanity check is skipped by default, 
although this does sort of defeats the purpose of the check. 
I guess that the check might still be used for developers.


> 
> > > Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> > > different node, and that causes NUMA balancing to go crazy and zap pretty much
> > > all of guest memory.  If that's what's happening, then a better solution for the
> > > NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> > > pin it to a single pCPU?).
> > 
> > Nope. I pinned main thread to  CPU 0 and VM thread to  CPU 1 and the problem
> > persists.  On 6.13, the only way to make the test consistently work is to
> > disable NUMA balancing.
> 
> Well that's odd.  While I'm quite curious as to what's happening, my stance is
> that enabling NUMA balancing with KVM is a terrible idea, so my vote is to sweep
> it under the rug and let the user disable the sanity check.
> 

One thing for sure, with NUMA balancing off, the test passes well (shows on average
around 100-200 idle pages) and I have run it for a long time.


Best regards,
	Maxim Levitsky

Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

Posted by Sean Christopherson 11 months, 2 weeks ago

On Wed, Feb 26, 2025, Maxim Levitsky wrote:
> On Tue, 2025-02-25 at 16:50 -0800, Sean Christopherson wrote:
> > On Tue, Feb 25, 2025, Maxim Levitsky wrote:
> > What if we make the assertion user controllable?  I.e. let the user opt-out (or
> > off-by-default and opt-in) via command line?  We did something similar for the
> > rseq test, because the test would run far fewer iterations than expected if the
> > vCPU task was migrated to CPU(s) in deep sleep states.
> > 
> > 	TEST_ASSERT(skip_sanity_check || i > (NR_TASK_MIGRATIONS / 2),
> > 		    "Only performed %d KVM_RUNs, task stalled too much?\n\n"
> > 		    "  Try disabling deep sleep states to reduce CPU wakeup latency,\n"
> > 		    "  e.g. via cpuidle.off=1 or setting /dev/cpu_dma_latency to '0',\n"
> > 		    "  or run with -u to disable this sanity check.", i);
> > 
> > This is quite similar, because as you say, it's impractical for the test to account
> > for every possible environmental quirk.
> 
> No objections in principle, especially if sanity check is skipped by default, 
> although this does sort of defeats the purpose of the check. 
> I guess that the check might still be used for developers.

A middle ground would be to enable the check by default if NUMA balancing is off.
We can always revisit the default setting if it turns out there are other problematic
"features".

> > > > Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> > > > different node, and that causes NUMA balancing to go crazy and zap pretty much
> > > > all of guest memory.  If that's what's happening, then a better solution for the
> > > > NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> > > > pin it to a single pCPU?).
> > > 
> > > Nope. I pinned main thread to  CPU 0 and VM thread to  CPU 1 and the problem
> > > persists.  On 6.13, the only way to make the test consistently work is to
> > > disable NUMA balancing.
> > 
> > Well that's odd.  While I'm quite curious as to what's happening,

Gah, chatting about this offline jogged my memory.  NUMA balancing doesn't zap
(mark PROT_NONE/PROT_NUMA) PTEs for paging the kernel thinks are being accessed
remotely, it zaps PTEs to see if they're are being accessed remotely.  So yeah,
whenever NUMA balancing kicks in, the guest will see a large amount of its memory
get re-faulted.

Which is why it's such a terribly feature to pair with KVM, at least as-is.  NUMA
balancing is predicated on inducing and resolving the #PF being relatively cheap,
but that doesn't hold true for secondary MMUs due to the coarse nature of mmu_notifiers.

Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

Posted by Maxim Levitsky 11 months, 2 weeks ago

On Wed, 2025-02-26 at 16:51 -0800, Sean Christopherson wrote:
> On Wed, Feb 26, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-02-25 at 16:50 -0800, Sean Christopherson wrote:
> > > On Tue, Feb 25, 2025, Maxim Levitsky wrote:
> > > What if we make the assertion user controllable?  I.e. let the user opt-out (or
> > > off-by-default and opt-in) via command line?  We did something similar for the
> > > rseq test, because the test would run far fewer iterations than expected if the
> > > vCPU task was migrated to CPU(s) in deep sleep states.
> > > 
> > > 	TEST_ASSERT(skip_sanity_check || i > (NR_TASK_MIGRATIONS / 2),
> > > 		    "Only performed %d KVM_RUNs, task stalled too much?\n\n"
> > > 		    "  Try disabling deep sleep states to reduce CPU wakeup latency,\n"
> > > 		    "  e.g. via cpuidle.off=1 or setting /dev/cpu_dma_latency to '0',\n"
> > > 		    "  or run with -u to disable this sanity check.", i);
> > > 
> > > This is quite similar, because as you say, it's impractical for the test to account
> > > for every possible environmental quirk.
> > 
> > No objections in principle, especially if sanity check is skipped by default, 
> > although this does sort of defeats the purpose of the check. 
> > I guess that the check might still be used for developers.
> 
> A middle ground would be to enable the check by default if NUMA balancing is off.
> We can always revisit the default setting if it turns out there are other problematic
> "features".

That works for me.
I can send a patch for this then.

> 
> > > > > Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> > > > > different node, and that causes NUMA balancing to go crazy and zap pretty much
> > > > > all of guest memory.  If that's what's happening, then a better solution for the
> > > > > NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> > > > > pin it to a single pCPU?).
> > > > 
> > > > Nope. I pinned main thread to  CPU 0 and VM thread to  CPU 1 and the problem
> > > > persists.  On 6.13, the only way to make the test consistently work is to
> > > > disable NUMA balancing.
> > > 
> > > Well that's odd.  While I'm quite curious as to what's happening,
> 
> Gah, chatting about this offline jogged my memory.  NUMA balancing doesn't zap
> (mark PROT_NONE/PROT_NUMA) PTEs for paging the kernel thinks are being accessed
> remotely, it zaps PTEs to see if they're are being accessed remotely.  So yeah,
> whenever NUMA balancing kicks in, the guest will see a large amount of its memory
> get re-faulted.
> 
> Which is why it's such a terribly feature to pair with KVM, at least as-is.  NUMA
> balancing is predicated on inducing and resolving the #PF being relatively cheap,
> but that doesn't hold true for secondary MMUs due to the coarse nature of mmu_notifiers.
> 

I also think so, not to mention that VM exits aren't that cheap either, and the general
direction is to avoid them as much as possible, and here this feature pretty much
yanks the memory out of the guest every two seconds or so, causing lots of VM exits.

Best regards,
	Maxim Levitsky

Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

Posted by James Houghton 11 months, 3 weeks ago

On Tue, Feb 18, 2025 at 5:13 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Feb 18, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote:
> > > By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> > > vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> > > waiting for aging. This contention reduction improves guest performance
> > > and saves a significant amount of Google Cloud's CPU usage, and it has
> > > valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> > >
> > > Please see v8[8] for some performance results using
> > > access_tracking_perf_test patched to use MGLRU.
> > >
> > > Neither access_tracking_perf_test nor mmu_stress_test trigger any
> > > splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.
> >
> >
> > Hi, I have a question about this patch series and about the
> > access_tracking_perf_test:
> >
> > Some time ago, I investigated a failure in access_tracking_perf_test which
> > shows up in our CI.
> >
> > The root cause was that 'folio_clear_idle' doesn't clear the idle bit when
> > MGLRU is enabled, and overall I got the impression that MGLRU is not
> > compatible with idle page tracking.
> >
> > I thought that this patch series and the 'mm: multi-gen LRU: Have secondary
> > MMUs participate in MM_WALK' patch series could address this but the test
> > still fails.
> >
> >
> > For the reference the exact problem is:
> >
> > 1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap
> >
> > 2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings.
> >
> > 3. A NUMA autobalance code write protects the guest memory. KVM in response
> >    evicts the SPTE mappings with A/D bit set, and while doing so tells mm
> >    that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) )
> >    but due to MLGRU the call doesn't clear the idle bit and thus all the traces
> >    of the guest access disappear and the kernel thinks that the page is still idle.
> >
> > I can say that the root cause of this is that folio_mark_accessed doesn't do
> > what it supposed to do.
> >
> > Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed()
> > will probably fix this but I don't have enough confidence to say if this is
> > all that is needed to fix this.  If this is the case I can send a patch.
>
> My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
> effectively isn't supported by MGLRU.
>
> [1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com

Yu's suggestion was to look at the generation numbers themselves, and
that is exactly what my patched access_tracking_perf_test does[2]. :)

So I think to make this work with MGLRU, I'll re-post my
access_tracking_perf_test patch, but if MGLRU is enabled, always use
the MGLRU debugfs instead of using page_idle/bitmap. It needs some
cleanup first though.

[2]: https://lore.kernel.org/kvm/20241105184333.2305744-12-jthoughton@google.com/

> > Any ideas on how to fix all this mess?
>
> The easy answer is to skip the test if MGLRU is in use, or if NUMA balancing is
> enabled.  In a real-world scenario, if the guest is actually accessing the pages
> that get PROT_NONE'd by NUMA balancing, they will be marked accessed when they're
> faulted back in.  There's a window where page_idle/bitmap could be read between
> making the VMA PROT_NONE and re-accessing the page from the guest, but IMO that's
> one of the many flaws of NUMA balancing.
>
> That said, one thing is quite odd.  In the failing case, *half* of the guest pages
> are still idle.  That's quite insane.
>
> Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> different node, and that causes NUMA balancing to go crazy and zap pretty much
> all of guest memory.  If that's what's happening, then a better solution for the
> NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> pin it to a single pCPU?).

+1 to this idea, if this is really what's going on. If NUMA balancing
is only migrating a few pages, the 90% threshold in the test should be
low enough that we tolerate the few pages that were moved.

Or we could just print a warning (instead of fail) if NUMA balancing is enabled.

Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

Posted by Sean Christopherson 11 months, 3 weeks ago

On Tue, 04 Feb 2025 00:40:27 +0000, James Houghton wrote:
> By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> waiting for aging. This contention reduction improves guest performance
> and saves a significant amount of Google Cloud's CPU usage, and it has
> valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> 
> Please see v8[8] for some performance results using
> access_tracking_perf_test patched to use MGLRU.
> 
> [...]

Applied to kvm-x86 mmu, thanks!

[01/11] KVM: Rename kvm_handle_hva_range()
        https://github.com/kvm-x86/linux/commit/374ccd63600b
[02/11] KVM: Add lockless memslot walk to KVM
        https://github.com/kvm-x86/linux/commit/aa34b811650c
[03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine
        https://github.com/kvm-x86/linux/commit/e29b74920e6f
[04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
        https://github.com/kvm-x86/linux/commit/b146a9b34aed
[05/11] KVM: x86/mmu: Rename spte_has_volatile_bits() to spte_needs_atomic_write()
        https://github.com/kvm-x86/linux/commit/61d65f2dc766
[06/11] KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as young
        https://github.com/kvm-x86/linux/commit/e25c2332346f
[07/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0
        https://github.com/kvm-x86/linux/commit/8c403cf23119
[08/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock
        https://github.com/kvm-x86/linux/commit/9fb13ba6b5ff
[09/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
        https://github.com/kvm-x86/linux/commit/4834eaded91e
[10/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
        https://github.com/kvm-x86/linux/commit/bb6c7749ccee
[11/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns
        https://github.com/kvm-x86/linux/commit/af3b6a9eba48

--
https://github.com/kvm-x86/linux/tree/next