[v5] KSM: performance optimizations for rmap_walk_ksm

[PATCH v5 0/5] KSM: performance optimizations for rmap_walk_ksm

Posted by xu.xin16@zte.com.cn 5 days, 11 hours ago

From: xu xin <xu.xin16@zte.com.cn>

When available memory is extremely tight, causing KSM pages to be swapped
out, or when there is significant memory fragmentation and THP triggers
memory compaction, the system will invoke the rmap_walk_ksm function to
perform reverse mapping. However, we observed that this function becomes
particularly time-consuming when a large number of VMAs (e.g., 20,000)
share the same anon_vma. Through debug trace analysis, we found that most
of the latency occurs within anon_vma_interval_tree_foreach, leading to an
excessively long hold time on the anon_vma lock (even reaching 500ms or
more), which in turn causes upper-layer applications (waiting for the
anon_vma lock) to be blocked for extended periods.

This series fixes a severe KSM reverse-mapping performance problem
that can freeze applications for hundreds of milliseconds under
memory pressure especially when a lot of unrelated VMAs sharing a
single anon_vma.

Two key highlights:

1. Lock hold time drops from >500ms to <2ms
   - In our benchmark (20,000 VMAs sharing an anon_vma), worst-case
     anon_vma lock hold time during KSM rmap walk went from 705ms
     down to 1.67ms (max) and 1.44ms (avg).

2. Real user impact
   - The anon_vma lock is also acquired by page faults, reclaim,
     migration, compaction, mlock, exit_mmap, and cgroup accounting.

   - A long hold due to inefficient rmap walks stalls application
     threads, causing latency spikes, reduced throughput, or even
     container timeouts.

   - The problem occurs even without fork() – VMA splitting (e.g.,
     via mprotect or madvise over time) can create tens of thousands
     of VMAs all attached to the same anon_vma.

Patch summary:
==============
patch 1/5: mm/rmap: add tracepoint for rmap_walk
      - Zero overhead when disabled; offline latency analysis.

patch 2/5: tools/testing: add rmap benchmark
      - Measures KSM/anon/file rmap walks.

patch 3/5: ksm: add pgoff into ksm_rmap_item
      - Stores linear page offset (not vm_pgoff) using a union.
      - Cleared on failure paths including break_cow().

patch 4/5: ksm: optimize rmap_walk_ksm by passing a suitable range
      - Uses stored pgoff to narrow interval tree search.
      - Reduces iterations from >22k to ~3; lock hold 705ms ->1.67ms.
      - Includes detailed user-impact description (suggested by Andrew).

patch 5/5: ksm: add mremap selftests for ksm_rmap_walk
      - Single-process, 32 pages; covers mremap + KSM + migration.

---

Changes in v5:
- Patch 1: replaced local_clock() with tracepoints – no overhead
           when tracepoints are disabled.
- Patch 3: switched from vm_pgoff (unstable after VMA split) to a
           linear page offset.
- Patch 4: adapted to the linear page offset; added user-impact
           description (real workloads, lock contention examples,
           VMA splitting scenario).
- Patch 5: simplified to a single process with 32 pages (instead
           of multi-process), as suggested by David.

Changes in v4:
 - Add a tracepoint for rmap_walk
 - Provide a testbench for rmap_walk
 - Add vm_pgoff field in ksm_rmap_item
 - use vm_pgoff instead of address >> PAGE_SHIFT (Suggested by David and Lorenzo)

Changes in v3:
- Fix some typos in commit description
- Replace "pgoff_start" and 'pgoff_end' by 'pgoff'.

Changes in v2:
- Use const variable to initialize 'addr'  "pgoff_start" and 'pgoff_end'
- Let pgoff_end = pgoff_start, since KSM folios are always order-0 (Suggested by David)

xu xin (5):
  mm/rmap: add tracepoint for rmap_walk
  tools/testing: add rmap walk latency benchmark for KSM, anonymous and
    file pages
  ksm: add pgoff into ksm_rmap_item
  ksm: Optimize rmap_walk_ksm by passing a suitable address range
  ksm: add mremap selftests for ksm_rmap_walk

 MAINTAINERS                          |   3 +
 include/trace/events/rmap.h          |  73 ++++
 mm/ksm.c                             |  48 ++-
 mm/rmap.c                            |   9 +
 tools/testing/rmap/Makefile          |  11 +
 tools/testing/rmap/rmap_benchmark.c  | 529 +++++++++++++++++++++++++++
 tools/testing/selftests/mm/rmap.c    |  76 ++++
 tools/testing/selftests/mm/vm_util.c |  38 ++
 tools/testing/selftests/mm/vm_util.h |   2 +
 9 files changed, 781 insertions(+), 8 deletions(-)
 create mode 100644 include/trace/events/rmap.h
 create mode 100644 tools/testing/rmap/Makefile
 create mode 100644 tools/testing/rmap/rmap_benchmark.c

-- 
2.25.1

Re: [PATCH v5 0/5] KSM: performance optimizations for rmap_walk_ksm

Posted by Andrew Morton 5 days, 8 hours ago

On Tue, 19 May 2026 22:05:36 +0800 (CST) <xu.xin16@zte.com.cn> wrote:

> From: xu xin <xu.xin16@zte.com.cn>
> 
> When available memory is extremely tight, causing KSM pages to be swapped
> out, or when there is significant memory fragmentation and THP triggers
> memory compaction, the system will invoke the rmap_walk_ksm function to
> perform reverse mapping. However, we observed that this function becomes
> particularly time-consuming when a large number of VMAs (e.g., 20,000)
> share the same anon_vma. Through debug trace analysis, we found that most
> of the latency occurs within anon_vma_interval_tree_foreach, leading to an
> excessively long hold time on the anon_vma lock (even reaching 500ms or
> more), which in turn causes upper-layer applications (waiting for the
> anon_vma lock) to be blocked for extended periods.
> 
> This series fixes a severe KSM reverse-mapping performance problem
> that can freeze applications for hundreds of milliseconds under
> memory pressure especially when a lot of unrelated VMAs sharing a
> single anon_vma.

That would be good to fix.

> Two key highlights:
> 
> 1. Lock hold time drops from >500ms to <2ms
>    - In our benchmark (20,000 VMAs sharing an anon_vma), worst-case
>      anon_vma lock hold time during KSM rmap walk went from 705ms
>      down to 1.67ms (max) and 1.44ms (avg).

How real-worldish is that benchmark?

How much effect are our users likely to see from this patchset in their
real-world workloads?

> 2. Real user impact
>    - The anon_vma lock is also acquired by page faults, reclaim,
>      migration, compaction, mlock, exit_mmap, and cgroup accounting.
> 
>    - A long hold due to inefficient rmap walks stalls application
>      threads, causing latency spikes, reduced throughput, or even
>      container timeouts.
> 
>    - The problem occurs even without fork() – VMA splitting (e.g.,
>      via mprotect or madvise over time) can create tens of thousands
>      of VMAs all attached to the same anon_vma.
> 
> ...
>
> Changes in v5:
> - Patch 1: replaced local_clock() with tracepoints – no overhead
>            when tracepoints are disabled.

Thanks for that change.

> - Patch 3: switched from vm_pgoff (unstable after VMA split) to a
>            linear page offset.
> - Patch 4: adapted to the linear page offset; added user-impact
>            description (real workloads, lock contention examples,
>            VMA splitting scenario).
> - Patch 5: simplified to a single process with 32 pages (instead
>            of multi-process), as suggested by David.
> 
>  MAINTAINERS                          |   3 +

I don't recall seeing any discussion about you becoming an rmap
M:aintainer, perhaps I missed it.  Thanks for the interest, but it
probably would be better to propose this as a standalone patch,
separated from this series.


>  include/trace/events/rmap.h          |  73 ++++
>  mm/ksm.c                             |  48 ++-
>  mm/rmap.c                            |   9 +
>  tools/testing/rmap/Makefile          |  11 +
>  tools/testing/rmap/rmap_benchmark.c  | 529 +++++++++++++++++++++++++++
>  tools/testing/selftests/mm/rmap.c    |  76 ++++
>  tools/testing/selftests/mm/vm_util.c |  38 ++
>  tools/testing/selftests/mm/vm_util.h |   2 +
>  9 files changed, 781 insertions(+), 8 deletions(-)
>  create mode 100644 include/trace/events/rmap.h
>  create mode 100644 tools/testing/rmap/Makefile
>  create mode 100644 tools/testing/rmap/rmap_benchmark.c

AI review was only partial, for unclear reasons:

	https://sashiko.dev/#/patchset/20260519220536792dMIKRMurt3vZ5lXC5pwh8@zte.com.cn

Please take a look, see if there's anything useful there?

Re: [PATCH v5 0/5] KSM: performance optimizations for rmap_walk_ksm

Posted by xu.xin16@zte.com.cn 4 days, 18 hours ago

> > Changes in v5:
> > - Patch 1: replaced local_clock() with tracepoints – no overhead
> >            when tracepoints are disabled.
> 
> Thanks for that change.
> 
> > - Patch 3: switched from vm_pgoff (unstable after VMA split) to a
> >            linear page offset.
> > - Patch 4: adapted to the linear page offset; added user-impact
> >            description (real workloads, lock contention examples,
> >            VMA splitting scenario).
> > - Patch 5: simplified to a single process with 32 pages (instead
> >            of multi-process), as suggested by David.
> > 
> >  MAINTAINERS                          |   3 +
> 
> I don't recall seeing any discussion about you becoming an rmap
> M:aintainer, perhaps I missed it.  Thanks for the interest, but it
> probably would be better to propose this as a standalone patch,
> separated from this series.
>

You are absolutely right – I did not discuss this change on the mailing
list beforehand and I agree that a MAINTAINERS update should be a
separate, standalone patch.

My intention was to help review future changes to the
rmap benchmark and tracepoints, but you are right that it needs its
own discussion. Thanks for the reminder.

My hope was that for future changes
touching these specific rmap test files or the rmap tracepoints,
maintainers or other developers could cc me so that I can help review
or test them. By the way , I was Adding myself as an R: (reviewer) rather
than M:

Re: [PATCH v5 0/5] KSM: performance optimizations for rmap_walk_ksm

Posted by xu.xin16@zte.com.cn 4 days, 18 hours ago

> > Two key highlights:
> > 
> > 1. Lock hold time drops from >500ms to <2ms
> >    - In our benchmark (20,000 VMAs sharing an anon_vma), worst-case
> >      anon_vma lock hold time during KSM rmap walk went from 705ms
> >      down to 1.67ms (max) and 1.44ms (avg).
> 
> How real-worldish is that benchmark?
> 
> How much effect are our users likely to see from this patchset in their
> real-world workloads?

Hi Andrew,

Thank you for your thoughtful question.

The benchmark intentionally simulates a scenario where many VMAs share the
same anon_vma without any fork() involved. This happens in real systems
when applications repeatedly split existing VMAs via mprotect(2) or
madvise(2) (e.g., MADV_DONTNEED, MADV_FREE) on sub‑ranges of a large
anonymous mapping.

Real-world examples:

 - JVM / Go runtime: These use mmap for heap regions and later call
mprotect(PROT_NONE) for garbage collection barriers or guard pages,
splitting the original VMA into thousands of small pieces over time.

 - Database engines (MySQL, PostgreSQL): Large shared memory buffers
or anonymous mappings are managed with madvise(MADV_DONTNEED) to release
specific pages, which also splits VMAs.

* Why the benchmark numbers are realistic: We observed ~20,000 VMAs sharing
one anon_vma on a production system running a Java application with KSM
enabled. The lock hold time before the patch was measured at 228 ms (max)
during rmap walks triggered by memory compaction and page migration.
The benchmark reproduces that VMA count and lock‑hold behavior in a
controlled environment.

For systems that do not have thousands of VMAs per anon_vma, the
patch adds negligible overhead (a single pgoff comparison). For systems
that do suffer from this issue, the improvement is dramatic:
1) Worst‑case anon_vma lock hold time drops from hundreds of milliseconds
to under 2 ms.2)This directly reduces blocking of parallel operations that
need the same lock – page faults, reclaim, migration, compaction, mlock, and
exit_mmap.

End‑users will see lower tail latency (fewer application stalls),
higher throughput under memory pressure, and no more spurious
lockup warnings or container timeouts caused by excessive lock hold
times.

In short: workloads that do not hit this pathological pattern are
unaffected; those that do will see a 100x to 500x reduction in lock
hold times, which translates directly into a more responsive system.

I hope this clarifies the real‑world relevance. Thank you for pushing
us to make the changelog clearer.

Best regards,
xu xi