[PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm

xu.xin16@zte.com.cn posted 6 patches 2 days, 15 hours ago
MAINTAINERS                          |   3 +
include/trace/events/rmap.h          |  67 ++++
mm/ksm.c                             |  48 ++-
mm/rmap.c                            |   9 +
tools/testing/rmap/Makefile          |  11 +
tools/testing/rmap/rmap_benchmark.c  | 461 +++++++++++++++++++++++++++
tools/testing/selftests/mm/rmap.c    |  76 +++++
tools/testing/selftests/mm/vm_util.c |  38 +++
tools/testing/selftests/mm/vm_util.h |   2 +
9 files changed, 707 insertions(+), 8 deletions(-)
create mode 100644 include/trace/events/rmap.h
create mode 100644 tools/testing/rmap/Makefile
create mode 100644 tools/testing/rmap/rmap_benchmark.c
[PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm
Posted by xu.xin16@zte.com.cn 2 days, 15 hours ago
From: xu xin <xu.xin16@zte.com.cn>

This series fixes a severe KSM reverse-mapping performance problem
that can freeze applications for hundreds of milliseconds under
memory pressure especially when a lot of unrelated VMAs sharing a
single anon_vma.

Two key highlights:

1. Lock hold time drops from >500ms to <2ms
   - In our benchmark (20,000 VMAs sharing an anon_vma), worst-case
     anon_vma lock hold time during KSM rmap walk went from 705ms
     down to 1.67ms (max) and 1.44ms (avg).

2. Real user impact
   - The anon_vma lock is also acquired by page faults, reclaim,
     migration, compaction, mlock, exit_mmap, and cgroup accounting.

   - A long hold due to inefficient rmap walks stalls application
     threads, causing latency spikes, reduced throughput, or even
     container timeouts.

   - The problem occurs even without fork() – VMA splitting (e.g.,
     via mprotect or madvise over time) can create tens of thousands
     of VMAs all attached to the same anon_vma.

Real-world examples:

 - JVM / Go runtime: These use mmap for heap regions and later call
mprotect(PROT_NONE) for garbage collection barriers or guard pages,
splitting the original VMA into thousands of small pieces over time.

 - Database engines (MySQL, PostgreSQL): Large shared memory buffers
or anonymous mappings are managed with madvise(MADV_DONTNEED) to release
specific pages, which also splits VMAs.

* Why the benchmark numbers are realistic: We observed ~20,000 VMAs sharing
one anon_vma on a production system running a Java application with KSM
enabled. The lock hold time before the patch was measured at 228 ms (max)
during rmap walks triggered by memory compaction and page migration.
The benchmark reproduces that VMA count and lock‑hold behavior in a
controlled environment.

For systems that do not have thousands of VMAs per anon_vma, the
patch adds negligible overhead (a single pgoff comparison). For systems
that do suffer from this issue, the improvement is dramatic:
1) Worst‑case anon_vma lock hold time drops from hundreds of milliseconds
to under 2 ms.2)This directly reduces blocking of parallel operations that
need the same lock – page faults, reclaim, migration, compaction, mlock, and
exit_mmap.

End‑users will see lower tail latency (fewer application stalls),
higher throughput under memory pressure, and no more spurious
lockup warnings or container timeouts caused by excessive lock hold
times.

In short: workloads that do not hit this pathological pattern are
unaffected; those that do will see a 100x to 500x reduction in lock
hold times, which translates directly into a more responsive system.

---
Changes in v6:
- Patch 1: Defining a single event class once and instantiating the individual
	   tracepoints with DEFINE_EVENT, as AI said: 
	https://sashiko.dev/#/patchset/20260519220536792dMIKRMurt3vZ5lXC5pwh8@zte.com.cn

- Patch 2: Suggested-by AI below, three useful changes are done:
	(1) Safe event pairing – Now stores folio and rwc addresses for rmap_walk_start
	    and matches with the same addresses in rmap_walk_end, eliminating
	    cross‑thread interference.

	(2 )KSM configuration preservation – Saves original KSM settings and restores
	    them after the KSM test, avoiding persistent changes to system behaviour.

	(3) unlink in advance to prevent potentialfile leak – unlink(filename) called
	    immediately after mkstemp, so the temporary file is automatically removed
	    even if the program crashes early.

 - Patch 3: a separate, standalone patch to update the MAINTAINERS file.

Changes in v5:
- Patch 1: replaced local_clock() with tracepoints – no overhead
           when tracepoints are disabled.
- Patch 3: switched from vm_pgoff (unstable after VMA split) to a
           linear page offset.
- Patch 4: adapted to the linear page offset; added user-impact
           description (real workloads, lock contention examples,
           VMA splitting scenario).
- Patch 5: simplified to a single process with 32 pages (instead
           of multi-process), as suggested by David.

Changes in v4:
 - Add a tracepoint for rmap_walk
 - Provide a testbench for rmap_walk
 - Add vm_pgoff field in ksm_rmap_item
 - use vm_pgoff instead of address >> PAGE_SHIFT (Suggested by David and Lorenzo)

Changes in v3:
- Fix some typos in commit description
- Replace "pgoff_start" and 'pgoff_end' by 'pgoff'.

Changes in v2:
- Use const variable to initialize 'addr'  "pgoff_start" and 'pgoff_end'
- Let pgoff_end = pgoff_start, since KSM folios are always order-0 (Suggested by David)


xu xin (6):
  mm/rmap: add tracepoint for rmap_walk
  tools/testing: add rmap walk latency benchmark for KSM, anonymous and
    file pages
  MAINTAINERS: add myself as reviewer for rmap section
  ksm: add pgoff into ksm_rmap_item
  ksm: Optimize rmap_walk_ksm by passing a suitable address range
  ksm: add mremap selftests for ksm_rmap_walk

 MAINTAINERS                          |   3 +
 include/trace/events/rmap.h          |  67 ++++
 mm/ksm.c                             |  48 ++-
 mm/rmap.c                            |   9 +
 tools/testing/rmap/Makefile          |  11 +
 tools/testing/rmap/rmap_benchmark.c  | 461 +++++++++++++++++++++++++++
 tools/testing/selftests/mm/rmap.c    |  76 +++++
 tools/testing/selftests/mm/vm_util.c |  38 +++
 tools/testing/selftests/mm/vm_util.h |   2 +
 9 files changed, 707 insertions(+), 8 deletions(-)
 create mode 100644 include/trace/events/rmap.h
 create mode 100644 tools/testing/rmap/Makefile
 create mode 100644 tools/testing/rmap/rmap_benchmark.c

-- 
2.25.1
Re: [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm
Posted by Andrew Morton 1 day, 15 hours ago
On Fri, 22 May 2026 10:52:34 +0800 (CST) <xu.xin16@zte.com.cn> wrote:

> This series fixes a severe KSM reverse-mapping performance problem
> that can freeze applications for hundreds of milliseconds under
> memory pressure especially when a lot of unrelated VMAs sharing a
> single anon_vma.

Thanks.  I agree that this behaviour is quite obnoxious and getting it
addressed is quite desirable.

So I'd normally merge this in its present unreviewed state in order to
push things along a bit, but the AI review gives me pause:

	https://sashiko.dev/#/patchset/20260522105234715fKI7KSsjC5XpEVMwoV6rI@zte.com.cn

Can you please take a look, decide what (if any) changes are needed?
答复: [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm
Posted by xu.xin16@zte.com.cn 1 day, 14 hours ago
>> This series fixes a severe KSM reverse-mapping performance problem
>> that can freeze applications for hundreds of milliseconds under
>> memory pressure especially when a lot of unrelated VMAs sharing a
>> single anon_vma.
>
>Thanks.  I agree that this behaviour is quite obnoxious and getting it
>addressed is quite desirable.
>
>So I'd normally merge this in its present unreviewed state in order to
>push things along a bit, but the AI review gives me pause:
>
>    https://sashiko.dev/#/patchset/20260522105234715fKI7KSsjC5XpEVMwoV6rI@zte.com.cn
>
>Can you please take a look, decide what (if any) changes are needed?

Yes