MAINTAINERS | 3 + include/trace/events/rmap.h | 73 ++++ mm/ksm.c | 48 ++- mm/rmap.c | 9 + tools/testing/rmap/Makefile | 11 + tools/testing/rmap/rmap_benchmark.c | 529 +++++++++++++++++++++++++++ tools/testing/selftests/mm/rmap.c | 76 ++++ tools/testing/selftests/mm/vm_util.c | 38 ++ tools/testing/selftests/mm/vm_util.h | 2 + 9 files changed, 781 insertions(+), 8 deletions(-) create mode 100644 include/trace/events/rmap.h create mode 100644 tools/testing/rmap/Makefile create mode 100644 tools/testing/rmap/rmap_benchmark.c
From: xu xin <xu.xin16@zte.com.cn>
When available memory is extremely tight, causing KSM pages to be swapped
out, or when there is significant memory fragmentation and THP triggers
memory compaction, the system will invoke the rmap_walk_ksm function to
perform reverse mapping. However, we observed that this function becomes
particularly time-consuming when a large number of VMAs (e.g., 20,000)
share the same anon_vma. Through debug trace analysis, we found that most
of the latency occurs within anon_vma_interval_tree_foreach, leading to an
excessively long hold time on the anon_vma lock (even reaching 500ms or
more), which in turn causes upper-layer applications (waiting for the
anon_vma lock) to be blocked for extended periods.
This series fixes a severe KSM reverse-mapping performance problem
that can freeze applications for hundreds of milliseconds under
memory pressure especially when a lot of unrelated VMAs sharing a
single anon_vma.
Two key highlights:
1. Lock hold time drops from >500ms to <2ms
- In our benchmark (20,000 VMAs sharing an anon_vma), worst-case
anon_vma lock hold time during KSM rmap walk went from 705ms
down to 1.67ms (max) and 1.44ms (avg).
2. Real user impact
- The anon_vma lock is also acquired by page faults, reclaim,
migration, compaction, mlock, exit_mmap, and cgroup accounting.
- A long hold due to inefficient rmap walks stalls application
threads, causing latency spikes, reduced throughput, or even
container timeouts.
- The problem occurs even without fork() – VMA splitting (e.g.,
via mprotect or madvise over time) can create tens of thousands
of VMAs all attached to the same anon_vma.
Patch summary:
==============
patch 1/5: mm/rmap: add tracepoint for rmap_walk
- Zero overhead when disabled; offline latency analysis.
patch 2/5: tools/testing: add rmap benchmark
- Measures KSM/anon/file rmap walks.
patch 3/5: ksm: add pgoff into ksm_rmap_item
- Stores linear page offset (not vm_pgoff) using a union.
- Cleared on failure paths including break_cow().
patch 4/5: ksm: optimize rmap_walk_ksm by passing a suitable range
- Uses stored pgoff to narrow interval tree search.
- Reduces iterations from >22k to ~3; lock hold 705ms ->1.67ms.
- Includes detailed user-impact description (suggested by Andrew).
patch 5/5: ksm: add mremap selftests for ksm_rmap_walk
- Single-process, 32 pages; covers mremap + KSM + migration.
---
Changes in v5:
- Patch 1: replaced local_clock() with tracepoints – no overhead
when tracepoints are disabled.
- Patch 3: switched from vm_pgoff (unstable after VMA split) to a
linear page offset.
- Patch 4: adapted to the linear page offset; added user-impact
description (real workloads, lock contention examples,
VMA splitting scenario).
- Patch 5: simplified to a single process with 32 pages (instead
of multi-process), as suggested by David.
Changes in v4:
- Add a tracepoint for rmap_walk
- Provide a testbench for rmap_walk
- Add vm_pgoff field in ksm_rmap_item
- use vm_pgoff instead of address >> PAGE_SHIFT (Suggested by David and Lorenzo)
Changes in v3:
- Fix some typos in commit description
- Replace "pgoff_start" and 'pgoff_end' by 'pgoff'.
Changes in v2:
- Use const variable to initialize 'addr' "pgoff_start" and 'pgoff_end'
- Let pgoff_end = pgoff_start, since KSM folios are always order-0 (Suggested by David)
xu xin (5):
mm/rmap: add tracepoint for rmap_walk
tools/testing: add rmap walk latency benchmark for KSM, anonymous and
file pages
ksm: add pgoff into ksm_rmap_item
ksm: Optimize rmap_walk_ksm by passing a suitable address range
ksm: add mremap selftests for ksm_rmap_walk
MAINTAINERS | 3 +
include/trace/events/rmap.h | 73 ++++
mm/ksm.c | 48 ++-
mm/rmap.c | 9 +
tools/testing/rmap/Makefile | 11 +
tools/testing/rmap/rmap_benchmark.c | 529 +++++++++++++++++++++++++++
tools/testing/selftests/mm/rmap.c | 76 ++++
tools/testing/selftests/mm/vm_util.c | 38 ++
tools/testing/selftests/mm/vm_util.h | 2 +
9 files changed, 781 insertions(+), 8 deletions(-)
create mode 100644 include/trace/events/rmap.h
create mode 100644 tools/testing/rmap/Makefile
create mode 100644 tools/testing/rmap/rmap_benchmark.c
--
2.25.1
On Tue, 19 May 2026 22:05:36 +0800 (CST) <xu.xin16@zte.com.cn> wrote: > From: xu xin <xu.xin16@zte.com.cn> > > When available memory is extremely tight, causing KSM pages to be swapped > out, or when there is significant memory fragmentation and THP triggers > memory compaction, the system will invoke the rmap_walk_ksm function to > perform reverse mapping. However, we observed that this function becomes > particularly time-consuming when a large number of VMAs (e.g., 20,000) > share the same anon_vma. Through debug trace analysis, we found that most > of the latency occurs within anon_vma_interval_tree_foreach, leading to an > excessively long hold time on the anon_vma lock (even reaching 500ms or > more), which in turn causes upper-layer applications (waiting for the > anon_vma lock) to be blocked for extended periods. > > This series fixes a severe KSM reverse-mapping performance problem > that can freeze applications for hundreds of milliseconds under > memory pressure especially when a lot of unrelated VMAs sharing a > single anon_vma. That would be good to fix. > Two key highlights: > > 1. Lock hold time drops from >500ms to <2ms > - In our benchmark (20,000 VMAs sharing an anon_vma), worst-case > anon_vma lock hold time during KSM rmap walk went from 705ms > down to 1.67ms (max) and 1.44ms (avg). How real-worldish is that benchmark? How much effect are our users likely to see from this patchset in their real-world workloads? > 2. Real user impact > - The anon_vma lock is also acquired by page faults, reclaim, > migration, compaction, mlock, exit_mmap, and cgroup accounting. > > - A long hold due to inefficient rmap walks stalls application > threads, causing latency spikes, reduced throughput, or even > container timeouts. > > - The problem occurs even without fork() – VMA splitting (e.g., > via mprotect or madvise over time) can create tens of thousands > of VMAs all attached to the same anon_vma. > > ... > > Changes in v5: > - Patch 1: replaced local_clock() with tracepoints – no overhead > when tracepoints are disabled. Thanks for that change. > - Patch 3: switched from vm_pgoff (unstable after VMA split) to a > linear page offset. > - Patch 4: adapted to the linear page offset; added user-impact > description (real workloads, lock contention examples, > VMA splitting scenario). > - Patch 5: simplified to a single process with 32 pages (instead > of multi-process), as suggested by David. > > MAINTAINERS | 3 + I don't recall seeing any discussion about you becoming an rmap M:aintainer, perhaps I missed it. Thanks for the interest, but it probably would be better to propose this as a standalone patch, separated from this series. > include/trace/events/rmap.h | 73 ++++ > mm/ksm.c | 48 ++- > mm/rmap.c | 9 + > tools/testing/rmap/Makefile | 11 + > tools/testing/rmap/rmap_benchmark.c | 529 +++++++++++++++++++++++++++ > tools/testing/selftests/mm/rmap.c | 76 ++++ > tools/testing/selftests/mm/vm_util.c | 38 ++ > tools/testing/selftests/mm/vm_util.h | 2 + > 9 files changed, 781 insertions(+), 8 deletions(-) > create mode 100644 include/trace/events/rmap.h > create mode 100644 tools/testing/rmap/Makefile > create mode 100644 tools/testing/rmap/rmap_benchmark.c AI review was only partial, for unclear reasons: https://sashiko.dev/#/patchset/20260519220536792dMIKRMurt3vZ5lXC5pwh8@zte.com.cn Please take a look, see if there's anything useful there?
> > Changes in v5: > > - Patch 1: replaced local_clock() with tracepoints – no overhead > > when tracepoints are disabled. > > Thanks for that change. > > > - Patch 3: switched from vm_pgoff (unstable after VMA split) to a > > linear page offset. > > - Patch 4: adapted to the linear page offset; added user-impact > > description (real workloads, lock contention examples, > > VMA splitting scenario). > > - Patch 5: simplified to a single process with 32 pages (instead > > of multi-process), as suggested by David. > > > > MAINTAINERS | 3 + > > I don't recall seeing any discussion about you becoming an rmap > M:aintainer, perhaps I missed it. Thanks for the interest, but it > probably would be better to propose this as a standalone patch, > separated from this series. > You are absolutely right – I did not discuss this change on the mailing list beforehand and I agree that a MAINTAINERS update should be a separate, standalone patch. My intention was to help review future changes to the rmap benchmark and tracepoints, but you are right that it needs its own discussion. Thanks for the reminder. My hope was that for future changes touching these specific rmap test files or the rmap tracepoints, maintainers or other developers could cc me so that I can help review or test them. By the way , I was Adding myself as an R: (reviewer) rather than M:
> > Two key highlights: > > > > 1. Lock hold time drops from >500ms to <2ms > > - In our benchmark (20,000 VMAs sharing an anon_vma), worst-case > > anon_vma lock hold time during KSM rmap walk went from 705ms > > down to 1.67ms (max) and 1.44ms (avg). > > How real-worldish is that benchmark? > > How much effect are our users likely to see from this patchset in their > real-world workloads? Hi Andrew, Thank you for your thoughtful question. The benchmark intentionally simulates a scenario where many VMAs share the same anon_vma without any fork() involved. This happens in real systems when applications repeatedly split existing VMAs via mprotect(2) or madvise(2) (e.g., MADV_DONTNEED, MADV_FREE) on sub‑ranges of a large anonymous mapping. Real-world examples: - JVM / Go runtime: These use mmap for heap regions and later call mprotect(PROT_NONE) for garbage collection barriers or guard pages, splitting the original VMA into thousands of small pieces over time. - Database engines (MySQL, PostgreSQL): Large shared memory buffers or anonymous mappings are managed with madvise(MADV_DONTNEED) to release specific pages, which also splits VMAs. * Why the benchmark numbers are realistic: We observed ~20,000 VMAs sharing one anon_vma on a production system running a Java application with KSM enabled. The lock hold time before the patch was measured at 228 ms (max) during rmap walks triggered by memory compaction and page migration. The benchmark reproduces that VMA count and lock‑hold behavior in a controlled environment. For systems that do not have thousands of VMAs per anon_vma, the patch adds negligible overhead (a single pgoff comparison). For systems that do suffer from this issue, the improvement is dramatic: 1) Worst‑case anon_vma lock hold time drops from hundreds of milliseconds to under 2 ms.2)This directly reduces blocking of parallel operations that need the same lock – page faults, reclaim, migration, compaction, mlock, and exit_mmap. End‑users will see lower tail latency (fewer application stalls), higher throughput under memory pressure, and no more spurious lockup warnings or container timeouts caused by excessive lock hold times. In short: workloads that do not hit this pathological pattern are unaffected; those that do will see a 100x to 500x reduction in lock hold times, which translates directly into a more responsive system. I hope this clarifies the real‑world relevance. Thank you for pushing us to make the changelog clearer. Best regards, xu xi
© 2016 - 2026 Red Hat, Inc.