[PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm

xu.xin16@zte.com.cn posted 5 patches 1 month, 1 week ago
MAINTAINERS                          |   3 +
include/trace/events/rmap.h          |  49 +++
mm/ksm.c                             |  48 ++-
mm/rmap.c                            |  14 +
tools/testing/rmap/Makefile          |  11 +
tools/testing/rmap/rmap_benchmark.c  | 488 +++++++++++++++++++++++++++
tools/testing/selftests/mm/rmap.c    |  79 +++++
tools/testing/selftests/mm/vm_util.c |  38 +++
tools/testing/selftests/mm/vm_util.h |   2 +
9 files changed, 724 insertions(+), 8 deletions(-)
create mode 100644 include/trace/events/rmap.h
create mode 100644 tools/testing/rmap/Makefile
create mode 100644 tools/testing/rmap/rmap_benchmark.c
[PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm
Posted by xu.xin16@zte.com.cn 1 month, 1 week ago
From: xu xin <xu.xin16@zte.com.cn>

Deep investigation revealed that rmap_walk_ksm's 99.9% of iterations inside
the anon_vma_interval_tree_foreach loop are skipped due to the first check
"if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
number of loop iterations are ineffective. This inefficiency arises because
the pgoff_start and pgoff_end parameters passed to
anon_vma_interval_tree_foreach span the entire address space from 0 to
ULONG_MAX, resulting in very poor loop efficiency.

An initial immature thought was using the "rmap_item->address >> PAGE_SHIFT"
to be the pgoff passed into anon_vma_interval_tree_foreach(). But this is
flawed because when a range has been mremap-moved, when its anon folio
indexes and anon_vma pgoff correspond to the original user address,
not to the current user address, which was pointed out at:

  https://lore.kernel.org/all/02e1b8df-d568-8cbb-b8f6-46d5476d9d75@google.com/

According to the implementation of anon_vma_interval_tree_foreach —
it essentially iterates to find a suitable VMA such that the provided pgoff falls
within the VMA's range [vm_pgoff, vm_pgoff + vma_pages(v) - 1].

So the solution is to add vm_pgoff field in ksm_rmap_item and use vm_pgoff instead of
address >> PAGE_SHIFT.


Changes in v4:
 - Add a tracepoint for rmap_walk
 - Provide a testbench for rmap_walk
 - Add vm_pgoff field in ksm_rmap_item
 - use vm_pgoff instead of address >> PAGE_SHIFT (Suggested by David and Lorenzo)

Changes in v3:
- Fix some typos in commit description
- Replace "pgoff_start" and 'pgoff_end' by 'pgoff'.

Changes in v2:
- Use const variable to initialize 'addr'  "pgoff_start" and 'pgoff_end'
- Let pgoff_end = pgoff_start, since KSM folios are always order-0 (Suggested by David)

xu xin (5):
  mm/rmap: add tracepoint for rmap_walk
  tools/testing: add rmap walk latency benchmark for KSM, anonymous and
    file pages
  ksm: add vm_pgoff into ksm_rmap_item
  ksm: Optimize rmap_walk_ksm by passing a suitable address range
  ksm: add mremap selftests for ksm_rmap_walk

 MAINTAINERS                          |   3 +
 include/trace/events/rmap.h          |  49 +++
 mm/ksm.c                             |  48 ++-
 mm/rmap.c                            |  14 +
 tools/testing/rmap/Makefile          |  11 +
 tools/testing/rmap/rmap_benchmark.c  | 488 +++++++++++++++++++++++++++
 tools/testing/selftests/mm/rmap.c    |  79 +++++
 tools/testing/selftests/mm/vm_util.c |  38 +++
 tools/testing/selftests/mm/vm_util.h |   2 +
 9 files changed, 724 insertions(+), 8 deletions(-)
 create mode 100644 include/trace/events/rmap.h
 create mode 100644 tools/testing/rmap/Makefile
 create mode 100644 tools/testing/rmap/rmap_benchmark.c

-- 
2.25.1
Re: [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm
Posted by Andrew Morton 1 month, 1 week ago
On Sun, 3 May 2026 20:35:38 +0800 (CST) <xu.xin16@zte.com.cn> wrote:

> Deep investigation revealed that rmap_walk_ksm's 99.9% of iterations inside
> the anon_vma_interval_tree_foreach loop are skipped due to the first check
> "if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
> number of loop iterations are ineffective. This inefficiency arises because
> the pgoff_start and pgoff_end parameters passed to
> anon_vma_interval_tree_foreach span the entire address space from 0 to
> ULONG_MAX, resulting in very poor loop efficiency.
> 
> An initial immature thought was using the "rmap_item->address >> PAGE_SHIFT"
> to be the pgoff passed into anon_vma_interval_tree_foreach(). But this is
> flawed because when a range has been mremap-moved, when its anon folio
> indexes and anon_vma pgoff correspond to the original user address,
> not to the current user address, which was pointed out at:
> 
>   https://lore.kernel.org/all/02e1b8df-d568-8cbb-b8f6-46d5476d9d75@google.com/
> 
> According to the implementation of anon_vma_interval_tree_foreach —
> it essentially iterates to find a suitable VMA such that the provided pgoff falls
> within the VMA's range [vm_pgoff, vm_pgoff + vma_pages(v) - 1].
> 
> So the solution is to add vm_pgoff field in ksm_rmap_item and use vm_pgoff instead of
> address >> PAGE_SHIFT.

Thanks for pushing ahead with this.

Regarding the [4/5] changelog: I don't think I understand how much
effect this change has upon real-world workloads.  Are you able to
clarify that?  "How useful is this change to Linux users".

AI review had a lot to say:
	https://sashiko.dev/#/patchset/20260503203538194jFwVGloy43M1F3sQGaFt7@zte.com.cn

Human review was wondering how much overhead [1/5] would add.  I do
note that it adds overhead even when CONFIG_TRACING=n - the rmap.o text
segment gets a few hundred bytes larger and there's additional runtime
overhead.