x86/mm: Add multi-page clearing

[PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing

Posted by Ankur Arora 3 months, 3 weeks ago

Override the common code version of folio_zero_user() so we can use
clear_pages() to do multi-page clearing instead of the standard
page-at-a-time clearing. This allows us to advertise the full
region-size to the processor, which when using string instructions
(REP; STOS), can use the knowledge of the extent to optimize the
clearing.

Apart from this we have two other considerations: cache locality when
clearing 2MB pages, and preemption latency when clearing GB pages.

The first is handled by breaking the clearing in three parts: the
faulting page and its immediate locality, its left and right regions;
with the local neighbourhood cleared last.

The second is only an issue for kernels running under cooperative
preemption. Limit the worst case preemption latency by clearing in
PAGE_RESCHED_CHUNK (8MB) units.

The resultant performance falls in two buckets depending on the kinds of
optimizations that the uarch can do for the clearing extent. Two classes
of optimizations:

  - amortize each clearing iteration over a large range instead of at
    a page granularity.
  - cacheline allocation elision (seen only on AMD Zen models)

A demand fault workload shows that the resultant performance falls in two
buckets depending on if the extent being zeroed is large enough to allow
for cacheline allocation elision.

AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):

 $ perf bench mem map -p $page-size -f demand -s 64GB -l 5

                 mm/folio_zero_user    x86/folio_zero_user       change
                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

  pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%
  pg-sz=1GB       17.51  +- 1.19%        40.03  +-  7.26% [#]   +129.9%

[#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, which is higher than PAGE_RESCHED_CHUNK, so
preempt=none|voluntary sees no improvement for this test.

  pg-sz=1GB       17.14  +- 1.39%        17.42  +-  0.98%        + 1.6%

The dropoff in cacheline allocations for pg-sz=1GB can be seen with
perf-stat:

   - 44,513,459,667      cycles                           #    2.420 GHz                         ( +-  0.44% )  (35.71%)
   -  1,378,032,592      instructions                     #    0.03  insn per cycle
   - 11,224,288,082      L1-dcache-loads                  #  610.187 M/sec                       ( +-  0.08% )  (35.72%)
   -  5,373,473,118      L1-dcache-load-misses            #   47.87% of all L1-dcache accesses   ( +-  0.00% )  (35.71%)

   + 20,093,219,076      cycles                           #    2.421 GHz                         ( +-  3.64% )  (35.69%)
   +  1,378,032,592      instructions                     #    0.03  insn per cycle
   +    186,525,095      L1-dcache-loads                  #   22.479 M/sec                       ( +-  2.11% )  (35.74%)
   +     73,479,687      L1-dcache-load-misses            #   39.39% of all L1-dcache accesses   ( +-  3.03% )  (35.74%)

Also note that as mentioned earlier, this improvement is not specific to
AMD Zen*. Intel Icelakex (pg-sz=2MB|1GB) sees a similar improvement as
the Milan pg-sz=2MB workload above (~35%).

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/mm/Makefile |  1 +
 arch/x86/mm/memory.c | 97 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 98 insertions(+)
 create mode 100644 arch/x86/mm/memory.c

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5b9908f13dcf..9031faf21849 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_MMIOTRACE_TEST)	+= testmmiotrace.o
 obj-$(CONFIG_NUMA)		+= numa.o
 obj-$(CONFIG_AMD_NUMA)		+= amdtopology.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat.o
+obj-$(CONFIG_PREEMPTION)	+= memory.o
 
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c
new file mode 100644
index 000000000000..a799c0cc3c5f
--- /dev/null
+++ b/arch/x86/mm/memory.c
@@ -0,0 +1,97 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/mm.h>
+#include <linux/range.h>
+#include <linux/minmax.h>
+
+/*
+ * Limit the optimized version of folio_zero_user() to !CONFIG_HIGHMEM.
+ * We do that because clear_pages() works on contiguous kernel pages
+ * which might not be true under HIGHMEM.
+ */
+#ifndef CONFIG_HIGHMEM
+/*
+ * For voluntary preemption models, operate with a max chunk-size of 8MB.
+ * (Worst case resched latency of ~1ms, with a clearing BW of ~10GBps.)
+ */
+#define PAGE_RESCHED_CHUNK	(8 << (20 - PAGE_SHIFT))
+
+static void clear_pages_resched(void *addr, int npages)
+{
+	int i, remaining;
+
+	if (preempt_model_preemptible()) {
+		clear_pages(addr, npages);
+		goto out;
+	}
+
+	for (i = 0; i < npages/PAGE_RESCHED_CHUNK; i++) {
+		clear_pages(addr + i * PAGE_RESCHED_CHUNK * PAGE_SIZE, PAGE_RESCHED_CHUNK);
+		cond_resched();
+	}
+
+	remaining = npages % PAGE_RESCHED_CHUNK;
+
+	if (remaining)
+		clear_pages(addr + i * PAGE_RESCHED_CHUNK * PAGE_SHIFT, remaining);
+out:
+	cond_resched();
+}
+
+/*
+ * folio_zero_user() - multi-page clearing.
+ *
+ * @folio: hugepage folio
+ * @addr_hint: faulting address (if any)
+ *
+ * Overrides common code folio_zero_user(). This version takes advantage of
+ * the fact that string instructions in clear_pages() are more performant
+ * on larger extents compared to the usual page-at-a-time clearing.
+ *
+ * Clearing of 2MB pages is split in three parts: pages in the immediate
+ * locality of the faulting page, and its left, right regions; with the local
+ * neighbourhood cleared last in order to keep cache lines of the target
+ * region hot.
+ *
+ * For GB pages, there is no expectation of cache locality so just do a
+ * straight zero.
+ *
+ * Note that the folio is fully allocated already so we don't do any exception
+ * handling.
+ */
+void folio_zero_user(struct folio *folio, unsigned long addr_hint)
+{
+	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
+	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
+	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
+	const int width = 2; /* number of pages cleared last on either side */
+	struct range r[3];
+	int i;
+
+	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
+		clear_pages_resched(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
+		return;
+	}
+
+	/*
+	 * Faulting page and its immediate neighbourhood. Cleared at the end to
+	 * ensure it sticks around in the cache.
+	 */
+	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+	/* Region to the left of the fault */
+	r[1] = DEFINE_RANGE(pg.start,
+			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+			    pg.end);
+
+	for (i = 0; i <= 2; i++) {
+		int npages = range_len(&r[i]);
+
+		if (npages > 0)
+			clear_pages_resched(page_address(folio_page(folio, r[i].start)), npages);
+	}
+}
+#endif /* CONFIG_HIGHMEM */
-- 
2.31.1

Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing

Posted by Dave Hansen 3 months, 3 weeks ago

On 6/15/25 22:22, Ankur Arora wrote:
> Override the common code version of folio_zero_user() so we can use
> clear_pages() to do multi-page clearing instead of the standard
> page-at-a-time clearing.

I'm not a big fan of the naming in this series.

To me multi-page means "more than one 'struct page'". But this series is
clearly using multi-page clearing to mean clearing >PAGE_SIZE in one
clear. But oh well.

The second problem with where this ends up is that none of the code is
*actually* x86-specific. The only thing that x86 provides that's
interesting is a clear_pages() implementation that hands >PAGE_SIZE
units down to the CPUs.

The result is ~100 lines of code that will compile and run functionally
on any architecture.

To me, that's deserving of an ARCH_HAS_FOO bit that we can set on the
x86 side that then cajoles the core mm/ code to use the fancy new
clear_pages_resched() implementation.

Because what are the arm64 guys going to do when their CPUs start doing
this? They're either going to copy-and-paste the x86 implementation or
they're going to go move the refactor the x86 implementation into common
code.

My money is on the refactoring, because those arm64 guys do good work.
Could we save them the trouble, please?

Oh, and one other little thing:

> +/*
> + * Limit the optimized version of folio_zero_user() to !CONFIG_HIGHMEM.
> + * We do that because clear_pages() works on contiguous kernel pages
> + * which might not be true under HIGHMEM.
> + */

The tip trees are picky about imperative voice, so no "we's". But if you
stick this in mm/, folks are less picky. ;)

Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing

Posted by Ankur Arora 3 months, 3 weeks ago

Dave Hansen <dave.hansen@intel.com> writes:

> On 6/15/25 22:22, Ankur Arora wrote:
>> Override the common code version of folio_zero_user() so we can use
>> clear_pages() to do multi-page clearing instead of the standard
>> page-at-a-time clearing.
>
> I'm not a big fan of the naming in this series.
>
> To me multi-page means "more than one 'struct page'". But this series is
> clearly using multi-page clearing to mean clearing >PAGE_SIZE in one
> clear. But oh well.

I'd say it's doing both of those. Seen from the folio side, it is
clearing more than one struct page.

Once you descend to the clearing primitive, that's just page aligned
memory.

> The second problem with where this ends up is that none of the code is
> *actually* x86-specific. The only thing that x86 provides that's
> interesting is a clear_pages() implementation that hands >PAGE_SIZE
> units down to the CPUs.
>
> The result is ~100 lines of code that will compile and run functionally
> on any architecture.

True. The underlying assumption is that you can provide extent level
information to string instructions which AFAIK only exists on x86.

> To me, that's deserving of an ARCH_HAS_FOO bit that we can set on the
> x86 side that then cajoles the core mm/ code to use the fancy new
> clear_pages_resched() implementation.

This seems straight-forward enough.

> Because what are the arm64 guys going to do when their CPUs start doing
> this? They're either going to copy-and-paste the x86 implementation or
> they're going to go move the refactor the x86 implementation into common
> code.

These instructions have been around for an awfully long time. Are other
architectures looking at adding similar instructions?

I think this is definitely worth if there are performance advantages on
arm64 -- maybe just because of the reduced per-page overhead.

Let me try this out on arm64.

> My money is on the refactoring, because those arm64 guys do good work.
> Could we save them the trouble, please?

> Oh, and one other little thing:
>
>> +/*
>> + * Limit the optimized version of folio_zero_user() to !CONFIG_HIGHMEM.
>> + * We do that because clear_pages() works on contiguous kernel pages
>> + * which might not be true under HIGHMEM.
>> + */
>
> The tip trees are picky about imperative voice, so no "we's". But if you
> stick this in mm/, folks are less picky. ;)

Hah. That might be come in handy ;).

--
ankur

Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing

Posted by Ankur Arora 3 months, 3 weeks ago

Ankur Arora <ankur.a.arora@oracle.com> writes:

> Dave Hansen <dave.hansen@intel.com> writes:
>
>> On 6/15/25 22:22, Ankur Arora wrote:

[ ... ]

>> The second problem with where this ends up is that none of the code is
>> *actually* x86-specific. The only thing that x86 provides that's
>> interesting is a clear_pages() implementation that hands >PAGE_SIZE
>> units down to the CPUs.
>>
>> The result is ~100 lines of code that will compile and run functionally
>> on any architecture.
>
> True. The underlying assumption is that you can provide extent level
> information to string instructions which AFAIK only exists on x86.
>
>> To me, that's deserving of an ARCH_HAS_FOO bit that we can set on the
>> x86 side that then cajoles the core mm/ code to use the fancy new
>> clear_pages_resched() implementation.
>
> This seems straight-forward enough.
>
>> Because what are the arm64 guys going to do when their CPUs start doing
>> this? They're either going to copy-and-paste the x86 implementation or
>> they're going to go move the refactor the x86 implementation into common
>> code.
>
> These instructions have been around for an awfully long time. Are other
> architectures looking at adding similar instructions?

Just to answer my own question: arm64 with FEAT_MOPS (post v8.8) does
support operating on memory extents. (Both clearing and copying.)

> I think this is definitely worth if there are performance advantages on
> arm64 -- maybe just because of the reduced per-page overhead.
>
> Let me try this out on arm64.
>
>> My money is on the refactoring, because those arm64 guys do good work.
>> Could we save them the trouble, please?

I thought about this and this definitely makes sense to do. But, it
really suggests a larger set of refactors:

1. hugepage clearing via clear_pages() (this series)
2. hugepage copying via copy_pages()

Both of these are faster than the current per page approach on x86. And,
from some preliminary tests, at least no slower no arm64.
(My arm64 test machine does not have the FEAT_MOPS.)

With those two done we should be able to simplify the current
folio_zero_user(), copy_user_large_folio(), process_huge_page() which
is overcomplicated. Other archs that care about performance could
switch to the multiple page approach.

3. Simplify the logic around process_huge_page().

None of these pieces are overly complex. I think the only question is
how to stage it.

Ideally I would like to stage them sequentially and not send out a
single unwieldy series that touches mm and has performance implications
for multiple architectures.

Also would be good to get wider testing for each part.

What do you think? I guess this is also a question for Andrew.

--
ankur

Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing

Posted by Matthew Wilcox 3 months, 3 weeks ago

On Mon, Jun 16, 2025 at 07:44:13AM -0700, Dave Hansen wrote:
> To me multi-page means "more than one 'struct page'". But this series is
> clearly using multi-page clearing to mean clearing >PAGE_SIZE in one
> clear. But oh well.

I'm not sure I see the distinction you're trying to draw.  struct page
refers to a PAGE_SIZE aligned, PAGE_SIZE sized chunk of memory.  So
if you do something to more than PAGE_SIZE bytes, you're doing something
to multiple struct pages.

Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Mon, Jun 16, 2025 at 07:44:13AM -0700, Dave Hansen wrote:

> To me, that's deserving of an ARCH_HAS_FOO bit that we can set on the
> x86 side that then cajoles the core mm/ code to use the fancy new
> clear_pages_resched() implementation.

Note that we should only set this bit with either full or lazy
preemption selected. Haven't checked the patch-set to see if that
constraint is already taken care of.

Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing

Posted by Ankur Arora 3 months, 3 weeks ago

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Jun 16, 2025 at 07:44:13AM -0700, Dave Hansen wrote:
>
>> To me, that's deserving of an ARCH_HAS_FOO bit that we can set on the
>> x86 side that then cajoles the core mm/ code to use the fancy new
>> clear_pages_resched() implementation.
>
> Note that we should only set this bit with either full or lazy
> preemption selected. Haven't checked the patch-set to see if that
> constraint is already taken care of.

It is. I check for preempt_model_preemptible() and limit voluntary models
to chunk sizes of 8MB.

--
ankur

Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing

Posted by Dave Hansen 3 months, 3 weeks ago

On 6/16/25 07:50, Peter Zijlstra wrote:
> On Mon, Jun 16, 2025 at 07:44:13AM -0700, Dave Hansen wrote:
>> To me, that's deserving of an ARCH_HAS_FOO bit that we can set on the
>> x86 side that then cajoles the core mm/ code to use the fancy new
>> clear_pages_resched() implementation.
> Note that we should only set this bit with either full or lazy
> preemption selected. Haven't checked the patch-set to see if that
> constraint is already taken care of.

There is a check in the C code for preempt_model_preemptible(). So as
long as there was something like:

config SOMETHING_PAGE_CLEARING
	def_bool y
	depends on ARCH_HAS_WHATEVER_PAGE_CLEARING
	depends on !HIGHMEM

Then the check for HIGHMEM and the specific architecture support could
be along the lines of:

static void clear_pages_resched(void *addr, int npages)
{
	int i, remaining;

	if (!IS_ENABLED(SOMETHING_PAGE_CLEARING) ||
	    preempt_model_preemptible()) {
		clear_pages(addr, npages);
		goto out;
	}

... which would also remove the #ifdef CONFIG_HIGHMEM in there now.

Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing

Posted by kernel test robot 3 months, 3 weeks ago

Hi Ankur,

kernel test robot noticed the following build warnings:

[auto build test WARNING on perf-tools-next/perf-tools-next]
[also build test WARNING on tip/perf/core perf-tools/perf-tools linus/master v6.16-rc2 next-20250616]
[cannot apply to acme/perf/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ankur-Arora/perf-bench-mem-Remove-repetition-around-time-measurement/20250616-132651
base:   https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git perf-tools-next
patch link:    https://lore.kernel.org/r/20250616052223.723982-14-ankur.a.arora%40oracle.com
patch subject: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing
config: x86_64-buildonly-randconfig-003-20250616 (https://download.01.org/0day-ci/archive/20250616/202506161939.YrEAfTPY-lkp@intel.com/config)
compiler: clang version 20.1.2 (https://github.com/llvm/llvm-project 58df0ef89dd64126512e4ee27b4ac3fd8ddf6247)
rustc: rustc 1.78.0 (9b00956e5 2024-04-29)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250616/202506161939.YrEAfTPY-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506161939.YrEAfTPY-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> arch/x86/mm/memory.c:61:6: warning: no previous prototype for function 'folio_zero_user' [-Wmissing-prototypes]
      61 | void folio_zero_user(struct folio *folio, unsigned long addr_hint)
         |      ^
   arch/x86/mm/memory.c:61:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
      61 | void folio_zero_user(struct folio *folio, unsigned long addr_hint)
         | ^
         | static 
   1 warning generated.


vim +/folio_zero_user +61 arch/x86/mm/memory.c

    39	
    40	/*
    41	 * folio_zero_user() - multi-page clearing.
    42	 *
    43	 * @folio: hugepage folio
    44	 * @addr_hint: faulting address (if any)
    45	 *
    46	 * Overrides common code folio_zero_user(). This version takes advantage of
    47	 * the fact that string instructions in clear_pages() are more performant
    48	 * on larger extents compared to the usual page-at-a-time clearing.
    49	 *
    50	 * Clearing of 2MB pages is split in three parts: pages in the immediate
    51	 * locality of the faulting page, and its left, right regions; with the local
    52	 * neighbourhood cleared last in order to keep cache lines of the target
    53	 * region hot.
    54	 *
    55	 * For GB pages, there is no expectation of cache locality so just do a
    56	 * straight zero.
    57	 *
    58	 * Note that the folio is fully allocated already so we don't do any exception
    59	 * handling.
    60	 */
  > 61	void folio_zero_user(struct folio *folio, unsigned long addr_hint)

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki