[v8] mm: folio_zero_user: clear contiguous pages

[PATCH v8 5/7] x86/clear_page: Introduce clear_pages()

Posted by Ankur Arora 3 months, 2 weeks ago

Performance when clearing with string instructions (x86-64-stosq and
similar) can vary significantly based on the chunk-size used.

  $ perf bench mem memset -k 4KB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      13.748208 GB/sec

  $ perf bench mem memset -k 2MB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in
  # arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      15.067900 GB/sec

  $ perf bench mem memset -k 1GB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      38.104311 GB/sec

(Both on AMD Milan.)

With a change in chunk-size of 4KB to 1GB, we see the performance go
from 13.7 GB/sec to 38.1 GB/sec. For a chunk-size of 2MB the change isn't
quite as drastic but it is worth adding a clear_page() variant that can
handle contiguous page-extents.

Also define ARCH_PAGE_CONTIG_NR to specify the maximum contiguous page
range that can be zeroed when running under cooperative preemption
models. This limits the worst case preemption latency.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 arch/x86/include/asm/page_64.h | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index df528cff90ef..efab5dc26e3e 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -43,8 +43,9 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 void memzero_page_aligned_unrolled(void *addr, u64 len);
 
 /**
- * clear_page() - clear a page using a kernel virtual address.
- * @addr: address of kernel page
+ * clear_pages() - clear a page range using a kernel virtual address.
+ * @addr: start address of kernel page range
+ * @npages: number of pages
  *
  * Switch between three implementations of page clearing based on CPU
  * capabilities:
@@ -65,11 +66,11 @@ void memzero_page_aligned_unrolled(void *addr, u64 len);
  *
  * Does absolutely no exception handling.
  */
-static inline void clear_page(void *addr)
+static inline void clear_pages(void *addr, unsigned int npages)
 {
-	u64 len = PAGE_SIZE;
+	u64 len = npages * PAGE_SIZE;
 	/*
-	 * Clean up KMSAN metadata for the page being cleared. The assembly call
+	 * Clean up KMSAN metadata for the pages being cleared. The assembly call
 	 * below clobbers @addr, so we perform unpoisoning before it.
 	 */
 	kmsan_unpoison_memory(addr, len);
@@ -80,6 +81,21 @@ static inline void clear_page(void *addr)
 			: "a" (0)
 			: "cc", "memory");
 }
+#define __HAVE_ARCH_CLEAR_PAGES
+
+/*
+ * When running under cooperatively scheduled preemption models limit the
+ * maximum contiguous extent that can be cleared to pages worth 8MB.
+ *
+ * With a clearing BW of ~10GBps, this should result in worst case scheduling
+ * latency of ~1ms.
+ */
+#define ARCH_PAGE_CONTIG_NR (8 << (20 - PAGE_SHIFT))
+
+static inline void clear_page(void *addr)
+{
+	clear_pages(addr, 1);
+}
 
 void copy_page(void *to, void *from);
 KCFI_REFERENCE(copy_page);
-- 
2.43.5

Re: [PATCH v8 5/7] x86/clear_page: Introduce clear_pages()

Posted by Borislav Petkov 3 months, 1 week ago

On Mon, Oct 27, 2025 at 01:21:07PM -0700, Ankur Arora wrote:
> Also define ARCH_PAGE_CONTIG_NR to specify the maximum contiguous page
> range that can be zeroed when running under cooperative preemption
> models. This limits the worst case preemption latency.

Please do not explain what the patch does in the commit message - that should
be clear from the diff itself. Rather, concentrate on why this patch exists.

> +/*
> + * When running under cooperatively scheduled preemption models limit the
> + * maximum contiguous extent that can be cleared to pages worth 8MB.

Why?

> + *
> + * With a clearing BW of ~10GBps, this should result in worst case scheduling

This sounds like you have this bandwidth (please write it out - we have more
than enough silly abbreviations) on *everything* x86 the kernel runs on. Which
simply ain't true.

> + * latency of ~1ms.
> + */
> +#define ARCH_PAGE_CONTIG_NR (8 << (20 - PAGE_SHIFT))

And so this looks like some magic number which makes sense only on some
uarches but most likely it doesn't on others.

Why isn't this thing determined dynamically during boot or so, instead of
hardcoding it this way and then having to change it again later when bandwidth
increases?

Hmm, weird.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v8 5/7] x86/clear_page: Introduce clear_pages()

Posted by Ankur Arora 3 months, 1 week ago

Borislav Petkov <bp@alien8.de> writes:

> On Mon, Oct 27, 2025 at 01:21:07PM -0700, Ankur Arora wrote:
>> Also define ARCH_PAGE_CONTIG_NR to specify the maximum contiguous page
>> range that can be zeroed when running under cooperative preemption
>> models. This limits the worst case preemption latency.
>
> Please do not explain what the patch does in the commit message - that should
> be clear from the diff itself. Rather, concentrate on why this patch exists.

Ack.

>> +/*
>> + * When running under cooperatively scheduled preemption models limit the
>> + * maximum contiguous extent that can be cleared to pages worth 8MB.
>
> Why?

Will mention that the point is to minimize worst case preemption latency.

>> + *
>> + * With a clearing BW of ~10GBps, this should result in worst case scheduling
>
> This sounds like you have this bandwidth (please write it out - we have more
> than enough silly abbreviations) on *everything* x86 the kernel runs on. Which
> simply ain't true.
>
>> + * latency of ~1ms.
>> + */
>> +#define ARCH_PAGE_CONTIG_NR (8 << (20 - PAGE_SHIFT))
>
> And so this looks like some magic number which makes sense only on some
> uarches but most likely it doesn't on others.

The intent was to use a large enough value that enables uarchs which do
'REP; STOS' optimizations, but not too large so we end up with high
preemption latency.

> Why isn't this thing determined dynamically during boot or so, instead of
> hardcoding it this way and then having to change it again later when bandwidth
> increases?

I thought of doing that but given that the precise value doesn't matter
very much (and there's enough slack in it in either direction) it seemed
unnecessary to do at this point.

Also, I'm not sure that a boot determined value would really help given
that the 'REP; STOS' bandwidth could be high or low based on how
saturated the bus is.

Clearly some of this detail should have been in my commit message.

Let me add it there.

Thanks

--
ankur

Re: [PATCH v8 5/7] x86/clear_page: Introduce clear_pages()

Posted by Borislav Petkov 3 months, 1 week ago

On Tue, Oct 28, 2025 at 11:51:39AM -0700, Ankur Arora wrote:
> The intent was to use a large enough value that enables uarchs which do
> 'REP; STOS' optimizations, but not too large so we end up with high
> preemption latency.

How is selecting that number tied to uarches which can do REP; STOSB? I assume
you mean REP; STOSB where microcode magic glue aggregates larger moves than
just u64 chunks but only under certain conditions and so on..., and not
REP_GOOD where the microcode doesn't have problems with REP prefixes...

> > Why isn't this thing determined dynamically during boot or so, instead of
> > hardcoding it this way and then having to change it again later when bandwidth
> > increases?
> 
> I thought of doing that but given that the precise value doesn't matter
> very much (and there's enough slack in it in either direction) it seemed
> unnecessary to do at this point.
> 
> Also, I'm not sure that a boot determined value would really help given
> that the 'REP; STOS' bandwidth could be high or low based on how
> saturated the bus is.
> 
> Clearly some of this detail should have been in my commit message.

So you want to have, say, 8MB of contiguous range - if possible - and let the
CPU do larger clears. And it depends on the scheduling model. And it depends
on what the CPU can do wrt length aggregation. Close?

Well, I would like, please, for this to be properly documented why it was
selected this way and what *all* the aspects were to select it this way so
that we can know why it is there and we can change it in the future if
needed.

It is very hard to do so if the reasoning behind it has disappeared in the
bowels of lkml...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v8 5/7] x86/clear_page: Introduce clear_pages()

Posted by Ankur Arora 3 months, 1 week ago

Borislav Petkov <bp@alien8.de> writes:

> On Tue, Oct 28, 2025 at 11:51:39AM -0700, Ankur Arora wrote:
>> The intent was to use a large enough value that enables uarchs which do
>> 'REP; STOS' optimizations, but not too large so we end up with high
>> preemption latency.
>
> How is selecting that number tied to uarches which can do REP; STOSB? I assume
> you mean REP; STOSB where microcode magic glue aggregates larger moves than
> just u64 chunks but only under certain conditions and so on..., and not
> REP_GOOD where the microcode doesn't have problems with REP prefixes...

Yes, to what you say below.

>> > Why isn't this thing determined dynamically during boot or so, instead of
>> > hardcoding it this way and then having to change it again later when bandwidth
>> > increases?
>>
>> I thought of doing that but given that the precise value doesn't matter
>> very much (and there's enough slack in it in either direction) it seemed
>> unnecessary to do at this point.
>>
>> Also, I'm not sure that a boot determined value would really help given
>> that the 'REP; STOS' bandwidth could be high or low based on how
>> saturated the bus is.
>>
>> Clearly some of this detail should have been in my commit message.
>
> So you want to have, say, 8MB of contiguous range - if possible - and let the
> CPU do larger clears. And it depends on the scheduling model. And it depends
> on what the CPU can do wrt length aggregation. Close?

Yeah pretty much that. Just to restate:

 - be large enough so CPUs that can optimize, are able to optimize
 - even in the bad cases (CPUs that don't optimize and/or are generally
   slow at this optimization): should be fast enough that we have
   reasonable preemption latency (which is an issue only for voluntary
   preemption etc)

> Well, I would like, please, for this to be properly documented why it was
> selected this way and what *all* the aspects were to select it this way so
> that we can know why it is there and we can change it in the future if
> needed.
>
> It is very hard to do so if the reasoning behind it has disappeared in the
> bowels of lkml...

Ack. Yeah I should have documented this way better.

Thanks
--
ankur