Clear contiguous page ranges in folio_zero_user() instead of clearing
a single page at a time. Exposing larger ranges enables extent based
processor optimizations.
However, because the underlying clearing primitives do not, or might
not be able to check to call cond_resched() to check if preemption
is required, limit the worst case preemption latency by doing the
clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.
For architectures that define clear_pages(), we assume that the
clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
worth of pages. This should be large enough to allow the processor
to optimize the operation and yet small enough that we see reasonable
preemption latency for when this optimization is not possible
(ex. slow microarchitectures, memory bandwidth saturation.)
Architectures that don't define clear_pages() will continue to use
the base value (single page). And, preemptible models don't need
invocations of cond_resched() so don't care about the batch size.
The resultant performance depends on the kinds of optimizations
available to the CPU for the region size being cleared. Two classes
of optimizations:
- clearing iteration costs are amortized over a range larger
than a single page.
- cacheline allocation elision (seen on AMD Zen models).
Testing a demand fault workload shows an improved baseline from the
first optimization and a larger improvement when the region being
cleared is large enough for the second optimization.
AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
$ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
page-at-a-time contiguous clearing change
(GB/s +- %stdev) (GB/s +- %stdev)
pg-sz=2MB 12.92 +- 2.55% 17.03 +- 0.70% + 31.8% preempt=*
pg-sz=1GB 17.14 +- 2.27% 18.04 +- 1.05% + 5.2% preempt=none|voluntary
pg-sz=1GB 17.26 +- 1.24% 42.17 +- 4.21% [#] +144.3% preempt=full|lazy
[#] Notice that we perform much better with preempt=full|lazy. As
mentioned above, preemptible models not needing explicit invocations
of cond_resched() allow clearing of the full extent (1GB) as a
single unit.
In comparison the maximum extent used for preempt=none|voluntary is
PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
The larger extent allows the processor to elide cacheline
allocation (on Milan the threshold is LLC-size=32MB.)
Also as mentioned earlier, the baseline improvement is not specific to
AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
improvement as the Milan pg-sz=2MB workload above (~30%).
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
---
include/linux/mm.h | 38 +++++++++++++++++++++++++++++++++++++-
mm/memory.c | 46 +++++++++++++++++++++++++---------------------
2 files changed, 62 insertions(+), 22 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 12106ebf1a50..45e5e0ef620c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4194,7 +4194,6 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
unsigned int order) {}
#endif /* CONFIG_DEBUG_PAGEALLOC */
-#ifndef clear_pages
/**
* clear_pages() - clear a page range for kernel-internal use.
* @addr: start address
@@ -4204,7 +4203,18 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
* mapped to user space.
*
* Does absolutely no exception handling.
+ *
+ * Note that even though the clearing operation is preemptible, clear_pages()
+ * does not (and on architectures where it reduces to a few long-running
+ * instructions, might not be able to) call cond_resched() to check if
+ * rescheduling is required.
+ *
+ * When running under preemptible models this is fine, since clear_pages(),
+ * even when reduced to long-running instructions, is preemptible.
+ * Under cooperatively scheduled models, however, the caller is expected to
+ * limit @npages to no more than PROCESS_PAGES_NON_PREEMPT_BATCH.
*/
+#ifndef clear_pages
static inline void clear_pages(void *addr, unsigned int npages)
{
do {
@@ -4214,6 +4224,32 @@ static inline void clear_pages(void *addr, unsigned int npages)
}
#endif
+#ifndef PROCESS_PAGES_NON_PREEMPT_BATCH
+#ifdef clear_pages
+/*
+ * The architecture defines clear_pages(), and we assume that it is
+ * generally "fast". So choose a batch size large enough to allow the processor
+ * headroom for optimizing the operation and yet small enough that we see
+ * reasonable preemption latency for when this optimization is not possible
+ * (ex. slow microarchitectures, memory bandwidth saturation.)
+ *
+ * With a value of 8MB and assuming a memory bandwidth of ~10GBps, this should
+ * result in worst case preemption latency of around 1ms when clearing pages.
+ *
+ * (See comment above clear_pages() for why preemption latency is a concern
+ * here.)
+ */
+#define PROCESS_PAGES_NON_PREEMPT_BATCH (8 << (20 - PAGE_SHIFT))
+#else /* !clear_pages */
+/*
+ * The architecture does not provide a clear_pages() implementation. Assume
+ * that clear_page() -- which clear_pages() will fallback to -- is relatively
+ * slow and choose a small value for PROCESS_PAGES_NON_PREEMPT_BATCH.
+ */
+#define PROCESS_PAGES_NON_PREEMPT_BATCH 1
+#endif
+#endif
+
#ifdef __HAVE_ARCH_GATE_AREA
extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
extern int in_gate_area_no_mm(unsigned long addr);
diff --git a/mm/memory.c b/mm/memory.c
index 2a55edc48a65..974c48db6089 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7237,40 +7237,44 @@ static inline int process_huge_page(
return 0;
}
-static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
- unsigned int nr_pages)
+static void clear_contig_highpages(struct page *page, unsigned long addr,
+ unsigned int npages)
{
- unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
- int i;
+ unsigned int i, count, unit;
- might_sleep();
- for (i = 0; i < nr_pages; i++) {
+ /*
+ * When clearing we want to operate on the largest extent possible since
+ * that allows for extent based architecture specific optimizations.
+ *
+ * However, since the clearing interfaces (clear_user_highpages(),
+ * clear_user_pages(), clear_pages()), do not call cond_resched(), we
+ * limit the batch size when running under non-preemptible scheduling
+ * models.
+ */
+ unit = preempt_model_preemptible() ? npages : PROCESS_PAGES_NON_PREEMPT_BATCH;
+
+ for (i = 0; i < npages; i += count) {
cond_resched();
- clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
+
+ count = min(unit, npages - i);
+ clear_user_highpages(page + i,
+ addr + i * PAGE_SIZE, count);
}
}
-static int clear_subpage(unsigned long addr, int idx, void *arg)
-{
- struct folio *folio = arg;
-
- clear_user_highpage(folio_page(folio, idx), addr);
- return 0;
-}
-
/**
* folio_zero_user - Zero a folio which will be mapped to userspace.
* @folio: The folio to zero.
- * @addr_hint: The address will be accessed or the base address if uncelar.
+ * @addr_hint: The address accessed by the user or the base address.
+ *
+ * Uses architectural support to clear page ranges.
*/
void folio_zero_user(struct folio *folio, unsigned long addr_hint)
{
- unsigned int nr_pages = folio_nr_pages(folio);
+ unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
- if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
- clear_gigantic_page(folio, addr_hint, nr_pages);
- else
- process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+ clear_contig_highpages(folio_page(folio, 0),
+ base_addr, folio_nr_pages(folio));
}
static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
--
2.31.1
On 12/15/25 21:49, Ankur Arora wrote:
> Clear contiguous page ranges in folio_zero_user() instead of clearing
> a single page at a time. Exposing larger ranges enables extent based
> processor optimizations.
>
> However, because the underlying clearing primitives do not, or might
> not be able to check to call cond_resched() to check if preemption
> is required, limit the worst case preemption latency by doing the
> clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.
>
> For architectures that define clear_pages(), we assume that the
> clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
> worth of pages. This should be large enough to allow the processor
> to optimize the operation and yet small enough that we see reasonable
> preemption latency for when this optimization is not possible
> (ex. slow microarchitectures, memory bandwidth saturation.)
>
> Architectures that don't define clear_pages() will continue to use
> the base value (single page). And, preemptible models don't need
> invocations of cond_resched() so don't care about the batch size.
>
> The resultant performance depends on the kinds of optimizations
> available to the CPU for the region size being cleared. Two classes
> of optimizations:
>
> - clearing iteration costs are amortized over a range larger
> than a single page.
> - cacheline allocation elision (seen on AMD Zen models).
>
> Testing a demand fault workload shows an improved baseline from the
> first optimization and a larger improvement when the region being
> cleared is large enough for the second optimization.
>
> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
>
> $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
>
> page-at-a-time contiguous clearing change
>
> (GB/s +- %stdev) (GB/s +- %stdev)
>
> pg-sz=2MB 12.92 +- 2.55% 17.03 +- 0.70% + 31.8% preempt=*
>
> pg-sz=1GB 17.14 +- 2.27% 18.04 +- 1.05% + 5.2% preempt=none|voluntary
> pg-sz=1GB 17.26 +- 1.24% 42.17 +- 4.21% [#] +144.3% preempt=full|lazy
>
> [#] Notice that we perform much better with preempt=full|lazy. As
> mentioned above, preemptible models not needing explicit invocations
> of cond_resched() allow clearing of the full extent (1GB) as a
> single unit.
> In comparison the maximum extent used for preempt=none|voluntary is
> PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
>
> The larger extent allows the processor to elide cacheline
> allocation (on Milan the threshold is LLC-size=32MB.)
>
> Also as mentioned earlier, the baseline improvement is not specific to
> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
> improvement as the Milan pg-sz=2MB workload above (~30%).
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
> include/linux/mm.h | 38 +++++++++++++++++++++++++++++++++++++-
> mm/memory.c | 46 +++++++++++++++++++++++++---------------------
> 2 files changed, 62 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 12106ebf1a50..45e5e0ef620c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4194,7 +4194,6 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
> unsigned int order) {}
> #endif /* CONFIG_DEBUG_PAGEALLOC */
>
> -#ifndef clear_pages
Why is that change part of this patch?
Looks like this should either go into the patch introducing
clear_pages() (#3 ?).
> /**
> * clear_pages() - clear a page range for kernel-internal use.
> * @addr: start address
> @@ -4204,7 +4203,18 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
> * mapped to user space.
> *
> * Does absolutely no exception handling.
> + *
> + * Note that even though the clearing operation is preemptible, clear_pages()
> + * does not (and on architectures where it reduces to a few long-running
> + * instructions, might not be able to) call cond_resched() to check if
> + * rescheduling is required.
> + *
> + * When running under preemptible models this is fine, since clear_pages(),
> + * even when reduced to long-running instructions, is preemptible.
> + * Under cooperatively scheduled models, however, the caller is expected to
> + * limit @npages to no more than PROCESS_PAGES_NON_PREEMPT_BATCH.
> */
> +#ifndef clear_pages
> static inline void clear_pages(void *addr, unsigned int npages)
> {
> do {
> @@ -4214,6 +4224,32 @@ static inline void clear_pages(void *addr, unsigned int npages)
> }
> #endif
>
> +#ifndef PROCESS_PAGES_NON_PREEMPT_BATCH
> +#ifdef clear_pages
> +/*
> + * The architecture defines clear_pages(), and we assume that it is
> + * generally "fast". So choose a batch size large enough to allow the processor
> + * headroom for optimizing the operation and yet small enough that we see
> + * reasonable preemption latency for when this optimization is not possible
> + * (ex. slow microarchitectures, memory bandwidth saturation.)
> + *
> + * With a value of 8MB and assuming a memory bandwidth of ~10GBps, this should
> + * result in worst case preemption latency of around 1ms when clearing pages.
> + *
> + * (See comment above clear_pages() for why preemption latency is a concern
> + * here.)
> + */
> +#define PROCESS_PAGES_NON_PREEMPT_BATCH (8 << (20 - PAGE_SHIFT))
> +#else /* !clear_pages */
> +/*
> + * The architecture does not provide a clear_pages() implementation. Assume
> + * that clear_page() -- which clear_pages() will fallback to -- is relatively
> + * slow and choose a small value for PROCESS_PAGES_NON_PREEMPT_BATCH.
> + */
> +#define PROCESS_PAGES_NON_PREEMPT_BATCH 1
> +#endif
> +#endif
> +
> #ifdef __HAVE_ARCH_GATE_AREA
> extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
> extern int in_gate_area_no_mm(unsigned long addr);
> diff --git a/mm/memory.c b/mm/memory.c
> index 2a55edc48a65..974c48db6089 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7237,40 +7237,44 @@ static inline int process_huge_page(
> return 0;
> }
>
> -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
> - unsigned int nr_pages)
> +static void clear_contig_highpages(struct page *page, unsigned long addr,
> + unsigned int npages)
> {
> - unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> - int i;
> + unsigned int i, count, unit;
>
> - might_sleep();
> - for (i = 0; i < nr_pages; i++) {
> + /*
> + * When clearing we want to operate on the largest extent possible since
> + * that allows for extent based architecture specific optimizations.
> + *
> + * However, since the clearing interfaces (clear_user_highpages(),
> + * clear_user_pages(), clear_pages()), do not call cond_resched(), we
> + * limit the batch size when running under non-preemptible scheduling
> + * models.
> + */
> + unit = preempt_model_preemptible() ? npages : PROCESS_PAGES_NON_PREEMPT_BATCH;
> +
> + for (i = 0; i < npages; i += count) {
> cond_resched();
> - clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
> +
> + count = min(unit, npages - i);
> + clear_user_highpages(page + i,
> + addr + i * PAGE_SIZE, count);
I guess that logic could be pushed down for the
clear_user_highpages()->clear_pages() implementation (arch or generic)
to take care of that, so not every user would have to care about that.
No strong opinion as we could do that later whenever we actually get
more clear_pages() users :)
> -
> /**
> * folio_zero_user - Zero a folio which will be mapped to userspace.
> * @folio: The folio to zero.
> - * @addr_hint: The address will be accessed or the base address if uncelar.
> + * @addr_hint: The address accessed by the user or the base address.
> + *
> + * Uses architectural support to clear page ranges.
I think that comment can be dropped. Implementation detail :)
--
Cheers
David
David Hildenbrand (Red Hat) <david@kernel.org> writes:
> On 12/15/25 21:49, Ankur Arora wrote:
>> Clear contiguous page ranges in folio_zero_user() instead of clearing
>> a single page at a time. Exposing larger ranges enables extent based
>> processor optimizations.
>> However, because the underlying clearing primitives do not, or might
>> not be able to check to call cond_resched() to check if preemption
>> is required, limit the worst case preemption latency by doing the
>> clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.
>> For architectures that define clear_pages(), we assume that the
>> clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
>> worth of pages. This should be large enough to allow the processor
>> to optimize the operation and yet small enough that we see reasonable
>> preemption latency for when this optimization is not possible
>> (ex. slow microarchitectures, memory bandwidth saturation.)
>> Architectures that don't define clear_pages() will continue to use
>> the base value (single page). And, preemptible models don't need
>> invocations of cond_resched() so don't care about the batch size.
>> The resultant performance depends on the kinds of optimizations
>> available to the CPU for the region size being cleared. Two classes
>> of optimizations:
>> - clearing iteration costs are amortized over a range larger
>> than a single page.
>> - cacheline allocation elision (seen on AMD Zen models).
>> Testing a demand fault workload shows an improved baseline from the
>> first optimization and a larger improvement when the region being
>> cleared is large enough for the second optimization.
>> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
>> $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
>> page-at-a-time contiguous clearing change
>> (GB/s +- %stdev) (GB/s +- %stdev)
>> pg-sz=2MB 12.92 +- 2.55% 17.03 +- 0.70% + 31.8%
>> preempt=*
>> pg-sz=1GB 17.14 +- 2.27% 18.04 +- 1.05% + 5.2%
>> preempt=none|voluntary
>> pg-sz=1GB 17.26 +- 1.24% 42.17 +- 4.21% [#] +144.3% preempt=full|lazy
>> [#] Notice that we perform much better with preempt=full|lazy. As
>> mentioned above, preemptible models not needing explicit invocations
>> of cond_resched() allow clearing of the full extent (1GB) as a
>> single unit.
>> In comparison the maximum extent used for preempt=none|voluntary is
>> PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
>> The larger extent allows the processor to elide cacheline
>> allocation (on Milan the threshold is LLC-size=32MB.)
>> Also as mentioned earlier, the baseline improvement is not specific to
>> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
>> improvement as the Milan pg-sz=2MB workload above (~30%).
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
>> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
>> ---
>> include/linux/mm.h | 38 +++++++++++++++++++++++++++++++++++++-
>> mm/memory.c | 46 +++++++++++++++++++++++++---------------------
>> 2 files changed, 62 insertions(+), 22 deletions(-)
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 12106ebf1a50..45e5e0ef620c 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -4194,7 +4194,6 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
>> unsigned int order) {}
>> #endif /* CONFIG_DEBUG_PAGEALLOC */
>> -#ifndef clear_pages
>
> Why is that change part of this patch?
>
> Looks like this should either go into the patch introducing clear_pages() (#3
> ?).
Yeah I think this was a mistake. There was no need to move the ifndef
below the comment as I do here. Will fix.
>> /**
>> * clear_pages() - clear a page range for kernel-internal use.
>> * @addr: start address
>> @@ -4204,7 +4203,18 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
>> * mapped to user space.
>> *
>> * Does absolutely no exception handling.
>> + *
>> + * Note that even though the clearing operation is preemptible, clear_pages()
>> + * does not (and on architectures where it reduces to a few long-running
>> + * instructions, might not be able to) call cond_resched() to check if
>> + * rescheduling is required.
>> + *
>> + * When running under preemptible models this is fine, since clear_pages(),
>> + * even when reduced to long-running instructions, is preemptible.
>> + * Under cooperatively scheduled models, however, the caller is expected to
>> + * limit @npages to no more than PROCESS_PAGES_NON_PREEMPT_BATCH.
>> */
>> +#ifndef clear_pages
>> static inline void clear_pages(void *addr, unsigned int npages)
>> {
>> do {
>> @@ -4214,6 +4224,32 @@ static inline void clear_pages(void *addr, unsigned int npages)
>> }
>> #endif
>> +#ifndef PROCESS_PAGES_NON_PREEMPT_BATCH
>> +#ifdef clear_pages
>> +/*
>> + * The architecture defines clear_pages(), and we assume that it is
>> + * generally "fast". So choose a batch size large enough to allow the processor
>> + * headroom for optimizing the operation and yet small enough that we see
>> + * reasonable preemption latency for when this optimization is not possible
>> + * (ex. slow microarchitectures, memory bandwidth saturation.)
>> + *
>> + * With a value of 8MB and assuming a memory bandwidth of ~10GBps, this should
>> + * result in worst case preemption latency of around 1ms when clearing pages.
>> + *
>> + * (See comment above clear_pages() for why preemption latency is a concern
>> + * here.)
>> + */
>> +#define PROCESS_PAGES_NON_PREEMPT_BATCH (8 << (20 - PAGE_SHIFT))
>> +#else /* !clear_pages */
>> +/*
>> + * The architecture does not provide a clear_pages() implementation. Assume
>> + * that clear_page() -- which clear_pages() will fallback to -- is relatively
>> + * slow and choose a small value for PROCESS_PAGES_NON_PREEMPT_BATCH.
>> + */
>> +#define PROCESS_PAGES_NON_PREEMPT_BATCH 1
>> +#endif
>> +#endif
>> +
>> #ifdef __HAVE_ARCH_GATE_AREA
>> extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
>> extern int in_gate_area_no_mm(unsigned long addr);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 2a55edc48a65..974c48db6089 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -7237,40 +7237,44 @@ static inline int process_huge_page(
>> return 0;
>> }
>> -static void clear_gigantic_page(struct folio *folio, unsigned long
>> addr_hint,
>> - unsigned int nr_pages)
>> +static void clear_contig_highpages(struct page *page, unsigned long addr,
>> + unsigned int npages)
>> {
>> - unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> - int i;
>> + unsigned int i, count, unit;
>> - might_sleep();
>> - for (i = 0; i < nr_pages; i++) {
>> + /*
>> + * When clearing we want to operate on the largest extent possible since
>> + * that allows for extent based architecture specific optimizations.
>> + *
>> + * However, since the clearing interfaces (clear_user_highpages(),
>> + * clear_user_pages(), clear_pages()), do not call cond_resched(), we
>> + * limit the batch size when running under non-preemptible scheduling
>> + * models.
>> + */
>> + unit = preempt_model_preemptible() ? npages : PROCESS_PAGES_NON_PREEMPT_BATCH;
>> +
>> + for (i = 0; i < npages; i += count) {
>> cond_resched();
>> - clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
>> +
>> + count = min(unit, npages - i);
>> + clear_user_highpages(page + i,
>> + addr + i * PAGE_SIZE, count);
>
> I guess that logic could be pushed down for the
> clear_user_highpages()->clear_pages() implementation (arch or generic)
> to take care of that, so not every user would have to care about that.
You mean the preemptibility unit stuff?
> No strong opinion as we could do that later whenever we actually get more
> clear_pages() users :)
I remember thinking about it early on. And seemed to me that clear_pages(),
clear_user_pages(), clear_user_highpages() all of them are structured
pretty similarly and all of them leave the the responsibility of when to
preempt to the caller.
I guess clear_user_highpages() is special since that's the logical entry
point for user pages.
I'd like to keep it as it for now. And relook at changing once clear_pages()
clear_pages() has a few more users.
>> -
>> /**
>> * folio_zero_user - Zero a folio which will be mapped to userspace.
>> * @folio: The folio to zero.
>> - * @addr_hint: The address will be accessed or the base address if uncelar.
>> + * @addr_hint: The address accessed by the user or the base address.
>> + *
>> + * Uses architectural support to clear page ranges.
>
> I think that comment can be dropped. Implementation detail :)
Ack.
Thanks for the reviews!
--
ankur
On Mon, 15 Dec 2025 12:49:21 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote: > Clear contiguous page ranges in folio_zero_user() instead of clearing > a single page at a time. Exposing larger ranges enables extent based > processor optimizations. > > However, because the underlying clearing primitives do not, or might > not be able to check to call cond_resched() to check if preemption > is required, limit the worst case preemption latency by doing the > clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units. > > For architectures that define clear_pages(), we assume that the > clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB > worth of pages. This should be large enough to allow the processor > to optimize the operation and yet small enough that we see reasonable > preemption latency for when this optimization is not possible > (ex. slow microarchitectures, memory bandwidth saturation.) > > Architectures that don't define clear_pages() will continue to use > the base value (single page). And, preemptible models don't need > invocations of cond_resched() so don't care about the batch size. > > The resultant performance depends on the kinds of optimizations > available to the CPU for the region size being cleared. Two classes > of optimizations: > > - clearing iteration costs are amortized over a range larger > than a single page. > - cacheline allocation elision (seen on AMD Zen models). 8MB is a big chunk of memory. > Testing a demand fault workload shows an improved baseline from the > first optimization and a larger improvement when the region being > cleared is large enough for the second optimization. > > AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node): So we break out of the copy to run cond_resched() 8192 times? This sounds like a minor cost. > $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 > > page-at-a-time contiguous clearing change > > (GB/s +- %stdev) (GB/s +- %stdev) > > pg-sz=2MB 12.92 +- 2.55% 17.03 +- 0.70% + 31.8% preempt=* > > pg-sz=1GB 17.14 +- 2.27% 18.04 +- 1.05% + 5.2% preempt=none|voluntary > pg-sz=1GB 17.26 +- 1.24% 42.17 +- 4.21% [#] +144.3% preempt=full|lazy And yet those 8192 cond_resched()'s have a huge impact on the performance! I find this result very surprising. Is it explainable? > [#] Notice that we perform much better with preempt=full|lazy. As > mentioned above, preemptible models not needing explicit invocations > of cond_resched() allow clearing of the full extent (1GB) as a > single unit. > In comparison the maximum extent used for preempt=none|voluntary is > PROCESS_PAGES_NON_PREEMPT_BATCH (8MB). > > The larger extent allows the processor to elide cacheline > allocation (on Milan the threshold is LLC-size=32MB.) It is this? > Also as mentioned earlier, the baseline improvement is not specific to > AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar > improvement as the Milan pg-sz=2MB workload above (~30%). >
Andrew Morton <akpm@linux-foundation.org> writes:
> On Mon, 15 Dec 2025 12:49:21 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Clear contiguous page ranges in folio_zero_user() instead of clearing
>> a single page at a time. Exposing larger ranges enables extent based
>> processor optimizations.
>>
>> However, because the underlying clearing primitives do not, or might
>> not be able to check to call cond_resched() to check if preemption
>> is required, limit the worst case preemption latency by doing the
>> clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.
>>
>> For architectures that define clear_pages(), we assume that the
>> clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
>> worth of pages. This should be large enough to allow the processor
>> to optimize the operation and yet small enough that we see reasonable
>> preemption latency for when this optimization is not possible
>> (ex. slow microarchitectures, memory bandwidth saturation.)
>>
>> Architectures that don't define clear_pages() will continue to use
>> the base value (single page). And, preemptible models don't need
>> invocations of cond_resched() so don't care about the batch size.
>>
>> The resultant performance depends on the kinds of optimizations
>> available to the CPU for the region size being cleared. Two classes
>> of optimizations:
>>
>> - clearing iteration costs are amortized over a range larger
>> than a single page.
>> - cacheline allocation elision (seen on AMD Zen models).
>
> 8MB is a big chunk of memory.
>
>> Testing a demand fault workload shows an improved baseline from the
>> first optimization and a larger improvement when the region being
>> cleared is large enough for the second optimization.
>>
>> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
>
> So we break out of the copy to run cond_resched() 8192 times? This sounds
> like a minor cost.
>
>> $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
>>
>> page-at-a-time contiguous clearing change
>>
>> (GB/s +- %stdev) (GB/s +- %stdev)
>>
>> pg-sz=2MB 12.92 +- 2.55% 17.03 +- 0.70% + 31.8% preempt=*
>>
>> pg-sz=1GB 17.14 +- 2.27% 18.04 +- 1.05% + 5.2% preempt=none|voluntary
>> pg-sz=1GB 17.26 +- 1.24% 42.17 +- 4.21% [#] +144.3% preempt=full|lazy
>
> And yet those 8192 cond_resched()'s have a huge impact on the
> performance! I find this result very surprising. Is it explainable?
I agree about this being surprising. On the 2MB extent, I still find the
30% quite high but I think a decent portion of it is:
- on x86, the CPU is executing a single microcoded insn: REP; STOSB. And,
because it's doing it for a 2MB instead of a bunch of 4K extents it
saves the microcoding costs (and I suspect it allows it to do some
range operation which also helps.)
- the second reason (from Ingo) was again the per-iteration cost, which
given all of the mitigations on x86 is quite substantial.
On the AMD systems I had tested on, I think there's at least the cost
of RET misprediction in there.
(https://lore.kernel.org/lkml/Z_yzshvBmYiPrxU0@gmail.com/)
>> [#] Notice that we perform much better with preempt=full|lazy. As
>> mentioned above, preemptible models not needing explicit invocations
>> of cond_resched() allow clearing of the full extent (1GB) as a
>> single unit.
>> In comparison the maximum extent used for preempt=none|voluntary is
>> PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
>>
>> The larger extent allows the processor to elide cacheline
>> allocation (on Milan the threshold is LLC-size=32MB.)
>
> It is this?
Yeah I think so. For size >= 32MB, the microcoder can really just elide
cacheline allocation, and with the foreknowledge of the extent can perhaps
optimize on cache coherence traffic (this last one is my speculation).
On cacheline allocation elision, compare the L1-dcache-load in the two versions
below:
pg-sz=1GB:
- 9,250,034,512 cycles # 2.418 GHz ( +- 0.43% ) (46.16%)
- 544,878,976 instructions # 0.06 insn per cycle
- 2,331,332,516 L1-dcache-loads # 609.471 M/sec ( +- 0.03% ) (46.16%)
- 1,075,122,960 L1-dcache-load-misses # 46.12% of all L1-dcache accesses ( +- 0.01% ) (46.15%)
+ 3,688,681,006 cycles # 2.420 GHz ( +- 3.48% ) (46.01%)
+ 10,979,121 instructions # 0.00 insn per cycle
+ 31,829,258 L1-dcache-loads # 20.881 M/sec ( +- 4.92% ) (46.34%)
+ 13,677,295 L1-dcache-load-misses # 42.97% of all L1-dcache accesses ( +- 6.15% ) (46.32%)
(From an earlier version of this series: https://lore.kernel.org/lkml/20250414034607.762653-5-ankur.a.arora@oracle.com/)
Maybe I should have kept it in this commit :).
>> Also as mentioned earlier, the baseline improvement is not specific to
>> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
>> improvement as the Milan pg-sz=2MB workload above (~30%).
>>
--
ankur
On Mon, 15 Dec 2025 22:49:25 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote: > >> [#] Notice that we perform much better with preempt=full|lazy. As > >> mentioned above, preemptible models not needing explicit invocations > >> of cond_resched() allow clearing of the full extent (1GB) as a > >> single unit. > >> In comparison the maximum extent used for preempt=none|voluntary is > >> PROCESS_PAGES_NON_PREEMPT_BATCH (8MB). > >> > >> The larger extent allows the processor to elide cacheline > >> allocation (on Milan the threshold is LLC-size=32MB.) > > > > It is this? > > Yeah I think so. For size >= 32MB, the microcoder can really just elide > cacheline allocation, and with the foreknowledge of the extent can perhaps > optimize on cache coherence traffic (this last one is my speculation). > > On cacheline allocation elision, compare the L1-dcache-load in the two versions > below: > > pg-sz=1GB: > - 9,250,034,512 cycles # 2.418 GHz ( +- 0.43% ) (46.16%) > - 544,878,976 instructions # 0.06 insn per cycle > - 2,331,332,516 L1-dcache-loads # 609.471 M/sec ( +- 0.03% ) (46.16%) > - 1,075,122,960 L1-dcache-load-misses # 46.12% of all L1-dcache accesses ( +- 0.01% ) (46.15%) > > + 3,688,681,006 cycles # 2.420 GHz ( +- 3.48% ) (46.01%) > + 10,979,121 instructions # 0.00 insn per cycle > + 31,829,258 L1-dcache-loads # 20.881 M/sec ( +- 4.92% ) (46.34%) > + 13,677,295 L1-dcache-load-misses # 42.97% of all L1-dcache accesses ( +- 6.15% ) (46.32%) > That says L1 d-cache loads went from 600 million/sec down to 20 million/sec when using 32MB chunks? Do you know what happens to preemption latency if you increase that chunk size from 8MB to 32MB? At 42GB/sec, 32MB will take less than a millisecond, yes? I'm not aware of us really having any latency targets in these preemption modes, but 1 millisecond sounds pretty good.
Andrew Morton <akpm@linux-foundation.org> writes:
> On Mon, 15 Dec 2025 22:49:25 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> >> [#] Notice that we perform much better with preempt=full|lazy. As
>> >> mentioned above, preemptible models not needing explicit invocations
>> >> of cond_resched() allow clearing of the full extent (1GB) as a
>> >> single unit.
>> >> In comparison the maximum extent used for preempt=none|voluntary is
>> >> PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
>> >>
>> >> The larger extent allows the processor to elide cacheline
>> >> allocation (on Milan the threshold is LLC-size=32MB.)
>> >
>> > It is this?
>>
>> Yeah I think so. For size >= 32MB, the microcoder can really just elide
>> cacheline allocation, and with the foreknowledge of the extent can perhaps
>> optimize on cache coherence traffic (this last one is my speculation).
>>
>> On cacheline allocation elision, compare the L1-dcache-load in the two versions
>> below:
>>
>> pg-sz=1GB:
>> - 9,250,034,512 cycles # 2.418 GHz ( +- 0.43% ) (46.16%)
>> - 544,878,976 instructions # 0.06 insn per cycle
>> - 2,331,332,516 L1-dcache-loads # 609.471 M/sec ( +- 0.03% ) (46.16%)
>> - 1,075,122,960 L1-dcache-load-misses # 46.12% of all L1-dcache accesses ( +- 0.01% ) (46.15%)
>>
>> + 3,688,681,006 cycles # 2.420 GHz ( +- 3.48% ) (46.01%)
>> + 10,979,121 instructions # 0.00 insn per cycle
>> + 31,829,258 L1-dcache-loads # 20.881 M/sec ( +- 4.92% ) (46.34%)
>> + 13,677,295 L1-dcache-load-misses # 42.97% of all L1-dcache accesses ( +- 6.15% ) (46.32%)
>>
>
> That says L1 d-cache loads went from 600 million/sec down to 20
> million/sec when using 32MB chunks?
Sorry, should have mentioned that that run was with preempt=full/lazy.
For those the chunk size is the whole page (GB page in that case).
The context for 32MB was that that's the LLC-size for these systems.
And, from observed behaviour the cacheline allocation elision
optimization only happens when the chunk size used is larger than that.
> Do you know what happens to preemption latency if you increase that
> chunk size from 8MB to 32MB?
So, I gathered some numbers on a Zen4/Genoa system. The ones above are
from Zen3/Milan.
region-sz=64GB, loop-count=3 (total region-size=3*64GB):
Bandwidth L1-dcache-loads
pg-sz=2MB, batch-sz= 8MB 25.10 GB/s 6,745,859,855 # 2.00 L1-dcache-loads/64B
# pg-sz=2MB for context
pg-sz=1GB, batch-sz= 8MB 26.88 GB/s 6,469,900,728 # 2.00 L1-dcache-loads/64B
pg-sz=1GB, batch-sz=32MB 38.69 GB/s 2,559,249,546 # 0.79 L1-dcache-loads/64B
pg-sz=1GB, batch-sz=64MB 46.91 GB/s 919,539,544 # 0.28 L1-dcache-loads/64B
pg-sz=1GB, batch-sz= 1GB 58.68 GB/s 79,458,439 # 0.024 L1-dcache-loads/64B
All of these are for preempt=none, and with boost=0. (With boost=1 the
BW increases by ~25%.)
So, I wasn't quite right about the LLC-size=32MB being the threshold for
this optimization. There is a change in behaviour at that point but it
does improve beyond that.
(Ideally this threshold would be a processor MSR. That way we could
use this for 2MB pages as well. Oh well.)
> At 42GB/sec, 32MB will take less than a
> millisecond, yes? I'm not aware of us really having any latency
> targets in these preemption modes, but 1 millisecond sounds pretty
> good.
Agreed. The only complaint threshold I see is 100ms (default value of
sysctl_resched_latency_warn_ms) which is pretty far from ~1ms.
And having a threshold of 32MB might benefit other applications since
we won't be discarding their cachelines in favour of filling up the
cache with zeroes.
I think the only problem cases might be slow uarchs and workloads where
the memory bus is saturated which might dilate the preemption latency.
And, even if the operation takes say ~20ms, that should still leave us
with a reasonably large margin.
(And, any latency senstive users are probably not running with
preempt=none/voluntary.)
--
ankur
On Wed, 17 Dec 2025 00:48:50 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote: > > At 42GB/sec, 32MB will take less than a > > millisecond, yes? I'm not aware of us really having any latency > > targets in these preemption modes, but 1 millisecond sounds pretty > > good. > > Agreed. The only complaint threshold I see is 100ms (default value of > sysctl_resched_latency_warn_ms) which is pretty far from ~1ms. > > And having a threshold of 32MB might benefit other applications since > we won't be discarding their cachelines in favour of filling up the > cache with zeroes. > > I think the only problem cases might be slow uarchs and workloads where > the memory bus is saturated which might dilate the preemption latency. > > And, even if the operation takes say ~20ms, that should still leave us > with a reasonably large margin. > (And, any latency senstive users are probably not running with > preempt=none/voluntary.) So I think you're saying that yes, we should increase the chunk size? If so, what's the timing on that? It would be nice to do it in the current -rc cycle for testing reasons and so the changelogs can be updated to reflect the altered performance numbers.
Andrew Morton <akpm@linux-foundation.org> writes: > On Wed, 17 Dec 2025 00:48:50 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote: > >> > At 42GB/sec, 32MB will take less than a >> > millisecond, yes? I'm not aware of us really having any latency >> > targets in these preemption modes, but 1 millisecond sounds pretty >> > good. >> >> Agreed. The only complaint threshold I see is 100ms (default value of >> sysctl_resched_latency_warn_ms) which is pretty far from ~1ms. >> >> And having a threshold of 32MB might benefit other applications since >> we won't be discarding their cachelines in favour of filling up the >> cache with zeroes. >> >> I think the only problem cases might be slow uarchs and workloads where >> the memory bus is saturated which might dilate the preemption latency. >> >> And, even if the operation takes say ~20ms, that should still leave us >> with a reasonably large margin. >> (And, any latency senstive users are probably not running with >> preempt=none/voluntary.) > > So I think you're saying that yes, we should increase the chunk size? Yeah. > If so, what's the timing on that? It would be nice to do it in the > current -rc cycle for testing reasons and so the changelogs can be > updated to reflect the altered performance numbers. I can send out an updated version of this patch later today. I think the only real change is updating the constant and perf stats motivating the chunk size value of 32MB. Anything else you also think needs doing for this? -- ankur
On Wed, 17 Dec 2025 11:51:43 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote: > > If so, what's the timing on that? It would be nice to do it in the > > current -rc cycle for testing reasons and so the changelogs can be > > updated to reflect the altered performance numbers. > > I can send out an updated version of this patch later today. I think the > only real change is updating the constant and perf stats motivating > the chunk size value of 32MB. Yep. A tiny change wouldn't normally require a full resend, but fairly widespread changelog updates would best be handled with a v11, please. > Anything else you also think needs doing for this? Nope. Just lots of review, as always ;) What's the story with architectures other that x86, btw?
[ Added Kristina and Catalin for the FEAT_MOPS question. ] Andrew Morton <akpm@linux-foundation.org> writes: > On Wed, 17 Dec 2025 11:51:43 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote: > >> > If so, what's the timing on that? It would be nice to do it in the >> > current -rc cycle for testing reasons and so the changelogs can be >> > updated to reflect the altered performance numbers. >> >> I can send out an updated version of this patch later today. I think the >> only real change is updating the constant and perf stats motivating >> the chunk size value of 32MB. > > Yep. A tiny change wouldn't normally require a full resend, but fairly > widespread changelog updates would best be handled with a v11, please. True, it will need updates to patches 7 and 8. Will send out a v11 after rerunning tests for both of those. Might take a day or two but should be able to send it out this week. >> Anything else you also think needs doing for this? > > Nope. Just lots of review, as always ;) > > What's the story with architectures other that x86, btw? The only other architecture I know of which has a similar range primitive is arm64 (with FEAT_MOPS). That should be extendable to larger page sizes. Don't have any numbers on it though. It's only available after arm64 v8.7 which I should have access to next year. (But maybe Kristina or Catalin have tried out clearing large ranges with MOPS?) Other than that, the only one I know of is powerpc which already uses a primitive to zero a cacheline (DCBZ). Which seems quite similar to CLZERO on AMD Zen systems (though CLZERO does uncached weakly ordered writes so needs a store barrier at the end). Just from googling powerpc's implementation seems to be pretty optimal already so probably wouldn't gain much from larger chunk sizes and removal of the cond_resched(). But, CLZERO performs on par (or better) than this "REP; STOS" implementation especially for smaller extents. So maybe in the future we could use it to improve the 2MB performance for AMD Zen. IMO the fiddly part might be in deciding when the cost of not-caching is higher than the speedup from not-caching. -- ankur
© 2016 - 2025 Red Hat, Inc.