mm/folio_zero_user: add multi-page clearing

[PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

Posted by Ankur Arora 10 months ago

This series adds multi-page clearing for hugepages. It is a rework
of [1] which took a detour through PREEMPT_LAZY [2].

Why multi-page clearing?: multi-page clearing improves upon the
current page-at-a-time approach by providing the processor with a
hint as to the real region size. A processor could use this hint to,
for instance, elide cacheline allocation when clearing a large
region.

This optimization in particular is done by REP; STOS on AMD Zen
where regions larger than L3-size use non-temporal stores.

This results in significantly better performance. 

We also see performance improvement for cases where this optimization is
unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
REP; STOS is typically microcoded which can now be amortized over
larger regions and the hint allows the hardware prefetcher to do a
better job.

Milan (EPYC 7J13, boost=0, preempt=full|lazy):

                 mm/folio_zero_user    x86/folio_zero_user     change
                  (GB/s  +- stddev)      (GB/s  +- stddev)

  pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
  pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%

Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):

                 mm/folio_zero_user    x86/folio_zero_user     change
                  (GB/s +- stddev)      (GB/s +- stddev)

  pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
  pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%

Interaction with preemption: as discussed in [3], zeroing large
regions with string instructions doesn't work well with cooperative
preemption models which need regular invocations of cond_resched(). So,
this optimization is limited to only preemptible models (full, lazy).

This is done by overriding __folio_zero_user() -- which does the usual
page-at-a-time zeroing -- by an architecture optimized version but
only when running under preemptible models.
As such this ties an architecture specific optimization too closely
to preemption. Should be easy enough to change but seemed like the
simplest approach.

Comments appreciated!

Also at:
  github.com/terminus/linux clear-pages-preempt.v1


[1] https://lore.kernel.org/lkml/20230830184958.2333078-1-ankur.a.arora@oracle.com/
[2] https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
[3] https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@mail.gmail.com/

Ankur Arora (4):
  x86/clear_page: extend clear_page*() for multi-page clearing
  x86/clear_page: add clear_pages()
  huge_page: allow arch override for folio_zero_user()
  x86/folio_zero_user: multi-page clearing

 arch/x86/include/asm/page_32.h |  6 ++++
 arch/x86/include/asm/page_64.h | 27 +++++++++------
 arch/x86/lib/clear_page_64.S   | 52 +++++++++++++++++++++--------
 arch/x86/mm/Makefile           |  1 +
 arch/x86/mm/memory.c           | 60 ++++++++++++++++++++++++++++++++++
 include/linux/mm.h             |  1 +
 mm/memory.c                    | 38 ++++++++++++++++++---
 7 files changed, 156 insertions(+), 29 deletions(-)
 create mode 100644 arch/x86/mm/memory.c

-- 
2.31.1

Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

Posted by Raghavendra K T 9 months, 3 weeks ago


On 4/14/2025 9:16 AM, Ankur Arora wrote:
> This series adds multi-page clearing for hugepages. It is a rework
> of [1] which took a detour through PREEMPT_LAZY [2].
> 
> Why multi-page clearing?: multi-page clearing improves upon the
> current page-at-a-time approach by providing the processor with a
> hint as to the real region size. A processor could use this hint to,
> for instance, elide cacheline allocation when clearing a large
> region.
> 
> This optimization in particular is done by REP; STOS on AMD Zen
> where regions larger than L3-size use non-temporal stores.
> 
> This results in significantly better performance.
> 
> We also see performance improvement for cases where this optimization is
> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
> REP; STOS is typically microcoded which can now be amortized over
> larger regions and the hint allows the hardware prefetcher to do a
> better job.
> 
> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
> 
>                   mm/folio_zero_user    x86/folio_zero_user     change
>                    (GB/s  +- stddev)      (GB/s  +- stddev)
> 
>    pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>    pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
> 
> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
> 
>                   mm/folio_zero_user    x86/folio_zero_user     change
>                    (GB/s +- stddev)      (GB/s +- stddev)
> 
>    pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>    pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
> 
[...]

Hello Ankur,

Thank you for the patches. Was able to test briefly w/ lazy preempt
mode.

(I do understand that, there could be lot of churn based on Ingo,
Mateusz and others' comments)
But here it goes:

SUT: AMD EPYC 9B24 (Genoa) preempt=lazy

metric = time taken in sec (lower is better). total SIZE=64GB
                  mm/folio_zero_user    x86/folio_zero_user     change
   pg-sz=1GB       2.47044  +-  0.38%    1.060877  +-  0.07%    57.06
   pg-sz=2MB       5.098403 +-  0.01%    2.52015   +-  0.36%    50.57

More details (1G example run):

base kernel    =   6.14 (preempt = lazy)

mm/folio_zero_user
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):

           2,476.47 msec task-clock                       #    1.002 
CPUs utilized            ( +-  0.39% )
                  5      context-switches                 #    2.025 
/sec                     ( +- 29.70% )
                  2      cpu-migrations                   #    0.810 
/sec                     ( +- 21.15% )
                202      page-faults                      #   81.806 
/sec                     ( +-  0.18% )
      7,348,664,233      cycles                           #    2.976 GHz 
                      ( +-  0.38% )  (38.39%)
        878,805,326      stalled-cycles-frontend          #   11.99% 
frontend cycles idle     ( +-  0.74% )  (38.43%)
        339,023,729      instructions                     #    0.05 
insn per cycle
                                                   #    2.53  stalled 
cycles per insn  ( +-  0.08% )  (38.47%)
         88,579,915      branches                         #   35.873 
M/sec                    ( +-  0.06% )  (38.51%)
         17,369,776      branch-misses                    #   19.55% of 
all branches          ( +-  0.04% )  (38.55%)
      2,261,339,695      L1-dcache-loads                  #  915.795 
M/sec                    ( +-  0.06% )  (38.56%)
      1,073,880,164      L1-dcache-load-misses            #   47.48% of 
all L1-dcache accesses  ( +-  0.05% )  (38.56%)
          511,231,988      L1-icache-loads                  #  207.038 
M/sec                    ( +-  0.25% )  (38.52%)
            128,533      L1-icache-load-misses            #    0.02% of 
all L1-icache accesses  ( +-  0.40% )  (38.48%)
             38,134      dTLB-loads                       #   15.443 
K/sec                    ( +-  4.22% )  (38.44%)
             33,992      dTLB-load-misses                 #  114.39% of 
all dTLB cache accesses  ( +-  9.42% )  (38.40%)
                156      iTLB-loads                       #   63.177 
/sec                     ( +- 13.34% )  (38.36%)
                156      iTLB-load-misses                 #  102.50% of 
all iTLB cache accesses  ( +- 25.98% )  (38.36%)

            2.47044 +- 0.00949 seconds time elapsed  ( +-  0.38% )

x86/folio_zero_user
           1,056.72 msec task-clock                       #    0.996 
CPUs utilized            ( +-  0.07% )
                 10      context-switches                 #    9.436 
/sec                     ( +-  3.59% )
                  3      cpu-migrations                   #    2.831 
/sec                     ( +- 11.33% )
                200      page-faults                      #  188.718 
/sec                     ( +-  0.15% )
      3,146,571,264      cycles                           #    2.969 GHz 
                      ( +-  0.07% )  (38.35%)
         17,226,261      stalled-cycles-frontend          #    0.55% 
frontend cycles idle     ( +-  4.12% )  (38.44%)
         14,130,553      instructions                     #    0.00 
insn per cycle
                                                   #    1.39  stalled 
cycles per insn  ( +-  1.59% )  (38.53%)
          3,578,614      branches                         #    3.377 
M/sec                    ( +-  1.54% )  (38.62%)
            415,807      branch-misses                    #   12.45% of 
all branches          ( +-  1.17% )  (38.62%)
         22,208,699      L1-dcache-loads                  #   20.956 
M/sec                    ( +-  5.27% )  (38.60%)
          7,312,684      L1-dcache-load-misses            #   27.79% of 
all L1-dcache accesses  ( +-  8.46% )  (38.51%)
            4,032,315      L1-icache-loads                  #    3.805 
M/sec                    ( +-  1.29% )  (38.48%)
             15,094      L1-icache-load-misses            #    0.38% of 
all L1-icache accesses  ( +-  1.14% )  (38.39%)
             14,365      dTLB-loads                       #   13.555 
K/sec                    ( +-  7.23% )  (38.38%)
              9,477      dTLB-load-misses                 #   65.36% of 
all dTLB cache accesses  ( +- 12.05% )  (38.38%)
                 18      iTLB-loads                       #   16.985 
/sec                     ( +- 34.84% )  (38.38%)
                 67      iTLB-load-misses                 #  158.39% of 
all iTLB cache accesses  ( +- 48.32% )  (38.32%)

           1.060877 +- 0.000766 seconds time elapsed  ( +-  0.07% )

Thanks and Regards
- Raghu

Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

Posted by Ankur Arora 9 months, 3 weeks ago

Raghavendra K T <raghavendra.kt@amd.com> writes:

> On 4/14/2025 9:16 AM, Ankur Arora wrote:
>> This series adds multi-page clearing for hugepages. It is a rework
>> of [1] which took a detour through PREEMPT_LAZY [2].
>> Why multi-page clearing?: multi-page clearing improves upon the
>> current page-at-a-time approach by providing the processor with a
>> hint as to the real region size. A processor could use this hint to,
>> for instance, elide cacheline allocation when clearing a large
>> region.
>> This optimization in particular is done by REP; STOS on AMD Zen
>> where regions larger than L3-size use non-temporal stores.
>> This results in significantly better performance.
>> We also see performance improvement for cases where this optimization is
>> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
>> REP; STOS is typically microcoded which can now be amortized over
>> larger regions and the hint allows the hardware prefetcher to do a
>> better job.
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>>                   mm/folio_zero_user    x86/folio_zero_user     change
>>                    (GB/s  +- stddev)      (GB/s  +- stddev)
>>    pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>>    pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>>                   mm/folio_zero_user    x86/folio_zero_user     change
>>                    (GB/s +- stddev)      (GB/s +- stddev)
>>    pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>>    pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
>>
> [...]
>
> Hello Ankur,
>
> Thank you for the patches. Was able to test briefly w/ lazy preempt
> mode.

Thanks for testing.

> (I do understand that, there could be lot of churn based on Ingo,
> Mateusz and others' comments)
> But here it goes:
>
> SUT: AMD EPYC 9B24 (Genoa) preempt=lazy
>
> metric = time taken in sec (lower is better). total SIZE=64GB
>                  mm/folio_zero_user    x86/folio_zero_user     change
>   pg-sz=1GB       2.47044  +-  0.38%    1.060877  +-  0.07%    57.06
>   pg-sz=2MB       5.098403 +-  0.01%    2.52015   +-  0.36%    50.57


Just to translate it into the same units I was using above:

                  mm/folio_zero_user        x86/folio_zero_user
   pg-sz=1GB       25.91 GBps +-  0.38%    60.37 GBps +-  0.07%
   pg-sz=2MB       12.57 GBps +-  0.01%    25.39 GBps +-  0.36%

That's a decent improvement over Milan. Btw, are you using boost=1?

Also, any idea why the huge delta in the mm/folio_zero_user 2MB, 1GB
cases? Both of these are doing 4k page at a time, so the huge delta
is a little head scratching.

There's a gap on Milan as well but it is much smaller.

Thanks
Ankur

> More details (1G example run):
>
> base kernel    =   6.14 (preempt = lazy)
>
> mm/folio_zero_user
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
>           2,476.47 msec task-clock                       #    1.002 CPUs
>          utilized            ( +-  0.39% )
>                  5      context-switches                 #    2.025 /sec ( +- 29.70% )
>                  2      cpu-migrations                   #    0.810 /sec ( +- 21.15% )
>                202      page-faults                      #   81.806 /sec ( +-  0.18% )
>      7,348,664,233      cycles                           #    2.976 GHz  ( +-  0.38% )  (38.39%)
>        878,805,326      stalled-cycles-frontend          #   11.99% frontend cycles idle     ( +-  0.74% )  (38.43%)
>        339,023,729      instructions                     #    0.05 insn per
>       cycle
>                                                   #    2.53  stalled cycles per
>                                                       insn  ( +-  0.08% )
>                                                       (38.47%)
>         88,579,915      branches                         #   35.873 M/sec
>        ( +-  0.06% )  (38.51%)
>         17,369,776      branch-misses                    #   19.55% of all
>        branches          ( +-  0.04% )  (38.55%)
>      2,261,339,695      L1-dcache-loads                  #  915.795 M/sec
>     ( +-  0.06% )  (38.56%)
>      1,073,880,164      L1-dcache-load-misses            #   47.48% of all
>     L1-dcache accesses  ( +-  0.05% )  (38.56%)
>          511,231,988      L1-icache-loads                  #  207.038 M/sec
>         ( +-  0.25% )  (38.52%)
>            128,533      L1-icache-load-misses            #    0.02% of all
>           L1-icache accesses  ( +-  0.40% )  (38.48%)
>             38,134      dTLB-loads                       #   15.443 K/sec
>            ( +-  4.22% )  (38.44%)
>             33,992      dTLB-load-misses                 #  114.39% of all dTLB
>            cache accesses  ( +-  9.42% )  (38.40%)
>                156      iTLB-loads                       #   63.177 /sec
>               ( +- 13.34% )  (38.36%)
>                156      iTLB-load-misses                 #  102.50% of all iTLB
>               cache accesses  ( +- 25.98% )  (38.36%)
>
>            2.47044 +- 0.00949 seconds time elapsed  ( +-  0.38% )
>
> x86/folio_zero_user
>           1,056.72 msec task-clock                       #    0.996 CPUs
>          utilized            ( +-  0.07% )
>                 10      context-switches                 #    9.436 /sec
>                ( +-  3.59% )
>                  3      cpu-migrations                   #    2.831 /sec
>                 ( +- 11.33% )
>                200      page-faults                      #  188.718 /sec
>               ( +-  0.15% )
>      3,146,571,264      cycles                           #    2.969 GHz
>      ( +-  0.07% )  (38.35%)
>         17,226,261      stalled-cycles-frontend          #    0.55% frontend
>        cycles idle     ( +-  4.12% )  (38.44%)
>         14,130,553      instructions                     #    0.00 insn per
>        cycle
>                                                   #    1.39  stalled cycles per
>                                                       insn  ( +-  1.59% )
>                                                       (38.53%)
>          3,578,614      branches                         #    3.377 M/sec
>         ( +-  1.54% )  (38.62%)
>            415,807      branch-misses                    #   12.45% of all
>           branches          ( +-  1.17% )  (38.62%)
>         22,208,699      L1-dcache-loads                  #   20.956 M/sec
>        ( +-  5.27% )  (38.60%)
>          7,312,684      L1-dcache-load-misses            #   27.79% of all
>         L1-dcache accesses  ( +-  8.46% )  (38.51%)
>            4,032,315      L1-icache-loads                  #    3.805 M/sec
>           ( +-  1.29% )  (38.48%)
>             15,094      L1-icache-load-misses            #    0.38% of all
>            L1-icache accesses  ( +-  1.14% )  (38.39%)
>             14,365      dTLB-loads                       #   13.555 K/sec
>            ( +-  7.23% )  (38.38%)
>              9,477      dTLB-load-misses                 #   65.36% of all dTLB
>             cache accesses  ( +- 12.05% )  (38.38%)
>                 18      iTLB-loads                       #   16.985 /sec
>                ( +- 34.84% )  (38.38%)
>                 67      iTLB-load-misses                 #  158.39% of all iTLB
>                cache accesses  ( +- 48.32% )  (38.32%)
>
>           1.060877 +- 0.000766 seconds time elapsed  ( +-  0.07% )
>
> Thanks and Regards
> - Raghu


--
ankur

Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

Posted by Raghavendra K T 9 months, 2 weeks ago


On 4/23/2025 12:52 AM, Ankur Arora wrote:
> 
> Raghavendra K T <raghavendra.kt@amd.com> writes:
[...]
>>
>> SUT: AMD EPYC 9B24 (Genoa) preempt=lazy
>>
>> metric = time taken in sec (lower is better). total SIZE=64GB
>>                   mm/folio_zero_user    x86/folio_zero_user     change
>>    pg-sz=1GB       2.47044  +-  0.38%    1.060877  +-  0.07%    57.06
>>    pg-sz=2MB       5.098403 +-  0.01%    2.52015   +-  0.36%    50.57
> 
> 
> Just to translate it into the same units I was using above:
> 
>                    mm/folio_zero_user        x86/folio_zero_user
>     pg-sz=1GB       25.91 GBps +-  0.38%    60.37 GBps +-  0.07%
>     pg-sz=2MB       12.57 GBps +-  0.01%    25.39 GBps +-  0.36%
> 
> That's a decent improvement over Milan. Btw, are you using boost=1?
> 

yes boost=1

> Also, any idea why the huge delta in the mm/folio_zero_user 2MB, 1GB
> cases? Both of these are doing 4k page at a time, so the huge delta
> is a little head scratching.
> 
> There's a gap on Milan as well but it is much smaller.
> 

Need to think/analyze further, but from perf stat I see
glaring difference in:
			2M		1G
pagefaults		32,906		202
iTLB-load-misses	44,490		156

- Raghu

Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

Posted by Raghavendra K T 9 months, 2 weeks ago

On 4/23/2025 1:42 PM, Raghavendra K T wrote:
[...]

>> Also, any idea why the huge delta in the mm/folio_zero_user 2MB, 1GB
>> cases? Both of these are doing 4k page at a time, so the huge delta
>> is a little head scratching.
>>
>> There's a gap on Milan as well but it is much smaller.
>>
> 

Sorry forgot to add perhaps even more valid/important data point

> Need to think/analyze further, but from perf stat I see
> glaring difference in:
>              2M        1G
> pagefaults        32,906        202
> iTLB-load-misses    44,490        156

dTLB-loads		1,000,169	38,134
dTLB-load-misses	402,641		33,992

- Raghu

Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

Posted by Zi Yan 9 months, 4 weeks ago

On 13 Apr 2025, at 23:46, Ankur Arora wrote:

> This series adds multi-page clearing for hugepages. It is a rework
> of [1] which took a detour through PREEMPT_LAZY [2].
>
> Why multi-page clearing?: multi-page clearing improves upon the
> current page-at-a-time approach by providing the processor with a
> hint as to the real region size. A processor could use this hint to,
> for instance, elide cacheline allocation when clearing a large
> region.
>
> This optimization in particular is done by REP; STOS on AMD Zen
> where regions larger than L3-size use non-temporal stores.
>
> This results in significantly better performance.

Do you have init_on_alloc=1 in your kernel?
With that, pages coming from buddy allocator are zeroed
in post_alloc_hook() by kernel_init_pages(), which is a for loop
of clear_highpage_kasan_tagged(), a wrap of clear_page().
And folio_zero_user() is not used.

At least Debian, Fedora, and Ubuntu by default have
CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y, which means init_on_alloc=1.

Maybe kernel_init_pages() should get your optimization as well,
unless you only target hugetlb pages.

Best Regards,
Yan, Zi

Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

Posted by Ankur Arora 9 months, 3 weeks ago

Zi Yan <ziy@nvidia.com> writes:

> On 13 Apr 2025, at 23:46, Ankur Arora wrote:
>
>> This series adds multi-page clearing for hugepages. It is a rework
>> of [1] which took a detour through PREEMPT_LAZY [2].
>>
>> Why multi-page clearing?: multi-page clearing improves upon the
>> current page-at-a-time approach by providing the processor with a
>> hint as to the real region size. A processor could use this hint to,
>> for instance, elide cacheline allocation when clearing a large
>> region.
>>
>> This optimization in particular is done by REP; STOS on AMD Zen
>> where regions larger than L3-size use non-temporal stores.
>>
>> This results in significantly better performance.
>
> Do you have init_on_alloc=1 in your kernel?
> With that, pages coming from buddy allocator are zeroed
> in post_alloc_hook() by kernel_init_pages(), which is a for loop
> of clear_highpage_kasan_tagged(), a wrap of clear_page().
> And folio_zero_user() is not used.
>
> At least Debian, Fedora, and Ubuntu by default have
> CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y, which means init_on_alloc=1.
>
> Maybe kernel_init_pages() should get your optimization as well,
> unless you only target hugetlb pages.

Thanks for the suggestion. I do plan to look for other places where
we could be zeroing contiguous regions.

Often the problem is that even if the underlying region is contiguous,
it isn't so under CONFIG_HIGHMEM. For instance,
clear_highpage_kasan_tagged() does a kmap/kunmap_local_page() around the
clearing. This breaks the contiguous region into multiple 4K pages even
when CONFIG_HIGHMEM is not defined.

I wonder if we need a clear_highpages() kind of abstraction which lets
HIGHMEM and non-HIGHMEM go their separate ways.

--
ankur

Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

Posted by Ingo Molnar 10 months ago

* Ankur Arora <ankur.a.arora@oracle.com> wrote:

> We also see performance improvement for cases where this optimization is
> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
> REP; STOS is typically microcoded which can now be amortized over
> larger regions and the hint allows the hardware prefetcher to do a
> better job.
> 
> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
> 
>                  mm/folio_zero_user    x86/folio_zero_user     change
>                   (GB/s  +- stddev)      (GB/s  +- stddev)
> 
>   pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>   pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
> 
> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
> 
>                  mm/folio_zero_user    x86/folio_zero_user     change
>                   (GB/s +- stddev)      (GB/s +- stddev)
> 
>   pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>   pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%

How was this measured? Could you integrate this measurement as a new 
tools/perf/bench/ subcommand so that people can try it on different 
systems, etc.? There's already a 'perf bench mem' subcommand space 
where this feature could be added to.

Thanks,

	Ingo

Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

Posted by Ankur Arora 9 months, 4 weeks ago

Ingo Molnar <mingo@kernel.org> writes:

> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> We also see performance improvement for cases where this optimization is
>> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
>> REP; STOS is typically microcoded which can now be amortized over
>> larger regions and the hint allows the hardware prefetcher to do a
>> better job.
>>
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>>
>>                  mm/folio_zero_user    x86/folio_zero_user     change
>>                   (GB/s  +- stddev)      (GB/s  +- stddev)
>>
>>   pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>>   pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
>>
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>>
>>                  mm/folio_zero_user    x86/folio_zero_user     change
>>                   (GB/s +- stddev)      (GB/s +- stddev)
>>
>>   pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>>   pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
>
> How was this measured? Could you integrate this measurement as a new
> tools/perf/bench/ subcommand so that people can try it on different
> systems, etc.? There's already a 'perf bench mem' subcommand space
> where this feature could be added to.

This was a standalone trivial mmap workload similar to what qemu does
when creating a VM, really any hugetlb mmap().

x86-64-stosq (lib/memset_64.S::__memset) should have the same performance
characteristics but it uses malloc() for allocation.

For this workload we want to control the allocation path as well. Let me
see if it makes sense to extend perf bench mem memset to optionally allocate
via mmap(MAP_HUGETLB) or add a new workload under perf bench mem which
does that.

Thanks for the review!

--
ankur

Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

Posted by Ingo Molnar 10 months ago

* Ankur Arora <ankur.a.arora@oracle.com> wrote:

> Ankur Arora (4):
>   x86/clear_page: extend clear_page*() for multi-page clearing
>   x86/clear_page: add clear_pages()
>   huge_page: allow arch override for folio_zero_user()
>   x86/folio_zero_user: multi-page clearing

These are not how x86 commit titles should look like. Please take a 
look at the titles of previous commits to the x86 files you are 
modifying and follow that style. (Capitalization, use of verbs, etc.)

Thanks,

	Ingo

Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

Posted by Ankur Arora 9 months, 4 weeks ago

Ingo Molnar <mingo@kernel.org> writes:

> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Ankur Arora (4):
>>   x86/clear_page: extend clear_page*() for multi-page clearing
>>   x86/clear_page: add clear_pages()
>>   huge_page: allow arch override for folio_zero_user()
>>   x86/folio_zero_user: multi-page clearing
>
> These are not how x86 commit titles should look like. Please take a
> look at the titles of previous commits to the x86 files you are
> modifying and follow that style. (Capitalization, use of verbs, etc.)

Ack. Will fix.

--
ankur