[v2] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE

[PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE

Posted by SeongJae Park 10 months, 1 week ago

When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or
MADV_FREE with multiple address ranges, tlb flushes happen for each of
the given address ranges.  Because such tlb flushes are for same
process, doing those in a batch is more efficient while still being
safe.  Modify process_madvise() entry level code path to do such batched
tlb flushes, while the internal unmap logics do only gathering of the
tlb entries to flush.

In more detail, modify the entry functions to initialize an mmu_gather
ojbect and pass it to the internal logics.  And make the internal logics
do only gathering of the tlb entries to flush into the received
mmu_gather object.  After all internal function calls are done, the
entry functions flush the gathered tlb entries at once.

Because process_madvise() and madvise() share the internal unmap logic,
make same change to madvise() entry code together, to make code
consistent and cleaner.  It is only for keeping the code clean, and
shouldn't degrade madvise().  It could rather provide a potential tlb
flushes reduction benefit for a case that there are multiple vmas for
the given address range.  It is only a side effect from an effort to
keep code clean, so we don't measure it separately.

Similar optimizations might be applicable to other madvise behaviros
such as MADV_COLD and MADV_PAGEOUT.  Those are simply out of the scope
of this patch series, though.

Patches Seuquence
=================

First patch defines new data structure for managing information that
required for batched tlb flushes (mmu_gather and behavior), and update
code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling internal
logics to receive it.

Second patch batches tlb flushes for MADV_FREE handling for both
madvise() and process_madvise().

Remaining two patches are for MADV_DONTNEED[_LOCKED] tlb flushes
batching.  Third patch splits zap_page_range_single() for batching of
MADV_DONTNEED[_LOCKED] handling.  The final and fourth patch batches tlb
flushes for the hint using the sub-logic that the third patch split out,
and the helpers for batched tlb flushes that intorduced for MADV_FREE
case, by the second patch.

Test Results
============

I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
using multiple process_madvise() calls.  I apply the advice in 4 KiB
sized regions granularity, but with varying batch size per
process_madvise() call (vlen) from 1 to 1024.  The source code for the
measurement is available at GitHub[1].  To reduce measurement errors, I
did the measurement five times.

The measurement results are as below.  'sz_batch' column shows the batch
size of process_madvise() calls.  'Before' and 'After' columns show the
average of latencies in nanoseconds that measured five times on kernels
that built without and with the tlb flushes batching of this series
(patches 3 and 4), respectively.  For the baseline, mm-new tree of
2025-04-04[2] has been used.  'B-stdev' and 'A-stdev' columns show
ratios of latency measurements standard deviation to average in percent
for 'Before' and 'After', respectively.  'Latency_reduction' shows the
reduction of the latency that the 'After' has achieved compared to
'Before', in percent.  Higher 'Latency_reduction' values mean more
efficiency improvements.

    sz_batch   Before        B-stdev   After         A-stdev   Latency_reduction
    1          110948138.2   5.55      109476402.8   4.28      1.33
    2          75678535.6    1.67      70470722.2    3.55      6.88
    4          59530647.6    4.77      51735606.6    3.44      13.09
    8          50013051.6    4.39      44377029.8    5.20      11.27
    16         48657878.2    9.32      37291600.4    3.39      23.36
    32         43614180.2    6.06      34127428      3.75      21.75
    64         42466694.2    5.70      26737935.2    2.54      37.04
    128        42977970      6.99      25050444.2    4.06      41.71
    256        41549546      1.88      24644375.8    3.77      40.69
    512        42162698.6    6.17      24068224.8    2.87      42.92
    1024       40978574      5.44      23872024.2    3.65      41.75

As expected, tlb flushes batching provides latency reduction that
proportional to the batch size.  The efficiency gain ranges from about
6.88 percent with batch size 2, to about 40 percent with batch size 128.

Please note that this is a very simple microbenchmark, so real
efficiency gain on real workload could be very different.

Chagelong
=========

Changes from v1
(https://lore.kernel.org/20250310172318.653630-1-sj@kernel.org)
- Split code cleanup part out
- Keep the order between tlb flushes and hugetlb_zap_end()
- Put mm/memory change just before its real usage
- Add VM_WARN_ON_ONCE() for invlaid tlb argument to unmap_vma_single()
- Cleanups following nice reviewers suggestions

Changes from RFC
(https://lore.kernel.org/20250305181611.54484-1-sj@kernel.org)
- Clarify motivation of the change on the cover letter
- Add average and stdev of evaluation results
- Show latency reduction on evaluation results
- Fix !CONFIG_MEMORY_FAILURE build error
- Rename is_memory_populate() to is_madvise_populate()
- Squash patches 5-8
- Add kerneldoc for unmap_vm_area_struct()
- Squash patches 10 and 11
- Squash patches 12-14
- Squash patches 15 and 16

References
==========

[1] https://github.com/sjp38/eval_proc_madvise
[2] commit edd67244fe67 ("mm/show_mem: optimize si_meminfo_node by reducing redundant code") # mm-new

SeongJae Park (4):
  mm/madvise: define and use madvise_behavior struct for
    madvise_do_behavior()
  mm/madvise: batch tlb flushes for MADV_FREE
  mm/memory: split non-tlb flushing part from zap_page_range_single()
  mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED]

 mm/internal.h |   3 ++
 mm/madvise.c  | 110 ++++++++++++++++++++++++++++++++++++--------------
 mm/memory.c   |  47 ++++++++++++++++-----
 3 files changed, 121 insertions(+), 39 deletions(-)


base-commit: 85b87628fae973dedae95f2ea2782b7df4c11322
-- 
2.39.5

Re: [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE

Posted by Lorenzo Stoakes 10 months ago

On Fri, Apr 04, 2025 at 02:06:56PM -0700, SeongJae Park wrote:
> When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or
> MADV_FREE with multiple address ranges, tlb flushes happen for each of
> the given address ranges.  Because such tlb flushes are for same

Nit: for _the_ same.

> process, doing those in a batch is more efficient while still being
> safe.  Modify process_madvise() entry level code path to do such batched
> tlb flushes, while the internal unmap logics do only gathering of the
> tlb entries to flush.
>
> In more detail, modify the entry functions to initialize an mmu_gather
> ojbect and pass it to the internal logics.  And make the internal logics

Typo: ojbect -> object, logics -> logic.

> do only gathering of the tlb entries to flush into the received
> mmu_gather object.  After all internal function calls are done, the
> entry functions flush the gathered tlb entries at once.
>
> Because process_madvise() and madvise() share the internal unmap logic,
> make same change to madvise() entry code together, to make code
> consistent and cleaner.  It is only for keeping the code clean, and
> shouldn't degrade madvise().  It could rather provide a potential tlb
> flushes reduction benefit for a case that there are multiple vmas for
> the given address range.  It is only a side effect from an effort to
> keep code clean, so we don't measure it separately.
>
> Similar optimizations might be applicable to other madvise behaviros

Typo: behaviros -> behavior (or 'behaviors', but since behavior is already plural
probably unnecessary).

> such as MADV_COLD and MADV_PAGEOUT.  Those are simply out of the scope
> of this patch series, though.

Well well, for now :)

>
> Patches Seuquence

Typo: Seuquence -> Sequence.

> =================
>
> First patch defines new data structure for managing information that

Nit: _The_ first patch. _a_ new data structure. that -> 'that is'.

> required for batched tlb flushes (mmu_gather and behavior), and update
> code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling internal
> logics to receive it.

Typo: logics -> logic.

>
> Second patch batches tlb flushes for MADV_FREE handling for both

Nit: _The_ second patch.

> madvise() and process_madvise().
>
> Remaining two patches are for MADV_DONTNEED[_LOCKED] tlb flushes
> batching.  Third patch splits zap_page_range_single() for batching of

Nit: Third patch -> _The_ third patch.

> MADV_DONTNEED[_LOCKED] handling.  The final and fourth patch batches tlb

Nit: 'patch batches' -> 'patches batch'.

> flushes for the hint using the sub-logic that the third patch split out,
> and the helpers for batched tlb flushes that intorduced for MADV_FREE

Typo: intorduced -> introduced.

Nit: 'for MADV_FREE' -> 'for the MADV_FREE'.

> case, by the second patch.
>
> Test Results
> ============
>
> I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
> using multiple process_madvise() calls.  I apply the advice in 4 KiB
> sized regions granularity, but with varying batch size per
> process_madvise() call (vlen) from 1 to 1024.  The source code for the
> measurement is available at GitHub[1].  To reduce measurement errors, I
> did the measurement five times.

Be interesting to see how this behaves with mTHP sizing too! But probably a bit
out of scope perhaps.

>
> The measurement results are as below.  'sz_batch' column shows the batch
> size of process_madvise() calls.  'Before' and 'After' columns show the
> average of latencies in nanoseconds that measured five times on kernels
> that built without and with the tlb flushes batching of this series
> (patches 3 and 4), respectively.  For the baseline, mm-new tree of
> 2025-04-04[2] has been used.  'B-stdev' and 'A-stdev' columns show
> ratios of latency measurements standard deviation to average in percent
> for 'Before' and 'After', respectively.  'Latency_reduction' shows the
> reduction of the latency that the 'After' has achieved compared to
> 'Before', in percent.  Higher 'Latency_reduction' values mean more
> efficiency improvements.
>
>     sz_batch   Before        B-stdev   After         A-stdev   Latency_reduction
>     1          110948138.2   5.55      109476402.8   4.28      1.33
>     2          75678535.6    1.67      70470722.2    3.55      6.88
>     4          59530647.6    4.77      51735606.6    3.44      13.09
>     8          50013051.6    4.39      44377029.8    5.20      11.27
>     16         48657878.2    9.32      37291600.4    3.39      23.36
>     32         43614180.2    6.06      34127428      3.75      21.75
>     64         42466694.2    5.70      26737935.2    2.54      37.04
>     128        42977970      6.99      25050444.2    4.06      41.71
>     256        41549546      1.88      24644375.8    3.77      40.69
>     512        42162698.6    6.17      24068224.8    2.87      42.92
>     1024       40978574      5.44      23872024.2    3.65      41.75

Very nice! Great work.

>
> As expected, tlb flushes batching provides latency reduction that
> proportional to the batch size.  The efficiency gain ranges from about
> 6.88 percent with batch size 2, to about 40 percent with batch size 128.
>
> Please note that this is a very simple microbenchmark, so real
> efficiency gain on real workload could be very different.

Indeed, accepted, but it makes a great deal of sense to batch these operations,
especially when we get to the point of actually increasing the process_madvise()
iov size.

>
> Chagelong
> =========
>
> Changes from v1
> (https://lore.kernel.org/20250310172318.653630-1-sj@kernel.org)
> - Split code cleanup part out
> - Keep the order between tlb flushes and hugetlb_zap_end()
> - Put mm/memory change just before its real usage
> - Add VM_WARN_ON_ONCE() for invlaid tlb argument to unmap_vma_single()
> - Cleanups following nice reviewers suggestions
>
> Changes from RFC
> (https://lore.kernel.org/20250305181611.54484-1-sj@kernel.org)
> - Clarify motivation of the change on the cover letter
> - Add average and stdev of evaluation results
> - Show latency reduction on evaluation results
> - Fix !CONFIG_MEMORY_FAILURE build error
> - Rename is_memory_populate() to is_madvise_populate()
> - Squash patches 5-8
> - Add kerneldoc for unmap_vm_area_struct()
> - Squash patches 10 and 11
> - Squash patches 12-14
> - Squash patches 15 and 16
>
> References
> ==========
>
> [1] https://github.com/sjp38/eval_proc_madvise
> [2] commit edd67244fe67 ("mm/show_mem: optimize si_meminfo_node by reducing redundant code") # mm-new
>
> SeongJae Park (4):
>   mm/madvise: define and use madvise_behavior struct for
>     madvise_do_behavior()
>   mm/madvise: batch tlb flushes for MADV_FREE
>   mm/memory: split non-tlb flushing part from zap_page_range_single()
>   mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED]
>
>  mm/internal.h |   3 ++
>  mm/madvise.c  | 110 ++++++++++++++++++++++++++++++++++++--------------
>  mm/memory.c   |  47 ++++++++++++++++-----
>  3 files changed, 121 insertions(+), 39 deletions(-)
>
>
> base-commit: 85b87628fae973dedae95f2ea2782b7df4c11322
> --
> 2.39.5

Re: [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE

Posted by SeongJae Park 10 months ago

On Tue, 8 Apr 2025 14:44:40 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Fri, Apr 04, 2025 at 02:06:56PM -0700, SeongJae Park wrote:
> > When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or
> > MADV_FREE with multiple address ranges, tlb flushes happen for each of
> > the given address ranges.  Because such tlb flushes are for same
> 
> Nit: for _the_ same.

Thank you for kindly finding and suggesting fixes for these mistakes.  I will
update following your suggestions here and below.

[...]
> > Similar optimizations might be applicable to other madvise behaviros
> 
> Typo: behaviros -> behavior (or 'behaviors', but since behavior is already plural
> probably unnecessary).
> 
> > such as MADV_COLD and MADV_PAGEOUT.  Those are simply out of the scope
> > of this patch series, though.
> 
> Well well, for now :)

Yes.  Hopefully we will have another chance to further improve the cases.

[...]
> > Test Results
> > ============
> >
> > I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
> > using multiple process_madvise() calls.  I apply the advice in 4 KiB
> > sized regions granularity, but with varying batch size per
> > process_madvise() call (vlen) from 1 to 1024.  The source code for the
> > measurement is available at GitHub[1].  To reduce measurement errors, I
> > did the measurement five times.
> 
> Be interesting to see how this behaves with mTHP sizing too! But probably a bit
> out of scope perhaps.

Obviously we have many more rooms to explore and get fun :)

> 
> >
> > The measurement results are as below.  'sz_batch' column shows the batch
> > size of process_madvise() calls.  'Before' and 'After' columns show the
> > average of latencies in nanoseconds that measured five times on kernels
> > that built without and with the tlb flushes batching of this series
> > (patches 3 and 4), respectively.  For the baseline, mm-new tree of
> > 2025-04-04[2] has been used.  'B-stdev' and 'A-stdev' columns show
> > ratios of latency measurements standard deviation to average in percent
> > for 'Before' and 'After', respectively.  'Latency_reduction' shows the
> > reduction of the latency that the 'After' has achieved compared to
> > 'Before', in percent.  Higher 'Latency_reduction' values mean more
> > efficiency improvements.
> >
> >     sz_batch   Before        B-stdev   After         A-stdev   Latency_reduction
> >     1          110948138.2   5.55      109476402.8   4.28      1.33
> >     2          75678535.6    1.67      70470722.2    3.55      6.88
> >     4          59530647.6    4.77      51735606.6    3.44      13.09
> >     8          50013051.6    4.39      44377029.8    5.20      11.27
> >     16         48657878.2    9.32      37291600.4    3.39      23.36
> >     32         43614180.2    6.06      34127428      3.75      21.75
> >     64         42466694.2    5.70      26737935.2    2.54      37.04
> >     128        42977970      6.99      25050444.2    4.06      41.71
> >     256        41549546      1.88      24644375.8    3.77      40.69
> >     512        42162698.6    6.17      24068224.8    2.87      42.92
> >     1024       40978574      5.44      23872024.2    3.65      41.75
> 
> Very nice! Great work.
> 
> >
> > As expected, tlb flushes batching provides latency reduction that
> > proportional to the batch size.  The efficiency gain ranges from about
> > 6.88 percent with batch size 2, to about 40 percent with batch size 128.
> >
> > Please note that this is a very simple microbenchmark, so real
> > efficiency gain on real workload could be very different.
> 
> Indeed, accepted, but it makes a great deal of sense to batch these operations,
> especially when we get to the point of actually increasing the process_madvise()
> iov size.

Cannot agree more.

Thank you for your kind review with great suggestions for this patchset.  I
will post the next spin with the suggested changes, soon.


Thanks,
SJ

[...]