[v4] Perf improvements for hugetlb and vmalloc on arm64

[PATCH v4 00/11] Perf improvements for hugetlb and vmalloc on arm64

Posted by Ryan Roberts 9 months, 3 weeks ago

Hi All,

This is v4 of a series to improve performance for hugetlb and vmalloc on arm64.
Although some of these patches are core-mm, advice from Andrew was to go via the
arm64 tree. All patches are now acked/reviewed by relevant maintainers so I
believe this should be good-to-go.

The 2 key performance improvements are 1) enabling the use of contpte-mapped
blocks in the vmalloc space when appropriate (which reduces TLB pressure). There
were already hooks for this (used by powerpc) but they required some tidying and
extending for arm64. And 2) batching up barriers when modifying the vmalloc
address space for upto 30% reduction in time taken in vmalloc().

vmalloc() performance was measured using the test_vmalloc.ko module. Tested on
Apple M2 and Ampere Altra. Each test had loop count set to 500000 and the whole
test was repeated 10 times.

legend:
  - p: nr_pages (pages to allocate)
  - h: use_huge (vmalloc() vs vmalloc_huge())
  - (I): statistically significant improvement (95% CI does not overlap)
  - (R): statistically significant regression (95% CI does not overlap)
  - measurements are times; smaller is better

+--------------------------------------------------+-------------+-------------+
| Benchmark                                        |             |             |
|   Result Class                                   |    Apple M2 | Ampere Alta |
+==================================================+=============+=============+
| micromm/vmalloc                                  |             |             |
|   fix_align_alloc_test: p:1, h:0 (usec)          | (I) -11.53% |      -2.57% |
|   fix_size_alloc_test: p:1, h:0 (usec)           |       2.14% |       1.79% |
|   fix_size_alloc_test: p:4, h:0 (usec)           |  (I) -9.93% |  (I) -4.80% |
|   fix_size_alloc_test: p:16, h:0 (usec)          | (I) -25.07% | (I) -14.24% |
|   fix_size_alloc_test: p:16, h:1 (usec)          | (I) -14.07% |   (R) 7.93% |
|   fix_size_alloc_test: p:64, h:0 (usec)          | (I) -29.43% | (I) -19.30% |
|   fix_size_alloc_test: p:64, h:1 (usec)          | (I) -16.39% |   (R) 6.71% |
|   fix_size_alloc_test: p:256, h:0 (usec)         | (I) -31.46% | (I) -20.60% |
|   fix_size_alloc_test: p:256, h:1 (usec)         | (I) -16.58% |   (R) 6.70% |
|   fix_size_alloc_test: p:512, h:0 (usec)         | (I) -31.96% | (I) -20.04% |
|   fix_size_alloc_test: p:512, h:1 (usec)         |       2.30% |       0.71% |
|   full_fit_alloc_test: p:1, h:0 (usec)           |      -2.94% |       1.77% |
|   kvfree_rcu_1_arg_vmalloc_test: p:1, h:0 (usec) |      -7.75% |       1.71% |
|   kvfree_rcu_2_arg_vmalloc_test: p:1, h:0 (usec) |      -9.07% |   (R) 2.34% |
|   long_busy_list_alloc_test: p:1, h:0 (usec)     | (I) -29.18% | (I) -17.91% |
|   pcpu_alloc_test: p:1, h:0 (usec)               |     -14.71% |      -3.14% |
|   random_size_align_alloc_test: p:1, h:0 (usec)  | (I) -11.08% |  (I) -4.62% |
|   random_size_alloc_test: p:1, h:0 (usec)        | (I) -30.25% | (I) -17.95% |
|   vm_map_ram_test: p:1, h:0 (usec)               |       5.06% |   (R) 6.63% |
+--------------------------------------------------+-------------+-------------+

So there are some nice improvements but also some regressions to explain:

fix_size_alloc_test with h:1 and p:16,64,256 regress by ~6% on Altra. The
regression is actually introduced by enabling contpte-mapped 64K blocks in these
tests, and that regression is reduced (from about 8% if memory serves) by doing
the barrier batching. I don't have a definite conclusion on the root cause, but
I've ruled out the differences in the mapping paths in vmalloc. I strongly
believe this is likely due to the difference in the allocation path; 64K blocks
are not cached per-cpu so we have to go all the way to the buddy. I'm not sure
why this doesn't show up on M2 though. Regardless, I'm going to assert that it's
better to choose 16x reduction in TLB pressure vs 6% on the vmalloc allocation
call duration.

Changes since v3 [3]
====================
- Applied R-bs (thanks all!)
- Renamed set_ptes_anysz() -> __set_ptes_anysz() (Catalin)
- Renamed ptep_get_and_clear_anysz() -> __ptep_get_and_clear_anysz() (Catalin)
- Only set TIF_LAZY_MMU_PENDING if not already set to avoid atomic ops (Catalin)
- Fix commet typos (Anshuman)
- Fix build warnings when PMD is folded (buildbot)
- Reverse xmas tree for variables in __page_table_check_p[mu]ds_set() (Pasha)

Changes since v2 [2]
====================
- Removed the new arch_update_kernel_mappings_[begin|end]() API
- Switches to arch_[enter|leave]_lazy_mmu_mode() instead for barrier batching
- Removed clean up to avoid barriers for invalid or user mappings

Changes since v1 [1]
====================
- Split out the fixes into their own series
- Added Rbs from Anshuman - Thanks!
- Added patch to clean up the methods by which huge_pte size is determined
- Added "#ifndef __PAGETABLE_PMD_FOLDED" around PUD_SIZE in
  flush_hugetlb_tlb_range()
- Renamed ___set_ptes() -> set_ptes_anysz()
- Renamed ___ptep_get_and_clear() -> ptep_get_and_clear_anysz()
- Fixed typos in commit logs
- Refactored pXd_valid_not_user() for better reuse
- Removed TIF_KMAP_UPDATE_PENDING after concluding that single flag is sufficent
- Concluded the extra isb() in __switch_to() is not required
- Only call arch_update_kernel_mappings_[begin|end]() for kernel mappings

Applies on top of v6.15-rc3. All mm selftests run and no regressions observed.

[1] https://lore.kernel.org/all/20250205151003.88959-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/all/20250217140809.1702789-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/all/20250304150444.3788920-1-ryan.roberts@arm.com/

Thanks,
Ryan

Ryan Roberts (11):
  arm64: hugetlb: Cleanup huge_pte size discovery mechanisms
  arm64: hugetlb: Refine tlb maintenance scope
  mm/page_table_check: Batch-check pmds/puds just like ptes
  arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
  arm64: hugetlb: Use __set_ptes_anysz() and
    __ptep_get_and_clear_anysz()
  arm64/mm: Hoist barriers out of set_ptes_anysz() loop
  mm/vmalloc: Warn on improper use of vunmap_range()
  mm/vmalloc: Gracefully unmap huge ptes
  arm64/mm: Support huge pte-mapped pages in vmap
  mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
  arm64/mm: Batch barriers when updating kernel mappings

 arch/arm64/include/asm/hugetlb.h     |  29 ++--
 arch/arm64/include/asm/pgtable.h     | 209 +++++++++++++++++++--------
 arch/arm64/include/asm/thread_info.h |   2 +
 arch/arm64/include/asm/vmalloc.h     |  45 ++++++
 arch/arm64/kernel/process.c          |   9 +-
 arch/arm64/mm/hugetlbpage.c          |  73 ++++------
 include/linux/page_table_check.h     |  30 ++--
 include/linux/vmalloc.h              |   8 +
 mm/page_table_check.c                |  34 +++--
 mm/vmalloc.c                         |  40 ++++-
 10 files changed, 329 insertions(+), 150 deletions(-)

--
2.43.0

Re: [PATCH v4 00/11] Perf improvements for hugetlb and vmalloc on arm64

Posted by Will Deacon 9 months ago

On Tue, 22 Apr 2025 09:18:08 +0100, Ryan Roberts wrote:
> This is v4 of a series to improve performance for hugetlb and vmalloc on arm64.
> Although some of these patches are core-mm, advice from Andrew was to go via the
> arm64 tree. All patches are now acked/reviewed by relevant maintainers so I
> believe this should be good-to-go.
> 
> The 2 key performance improvements are 1) enabling the use of contpte-mapped
> blocks in the vmalloc space when appropriate (which reduces TLB pressure). There
> were already hooks for this (used by powerpc) but they required some tidying and
> extending for arm64. And 2) batching up barriers when modifying the vmalloc
> address space for upto 30% reduction in time taken in vmalloc().
> 
> [...]

Sorry for the delay in getting to this series, it all looks good.

Applied to arm64 (for-next/mm), thanks!

[01/11] arm64: hugetlb: Cleanup huge_pte size discovery mechanisms
        https://git.kernel.org/arm64/c/29cb80519689
[02/11] arm64: hugetlb: Refine tlb maintenance scope
        https://git.kernel.org/arm64/c/5b3f8917644e
[03/11] mm/page_table_check: Batch-check pmds/puds just like ptes
        https://git.kernel.org/arm64/c/91e40668e70a
[04/11] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
        https://git.kernel.org/arm64/c/ef493d234362
[05/11] arm64: hugetlb: Use __set_ptes_anysz() and __ptep_get_and_clear_anysz()
        https://git.kernel.org/arm64/c/a899b7d0673c
[06/11] arm64/mm: Hoist barriers out of set_ptes_anysz() loop
        https://git.kernel.org/arm64/c/f89b399e8d6e
[07/11] mm/vmalloc: Warn on improper use of vunmap_range()
        https://git.kernel.org/arm64/c/61ef8ddaa35e
[08/11] mm/vmalloc: Gracefully unmap huge ptes
        https://git.kernel.org/arm64/c/2fba13371fe8
[09/11] arm64/mm: Support huge pte-mapped pages in vmap
        https://git.kernel.org/arm64/c/06fc959fcff7
[10/11] mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
        https://git.kernel.org/arm64/c/44562c71e2cf
[11/11] arm64/mm: Batch barriers when updating kernel mappings
        https://git.kernel.org/arm64/c/5fdd05efa1cd

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev

Re: [PATCH v4 00/11] Perf improvements for hugetlb and vmalloc on arm64

Posted by Ryan Roberts 9 months ago

Hi Will,

Just a bump on this; I believe it's had review by all the relavent folks (and
has the R-b) tags. I was hoping to get this into v6.16, but getting nervous that
time is running out to soak it in linux-next. Any chance you could consider pulling?

Thanks,
Ryan


On 22/04/2025 09:18, Ryan Roberts wrote:
> Hi All,
> 
> This is v4 of a series to improve performance for hugetlb and vmalloc on arm64.
> Although some of these patches are core-mm, advice from Andrew was to go via the
> arm64 tree. All patches are now acked/reviewed by relevant maintainers so I
> believe this should be good-to-go.
> 
> The 2 key performance improvements are 1) enabling the use of contpte-mapped
> blocks in the vmalloc space when appropriate (which reduces TLB pressure). There
> were already hooks for this (used by powerpc) but they required some tidying and
> extending for arm64. And 2) batching up barriers when modifying the vmalloc
> address space for upto 30% reduction in time taken in vmalloc().
> 
> vmalloc() performance was measured using the test_vmalloc.ko module. Tested on
> Apple M2 and Ampere Altra. Each test had loop count set to 500000 and the whole
> test was repeated 10 times.
> 
> legend:
>   - p: nr_pages (pages to allocate)
>   - h: use_huge (vmalloc() vs vmalloc_huge())
>   - (I): statistically significant improvement (95% CI does not overlap)
>   - (R): statistically significant regression (95% CI does not overlap)
>   - measurements are times; smaller is better
> 
> +--------------------------------------------------+-------------+-------------+
> | Benchmark                                        |             |             |
> |   Result Class                                   |    Apple M2 | Ampere Alta |
> +==================================================+=============+=============+
> | micromm/vmalloc                                  |             |             |
> |   fix_align_alloc_test: p:1, h:0 (usec)          | (I) -11.53% |      -2.57% |
> |   fix_size_alloc_test: p:1, h:0 (usec)           |       2.14% |       1.79% |
> |   fix_size_alloc_test: p:4, h:0 (usec)           |  (I) -9.93% |  (I) -4.80% |
> |   fix_size_alloc_test: p:16, h:0 (usec)          | (I) -25.07% | (I) -14.24% |
> |   fix_size_alloc_test: p:16, h:1 (usec)          | (I) -14.07% |   (R) 7.93% |
> |   fix_size_alloc_test: p:64, h:0 (usec)          | (I) -29.43% | (I) -19.30% |
> |   fix_size_alloc_test: p:64, h:1 (usec)          | (I) -16.39% |   (R) 6.71% |
> |   fix_size_alloc_test: p:256, h:0 (usec)         | (I) -31.46% | (I) -20.60% |
> |   fix_size_alloc_test: p:256, h:1 (usec)         | (I) -16.58% |   (R) 6.70% |
> |   fix_size_alloc_test: p:512, h:0 (usec)         | (I) -31.96% | (I) -20.04% |
> |   fix_size_alloc_test: p:512, h:1 (usec)         |       2.30% |       0.71% |
> |   full_fit_alloc_test: p:1, h:0 (usec)           |      -2.94% |       1.77% |
> |   kvfree_rcu_1_arg_vmalloc_test: p:1, h:0 (usec) |      -7.75% |       1.71% |
> |   kvfree_rcu_2_arg_vmalloc_test: p:1, h:0 (usec) |      -9.07% |   (R) 2.34% |
> |   long_busy_list_alloc_test: p:1, h:0 (usec)     | (I) -29.18% | (I) -17.91% |
> |   pcpu_alloc_test: p:1, h:0 (usec)               |     -14.71% |      -3.14% |
> |   random_size_align_alloc_test: p:1, h:0 (usec)  | (I) -11.08% |  (I) -4.62% |
> |   random_size_alloc_test: p:1, h:0 (usec)        | (I) -30.25% | (I) -17.95% |
> |   vm_map_ram_test: p:1, h:0 (usec)               |       5.06% |   (R) 6.63% |
> +--------------------------------------------------+-------------+-------------+
> 
> So there are some nice improvements but also some regressions to explain:
> 
> fix_size_alloc_test with h:1 and p:16,64,256 regress by ~6% on Altra. The
> regression is actually introduced by enabling contpte-mapped 64K blocks in these
> tests, and that regression is reduced (from about 8% if memory serves) by doing
> the barrier batching. I don't have a definite conclusion on the root cause, but
> I've ruled out the differences in the mapping paths in vmalloc. I strongly
> believe this is likely due to the difference in the allocation path; 64K blocks
> are not cached per-cpu so we have to go all the way to the buddy. I'm not sure
> why this doesn't show up on M2 though. Regardless, I'm going to assert that it's
> better to choose 16x reduction in TLB pressure vs 6% on the vmalloc allocation
> call duration.
> 
> Changes since v3 [3]
> ====================
> - Applied R-bs (thanks all!)
> - Renamed set_ptes_anysz() -> __set_ptes_anysz() (Catalin)
> - Renamed ptep_get_and_clear_anysz() -> __ptep_get_and_clear_anysz() (Catalin)
> - Only set TIF_LAZY_MMU_PENDING if not already set to avoid atomic ops (Catalin)
> - Fix commet typos (Anshuman)
> - Fix build warnings when PMD is folded (buildbot)
> - Reverse xmas tree for variables in __page_table_check_p[mu]ds_set() (Pasha)
> 
> Changes since v2 [2]
> ====================
> - Removed the new arch_update_kernel_mappings_[begin|end]() API
> - Switches to arch_[enter|leave]_lazy_mmu_mode() instead for barrier batching
> - Removed clean up to avoid barriers for invalid or user mappings
> 
> Changes since v1 [1]
> ====================
> - Split out the fixes into their own series
> - Added Rbs from Anshuman - Thanks!
> - Added patch to clean up the methods by which huge_pte size is determined
> - Added "#ifndef __PAGETABLE_PMD_FOLDED" around PUD_SIZE in
>   flush_hugetlb_tlb_range()
> - Renamed ___set_ptes() -> set_ptes_anysz()
> - Renamed ___ptep_get_and_clear() -> ptep_get_and_clear_anysz()
> - Fixed typos in commit logs
> - Refactored pXd_valid_not_user() for better reuse
> - Removed TIF_KMAP_UPDATE_PENDING after concluding that single flag is sufficent
> - Concluded the extra isb() in __switch_to() is not required
> - Only call arch_update_kernel_mappings_[begin|end]() for kernel mappings
> 
> Applies on top of v6.15-rc3. All mm selftests run and no regressions observed.
> 
> [1] https://lore.kernel.org/all/20250205151003.88959-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/all/20250217140809.1702789-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/all/20250304150444.3788920-1-ryan.roberts@arm.com/
> 
> Thanks,
> Ryan
> 
> Ryan Roberts (11):
>   arm64: hugetlb: Cleanup huge_pte size discovery mechanisms
>   arm64: hugetlb: Refine tlb maintenance scope
>   mm/page_table_check: Batch-check pmds/puds just like ptes
>   arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
>   arm64: hugetlb: Use __set_ptes_anysz() and
>     __ptep_get_and_clear_anysz()
>   arm64/mm: Hoist barriers out of set_ptes_anysz() loop
>   mm/vmalloc: Warn on improper use of vunmap_range()
>   mm/vmalloc: Gracefully unmap huge ptes
>   arm64/mm: Support huge pte-mapped pages in vmap
>   mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
>   arm64/mm: Batch barriers when updating kernel mappings
> 
>  arch/arm64/include/asm/hugetlb.h     |  29 ++--
>  arch/arm64/include/asm/pgtable.h     | 209 +++++++++++++++++++--------
>  arch/arm64/include/asm/thread_info.h |   2 +
>  arch/arm64/include/asm/vmalloc.h     |  45 ++++++
>  arch/arm64/kernel/process.c          |   9 +-
>  arch/arm64/mm/hugetlbpage.c          |  73 ++++------
>  include/linux/page_table_check.h     |  30 ++--
>  include/linux/vmalloc.h              |   8 +
>  mm/page_table_check.c                |  34 +++--
>  mm/vmalloc.c                         |  40 ++++-
>  10 files changed, 329 insertions(+), 150 deletions(-)
> 
> --
> 2.43.0
>

Re: [PATCH v4 00/11] Perf improvements for hugetlb and vmalloc on arm64

Posted by Luiz Capitulino 9 months, 2 weeks ago

On 2025-04-22 04:18, Ryan Roberts wrote:
> Hi All,
> 
> This is v4 of a series to improve performance for hugetlb and vmalloc on arm64.
> Although some of these patches are core-mm, advice from Andrew was to go via the
> arm64 tree. All patches are now acked/reviewed by relevant maintainers so I
> believe this should be good-to-go.
> 
> The 2 key performance improvements are 1) enabling the use of contpte-mapped
> blocks in the vmalloc space when appropriate (which reduces TLB pressure). There
> were already hooks for this (used by powerpc) but they required some tidying and
> extending for arm64. And 2) batching up barriers when modifying the vmalloc
> address space for upto 30% reduction in time taken in vmalloc().
> 
> vmalloc() performance was measured using the test_vmalloc.ko module. Tested on
> Apple M2 and Ampere Altra. Each test had loop count set to 500000 and the whole
> test was repeated 10 times.
> 
> legend:
>    - p: nr_pages (pages to allocate)
>    - h: use_huge (vmalloc() vs vmalloc_huge())
>    - (I): statistically significant improvement (95% CI does not overlap)
>    - (R): statistically significant regression (95% CI does not overlap)
>    - measurements are times; smaller is better
> 
> +--------------------------------------------------+-------------+-------------+
> | Benchmark                                        |             |             |
> |   Result Class                                   |    Apple M2 | Ampere Alta |
> +==================================================+=============+=============+
> | micromm/vmalloc                                  |             |             |
> |   fix_align_alloc_test: p:1, h:0 (usec)          | (I) -11.53% |      -2.57% |
> |   fix_size_alloc_test: p:1, h:0 (usec)           |       2.14% |       1.79% |
> |   fix_size_alloc_test: p:4, h:0 (usec)           |  (I) -9.93% |  (I) -4.80% |
> |   fix_size_alloc_test: p:16, h:0 (usec)          | (I) -25.07% | (I) -14.24% |
> |   fix_size_alloc_test: p:16, h:1 (usec)          | (I) -14.07% |   (R) 7.93% |
> |   fix_size_alloc_test: p:64, h:0 (usec)          | (I) -29.43% | (I) -19.30% |
> |   fix_size_alloc_test: p:64, h:1 (usec)          | (I) -16.39% |   (R) 6.71% |
> |   fix_size_alloc_test: p:256, h:0 (usec)         | (I) -31.46% | (I) -20.60% |
> |   fix_size_alloc_test: p:256, h:1 (usec)         | (I) -16.58% |   (R) 6.70% |
> |   fix_size_alloc_test: p:512, h:0 (usec)         | (I) -31.96% | (I) -20.04% |
> |   fix_size_alloc_test: p:512, h:1 (usec)         |       2.30% |       0.71% |
> |   full_fit_alloc_test: p:1, h:0 (usec)           |      -2.94% |       1.77% |
> |   kvfree_rcu_1_arg_vmalloc_test: p:1, h:0 (usec) |      -7.75% |       1.71% |
> |   kvfree_rcu_2_arg_vmalloc_test: p:1, h:0 (usec) |      -9.07% |   (R) 2.34% |
> |   long_busy_list_alloc_test: p:1, h:0 (usec)     | (I) -29.18% | (I) -17.91% |
> |   pcpu_alloc_test: p:1, h:0 (usec)               |     -14.71% |      -3.14% |
> |   random_size_align_alloc_test: p:1, h:0 (usec)  | (I) -11.08% |  (I) -4.62% |
> |   random_size_alloc_test: p:1, h:0 (usec)        | (I) -30.25% | (I) -17.95% |
> |   vm_map_ram_test: p:1, h:0 (usec)               |       5.06% |   (R) 6.63% |
> +--------------------------------------------------+-------------+-------------+
> 
> So there are some nice improvements but also some regressions to explain:
> 
> fix_size_alloc_test with h:1 and p:16,64,256 regress by ~6% on Altra. The
> regression is actually introduced by enabling contpte-mapped 64K blocks in these
> tests, and that regression is reduced (from about 8% if memory serves) by doing
> the barrier batching. I don't have a definite conclusion on the root cause, but
> I've ruled out the differences in the mapping paths in vmalloc. I strongly
> believe this is likely due to the difference in the allocation path; 64K blocks
> are not cached per-cpu so we have to go all the way to the buddy. I'm not sure
> why this doesn't show up on M2 though. Regardless, I'm going to assert that it's
> better to choose 16x reduction in TLB pressure vs 6% on the vmalloc allocation
> call duration.

I ran a couple of basic functional tests for HugeTLB for 1G, 2M, 32M and 64k
pages on an Ampere Mt. Jade and a Jetson AGX Orin systems and didn't
get any issues, so:

Tested-by: Luiz Capitulino <luizcap@redhat.com>

> 
> Changes since v3 [3]
> ====================
> - Applied R-bs (thanks all!)
> - Renamed set_ptes_anysz() -> __set_ptes_anysz() (Catalin)
> - Renamed ptep_get_and_clear_anysz() -> __ptep_get_and_clear_anysz() (Catalin)
> - Only set TIF_LAZY_MMU_PENDING if not already set to avoid atomic ops (Catalin)
> - Fix commet typos (Anshuman)
> - Fix build warnings when PMD is folded (buildbot)
> - Reverse xmas tree for variables in __page_table_check_p[mu]ds_set() (Pasha)
> 
> Changes since v2 [2]
> ====================
> - Removed the new arch_update_kernel_mappings_[begin|end]() API
> - Switches to arch_[enter|leave]_lazy_mmu_mode() instead for barrier batching
> - Removed clean up to avoid barriers for invalid or user mappings
> 
> Changes since v1 [1]
> ====================
> - Split out the fixes into their own series
> - Added Rbs from Anshuman - Thanks!
> - Added patch to clean up the methods by which huge_pte size is determined
> - Added "#ifndef __PAGETABLE_PMD_FOLDED" around PUD_SIZE in
>    flush_hugetlb_tlb_range()
> - Renamed ___set_ptes() -> set_ptes_anysz()
> - Renamed ___ptep_get_and_clear() -> ptep_get_and_clear_anysz()
> - Fixed typos in commit logs
> - Refactored pXd_valid_not_user() for better reuse
> - Removed TIF_KMAP_UPDATE_PENDING after concluding that single flag is sufficent
> - Concluded the extra isb() in __switch_to() is not required
> - Only call arch_update_kernel_mappings_[begin|end]() for kernel mappings
> 
> Applies on top of v6.15-rc3. All mm selftests run and no regressions observed.
> 
> [1] https://lore.kernel.org/all/20250205151003.88959-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/all/20250217140809.1702789-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/all/20250304150444.3788920-1-ryan.roberts@arm.com/
> 
> Thanks,
> Ryan
> 
> Ryan Roberts (11):
>    arm64: hugetlb: Cleanup huge_pte size discovery mechanisms
>    arm64: hugetlb: Refine tlb maintenance scope
>    mm/page_table_check: Batch-check pmds/puds just like ptes
>    arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
>    arm64: hugetlb: Use __set_ptes_anysz() and
>      __ptep_get_and_clear_anysz()
>    arm64/mm: Hoist barriers out of set_ptes_anysz() loop
>    mm/vmalloc: Warn on improper use of vunmap_range()
>    mm/vmalloc: Gracefully unmap huge ptes
>    arm64/mm: Support huge pte-mapped pages in vmap
>    mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
>    arm64/mm: Batch barriers when updating kernel mappings
> 
>   arch/arm64/include/asm/hugetlb.h     |  29 ++--
>   arch/arm64/include/asm/pgtable.h     | 209 +++++++++++++++++++--------
>   arch/arm64/include/asm/thread_info.h |   2 +
>   arch/arm64/include/asm/vmalloc.h     |  45 ++++++
>   arch/arm64/kernel/process.c          |   9 +-
>   arch/arm64/mm/hugetlbpage.c          |  73 ++++------
>   include/linux/page_table_check.h     |  30 ++--
>   include/linux/vmalloc.h              |   8 +
>   mm/page_table_check.c                |  34 +++--
>   mm/vmalloc.c                         |  40 ++++-
>   10 files changed, 329 insertions(+), 150 deletions(-)
> 
> --
> 2.43.0
> 
>