mm: memory: Force-inline PTE/PMD zapping functions for performance

[PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Posted by Li Qiang 2 months ago

This change converts several critical page table zapping functions from
`inline` to `__always_inline`, resulting in measurable performance
improvements in process spawning workloads.

Performance Impact (Intel Xeon Gold 6430 2.1GHz):
- UnixBench 'context1' test shows ~6% improvement (single-core)
- UnixBench  shows ~0.6% improvement (single-core)
- mm/memory.o size reduced by 2.49% (70190 -> 68445 bytes)
- Net code reduction of 1745 bytes (add/remove: 211/166)

The modified functions form a hot path during process teardown:
1. zap_present_ptes()
2. do_zap_pte_range()
3. zap_pte_range()
4. zap_pmd_range()

Signed-off-by: Li Qiang <liqiang01@kylinos.cn>
---
 mm/memory.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b0cda5aab398..281a353fae7b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1543,7 +1543,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
  *
  * Returns the number of processed (skipped or zapped) PTEs (at least 1).
  */
-static inline int zap_present_ptes(struct mmu_gather *tlb,
+static __always_inline int zap_present_ptes(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
 		unsigned int max_nr, unsigned long addr,
 		struct zap_details *details, int *rss, bool *force_flush,
@@ -1662,7 +1662,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 	return nr;
 }
 
-static inline int do_zap_pte_range(struct mmu_gather *tlb,
+static __always_inline int do_zap_pte_range(struct mmu_gather *tlb,
 				   struct vm_area_struct *vma, pte_t *pte,
 				   unsigned long addr, unsigned long end,
 				   struct zap_details *details, int *rss,
@@ -1698,7 +1698,7 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
 	return nr;
 }
 
-static unsigned long zap_pte_range(struct mmu_gather *tlb,
+static __always_inline unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
@@ -1790,7 +1790,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
+static __always_inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pud_t *pud,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
@@ -1832,7 +1832,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
+static __always_inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, p4d_t *p4d,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
@@ -1861,7 +1861,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
+static __always_inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pgd_t *pgd,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
-- 
2.25.1

Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Posted by David Hildenbrand 2 months ago

On 04.08.25 14:39, Li Qiang wrote:
> This change converts several critical page table zapping functions from
> `inline` to `__always_inline`, resulting in measurable performance
> improvements in process spawning workloads.
> 
> Performance Impact (Intel Xeon Gold 6430 2.1GHz):
> - UnixBench 'context1' test shows ~6% improvement (single-core)
> - UnixBench  shows ~0.6% improvement (single-core)
> - mm/memory.o size reduced by 2.49% (70190 -> 68445 bytes)
> - Net code reduction of 1745 bytes (add/remove: 211/166)
> 
> The modified functions form a hot path during process teardown:
> 1. zap_present_ptes()
> 2. do_zap_pte_range()
> 3. zap_pte_range()
> 4. zap_pmd_range()
> 
> Signed-off-by: Li Qiang <liqiang01@kylinos.cn>
> ---

What's the object file size change?

-- 
Cheers,

David / dhildenb

Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Posted by Nadav Amit 2 months ago

> On 4 Aug 2025, at 15:51, David Hildenbrand <david@redhat.com> wrote:
> 
> On 04.08.25 14:39, Li Qiang wrote:
>> This change converts several critical page table zapping functions from
>> `inline` to `__always_inline`, resulting in measurable performance
>> improvements in process spawning workloads.
>> Performance Impact (Intel Xeon Gold 6430 2.1GHz):
>> - UnixBench 'context1' test shows ~6% improvement (single-core)
>> - UnixBench  shows ~0.6% improvement (single-core)
>> - mm/memory.o size reduced by 2.49% (70190 -> 68445 bytes)
>> - Net code reduction of 1745 bytes (add/remove: 211/166)
>> The modified functions form a hot path during process teardown:
>> 1. zap_present_ptes()
>> 2. do_zap_pte_range()
>> 3. zap_pte_range()
>> 4. zap_pmd_range()
>> Signed-off-by: Li Qiang <liqiang01@kylinos.cn>
>> ---
> 
> What's the object file size change?

I think that Li wrote that the size is reduced by 2.49% .

My 2 cents is that usually it may be better to understand why it is
not inlined and address that (e.g., likely() hints or something else)
instead of blindly putting __always_inline. The __always_inline might
stay there for no reason after some code changes and therefore become
a maintenance burden. Concretely, in this case, where there is a single
caller, one can expect the compiler to really prefer to inline the
callees.

Perhaps it is beneficial to compile with flags such as
"-fopt-info-inline-optimized-missed=inline.txt” to better understand
the reason.

Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Posted by David Hildenbrand 2 months ago

On 04.08.25 15:01, Nadav Amit wrote:
> 
> 
>> On 4 Aug 2025, at 15:51, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.08.25 14:39, Li Qiang wrote:
>>> This change converts several critical page table zapping functions from
>>> `inline` to `__always_inline`, resulting in measurable performance
>>> improvements in process spawning workloads.
>>> Performance Impact (Intel Xeon Gold 6430 2.1GHz):
>>> - UnixBench 'context1' test shows ~6% improvement (single-core)
>>> - UnixBench  shows ~0.6% improvement (single-core)
>>> - mm/memory.o size reduced by 2.49% (70190 -> 68445 bytes)
>>> - Net code reduction of 1745 bytes (add/remove: 211/166)
>>> The modified functions form a hot path during process teardown:
>>> 1. zap_present_ptes()
>>> 2. do_zap_pte_range()
>>> 3. zap_pte_range()
>>> 4. zap_pmd_range()
>>> Signed-off-by: Li Qiang <liqiang01@kylinos.cn>
>>> ---
>>
>> What's the object file size change?
> 
> I think that Li wrote that the size is reduced by 2.49% .

Ah, missed it after the performance numbers. As Vlastimil mentioned, I 
would have expected a bloat-o-meter output.

> 
> My 2 cents is that usually it may be better to understand why it is
> not inlined and address that (e.g., likely() hints or something else)
> instead of blindly putting __always_inline. The __always_inline might
> stay there for no reason after some code changes and therefore become
> a maintenance burden. Concretely, in this case, where there is a single
> caller, one can expect the compiler to really prefer to inline the
> callees.

Agreed, although the compiler is sometimes hard to convince to do the 
right thing when dealing with rather large+complicated code in my 
experience.

-- 
Cheers,

David / dhildenb

[PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Posted by Li Qiang 2 months ago

Ah, missed it after the performance numbers. As Vlastimil mentioned, I 
would have expected a bloat-o-meter output.

> 
> My 2 cents is that usually it may be better to understand why it is
> not inlined and address that (e.g., likely() hints or something else)
> instead of blindly putting __always_inline. The __always_inline might
> stay there for no reason after some code changes and therefore become
> a maintenance burden. Concretely, in this case, where there is a single
> caller, one can expect the compiler to really prefer to inline the
> callees.

>
> Agreed, although the compiler is sometimes hard to convince to do the 
> right thing when dealing with rather large+complicated code in my 
> experience.

Question 1: Will this patch increase the vmlinux size?
Reply:
	Actually, the overall vmlinux size becomes smaller on x86_64:
	[root@localhost linux_old1]# ./scripts/bloat-o-meter before.vmlinux after.vmlinux  
	add/remove: 6/0 grow/shrink: 0/1 up/down: 4569/-4747 (-178)  
	Function                                     old     new   delta  
	zap_present_ptes.constprop                     -    2696   +2696  
	zap_pte_range                                  -    1236   +1236  
	zap_pmd_range.isra                             -     589    +589  
	__pfx_zap_pte_range                            -      16     +16  
	__pfx_zap_present_ptes.constprop               -      16     +16  
	__pfx_zap_pmd_range.isra                       -      16     +16  
	unmap_page_range                            5765    1018   -4747  
	Total: Before=35379786, After=35379608, chg -0.00%  


Question 2: Why doesn't GCC inline these functions by default? Are there any side effects of forced inlining?
Reply:
	1) GCC's default parameter max-inline-insns-single imposes restrictions. However, since these are leaf functions, inlining them not only improves performance but also reduces code size. May we consider relaxing the max-inline-insns-single restriction in this case?

	2) The functions being inlined in this patch follow a single call path and are ultimately inlined into unmap_page_range. This only increases the size of the unmap_page_range assembly function, but since unmap_page_range itself won't be further inlined, the impact is well-contained.



Question 3: Does this inlining modification affect code maintainability?
Reply: The modified inline functions are exclusively called by unmap_page_range, forming a single call path. This doesn't introduce additional maintenance complexity.


Question 4: Have you performed performance testing on other platforms? Have you tested other scenarios?
Reply:
	1) I tested the same GCC version on arm64 architecture. Even without this patch, these functions get inlined into unmap_page_range automatically. This appears to be due to architecture-specific differences in GCC's max-inline-insns-single default values.

	2) I believe UnixBench serves as a reasonably representative server benchmark. Theoretically, this patch should improve performance by reducing multi-layer function call overhead. However, I would sincerely appreciate your guidance on what additional tests might better demonstrate the performance improvements. Could you kindly suggest some specific benchmarks or test scenarios I should consider?

--
Cheers,

Li Qiang

Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Posted by Vlastimil Babka 2 months ago

On 8/5/25 2:04 PM, Li Qiang wrote:
> Ah, missed it after the performance numbers. As Vlastimil mentioned, I 
> would have expected a bloat-o-meter output.
> 
>>
>> My 2 cents is that usually it may be better to understand why it is
>> not inlined and address that (e.g., likely() hints or something else)
>> instead of blindly putting __always_inline. The __always_inline might
>> stay there for no reason after some code changes and therefore become
>> a maintenance burden. Concretely, in this case, where there is a single
>> caller, one can expect the compiler to really prefer to inline the
>> callees.
> 
>>
>> Agreed, although the compiler is sometimes hard to convince to do the 
>> right thing when dealing with rather large+complicated code in my 
>> experience.
> 
> Question 1: Will this patch increase the vmlinux size?
> Reply:
> 	Actually, the overall vmlinux size becomes smaller on x86_64:
> 	[root@localhost linux_old1]# ./scripts/bloat-o-meter before.vmlinux after.vmlinux  
> 	add/remove: 6/0 grow/shrink: 0/1 up/down: 4569/-4747 (-178)  
> 	Function                                     old     new   delta  
> 	zap_present_ptes.constprop                     -    2696   +2696  
> 	zap_pte_range                                  -    1236   +1236  
> 	zap_pmd_range.isra                             -     589    +589  
> 	__pfx_zap_pte_range                            -      16     +16  
> 	__pfx_zap_present_ptes.constprop               -      16     +16  
> 	__pfx_zap_pmd_range.isra                       -      16     +16  
> 	unmap_page_range                            5765    1018   -4747  
> 	Total: Before=35379786, After=35379608, chg -0.00%  

Is the before/after swapped here? This output suggests some functions
became NOT inlined.

If I'm right the output binary becomes slightly larger. But it doesn't
matter.

Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions

Posted by Li Qiang 2 months ago

On 8/5/25 13:15 PM, Vlastimil Babka wrote:
> Is the before/after swapped here? This output suggests some functions
> became NOT inlined.
> 
> If I'm right the output binary becomes slightly larger. But it doesn't
> matter.

Thank you for your careful review. You're absolutely right - the 
before/after columns in my previous data table were accidentally 
reversed. I've corrected this in the updated measurements below:

[root@localhost linux_old1]# ./scripts/bloat-o-meter  befor.vmlinux after.vmlinux 
add/remove: 0/6 grow/shrink: 1/0 up/down: 4747/-4569 (178)
Function                                     old     new   delta
unmap_page_range                            1018    5765   +4747
__pfx_zap_pte_range                           16       -     -16
__pfx_zap_present_ptes.constprop              16       -     -16
__pfx_zap_pmd_range.isra                      16       -     -16
zap_pmd_range.isra                           589       -    -589
zap_pte_range                               1236       -   -1236
zap_present_ptes.constprop                  2696       -   -2696
Total: Before=35379608, After=35379786, chg +0.00%
[root@localhost linux_old1]#

Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Posted by Lorenzo Stoakes 2 months ago

On Tue, Aug 05, 2025 at 08:04:35PM +0800, Li Qiang wrote:
> Ah, missed it after the performance numbers. As Vlastimil mentioned, I
> would have expected a bloat-o-meter output.
>

You're weirdly quoting David here unattributed as if it were your reply?

> >
> > My 2 cents is that usually it may be better to understand why it is
> > not inlined and address that (e.g., likely() hints or something else)
> > instead of blindly putting __always_inline. The __always_inline might
> > stay there for no reason after some code changes and therefore become
> > a maintenance burden. Concretely, in this case, where there is a single
> > caller, one can expect the compiler to really prefer to inline the
> > callees.
>
> >
> > Agreed, although the compiler is sometimes hard to convince to do the
> > right thing when dealing with rather large+complicated code in my
> > experience.
>

Some nits on reply:

- Please reply to mails individually, rather than reply to one arbtrary one with
  questions as to the other.

- Try to wrap to 75 characters per line in replies.

- Make sure it's clear who you're quoting.

This makes life easier, I've had to go and read through a bunch of mails in
thread to get context here.

> Question 1: Will this patch increase the vmlinux size?
> Reply:
> 	Actually, the overall vmlinux size becomes smaller on x86_64:
> 	[root@localhost linux_old1]# ./scripts/bloat-o-meter before.vmlinux after.vmlinux
> 	add/remove: 6/0 grow/shrink: 0/1 up/down: 4569/-4747 (-178)
> 	Function                                     old     new   delta
> 	zap_present_ptes.constprop                     -    2696   +2696
> 	zap_pte_range                                  -    1236   +1236
> 	zap_pmd_range.isra                             -     589    +589
> 	__pfx_zap_pte_range                            -      16     +16
> 	__pfx_zap_present_ptes.constprop               -      16     +16
> 	__pfx_zap_pmd_range.isra                       -      16     +16
> 	unmap_page_range                            5765    1018   -4747
> 	Total: Before=35379786, After=35379608, chg -0.00%
>
>
> Question 2: Why doesn't GCC inline these functions by default? Are there any side effects of forced inlining?
> Reply:
> 	1) GCC's default parameter max-inline-insns-single imposes restrictions. However, since these are leaf functions, inlining them not only improves performance but also reduces code size. May we consider relaxing the max-inline-insns-single restriction in this case?

Yeah I wonder if we could just increase this... I noticed in my analysis
(presumably what you're replying to here?) that this is what was causing
inlining to stop.

We do a _lot_ of static functions that behave like this so I actually wonder if
we could get perf wins more roadly by doing this...

Could you experiment with this?...


>
> 	2) The functions being inlined in this patch follow a single call path and are ultimately inlined into unmap_page_range. This only increases the size of the unmap_page_range assembly function, but since unmap_page_range itself won't be further inlined, the impact is well-contained.
>

Yup. This is something I already mentioned.

>
>
> Question 3: Does this inlining modification affect code maintainability?
> Reply: The modified inline functions are exclusively called by unmap_page_range, forming a single call path. This doesn't introduce additional maintenance complexity.

Not sure why maintenance would be an issue, code is virtually unchanged.

>
>
> Question 4: Have you performed performance testing on other platforms? Have you tested other scenarios?
> Reply:
> 	1) I tested the same GCC version on arm64 architecture. Even without this patch, these functions get inlined into unmap_page_range automatically. This appears to be due to architecture-specific differences in GCC's max-inline-insns-single default values.

OK interesting. I suspect that's due to more registers right?

>
> 	2) I believe UnixBench serves as a reasonably representative server benchmark. Theoretically, this patch should improve performance by reducing multi-layer function call overhead. However, I would sincerely appreciate your guidance on what additional tests might better demonstrate the performance improvements. Could you kindly suggest some specific benchmarks or test scenarios I should consider?

I'm not sure, actual workloads would be best but presumably you don't have
one where you've noticed a demonstrable difference otherwise you'd have
mentioned...

At any rate I've come around on this series, and think this is probably
reasonable, but I would like to see what increasing max-inline-insns-single
does first?

>
> --
> Cheers,
>
> Li Qiang

Cheers, Lorenzo

Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Posted by Li Qiang 2 months ago

Tue, 5 Aug 2025 14:35:22, Lorenzo Stoakes wrote:
> I'm not sure, actual workloads would be best but presumably you don't have
> one where you've noticed a demonstrable difference otherwise you'd have
> mentioned...
> 
> At any rate I've come around on this series, and think this is probably
> reasonable, but I would like to see what increasing max-inline-insns-single
> does first?

Thank you for your suggestions. I'll pay closer attention 
to email formatting in future communications.

Regarding the performance tests on x86_64 architecture:

Parameter Observation:
When setting max-inline-insns-single=400 (matching arm64's 
default value) without applying my patch, the compiler 
automatically inlines the critical functions.

Benchmark Results:

Configuration			Baseline		With Patch			max-inline-insns-single=400
UnixBench Score			1824			1835 (+0.6%)			1840 (+0.9%)
vmlinux Size (bytes)	35,379,608		35,379,786 (+0.005%)	35,529,641 (+0.4%)

Key Findings:

The patch provides significant performance gain (0.6%) with 
minimal size impact (0.005% increase). While 
max-inline-insns-single=400 yields slightly better 
performance (0.9%), it incurs a larger size penalty (0.4% increase).

Conclusion:
The patch achieves a better performance/size trade-off 
compared to globally adjusting the inline threshold. The targeted 
approach (selective __always_inline) appears more efficient for 
this specific optimization.

Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Posted by Vlastimil Babka 1 month, 4 weeks ago

On 8/6/25 07:51, Li Qiang wrote:
> Tue, 5 Aug 2025 14:35:22, Lorenzo Stoakes wrote:
>> I'm not sure, actual workloads would be best but presumably you don't have
>> one where you've noticed a demonstrable difference otherwise you'd have
>> mentioned...
>> 
>> At any rate I've come around on this series, and think this is probably
>> reasonable, but I would like to see what increasing max-inline-insns-single
>> does first?
> 
> Thank you for your suggestions. I'll pay closer attention 
> to email formatting in future communications.
> 
> Regarding the performance tests on x86_64 architecture:
> 
> Parameter Observation:
> When setting max-inline-insns-single=400 (matching arm64's 
> default value) without applying my patch, the compiler 
> automatically inlines the critical functions.
> 
> Benchmark Results:
> 
> Configuration			Baseline		With Patch			max-inline-insns-single=400
> UnixBench Score			1824			1835 (+0.6%)			1840 (+0.9%)
> vmlinux Size (bytes)	35,379,608		35,379,786 (+0.005%)	35,529,641 (+0.4%)
> 
> Key Findings:
> 
> The patch provides significant performance gain (0.6%) with 
> minimal size impact (0.005% increase). While 
> max-inline-insns-single=400 yields slightly better 
> performance (0.9%), it incurs a larger size penalty (0.4% increase).
> 
> Conclusion:
> The patch achieves a better performance/size trade-off 
> compared to globally adjusting the inline threshold. The targeted 
> approach (selective __always_inline) appears more efficient for 
> this specific optimization.

Another attempt at my opensuse tumbleweed system gcc 15.1.1:

add/remove: 1/0 grow/shrink: 4/7 up/down: 1069/-520 (549)
Function                                     old     new   delta
unmap_page_range                            6493    7424    +931
add_mm_counter                                 -     112    +112
finish_fault                                1101    1117     +16
do_swap_page                                6523    6531      +8
remap_pfn_range_internal                    1358    1360      +2
pte_to_swp_entry                             123     122      -1
pte_move_swp_offset                          219     218      -1
restore_exclusive_pte                        356     325     -31
__handle_mm_fault                           3988    3949     -39
do_wp_page                                  3926    3810    -116
copy_page_range                             8051    7930    -121
swap_pte_batch                               817     606    -211
Total: Before=66483, After=67032, chg +0.83%

The functions changed by your patch were already inlined, and yet this force
inlining apparently changed some heuristics to change also completely
unrelated functions.

So that just shows how fragile these kinds of attempts to hand-hold gcc for
a specific desired behavior is, and I'd be wary of it.

Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Posted by Vlastimil Babka 2 months ago

On 8/4/25 14:51, David Hildenbrand wrote:
> On 04.08.25 14:39, Li Qiang wrote:
>> This change converts several critical page table zapping functions from
>> `inline` to `__always_inline`, resulting in measurable performance
>> improvements in process spawning workloads.
>> 
>> Performance Impact (Intel Xeon Gold 6430 2.1GHz):
>> - UnixBench 'context1' test shows ~6% improvement (single-core)
>> - UnixBench  shows ~0.6% improvement (single-core)
>> - mm/memory.o size reduced by 2.49% (70190 -> 68445 bytes)
>> - Net code reduction of 1745 bytes (add/remove: 211/166)
>> 
>> The modified functions form a hot path during process teardown:
>> 1. zap_present_ptes()
>> 2. do_zap_pte_range()
>> 3. zap_pte_range()
>> 4. zap_pmd_range()
>> 
>> Signed-off-by: Li Qiang <liqiang01@kylinos.cn>
>> ---
> 
> What's the object file size change?

The output of ./scripts/bloat-o-meter would be the most informative.

Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Posted by Lorenzo Stoakes 2 months ago

On Mon, Aug 04, 2025 at 08:39:23PM +0800, Li Qiang wrote:
> This change converts several critical page table zapping functions from
> `inline` to `__always_inline`, resulting in measurable performance
> improvements in process spawning workloads.
>
> Performance Impact (Intel Xeon Gold 6430 2.1GHz):
> - UnixBench 'context1' test shows ~6% improvement (single-core)
> - UnixBench  shows ~0.6% improvement (single-core)

These aren't exactly earth-shattering. Are we sure these are representative
of anything real-world representative of real workloads?

Spawning a bazillion processes is not really meaningful.

> - mm/memory.o size reduced by 2.49% (70190 -> 68445 bytes)
> - Net code reduction of 1745 bytes (add/remove: 211/166)
>
> The modified functions form a hot path during process teardown:
> 1. zap_present_ptes()
> 2. do_zap_pte_range()
> 3. zap_pte_range()
> 4. zap_pmd_range()
>
> Signed-off-by: Li Qiang <liqiang01@kylinos.cn>

I think others have covered this well, but we've had patches like this before
where, in essence, it's a case of 'improves things on my machine'.

The question really is _why_ your compiler is not making these inline in
the first place.

I'm no compiler expert, but the inline here I believe is redundant anyway
within a compilation unit so the compiler will make an inline decision
regardless.

These are pretty big functions though. You're essentially inlining
everything into a mega function in unmap_page_range(). Which seems iffy.

I wonder if we might see degradation in other workloads? And you're talking
about one architecture, not others...

I feel like you'd really need to justify with information on the compiler
(ideally with insights into why it's not inlining now), how it impacts
other architectures, _real workloads_ you've observed this matter for,
etc. for this to be justifiable.

Also are you sure it has to be _every_ level in the hierarchy? What happens
if you inline only e.g. zap_present_ptes(), as we do with
zap_present_folio_ptes() already?

(Fact that's _also_ inlined makes this a mega giant chonker inlined
function also...).

I guess bloat is less of an issue as it's all going inside a non-inlined
function.

But how this behaves in places other than 'not entirely convincing
benchmark on one architecture/uarch' is key here I think.

I don't think I'll really be convinced until there's quite a bit more data
to back this up with real-world usage.

> ---
>  mm/memory.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b0cda5aab398..281a353fae7b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1543,7 +1543,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
>   *
>   * Returns the number of processed (skipped or zapped) PTEs (at least 1).
>   */
> -static inline int zap_present_ptes(struct mmu_gather *tlb,
> +static __always_inline int zap_present_ptes(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
>  		unsigned int max_nr, unsigned long addr,
>  		struct zap_details *details, int *rss, bool *force_flush,
> @@ -1662,7 +1662,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
>  	return nr;
>  }
>
> -static inline int do_zap_pte_range(struct mmu_gather *tlb,
> +static __always_inline int do_zap_pte_range(struct mmu_gather *tlb,
>  				   struct vm_area_struct *vma, pte_t *pte,
>  				   unsigned long addr, unsigned long end,
>  				   struct zap_details *details, int *rss,
> @@ -1698,7 +1698,7 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
>  	return nr;
>  }
>
> -static unsigned long zap_pte_range(struct mmu_gather *tlb,
> +static __always_inline unsigned long zap_pte_range(struct mmu_gather *tlb,
>  				struct vm_area_struct *vma, pmd_t *pmd,
>  				unsigned long addr, unsigned long end,
>  				struct zap_details *details)
> @@ -1790,7 +1790,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  	return addr;
>  }
>
> -static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
> +static __always_inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>  				struct vm_area_struct *vma, pud_t *pud,
>  				unsigned long addr, unsigned long end,
>  				struct zap_details *details)
> @@ -1832,7 +1832,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>  	return addr;
>  }
>
> -static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
> +static __always_inline unsigned long zap_pud_range(struct mmu_gather *tlb,
>  				struct vm_area_struct *vma, p4d_t *p4d,
>  				unsigned long addr, unsigned long end,
>  				struct zap_details *details)
> @@ -1861,7 +1861,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
>  	return addr;
>  }
>
> -static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
> +static __always_inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
>  				struct vm_area_struct *vma, pgd_t *pgd,
>  				unsigned long addr, unsigned long end,
>  				struct zap_details *details)
> --
> 2.25.1
>

Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Posted by Lorenzo Stoakes 2 months ago

OK,

So I hacked -fopt-info-inline-all into the mm/ Makefile in a rather quick and
dirty way and it seems some stuff gets inlined locally, but we're mostly hitting
the '--param max-inline-insns-single limit reached' limit here.

Which maybe is just a point where the compiler possibly arbitrarily gives up?

Vlasta rightly pointed out off-list that given this appears to only be used in
one place you'd expect inlining as register spill isn't such a concern (we'll
spill saving the stack before function invocation anyway).

So there might actually be some validity here?

This is gcc 15.1.1 running on an x86-64 platform by the way.

mm/memory.c:1871:10: optimized:  Inlined zap_p4d_range/6380 into unmap_page_range/6381 which now has time 1458.996712 and size 65, net change of -11.
mm/memory.c:1850:10: optimized:  Inlined zap_pud_range.isra/8017 into zap_p4d_range/6380 which now has time 10725.428482 and size 29, net change of -12.
mm/memory.c:1829:10: missed:   not inlinable: zap_pud_range.isra/8017 -> zap_pmd_range.isra/8018, --param max-inline-insns-single limit reached
mm/memory.c:1800:10: missed:   not inlinable: zap_pmd_range.isra/8018 -> zap_pte_range/6377, --param max-inline-insns-auto limit reached
mm/memory.c:1708:8: optimized:  Inlined do_zap_pte_range.constprop/7983 into zap_pte_range/6377 which now has time 4244.320854 and size 148, net change of -15.
mm/memory.c:1664:9: missed:   not inlinable: do_zap_pte_range.constprop/7983 -> zap_present_ptes.constprop/7985, --param max-inline-insns-single limit reached

Cheers, Lorenzo

Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Posted by Vlastimil Babka 2 months ago

On 8/4/25 15:59, Lorenzo Stoakes wrote:
> OK,
> 
> So I hacked -fopt-info-inline-all into the mm/ Makefile in a rather quick and
> dirty way and it seems some stuff gets inlined locally, but we're mostly hitting
> the '--param max-inline-insns-single limit reached' limit here.
> 
> Which maybe is just a point where the compiler possibly arbitrarily gives up?
> 
> Vlasta rightly pointed out off-list that given this appears to only be used in
> one place you'd expect inlining as register spill isn't such a concern (we'll
> spill saving the stack before function invocation anyway).
> 
> So there might actually be some validity here?
> 
> This is gcc 15.1.1 running on an x86-64 platform by the way.
> 
> mm/memory.c:1871:10: optimized:  Inlined zap_p4d_range/6380 into unmap_page_range/6381 which now has time 1458.996712 and size 65, net change of -11.
> mm/memory.c:1850:10: optimized:  Inlined zap_pud_range.isra/8017 into zap_p4d_range/6380 which now has time 10725.428482 and size 29, net change of -12.
> mm/memory.c:1829:10: missed:   not inlinable: zap_pud_range.isra/8017 -> zap_pmd_range.isra/8018, --param max-inline-insns-single limit reached
> mm/memory.c:1800:10: missed:   not inlinable: zap_pmd_range.isra/8018 -> zap_pte_range/6377, --param max-inline-insns-auto limit reached
> mm/memory.c:1708:8: optimized:  Inlined do_zap_pte_range.constprop/7983 into zap_pte_range/6377 which now has time 4244.320854 and size 148, net change of -15.
> mm/memory.c:1664:9: missed:   not inlinable: do_zap_pte_range.constprop/7983 -> zap_present_ptes.constprop/7985, --param max-inline-insns-single limit reached

I got some weird bloat-o-meter on this patch:

add/remove: 1/0 grow/shrink: 2/2 up/down: 693/-31 (662)
Function                                     old     new   delta
__handle_mm_fault                           3817    4403    +586
do_swap_page                                4497    4560     +63
mksaveddirty_shift                             -      44     +44
unmap_page_range                            4843    4828     -15
copy_page_range                             6497    6481     -16

but even without this patch, "objdump -t mm/memory.o" shows no zap
functions, so they are already inlined?

gcc also 15.1.1 but maybe opensuse has some non-default tunings.

Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Posted by Nadav Amit 2 months ago

> On 4 Aug 2025, at 16:59, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> 
> OK,
> 
> So I hacked -fopt-info-inline-all into the mm/ Makefile in a rather quick and
> dirty way and it seems some stuff gets inlined locally, but we're mostly hitting
> the '--param max-inline-insns-single limit reached' limit here.

Yes, it does require further investigation. My point is that sprinkling
__always_inline is a slippery slope. You start with putting __always_inline on
zap_present_folio_ptes (as currently done), and then the caller becomes expensive.

Now you noticed that the caller to zap_present_folio_ptes is not getting inlined,
which is not surprising because it got the cost of the always-inlined callee,
so you put __always_inline there, and so on.