mm/memory.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
This change converts several critical page table zapping functions from
`inline` to `__always_inline`, resulting in measurable performance
improvements in process spawning workloads.
Performance Impact (Intel Xeon Gold 6430 2.1GHz):
- UnixBench 'context1' test shows ~6% improvement (single-core)
- UnixBench shows ~0.6% improvement (single-core)
- mm/memory.o size reduced by 2.49% (70190 -> 68445 bytes)
- Net code reduction of 1745 bytes (add/remove: 211/166)
The modified functions form a hot path during process teardown:
1. zap_present_ptes()
2. do_zap_pte_range()
3. zap_pte_range()
4. zap_pmd_range()
Signed-off-by: Li Qiang <liqiang01@kylinos.cn>
---
mm/memory.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index b0cda5aab398..281a353fae7b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1543,7 +1543,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
*
* Returns the number of processed (skipped or zapped) PTEs (at least 1).
*/
-static inline int zap_present_ptes(struct mmu_gather *tlb,
+static __always_inline int zap_present_ptes(struct mmu_gather *tlb,
struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
unsigned int max_nr, unsigned long addr,
struct zap_details *details, int *rss, bool *force_flush,
@@ -1662,7 +1662,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
return nr;
}
-static inline int do_zap_pte_range(struct mmu_gather *tlb,
+static __always_inline int do_zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pte_t *pte,
unsigned long addr, unsigned long end,
struct zap_details *details, int *rss,
@@ -1698,7 +1698,7 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
return nr;
}
-static unsigned long zap_pte_range(struct mmu_gather *tlb,
+static __always_inline unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
struct zap_details *details)
@@ -1790,7 +1790,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
return addr;
}
-static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
+static __always_inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pud_t *pud,
unsigned long addr, unsigned long end,
struct zap_details *details)
@@ -1832,7 +1832,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
return addr;
}
-static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
+static __always_inline unsigned long zap_pud_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, p4d_t *p4d,
unsigned long addr, unsigned long end,
struct zap_details *details)
@@ -1861,7 +1861,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
return addr;
}
-static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
+static __always_inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pgd_t *pgd,
unsigned long addr, unsigned long end,
struct zap_details *details)
--
2.25.1
On 04.08.25 14:39, Li Qiang wrote: > This change converts several critical page table zapping functions from > `inline` to `__always_inline`, resulting in measurable performance > improvements in process spawning workloads. > > Performance Impact (Intel Xeon Gold 6430 2.1GHz): > - UnixBench 'context1' test shows ~6% improvement (single-core) > - UnixBench shows ~0.6% improvement (single-core) > - mm/memory.o size reduced by 2.49% (70190 -> 68445 bytes) > - Net code reduction of 1745 bytes (add/remove: 211/166) > > The modified functions form a hot path during process teardown: > 1. zap_present_ptes() > 2. do_zap_pte_range() > 3. zap_pte_range() > 4. zap_pmd_range() > > Signed-off-by: Li Qiang <liqiang01@kylinos.cn> > --- What's the object file size change? -- Cheers, David / dhildenb
> On 4 Aug 2025, at 15:51, David Hildenbrand <david@redhat.com> wrote: > > On 04.08.25 14:39, Li Qiang wrote: >> This change converts several critical page table zapping functions from >> `inline` to `__always_inline`, resulting in measurable performance >> improvements in process spawning workloads. >> Performance Impact (Intel Xeon Gold 6430 2.1GHz): >> - UnixBench 'context1' test shows ~6% improvement (single-core) >> - UnixBench shows ~0.6% improvement (single-core) >> - mm/memory.o size reduced by 2.49% (70190 -> 68445 bytes) >> - Net code reduction of 1745 bytes (add/remove: 211/166) >> The modified functions form a hot path during process teardown: >> 1. zap_present_ptes() >> 2. do_zap_pte_range() >> 3. zap_pte_range() >> 4. zap_pmd_range() >> Signed-off-by: Li Qiang <liqiang01@kylinos.cn> >> --- > > What's the object file size change? I think that Li wrote that the size is reduced by 2.49% . My 2 cents is that usually it may be better to understand why it is not inlined and address that (e.g., likely() hints or something else) instead of blindly putting __always_inline. The __always_inline might stay there for no reason after some code changes and therefore become a maintenance burden. Concretely, in this case, where there is a single caller, one can expect the compiler to really prefer to inline the callees. Perhaps it is beneficial to compile with flags such as "-fopt-info-inline-optimized-missed=inline.txt” to better understand the reason.
On 04.08.25 15:01, Nadav Amit wrote: > > >> On 4 Aug 2025, at 15:51, David Hildenbrand <david@redhat.com> wrote: >> >> On 04.08.25 14:39, Li Qiang wrote: >>> This change converts several critical page table zapping functions from >>> `inline` to `__always_inline`, resulting in measurable performance >>> improvements in process spawning workloads. >>> Performance Impact (Intel Xeon Gold 6430 2.1GHz): >>> - UnixBench 'context1' test shows ~6% improvement (single-core) >>> - UnixBench shows ~0.6% improvement (single-core) >>> - mm/memory.o size reduced by 2.49% (70190 -> 68445 bytes) >>> - Net code reduction of 1745 bytes (add/remove: 211/166) >>> The modified functions form a hot path during process teardown: >>> 1. zap_present_ptes() >>> 2. do_zap_pte_range() >>> 3. zap_pte_range() >>> 4. zap_pmd_range() >>> Signed-off-by: Li Qiang <liqiang01@kylinos.cn> >>> --- >> >> What's the object file size change? > > I think that Li wrote that the size is reduced by 2.49% . Ah, missed it after the performance numbers. As Vlastimil mentioned, I would have expected a bloat-o-meter output. > > My 2 cents is that usually it may be better to understand why it is > not inlined and address that (e.g., likely() hints or something else) > instead of blindly putting __always_inline. The __always_inline might > stay there for no reason after some code changes and therefore become > a maintenance burden. Concretely, in this case, where there is a single > caller, one can expect the compiler to really prefer to inline the > callees. Agreed, although the compiler is sometimes hard to convince to do the right thing when dealing with rather large+complicated code in my experience. -- Cheers, David / dhildenb
Ah, missed it after the performance numbers. As Vlastimil mentioned, I would have expected a bloat-o-meter output. > > My 2 cents is that usually it may be better to understand why it is > not inlined and address that (e.g., likely() hints or something else) > instead of blindly putting __always_inline. The __always_inline might > stay there for no reason after some code changes and therefore become > a maintenance burden. Concretely, in this case, where there is a single > caller, one can expect the compiler to really prefer to inline the > callees. > > Agreed, although the compiler is sometimes hard to convince to do the > right thing when dealing with rather large+complicated code in my > experience. Question 1: Will this patch increase the vmlinux size? Reply: Actually, the overall vmlinux size becomes smaller on x86_64: [root@localhost linux_old1]# ./scripts/bloat-o-meter before.vmlinux after.vmlinux add/remove: 6/0 grow/shrink: 0/1 up/down: 4569/-4747 (-178) Function old new delta zap_present_ptes.constprop - 2696 +2696 zap_pte_range - 1236 +1236 zap_pmd_range.isra - 589 +589 __pfx_zap_pte_range - 16 +16 __pfx_zap_present_ptes.constprop - 16 +16 __pfx_zap_pmd_range.isra - 16 +16 unmap_page_range 5765 1018 -4747 Total: Before=35379786, After=35379608, chg -0.00% Question 2: Why doesn't GCC inline these functions by default? Are there any side effects of forced inlining? Reply: 1) GCC's default parameter max-inline-insns-single imposes restrictions. However, since these are leaf functions, inlining them not only improves performance but also reduces code size. May we consider relaxing the max-inline-insns-single restriction in this case? 2) The functions being inlined in this patch follow a single call path and are ultimately inlined into unmap_page_range. This only increases the size of the unmap_page_range assembly function, but since unmap_page_range itself won't be further inlined, the impact is well-contained. Question 3: Does this inlining modification affect code maintainability? Reply: The modified inline functions are exclusively called by unmap_page_range, forming a single call path. This doesn't introduce additional maintenance complexity. Question 4: Have you performed performance testing on other platforms? Have you tested other scenarios? Reply: 1) I tested the same GCC version on arm64 architecture. Even without this patch, these functions get inlined into unmap_page_range automatically. This appears to be due to architecture-specific differences in GCC's max-inline-insns-single default values. 2) I believe UnixBench serves as a reasonably representative server benchmark. Theoretically, this patch should improve performance by reducing multi-layer function call overhead. However, I would sincerely appreciate your guidance on what additional tests might better demonstrate the performance improvements. Could you kindly suggest some specific benchmarks or test scenarios I should consider? -- Cheers, Li Qiang
On 8/5/25 2:04 PM, Li Qiang wrote: > Ah, missed it after the performance numbers. As Vlastimil mentioned, I > would have expected a bloat-o-meter output. > >> >> My 2 cents is that usually it may be better to understand why it is >> not inlined and address that (e.g., likely() hints or something else) >> instead of blindly putting __always_inline. The __always_inline might >> stay there for no reason after some code changes and therefore become >> a maintenance burden. Concretely, in this case, where there is a single >> caller, one can expect the compiler to really prefer to inline the >> callees. > >> >> Agreed, although the compiler is sometimes hard to convince to do the >> right thing when dealing with rather large+complicated code in my >> experience. > > Question 1: Will this patch increase the vmlinux size? > Reply: > Actually, the overall vmlinux size becomes smaller on x86_64: > [root@localhost linux_old1]# ./scripts/bloat-o-meter before.vmlinux after.vmlinux > add/remove: 6/0 grow/shrink: 0/1 up/down: 4569/-4747 (-178) > Function old new delta > zap_present_ptes.constprop - 2696 +2696 > zap_pte_range - 1236 +1236 > zap_pmd_range.isra - 589 +589 > __pfx_zap_pte_range - 16 +16 > __pfx_zap_present_ptes.constprop - 16 +16 > __pfx_zap_pmd_range.isra - 16 +16 > unmap_page_range 5765 1018 -4747 > Total: Before=35379786, After=35379608, chg -0.00% Is the before/after swapped here? This output suggests some functions became NOT inlined. If I'm right the output binary becomes slightly larger. But it doesn't matter.
On 8/5/25 13:15 PM, Vlastimil Babka wrote: > Is the before/after swapped here? This output suggests some functions > became NOT inlined. > > If I'm right the output binary becomes slightly larger. But it doesn't > matter. Thank you for your careful review. You're absolutely right - the before/after columns in my previous data table were accidentally reversed. I've corrected this in the updated measurements below: [root@localhost linux_old1]# ./scripts/bloat-o-meter befor.vmlinux after.vmlinux add/remove: 0/6 grow/shrink: 1/0 up/down: 4747/-4569 (178) Function old new delta unmap_page_range 1018 5765 +4747 __pfx_zap_pte_range 16 - -16 __pfx_zap_present_ptes.constprop 16 - -16 __pfx_zap_pmd_range.isra 16 - -16 zap_pmd_range.isra 589 - -589 zap_pte_range 1236 - -1236 zap_present_ptes.constprop 2696 - -2696 Total: Before=35379608, After=35379786, chg +0.00% [root@localhost linux_old1]#
On Tue, Aug 05, 2025 at 08:04:35PM +0800, Li Qiang wrote: > Ah, missed it after the performance numbers. As Vlastimil mentioned, I > would have expected a bloat-o-meter output. > You're weirdly quoting David here unattributed as if it were your reply? > > > > My 2 cents is that usually it may be better to understand why it is > > not inlined and address that (e.g., likely() hints or something else) > > instead of blindly putting __always_inline. The __always_inline might > > stay there for no reason after some code changes and therefore become > > a maintenance burden. Concretely, in this case, where there is a single > > caller, one can expect the compiler to really prefer to inline the > > callees. > > > > > Agreed, although the compiler is sometimes hard to convince to do the > > right thing when dealing with rather large+complicated code in my > > experience. > Some nits on reply: - Please reply to mails individually, rather than reply to one arbtrary one with questions as to the other. - Try to wrap to 75 characters per line in replies. - Make sure it's clear who you're quoting. This makes life easier, I've had to go and read through a bunch of mails in thread to get context here. > Question 1: Will this patch increase the vmlinux size? > Reply: > Actually, the overall vmlinux size becomes smaller on x86_64: > [root@localhost linux_old1]# ./scripts/bloat-o-meter before.vmlinux after.vmlinux > add/remove: 6/0 grow/shrink: 0/1 up/down: 4569/-4747 (-178) > Function old new delta > zap_present_ptes.constprop - 2696 +2696 > zap_pte_range - 1236 +1236 > zap_pmd_range.isra - 589 +589 > __pfx_zap_pte_range - 16 +16 > __pfx_zap_present_ptes.constprop - 16 +16 > __pfx_zap_pmd_range.isra - 16 +16 > unmap_page_range 5765 1018 -4747 > Total: Before=35379786, After=35379608, chg -0.00% > > > Question 2: Why doesn't GCC inline these functions by default? Are there any side effects of forced inlining? > Reply: > 1) GCC's default parameter max-inline-insns-single imposes restrictions. However, since these are leaf functions, inlining them not only improves performance but also reduces code size. May we consider relaxing the max-inline-insns-single restriction in this case? Yeah I wonder if we could just increase this... I noticed in my analysis (presumably what you're replying to here?) that this is what was causing inlining to stop. We do a _lot_ of static functions that behave like this so I actually wonder if we could get perf wins more roadly by doing this... Could you experiment with this?... > > 2) The functions being inlined in this patch follow a single call path and are ultimately inlined into unmap_page_range. This only increases the size of the unmap_page_range assembly function, but since unmap_page_range itself won't be further inlined, the impact is well-contained. > Yup. This is something I already mentioned. > > > Question 3: Does this inlining modification affect code maintainability? > Reply: The modified inline functions are exclusively called by unmap_page_range, forming a single call path. This doesn't introduce additional maintenance complexity. Not sure why maintenance would be an issue, code is virtually unchanged. > > > Question 4: Have you performed performance testing on other platforms? Have you tested other scenarios? > Reply: > 1) I tested the same GCC version on arm64 architecture. Even without this patch, these functions get inlined into unmap_page_range automatically. This appears to be due to architecture-specific differences in GCC's max-inline-insns-single default values. OK interesting. I suspect that's due to more registers right? > > 2) I believe UnixBench serves as a reasonably representative server benchmark. Theoretically, this patch should improve performance by reducing multi-layer function call overhead. However, I would sincerely appreciate your guidance on what additional tests might better demonstrate the performance improvements. Could you kindly suggest some specific benchmarks or test scenarios I should consider? I'm not sure, actual workloads would be best but presumably you don't have one where you've noticed a demonstrable difference otherwise you'd have mentioned... At any rate I've come around on this series, and think this is probably reasonable, but I would like to see what increasing max-inline-insns-single does first? > > -- > Cheers, > > Li Qiang Cheers, Lorenzo
Tue, 5 Aug 2025 14:35:22, Lorenzo Stoakes wrote: > I'm not sure, actual workloads would be best but presumably you don't have > one where you've noticed a demonstrable difference otherwise you'd have > mentioned... > > At any rate I've come around on this series, and think this is probably > reasonable, but I would like to see what increasing max-inline-insns-single > does first? Thank you for your suggestions. I'll pay closer attention to email formatting in future communications. Regarding the performance tests on x86_64 architecture: Parameter Observation: When setting max-inline-insns-single=400 (matching arm64's default value) without applying my patch, the compiler automatically inlines the critical functions. Benchmark Results: Configuration Baseline With Patch max-inline-insns-single=400 UnixBench Score 1824 1835 (+0.6%) 1840 (+0.9%) vmlinux Size (bytes) 35,379,608 35,379,786 (+0.005%) 35,529,641 (+0.4%) Key Findings: The patch provides significant performance gain (0.6%) with minimal size impact (0.005% increase). While max-inline-insns-single=400 yields slightly better performance (0.9%), it incurs a larger size penalty (0.4% increase). Conclusion: The patch achieves a better performance/size trade-off compared to globally adjusting the inline threshold. The targeted approach (selective __always_inline) appears more efficient for this specific optimization.
On 8/6/25 07:51, Li Qiang wrote: > Tue, 5 Aug 2025 14:35:22, Lorenzo Stoakes wrote: >> I'm not sure, actual workloads would be best but presumably you don't have >> one where you've noticed a demonstrable difference otherwise you'd have >> mentioned... >> >> At any rate I've come around on this series, and think this is probably >> reasonable, but I would like to see what increasing max-inline-insns-single >> does first? > > Thank you for your suggestions. I'll pay closer attention > to email formatting in future communications. > > Regarding the performance tests on x86_64 architecture: > > Parameter Observation: > When setting max-inline-insns-single=400 (matching arm64's > default value) without applying my patch, the compiler > automatically inlines the critical functions. > > Benchmark Results: > > Configuration Baseline With Patch max-inline-insns-single=400 > UnixBench Score 1824 1835 (+0.6%) 1840 (+0.9%) > vmlinux Size (bytes) 35,379,608 35,379,786 (+0.005%) 35,529,641 (+0.4%) > > Key Findings: > > The patch provides significant performance gain (0.6%) with > minimal size impact (0.005% increase). While > max-inline-insns-single=400 yields slightly better > performance (0.9%), it incurs a larger size penalty (0.4% increase). > > Conclusion: > The patch achieves a better performance/size trade-off > compared to globally adjusting the inline threshold. The targeted > approach (selective __always_inline) appears more efficient for > this specific optimization. Another attempt at my opensuse tumbleweed system gcc 15.1.1: add/remove: 1/0 grow/shrink: 4/7 up/down: 1069/-520 (549) Function old new delta unmap_page_range 6493 7424 +931 add_mm_counter - 112 +112 finish_fault 1101 1117 +16 do_swap_page 6523 6531 +8 remap_pfn_range_internal 1358 1360 +2 pte_to_swp_entry 123 122 -1 pte_move_swp_offset 219 218 -1 restore_exclusive_pte 356 325 -31 __handle_mm_fault 3988 3949 -39 do_wp_page 3926 3810 -116 copy_page_range 8051 7930 -121 swap_pte_batch 817 606 -211 Total: Before=66483, After=67032, chg +0.83% The functions changed by your patch were already inlined, and yet this force inlining apparently changed some heuristics to change also completely unrelated functions. So that just shows how fragile these kinds of attempts to hand-hold gcc for a specific desired behavior is, and I'd be wary of it.
On 8/4/25 14:51, David Hildenbrand wrote: > On 04.08.25 14:39, Li Qiang wrote: >> This change converts several critical page table zapping functions from >> `inline` to `__always_inline`, resulting in measurable performance >> improvements in process spawning workloads. >> >> Performance Impact (Intel Xeon Gold 6430 2.1GHz): >> - UnixBench 'context1' test shows ~6% improvement (single-core) >> - UnixBench shows ~0.6% improvement (single-core) >> - mm/memory.o size reduced by 2.49% (70190 -> 68445 bytes) >> - Net code reduction of 1745 bytes (add/remove: 211/166) >> >> The modified functions form a hot path during process teardown: >> 1. zap_present_ptes() >> 2. do_zap_pte_range() >> 3. zap_pte_range() >> 4. zap_pmd_range() >> >> Signed-off-by: Li Qiang <liqiang01@kylinos.cn> >> --- > > What's the object file size change? The output of ./scripts/bloat-o-meter would be the most informative.
On Mon, Aug 04, 2025 at 08:39:23PM +0800, Li Qiang wrote: > This change converts several critical page table zapping functions from > `inline` to `__always_inline`, resulting in measurable performance > improvements in process spawning workloads. > > Performance Impact (Intel Xeon Gold 6430 2.1GHz): > - UnixBench 'context1' test shows ~6% improvement (single-core) > - UnixBench shows ~0.6% improvement (single-core) These aren't exactly earth-shattering. Are we sure these are representative of anything real-world representative of real workloads? Spawning a bazillion processes is not really meaningful. > - mm/memory.o size reduced by 2.49% (70190 -> 68445 bytes) > - Net code reduction of 1745 bytes (add/remove: 211/166) > > The modified functions form a hot path during process teardown: > 1. zap_present_ptes() > 2. do_zap_pte_range() > 3. zap_pte_range() > 4. zap_pmd_range() > > Signed-off-by: Li Qiang <liqiang01@kylinos.cn> I think others have covered this well, but we've had patches like this before where, in essence, it's a case of 'improves things on my machine'. The question really is _why_ your compiler is not making these inline in the first place. I'm no compiler expert, but the inline here I believe is redundant anyway within a compilation unit so the compiler will make an inline decision regardless. These are pretty big functions though. You're essentially inlining everything into a mega function in unmap_page_range(). Which seems iffy. I wonder if we might see degradation in other workloads? And you're talking about one architecture, not others... I feel like you'd really need to justify with information on the compiler (ideally with insights into why it's not inlining now), how it impacts other architectures, _real workloads_ you've observed this matter for, etc. for this to be justifiable. Also are you sure it has to be _every_ level in the hierarchy? What happens if you inline only e.g. zap_present_ptes(), as we do with zap_present_folio_ptes() already? (Fact that's _also_ inlined makes this a mega giant chonker inlined function also...). I guess bloat is less of an issue as it's all going inside a non-inlined function. But how this behaves in places other than 'not entirely convincing benchmark on one architecture/uarch' is key here I think. I don't think I'll really be convinced until there's quite a bit more data to back this up with real-world usage. > --- > mm/memory.c | 12 ++++++------ > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index b0cda5aab398..281a353fae7b 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1543,7 +1543,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, > * > * Returns the number of processed (skipped or zapped) PTEs (at least 1). > */ > -static inline int zap_present_ptes(struct mmu_gather *tlb, > +static __always_inline int zap_present_ptes(struct mmu_gather *tlb, > struct vm_area_struct *vma, pte_t *pte, pte_t ptent, > unsigned int max_nr, unsigned long addr, > struct zap_details *details, int *rss, bool *force_flush, > @@ -1662,7 +1662,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb, > return nr; > } > > -static inline int do_zap_pte_range(struct mmu_gather *tlb, > +static __always_inline int do_zap_pte_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, pte_t *pte, > unsigned long addr, unsigned long end, > struct zap_details *details, int *rss, > @@ -1698,7 +1698,7 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb, > return nr; > } > > -static unsigned long zap_pte_range(struct mmu_gather *tlb, > +static __always_inline unsigned long zap_pte_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, pmd_t *pmd, > unsigned long addr, unsigned long end, > struct zap_details *details) > @@ -1790,7 +1790,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > return addr; > } > > -static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, > +static __always_inline unsigned long zap_pmd_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, pud_t *pud, > unsigned long addr, unsigned long end, > struct zap_details *details) > @@ -1832,7 +1832,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, > return addr; > } > > -static inline unsigned long zap_pud_range(struct mmu_gather *tlb, > +static __always_inline unsigned long zap_pud_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, p4d_t *p4d, > unsigned long addr, unsigned long end, > struct zap_details *details) > @@ -1861,7 +1861,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb, > return addr; > } > > -static inline unsigned long zap_p4d_range(struct mmu_gather *tlb, > +static __always_inline unsigned long zap_p4d_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, pgd_t *pgd, > unsigned long addr, unsigned long end, > struct zap_details *details) > -- > 2.25.1 >
OK, So I hacked -fopt-info-inline-all into the mm/ Makefile in a rather quick and dirty way and it seems some stuff gets inlined locally, but we're mostly hitting the '--param max-inline-insns-single limit reached' limit here. Which maybe is just a point where the compiler possibly arbitrarily gives up? Vlasta rightly pointed out off-list that given this appears to only be used in one place you'd expect inlining as register spill isn't such a concern (we'll spill saving the stack before function invocation anyway). So there might actually be some validity here? This is gcc 15.1.1 running on an x86-64 platform by the way. mm/memory.c:1871:10: optimized: Inlined zap_p4d_range/6380 into unmap_page_range/6381 which now has time 1458.996712 and size 65, net change of -11. mm/memory.c:1850:10: optimized: Inlined zap_pud_range.isra/8017 into zap_p4d_range/6380 which now has time 10725.428482 and size 29, net change of -12. mm/memory.c:1829:10: missed: not inlinable: zap_pud_range.isra/8017 -> zap_pmd_range.isra/8018, --param max-inline-insns-single limit reached mm/memory.c:1800:10: missed: not inlinable: zap_pmd_range.isra/8018 -> zap_pte_range/6377, --param max-inline-insns-auto limit reached mm/memory.c:1708:8: optimized: Inlined do_zap_pte_range.constprop/7983 into zap_pte_range/6377 which now has time 4244.320854 and size 148, net change of -15. mm/memory.c:1664:9: missed: not inlinable: do_zap_pte_range.constprop/7983 -> zap_present_ptes.constprop/7985, --param max-inline-insns-single limit reached Cheers, Lorenzo
On 8/4/25 15:59, Lorenzo Stoakes wrote: > OK, > > So I hacked -fopt-info-inline-all into the mm/ Makefile in a rather quick and > dirty way and it seems some stuff gets inlined locally, but we're mostly hitting > the '--param max-inline-insns-single limit reached' limit here. > > Which maybe is just a point where the compiler possibly arbitrarily gives up? > > Vlasta rightly pointed out off-list that given this appears to only be used in > one place you'd expect inlining as register spill isn't such a concern (we'll > spill saving the stack before function invocation anyway). > > So there might actually be some validity here? > > This is gcc 15.1.1 running on an x86-64 platform by the way. > > mm/memory.c:1871:10: optimized: Inlined zap_p4d_range/6380 into unmap_page_range/6381 which now has time 1458.996712 and size 65, net change of -11. > mm/memory.c:1850:10: optimized: Inlined zap_pud_range.isra/8017 into zap_p4d_range/6380 which now has time 10725.428482 and size 29, net change of -12. > mm/memory.c:1829:10: missed: not inlinable: zap_pud_range.isra/8017 -> zap_pmd_range.isra/8018, --param max-inline-insns-single limit reached > mm/memory.c:1800:10: missed: not inlinable: zap_pmd_range.isra/8018 -> zap_pte_range/6377, --param max-inline-insns-auto limit reached > mm/memory.c:1708:8: optimized: Inlined do_zap_pte_range.constprop/7983 into zap_pte_range/6377 which now has time 4244.320854 and size 148, net change of -15. > mm/memory.c:1664:9: missed: not inlinable: do_zap_pte_range.constprop/7983 -> zap_present_ptes.constprop/7985, --param max-inline-insns-single limit reached I got some weird bloat-o-meter on this patch: add/remove: 1/0 grow/shrink: 2/2 up/down: 693/-31 (662) Function old new delta __handle_mm_fault 3817 4403 +586 do_swap_page 4497 4560 +63 mksaveddirty_shift - 44 +44 unmap_page_range 4843 4828 -15 copy_page_range 6497 6481 -16 but even without this patch, "objdump -t mm/memory.o" shows no zap functions, so they are already inlined? gcc also 15.1.1 but maybe opensuse has some non-default tunings.
> On 4 Aug 2025, at 16:59, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > OK, > > So I hacked -fopt-info-inline-all into the mm/ Makefile in a rather quick and > dirty way and it seems some stuff gets inlined locally, but we're mostly hitting > the '--param max-inline-insns-single limit reached' limit here. Yes, it does require further investigation. My point is that sprinkling __always_inline is a slippery slope. You start with putting __always_inline on zap_present_folio_ptes (as currently done), and then the caller becomes expensive. Now you noticed that the caller to zap_present_folio_ptes is not getting inlined, which is not surprising because it got the cost of the always-inlined callee, so you put __always_inline there, and so on.
© 2016 - 2025 Red Hat, Inc.