mm/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
From: jiangyunshui <jiangyunshui@kylinos.cn>
Since commit bcd51a3c679d ("hugetlb: lazy page table copies
in fork()"), the logic about judging whether to copy
page table inside func copy_page_range has been extracted
into a separate func vma_needs_copy. While this change
improves code readability, it also incurs more function call
overhead, especially where fork() were frequently called.
Inline func vma_needs_copy to optimize the copy_page_range
performance. Given that func vma_needs_copy is only called
by copy_page_range, inlining it would not cause unacceptable
code bloat.
Testing was done with the byte-unixbench spawn benchmark
(which frequently calls fork). I measured 1.7% improvement
on x86 and 1.8% improvement on arm64.
Signed-off-by: jiangyunshui <jiangyunshui@kylinos.cn>
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c
index 8eba595056fe..d15b07f96ab1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1337,7 +1337,7 @@ copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
* false when we can speed up fork() by allowing lazy page faults later until
* when the child accesses the memory range.
*/
-static bool
+static __always_inline bool
vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
{
/*
--
2.47.1
On Wed, Jun 18, 2025 at 09:42:09AM +0800, Yunshui Jiang wrote: > From: jiangyunshui <jiangyunshui@kylinos.cn> > > Since commit bcd51a3c679d ("hugetlb: lazy page table copies > in fork()"), the logic about judging whether to copy > page table inside func copy_page_range has been extracted > into a separate func vma_needs_copy. While this change > improves code readability, it also incurs more function call > overhead, especially where fork() were frequently called. > > Inline func vma_needs_copy to optimize the copy_page_range > performance. Given that func vma_needs_copy is only called > by copy_page_range, inlining it would not cause unacceptable > code bloat. > > Testing was done with the byte-unixbench spawn benchmark > (which frequently calls fork). I measured 1.7% improvement > on x86 and 1.8% improvement on arm64. As per others you are going to need to provide details of your compiler setup because modern compilers are inlining this already. if it's ye olde compiler I'm not sure this is justified... > > Signed-off-by: jiangyunshui <jiangyunshui@kylinos.cn> > --- > mm/memory.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 8eba595056fe..d15b07f96ab1 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1337,7 +1337,7 @@ copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, > * false when we can speed up fork() by allowing lazy page faults later until > * when the child accesses the memory range. > */ > -static bool > +static __always_inline bool > vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > { This needs to live in vma.h probably... todo++... > /* > -- > 2.47.1 >
On 6/18/25 3:42 AM, Yunshui Jiang wrote: > From: jiangyunshui <jiangyunshui@kylinos.cn> > > Since commit bcd51a3c679d ("hugetlb: lazy page table copies > in fork()"), the logic about judging whether to copy > page table inside func copy_page_range has been extracted > into a separate func vma_needs_copy. While this change > improves code readability, it also incurs more function call > overhead, especially where fork() were frequently called. > > Inline func vma_needs_copy to optimize the copy_page_range > performance. Given that func vma_needs_copy is only called > by copy_page_range, inlining it would not cause unacceptable > code bloat. I'm surprised the compiler doesn't inline it already, if there's a single caller. In fact, mine (gcc-14.3 on x86) already does. So I wonder to which extent should we force override wrong compiler heuristics? Maybe just inline instead of __always_inline would be OK? Is that enough of a hint for your compiler? > Testing was done with the byte-unixbench spawn benchmark > (which frequently calls fork). I measured 1.7% improvement > on x86 and 1.8% improvement on arm64. > > Signed-off-by: jiangyunshui <jiangyunshui@kylinos.cn> > --- > mm/memory.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 8eba595056fe..d15b07f96ab1 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1337,7 +1337,7 @@ copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, > * false when we can speed up fork() by allowing lazy page faults later until > * when the child accesses the memory range. > */ > -static bool > +static __always_inline bool > vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > { > /*
On 18.06.25 11:39, Vlastimil Babka wrote: > On 6/18/25 3:42 AM, Yunshui Jiang wrote: >> From: jiangyunshui <jiangyunshui@kylinos.cn> >> >> Since commit bcd51a3c679d ("hugetlb: lazy page table copies >> in fork()"), the logic about judging whether to copy >> page table inside func copy_page_range has been extracted >> into a separate func vma_needs_copy. While this change >> improves code readability, it also incurs more function call >> overhead, especially where fork() were frequently called. >> >> Inline func vma_needs_copy to optimize the copy_page_range >> performance. Given that func vma_needs_copy is only called >> by copy_page_range, inlining it would not cause unacceptable >> code bloat. > > I'm surprised the compiler doesn't inline it already, if there's a > single caller. In fact, mine (gcc-14.3 on x86) already does. Same here gcc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) > > So I wonder to which extent should we force override wrong compiler > heuristics? I think only where we (a) really know better (b) it's crucial for performance and (c) modern compilers don't do the right thing. > Maybe just inline instead of __always_inline would be OK? Is > that enough of a hint for your compiler? I would do the same thing (inline as a hint), but some people apparently recently argued that newer compiler mostly ignore it either way. At least that's what I recall. -- Cheers, David / dhildenb
* David Hildenbrand <david@redhat.com> [250618 06:03]: > On 18.06.25 11:39, Vlastimil Babka wrote: > > On 6/18/25 3:42 AM, Yunshui Jiang wrote: > > > From: jiangyunshui <jiangyunshui@kylinos.cn> > > > > > > Since commit bcd51a3c679d ("hugetlb: lazy page table copies > > > in fork()"), the logic about judging whether to copy > > > page table inside func copy_page_range has been extracted > > > into a separate func vma_needs_copy. While this change > > > improves code readability, it also incurs more function call > > > overhead, especially where fork() were frequently called. > > > > > > Inline func vma_needs_copy to optimize the copy_page_range > > > performance. Given that func vma_needs_copy is only called > > > by copy_page_range, inlining it would not cause unacceptable > > > code bloat. > > > > I'm surprised the compiler doesn't inline it already, if there's a > > single caller. In fact, mine (gcc-14.3 on x86) already does. > > Same here > > gcc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) > > > > So I wonder to which extent should we force override wrong compiler > > heuristics? > > I think only where we (a) really know better (b) it's crucial for > performance and (c) modern compilers don't do the right thing. > > > Maybe just inline instead of __always_inline would be OK? Is > > that enough of a hint for your compiler? > > I would do the same thing (inline as a hint), but some people apparently > recently argued that newer compiler mostly ignore it either way. At least > that's what I recall. I have seen inline ignored, but not with a single caller. Perhaps godbolt could find a reason this is necessary, otherwise we should just leave it, IMO. Cheers, Liam
Initially, I added explicit inline hints in the code precisely because some newer compilers might ignore such suggestions. Based on everyone's feedback, I rechecked my compiler configurations —ARM64 uses GCC 12.3, while x86 uses GCC 12.4. By analyzing the disassembly of vmlinux via objdump, I confirmed that vma_needs_copy is indeed inlined in both cases. In this case the score difference in spawn might just be due to fluctuations. Leave it unchanged might be better. If we later encounter issues where older compilers fail to inline single-caller cases or multiple calls are incurred due to code changes, we could re-discuss whether manual intervention is needed. Best regards, Yunshui Jiang
© 2016 - 2025 Red Hat, Inc.