arch/loongarch/include/asm/syscall.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
Loongarch use -fno-builtin-memcpy in Makefile,
so cause a extra memcpy in syscall hot path:
syscall_enter_audit
->syscall_get_arguments->memcpy(__memcpy_fast)
->audit_syscall_entry
Just avoid memcpy() altogether and write the copying of args from regs
manually, which shows 5% multi core score up in UnixBench syscall.
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
---
arch/loongarch/include/asm/syscall.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h
index 81d2733f7b94..171af2edd569 100644
--- a/arch/loongarch/include/asm/syscall.h
+++ b/arch/loongarch/include/asm/syscall.h
@@ -65,7 +65,11 @@ static inline void syscall_get_arguments(struct task_struct *task,
unsigned long *args)
{
args[0] = regs->orig_a0;
- memcpy(&args[1], ®s->regs[5], 5 * sizeof(long));
+ args[1] = regs->regs[5];
+ args[2] = regs->regs[6];
+ args[3] = regs->regs[7];
+ args[4] = regs->regs[8];
+ args[5] = regs->regs[9];
}
static inline void syscall_set_arguments(struct task_struct *task,
--
2.20.1
On Tue, 2025-08-26 at 19:32 +0800, Wentao Guan wrote: > Loongarch use -fno-builtin-memcpy in Makefile, I still think we should just remove -fno-builtin-memcpy if GCC >= 14. There's some wide-spread misunderstanding like "-fno-builtin-memcpy can stop the compiler from using memcpy for copying small structs" but the fact is very much contrary to it. The compiler just uses memcpy for copying the structs of which the size is larger than a threshold, no matter -fbuiltin-memcpy or not, but with -fbuiltin-memcpy the compiler can expand some of them into ld/st sequences, reducing the actual numbers of memcpy calls. The reason we didn't use -fbuiltin-memcpy is GCC < 14 expands memcpy for various sizes in a very stupid way, see https://gcc.gnu.org/PR109465. -- Xi Ruoyao <xry111@xry111.site>
Hello Ruoyao, Thanks for your review. I will test and send test results later. (use different gcc verions and remove the three flags.) BRs Wentao Guan
Hello Ruoyao, The quick test results are very interesting...: UnixBench syscall on 3A6000 platform: ./Run syscall -i 10 -c 8 kernel source: v6.16 kernel config: https://github.com/deepin-community/kernel/blob/linux-6.12.y/arch/loongarch/configs/deepin_loongarch_desktop_defconfig Use this GCC 15[1]: result: 1. with no patch: System Call Overhead 15000.0 8893011.8 5928.7 2. with patch [2] to disable -fno-builtin-memcpy -fno-builtin-memmove -fno-builtin-memset: System Call Overhead 15000.0 8510415.2 5673.6 lower than patch??? 3. with my origin patch [3](+4.9%): System Call Overhead 15000.0 9336022.1 6224.0 BRs Wentao Guan [1]: https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/15.2.0 [2]: diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 4b19f9337..53bbcf31c 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -746,6 +746,9 @@ config ARCH_SUSPEND_POSSIBLE config ARCH_HIBERNATION_POSSIBLE def_bool y +config ARCH_MEMORY_LIBCALL_DISABLE + def_bool KASAN || GCC_VERSION >= 140000 + source "kernel/power/Kconfig" source "drivers/acpi/Kconfig" source "drivers/cpufreq/Kconfig" diff --git a/arch/loongarch/Makefile b/arch/loongarch/Makefile index b0703a4e0..fba837b22 100644 --- a/arch/loongarch/Makefile +++ b/arch/loongarch/Makefile @@ -120,7 +120,7 @@ endif cflags-y += $(call cc-option, -mno-check-zero-division) -ifndef CONFIG_KASAN +ifeq ($(CONFIG_ARCH_MEMORY_LIBCALL_DISABLE),) cflags-y += -fno-builtin-memcpy -fno-builtin-memmove -fno-builtin-memset endif [3]: diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h index 81d2733f7..171af2edd 100644 --- a/arch/loongarch/include/asm/syscall.h +++ b/arch/loongarch/include/asm/syscall.h @@ -65,7 +65,11 @@ static inline void syscall_get_arguments(struct task_struct *task, unsigned long *args) { args[0] = regs->orig_a0; - memcpy(&args[1], ®s->regs[5], 5 * sizeof(long)); + args[1] = regs->regs[5]; + args[2] = regs->regs[6]; + args[3] = regs->regs[7]; + args[4] = regs->regs[8]; + args[5] = regs->regs[9]; } static inline void syscall_set_arguments(struct task_struct *task, [4]: perf diff # Event 'cycles:P' # # Baseline Delta Abs Shared Object Symbol > # ........ ......... .................................. .....................................................> # 12.72% -7.07% [kernel.kallsyms] [k] syscall_trace_enter 0.00% +2.94% [kernel.kallsyms] [k] __memcpy_fast 33.69% +1.41% [kernel.kallsyms] [k] do_syscall 2.64% +0.32% [kernel.kallsyms] [k] __audit_syscall_entry 4.92% +0.32% libc.so.6 [.] __GI___umask 3.54% +0.30% [kernel.kallsyms] [k] alloc_fd 4.36% +0.24% libc.so.6 [.] __GI___getuid 4.63% +0.24% libc.so.6 [.] syscall 4.50% +0.22% libc.so.6 [.] __GI___dup 4.84% +0.22% [kernel.kallsyms] [k] syscall_exit_work 4.81% +0.22% libc.so.6 [.] __close +0.20% [kernel.kallsyms] [k] __memcpy 1.48% +0.15% [kernel.kallsyms] [k] audit_reset_context 0.69% +0.13% [kernel.kallsyms] [k] __se_sys_close 2.13% +0.10% [kernel.kallsyms] [k] file_close_fd_locked 0.86% +0.08% [kernel.kallsyms] [k] map_id_range_up 0.71% +0.06% [kernel.kallsyms] [k] __task_pid_nr_ns 0.41% +0.04% [kernel.kallsyms] [k] fd_install 0.56% +0.04% [kernel.kallsyms] [k] __rcu_read_unlock 0.57% +0.04% [kernel.kallsyms] [k] fput_close_sync 0.07% -0.04% [kernel.kallsyms] [k] idle_exit 1.37% +0.04% [kernel.kallsyms] [k] _raw_spin_lock 0.69% +0.03% [kernel.kallsyms] [k] filp_flush 0.76% +0.03% syscall [.] main 0.52% +0.03% [kernel.kallsyms] [k] from_kuid_munged 0.45% +0.03% [kernel.kallsyms] [k] _raw_spin_unlock 0.07% -0.03% [kernel.kallsyms] [k] finish_task_switch.isra.0 1.72% +0.03% [kernel.kallsyms] [k] ktime_get_coarse_real_ts64 0.67% +0.03% [kernel.kallsyms] [k] __rcu_read_lock 0.03% -0.03% deepin-deepinid-daemon [.] 0x0000000000003018 1.28% -0.02% [kernel.kallsyms] [k] __audit_syscall_exit 0.12% +0.02% syscall [.] getuid@plt 0.02% -0.02% dde-session-daemon [.] 0x0000000000299d10 0.02% -0.02% ld-linux-loongarch-lp64d.so.1 [.] do_lookup_x 0.03% -0.01% [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0.01% -0.01% ld-linux-loongarch-lp64d.so.1 [.] _dl_lookup_symbol_x 0.30% +0.01% [kernel.kallsyms] [k] sys_getpid 0.14% -0.01% syscall [.] umask@plt 0.01% -0.01% libdbus-1.so.3.32.4 [.] 0x0000000000015ac0 0.02% -0.01% [kernel.kallsyms] [k] clear_page 0.01% -0.01% deepin-sync-helper [.] 0x0000000000003a84 0.14% +0.01% syscall [.] close@plt
On Thu, Aug 28, 2025 at 3:06 PM Wentao Guan <guanwentao@uniontech.com> wrote: > > Hello Ruoyao, > > Thanks for your review. > I will test and send test results later. > (use different gcc verions and remove the three flags.) I think kernel's memcpy is still needed for CPUs without UAL. Huacai > > BRs > Wentao Guan
On Thu, 2025-08-28 at 15:09 +0800, Huacai Chen wrote: > On Thu, Aug 28, 2025 at 3:06 PM Wentao Guan <guanwentao@uniontech.com> > wrote: > > > > Hello Ruoyao, > > > > Thanks for your review. > > I will test and send test results later. > > (use different gcc verions and remove the three flags.) > I think kernel's memcpy is still needed for CPUs without UAL. GCC memcpy expansion correctly understands no-UAL case. It won't generate unaligned access unless -mno-strict-align. -- Xi Ruoyao <xry111@xry111.site>
On Thu, 2025-08-28 at 15:11 +0800, Xi Ruoyao wrote: > On Thu, 2025-08-28 at 15:09 +0800, Huacai Chen wrote: > > On Thu, Aug 28, 2025 at 3:06 PM Wentao Guan > > <guanwentao@uniontech.com> > > wrote: > > > > > > Hello Ruoyao, > > > > > > Thanks for your review. > > > I will test and send test results later. > > > (use different gcc verions and remove the three flags.) > > I think kernel's memcpy is still needed for CPUs without UAL. > > GCC memcpy expansion correctly understands no-UAL case. It won't > generate unaligned access unless -mno-strict-align. P.S. kernel's memcpy is indeed still needed because when GCC deduces it's not beneficial to expand the __builtin_memcpy into ld/st sequeces, it'll just call memcpy. -- Xi Ruoyao <xry111@xry111.site>
在 8/26/25 7:32 PM, Wentao Guan 写道: > Loongarch use -fno-builtin-memcpy in Makefile, > so cause a extra memcpy in syscall hot path: > > syscall_enter_audit > ->syscall_get_arguments->memcpy(__memcpy_fast) > ->audit_syscall_entry > > Just avoid memcpy() altogether and write the copying of args from regs > manually, which shows 5% multi core score up in UnixBench syscall. 5%? Awesome! > > Signed-off-by: Wentao Guan <guanwentao@uniontech.com> Reviewed-by: Yanteng Si <siyanteng@cqsoftware.com.cn> Thanks, Yanteng > --- > arch/loongarch/include/asm/syscall.h | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h > index 81d2733f7b94..171af2edd569 100644 > --- a/arch/loongarch/include/asm/syscall.h > +++ b/arch/loongarch/include/asm/syscall.h > @@ -65,7 +65,11 @@ static inline void syscall_get_arguments(struct task_struct *task, > unsigned long *args) > { > args[0] = regs->orig_a0; > - memcpy(&args[1], ®s->regs[5], 5 * sizeof(long)); > + args[1] = regs->regs[5]; > + args[2] = regs->regs[6]; > + args[3] = regs->regs[7]; > + args[4] = regs->regs[8]; > + args[5] = regs->regs[9]; > } > > static inline void syscall_set_arguments(struct task_struct *task,
Hello, Thanks for your review, I test the patch in our v6.6 kernel, in 3a6000 platform UnixBench results syscall 1 thread from 966 to 996 8 threads from 6173 to 6501. Best Regards Wentao Guan
On Thu, Aug 28, 2025 at 1:47 PM Yanteng Si <si.yanteng@linux.dev> wrote: > > 在 8/26/25 7:32 PM, Wentao Guan 写道: > > Loongarch use -fno-builtin-memcpy in Makefile, > > so cause a extra memcpy in syscall hot path: > > > > syscall_enter_audit > > ->syscall_get_arguments->memcpy(__memcpy_fast) > > ->audit_syscall_entry > > > > Just avoid memcpy() altogether and write the copying of args from regs > > manually, which shows 5% multi core score up in UnixBench syscall. > 5%? Awesome! > > > > Signed-off-by: Wentao Guan <guanwentao@uniontech.com> > Reviewed-by: Yanteng Si <siyanteng@cqsoftware.com.cn> But why this is a "fix"? It is an optimization. Huacai > > Thanks, > Yanteng > > --- > > arch/loongarch/include/asm/syscall.h | 6 +++++- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h > > index 81d2733f7b94..171af2edd569 100644 > > --- a/arch/loongarch/include/asm/syscall.h > > +++ b/arch/loongarch/include/asm/syscall.h > > @@ -65,7 +65,11 @@ static inline void syscall_get_arguments(struct task_struct *task, > > unsigned long *args) > > { > > args[0] = regs->orig_a0; > > - memcpy(&args[1], ®s->regs[5], 5 * sizeof(long)); > > + args[1] = regs->regs[5]; > > + args[2] = regs->regs[6]; > > + args[3] = regs->regs[7]; > > + args[4] = regs->regs[8]; > > + args[5] = regs->regs[9]; > > } > > > > static inline void syscall_set_arguments(struct task_struct *task, >
© 2016 - 2025 Red Hat, Inc.