[PATCH] Loongarch: entry: fix syscall_get_arguments() VS no-bultin-memcpy

Wentao Guan posted 1 patch 1 month, 1 week ago
arch/loongarch/include/asm/syscall.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
[PATCH] Loongarch: entry: fix syscall_get_arguments() VS no-bultin-memcpy
Posted by Wentao Guan 1 month, 1 week ago
Loongarch use -fno-builtin-memcpy in Makefile,
so cause a extra memcpy in syscall hot path:

syscall_enter_audit
->syscall_get_arguments->memcpy(__memcpy_fast)
->audit_syscall_entry

Just avoid memcpy() altogether and write the copying of args from regs
manually, which shows 5% multi core score up in UnixBench syscall.

Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
---
 arch/loongarch/include/asm/syscall.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h
index 81d2733f7b94..171af2edd569 100644
--- a/arch/loongarch/include/asm/syscall.h
+++ b/arch/loongarch/include/asm/syscall.h
@@ -65,7 +65,11 @@ static inline void syscall_get_arguments(struct task_struct *task,
 					 unsigned long *args)
 {
 	args[0] = regs->orig_a0;
-	memcpy(&args[1], &regs->regs[5], 5 * sizeof(long));
+	args[1] = regs->regs[5];
+	args[2] = regs->regs[6];
+	args[3] = regs->regs[7];
+	args[4] = regs->regs[8];
+	args[5] = regs->regs[9];
 }
 
 static inline void syscall_set_arguments(struct task_struct *task,
-- 
2.20.1
Re: [PATCH] Loongarch: entry: fix syscall_get_arguments() VS no-bultin-memcpy
Posted by Xi Ruoyao 1 month ago
On Tue, 2025-08-26 at 19:32 +0800, Wentao Guan wrote:
> Loongarch use -fno-builtin-memcpy in Makefile,

I still think we should just remove -fno-builtin-memcpy if GCC >= 14. 
There's some wide-spread misunderstanding like "-fno-builtin-memcpy can
stop the compiler from using memcpy for copying small structs" but the
fact is very much contrary to it.

The compiler just uses memcpy for copying the structs of which the size
is larger than a threshold, no matter -fbuiltin-memcpy or not, but with
-fbuiltin-memcpy the compiler can expand some of them into ld/st
sequences, reducing the actual numbers of memcpy calls.

The reason we didn't use -fbuiltin-memcpy is GCC < 14 expands memcpy for
various sizes in a very stupid way, see https://gcc.gnu.org/PR109465.

-- 
Xi Ruoyao <xry111@xry111.site>
Re: [PATCH] Loongarch: entry: fix syscall_get_arguments() VS no-bultin-memcpy
Posted by Wentao Guan 1 month ago
Hello Ruoyao,

Thanks for your review.
I will test and send test results later.
(use different gcc verions and remove the three flags.)

BRs
Wentao Guan
Re: [PATCH] Loongarch: entry: fix syscall_get_arguments() VS no-bultin-memcpy
Posted by Wentao Guan 1 month ago
Hello Ruoyao,

The quick test results are very interesting...:
UnixBench syscall on 3A6000 platform:
./Run syscall -i 10 -c 8
kernel source: v6.16
kernel config:
https://github.com/deepin-community/kernel/blob/linux-6.12.y/arch/loongarch/configs/deepin_loongarch_desktop_defconfig
Use this GCC 15[1]:

result: 
1. with no patch:
System Call Overhead                          15000.0    8893011.8   5928.7
2. with patch [2] to disable -fno-builtin-memcpy -fno-builtin-memmove -fno-builtin-memset:
System Call Overhead                          15000.0    8510415.2   5673.6
lower than patch???
3. with my origin patch [3](+4.9%): 
System Call Overhead                          15000.0    9336022.1   6224.0

BRs
Wentao Guan

[1]: https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/15.2.0
[2]:
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 4b19f9337..53bbcf31c 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -746,6 +746,9 @@ config ARCH_SUSPEND_POSSIBLE
 config ARCH_HIBERNATION_POSSIBLE
        def_bool y
 
+config ARCH_MEMORY_LIBCALL_DISABLE
+       def_bool KASAN || GCC_VERSION >= 140000
+
 source "kernel/power/Kconfig"
 source "drivers/acpi/Kconfig"
 source "drivers/cpufreq/Kconfig"
diff --git a/arch/loongarch/Makefile b/arch/loongarch/Makefile
index b0703a4e0..fba837b22 100644
--- a/arch/loongarch/Makefile
+++ b/arch/loongarch/Makefile
@@ -120,7 +120,7 @@ endif
 
 cflags-y += $(call cc-option, -mno-check-zero-division)
 
-ifndef CONFIG_KASAN
+ifeq ($(CONFIG_ARCH_MEMORY_LIBCALL_DISABLE),)
 cflags-y += -fno-builtin-memcpy -fno-builtin-memmove -fno-builtin-memset
 endif
[3]:
diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h
index 81d2733f7..171af2edd 100644
--- a/arch/loongarch/include/asm/syscall.h
+++ b/arch/loongarch/include/asm/syscall.h
@@ -65,7 +65,11 @@ static inline void syscall_get_arguments(struct task_struct *task,
                                         unsigned long *args)
 {
        args[0] = regs->orig_a0;
-       memcpy(&args[1], &regs->regs[5], 5 * sizeof(long));
+       args[1] = regs->regs[5];
+       args[2] = regs->regs[6];
+       args[3] = regs->regs[7];
+       args[4] = regs->regs[8];
+       args[5] = regs->regs[9];
 }
 
 static inline void syscall_set_arguments(struct task_struct *task,

[4]: perf diff
# Event 'cycles:P'
#
# Baseline  Delta Abs  Shared Object                       Symbol                                               >
# ........  .........  ..................................  .....................................................>
#
    12.72%     -7.07%  [kernel.kallsyms]                   [k] syscall_trace_enter
     0.00%     +2.94%  [kernel.kallsyms]                   [k] __memcpy_fast
    33.69%     +1.41%  [kernel.kallsyms]                   [k] do_syscall
     2.64%     +0.32%  [kernel.kallsyms]                   [k] __audit_syscall_entry
     4.92%     +0.32%  libc.so.6                           [.] __GI___umask
     3.54%     +0.30%  [kernel.kallsyms]                   [k] alloc_fd
     4.36%     +0.24%  libc.so.6                           [.] __GI___getuid
     4.63%     +0.24%  libc.so.6                           [.] syscall
     4.50%     +0.22%  libc.so.6                           [.] __GI___dup
     4.84%     +0.22%  [kernel.kallsyms]                   [k] syscall_exit_work
     4.81%     +0.22%  libc.so.6                           [.] __close
               +0.20%  [kernel.kallsyms]                   [k] __memcpy
     1.48%     +0.15%  [kernel.kallsyms]                   [k] audit_reset_context
     0.69%     +0.13%  [kernel.kallsyms]                   [k] __se_sys_close
     2.13%     +0.10%  [kernel.kallsyms]                   [k] file_close_fd_locked
     0.86%     +0.08%  [kernel.kallsyms]                   [k] map_id_range_up
     0.71%     +0.06%  [kernel.kallsyms]                   [k] __task_pid_nr_ns
     0.41%     +0.04%  [kernel.kallsyms]                   [k] fd_install
     0.56%     +0.04%  [kernel.kallsyms]                   [k] __rcu_read_unlock
     0.57%     +0.04%  [kernel.kallsyms]                   [k] fput_close_sync
     0.07%     -0.04%  [kernel.kallsyms]                   [k] idle_exit
     1.37%     +0.04%  [kernel.kallsyms]                   [k] _raw_spin_lock
     0.69%     +0.03%  [kernel.kallsyms]                   [k] filp_flush
     0.76%     +0.03%  syscall                             [.] main
     0.52%     +0.03%  [kernel.kallsyms]                   [k] from_kuid_munged
     0.45%     +0.03%  [kernel.kallsyms]                   [k] _raw_spin_unlock
     0.07%     -0.03%  [kernel.kallsyms]                   [k] finish_task_switch.isra.0
     1.72%     +0.03%  [kernel.kallsyms]                   [k] ktime_get_coarse_real_ts64
     0.67%     +0.03%  [kernel.kallsyms]                   [k] __rcu_read_lock
     0.03%     -0.03%  deepin-deepinid-daemon              [.] 0x0000000000003018
     1.28%     -0.02%  [kernel.kallsyms]                   [k] __audit_syscall_exit
     0.12%     +0.02%  syscall                             [.] getuid@plt
     0.02%     -0.02%  dde-session-daemon                  [.] 0x0000000000299d10
     0.02%     -0.02%  ld-linux-loongarch-lp64d.so.1       [.] do_lookup_x
     0.03%     -0.01%  [kernel.kallsyms]                   [k] _raw_spin_unlock_irqrestore
     0.01%     -0.01%  ld-linux-loongarch-lp64d.so.1       [.] _dl_lookup_symbol_x
     0.30%     +0.01%  [kernel.kallsyms]                   [k] sys_getpid
     0.14%     -0.01%  syscall                             [.] umask@plt
     0.01%     -0.01%  libdbus-1.so.3.32.4                 [.] 0x0000000000015ac0
     0.02%     -0.01%  [kernel.kallsyms]                   [k] clear_page
     0.01%     -0.01%  deepin-sync-helper                  [.] 0x0000000000003a84
     0.14%     +0.01%  syscall                             [.] close@plt
Re: [PATCH] Loongarch: entry: fix syscall_get_arguments() VS no-bultin-memcpy
Posted by Huacai Chen 1 month ago
On Thu, Aug 28, 2025 at 3:06 PM Wentao Guan <guanwentao@uniontech.com> wrote:
>
> Hello Ruoyao,
>
> Thanks for your review.
> I will test and send test results later.
> (use different gcc verions and remove the three flags.)
I think kernel's memcpy is still needed for CPUs without UAL.

Huacai

>
> BRs
> Wentao Guan
Re: [PATCH] Loongarch: entry: fix syscall_get_arguments() VS no-bultin-memcpy
Posted by Xi Ruoyao 1 month ago
On Thu, 2025-08-28 at 15:09 +0800, Huacai Chen wrote:
> On Thu, Aug 28, 2025 at 3:06 PM Wentao Guan <guanwentao@uniontech.com>
> wrote:
> > 
> > Hello Ruoyao,
> > 
> > Thanks for your review.
> > I will test and send test results later.
> > (use different gcc verions and remove the three flags.)
> I think kernel's memcpy is still needed for CPUs without UAL.

GCC memcpy expansion correctly understands no-UAL case.  It won't
generate unaligned access unless -mno-strict-align.

-- 
Xi Ruoyao <xry111@xry111.site>
Re: [PATCH] Loongarch: entry: fix syscall_get_arguments() VS no-bultin-memcpy
Posted by Xi Ruoyao 1 month ago
On Thu, 2025-08-28 at 15:11 +0800, Xi Ruoyao wrote:
> On Thu, 2025-08-28 at 15:09 +0800, Huacai Chen wrote:
> > On Thu, Aug 28, 2025 at 3:06 PM Wentao Guan
> > <guanwentao@uniontech.com>
> > wrote:
> > > 
> > > Hello Ruoyao,
> > > 
> > > Thanks for your review.
> > > I will test and send test results later.
> > > (use different gcc verions and remove the three flags.)
> > I think kernel's memcpy is still needed for CPUs without UAL.
> 
> GCC memcpy expansion correctly understands no-UAL case.  It won't
> generate unaligned access unless -mno-strict-align.

P.S. kernel's memcpy is indeed still needed because when GCC deduces
it's not beneficial to expand the __builtin_memcpy into ld/st sequeces,
it'll just call memcpy.

-- 
Xi Ruoyao <xry111@xry111.site>
Re: [PATCH] Loongarch: entry: fix syscall_get_arguments() VS no-bultin-memcpy
Posted by Yanteng Si 1 month ago
在 8/26/25 7:32 PM, Wentao Guan 写道:
> Loongarch use -fno-builtin-memcpy in Makefile,
> so cause a extra memcpy in syscall hot path:
> 
> syscall_enter_audit
> ->syscall_get_arguments->memcpy(__memcpy_fast)
> ->audit_syscall_entry
> 
> Just avoid memcpy() altogether and write the copying of args from regs
> manually, which shows 5% multi core score up in UnixBench syscall.
5%? Awesome!
> 
> Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Reviewed-by: Yanteng Si <siyanteng@cqsoftware.com.cn>

Thanks,
Yanteng
> ---
>   arch/loongarch/include/asm/syscall.h | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h
> index 81d2733f7b94..171af2edd569 100644
> --- a/arch/loongarch/include/asm/syscall.h
> +++ b/arch/loongarch/include/asm/syscall.h
> @@ -65,7 +65,11 @@ static inline void syscall_get_arguments(struct task_struct *task,
>   					 unsigned long *args)
>   {
>   	args[0] = regs->orig_a0;
> -	memcpy(&args[1], &regs->regs[5], 5 * sizeof(long));
> +	args[1] = regs->regs[5];
> +	args[2] = regs->regs[6];
> +	args[3] = regs->regs[7];
> +	args[4] = regs->regs[8];
> +	args[5] = regs->regs[9];
>   }
>   
>   static inline void syscall_set_arguments(struct task_struct *task,

Re: [PATCH] Loongarch: entry: fix syscall_get_arguments() VS no-bultin-memcpy
Posted by Wentao Guan 1 month ago
Hello,

Thanks for your review,
I test the patch in our v6.6 kernel,
in 3a6000 platform UnixBench results
syscall 1 thread from 966 to 996
8 threads from 6173 to 6501.

Best Regards
Wentao Guan
Re: [PATCH] Loongarch: entry: fix syscall_get_arguments() VS no-bultin-memcpy
Posted by Huacai Chen 1 month ago
On Thu, Aug 28, 2025 at 1:47 PM Yanteng Si <si.yanteng@linux.dev> wrote:
>
> 在 8/26/25 7:32 PM, Wentao Guan 写道:
> > Loongarch use -fno-builtin-memcpy in Makefile,
> > so cause a extra memcpy in syscall hot path:
> >
> > syscall_enter_audit
> > ->syscall_get_arguments->memcpy(__memcpy_fast)
> > ->audit_syscall_entry
> >
> > Just avoid memcpy() altogether and write the copying of args from regs
> > manually, which shows 5% multi core score up in UnixBench syscall.
> 5%? Awesome!
> >
> > Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
> Reviewed-by: Yanteng Si <siyanteng@cqsoftware.com.cn>
But why this is a "fix"? It is an optimization.

Huacai

>
> Thanks,
> Yanteng
> > ---
> >   arch/loongarch/include/asm/syscall.h | 6 +++++-
> >   1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h
> > index 81d2733f7b94..171af2edd569 100644
> > --- a/arch/loongarch/include/asm/syscall.h
> > +++ b/arch/loongarch/include/asm/syscall.h
> > @@ -65,7 +65,11 @@ static inline void syscall_get_arguments(struct task_struct *task,
> >                                        unsigned long *args)
> >   {
> >       args[0] = regs->orig_a0;
> > -     memcpy(&args[1], &regs->regs[5], 5 * sizeof(long));
> > +     args[1] = regs->regs[5];
> > +     args[2] = regs->regs[6];
> > +     args[3] = regs->regs[7];
> > +     args[4] = regs->regs[8];
> > +     args[5] = regs->regs[9];
> >   }
> >
> >   static inline void syscall_set_arguments(struct task_struct *task,
>
Re: [PATCH] Loongarch: entry: fix syscall_get_arguments() VS no-bultin-memcpy
Posted by Wentao Guan 1 month ago
Dear Huacai,

Emmm, let change the title fix to optimize...

BRs
Wentao Guan