riscv: optimize Vector context restore on syscall

[PATCH v3 0/4] riscv: optimize Vector context restore on syscall

Posted by Andy Chiu 3 days, 3 hours ago

This patch series optimizes riscv vector state handling across syscall
boundaries and context switches. The kernel now keeps track of the
INITIAL state in sstatus.vs to optimize unnecessary context management
operations.

This version merges daichengrong's RFC patch [1] for the state tracking
code as it looks cleaner than my v2/v1.

[1]: https://lore.kernel.org/linux-riscv/7ba2f4b7-8475-4ec3-ab31-58b332bda47e@iscas.ac.cn/#r
Link to v2: https://lore.kernel.org/linux-riscv/20260402043414.2421916-1-andybnac@gmail.com/

Patch summary:
 - Updated patches: 2
 - New patches: 1, 3, 4

Changelog v3:
 - Refactor function names. (1, 2)
 - Merge daichengrong's patch, with a fix and optimzation. (2)
 - Fix ptrace GETREGSET failure. (3)
 - Strengthen ptrace SETREGSET semantics and add a test to cover it. (3,
   4)
 - Fix a potential ABI break in signal and add a test to prevent future
   breaks. (3, 4)

Changelog v2: rebase on top of for-next

Andy Chiu (3):
  riscv: vector: refactor vector context operations
  riscv: vector: adjust ptrace and signal behavior for INITIAL state
  selftests: riscv: Extend vector tests for sigreturn and ptrace

daichengrong (1):
  riscv: clarify vector state semantics on syscall and context switch

 arch/riscv/include/asm/kvm_vcpu_vector.h      |   8 +-
 arch/riscv/include/asm/vector.h               |  45 ++++----
 arch/riscv/kernel/kernel_mode_vector.c        |  27 ++++-
 arch/riscv/kernel/ptrace.c                    |  13 +--
 arch/riscv/kernel/signal.c                    |  11 +-
 arch/riscv/kernel/vector.c                    |  38 ++++++-
 arch/riscv/kvm/vcpu_vector.c                  |   8 +-
 .../selftests/riscv/sigreturn/sigreturn.c     |  75 +++++++++++++
 .../selftests/riscv/vector/vstate_ptrace.c    | 100 +++++++++++++++++-
 9 files changed, 281 insertions(+), 44 deletions(-)

-- 
2.43.0

Re: [PATCH v3 0/4] riscv: optimize Vector context restore on syscall

Posted by Olof Johansson 3 days ago

Hi Andy,

On Thu, May 21, 2026 at 11:25:16AM -0500, Andy Chiu wrote:
> This patch series optimizes riscv vector state handling across syscall
> boundaries and context switches. The kernel now keeps track of the
> INITIAL state in sstatus.vs to optimize unnecessary context management
> operations.
> 
> This version merges daichengrong's RFC patch [1] for the state tracking
> code as it looks cleaner than my v2/v1.
> 
> [1]: https://lore.kernel.org/linux-riscv/7ba2f4b7-8475-4ec3-ab31-58b332bda47e@iscas.ac.cn/#r
> Link to v2: https://lore.kernel.org/linux-riscv/20260402043414.2421916-1-andybnac@gmail.com/

A patchset like this would be really helped by some kind of numbers in the
cover letter to indicate how much performance moved, given a claim of
optimization.

Just for kicks I tried a simple microbenchmark for syscalls from
a vector-enabled process:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <time.h>
#include <stdint.h>

static inline uint64_t ns_now(void) {
    struct timespec t;
    clock_gettime(CLOCK_MONOTONIC, &t);
    return t.tv_sec * 1000000000ull + t.tv_nsec;
}

int main(int argc, char **argv) {
    int iters = argc > 1 ? atoi(argv[1]) : 10000000;
    int use_v = argc > 2 ? atoi(argv[2]) : 1;

    if (use_v) {
        asm volatile(
            ".option push\n\t.option arch, +v\n\t"
            "vsetivli x0, 1, e32, m1, ta, ma\n\t"
            "vmv.v.i v0, 1\n\t"
            ".option pop\n\t" ::: "memory");
    }

    for (int i = 0; i < 10000; i++) syscall(SYS_getppid);  // warmup

    uint64_t t0 = ns_now();
    for (int i = 0; i < iters; i++) syscall(SYS_getppid);
    uint64_t t1 = ns_now();

    printf("V=%d %.1f ns/call (%lu ns / %d iters)\n",
           use_v, (double)(t1 - t0) / iters, t1 - t0, iters);
    return 0;
}

I compiled with gcc -O3, default GCC 14.2 on Debian 13. Host is x280
(Blackhole). Base kernel sources is 7.1.0-rc4-next-20260520 defconfig. Ran
with taskset to pin to one of the CPUs.

The testcase doesn't use vector inbetween each syscall, but will obviously
have initiated the state (if started with '1' as second argument).

Without this patchset:
V=1 242.9 ns/call (12144527848 ns / 50000000 iters)

With this patchset:
V=1 264.5 ns/call (13226852900 ns / 50000000 iters)

Interestingly enough, with V=0 test it sped up slightly (194.3 -> 189.5 ns).

I repeated the runs a few times, with similar results so I don't think it's
explainable as noise.

Given that more code will be vector enabled in the new shiny RVA23 world
we are entering, I'm uncertain whether this is the right trade-off. You won't
get the syscall perf cost returned unless you need the vector context swapped
in without the lazy fault between calls.

I suspect running userspace workloads on a RVA23 platform (SpaceMIT
K3) with Ubuntu 26.04 would be the most meaningful data to collect. My
ordered board is still in shipping, unfortunately.

PS: There's a new build warning due to an unused 'uvstate' variable in
riscv_v_start_kernel_context() that you might want to fix.

-Olof

Re: Re: [PATCH v3 0/4] riscv: optimize Vector context restore on syscall

Posted by Andy Chiu 2 days, 22 hours ago

On Thu, May 21, 2026 at 12:15:07PM -0700, Olof Johansson wrote:
> Hi Andy,
> 
> On Thu, May 21, 2026 at 11:25:16AM -0500, Andy Chiu wrote:
> > This patch series optimizes riscv vector state handling across syscall
> > boundaries and context switches. The kernel now keeps track of the
> > INITIAL state in sstatus.vs to optimize unnecessary context management
> > operations.
> > 
> > This version merges daichengrong's RFC patch [1] for the state tracking
> > code as it looks cleaner than my v2/v1.
> > 
> > [1]: https://lore.kernel.org/linux-riscv/7ba2f4b7-8475-4ec3-ab31-58b332bda47e@iscas.ac.cn/#r
> > Link to v2: https://lore.kernel.org/linux-riscv/20260402043414.2421916-1-andybnac@gmail.com/
> 
> A patchset like this would be really helped by some kind of numbers in the
> cover letter to indicate how much performance moved, given a claim of
> optimization.
> 

Thanks for pointing it out, I totally agree with you. I had included a
test result on sifive's hardware in v2[1]. But that was on FPGA, I will
test it on a real silicon as soon as I have an access. Sorry for the
confusing claim here.

My test was running on a vector enabled version of lat_ctx. I modified
the main function to make sure the process touches vector, then run with
2 threads. Since lat_ctx uses syscall interface to notify another
process, the kernel will trash their vector registers instead of wasting
cycles on saving/restoring them.

> 
> Just for kicks I tried a simple microbenchmark for syscalls from
> a vector-enabled process:
> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/syscall.h>
> #include <unistd.h>
> #include <time.h>
> #include <stdint.h>
> 
> static inline uint64_t ns_now(void) {
>     struct timespec t;
>     clock_gettime(CLOCK_MONOTONIC, &t);
>     return t.tv_sec * 1000000000ull + t.tv_nsec;
> }
> 
> int main(int argc, char **argv) {
>     int iters = argc > 1 ? atoi(argv[1]) : 10000000;
>     int use_v = argc > 2 ? atoi(argv[2]) : 1;
> 
>     if (use_v) {
>         asm volatile(
>             ".option push\n\t.option arch, +v\n\t"
>             "vsetivli x0, 1, e32, m1, ta, ma\n\t"
>             "vmv.v.i v0, 1\n\t"
>             ".option pop\n\t" ::: "memory");
>     }
> 
>     for (int i = 0; i < 10000; i++) syscall(SYS_getppid);  // warmup
> 
>     uint64_t t0 = ns_now();
>     for (int i = 0; i < iters; i++) syscall(SYS_getppid);
>     uint64_t t1 = ns_now();
> 
>     printf("V=%d %.1f ns/call (%lu ns / %d iters)\n",
>            use_v, (double)(t1 - t0) / iters, t1 - t0, iters);
>     return 0;
> }
> 
> 
> I compiled with gcc -O3, default GCC 14.2 on Debian 13. Host is x280
> (Blackhole). Base kernel sources is 7.1.0-rc4-next-20260520 defconfig. Ran
> with taskset to pin to one of the CPUs.
> 
> The testcase doesn't use vector inbetween each syscall, but will obviously
> have initiated the state (if started with '1' as second argument).
> 
> Without this patchset:
> V=1 242.9 ns/call (12144527848 ns / 50000000 iters)
> 
> With this patchset:
> V=1 264.5 ns/call (13226852900 ns / 50000000 iters)

This 9% regression is suprising to me as this patch set (without the fix
below) should be equivalent in performance on this code path (and better
if getpid takes one process switch).

Before the patch, nulling v resgisters happens at the entry point. After
this patchset we mark sstatus.vs to INITIAL at the entry and nulling
happens right before getting back to the user space.

> 
> Interestingly enough, with V=0 test it sped up slightly (194.3 -> 189.5 ns).
> 

This result is expected as with V=0 the kernel doesn't have to maintain
vstate at all. But we are also looking into ways to improve mode switch
latencies.

> I repeated the runs a few times, with similar results so I don't think it's
> explainable as noise.
> 

Thanks for carrying out the experiment, it's very sound! I actually
missed one thing on this patch for it to be optimized on this specific
case:

diff --git a/arch/riscv/include/asm/vector.h b/arch/riscv/include/asm/vector.h
index 8c1e64e0dd0b..5d1282870a20 100644
--- a/arch/riscv/include/asm/vector.h
+++ b/arch/riscv/include/asm/vector.h
@@ -40,6 +40,15 @@
 	_res;								\
 })
 
+#define __riscv_v_vstate_check_gt(_val, TYPE) ({			\
+	bool _res;							\
+	if (has_xtheadvector()) 					\
+		_res = ((_val) & SR_VS_THEAD) > SR_VS_##TYPE##_THEAD;	\
+	else								\
+		_res = ((_val) & SR_VS) > SR_VS_##TYPE;			\
+	_res;								\
+})
+
 extern unsigned long riscv_v_vsize;
 int riscv_v_setup_vsize(void);
 bool insn_is_vector(u32 insn_buf);
@@ -323,7 +332,7 @@ static inline void riscv_v_vstate_set_restore(struct task_struct *task,
 
 static inline void riscv_v_vstate_discard(struct pt_regs *regs)
 {
-	if (riscv_v_vstate_query(regs)) {
+	if (__riscv_v_vstate_check_gt(regs->status, INITIAL)) {
 		riscv_v_vstate_set_restore(current, regs);
 		riscv_v_vstate_init(regs);
 	}

In this way if the user never touch vregs again then we will not null
out the context at syscall exit.

Again, I only have it functionally tested at the moment. I appreciate it
if you could get the number with the above diff. Meanwhile, I am going
souce and run on an actual hardware and hopefully find the reason for
the above regression, before rolling out v4.

> 
> Given that more code will be vector enabled in the new shiny RVA23 world
> we are entering, I'm uncertain whether this is the right trade-off. You won't
> get the syscall perf cost returned unless you need the vector context swapped
> in without the lazy fault between calls.
> 
> I suspect running userspace workloads on a RVA23 platform (SpaceMIT
> K3) with Ubuntu 26.04 would be the most meaningful data to collect. My
> ordered board is still in shipping, unfortunately.
> 
> 
> PS: There's a new build warning due to an unused 'uvstate' variable in
> riscv_v_start_kernel_context() that you might want to fix.
> 
> 
> -Olof

[1]: https://lore.kernel.org/linux-riscv/20260402043414.2421916-2-andybnac@gmail.com/

Thanks,
Andy

Re: Re: [PATCH v3 0/4] riscv: optimize Vector context restore on syscall

Posted by Olof Johansson 2 days, 16 hours ago

On Thu, May 21, 2026 at 04:09:40PM -0500, Andy Chiu wrote:
> On Thu, May 21, 2026 at 12:15:07PM -0700, Olof Johansson wrote:
> > Hi Andy,
> > 
> > On Thu, May 21, 2026 at 11:25:16AM -0500, Andy Chiu wrote:
> > > This patch series optimizes riscv vector state handling across syscall
> > > boundaries and context switches. The kernel now keeps track of the
> > > INITIAL state in sstatus.vs to optimize unnecessary context management
> > > operations.
> > > 
> > > This version merges daichengrong's RFC patch [1] for the state tracking
> > > code as it looks cleaner than my v2/v1.
> > > 
> > > [1]: https://lore.kernel.org/linux-riscv/7ba2f4b7-8475-4ec3-ab31-58b332bda47e@iscas.ac.cn/#r
> > > Link to v2: https://lore.kernel.org/linux-riscv/20260402043414.2421916-1-andybnac@gmail.com/
> > 
> > A patchset like this would be really helped by some kind of numbers in the
> > cover letter to indicate how much performance moved, given a claim of
> > optimization.
> > 
> 
> Thanks for pointing it out, I totally agree with you. I had included a
> test result on sifive's hardware in v2[1]. But that was on FPGA, I will
> test it on a real silicon as soon as I have an access. Sorry for the
> confusing claim here.
> 
> My test was running on a vector enabled version of lat_ctx. I modified
> the main function to make sure the process touches vector, then run with
> 2 threads. Since lat_ctx uses syscall interface to notify another
> process, the kernel will trash their vector registers instead of wasting
> cycles on saving/restoring them.
> 
> > 
> > Just for kicks I tried a simple microbenchmark for syscalls from
> > a vector-enabled process:
> > 
> > #define _GNU_SOURCE
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <sys/syscall.h>
> > #include <unistd.h>
> > #include <time.h>
> > #include <stdint.h>
> > 
> > static inline uint64_t ns_now(void) {
> >     struct timespec t;
> >     clock_gettime(CLOCK_MONOTONIC, &t);
> >     return t.tv_sec * 1000000000ull + t.tv_nsec;
> > }
> > 
> > int main(int argc, char **argv) {
> >     int iters = argc > 1 ? atoi(argv[1]) : 10000000;
> >     int use_v = argc > 2 ? atoi(argv[2]) : 1;
> > 
> >     if (use_v) {
> >         asm volatile(
> >             ".option push\n\t.option arch, +v\n\t"
> >             "vsetivli x0, 1, e32, m1, ta, ma\n\t"
> >             "vmv.v.i v0, 1\n\t"
> >             ".option pop\n\t" ::: "memory");
> >     }
> > 
> >     for (int i = 0; i < 10000; i++) syscall(SYS_getppid);  // warmup
> > 
> >     uint64_t t0 = ns_now();
> >     for (int i = 0; i < iters; i++) syscall(SYS_getppid);
> >     uint64_t t1 = ns_now();
> > 
> >     printf("V=%d %.1f ns/call (%lu ns / %d iters)\n",
> >            use_v, (double)(t1 - t0) / iters, t1 - t0, iters);
> >     return 0;
> > }
> > 
> > 
> > I compiled with gcc -O3, default GCC 14.2 on Debian 13. Host is x280
> > (Blackhole). Base kernel sources is 7.1.0-rc4-next-20260520 defconfig. Ran
> > with taskset to pin to one of the CPUs.
> > 
> > The testcase doesn't use vector inbetween each syscall, but will obviously
> > have initiated the state (if started with '1' as second argument).
> > 
> > Without this patchset:
> > V=1 242.9 ns/call (12144527848 ns / 50000000 iters)
> > 
> > With this patchset:
> > V=1 264.5 ns/call (13226852900 ns / 50000000 iters)
> 
> This 9% regression is suprising to me as this patch set (without the fix
> below) should be equivalent in performance on this code path (and better
> if getpid takes one process switch).
> 
> Before the patch, nulling v resgisters happens at the entry point. After
> this patchset we mark sstatus.vs to INITIAL at the entry and nulling
> happens right before getting back to the user space.
> 
> > 
> > Interestingly enough, with V=0 test it sped up slightly (194.3 -> 189.5 ns).
> > 
> 
> This result is expected as with V=0 the kernel doesn't have to maintain
> vstate at all. But we are also looking into ways to improve mode switch
> latencies.
> 
> > I repeated the runs a few times, with similar results so I don't think it's
> > explainable as noise.
> > 
> 
> Thanks for carrying out the experiment, it's very sound! I actually
> missed one thing on this patch for it to be optimized on this specific
> case:
> 
> diff --git a/arch/riscv/include/asm/vector.h b/arch/riscv/include/asm/vector.h
> index 8c1e64e0dd0b..5d1282870a20 100644
> --- a/arch/riscv/include/asm/vector.h
> +++ b/arch/riscv/include/asm/vector.h
> @@ -40,6 +40,15 @@
>  	_res;								\
>  })
>  
> +#define __riscv_v_vstate_check_gt(_val, TYPE) ({			\
> +	bool _res;							\
> +	if (has_xtheadvector()) 					\
> +		_res = ((_val) & SR_VS_THEAD) > SR_VS_##TYPE##_THEAD;	\
> +	else								\
> +		_res = ((_val) & SR_VS) > SR_VS_##TYPE;			\
> +	_res;								\
> +})
> +
>  extern unsigned long riscv_v_vsize;
>  int riscv_v_setup_vsize(void);
>  bool insn_is_vector(u32 insn_buf);
> @@ -323,7 +332,7 @@ static inline void riscv_v_vstate_set_restore(struct task_struct *task,
>  
>  static inline void riscv_v_vstate_discard(struct pt_regs *regs)
>  {
> -	if (riscv_v_vstate_query(regs)) {
> +	if (__riscv_v_vstate_check_gt(regs->status, INITIAL)) {
>  		riscv_v_vstate_set_restore(current, regs);
>  		riscv_v_vstate_init(regs);
>  	}
> 
> In this way if the user never touch vregs again then we will not null
> out the context at syscall exit.

Significantly better indeed, now ~191ns without vector, ~192ns with -- so
a proper optimization.

> Again, I only have it functionally tested at the moment. I appreciate it
> if you could get the number with the above diff. Meanwhile, I am going
> souce and run on an actual hardware and hopefully find the reason for
> the above regression, before rolling out v4.

Sounds good.



-Olof