riscv: lib: optimize strlen loop efficiency

[PATCH] riscv: lib: optimize strlen loop efficiency

Posted by Feng Jiang 1 month, 3 weeks ago

Optimize the generic strlen implementation by using a pre-decrement
pointer. This reduces the loop body from 4 instructions to 3 and
eliminates the unconditional jump ('j').

Old loop (4 instructions, 2 branches):
  1: lbu t0, 0(t1); beqz t0, 2f; addi t1, t1, 1; j 1b

New loop (3 instructions, 1 branch):
  1: addi t1, t1, 1; lbu t0, 0(t1); bnez t0, 1b

This change improves execution efficiency and reduces branch pressure
for systems without the Zbb extension.

Signed-off-by: Feng Jiang <jiangfeng@kylinos.cn>
---
 arch/riscv/lib/strlen.S | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/riscv/lib/strlen.S b/arch/riscv/lib/strlen.S
index eb4d2b7ed22b..e7736ccda514 100644
--- a/arch/riscv/lib/strlen.S
+++ b/arch/riscv/lib/strlen.S
@@ -21,13 +21,11 @@ SYM_FUNC_START(strlen)
 	 * Clobbers:
 	 *   t0, t1
 	 */
-	mv	t1, a0
+	addi	t1, a0, -1
 1:
-	lbu	t0, 0(t1)
-	beqz	t0, 2f
 	addi	t1, t1, 1
-	j	1b
-2:
+	lbu	t0, 0(t1)
+	bnez	t0, 1b
 	sub	a0, t1, a0
 	ret
 
-- 
2.25.1

Re: [PATCH] riscv: lib: optimize strlen loop efficiency

Posted by Paul Walmsley 3 weeks, 4 days ago

On Thu, 18 Dec 2025, Feng Jiang wrote:

> Optimize the generic strlen implementation by using a pre-decrement
> pointer. This reduces the loop body from 4 instructions to 3 and
> eliminates the unconditional jump ('j').
> 
> Old loop (4 instructions, 2 branches):
>   1: lbu t0, 0(t1); beqz t0, 2f; addi t1, t1, 1; j 1b
> 
> New loop (3 instructions, 1 branch):
>   1: addi t1, t1, 1; lbu t0, 0(t1); bnez t0, 1b
> 
> This change improves execution efficiency and reduces branch pressure
> for systems without the Zbb extension.

Looks reasonable; do you have any benchmarks on hardware that you can 
share?  Any reason why this patch stands alone and isn't rolled up as part 
of your "optimize string function" series?

- Paul

Re: [PATCH] riscv: lib: optimize strlen loop efficiency

Posted by David Laight 3 weeks, 4 days ago

On Wed, 14 Jan 2026 19:03:17 -0700 (MST)
Paul Walmsley <pjw@kernel.org> wrote:

> On Thu, 18 Dec 2025, Feng Jiang wrote:
> 
> > Optimize the generic strlen implementation by using a pre-decrement
> > pointer. This reduces the loop body from 4 instructions to 3 and
> > eliminates the unconditional jump ('j').
> > 
> > Old loop (4 instructions, 2 branches):
> >   1: lbu t0, 0(t1); beqz t0, 2f; addi t1, t1, 1; j 1b
> > 
> > New loop (3 instructions, 1 branch):
> >   1: addi t1, t1, 1; lbu t0, 0(t1); bnez t0, 1b

Is that a change to the generic C code?
Testing (++sc)[-1] might do the trick without requiring the extra read
of the first location.

> > 
> > This change improves execution efficiency and reduces branch pressure
> > for systems without the Zbb extension.
> 
> Looks reasonable; do you have any benchmarks on hardware that you can 
> share?  Any reason why this patch stands alone and isn't rolled up as part 
> of your "optimize string function" series?

For 64bit you can do a lot better (in C) by loading 64bit words and doing
the correct 'shift and mask' sequence to detect a zero byte.
It usually isn't worth in for 32bit.

Does need to handle a mis-aligned base - eg by masking the bits off
the base pointer and or'ing in non-zero values to the value read from
the base pointer.

	David

Re: [PATCH] riscv: lib: optimize strlen loop efficiency

Posted by David Laight 3 weeks, 3 days ago

On Thu, 15 Jan 2026 11:19:47 +0000
David Laight <david.laight.linux@gmail.com> wrote:

> For 64bit you can do a lot better (in C) by loading 64bit words and doing
> the correct 'shift and mask' sequence to detect a zero byte.
> It usually isn't worth in for 32bit.
> 
> Does need to handle a mis-aligned base - eg by masking the bits off
> the base pointer and or'ing in non-zero values to the value read from
> the base pointer.
> 
> 	David

The version below seems to work https://www.godbolt.org/z/sME3Ts6vW
It actually looks ok for x86-32, the loop is 8 instructions plus the branch
but the 'register dependency chain' is only 4 instructions.
So maybe better than byte compares for moderate to long strings.
(Especially if the cpu starts speculatively executing the next loop
iteration.)

The OPTIMIZER_HIDE_VAR() helps a lot on (eg) MIPS-64 and a bit elsewhere
since most 64bit cpu can't load 64bit immediates.

I can't get gcc and clang to reliably have a loop with a conditional
jump at the bottom, especially with an unconditional jump into the
loop (to remove the '| mask' from the loop body).

Also KASAN (or one of its friends) wont like the code reading entire
words that hold the string.

And it does need ffs/clz instructions - or a different loop bottom.
(For BE one with clzl() returning 0 will work.) 

While I suspect the per-byte cost is 'two bytes/clock' on x86-64
the fixed cost may move the break-even point above the length of the
average strlen() in the kernel.
Of course, x86 probably falls back to 'rep scasb' at (maybe)
(40 + 2n) clocks for 'n' bytes.
A carefully written slightly unrolled asm loop might manage one
byte per clock!
I could spend weeks benchmarking different versions.

	David

#define OPTIMIZER_HIDE_VAR(var)                                         \
        __asm__ ("" : "=r" (var) : "0" (var))

/* Set BE to test big-endian on little-endian.
 * For real BE either do a byteswapping read or use the BE code. */
#ifdef BE
#define SWP(x) __builtin_bswap64(x)
#define SHIFT <<
#else
#define SWP(x) (x)
#define SHIFT >>
#endif

unsigned long my_strlen(const char *s)
{
        unsigned int off = (unsigned long)s % sizeof (long);
        const unsigned long *p = (void *)(s - off);
        unsigned long val;
        unsigned long mask;
        unsigned long ones = 0x01010101;
        /* Force the compiler to generate the related constants sanely. */
        OPTIMIZER_HIDE_VAR(ones);
        ones |= ones << 16 << 16;

        mask = ((~0ul SHIFT 8) SHIFT 8 * (sizeof (long) - 1 - off));
        do {
                val = SWP(*p++) | mask;
                mask = (val - ones) & ~val & ones << 7;
        } while (!mask);

#ifdef BE
        off = __builtin_clzl(mask);
        /* Correct for "...\x01" */
        val <<= off;
        for (off /= 8; val > (~0ul >> 8); off++)
                val <<= 8;
#else
        off = (__builtin_ffsl(mask) - 1)/8;
#endif

        return (const char *)(p - 1) + off - s;
}

Re: [PATCH] riscv: lib: optimize strlen loop efficiency

Posted by David Laight 1 week, 4 days ago

On Thu, 15 Jan 2026 18:46:19 +0000
David Laight <david.laight.linux@gmail.com> wrote:

... 
> While I suspect the per-byte cost is 'two bytes/clock' on x86-64
> the fixed cost may move the break-even point above the length of the
> average strlen() in the kernel.
> Of course, x86 probably falls back to 'rep scasb' at (maybe)
> (40 + 2n) clocks for 'n' bytes.
> A carefully written slightly unrolled asm loop might manage one
> byte per clock!
> I could spend weeks benchmarking different versions.

I've spent a quick half-hour...

On my zen-5 in userspace:

glibc's strlen() is showing the same fixed cost (50 clocks including overhead)
for sizes below (about) 100 bytes, for big buffers add 1 clock for ~50 bytes.
It must be using some simd instructions.

A simple:
	len = 0; while (s[len]) len++; return len;
loop is about 1 byte/clock, overhead ~25 clocks (probably the mostly one 'rdpmc'
instruction).
(Needs a barrier() to stop gcc converting it to a libc call.)

Unrolling the loop once:
	for (len = 0; s[len]; len += 2)
		if (!s[len + 1] return len + 1;
	return len;
actually runs twice as fast - so 2 bytes/clock.

Unrolling 4 times doesn't help, suddenly goes somewhat slower somewhere
between 128 and 256 bytes (to 1.5 bytes/clock).

The C 'longs' loop has an overhead of ~45 clocks and does 6 bytes/clock.
So the is better for buffers longer than 64 bytes.

The 'elephant in the room' is 'repne scasb'.
The fixed cost is some 150 clocks and the cost 3 clocks/byte.

I don't think any of the Intel cpu I have will do a 'one clock loop'.
I certainly failed to get one in the past when there was a data-dependency
between the iterations.

But I don't have anything modern (newest is an i7-7xxx) and I don't have
any old amd ones.
I needs to get a zen-1 (or 1a) and one of the Intel system that should be
cheap because they won't run win-11.

	David

Re: [PATCH] riscv: lib: optimize strlen loop efficiency

Posted by Feng Jiang 1 week, 4 days ago

On 2026/1/29 02:59, David Laight wrote:
> On Thu, 15 Jan 2026 18:46:19 +0000
> David Laight <david.laight.linux@gmail.com> wrote:
> 
> ... 
>> While I suspect the per-byte cost is 'two bytes/clock' on x86-64
>> the fixed cost may move the break-even point above the length of the
>> average strlen() in the kernel.
>> Of course, x86 probably falls back to 'rep scasb' at (maybe)
>> (40 + 2n) clocks for 'n' bytes.
>> A carefully written slightly unrolled asm loop might manage one
>> byte per clock!
>> I could spend weeks benchmarking different versions.
> 
> I've spent a quick half-hour...
> 
> On my zen-5 in userspace:
> 
> glibc's strlen() is showing the same fixed cost (50 clocks including overhead)
> for sizes below (about) 100 bytes, for big buffers add 1 clock for ~50 bytes.
> It must be using some simd instructions.
> 
> A simple:
> 	len = 0; while (s[len]) len++; return len;
> loop is about 1 byte/clock, overhead ~25 clocks (probably the mostly one 'rdpmc'
> instruction).
> (Needs a barrier() to stop gcc converting it to a libc call.)
> 
> Unrolling the loop once:
> 	for (len = 0; s[len]; len += 2)
> 		if (!s[len + 1] return len + 1;
> 	return len;
> actually runs twice as fast - so 2 bytes/clock.
> 
> Unrolling 4 times doesn't help, suddenly goes somewhat slower somewhere
> between 128 and 256 bytes (to 1.5 bytes/clock).
> 
> The C 'longs' loop has an overhead of ~45 clocks and does 6 bytes/clock.
> So the is better for buffers longer than 64 bytes.
> 
> The 'elephant in the room' is 'repne scasb'.
> The fixed cost is some 150 clocks and the cost 3 clocks/byte.
> 
> I don't think any of the Intel cpu I have will do a 'one clock loop'.
> I certainly failed to get one in the past when there was a data-dependency
> between the iterations.
> 
> But I don't have anything modern (newest is an i7-7xxx) and I don't have
> any old amd ones.
> I needs to get a zen-1 (or 1a) and one of the Intel system that should be
> cheap because they won't run win-11.

Thank you very much for sharing these detailed test results and your in-depth
analysis. It is truly helpful and inspiring to see how different loop strategies
perform on the wire.

My current priority is the RISC-V patch series. Once this is done, I'd love to
follow up and explore potential improvements for the generic C implementation.

While I don't have many x86 machines at hand either, I do have access to some
ARM64 and LoongArch hardware. I think I can also perform tests and observations
on these platforms later.

Thanks again for the great discussion!

-- 
With Best Regards,
Feng Jiang

Re: [PATCH] riscv: lib: optimize strlen loop efficiency

Posted by Feng Jiang 2 weeks ago

On 2026/1/16 02:46, David Laight wrote:
> On Thu, 15 Jan 2026 11:19:47 +0000
> David Laight <david.laight.linux@gmail.com> wrote:
> 
>> For 64bit you can do a lot better (in C) by loading 64bit words and doing
>> the correct 'shift and mask' sequence to detect a zero byte.
>> It usually isn't worth in for 32bit.
>>
>> Does need to handle a mis-aligned base - eg by masking the bits off
>> the base pointer and or'ing in non-zero values to the value read from
>> the base pointer.
>>
>> 	David
> 
> The version below seems to work https://www.godbolt.org/z/sME3Ts6vW
> It actually looks ok for x86-32, the loop is 8 instructions plus the branch
> but the 'register dependency chain' is only 4 instructions.
> So maybe better than byte compares for moderate to long strings.
> (Especially if the cpu starts speculatively executing the next loop
> iteration.)
> 
> The OPTIMIZER_HIDE_VAR() helps a lot on (eg) MIPS-64 and a bit elsewhere
> since most 64bit cpu can't load 64bit immediates.
> 
> I can't get gcc and clang to reliably have a loop with a conditional
> jump at the bottom, especially with an unconditional jump into the
> loop (to remove the '| mask' from the loop body).
> 
> Also KASAN (or one of its friends) wont like the code reading entire
> words that hold the string.
> 
> And it does need ffs/clz instructions - or a different loop bottom.
> (For BE one with clzl() returning 0 will work.) 
> 
> While I suspect the per-byte cost is 'two bytes/clock' on x86-64
> the fixed cost may move the break-even point above the length of the
> average strlen() in the kernel.
> Of course, x86 probably falls back to 'rep scasb' at (maybe)
> (40 + 2n) clocks for 'n' bytes.
> A carefully written slightly unrolled asm loop might manage one
> byte per clock!
> I could spend weeks benchmarking different versions.
> 
> 	David
> 
> #define OPTIMIZER_HIDE_VAR(var)                                         \
>         __asm__ ("" : "=r" (var) : "0" (var))
> 
> /* Set BE to test big-endian on little-endian.
>  * For real BE either do a byteswapping read or use the BE code. */
> #ifdef BE
> #define SWP(x) __builtin_bswap64(x)
> #define SHIFT <<
> #else
> #define SWP(x) (x)
> #define SHIFT >>
> #endif
> 
> unsigned long my_strlen(const char *s)
> {
>         unsigned int off = (unsigned long)s % sizeof (long);
>         const unsigned long *p = (void *)(s - off);
>         unsigned long val;
>         unsigned long mask;
>         unsigned long ones = 0x01010101;
>         /* Force the compiler to generate the related constants sanely. */
>         OPTIMIZER_HIDE_VAR(ones);
>         ones |= ones << 16 << 16;
> 
>         mask = ((~0ul SHIFT 8) SHIFT 8 * (sizeof (long) - 1 - off));
>         do {
>                 val = SWP(*p++) | mask;
>                 mask = (val - ones) & ~val & ones << 7;
>         } while (!mask);
> 
> #ifdef BE
>         off = __builtin_clzl(mask);
>         /* Correct for "...\x01" */
>         val <<= off;
>         for (off /= 8; val > (~0ul >> 8); off++)
>                 val <<= 8;
> #else
>         off = (__builtin_ffsl(mask) - 1)/8;
> #endif
> 
>         return (const char *)(p - 1) + off - s;
> }

Thank you very much for your detailed explanation and the insightful examples!

Your explanation of instruction cycles and performance trade-offs has been very
enlightening. As I mentioned in my other patch series, I really appreciate the
technical depth you've shared. I will definitely spend some time studying this
area further to better understand these performance trade-offs and see if I can
contribute more meaningful improvements in the future.

Thanks again for your time and for sharing your expertise!

-- 
With Best Regards,
Feng Jiang

Re: [PATCH] riscv: lib: optimize strlen loop efficiency

Posted by Feng Jiang 3 weeks, 4 days ago

On 2026/1/15 10:03, Paul Walmsley wrote:
> On Thu, 18 Dec 2025, Feng Jiang wrote:
> 
>> Optimize the generic strlen implementation by using a pre-decrement
>> pointer. This reduces the loop body from 4 instructions to 3 and
>> eliminates the unconditional jump ('j').
>>
>> Old loop (4 instructions, 2 branches):
>>   1: lbu t0, 0(t1); beqz t0, 2f; addi t1, t1, 1; j 1b
>>
>> New loop (3 instructions, 1 branch):
>>   1: addi t1, t1, 1; lbu t0, 0(t1); bnez t0, 1b
>>
>> This change improves execution efficiency and reduces branch pressure
>> for systems without the Zbb extension.
> 
> Looks reasonable; do you have any benchmarks on hardware that you can 
> share?  Any reason why this patch stands alone and isn't rolled up as part 
> of your "optimize string function" series?

Thanks for the feedback.

This patch predates the rest of the series, which is why it wasn't included
in the 'optimize string function' rollup. At the time, I focused on correctness
testing and observed the improvement through rdcycle instruction counts.

Since the series still needs further refinement and may take a longer time to
complete, I was hoping this standalone optimization could be considered independently.
However, I am also happy to roll it into the series if you prefer.

-- 
With Best Regards,
Feng Jiang

Re: [PATCH] riscv: lib: optimize strlen loop efficiency

Posted by Paul Walmsley 2 weeks, 2 days ago

On Thu, 15 Jan 2026, Feng Jiang wrote:

> On 2026/1/15 10:03, Paul Walmsley wrote:
> > On Thu, 18 Dec 2025, Feng Jiang wrote:
> > 
> >> Optimize the generic strlen implementation by using a pre-decrement
> >> pointer. This reduces the loop body from 4 instructions to 3 and
> >> eliminates the unconditional jump ('j').
> >>
> >> Old loop (4 instructions, 2 branches):
> >>   1: lbu t0, 0(t1); beqz t0, 2f; addi t1, t1, 1; j 1b
> >>
> >> New loop (3 instructions, 1 branch):
> >>   1: addi t1, t1, 1; lbu t0, 0(t1); bnez t0, 1b
> >>
> >> This change improves execution efficiency and reduces branch pressure
> >> for systems without the Zbb extension.
> > 
> > Looks reasonable; do you have any benchmarks on hardware that you can 
> > share?  Any reason why this patch stands alone and isn't rolled up as part 
> > of your "optimize string function" series?
> 
> Thanks for the feedback.
> 
> This patch predates the rest of the series, which is why it wasn't included
> in the 'optimize string function' rollup. At the time, I focused on correctness
> testing and observed the improvement through rdcycle instruction counts.
> 
> Since the series still needs further refinement and may take a longer time to
> complete, I was hoping this standalone optimization could be considered independently.

Ok.  Queued for v6.20.

Might be worth taking a look at David's suggestions for a followup patch?


- Paul

Re: [PATCH] riscv: lib: optimize strlen loop efficiency

Posted by Feng Jiang 2 weeks ago

On 2026/1/24 16:14, Paul Walmsley wrote:
> On Thu, 15 Jan 2026, Feng Jiang wrote:
> 
>> On 2026/1/15 10:03, Paul Walmsley wrote:
>>> On Thu, 18 Dec 2025, Feng Jiang wrote:
>>>
>>>> Optimize the generic strlen implementation by using a pre-decrement
>>>> pointer. This reduces the loop body from 4 instructions to 3 and
>>>> eliminates the unconditional jump ('j').
>>>>
>>>> Old loop (4 instructions, 2 branches):
>>>>   1: lbu t0, 0(t1); beqz t0, 2f; addi t1, t1, 1; j 1b
>>>>
>>>> New loop (3 instructions, 1 branch):
>>>>   1: addi t1, t1, 1; lbu t0, 0(t1); bnez t0, 1b
>>>>
>>>> This change improves execution efficiency and reduces branch pressure
>>>> for systems without the Zbb extension.
>>>
>>> Looks reasonable; do you have any benchmarks on hardware that you can 
>>> share?  Any reason why this patch stands alone and isn't rolled up as part 
>>> of your "optimize string function" series?
>>
>> Thanks for the feedback.
>>
>> This patch predates the rest of the series, which is why it wasn't included
>> in the 'optimize string function' rollup. At the time, I focused on correctness
>> testing and observed the improvement through rdcycle instruction counts.
>>
>> Since the series still needs further refinement and may take a longer time to
>> complete, I was hoping this standalone optimization could be considered independently.
> 
> Ok.  Queued for v6.20.
> 
> Might be worth taking a look at David's suggestions for a followup patch?
> 

Thanks for queuing this!

I am definitely planning to study David's suggestions. He has also provided a lot
of valuable feedback on my other patch series, and I will explore further improvements
for a follow-up patch.

-- 
With Best Regards,
Feng Jiang