arch/riscv/include/asm/string.h | 9 ++ arch/riscv/lib/Makefile | 3 + arch/riscv/lib/strchr.S | 35 +++++ arch/riscv/lib/strnlen.S | 164 ++++++++++++++++++++ arch/riscv/lib/strrchr.S | 37 +++++ arch/riscv/purgatory/Makefile | 11 +- lib/Kconfig.debug | 11 ++ lib/tests/string_kunit.c | 258 ++++++++++++++++++++++++++++++++ 8 files changed, 527 insertions(+), 1 deletion(-) create mode 100644 arch/riscv/lib/strchr.S create mode 100644 arch/riscv/lib/strnlen.S create mode 100644 arch/riscv/lib/strrchr.S
This series provides optimized implementations of strnlen(), strchr(), and strrchr() for the RISC-V architecture. The strnlen implementation is derived from the existing optimized strlen. For strchr and strrchr, the current versions use simple byte-by-byte assembly logic, which will serve as a baseline for future Zbb-based optimizations. The patch series is organized into three parts: 1. Correctness Testing: The first three patches add KUnit test cases for strlen, strnlen, and strrchr to ensure the baseline and optimized versions are functionally correct. 2. Benchmarking Tool: Patches 4 and 5 extend string_kunit to include performance measurement capabilities, allowing for comparative analysis within the KUnit environment. 3. Architectural Optimizations: The final three patches introduce the RISC-V specific assembly implementations. Following suggestions from Andy Shevchenko, performance benchmarks have been added to string_kunit.c to provide quantifiable evidence of the improvements. Andy provided many specific comments on the implementation of the benchmark logic, which is also inspired by Eric Biggers' crc_benchmark(). Performance was measured in a QEMU TCG (rv64) environment, comparing the generic C implementation with the new RISC-V assembly versions. Performance Summary (Improvement %): --------------------------------------------------------------- Function | 16 B (Short) | 512 B (Mid) | 4096 B (Long) --------------------------------------------------------------- strnlen | +64.0% | +346.2% | +410.7% strchr | +4.0% | +6.4% | +1.5% strrchr | +6.6% | +2.8% | +0.0% --------------------------------------------------------------- The benchmarks can be reproduced by enabling CONFIG_STRING_KUNIT_BENCH and running: ./tools/testing/kunit/kunit.py run --arch=riscv \ --cross_compile=riscv64-linux-gnu- --kunitconfig=my_string.kunitconfig \ --raw_output The strnlen implementation leverages the Zbb 'orc.b' instruction and word-at-a-time logic, showing significant gains as the string length increases. For strchr and strrchr, the handwritten assembly reduces fixed overhead by eliminating stack frame management. The gain is most prominent on short strings (1-16B) where function call overhead dominates, while the performance converges with the C implementation for longer strings in the TCG environment. I would like to thank Andy Shevchenko for the suggestion to add benchmarks and for his detailed feedback on the test framework, and Eric Biggers for the benchmarking approach. Thanks also to Joel Stanley for testing support and feedback, and to David Laight for his suggestions regarding performance measurement. Changes: v3: - Re-implement benchmark logic inspired by crc_benchmark(). - Add 'len - 2' test case to strnlen correctness tests. - Incorporate detailed benchmark data into individual commit messages. v2: - Refactored lib/string.c to export __generic_* functions and added corresponding functional/performance tests for strnlen, strchr, and strrchr (Andy Shevchenko). - Replaced magic numbers with STRING_TEST_MAX_LEN etc. (Andy Shevchenko). v1: Initial submission. --- Feng Jiang (8): lib/string_kunit: add correctness test for strlen lib/string_kunit: add correctness test for strnlen lib/string_kunit: add correctness test for strrchr() lib/string_kunit: add performance benchmarks for strlen lib/string_kunit: extend benchmarks to strnlen and chr searches riscv: lib: add strnlen implementation riscv: lib: add strchr implementation riscv: lib: add strrchr implementation arch/riscv/include/asm/string.h | 9 ++ arch/riscv/lib/Makefile | 3 + arch/riscv/lib/strchr.S | 35 +++++ arch/riscv/lib/strnlen.S | 164 ++++++++++++++++++++ arch/riscv/lib/strrchr.S | 37 +++++ arch/riscv/purgatory/Makefile | 11 +- lib/Kconfig.debug | 11 ++ lib/tests/string_kunit.c | 258 ++++++++++++++++++++++++++++++++ 8 files changed, 527 insertions(+), 1 deletion(-) create mode 100644 arch/riscv/lib/strchr.S create mode 100644 arch/riscv/lib/strnlen.S create mode 100644 arch/riscv/lib/strrchr.S -- 2.25.1
On Tue, Jan 20, 2026 at 02:58:44PM +0800, Feng Jiang wrote: > This series provides optimized implementations of strnlen(), strchr(), > and strrchr() for the RISC-V architecture. The strnlen implementation > is derived from the existing optimized strlen. For strchr and strrchr, strchr() and strrchr() > the current versions use simple byte-by-byte assembly logic, which > will serve as a baseline for future Zbb-based optimizations. > > The patch series is organized into three parts: > 1. Correctness Testing: The first three patches add KUnit test cases > for strlen, strnlen, and strrchr to ensure the baseline and optimized strlen(), strnlen(), and strrchr() > versions are functionally correct. > 2. Benchmarking Tool: Patches 4 and 5 extend string_kunit to include > performance measurement capabilities, allowing for comparative > analysis within the KUnit environment. > 3. Architectural Optimizations: The final three patches introduce the > RISC-V specific assembly implementations. > > Following suggestions from Andy Shevchenko, performance benchmarks have > been added to string_kunit.c to provide quantifiable evidence of the > improvements. Andy provided many specific comments on the implementation > of the benchmark logic, which is also inspired by Eric Biggers' > crc_benchmark(). Performance was measured in a QEMU TCG (rv64) environment, > comparing the generic C implementation with the new RISC-V assembly versions. > > Performance Summary (Improvement %): > --------------------------------------------------------------- > Function | 16 B (Short) | 512 B (Mid) | 4096 B (Long) > --------------------------------------------------------------- > strnlen | +64.0% | +346.2% | +410.7% This is still suspicious. > strchr | +4.0% | +6.4% | +1.5% > strrchr | +6.6% | +2.8% | +0.0% > --------------------------------------------------------------- > The benchmarks can be reproduced by enabling CONFIG_STRING_KUNIT_BENCH > and running: ./tools/testing/kunit/kunit.py run --arch=riscv \ > --cross_compile=riscv64-linux-gnu- --kunitconfig=my_string.kunitconfig \ > --raw_output > > The strnlen implementation leverages the Zbb 'orc.b' instruction and strnlen() > word-at-a-time logic, showing significant gains as the string length > increases. Hmm... Have you tried to optimise the generic implementation to use word-at-a-time logic and compare? > For strchr and strrchr, the handwritten assembly reduces strchr() and strrchr() > fixed overhead by eliminating stack frame management. The gain is most > prominent on short strings (1-16B) where function call overhead dominates, > while the performance converges with the C implementation for longer > strings in the TCG environment. -- With Best Regards, Andy Shevchenko
On 2026/1/20 15:36, Andy Shevchenko wrote:
> On Tue, Jan 20, 2026 at 02:58:44PM +0800, Feng Jiang wrote:
>> This series provides optimized implementations of strnlen(), strchr(),
>> and strrchr() for the RISC-V architecture. The strnlen implementation
>> is derived from the existing optimized strlen. For strchr and strrchr,
>
> strchr() and strrchr()
>>> the current versions use simple byte-by-byte assembly logic, which
>> will serve as a baseline for future Zbb-based optimizations.
>>
>> The patch series is organized into three parts:
>> 1. Correctness Testing: The first three patches add KUnit test cases
>> for strlen, strnlen, and strrchr to ensure the baseline and optimized
>
> strlen(), strnlen(), and strrchr()
Will fix these to include parentheses consistently in the next version.
>> Performance Summary (Improvement %):
>> ---------------------------------------------------------------
>> Function | 16 B (Short) | 512 B (Mid) | 4096 B (Long)
>> ---------------------------------------------------------------
>> strnlen | +64.0% | +346.2% | +410.7%
>
> This is still suspicious.
>
Regarding the +410% gain, it becomes clearer when looking at the inner loop of the
Zbb implementation (https://docs.riscv.org/reference/isa/unpriv/b-st-ext.html#zbb).
For a 64-bit system, the core loop uses only 5 instructions to process 8 bytes.
Note that t3 is pre-loaded with -1 (0xFFFFFFFFFFFFFFFF):
1:
REG_L t1, SZREG(t0) /* Load 8 bytes */
addi t0, t0, SZREG /* Advance pointer */
orc.b t1, t1 /* Bitwise OR-Combine: 0x00 becomes 0x00, others 0xFF */
bgeu t0, t4, 4f /* Boundary check (max_len) */
beq t1, t3, 1b /* If t1 == 0xFFFFFFFFFFFFFFFF (no NUL), loop */
In contrast, the generic C implementation performs byte-by-byte comparisons, which involves
significantly more loads and branches for every single byte processed. The Zbb approach is
much leaner: the orc.b instruction collapses the NUL-check for an entire 8-byte word into a
single step. By shifting from a byte-oriented loop to this hardware-accelerated word-at-a-time
logic, we drastically reduce the instruction count and branch overhead, which explains the
massive jump in TCG throughput for long strings.
Beyond the main loop, the Zbb implementation also utilizes ctz (Count Trailing Zeros)
to handle the tail and alignment. Once orc.b identifies a NUL byte within a register,
we can precisely locate its position in just two instructions:
not t1, t1 /* Flip bits: NUL byte (0x00) becomes 0xFF */
ctz t1, t1 /* Count bits before the first NUL byte */
srli t1, t1, 3 /* Divide by 8 to get byte offset */
In a generic C implementation, calculating this byte offset typically requires a series
of shifts, masks, or a sub-loop, which adds significant overhead. By combining orc.b and
ctz, we eliminate all branching and lookup tables for the tail-end calculation, further
contributing to the performance gains observed in the benchmarks.
>> strchr | +4.0% | +6.4% | +1.5%
>> strrchr | +6.6% | +2.8% | +0.0%
As for strchr() and strrchr(), the relatively modest improvements are because the current
versions in this series are implemented using simple byte-by-byte assembly. These primarily
gain performance by reducing function call overhead and eliminating stack frame management
compared to the generic C version.
Unlike strnlen(), they do not yet utilize Zbb extensions. I plan to introduce Zbb-optimized
versions for these functions in a future patch set, which I expect will bring performance
gains similar to what we now see with strnlen().
>> word-at-a-time logic, showing significant gains as the string length
>> increases.
>
> Hmm... Have you tried to optimise the generic implementation to use
> word-at-a-time logic and compare?
>
Regarding the generic implementation, even if we were to optimize the C code
to use word-at-a-time logic (the has_zero() style bit-manipulation), it still
wouldn't match the Zbb version's efficiency.
The traditional C-based word-level detection requires a sequence of arithmetic
operations to identify NUL bytes. In contrast, the RISC-V orc.b instruction
collapses this entire check into a single hardware cycle. I've focused on this
architectural approach to fully leverage these specific Zbb features, which
provides a level of instruction density that generic C math cannot achieve.
>> For strchr and strrchr, the handwritten assembly reduces
>
> strchr() and strrchr()
>
>> fixed overhead by eliminating stack frame management. The gain is most
>> prominent on short strings (1-16B) where function call overhead dominates,
>> while the performance converges with the C implementation for longer
>> strings in the TCG environment.
>
Thanks for your detailed review!
--
With Best Regards,
Feng Jiang
On Wed, Jan 21, 2026 at 8:44 AM Feng Jiang <jiangfeng@kylinos.cn> wrote: > On 2026/1/20 15:36, Andy Shevchenko wrote: > > On Tue, Jan 20, 2026 at 02:58:44PM +0800, Feng Jiang wrote: ... > >> Performance Summary (Improvement %): > >> --------------------------------------------------------------- > >> Function | 16 B (Short) | 512 B (Mid) | 4096 B (Long) > >> --------------------------------------------------------------- > >> strnlen | +64.0% | +346.2% | +410.7% > > > > This is still suspicious. > > Regarding the +410% gain, it becomes clearer when looking at the inner loop of the > Zbb implementation (https://docs.riscv.org/reference/isa/unpriv/b-st-ext.html#zbb). > For a 64-bit system, the core loop uses only 5 instructions to process 8 bytes. > Note that t3 is pre-loaded with -1 (0xFFFFFFFFFFFFFFFF): > > 1: > REG_L t1, SZREG(t0) /* Load 8 bytes */ > addi t0, t0, SZREG /* Advance pointer */ > orc.b t1, t1 /* Bitwise OR-Combine: 0x00 becomes 0x00, others 0xFF */ > bgeu t0, t4, 4f /* Boundary check (max_len) */ > beq t1, t3, 1b /* If t1 == 0xFFFFFFFFFFFFFFFF (no NUL), loop */ > > In contrast, the generic C implementation performs byte-by-byte comparisons, which involves > significantly more loads and branches for every single byte processed. The Zbb approach is > much leaner: the orc.b instruction collapses the NUL-check for an entire 8-byte word into a > single step. By shifting from a byte-oriented loop to this hardware-accelerated word-at-a-time > logic, we drastically reduce the instruction count and branch overhead, which explains the > massive jump in TCG throughput for long strings. > > Beyond the main loop, the Zbb implementation also utilizes ctz (Count Trailing Zeros) > to handle the tail and alignment. Once orc.b identifies a NUL byte within a register, > we can precisely locate its position in just two instructions: > > not t1, t1 /* Flip bits: NUL byte (0x00) becomes 0xFF */ > ctz t1, t1 /* Count bits before the first NUL byte */ > srli t1, t1, 3 /* Divide by 8 to get byte offset */ > > In a generic C implementation, calculating this byte offset typically requires a series > of shifts, masks, or a sub-loop, which adds significant overhead. By combining orc.b and > ctz, we eliminate all branching and lookup tables for the tail-end calculation, further > contributing to the performance gains observed in the benchmarks. > > >> strchr | +4.0% | +6.4% | +1.5% > >> strrchr | +6.6% | +2.8% | +0.0% > > As for strchr() and strrchr(), the relatively modest improvements are because the current > versions in this series are implemented using simple byte-by-byte assembly. These primarily > gain performance by reducing function call overhead and eliminating stack frame management > compared to the generic C version. > > Unlike strnlen(), they do not yet utilize Zbb extensions. I plan to introduce Zbb-optimized > versions for these functions in a future patch set, which I expect will bring performance > gains similar to what we now see with strnlen(). Thanks for the details regarding the assembly native implementation. > >> word-at-a-time logic, showing significant gains as the string length > >> increases. > > > > Hmm... Have you tried to optimise the generic implementation to use > > word-at-a-time logic and compare? > > Regarding the generic implementation, even if we were to optimize the C code > to use word-at-a-time logic (the has_zero() style bit-manipulation), it still > wouldn't match the Zbb version's efficiency. > > The traditional C-based word-level detection requires a sequence of arithmetic > operations to identify NUL bytes. In contrast, the RISC-V orc.b instruction > collapses this entire check into a single hardware cycle. I've focused on this > architectural approach to fully leverage these specific Zbb features, which > provides a level of instruction density that generic C math cannot achieve. I understand that. My point is if we move the generic implementation to use word-at-a-time technique the difference should not go 4x, right? Perhaps 1.5x or so. I believe this will be a very useful exercise. -- With Best Regards, Andy Shevchenko
On Wed, 21 Jan 2026 09:01:29 +0200
Andy Shevchenko <andy.shevchenko@gmail.com> wrote:
...
> I understand that. My point is if we move the generic implementation
> to use word-at-a-time technique the difference should not go 4x,
> right? Perhaps 1.5x or so. I believe this will be a very useful
> exercise.
I posted a version earlier.
After the initial setup (aligning the base address and loading
some constants the loop on x86-64 is 7 instructions (should be similar
for other architectures).
I think it will execute in 4 clocks.
You then need to find the byte in the word, easy enough on LE with
a fast ffs() - but harder otherwise.
The real problem is the cost for short strings.
Like memcpy() you need a hint from the source of the 'expected' length
(as a compile-time constant) to compile-time select the algorithm.
OTOH:
for (;;) {
if (!ptr[0]) return ptr - start;
ptr += 2;
while (ptr[-1]);
return ptr - start - 1;
has two 'load+compare+branch' and one add per loop.
On x86 that might all overlap and give you a two-clock loop
that checks one byte every clock - faster than 'rep scasb'.
(You can get a two clock loop, but not a 1 clock loop.)
I think unrolling further will make little/no difference.
The break-even for the word-at-a-time version is probably at least 64
characters.
David
On 2026/1/21 18:57, David Laight wrote:
> On Wed, 21 Jan 2026 09:01:29 +0200
> Andy Shevchenko <andy.shevchenko@gmail.com> wrote:
>
> ...
>> I understand that. My point is if we move the generic implementation
>> to use word-at-a-time technique the difference should not go 4x,
>> right? Perhaps 1.5x or so. I believe this will be a very useful
>> exercise.
>
> I posted a version earlier.
>
> After the initial setup (aligning the base address and loading
> some constants the loop on x86-64 is 7 instructions (should be similar
> for other architectures).
> I think it will execute in 4 clocks.
> You then need to find the byte in the word, easy enough on LE with
> a fast ffs() - but harder otherwise.
> The real problem is the cost for short strings.
> Like memcpy() you need a hint from the source of the 'expected' length
> (as a compile-time constant) to compile-time select the algorithm.
>
> OTOH:
> for (;;) {
> if (!ptr[0]) return ptr - start;
> ptr += 2;
> while (ptr[-1]);
> return ptr - start - 1;
> has two 'load+compare+branch' and one add per loop.
> On x86 that might all overlap and give you a two-clock loop
> that checks one byte every clock - faster than 'rep scasb'.
> (You can get a two clock loop, but not a 1 clock loop.)
> I think unrolling further will make little/no difference.
>
> The break-even for the word-at-a-time version is probably at least 64
> characters.
>
Thanks for the profound analysis and the detailed suggestions.
I am still exploring some of the finer points of these low-level performance
trade-offs, so your input is very helpful. I will definitely spend some time
studying this further and experiment with the approaches you mentioned.
Thanks again for your help!
--
With Best Regards,
Feng Jiang
On Fri, 23 Jan 2026 11:12:00 +0800 Feng Jiang <jiangfeng@kylinos.cn> wrote: >... > I am still exploring some of the finer points of these low-level performance > trade-offs, so your input is very helpful. I will definitely spend some time > studying this further and experiment with the approaches you mentioned. You can spend a long time micro-optimising small functions. I know, I've done it.... Then you need to test on as many cpu variants as you can find. David
On 2026/1/21 15:01, Andy Shevchenko wrote: > On Wed, Jan 21, 2026 at 8:44 AM Feng Jiang <jiangfeng@kylinos.cn> wrote: >> On 2026/1/20 15:36, Andy Shevchenko wrote: >>> On Tue, Jan 20, 2026 at 02:58:44PM +0800, Feng Jiang wrote: > > ... >>>> word-at-a-time logic, showing significant gains as the string length >>>> increases. >>> >>> Hmm... Have you tried to optimise the generic implementation to use >>> word-at-a-time logic and compare? >> >> Regarding the generic implementation, even if we were to optimize the C code >> to use word-at-a-time logic (the has_zero() style bit-manipulation), it still >> wouldn't match the Zbb version's efficiency. >> >> The traditional C-based word-level detection requires a sequence of arithmetic >> operations to identify NUL bytes. In contrast, the RISC-V orc.b instruction >> collapses this entire check into a single hardware cycle. I've focused on this >> architectural approach to fully leverage these specific Zbb features, which >> provides a level of instruction density that generic C math cannot achieve. > > I understand that. My point is if we move the generic implementation > to use word-at-a-time technique the difference should not go 4x, > right? Perhaps 1.5x or so. I believe this will be a very useful > exercise. > That is a very insightful point, thanks for the suggestion. I'll look into optimizing the generic string library as a follow-up task to see if we can bring some improvements there as well. Thanks again for the guidance. -- With Best Regards, Feng Jiang
© 2016 - 2026 Red Hat, Inc.