All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Laight <david.laight.linux@gmail.com>
To: Zongmin Zhou <min_halo@163.com>
Cc: pjw@kernel.org, palmer@dabbelt.com, aou@eecs.berkeley.edu,
	alex@ghiti.fr, linux-riscv@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	Zongmin Zhou <zhouzongmin@kylinos.cn>
Subject: Re: [PATCH] riscv: lib: optimize strchr() with Zbb extension
Date: Thu, 7 May 2026 11:22:35 +0100	[thread overview]
Message-ID: <20260507112235.41e539fa@pumpkin> (raw)
In-Reply-To: <20260507020620.134225-1-min_halo@163.com>

On Thu,  7 May 2026 10:06:20 +0800
Zongmin Zhou <min_halo@163.com> wrote:

> From: Zongmin Zhou <zhouzongmin@kylinos.cn>
> 
> Add a Zbb-powered optimization to the existing strchr() implementation
> using the 'orc.b' instruction, following the same pattern established
> by strnlen().
> 
> The Zbb variant processes data in word-sized chunks using orc.b to
> detect both NUL terminators and target characters in parallel. On
> systems without Zbb support, the original byte-by-byte implementation
> is used as a fallback via the alternatives mechanism.
> 
> Benchmark results (QEMU TCG, rv64):
>   Length | zbb=off (MB/s) | zbb=on (MB/s) | Improvement
>   -------|----------------|---------------|------------
>   1 B    | 27             | 25            | -7.4%
>   7 B    | 147            | 128           | -12.9%
>   16 B   | 216            | 372           | +72.2%
>   64 B   | 378            | 958           | +153.4%
>   512 B  | 480            | 1990          | +314.6%
>   4096 B | 501            | 2269          | +352.9%
> 
> The regression on very short strings (1-7 bytes) is due to the fixed
> overhead of the word-level path: broadcasting the target character to
> all byte lanes via multiplication and checking pointer alignment before
> entering the main loop. For strings shorter than one machine word, this
> setup cost outweighs the benefit of parallel comparison. As string
> length increases beyond 16 bytes, the word-at-a-time processing shows
> significant gains.
> 
> Signed-off-by: Zongmin Zhou <zhouzongmin@kylinos.cn>
> ---
>  arch/riscv/lib/strchr.S | 115 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 115 insertions(+)
> 
> diff --git a/arch/riscv/lib/strchr.S b/arch/riscv/lib/strchr.S
> index 48c3a9da53e3..600b19452bc2 100644
> --- a/arch/riscv/lib/strchr.S
> +++ b/arch/riscv/lib/strchr.S
> @@ -6,9 +6,15 @@
>  
>  #include <linux/linkage.h>
>  #include <asm/asm.h>
> +#include <asm/alternative-macros.h>
> +#include <asm/hwcap.h>
>  
>  /* char *strchr(const char *s, int c) */
>  SYM_FUNC_START(strchr)
> +
> +	__ALTERNATIVE_CFG("nop", "j strchr_zbb", 0, RISCV_ISA_EXT_ZBB,
> +		IS_ENABLED(CONFIG_RISCV_ISA_ZBB) && IS_ENABLED(CONFIG_TOOLCHAIN_HAS_ZBB))
> +
>  	/*
>  	 * Parameters
>  	 *   a0 - The string to be searched
> @@ -29,6 +35,115 @@ SYM_FUNC_START(strchr)
>  	li	a0, 0
>  2:
>  	ret
> +
> +/*
> + * Variant of strchr using the ZBB extension if available
> + *
> + * This implementation uses orc.b to detect both NUL terminators and target
> + * characters in parallel, processing word-sized chunks for efficiency.
> + */
> +#if defined(CONFIG_RISCV_ISA_ZBB) && defined(CONFIG_TOOLCHAIN_HAS_ZBB)
> +strchr_zbb:
> +
> +#ifdef CONFIG_CPU_BIG_ENDIAN
> +# define CZ	clz
> +#else
> +# define CZ	ctz
> +#endif
> +
> +.option push
> +.option arch,+zbb
> +
> +	/*
> +	 * Returns
> +	 *   a0 - Address of first occurrence of 'c' or NULL
> +	 *
> +	 * Parameters
> +	 *   a0 - String to search
> +	 *   a1 - Character to find
> +	 *
> +	 * Clobbers
> +	 *   t0, t1, t2, t3, t4
> +	 */
> +
> +	/*
> +	 * Prepare target character mask.
> +	 * Broadcast target character to all bytes using multiply.
> +	 */
> +	andi	a1, a1, 0xff
> +	li	t1, 0x01010101
> +#if __riscv_xlen == 64
> +	slli	t2, t1, 32
> +	or	t1, t1, t2
> +#endif
> +	mul	t2, a1, t1
> +
> +	/* All-ones mask for orc.b comparisons. */
> +	li	t4, -1
> +
> +	/* Check alignment. */
> +	andi	t0, a0, SZREG-1
> +	beqz	t0, 2f

It is almost certainly faster to jump 'out of line' for misaligned
strings and fallthrough for aligned ones.

> +
> +	/* Handle misaligned portion byte-by-byte. */
> +1:
> +	lbu	t1, 0(a0)
> +	beq	t1, a1, 9f
> +	beqz	t1, 8f
> +	addi	a0, a0, 1
> +	andi	t0, a0, SZREG-1
> +	bnez	t0, 1b
> +
> +	/* Main loop: process word-sized chunks. */

Tweak to remove a branch from the loop:
	addi	a0, a0, -SZREG

> +2:
> +	REG_L	t0, 0(a0)

Do read first for better instruction scheduling.
	REG_L	t0, SZREG(a0)
	addi	a0, a0, SZREG

> +
> +	/* Check for NUL terminator. */
> +	orc.b	t1, t0
> +	bne	t1, t4, 3f
> +
> +	/* Check for target character. */
> +	xor	t1, t0, t2
> +	orc.b	t1, t1
> +	bne	t1, t4, 4f

	be	t1, t4, 2b
and move the code at '4:' here.

-- David

> +
> +	addi	a0, a0, SZREG
> +	j	2b
> +
> +3:
> +	/* NUL found in current chunk. Check if target appears before NUL. */
> +	not	t1, t1
> +
> +	xor	t3, t0, t2
> +	orc.b	t3, t3
> +	not	t3, t3
> +
> +	CZ	t3, t3
> +	CZ	t1, t1
> +
> +	/* If NUL appears before target, character not found. */
> +	bltu	t1, t3, 8f
> +
> +	srli	t3, t3, 3
> +	add	a0, a0, t3
> +	ret
> +
> +4:
> +	/* Target found in chunk without NUL. */
> +	not	t1, t1
> +	CZ	t1, t1
> +	srli	t1, t1, 3
> +	add	a0, a0, t1
> +	ret
> +
> +8:
> +	/* Character not found, return NULL. */
> +	li	a0, 0
> +9:
> +	ret
> +
> +.option pop
> +#endif
>  SYM_FUNC_END(strchr)
>  
>  SYM_FUNC_ALIAS_WEAK(__pi_strchr, strchr)


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

WARNING: multiple messages have this Message-ID (diff)
From: David Laight <david.laight.linux@gmail.com>
To: Zongmin Zhou <min_halo@163.com>
Cc: pjw@kernel.org, palmer@dabbelt.com, aou@eecs.berkeley.edu,
	alex@ghiti.fr, linux-riscv@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	Zongmin Zhou <zhouzongmin@kylinos.cn>
Subject: Re: [PATCH] riscv: lib: optimize strchr() with Zbb extension
Date: Thu, 7 May 2026 11:22:35 +0100	[thread overview]
Message-ID: <20260507112235.41e539fa@pumpkin> (raw)
In-Reply-To: <20260507020620.134225-1-min_halo@163.com>

On Thu,  7 May 2026 10:06:20 +0800
Zongmin Zhou <min_halo@163.com> wrote:

> From: Zongmin Zhou <zhouzongmin@kylinos.cn>
> 
> Add a Zbb-powered optimization to the existing strchr() implementation
> using the 'orc.b' instruction, following the same pattern established
> by strnlen().
> 
> The Zbb variant processes data in word-sized chunks using orc.b to
> detect both NUL terminators and target characters in parallel. On
> systems without Zbb support, the original byte-by-byte implementation
> is used as a fallback via the alternatives mechanism.
> 
> Benchmark results (QEMU TCG, rv64):
>   Length | zbb=off (MB/s) | zbb=on (MB/s) | Improvement
>   -------|----------------|---------------|------------
>   1 B    | 27             | 25            | -7.4%
>   7 B    | 147            | 128           | -12.9%
>   16 B   | 216            | 372           | +72.2%
>   64 B   | 378            | 958           | +153.4%
>   512 B  | 480            | 1990          | +314.6%
>   4096 B | 501            | 2269          | +352.9%
> 
> The regression on very short strings (1-7 bytes) is due to the fixed
> overhead of the word-level path: broadcasting the target character to
> all byte lanes via multiplication and checking pointer alignment before
> entering the main loop. For strings shorter than one machine word, this
> setup cost outweighs the benefit of parallel comparison. As string
> length increases beyond 16 bytes, the word-at-a-time processing shows
> significant gains.
> 
> Signed-off-by: Zongmin Zhou <zhouzongmin@kylinos.cn>
> ---
>  arch/riscv/lib/strchr.S | 115 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 115 insertions(+)
> 
> diff --git a/arch/riscv/lib/strchr.S b/arch/riscv/lib/strchr.S
> index 48c3a9da53e3..600b19452bc2 100644
> --- a/arch/riscv/lib/strchr.S
> +++ b/arch/riscv/lib/strchr.S
> @@ -6,9 +6,15 @@
>  
>  #include <linux/linkage.h>
>  #include <asm/asm.h>
> +#include <asm/alternative-macros.h>
> +#include <asm/hwcap.h>
>  
>  /* char *strchr(const char *s, int c) */
>  SYM_FUNC_START(strchr)
> +
> +	__ALTERNATIVE_CFG("nop", "j strchr_zbb", 0, RISCV_ISA_EXT_ZBB,
> +		IS_ENABLED(CONFIG_RISCV_ISA_ZBB) && IS_ENABLED(CONFIG_TOOLCHAIN_HAS_ZBB))
> +
>  	/*
>  	 * Parameters
>  	 *   a0 - The string to be searched
> @@ -29,6 +35,115 @@ SYM_FUNC_START(strchr)
>  	li	a0, 0
>  2:
>  	ret
> +
> +/*
> + * Variant of strchr using the ZBB extension if available
> + *
> + * This implementation uses orc.b to detect both NUL terminators and target
> + * characters in parallel, processing word-sized chunks for efficiency.
> + */
> +#if defined(CONFIG_RISCV_ISA_ZBB) && defined(CONFIG_TOOLCHAIN_HAS_ZBB)
> +strchr_zbb:
> +
> +#ifdef CONFIG_CPU_BIG_ENDIAN
> +# define CZ	clz
> +#else
> +# define CZ	ctz
> +#endif
> +
> +.option push
> +.option arch,+zbb
> +
> +	/*
> +	 * Returns
> +	 *   a0 - Address of first occurrence of 'c' or NULL
> +	 *
> +	 * Parameters
> +	 *   a0 - String to search
> +	 *   a1 - Character to find
> +	 *
> +	 * Clobbers
> +	 *   t0, t1, t2, t3, t4
> +	 */
> +
> +	/*
> +	 * Prepare target character mask.
> +	 * Broadcast target character to all bytes using multiply.
> +	 */
> +	andi	a1, a1, 0xff
> +	li	t1, 0x01010101
> +#if __riscv_xlen == 64
> +	slli	t2, t1, 32
> +	or	t1, t1, t2
> +#endif
> +	mul	t2, a1, t1
> +
> +	/* All-ones mask for orc.b comparisons. */
> +	li	t4, -1
> +
> +	/* Check alignment. */
> +	andi	t0, a0, SZREG-1
> +	beqz	t0, 2f

It is almost certainly faster to jump 'out of line' for misaligned
strings and fallthrough for aligned ones.

> +
> +	/* Handle misaligned portion byte-by-byte. */
> +1:
> +	lbu	t1, 0(a0)
> +	beq	t1, a1, 9f
> +	beqz	t1, 8f
> +	addi	a0, a0, 1
> +	andi	t0, a0, SZREG-1
> +	bnez	t0, 1b
> +
> +	/* Main loop: process word-sized chunks. */

Tweak to remove a branch from the loop:
	addi	a0, a0, -SZREG

> +2:
> +	REG_L	t0, 0(a0)

Do read first for better instruction scheduling.
	REG_L	t0, SZREG(a0)
	addi	a0, a0, SZREG

> +
> +	/* Check for NUL terminator. */
> +	orc.b	t1, t0
> +	bne	t1, t4, 3f
> +
> +	/* Check for target character. */
> +	xor	t1, t0, t2
> +	orc.b	t1, t1
> +	bne	t1, t4, 4f

	be	t1, t4, 2b
and move the code at '4:' here.

-- David

> +
> +	addi	a0, a0, SZREG
> +	j	2b
> +
> +3:
> +	/* NUL found in current chunk. Check if target appears before NUL. */
> +	not	t1, t1
> +
> +	xor	t3, t0, t2
> +	orc.b	t3, t3
> +	not	t3, t3
> +
> +	CZ	t3, t3
> +	CZ	t1, t1
> +
> +	/* If NUL appears before target, character not found. */
> +	bltu	t1, t3, 8f
> +
> +	srli	t3, t3, 3
> +	add	a0, a0, t3
> +	ret
> +
> +4:
> +	/* Target found in chunk without NUL. */
> +	not	t1, t1
> +	CZ	t1, t1
> +	srli	t1, t1, 3
> +	add	a0, a0, t1
> +	ret
> +
> +8:
> +	/* Character not found, return NULL. */
> +	li	a0, 0
> +9:
> +	ret
> +
> +.option pop
> +#endif
>  SYM_FUNC_END(strchr)
>  
>  SYM_FUNC_ALIAS_WEAK(__pi_strchr, strchr)


  reply	other threads:[~2026-05-07 10:22 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-07  2:06 [PATCH] riscv: lib: optimize strchr() with Zbb extension Zongmin Zhou
2026-05-07  2:06 ` Zongmin Zhou
2026-05-07 10:22 ` David Laight [this message]
2026-05-07 10:22   ` David Laight
2026-05-12  7:30   ` [PATCH v2] " Zongmin Zhou
2026-05-12  7:30     ` Zongmin Zhou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260507112235.41e539fa@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=alex@ghiti.fr \
    --cc=aou@eecs.berkeley.edu \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=min_halo@163.com \
    --cc=palmer@dabbelt.com \
    --cc=pjw@kernel.org \
    --cc=zhouzongmin@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.