* [PATCH 0/3] riscv: word-at-a-time: improve find_zero()
@ 2026-01-13 12:24 Jisheng Zhang
2026-01-13 12:24 ` [PATCH 1/3] riscv: word-at-a-time: improve find_zero() for !RISCV_ISA_ZBB Jisheng Zhang
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Jisheng Zhang @ 2026-01-13 12:24 UTC (permalink / raw)
To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: linux-riscv, linux-kernel
Currently, there are two problems with riscv find_zero():
1. When !RISCV_ISA_ZBB, the generic fls64() bring non-optimal code.
But in word-at-a-time case, we don't have to go with fls64() code path,
instead, we can fallback to the generic word-at-a-time implementaion.
What's more, the fls64() brings non-necessary zero bits couting for
RV32. In fact, fls() is enough.
2. Similar as 1, the generic fls64() also brings non-optimal code when
RISCV_ISA_ZBB=y but HW doesn't support Zbb.
So this series tries to improve find_zero() by falling back to generic
word-at-a-time implementaion where necessary. We dramatically reduce
the instructions of find_zero() from 33 to 8! Also testing with the
micro-benchamrk in patch1 shows that the performance is improved by
about 1150%!
After that, we improve find_zero() for Zbb further by applying similar
optimization as Linus did in commit f915a3e5b018 ("arm64:
word-at-a-time: improve byte count calculations for LE"), so that
we share the similar improvements:
"The difference between the old and the new implementation is that
"count_zero()" ends up scheduling better because it is being done on a
value that is available earlier (before the final mask).
But more importantly, it can be implemented without the insane semantics
of the standard bit finding helpers that have the off-by-one issue and
have to special-case the zero mask situation."
On RV64 w/ Zbb, the new "find_zero()" ends up just "ctz" plus the shift
right that then ends up being subsumed by the "add to final length".
Reduce the total instructions from 7 to 3!
But I have no HW platform which supports Zbb, so I can't get the
performance improvement numbers by the last patch, only built and
tested the patch on QEMU.
Jisheng Zhang (3):
riscv: word-at-a-time: improve find_zero() for !RISCV_ISA_ZBB
riscv: word-at-a-time: improve find_zero() without Zbb
riscv: word-at-a-time: improve find_zero() for Zbb
arch/riscv/include/asm/word-at-a-time.h | 47 +++++++++++++++++++++++--
1 file changed, 44 insertions(+), 3 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 1/3] riscv: word-at-a-time: improve find_zero() for !RISCV_ISA_ZBB
2026-01-13 12:24 [PATCH 0/3] riscv: word-at-a-time: improve find_zero() Jisheng Zhang
@ 2026-01-13 12:24 ` Jisheng Zhang
2026-01-13 12:24 ` [PATCH 2/3] riscv: word-at-a-time: improve find_zero() without Zbb Jisheng Zhang
2026-01-13 12:24 ` [PATCH 3/3] riscv: word-at-a-time: improve find_zero() for Zbb Jisheng Zhang
2 siblings, 0 replies; 4+ messages in thread
From: Jisheng Zhang @ 2026-01-13 12:24 UTC (permalink / raw)
To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: linux-riscv, linux-kernel
Current find_zero() heavily depends on fls64() for calculation. This
bring non-optimal code when !RISCV_ISA_ZBB.
But in word-at-a-time case, we don't have to go with fls64() code path,
instead, we can fallback to the generic word-at-a-time implementaion.
What's more, the fls64() brings non-necessary zero bits couting for
RV32. In fact, fls() is enough.
Before the patch:
0000000000000000 <find_zero>:
0: c529 beqz a0,4a <.L1>
2: 577d li a4,-1
4: 9301 srli a4,a4,0x20
6: 03f00793 li a5,63
a: 00a76463 bltu a4,a0,12 <.L3>
e: 1502 slli a0,a0,0x20
10: 47fd li a5,31
0000000000000012 <.L3>:
12: 577d li a4,-1
14: 8341 srli a4,a4,0x10
16: 00a76463 bltu a4,a0,1e <.L4>
1a: 37c1 addiw a5,a5,-16
1c: 0542 slli a0,a0,0x10
000000000000001e <.L4>:
1e: 577d li a4,-1
20: 8321 srli a4,a4,0x8
22: 00a76463 bltu a4,a0,2a <.L5>
26: 37e1 addiw a5,a5,-8
28: 0522 slli a0,a0,0x8
000000000000002a <.L5>:
2a: 577d li a4,-1
2c: 8311 srli a4,a4,0x4
2e: 00a76463 bltu a4,a0,36 <.L6>
32: 37f1 addiw a5,a5,-4
34: 0512 slli a0,a0,0x4
0000000000000036 <.L6>:
36: 577d li a4,-1
38: 8309 srli a4,a4,0x2
3a: 00a76463 bltu a4,a0,42 <.L7>
3e: 37f9 addiw a5,a5,-2
40: 050a slli a0,a0,0x2
0000000000000042 <.L7>:
42: 00054563 bltz a0,4c <.L12>
46: 4037d51b sraiw a0,a5,0x3
000000000000004a <.L1>:
4a: 8082 ret
000000000000004c <.L12>:
4c: 2785 addiw a5,a5,1
4e: 4037d51b sraiw a0,a5,0x3
52: 8082 ret
After the patch:
0000000000000000 <find_zero>:
0: 102037b7 lui a5,0x10203
4: 0792 slli a5,a5,0x4
6: 40578793 addi a5,a5,1029 # 10203405 <.L4+0x102033c5>
a: 07c2 slli a5,a5,0x10
c: 60878793 addi a5,a5,1544
10: 02f50533 mul a0,a0,a5
14: 9161 srli a0,a0,0x38
16: 8082 ret
33 instructions vs 8 instructions!
And this kind of instructions reducing dramatically improves the
performance of below micro-benchmark:
$ cat tt.c
#inlcude <stdio.h>
#inlcude "word-at-a-time.h" // copy and modify, eg. remove other headers
int main()
{
int i;
unsigned long ret = 0;
for (i = 0; i < 100000000; i++)
ret |= find_zero(0xabcd123 + i);
printf("%ld\n", ret);
}
$ gcc -O tt.c
$ time ./a.out
Per my test, the above micro-benchmark is improved by about 1150%!
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
---
arch/riscv/include/asm/word-at-a-time.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/riscv/include/asm/word-at-a-time.h b/arch/riscv/include/asm/word-at-a-time.h
index 3802cda71ab7..0c8a9b337f93 100644
--- a/arch/riscv/include/asm/word-at-a-time.h
+++ b/arch/riscv/include/asm/word-at-a-time.h
@@ -13,6 +13,9 @@
#include <linux/bitops.h>
#include <linux/wordpart.h>
+#if !(defined(CONFIG_RISCV_ISA_ZBB) && defined(CONFIG_TOOLCHAIN_HAS_ZBB))
+#include <asm-generic/word-at-a-time.h>
+#else
struct word_at_a_time {
const unsigned long one_bits, high_bits;
};
@@ -47,6 +50,8 @@ static inline unsigned long find_zero(unsigned long mask)
/* The mask we created is directly usable as a bytemask */
#define zero_bytemask(mask) (mask)
+#endif /* !(defined(CONFIG_RISCV_ISA_ZBB) && defined(CONFIG_TOOLCHAIN_HAS_ZBB)) */
+
#ifdef CONFIG_DCACHE_WORD_ACCESS
/*
--
2.51.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 2/3] riscv: word-at-a-time: improve find_zero() without Zbb
2026-01-13 12:24 [PATCH 0/3] riscv: word-at-a-time: improve find_zero() Jisheng Zhang
2026-01-13 12:24 ` [PATCH 1/3] riscv: word-at-a-time: improve find_zero() for !RISCV_ISA_ZBB Jisheng Zhang
@ 2026-01-13 12:24 ` Jisheng Zhang
2026-01-13 12:24 ` [PATCH 3/3] riscv: word-at-a-time: improve find_zero() for Zbb Jisheng Zhang
2 siblings, 0 replies; 4+ messages in thread
From: Jisheng Zhang @ 2026-01-13 12:24 UTC (permalink / raw)
To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: linux-riscv, linux-kernel
Previous commit improved the find_zero() performance for !RISCV_ISA_ZBB.
What about RISCV_ISA_ZBB=y but the HW doesn't support Zbb? We have the
same heavy generic fls64() issue.
Let's improve this situation by checking Zbb extension and fall back
to generic count_masked_bytes() if Zbb isn't supported.
To remove non-necessary zero bits couting on RV32, we also replace the
'fls64(mask) >> 3' with '!mask ? 0 : ((__fls(mask) + 1) >> 3);'
We will get similar performance improvement as previous commit for
RISCV_ISA_ZBB=y but HW doesn't support Zbb.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
---
arch/riscv/include/asm/word-at-a-time.h | 29 ++++++++++++++++++++++++-
1 file changed, 28 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/word-at-a-time.h b/arch/riscv/include/asm/word-at-a-time.h
index 0c8a9b337f93..ca3d30741ed1 100644
--- a/arch/riscv/include/asm/word-at-a-time.h
+++ b/arch/riscv/include/asm/word-at-a-time.h
@@ -42,9 +42,36 @@ static inline unsigned long create_zero_mask(unsigned long bits)
return bits >> 7;
}
+#ifdef CONFIG_64BIT
+/*
+ * Jan Achrenius on G+: microoptimized version of
+ * the simpler "(mask & ONEBYTES) * ONEBYTES >> 56"
+ * that works for the bytemasks without having to
+ * mask them first.
+ */
+static inline long count_masked_bytes(unsigned long mask)
+{
+ return mask*0x0001020304050608ul >> 56;
+}
+
+#else /* 32-bit case */
+
+/* Carl Chatfield / Jan Achrenius G+ version for 32-bit */
+static inline long count_masked_bytes(long mask)
+{
+ /* (000000 0000ff 00ffff ffffff) -> ( 1 1 2 3 ) */
+ long a = (0x0ff0001+mask) >> 23;
+ /* Fix the 1 for 00 case */
+ return a & mask;
+}
+#endif
+
static inline unsigned long find_zero(unsigned long mask)
{
- return fls64(mask) >> 3;
+ if (riscv_has_extension_likely(RISCV_ISA_EXT_ZBB))
+ return !mask ? 0 : ((__fls(mask) + 1) >> 3);
+
+ return count_masked_bytes(mask);
}
/* The mask we created is directly usable as a bytemask */
--
2.51.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 3/3] riscv: word-at-a-time: improve find_zero() for Zbb
2026-01-13 12:24 [PATCH 0/3] riscv: word-at-a-time: improve find_zero() Jisheng Zhang
2026-01-13 12:24 ` [PATCH 1/3] riscv: word-at-a-time: improve find_zero() for !RISCV_ISA_ZBB Jisheng Zhang
2026-01-13 12:24 ` [PATCH 2/3] riscv: word-at-a-time: improve find_zero() without Zbb Jisheng Zhang
@ 2026-01-13 12:24 ` Jisheng Zhang
2 siblings, 0 replies; 4+ messages in thread
From: Jisheng Zhang @ 2026-01-13 12:24 UTC (permalink / raw)
To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: linux-riscv, linux-kernel
In commit f915a3e5b018 ("arm64: word-at-a-time: improve byte count
calculations for LE"), Linus improved the find_zero() for arm64 LE.
Do the same optimization as he did: "do __ffs() on the intermediate value
that found whether there is a zero byte, before we've actually computed
the final byte mask.", so that we share the similar improvements:
"The difference between the old and the new implementation is that
"count_zero()" ends up scheduling better because it is being done on a
value that is available earlier (before the final mask).
But more importantly, it can be implemented without the insane semantics
of the standard bit finding helpers that have the off-by-one issue and
have to special-case the zero mask situation."
Before the patch:
0000000000000000 <find_zero>:
0: c909 beqz a0,12 <.L1>
2: 60051793 clz a5,a0
6: 03f00513 li a0,63
a: 8d1d sub a0,a0,a5
c: 2505 addiw a0,a0,1
e: 4035551b sraiw a0,a0,0x3
0000000000000012 <.L1>:
12: 8082 ret
After the patch:
0000000000000000 <find_zero>:
0: 60151513 ctz a0,a0
4: 810d srli a0,a0,0x3
6: 8082 ret
7 instructions vs 3 instructions!
As can be seen, on RV64 w/ Zbb, the new "find_zero()" ends up just
"ctz" plus the shift right that then ends up being subsumed by the
"add to final length".
But I have no HW platform which supports Zbb, so I can't get the
performance improvement numbers by the last patch, only built and
tested the patch on QEMU.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
---
arch/riscv/include/asm/word-at-a-time.h | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/riscv/include/asm/word-at-a-time.h b/arch/riscv/include/asm/word-at-a-time.h
index ca3d30741ed1..8c5ac6a72f7f 100644
--- a/arch/riscv/include/asm/word-at-a-time.h
+++ b/arch/riscv/include/asm/word-at-a-time.h
@@ -38,6 +38,9 @@ static inline unsigned long prep_zero_mask(unsigned long val,
static inline unsigned long create_zero_mask(unsigned long bits)
{
+ if (riscv_has_extension_likely(RISCV_ISA_EXT_ZBB))
+ return bits;
+
bits = (bits - 1) & ~bits;
return bits >> 7;
}
@@ -69,13 +72,19 @@ static inline long count_masked_bytes(long mask)
static inline unsigned long find_zero(unsigned long mask)
{
if (riscv_has_extension_likely(RISCV_ISA_EXT_ZBB))
- return !mask ? 0 : ((__fls(mask) + 1) >> 3);
+ return __ffs(mask) >> 3;
return count_masked_bytes(mask);
}
-/* The mask we created is directly usable as a bytemask */
-#define zero_bytemask(mask) (mask)
+static inline unsigned long zero_bytemask(unsigned long bits)
+{
+ if (!riscv_has_extension_likely(RISCV_ISA_EXT_ZBB))
+ return bits;
+
+ bits = (bits - 1) & ~bits;
+ return bits >> 7;
+}
#endif /* !(defined(CONFIG_RISCV_ISA_ZBB) && defined(CONFIG_TOOLCHAIN_HAS_ZBB)) */
--
2.51.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-01-13 12:43 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-13 12:24 [PATCH 0/3] riscv: word-at-a-time: improve find_zero() Jisheng Zhang
2026-01-13 12:24 ` [PATCH 1/3] riscv: word-at-a-time: improve find_zero() for !RISCV_ISA_ZBB Jisheng Zhang
2026-01-13 12:24 ` [PATCH 2/3] riscv: word-at-a-time: improve find_zero() without Zbb Jisheng Zhang
2026-01-13 12:24 ` [PATCH 3/3] riscv: word-at-a-time: improve find_zero() for Zbb Jisheng Zhang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox