Re: [PATCH 2/2] bitops: rotate: Add riscv implementation using Zbb extension

From: Yury Norov <yury.norov@gmail.com>
To: cp0613@linux.alibaba.com
Cc: alex@ghiti.fr, aou@eecs.berkeley.edu, arnd@arndb.de,
	linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-riscv@lists.infradead.org, linux@rasmusvillemoes.dk,
	palmer@dabbelt.com, paul.walmsley@sifive.com
Subject: Re: [PATCH 2/2] bitops: rotate: Add riscv implementation using Zbb extension
Date: Tue, 1 Jul 2025 14:32:14 -0400	[thread overview]
Message-ID: <aGQprv3HTplw9r-q@yury> (raw)
In-Reply-To: <20250701124737.687-1-cp0613@linux.alibaba.com>

On Tue, Jul 01, 2025 at 08:47:01PM +0800, cp0613@linux.alibaba.com wrote:
> On Mon, 30 Jun 2025 12:53:03 -0400, yury.norov@gmail.com wrote:
> 
> > > > 1. The most trivial compile-case is actually evaluated at compile time; and
> > > > 2. Any arch-specific code is well explained; and
> > > > 3. legacy case optimized just as well as non-legacy.
> > > 
> > > 1. As in the above reply, use the generic implementation when compile-time evaluation
> > >    is possible。
> > > 2. I will improve the comments later.
> > 
> > I'm particularly interested in ror8/rol8 case:
> > 
> >         u32 word32 = ((u32)word << 24) | ((u32)word << 16) | ((u32)word << 8) | word;
> > 
> > When you expand it to 32-bit word, and want to rotate, you obviously
> > need to copy lower quarterword to the higher one:
> > 
> >         0xab >> 0xab0000ab
> > 
> > That way generic (u8)ror32(0xab, shift) would work. But I don't understand
> > why you copy the lower 8 bits to inner quarterwords. Is that a hardware
> > requirement? Can you point to any arch documentation 
> > 
> > > 3. As mentioned before, only 8-bit rotation should have no optimization effect, and
> > >    16-bit and above will have significant optimization.
> > 
> > I asked you about the asm goto ("legacy") thing: you calculate that
> > complex word32 _before_ evaluating the goto. So this word32 may get
> > unused, and you waste cycles. I want to make sure this isn't the case.
> 
> Thank you for your suggestion and reminder. This is not a hardware requirement, but it

Sure. Please add

Suggested-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>

> is somewhat related because the zbb instruction only supports word-sized rotate [1].
> Considering the case where the shift is greater than 8, if the modulus of the shift is
> not taken, it is required to continuously concatenate 8-bit data into 32 bits.
> 
> So, I considered the case of taking the remainder of shift and found that this
> implementation would have one less instruction, so the final implementation is as follows:
> ```
> static inline u8 variable_rol8(u8 word, unsigned int shift)

Now that it has assembler inlines, would it help to add the __pure
qualifier?

> {
> 	u32 word32;
> 
> 	asm goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
> 				      RISCV_ISA_EXT_ZBB, 1)
> 			  : : : : legacy);
> 
> 	word32 = ((u32)word << 24) | word;
> 	shift = shift % 8;

        shift %= 8;

> 
> 	asm volatile(
> 		".option push\n"
> 		".option arch,+zbb\n"
> 		"rolw %0, %1, %2\n"
> 		".option pop\n"
> 		: "=r" (word32) : "r" (word32), "r" (shift) :);
> 
> 	return (u8)word32;
> 
> legacy:
> 	return generic_rol8(word, shift);
> }
> 
> static inline u8 variable_ror8(u8 word, unsigned int shift)
> {
> 	u32 word32;
> 
> 	asm goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
> 				      RISCV_ISA_EXT_ZBB, 1)
> 			  : : : : legacy);
> 
> 	word32 = ((u32)word << 8) | word;
> 	shift = shift % 8;
> 
> 	asm volatile(
> 		".option push\n"
> 		".option arch,+zbb\n"
> 		"rorw %0, %1, %2\n"
> 		".option pop\n"
> 		: "=r" (word32) : "r" (word32), "r" (shift) :);
> 
> 	return (u8)word32;
> 
> legacy:
> 	return generic_ror8(word, shift);
> }
> ```

OK, this looks better.

> I compared the performance of ror8 (zbb optimized) and generic_ror8 on the XUANTIE C908
> by looping them. ror8 is better, and the advantage of ror8 becomes more obvious as the
> number of iterations increases. The test code is as follows:
> ```
> 	u8 word = 0x5a;
> 	u32 shift = 9;
> 	u32 i, loop = 100;
> 	u8 ret1, ret2;
> 
> 	u64 t1 = ktime_get_ns();
> 	for (i = 0; i < loop; i++) {
> 		ret2 = generic_ror8(word, shift);
> 	}
> 	u64 t2 = ktime_get_ns();
> 	for (i = 0; i < loop; i++) {
> 		ret1 = ror8(word, shift);
> 	}
> 	u64 t3 = ktime_get_ns();
> 
> 	pr_info("t2-t1=%lld t3-t2=%lld\n", t2 - t1, t3 - t2);
> ```

Please do the following:

1. Drop the generic_ror8() and keep only ror/l8()
2. Add ror/l16, 34 and 64 tests.
3. Adjust the 'loop' so that each subtest will take 1-10 ms on your hw.
4. Name the function like test_rorl(), and put it next to the test_fns()
   in lib/test_bitops.c. Refer the 0a2c6664e56f0 for implementation.
5. Prepend the series with the benchmark patch, just as with const-eval
   one, and report before/after for your series. 

> > Please find attached a test for compile-time ror/rol evaluation.
> > Please consider prepending your series with it.
> 
> Okay, I'll bring it to the next series.

Still waiting for bloat-o-meter results.

Thanks,
Yury

> 
> [1] https://github.com/riscv/riscv-bitmanip

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv