[PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation

linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
@ 2025-06-11  3:31 Zhihang Shao
  2025-06-24  3:50 ` Eric Biggers
  0 siblings, 1 reply; 15+ messages in thread
From: Zhihang Shao @ 2025-06-11  3:31 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-riscv, herbert, paul.walmsley, alex, appro, zhang.lyra,
	ebiggers

From: Zhihang Shao <zhihang.shao.iscas@gmail.com>

This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305
implementation for riscv authored by Andy Polyakov.
The file 'poly1305-riscv.pl' is taken straight from this upstream
GitHub repository [0] at commit 33fe84bc21219a16825459b37c825bf4580a0a7b,
and this commit fixed a bug in riscv 64bit implementation.

[0] https://github.com/dot-asm/cryptogams

Signed-off-by: Zhihang Shao <zhihang.shao.iscas@gmail.com>
---
v4:
- use 142c43db490d188ff2aea1faf31b8ad0afd5b33f, the newest commit of
  poly1305-riscv.pl in [0]. (Eric)
- Directly pass -Dpoly1305_blocks=poly1305_blocks_arch instead of
  implementing glue subroutine in C. (Andy, Eric)
---
v3:
- remove redundant functions. (Herbert)
---
v2:
- rebase onto mainline and port change to arch/riscv/lib/crypto. (Eric)
- fix problems in code according to Andy's guidance. (Andy)
---
 arch/riscv/lib/crypto/Kconfig           |   5 +
 arch/riscv/lib/crypto/Makefile          |  18 +
 arch/riscv/lib/crypto/poly1305-glue.c   |  37 ++
 arch/riscv/lib/crypto/poly1305-riscv.pl | 816 ++++++++++++++++++++++++
 lib/crypto/Kconfig                      |   2 +-
 5 files changed, 877 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/lib/crypto/poly1305-glue.c
 create mode 100644 arch/riscv/lib/crypto/poly1305-riscv.pl

diff --git a/arch/riscv/lib/crypto/Kconfig b/arch/riscv/lib/crypto/Kconfig
index 47c99ea97ce2..3d5a87fafb01 100644
--- a/arch/riscv/lib/crypto/Kconfig
+++ b/arch/riscv/lib/crypto/Kconfig
@@ -7,6 +7,11 @@ config CRYPTO_CHACHA_RISCV64
 	select CRYPTO_ARCH_HAVE_LIB_CHACHA
 	select CRYPTO_LIB_CHACHA_GENERIC
 
+config CRYPTO_POLY1305_RISCV
+	tristate
+	default CRYPTO_LIB_POLY1305
+	select CRYPTO_ARCH_HAVE_LIB_POLY1305
+
 config CRYPTO_SHA256_RISCV64
 	tristate
 	depends on 64BIT && RISCV_ISA_V && TOOLCHAIN_HAS_VECTOR_CRYPTO
diff --git a/arch/riscv/lib/crypto/Makefile b/arch/riscv/lib/crypto/Makefile
index b7cb877a2c07..93ddb62ef0f9 100644
--- a/arch/riscv/lib/crypto/Makefile
+++ b/arch/riscv/lib/crypto/Makefile
@@ -3,5 +3,23 @@
 obj-$(CONFIG_CRYPTO_CHACHA_RISCV64) += chacha-riscv64.o
 chacha-riscv64-y := chacha-riscv64-glue.o chacha-riscv64-zvkb.o
 
+obj-$(CONFIG_CRYPTO_POLY1305_RISCV) += poly1305-riscv.o
+poly1305-riscv-y := poly1305-core.o poly1305-glue.o
+AFLAGS_poly1305-core.o += -Dpoly1305_init=poly1305_block_init_arch
+AFLAGS_poly1305-core.o += -Dpoly1305_blocks=poly1305_blocks_arch
+AFLAGS_poly1305-core.o += -Dpoly1305_emit=poly1305_emit_arch
+
 obj-$(CONFIG_CRYPTO_SHA256_RISCV64) += sha256-riscv64.o
 sha256-riscv64-y := sha256.o sha256-riscv64-zvknha_or_zvknhb-zvkb.o
+
+ifeq ($(CONFIG_64BIT),y)
+PERLASM_ARCH := 64
+else
+PERLASM_ARCH := void
+endif
+
+quiet_cmd_perlasm = PERLASM $@
+      cmd_perlasm = $(PERL) $(<) $(PERLASM_ARCH) $(@)
+
+$(obj)/%-core.S: $(src)/%-riscv.pl
+	$(call cmd,perlasm)
diff --git a/arch/riscv/lib/crypto/poly1305-glue.c b/arch/riscv/lib/crypto/poly1305-glue.c
new file mode 100644
index 000000000000..ddc73741faf5
--- /dev/null
+++ b/arch/riscv/lib/crypto/poly1305-glue.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * OpenSSL/Cryptogams accelerated Poly1305 transform for riscv
+ *
+ * Copyright (C) 2025 Institute of Software, CAS.
+ */
+
+#include <asm/hwcap.h>
+#include <asm/simd.h>
+#include <crypto/internal/poly1305.h>
+#include <linux/cpufeature.h>
+#include <linux/jump_label.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/unaligned.h>
+
+asmlinkage void poly1305_block_init_arch(
+	struct poly1305_block_state *state,
+	const u8 raw_key[POLY1305_BLOCK_SIZE]);
+EXPORT_SYMBOL_GPL(poly1305_block_init_arch);
+asmlinkage void poly1305_blocks_arch(struct poly1305_block_state *state,
+				const u8 *src, u32 len, u32 hibit);
+EXPORT_SYMBOL_GPL(poly1305_blocks_arch);
+asmlinkage void poly1305_emit_arch(const struct poly1305_state *state,
+				   u8 digest[POLY1305_DIGEST_SIZE],
+				   const u32 nonce[4]);
+EXPORT_SYMBOL_GPL(poly1305_emit_arch);
+
+bool poly1305_is_arch_optimized(void)
+{
+	/* We always can use since only Integer Multiplication extension is used. */
+	return true;
+}
+EXPORT_SYMBOL(poly1305_is_arch_optimized);
+
+MODULE_DESCRIPTION("Poly1305 authenticator (RISC-V accelerated)");
+MODULE_LICENSE("GPL");
\ No newline at end of file
diff --git a/arch/riscv/lib/crypto/poly1305-riscv.pl b/arch/riscv/lib/crypto/poly1305-riscv.pl
new file mode 100644
index 000000000000..e08766541be7
--- /dev/null
+++ b/arch/riscv/lib/crypto/poly1305-riscv.pl
@@ -0,0 +1,816 @@
+#!/usr/bin/env perl
+# SPDX-License-Identifier: GPL-1.0+ OR BSD-3-Clause
+#
+# ====================================================================
+# Written by Andy Polyakov, @dot-asm, initially for use with OpenSSL.
+# ====================================================================
+#
+# Poly1305 hash for RISC-V.
+#
+# February 2019
+#
+# In the essence it's pretty straightforward transliteration of MIPS
+# module [without big-endian option].
+#
+# 3.9 cycles per byte on U74, ~60% faster than compiler-generated code.
+# 1.9 cpb on C910, ~75% improvement. 2.3 cpb on JH7110 (U74 with
+# apparently better multiplier), ~69% faster, 3.3 on Spacemit X60,
+# ~69% improvement.
+#
+# June 2024.
+#
+# Add CHERI support.
+#
+######################################################################
+#
+($zero,$ra,$sp,$gp,$tp)=map("x$_",(0..4));
+($t0,$t1,$t2,$t3,$t4,$t5,$t6)=map("x$_",(5..7,28..31));
+($a0,$a1,$a2,$a3,$a4,$a5,$a6,$a7)=map("x$_",(10..17));
+($s0,$s1,$s2,$s3,$s4,$s5,$s6,$s7,$s8,$s9,$s10,$s11)=map("x$_",(8,9,18..27));
+#
+######################################################################
+
+$flavour = shift || "64";
+
+for (@ARGV) {   $output=$_ if (/\w[\w\-]*\.\w+$/);   }
+open STDOUT,">$output";
+
+if ($flavour =~ /64/) {{{
+######################################################################
+# 64-bit code path...
+#
+my ($ctx,$inp,$len,$padbit) = ($a0,$a1,$a2,$a3);
+my ($in0,$in1,$tmp0,$tmp1,$tmp2,$tmp3,$tmp4) = ($a4,$a5,$a6,$a7,$t0,$t1,$t2);
+
+$code.=<<___;
+#if __riscv_xlen == 64
+# if __SIZEOF_POINTER__ == 16
+#  define PUSH	csc
+#  define POP	clc
+# else
+#  define PUSH	sd
+#  define POP	ld
+# endif
+#else
+# error "unsupported __riscv_xlen"
+#endif
+
+.option	pic
+.text
+
+.globl	poly1305_init
+.type	poly1305_init,\@function
+poly1305_init:
+#ifdef	__riscv_zicfilp
+	lpad	0
+#endif
+	sd	$zero,0($ctx)
+	sd	$zero,8($ctx)
+	sd	$zero,16($ctx)
+
+	beqz	$inp,.Lno_key
+
+#ifndef	__CHERI_PURE_CAPABILITY__
+	andi	$tmp0,$inp,7		# $inp % 8
+	andi	$inp,$inp,-8		# align $inp
+	slli	$tmp0,$tmp0,3		# byte to bit offset
+#endif
+	ld	$in0,0($inp)
+	ld	$in1,8($inp)
+#ifndef	__CHERI_PURE_CAPABILITY__
+	beqz	$tmp0,.Laligned_key
+
+	ld	$tmp2,16($inp)
+	neg	$tmp1,$tmp0		# implicit &63 in sll
+	srl	$in0,$in0,$tmp0
+	sll	$tmp3,$in1,$tmp1
+	srl	$in1,$in1,$tmp0
+	sll	$tmp2,$tmp2,$tmp1
+	or	$in0,$in0,$tmp3
+	or	$in1,$in1,$tmp2
+
+.Laligned_key:
+#endif
+	li	$tmp0,1
+	slli	$tmp0,$tmp0,32		# 0x0000000100000000
+	addi	$tmp0,$tmp0,-63		# 0x00000000ffffffc1
+	slli	$tmp0,$tmp0,28		# 0x0ffffffc10000000
+	addi	$tmp0,$tmp0,-1		# 0x0ffffffc0fffffff
+
+	and	$in0,$in0,$tmp0
+	addi	$tmp0,$tmp0,-3		# 0x0ffffffc0ffffffc
+	and	$in1,$in1,$tmp0
+
+	sd	$in0,24($ctx)
+	srli	$tmp0,$in1,2
+	sd	$in1,32($ctx)
+	add	$tmp0,$tmp0,$in1	# s1 = r1 + (r1 >> 2)
+	sd	$tmp0,40($ctx)
+
+.Lno_key:
+	li	$a0,0			# return 0
+	ret
+.size	poly1305_init,.-poly1305_init
+___
+{
+my ($h0,$h1,$h2,$r0,$r1,$rs1,$d0,$d1,$d2) =
+   ($s0,$s1,$s2,$s3,$t3,$t4,$in0,$in1,$t2);
+my ($shr,$shl) = ($t5,$t6);		# used on R6
+
+$code.=<<___;
+.globl	poly1305_blocks
+.type	poly1305_blocks,\@function
+poly1305_blocks:
+#ifdef	__riscv_zicfilp
+	lpad	0
+#endif
+	andi	$len,$len,-16		# complete blocks only
+	beqz	$len,.Lno_data
+
+	caddi	$sp,$sp,-4*__SIZEOF_POINTER__
+	PUSH	$s0,3*__SIZEOF_POINTER__($sp)
+	PUSH	$s1,2*__SIZEOF_POINTER__($sp)
+	PUSH	$s2,1*__SIZEOF_POINTER__($sp)
+	PUSH	$s3,0*__SIZEOF_POINTER__($sp)
+
+#ifndef	__CHERI_PURE_CAPABILITY__
+	andi	$shr,$inp,7
+	andi	$inp,$inp,-8		# align $inp
+	slli	$shr,$shr,3		# byte to bit offset
+	neg	$shl,$shr		# implicit &63 in sll
+#endif
+
+	ld	$h0,0($ctx)		# load hash value
+	ld	$h1,8($ctx)
+	ld	$h2,16($ctx)
+
+	ld	$r0,24($ctx)		# load key
+	ld	$r1,32($ctx)
+	ld	$rs1,40($ctx)
+
+	add	$len,$len,$inp		# end of buffer
+
+.Loop:
+	ld	$in0,0($inp)		# load input
+	ld	$in1,8($inp)
+#ifndef	__CHERI_PURE_CAPABILITY__
+	beqz	$shr,.Laligned_inp
+
+	ld	$tmp2,16($inp)
+	srl	$in0,$in0,$shr
+	sll	$tmp3,$in1,$shl
+	srl	$in1,$in1,$shr
+	sll	$tmp2,$tmp2,$shl
+	or	$in0,$in0,$tmp3
+	or	$in1,$in1,$tmp2
+
+.Laligned_inp:
+#endif
+	caddi	$inp,$inp,16
+
+	andi	$tmp0,$h2,-4		# modulo-scheduled reduction
+	srli	$tmp1,$h2,2
+	andi	$h2,$h2,3
+
+	add	$d0,$h0,$in0		# accumulate input
+	 add	$tmp1,$tmp1,$tmp0
+	sltu	$tmp0,$d0,$h0
+	add	$d0,$d0,$tmp1		# ... and residue
+	sltu	$tmp1,$d0,$tmp1
+	add	$d1,$h1,$in1
+	add	$tmp0,$tmp0,$tmp1
+	sltu	$tmp1,$d1,$h1
+	add	$d1,$d1,$tmp0
+
+	 add	$d2,$h2,$padbit
+	 sltu	$tmp0,$d1,$tmp0
+	mulhu	$h1,$r0,$d0		# h0*r0
+	mul	$h0,$r0,$d0
+
+	 add	$d2,$d2,$tmp1
+	 add	$d2,$d2,$tmp0
+	mulhu	$tmp1,$rs1,$d1		# h1*5*r1
+	mul	$tmp0,$rs1,$d1
+
+	mulhu	$h2,$r1,$d0		# h0*r1
+	mul	$tmp2,$r1,$d0
+	 add	$h0,$h0,$tmp0
+	 add	$h1,$h1,$tmp1
+	 sltu	$tmp0,$h0,$tmp0
+
+	 add	$h1,$h1,$tmp0
+	 add	$h1,$h1,$tmp2
+	mulhu	$tmp1,$r0,$d1		# h1*r0
+	mul	$tmp0,$r0,$d1
+
+	 sltu	$tmp2,$h1,$tmp2
+	 add	$h2,$h2,$tmp2
+	mul	$tmp2,$rs1,$d2		# h2*5*r1
+
+	 add	$h1,$h1,$tmp0
+	 add	$h2,$h2,$tmp1
+	mul	$tmp3,$r0,$d2		# h2*r0
+	 sltu	$tmp0,$h1,$tmp0
+	 add	$h2,$h2,$tmp0
+
+	add	$h1,$h1,$tmp2
+	sltu	$tmp2,$h1,$tmp2
+	add	$h2,$h2,$tmp2
+	add	$h2,$h2,$tmp3
+
+	bne	$inp,$len,.Loop
+
+	sd	$h0,0($ctx)		# store hash value
+	sd	$h1,8($ctx)
+	sd	$h2,16($ctx)
+
+	POP	$s0,3*__SIZEOF_POINTER__($sp)		# epilogue
+	POP	$s1,2*__SIZEOF_POINTER__($sp)
+	POP	$s2,1*__SIZEOF_POINTER__($sp)
+	POP	$s3,0*__SIZEOF_POINTER__($sp)
+	caddi	$sp,$sp,4*__SIZEOF_POINTER__
+
+.Lno_data:
+	ret
+.size	poly1305_blocks,.-poly1305_blocks
+___
+}
+{
+my ($ctx,$mac,$nonce) = ($a0,$a1,$a2);
+
+$code.=<<___;
+.globl	poly1305_emit
+.type	poly1305_emit,\@function
+poly1305_emit:
+#ifdef	__riscv_zicfilp
+	lpad	0
+#endif
+	ld	$tmp2,16($ctx)
+	ld	$tmp0,0($ctx)
+	ld	$tmp1,8($ctx)
+
+	andi	$in0,$tmp2,-4		# final reduction
+	srl	$in1,$tmp2,2
+	andi	$tmp2,$tmp2,3
+	add	$in0,$in0,$in1
+
+	add	$tmp0,$tmp0,$in0
+	sltu	$in1,$tmp0,$in0
+	 addi	$in0,$tmp0,5		# compare to modulus
+	add	$tmp1,$tmp1,$in1
+	 sltiu	$tmp3,$in0,5
+	sltu	$tmp4,$tmp1,$in1
+	 add	$in1,$tmp1,$tmp3
+	add	$tmp2,$tmp2,$tmp4
+	 sltu	$tmp3,$in1,$tmp3
+	 add	$tmp2,$tmp2,$tmp3
+
+	srli	$tmp2,$tmp2,2		# see if it carried/borrowed
+	neg	$tmp2,$tmp2
+
+	xor	$in0,$in0,$tmp0
+	xor	$in1,$in1,$tmp1
+	and	$in0,$in0,$tmp2
+	and	$in1,$in1,$tmp2
+	xor	$in0,$in0,$tmp0
+	xor	$in1,$in1,$tmp1
+
+	lwu	$tmp0,0($nonce)		# load nonce
+	lwu	$tmp1,4($nonce)
+	lwu	$tmp2,8($nonce)
+	lwu	$tmp3,12($nonce)
+	slli	$tmp1,$tmp1,32
+	slli	$tmp3,$tmp3,32
+	or	$tmp0,$tmp0,$tmp1
+	or	$tmp2,$tmp2,$tmp3
+
+	add	$in0,$in0,$tmp0		# accumulate nonce
+	add	$in1,$in1,$tmp2
+	sltu	$tmp0,$in0,$tmp0
+	add	$in1,$in1,$tmp0
+
+	srli	$tmp0,$in0,8		# write mac value
+	srli	$tmp1,$in0,16
+	srli	$tmp2,$in0,24
+	sb	$in0,0($mac)
+	srli	$tmp3,$in0,32
+	sb	$tmp0,1($mac)
+	srli	$tmp0,$in0,40
+	sb	$tmp1,2($mac)
+	srli	$tmp1,$in0,48
+	sb	$tmp2,3($mac)
+	srli	$tmp2,$in0,56
+	sb	$tmp3,4($mac)
+	srli	$tmp3,$in1,8
+	sb	$tmp0,5($mac)
+	srli	$tmp0,$in1,16
+	sb	$tmp1,6($mac)
+	srli	$tmp1,$in1,24
+	sb	$tmp2,7($mac)
+
+	sb	$in1,8($mac)
+	srli	$tmp2,$in1,32
+	sb	$tmp3,9($mac)
+	srli	$tmp3,$in1,40
+	sb	$tmp0,10($mac)
+	srli	$tmp0,$in1,48
+	sb	$tmp1,11($mac)
+	srli	$tmp1,$in1,56
+	sb	$tmp2,12($mac)
+	sb	$tmp3,13($mac)
+	sb	$tmp0,14($mac)
+	sb	$tmp1,15($mac)
+
+	ret
+.size	poly1305_emit,.-poly1305_emit
+.string	"Poly1305 for RISC-V, CRYPTOGAMS by \@dot-asm"
+___
+}
+}}} else {{{
+######################################################################
+# 32-bit code path
+#
+
+my ($ctx,$inp,$len,$padbit) = ($a0,$a1,$a2,$a3);
+my ($in0,$in1,$in2,$in3,$tmp0,$tmp1,$tmp2,$tmp3) =
+   ($a4,$a5,$a6,$a7,$t0,$t1,$t2,$t3);
+
+$code.=<<___;
+#if __riscv_xlen == 32
+# if __SIZEOF_POINTER__ == 8
+#  define PUSH	csc
+#  define POP	clc
+# else
+#  define PUSH	sw
+#  define POP	lw
+# endif
+# define MULX(hi,lo,a,b)	mulhu hi,a,b; mul lo,a,b
+# define srliw	srli
+# define srlw	srl
+# define sllw	sll
+# define addw	add
+# define addiw	addi
+# define mulw	mul
+#elif __riscv_xlen == 64
+# if __SIZEOF_POINTER__ == 16
+#  define PUSH	csc
+#  define POP	clc
+# else
+#  define PUSH	sd
+#  define POP	ld
+# endif
+# define MULX(hi,lo,a,b)	slli b,b,32; srli b,b,32; mul hi,a,b; addiw lo,hi,0; srai hi,hi,32
+#else
+# error "unsupported __riscv_xlen"
+#endif
+
+.option	pic
+.text
+
+.globl	poly1305_init
+.type	poly1305_init,\@function
+poly1305_init:
+#ifdef	__riscv_zicfilp
+	lpad	0
+#endif
+	sw	$zero,0($ctx)
+	sw	$zero,4($ctx)
+	sw	$zero,8($ctx)
+	sw	$zero,12($ctx)
+	sw	$zero,16($ctx)
+
+	beqz	$inp,.Lno_key
+
+#ifndef	__CHERI_PURE_CAPABILITY__
+	andi	$tmp0,$inp,3		# $inp % 4
+	sub	$inp,$inp,$tmp0		# align $inp
+	sll	$tmp0,$tmp0,3		# byte to bit offset
+#endif
+	lw	$in0,0($inp)
+	lw	$in1,4($inp)
+	lw	$in2,8($inp)
+	lw	$in3,12($inp)
+#ifndef	__CHERI_PURE_CAPABILITY__
+	beqz	$tmp0,.Laligned_key
+
+	lw	$tmp2,16($inp)
+	sub	$tmp1,$zero,$tmp0
+	srlw	$in0,$in0,$tmp0
+	sllw	$tmp3,$in1,$tmp1
+	srlw	$in1,$in1,$tmp0
+	or	$in0,$in0,$tmp3
+	sllw	$tmp3,$in2,$tmp1
+	srlw	$in2,$in2,$tmp0
+	or	$in1,$in1,$tmp3
+	sllw	$tmp3,$in3,$tmp1
+	srlw	$in3,$in3,$tmp0
+	or	$in2,$in2,$tmp3
+	sllw	$tmp2,$tmp2,$tmp1
+	or	$in3,$in3,$tmp2
+.Laligned_key:
+#endif
+
+	lui	$tmp0,0x10000
+	addi	$tmp0,$tmp0,-1		# 0x0fffffff
+	and	$in0,$in0,$tmp0
+	addi	$tmp0,$tmp0,-3		# 0x0ffffffc
+	and	$in1,$in1,$tmp0
+	and	$in2,$in2,$tmp0
+	and	$in3,$in3,$tmp0
+
+	sw	$in0,20($ctx)
+	sw	$in1,24($ctx)
+	sw	$in2,28($ctx)
+	sw	$in3,32($ctx)
+
+	srlw	$tmp1,$in1,2
+	srlw	$tmp2,$in2,2
+	srlw	$tmp3,$in3,2
+	addw	$in1,$in1,$tmp1		# s1 = r1 + (r1 >> 2)
+	addw	$in2,$in2,$tmp2
+	addw	$in3,$in3,$tmp3
+	sw	$in1,36($ctx)
+	sw	$in2,40($ctx)
+	sw	$in3,44($ctx)
+.Lno_key:
+	li	$a0,0
+	ret
+.size	poly1305_init,.-poly1305_init
+___
+{
+my ($h0,$h1,$h2,$h3,$h4, $r0,$r1,$r2,$r3, $rs1,$rs2,$rs3) =
+   ($s0,$s1,$s2,$s3,$s4, $s5,$s6,$s7,$s8, $t0,$t1,$t2);
+my ($d0,$d1,$d2,$d3) =
+   ($a4,$a5,$a6,$a7);
+my $shr = $ra;		# used on R6
+
+$code.=<<___;
+.globl	poly1305_blocks
+.type	poly1305_blocks,\@function
+poly1305_blocks:
+#ifdef	__riscv_zicfilp
+	lpad	0
+#endif
+	andi	$len,$len,-16		# complete blocks only
+	beqz	$len,.Labort
+
+	caddi	$sp,$sp,-__SIZEOF_POINTER__*12
+	PUSH	$ra, __SIZEOF_POINTER__*11($sp)
+	PUSH	$s0, __SIZEOF_POINTER__*10($sp)
+	PUSH	$s1, __SIZEOF_POINTER__*9($sp)
+	PUSH	$s2, __SIZEOF_POINTER__*8($sp)
+	PUSH	$s3, __SIZEOF_POINTER__*7($sp)
+	PUSH	$s4, __SIZEOF_POINTER__*6($sp)
+	PUSH	$s5, __SIZEOF_POINTER__*5($sp)
+	PUSH	$s6, __SIZEOF_POINTER__*4($sp)
+	PUSH	$s7, __SIZEOF_POINTER__*3($sp)
+	PUSH	$s8, __SIZEOF_POINTER__*2($sp)
+
+#ifndef	__CHERI_PURE_CAPABILITY__
+	andi	$shr,$inp,3
+	andi	$inp,$inp,-4		# align $inp
+	slli	$shr,$shr,3		# byte to bit offset
+#endif
+
+	lw	$h0,0($ctx)		# load hash value
+	lw	$h1,4($ctx)
+	lw	$h2,8($ctx)
+	lw	$h3,12($ctx)
+	lw	$h4,16($ctx)
+
+	lw	$r0,20($ctx)		# load key
+	lw	$r1,24($ctx)
+	lw	$r2,28($ctx)
+	lw	$r3,32($ctx)
+	lw	$rs1,36($ctx)
+	lw	$rs2,40($ctx)
+	lw	$rs3,44($ctx)
+
+	add	$len,$len,$inp		# end of buffer
+
+.Loop:
+	lw	$d0,0($inp)		# load input
+	lw	$d1,4($inp)
+	lw	$d2,8($inp)
+	lw	$d3,12($inp)
+#ifndef	__CHERI_PURE_CAPABILITY__
+	beqz	$shr,.Laligned_inp
+
+	lw	$t4,16($inp)
+	sub	$t5,$zero,$shr
+	srlw	$d0,$d0,$shr
+	sllw	$t3,$d1,$t5
+	srlw	$d1,$d1,$shr
+	or	$d0,$d0,$t3
+	sllw	$t3,$d2,$t5
+	srlw	$d2,$d2,$shr
+	or	$d1,$d1,$t3
+	sllw	$t3,$d3,$t5
+	srlw	$d3,$d3,$shr
+	or	$d2,$d2,$t3
+	sllw	$t4,$t4,$t5
+	or	$d3,$d3,$t4
+
+.Laligned_inp:
+#endif
+	srliw	$t3,$h4,2		# modulo-scheduled reduction
+	andi	$t4,$h4,-4
+	andi	$h4,$h4,3
+
+	addw	$d0,$d0,$h0		# accumulate input
+	 addw	$t4,$t4,$t3
+	sltu	$h0,$d0,$h0
+	addw	$d0,$d0,$t4		# ... and residue
+	sltu	$t4,$d0,$t4
+
+	addw	$d1,$d1,$h1
+	 addw	$h0,$h0,$t4		# carry
+	sltu	$h1,$d1,$h1
+	addw	$d1,$d1,$h0
+	sltu	$h0,$d1,$h0
+
+	addw	$d2,$d2,$h2
+	 addw	$h1,$h1,$h0		# carry
+	sltu	$h2,$d2,$h2
+	addw	$d2,$d2,$h1
+	sltu	$h1,$d2,$h1
+
+	addw	$d3,$d3,$h3
+	 addw	$h2,$h2,$h1		# carry
+	sltu	$h3,$d3,$h3
+	addw	$d3,$d3,$h2
+
+	MULX	($h1,$h0,$r0,$d0)	# d0*r0
+
+	 sltu	$h2,$d3,$h2
+	 addw	$h3,$h3,$h2		# carry
+
+	MULX	($t4,$t3,$rs3,$d1)	# d1*s3
+
+	 addw	$h4,$h4,$padbit
+	 caddi	$inp,$inp,16
+	 addw	$h4,$h4,$h3
+
+	MULX	($t6,$a3,$rs2,$d2)	# d2*s2
+	 addw	$h0,$h0,$t3
+	 addw	$h1,$h1,$t4
+	 sltu	$t3,$h0,$t3
+	 addw	$h1,$h1,$t3
+
+	MULX	($t4,$t3,$rs1,$d3)	# d3*s1
+	 addw	$h0,$h0,$a3
+	 addw	$h1,$h1,$t6
+	 sltu	$a3,$h0,$a3
+	 addw	$h1,$h1,$a3
+
+
+	MULX	($h2,$a3,$r1,$d0)	# d0*r1
+	 addw	$h0,$h0,$t3
+	 addw	$h1,$h1,$t4
+	 sltu	$t3,$h0,$t3
+	 addw	$h1,$h1,$t3
+
+	MULX	($t4,$t3,$r0,$d1)	# d1*r0
+	 addw	$h1,$h1,$a3
+	 sltu	$a3,$h1,$a3
+	 addw	$h2,$h2,$a3
+
+	MULX	($t6,$a3,$rs3,$d2)	# d2*s3
+	 addw	$h1,$h1,$t3
+	 addw	$h2,$h2,$t4
+	 sltu	$t3,$h1,$t3
+	 addw	$h2,$h2,$t3
+
+	MULX	($t4,$t3,$rs2,$d3)	# d3*s2
+	 addw	$h1,$h1,$a3
+	 addw	$h2,$h2,$t6
+	 sltu	$a3,$h1,$a3
+	 addw	$h2,$h2,$a3
+
+	mulw	$a3,$rs1,$h4		# h4*s1
+	 addw	$h1,$h1,$t3
+	 addw	$h2,$h2,$t4
+	 sltu	$t3,$h1,$t3
+	 addw	$h2,$h2,$t3
+
+
+	MULX	($h3,$t3,$r2,$d0)	# d0*r2
+	 addw	$h1,$h1,$a3
+	 sltu	$a3,$h1,$a3
+	 addw	$h2,$h2,$a3
+
+	MULX	($t6,$a3,$r1,$d1)	# d1*r1
+	 addw	$h2,$h2,$t3
+	 sltu	$t3,$h2,$t3
+	 addw	$h3,$h3,$t3
+
+	MULX	($t4,$t3,$r0,$d2)	# d2*r0
+	 addw	$h2,$h2,$a3
+	 addw	$h3,$h3,$t6
+	 sltu	$a3,$h2,$a3
+	 addw	$h3,$h3,$a3
+
+	MULX	($t6,$a3,$rs3,$d3)	# d3*s3
+	 addw	$h2,$h2,$t3
+	 addw	$h3,$h3,$t4
+	 sltu	$t3,$h2,$t3
+	 addw	$h3,$h3,$t3
+
+	mulw	$t3,$rs2,$h4		# h4*s2
+	 addw	$h2,$h2,$a3
+	 addw	$h3,$h3,$t6
+	 sltu	$a3,$h2,$a3
+	 addw	$h3,$h3,$a3
+
+
+	MULX	($t6,$a3,$r3,$d0)	# d0*r3
+	 addw	$h2,$h2,$t3
+	 sltu	$t3,$h2,$t3
+	 addw	$h3,$h3,$t3
+
+	MULX	($t4,$t3,$r2,$d1)	# d1*r2
+	 addw	$h3,$h3,$a3
+	 sltu	$a3,$h3,$a3
+	 addw	$t6,$t6,$a3
+
+	MULX	($a3,$d3,$r0,$d3)	# d3*r0
+	 addw	$h3,$h3,$t3
+	 addw	$t6,$t6,$t4
+	 sltu	$t3,$h3,$t3
+	 addw	$t6,$t6,$t3
+
+	MULX	($t4,$t3,$r1,$d2)	# d2*r1
+	 addw	$h3,$h3,$d3
+	 addw	$t6,$t6,$a3
+	 sltu	$d3,$h3,$d3
+	 addw	$t6,$t6,$d3
+
+	mulw	$a3,$rs3,$h4		# h4*s3
+	 addw	$h3,$h3,$t3
+	 addw	$t6,$t6,$t4
+	 sltu	$t3,$h3,$t3
+	 addw	$t6,$t6,$t3
+
+
+	mulw	$h4,$r0,$h4		# h4*r0
+	 addw	$h3,$h3,$a3
+	 sltu	$a3,$h3,$a3
+	 addw	$t6,$t6,$a3
+	addw	$h4,$t6,$h4
+
+	li	$padbit,1		# if we loop, padbit is 1
+
+	bne	$inp,$len,.Loop
+
+	sw	$h0,0($ctx)		# store hash value
+	sw	$h1,4($ctx)
+	sw	$h2,8($ctx)
+	sw	$h3,12($ctx)
+	sw	$h4,16($ctx)
+
+	POP	$ra, __SIZEOF_POINTER__*11($sp)
+	POP	$s0, __SIZEOF_POINTER__*10($sp)
+	POP	$s1, __SIZEOF_POINTER__*9($sp)
+	POP	$s2, __SIZEOF_POINTER__*8($sp)
+	POP	$s3, __SIZEOF_POINTER__*7($sp)
+	POP	$s4, __SIZEOF_POINTER__*6($sp)
+	POP	$s5, __SIZEOF_POINTER__*5($sp)
+	POP	$s6, __SIZEOF_POINTER__*4($sp)
+	POP	$s7, __SIZEOF_POINTER__*3($sp)
+	POP	$s8, __SIZEOF_POINTER__*2($sp)
+	caddi	$sp,$sp,__SIZEOF_POINTER__*12
+.Labort:
+	ret
+.size	poly1305_blocks,.-poly1305_blocks
+___
+}
+{
+my ($ctx,$mac,$nonce,$tmp4) = ($a0,$a1,$a2,$a3);
+
+$code.=<<___;
+.globl	poly1305_emit
+.type	poly1305_emit,\@function
+poly1305_emit:
+#ifdef	__riscv_zicfilp
+	lpad	0
+#endif
+	lw	$tmp4,16($ctx)
+	lw	$tmp0,0($ctx)
+	lw	$tmp1,4($ctx)
+	lw	$tmp2,8($ctx)
+	lw	$tmp3,12($ctx)
+
+	srliw	$ctx,$tmp4,2		# final reduction
+	andi	$in0,$tmp4,-4
+	andi	$tmp4,$tmp4,3
+	addw	$ctx,$ctx,$in0
+
+	addw	$tmp0,$tmp0,$ctx
+	sltu	$ctx,$tmp0,$ctx
+	 addiw	$in0,$tmp0,5		# compare to modulus
+	addw	$tmp1,$tmp1,$ctx
+	 sltiu	$in1,$in0,5
+	sltu	$ctx,$tmp1,$ctx
+	 addw	$in1,$in1,$tmp1
+	addw	$tmp2,$tmp2,$ctx
+	 sltu	$in2,$in1,$tmp1
+	sltu	$ctx,$tmp2,$ctx
+	 addw	$in2,$in2,$tmp2
+	addw	$tmp3,$tmp3,$ctx
+	 sltu	$in3,$in2,$tmp2
+	sltu	$ctx,$tmp3,$ctx
+	 addw	$in3,$in3,$tmp3
+	addw	$tmp4,$tmp4,$ctx
+	 sltu	$ctx,$in3,$tmp3
+	 addw	$ctx,$ctx,$tmp4
+
+	srl	$ctx,$ctx,2		# see if it carried/borrowed
+	sub	$ctx,$zero,$ctx
+
+	xor	$in0,$in0,$tmp0
+	xor	$in1,$in1,$tmp1
+	xor	$in2,$in2,$tmp2
+	xor	$in3,$in3,$tmp3
+	and	$in0,$in0,$ctx
+	and	$in1,$in1,$ctx
+	and	$in2,$in2,$ctx
+	and	$in3,$in3,$ctx
+	xor	$in0,$in0,$tmp0
+	xor	$in1,$in1,$tmp1
+	xor	$in2,$in2,$tmp2
+	xor	$in3,$in3,$tmp3
+
+	lw	$tmp0,0($nonce)		# load nonce
+	lw	$tmp1,4($nonce)
+	lw	$tmp2,8($nonce)
+	lw	$tmp3,12($nonce)
+
+	addw	$in0,$in0,$tmp0		# accumulate nonce
+	sltu	$ctx,$in0,$tmp0
+
+	addw	$in1,$in1,$tmp1
+	sltu	$tmp1,$in1,$tmp1
+	addw	$in1,$in1,$ctx
+	sltu	$ctx,$in1,$ctx
+	addw	$ctx,$ctx,$tmp1
+
+	addw	$in2,$in2,$tmp2
+	sltu	$tmp2,$in2,$tmp2
+	addw	$in2,$in2,$ctx
+	sltu	$ctx,$in2,$ctx
+	addw	$ctx,$ctx,$tmp2
+
+	addw	$in3,$in3,$tmp3
+	addw	$in3,$in3,$ctx
+
+	srl	$tmp0,$in0,8		# write mac value
+	srl	$tmp1,$in0,16
+	srl	$tmp2,$in0,24
+	sb	$in0, 0($mac)
+	sb	$tmp0,1($mac)
+	srl	$tmp0,$in1,8
+	sb	$tmp1,2($mac)
+	srl	$tmp1,$in1,16
+	sb	$tmp2,3($mac)
+	srl	$tmp2,$in1,24
+	sb	$in1, 4($mac)
+	sb	$tmp0,5($mac)
+	srl	$tmp0,$in2,8
+	sb	$tmp1,6($mac)
+	srl	$tmp1,$in2,16
+	sb	$tmp2,7($mac)
+	srl	$tmp2,$in2,24
+	sb	$in2, 8($mac)
+	sb	$tmp0,9($mac)
+	srl	$tmp0,$in3,8
+	sb	$tmp1,10($mac)
+	srl	$tmp1,$in3,16
+	sb	$tmp2,11($mac)
+	srl	$tmp2,$in3,24
+	sb	$in3, 12($mac)
+	sb	$tmp0,13($mac)
+	sb	$tmp1,14($mac)
+	sb	$tmp2,15($mac)
+
+	ret
+.size	poly1305_emit,.-poly1305_emit
+.string	"Poly1305 for RISC-V, CRYPTOGAMS by \@dot-asm"
+___
+}
+}}}
+
+foreach (split("\n", $code)) {
+    if ($flavour =~ /^cheri/) {
+	s/\(x([0-9]+)\)/(c$1)/ and s/\b([ls][bhwd]u?)\b/c$1/;
+	s/\b(PUSH|POP)(\s+)x([0-9]+)/$1$2c$3/ or
+	s/\b(ret|jal)\b/c$1/;
+	s/\bcaddi?\b/cincoffset/ and s/\bx([0-9]+,)/c$1/g or
+	m/\bcmove\b/ and s/\bx([0-9]+)/c$1/g;
+    } else {
+	s/\bcaddi?\b/add/ or
+	s/\bcmove\b/mv/;
+    }
+    print $_, "\n";
+}
+
+close STDOUT;
\ No newline at end of file
diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig
index 1ec1466108cc..0d425f8ce90f 100644
--- a/lib/crypto/Kconfig
+++ b/lib/crypto/Kconfig
@@ -100,7 +100,7 @@ config CRYPTO_LIB_DES
 
 config CRYPTO_LIB_POLY1305_RSIZE
 	int
-	default 2 if MIPS
+	default 2 if MIPS || RISCV
 	default 11 if X86_64
 	default 9 if ARM || ARM64
 	default 1
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
  2025-06-11  3:31 Zhihang Shao
@ 2025-06-24  3:50 ` Eric Biggers
  2025-06-24  9:13   ` Andy Polyakov
                     ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Eric Biggers @ 2025-06-24  3:50 UTC (permalink / raw)
  To: Zhihang Shao
  Cc: linux-crypto, linux-riscv, herbert, paul.walmsley, alex, appro,
	zhang.lyra

Hi Zhihang,

On Wed, Jun 11, 2025 at 11:31:51AM +0800, Zhihang Shao wrote:
> From: Zhihang Shao <zhihang.shao.iscas@gmail.com>
> 
> This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305
> implementation for riscv authored by Andy Polyakov.
> The file 'poly1305-riscv.pl' is taken straight from this upstream
> GitHub repository [0] at commit 33fe84bc21219a16825459b37c825bf4580a0a7b,
> and this commit fixed a bug in riscv 64bit implementation.
> 
> [0] https://github.com/dot-asm/cryptogams
> 
> Signed-off-by: Zhihang Shao <zhihang.shao.iscas@gmail.com>

I wanted to let you know that I haven't forgotten about this.  However, the
kernel currently lacks a proper test for Poly1305, and Poly1305 has some edge
cases that are prone to bugs.  So I'd like to find some time to add a proper
Poly1305 test and properly review this.  Also, I'm in the middle of fixing how
the kernel's architecture-optimized crypto code is integrated; for example, I've
just moved arch/*/lib/crypto/ to lib/crypto/.

There will also need to be benchmark results that show that this code is
actually worthwhile, on both RV32 and RV64.  I am planning for the
poly1305_kunit test to have a benchmark built-in, after which it will be
straightforward to collect these.

(The "perlasm" file does have some benchmark results in a comment, but they do
not necessarily apply to the Poly1305 code in the kernel.)

So this isn't a perfect time to be adding a new Poly1305 implementation, as
we're not quite ready for it.  But I'd indeed like to take this eventually.

A few comments below:

> diff --git a/arch/riscv/lib/crypto/Makefile b/arch/riscv/lib/crypto/Makefile
> index b7cb877a2c07..93ddb62ef0f9 100644
> --- a/arch/riscv/lib/crypto/Makefile
> +++ b/arch/riscv/lib/crypto/Makefile
> @@ -3,5 +3,23 @@
>  obj-$(CONFIG_CRYPTO_CHACHA_RISCV64) += chacha-riscv64.o
>  chacha-riscv64-y := chacha-riscv64-glue.o chacha-riscv64-zvkb.o
>  
> +obj-$(CONFIG_CRYPTO_POLY1305_RISCV) += poly1305-riscv.o
> +poly1305-riscv-y := poly1305-core.o poly1305-glue.o
> +AFLAGS_poly1305-core.o += -Dpoly1305_init=poly1305_block_init_arch
> +AFLAGS_poly1305-core.o += -Dpoly1305_blocks=poly1305_blocks_arch
> +AFLAGS_poly1305-core.o += -Dpoly1305_emit=poly1305_emit_arch
> +
>  obj-$(CONFIG_CRYPTO_SHA256_RISCV64) += sha256-riscv64.o
>  sha256-riscv64-y := sha256.o sha256-riscv64-zvknha_or_zvknhb-zvkb.o
> +
> +ifeq ($(CONFIG_64BIT),y)
> +PERLASM_ARCH := 64
> +else
> +PERLASM_ARCH := void
> +endif
> +
> +quiet_cmd_perlasm = PERLASM $@
> +      cmd_perlasm = $(PERL) $(<) $(PERLASM_ARCH) $(@)
> +
> +$(obj)/%-core.S: $(src)/%-riscv.pl
> +	$(call cmd,perlasm)

Missing:

    clean-files += poly1305-core.S

> diff --git a/arch/riscv/lib/crypto/poly1305-glue.c b/arch/riscv/lib/crypto/poly1305-glue.c
> new file mode 100644
> index 000000000000..ddc73741faf5
> --- /dev/null
> +++ b/arch/riscv/lib/crypto/poly1305-glue.c
> @@ -0,0 +1,37 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * OpenSSL/Cryptogams accelerated Poly1305 transform for riscv
> + *
> + * Copyright (C) 2025 Institute of Software, CAS.
> + */
> +
> +#include <asm/hwcap.h>
> +#include <asm/simd.h>
> +#include <crypto/internal/poly1305.h>
> +#include <linux/cpufeature.h>
> +#include <linux/jump_label.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/unaligned.h>

Reduce the include list to:

#include <crypto/internal/poly1305.h>
#include <linux/export.h>
#include <linux/module.h>

> +.globl	poly1305_init
> +.type	poly1305_init,\@function
> +poly1305_init:
> +#ifdef	__riscv_zicfilp
> +	lpad	0
> +#endif

The 'lpad' instructions aren't present in the upstream CRYPTOGAMS source.

If they are necessary, this addition needs to be documented.

But they appear to be unnecessary.

> +#ifndef	__CHERI_PURE_CAPABILITY__
> +	andi	$tmp0,$inp,7		# $inp % 8
> +	andi	$inp,$inp,-8		# align $inp
> +	slli	$tmp0,$tmp0,3		# byte to bit offset
> +#endif
> +	ld	$in0,0($inp)
> +	ld	$in1,8($inp)
> +#ifndef	__CHERI_PURE_CAPABILITY__
> +	beqz	$tmp0,.Laligned_key
> +
> +	ld	$tmp2,16($inp)
> +	neg	$tmp1,$tmp0		# implicit &63 in sll
> +	srl	$in0,$in0,$tmp0
> +	sll	$tmp3,$in1,$tmp1
> +	srl	$in1,$in1,$tmp0
> +	sll	$tmp2,$tmp2,$tmp1
> +	or	$in0,$in0,$tmp3
> +	or	$in1,$in1,$tmp2
> +
> +.Laligned_key:

This code is going through a lot of trouble to work on RISC-V CPUs that don't
support efficient misaligned memory accesses.  That includes issuing loads of
memory outside the bounds of the given buffer, which is questionable (even if
it's guaranteed to not cross a page boundary).

Is there any chance we can just make the RISC-V Poly1305 code be conditional on
CONFIG_RISCV_EFFICIENT_UNALIGNED_ACCESS=y?  Or do people not actually use that?

The rest of the kernel's RISC-V crypto code, which is based on the vector
extension, just assumes that efficient misaligned memory accesses are supported.

On a related topic, if this patch is accepted, the result will be inconsistent
optimization of ChaCha vs. Poly1305, which are usually paired:

    (1) ChaCha optimized with the RISC-V vector extension
    (2) Poly1305 optimized with RISC-V scalar instructions

Surely a RISC-V vector extension optimized Poly1305 is going to be needed too?

But with that being the case, will a RISC-V scalar optimized Poly1305 actually
be worthwhile to add too?  Especially without optimized ChaCha alongside it?

- Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
  2025-06-24  3:50 ` Eric Biggers
@ 2025-06-24  9:13   ` Andy Polyakov
  2025-06-25  3:54     ` Eric Biggers
  2025-06-24 11:08   ` Andy Polyakov
  2025-07-06 23:35   ` Eric Biggers
  2 siblings, 1 reply; 15+ messages in thread
From: Andy Polyakov @ 2025-06-24  9:13 UTC (permalink / raw)
  To: Eric Biggers, Zhihang Shao
  Cc: linux-crypto, linux-riscv, herbert, paul.walmsley, alex,
	zhang.lyra

>> +.globl	poly1305_init
>> +.type	poly1305_init,\@function
>> +poly1305_init:
>> +#ifdef	__riscv_zicfilp
>> +	lpad	0
>> +#endif
> 
> The 'lpad' instructions aren't present in the upstream CRYPTOGAMS source.

They are.

> If they are necessary, this addition needs to be documented.
> 
> But they appear to be unnecessary.

They are better be there if Control Flow Integrity is on. It's the same 
deal as with endbranch instruction on Intel and hint #34 on ARM. It's 
possible that the kernel never engages CFI for itself, in which case all 
the mentioned instructions are executed as nop-s. But note that here 
they are compiled conditionally, so that if you don't compile the kernel 
with -march=..._zicfilp_..., then they won't be there.

>> +#ifndef	__CHERI_PURE_CAPABILITY__
>> +	andi	$tmp0,$inp,7		# $inp % 8
>> +	andi	$inp,$inp,-8		# align $inp
>> +	slli	$tmp0,$tmp0,3		# byte to bit offset
>> +#endif
>> +	ld	$in0,0($inp)
>> +	ld	$in1,8($inp)
>> +#ifndef	__CHERI_PURE_CAPABILITY__
>> +	beqz	$tmp0,.Laligned_key
>> +
>> +	ld	$tmp2,16($inp)
>> +	neg	$tmp1,$tmp0		# implicit &63 in sll
>> +	srl	$in0,$in0,$tmp0
>> +	sll	$tmp3,$in1,$tmp1
>> +	srl	$in1,$in1,$tmp0
>> +	sll	$tmp2,$tmp2,$tmp1
>> +	or	$in0,$in0,$tmp3
>> +	or	$in1,$in1,$tmp2
>> +
>> +.Laligned_key:
> 
> This code is going through a lot of trouble to work on RISC-V CPUs that don't
> support efficient misaligned memory accesses.  That includes issuing loads of
> memory outside the bounds of the given buffer, which is questionable (even if
> it's guaranteed to not cross a page boundary).

It's indeed guaranteed to not cross a page *nor* even cache-line 
boundaries. Hence they can't trigger any externally observed side 
effects the corresponding unaligned loads won't. What is the concern 
otherwise? [Do note that the boundaries are not crossed on a 
boundary-enforcable CHERI platform ;-)]

> The rest of the kernel's RISC-V crypto code, which is based on the vector
> extension, just assumes that efficient misaligned memory accesses are supported.

Was it tested on real hardware though? I wonder what hardware is out 
there that supports the vector crypto extensions?

> On a related topic, if this patch is accepted, the result will be inconsistent
> optimization of ChaCha vs. Poly1305, which are usually paired:

https://github.com/dot-asm/cryptogams/blob/master/riscv/chacha-riscv.pl

>      (1) ChaCha optimized with the RISC-V vector extension
>      (2) Poly1305 optimized with RISC-V scalar instructions
> 
> Surely a RISC-V vector extension optimized Poly1305 is going to be needed too?

I'm a "test-on-hardware" guy. I've got Spacemit X60, which has a working 
256-bit base vector implementation. I have a "teaser" Chacha vector 
implementation that currently performs *worse* than scalar one, more 
than twice worse. Working on improving it. For reference. One has to 
recognize that cryptographic algorithms customarily have short 
dependencies, which means that performance is dominated by instruction 
latencies. There might or might not be ways to match the scalar 
performance. Or course, even if it turns out to be impossible on this 
processor, it doesn't mean that it won't make sense to keep the vector 
implementation, because other processors might do better. In other 
words, it's coming...

> But with that being the case, will a RISC-V scalar optimized Poly1305 actually
> be worthwhile to add too?  Especially without optimized ChaCha alongside it?

Yes. Because vector implementations are inefficient on short inputs and 
having a compatible scalar fall-back for short inputs is more than 
appropriate. In other words starting with scalar implementations is a 
sensible and perfectly meaningful step.

Cheers.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
  2025-06-24  3:50 ` Eric Biggers
  2025-06-24  9:13   ` Andy Polyakov
@ 2025-06-24 11:08   ` Andy Polyakov
  2025-07-06 23:35   ` Eric Biggers
  2 siblings, 0 replies; 15+ messages in thread
From: Andy Polyakov @ 2025-06-24 11:08 UTC (permalink / raw)
  To: Eric Biggers, Zhihang Shao
  Cc: linux-crypto, linux-riscv, herbert, paul.walmsley, alex,
	zhang.lyra

>> +#ifndef	__CHERI_PURE_CAPABILITY__
>> +	andi	$tmp0,$inp,7		# $inp % 8
>> +	andi	$inp,$inp,-8		# align $inp
>> +	slli	$tmp0,$tmp0,3		# byte to bit offset
>> +#endif
>> +	ld	$in0,0($inp)
>> +	ld	$in1,8($inp)
>> +#ifndef	__CHERI_PURE_CAPABILITY__
>> +	beqz	$tmp0,.Laligned_key
>> +
>> +	ld	$tmp2,16($inp)
>> +	neg	$tmp1,$tmp0		# implicit &63 in sll
>> +	srl	$in0,$in0,$tmp0
>> +	sll	$tmp3,$in1,$tmp1
>> +	srl	$in1,$in1,$tmp0
>> +	sll	$tmp2,$tmp2,$tmp1
>> +	or	$in0,$in0,$tmp3
>> +	or	$in1,$in1,$tmp2
>> +
>> +.Laligned_key:
> 
> This code is going through a lot of trouble to work on RISC-V CPUs that don't
> support efficient misaligned memory accesses.  That includes issuing loads of
> memory outside the bounds of the given buffer, which is questionable (even if
> it's guaranteed to not cross a page boundary).
> 
> Is there any chance we can just make the RISC-V Poly1305 code be conditional on
> CONFIG_RISCV_EFFICIENT_UNALIGNED_ACCESS=y?  Or do people not actually use that?

For reference. The penalties for handling unaligned data as above on a 
processor that can handle unaligned load efficiently are arguably 
tolerable. For example on Spacemit X60 it's meager ~7%. However, since 
poly1305 is always used with chacha20 and is faster than chacha20 the 
difference would be "masked" and rendered marginal. If anything, it 
makes more sense to utilize this option for chacha20 where the 
difference is way more significant, a tad less than 20%.

Cheers.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
  2025-06-24  9:13   ` Andy Polyakov
@ 2025-06-25  3:54     ` Eric Biggers
  2025-06-25  9:38       ` Andy Polyakov
  2025-06-26 15:58       ` Sami Tolvanen
  0 siblings, 2 replies; 15+ messages in thread
From: Eric Biggers @ 2025-06-25  3:54 UTC (permalink / raw)
  To: Andy Polyakov
  Cc: Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley,
	alex, zhang.lyra

On Tue, Jun 24, 2025 at 11:13:49AM +0200, Andy Polyakov wrote:
> > > +.globl	poly1305_init
> > > +.type	poly1305_init,\@function
> > > +poly1305_init:
> > > +#ifdef	__riscv_zicfilp
> > > +	lpad	0
> > > +#endif
> > 
> > The 'lpad' instructions aren't present in the upstream CRYPTOGAMS source.
> 
> They are.

I now see the latest version does have them.  However, the description of this
patch explicitly states it was taken from CRYPTOGAMS commit
33fe84bc21219a16825459b37c825bf4580a0a7b.  Which is of course the one I looked
at, and it did not have them.  So the patch description is wrong.

> > If they are necessary, this addition needs to be documented.
> > 
> > But they appear to be unnecessary.
> 
> They are better be there if Control Flow Integrity is on. It's the same deal
> as with endbranch instruction on Intel and hint #34 on ARM. It's possible
> that the kernel never engages CFI for itself, in which case all the
> mentioned instructions are executed as nop-s. But note that here they are
> compiled conditionally, so that if you don't compile the kernel with
> -march=..._zicfilp_..., then they won't be there.

There appears to be no kernel-mode support for Zicfilp yet.  This would be the
very first occurrence of the lpad instruction in the kernel source.

Keeping this anyway is fine if it's in the upstream version.  It just appeared
to be an undocumented change from the upstream version...

> > > +#ifndef	__CHERI_PURE_CAPABILITY__
> > > +	andi	$tmp0,$inp,7		# $inp % 8
> > > +	andi	$inp,$inp,-8		# align $inp
> > > +	slli	$tmp0,$tmp0,3		# byte to bit offset
> > > +#endif
> > > +	ld	$in0,0($inp)
> > > +	ld	$in1,8($inp)
> > > +#ifndef	__CHERI_PURE_CAPABILITY__
> > > +	beqz	$tmp0,.Laligned_key
> > > +
> > > +	ld	$tmp2,16($inp)
> > > +	neg	$tmp1,$tmp0		# implicit &63 in sll
> > > +	srl	$in0,$in0,$tmp0
> > > +	sll	$tmp3,$in1,$tmp1
> > > +	srl	$in1,$in1,$tmp0
> > > +	sll	$tmp2,$tmp2,$tmp1
> > > +	or	$in0,$in0,$tmp3
> > > +	or	$in1,$in1,$tmp2
> > > +
> > > +.Laligned_key:
> > 
> > This code is going through a lot of trouble to work on RISC-V CPUs that don't
> > support efficient misaligned memory accesses.  That includes issuing loads of
> > memory outside the bounds of the given buffer, which is questionable (even if
> > it's guaranteed to not cross a page boundary).
> 
> It's indeed guaranteed to not cross a page *nor* even cache-line boundaries.
> Hence they can't trigger any externally observed side effects the
> corresponding unaligned loads won't. What is the concern otherwise? [Do note
> that the boundaries are not crossed on a boundary-enforcable CHERI platform
> ;-)]

With this, we get:

- More complex code.
- Slower on CPUs that do support efficient misaligned accesses.
- The buffer underflows and overflows could cause problems with future CPU
  behavior.  (Did you consider the palette memory extension, for example?)

That being said, if there will continue to be many RISC-V CPUs that don't
support efficient misaligned accesses, then we effectively need to do this
anyway.  I hoped that things might be developing along the lines of ARM, where
eventually misaligned accesses started being supported uniformly.  But perhaps
RISC-V is still in the process of learning that lesson.

> > The rest of the kernel's RISC-V crypto code, which is based on the vector
> > extension, just assumes that efficient misaligned memory accesses are supported.
> 
> Was it tested on real hardware though? I wonder what hardware is out there
> that supports the vector crypto extensions?

If I remember correctly, SiFive tested it on their hardware.

- Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
  2025-06-25  3:54     ` Eric Biggers
@ 2025-06-25  9:38       ` Andy Polyakov
  2025-06-26 15:58       ` Sami Tolvanen
  1 sibling, 0 replies; 15+ messages in thread
From: Andy Polyakov @ 2025-06-25  9:38 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley,
	alex, zhang.lyra

>>>> +#ifndef	__CHERI_PURE_CAPABILITY__
>>>> +	andi	$tmp0,$inp,7		# $inp % 8
>>>> +	andi	$inp,$inp,-8		# align $inp
>>>> +	slli	$tmp0,$tmp0,3		# byte to bit offset
>>>> +#endif
>>>> +	ld	$in0,0($inp)
>>>> +	ld	$in1,8($inp)
>>>> +#ifndef	__CHERI_PURE_CAPABILITY__
>>>> +	beqz	$tmp0,.Laligned_key
>>>> +
>>>> +	ld	$tmp2,16($inp)
>>>> +	neg	$tmp1,$tmp0		# implicit &63 in sll
>>>> +	srl	$in0,$in0,$tmp0
>>>> +	sll	$tmp3,$in1,$tmp1
>>>> +	srl	$in1,$in1,$tmp0
>>>> +	sll	$tmp2,$tmp2,$tmp1
>>>> +	or	$in0,$in0,$tmp3
>>>> +	or	$in1,$in1,$tmp2
>>>> +
>>>> +.Laligned_key:
>>>
>>> This code is going through a lot of trouble to work on RISC-V CPUs that don't
>>> support efficient misaligned memory accesses.  That includes issuing loads of
>>> memory outside the bounds of the given buffer, which is questionable (even if
>>> it's guaranteed to not cross a page boundary).
>>
>> It's indeed guaranteed to not cross a page *nor* even cache-line boundaries.
>> Hence they can't trigger any externally observed side effects the
>> corresponding unaligned loads won't. What is the concern otherwise? [Do note
>> that the boundaries are not crossed on a boundary-enforcable CHERI platform
>> ;-)]
> 
> With this, we get:
> 
> - More complex code.

My rationale is as follows. It's beneficial to have this code to cover 
the whole spectrum of processor implementations. I for one would even 
say it's important, because penalties on processors that can't handle 
misaligned access efficiently are just too high to ignore. Now, it's 
possible to bypass it with #ifdef-s (as done for CHERI), but to make 
things less confusing, a.k.a. *less* complex, it's preferred to rely on 
the compiler predefines (as done for CHERI). Later compiler versions 
introduced apparently suitable predefines for this, 
__riscv_misaligned_slow/fast/avoid. However, as of the moment of this 
writing the macros in question don't seem to depend on the -mcpu 
parameter. But it's probably reasonable to assume that they will at a 
later point. So the suggestion would be to use these. Does it sound 
reasonable? Or would you insist on a custom macro that would need to be 
set depending on CONFIG_RISCV_EFFICIENT_UNALIGNED_ACCESS?

> - Slower on CPUs that do support efficient misaligned accesses.

With arguably marginal penalty, as discussed in the previous message. In 
the context one can also view it as a trade-off between a small penalty 
and increased #ifdef spaghetti :-)

> - The buffer underflows and overflows could cause problems with future CPU
>    behavior.  (Did you consider the palette memory extension, for example?)

Pallette memory extension colours fixed-size, hence accordingly aligned, 
blocks. Since the block size is larger than the word load size, any 
aligned load would be safe, because even a single "excess" or "short" 
byte would colour the whole block accordingly.

Just in case to be clear. The argument is about loads. Misaligned stores 
is naturally different matter and it would be inappropriate to handle 
them in the similar manner.

> That being said, if there will continue to be many RISC-V CPUs that don't
> support efficient misaligned accesses, then we effectively need to do this
> anyway.  I hoped that things might be developing along the lines of ARM, where
> eventually misaligned accesses started being supported uniformly.  But perhaps
> RISC-V is still in the process of learning that lesson.

One has to recognize that it can also be a matter of cost. I mean 
imagine you want to license the least expensive IP from SiFive, or have 
very limited space for MCU. Well, Linux, naturally having higher minimum 
requirements, doesn't have to care about these, but it doesn't mean that 
nobody would :-)

>>> The rest of the kernel's RISC-V crypto code, which is based on the vector
>>> extension, just assumes that efficient misaligned memory accesses are supported.
>>
>> Was it tested on real hardware though? I wonder what hardware is out there
>> that supports the vector crypto extensions?
> 
> If I remember correctly, SiFive tested it on their hardware.

Cool! The question was rather "how did it do performance-wise in the 
context of this discussion," but never mind. Thanks! In a way there is a 
contradiction. RISC-V as a concept is about openness to everybody, while 
SiFive is naturally about itself ;-)

Cheers.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
  2025-06-25  3:54     ` Eric Biggers
  2025-06-25  9:38       ` Andy Polyakov
@ 2025-06-26 15:58       ` Sami Tolvanen
  2025-06-27  9:07         ` Andy Polyakov
  1 sibling, 1 reply; 15+ messages in thread
From: Sami Tolvanen @ 2025-06-26 15:58 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andy Polyakov, Zhihang Shao, linux-crypto, linux-riscv, herbert,
	paul.walmsley, alex, zhang.lyra

On Tue, Jun 24, 2025 at 8:56 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Tue, Jun 24, 2025 at 11:13:49AM +0200, Andy Polyakov wrote:
> > > > +.globl   poly1305_init
> > > > +.type    poly1305_init,\@function
> > > > +poly1305_init:
> > > > +#ifdef   __riscv_zicfilp
> > > > + lpad    0
> > > > +#endif
> > >
> > > The 'lpad' instructions aren't present in the upstream CRYPTOGAMS source.
> >
> > They are.
>
> I now see the latest version does have them.  However, the description of this
> patch explicitly states it was taken from CRYPTOGAMS commit
> 33fe84bc21219a16825459b37c825bf4580a0a7b.  Which is of course the one I looked
> at, and it did not have them.  So the patch description is wrong.
>
> > > If they are necessary, this addition needs to be documented.
> > >
> > > But they appear to be unnecessary.
> >
> > They are better be there if Control Flow Integrity is on. It's the same deal
> > as with endbranch instruction on Intel and hint #34 on ARM. It's possible
> > that the kernel never engages CFI for itself, in which case all the
> > mentioned instructions are executed as nop-s. But note that here they are
> > compiled conditionally, so that if you don't compile the kernel with
> > -march=..._zicfilp_..., then they won't be there.
>
> There appears to be no kernel-mode support for Zicfilp yet.  This would be the
> very first occurrence of the lpad instruction in the kernel source.

Of course, if the kernel actually ends up calling these functions
indirectly at some point, lpad alone isn't sufficient, we would need
to use SYM_TYPED_FUNC_START to emit CFI type information for them. I
assume if RISC-V gains kernel-mode Zicfilp support later, we would
have an arch-specific override for the SYM_TYPED_FUNC_START macro that
includes the lpad instruction, similarly to arm64 and BTI.

Also, if the kernel decides to use type-based landing pad labels for
finer-grained CFI, "lpad 0" isn't going to work anyway. Perhaps it
would make sense to just drop the lpad instruction in kernel builds
for now to avoid confusion?

Sami

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
  2025-06-26 15:58       ` Sami Tolvanen
@ 2025-06-27  9:07         ` Andy Polyakov
  2025-06-27 16:10           ` Sami Tolvanen
  0 siblings, 1 reply; 15+ messages in thread
From: Andy Polyakov @ 2025-06-27  9:07 UTC (permalink / raw)
  To: Sami Tolvanen, Eric Biggers
  Cc: Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley,
	alex, zhang.lyra

>>>>> +.globl   poly1305_init
>>>>> +.type    poly1305_init,\@function
>>>>> +poly1305_init:
>>>>> +#ifdef   __riscv_zicfilp
>>>>> + lpad    0
>>>>> +#endif
>>>>
>>>> But they appear to be unnecessary.
>>>
>>> They are better be there if Control Flow Integrity is on. It's the same deal
>>> as with endbranch instruction on Intel and hint #34 on ARM. It's possible
>>> that the kernel never engages CFI for itself, in which case all the
>>> mentioned instructions are executed as nop-s. But note that here they are
>>> compiled conditionally, so that if you don't compile the kernel with
>>> -march=..._zicfilp_..., then they won't be there.
>>
>> There appears to be no kernel-mode support for Zicfilp yet.  This would be the
>> very first occurrence of the lpad instruction in the kernel source.
> 
> Of course, if the kernel actually ends up calling these functions
> indirectly at some point, lpad alone isn't sufficient,

The exception condition is specified as "tag != x7 *and* tag != 0." In 
other words lpad 0 provides a coarse-grained, but a protection 
nevertheless. As effective as the endbranch on Intel. Since a 
"fit-multiple-use-cases" assembly module, such as cryptogams' ones, 
can't make specific assumptions about the caller, lpad 0 is a working 
fall-back.

> we would need
> to use SYM_TYPED_FUNC_START to emit CFI type information for them.

I'm confused. My understanding is that SYM_TYPED_FUNC_START is about 
clang -fsanitize=kcfi, which is a different mechanism. Well, you might 
be referring to future developments...

> Also, if the kernel decides to use type-based landing pad labels for
> finer-grained CFI, "lpad 0" isn't going to work anyway.

It would, it simply won't be fine-graned.

> Perhaps it
> would make sense to just drop the lpad instruction in kernel builds
> for now to avoid confusion?

Let's say I go for

#ifdef __KERNEL__
SYM_TYPED_FUNC_START(poly1305_init, ...)
#else
.globl	poly1305_init
.type	poly1305_init,\@function
poly1305_init:
# ifdef	__riscv_zicfilp
	lpad	0
# endif
#endif

Would it be sufficient to #include <linux/cfi_types.h>?

Cheers.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
  2025-06-27  9:07         ` Andy Polyakov
@ 2025-06-27 16:10           ` Sami Tolvanen
  2025-06-27 21:51             ` Eric Biggers
  0 siblings, 1 reply; 15+ messages in thread
From: Sami Tolvanen @ 2025-06-27 16:10 UTC (permalink / raw)
  To: Andy Polyakov
  Cc: Eric Biggers, Zhihang Shao, linux-crypto, linux-riscv, herbert,
	paul.walmsley, alex, zhang.lyra

On Fri, Jun 27, 2025 at 2:07 AM Andy Polyakov <appro@cryptogams.org> wrote:
>
> > we would need
> > to use SYM_TYPED_FUNC_START to emit CFI type information for them.
>
> I'm confused. My understanding is that SYM_TYPED_FUNC_START is about
> clang -fsanitize=kcfi, which is a different mechanism. Well, you might
> be referring to future developments...

CONFIG_CFI_CLANG is supported in the kernel right now, which is why we
need to make sure indirect call targets in assembly code are properly
annotated.

Like I said, if RISC-V gains kernel-mode Zicfilp support later, I
would expect the SYM_TYPED_FUNC_START macro to be overridden to
include the lpad instruction as well, similarly to other architectures
that already support landing pad instructions. In either case, instead
of including raw lpad instructions in the code, it should be up to the
kernel to decide if landing pads (or other CFI annotations) are needed
for the function.

> > Also, if the kernel decides to use type-based landing pad labels for
> > finer-grained CFI, "lpad 0" isn't going to work anyway.
>
> It would, it simply won't be fine-graned.

Sorry for not being clear. If the kernel implements a fine-grained
Zicfilp scheme, we wouldn't want universal indirect call targets (i.e.
lpad 0) anywhere in the code. And even if we implement coarse-grained
Zicfilp, we would only want landing pads in functions where they're
needed to minimize the number of call targets.

> > Perhaps it
> > would make sense to just drop the lpad instruction in kernel builds
> > for now to avoid confusion?
>
> Let's say I go for
>
> #ifdef __KERNEL__
> SYM_TYPED_FUNC_START(poly1305_init, ...)
> #else
> .globl  poly1305_init
> .type   poly1305_init,\@function
> poly1305_init:
> # ifdef __riscv_zicfilp
>         lpad    0
> # endif
> #endif
>
> Would it be sufficient to #include <linux/cfi_types.h>?

Yes, but this requires the function to be indirectly called, because
with CFI_CLANG the compiler only emits CFI type information for
functions that are address-taken in C. If, like Eric suggested, these
functions are not currently indirectly called, I would simply leave
out the lpad instructions for kernel builds and worry about
kernel-mode CFI annotations when they're actually needed:

# if defined(__riscv_zicfilp) && !defined(__KERNEL__)
        lpad    0
# endif

Sami

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
  2025-06-27 16:10           ` Sami Tolvanen
@ 2025-06-27 21:51             ` Eric Biggers
  2025-06-30  9:55               ` Andy Polyakov
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Biggers @ 2025-06-27 21:51 UTC (permalink / raw)
  To: Sami Tolvanen
  Cc: Andy Polyakov, Zhihang Shao, linux-crypto, linux-riscv, herbert,
	paul.walmsley, alex, zhang.lyra

On Fri, Jun 27, 2025 at 09:10:30AM -0700, Sami Tolvanen wrote:
> > Would it be sufficient to #include <linux/cfi_types.h>?
> 
> Yes, but this requires the function to be indirectly called, because
> with CFI_CLANG the compiler only emits CFI type information for
> functions that are address-taken in C. If, like Eric suggested, these
> functions are not currently indirectly called, I would simply leave
> out the lpad instructions for kernel builds and worry about
> kernel-mode CFI annotations when they're actually needed:
> 
> # if defined(__riscv_zicfilp) && !defined(__KERNEL__)
>         lpad    0
> # endif

These functions aren't indirectly called, and I'm intending to keep it that way.

- Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
  2025-06-27 21:51             ` Eric Biggers
@ 2025-06-30  9:55               ` Andy Polyakov
  0 siblings, 0 replies; 15+ messages in thread
From: Andy Polyakov @ 2025-06-30  9:55 UTC (permalink / raw)
  To: Eric Biggers, Sami Tolvanen
  Cc: Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley,
	alex, zhang.lyra

>>> Would it be sufficient to #include <linux/cfi_types.h>?
>>
>> Yes, but this requires the function to be indirectly called, because
>> with CFI_CLANG the compiler only emits CFI type information for
>> functions that are address-taken in C. If, like Eric suggested, these
>> functions are not currently indirectly called, I would simply leave
>> out the lpad instructions for kernel builds and worry about
>> kernel-mode CFI annotations when they're actually needed:
>>
>> # if defined(__riscv_zicfilp) && !defined(__KERNEL__)
>>          lpad    0
>> # endif
> 
> These functions aren't indirectly called, and I'm intending to keep it that way.

In which case lpad will be executed as nops. Anyway, does 
https://github.com/dot-asm/cryptogams/commit/e6ae2202268d995e78fa5d137dde992bdff1b8e8 
look all right?

Cheers.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
  2025-06-24  3:50 ` Eric Biggers
  2025-06-24  9:13   ` Andy Polyakov
  2025-06-24 11:08   ` Andy Polyakov
@ 2025-07-06 23:35   ` Eric Biggers
  2 siblings, 0 replies; 15+ messages in thread
From: Eric Biggers @ 2025-07-06 23:35 UTC (permalink / raw)
  To: Zhihang Shao
  Cc: linux-crypto, linux-riscv, herbert, paul.walmsley, alex, appro,
	zhang.lyra

On Mon, Jun 23, 2025 at 08:50:57PM -0700, Eric Biggers wrote:
> Hi Zhihang,
> 
> On Wed, Jun 11, 2025 at 11:31:51AM +0800, Zhihang Shao wrote:
> > From: Zhihang Shao <zhihang.shao.iscas@gmail.com>
> > 
> > This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305
> > implementation for riscv authored by Andy Polyakov.
> > The file 'poly1305-riscv.pl' is taken straight from this upstream
> > GitHub repository [0] at commit 33fe84bc21219a16825459b37c825bf4580a0a7b,
> > and this commit fixed a bug in riscv 64bit implementation.
> > 
> > [0] https://github.com/dot-asm/cryptogams
> > 
> > Signed-off-by: Zhihang Shao <zhihang.shao.iscas@gmail.com>
> 
> I wanted to let you know that I haven't forgotten about this.  However, the
> kernel currently lacks a proper test for Poly1305, and Poly1305 has some edge
> cases that are prone to bugs.  So I'd like to find some time to add a proper
> Poly1305 test and properly review this.  Also, I'm in the middle of fixing how
> the kernel's architecture-optimized crypto code is integrated; for example, I've
> just moved arch/*/lib/crypto/ to lib/crypto/.
> 
> There will also need to be benchmark results that show that this code is
> actually worthwhile, on both RV32 and RV64.  I am planning for the
> poly1305_kunit test to have a benchmark built-in, after which it will be
> straightforward to collect these.
> 
> (The "perlasm" file does have some benchmark results in a comment, but they do
> not necessarily apply to the Poly1305 code in the kernel.)
> 
> So this isn't a perfect time to be adding a new Poly1305 implementation, as
> we're not quite ready for it.  But I'd indeed like to take this eventually.

My series
https://lore.kernel.org/linux-crypto/20250706232817.179500-1-ebiggers@kernel.org/
adds a KUnit test suite for Poly1305, including a benchmark, as I had planned.

- Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
@ 2025-07-20  9:10 Zhihang Shao
  2025-07-20 17:22 ` Eric Biggers
  2025-07-23  9:47 ` Andy Polyakov
  0 siblings, 2 replies; 15+ messages in thread
From: Zhihang Shao @ 2025-07-20  9:10 UTC (permalink / raw)
  To: Eric Biggers
  Cc: alex, appro, herbert, linux-crypto, linux-riscv, paul.walmsley,
	zhang.lyra, Zhihang Shao

Hi Eric,

I recently ran a test using the Kunit module you wrote for testing
poly1305, which I executed on QEMU RISC-V 64, . The results show a
significant performance improvement of the optimized implementation
compared to the generic one. The test data are as follows:

--- base.log    2025-07-19 17:41:06.443392989 +0800
+++ optimized.log       2025-07-19 17:40:45.650048601 +0800
@@ -1,31 +1,31 @@
-[    0.668631]     # Subtest: poly1305
-[    0.668774]     # module: poly1305_kunit
-[    0.668857]     1..12
-[    0.670267]     ok 1 test_hash_test_vectors
-[    0.679479]     ok 2 test_hash_all_lens_up_to_4096
-[    0.696048]     ok 3 test_hash_incremental_updates
-[    0.697645]     ok 4 test_hash_buffer_overruns
-[    0.701060]     ok 5 test_hash_overlaps
-[    0.702858]     ok 6 test_hash_alignment_consistency
-[    0.703108]     ok 7 test_hash_ctx_zeroization
-[    0.846150]     ok 8 test_hash_interrupt_context_1
-[    1.235247]     ok 9 test_hash_interrupt_context_2
-[    1.250813]     ok 10 test_poly1305_allones_keys_and_message
-[    1.251138]     ok 11 test_poly1305_reduction_edge_cases
-[    1.287196]     # benchmark_hash: len=1: 2 MB/s
-[    1.305363]     # benchmark_hash: len=16: 61 MB/s
-[    1.321102]     # benchmark_hash: len=64: 212 MB/s
-[    1.340105]     # benchmark_hash: len=127: 263 MB/s
-[    1.353880]     # benchmark_hash: len=128: 364 MB/s
-[    1.370118]     # benchmark_hash: len=200: 377 MB/s
-[    1.381879]     # benchmark_hash: len=256: 570 MB/s
-[    1.394125]     # benchmark_hash: len=511: 657 MB/s
-[    1.404265]     # benchmark_hash: len=512: 794 MB/s
-[    1.413356]     # benchmark_hash: len=1024: 985 MB/s
-[    1.421925]     # benchmark_hash: len=3173: 1131 MB/s
-[    1.429956]     # benchmark_hash: len=4096: 1218 MB/s
-[    1.438184]     # benchmark_hash: len=16384: 1216 MB/s
-[    1.438462]     ok 12 benchmark_hash
-[    1.438686] # poly1305: pass:12 fail:0 skip:0 total:12
-[    1.438763] # Totals: pass:12 fail:0 skip:0 total:12
-[    1.438904] ok 1 poly1305
+[    0.666280]     # Subtest: poly1305
+[    0.666413]     # module: poly1305_kunit
+[    0.666490]     1..12
+[    0.667702]     ok 1 test_hash_test_vectors
+[    0.672896]     ok 2 test_hash_all_lens_up_to_4096
+[    0.686244]     ok 3 test_hash_incremental_updates
+[    0.687263]     ok 4 test_hash_buffer_overruns
+[    0.689957]     ok 5 test_hash_overlaps
+[    0.691393]     ok 6 test_hash_alignment_consistency
+[    0.691622]     ok 7 test_hash_ctx_zeroization
+[    0.769741]     ok 8 test_hash_interrupt_context_1
+[    0.930832]     ok 9 test_hash_interrupt_context_2
+[    0.940068]     ok 10 test_poly1305_allones_keys_and_message
+[    0.940478]     ok 11 test_poly1305_reduction_edge_cases
+[    0.964546]     # benchmark_hash: len=1: 3 MB/s
+[    0.978836]     # benchmark_hash: len=16: 78 MB/s
+[    0.990414]     # benchmark_hash: len=64: 289 MB/s
+[    1.003012]     # benchmark_hash: len=127: 397 MB/s
+[    1.012755]     # benchmark_hash: len=128: 517 MB/s
+[    1.022928]     # benchmark_hash: len=200: 603 MB/s
+[    1.030981]     # benchmark_hash: len=256: 835 MB/s
+[    1.038706]     # benchmark_hash: len=511: 1046 MB/s
+[    1.045233]     # benchmark_hash: len=512: 1240 MB/s
+[    1.050733]     # benchmark_hash: len=1024: 1638 MB/s
+[    1.055620]     # benchmark_hash: len=3173: 1998 MB/s
+[    1.060247]     # benchmark_hash: len=4096: 2132 MB/s
+[    1.064695]     # benchmark_hash: len=16384: 2267 MB/s
+[    1.065179]     ok 12 benchmark_hash
+[    1.065425] # poly1305: pass:12 fail:0 skip:0 total:12
+[    1.065498] # Totals: pass:12 fail:0 skip:0 total:12
+[    1.065612] ok 1 poly1305

Next, I plan to validate this performance gain on actual RISC-V
hardware. I will also submit a v5 patch to the mailing list.
Look forward to your feedback and suggestions.

- Zhihang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
  2025-07-20  9:10 [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation Zhihang Shao
@ 2025-07-20 17:22 ` Eric Biggers
  2025-07-23  9:47 ` Andy Polyakov
  1 sibling, 0 replies; 15+ messages in thread
From: Eric Biggers @ 2025-07-20 17:22 UTC (permalink / raw)
  To: Zhihang Shao
  Cc: alex, appro, herbert, linux-crypto, linux-riscv, paul.walmsley,
	zhang.lyra

On Sun, Jul 20, 2025 at 05:10:13PM +0800, Zhihang Shao wrote:
> Hi Eric,
> 
> I recently ran a test using the Kunit module you wrote for testing
> poly1305, which I executed on QEMU RISC-V 64, . The results show a
> significant performance improvement of the optimized implementation
> compared to the generic one. The test data are as follows:
> 
> --- base.log    2025-07-19 17:41:06.443392989 +0800
> +++ optimized.log       2025-07-19 17:40:45.650048601 +0800
> @@ -1,31 +1,31 @@
> -[    0.668631]     # Subtest: poly1305
> -[    0.668774]     # module: poly1305_kunit
> -[    0.668857]     1..12
> -[    0.670267]     ok 1 test_hash_test_vectors
> -[    0.679479]     ok 2 test_hash_all_lens_up_to_4096
> -[    0.696048]     ok 3 test_hash_incremental_updates
> -[    0.697645]     ok 4 test_hash_buffer_overruns
> -[    0.701060]     ok 5 test_hash_overlaps
> -[    0.702858]     ok 6 test_hash_alignment_consistency
> -[    0.703108]     ok 7 test_hash_ctx_zeroization
> -[    0.846150]     ok 8 test_hash_interrupt_context_1
> -[    1.235247]     ok 9 test_hash_interrupt_context_2
> -[    1.250813]     ok 10 test_poly1305_allones_keys_and_message
> -[    1.251138]     ok 11 test_poly1305_reduction_edge_cases
> -[    1.287196]     # benchmark_hash: len=1: 2 MB/s
> -[    1.305363]     # benchmark_hash: len=16: 61 MB/s
> -[    1.321102]     # benchmark_hash: len=64: 212 MB/s
> -[    1.340105]     # benchmark_hash: len=127: 263 MB/s
> -[    1.353880]     # benchmark_hash: len=128: 364 MB/s
> -[    1.370118]     # benchmark_hash: len=200: 377 MB/s
> -[    1.381879]     # benchmark_hash: len=256: 570 MB/s
> -[    1.394125]     # benchmark_hash: len=511: 657 MB/s
> -[    1.404265]     # benchmark_hash: len=512: 794 MB/s
> -[    1.413356]     # benchmark_hash: len=1024: 985 MB/s
> -[    1.421925]     # benchmark_hash: len=3173: 1131 MB/s
> -[    1.429956]     # benchmark_hash: len=4096: 1218 MB/s
> -[    1.438184]     # benchmark_hash: len=16384: 1216 MB/s
> -[    1.438462]     ok 12 benchmark_hash
> -[    1.438686] # poly1305: pass:12 fail:0 skip:0 total:12
> -[    1.438763] # Totals: pass:12 fail:0 skip:0 total:12
> -[    1.438904] ok 1 poly1305
> +[    0.666280]     # Subtest: poly1305
> +[    0.666413]     # module: poly1305_kunit
> +[    0.666490]     1..12
> +[    0.667702]     ok 1 test_hash_test_vectors
> +[    0.672896]     ok 2 test_hash_all_lens_up_to_4096
> +[    0.686244]     ok 3 test_hash_incremental_updates
> +[    0.687263]     ok 4 test_hash_buffer_overruns
> +[    0.689957]     ok 5 test_hash_overlaps
> +[    0.691393]     ok 6 test_hash_alignment_consistency
> +[    0.691622]     ok 7 test_hash_ctx_zeroization
> +[    0.769741]     ok 8 test_hash_interrupt_context_1
> +[    0.930832]     ok 9 test_hash_interrupt_context_2
> +[    0.940068]     ok 10 test_poly1305_allones_keys_and_message
> +[    0.940478]     ok 11 test_poly1305_reduction_edge_cases
> +[    0.964546]     # benchmark_hash: len=1: 3 MB/s
> +[    0.978836]     # benchmark_hash: len=16: 78 MB/s
> +[    0.990414]     # benchmark_hash: len=64: 289 MB/s
> +[    1.003012]     # benchmark_hash: len=127: 397 MB/s
> +[    1.012755]     # benchmark_hash: len=128: 517 MB/s
> +[    1.022928]     # benchmark_hash: len=200: 603 MB/s
> +[    1.030981]     # benchmark_hash: len=256: 835 MB/s
> +[    1.038706]     # benchmark_hash: len=511: 1046 MB/s
> +[    1.045233]     # benchmark_hash: len=512: 1240 MB/s
> +[    1.050733]     # benchmark_hash: len=1024: 1638 MB/s
> +[    1.055620]     # benchmark_hash: len=3173: 1998 MB/s
> +[    1.060247]     # benchmark_hash: len=4096: 2132 MB/s
> +[    1.064695]     # benchmark_hash: len=16384: 2267 MB/s
> +[    1.065179]     ok 12 benchmark_hash
> +[    1.065425] # poly1305: pass:12 fail:0 skip:0 total:12
> +[    1.065498] # Totals: pass:12 fail:0 skip:0 total:12
> +[    1.065612] ok 1 poly1305
> 
> Next, I plan to validate this performance gain on actual RISC-V
> hardware. I will also submit a v5 patch to the mailing list.
> Look forward to your feedback and suggestions.
> 

Sounds good, thank you!

- Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation
  2025-07-20  9:10 [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation Zhihang Shao
  2025-07-20 17:22 ` Eric Biggers
@ 2025-07-23  9:47 ` Andy Polyakov
  1 sibling, 0 replies; 15+ messages in thread
From: Andy Polyakov @ 2025-07-23  9:47 UTC (permalink / raw)
  To: Zhihang Shao, Eric Biggers
  Cc: alex, herbert, linux-crypto, linux-riscv, paul.walmsley,
	zhang.lyra

Hi,

> Next, I plan to validate this performance gain on actual RISC-V
> hardware.

I've rerun my benchmarks, the cycles-per-byte results quoted in the 
poly1305-riscv.pl commentary section, and it appears that my U74 results 
were off. I must have made wrong assumptions about clock frequency or I 
failed to note that the [shared] system was busy. Either way, U74 
delivers 1.8 cpb, be it the initial processor version or one with 
additional ISA capabilities such as Zbb, JH7100 vs. JH7110. For 
reference, the cpb is calculated by dividing the clock frequency by the 
measured MBps rate.

I also have vector implementation cooking. It's not ready to be 
released, because it doesn't yet scale with vlenb and works only on a 
256-bit vector unit. It achieves 1.3 cpb on Spacemit X60, 2.5x 
improvement over scalar code. Just in case, one can't expect the 
coefficient to be the same on other processors.

Cheers.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-07-23  9:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-20  9:10 [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation Zhihang Shao
2025-07-20 17:22 ` Eric Biggers
2025-07-23  9:47 ` Andy Polyakov
  -- strict thread matches above, loose matches on Subject: below --
2025-06-11  3:31 Zhihang Shao
2025-06-24  3:50 ` Eric Biggers
2025-06-24  9:13   ` Andy Polyakov
2025-06-25  3:54     ` Eric Biggers
2025-06-25  9:38       ` Andy Polyakov
2025-06-26 15:58       ` Sami Tolvanen
2025-06-27  9:07         ` Andy Polyakov
2025-06-27 16:10           ` Sami Tolvanen
2025-06-27 21:51             ` Eric Biggers
2025-06-30  9:55               ` Andy Polyakov
2025-06-24 11:08   ` Andy Polyakov
2025-07-06 23:35   ` Eric Biggers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).