* [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-06-11 3:31 ` Zhihang Shao 0 siblings, 0 replies; 30+ messages in thread From: Zhihang Shao @ 2025-06-11 3:31 UTC (permalink / raw) To: linux-crypto Cc: linux-riscv, herbert, paul.walmsley, alex, appro, zhang.lyra, ebiggers From: Zhihang Shao <zhihang.shao.iscas@gmail.com> This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 implementation for riscv authored by Andy Polyakov. The file 'poly1305-riscv.pl' is taken straight from this upstream GitHub repository [0] at commit 33fe84bc21219a16825459b37c825bf4580a0a7b, and this commit fixed a bug in riscv 64bit implementation. [0] https://github.com/dot-asm/cryptogams Signed-off-by: Zhihang Shao <zhihang.shao.iscas@gmail.com> --- v4: - use 142c43db490d188ff2aea1faf31b8ad0afd5b33f, the newest commit of poly1305-riscv.pl in [0]. (Eric) - Directly pass -Dpoly1305_blocks=poly1305_blocks_arch instead of implementing glue subroutine in C. (Andy, Eric) --- v3: - remove redundant functions. (Herbert) --- v2: - rebase onto mainline and port change to arch/riscv/lib/crypto. (Eric) - fix problems in code according to Andy's guidance. (Andy) --- arch/riscv/lib/crypto/Kconfig | 5 + arch/riscv/lib/crypto/Makefile | 18 + arch/riscv/lib/crypto/poly1305-glue.c | 37 ++ arch/riscv/lib/crypto/poly1305-riscv.pl | 816 ++++++++++++++++++++++++ lib/crypto/Kconfig | 2 +- 5 files changed, 877 insertions(+), 1 deletion(-) create mode 100644 arch/riscv/lib/crypto/poly1305-glue.c create mode 100644 arch/riscv/lib/crypto/poly1305-riscv.pl diff --git a/arch/riscv/lib/crypto/Kconfig b/arch/riscv/lib/crypto/Kconfig index 47c99ea97ce2..3d5a87fafb01 100644 --- a/arch/riscv/lib/crypto/Kconfig +++ b/arch/riscv/lib/crypto/Kconfig @@ -7,6 +7,11 @@ config CRYPTO_CHACHA_RISCV64 select CRYPTO_ARCH_HAVE_LIB_CHACHA select CRYPTO_LIB_CHACHA_GENERIC +config CRYPTO_POLY1305_RISCV + tristate + default CRYPTO_LIB_POLY1305 + select CRYPTO_ARCH_HAVE_LIB_POLY1305 + config CRYPTO_SHA256_RISCV64 tristate depends on 64BIT && RISCV_ISA_V && TOOLCHAIN_HAS_VECTOR_CRYPTO diff --git a/arch/riscv/lib/crypto/Makefile b/arch/riscv/lib/crypto/Makefile index b7cb877a2c07..93ddb62ef0f9 100644 --- a/arch/riscv/lib/crypto/Makefile +++ b/arch/riscv/lib/crypto/Makefile @@ -3,5 +3,23 @@ obj-$(CONFIG_CRYPTO_CHACHA_RISCV64) += chacha-riscv64.o chacha-riscv64-y := chacha-riscv64-glue.o chacha-riscv64-zvkb.o +obj-$(CONFIG_CRYPTO_POLY1305_RISCV) += poly1305-riscv.o +poly1305-riscv-y := poly1305-core.o poly1305-glue.o +AFLAGS_poly1305-core.o += -Dpoly1305_init=poly1305_block_init_arch +AFLAGS_poly1305-core.o += -Dpoly1305_blocks=poly1305_blocks_arch +AFLAGS_poly1305-core.o += -Dpoly1305_emit=poly1305_emit_arch + obj-$(CONFIG_CRYPTO_SHA256_RISCV64) += sha256-riscv64.o sha256-riscv64-y := sha256.o sha256-riscv64-zvknha_or_zvknhb-zvkb.o + +ifeq ($(CONFIG_64BIT),y) +PERLASM_ARCH := 64 +else +PERLASM_ARCH := void +endif + +quiet_cmd_perlasm = PERLASM $@ + cmd_perlasm = $(PERL) $(<) $(PERLASM_ARCH) $(@) + +$(obj)/%-core.S: $(src)/%-riscv.pl + $(call cmd,perlasm) diff --git a/arch/riscv/lib/crypto/poly1305-glue.c b/arch/riscv/lib/crypto/poly1305-glue.c new file mode 100644 index 000000000000..ddc73741faf5 --- /dev/null +++ b/arch/riscv/lib/crypto/poly1305-glue.c @@ -0,0 +1,37 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * OpenSSL/Cryptogams accelerated Poly1305 transform for riscv + * + * Copyright (C) 2025 Institute of Software, CAS. + */ + +#include <asm/hwcap.h> +#include <asm/simd.h> +#include <crypto/internal/poly1305.h> +#include <linux/cpufeature.h> +#include <linux/jump_label.h> +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/unaligned.h> + +asmlinkage void poly1305_block_init_arch( + struct poly1305_block_state *state, + const u8 raw_key[POLY1305_BLOCK_SIZE]); +EXPORT_SYMBOL_GPL(poly1305_block_init_arch); +asmlinkage void poly1305_blocks_arch(struct poly1305_block_state *state, + const u8 *src, u32 len, u32 hibit); +EXPORT_SYMBOL_GPL(poly1305_blocks_arch); +asmlinkage void poly1305_emit_arch(const struct poly1305_state *state, + u8 digest[POLY1305_DIGEST_SIZE], + const u32 nonce[4]); +EXPORT_SYMBOL_GPL(poly1305_emit_arch); + +bool poly1305_is_arch_optimized(void) +{ + /* We always can use since only Integer Multiplication extension is used. */ + return true; +} +EXPORT_SYMBOL(poly1305_is_arch_optimized); + +MODULE_DESCRIPTION("Poly1305 authenticator (RISC-V accelerated)"); +MODULE_LICENSE("GPL"); \ No newline at end of file diff --git a/arch/riscv/lib/crypto/poly1305-riscv.pl b/arch/riscv/lib/crypto/poly1305-riscv.pl new file mode 100644 index 000000000000..e08766541be7 --- /dev/null +++ b/arch/riscv/lib/crypto/poly1305-riscv.pl @@ -0,0 +1,816 @@ +#!/usr/bin/env perl +# SPDX-License-Identifier: GPL-1.0+ OR BSD-3-Clause +# +# ==================================================================== +# Written by Andy Polyakov, @dot-asm, initially for use with OpenSSL. +# ==================================================================== +# +# Poly1305 hash for RISC-V. +# +# February 2019 +# +# In the essence it's pretty straightforward transliteration of MIPS +# module [without big-endian option]. +# +# 3.9 cycles per byte on U74, ~60% faster than compiler-generated code. +# 1.9 cpb on C910, ~75% improvement. 2.3 cpb on JH7110 (U74 with +# apparently better multiplier), ~69% faster, 3.3 on Spacemit X60, +# ~69% improvement. +# +# June 2024. +# +# Add CHERI support. +# +###################################################################### +# +($zero,$ra,$sp,$gp,$tp)=map("x$_",(0..4)); +($t0,$t1,$t2,$t3,$t4,$t5,$t6)=map("x$_",(5..7,28..31)); +($a0,$a1,$a2,$a3,$a4,$a5,$a6,$a7)=map("x$_",(10..17)); +($s0,$s1,$s2,$s3,$s4,$s5,$s6,$s7,$s8,$s9,$s10,$s11)=map("x$_",(8,9,18..27)); +# +###################################################################### + +$flavour = shift || "64"; + +for (@ARGV) { $output=$_ if (/\w[\w\-]*\.\w+$/); } +open STDOUT,">$output"; + +if ($flavour =~ /64/) {{{ +###################################################################### +# 64-bit code path... +# +my ($ctx,$inp,$len,$padbit) = ($a0,$a1,$a2,$a3); +my ($in0,$in1,$tmp0,$tmp1,$tmp2,$tmp3,$tmp4) = ($a4,$a5,$a6,$a7,$t0,$t1,$t2); + +$code.=<<___; +#if __riscv_xlen == 64 +# if __SIZEOF_POINTER__ == 16 +# define PUSH csc +# define POP clc +# else +# define PUSH sd +# define POP ld +# endif +#else +# error "unsupported __riscv_xlen" +#endif + +.option pic +.text + +.globl poly1305_init +.type poly1305_init,\@function +poly1305_init: +#ifdef __riscv_zicfilp + lpad 0 +#endif + sd $zero,0($ctx) + sd $zero,8($ctx) + sd $zero,16($ctx) + + beqz $inp,.Lno_key + +#ifndef __CHERI_PURE_CAPABILITY__ + andi $tmp0,$inp,7 # $inp % 8 + andi $inp,$inp,-8 # align $inp + slli $tmp0,$tmp0,3 # byte to bit offset +#endif + ld $in0,0($inp) + ld $in1,8($inp) +#ifndef __CHERI_PURE_CAPABILITY__ + beqz $tmp0,.Laligned_key + + ld $tmp2,16($inp) + neg $tmp1,$tmp0 # implicit &63 in sll + srl $in0,$in0,$tmp0 + sll $tmp3,$in1,$tmp1 + srl $in1,$in1,$tmp0 + sll $tmp2,$tmp2,$tmp1 + or $in0,$in0,$tmp3 + or $in1,$in1,$tmp2 + +.Laligned_key: +#endif + li $tmp0,1 + slli $tmp0,$tmp0,32 # 0x0000000100000000 + addi $tmp0,$tmp0,-63 # 0x00000000ffffffc1 + slli $tmp0,$tmp0,28 # 0x0ffffffc10000000 + addi $tmp0,$tmp0,-1 # 0x0ffffffc0fffffff + + and $in0,$in0,$tmp0 + addi $tmp0,$tmp0,-3 # 0x0ffffffc0ffffffc + and $in1,$in1,$tmp0 + + sd $in0,24($ctx) + srli $tmp0,$in1,2 + sd $in1,32($ctx) + add $tmp0,$tmp0,$in1 # s1 = r1 + (r1 >> 2) + sd $tmp0,40($ctx) + +.Lno_key: + li $a0,0 # return 0 + ret +.size poly1305_init,.-poly1305_init +___ +{ +my ($h0,$h1,$h2,$r0,$r1,$rs1,$d0,$d1,$d2) = + ($s0,$s1,$s2,$s3,$t3,$t4,$in0,$in1,$t2); +my ($shr,$shl) = ($t5,$t6); # used on R6 + +$code.=<<___; +.globl poly1305_blocks +.type poly1305_blocks,\@function +poly1305_blocks: +#ifdef __riscv_zicfilp + lpad 0 +#endif + andi $len,$len,-16 # complete blocks only + beqz $len,.Lno_data + + caddi $sp,$sp,-4*__SIZEOF_POINTER__ + PUSH $s0,3*__SIZEOF_POINTER__($sp) + PUSH $s1,2*__SIZEOF_POINTER__($sp) + PUSH $s2,1*__SIZEOF_POINTER__($sp) + PUSH $s3,0*__SIZEOF_POINTER__($sp) + +#ifndef __CHERI_PURE_CAPABILITY__ + andi $shr,$inp,7 + andi $inp,$inp,-8 # align $inp + slli $shr,$shr,3 # byte to bit offset + neg $shl,$shr # implicit &63 in sll +#endif + + ld $h0,0($ctx) # load hash value + ld $h1,8($ctx) + ld $h2,16($ctx) + + ld $r0,24($ctx) # load key + ld $r1,32($ctx) + ld $rs1,40($ctx) + + add $len,$len,$inp # end of buffer + +.Loop: + ld $in0,0($inp) # load input + ld $in1,8($inp) +#ifndef __CHERI_PURE_CAPABILITY__ + beqz $shr,.Laligned_inp + + ld $tmp2,16($inp) + srl $in0,$in0,$shr + sll $tmp3,$in1,$shl + srl $in1,$in1,$shr + sll $tmp2,$tmp2,$shl + or $in0,$in0,$tmp3 + or $in1,$in1,$tmp2 + +.Laligned_inp: +#endif + caddi $inp,$inp,16 + + andi $tmp0,$h2,-4 # modulo-scheduled reduction + srli $tmp1,$h2,2 + andi $h2,$h2,3 + + add $d0,$h0,$in0 # accumulate input + add $tmp1,$tmp1,$tmp0 + sltu $tmp0,$d0,$h0 + add $d0,$d0,$tmp1 # ... and residue + sltu $tmp1,$d0,$tmp1 + add $d1,$h1,$in1 + add $tmp0,$tmp0,$tmp1 + sltu $tmp1,$d1,$h1 + add $d1,$d1,$tmp0 + + add $d2,$h2,$padbit + sltu $tmp0,$d1,$tmp0 + mulhu $h1,$r0,$d0 # h0*r0 + mul $h0,$r0,$d0 + + add $d2,$d2,$tmp1 + add $d2,$d2,$tmp0 + mulhu $tmp1,$rs1,$d1 # h1*5*r1 + mul $tmp0,$rs1,$d1 + + mulhu $h2,$r1,$d0 # h0*r1 + mul $tmp2,$r1,$d0 + add $h0,$h0,$tmp0 + add $h1,$h1,$tmp1 + sltu $tmp0,$h0,$tmp0 + + add $h1,$h1,$tmp0 + add $h1,$h1,$tmp2 + mulhu $tmp1,$r0,$d1 # h1*r0 + mul $tmp0,$r0,$d1 + + sltu $tmp2,$h1,$tmp2 + add $h2,$h2,$tmp2 + mul $tmp2,$rs1,$d2 # h2*5*r1 + + add $h1,$h1,$tmp0 + add $h2,$h2,$tmp1 + mul $tmp3,$r0,$d2 # h2*r0 + sltu $tmp0,$h1,$tmp0 + add $h2,$h2,$tmp0 + + add $h1,$h1,$tmp2 + sltu $tmp2,$h1,$tmp2 + add $h2,$h2,$tmp2 + add $h2,$h2,$tmp3 + + bne $inp,$len,.Loop + + sd $h0,0($ctx) # store hash value + sd $h1,8($ctx) + sd $h2,16($ctx) + + POP $s0,3*__SIZEOF_POINTER__($sp) # epilogue + POP $s1,2*__SIZEOF_POINTER__($sp) + POP $s2,1*__SIZEOF_POINTER__($sp) + POP $s3,0*__SIZEOF_POINTER__($sp) + caddi $sp,$sp,4*__SIZEOF_POINTER__ + +.Lno_data: + ret +.size poly1305_blocks,.-poly1305_blocks +___ +} +{ +my ($ctx,$mac,$nonce) = ($a0,$a1,$a2); + +$code.=<<___; +.globl poly1305_emit +.type poly1305_emit,\@function +poly1305_emit: +#ifdef __riscv_zicfilp + lpad 0 +#endif + ld $tmp2,16($ctx) + ld $tmp0,0($ctx) + ld $tmp1,8($ctx) + + andi $in0,$tmp2,-4 # final reduction + srl $in1,$tmp2,2 + andi $tmp2,$tmp2,3 + add $in0,$in0,$in1 + + add $tmp0,$tmp0,$in0 + sltu $in1,$tmp0,$in0 + addi $in0,$tmp0,5 # compare to modulus + add $tmp1,$tmp1,$in1 + sltiu $tmp3,$in0,5 + sltu $tmp4,$tmp1,$in1 + add $in1,$tmp1,$tmp3 + add $tmp2,$tmp2,$tmp4 + sltu $tmp3,$in1,$tmp3 + add $tmp2,$tmp2,$tmp3 + + srli $tmp2,$tmp2,2 # see if it carried/borrowed + neg $tmp2,$tmp2 + + xor $in0,$in0,$tmp0 + xor $in1,$in1,$tmp1 + and $in0,$in0,$tmp2 + and $in1,$in1,$tmp2 + xor $in0,$in0,$tmp0 + xor $in1,$in1,$tmp1 + + lwu $tmp0,0($nonce) # load nonce + lwu $tmp1,4($nonce) + lwu $tmp2,8($nonce) + lwu $tmp3,12($nonce) + slli $tmp1,$tmp1,32 + slli $tmp3,$tmp3,32 + or $tmp0,$tmp0,$tmp1 + or $tmp2,$tmp2,$tmp3 + + add $in0,$in0,$tmp0 # accumulate nonce + add $in1,$in1,$tmp2 + sltu $tmp0,$in0,$tmp0 + add $in1,$in1,$tmp0 + + srli $tmp0,$in0,8 # write mac value + srli $tmp1,$in0,16 + srli $tmp2,$in0,24 + sb $in0,0($mac) + srli $tmp3,$in0,32 + sb $tmp0,1($mac) + srli $tmp0,$in0,40 + sb $tmp1,2($mac) + srli $tmp1,$in0,48 + sb $tmp2,3($mac) + srli $tmp2,$in0,56 + sb $tmp3,4($mac) + srli $tmp3,$in1,8 + sb $tmp0,5($mac) + srli $tmp0,$in1,16 + sb $tmp1,6($mac) + srli $tmp1,$in1,24 + sb $tmp2,7($mac) + + sb $in1,8($mac) + srli $tmp2,$in1,32 + sb $tmp3,9($mac) + srli $tmp3,$in1,40 + sb $tmp0,10($mac) + srli $tmp0,$in1,48 + sb $tmp1,11($mac) + srli $tmp1,$in1,56 + sb $tmp2,12($mac) + sb $tmp3,13($mac) + sb $tmp0,14($mac) + sb $tmp1,15($mac) + + ret +.size poly1305_emit,.-poly1305_emit +.string "Poly1305 for RISC-V, CRYPTOGAMS by \@dot-asm" +___ +} +}}} else {{{ +###################################################################### +# 32-bit code path +# + +my ($ctx,$inp,$len,$padbit) = ($a0,$a1,$a2,$a3); +my ($in0,$in1,$in2,$in3,$tmp0,$tmp1,$tmp2,$tmp3) = + ($a4,$a5,$a6,$a7,$t0,$t1,$t2,$t3); + +$code.=<<___; +#if __riscv_xlen == 32 +# if __SIZEOF_POINTER__ == 8 +# define PUSH csc +# define POP clc +# else +# define PUSH sw +# define POP lw +# endif +# define MULX(hi,lo,a,b) mulhu hi,a,b; mul lo,a,b +# define srliw srli +# define srlw srl +# define sllw sll +# define addw add +# define addiw addi +# define mulw mul +#elif __riscv_xlen == 64 +# if __SIZEOF_POINTER__ == 16 +# define PUSH csc +# define POP clc +# else +# define PUSH sd +# define POP ld +# endif +# define MULX(hi,lo,a,b) slli b,b,32; srli b,b,32; mul hi,a,b; addiw lo,hi,0; srai hi,hi,32 +#else +# error "unsupported __riscv_xlen" +#endif + +.option pic +.text + +.globl poly1305_init +.type poly1305_init,\@function +poly1305_init: +#ifdef __riscv_zicfilp + lpad 0 +#endif + sw $zero,0($ctx) + sw $zero,4($ctx) + sw $zero,8($ctx) + sw $zero,12($ctx) + sw $zero,16($ctx) + + beqz $inp,.Lno_key + +#ifndef __CHERI_PURE_CAPABILITY__ + andi $tmp0,$inp,3 # $inp % 4 + sub $inp,$inp,$tmp0 # align $inp + sll $tmp0,$tmp0,3 # byte to bit offset +#endif + lw $in0,0($inp) + lw $in1,4($inp) + lw $in2,8($inp) + lw $in3,12($inp) +#ifndef __CHERI_PURE_CAPABILITY__ + beqz $tmp0,.Laligned_key + + lw $tmp2,16($inp) + sub $tmp1,$zero,$tmp0 + srlw $in0,$in0,$tmp0 + sllw $tmp3,$in1,$tmp1 + srlw $in1,$in1,$tmp0 + or $in0,$in0,$tmp3 + sllw $tmp3,$in2,$tmp1 + srlw $in2,$in2,$tmp0 + or $in1,$in1,$tmp3 + sllw $tmp3,$in3,$tmp1 + srlw $in3,$in3,$tmp0 + or $in2,$in2,$tmp3 + sllw $tmp2,$tmp2,$tmp1 + or $in3,$in3,$tmp2 +.Laligned_key: +#endif + + lui $tmp0,0x10000 + addi $tmp0,$tmp0,-1 # 0x0fffffff + and $in0,$in0,$tmp0 + addi $tmp0,$tmp0,-3 # 0x0ffffffc + and $in1,$in1,$tmp0 + and $in2,$in2,$tmp0 + and $in3,$in3,$tmp0 + + sw $in0,20($ctx) + sw $in1,24($ctx) + sw $in2,28($ctx) + sw $in3,32($ctx) + + srlw $tmp1,$in1,2 + srlw $tmp2,$in2,2 + srlw $tmp3,$in3,2 + addw $in1,$in1,$tmp1 # s1 = r1 + (r1 >> 2) + addw $in2,$in2,$tmp2 + addw $in3,$in3,$tmp3 + sw $in1,36($ctx) + sw $in2,40($ctx) + sw $in3,44($ctx) +.Lno_key: + li $a0,0 + ret +.size poly1305_init,.-poly1305_init +___ +{ +my ($h0,$h1,$h2,$h3,$h4, $r0,$r1,$r2,$r3, $rs1,$rs2,$rs3) = + ($s0,$s1,$s2,$s3,$s4, $s5,$s6,$s7,$s8, $t0,$t1,$t2); +my ($d0,$d1,$d2,$d3) = + ($a4,$a5,$a6,$a7); +my $shr = $ra; # used on R6 + +$code.=<<___; +.globl poly1305_blocks +.type poly1305_blocks,\@function +poly1305_blocks: +#ifdef __riscv_zicfilp + lpad 0 +#endif + andi $len,$len,-16 # complete blocks only + beqz $len,.Labort + + caddi $sp,$sp,-__SIZEOF_POINTER__*12 + PUSH $ra, __SIZEOF_POINTER__*11($sp) + PUSH $s0, __SIZEOF_POINTER__*10($sp) + PUSH $s1, __SIZEOF_POINTER__*9($sp) + PUSH $s2, __SIZEOF_POINTER__*8($sp) + PUSH $s3, __SIZEOF_POINTER__*7($sp) + PUSH $s4, __SIZEOF_POINTER__*6($sp) + PUSH $s5, __SIZEOF_POINTER__*5($sp) + PUSH $s6, __SIZEOF_POINTER__*4($sp) + PUSH $s7, __SIZEOF_POINTER__*3($sp) + PUSH $s8, __SIZEOF_POINTER__*2($sp) + +#ifndef __CHERI_PURE_CAPABILITY__ + andi $shr,$inp,3 + andi $inp,$inp,-4 # align $inp + slli $shr,$shr,3 # byte to bit offset +#endif + + lw $h0,0($ctx) # load hash value + lw $h1,4($ctx) + lw $h2,8($ctx) + lw $h3,12($ctx) + lw $h4,16($ctx) + + lw $r0,20($ctx) # load key + lw $r1,24($ctx) + lw $r2,28($ctx) + lw $r3,32($ctx) + lw $rs1,36($ctx) + lw $rs2,40($ctx) + lw $rs3,44($ctx) + + add $len,$len,$inp # end of buffer + +.Loop: + lw $d0,0($inp) # load input + lw $d1,4($inp) + lw $d2,8($inp) + lw $d3,12($inp) +#ifndef __CHERI_PURE_CAPABILITY__ + beqz $shr,.Laligned_inp + + lw $t4,16($inp) + sub $t5,$zero,$shr + srlw $d0,$d0,$shr + sllw $t3,$d1,$t5 + srlw $d1,$d1,$shr + or $d0,$d0,$t3 + sllw $t3,$d2,$t5 + srlw $d2,$d2,$shr + or $d1,$d1,$t3 + sllw $t3,$d3,$t5 + srlw $d3,$d3,$shr + or $d2,$d2,$t3 + sllw $t4,$t4,$t5 + or $d3,$d3,$t4 + +.Laligned_inp: +#endif + srliw $t3,$h4,2 # modulo-scheduled reduction + andi $t4,$h4,-4 + andi $h4,$h4,3 + + addw $d0,$d0,$h0 # accumulate input + addw $t4,$t4,$t3 + sltu $h0,$d0,$h0 + addw $d0,$d0,$t4 # ... and residue + sltu $t4,$d0,$t4 + + addw $d1,$d1,$h1 + addw $h0,$h0,$t4 # carry + sltu $h1,$d1,$h1 + addw $d1,$d1,$h0 + sltu $h0,$d1,$h0 + + addw $d2,$d2,$h2 + addw $h1,$h1,$h0 # carry + sltu $h2,$d2,$h2 + addw $d2,$d2,$h1 + sltu $h1,$d2,$h1 + + addw $d3,$d3,$h3 + addw $h2,$h2,$h1 # carry + sltu $h3,$d3,$h3 + addw $d3,$d3,$h2 + + MULX ($h1,$h0,$r0,$d0) # d0*r0 + + sltu $h2,$d3,$h2 + addw $h3,$h3,$h2 # carry + + MULX ($t4,$t3,$rs3,$d1) # d1*s3 + + addw $h4,$h4,$padbit + caddi $inp,$inp,16 + addw $h4,$h4,$h3 + + MULX ($t6,$a3,$rs2,$d2) # d2*s2 + addw $h0,$h0,$t3 + addw $h1,$h1,$t4 + sltu $t3,$h0,$t3 + addw $h1,$h1,$t3 + + MULX ($t4,$t3,$rs1,$d3) # d3*s1 + addw $h0,$h0,$a3 + addw $h1,$h1,$t6 + sltu $a3,$h0,$a3 + addw $h1,$h1,$a3 + + + MULX ($h2,$a3,$r1,$d0) # d0*r1 + addw $h0,$h0,$t3 + addw $h1,$h1,$t4 + sltu $t3,$h0,$t3 + addw $h1,$h1,$t3 + + MULX ($t4,$t3,$r0,$d1) # d1*r0 + addw $h1,$h1,$a3 + sltu $a3,$h1,$a3 + addw $h2,$h2,$a3 + + MULX ($t6,$a3,$rs3,$d2) # d2*s3 + addw $h1,$h1,$t3 + addw $h2,$h2,$t4 + sltu $t3,$h1,$t3 + addw $h2,$h2,$t3 + + MULX ($t4,$t3,$rs2,$d3) # d3*s2 + addw $h1,$h1,$a3 + addw $h2,$h2,$t6 + sltu $a3,$h1,$a3 + addw $h2,$h2,$a3 + + mulw $a3,$rs1,$h4 # h4*s1 + addw $h1,$h1,$t3 + addw $h2,$h2,$t4 + sltu $t3,$h1,$t3 + addw $h2,$h2,$t3 + + + MULX ($h3,$t3,$r2,$d0) # d0*r2 + addw $h1,$h1,$a3 + sltu $a3,$h1,$a3 + addw $h2,$h2,$a3 + + MULX ($t6,$a3,$r1,$d1) # d1*r1 + addw $h2,$h2,$t3 + sltu $t3,$h2,$t3 + addw $h3,$h3,$t3 + + MULX ($t4,$t3,$r0,$d2) # d2*r0 + addw $h2,$h2,$a3 + addw $h3,$h3,$t6 + sltu $a3,$h2,$a3 + addw $h3,$h3,$a3 + + MULX ($t6,$a3,$rs3,$d3) # d3*s3 + addw $h2,$h2,$t3 + addw $h3,$h3,$t4 + sltu $t3,$h2,$t3 + addw $h3,$h3,$t3 + + mulw $t3,$rs2,$h4 # h4*s2 + addw $h2,$h2,$a3 + addw $h3,$h3,$t6 + sltu $a3,$h2,$a3 + addw $h3,$h3,$a3 + + + MULX ($t6,$a3,$r3,$d0) # d0*r3 + addw $h2,$h2,$t3 + sltu $t3,$h2,$t3 + addw $h3,$h3,$t3 + + MULX ($t4,$t3,$r2,$d1) # d1*r2 + addw $h3,$h3,$a3 + sltu $a3,$h3,$a3 + addw $t6,$t6,$a3 + + MULX ($a3,$d3,$r0,$d3) # d3*r0 + addw $h3,$h3,$t3 + addw $t6,$t6,$t4 + sltu $t3,$h3,$t3 + addw $t6,$t6,$t3 + + MULX ($t4,$t3,$r1,$d2) # d2*r1 + addw $h3,$h3,$d3 + addw $t6,$t6,$a3 + sltu $d3,$h3,$d3 + addw $t6,$t6,$d3 + + mulw $a3,$rs3,$h4 # h4*s3 + addw $h3,$h3,$t3 + addw $t6,$t6,$t4 + sltu $t3,$h3,$t3 + addw $t6,$t6,$t3 + + + mulw $h4,$r0,$h4 # h4*r0 + addw $h3,$h3,$a3 + sltu $a3,$h3,$a3 + addw $t6,$t6,$a3 + addw $h4,$t6,$h4 + + li $padbit,1 # if we loop, padbit is 1 + + bne $inp,$len,.Loop + + sw $h0,0($ctx) # store hash value + sw $h1,4($ctx) + sw $h2,8($ctx) + sw $h3,12($ctx) + sw $h4,16($ctx) + + POP $ra, __SIZEOF_POINTER__*11($sp) + POP $s0, __SIZEOF_POINTER__*10($sp) + POP $s1, __SIZEOF_POINTER__*9($sp) + POP $s2, __SIZEOF_POINTER__*8($sp) + POP $s3, __SIZEOF_POINTER__*7($sp) + POP $s4, __SIZEOF_POINTER__*6($sp) + POP $s5, __SIZEOF_POINTER__*5($sp) + POP $s6, __SIZEOF_POINTER__*4($sp) + POP $s7, __SIZEOF_POINTER__*3($sp) + POP $s8, __SIZEOF_POINTER__*2($sp) + caddi $sp,$sp,__SIZEOF_POINTER__*12 +.Labort: + ret +.size poly1305_blocks,.-poly1305_blocks +___ +} +{ +my ($ctx,$mac,$nonce,$tmp4) = ($a0,$a1,$a2,$a3); + +$code.=<<___; +.globl poly1305_emit +.type poly1305_emit,\@function +poly1305_emit: +#ifdef __riscv_zicfilp + lpad 0 +#endif + lw $tmp4,16($ctx) + lw $tmp0,0($ctx) + lw $tmp1,4($ctx) + lw $tmp2,8($ctx) + lw $tmp3,12($ctx) + + srliw $ctx,$tmp4,2 # final reduction + andi $in0,$tmp4,-4 + andi $tmp4,$tmp4,3 + addw $ctx,$ctx,$in0 + + addw $tmp0,$tmp0,$ctx + sltu $ctx,$tmp0,$ctx + addiw $in0,$tmp0,5 # compare to modulus + addw $tmp1,$tmp1,$ctx + sltiu $in1,$in0,5 + sltu $ctx,$tmp1,$ctx + addw $in1,$in1,$tmp1 + addw $tmp2,$tmp2,$ctx + sltu $in2,$in1,$tmp1 + sltu $ctx,$tmp2,$ctx + addw $in2,$in2,$tmp2 + addw $tmp3,$tmp3,$ctx + sltu $in3,$in2,$tmp2 + sltu $ctx,$tmp3,$ctx + addw $in3,$in3,$tmp3 + addw $tmp4,$tmp4,$ctx + sltu $ctx,$in3,$tmp3 + addw $ctx,$ctx,$tmp4 + + srl $ctx,$ctx,2 # see if it carried/borrowed + sub $ctx,$zero,$ctx + + xor $in0,$in0,$tmp0 + xor $in1,$in1,$tmp1 + xor $in2,$in2,$tmp2 + xor $in3,$in3,$tmp3 + and $in0,$in0,$ctx + and $in1,$in1,$ctx + and $in2,$in2,$ctx + and $in3,$in3,$ctx + xor $in0,$in0,$tmp0 + xor $in1,$in1,$tmp1 + xor $in2,$in2,$tmp2 + xor $in3,$in3,$tmp3 + + lw $tmp0,0($nonce) # load nonce + lw $tmp1,4($nonce) + lw $tmp2,8($nonce) + lw $tmp3,12($nonce) + + addw $in0,$in0,$tmp0 # accumulate nonce + sltu $ctx,$in0,$tmp0 + + addw $in1,$in1,$tmp1 + sltu $tmp1,$in1,$tmp1 + addw $in1,$in1,$ctx + sltu $ctx,$in1,$ctx + addw $ctx,$ctx,$tmp1 + + addw $in2,$in2,$tmp2 + sltu $tmp2,$in2,$tmp2 + addw $in2,$in2,$ctx + sltu $ctx,$in2,$ctx + addw $ctx,$ctx,$tmp2 + + addw $in3,$in3,$tmp3 + addw $in3,$in3,$ctx + + srl $tmp0,$in0,8 # write mac value + srl $tmp1,$in0,16 + srl $tmp2,$in0,24 + sb $in0, 0($mac) + sb $tmp0,1($mac) + srl $tmp0,$in1,8 + sb $tmp1,2($mac) + srl $tmp1,$in1,16 + sb $tmp2,3($mac) + srl $tmp2,$in1,24 + sb $in1, 4($mac) + sb $tmp0,5($mac) + srl $tmp0,$in2,8 + sb $tmp1,6($mac) + srl $tmp1,$in2,16 + sb $tmp2,7($mac) + srl $tmp2,$in2,24 + sb $in2, 8($mac) + sb $tmp0,9($mac) + srl $tmp0,$in3,8 + sb $tmp1,10($mac) + srl $tmp1,$in3,16 + sb $tmp2,11($mac) + srl $tmp2,$in3,24 + sb $in3, 12($mac) + sb $tmp0,13($mac) + sb $tmp1,14($mac) + sb $tmp2,15($mac) + + ret +.size poly1305_emit,.-poly1305_emit +.string "Poly1305 for RISC-V, CRYPTOGAMS by \@dot-asm" +___ +} +}}} + +foreach (split("\n", $code)) { + if ($flavour =~ /^cheri/) { + s/\(x([0-9]+)\)/(c$1)/ and s/\b([ls][bhwd]u?)\b/c$1/; + s/\b(PUSH|POP)(\s+)x([0-9]+)/$1$2c$3/ or + s/\b(ret|jal)\b/c$1/; + s/\bcaddi?\b/cincoffset/ and s/\bx([0-9]+,)/c$1/g or + m/\bcmove\b/ and s/\bx([0-9]+)/c$1/g; + } else { + s/\bcaddi?\b/add/ or + s/\bcmove\b/mv/; + } + print $_, "\n"; +} + +close STDOUT; \ No newline at end of file diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig index 1ec1466108cc..0d425f8ce90f 100644 --- a/lib/crypto/Kconfig +++ b/lib/crypto/Kconfig @@ -100,7 +100,7 @@ config CRYPTO_LIB_DES config CRYPTO_LIB_POLY1305_RSIZE int - default 2 if MIPS + default 2 if MIPS || RISCV default 11 if X86_64 default 9 if ARM || ARM64 default 1 -- 2.43.0 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-06-11 3:31 ` Zhihang Shao 0 siblings, 0 replies; 30+ messages in thread From: Zhihang Shao @ 2025-06-11 3:31 UTC (permalink / raw) To: linux-crypto Cc: linux-riscv, herbert, paul.walmsley, alex, appro, zhang.lyra, ebiggers From: Zhihang Shao <zhihang.shao.iscas@gmail.com> This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 implementation for riscv authored by Andy Polyakov. The file 'poly1305-riscv.pl' is taken straight from this upstream GitHub repository [0] at commit 33fe84bc21219a16825459b37c825bf4580a0a7b, and this commit fixed a bug in riscv 64bit implementation. [0] https://github.com/dot-asm/cryptogams Signed-off-by: Zhihang Shao <zhihang.shao.iscas@gmail.com> --- v4: - use 142c43db490d188ff2aea1faf31b8ad0afd5b33f, the newest commit of poly1305-riscv.pl in [0]. (Eric) - Directly pass -Dpoly1305_blocks=poly1305_blocks_arch instead of implementing glue subroutine in C. (Andy, Eric) --- v3: - remove redundant functions. (Herbert) --- v2: - rebase onto mainline and port change to arch/riscv/lib/crypto. (Eric) - fix problems in code according to Andy's guidance. (Andy) --- arch/riscv/lib/crypto/Kconfig | 5 + arch/riscv/lib/crypto/Makefile | 18 + arch/riscv/lib/crypto/poly1305-glue.c | 37 ++ arch/riscv/lib/crypto/poly1305-riscv.pl | 816 ++++++++++++++++++++++++ lib/crypto/Kconfig | 2 +- 5 files changed, 877 insertions(+), 1 deletion(-) create mode 100644 arch/riscv/lib/crypto/poly1305-glue.c create mode 100644 arch/riscv/lib/crypto/poly1305-riscv.pl diff --git a/arch/riscv/lib/crypto/Kconfig b/arch/riscv/lib/crypto/Kconfig index 47c99ea97ce2..3d5a87fafb01 100644 --- a/arch/riscv/lib/crypto/Kconfig +++ b/arch/riscv/lib/crypto/Kconfig @@ -7,6 +7,11 @@ config CRYPTO_CHACHA_RISCV64 select CRYPTO_ARCH_HAVE_LIB_CHACHA select CRYPTO_LIB_CHACHA_GENERIC +config CRYPTO_POLY1305_RISCV + tristate + default CRYPTO_LIB_POLY1305 + select CRYPTO_ARCH_HAVE_LIB_POLY1305 + config CRYPTO_SHA256_RISCV64 tristate depends on 64BIT && RISCV_ISA_V && TOOLCHAIN_HAS_VECTOR_CRYPTO diff --git a/arch/riscv/lib/crypto/Makefile b/arch/riscv/lib/crypto/Makefile index b7cb877a2c07..93ddb62ef0f9 100644 --- a/arch/riscv/lib/crypto/Makefile +++ b/arch/riscv/lib/crypto/Makefile @@ -3,5 +3,23 @@ obj-$(CONFIG_CRYPTO_CHACHA_RISCV64) += chacha-riscv64.o chacha-riscv64-y := chacha-riscv64-glue.o chacha-riscv64-zvkb.o +obj-$(CONFIG_CRYPTO_POLY1305_RISCV) += poly1305-riscv.o +poly1305-riscv-y := poly1305-core.o poly1305-glue.o +AFLAGS_poly1305-core.o += -Dpoly1305_init=poly1305_block_init_arch +AFLAGS_poly1305-core.o += -Dpoly1305_blocks=poly1305_blocks_arch +AFLAGS_poly1305-core.o += -Dpoly1305_emit=poly1305_emit_arch + obj-$(CONFIG_CRYPTO_SHA256_RISCV64) += sha256-riscv64.o sha256-riscv64-y := sha256.o sha256-riscv64-zvknha_or_zvknhb-zvkb.o + +ifeq ($(CONFIG_64BIT),y) +PERLASM_ARCH := 64 +else +PERLASM_ARCH := void +endif + +quiet_cmd_perlasm = PERLASM $@ + cmd_perlasm = $(PERL) $(<) $(PERLASM_ARCH) $(@) + +$(obj)/%-core.S: $(src)/%-riscv.pl + $(call cmd,perlasm) diff --git a/arch/riscv/lib/crypto/poly1305-glue.c b/arch/riscv/lib/crypto/poly1305-glue.c new file mode 100644 index 000000000000..ddc73741faf5 --- /dev/null +++ b/arch/riscv/lib/crypto/poly1305-glue.c @@ -0,0 +1,37 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * OpenSSL/Cryptogams accelerated Poly1305 transform for riscv + * + * Copyright (C) 2025 Institute of Software, CAS. + */ + +#include <asm/hwcap.h> +#include <asm/simd.h> +#include <crypto/internal/poly1305.h> +#include <linux/cpufeature.h> +#include <linux/jump_label.h> +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/unaligned.h> + +asmlinkage void poly1305_block_init_arch( + struct poly1305_block_state *state, + const u8 raw_key[POLY1305_BLOCK_SIZE]); +EXPORT_SYMBOL_GPL(poly1305_block_init_arch); +asmlinkage void poly1305_blocks_arch(struct poly1305_block_state *state, + const u8 *src, u32 len, u32 hibit); +EXPORT_SYMBOL_GPL(poly1305_blocks_arch); +asmlinkage void poly1305_emit_arch(const struct poly1305_state *state, + u8 digest[POLY1305_DIGEST_SIZE], + const u32 nonce[4]); +EXPORT_SYMBOL_GPL(poly1305_emit_arch); + +bool poly1305_is_arch_optimized(void) +{ + /* We always can use since only Integer Multiplication extension is used. */ + return true; +} +EXPORT_SYMBOL(poly1305_is_arch_optimized); + +MODULE_DESCRIPTION("Poly1305 authenticator (RISC-V accelerated)"); +MODULE_LICENSE("GPL"); \ No newline at end of file diff --git a/arch/riscv/lib/crypto/poly1305-riscv.pl b/arch/riscv/lib/crypto/poly1305-riscv.pl new file mode 100644 index 000000000000..e08766541be7 --- /dev/null +++ b/arch/riscv/lib/crypto/poly1305-riscv.pl @@ -0,0 +1,816 @@ +#!/usr/bin/env perl +# SPDX-License-Identifier: GPL-1.0+ OR BSD-3-Clause +# +# ==================================================================== +# Written by Andy Polyakov, @dot-asm, initially for use with OpenSSL. +# ==================================================================== +# +# Poly1305 hash for RISC-V. +# +# February 2019 +# +# In the essence it's pretty straightforward transliteration of MIPS +# module [without big-endian option]. +# +# 3.9 cycles per byte on U74, ~60% faster than compiler-generated code. +# 1.9 cpb on C910, ~75% improvement. 2.3 cpb on JH7110 (U74 with +# apparently better multiplier), ~69% faster, 3.3 on Spacemit X60, +# ~69% improvement. +# +# June 2024. +# +# Add CHERI support. +# +###################################################################### +# +($zero,$ra,$sp,$gp,$tp)=map("x$_",(0..4)); +($t0,$t1,$t2,$t3,$t4,$t5,$t6)=map("x$_",(5..7,28..31)); +($a0,$a1,$a2,$a3,$a4,$a5,$a6,$a7)=map("x$_",(10..17)); +($s0,$s1,$s2,$s3,$s4,$s5,$s6,$s7,$s8,$s9,$s10,$s11)=map("x$_",(8,9,18..27)); +# +###################################################################### + +$flavour = shift || "64"; + +for (@ARGV) { $output=$_ if (/\w[\w\-]*\.\w+$/); } +open STDOUT,">$output"; + +if ($flavour =~ /64/) {{{ +###################################################################### +# 64-bit code path... +# +my ($ctx,$inp,$len,$padbit) = ($a0,$a1,$a2,$a3); +my ($in0,$in1,$tmp0,$tmp1,$tmp2,$tmp3,$tmp4) = ($a4,$a5,$a6,$a7,$t0,$t1,$t2); + +$code.=<<___; +#if __riscv_xlen == 64 +# if __SIZEOF_POINTER__ == 16 +# define PUSH csc +# define POP clc +# else +# define PUSH sd +# define POP ld +# endif +#else +# error "unsupported __riscv_xlen" +#endif + +.option pic +.text + +.globl poly1305_init +.type poly1305_init,\@function +poly1305_init: +#ifdef __riscv_zicfilp + lpad 0 +#endif + sd $zero,0($ctx) + sd $zero,8($ctx) + sd $zero,16($ctx) + + beqz $inp,.Lno_key + +#ifndef __CHERI_PURE_CAPABILITY__ + andi $tmp0,$inp,7 # $inp % 8 + andi $inp,$inp,-8 # align $inp + slli $tmp0,$tmp0,3 # byte to bit offset +#endif + ld $in0,0($inp) + ld $in1,8($inp) +#ifndef __CHERI_PURE_CAPABILITY__ + beqz $tmp0,.Laligned_key + + ld $tmp2,16($inp) + neg $tmp1,$tmp0 # implicit &63 in sll + srl $in0,$in0,$tmp0 + sll $tmp3,$in1,$tmp1 + srl $in1,$in1,$tmp0 + sll $tmp2,$tmp2,$tmp1 + or $in0,$in0,$tmp3 + or $in1,$in1,$tmp2 + +.Laligned_key: +#endif + li $tmp0,1 + slli $tmp0,$tmp0,32 # 0x0000000100000000 + addi $tmp0,$tmp0,-63 # 0x00000000ffffffc1 + slli $tmp0,$tmp0,28 # 0x0ffffffc10000000 + addi $tmp0,$tmp0,-1 # 0x0ffffffc0fffffff + + and $in0,$in0,$tmp0 + addi $tmp0,$tmp0,-3 # 0x0ffffffc0ffffffc + and $in1,$in1,$tmp0 + + sd $in0,24($ctx) + srli $tmp0,$in1,2 + sd $in1,32($ctx) + add $tmp0,$tmp0,$in1 # s1 = r1 + (r1 >> 2) + sd $tmp0,40($ctx) + +.Lno_key: + li $a0,0 # return 0 + ret +.size poly1305_init,.-poly1305_init +___ +{ +my ($h0,$h1,$h2,$r0,$r1,$rs1,$d0,$d1,$d2) = + ($s0,$s1,$s2,$s3,$t3,$t4,$in0,$in1,$t2); +my ($shr,$shl) = ($t5,$t6); # used on R6 + +$code.=<<___; +.globl poly1305_blocks +.type poly1305_blocks,\@function +poly1305_blocks: +#ifdef __riscv_zicfilp + lpad 0 +#endif + andi $len,$len,-16 # complete blocks only + beqz $len,.Lno_data + + caddi $sp,$sp,-4*__SIZEOF_POINTER__ + PUSH $s0,3*__SIZEOF_POINTER__($sp) + PUSH $s1,2*__SIZEOF_POINTER__($sp) + PUSH $s2,1*__SIZEOF_POINTER__($sp) + PUSH $s3,0*__SIZEOF_POINTER__($sp) + +#ifndef __CHERI_PURE_CAPABILITY__ + andi $shr,$inp,7 + andi $inp,$inp,-8 # align $inp + slli $shr,$shr,3 # byte to bit offset + neg $shl,$shr # implicit &63 in sll +#endif + + ld $h0,0($ctx) # load hash value + ld $h1,8($ctx) + ld $h2,16($ctx) + + ld $r0,24($ctx) # load key + ld $r1,32($ctx) + ld $rs1,40($ctx) + + add $len,$len,$inp # end of buffer + +.Loop: + ld $in0,0($inp) # load input + ld $in1,8($inp) +#ifndef __CHERI_PURE_CAPABILITY__ + beqz $shr,.Laligned_inp + + ld $tmp2,16($inp) + srl $in0,$in0,$shr + sll $tmp3,$in1,$shl + srl $in1,$in1,$shr + sll $tmp2,$tmp2,$shl + or $in0,$in0,$tmp3 + or $in1,$in1,$tmp2 + +.Laligned_inp: +#endif + caddi $inp,$inp,16 + + andi $tmp0,$h2,-4 # modulo-scheduled reduction + srli $tmp1,$h2,2 + andi $h2,$h2,3 + + add $d0,$h0,$in0 # accumulate input + add $tmp1,$tmp1,$tmp0 + sltu $tmp0,$d0,$h0 + add $d0,$d0,$tmp1 # ... and residue + sltu $tmp1,$d0,$tmp1 + add $d1,$h1,$in1 + add $tmp0,$tmp0,$tmp1 + sltu $tmp1,$d1,$h1 + add $d1,$d1,$tmp0 + + add $d2,$h2,$padbit + sltu $tmp0,$d1,$tmp0 + mulhu $h1,$r0,$d0 # h0*r0 + mul $h0,$r0,$d0 + + add $d2,$d2,$tmp1 + add $d2,$d2,$tmp0 + mulhu $tmp1,$rs1,$d1 # h1*5*r1 + mul $tmp0,$rs1,$d1 + + mulhu $h2,$r1,$d0 # h0*r1 + mul $tmp2,$r1,$d0 + add $h0,$h0,$tmp0 + add $h1,$h1,$tmp1 + sltu $tmp0,$h0,$tmp0 + + add $h1,$h1,$tmp0 + add $h1,$h1,$tmp2 + mulhu $tmp1,$r0,$d1 # h1*r0 + mul $tmp0,$r0,$d1 + + sltu $tmp2,$h1,$tmp2 + add $h2,$h2,$tmp2 + mul $tmp2,$rs1,$d2 # h2*5*r1 + + add $h1,$h1,$tmp0 + add $h2,$h2,$tmp1 + mul $tmp3,$r0,$d2 # h2*r0 + sltu $tmp0,$h1,$tmp0 + add $h2,$h2,$tmp0 + + add $h1,$h1,$tmp2 + sltu $tmp2,$h1,$tmp2 + add $h2,$h2,$tmp2 + add $h2,$h2,$tmp3 + + bne $inp,$len,.Loop + + sd $h0,0($ctx) # store hash value + sd $h1,8($ctx) + sd $h2,16($ctx) + + POP $s0,3*__SIZEOF_POINTER__($sp) # epilogue + POP $s1,2*__SIZEOF_POINTER__($sp) + POP $s2,1*__SIZEOF_POINTER__($sp) + POP $s3,0*__SIZEOF_POINTER__($sp) + caddi $sp,$sp,4*__SIZEOF_POINTER__ + +.Lno_data: + ret +.size poly1305_blocks,.-poly1305_blocks +___ +} +{ +my ($ctx,$mac,$nonce) = ($a0,$a1,$a2); + +$code.=<<___; +.globl poly1305_emit +.type poly1305_emit,\@function +poly1305_emit: +#ifdef __riscv_zicfilp + lpad 0 +#endif + ld $tmp2,16($ctx) + ld $tmp0,0($ctx) + ld $tmp1,8($ctx) + + andi $in0,$tmp2,-4 # final reduction + srl $in1,$tmp2,2 + andi $tmp2,$tmp2,3 + add $in0,$in0,$in1 + + add $tmp0,$tmp0,$in0 + sltu $in1,$tmp0,$in0 + addi $in0,$tmp0,5 # compare to modulus + add $tmp1,$tmp1,$in1 + sltiu $tmp3,$in0,5 + sltu $tmp4,$tmp1,$in1 + add $in1,$tmp1,$tmp3 + add $tmp2,$tmp2,$tmp4 + sltu $tmp3,$in1,$tmp3 + add $tmp2,$tmp2,$tmp3 + + srli $tmp2,$tmp2,2 # see if it carried/borrowed + neg $tmp2,$tmp2 + + xor $in0,$in0,$tmp0 + xor $in1,$in1,$tmp1 + and $in0,$in0,$tmp2 + and $in1,$in1,$tmp2 + xor $in0,$in0,$tmp0 + xor $in1,$in1,$tmp1 + + lwu $tmp0,0($nonce) # load nonce + lwu $tmp1,4($nonce) + lwu $tmp2,8($nonce) + lwu $tmp3,12($nonce) + slli $tmp1,$tmp1,32 + slli $tmp3,$tmp3,32 + or $tmp0,$tmp0,$tmp1 + or $tmp2,$tmp2,$tmp3 + + add $in0,$in0,$tmp0 # accumulate nonce + add $in1,$in1,$tmp2 + sltu $tmp0,$in0,$tmp0 + add $in1,$in1,$tmp0 + + srli $tmp0,$in0,8 # write mac value + srli $tmp1,$in0,16 + srli $tmp2,$in0,24 + sb $in0,0($mac) + srli $tmp3,$in0,32 + sb $tmp0,1($mac) + srli $tmp0,$in0,40 + sb $tmp1,2($mac) + srli $tmp1,$in0,48 + sb $tmp2,3($mac) + srli $tmp2,$in0,56 + sb $tmp3,4($mac) + srli $tmp3,$in1,8 + sb $tmp0,5($mac) + srli $tmp0,$in1,16 + sb $tmp1,6($mac) + srli $tmp1,$in1,24 + sb $tmp2,7($mac) + + sb $in1,8($mac) + srli $tmp2,$in1,32 + sb $tmp3,9($mac) + srli $tmp3,$in1,40 + sb $tmp0,10($mac) + srli $tmp0,$in1,48 + sb $tmp1,11($mac) + srli $tmp1,$in1,56 + sb $tmp2,12($mac) + sb $tmp3,13($mac) + sb $tmp0,14($mac) + sb $tmp1,15($mac) + + ret +.size poly1305_emit,.-poly1305_emit +.string "Poly1305 for RISC-V, CRYPTOGAMS by \@dot-asm" +___ +} +}}} else {{{ +###################################################################### +# 32-bit code path +# + +my ($ctx,$inp,$len,$padbit) = ($a0,$a1,$a2,$a3); +my ($in0,$in1,$in2,$in3,$tmp0,$tmp1,$tmp2,$tmp3) = + ($a4,$a5,$a6,$a7,$t0,$t1,$t2,$t3); + +$code.=<<___; +#if __riscv_xlen == 32 +# if __SIZEOF_POINTER__ == 8 +# define PUSH csc +# define POP clc +# else +# define PUSH sw +# define POP lw +# endif +# define MULX(hi,lo,a,b) mulhu hi,a,b; mul lo,a,b +# define srliw srli +# define srlw srl +# define sllw sll +# define addw add +# define addiw addi +# define mulw mul +#elif __riscv_xlen == 64 +# if __SIZEOF_POINTER__ == 16 +# define PUSH csc +# define POP clc +# else +# define PUSH sd +# define POP ld +# endif +# define MULX(hi,lo,a,b) slli b,b,32; srli b,b,32; mul hi,a,b; addiw lo,hi,0; srai hi,hi,32 +#else +# error "unsupported __riscv_xlen" +#endif + +.option pic +.text + +.globl poly1305_init +.type poly1305_init,\@function +poly1305_init: +#ifdef __riscv_zicfilp + lpad 0 +#endif + sw $zero,0($ctx) + sw $zero,4($ctx) + sw $zero,8($ctx) + sw $zero,12($ctx) + sw $zero,16($ctx) + + beqz $inp,.Lno_key + +#ifndef __CHERI_PURE_CAPABILITY__ + andi $tmp0,$inp,3 # $inp % 4 + sub $inp,$inp,$tmp0 # align $inp + sll $tmp0,$tmp0,3 # byte to bit offset +#endif + lw $in0,0($inp) + lw $in1,4($inp) + lw $in2,8($inp) + lw $in3,12($inp) +#ifndef __CHERI_PURE_CAPABILITY__ + beqz $tmp0,.Laligned_key + + lw $tmp2,16($inp) + sub $tmp1,$zero,$tmp0 + srlw $in0,$in0,$tmp0 + sllw $tmp3,$in1,$tmp1 + srlw $in1,$in1,$tmp0 + or $in0,$in0,$tmp3 + sllw $tmp3,$in2,$tmp1 + srlw $in2,$in2,$tmp0 + or $in1,$in1,$tmp3 + sllw $tmp3,$in3,$tmp1 + srlw $in3,$in3,$tmp0 + or $in2,$in2,$tmp3 + sllw $tmp2,$tmp2,$tmp1 + or $in3,$in3,$tmp2 +.Laligned_key: +#endif + + lui $tmp0,0x10000 + addi $tmp0,$tmp0,-1 # 0x0fffffff + and $in0,$in0,$tmp0 + addi $tmp0,$tmp0,-3 # 0x0ffffffc + and $in1,$in1,$tmp0 + and $in2,$in2,$tmp0 + and $in3,$in3,$tmp0 + + sw $in0,20($ctx) + sw $in1,24($ctx) + sw $in2,28($ctx) + sw $in3,32($ctx) + + srlw $tmp1,$in1,2 + srlw $tmp2,$in2,2 + srlw $tmp3,$in3,2 + addw $in1,$in1,$tmp1 # s1 = r1 + (r1 >> 2) + addw $in2,$in2,$tmp2 + addw $in3,$in3,$tmp3 + sw $in1,36($ctx) + sw $in2,40($ctx) + sw $in3,44($ctx) +.Lno_key: + li $a0,0 + ret +.size poly1305_init,.-poly1305_init +___ +{ +my ($h0,$h1,$h2,$h3,$h4, $r0,$r1,$r2,$r3, $rs1,$rs2,$rs3) = + ($s0,$s1,$s2,$s3,$s4, $s5,$s6,$s7,$s8, $t0,$t1,$t2); +my ($d0,$d1,$d2,$d3) = + ($a4,$a5,$a6,$a7); +my $shr = $ra; # used on R6 + +$code.=<<___; +.globl poly1305_blocks +.type poly1305_blocks,\@function +poly1305_blocks: +#ifdef __riscv_zicfilp + lpad 0 +#endif + andi $len,$len,-16 # complete blocks only + beqz $len,.Labort + + caddi $sp,$sp,-__SIZEOF_POINTER__*12 + PUSH $ra, __SIZEOF_POINTER__*11($sp) + PUSH $s0, __SIZEOF_POINTER__*10($sp) + PUSH $s1, __SIZEOF_POINTER__*9($sp) + PUSH $s2, __SIZEOF_POINTER__*8($sp) + PUSH $s3, __SIZEOF_POINTER__*7($sp) + PUSH $s4, __SIZEOF_POINTER__*6($sp) + PUSH $s5, __SIZEOF_POINTER__*5($sp) + PUSH $s6, __SIZEOF_POINTER__*4($sp) + PUSH $s7, __SIZEOF_POINTER__*3($sp) + PUSH $s8, __SIZEOF_POINTER__*2($sp) + +#ifndef __CHERI_PURE_CAPABILITY__ + andi $shr,$inp,3 + andi $inp,$inp,-4 # align $inp + slli $shr,$shr,3 # byte to bit offset +#endif + + lw $h0,0($ctx) # load hash value + lw $h1,4($ctx) + lw $h2,8($ctx) + lw $h3,12($ctx) + lw $h4,16($ctx) + + lw $r0,20($ctx) # load key + lw $r1,24($ctx) + lw $r2,28($ctx) + lw $r3,32($ctx) + lw $rs1,36($ctx) + lw $rs2,40($ctx) + lw $rs3,44($ctx) + + add $len,$len,$inp # end of buffer + +.Loop: + lw $d0,0($inp) # load input + lw $d1,4($inp) + lw $d2,8($inp) + lw $d3,12($inp) +#ifndef __CHERI_PURE_CAPABILITY__ + beqz $shr,.Laligned_inp + + lw $t4,16($inp) + sub $t5,$zero,$shr + srlw $d0,$d0,$shr + sllw $t3,$d1,$t5 + srlw $d1,$d1,$shr + or $d0,$d0,$t3 + sllw $t3,$d2,$t5 + srlw $d2,$d2,$shr + or $d1,$d1,$t3 + sllw $t3,$d3,$t5 + srlw $d3,$d3,$shr + or $d2,$d2,$t3 + sllw $t4,$t4,$t5 + or $d3,$d3,$t4 + +.Laligned_inp: +#endif + srliw $t3,$h4,2 # modulo-scheduled reduction + andi $t4,$h4,-4 + andi $h4,$h4,3 + + addw $d0,$d0,$h0 # accumulate input + addw $t4,$t4,$t3 + sltu $h0,$d0,$h0 + addw $d0,$d0,$t4 # ... and residue + sltu $t4,$d0,$t4 + + addw $d1,$d1,$h1 + addw $h0,$h0,$t4 # carry + sltu $h1,$d1,$h1 + addw $d1,$d1,$h0 + sltu $h0,$d1,$h0 + + addw $d2,$d2,$h2 + addw $h1,$h1,$h0 # carry + sltu $h2,$d2,$h2 + addw $d2,$d2,$h1 + sltu $h1,$d2,$h1 + + addw $d3,$d3,$h3 + addw $h2,$h2,$h1 # carry + sltu $h3,$d3,$h3 + addw $d3,$d3,$h2 + + MULX ($h1,$h0,$r0,$d0) # d0*r0 + + sltu $h2,$d3,$h2 + addw $h3,$h3,$h2 # carry + + MULX ($t4,$t3,$rs3,$d1) # d1*s3 + + addw $h4,$h4,$padbit + caddi $inp,$inp,16 + addw $h4,$h4,$h3 + + MULX ($t6,$a3,$rs2,$d2) # d2*s2 + addw $h0,$h0,$t3 + addw $h1,$h1,$t4 + sltu $t3,$h0,$t3 + addw $h1,$h1,$t3 + + MULX ($t4,$t3,$rs1,$d3) # d3*s1 + addw $h0,$h0,$a3 + addw $h1,$h1,$t6 + sltu $a3,$h0,$a3 + addw $h1,$h1,$a3 + + + MULX ($h2,$a3,$r1,$d0) # d0*r1 + addw $h0,$h0,$t3 + addw $h1,$h1,$t4 + sltu $t3,$h0,$t3 + addw $h1,$h1,$t3 + + MULX ($t4,$t3,$r0,$d1) # d1*r0 + addw $h1,$h1,$a3 + sltu $a3,$h1,$a3 + addw $h2,$h2,$a3 + + MULX ($t6,$a3,$rs3,$d2) # d2*s3 + addw $h1,$h1,$t3 + addw $h2,$h2,$t4 + sltu $t3,$h1,$t3 + addw $h2,$h2,$t3 + + MULX ($t4,$t3,$rs2,$d3) # d3*s2 + addw $h1,$h1,$a3 + addw $h2,$h2,$t6 + sltu $a3,$h1,$a3 + addw $h2,$h2,$a3 + + mulw $a3,$rs1,$h4 # h4*s1 + addw $h1,$h1,$t3 + addw $h2,$h2,$t4 + sltu $t3,$h1,$t3 + addw $h2,$h2,$t3 + + + MULX ($h3,$t3,$r2,$d0) # d0*r2 + addw $h1,$h1,$a3 + sltu $a3,$h1,$a3 + addw $h2,$h2,$a3 + + MULX ($t6,$a3,$r1,$d1) # d1*r1 + addw $h2,$h2,$t3 + sltu $t3,$h2,$t3 + addw $h3,$h3,$t3 + + MULX ($t4,$t3,$r0,$d2) # d2*r0 + addw $h2,$h2,$a3 + addw $h3,$h3,$t6 + sltu $a3,$h2,$a3 + addw $h3,$h3,$a3 + + MULX ($t6,$a3,$rs3,$d3) # d3*s3 + addw $h2,$h2,$t3 + addw $h3,$h3,$t4 + sltu $t3,$h2,$t3 + addw $h3,$h3,$t3 + + mulw $t3,$rs2,$h4 # h4*s2 + addw $h2,$h2,$a3 + addw $h3,$h3,$t6 + sltu $a3,$h2,$a3 + addw $h3,$h3,$a3 + + + MULX ($t6,$a3,$r3,$d0) # d0*r3 + addw $h2,$h2,$t3 + sltu $t3,$h2,$t3 + addw $h3,$h3,$t3 + + MULX ($t4,$t3,$r2,$d1) # d1*r2 + addw $h3,$h3,$a3 + sltu $a3,$h3,$a3 + addw $t6,$t6,$a3 + + MULX ($a3,$d3,$r0,$d3) # d3*r0 + addw $h3,$h3,$t3 + addw $t6,$t6,$t4 + sltu $t3,$h3,$t3 + addw $t6,$t6,$t3 + + MULX ($t4,$t3,$r1,$d2) # d2*r1 + addw $h3,$h3,$d3 + addw $t6,$t6,$a3 + sltu $d3,$h3,$d3 + addw $t6,$t6,$d3 + + mulw $a3,$rs3,$h4 # h4*s3 + addw $h3,$h3,$t3 + addw $t6,$t6,$t4 + sltu $t3,$h3,$t3 + addw $t6,$t6,$t3 + + + mulw $h4,$r0,$h4 # h4*r0 + addw $h3,$h3,$a3 + sltu $a3,$h3,$a3 + addw $t6,$t6,$a3 + addw $h4,$t6,$h4 + + li $padbit,1 # if we loop, padbit is 1 + + bne $inp,$len,.Loop + + sw $h0,0($ctx) # store hash value + sw $h1,4($ctx) + sw $h2,8($ctx) + sw $h3,12($ctx) + sw $h4,16($ctx) + + POP $ra, __SIZEOF_POINTER__*11($sp) + POP $s0, __SIZEOF_POINTER__*10($sp) + POP $s1, __SIZEOF_POINTER__*9($sp) + POP $s2, __SIZEOF_POINTER__*8($sp) + POP $s3, __SIZEOF_POINTER__*7($sp) + POP $s4, __SIZEOF_POINTER__*6($sp) + POP $s5, __SIZEOF_POINTER__*5($sp) + POP $s6, __SIZEOF_POINTER__*4($sp) + POP $s7, __SIZEOF_POINTER__*3($sp) + POP $s8, __SIZEOF_POINTER__*2($sp) + caddi $sp,$sp,__SIZEOF_POINTER__*12 +.Labort: + ret +.size poly1305_blocks,.-poly1305_blocks +___ +} +{ +my ($ctx,$mac,$nonce,$tmp4) = ($a0,$a1,$a2,$a3); + +$code.=<<___; +.globl poly1305_emit +.type poly1305_emit,\@function +poly1305_emit: +#ifdef __riscv_zicfilp + lpad 0 +#endif + lw $tmp4,16($ctx) + lw $tmp0,0($ctx) + lw $tmp1,4($ctx) + lw $tmp2,8($ctx) + lw $tmp3,12($ctx) + + srliw $ctx,$tmp4,2 # final reduction + andi $in0,$tmp4,-4 + andi $tmp4,$tmp4,3 + addw $ctx,$ctx,$in0 + + addw $tmp0,$tmp0,$ctx + sltu $ctx,$tmp0,$ctx + addiw $in0,$tmp0,5 # compare to modulus + addw $tmp1,$tmp1,$ctx + sltiu $in1,$in0,5 + sltu $ctx,$tmp1,$ctx + addw $in1,$in1,$tmp1 + addw $tmp2,$tmp2,$ctx + sltu $in2,$in1,$tmp1 + sltu $ctx,$tmp2,$ctx + addw $in2,$in2,$tmp2 + addw $tmp3,$tmp3,$ctx + sltu $in3,$in2,$tmp2 + sltu $ctx,$tmp3,$ctx + addw $in3,$in3,$tmp3 + addw $tmp4,$tmp4,$ctx + sltu $ctx,$in3,$tmp3 + addw $ctx,$ctx,$tmp4 + + srl $ctx,$ctx,2 # see if it carried/borrowed + sub $ctx,$zero,$ctx + + xor $in0,$in0,$tmp0 + xor $in1,$in1,$tmp1 + xor $in2,$in2,$tmp2 + xor $in3,$in3,$tmp3 + and $in0,$in0,$ctx + and $in1,$in1,$ctx + and $in2,$in2,$ctx + and $in3,$in3,$ctx + xor $in0,$in0,$tmp0 + xor $in1,$in1,$tmp1 + xor $in2,$in2,$tmp2 + xor $in3,$in3,$tmp3 + + lw $tmp0,0($nonce) # load nonce + lw $tmp1,4($nonce) + lw $tmp2,8($nonce) + lw $tmp3,12($nonce) + + addw $in0,$in0,$tmp0 # accumulate nonce + sltu $ctx,$in0,$tmp0 + + addw $in1,$in1,$tmp1 + sltu $tmp1,$in1,$tmp1 + addw $in1,$in1,$ctx + sltu $ctx,$in1,$ctx + addw $ctx,$ctx,$tmp1 + + addw $in2,$in2,$tmp2 + sltu $tmp2,$in2,$tmp2 + addw $in2,$in2,$ctx + sltu $ctx,$in2,$ctx + addw $ctx,$ctx,$tmp2 + + addw $in3,$in3,$tmp3 + addw $in3,$in3,$ctx + + srl $tmp0,$in0,8 # write mac value + srl $tmp1,$in0,16 + srl $tmp2,$in0,24 + sb $in0, 0($mac) + sb $tmp0,1($mac) + srl $tmp0,$in1,8 + sb $tmp1,2($mac) + srl $tmp1,$in1,16 + sb $tmp2,3($mac) + srl $tmp2,$in1,24 + sb $in1, 4($mac) + sb $tmp0,5($mac) + srl $tmp0,$in2,8 + sb $tmp1,6($mac) + srl $tmp1,$in2,16 + sb $tmp2,7($mac) + srl $tmp2,$in2,24 + sb $in2, 8($mac) + sb $tmp0,9($mac) + srl $tmp0,$in3,8 + sb $tmp1,10($mac) + srl $tmp1,$in3,16 + sb $tmp2,11($mac) + srl $tmp2,$in3,24 + sb $in3, 12($mac) + sb $tmp0,13($mac) + sb $tmp1,14($mac) + sb $tmp2,15($mac) + + ret +.size poly1305_emit,.-poly1305_emit +.string "Poly1305 for RISC-V, CRYPTOGAMS by \@dot-asm" +___ +} +}}} + +foreach (split("\n", $code)) { + if ($flavour =~ /^cheri/) { + s/\(x([0-9]+)\)/(c$1)/ and s/\b([ls][bhwd]u?)\b/c$1/; + s/\b(PUSH|POP)(\s+)x([0-9]+)/$1$2c$3/ or + s/\b(ret|jal)\b/c$1/; + s/\bcaddi?\b/cincoffset/ and s/\bx([0-9]+,)/c$1/g or + m/\bcmove\b/ and s/\bx([0-9]+)/c$1/g; + } else { + s/\bcaddi?\b/add/ or + s/\bcmove\b/mv/; + } + print $_, "\n"; +} + +close STDOUT; \ No newline at end of file diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig index 1ec1466108cc..0d425f8ce90f 100644 --- a/lib/crypto/Kconfig +++ b/lib/crypto/Kconfig @@ -100,7 +100,7 @@ config CRYPTO_LIB_DES config CRYPTO_LIB_POLY1305_RSIZE int - default 2 if MIPS + default 2 if MIPS || RISCV default 11 if X86_64 default 9 if ARM || ARM64 default 1 -- 2.43.0 _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation 2025-06-11 3:31 ` Zhihang Shao @ 2025-06-24 3:50 ` Eric Biggers -1 siblings, 0 replies; 30+ messages in thread From: Eric Biggers @ 2025-06-24 3:50 UTC (permalink / raw) To: Zhihang Shao Cc: linux-crypto, linux-riscv, herbert, paul.walmsley, alex, appro, zhang.lyra Hi Zhihang, On Wed, Jun 11, 2025 at 11:31:51AM +0800, Zhihang Shao wrote: > From: Zhihang Shao <zhihang.shao.iscas@gmail.com> > > This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 > implementation for riscv authored by Andy Polyakov. > The file 'poly1305-riscv.pl' is taken straight from this upstream > GitHub repository [0] at commit 33fe84bc21219a16825459b37c825bf4580a0a7b, > and this commit fixed a bug in riscv 64bit implementation. > > [0] https://github.com/dot-asm/cryptogams > > Signed-off-by: Zhihang Shao <zhihang.shao.iscas@gmail.com> I wanted to let you know that I haven't forgotten about this. However, the kernel currently lacks a proper test for Poly1305, and Poly1305 has some edge cases that are prone to bugs. So I'd like to find some time to add a proper Poly1305 test and properly review this. Also, I'm in the middle of fixing how the kernel's architecture-optimized crypto code is integrated; for example, I've just moved arch/*/lib/crypto/ to lib/crypto/. There will also need to be benchmark results that show that this code is actually worthwhile, on both RV32 and RV64. I am planning for the poly1305_kunit test to have a benchmark built-in, after which it will be straightforward to collect these. (The "perlasm" file does have some benchmark results in a comment, but they do not necessarily apply to the Poly1305 code in the kernel.) So this isn't a perfect time to be adding a new Poly1305 implementation, as we're not quite ready for it. But I'd indeed like to take this eventually. A few comments below: > diff --git a/arch/riscv/lib/crypto/Makefile b/arch/riscv/lib/crypto/Makefile > index b7cb877a2c07..93ddb62ef0f9 100644 > --- a/arch/riscv/lib/crypto/Makefile > +++ b/arch/riscv/lib/crypto/Makefile > @@ -3,5 +3,23 @@ > obj-$(CONFIG_CRYPTO_CHACHA_RISCV64) += chacha-riscv64.o > chacha-riscv64-y := chacha-riscv64-glue.o chacha-riscv64-zvkb.o > > +obj-$(CONFIG_CRYPTO_POLY1305_RISCV) += poly1305-riscv.o > +poly1305-riscv-y := poly1305-core.o poly1305-glue.o > +AFLAGS_poly1305-core.o += -Dpoly1305_init=poly1305_block_init_arch > +AFLAGS_poly1305-core.o += -Dpoly1305_blocks=poly1305_blocks_arch > +AFLAGS_poly1305-core.o += -Dpoly1305_emit=poly1305_emit_arch > + > obj-$(CONFIG_CRYPTO_SHA256_RISCV64) += sha256-riscv64.o > sha256-riscv64-y := sha256.o sha256-riscv64-zvknha_or_zvknhb-zvkb.o > + > +ifeq ($(CONFIG_64BIT),y) > +PERLASM_ARCH := 64 > +else > +PERLASM_ARCH := void > +endif > + > +quiet_cmd_perlasm = PERLASM $@ > + cmd_perlasm = $(PERL) $(<) $(PERLASM_ARCH) $(@) > + > +$(obj)/%-core.S: $(src)/%-riscv.pl > + $(call cmd,perlasm) Missing: clean-files += poly1305-core.S > diff --git a/arch/riscv/lib/crypto/poly1305-glue.c b/arch/riscv/lib/crypto/poly1305-glue.c > new file mode 100644 > index 000000000000..ddc73741faf5 > --- /dev/null > +++ b/arch/riscv/lib/crypto/poly1305-glue.c > @@ -0,0 +1,37 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * OpenSSL/Cryptogams accelerated Poly1305 transform for riscv > + * > + * Copyright (C) 2025 Institute of Software, CAS. > + */ > + > +#include <asm/hwcap.h> > +#include <asm/simd.h> > +#include <crypto/internal/poly1305.h> > +#include <linux/cpufeature.h> > +#include <linux/jump_label.h> > +#include <linux/kernel.h> > +#include <linux/module.h> > +#include <linux/unaligned.h> Reduce the include list to: #include <crypto/internal/poly1305.h> #include <linux/export.h> #include <linux/module.h> > +.globl poly1305_init > +.type poly1305_init,\@function > +poly1305_init: > +#ifdef __riscv_zicfilp > + lpad 0 > +#endif The 'lpad' instructions aren't present in the upstream CRYPTOGAMS source. If they are necessary, this addition needs to be documented. But they appear to be unnecessary. > +#ifndef __CHERI_PURE_CAPABILITY__ > + andi $tmp0,$inp,7 # $inp % 8 > + andi $inp,$inp,-8 # align $inp > + slli $tmp0,$tmp0,3 # byte to bit offset > +#endif > + ld $in0,0($inp) > + ld $in1,8($inp) > +#ifndef __CHERI_PURE_CAPABILITY__ > + beqz $tmp0,.Laligned_key > + > + ld $tmp2,16($inp) > + neg $tmp1,$tmp0 # implicit &63 in sll > + srl $in0,$in0,$tmp0 > + sll $tmp3,$in1,$tmp1 > + srl $in1,$in1,$tmp0 > + sll $tmp2,$tmp2,$tmp1 > + or $in0,$in0,$tmp3 > + or $in1,$in1,$tmp2 > + > +.Laligned_key: This code is going through a lot of trouble to work on RISC-V CPUs that don't support efficient misaligned memory accesses. That includes issuing loads of memory outside the bounds of the given buffer, which is questionable (even if it's guaranteed to not cross a page boundary). Is there any chance we can just make the RISC-V Poly1305 code be conditional on CONFIG_RISCV_EFFICIENT_UNALIGNED_ACCESS=y? Or do people not actually use that? The rest of the kernel's RISC-V crypto code, which is based on the vector extension, just assumes that efficient misaligned memory accesses are supported. On a related topic, if this patch is accepted, the result will be inconsistent optimization of ChaCha vs. Poly1305, which are usually paired: (1) ChaCha optimized with the RISC-V vector extension (2) Poly1305 optimized with RISC-V scalar instructions Surely a RISC-V vector extension optimized Poly1305 is going to be needed too? But with that being the case, will a RISC-V scalar optimized Poly1305 actually be worthwhile to add too? Especially without optimized ChaCha alongside it? - Eric ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-06-24 3:50 ` Eric Biggers 0 siblings, 0 replies; 30+ messages in thread From: Eric Biggers @ 2025-06-24 3:50 UTC (permalink / raw) To: Zhihang Shao Cc: linux-crypto, linux-riscv, herbert, paul.walmsley, alex, appro, zhang.lyra Hi Zhihang, On Wed, Jun 11, 2025 at 11:31:51AM +0800, Zhihang Shao wrote: > From: Zhihang Shao <zhihang.shao.iscas@gmail.com> > > This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 > implementation for riscv authored by Andy Polyakov. > The file 'poly1305-riscv.pl' is taken straight from this upstream > GitHub repository [0] at commit 33fe84bc21219a16825459b37c825bf4580a0a7b, > and this commit fixed a bug in riscv 64bit implementation. > > [0] https://github.com/dot-asm/cryptogams > > Signed-off-by: Zhihang Shao <zhihang.shao.iscas@gmail.com> I wanted to let you know that I haven't forgotten about this. However, the kernel currently lacks a proper test for Poly1305, and Poly1305 has some edge cases that are prone to bugs. So I'd like to find some time to add a proper Poly1305 test and properly review this. Also, I'm in the middle of fixing how the kernel's architecture-optimized crypto code is integrated; for example, I've just moved arch/*/lib/crypto/ to lib/crypto/. There will also need to be benchmark results that show that this code is actually worthwhile, on both RV32 and RV64. I am planning for the poly1305_kunit test to have a benchmark built-in, after which it will be straightforward to collect these. (The "perlasm" file does have some benchmark results in a comment, but they do not necessarily apply to the Poly1305 code in the kernel.) So this isn't a perfect time to be adding a new Poly1305 implementation, as we're not quite ready for it. But I'd indeed like to take this eventually. A few comments below: > diff --git a/arch/riscv/lib/crypto/Makefile b/arch/riscv/lib/crypto/Makefile > index b7cb877a2c07..93ddb62ef0f9 100644 > --- a/arch/riscv/lib/crypto/Makefile > +++ b/arch/riscv/lib/crypto/Makefile > @@ -3,5 +3,23 @@ > obj-$(CONFIG_CRYPTO_CHACHA_RISCV64) += chacha-riscv64.o > chacha-riscv64-y := chacha-riscv64-glue.o chacha-riscv64-zvkb.o > > +obj-$(CONFIG_CRYPTO_POLY1305_RISCV) += poly1305-riscv.o > +poly1305-riscv-y := poly1305-core.o poly1305-glue.o > +AFLAGS_poly1305-core.o += -Dpoly1305_init=poly1305_block_init_arch > +AFLAGS_poly1305-core.o += -Dpoly1305_blocks=poly1305_blocks_arch > +AFLAGS_poly1305-core.o += -Dpoly1305_emit=poly1305_emit_arch > + > obj-$(CONFIG_CRYPTO_SHA256_RISCV64) += sha256-riscv64.o > sha256-riscv64-y := sha256.o sha256-riscv64-zvknha_or_zvknhb-zvkb.o > + > +ifeq ($(CONFIG_64BIT),y) > +PERLASM_ARCH := 64 > +else > +PERLASM_ARCH := void > +endif > + > +quiet_cmd_perlasm = PERLASM $@ > + cmd_perlasm = $(PERL) $(<) $(PERLASM_ARCH) $(@) > + > +$(obj)/%-core.S: $(src)/%-riscv.pl > + $(call cmd,perlasm) Missing: clean-files += poly1305-core.S > diff --git a/arch/riscv/lib/crypto/poly1305-glue.c b/arch/riscv/lib/crypto/poly1305-glue.c > new file mode 100644 > index 000000000000..ddc73741faf5 > --- /dev/null > +++ b/arch/riscv/lib/crypto/poly1305-glue.c > @@ -0,0 +1,37 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * OpenSSL/Cryptogams accelerated Poly1305 transform for riscv > + * > + * Copyright (C) 2025 Institute of Software, CAS. > + */ > + > +#include <asm/hwcap.h> > +#include <asm/simd.h> > +#include <crypto/internal/poly1305.h> > +#include <linux/cpufeature.h> > +#include <linux/jump_label.h> > +#include <linux/kernel.h> > +#include <linux/module.h> > +#include <linux/unaligned.h> Reduce the include list to: #include <crypto/internal/poly1305.h> #include <linux/export.h> #include <linux/module.h> > +.globl poly1305_init > +.type poly1305_init,\@function > +poly1305_init: > +#ifdef __riscv_zicfilp > + lpad 0 > +#endif The 'lpad' instructions aren't present in the upstream CRYPTOGAMS source. If they are necessary, this addition needs to be documented. But they appear to be unnecessary. > +#ifndef __CHERI_PURE_CAPABILITY__ > + andi $tmp0,$inp,7 # $inp % 8 > + andi $inp,$inp,-8 # align $inp > + slli $tmp0,$tmp0,3 # byte to bit offset > +#endif > + ld $in0,0($inp) > + ld $in1,8($inp) > +#ifndef __CHERI_PURE_CAPABILITY__ > + beqz $tmp0,.Laligned_key > + > + ld $tmp2,16($inp) > + neg $tmp1,$tmp0 # implicit &63 in sll > + srl $in0,$in0,$tmp0 > + sll $tmp3,$in1,$tmp1 > + srl $in1,$in1,$tmp0 > + sll $tmp2,$tmp2,$tmp1 > + or $in0,$in0,$tmp3 > + or $in1,$in1,$tmp2 > + > +.Laligned_key: This code is going through a lot of trouble to work on RISC-V CPUs that don't support efficient misaligned memory accesses. That includes issuing loads of memory outside the bounds of the given buffer, which is questionable (even if it's guaranteed to not cross a page boundary). Is there any chance we can just make the RISC-V Poly1305 code be conditional on CONFIG_RISCV_EFFICIENT_UNALIGNED_ACCESS=y? Or do people not actually use that? The rest of the kernel's RISC-V crypto code, which is based on the vector extension, just assumes that efficient misaligned memory accesses are supported. On a related topic, if this patch is accepted, the result will be inconsistent optimization of ChaCha vs. Poly1305, which are usually paired: (1) ChaCha optimized with the RISC-V vector extension (2) Poly1305 optimized with RISC-V scalar instructions Surely a RISC-V vector extension optimized Poly1305 is going to be needed too? But with that being the case, will a RISC-V scalar optimized Poly1305 actually be worthwhile to add too? Especially without optimized ChaCha alongside it? - Eric _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation 2025-06-24 3:50 ` Eric Biggers @ 2025-06-24 9:13 ` Andy Polyakov -1 siblings, 0 replies; 30+ messages in thread From: Andy Polyakov @ 2025-06-24 9:13 UTC (permalink / raw) To: Eric Biggers, Zhihang Shao Cc: linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra >> +.globl poly1305_init >> +.type poly1305_init,\@function >> +poly1305_init: >> +#ifdef __riscv_zicfilp >> + lpad 0 >> +#endif > > The 'lpad' instructions aren't present in the upstream CRYPTOGAMS source. They are. > If they are necessary, this addition needs to be documented. > > But they appear to be unnecessary. They are better be there if Control Flow Integrity is on. It's the same deal as with endbranch instruction on Intel and hint #34 on ARM. It's possible that the kernel never engages CFI for itself, in which case all the mentioned instructions are executed as nop-s. But note that here they are compiled conditionally, so that if you don't compile the kernel with -march=..._zicfilp_..., then they won't be there. >> +#ifndef __CHERI_PURE_CAPABILITY__ >> + andi $tmp0,$inp,7 # $inp % 8 >> + andi $inp,$inp,-8 # align $inp >> + slli $tmp0,$tmp0,3 # byte to bit offset >> +#endif >> + ld $in0,0($inp) >> + ld $in1,8($inp) >> +#ifndef __CHERI_PURE_CAPABILITY__ >> + beqz $tmp0,.Laligned_key >> + >> + ld $tmp2,16($inp) >> + neg $tmp1,$tmp0 # implicit &63 in sll >> + srl $in0,$in0,$tmp0 >> + sll $tmp3,$in1,$tmp1 >> + srl $in1,$in1,$tmp0 >> + sll $tmp2,$tmp2,$tmp1 >> + or $in0,$in0,$tmp3 >> + or $in1,$in1,$tmp2 >> + >> +.Laligned_key: > > This code is going through a lot of trouble to work on RISC-V CPUs that don't > support efficient misaligned memory accesses. That includes issuing loads of > memory outside the bounds of the given buffer, which is questionable (even if > it's guaranteed to not cross a page boundary). It's indeed guaranteed to not cross a page *nor* even cache-line boundaries. Hence they can't trigger any externally observed side effects the corresponding unaligned loads won't. What is the concern otherwise? [Do note that the boundaries are not crossed on a boundary-enforcable CHERI platform ;-)] > The rest of the kernel's RISC-V crypto code, which is based on the vector > extension, just assumes that efficient misaligned memory accesses are supported. Was it tested on real hardware though? I wonder what hardware is out there that supports the vector crypto extensions? > On a related topic, if this patch is accepted, the result will be inconsistent > optimization of ChaCha vs. Poly1305, which are usually paired: https://github.com/dot-asm/cryptogams/blob/master/riscv/chacha-riscv.pl > (1) ChaCha optimized with the RISC-V vector extension > (2) Poly1305 optimized with RISC-V scalar instructions > > Surely a RISC-V vector extension optimized Poly1305 is going to be needed too? I'm a "test-on-hardware" guy. I've got Spacemit X60, which has a working 256-bit base vector implementation. I have a "teaser" Chacha vector implementation that currently performs *worse* than scalar one, more than twice worse. Working on improving it. For reference. One has to recognize that cryptographic algorithms customarily have short dependencies, which means that performance is dominated by instruction latencies. There might or might not be ways to match the scalar performance. Or course, even if it turns out to be impossible on this processor, it doesn't mean that it won't make sense to keep the vector implementation, because other processors might do better. In other words, it's coming... > But with that being the case, will a RISC-V scalar optimized Poly1305 actually > be worthwhile to add too? Especially without optimized ChaCha alongside it? Yes. Because vector implementations are inefficient on short inputs and having a compatible scalar fall-back for short inputs is more than appropriate. In other words starting with scalar implementations is a sensible and perfectly meaningful step. Cheers. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-06-24 9:13 ` Andy Polyakov 0 siblings, 0 replies; 30+ messages in thread From: Andy Polyakov @ 2025-06-24 9:13 UTC (permalink / raw) To: Eric Biggers, Zhihang Shao Cc: linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra >> +.globl poly1305_init >> +.type poly1305_init,\@function >> +poly1305_init: >> +#ifdef __riscv_zicfilp >> + lpad 0 >> +#endif > > The 'lpad' instructions aren't present in the upstream CRYPTOGAMS source. They are. > If they are necessary, this addition needs to be documented. > > But they appear to be unnecessary. They are better be there if Control Flow Integrity is on. It's the same deal as with endbranch instruction on Intel and hint #34 on ARM. It's possible that the kernel never engages CFI for itself, in which case all the mentioned instructions are executed as nop-s. But note that here they are compiled conditionally, so that if you don't compile the kernel with -march=..._zicfilp_..., then they won't be there. >> +#ifndef __CHERI_PURE_CAPABILITY__ >> + andi $tmp0,$inp,7 # $inp % 8 >> + andi $inp,$inp,-8 # align $inp >> + slli $tmp0,$tmp0,3 # byte to bit offset >> +#endif >> + ld $in0,0($inp) >> + ld $in1,8($inp) >> +#ifndef __CHERI_PURE_CAPABILITY__ >> + beqz $tmp0,.Laligned_key >> + >> + ld $tmp2,16($inp) >> + neg $tmp1,$tmp0 # implicit &63 in sll >> + srl $in0,$in0,$tmp0 >> + sll $tmp3,$in1,$tmp1 >> + srl $in1,$in1,$tmp0 >> + sll $tmp2,$tmp2,$tmp1 >> + or $in0,$in0,$tmp3 >> + or $in1,$in1,$tmp2 >> + >> +.Laligned_key: > > This code is going through a lot of trouble to work on RISC-V CPUs that don't > support efficient misaligned memory accesses. That includes issuing loads of > memory outside the bounds of the given buffer, which is questionable (even if > it's guaranteed to not cross a page boundary). It's indeed guaranteed to not cross a page *nor* even cache-line boundaries. Hence they can't trigger any externally observed side effects the corresponding unaligned loads won't. What is the concern otherwise? [Do note that the boundaries are not crossed on a boundary-enforcable CHERI platform ;-)] > The rest of the kernel's RISC-V crypto code, which is based on the vector > extension, just assumes that efficient misaligned memory accesses are supported. Was it tested on real hardware though? I wonder what hardware is out there that supports the vector crypto extensions? > On a related topic, if this patch is accepted, the result will be inconsistent > optimization of ChaCha vs. Poly1305, which are usually paired: https://github.com/dot-asm/cryptogams/blob/master/riscv/chacha-riscv.pl > (1) ChaCha optimized with the RISC-V vector extension > (2) Poly1305 optimized with RISC-V scalar instructions > > Surely a RISC-V vector extension optimized Poly1305 is going to be needed too? I'm a "test-on-hardware" guy. I've got Spacemit X60, which has a working 256-bit base vector implementation. I have a "teaser" Chacha vector implementation that currently performs *worse* than scalar one, more than twice worse. Working on improving it. For reference. One has to recognize that cryptographic algorithms customarily have short dependencies, which means that performance is dominated by instruction latencies. There might or might not be ways to match the scalar performance. Or course, even if it turns out to be impossible on this processor, it doesn't mean that it won't make sense to keep the vector implementation, because other processors might do better. In other words, it's coming... > But with that being the case, will a RISC-V scalar optimized Poly1305 actually > be worthwhile to add too? Especially without optimized ChaCha alongside it? Yes. Because vector implementations are inefficient on short inputs and having a compatible scalar fall-back for short inputs is more than appropriate. In other words starting with scalar implementations is a sensible and perfectly meaningful step. Cheers. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation 2025-06-24 9:13 ` Andy Polyakov @ 2025-06-25 3:54 ` Eric Biggers -1 siblings, 0 replies; 30+ messages in thread From: Eric Biggers @ 2025-06-25 3:54 UTC (permalink / raw) To: Andy Polyakov Cc: Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra On Tue, Jun 24, 2025 at 11:13:49AM +0200, Andy Polyakov wrote: > > > +.globl poly1305_init > > > +.type poly1305_init,\@function > > > +poly1305_init: > > > +#ifdef __riscv_zicfilp > > > + lpad 0 > > > +#endif > > > > The 'lpad' instructions aren't present in the upstream CRYPTOGAMS source. > > They are. I now see the latest version does have them. However, the description of this patch explicitly states it was taken from CRYPTOGAMS commit 33fe84bc21219a16825459b37c825bf4580a0a7b. Which is of course the one I looked at, and it did not have them. So the patch description is wrong. > > If they are necessary, this addition needs to be documented. > > > > But they appear to be unnecessary. > > They are better be there if Control Flow Integrity is on. It's the same deal > as with endbranch instruction on Intel and hint #34 on ARM. It's possible > that the kernel never engages CFI for itself, in which case all the > mentioned instructions are executed as nop-s. But note that here they are > compiled conditionally, so that if you don't compile the kernel with > -march=..._zicfilp_..., then they won't be there. There appears to be no kernel-mode support for Zicfilp yet. This would be the very first occurrence of the lpad instruction in the kernel source. Keeping this anyway is fine if it's in the upstream version. It just appeared to be an undocumented change from the upstream version... > > > +#ifndef __CHERI_PURE_CAPABILITY__ > > > + andi $tmp0,$inp,7 # $inp % 8 > > > + andi $inp,$inp,-8 # align $inp > > > + slli $tmp0,$tmp0,3 # byte to bit offset > > > +#endif > > > + ld $in0,0($inp) > > > + ld $in1,8($inp) > > > +#ifndef __CHERI_PURE_CAPABILITY__ > > > + beqz $tmp0,.Laligned_key > > > + > > > + ld $tmp2,16($inp) > > > + neg $tmp1,$tmp0 # implicit &63 in sll > > > + srl $in0,$in0,$tmp0 > > > + sll $tmp3,$in1,$tmp1 > > > + srl $in1,$in1,$tmp0 > > > + sll $tmp2,$tmp2,$tmp1 > > > + or $in0,$in0,$tmp3 > > > + or $in1,$in1,$tmp2 > > > + > > > +.Laligned_key: > > > > This code is going through a lot of trouble to work on RISC-V CPUs that don't > > support efficient misaligned memory accesses. That includes issuing loads of > > memory outside the bounds of the given buffer, which is questionable (even if > > it's guaranteed to not cross a page boundary). > > It's indeed guaranteed to not cross a page *nor* even cache-line boundaries. > Hence they can't trigger any externally observed side effects the > corresponding unaligned loads won't. What is the concern otherwise? [Do note > that the boundaries are not crossed on a boundary-enforcable CHERI platform > ;-)] With this, we get: - More complex code. - Slower on CPUs that do support efficient misaligned accesses. - The buffer underflows and overflows could cause problems with future CPU behavior. (Did you consider the palette memory extension, for example?) That being said, if there will continue to be many RISC-V CPUs that don't support efficient misaligned accesses, then we effectively need to do this anyway. I hoped that things might be developing along the lines of ARM, where eventually misaligned accesses started being supported uniformly. But perhaps RISC-V is still in the process of learning that lesson. > > The rest of the kernel's RISC-V crypto code, which is based on the vector > > extension, just assumes that efficient misaligned memory accesses are supported. > > Was it tested on real hardware though? I wonder what hardware is out there > that supports the vector crypto extensions? If I remember correctly, SiFive tested it on their hardware. - Eric ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-06-25 3:54 ` Eric Biggers 0 siblings, 0 replies; 30+ messages in thread From: Eric Biggers @ 2025-06-25 3:54 UTC (permalink / raw) To: Andy Polyakov Cc: Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra On Tue, Jun 24, 2025 at 11:13:49AM +0200, Andy Polyakov wrote: > > > +.globl poly1305_init > > > +.type poly1305_init,\@function > > > +poly1305_init: > > > +#ifdef __riscv_zicfilp > > > + lpad 0 > > > +#endif > > > > The 'lpad' instructions aren't present in the upstream CRYPTOGAMS source. > > They are. I now see the latest version does have them. However, the description of this patch explicitly states it was taken from CRYPTOGAMS commit 33fe84bc21219a16825459b37c825bf4580a0a7b. Which is of course the one I looked at, and it did not have them. So the patch description is wrong. > > If they are necessary, this addition needs to be documented. > > > > But they appear to be unnecessary. > > They are better be there if Control Flow Integrity is on. It's the same deal > as with endbranch instruction on Intel and hint #34 on ARM. It's possible > that the kernel never engages CFI for itself, in which case all the > mentioned instructions are executed as nop-s. But note that here they are > compiled conditionally, so that if you don't compile the kernel with > -march=..._zicfilp_..., then they won't be there. There appears to be no kernel-mode support for Zicfilp yet. This would be the very first occurrence of the lpad instruction in the kernel source. Keeping this anyway is fine if it's in the upstream version. It just appeared to be an undocumented change from the upstream version... > > > +#ifndef __CHERI_PURE_CAPABILITY__ > > > + andi $tmp0,$inp,7 # $inp % 8 > > > + andi $inp,$inp,-8 # align $inp > > > + slli $tmp0,$tmp0,3 # byte to bit offset > > > +#endif > > > + ld $in0,0($inp) > > > + ld $in1,8($inp) > > > +#ifndef __CHERI_PURE_CAPABILITY__ > > > + beqz $tmp0,.Laligned_key > > > + > > > + ld $tmp2,16($inp) > > > + neg $tmp1,$tmp0 # implicit &63 in sll > > > + srl $in0,$in0,$tmp0 > > > + sll $tmp3,$in1,$tmp1 > > > + srl $in1,$in1,$tmp0 > > > + sll $tmp2,$tmp2,$tmp1 > > > + or $in0,$in0,$tmp3 > > > + or $in1,$in1,$tmp2 > > > + > > > +.Laligned_key: > > > > This code is going through a lot of trouble to work on RISC-V CPUs that don't > > support efficient misaligned memory accesses. That includes issuing loads of > > memory outside the bounds of the given buffer, which is questionable (even if > > it's guaranteed to not cross a page boundary). > > It's indeed guaranteed to not cross a page *nor* even cache-line boundaries. > Hence they can't trigger any externally observed side effects the > corresponding unaligned loads won't. What is the concern otherwise? [Do note > that the boundaries are not crossed on a boundary-enforcable CHERI platform > ;-)] With this, we get: - More complex code. - Slower on CPUs that do support efficient misaligned accesses. - The buffer underflows and overflows could cause problems with future CPU behavior. (Did you consider the palette memory extension, for example?) That being said, if there will continue to be many RISC-V CPUs that don't support efficient misaligned accesses, then we effectively need to do this anyway. I hoped that things might be developing along the lines of ARM, where eventually misaligned accesses started being supported uniformly. But perhaps RISC-V is still in the process of learning that lesson. > > The rest of the kernel's RISC-V crypto code, which is based on the vector > > extension, just assumes that efficient misaligned memory accesses are supported. > > Was it tested on real hardware though? I wonder what hardware is out there > that supports the vector crypto extensions? If I remember correctly, SiFive tested it on their hardware. - Eric _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation 2025-06-25 3:54 ` Eric Biggers @ 2025-06-25 9:38 ` Andy Polyakov -1 siblings, 0 replies; 30+ messages in thread From: Andy Polyakov @ 2025-06-25 9:38 UTC (permalink / raw) To: Eric Biggers Cc: Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra >>>> +#ifndef __CHERI_PURE_CAPABILITY__ >>>> + andi $tmp0,$inp,7 # $inp % 8 >>>> + andi $inp,$inp,-8 # align $inp >>>> + slli $tmp0,$tmp0,3 # byte to bit offset >>>> +#endif >>>> + ld $in0,0($inp) >>>> + ld $in1,8($inp) >>>> +#ifndef __CHERI_PURE_CAPABILITY__ >>>> + beqz $tmp0,.Laligned_key >>>> + >>>> + ld $tmp2,16($inp) >>>> + neg $tmp1,$tmp0 # implicit &63 in sll >>>> + srl $in0,$in0,$tmp0 >>>> + sll $tmp3,$in1,$tmp1 >>>> + srl $in1,$in1,$tmp0 >>>> + sll $tmp2,$tmp2,$tmp1 >>>> + or $in0,$in0,$tmp3 >>>> + or $in1,$in1,$tmp2 >>>> + >>>> +.Laligned_key: >>> >>> This code is going through a lot of trouble to work on RISC-V CPUs that don't >>> support efficient misaligned memory accesses. That includes issuing loads of >>> memory outside the bounds of the given buffer, which is questionable (even if >>> it's guaranteed to not cross a page boundary). >> >> It's indeed guaranteed to not cross a page *nor* even cache-line boundaries. >> Hence they can't trigger any externally observed side effects the >> corresponding unaligned loads won't. What is the concern otherwise? [Do note >> that the boundaries are not crossed on a boundary-enforcable CHERI platform >> ;-)] > > With this, we get: > > - More complex code. My rationale is as follows. It's beneficial to have this code to cover the whole spectrum of processor implementations. I for one would even say it's important, because penalties on processors that can't handle misaligned access efficiently are just too high to ignore. Now, it's possible to bypass it with #ifdef-s (as done for CHERI), but to make things less confusing, a.k.a. *less* complex, it's preferred to rely on the compiler predefines (as done for CHERI). Later compiler versions introduced apparently suitable predefines for this, __riscv_misaligned_slow/fast/avoid. However, as of the moment of this writing the macros in question don't seem to depend on the -mcpu parameter. But it's probably reasonable to assume that they will at a later point. So the suggestion would be to use these. Does it sound reasonable? Or would you insist on a custom macro that would need to be set depending on CONFIG_RISCV_EFFICIENT_UNALIGNED_ACCESS? > - Slower on CPUs that do support efficient misaligned accesses. With arguably marginal penalty, as discussed in the previous message. In the context one can also view it as a trade-off between a small penalty and increased #ifdef spaghetti :-) > - The buffer underflows and overflows could cause problems with future CPU > behavior. (Did you consider the palette memory extension, for example?) Pallette memory extension colours fixed-size, hence accordingly aligned, blocks. Since the block size is larger than the word load size, any aligned load would be safe, because even a single "excess" or "short" byte would colour the whole block accordingly. Just in case to be clear. The argument is about loads. Misaligned stores is naturally different matter and it would be inappropriate to handle them in the similar manner. > That being said, if there will continue to be many RISC-V CPUs that don't > support efficient misaligned accesses, then we effectively need to do this > anyway. I hoped that things might be developing along the lines of ARM, where > eventually misaligned accesses started being supported uniformly. But perhaps > RISC-V is still in the process of learning that lesson. One has to recognize that it can also be a matter of cost. I mean imagine you want to license the least expensive IP from SiFive, or have very limited space for MCU. Well, Linux, naturally having higher minimum requirements, doesn't have to care about these, but it doesn't mean that nobody would :-) >>> The rest of the kernel's RISC-V crypto code, which is based on the vector >>> extension, just assumes that efficient misaligned memory accesses are supported. >> >> Was it tested on real hardware though? I wonder what hardware is out there >> that supports the vector crypto extensions? > > If I remember correctly, SiFive tested it on their hardware. Cool! The question was rather "how did it do performance-wise in the context of this discussion," but never mind. Thanks! In a way there is a contradiction. RISC-V as a concept is about openness to everybody, while SiFive is naturally about itself ;-) Cheers. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-06-25 9:38 ` Andy Polyakov 0 siblings, 0 replies; 30+ messages in thread From: Andy Polyakov @ 2025-06-25 9:38 UTC (permalink / raw) To: Eric Biggers Cc: Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra >>>> +#ifndef __CHERI_PURE_CAPABILITY__ >>>> + andi $tmp0,$inp,7 # $inp % 8 >>>> + andi $inp,$inp,-8 # align $inp >>>> + slli $tmp0,$tmp0,3 # byte to bit offset >>>> +#endif >>>> + ld $in0,0($inp) >>>> + ld $in1,8($inp) >>>> +#ifndef __CHERI_PURE_CAPABILITY__ >>>> + beqz $tmp0,.Laligned_key >>>> + >>>> + ld $tmp2,16($inp) >>>> + neg $tmp1,$tmp0 # implicit &63 in sll >>>> + srl $in0,$in0,$tmp0 >>>> + sll $tmp3,$in1,$tmp1 >>>> + srl $in1,$in1,$tmp0 >>>> + sll $tmp2,$tmp2,$tmp1 >>>> + or $in0,$in0,$tmp3 >>>> + or $in1,$in1,$tmp2 >>>> + >>>> +.Laligned_key: >>> >>> This code is going through a lot of trouble to work on RISC-V CPUs that don't >>> support efficient misaligned memory accesses. That includes issuing loads of >>> memory outside the bounds of the given buffer, which is questionable (even if >>> it's guaranteed to not cross a page boundary). >> >> It's indeed guaranteed to not cross a page *nor* even cache-line boundaries. >> Hence they can't trigger any externally observed side effects the >> corresponding unaligned loads won't. What is the concern otherwise? [Do note >> that the boundaries are not crossed on a boundary-enforcable CHERI platform >> ;-)] > > With this, we get: > > - More complex code. My rationale is as follows. It's beneficial to have this code to cover the whole spectrum of processor implementations. I for one would even say it's important, because penalties on processors that can't handle misaligned access efficiently are just too high to ignore. Now, it's possible to bypass it with #ifdef-s (as done for CHERI), but to make things less confusing, a.k.a. *less* complex, it's preferred to rely on the compiler predefines (as done for CHERI). Later compiler versions introduced apparently suitable predefines for this, __riscv_misaligned_slow/fast/avoid. However, as of the moment of this writing the macros in question don't seem to depend on the -mcpu parameter. But it's probably reasonable to assume that they will at a later point. So the suggestion would be to use these. Does it sound reasonable? Or would you insist on a custom macro that would need to be set depending on CONFIG_RISCV_EFFICIENT_UNALIGNED_ACCESS? > - Slower on CPUs that do support efficient misaligned accesses. With arguably marginal penalty, as discussed in the previous message. In the context one can also view it as a trade-off between a small penalty and increased #ifdef spaghetti :-) > - The buffer underflows and overflows could cause problems with future CPU > behavior. (Did you consider the palette memory extension, for example?) Pallette memory extension colours fixed-size, hence accordingly aligned, blocks. Since the block size is larger than the word load size, any aligned load would be safe, because even a single "excess" or "short" byte would colour the whole block accordingly. Just in case to be clear. The argument is about loads. Misaligned stores is naturally different matter and it would be inappropriate to handle them in the similar manner. > That being said, if there will continue to be many RISC-V CPUs that don't > support efficient misaligned accesses, then we effectively need to do this > anyway. I hoped that things might be developing along the lines of ARM, where > eventually misaligned accesses started being supported uniformly. But perhaps > RISC-V is still in the process of learning that lesson. One has to recognize that it can also be a matter of cost. I mean imagine you want to license the least expensive IP from SiFive, or have very limited space for MCU. Well, Linux, naturally having higher minimum requirements, doesn't have to care about these, but it doesn't mean that nobody would :-) >>> The rest of the kernel's RISC-V crypto code, which is based on the vector >>> extension, just assumes that efficient misaligned memory accesses are supported. >> >> Was it tested on real hardware though? I wonder what hardware is out there >> that supports the vector crypto extensions? > > If I remember correctly, SiFive tested it on their hardware. Cool! The question was rather "how did it do performance-wise in the context of this discussion," but never mind. Thanks! In a way there is a contradiction. RISC-V as a concept is about openness to everybody, while SiFive is naturally about itself ;-) Cheers. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation 2025-06-25 3:54 ` Eric Biggers @ 2025-06-26 15:58 ` Sami Tolvanen -1 siblings, 0 replies; 30+ messages in thread From: Sami Tolvanen @ 2025-06-26 15:58 UTC (permalink / raw) To: Eric Biggers Cc: Andy Polyakov, Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra On Tue, Jun 24, 2025 at 8:56 PM Eric Biggers <ebiggers@kernel.org> wrote: > > On Tue, Jun 24, 2025 at 11:13:49AM +0200, Andy Polyakov wrote: > > > > +.globl poly1305_init > > > > +.type poly1305_init,\@function > > > > +poly1305_init: > > > > +#ifdef __riscv_zicfilp > > > > + lpad 0 > > > > +#endif > > > > > > The 'lpad' instructions aren't present in the upstream CRYPTOGAMS source. > > > > They are. > > I now see the latest version does have them. However, the description of this > patch explicitly states it was taken from CRYPTOGAMS commit > 33fe84bc21219a16825459b37c825bf4580a0a7b. Which is of course the one I looked > at, and it did not have them. So the patch description is wrong. > > > > If they are necessary, this addition needs to be documented. > > > > > > But they appear to be unnecessary. > > > > They are better be there if Control Flow Integrity is on. It's the same deal > > as with endbranch instruction on Intel and hint #34 on ARM. It's possible > > that the kernel never engages CFI for itself, in which case all the > > mentioned instructions are executed as nop-s. But note that here they are > > compiled conditionally, so that if you don't compile the kernel with > > -march=..._zicfilp_..., then they won't be there. > > There appears to be no kernel-mode support for Zicfilp yet. This would be the > very first occurrence of the lpad instruction in the kernel source. Of course, if the kernel actually ends up calling these functions indirectly at some point, lpad alone isn't sufficient, we would need to use SYM_TYPED_FUNC_START to emit CFI type information for them. I assume if RISC-V gains kernel-mode Zicfilp support later, we would have an arch-specific override for the SYM_TYPED_FUNC_START macro that includes the lpad instruction, similarly to arm64 and BTI. Also, if the kernel decides to use type-based landing pad labels for finer-grained CFI, "lpad 0" isn't going to work anyway. Perhaps it would make sense to just drop the lpad instruction in kernel builds for now to avoid confusion? Sami ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-06-26 15:58 ` Sami Tolvanen 0 siblings, 0 replies; 30+ messages in thread From: Sami Tolvanen @ 2025-06-26 15:58 UTC (permalink / raw) To: Eric Biggers Cc: Andy Polyakov, Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra On Tue, Jun 24, 2025 at 8:56 PM Eric Biggers <ebiggers@kernel.org> wrote: > > On Tue, Jun 24, 2025 at 11:13:49AM +0200, Andy Polyakov wrote: > > > > +.globl poly1305_init > > > > +.type poly1305_init,\@function > > > > +poly1305_init: > > > > +#ifdef __riscv_zicfilp > > > > + lpad 0 > > > > +#endif > > > > > > The 'lpad' instructions aren't present in the upstream CRYPTOGAMS source. > > > > They are. > > I now see the latest version does have them. However, the description of this > patch explicitly states it was taken from CRYPTOGAMS commit > 33fe84bc21219a16825459b37c825bf4580a0a7b. Which is of course the one I looked > at, and it did not have them. So the patch description is wrong. > > > > If they are necessary, this addition needs to be documented. > > > > > > But they appear to be unnecessary. > > > > They are better be there if Control Flow Integrity is on. It's the same deal > > as with endbranch instruction on Intel and hint #34 on ARM. It's possible > > that the kernel never engages CFI for itself, in which case all the > > mentioned instructions are executed as nop-s. But note that here they are > > compiled conditionally, so that if you don't compile the kernel with > > -march=..._zicfilp_..., then they won't be there. > > There appears to be no kernel-mode support for Zicfilp yet. This would be the > very first occurrence of the lpad instruction in the kernel source. Of course, if the kernel actually ends up calling these functions indirectly at some point, lpad alone isn't sufficient, we would need to use SYM_TYPED_FUNC_START to emit CFI type information for them. I assume if RISC-V gains kernel-mode Zicfilp support later, we would have an arch-specific override for the SYM_TYPED_FUNC_START macro that includes the lpad instruction, similarly to arm64 and BTI. Also, if the kernel decides to use type-based landing pad labels for finer-grained CFI, "lpad 0" isn't going to work anyway. Perhaps it would make sense to just drop the lpad instruction in kernel builds for now to avoid confusion? Sami _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation 2025-06-26 15:58 ` Sami Tolvanen @ 2025-06-27 9:07 ` Andy Polyakov -1 siblings, 0 replies; 30+ messages in thread From: Andy Polyakov @ 2025-06-27 9:07 UTC (permalink / raw) To: Sami Tolvanen, Eric Biggers Cc: Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra >>>>> +.globl poly1305_init >>>>> +.type poly1305_init,\@function >>>>> +poly1305_init: >>>>> +#ifdef __riscv_zicfilp >>>>> + lpad 0 >>>>> +#endif >>>> >>>> But they appear to be unnecessary. >>> >>> They are better be there if Control Flow Integrity is on. It's the same deal >>> as with endbranch instruction on Intel and hint #34 on ARM. It's possible >>> that the kernel never engages CFI for itself, in which case all the >>> mentioned instructions are executed as nop-s. But note that here they are >>> compiled conditionally, so that if you don't compile the kernel with >>> -march=..._zicfilp_..., then they won't be there. >> >> There appears to be no kernel-mode support for Zicfilp yet. This would be the >> very first occurrence of the lpad instruction in the kernel source. > > Of course, if the kernel actually ends up calling these functions > indirectly at some point, lpad alone isn't sufficient, The exception condition is specified as "tag != x7 *and* tag != 0." In other words lpad 0 provides a coarse-grained, but a protection nevertheless. As effective as the endbranch on Intel. Since a "fit-multiple-use-cases" assembly module, such as cryptogams' ones, can't make specific assumptions about the caller, lpad 0 is a working fall-back. > we would need > to use SYM_TYPED_FUNC_START to emit CFI type information for them. I'm confused. My understanding is that SYM_TYPED_FUNC_START is about clang -fsanitize=kcfi, which is a different mechanism. Well, you might be referring to future developments... > Also, if the kernel decides to use type-based landing pad labels for > finer-grained CFI, "lpad 0" isn't going to work anyway. It would, it simply won't be fine-graned. > Perhaps it > would make sense to just drop the lpad instruction in kernel builds > for now to avoid confusion? Let's say I go for #ifdef __KERNEL__ SYM_TYPED_FUNC_START(poly1305_init, ...) #else .globl poly1305_init .type poly1305_init,\@function poly1305_init: # ifdef __riscv_zicfilp lpad 0 # endif #endif Would it be sufficient to #include <linux/cfi_types.h>? Cheers. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-06-27 9:07 ` Andy Polyakov 0 siblings, 0 replies; 30+ messages in thread From: Andy Polyakov @ 2025-06-27 9:07 UTC (permalink / raw) To: Sami Tolvanen, Eric Biggers Cc: Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra >>>>> +.globl poly1305_init >>>>> +.type poly1305_init,\@function >>>>> +poly1305_init: >>>>> +#ifdef __riscv_zicfilp >>>>> + lpad 0 >>>>> +#endif >>>> >>>> But they appear to be unnecessary. >>> >>> They are better be there if Control Flow Integrity is on. It's the same deal >>> as with endbranch instruction on Intel and hint #34 on ARM. It's possible >>> that the kernel never engages CFI for itself, in which case all the >>> mentioned instructions are executed as nop-s. But note that here they are >>> compiled conditionally, so that if you don't compile the kernel with >>> -march=..._zicfilp_..., then they won't be there. >> >> There appears to be no kernel-mode support for Zicfilp yet. This would be the >> very first occurrence of the lpad instruction in the kernel source. > > Of course, if the kernel actually ends up calling these functions > indirectly at some point, lpad alone isn't sufficient, The exception condition is specified as "tag != x7 *and* tag != 0." In other words lpad 0 provides a coarse-grained, but a protection nevertheless. As effective as the endbranch on Intel. Since a "fit-multiple-use-cases" assembly module, such as cryptogams' ones, can't make specific assumptions about the caller, lpad 0 is a working fall-back. > we would need > to use SYM_TYPED_FUNC_START to emit CFI type information for them. I'm confused. My understanding is that SYM_TYPED_FUNC_START is about clang -fsanitize=kcfi, which is a different mechanism. Well, you might be referring to future developments... > Also, if the kernel decides to use type-based landing pad labels for > finer-grained CFI, "lpad 0" isn't going to work anyway. It would, it simply won't be fine-graned. > Perhaps it > would make sense to just drop the lpad instruction in kernel builds > for now to avoid confusion? Let's say I go for #ifdef __KERNEL__ SYM_TYPED_FUNC_START(poly1305_init, ...) #else .globl poly1305_init .type poly1305_init,\@function poly1305_init: # ifdef __riscv_zicfilp lpad 0 # endif #endif Would it be sufficient to #include <linux/cfi_types.h>? Cheers. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation 2025-06-27 9:07 ` Andy Polyakov @ 2025-06-27 16:10 ` Sami Tolvanen -1 siblings, 0 replies; 30+ messages in thread From: Sami Tolvanen @ 2025-06-27 16:10 UTC (permalink / raw) To: Andy Polyakov Cc: Eric Biggers, Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra On Fri, Jun 27, 2025 at 2:07 AM Andy Polyakov <appro@cryptogams.org> wrote: > > > we would need > > to use SYM_TYPED_FUNC_START to emit CFI type information for them. > > I'm confused. My understanding is that SYM_TYPED_FUNC_START is about > clang -fsanitize=kcfi, which is a different mechanism. Well, you might > be referring to future developments... CONFIG_CFI_CLANG is supported in the kernel right now, which is why we need to make sure indirect call targets in assembly code are properly annotated. Like I said, if RISC-V gains kernel-mode Zicfilp support later, I would expect the SYM_TYPED_FUNC_START macro to be overridden to include the lpad instruction as well, similarly to other architectures that already support landing pad instructions. In either case, instead of including raw lpad instructions in the code, it should be up to the kernel to decide if landing pads (or other CFI annotations) are needed for the function. > > Also, if the kernel decides to use type-based landing pad labels for > > finer-grained CFI, "lpad 0" isn't going to work anyway. > > It would, it simply won't be fine-graned. Sorry for not being clear. If the kernel implements a fine-grained Zicfilp scheme, we wouldn't want universal indirect call targets (i.e. lpad 0) anywhere in the code. And even if we implement coarse-grained Zicfilp, we would only want landing pads in functions where they're needed to minimize the number of call targets. > > Perhaps it > > would make sense to just drop the lpad instruction in kernel builds > > for now to avoid confusion? > > Let's say I go for > > #ifdef __KERNEL__ > SYM_TYPED_FUNC_START(poly1305_init, ...) > #else > .globl poly1305_init > .type poly1305_init,\@function > poly1305_init: > # ifdef __riscv_zicfilp > lpad 0 > # endif > #endif > > Would it be sufficient to #include <linux/cfi_types.h>? Yes, but this requires the function to be indirectly called, because with CFI_CLANG the compiler only emits CFI type information for functions that are address-taken in C. If, like Eric suggested, these functions are not currently indirectly called, I would simply leave out the lpad instructions for kernel builds and worry about kernel-mode CFI annotations when they're actually needed: # if defined(__riscv_zicfilp) && !defined(__KERNEL__) lpad 0 # endif Sami ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-06-27 16:10 ` Sami Tolvanen 0 siblings, 0 replies; 30+ messages in thread From: Sami Tolvanen @ 2025-06-27 16:10 UTC (permalink / raw) To: Andy Polyakov Cc: Eric Biggers, Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra On Fri, Jun 27, 2025 at 2:07 AM Andy Polyakov <appro@cryptogams.org> wrote: > > > we would need > > to use SYM_TYPED_FUNC_START to emit CFI type information for them. > > I'm confused. My understanding is that SYM_TYPED_FUNC_START is about > clang -fsanitize=kcfi, which is a different mechanism. Well, you might > be referring to future developments... CONFIG_CFI_CLANG is supported in the kernel right now, which is why we need to make sure indirect call targets in assembly code are properly annotated. Like I said, if RISC-V gains kernel-mode Zicfilp support later, I would expect the SYM_TYPED_FUNC_START macro to be overridden to include the lpad instruction as well, similarly to other architectures that already support landing pad instructions. In either case, instead of including raw lpad instructions in the code, it should be up to the kernel to decide if landing pads (or other CFI annotations) are needed for the function. > > Also, if the kernel decides to use type-based landing pad labels for > > finer-grained CFI, "lpad 0" isn't going to work anyway. > > It would, it simply won't be fine-graned. Sorry for not being clear. If the kernel implements a fine-grained Zicfilp scheme, we wouldn't want universal indirect call targets (i.e. lpad 0) anywhere in the code. And even if we implement coarse-grained Zicfilp, we would only want landing pads in functions where they're needed to minimize the number of call targets. > > Perhaps it > > would make sense to just drop the lpad instruction in kernel builds > > for now to avoid confusion? > > Let's say I go for > > #ifdef __KERNEL__ > SYM_TYPED_FUNC_START(poly1305_init, ...) > #else > .globl poly1305_init > .type poly1305_init,\@function > poly1305_init: > # ifdef __riscv_zicfilp > lpad 0 > # endif > #endif > > Would it be sufficient to #include <linux/cfi_types.h>? Yes, but this requires the function to be indirectly called, because with CFI_CLANG the compiler only emits CFI type information for functions that are address-taken in C. If, like Eric suggested, these functions are not currently indirectly called, I would simply leave out the lpad instructions for kernel builds and worry about kernel-mode CFI annotations when they're actually needed: # if defined(__riscv_zicfilp) && !defined(__KERNEL__) lpad 0 # endif Sami _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation 2025-06-27 16:10 ` Sami Tolvanen @ 2025-06-27 21:51 ` Eric Biggers -1 siblings, 0 replies; 30+ messages in thread From: Eric Biggers @ 2025-06-27 21:51 UTC (permalink / raw) To: Sami Tolvanen Cc: Andy Polyakov, Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra On Fri, Jun 27, 2025 at 09:10:30AM -0700, Sami Tolvanen wrote: > > Would it be sufficient to #include <linux/cfi_types.h>? > > Yes, but this requires the function to be indirectly called, because > with CFI_CLANG the compiler only emits CFI type information for > functions that are address-taken in C. If, like Eric suggested, these > functions are not currently indirectly called, I would simply leave > out the lpad instructions for kernel builds and worry about > kernel-mode CFI annotations when they're actually needed: > > # if defined(__riscv_zicfilp) && !defined(__KERNEL__) > lpad 0 > # endif These functions aren't indirectly called, and I'm intending to keep it that way. - Eric ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-06-27 21:51 ` Eric Biggers 0 siblings, 0 replies; 30+ messages in thread From: Eric Biggers @ 2025-06-27 21:51 UTC (permalink / raw) To: Sami Tolvanen Cc: Andy Polyakov, Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra On Fri, Jun 27, 2025 at 09:10:30AM -0700, Sami Tolvanen wrote: > > Would it be sufficient to #include <linux/cfi_types.h>? > > Yes, but this requires the function to be indirectly called, because > with CFI_CLANG the compiler only emits CFI type information for > functions that are address-taken in C. If, like Eric suggested, these > functions are not currently indirectly called, I would simply leave > out the lpad instructions for kernel builds and worry about > kernel-mode CFI annotations when they're actually needed: > > # if defined(__riscv_zicfilp) && !defined(__KERNEL__) > lpad 0 > # endif These functions aren't indirectly called, and I'm intending to keep it that way. - Eric _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation 2025-06-27 21:51 ` Eric Biggers @ 2025-06-30 9:55 ` Andy Polyakov -1 siblings, 0 replies; 30+ messages in thread From: Andy Polyakov @ 2025-06-30 9:55 UTC (permalink / raw) To: Eric Biggers, Sami Tolvanen Cc: Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra >>> Would it be sufficient to #include <linux/cfi_types.h>? >> >> Yes, but this requires the function to be indirectly called, because >> with CFI_CLANG the compiler only emits CFI type information for >> functions that are address-taken in C. If, like Eric suggested, these >> functions are not currently indirectly called, I would simply leave >> out the lpad instructions for kernel builds and worry about >> kernel-mode CFI annotations when they're actually needed: >> >> # if defined(__riscv_zicfilp) && !defined(__KERNEL__) >> lpad 0 >> # endif > > These functions aren't indirectly called, and I'm intending to keep it that way. In which case lpad will be executed as nops. Anyway, does https://github.com/dot-asm/cryptogams/commit/e6ae2202268d995e78fa5d137dde992bdff1b8e8 look all right? Cheers. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-06-30 9:55 ` Andy Polyakov 0 siblings, 0 replies; 30+ messages in thread From: Andy Polyakov @ 2025-06-30 9:55 UTC (permalink / raw) To: Eric Biggers, Sami Tolvanen Cc: Zhihang Shao, linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra >>> Would it be sufficient to #include <linux/cfi_types.h>? >> >> Yes, but this requires the function to be indirectly called, because >> with CFI_CLANG the compiler only emits CFI type information for >> functions that are address-taken in C. If, like Eric suggested, these >> functions are not currently indirectly called, I would simply leave >> out the lpad instructions for kernel builds and worry about >> kernel-mode CFI annotations when they're actually needed: >> >> # if defined(__riscv_zicfilp) && !defined(__KERNEL__) >> lpad 0 >> # endif > > These functions aren't indirectly called, and I'm intending to keep it that way. In which case lpad will be executed as nops. Anyway, does https://github.com/dot-asm/cryptogams/commit/e6ae2202268d995e78fa5d137dde992bdff1b8e8 look all right? Cheers. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation 2025-06-24 3:50 ` Eric Biggers @ 2025-06-24 11:08 ` Andy Polyakov -1 siblings, 0 replies; 30+ messages in thread From: Andy Polyakov @ 2025-06-24 11:08 UTC (permalink / raw) To: Eric Biggers, Zhihang Shao Cc: linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra >> +#ifndef __CHERI_PURE_CAPABILITY__ >> + andi $tmp0,$inp,7 # $inp % 8 >> + andi $inp,$inp,-8 # align $inp >> + slli $tmp0,$tmp0,3 # byte to bit offset >> +#endif >> + ld $in0,0($inp) >> + ld $in1,8($inp) >> +#ifndef __CHERI_PURE_CAPABILITY__ >> + beqz $tmp0,.Laligned_key >> + >> + ld $tmp2,16($inp) >> + neg $tmp1,$tmp0 # implicit &63 in sll >> + srl $in0,$in0,$tmp0 >> + sll $tmp3,$in1,$tmp1 >> + srl $in1,$in1,$tmp0 >> + sll $tmp2,$tmp2,$tmp1 >> + or $in0,$in0,$tmp3 >> + or $in1,$in1,$tmp2 >> + >> +.Laligned_key: > > This code is going through a lot of trouble to work on RISC-V CPUs that don't > support efficient misaligned memory accesses. That includes issuing loads of > memory outside the bounds of the given buffer, which is questionable (even if > it's guaranteed to not cross a page boundary). > > Is there any chance we can just make the RISC-V Poly1305 code be conditional on > CONFIG_RISCV_EFFICIENT_UNALIGNED_ACCESS=y? Or do people not actually use that? For reference. The penalties for handling unaligned data as above on a processor that can handle unaligned load efficiently are arguably tolerable. For example on Spacemit X60 it's meager ~7%. However, since poly1305 is always used with chacha20 and is faster than chacha20 the difference would be "masked" and rendered marginal. If anything, it makes more sense to utilize this option for chacha20 where the difference is way more significant, a tad less than 20%. Cheers. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-06-24 11:08 ` Andy Polyakov 0 siblings, 0 replies; 30+ messages in thread From: Andy Polyakov @ 2025-06-24 11:08 UTC (permalink / raw) To: Eric Biggers, Zhihang Shao Cc: linux-crypto, linux-riscv, herbert, paul.walmsley, alex, zhang.lyra >> +#ifndef __CHERI_PURE_CAPABILITY__ >> + andi $tmp0,$inp,7 # $inp % 8 >> + andi $inp,$inp,-8 # align $inp >> + slli $tmp0,$tmp0,3 # byte to bit offset >> +#endif >> + ld $in0,0($inp) >> + ld $in1,8($inp) >> +#ifndef __CHERI_PURE_CAPABILITY__ >> + beqz $tmp0,.Laligned_key >> + >> + ld $tmp2,16($inp) >> + neg $tmp1,$tmp0 # implicit &63 in sll >> + srl $in0,$in0,$tmp0 >> + sll $tmp3,$in1,$tmp1 >> + srl $in1,$in1,$tmp0 >> + sll $tmp2,$tmp2,$tmp1 >> + or $in0,$in0,$tmp3 >> + or $in1,$in1,$tmp2 >> + >> +.Laligned_key: > > This code is going through a lot of trouble to work on RISC-V CPUs that don't > support efficient misaligned memory accesses. That includes issuing loads of > memory outside the bounds of the given buffer, which is questionable (even if > it's guaranteed to not cross a page boundary). > > Is there any chance we can just make the RISC-V Poly1305 code be conditional on > CONFIG_RISCV_EFFICIENT_UNALIGNED_ACCESS=y? Or do people not actually use that? For reference. The penalties for handling unaligned data as above on a processor that can handle unaligned load efficiently are arguably tolerable. For example on Spacemit X60 it's meager ~7%. However, since poly1305 is always used with chacha20 and is faster than chacha20 the difference would be "masked" and rendered marginal. If anything, it makes more sense to utilize this option for chacha20 where the difference is way more significant, a tad less than 20%. Cheers. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation 2025-06-24 3:50 ` Eric Biggers @ 2025-07-06 23:35 ` Eric Biggers -1 siblings, 0 replies; 30+ messages in thread From: Eric Biggers @ 2025-07-06 23:35 UTC (permalink / raw) To: Zhihang Shao Cc: linux-crypto, linux-riscv, herbert, paul.walmsley, alex, appro, zhang.lyra On Mon, Jun 23, 2025 at 08:50:57PM -0700, Eric Biggers wrote: > Hi Zhihang, > > On Wed, Jun 11, 2025 at 11:31:51AM +0800, Zhihang Shao wrote: > > From: Zhihang Shao <zhihang.shao.iscas@gmail.com> > > > > This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 > > implementation for riscv authored by Andy Polyakov. > > The file 'poly1305-riscv.pl' is taken straight from this upstream > > GitHub repository [0] at commit 33fe84bc21219a16825459b37c825bf4580a0a7b, > > and this commit fixed a bug in riscv 64bit implementation. > > > > [0] https://github.com/dot-asm/cryptogams > > > > Signed-off-by: Zhihang Shao <zhihang.shao.iscas@gmail.com> > > I wanted to let you know that I haven't forgotten about this. However, the > kernel currently lacks a proper test for Poly1305, and Poly1305 has some edge > cases that are prone to bugs. So I'd like to find some time to add a proper > Poly1305 test and properly review this. Also, I'm in the middle of fixing how > the kernel's architecture-optimized crypto code is integrated; for example, I've > just moved arch/*/lib/crypto/ to lib/crypto/. > > There will also need to be benchmark results that show that this code is > actually worthwhile, on both RV32 and RV64. I am planning for the > poly1305_kunit test to have a benchmark built-in, after which it will be > straightforward to collect these. > > (The "perlasm" file does have some benchmark results in a comment, but they do > not necessarily apply to the Poly1305 code in the kernel.) > > So this isn't a perfect time to be adding a new Poly1305 implementation, as > we're not quite ready for it. But I'd indeed like to take this eventually. My series https://lore.kernel.org/linux-crypto/20250706232817.179500-1-ebiggers@kernel.org/ adds a KUnit test suite for Poly1305, including a benchmark, as I had planned. - Eric ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-07-06 23:35 ` Eric Biggers 0 siblings, 0 replies; 30+ messages in thread From: Eric Biggers @ 2025-07-06 23:35 UTC (permalink / raw) To: Zhihang Shao Cc: linux-crypto, linux-riscv, herbert, paul.walmsley, alex, appro, zhang.lyra On Mon, Jun 23, 2025 at 08:50:57PM -0700, Eric Biggers wrote: > Hi Zhihang, > > On Wed, Jun 11, 2025 at 11:31:51AM +0800, Zhihang Shao wrote: > > From: Zhihang Shao <zhihang.shao.iscas@gmail.com> > > > > This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 > > implementation for riscv authored by Andy Polyakov. > > The file 'poly1305-riscv.pl' is taken straight from this upstream > > GitHub repository [0] at commit 33fe84bc21219a16825459b37c825bf4580a0a7b, > > and this commit fixed a bug in riscv 64bit implementation. > > > > [0] https://github.com/dot-asm/cryptogams > > > > Signed-off-by: Zhihang Shao <zhihang.shao.iscas@gmail.com> > > I wanted to let you know that I haven't forgotten about this. However, the > kernel currently lacks a proper test for Poly1305, and Poly1305 has some edge > cases that are prone to bugs. So I'd like to find some time to add a proper > Poly1305 test and properly review this. Also, I'm in the middle of fixing how > the kernel's architecture-optimized crypto code is integrated; for example, I've > just moved arch/*/lib/crypto/ to lib/crypto/. > > There will also need to be benchmark results that show that this code is > actually worthwhile, on both RV32 and RV64. I am planning for the > poly1305_kunit test to have a benchmark built-in, after which it will be > straightforward to collect these. > > (The "perlasm" file does have some benchmark results in a comment, but they do > not necessarily apply to the Poly1305 code in the kernel.) > > So this isn't a perfect time to be adding a new Poly1305 implementation, as > we're not quite ready for it. But I'd indeed like to take this eventually. My series https://lore.kernel.org/linux-crypto/20250706232817.179500-1-ebiggers@kernel.org/ adds a KUnit test suite for Poly1305, including a benchmark, as I had planned. - Eric _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-07-20 9:10 ` Zhihang Shao 0 siblings, 0 replies; 30+ messages in thread From: Zhihang Shao @ 2025-07-20 9:10 UTC (permalink / raw) To: Eric Biggers Cc: alex, appro, herbert, linux-crypto, linux-riscv, paul.walmsley, zhang.lyra, Zhihang Shao Hi Eric, I recently ran a test using the Kunit module you wrote for testing poly1305, which I executed on QEMU RISC-V 64, . The results show a significant performance improvement of the optimized implementation compared to the generic one. The test data are as follows: --- base.log 2025-07-19 17:41:06.443392989 +0800 +++ optimized.log 2025-07-19 17:40:45.650048601 +0800 @@ -1,31 +1,31 @@ -[ 0.668631] # Subtest: poly1305 -[ 0.668774] # module: poly1305_kunit -[ 0.668857] 1..12 -[ 0.670267] ok 1 test_hash_test_vectors -[ 0.679479] ok 2 test_hash_all_lens_up_to_4096 -[ 0.696048] ok 3 test_hash_incremental_updates -[ 0.697645] ok 4 test_hash_buffer_overruns -[ 0.701060] ok 5 test_hash_overlaps -[ 0.702858] ok 6 test_hash_alignment_consistency -[ 0.703108] ok 7 test_hash_ctx_zeroization -[ 0.846150] ok 8 test_hash_interrupt_context_1 -[ 1.235247] ok 9 test_hash_interrupt_context_2 -[ 1.250813] ok 10 test_poly1305_allones_keys_and_message -[ 1.251138] ok 11 test_poly1305_reduction_edge_cases -[ 1.287196] # benchmark_hash: len=1: 2 MB/s -[ 1.305363] # benchmark_hash: len=16: 61 MB/s -[ 1.321102] # benchmark_hash: len=64: 212 MB/s -[ 1.340105] # benchmark_hash: len=127: 263 MB/s -[ 1.353880] # benchmark_hash: len=128: 364 MB/s -[ 1.370118] # benchmark_hash: len=200: 377 MB/s -[ 1.381879] # benchmark_hash: len=256: 570 MB/s -[ 1.394125] # benchmark_hash: len=511: 657 MB/s -[ 1.404265] # benchmark_hash: len=512: 794 MB/s -[ 1.413356] # benchmark_hash: len=1024: 985 MB/s -[ 1.421925] # benchmark_hash: len=3173: 1131 MB/s -[ 1.429956] # benchmark_hash: len=4096: 1218 MB/s -[ 1.438184] # benchmark_hash: len=16384: 1216 MB/s -[ 1.438462] ok 12 benchmark_hash -[ 1.438686] # poly1305: pass:12 fail:0 skip:0 total:12 -[ 1.438763] # Totals: pass:12 fail:0 skip:0 total:12 -[ 1.438904] ok 1 poly1305 +[ 0.666280] # Subtest: poly1305 +[ 0.666413] # module: poly1305_kunit +[ 0.666490] 1..12 +[ 0.667702] ok 1 test_hash_test_vectors +[ 0.672896] ok 2 test_hash_all_lens_up_to_4096 +[ 0.686244] ok 3 test_hash_incremental_updates +[ 0.687263] ok 4 test_hash_buffer_overruns +[ 0.689957] ok 5 test_hash_overlaps +[ 0.691393] ok 6 test_hash_alignment_consistency +[ 0.691622] ok 7 test_hash_ctx_zeroization +[ 0.769741] ok 8 test_hash_interrupt_context_1 +[ 0.930832] ok 9 test_hash_interrupt_context_2 +[ 0.940068] ok 10 test_poly1305_allones_keys_and_message +[ 0.940478] ok 11 test_poly1305_reduction_edge_cases +[ 0.964546] # benchmark_hash: len=1: 3 MB/s +[ 0.978836] # benchmark_hash: len=16: 78 MB/s +[ 0.990414] # benchmark_hash: len=64: 289 MB/s +[ 1.003012] # benchmark_hash: len=127: 397 MB/s +[ 1.012755] # benchmark_hash: len=128: 517 MB/s +[ 1.022928] # benchmark_hash: len=200: 603 MB/s +[ 1.030981] # benchmark_hash: len=256: 835 MB/s +[ 1.038706] # benchmark_hash: len=511: 1046 MB/s +[ 1.045233] # benchmark_hash: len=512: 1240 MB/s +[ 1.050733] # benchmark_hash: len=1024: 1638 MB/s +[ 1.055620] # benchmark_hash: len=3173: 1998 MB/s +[ 1.060247] # benchmark_hash: len=4096: 2132 MB/s +[ 1.064695] # benchmark_hash: len=16384: 2267 MB/s +[ 1.065179] ok 12 benchmark_hash +[ 1.065425] # poly1305: pass:12 fail:0 skip:0 total:12 +[ 1.065498] # Totals: pass:12 fail:0 skip:0 total:12 +[ 1.065612] ok 1 poly1305 Next, I plan to validate this performance gain on actual RISC-V hardware. I will also submit a v5 patch to the mailing list. Look forward to your feedback and suggestions. - Zhihang ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-07-20 9:10 ` Zhihang Shao 0 siblings, 0 replies; 30+ messages in thread From: Zhihang Shao @ 2025-07-20 9:10 UTC (permalink / raw) To: Eric Biggers Cc: alex, appro, herbert, linux-crypto, linux-riscv, paul.walmsley, zhang.lyra, Zhihang Shao Hi Eric, I recently ran a test using the Kunit module you wrote for testing poly1305, which I executed on QEMU RISC-V 64, . The results show a significant performance improvement of the optimized implementation compared to the generic one. The test data are as follows: --- base.log 2025-07-19 17:41:06.443392989 +0800 +++ optimized.log 2025-07-19 17:40:45.650048601 +0800 @@ -1,31 +1,31 @@ -[ 0.668631] # Subtest: poly1305 -[ 0.668774] # module: poly1305_kunit -[ 0.668857] 1..12 -[ 0.670267] ok 1 test_hash_test_vectors -[ 0.679479] ok 2 test_hash_all_lens_up_to_4096 -[ 0.696048] ok 3 test_hash_incremental_updates -[ 0.697645] ok 4 test_hash_buffer_overruns -[ 0.701060] ok 5 test_hash_overlaps -[ 0.702858] ok 6 test_hash_alignment_consistency -[ 0.703108] ok 7 test_hash_ctx_zeroization -[ 0.846150] ok 8 test_hash_interrupt_context_1 -[ 1.235247] ok 9 test_hash_interrupt_context_2 -[ 1.250813] ok 10 test_poly1305_allones_keys_and_message -[ 1.251138] ok 11 test_poly1305_reduction_edge_cases -[ 1.287196] # benchmark_hash: len=1: 2 MB/s -[ 1.305363] # benchmark_hash: len=16: 61 MB/s -[ 1.321102] # benchmark_hash: len=64: 212 MB/s -[ 1.340105] # benchmark_hash: len=127: 263 MB/s -[ 1.353880] # benchmark_hash: len=128: 364 MB/s -[ 1.370118] # benchmark_hash: len=200: 377 MB/s -[ 1.381879] # benchmark_hash: len=256: 570 MB/s -[ 1.394125] # benchmark_hash: len=511: 657 MB/s -[ 1.404265] # benchmark_hash: len=512: 794 MB/s -[ 1.413356] # benchmark_hash: len=1024: 985 MB/s -[ 1.421925] # benchmark_hash: len=3173: 1131 MB/s -[ 1.429956] # benchmark_hash: len=4096: 1218 MB/s -[ 1.438184] # benchmark_hash: len=16384: 1216 MB/s -[ 1.438462] ok 12 benchmark_hash -[ 1.438686] # poly1305: pass:12 fail:0 skip:0 total:12 -[ 1.438763] # Totals: pass:12 fail:0 skip:0 total:12 -[ 1.438904] ok 1 poly1305 +[ 0.666280] # Subtest: poly1305 +[ 0.666413] # module: poly1305_kunit +[ 0.666490] 1..12 +[ 0.667702] ok 1 test_hash_test_vectors +[ 0.672896] ok 2 test_hash_all_lens_up_to_4096 +[ 0.686244] ok 3 test_hash_incremental_updates +[ 0.687263] ok 4 test_hash_buffer_overruns +[ 0.689957] ok 5 test_hash_overlaps +[ 0.691393] ok 6 test_hash_alignment_consistency +[ 0.691622] ok 7 test_hash_ctx_zeroization +[ 0.769741] ok 8 test_hash_interrupt_context_1 +[ 0.930832] ok 9 test_hash_interrupt_context_2 +[ 0.940068] ok 10 test_poly1305_allones_keys_and_message +[ 0.940478] ok 11 test_poly1305_reduction_edge_cases +[ 0.964546] # benchmark_hash: len=1: 3 MB/s +[ 0.978836] # benchmark_hash: len=16: 78 MB/s +[ 0.990414] # benchmark_hash: len=64: 289 MB/s +[ 1.003012] # benchmark_hash: len=127: 397 MB/s +[ 1.012755] # benchmark_hash: len=128: 517 MB/s +[ 1.022928] # benchmark_hash: len=200: 603 MB/s +[ 1.030981] # benchmark_hash: len=256: 835 MB/s +[ 1.038706] # benchmark_hash: len=511: 1046 MB/s +[ 1.045233] # benchmark_hash: len=512: 1240 MB/s +[ 1.050733] # benchmark_hash: len=1024: 1638 MB/s +[ 1.055620] # benchmark_hash: len=3173: 1998 MB/s +[ 1.060247] # benchmark_hash: len=4096: 2132 MB/s +[ 1.064695] # benchmark_hash: len=16384: 2267 MB/s +[ 1.065179] ok 12 benchmark_hash +[ 1.065425] # poly1305: pass:12 fail:0 skip:0 total:12 +[ 1.065498] # Totals: pass:12 fail:0 skip:0 total:12 +[ 1.065612] ok 1 poly1305 Next, I plan to validate this performance gain on actual RISC-V hardware. I will also submit a v5 patch to the mailing list. Look forward to your feedback and suggestions. - Zhihang _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation 2025-07-20 9:10 ` Zhihang Shao @ 2025-07-20 17:22 ` Eric Biggers -1 siblings, 0 replies; 30+ messages in thread From: Eric Biggers @ 2025-07-20 17:22 UTC (permalink / raw) To: Zhihang Shao Cc: alex, appro, herbert, linux-crypto, linux-riscv, paul.walmsley, zhang.lyra On Sun, Jul 20, 2025 at 05:10:13PM +0800, Zhihang Shao wrote: > Hi Eric, > > I recently ran a test using the Kunit module you wrote for testing > poly1305, which I executed on QEMU RISC-V 64, . The results show a > significant performance improvement of the optimized implementation > compared to the generic one. The test data are as follows: > > --- base.log 2025-07-19 17:41:06.443392989 +0800 > +++ optimized.log 2025-07-19 17:40:45.650048601 +0800 > @@ -1,31 +1,31 @@ > -[ 0.668631] # Subtest: poly1305 > -[ 0.668774] # module: poly1305_kunit > -[ 0.668857] 1..12 > -[ 0.670267] ok 1 test_hash_test_vectors > -[ 0.679479] ok 2 test_hash_all_lens_up_to_4096 > -[ 0.696048] ok 3 test_hash_incremental_updates > -[ 0.697645] ok 4 test_hash_buffer_overruns > -[ 0.701060] ok 5 test_hash_overlaps > -[ 0.702858] ok 6 test_hash_alignment_consistency > -[ 0.703108] ok 7 test_hash_ctx_zeroization > -[ 0.846150] ok 8 test_hash_interrupt_context_1 > -[ 1.235247] ok 9 test_hash_interrupt_context_2 > -[ 1.250813] ok 10 test_poly1305_allones_keys_and_message > -[ 1.251138] ok 11 test_poly1305_reduction_edge_cases > -[ 1.287196] # benchmark_hash: len=1: 2 MB/s > -[ 1.305363] # benchmark_hash: len=16: 61 MB/s > -[ 1.321102] # benchmark_hash: len=64: 212 MB/s > -[ 1.340105] # benchmark_hash: len=127: 263 MB/s > -[ 1.353880] # benchmark_hash: len=128: 364 MB/s > -[ 1.370118] # benchmark_hash: len=200: 377 MB/s > -[ 1.381879] # benchmark_hash: len=256: 570 MB/s > -[ 1.394125] # benchmark_hash: len=511: 657 MB/s > -[ 1.404265] # benchmark_hash: len=512: 794 MB/s > -[ 1.413356] # benchmark_hash: len=1024: 985 MB/s > -[ 1.421925] # benchmark_hash: len=3173: 1131 MB/s > -[ 1.429956] # benchmark_hash: len=4096: 1218 MB/s > -[ 1.438184] # benchmark_hash: len=16384: 1216 MB/s > -[ 1.438462] ok 12 benchmark_hash > -[ 1.438686] # poly1305: pass:12 fail:0 skip:0 total:12 > -[ 1.438763] # Totals: pass:12 fail:0 skip:0 total:12 > -[ 1.438904] ok 1 poly1305 > +[ 0.666280] # Subtest: poly1305 > +[ 0.666413] # module: poly1305_kunit > +[ 0.666490] 1..12 > +[ 0.667702] ok 1 test_hash_test_vectors > +[ 0.672896] ok 2 test_hash_all_lens_up_to_4096 > +[ 0.686244] ok 3 test_hash_incremental_updates > +[ 0.687263] ok 4 test_hash_buffer_overruns > +[ 0.689957] ok 5 test_hash_overlaps > +[ 0.691393] ok 6 test_hash_alignment_consistency > +[ 0.691622] ok 7 test_hash_ctx_zeroization > +[ 0.769741] ok 8 test_hash_interrupt_context_1 > +[ 0.930832] ok 9 test_hash_interrupt_context_2 > +[ 0.940068] ok 10 test_poly1305_allones_keys_and_message > +[ 0.940478] ok 11 test_poly1305_reduction_edge_cases > +[ 0.964546] # benchmark_hash: len=1: 3 MB/s > +[ 0.978836] # benchmark_hash: len=16: 78 MB/s > +[ 0.990414] # benchmark_hash: len=64: 289 MB/s > +[ 1.003012] # benchmark_hash: len=127: 397 MB/s > +[ 1.012755] # benchmark_hash: len=128: 517 MB/s > +[ 1.022928] # benchmark_hash: len=200: 603 MB/s > +[ 1.030981] # benchmark_hash: len=256: 835 MB/s > +[ 1.038706] # benchmark_hash: len=511: 1046 MB/s > +[ 1.045233] # benchmark_hash: len=512: 1240 MB/s > +[ 1.050733] # benchmark_hash: len=1024: 1638 MB/s > +[ 1.055620] # benchmark_hash: len=3173: 1998 MB/s > +[ 1.060247] # benchmark_hash: len=4096: 2132 MB/s > +[ 1.064695] # benchmark_hash: len=16384: 2267 MB/s > +[ 1.065179] ok 12 benchmark_hash > +[ 1.065425] # poly1305: pass:12 fail:0 skip:0 total:12 > +[ 1.065498] # Totals: pass:12 fail:0 skip:0 total:12 > +[ 1.065612] ok 1 poly1305 > > Next, I plan to validate this performance gain on actual RISC-V > hardware. I will also submit a v5 patch to the mailing list. > Look forward to your feedback and suggestions. > Sounds good, thank you! - Eric ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-07-20 17:22 ` Eric Biggers 0 siblings, 0 replies; 30+ messages in thread From: Eric Biggers @ 2025-07-20 17:22 UTC (permalink / raw) To: Zhihang Shao Cc: alex, appro, herbert, linux-crypto, linux-riscv, paul.walmsley, zhang.lyra On Sun, Jul 20, 2025 at 05:10:13PM +0800, Zhihang Shao wrote: > Hi Eric, > > I recently ran a test using the Kunit module you wrote for testing > poly1305, which I executed on QEMU RISC-V 64, . The results show a > significant performance improvement of the optimized implementation > compared to the generic one. The test data are as follows: > > --- base.log 2025-07-19 17:41:06.443392989 +0800 > +++ optimized.log 2025-07-19 17:40:45.650048601 +0800 > @@ -1,31 +1,31 @@ > -[ 0.668631] # Subtest: poly1305 > -[ 0.668774] # module: poly1305_kunit > -[ 0.668857] 1..12 > -[ 0.670267] ok 1 test_hash_test_vectors > -[ 0.679479] ok 2 test_hash_all_lens_up_to_4096 > -[ 0.696048] ok 3 test_hash_incremental_updates > -[ 0.697645] ok 4 test_hash_buffer_overruns > -[ 0.701060] ok 5 test_hash_overlaps > -[ 0.702858] ok 6 test_hash_alignment_consistency > -[ 0.703108] ok 7 test_hash_ctx_zeroization > -[ 0.846150] ok 8 test_hash_interrupt_context_1 > -[ 1.235247] ok 9 test_hash_interrupt_context_2 > -[ 1.250813] ok 10 test_poly1305_allones_keys_and_message > -[ 1.251138] ok 11 test_poly1305_reduction_edge_cases > -[ 1.287196] # benchmark_hash: len=1: 2 MB/s > -[ 1.305363] # benchmark_hash: len=16: 61 MB/s > -[ 1.321102] # benchmark_hash: len=64: 212 MB/s > -[ 1.340105] # benchmark_hash: len=127: 263 MB/s > -[ 1.353880] # benchmark_hash: len=128: 364 MB/s > -[ 1.370118] # benchmark_hash: len=200: 377 MB/s > -[ 1.381879] # benchmark_hash: len=256: 570 MB/s > -[ 1.394125] # benchmark_hash: len=511: 657 MB/s > -[ 1.404265] # benchmark_hash: len=512: 794 MB/s > -[ 1.413356] # benchmark_hash: len=1024: 985 MB/s > -[ 1.421925] # benchmark_hash: len=3173: 1131 MB/s > -[ 1.429956] # benchmark_hash: len=4096: 1218 MB/s > -[ 1.438184] # benchmark_hash: len=16384: 1216 MB/s > -[ 1.438462] ok 12 benchmark_hash > -[ 1.438686] # poly1305: pass:12 fail:0 skip:0 total:12 > -[ 1.438763] # Totals: pass:12 fail:0 skip:0 total:12 > -[ 1.438904] ok 1 poly1305 > +[ 0.666280] # Subtest: poly1305 > +[ 0.666413] # module: poly1305_kunit > +[ 0.666490] 1..12 > +[ 0.667702] ok 1 test_hash_test_vectors > +[ 0.672896] ok 2 test_hash_all_lens_up_to_4096 > +[ 0.686244] ok 3 test_hash_incremental_updates > +[ 0.687263] ok 4 test_hash_buffer_overruns > +[ 0.689957] ok 5 test_hash_overlaps > +[ 0.691393] ok 6 test_hash_alignment_consistency > +[ 0.691622] ok 7 test_hash_ctx_zeroization > +[ 0.769741] ok 8 test_hash_interrupt_context_1 > +[ 0.930832] ok 9 test_hash_interrupt_context_2 > +[ 0.940068] ok 10 test_poly1305_allones_keys_and_message > +[ 0.940478] ok 11 test_poly1305_reduction_edge_cases > +[ 0.964546] # benchmark_hash: len=1: 3 MB/s > +[ 0.978836] # benchmark_hash: len=16: 78 MB/s > +[ 0.990414] # benchmark_hash: len=64: 289 MB/s > +[ 1.003012] # benchmark_hash: len=127: 397 MB/s > +[ 1.012755] # benchmark_hash: len=128: 517 MB/s > +[ 1.022928] # benchmark_hash: len=200: 603 MB/s > +[ 1.030981] # benchmark_hash: len=256: 835 MB/s > +[ 1.038706] # benchmark_hash: len=511: 1046 MB/s > +[ 1.045233] # benchmark_hash: len=512: 1240 MB/s > +[ 1.050733] # benchmark_hash: len=1024: 1638 MB/s > +[ 1.055620] # benchmark_hash: len=3173: 1998 MB/s > +[ 1.060247] # benchmark_hash: len=4096: 2132 MB/s > +[ 1.064695] # benchmark_hash: len=16384: 2267 MB/s > +[ 1.065179] ok 12 benchmark_hash > +[ 1.065425] # poly1305: pass:12 fail:0 skip:0 total:12 > +[ 1.065498] # Totals: pass:12 fail:0 skip:0 total:12 > +[ 1.065612] ok 1 poly1305 > > Next, I plan to validate this performance gain on actual RISC-V > hardware. I will also submit a v5 patch to the mailing list. > Look forward to your feedback and suggestions. > Sounds good, thank you! - Eric _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation 2025-07-20 9:10 ` Zhihang Shao @ 2025-07-23 9:47 ` Andy Polyakov -1 siblings, 0 replies; 30+ messages in thread From: Andy Polyakov @ 2025-07-23 9:47 UTC (permalink / raw) To: Zhihang Shao, Eric Biggers Cc: alex, herbert, linux-crypto, linux-riscv, paul.walmsley, zhang.lyra Hi, > Next, I plan to validate this performance gain on actual RISC-V > hardware. I've rerun my benchmarks, the cycles-per-byte results quoted in the poly1305-riscv.pl commentary section, and it appears that my U74 results were off. I must have made wrong assumptions about clock frequency or I failed to note that the [shared] system was busy. Either way, U74 delivers 1.8 cpb, be it the initial processor version or one with additional ISA capabilities such as Zbb, JH7100 vs. JH7110. For reference, the cpb is calculated by dividing the clock frequency by the measured MBps rate. I also have vector implementation cooking. It's not ready to be released, because it doesn't yet scale with vlenb and works only on a 256-bit vector unit. It achieves 1.3 cpb on Spacemit X60, 2.5x improvement over scalar code. Just in case, one can't expect the coefficient to be the same on other processors. Cheers. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation @ 2025-07-23 9:47 ` Andy Polyakov 0 siblings, 0 replies; 30+ messages in thread From: Andy Polyakov @ 2025-07-23 9:47 UTC (permalink / raw) To: Zhihang Shao, Eric Biggers Cc: alex, herbert, linux-crypto, linux-riscv, paul.walmsley, zhang.lyra Hi, > Next, I plan to validate this performance gain on actual RISC-V > hardware. I've rerun my benchmarks, the cycles-per-byte results quoted in the poly1305-riscv.pl commentary section, and it appears that my U74 results were off. I must have made wrong assumptions about clock frequency or I failed to note that the [shared] system was busy. Either way, U74 delivers 1.8 cpb, be it the initial processor version or one with additional ISA capabilities such as Zbb, JH7100 vs. JH7110. For reference, the cpb is calculated by dividing the clock frequency by the measured MBps rate. I also have vector implementation cooking. It's not ready to be released, because it doesn't yet scale with vlenb and works only on a 256-bit vector unit. It achieves 1.3 cpb on Spacemit X60, 2.5x improvement over scalar code. Just in case, one can't expect the coefficient to be the same on other processors. Cheers. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2025-08-12 22:00 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-06-11 3:31 [PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation Zhihang Shao 2025-06-11 3:31 ` Zhihang Shao 2025-06-24 3:50 ` Eric Biggers 2025-06-24 3:50 ` Eric Biggers 2025-06-24 9:13 ` Andy Polyakov 2025-06-24 9:13 ` Andy Polyakov 2025-06-25 3:54 ` Eric Biggers 2025-06-25 3:54 ` Eric Biggers 2025-06-25 9:38 ` Andy Polyakov 2025-06-25 9:38 ` Andy Polyakov 2025-06-26 15:58 ` Sami Tolvanen 2025-06-26 15:58 ` Sami Tolvanen 2025-06-27 9:07 ` Andy Polyakov 2025-06-27 9:07 ` Andy Polyakov 2025-06-27 16:10 ` Sami Tolvanen 2025-06-27 16:10 ` Sami Tolvanen 2025-06-27 21:51 ` Eric Biggers 2025-06-27 21:51 ` Eric Biggers 2025-06-30 9:55 ` Andy Polyakov 2025-06-30 9:55 ` Andy Polyakov 2025-06-24 11:08 ` Andy Polyakov 2025-06-24 11:08 ` Andy Polyakov 2025-07-06 23:35 ` Eric Biggers 2025-07-06 23:35 ` Eric Biggers -- strict thread matches above, loose matches on Subject: below -- 2025-07-20 9:10 Zhihang Shao 2025-07-20 9:10 ` Zhihang Shao 2025-07-20 17:22 ` Eric Biggers 2025-07-20 17:22 ` Eric Biggers 2025-07-23 9:47 ` Andy Polyakov 2025-07-23 9:47 ` Andy Polyakov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.