linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* (no subject)
@ 2024-08-16 11:07 Xi Ruoyao
  2024-08-16 11:07 ` [PATCH v3 1/3] LoongArch: vDSO: Wire up getrandom() vDSO implementation Xi Ruoyao
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Xi Ruoyao @ 2024-08-16 11:07 UTC (permalink / raw)
  To: Jason A . Donenfeld, Huacai Chen, WANG Xuerui
  Cc: Xi Ruoyao, linux-crypto, loongarch, Jinyang He, Tiezhu Yang,
	Arnd Bergmann

Subject: [PATCH v3 0/2] LoongArch: Implement getrandom() in vDSO

For the rationale to implement getrandom() in vDSO see [1].

The vDSO getrandom() needs a stack-less ChaCha20 implementation, so we
need to add architecture-specific code and wire it up with the generic
code.  Both generic LoongArch implementation and Loongson SIMD eXtension
based implementation are added.  To dispatch them at runtime without
invoking cpucfg on each call, the alternative runtime patching mechanism
is extended to cover the vDSO.

The implementation is tested with the kernel selftests added by the last
patch in [1].  I had to make some adjustments to make it work on
LoongArch (see [2], I've not submitted the changes as at now because I'm
unsure about the KHDR_INCLUDES addition).  The vdso_test_getrandom
bench-single result:

       vdso: 25000000 times in 0.647855257 seconds (generic)
       vdso: 25000000 times in 0.601068605 seconds (LSX)
       libc: 25000000 times in 6.948168864 seconds
    syscall: 25000000 times in 6.990265548 seconds

The vdso_test_getrandom bench-multi result:

       vdso: 25000000 x 256 times in 35.322187834 seconds (generic)
       vdso: 25000000 x 256 times in 29.183885426 seconds (LSX)
       libc: 25000000 x 256 times in 356.628428409 seconds
       syscall: 25000000 x 256 times in 334.764602866 seconds

[1]:https://lore.kernel.org/all/20240712014009.281406-1-Jason@zx2c4.com/
[2]:https://github.com/xry111/linux/commits/xry111/la-vdso-v3/

[v2]->v3:
- Add a generic LoongArch implementation for which LSX isn't needed.

v1->v2:
- Properly send the series to the list.

[v2]:https://lore.kernel.org/all/20240815133357.35829-1-xry111@xry111.site/

Xi Ruoyao (3):
  LoongArch: vDSO: Wire up getrandom() vDSO implementation
  LoongArch: Perform alternative runtime patching on vDSO
  LoongArch: vDSO: Add LSX implementation of vDSO getrandom()

 arch/loongarch/Kconfig                      |   1 +
 arch/loongarch/include/asm/vdso/getrandom.h |  47 ++++
 arch/loongarch/include/asm/vdso/vdso.h      |   8 +
 arch/loongarch/kernel/asm-offsets.c         |  10 +
 arch/loongarch/kernel/vdso.c                |  14 +-
 arch/loongarch/vdso/Makefile                |   6 +
 arch/loongarch/vdso/memset.S                |  24 ++
 arch/loongarch/vdso/vdso.lds.S              |   7 +
 arch/loongarch/vdso/vgetrandom-chacha-lsx.S | 162 +++++++++++++
 arch/loongarch/vdso/vgetrandom-chacha.S     | 252 ++++++++++++++++++++
 arch/loongarch/vdso/vgetrandom.c            |  19 ++
 11 files changed, 549 insertions(+), 1 deletion(-)
 create mode 100644 arch/loongarch/include/asm/vdso/getrandom.h
 create mode 100644 arch/loongarch/vdso/memset.S
 create mode 100644 arch/loongarch/vdso/vgetrandom-chacha-lsx.S
 create mode 100644 arch/loongarch/vdso/vgetrandom-chacha.S
 create mode 100644 arch/loongarch/vdso/vgetrandom.c

-- 
2.46.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/3] LoongArch: vDSO: Wire up getrandom() vDSO implementation
  2024-08-16 11:07 Xi Ruoyao
@ 2024-08-16 11:07 ` Xi Ruoyao
  2024-08-19 12:41   ` Huacai Chen
  2024-08-16 11:07 ` [PATCH v3 2/3] LoongArch: Perform alternative runtime patching on vDSO Xi Ruoyao
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Xi Ruoyao @ 2024-08-16 11:07 UTC (permalink / raw)
  To: Jason A . Donenfeld, Huacai Chen, WANG Xuerui
  Cc: Xi Ruoyao, linux-crypto, loongarch, Jinyang He, Tiezhu Yang,
	Arnd Bergmann

Hook up the generic vDSO implementation to the LoongArch vDSO data page:
embed struct vdso_rng_data into struct loongarch_vdso_data, and use
assembler hack to resolve the symbol name "_vdso_rng_data" (which is
expected by the generic vDSO implementation) to the rng_data field in
loongarch_vdso_data.

The compiler (GCC 14.2) calls memset() for initializing a "large" struct
in a cold path of the generic vDSO getrandom() code.  There seems no way
to prevent it from calling memset(), and it's a cold path so the
performance does not matter, so just provide a naive memset()
implementation for vDSO.

Signed-off-by: Xi Ruoyao <xry111@xry111.site>
---
 arch/loongarch/Kconfig                      |   1 +
 arch/loongarch/include/asm/vdso/getrandom.h |  47 ++++
 arch/loongarch/include/asm/vdso/vdso.h      |   8 +
 arch/loongarch/kernel/asm-offsets.c         |  10 +
 arch/loongarch/kernel/vdso.c                |   6 +
 arch/loongarch/vdso/Makefile                |   2 +
 arch/loongarch/vdso/memset.S                |  24 ++
 arch/loongarch/vdso/vdso.lds.S              |   1 +
 arch/loongarch/vdso/vgetrandom-chacha.S     | 239 ++++++++++++++++++++
 arch/loongarch/vdso/vgetrandom.c            |  19 ++
 10 files changed, 357 insertions(+)
 create mode 100644 arch/loongarch/include/asm/vdso/getrandom.h
 create mode 100644 arch/loongarch/vdso/memset.S
 create mode 100644 arch/loongarch/vdso/vgetrandom-chacha.S
 create mode 100644 arch/loongarch/vdso/vgetrandom.c

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 70f169210b52..14821c2aba5b 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -190,6 +190,7 @@ config LOONGARCH
 	select TRACE_IRQFLAGS_SUPPORT
 	select USE_PERCPU_NUMA_NODE_ID
 	select USER_STACKTRACE_SUPPORT
+	select VDSO_GETRANDOM
 	select ZONE_DMA32
 
 config 32BIT
diff --git a/arch/loongarch/include/asm/vdso/getrandom.h b/arch/loongarch/include/asm/vdso/getrandom.h
new file mode 100644
index 000000000000..a369588a4ebf
--- /dev/null
+++ b/arch/loongarch/include/asm/vdso/getrandom.h
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2024 Xi Ruoyao <xry111@xry111.site>. All Rights Reserved.
+ */
+#ifndef __ASM_VDSO_GETRANDOM_H
+#define __ASM_VDSO_GETRANDOM_H
+
+#ifndef __ASSEMBLY__
+
+#include <asm/unistd.h>
+#include <asm/vdso/vdso.h>
+
+static __always_inline ssize_t getrandom_syscall(void *_buffer,
+						 size_t _len,
+						 unsigned int _flags)
+{
+	register long ret asm("a0");
+	register long int nr asm("a7") = __NR_getrandom;
+	register void *buffer asm("a0") = _buffer;
+	register size_t len asm("a1") = _len;
+	register unsigned int flags asm("a2") = _flags;
+
+	asm volatile(
+	"      syscall 0\n"
+	: "+r" (ret)
+	: "r" (nr), "r" (buffer), "r" (len), "r" (flags)
+	: "$t0", "$t1", "$t2", "$t3", "$t4", "$t5", "$t6", "$t7", "$t8",
+	  "memory");
+
+	return ret;
+}
+
+static __always_inline const struct vdso_rng_data *__arch_get_vdso_rng_data(
+	void)
+{
+	return (const struct vdso_rng_data *)(
+		get_vdso_data() +
+		VVAR_LOONGARCH_PAGES_START * PAGE_SIZE +
+		offsetof(struct loongarch_vdso_data, rng_data));
+}
+
+extern void __arch_chacha20_blocks_nostack(u8 *dst_bytes, const u32 *key,
+					   u32 *counter, size_t nblocks);
+
+#endif /* !__ASSEMBLY__ */
+
+#endif /* __ASM_VDSO_GETRANDOM_H */
diff --git a/arch/loongarch/include/asm/vdso/vdso.h b/arch/loongarch/include/asm/vdso/vdso.h
index 5a12309d9fb5..a2e24c3007e2 100644
--- a/arch/loongarch/include/asm/vdso/vdso.h
+++ b/arch/loongarch/include/asm/vdso/vdso.h
@@ -4,6 +4,9 @@
  * Copyright (C) 2020-2022 Loongson Technology Corporation Limited
  */
 
+#ifndef _ASM_VDSO_VDSO_H
+#define _ASM_VDSO_VDSO_H
+
 #ifndef __ASSEMBLY__
 
 #include <asm/asm.h>
@@ -16,6 +19,9 @@ struct vdso_pcpu_data {
 
 struct loongarch_vdso_data {
 	struct vdso_pcpu_data pdata[NR_CPUS];
+#ifdef CONFIG_VDSO_GETRANDOM
+	struct vdso_rng_data rng_data;
+#endif
 };
 
 /*
@@ -63,3 +69,5 @@ static inline unsigned long get_vdso_data(void)
 }
 
 #endif /* __ASSEMBLY__ */
+
+#endif
diff --git a/arch/loongarch/kernel/asm-offsets.c b/arch/loongarch/kernel/asm-offsets.c
index bee9f7a3108f..86f6d8a6dc23 100644
--- a/arch/loongarch/kernel/asm-offsets.c
+++ b/arch/loongarch/kernel/asm-offsets.c
@@ -14,6 +14,7 @@
 #include <asm/ptrace.h>
 #include <asm/processor.h>
 #include <asm/ftrace.h>
+#include <asm/vdso/vdso.h>
 
 static void __used output_ptreg_defines(void)
 {
@@ -321,3 +322,12 @@ static void __used output_kvm_defines(void)
 	OFFSET(KVM_GPGD, kvm, arch.pgd);
 	BLANK();
 }
+
+#ifdef CONFIG_VDSO_GETRANDOM
+static void __used output_vdso_rng_defines(void)
+{
+	COMMENT("LoongArch VDSO getrandom offsets.");
+	OFFSET(VDSO_RNG_DATA, loongarch_vdso_data, rng_data);
+	BLANK();
+}
+#endif
diff --git a/arch/loongarch/kernel/vdso.c b/arch/loongarch/kernel/vdso.c
index 90dfccb41c14..15b65d8e2fdc 100644
--- a/arch/loongarch/kernel/vdso.c
+++ b/arch/loongarch/kernel/vdso.c
@@ -22,6 +22,7 @@
 #include <vdso/helpers.h>
 #include <vdso/vsyscall.h>
 #include <vdso/datapage.h>
+#include <generated/asm-offsets.h>
 #include <generated/vdso-offsets.h>
 
 extern char vdso_start[], vdso_end[];
@@ -34,6 +35,11 @@ static union {
 	struct loongarch_vdso_data vdata;
 } loongarch_vdso_data __page_aligned_data;
 
+#ifdef CONFIG_VDSO_GETRANDOM
+asm(".globl _vdso_rng_data\n"
+    ".set _vdso_rng_data, loongarch_vdso_data + " __stringify(VDSO_RNG_DATA));
+#endif
+
 static struct page *vdso_pages[] = { NULL };
 struct vdso_data *vdso_data = generic_vdso_data.data;
 struct vdso_pcpu_data *vdso_pdata = loongarch_vdso_data.vdata.pdata;
diff --git a/arch/loongarch/vdso/Makefile b/arch/loongarch/vdso/Makefile
index 2ddf0480e710..c8c5d9a7c80c 100644
--- a/arch/loongarch/vdso/Makefile
+++ b/arch/loongarch/vdso/Makefile
@@ -6,6 +6,8 @@ include $(srctree)/lib/vdso/Makefile
 
 obj-vdso-y := elf.o vgetcpu.o vgettimeofday.o sigreturn.o
 
+obj-vdso-$(CONFIG_VDSO_GETRANDOM) += vgetrandom.o vgetrandom-chacha.o memset.o
+
 # Common compiler flags between ABIs.
 ccflags-vdso := \
 	$(filter -I%,$(KBUILD_CFLAGS)) \
diff --git a/arch/loongarch/vdso/memset.S b/arch/loongarch/vdso/memset.S
new file mode 100644
index 000000000000..ec1531683936
--- /dev/null
+++ b/arch/loongarch/vdso/memset.S
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A copy of __memset_generic from arch/loongarch/lib/memset.S for vDSO.
+ *
+ * Copyright (C) 2020-2024 Loongson Technology Corporation Limited
+ */
+
+#include <asm/regdef.h>
+#include <linux/linkage.h>
+
+SYM_FUNC_START(memset)
+	move	a3, a0
+	beqz	a2, 2f
+
+1:	st.b	a1, a0, 0
+	addi.d	a0, a0, 1
+	addi.d	a2, a2, -1
+	bgt	a2, zero, 1b
+
+2:	move	a0, a3
+	jr	ra
+SYM_FUNC_END(memset)
+
+.hidden memset
diff --git a/arch/loongarch/vdso/vdso.lds.S b/arch/loongarch/vdso/vdso.lds.S
index 56ad855896de..2c965a597d9e 100644
--- a/arch/loongarch/vdso/vdso.lds.S
+++ b/arch/loongarch/vdso/vdso.lds.S
@@ -63,6 +63,7 @@ VERSION
 		__vdso_clock_gettime;
 		__vdso_gettimeofday;
 		__vdso_rt_sigreturn;
+		__vdso_getrandom;
 	local: *;
 	};
 }
diff --git a/arch/loongarch/vdso/vgetrandom-chacha.S b/arch/loongarch/vdso/vgetrandom-chacha.S
new file mode 100644
index 000000000000..2e42198f2faf
--- /dev/null
+++ b/arch/loongarch/vdso/vgetrandom-chacha.S
@@ -0,0 +1,239 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2024 Xi Ruoyao <xry111@xry111.site>. All Rights Reserved.
+ */
+
+#include <asm/asm.h>
+#include <asm/regdef.h>
+#include <linux/linkage.h>
+
+.text
+
+/* Salsa20 quarter-round */
+.macro	QR	a b c d
+	add.w		\a, \a, \b
+	xor		\d, \d, \a
+	rotri.w		\d, \d, 16
+
+	add.w		\c, \c, \d
+	xor		\b, \b, \c
+	rotri.w		\b, \b, 20
+
+	add.w		\a, \a, \b
+	xor		\d, \d, \a
+	rotri.w		\d, \d, 24
+
+	add.w		\c, \c, \d
+	xor		\b, \b, \c
+	rotri.w		\b, \b, 25
+.endm
+
+/*
+ * Very basic LoongArch implementation of ChaCha20. Produces a given positive
+ * number of blocks of output with a nonce of 0, taking an input key and
+ * 8-byte counter. Importantly does not spill to the stack. Its arguments
+ * are:
+ *
+ *	a0: output bytes
+ *	a1: 32-byte key input
+ *	a2: 8-byte counter input/output
+ *	a3: number of 64-byte blocks to write to output
+ */
+SYM_FUNC_START(__arch_chacha20_blocks_nostack)
+
+/* We don't need a frame pointer */
+#define s9		fp
+
+#define output		a0
+#define key		a1
+#define counter		a2
+#define nblocks		a3
+#define i		a4
+#define state0		s0
+#define state1		s1
+#define state2		s2
+#define state3		s3
+#define state4		s4
+#define state5		s5
+#define state6		s6
+#define state7		s7
+#define state8		s8
+#define state9		s9
+#define state10		a5
+#define state11		a6
+#define state12		a7
+#define state13		t0
+#define state14		t1
+#define state15		t2
+#define cnt_lo		t3
+#define cnt_hi		t4
+#define copy0		t5
+#define copy1		t6
+#define copy2		t7
+
+/* Reuse i as copy3 */
+#define copy3		i
+
+	/*
+	 * The ABI requires s0-s9 saved, and sp aligned to 16-byte.
+	 * This does not violate the stack-less requirement: no sensitive data
+	 * is spilled onto the stack.
+	 */
+	PTR_ADDI	sp, sp, (-SZREG * 10) & STACK_ALIGN
+	REG_S		s0, sp, 0
+	REG_S		s1, sp, SZREG
+	REG_S		s2, sp, SZREG * 2
+	REG_S		s3, sp, SZREG * 3
+	REG_S		s4, sp, SZREG * 4
+	REG_S		s5, sp, SZREG * 5
+	REG_S		s6, sp, SZREG * 6
+	REG_S		s7, sp, SZREG * 7
+	REG_S		s8, sp, SZREG * 8
+	REG_S		s9, sp, SZREG * 9
+
+	li.w		copy0, 0x61707865
+	li.w		copy1, 0x3320646e
+	li.w		copy2, 0x79622d32
+
+	ld.w		cnt_lo, counter, 0
+	ld.w		cnt_hi, counter, 4
+
+.Lblock:
+	/* state[0,1,2,3] = "expand 32-byte k" */
+	move		state0, copy0
+	move		state1, copy1
+	move		state2, copy2
+	li.w		state3, 0x6b206574
+
+	/* state[4,5,..,11] = key */
+	ld.w		state4, key, 0
+	ld.w		state5, key, 4
+	ld.w		state6, key, 8
+	ld.w		state7, key, 12
+	ld.w		state8, key, 16
+	ld.w		state9, key, 20
+	ld.w		state10, key, 24
+	ld.w		state11, key, 28
+
+	/* state[12,13] = counter */
+	move		state12, cnt_lo
+	move		state13, cnt_hi
+
+	/* state[14,15] = 0 */
+	move		state14, zero
+	move		state15, zero
+
+	li.w		i, 10
+.Lpermute:
+	/* odd round */
+	QR		state0, state4, state8, state12
+	QR		state1, state5, state9, state13
+	QR		state2, state6, state10, state14
+	QR		state3, state7, state11, state15
+
+	/* even round */
+	QR		state0, state5, state10, state15
+	QR		state1, state6, state11, state12
+	QR		state2, state7, state8, state13
+	QR		state3, state4, state9, state14
+
+	addi.w		i, i, -1
+	bnez		i, .Lpermute
+
+	/* copy[3] = "expa" */
+	li.w		copy3, 0x6b206574
+
+	/* output[0,1,2,3] = copy[0,1,2,3] + state[0,1,2,3] */
+	add.w		state0, state0, copy0
+	add.w		state1, state1, copy1
+	add.w		state2, state2, copy2
+	add.w		state3, state3, copy3
+	st.w		state0, output, 0
+	st.w		state1, output, 4
+	st.w		state2, output, 8
+	st.w		state3, output, 12
+
+	/* from now on state[0,1,2,3] are scratch registers  */
+
+	/* state[0,1,2,3] = lo32(key) */
+	ld.w		state0, key, 0
+	ld.w		state1, key, 4
+	ld.w		state2, key, 8
+	ld.w		state3, key, 12
+
+	/* output[4,5,6,7] = state[0,1,2,3] + state[4,5,6,7] */
+	add.w		state4, state4, state0
+	add.w		state5, state5, state1
+	add.w		state6, state6, state2
+	add.w		state7, state7, state3
+	st.w		state4, output, 16
+	st.w		state5, output, 20
+	st.w		state6, output, 24
+	st.w		state7, output, 28
+
+	/* state[0,1,2,3] = hi32(key) */
+	ld.w		state0, key, 16
+	ld.w		state1, key, 20
+	ld.w		state2, key, 24
+	ld.w		state3, key, 28
+
+	/* output[8,9,10,11] = state[0,1,2,3] + state[8,9,10,11] */
+	add.w		state8, state8, state0
+	add.w		state9, state9, state1
+	add.w		state10, state10, state2
+	add.w		state11, state11, state3
+	st.w		state8, output, 32
+	st.w		state9, output, 36
+	st.w		state10, output, 40
+	st.w		state11, output, 44
+
+	/* output[12,13,14,15] = state[12,13,14,15] + [cnt_lo, cnt_hi, 0, 0] */
+	add.w		state12, state12, cnt_lo
+	add.w		state13, state13, cnt_hi
+	st.w		state12, output, 48
+	st.w		state13, output, 52
+	st.w		state14, output, 56
+	st.w		state15, output, 60
+
+	/* ++counter  */
+	addi.w		cnt_lo, cnt_lo, 1
+	sltui		state0, cnt_lo, 1
+	add.w		cnt_hi, cnt_hi, state0
+
+	/* output += 64 */
+	PTR_ADDI	output, output, 64
+	/* --nblocks */
+	PTR_ADDI	nblocks, nblocks, -1
+	bnez		nblocks, .Lblock
+
+	/* counter = [cnt_lo, cnt_hi] */
+	st.w		cnt_lo, counter, 0
+	st.w		cnt_hi, counter, 4
+
+	/*
+	 * Zero out the potentially sensitive regs, in case nothing uses these
+	 * again. As at now copy[0,1,2,3] just contains "expand 32-byte k" and
+	 * state[0,...,9] are s0-s9 those we'll restore in the epilogue, so we
+	 * only need to zero state[11,...,15].
+	 */
+	move		state10, zero
+	move		state11, zero
+	move		state12, zero
+	move		state13, zero
+	move		state14, zero
+	move		state15, zero
+
+	REG_L		s0, sp, 0
+	REG_L		s1, sp, SZREG
+	REG_L		s2, sp, SZREG * 2
+	REG_L		s3, sp, SZREG * 3
+	REG_L		s4, sp, SZREG * 4
+	REG_L		s5, sp, SZREG * 5
+	REG_L		s6, sp, SZREG * 6
+	REG_L		s7, sp, SZREG * 7
+	REG_L		s8, sp, SZREG * 8
+	REG_L		s9, sp, SZREG * 9
+	PTR_ADDI	sp, sp, -((-SZREG * 10) & STACK_ALIGN)
+
+	jr		ra
+SYM_FUNC_END(__arch_chacha20_blocks_nostack)
diff --git a/arch/loongarch/vdso/vgetrandom.c b/arch/loongarch/vdso/vgetrandom.c
new file mode 100644
index 000000000000..0b3b30ecd68a
--- /dev/null
+++ b/arch/loongarch/vdso/vgetrandom.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2024 Xi Ruoyao <xry111@xry111.site>. All Rights Reserved.
+ */
+#include <linux/types.h>
+
+#include "../../../../lib/vdso/getrandom.c"
+
+typeof(__cvdso_getrandom) __vdso_getrandom;
+
+ssize_t __vdso_getrandom(void *buffer, size_t len, unsigned int flags,
+			 void *opaque_state, size_t opaque_len)
+{
+	return __cvdso_getrandom(buffer, len, flags, opaque_state,
+				 opaque_len);
+}
+
+typeof(__cvdso_getrandom) getrandom
+	__attribute__((weak, alias("__vdso_getrandom")));
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 2/3] LoongArch: Perform alternative runtime patching on vDSO
  2024-08-16 11:07 Xi Ruoyao
  2024-08-16 11:07 ` [PATCH v3 1/3] LoongArch: vDSO: Wire up getrandom() vDSO implementation Xi Ruoyao
@ 2024-08-16 11:07 ` Xi Ruoyao
  2024-08-16 11:07 ` [PATCH v3 3/3] LoongArch: vDSO: Add LSX implementation of vDSO getrandom() Xi Ruoyao
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Xi Ruoyao @ 2024-08-16 11:07 UTC (permalink / raw)
  To: Jason A . Donenfeld, Huacai Chen, WANG Xuerui
  Cc: Xi Ruoyao, linux-crypto, loongarch, Jinyang He, Tiezhu Yang,
	Arnd Bergmann

To implement getrandom() in vDSO, we need to implement stack-less
ChaCha20.  ChaCha20 is designed to be SIMD-friendly, but LSX is not
guaranteed to be available on all LoongArch CPU models.  Perform
alternative runtime patching on vDSO so we'll be able to use LSX in
vDSO.

Signed-off-by: Xi Ruoyao <xry111@xry111.site>
---
 arch/loongarch/kernel/vdso.c   | 8 +++++++-
 arch/loongarch/vdso/vdso.lds.S | 6 ++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/loongarch/kernel/vdso.c b/arch/loongarch/kernel/vdso.c
index 15b65d8e2fdc..d500436f252b 100644
--- a/arch/loongarch/kernel/vdso.c
+++ b/arch/loongarch/kernel/vdso.c
@@ -17,6 +17,7 @@
 #include <linux/time_namespace.h>
 #include <linux/timekeeper_internal.h>
 
+#include <asm/alternative.h>
 #include <asm/page.h>
 #include <asm/vdso.h>
 #include <vdso/helpers.h>
@@ -105,7 +106,7 @@ struct loongarch_vdso_info vdso_info = {
 
 static int __init init_vdso(void)
 {
-	unsigned long i, cpu, pfn;
+	unsigned long i, cpu, pfn, vdso;
 
 	BUG_ON(!PAGE_ALIGNED(vdso_info.vdso));
 	BUG_ON(!PAGE_ALIGNED(vdso_info.size));
@@ -117,6 +118,11 @@ static int __init init_vdso(void)
 	for (i = 0; i < vdso_info.size / PAGE_SIZE; i++)
 		vdso_info.code_mapping.pages[i] = pfn_to_page(pfn + i);
 
+	vdso = (unsigned long)vdso_info.vdso;
+
+	apply_alternatives((struct alt_instr *)(vdso + vdso_offset_alt),
+			   (struct alt_instr *)(vdso + vdso_offset_alt_end));
+
 	return 0;
 }
 subsys_initcall(init_vdso);
diff --git a/arch/loongarch/vdso/vdso.lds.S b/arch/loongarch/vdso/vdso.lds.S
index 2c965a597d9e..ac63dc080bc9 100644
--- a/arch/loongarch/vdso/vdso.lds.S
+++ b/arch/loongarch/vdso/vdso.lds.S
@@ -35,6 +35,12 @@ SECTIONS
 
 	.rodata		: { *(.rodata*) }		:text
 
+	.altinstructions : ALIGN(4) {
+		VDSO_alt = .;
+		*(.altinstructions)
+		VDSO_alt_end = .;
+	} :text
+
 	_end = .;
 	PROVIDE(end = .);
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 3/3] LoongArch: vDSO: Add LSX implementation of vDSO getrandom()
  2024-08-16 11:07 Xi Ruoyao
  2024-08-16 11:07 ` [PATCH v3 1/3] LoongArch: vDSO: Wire up getrandom() vDSO implementation Xi Ruoyao
  2024-08-16 11:07 ` [PATCH v3 2/3] LoongArch: Perform alternative runtime patching on vDSO Xi Ruoyao
@ 2024-08-16 11:07 ` Xi Ruoyao
  2024-08-19 12:40 ` Huacai Chen
  2024-08-27  9:45 ` Re: Jason A. Donenfeld
  4 siblings, 0 replies; 16+ messages in thread
From: Xi Ruoyao @ 2024-08-16 11:07 UTC (permalink / raw)
  To: Jason A . Donenfeld, Huacai Chen, WANG Xuerui
  Cc: Xi Ruoyao, linux-crypto, loongarch, Jinyang He, Tiezhu Yang,
	Arnd Bergmann

It's 7% faster in vdso_test_getrandom bench-single test and 21% faster
in vdso_test_getrandom bench-multi test than the generic LoongArch
implementation.

Signed-off-by: Xi Ruoyao <xry111@xry111.site>
---
 arch/loongarch/vdso/Makefile                |   4 +
 arch/loongarch/vdso/vgetrandom-chacha-lsx.S | 162 ++++++++++++++++++++
 arch/loongarch/vdso/vgetrandom-chacha.S     |  13 ++
 3 files changed, 179 insertions(+)
 create mode 100644 arch/loongarch/vdso/vgetrandom-chacha-lsx.S

diff --git a/arch/loongarch/vdso/Makefile b/arch/loongarch/vdso/Makefile
index c8c5d9a7c80c..cab92c3a70a4 100644
--- a/arch/loongarch/vdso/Makefile
+++ b/arch/loongarch/vdso/Makefile
@@ -8,6 +8,10 @@ obj-vdso-y := elf.o vgetcpu.o vgettimeofday.o sigreturn.o
 
 obj-vdso-$(CONFIG_VDSO_GETRANDOM) += vgetrandom.o vgetrandom-chacha.o memset.o
 
+ifdef CONFIG_CPU_HAS_LSX
+obj-vdso-$(CONFIG_VDSO_GETRANDOM) += vgetrandom-chacha-lsx.o
+endif
+
 # Common compiler flags between ABIs.
 ccflags-vdso := \
 	$(filter -I%,$(KBUILD_CFLAGS)) \
diff --git a/arch/loongarch/vdso/vgetrandom-chacha-lsx.S b/arch/loongarch/vdso/vgetrandom-chacha-lsx.S
new file mode 100644
index 000000000000..6d8c886d78c8
--- /dev/null
+++ b/arch/loongarch/vdso/vgetrandom-chacha-lsx.S
@@ -0,0 +1,162 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2024 Xi Ruoyao <xry111@xry111.site>. All Rights Reserved.
+ *
+ * Based on arch/x86/entry/vdso/vgetrandom-chacha.S:
+ *
+ * Copyright (C) 2022-2024 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights
+ * Reserved.
+ */
+
+#include <asm/asm.h>
+#include <asm/regdef.h>
+#include <linux/linkage.h>
+
+.section	.rodata
+.align 4
+CONSTANTS:	.octa 0x6b20657479622d323320646e61707865
+
+.text
+
+/*
+ * Loongson SIMD eXtension implementation of ChaCha20. Produces a given
+ * positive number of blocks of output with a nonce of 0, taking an input
+ * key and 8-byte counter. Importantly does not spill to the stack. Its
+ * arguments are:
+ *
+ *	a0: output bytes
+ *	a1: 32-byte key input
+ *	a2: 8-byte counter input/output
+ *	a3: number of 64-byte blocks to write to output
+ */
+SYM_FUNC_START(__arch_chacha20_blocks_nostack_lsx)
+#define output		a0
+#define key		a1
+#define counter		a2
+#define nblocks		a3
+#define i		t0
+/* LSX registers vr0-vr23 are caller-save. */
+#define state0		$vr0
+#define state1		$vr1
+#define state2		$vr2
+#define state3		$vr3
+#define copy0		$vr4
+#define copy1		$vr5
+#define copy2		$vr6
+#define copy3		$vr7
+#define one		$vr8
+
+	/* copy0 = "expand 32-byte k" */
+	la.pcrel	t1, CONSTANTS
+	vld		copy0, t1, 0
+	/* copy1, copy2 = key */
+	vld		copy1, key, 0
+	vld		copy2, key, 0x10
+	/* copy3 = counter || zero nonce */
+	vldrepl.d	copy3, counter, 0
+	vinsgr2vr.d	copy3, zero, 1
+	/* one = 1 || 0 */
+	vldi		one, 0b0110000000001
+	vinsgr2vr.d	one, zero, 1
+
+.Lblock:
+	/* state = copy */
+	vori.b		state0, copy0, 0
+	vori.b		state1, copy1, 0
+	vori.b		state2, copy2, 0
+	vori.b		state3, copy3, 0
+
+	li.w		i, 10
+.Lpermute:
+	/* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
+	vadd.w		state0, state0, state1
+	vxor.v		state3, state3, state0
+	vrotri.w	state3, state3, 16
+
+	/* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
+	vadd.w		state2, state2, state3
+	vxor.v		state1, state1, state2
+	vrotri.w	state1, state1, 20
+
+	/* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
+	vadd.w		state0, state0, state1
+	vxor.v		state3, state3, state0
+	vrotri.w	state3, state3, 24
+
+	/* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
+	vadd.w		state2, state2, state3
+	vxor.v		state1, state1, state2
+	vrotri.w	state1, state1, 25
+
+	/* state1[0,1,2,3] = state1[1,2,3,0] */
+	vshuf4i.w	state1, state1, 0b00111001
+	/* state2[0,1,2,3] = state2[2,3,0,1] */
+	vshuf4i.w	state2, state2, 0b01001110
+	/* state3[0,1,2,3] = state3[1,2,3,0] */
+	vshuf4i.w	state3, state3, 0b10010011
+
+	/* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
+	vadd.w		state0, state0, state1
+	vxor.v		state3, state3, state0
+	vrotri.w	state3, state3, 16
+
+	/* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
+	vadd.w		state2, state2, state3
+	vxor.v		state1, state1, state2
+	vrotri.w	state1, state1, 20
+
+	/* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
+	vadd.w		state0, state0, state1
+	vxor.v		state3, state3, state0
+	vrotri.w	state3, state3, 24
+
+	/* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
+	vadd.w		state2, state2, state3
+	vxor.v		state1, state1, state2
+	vrotri.w	state1, state1, 25
+
+	/* state1[0,1,2,3] = state1[3,0,1,2] */
+	vshuf4i.w	state1, state1, 0b10010011
+	/* state2[0,1,2,3] = state2[2,3,0,1] */
+	vshuf4i.w	state2, state2, 0b01001110
+	/* state3[0,1,2,3] = state3[1,2,3,0] */
+	vshuf4i.w	state3, state3, 0b00111001
+
+	addi.w		i, i, -1
+	bnez		i, .Lpermute
+
+	/* output0 = state0 + copy0 */
+	vadd.w		state0, state0, copy0
+	vst		state0, output, 0
+	/* output1 = state1 + copy1 */
+	vadd.w		state1, state1, copy1
+	vst		state1, output, 0x10
+	/* output2 = state2 + copy2 */
+	vadd.w		state2, state2, copy2
+	vst		state2, output, 0x20
+	/* output3 = state3 + copy3 */
+	vadd.w		state3, state3, copy3
+	vst		state3, output, 0x30
+
+	/* ++copy3.counter */
+	vadd.d		copy3, copy3, one
+
+	/* output += 64 */
+	PTR_ADDI	output, output, 64
+	/* --nblocks */
+	PTR_ADDI	nblocks, nblocks, -1
+	bnez		nblocks, .Lblock
+
+	/* counter = copy3.counter */
+	vstelm.d	copy3, counter, 0, 0
+
+	/* Zero out the potentially sensitive regs, in case nothing uses these again. */
+	vldi		state0, 0
+	vldi		state1, 0
+	vldi		state2, 0
+	vldi		state3, 0
+	vldi		copy1, 0
+	vldi		copy2, 0
+
+	jr		ra
+SYM_FUNC_END(__arch_chacha20_blocks_nostack_lsx)
diff --git a/arch/loongarch/vdso/vgetrandom-chacha.S b/arch/loongarch/vdso/vgetrandom-chacha.S
index 2e42198f2faf..1931119e12a6 100644
--- a/arch/loongarch/vdso/vgetrandom-chacha.S
+++ b/arch/loongarch/vdso/vgetrandom-chacha.S
@@ -7,6 +7,11 @@
 #include <asm/regdef.h>
 #include <linux/linkage.h>
 
+#ifdef CONFIG_CPU_HAS_LSX
+# include <asm/alternative-asm.h>
+# include <asm/cpu.h>
+#endif
+
 .text
 
 /* Salsa20 quarter-round */
@@ -78,8 +83,16 @@ SYM_FUNC_START(__arch_chacha20_blocks_nostack)
 	 * The ABI requires s0-s9 saved, and sp aligned to 16-byte.
 	 * This does not violate the stack-less requirement: no sensitive data
 	 * is spilled onto the stack.
+	 *
+	 * Rewrite the very first instruction to jump to the LSX implementation
+	 * if LSX is available.
 	 */
+#ifdef CONFIG_CPU_HAS_LSX
+	ALTERNATIVE __stringify(PTR_ADDI sp, sp, (-SZREG * 10) & STACK_ALIGN), \
+		    "b __arch_chacha20_blocks_nostack_lsx", CPU_FEATURE_LSX
+#else
 	PTR_ADDI	sp, sp, (-SZREG * 10) & STACK_ALIGN
+#endif
 	REG_S		s0, sp, 0
 	REG_S		s1, sp, SZREG
 	REG_S		s2, sp, SZREG * 2
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re:
  2024-08-16 11:07 Xi Ruoyao
                   ` (2 preceding siblings ...)
  2024-08-16 11:07 ` [PATCH v3 3/3] LoongArch: vDSO: Add LSX implementation of vDSO getrandom() Xi Ruoyao
@ 2024-08-19 12:40 ` Huacai Chen
  2024-08-19 13:01   ` Re: Jason A. Donenfeld
  2024-08-19 15:22   ` Re: Xi Ruoyao
  2024-08-27  9:45 ` Re: Jason A. Donenfeld
  4 siblings, 2 replies; 16+ messages in thread
From: Huacai Chen @ 2024-08-19 12:40 UTC (permalink / raw)
  To: Xi Ruoyao
  Cc: Jason A . Donenfeld, WANG Xuerui, linux-crypto, loongarch,
	Jinyang He, Tiezhu Yang, Arnd Bergmann

Hi, Ruoyao,

Why no subject?

On Fri, Aug 16, 2024 at 7:07 PM Xi Ruoyao <xry111@xry111.site> wrote:
>
> Subject: [PATCH v3 0/2] LoongArch: Implement getrandom() in vDSO
>
> For the rationale to implement getrandom() in vDSO see [1].
>
> The vDSO getrandom() needs a stack-less ChaCha20 implementation, so we
> need to add architecture-specific code and wire it up with the generic
> code.  Both generic LoongArch implementation and Loongson SIMD eXtension
> based implementation are added.  To dispatch them at runtime without
> invoking cpucfg on each call, the alternative runtime patching mechanism
> is extended to cover the vDSO.
>
> The implementation is tested with the kernel selftests added by the last
> patch in [1].  I had to make some adjustments to make it work on
> LoongArch (see [2], I've not submitted the changes as at now because I'm
> unsure about the KHDR_INCLUDES addition).  The vdso_test_getrandom
> bench-single result:
>
>        vdso: 25000000 times in 0.647855257 seconds (generic)
>        vdso: 25000000 times in 0.601068605 seconds (LSX)
>        libc: 25000000 times in 6.948168864 seconds
>     syscall: 25000000 times in 6.990265548 seconds
>
> The vdso_test_getrandom bench-multi result:
>
>        vdso: 25000000 x 256 times in 35.322187834 seconds (generic)
>        vdso: 25000000 x 256 times in 29.183885426 seconds (LSX)
>        libc: 25000000 x 256 times in 356.628428409 seconds
>        syscall: 25000000 x 256 times in 334.764602866 seconds
I don't see significant improvements about LSX here, so I prefer to
just use the generic version to avoid complexity (I remember Linus
said the whole of __vdso_getrandom is not very useful).


Huacai

>
> [1]:https://lore.kernel.org/all/20240712014009.281406-1-Jason@zx2c4.com/
> [2]:https://github.com/xry111/linux/commits/xry111/la-vdso-v3/
>
> [v2]->v3:
> - Add a generic LoongArch implementation for which LSX isn't needed.
>
> v1->v2:
> - Properly send the series to the list.
>
> [v2]:https://lore.kernel.org/all/20240815133357.35829-1-xry111@xry111.site/
>
> Xi Ruoyao (3):
>   LoongArch: vDSO: Wire up getrandom() vDSO implementation
>   LoongArch: Perform alternative runtime patching on vDSO
>   LoongArch: vDSO: Add LSX implementation of vDSO getrandom()
>
>  arch/loongarch/Kconfig                      |   1 +
>  arch/loongarch/include/asm/vdso/getrandom.h |  47 ++++
>  arch/loongarch/include/asm/vdso/vdso.h      |   8 +
>  arch/loongarch/kernel/asm-offsets.c         |  10 +
>  arch/loongarch/kernel/vdso.c                |  14 +-
>  arch/loongarch/vdso/Makefile                |   6 +
>  arch/loongarch/vdso/memset.S                |  24 ++
>  arch/loongarch/vdso/vdso.lds.S              |   7 +
>  arch/loongarch/vdso/vgetrandom-chacha-lsx.S | 162 +++++++++++++
>  arch/loongarch/vdso/vgetrandom-chacha.S     | 252 ++++++++++++++++++++
>  arch/loongarch/vdso/vgetrandom.c            |  19 ++
>  11 files changed, 549 insertions(+), 1 deletion(-)
>  create mode 100644 arch/loongarch/include/asm/vdso/getrandom.h
>  create mode 100644 arch/loongarch/vdso/memset.S
>  create mode 100644 arch/loongarch/vdso/vgetrandom-chacha-lsx.S
>  create mode 100644 arch/loongarch/vdso/vgetrandom-chacha.S
>  create mode 100644 arch/loongarch/vdso/vgetrandom.c
>
> --
> 2.46.0
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] LoongArch: vDSO: Wire up getrandom() vDSO implementation
  2024-08-16 11:07 ` [PATCH v3 1/3] LoongArch: vDSO: Wire up getrandom() vDSO implementation Xi Ruoyao
@ 2024-08-19 12:41   ` Huacai Chen
  2024-08-19 13:03     ` Jason A. Donenfeld
  0 siblings, 1 reply; 16+ messages in thread
From: Huacai Chen @ 2024-08-19 12:41 UTC (permalink / raw)
  To: Xi Ruoyao
  Cc: Jason A . Donenfeld, WANG Xuerui, linux-crypto, loongarch,
	Jinyang He, Tiezhu Yang, Arnd Bergmann

Hi, Ruoyao,

On Fri, Aug 16, 2024 at 7:07 PM Xi Ruoyao <xry111@xry111.site> wrote:
>
> Hook up the generic vDSO implementation to the LoongArch vDSO data page:
> embed struct vdso_rng_data into struct loongarch_vdso_data, and use
> assembler hack to resolve the symbol name "_vdso_rng_data" (which is
> expected by the generic vDSO implementation) to the rng_data field in
> loongarch_vdso_data.
>
> The compiler (GCC 14.2) calls memset() for initializing a "large" struct
> in a cold path of the generic vDSO getrandom() code.  There seems no way
> to prevent it from calling memset(), and it's a cold path so the
> performance does not matter, so just provide a naive memset()
> implementation for vDSO.
Why x86 doesn't need to provide a naive memset()?

>
> Signed-off-by: Xi Ruoyao <xry111@xry111.site>
> ---
>  arch/loongarch/Kconfig                      |   1 +
>  arch/loongarch/include/asm/vdso/getrandom.h |  47 ++++
>  arch/loongarch/include/asm/vdso/vdso.h      |   8 +
>  arch/loongarch/kernel/asm-offsets.c         |  10 +
>  arch/loongarch/kernel/vdso.c                |   6 +
>  arch/loongarch/vdso/Makefile                |   2 +
>  arch/loongarch/vdso/memset.S                |  24 ++
>  arch/loongarch/vdso/vdso.lds.S              |   1 +
>  arch/loongarch/vdso/vgetrandom-chacha.S     | 239 ++++++++++++++++++++
>  arch/loongarch/vdso/vgetrandom.c            |  19 ++
>  10 files changed, 357 insertions(+)
>  create mode 100644 arch/loongarch/include/asm/vdso/getrandom.h
>  create mode 100644 arch/loongarch/vdso/memset.S
>  create mode 100644 arch/loongarch/vdso/vgetrandom-chacha.S
>  create mode 100644 arch/loongarch/vdso/vgetrandom.c
>
> diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
> index 70f169210b52..14821c2aba5b 100644
> --- a/arch/loongarch/Kconfig
> +++ b/arch/loongarch/Kconfig
> @@ -190,6 +190,7 @@ config LOONGARCH
>         select TRACE_IRQFLAGS_SUPPORT
>         select USE_PERCPU_NUMA_NODE_ID
>         select USER_STACKTRACE_SUPPORT
> +       select VDSO_GETRANDOM
>         select ZONE_DMA32
>
>  config 32BIT
> diff --git a/arch/loongarch/include/asm/vdso/getrandom.h b/arch/loongarch/include/asm/vdso/getrandom.h
> new file mode 100644
> index 000000000000..a369588a4ebf
> --- /dev/null
> +++ b/arch/loongarch/include/asm/vdso/getrandom.h
> @@ -0,0 +1,47 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2024 Xi Ruoyao <xry111@xry111.site>. All Rights Reserved.
> + */
> +#ifndef __ASM_VDSO_GETRANDOM_H
> +#define __ASM_VDSO_GETRANDOM_H
> +
> +#ifndef __ASSEMBLY__
> +
> +#include <asm/unistd.h>
> +#include <asm/vdso/vdso.h>
> +
> +static __always_inline ssize_t getrandom_syscall(void *_buffer,
> +                                                size_t _len,
> +                                                unsigned int _flags)
> +{
> +       register long ret asm("a0");
> +       register long int nr asm("a7") = __NR_getrandom;
> +       register void *buffer asm("a0") = _buffer;
> +       register size_t len asm("a1") = _len;
> +       register unsigned int flags asm("a2") = _flags;
> +
> +       asm volatile(
> +       "      syscall 0\n"
> +       : "+r" (ret)
> +       : "r" (nr), "r" (buffer), "r" (len), "r" (flags)
> +       : "$t0", "$t1", "$t2", "$t3", "$t4", "$t5", "$t6", "$t7", "$t8",
> +         "memory");
> +
> +       return ret;
> +}
> +
> +static __always_inline const struct vdso_rng_data *__arch_get_vdso_rng_data(
> +       void)
Don't need a line break.

> +{
> +       return (const struct vdso_rng_data *)(
> +               get_vdso_data() +
> +               VVAR_LOONGARCH_PAGES_START * PAGE_SIZE +
> +               offsetof(struct loongarch_vdso_data, rng_data));
> +}
> +
> +extern void __arch_chacha20_blocks_nostack(u8 *dst_bytes, const u32 *key,
> +                                          u32 *counter, size_t nblocks);
> +
> +#endif /* !__ASSEMBLY__ */
> +
> +#endif /* __ASM_VDSO_GETRANDOM_H */
> diff --git a/arch/loongarch/include/asm/vdso/vdso.h b/arch/loongarch/include/asm/vdso/vdso.h
> index 5a12309d9fb5..a2e24c3007e2 100644
> --- a/arch/loongarch/include/asm/vdso/vdso.h
> +++ b/arch/loongarch/include/asm/vdso/vdso.h
> @@ -4,6 +4,9 @@
>   * Copyright (C) 2020-2022 Loongson Technology Corporation Limited
>   */
>
> +#ifndef _ASM_VDSO_VDSO_H
> +#define _ASM_VDSO_VDSO_H
> +
>  #ifndef __ASSEMBLY__
>
>  #include <asm/asm.h>
> @@ -16,6 +19,9 @@ struct vdso_pcpu_data {
>
>  struct loongarch_vdso_data {
>         struct vdso_pcpu_data pdata[NR_CPUS];
> +#ifdef CONFIG_VDSO_GETRANDOM
You select VDSO_GETRANDOM unconditionally, so #ifdef is useless.

> +       struct vdso_rng_data rng_data;
> +#endif
>  };
>
>  /*
> @@ -63,3 +69,5 @@ static inline unsigned long get_vdso_data(void)
>  }
>
>  #endif /* __ASSEMBLY__ */
> +
> +#endif
> diff --git a/arch/loongarch/kernel/asm-offsets.c b/arch/loongarch/kernel/asm-offsets.c
> index bee9f7a3108f..86f6d8a6dc23 100644
> --- a/arch/loongarch/kernel/asm-offsets.c
> +++ b/arch/loongarch/kernel/asm-offsets.c
> @@ -14,6 +14,7 @@
>  #include <asm/ptrace.h>
>  #include <asm/processor.h>
>  #include <asm/ftrace.h>
> +#include <asm/vdso/vdso.h>
>
>  static void __used output_ptreg_defines(void)
>  {
> @@ -321,3 +322,12 @@ static void __used output_kvm_defines(void)
>         OFFSET(KVM_GPGD, kvm, arch.pgd);
>         BLANK();
>  }
> +
> +#ifdef CONFIG_VDSO_GETRANDOM
The same.

> +static void __used output_vdso_rng_defines(void)
> +{
> +       COMMENT("LoongArch VDSO getrandom offsets.");
> +       OFFSET(VDSO_RNG_DATA, loongarch_vdso_data, rng_data);
> +       BLANK();
> +}
> +#endif
> diff --git a/arch/loongarch/kernel/vdso.c b/arch/loongarch/kernel/vdso.c
> index 90dfccb41c14..15b65d8e2fdc 100644
> --- a/arch/loongarch/kernel/vdso.c
> +++ b/arch/loongarch/kernel/vdso.c
> @@ -22,6 +22,7 @@
>  #include <vdso/helpers.h>
>  #include <vdso/vsyscall.h>
>  #include <vdso/datapage.h>
> +#include <generated/asm-offsets.h>
>  #include <generated/vdso-offsets.h>
>
>  extern char vdso_start[], vdso_end[];
> @@ -34,6 +35,11 @@ static union {
>         struct loongarch_vdso_data vdata;
>  } loongarch_vdso_data __page_aligned_data;
>
> +#ifdef CONFIG_VDSO_GETRANDOM
The same.

> +asm(".globl _vdso_rng_data\n"
> +    ".set _vdso_rng_data, loongarch_vdso_data + " __stringify(VDSO_RNG_DATA));
> +#endif
> +
>  static struct page *vdso_pages[] = { NULL };
>  struct vdso_data *vdso_data = generic_vdso_data.data;
>  struct vdso_pcpu_data *vdso_pdata = loongarch_vdso_data.vdata.pdata;
> diff --git a/arch/loongarch/vdso/Makefile b/arch/loongarch/vdso/Makefile
> index 2ddf0480e710..c8c5d9a7c80c 100644
> --- a/arch/loongarch/vdso/Makefile
> +++ b/arch/loongarch/vdso/Makefile
> @@ -6,6 +6,8 @@ include $(srctree)/lib/vdso/Makefile
>
>  obj-vdso-y := elf.o vgetcpu.o vgettimeofday.o sigreturn.o
>
> +obj-vdso-$(CONFIG_VDSO_GETRANDOM) += vgetrandom.o vgetrandom-chacha.o memset.o
> +
>  # Common compiler flags between ABIs.
>  ccflags-vdso := \
>         $(filter -I%,$(KBUILD_CFLAGS)) \
> diff --git a/arch/loongarch/vdso/memset.S b/arch/loongarch/vdso/memset.S
> new file mode 100644
> index 000000000000..ec1531683936
> --- /dev/null
> +++ b/arch/loongarch/vdso/memset.S
> @@ -0,0 +1,24 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * A copy of __memset_generic from arch/loongarch/lib/memset.S for vDSO.
> + *
> + * Copyright (C) 2020-2024 Loongson Technology Corporation Limited
> + */
> +
> +#include <asm/regdef.h>
> +#include <linux/linkage.h>
> +
> +SYM_FUNC_START(memset)
> +       move    a3, a0
> +       beqz    a2, 2f
> +
> +1:     st.b    a1, a0, 0
> +       addi.d  a0, a0, 1
> +       addi.d  a2, a2, -1
> +       bgt     a2, zero, 1b
> +
> +2:     move    a0, a3
> +       jr      ra
> +SYM_FUNC_END(memset)
> +
> +.hidden memset
> diff --git a/arch/loongarch/vdso/vdso.lds.S b/arch/loongarch/vdso/vdso.lds.S
> index 56ad855896de..2c965a597d9e 100644
> --- a/arch/loongarch/vdso/vdso.lds.S
> +++ b/arch/loongarch/vdso/vdso.lds.S
> @@ -63,6 +63,7 @@ VERSION
>                 __vdso_clock_gettime;
>                 __vdso_gettimeofday;
>                 __vdso_rt_sigreturn;
> +               __vdso_getrandom;
In my opinion, __vdso_rt_sigreturn is different from others, so I
prefer to keep it at last.


Huacai

>         local: *;
>         };
>  }
> diff --git a/arch/loongarch/vdso/vgetrandom-chacha.S b/arch/loongarch/vdso/vgetrandom-chacha.S
> new file mode 100644
> index 000000000000..2e42198f2faf
> --- /dev/null
> +++ b/arch/loongarch/vdso/vgetrandom-chacha.S
> @@ -0,0 +1,239 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2024 Xi Ruoyao <xry111@xry111.site>. All Rights Reserved.
> + */
> +
> +#include <asm/asm.h>
> +#include <asm/regdef.h>
> +#include <linux/linkage.h>
> +
> +.text
> +
> +/* Salsa20 quarter-round */
> +.macro QR      a b c d
> +       add.w           \a, \a, \b
> +       xor             \d, \d, \a
> +       rotri.w         \d, \d, 16
> +
> +       add.w           \c, \c, \d
> +       xor             \b, \b, \c
> +       rotri.w         \b, \b, 20
> +
> +       add.w           \a, \a, \b
> +       xor             \d, \d, \a
> +       rotri.w         \d, \d, 24
> +
> +       add.w           \c, \c, \d
> +       xor             \b, \b, \c
> +       rotri.w         \b, \b, 25
> +.endm
> +
> +/*
> + * Very basic LoongArch implementation of ChaCha20. Produces a given positive
> + * number of blocks of output with a nonce of 0, taking an input key and
> + * 8-byte counter. Importantly does not spill to the stack. Its arguments
> + * are:
> + *
> + *     a0: output bytes
> + *     a1: 32-byte key input
> + *     a2: 8-byte counter input/output
> + *     a3: number of 64-byte blocks to write to output
> + */
> +SYM_FUNC_START(__arch_chacha20_blocks_nostack)
> +
> +/* We don't need a frame pointer */
> +#define s9             fp
> +
> +#define output         a0
> +#define key            a1
> +#define counter                a2
> +#define nblocks                a3
> +#define i              a4
> +#define state0         s0
> +#define state1         s1
> +#define state2         s2
> +#define state3         s3
> +#define state4         s4
> +#define state5         s5
> +#define state6         s6
> +#define state7         s7
> +#define state8         s8
> +#define state9         s9
> +#define state10                a5
> +#define state11                a6
> +#define state12                a7
> +#define state13                t0
> +#define state14                t1
> +#define state15                t2
> +#define cnt_lo         t3
> +#define cnt_hi         t4
> +#define copy0          t5
> +#define copy1          t6
> +#define copy2          t7
> +
> +/* Reuse i as copy3 */
> +#define copy3          i
> +
> +       /*
> +        * The ABI requires s0-s9 saved, and sp aligned to 16-byte.
> +        * This does not violate the stack-less requirement: no sensitive data
> +        * is spilled onto the stack.
> +        */
> +       PTR_ADDI        sp, sp, (-SZREG * 10) & STACK_ALIGN
> +       REG_S           s0, sp, 0
> +       REG_S           s1, sp, SZREG
> +       REG_S           s2, sp, SZREG * 2
> +       REG_S           s3, sp, SZREG * 3
> +       REG_S           s4, sp, SZREG * 4
> +       REG_S           s5, sp, SZREG * 5
> +       REG_S           s6, sp, SZREG * 6
> +       REG_S           s7, sp, SZREG * 7
> +       REG_S           s8, sp, SZREG * 8
> +       REG_S           s9, sp, SZREG * 9
> +
> +       li.w            copy0, 0x61707865
> +       li.w            copy1, 0x3320646e
> +       li.w            copy2, 0x79622d32
> +
> +       ld.w            cnt_lo, counter, 0
> +       ld.w            cnt_hi, counter, 4
> +
> +.Lblock:
> +       /* state[0,1,2,3] = "expand 32-byte k" */
> +       move            state0, copy0
> +       move            state1, copy1
> +       move            state2, copy2
> +       li.w            state3, 0x6b206574
> +
> +       /* state[4,5,..,11] = key */
> +       ld.w            state4, key, 0
> +       ld.w            state5, key, 4
> +       ld.w            state6, key, 8
> +       ld.w            state7, key, 12
> +       ld.w            state8, key, 16
> +       ld.w            state9, key, 20
> +       ld.w            state10, key, 24
> +       ld.w            state11, key, 28
> +
> +       /* state[12,13] = counter */
> +       move            state12, cnt_lo
> +       move            state13, cnt_hi
> +
> +       /* state[14,15] = 0 */
> +       move            state14, zero
> +       move            state15, zero
> +
> +       li.w            i, 10
> +.Lpermute:
> +       /* odd round */
> +       QR              state0, state4, state8, state12
> +       QR              state1, state5, state9, state13
> +       QR              state2, state6, state10, state14
> +       QR              state3, state7, state11, state15
> +
> +       /* even round */
> +       QR              state0, state5, state10, state15
> +       QR              state1, state6, state11, state12
> +       QR              state2, state7, state8, state13
> +       QR              state3, state4, state9, state14
> +
> +       addi.w          i, i, -1
> +       bnez            i, .Lpermute
> +
> +       /* copy[3] = "expa" */
> +       li.w            copy3, 0x6b206574
> +
> +       /* output[0,1,2,3] = copy[0,1,2,3] + state[0,1,2,3] */
> +       add.w           state0, state0, copy0
> +       add.w           state1, state1, copy1
> +       add.w           state2, state2, copy2
> +       add.w           state3, state3, copy3
> +       st.w            state0, output, 0
> +       st.w            state1, output, 4
> +       st.w            state2, output, 8
> +       st.w            state3, output, 12
> +
> +       /* from now on state[0,1,2,3] are scratch registers  */
> +
> +       /* state[0,1,2,3] = lo32(key) */
> +       ld.w            state0, key, 0
> +       ld.w            state1, key, 4
> +       ld.w            state2, key, 8
> +       ld.w            state3, key, 12
> +
> +       /* output[4,5,6,7] = state[0,1,2,3] + state[4,5,6,7] */
> +       add.w           state4, state4, state0
> +       add.w           state5, state5, state1
> +       add.w           state6, state6, state2
> +       add.w           state7, state7, state3
> +       st.w            state4, output, 16
> +       st.w            state5, output, 20
> +       st.w            state6, output, 24
> +       st.w            state7, output, 28
> +
> +       /* state[0,1,2,3] = hi32(key) */
> +       ld.w            state0, key, 16
> +       ld.w            state1, key, 20
> +       ld.w            state2, key, 24
> +       ld.w            state3, key, 28
> +
> +       /* output[8,9,10,11] = state[0,1,2,3] + state[8,9,10,11] */
> +       add.w           state8, state8, state0
> +       add.w           state9, state9, state1
> +       add.w           state10, state10, state2
> +       add.w           state11, state11, state3
> +       st.w            state8, output, 32
> +       st.w            state9, output, 36
> +       st.w            state10, output, 40
> +       st.w            state11, output, 44
> +
> +       /* output[12,13,14,15] = state[12,13,14,15] + [cnt_lo, cnt_hi, 0, 0] */
> +       add.w           state12, state12, cnt_lo
> +       add.w           state13, state13, cnt_hi
> +       st.w            state12, output, 48
> +       st.w            state13, output, 52
> +       st.w            state14, output, 56
> +       st.w            state15, output, 60
> +
> +       /* ++counter  */
> +       addi.w          cnt_lo, cnt_lo, 1
> +       sltui           state0, cnt_lo, 1
> +       add.w           cnt_hi, cnt_hi, state0
> +
> +       /* output += 64 */
> +       PTR_ADDI        output, output, 64
> +       /* --nblocks */
> +       PTR_ADDI        nblocks, nblocks, -1
> +       bnez            nblocks, .Lblock
> +
> +       /* counter = [cnt_lo, cnt_hi] */
> +       st.w            cnt_lo, counter, 0
> +       st.w            cnt_hi, counter, 4
> +
> +       /*
> +        * Zero out the potentially sensitive regs, in case nothing uses these
> +        * again. As at now copy[0,1,2,3] just contains "expand 32-byte k" and
> +        * state[0,...,9] are s0-s9 those we'll restore in the epilogue, so we
> +        * only need to zero state[11,...,15].
> +        */
> +       move            state10, zero
> +       move            state11, zero
> +       move            state12, zero
> +       move            state13, zero
> +       move            state14, zero
> +       move            state15, zero
> +
> +       REG_L           s0, sp, 0
> +       REG_L           s1, sp, SZREG
> +       REG_L           s2, sp, SZREG * 2
> +       REG_L           s3, sp, SZREG * 3
> +       REG_L           s4, sp, SZREG * 4
> +       REG_L           s5, sp, SZREG * 5
> +       REG_L           s6, sp, SZREG * 6
> +       REG_L           s7, sp, SZREG * 7
> +       REG_L           s8, sp, SZREG * 8
> +       REG_L           s9, sp, SZREG * 9
> +       PTR_ADDI        sp, sp, -((-SZREG * 10) & STACK_ALIGN)
> +
> +       jr              ra
> +SYM_FUNC_END(__arch_chacha20_blocks_nostack)
> diff --git a/arch/loongarch/vdso/vgetrandom.c b/arch/loongarch/vdso/vgetrandom.c
> new file mode 100644
> index 000000000000..0b3b30ecd68a
> --- /dev/null
> +++ b/arch/loongarch/vdso/vgetrandom.c
> @@ -0,0 +1,19 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2024 Xi Ruoyao <xry111@xry111.site>. All Rights Reserved.
> + */
> +#include <linux/types.h>
> +
> +#include "../../../../lib/vdso/getrandom.c"
> +
> +typeof(__cvdso_getrandom) __vdso_getrandom;
> +
> +ssize_t __vdso_getrandom(void *buffer, size_t len, unsigned int flags,
> +                        void *opaque_state, size_t opaque_len)
> +{
> +       return __cvdso_getrandom(buffer, len, flags, opaque_state,
> +                                opaque_len);
> +}
> +
> +typeof(__cvdso_getrandom) getrandom
> +       __attribute__((weak, alias("__vdso_getrandom")));
> --
> 2.46.0
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re:
  2024-08-19 12:40 ` Huacai Chen
@ 2024-08-19 13:01   ` Jason A. Donenfeld
  2024-08-19 15:22     ` Re: Xi Ruoyao
  2024-08-19 15:22   ` Re: Xi Ruoyao
  1 sibling, 1 reply; 16+ messages in thread
From: Jason A. Donenfeld @ 2024-08-19 13:01 UTC (permalink / raw)
  To: Huacai Chen
  Cc: Xi Ruoyao, WANG Xuerui, linux-crypto, loongarch, Jinyang He,
	Tiezhu Yang, Arnd Bergmann

> I don't see significant improvements about LSX here, so I prefer to
> just use the generic version to avoid complexity (I remember Linus
> said the whole of __vdso_getrandom is not very useful).

I'm inclined to feel the same way, at least for now. Let's just go with
one implementation -- the generic one -- and then we can see if
optimization really makes sense later. I suspect the large speedup we're
already getting from being in the vDSO is already sufficient for
purposes.

Regards,
Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] LoongArch: vDSO: Wire up getrandom() vDSO implementation
  2024-08-19 12:41   ` Huacai Chen
@ 2024-08-19 13:03     ` Jason A. Donenfeld
  2024-08-19 15:36       ` Xi Ruoyao
  0 siblings, 1 reply; 16+ messages in thread
From: Jason A. Donenfeld @ 2024-08-19 13:03 UTC (permalink / raw)
  To: Huacai Chen
  Cc: Xi Ruoyao, WANG Xuerui, linux-crypto, loongarch, Jinyang He,
	Tiezhu Yang, Arnd Bergmann

> > The compiler (GCC 14.2) calls memset() for initializing a "large" struct
> > in a cold path of the generic vDSO getrandom() code.  There seems no way
> > to prevent it from calling memset(), and it's a cold path so the
> > performance does not matter, so just provide a naive memset()
> > implementation for vDSO.
> Why x86 doesn't need to provide a naive memset()?

It looks like others are running into this when porting to ppc and
arm64, so I'll probably refactor the code to avoid needing it in the
first place. I'll chime in here when that's done.

Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re:
  2024-08-19 13:01   ` Re: Jason A. Donenfeld
@ 2024-08-19 15:22     ` Xi Ruoyao
  2024-08-19 15:54       ` Re: Xi Ruoyao
  0 siblings, 1 reply; 16+ messages in thread
From: Xi Ruoyao @ 2024-08-19 15:22 UTC (permalink / raw)
  To: Jason A. Donenfeld, Huacai Chen
  Cc: WANG Xuerui, linux-crypto, loongarch, Jinyang He, Tiezhu Yang,
	Arnd Bergmann

On Mon, 2024-08-19 at 13:01 +0000, Jason A. Donenfeld wrote:
> > I don't see significant improvements about LSX here, so I prefer to
> > just use the generic version to avoid complexity (I remember Linus
> > said the whole of __vdso_getrandom is not very useful).
> 
> I'm inclined to feel the same way, at least for now. Let's just go with
> one implementation -- the generic one -- and then we can see if
> optimization really makes sense later. I suspect the large speedup we're
> already getting from being in the vDSO is already sufficient for
> purposes.

Ok I'll drop the 2nd and 3rd patches in the next version.  But I'm
puzzled why the LSX implementation isn't much faster, maybe I made some
mistake in it?

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re:
  2024-08-19 12:40 ` Huacai Chen
  2024-08-19 13:01   ` Re: Jason A. Donenfeld
@ 2024-08-19 15:22   ` Xi Ruoyao
  1 sibling, 0 replies; 16+ messages in thread
From: Xi Ruoyao @ 2024-08-19 15:22 UTC (permalink / raw)
  To: Huacai Chen
  Cc: Jason A . Donenfeld, WANG Xuerui, linux-crypto, loongarch,
	Jinyang He, Tiezhu Yang, Arnd Bergmann

On Mon, 2024-08-19 at 20:40 +0800, Huacai Chen wrote:
> Hi, Ruoyao,
> 
> Why no subject?

Because I misused git send-email (again) :(.


-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] LoongArch: vDSO: Wire up getrandom() vDSO implementation
  2024-08-19 13:03     ` Jason A. Donenfeld
@ 2024-08-19 15:36       ` Xi Ruoyao
  2024-08-20  0:50         ` Jinyang He
  0 siblings, 1 reply; 16+ messages in thread
From: Xi Ruoyao @ 2024-08-19 15:36 UTC (permalink / raw)
  To: Jason A. Donenfeld, Huacai Chen
  Cc: WANG Xuerui, linux-crypto, loongarch, Jinyang He, Tiezhu Yang,
	Arnd Bergmann

On Mon, 2024-08-19 at 13:03 +0000, Jason A. Donenfeld wrote:
> > > The compiler (GCC 14.2) calls memset() for initializing a "large" struct
> > > in a cold path of the generic vDSO getrandom() code.  There seems no way
> > > to prevent it from calling memset(), and it's a cold path so the
> > > performance does not matter, so just provide a naive memset()
> > > implementation for vDSO.
> > Why x86 doesn't need to provide a naive memset()?

I'm not sure.  Maybe it's because x86_64 has SSE2 enabled so by default
the maximum buffer length to inline memset is larger.

> It looks like others are running into this when porting to ppc and
> arm64, so I'll probably refactor the code to avoid needing it in the
> first place. I'll chime in here when that's done.

Yes, I've seen the PPC guys hacking the code to avoid memset.

BTW I've also seen "vDSO getrandom isn't supported on 32-bit platforms"
in the PPC discussion.  Is there any plan to support it in the future? 
If not I'll only select VDSO_GETRANDOM if CONFIG_64BIT, and then the
assembly code can be slightly simplified.

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re:
  2024-08-19 15:22     ` Re: Xi Ruoyao
@ 2024-08-19 15:54       ` Xi Ruoyao
  0 siblings, 0 replies; 16+ messages in thread
From: Xi Ruoyao @ 2024-08-19 15:54 UTC (permalink / raw)
  To: Jason A. Donenfeld, Huacai Chen
  Cc: WANG Xuerui, linux-crypto, loongarch, Jinyang He, Tiezhu Yang,
	Arnd Bergmann

On Mon, 2024-08-19 at 23:22 +0800, Xi Ruoyao wrote:
> On Mon, 2024-08-19 at 13:01 +0000, Jason A. Donenfeld wrote:
> > > I don't see significant improvements about LSX here, so I prefer to
> > > just use the generic version to avoid complexity (I remember Linus
> > > said the whole of __vdso_getrandom is not very useful).
> > 
> > I'm inclined to feel the same way, at least for now. Let's just go with
> > one implementation -- the generic one -- and then we can see if
> > optimization really makes sense later. I suspect the large speedup we're
> > already getting from being in the vDSO is already sufficient for
> > purposes.
> 
> Ok I'll drop the 2nd and 3rd patches in the next version.  But I'm
> puzzled why the LSX implementation isn't much faster, maybe I made some
> mistake in it?

After some thinking this seems making sense: the LoongArch desktop
processors have 4 ALUs able to perform the scalar add/rot/xor
operations, and the throughput is already maximized for ChaCha20 due to
the data dependency.  The advantage of LSX seems just to avoid reloading
key from the memory (because the register file is large enough to hold a
copy of it).

Perhaps LSX will be much better on those embedded processors with 2 ALUs
and 1 SIMD unit (if they don't downclock with heavy SIMD load), but I
don't have one for testing.

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] LoongArch: vDSO: Wire up getrandom() vDSO implementation
  2024-08-19 15:36       ` Xi Ruoyao
@ 2024-08-20  0:50         ` Jinyang He
  2024-08-20  1:09           ` Xi Ruoyao
  0 siblings, 1 reply; 16+ messages in thread
From: Jinyang He @ 2024-08-20  0:50 UTC (permalink / raw)
  To: Xi Ruoyao, Jason A. Donenfeld, Huacai Chen
  Cc: WANG Xuerui, linux-crypto, loongarch, Tiezhu Yang, Arnd Bergmann

On 2024-08-19 23:36, Xi Ruoyao wrote:

> On Mon, 2024-08-19 at 13:03 +0000, Jason A. Donenfeld wrote:
>>>> The compiler (GCC 14.2) calls memset() for initializing a "large" struct
>>>> in a cold path of the generic vDSO getrandom() code.  There seems no way
>>>> to prevent it from calling memset(), and it's a cold path so the
>>>> performance does not matter, so just provide a naive memset()
>>>> implementation for vDSO.
>>> Why x86 doesn't need to provide a naive memset()?
> I'm not sure.  Maybe it's because x86_64 has SSE2 enabled so by default
> the maximum buffer length to inline memset is larger.
>
I suspect the loongarch gcc has issue with -fno-builtin(-memset).


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] LoongArch: vDSO: Wire up getrandom() vDSO implementation
  2024-08-20  0:50         ` Jinyang He
@ 2024-08-20  1:09           ` Xi Ruoyao
  2024-08-20  2:03             ` Jinyang He
  0 siblings, 1 reply; 16+ messages in thread
From: Xi Ruoyao @ 2024-08-20  1:09 UTC (permalink / raw)
  To: Jinyang He, Jason A. Donenfeld, Huacai Chen
  Cc: WANG Xuerui, linux-crypto, loongarch, Tiezhu Yang, Arnd Bergmann

On Tue, 2024-08-20 at 08:50 +0800, Jinyang He wrote:
> On 2024-08-19 23:36, Xi Ruoyao wrote:
> 
> > On Mon, 2024-08-19 at 13:03 +0000, Jason A. Donenfeld wrote:
> > > > > The compiler (GCC 14.2) calls memset() for initializing a "large" struct
> > > > > in a cold path of the generic vDSO getrandom() code.  There seems no way
> > > > > to prevent it from calling memset(), and it's a cold path so the
> > > > > performance does not matter, so just provide a naive memset()
> > > > > implementation for vDSO.
> > > > Why x86 doesn't need to provide a naive memset()?
> > I'm not sure.  Maybe it's because x86_64 has SSE2 enabled so by default
> > the maximum buffer length to inline memset is larger.
> > 
> I suspect the loongarch gcc has issue with -fno-builtin(-memset).

No, -fno-builtin-memset just means don't convert memset to
__builtin_memset, it does not mean "don't emit memset call," nor
anything more than that.

Even -ffreestanding is not guaranteed to turn off memset call generation
because per the standard memset should be available even in a
freestanding implementation.

x86 has a -mmemset-strategy= option but it's really x86 specific.  As
Jason pointed out, PowerPC and ARM64 have also hit the same issue.

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] LoongArch: vDSO: Wire up getrandom() vDSO implementation
  2024-08-20  1:09           ` Xi Ruoyao
@ 2024-08-20  2:03             ` Jinyang He
  0 siblings, 0 replies; 16+ messages in thread
From: Jinyang He @ 2024-08-20  2:03 UTC (permalink / raw)
  To: Xi Ruoyao, Jason A. Donenfeld, Huacai Chen
  Cc: WANG Xuerui, linux-crypto, loongarch, Tiezhu Yang, Arnd Bergmann

On 2024-08-20 09:09, Xi Ruoyao wrote:

> On Tue, 2024-08-20 at 08:50 +0800, Jinyang He wrote:
>> On 2024-08-19 23:36, Xi Ruoyao wrote:
>>
>>> On Mon, 2024-08-19 at 13:03 +0000, Jason A. Donenfeld wrote:
>>>>>> The compiler (GCC 14.2) calls memset() for initializing a "large" struct
>>>>>> in a cold path of the generic vDSO getrandom() code.  There seems no way
>>>>>> to prevent it from calling memset(), and it's a cold path so the
>>>>>> performance does not matter, so just provide a naive memset()
>>>>>> implementation for vDSO.
>>>>> Why x86 doesn't need to provide a naive memset()?
>>> I'm not sure.  Maybe it's because x86_64 has SSE2 enabled so by default
>>> the maximum buffer length to inline memset is larger.
>>>
>> I suspect the loongarch gcc has issue with -fno-builtin(-memset).
> No, -fno-builtin-memset just means don't convert memset to
> __builtin_memset, it does not mean "don't emit memset call," nor
> anything more than that.
>
> Even -ffreestanding is not guaranteed to turn off memset call generation
> because per the standard memset should be available even in a
> freestanding implementation.
>
> x86 has a -mmemset-strategy= option but it's really x86 specific.  As
> Jason pointed out, PowerPC and ARM64 have also hit the same issue.
>
I see. Thanks! The gcc produced __builtin_memset in expand pass.
X86 increase the maximum buffer length to inline memset by
`-mmemset-strategy=`, while other archs like LoongArch cannot do this.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re:
  2024-08-16 11:07 Xi Ruoyao
                   ` (3 preceding siblings ...)
  2024-08-19 12:40 ` Huacai Chen
@ 2024-08-27  9:45 ` Jason A. Donenfeld
  4 siblings, 0 replies; 16+ messages in thread
From: Jason A. Donenfeld @ 2024-08-27  9:45 UTC (permalink / raw)
  To: Xi Ruoyao
  Cc: Huacai Chen, WANG Xuerui, linux-crypto, loongarch, Jinyang He,
	Tiezhu Yang, Arnd Bergmann

Hey,

Per https://lore.kernel.org/all/Zs2c_9Z6sFMNJs1O@zx2c4.com/ , you may
want to rebase on random.git and send a v4 series. Hopefully now it's
just a single patch.

Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-08-27  9:45 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-16 11:07 Xi Ruoyao
2024-08-16 11:07 ` [PATCH v3 1/3] LoongArch: vDSO: Wire up getrandom() vDSO implementation Xi Ruoyao
2024-08-19 12:41   ` Huacai Chen
2024-08-19 13:03     ` Jason A. Donenfeld
2024-08-19 15:36       ` Xi Ruoyao
2024-08-20  0:50         ` Jinyang He
2024-08-20  1:09           ` Xi Ruoyao
2024-08-20  2:03             ` Jinyang He
2024-08-16 11:07 ` [PATCH v3 2/3] LoongArch: Perform alternative runtime patching on vDSO Xi Ruoyao
2024-08-16 11:07 ` [PATCH v3 3/3] LoongArch: vDSO: Add LSX implementation of vDSO getrandom() Xi Ruoyao
2024-08-19 12:40 ` Huacai Chen
2024-08-19 13:01   ` Re: Jason A. Donenfeld
2024-08-19 15:22     ` Re: Xi Ruoyao
2024-08-19 15:54       ` Re: Xi Ruoyao
2024-08-19 15:22   ` Re: Xi Ruoyao
2024-08-27  9:45 ` Re: Jason A. Donenfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).