* [PATCH v2 1/5] ARM: Add a neon-intrinsics.h header like on arm64
2026-03-31 7:49 [PATCH v2 0/5] xor/arm: Replace vectorized version with intrinsics Ard Biesheuvel
@ 2026-03-31 7:49 ` Ard Biesheuvel
2026-03-31 7:49 ` [PATCH v2 2/5] crypto: aegis128 - Use neon-intrinsics.h on ARM too Ard Biesheuvel
` (4 subsequent siblings)
5 siblings, 0 replies; 9+ messages in thread
From: Ard Biesheuvel @ 2026-03-31 7:49 UTC (permalink / raw)
To: linux-raid
Cc: linux-arm-kernel, linux-crypto, Ard Biesheuvel, Christoph Hellwig,
Russell King, Arnd Bergmann, Eric Biggers
From: Ard Biesheuvel <ardb@kernel.org>
Add a header asm/neon-intrinsics.h similar to the one that arm64 has.
This makes it possible for NEON intrinsics code to be shared seamlessly
between ARM and arm64.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
Documentation/arch/arm/kernel_mode_neon.rst | 4 +-
arch/arm/include/asm/neon-intrinsics.h | 64 ++++++++++++++++++++
2 files changed, 67 insertions(+), 1 deletion(-)
diff --git a/Documentation/arch/arm/kernel_mode_neon.rst b/Documentation/arch/arm/kernel_mode_neon.rst
index 9bfb71a2a9b9..1efb6d35b7bd 100644
--- a/Documentation/arch/arm/kernel_mode_neon.rst
+++ b/Documentation/arch/arm/kernel_mode_neon.rst
@@ -121,4 +121,6 @@ observe the following in addition to the rules above:
* Compile the unit containing the NEON intrinsics with '-ffreestanding' so GCC
uses its builtin version of <stdint.h> (this is a C99 header which the kernel
does not supply);
-* Include <arm_neon.h> last, or at least after <linux/types.h>
+* Do not include <arm_neon.h> directly: instead, include <asm/neon-intrinsics.h>,
+ which tweaks some macro definitions so that system headers can be included
+ safely.
diff --git a/arch/arm/include/asm/neon-intrinsics.h b/arch/arm/include/asm/neon-intrinsics.h
new file mode 100644
index 000000000000..3fe0b5ab9659
--- /dev/null
+++ b/arch/arm/include/asm/neon-intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef __ASM_NEON_INTRINSICS_H
+#define __ASM_NEON_INTRINSICS_H
+
+#ifndef __ARM_NEON__
+#error You should compile this file with '-march=armv7-a -mfloat-abi=softfp -mfpu=neon'
+#endif
+
+#include <asm-generic/int-ll64.h>
+
+/*
+ * The C99 types uintXX_t that are usually defined in 'stdint.h' are not as
+ * unambiguous on ARM as you would expect. For the types below, there is a
+ * difference on ARM between GCC built for bare metal ARM, GCC built for glibc
+ * and the kernel itself, which results in build errors if you try to build
+ * with -ffreestanding and include 'stdint.h' (such as when you include
+ * 'arm_neon.h' in order to use NEON intrinsics)
+ *
+ * As the typedefs for these types in 'stdint.h' are based on builtin defines
+ * supplied by GCC, we can tweak these to align with the kernel's idea of those
+ * types, so 'linux/types.h' and 'stdint.h' can be safely included from the
+ * same source file (provided that -ffreestanding is used).
+ *
+ * int32_t uint32_t intptr_t uintptr_t
+ * bare metal GCC long unsigned long int unsigned int
+ * glibc GCC int unsigned int int unsigned int
+ * kernel int unsigned int long unsigned long
+ */
+
+#ifdef __INT32_TYPE__
+#undef __INT32_TYPE__
+#define __INT32_TYPE__ int
+#endif
+
+#ifdef __UINT32_TYPE__
+#undef __UINT32_TYPE__
+#define __UINT32_TYPE__ unsigned int
+#endif
+
+#ifdef __INTPTR_TYPE__
+#undef __INTPTR_TYPE__
+#define __INTPTR_TYPE__ long
+#endif
+
+#ifdef __UINTPTR_TYPE__
+#undef __UINTPTR_TYPE__
+#define __UINTPTR_TYPE__ unsigned long
+#endif
+
+/*
+ * genksyms chokes on the ARM NEON instrinsics system header, but we
+ * don't export anything it defines anyway, so just disregard when
+ * genksyms execute.
+ */
+#ifndef __GENKSYMS__
+#include <arm_neon.h>
+#endif
+
+#ifdef CONFIG_CC_IS_CLANG
+#pragma clang diagnostic ignored "-Wincompatible-pointer-types"
+#endif
+
+#endif /* __ASM_NEON_INTRINSICS_H */
--
2.53.0.1018.g2bb0e51243-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v2 2/5] crypto: aegis128 - Use neon-intrinsics.h on ARM too
2026-03-31 7:49 [PATCH v2 0/5] xor/arm: Replace vectorized version with intrinsics Ard Biesheuvel
2026-03-31 7:49 ` [PATCH v2 1/5] ARM: Add a neon-intrinsics.h header like on arm64 Ard Biesheuvel
@ 2026-03-31 7:49 ` Ard Biesheuvel
2026-03-31 7:49 ` [PATCH v2 3/5] xor/arm: Replace vectorized implementation with arm64's intrinsics Ard Biesheuvel
` (3 subsequent siblings)
5 siblings, 0 replies; 9+ messages in thread
From: Ard Biesheuvel @ 2026-03-31 7:49 UTC (permalink / raw)
To: linux-raid
Cc: linux-arm-kernel, linux-crypto, Ard Biesheuvel, Christoph Hellwig,
Russell King, Arnd Bergmann, Eric Biggers
From: Ard Biesheuvel <ardb@kernel.org>
Use the asm/neon-intrinsics.h header on ARM as well as arm64, so that
the calling code does not have to know the difference.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
crypto/aegis128-neon-inner.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/crypto/aegis128-neon-inner.c b/crypto/aegis128-neon-inner.c
index b6a52a386b22..56b534eeb680 100644
--- a/crypto/aegis128-neon-inner.c
+++ b/crypto/aegis128-neon-inner.c
@@ -3,13 +3,11 @@
* Copyright (C) 2019 Linaro, Ltd. <ard.biesheuvel@linaro.org>
*/
-#ifdef CONFIG_ARM64
#include <asm/neon-intrinsics.h>
+#ifdef CONFIG_ARM64
#define AES_ROUND "aese %0.16b, %1.16b \n\t aesmc %0.16b, %0.16b"
#else
-#include <arm_neon.h>
-
#define AES_ROUND "aese.8 %q0, %q1 \n\t aesmc.8 %q0, %q0"
#endif
--
2.53.0.1018.g2bb0e51243-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v2 3/5] xor/arm: Replace vectorized implementation with arm64's intrinsics
2026-03-31 7:49 [PATCH v2 0/5] xor/arm: Replace vectorized version with intrinsics Ard Biesheuvel
2026-03-31 7:49 ` [PATCH v2 1/5] ARM: Add a neon-intrinsics.h header like on arm64 Ard Biesheuvel
2026-03-31 7:49 ` [PATCH v2 2/5] crypto: aegis128 - Use neon-intrinsics.h on ARM too Ard Biesheuvel
@ 2026-03-31 7:49 ` Ard Biesheuvel
2026-03-31 7:49 ` [PATCH v2 4/5] xor/arm64: Use shared NEON intrinsics implementation from 32-bit ARM Ard Biesheuvel
` (2 subsequent siblings)
5 siblings, 0 replies; 9+ messages in thread
From: Ard Biesheuvel @ 2026-03-31 7:49 UTC (permalink / raw)
To: linux-raid
Cc: linux-arm-kernel, linux-crypto, Ard Biesheuvel, Christoph Hellwig,
Russell King, Arnd Bergmann, Eric Biggers
From: Ard Biesheuvel <ardb@kernel.org>
Drop the XOR implementation generated by the vectorizer: this has always
been a bit of a hack, and now that arm64 has an intrinsics version that
works on ARM too, let's use that instead.
So copy the part of the arm64 code that can be shared (so not the EOR3
version). The arm64 code will be updated in a subsequent patch to share
this implementation.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
lib/raid/xor/arm/xor-neon.c | 183 ++++++++++++++++++--
lib/raid/xor/arm/xor-neon.h | 7 +
lib/raid/xor/arm/xor_arch.h | 7 +-
lib/raid/xor/xor-8regs.c | 2 -
4 files changed, 174 insertions(+), 25 deletions(-)
diff --git a/lib/raid/xor/arm/xor-neon.c b/lib/raid/xor/arm/xor-neon.c
index 23147e3a7904..a3e2b4af8d36 100644
--- a/lib/raid/xor/arm/xor-neon.c
+++ b/lib/raid/xor/arm/xor-neon.c
@@ -1,26 +1,175 @@
// SPDX-License-Identifier: GPL-2.0-only
/*
- * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
+ * Authors: Jackie Liu <liuyun01@kylinos.cn>
+ * Copyright (C) 2018,Tianjin KYLIN Information Technology Co., Ltd.
*/
#include "xor_impl.h"
-#include "xor_arch.h"
+#include "xor-neon.h"
-#ifndef __ARM_NEON__
-#error You should compile this file with '-march=armv7-a -mfloat-abi=softfp -mfpu=neon'
-#endif
+#include <asm/neon-intrinsics.h>
-/*
- * Pull in the reference implementations while instructing GCC (through
- * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
- * NEON instructions. Clang does this by default at O2 so no pragma is
- * needed.
- */
-#ifdef CONFIG_CC_IS_GCC
-#pragma GCC optimize "tree-vectorize"
-#endif
+static void __xor_neon_2(unsigned long bytes, unsigned long * __restrict p1,
+ const unsigned long * __restrict p2)
+{
+ uint64_t *dp1 = (uint64_t *)p1;
+ uint64_t *dp2 = (uint64_t *)p2;
+
+ register uint64x2_t v0, v1, v2, v3;
+ long lines = bytes / (sizeof(uint64x2_t) * 4);
+
+ do {
+ /* p1 ^= p2 */
+ v0 = veorq_u64(vld1q_u64(dp1 + 0), vld1q_u64(dp2 + 0));
+ v1 = veorq_u64(vld1q_u64(dp1 + 2), vld1q_u64(dp2 + 2));
+ v2 = veorq_u64(vld1q_u64(dp1 + 4), vld1q_u64(dp2 + 4));
+ v3 = veorq_u64(vld1q_u64(dp1 + 6), vld1q_u64(dp2 + 6));
+
+ /* store */
+ vst1q_u64(dp1 + 0, v0);
+ vst1q_u64(dp1 + 2, v1);
+ vst1q_u64(dp1 + 4, v2);
+ vst1q_u64(dp1 + 6, v3);
+
+ dp1 += 8;
+ dp2 += 8;
+ } while (--lines > 0);
+}
+
+static void __xor_neon_3(unsigned long bytes, unsigned long * __restrict p1,
+ const unsigned long * __restrict p2,
+ const unsigned long * __restrict p3)
+{
+ uint64_t *dp1 = (uint64_t *)p1;
+ uint64_t *dp2 = (uint64_t *)p2;
+ uint64_t *dp3 = (uint64_t *)p3;
+
+ register uint64x2_t v0, v1, v2, v3;
+ long lines = bytes / (sizeof(uint64x2_t) * 4);
+
+ do {
+ /* p1 ^= p2 */
+ v0 = veorq_u64(vld1q_u64(dp1 + 0), vld1q_u64(dp2 + 0));
+ v1 = veorq_u64(vld1q_u64(dp1 + 2), vld1q_u64(dp2 + 2));
+ v2 = veorq_u64(vld1q_u64(dp1 + 4), vld1q_u64(dp2 + 4));
+ v3 = veorq_u64(vld1q_u64(dp1 + 6), vld1q_u64(dp2 + 6));
+
+ /* p1 ^= p3 */
+ v0 = veorq_u64(v0, vld1q_u64(dp3 + 0));
+ v1 = veorq_u64(v1, vld1q_u64(dp3 + 2));
+ v2 = veorq_u64(v2, vld1q_u64(dp3 + 4));
+ v3 = veorq_u64(v3, vld1q_u64(dp3 + 6));
+
+ /* store */
+ vst1q_u64(dp1 + 0, v0);
+ vst1q_u64(dp1 + 2, v1);
+ vst1q_u64(dp1 + 4, v2);
+ vst1q_u64(dp1 + 6, v3);
+
+ dp1 += 8;
+ dp2 += 8;
+ dp3 += 8;
+ } while (--lines > 0);
+}
+
+static void __xor_neon_4(unsigned long bytes, unsigned long * __restrict p1,
+ const unsigned long * __restrict p2,
+ const unsigned long * __restrict p3,
+ const unsigned long * __restrict p4)
+{
+ uint64_t *dp1 = (uint64_t *)p1;
+ uint64_t *dp2 = (uint64_t *)p2;
+ uint64_t *dp3 = (uint64_t *)p3;
+ uint64_t *dp4 = (uint64_t *)p4;
+
+ register uint64x2_t v0, v1, v2, v3;
+ long lines = bytes / (sizeof(uint64x2_t) * 4);
+
+ do {
+ /* p1 ^= p2 */
+ v0 = veorq_u64(vld1q_u64(dp1 + 0), vld1q_u64(dp2 + 0));
+ v1 = veorq_u64(vld1q_u64(dp1 + 2), vld1q_u64(dp2 + 2));
+ v2 = veorq_u64(vld1q_u64(dp1 + 4), vld1q_u64(dp2 + 4));
+ v3 = veorq_u64(vld1q_u64(dp1 + 6), vld1q_u64(dp2 + 6));
+
+ /* p1 ^= p3 */
+ v0 = veorq_u64(v0, vld1q_u64(dp3 + 0));
+ v1 = veorq_u64(v1, vld1q_u64(dp3 + 2));
+ v2 = veorq_u64(v2, vld1q_u64(dp3 + 4));
+ v3 = veorq_u64(v3, vld1q_u64(dp3 + 6));
+
+ /* p1 ^= p4 */
+ v0 = veorq_u64(v0, vld1q_u64(dp4 + 0));
+ v1 = veorq_u64(v1, vld1q_u64(dp4 + 2));
+ v2 = veorq_u64(v2, vld1q_u64(dp4 + 4));
+ v3 = veorq_u64(v3, vld1q_u64(dp4 + 6));
+
+ /* store */
+ vst1q_u64(dp1 + 0, v0);
+ vst1q_u64(dp1 + 2, v1);
+ vst1q_u64(dp1 + 4, v2);
+ vst1q_u64(dp1 + 6, v3);
+
+ dp1 += 8;
+ dp2 += 8;
+ dp3 += 8;
+ dp4 += 8;
+ } while (--lines > 0);
+}
+
+static void __xor_neon_5(unsigned long bytes, unsigned long * __restrict p1,
+ const unsigned long * __restrict p2,
+ const unsigned long * __restrict p3,
+ const unsigned long * __restrict p4,
+ const unsigned long * __restrict p5)
+{
+ uint64_t *dp1 = (uint64_t *)p1;
+ uint64_t *dp2 = (uint64_t *)p2;
+ uint64_t *dp3 = (uint64_t *)p3;
+ uint64_t *dp4 = (uint64_t *)p4;
+ uint64_t *dp5 = (uint64_t *)p5;
+
+ register uint64x2_t v0, v1, v2, v3;
+ long lines = bytes / (sizeof(uint64x2_t) * 4);
+
+ do {
+ /* p1 ^= p2 */
+ v0 = veorq_u64(vld1q_u64(dp1 + 0), vld1q_u64(dp2 + 0));
+ v1 = veorq_u64(vld1q_u64(dp1 + 2), vld1q_u64(dp2 + 2));
+ v2 = veorq_u64(vld1q_u64(dp1 + 4), vld1q_u64(dp2 + 4));
+ v3 = veorq_u64(vld1q_u64(dp1 + 6), vld1q_u64(dp2 + 6));
+
+ /* p1 ^= p3 */
+ v0 = veorq_u64(v0, vld1q_u64(dp3 + 0));
+ v1 = veorq_u64(v1, vld1q_u64(dp3 + 2));
+ v2 = veorq_u64(v2, vld1q_u64(dp3 + 4));
+ v3 = veorq_u64(v3, vld1q_u64(dp3 + 6));
+
+ /* p1 ^= p4 */
+ v0 = veorq_u64(v0, vld1q_u64(dp4 + 0));
+ v1 = veorq_u64(v1, vld1q_u64(dp4 + 2));
+ v2 = veorq_u64(v2, vld1q_u64(dp4 + 4));
+ v3 = veorq_u64(v3, vld1q_u64(dp4 + 6));
+
+ /* p1 ^= p5 */
+ v0 = veorq_u64(v0, vld1q_u64(dp5 + 0));
+ v1 = veorq_u64(v1, vld1q_u64(dp5 + 2));
+ v2 = veorq_u64(v2, vld1q_u64(dp5 + 4));
+ v3 = veorq_u64(v3, vld1q_u64(dp5 + 6));
+
+ /* store */
+ vst1q_u64(dp1 + 0, v0);
+ vst1q_u64(dp1 + 2, v1);
+ vst1q_u64(dp1 + 4, v2);
+ vst1q_u64(dp1 + 6, v3);
-#define NO_TEMPLATE
-#include "../xor-8regs.c"
+ dp1 += 8;
+ dp2 += 8;
+ dp3 += 8;
+ dp4 += 8;
+ dp5 += 8;
+ } while (--lines > 0);
+}
-__DO_XOR_BLOCKS(neon_inner, xor_8regs_2, xor_8regs_3, xor_8regs_4, xor_8regs_5);
+__DO_XOR_BLOCKS(neon_inner, __xor_neon_2, __xor_neon_3, __xor_neon_4,
+ __xor_neon_5);
diff --git a/lib/raid/xor/arm/xor-neon.h b/lib/raid/xor/arm/xor-neon.h
new file mode 100644
index 000000000000..406e0356f05b
--- /dev/null
+++ b/lib/raid/xor/arm/xor-neon.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+extern struct xor_block_template xor_block_arm4regs;
+extern struct xor_block_template xor_block_neon;
+
+void xor_gen_neon_inner(void *dest, void **srcs, unsigned int src_cnt,
+ unsigned int bytes);
diff --git a/lib/raid/xor/arm/xor_arch.h b/lib/raid/xor/arm/xor_arch.h
index 775ff835df65..f1ddb64fe62a 100644
--- a/lib/raid/xor/arm/xor_arch.h
+++ b/lib/raid/xor/arm/xor_arch.h
@@ -3,12 +3,7 @@
* Copyright (C) 2001 Russell King
*/
#include <asm/neon.h>
-
-extern struct xor_block_template xor_block_arm4regs;
-extern struct xor_block_template xor_block_neon;
-
-void xor_gen_neon_inner(void *dest, void **srcs, unsigned int src_cnt,
- unsigned int bytes);
+#include "xor-neon.h"
static __always_inline void __init arch_xor_init(void)
{
diff --git a/lib/raid/xor/xor-8regs.c b/lib/raid/xor/xor-8regs.c
index 1edaed8acffe..46b3c8bdc27f 100644
--- a/lib/raid/xor/xor-8regs.c
+++ b/lib/raid/xor/xor-8regs.c
@@ -93,11 +93,9 @@ xor_8regs_5(unsigned long bytes, unsigned long * __restrict p1,
} while (--lines > 0);
}
-#ifndef NO_TEMPLATE
DO_XOR_BLOCKS(8regs, xor_8regs_2, xor_8regs_3, xor_8regs_4, xor_8regs_5);
struct xor_block_template xor_block_8regs = {
.name = "8regs",
.xor_gen = xor_gen_8regs,
};
-#endif /* NO_TEMPLATE */
--
2.53.0.1018.g2bb0e51243-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v2 4/5] xor/arm64: Use shared NEON intrinsics implementation from 32-bit ARM
2026-03-31 7:49 [PATCH v2 0/5] xor/arm: Replace vectorized version with intrinsics Ard Biesheuvel
` (2 preceding siblings ...)
2026-03-31 7:49 ` [PATCH v2 3/5] xor/arm: Replace vectorized implementation with arm64's intrinsics Ard Biesheuvel
@ 2026-03-31 7:49 ` Ard Biesheuvel
2026-03-31 7:49 ` [PATCH v2 5/5] ARM: Remove hacked-up asm/types.h header Ard Biesheuvel
2026-03-31 15:16 ` [PATCH v2 0/5] xor/arm: Replace vectorized version with intrinsics Christoph Hellwig
5 siblings, 0 replies; 9+ messages in thread
From: Ard Biesheuvel @ 2026-03-31 7:49 UTC (permalink / raw)
To: linux-raid
Cc: linux-arm-kernel, linux-crypto, Ard Biesheuvel, Christoph Hellwig,
Russell King, Arnd Bergmann, Eric Biggers
From: Ard Biesheuvel <ardb@kernel.org>
Tweak the arm64 code so that the pure NEON intrinsics implementation of
XOR is shared between arm64 and ARM.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
lib/raid/xor/Makefile | 3 +-
lib/raid/xor/arm/xor-neon.c | 4 +
lib/raid/xor/arm64/xor-neon.c | 172 +-------------------
3 files changed, 9 insertions(+), 170 deletions(-)
diff --git a/lib/raid/xor/Makefile b/lib/raid/xor/Makefile
index 4d633dfd5b90..b27bf5156784 100644
--- a/lib/raid/xor/Makefile
+++ b/lib/raid/xor/Makefile
@@ -19,7 +19,8 @@ xor-$(CONFIG_ARM) += arm/xor.o
ifeq ($(CONFIG_ARM),y)
xor-$(CONFIG_KERNEL_MODE_NEON) += arm/xor-neon.o arm/xor-neon-glue.o
endif
-xor-$(CONFIG_ARM64) += arm64/xor-neon.o arm64/xor-neon-glue.o
+xor-$(CONFIG_ARM64) += arm/xor-neon.o arm64/xor-neon.o \
+ arm64/xor-neon-glue.o
xor-$(CONFIG_CPU_HAS_LSX) += loongarch/xor_simd.o
xor-$(CONFIG_CPU_HAS_LSX) += loongarch/xor_simd_glue.o
xor-$(CONFIG_ALTIVEC) += powerpc/xor_vmx.o powerpc/xor_vmx_glue.o
diff --git a/lib/raid/xor/arm/xor-neon.c b/lib/raid/xor/arm/xor-neon.c
index a3e2b4af8d36..c7c3cf634e23 100644
--- a/lib/raid/xor/arm/xor-neon.c
+++ b/lib/raid/xor/arm/xor-neon.c
@@ -173,3 +173,7 @@ static void __xor_neon_5(unsigned long bytes, unsigned long * __restrict p1,
__DO_XOR_BLOCKS(neon_inner, __xor_neon_2, __xor_neon_3, __xor_neon_4,
__xor_neon_5);
+
+#ifdef CONFIG_ARM64
+extern typeof(__xor_neon_2) __xor_eor3_2 __alias(__xor_neon_2);
+#endif
diff --git a/lib/raid/xor/arm64/xor-neon.c b/lib/raid/xor/arm64/xor-neon.c
index 97ef3cb92496..e44016c363f1 100644
--- a/lib/raid/xor/arm64/xor-neon.c
+++ b/lib/raid/xor/arm64/xor-neon.c
@@ -1,8 +1,4 @@
// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Authors: Jackie Liu <liuyun01@kylinos.cn>
- * Copyright (C) 2018,Tianjin KYLIN Information Technology Co., Ltd.
- */
#include <linux/cache.h>
#include <asm/neon-intrinsics.h>
@@ -10,170 +6,8 @@
#include "xor_arch.h"
#include "xor-neon.h"
-static void __xor_neon_2(unsigned long bytes, unsigned long * __restrict p1,
- const unsigned long * __restrict p2)
-{
- uint64_t *dp1 = (uint64_t *)p1;
- uint64_t *dp2 = (uint64_t *)p2;
-
- register uint64x2_t v0, v1, v2, v3;
- long lines = bytes / (sizeof(uint64x2_t) * 4);
-
- do {
- /* p1 ^= p2 */
- v0 = veorq_u64(vld1q_u64(dp1 + 0), vld1q_u64(dp2 + 0));
- v1 = veorq_u64(vld1q_u64(dp1 + 2), vld1q_u64(dp2 + 2));
- v2 = veorq_u64(vld1q_u64(dp1 + 4), vld1q_u64(dp2 + 4));
- v3 = veorq_u64(vld1q_u64(dp1 + 6), vld1q_u64(dp2 + 6));
-
- /* store */
- vst1q_u64(dp1 + 0, v0);
- vst1q_u64(dp1 + 2, v1);
- vst1q_u64(dp1 + 4, v2);
- vst1q_u64(dp1 + 6, v3);
-
- dp1 += 8;
- dp2 += 8;
- } while (--lines > 0);
-}
-
-static void __xor_neon_3(unsigned long bytes, unsigned long * __restrict p1,
- const unsigned long * __restrict p2,
- const unsigned long * __restrict p3)
-{
- uint64_t *dp1 = (uint64_t *)p1;
- uint64_t *dp2 = (uint64_t *)p2;
- uint64_t *dp3 = (uint64_t *)p3;
-
- register uint64x2_t v0, v1, v2, v3;
- long lines = bytes / (sizeof(uint64x2_t) * 4);
-
- do {
- /* p1 ^= p2 */
- v0 = veorq_u64(vld1q_u64(dp1 + 0), vld1q_u64(dp2 + 0));
- v1 = veorq_u64(vld1q_u64(dp1 + 2), vld1q_u64(dp2 + 2));
- v2 = veorq_u64(vld1q_u64(dp1 + 4), vld1q_u64(dp2 + 4));
- v3 = veorq_u64(vld1q_u64(dp1 + 6), vld1q_u64(dp2 + 6));
-
- /* p1 ^= p3 */
- v0 = veorq_u64(v0, vld1q_u64(dp3 + 0));
- v1 = veorq_u64(v1, vld1q_u64(dp3 + 2));
- v2 = veorq_u64(v2, vld1q_u64(dp3 + 4));
- v3 = veorq_u64(v3, vld1q_u64(dp3 + 6));
-
- /* store */
- vst1q_u64(dp1 + 0, v0);
- vst1q_u64(dp1 + 2, v1);
- vst1q_u64(dp1 + 4, v2);
- vst1q_u64(dp1 + 6, v3);
-
- dp1 += 8;
- dp2 += 8;
- dp3 += 8;
- } while (--lines > 0);
-}
-
-static void __xor_neon_4(unsigned long bytes, unsigned long * __restrict p1,
- const unsigned long * __restrict p2,
- const unsigned long * __restrict p3,
- const unsigned long * __restrict p4)
-{
- uint64_t *dp1 = (uint64_t *)p1;
- uint64_t *dp2 = (uint64_t *)p2;
- uint64_t *dp3 = (uint64_t *)p3;
- uint64_t *dp4 = (uint64_t *)p4;
-
- register uint64x2_t v0, v1, v2, v3;
- long lines = bytes / (sizeof(uint64x2_t) * 4);
-
- do {
- /* p1 ^= p2 */
- v0 = veorq_u64(vld1q_u64(dp1 + 0), vld1q_u64(dp2 + 0));
- v1 = veorq_u64(vld1q_u64(dp1 + 2), vld1q_u64(dp2 + 2));
- v2 = veorq_u64(vld1q_u64(dp1 + 4), vld1q_u64(dp2 + 4));
- v3 = veorq_u64(vld1q_u64(dp1 + 6), vld1q_u64(dp2 + 6));
-
- /* p1 ^= p3 */
- v0 = veorq_u64(v0, vld1q_u64(dp3 + 0));
- v1 = veorq_u64(v1, vld1q_u64(dp3 + 2));
- v2 = veorq_u64(v2, vld1q_u64(dp3 + 4));
- v3 = veorq_u64(v3, vld1q_u64(dp3 + 6));
-
- /* p1 ^= p4 */
- v0 = veorq_u64(v0, vld1q_u64(dp4 + 0));
- v1 = veorq_u64(v1, vld1q_u64(dp4 + 2));
- v2 = veorq_u64(v2, vld1q_u64(dp4 + 4));
- v3 = veorq_u64(v3, vld1q_u64(dp4 + 6));
-
- /* store */
- vst1q_u64(dp1 + 0, v0);
- vst1q_u64(dp1 + 2, v1);
- vst1q_u64(dp1 + 4, v2);
- vst1q_u64(dp1 + 6, v3);
-
- dp1 += 8;
- dp2 += 8;
- dp3 += 8;
- dp4 += 8;
- } while (--lines > 0);
-}
-
-static void __xor_neon_5(unsigned long bytes, unsigned long * __restrict p1,
- const unsigned long * __restrict p2,
- const unsigned long * __restrict p3,
- const unsigned long * __restrict p4,
- const unsigned long * __restrict p5)
-{
- uint64_t *dp1 = (uint64_t *)p1;
- uint64_t *dp2 = (uint64_t *)p2;
- uint64_t *dp3 = (uint64_t *)p3;
- uint64_t *dp4 = (uint64_t *)p4;
- uint64_t *dp5 = (uint64_t *)p5;
-
- register uint64x2_t v0, v1, v2, v3;
- long lines = bytes / (sizeof(uint64x2_t) * 4);
-
- do {
- /* p1 ^= p2 */
- v0 = veorq_u64(vld1q_u64(dp1 + 0), vld1q_u64(dp2 + 0));
- v1 = veorq_u64(vld1q_u64(dp1 + 2), vld1q_u64(dp2 + 2));
- v2 = veorq_u64(vld1q_u64(dp1 + 4), vld1q_u64(dp2 + 4));
- v3 = veorq_u64(vld1q_u64(dp1 + 6), vld1q_u64(dp2 + 6));
-
- /* p1 ^= p3 */
- v0 = veorq_u64(v0, vld1q_u64(dp3 + 0));
- v1 = veorq_u64(v1, vld1q_u64(dp3 + 2));
- v2 = veorq_u64(v2, vld1q_u64(dp3 + 4));
- v3 = veorq_u64(v3, vld1q_u64(dp3 + 6));
-
- /* p1 ^= p4 */
- v0 = veorq_u64(v0, vld1q_u64(dp4 + 0));
- v1 = veorq_u64(v1, vld1q_u64(dp4 + 2));
- v2 = veorq_u64(v2, vld1q_u64(dp4 + 4));
- v3 = veorq_u64(v3, vld1q_u64(dp4 + 6));
-
- /* p1 ^= p5 */
- v0 = veorq_u64(v0, vld1q_u64(dp5 + 0));
- v1 = veorq_u64(v1, vld1q_u64(dp5 + 2));
- v2 = veorq_u64(v2, vld1q_u64(dp5 + 4));
- v3 = veorq_u64(v3, vld1q_u64(dp5 + 6));
-
- /* store */
- vst1q_u64(dp1 + 0, v0);
- vst1q_u64(dp1 + 2, v1);
- vst1q_u64(dp1 + 4, v2);
- vst1q_u64(dp1 + 6, v3);
-
- dp1 += 8;
- dp2 += 8;
- dp3 += 8;
- dp4 += 8;
- dp5 += 8;
- } while (--lines > 0);
-}
-
-__DO_XOR_BLOCKS(neon_inner, __xor_neon_2, __xor_neon_3, __xor_neon_4,
- __xor_neon_5);
+extern void __xor_eor3_2(unsigned long bytes, unsigned long * __restrict p1,
+ const unsigned long * __restrict p2);
static inline uint64x2_t eor3(uint64x2_t p, uint64x2_t q, uint64x2_t r)
{
@@ -308,5 +142,5 @@ static void __xor_eor3_5(unsigned long bytes, unsigned long * __restrict p1,
} while (--lines > 0);
}
-__DO_XOR_BLOCKS(eor3_inner, __xor_neon_2, __xor_eor3_3, __xor_eor3_4,
+__DO_XOR_BLOCKS(eor3_inner, __xor_eor3_2, __xor_eor3_3, __xor_eor3_4,
__xor_eor3_5);
--
2.53.0.1018.g2bb0e51243-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v2 5/5] ARM: Remove hacked-up asm/types.h header
2026-03-31 7:49 [PATCH v2 0/5] xor/arm: Replace vectorized version with intrinsics Ard Biesheuvel
` (3 preceding siblings ...)
2026-03-31 7:49 ` [PATCH v2 4/5] xor/arm64: Use shared NEON intrinsics implementation from 32-bit ARM Ard Biesheuvel
@ 2026-03-31 7:49 ` Ard Biesheuvel
2026-03-31 15:16 ` [PATCH v2 0/5] xor/arm: Replace vectorized version with intrinsics Christoph Hellwig
5 siblings, 0 replies; 9+ messages in thread
From: Ard Biesheuvel @ 2026-03-31 7:49 UTC (permalink / raw)
To: linux-raid
Cc: linux-arm-kernel, linux-crypto, Ard Biesheuvel, Christoph Hellwig,
Russell King, Arnd Bergmann, Eric Biggers
From: Ard Biesheuvel <ardb@kernel.org>
ARM has a special version of asm/types.h which contains overrides for
certain #define's related to the C types used to back C99 types such as
uint32_t and uintptr_t.
This is only needed when pulling in system headers such as stdint.h
during the build, and this only happens when using NEON intrinsics,
for which there is now a dedicated header file.
So drop this header entirely, and revert to the asm-generic one.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/arm/include/uapi/asm/types.h | 41 --------------------
1 file changed, 41 deletions(-)
diff --git a/arch/arm/include/uapi/asm/types.h b/arch/arm/include/uapi/asm/types.h
deleted file mode 100644
index 1a667bc26510..000000000000
--- a/arch/arm/include/uapi/asm/types.h
+++ /dev/null
@@ -1,41 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-#ifndef _UAPI_ASM_TYPES_H
-#define _UAPI_ASM_TYPES_H
-
-#include <asm-generic/int-ll64.h>
-
-/*
- * The C99 types uintXX_t that are usually defined in 'stdint.h' are not as
- * unambiguous on ARM as you would expect. For the types below, there is a
- * difference on ARM between GCC built for bare metal ARM, GCC built for glibc
- * and the kernel itself, which results in build errors if you try to build with
- * -ffreestanding and include 'stdint.h' (such as when you include 'arm_neon.h'
- * in order to use NEON intrinsics)
- *
- * As the typedefs for these types in 'stdint.h' are based on builtin defines
- * supplied by GCC, we can tweak these to align with the kernel's idea of those
- * types, so 'linux/types.h' and 'stdint.h' can be safely included from the same
- * source file (provided that -ffreestanding is used).
- *
- * int32_t uint32_t uintptr_t
- * bare metal GCC long unsigned long unsigned int
- * glibc GCC int unsigned int unsigned int
- * kernel int unsigned int unsigned long
- */
-
-#ifdef __INT32_TYPE__
-#undef __INT32_TYPE__
-#define __INT32_TYPE__ int
-#endif
-
-#ifdef __UINT32_TYPE__
-#undef __UINT32_TYPE__
-#define __UINT32_TYPE__ unsigned int
-#endif
-
-#ifdef __UINTPTR_TYPE__
-#undef __UINTPTR_TYPE__
-#define __UINTPTR_TYPE__ unsigned long
-#endif
-
-#endif /* _UAPI_ASM_TYPES_H */
--
2.53.0.1018.g2bb0e51243-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [PATCH v2 0/5] xor/arm: Replace vectorized version with intrinsics
2026-03-31 7:49 [PATCH v2 0/5] xor/arm: Replace vectorized version with intrinsics Ard Biesheuvel
` (4 preceding siblings ...)
2026-03-31 7:49 ` [PATCH v2 5/5] ARM: Remove hacked-up asm/types.h header Ard Biesheuvel
@ 2026-03-31 15:16 ` Christoph Hellwig
2026-03-31 15:26 ` Ard Biesheuvel
5 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2026-03-31 15:16 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-raid, linux-arm-kernel, linux-crypto, Ard Biesheuvel,
Christoph Hellwig, Russell King, Arnd Bergmann, Eric Biggers
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
I think some of the intrinsics patches were also in your crc64 series,
so I'm not sure how to be merge this.
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH v2 0/5] xor/arm: Replace vectorized version with intrinsics
2026-03-31 15:16 ` [PATCH v2 0/5] xor/arm: Replace vectorized version with intrinsics Christoph Hellwig
@ 2026-03-31 15:26 ` Ard Biesheuvel
2026-03-31 15:28 ` Christoph Hellwig
0 siblings, 1 reply; 9+ messages in thread
From: Ard Biesheuvel @ 2026-03-31 15:26 UTC (permalink / raw)
To: Christoph Hellwig, Ard Biesheuvel
Cc: linux-raid, linux-arm-kernel, linux-crypto, Russell King,
Arnd Bergmann, Eric Biggers
On Tue, 31 Mar 2026, at 17:16, Christoph Hellwig wrote:
> Looks good:
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>
Thanks.
> I think some of the intrinsics patches were also in your crc64 series,
> so I'm not sure how to be merge this.
The first patch is used by both series, yes. If this is good to go, we might as well just merge it, and defer the crc work (or at least the 32-bit ARM specific changes) to the next cycle. I am in no particular hurry with any of this, so whatever works for other people is fine with me.
The RAID pieces are going through akpm's tree, right?
Eric, any preferences? (assuming you are on board with the CRC64 changes in the first place)
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 0/5] xor/arm: Replace vectorized version with intrinsics
2026-03-31 15:26 ` Ard Biesheuvel
@ 2026-03-31 15:28 ` Christoph Hellwig
0 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2026-03-31 15:28 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Christoph Hellwig, Ard Biesheuvel, linux-raid, linux-arm-kernel,
linux-crypto, Russell King, Arnd Bergmann, Eric Biggers
On Tue, Mar 31, 2026 at 05:26:33PM +0200, Ard Biesheuvel wrote:
> The RAID pieces are going through akpm's tree, right?
Yes.
^ permalink raw reply [flat|nested] 9+ messages in thread