* [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts
@ 2013-10-13 12:14 Ard Biesheuvel
2013-10-13 12:14 ` [RFC v3 PATCH 1/7] ARM: add support for kernel mode NEON in atomic context Ard Biesheuvel
` (7 more replies)
0 siblings, 8 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2013-10-13 12:14 UTC (permalink / raw)
To: linux-arm-kernel
Take #3 of this RFC series.
Instead of having additional separate versions of kernel_neon_begin/end, the
existing ones now have been modified to always take a preallocated stack area
as an argument.
The stack area is allocated by DEFINE_NEON_REGSTACK[_PARTIAL](varname), where
the partial version takes an additional int num_regs indicating how many
registers need to be freed up.
In the !in_interrupt() case, these functions operate as before, and the regstack
is defined to minimal size in this case as it will remain unused anyway. In the
in_interrupt() case, 'num_regs' (or all) NEON registers are stacked/unstacked
using the allocated stack region.
Patches #1 and #4 implement the above for ARM and ARM64, respectively. Patch #3
implements the optimization for ARM64 suggested by Catalin, which has no lazy
restore, potentially resulting in lots of unnecessary stack/unstack sequences
otherwise.
The remaining patches are existing or new users of this API, for reference.
Ard Biesheuvel (7):
ARM: add support for kernel mode NEON in atomic context
ARM: port NEON version of xor_blocks() to new kmode NEON api
ARM64: defer reloading a task's FPSIMD state to userland resume
ARM64: add support for kernel mode NEON in atomic context
ARM64: add Crypto Extensions based synchronous core AES cipher
ARM64: add Crypto Extensions based synchronous AES in CCM mode
lib/raid6: port NEON implementation to updated kmode NEON api
arch/arm/include/asm/fpstate.h | 12 +
arch/arm/include/asm/neon.h | 32 ++-
arch/arm/include/asm/xor.h | 48 ++--
arch/arm/vfp/vfphw.S | 45 ++++
arch/arm/vfp/vfpmodule.c | 55 +++--
arch/arm64/Makefile | 11 +-
arch/arm64/crypto/Makefile | 14 ++
arch/arm64/crypto/aes-sync.c | 453 ++++++++++++++++++++++++++++++++++
arch/arm64/crypto/aesce-ccm.S | 186 ++++++++++++++
arch/arm64/include/asm/fpsimd.h | 17 ++
arch/arm64/include/asm/fpsimdmacros.h | 35 +++
arch/arm64/include/asm/neon.h | 31 ++-
arch/arm64/include/asm/thread_info.h | 4 +-
arch/arm64/kernel/entry-fpsimd.S | 24 ++
arch/arm64/kernel/entry.S | 2 +-
arch/arm64/kernel/fpsimd.c | 34 +--
arch/arm64/kernel/signal.c | 2 +
lib/raid6/neon.c | 9 +-
18 files changed, 932 insertions(+), 82 deletions(-)
create mode 100644 arch/arm64/crypto/Makefile
create mode 100644 arch/arm64/crypto/aes-sync.c
create mode 100644 arch/arm64/crypto/aesce-ccm.S
--
1.8.1.2
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 1/7] ARM: add support for kernel mode NEON in atomic context
2013-10-13 12:14 [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts Ard Biesheuvel
@ 2013-10-13 12:14 ` Ard Biesheuvel
2013-10-15 17:26 ` Catalin Marinas
2013-10-13 12:14 ` [RFC v3 PATCH 2/7] ARM: port NEON version of xor_blocks() to new kmode NEON api Ard Biesheuvel
` (6 subsequent siblings)
7 siblings, 1 reply; 19+ messages in thread
From: Ard Biesheuvel @ 2013-10-13 12:14 UTC (permalink / raw)
To: linux-arm-kernel
Some applications, such as WPA CCMP encryption, do substantial
amounts of work in non-process context. In order to support
accelerated NEON implementations under these circumstances, we
need a way to preserve the NEON context that may
(a) belong to a completely unrelated userland process (if the
NEON unit is turned off atm);
(b) belong to current userland;
(c) belong to current kernel mode in process context.
The best way to deal with this is to just stack whatever registers
we are going to use, and unstack them when we are done.
This patch modifies kernel_neon_begin() and kernel_neon_end(), so
they may be called from any context. To address the in_interrupt()
case, they now both take a parameter defined by DEFINE_NEON_REGSTACK()
or DEFINE_NEON_REGSTACK_PARTIAL() [in case only a few NEON registers
are in fact used]. The !in_interrupt() case is unchanged from before.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm/include/asm/fpstate.h | 12 +++++++++
arch/arm/include/asm/neon.h | 32 +++++++++++++++++++++---
arch/arm/vfp/vfphw.S | 45 ++++++++++++++++++++++++++++++++++
arch/arm/vfp/vfpmodule.c | 55 ++++++++++++++++++++++++------------------
4 files changed, 118 insertions(+), 26 deletions(-)
diff --git a/arch/arm/include/asm/fpstate.h b/arch/arm/include/asm/fpstate.h
index 3ad4c10..0471c36 100644
--- a/arch/arm/include/asm/fpstate.h
+++ b/arch/arm/include/asm/fpstate.h
@@ -52,6 +52,18 @@ union vfp_state {
extern void vfp_flush_thread(union vfp_state *);
extern void vfp_release_thread(union vfp_state *);
+/*
+ * Variable sized struct for stacking the bottom 'n' NEON registers.
+ */
+struct vfp_partial_state {
+ __u32 fpexc;
+ __u32 fpscr;
+ __u8 qregs[] __aligned(16);
+} __aligned(16);
+
+extern void vfp_load_partial_state(struct vfp_partial_state *, u32 num_regs);
+extern void vfp_save_partial_state(struct vfp_partial_state *, u32 num_regs);
+
#define FP_HARD_SIZE 35
struct fp_hard_struct {
diff --git a/arch/arm/include/asm/neon.h b/arch/arm/include/asm/neon.h
index 8f730fe..800d85c 100644
--- a/arch/arm/include/asm/neon.h
+++ b/arch/arm/include/asm/neon.h
@@ -8,10 +8,30 @@
* published by the Free Software Foundation.
*/
+#include <linux/types.h>
+#include <linux/hardirq.h>
+#include <asm/fpstate.h>
#include <asm/hwcap.h>
#define cpu_has_neon() (!!(elf_hwcap & HWCAP_NEON))
+/*
+ * Avoid wasting stack space by making the size of the allocated area depend on
+ * whether we are currently running in process context. (If this is the case, we
+ * will use the normal preserve/restore mechanism, leaving the allocated stack
+ * space unused.)
+ */
+#define __QREG_SIZE(num) \
+ ((!in_interrupt()) ? 0 : (num) > 16 ? 256 : 16 * (((num) + 1) & ~1U))
+
+#define DEFINE_NEON_REGSTACK_PARTIAL(v, num) \
+ struct { \
+ struct vfp_partial_state regs; \
+ u8 qregs[__QREG_SIZE(num)]; \
+ } v
+
+#define DEFINE_NEON_REGSTACK(name) DEFINE_NEON_REGSTACK_PARTIAL(name, 16)
+
#ifdef __ARM_NEON__
/*
@@ -27,10 +47,16 @@
* -mpfu=neon is set.
*/
-#define kernel_neon_begin() \
+#define kernel_neon_begin(p) \
BUILD_BUG_ON_MSG(1, "kernel_neon_begin() called from NEON code")
#else
-void kernel_neon_begin(void);
+#define kernel_neon_begin(p) \
+ __kernel_neon_begin(&(p).regs, sizeof((p).qregs)/16)
#endif
-void kernel_neon_end(void);
+
+#define kernel_neon_end(p) \
+ __kernel_neon_end(&(p).regs, sizeof((p).qregs)/16)
+
+void __kernel_neon_begin(struct vfp_partial_state *regs, u32 num_regs);
+void __kernel_neon_end(struct vfp_partial_state *regs, u32 num_regs);
diff --git a/arch/arm/vfp/vfphw.S b/arch/arm/vfp/vfphw.S
index 3e5d311..28384a5 100644
--- a/arch/arm/vfp/vfphw.S
+++ b/arch/arm/vfp/vfphw.S
@@ -322,3 +322,48 @@ ENTRY(vfp_put_double)
.endr
#endif
ENDPROC(vfp_put_double)
+
+
+#ifdef CONFIG_KERNEL_MODE_NEON
+
+ .fpu neon
+ENTRY(vfp_save_partial_state)
+ VFPFMRX r2, FPEXC @ load the control registers
+ tst r2, #FPEXC_EN
+ str r2, [r0] @ save to memory
+ bne 0f
+ orr r2, r2, #FPEXC_EN @ enable VFP if it was disabled
+ VFPFMXR FPEXC, r2
+0: VFPFMRX r3, FPSCR
+ str r3, [r0, #4] @ save to memory
+ rsbs r1, r1, #16
+ add r2, r0, #16
+ beq 1f
+ adr r3, 1f
+ add r3, r3, r1, lsl #1
+THUMB( orr r3, r3, #1)
+ bx r3
+1: .irp qq,q14-q15,q12-q13,q10-q11,q8-q9,q6-q7,q4-q5,q2-q3,q0-q1
+ vst1.8 {\qq}, [r2,:128]!
+ .endr
+ bx lr
+ENDPROC(vfp_save_partial_state)
+
+ENTRY(vfp_load_partial_state)
+ rsbs r1, r1, #16
+ add r2, r0, #16
+ beq 0f
+ adr r3, 0f
+ add r3, r3, r1, lsl #1
+THUMB( orr r3, r3, #1)
+ bx r3
+0: .irp qq,q14-q15,q12-q13,q10-q11,q8-q9,q6-q7,q4-q5,q2-q3,q0-q1
+ vld1.8 {\qq}, [r2,:128]!
+ .endr
+ ldrd r2, r3, [r0]
+ VFPFMXR FPSCR, r3
+ VFPFMXR FPEXC, r2
+ bx lr
+ENDPROC(vfp_load_partial_state)
+
+#endif
diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
index 52b8f40..b924a5b 100644
--- a/arch/arm/vfp/vfpmodule.c
+++ b/arch/arm/vfp/vfpmodule.c
@@ -674,44 +674,53 @@ void vfp_kmode_exception(void)
/*
* Kernel-side NEON support functions
*/
-void kernel_neon_begin(void)
+void __kernel_neon_begin(struct vfp_partial_state *regs, u32 num_regs)
{
struct thread_info *thread = current_thread_info();
unsigned int cpu;
u32 fpexc;
/*
- * Kernel mode NEON is only allowed outside of interrupt context
- * with preemption disabled. This will make sure that the kernel
- * mode NEON register contents never need to be preserved.
+ * If running in non-process context, we just stack whatever registers
+ * the caller has indicated he needs. Otherwise, do a regular preserve
+ * of the userland context.
*/
- BUG_ON(in_interrupt());
- cpu = get_cpu();
+ if (in_interrupt()) {
+ BUG_ON(!num_regs);
+ vfp_save_partial_state(regs, num_regs);
+ } else {
+ cpu = get_cpu();
- fpexc = fmrx(FPEXC) | FPEXC_EN;
- fmxr(FPEXC, fpexc);
+ fpexc = fmrx(FPEXC) | FPEXC_EN;
+ fmxr(FPEXC, fpexc);
- /*
- * Save the userland NEON/VFP state. Under UP,
- * the owner could be a task other than 'current'
- */
- if (vfp_state_in_hw(cpu, thread))
- vfp_save_state(&thread->vfpstate, fpexc);
+ /*
+ * Save the userland NEON/VFP state. Under UP,
+ * the owner could be a task other than 'current'
+ */
+ if (vfp_state_in_hw(cpu, thread))
+ vfp_save_state(&thread->vfpstate, fpexc);
#ifndef CONFIG_SMP
- else if (vfp_current_hw_state[cpu] != NULL)
- vfp_save_state(vfp_current_hw_state[cpu], fpexc);
+ else if (vfp_current_hw_state[cpu] != NULL)
+ vfp_save_state(vfp_current_hw_state[cpu], fpexc);
#endif
- vfp_current_hw_state[cpu] = NULL;
+ vfp_current_hw_state[cpu] = NULL;
+ }
}
-EXPORT_SYMBOL(kernel_neon_begin);
+EXPORT_SYMBOL(__kernel_neon_begin);
-void kernel_neon_end(void)
+void __kernel_neon_end(struct vfp_partial_state *regs, u32 num_regs)
{
- /* Disable the NEON/VFP unit. */
- fmxr(FPEXC, fmrx(FPEXC) & ~FPEXC_EN);
- put_cpu();
+ if (in_interrupt()) {
+ BUG_ON(!num_regs);
+ vfp_load_partial_state(regs, num_regs);
+ } else {
+ /* Disable the NEON/VFP unit. */
+ fmxr(FPEXC, fmrx(FPEXC) & ~FPEXC_EN);
+ put_cpu();
+ }
}
-EXPORT_SYMBOL(kernel_neon_end);
+EXPORT_SYMBOL(__kernel_neon_end);
#endif /* CONFIG_KERNEL_MODE_NEON */
--
1.8.1.2
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 2/7] ARM: port NEON version of xor_blocks() to new kmode NEON api
2013-10-13 12:14 [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts Ard Biesheuvel
2013-10-13 12:14 ` [RFC v3 PATCH 1/7] ARM: add support for kernel mode NEON in atomic context Ard Biesheuvel
@ 2013-10-13 12:14 ` Ard Biesheuvel
2013-10-13 12:14 ` [RFC v3 PATCH 3/7] ARM64: defer reloading a task's FPSIMD state to userland resume Ard Biesheuvel
` (5 subsequent siblings)
7 siblings, 0 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2013-10-13 12:14 UTC (permalink / raw)
To: linux-arm-kernel
It is now permissible to use the NEON in non-process context, so
update the XOR code so it uses the NEON version even in non-process
context.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm/include/asm/xor.h | 48 +++++++++++++++++++---------------------------
1 file changed, 20 insertions(+), 28 deletions(-)
diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
index 4ffb26d..1bda8b5 100644
--- a/arch/arm/include/asm/xor.h
+++ b/arch/arm/include/asm/xor.h
@@ -151,52 +151,44 @@ extern struct xor_block_template const xor_block_neon_inner;
static void
xor_neon_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
{
- if (in_interrupt()) {
- xor_arm4regs_2(bytes, p1, p2);
- } else {
- kernel_neon_begin();
- xor_block_neon_inner.do_2(bytes, p1, p2);
- kernel_neon_end();
- }
+ DEFINE_NEON_REGSTACK(s);
+
+ kernel_neon_begin(s);
+ xor_block_neon_inner.do_2(bytes, p1, p2);
+ kernel_neon_end(s);
}
static void
xor_neon_3(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3)
{
- if (in_interrupt()) {
- xor_arm4regs_3(bytes, p1, p2, p3);
- } else {
- kernel_neon_begin();
- xor_block_neon_inner.do_3(bytes, p1, p2, p3);
- kernel_neon_end();
- }
+ DEFINE_NEON_REGSTACK(s);
+
+ kernel_neon_begin(s);
+ xor_block_neon_inner.do_3(bytes, p1, p2, p3);
+ kernel_neon_end(s);
}
static void
xor_neon_4(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3, unsigned long *p4)
{
- if (in_interrupt()) {
- xor_arm4regs_4(bytes, p1, p2, p3, p4);
- } else {
- kernel_neon_begin();
- xor_block_neon_inner.do_4(bytes, p1, p2, p3, p4);
- kernel_neon_end();
- }
+ DEFINE_NEON_REGSTACK(s);
+
+ kernel_neon_begin(s);
+ xor_block_neon_inner.do_4(bytes, p1, p2, p3, p4);
+ kernel_neon_end(s);
}
static void
xor_neon_5(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3, unsigned long *p4, unsigned long *p5)
{
- if (in_interrupt()) {
- xor_arm4regs_5(bytes, p1, p2, p3, p4, p5);
- } else {
- kernel_neon_begin();
- xor_block_neon_inner.do_5(bytes, p1, p2, p3, p4, p5);
- kernel_neon_end();
- }
+ DEFINE_NEON_REGSTACK(s);
+
+ kernel_neon_begin(s);
+ xor_block_neon_inner.do_5(bytes, p1, p2, p3, p4, p5);
+ kernel_neon_end(s);
}
static struct xor_block_template xor_block_neon = {
--
1.8.1.2
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 3/7] ARM64: defer reloading a task's FPSIMD state to userland resume
2013-10-13 12:14 [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts Ard Biesheuvel
2013-10-13 12:14 ` [RFC v3 PATCH 1/7] ARM: add support for kernel mode NEON in atomic context Ard Biesheuvel
2013-10-13 12:14 ` [RFC v3 PATCH 2/7] ARM: port NEON version of xor_blocks() to new kmode NEON api Ard Biesheuvel
@ 2013-10-13 12:14 ` Ard Biesheuvel
2013-10-28 18:12 ` Catalin Marinas
2013-10-13 12:15 ` [RFC v3 PATCH 4/7] ARM64: add support for kernel mode NEON in atomic context Ard Biesheuvel
` (4 subsequent siblings)
7 siblings, 1 reply; 19+ messages in thread
From: Ard Biesheuvel @ 2013-10-13 12:14 UTC (permalink / raw)
To: linux-arm-kernel
Modify kernel_neon_begin() and kernel_neon_end() so subsequent calls
don't need to preserve/restore the userland FPSIMD state if the task
has not entered userland in the mean time.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/include/asm/thread_info.h | 4 +++-
arch/arm64/kernel/entry.S | 2 +-
arch/arm64/kernel/fpsimd.c | 7 ++-----
arch/arm64/kernel/signal.c | 2 ++
4 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 23a3c47..3bdeab6 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -106,6 +106,7 @@ static inline struct thread_info *current_thread_info(void)
#define TIF_SIGPENDING 0
#define TIF_NEED_RESCHED 1
#define TIF_NOTIFY_RESUME 2 /* callback before returning to user */
+#define TIF_RELOAD_FPSTATE 3 /* user FPSIMD context saved to mem */
#define TIF_SYSCALL_TRACE 8
#define TIF_POLLING_NRFLAG 16
#define TIF_MEMDIE 18 /* is terminating due to OOM killer */
@@ -118,10 +119,11 @@ static inline struct thread_info *current_thread_info(void)
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
+#define _TIF_RELOAD_FPSTATE (1 << TIF_RELOAD_FPSTATE)
#define _TIF_32BIT (1 << TIF_32BIT)
#define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
- _TIF_NOTIFY_RESUME)
+ _TIF_NOTIFY_RESUME | _TIF_RELOAD_FPSTATE)
#endif /* __KERNEL__ */
#endif /* __ASM_THREAD_INFO_H */
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 3881fd1..2c6c7fb 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -589,7 +589,7 @@ fast_work_pending:
str x0, [sp, #S_X0] // returned x0
work_pending:
tbnz x1, #TIF_NEED_RESCHED, work_resched
- /* TIF_SIGPENDING or TIF_NOTIFY_RESUME case */
+ /* TIF_SIGPENDING/TIF_NOTIFY_RESUME/TIF_RELOAD_FPSTATE case */
ldr x2, [sp, #S_PSTATE]
mov x0, sp // 'regs'
tst x2, #PSR_MODE_MASK // user mode regs?
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 1f2e4d5..a52affd 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -72,7 +72,7 @@ void do_fpsimd_exc(unsigned int esr, struct pt_regs *regs)
void fpsimd_thread_switch(struct task_struct *next)
{
/* check if not kernel threads */
- if (current->mm)
+ if (current->mm && !test_and_clear_thread_flag(TIF_RELOAD_FPSTATE))
fpsimd_save_state(¤t->thread.fpsimd_state);
if (next->mm)
fpsimd_load_state(&next->thread.fpsimd_state);
@@ -95,16 +95,13 @@ void kernel_neon_begin(void)
BUG_ON(in_interrupt());
preempt_disable();
- if (current->mm)
+ if (current->mm && !test_and_set_thread_flag(TIF_RELOAD_FPSTATE))
fpsimd_save_state(¤t->thread.fpsimd_state);
}
EXPORT_SYMBOL(kernel_neon_begin);
void kernel_neon_end(void)
{
- if (current->mm)
- fpsimd_load_state(¤t->thread.fpsimd_state);
-
preempt_enable();
}
EXPORT_SYMBOL(kernel_neon_end);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 890a591..da3a433 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -416,4 +416,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
}
+ if (test_and_clear_thread_flag(TIF_RELOAD_FPSTATE))
+ fpsimd_load_state(¤t->thread.fpsimd_state);
}
--
1.8.1.2
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 4/7] ARM64: add support for kernel mode NEON in atomic context
2013-10-13 12:14 [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts Ard Biesheuvel
` (2 preceding siblings ...)
2013-10-13 12:14 ` [RFC v3 PATCH 3/7] ARM64: defer reloading a task's FPSIMD state to userland resume Ard Biesheuvel
@ 2013-10-13 12:15 ` Ard Biesheuvel
2013-10-13 12:15 ` [RFC v3 PATCH 5/7] ARM64: add Crypto Extensions based synchronous core AES cipher Ard Biesheuvel
` (3 subsequent siblings)
7 siblings, 0 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2013-10-13 12:15 UTC (permalink / raw)
To: linux-arm-kernel
This patch modifies kernel_neon_begin() and kernel_neon_end(), so
they may be called from any context. To address the in_interrupt()
case, they now both take a parameter defined by DEFINE_NEON_REGSTACK()
or DEFINE_NEON_REGSTACK_PARTIAL() [in case only a few NEON registers
are in fact used]. The !in_interrupt() case is unchanged from before.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/include/asm/fpsimd.h | 17 +++++++++++++++++
arch/arm64/include/asm/fpsimdmacros.h | 35 +++++++++++++++++++++++++++++++++++
arch/arm64/include/asm/neon.h | 31 +++++++++++++++++++++++++++++--
arch/arm64/kernel/entry-fpsimd.S | 24 ++++++++++++++++++++++++
arch/arm64/kernel/fpsimd.c | 29 ++++++++++++++++++-----------
5 files changed, 123 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
index c43b4ac..755bdf1 100644
--- a/arch/arm64/include/asm/fpsimd.h
+++ b/arch/arm64/include/asm/fpsimd.h
@@ -39,6 +39,18 @@ struct fpsimd_state {
};
};
+/*
+ * Variable sized struct for stacking the bottom 'n' FP/SIMD registers.
+ * Mainly intended for kernel use of v8 Crypto Extensions which only
+ * needs a few registers and may need to execute in atomic context.
+ */
+struct fpsimd_partial_state {
+ u32 fpsr;
+ u32 fpcr;
+ __uint128_t vregs[] __aligned(16);
+} __aligned(16);
+
+
#if defined(__KERNEL__) && defined(CONFIG_COMPAT)
/* Masks for extracting the FPSR and FPCR from the FPSCR */
#define VFP_FPSCR_STAT_MASK 0xf800009f
@@ -55,6 +67,11 @@ struct task_struct;
extern void fpsimd_save_state(struct fpsimd_state *state);
extern void fpsimd_load_state(struct fpsimd_state *state);
+extern void fpsimd_save_partial_state(struct fpsimd_partial_state *state,
+ u32 num_regs);
+extern void fpsimd_load_partial_state(struct fpsimd_partial_state *state,
+ u32 num_regs);
+
extern void fpsimd_thread_switch(struct task_struct *next);
extern void fpsimd_flush_thread(void);
diff --git a/arch/arm64/include/asm/fpsimdmacros.h b/arch/arm64/include/asm/fpsimdmacros.h
index bbec599..f771b69 100644
--- a/arch/arm64/include/asm/fpsimdmacros.h
+++ b/arch/arm64/include/asm/fpsimdmacros.h
@@ -62,3 +62,38 @@
ldr w\tmpnr, [\state, #16 * 2 + 4]
msr fpcr, x\tmpnr
.endm
+
+.altmacro
+.macro q2op, op, q1, q2, state
+ \op q\q1, q\q2, [\state, #-(16 * \q1) - 16]
+.endm
+
+.macro fpsimd_save_partial state, num, tmpnr1, tmpnr2
+ mrs x\tmpnr1, fpsr
+ mrs x\tmpnr2, fpcr
+ stp w\tmpnr1, w\tmpnr2, [\state]
+ adr x\tmpnr1, 0f
+ add \state, \state, \num, lsl #4
+ sub x\tmpnr1, x\tmpnr1, \num, lsl #1
+ br x\tmpnr1
+ .irp qa, 30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 0
+ qb = \qa + 1
+ q2op stp, \qa, %qb, \state
+ .endr
+0:
+.endm
+
+.macro fpsimd_restore_partial state, num, tmpnr1, tmpnr2
+ ldp w\tmpnr1, w\tmpnr2, [\state]
+ msr fpsr, x\tmpnr1
+ msr fpcr, x\tmpnr2
+ adr x\tmpnr1, 0f
+ add \state, \state, \num, lsl #4
+ sub x\tmpnr1, x\tmpnr1, \num, lsl #1
+ br x\tmpnr1
+ .irp qa, 30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 0
+ qb = \qa + 1
+ q2op ldp, \qa, %qb, \state
+ .endr
+0:
+.endm
diff --git a/arch/arm64/include/asm/neon.h b/arch/arm64/include/asm/neon.h
index b0cc58a9..e496dce 100644
--- a/arch/arm64/include/asm/neon.h
+++ b/arch/arm64/include/asm/neon.h
@@ -8,7 +8,34 @@
* published by the Free Software Foundation.
*/
+#include <linux/hardirq.h>
+#include <linux/types.h>
+#include <asm/fpsimd.h>
+
#define cpu_has_neon() (1)
-void kernel_neon_begin(void);
-void kernel_neon_end(void);
+/*
+ * Avoid wasting stack space by making the size of the allocated area depend on
+ * whether we are currently running in process context. (If this is the case, we
+ * will use the normal preserve/restore mechanism, leaving the allocated stack
+ * space unused.)
+ */
+#define __VREG_SIZE(num) \
+ ((!in_interrupt()) ? 0 : (num) > 32 ? 512 : 32 * (((num) + 1) & ~1U))
+
+#define DEFINE_NEON_REGSTACK_PARTIAL(v, num) \
+ struct { \
+ struct fpsimd_partial_state regs; \
+ u8 vregs[__VREG_SIZE(num)]; \
+ } v
+
+#define DEFINE_NEON_REGSTACK(name) DEFINE_NEON_REGSTACK_PARTIAL(name, 32)
+
+#define kernel_neon_begin(p) \
+ __kernel_neon_begin(&(p).regs, sizeof((p).vregs)/16)
+
+#define kernel_neon_end(p) \
+ __kernel_neon_end(&(p).regs, sizeof((p).vregs)/16)
+
+void __kernel_neon_begin(struct fpsimd_partial_state *regs, u32 num_regs);
+void __kernel_neon_end(struct fpsimd_partial_state *regs, u32 num_regs);
diff --git a/arch/arm64/kernel/entry-fpsimd.S b/arch/arm64/kernel/entry-fpsimd.S
index 6a27cd6..aa73ee9 100644
--- a/arch/arm64/kernel/entry-fpsimd.S
+++ b/arch/arm64/kernel/entry-fpsimd.S
@@ -41,3 +41,27 @@ ENTRY(fpsimd_load_state)
fpsimd_restore x0, 8
ret
ENDPROC(fpsimd_load_state)
+
+#ifdef CONFIG_KERNEL_MODE_NEON
+
+/*
+ * Save the bottom n FP registers.
+ *
+ * x0 - pointer to struct fpsimd_partial_state
+ */
+ENTRY(fpsimd_save_partial_state)
+ fpsimd_save_partial x0, x1, 8, 9
+ ret
+ENDPROC(fpsimd_load_partial_state)
+
+/*
+ * Load the bottom n FP registers.
+ *
+ * x0 - pointer to struct fpsimd_partial_state
+ */
+ENTRY(fpsimd_load_partial_state)
+ fpsimd_restore_partial x0, x1, 8, 9
+ ret
+ENDPROC(fpsimd_load_partial_state)
+
+#endif
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index a52affd..34fa94b 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -89,22 +89,29 @@ void fpsimd_flush_thread(void)
/*
* Kernel-side NEON support functions
*/
-void kernel_neon_begin(void)
+void __kernel_neon_begin(struct fpsimd_partial_state *regs, u32 num_regs)
{
- /* Avoid using the NEON in interrupt context */
- BUG_ON(in_interrupt());
- preempt_disable();
-
- if (current->mm && !test_and_set_thread_flag(TIF_RELOAD_FPSTATE))
- fpsimd_save_state(¤t->thread.fpsimd_state);
+ if (in_interrupt()) {
+ BUG_ON(!num_regs);
+ fpsimd_save_partial_state(regs, num_regs);
+ } else {
+ preempt_disable();
+ if (current->mm &&
+ !test_and_set_thread_flag(TIF_RELOAD_FPSTATE))
+ fpsimd_save_state(¤t->thread.fpsimd_state);
+ }
}
-EXPORT_SYMBOL(kernel_neon_begin);
+EXPORT_SYMBOL(__kernel_neon_begin);
-void kernel_neon_end(void)
+void __kernel_neon_end(struct fpsimd_partial_state *regs, u32 num_regs)
{
- preempt_enable();
+ if (in_interrupt()) {
+ BUG_ON(!num_regs);
+ fpsimd_load_partial_state(regs, num_regs);
+ } else
+ preempt_enable();
}
-EXPORT_SYMBOL(kernel_neon_end);
+EXPORT_SYMBOL(__kernel_neon_end);
#endif /* CONFIG_KERNEL_MODE_NEON */
--
1.8.1.2
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 5/7] ARM64: add Crypto Extensions based synchronous core AES cipher
2013-10-13 12:14 [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts Ard Biesheuvel
` (3 preceding siblings ...)
2013-10-13 12:15 ` [RFC v3 PATCH 4/7] ARM64: add support for kernel mode NEON in atomic context Ard Biesheuvel
@ 2013-10-13 12:15 ` Ard Biesheuvel
2013-10-13 12:15 ` [RFC v3 PATCH 6/7] ARM64: add Crypto Extensions based synchronous AES in CCM mode Ard Biesheuvel
` (2 subsequent siblings)
7 siblings, 0 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2013-10-13 12:15 UTC (permalink / raw)
To: linux-arm-kernel
This implements the core AES cipher using the Crypto Extensions,
using only NEON register q0 and q1.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/Makefile | 11 +++--
arch/arm64/crypto/Makefile | 14 ++++++
arch/arm64/crypto/aes-sync.c | 106 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 126 insertions(+), 5 deletions(-)
create mode 100644 arch/arm64/crypto/Makefile
create mode 100644 arch/arm64/crypto/aes-sync.c
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index d90cf79..d1ca9d8 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -36,11 +36,12 @@ TEXT_OFFSET := 0x00080000
export TEXT_OFFSET GZFLAGS
-core-y += arch/arm64/kernel/ arch/arm64/mm/
-core-$(CONFIG_KVM) += arch/arm64/kvm/
-core-$(CONFIG_XEN) += arch/arm64/xen/
-libs-y := arch/arm64/lib/ $(libs-y)
-libs-y += $(LIBGCC)
+core-y += arch/arm64/kernel/ arch/arm64/mm/
+core-$(CONFIG_KVM) += arch/arm64/kvm/
+core-$(CONFIG_XEN) += arch/arm64/xen/
+core-$(CONFIG_CRYPTO) += arch/arm64/crypto/
+libs-y := arch/arm64/lib/ $(libs-y)
+libs-y += $(LIBGCC)
# Default target when executing plain make
KBUILD_IMAGE := Image.gz
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
new file mode 100644
index 0000000..269d9be
--- /dev/null
+++ b/arch/arm64/crypto/Makefile
@@ -0,0 +1,14 @@
+#
+# linux/arch/arm64/crypto/Makefile
+#
+# Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License version 2 as
+# published by the Free Software Foundation.
+#
+
+aesce-sync-y := aes-sync.o
+obj-m += aesce-sync.o
+
+CFLAGS_aes-sync.o += -march=armv8-a+crypto
diff --git a/arch/arm64/crypto/aes-sync.c b/arch/arm64/crypto/aes-sync.c
new file mode 100644
index 0000000..5d7ed4e
--- /dev/null
+++ b/arch/arm64/crypto/aes-sync.c
@@ -0,0 +1,106 @@
+/*
+ * linux/arch/arm64/crypto/aes-sync.c
+ *
+ * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <asm/neon.h>
+#include <crypto/aes.h>
+#include <linux/crypto.h>
+#include <linux/module.h>
+
+static void aes_cipher_encrypt(struct crypto_tfm *tfm, u8 dst[], u8 const src[])
+{
+ struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
+ u32 rounds = 6 + ctx->key_length / 4;
+ DEFINE_NEON_REGSTACK_PARTIAL(regs, 2);
+
+ kernel_neon_begin(regs);
+
+ __asm__(" ld1 {v0.16b}, [%[in]] ;"
+ " ld1 {v1.16b}, [%[key]], #16 ;"
+ "0: aese v0.16b, v1.16b ;"
+ " subs %[rounds], %[rounds], #1 ;"
+ " ld1 {v1.16b}, [%[key]], #16 ;"
+ " beq 1f ;"
+ " aesmc v0.16b, v0.16b ;"
+ " b 0b ;"
+ "1: eor v0.16b, v0.16b, v1.16b ;"
+ " st1 {v0.16b}, [%[out]] ;"
+ : :
+ [out] "r"(dst),
+ [in] "r"(src),
+ [rounds] "r"(rounds),
+ [key] "r"(ctx->key_enc)
+ : "cc");
+
+ kernel_neon_end(regs);
+}
+
+static void aes_cipher_decrypt(struct crypto_tfm *tfm, u8 dst[], u8 const src[])
+{
+ struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
+ u32 rounds = 6 + ctx->key_length / 4;
+ DEFINE_NEON_REGSTACK_PARTIAL(regs, 2);
+
+ kernel_neon_begin(regs);
+
+ __asm__(" ld1 {v0.16b}, [%[in]] ;"
+ " ld1 {v1.16b}, [%[key]], #16 ;"
+ "0: aesd v0.16b, v1.16b ;"
+ " ld1 {v1.16b}, [%[key]], #16 ;"
+ " subs %[rounds], %[rounds], #1 ;"
+ " beq 1f ;"
+ " aesimc v0.16b, v0.16b ;"
+ " b 0b ;"
+ "1: eor v0.16b, v0.16b, v1.16b ;"
+ " st1 {v0.16b}, [%[out]] ;"
+ : :
+ [out] "r"(dst),
+ [in] "r"(src),
+ [rounds] "r"(rounds),
+ [key] "r"(ctx->key_dec)
+ : "cc");
+
+ kernel_neon_end(regs);
+}
+
+static struct crypto_alg aes_alg = {
+ .cra_name = "aes",
+ .cra_driver_name = "aes-ce",
+ .cra_priority = 300,
+ .cra_flags = CRYPTO_ALG_TYPE_CIPHER,
+ .cra_blocksize = AES_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct crypto_aes_ctx),
+ .cra_module = THIS_MODULE,
+ .cra_cipher = {
+ .cia_min_keysize = AES_MIN_KEY_SIZE,
+ .cia_max_keysize = AES_MAX_KEY_SIZE,
+ .cia_setkey = crypto_aes_set_key,
+ .cia_encrypt = aes_cipher_encrypt,
+ .cia_decrypt = aes_cipher_decrypt
+ }
+};
+
+static int __init aes_mod_init(void)
+{
+ if (0) // TODO check for crypto extensions
+ return -ENODEV;
+ return crypto_register_alg(&aes_alg);
+}
+
+static void __exit aes_mod_exit(void)
+{
+ crypto_unregister_alg(&aes_alg);
+}
+
+module_init(aes_mod_init);
+module_exit(aes_mod_exit);
+
+MODULE_DESCRIPTION("Synchronous AES using ARMv8 Crypto Extensions");
+MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
+MODULE_LICENSE("GPL");
--
1.8.1.2
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 6/7] ARM64: add Crypto Extensions based synchronous AES in CCM mode
2013-10-13 12:14 [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts Ard Biesheuvel
` (4 preceding siblings ...)
2013-10-13 12:15 ` [RFC v3 PATCH 5/7] ARM64: add Crypto Extensions based synchronous core AES cipher Ard Biesheuvel
@ 2013-10-13 12:15 ` Ard Biesheuvel
2013-10-13 12:15 ` [RFC v3 PATCH 7/7] lib/raid6: port NEON implementation to updated kmode NEON api Ard Biesheuvel
2013-10-15 4:01 ` [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts Nicolas Pitre
7 siblings, 0 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2013-10-13 12:15 UTC (permalink / raw)
To: linux-arm-kernel
This implements the CCM AEAD chaining mode for AES using Crypto
Extensions instructions.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/Makefile | 2 +-
arch/arm64/crypto/aes-sync.c | 355 +++++++++++++++++++++++++++++++++++++++++-
arch/arm64/crypto/aesce-ccm.S | 186 ++++++++++++++++++++++
3 files changed, 538 insertions(+), 5 deletions(-)
create mode 100644 arch/arm64/crypto/aesce-ccm.S
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 269d9be..f15940c 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -8,7 +8,7 @@
# published by the Free Software Foundation.
#
-aesce-sync-y := aes-sync.o
+aesce-sync-y := aes-sync.o aesce-ccm.o
obj-m += aesce-sync.o
CFLAGS_aes-sync.o += -march=armv8-a+crypto
diff --git a/arch/arm64/crypto/aes-sync.c b/arch/arm64/crypto/aes-sync.c
index 5d7ed4e..0c0d0bd 100644
--- a/arch/arm64/crypto/aes-sync.c
+++ b/arch/arm64/crypto/aes-sync.c
@@ -9,7 +9,10 @@
*/
#include <asm/neon.h>
+#include <asm/unaligned.h>
#include <crypto/aes.h>
+#include <crypto/algapi.h>
+#include <crypto/scatterwalk.h>
#include <linux/crypto.h>
#include <linux/module.h>
@@ -69,7 +72,313 @@ static void aes_cipher_decrypt(struct crypto_tfm *tfm, u8 dst[], u8 const src[])
kernel_neon_end(regs);
}
-static struct crypto_alg aes_alg = {
+struct crypto_ccm_aes_ctx {
+ struct crypto_aes_ctx *key;
+ struct crypto_blkcipher *blk_tfm;
+};
+
+asmlinkage void ce_aes_ccm_auth_data(u8 mac[], u8 const in[], u32 abytes,
+ u32 const rk[], u32 rounds);
+
+asmlinkage void ce_aes_ccm_encrypt(u8 out[], u8 const in[], u32 cbytes,
+ u32 const rk[], u32 rounds, u8 mac[],
+ u8 ctr[]);
+
+asmlinkage void ce_aes_ccm_decrypt(u8 out[], u8 const in[], u32 cbytes,
+ u32 const rk[], u32 rounds, u8 mac[],
+ u8 ctr[]);
+
+asmlinkage void ce_aes_ccm_final(u8 mac[], u8 const ctr[], u32 const rk[],
+ long rounds);
+
+static int ccm_setkey(struct crypto_aead *tfm, const u8 *in_key,
+ unsigned int key_len)
+{
+ struct crypto_ccm_aes_ctx *ctx = crypto_aead_ctx(tfm);
+ int ret;
+
+ ret = crypto_aes_expand_key(ctx->key, in_key, key_len);
+ if (!ret)
+ return 0;
+
+ tfm->base.crt_flags |= CRYPTO_TFM_RES_BAD_KEY_LEN;
+ return -EINVAL;
+}
+
+static int ccm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
+{
+ if ((authsize & 1) || authsize < 4)
+ return -EINVAL;
+ return 0;
+}
+
+static int ccm_init_mac(struct aead_request *req, u8 maciv[], u32 msglen)
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ __be32 *n = (__be32 *)&maciv[AES_BLOCK_SIZE - 8];
+ u32 l = req->iv[0] + 1;
+
+ /* verify that CCM dimension 'L' is set correctly in the IV */
+ if (l < 2 || l > 8)
+ return -EINVAL;
+
+ /* verify that msglen can in fact be represented in L bytes */
+ if (msglen >> (8 * l))
+ return -EOVERFLOW;
+
+ /*
+ * Even if the CCM spec allows L values of up to 8, the Linux cryptoapi
+ * uses a u32 type to represent msglen so the top 4 bytes are always 0.
+ */
+ n[0] = 0;
+ n[1] = cpu_to_be32(msglen);
+
+ memcpy(maciv, req->iv, AES_BLOCK_SIZE - l);
+
+ maciv[0] |= (crypto_aead_authsize(aead) - 2) << 2;
+ if (req->assoclen)
+ maciv[0] |= 0x40;
+
+ memset(&req->iv[AES_BLOCK_SIZE - l], 0, l);
+ return 0;
+}
+
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ struct crypto_ccm_aes_ctx *ctx = crypto_aead_ctx(aead);
+ struct __packed { __be16 l; __be32 h; } ltag;
+ u32 rounds = 6 + ctx->key->key_length / 4;
+ struct scatter_walk walk;
+ u32 len = req->assoclen;
+ u32 macp;
+
+ /* prepend the AAD with a length tag */
+ if (len < 0xff00) {
+ ltag.l = cpu_to_be16(len);
+ macp = 2;
+ } else {
+ ltag.l = cpu_to_be16(0xfffe);
+ put_unaligned_be32(len, <ag.h);
+ macp = 6;
+ }
+
+ ce_aes_ccm_auth_data(mac, (u8 *)<ag, macp, ctx->key->key_enc, rounds);
+ scatterwalk_start(&walk, req->assoc);
+
+ do {
+ u32 n = scatterwalk_clamp(&walk, len);
+ u32 m;
+ u8 *p;
+
+ if (!n) {
+ scatterwalk_start(&walk, sg_next(walk.sg));
+ n = scatterwalk_clamp(&walk, len);
+ }
+ p = scatterwalk_map(&walk);
+ m = min(n, AES_BLOCK_SIZE - macp);
+ crypto_xor(&mac[macp], p, m);
+
+ len -= n;
+ n -= m;
+ macp += m;
+ if (macp == AES_BLOCK_SIZE && (n || len)) {
+ ce_aes_ccm_auth_data(mac, &p[m], n, ctx->key->key_enc,
+ rounds);
+ macp = n % AES_BLOCK_SIZE;
+ }
+
+ scatterwalk_unmap(p);
+ scatterwalk_advance(&walk, n + m);
+ scatterwalk_done(&walk, 0, len);
+ } while (len);
+}
+
+struct ccm_inner_desc_info {
+ u8 ctriv[AES_BLOCK_SIZE];
+ u8 mac[AES_BLOCK_SIZE];
+} __aligned(8);
+
+static int ccm_inner_encrypt(struct blkcipher_desc *desc,
+ struct scatterlist *dst, struct scatterlist *src,
+ unsigned int nbytes)
+{
+ struct crypto_aes_ctx *ctx = crypto_blkcipher_ctx(desc->tfm);
+ struct ccm_inner_desc_info *descinfo = desc->info;
+ u32 rounds = 6 + ctx->key_length / 4;
+ struct blkcipher_walk walk;
+ int err;
+
+ blkcipher_walk_init(&walk, dst, src, nbytes);
+ err = blkcipher_walk_virt_block(desc, &walk, AES_BLOCK_SIZE);
+
+ while (walk.nbytes) {
+ u32 tail = walk.nbytes % AES_BLOCK_SIZE;
+
+ if (walk.nbytes == nbytes)
+ tail = 0;
+
+ ce_aes_ccm_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
+ walk.nbytes - tail, ctx->key_enc, rounds,
+ descinfo->mac, descinfo->ctriv);
+
+ nbytes -= walk.nbytes - tail;
+ err = blkcipher_walk_done(desc, &walk, tail);
+ }
+ return err;
+}
+
+static int ccm_inner_decrypt(struct blkcipher_desc *desc,
+ struct scatterlist *dst, struct scatterlist *src,
+ unsigned int nbytes)
+{
+ struct crypto_aes_ctx *ctx = crypto_blkcipher_ctx(desc->tfm);
+ struct ccm_inner_desc_info *descinfo = desc->info;
+ u32 rounds = 6 + ctx->key_length / 4;
+ struct blkcipher_walk walk;
+ int err;
+
+ blkcipher_walk_init(&walk, dst, src, nbytes);
+ err = blkcipher_walk_virt_block(desc, &walk, AES_BLOCK_SIZE);
+
+ while (walk.nbytes) {
+ u32 tail = walk.nbytes % AES_BLOCK_SIZE;
+
+ if (walk.nbytes == nbytes)
+ tail = 0;
+
+ ce_aes_ccm_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
+ walk.nbytes - tail, ctx->key_enc, rounds,
+ descinfo->mac, descinfo->ctriv);
+
+ nbytes -= walk.nbytes - tail;
+ err = blkcipher_walk_done(desc, &walk, tail);
+ }
+ return err;
+}
+
+static int ccm_encrypt(struct aead_request *req)
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ struct crypto_ccm_aes_ctx *ctx = crypto_aead_ctx(aead);
+ u32 rounds = 6 + ctx->key->key_length / 4;
+ struct ccm_inner_desc_info descinfo;
+ DEFINE_NEON_REGSTACK_PARTIAL(regs, 4);
+ int err;
+
+ struct blkcipher_desc desc = {
+ .tfm = ctx->blk_tfm,
+ .info = &descinfo,
+ .flags = 0,
+ };
+
+ err = ccm_init_mac(req, descinfo.mac, req->cryptlen);
+ if (err)
+ return err;
+
+ kernel_neon_begin(regs);
+
+ if (req->assoclen)
+ ccm_calculate_auth_mac(req, descinfo.mac);
+
+ memcpy(descinfo.ctriv, req->iv, AES_BLOCK_SIZE);
+
+ /* call inner blkcipher to process the payload */
+ err = ccm_inner_encrypt(&desc, req->dst, req->src, req->cryptlen);
+ if (!err)
+ ce_aes_ccm_final(descinfo.mac, req->iv, ctx->key->key_enc,
+ rounds);
+
+ kernel_neon_end(regs);
+
+ if (err)
+ return err;
+
+ /* copy authtag to end of dst */
+ scatterwalk_map_and_copy(descinfo.mac, req->dst, req->cryptlen,
+ crypto_aead_authsize(aead), 1);
+
+ return 0;
+}
+
+static int ccm_decrypt(struct aead_request *req)
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ struct crypto_ccm_aes_ctx *ctx = crypto_aead_ctx(aead);
+ u32 rounds = 6 + ctx->key->key_length / 4;
+ struct ccm_inner_desc_info descinfo;
+ DEFINE_NEON_REGSTACK_PARTIAL(regs, 4);
+ u8 atag[AES_BLOCK_SIZE];
+ u32 len;
+ int err;
+
+ struct blkcipher_desc desc = {
+ .tfm = ctx->blk_tfm,
+ .info = &descinfo,
+ .flags = 0,
+ };
+
+ len = req->cryptlen - crypto_aead_authsize(aead);
+ err = ccm_init_mac(req, descinfo.mac, len);
+ if (err)
+ return err;
+
+ if (req->assoclen)
+ ccm_calculate_auth_mac(req, descinfo.mac);
+
+ memcpy(descinfo.ctriv, req->iv, AES_BLOCK_SIZE);
+
+ kernel_neon_begin(regs);
+
+ /* call inner blkcipher to process the payload */
+ err = ccm_inner_decrypt(&desc, req->dst, req->src, len);
+ if (!err)
+ ce_aes_ccm_final(descinfo.mac, req->iv, ctx->key->key_enc,
+ rounds);
+
+ kernel_neon_end(regs);
+
+ if (err)
+ return err;
+
+ /* compare calculated auth tag with the stored one */
+ scatterwalk_map_and_copy(atag, req->src, len,
+ crypto_aead_authsize(aead), 0);
+
+ if (memcmp(descinfo.mac, atag, crypto_aead_authsize(aead)))
+ return -EBADMSG;
+ return 0;
+}
+
+static int ccm_init(struct crypto_tfm *tfm)
+{
+ struct crypto_ccm_aes_ctx *ctx = crypto_tfm_ctx(tfm);
+ struct crypto_blkcipher *blk_tfm;
+
+ blk_tfm = crypto_alloc_blkcipher("__driver-ccm-aesce-inner", 0, 0);
+ if (IS_ERR(blk_tfm))
+ return PTR_ERR(blk_tfm);
+
+ /* did we get the right one? (sanity check) */
+ if (crypto_blkcipher_crt(blk_tfm)->encrypt != ccm_inner_encrypt) {
+ crypto_free_blkcipher(ctx->blk_tfm);
+ return -EINVAL;
+ }
+
+ ctx->blk_tfm = blk_tfm;
+ ctx->key = crypto_blkcipher_ctx(blk_tfm);
+
+ return 0;
+}
+
+static void ccm_exit(struct crypto_tfm *tfm)
+{
+ struct crypto_ccm_aes_ctx *ctx = crypto_tfm_ctx(tfm);
+
+ crypto_free_blkcipher(ctx->blk_tfm);
+}
+
+static struct crypto_alg aes_algs[] = { {
.cra_name = "aes",
.cra_driver_name = "aes-ce",
.cra_priority = 300,
@@ -84,18 +393,56 @@ static struct crypto_alg aes_alg = {
.cia_encrypt = aes_cipher_encrypt,
.cia_decrypt = aes_cipher_decrypt
}
-};
+}, {
+ .cra_name = "__ccm-aesce-inner",
+ .cra_driver_name = "__driver-ccm-aesce-inner",
+ .cra_priority = 0,
+ .cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct crypto_aes_ctx),
+ .cra_alignmask = 7,
+ .cra_type = &crypto_blkcipher_type,
+ .cra_module = THIS_MODULE,
+ .cra_blkcipher = {
+ .min_keysize = AES_MIN_KEY_SIZE,
+ .max_keysize = AES_MAX_KEY_SIZE,
+ .ivsize = sizeof(struct ccm_inner_desc_info),
+ .setkey = crypto_aes_set_key,
+ .encrypt = ccm_inner_encrypt,
+ .decrypt = ccm_inner_decrypt,
+ },
+}, {
+ .cra_name = "ccm(aes)",
+ .cra_driver_name = "ccm-aes-ce",
+ .cra_priority = 300,
+ .cra_flags = CRYPTO_ALG_TYPE_AEAD,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct crypto_ccm_aes_ctx),
+ .cra_alignmask = 7,
+ .cra_type = &crypto_aead_type,
+ .cra_module = THIS_MODULE,
+ .cra_init = ccm_init,
+ .cra_exit = ccm_exit,
+ .cra_aead = {
+ .ivsize = AES_BLOCK_SIZE,
+ .maxauthsize = AES_BLOCK_SIZE,
+ .setkey = ccm_setkey,
+ .setauthsize = ccm_setauthsize,
+ .encrypt = ccm_encrypt,
+ .decrypt = ccm_decrypt,
+ }
+} };
static int __init aes_mod_init(void)
{
if (0) // TODO check for crypto extensions
return -ENODEV;
- return crypto_register_alg(&aes_alg);
+ return crypto_register_algs(aes_algs, ARRAY_SIZE(aes_algs));
}
static void __exit aes_mod_exit(void)
{
- crypto_unregister_alg(&aes_alg);
+ crypto_unregister_algs(aes_algs, ARRAY_SIZE(aes_algs));
}
module_init(aes_mod_init);
diff --git a/arch/arm64/crypto/aesce-ccm.S b/arch/arm64/crypto/aesce-ccm.S
new file mode 100644
index 0000000..df1248b
--- /dev/null
+++ b/arch/arm64/crypto/aesce-ccm.S
@@ -0,0 +1,186 @@
+/*
+ * linux/arch/arm64/crypto/aesce-ccm.S - AES-CCM transform for ARMv8 with
+ * Crypto Extensions
+ *
+ * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/linkage.h>
+
+ .text
+ .arch armv8-a+crypto
+
+ /*
+ * void ce_aes_ccm_auth_data(u8 mac[], u8 const in[], u32 abytes,
+ * u8 const rk[], u32 rounds);
+ */
+ENTRY(ce_aes_ccm_auth_data)
+ ld1 {v0.16b}, [x0] /* load mac */
+0: ld1 {v3.16b}, [x3] /* load first round key */
+ mov w7, w4
+ add x6, x3, #16
+ b 2f
+1: aese v0.16b, v2.16b
+ subs w7, w7, #2
+ beq 3f
+ aesmc v0.16b, v0.16b
+2: aese v0.16b, v3.16b
+ ld1 {v2.16b-v3.16b}, [x6], #32 /* load next round keys */
+ aesmc v0.16b, v0.16b
+ b 1b
+3: eor v0.16b, v0.16b, v3.16b /* final round */
+ subs w2, w2, #16 /* last data? */
+ bmi 4f
+ ld1 {v1.16b}, [x1], #16 /* load next input block */
+ eor v0.16b, v0.16b, v1.16b /* xor with mac */
+ bne 0b
+4: st1 {v0.16b}, [x0] /* store mac */
+ beq 6f
+ adds w2, w2, #16
+ beq 6f
+5: ldrb w7, [x1], #1
+ umov w6, v0.b[0]
+ eor w6, w6, w7
+ strb w6, [x0], #1
+ subs w2, w2, #1
+ beq 6f
+ ext v0.16b, v0.16b, v0.16b, #1 /* rotate out the mac bytes */
+ b 5b
+6: ret
+ENDPROC(ce_aes_ccm_auth_data)
+
+ /*
+ * void ce_aes_ccm_final(u8 mac[], u8 const ctr[], u8 const rk[],
+ * u32 rounds);
+ */
+ENTRY(ce_aes_ccm_final)
+ ld1 {v0.16b}, [x0] /* load mac */
+ ld1 {v2.16b-v3.16b}, [x2], #32 /* load first 2 round keys */
+ ld1 {v1.16b}, [x1] /* load 1st ctriv */
+ cmp w3, #12
+ beq 1f
+0: aese v0.16b, v2.16b /* 4 rounds, 2x interleaved */
+ aese v1.16b, v2.16b
+ aesmc v0.16b, v0.16b
+ aesmc v1.16b, v1.16b
+ aese v0.16b, v3.16b
+ aese v1.16b, v3.16b
+ subs w3, w3, #4
+ ble 2f
+ ld1 {v2.16b-v3.16b}, [x2], #32 /* load next 2 round keys */
+ aesmc v0.16b, v0.16b
+ aesmc v1.16b, v1.16b
+1: aese v0.16b, v2.16b
+ aese v1.16b, v2.16b
+ aesmc v0.16b, v0.16b
+ aesmc v1.16b, v1.16b
+ aese v0.16b, v3.16b
+ aese v1.16b, v3.16b
+ ld1 {v2.16b-v3.16b}, [x2], #32 /* load next 2 round keys */
+ aesmc v0.16b, v0.16b
+ aesmc v1.16b, v1.16b
+ b 0b
+2: /* final round key cancels out */
+ eor v0.16b, v0.16b, v1.16b /* en-/decrypt the mac */
+ st1 {v0.16b}, [x0] /* store result */
+ ret
+ENDPROC(ce_aes_ccm_final)
+
+ .macro aes_ccm_do_crypt,enc
+ ldr x8, [x6, #8] /* load lower ctr */
+ ld1 {v0.16b}, [x5] /* load mac */
+ rev x8, x8 /* keep swabbed ctr in reg */
+ b 0f
+ .align 6
+0: ld1 {v1.8b}, [x6] /* load upper ctr */
+ ld1 {v3.16b}, [x3] /* load first round key */
+ add x8, x8, #1
+ mov w7, w4 /* get # of rounds */
+ rev x9, x8
+ cmp w4, #12 /* 10, 12 or 14 rounds? */
+ add x10, x3, #16
+ ins v1.d[1], x9 /* no carry in lower ctr */
+ beq 3f
+ b 2f
+1: aese v0.16b, v2.16b /* 4 rounds, 2x interleaved */
+ aese v1.16b, v2.16b
+ aesmc v0.16b, v0.16b
+ aesmc v1.16b, v1.16b
+2: aese v0.16b, v3.16b
+ aese v1.16b, v3.16b
+ ld1 {v2.16b-v3.16b}, [x10], #32 /* load next 2 round keys */
+ aesmc v0.16b, v0.16b
+ aesmc v1.16b, v1.16b
+ subs w7, w7, #4
+ aese v0.16b, v2.16b
+ aese v1.16b, v2.16b
+ ble 4f
+ aesmc v0.16b, v0.16b
+ aesmc v1.16b, v1.16b
+3: aese v0.16b, v3.16b
+ aese v1.16b, v3.16b
+ ld1 {v2.16b-v3.16b}, [x10], #32 /* load next 2 round keys */
+ aesmc v0.16b, v0.16b
+ aesmc v1.16b, v1.16b
+ b 1b
+4: subs w2, w2, #16
+ bmi 5f
+ ld1 {v2.16b}, [x1], #16 /* load next input block */
+ .if \enc == 1
+ eor v2.16b, v2.16b, v3.16b /* final round enc+mac */
+ eor v1.16b, v1.16b, v2.16b /* xor with crypted ctr */
+ .else
+ eor v2.16b, v2.16b, v1.16b /* xor with crypted ctr */
+ eor v1.16b, v2.16b, v3.16b /* final round enc */
+ .endif
+ eor v0.16b, v0.16b, v2.16b /* xor mac with pt ^ rk[last] */
+ st1 {v1.16b}, [x0], #16 /* write output block */
+ beq 5f
+ b 0b
+5: eor v0.16b, v0.16b, v3.16b /* final round mac */
+ eor v1.16b, v1.16b, v3.16b /* final round enc */
+ st1 {v0.16b}, [x5] /* store mac */
+ beq 7f
+ add w2, w2, #16 /* process partial tail block */
+6: ldrb w9, [x1], #1 /* get 1 byte of input */
+ umov w6, v1.b[0] /* get top crypted ctr byte */
+ umov w7, v0.b[0] /* get top mac byte */
+ .if \enc == 1
+ eor w7, w7, w9
+ eor w9, w9, w6
+ .else
+ eor w9, w9, w6
+ eor w7, w7, w9
+ .endif
+ strb w9, [x0], #1 /* store out byte */
+ strb w7, [x5], #1 /* store mac byte */
+ subs w2, w2, #1
+ beq 8f
+ ext v0.16b, v0.16b, v0.16b, #1 /* shift out mac byte */
+ ext v1.16b, v1.16b, v1.16b, #1 /* shift out ctr byte */
+ b 6b
+7: rev x8, x8
+ str x8, [x6, #8] /* store lsb end of ctr (BE) */
+8: ret
+ .endm
+
+ /*
+ * void ce_aes_ccm_encrypt(u8 out[], u8 const in[], u32 cbytes,
+ * u8 const rk[], u32 rounds, u8 mac[],
+ * u8 ctr[]);
+ * void ce_aes_ccm_decrypt(u8 out[], u8 const in[], u32 cbytes,
+ * u8 const rk[], u32 rounds, u8 mac[],
+ * u8 ctr[]);
+ */
+ENTRY(ce_aes_ccm_encrypt)
+ aes_ccm_do_crypt 1
+ENDPROC(ce_aes_ccm_encrypt)
+
+ENTRY(ce_aes_ccm_decrypt)
+ aes_ccm_do_crypt 0
+ENDPROC(ce_aes_ccm_decrypt)
+
--
1.8.1.2
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 7/7] lib/raid6: port NEON implementation to updated kmode NEON api
2013-10-13 12:14 [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts Ard Biesheuvel
` (5 preceding siblings ...)
2013-10-13 12:15 ` [RFC v3 PATCH 6/7] ARM64: add Crypto Extensions based synchronous AES in CCM mode Ard Biesheuvel
@ 2013-10-13 12:15 ` Ard Biesheuvel
2013-10-15 4:01 ` [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts Nicolas Pitre
7 siblings, 0 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2013-10-13 12:15 UTC (permalink / raw)
To: linux-arm-kernel
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
lib/raid6/neon.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/lib/raid6/neon.c b/lib/raid6/neon.c
index 36ad470..172b53f 100644
--- a/lib/raid6/neon.c
+++ b/lib/raid6/neon.c
@@ -13,8 +13,8 @@
#ifdef __KERNEL__
#include <asm/neon.h>
#else
-#define kernel_neon_begin()
-#define kernel_neon_end()
+#define kernel_neon_begin(s)
+#define kernel_neon_end(s)
#define cpu_has_neon() (1)
#endif
@@ -33,12 +33,13 @@
static void raid6_neon ## _n ## _gen_syndrome(int disks, \
size_t bytes, void **ptrs) \
{ \
+ DEFINE_NEON_REGSTACK(s); \
void raid6_neon ## _n ## _gen_syndrome_real(int, \
unsigned long, void**); \
- kernel_neon_begin(); \
+ kernel_neon_begin(s); \
raid6_neon ## _n ## _gen_syndrome_real(disks, \
(unsigned long)bytes, ptrs); \
- kernel_neon_end(); \
+ kernel_neon_end(s); \
} \
struct raid6_calls const raid6_neonx ## _n = { \
raid6_neon ## _n ## _gen_syndrome, \
--
1.8.1.2
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts
2013-10-13 12:14 [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts Ard Biesheuvel
` (6 preceding siblings ...)
2013-10-13 12:15 ` [RFC v3 PATCH 7/7] lib/raid6: port NEON implementation to updated kmode NEON api Ard Biesheuvel
@ 2013-10-15 4:01 ` Nicolas Pitre
2013-10-15 13:13 ` Ard Biesheuvel
7 siblings, 1 reply; 19+ messages in thread
From: Nicolas Pitre @ 2013-10-15 4:01 UTC (permalink / raw)
To: linux-arm-kernel
On Sun, 13 Oct 2013, Ard Biesheuvel wrote:
> Take #3 of this RFC series.
>
> Instead of having additional separate versions of kernel_neon_begin/end, the
> existing ones now have been modified to always take a preallocated stack area
> as an argument.
The problem with this approach is that you break git bisect by making
the kernel unbuildable when this series is partially applied. Either
you make kernel_neon_begin/end into wrappers with no argument around the
new interface, or you change all users at the same time as the
interface. One big principle is not to break the kernel build in the
middle of a patch series when altering an existing interface.
> The stack area is allocated by DEFINE_NEON_REGSTACK[_PARTIAL](varname), where
> the partial version takes an additional int num_regs indicating how many
> registers need to be freed up.
>
> In the !in_interrupt() case, these functions operate as before, and the regstack
> is defined to minimal size in this case as it will remain unused anyway. In the
> in_interrupt() case, 'num_regs' (or all) NEON registers are stacked/unstacked
> using the allocated stack region.
Would have been nice to have the stack simply be a NULL pointer when
!in_interrupt() or when the number of regs is 0. This would remove the
need for a runtime check on !num_regs. I don't see an obvious way to
accomplish that right now though.
Nicolas
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts
2013-10-15 4:01 ` [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts Nicolas Pitre
@ 2013-10-15 13:13 ` Ard Biesheuvel
2013-10-15 14:06 ` Ard Biesheuvel
2013-10-15 16:05 ` Nicolas Pitre
0 siblings, 2 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2013-10-15 13:13 UTC (permalink / raw)
To: linux-arm-kernel
On 15 October 2013 06:01, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Sun, 13 Oct 2013, Ard Biesheuvel wrote:
>
>> Instead of having additional separate versions of kernel_neon_begin/end, the
>> existing ones now have been modified to always take a preallocated stack area
>> as an argument.
>
> The problem with this approach is that you break git bisect by making
> the kernel unbuildable when this series is partially applied. Either
> you make kernel_neon_begin/end into wrappers with no argument around the
> new interface, or you change all users at the same time as the
> interface. One big principle is not to break the kernel build in the
> middle of a patch series when altering an existing interface.
>
I see.
>> The stack area is allocated by DEFINE_NEON_REGSTACK[_PARTIAL](varname), where
>> the partial version takes an additional int num_regs indicating how many
>> registers need to be freed up.
>>
>> In the !in_interrupt() case, these functions operate as before, and the regstack
>> is defined to minimal size in this case as it will remain unused anyway. In the
>> in_interrupt() case, 'num_regs' (or all) NEON registers are stacked/unstacked
>> using the allocated stack region.
>
> Would have been nice to have the stack simply be a NULL pointer when
> !in_interrupt() or when the number of regs is 0. This would remove the
> need for a runtime check on !num_regs. I don't see an obvious way to
> accomplish that right now though.
>
We could address both of these issues by implementing Catalin's
suggestion to reserve per-process vfp_states[] for both irq and
softirq context in addition to the ordinary one, but it would waste a
lot of space imo. What is your take on that?
--
Ard.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts
2013-10-15 13:13 ` Ard Biesheuvel
@ 2013-10-15 14:06 ` Ard Biesheuvel
2013-10-15 16:05 ` Nicolas Pitre
1 sibling, 0 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2013-10-15 14:06 UTC (permalink / raw)
To: linux-arm-kernel
On 15 October 2013 15:13, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 15 October 2013 06:01, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> On Sun, 13 Oct 2013, Ard Biesheuvel wrote:
>>
>>> Instead of having additional separate versions of kernel_neon_begin/end, the
>>> existing ones now have been modified to always take a preallocated stack area
>>> as an argument.
>>
>> The problem with this approach is that you break git bisect by making
>> the kernel unbuildable when this series is partially applied. Either
>> you make kernel_neon_begin/end into wrappers with no argument around the
>> new interface, or you change all users at the same time as the
>> interface. One big principle is not to break the kernel build in the
>> middle of a patch series when altering an existing interface.
>>
>
> I see.
>
>>> The stack area is allocated by DEFINE_NEON_REGSTACK[_PARTIAL](varname), where
>>> the partial version takes an additional int num_regs indicating how many
>>> registers need to be freed up.
>>>
>>> In the !in_interrupt() case, these functions operate as before, and the regstack
>>> is defined to minimal size in this case as it will remain unused anyway. In the
>>> in_interrupt() case, 'num_regs' (or all) NEON registers are stacked/unstacked
>>> using the allocated stack region.
>>
>> Would have been nice to have the stack simply be a NULL pointer when
>> !in_interrupt() or when the number of regs is 0. This would remove the
>> need for a runtime check on !num_regs. I don't see an obvious way to
>> accomplish that right now though.
>>
>
> We could address both of these issues by implementing Catalin's
> suggestion to reserve per-process vfp_states[] for both irq and
> softirq context in addition to the ordinary one, but it would waste a
> lot of space imo. What is your take on that?
>
Replying to self: two per-cpu vfp_states, one for irq and one for
softirq, is probably the best approach here. I still need to add
kernel_neon_begin_partial() in this case, but the existing users can
remain unmodified.
I will do a v4 by end of next week.
Regards,
Ard.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts
2013-10-15 13:13 ` Ard Biesheuvel
2013-10-15 14:06 ` Ard Biesheuvel
@ 2013-10-15 16:05 ` Nicolas Pitre
2013-10-15 16:53 ` Catalin Marinas
1 sibling, 1 reply; 19+ messages in thread
From: Nicolas Pitre @ 2013-10-15 16:05 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, 15 Oct 2013, Ard Biesheuvel wrote:
> On 15 October 2013 06:01, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > On Sun, 13 Oct 2013, Ard Biesheuvel wrote:
> >
> >> The stack area is allocated by DEFINE_NEON_REGSTACK[_PARTIAL](varname), where
> >> the partial version takes an additional int num_regs indicating how many
> >> registers need to be freed up.
> >>
> >> In the !in_interrupt() case, these functions operate as before, and the regstack
> >> is defined to minimal size in this case as it will remain unused anyway. In the
> >> in_interrupt() case, 'num_regs' (or all) NEON registers are stacked/unstacked
> >> using the allocated stack region.
> >
> > Would have been nice to have the stack simply be a NULL pointer when
> > !in_interrupt() or when the number of regs is 0. This would remove the
> > need for a runtime check on !num_regs. I don't see an obvious way to
> > accomplish that right now though.
> >
>
> We could address both of these issues by implementing Catalin's
> suggestion to reserve per-process vfp_states[] for both irq and
> softirq context in addition to the ordinary one, but it would waste a
> lot of space imo. What is your take on that?
I agree that this would be rather wasteful. I really like your current
approach of dynamically allocating just the right amount of space on the
stack. I'm not a big fan of statically allocated memory which is
seldomly used.
What I meant by my suggestion was something like this:
#define kernel_neon_begin(p) \
__kernel_neon_begin(sizeof((p).qregs) ? &(p).regs : NULL, \
sizeof((p).qregs)/16)
However it seems gcc is not clever enough to optimize the stack usage
away at all in that case which is worse than your current version. So
better forget about this suggestion.
Nicolas
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts
2013-10-15 16:05 ` Nicolas Pitre
@ 2013-10-15 16:53 ` Catalin Marinas
0 siblings, 0 replies; 19+ messages in thread
From: Catalin Marinas @ 2013-10-15 16:53 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Oct 15, 2013 at 05:05:48PM +0100, Nicolas Pitre wrote:
> On Tue, 15 Oct 2013, Ard Biesheuvel wrote:
>
> > On 15 October 2013 06:01, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > > On Sun, 13 Oct 2013, Ard Biesheuvel wrote:
> > >
> > >> The stack area is allocated by DEFINE_NEON_REGSTACK[_PARTIAL](varname), where
> > >> the partial version takes an additional int num_regs indicating how many
> > >> registers need to be freed up.
> > >>
> > >> In the !in_interrupt() case, these functions operate as before, and the regstack
> > >> is defined to minimal size in this case as it will remain unused anyway. In the
> > >> in_interrupt() case, 'num_regs' (or all) NEON registers are stacked/unstacked
> > >> using the allocated stack region.
> > >
> > > Would have been nice to have the stack simply be a NULL pointer when
> > > !in_interrupt() or when the number of regs is 0. This would remove the
> > > need for a runtime check on !num_regs. I don't see an obvious way to
> > > accomplish that right now though.
> > >
> >
> > We could address both of these issues by implementing Catalin's
> > suggestion to reserve per-process vfp_states[] for both irq and
> > softirq context in addition to the ordinary one, but it would waste a
> > lot of space imo. What is your take on that?
>
> I agree that this would be rather wasteful. I really like your current
> approach of dynamically allocating just the right amount of space on the
> stack. I'm not a big fan of statically allocated memory which is
> seldomly used.
I agree here, especially since we need to cover both soft and hard irqs.
It would be about 1KB per CPU, not noticeable even on big systems but
still looks like it's only going to be used rarely.
--
Catalin
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 1/7] ARM: add support for kernel mode NEON in atomic context
2013-10-13 12:14 ` [RFC v3 PATCH 1/7] ARM: add support for kernel mode NEON in atomic context Ard Biesheuvel
@ 2013-10-15 17:26 ` Catalin Marinas
2013-10-15 17:30 ` Ard Biesheuvel
0 siblings, 1 reply; 19+ messages in thread
From: Catalin Marinas @ 2013-10-15 17:26 UTC (permalink / raw)
To: linux-arm-kernel
On Sun, Oct 13, 2013 at 01:14:57PM +0100, Ard Biesheuvel wrote:
> diff --git a/arch/arm/include/asm/neon.h b/arch/arm/include/asm/neon.h
> index 8f730fe..800d85c 100644
> --- a/arch/arm/include/asm/neon.h
> +++ b/arch/arm/include/asm/neon.h
> @@ -8,10 +8,30 @@
> * published by the Free Software Foundation.
> */
>
> +#include <linux/types.h>
> +#include <linux/hardirq.h>
> +#include <asm/fpstate.h>
> #include <asm/hwcap.h>
>
> #define cpu_has_neon() (!!(elf_hwcap & HWCAP_NEON))
>
> +/*
> + * Avoid wasting stack space by making the size of the allocated area depend on
> + * whether we are currently running in process context. (If this is the case, we
> + * will use the normal preserve/restore mechanism, leaving the allocated stack
> + * space unused.)
> + */
> +#define __QREG_SIZE(num) \
> + ((!in_interrupt()) ? 0 : (num) > 16 ? 256 : 16 * (((num) + 1) & ~1U))
> +
> +#define DEFINE_NEON_REGSTACK_PARTIAL(v, num) \
> + struct { \
> + struct vfp_partial_state regs; \
> + u8 qregs[__QREG_SIZE(num)]; \
> + } v
Oh, interesting gcc feature. What does it generate?
--
Catalin
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 1/7] ARM: add support for kernel mode NEON in atomic context
2013-10-15 17:26 ` Catalin Marinas
@ 2013-10-15 17:30 ` Ard Biesheuvel
2013-10-15 17:46 ` Catalin Marinas
0 siblings, 1 reply; 19+ messages in thread
From: Ard Biesheuvel @ 2013-10-15 17:30 UTC (permalink / raw)
To: linux-arm-kernel
On 15 October 2013 19:26, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Sun, Oct 13, 2013 at 01:14:57PM +0100, Ard Biesheuvel wrote:
>> diff --git a/arch/arm/include/asm/neon.h b/arch/arm/include/asm/neon.h
>> index 8f730fe..800d85c 100644
>> --- a/arch/arm/include/asm/neon.h
>> +++ b/arch/arm/include/asm/neon.h
>> @@ -8,10 +8,30 @@
>> * published by the Free Software Foundation.
>> */
>>
>> +#include <linux/types.h>
>> +#include <linux/hardirq.h>
>> +#include <asm/fpstate.h>
>> #include <asm/hwcap.h>
>>
>> #define cpu_has_neon() (!!(elf_hwcap & HWCAP_NEON))
>>
>> +/*
>> + * Avoid wasting stack space by making the size of the allocated area depend on
>> + * whether we are currently running in process context. (If this is the case, we
>> + * will use the normal preserve/restore mechanism, leaving the allocated stack
>> + * space unused.)
>> + */
>> +#define __QREG_SIZE(num) \
>> + ((!in_interrupt()) ? 0 : (num) > 16 ? 256 : 16 * (((num) + 1) & ~1U))
>> +
>> +#define DEFINE_NEON_REGSTACK_PARTIAL(v, num) \
>> + struct { \
>> + struct vfp_partial_state regs; \
>> + u8 qregs[__QREG_SIZE(num)]; \
>> + } v
>
> Oh, interesting gcc feature. What does it generate?
>
Well, it's not a feature particular to GCC, as far as I am aware. The
anonymous struct is just runtime variably sized depending on
in_interrupt() and the requested number of registers.
--
Ard.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 1/7] ARM: add support for kernel mode NEON in atomic context
2013-10-15 17:30 ` Ard Biesheuvel
@ 2013-10-15 17:46 ` Catalin Marinas
0 siblings, 0 replies; 19+ messages in thread
From: Catalin Marinas @ 2013-10-15 17:46 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Oct 15, 2013 at 06:30:50PM +0100, Ard Biesheuvel wrote:
> On 15 October 2013 19:26, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Sun, Oct 13, 2013 at 01:14:57PM +0100, Ard Biesheuvel wrote:
> >> diff --git a/arch/arm/include/asm/neon.h b/arch/arm/include/asm/neon.h
> >> index 8f730fe..800d85c 100644
> >> --- a/arch/arm/include/asm/neon.h
> >> +++ b/arch/arm/include/asm/neon.h
> >> @@ -8,10 +8,30 @@
> >> * published by the Free Software Foundation.
> >> */
> >>
> >> +#include <linux/types.h>
> >> +#include <linux/hardirq.h>
> >> +#include <asm/fpstate.h>
> >> #include <asm/hwcap.h>
> >>
> >> #define cpu_has_neon() (!!(elf_hwcap & HWCAP_NEON))
> >>
> >> +/*
> >> + * Avoid wasting stack space by making the size of the allocated area depend on
> >> + * whether we are currently running in process context. (If this is the case, we
> >> + * will use the normal preserve/restore mechanism, leaving the allocated stack
> >> + * space unused.)
> >> + */
> >> +#define __QREG_SIZE(num) \
> >> + ((!in_interrupt()) ? 0 : (num) > 16 ? 256 : 16 * (((num) + 1) & ~1U))
> >> +
> >> +#define DEFINE_NEON_REGSTACK_PARTIAL(v, num) \
> >> + struct { \
> >> + struct vfp_partial_state regs; \
> >> + u8 qregs[__QREG_SIZE(num)]; \
> >> + } v
> >
> > Oh, interesting gcc feature. What does it generate?
> >
>
> Well, it's not a feature particular to GCC, as far as I am aware. The
> anonymous struct is just runtime variably sized depending on
> in_interrupt() and the requested number of registers.
OK, it looks like it's valid C99. I was worried the compiler may
generate something like an alloca() library call.
--
Catalin
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 3/7] ARM64: defer reloading a task's FPSIMD state to userland resume
2013-10-13 12:14 ` [RFC v3 PATCH 3/7] ARM64: defer reloading a task's FPSIMD state to userland resume Ard Biesheuvel
@ 2013-10-28 18:12 ` Catalin Marinas
2013-10-28 20:32 ` Ard Biesheuvel
0 siblings, 1 reply; 19+ messages in thread
From: Catalin Marinas @ 2013-10-28 18:12 UTC (permalink / raw)
To: linux-arm-kernel
On Sun, Oct 13, 2013 at 01:14:59PM +0100, Ard Biesheuvel wrote:
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -72,7 +72,7 @@ void do_fpsimd_exc(unsigned int esr, struct pt_regs *regs)
> void fpsimd_thread_switch(struct task_struct *next)
> {
> /* check if not kernel threads */
> - if (current->mm)
> + if (current->mm && !test_and_clear_thread_flag(TIF_RELOAD_FPSTATE))
> fpsimd_save_state(¤t->thread.fpsimd_state);
Why does it need test_and_set_thread_flag() here? Some comments would be
useful as it looks strange to check a reload flag to decide whether to
save a state. Or change the name to something like 'dirty'.
> if (next->mm)
> fpsimd_load_state(&next->thread.fpsimd_state);
This function could be optimised a bit more to avoid saving/restoring if
the switch only happened between a user thread and a kernel one (and
back again) since the FP state may not have been dirtied. But what I had
in mind was per-CPU fpstate (possibly pointer or some flag) rather than
per-thread.
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -416,4 +416,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
> clear_thread_flag(TIF_NOTIFY_RESUME);
> tracehook_notify_resume(regs);
> }
> + if (test_and_clear_thread_flag(TIF_RELOAD_FPSTATE))
> + fpsimd_load_state(¤t->thread.fpsimd_state);
I think this code can be preempted, it is run with IRQs enabled. And
there is a small window where we cleared the flag but haven't loaded the
state.
--
Catalin
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 3/7] ARM64: defer reloading a task's FPSIMD state to userland resume
2013-10-28 18:12 ` Catalin Marinas
@ 2013-10-28 20:32 ` Ard Biesheuvel
2013-10-28 22:29 ` Catalin Marinas
0 siblings, 1 reply; 19+ messages in thread
From: Ard Biesheuvel @ 2013-10-28 20:32 UTC (permalink / raw)
To: linux-arm-kernel
On 28 October 2013 11:12, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Sun, Oct 13, 2013 at 01:14:59PM +0100, Ard Biesheuvel wrote:
>> --- a/arch/arm64/kernel/fpsimd.c
>> +++ b/arch/arm64/kernel/fpsimd.c
>> @@ -72,7 +72,7 @@ void do_fpsimd_exc(unsigned int esr, struct pt_regs *regs)
>> void fpsimd_thread_switch(struct task_struct *next)
>> {
>> /* check if not kernel threads */
>> - if (current->mm)
>> + if (current->mm && !test_and_clear_thread_flag(TIF_RELOAD_FPSTATE))
>> fpsimd_save_state(¤t->thread.fpsimd_state);
>
> Why does it need test_and_set_thread_flag() here? Some comments would be
> useful as it looks strange to check a reload flag to decide whether to
> save a state. Or change the name to something like 'dirty'.
>
Actually, it's test and clear. If the userland register file has
already been preserved for the purpose of performing kernel mode NEON,
it should not be saved again when the task gets scheduled out. The
clearing could also be deferred to the time when the task gets
scheduled in again. Or perhaps, it would be even better to always
defer loading the userland state for the next task when that task in
fact enters userland.
>> if (next->mm)
>> fpsimd_load_state(&next->thread.fpsimd_state);
>
> This function could be optimised a bit more to avoid saving/restoring if
> the switch only happened between a user thread and a kernel one (and
> back again) since the FP state may not have been dirtied. But what I had
> in mind was per-CPU fpstate (possibly pointer or some flag) rather than
> per-thread.
>
Well, then we are entering the realm of lazy restore, imo. There were
some patches proposed for that already, I think? But I do agree that
at his point, there is no need to restore the userland register
contents yet, it can be deferred to the point when the task reenters
userland (as mentioned above).
>> --- a/arch/arm64/kernel/signal.c
>> +++ b/arch/arm64/kernel/signal.c
>> @@ -416,4 +416,6 @@ asmlinkage void do_noti fy_resume(struct pt_regs *regs,
>> clear_thread_flag(TIF_NOTIFY_RESUME);
>> tracehook_notify_resume(regs);
>> }
>> + if (test_and_clear_thread_flag(TIF_RELOAD_FPSTATE))
>> + fpsimd_load_state(¤t->thread.fpsimd_state);
>
> I think this code can be preempted, it is run with IRQs enabled. And
> there is a small window where we cleared the flag but haven't loaded the
> state.
>
If we are preempted at this point, the fpstate will be loaded in the
normal way the next time this task runs, so I think this is harmless.
Although I guess we may be restoring the fp state twice in that case?
So in summary, what I need to do is:
- rework to use a per_cpu flag rather than a TIF;
- preserve the userland state (if it has one) when a task gets scheduled out;
- restore the userland state when a task enters userland;
I will propose an updated patch after I wrap up the work on the
in_interrupt() kernel mode NEON, as these topics are really orthogonal
and there is no reason to keep them combined in a single series.
--
Ard.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC v3 PATCH 3/7] ARM64: defer reloading a task's FPSIMD state to userland resume
2013-10-28 20:32 ` Ard Biesheuvel
@ 2013-10-28 22:29 ` Catalin Marinas
0 siblings, 0 replies; 19+ messages in thread
From: Catalin Marinas @ 2013-10-28 22:29 UTC (permalink / raw)
To: linux-arm-kernel
On 28 Oct 2013, at 20:32, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 28 October 2013 11:12, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> On Sun, Oct 13, 2013 at 01:14:59PM +0100, Ard Biesheuvel wrote:
>>> --- a/arch/arm64/kernel/fpsimd.c
>>> +++ b/arch/arm64/kernel/fpsimd.c
>>> @@ -72,7 +72,7 @@ void do_fpsimd_exc(unsigned int esr, struct pt_regs *regs)
>>> void fpsimd_thread_switch(struct task_struct *next)
>>> {
>>> /* check if not kernel threads */
>>> - if (current->mm)
>>> + if (current->mm && !test_and_clear_thread_flag(TIF_RELOAD_FPSTATE))
>>> fpsimd_save_state(¤t->thread.fpsimd_state);
>>
>> Why does it need test_and_set_thread_flag() here? Some comments would be
>> useful as it looks strange to check a reload flag to decide whether to
>> save a state. Or change the name to something like 'dirty'.
>
> Actually, it's test and clear.
Yes, just a typo.
> If the userland register file has
> already been preserved for the purpose of performing kernel mode NEON,
> it should not be saved again when the task gets scheduled out. The
> clearing could also be deferred to the time when the task gets
> scheduled in again.
The above should be turned into a comment in the code.
> Or perhaps, it would be even better to always
> defer loading the userland state for the next task when that task in
> fact enters userland.
It needs some more thinking, its usually cleaner in the context
switching code.
>>> if (next->mm)
>>> fpsimd_load_state(&next->thread.fpsimd_state);
>>
>> This function could be optimised a bit more to avoid saving/restoring if
>> the switch only happened between a user thread and a kernel one (and
>> back again) since the FP state may not have been dirtied. But what I had
>> in mind was per-CPU fpstate (possibly pointer or some flag) rather than
>> per-thread.
>
> Well, then we are entering the realm of lazy restore, imo. There were
> some patches proposed for that already, I think? But I do agree that
> at his point, there is no need to restore the userland register
> contents yet, it can be deferred to the point when the task reenters
> userland (as mentioned above).
Not entirely lazy. What I dont want to see (without proper benchmarks)
is disabling the FP at context switch and restoring the registers lazily
via the fault mechanism. What Im proposing above is not lazy, just an
optimisation for a clear case where the FP is not used in a kernel
thread.
>>> --- a/arch/arm64/kernel/signal.c
>>> +++ b/arch/arm64/kernel/signal.c
>>> @@ -416,4 +416,6 @@ asmlinkage void do_noti fy_resume(struct pt_regs *regs,
>>> clear_thread_flag(TIF_NOTIFY_RESUME);
>>> tracehook_notify_resume(regs);
>>> }
>>> + if (test_and_clear_thread_flag(TIF_RELOAD_FPSTATE))
>>> + fpsimd_load_state(¤t->thread.fpsimd_state);
>>
>> I think this code can be preempted, it is run with IRQs enabled. And
>> there is a small window where we cleared the flag but haven't loaded the
>> state.
>
> If we are preempted at this point, the fpstate will be loaded in the
> normal way the next time this task runs, so I think this is harmless.
> Although I guess we may be restoring the fp state twice in that case?
Lets say task A does an svc and gets into kernel mode followed by
kernel_neon_begin/end(). Before returning to user, the kernel runs the
above test_and_clear_thread_flag(). Immediately after, an interrupt
happens the task A is preempted. The fpsimd_thread_switch() function
finds that TIF_RELOAD_FPSTATE is cleared and save the current FP state.
But the FP regs contain whatever the kernel neon code did, so you
corrupt the existing data.
> So in summary, what I need to do is:
> - rework to use a per_cpu flag rather than a TIF;
> - preserve the userland state (if it has one) when a task gets scheduled out;
> - restore the userland state when a task enters userland;
Happy to discuss the algorithm before you code it (unless you prefer to
write the code quickly).
> I will propose an updated patch after I wrap up the work on the
> in_interrupt() kernel mode NEON, as these topics are really orthogonal
> and there is no reason to keep them combined in a single series.
Sounds fine.
Catalin
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2013-10-28 22:29 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-13 12:14 [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts Ard Biesheuvel
2013-10-13 12:14 ` [RFC v3 PATCH 1/7] ARM: add support for kernel mode NEON in atomic context Ard Biesheuvel
2013-10-15 17:26 ` Catalin Marinas
2013-10-15 17:30 ` Ard Biesheuvel
2013-10-15 17:46 ` Catalin Marinas
2013-10-13 12:14 ` [RFC v3 PATCH 2/7] ARM: port NEON version of xor_blocks() to new kmode NEON api Ard Biesheuvel
2013-10-13 12:14 ` [RFC v3 PATCH 3/7] ARM64: defer reloading a task's FPSIMD state to userland resume Ard Biesheuvel
2013-10-28 18:12 ` Catalin Marinas
2013-10-28 20:32 ` Ard Biesheuvel
2013-10-28 22:29 ` Catalin Marinas
2013-10-13 12:15 ` [RFC v3 PATCH 4/7] ARM64: add support for kernel mode NEON in atomic context Ard Biesheuvel
2013-10-13 12:15 ` [RFC v3 PATCH 5/7] ARM64: add Crypto Extensions based synchronous core AES cipher Ard Biesheuvel
2013-10-13 12:15 ` [RFC v3 PATCH 6/7] ARM64: add Crypto Extensions based synchronous AES in CCM mode Ard Biesheuvel
2013-10-13 12:15 ` [RFC v3 PATCH 7/7] lib/raid6: port NEON implementation to updated kmode NEON api Ard Biesheuvel
2013-10-15 4:01 ` [RFC v3 PATCH 0/7] ARM[64]: kernel mode NEON in atomic contexts Nicolas Pitre
2013-10-15 13:13 ` Ard Biesheuvel
2013-10-15 14:06 ` Ard Biesheuvel
2013-10-15 16:05 ` Nicolas Pitre
2013-10-15 16:53 ` Catalin Marinas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).