* Re: [PATCH net-next 0/6] net/ncsi: Allow enabling multiple packages & channels
From: David Miller @ 2018-10-18 22:56 UTC (permalink / raw)
To: sam; +Cc: netdev, Justin.Lee1, linux-kernel, openbmc
In-Reply-To: <20181018035917.19413-1-sam@mendozajonas.com>
From: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Date: Thu, 18 Oct 2018 14:59:11 +1100
> This series extends the NCSI driver to configure multiple packages
> and/or channels simultaneously. Since the RFC series this includes a few
> extra changes to fix areas in the driver that either made this harder or
> were roadblocks due to deviations from the NCSI specification.
>
> Patches 1 & 2 fix two issues where the driver made assumptions about the
> capabilities of the NCSI topology.
> Patches 3 & 4 change some internal semantics slightly to make multi-mode
> easier.
> Patch 5 introduces a cleaner way of reconfiguring the NCSI configuration
> and keeping track of channel states.
> Patch 6 implements the main multi-package/multi-channel configuration,
> configured via the Netlink interface.
>
> Readers who have an interesting NCSI setup - especially multi-package
> with HWA - please test! I think I've covered all permutations but I
> don't have infinite hardware to test on.
This doesn't apply cleanly to net-next. Does it depend upon changes
applied elsewhere? You must always make that explicit.
Also, please explain this locking in ncsi_reset_dev():
+ NCSI_FOR_EACH_PACKAGE(ndp, np) {
+ NCSI_FOR_EACH_CHANNEL(np, nc) {
+ spin_lock_irqsave(&nc->lock, flags);
+ enabled = nc->monitor.enabled;
+ state = nc->state;
+ spin_unlock_irqrestore(&nc->lock, flags);
+
+ if (enabled)
+ ncsi_stop_channel_monitor(nc);
+ if (state == NCSI_CHANNEL_ACTIVE) {
+ active = nc;
+ break;
+ }
Is that really protecting anything?
Right after you drop np->lock those two values can change, the state
of the 'nc' can change such that it isn't NCSI_CHANNEL_ACTIVE anymore
etc.
At best this locking makes sure thatn enabled and state are consistent
with respect to eachother, only. It doesn't guarantee anything about
the stability of the state of the object at all, and it can change
right from under you.
^ permalink raw reply
* [PATCH net-next v8 02/28] asm: simd context helper API
From: Jason A. Donenfeld @ 2018-10-18 14:56 UTC (permalink / raw)
To: linux-kernel, netdev, linux-crypto, davem, gregkh
Cc: Jason A. Donenfeld, Samuel Neves, Andy Lutomirski,
Thomas Gleixner, linux-arch
In-Reply-To: <20181018145712.7538-1-Jason@zx2c4.com>
Sometimes it's useful to amortize calls to XSAVE/XRSTOR and the related
FPU/SIMD functions over a number of calls, because FPU restoration is
quite expensive. This adds a simple header for carrying out this pattern:
simd_context_t simd_context;
simd_get(&simd_context);
while ((item = get_item_from_queue()) != NULL) {
encrypt_item(item, &simd_context);
simd_relax(&simd_context);
}
simd_put(&simd_context);
The relaxation step ensures that we don't trample over preemption, and
the get/put API should be a familiar paradigm in the kernel.
On the other end, code that actually wants to use SIMD instructions can
accept this as a parameter and check it via:
void encrypt_item(struct item *item, simd_context_t *simd_context)
{
if (item->len > LARGE_FOR_SIMD && simd_use(simd_context))
wild_simd_code(item);
else
boring_scalar_code(item);
}
The actual XSAVE happens during simd_use (and only on the first time),
so that if the context is never actually used, no performance penalty is
hit.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: linux-arch@vger.kernel.org
---
arch/alpha/include/asm/Kbuild | 1 +
arch/arc/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 -
arch/arm/include/asm/simd.h | 63 ++++++++++++++++++++++++++++++
arch/arm64/include/asm/simd.h | 51 +++++++++++++++++++++---
arch/c6x/include/asm/Kbuild | 1 +
arch/h8300/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/ia64/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/nds32/include/asm/Kbuild | 1 +
arch/nios2/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/Kbuild | 1 +
arch/riscv/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 1 +
arch/unicore32/include/asm/Kbuild | 1 +
arch/x86/include/asm/simd.h | 44 ++++++++++++++++++++-
arch/xtensa/include/asm/Kbuild | 1 +
include/asm-generic/simd.h | 20 ++++++++++
include/linux/simd.h | 32 +++++++++++++++
27 files changed, 224 insertions(+), 8 deletions(-)
create mode 100644 arch/arm/include/asm/simd.h
create mode 100644 include/linux/simd.h
diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 0580cb8c84b2..220dfd170d45 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -13,3 +13,4 @@ generic-y += sections.h
generic-y += trace_clock.h
generic-y += current.h
generic-y += kprobes.h
+generic-y += simd.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index feed50ce89fa..a7f4255f1649 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -22,6 +22,7 @@ generic-y += parport.h
generic-y += pci.h
generic-y += percpu.h
generic-y += preempt.h
+generic-y += simd.h
generic-y += topology.h
generic-y += trace_clock.h
generic-y += user.h
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 1d66db9c9db5..ebdc9eeb8d39 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -16,7 +16,6 @@ generic-y += rwsem.h
generic-y += seccomp.h
generic-y += segment.h
generic-y += serial.h
-generic-y += simd.h
generic-y += sizes.h
generic-y += timex.h
generic-y += trace_clock.h
diff --git a/arch/arm/include/asm/simd.h b/arch/arm/include/asm/simd.h
new file mode 100644
index 000000000000..264ed84b41d8
--- /dev/null
+++ b/arch/arm/include/asm/simd.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <linux/simd.h>
+#ifndef _ASM_SIMD_H
+#define _ASM_SIMD_H
+
+#ifdef CONFIG_KERNEL_MODE_NEON
+#include <asm/neon.h>
+
+static __must_check inline bool may_use_simd(void)
+{
+ return !in_nmi() && !in_irq() && !in_serving_softirq();
+}
+
+static inline void simd_get(simd_context_t *ctx)
+{
+ *ctx = may_use_simd() ? HAVE_FULL_SIMD : HAVE_NO_SIMD;
+}
+
+static inline void simd_put(simd_context_t *ctx)
+{
+ if (*ctx & HAVE_SIMD_IN_USE)
+ kernel_neon_end();
+ *ctx = HAVE_NO_SIMD;
+}
+
+static __must_check inline bool simd_use(simd_context_t *ctx)
+{
+ if (!(*ctx & HAVE_FULL_SIMD))
+ return false;
+ if (*ctx & HAVE_SIMD_IN_USE)
+ return true;
+ kernel_neon_begin();
+ *ctx |= HAVE_SIMD_IN_USE;
+ return true;
+}
+
+#else
+
+static __must_check inline bool may_use_simd(void)
+{
+ return false;
+}
+
+static inline void simd_get(simd_context_t *ctx)
+{
+ *ctx = HAVE_NO_SIMD;
+}
+
+static inline void simd_put(simd_context_t *ctx)
+{
+}
+
+static __must_check inline bool simd_use(simd_context_t *ctx)
+{
+ return false;
+}
+#endif
+
+#endif /* _ASM_SIMD_H */
diff --git a/arch/arm64/include/asm/simd.h b/arch/arm64/include/asm/simd.h
index 6495cc51246f..a45ff1600040 100644
--- a/arch/arm64/include/asm/simd.h
+++ b/arch/arm64/include/asm/simd.h
@@ -1,11 +1,10 @@
-/*
- * Copyright (C) 2017 Linaro Ltd. <ard.biesheuvel@linaro.org>
+/* SPDX-License-Identifier: GPL-2.0
*
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License version 2 as published
- * by the Free Software Foundation.
+ * Copyright (C) 2017 Linaro Ltd. <ard.biesheuvel@linaro.org>
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
*/
+#include <linux/simd.h>
#ifndef __ASM_SIMD_H
#define __ASM_SIMD_H
@@ -16,6 +15,8 @@
#include <linux/types.h>
#ifdef CONFIG_KERNEL_MODE_NEON
+#include <asm/neon.h>
+#include <asm/simd.h>
DECLARE_PER_CPU(bool, kernel_neon_busy);
@@ -40,9 +41,47 @@ static __must_check inline bool may_use_simd(void)
!this_cpu_read(kernel_neon_busy);
}
+static inline void simd_get(simd_context_t *ctx)
+{
+ *ctx = may_use_simd() ? HAVE_FULL_SIMD : HAVE_NO_SIMD;
+}
+
+static inline void simd_put(simd_context_t *ctx)
+{
+ if (*ctx & HAVE_SIMD_IN_USE)
+ kernel_neon_end();
+ *ctx = HAVE_NO_SIMD;
+}
+
+static __must_check inline bool simd_use(simd_context_t *ctx)
+{
+ if (!(*ctx & HAVE_FULL_SIMD))
+ return false;
+ if (*ctx & HAVE_SIMD_IN_USE)
+ return true;
+ kernel_neon_begin();
+ *ctx |= HAVE_SIMD_IN_USE;
+ return true;
+}
+
#else /* ! CONFIG_KERNEL_MODE_NEON */
-static __must_check inline bool may_use_simd(void) {
+static __must_check inline bool may_use_simd(void)
+{
+ return false;
+}
+
+static inline void simd_get(simd_context_t *ctx)
+{
+ *ctx = HAVE_NO_SIMD;
+}
+
+static inline void simd_put(simd_context_t *ctx)
+{
+}
+
+static __must_check inline bool simd_use(simd_context_t *ctx)
+{
return false;
}
diff --git a/arch/c6x/include/asm/Kbuild b/arch/c6x/include/asm/Kbuild
index 33a2c94fed0d..7543c38f7ade 100644
--- a/arch/c6x/include/asm/Kbuild
+++ b/arch/c6x/include/asm/Kbuild
@@ -30,6 +30,7 @@ generic-y += pgalloc.h
generic-y += preempt.h
generic-y += segment.h
generic-y += serial.h
+generic-y += simd.h
generic-y += tlbflush.h
generic-y += topology.h
generic-y += trace_clock.h
diff --git a/arch/h8300/include/asm/Kbuild b/arch/h8300/include/asm/Kbuild
index a5d0b2991f47..1fcef25ee19d 100644
--- a/arch/h8300/include/asm/Kbuild
+++ b/arch/h8300/include/asm/Kbuild
@@ -39,6 +39,7 @@ generic-y += preempt.h
generic-y += scatterlist.h
generic-y += sections.h
generic-y += serial.h
+generic-y += simd.h
generic-y += sizes.h
generic-y += spinlock.h
generic-y += timex.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index dd2fd9c0d292..217d4695fd8a 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -29,6 +29,7 @@ generic-y += rwsem.h
generic-y += sections.h
generic-y += segment.h
generic-y += serial.h
+generic-y += simd.h
generic-y += sizes.h
generic-y += topology.h
generic-y += trace_clock.h
diff --git a/arch/ia64/include/asm/Kbuild b/arch/ia64/include/asm/Kbuild
index 557bbc8ba9f5..41c5ebdf79e5 100644
--- a/arch/ia64/include/asm/Kbuild
+++ b/arch/ia64/include/asm/Kbuild
@@ -4,6 +4,7 @@ generic-y += irq_work.h
generic-y += mcs_spinlock.h
generic-y += mm-arch-hooks.h
generic-y += preempt.h
+generic-y += simd.h
generic-y += trace_clock.h
generic-y += vtime.h
generic-y += word-at-a-time.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index a4b8d3331a9e..73898dd1a4d0 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -19,6 +19,7 @@ generic-y += mm-arch-hooks.h
generic-y += percpu.h
generic-y += preempt.h
generic-y += sections.h
+generic-y += simd.h
generic-y += spinlock.h
generic-y += topology.h
generic-y += trace_clock.h
diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
index 569ba9e670c1..7a877eea99d3 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -25,6 +25,7 @@ generic-y += parport.h
generic-y += percpu.h
generic-y += preempt.h
generic-y += serial.h
+generic-y += simd.h
generic-y += syscalls.h
generic-y += topology.h
generic-y += trace_clock.h
diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
index 58351e48421e..e8868e0fb2c3 100644
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -16,6 +16,7 @@ generic-y += qrwlock.h
generic-y += qspinlock.h
generic-y += sections.h
generic-y += segment.h
+generic-y += simd.h
generic-y += trace_clock.h
generic-y += unaligned.h
generic-y += user.h
diff --git a/arch/nds32/include/asm/Kbuild b/arch/nds32/include/asm/Kbuild
index dbc4e5422550..fb2f113716ce 100644
--- a/arch/nds32/include/asm/Kbuild
+++ b/arch/nds32/include/asm/Kbuild
@@ -46,6 +46,7 @@ generic-y += sections.h
generic-y += segment.h
generic-y += serial.h
generic-y += shmbuf.h
+generic-y += simd.h
generic-y += sizes.h
generic-y += stat.h
generic-y += switch_to.h
diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild
index 8fde4fa2c34f..571a9d9ad107 100644
--- a/arch/nios2/include/asm/Kbuild
+++ b/arch/nios2/include/asm/Kbuild
@@ -33,6 +33,7 @@ generic-y += preempt.h
generic-y += sections.h
generic-y += segment.h
generic-y += serial.h
+generic-y += simd.h
generic-y += spinlock.h
generic-y += topology.h
generic-y += trace_clock.h
diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
index eb87cd8327c8..b6231211bbad 100644
--- a/arch/openrisc/include/asm/Kbuild
+++ b/arch/openrisc/include/asm/Kbuild
@@ -34,6 +34,7 @@ generic-y += qrwlock_types.h
generic-y += qrwlock.h
generic-y += sections.h
generic-y += segment.h
+generic-y += simd.h
generic-y += string.h
generic-y += switch_to.h
generic-y += topology.h
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
index 2013d639e735..97970b4d05ab 100644
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -17,6 +17,7 @@ generic-y += percpu.h
generic-y += preempt.h
generic-y += seccomp.h
generic-y += segment.h
+generic-y += simd.h
generic-y += topology.h
generic-y += trace_clock.h
generic-y += user.h
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 3196d227e351..2337190aaf69 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -8,3 +8,4 @@ generic-y += preempt.h
generic-y += rwsem.h
generic-y += vtime.h
generic-y += msi.h
+generic-y += simd.h
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index efdbe311e936..438a11d9c47a 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -46,6 +46,7 @@ generic-y += setup.h
generic-y += shmbuf.h
generic-y += shmparam.h
generic-y += signal.h
+generic-y += simd.h
generic-y += socket.h
generic-y += sockios.h
generic-y += stat.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index e3239772887a..3744c4c61fb5 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -22,6 +22,7 @@ generic-y += mcs_spinlock.h
generic-y += mm-arch-hooks.h
generic-y += preempt.h
generic-y += rwsem.h
+generic-y += simd.h
generic-y += trace_clock.h
generic-y += unaligned.h
generic-y += word-at-a-time.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index 6a5609a55965..8e64ff35a933 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -16,6 +16,7 @@ generic-y += percpu.h
generic-y += preempt.h
generic-y += rwsem.h
generic-y += serial.h
+generic-y += simd.h
generic-y += sizes.h
generic-y += trace_clock.h
generic-y += xor.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index 410b263ef5c8..72b9e08fb350 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -17,5 +17,6 @@ generic-y += msi.h
generic-y += preempt.h
generic-y += rwsem.h
generic-y += serial.h
+generic-y += simd.h
generic-y += trace_clock.h
generic-y += word-at-a-time.h
diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index b10dde6cb793..8c2bfa6e0494 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -22,6 +22,7 @@ generic-y += param.h
generic-y += pci.h
generic-y += percpu.h
generic-y += preempt.h
+generic-y += simd.h
generic-y += switch_to.h
generic-y += topology.h
generic-y += trace_clock.h
diff --git a/arch/unicore32/include/asm/Kbuild b/arch/unicore32/include/asm/Kbuild
index bfc7abe77905..98a908720bbd 100644
--- a/arch/unicore32/include/asm/Kbuild
+++ b/arch/unicore32/include/asm/Kbuild
@@ -27,6 +27,7 @@ generic-y += preempt.h
generic-y += sections.h
generic-y += segment.h
generic-y += serial.h
+generic-y += simd.h
generic-y += sizes.h
generic-y += syscalls.h
generic-y += topology.h
diff --git a/arch/x86/include/asm/simd.h b/arch/x86/include/asm/simd.h
index a341c878e977..4aad7f158dcb 100644
--- a/arch/x86/include/asm/simd.h
+++ b/arch/x86/include/asm/simd.h
@@ -1,4 +1,11 @@
-/* SPDX-License-Identifier: GPL-2.0 */
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <linux/simd.h>
+#ifndef _ASM_SIMD_H
+#define _ASM_SIMD_H
#include <asm/fpu/api.h>
@@ -10,3 +17,38 @@ static __must_check inline bool may_use_simd(void)
{
return irq_fpu_usable();
}
+
+static inline void simd_get(simd_context_t *ctx)
+{
+#if !defined(CONFIG_UML)
+ *ctx = may_use_simd() ? HAVE_FULL_SIMD : HAVE_NO_SIMD;
+#else
+ *ctx = HAVE_NO_SIMD;
+#endif
+}
+
+static inline void simd_put(simd_context_t *ctx)
+{
+#if !defined(CONFIG_UML)
+ if (*ctx & HAVE_SIMD_IN_USE)
+ kernel_fpu_end();
+#endif
+ *ctx = HAVE_NO_SIMD;
+}
+
+static __must_check inline bool simd_use(simd_context_t *ctx)
+{
+#if !defined(CONFIG_UML)
+ if (!(*ctx & HAVE_FULL_SIMD))
+ return false;
+ if (*ctx & HAVE_SIMD_IN_USE)
+ return true;
+ kernel_fpu_begin();
+ *ctx |= HAVE_SIMD_IN_USE;
+ return true;
+#else
+ return false;
+#endif
+}
+
+#endif /* _ASM_SIMD_H */
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index 82c756431b49..7950f359649d 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -24,6 +24,7 @@ generic-y += percpu.h
generic-y += preempt.h
generic-y += rwsem.h
generic-y += sections.h
+generic-y += simd.h
generic-y += topology.h
generic-y += trace_clock.h
generic-y += word-at-a-time.h
diff --git a/include/asm-generic/simd.h b/include/asm-generic/simd.h
index d0343d58a74a..b3dd61ac010e 100644
--- a/include/asm-generic/simd.h
+++ b/include/asm-generic/simd.h
@@ -1,5 +1,9 @@
/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/simd.h>
+#ifndef _ASM_SIMD_H
+#define _ASM_SIMD_H
+
#include <linux/hardirq.h>
/*
@@ -13,3 +17,19 @@ static __must_check inline bool may_use_simd(void)
{
return !in_interrupt();
}
+
+static inline void simd_get(simd_context_t *ctx)
+{
+ *ctx = HAVE_NO_SIMD;
+}
+
+static inline void simd_put(simd_context_t *ctx)
+{
+}
+
+static __must_check inline bool simd_use(simd_context_t *ctx)
+{
+ return false;
+}
+
+#endif /* _ASM_SIMD_H */
diff --git a/include/linux/simd.h b/include/linux/simd.h
new file mode 100644
index 000000000000..4e0b8a9bdc14
--- /dev/null
+++ b/include/linux/simd.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifndef _SIMD_H
+#define _SIMD_H
+
+typedef enum {
+ HAVE_NO_SIMD = 1 << 0,
+ HAVE_FULL_SIMD = 1 << 1,
+ HAVE_SIMD_IN_USE = 1 << 31
+} simd_context_t;
+
+#define DONT_USE_SIMD ((simd_context_t []){ HAVE_NO_SIMD })
+
+#include <linux/sched.h>
+#include <asm/simd.h>
+
+static inline bool simd_relax(simd_context_t *ctx)
+{
+#ifdef CONFIG_PREEMPT
+ if ((*ctx & HAVE_SIMD_IN_USE) && need_resched()) {
+ simd_put(ctx);
+ simd_get(ctx);
+ return true;
+ }
+#endif
+ return false;
+}
+
+#endif /* _SIMD_H */
--
2.19.1
^ permalink raw reply related
* [PATCH net-next v8 03/28] zinc: introduce minimal cryptography library
From: Jason A. Donenfeld @ 2018-10-18 14:56 UTC (permalink / raw)
To: linux-kernel, netdev, linux-crypto, davem, gregkh
Cc: Jason A. Donenfeld, Samuel Neves, Jean-Philippe Aumasson,
Andy Lutomirski, Andrew Morton, Linus Torvalds, kernel-hardening
In-Reply-To: <20181018145712.7538-1-Jason@zx2c4.com>
Zinc stands for "Zinc Is Neat Crypto" or "Zinc as IN Crypto". It's also
short, easy to type, and plays nicely with the recent trend of naming
crypto libraries after elements. The guiding principle is "don't overdo
it". It's less of a library and more of a directory tree for organizing
well-curated direct implementations of cryptography primitives.
Zinc is a new cryptography API that is much more minimal and lower-level
than the current one. It intends to complement it and provide a basis
upon which the current crypto API might build, as the provider of
software implementations of cryptographic primitives. It is motivated by
three primary observations in crypto API design:
* Highly composable "cipher modes" and related abstractions from the
90s did not turn out to be as terrific an idea as hoped, leading to
a host of API misuse problems.
* Most programmers are afraid of crypto code, and so prefer to
integrate it into libraries in a highly abstracted manner, so as to
shield themselves from implementation details. Cryptographers, on
the other hand, prefer simple direct implementations, which they're
able to verify for high assurance and optimize in accordance with
their expertise.
* Overly abstracted and flexible cryptography APIs lead to a host of
dangerous problems and performance issues. The kernel is in the
business usually not of coming up with new uses of crypto, but
rather implementing various constructions, which means it essentially
needs a library of primitives, not a highly abstracted enterprise-ready
pluggable system, with a few particular exceptions.
This last observation has seen itself play out several times over and
over again within the kernel:
* The perennial move of actual primitives away from crypto/ and into
lib/, so that users can actually call these functions directly with
no overhead and without lots of allocations, function pointers,
string specifier parsing, and general clunkiness. For example:
sha256, chacha20, siphash, sha1, and so forth live in lib/ rather
than in crypto/. Zinc intends to stop the cluttering of lib/ and
introduce these direct primitives into their proper place, lib/zinc/.
* An abundance of misuse bugs with the present crypto API that have
been very unpleasant to clean up.
* A hesitance to even use cryptography, because of the overhead and
headaches involved in accessing the routines.
Zinc goes in a rather different direction. Rather than providing a
thoroughly designed and abstracted API, Zinc gives you simple functions,
which implement some primitive, or some particular and specific
construction of primitives. It is not dynamic in the least, though one
could imagine implementing a complex dynamic dispatch mechanism (such as
the current crypto API) on top of these basic functions. After all,
dynamic dispatch is usually needed for applications with cipher agility,
such as IPsec, dm-crypt, AF_ALG, and so forth, and the existing crypto
API will continue to play that role. However, Zinc will provide a non-
haphazard way of directly utilizing crypto routines in applications
that do have neither the need nor desire for abstraction and dynamic
dispatch.
It also organizes the implementations in a simple, straight-forward,
and direct manner, making it enjoyable and intuitive to work on.
Rather than moving optimized assembly implementations into arch/, it
keeps them all together in lib/zinc/, making it simple and obvious to
compare and contrast what's happening. This is, notably, exactly what
the lib/raid6/ tree does, and that seems to work out rather well. It's
also the pattern of most successful crypto libraries. The architecture-
specific glue-code is made a part of each translation unit, rather than
being in a separate one, so that generic and architecture-optimized code
are combined at compile-time, and incompatibility branches compiled out by
the optimizer.
All implementations have been extensively tested and fuzzed, and are
selected for their quality, trustworthiness, and performance. Wherever
possible and performant, formally verified implementations are used,
such as those from HACL* [1] and Fiat-Crypto [2]. The routines also take
special care to zero out secrets using memzero_explicit (and future work
is planned to have gcc do this more reliably and performantly with
compiler plugins). The performance of the selected implementations is
state-of-the-art and unrivaled on a broad array of hardware, though of
course we will continue to fine tune these to the hardware demands
needed by kernel contributors. Each implementation also comes with
extensive self-tests and crafted test vectors, pulled from various
places such as Wycheproof [9].
Regularity of function signatures is important, so that users can easily
"guess" the name of the function they want. Though, individual
primitives are oftentimes not trivially interchangeable, having been
designed for different things and requiring different parameters and
semantics, and so the function signatures they provide will directly
reflect the realities of the primitives' usages, rather than hiding it
behind (inevitably leaky) abstractions. Also, in contrast to the current
crypto API, Zinc functions can work on stack buffers, and can be called
with different keys, without requiring allocations or locking.
SIMD is used automatically when available, though some routines may
benefit from either having their SIMD disabled for particular
invocations, or to have the SIMD initialization calls amortized over
several invocations of the function, and so Zinc utilizes function
signatures enabling that in conjunction with the recently introduced
simd_context_t.
More generally, Zinc provides function signatures that allow just what
is required by the various callers. This isn't to say that users of the
functions will be permitted to pollute the function semantics with weird
particular needs, but we are trying very hard not to overdo it, and that
means looking carefully at what's actually necessary, and doing just that,
and not much more than that. Remember: practicality and cleanliness rather
than over-zealous infrastructure.
Zinc provides also an opening for the best implementers in academia to
contribute their time and effort to the kernel, by being sufficiently
simple and inviting. In discussing this commit with some of the best and
brightest over the last few years, there are many who are eager to
devote rare talent and energy to this effort.
Following the merging of this, I expect for the primitives that
currently exist in lib/ to work their way into lib/zinc/, after intense
scrutiny of each implementation, potentially replacing them with either
formally-verified implementations, or better studied and faster
state-of-the-art implementations.
Also following the merging of this, I expect for the old crypto API
implementations to be ported over to use Zinc for their software-based
implementations.
As Zinc is simply library code, its config options are un-menued, with
the exception of CONFIG_ZINC_SELFTEST and CONFIG_ZINC_DEBUG, which enables
various selftests and debugging conditions.
[1] https://github.com/project-everest/hacl-star
[2] https://github.com/mit-plv/fiat-crypto
[3] https://cr.yp.to/ecdh.html
[4] https://cr.yp.to/chacha.html
[5] https://cr.yp.to/snuffle/xsalsa-20081128.pdf
[6] https://cr.yp.to/mac.html
[7] https://blake2.net/
[8] https://tools.ietf.org/html/rfc8439
[9] https://github.com/google/wycheproof
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-crypto@vger.kernel.org
---
MAINTAINERS | 8 ++++++++
lib/Kconfig | 2 ++
lib/Makefile | 2 ++
lib/zinc/Kconfig | 41 +++++++++++++++++++++++++++++++++++++++++
lib/zinc/Makefile | 3 +++
5 files changed, 56 insertions(+)
create mode 100644 lib/zinc/Kconfig
create mode 100644 lib/zinc/Makefile
diff --git a/MAINTAINERS b/MAINTAINERS
index 7f1399ac028e..c0d078d4dbd8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16230,6 +16230,14 @@ Q: https://patchwork.linuxtv.org/project/linux-media/list/
S: Maintained
F: drivers/media/dvb-frontends/zd1301_demod*
+ZINC CRYPTOGRAPHY LIBRARY
+M: Jason A. Donenfeld <Jason@zx2c4.com>
+M: Samuel Neves <sneves@dei.uc.pt>
+S: Maintained
+F: lib/zinc/
+F: include/zinc/
+L: linux-crypto@vger.kernel.org
+
ZPOOL COMPRESSED PAGE STORAGE API
M: Dan Streetman <ddstreet@ieee.org>
L: linux-mm@kvack.org
diff --git a/lib/Kconfig b/lib/Kconfig
index a3928d4438b5..3e6848269c66 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -485,6 +485,8 @@ config GLOB_SELFTEST
module load) by a small amount, so you're welcome to play with
it, but you probably don't need it.
+source "lib/zinc/Kconfig"
+
#
# Netlink attribute parsing support is select'ed if needed
#
diff --git a/lib/Makefile b/lib/Makefile
index 423876446810..7a04d7ce95a6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -213,6 +213,8 @@ obj-$(CONFIG_PERCPU_TEST) += percpu_test.o
obj-$(CONFIG_ASN1) += asn1_decoder.o
+obj-y += zinc/
+
obj-$(CONFIG_FONT_SUPPORT) += fonts/
obj-$(CONFIG_PRIME_NUMBERS) += prime_numbers.o
diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig
new file mode 100644
index 000000000000..90e066ea93a0
--- /dev/null
+++ b/lib/zinc/Kconfig
@@ -0,0 +1,41 @@
+config ZINC_SELFTEST
+ bool "Zinc cryptography library self-tests"
+ help
+ This builds a series of self-tests for the Zinc crypto library, which
+ help diagnose any cryptographic algorithm implementation issues that
+ might be at the root cause of potential bugs. It also adds various
+ traps for incorrect usage.
+
+ Unless you are optimizing for machines without much disk space or for
+ very slow machines, it is probably a good idea to say Y here, so that
+ any potential cryptographic bugs translate into easy bug reports
+ rather than long-lasting security issues.
+
+config ZINC_DEBUG
+ bool "Zinc cryptography library debugging"
+ help
+ This turns on a series of additional checks and debugging options
+ that are useful for developers but probably will not provide much
+ benefit to end users.
+
+ Most people should say N here.
+
+config ZINC_ARCH_ARM
+ def_bool y
+ depends on ARM
+
+config ZINC_ARCH_ARM64
+ def_bool y
+ depends on ARM64
+
+config ZINC_ARCH_X86_64
+ def_bool y
+ depends on X86_64 && !UML
+
+config ZINC_ARCH_MIPS
+ def_bool y
+ depends on MIPS && CPU_MIPS32_R2 && !64BIT
+
+config ZINC_ARCH_MIPS64
+ def_bool y
+ depends on MIPS && 64BIT
diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
new file mode 100644
index 000000000000..a61c80d676cb
--- /dev/null
+++ b/lib/zinc/Makefile
@@ -0,0 +1,3 @@
+ccflags-y := -O2
+ccflags-y += -D'pr_fmt(fmt)="zinc: " fmt'
+ccflags-$(CONFIG_ZINC_DEBUG) += -DDEBUG
--
2.19.1
^ permalink raw reply related
* [PATCH net-next v8 06/28] zinc: ChaCha20 x86_64 implementation
From: Jason A. Donenfeld @ 2018-10-18 14:56 UTC (permalink / raw)
To: linux-kernel, netdev, linux-crypto, davem, gregkh
Cc: Jason A. Donenfeld, Samuel Neves, Thomas Gleixner, Ingo Molnar,
x86, Jean-Philippe Aumasson, Andy Lutomirski, Andrew Morton,
Linus Torvalds, kernel-hardening
In-Reply-To: <20181018145712.7538-1-Jason@zx2c4.com>
This ports SSSE3, AVX-2, AVX-512F, and AVX-512VL implementations for
ChaCha20. The AVX-512F implementation is disabled on Skylake, due to
throttling, and the VL ymm implementation is used instead. These come
from Andy Polyakov's implementation, with the following modifications
from Samuel Neves:
- Some cosmetic changes, like renaming labels to .Lname, constants,
and other Linux conventions.
- CPU feature checking is done in C by the glue code, so that has been
removed from the assembly.
- Eliminate translating certain instructions, such as pshufb, palignr,
vprotd, etc, to .byte directives. This is meant for compatibility
with ancient toolchains, but presumably it is unnecessary here,
since the build system already does checks on what GNU as can
assemble.
- When aligning the stack, the original code was saving %rsp to %r9.
To keep objtool happy, we use instead the DRAP idiom to save %rsp
to %r10:
leaq 8(%rsp),%r10
... code here ...
leaq -8(%r10),%rsp
- The original code assumes the stack comes aligned to 16 bytes. This
is not necessarily the case, and to avoid crashes,
`andq $-alignment, %rsp` was added in the prolog of a few functions.
- The original hardcodes returns as .byte 0xf3,0xc3, aka "rep ret".
We replace this by "ret". "rep ret" was meant to help with AMD K8
chips, cf. http://repzret.org/p/repzret. It makes no sense to
continue to use this kludge for code that won't even run on ancient
AMD chips.
Cycle counts on a Core i7 6700HQ using the AVX-2 codepath, comparing
this implementation ("new") to the implementation in the current crypto
api ("old"):
size old new
---- ---- ----
0 62 52
16 414 376
32 410 400
48 414 422
64 362 356
80 714 666
96 714 700
112 712 718
128 692 646
144 1042 674
160 1042 694
176 1042 726
192 1018 650
208 1366 686
224 1366 696
240 1366 722
256 640 656
272 988 1246
288 988 1276
304 992 1296
320 972 1222
336 1318 1256
352 1318 1276
368 1316 1294
384 1294 1218
400 1642 1258
416 1642 1282
432 1642 1302
448 1628 1224
464 1970 1258
480 1970 1280
496 1970 1300
512 656 676
528 1010 1290
544 1010 1306
560 1010 1332
576 986 1254
592 1340 1284
608 1334 1310
624 1340 1334
640 1314 1254
656 1664 1282
672 1674 1306
688 1662 1336
704 1638 1250
720 1992 1292
736 1994 1308
752 1988 1334
768 1252 1254
784 1596 1290
800 1596 1314
816 1596 1330
832 1576 1256
848 1922 1286
864 1922 1314
880 1926 1338
896 1898 1258
912 2248 1288
928 2248 1320
944 2248 1338
960 2226 1268
976 2574 1288
992 2576 1312
1008 2574 1340
Cycle counts on a Xeon Gold 5120 using the AVX-512 codepath:
size old new
---- ---- ----
0 64 54
16 386 372
32 388 396
48 388 420
64 366 350
80 708 666
96 708 692
112 706 736
128 692 648
144 1036 682
160 1036 708
176 1036 730
192 1016 658
208 1360 684
224 1362 708
240 1360 732
256 644 500
272 990 526
288 988 556
304 988 576
320 972 500
336 1314 532
352 1316 558
368 1318 578
384 1308 506
400 1644 532
416 1644 556
432 1644 594
448 1624 508
464 1970 534
480 1970 556
496 1968 582
512 660 624
528 1016 682
544 1016 702
560 1018 728
576 998 654
592 1344 680
608 1344 708
624 1344 730
640 1326 654
656 1670 686
672 1670 708
688 1670 732
704 1652 658
720 1998 682
736 1998 710
752 1996 734
768 1256 662
784 1606 688
800 1606 714
816 1606 736
832 1584 660
848 1948 688
864 1950 714
880 1948 736
896 1912 688
912 2258 718
928 2258 744
944 2256 768
960 2238 692
976 2584 718
992 2584 744
1008 2584 770
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: x86@kernel.org
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-crypto@vger.kernel.org
---
lib/zinc/Makefile | 1 +
lib/zinc/chacha20/chacha20-x86_64-glue.c | 103 ++
...-x86_64-cryptogams.S => chacha20-x86_64.S} | 1557 ++++-------------
lib/zinc/chacha20/chacha20.c | 4 +
4 files changed, 486 insertions(+), 1179 deletions(-)
create mode 100644 lib/zinc/chacha20/chacha20-x86_64-glue.c
rename lib/zinc/chacha20/{chacha20-x86_64-cryptogams.S => chacha20-x86_64.S} (71%)
diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 3d80144d55a6..223a0816c918 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -3,4 +3,5 @@ ccflags-y += -D'pr_fmt(fmt)="zinc: " fmt'
ccflags-$(CONFIG_ZINC_DEBUG) += -DDEBUG
zinc_chacha20-y := chacha20/chacha20.o
+zinc_chacha20-$(CONFIG_ZINC_ARCH_X86_64) += chacha20/chacha20-x86_64.o
obj-$(CONFIG_ZINC_CHACHA20) += zinc_chacha20.o
diff --git a/lib/zinc/chacha20/chacha20-x86_64-glue.c b/lib/zinc/chacha20/chacha20-x86_64-glue.c
new file mode 100644
index 000000000000..8629d5d420e6
--- /dev/null
+++ b/lib/zinc/chacha20/chacha20-x86_64-glue.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <asm/fpu/api.h>
+#include <asm/cpufeature.h>
+#include <asm/processor.h>
+#include <asm/intel-family.h>
+
+asmlinkage void hchacha20_ssse3(u32 *derived_key, const u8 *nonce,
+ const u8 *key);
+asmlinkage void chacha20_ssse3(u8 *out, const u8 *in, const size_t len,
+ const u32 key[8], const u32 counter[4]);
+asmlinkage void chacha20_avx2(u8 *out, const u8 *in, const size_t len,
+ const u32 key[8], const u32 counter[4]);
+asmlinkage void chacha20_avx512(u8 *out, const u8 *in, const size_t len,
+ const u32 key[8], const u32 counter[4]);
+asmlinkage void chacha20_avx512vl(u8 *out, const u8 *in, const size_t len,
+ const u32 key[8], const u32 counter[4]);
+
+static bool chacha20_use_ssse3 __ro_after_init;
+static bool chacha20_use_avx2 __ro_after_init;
+static bool chacha20_use_avx512 __ro_after_init;
+static bool chacha20_use_avx512vl __ro_after_init;
+static bool *const chacha20_nobs[] __initconst = {
+ &chacha20_use_ssse3, &chacha20_use_avx2, &chacha20_use_avx512,
+ &chacha20_use_avx512vl };
+
+static void __init chacha20_fpu_init(void)
+{
+ chacha20_use_ssse3 = boot_cpu_has(X86_FEATURE_SSSE3);
+ chacha20_use_avx2 =
+ boot_cpu_has(X86_FEATURE_AVX) &&
+ boot_cpu_has(X86_FEATURE_AVX2) &&
+ cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL);
+ chacha20_use_avx512 =
+ boot_cpu_has(X86_FEATURE_AVX) &&
+ boot_cpu_has(X86_FEATURE_AVX2) &&
+ boot_cpu_has(X86_FEATURE_AVX512F) &&
+ cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM |
+ XFEATURE_MASK_AVX512, NULL) &&
+ /* Skylake downclocks unacceptably much when using zmm. */
+ boot_cpu_data.x86_model != INTEL_FAM6_SKYLAKE_X;
+ chacha20_use_avx512vl =
+ boot_cpu_has(X86_FEATURE_AVX) &&
+ boot_cpu_has(X86_FEATURE_AVX2) &&
+ boot_cpu_has(X86_FEATURE_AVX512F) &&
+ boot_cpu_has(X86_FEATURE_AVX512VL) &&
+ cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM |
+ XFEATURE_MASK_AVX512, NULL);
+}
+
+static inline bool chacha20_arch(struct chacha20_ctx *ctx, u8 *dst,
+ const u8 *src, size_t len,
+ simd_context_t *simd_context)
+{
+ /* SIMD disables preemption, so relax after processing each page. */
+ BUILD_BUG_ON(PAGE_SIZE < CHACHA20_BLOCK_SIZE ||
+ PAGE_SIZE % CHACHA20_BLOCK_SIZE);
+
+ if (!IS_ENABLED(CONFIG_AS_SSSE3) || !chacha20_use_ssse3 ||
+ len <= CHACHA20_BLOCK_SIZE || !simd_use(simd_context))
+ return false;
+
+ for (;;) {
+ const size_t bytes = min_t(size_t, len, PAGE_SIZE);
+
+ if (IS_ENABLED(CONFIG_AS_AVX512) && chacha20_use_avx512 &&
+ len >= CHACHA20_BLOCK_SIZE * 8)
+ chacha20_avx512(dst, src, bytes, ctx->key, ctx->counter);
+ else if (IS_ENABLED(CONFIG_AS_AVX512) && chacha20_use_avx512vl &&
+ len >= CHACHA20_BLOCK_SIZE * 4)
+ chacha20_avx512vl(dst, src, bytes, ctx->key, ctx->counter);
+ else if (IS_ENABLED(CONFIG_AS_AVX2) && chacha20_use_avx2 &&
+ len >= CHACHA20_BLOCK_SIZE * 4)
+ chacha20_avx2(dst, src, bytes, ctx->key, ctx->counter);
+ else
+ chacha20_ssse3(dst, src, bytes, ctx->key, ctx->counter);
+ ctx->counter[0] += (bytes + 63) / 64;
+ len -= bytes;
+ if (!len)
+ break;
+ dst += bytes;
+ src += bytes;
+ simd_relax(simd_context);
+ }
+
+ return true;
+}
+
+static inline bool hchacha20_arch(u32 derived_key[CHACHA20_KEY_WORDS],
+ const u8 nonce[HCHACHA20_NONCE_SIZE],
+ const u8 key[HCHACHA20_KEY_SIZE],
+ simd_context_t *simd_context)
+{
+ if (IS_ENABLED(CONFIG_AS_SSSE3) && chacha20_use_ssse3 &&
+ simd_use(simd_context)) {
+ hchacha20_ssse3(derived_key, nonce, key);
+ return true;
+ }
+ return false;
+}
diff --git a/lib/zinc/chacha20/chacha20-x86_64-cryptogams.S b/lib/zinc/chacha20/chacha20-x86_64.S
similarity index 71%
rename from lib/zinc/chacha20/chacha20-x86_64-cryptogams.S
rename to lib/zinc/chacha20/chacha20-x86_64.S
index 2bfc76f7e01f..3d10c7f21642 100644
--- a/lib/zinc/chacha20/chacha20-x86_64-cryptogams.S
+++ b/lib/zinc/chacha20/chacha20-x86_64.S
@@ -1,351 +1,148 @@
/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
/*
+ * Copyright (C) 2017 Samuel Neves <sneves@dei.uc.pt>. All Rights Reserved.
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
* Copyright (C) 2006-2017 CRYPTOGAMS by <appro@openssl.org>. All Rights Reserved.
+ *
+ * This is based in part on Andy Polyakov's implementation from CRYPTOGAMS.
*/
-.text
+#include <linux/linkage.h>
-
-
-.align 64
+.section .rodata.cst16.Lzero, "aM", @progbits, 16
+.align 16
.Lzero:
.long 0,0,0,0
+.section .rodata.cst16.Lone, "aM", @progbits, 16
+.align 16
.Lone:
.long 1,0,0,0
+.section .rodata.cst16.Linc, "aM", @progbits, 16
+.align 16
.Linc:
.long 0,1,2,3
+.section .rodata.cst16.Lfour, "aM", @progbits, 16
+.align 16
.Lfour:
.long 4,4,4,4
+.section .rodata.cst32.Lincy, "aM", @progbits, 32
+.align 32
.Lincy:
.long 0,2,4,6,1,3,5,7
+.section .rodata.cst32.Leight, "aM", @progbits, 32
+.align 32
.Leight:
.long 8,8,8,8,8,8,8,8
+.section .rodata.cst16.Lrot16, "aM", @progbits, 16
+.align 16
.Lrot16:
.byte 0x2,0x3,0x0,0x1, 0x6,0x7,0x4,0x5, 0xa,0xb,0x8,0x9, 0xe,0xf,0xc,0xd
+.section .rodata.cst16.Lrot24, "aM", @progbits, 16
+.align 16
.Lrot24:
.byte 0x3,0x0,0x1,0x2, 0x7,0x4,0x5,0x6, 0xb,0x8,0x9,0xa, 0xf,0xc,0xd,0xe
-.Ltwoy:
-.long 2,0,0,0, 2,0,0,0
+.section .rodata.cst16.Lsigma, "aM", @progbits, 16
+.align 16
+.Lsigma:
+.byte 101,120,112,97,110,100,32,51,50,45,98,121,116,101,32,107,0
+.section .rodata.cst64.Lzeroz, "aM", @progbits, 64
.align 64
.Lzeroz:
.long 0,0,0,0, 1,0,0,0, 2,0,0,0, 3,0,0,0
+.section .rodata.cst64.Lfourz, "aM", @progbits, 64
+.align 64
.Lfourz:
.long 4,0,0,0, 4,0,0,0, 4,0,0,0, 4,0,0,0
+.section .rodata.cst64.Lincz, "aM", @progbits, 64
+.align 64
.Lincz:
.long 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
+.section .rodata.cst64.Lsixteen, "aM", @progbits, 64
+.align 64
.Lsixteen:
.long 16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16
-.Lsigma:
-.byte 101,120,112,97,110,100,32,51,50,45,98,121,116,101,32,107,0
-.byte 67,104,97,67,104,97,50,48,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
-.globl ChaCha20_ctr32
-.type ChaCha20_ctr32,@function
+.section .rodata.cst32.Ltwoy, "aM", @progbits, 32
.align 64
-ChaCha20_ctr32:
-.cfi_startproc
- cmpq $0,%rdx
- je .Lno_data
- movq OPENSSL_ia32cap_P+4(%rip),%r10
- btq $48,%r10
- jc .LChaCha20_avx512
- testq %r10,%r10
- js .LChaCha20_avx512vl
- testl $512,%r10d
- jnz .LChaCha20_ssse3
-
- pushq %rbx
-.cfi_adjust_cfa_offset 8
-.cfi_offset %rbx,-16
- pushq %rbp
-.cfi_adjust_cfa_offset 8
-.cfi_offset %rbp,-24
- pushq %r12
-.cfi_adjust_cfa_offset 8
-.cfi_offset %r12,-32
- pushq %r13
-.cfi_adjust_cfa_offset 8
-.cfi_offset %r13,-40
- pushq %r14
-.cfi_adjust_cfa_offset 8
-.cfi_offset %r14,-48
- pushq %r15
-.cfi_adjust_cfa_offset 8
-.cfi_offset %r15,-56
- subq $64+24,%rsp
-.cfi_adjust_cfa_offset 64+24
-.Lctr32_body:
-
-
- movdqu (%rcx),%xmm1
- movdqu 16(%rcx),%xmm2
- movdqu (%r8),%xmm3
- movdqa .Lone(%rip),%xmm4
-
+.Ltwoy:
+.long 2,0,0,0, 2,0,0,0
- movdqa %xmm1,16(%rsp)
- movdqa %xmm2,32(%rsp)
- movdqa %xmm3,48(%rsp)
- movq %rdx,%rbp
- jmp .Loop_outer
+.text
+#ifdef CONFIG_AS_SSSE3
.align 32
-.Loop_outer:
- movl $0x61707865,%eax
- movl $0x3320646e,%ebx
- movl $0x79622d32,%ecx
- movl $0x6b206574,%edx
- movl 16(%rsp),%r8d
- movl 20(%rsp),%r9d
- movl 24(%rsp),%r10d
- movl 28(%rsp),%r11d
- movd %xmm3,%r12d
- movl 52(%rsp),%r13d
- movl 56(%rsp),%r14d
- movl 60(%rsp),%r15d
-
- movq %rbp,64+0(%rsp)
- movl $10,%ebp
- movq %rsi,64+8(%rsp)
-.byte 102,72,15,126,214
- movq %rdi,64+16(%rsp)
- movq %rsi,%rdi
- shrq $32,%rdi
- jmp .Loop
+ENTRY(hchacha20_ssse3)
+ movdqa .Lsigma(%rip),%xmm0
+ movdqu (%rdx),%xmm1
+ movdqu 16(%rdx),%xmm2
+ movdqu (%rsi),%xmm3
+ movdqa .Lrot16(%rip),%xmm6
+ movdqa .Lrot24(%rip),%xmm7
+ movq $10,%r8
+ .align 32
+.Loop_hssse3:
+ paddd %xmm1,%xmm0
+ pxor %xmm0,%xmm3
+ pshufb %xmm6,%xmm3
+ paddd %xmm3,%xmm2
+ pxor %xmm2,%xmm1
+ movdqa %xmm1,%xmm4
+ psrld $20,%xmm1
+ pslld $12,%xmm4
+ por %xmm4,%xmm1
+ paddd %xmm1,%xmm0
+ pxor %xmm0,%xmm3
+ pshufb %xmm7,%xmm3
+ paddd %xmm3,%xmm2
+ pxor %xmm2,%xmm1
+ movdqa %xmm1,%xmm4
+ psrld $25,%xmm1
+ pslld $7,%xmm4
+ por %xmm4,%xmm1
+ pshufd $78,%xmm2,%xmm2
+ pshufd $57,%xmm1,%xmm1
+ pshufd $147,%xmm3,%xmm3
+ nop
+ paddd %xmm1,%xmm0
+ pxor %xmm0,%xmm3
+ pshufb %xmm6,%xmm3
+ paddd %xmm3,%xmm2
+ pxor %xmm2,%xmm1
+ movdqa %xmm1,%xmm4
+ psrld $20,%xmm1
+ pslld $12,%xmm4
+ por %xmm4,%xmm1
+ paddd %xmm1,%xmm0
+ pxor %xmm0,%xmm3
+ pshufb %xmm7,%xmm3
+ paddd %xmm3,%xmm2
+ pxor %xmm2,%xmm1
+ movdqa %xmm1,%xmm4
+ psrld $25,%xmm1
+ pslld $7,%xmm4
+ por %xmm4,%xmm1
+ pshufd $78,%xmm2,%xmm2
+ pshufd $147,%xmm1,%xmm1
+ pshufd $57,%xmm3,%xmm3
+ decq %r8
+ jnz .Loop_hssse3
+ movdqu %xmm0,0(%rdi)
+ movdqu %xmm3,16(%rdi)
+ ret
+ENDPROC(hchacha20_ssse3)
.align 32
-.Loop:
- addl %r8d,%eax
- xorl %eax,%r12d
- roll $16,%r12d
- addl %r9d,%ebx
- xorl %ebx,%r13d
- roll $16,%r13d
- addl %r12d,%esi
- xorl %esi,%r8d
- roll $12,%r8d
- addl %r13d,%edi
- xorl %edi,%r9d
- roll $12,%r9d
- addl %r8d,%eax
- xorl %eax,%r12d
- roll $8,%r12d
- addl %r9d,%ebx
- xorl %ebx,%r13d
- roll $8,%r13d
- addl %r12d,%esi
- xorl %esi,%r8d
- roll $7,%r8d
- addl %r13d,%edi
- xorl %edi,%r9d
- roll $7,%r9d
- movl %esi,32(%rsp)
- movl %edi,36(%rsp)
- movl 40(%rsp),%esi
- movl 44(%rsp),%edi
- addl %r10d,%ecx
- xorl %ecx,%r14d
- roll $16,%r14d
- addl %r11d,%edx
- xorl %edx,%r15d
- roll $16,%r15d
- addl %r14d,%esi
- xorl %esi,%r10d
- roll $12,%r10d
- addl %r15d,%edi
- xorl %edi,%r11d
- roll $12,%r11d
- addl %r10d,%ecx
- xorl %ecx,%r14d
- roll $8,%r14d
- addl %r11d,%edx
- xorl %edx,%r15d
- roll $8,%r15d
- addl %r14d,%esi
- xorl %esi,%r10d
- roll $7,%r10d
- addl %r15d,%edi
- xorl %edi,%r11d
- roll $7,%r11d
- addl %r9d,%eax
- xorl %eax,%r15d
- roll $16,%r15d
- addl %r10d,%ebx
- xorl %ebx,%r12d
- roll $16,%r12d
- addl %r15d,%esi
- xorl %esi,%r9d
- roll $12,%r9d
- addl %r12d,%edi
- xorl %edi,%r10d
- roll $12,%r10d
- addl %r9d,%eax
- xorl %eax,%r15d
- roll $8,%r15d
- addl %r10d,%ebx
- xorl %ebx,%r12d
- roll $8,%r12d
- addl %r15d,%esi
- xorl %esi,%r9d
- roll $7,%r9d
- addl %r12d,%edi
- xorl %edi,%r10d
- roll $7,%r10d
- movl %esi,40(%rsp)
- movl %edi,44(%rsp)
- movl 32(%rsp),%esi
- movl 36(%rsp),%edi
- addl %r11d,%ecx
- xorl %ecx,%r13d
- roll $16,%r13d
- addl %r8d,%edx
- xorl %edx,%r14d
- roll $16,%r14d
- addl %r13d,%esi
- xorl %esi,%r11d
- roll $12,%r11d
- addl %r14d,%edi
- xorl %edi,%r8d
- roll $12,%r8d
- addl %r11d,%ecx
- xorl %ecx,%r13d
- roll $8,%r13d
- addl %r8d,%edx
- xorl %edx,%r14d
- roll $8,%r14d
- addl %r13d,%esi
- xorl %esi,%r11d
- roll $7,%r11d
- addl %r14d,%edi
- xorl %edi,%r8d
- roll $7,%r8d
- decl %ebp
- jnz .Loop
- movl %edi,36(%rsp)
- movl %esi,32(%rsp)
- movq 64(%rsp),%rbp
- movdqa %xmm2,%xmm1
- movq 64+8(%rsp),%rsi
- paddd %xmm4,%xmm3
- movq 64+16(%rsp),%rdi
-
- addl $0x61707865,%eax
- addl $0x3320646e,%ebx
- addl $0x79622d32,%ecx
- addl $0x6b206574,%edx
- addl 16(%rsp),%r8d
- addl 20(%rsp),%r9d
- addl 24(%rsp),%r10d
- addl 28(%rsp),%r11d
- addl 48(%rsp),%r12d
- addl 52(%rsp),%r13d
- addl 56(%rsp),%r14d
- addl 60(%rsp),%r15d
- paddd 32(%rsp),%xmm1
-
- cmpq $64,%rbp
- jb .Ltail
-
- xorl 0(%rsi),%eax
- xorl 4(%rsi),%ebx
- xorl 8(%rsi),%ecx
- xorl 12(%rsi),%edx
- xorl 16(%rsi),%r8d
- xorl 20(%rsi),%r9d
- xorl 24(%rsi),%r10d
- xorl 28(%rsi),%r11d
- movdqu 32(%rsi),%xmm0
- xorl 48(%rsi),%r12d
- xorl 52(%rsi),%r13d
- xorl 56(%rsi),%r14d
- xorl 60(%rsi),%r15d
- leaq 64(%rsi),%rsi
- pxor %xmm1,%xmm0
-
- movdqa %xmm2,32(%rsp)
- movd %xmm3,48(%rsp)
-
- movl %eax,0(%rdi)
- movl %ebx,4(%rdi)
- movl %ecx,8(%rdi)
- movl %edx,12(%rdi)
- movl %r8d,16(%rdi)
- movl %r9d,20(%rdi)
- movl %r10d,24(%rdi)
- movl %r11d,28(%rdi)
- movdqu %xmm0,32(%rdi)
- movl %r12d,48(%rdi)
- movl %r13d,52(%rdi)
- movl %r14d,56(%rdi)
- movl %r15d,60(%rdi)
- leaq 64(%rdi),%rdi
-
- subq $64,%rbp
- jnz .Loop_outer
-
- jmp .Ldone
+ENTRY(chacha20_ssse3)
+.Lchacha20_ssse3:
+ cmpq $0,%rdx
+ je .Lssse3_epilogue
+ leaq 8(%rsp),%r10
-.align 16
-.Ltail:
- movl %eax,0(%rsp)
- movl %ebx,4(%rsp)
- xorq %rbx,%rbx
- movl %ecx,8(%rsp)
- movl %edx,12(%rsp)
- movl %r8d,16(%rsp)
- movl %r9d,20(%rsp)
- movl %r10d,24(%rsp)
- movl %r11d,28(%rsp)
- movdqa %xmm1,32(%rsp)
- movl %r12d,48(%rsp)
- movl %r13d,52(%rsp)
- movl %r14d,56(%rsp)
- movl %r15d,60(%rsp)
-
-.Loop_tail:
- movzbl (%rsi,%rbx,1),%eax
- movzbl (%rsp,%rbx,1),%edx
- leaq 1(%rbx),%rbx
- xorl %edx,%eax
- movb %al,-1(%rdi,%rbx,1)
- decq %rbp
- jnz .Loop_tail
-
-.Ldone:
- leaq 64+24+48(%rsp),%rsi
-.cfi_def_cfa %rsi,8
- movq -48(%rsi),%r15
-.cfi_restore %r15
- movq -40(%rsi),%r14
-.cfi_restore %r14
- movq -32(%rsi),%r13
-.cfi_restore %r13
- movq -24(%rsi),%r12
-.cfi_restore %r12
- movq -16(%rsi),%rbp
-.cfi_restore %rbp
- movq -8(%rsi),%rbx
-.cfi_restore %rbx
- leaq (%rsi),%rsp
-.cfi_def_cfa_register %rsp
-.Lno_data:
- .byte 0xf3,0xc3
-.cfi_endproc
-.size ChaCha20_ctr32,.-ChaCha20_ctr32
-.type ChaCha20_ssse3,@function
-.align 32
-ChaCha20_ssse3:
-.cfi_startproc
-.LChaCha20_ssse3:
- movq %rsp,%r9
-.cfi_def_cfa_register %r9
- testl $2048,%r10d
- jnz .LChaCha20_4xop
cmpq $128,%rdx
- je .LChaCha20_128
- ja .LChaCha20_4x
+ ja .Lchacha20_4x
.Ldo_sse3_after_all:
subq $64+8,%rsp
+ andq $-32,%rsp
movdqa .Lsigma(%rip),%xmm0
movdqu (%rcx),%xmm1
movdqu 16(%rcx),%xmm2
@@ -375,7 +172,7 @@ ChaCha20_ssse3:
.Loop_ssse3:
paddd %xmm1,%xmm0
pxor %xmm0,%xmm3
-.byte 102,15,56,0,222
+ pshufb %xmm6,%xmm3
paddd %xmm3,%xmm2
pxor %xmm2,%xmm1
movdqa %xmm1,%xmm4
@@ -384,7 +181,7 @@ ChaCha20_ssse3:
por %xmm4,%xmm1
paddd %xmm1,%xmm0
pxor %xmm0,%xmm3
-.byte 102,15,56,0,223
+ pshufb %xmm7,%xmm3
paddd %xmm3,%xmm2
pxor %xmm2,%xmm1
movdqa %xmm1,%xmm4
@@ -397,7 +194,7 @@ ChaCha20_ssse3:
nop
paddd %xmm1,%xmm0
pxor %xmm0,%xmm3
-.byte 102,15,56,0,222
+ pshufb %xmm6,%xmm3
paddd %xmm3,%xmm2
pxor %xmm2,%xmm1
movdqa %xmm1,%xmm4
@@ -406,7 +203,7 @@ ChaCha20_ssse3:
por %xmm4,%xmm1
paddd %xmm1,%xmm0
pxor %xmm0,%xmm3
-.byte 102,15,56,0,223
+ pshufb %xmm7,%xmm3
paddd %xmm3,%xmm2
pxor %xmm2,%xmm1
movdqa %xmm1,%xmm4
@@ -465,194 +262,24 @@ ChaCha20_ssse3:
jnz .Loop_tail_ssse3
.Ldone_ssse3:
- leaq (%r9),%rsp
-.cfi_def_cfa_register %rsp
-.Lssse3_epilogue:
- .byte 0xf3,0xc3
-.cfi_endproc
-.size ChaCha20_ssse3,.-ChaCha20_ssse3
-.type ChaCha20_128,@function
-.align 32
-ChaCha20_128:
-.cfi_startproc
-.LChaCha20_128:
- movq %rsp,%r9
-.cfi_def_cfa_register %r9
- subq $64+8,%rsp
- movdqa .Lsigma(%rip),%xmm8
- movdqu (%rcx),%xmm9
- movdqu 16(%rcx),%xmm2
- movdqu (%r8),%xmm3
- movdqa .Lone(%rip),%xmm1
- movdqa .Lrot16(%rip),%xmm6
- movdqa .Lrot24(%rip),%xmm7
+ leaq -8(%r10),%rsp
- movdqa %xmm8,%xmm10
- movdqa %xmm8,0(%rsp)
- movdqa %xmm9,%xmm11
- movdqa %xmm9,16(%rsp)
- movdqa %xmm2,%xmm0
- movdqa %xmm2,32(%rsp)
- paddd %xmm3,%xmm1
- movdqa %xmm3,48(%rsp)
- movq $10,%r8
- jmp .Loop_128
-
-.align 32
-.Loop_128:
- paddd %xmm9,%xmm8
- pxor %xmm8,%xmm3
- paddd %xmm11,%xmm10
- pxor %xmm10,%xmm1
-.byte 102,15,56,0,222
-.byte 102,15,56,0,206
- paddd %xmm3,%xmm2
- paddd %xmm1,%xmm0
- pxor %xmm2,%xmm9
- pxor %xmm0,%xmm11
- movdqa %xmm9,%xmm4
- psrld $20,%xmm9
- movdqa %xmm11,%xmm5
- pslld $12,%xmm4
- psrld $20,%xmm11
- por %xmm4,%xmm9
- pslld $12,%xmm5
- por %xmm5,%xmm11
- paddd %xmm9,%xmm8
- pxor %xmm8,%xmm3
- paddd %xmm11,%xmm10
- pxor %xmm10,%xmm1
-.byte 102,15,56,0,223
-.byte 102,15,56,0,207
- paddd %xmm3,%xmm2
- paddd %xmm1,%xmm0
- pxor %xmm2,%xmm9
- pxor %xmm0,%xmm11
- movdqa %xmm9,%xmm4
- psrld $25,%xmm9
- movdqa %xmm11,%xmm5
- pslld $7,%xmm4
- psrld $25,%xmm11
- por %xmm4,%xmm9
- pslld $7,%xmm5
- por %xmm5,%xmm11
- pshufd $78,%xmm2,%xmm2
- pshufd $57,%xmm9,%xmm9
- pshufd $147,%xmm3,%xmm3
- pshufd $78,%xmm0,%xmm0
- pshufd $57,%xmm11,%xmm11
- pshufd $147,%xmm1,%xmm1
- paddd %xmm9,%xmm8
- pxor %xmm8,%xmm3
- paddd %xmm11,%xmm10
- pxor %xmm10,%xmm1
-.byte 102,15,56,0,222
-.byte 102,15,56,0,206
- paddd %xmm3,%xmm2
- paddd %xmm1,%xmm0
- pxor %xmm2,%xmm9
- pxor %xmm0,%xmm11
- movdqa %xmm9,%xmm4
- psrld $20,%xmm9
- movdqa %xmm11,%xmm5
- pslld $12,%xmm4
- psrld $20,%xmm11
- por %xmm4,%xmm9
- pslld $12,%xmm5
- por %xmm5,%xmm11
- paddd %xmm9,%xmm8
- pxor %xmm8,%xmm3
- paddd %xmm11,%xmm10
- pxor %xmm10,%xmm1
-.byte 102,15,56,0,223
-.byte 102,15,56,0,207
- paddd %xmm3,%xmm2
- paddd %xmm1,%xmm0
- pxor %xmm2,%xmm9
- pxor %xmm0,%xmm11
- movdqa %xmm9,%xmm4
- psrld $25,%xmm9
- movdqa %xmm11,%xmm5
- pslld $7,%xmm4
- psrld $25,%xmm11
- por %xmm4,%xmm9
- pslld $7,%xmm5
- por %xmm5,%xmm11
- pshufd $78,%xmm2,%xmm2
- pshufd $147,%xmm9,%xmm9
- pshufd $57,%xmm3,%xmm3
- pshufd $78,%xmm0,%xmm0
- pshufd $147,%xmm11,%xmm11
- pshufd $57,%xmm1,%xmm1
- decq %r8
- jnz .Loop_128
- paddd 0(%rsp),%xmm8
- paddd 16(%rsp),%xmm9
- paddd 32(%rsp),%xmm2
- paddd 48(%rsp),%xmm3
- paddd .Lone(%rip),%xmm1
- paddd 0(%rsp),%xmm10
- paddd 16(%rsp),%xmm11
- paddd 32(%rsp),%xmm0
- paddd 48(%rsp),%xmm1
-
- movdqu 0(%rsi),%xmm4
- movdqu 16(%rsi),%xmm5
- pxor %xmm4,%xmm8
- movdqu 32(%rsi),%xmm4
- pxor %xmm5,%xmm9
- movdqu 48(%rsi),%xmm5
- pxor %xmm4,%xmm2
- movdqu 64(%rsi),%xmm4
- pxor %xmm5,%xmm3
- movdqu 80(%rsi),%xmm5
- pxor %xmm4,%xmm10
- movdqu 96(%rsi),%xmm4
- pxor %xmm5,%xmm11
- movdqu 112(%rsi),%xmm5
- pxor %xmm4,%xmm0
- pxor %xmm5,%xmm1
+.Lssse3_epilogue:
+ ret
- movdqu %xmm8,0(%rdi)
- movdqu %xmm9,16(%rdi)
- movdqu %xmm2,32(%rdi)
- movdqu %xmm3,48(%rdi)
- movdqu %xmm10,64(%rdi)
- movdqu %xmm11,80(%rdi)
- movdqu %xmm0,96(%rdi)
- movdqu %xmm1,112(%rdi)
- leaq (%r9),%rsp
-.cfi_def_cfa_register %rsp
-.L128_epilogue:
- .byte 0xf3,0xc3
-.cfi_endproc
-.size ChaCha20_128,.-ChaCha20_128
-.type ChaCha20_4x,@function
.align 32
-ChaCha20_4x:
-.cfi_startproc
-.LChaCha20_4x:
- movq %rsp,%r9
-.cfi_def_cfa_register %r9
- movq %r10,%r11
- shrq $32,%r10
- testq $32,%r10
- jnz .LChaCha20_8x
- cmpq $192,%rdx
- ja .Lproceed4x
-
- andq $71303168,%r11
- cmpq $4194304,%r11
- je .Ldo_sse3_after_all
+.Lchacha20_4x:
+ leaq 8(%rsp),%r10
.Lproceed4x:
subq $0x140+8,%rsp
+ andq $-32,%rsp
movdqa .Lsigma(%rip),%xmm11
movdqu (%rcx),%xmm15
movdqu 16(%rcx),%xmm7
movdqu (%r8),%xmm3
leaq 256(%rsp),%rcx
- leaq .Lrot16(%rip),%r10
+ leaq .Lrot16(%rip),%r9
leaq .Lrot24(%rip),%r11
pshufd $0x00,%xmm11,%xmm8
@@ -716,7 +343,7 @@ ChaCha20_4x:
.Loop_enter4x:
movdqa %xmm6,32(%rsp)
movdqa %xmm7,48(%rsp)
- movdqa (%r10),%xmm7
+ movdqa (%r9),%xmm7
movl $10,%eax
movdqa %xmm0,256-256(%rcx)
jmp .Loop4x
@@ -727,8 +354,8 @@ ChaCha20_4x:
paddd %xmm13,%xmm9
pxor %xmm8,%xmm0
pxor %xmm9,%xmm1
-.byte 102,15,56,0,199
-.byte 102,15,56,0,207
+ pshufb %xmm7,%xmm0
+ pshufb %xmm7,%xmm1
paddd %xmm0,%xmm4
paddd %xmm1,%xmm5
pxor %xmm4,%xmm12
@@ -746,8 +373,8 @@ ChaCha20_4x:
paddd %xmm13,%xmm9
pxor %xmm8,%xmm0
pxor %xmm9,%xmm1
-.byte 102,15,56,0,198
-.byte 102,15,56,0,206
+ pshufb %xmm6,%xmm0
+ pshufb %xmm6,%xmm1
paddd %xmm0,%xmm4
paddd %xmm1,%xmm5
pxor %xmm4,%xmm12
@@ -759,7 +386,7 @@ ChaCha20_4x:
pslld $7,%xmm13
por %xmm7,%xmm12
psrld $25,%xmm6
- movdqa (%r10),%xmm7
+ movdqa (%r9),%xmm7
por %xmm6,%xmm13
movdqa %xmm4,0(%rsp)
movdqa %xmm5,16(%rsp)
@@ -769,8 +396,8 @@ ChaCha20_4x:
paddd %xmm15,%xmm11
pxor %xmm10,%xmm2
pxor %xmm11,%xmm3
-.byte 102,15,56,0,215
-.byte 102,15,56,0,223
+ pshufb %xmm7,%xmm2
+ pshufb %xmm7,%xmm3
paddd %xmm2,%xmm4
paddd %xmm3,%xmm5
pxor %xmm4,%xmm14
@@ -788,8 +415,8 @@ ChaCha20_4x:
paddd %xmm15,%xmm11
pxor %xmm10,%xmm2
pxor %xmm11,%xmm3
-.byte 102,15,56,0,214
-.byte 102,15,56,0,222
+ pshufb %xmm6,%xmm2
+ pshufb %xmm6,%xmm3
paddd %xmm2,%xmm4
paddd %xmm3,%xmm5
pxor %xmm4,%xmm14
@@ -801,14 +428,14 @@ ChaCha20_4x:
pslld $7,%xmm15
por %xmm7,%xmm14
psrld $25,%xmm6
- movdqa (%r10),%xmm7
+ movdqa (%r9),%xmm7
por %xmm6,%xmm15
paddd %xmm13,%xmm8
paddd %xmm14,%xmm9
pxor %xmm8,%xmm3
pxor %xmm9,%xmm0
-.byte 102,15,56,0,223
-.byte 102,15,56,0,199
+ pshufb %xmm7,%xmm3
+ pshufb %xmm7,%xmm0
paddd %xmm3,%xmm4
paddd %xmm0,%xmm5
pxor %xmm4,%xmm13
@@ -826,8 +453,8 @@ ChaCha20_4x:
paddd %xmm14,%xmm9
pxor %xmm8,%xmm3
pxor %xmm9,%xmm0
-.byte 102,15,56,0,222
-.byte 102,15,56,0,198
+ pshufb %xmm6,%xmm3
+ pshufb %xmm6,%xmm0
paddd %xmm3,%xmm4
paddd %xmm0,%xmm5
pxor %xmm4,%xmm13
@@ -839,7 +466,7 @@ ChaCha20_4x:
pslld $7,%xmm14
por %xmm7,%xmm13
psrld $25,%xmm6
- movdqa (%r10),%xmm7
+ movdqa (%r9),%xmm7
por %xmm6,%xmm14
movdqa %xmm4,32(%rsp)
movdqa %xmm5,48(%rsp)
@@ -849,8 +476,8 @@ ChaCha20_4x:
paddd %xmm12,%xmm11
pxor %xmm10,%xmm1
pxor %xmm11,%xmm2
-.byte 102,15,56,0,207
-.byte 102,15,56,0,215
+ pshufb %xmm7,%xmm1
+ pshufb %xmm7,%xmm2
paddd %xmm1,%xmm4
paddd %xmm2,%xmm5
pxor %xmm4,%xmm15
@@ -868,8 +495,8 @@ ChaCha20_4x:
paddd %xmm12,%xmm11
pxor %xmm10,%xmm1
pxor %xmm11,%xmm2
-.byte 102,15,56,0,206
-.byte 102,15,56,0,214
+ pshufb %xmm6,%xmm1
+ pshufb %xmm6,%xmm2
paddd %xmm1,%xmm4
paddd %xmm2,%xmm5
pxor %xmm4,%xmm15
@@ -881,7 +508,7 @@ ChaCha20_4x:
pslld $7,%xmm12
por %xmm7,%xmm15
psrld $25,%xmm6
- movdqa (%r10),%xmm7
+ movdqa (%r9),%xmm7
por %xmm6,%xmm12
decl %eax
jnz .Loop4x
@@ -1035,7 +662,7 @@ ChaCha20_4x:
jae .L64_or_more4x
- xorq %r10,%r10
+ xorq %r9,%r9
movdqa %xmm12,16(%rsp)
movdqa %xmm4,32(%rsp)
@@ -1060,7 +687,7 @@ ChaCha20_4x:
movdqa 16(%rsp),%xmm6
leaq 64(%rsi),%rsi
- xorq %r10,%r10
+ xorq %r9,%r9
movdqa %xmm6,0(%rsp)
movdqa %xmm13,16(%rsp)
leaq 64(%rdi),%rdi
@@ -1100,7 +727,7 @@ ChaCha20_4x:
movdqa 32(%rsp),%xmm6
leaq 128(%rsi),%rsi
- xorq %r10,%r10
+ xorq %r9,%r9
movdqa %xmm6,0(%rsp)
movdqa %xmm10,16(%rsp)
leaq 128(%rdi),%rdi
@@ -1155,7 +782,7 @@ ChaCha20_4x:
movdqa 48(%rsp),%xmm6
leaq 64(%rsi),%rsi
- xorq %r10,%r10
+ xorq %r9,%r9
movdqa %xmm6,0(%rsp)
movdqa %xmm15,16(%rsp)
leaq 64(%rdi),%rdi
@@ -1164,463 +791,41 @@ ChaCha20_4x:
movdqa %xmm3,48(%rsp)
.Loop_tail4x:
- movzbl (%rsi,%r10,1),%eax
- movzbl (%rsp,%r10,1),%ecx
- leaq 1(%r10),%r10
+ movzbl (%rsi,%r9,1),%eax
+ movzbl (%rsp,%r9,1),%ecx
+ leaq 1(%r9),%r9
xorl %ecx,%eax
- movb %al,-1(%rdi,%r10,1)
+ movb %al,-1(%rdi,%r9,1)
decq %rdx
jnz .Loop_tail4x
.Ldone4x:
- leaq (%r9),%rsp
-.cfi_def_cfa_register %rsp
-.L4x_epilogue:
- .byte 0xf3,0xc3
-.cfi_endproc
-.size ChaCha20_4x,.-ChaCha20_4x
-.type ChaCha20_4xop,@function
-.align 32
-ChaCha20_4xop:
-.cfi_startproc
-.LChaCha20_4xop:
- movq %rsp,%r9
-.cfi_def_cfa_register %r9
- subq $0x140+8,%rsp
- vzeroupper
+ leaq -8(%r10),%rsp
- vmovdqa .Lsigma(%rip),%xmm11
- vmovdqu (%rcx),%xmm3
- vmovdqu 16(%rcx),%xmm15
- vmovdqu (%r8),%xmm7
- leaq 256(%rsp),%rcx
-
- vpshufd $0x00,%xmm11,%xmm8
- vpshufd $0x55,%xmm11,%xmm9
- vmovdqa %xmm8,64(%rsp)
- vpshufd $0xaa,%xmm11,%xmm10
- vmovdqa %xmm9,80(%rsp)
- vpshufd $0xff,%xmm11,%xmm11
- vmovdqa %xmm10,96(%rsp)
- vmovdqa %xmm11,112(%rsp)
-
- vpshufd $0x00,%xmm3,%xmm0
- vpshufd $0x55,%xmm3,%xmm1
- vmovdqa %xmm0,128-256(%rcx)
- vpshufd $0xaa,%xmm3,%xmm2
- vmovdqa %xmm1,144-256(%rcx)
- vpshufd $0xff,%xmm3,%xmm3
- vmovdqa %xmm2,160-256(%rcx)
- vmovdqa %xmm3,176-256(%rcx)
-
- vpshufd $0x00,%xmm15,%xmm12
- vpshufd $0x55,%xmm15,%xmm13
- vmovdqa %xmm12,192-256(%rcx)
- vpshufd $0xaa,%xmm15,%xmm14
- vmovdqa %xmm13,208-256(%rcx)
- vpshufd $0xff,%xmm15,%xmm15
- vmovdqa %xmm14,224-256(%rcx)
- vmovdqa %xmm15,240-256(%rcx)
-
- vpshufd $0x00,%xmm7,%xmm4
- vpshufd $0x55,%xmm7,%xmm5
- vpaddd .Linc(%rip),%xmm4,%xmm4
- vpshufd $0xaa,%xmm7,%xmm6
- vmovdqa %xmm5,272-256(%rcx)
- vpshufd $0xff,%xmm7,%xmm7
- vmovdqa %xmm6,288-256(%rcx)
- vmovdqa %xmm7,304-256(%rcx)
-
- jmp .Loop_enter4xop
-
-.align 32
-.Loop_outer4xop:
- vmovdqa 64(%rsp),%xmm8
- vmovdqa 80(%rsp),%xmm9
- vmovdqa 96(%rsp),%xmm10
- vmovdqa 112(%rsp),%xmm11
- vmovdqa 128-256(%rcx),%xmm0
- vmovdqa 144-256(%rcx),%xmm1
- vmovdqa 160-256(%rcx),%xmm2
- vmovdqa 176-256(%rcx),%xmm3
- vmovdqa 192-256(%rcx),%xmm12
- vmovdqa 208-256(%rcx),%xmm13
- vmovdqa 224-256(%rcx),%xmm14
- vmovdqa 240-256(%rcx),%xmm15
- vmovdqa 256-256(%rcx),%xmm4
- vmovdqa 272-256(%rcx),%xmm5
- vmovdqa 288-256(%rcx),%xmm6
- vmovdqa 304-256(%rcx),%xmm7
- vpaddd .Lfour(%rip),%xmm4,%xmm4
-
-.Loop_enter4xop:
- movl $10,%eax
- vmovdqa %xmm4,256-256(%rcx)
- jmp .Loop4xop
-
-.align 32
-.Loop4xop:
- vpaddd %xmm0,%xmm8,%xmm8
- vpaddd %xmm1,%xmm9,%xmm9
- vpaddd %xmm2,%xmm10,%xmm10
- vpaddd %xmm3,%xmm11,%xmm11
- vpxor %xmm4,%xmm8,%xmm4
- vpxor %xmm5,%xmm9,%xmm5
- vpxor %xmm6,%xmm10,%xmm6
- vpxor %xmm7,%xmm11,%xmm7
-.byte 143,232,120,194,228,16
-.byte 143,232,120,194,237,16
-.byte 143,232,120,194,246,16
-.byte 143,232,120,194,255,16
- vpaddd %xmm4,%xmm12,%xmm12
- vpaddd %xmm5,%xmm13,%xmm13
- vpaddd %xmm6,%xmm14,%xmm14
- vpaddd %xmm7,%xmm15,%xmm15
- vpxor %xmm0,%xmm12,%xmm0
- vpxor %xmm1,%xmm13,%xmm1
- vpxor %xmm14,%xmm2,%xmm2
- vpxor %xmm15,%xmm3,%xmm3
-.byte 143,232,120,194,192,12
-.byte 143,232,120,194,201,12
-.byte 143,232,120,194,210,12
-.byte 143,232,120,194,219,12
- vpaddd %xmm8,%xmm0,%xmm8
- vpaddd %xmm9,%xmm1,%xmm9
- vpaddd %xmm2,%xmm10,%xmm10
- vpaddd %xmm3,%xmm11,%xmm11
- vpxor %xmm4,%xmm8,%xmm4
- vpxor %xmm5,%xmm9,%xmm5
- vpxor %xmm6,%xmm10,%xmm6
- vpxor %xmm7,%xmm11,%xmm7
-.byte 143,232,120,194,228,8
-.byte 143,232,120,194,237,8
-.byte 143,232,120,194,246,8
-.byte 143,232,120,194,255,8
- vpaddd %xmm4,%xmm12,%xmm12
- vpaddd %xmm5,%xmm13,%xmm13
- vpaddd %xmm6,%xmm14,%xmm14
- vpaddd %xmm7,%xmm15,%xmm15
- vpxor %xmm0,%xmm12,%xmm0
- vpxor %xmm1,%xmm13,%xmm1
- vpxor %xmm14,%xmm2,%xmm2
- vpxor %xmm15,%xmm3,%xmm3
-.byte 143,232,120,194,192,7
-.byte 143,232,120,194,201,7
-.byte 143,232,120,194,210,7
-.byte 143,232,120,194,219,7
- vpaddd %xmm1,%xmm8,%xmm8
- vpaddd %xmm2,%xmm9,%xmm9
- vpaddd %xmm3,%xmm10,%xmm10
- vpaddd %xmm0,%xmm11,%xmm11
- vpxor %xmm7,%xmm8,%xmm7
- vpxor %xmm4,%xmm9,%xmm4
- vpxor %xmm5,%xmm10,%xmm5
- vpxor %xmm6,%xmm11,%xmm6
-.byte 143,232,120,194,255,16
-.byte 143,232,120,194,228,16
-.byte 143,232,120,194,237,16
-.byte 143,232,120,194,246,16
- vpaddd %xmm7,%xmm14,%xmm14
- vpaddd %xmm4,%xmm15,%xmm15
- vpaddd %xmm5,%xmm12,%xmm12
- vpaddd %xmm6,%xmm13,%xmm13
- vpxor %xmm1,%xmm14,%xmm1
- vpxor %xmm2,%xmm15,%xmm2
- vpxor %xmm12,%xmm3,%xmm3
- vpxor %xmm13,%xmm0,%xmm0
-.byte 143,232,120,194,201,12
-.byte 143,232,120,194,210,12
-.byte 143,232,120,194,219,12
-.byte 143,232,120,194,192,12
- vpaddd %xmm8,%xmm1,%xmm8
- vpaddd %xmm9,%xmm2,%xmm9
- vpaddd %xmm3,%xmm10,%xmm10
- vpaddd %xmm0,%xmm11,%xmm11
- vpxor %xmm7,%xmm8,%xmm7
- vpxor %xmm4,%xmm9,%xmm4
- vpxor %xmm5,%xmm10,%xmm5
- vpxor %xmm6,%xmm11,%xmm6
-.byte 143,232,120,194,255,8
-.byte 143,232,120,194,228,8
-.byte 143,232,120,194,237,8
-.byte 143,232,120,194,246,8
- vpaddd %xmm7,%xmm14,%xmm14
- vpaddd %xmm4,%xmm15,%xmm15
- vpaddd %xmm5,%xmm12,%xmm12
- vpaddd %xmm6,%xmm13,%xmm13
- vpxor %xmm1,%xmm14,%xmm1
- vpxor %xmm2,%xmm15,%xmm2
- vpxor %xmm12,%xmm3,%xmm3
- vpxor %xmm13,%xmm0,%xmm0
-.byte 143,232,120,194,201,7
-.byte 143,232,120,194,210,7
-.byte 143,232,120,194,219,7
-.byte 143,232,120,194,192,7
- decl %eax
- jnz .Loop4xop
-
- vpaddd 64(%rsp),%xmm8,%xmm8
- vpaddd 80(%rsp),%xmm9,%xmm9
- vpaddd 96(%rsp),%xmm10,%xmm10
- vpaddd 112(%rsp),%xmm11,%xmm11
-
- vmovdqa %xmm14,32(%rsp)
- vmovdqa %xmm15,48(%rsp)
-
- vpunpckldq %xmm9,%xmm8,%xmm14
- vpunpckldq %xmm11,%xmm10,%xmm15
- vpunpckhdq %xmm9,%xmm8,%xmm8
- vpunpckhdq %xmm11,%xmm10,%xmm10
- vpunpcklqdq %xmm15,%xmm14,%xmm9
- vpunpckhqdq %xmm15,%xmm14,%xmm14
- vpunpcklqdq %xmm10,%xmm8,%xmm11
- vpunpckhqdq %xmm10,%xmm8,%xmm8
- vpaddd 128-256(%rcx),%xmm0,%xmm0
- vpaddd 144-256(%rcx),%xmm1,%xmm1
- vpaddd 160-256(%rcx),%xmm2,%xmm2
- vpaddd 176-256(%rcx),%xmm3,%xmm3
-
- vmovdqa %xmm9,0(%rsp)
- vmovdqa %xmm14,16(%rsp)
- vmovdqa 32(%rsp),%xmm9
- vmovdqa 48(%rsp),%xmm14
-
- vpunpckldq %xmm1,%xmm0,%xmm10
- vpunpckldq %xmm3,%xmm2,%xmm15
- vpunpckhdq %xmm1,%xmm0,%xmm0
- vpunpckhdq %xmm3,%xmm2,%xmm2
- vpunpcklqdq %xmm15,%xmm10,%xmm1
- vpunpckhqdq %xmm15,%xmm10,%xmm10
- vpunpcklqdq %xmm2,%xmm0,%xmm3
- vpunpckhqdq %xmm2,%xmm0,%xmm0
- vpaddd 192-256(%rcx),%xmm12,%xmm12
- vpaddd 208-256(%rcx),%xmm13,%xmm13
- vpaddd 224-256(%rcx),%xmm9,%xmm9
- vpaddd 240-256(%rcx),%xmm14,%xmm14
-
- vpunpckldq %xmm13,%xmm12,%xmm2
- vpunpckldq %xmm14,%xmm9,%xmm15
- vpunpckhdq %xmm13,%xmm12,%xmm12
- vpunpckhdq %xmm14,%xmm9,%xmm9
- vpunpcklqdq %xmm15,%xmm2,%xmm13
- vpunpckhqdq %xmm15,%xmm2,%xmm2
- vpunpcklqdq %xmm9,%xmm12,%xmm14
- vpunpckhqdq %xmm9,%xmm12,%xmm12
- vpaddd 256-256(%rcx),%xmm4,%xmm4
- vpaddd 272-256(%rcx),%xmm5,%xmm5
- vpaddd 288-256(%rcx),%xmm6,%xmm6
- vpaddd 304-256(%rcx),%xmm7,%xmm7
-
- vpunpckldq %xmm5,%xmm4,%xmm9
- vpunpckldq %xmm7,%xmm6,%xmm15
- vpunpckhdq %xmm5,%xmm4,%xmm4
- vpunpckhdq %xmm7,%xmm6,%xmm6
- vpunpcklqdq %xmm15,%xmm9,%xmm5
- vpunpckhqdq %xmm15,%xmm9,%xmm9
- vpunpcklqdq %xmm6,%xmm4,%xmm7
- vpunpckhqdq %xmm6,%xmm4,%xmm4
- vmovdqa 0(%rsp),%xmm6
- vmovdqa 16(%rsp),%xmm15
-
- cmpq $256,%rdx
- jb .Ltail4xop
-
- vpxor 0(%rsi),%xmm6,%xmm6
- vpxor 16(%rsi),%xmm1,%xmm1
- vpxor 32(%rsi),%xmm13,%xmm13
- vpxor 48(%rsi),%xmm5,%xmm5
- vpxor 64(%rsi),%xmm15,%xmm15
- vpxor 80(%rsi),%xmm10,%xmm10
- vpxor 96(%rsi),%xmm2,%xmm2
- vpxor 112(%rsi),%xmm9,%xmm9
- leaq 128(%rsi),%rsi
- vpxor 0(%rsi),%xmm11,%xmm11
- vpxor 16(%rsi),%xmm3,%xmm3
- vpxor 32(%rsi),%xmm14,%xmm14
- vpxor 48(%rsi),%xmm7,%xmm7
- vpxor 64(%rsi),%xmm8,%xmm8
- vpxor 80(%rsi),%xmm0,%xmm0
- vpxor 96(%rsi),%xmm12,%xmm12
- vpxor 112(%rsi),%xmm4,%xmm4
- leaq 128(%rsi),%rsi
-
- vmovdqu %xmm6,0(%rdi)
- vmovdqu %xmm1,16(%rdi)
- vmovdqu %xmm13,32(%rdi)
- vmovdqu %xmm5,48(%rdi)
- vmovdqu %xmm15,64(%rdi)
- vmovdqu %xmm10,80(%rdi)
- vmovdqu %xmm2,96(%rdi)
- vmovdqu %xmm9,112(%rdi)
- leaq 128(%rdi),%rdi
- vmovdqu %xmm11,0(%rdi)
- vmovdqu %xmm3,16(%rdi)
- vmovdqu %xmm14,32(%rdi)
- vmovdqu %xmm7,48(%rdi)
- vmovdqu %xmm8,64(%rdi)
- vmovdqu %xmm0,80(%rdi)
- vmovdqu %xmm12,96(%rdi)
- vmovdqu %xmm4,112(%rdi)
- leaq 128(%rdi),%rdi
-
- subq $256,%rdx
- jnz .Loop_outer4xop
-
- jmp .Ldone4xop
-
-.align 32
-.Ltail4xop:
- cmpq $192,%rdx
- jae .L192_or_more4xop
- cmpq $128,%rdx
- jae .L128_or_more4xop
- cmpq $64,%rdx
- jae .L64_or_more4xop
-
- xorq %r10,%r10
- vmovdqa %xmm6,0(%rsp)
- vmovdqa %xmm1,16(%rsp)
- vmovdqa %xmm13,32(%rsp)
- vmovdqa %xmm5,48(%rsp)
- jmp .Loop_tail4xop
-
-.align 32
-.L64_or_more4xop:
- vpxor 0(%rsi),%xmm6,%xmm6
- vpxor 16(%rsi),%xmm1,%xmm1
- vpxor 32(%rsi),%xmm13,%xmm13
- vpxor 48(%rsi),%xmm5,%xmm5
- vmovdqu %xmm6,0(%rdi)
- vmovdqu %xmm1,16(%rdi)
- vmovdqu %xmm13,32(%rdi)
- vmovdqu %xmm5,48(%rdi)
- je .Ldone4xop
-
- leaq 64(%rsi),%rsi
- vmovdqa %xmm15,0(%rsp)
- xorq %r10,%r10
- vmovdqa %xmm10,16(%rsp)
- leaq 64(%rdi),%rdi
- vmovdqa %xmm2,32(%rsp)
- subq $64,%rdx
- vmovdqa %xmm9,48(%rsp)
- jmp .Loop_tail4xop
-
-.align 32
-.L128_or_more4xop:
- vpxor 0(%rsi),%xmm6,%xmm6
- vpxor 16(%rsi),%xmm1,%xmm1
- vpxor 32(%rsi),%xmm13,%xmm13
- vpxor 48(%rsi),%xmm5,%xmm5
- vpxor 64(%rsi),%xmm15,%xmm15
- vpxor 80(%rsi),%xmm10,%xmm10
- vpxor 96(%rsi),%xmm2,%xmm2
- vpxor 112(%rsi),%xmm9,%xmm9
-
- vmovdqu %xmm6,0(%rdi)
- vmovdqu %xmm1,16(%rdi)
- vmovdqu %xmm13,32(%rdi)
- vmovdqu %xmm5,48(%rdi)
- vmovdqu %xmm15,64(%rdi)
- vmovdqu %xmm10,80(%rdi)
- vmovdqu %xmm2,96(%rdi)
- vmovdqu %xmm9,112(%rdi)
- je .Ldone4xop
-
- leaq 128(%rsi),%rsi
- vmovdqa %xmm11,0(%rsp)
- xorq %r10,%r10
- vmovdqa %xmm3,16(%rsp)
- leaq 128(%rdi),%rdi
- vmovdqa %xmm14,32(%rsp)
- subq $128,%rdx
- vmovdqa %xmm7,48(%rsp)
- jmp .Loop_tail4xop
+.L4x_epilogue:
+ ret
+ENDPROC(chacha20_ssse3)
+#endif /* CONFIG_AS_SSSE3 */
+#ifdef CONFIG_AS_AVX2
.align 32
-.L192_or_more4xop:
- vpxor 0(%rsi),%xmm6,%xmm6
- vpxor 16(%rsi),%xmm1,%xmm1
- vpxor 32(%rsi),%xmm13,%xmm13
- vpxor 48(%rsi),%xmm5,%xmm5
- vpxor 64(%rsi),%xmm15,%xmm15
- vpxor 80(%rsi),%xmm10,%xmm10
- vpxor 96(%rsi),%xmm2,%xmm2
- vpxor 112(%rsi),%xmm9,%xmm9
- leaq 128(%rsi),%rsi
- vpxor 0(%rsi),%xmm11,%xmm11
- vpxor 16(%rsi),%xmm3,%xmm3
- vpxor 32(%rsi),%xmm14,%xmm14
- vpxor 48(%rsi),%xmm7,%xmm7
-
- vmovdqu %xmm6,0(%rdi)
- vmovdqu %xmm1,16(%rdi)
- vmovdqu %xmm13,32(%rdi)
- vmovdqu %xmm5,48(%rdi)
- vmovdqu %xmm15,64(%rdi)
- vmovdqu %xmm10,80(%rdi)
- vmovdqu %xmm2,96(%rdi)
- vmovdqu %xmm9,112(%rdi)
- leaq 128(%rdi),%rdi
- vmovdqu %xmm11,0(%rdi)
- vmovdqu %xmm3,16(%rdi)
- vmovdqu %xmm14,32(%rdi)
- vmovdqu %xmm7,48(%rdi)
- je .Ldone4xop
-
- leaq 64(%rsi),%rsi
- vmovdqa %xmm8,0(%rsp)
- xorq %r10,%r10
- vmovdqa %xmm0,16(%rsp)
- leaq 64(%rdi),%rdi
- vmovdqa %xmm12,32(%rsp)
- subq $192,%rdx
- vmovdqa %xmm4,48(%rsp)
-
-.Loop_tail4xop:
- movzbl (%rsi,%r10,1),%eax
- movzbl (%rsp,%r10,1),%ecx
- leaq 1(%r10),%r10
- xorl %ecx,%eax
- movb %al,-1(%rdi,%r10,1)
- decq %rdx
- jnz .Loop_tail4xop
+ENTRY(chacha20_avx2)
+.Lchacha20_avx2:
+ cmpq $0,%rdx
+ je .L8x_epilogue
+ leaq 8(%rsp),%r10
-.Ldone4xop:
- vzeroupper
- leaq (%r9),%rsp
-.cfi_def_cfa_register %rsp
-.L4xop_epilogue:
- .byte 0xf3,0xc3
-.cfi_endproc
-.size ChaCha20_4xop,.-ChaCha20_4xop
-.type ChaCha20_8x,@function
-.align 32
-ChaCha20_8x:
-.cfi_startproc
-.LChaCha20_8x:
- movq %rsp,%r9
-.cfi_def_cfa_register %r9
subq $0x280+8,%rsp
andq $-32,%rsp
vzeroupper
-
-
-
-
-
-
-
-
-
vbroadcasti128 .Lsigma(%rip),%ymm11
vbroadcasti128 (%rcx),%ymm3
vbroadcasti128 16(%rcx),%ymm15
vbroadcasti128 (%r8),%ymm7
leaq 256(%rsp),%rcx
leaq 512(%rsp),%rax
- leaq .Lrot16(%rip),%r10
+ leaq .Lrot16(%rip),%r9
leaq .Lrot24(%rip),%r11
vpshufd $0x00,%ymm11,%ymm8
@@ -1684,7 +889,7 @@ ChaCha20_8x:
.Loop_enter8x:
vmovdqa %ymm14,64(%rsp)
vmovdqa %ymm15,96(%rsp)
- vbroadcasti128 (%r10),%ymm15
+ vbroadcasti128 (%r9),%ymm15
vmovdqa %ymm4,512-512(%rax)
movl $10,%eax
jmp .Loop8x
@@ -1719,7 +924,7 @@ ChaCha20_8x:
vpslld $7,%ymm0,%ymm15
vpsrld $25,%ymm0,%ymm0
vpor %ymm0,%ymm15,%ymm0
- vbroadcasti128 (%r10),%ymm15
+ vbroadcasti128 (%r9),%ymm15
vpaddd %ymm5,%ymm13,%ymm13
vpxor %ymm1,%ymm13,%ymm1
vpslld $7,%ymm1,%ymm14
@@ -1757,7 +962,7 @@ ChaCha20_8x:
vpslld $7,%ymm2,%ymm15
vpsrld $25,%ymm2,%ymm2
vpor %ymm2,%ymm15,%ymm2
- vbroadcasti128 (%r10),%ymm15
+ vbroadcasti128 (%r9),%ymm15
vpaddd %ymm7,%ymm13,%ymm13
vpxor %ymm3,%ymm13,%ymm3
vpslld $7,%ymm3,%ymm14
@@ -1791,7 +996,7 @@ ChaCha20_8x:
vpslld $7,%ymm1,%ymm15
vpsrld $25,%ymm1,%ymm1
vpor %ymm1,%ymm15,%ymm1
- vbroadcasti128 (%r10),%ymm15
+ vbroadcasti128 (%r9),%ymm15
vpaddd %ymm4,%ymm13,%ymm13
vpxor %ymm2,%ymm13,%ymm2
vpslld $7,%ymm2,%ymm14
@@ -1829,7 +1034,7 @@ ChaCha20_8x:
vpslld $7,%ymm3,%ymm15
vpsrld $25,%ymm3,%ymm3
vpor %ymm3,%ymm15,%ymm3
- vbroadcasti128 (%r10),%ymm15
+ vbroadcasti128 (%r9),%ymm15
vpaddd %ymm6,%ymm13,%ymm13
vpxor %ymm0,%ymm13,%ymm0
vpslld $7,%ymm0,%ymm14
@@ -1983,7 +1188,7 @@ ChaCha20_8x:
cmpq $64,%rdx
jae .L64_or_more8x
- xorq %r10,%r10
+ xorq %r9,%r9
vmovdqa %ymm6,0(%rsp)
vmovdqa %ymm8,32(%rsp)
jmp .Loop_tail8x
@@ -1997,7 +1202,7 @@ ChaCha20_8x:
je .Ldone8x
leaq 64(%rsi),%rsi
- xorq %r10,%r10
+ xorq %r9,%r9
vmovdqa %ymm1,0(%rsp)
leaq 64(%rdi),%rdi
subq $64,%rdx
@@ -2017,7 +1222,7 @@ ChaCha20_8x:
je .Ldone8x
leaq 128(%rsi),%rsi
- xorq %r10,%r10
+ xorq %r9,%r9
vmovdqa %ymm12,0(%rsp)
leaq 128(%rdi),%rdi
subq $128,%rdx
@@ -2041,7 +1246,7 @@ ChaCha20_8x:
je .Ldone8x
leaq 192(%rsi),%rsi
- xorq %r10,%r10
+ xorq %r9,%r9
vmovdqa %ymm10,0(%rsp)
leaq 192(%rdi),%rdi
subq $192,%rdx
@@ -2069,7 +1274,7 @@ ChaCha20_8x:
je .Ldone8x
leaq 256(%rsi),%rsi
- xorq %r10,%r10
+ xorq %r9,%r9
vmovdqa %ymm14,0(%rsp)
leaq 256(%rdi),%rdi
subq $256,%rdx
@@ -2101,7 +1306,7 @@ ChaCha20_8x:
je .Ldone8x
leaq 320(%rsi),%rsi
- xorq %r10,%r10
+ xorq %r9,%r9
vmovdqa %ymm3,0(%rsp)
leaq 320(%rdi),%rdi
subq $320,%rdx
@@ -2137,7 +1342,7 @@ ChaCha20_8x:
je .Ldone8x
leaq 384(%rsi),%rsi
- xorq %r10,%r10
+ xorq %r9,%r9
vmovdqa %ymm11,0(%rsp)
leaq 384(%rdi),%rdi
subq $384,%rdx
@@ -2177,40 +1382,43 @@ ChaCha20_8x:
je .Ldone8x
leaq 448(%rsi),%rsi
- xorq %r10,%r10
+ xorq %r9,%r9
vmovdqa %ymm0,0(%rsp)
leaq 448(%rdi),%rdi
subq $448,%rdx
vmovdqa %ymm4,32(%rsp)
.Loop_tail8x:
- movzbl (%rsi,%r10,1),%eax
- movzbl (%rsp,%r10,1),%ecx
- leaq 1(%r10),%r10
+ movzbl (%rsi,%r9,1),%eax
+ movzbl (%rsp,%r9,1),%ecx
+ leaq 1(%r9),%r9
xorl %ecx,%eax
- movb %al,-1(%rdi,%r10,1)
+ movb %al,-1(%rdi,%r9,1)
decq %rdx
jnz .Loop_tail8x
.Ldone8x:
vzeroall
- leaq (%r9),%rsp
-.cfi_def_cfa_register %rsp
+ leaq -8(%r10),%rsp
+
.L8x_epilogue:
- .byte 0xf3,0xc3
-.cfi_endproc
-.size ChaCha20_8x,.-ChaCha20_8x
-.type ChaCha20_avx512,@function
+ ret
+ENDPROC(chacha20_avx2)
+#endif /* CONFIG_AS_AVX2 */
+
+#ifdef CONFIG_AS_AVX512
.align 32
-ChaCha20_avx512:
-.cfi_startproc
-.LChaCha20_avx512:
- movq %rsp,%r9
-.cfi_def_cfa_register %r9
+ENTRY(chacha20_avx512)
+.Lchacha20_avx512:
+ cmpq $0,%rdx
+ je .Lavx512_epilogue
+ leaq 8(%rsp),%r10
+
cmpq $512,%rdx
- ja .LChaCha20_16x
+ ja .Lchacha20_16x
subq $64+8,%rsp
+ andq $-64,%rsp
vbroadcasti32x4 .Lsigma(%rip),%zmm0
vbroadcasti32x4 (%rcx),%zmm1
vbroadcasti32x4 16(%rcx),%zmm2
@@ -2385,181 +1593,25 @@ ChaCha20_avx512:
decq %rdx
jnz .Loop_tail_avx512
- vmovdqu32 %zmm16,0(%rsp)
+ vmovdqa32 %zmm16,0(%rsp)
.Ldone_avx512:
vzeroall
- leaq (%r9),%rsp
-.cfi_def_cfa_register %rsp
-.Lavx512_epilogue:
- .byte 0xf3,0xc3
-.cfi_endproc
-.size ChaCha20_avx512,.-ChaCha20_avx512
-.type ChaCha20_avx512vl,@function
-.align 32
-ChaCha20_avx512vl:
-.cfi_startproc
-.LChaCha20_avx512vl:
- movq %rsp,%r9
-.cfi_def_cfa_register %r9
- cmpq $128,%rdx
- ja .LChaCha20_8xvl
-
- subq $64+8,%rsp
- vbroadcasti128 .Lsigma(%rip),%ymm0
- vbroadcasti128 (%rcx),%ymm1
- vbroadcasti128 16(%rcx),%ymm2
- vbroadcasti128 (%r8),%ymm3
+ leaq -8(%r10),%rsp
- vmovdqa32 %ymm0,%ymm16
- vmovdqa32 %ymm1,%ymm17
- vmovdqa32 %ymm2,%ymm18
- vpaddd .Lzeroz(%rip),%ymm3,%ymm3
- vmovdqa32 .Ltwoy(%rip),%ymm20
- movq $10,%r8
- vmovdqa32 %ymm3,%ymm19
- jmp .Loop_avx512vl
-
-.align 16
-.Loop_outer_avx512vl:
- vmovdqa32 %ymm18,%ymm2
- vpaddd %ymm20,%ymm19,%ymm3
- movq $10,%r8
- vmovdqa32 %ymm3,%ymm19
- jmp .Loop_avx512vl
+.Lavx512_epilogue:
+ ret
.align 32
-.Loop_avx512vl:
- vpaddd %ymm1,%ymm0,%ymm0
- vpxor %ymm0,%ymm3,%ymm3
- vprold $16,%ymm3,%ymm3
- vpaddd %ymm3,%ymm2,%ymm2
- vpxor %ymm2,%ymm1,%ymm1
- vprold $12,%ymm1,%ymm1
- vpaddd %ymm1,%ymm0,%ymm0
- vpxor %ymm0,%ymm3,%ymm3
- vprold $8,%ymm3,%ymm3
- vpaddd %ymm3,%ymm2,%ymm2
- vpxor %ymm2,%ymm1,%ymm1
- vprold $7,%ymm1,%ymm1
- vpshufd $78,%ymm2,%ymm2
- vpshufd $57,%ymm1,%ymm1
- vpshufd $147,%ymm3,%ymm3
- vpaddd %ymm1,%ymm0,%ymm0
- vpxor %ymm0,%ymm3,%ymm3
- vprold $16,%ymm3,%ymm3
- vpaddd %ymm3,%ymm2,%ymm2
- vpxor %ymm2,%ymm1,%ymm1
- vprold $12,%ymm1,%ymm1
- vpaddd %ymm1,%ymm0,%ymm0
- vpxor %ymm0,%ymm3,%ymm3
- vprold $8,%ymm3,%ymm3
- vpaddd %ymm3,%ymm2,%ymm2
- vpxor %ymm2,%ymm1,%ymm1
- vprold $7,%ymm1,%ymm1
- vpshufd $78,%ymm2,%ymm2
- vpshufd $147,%ymm1,%ymm1
- vpshufd $57,%ymm3,%ymm3
- decq %r8
- jnz .Loop_avx512vl
- vpaddd %ymm16,%ymm0,%ymm0
- vpaddd %ymm17,%ymm1,%ymm1
- vpaddd %ymm18,%ymm2,%ymm2
- vpaddd %ymm19,%ymm3,%ymm3
-
- subq $64,%rdx
- jb .Ltail64_avx512vl
-
- vpxor 0(%rsi),%xmm0,%xmm4
- vpxor 16(%rsi),%xmm1,%xmm5
- vpxor 32(%rsi),%xmm2,%xmm6
- vpxor 48(%rsi),%xmm3,%xmm7
- leaq 64(%rsi),%rsi
-
- vmovdqu %xmm4,0(%rdi)
- vmovdqu %xmm5,16(%rdi)
- vmovdqu %xmm6,32(%rdi)
- vmovdqu %xmm7,48(%rdi)
- leaq 64(%rdi),%rdi
-
- jz .Ldone_avx512vl
-
- vextracti128 $1,%ymm0,%xmm4
- vextracti128 $1,%ymm1,%xmm5
- vextracti128 $1,%ymm2,%xmm6
- vextracti128 $1,%ymm3,%xmm7
-
- subq $64,%rdx
- jb .Ltail_avx512vl
-
- vpxor 0(%rsi),%xmm4,%xmm4
- vpxor 16(%rsi),%xmm5,%xmm5
- vpxor 32(%rsi),%xmm6,%xmm6
- vpxor 48(%rsi),%xmm7,%xmm7
- leaq 64(%rsi),%rsi
+.Lchacha20_16x:
+ leaq 8(%rsp),%r10
- vmovdqu %xmm4,0(%rdi)
- vmovdqu %xmm5,16(%rdi)
- vmovdqu %xmm6,32(%rdi)
- vmovdqu %xmm7,48(%rdi)
- leaq 64(%rdi),%rdi
-
- vmovdqa32 %ymm16,%ymm0
- vmovdqa32 %ymm17,%ymm1
- jnz .Loop_outer_avx512vl
-
- jmp .Ldone_avx512vl
-
-.align 16
-.Ltail64_avx512vl:
- vmovdqa %xmm0,0(%rsp)
- vmovdqa %xmm1,16(%rsp)
- vmovdqa %xmm2,32(%rsp)
- vmovdqa %xmm3,48(%rsp)
- addq $64,%rdx
- jmp .Loop_tail_avx512vl
-
-.align 16
-.Ltail_avx512vl:
- vmovdqa %xmm4,0(%rsp)
- vmovdqa %xmm5,16(%rsp)
- vmovdqa %xmm6,32(%rsp)
- vmovdqa %xmm7,48(%rsp)
- addq $64,%rdx
-
-.Loop_tail_avx512vl:
- movzbl (%rsi,%r8,1),%eax
- movzbl (%rsp,%r8,1),%ecx
- leaq 1(%r8),%r8
- xorl %ecx,%eax
- movb %al,-1(%rdi,%r8,1)
- decq %rdx
- jnz .Loop_tail_avx512vl
-
- vmovdqu32 %ymm16,0(%rsp)
- vmovdqu32 %ymm16,32(%rsp)
-
-.Ldone_avx512vl:
- vzeroall
- leaq (%r9),%rsp
-.cfi_def_cfa_register %rsp
-.Lavx512vl_epilogue:
- .byte 0xf3,0xc3
-.cfi_endproc
-.size ChaCha20_avx512vl,.-ChaCha20_avx512vl
-.type ChaCha20_16x,@function
-.align 32
-ChaCha20_16x:
-.cfi_startproc
-.LChaCha20_16x:
- movq %rsp,%r9
-.cfi_def_cfa_register %r9
subq $64+8,%rsp
andq $-64,%rsp
vzeroupper
- leaq .Lsigma(%rip),%r10
- vbroadcasti32x4 (%r10),%zmm3
+ leaq .Lsigma(%rip),%r9
+ vbroadcasti32x4 (%r9),%zmm3
vbroadcasti32x4 (%rcx),%zmm7
vbroadcasti32x4 16(%rcx),%zmm11
vbroadcasti32x4 (%r8),%zmm15
@@ -2606,10 +1658,10 @@ ChaCha20_16x:
.align 32
.Loop_outer16x:
- vpbroadcastd 0(%r10),%zmm0
- vpbroadcastd 4(%r10),%zmm1
- vpbroadcastd 8(%r10),%zmm2
- vpbroadcastd 12(%r10),%zmm3
+ vpbroadcastd 0(%r9),%zmm0
+ vpbroadcastd 4(%r9),%zmm1
+ vpbroadcastd 8(%r9),%zmm2
+ vpbroadcastd 12(%r9),%zmm3
vpaddd .Lsixteen(%rip),%zmm28,%zmm28
vmovdqa64 %zmm20,%zmm4
vmovdqa64 %zmm21,%zmm5
@@ -2865,7 +1917,7 @@ ChaCha20_16x:
.align 32
.Ltail16x:
- xorq %r10,%r10
+ xorq %r9,%r9
subq %rsi,%rdi
cmpq $64,%rdx
jb .Less_than_64_16x
@@ -2993,11 +2045,11 @@ ChaCha20_16x:
andq $63,%rdx
.Loop_tail16x:
- movzbl (%rsi,%r10,1),%eax
- movzbl (%rsp,%r10,1),%ecx
- leaq 1(%r10),%r10
+ movzbl (%rsi,%r9,1),%eax
+ movzbl (%rsp,%r9,1),%ecx
+ leaq 1(%r9),%r9
xorl %ecx,%eax
- movb %al,-1(%rdi,%r10,1)
+ movb %al,-1(%rdi,%r9,1)
decq %rdx
jnz .Loop_tail16x
@@ -3006,25 +2058,172 @@ ChaCha20_16x:
.Ldone16x:
vzeroall
- leaq (%r9),%rsp
-.cfi_def_cfa_register %rsp
+ leaq -8(%r10),%rsp
+
.L16x_epilogue:
- .byte 0xf3,0xc3
-.cfi_endproc
-.size ChaCha20_16x,.-ChaCha20_16x
-.type ChaCha20_8xvl,@function
+ ret
+ENDPROC(chacha20_avx512)
+
.align 32
-ChaCha20_8xvl:
-.cfi_startproc
-.LChaCha20_8xvl:
- movq %rsp,%r9
-.cfi_def_cfa_register %r9
+ENTRY(chacha20_avx512vl)
+ cmpq $0,%rdx
+ je .Lavx512vl_epilogue
+
+ leaq 8(%rsp),%r10
+
+ cmpq $128,%rdx
+ ja .Lchacha20_8xvl
+
+ subq $64+8,%rsp
+ andq $-64,%rsp
+ vbroadcasti128 .Lsigma(%rip),%ymm0
+ vbroadcasti128 (%rcx),%ymm1
+ vbroadcasti128 16(%rcx),%ymm2
+ vbroadcasti128 (%r8),%ymm3
+
+ vmovdqa32 %ymm0,%ymm16
+ vmovdqa32 %ymm1,%ymm17
+ vmovdqa32 %ymm2,%ymm18
+ vpaddd .Lzeroz(%rip),%ymm3,%ymm3
+ vmovdqa32 .Ltwoy(%rip),%ymm20
+ movq $10,%r8
+ vmovdqa32 %ymm3,%ymm19
+ jmp .Loop_avx512vl
+
+.align 16
+.Loop_outer_avx512vl:
+ vmovdqa32 %ymm18,%ymm2
+ vpaddd %ymm20,%ymm19,%ymm3
+ movq $10,%r8
+ vmovdqa32 %ymm3,%ymm19
+ jmp .Loop_avx512vl
+
+.align 32
+.Loop_avx512vl:
+ vpaddd %ymm1,%ymm0,%ymm0
+ vpxor %ymm0,%ymm3,%ymm3
+ vprold $16,%ymm3,%ymm3
+ vpaddd %ymm3,%ymm2,%ymm2
+ vpxor %ymm2,%ymm1,%ymm1
+ vprold $12,%ymm1,%ymm1
+ vpaddd %ymm1,%ymm0,%ymm0
+ vpxor %ymm0,%ymm3,%ymm3
+ vprold $8,%ymm3,%ymm3
+ vpaddd %ymm3,%ymm2,%ymm2
+ vpxor %ymm2,%ymm1,%ymm1
+ vprold $7,%ymm1,%ymm1
+ vpshufd $78,%ymm2,%ymm2
+ vpshufd $57,%ymm1,%ymm1
+ vpshufd $147,%ymm3,%ymm3
+ vpaddd %ymm1,%ymm0,%ymm0
+ vpxor %ymm0,%ymm3,%ymm3
+ vprold $16,%ymm3,%ymm3
+ vpaddd %ymm3,%ymm2,%ymm2
+ vpxor %ymm2,%ymm1,%ymm1
+ vprold $12,%ymm1,%ymm1
+ vpaddd %ymm1,%ymm0,%ymm0
+ vpxor %ymm0,%ymm3,%ymm3
+ vprold $8,%ymm3,%ymm3
+ vpaddd %ymm3,%ymm2,%ymm2
+ vpxor %ymm2,%ymm1,%ymm1
+ vprold $7,%ymm1,%ymm1
+ vpshufd $78,%ymm2,%ymm2
+ vpshufd $147,%ymm1,%ymm1
+ vpshufd $57,%ymm3,%ymm3
+ decq %r8
+ jnz .Loop_avx512vl
+ vpaddd %ymm16,%ymm0,%ymm0
+ vpaddd %ymm17,%ymm1,%ymm1
+ vpaddd %ymm18,%ymm2,%ymm2
+ vpaddd %ymm19,%ymm3,%ymm3
+
+ subq $64,%rdx
+ jb .Ltail64_avx512vl
+
+ vpxor 0(%rsi),%xmm0,%xmm4
+ vpxor 16(%rsi),%xmm1,%xmm5
+ vpxor 32(%rsi),%xmm2,%xmm6
+ vpxor 48(%rsi),%xmm3,%xmm7
+ leaq 64(%rsi),%rsi
+
+ vmovdqu %xmm4,0(%rdi)
+ vmovdqu %xmm5,16(%rdi)
+ vmovdqu %xmm6,32(%rdi)
+ vmovdqu %xmm7,48(%rdi)
+ leaq 64(%rdi),%rdi
+
+ jz .Ldone_avx512vl
+
+ vextracti128 $1,%ymm0,%xmm4
+ vextracti128 $1,%ymm1,%xmm5
+ vextracti128 $1,%ymm2,%xmm6
+ vextracti128 $1,%ymm3,%xmm7
+
+ subq $64,%rdx
+ jb .Ltail_avx512vl
+
+ vpxor 0(%rsi),%xmm4,%xmm4
+ vpxor 16(%rsi),%xmm5,%xmm5
+ vpxor 32(%rsi),%xmm6,%xmm6
+ vpxor 48(%rsi),%xmm7,%xmm7
+ leaq 64(%rsi),%rsi
+
+ vmovdqu %xmm4,0(%rdi)
+ vmovdqu %xmm5,16(%rdi)
+ vmovdqu %xmm6,32(%rdi)
+ vmovdqu %xmm7,48(%rdi)
+ leaq 64(%rdi),%rdi
+
+ vmovdqa32 %ymm16,%ymm0
+ vmovdqa32 %ymm17,%ymm1
+ jnz .Loop_outer_avx512vl
+
+ jmp .Ldone_avx512vl
+
+.align 16
+.Ltail64_avx512vl:
+ vmovdqa %xmm0,0(%rsp)
+ vmovdqa %xmm1,16(%rsp)
+ vmovdqa %xmm2,32(%rsp)
+ vmovdqa %xmm3,48(%rsp)
+ addq $64,%rdx
+ jmp .Loop_tail_avx512vl
+
+.align 16
+.Ltail_avx512vl:
+ vmovdqa %xmm4,0(%rsp)
+ vmovdqa %xmm5,16(%rsp)
+ vmovdqa %xmm6,32(%rsp)
+ vmovdqa %xmm7,48(%rsp)
+ addq $64,%rdx
+
+.Loop_tail_avx512vl:
+ movzbl (%rsi,%r8,1),%eax
+ movzbl (%rsp,%r8,1),%ecx
+ leaq 1(%r8),%r8
+ xorl %ecx,%eax
+ movb %al,-1(%rdi,%r8,1)
+ decq %rdx
+ jnz .Loop_tail_avx512vl
+
+ vmovdqa32 %ymm16,0(%rsp)
+ vmovdqa32 %ymm16,32(%rsp)
+
+.Ldone_avx512vl:
+ vzeroall
+ leaq -8(%r10),%rsp
+.Lavx512vl_epilogue:
+ ret
+
+.align 32
+.Lchacha20_8xvl:
+ leaq 8(%rsp),%r10
subq $64+8,%rsp
andq $-64,%rsp
vzeroupper
- leaq .Lsigma(%rip),%r10
- vbroadcasti128 (%r10),%ymm3
+ leaq .Lsigma(%rip),%r9
+ vbroadcasti128 (%r9),%ymm3
vbroadcasti128 (%rcx),%ymm7
vbroadcasti128 16(%rcx),%ymm11
vbroadcasti128 (%r8),%ymm15
@@ -3073,8 +2272,8 @@ ChaCha20_8xvl:
.Loop_outer8xvl:
- vpbroadcastd 8(%r10),%ymm2
- vpbroadcastd 12(%r10),%ymm3
+ vpbroadcastd 8(%r9),%ymm2
+ vpbroadcastd 12(%r9),%ymm3
vpaddd .Leight(%rip),%ymm28,%ymm28
vmovdqa64 %ymm20,%ymm4
vmovdqa64 %ymm21,%ymm5
@@ -3314,8 +2513,8 @@ ChaCha20_8xvl:
vmovdqu %ymm12,96(%rdi)
leaq (%rdi,%rax,1),%rdi
- vpbroadcastd 0(%r10),%ymm0
- vpbroadcastd 4(%r10),%ymm1
+ vpbroadcastd 0(%r9),%ymm0
+ vpbroadcastd 4(%r9),%ymm1
subq $512,%rdx
jnz .Loop_outer8xvl
@@ -3325,7 +2524,7 @@ ChaCha20_8xvl:
.align 32
.Ltail8xvl:
vmovdqa64 %ymm19,%ymm8
- xorq %r10,%r10
+ xorq %r9,%r9
subq %rsi,%rdi
cmpq $64,%rdx
jb .Less_than_64_8xvl
@@ -3411,11 +2610,11 @@ ChaCha20_8xvl:
andq $63,%rdx
.Loop_tail8xvl:
- movzbl (%rsi,%r10,1),%eax
- movzbl (%rsp,%r10,1),%ecx
- leaq 1(%r10),%r10
+ movzbl (%rsi,%r9,1),%eax
+ movzbl (%rsp,%r9,1),%ecx
+ leaq 1(%r9),%r9
xorl %ecx,%eax
- movb %al,-1(%rdi,%r10,1)
+ movb %al,-1(%rdi,%r9,1)
decq %rdx
jnz .Loop_tail8xvl
@@ -3425,9 +2624,9 @@ ChaCha20_8xvl:
.Ldone8xvl:
vzeroall
- leaq (%r9),%rsp
-.cfi_def_cfa_register %rsp
+ leaq -8(%r10),%rsp
.L8xvl_epilogue:
- .byte 0xf3,0xc3
-.cfi_endproc
-.size ChaCha20_8xvl,.-ChaCha20_8xvl
+ ret
+ENDPROC(chacha20_avx512vl)
+
+#endif /* CONFIG_AS_AVX512 */
diff --git a/lib/zinc/chacha20/chacha20.c b/lib/zinc/chacha20/chacha20.c
index 03209c15d1ca..22a21431c221 100644
--- a/lib/zinc/chacha20/chacha20.c
+++ b/lib/zinc/chacha20/chacha20.c
@@ -16,6 +16,9 @@
#include <linux/vmalloc.h>
#include <crypto/algapi.h> // For crypto_xor_cpy.
+#if defined(CONFIG_ZINC_ARCH_X86_64)
+#include "chacha20-x86_64-glue.c"
+#else
static bool *const chacha20_nobs[] __initconst = { };
static void __init chacha20_fpu_init(void)
{
@@ -33,6 +36,7 @@ static inline bool hchacha20_arch(u32 derived_key[CHACHA20_KEY_WORDS],
{
return false;
}
+#endif
#define QUARTER_ROUND(x, a, b, c, d) ( \
x[a] += x[b], \
--
2.19.1
^ permalink raw reply related
* [PATCH net-next v8 11/28] zinc: Poly1305 generic C implementations and selftest
From: Jason A. Donenfeld @ 2018-10-18 14:56 UTC (permalink / raw)
To: linux-kernel, netdev, linux-crypto, davem, gregkh
Cc: Jason A. Donenfeld, Samuel Neves, Jean-Philippe Aumasson,
Andy Lutomirski, Andrew Morton, Linus Torvalds, kernel-hardening
In-Reply-To: <20181018145712.7538-1-Jason@zx2c4.com>
These two C implementations -- a 32x32 one and a 64x64 one, depending on
the platform -- come from Andrew Moon's public domain poly1305-donna
portable code, modified for usage in the kernel and for usage with
accelerated primitives.
Information: https://cr.yp.to/mac.html
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-crypto@vger.kernel.org
---
include/zinc/poly1305.h | 31 +
lib/zinc/Kconfig | 3 +
lib/zinc/Makefile | 3 +
lib/zinc/poly1305/poly1305-donna32.c | 205 +++++
lib/zinc/poly1305/poly1305-donna64.c | 182 +++++
lib/zinc/poly1305/poly1305.c | 154 ++++
lib/zinc/selftest/poly1305.c | 1107 ++++++++++++++++++++++++++
7 files changed, 1685 insertions(+)
create mode 100644 include/zinc/poly1305.h
create mode 100644 lib/zinc/poly1305/poly1305-donna32.c
create mode 100644 lib/zinc/poly1305/poly1305-donna64.c
create mode 100644 lib/zinc/poly1305/poly1305.c
create mode 100644 lib/zinc/selftest/poly1305.c
diff --git a/include/zinc/poly1305.h b/include/zinc/poly1305.h
new file mode 100644
index 000000000000..13fe0e50fc3c
--- /dev/null
+++ b/include/zinc/poly1305.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifndef _ZINC_POLY1305_H
+#define _ZINC_POLY1305_H
+
+#include <linux/simd.h>
+#include <linux/types.h>
+
+enum poly1305_lengths {
+ POLY1305_BLOCK_SIZE = 16,
+ POLY1305_KEY_SIZE = 32,
+ POLY1305_MAC_SIZE = 16
+};
+
+struct poly1305_ctx {
+ u8 opaque[24 * sizeof(u64)];
+ u32 nonce[4];
+ u8 data[POLY1305_BLOCK_SIZE];
+ size_t num;
+} __aligned(8);
+
+void poly1305_init(struct poly1305_ctx *ctx, const u8 key[POLY1305_KEY_SIZE]);
+void poly1305_update(struct poly1305_ctx *ctx, const u8 *input, size_t len,
+ simd_context_t *simd_context);
+void poly1305_final(struct poly1305_ctx *ctx, u8 mac[POLY1305_MAC_SIZE],
+ simd_context_t *simd_context);
+
+#endif /* _ZINC_POLY1305_H */
diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig
index d271be37cecb..f08bf1eaa2a0 100644
--- a/lib/zinc/Kconfig
+++ b/lib/zinc/Kconfig
@@ -2,6 +2,9 @@ config ZINC_CHACHA20
tristate
select CRYPTO_ALGAPI
+config ZINC_POLY1305
+ tristate
+
config ZINC_SELFTEST
bool "Zinc cryptography library self-tests"
help
diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 60d568cf5206..6fc9626c55fa 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -9,3 +9,6 @@ zinc_chacha20-$(CONFIG_ZINC_ARCH_ARM64) += chacha20/chacha20-arm64.o
zinc_chacha20-$(CONFIG_ZINC_ARCH_MIPS) += chacha20/chacha20-mips.o
AFLAGS_chacha20-mips.o += -O2 # This is required to fill the branch delay slots
obj-$(CONFIG_ZINC_CHACHA20) += zinc_chacha20.o
+
+zinc_poly1305-y := poly1305/poly1305.o
+obj-$(CONFIG_ZINC_POLY1305) += zinc_poly1305.o
diff --git a/lib/zinc/poly1305/poly1305-donna32.c b/lib/zinc/poly1305/poly1305-donna32.c
new file mode 100644
index 000000000000..e342f03c21f6
--- /dev/null
+++ b/lib/zinc/poly1305/poly1305-donna32.c
@@ -0,0 +1,205 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This is based in part on Andrew Moon's poly1305-donna, which is in the
+ * public domain.
+ */
+
+struct poly1305_internal {
+ u32 h[5];
+ u32 r[5];
+ u32 s[4];
+};
+
+static void poly1305_init_generic(void *ctx, const u8 key[16])
+{
+ struct poly1305_internal *st = (struct poly1305_internal *)ctx;
+
+ /* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
+ st->r[0] = (get_unaligned_le32(&key[0])) & 0x3ffffff;
+ st->r[1] = (get_unaligned_le32(&key[3]) >> 2) & 0x3ffff03;
+ st->r[2] = (get_unaligned_le32(&key[6]) >> 4) & 0x3ffc0ff;
+ st->r[3] = (get_unaligned_le32(&key[9]) >> 6) & 0x3f03fff;
+ st->r[4] = (get_unaligned_le32(&key[12]) >> 8) & 0x00fffff;
+
+ /* s = 5*r */
+ st->s[0] = st->r[1] * 5;
+ st->s[1] = st->r[2] * 5;
+ st->s[2] = st->r[3] * 5;
+ st->s[3] = st->r[4] * 5;
+
+ /* h = 0 */
+ st->h[0] = 0;
+ st->h[1] = 0;
+ st->h[2] = 0;
+ st->h[3] = 0;
+ st->h[4] = 0;
+}
+
+static void poly1305_blocks_generic(void *ctx, const u8 *input, size_t len,
+ const u32 padbit)
+{
+ struct poly1305_internal *st = (struct poly1305_internal *)ctx;
+ const u32 hibit = padbit << 24;
+ u32 r0, r1, r2, r3, r4;
+ u32 s1, s2, s3, s4;
+ u32 h0, h1, h2, h3, h4;
+ u64 d0, d1, d2, d3, d4;
+ u32 c;
+
+ r0 = st->r[0];
+ r1 = st->r[1];
+ r2 = st->r[2];
+ r3 = st->r[3];
+ r4 = st->r[4];
+
+ s1 = st->s[0];
+ s2 = st->s[1];
+ s3 = st->s[2];
+ s4 = st->s[3];
+
+ h0 = st->h[0];
+ h1 = st->h[1];
+ h2 = st->h[2];
+ h3 = st->h[3];
+ h4 = st->h[4];
+
+ while (len >= POLY1305_BLOCK_SIZE) {
+ /* h += m[i] */
+ h0 += (get_unaligned_le32(&input[0])) & 0x3ffffff;
+ h1 += (get_unaligned_le32(&input[3]) >> 2) & 0x3ffffff;
+ h2 += (get_unaligned_le32(&input[6]) >> 4) & 0x3ffffff;
+ h3 += (get_unaligned_le32(&input[9]) >> 6) & 0x3ffffff;
+ h4 += (get_unaligned_le32(&input[12]) >> 8) | hibit;
+
+ /* h *= r */
+ d0 = ((u64)h0 * r0) + ((u64)h1 * s4) +
+ ((u64)h2 * s3) + ((u64)h3 * s2) +
+ ((u64)h4 * s1);
+ d1 = ((u64)h0 * r1) + ((u64)h1 * r0) +
+ ((u64)h2 * s4) + ((u64)h3 * s3) +
+ ((u64)h4 * s2);
+ d2 = ((u64)h0 * r2) + ((u64)h1 * r1) +
+ ((u64)h2 * r0) + ((u64)h3 * s4) +
+ ((u64)h4 * s3);
+ d3 = ((u64)h0 * r3) + ((u64)h1 * r2) +
+ ((u64)h2 * r1) + ((u64)h3 * r0) +
+ ((u64)h4 * s4);
+ d4 = ((u64)h0 * r4) + ((u64)h1 * r3) +
+ ((u64)h2 * r2) + ((u64)h3 * r1) +
+ ((u64)h4 * r0);
+
+ /* (partial) h %= p */
+ c = (u32)(d0 >> 26);
+ h0 = (u32)d0 & 0x3ffffff;
+ d1 += c;
+ c = (u32)(d1 >> 26);
+ h1 = (u32)d1 & 0x3ffffff;
+ d2 += c;
+ c = (u32)(d2 >> 26);
+ h2 = (u32)d2 & 0x3ffffff;
+ d3 += c;
+ c = (u32)(d3 >> 26);
+ h3 = (u32)d3 & 0x3ffffff;
+ d4 += c;
+ c = (u32)(d4 >> 26);
+ h4 = (u32)d4 & 0x3ffffff;
+ h0 += c * 5;
+ c = (h0 >> 26);
+ h0 = h0 & 0x3ffffff;
+ h1 += c;
+
+ input += POLY1305_BLOCK_SIZE;
+ len -= POLY1305_BLOCK_SIZE;
+ }
+
+ st->h[0] = h0;
+ st->h[1] = h1;
+ st->h[2] = h2;
+ st->h[3] = h3;
+ st->h[4] = h4;
+}
+
+static void poly1305_emit_generic(void *ctx, u8 mac[16], const u32 nonce[4])
+{
+ struct poly1305_internal *st = (struct poly1305_internal *)ctx;
+ u32 h0, h1, h2, h3, h4, c;
+ u32 g0, g1, g2, g3, g4;
+ u64 f;
+ u32 mask;
+
+ /* fully carry h */
+ h0 = st->h[0];
+ h1 = st->h[1];
+ h2 = st->h[2];
+ h3 = st->h[3];
+ h4 = st->h[4];
+
+ c = h1 >> 26;
+ h1 = h1 & 0x3ffffff;
+ h2 += c;
+ c = h2 >> 26;
+ h2 = h2 & 0x3ffffff;
+ h3 += c;
+ c = h3 >> 26;
+ h3 = h3 & 0x3ffffff;
+ h4 += c;
+ c = h4 >> 26;
+ h4 = h4 & 0x3ffffff;
+ h0 += c * 5;
+ c = h0 >> 26;
+ h0 = h0 & 0x3ffffff;
+ h1 += c;
+
+ /* compute h + -p */
+ g0 = h0 + 5;
+ c = g0 >> 26;
+ g0 &= 0x3ffffff;
+ g1 = h1 + c;
+ c = g1 >> 26;
+ g1 &= 0x3ffffff;
+ g2 = h2 + c;
+ c = g2 >> 26;
+ g2 &= 0x3ffffff;
+ g3 = h3 + c;
+ c = g3 >> 26;
+ g3 &= 0x3ffffff;
+ g4 = h4 + c - (1UL << 26);
+
+ /* select h if h < p, or h + -p if h >= p */
+ mask = (g4 >> ((sizeof(u32) * 8) - 1)) - 1;
+ g0 &= mask;
+ g1 &= mask;
+ g2 &= mask;
+ g3 &= mask;
+ g4 &= mask;
+ mask = ~mask;
+
+ h0 = (h0 & mask) | g0;
+ h1 = (h1 & mask) | g1;
+ h2 = (h2 & mask) | g2;
+ h3 = (h3 & mask) | g3;
+ h4 = (h4 & mask) | g4;
+
+ /* h = h % (2^128) */
+ h0 = ((h0) | (h1 << 26)) & 0xffffffff;
+ h1 = ((h1 >> 6) | (h2 << 20)) & 0xffffffff;
+ h2 = ((h2 >> 12) | (h3 << 14)) & 0xffffffff;
+ h3 = ((h3 >> 18) | (h4 << 8)) & 0xffffffff;
+
+ /* mac = (h + nonce) % (2^128) */
+ f = (u64)h0 + nonce[0];
+ h0 = (u32)f;
+ f = (u64)h1 + nonce[1] + (f >> 32);
+ h1 = (u32)f;
+ f = (u64)h2 + nonce[2] + (f >> 32);
+ h2 = (u32)f;
+ f = (u64)h3 + nonce[3] + (f >> 32);
+ h3 = (u32)f;
+
+ put_unaligned_le32(h0, &mac[0]);
+ put_unaligned_le32(h1, &mac[4]);
+ put_unaligned_le32(h2, &mac[8]);
+ put_unaligned_le32(h3, &mac[12]);
+}
diff --git a/lib/zinc/poly1305/poly1305-donna64.c b/lib/zinc/poly1305/poly1305-donna64.c
new file mode 100644
index 000000000000..aeb75ab8da6b
--- /dev/null
+++ b/lib/zinc/poly1305/poly1305-donna64.c
@@ -0,0 +1,182 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This is based in part on Andrew Moon's poly1305-donna, which is in the
+ * public domain.
+ */
+
+typedef __uint128_t u128;
+
+struct poly1305_internal {
+ u64 r[3];
+ u64 h[3];
+ u64 s[2];
+};
+
+static void poly1305_init_generic(void *ctx, const u8 key[16])
+{
+ struct poly1305_internal *st = (struct poly1305_internal *)ctx;
+ u64 t0, t1;
+
+ /* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
+ t0 = get_unaligned_le64(&key[0]);
+ t1 = get_unaligned_le64(&key[8]);
+
+ st->r[0] = t0 & 0xffc0fffffff;
+ st->r[1] = ((t0 >> 44) | (t1 << 20)) & 0xfffffc0ffff;
+ st->r[2] = ((t1 >> 24)) & 0x00ffffffc0f;
+
+ /* s = 20*r */
+ st->s[0] = st->r[1] * 20;
+ st->s[1] = st->r[2] * 20;
+
+ /* h = 0 */
+ st->h[0] = 0;
+ st->h[1] = 0;
+ st->h[2] = 0;
+}
+
+static void poly1305_blocks_generic(void *ctx, const u8 *input, size_t len,
+ const u32 padbit)
+{
+ struct poly1305_internal *st = (struct poly1305_internal *)ctx;
+ const u64 hibit = ((u64)padbit) << 40;
+ u64 r0, r1, r2;
+ u64 s1, s2;
+ u64 h0, h1, h2;
+ u64 c;
+ u128 d0, d1, d2, d;
+
+ r0 = st->r[0];
+ r1 = st->r[1];
+ r2 = st->r[2];
+
+ h0 = st->h[0];
+ h1 = st->h[1];
+ h2 = st->h[2];
+
+ s1 = st->s[0];
+ s2 = st->s[1];
+
+ while (len >= POLY1305_BLOCK_SIZE) {
+ u64 t0, t1;
+
+ /* h += m[i] */
+ t0 = get_unaligned_le64(&input[0]);
+ t1 = get_unaligned_le64(&input[8]);
+
+ h0 += t0 & 0xfffffffffff;
+ h1 += ((t0 >> 44) | (t1 << 20)) & 0xfffffffffff;
+ h2 += (((t1 >> 24)) & 0x3ffffffffff) | hibit;
+
+ /* h *= r */
+ d0 = (u128)h0 * r0;
+ d = (u128)h1 * s2;
+ d0 += d;
+ d = (u128)h2 * s1;
+ d0 += d;
+ d1 = (u128)h0 * r1;
+ d = (u128)h1 * r0;
+ d1 += d;
+ d = (u128)h2 * s2;
+ d1 += d;
+ d2 = (u128)h0 * r2;
+ d = (u128)h1 * r1;
+ d2 += d;
+ d = (u128)h2 * r0;
+ d2 += d;
+
+ /* (partial) h %= p */
+ c = (u64)(d0 >> 44);
+ h0 = (u64)d0 & 0xfffffffffff;
+ d1 += c;
+ c = (u64)(d1 >> 44);
+ h1 = (u64)d1 & 0xfffffffffff;
+ d2 += c;
+ c = (u64)(d2 >> 42);
+ h2 = (u64)d2 & 0x3ffffffffff;
+ h0 += c * 5;
+ c = h0 >> 44;
+ h0 = h0 & 0xfffffffffff;
+ h1 += c;
+
+ input += POLY1305_BLOCK_SIZE;
+ len -= POLY1305_BLOCK_SIZE;
+ }
+
+ st->h[0] = h0;
+ st->h[1] = h1;
+ st->h[2] = h2;
+}
+
+static void poly1305_emit_generic(void *ctx, u8 mac[16], const u32 nonce[4])
+{
+ struct poly1305_internal *st = (struct poly1305_internal *)ctx;
+ u64 h0, h1, h2, c;
+ u64 g0, g1, g2;
+ u64 t0, t1;
+
+ /* fully carry h */
+ h0 = st->h[0];
+ h1 = st->h[1];
+ h2 = st->h[2];
+
+ c = h1 >> 44;
+ h1 &= 0xfffffffffff;
+ h2 += c;
+ c = h2 >> 42;
+ h2 &= 0x3ffffffffff;
+ h0 += c * 5;
+ c = h0 >> 44;
+ h0 &= 0xfffffffffff;
+ h1 += c;
+ c = h1 >> 44;
+ h1 &= 0xfffffffffff;
+ h2 += c;
+ c = h2 >> 42;
+ h2 &= 0x3ffffffffff;
+ h0 += c * 5;
+ c = h0 >> 44;
+ h0 &= 0xfffffffffff;
+ h1 += c;
+
+ /* compute h + -p */
+ g0 = h0 + 5;
+ c = g0 >> 44;
+ g0 &= 0xfffffffffff;
+ g1 = h1 + c;
+ c = g1 >> 44;
+ g1 &= 0xfffffffffff;
+ g2 = h2 + c - (1ULL << 42);
+
+ /* select h if h < p, or h + -p if h >= p */
+ c = (g2 >> ((sizeof(u64) * 8) - 1)) - 1;
+ g0 &= c;
+ g1 &= c;
+ g2 &= c;
+ c = ~c;
+ h0 = (h0 & c) | g0;
+ h1 = (h1 & c) | g1;
+ h2 = (h2 & c) | g2;
+
+ /* h = (h + nonce) */
+ t0 = ((u64)nonce[1] << 32) | nonce[0];
+ t1 = ((u64)nonce[3] << 32) | nonce[2];
+
+ h0 += t0 & 0xfffffffffff;
+ c = h0 >> 44;
+ h0 &= 0xfffffffffff;
+ h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff) + c;
+ c = h1 >> 44;
+ h1 &= 0xfffffffffff;
+ h2 += (((t1 >> 24)) & 0x3ffffffffff) + c;
+ h2 &= 0x3ffffffffff;
+
+ /* mac = h % (2^128) */
+ h0 = h0 | (h1 << 44);
+ h1 = (h1 >> 20) | (h2 << 24);
+
+ put_unaligned_le64(h0, &mac[0]);
+ put_unaligned_le64(h1, &mac[8]);
+}
diff --git a/lib/zinc/poly1305/poly1305.c b/lib/zinc/poly1305/poly1305.c
new file mode 100644
index 000000000000..d4922d230c4f
--- /dev/null
+++ b/lib/zinc/poly1305/poly1305.c
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * Implementation of the Poly1305 message authenticator.
+ *
+ * Information: https://cr.yp.to/mac.html
+ */
+
+#include <zinc/poly1305.h>
+#include "../selftest/run.h"
+
+#include <asm/unaligned.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/module.h>
+#include <linux/init.h>
+
+static inline bool poly1305_init_arch(void *ctx,
+ const u8 key[POLY1305_KEY_SIZE])
+{
+ return false;
+}
+static inline bool poly1305_blocks_arch(void *ctx, const u8 *input,
+ size_t len, const u32 padbit,
+ simd_context_t *simd_context)
+{
+ return false;
+}
+static inline bool poly1305_emit_arch(void *ctx, u8 mac[POLY1305_MAC_SIZE],
+ const u32 nonce[4],
+ simd_context_t *simd_context)
+{
+ return false;
+}
+static bool *const poly1305_nobs[] __initconst = { };
+static void __init poly1305_fpu_init(void)
+{
+}
+
+#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+#include "poly1305-donna64.c"
+#else
+#include "poly1305-donna32.c"
+#endif
+
+void poly1305_init(struct poly1305_ctx *ctx, const u8 key[POLY1305_KEY_SIZE])
+{
+ ctx->nonce[0] = get_unaligned_le32(&key[16]);
+ ctx->nonce[1] = get_unaligned_le32(&key[20]);
+ ctx->nonce[2] = get_unaligned_le32(&key[24]);
+ ctx->nonce[3] = get_unaligned_le32(&key[28]);
+
+ if (!poly1305_init_arch(ctx->opaque, key))
+ poly1305_init_generic(ctx->opaque, key);
+
+ ctx->num = 0;
+}
+EXPORT_SYMBOL(poly1305_init);
+
+static inline void poly1305_blocks(void *ctx, const u8 *input, const size_t len,
+ const u32 padbit,
+ simd_context_t *simd_context)
+{
+ if (!poly1305_blocks_arch(ctx, input, len, padbit, simd_context))
+ poly1305_blocks_generic(ctx, input, len, padbit);
+}
+
+static inline void poly1305_emit(void *ctx, u8 mac[POLY1305_KEY_SIZE],
+ const u32 nonce[4],
+ simd_context_t *simd_context)
+{
+ if (!poly1305_emit_arch(ctx, mac, nonce, simd_context))
+ poly1305_emit_generic(ctx, mac, nonce);
+}
+
+void poly1305_update(struct poly1305_ctx *ctx, const u8 *input, size_t len,
+ simd_context_t *simd_context)
+{
+ const size_t num = ctx->num;
+ size_t rem;
+
+ if (num) {
+ rem = POLY1305_BLOCK_SIZE - num;
+ if (len < rem) {
+ memcpy(ctx->data + num, input, len);
+ ctx->num = num + len;
+ return;
+ }
+ memcpy(ctx->data + num, input, rem);
+ poly1305_blocks(ctx->opaque, ctx->data, POLY1305_BLOCK_SIZE, 1,
+ simd_context);
+ input += rem;
+ len -= rem;
+ }
+
+ rem = len % POLY1305_BLOCK_SIZE;
+ len -= rem;
+
+ if (len >= POLY1305_BLOCK_SIZE) {
+ poly1305_blocks(ctx->opaque, input, len, 1, simd_context);
+ input += len;
+ }
+
+ if (rem)
+ memcpy(ctx->data, input, rem);
+
+ ctx->num = rem;
+}
+EXPORT_SYMBOL(poly1305_update);
+
+void poly1305_final(struct poly1305_ctx *ctx, u8 mac[POLY1305_MAC_SIZE],
+ simd_context_t *simd_context)
+{
+ size_t num = ctx->num;
+
+ if (num) {
+ ctx->data[num++] = 1;
+ while (num < POLY1305_BLOCK_SIZE)
+ ctx->data[num++] = 0;
+ poly1305_blocks(ctx->opaque, ctx->data, POLY1305_BLOCK_SIZE, 0,
+ simd_context);
+ }
+
+ poly1305_emit(ctx->opaque, mac, ctx->nonce, simd_context);
+
+ memzero_explicit(ctx, sizeof(*ctx));
+}
+EXPORT_SYMBOL(poly1305_final);
+
+#include "../selftest/poly1305.c"
+
+static bool nosimd __initdata = false;
+
+static int __init mod_init(void)
+{
+ if (!nosimd)
+ poly1305_fpu_init();
+ if (!selftest_run("poly1305", poly1305_selftest, poly1305_nobs,
+ ARRAY_SIZE(poly1305_nobs)))
+ return -ENOTRECOVERABLE;
+ return 0;
+}
+
+static void __exit mod_exit(void)
+{
+}
+
+module_param(nosimd, bool, 0);
+module_init(mod_init);
+module_exit(mod_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("Poly1305 one-time authenticator");
+MODULE_AUTHOR("Jason A. Donenfeld <Jason@zx2c4.com>");
diff --git a/lib/zinc/selftest/poly1305.c b/lib/zinc/selftest/poly1305.c
new file mode 100644
index 000000000000..6b1f87288490
--- /dev/null
+++ b/lib/zinc/selftest/poly1305.c
@@ -0,0 +1,1107 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+struct poly1305_testvec {
+ const u8 *input, *output, *key;
+ size_t ilen;
+};
+
+/* RFC7539 */
+static const u8 input01[] __initconst = {
+ 0x43, 0x72, 0x79, 0x70, 0x74, 0x6f, 0x67, 0x72,
+ 0x61, 0x70, 0x68, 0x69, 0x63, 0x20, 0x46, 0x6f,
+ 0x72, 0x75, 0x6d, 0x20, 0x52, 0x65, 0x73, 0x65,
+ 0x61, 0x72, 0x63, 0x68, 0x20, 0x47, 0x72, 0x6f,
+ 0x75, 0x70
+};
+static const u8 output01[] __initconst = {
+ 0xa8, 0x06, 0x1d, 0xc1, 0x30, 0x51, 0x36, 0xc6,
+ 0xc2, 0x2b, 0x8b, 0xaf, 0x0c, 0x01, 0x27, 0xa9
+};
+static const u8 key01[] __initconst = {
+ 0x85, 0xd6, 0xbe, 0x78, 0x57, 0x55, 0x6d, 0x33,
+ 0x7f, 0x44, 0x52, 0xfe, 0x42, 0xd5, 0x06, 0xa8,
+ 0x01, 0x03, 0x80, 0x8a, 0xfb, 0x0d, 0xb2, 0xfd,
+ 0x4a, 0xbf, 0xf6, 0xaf, 0x41, 0x49, 0xf5, 0x1b
+};
+
+/* "The Poly1305-AES message-authentication code" */
+static const u8 input02[] __initconst = {
+ 0xf3, 0xf6
+};
+static const u8 output02[] __initconst = {
+ 0xf4, 0xc6, 0x33, 0xc3, 0x04, 0x4f, 0xc1, 0x45,
+ 0xf8, 0x4f, 0x33, 0x5c, 0xb8, 0x19, 0x53, 0xde
+};
+static const u8 key02[] __initconst = {
+ 0x85, 0x1f, 0xc4, 0x0c, 0x34, 0x67, 0xac, 0x0b,
+ 0xe0, 0x5c, 0xc2, 0x04, 0x04, 0xf3, 0xf7, 0x00,
+ 0x58, 0x0b, 0x3b, 0x0f, 0x94, 0x47, 0xbb, 0x1e,
+ 0x69, 0xd0, 0x95, 0xb5, 0x92, 0x8b, 0x6d, 0xbc
+};
+
+static const u8 input03[] __initconst = { };
+static const u8 output03[] __initconst = {
+ 0xdd, 0x3f, 0xab, 0x22, 0x51, 0xf1, 0x1a, 0xc7,
+ 0x59, 0xf0, 0x88, 0x71, 0x29, 0xcc, 0x2e, 0xe7
+};
+static const u8 key03[] __initconst = {
+ 0xa0, 0xf3, 0x08, 0x00, 0x00, 0xf4, 0x64, 0x00,
+ 0xd0, 0xc7, 0xe9, 0x07, 0x6c, 0x83, 0x44, 0x03,
+ 0xdd, 0x3f, 0xab, 0x22, 0x51, 0xf1, 0x1a, 0xc7,
+ 0x59, 0xf0, 0x88, 0x71, 0x29, 0xcc, 0x2e, 0xe7
+};
+
+static const u8 input04[] __initconst = {
+ 0x66, 0x3c, 0xea, 0x19, 0x0f, 0xfb, 0x83, 0xd8,
+ 0x95, 0x93, 0xf3, 0xf4, 0x76, 0xb6, 0xbc, 0x24,
+ 0xd7, 0xe6, 0x79, 0x10, 0x7e, 0xa2, 0x6a, 0xdb,
+ 0x8c, 0xaf, 0x66, 0x52, 0xd0, 0x65, 0x61, 0x36
+};
+static const u8 output04[] __initconst = {
+ 0x0e, 0xe1, 0xc1, 0x6b, 0xb7, 0x3f, 0x0f, 0x4f,
+ 0xd1, 0x98, 0x81, 0x75, 0x3c, 0x01, 0xcd, 0xbe
+};
+static const u8 key04[] __initconst = {
+ 0x48, 0x44, 0x3d, 0x0b, 0xb0, 0xd2, 0x11, 0x09,
+ 0xc8, 0x9a, 0x10, 0x0b, 0x5c, 0xe2, 0xc2, 0x08,
+ 0x83, 0x14, 0x9c, 0x69, 0xb5, 0x61, 0xdd, 0x88,
+ 0x29, 0x8a, 0x17, 0x98, 0xb1, 0x07, 0x16, 0xef
+};
+
+static const u8 input05[] __initconst = {
+ 0xab, 0x08, 0x12, 0x72, 0x4a, 0x7f, 0x1e, 0x34,
+ 0x27, 0x42, 0xcb, 0xed, 0x37, 0x4d, 0x94, 0xd1,
+ 0x36, 0xc6, 0xb8, 0x79, 0x5d, 0x45, 0xb3, 0x81,
+ 0x98, 0x30, 0xf2, 0xc0, 0x44, 0x91, 0xfa, 0xf0,
+ 0x99, 0x0c, 0x62, 0xe4, 0x8b, 0x80, 0x18, 0xb2,
+ 0xc3, 0xe4, 0xa0, 0xfa, 0x31, 0x34, 0xcb, 0x67,
+ 0xfa, 0x83, 0xe1, 0x58, 0xc9, 0x94, 0xd9, 0x61,
+ 0xc4, 0xcb, 0x21, 0x09, 0x5c, 0x1b, 0xf9
+};
+static const u8 output05[] __initconst = {
+ 0x51, 0x54, 0xad, 0x0d, 0x2c, 0xb2, 0x6e, 0x01,
+ 0x27, 0x4f, 0xc5, 0x11, 0x48, 0x49, 0x1f, 0x1b
+};
+static const u8 key05[] __initconst = {
+ 0x12, 0x97, 0x6a, 0x08, 0xc4, 0x42, 0x6d, 0x0c,
+ 0xe8, 0xa8, 0x24, 0x07, 0xc4, 0xf4, 0x82, 0x07,
+ 0x80, 0xf8, 0xc2, 0x0a, 0xa7, 0x12, 0x02, 0xd1,
+ 0xe2, 0x91, 0x79, 0xcb, 0xcb, 0x55, 0x5a, 0x57
+};
+
+/* self-generated vectors exercise "significant" lengths, such that they
+ * are handled by different code paths */
+static const u8 input06[] __initconst = {
+ 0xab, 0x08, 0x12, 0x72, 0x4a, 0x7f, 0x1e, 0x34,
+ 0x27, 0x42, 0xcb, 0xed, 0x37, 0x4d, 0x94, 0xd1,
+ 0x36, 0xc6, 0xb8, 0x79, 0x5d, 0x45, 0xb3, 0x81,
+ 0x98, 0x30, 0xf2, 0xc0, 0x44, 0x91, 0xfa, 0xf0,
+ 0x99, 0x0c, 0x62, 0xe4, 0x8b, 0x80, 0x18, 0xb2,
+ 0xc3, 0xe4, 0xa0, 0xfa, 0x31, 0x34, 0xcb, 0x67,
+ 0xfa, 0x83, 0xe1, 0x58, 0xc9, 0x94, 0xd9, 0x61,
+ 0xc4, 0xcb, 0x21, 0x09, 0x5c, 0x1b, 0xf9, 0xaf
+};
+static const u8 output06[] __initconst = {
+ 0x81, 0x20, 0x59, 0xa5, 0xda, 0x19, 0x86, 0x37,
+ 0xca, 0xc7, 0xc4, 0xa6, 0x31, 0xbe, 0xe4, 0x66
+};
+static const u8 key06[] __initconst = {
+ 0x12, 0x97, 0x6a, 0x08, 0xc4, 0x42, 0x6d, 0x0c,
+ 0xe8, 0xa8, 0x24, 0x07, 0xc4, 0xf4, 0x82, 0x07,
+ 0x80, 0xf8, 0xc2, 0x0a, 0xa7, 0x12, 0x02, 0xd1,
+ 0xe2, 0x91, 0x79, 0xcb, 0xcb, 0x55, 0x5a, 0x57
+};
+
+static const u8 input07[] __initconst = {
+ 0xab, 0x08, 0x12, 0x72, 0x4a, 0x7f, 0x1e, 0x34,
+ 0x27, 0x42, 0xcb, 0xed, 0x37, 0x4d, 0x94, 0xd1,
+ 0x36, 0xc6, 0xb8, 0x79, 0x5d, 0x45, 0xb3, 0x81,
+ 0x98, 0x30, 0xf2, 0xc0, 0x44, 0x91, 0xfa, 0xf0,
+ 0x99, 0x0c, 0x62, 0xe4, 0x8b, 0x80, 0x18, 0xb2,
+ 0xc3, 0xe4, 0xa0, 0xfa, 0x31, 0x34, 0xcb, 0x67
+};
+static const u8 output07[] __initconst = {
+ 0x5b, 0x88, 0xd7, 0xf6, 0x22, 0x8b, 0x11, 0xe2,
+ 0xe2, 0x85, 0x79, 0xa5, 0xc0, 0xc1, 0xf7, 0x61
+};
+static const u8 key07[] __initconst = {
+ 0x12, 0x97, 0x6a, 0x08, 0xc4, 0x42, 0x6d, 0x0c,
+ 0xe8, 0xa8, 0x24, 0x07, 0xc4, 0xf4, 0x82, 0x07,
+ 0x80, 0xf8, 0xc2, 0x0a, 0xa7, 0x12, 0x02, 0xd1,
+ 0xe2, 0x91, 0x79, 0xcb, 0xcb, 0x55, 0x5a, 0x57
+};
+
+static const u8 input08[] __initconst = {
+ 0xab, 0x08, 0x12, 0x72, 0x4a, 0x7f, 0x1e, 0x34,
+ 0x27, 0x42, 0xcb, 0xed, 0x37, 0x4d, 0x94, 0xd1,
+ 0x36, 0xc6, 0xb8, 0x79, 0x5d, 0x45, 0xb3, 0x81,
+ 0x98, 0x30, 0xf2, 0xc0, 0x44, 0x91, 0xfa, 0xf0,
+ 0x99, 0x0c, 0x62, 0xe4, 0x8b, 0x80, 0x18, 0xb2,
+ 0xc3, 0xe4, 0xa0, 0xfa, 0x31, 0x34, 0xcb, 0x67,
+ 0xfa, 0x83, 0xe1, 0x58, 0xc9, 0x94, 0xd9, 0x61,
+ 0xc4, 0xcb, 0x21, 0x09, 0x5c, 0x1b, 0xf9, 0xaf,
+ 0x66, 0x3c, 0xea, 0x19, 0x0f, 0xfb, 0x83, 0xd8,
+ 0x95, 0x93, 0xf3, 0xf4, 0x76, 0xb6, 0xbc, 0x24,
+ 0xd7, 0xe6, 0x79, 0x10, 0x7e, 0xa2, 0x6a, 0xdb,
+ 0x8c, 0xaf, 0x66, 0x52, 0xd0, 0x65, 0x61, 0x36
+};
+static const u8 output08[] __initconst = {
+ 0xbb, 0xb6, 0x13, 0xb2, 0xb6, 0xd7, 0x53, 0xba,
+ 0x07, 0x39, 0x5b, 0x91, 0x6a, 0xae, 0xce, 0x15
+};
+static const u8 key08[] __initconst = {
+ 0x12, 0x97, 0x6a, 0x08, 0xc4, 0x42, 0x6d, 0x0c,
+ 0xe8, 0xa8, 0x24, 0x07, 0xc4, 0xf4, 0x82, 0x07,
+ 0x80, 0xf8, 0xc2, 0x0a, 0xa7, 0x12, 0x02, 0xd1,
+ 0xe2, 0x91, 0x79, 0xcb, 0xcb, 0x55, 0x5a, 0x57
+};
+
+static const u8 input09[] __initconst = {
+ 0xab, 0x08, 0x12, 0x72, 0x4a, 0x7f, 0x1e, 0x34,
+ 0x27, 0x42, 0xcb, 0xed, 0x37, 0x4d, 0x94, 0xd1,
+ 0x36, 0xc6, 0xb8, 0x79, 0x5d, 0x45, 0xb3, 0x81,
+ 0x98, 0x30, 0xf2, 0xc0, 0x44, 0x91, 0xfa, 0xf0,
+ 0x99, 0x0c, 0x62, 0xe4, 0x8b, 0x80, 0x18, 0xb2,
+ 0xc3, 0xe4, 0xa0, 0xfa, 0x31, 0x34, 0xcb, 0x67,
+ 0xfa, 0x83, 0xe1, 0x58, 0xc9, 0x94, 0xd9, 0x61,
+ 0xc4, 0xcb, 0x21, 0x09, 0x5c, 0x1b, 0xf9, 0xaf,
+ 0x48, 0x44, 0x3d, 0x0b, 0xb0, 0xd2, 0x11, 0x09,
+ 0xc8, 0x9a, 0x10, 0x0b, 0x5c, 0xe2, 0xc2, 0x08,
+ 0x83, 0x14, 0x9c, 0x69, 0xb5, 0x61, 0xdd, 0x88,
+ 0x29, 0x8a, 0x17, 0x98, 0xb1, 0x07, 0x16, 0xef,
+ 0x66, 0x3c, 0xea, 0x19, 0x0f, 0xfb, 0x83, 0xd8,
+ 0x95, 0x93, 0xf3, 0xf4, 0x76, 0xb6, 0xbc, 0x24
+};
+static const u8 output09[] __initconst = {
+ 0xc7, 0x94, 0xd7, 0x05, 0x7d, 0x17, 0x78, 0xc4,
+ 0xbb, 0xee, 0x0a, 0x39, 0xb3, 0xd9, 0x73, 0x42
+};
+static const u8 key09[] __initconst = {
+ 0x12, 0x97, 0x6a, 0x08, 0xc4, 0x42, 0x6d, 0x0c,
+ 0xe8, 0xa8, 0x24, 0x07, 0xc4, 0xf4, 0x82, 0x07,
+ 0x80, 0xf8, 0xc2, 0x0a, 0xa7, 0x12, 0x02, 0xd1,
+ 0xe2, 0x91, 0x79, 0xcb, 0xcb, 0x55, 0x5a, 0x57
+};
+
+static const u8 input10[] __initconst = {
+ 0xab, 0x08, 0x12, 0x72, 0x4a, 0x7f, 0x1e, 0x34,
+ 0x27, 0x42, 0xcb, 0xed, 0x37, 0x4d, 0x94, 0xd1,
+ 0x36, 0xc6, 0xb8, 0x79, 0x5d, 0x45, 0xb3, 0x81,
+ 0x98, 0x30, 0xf2, 0xc0, 0x44, 0x91, 0xfa, 0xf0,
+ 0x99, 0x0c, 0x62, 0xe4, 0x8b, 0x80, 0x18, 0xb2,
+ 0xc3, 0xe4, 0xa0, 0xfa, 0x31, 0x34, 0xcb, 0x67,
+ 0xfa, 0x83, 0xe1, 0x58, 0xc9, 0x94, 0xd9, 0x61,
+ 0xc4, 0xcb, 0x21, 0x09, 0x5c, 0x1b, 0xf9, 0xaf,
+ 0x48, 0x44, 0x3d, 0x0b, 0xb0, 0xd2, 0x11, 0x09,
+ 0xc8, 0x9a, 0x10, 0x0b, 0x5c, 0xe2, 0xc2, 0x08,
+ 0x83, 0x14, 0x9c, 0x69, 0xb5, 0x61, 0xdd, 0x88,
+ 0x29, 0x8a, 0x17, 0x98, 0xb1, 0x07, 0x16, 0xef,
+ 0x66, 0x3c, 0xea, 0x19, 0x0f, 0xfb, 0x83, 0xd8,
+ 0x95, 0x93, 0xf3, 0xf4, 0x76, 0xb6, 0xbc, 0x24,
+ 0xd7, 0xe6, 0x79, 0x10, 0x7e, 0xa2, 0x6a, 0xdb,
+ 0x8c, 0xaf, 0x66, 0x52, 0xd0, 0x65, 0x61, 0x36
+};
+static const u8 output10[] __initconst = {
+ 0xff, 0xbc, 0xb9, 0xb3, 0x71, 0x42, 0x31, 0x52,
+ 0xd7, 0xfc, 0xa5, 0xad, 0x04, 0x2f, 0xba, 0xa9
+};
+static const u8 key10[] __initconst = {
+ 0x12, 0x97, 0x6a, 0x08, 0xc4, 0x42, 0x6d, 0x0c,
+ 0xe8, 0xa8, 0x24, 0x07, 0xc4, 0xf4, 0x82, 0x07,
+ 0x80, 0xf8, 0xc2, 0x0a, 0xa7, 0x12, 0x02, 0xd1,
+ 0xe2, 0x91, 0x79, 0xcb, 0xcb, 0x55, 0x5a, 0x57
+};
+
+static const u8 input11[] __initconst = {
+ 0xab, 0x08, 0x12, 0x72, 0x4a, 0x7f, 0x1e, 0x34,
+ 0x27, 0x42, 0xcb, 0xed, 0x37, 0x4d, 0x94, 0xd1,
+ 0x36, 0xc6, 0xb8, 0x79, 0x5d, 0x45, 0xb3, 0x81,
+ 0x98, 0x30, 0xf2, 0xc0, 0x44, 0x91, 0xfa, 0xf0,
+ 0x99, 0x0c, 0x62, 0xe4, 0x8b, 0x80, 0x18, 0xb2,
+ 0xc3, 0xe4, 0xa0, 0xfa, 0x31, 0x34, 0xcb, 0x67,
+ 0xfa, 0x83, 0xe1, 0x58, 0xc9, 0x94, 0xd9, 0x61,
+ 0xc4, 0xcb, 0x21, 0x09, 0x5c, 0x1b, 0xf9, 0xaf,
+ 0x48, 0x44, 0x3d, 0x0b, 0xb0, 0xd2, 0x11, 0x09,
+ 0xc8, 0x9a, 0x10, 0x0b, 0x5c, 0xe2, 0xc2, 0x08,
+ 0x83, 0x14, 0x9c, 0x69, 0xb5, 0x61, 0xdd, 0x88,
+ 0x29, 0x8a, 0x17, 0x98, 0xb1, 0x07, 0x16, 0xef,
+ 0x66, 0x3c, 0xea, 0x19, 0x0f, 0xfb, 0x83, 0xd8,
+ 0x95, 0x93, 0xf3, 0xf4, 0x76, 0xb6, 0xbc, 0x24,
+ 0xd7, 0xe6, 0x79, 0x10, 0x7e, 0xa2, 0x6a, 0xdb,
+ 0x8c, 0xaf, 0x66, 0x52, 0xd0, 0x65, 0x61, 0x36,
+ 0x81, 0x20, 0x59, 0xa5, 0xda, 0x19, 0x86, 0x37,
+ 0xca, 0xc7, 0xc4, 0xa6, 0x31, 0xbe, 0xe4, 0x66
+};
+static const u8 output11[] __initconst = {
+ 0x06, 0x9e, 0xd6, 0xb8, 0xef, 0x0f, 0x20, 0x7b,
+ 0x3e, 0x24, 0x3b, 0xb1, 0x01, 0x9f, 0xe6, 0x32
+};
+static const u8 key11[] __initconst = {
+ 0x12, 0x97, 0x6a, 0x08, 0xc4, 0x42, 0x6d, 0x0c,
+ 0xe8, 0xa8, 0x24, 0x07, 0xc4, 0xf4, 0x82, 0x07,
+ 0x80, 0xf8, 0xc2, 0x0a, 0xa7, 0x12, 0x02, 0xd1,
+ 0xe2, 0x91, 0x79, 0xcb, 0xcb, 0x55, 0x5a, 0x57
+};
+
+static const u8 input12[] __initconst = {
+ 0xab, 0x08, 0x12, 0x72, 0x4a, 0x7f, 0x1e, 0x34,
+ 0x27, 0x42, 0xcb, 0xed, 0x37, 0x4d, 0x94, 0xd1,
+ 0x36, 0xc6, 0xb8, 0x79, 0x5d, 0x45, 0xb3, 0x81,
+ 0x98, 0x30, 0xf2, 0xc0, 0x44, 0x91, 0xfa, 0xf0,
+ 0x99, 0x0c, 0x62, 0xe4, 0x8b, 0x80, 0x18, 0xb2,
+ 0xc3, 0xe4, 0xa0, 0xfa, 0x31, 0x34, 0xcb, 0x67,
+ 0xfa, 0x83, 0xe1, 0x58, 0xc9, 0x94, 0xd9, 0x61,
+ 0xc4, 0xcb, 0x21, 0x09, 0x5c, 0x1b, 0xf9, 0xaf,
+ 0x48, 0x44, 0x3d, 0x0b, 0xb0, 0xd2, 0x11, 0x09,
+ 0xc8, 0x9a, 0x10, 0x0b, 0x5c, 0xe2, 0xc2, 0x08,
+ 0x83, 0x14, 0x9c, 0x69, 0xb5, 0x61, 0xdd, 0x88,
+ 0x29, 0x8a, 0x17, 0x98, 0xb1, 0x07, 0x16, 0xef,
+ 0x66, 0x3c, 0xea, 0x19, 0x0f, 0xfb, 0x83, 0xd8,
+ 0x95, 0x93, 0xf3, 0xf4, 0x76, 0xb6, 0xbc, 0x24,
+ 0xd7, 0xe6, 0x79, 0x10, 0x7e, 0xa2, 0x6a, 0xdb,
+ 0x8c, 0xaf, 0x66, 0x52, 0xd0, 0x65, 0x61, 0x36,
+ 0x81, 0x20, 0x59, 0xa5, 0xda, 0x19, 0x86, 0x37,
+ 0xca, 0xc7, 0xc4, 0xa6, 0x31, 0xbe, 0xe4, 0x66,
+ 0x5b, 0x88, 0xd7, 0xf6, 0x22, 0x8b, 0x11, 0xe2,
+ 0xe2, 0x85, 0x79, 0xa5, 0xc0, 0xc1, 0xf7, 0x61
+};
+static const u8 output12[] __initconst = {
+ 0xcc, 0xa3, 0x39, 0xd9, 0xa4, 0x5f, 0xa2, 0x36,
+ 0x8c, 0x2c, 0x68, 0xb3, 0xa4, 0x17, 0x91, 0x33
+};
+static const u8 key12[] __initconst = {
+ 0x12, 0x97, 0x6a, 0x08, 0xc4, 0x42, 0x6d, 0x0c,
+ 0xe8, 0xa8, 0x24, 0x07, 0xc4, 0xf4, 0x82, 0x07,
+ 0x80, 0xf8, 0xc2, 0x0a, 0xa7, 0x12, 0x02, 0xd1,
+ 0xe2, 0x91, 0x79, 0xcb, 0xcb, 0x55, 0x5a, 0x57
+};
+
+static const u8 input13[] __initconst = {
+ 0xab, 0x08, 0x12, 0x72, 0x4a, 0x7f, 0x1e, 0x34,
+ 0x27, 0x42, 0xcb, 0xed, 0x37, 0x4d, 0x94, 0xd1,
+ 0x36, 0xc6, 0xb8, 0x79, 0x5d, 0x45, 0xb3, 0x81,
+ 0x98, 0x30, 0xf2, 0xc0, 0x44, 0x91, 0xfa, 0xf0,
+ 0x99, 0x0c, 0x62, 0xe4, 0x8b, 0x80, 0x18, 0xb2,
+ 0xc3, 0xe4, 0xa0, 0xfa, 0x31, 0x34, 0xcb, 0x67,
+ 0xfa, 0x83, 0xe1, 0x58, 0xc9, 0x94, 0xd9, 0x61,
+ 0xc4, 0xcb, 0x21, 0x09, 0x5c, 0x1b, 0xf9, 0xaf,
+ 0x48, 0x44, 0x3d, 0x0b, 0xb0, 0xd2, 0x11, 0x09,
+ 0xc8, 0x9a, 0x10, 0x0b, 0x5c, 0xe2, 0xc2, 0x08,
+ 0x83, 0x14, 0x9c, 0x69, 0xb5, 0x61, 0xdd, 0x88,
+ 0x29, 0x8a, 0x17, 0x98, 0xb1, 0x07, 0x16, 0xef,
+ 0x66, 0x3c, 0xea, 0x19, 0x0f, 0xfb, 0x83, 0xd8,
+ 0x95, 0x93, 0xf3, 0xf4, 0x76, 0xb6, 0xbc, 0x24,
+ 0xd7, 0xe6, 0x79, 0x10, 0x7e, 0xa2, 0x6a, 0xdb,
+ 0x8c, 0xaf, 0x66, 0x52, 0xd0, 0x65, 0x61, 0x36,
+ 0x81, 0x20, 0x59, 0xa5, 0xda, 0x19, 0x86, 0x37,
+ 0xca, 0xc7, 0xc4, 0xa6, 0x31, 0xbe, 0xe4, 0x66,
+ 0x5b, 0x88, 0xd7, 0xf6, 0x22, 0x8b, 0x11, 0xe2,
+ 0xe2, 0x85, 0x79, 0xa5, 0xc0, 0xc1, 0xf7, 0x61,
+ 0xab, 0x08, 0x12, 0x72, 0x4a, 0x7f, 0x1e, 0x34,
+ 0x27, 0x42, 0xcb, 0xed, 0x37, 0x4d, 0x94, 0xd1,
+ 0x36, 0xc6, 0xb8, 0x79, 0x5d, 0x45, 0xb3, 0x81,
+ 0x98, 0x30, 0xf2, 0xc0, 0x44, 0x91, 0xfa, 0xf0,
+ 0x99, 0x0c, 0x62, 0xe4, 0x8b, 0x80, 0x18, 0xb2,
+ 0xc3, 0xe4, 0xa0, 0xfa, 0x31, 0x34, 0xcb, 0x67,
+ 0xfa, 0x83, 0xe1, 0x58, 0xc9, 0x94, 0xd9, 0x61,
+ 0xc4, 0xcb, 0x21, 0x09, 0x5c, 0x1b, 0xf9, 0xaf,
+ 0x48, 0x44, 0x3d, 0x0b, 0xb0, 0xd2, 0x11, 0x09,
+ 0xc8, 0x9a, 0x10, 0x0b, 0x5c, 0xe2, 0xc2, 0x08,
+ 0x83, 0x14, 0x9c, 0x69, 0xb5, 0x61, 0xdd, 0x88,
+ 0x29, 0x8a, 0x17, 0x98, 0xb1, 0x07, 0x16, 0xef,
+ 0x66, 0x3c, 0xea, 0x19, 0x0f, 0xfb, 0x83, 0xd8,
+ 0x95, 0x93, 0xf3, 0xf4, 0x76, 0xb6, 0xbc, 0x24,
+ 0xd7, 0xe6, 0x79, 0x10, 0x7e, 0xa2, 0x6a, 0xdb,
+ 0x8c, 0xaf, 0x66, 0x52, 0xd0, 0x65, 0x61, 0x36
+};
+static const u8 output13[] __initconst = {
+ 0x53, 0xf6, 0xe8, 0x28, 0xa2, 0xf0, 0xfe, 0x0e,
+ 0xe8, 0x15, 0xbf, 0x0b, 0xd5, 0x84, 0x1a, 0x34
+};
+static const u8 key13[] __initconst = {
+ 0x12, 0x97, 0x6a, 0x08, 0xc4, 0x42, 0x6d, 0x0c,
+ 0xe8, 0xa8, 0x24, 0x07, 0xc4, 0xf4, 0x82, 0x07,
+ 0x80, 0xf8, 0xc2, 0x0a, 0xa7, 0x12, 0x02, 0xd1,
+ 0xe2, 0x91, 0x79, 0xcb, 0xcb, 0x55, 0x5a, 0x57
+};
+
+static const u8 input14[] __initconst = {
+ 0xab, 0x08, 0x12, 0x72, 0x4a, 0x7f, 0x1e, 0x34,
+ 0x27, 0x42, 0xcb, 0xed, 0x37, 0x4d, 0x94, 0xd1,
+ 0x36, 0xc6, 0xb8, 0x79, 0x5d, 0x45, 0xb3, 0x81,
+ 0x98, 0x30, 0xf2, 0xc0, 0x44, 0x91, 0xfa, 0xf0,
+ 0x99, 0x0c, 0x62, 0xe4, 0x8b, 0x80, 0x18, 0xb2,
+ 0xc3, 0xe4, 0xa0, 0xfa, 0x31, 0x34, 0xcb, 0x67,
+ 0xfa, 0x83, 0xe1, 0x58, 0xc9, 0x94, 0xd9, 0x61,
+ 0xc4, 0xcb, 0x21, 0x09, 0x5c, 0x1b, 0xf9, 0xaf,
+ 0x48, 0x44, 0x3d, 0x0b, 0xb0, 0xd2, 0x11, 0x09,
+ 0xc8, 0x9a, 0x10, 0x0b, 0x5c, 0xe2, 0xc2, 0x08,
+ 0x83, 0x14, 0x9c, 0x69, 0xb5, 0x61, 0xdd, 0x88,
+ 0x29, 0x8a, 0x17, 0x98, 0xb1, 0x07, 0x16, 0xef,
+ 0x66, 0x3c, 0xea, 0x19, 0x0f, 0xfb, 0x83, 0xd8,
+ 0x95, 0x93, 0xf3, 0xf4, 0x76, 0xb6, 0xbc, 0x24,
+ 0xd7, 0xe6, 0x79, 0x10, 0x7e, 0xa2, 0x6a, 0xdb,
+ 0x8c, 0xaf, 0x66, 0x52, 0xd0, 0x65, 0x61, 0x36,
+ 0x81, 0x20, 0x59, 0xa5, 0xda, 0x19, 0x86, 0x37,
+ 0xca, 0xc7, 0xc4, 0xa6, 0x31, 0xbe, 0xe4, 0x66,
+ 0x5b, 0x88, 0xd7, 0xf6, 0x22, 0x8b, 0x11, 0xe2,
+ 0xe2, 0x85, 0x79, 0xa5, 0xc0, 0xc1, 0xf7, 0x61,
+ 0xab, 0x08, 0x12, 0x72, 0x4a, 0x7f, 0x1e, 0x34,
+ 0x27, 0x42, 0xcb, 0xed, 0x37, 0x4d, 0x94, 0xd1,
+ 0x36, 0xc6, 0xb8, 0x79, 0x5d, 0x45, 0xb3, 0x81,
+ 0x98, 0x30, 0xf2, 0xc0, 0x44, 0x91, 0xfa, 0xf0,
+ 0x99, 0x0c, 0x62, 0xe4, 0x8b, 0x80, 0x18, 0xb2,
+ 0xc3, 0xe4, 0xa0, 0xfa, 0x31, 0x34, 0xcb, 0x67,
+ 0xfa, 0x83, 0xe1, 0x58, 0xc9, 0x94, 0xd9, 0x61,
+ 0xc4, 0xcb, 0x21, 0x09, 0x5c, 0x1b, 0xf9, 0xaf,
+ 0x48, 0x44, 0x3d, 0x0b, 0xb0, 0xd2, 0x11, 0x09,
+ 0xc8, 0x9a, 0x10, 0x0b, 0x5c, 0xe2, 0xc2, 0x08,
+ 0x83, 0x14, 0x9c, 0x69, 0xb5, 0x61, 0xdd, 0x88,
+ 0x29, 0x8a, 0x17, 0x98, 0xb1, 0x07, 0x16, 0xef,
+ 0x66, 0x3c, 0xea, 0x19, 0x0f, 0xfb, 0x83, 0xd8,
+ 0x95, 0x93, 0xf3, 0xf4, 0x76, 0xb6, 0xbc, 0x24,
+ 0xd7, 0xe6, 0x79, 0x10, 0x7e, 0xa2, 0x6a, 0xdb,
+ 0x8c, 0xaf, 0x66, 0x52, 0xd0, 0x65, 0x61, 0x36,
+ 0x81, 0x20, 0x59, 0xa5, 0xda, 0x19, 0x86, 0x37,
+ 0xca, 0xc7, 0xc4, 0xa6, 0x31, 0xbe, 0xe4, 0x66,
+ 0x5b, 0x88, 0xd7, 0xf6, 0x22, 0x8b, 0x11, 0xe2,
+ 0xe2, 0x85, 0x79, 0xa5, 0xc0, 0xc1, 0xf7, 0x61
+};
+static const u8 output14[] __initconst = {
+ 0xb8, 0x46, 0xd4, 0x4e, 0x9b, 0xbd, 0x53, 0xce,
+ 0xdf, 0xfb, 0xfb, 0xb6, 0xb7, 0xfa, 0x49, 0x33
+};
+static const u8 key14[] __initconst = {
+ 0x12, 0x97, 0x6a, 0x08, 0xc4, 0x42, 0x6d, 0x0c,
+ 0xe8, 0xa8, 0x24, 0x07, 0xc4, 0xf4, 0x82, 0x07,
+ 0x80, 0xf8, 0xc2, 0x0a, 0xa7, 0x12, 0x02, 0xd1,
+ 0xe2, 0x91, 0x79, 0xcb, 0xcb, 0x55, 0x5a, 0x57
+};
+
+/* 4th power of the key spills to 131th bit in SIMD key setup */
+static const u8 input15[] __initconst = {
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+};
+static const u8 output15[] __initconst = {
+ 0x07, 0x14, 0x5a, 0x4c, 0x02, 0xfe, 0x5f, 0xa3,
+ 0x20, 0x36, 0xde, 0x68, 0xfa, 0xbe, 0x90, 0x66
+};
+static const u8 key15[] __initconst = {
+ 0xad, 0x62, 0x81, 0x07, 0xe8, 0x35, 0x1d, 0x0f,
+ 0x2c, 0x23, 0x1a, 0x05, 0xdc, 0x4a, 0x41, 0x06,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+
+/* OpenSSL's poly1305_ieee754.c failed this in final stage */
+static const u8 input16[] __initconst = {
+ 0x84, 0x23, 0x64, 0xe1, 0x56, 0x33, 0x6c, 0x09,
+ 0x98, 0xb9, 0x33, 0xa6, 0x23, 0x77, 0x26, 0x18,
+ 0x0d, 0x9e, 0x3f, 0xdc, 0xbd, 0xe4, 0xcd, 0x5d,
+ 0x17, 0x08, 0x0f, 0xc3, 0xbe, 0xb4, 0x96, 0x14,
+ 0xd7, 0x12, 0x2c, 0x03, 0x74, 0x63, 0xff, 0x10,
+ 0x4d, 0x73, 0xf1, 0x9c, 0x12, 0x70, 0x46, 0x28,
+ 0xd4, 0x17, 0xc4, 0xc5, 0x4a, 0x3f, 0xe3, 0x0d,
+ 0x3c, 0x3d, 0x77, 0x14, 0x38, 0x2d, 0x43, 0xb0,
+ 0x38, 0x2a, 0x50, 0xa5, 0xde, 0xe5, 0x4b, 0xe8,
+ 0x44, 0xb0, 0x76, 0xe8, 0xdf, 0x88, 0x20, 0x1a,
+ 0x1c, 0xd4, 0x3b, 0x90, 0xeb, 0x21, 0x64, 0x3f,
+ 0xa9, 0x6f, 0x39, 0xb5, 0x18, 0xaa, 0x83, 0x40,
+ 0xc9, 0x42, 0xff, 0x3c, 0x31, 0xba, 0xf7, 0xc9,
+ 0xbd, 0xbf, 0x0f, 0x31, 0xae, 0x3f, 0xa0, 0x96,
+ 0xbf, 0x8c, 0x63, 0x03, 0x06, 0x09, 0x82, 0x9f,
+ 0xe7, 0x2e, 0x17, 0x98, 0x24, 0x89, 0x0b, 0xc8,
+ 0xe0, 0x8c, 0x31, 0x5c, 0x1c, 0xce, 0x2a, 0x83,
+ 0x14, 0x4d, 0xbb, 0xff, 0x09, 0xf7, 0x4e, 0x3e,
+ 0xfc, 0x77, 0x0b, 0x54, 0xd0, 0x98, 0x4a, 0x8f,
+ 0x19, 0xb1, 0x47, 0x19, 0xe6, 0x36, 0x35, 0x64,
+ 0x1d, 0x6b, 0x1e, 0xed, 0xf6, 0x3e, 0xfb, 0xf0,
+ 0x80, 0xe1, 0x78, 0x3d, 0x32, 0x44, 0x54, 0x12,
+ 0x11, 0x4c, 0x20, 0xde, 0x0b, 0x83, 0x7a, 0x0d,
+ 0xfa, 0x33, 0xd6, 0xb8, 0x28, 0x25, 0xff, 0xf4,
+ 0x4c, 0x9a, 0x70, 0xea, 0x54, 0xce, 0x47, 0xf0,
+ 0x7d, 0xf6, 0x98, 0xe6, 0xb0, 0x33, 0x23, 0xb5,
+ 0x30, 0x79, 0x36, 0x4a, 0x5f, 0xc3, 0xe9, 0xdd,
+ 0x03, 0x43, 0x92, 0xbd, 0xde, 0x86, 0xdc, 0xcd,
+ 0xda, 0x94, 0x32, 0x1c, 0x5e, 0x44, 0x06, 0x04,
+ 0x89, 0x33, 0x6c, 0xb6, 0x5b, 0xf3, 0x98, 0x9c,
+ 0x36, 0xf7, 0x28, 0x2c, 0x2f, 0x5d, 0x2b, 0x88,
+ 0x2c, 0x17, 0x1e, 0x74
+};
+static const u8 output16[] __initconst = {
+ 0xf2, 0x48, 0x31, 0x2e, 0x57, 0x8d, 0x9d, 0x58,
+ 0xf8, 0xb7, 0xbb, 0x4d, 0x19, 0x10, 0x54, 0x31
+};
+static const u8 key16[] __initconst = {
+ 0x95, 0xd5, 0xc0, 0x05, 0x50, 0x3e, 0x51, 0x0d,
+ 0x8c, 0xd0, 0xaa, 0x07, 0x2c, 0x4a, 0x4d, 0x06,
+ 0x6e, 0xab, 0xc5, 0x2d, 0x11, 0x65, 0x3d, 0xf4,
+ 0x7f, 0xbf, 0x63, 0xab, 0x19, 0x8b, 0xcc, 0x26
+};
+
+/* AVX2 in OpenSSL's poly1305-x86.pl failed this with 176+32 split */
+static const u8 input17[] __initconst = {
+ 0x24, 0x8a, 0xc3, 0x10, 0x85, 0xb6, 0xc2, 0xad,
+ 0xaa, 0xa3, 0x82, 0x59, 0xa0, 0xd7, 0x19, 0x2c,
+ 0x5c, 0x35, 0xd1, 0xbb, 0x4e, 0xf3, 0x9a, 0xd9,
+ 0x4c, 0x38, 0xd1, 0xc8, 0x24, 0x79, 0xe2, 0xdd,
+ 0x21, 0x59, 0xa0, 0x77, 0x02, 0x4b, 0x05, 0x89,
+ 0xbc, 0x8a, 0x20, 0x10, 0x1b, 0x50, 0x6f, 0x0a,
+ 0x1a, 0xd0, 0xbb, 0xab, 0x76, 0xe8, 0x3a, 0x83,
+ 0xf1, 0xb9, 0x4b, 0xe6, 0xbe, 0xae, 0x74, 0xe8,
+ 0x74, 0xca, 0xb6, 0x92, 0xc5, 0x96, 0x3a, 0x75,
+ 0x43, 0x6b, 0x77, 0x61, 0x21, 0xec, 0x9f, 0x62,
+ 0x39, 0x9a, 0x3e, 0x66, 0xb2, 0xd2, 0x27, 0x07,
+ 0xda, 0xe8, 0x19, 0x33, 0xb6, 0x27, 0x7f, 0x3c,
+ 0x85, 0x16, 0xbc, 0xbe, 0x26, 0xdb, 0xbd, 0x86,
+ 0xf3, 0x73, 0x10, 0x3d, 0x7c, 0xf4, 0xca, 0xd1,
+ 0x88, 0x8c, 0x95, 0x21, 0x18, 0xfb, 0xfb, 0xd0,
+ 0xd7, 0xb4, 0xbe, 0xdc, 0x4a, 0xe4, 0x93, 0x6a,
+ 0xff, 0x91, 0x15, 0x7e, 0x7a, 0xa4, 0x7c, 0x54,
+ 0x44, 0x2e, 0xa7, 0x8d, 0x6a, 0xc2, 0x51, 0xd3,
+ 0x24, 0xa0, 0xfb, 0xe4, 0x9d, 0x89, 0xcc, 0x35,
+ 0x21, 0xb6, 0x6d, 0x16, 0xe9, 0xc6, 0x6a, 0x37,
+ 0x09, 0x89, 0x4e, 0x4e, 0xb0, 0xa4, 0xee, 0xdc,
+ 0x4a, 0xe1, 0x94, 0x68, 0xe6, 0x6b, 0x81, 0xf2,
+ 0x71, 0x35, 0x1b, 0x1d, 0x92, 0x1e, 0xa5, 0x51,
+ 0x04, 0x7a, 0xbc, 0xc6, 0xb8, 0x7a, 0x90, 0x1f,
+ 0xde, 0x7d, 0xb7, 0x9f, 0xa1, 0x81, 0x8c, 0x11,
+ 0x33, 0x6d, 0xbc, 0x07, 0x24, 0x4a, 0x40, 0xeb
+};
+static const u8 output17[] __initconst = {
+ 0xbc, 0x93, 0x9b, 0xc5, 0x28, 0x14, 0x80, 0xfa,
+ 0x99, 0xc6, 0xd6, 0x8c, 0x25, 0x8e, 0xc4, 0x2f
+};
+static const u8 key17[] __initconst = {
+ 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+ 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+
+/* test vectors from Google */
+static const u8 input18[] __initconst = { };
+static const u8 output18[] __initconst = {
+ 0x47, 0x10, 0x13, 0x0e, 0x9f, 0x6f, 0xea, 0x8d,
+ 0x72, 0x29, 0x38, 0x50, 0xa6, 0x67, 0xd8, 0x6c
+};
+static const u8 key18[] __initconst = {
+ 0xc8, 0xaf, 0xaa, 0xc3, 0x31, 0xee, 0x37, 0x2c,
+ 0xd6, 0x08, 0x2d, 0xe1, 0x34, 0x94, 0x3b, 0x17,
+ 0x47, 0x10, 0x13, 0x0e, 0x9f, 0x6f, 0xea, 0x8d,
+ 0x72, 0x29, 0x38, 0x50, 0xa6, 0x67, 0xd8, 0x6c
+};
+
+static const u8 input19[] __initconst = {
+ 0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f,
+ 0x72, 0x6c, 0x64, 0x21
+};
+static const u8 output19[] __initconst = {
+ 0xa6, 0xf7, 0x45, 0x00, 0x8f, 0x81, 0xc9, 0x16,
+ 0xa2, 0x0d, 0xcc, 0x74, 0xee, 0xf2, 0xb2, 0xf0
+};
+static const u8 key19[] __initconst = {
+ 0x74, 0x68, 0x69, 0x73, 0x20, 0x69, 0x73, 0x20,
+ 0x33, 0x32, 0x2d, 0x62, 0x79, 0x74, 0x65, 0x20,
+ 0x6b, 0x65, 0x79, 0x20, 0x66, 0x6f, 0x72, 0x20,
+ 0x50, 0x6f, 0x6c, 0x79, 0x31, 0x33, 0x30, 0x35
+};
+
+static const u8 input20[] __initconst = {
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+static const u8 output20[] __initconst = {
+ 0x49, 0xec, 0x78, 0x09, 0x0e, 0x48, 0x1e, 0xc6,
+ 0xc2, 0x6b, 0x33, 0xb9, 0x1c, 0xcc, 0x03, 0x07
+};
+static const u8 key20[] __initconst = {
+ 0x74, 0x68, 0x69, 0x73, 0x20, 0x69, 0x73, 0x20,
+ 0x33, 0x32, 0x2d, 0x62, 0x79, 0x74, 0x65, 0x20,
+ 0x6b, 0x65, 0x79, 0x20, 0x66, 0x6f, 0x72, 0x20,
+ 0x50, 0x6f, 0x6c, 0x79, 0x31, 0x33, 0x30, 0x35
+};
+
+static const u8 input21[] __initconst = {
+ 0x89, 0xda, 0xb8, 0x0b, 0x77, 0x17, 0xc1, 0xdb,
+ 0x5d, 0xb4, 0x37, 0x86, 0x0a, 0x3f, 0x70, 0x21,
+ 0x8e, 0x93, 0xe1, 0xb8, 0xf4, 0x61, 0xfb, 0x67,
+ 0x7f, 0x16, 0xf3, 0x5f, 0x6f, 0x87, 0xe2, 0xa9,
+ 0x1c, 0x99, 0xbc, 0x3a, 0x47, 0xac, 0xe4, 0x76,
+ 0x40, 0xcc, 0x95, 0xc3, 0x45, 0xbe, 0x5e, 0xcc,
+ 0xa5, 0xa3, 0x52, 0x3c, 0x35, 0xcc, 0x01, 0x89,
+ 0x3a, 0xf0, 0xb6, 0x4a, 0x62, 0x03, 0x34, 0x27,
+ 0x03, 0x72, 0xec, 0x12, 0x48, 0x2d, 0x1b, 0x1e,
+ 0x36, 0x35, 0x61, 0x69, 0x8a, 0x57, 0x8b, 0x35,
+ 0x98, 0x03, 0x49, 0x5b, 0xb4, 0xe2, 0xef, 0x19,
+ 0x30, 0xb1, 0x7a, 0x51, 0x90, 0xb5, 0x80, 0xf1,
+ 0x41, 0x30, 0x0d, 0xf3, 0x0a, 0xdb, 0xec, 0xa2,
+ 0x8f, 0x64, 0x27, 0xa8, 0xbc, 0x1a, 0x99, 0x9f,
+ 0xd5, 0x1c, 0x55, 0x4a, 0x01, 0x7d, 0x09, 0x5d,
+ 0x8c, 0x3e, 0x31, 0x27, 0xda, 0xf9, 0xf5, 0x95
+};
+static const u8 output21[] __initconst = {
+ 0xc8, 0x5d, 0x15, 0xed, 0x44, 0xc3, 0x78, 0xd6,
+ 0xb0, 0x0e, 0x23, 0x06, 0x4c, 0x7b, 0xcd, 0x51
+};
+static const u8 key21[] __initconst = {
+ 0x2d, 0x77, 0x3b, 0xe3, 0x7a, 0xdb, 0x1e, 0x4d,
+ 0x68, 0x3b, 0xf0, 0x07, 0x5e, 0x79, 0xc4, 0xee,
+ 0x03, 0x79, 0x18, 0x53, 0x5a, 0x7f, 0x99, 0xcc,
+ 0xb7, 0x04, 0x0f, 0xb5, 0xf5, 0xf4, 0x3a, 0xea
+};
+
+static const u8 input22[] __initconst = {
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x0b,
+ 0x17, 0x03, 0x03, 0x02, 0x00, 0x00, 0x00, 0x00,
+ 0x06, 0xdb, 0x1f, 0x1f, 0x36, 0x8d, 0x69, 0x6a,
+ 0x81, 0x0a, 0x34, 0x9c, 0x0c, 0x71, 0x4c, 0x9a,
+ 0x5e, 0x78, 0x50, 0xc2, 0x40, 0x7d, 0x72, 0x1a,
+ 0xcd, 0xed, 0x95, 0xe0, 0x18, 0xd7, 0xa8, 0x52,
+ 0x66, 0xa6, 0xe1, 0x28, 0x9c, 0xdb, 0x4a, 0xeb,
+ 0x18, 0xda, 0x5a, 0xc8, 0xa2, 0xb0, 0x02, 0x6d,
+ 0x24, 0xa5, 0x9a, 0xd4, 0x85, 0x22, 0x7f, 0x3e,
+ 0xae, 0xdb, 0xb2, 0xe7, 0xe3, 0x5e, 0x1c, 0x66,
+ 0xcd, 0x60, 0xf9, 0xab, 0xf7, 0x16, 0xdc, 0xc9,
+ 0xac, 0x42, 0x68, 0x2d, 0xd7, 0xda, 0xb2, 0x87,
+ 0xa7, 0x02, 0x4c, 0x4e, 0xef, 0xc3, 0x21, 0xcc,
+ 0x05, 0x74, 0xe1, 0x67, 0x93, 0xe3, 0x7c, 0xec,
+ 0x03, 0xc5, 0xbd, 0xa4, 0x2b, 0x54, 0xc1, 0x14,
+ 0xa8, 0x0b, 0x57, 0xaf, 0x26, 0x41, 0x6c, 0x7b,
+ 0xe7, 0x42, 0x00, 0x5e, 0x20, 0x85, 0x5c, 0x73,
+ 0xe2, 0x1d, 0xc8, 0xe2, 0xed, 0xc9, 0xd4, 0x35,
+ 0xcb, 0x6f, 0x60, 0x59, 0x28, 0x00, 0x11, 0xc2,
+ 0x70, 0xb7, 0x15, 0x70, 0x05, 0x1c, 0x1c, 0x9b,
+ 0x30, 0x52, 0x12, 0x66, 0x20, 0xbc, 0x1e, 0x27,
+ 0x30, 0xfa, 0x06, 0x6c, 0x7a, 0x50, 0x9d, 0x53,
+ 0xc6, 0x0e, 0x5a, 0xe1, 0xb4, 0x0a, 0xa6, 0xe3,
+ 0x9e, 0x49, 0x66, 0x92, 0x28, 0xc9, 0x0e, 0xec,
+ 0xb4, 0xa5, 0x0d, 0xb3, 0x2a, 0x50, 0xbc, 0x49,
+ 0xe9, 0x0b, 0x4f, 0x4b, 0x35, 0x9a, 0x1d, 0xfd,
+ 0x11, 0x74, 0x9c, 0xd3, 0x86, 0x7f, 0xcf, 0x2f,
+ 0xb7, 0xbb, 0x6c, 0xd4, 0x73, 0x8f, 0x6a, 0x4a,
+ 0xd6, 0xf7, 0xca, 0x50, 0x58, 0xf7, 0x61, 0x88,
+ 0x45, 0xaf, 0x9f, 0x02, 0x0f, 0x6c, 0x3b, 0x96,
+ 0x7b, 0x8f, 0x4c, 0xd4, 0xa9, 0x1e, 0x28, 0x13,
+ 0xb5, 0x07, 0xae, 0x66, 0xf2, 0xd3, 0x5c, 0x18,
+ 0x28, 0x4f, 0x72, 0x92, 0x18, 0x60, 0x62, 0xe1,
+ 0x0f, 0xd5, 0x51, 0x0d, 0x18, 0x77, 0x53, 0x51,
+ 0xef, 0x33, 0x4e, 0x76, 0x34, 0xab, 0x47, 0x43,
+ 0xf5, 0xb6, 0x8f, 0x49, 0xad, 0xca, 0xb3, 0x84,
+ 0xd3, 0xfd, 0x75, 0xf7, 0x39, 0x0f, 0x40, 0x06,
+ 0xef, 0x2a, 0x29, 0x5c, 0x8c, 0x7a, 0x07, 0x6a,
+ 0xd5, 0x45, 0x46, 0xcd, 0x25, 0xd2, 0x10, 0x7f,
+ 0xbe, 0x14, 0x36, 0xc8, 0x40, 0x92, 0x4a, 0xae,
+ 0xbe, 0x5b, 0x37, 0x08, 0x93, 0xcd, 0x63, 0xd1,
+ 0x32, 0x5b, 0x86, 0x16, 0xfc, 0x48, 0x10, 0x88,
+ 0x6b, 0xc1, 0x52, 0xc5, 0x32, 0x21, 0xb6, 0xdf,
+ 0x37, 0x31, 0x19, 0x39, 0x32, 0x55, 0xee, 0x72,
+ 0xbc, 0xaa, 0x88, 0x01, 0x74, 0xf1, 0x71, 0x7f,
+ 0x91, 0x84, 0xfa, 0x91, 0x64, 0x6f, 0x17, 0xa2,
+ 0x4a, 0xc5, 0x5d, 0x16, 0xbf, 0xdd, 0xca, 0x95,
+ 0x81, 0xa9, 0x2e, 0xda, 0x47, 0x92, 0x01, 0xf0,
+ 0xed, 0xbf, 0x63, 0x36, 0x00, 0xd6, 0x06, 0x6d,
+ 0x1a, 0xb3, 0x6d, 0x5d, 0x24, 0x15, 0xd7, 0x13,
+ 0x51, 0xbb, 0xcd, 0x60, 0x8a, 0x25, 0x10, 0x8d,
+ 0x25, 0x64, 0x19, 0x92, 0xc1, 0xf2, 0x6c, 0x53,
+ 0x1c, 0xf9, 0xf9, 0x02, 0x03, 0xbc, 0x4c, 0xc1,
+ 0x9f, 0x59, 0x27, 0xd8, 0x34, 0xb0, 0xa4, 0x71,
+ 0x16, 0xd3, 0x88, 0x4b, 0xbb, 0x16, 0x4b, 0x8e,
+ 0xc8, 0x83, 0xd1, 0xac, 0x83, 0x2e, 0x56, 0xb3,
+ 0x91, 0x8a, 0x98, 0x60, 0x1a, 0x08, 0xd1, 0x71,
+ 0x88, 0x15, 0x41, 0xd5, 0x94, 0xdb, 0x39, 0x9c,
+ 0x6a, 0xe6, 0x15, 0x12, 0x21, 0x74, 0x5a, 0xec,
+ 0x81, 0x4c, 0x45, 0xb0, 0xb0, 0x5b, 0x56, 0x54,
+ 0x36, 0xfd, 0x6f, 0x13, 0x7a, 0xa1, 0x0a, 0x0c,
+ 0x0b, 0x64, 0x37, 0x61, 0xdb, 0xd6, 0xf9, 0xa9,
+ 0xdc, 0xb9, 0x9b, 0x1a, 0x6e, 0x69, 0x08, 0x54,
+ 0xce, 0x07, 0x69, 0xcd, 0xe3, 0x97, 0x61, 0xd8,
+ 0x2f, 0xcd, 0xec, 0x15, 0xf0, 0xd9, 0x2d, 0x7d,
+ 0x8e, 0x94, 0xad, 0xe8, 0xeb, 0x83, 0xfb, 0xe0
+};
+static const u8 output22[] __initconst = {
+ 0x26, 0x37, 0x40, 0x8f, 0xe1, 0x30, 0x86, 0xea,
+ 0x73, 0xf9, 0x71, 0xe3, 0x42, 0x5e, 0x28, 0x20
+};
+static const u8 key22[] __initconst = {
+ 0x99, 0xe5, 0x82, 0x2d, 0xd4, 0x17, 0x3c, 0x99,
+ 0x5e, 0x3d, 0xae, 0x0d, 0xde, 0xfb, 0x97, 0x74,
+ 0x3f, 0xde, 0x3b, 0x08, 0x01, 0x34, 0xb3, 0x9f,
+ 0x76, 0xe9, 0xbf, 0x8d, 0x0e, 0x88, 0xd5, 0x46
+};
+
+/* test vectors from Hanno Böck */
+static const u8 input23[] __initconst = {
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0x80, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xce, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xc5,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xe3, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xac, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xe6,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0x00, 0x00, 0x00,
+ 0xaf, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc,
+ 0xcc, 0xcc, 0xff, 0xff, 0xff, 0xf5, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0xff, 0xff, 0xff, 0xe7, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x71, 0x92, 0x05, 0xa8, 0x52, 0x1d,
+ 0xfc
+};
+static const u8 output23[] __initconst = {
+ 0x85, 0x59, 0xb8, 0x76, 0xec, 0xee, 0xd6, 0x6e,
+ 0xb3, 0x77, 0x98, 0xc0, 0x45, 0x7b, 0xaf, 0xf9
+};
+static const u8 key23[] __initconst = {
+ 0x7f, 0x1b, 0x02, 0x64, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc
+};
+
+static const u8 input24[] __initconst = {
+ 0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa,
+ 0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa,
+ 0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa,
+ 0xaa, 0xaa, 0xaa, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x80, 0x02, 0x64
+};
+static const u8 output24[] __initconst = {
+ 0x00, 0xbd, 0x12, 0x58, 0x97, 0x8e, 0x20, 0x54,
+ 0x44, 0xc9, 0xaa, 0xaa, 0x82, 0x00, 0x6f, 0xed
+};
+static const u8 key24[] __initconst = {
+ 0xe0, 0x00, 0x16, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa,
+ 0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa
+};
+
+static const u8 input25[] __initconst = {
+ 0x02, 0xfc
+};
+static const u8 output25[] __initconst = {
+ 0x06, 0x12, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c,
+ 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c
+};
+static const u8 key25[] __initconst = {
+ 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c,
+ 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c,
+ 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c,
+ 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c
+};
+
+static const u8 input26[] __initconst = {
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7a, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x5c, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x6e, 0x7b, 0x00, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7a, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x5c,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b, 0x7b,
+ 0x7b, 0x6e, 0x7b, 0x00, 0x13, 0x00, 0x00, 0x00,
+ 0x00, 0xb3, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0xf2, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x20, 0x00, 0xef, 0xff, 0x00,
+ 0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00,
+ 0x00, 0x00, 0x09, 0x00, 0x00, 0x00, 0x64, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x13, 0x00, 0x00, 0x00, 0x00,
+ 0xb3, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xf2,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x20, 0x00, 0xef, 0xff, 0x00, 0x09,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x7a, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00,
+ 0x00, 0x09, 0x00, 0x00, 0x00, 0x64, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xfc
+};
+static const u8 output26[] __initconst = {
+ 0x33, 0x20, 0x5b, 0xbf, 0x9e, 0x9f, 0x8f, 0x72,
+ 0x12, 0xab, 0x9e, 0x2a, 0xb9, 0xb7, 0xe4, 0xa5
+};
+static const u8 key26[] __initconst = {
+ 0x00, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x1e, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x7b, 0x7b
+};
+
+static const u8 input27[] __initconst = {
+ 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77,
+ 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77,
+ 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77,
+ 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77,
+ 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77,
+ 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77,
+ 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77,
+ 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77,
+ 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77,
+ 0x77, 0x77, 0x77, 0x77, 0xff, 0xff, 0xff, 0xe9,
+ 0xe9, 0xac, 0xac, 0xac, 0xac, 0xac, 0xac, 0xac,
+ 0xac, 0xac, 0xac, 0xac, 0x00, 0x00, 0xac, 0xac,
+ 0xec, 0x01, 0x00, 0xac, 0xac, 0xac, 0x2c, 0xac,
+ 0xa2, 0xac, 0xac, 0xac, 0xac, 0xac, 0xac, 0xac,
+ 0xac, 0xac, 0xac, 0xac, 0x64, 0xf2
+};
+static const u8 output27[] __initconst = {
+ 0x02, 0xee, 0x7c, 0x8c, 0x54, 0x6d, 0xde, 0xb1,
+ 0xa4, 0x67, 0xe4, 0xc3, 0x98, 0x11, 0x58, 0xb9
+};
+static const u8 key27[] __initconst = {
+ 0x00, 0x00, 0x00, 0x7f, 0x00, 0x00, 0x00, 0x7f,
+ 0x01, 0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0xcf, 0x77, 0x77, 0x77, 0x77, 0x77,
+ 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77
+};
+
+/* nacl */
+static const u8 input28[] __initconst = {
+ 0x8e, 0x99, 0x3b, 0x9f, 0x48, 0x68, 0x12, 0x73,
+ 0xc2, 0x96, 0x50, 0xba, 0x32, 0xfc, 0x76, 0xce,
+ 0x48, 0x33, 0x2e, 0xa7, 0x16, 0x4d, 0x96, 0xa4,
+ 0x47, 0x6f, 0xb8, 0xc5, 0x31, 0xa1, 0x18, 0x6a,
+ 0xc0, 0xdf, 0xc1, 0x7c, 0x98, 0xdc, 0xe8, 0x7b,
+ 0x4d, 0xa7, 0xf0, 0x11, 0xec, 0x48, 0xc9, 0x72,
+ 0x71, 0xd2, 0xc2, 0x0f, 0x9b, 0x92, 0x8f, 0xe2,
+ 0x27, 0x0d, 0x6f, 0xb8, 0x63, 0xd5, 0x17, 0x38,
+ 0xb4, 0x8e, 0xee, 0xe3, 0x14, 0xa7, 0xcc, 0x8a,
+ 0xb9, 0x32, 0x16, 0x45, 0x48, 0xe5, 0x26, 0xae,
+ 0x90, 0x22, 0x43, 0x68, 0x51, 0x7a, 0xcf, 0xea,
+ 0xbd, 0x6b, 0xb3, 0x73, 0x2b, 0xc0, 0xe9, 0xda,
+ 0x99, 0x83, 0x2b, 0x61, 0xca, 0x01, 0xb6, 0xde,
+ 0x56, 0x24, 0x4a, 0x9e, 0x88, 0xd5, 0xf9, 0xb3,
+ 0x79, 0x73, 0xf6, 0x22, 0xa4, 0x3d, 0x14, 0xa6,
+ 0x59, 0x9b, 0x1f, 0x65, 0x4c, 0xb4, 0x5a, 0x74,
+ 0xe3, 0x55, 0xa5
+};
+static const u8 output28[] __initconst = {
+ 0xf3, 0xff, 0xc7, 0x70, 0x3f, 0x94, 0x00, 0xe5,
+ 0x2a, 0x7d, 0xfb, 0x4b, 0x3d, 0x33, 0x05, 0xd9
+};
+static const u8 key28[] __initconst = {
+ 0xee, 0xa6, 0xa7, 0x25, 0x1c, 0x1e, 0x72, 0x91,
+ 0x6d, 0x11, 0xc2, 0xcb, 0x21, 0x4d, 0x3c, 0x25,
+ 0x25, 0x39, 0x12, 0x1d, 0x8e, 0x23, 0x4e, 0x65,
+ 0x2d, 0x65, 0x1f, 0xa4, 0xc8, 0xcf, 0xf8, 0x80
+};
+
+/* wrap 2^130-5 */
+static const u8 input29[] __initconst = {
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+};
+static const u8 output29[] __initconst = {
+ 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+static const u8 key29[] __initconst = {
+ 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+
+/* wrap 2^128 */
+static const u8 input30[] __initconst = {
+ 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+static const u8 output30[] __initconst = {
+ 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+static const u8 key30[] __initconst = {
+ 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+};
+
+/* limb carry */
+static const u8 input31[] __initconst = {
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xf0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0x11, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+static const u8 output31[] __initconst = {
+ 0x05, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+static const u8 key31[] __initconst = {
+ 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+
+/* 2^130-5 */
+static const u8 input32[] __initconst = {
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xfb, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe,
+ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe,
+ 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01,
+ 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
+};
+static const u8 output32[] __initconst = {
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+static const u8 key32[] __initconst = {
+ 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+
+/* 2^130-6 */
+static const u8 input33[] __initconst = {
+ 0xfd, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+};
+static const u8 output33[] __initconst = {
+ 0xfa, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+};
+static const u8 key33[] __initconst = {
+ 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+
+/* 5*H+L reduction intermediate */
+static const u8 input34[] __initconst = {
+ 0xe3, 0x35, 0x94, 0xd7, 0x50, 0x5e, 0x43, 0xb9,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x33, 0x94, 0xd7, 0x50, 0x5e, 0x43, 0x79, 0xcd,
+ 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+static const u8 output34[] __initconst = {
+ 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x55, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+static const u8 key34[] __initconst = {
+ 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+
+/* 5*H+L reduction final */
+static const u8 input35[] __initconst = {
+ 0xe3, 0x35, 0x94, 0xd7, 0x50, 0x5e, 0x43, 0xb9,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x33, 0x94, 0xd7, 0x50, 0x5e, 0x43, 0x79, 0xcd,
+ 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+static const u8 output35[] __initconst = {
+ 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+static const u8 key35[] __initconst = {
+ 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+};
+
+static const struct poly1305_testvec poly1305_testvecs[] __initconst = {
+ { input01, output01, key01, sizeof(input01) },
+ { input02, output02, key02, sizeof(input02) },
+ { input03, output03, key03, sizeof(input03) },
+ { input04, output04, key04, sizeof(input04) },
+ { input05, output05, key05, sizeof(input05) },
+ { input06, output06, key06, sizeof(input06) },
+ { input07, output07, key07, sizeof(input07) },
+ { input08, output08, key08, sizeof(input08) },
+ { input09, output09, key09, sizeof(input09) },
+ { input10, output10, key10, sizeof(input10) },
+ { input11, output11, key11, sizeof(input11) },
+ { input12, output12, key12, sizeof(input12) },
+ { input13, output13, key13, sizeof(input13) },
+ { input14, output14, key14, sizeof(input14) },
+ { input15, output15, key15, sizeof(input15) },
+ { input16, output16, key16, sizeof(input16) },
+ { input17, output17, key17, sizeof(input17) },
+ { input18, output18, key18, sizeof(input18) },
+ { input19, output19, key19, sizeof(input19) },
+ { input20, output20, key20, sizeof(input20) },
+ { input21, output21, key21, sizeof(input21) },
+ { input22, output22, key22, sizeof(input22) },
+ { input23, output23, key23, sizeof(input23) },
+ { input24, output24, key24, sizeof(input24) },
+ { input25, output25, key25, sizeof(input25) },
+ { input26, output26, key26, sizeof(input26) },
+ { input27, output27, key27, sizeof(input27) },
+ { input28, output28, key28, sizeof(input28) },
+ { input29, output29, key29, sizeof(input29) },
+ { input30, output30, key30, sizeof(input30) },
+ { input31, output31, key31, sizeof(input31) },
+ { input32, output32, key32, sizeof(input32) },
+ { input33, output33, key33, sizeof(input33) },
+ { input34, output34, key34, sizeof(input34) },
+ { input35, output35, key35, sizeof(input35) }
+};
+
+static bool __init poly1305_selftest(void)
+{
+ simd_context_t simd_context;
+ bool success = true;
+ size_t i, j;
+
+ simd_get(&simd_context);
+ for (i = 0; i < ARRAY_SIZE(poly1305_testvecs); ++i) {
+ struct poly1305_ctx poly1305;
+ u8 out[POLY1305_MAC_SIZE];
+
+ memset(out, 0, sizeof(out));
+ memset(&poly1305, 0, sizeof(poly1305));
+ poly1305_init(&poly1305, poly1305_testvecs[i].key);
+ poly1305_update(&poly1305, poly1305_testvecs[i].input,
+ poly1305_testvecs[i].ilen, &simd_context);
+ poly1305_final(&poly1305, out, &simd_context);
+ if (memcmp(out, poly1305_testvecs[i].output,
+ POLY1305_MAC_SIZE)) {
+ pr_err("poly1305 self-test %zu: FAIL\n", i + 1);
+ success = false;
+ }
+ simd_relax(&simd_context);
+
+ if (poly1305_testvecs[i].ilen <= 1)
+ continue;
+
+ for (j = 1; j < poly1305_testvecs[i].ilen - 1; ++j) {
+ memset(out, 0, sizeof(out));
+ memset(&poly1305, 0, sizeof(poly1305));
+ poly1305_init(&poly1305, poly1305_testvecs[i].key);
+ poly1305_update(&poly1305, poly1305_testvecs[i].input,
+ j, &simd_context);
+ poly1305_update(&poly1305,
+ poly1305_testvecs[i].input + j,
+ poly1305_testvecs[i].ilen - j,
+ &simd_context);
+ poly1305_final(&poly1305, out, &simd_context);
+ if (memcmp(out, poly1305_testvecs[i].output,
+ POLY1305_MAC_SIZE)) {
+ pr_err("poly1305 self-test %zu (split %zu): FAIL\n",
+ i + 1, j);
+ success = false;
+ }
+
+ memset(out, 0, sizeof(out));
+ memset(&poly1305, 0, sizeof(poly1305));
+ poly1305_init(&poly1305, poly1305_testvecs[i].key);
+ poly1305_update(&poly1305, poly1305_testvecs[i].input,
+ j, &simd_context);
+ poly1305_update(&poly1305,
+ poly1305_testvecs[i].input + j,
+ poly1305_testvecs[i].ilen - j,
+ DONT_USE_SIMD);
+ poly1305_final(&poly1305, out, &simd_context);
+ if (memcmp(out, poly1305_testvecs[i].output,
+ POLY1305_MAC_SIZE)) {
+ pr_err("poly1305 self-test %zu (split %zu, mixed simd): FAIL\n",
+ i + 1, j);
+ success = false;
+ }
+ simd_relax(&simd_context);
+ }
+ }
+ simd_put(&simd_context);
+
+ return success;
+}
--
2.19.1
^ permalink raw reply related
* [PATCH net-next v8 14/28] zinc: import Andy Polyakov's Poly1305 ARM and ARM64 implementations
From: Jason A. Donenfeld @ 2018-10-18 14:56 UTC (permalink / raw)
To: linux-kernel, netdev, linux-crypto, davem, gregkh
Cc: Jason A. Donenfeld, Andy Polyakov, Russell King, linux-arm-kernel,
Samuel Neves, Jean-Philippe Aumasson, Andy Lutomirski,
Andrew Morton, Linus Torvalds, kernel-hardening
In-Reply-To: <20181018145712.7538-1-Jason@zx2c4.com>
These NEON and non-NEON implementations come from Andy Polyakov's
implementation, and are included here in raw form without modification,
so that subsequent commits that fix these up for the kernel can see how
it has changed.
While this is CRYPTOGAMS code, the originating code for this happens to
be the same as OpenSSL's commit 5bb1cd2292b388263a0cc05392bb99141212aa53
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Based-on-code-from: Andy Polyakov <appro@openssl.org>
Cc: Andy Polyakov <appro@openssl.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-crypto@vger.kernel.org
---
lib/zinc/poly1305/poly1305-arm-cryptogams.S | 1172 +++++++++++++++++
lib/zinc/poly1305/poly1305-arm64-cryptogams.S | 869 ++++++++++++
2 files changed, 2041 insertions(+)
create mode 100644 lib/zinc/poly1305/poly1305-arm-cryptogams.S
create mode 100644 lib/zinc/poly1305/poly1305-arm64-cryptogams.S
diff --git a/lib/zinc/poly1305/poly1305-arm-cryptogams.S b/lib/zinc/poly1305/poly1305-arm-cryptogams.S
new file mode 100644
index 000000000000..884b465030e4
--- /dev/null
+++ b/lib/zinc/poly1305/poly1305-arm-cryptogams.S
@@ -0,0 +1,1172 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
+/*
+ * Copyright (C) 2006-2017 CRYPTOGAMS by <appro@openssl.org>. All Rights Reserved.
+ */
+
+#include "arm_arch.h"
+
+.text
+#if defined(__thumb2__)
+.syntax unified
+.thumb
+#else
+.code 32
+#endif
+
+.globl poly1305_emit
+.globl poly1305_blocks
+.globl poly1305_init
+.type poly1305_init,%function
+.align 5
+poly1305_init:
+.Lpoly1305_init:
+ stmdb sp!,{r4-r11}
+
+ eor r3,r3,r3
+ cmp r1,#0
+ str r3,[r0,#0] @ zero hash value
+ str r3,[r0,#4]
+ str r3,[r0,#8]
+ str r3,[r0,#12]
+ str r3,[r0,#16]
+ str r3,[r0,#36] @ is_base2_26
+ add r0,r0,#20
+
+#ifdef __thumb2__
+ it eq
+#endif
+ moveq r0,#0
+ beq .Lno_key
+
+#if __ARM_MAX_ARCH__>=7
+ adr r11,.Lpoly1305_init
+ ldr r12,.LOPENSSL_armcap
+#endif
+ ldrb r4,[r1,#0]
+ mov r10,#0x0fffffff
+ ldrb r5,[r1,#1]
+ and r3,r10,#-4 @ 0x0ffffffc
+ ldrb r6,[r1,#2]
+ ldrb r7,[r1,#3]
+ orr r4,r4,r5,lsl#8
+ ldrb r5,[r1,#4]
+ orr r4,r4,r6,lsl#16
+ ldrb r6,[r1,#5]
+ orr r4,r4,r7,lsl#24
+ ldrb r7,[r1,#6]
+ and r4,r4,r10
+
+#if __ARM_MAX_ARCH__>=7
+ ldr r12,[r11,r12] @ OPENSSL_armcap_P
+# ifdef __APPLE__
+ ldr r12,[r12]
+# endif
+#endif
+ ldrb r8,[r1,#7]
+ orr r5,r5,r6,lsl#8
+ ldrb r6,[r1,#8]
+ orr r5,r5,r7,lsl#16
+ ldrb r7,[r1,#9]
+ orr r5,r5,r8,lsl#24
+ ldrb r8,[r1,#10]
+ and r5,r5,r3
+
+#if __ARM_MAX_ARCH__>=7
+ tst r12,#ARMV7_NEON @ check for NEON
+# ifdef __APPLE__
+ adr r9,poly1305_blocks_neon
+ adr r11,poly1305_blocks
+# ifdef __thumb2__
+ it ne
+# endif
+ movne r11,r9
+ adr r12,poly1305_emit
+ adr r10,poly1305_emit_neon
+# ifdef __thumb2__
+ it ne
+# endif
+ movne r12,r10
+# else
+# ifdef __thumb2__
+ itete eq
+# endif
+ addeq r12,r11,#(poly1305_emit-.Lpoly1305_init)
+ addne r12,r11,#(poly1305_emit_neon-.Lpoly1305_init)
+ addeq r11,r11,#(poly1305_blocks-.Lpoly1305_init)
+ addne r11,r11,#(poly1305_blocks_neon-.Lpoly1305_init)
+# endif
+# ifdef __thumb2__
+ orr r12,r12,#1 @ thumb-ify address
+ orr r11,r11,#1
+# endif
+#endif
+ ldrb r9,[r1,#11]
+ orr r6,r6,r7,lsl#8
+ ldrb r7,[r1,#12]
+ orr r6,r6,r8,lsl#16
+ ldrb r8,[r1,#13]
+ orr r6,r6,r9,lsl#24
+ ldrb r9,[r1,#14]
+ and r6,r6,r3
+
+ ldrb r10,[r1,#15]
+ orr r7,r7,r8,lsl#8
+ str r4,[r0,#0]
+ orr r7,r7,r9,lsl#16
+ str r5,[r0,#4]
+ orr r7,r7,r10,lsl#24
+ str r6,[r0,#8]
+ and r7,r7,r3
+ str r7,[r0,#12]
+#if __ARM_MAX_ARCH__>=7
+ stmia r2,{r11,r12} @ fill functions table
+ mov r0,#1
+#else
+ mov r0,#0
+#endif
+.Lno_key:
+ ldmia sp!,{r4-r11}
+#if __ARM_ARCH__>=5
+ bx lr @ bx lr
+#else
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ .word 0xe12fff1e @ interoperable with Thumb ISA:-)
+#endif
+.size poly1305_init,.-poly1305_init
+.type poly1305_blocks,%function
+.align 5
+poly1305_blocks:
+.Lpoly1305_blocks:
+ stmdb sp!,{r3-r11,lr}
+
+ ands r2,r2,#-16
+ beq .Lno_data
+
+ cmp r3,#0
+ add r2,r2,r1 @ end pointer
+ sub sp,sp,#32
+
+ ldmia r0,{r4-r12} @ load context
+
+ str r0,[sp,#12] @ offload stuff
+ mov lr,r1
+ str r2,[sp,#16]
+ str r10,[sp,#20]
+ str r11,[sp,#24]
+ str r12,[sp,#28]
+ b .Loop
+
+.Loop:
+#if __ARM_ARCH__<7
+ ldrb r0,[lr],#16 @ load input
+# ifdef __thumb2__
+ it hi
+# endif
+ addhi r8,r8,#1 @ 1<<128
+ ldrb r1,[lr,#-15]
+ ldrb r2,[lr,#-14]
+ ldrb r3,[lr,#-13]
+ orr r1,r0,r1,lsl#8
+ ldrb r0,[lr,#-12]
+ orr r2,r1,r2,lsl#16
+ ldrb r1,[lr,#-11]
+ orr r3,r2,r3,lsl#24
+ ldrb r2,[lr,#-10]
+ adds r4,r4,r3 @ accumulate input
+
+ ldrb r3,[lr,#-9]
+ orr r1,r0,r1,lsl#8
+ ldrb r0,[lr,#-8]
+ orr r2,r1,r2,lsl#16
+ ldrb r1,[lr,#-7]
+ orr r3,r2,r3,lsl#24
+ ldrb r2,[lr,#-6]
+ adcs r5,r5,r3
+
+ ldrb r3,[lr,#-5]
+ orr r1,r0,r1,lsl#8
+ ldrb r0,[lr,#-4]
+ orr r2,r1,r2,lsl#16
+ ldrb r1,[lr,#-3]
+ orr r3,r2,r3,lsl#24
+ ldrb r2,[lr,#-2]
+ adcs r6,r6,r3
+
+ ldrb r3,[lr,#-1]
+ orr r1,r0,r1,lsl#8
+ str lr,[sp,#8] @ offload input pointer
+ orr r2,r1,r2,lsl#16
+ add r10,r10,r10,lsr#2
+ orr r3,r2,r3,lsl#24
+#else
+ ldr r0,[lr],#16 @ load input
+# ifdef __thumb2__
+ it hi
+# endif
+ addhi r8,r8,#1 @ padbit
+ ldr r1,[lr,#-12]
+ ldr r2,[lr,#-8]
+ ldr r3,[lr,#-4]
+# ifdef __ARMEB__
+ rev r0,r0
+ rev r1,r1
+ rev r2,r2
+ rev r3,r3
+# endif
+ adds r4,r4,r0 @ accumulate input
+ str lr,[sp,#8] @ offload input pointer
+ adcs r5,r5,r1
+ add r10,r10,r10,lsr#2
+ adcs r6,r6,r2
+#endif
+ add r11,r11,r11,lsr#2
+ adcs r7,r7,r3
+ add r12,r12,r12,lsr#2
+
+ umull r2,r3,r5,r9
+ adc r8,r8,#0
+ umull r0,r1,r4,r9
+ umlal r2,r3,r8,r10
+ umlal r0,r1,r7,r10
+ ldr r10,[sp,#20] @ reload r10
+ umlal r2,r3,r6,r12
+ umlal r0,r1,r5,r12
+ umlal r2,r3,r7,r11
+ umlal r0,r1,r6,r11
+ umlal r2,r3,r4,r10
+ str r0,[sp,#0] @ future r4
+ mul r0,r11,r8
+ ldr r11,[sp,#24] @ reload r11
+ adds r2,r2,r1 @ d1+=d0>>32
+ eor r1,r1,r1
+ adc lr,r3,#0 @ future r6
+ str r2,[sp,#4] @ future r5
+
+ mul r2,r12,r8
+ eor r3,r3,r3
+ umlal r0,r1,r7,r12
+ ldr r12,[sp,#28] @ reload r12
+ umlal r2,r3,r7,r9
+ umlal r0,r1,r6,r9
+ umlal r2,r3,r6,r10
+ umlal r0,r1,r5,r10
+ umlal r2,r3,r5,r11
+ umlal r0,r1,r4,r11
+ umlal r2,r3,r4,r12
+ ldr r4,[sp,#0]
+ mul r8,r9,r8
+ ldr r5,[sp,#4]
+
+ adds r6,lr,r0 @ d2+=d1>>32
+ ldr lr,[sp,#8] @ reload input pointer
+ adc r1,r1,#0
+ adds r7,r2,r1 @ d3+=d2>>32
+ ldr r0,[sp,#16] @ reload end pointer
+ adc r3,r3,#0
+ add r8,r8,r3 @ h4+=d3>>32
+
+ and r1,r8,#-4
+ and r8,r8,#3
+ add r1,r1,r1,lsr#2 @ *=5
+ adds r4,r4,r1
+ adcs r5,r5,#0
+ adcs r6,r6,#0
+ adcs r7,r7,#0
+ adc r8,r8,#0
+
+ cmp r0,lr @ done yet?
+ bhi .Loop
+
+ ldr r0,[sp,#12]
+ add sp,sp,#32
+ stmia r0,{r4-r8} @ store the result
+
+.Lno_data:
+#if __ARM_ARCH__>=5
+ ldmia sp!,{r3-r11,pc}
+#else
+ ldmia sp!,{r3-r11,lr}
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ .word 0xe12fff1e @ interoperable with Thumb ISA:-)
+#endif
+.size poly1305_blocks,.-poly1305_blocks
+.type poly1305_emit,%function
+.align 5
+poly1305_emit:
+ stmdb sp!,{r4-r11}
+.Lpoly1305_emit_enter:
+
+ ldmia r0,{r3-r7}
+ adds r8,r3,#5 @ compare to modulus
+ adcs r9,r4,#0
+ adcs r10,r5,#0
+ adcs r11,r6,#0
+ adc r7,r7,#0
+ tst r7,#4 @ did it carry/borrow?
+
+#ifdef __thumb2__
+ it ne
+#endif
+ movne r3,r8
+ ldr r8,[r2,#0]
+#ifdef __thumb2__
+ it ne
+#endif
+ movne r4,r9
+ ldr r9,[r2,#4]
+#ifdef __thumb2__
+ it ne
+#endif
+ movne r5,r10
+ ldr r10,[r2,#8]
+#ifdef __thumb2__
+ it ne
+#endif
+ movne r6,r11
+ ldr r11,[r2,#12]
+
+ adds r3,r3,r8
+ adcs r4,r4,r9
+ adcs r5,r5,r10
+ adc r6,r6,r11
+
+#if __ARM_ARCH__>=7
+# ifdef __ARMEB__
+ rev r3,r3
+ rev r4,r4
+ rev r5,r5
+ rev r6,r6
+# endif
+ str r3,[r1,#0]
+ str r4,[r1,#4]
+ str r5,[r1,#8]
+ str r6,[r1,#12]
+#else
+ strb r3,[r1,#0]
+ mov r3,r3,lsr#8
+ strb r4,[r1,#4]
+ mov r4,r4,lsr#8
+ strb r5,[r1,#8]
+ mov r5,r5,lsr#8
+ strb r6,[r1,#12]
+ mov r6,r6,lsr#8
+
+ strb r3,[r1,#1]
+ mov r3,r3,lsr#8
+ strb r4,[r1,#5]
+ mov r4,r4,lsr#8
+ strb r5,[r1,#9]
+ mov r5,r5,lsr#8
+ strb r6,[r1,#13]
+ mov r6,r6,lsr#8
+
+ strb r3,[r1,#2]
+ mov r3,r3,lsr#8
+ strb r4,[r1,#6]
+ mov r4,r4,lsr#8
+ strb r5,[r1,#10]
+ mov r5,r5,lsr#8
+ strb r6,[r1,#14]
+ mov r6,r6,lsr#8
+
+ strb r3,[r1,#3]
+ strb r4,[r1,#7]
+ strb r5,[r1,#11]
+ strb r6,[r1,#15]
+#endif
+ ldmia sp!,{r4-r11}
+#if __ARM_ARCH__>=5
+ bx lr @ bx lr
+#else
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ .word 0xe12fff1e @ interoperable with Thumb ISA:-)
+#endif
+.size poly1305_emit,.-poly1305_emit
+#if __ARM_MAX_ARCH__>=7
+.fpu neon
+
+.type poly1305_init_neon,%function
+.align 5
+poly1305_init_neon:
+ ldr r4,[r0,#20] @ load key base 2^32
+ ldr r5,[r0,#24]
+ ldr r6,[r0,#28]
+ ldr r7,[r0,#32]
+
+ and r2,r4,#0x03ffffff @ base 2^32 -> base 2^26
+ mov r3,r4,lsr#26
+ mov r4,r5,lsr#20
+ orr r3,r3,r5,lsl#6
+ mov r5,r6,lsr#14
+ orr r4,r4,r6,lsl#12
+ mov r6,r7,lsr#8
+ orr r5,r5,r7,lsl#18
+ and r3,r3,#0x03ffffff
+ and r4,r4,#0x03ffffff
+ and r5,r5,#0x03ffffff
+
+ vdup.32 d0,r2 @ r^1 in both lanes
+ add r2,r3,r3,lsl#2 @ *5
+ vdup.32 d1,r3
+ add r3,r4,r4,lsl#2
+ vdup.32 d2,r2
+ vdup.32 d3,r4
+ add r4,r5,r5,lsl#2
+ vdup.32 d4,r3
+ vdup.32 d5,r5
+ add r5,r6,r6,lsl#2
+ vdup.32 d6,r4
+ vdup.32 d7,r6
+ vdup.32 d8,r5
+
+ mov r5,#2 @ counter
+
+.Lsquare_neon:
+ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+ @ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+ @ d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4
+ @ d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4
+ @ d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4
+ @ d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4
+
+ vmull.u32 q5,d0,d0[1]
+ vmull.u32 q6,d1,d0[1]
+ vmull.u32 q7,d3,d0[1]
+ vmull.u32 q8,d5,d0[1]
+ vmull.u32 q9,d7,d0[1]
+
+ vmlal.u32 q5,d7,d2[1]
+ vmlal.u32 q6,d0,d1[1]
+ vmlal.u32 q7,d1,d1[1]
+ vmlal.u32 q8,d3,d1[1]
+ vmlal.u32 q9,d5,d1[1]
+
+ vmlal.u32 q5,d5,d4[1]
+ vmlal.u32 q6,d7,d4[1]
+ vmlal.u32 q8,d1,d3[1]
+ vmlal.u32 q7,d0,d3[1]
+ vmlal.u32 q9,d3,d3[1]
+
+ vmlal.u32 q5,d3,d6[1]
+ vmlal.u32 q8,d0,d5[1]
+ vmlal.u32 q6,d5,d6[1]
+ vmlal.u32 q7,d7,d6[1]
+ vmlal.u32 q9,d1,d5[1]
+
+ vmlal.u32 q8,d7,d8[1]
+ vmlal.u32 q5,d1,d8[1]
+ vmlal.u32 q6,d3,d8[1]
+ vmlal.u32 q7,d5,d8[1]
+ vmlal.u32 q9,d0,d7[1]
+
+ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+ @ lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
+ @ and P. Schwabe
+ @
+ @ H0>>+H1>>+H2>>+H3>>+H4
+ @ H3>>+H4>>*5+H0>>+H1
+ @
+ @ Trivia.
+ @
+ @ Result of multiplication of n-bit number by m-bit number is
+ @ n+m bits wide. However! Even though 2^n is a n+1-bit number,
+ @ m-bit number multiplied by 2^n is still n+m bits wide.
+ @
+ @ Sum of two n-bit numbers is n+1 bits wide, sum of three - n+2,
+ @ and so is sum of four. Sum of 2^m n-m-bit numbers and n-bit
+ @ one is n+1 bits wide.
+ @
+ @ >>+ denotes Hnext += Hn>>26, Hn &= 0x3ffffff. This means that
+ @ H0, H2, H3 are guaranteed to be 26 bits wide, while H1 and H4
+ @ can be 27. However! In cases when their width exceeds 26 bits
+ @ they are limited by 2^26+2^6. This in turn means that *sum*
+ @ of the products with these values can still be viewed as sum
+ @ of 52-bit numbers as long as the amount of addends is not a
+ @ power of 2. For example,
+ @
+ @ H4 = H4*R0 + H3*R1 + H2*R2 + H1*R3 + H0 * R4,
+ @
+ @ which can't be larger than 5 * (2^26 + 2^6) * (2^26 + 2^6), or
+ @ 5 * (2^52 + 2*2^32 + 2^12), which in turn is smaller than
+ @ 8 * (2^52) or 2^55. However, the value is then multiplied by
+ @ by 5, so we should be looking at 5 * 5 * (2^52 + 2^33 + 2^12),
+ @ which is less than 32 * (2^52) or 2^57. And when processing
+ @ data we are looking at triple as many addends...
+ @
+ @ In key setup procedure pre-reduced H0 is limited by 5*4+1 and
+ @ 5*H4 - by 5*5 52-bit addends, or 57 bits. But when hashing the
+ @ input H0 is limited by (5*4+1)*3 addends, or 58 bits, while
+ @ 5*H4 by 5*5*3, or 59[!] bits. How is this relevant? vmlal.u32
+ @ instruction accepts 2x32-bit input and writes 2x64-bit result.
+ @ This means that result of reduction have to be compressed upon
+ @ loop wrap-around. This can be done in the process of reduction
+ @ to minimize amount of instructions [as well as amount of
+ @ 128-bit instructions, which benefits low-end processors], but
+ @ one has to watch for H2 (which is narrower than H0) and 5*H4
+ @ not being wider than 58 bits, so that result of right shift
+ @ by 26 bits fits in 32 bits. This is also useful on x86,
+ @ because it allows to use paddd in place for paddq, which
+ @ benefits Atom, where paddq is ridiculously slow.
+
+ vshr.u64 q15,q8,#26
+ vmovn.i64 d16,q8
+ vshr.u64 q4,q5,#26
+ vmovn.i64 d10,q5
+ vadd.i64 q9,q9,q15 @ h3 -> h4
+ vbic.i32 d16,#0xfc000000 @ &=0x03ffffff
+ vadd.i64 q6,q6,q4 @ h0 -> h1
+ vbic.i32 d10,#0xfc000000
+
+ vshrn.u64 d30,q9,#26
+ vmovn.i64 d18,q9
+ vshr.u64 q4,q6,#26
+ vmovn.i64 d12,q6
+ vadd.i64 q7,q7,q4 @ h1 -> h2
+ vbic.i32 d18,#0xfc000000
+ vbic.i32 d12,#0xfc000000
+
+ vadd.i32 d10,d10,d30
+ vshl.u32 d30,d30,#2
+ vshrn.u64 d8,q7,#26
+ vmovn.i64 d14,q7
+ vadd.i32 d10,d10,d30 @ h4 -> h0
+ vadd.i32 d16,d16,d8 @ h2 -> h3
+ vbic.i32 d14,#0xfc000000
+
+ vshr.u32 d30,d10,#26
+ vbic.i32 d10,#0xfc000000
+ vshr.u32 d8,d16,#26
+ vbic.i32 d16,#0xfc000000
+ vadd.i32 d12,d12,d30 @ h0 -> h1
+ vadd.i32 d18,d18,d8 @ h3 -> h4
+
+ subs r5,r5,#1
+ beq .Lsquare_break_neon
+
+ add r6,r0,#(48+0*9*4)
+ add r7,r0,#(48+1*9*4)
+
+ vtrn.32 d0,d10 @ r^2:r^1
+ vtrn.32 d3,d14
+ vtrn.32 d5,d16
+ vtrn.32 d1,d12
+ vtrn.32 d7,d18
+
+ vshl.u32 d4,d3,#2 @ *5
+ vshl.u32 d6,d5,#2
+ vshl.u32 d2,d1,#2
+ vshl.u32 d8,d7,#2
+ vadd.i32 d4,d4,d3
+ vadd.i32 d2,d2,d1
+ vadd.i32 d6,d6,d5
+ vadd.i32 d8,d8,d7
+
+ vst4.32 {d0[0],d1[0],d2[0],d3[0]},[r6]!
+ vst4.32 {d0[1],d1[1],d2[1],d3[1]},[r7]!
+ vst4.32 {d4[0],d5[0],d6[0],d7[0]},[r6]!
+ vst4.32 {d4[1],d5[1],d6[1],d7[1]},[r7]!
+ vst1.32 {d8[0]},[r6,:32]
+ vst1.32 {d8[1]},[r7,:32]
+
+ b .Lsquare_neon
+
+.align 4
+.Lsquare_break_neon:
+ add r6,r0,#(48+2*4*9)
+ add r7,r0,#(48+3*4*9)
+
+ vmov d0,d10 @ r^4:r^3
+ vshl.u32 d2,d12,#2 @ *5
+ vmov d1,d12
+ vshl.u32 d4,d14,#2
+ vmov d3,d14
+ vshl.u32 d6,d16,#2
+ vmov d5,d16
+ vshl.u32 d8,d18,#2
+ vmov d7,d18
+ vadd.i32 d2,d2,d12
+ vadd.i32 d4,d4,d14
+ vadd.i32 d6,d6,d16
+ vadd.i32 d8,d8,d18
+
+ vst4.32 {d0[0],d1[0],d2[0],d3[0]},[r6]!
+ vst4.32 {d0[1],d1[1],d2[1],d3[1]},[r7]!
+ vst4.32 {d4[0],d5[0],d6[0],d7[0]},[r6]!
+ vst4.32 {d4[1],d5[1],d6[1],d7[1]},[r7]!
+ vst1.32 {d8[0]},[r6]
+ vst1.32 {d8[1]},[r7]
+
+ bx lr @ bx lr
+.size poly1305_init_neon,.-poly1305_init_neon
+
+.type poly1305_blocks_neon,%function
+.align 5
+poly1305_blocks_neon:
+ ldr ip,[r0,#36] @ is_base2_26
+ ands r2,r2,#-16
+ beq .Lno_data_neon
+
+ cmp r2,#64
+ bhs .Lenter_neon
+ tst ip,ip @ is_base2_26?
+ beq .Lpoly1305_blocks
+
+.Lenter_neon:
+ stmdb sp!,{r4-r7}
+ vstmdb sp!,{d8-d15} @ ABI specification says so
+
+ tst ip,ip @ is_base2_26?
+ bne .Lbase2_26_neon
+
+ stmdb sp!,{r1-r3,lr}
+ bl poly1305_init_neon
+
+ ldr r4,[r0,#0] @ load hash value base 2^32
+ ldr r5,[r0,#4]
+ ldr r6,[r0,#8]
+ ldr r7,[r0,#12]
+ ldr ip,[r0,#16]
+
+ and r2,r4,#0x03ffffff @ base 2^32 -> base 2^26
+ mov r3,r4,lsr#26
+ veor d10,d10,d10
+ mov r4,r5,lsr#20
+ orr r3,r3,r5,lsl#6
+ veor d12,d12,d12
+ mov r5,r6,lsr#14
+ orr r4,r4,r6,lsl#12
+ veor d14,d14,d14
+ mov r6,r7,lsr#8
+ orr r5,r5,r7,lsl#18
+ veor d16,d16,d16
+ and r3,r3,#0x03ffffff
+ orr r6,r6,ip,lsl#24
+ veor d18,d18,d18
+ and r4,r4,#0x03ffffff
+ mov r1,#1
+ and r5,r5,#0x03ffffff
+ str r1,[r0,#36] @ is_base2_26
+
+ vmov.32 d10[0],r2
+ vmov.32 d12[0],r3
+ vmov.32 d14[0],r4
+ vmov.32 d16[0],r5
+ vmov.32 d18[0],r6
+ adr r5,.Lzeros
+
+ ldmia sp!,{r1-r3,lr}
+ b .Lbase2_32_neon
+
+.align 4
+.Lbase2_26_neon:
+ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+ @ load hash value
+
+ veor d10,d10,d10
+ veor d12,d12,d12
+ veor d14,d14,d14
+ veor d16,d16,d16
+ veor d18,d18,d18
+ vld4.32 {d10[0],d12[0],d14[0],d16[0]},[r0]!
+ adr r5,.Lzeros
+ vld1.32 {d18[0]},[r0]
+ sub r0,r0,#16 @ rewind
+
+.Lbase2_32_neon:
+ add r4,r1,#32
+ mov r3,r3,lsl#24
+ tst r2,#31
+ beq .Leven
+
+ vld4.32 {d20[0],d22[0],d24[0],d26[0]},[r1]!
+ vmov.32 d28[0],r3
+ sub r2,r2,#16
+ add r4,r1,#32
+
+# ifdef __ARMEB__
+ vrev32.8 q10,q10
+ vrev32.8 q13,q13
+ vrev32.8 q11,q11
+ vrev32.8 q12,q12
+# endif
+ vsri.u32 d28,d26,#8 @ base 2^32 -> base 2^26
+ vshl.u32 d26,d26,#18
+
+ vsri.u32 d26,d24,#14
+ vshl.u32 d24,d24,#12
+ vadd.i32 d29,d28,d18 @ add hash value and move to #hi
+
+ vbic.i32 d26,#0xfc000000
+ vsri.u32 d24,d22,#20
+ vshl.u32 d22,d22,#6
+
+ vbic.i32 d24,#0xfc000000
+ vsri.u32 d22,d20,#26
+ vadd.i32 d27,d26,d16
+
+ vbic.i32 d20,#0xfc000000
+ vbic.i32 d22,#0xfc000000
+ vadd.i32 d25,d24,d14
+
+ vadd.i32 d21,d20,d10
+ vadd.i32 d23,d22,d12
+
+ mov r7,r5
+ add r6,r0,#48
+
+ cmp r2,r2
+ b .Long_tail
+
+.align 4
+.Leven:
+ subs r2,r2,#64
+ it lo
+ movlo r4,r5
+
+ vmov.i32 q14,#1<<24 @ padbit, yes, always
+ vld4.32 {d20,d22,d24,d26},[r1] @ inp[0:1]
+ add r1,r1,#64
+ vld4.32 {d21,d23,d25,d27},[r4] @ inp[2:3] (or 0)
+ add r4,r4,#64
+ itt hi
+ addhi r7,r0,#(48+1*9*4)
+ addhi r6,r0,#(48+3*9*4)
+
+# ifdef __ARMEB__
+ vrev32.8 q10,q10
+ vrev32.8 q13,q13
+ vrev32.8 q11,q11
+ vrev32.8 q12,q12
+# endif
+ vsri.u32 q14,q13,#8 @ base 2^32 -> base 2^26
+ vshl.u32 q13,q13,#18
+
+ vsri.u32 q13,q12,#14
+ vshl.u32 q12,q12,#12
+
+ vbic.i32 q13,#0xfc000000
+ vsri.u32 q12,q11,#20
+ vshl.u32 q11,q11,#6
+
+ vbic.i32 q12,#0xfc000000
+ vsri.u32 q11,q10,#26
+
+ vbic.i32 q10,#0xfc000000
+ vbic.i32 q11,#0xfc000000
+
+ bls .Lskip_loop
+
+ vld4.32 {d0[1],d1[1],d2[1],d3[1]},[r7]! @ load r^2
+ vld4.32 {d0[0],d1[0],d2[0],d3[0]},[r6]! @ load r^4
+ vld4.32 {d4[1],d5[1],d6[1],d7[1]},[r7]!
+ vld4.32 {d4[0],d5[0],d6[0],d7[0]},[r6]!
+ b .Loop_neon
+
+.align 5
+.Loop_neon:
+ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+ @ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+ @ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
+ @ ___________________/
+ @ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
+ @ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
+ @ ___________________/ ____________________/
+ @
+ @ Note that we start with inp[2:3]*r^2. This is because it
+ @ doesn't depend on reduction in previous iteration.
+ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+ @ d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4
+ @ d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4
+ @ d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4
+ @ d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4
+ @ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+
+ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+ @ inp[2:3]*r^2
+
+ vadd.i32 d24,d24,d14 @ accumulate inp[0:1]
+ vmull.u32 q7,d25,d0[1]
+ vadd.i32 d20,d20,d10
+ vmull.u32 q5,d21,d0[1]
+ vadd.i32 d26,d26,d16
+ vmull.u32 q8,d27,d0[1]
+ vmlal.u32 q7,d23,d1[1]
+ vadd.i32 d22,d22,d12
+ vmull.u32 q6,d23,d0[1]
+
+ vadd.i32 d28,d28,d18
+ vmull.u32 q9,d29,d0[1]
+ subs r2,r2,#64
+ vmlal.u32 q5,d29,d2[1]
+ it lo
+ movlo r4,r5
+ vmlal.u32 q8,d25,d1[1]
+ vld1.32 d8[1],[r7,:32]
+ vmlal.u32 q6,d21,d1[1]
+ vmlal.u32 q9,d27,d1[1]
+
+ vmlal.u32 q5,d27,d4[1]
+ vmlal.u32 q8,d23,d3[1]
+ vmlal.u32 q9,d25,d3[1]
+ vmlal.u32 q6,d29,d4[1]
+ vmlal.u32 q7,d21,d3[1]
+
+ vmlal.u32 q8,d21,d5[1]
+ vmlal.u32 q5,d25,d6[1]
+ vmlal.u32 q9,d23,d5[1]
+ vmlal.u32 q6,d27,d6[1]
+ vmlal.u32 q7,d29,d6[1]
+
+ vmlal.u32 q8,d29,d8[1]
+ vmlal.u32 q5,d23,d8[1]
+ vmlal.u32 q9,d21,d7[1]
+ vmlal.u32 q6,d25,d8[1]
+ vmlal.u32 q7,d27,d8[1]
+
+ vld4.32 {d21,d23,d25,d27},[r4] @ inp[2:3] (or 0)
+ add r4,r4,#64
+
+ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+ @ (hash+inp[0:1])*r^4 and accumulate
+
+ vmlal.u32 q8,d26,d0[0]
+ vmlal.u32 q5,d20,d0[0]
+ vmlal.u32 q9,d28,d0[0]
+ vmlal.u32 q6,d22,d0[0]
+ vmlal.u32 q7,d24,d0[0]
+ vld1.32 d8[0],[r6,:32]
+
+ vmlal.u32 q8,d24,d1[0]
+ vmlal.u32 q5,d28,d2[0]
+ vmlal.u32 q9,d26,d1[0]
+ vmlal.u32 q6,d20,d1[0]
+ vmlal.u32 q7,d22,d1[0]
+
+ vmlal.u32 q8,d22,d3[0]
+ vmlal.u32 q5,d26,d4[0]
+ vmlal.u32 q9,d24,d3[0]
+ vmlal.u32 q6,d28,d4[0]
+ vmlal.u32 q7,d20,d3[0]
+
+ vmlal.u32 q8,d20,d5[0]
+ vmlal.u32 q5,d24,d6[0]
+ vmlal.u32 q9,d22,d5[0]
+ vmlal.u32 q6,d26,d6[0]
+ vmlal.u32 q8,d28,d8[0]
+
+ vmlal.u32 q7,d28,d6[0]
+ vmlal.u32 q5,d22,d8[0]
+ vmlal.u32 q9,d20,d7[0]
+ vmov.i32 q14,#1<<24 @ padbit, yes, always
+ vmlal.u32 q6,d24,d8[0]
+ vmlal.u32 q7,d26,d8[0]
+
+ vld4.32 {d20,d22,d24,d26},[r1] @ inp[0:1]
+ add r1,r1,#64
+# ifdef __ARMEB__
+ vrev32.8 q10,q10
+ vrev32.8 q11,q11
+ vrev32.8 q12,q12
+ vrev32.8 q13,q13
+# endif
+
+ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+ @ lazy reduction interleaved with base 2^32 -> base 2^26 of
+ @ inp[0:3] previously loaded to q10-q13 and smashed to q10-q14.
+
+ vshr.u64 q15,q8,#26
+ vmovn.i64 d16,q8
+ vshr.u64 q4,q5,#26
+ vmovn.i64 d10,q5
+ vadd.i64 q9,q9,q15 @ h3 -> h4
+ vbic.i32 d16,#0xfc000000
+ vsri.u32 q14,q13,#8 @ base 2^32 -> base 2^26
+ vadd.i64 q6,q6,q4 @ h0 -> h1
+ vshl.u32 q13,q13,#18
+ vbic.i32 d10,#0xfc000000
+
+ vshrn.u64 d30,q9,#26
+ vmovn.i64 d18,q9
+ vshr.u64 q4,q6,#26
+ vmovn.i64 d12,q6
+ vadd.i64 q7,q7,q4 @ h1 -> h2
+ vsri.u32 q13,q12,#14
+ vbic.i32 d18,#0xfc000000
+ vshl.u32 q12,q12,#12
+ vbic.i32 d12,#0xfc000000
+
+ vadd.i32 d10,d10,d30
+ vshl.u32 d30,d30,#2
+ vbic.i32 q13,#0xfc000000
+ vshrn.u64 d8,q7,#26
+ vmovn.i64 d14,q7
+ vaddl.u32 q5,d10,d30 @ h4 -> h0 [widen for a sec]
+ vsri.u32 q12,q11,#20
+ vadd.i32 d16,d16,d8 @ h2 -> h3
+ vshl.u32 q11,q11,#6
+ vbic.i32 d14,#0xfc000000
+ vbic.i32 q12,#0xfc000000
+
+ vshrn.u64 d30,q5,#26 @ re-narrow
+ vmovn.i64 d10,q5
+ vsri.u32 q11,q10,#26
+ vbic.i32 q10,#0xfc000000
+ vshr.u32 d8,d16,#26
+ vbic.i32 d16,#0xfc000000
+ vbic.i32 d10,#0xfc000000
+ vadd.i32 d12,d12,d30 @ h0 -> h1
+ vadd.i32 d18,d18,d8 @ h3 -> h4
+ vbic.i32 q11,#0xfc000000
+
+ bhi .Loop_neon
+
+.Lskip_loop:
+ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+ @ multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
+
+ add r7,r0,#(48+0*9*4)
+ add r6,r0,#(48+1*9*4)
+ adds r2,r2,#32
+ it ne
+ movne r2,#0
+ bne .Long_tail
+
+ vadd.i32 d25,d24,d14 @ add hash value and move to #hi
+ vadd.i32 d21,d20,d10
+ vadd.i32 d27,d26,d16
+ vadd.i32 d23,d22,d12
+ vadd.i32 d29,d28,d18
+
+.Long_tail:
+ vld4.32 {d0[1],d1[1],d2[1],d3[1]},[r7]! @ load r^1
+ vld4.32 {d0[0],d1[0],d2[0],d3[0]},[r6]! @ load r^2
+
+ vadd.i32 d24,d24,d14 @ can be redundant
+ vmull.u32 q7,d25,d0
+ vadd.i32 d20,d20,d10
+ vmull.u32 q5,d21,d0
+ vadd.i32 d26,d26,d16
+ vmull.u32 q8,d27,d0
+ vadd.i32 d22,d22,d12
+ vmull.u32 q6,d23,d0
+ vadd.i32 d28,d28,d18
+ vmull.u32 q9,d29,d0
+
+ vmlal.u32 q5,d29,d2
+ vld4.32 {d4[1],d5[1],d6[1],d7[1]},[r7]!
+ vmlal.u32 q8,d25,d1
+ vld4.32 {d4[0],d5[0],d6[0],d7[0]},[r6]!
+ vmlal.u32 q6,d21,d1
+ vmlal.u32 q9,d27,d1
+ vmlal.u32 q7,d23,d1
+
+ vmlal.u32 q8,d23,d3
+ vld1.32 d8[1],[r7,:32]
+ vmlal.u32 q5,d27,d4
+ vld1.32 d8[0],[r6,:32]
+ vmlal.u32 q9,d25,d3
+ vmlal.u32 q6,d29,d4
+ vmlal.u32 q7,d21,d3
+
+ vmlal.u32 q8,d21,d5
+ it ne
+ addne r7,r0,#(48+2*9*4)
+ vmlal.u32 q5,d25,d6
+ it ne
+ addne r6,r0,#(48+3*9*4)
+ vmlal.u32 q9,d23,d5
+ vmlal.u32 q6,d27,d6
+ vmlal.u32 q7,d29,d6
+
+ vmlal.u32 q8,d29,d8
+ vorn q0,q0,q0 @ all-ones, can be redundant
+ vmlal.u32 q5,d23,d8
+ vshr.u64 q0,q0,#38
+ vmlal.u32 q9,d21,d7
+ vmlal.u32 q6,d25,d8
+ vmlal.u32 q7,d27,d8
+
+ beq .Lshort_tail
+
+ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+ @ (hash+inp[0:1])*r^4:r^3 and accumulate
+
+ vld4.32 {d0[1],d1[1],d2[1],d3[1]},[r7]! @ load r^3
+ vld4.32 {d0[0],d1[0],d2[0],d3[0]},[r6]! @ load r^4
+
+ vmlal.u32 q7,d24,d0
+ vmlal.u32 q5,d20,d0
+ vmlal.u32 q8,d26,d0
+ vmlal.u32 q6,d22,d0
+ vmlal.u32 q9,d28,d0
+
+ vmlal.u32 q5,d28,d2
+ vld4.32 {d4[1],d5[1],d6[1],d7[1]},[r7]!
+ vmlal.u32 q8,d24,d1
+ vld4.32 {d4[0],d5[0],d6[0],d7[0]},[r6]!
+ vmlal.u32 q6,d20,d1
+ vmlal.u32 q9,d26,d1
+ vmlal.u32 q7,d22,d1
+
+ vmlal.u32 q8,d22,d3
+ vld1.32 d8[1],[r7,:32]
+ vmlal.u32 q5,d26,d4
+ vld1.32 d8[0],[r6,:32]
+ vmlal.u32 q9,d24,d3
+ vmlal.u32 q6,d28,d4
+ vmlal.u32 q7,d20,d3
+
+ vmlal.u32 q8,d20,d5
+ vmlal.u32 q5,d24,d6
+ vmlal.u32 q9,d22,d5
+ vmlal.u32 q6,d26,d6
+ vmlal.u32 q7,d28,d6
+
+ vmlal.u32 q8,d28,d8
+ vorn q0,q0,q0 @ all-ones
+ vmlal.u32 q5,d22,d8
+ vshr.u64 q0,q0,#38
+ vmlal.u32 q9,d20,d7
+ vmlal.u32 q6,d24,d8
+ vmlal.u32 q7,d26,d8
+
+.Lshort_tail:
+ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+ @ horizontal addition
+
+ vadd.i64 d16,d16,d17
+ vadd.i64 d10,d10,d11
+ vadd.i64 d18,d18,d19
+ vadd.i64 d12,d12,d13
+ vadd.i64 d14,d14,d15
+
+ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+ @ lazy reduction, but without narrowing
+
+ vshr.u64 q15,q8,#26
+ vand.i64 q8,q8,q0
+ vshr.u64 q4,q5,#26
+ vand.i64 q5,q5,q0
+ vadd.i64 q9,q9,q15 @ h3 -> h4
+ vadd.i64 q6,q6,q4 @ h0 -> h1
+
+ vshr.u64 q15,q9,#26
+ vand.i64 q9,q9,q0
+ vshr.u64 q4,q6,#26
+ vand.i64 q6,q6,q0
+ vadd.i64 q7,q7,q4 @ h1 -> h2
+
+ vadd.i64 q5,q5,q15
+ vshl.u64 q15,q15,#2
+ vshr.u64 q4,q7,#26
+ vand.i64 q7,q7,q0
+ vadd.i64 q5,q5,q15 @ h4 -> h0
+ vadd.i64 q8,q8,q4 @ h2 -> h3
+
+ vshr.u64 q15,q5,#26
+ vand.i64 q5,q5,q0
+ vshr.u64 q4,q8,#26
+ vand.i64 q8,q8,q0
+ vadd.i64 q6,q6,q15 @ h0 -> h1
+ vadd.i64 q9,q9,q4 @ h3 -> h4
+
+ cmp r2,#0
+ bne .Leven
+
+ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+ @ store hash value
+
+ vst4.32 {d10[0],d12[0],d14[0],d16[0]},[r0]!
+ vst1.32 {d18[0]},[r0]
+
+ vldmia sp!,{d8-d15} @ epilogue
+ ldmia sp!,{r4-r7}
+.Lno_data_neon:
+ bx lr @ bx lr
+.size poly1305_blocks_neon,.-poly1305_blocks_neon
+
+.type poly1305_emit_neon,%function
+.align 5
+poly1305_emit_neon:
+ ldr ip,[r0,#36] @ is_base2_26
+
+ stmdb sp!,{r4-r11}
+
+ tst ip,ip
+ beq .Lpoly1305_emit_enter
+
+ ldmia r0,{r3-r7}
+ eor r8,r8,r8
+
+ adds r3,r3,r4,lsl#26 @ base 2^26 -> base 2^32
+ mov r4,r4,lsr#6
+ adcs r4,r4,r5,lsl#20
+ mov r5,r5,lsr#12
+ adcs r5,r5,r6,lsl#14
+ mov r6,r6,lsr#18
+ adcs r6,r6,r7,lsl#8
+ adc r7,r8,r7,lsr#24 @ can be partially reduced ...
+
+ and r8,r7,#-4 @ ... so reduce
+ and r7,r6,#3
+ add r8,r8,r8,lsr#2 @ *= 5
+ adds r3,r3,r8
+ adcs r4,r4,#0
+ adcs r5,r5,#0
+ adcs r6,r6,#0
+ adc r7,r7,#0
+
+ adds r8,r3,#5 @ compare to modulus
+ adcs r9,r4,#0
+ adcs r10,r5,#0
+ adcs r11,r6,#0
+ adc r7,r7,#0
+ tst r7,#4 @ did it carry/borrow?
+
+ it ne
+ movne r3,r8
+ ldr r8,[r2,#0]
+ it ne
+ movne r4,r9
+ ldr r9,[r2,#4]
+ it ne
+ movne r5,r10
+ ldr r10,[r2,#8]
+ it ne
+ movne r6,r11
+ ldr r11,[r2,#12]
+
+ adds r3,r3,r8 @ accumulate nonce
+ adcs r4,r4,r9
+ adcs r5,r5,r10
+ adc r6,r6,r11
+
+# ifdef __ARMEB__
+ rev r3,r3
+ rev r4,r4
+ rev r5,r5
+ rev r6,r6
+# endif
+ str r3,[r1,#0] @ store the result
+ str r4,[r1,#4]
+ str r5,[r1,#8]
+ str r6,[r1,#12]
+
+ ldmia sp!,{r4-r11}
+ bx lr @ bx lr
+.size poly1305_emit_neon,.-poly1305_emit_neon
+
+.align 5
+.Lzeros:
+.long 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+.LOPENSSL_armcap:
+.word OPENSSL_armcap_P-.Lpoly1305_init
+#endif
+.asciz "Poly1305 for ARMv4/NEON, CRYPTOGAMS by <appro@openssl.org>"
+.align 2
+#if __ARM_MAX_ARCH__>=7
+.comm OPENSSL_armcap_P,4,4
+#endif
diff --git a/lib/zinc/poly1305/poly1305-arm64-cryptogams.S b/lib/zinc/poly1305/poly1305-arm64-cryptogams.S
new file mode 100644
index 000000000000..0ecb50a83ec0
--- /dev/null
+++ b/lib/zinc/poly1305/poly1305-arm64-cryptogams.S
@@ -0,0 +1,869 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
+/*
+ * Copyright (C) 2006-2017 CRYPTOGAMS by <appro@openssl.org>. All Rights Reserved.
+ */
+
+#include "arm_arch.h"
+
+.text
+
+// forward "declarations" are required for Apple
+
+.globl poly1305_blocks
+.globl poly1305_emit
+
+.globl poly1305_init
+.type poly1305_init,%function
+.align 5
+poly1305_init:
+ cmp x1,xzr
+ stp xzr,xzr,[x0] // zero hash value
+ stp xzr,xzr,[x0,#16] // [along with is_base2_26]
+
+ csel x0,xzr,x0,eq
+ b.eq .Lno_key
+
+#ifdef __ILP32__
+ ldrsw x11,.LOPENSSL_armcap_P
+#else
+ ldr x11,.LOPENSSL_armcap_P
+#endif
+ adr x10,.LOPENSSL_armcap_P
+
+ ldp x7,x8,[x1] // load key
+ mov x9,#0xfffffffc0fffffff
+ movk x9,#0x0fff,lsl#48
+ ldr w17,[x10,x11]
+#ifdef __ARMEB__
+ rev x7,x7 // flip bytes
+ rev x8,x8
+#endif
+ and x7,x7,x9 // &=0ffffffc0fffffff
+ and x9,x9,#-4
+ and x8,x8,x9 // &=0ffffffc0ffffffc
+ stp x7,x8,[x0,#32] // save key value
+
+ tst w17,#ARMV7_NEON
+
+ adr x12,poly1305_blocks
+ adr x7,poly1305_blocks_neon
+ adr x13,poly1305_emit
+ adr x8,poly1305_emit_neon
+
+ csel x12,x12,x7,eq
+ csel x13,x13,x8,eq
+
+#ifdef __ILP32__
+ stp w12,w13,[x2]
+#else
+ stp x12,x13,[x2]
+#endif
+
+ mov x0,#1
+.Lno_key:
+ ret
+.size poly1305_init,.-poly1305_init
+
+.type poly1305_blocks,%function
+.align 5
+poly1305_blocks:
+ ands x2,x2,#-16
+ b.eq .Lno_data
+
+ ldp x4,x5,[x0] // load hash value
+ ldp x7,x8,[x0,#32] // load key value
+ ldr x6,[x0,#16]
+ add x9,x8,x8,lsr#2 // s1 = r1 + (r1 >> 2)
+ b .Loop
+
+.align 5
+.Loop:
+ ldp x10,x11,[x1],#16 // load input
+ sub x2,x2,#16
+#ifdef __ARMEB__
+ rev x10,x10
+ rev x11,x11
+#endif
+ adds x4,x4,x10 // accumulate input
+ adcs x5,x5,x11
+
+ mul x12,x4,x7 // h0*r0
+ adc x6,x6,x3
+ umulh x13,x4,x7
+
+ mul x10,x5,x9 // h1*5*r1
+ umulh x11,x5,x9
+
+ adds x12,x12,x10
+ mul x10,x4,x8 // h0*r1
+ adc x13,x13,x11
+ umulh x14,x4,x8
+
+ adds x13,x13,x10
+ mul x10,x5,x7 // h1*r0
+ adc x14,x14,xzr
+ umulh x11,x5,x7
+
+ adds x13,x13,x10
+ mul x10,x6,x9 // h2*5*r1
+ adc x14,x14,x11
+ mul x11,x6,x7 // h2*r0
+
+ adds x13,x13,x10
+ adc x14,x14,x11
+
+ and x10,x14,#-4 // final reduction
+ and x6,x14,#3
+ add x10,x10,x14,lsr#2
+ adds x4,x12,x10
+ adcs x5,x13,xzr
+ adc x6,x6,xzr
+
+ cbnz x2,.Loop
+
+ stp x4,x5,[x0] // store hash value
+ str x6,[x0,#16]
+
+.Lno_data:
+ ret
+.size poly1305_blocks,.-poly1305_blocks
+
+.type poly1305_emit,%function
+.align 5
+poly1305_emit:
+ ldp x4,x5,[x0] // load hash base 2^64
+ ldr x6,[x0,#16]
+ ldp x10,x11,[x2] // load nonce
+
+ adds x12,x4,#5 // compare to modulus
+ adcs x13,x5,xzr
+ adc x14,x6,xzr
+
+ tst x14,#-4 // see if it's carried/borrowed
+
+ csel x4,x4,x12,eq
+ csel x5,x5,x13,eq
+
+#ifdef __ARMEB__
+ ror x10,x10,#32 // flip nonce words
+ ror x11,x11,#32
+#endif
+ adds x4,x4,x10 // accumulate nonce
+ adc x5,x5,x11
+#ifdef __ARMEB__
+ rev x4,x4 // flip output bytes
+ rev x5,x5
+#endif
+ stp x4,x5,[x1] // write result
+
+ ret
+.size poly1305_emit,.-poly1305_emit
+.type poly1305_mult,%function
+.align 5
+poly1305_mult:
+ mul x12,x4,x7 // h0*r0
+ umulh x13,x4,x7
+
+ mul x10,x5,x9 // h1*5*r1
+ umulh x11,x5,x9
+
+ adds x12,x12,x10
+ mul x10,x4,x8 // h0*r1
+ adc x13,x13,x11
+ umulh x14,x4,x8
+
+ adds x13,x13,x10
+ mul x10,x5,x7 // h1*r0
+ adc x14,x14,xzr
+ umulh x11,x5,x7
+
+ adds x13,x13,x10
+ mul x10,x6,x9 // h2*5*r1
+ adc x14,x14,x11
+ mul x11,x6,x7 // h2*r0
+
+ adds x13,x13,x10
+ adc x14,x14,x11
+
+ and x10,x14,#-4 // final reduction
+ and x6,x14,#3
+ add x10,x10,x14,lsr#2
+ adds x4,x12,x10
+ adcs x5,x13,xzr
+ adc x6,x6,xzr
+
+ ret
+.size poly1305_mult,.-poly1305_mult
+
+.type poly1305_splat,%function
+.align 5
+poly1305_splat:
+ and x12,x4,#0x03ffffff // base 2^64 -> base 2^26
+ ubfx x13,x4,#26,#26
+ extr x14,x5,x4,#52
+ and x14,x14,#0x03ffffff
+ ubfx x15,x5,#14,#26
+ extr x16,x6,x5,#40
+
+ str w12,[x0,#16*0] // r0
+ add w12,w13,w13,lsl#2 // r1*5
+ str w13,[x0,#16*1] // r1
+ add w13,w14,w14,lsl#2 // r2*5
+ str w12,[x0,#16*2] // s1
+ str w14,[x0,#16*3] // r2
+ add w14,w15,w15,lsl#2 // r3*5
+ str w13,[x0,#16*4] // s2
+ str w15,[x0,#16*5] // r3
+ add w15,w16,w16,lsl#2 // r4*5
+ str w14,[x0,#16*6] // s3
+ str w16,[x0,#16*7] // r4
+ str w15,[x0,#16*8] // s4
+
+ ret
+.size poly1305_splat,.-poly1305_splat
+
+.type poly1305_blocks_neon,%function
+.align 5
+poly1305_blocks_neon:
+ ldr x17,[x0,#24]
+ cmp x2,#128
+ b.hs .Lblocks_neon
+ cbz x17,poly1305_blocks
+
+.Lblocks_neon:
+ stp x29,x30,[sp,#-80]!
+ add x29,sp,#0
+
+ ands x2,x2,#-16
+ b.eq .Lno_data_neon
+
+ cbz x17,.Lbase2_64_neon
+
+ ldp w10,w11,[x0] // load hash value base 2^26
+ ldp w12,w13,[x0,#8]
+ ldr w14,[x0,#16]
+
+ tst x2,#31
+ b.eq .Leven_neon
+
+ ldp x7,x8,[x0,#32] // load key value
+
+ add x4,x10,x11,lsl#26 // base 2^26 -> base 2^64
+ lsr x5,x12,#12
+ adds x4,x4,x12,lsl#52
+ add x5,x5,x13,lsl#14
+ adc x5,x5,xzr
+ lsr x6,x14,#24
+ adds x5,x5,x14,lsl#40
+ adc x14,x6,xzr // can be partially reduced...
+
+ ldp x12,x13,[x1],#16 // load input
+ sub x2,x2,#16
+ add x9,x8,x8,lsr#2 // s1 = r1 + (r1 >> 2)
+
+ and x10,x14,#-4 // ... so reduce
+ and x6,x14,#3
+ add x10,x10,x14,lsr#2
+ adds x4,x4,x10
+ adcs x5,x5,xzr
+ adc x6,x6,xzr
+
+#ifdef __ARMEB__
+ rev x12,x12
+ rev x13,x13
+#endif
+ adds x4,x4,x12 // accumulate input
+ adcs x5,x5,x13
+ adc x6,x6,x3
+
+ bl poly1305_mult
+ ldr x30,[sp,#8]
+
+ cbz x3,.Lstore_base2_64_neon
+
+ and x10,x4,#0x03ffffff // base 2^64 -> base 2^26
+ ubfx x11,x4,#26,#26
+ extr x12,x5,x4,#52
+ and x12,x12,#0x03ffffff
+ ubfx x13,x5,#14,#26
+ extr x14,x6,x5,#40
+
+ cbnz x2,.Leven_neon
+
+ stp w10,w11,[x0] // store hash value base 2^26
+ stp w12,w13,[x0,#8]
+ str w14,[x0,#16]
+ b .Lno_data_neon
+
+.align 4
+.Lstore_base2_64_neon:
+ stp x4,x5,[x0] // store hash value base 2^64
+ stp x6,xzr,[x0,#16] // note that is_base2_26 is zeroed
+ b .Lno_data_neon
+
+.align 4
+.Lbase2_64_neon:
+ ldp x7,x8,[x0,#32] // load key value
+
+ ldp x4,x5,[x0] // load hash value base 2^64
+ ldr x6,[x0,#16]
+
+ tst x2,#31
+ b.eq .Linit_neon
+
+ ldp x12,x13,[x1],#16 // load input
+ sub x2,x2,#16
+ add x9,x8,x8,lsr#2 // s1 = r1 + (r1 >> 2)
+#ifdef __ARMEB__
+ rev x12,x12
+ rev x13,x13
+#endif
+ adds x4,x4,x12 // accumulate input
+ adcs x5,x5,x13
+ adc x6,x6,x3
+
+ bl poly1305_mult
+
+.Linit_neon:
+ and x10,x4,#0x03ffffff // base 2^64 -> base 2^26
+ ubfx x11,x4,#26,#26
+ extr x12,x5,x4,#52
+ and x12,x12,#0x03ffffff
+ ubfx x13,x5,#14,#26
+ extr x14,x6,x5,#40
+
+ stp d8,d9,[sp,#16] // meet ABI requirements
+ stp d10,d11,[sp,#32]
+ stp d12,d13,[sp,#48]
+ stp d14,d15,[sp,#64]
+
+ fmov d24,x10
+ fmov d25,x11
+ fmov d26,x12
+ fmov d27,x13
+ fmov d28,x14
+
+ ////////////////////////////////// initialize r^n table
+ mov x4,x7 // r^1
+ add x9,x8,x8,lsr#2 // s1 = r1 + (r1 >> 2)
+ mov x5,x8
+ mov x6,xzr
+ add x0,x0,#48+12
+ bl poly1305_splat
+
+ bl poly1305_mult // r^2
+ sub x0,x0,#4
+ bl poly1305_splat
+
+ bl poly1305_mult // r^3
+ sub x0,x0,#4
+ bl poly1305_splat
+
+ bl poly1305_mult // r^4
+ sub x0,x0,#4
+ bl poly1305_splat
+ ldr x30,[sp,#8]
+
+ add x16,x1,#32
+ adr x17,.Lzeros
+ subs x2,x2,#64
+ csel x16,x17,x16,lo
+
+ mov x4,#1
+ str x4,[x0,#-24] // set is_base2_26
+ sub x0,x0,#48 // restore original x0
+ b .Ldo_neon
+
+.align 4
+.Leven_neon:
+ add x16,x1,#32
+ adr x17,.Lzeros
+ subs x2,x2,#64
+ csel x16,x17,x16,lo
+
+ stp d8,d9,[sp,#16] // meet ABI requirements
+ stp d10,d11,[sp,#32]
+ stp d12,d13,[sp,#48]
+ stp d14,d15,[sp,#64]
+
+ fmov d24,x10
+ fmov d25,x11
+ fmov d26,x12
+ fmov d27,x13
+ fmov d28,x14
+
+.Ldo_neon:
+ ldp x8,x12,[x16],#16 // inp[2:3] (or zero)
+ ldp x9,x13,[x16],#48
+
+ lsl x3,x3,#24
+ add x15,x0,#48
+
+#ifdef __ARMEB__
+ rev x8,x8
+ rev x12,x12
+ rev x9,x9
+ rev x13,x13
+#endif
+ and x4,x8,#0x03ffffff // base 2^64 -> base 2^26
+ and x5,x9,#0x03ffffff
+ ubfx x6,x8,#26,#26
+ ubfx x7,x9,#26,#26
+ add x4,x4,x5,lsl#32 // bfi x4,x5,#32,#32
+ extr x8,x12,x8,#52
+ extr x9,x13,x9,#52
+ add x6,x6,x7,lsl#32 // bfi x6,x7,#32,#32
+ fmov d14,x4
+ and x8,x8,#0x03ffffff
+ and x9,x9,#0x03ffffff
+ ubfx x10,x12,#14,#26
+ ubfx x11,x13,#14,#26
+ add x12,x3,x12,lsr#40
+ add x13,x3,x13,lsr#40
+ add x8,x8,x9,lsl#32 // bfi x8,x9,#32,#32
+ fmov d15,x6
+ add x10,x10,x11,lsl#32 // bfi x10,x11,#32,#32
+ add x12,x12,x13,lsl#32 // bfi x12,x13,#32,#32
+ fmov d16,x8
+ fmov d17,x10
+ fmov d18,x12
+
+ ldp x8,x12,[x1],#16 // inp[0:1]
+ ldp x9,x13,[x1],#48
+
+ ld1 {v0.4s,v1.4s,v2.4s,v3.4s},[x15],#64
+ ld1 {v4.4s,v5.4s,v6.4s,v7.4s},[x15],#64
+ ld1 {v8.4s},[x15]
+
+#ifdef __ARMEB__
+ rev x8,x8
+ rev x12,x12
+ rev x9,x9
+ rev x13,x13
+#endif
+ and x4,x8,#0x03ffffff // base 2^64 -> base 2^26
+ and x5,x9,#0x03ffffff
+ ubfx x6,x8,#26,#26
+ ubfx x7,x9,#26,#26
+ add x4,x4,x5,lsl#32 // bfi x4,x5,#32,#32
+ extr x8,x12,x8,#52
+ extr x9,x13,x9,#52
+ add x6,x6,x7,lsl#32 // bfi x6,x7,#32,#32
+ fmov d9,x4
+ and x8,x8,#0x03ffffff
+ and x9,x9,#0x03ffffff
+ ubfx x10,x12,#14,#26
+ ubfx x11,x13,#14,#26
+ add x12,x3,x12,lsr#40
+ add x13,x3,x13,lsr#40
+ add x8,x8,x9,lsl#32 // bfi x8,x9,#32,#32
+ fmov d10,x6
+ add x10,x10,x11,lsl#32 // bfi x10,x11,#32,#32
+ add x12,x12,x13,lsl#32 // bfi x12,x13,#32,#32
+ movi v31.2d,#-1
+ fmov d11,x8
+ fmov d12,x10
+ fmov d13,x12
+ ushr v31.2d,v31.2d,#38
+
+ b.ls .Lskip_loop
+
+.align 4
+.Loop_neon:
+ ////////////////////////////////////////////////////////////////
+ // ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+ // ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
+ // ___________________/
+ // ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
+ // ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
+ // ___________________/ ____________________/
+ //
+ // Note that we start with inp[2:3]*r^2. This is because it
+ // doesn't depend on reduction in previous iteration.
+ ////////////////////////////////////////////////////////////////
+ // d4 = h0*r4 + h1*r3 + h2*r2 + h3*r1 + h4*r0
+ // d3 = h0*r3 + h1*r2 + h2*r1 + h3*r0 + h4*5*r4
+ // d2 = h0*r2 + h1*r1 + h2*r0 + h3*5*r4 + h4*5*r3
+ // d1 = h0*r1 + h1*r0 + h2*5*r4 + h3*5*r3 + h4*5*r2
+ // d0 = h0*r0 + h1*5*r4 + h2*5*r3 + h3*5*r2 + h4*5*r1
+
+ subs x2,x2,#64
+ umull v23.2d,v14.2s,v7.s[2]
+ csel x16,x17,x16,lo
+ umull v22.2d,v14.2s,v5.s[2]
+ umull v21.2d,v14.2s,v3.s[2]
+ ldp x8,x12,[x16],#16 // inp[2:3] (or zero)
+ umull v20.2d,v14.2s,v1.s[2]
+ ldp x9,x13,[x16],#48
+ umull v19.2d,v14.2s,v0.s[2]
+#ifdef __ARMEB__
+ rev x8,x8
+ rev x12,x12
+ rev x9,x9
+ rev x13,x13
+#endif
+
+ umlal v23.2d,v15.2s,v5.s[2]
+ and x4,x8,#0x03ffffff // base 2^64 -> base 2^26
+ umlal v22.2d,v15.2s,v3.s[2]
+ and x5,x9,#0x03ffffff
+ umlal v21.2d,v15.2s,v1.s[2]
+ ubfx x6,x8,#26,#26
+ umlal v20.2d,v15.2s,v0.s[2]
+ ubfx x7,x9,#26,#26
+ umlal v19.2d,v15.2s,v8.s[2]
+ add x4,x4,x5,lsl#32 // bfi x4,x5,#32,#32
+
+ umlal v23.2d,v16.2s,v3.s[2]
+ extr x8,x12,x8,#52
+ umlal v22.2d,v16.2s,v1.s[2]
+ extr x9,x13,x9,#52
+ umlal v21.2d,v16.2s,v0.s[2]
+ add x6,x6,x7,lsl#32 // bfi x6,x7,#32,#32
+ umlal v20.2d,v16.2s,v8.s[2]
+ fmov d14,x4
+ umlal v19.2d,v16.2s,v6.s[2]
+ and x8,x8,#0x03ffffff
+
+ umlal v23.2d,v17.2s,v1.s[2]
+ and x9,x9,#0x03ffffff
+ umlal v22.2d,v17.2s,v0.s[2]
+ ubfx x10,x12,#14,#26
+ umlal v21.2d,v17.2s,v8.s[2]
+ ubfx x11,x13,#14,#26
+ umlal v20.2d,v17.2s,v6.s[2]
+ add x8,x8,x9,lsl#32 // bfi x8,x9,#32,#32
+ umlal v19.2d,v17.2s,v4.s[2]
+ fmov d15,x6
+
+ add v11.2s,v11.2s,v26.2s
+ add x12,x3,x12,lsr#40
+ umlal v23.2d,v18.2s,v0.s[2]
+ add x13,x3,x13,lsr#40
+ umlal v22.2d,v18.2s,v8.s[2]
+ add x10,x10,x11,lsl#32 // bfi x10,x11,#32,#32
+ umlal v21.2d,v18.2s,v6.s[2]
+ add x12,x12,x13,lsl#32 // bfi x12,x13,#32,#32
+ umlal v20.2d,v18.2s,v4.s[2]
+ fmov d16,x8
+ umlal v19.2d,v18.2s,v2.s[2]
+ fmov d17,x10
+
+ ////////////////////////////////////////////////////////////////
+ // (hash+inp[0:1])*r^4 and accumulate
+
+ add v9.2s,v9.2s,v24.2s
+ fmov d18,x12
+ umlal v22.2d,v11.2s,v1.s[0]
+ ldp x8,x12,[x1],#16 // inp[0:1]
+ umlal v19.2d,v11.2s,v6.s[0]
+ ldp x9,x13,[x1],#48
+ umlal v23.2d,v11.2s,v3.s[0]
+ umlal v20.2d,v11.2s,v8.s[0]
+ umlal v21.2d,v11.2s,v0.s[0]
+#ifdef __ARMEB__
+ rev x8,x8
+ rev x12,x12
+ rev x9,x9
+ rev x13,x13
+#endif
+
+ add v10.2s,v10.2s,v25.2s
+ umlal v22.2d,v9.2s,v5.s[0]
+ umlal v23.2d,v9.2s,v7.s[0]
+ and x4,x8,#0x03ffffff // base 2^64 -> base 2^26
+ umlal v21.2d,v9.2s,v3.s[0]
+ and x5,x9,#0x03ffffff
+ umlal v19.2d,v9.2s,v0.s[0]
+ ubfx x6,x8,#26,#26
+ umlal v20.2d,v9.2s,v1.s[0]
+ ubfx x7,x9,#26,#26
+
+ add v12.2s,v12.2s,v27.2s
+ add x4,x4,x5,lsl#32 // bfi x4,x5,#32,#32
+ umlal v22.2d,v10.2s,v3.s[0]
+ extr x8,x12,x8,#52
+ umlal v23.2d,v10.2s,v5.s[0]
+ extr x9,x13,x9,#52
+ umlal v19.2d,v10.2s,v8.s[0]
+ add x6,x6,x7,lsl#32 // bfi x6,x7,#32,#32
+ umlal v21.2d,v10.2s,v1.s[0]
+ fmov d9,x4
+ umlal v20.2d,v10.2s,v0.s[0]
+ and x8,x8,#0x03ffffff
+
+ add v13.2s,v13.2s,v28.2s
+ and x9,x9,#0x03ffffff
+ umlal v22.2d,v12.2s,v0.s[0]
+ ubfx x10,x12,#14,#26
+ umlal v19.2d,v12.2s,v4.s[0]
+ ubfx x11,x13,#14,#26
+ umlal v23.2d,v12.2s,v1.s[0]
+ add x8,x8,x9,lsl#32 // bfi x8,x9,#32,#32
+ umlal v20.2d,v12.2s,v6.s[0]
+ fmov d10,x6
+ umlal v21.2d,v12.2s,v8.s[0]
+ add x12,x3,x12,lsr#40
+
+ umlal v22.2d,v13.2s,v8.s[0]
+ add x13,x3,x13,lsr#40
+ umlal v19.2d,v13.2s,v2.s[0]
+ add x10,x10,x11,lsl#32 // bfi x10,x11,#32,#32
+ umlal v23.2d,v13.2s,v0.s[0]
+ add x12,x12,x13,lsl#32 // bfi x12,x13,#32,#32
+ umlal v20.2d,v13.2s,v4.s[0]
+ fmov d11,x8
+ umlal v21.2d,v13.2s,v6.s[0]
+ fmov d12,x10
+ fmov d13,x12
+
+ /////////////////////////////////////////////////////////////////
+ // lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
+ // and P. Schwabe
+ //
+ // [see discussion in poly1305-armv4 module]
+
+ ushr v29.2d,v22.2d,#26
+ xtn v27.2s,v22.2d
+ ushr v30.2d,v19.2d,#26
+ and v19.16b,v19.16b,v31.16b
+ add v23.2d,v23.2d,v29.2d // h3 -> h4
+ bic v27.2s,#0xfc,lsl#24 // &=0x03ffffff
+ add v20.2d,v20.2d,v30.2d // h0 -> h1
+
+ ushr v29.2d,v23.2d,#26
+ xtn v28.2s,v23.2d
+ ushr v30.2d,v20.2d,#26
+ xtn v25.2s,v20.2d
+ bic v28.2s,#0xfc,lsl#24
+ add v21.2d,v21.2d,v30.2d // h1 -> h2
+
+ add v19.2d,v19.2d,v29.2d
+ shl v29.2d,v29.2d,#2
+ shrn v30.2s,v21.2d,#26
+ xtn v26.2s,v21.2d
+ add v19.2d,v19.2d,v29.2d // h4 -> h0
+ bic v25.2s,#0xfc,lsl#24
+ add v27.2s,v27.2s,v30.2s // h2 -> h3
+ bic v26.2s,#0xfc,lsl#24
+
+ shrn v29.2s,v19.2d,#26
+ xtn v24.2s,v19.2d
+ ushr v30.2s,v27.2s,#26
+ bic v27.2s,#0xfc,lsl#24
+ bic v24.2s,#0xfc,lsl#24
+ add v25.2s,v25.2s,v29.2s // h0 -> h1
+ add v28.2s,v28.2s,v30.2s // h3 -> h4
+
+ b.hi .Loop_neon
+
+.Lskip_loop:
+ dup v16.2d,v16.d[0]
+ add v11.2s,v11.2s,v26.2s
+
+ ////////////////////////////////////////////////////////////////
+ // multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
+
+ adds x2,x2,#32
+ b.ne .Long_tail
+
+ dup v16.2d,v11.d[0]
+ add v14.2s,v9.2s,v24.2s
+ add v17.2s,v12.2s,v27.2s
+ add v15.2s,v10.2s,v25.2s
+ add v18.2s,v13.2s,v28.2s
+
+.Long_tail:
+ dup v14.2d,v14.d[0]
+ umull2 v19.2d,v16.4s,v6.4s
+ umull2 v22.2d,v16.4s,v1.4s
+ umull2 v23.2d,v16.4s,v3.4s
+ umull2 v21.2d,v16.4s,v0.4s
+ umull2 v20.2d,v16.4s,v8.4s
+
+ dup v15.2d,v15.d[0]
+ umlal2 v19.2d,v14.4s,v0.4s
+ umlal2 v21.2d,v14.4s,v3.4s
+ umlal2 v22.2d,v14.4s,v5.4s
+ umlal2 v23.2d,v14.4s,v7.4s
+ umlal2 v20.2d,v14.4s,v1.4s
+
+ dup v17.2d,v17.d[0]
+ umlal2 v19.2d,v15.4s,v8.4s
+ umlal2 v22.2d,v15.4s,v3.4s
+ umlal2 v21.2d,v15.4s,v1.4s
+ umlal2 v23.2d,v15.4s,v5.4s
+ umlal2 v20.2d,v15.4s,v0.4s
+
+ dup v18.2d,v18.d[0]
+ umlal2 v22.2d,v17.4s,v0.4s
+ umlal2 v23.2d,v17.4s,v1.4s
+ umlal2 v19.2d,v17.4s,v4.4s
+ umlal2 v20.2d,v17.4s,v6.4s
+ umlal2 v21.2d,v17.4s,v8.4s
+
+ umlal2 v22.2d,v18.4s,v8.4s
+ umlal2 v19.2d,v18.4s,v2.4s
+ umlal2 v23.2d,v18.4s,v0.4s
+ umlal2 v20.2d,v18.4s,v4.4s
+ umlal2 v21.2d,v18.4s,v6.4s
+
+ b.eq .Lshort_tail
+
+ ////////////////////////////////////////////////////////////////
+ // (hash+inp[0:1])*r^4:r^3 and accumulate
+
+ add v9.2s,v9.2s,v24.2s
+ umlal v22.2d,v11.2s,v1.2s
+ umlal v19.2d,v11.2s,v6.2s
+ umlal v23.2d,v11.2s,v3.2s
+ umlal v20.2d,v11.2s,v8.2s
+ umlal v21.2d,v11.2s,v0.2s
+
+ add v10.2s,v10.2s,v25.2s
+ umlal v22.2d,v9.2s,v5.2s
+ umlal v19.2d,v9.2s,v0.2s
+ umlal v23.2d,v9.2s,v7.2s
+ umlal v20.2d,v9.2s,v1.2s
+ umlal v21.2d,v9.2s,v3.2s
+
+ add v12.2s,v12.2s,v27.2s
+ umlal v22.2d,v10.2s,v3.2s
+ umlal v19.2d,v10.2s,v8.2s
+ umlal v23.2d,v10.2s,v5.2s
+ umlal v20.2d,v10.2s,v0.2s
+ umlal v21.2d,v10.2s,v1.2s
+
+ add v13.2s,v13.2s,v28.2s
+ umlal v22.2d,v12.2s,v0.2s
+ umlal v19.2d,v12.2s,v4.2s
+ umlal v23.2d,v12.2s,v1.2s
+ umlal v20.2d,v12.2s,v6.2s
+ umlal v21.2d,v12.2s,v8.2s
+
+ umlal v22.2d,v13.2s,v8.2s
+ umlal v19.2d,v13.2s,v2.2s
+ umlal v23.2d,v13.2s,v0.2s
+ umlal v20.2d,v13.2s,v4.2s
+ umlal v21.2d,v13.2s,v6.2s
+
+.Lshort_tail:
+ ////////////////////////////////////////////////////////////////
+ // horizontal add
+
+ addp v22.2d,v22.2d,v22.2d
+ ldp d8,d9,[sp,#16] // meet ABI requirements
+ addp v19.2d,v19.2d,v19.2d
+ ldp d10,d11,[sp,#32]
+ addp v23.2d,v23.2d,v23.2d
+ ldp d12,d13,[sp,#48]
+ addp v20.2d,v20.2d,v20.2d
+ ldp d14,d15,[sp,#64]
+ addp v21.2d,v21.2d,v21.2d
+
+ ////////////////////////////////////////////////////////////////
+ // lazy reduction, but without narrowing
+
+ ushr v29.2d,v22.2d,#26
+ and v22.16b,v22.16b,v31.16b
+ ushr v30.2d,v19.2d,#26
+ and v19.16b,v19.16b,v31.16b
+
+ add v23.2d,v23.2d,v29.2d // h3 -> h4
+ add v20.2d,v20.2d,v30.2d // h0 -> h1
+
+ ushr v29.2d,v23.2d,#26
+ and v23.16b,v23.16b,v31.16b
+ ushr v30.2d,v20.2d,#26
+ and v20.16b,v20.16b,v31.16b
+ add v21.2d,v21.2d,v30.2d // h1 -> h2
+
+ add v19.2d,v19.2d,v29.2d
+ shl v29.2d,v29.2d,#2
+ ushr v30.2d,v21.2d,#26
+ and v21.16b,v21.16b,v31.16b
+ add v19.2d,v19.2d,v29.2d // h4 -> h0
+ add v22.2d,v22.2d,v30.2d // h2 -> h3
+
+ ushr v29.2d,v19.2d,#26
+ and v19.16b,v19.16b,v31.16b
+ ushr v30.2d,v22.2d,#26
+ and v22.16b,v22.16b,v31.16b
+ add v20.2d,v20.2d,v29.2d // h0 -> h1
+ add v23.2d,v23.2d,v30.2d // h3 -> h4
+
+ ////////////////////////////////////////////////////////////////
+ // write the result, can be partially reduced
+
+ st4 {v19.s,v20.s,v21.s,v22.s}[0],[x0],#16
+ st1 {v23.s}[0],[x0]
+
+.Lno_data_neon:
+ ldr x29,[sp],#80
+ ret
+.size poly1305_blocks_neon,.-poly1305_blocks_neon
+
+.type poly1305_emit_neon,%function
+.align 5
+poly1305_emit_neon:
+ ldr x17,[x0,#24]
+ cbz x17,poly1305_emit
+
+ ldp w10,w11,[x0] // load hash value base 2^26
+ ldp w12,w13,[x0,#8]
+ ldr w14,[x0,#16]
+
+ add x4,x10,x11,lsl#26 // base 2^26 -> base 2^64
+ lsr x5,x12,#12
+ adds x4,x4,x12,lsl#52
+ add x5,x5,x13,lsl#14
+ adc x5,x5,xzr
+ lsr x6,x14,#24
+ adds x5,x5,x14,lsl#40
+ adc x6,x6,xzr // can be partially reduced...
+
+ ldp x10,x11,[x2] // load nonce
+
+ and x12,x6,#-4 // ... so reduce
+ add x12,x12,x6,lsr#2
+ and x6,x6,#3
+ adds x4,x4,x12
+ adcs x5,x5,xzr
+ adc x6,x6,xzr
+
+ adds x12,x4,#5 // compare to modulus
+ adcs x13,x5,xzr
+ adc x14,x6,xzr
+
+ tst x14,#-4 // see if it's carried/borrowed
+
+ csel x4,x4,x12,eq
+ csel x5,x5,x13,eq
+
+#ifdef __ARMEB__
+ ror x10,x10,#32 // flip nonce words
+ ror x11,x11,#32
+#endif
+ adds x4,x4,x10 // accumulate nonce
+ adc x5,x5,x11
+#ifdef __ARMEB__
+ rev x4,x4 // flip output bytes
+ rev x5,x5
+#endif
+ stp x4,x5,[x1] // write result
+
+ ret
+.size poly1305_emit_neon,.-poly1305_emit_neon
+
+.align 5
+.Lzeros:
+.long 0,0,0,0,0,0,0,0
+.LOPENSSL_armcap_P:
+#ifdef __ILP32__
+.long OPENSSL_armcap_P-.
+#else
+.quad OPENSSL_armcap_P-.
+#endif
+.byte 80,111,108,121,49,51,48,53,32,102,111,114,32,65,82,77,118,56,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
+.align 2
+.align 2
--
2.19.1
^ permalink raw reply related
* [PATCH net-next v8 20/28] zinc: BLAKE2s x86_64 implementation
From: Jason A. Donenfeld @ 2018-10-18 14:57 UTC (permalink / raw)
To: linux-kernel, netdev, linux-crypto, davem, gregkh
Cc: Jason A. Donenfeld, Samuel Neves, Thomas Gleixner, Ingo Molnar,
x86, Jean-Philippe Aumasson, Andy Lutomirski, Andrew Morton,
Linus Torvalds, kernel-hardening
In-Reply-To: <20181018145712.7538-1-Jason@zx2c4.com>
These implementations from Samuel Neves support AVX and AVX-512VL.
Originally this used AVX-512F, but Skylake thermal throttling made
AVX-512VL more attractive and possible to do with negligable difference.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: x86@kernel.org
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-crypto@vger.kernel.org
---
lib/zinc/Makefile | 1 +
lib/zinc/blake2s/blake2s-x86_64-glue.c | 71 +++
lib/zinc/blake2s/blake2s-x86_64.S | 685 +++++++++++++++++++++++++
lib/zinc/blake2s/blake2s.c | 4 +
4 files changed, 761 insertions(+)
create mode 100644 lib/zinc/blake2s/blake2s-x86_64-glue.c
create mode 100644 lib/zinc/blake2s/blake2s-x86_64.S
diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index d2ec55c33ef0..67ad837c822c 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -23,4 +23,5 @@ zinc_chacha20poly1305-y := chacha20poly1305.o
obj-$(CONFIG_ZINC_CHACHA20POLY1305) += zinc_chacha20poly1305.o
zinc_blake2s-y := blake2s/blake2s.o
+zinc_blake2s-$(CONFIG_ZINC_ARCH_X86_64) += blake2s/blake2s-x86_64.o
obj-$(CONFIG_ZINC_BLAKE2S) += zinc_blake2s.o
diff --git a/lib/zinc/blake2s/blake2s-x86_64-glue.c b/lib/zinc/blake2s/blake2s-x86_64-glue.c
new file mode 100644
index 000000000000..615c38861579
--- /dev/null
+++ b/lib/zinc/blake2s/blake2s-x86_64-glue.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <linux/simd.h>
+#include <asm/cpufeature.h>
+#include <asm/processor.h>
+#include <asm/fpu/api.h>
+
+asmlinkage void blake2s_compress_avx(struct blake2s_state *state,
+ const u8 *block, const size_t nblocks,
+ const u32 inc);
+asmlinkage void blake2s_compress_avx512(struct blake2s_state *state,
+ const u8 *block, const size_t nblocks,
+ const u32 inc);
+
+static bool blake2s_use_avx __ro_after_init;
+static bool blake2s_use_avx512 __ro_after_init;
+static bool *const blake2s_nobs[] __initconst = { &blake2s_use_avx512 };
+
+static void __init blake2s_fpu_init(void)
+{
+ blake2s_use_avx =
+ boot_cpu_has(X86_FEATURE_AVX) &&
+ cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL);
+ blake2s_use_avx512 =
+ boot_cpu_has(X86_FEATURE_AVX) &&
+ boot_cpu_has(X86_FEATURE_AVX2) &&
+ boot_cpu_has(X86_FEATURE_AVX512F) &&
+ boot_cpu_has(X86_FEATURE_AVX512VL) &&
+ cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM |
+ XFEATURE_MASK_AVX512, NULL);
+}
+
+static inline bool blake2s_compress_arch(struct blake2s_state *state,
+ const u8 *block, size_t nblocks,
+ const u32 inc)
+{
+ simd_context_t simd_context;
+ bool used_arch = false;
+
+ /* SIMD disables preemption, so relax after processing each page. */
+ BUILD_BUG_ON(PAGE_SIZE / BLAKE2S_BLOCK_SIZE < 8);
+
+ simd_get(&simd_context);
+
+ if (!IS_ENABLED(CONFIG_AS_AVX) || !blake2s_use_avx ||
+ !simd_use(&simd_context))
+ goto out;
+ used_arch = true;
+
+ for (;;) {
+ const size_t blocks = min_t(size_t, nblocks,
+ PAGE_SIZE / BLAKE2S_BLOCK_SIZE);
+
+ if (IS_ENABLED(CONFIG_AS_AVX512) && blake2s_use_avx512)
+ blake2s_compress_avx512(state, block, blocks, inc);
+ else
+ blake2s_compress_avx(state, block, blocks, inc);
+
+ nblocks -= blocks;
+ if (!nblocks)
+ break;
+ block += blocks * BLAKE2S_BLOCK_SIZE;
+ simd_relax(&simd_context);
+ }
+out:
+ simd_put(&simd_context);
+ return used_arch;
+}
diff --git a/lib/zinc/blake2s/blake2s-x86_64.S b/lib/zinc/blake2s/blake2s-x86_64.S
new file mode 100644
index 000000000000..1407a5ffb5c2
--- /dev/null
+++ b/lib/zinc/blake2s/blake2s-x86_64.S
@@ -0,0 +1,685 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ * Copyright (C) 2017 Samuel Neves <sneves@dei.uc.pt>. All Rights Reserved.
+ */
+
+#include <linux/linkage.h>
+
+.section .rodata.cst32.BLAKE2S_IV, "aM", @progbits, 32
+.align 32
+IV: .octa 0xA54FF53A3C6EF372BB67AE856A09E667
+ .octa 0x5BE0CD191F83D9AB9B05688C510E527F
+.section .rodata.cst16.ROT16, "aM", @progbits, 16
+.align 16
+ROT16: .octa 0x0D0C0F0E09080B0A0504070601000302
+.section .rodata.cst16.ROR328, "aM", @progbits, 16
+.align 16
+ROR328: .octa 0x0C0F0E0D080B0A090407060500030201
+#ifdef CONFIG_AS_AVX512
+.section .rodata.cst64.BLAKE2S_SIGMA, "aM", @progbits, 640
+.align 64
+SIGMA:
+.long 0, 2, 4, 6, 1, 3, 5, 7, 8, 10, 12, 14, 9, 11, 13, 15
+.long 11, 2, 12, 14, 9, 8, 15, 3, 4, 0, 13, 6, 10, 1, 7, 5
+.long 10, 12, 11, 6, 5, 9, 13, 3, 4, 15, 14, 2, 0, 7, 8, 1
+.long 10, 9, 7, 0, 11, 14, 1, 12, 6, 2, 15, 3, 13, 8, 5, 4
+.long 4, 9, 8, 13, 14, 0, 10, 11, 7, 3, 12, 1, 5, 6, 15, 2
+.long 2, 10, 4, 14, 13, 3, 9, 11, 6, 5, 7, 12, 15, 1, 8, 0
+.long 4, 11, 14, 8, 13, 10, 12, 5, 2, 1, 15, 3, 9, 7, 0, 6
+.long 6, 12, 0, 13, 15, 2, 1, 10, 4, 5, 11, 14, 8, 3, 9, 7
+.long 14, 5, 4, 12, 9, 7, 3, 10, 2, 0, 6, 15, 11, 1, 13, 8
+.long 11, 7, 13, 10, 12, 14, 0, 15, 4, 5, 6, 9, 2, 1, 8, 3
+#endif /* CONFIG_AS_AVX512 */
+
+.text
+#ifdef CONFIG_AS_AVX
+ENTRY(blake2s_compress_avx)
+ movl %ecx, %ecx
+ testq %rdx, %rdx
+ je .Lendofloop
+ .align 32
+.Lbeginofloop:
+ addq %rcx, 32(%rdi)
+ vmovdqu IV+16(%rip), %xmm1
+ vmovdqu (%rsi), %xmm4
+ vpxor 32(%rdi), %xmm1, %xmm1
+ vmovdqu 16(%rsi), %xmm3
+ vshufps $136, %xmm3, %xmm4, %xmm6
+ vmovdqa ROT16(%rip), %xmm7
+ vpaddd (%rdi), %xmm6, %xmm6
+ vpaddd 16(%rdi), %xmm6, %xmm6
+ vpxor %xmm6, %xmm1, %xmm1
+ vmovdqu IV(%rip), %xmm8
+ vpshufb %xmm7, %xmm1, %xmm1
+ vmovdqu 48(%rsi), %xmm5
+ vpaddd %xmm1, %xmm8, %xmm8
+ vpxor 16(%rdi), %xmm8, %xmm9
+ vmovdqu 32(%rsi), %xmm2
+ vpblendw $12, %xmm3, %xmm5, %xmm13
+ vshufps $221, %xmm5, %xmm2, %xmm12
+ vpunpckhqdq %xmm2, %xmm4, %xmm14
+ vpslld $20, %xmm9, %xmm0
+ vpsrld $12, %xmm9, %xmm9
+ vpxor %xmm0, %xmm9, %xmm0
+ vshufps $221, %xmm3, %xmm4, %xmm9
+ vpaddd %xmm9, %xmm6, %xmm9
+ vpaddd %xmm0, %xmm9, %xmm9
+ vpxor %xmm9, %xmm1, %xmm1
+ vmovdqa ROR328(%rip), %xmm6
+ vpshufb %xmm6, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm8, %xmm8
+ vpxor %xmm8, %xmm0, %xmm0
+ vpshufd $147, %xmm1, %xmm1
+ vpshufd $78, %xmm8, %xmm8
+ vpslld $25, %xmm0, %xmm10
+ vpsrld $7, %xmm0, %xmm0
+ vpxor %xmm10, %xmm0, %xmm0
+ vshufps $136, %xmm5, %xmm2, %xmm10
+ vpshufd $57, %xmm0, %xmm0
+ vpaddd %xmm10, %xmm9, %xmm9
+ vpaddd %xmm0, %xmm9, %xmm9
+ vpxor %xmm9, %xmm1, %xmm1
+ vpaddd %xmm12, %xmm9, %xmm9
+ vpblendw $12, %xmm2, %xmm3, %xmm12
+ vpshufb %xmm7, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm8, %xmm8
+ vpxor %xmm8, %xmm0, %xmm10
+ vpslld $20, %xmm10, %xmm0
+ vpsrld $12, %xmm10, %xmm10
+ vpxor %xmm0, %xmm10, %xmm0
+ vpaddd %xmm0, %xmm9, %xmm9
+ vpxor %xmm9, %xmm1, %xmm1
+ vpshufb %xmm6, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm8, %xmm8
+ vpxor %xmm8, %xmm0, %xmm0
+ vpshufd $57, %xmm1, %xmm1
+ vpshufd $78, %xmm8, %xmm8
+ vpslld $25, %xmm0, %xmm10
+ vpsrld $7, %xmm0, %xmm0
+ vpxor %xmm10, %xmm0, %xmm0
+ vpslldq $4, %xmm5, %xmm10
+ vpblendw $240, %xmm10, %xmm12, %xmm12
+ vpshufd $147, %xmm0, %xmm0
+ vpshufd $147, %xmm12, %xmm12
+ vpaddd %xmm9, %xmm12, %xmm12
+ vpaddd %xmm0, %xmm12, %xmm12
+ vpxor %xmm12, %xmm1, %xmm1
+ vpshufb %xmm7, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm8, %xmm8
+ vpxor %xmm8, %xmm0, %xmm11
+ vpslld $20, %xmm11, %xmm9
+ vpsrld $12, %xmm11, %xmm11
+ vpxor %xmm9, %xmm11, %xmm0
+ vpshufd $8, %xmm2, %xmm9
+ vpblendw $192, %xmm5, %xmm3, %xmm11
+ vpblendw $240, %xmm11, %xmm9, %xmm9
+ vpshufd $177, %xmm9, %xmm9
+ vpaddd %xmm12, %xmm9, %xmm9
+ vpaddd %xmm0, %xmm9, %xmm11
+ vpxor %xmm11, %xmm1, %xmm1
+ vpshufb %xmm6, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm8, %xmm8
+ vpxor %xmm8, %xmm0, %xmm9
+ vpshufd $147, %xmm1, %xmm1
+ vpshufd $78, %xmm8, %xmm8
+ vpslld $25, %xmm9, %xmm0
+ vpsrld $7, %xmm9, %xmm9
+ vpxor %xmm0, %xmm9, %xmm0
+ vpslldq $4, %xmm3, %xmm9
+ vpblendw $48, %xmm9, %xmm2, %xmm9
+ vpblendw $240, %xmm9, %xmm4, %xmm9
+ vpshufd $57, %xmm0, %xmm0
+ vpshufd $177, %xmm9, %xmm9
+ vpaddd %xmm11, %xmm9, %xmm9
+ vpaddd %xmm0, %xmm9, %xmm9
+ vpxor %xmm9, %xmm1, %xmm1
+ vpshufb %xmm7, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm8, %xmm11
+ vpxor %xmm11, %xmm0, %xmm0
+ vpslld $20, %xmm0, %xmm8
+ vpsrld $12, %xmm0, %xmm0
+ vpxor %xmm8, %xmm0, %xmm0
+ vpunpckhdq %xmm3, %xmm4, %xmm8
+ vpblendw $12, %xmm10, %xmm8, %xmm12
+ vpshufd $177, %xmm12, %xmm12
+ vpaddd %xmm9, %xmm12, %xmm9
+ vpaddd %xmm0, %xmm9, %xmm9
+ vpxor %xmm9, %xmm1, %xmm1
+ vpshufb %xmm6, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm11, %xmm11
+ vpxor %xmm11, %xmm0, %xmm0
+ vpshufd $57, %xmm1, %xmm1
+ vpshufd $78, %xmm11, %xmm11
+ vpslld $25, %xmm0, %xmm12
+ vpsrld $7, %xmm0, %xmm0
+ vpxor %xmm12, %xmm0, %xmm0
+ vpunpckhdq %xmm5, %xmm2, %xmm12
+ vpshufd $147, %xmm0, %xmm0
+ vpblendw $15, %xmm13, %xmm12, %xmm12
+ vpslldq $8, %xmm5, %xmm13
+ vpshufd $210, %xmm12, %xmm12
+ vpaddd %xmm9, %xmm12, %xmm9
+ vpaddd %xmm0, %xmm9, %xmm9
+ vpxor %xmm9, %xmm1, %xmm1
+ vpshufb %xmm7, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm11, %xmm11
+ vpxor %xmm11, %xmm0, %xmm0
+ vpslld $20, %xmm0, %xmm12
+ vpsrld $12, %xmm0, %xmm0
+ vpxor %xmm12, %xmm0, %xmm0
+ vpunpckldq %xmm4, %xmm2, %xmm12
+ vpblendw $240, %xmm4, %xmm12, %xmm12
+ vpblendw $192, %xmm13, %xmm12, %xmm12
+ vpsrldq $12, %xmm3, %xmm13
+ vpaddd %xmm12, %xmm9, %xmm9
+ vpaddd %xmm0, %xmm9, %xmm9
+ vpxor %xmm9, %xmm1, %xmm1
+ vpshufb %xmm6, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm11, %xmm11
+ vpxor %xmm11, %xmm0, %xmm0
+ vpshufd $147, %xmm1, %xmm1
+ vpshufd $78, %xmm11, %xmm11
+ vpslld $25, %xmm0, %xmm12
+ vpsrld $7, %xmm0, %xmm0
+ vpxor %xmm12, %xmm0, %xmm0
+ vpblendw $60, %xmm2, %xmm4, %xmm12
+ vpblendw $3, %xmm13, %xmm12, %xmm12
+ vpshufd $57, %xmm0, %xmm0
+ vpshufd $78, %xmm12, %xmm12
+ vpaddd %xmm9, %xmm12, %xmm9
+ vpaddd %xmm0, %xmm9, %xmm9
+ vpxor %xmm9, %xmm1, %xmm1
+ vpshufb %xmm7, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm11, %xmm11
+ vpxor %xmm11, %xmm0, %xmm12
+ vpslld $20, %xmm12, %xmm13
+ vpsrld $12, %xmm12, %xmm0
+ vpblendw $51, %xmm3, %xmm4, %xmm12
+ vpxor %xmm13, %xmm0, %xmm0
+ vpblendw $192, %xmm10, %xmm12, %xmm10
+ vpslldq $8, %xmm2, %xmm12
+ vpshufd $27, %xmm10, %xmm10
+ vpaddd %xmm9, %xmm10, %xmm9
+ vpaddd %xmm0, %xmm9, %xmm9
+ vpxor %xmm9, %xmm1, %xmm1
+ vpshufb %xmm6, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm11, %xmm11
+ vpxor %xmm11, %xmm0, %xmm0
+ vpshufd $57, %xmm1, %xmm1
+ vpshufd $78, %xmm11, %xmm11
+ vpslld $25, %xmm0, %xmm10
+ vpsrld $7, %xmm0, %xmm0
+ vpxor %xmm10, %xmm0, %xmm0
+ vpunpckhdq %xmm2, %xmm8, %xmm10
+ vpshufd $147, %xmm0, %xmm0
+ vpblendw $12, %xmm5, %xmm10, %xmm10
+ vpshufd $210, %xmm10, %xmm10
+ vpaddd %xmm9, %xmm10, %xmm9
+ vpaddd %xmm0, %xmm9, %xmm9
+ vpxor %xmm9, %xmm1, %xmm1
+ vpshufb %xmm7, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm11, %xmm11
+ vpxor %xmm11, %xmm0, %xmm10
+ vpslld $20, %xmm10, %xmm0
+ vpsrld $12, %xmm10, %xmm10
+ vpxor %xmm0, %xmm10, %xmm0
+ vpblendw $12, %xmm4, %xmm5, %xmm10
+ vpblendw $192, %xmm12, %xmm10, %xmm10
+ vpunpckldq %xmm2, %xmm4, %xmm12
+ vpshufd $135, %xmm10, %xmm10
+ vpaddd %xmm9, %xmm10, %xmm9
+ vpaddd %xmm0, %xmm9, %xmm9
+ vpxor %xmm9, %xmm1, %xmm1
+ vpshufb %xmm6, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm11, %xmm13
+ vpxor %xmm13, %xmm0, %xmm0
+ vpshufd $147, %xmm1, %xmm1
+ vpshufd $78, %xmm13, %xmm13
+ vpslld $25, %xmm0, %xmm10
+ vpsrld $7, %xmm0, %xmm0
+ vpxor %xmm10, %xmm0, %xmm0
+ vpblendw $15, %xmm3, %xmm4, %xmm10
+ vpblendw $192, %xmm5, %xmm10, %xmm10
+ vpshufd $57, %xmm0, %xmm0
+ vpshufd $198, %xmm10, %xmm10
+ vpaddd %xmm9, %xmm10, %xmm10
+ vpaddd %xmm0, %xmm10, %xmm10
+ vpxor %xmm10, %xmm1, %xmm1
+ vpshufb %xmm7, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm13, %xmm13
+ vpxor %xmm13, %xmm0, %xmm9
+ vpslld $20, %xmm9, %xmm0
+ vpsrld $12, %xmm9, %xmm9
+ vpxor %xmm0, %xmm9, %xmm0
+ vpunpckhdq %xmm2, %xmm3, %xmm9
+ vpunpcklqdq %xmm12, %xmm9, %xmm15
+ vpunpcklqdq %xmm12, %xmm8, %xmm12
+ vpblendw $15, %xmm5, %xmm8, %xmm8
+ vpaddd %xmm15, %xmm10, %xmm15
+ vpaddd %xmm0, %xmm15, %xmm15
+ vpxor %xmm15, %xmm1, %xmm1
+ vpshufd $141, %xmm8, %xmm8
+ vpshufb %xmm6, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm13, %xmm13
+ vpxor %xmm13, %xmm0, %xmm0
+ vpshufd $57, %xmm1, %xmm1
+ vpshufd $78, %xmm13, %xmm13
+ vpslld $25, %xmm0, %xmm10
+ vpsrld $7, %xmm0, %xmm0
+ vpxor %xmm10, %xmm0, %xmm0
+ vpunpcklqdq %xmm2, %xmm3, %xmm10
+ vpshufd $147, %xmm0, %xmm0
+ vpblendw $51, %xmm14, %xmm10, %xmm14
+ vpshufd $135, %xmm14, %xmm14
+ vpaddd %xmm15, %xmm14, %xmm14
+ vpaddd %xmm0, %xmm14, %xmm14
+ vpxor %xmm14, %xmm1, %xmm1
+ vpunpcklqdq %xmm3, %xmm4, %xmm15
+ vpshufb %xmm7, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm13, %xmm13
+ vpxor %xmm13, %xmm0, %xmm0
+ vpslld $20, %xmm0, %xmm11
+ vpsrld $12, %xmm0, %xmm0
+ vpxor %xmm11, %xmm0, %xmm0
+ vpunpckhqdq %xmm5, %xmm3, %xmm11
+ vpblendw $51, %xmm15, %xmm11, %xmm11
+ vpunpckhqdq %xmm3, %xmm5, %xmm15
+ vpaddd %xmm11, %xmm14, %xmm11
+ vpaddd %xmm0, %xmm11, %xmm11
+ vpxor %xmm11, %xmm1, %xmm1
+ vpshufb %xmm6, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm13, %xmm13
+ vpxor %xmm13, %xmm0, %xmm0
+ vpshufd $147, %xmm1, %xmm1
+ vpshufd $78, %xmm13, %xmm13
+ vpslld $25, %xmm0, %xmm14
+ vpsrld $7, %xmm0, %xmm0
+ vpxor %xmm14, %xmm0, %xmm14
+ vpunpckhqdq %xmm4, %xmm2, %xmm0
+ vpshufd $57, %xmm14, %xmm14
+ vpblendw $51, %xmm15, %xmm0, %xmm15
+ vpaddd %xmm15, %xmm11, %xmm15
+ vpaddd %xmm14, %xmm15, %xmm15
+ vpxor %xmm15, %xmm1, %xmm1
+ vpshufb %xmm7, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm13, %xmm13
+ vpxor %xmm13, %xmm14, %xmm14
+ vpslld $20, %xmm14, %xmm11
+ vpsrld $12, %xmm14, %xmm14
+ vpxor %xmm11, %xmm14, %xmm14
+ vpblendw $3, %xmm2, %xmm4, %xmm11
+ vpslldq $8, %xmm11, %xmm0
+ vpblendw $15, %xmm5, %xmm0, %xmm0
+ vpshufd $99, %xmm0, %xmm0
+ vpaddd %xmm15, %xmm0, %xmm15
+ vpaddd %xmm14, %xmm15, %xmm15
+ vpxor %xmm15, %xmm1, %xmm0
+ vpaddd %xmm12, %xmm15, %xmm15
+ vpshufb %xmm6, %xmm0, %xmm0
+ vpaddd %xmm0, %xmm13, %xmm13
+ vpxor %xmm13, %xmm14, %xmm14
+ vpshufd $57, %xmm0, %xmm0
+ vpshufd $78, %xmm13, %xmm13
+ vpslld $25, %xmm14, %xmm1
+ vpsrld $7, %xmm14, %xmm14
+ vpxor %xmm1, %xmm14, %xmm14
+ vpblendw $3, %xmm5, %xmm4, %xmm1
+ vpshufd $147, %xmm14, %xmm14
+ vpaddd %xmm14, %xmm15, %xmm15
+ vpxor %xmm15, %xmm0, %xmm0
+ vpshufb %xmm7, %xmm0, %xmm0
+ vpaddd %xmm0, %xmm13, %xmm13
+ vpxor %xmm13, %xmm14, %xmm14
+ vpslld $20, %xmm14, %xmm12
+ vpsrld $12, %xmm14, %xmm14
+ vpxor %xmm12, %xmm14, %xmm14
+ vpsrldq $4, %xmm2, %xmm12
+ vpblendw $60, %xmm12, %xmm1, %xmm1
+ vpaddd %xmm1, %xmm15, %xmm15
+ vpaddd %xmm14, %xmm15, %xmm15
+ vpxor %xmm15, %xmm0, %xmm0
+ vpblendw $12, %xmm4, %xmm3, %xmm1
+ vpshufb %xmm6, %xmm0, %xmm0
+ vpaddd %xmm0, %xmm13, %xmm13
+ vpxor %xmm13, %xmm14, %xmm14
+ vpshufd $147, %xmm0, %xmm0
+ vpshufd $78, %xmm13, %xmm13
+ vpslld $25, %xmm14, %xmm12
+ vpsrld $7, %xmm14, %xmm14
+ vpxor %xmm12, %xmm14, %xmm14
+ vpsrldq $4, %xmm5, %xmm12
+ vpblendw $48, %xmm12, %xmm1, %xmm1
+ vpshufd $33, %xmm5, %xmm12
+ vpshufd $57, %xmm14, %xmm14
+ vpshufd $108, %xmm1, %xmm1
+ vpblendw $51, %xmm12, %xmm10, %xmm12
+ vpaddd %xmm15, %xmm1, %xmm15
+ vpaddd %xmm14, %xmm15, %xmm15
+ vpxor %xmm15, %xmm0, %xmm0
+ vpaddd %xmm12, %xmm15, %xmm15
+ vpshufb %xmm7, %xmm0, %xmm0
+ vpaddd %xmm0, %xmm13, %xmm1
+ vpxor %xmm1, %xmm14, %xmm14
+ vpslld $20, %xmm14, %xmm13
+ vpsrld $12, %xmm14, %xmm14
+ vpxor %xmm13, %xmm14, %xmm14
+ vpslldq $12, %xmm3, %xmm13
+ vpaddd %xmm14, %xmm15, %xmm15
+ vpxor %xmm15, %xmm0, %xmm0
+ vpshufb %xmm6, %xmm0, %xmm0
+ vpaddd %xmm0, %xmm1, %xmm1
+ vpxor %xmm1, %xmm14, %xmm14
+ vpshufd $57, %xmm0, %xmm0
+ vpshufd $78, %xmm1, %xmm1
+ vpslld $25, %xmm14, %xmm12
+ vpsrld $7, %xmm14, %xmm14
+ vpxor %xmm12, %xmm14, %xmm14
+ vpblendw $51, %xmm5, %xmm4, %xmm12
+ vpshufd $147, %xmm14, %xmm14
+ vpblendw $192, %xmm13, %xmm12, %xmm12
+ vpaddd %xmm12, %xmm15, %xmm15
+ vpaddd %xmm14, %xmm15, %xmm15
+ vpxor %xmm15, %xmm0, %xmm0
+ vpsrldq $4, %xmm3, %xmm12
+ vpshufb %xmm7, %xmm0, %xmm0
+ vpaddd %xmm0, %xmm1, %xmm1
+ vpxor %xmm1, %xmm14, %xmm14
+ vpslld $20, %xmm14, %xmm13
+ vpsrld $12, %xmm14, %xmm14
+ vpxor %xmm13, %xmm14, %xmm14
+ vpblendw $48, %xmm2, %xmm5, %xmm13
+ vpblendw $3, %xmm12, %xmm13, %xmm13
+ vpshufd $156, %xmm13, %xmm13
+ vpaddd %xmm15, %xmm13, %xmm15
+ vpaddd %xmm14, %xmm15, %xmm15
+ vpxor %xmm15, %xmm0, %xmm0
+ vpshufb %xmm6, %xmm0, %xmm0
+ vpaddd %xmm0, %xmm1, %xmm1
+ vpxor %xmm1, %xmm14, %xmm14
+ vpshufd $147, %xmm0, %xmm0
+ vpshufd $78, %xmm1, %xmm1
+ vpslld $25, %xmm14, %xmm13
+ vpsrld $7, %xmm14, %xmm14
+ vpxor %xmm13, %xmm14, %xmm14
+ vpunpcklqdq %xmm2, %xmm4, %xmm13
+ vpshufd $57, %xmm14, %xmm14
+ vpblendw $12, %xmm12, %xmm13, %xmm12
+ vpshufd $180, %xmm12, %xmm12
+ vpaddd %xmm15, %xmm12, %xmm15
+ vpaddd %xmm14, %xmm15, %xmm15
+ vpxor %xmm15, %xmm0, %xmm0
+ vpshufb %xmm7, %xmm0, %xmm0
+ vpaddd %xmm0, %xmm1, %xmm1
+ vpxor %xmm1, %xmm14, %xmm14
+ vpslld $20, %xmm14, %xmm12
+ vpsrld $12, %xmm14, %xmm14
+ vpxor %xmm12, %xmm14, %xmm14
+ vpunpckhqdq %xmm9, %xmm4, %xmm12
+ vpshufd $198, %xmm12, %xmm12
+ vpaddd %xmm15, %xmm12, %xmm15
+ vpaddd %xmm14, %xmm15, %xmm15
+ vpxor %xmm15, %xmm0, %xmm0
+ vpaddd %xmm15, %xmm8, %xmm15
+ vpshufb %xmm6, %xmm0, %xmm0
+ vpaddd %xmm0, %xmm1, %xmm1
+ vpxor %xmm1, %xmm14, %xmm14
+ vpshufd $57, %xmm0, %xmm0
+ vpshufd $78, %xmm1, %xmm1
+ vpslld $25, %xmm14, %xmm12
+ vpsrld $7, %xmm14, %xmm14
+ vpxor %xmm12, %xmm14, %xmm14
+ vpsrldq $4, %xmm4, %xmm12
+ vpshufd $147, %xmm14, %xmm14
+ vpaddd %xmm14, %xmm15, %xmm15
+ vpxor %xmm15, %xmm0, %xmm0
+ vpshufb %xmm7, %xmm0, %xmm0
+ vpaddd %xmm0, %xmm1, %xmm1
+ vpxor %xmm1, %xmm14, %xmm14
+ vpslld $20, %xmm14, %xmm8
+ vpsrld $12, %xmm14, %xmm14
+ vpxor %xmm14, %xmm8, %xmm14
+ vpblendw $48, %xmm5, %xmm2, %xmm8
+ vpblendw $3, %xmm12, %xmm8, %xmm8
+ vpunpckhqdq %xmm5, %xmm4, %xmm12
+ vpshufd $75, %xmm8, %xmm8
+ vpblendw $60, %xmm10, %xmm12, %xmm10
+ vpaddd %xmm15, %xmm8, %xmm15
+ vpaddd %xmm14, %xmm15, %xmm15
+ vpxor %xmm0, %xmm15, %xmm0
+ vpshufd $45, %xmm10, %xmm10
+ vpshufb %xmm6, %xmm0, %xmm0
+ vpaddd %xmm15, %xmm10, %xmm15
+ vpaddd %xmm0, %xmm1, %xmm1
+ vpxor %xmm1, %xmm14, %xmm14
+ vpshufd $147, %xmm0, %xmm0
+ vpshufd $78, %xmm1, %xmm1
+ vpslld $25, %xmm14, %xmm8
+ vpsrld $7, %xmm14, %xmm14
+ vpxor %xmm14, %xmm8, %xmm8
+ vpshufd $57, %xmm8, %xmm8
+ vpaddd %xmm8, %xmm15, %xmm15
+ vpxor %xmm0, %xmm15, %xmm0
+ vpshufb %xmm7, %xmm0, %xmm0
+ vpaddd %xmm0, %xmm1, %xmm1
+ vpxor %xmm8, %xmm1, %xmm8
+ vpslld $20, %xmm8, %xmm10
+ vpsrld $12, %xmm8, %xmm8
+ vpxor %xmm8, %xmm10, %xmm10
+ vpunpckldq %xmm3, %xmm4, %xmm8
+ vpunpcklqdq %xmm9, %xmm8, %xmm9
+ vpaddd %xmm9, %xmm15, %xmm9
+ vpaddd %xmm10, %xmm9, %xmm9
+ vpxor %xmm0, %xmm9, %xmm8
+ vpshufb %xmm6, %xmm8, %xmm8
+ vpaddd %xmm8, %xmm1, %xmm1
+ vpxor %xmm1, %xmm10, %xmm10
+ vpshufd $57, %xmm8, %xmm8
+ vpshufd $78, %xmm1, %xmm1
+ vpslld $25, %xmm10, %xmm12
+ vpsrld $7, %xmm10, %xmm10
+ vpxor %xmm10, %xmm12, %xmm10
+ vpblendw $48, %xmm4, %xmm3, %xmm12
+ vpshufd $147, %xmm10, %xmm0
+ vpunpckhdq %xmm5, %xmm3, %xmm10
+ vpshufd $78, %xmm12, %xmm12
+ vpunpcklqdq %xmm4, %xmm10, %xmm10
+ vpblendw $192, %xmm2, %xmm10, %xmm10
+ vpshufhw $78, %xmm10, %xmm10
+ vpaddd %xmm10, %xmm9, %xmm10
+ vpaddd %xmm0, %xmm10, %xmm10
+ vpxor %xmm8, %xmm10, %xmm8
+ vpshufb %xmm7, %xmm8, %xmm8
+ vpaddd %xmm8, %xmm1, %xmm1
+ vpxor %xmm0, %xmm1, %xmm9
+ vpslld $20, %xmm9, %xmm0
+ vpsrld $12, %xmm9, %xmm9
+ vpxor %xmm9, %xmm0, %xmm0
+ vpunpckhdq %xmm5, %xmm4, %xmm9
+ vpblendw $240, %xmm9, %xmm2, %xmm13
+ vpshufd $39, %xmm13, %xmm13
+ vpaddd %xmm10, %xmm13, %xmm10
+ vpaddd %xmm0, %xmm10, %xmm10
+ vpxor %xmm8, %xmm10, %xmm8
+ vpblendw $12, %xmm4, %xmm2, %xmm13
+ vpshufb %xmm6, %xmm8, %xmm8
+ vpslldq $4, %xmm13, %xmm13
+ vpblendw $15, %xmm5, %xmm13, %xmm13
+ vpaddd %xmm8, %xmm1, %xmm1
+ vpxor %xmm1, %xmm0, %xmm0
+ vpaddd %xmm13, %xmm10, %xmm13
+ vpshufd $147, %xmm8, %xmm8
+ vpshufd $78, %xmm1, %xmm1
+ vpslld $25, %xmm0, %xmm14
+ vpsrld $7, %xmm0, %xmm0
+ vpxor %xmm0, %xmm14, %xmm14
+ vpshufd $57, %xmm14, %xmm14
+ vpaddd %xmm14, %xmm13, %xmm13
+ vpxor %xmm8, %xmm13, %xmm8
+ vpaddd %xmm13, %xmm12, %xmm12
+ vpshufb %xmm7, %xmm8, %xmm8
+ vpaddd %xmm8, %xmm1, %xmm1
+ vpxor %xmm14, %xmm1, %xmm14
+ vpslld $20, %xmm14, %xmm10
+ vpsrld $12, %xmm14, %xmm14
+ vpxor %xmm14, %xmm10, %xmm10
+ vpaddd %xmm10, %xmm12, %xmm12
+ vpxor %xmm8, %xmm12, %xmm8
+ vpshufb %xmm6, %xmm8, %xmm8
+ vpaddd %xmm8, %xmm1, %xmm1
+ vpxor %xmm1, %xmm10, %xmm0
+ vpshufd $57, %xmm8, %xmm8
+ vpshufd $78, %xmm1, %xmm1
+ vpslld $25, %xmm0, %xmm10
+ vpsrld $7, %xmm0, %xmm0
+ vpxor %xmm0, %xmm10, %xmm10
+ vpblendw $48, %xmm2, %xmm3, %xmm0
+ vpblendw $15, %xmm11, %xmm0, %xmm0
+ vpshufd $147, %xmm10, %xmm10
+ vpshufd $114, %xmm0, %xmm0
+ vpaddd %xmm12, %xmm0, %xmm0
+ vpaddd %xmm10, %xmm0, %xmm0
+ vpxor %xmm8, %xmm0, %xmm8
+ vpshufb %xmm7, %xmm8, %xmm8
+ vpaddd %xmm8, %xmm1, %xmm1
+ vpxor %xmm10, %xmm1, %xmm10
+ vpslld $20, %xmm10, %xmm11
+ vpsrld $12, %xmm10, %xmm10
+ vpxor %xmm10, %xmm11, %xmm10
+ vpslldq $4, %xmm4, %xmm11
+ vpblendw $192, %xmm11, %xmm3, %xmm3
+ vpunpckldq %xmm5, %xmm4, %xmm4
+ vpshufd $99, %xmm3, %xmm3
+ vpaddd %xmm0, %xmm3, %xmm3
+ vpaddd %xmm10, %xmm3, %xmm3
+ vpxor %xmm8, %xmm3, %xmm11
+ vpunpckldq %xmm5, %xmm2, %xmm0
+ vpblendw $192, %xmm2, %xmm5, %xmm2
+ vpshufb %xmm6, %xmm11, %xmm11
+ vpunpckhqdq %xmm0, %xmm9, %xmm0
+ vpblendw $15, %xmm4, %xmm2, %xmm4
+ vpaddd %xmm11, %xmm1, %xmm1
+ vpxor %xmm1, %xmm10, %xmm10
+ vpshufd $147, %xmm11, %xmm11
+ vpshufd $201, %xmm0, %xmm0
+ vpslld $25, %xmm10, %xmm8
+ vpsrld $7, %xmm10, %xmm10
+ vpxor %xmm10, %xmm8, %xmm10
+ vpshufd $78, %xmm1, %xmm1
+ vpaddd %xmm3, %xmm0, %xmm0
+ vpshufd $27, %xmm4, %xmm4
+ vpshufd $57, %xmm10, %xmm10
+ vpaddd %xmm10, %xmm0, %xmm0
+ vpxor %xmm11, %xmm0, %xmm11
+ vpaddd %xmm0, %xmm4, %xmm0
+ vpshufb %xmm7, %xmm11, %xmm7
+ vpaddd %xmm7, %xmm1, %xmm1
+ vpxor %xmm10, %xmm1, %xmm10
+ vpslld $20, %xmm10, %xmm8
+ vpsrld $12, %xmm10, %xmm10
+ vpxor %xmm10, %xmm8, %xmm8
+ vpaddd %xmm8, %xmm0, %xmm0
+ vpxor %xmm7, %xmm0, %xmm7
+ vpshufb %xmm6, %xmm7, %xmm6
+ vpaddd %xmm6, %xmm1, %xmm1
+ vpxor %xmm1, %xmm8, %xmm8
+ vpshufd $78, %xmm1, %xmm1
+ vpshufd $57, %xmm6, %xmm6
+ vpslld $25, %xmm8, %xmm2
+ vpsrld $7, %xmm8, %xmm8
+ vpxor %xmm8, %xmm2, %xmm8
+ vpxor (%rdi), %xmm1, %xmm1
+ vpshufd $147, %xmm8, %xmm8
+ vpxor %xmm0, %xmm1, %xmm0
+ vmovups %xmm0, (%rdi)
+ vpxor 16(%rdi), %xmm8, %xmm0
+ vpxor %xmm6, %xmm0, %xmm6
+ vmovups %xmm6, 16(%rdi)
+ addq $64, %rsi
+ decq %rdx
+ jnz .Lbeginofloop
+.Lendofloop:
+ ret
+ENDPROC(blake2s_compress_avx)
+#endif /* CONFIG_AS_AVX */
+
+#ifdef CONFIG_AS_AVX512
+ENTRY(blake2s_compress_avx512)
+ vmovdqu (%rdi),%xmm0
+ vmovdqu 0x10(%rdi),%xmm1
+ vmovdqu 0x20(%rdi),%xmm4
+ vmovq %rcx,%xmm5
+ vmovdqa IV(%rip),%xmm14
+ vmovdqa IV+16(%rip),%xmm15
+ jmp .Lblake2s_compress_avx512_mainloop
+.align 32
+.Lblake2s_compress_avx512_mainloop:
+ vmovdqa %xmm0,%xmm10
+ vmovdqa %xmm1,%xmm11
+ vpaddq %xmm5,%xmm4,%xmm4
+ vmovdqa %xmm14,%xmm2
+ vpxor %xmm15,%xmm4,%xmm3
+ vmovdqu (%rsi),%ymm6
+ vmovdqu 0x20(%rsi),%ymm7
+ addq $0x40,%rsi
+ leaq SIGMA(%rip),%rax
+ movb $0xa,%cl
+.Lblake2s_compress_avx512_roundloop:
+ addq $0x40,%rax
+ vmovdqa -0x40(%rax),%ymm8
+ vmovdqa -0x20(%rax),%ymm9
+ vpermi2d %ymm7,%ymm6,%ymm8
+ vpermi2d %ymm7,%ymm6,%ymm9
+ vmovdqa %ymm8,%ymm6
+ vmovdqa %ymm9,%ymm7
+ vpaddd %xmm8,%xmm0,%xmm0
+ vpaddd %xmm1,%xmm0,%xmm0
+ vpxor %xmm0,%xmm3,%xmm3
+ vprord $0x10,%xmm3,%xmm3
+ vpaddd %xmm3,%xmm2,%xmm2
+ vpxor %xmm2,%xmm1,%xmm1
+ vprord $0xc,%xmm1,%xmm1
+ vextracti128 $0x1,%ymm8,%xmm8
+ vpaddd %xmm8,%xmm0,%xmm0
+ vpaddd %xmm1,%xmm0,%xmm0
+ vpxor %xmm0,%xmm3,%xmm3
+ vprord $0x8,%xmm3,%xmm3
+ vpaddd %xmm3,%xmm2,%xmm2
+ vpxor %xmm2,%xmm1,%xmm1
+ vprord $0x7,%xmm1,%xmm1
+ vpshufd $0x39,%xmm1,%xmm1
+ vpshufd $0x4e,%xmm2,%xmm2
+ vpshufd $0x93,%xmm3,%xmm3
+ vpaddd %xmm9,%xmm0,%xmm0
+ vpaddd %xmm1,%xmm0,%xmm0
+ vpxor %xmm0,%xmm3,%xmm3
+ vprord $0x10,%xmm3,%xmm3
+ vpaddd %xmm3,%xmm2,%xmm2
+ vpxor %xmm2,%xmm1,%xmm1
+ vprord $0xc,%xmm1,%xmm1
+ vextracti128 $0x1,%ymm9,%xmm9
+ vpaddd %xmm9,%xmm0,%xmm0
+ vpaddd %xmm1,%xmm0,%xmm0
+ vpxor %xmm0,%xmm3,%xmm3
+ vprord $0x8,%xmm3,%xmm3
+ vpaddd %xmm3,%xmm2,%xmm2
+ vpxor %xmm2,%xmm1,%xmm1
+ vprord $0x7,%xmm1,%xmm1
+ vpshufd $0x93,%xmm1,%xmm1
+ vpshufd $0x4e,%xmm2,%xmm2
+ vpshufd $0x39,%xmm3,%xmm3
+ decb %cl
+ jne .Lblake2s_compress_avx512_roundloop
+ vpxor %xmm10,%xmm0,%xmm0
+ vpxor %xmm11,%xmm1,%xmm1
+ vpxor %xmm2,%xmm0,%xmm0
+ vpxor %xmm3,%xmm1,%xmm1
+ decq %rdx
+ jne .Lblake2s_compress_avx512_mainloop
+ vmovdqu %xmm0,(%rdi)
+ vmovdqu %xmm1,0x10(%rdi)
+ vmovdqu %xmm4,0x20(%rdi)
+ vzeroupper
+ retq
+ENDPROC(blake2s_compress_avx512)
+#endif /* CONFIG_AS_AVX512 */
diff --git a/lib/zinc/blake2s/blake2s.c b/lib/zinc/blake2s/blake2s.c
index 58d7e9378bd4..59db1ce2f7ef 100644
--- a/lib/zinc/blake2s/blake2s.c
+++ b/lib/zinc/blake2s/blake2s.c
@@ -110,6 +110,9 @@ void blake2s_init_key(struct blake2s_state *state, const size_t outlen,
}
EXPORT_SYMBOL(blake2s_init_key);
+#if defined(CONFIG_ZINC_ARCH_X86_64)
+#include "blake2s-x86_64-glue.c"
+#else
static bool *const blake2s_nobs[] __initconst = { };
static void __init blake2s_fpu_init(void)
{
@@ -120,6 +123,7 @@ static inline bool blake2s_compress_arch(struct blake2s_state *state,
{
return false;
}
+#endif
static inline void blake2s_compress(struct blake2s_state *state,
const u8 *block, size_t nblocks,
--
2.19.1
^ permalink raw reply related
* [PATCH net-next v8 21/28] zinc: Curve25519 generic C implementations and selftest
From: Jason A. Donenfeld @ 2018-10-18 14:57 UTC (permalink / raw)
To: linux-kernel, netdev, linux-crypto, davem, gregkh
Cc: Jason A. Donenfeld, Karthikeyan Bhargavan, Samuel Neves,
Jean-Philippe Aumasson, Andy Lutomirski, Andrew Morton,
Linus Torvalds, kernel-hardening
In-Reply-To: <20181018145712.7538-1-Jason@zx2c4.com>
This contains two formally verified C implementations of the Curve25519
scalar multiplication function, one for 32-bit systems, and one for
64-bit systems whose compiler supports efficient 128-bit integer types.
Not only are these implementations formally verified, but they are also
the fastest available C implementations. They have been modified to be
friendly to kernel space and to be generally less horrendous looking,
but still an effort has been made to retain their formally verified
characteristic, and so the C might look slightly unidiomatic.
The 64-bit version comes from HACL*: https://github.com/project-everest/hacl-star
The 32-bit version comes from Fiat: https://github.com/mit-plv/fiat-crypto
Information: https://cr.yp.to/ecdh.html
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Karthikeyan Bhargavan <karthik.bhargavan@gmail.com>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-crypto@vger.kernel.org
---
include/zinc/curve25519.h | 22 +
lib/zinc/Kconfig | 4 +
lib/zinc/Makefile | 3 +
lib/zinc/curve25519/curve25519-fiat32.c | 860 +++++++++++++++
lib/zinc/curve25519/curve25519-hacl64.c | 784 ++++++++++++++
lib/zinc/curve25519/curve25519.c | 108 ++
lib/zinc/selftest/curve25519.c | 1315 +++++++++++++++++++++++
7 files changed, 3096 insertions(+)
create mode 100644 include/zinc/curve25519.h
create mode 100644 lib/zinc/curve25519/curve25519-fiat32.c
create mode 100644 lib/zinc/curve25519/curve25519-hacl64.c
create mode 100644 lib/zinc/curve25519/curve25519.c
create mode 100644 lib/zinc/selftest/curve25519.c
diff --git a/include/zinc/curve25519.h b/include/zinc/curve25519.h
new file mode 100644
index 000000000000..def173e736fc
--- /dev/null
+++ b/include/zinc/curve25519.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifndef _ZINC_CURVE25519_H
+#define _ZINC_CURVE25519_H
+
+#include <linux/types.h>
+
+enum curve25519_lengths {
+ CURVE25519_KEY_SIZE = 32
+};
+
+bool __must_check curve25519(u8 mypublic[CURVE25519_KEY_SIZE],
+ const u8 secret[CURVE25519_KEY_SIZE],
+ const u8 basepoint[CURVE25519_KEY_SIZE]);
+void curve25519_generate_secret(u8 secret[CURVE25519_KEY_SIZE]);
+bool __must_check curve25519_generate_public(
+ u8 pub[CURVE25519_KEY_SIZE], const u8 secret[CURVE25519_KEY_SIZE]);
+
+#endif /* _ZINC_CURVE25519_H */
diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig
index 9fc21f93ee9f..f1840c4e9fde 100644
--- a/lib/zinc/Kconfig
+++ b/lib/zinc/Kconfig
@@ -14,6 +14,10 @@ config ZINC_CHACHA20POLY1305
config ZINC_BLAKE2S
tristate
+config ZINC_CURVE25519
+ tristate
+ select CONFIG_CRYPTO
+
config ZINC_SELFTEST
bool "Zinc cryptography library self-tests"
help
diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 67ad837c822c..65440438c6e5 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -25,3 +25,6 @@ obj-$(CONFIG_ZINC_CHACHA20POLY1305) += zinc_chacha20poly1305.o
zinc_blake2s-y := blake2s/blake2s.o
zinc_blake2s-$(CONFIG_ZINC_ARCH_X86_64) += blake2s/blake2s-x86_64.o
obj-$(CONFIG_ZINC_BLAKE2S) += zinc_blake2s.o
+
+zinc_curve25519-y := curve25519/curve25519.o
+obj-$(CONFIG_ZINC_CURVE25519) += zinc_curve25519.o
diff --git a/lib/zinc/curve25519/curve25519-fiat32.c b/lib/zinc/curve25519/curve25519-fiat32.c
new file mode 100644
index 000000000000..6076c06f28ed
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-fiat32.c
@@ -0,0 +1,860 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2016 The fiat-crypto Authors.
+ * Copyright (C) 2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This is a machine-generated formally verified implementation of Curve25519
+ * ECDH from: <https://github.com/mit-plv/fiat-crypto>. Though originally
+ * machine generated, it has been tweaked to be suitable for use in the kernel.
+ * It is optimized for 32-bit machines and machines that cannot work efficiently
+ * with 128-bit integer types.
+ */
+
+/* fe means field element. Here the field is \Z/(2^255-19). An element t,
+ * entries t[0]...t[9], represents the integer t[0]+2^26 t[1]+2^51 t[2]+2^77
+ * t[3]+2^102 t[4]+...+2^230 t[9].
+ * fe limbs are bounded by 1.125*2^26,1.125*2^25,1.125*2^26,1.125*2^25,etc.
+ * Multiplication and carrying produce fe from fe_loose.
+ */
+typedef struct fe { u32 v[10]; } fe;
+
+/* fe_loose limbs are bounded by 3.375*2^26,3.375*2^25,3.375*2^26,3.375*2^25,etc
+ * Addition and subtraction produce fe_loose from (fe, fe).
+ */
+typedef struct fe_loose { u32 v[10]; } fe_loose;
+
+static __always_inline void fe_frombytes_impl(u32 h[10], const u8 *s)
+{
+ /* Ignores top bit of s. */
+ u32 a0 = get_unaligned_le32(s);
+ u32 a1 = get_unaligned_le32(s+4);
+ u32 a2 = get_unaligned_le32(s+8);
+ u32 a3 = get_unaligned_le32(s+12);
+ u32 a4 = get_unaligned_le32(s+16);
+ u32 a5 = get_unaligned_le32(s+20);
+ u32 a6 = get_unaligned_le32(s+24);
+ u32 a7 = get_unaligned_le32(s+28);
+ h[0] = a0&((1<<26)-1); /* 26 used, 32-26 left. 26 */
+ h[1] = (a0>>26) | ((a1&((1<<19)-1))<< 6); /* (32-26) + 19 = 6+19 = 25 */
+ h[2] = (a1>>19) | ((a2&((1<<13)-1))<<13); /* (32-19) + 13 = 13+13 = 26 */
+ h[3] = (a2>>13) | ((a3&((1<< 6)-1))<<19); /* (32-13) + 6 = 19+ 6 = 25 */
+ h[4] = (a3>> 6); /* (32- 6) = 26 */
+ h[5] = a4&((1<<25)-1); /* 25 */
+ h[6] = (a4>>25) | ((a5&((1<<19)-1))<< 7); /* (32-25) + 19 = 7+19 = 26 */
+ h[7] = (a5>>19) | ((a6&((1<<12)-1))<<13); /* (32-19) + 12 = 13+12 = 25 */
+ h[8] = (a6>>12) | ((a7&((1<< 6)-1))<<20); /* (32-12) + 6 = 20+ 6 = 26 */
+ h[9] = (a7>> 6)&((1<<25)-1); /* 25 */
+}
+
+static __always_inline void fe_frombytes(fe *h, const u8 *s)
+{
+ fe_frombytes_impl(h->v, s);
+}
+
+static __always_inline u8 /*bool*/
+addcarryx_u25(u8 /*bool*/ c, u32 a, u32 b, u32 *low)
+{
+ /* This function extracts 25 bits of result and 1 bit of carry
+ * (26 total), so a 32-bit intermediate is sufficient.
+ */
+ u32 x = a + b + c;
+ *low = x & ((1 << 25) - 1);
+ return (x >> 25) & 1;
+}
+
+static __always_inline u8 /*bool*/
+addcarryx_u26(u8 /*bool*/ c, u32 a, u32 b, u32 *low)
+{
+ /* This function extracts 26 bits of result and 1 bit of carry
+ * (27 total), so a 32-bit intermediate is sufficient.
+ */
+ u32 x = a + b + c;
+ *low = x & ((1 << 26) - 1);
+ return (x >> 26) & 1;
+}
+
+static __always_inline u8 /*bool*/
+subborrow_u25(u8 /*bool*/ c, u32 a, u32 b, u32 *low)
+{
+ /* This function extracts 25 bits of result and 1 bit of borrow
+ * (26 total), so a 32-bit intermediate is sufficient.
+ */
+ u32 x = a - b - c;
+ *low = x & ((1 << 25) - 1);
+ return x >> 31;
+}
+
+static __always_inline u8 /*bool*/
+subborrow_u26(u8 /*bool*/ c, u32 a, u32 b, u32 *low)
+{
+ /* This function extracts 26 bits of result and 1 bit of borrow
+ *(27 total), so a 32-bit intermediate is sufficient.
+ */
+ u32 x = a - b - c;
+ *low = x & ((1 << 26) - 1);
+ return x >> 31;
+}
+
+static __always_inline u32 cmovznz32(u32 t, u32 z, u32 nz)
+{
+ t = -!!t; /* all set if nonzero, 0 if 0 */
+ return (t&nz) | ((~t)&z);
+}
+
+static __always_inline void fe_freeze(u32 out[10], const u32 in1[10])
+{
+ { const u32 x17 = in1[9];
+ { const u32 x18 = in1[8];
+ { const u32 x16 = in1[7];
+ { const u32 x14 = in1[6];
+ { const u32 x12 = in1[5];
+ { const u32 x10 = in1[4];
+ { const u32 x8 = in1[3];
+ { const u32 x6 = in1[2];
+ { const u32 x4 = in1[1];
+ { const u32 x2 = in1[0];
+ { u32 x20; u8/*bool*/ x21 = subborrow_u26(0x0, x2, 0x3ffffed, &x20);
+ { u32 x23; u8/*bool*/ x24 = subborrow_u25(x21, x4, 0x1ffffff, &x23);
+ { u32 x26; u8/*bool*/ x27 = subborrow_u26(x24, x6, 0x3ffffff, &x26);
+ { u32 x29; u8/*bool*/ x30 = subborrow_u25(x27, x8, 0x1ffffff, &x29);
+ { u32 x32; u8/*bool*/ x33 = subborrow_u26(x30, x10, 0x3ffffff, &x32);
+ { u32 x35; u8/*bool*/ x36 = subborrow_u25(x33, x12, 0x1ffffff, &x35);
+ { u32 x38; u8/*bool*/ x39 = subborrow_u26(x36, x14, 0x3ffffff, &x38);
+ { u32 x41; u8/*bool*/ x42 = subborrow_u25(x39, x16, 0x1ffffff, &x41);
+ { u32 x44; u8/*bool*/ x45 = subborrow_u26(x42, x18, 0x3ffffff, &x44);
+ { u32 x47; u8/*bool*/ x48 = subborrow_u25(x45, x17, 0x1ffffff, &x47);
+ { u32 x49 = cmovznz32(x48, 0x0, 0xffffffff);
+ { u32 x50 = (x49 & 0x3ffffed);
+ { u32 x52; u8/*bool*/ x53 = addcarryx_u26(0x0, x20, x50, &x52);
+ { u32 x54 = (x49 & 0x1ffffff);
+ { u32 x56; u8/*bool*/ x57 = addcarryx_u25(x53, x23, x54, &x56);
+ { u32 x58 = (x49 & 0x3ffffff);
+ { u32 x60; u8/*bool*/ x61 = addcarryx_u26(x57, x26, x58, &x60);
+ { u32 x62 = (x49 & 0x1ffffff);
+ { u32 x64; u8/*bool*/ x65 = addcarryx_u25(x61, x29, x62, &x64);
+ { u32 x66 = (x49 & 0x3ffffff);
+ { u32 x68; u8/*bool*/ x69 = addcarryx_u26(x65, x32, x66, &x68);
+ { u32 x70 = (x49 & 0x1ffffff);
+ { u32 x72; u8/*bool*/ x73 = addcarryx_u25(x69, x35, x70, &x72);
+ { u32 x74 = (x49 & 0x3ffffff);
+ { u32 x76; u8/*bool*/ x77 = addcarryx_u26(x73, x38, x74, &x76);
+ { u32 x78 = (x49 & 0x1ffffff);
+ { u32 x80; u8/*bool*/ x81 = addcarryx_u25(x77, x41, x78, &x80);
+ { u32 x82 = (x49 & 0x3ffffff);
+ { u32 x84; u8/*bool*/ x85 = addcarryx_u26(x81, x44, x82, &x84);
+ { u32 x86 = (x49 & 0x1ffffff);
+ { u32 x88; addcarryx_u25(x85, x47, x86, &x88);
+ out[0] = x52;
+ out[1] = x56;
+ out[2] = x60;
+ out[3] = x64;
+ out[4] = x68;
+ out[5] = x72;
+ out[6] = x76;
+ out[7] = x80;
+ out[8] = x84;
+ out[9] = x88;
+ }}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
+}
+
+static __always_inline void fe_tobytes(u8 s[32], const fe *f)
+{
+ u32 h[10];
+ fe_freeze(h, f->v);
+ s[0] = h[0] >> 0;
+ s[1] = h[0] >> 8;
+ s[2] = h[0] >> 16;
+ s[3] = (h[0] >> 24) | (h[1] << 2);
+ s[4] = h[1] >> 6;
+ s[5] = h[1] >> 14;
+ s[6] = (h[1] >> 22) | (h[2] << 3);
+ s[7] = h[2] >> 5;
+ s[8] = h[2] >> 13;
+ s[9] = (h[2] >> 21) | (h[3] << 5);
+ s[10] = h[3] >> 3;
+ s[11] = h[3] >> 11;
+ s[12] = (h[3] >> 19) | (h[4] << 6);
+ s[13] = h[4] >> 2;
+ s[14] = h[4] >> 10;
+ s[15] = h[4] >> 18;
+ s[16] = h[5] >> 0;
+ s[17] = h[5] >> 8;
+ s[18] = h[5] >> 16;
+ s[19] = (h[5] >> 24) | (h[6] << 1);
+ s[20] = h[6] >> 7;
+ s[21] = h[6] >> 15;
+ s[22] = (h[6] >> 23) | (h[7] << 3);
+ s[23] = h[7] >> 5;
+ s[24] = h[7] >> 13;
+ s[25] = (h[7] >> 21) | (h[8] << 4);
+ s[26] = h[8] >> 4;
+ s[27] = h[8] >> 12;
+ s[28] = (h[8] >> 20) | (h[9] << 6);
+ s[29] = h[9] >> 2;
+ s[30] = h[9] >> 10;
+ s[31] = h[9] >> 18;
+}
+
+/* h = f */
+static __always_inline void fe_copy(fe *h, const fe *f)
+{
+ memmove(h, f, sizeof(u32) * 10);
+}
+
+static __always_inline void fe_copy_lt(fe_loose *h, const fe *f)
+{
+ memmove(h, f, sizeof(u32) * 10);
+}
+
+/* h = 0 */
+static __always_inline void fe_0(fe *h)
+{
+ memset(h, 0, sizeof(u32) * 10);
+}
+
+/* h = 1 */
+static __always_inline void fe_1(fe *h)
+{
+ memset(h, 0, sizeof(u32) * 10);
+ h->v[0] = 1;
+}
+
+static void fe_add_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
+{
+ { const u32 x20 = in1[9];
+ { const u32 x21 = in1[8];
+ { const u32 x19 = in1[7];
+ { const u32 x17 = in1[6];
+ { const u32 x15 = in1[5];
+ { const u32 x13 = in1[4];
+ { const u32 x11 = in1[3];
+ { const u32 x9 = in1[2];
+ { const u32 x7 = in1[1];
+ { const u32 x5 = in1[0];
+ { const u32 x38 = in2[9];
+ { const u32 x39 = in2[8];
+ { const u32 x37 = in2[7];
+ { const u32 x35 = in2[6];
+ { const u32 x33 = in2[5];
+ { const u32 x31 = in2[4];
+ { const u32 x29 = in2[3];
+ { const u32 x27 = in2[2];
+ { const u32 x25 = in2[1];
+ { const u32 x23 = in2[0];
+ out[0] = (x5 + x23);
+ out[1] = (x7 + x25);
+ out[2] = (x9 + x27);
+ out[3] = (x11 + x29);
+ out[4] = (x13 + x31);
+ out[5] = (x15 + x33);
+ out[6] = (x17 + x35);
+ out[7] = (x19 + x37);
+ out[8] = (x21 + x39);
+ out[9] = (x20 + x38);
+ }}}}}}}}}}}}}}}}}}}}
+}
+
+/* h = f + g
+ * Can overlap h with f or g.
+ */
+static __always_inline void fe_add(fe_loose *h, const fe *f, const fe *g)
+{
+ fe_add_impl(h->v, f->v, g->v);
+}
+
+static void fe_sub_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
+{
+ { const u32 x20 = in1[9];
+ { const u32 x21 = in1[8];
+ { const u32 x19 = in1[7];
+ { const u32 x17 = in1[6];
+ { const u32 x15 = in1[5];
+ { const u32 x13 = in1[4];
+ { const u32 x11 = in1[3];
+ { const u32 x9 = in1[2];
+ { const u32 x7 = in1[1];
+ { const u32 x5 = in1[0];
+ { const u32 x38 = in2[9];
+ { const u32 x39 = in2[8];
+ { const u32 x37 = in2[7];
+ { const u32 x35 = in2[6];
+ { const u32 x33 = in2[5];
+ { const u32 x31 = in2[4];
+ { const u32 x29 = in2[3];
+ { const u32 x27 = in2[2];
+ { const u32 x25 = in2[1];
+ { const u32 x23 = in2[0];
+ out[0] = ((0x7ffffda + x5) - x23);
+ out[1] = ((0x3fffffe + x7) - x25);
+ out[2] = ((0x7fffffe + x9) - x27);
+ out[3] = ((0x3fffffe + x11) - x29);
+ out[4] = ((0x7fffffe + x13) - x31);
+ out[5] = ((0x3fffffe + x15) - x33);
+ out[6] = ((0x7fffffe + x17) - x35);
+ out[7] = ((0x3fffffe + x19) - x37);
+ out[8] = ((0x7fffffe + x21) - x39);
+ out[9] = ((0x3fffffe + x20) - x38);
+ }}}}}}}}}}}}}}}}}}}}
+}
+
+/* h = f - g
+ * Can overlap h with f or g.
+ */
+static __always_inline void fe_sub(fe_loose *h, const fe *f, const fe *g)
+{
+ fe_sub_impl(h->v, f->v, g->v);
+}
+
+static void fe_mul_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
+{
+ { const u32 x20 = in1[9];
+ { const u32 x21 = in1[8];
+ { const u32 x19 = in1[7];
+ { const u32 x17 = in1[6];
+ { const u32 x15 = in1[5];
+ { const u32 x13 = in1[4];
+ { const u32 x11 = in1[3];
+ { const u32 x9 = in1[2];
+ { const u32 x7 = in1[1];
+ { const u32 x5 = in1[0];
+ { const u32 x38 = in2[9];
+ { const u32 x39 = in2[8];
+ { const u32 x37 = in2[7];
+ { const u32 x35 = in2[6];
+ { const u32 x33 = in2[5];
+ { const u32 x31 = in2[4];
+ { const u32 x29 = in2[3];
+ { const u32 x27 = in2[2];
+ { const u32 x25 = in2[1];
+ { const u32 x23 = in2[0];
+ { u64 x40 = ((u64)x23 * x5);
+ { u64 x41 = (((u64)x23 * x7) + ((u64)x25 * x5));
+ { u64 x42 = ((((u64)(0x2 * x25) * x7) + ((u64)x23 * x9)) + ((u64)x27 * x5));
+ { u64 x43 = (((((u64)x25 * x9) + ((u64)x27 * x7)) + ((u64)x23 * x11)) + ((u64)x29 * x5));
+ { u64 x44 = (((((u64)x27 * x9) + (0x2 * (((u64)x25 * x11) + ((u64)x29 * x7)))) + ((u64)x23 * x13)) + ((u64)x31 * x5));
+ { u64 x45 = (((((((u64)x27 * x11) + ((u64)x29 * x9)) + ((u64)x25 * x13)) + ((u64)x31 * x7)) + ((u64)x23 * x15)) + ((u64)x33 * x5));
+ { u64 x46 = (((((0x2 * ((((u64)x29 * x11) + ((u64)x25 * x15)) + ((u64)x33 * x7))) + ((u64)x27 * x13)) + ((u64)x31 * x9)) + ((u64)x23 * x17)) + ((u64)x35 * x5));
+ { u64 x47 = (((((((((u64)x29 * x13) + ((u64)x31 * x11)) + ((u64)x27 * x15)) + ((u64)x33 * x9)) + ((u64)x25 * x17)) + ((u64)x35 * x7)) + ((u64)x23 * x19)) + ((u64)x37 * x5));
+ { u64 x48 = (((((((u64)x31 * x13) + (0x2 * (((((u64)x29 * x15) + ((u64)x33 * x11)) + ((u64)x25 * x19)) + ((u64)x37 * x7)))) + ((u64)x27 * x17)) + ((u64)x35 * x9)) + ((u64)x23 * x21)) + ((u64)x39 * x5));
+ { u64 x49 = (((((((((((u64)x31 * x15) + ((u64)x33 * x13)) + ((u64)x29 * x17)) + ((u64)x35 * x11)) + ((u64)x27 * x19)) + ((u64)x37 * x9)) + ((u64)x25 * x21)) + ((u64)x39 * x7)) + ((u64)x23 * x20)) + ((u64)x38 * x5));
+ { u64 x50 = (((((0x2 * ((((((u64)x33 * x15) + ((u64)x29 * x19)) + ((u64)x37 * x11)) + ((u64)x25 * x20)) + ((u64)x38 * x7))) + ((u64)x31 * x17)) + ((u64)x35 * x13)) + ((u64)x27 * x21)) + ((u64)x39 * x9));
+ { u64 x51 = (((((((((u64)x33 * x17) + ((u64)x35 * x15)) + ((u64)x31 * x19)) + ((u64)x37 * x13)) + ((u64)x29 * x21)) + ((u64)x39 * x11)) + ((u64)x27 * x20)) + ((u64)x38 * x9));
+ { u64 x52 = (((((u64)x35 * x17) + (0x2 * (((((u64)x33 * x19) + ((u64)x37 * x15)) + ((u64)x29 * x20)) + ((u64)x38 * x11)))) + ((u64)x31 * x21)) + ((u64)x39 * x13));
+ { u64 x53 = (((((((u64)x35 * x19) + ((u64)x37 * x17)) + ((u64)x33 * x21)) + ((u64)x39 * x15)) + ((u64)x31 * x20)) + ((u64)x38 * x13));
+ { u64 x54 = (((0x2 * ((((u64)x37 * x19) + ((u64)x33 * x20)) + ((u64)x38 * x15))) + ((u64)x35 * x21)) + ((u64)x39 * x17));
+ { u64 x55 = (((((u64)x37 * x21) + ((u64)x39 * x19)) + ((u64)x35 * x20)) + ((u64)x38 * x17));
+ { u64 x56 = (((u64)x39 * x21) + (0x2 * (((u64)x37 * x20) + ((u64)x38 * x19))));
+ { u64 x57 = (((u64)x39 * x20) + ((u64)x38 * x21));
+ { u64 x58 = ((u64)(0x2 * x38) * x20);
+ { u64 x59 = (x48 + (x58 << 0x4));
+ { u64 x60 = (x59 + (x58 << 0x1));
+ { u64 x61 = (x60 + x58);
+ { u64 x62 = (x47 + (x57 << 0x4));
+ { u64 x63 = (x62 + (x57 << 0x1));
+ { u64 x64 = (x63 + x57);
+ { u64 x65 = (x46 + (x56 << 0x4));
+ { u64 x66 = (x65 + (x56 << 0x1));
+ { u64 x67 = (x66 + x56);
+ { u64 x68 = (x45 + (x55 << 0x4));
+ { u64 x69 = (x68 + (x55 << 0x1));
+ { u64 x70 = (x69 + x55);
+ { u64 x71 = (x44 + (x54 << 0x4));
+ { u64 x72 = (x71 + (x54 << 0x1));
+ { u64 x73 = (x72 + x54);
+ { u64 x74 = (x43 + (x53 << 0x4));
+ { u64 x75 = (x74 + (x53 << 0x1));
+ { u64 x76 = (x75 + x53);
+ { u64 x77 = (x42 + (x52 << 0x4));
+ { u64 x78 = (x77 + (x52 << 0x1));
+ { u64 x79 = (x78 + x52);
+ { u64 x80 = (x41 + (x51 << 0x4));
+ { u64 x81 = (x80 + (x51 << 0x1));
+ { u64 x82 = (x81 + x51);
+ { u64 x83 = (x40 + (x50 << 0x4));
+ { u64 x84 = (x83 + (x50 << 0x1));
+ { u64 x85 = (x84 + x50);
+ { u64 x86 = (x85 >> 0x1a);
+ { u32 x87 = ((u32)x85 & 0x3ffffff);
+ { u64 x88 = (x86 + x82);
+ { u64 x89 = (x88 >> 0x19);
+ { u32 x90 = ((u32)x88 & 0x1ffffff);
+ { u64 x91 = (x89 + x79);
+ { u64 x92 = (x91 >> 0x1a);
+ { u32 x93 = ((u32)x91 & 0x3ffffff);
+ { u64 x94 = (x92 + x76);
+ { u64 x95 = (x94 >> 0x19);
+ { u32 x96 = ((u32)x94 & 0x1ffffff);
+ { u64 x97 = (x95 + x73);
+ { u64 x98 = (x97 >> 0x1a);
+ { u32 x99 = ((u32)x97 & 0x3ffffff);
+ { u64 x100 = (x98 + x70);
+ { u64 x101 = (x100 >> 0x19);
+ { u32 x102 = ((u32)x100 & 0x1ffffff);
+ { u64 x103 = (x101 + x67);
+ { u64 x104 = (x103 >> 0x1a);
+ { u32 x105 = ((u32)x103 & 0x3ffffff);
+ { u64 x106 = (x104 + x64);
+ { u64 x107 = (x106 >> 0x19);
+ { u32 x108 = ((u32)x106 & 0x1ffffff);
+ { u64 x109 = (x107 + x61);
+ { u64 x110 = (x109 >> 0x1a);
+ { u32 x111 = ((u32)x109 & 0x3ffffff);
+ { u64 x112 = (x110 + x49);
+ { u64 x113 = (x112 >> 0x19);
+ { u32 x114 = ((u32)x112 & 0x1ffffff);
+ { u64 x115 = (x87 + (0x13 * x113));
+ { u32 x116 = (u32) (x115 >> 0x1a);
+ { u32 x117 = ((u32)x115 & 0x3ffffff);
+ { u32 x118 = (x116 + x90);
+ { u32 x119 = (x118 >> 0x19);
+ { u32 x120 = (x118 & 0x1ffffff);
+ out[0] = x117;
+ out[1] = x120;
+ out[2] = (x119 + x93);
+ out[3] = x96;
+ out[4] = x99;
+ out[5] = x102;
+ out[6] = x105;
+ out[7] = x108;
+ out[8] = x111;
+ out[9] = x114;
+ }}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
+}
+
+static __always_inline void fe_mul_ttt(fe *h, const fe *f, const fe *g)
+{
+ fe_mul_impl(h->v, f->v, g->v);
+}
+
+static __always_inline void fe_mul_tlt(fe *h, const fe_loose *f, const fe *g)
+{
+ fe_mul_impl(h->v, f->v, g->v);
+}
+
+static __always_inline void
+fe_mul_tll(fe *h, const fe_loose *f, const fe_loose *g)
+{
+ fe_mul_impl(h->v, f->v, g->v);
+}
+
+static void fe_sqr_impl(u32 out[10], const u32 in1[10])
+{
+ { const u32 x17 = in1[9];
+ { const u32 x18 = in1[8];
+ { const u32 x16 = in1[7];
+ { const u32 x14 = in1[6];
+ { const u32 x12 = in1[5];
+ { const u32 x10 = in1[4];
+ { const u32 x8 = in1[3];
+ { const u32 x6 = in1[2];
+ { const u32 x4 = in1[1];
+ { const u32 x2 = in1[0];
+ { u64 x19 = ((u64)x2 * x2);
+ { u64 x20 = ((u64)(0x2 * x2) * x4);
+ { u64 x21 = (0x2 * (((u64)x4 * x4) + ((u64)x2 * x6)));
+ { u64 x22 = (0x2 * (((u64)x4 * x6) + ((u64)x2 * x8)));
+ { u64 x23 = ((((u64)x6 * x6) + ((u64)(0x4 * x4) * x8)) + ((u64)(0x2 * x2) * x10));
+ { u64 x24 = (0x2 * ((((u64)x6 * x8) + ((u64)x4 * x10)) + ((u64)x2 * x12)));
+ { u64 x25 = (0x2 * (((((u64)x8 * x8) + ((u64)x6 * x10)) + ((u64)x2 * x14)) + ((u64)(0x2 * x4) * x12)));
+ { u64 x26 = (0x2 * (((((u64)x8 * x10) + ((u64)x6 * x12)) + ((u64)x4 * x14)) + ((u64)x2 * x16)));
+ { u64 x27 = (((u64)x10 * x10) + (0x2 * ((((u64)x6 * x14) + ((u64)x2 * x18)) + (0x2 * (((u64)x4 * x16) + ((u64)x8 * x12))))));
+ { u64 x28 = (0x2 * ((((((u64)x10 * x12) + ((u64)x8 * x14)) + ((u64)x6 * x16)) + ((u64)x4 * x18)) + ((u64)x2 * x17)));
+ { u64 x29 = (0x2 * (((((u64)x12 * x12) + ((u64)x10 * x14)) + ((u64)x6 * x18)) + (0x2 * (((u64)x8 * x16) + ((u64)x4 * x17)))));
+ { u64 x30 = (0x2 * (((((u64)x12 * x14) + ((u64)x10 * x16)) + ((u64)x8 * x18)) + ((u64)x6 * x17)));
+ { u64 x31 = (((u64)x14 * x14) + (0x2 * (((u64)x10 * x18) + (0x2 * (((u64)x12 * x16) + ((u64)x8 * x17))))));
+ { u64 x32 = (0x2 * ((((u64)x14 * x16) + ((u64)x12 * x18)) + ((u64)x10 * x17)));
+ { u64 x33 = (0x2 * ((((u64)x16 * x16) + ((u64)x14 * x18)) + ((u64)(0x2 * x12) * x17)));
+ { u64 x34 = (0x2 * (((u64)x16 * x18) + ((u64)x14 * x17)));
+ { u64 x35 = (((u64)x18 * x18) + ((u64)(0x4 * x16) * x17));
+ { u64 x36 = ((u64)(0x2 * x18) * x17);
+ { u64 x37 = ((u64)(0x2 * x17) * x17);
+ { u64 x38 = (x27 + (x37 << 0x4));
+ { u64 x39 = (x38 + (x37 << 0x1));
+ { u64 x40 = (x39 + x37);
+ { u64 x41 = (x26 + (x36 << 0x4));
+ { u64 x42 = (x41 + (x36 << 0x1));
+ { u64 x43 = (x42 + x36);
+ { u64 x44 = (x25 + (x35 << 0x4));
+ { u64 x45 = (x44 + (x35 << 0x1));
+ { u64 x46 = (x45 + x35);
+ { u64 x47 = (x24 + (x34 << 0x4));
+ { u64 x48 = (x47 + (x34 << 0x1));
+ { u64 x49 = (x48 + x34);
+ { u64 x50 = (x23 + (x33 << 0x4));
+ { u64 x51 = (x50 + (x33 << 0x1));
+ { u64 x52 = (x51 + x33);
+ { u64 x53 = (x22 + (x32 << 0x4));
+ { u64 x54 = (x53 + (x32 << 0x1));
+ { u64 x55 = (x54 + x32);
+ { u64 x56 = (x21 + (x31 << 0x4));
+ { u64 x57 = (x56 + (x31 << 0x1));
+ { u64 x58 = (x57 + x31);
+ { u64 x59 = (x20 + (x30 << 0x4));
+ { u64 x60 = (x59 + (x30 << 0x1));
+ { u64 x61 = (x60 + x30);
+ { u64 x62 = (x19 + (x29 << 0x4));
+ { u64 x63 = (x62 + (x29 << 0x1));
+ { u64 x64 = (x63 + x29);
+ { u64 x65 = (x64 >> 0x1a);
+ { u32 x66 = ((u32)x64 & 0x3ffffff);
+ { u64 x67 = (x65 + x61);
+ { u64 x68 = (x67 >> 0x19);
+ { u32 x69 = ((u32)x67 & 0x1ffffff);
+ { u64 x70 = (x68 + x58);
+ { u64 x71 = (x70 >> 0x1a);
+ { u32 x72 = ((u32)x70 & 0x3ffffff);
+ { u64 x73 = (x71 + x55);
+ { u64 x74 = (x73 >> 0x19);
+ { u32 x75 = ((u32)x73 & 0x1ffffff);
+ { u64 x76 = (x74 + x52);
+ { u64 x77 = (x76 >> 0x1a);
+ { u32 x78 = ((u32)x76 & 0x3ffffff);
+ { u64 x79 = (x77 + x49);
+ { u64 x80 = (x79 >> 0x19);
+ { u32 x81 = ((u32)x79 & 0x1ffffff);
+ { u64 x82 = (x80 + x46);
+ { u64 x83 = (x82 >> 0x1a);
+ { u32 x84 = ((u32)x82 & 0x3ffffff);
+ { u64 x85 = (x83 + x43);
+ { u64 x86 = (x85 >> 0x19);
+ { u32 x87 = ((u32)x85 & 0x1ffffff);
+ { u64 x88 = (x86 + x40);
+ { u64 x89 = (x88 >> 0x1a);
+ { u32 x90 = ((u32)x88 & 0x3ffffff);
+ { u64 x91 = (x89 + x28);
+ { u64 x92 = (x91 >> 0x19);
+ { u32 x93 = ((u32)x91 & 0x1ffffff);
+ { u64 x94 = (x66 + (0x13 * x92));
+ { u32 x95 = (u32) (x94 >> 0x1a);
+ { u32 x96 = ((u32)x94 & 0x3ffffff);
+ { u32 x97 = (x95 + x69);
+ { u32 x98 = (x97 >> 0x19);
+ { u32 x99 = (x97 & 0x1ffffff);
+ out[0] = x96;
+ out[1] = x99;
+ out[2] = (x98 + x72);
+ out[3] = x75;
+ out[4] = x78;
+ out[5] = x81;
+ out[6] = x84;
+ out[7] = x87;
+ out[8] = x90;
+ out[9] = x93;
+ }}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
+}
+
+static __always_inline void fe_sq_tl(fe *h, const fe_loose *f)
+{
+ fe_sqr_impl(h->v, f->v);
+}
+
+static __always_inline void fe_sq_tt(fe *h, const fe *f)
+{
+ fe_sqr_impl(h->v, f->v);
+}
+
+static __always_inline void fe_loose_invert(fe *out, const fe_loose *z)
+{
+ fe t0;
+ fe t1;
+ fe t2;
+ fe t3;
+ int i;
+
+ fe_sq_tl(&t0, z);
+ fe_sq_tt(&t1, &t0);
+ for (i = 1; i < 2; ++i)
+ fe_sq_tt(&t1, &t1);
+ fe_mul_tlt(&t1, z, &t1);
+ fe_mul_ttt(&t0, &t0, &t1);
+ fe_sq_tt(&t2, &t0);
+ fe_mul_ttt(&t1, &t1, &t2);
+ fe_sq_tt(&t2, &t1);
+ for (i = 1; i < 5; ++i)
+ fe_sq_tt(&t2, &t2);
+ fe_mul_ttt(&t1, &t2, &t1);
+ fe_sq_tt(&t2, &t1);
+ for (i = 1; i < 10; ++i)
+ fe_sq_tt(&t2, &t2);
+ fe_mul_ttt(&t2, &t2, &t1);
+ fe_sq_tt(&t3, &t2);
+ for (i = 1; i < 20; ++i)
+ fe_sq_tt(&t3, &t3);
+ fe_mul_ttt(&t2, &t3, &t2);
+ fe_sq_tt(&t2, &t2);
+ for (i = 1; i < 10; ++i)
+ fe_sq_tt(&t2, &t2);
+ fe_mul_ttt(&t1, &t2, &t1);
+ fe_sq_tt(&t2, &t1);
+ for (i = 1; i < 50; ++i)
+ fe_sq_tt(&t2, &t2);
+ fe_mul_ttt(&t2, &t2, &t1);
+ fe_sq_tt(&t3, &t2);
+ for (i = 1; i < 100; ++i)
+ fe_sq_tt(&t3, &t3);
+ fe_mul_ttt(&t2, &t3, &t2);
+ fe_sq_tt(&t2, &t2);
+ for (i = 1; i < 50; ++i)
+ fe_sq_tt(&t2, &t2);
+ fe_mul_ttt(&t1, &t2, &t1);
+ fe_sq_tt(&t1, &t1);
+ for (i = 1; i < 5; ++i)
+ fe_sq_tt(&t1, &t1);
+ fe_mul_ttt(out, &t1, &t0);
+}
+
+static __always_inline void fe_invert(fe *out, const fe *z)
+{
+ fe_loose l;
+ fe_copy_lt(&l, z);
+ fe_loose_invert(out, &l);
+}
+
+/* Replace (f,g) with (g,f) if b == 1;
+ * replace (f,g) with (f,g) if b == 0.
+ *
+ * Preconditions: b in {0,1}
+ */
+static __always_inline void fe_cswap(fe *f, fe *g, unsigned int b)
+{
+ unsigned i;
+ b = 0-b;
+ for (i = 0; i < 10; i++) {
+ u32 x = f->v[i] ^ g->v[i];
+ x &= b;
+ f->v[i] ^= x;
+ g->v[i] ^= x;
+ }
+}
+
+/* NOTE: based on fiat-crypto fe_mul, edited for in2=121666, 0, 0.*/
+static __always_inline void fe_mul_121666_impl(u32 out[10], const u32 in1[10])
+{
+ { const u32 x20 = in1[9];
+ { const u32 x21 = in1[8];
+ { const u32 x19 = in1[7];
+ { const u32 x17 = in1[6];
+ { const u32 x15 = in1[5];
+ { const u32 x13 = in1[4];
+ { const u32 x11 = in1[3];
+ { const u32 x9 = in1[2];
+ { const u32 x7 = in1[1];
+ { const u32 x5 = in1[0];
+ { const u32 x38 = 0;
+ { const u32 x39 = 0;
+ { const u32 x37 = 0;
+ { const u32 x35 = 0;
+ { const u32 x33 = 0;
+ { const u32 x31 = 0;
+ { const u32 x29 = 0;
+ { const u32 x27 = 0;
+ { const u32 x25 = 0;
+ { const u32 x23 = 121666;
+ { u64 x40 = ((u64)x23 * x5);
+ { u64 x41 = (((u64)x23 * x7) + ((u64)x25 * x5));
+ { u64 x42 = ((((u64)(0x2 * x25) * x7) + ((u64)x23 * x9)) + ((u64)x27 * x5));
+ { u64 x43 = (((((u64)x25 * x9) + ((u64)x27 * x7)) + ((u64)x23 * x11)) + ((u64)x29 * x5));
+ { u64 x44 = (((((u64)x27 * x9) + (0x2 * (((u64)x25 * x11) + ((u64)x29 * x7)))) + ((u64)x23 * x13)) + ((u64)x31 * x5));
+ { u64 x45 = (((((((u64)x27 * x11) + ((u64)x29 * x9)) + ((u64)x25 * x13)) + ((u64)x31 * x7)) + ((u64)x23 * x15)) + ((u64)x33 * x5));
+ { u64 x46 = (((((0x2 * ((((u64)x29 * x11) + ((u64)x25 * x15)) + ((u64)x33 * x7))) + ((u64)x27 * x13)) + ((u64)x31 * x9)) + ((u64)x23 * x17)) + ((u64)x35 * x5));
+ { u64 x47 = (((((((((u64)x29 * x13) + ((u64)x31 * x11)) + ((u64)x27 * x15)) + ((u64)x33 * x9)) + ((u64)x25 * x17)) + ((u64)x35 * x7)) + ((u64)x23 * x19)) + ((u64)x37 * x5));
+ { u64 x48 = (((((((u64)x31 * x13) + (0x2 * (((((u64)x29 * x15) + ((u64)x33 * x11)) + ((u64)x25 * x19)) + ((u64)x37 * x7)))) + ((u64)x27 * x17)) + ((u64)x35 * x9)) + ((u64)x23 * x21)) + ((u64)x39 * x5));
+ { u64 x49 = (((((((((((u64)x31 * x15) + ((u64)x33 * x13)) + ((u64)x29 * x17)) + ((u64)x35 * x11)) + ((u64)x27 * x19)) + ((u64)x37 * x9)) + ((u64)x25 * x21)) + ((u64)x39 * x7)) + ((u64)x23 * x20)) + ((u64)x38 * x5));
+ { u64 x50 = (((((0x2 * ((((((u64)x33 * x15) + ((u64)x29 * x19)) + ((u64)x37 * x11)) + ((u64)x25 * x20)) + ((u64)x38 * x7))) + ((u64)x31 * x17)) + ((u64)x35 * x13)) + ((u64)x27 * x21)) + ((u64)x39 * x9));
+ { u64 x51 = (((((((((u64)x33 * x17) + ((u64)x35 * x15)) + ((u64)x31 * x19)) + ((u64)x37 * x13)) + ((u64)x29 * x21)) + ((u64)x39 * x11)) + ((u64)x27 * x20)) + ((u64)x38 * x9));
+ { u64 x52 = (((((u64)x35 * x17) + (0x2 * (((((u64)x33 * x19) + ((u64)x37 * x15)) + ((u64)x29 * x20)) + ((u64)x38 * x11)))) + ((u64)x31 * x21)) + ((u64)x39 * x13));
+ { u64 x53 = (((((((u64)x35 * x19) + ((u64)x37 * x17)) + ((u64)x33 * x21)) + ((u64)x39 * x15)) + ((u64)x31 * x20)) + ((u64)x38 * x13));
+ { u64 x54 = (((0x2 * ((((u64)x37 * x19) + ((u64)x33 * x20)) + ((u64)x38 * x15))) + ((u64)x35 * x21)) + ((u64)x39 * x17));
+ { u64 x55 = (((((u64)x37 * x21) + ((u64)x39 * x19)) + ((u64)x35 * x20)) + ((u64)x38 * x17));
+ { u64 x56 = (((u64)x39 * x21) + (0x2 * (((u64)x37 * x20) + ((u64)x38 * x19))));
+ { u64 x57 = (((u64)x39 * x20) + ((u64)x38 * x21));
+ { u64 x58 = ((u64)(0x2 * x38) * x20);
+ { u64 x59 = (x48 + (x58 << 0x4));
+ { u64 x60 = (x59 + (x58 << 0x1));
+ { u64 x61 = (x60 + x58);
+ { u64 x62 = (x47 + (x57 << 0x4));
+ { u64 x63 = (x62 + (x57 << 0x1));
+ { u64 x64 = (x63 + x57);
+ { u64 x65 = (x46 + (x56 << 0x4));
+ { u64 x66 = (x65 + (x56 << 0x1));
+ { u64 x67 = (x66 + x56);
+ { u64 x68 = (x45 + (x55 << 0x4));
+ { u64 x69 = (x68 + (x55 << 0x1));
+ { u64 x70 = (x69 + x55);
+ { u64 x71 = (x44 + (x54 << 0x4));
+ { u64 x72 = (x71 + (x54 << 0x1));
+ { u64 x73 = (x72 + x54);
+ { u64 x74 = (x43 + (x53 << 0x4));
+ { u64 x75 = (x74 + (x53 << 0x1));
+ { u64 x76 = (x75 + x53);
+ { u64 x77 = (x42 + (x52 << 0x4));
+ { u64 x78 = (x77 + (x52 << 0x1));
+ { u64 x79 = (x78 + x52);
+ { u64 x80 = (x41 + (x51 << 0x4));
+ { u64 x81 = (x80 + (x51 << 0x1));
+ { u64 x82 = (x81 + x51);
+ { u64 x83 = (x40 + (x50 << 0x4));
+ { u64 x84 = (x83 + (x50 << 0x1));
+ { u64 x85 = (x84 + x50);
+ { u64 x86 = (x85 >> 0x1a);
+ { u32 x87 = ((u32)x85 & 0x3ffffff);
+ { u64 x88 = (x86 + x82);
+ { u64 x89 = (x88 >> 0x19);
+ { u32 x90 = ((u32)x88 & 0x1ffffff);
+ { u64 x91 = (x89 + x79);
+ { u64 x92 = (x91 >> 0x1a);
+ { u32 x93 = ((u32)x91 & 0x3ffffff);
+ { u64 x94 = (x92 + x76);
+ { u64 x95 = (x94 >> 0x19);
+ { u32 x96 = ((u32)x94 & 0x1ffffff);
+ { u64 x97 = (x95 + x73);
+ { u64 x98 = (x97 >> 0x1a);
+ { u32 x99 = ((u32)x97 & 0x3ffffff);
+ { u64 x100 = (x98 + x70);
+ { u64 x101 = (x100 >> 0x19);
+ { u32 x102 = ((u32)x100 & 0x1ffffff);
+ { u64 x103 = (x101 + x67);
+ { u64 x104 = (x103 >> 0x1a);
+ { u32 x105 = ((u32)x103 & 0x3ffffff);
+ { u64 x106 = (x104 + x64);
+ { u64 x107 = (x106 >> 0x19);
+ { u32 x108 = ((u32)x106 & 0x1ffffff);
+ { u64 x109 = (x107 + x61);
+ { u64 x110 = (x109 >> 0x1a);
+ { u32 x111 = ((u32)x109 & 0x3ffffff);
+ { u64 x112 = (x110 + x49);
+ { u64 x113 = (x112 >> 0x19);
+ { u32 x114 = ((u32)x112 & 0x1ffffff);
+ { u64 x115 = (x87 + (0x13 * x113));
+ { u32 x116 = (u32) (x115 >> 0x1a);
+ { u32 x117 = ((u32)x115 & 0x3ffffff);
+ { u32 x118 = (x116 + x90);
+ { u32 x119 = (x118 >> 0x19);
+ { u32 x120 = (x118 & 0x1ffffff);
+ out[0] = x117;
+ out[1] = x120;
+ out[2] = (x119 + x93);
+ out[3] = x96;
+ out[4] = x99;
+ out[5] = x102;
+ out[6] = x105;
+ out[7] = x108;
+ out[8] = x111;
+ out[9] = x114;
+ }}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
+}
+
+static __always_inline void fe_mul121666(fe *h, const fe_loose *f)
+{
+ fe_mul_121666_impl(h->v, f->v);
+}
+
+static void curve25519_generic(u8 out[CURVE25519_KEY_SIZE],
+ const u8 scalar[CURVE25519_KEY_SIZE],
+ const u8 point[CURVE25519_KEY_SIZE])
+{
+ fe x1, x2, z2, x3, z3;
+ fe_loose x2l, z2l, x3l;
+ unsigned swap = 0;
+ int pos;
+ u8 e[32];
+
+ memcpy(e, scalar, 32);
+ normalize_secret(e);
+
+ /* The following implementation was transcribed to Coq and proven to
+ * correspond to unary scalar multiplication in affine coordinates given
+ * that x1 != 0 is the x coordinate of some point on the curve. It was
+ * also checked in Coq that doing a ladderstep with x1 = x3 = 0 gives
+ * z2' = z3' = 0, and z2 = z3 = 0 gives z2' = z3' = 0. The statement was
+ * quantified over the underlying field, so it applies to Curve25519
+ * itself and the quadratic twist of Curve25519. It was not proven in
+ * Coq that prime-field arithmetic correctly simulates extension-field
+ * arithmetic on prime-field values. The decoding of the byte array
+ * representation of e was not considered.
+ *
+ * Specification of Montgomery curves in affine coordinates:
+ * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Spec/MontgomeryCurve.v#L27>
+ *
+ * Proof that these form a group that is isomorphic to a Weierstrass
+ * curve:
+ * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/AffineProofs.v#L35>
+ *
+ * Coq transcription and correctness proof of the loop
+ * (where scalarbits=255):
+ * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZ.v#L118>
+ * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZProofs.v#L278>
+ * preconditions: 0 <= e < 2^255 (not necessarily e < order),
+ * fe_invert(0) = 0
+ */
+ fe_frombytes(&x1, point);
+ fe_1(&x2);
+ fe_0(&z2);
+ fe_copy(&x3, &x1);
+ fe_1(&z3);
+
+ for (pos = 254; pos >= 0; --pos) {
+ fe tmp0, tmp1;
+ fe_loose tmp0l, tmp1l;
+ /* loop invariant as of right before the test, for the case
+ * where x1 != 0:
+ * pos >= -1; if z2 = 0 then x2 is nonzero; if z3 = 0 then x3
+ * is nonzero
+ * let r := e >> (pos+1) in the following equalities of
+ * projective points:
+ * to_xz (r*P) === if swap then (x3, z3) else (x2, z2)
+ * to_xz ((r+1)*P) === if swap then (x2, z2) else (x3, z3)
+ * x1 is the nonzero x coordinate of the nonzero
+ * point (r*P-(r+1)*P)
+ */
+ unsigned b = 1 & (e[pos / 8] >> (pos & 7));
+ swap ^= b;
+ fe_cswap(&x2, &x3, swap);
+ fe_cswap(&z2, &z3, swap);
+ swap = b;
+ /* Coq transcription of ladderstep formula (called from
+ * transcribed loop):
+ * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZ.v#L89>
+ * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZProofs.v#L131>
+ * x1 != 0 <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZProofs.v#L217>
+ * x1 = 0 <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZProofs.v#L147>
+ */
+ fe_sub(&tmp0l, &x3, &z3);
+ fe_sub(&tmp1l, &x2, &z2);
+ fe_add(&x2l, &x2, &z2);
+ fe_add(&z2l, &x3, &z3);
+ fe_mul_tll(&z3, &tmp0l, &x2l);
+ fe_mul_tll(&z2, &z2l, &tmp1l);
+ fe_sq_tl(&tmp0, &tmp1l);
+ fe_sq_tl(&tmp1, &x2l);
+ fe_add(&x3l, &z3, &z2);
+ fe_sub(&z2l, &z3, &z2);
+ fe_mul_ttt(&x2, &tmp1, &tmp0);
+ fe_sub(&tmp1l, &tmp1, &tmp0);
+ fe_sq_tl(&z2, &z2l);
+ fe_mul121666(&z3, &tmp1l);
+ fe_sq_tl(&x3, &x3l);
+ fe_add(&tmp0l, &tmp0, &z3);
+ fe_mul_ttt(&z3, &x1, &z2);
+ fe_mul_tll(&z2, &tmp1l, &tmp0l);
+ }
+ /* here pos=-1, so r=e, so to_xz (e*P) === if swap then (x3, z3)
+ * else (x2, z2)
+ */
+ fe_cswap(&x2, &x3, swap);
+ fe_cswap(&z2, &z3, swap);
+
+ fe_invert(&z2, &z2);
+ fe_mul_ttt(&x2, &x2, &z2);
+ fe_tobytes(out, &x2);
+
+ memzero_explicit(&x1, sizeof(x1));
+ memzero_explicit(&x2, sizeof(x2));
+ memzero_explicit(&z2, sizeof(z2));
+ memzero_explicit(&x3, sizeof(x3));
+ memzero_explicit(&z3, sizeof(z3));
+ memzero_explicit(&x2l, sizeof(x2l));
+ memzero_explicit(&z2l, sizeof(z2l));
+ memzero_explicit(&x3l, sizeof(x3l));
+ memzero_explicit(&e, sizeof(e));
+}
diff --git a/lib/zinc/curve25519/curve25519-hacl64.c b/lib/zinc/curve25519/curve25519-hacl64.c
new file mode 100644
index 000000000000..ce7dfcf064b1
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-hacl64.c
@@ -0,0 +1,784 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2016-2017 INRIA and Microsoft Corporation.
+ * Copyright (C) 2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This is a machine-generated formally verified implementation of Curve25519
+ * ECDH from: <https://github.com/mitls/hacl-star>. Though originally machine
+ * generated, it has been tweaked to be suitable for use in the kernel. It is
+ * optimized for 64-bit machines that can efficiently work with 128-bit
+ * integer types.
+ */
+
+typedef __uint128_t u128;
+
+static __always_inline u64 u64_eq_mask(u64 a, u64 b)
+{
+ u64 x = a ^ b;
+ u64 minus_x = ~x + (u64)1U;
+ u64 x_or_minus_x = x | minus_x;
+ u64 xnx = x_or_minus_x >> (u32)63U;
+ u64 c = xnx - (u64)1U;
+ return c;
+}
+
+static __always_inline u64 u64_gte_mask(u64 a, u64 b)
+{
+ u64 x = a;
+ u64 y = b;
+ u64 x_xor_y = x ^ y;
+ u64 x_sub_y = x - y;
+ u64 x_sub_y_xor_y = x_sub_y ^ y;
+ u64 q = x_xor_y | x_sub_y_xor_y;
+ u64 x_xor_q = x ^ q;
+ u64 x_xor_q_ = x_xor_q >> (u32)63U;
+ u64 c = x_xor_q_ - (u64)1U;
+ return c;
+}
+
+static __always_inline void modulo_carry_top(u64 *b)
+{
+ u64 b4 = b[4];
+ u64 b0 = b[0];
+ u64 b4_ = b4 & 0x7ffffffffffffLLU;
+ u64 b0_ = b0 + 19 * (b4 >> 51);
+ b[4] = b4_;
+ b[0] = b0_;
+}
+
+static __always_inline void fproduct_copy_from_wide_(u64 *output, u128 *input)
+{
+ {
+ u128 xi = input[0];
+ output[0] = ((u64)(xi));
+ }
+ {
+ u128 xi = input[1];
+ output[1] = ((u64)(xi));
+ }
+ {
+ u128 xi = input[2];
+ output[2] = ((u64)(xi));
+ }
+ {
+ u128 xi = input[3];
+ output[3] = ((u64)(xi));
+ }
+ {
+ u128 xi = input[4];
+ output[4] = ((u64)(xi));
+ }
+}
+
+static __always_inline void
+fproduct_sum_scalar_multiplication_(u128 *output, u64 *input, u64 s)
+{
+ output[0] += (u128)input[0] * s;
+ output[1] += (u128)input[1] * s;
+ output[2] += (u128)input[2] * s;
+ output[3] += (u128)input[3] * s;
+ output[4] += (u128)input[4] * s;
+}
+
+static __always_inline void fproduct_carry_wide_(u128 *tmp)
+{
+ {
+ u32 ctr = 0;
+ u128 tctr = tmp[ctr];
+ u128 tctrp1 = tmp[ctr + 1];
+ u64 r0 = ((u64)(tctr)) & 0x7ffffffffffffLLU;
+ u128 c = ((tctr) >> (51));
+ tmp[ctr] = ((u128)(r0));
+ tmp[ctr + 1] = ((tctrp1) + (c));
+ }
+ {
+ u32 ctr = 1;
+ u128 tctr = tmp[ctr];
+ u128 tctrp1 = tmp[ctr + 1];
+ u64 r0 = ((u64)(tctr)) & 0x7ffffffffffffLLU;
+ u128 c = ((tctr) >> (51));
+ tmp[ctr] = ((u128)(r0));
+ tmp[ctr + 1] = ((tctrp1) + (c));
+ }
+
+ {
+ u32 ctr = 2;
+ u128 tctr = tmp[ctr];
+ u128 tctrp1 = tmp[ctr + 1];
+ u64 r0 = ((u64)(tctr)) & 0x7ffffffffffffLLU;
+ u128 c = ((tctr) >> (51));
+ tmp[ctr] = ((u128)(r0));
+ tmp[ctr + 1] = ((tctrp1) + (c));
+ }
+ {
+ u32 ctr = 3;
+ u128 tctr = tmp[ctr];
+ u128 tctrp1 = tmp[ctr + 1];
+ u64 r0 = ((u64)(tctr)) & 0x7ffffffffffffLLU;
+ u128 c = ((tctr) >> (51));
+ tmp[ctr] = ((u128)(r0));
+ tmp[ctr + 1] = ((tctrp1) + (c));
+ }
+}
+
+static __always_inline void fmul_shift_reduce(u64 *output)
+{
+ u64 tmp = output[4];
+ u64 b0;
+ {
+ u32 ctr = 5 - 0 - 1;
+ u64 z = output[ctr - 1];
+ output[ctr] = z;
+ }
+ {
+ u32 ctr = 5 - 1 - 1;
+ u64 z = output[ctr - 1];
+ output[ctr] = z;
+ }
+ {
+ u32 ctr = 5 - 2 - 1;
+ u64 z = output[ctr - 1];
+ output[ctr] = z;
+ }
+ {
+ u32 ctr = 5 - 3 - 1;
+ u64 z = output[ctr - 1];
+ output[ctr] = z;
+ }
+ output[0] = tmp;
+ b0 = output[0];
+ output[0] = 19 * b0;
+}
+
+static __always_inline void fmul_mul_shift_reduce_(u128 *output, u64 *input,
+ u64 *input21)
+{
+ u32 i;
+ u64 input2i;
+ {
+ u64 input2i = input21[0];
+ fproduct_sum_scalar_multiplication_(output, input, input2i);
+ fmul_shift_reduce(input);
+ }
+ {
+ u64 input2i = input21[1];
+ fproduct_sum_scalar_multiplication_(output, input, input2i);
+ fmul_shift_reduce(input);
+ }
+ {
+ u64 input2i = input21[2];
+ fproduct_sum_scalar_multiplication_(output, input, input2i);
+ fmul_shift_reduce(input);
+ }
+ {
+ u64 input2i = input21[3];
+ fproduct_sum_scalar_multiplication_(output, input, input2i);
+ fmul_shift_reduce(input);
+ }
+ i = 4;
+ input2i = input21[i];
+ fproduct_sum_scalar_multiplication_(output, input, input2i);
+}
+
+static __always_inline void fmul_fmul(u64 *output, u64 *input, u64 *input21)
+{
+ u64 tmp[5] = { input[0], input[1], input[2], input[3], input[4] };
+ {
+ u128 b4;
+ u128 b0;
+ u128 b4_;
+ u128 b0_;
+ u64 i0;
+ u64 i1;
+ u64 i0_;
+ u64 i1_;
+ u128 t[5] = { 0 };
+ fmul_mul_shift_reduce_(t, tmp, input21);
+ fproduct_carry_wide_(t);
+ b4 = t[4];
+ b0 = t[0];
+ b4_ = ((b4) & (((u128)(0x7ffffffffffffLLU))));
+ b0_ = ((b0) + (((u128)(19) * (((u64)(((b4) >> (51))))))));
+ t[4] = b4_;
+ t[0] = b0_;
+ fproduct_copy_from_wide_(output, t);
+ i0 = output[0];
+ i1 = output[1];
+ i0_ = i0 & 0x7ffffffffffffLLU;
+ i1_ = i1 + (i0 >> 51);
+ output[0] = i0_;
+ output[1] = i1_;
+ }
+}
+
+static __always_inline void fsquare_fsquare__(u128 *tmp, u64 *output)
+{
+ u64 r0 = output[0];
+ u64 r1 = output[1];
+ u64 r2 = output[2];
+ u64 r3 = output[3];
+ u64 r4 = output[4];
+ u64 d0 = r0 * 2;
+ u64 d1 = r1 * 2;
+ u64 d2 = r2 * 2 * 19;
+ u64 d419 = r4 * 19;
+ u64 d4 = d419 * 2;
+ u128 s0 = ((((((u128)(r0) * (r0))) + (((u128)(d4) * (r1))))) +
+ (((u128)(d2) * (r3))));
+ u128 s1 = ((((((u128)(d0) * (r1))) + (((u128)(d4) * (r2))))) +
+ (((u128)(r3 * 19) * (r3))));
+ u128 s2 = ((((((u128)(d0) * (r2))) + (((u128)(r1) * (r1))))) +
+ (((u128)(d4) * (r3))));
+ u128 s3 = ((((((u128)(d0) * (r3))) + (((u128)(d1) * (r2))))) +
+ (((u128)(r4) * (d419))));
+ u128 s4 = ((((((u128)(d0) * (r4))) + (((u128)(d1) * (r3))))) +
+ (((u128)(r2) * (r2))));
+ tmp[0] = s0;
+ tmp[1] = s1;
+ tmp[2] = s2;
+ tmp[3] = s3;
+ tmp[4] = s4;
+}
+
+static __always_inline void fsquare_fsquare_(u128 *tmp, u64 *output)
+{
+ u128 b4;
+ u128 b0;
+ u128 b4_;
+ u128 b0_;
+ u64 i0;
+ u64 i1;
+ u64 i0_;
+ u64 i1_;
+ fsquare_fsquare__(tmp, output);
+ fproduct_carry_wide_(tmp);
+ b4 = tmp[4];
+ b0 = tmp[0];
+ b4_ = ((b4) & (((u128)(0x7ffffffffffffLLU))));
+ b0_ = ((b0) + (((u128)(19) * (((u64)(((b4) >> (51))))))));
+ tmp[4] = b4_;
+ tmp[0] = b0_;
+ fproduct_copy_from_wide_(output, tmp);
+ i0 = output[0];
+ i1 = output[1];
+ i0_ = i0 & 0x7ffffffffffffLLU;
+ i1_ = i1 + (i0 >> 51);
+ output[0] = i0_;
+ output[1] = i1_;
+}
+
+static __always_inline void fsquare_fsquare_times_(u64 *output, u128 *tmp,
+ u32 count1)
+{
+ u32 i;
+ fsquare_fsquare_(tmp, output);
+ for (i = 1; i < count1; ++i)
+ fsquare_fsquare_(tmp, output);
+}
+
+static __always_inline void fsquare_fsquare_times(u64 *output, u64 *input,
+ u32 count1)
+{
+ u128 t[5];
+ memcpy(output, input, 5 * sizeof(*input));
+ fsquare_fsquare_times_(output, t, count1);
+}
+
+static __always_inline void fsquare_fsquare_times_inplace(u64 *output,
+ u32 count1)
+{
+ u128 t[5];
+ fsquare_fsquare_times_(output, t, count1);
+}
+
+static __always_inline void crecip_crecip(u64 *out, u64 *z)
+{
+ u64 buf[20] = { 0 };
+ u64 *a0 = buf;
+ u64 *t00 = buf + 5;
+ u64 *b0 = buf + 10;
+ u64 *t01;
+ u64 *b1;
+ u64 *c0;
+ u64 *a;
+ u64 *t0;
+ u64 *b;
+ u64 *c;
+ fsquare_fsquare_times(a0, z, 1);
+ fsquare_fsquare_times(t00, a0, 2);
+ fmul_fmul(b0, t00, z);
+ fmul_fmul(a0, b0, a0);
+ fsquare_fsquare_times(t00, a0, 1);
+ fmul_fmul(b0, t00, b0);
+ fsquare_fsquare_times(t00, b0, 5);
+ t01 = buf + 5;
+ b1 = buf + 10;
+ c0 = buf + 15;
+ fmul_fmul(b1, t01, b1);
+ fsquare_fsquare_times(t01, b1, 10);
+ fmul_fmul(c0, t01, b1);
+ fsquare_fsquare_times(t01, c0, 20);
+ fmul_fmul(t01, t01, c0);
+ fsquare_fsquare_times_inplace(t01, 10);
+ fmul_fmul(b1, t01, b1);
+ fsquare_fsquare_times(t01, b1, 50);
+ a = buf;
+ t0 = buf + 5;
+ b = buf + 10;
+ c = buf + 15;
+ fmul_fmul(c, t0, b);
+ fsquare_fsquare_times(t0, c, 100);
+ fmul_fmul(t0, t0, c);
+ fsquare_fsquare_times_inplace(t0, 50);
+ fmul_fmul(t0, t0, b);
+ fsquare_fsquare_times_inplace(t0, 5);
+ fmul_fmul(out, t0, a);
+}
+
+static __always_inline void fsum(u64 *a, u64 *b)
+{
+ a[0] += b[0];
+ a[1] += b[1];
+ a[2] += b[2];
+ a[3] += b[3];
+ a[4] += b[4];
+}
+
+static __always_inline void fdifference(u64 *a, u64 *b)
+{
+ u64 tmp[5] = { 0 };
+ u64 b0;
+ u64 b1;
+ u64 b2;
+ u64 b3;
+ u64 b4;
+ memcpy(tmp, b, 5 * sizeof(*b));
+ b0 = tmp[0];
+ b1 = tmp[1];
+ b2 = tmp[2];
+ b3 = tmp[3];
+ b4 = tmp[4];
+ tmp[0] = b0 + 0x3fffffffffff68LLU;
+ tmp[1] = b1 + 0x3ffffffffffff8LLU;
+ tmp[2] = b2 + 0x3ffffffffffff8LLU;
+ tmp[3] = b3 + 0x3ffffffffffff8LLU;
+ tmp[4] = b4 + 0x3ffffffffffff8LLU;
+ {
+ u64 xi = a[0];
+ u64 yi = tmp[0];
+ a[0] = yi - xi;
+ }
+ {
+ u64 xi = a[1];
+ u64 yi = tmp[1];
+ a[1] = yi - xi;
+ }
+ {
+ u64 xi = a[2];
+ u64 yi = tmp[2];
+ a[2] = yi - xi;
+ }
+ {
+ u64 xi = a[3];
+ u64 yi = tmp[3];
+ a[3] = yi - xi;
+ }
+ {
+ u64 xi = a[4];
+ u64 yi = tmp[4];
+ a[4] = yi - xi;
+ }
+}
+
+static __always_inline void fscalar(u64 *output, u64 *b, u64 s)
+{
+ u128 tmp[5];
+ u128 b4;
+ u128 b0;
+ u128 b4_;
+ u128 b0_;
+ {
+ u64 xi = b[0];
+ tmp[0] = ((u128)(xi) * (s));
+ }
+ {
+ u64 xi = b[1];
+ tmp[1] = ((u128)(xi) * (s));
+ }
+ {
+ u64 xi = b[2];
+ tmp[2] = ((u128)(xi) * (s));
+ }
+ {
+ u64 xi = b[3];
+ tmp[3] = ((u128)(xi) * (s));
+ }
+ {
+ u64 xi = b[4];
+ tmp[4] = ((u128)(xi) * (s));
+ }
+ fproduct_carry_wide_(tmp);
+ b4 = tmp[4];
+ b0 = tmp[0];
+ b4_ = ((b4) & (((u128)(0x7ffffffffffffLLU))));
+ b0_ = ((b0) + (((u128)(19) * (((u64)(((b4) >> (51))))))));
+ tmp[4] = b4_;
+ tmp[0] = b0_;
+ fproduct_copy_from_wide_(output, tmp);
+}
+
+static __always_inline void fmul(u64 *output, u64 *a, u64 *b)
+{
+ fmul_fmul(output, a, b);
+}
+
+static __always_inline void crecip(u64 *output, u64 *input)
+{
+ crecip_crecip(output, input);
+}
+
+static __always_inline void point_swap_conditional_step(u64 *a, u64 *b,
+ u64 swap1, u32 ctr)
+{
+ u32 i = ctr - 1;
+ u64 ai = a[i];
+ u64 bi = b[i];
+ u64 x = swap1 & (ai ^ bi);
+ u64 ai1 = ai ^ x;
+ u64 bi1 = bi ^ x;
+ a[i] = ai1;
+ b[i] = bi1;
+}
+
+static __always_inline void point_swap_conditional5(u64 *a, u64 *b, u64 swap1)
+{
+ point_swap_conditional_step(a, b, swap1, 5);
+ point_swap_conditional_step(a, b, swap1, 4);
+ point_swap_conditional_step(a, b, swap1, 3);
+ point_swap_conditional_step(a, b, swap1, 2);
+ point_swap_conditional_step(a, b, swap1, 1);
+}
+
+static __always_inline void point_swap_conditional(u64 *a, u64 *b, u64 iswap)
+{
+ u64 swap1 = 0 - iswap;
+ point_swap_conditional5(a, b, swap1);
+ point_swap_conditional5(a + 5, b + 5, swap1);
+}
+
+static __always_inline void point_copy(u64 *output, u64 *input)
+{
+ memcpy(output, input, 5 * sizeof(*input));
+ memcpy(output + 5, input + 5, 5 * sizeof(*input));
+}
+
+static __always_inline void addanddouble_fmonty(u64 *pp, u64 *ppq, u64 *p,
+ u64 *pq, u64 *qmqp)
+{
+ u64 *qx = qmqp;
+ u64 *x2 = pp;
+ u64 *z2 = pp + 5;
+ u64 *x3 = ppq;
+ u64 *z3 = ppq + 5;
+ u64 *x = p;
+ u64 *z = p + 5;
+ u64 *xprime = pq;
+ u64 *zprime = pq + 5;
+ u64 buf[40] = { 0 };
+ u64 *origx = buf;
+ u64 *origxprime0 = buf + 5;
+ u64 *xxprime0;
+ u64 *zzprime0;
+ u64 *origxprime;
+ xxprime0 = buf + 25;
+ zzprime0 = buf + 30;
+ memcpy(origx, x, 5 * sizeof(*x));
+ fsum(x, z);
+ fdifference(z, origx);
+ memcpy(origxprime0, xprime, 5 * sizeof(*xprime));
+ fsum(xprime, zprime);
+ fdifference(zprime, origxprime0);
+ fmul(xxprime0, xprime, z);
+ fmul(zzprime0, x, zprime);
+ origxprime = buf + 5;
+ {
+ u64 *xx0;
+ u64 *zz0;
+ u64 *xxprime;
+ u64 *zzprime;
+ u64 *zzzprime;
+ xx0 = buf + 15;
+ zz0 = buf + 20;
+ xxprime = buf + 25;
+ zzprime = buf + 30;
+ zzzprime = buf + 35;
+ memcpy(origxprime, xxprime, 5 * sizeof(*xxprime));
+ fsum(xxprime, zzprime);
+ fdifference(zzprime, origxprime);
+ fsquare_fsquare_times(x3, xxprime, 1);
+ fsquare_fsquare_times(zzzprime, zzprime, 1);
+ fmul(z3, zzzprime, qx);
+ fsquare_fsquare_times(xx0, x, 1);
+ fsquare_fsquare_times(zz0, z, 1);
+ {
+ u64 *zzz;
+ u64 *xx;
+ u64 *zz;
+ u64 scalar;
+ zzz = buf + 10;
+ xx = buf + 15;
+ zz = buf + 20;
+ fmul(x2, xx, zz);
+ fdifference(zz, xx);
+ scalar = 121665;
+ fscalar(zzz, zz, scalar);
+ fsum(zzz, xx);
+ fmul(z2, zzz, zz);
+ }
+ }
+}
+
+static __always_inline void
+ladder_smallloop_cmult_small_loop_step(u64 *nq, u64 *nqpq, u64 *nq2, u64 *nqpq2,
+ u64 *q, u8 byt)
+{
+ u64 bit0 = (u64)(byt >> 7);
+ u64 bit;
+ point_swap_conditional(nq, nqpq, bit0);
+ addanddouble_fmonty(nq2, nqpq2, nq, nqpq, q);
+ bit = (u64)(byt >> 7);
+ point_swap_conditional(nq2, nqpq2, bit);
+}
+
+static __always_inline void
+ladder_smallloop_cmult_small_loop_double_step(u64 *nq, u64 *nqpq, u64 *nq2,
+ u64 *nqpq2, u64 *q, u8 byt)
+{
+ u8 byt1;
+ ladder_smallloop_cmult_small_loop_step(nq, nqpq, nq2, nqpq2, q, byt);
+ byt1 = byt << 1;
+ ladder_smallloop_cmult_small_loop_step(nq2, nqpq2, nq, nqpq, q, byt1);
+}
+
+static __always_inline void
+ladder_smallloop_cmult_small_loop(u64 *nq, u64 *nqpq, u64 *nq2, u64 *nqpq2,
+ u64 *q, u8 byt, u32 i)
+{
+ while (i--) {
+ ladder_smallloop_cmult_small_loop_double_step(nq, nqpq, nq2,
+ nqpq2, q, byt);
+ byt <<= 2;
+ }
+}
+
+static __always_inline void ladder_bigloop_cmult_big_loop(u8 *n1, u64 *nq,
+ u64 *nqpq, u64 *nq2,
+ u64 *nqpq2, u64 *q,
+ u32 i)
+{
+ while (i--) {
+ u8 byte = n1[i];
+ ladder_smallloop_cmult_small_loop(nq, nqpq, nq2, nqpq2, q,
+ byte, 4);
+ }
+}
+
+static void ladder_cmult(u64 *result, u8 *n1, u64 *q)
+{
+ u64 point_buf[40] = { 0 };
+ u64 *nq = point_buf;
+ u64 *nqpq = point_buf + 10;
+ u64 *nq2 = point_buf + 20;
+ u64 *nqpq2 = point_buf + 30;
+ point_copy(nqpq, q);
+ nq[0] = 1;
+ ladder_bigloop_cmult_big_loop(n1, nq, nqpq, nq2, nqpq2, q, 32);
+ point_copy(result, nq);
+}
+
+static __always_inline void format_fexpand(u64 *output, const u8 *input)
+{
+ const u8 *x00 = input + 6;
+ const u8 *x01 = input + 12;
+ const u8 *x02 = input + 19;
+ const u8 *x0 = input + 24;
+ u64 i0, i1, i2, i3, i4, output0, output1, output2, output3, output4;
+ i0 = get_unaligned_le64(input);
+ i1 = get_unaligned_le64(x00);
+ i2 = get_unaligned_le64(x01);
+ i3 = get_unaligned_le64(x02);
+ i4 = get_unaligned_le64(x0);
+ output0 = i0 & 0x7ffffffffffffLLU;
+ output1 = i1 >> 3 & 0x7ffffffffffffLLU;
+ output2 = i2 >> 6 & 0x7ffffffffffffLLU;
+ output3 = i3 >> 1 & 0x7ffffffffffffLLU;
+ output4 = i4 >> 12 & 0x7ffffffffffffLLU;
+ output[0] = output0;
+ output[1] = output1;
+ output[2] = output2;
+ output[3] = output3;
+ output[4] = output4;
+}
+
+static __always_inline void format_fcontract_first_carry_pass(u64 *input)
+{
+ u64 t0 = input[0];
+ u64 t1 = input[1];
+ u64 t2 = input[2];
+ u64 t3 = input[3];
+ u64 t4 = input[4];
+ u64 t1_ = t1 + (t0 >> 51);
+ u64 t0_ = t0 & 0x7ffffffffffffLLU;
+ u64 t2_ = t2 + (t1_ >> 51);
+ u64 t1__ = t1_ & 0x7ffffffffffffLLU;
+ u64 t3_ = t3 + (t2_ >> 51);
+ u64 t2__ = t2_ & 0x7ffffffffffffLLU;
+ u64 t4_ = t4 + (t3_ >> 51);
+ u64 t3__ = t3_ & 0x7ffffffffffffLLU;
+ input[0] = t0_;
+ input[1] = t1__;
+ input[2] = t2__;
+ input[3] = t3__;
+ input[4] = t4_;
+}
+
+static __always_inline void format_fcontract_first_carry_full(u64 *input)
+{
+ format_fcontract_first_carry_pass(input);
+ modulo_carry_top(input);
+}
+
+static __always_inline void format_fcontract_second_carry_pass(u64 *input)
+{
+ u64 t0 = input[0];
+ u64 t1 = input[1];
+ u64 t2 = input[2];
+ u64 t3 = input[3];
+ u64 t4 = input[4];
+ u64 t1_ = t1 + (t0 >> 51);
+ u64 t0_ = t0 & 0x7ffffffffffffLLU;
+ u64 t2_ = t2 + (t1_ >> 51);
+ u64 t1__ = t1_ & 0x7ffffffffffffLLU;
+ u64 t3_ = t3 + (t2_ >> 51);
+ u64 t2__ = t2_ & 0x7ffffffffffffLLU;
+ u64 t4_ = t4 + (t3_ >> 51);
+ u64 t3__ = t3_ & 0x7ffffffffffffLLU;
+ input[0] = t0_;
+ input[1] = t1__;
+ input[2] = t2__;
+ input[3] = t3__;
+ input[4] = t4_;
+}
+
+static __always_inline void format_fcontract_second_carry_full(u64 *input)
+{
+ u64 i0;
+ u64 i1;
+ u64 i0_;
+ u64 i1_;
+ format_fcontract_second_carry_pass(input);
+ modulo_carry_top(input);
+ i0 = input[0];
+ i1 = input[1];
+ i0_ = i0 & 0x7ffffffffffffLLU;
+ i1_ = i1 + (i0 >> 51);
+ input[0] = i0_;
+ input[1] = i1_;
+}
+
+static __always_inline void format_fcontract_trim(u64 *input)
+{
+ u64 a0 = input[0];
+ u64 a1 = input[1];
+ u64 a2 = input[2];
+ u64 a3 = input[3];
+ u64 a4 = input[4];
+ u64 mask0 = u64_gte_mask(a0, 0x7ffffffffffedLLU);
+ u64 mask1 = u64_eq_mask(a1, 0x7ffffffffffffLLU);
+ u64 mask2 = u64_eq_mask(a2, 0x7ffffffffffffLLU);
+ u64 mask3 = u64_eq_mask(a3, 0x7ffffffffffffLLU);
+ u64 mask4 = u64_eq_mask(a4, 0x7ffffffffffffLLU);
+ u64 mask = (((mask0 & mask1) & mask2) & mask3) & mask4;
+ u64 a0_ = a0 - (0x7ffffffffffedLLU & mask);
+ u64 a1_ = a1 - (0x7ffffffffffffLLU & mask);
+ u64 a2_ = a2 - (0x7ffffffffffffLLU & mask);
+ u64 a3_ = a3 - (0x7ffffffffffffLLU & mask);
+ u64 a4_ = a4 - (0x7ffffffffffffLLU & mask);
+ input[0] = a0_;
+ input[1] = a1_;
+ input[2] = a2_;
+ input[3] = a3_;
+ input[4] = a4_;
+}
+
+static __always_inline void format_fcontract_store(u8 *output, u64 *input)
+{
+ u64 t0 = input[0];
+ u64 t1 = input[1];
+ u64 t2 = input[2];
+ u64 t3 = input[3];
+ u64 t4 = input[4];
+ u64 o0 = t1 << 51 | t0;
+ u64 o1 = t2 << 38 | t1 >> 13;
+ u64 o2 = t3 << 25 | t2 >> 26;
+ u64 o3 = t4 << 12 | t3 >> 39;
+ u8 *b0 = output;
+ u8 *b1 = output + 8;
+ u8 *b2 = output + 16;
+ u8 *b3 = output + 24;
+ put_unaligned_le64(o0, b0);
+ put_unaligned_le64(o1, b1);
+ put_unaligned_le64(o2, b2);
+ put_unaligned_le64(o3, b3);
+}
+
+static __always_inline void format_fcontract(u8 *output, u64 *input)
+{
+ format_fcontract_first_carry_full(input);
+ format_fcontract_second_carry_full(input);
+ format_fcontract_trim(input);
+ format_fcontract_store(output, input);
+}
+
+static __always_inline void format_scalar_of_point(u8 *scalar, u64 *point)
+{
+ u64 *x = point;
+ u64 *z = point + 5;
+ u64 buf[10] __aligned(32) = { 0 };
+ u64 *zmone = buf;
+ u64 *sc = buf + 5;
+ crecip(zmone, z);
+ fmul(sc, x, zmone);
+ format_fcontract(scalar, sc);
+}
+
+static void curve25519_generic(u8 mypublic[CURVE25519_KEY_SIZE],
+ const u8 secret[CURVE25519_KEY_SIZE],
+ const u8 basepoint[CURVE25519_KEY_SIZE])
+{
+ u64 buf0[10] __aligned(32) = { 0 };
+ u64 *x0 = buf0;
+ u64 *z = buf0 + 5;
+ u64 *q;
+ format_fexpand(x0, basepoint);
+ z[0] = 1;
+ q = buf0;
+ {
+ u8 e[32] __aligned(32) = { 0 };
+ u8 *scalar;
+ memcpy(e, secret, 32);
+ normalize_secret(e);
+ scalar = e;
+ {
+ u64 buf[15] = { 0 };
+ u64 *nq = buf;
+ u64 *x = nq;
+ x[0] = 1;
+ ladder_cmult(nq, scalar, q);
+ format_scalar_of_point(mypublic, nq);
+ memzero_explicit(buf, sizeof(buf));
+ }
+ memzero_explicit(e, sizeof(e));
+ }
+ memzero_explicit(buf0, sizeof(buf0));
+}
diff --git a/lib/zinc/curve25519/curve25519.c b/lib/zinc/curve25519/curve25519.c
new file mode 100644
index 000000000000..8d9d8fc0ca38
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519.c
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This is an implementation of the Curve25519 ECDH algorithm, using either
+ * a 32-bit implementation or a 64-bit implementation with 128-bit integers,
+ * depending on what is supported by the target compiler.
+ *
+ * Information: https://cr.yp.to/ecdh.html
+ */
+
+#include <zinc/curve25519.h>
+#include "../selftest/run.h"
+
+#include <asm/unaligned.h>
+#include <linux/version.h>
+#include <linux/string.h>
+#include <linux/random.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <crypto/algapi.h> // For crypto_memneq.
+
+static bool *const curve25519_nobs[] __initconst = { };
+static void __init curve25519_fpu_init(void)
+{
+}
+static inline bool curve25519_arch(u8 mypublic[CURVE25519_KEY_SIZE],
+ const u8 secret[CURVE25519_KEY_SIZE],
+ const u8 basepoint[CURVE25519_KEY_SIZE])
+{
+ return false;
+}
+static inline bool curve25519_base_arch(u8 pub[CURVE25519_KEY_SIZE],
+ const u8 secret[CURVE25519_KEY_SIZE])
+{
+ return false;
+}
+
+static __always_inline void normalize_secret(u8 secret[CURVE25519_KEY_SIZE])
+{
+ secret[0] &= 248;
+ secret[31] &= 127;
+ secret[31] |= 64;
+}
+
+#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+#include "curve25519-hacl64.c"
+#else
+#include "curve25519-fiat32.c"
+#endif
+
+static const u8 null_point[CURVE25519_KEY_SIZE] = { 0 };
+
+bool curve25519(u8 mypublic[CURVE25519_KEY_SIZE],
+ const u8 secret[CURVE25519_KEY_SIZE],
+ const u8 basepoint[CURVE25519_KEY_SIZE])
+{
+ if (!curve25519_arch(mypublic, secret, basepoint))
+ curve25519_generic(mypublic, secret, basepoint);
+ return crypto_memneq(mypublic, null_point, CURVE25519_KEY_SIZE);
+}
+EXPORT_SYMBOL(curve25519);
+
+bool curve25519_generate_public(u8 pub[CURVE25519_KEY_SIZE],
+ const u8 secret[CURVE25519_KEY_SIZE])
+{
+ static const u8 basepoint[CURVE25519_KEY_SIZE] __aligned(32) = { 9 };
+
+ if (unlikely(!crypto_memneq(secret, null_point, CURVE25519_KEY_SIZE)))
+ return false;
+
+ if (curve25519_base_arch(pub, secret))
+ return crypto_memneq(pub, null_point, CURVE25519_KEY_SIZE);
+ return curve25519(pub, secret, basepoint);
+}
+EXPORT_SYMBOL(curve25519_generate_public);
+
+void curve25519_generate_secret(u8 secret[CURVE25519_KEY_SIZE])
+{
+ get_random_bytes_wait(secret, CURVE25519_KEY_SIZE);
+ normalize_secret(secret);
+}
+EXPORT_SYMBOL(curve25519_generate_secret);
+
+#include "../selftest/curve25519.c"
+
+static bool nosimd __initdata = false;
+
+static int __init mod_init(void)
+{
+ if (!nosimd)
+ curve25519_fpu_init();
+ if (!selftest_run("curve25519", curve25519_selftest, curve25519_nobs,
+ ARRAY_SIZE(curve25519_nobs)))
+ return -ENOTRECOVERABLE;
+ return 0;
+}
+
+static void __exit mod_exit(void)
+{
+}
+
+module_param(nosimd, bool, 0);
+module_init(mod_init);
+module_exit(mod_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("Curve25519 scalar multiplication");
+MODULE_AUTHOR("Jason A. Donenfeld <Jason@zx2c4.com>");
diff --git a/lib/zinc/selftest/curve25519.c b/lib/zinc/selftest/curve25519.c
new file mode 100644
index 000000000000..fa653d47df5a
--- /dev/null
+++ b/lib/zinc/selftest/curve25519.c
@@ -0,0 +1,1315 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+struct curve25519_test_vector {
+ u8 private[CURVE25519_KEY_SIZE];
+ u8 public[CURVE25519_KEY_SIZE];
+ u8 result[CURVE25519_KEY_SIZE];
+ bool valid;
+};
+static const struct curve25519_test_vector curve25519_test_vectors[] __initconst = {
+ {
+ .private = { 0x77, 0x07, 0x6d, 0x0a, 0x73, 0x18, 0xa5, 0x7d,
+ 0x3c, 0x16, 0xc1, 0x72, 0x51, 0xb2, 0x66, 0x45,
+ 0xdf, 0x4c, 0x2f, 0x87, 0xeb, 0xc0, 0x99, 0x2a,
+ 0xb1, 0x77, 0xfb, 0xa5, 0x1d, 0xb9, 0x2c, 0x2a },
+ .public = { 0xde, 0x9e, 0xdb, 0x7d, 0x7b, 0x7d, 0xc1, 0xb4,
+ 0xd3, 0x5b, 0x61, 0xc2, 0xec, 0xe4, 0x35, 0x37,
+ 0x3f, 0x83, 0x43, 0xc8, 0x5b, 0x78, 0x67, 0x4d,
+ 0xad, 0xfc, 0x7e, 0x14, 0x6f, 0x88, 0x2b, 0x4f },
+ .result = { 0x4a, 0x5d, 0x9d, 0x5b, 0xa4, 0xce, 0x2d, 0xe1,
+ 0x72, 0x8e, 0x3b, 0xf4, 0x80, 0x35, 0x0f, 0x25,
+ 0xe0, 0x7e, 0x21, 0xc9, 0x47, 0xd1, 0x9e, 0x33,
+ 0x76, 0xf0, 0x9b, 0x3c, 0x1e, 0x16, 0x17, 0x42 },
+ .valid = true
+ },
+ {
+ .private = { 0x5d, 0xab, 0x08, 0x7e, 0x62, 0x4a, 0x8a, 0x4b,
+ 0x79, 0xe1, 0x7f, 0x8b, 0x83, 0x80, 0x0e, 0xe6,
+ 0x6f, 0x3b, 0xb1, 0x29, 0x26, 0x18, 0xb6, 0xfd,
+ 0x1c, 0x2f, 0x8b, 0x27, 0xff, 0x88, 0xe0, 0xeb },
+ .public = { 0x85, 0x20, 0xf0, 0x09, 0x89, 0x30, 0xa7, 0x54,
+ 0x74, 0x8b, 0x7d, 0xdc, 0xb4, 0x3e, 0xf7, 0x5a,
+ 0x0d, 0xbf, 0x3a, 0x0d, 0x26, 0x38, 0x1a, 0xf4,
+ 0xeb, 0xa4, 0xa9, 0x8e, 0xaa, 0x9b, 0x4e, 0x6a },
+ .result = { 0x4a, 0x5d, 0x9d, 0x5b, 0xa4, 0xce, 0x2d, 0xe1,
+ 0x72, 0x8e, 0x3b, 0xf4, 0x80, 0x35, 0x0f, 0x25,
+ 0xe0, 0x7e, 0x21, 0xc9, 0x47, 0xd1, 0x9e, 0x33,
+ 0x76, 0xf0, 0x9b, 0x3c, 0x1e, 0x16, 0x17, 0x42 },
+ .valid = true
+ },
+ {
+ .private = { 1 },
+ .public = { 0x25, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .result = { 0x3c, 0x77, 0x77, 0xca, 0xf9, 0x97, 0xb2, 0x64,
+ 0x41, 0x60, 0x77, 0x66, 0x5b, 0x4e, 0x22, 0x9d,
+ 0x0b, 0x95, 0x48, 0xdc, 0x0c, 0xd8, 0x19, 0x98,
+ 0xdd, 0xcd, 0xc5, 0xc8, 0x53, 0x3c, 0x79, 0x7f },
+ .valid = true
+ },
+ {
+ .private = { 1 },
+ .public = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+ .result = { 0xb3, 0x2d, 0x13, 0x62, 0xc2, 0x48, 0xd6, 0x2f,
+ 0xe6, 0x26, 0x19, 0xcf, 0xf0, 0x4d, 0xd4, 0x3d,
+ 0xb7, 0x3f, 0xfc, 0x1b, 0x63, 0x08, 0xed, 0xe3,
+ 0x0b, 0x78, 0xd8, 0x73, 0x80, 0xf1, 0xe8, 0x34 },
+ .valid = true
+ },
+ {
+ .private = { 0xa5, 0x46, 0xe3, 0x6b, 0xf0, 0x52, 0x7c, 0x9d,
+ 0x3b, 0x16, 0x15, 0x4b, 0x82, 0x46, 0x5e, 0xdd,
+ 0x62, 0x14, 0x4c, 0x0a, 0xc1, 0xfc, 0x5a, 0x18,
+ 0x50, 0x6a, 0x22, 0x44, 0xba, 0x44, 0x9a, 0xc4 },
+ .public = { 0xe6, 0xdb, 0x68, 0x67, 0x58, 0x30, 0x30, 0xdb,
+ 0x35, 0x94, 0xc1, 0xa4, 0x24, 0xb1, 0x5f, 0x7c,
+ 0x72, 0x66, 0x24, 0xec, 0x26, 0xb3, 0x35, 0x3b,
+ 0x10, 0xa9, 0x03, 0xa6, 0xd0, 0xab, 0x1c, 0x4c },
+ .result = { 0xc3, 0xda, 0x55, 0x37, 0x9d, 0xe9, 0xc6, 0x90,
+ 0x8e, 0x94, 0xea, 0x4d, 0xf2, 0x8d, 0x08, 0x4f,
+ 0x32, 0xec, 0xcf, 0x03, 0x49, 0x1c, 0x71, 0xf7,
+ 0x54, 0xb4, 0x07, 0x55, 0x77, 0xa2, 0x85, 0x52 },
+ .valid = true
+ },
+ {
+ .private = { 1, 2, 3, 4 },
+ .public = { 0 },
+ .result = { 0 },
+ .valid = false
+ },
+ {
+ .private = { 2, 4, 6, 8 },
+ .public = { 0xe0, 0xeb, 0x7a, 0x7c, 0x3b, 0x41, 0xb8, 0xae,
+ 0x16, 0x56, 0xe3, 0xfa, 0xf1, 0x9f, 0xc4, 0x6a,
+ 0xda, 0x09, 0x8d, 0xeb, 0x9c, 0x32, 0xb1, 0xfd,
+ 0x86, 0x62, 0x05, 0x16, 0x5f, 0x49, 0xb8 },
+ .result = { 0 },
+ .valid = false
+ },
+ {
+ .private = { 0xff, 0xff, 0xff, 0xff, 0x0a, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+ .public = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0x0a, 0x00, 0xfb, 0x9f },
+ .result = { 0x77, 0x52, 0xb6, 0x18, 0xc1, 0x2d, 0x48, 0xd2,
+ 0xc6, 0x93, 0x46, 0x83, 0x81, 0x7c, 0xc6, 0x57,
+ 0xf3, 0x31, 0x03, 0x19, 0x49, 0x48, 0x20, 0x05,
+ 0x42, 0x2b, 0x4e, 0xae, 0x8d, 0x1d, 0x43, 0x23 },
+ .valid = true
+ },
+ {
+ .private = { 0x8e, 0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .public = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x8e, 0x06 },
+ .result = { 0x5a, 0xdf, 0xaa, 0x25, 0x86, 0x8e, 0x32, 0x3d,
+ 0xae, 0x49, 0x62, 0xc1, 0x01, 0x5c, 0xb3, 0x12,
+ 0xe1, 0xc5, 0xc7, 0x9e, 0x95, 0x3f, 0x03, 0x99,
+ 0xb0, 0xba, 0x16, 0x22, 0xf3, 0xb6, 0xf7, 0x0c },
+ .valid = true
+ },
+ /* wycheproof - normal case */
+ {
+ .private = { 0x48, 0x52, 0x83, 0x4d, 0x9d, 0x6b, 0x77, 0xda,
+ 0xde, 0xab, 0xaa, 0xf2, 0xe1, 0x1d, 0xca, 0x66,
+ 0xd1, 0x9f, 0xe7, 0x49, 0x93, 0xa7, 0xbe, 0xc3,
+ 0x6c, 0x6e, 0x16, 0xa0, 0x98, 0x3f, 0xea, 0xba },
+ .public = { 0x9c, 0x64, 0x7d, 0x9a, 0xe5, 0x89, 0xb9, 0xf5,
+ 0x8f, 0xdc, 0x3c, 0xa4, 0x94, 0x7e, 0xfb, 0xc9,
+ 0x15, 0xc4, 0xb2, 0xe0, 0x8e, 0x74, 0x4a, 0x0e,
+ 0xdf, 0x46, 0x9d, 0xac, 0x59, 0xc8, 0xf8, 0x5a },
+ .result = { 0x87, 0xb7, 0xf2, 0x12, 0xb6, 0x27, 0xf7, 0xa5,
+ 0x4c, 0xa5, 0xe0, 0xbc, 0xda, 0xdd, 0xd5, 0x38,
+ 0x9d, 0x9d, 0xe6, 0x15, 0x6c, 0xdb, 0xcf, 0x8e,
+ 0xbe, 0x14, 0xff, 0xbc, 0xfb, 0x43, 0x65, 0x51 },
+ .valid = true
+ },
+ /* wycheproof - public key on twist */
+ {
+ .private = { 0x58, 0x8c, 0x06, 0x1a, 0x50, 0x80, 0x4a, 0xc4,
+ 0x88, 0xad, 0x77, 0x4a, 0xc7, 0x16, 0xc3, 0xf5,
+ 0xba, 0x71, 0x4b, 0x27, 0x12, 0xe0, 0x48, 0x49,
+ 0x13, 0x79, 0xa5, 0x00, 0x21, 0x19, 0x98, 0xa8 },
+ .public = { 0x63, 0xaa, 0x40, 0xc6, 0xe3, 0x83, 0x46, 0xc5,
+ 0xca, 0xf2, 0x3a, 0x6d, 0xf0, 0xa5, 0xe6, 0xc8,
+ 0x08, 0x89, 0xa0, 0x86, 0x47, 0xe5, 0x51, 0xb3,
+ 0x56, 0x34, 0x49, 0xbe, 0xfc, 0xfc, 0x97, 0x33 },
+ .result = { 0xb1, 0xa7, 0x07, 0x51, 0x94, 0x95, 0xff, 0xff,
+ 0xb2, 0x98, 0xff, 0x94, 0x17, 0x16, 0xb0, 0x6d,
+ 0xfa, 0xb8, 0x7c, 0xf8, 0xd9, 0x11, 0x23, 0xfe,
+ 0x2b, 0xe9, 0xa2, 0x33, 0xdd, 0xa2, 0x22, 0x12 },
+ .valid = true
+ },
+ /* wycheproof - public key on twist */
+ {
+ .private = { 0xb0, 0x5b, 0xfd, 0x32, 0xe5, 0x53, 0x25, 0xd9,
+ 0xfd, 0x64, 0x8c, 0xb3, 0x02, 0x84, 0x80, 0x39,
+ 0x00, 0x0b, 0x39, 0x0e, 0x44, 0xd5, 0x21, 0xe5,
+ 0x8a, 0xab, 0x3b, 0x29, 0xa6, 0x96, 0x0b, 0xa8 },
+ .public = { 0x0f, 0x83, 0xc3, 0x6f, 0xde, 0xd9, 0xd3, 0x2f,
+ 0xad, 0xf4, 0xef, 0xa3, 0xae, 0x93, 0xa9, 0x0b,
+ 0xb5, 0xcf, 0xa6, 0x68, 0x93, 0xbc, 0x41, 0x2c,
+ 0x43, 0xfa, 0x72, 0x87, 0xdb, 0xb9, 0x97, 0x79 },
+ .result = { 0x67, 0xdd, 0x4a, 0x6e, 0x16, 0x55, 0x33, 0x53,
+ 0x4c, 0x0e, 0x3f, 0x17, 0x2e, 0x4a, 0xb8, 0x57,
+ 0x6b, 0xca, 0x92, 0x3a, 0x5f, 0x07, 0xb2, 0xc0,
+ 0x69, 0xb4, 0xc3, 0x10, 0xff, 0x2e, 0x93, 0x5b },
+ .valid = true
+ },
+ /* wycheproof - public key on twist */
+ {
+ .private = { 0x70, 0xe3, 0x4b, 0xcb, 0xe1, 0xf4, 0x7f, 0xbc,
+ 0x0f, 0xdd, 0xfd, 0x7c, 0x1e, 0x1a, 0xa5, 0x3d,
+ 0x57, 0xbf, 0xe0, 0xf6, 0x6d, 0x24, 0x30, 0x67,
+ 0xb4, 0x24, 0xbb, 0x62, 0x10, 0xbe, 0xd1, 0x9c },
+ .public = { 0x0b, 0x82, 0x11, 0xa2, 0xb6, 0x04, 0x90, 0x97,
+ 0xf6, 0x87, 0x1c, 0x6c, 0x05, 0x2d, 0x3c, 0x5f,
+ 0xc1, 0xba, 0x17, 0xda, 0x9e, 0x32, 0xae, 0x45,
+ 0x84, 0x03, 0xb0, 0x5b, 0xb2, 0x83, 0x09, 0x2a },
+ .result = { 0x4a, 0x06, 0x38, 0xcf, 0xaa, 0x9e, 0xf1, 0x93,
+ 0x3b, 0x47, 0xf8, 0x93, 0x92, 0x96, 0xa6, 0xb2,
+ 0x5b, 0xe5, 0x41, 0xef, 0x7f, 0x70, 0xe8, 0x44,
+ 0xc0, 0xbc, 0xc0, 0x0b, 0x13, 0x4d, 0xe6, 0x4a },
+ .valid = true
+ },
+ /* wycheproof - public key on twist */
+ {
+ .private = { 0x68, 0xc1, 0xf3, 0xa6, 0x53, 0xa4, 0xcd, 0xb1,
+ 0xd3, 0x7b, 0xba, 0x94, 0x73, 0x8f, 0x8b, 0x95,
+ 0x7a, 0x57, 0xbe, 0xb2, 0x4d, 0x64, 0x6e, 0x99,
+ 0x4d, 0xc2, 0x9a, 0x27, 0x6a, 0xad, 0x45, 0x8d },
+ .public = { 0x34, 0x3a, 0xc2, 0x0a, 0x3b, 0x9c, 0x6a, 0x27,
+ 0xb1, 0x00, 0x81, 0x76, 0x50, 0x9a, 0xd3, 0x07,
+ 0x35, 0x85, 0x6e, 0xc1, 0xc8, 0xd8, 0xfc, 0xae,
+ 0x13, 0x91, 0x2d, 0x08, 0xd1, 0x52, 0xf4, 0x6c },
+ .result = { 0x39, 0x94, 0x91, 0xfc, 0xe8, 0xdf, 0xab, 0x73,
+ 0xb4, 0xf9, 0xf6, 0x11, 0xde, 0x8e, 0xa0, 0xb2,
+ 0x7b, 0x28, 0xf8, 0x59, 0x94, 0x25, 0x0b, 0x0f,
+ 0x47, 0x5d, 0x58, 0x5d, 0x04, 0x2a, 0xc2, 0x07 },
+ .valid = true
+ },
+ /* wycheproof - public key on twist */
+ {
+ .private = { 0xd8, 0x77, 0xb2, 0x6d, 0x06, 0xdf, 0xf9, 0xd9,
+ 0xf7, 0xfd, 0x4c, 0x5b, 0x37, 0x69, 0xf8, 0xcd,
+ 0xd5, 0xb3, 0x05, 0x16, 0xa5, 0xab, 0x80, 0x6b,
+ 0xe3, 0x24, 0xff, 0x3e, 0xb6, 0x9e, 0xa0, 0xb2 },
+ .public = { 0xfa, 0x69, 0x5f, 0xc7, 0xbe, 0x8d, 0x1b, 0xe5,
+ 0xbf, 0x70, 0x48, 0x98, 0xf3, 0x88, 0xc4, 0x52,
+ 0xba, 0xfd, 0xd3, 0xb8, 0xea, 0xe8, 0x05, 0xf8,
+ 0x68, 0x1a, 0x8d, 0x15, 0xc2, 0xd4, 0xe1, 0x42 },
+ .result = { 0x2c, 0x4f, 0xe1, 0x1d, 0x49, 0x0a, 0x53, 0x86,
+ 0x17, 0x76, 0xb1, 0x3b, 0x43, 0x54, 0xab, 0xd4,
+ 0xcf, 0x5a, 0x97, 0x69, 0x9d, 0xb6, 0xe6, 0xc6,
+ 0x8c, 0x16, 0x26, 0xd0, 0x76, 0x62, 0xf7, 0x58 },
+ .valid = true
+ },
+ /* wycheproof - public key = 0 */
+ {
+ .private = { 0x20, 0x74, 0x94, 0x03, 0x8f, 0x2b, 0xb8, 0x11,
+ 0xd4, 0x78, 0x05, 0xbc, 0xdf, 0x04, 0xa2, 0xac,
+ 0x58, 0x5a, 0xda, 0x7f, 0x2f, 0x23, 0x38, 0x9b,
+ 0xfd, 0x46, 0x58, 0xf9, 0xdd, 0xd4, 0xde, 0xbc },
+ .public = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = false
+ },
+ /* wycheproof - public key = 1 */
+ {
+ .private = { 0x20, 0x2e, 0x89, 0x72, 0xb6, 0x1c, 0x7e, 0x61,
+ 0x93, 0x0e, 0xb9, 0x45, 0x0b, 0x50, 0x70, 0xea,
+ 0xe1, 0xc6, 0x70, 0x47, 0x56, 0x85, 0x54, 0x1f,
+ 0x04, 0x76, 0x21, 0x7e, 0x48, 0x18, 0xcf, 0xab },
+ .public = { 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = false
+ },
+ /* wycheproof - edge case on twist */
+ {
+ .private = { 0x38, 0xdd, 0xe9, 0xf3, 0xe7, 0xb7, 0x99, 0x04,
+ 0x5f, 0x9a, 0xc3, 0x79, 0x3d, 0x4a, 0x92, 0x77,
+ 0xda, 0xde, 0xad, 0xc4, 0x1b, 0xec, 0x02, 0x90,
+ 0xf8, 0x1f, 0x74, 0x4f, 0x73, 0x77, 0x5f, 0x84 },
+ .public = { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .result = { 0x9a, 0x2c, 0xfe, 0x84, 0xff, 0x9c, 0x4a, 0x97,
+ 0x39, 0x62, 0x5c, 0xae, 0x4a, 0x3b, 0x82, 0xa9,
+ 0x06, 0x87, 0x7a, 0x44, 0x19, 0x46, 0xf8, 0xd7,
+ 0xb3, 0xd7, 0x95, 0xfe, 0x8f, 0x5d, 0x16, 0x39 },
+ .valid = true
+ },
+ /* wycheproof - edge case on twist */
+ {
+ .private = { 0x98, 0x57, 0xa9, 0x14, 0xe3, 0xc2, 0x90, 0x36,
+ 0xfd, 0x9a, 0x44, 0x2b, 0xa5, 0x26, 0xb5, 0xcd,
+ 0xcd, 0xf2, 0x82, 0x16, 0x15, 0x3e, 0x63, 0x6c,
+ 0x10, 0x67, 0x7a, 0xca, 0xb6, 0xbd, 0x6a, 0xa5 },
+ .public = { 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .result = { 0x4d, 0xa4, 0xe0, 0xaa, 0x07, 0x2c, 0x23, 0x2e,
+ 0xe2, 0xf0, 0xfa, 0x4e, 0x51, 0x9a, 0xe5, 0x0b,
+ 0x52, 0xc1, 0xed, 0xd0, 0x8a, 0x53, 0x4d, 0x4e,
+ 0xf3, 0x46, 0xc2, 0xe1, 0x06, 0xd2, 0x1d, 0x60 },
+ .valid = true
+ },
+ /* wycheproof - edge case on twist */
+ {
+ .private = { 0x48, 0xe2, 0x13, 0x0d, 0x72, 0x33, 0x05, 0xed,
+ 0x05, 0xe6, 0xe5, 0x89, 0x4d, 0x39, 0x8a, 0x5e,
+ 0x33, 0x36, 0x7a, 0x8c, 0x6a, 0xac, 0x8f, 0xcd,
+ 0xf0, 0xa8, 0x8e, 0x4b, 0x42, 0x82, 0x0d, 0xb7 },
+ .public = { 0xff, 0xff, 0xff, 0x03, 0x00, 0x00, 0xf8, 0xff,
+ 0xff, 0x1f, 0x00, 0x00, 0xc0, 0xff, 0xff, 0xff,
+ 0x00, 0x00, 0x00, 0xfe, 0xff, 0xff, 0x07, 0x00,
+ 0x00, 0xf0, 0xff, 0xff, 0x3f, 0x00, 0x00, 0x00 },
+ .result = { 0x9e, 0xd1, 0x0c, 0x53, 0x74, 0x7f, 0x64, 0x7f,
+ 0x82, 0xf4, 0x51, 0x25, 0xd3, 0xde, 0x15, 0xa1,
+ 0xe6, 0xb8, 0x24, 0x49, 0x6a, 0xb4, 0x04, 0x10,
+ 0xff, 0xcc, 0x3c, 0xfe, 0x95, 0x76, 0x0f, 0x3b },
+ .valid = true
+ },
+ /* wycheproof - edge case on twist */
+ {
+ .private = { 0x28, 0xf4, 0x10, 0x11, 0x69, 0x18, 0x51, 0xb3,
+ 0xa6, 0x2b, 0x64, 0x15, 0x53, 0xb3, 0x0d, 0x0d,
+ 0xfd, 0xdc, 0xb8, 0xff, 0xfc, 0xf5, 0x37, 0x00,
+ 0xa7, 0xbe, 0x2f, 0x6a, 0x87, 0x2e, 0x9f, 0xb0 },
+ .public = { 0x00, 0x00, 0x00, 0xfc, 0xff, 0xff, 0x07, 0x00,
+ 0x00, 0xe0, 0xff, 0xff, 0x3f, 0x00, 0x00, 0x00,
+ 0xff, 0xff, 0xff, 0x01, 0x00, 0x00, 0xf8, 0xff,
+ 0xff, 0x0f, 0x00, 0x00, 0xc0, 0xff, 0xff, 0x7f },
+ .result = { 0xcf, 0x72, 0xb4, 0xaa, 0x6a, 0xa1, 0xc9, 0xf8,
+ 0x94, 0xf4, 0x16, 0x5b, 0x86, 0x10, 0x9a, 0xa4,
+ 0x68, 0x51, 0x76, 0x48, 0xe1, 0xf0, 0xcc, 0x70,
+ 0xe1, 0xab, 0x08, 0x46, 0x01, 0x76, 0x50, 0x6b },
+ .valid = true
+ },
+ /* wycheproof - edge case on twist */
+ {
+ .private = { 0x18, 0xa9, 0x3b, 0x64, 0x99, 0xb9, 0xf6, 0xb3,
+ 0x22, 0x5c, 0xa0, 0x2f, 0xef, 0x41, 0x0e, 0x0a,
+ 0xde, 0xc2, 0x35, 0x32, 0x32, 0x1d, 0x2d, 0x8e,
+ 0xf1, 0xa6, 0xd6, 0x02, 0xa8, 0xc6, 0x5b, 0x83 },
+ .public = { 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff,
+ 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff,
+ 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff,
+ 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0x7f },
+ .result = { 0x5d, 0x50, 0xb6, 0x28, 0x36, 0xbb, 0x69, 0x57,
+ 0x94, 0x10, 0x38, 0x6c, 0xf7, 0xbb, 0x81, 0x1c,
+ 0x14, 0xbf, 0x85, 0xb1, 0xc7, 0xb1, 0x7e, 0x59,
+ 0x24, 0xc7, 0xff, 0xea, 0x91, 0xef, 0x9e, 0x12 },
+ .valid = true
+ },
+ /* wycheproof - edge case on twist */
+ {
+ .private = { 0xc0, 0x1d, 0x13, 0x05, 0xa1, 0x33, 0x8a, 0x1f,
+ 0xca, 0xc2, 0xba, 0x7e, 0x2e, 0x03, 0x2b, 0x42,
+ 0x7e, 0x0b, 0x04, 0x90, 0x31, 0x65, 0xac, 0xa9,
+ 0x57, 0xd8, 0xd0, 0x55, 0x3d, 0x87, 0x17, 0xb0 },
+ .public = { 0xea, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+ .result = { 0x19, 0x23, 0x0e, 0xb1, 0x48, 0xd5, 0xd6, 0x7c,
+ 0x3c, 0x22, 0xab, 0x1d, 0xae, 0xff, 0x80, 0xa5,
+ 0x7e, 0xae, 0x42, 0x65, 0xce, 0x28, 0x72, 0x65,
+ 0x7b, 0x2c, 0x80, 0x99, 0xfc, 0x69, 0x8e, 0x50 },
+ .valid = true
+ },
+ /* wycheproof - edge case for public key */
+ {
+ .private = { 0x38, 0x6f, 0x7f, 0x16, 0xc5, 0x07, 0x31, 0xd6,
+ 0x4f, 0x82, 0xe6, 0xa1, 0x70, 0xb1, 0x42, 0xa4,
+ 0xe3, 0x4f, 0x31, 0xfd, 0x77, 0x68, 0xfc, 0xb8,
+ 0x90, 0x29, 0x25, 0xe7, 0xd1, 0xe2, 0x1a, 0xbe },
+ .public = { 0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .result = { 0x0f, 0xca, 0xb5, 0xd8, 0x42, 0xa0, 0x78, 0xd7,
+ 0xa7, 0x1f, 0xc5, 0x9b, 0x57, 0xbf, 0xb4, 0xca,
+ 0x0b, 0xe6, 0x87, 0x3b, 0x49, 0xdc, 0xdb, 0x9f,
+ 0x44, 0xe1, 0x4a, 0xe8, 0xfb, 0xdf, 0xa5, 0x42 },
+ .valid = true
+ },
+ /* wycheproof - edge case for public key */
+ {
+ .private = { 0xe0, 0x23, 0xa2, 0x89, 0xbd, 0x5e, 0x90, 0xfa,
+ 0x28, 0x04, 0xdd, 0xc0, 0x19, 0xa0, 0x5e, 0xf3,
+ 0xe7, 0x9d, 0x43, 0x4b, 0xb6, 0xea, 0x2f, 0x52,
+ 0x2e, 0xcb, 0x64, 0x3a, 0x75, 0x29, 0x6e, 0x95 },
+ .public = { 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00,
+ 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00,
+ 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00,
+ 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00 },
+ .result = { 0x54, 0xce, 0x8f, 0x22, 0x75, 0xc0, 0x77, 0xe3,
+ 0xb1, 0x30, 0x6a, 0x39, 0x39, 0xc5, 0xe0, 0x3e,
+ 0xef, 0x6b, 0xbb, 0x88, 0x06, 0x05, 0x44, 0x75,
+ 0x8d, 0x9f, 0xef, 0x59, 0xb0, 0xbc, 0x3e, 0x4f },
+ .valid = true
+ },
+ /* wycheproof - edge case for public key */
+ {
+ .private = { 0x68, 0xf0, 0x10, 0xd6, 0x2e, 0xe8, 0xd9, 0x26,
+ 0x05, 0x3a, 0x36, 0x1c, 0x3a, 0x75, 0xc6, 0xea,
+ 0x4e, 0xbd, 0xc8, 0x60, 0x6a, 0xb2, 0x85, 0x00,
+ 0x3a, 0x6f, 0x8f, 0x40, 0x76, 0xb0, 0x1e, 0x83 },
+ .public = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x03 },
+ .result = { 0xf1, 0x36, 0x77, 0x5c, 0x5b, 0xeb, 0x0a, 0xf8,
+ 0x11, 0x0a, 0xf1, 0x0b, 0x20, 0x37, 0x23, 0x32,
+ 0x04, 0x3c, 0xab, 0x75, 0x24, 0x19, 0x67, 0x87,
+ 0x75, 0xa2, 0x23, 0xdf, 0x57, 0xc9, 0xd3, 0x0d },
+ .valid = true
+ },
+ /* wycheproof - edge case for public key */
+ {
+ .private = { 0x58, 0xeb, 0xcb, 0x35, 0xb0, 0xf8, 0x84, 0x5c,
+ 0xaf, 0x1e, 0xc6, 0x30, 0xf9, 0x65, 0x76, 0xb6,
+ 0x2c, 0x4b, 0x7b, 0x6c, 0x36, 0xb2, 0x9d, 0xeb,
+ 0x2c, 0xb0, 0x08, 0x46, 0x51, 0x75, 0x5c, 0x96 },
+ .public = { 0xff, 0xff, 0xff, 0xfb, 0xff, 0xff, 0xfb, 0xff,
+ 0xff, 0xdf, 0xff, 0xff, 0xdf, 0xff, 0xff, 0xff,
+ 0xfe, 0xff, 0xff, 0xfe, 0xff, 0xff, 0xf7, 0xff,
+ 0xff, 0xf7, 0xff, 0xff, 0xbf, 0xff, 0xff, 0x3f },
+ .result = { 0xbf, 0x9a, 0xff, 0xd0, 0x6b, 0x84, 0x40, 0x85,
+ 0x58, 0x64, 0x60, 0x96, 0x2e, 0xf2, 0x14, 0x6f,
+ 0xf3, 0xd4, 0x53, 0x3d, 0x94, 0x44, 0xaa, 0xb0,
+ 0x06, 0xeb, 0x88, 0xcc, 0x30, 0x54, 0x40, 0x7d },
+ .valid = true
+ },
+ /* wycheproof - edge case for public key */
+ {
+ .private = { 0x18, 0x8c, 0x4b, 0xc5, 0xb9, 0xc4, 0x4b, 0x38,
+ 0xbb, 0x65, 0x8b, 0x9b, 0x2a, 0xe8, 0x2d, 0x5b,
+ 0x01, 0x01, 0x5e, 0x09, 0x31, 0x84, 0xb1, 0x7c,
+ 0xb7, 0x86, 0x35, 0x03, 0xa7, 0x83, 0xe1, 0xbb },
+ .public = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+ .result = { 0xd4, 0x80, 0xde, 0x04, 0xf6, 0x99, 0xcb, 0x3b,
+ 0xe0, 0x68, 0x4a, 0x9c, 0xc2, 0xe3, 0x12, 0x81,
+ 0xea, 0x0b, 0xc5, 0xa9, 0xdc, 0xc1, 0x57, 0xd3,
+ 0xd2, 0x01, 0x58, 0xd4, 0x6c, 0xa5, 0x24, 0x6d },
+ .valid = true
+ },
+ /* wycheproof - edge case for public key */
+ {
+ .private = { 0xe0, 0x6c, 0x11, 0xbb, 0x2e, 0x13, 0xce, 0x3d,
+ 0xc7, 0x67, 0x3f, 0x67, 0xf5, 0x48, 0x22, 0x42,
+ 0x90, 0x94, 0x23, 0xa9, 0xae, 0x95, 0xee, 0x98,
+ 0x6a, 0x98, 0x8d, 0x98, 0xfa, 0xee, 0x23, 0xa2 },
+ .public = { 0xff, 0xff, 0xff, 0xff, 0xfe, 0xff, 0xff, 0x7f,
+ 0xff, 0xff, 0xff, 0xff, 0xfe, 0xff, 0xff, 0x7f,
+ 0xff, 0xff, 0xff, 0xff, 0xfe, 0xff, 0xff, 0x7f,
+ 0xff, 0xff, 0xff, 0xff, 0xfe, 0xff, 0xff, 0x7f },
+ .result = { 0x4c, 0x44, 0x01, 0xcc, 0xe6, 0xb5, 0x1e, 0x4c,
+ 0xb1, 0x8f, 0x27, 0x90, 0x24, 0x6c, 0x9b, 0xf9,
+ 0x14, 0xdb, 0x66, 0x77, 0x50, 0xa1, 0xcb, 0x89,
+ 0x06, 0x90, 0x92, 0xaf, 0x07, 0x29, 0x22, 0x76 },
+ .valid = true
+ },
+ /* wycheproof - edge case for public key */
+ {
+ .private = { 0xc0, 0x65, 0x8c, 0x46, 0xdd, 0xe1, 0x81, 0x29,
+ 0x29, 0x38, 0x77, 0x53, 0x5b, 0x11, 0x62, 0xb6,
+ 0xf9, 0xf5, 0x41, 0x4a, 0x23, 0xcf, 0x4d, 0x2c,
+ 0xbc, 0x14, 0x0a, 0x4d, 0x99, 0xda, 0x2b, 0x8f },
+ .public = { 0xeb, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+ .result = { 0x57, 0x8b, 0xa8, 0xcc, 0x2d, 0xbd, 0xc5, 0x75,
+ 0xaf, 0xcf, 0x9d, 0xf2, 0xb3, 0xee, 0x61, 0x89,
+ 0xf5, 0x33, 0x7d, 0x68, 0x54, 0xc7, 0x9b, 0x4c,
+ 0xe1, 0x65, 0xea, 0x12, 0x29, 0x3b, 0x3a, 0x0f },
+ .valid = true
+ },
+ /* wycheproof - public key with low order */
+ {
+ .private = { 0x10, 0x25, 0x5c, 0x92, 0x30, 0xa9, 0x7a, 0x30,
+ 0xa4, 0x58, 0xca, 0x28, 0x4a, 0x62, 0x96, 0x69,
+ 0x29, 0x3a, 0x31, 0x89, 0x0c, 0xda, 0x9d, 0x14,
+ 0x7f, 0xeb, 0xc7, 0xd1, 0xe2, 0x2d, 0x6b, 0xb1 },
+ .public = { 0xe0, 0xeb, 0x7a, 0x7c, 0x3b, 0x41, 0xb8, 0xae,
+ 0x16, 0x56, 0xe3, 0xfa, 0xf1, 0x9f, 0xc4, 0x6a,
+ 0xda, 0x09, 0x8d, 0xeb, 0x9c, 0x32, 0xb1, 0xfd,
+ 0x86, 0x62, 0x05, 0x16, 0x5f, 0x49, 0xb8, 0x00 },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = false
+ },
+ /* wycheproof - public key with low order */
+ {
+ .private = { 0x78, 0xf1, 0xe8, 0xed, 0xf1, 0x44, 0x81, 0xb3,
+ 0x89, 0x44, 0x8d, 0xac, 0x8f, 0x59, 0xc7, 0x0b,
+ 0x03, 0x8e, 0x7c, 0xf9, 0x2e, 0xf2, 0xc7, 0xef,
+ 0xf5, 0x7a, 0x72, 0x46, 0x6e, 0x11, 0x52, 0x96 },
+ .public = { 0x5f, 0x9c, 0x95, 0xbc, 0xa3, 0x50, 0x8c, 0x24,
+ 0xb1, 0xd0, 0xb1, 0x55, 0x9c, 0x83, 0xef, 0x5b,
+ 0x04, 0x44, 0x5c, 0xc4, 0x58, 0x1c, 0x8e, 0x86,
+ 0xd8, 0x22, 0x4e, 0xdd, 0xd0, 0x9f, 0x11, 0x57 },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = false
+ },
+ /* wycheproof - public key with low order */
+ {
+ .private = { 0xa0, 0xa0, 0x5a, 0x3e, 0x8f, 0x9f, 0x44, 0x20,
+ 0x4d, 0x5f, 0x80, 0x59, 0xa9, 0x4a, 0xc7, 0xdf,
+ 0xc3, 0x9a, 0x49, 0xac, 0x01, 0x6d, 0xd7, 0x43,
+ 0xdb, 0xfa, 0x43, 0xc5, 0xd6, 0x71, 0xfd, 0x88 },
+ .public = { 0xec, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = false
+ },
+ /* wycheproof - public key with low order */
+ {
+ .private = { 0xd0, 0xdb, 0xb3, 0xed, 0x19, 0x06, 0x66, 0x3f,
+ 0x15, 0x42, 0x0a, 0xf3, 0x1f, 0x4e, 0xaf, 0x65,
+ 0x09, 0xd9, 0xa9, 0x94, 0x97, 0x23, 0x50, 0x06,
+ 0x05, 0xad, 0x7c, 0x1c, 0x6e, 0x74, 0x50, 0xa9 },
+ .public = { 0xed, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = false
+ },
+ /* wycheproof - public key with low order */
+ {
+ .private = { 0xc0, 0xb1, 0xd0, 0xeb, 0x22, 0xb2, 0x44, 0xfe,
+ 0x32, 0x91, 0x14, 0x00, 0x72, 0xcd, 0xd9, 0xd9,
+ 0x89, 0xb5, 0xf0, 0xec, 0xd9, 0x6c, 0x10, 0x0f,
+ 0xeb, 0x5b, 0xca, 0x24, 0x1c, 0x1d, 0x9f, 0x8f },
+ .public = { 0xee, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = false
+ },
+ /* wycheproof - public key with low order */
+ {
+ .private = { 0x48, 0x0b, 0xf4, 0x5f, 0x59, 0x49, 0x42, 0xa8,
+ 0xbc, 0x0f, 0x33, 0x53, 0xc6, 0xe8, 0xb8, 0x85,
+ 0x3d, 0x77, 0xf3, 0x51, 0xf1, 0xc2, 0xca, 0x6c,
+ 0x2d, 0x1a, 0xbf, 0x8a, 0x00, 0xb4, 0x22, 0x9c },
+ .public = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = false
+ },
+ /* wycheproof - public key with low order */
+ {
+ .private = { 0x30, 0xf9, 0x93, 0xfc, 0xf8, 0x51, 0x4f, 0xc8,
+ 0x9b, 0xd8, 0xdb, 0x14, 0xcd, 0x43, 0xba, 0x0d,
+ 0x4b, 0x25, 0x30, 0xe7, 0x3c, 0x42, 0x76, 0xa0,
+ 0x5e, 0x1b, 0x14, 0x5d, 0x42, 0x0c, 0xed, 0xb4 },
+ .public = { 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = false
+ },
+ /* wycheproof - public key with low order */
+ {
+ .private = { 0xc0, 0x49, 0x74, 0xb7, 0x58, 0x38, 0x0e, 0x2a,
+ 0x5b, 0x5d, 0xf6, 0xeb, 0x09, 0xbb, 0x2f, 0x6b,
+ 0x34, 0x34, 0xf9, 0x82, 0x72, 0x2a, 0x8e, 0x67,
+ 0x6d, 0x3d, 0xa2, 0x51, 0xd1, 0xb3, 0xde, 0x83 },
+ .public = { 0xe0, 0xeb, 0x7a, 0x7c, 0x3b, 0x41, 0xb8, 0xae,
+ 0x16, 0x56, 0xe3, 0xfa, 0xf1, 0x9f, 0xc4, 0x6a,
+ 0xda, 0x09, 0x8d, 0xeb, 0x9c, 0x32, 0xb1, 0xfd,
+ 0x86, 0x62, 0x05, 0x16, 0x5f, 0x49, 0xb8, 0x80 },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = false
+ },
+ /* wycheproof - public key with low order */
+ {
+ .private = { 0x50, 0x2a, 0x31, 0x37, 0x3d, 0xb3, 0x24, 0x46,
+ 0x84, 0x2f, 0xe5, 0xad, 0xd3, 0xe0, 0x24, 0x02,
+ 0x2e, 0xa5, 0x4f, 0x27, 0x41, 0x82, 0xaf, 0xc3,
+ 0xd9, 0xf1, 0xbb, 0x3d, 0x39, 0x53, 0x4e, 0xb5 },
+ .public = { 0x5f, 0x9c, 0x95, 0xbc, 0xa3, 0x50, 0x8c, 0x24,
+ 0xb1, 0xd0, 0xb1, 0x55, 0x9c, 0x83, 0xef, 0x5b,
+ 0x04, 0x44, 0x5c, 0xc4, 0x58, 0x1c, 0x8e, 0x86,
+ 0xd8, 0x22, 0x4e, 0xdd, 0xd0, 0x9f, 0x11, 0xd7 },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = false
+ },
+ /* wycheproof - public key with low order */
+ {
+ .private = { 0x90, 0xfa, 0x64, 0x17, 0xb0, 0xe3, 0x70, 0x30,
+ 0xfd, 0x6e, 0x43, 0xef, 0xf2, 0xab, 0xae, 0xf1,
+ 0x4c, 0x67, 0x93, 0x11, 0x7a, 0x03, 0x9c, 0xf6,
+ 0x21, 0x31, 0x8b, 0xa9, 0x0f, 0x4e, 0x98, 0xbe },
+ .public = { 0xec, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = false
+ },
+ /* wycheproof - public key with low order */
+ {
+ .private = { 0x78, 0xad, 0x3f, 0x26, 0x02, 0x7f, 0x1c, 0x9f,
+ 0xdd, 0x97, 0x5a, 0x16, 0x13, 0xb9, 0x47, 0x77,
+ 0x9b, 0xad, 0x2c, 0xf2, 0xb7, 0x41, 0xad, 0xe0,
+ 0x18, 0x40, 0x88, 0x5a, 0x30, 0xbb, 0x97, 0x9c },
+ .public = { 0xed, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = false
+ },
+ /* wycheproof - public key with low order */
+ {
+ .private = { 0x98, 0xe2, 0x3d, 0xe7, 0xb1, 0xe0, 0x92, 0x6e,
+ 0xd9, 0xc8, 0x7e, 0x7b, 0x14, 0xba, 0xf5, 0x5f,
+ 0x49, 0x7a, 0x1d, 0x70, 0x96, 0xf9, 0x39, 0x77,
+ 0x68, 0x0e, 0x44, 0xdc, 0x1c, 0x7b, 0x7b, 0x8b },
+ .public = { 0xee, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = false
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0xf0, 0x1e, 0x48, 0xda, 0xfa, 0xc9, 0xd7, 0xbc,
+ 0xf5, 0x89, 0xcb, 0xc3, 0x82, 0xc8, 0x78, 0xd1,
+ 0x8b, 0xda, 0x35, 0x50, 0x58, 0x9f, 0xfb, 0x5d,
+ 0x50, 0xb5, 0x23, 0xbe, 0xbe, 0x32, 0x9d, 0xae },
+ .public = { 0xef, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+ .result = { 0xbd, 0x36, 0xa0, 0x79, 0x0e, 0xb8, 0x83, 0x09,
+ 0x8c, 0x98, 0x8b, 0x21, 0x78, 0x67, 0x73, 0xde,
+ 0x0b, 0x3a, 0x4d, 0xf1, 0x62, 0x28, 0x2c, 0xf1,
+ 0x10, 0xde, 0x18, 0xdd, 0x48, 0x4c, 0xe7, 0x4b },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0x28, 0x87, 0x96, 0xbc, 0x5a, 0xff, 0x4b, 0x81,
+ 0xa3, 0x75, 0x01, 0x75, 0x7b, 0xc0, 0x75, 0x3a,
+ 0x3c, 0x21, 0x96, 0x47, 0x90, 0xd3, 0x86, 0x99,
+ 0x30, 0x8d, 0xeb, 0xc1, 0x7a, 0x6e, 0xaf, 0x8d },
+ .public = { 0xf0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+ .result = { 0xb4, 0xe0, 0xdd, 0x76, 0xda, 0x7b, 0x07, 0x17,
+ 0x28, 0xb6, 0x1f, 0x85, 0x67, 0x71, 0xaa, 0x35,
+ 0x6e, 0x57, 0xed, 0xa7, 0x8a, 0x5b, 0x16, 0x55,
+ 0xcc, 0x38, 0x20, 0xfb, 0x5f, 0x85, 0x4c, 0x5c },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0x98, 0xdf, 0x84, 0x5f, 0x66, 0x51, 0xbf, 0x11,
+ 0x38, 0x22, 0x1f, 0x11, 0x90, 0x41, 0xf7, 0x2b,
+ 0x6d, 0xbc, 0x3c, 0x4a, 0xce, 0x71, 0x43, 0xd9,
+ 0x9f, 0xd5, 0x5a, 0xd8, 0x67, 0x48, 0x0d, 0xa8 },
+ .public = { 0xf1, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+ .result = { 0x6f, 0xdf, 0x6c, 0x37, 0x61, 0x1d, 0xbd, 0x53,
+ 0x04, 0xdc, 0x0f, 0x2e, 0xb7, 0xc9, 0x51, 0x7e,
+ 0xb3, 0xc5, 0x0e, 0x12, 0xfd, 0x05, 0x0a, 0xc6,
+ 0xde, 0xc2, 0x70, 0x71, 0xd4, 0xbf, 0xc0, 0x34 },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0xf0, 0x94, 0x98, 0xe4, 0x6f, 0x02, 0xf8, 0x78,
+ 0x82, 0x9e, 0x78, 0xb8, 0x03, 0xd3, 0x16, 0xa2,
+ 0xed, 0x69, 0x5d, 0x04, 0x98, 0xa0, 0x8a, 0xbd,
+ 0xf8, 0x27, 0x69, 0x30, 0xe2, 0x4e, 0xdc, 0xb0 },
+ .public = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+ .result = { 0x4c, 0x8f, 0xc4, 0xb1, 0xc6, 0xab, 0x88, 0xfb,
+ 0x21, 0xf1, 0x8f, 0x6d, 0x4c, 0x81, 0x02, 0x40,
+ 0xd4, 0xe9, 0x46, 0x51, 0xba, 0x44, 0xf7, 0xa2,
+ 0xc8, 0x63, 0xce, 0xc7, 0xdc, 0x56, 0x60, 0x2d },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0x18, 0x13, 0xc1, 0x0a, 0x5c, 0x7f, 0x21, 0xf9,
+ 0x6e, 0x17, 0xf2, 0x88, 0xc0, 0xcc, 0x37, 0x60,
+ 0x7c, 0x04, 0xc5, 0xf5, 0xae, 0xa2, 0xdb, 0x13,
+ 0x4f, 0x9e, 0x2f, 0xfc, 0x66, 0xbd, 0x9d, 0xb8 },
+ .public = { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+ .result = { 0x1c, 0xd0, 0xb2, 0x82, 0x67, 0xdc, 0x54, 0x1c,
+ 0x64, 0x2d, 0x6d, 0x7d, 0xca, 0x44, 0xa8, 0xb3,
+ 0x8a, 0x63, 0x73, 0x6e, 0xef, 0x5c, 0x4e, 0x65,
+ 0x01, 0xff, 0xbb, 0xb1, 0x78, 0x0c, 0x03, 0x3c },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0x78, 0x57, 0xfb, 0x80, 0x86, 0x53, 0x64, 0x5a,
+ 0x0b, 0xeb, 0x13, 0x8a, 0x64, 0xf5, 0xf4, 0xd7,
+ 0x33, 0xa4, 0x5e, 0xa8, 0x4c, 0x3c, 0xda, 0x11,
+ 0xa9, 0xc0, 0x6f, 0x7e, 0x71, 0x39, 0x14, 0x9e },
+ .public = { 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+ .result = { 0x87, 0x55, 0xbe, 0x01, 0xc6, 0x0a, 0x7e, 0x82,
+ 0x5c, 0xff, 0x3e, 0x0e, 0x78, 0xcb, 0x3a, 0xa4,
+ 0x33, 0x38, 0x61, 0x51, 0x6a, 0xa5, 0x9b, 0x1c,
+ 0x51, 0xa8, 0xb2, 0xa5, 0x43, 0xdf, 0xa8, 0x22 },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0xe0, 0x3a, 0xa8, 0x42, 0xe2, 0xab, 0xc5, 0x6e,
+ 0x81, 0xe8, 0x7b, 0x8b, 0x9f, 0x41, 0x7b, 0x2a,
+ 0x1e, 0x59, 0x13, 0xc7, 0x23, 0xee, 0xd2, 0x8d,
+ 0x75, 0x2f, 0x8d, 0x47, 0xa5, 0x9f, 0x49, 0x8f },
+ .public = { 0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+ .result = { 0x54, 0xc9, 0xa1, 0xed, 0x95, 0xe5, 0x46, 0xd2,
+ 0x78, 0x22, 0xa3, 0x60, 0x93, 0x1d, 0xda, 0x60,
+ 0xa1, 0xdf, 0x04, 0x9d, 0xa6, 0xf9, 0x04, 0x25,
+ 0x3c, 0x06, 0x12, 0xbb, 0xdc, 0x08, 0x74, 0x76 },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0xf8, 0xf7, 0x07, 0xb7, 0x99, 0x9b, 0x18, 0xcb,
+ 0x0d, 0x6b, 0x96, 0x12, 0x4f, 0x20, 0x45, 0x97,
+ 0x2c, 0xa2, 0x74, 0xbf, 0xc1, 0x54, 0xad, 0x0c,
+ 0x87, 0x03, 0x8c, 0x24, 0xc6, 0xd0, 0xd4, 0xb2 },
+ .public = { 0xda, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+ .result = { 0xcc, 0x1f, 0x40, 0xd7, 0x43, 0xcd, 0xc2, 0x23,
+ 0x0e, 0x10, 0x43, 0xda, 0xba, 0x8b, 0x75, 0xe8,
+ 0x10, 0xf1, 0xfb, 0xab, 0x7f, 0x25, 0x52, 0x69,
+ 0xbd, 0x9e, 0xbb, 0x29, 0xe6, 0xbf, 0x49, 0x4f },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0xa0, 0x34, 0xf6, 0x84, 0xfa, 0x63, 0x1e, 0x1a,
+ 0x34, 0x81, 0x18, 0xc1, 0xce, 0x4c, 0x98, 0x23,
+ 0x1f, 0x2d, 0x9e, 0xec, 0x9b, 0xa5, 0x36, 0x5b,
+ 0x4a, 0x05, 0xd6, 0x9a, 0x78, 0x5b, 0x07, 0x96 },
+ .public = { 0xdb, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+ .result = { 0x54, 0x99, 0x8e, 0xe4, 0x3a, 0x5b, 0x00, 0x7b,
+ 0xf4, 0x99, 0xf0, 0x78, 0xe7, 0x36, 0x52, 0x44,
+ 0x00, 0xa8, 0xb5, 0xc7, 0xe9, 0xb9, 0xb4, 0x37,
+ 0x71, 0x74, 0x8c, 0x7c, 0xdf, 0x88, 0x04, 0x12 },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0x30, 0xb6, 0xc6, 0xa0, 0xf2, 0xff, 0xa6, 0x80,
+ 0x76, 0x8f, 0x99, 0x2b, 0xa8, 0x9e, 0x15, 0x2d,
+ 0x5b, 0xc9, 0x89, 0x3d, 0x38, 0xc9, 0x11, 0x9b,
+ 0xe4, 0xf7, 0x67, 0xbf, 0xab, 0x6e, 0x0c, 0xa5 },
+ .public = { 0xdc, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+ .result = { 0xea, 0xd9, 0xb3, 0x8e, 0xfd, 0xd7, 0x23, 0x63,
+ 0x79, 0x34, 0xe5, 0x5a, 0xb7, 0x17, 0xa7, 0xae,
+ 0x09, 0xeb, 0x86, 0xa2, 0x1d, 0xc3, 0x6a, 0x3f,
+ 0xee, 0xb8, 0x8b, 0x75, 0x9e, 0x39, 0x1e, 0x09 },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0x90, 0x1b, 0x9d, 0xcf, 0x88, 0x1e, 0x01, 0xe0,
+ 0x27, 0x57, 0x50, 0x35, 0xd4, 0x0b, 0x43, 0xbd,
+ 0xc1, 0xc5, 0x24, 0x2e, 0x03, 0x08, 0x47, 0x49,
+ 0x5b, 0x0c, 0x72, 0x86, 0x46, 0x9b, 0x65, 0x91 },
+ .public = { 0xea, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+ .result = { 0x60, 0x2f, 0xf4, 0x07, 0x89, 0xb5, 0x4b, 0x41,
+ 0x80, 0x59, 0x15, 0xfe, 0x2a, 0x62, 0x21, 0xf0,
+ 0x7a, 0x50, 0xff, 0xc2, 0xc3, 0xfc, 0x94, 0xcf,
+ 0x61, 0xf1, 0x3d, 0x79, 0x04, 0xe8, 0x8e, 0x0e },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0x80, 0x46, 0x67, 0x7c, 0x28, 0xfd, 0x82, 0xc9,
+ 0xa1, 0xbd, 0xb7, 0x1a, 0x1a, 0x1a, 0x34, 0xfa,
+ 0xba, 0x12, 0x25, 0xe2, 0x50, 0x7f, 0xe3, 0xf5,
+ 0x4d, 0x10, 0xbd, 0x5b, 0x0d, 0x86, 0x5f, 0x8e },
+ .public = { 0xeb, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+ .result = { 0xe0, 0x0a, 0xe8, 0xb1, 0x43, 0x47, 0x12, 0x47,
+ 0xba, 0x24, 0xf1, 0x2c, 0x88, 0x55, 0x36, 0xc3,
+ 0xcb, 0x98, 0x1b, 0x58, 0xe1, 0xe5, 0x6b, 0x2b,
+ 0xaf, 0x35, 0xc1, 0x2a, 0xe1, 0xf7, 0x9c, 0x26 },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0x60, 0x2f, 0x7e, 0x2f, 0x68, 0xa8, 0x46, 0xb8,
+ 0x2c, 0xc2, 0x69, 0xb1, 0xd4, 0x8e, 0x93, 0x98,
+ 0x86, 0xae, 0x54, 0xfd, 0x63, 0x6c, 0x1f, 0xe0,
+ 0x74, 0xd7, 0x10, 0x12, 0x7d, 0x47, 0x24, 0x91 },
+ .public = { 0xef, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+ .result = { 0x98, 0xcb, 0x9b, 0x50, 0xdd, 0x3f, 0xc2, 0xb0,
+ 0xd4, 0xf2, 0xd2, 0xbf, 0x7c, 0x5c, 0xfd, 0xd1,
+ 0x0c, 0x8f, 0xcd, 0x31, 0xfc, 0x40, 0xaf, 0x1a,
+ 0xd4, 0x4f, 0x47, 0xc1, 0x31, 0x37, 0x63, 0x62 },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0x60, 0x88, 0x7b, 0x3d, 0xc7, 0x24, 0x43, 0x02,
+ 0x6e, 0xbe, 0xdb, 0xbb, 0xb7, 0x06, 0x65, 0xf4,
+ 0x2b, 0x87, 0xad, 0xd1, 0x44, 0x0e, 0x77, 0x68,
+ 0xfb, 0xd7, 0xe8, 0xe2, 0xce, 0x5f, 0x63, 0x9d },
+ .public = { 0xf0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+ .result = { 0x38, 0xd6, 0x30, 0x4c, 0x4a, 0x7e, 0x6d, 0x9f,
+ 0x79, 0x59, 0x33, 0x4f, 0xb5, 0x24, 0x5b, 0xd2,
+ 0xc7, 0x54, 0x52, 0x5d, 0x4c, 0x91, 0xdb, 0x95,
+ 0x02, 0x06, 0x92, 0x62, 0x34, 0xc1, 0xf6, 0x33 },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0x78, 0xd3, 0x1d, 0xfa, 0x85, 0x44, 0x97, 0xd7,
+ 0x2d, 0x8d, 0xef, 0x8a, 0x1b, 0x7f, 0xb0, 0x06,
+ 0xce, 0xc2, 0xd8, 0xc4, 0x92, 0x46, 0x47, 0xc9,
+ 0x38, 0x14, 0xae, 0x56, 0xfa, 0xed, 0xa4, 0x95 },
+ .public = { 0xf1, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+ .result = { 0x78, 0x6c, 0xd5, 0x49, 0x96, 0xf0, 0x14, 0xa5,
+ 0xa0, 0x31, 0xec, 0x14, 0xdb, 0x81, 0x2e, 0xd0,
+ 0x83, 0x55, 0x06, 0x1f, 0xdb, 0x5d, 0xe6, 0x80,
+ 0xa8, 0x00, 0xac, 0x52, 0x1f, 0x31, 0x8e, 0x23 },
+ .valid = true
+ },
+ /* wycheproof - public key >= p */
+ {
+ .private = { 0xc0, 0x4c, 0x5b, 0xae, 0xfa, 0x83, 0x02, 0xdd,
+ 0xde, 0xd6, 0xa4, 0xbb, 0x95, 0x77, 0x61, 0xb4,
+ 0xeb, 0x97, 0xae, 0xfa, 0x4f, 0xc3, 0xb8, 0x04,
+ 0x30, 0x85, 0xf9, 0x6a, 0x56, 0x59, 0xb3, 0xa5 },
+ .public = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+ .result = { 0x29, 0xae, 0x8b, 0xc7, 0x3e, 0x9b, 0x10, 0xa0,
+ 0x8b, 0x4f, 0x68, 0x1c, 0x43, 0xc3, 0xe0, 0xac,
+ 0x1a, 0x17, 0x1d, 0x31, 0xb3, 0x8f, 0x1a, 0x48,
+ 0xef, 0xba, 0x29, 0xae, 0x63, 0x9e, 0xa1, 0x34 },
+ .valid = true
+ },
+ /* wycheproof - RFC 7748 */
+ {
+ .private = { 0xa0, 0x46, 0xe3, 0x6b, 0xf0, 0x52, 0x7c, 0x9d,
+ 0x3b, 0x16, 0x15, 0x4b, 0x82, 0x46, 0x5e, 0xdd,
+ 0x62, 0x14, 0x4c, 0x0a, 0xc1, 0xfc, 0x5a, 0x18,
+ 0x50, 0x6a, 0x22, 0x44, 0xba, 0x44, 0x9a, 0x44 },
+ .public = { 0xe6, 0xdb, 0x68, 0x67, 0x58, 0x30, 0x30, 0xdb,
+ 0x35, 0x94, 0xc1, 0xa4, 0x24, 0xb1, 0x5f, 0x7c,
+ 0x72, 0x66, 0x24, 0xec, 0x26, 0xb3, 0x35, 0x3b,
+ 0x10, 0xa9, 0x03, 0xa6, 0xd0, 0xab, 0x1c, 0x4c },
+ .result = { 0xc3, 0xda, 0x55, 0x37, 0x9d, 0xe9, 0xc6, 0x90,
+ 0x8e, 0x94, 0xea, 0x4d, 0xf2, 0x8d, 0x08, 0x4f,
+ 0x32, 0xec, 0xcf, 0x03, 0x49, 0x1c, 0x71, 0xf7,
+ 0x54, 0xb4, 0x07, 0x55, 0x77, 0xa2, 0x85, 0x52 },
+ .valid = true
+ },
+ /* wycheproof - RFC 7748 */
+ {
+ .private = { 0x48, 0x66, 0xe9, 0xd4, 0xd1, 0xb4, 0x67, 0x3c,
+ 0x5a, 0xd2, 0x26, 0x91, 0x95, 0x7d, 0x6a, 0xf5,
+ 0xc1, 0x1b, 0x64, 0x21, 0xe0, 0xea, 0x01, 0xd4,
+ 0x2c, 0xa4, 0x16, 0x9e, 0x79, 0x18, 0xba, 0x4d },
+ .public = { 0xe5, 0x21, 0x0f, 0x12, 0x78, 0x68, 0x11, 0xd3,
+ 0xf4, 0xb7, 0x95, 0x9d, 0x05, 0x38, 0xae, 0x2c,
+ 0x31, 0xdb, 0xe7, 0x10, 0x6f, 0xc0, 0x3c, 0x3e,
+ 0xfc, 0x4c, 0xd5, 0x49, 0xc7, 0x15, 0xa4, 0x13 },
+ .result = { 0x95, 0xcb, 0xde, 0x94, 0x76, 0xe8, 0x90, 0x7d,
+ 0x7a, 0xad, 0xe4, 0x5c, 0xb4, 0xb8, 0x73, 0xf8,
+ 0x8b, 0x59, 0x5a, 0x68, 0x79, 0x9f, 0xa1, 0x52,
+ 0xe6, 0xf8, 0xf7, 0x64, 0x7a, 0xac, 0x79, 0x57 },
+ .valid = true
+ },
+ /* wycheproof - edge case for shared secret */
+ {
+ .private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+ 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+ 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+ 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+ .public = { 0x0a, 0xb4, 0xe7, 0x63, 0x80, 0xd8, 0x4d, 0xde,
+ 0x4f, 0x68, 0x33, 0xc5, 0x8f, 0x2a, 0x9f, 0xb8,
+ 0xf8, 0x3b, 0xb0, 0x16, 0x9b, 0x17, 0x2b, 0xe4,
+ 0xb6, 0xe0, 0x59, 0x28, 0x87, 0x74, 0x1a, 0x36 },
+ .result = { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = true
+ },
+ /* wycheproof - edge case for shared secret */
+ {
+ .private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+ 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+ 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+ 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+ .public = { 0x89, 0xe1, 0x0d, 0x57, 0x01, 0xb4, 0x33, 0x7d,
+ 0x2d, 0x03, 0x21, 0x81, 0x53, 0x8b, 0x10, 0x64,
+ 0xbd, 0x40, 0x84, 0x40, 0x1c, 0xec, 0xa1, 0xfd,
+ 0x12, 0x66, 0x3a, 0x19, 0x59, 0x38, 0x80, 0x00 },
+ .result = { 0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = true
+ },
+ /* wycheproof - edge case for shared secret */
+ {
+ .private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+ 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+ 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+ 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+ .public = { 0x2b, 0x55, 0xd3, 0xaa, 0x4a, 0x8f, 0x80, 0xc8,
+ 0xc0, 0xb2, 0xae, 0x5f, 0x93, 0x3e, 0x85, 0xaf,
+ 0x49, 0xbe, 0xac, 0x36, 0xc2, 0xfa, 0x73, 0x94,
+ 0xba, 0xb7, 0x6c, 0x89, 0x33, 0xf8, 0xf8, 0x1d },
+ .result = { 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+ .valid = true
+ },
+ /* wycheproof - edge case for shared secret */
+ {
+ .private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+ 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+ 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+ 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+ .public = { 0x63, 0xe5, 0xb1, 0xfe, 0x96, 0x01, 0xfe, 0x84,
+ 0x38, 0x5d, 0x88, 0x66, 0xb0, 0x42, 0x12, 0x62,
+ 0xf7, 0x8f, 0xbf, 0xa5, 0xaf, 0xf9, 0x58, 0x5e,
+ 0x62, 0x66, 0x79, 0xb1, 0x85, 0x47, 0xd9, 0x59 },
+ .result = { 0xfe, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+ .valid = true
+ },
+ /* wycheproof - edge case for shared secret */
+ {
+ .private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+ 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+ 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+ 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+ .public = { 0xe4, 0x28, 0xf3, 0xda, 0xc1, 0x78, 0x09, 0xf8,
+ 0x27, 0xa5, 0x22, 0xce, 0x32, 0x35, 0x50, 0x58,
+ 0xd0, 0x73, 0x69, 0x36, 0x4a, 0xa7, 0x89, 0x02,
+ 0xee, 0x10, 0x13, 0x9b, 0x9f, 0x9d, 0xd6, 0x53 },
+ .result = { 0xfc, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+ .valid = true
+ },
+ /* wycheproof - edge case for shared secret */
+ {
+ .private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+ 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+ 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+ 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+ .public = { 0xb3, 0xb5, 0x0e, 0x3e, 0xd3, 0xa4, 0x07, 0xb9,
+ 0x5d, 0xe9, 0x42, 0xef, 0x74, 0x57, 0x5b, 0x5a,
+ 0xb8, 0xa1, 0x0c, 0x09, 0xee, 0x10, 0x35, 0x44,
+ 0xd6, 0x0b, 0xdf, 0xed, 0x81, 0x38, 0xab, 0x2b },
+ .result = { 0xf9, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+ .valid = true
+ },
+ /* wycheproof - edge case for shared secret */
+ {
+ .private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+ 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+ 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+ 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+ .public = { 0x21, 0x3f, 0xff, 0xe9, 0x3d, 0x5e, 0xa8, 0xcd,
+ 0x24, 0x2e, 0x46, 0x28, 0x44, 0x02, 0x99, 0x22,
+ 0xc4, 0x3c, 0x77, 0xc9, 0xe3, 0xe4, 0x2f, 0x56,
+ 0x2f, 0x48, 0x5d, 0x24, 0xc5, 0x01, 0xa2, 0x0b },
+ .result = { 0xf3, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+ .valid = true
+ },
+ /* wycheproof - edge case for shared secret */
+ {
+ .private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+ 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+ 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+ 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+ .public = { 0x91, 0xb2, 0x32, 0xa1, 0x78, 0xb3, 0xcd, 0x53,
+ 0x09, 0x32, 0x44, 0x1e, 0x61, 0x39, 0x41, 0x8f,
+ 0x72, 0x17, 0x22, 0x92, 0xf1, 0xda, 0x4c, 0x18,
+ 0x34, 0xfc, 0x5e, 0xbf, 0xef, 0xb5, 0x1e, 0x3f },
+ .result = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x03 },
+ .valid = true
+ },
+ /* wycheproof - edge case for shared secret */
+ {
+ .private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+ 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+ 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+ 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+ .public = { 0x04, 0x5c, 0x6e, 0x11, 0xc5, 0xd3, 0x32, 0x55,
+ 0x6c, 0x78, 0x22, 0xfe, 0x94, 0xeb, 0xf8, 0x9b,
+ 0x56, 0xa3, 0x87, 0x8d, 0xc2, 0x7c, 0xa0, 0x79,
+ 0x10, 0x30, 0x58, 0x84, 0x9f, 0xab, 0xcb, 0x4f },
+ .result = { 0xe5, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+ .valid = true
+ },
+ /* wycheproof - edge case for shared secret */
+ {
+ .private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+ 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+ 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+ 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+ .public = { 0x1c, 0xa2, 0x19, 0x0b, 0x71, 0x16, 0x35, 0x39,
+ 0x06, 0x3c, 0x35, 0x77, 0x3b, 0xda, 0x0c, 0x9c,
+ 0x92, 0x8e, 0x91, 0x36, 0xf0, 0x62, 0x0a, 0xeb,
+ 0x09, 0x3f, 0x09, 0x91, 0x97, 0xb7, 0xf7, 0x4e },
+ .result = { 0xe3, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+ .valid = true
+ },
+ /* wycheproof - edge case for shared secret */
+ {
+ .private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+ 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+ 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+ 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+ .public = { 0xf7, 0x6e, 0x90, 0x10, 0xac, 0x33, 0xc5, 0x04,
+ 0x3b, 0x2d, 0x3b, 0x76, 0xa8, 0x42, 0x17, 0x10,
+ 0x00, 0xc4, 0x91, 0x62, 0x22, 0xe9, 0xe8, 0x58,
+ 0x97, 0xa0, 0xae, 0xc7, 0xf6, 0x35, 0x0b, 0x3c },
+ .result = { 0xdd, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+ .valid = true
+ },
+ /* wycheproof - edge case for shared secret */
+ {
+ .private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+ 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+ 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+ 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+ .public = { 0xbb, 0x72, 0x68, 0x8d, 0x8f, 0x8a, 0xa7, 0xa3,
+ 0x9c, 0xd6, 0x06, 0x0c, 0xd5, 0xc8, 0x09, 0x3c,
+ 0xde, 0xc6, 0xfe, 0x34, 0x19, 0x37, 0xc3, 0x88,
+ 0x6a, 0x99, 0x34, 0x6c, 0xd0, 0x7f, 0xaa, 0x55 },
+ .result = { 0xdb, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+ .valid = true
+ },
+ /* wycheproof - edge case for shared secret */
+ {
+ .private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+ 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+ 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+ 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+ .public = { 0x88, 0xfd, 0xde, 0xa1, 0x93, 0x39, 0x1c, 0x6a,
+ 0x59, 0x33, 0xef, 0x9b, 0x71, 0x90, 0x15, 0x49,
+ 0x44, 0x72, 0x05, 0xaa, 0xe9, 0xda, 0x92, 0x8a,
+ 0x6b, 0x91, 0xa3, 0x52, 0xba, 0x10, 0xf4, 0x1f },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02 },
+ .valid = true
+ },
+ /* wycheproof - edge case for shared secret */
+ {
+ .private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+ 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+ 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+ 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+ .public = { 0x30, 0x3b, 0x39, 0x2f, 0x15, 0x31, 0x16, 0xca,
+ 0xd9, 0xcc, 0x68, 0x2a, 0x00, 0xcc, 0xc4, 0x4c,
+ 0x95, 0xff, 0x0d, 0x3b, 0xbe, 0x56, 0x8b, 0xeb,
+ 0x6c, 0x4e, 0x73, 0x9b, 0xaf, 0xdc, 0x2c, 0x68 },
+ .result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80, 0x00 },
+ .valid = true
+ },
+ /* wycheproof - checking for overflow */
+ {
+ .private = { 0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d,
+ 0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d,
+ 0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c,
+ 0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+ .public = { 0xfd, 0x30, 0x0a, 0xeb, 0x40, 0xe1, 0xfa, 0x58,
+ 0x25, 0x18, 0x41, 0x2b, 0x49, 0xb2, 0x08, 0xa7,
+ 0x84, 0x2b, 0x1e, 0x1f, 0x05, 0x6a, 0x04, 0x01,
+ 0x78, 0xea, 0x41, 0x41, 0x53, 0x4f, 0x65, 0x2d },
+ .result = { 0xb7, 0x34, 0x10, 0x5d, 0xc2, 0x57, 0x58, 0x5d,
+ 0x73, 0xb5, 0x66, 0xcc, 0xb7, 0x6f, 0x06, 0x27,
+ 0x95, 0xcc, 0xbe, 0xc8, 0x91, 0x28, 0xe5, 0x2b,
+ 0x02, 0xf3, 0xe5, 0x96, 0x39, 0xf1, 0x3c, 0x46 },
+ .valid = true
+ },
+ /* wycheproof - checking for overflow */
+ {
+ .private = { 0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d,
+ 0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d,
+ 0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c,
+ 0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+ .public = { 0xc8, 0xef, 0x79, 0xb5, 0x14, 0xd7, 0x68, 0x26,
+ 0x77, 0xbc, 0x79, 0x31, 0xe0, 0x6e, 0xe5, 0xc2,
+ 0x7c, 0x9b, 0x39, 0x2b, 0x4a, 0xe9, 0x48, 0x44,
+ 0x73, 0xf5, 0x54, 0xe6, 0x67, 0x8e, 0xcc, 0x2e },
+ .result = { 0x64, 0x7a, 0x46, 0xb6, 0xfc, 0x3f, 0x40, 0xd6,
+ 0x21, 0x41, 0xee, 0x3c, 0xee, 0x70, 0x6b, 0x4d,
+ 0x7a, 0x92, 0x71, 0x59, 0x3a, 0x7b, 0x14, 0x3e,
+ 0x8e, 0x2e, 0x22, 0x79, 0x88, 0x3e, 0x45, 0x50 },
+ .valid = true
+ },
+ /* wycheproof - checking for overflow */
+ {
+ .private = { 0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d,
+ 0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d,
+ 0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c,
+ 0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+ .public = { 0x64, 0xae, 0xac, 0x25, 0x04, 0x14, 0x48, 0x61,
+ 0x53, 0x2b, 0x7b, 0xbc, 0xb6, 0xc8, 0x7d, 0x67,
+ 0xdd, 0x4c, 0x1f, 0x07, 0xeb, 0xc2, 0xe0, 0x6e,
+ 0xff, 0xb9, 0x5a, 0xec, 0xc6, 0x17, 0x0b, 0x2c },
+ .result = { 0x4f, 0xf0, 0x3d, 0x5f, 0xb4, 0x3c, 0xd8, 0x65,
+ 0x7a, 0x3c, 0xf3, 0x7c, 0x13, 0x8c, 0xad, 0xce,
+ 0xcc, 0xe5, 0x09, 0xe4, 0xeb, 0xa0, 0x89, 0xd0,
+ 0xef, 0x40, 0xb4, 0xe4, 0xfb, 0x94, 0x61, 0x55 },
+ .valid = true
+ },
+ /* wycheproof - checking for overflow */
+ {
+ .private = { 0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d,
+ 0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d,
+ 0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c,
+ 0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+ .public = { 0xbf, 0x68, 0xe3, 0x5e, 0x9b, 0xdb, 0x7e, 0xee,
+ 0x1b, 0x50, 0x57, 0x02, 0x21, 0x86, 0x0f, 0x5d,
+ 0xcd, 0xad, 0x8a, 0xcb, 0xab, 0x03, 0x1b, 0x14,
+ 0x97, 0x4c, 0xc4, 0x90, 0x13, 0xc4, 0x98, 0x31 },
+ .result = { 0x21, 0xce, 0xe5, 0x2e, 0xfd, 0xbc, 0x81, 0x2e,
+ 0x1d, 0x02, 0x1a, 0x4a, 0xf1, 0xe1, 0xd8, 0xbc,
+ 0x4d, 0xb3, 0xc4, 0x00, 0xe4, 0xd2, 0xa2, 0xc5,
+ 0x6a, 0x39, 0x26, 0xdb, 0x4d, 0x99, 0xc6, 0x5b },
+ .valid = true
+ },
+ /* wycheproof - checking for overflow */
+ {
+ .private = { 0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d,
+ 0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d,
+ 0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c,
+ 0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+ .public = { 0x53, 0x47, 0xc4, 0x91, 0x33, 0x1a, 0x64, 0xb4,
+ 0x3d, 0xdc, 0x68, 0x30, 0x34, 0xe6, 0x77, 0xf5,
+ 0x3d, 0xc3, 0x2b, 0x52, 0xa5, 0x2a, 0x57, 0x7c,
+ 0x15, 0xa8, 0x3b, 0xf2, 0x98, 0xe9, 0x9f, 0x19 },
+ .result = { 0x18, 0xcb, 0x89, 0xe4, 0xe2, 0x0c, 0x0c, 0x2b,
+ 0xd3, 0x24, 0x30, 0x52, 0x45, 0x26, 0x6c, 0x93,
+ 0x27, 0x69, 0x0b, 0xbe, 0x79, 0xac, 0xb8, 0x8f,
+ 0x5b, 0x8f, 0xb3, 0xf7, 0x4e, 0xca, 0x3e, 0x52 },
+ .valid = true
+ },
+ /* wycheproof - private key == -1 (mod order) */
+ {
+ .private = { 0xa0, 0x23, 0xcd, 0xd0, 0x83, 0xef, 0x5b, 0xb8,
+ 0x2f, 0x10, 0xd6, 0x2e, 0x59, 0xe1, 0x5a, 0x68,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x50 },
+ .public = { 0x25, 0x8e, 0x04, 0x52, 0x3b, 0x8d, 0x25, 0x3e,
+ 0xe6, 0x57, 0x19, 0xfc, 0x69, 0x06, 0xc6, 0x57,
+ 0x19, 0x2d, 0x80, 0x71, 0x7e, 0xdc, 0x82, 0x8f,
+ 0xa0, 0xaf, 0x21, 0x68, 0x6e, 0x2f, 0xaa, 0x75 },
+ .result = { 0x25, 0x8e, 0x04, 0x52, 0x3b, 0x8d, 0x25, 0x3e,
+ 0xe6, 0x57, 0x19, 0xfc, 0x69, 0x06, 0xc6, 0x57,
+ 0x19, 0x2d, 0x80, 0x71, 0x7e, 0xdc, 0x82, 0x8f,
+ 0xa0, 0xaf, 0x21, 0x68, 0x6e, 0x2f, 0xaa, 0x75 },
+ .valid = true
+ },
+ /* wycheproof - private key == 1 (mod order) on twist */
+ {
+ .private = { 0x58, 0x08, 0x3d, 0xd2, 0x61, 0xad, 0x91, 0xef,
+ 0xf9, 0x52, 0x32, 0x2e, 0xc8, 0x24, 0xc6, 0x82,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x5f },
+ .public = { 0x2e, 0xae, 0x5e, 0xc3, 0xdd, 0x49, 0x4e, 0x9f,
+ 0x2d, 0x37, 0xd2, 0x58, 0xf8, 0x73, 0xa8, 0xe6,
+ 0xe9, 0xd0, 0xdb, 0xd1, 0xe3, 0x83, 0xef, 0x64,
+ 0xd9, 0x8b, 0xb9, 0x1b, 0x3e, 0x0b, 0xe0, 0x35 },
+ .result = { 0x2e, 0xae, 0x5e, 0xc3, 0xdd, 0x49, 0x4e, 0x9f,
+ 0x2d, 0x37, 0xd2, 0x58, 0xf8, 0x73, 0xa8, 0xe6,
+ 0xe9, 0xd0, 0xdb, 0xd1, 0xe3, 0x83, 0xef, 0x64,
+ 0xd9, 0x8b, 0xb9, 0x1b, 0x3e, 0x0b, 0xe0, 0x35 },
+ .valid = true
+ }
+};
+
+static bool __init curve25519_selftest(void)
+{
+ bool success = true, ret, ret2;
+ size_t i = 0, j;
+ u8 in[CURVE25519_KEY_SIZE];
+ u8 out[CURVE25519_KEY_SIZE], out2[CURVE25519_KEY_SIZE];
+
+ for (i = 0; i < ARRAY_SIZE(curve25519_test_vectors); ++i) {
+ memset(out, 0, CURVE25519_KEY_SIZE);
+ ret = curve25519(out, curve25519_test_vectors[i].private,
+ curve25519_test_vectors[i].public);
+ if (ret != curve25519_test_vectors[i].valid ||
+ memcmp(out, curve25519_test_vectors[i].result,
+ CURVE25519_KEY_SIZE)) {
+ pr_err("curve25519 self-test %zu: FAIL\n", i + 1);
+ success = false;
+ }
+ }
+
+ for (i = 0; i < 5; ++i) {
+ get_random_bytes(in, sizeof(in));
+ ret = curve25519_generate_public(out, in);
+ ret2 = curve25519(out2, in, (u8[CURVE25519_KEY_SIZE]){ 9 });
+ if (ret != ret2 || memcmp(out, out2, CURVE25519_KEY_SIZE)) {
+ pr_err("curve25519 basepoint self-test %zu: FAIL: input - 0x",
+ i + 1);
+ for (j = CURVE25519_KEY_SIZE; j-- > 0;)
+ printk(KERN_CONT "%02x", in[j]);
+ printk(KERN_CONT "\n");
+ success = false;
+ }
+ }
+
+ return success;
+}
--
2.19.1
^ permalink raw reply related
* [PATCH net-next v8 26/28] crypto: port ChaCha20 to Zinc
From: Jason A. Donenfeld @ 2018-10-18 14:57 UTC (permalink / raw)
To: linux-kernel, netdev, linux-crypto, davem, gregkh
Cc: Jason A. Donenfeld, Samuel Neves, Andy Lutomirski
In-Reply-To: <20181018145712.7538-1-Jason@zx2c4.com>
Now that ChaCha20 is in Zinc, we can have the crypto API code simply
call into it. The crypto API expects to have a stored key per instance
and independent nonces, so we follow suite and store the key and
initialize the nonce independently.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: linux-crypto@vger.kernel.org
---
arch/arm/configs/exynos_defconfig | 1 -
arch/arm/configs/multi_v7_defconfig | 1 -
arch/arm/configs/omap2plus_defconfig | 1 -
arch/arm/crypto/Kconfig | 6 -
arch/arm/crypto/Makefile | 2 -
arch/arm/crypto/chacha20-neon-core.S | 521 --------------------
arch/arm/crypto/chacha20-neon-glue.c | 127 -----
arch/arm64/configs/defconfig | 1 -
arch/arm64/crypto/Kconfig | 6 -
arch/arm64/crypto/Makefile | 3 -
arch/arm64/crypto/chacha20-neon-core.S | 450 -----------------
arch/arm64/crypto/chacha20-neon-glue.c | 133 -----
arch/x86/crypto/Makefile | 3 -
arch/x86/crypto/chacha20-avx2-x86_64.S | 448 -----------------
arch/x86/crypto/chacha20-ssse3-x86_64.S | 630 ------------------------
arch/x86/crypto/chacha20_glue.c | 146 ------
crypto/Kconfig | 17 +-
crypto/Makefile | 2 +-
crypto/chacha20_generic.c | 136 -----
crypto/chacha20_zinc.c | 90 ++++
crypto/chacha20poly1305.c | 8 +-
include/crypto/chacha20.h | 12 -
22 files changed, 96 insertions(+), 2648 deletions(-)
delete mode 100644 arch/arm/crypto/chacha20-neon-core.S
delete mode 100644 arch/arm/crypto/chacha20-neon-glue.c
delete mode 100644 arch/arm64/crypto/chacha20-neon-core.S
delete mode 100644 arch/arm64/crypto/chacha20-neon-glue.c
delete mode 100644 arch/x86/crypto/chacha20-avx2-x86_64.S
delete mode 100644 arch/x86/crypto/chacha20-ssse3-x86_64.S
delete mode 100644 arch/x86/crypto/chacha20_glue.c
delete mode 100644 crypto/chacha20_generic.c
create mode 100644 crypto/chacha20_zinc.c
diff --git a/arch/arm/configs/exynos_defconfig b/arch/arm/configs/exynos_defconfig
index 27ea6dfcf2f2..95929b5e7b10 100644
--- a/arch/arm/configs/exynos_defconfig
+++ b/arch/arm/configs/exynos_defconfig
@@ -350,7 +350,6 @@ CONFIG_CRYPTO_SHA1_ARM_NEON=m
CONFIG_CRYPTO_SHA256_ARM=m
CONFIG_CRYPTO_SHA512_ARM=m
CONFIG_CRYPTO_AES_ARM_BS=m
-CONFIG_CRYPTO_CHACHA20_NEON=m
CONFIG_CRC_CCITT=y
CONFIG_FONTS=y
CONFIG_FONT_7x14=y
diff --git a/arch/arm/configs/multi_v7_defconfig b/arch/arm/configs/multi_v7_defconfig
index fc33444e94f0..63be07724db3 100644
--- a/arch/arm/configs/multi_v7_defconfig
+++ b/arch/arm/configs/multi_v7_defconfig
@@ -1000,4 +1000,3 @@ CONFIG_CRYPTO_AES_ARM_BS=m
CONFIG_CRYPTO_AES_ARM_CE=m
CONFIG_CRYPTO_GHASH_ARM_CE=m
CONFIG_CRYPTO_CRC32_ARM_CE=m
-CONFIG_CRYPTO_CHACHA20_NEON=m
diff --git a/arch/arm/configs/omap2plus_defconfig b/arch/arm/configs/omap2plus_defconfig
index 6491419b1dad..f585a8ecc336 100644
--- a/arch/arm/configs/omap2plus_defconfig
+++ b/arch/arm/configs/omap2plus_defconfig
@@ -547,7 +547,6 @@ CONFIG_CRYPTO_SHA512_ARM=m
CONFIG_CRYPTO_AES_ARM=m
CONFIG_CRYPTO_AES_ARM_BS=m
CONFIG_CRYPTO_GHASH_ARM_CE=m
-CONFIG_CRYPTO_CHACHA20_NEON=m
CONFIG_CRC_CCITT=y
CONFIG_CRC_T10DIF=y
CONFIG_CRC_ITU_T=y
diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index 925d1364727a..fb80fd89f0e7 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -115,12 +115,6 @@ config CRYPTO_CRC32_ARM_CE
depends on KERNEL_MODE_NEON && CRC32
select CRYPTO_HASH
-config CRYPTO_CHACHA20_NEON
- tristate "NEON accelerated ChaCha20 symmetric cipher"
- depends on KERNEL_MODE_NEON
- select CRYPTO_BLKCIPHER
- select CRYPTO_CHACHA20
-
config CRYPTO_SPECK_NEON
tristate "NEON accelerated Speck cipher algorithms"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 8de542c48ade..bbfa98447063 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -9,7 +9,6 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
-obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
@@ -53,7 +52,6 @@ aes-arm-ce-y := aes-ce-core.o aes-ce-glue.o
ghash-arm-ce-y := ghash-ce-core.o ghash-ce-glue.o
crct10dif-arm-ce-y := crct10dif-ce-core.o crct10dif-ce-glue.o
crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
-chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
speck-neon-y := speck-neon-core.o speck-neon-glue.o
ifdef REGENERATE_ARM_CRYPTO
diff --git a/arch/arm/crypto/chacha20-neon-core.S b/arch/arm/crypto/chacha20-neon-core.S
deleted file mode 100644
index 451a849ad518..000000000000
--- a/arch/arm/crypto/chacha20-neon-core.S
+++ /dev/null
@@ -1,521 +0,0 @@
-/*
- * ChaCha20 256-bit cipher algorithm, RFC7539, ARM NEON functions
- *
- * Copyright (C) 2016 Linaro, Ltd. <ard.biesheuvel@linaro.org>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * Based on:
- * ChaCha20 256-bit cipher algorithm, RFC7539, x64 SSE3 functions
- *
- * Copyright (C) 2015 Martin Willi
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- */
-
-#include <linux/linkage.h>
-
- .text
- .fpu neon
- .align 5
-
-ENTRY(chacha20_block_xor_neon)
- // r0: Input state matrix, s
- // r1: 1 data block output, o
- // r2: 1 data block input, i
-
- //
- // This function encrypts one ChaCha20 block by loading the state matrix
- // in four NEON registers. It performs matrix operation on four words in
- // parallel, but requireds shuffling to rearrange the words after each
- // round.
- //
-
- // x0..3 = s0..3
- add ip, r0, #0x20
- vld1.32 {q0-q1}, [r0]
- vld1.32 {q2-q3}, [ip]
-
- vmov q8, q0
- vmov q9, q1
- vmov q10, q2
- vmov q11, q3
-
- mov r3, #10
-
-.Ldoubleround:
- // x0 += x1, x3 = rotl32(x3 ^ x0, 16)
- vadd.i32 q0, q0, q1
- veor q3, q3, q0
- vrev32.16 q3, q3
-
- // x2 += x3, x1 = rotl32(x1 ^ x2, 12)
- vadd.i32 q2, q2, q3
- veor q4, q1, q2
- vshl.u32 q1, q4, #12
- vsri.u32 q1, q4, #20
-
- // x0 += x1, x3 = rotl32(x3 ^ x0, 8)
- vadd.i32 q0, q0, q1
- veor q4, q3, q0
- vshl.u32 q3, q4, #8
- vsri.u32 q3, q4, #24
-
- // x2 += x3, x1 = rotl32(x1 ^ x2, 7)
- vadd.i32 q2, q2, q3
- veor q4, q1, q2
- vshl.u32 q1, q4, #7
- vsri.u32 q1, q4, #25
-
- // x1 = shuffle32(x1, MASK(0, 3, 2, 1))
- vext.8 q1, q1, q1, #4
- // x2 = shuffle32(x2, MASK(1, 0, 3, 2))
- vext.8 q2, q2, q2, #8
- // x3 = shuffle32(x3, MASK(2, 1, 0, 3))
- vext.8 q3, q3, q3, #12
-
- // x0 += x1, x3 = rotl32(x3 ^ x0, 16)
- vadd.i32 q0, q0, q1
- veor q3, q3, q0
- vrev32.16 q3, q3
-
- // x2 += x3, x1 = rotl32(x1 ^ x2, 12)
- vadd.i32 q2, q2, q3
- veor q4, q1, q2
- vshl.u32 q1, q4, #12
- vsri.u32 q1, q4, #20
-
- // x0 += x1, x3 = rotl32(x3 ^ x0, 8)
- vadd.i32 q0, q0, q1
- veor q4, q3, q0
- vshl.u32 q3, q4, #8
- vsri.u32 q3, q4, #24
-
- // x2 += x3, x1 = rotl32(x1 ^ x2, 7)
- vadd.i32 q2, q2, q3
- veor q4, q1, q2
- vshl.u32 q1, q4, #7
- vsri.u32 q1, q4, #25
-
- // x1 = shuffle32(x1, MASK(2, 1, 0, 3))
- vext.8 q1, q1, q1, #12
- // x2 = shuffle32(x2, MASK(1, 0, 3, 2))
- vext.8 q2, q2, q2, #8
- // x3 = shuffle32(x3, MASK(0, 3, 2, 1))
- vext.8 q3, q3, q3, #4
-
- subs r3, r3, #1
- bne .Ldoubleround
-
- add ip, r2, #0x20
- vld1.8 {q4-q5}, [r2]
- vld1.8 {q6-q7}, [ip]
-
- // o0 = i0 ^ (x0 + s0)
- vadd.i32 q0, q0, q8
- veor q0, q0, q4
-
- // o1 = i1 ^ (x1 + s1)
- vadd.i32 q1, q1, q9
- veor q1, q1, q5
-
- // o2 = i2 ^ (x2 + s2)
- vadd.i32 q2, q2, q10
- veor q2, q2, q6
-
- // o3 = i3 ^ (x3 + s3)
- vadd.i32 q3, q3, q11
- veor q3, q3, q7
-
- add ip, r1, #0x20
- vst1.8 {q0-q1}, [r1]
- vst1.8 {q2-q3}, [ip]
-
- bx lr
-ENDPROC(chacha20_block_xor_neon)
-
- .align 5
-ENTRY(chacha20_4block_xor_neon)
- push {r4-r6, lr}
- mov ip, sp // preserve the stack pointer
- sub r3, sp, #0x20 // allocate a 32 byte buffer
- bic r3, r3, #0x1f // aligned to 32 bytes
- mov sp, r3
-
- // r0: Input state matrix, s
- // r1: 4 data blocks output, o
- // r2: 4 data blocks input, i
-
- //
- // This function encrypts four consecutive ChaCha20 blocks by loading
- // the state matrix in NEON registers four times. The algorithm performs
- // each operation on the corresponding word of each state matrix, hence
- // requires no word shuffling. For final XORing step we transpose the
- // matrix by interleaving 32- and then 64-bit words, which allows us to
- // do XOR in NEON registers.
- //
-
- // x0..15[0-3] = s0..3[0..3]
- add r3, r0, #0x20
- vld1.32 {q0-q1}, [r0]
- vld1.32 {q2-q3}, [r3]
-
- adr r3, CTRINC
- vdup.32 q15, d7[1]
- vdup.32 q14, d7[0]
- vld1.32 {q11}, [r3, :128]
- vdup.32 q13, d6[1]
- vdup.32 q12, d6[0]
- vadd.i32 q12, q12, q11 // x12 += counter values 0-3
- vdup.32 q11, d5[1]
- vdup.32 q10, d5[0]
- vdup.32 q9, d4[1]
- vdup.32 q8, d4[0]
- vdup.32 q7, d3[1]
- vdup.32 q6, d3[0]
- vdup.32 q5, d2[1]
- vdup.32 q4, d2[0]
- vdup.32 q3, d1[1]
- vdup.32 q2, d1[0]
- vdup.32 q1, d0[1]
- vdup.32 q0, d0[0]
-
- mov r3, #10
-
-.Ldoubleround4:
- // x0 += x4, x12 = rotl32(x12 ^ x0, 16)
- // x1 += x5, x13 = rotl32(x13 ^ x1, 16)
- // x2 += x6, x14 = rotl32(x14 ^ x2, 16)
- // x3 += x7, x15 = rotl32(x15 ^ x3, 16)
- vadd.i32 q0, q0, q4
- vadd.i32 q1, q1, q5
- vadd.i32 q2, q2, q6
- vadd.i32 q3, q3, q7
-
- veor q12, q12, q0
- veor q13, q13, q1
- veor q14, q14, q2
- veor q15, q15, q3
-
- vrev32.16 q12, q12
- vrev32.16 q13, q13
- vrev32.16 q14, q14
- vrev32.16 q15, q15
-
- // x8 += x12, x4 = rotl32(x4 ^ x8, 12)
- // x9 += x13, x5 = rotl32(x5 ^ x9, 12)
- // x10 += x14, x6 = rotl32(x6 ^ x10, 12)
- // x11 += x15, x7 = rotl32(x7 ^ x11, 12)
- vadd.i32 q8, q8, q12
- vadd.i32 q9, q9, q13
- vadd.i32 q10, q10, q14
- vadd.i32 q11, q11, q15
-
- vst1.32 {q8-q9}, [sp, :256]
-
- veor q8, q4, q8
- veor q9, q5, q9
- vshl.u32 q4, q8, #12
- vshl.u32 q5, q9, #12
- vsri.u32 q4, q8, #20
- vsri.u32 q5, q9, #20
-
- veor q8, q6, q10
- veor q9, q7, q11
- vshl.u32 q6, q8, #12
- vshl.u32 q7, q9, #12
- vsri.u32 q6, q8, #20
- vsri.u32 q7, q9, #20
-
- // x0 += x4, x12 = rotl32(x12 ^ x0, 8)
- // x1 += x5, x13 = rotl32(x13 ^ x1, 8)
- // x2 += x6, x14 = rotl32(x14 ^ x2, 8)
- // x3 += x7, x15 = rotl32(x15 ^ x3, 8)
- vadd.i32 q0, q0, q4
- vadd.i32 q1, q1, q5
- vadd.i32 q2, q2, q6
- vadd.i32 q3, q3, q7
-
- veor q8, q12, q0
- veor q9, q13, q1
- vshl.u32 q12, q8, #8
- vshl.u32 q13, q9, #8
- vsri.u32 q12, q8, #24
- vsri.u32 q13, q9, #24
-
- veor q8, q14, q2
- veor q9, q15, q3
- vshl.u32 q14, q8, #8
- vshl.u32 q15, q9, #8
- vsri.u32 q14, q8, #24
- vsri.u32 q15, q9, #24
-
- vld1.32 {q8-q9}, [sp, :256]
-
- // x8 += x12, x4 = rotl32(x4 ^ x8, 7)
- // x9 += x13, x5 = rotl32(x5 ^ x9, 7)
- // x10 += x14, x6 = rotl32(x6 ^ x10, 7)
- // x11 += x15, x7 = rotl32(x7 ^ x11, 7)
- vadd.i32 q8, q8, q12
- vadd.i32 q9, q9, q13
- vadd.i32 q10, q10, q14
- vadd.i32 q11, q11, q15
-
- vst1.32 {q8-q9}, [sp, :256]
-
- veor q8, q4, q8
- veor q9, q5, q9
- vshl.u32 q4, q8, #7
- vshl.u32 q5, q9, #7
- vsri.u32 q4, q8, #25
- vsri.u32 q5, q9, #25
-
- veor q8, q6, q10
- veor q9, q7, q11
- vshl.u32 q6, q8, #7
- vshl.u32 q7, q9, #7
- vsri.u32 q6, q8, #25
- vsri.u32 q7, q9, #25
-
- vld1.32 {q8-q9}, [sp, :256]
-
- // x0 += x5, x15 = rotl32(x15 ^ x0, 16)
- // x1 += x6, x12 = rotl32(x12 ^ x1, 16)
- // x2 += x7, x13 = rotl32(x13 ^ x2, 16)
- // x3 += x4, x14 = rotl32(x14 ^ x3, 16)
- vadd.i32 q0, q0, q5
- vadd.i32 q1, q1, q6
- vadd.i32 q2, q2, q7
- vadd.i32 q3, q3, q4
-
- veor q15, q15, q0
- veor q12, q12, q1
- veor q13, q13, q2
- veor q14, q14, q3
-
- vrev32.16 q15, q15
- vrev32.16 q12, q12
- vrev32.16 q13, q13
- vrev32.16 q14, q14
-
- // x10 += x15, x5 = rotl32(x5 ^ x10, 12)
- // x11 += x12, x6 = rotl32(x6 ^ x11, 12)
- // x8 += x13, x7 = rotl32(x7 ^ x8, 12)
- // x9 += x14, x4 = rotl32(x4 ^ x9, 12)
- vadd.i32 q10, q10, q15
- vadd.i32 q11, q11, q12
- vadd.i32 q8, q8, q13
- vadd.i32 q9, q9, q14
-
- vst1.32 {q8-q9}, [sp, :256]
-
- veor q8, q7, q8
- veor q9, q4, q9
- vshl.u32 q7, q8, #12
- vshl.u32 q4, q9, #12
- vsri.u32 q7, q8, #20
- vsri.u32 q4, q9, #20
-
- veor q8, q5, q10
- veor q9, q6, q11
- vshl.u32 q5, q8, #12
- vshl.u32 q6, q9, #12
- vsri.u32 q5, q8, #20
- vsri.u32 q6, q9, #20
-
- // x0 += x5, x15 = rotl32(x15 ^ x0, 8)
- // x1 += x6, x12 = rotl32(x12 ^ x1, 8)
- // x2 += x7, x13 = rotl32(x13 ^ x2, 8)
- // x3 += x4, x14 = rotl32(x14 ^ x3, 8)
- vadd.i32 q0, q0, q5
- vadd.i32 q1, q1, q6
- vadd.i32 q2, q2, q7
- vadd.i32 q3, q3, q4
-
- veor q8, q15, q0
- veor q9, q12, q1
- vshl.u32 q15, q8, #8
- vshl.u32 q12, q9, #8
- vsri.u32 q15, q8, #24
- vsri.u32 q12, q9, #24
-
- veor q8, q13, q2
- veor q9, q14, q3
- vshl.u32 q13, q8, #8
- vshl.u32 q14, q9, #8
- vsri.u32 q13, q8, #24
- vsri.u32 q14, q9, #24
-
- vld1.32 {q8-q9}, [sp, :256]
-
- // x10 += x15, x5 = rotl32(x5 ^ x10, 7)
- // x11 += x12, x6 = rotl32(x6 ^ x11, 7)
- // x8 += x13, x7 = rotl32(x7 ^ x8, 7)
- // x9 += x14, x4 = rotl32(x4 ^ x9, 7)
- vadd.i32 q10, q10, q15
- vadd.i32 q11, q11, q12
- vadd.i32 q8, q8, q13
- vadd.i32 q9, q9, q14
-
- vst1.32 {q8-q9}, [sp, :256]
-
- veor q8, q7, q8
- veor q9, q4, q9
- vshl.u32 q7, q8, #7
- vshl.u32 q4, q9, #7
- vsri.u32 q7, q8, #25
- vsri.u32 q4, q9, #25
-
- veor q8, q5, q10
- veor q9, q6, q11
- vshl.u32 q5, q8, #7
- vshl.u32 q6, q9, #7
- vsri.u32 q5, q8, #25
- vsri.u32 q6, q9, #25
-
- subs r3, r3, #1
- beq 0f
-
- vld1.32 {q8-q9}, [sp, :256]
- b .Ldoubleround4
-
- // x0[0-3] += s0[0]
- // x1[0-3] += s0[1]
- // x2[0-3] += s0[2]
- // x3[0-3] += s0[3]
-0: ldmia r0!, {r3-r6}
- vdup.32 q8, r3
- vdup.32 q9, r4
- vadd.i32 q0, q0, q8
- vadd.i32 q1, q1, q9
- vdup.32 q8, r5
- vdup.32 q9, r6
- vadd.i32 q2, q2, q8
- vadd.i32 q3, q3, q9
-
- // x4[0-3] += s1[0]
- // x5[0-3] += s1[1]
- // x6[0-3] += s1[2]
- // x7[0-3] += s1[3]
- ldmia r0!, {r3-r6}
- vdup.32 q8, r3
- vdup.32 q9, r4
- vadd.i32 q4, q4, q8
- vadd.i32 q5, q5, q9
- vdup.32 q8, r5
- vdup.32 q9, r6
- vadd.i32 q6, q6, q8
- vadd.i32 q7, q7, q9
-
- // interleave 32-bit words in state n, n+1
- vzip.32 q0, q1
- vzip.32 q2, q3
- vzip.32 q4, q5
- vzip.32 q6, q7
-
- // interleave 64-bit words in state n, n+2
- vswp d1, d4
- vswp d3, d6
- vswp d9, d12
- vswp d11, d14
-
- // xor with corresponding input, write to output
- vld1.8 {q8-q9}, [r2]!
- veor q8, q8, q0
- veor q9, q9, q4
- vst1.8 {q8-q9}, [r1]!
-
- vld1.32 {q8-q9}, [sp, :256]
-
- // x8[0-3] += s2[0]
- // x9[0-3] += s2[1]
- // x10[0-3] += s2[2]
- // x11[0-3] += s2[3]
- ldmia r0!, {r3-r6}
- vdup.32 q0, r3
- vdup.32 q4, r4
- vadd.i32 q8, q8, q0
- vadd.i32 q9, q9, q4
- vdup.32 q0, r5
- vdup.32 q4, r6
- vadd.i32 q10, q10, q0
- vadd.i32 q11, q11, q4
-
- // x12[0-3] += s3[0]
- // x13[0-3] += s3[1]
- // x14[0-3] += s3[2]
- // x15[0-3] += s3[3]
- ldmia r0!, {r3-r6}
- vdup.32 q0, r3
- vdup.32 q4, r4
- adr r3, CTRINC
- vadd.i32 q12, q12, q0
- vld1.32 {q0}, [r3, :128]
- vadd.i32 q13, q13, q4
- vadd.i32 q12, q12, q0 // x12 += counter values 0-3
-
- vdup.32 q0, r5
- vdup.32 q4, r6
- vadd.i32 q14, q14, q0
- vadd.i32 q15, q15, q4
-
- // interleave 32-bit words in state n, n+1
- vzip.32 q8, q9
- vzip.32 q10, q11
- vzip.32 q12, q13
- vzip.32 q14, q15
-
- // interleave 64-bit words in state n, n+2
- vswp d17, d20
- vswp d19, d22
- vswp d25, d28
- vswp d27, d30
-
- vmov q4, q1
-
- vld1.8 {q0-q1}, [r2]!
- veor q0, q0, q8
- veor q1, q1, q12
- vst1.8 {q0-q1}, [r1]!
-
- vld1.8 {q0-q1}, [r2]!
- veor q0, q0, q2
- veor q1, q1, q6
- vst1.8 {q0-q1}, [r1]!
-
- vld1.8 {q0-q1}, [r2]!
- veor q0, q0, q10
- veor q1, q1, q14
- vst1.8 {q0-q1}, [r1]!
-
- vld1.8 {q0-q1}, [r2]!
- veor q0, q0, q4
- veor q1, q1, q5
- vst1.8 {q0-q1}, [r1]!
-
- vld1.8 {q0-q1}, [r2]!
- veor q0, q0, q9
- veor q1, q1, q13
- vst1.8 {q0-q1}, [r1]!
-
- vld1.8 {q0-q1}, [r2]!
- veor q0, q0, q3
- veor q1, q1, q7
- vst1.8 {q0-q1}, [r1]!
-
- vld1.8 {q0-q1}, [r2]
- veor q0, q0, q11
- veor q1, q1, q15
- vst1.8 {q0-q1}, [r1]
-
- mov sp, ip
- pop {r4-r6, pc}
-ENDPROC(chacha20_4block_xor_neon)
-
- .align 4
-CTRINC: .word 0, 1, 2, 3
diff --git a/arch/arm/crypto/chacha20-neon-glue.c b/arch/arm/crypto/chacha20-neon-glue.c
deleted file mode 100644
index 59a7be08e80c..000000000000
--- a/arch/arm/crypto/chacha20-neon-glue.c
+++ /dev/null
@@ -1,127 +0,0 @@
-/*
- * ChaCha20 256-bit cipher algorithm, RFC7539, ARM NEON functions
- *
- * Copyright (C) 2016 Linaro, Ltd. <ard.biesheuvel@linaro.org>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * Based on:
- * ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code
- *
- * Copyright (C) 2015 Martin Willi
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- */
-
-#include <crypto/algapi.h>
-#include <crypto/chacha20.h>
-#include <crypto/internal/skcipher.h>
-#include <linux/kernel.h>
-#include <linux/module.h>
-
-#include <asm/hwcap.h>
-#include <asm/neon.h>
-#include <asm/simd.h>
-
-asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src);
-asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src);
-
-static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
- unsigned int bytes)
-{
- u8 buf[CHACHA20_BLOCK_SIZE];
-
- while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
- chacha20_4block_xor_neon(state, dst, src);
- bytes -= CHACHA20_BLOCK_SIZE * 4;
- src += CHACHA20_BLOCK_SIZE * 4;
- dst += CHACHA20_BLOCK_SIZE * 4;
- state[12] += 4;
- }
- while (bytes >= CHACHA20_BLOCK_SIZE) {
- chacha20_block_xor_neon(state, dst, src);
- bytes -= CHACHA20_BLOCK_SIZE;
- src += CHACHA20_BLOCK_SIZE;
- dst += CHACHA20_BLOCK_SIZE;
- state[12]++;
- }
- if (bytes) {
- memcpy(buf, src, bytes);
- chacha20_block_xor_neon(state, buf, buf);
- memcpy(dst, buf, bytes);
- }
-}
-
-static int chacha20_neon(struct skcipher_request *req)
-{
- struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
- struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm);
- struct skcipher_walk walk;
- u32 state[16];
- int err;
-
- if (req->cryptlen <= CHACHA20_BLOCK_SIZE || !may_use_simd())
- return crypto_chacha20_crypt(req);
-
- err = skcipher_walk_virt(&walk, req, true);
-
- crypto_chacha20_init(state, ctx, walk.iv);
-
- kernel_neon_begin();
- while (walk.nbytes > 0) {
- unsigned int nbytes = walk.nbytes;
-
- if (nbytes < walk.total)
- nbytes = round_down(nbytes, walk.stride);
-
- chacha20_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
- nbytes);
- err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
- }
- kernel_neon_end();
-
- return err;
-}
-
-static struct skcipher_alg alg = {
- .base.cra_name = "chacha20",
- .base.cra_driver_name = "chacha20-neon",
- .base.cra_priority = 300,
- .base.cra_blocksize = 1,
- .base.cra_ctxsize = sizeof(struct chacha20_ctx),
- .base.cra_module = THIS_MODULE,
-
- .min_keysize = CHACHA20_KEY_SIZE,
- .max_keysize = CHACHA20_KEY_SIZE,
- .ivsize = CHACHA20_IV_SIZE,
- .chunksize = CHACHA20_BLOCK_SIZE,
- .walksize = 4 * CHACHA20_BLOCK_SIZE,
- .setkey = crypto_chacha20_setkey,
- .encrypt = chacha20_neon,
- .decrypt = chacha20_neon,
-};
-
-static int __init chacha20_simd_mod_init(void)
-{
- if (!(elf_hwcap & HWCAP_NEON))
- return -ENODEV;
-
- return crypto_register_skcipher(&alg);
-}
-
-static void __exit chacha20_simd_mod_fini(void)
-{
- crypto_unregister_skcipher(&alg);
-}
-
-module_init(chacha20_simd_mod_init);
-module_exit(chacha20_simd_mod_fini);
-
-MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
-MODULE_LICENSE("GPL v2");
-MODULE_ALIAS_CRYPTO("chacha20");
diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index db8d364f8476..6cc3c8a0ad88 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -709,5 +709,4 @@ CONFIG_CRYPTO_CRCT10DIF_ARM64_CE=m
CONFIG_CRYPTO_CRC32_ARM64_CE=m
CONFIG_CRYPTO_AES_ARM64_CE_CCM=y
CONFIG_CRYPTO_AES_ARM64_CE_BLK=y
-CONFIG_CRYPTO_CHACHA20_NEON=m
CONFIG_CRYPTO_AES_ARM64_BS=m
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index e3fdb0fd6f70..9db6d775a880 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -105,12 +105,6 @@ config CRYPTO_AES_ARM64_NEON_BLK
select CRYPTO_AES
select CRYPTO_SIMD
-config CRYPTO_CHACHA20_NEON
- tristate "NEON accelerated ChaCha20 symmetric cipher"
- depends on KERNEL_MODE_NEON
- select CRYPTO_BLKCIPHER
- select CRYPTO_CHACHA20
-
config CRYPTO_AES_ARM64_BS
tristate "AES in ECB/CBC/CTR/XTS modes using bit-sliced NEON algorithm"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index bcafd016618e..507c4bfb86e3 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -53,9 +53,6 @@ sha256-arm64-y := sha256-glue.o sha256-core.o
obj-$(CONFIG_CRYPTO_SHA512_ARM64) += sha512-arm64.o
sha512-arm64-y := sha512-glue.o sha512-core.o
-obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
-chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
-
obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
speck-neon-y := speck-neon-core.o speck-neon-glue.o
diff --git a/arch/arm64/crypto/chacha20-neon-core.S b/arch/arm64/crypto/chacha20-neon-core.S
deleted file mode 100644
index 13c85e272c2a..000000000000
--- a/arch/arm64/crypto/chacha20-neon-core.S
+++ /dev/null
@@ -1,450 +0,0 @@
-/*
- * ChaCha20 256-bit cipher algorithm, RFC7539, arm64 NEON functions
- *
- * Copyright (C) 2016 Linaro, Ltd. <ard.biesheuvel@linaro.org>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * Based on:
- * ChaCha20 256-bit cipher algorithm, RFC7539, x64 SSSE3 functions
- *
- * Copyright (C) 2015 Martin Willi
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- */
-
-#include <linux/linkage.h>
-
- .text
- .align 6
-
-ENTRY(chacha20_block_xor_neon)
- // x0: Input state matrix, s
- // x1: 1 data block output, o
- // x2: 1 data block input, i
-
- //
- // This function encrypts one ChaCha20 block by loading the state matrix
- // in four NEON registers. It performs matrix operation on four words in
- // parallel, but requires shuffling to rearrange the words after each
- // round.
- //
-
- // x0..3 = s0..3
- adr x3, ROT8
- ld1 {v0.4s-v3.4s}, [x0]
- ld1 {v8.4s-v11.4s}, [x0]
- ld1 {v12.4s}, [x3]
-
- mov x3, #10
-
-.Ldoubleround:
- // x0 += x1, x3 = rotl32(x3 ^ x0, 16)
- add v0.4s, v0.4s, v1.4s
- eor v3.16b, v3.16b, v0.16b
- rev32 v3.8h, v3.8h
-
- // x2 += x3, x1 = rotl32(x1 ^ x2, 12)
- add v2.4s, v2.4s, v3.4s
- eor v4.16b, v1.16b, v2.16b
- shl v1.4s, v4.4s, #12
- sri v1.4s, v4.4s, #20
-
- // x0 += x1, x3 = rotl32(x3 ^ x0, 8)
- add v0.4s, v0.4s, v1.4s
- eor v3.16b, v3.16b, v0.16b
- tbl v3.16b, {v3.16b}, v12.16b
-
- // x2 += x3, x1 = rotl32(x1 ^ x2, 7)
- add v2.4s, v2.4s, v3.4s
- eor v4.16b, v1.16b, v2.16b
- shl v1.4s, v4.4s, #7
- sri v1.4s, v4.4s, #25
-
- // x1 = shuffle32(x1, MASK(0, 3, 2, 1))
- ext v1.16b, v1.16b, v1.16b, #4
- // x2 = shuffle32(x2, MASK(1, 0, 3, 2))
- ext v2.16b, v2.16b, v2.16b, #8
- // x3 = shuffle32(x3, MASK(2, 1, 0, 3))
- ext v3.16b, v3.16b, v3.16b, #12
-
- // x0 += x1, x3 = rotl32(x3 ^ x0, 16)
- add v0.4s, v0.4s, v1.4s
- eor v3.16b, v3.16b, v0.16b
- rev32 v3.8h, v3.8h
-
- // x2 += x3, x1 = rotl32(x1 ^ x2, 12)
- add v2.4s, v2.4s, v3.4s
- eor v4.16b, v1.16b, v2.16b
- shl v1.4s, v4.4s, #12
- sri v1.4s, v4.4s, #20
-
- // x0 += x1, x3 = rotl32(x3 ^ x0, 8)
- add v0.4s, v0.4s, v1.4s
- eor v3.16b, v3.16b, v0.16b
- tbl v3.16b, {v3.16b}, v12.16b
-
- // x2 += x3, x1 = rotl32(x1 ^ x2, 7)
- add v2.4s, v2.4s, v3.4s
- eor v4.16b, v1.16b, v2.16b
- shl v1.4s, v4.4s, #7
- sri v1.4s, v4.4s, #25
-
- // x1 = shuffle32(x1, MASK(2, 1, 0, 3))
- ext v1.16b, v1.16b, v1.16b, #12
- // x2 = shuffle32(x2, MASK(1, 0, 3, 2))
- ext v2.16b, v2.16b, v2.16b, #8
- // x3 = shuffle32(x3, MASK(0, 3, 2, 1))
- ext v3.16b, v3.16b, v3.16b, #4
-
- subs x3, x3, #1
- b.ne .Ldoubleround
-
- ld1 {v4.16b-v7.16b}, [x2]
-
- // o0 = i0 ^ (x0 + s0)
- add v0.4s, v0.4s, v8.4s
- eor v0.16b, v0.16b, v4.16b
-
- // o1 = i1 ^ (x1 + s1)
- add v1.4s, v1.4s, v9.4s
- eor v1.16b, v1.16b, v5.16b
-
- // o2 = i2 ^ (x2 + s2)
- add v2.4s, v2.4s, v10.4s
- eor v2.16b, v2.16b, v6.16b
-
- // o3 = i3 ^ (x3 + s3)
- add v3.4s, v3.4s, v11.4s
- eor v3.16b, v3.16b, v7.16b
-
- st1 {v0.16b-v3.16b}, [x1]
-
- ret
-ENDPROC(chacha20_block_xor_neon)
-
- .align 6
-ENTRY(chacha20_4block_xor_neon)
- // x0: Input state matrix, s
- // x1: 4 data blocks output, o
- // x2: 4 data blocks input, i
-
- //
- // This function encrypts four consecutive ChaCha20 blocks by loading
- // the state matrix in NEON registers four times. The algorithm performs
- // each operation on the corresponding word of each state matrix, hence
- // requires no word shuffling. For final XORing step we transpose the
- // matrix by interleaving 32- and then 64-bit words, which allows us to
- // do XOR in NEON registers.
- //
- adr x3, CTRINC // ... and ROT8
- ld1 {v30.4s-v31.4s}, [x3]
-
- // x0..15[0-3] = s0..3[0..3]
- mov x4, x0
- ld4r { v0.4s- v3.4s}, [x4], #16
- ld4r { v4.4s- v7.4s}, [x4], #16
- ld4r { v8.4s-v11.4s}, [x4], #16
- ld4r {v12.4s-v15.4s}, [x4]
-
- // x12 += counter values 0-3
- add v12.4s, v12.4s, v30.4s
-
- mov x3, #10
-
-.Ldoubleround4:
- // x0 += x4, x12 = rotl32(x12 ^ x0, 16)
- // x1 += x5, x13 = rotl32(x13 ^ x1, 16)
- // x2 += x6, x14 = rotl32(x14 ^ x2, 16)
- // x3 += x7, x15 = rotl32(x15 ^ x3, 16)
- add v0.4s, v0.4s, v4.4s
- add v1.4s, v1.4s, v5.4s
- add v2.4s, v2.4s, v6.4s
- add v3.4s, v3.4s, v7.4s
-
- eor v12.16b, v12.16b, v0.16b
- eor v13.16b, v13.16b, v1.16b
- eor v14.16b, v14.16b, v2.16b
- eor v15.16b, v15.16b, v3.16b
-
- rev32 v12.8h, v12.8h
- rev32 v13.8h, v13.8h
- rev32 v14.8h, v14.8h
- rev32 v15.8h, v15.8h
-
- // x8 += x12, x4 = rotl32(x4 ^ x8, 12)
- // x9 += x13, x5 = rotl32(x5 ^ x9, 12)
- // x10 += x14, x6 = rotl32(x6 ^ x10, 12)
- // x11 += x15, x7 = rotl32(x7 ^ x11, 12)
- add v8.4s, v8.4s, v12.4s
- add v9.4s, v9.4s, v13.4s
- add v10.4s, v10.4s, v14.4s
- add v11.4s, v11.4s, v15.4s
-
- eor v16.16b, v4.16b, v8.16b
- eor v17.16b, v5.16b, v9.16b
- eor v18.16b, v6.16b, v10.16b
- eor v19.16b, v7.16b, v11.16b
-
- shl v4.4s, v16.4s, #12
- shl v5.4s, v17.4s, #12
- shl v6.4s, v18.4s, #12
- shl v7.4s, v19.4s, #12
-
- sri v4.4s, v16.4s, #20
- sri v5.4s, v17.4s, #20
- sri v6.4s, v18.4s, #20
- sri v7.4s, v19.4s, #20
-
- // x0 += x4, x12 = rotl32(x12 ^ x0, 8)
- // x1 += x5, x13 = rotl32(x13 ^ x1, 8)
- // x2 += x6, x14 = rotl32(x14 ^ x2, 8)
- // x3 += x7, x15 = rotl32(x15 ^ x3, 8)
- add v0.4s, v0.4s, v4.4s
- add v1.4s, v1.4s, v5.4s
- add v2.4s, v2.4s, v6.4s
- add v3.4s, v3.4s, v7.4s
-
- eor v12.16b, v12.16b, v0.16b
- eor v13.16b, v13.16b, v1.16b
- eor v14.16b, v14.16b, v2.16b
- eor v15.16b, v15.16b, v3.16b
-
- tbl v12.16b, {v12.16b}, v31.16b
- tbl v13.16b, {v13.16b}, v31.16b
- tbl v14.16b, {v14.16b}, v31.16b
- tbl v15.16b, {v15.16b}, v31.16b
-
- // x8 += x12, x4 = rotl32(x4 ^ x8, 7)
- // x9 += x13, x5 = rotl32(x5 ^ x9, 7)
- // x10 += x14, x6 = rotl32(x6 ^ x10, 7)
- // x11 += x15, x7 = rotl32(x7 ^ x11, 7)
- add v8.4s, v8.4s, v12.4s
- add v9.4s, v9.4s, v13.4s
- add v10.4s, v10.4s, v14.4s
- add v11.4s, v11.4s, v15.4s
-
- eor v16.16b, v4.16b, v8.16b
- eor v17.16b, v5.16b, v9.16b
- eor v18.16b, v6.16b, v10.16b
- eor v19.16b, v7.16b, v11.16b
-
- shl v4.4s, v16.4s, #7
- shl v5.4s, v17.4s, #7
- shl v6.4s, v18.4s, #7
- shl v7.4s, v19.4s, #7
-
- sri v4.4s, v16.4s, #25
- sri v5.4s, v17.4s, #25
- sri v6.4s, v18.4s, #25
- sri v7.4s, v19.4s, #25
-
- // x0 += x5, x15 = rotl32(x15 ^ x0, 16)
- // x1 += x6, x12 = rotl32(x12 ^ x1, 16)
- // x2 += x7, x13 = rotl32(x13 ^ x2, 16)
- // x3 += x4, x14 = rotl32(x14 ^ x3, 16)
- add v0.4s, v0.4s, v5.4s
- add v1.4s, v1.4s, v6.4s
- add v2.4s, v2.4s, v7.4s
- add v3.4s, v3.4s, v4.4s
-
- eor v15.16b, v15.16b, v0.16b
- eor v12.16b, v12.16b, v1.16b
- eor v13.16b, v13.16b, v2.16b
- eor v14.16b, v14.16b, v3.16b
-
- rev32 v15.8h, v15.8h
- rev32 v12.8h, v12.8h
- rev32 v13.8h, v13.8h
- rev32 v14.8h, v14.8h
-
- // x10 += x15, x5 = rotl32(x5 ^ x10, 12)
- // x11 += x12, x6 = rotl32(x6 ^ x11, 12)
- // x8 += x13, x7 = rotl32(x7 ^ x8, 12)
- // x9 += x14, x4 = rotl32(x4 ^ x9, 12)
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v12.4s
- add v8.4s, v8.4s, v13.4s
- add v9.4s, v9.4s, v14.4s
-
- eor v16.16b, v5.16b, v10.16b
- eor v17.16b, v6.16b, v11.16b
- eor v18.16b, v7.16b, v8.16b
- eor v19.16b, v4.16b, v9.16b
-
- shl v5.4s, v16.4s, #12
- shl v6.4s, v17.4s, #12
- shl v7.4s, v18.4s, #12
- shl v4.4s, v19.4s, #12
-
- sri v5.4s, v16.4s, #20
- sri v6.4s, v17.4s, #20
- sri v7.4s, v18.4s, #20
- sri v4.4s, v19.4s, #20
-
- // x0 += x5, x15 = rotl32(x15 ^ x0, 8)
- // x1 += x6, x12 = rotl32(x12 ^ x1, 8)
- // x2 += x7, x13 = rotl32(x13 ^ x2, 8)
- // x3 += x4, x14 = rotl32(x14 ^ x3, 8)
- add v0.4s, v0.4s, v5.4s
- add v1.4s, v1.4s, v6.4s
- add v2.4s, v2.4s, v7.4s
- add v3.4s, v3.4s, v4.4s
-
- eor v15.16b, v15.16b, v0.16b
- eor v12.16b, v12.16b, v1.16b
- eor v13.16b, v13.16b, v2.16b
- eor v14.16b, v14.16b, v3.16b
-
- tbl v15.16b, {v15.16b}, v31.16b
- tbl v12.16b, {v12.16b}, v31.16b
- tbl v13.16b, {v13.16b}, v31.16b
- tbl v14.16b, {v14.16b}, v31.16b
-
- // x10 += x15, x5 = rotl32(x5 ^ x10, 7)
- // x11 += x12, x6 = rotl32(x6 ^ x11, 7)
- // x8 += x13, x7 = rotl32(x7 ^ x8, 7)
- // x9 += x14, x4 = rotl32(x4 ^ x9, 7)
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v12.4s
- add v8.4s, v8.4s, v13.4s
- add v9.4s, v9.4s, v14.4s
-
- eor v16.16b, v5.16b, v10.16b
- eor v17.16b, v6.16b, v11.16b
- eor v18.16b, v7.16b, v8.16b
- eor v19.16b, v4.16b, v9.16b
-
- shl v5.4s, v16.4s, #7
- shl v6.4s, v17.4s, #7
- shl v7.4s, v18.4s, #7
- shl v4.4s, v19.4s, #7
-
- sri v5.4s, v16.4s, #25
- sri v6.4s, v17.4s, #25
- sri v7.4s, v18.4s, #25
- sri v4.4s, v19.4s, #25
-
- subs x3, x3, #1
- b.ne .Ldoubleround4
-
- ld4r {v16.4s-v19.4s}, [x0], #16
- ld4r {v20.4s-v23.4s}, [x0], #16
-
- // x12 += counter values 0-3
- add v12.4s, v12.4s, v30.4s
-
- // x0[0-3] += s0[0]
- // x1[0-3] += s0[1]
- // x2[0-3] += s0[2]
- // x3[0-3] += s0[3]
- add v0.4s, v0.4s, v16.4s
- add v1.4s, v1.4s, v17.4s
- add v2.4s, v2.4s, v18.4s
- add v3.4s, v3.4s, v19.4s
-
- ld4r {v24.4s-v27.4s}, [x0], #16
- ld4r {v28.4s-v31.4s}, [x0]
-
- // x4[0-3] += s1[0]
- // x5[0-3] += s1[1]
- // x6[0-3] += s1[2]
- // x7[0-3] += s1[3]
- add v4.4s, v4.4s, v20.4s
- add v5.4s, v5.4s, v21.4s
- add v6.4s, v6.4s, v22.4s
- add v7.4s, v7.4s, v23.4s
-
- // x8[0-3] += s2[0]
- // x9[0-3] += s2[1]
- // x10[0-3] += s2[2]
- // x11[0-3] += s2[3]
- add v8.4s, v8.4s, v24.4s
- add v9.4s, v9.4s, v25.4s
- add v10.4s, v10.4s, v26.4s
- add v11.4s, v11.4s, v27.4s
-
- // x12[0-3] += s3[0]
- // x13[0-3] += s3[1]
- // x14[0-3] += s3[2]
- // x15[0-3] += s3[3]
- add v12.4s, v12.4s, v28.4s
- add v13.4s, v13.4s, v29.4s
- add v14.4s, v14.4s, v30.4s
- add v15.4s, v15.4s, v31.4s
-
- // interleave 32-bit words in state n, n+1
- zip1 v16.4s, v0.4s, v1.4s
- zip2 v17.4s, v0.4s, v1.4s
- zip1 v18.4s, v2.4s, v3.4s
- zip2 v19.4s, v2.4s, v3.4s
- zip1 v20.4s, v4.4s, v5.4s
- zip2 v21.4s, v4.4s, v5.4s
- zip1 v22.4s, v6.4s, v7.4s
- zip2 v23.4s, v6.4s, v7.4s
- zip1 v24.4s, v8.4s, v9.4s
- zip2 v25.4s, v8.4s, v9.4s
- zip1 v26.4s, v10.4s, v11.4s
- zip2 v27.4s, v10.4s, v11.4s
- zip1 v28.4s, v12.4s, v13.4s
- zip2 v29.4s, v12.4s, v13.4s
- zip1 v30.4s, v14.4s, v15.4s
- zip2 v31.4s, v14.4s, v15.4s
-
- // interleave 64-bit words in state n, n+2
- zip1 v0.2d, v16.2d, v18.2d
- zip2 v4.2d, v16.2d, v18.2d
- zip1 v8.2d, v17.2d, v19.2d
- zip2 v12.2d, v17.2d, v19.2d
- ld1 {v16.16b-v19.16b}, [x2], #64
-
- zip1 v1.2d, v20.2d, v22.2d
- zip2 v5.2d, v20.2d, v22.2d
- zip1 v9.2d, v21.2d, v23.2d
- zip2 v13.2d, v21.2d, v23.2d
- ld1 {v20.16b-v23.16b}, [x2], #64
-
- zip1 v2.2d, v24.2d, v26.2d
- zip2 v6.2d, v24.2d, v26.2d
- zip1 v10.2d, v25.2d, v27.2d
- zip2 v14.2d, v25.2d, v27.2d
- ld1 {v24.16b-v27.16b}, [x2], #64
-
- zip1 v3.2d, v28.2d, v30.2d
- zip2 v7.2d, v28.2d, v30.2d
- zip1 v11.2d, v29.2d, v31.2d
- zip2 v15.2d, v29.2d, v31.2d
- ld1 {v28.16b-v31.16b}, [x2]
-
- // xor with corresponding input, write to output
- eor v16.16b, v16.16b, v0.16b
- eor v17.16b, v17.16b, v1.16b
- eor v18.16b, v18.16b, v2.16b
- eor v19.16b, v19.16b, v3.16b
- eor v20.16b, v20.16b, v4.16b
- eor v21.16b, v21.16b, v5.16b
- st1 {v16.16b-v19.16b}, [x1], #64
- eor v22.16b, v22.16b, v6.16b
- eor v23.16b, v23.16b, v7.16b
- eor v24.16b, v24.16b, v8.16b
- eor v25.16b, v25.16b, v9.16b
- st1 {v20.16b-v23.16b}, [x1], #64
- eor v26.16b, v26.16b, v10.16b
- eor v27.16b, v27.16b, v11.16b
- eor v28.16b, v28.16b, v12.16b
- st1 {v24.16b-v27.16b}, [x1], #64
- eor v29.16b, v29.16b, v13.16b
- eor v30.16b, v30.16b, v14.16b
- eor v31.16b, v31.16b, v15.16b
- st1 {v28.16b-v31.16b}, [x1]
-
- ret
-ENDPROC(chacha20_4block_xor_neon)
-
-CTRINC: .word 0, 1, 2, 3
-ROT8: .word 0x02010003, 0x06050407, 0x0a09080b, 0x0e0d0c0f
diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
deleted file mode 100644
index 727579c93ded..000000000000
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ /dev/null
@@ -1,133 +0,0 @@
-/*
- * ChaCha20 256-bit cipher algorithm, RFC7539, arm64 NEON functions
- *
- * Copyright (C) 2016 - 2017 Linaro, Ltd. <ard.biesheuvel@linaro.org>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * Based on:
- * ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code
- *
- * Copyright (C) 2015 Martin Willi
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- */
-
-#include <crypto/algapi.h>
-#include <crypto/chacha20.h>
-#include <crypto/internal/skcipher.h>
-#include <linux/kernel.h>
-#include <linux/module.h>
-
-#include <asm/hwcap.h>
-#include <asm/neon.h>
-#include <asm/simd.h>
-
-asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src);
-asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src);
-
-static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
- unsigned int bytes)
-{
- u8 buf[CHACHA20_BLOCK_SIZE];
-
- while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
- kernel_neon_begin();
- chacha20_4block_xor_neon(state, dst, src);
- kernel_neon_end();
- bytes -= CHACHA20_BLOCK_SIZE * 4;
- src += CHACHA20_BLOCK_SIZE * 4;
- dst += CHACHA20_BLOCK_SIZE * 4;
- state[12] += 4;
- }
-
- if (!bytes)
- return;
-
- kernel_neon_begin();
- while (bytes >= CHACHA20_BLOCK_SIZE) {
- chacha20_block_xor_neon(state, dst, src);
- bytes -= CHACHA20_BLOCK_SIZE;
- src += CHACHA20_BLOCK_SIZE;
- dst += CHACHA20_BLOCK_SIZE;
- state[12]++;
- }
- if (bytes) {
- memcpy(buf, src, bytes);
- chacha20_block_xor_neon(state, buf, buf);
- memcpy(dst, buf, bytes);
- }
- kernel_neon_end();
-}
-
-static int chacha20_neon(struct skcipher_request *req)
-{
- struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
- struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm);
- struct skcipher_walk walk;
- u32 state[16];
- int err;
-
- if (!may_use_simd() || req->cryptlen <= CHACHA20_BLOCK_SIZE)
- return crypto_chacha20_crypt(req);
-
- err = skcipher_walk_virt(&walk, req, false);
-
- crypto_chacha20_init(state, ctx, walk.iv);
-
- while (walk.nbytes > 0) {
- unsigned int nbytes = walk.nbytes;
-
- if (nbytes < walk.total)
- nbytes = round_down(nbytes, walk.stride);
-
- chacha20_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
- nbytes);
- err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
- }
-
- return err;
-}
-
-static struct skcipher_alg alg = {
- .base.cra_name = "chacha20",
- .base.cra_driver_name = "chacha20-neon",
- .base.cra_priority = 300,
- .base.cra_blocksize = 1,
- .base.cra_ctxsize = sizeof(struct chacha20_ctx),
- .base.cra_module = THIS_MODULE,
-
- .min_keysize = CHACHA20_KEY_SIZE,
- .max_keysize = CHACHA20_KEY_SIZE,
- .ivsize = CHACHA20_IV_SIZE,
- .chunksize = CHACHA20_BLOCK_SIZE,
- .walksize = 4 * CHACHA20_BLOCK_SIZE,
- .setkey = crypto_chacha20_setkey,
- .encrypt = chacha20_neon,
- .decrypt = chacha20_neon,
-};
-
-static int __init chacha20_simd_mod_init(void)
-{
- if (!(elf_hwcap & HWCAP_ASIMD))
- return -ENODEV;
-
- return crypto_register_skcipher(&alg);
-}
-
-static void __exit chacha20_simd_mod_fini(void)
-{
- crypto_unregister_skcipher(&alg);
-}
-
-module_init(chacha20_simd_mod_init);
-module_exit(chacha20_simd_mod_fini);
-
-MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
-MODULE_LICENSE("GPL v2");
-MODULE_ALIAS_CRYPTO("chacha20");
diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index cf830219846b..419212c31246 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -23,7 +23,6 @@ obj-$(CONFIG_CRYPTO_CAMELLIA_X86_64) += camellia-x86_64.o
obj-$(CONFIG_CRYPTO_BLOWFISH_X86_64) += blowfish-x86_64.o
obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o
obj-$(CONFIG_CRYPTO_TWOFISH_X86_64_3WAY) += twofish-x86_64-3way.o
-obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha20-x86_64.o
obj-$(CONFIG_CRYPTO_SERPENT_SSE2_X86_64) += serpent-sse2-x86_64.o
obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o
@@ -76,7 +75,6 @@ camellia-x86_64-y := camellia-x86_64-asm_64.o camellia_glue.o
blowfish-x86_64-y := blowfish-x86_64-asm_64.o blowfish_glue.o
twofish-x86_64-y := twofish-x86_64-asm_64.o twofish_glue.o
twofish-x86_64-3way-y := twofish-x86_64-asm_64-3way.o twofish_glue_3way.o
-chacha20-x86_64-y := chacha20-ssse3-x86_64.o chacha20_glue.o
serpent-sse2-x86_64-y := serpent-sse2-x86_64-asm_64.o serpent_sse2_glue.o
aegis128-aesni-y := aegis128-aesni-asm.o aegis128-aesni-glue.o
@@ -99,7 +97,6 @@ endif
ifeq ($(avx2_supported),yes)
camellia-aesni-avx2-y := camellia-aesni-avx2-asm_64.o camellia_aesni_avx2_glue.o
- chacha20-x86_64-y += chacha20-avx2-x86_64.o
serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o
morus1280-avx2-y := morus1280-avx2-asm.o morus1280-avx2-glue.o
diff --git a/arch/x86/crypto/chacha20-avx2-x86_64.S b/arch/x86/crypto/chacha20-avx2-x86_64.S
deleted file mode 100644
index f3cd26f48332..000000000000
--- a/arch/x86/crypto/chacha20-avx2-x86_64.S
+++ /dev/null
@@ -1,448 +0,0 @@
-/*
- * ChaCha20 256-bit cipher algorithm, RFC7539, x64 AVX2 functions
- *
- * Copyright (C) 2015 Martin Willi
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- */
-
-#include <linux/linkage.h>
-
-.section .rodata.cst32.ROT8, "aM", @progbits, 32
-.align 32
-ROT8: .octa 0x0e0d0c0f0a09080b0605040702010003
- .octa 0x0e0d0c0f0a09080b0605040702010003
-
-.section .rodata.cst32.ROT16, "aM", @progbits, 32
-.align 32
-ROT16: .octa 0x0d0c0f0e09080b0a0504070601000302
- .octa 0x0d0c0f0e09080b0a0504070601000302
-
-.section .rodata.cst32.CTRINC, "aM", @progbits, 32
-.align 32
-CTRINC: .octa 0x00000003000000020000000100000000
- .octa 0x00000007000000060000000500000004
-
-.text
-
-ENTRY(chacha20_8block_xor_avx2)
- # %rdi: Input state matrix, s
- # %rsi: 8 data blocks output, o
- # %rdx: 8 data blocks input, i
-
- # This function encrypts eight consecutive ChaCha20 blocks by loading
- # the state matrix in AVX registers eight times. As we need some
- # scratch registers, we save the first four registers on the stack. The
- # algorithm performs each operation on the corresponding word of each
- # state matrix, hence requires no word shuffling. For final XORing step
- # we transpose the matrix by interleaving 32-, 64- and then 128-bit
- # words, which allows us to do XOR in AVX registers. 8/16-bit word
- # rotation is done with the slightly better performing byte shuffling,
- # 7/12-bit word rotation uses traditional shift+OR.
-
- vzeroupper
- # 4 * 32 byte stack, 32-byte aligned
- lea 8(%rsp),%r10
- and $~31, %rsp
- sub $0x80, %rsp
-
- # x0..15[0-7] = s[0..15]
- vpbroadcastd 0x00(%rdi),%ymm0
- vpbroadcastd 0x04(%rdi),%ymm1
- vpbroadcastd 0x08(%rdi),%ymm2
- vpbroadcastd 0x0c(%rdi),%ymm3
- vpbroadcastd 0x10(%rdi),%ymm4
- vpbroadcastd 0x14(%rdi),%ymm5
- vpbroadcastd 0x18(%rdi),%ymm6
- vpbroadcastd 0x1c(%rdi),%ymm7
- vpbroadcastd 0x20(%rdi),%ymm8
- vpbroadcastd 0x24(%rdi),%ymm9
- vpbroadcastd 0x28(%rdi),%ymm10
- vpbroadcastd 0x2c(%rdi),%ymm11
- vpbroadcastd 0x30(%rdi),%ymm12
- vpbroadcastd 0x34(%rdi),%ymm13
- vpbroadcastd 0x38(%rdi),%ymm14
- vpbroadcastd 0x3c(%rdi),%ymm15
- # x0..3 on stack
- vmovdqa %ymm0,0x00(%rsp)
- vmovdqa %ymm1,0x20(%rsp)
- vmovdqa %ymm2,0x40(%rsp)
- vmovdqa %ymm3,0x60(%rsp)
-
- vmovdqa CTRINC(%rip),%ymm1
- vmovdqa ROT8(%rip),%ymm2
- vmovdqa ROT16(%rip),%ymm3
-
- # x12 += counter values 0-3
- vpaddd %ymm1,%ymm12,%ymm12
-
- mov $10,%ecx
-
-.Ldoubleround8:
- # x0 += x4, x12 = rotl32(x12 ^ x0, 16)
- vpaddd 0x00(%rsp),%ymm4,%ymm0
- vmovdqa %ymm0,0x00(%rsp)
- vpxor %ymm0,%ymm12,%ymm12
- vpshufb %ymm3,%ymm12,%ymm12
- # x1 += x5, x13 = rotl32(x13 ^ x1, 16)
- vpaddd 0x20(%rsp),%ymm5,%ymm0
- vmovdqa %ymm0,0x20(%rsp)
- vpxor %ymm0,%ymm13,%ymm13
- vpshufb %ymm3,%ymm13,%ymm13
- # x2 += x6, x14 = rotl32(x14 ^ x2, 16)
- vpaddd 0x40(%rsp),%ymm6,%ymm0
- vmovdqa %ymm0,0x40(%rsp)
- vpxor %ymm0,%ymm14,%ymm14
- vpshufb %ymm3,%ymm14,%ymm14
- # x3 += x7, x15 = rotl32(x15 ^ x3, 16)
- vpaddd 0x60(%rsp),%ymm7,%ymm0
- vmovdqa %ymm0,0x60(%rsp)
- vpxor %ymm0,%ymm15,%ymm15
- vpshufb %ymm3,%ymm15,%ymm15
-
- # x8 += x12, x4 = rotl32(x4 ^ x8, 12)
- vpaddd %ymm12,%ymm8,%ymm8
- vpxor %ymm8,%ymm4,%ymm4
- vpslld $12,%ymm4,%ymm0
- vpsrld $20,%ymm4,%ymm4
- vpor %ymm0,%ymm4,%ymm4
- # x9 += x13, x5 = rotl32(x5 ^ x9, 12)
- vpaddd %ymm13,%ymm9,%ymm9
- vpxor %ymm9,%ymm5,%ymm5
- vpslld $12,%ymm5,%ymm0
- vpsrld $20,%ymm5,%ymm5
- vpor %ymm0,%ymm5,%ymm5
- # x10 += x14, x6 = rotl32(x6 ^ x10, 12)
- vpaddd %ymm14,%ymm10,%ymm10
- vpxor %ymm10,%ymm6,%ymm6
- vpslld $12,%ymm6,%ymm0
- vpsrld $20,%ymm6,%ymm6
- vpor %ymm0,%ymm6,%ymm6
- # x11 += x15, x7 = rotl32(x7 ^ x11, 12)
- vpaddd %ymm15,%ymm11,%ymm11
- vpxor %ymm11,%ymm7,%ymm7
- vpslld $12,%ymm7,%ymm0
- vpsrld $20,%ymm7,%ymm7
- vpor %ymm0,%ymm7,%ymm7
-
- # x0 += x4, x12 = rotl32(x12 ^ x0, 8)
- vpaddd 0x00(%rsp),%ymm4,%ymm0
- vmovdqa %ymm0,0x00(%rsp)
- vpxor %ymm0,%ymm12,%ymm12
- vpshufb %ymm2,%ymm12,%ymm12
- # x1 += x5, x13 = rotl32(x13 ^ x1, 8)
- vpaddd 0x20(%rsp),%ymm5,%ymm0
- vmovdqa %ymm0,0x20(%rsp)
- vpxor %ymm0,%ymm13,%ymm13
- vpshufb %ymm2,%ymm13,%ymm13
- # x2 += x6, x14 = rotl32(x14 ^ x2, 8)
- vpaddd 0x40(%rsp),%ymm6,%ymm0
- vmovdqa %ymm0,0x40(%rsp)
- vpxor %ymm0,%ymm14,%ymm14
- vpshufb %ymm2,%ymm14,%ymm14
- # x3 += x7, x15 = rotl32(x15 ^ x3, 8)
- vpaddd 0x60(%rsp),%ymm7,%ymm0
- vmovdqa %ymm0,0x60(%rsp)
- vpxor %ymm0,%ymm15,%ymm15
- vpshufb %ymm2,%ymm15,%ymm15
-
- # x8 += x12, x4 = rotl32(x4 ^ x8, 7)
- vpaddd %ymm12,%ymm8,%ymm8
- vpxor %ymm8,%ymm4,%ymm4
- vpslld $7,%ymm4,%ymm0
- vpsrld $25,%ymm4,%ymm4
- vpor %ymm0,%ymm4,%ymm4
- # x9 += x13, x5 = rotl32(x5 ^ x9, 7)
- vpaddd %ymm13,%ymm9,%ymm9
- vpxor %ymm9,%ymm5,%ymm5
- vpslld $7,%ymm5,%ymm0
- vpsrld $25,%ymm5,%ymm5
- vpor %ymm0,%ymm5,%ymm5
- # x10 += x14, x6 = rotl32(x6 ^ x10, 7)
- vpaddd %ymm14,%ymm10,%ymm10
- vpxor %ymm10,%ymm6,%ymm6
- vpslld $7,%ymm6,%ymm0
- vpsrld $25,%ymm6,%ymm6
- vpor %ymm0,%ymm6,%ymm6
- # x11 += x15, x7 = rotl32(x7 ^ x11, 7)
- vpaddd %ymm15,%ymm11,%ymm11
- vpxor %ymm11,%ymm7,%ymm7
- vpslld $7,%ymm7,%ymm0
- vpsrld $25,%ymm7,%ymm7
- vpor %ymm0,%ymm7,%ymm7
-
- # x0 += x5, x15 = rotl32(x15 ^ x0, 16)
- vpaddd 0x00(%rsp),%ymm5,%ymm0
- vmovdqa %ymm0,0x00(%rsp)
- vpxor %ymm0,%ymm15,%ymm15
- vpshufb %ymm3,%ymm15,%ymm15
- # x1 += x6, x12 = rotl32(x12 ^ x1, 16)%ymm0
- vpaddd 0x20(%rsp),%ymm6,%ymm0
- vmovdqa %ymm0,0x20(%rsp)
- vpxor %ymm0,%ymm12,%ymm12
- vpshufb %ymm3,%ymm12,%ymm12
- # x2 += x7, x13 = rotl32(x13 ^ x2, 16)
- vpaddd 0x40(%rsp),%ymm7,%ymm0
- vmovdqa %ymm0,0x40(%rsp)
- vpxor %ymm0,%ymm13,%ymm13
- vpshufb %ymm3,%ymm13,%ymm13
- # x3 += x4, x14 = rotl32(x14 ^ x3, 16)
- vpaddd 0x60(%rsp),%ymm4,%ymm0
- vmovdqa %ymm0,0x60(%rsp)
- vpxor %ymm0,%ymm14,%ymm14
- vpshufb %ymm3,%ymm14,%ymm14
-
- # x10 += x15, x5 = rotl32(x5 ^ x10, 12)
- vpaddd %ymm15,%ymm10,%ymm10
- vpxor %ymm10,%ymm5,%ymm5
- vpslld $12,%ymm5,%ymm0
- vpsrld $20,%ymm5,%ymm5
- vpor %ymm0,%ymm5,%ymm5
- # x11 += x12, x6 = rotl32(x6 ^ x11, 12)
- vpaddd %ymm12,%ymm11,%ymm11
- vpxor %ymm11,%ymm6,%ymm6
- vpslld $12,%ymm6,%ymm0
- vpsrld $20,%ymm6,%ymm6
- vpor %ymm0,%ymm6,%ymm6
- # x8 += x13, x7 = rotl32(x7 ^ x8, 12)
- vpaddd %ymm13,%ymm8,%ymm8
- vpxor %ymm8,%ymm7,%ymm7
- vpslld $12,%ymm7,%ymm0
- vpsrld $20,%ymm7,%ymm7
- vpor %ymm0,%ymm7,%ymm7
- # x9 += x14, x4 = rotl32(x4 ^ x9, 12)
- vpaddd %ymm14,%ymm9,%ymm9
- vpxor %ymm9,%ymm4,%ymm4
- vpslld $12,%ymm4,%ymm0
- vpsrld $20,%ymm4,%ymm4
- vpor %ymm0,%ymm4,%ymm4
-
- # x0 += x5, x15 = rotl32(x15 ^ x0, 8)
- vpaddd 0x00(%rsp),%ymm5,%ymm0
- vmovdqa %ymm0,0x00(%rsp)
- vpxor %ymm0,%ymm15,%ymm15
- vpshufb %ymm2,%ymm15,%ymm15
- # x1 += x6, x12 = rotl32(x12 ^ x1, 8)
- vpaddd 0x20(%rsp),%ymm6,%ymm0
- vmovdqa %ymm0,0x20(%rsp)
- vpxor %ymm0,%ymm12,%ymm12
- vpshufb %ymm2,%ymm12,%ymm12
- # x2 += x7, x13 = rotl32(x13 ^ x2, 8)
- vpaddd 0x40(%rsp),%ymm7,%ymm0
- vmovdqa %ymm0,0x40(%rsp)
- vpxor %ymm0,%ymm13,%ymm13
- vpshufb %ymm2,%ymm13,%ymm13
- # x3 += x4, x14 = rotl32(x14 ^ x3, 8)
- vpaddd 0x60(%rsp),%ymm4,%ymm0
- vmovdqa %ymm0,0x60(%rsp)
- vpxor %ymm0,%ymm14,%ymm14
- vpshufb %ymm2,%ymm14,%ymm14
-
- # x10 += x15, x5 = rotl32(x5 ^ x10, 7)
- vpaddd %ymm15,%ymm10,%ymm10
- vpxor %ymm10,%ymm5,%ymm5
- vpslld $7,%ymm5,%ymm0
- vpsrld $25,%ymm5,%ymm5
- vpor %ymm0,%ymm5,%ymm5
- # x11 += x12, x6 = rotl32(x6 ^ x11, 7)
- vpaddd %ymm12,%ymm11,%ymm11
- vpxor %ymm11,%ymm6,%ymm6
- vpslld $7,%ymm6,%ymm0
- vpsrld $25,%ymm6,%ymm6
- vpor %ymm0,%ymm6,%ymm6
- # x8 += x13, x7 = rotl32(x7 ^ x8, 7)
- vpaddd %ymm13,%ymm8,%ymm8
- vpxor %ymm8,%ymm7,%ymm7
- vpslld $7,%ymm7,%ymm0
- vpsrld $25,%ymm7,%ymm7
- vpor %ymm0,%ymm7,%ymm7
- # x9 += x14, x4 = rotl32(x4 ^ x9, 7)
- vpaddd %ymm14,%ymm9,%ymm9
- vpxor %ymm9,%ymm4,%ymm4
- vpslld $7,%ymm4,%ymm0
- vpsrld $25,%ymm4,%ymm4
- vpor %ymm0,%ymm4,%ymm4
-
- dec %ecx
- jnz .Ldoubleround8
-
- # x0..15[0-3] += s[0..15]
- vpbroadcastd 0x00(%rdi),%ymm0
- vpaddd 0x00(%rsp),%ymm0,%ymm0
- vmovdqa %ymm0,0x00(%rsp)
- vpbroadcastd 0x04(%rdi),%ymm0
- vpaddd 0x20(%rsp),%ymm0,%ymm0
- vmovdqa %ymm0,0x20(%rsp)
- vpbroadcastd 0x08(%rdi),%ymm0
- vpaddd 0x40(%rsp),%ymm0,%ymm0
- vmovdqa %ymm0,0x40(%rsp)
- vpbroadcastd 0x0c(%rdi),%ymm0
- vpaddd 0x60(%rsp),%ymm0,%ymm0
- vmovdqa %ymm0,0x60(%rsp)
- vpbroadcastd 0x10(%rdi),%ymm0
- vpaddd %ymm0,%ymm4,%ymm4
- vpbroadcastd 0x14(%rdi),%ymm0
- vpaddd %ymm0,%ymm5,%ymm5
- vpbroadcastd 0x18(%rdi),%ymm0
- vpaddd %ymm0,%ymm6,%ymm6
- vpbroadcastd 0x1c(%rdi),%ymm0
- vpaddd %ymm0,%ymm7,%ymm7
- vpbroadcastd 0x20(%rdi),%ymm0
- vpaddd %ymm0,%ymm8,%ymm8
- vpbroadcastd 0x24(%rdi),%ymm0
- vpaddd %ymm0,%ymm9,%ymm9
- vpbroadcastd 0x28(%rdi),%ymm0
- vpaddd %ymm0,%ymm10,%ymm10
- vpbroadcastd 0x2c(%rdi),%ymm0
- vpaddd %ymm0,%ymm11,%ymm11
- vpbroadcastd 0x30(%rdi),%ymm0
- vpaddd %ymm0,%ymm12,%ymm12
- vpbroadcastd 0x34(%rdi),%ymm0
- vpaddd %ymm0,%ymm13,%ymm13
- vpbroadcastd 0x38(%rdi),%ymm0
- vpaddd %ymm0,%ymm14,%ymm14
- vpbroadcastd 0x3c(%rdi),%ymm0
- vpaddd %ymm0,%ymm15,%ymm15
-
- # x12 += counter values 0-3
- vpaddd %ymm1,%ymm12,%ymm12
-
- # interleave 32-bit words in state n, n+1
- vmovdqa 0x00(%rsp),%ymm0
- vmovdqa 0x20(%rsp),%ymm1
- vpunpckldq %ymm1,%ymm0,%ymm2
- vpunpckhdq %ymm1,%ymm0,%ymm1
- vmovdqa %ymm2,0x00(%rsp)
- vmovdqa %ymm1,0x20(%rsp)
- vmovdqa 0x40(%rsp),%ymm0
- vmovdqa 0x60(%rsp),%ymm1
- vpunpckldq %ymm1,%ymm0,%ymm2
- vpunpckhdq %ymm1,%ymm0,%ymm1
- vmovdqa %ymm2,0x40(%rsp)
- vmovdqa %ymm1,0x60(%rsp)
- vmovdqa %ymm4,%ymm0
- vpunpckldq %ymm5,%ymm0,%ymm4
- vpunpckhdq %ymm5,%ymm0,%ymm5
- vmovdqa %ymm6,%ymm0
- vpunpckldq %ymm7,%ymm0,%ymm6
- vpunpckhdq %ymm7,%ymm0,%ymm7
- vmovdqa %ymm8,%ymm0
- vpunpckldq %ymm9,%ymm0,%ymm8
- vpunpckhdq %ymm9,%ymm0,%ymm9
- vmovdqa %ymm10,%ymm0
- vpunpckldq %ymm11,%ymm0,%ymm10
- vpunpckhdq %ymm11,%ymm0,%ymm11
- vmovdqa %ymm12,%ymm0
- vpunpckldq %ymm13,%ymm0,%ymm12
- vpunpckhdq %ymm13,%ymm0,%ymm13
- vmovdqa %ymm14,%ymm0
- vpunpckldq %ymm15,%ymm0,%ymm14
- vpunpckhdq %ymm15,%ymm0,%ymm15
-
- # interleave 64-bit words in state n, n+2
- vmovdqa 0x00(%rsp),%ymm0
- vmovdqa 0x40(%rsp),%ymm2
- vpunpcklqdq %ymm2,%ymm0,%ymm1
- vpunpckhqdq %ymm2,%ymm0,%ymm2
- vmovdqa %ymm1,0x00(%rsp)
- vmovdqa %ymm2,0x40(%rsp)
- vmovdqa 0x20(%rsp),%ymm0
- vmovdqa 0x60(%rsp),%ymm2
- vpunpcklqdq %ymm2,%ymm0,%ymm1
- vpunpckhqdq %ymm2,%ymm0,%ymm2
- vmovdqa %ymm1,0x20(%rsp)
- vmovdqa %ymm2,0x60(%rsp)
- vmovdqa %ymm4,%ymm0
- vpunpcklqdq %ymm6,%ymm0,%ymm4
- vpunpckhqdq %ymm6,%ymm0,%ymm6
- vmovdqa %ymm5,%ymm0
- vpunpcklqdq %ymm7,%ymm0,%ymm5
- vpunpckhqdq %ymm7,%ymm0,%ymm7
- vmovdqa %ymm8,%ymm0
- vpunpcklqdq %ymm10,%ymm0,%ymm8
- vpunpckhqdq %ymm10,%ymm0,%ymm10
- vmovdqa %ymm9,%ymm0
- vpunpcklqdq %ymm11,%ymm0,%ymm9
- vpunpckhqdq %ymm11,%ymm0,%ymm11
- vmovdqa %ymm12,%ymm0
- vpunpcklqdq %ymm14,%ymm0,%ymm12
- vpunpckhqdq %ymm14,%ymm0,%ymm14
- vmovdqa %ymm13,%ymm0
- vpunpcklqdq %ymm15,%ymm0,%ymm13
- vpunpckhqdq %ymm15,%ymm0,%ymm15
-
- # interleave 128-bit words in state n, n+4
- vmovdqa 0x00(%rsp),%ymm0
- vperm2i128 $0x20,%ymm4,%ymm0,%ymm1
- vperm2i128 $0x31,%ymm4,%ymm0,%ymm4
- vmovdqa %ymm1,0x00(%rsp)
- vmovdqa 0x20(%rsp),%ymm0
- vperm2i128 $0x20,%ymm5,%ymm0,%ymm1
- vperm2i128 $0x31,%ymm5,%ymm0,%ymm5
- vmovdqa %ymm1,0x20(%rsp)
- vmovdqa 0x40(%rsp),%ymm0
- vperm2i128 $0x20,%ymm6,%ymm0,%ymm1
- vperm2i128 $0x31,%ymm6,%ymm0,%ymm6
- vmovdqa %ymm1,0x40(%rsp)
- vmovdqa 0x60(%rsp),%ymm0
- vperm2i128 $0x20,%ymm7,%ymm0,%ymm1
- vperm2i128 $0x31,%ymm7,%ymm0,%ymm7
- vmovdqa %ymm1,0x60(%rsp)
- vperm2i128 $0x20,%ymm12,%ymm8,%ymm0
- vperm2i128 $0x31,%ymm12,%ymm8,%ymm12
- vmovdqa %ymm0,%ymm8
- vperm2i128 $0x20,%ymm13,%ymm9,%ymm0
- vperm2i128 $0x31,%ymm13,%ymm9,%ymm13
- vmovdqa %ymm0,%ymm9
- vperm2i128 $0x20,%ymm14,%ymm10,%ymm0
- vperm2i128 $0x31,%ymm14,%ymm10,%ymm14
- vmovdqa %ymm0,%ymm10
- vperm2i128 $0x20,%ymm15,%ymm11,%ymm0
- vperm2i128 $0x31,%ymm15,%ymm11,%ymm15
- vmovdqa %ymm0,%ymm11
-
- # xor with corresponding input, write to output
- vmovdqa 0x00(%rsp),%ymm0
- vpxor 0x0000(%rdx),%ymm0,%ymm0
- vmovdqu %ymm0,0x0000(%rsi)
- vmovdqa 0x20(%rsp),%ymm0
- vpxor 0x0080(%rdx),%ymm0,%ymm0
- vmovdqu %ymm0,0x0080(%rsi)
- vmovdqa 0x40(%rsp),%ymm0
- vpxor 0x0040(%rdx),%ymm0,%ymm0
- vmovdqu %ymm0,0x0040(%rsi)
- vmovdqa 0x60(%rsp),%ymm0
- vpxor 0x00c0(%rdx),%ymm0,%ymm0
- vmovdqu %ymm0,0x00c0(%rsi)
- vpxor 0x0100(%rdx),%ymm4,%ymm4
- vmovdqu %ymm4,0x0100(%rsi)
- vpxor 0x0180(%rdx),%ymm5,%ymm5
- vmovdqu %ymm5,0x00180(%rsi)
- vpxor 0x0140(%rdx),%ymm6,%ymm6
- vmovdqu %ymm6,0x0140(%rsi)
- vpxor 0x01c0(%rdx),%ymm7,%ymm7
- vmovdqu %ymm7,0x01c0(%rsi)
- vpxor 0x0020(%rdx),%ymm8,%ymm8
- vmovdqu %ymm8,0x0020(%rsi)
- vpxor 0x00a0(%rdx),%ymm9,%ymm9
- vmovdqu %ymm9,0x00a0(%rsi)
- vpxor 0x0060(%rdx),%ymm10,%ymm10
- vmovdqu %ymm10,0x0060(%rsi)
- vpxor 0x00e0(%rdx),%ymm11,%ymm11
- vmovdqu %ymm11,0x00e0(%rsi)
- vpxor 0x0120(%rdx),%ymm12,%ymm12
- vmovdqu %ymm12,0x0120(%rsi)
- vpxor 0x01a0(%rdx),%ymm13,%ymm13
- vmovdqu %ymm13,0x01a0(%rsi)
- vpxor 0x0160(%rdx),%ymm14,%ymm14
- vmovdqu %ymm14,0x0160(%rsi)
- vpxor 0x01e0(%rdx),%ymm15,%ymm15
- vmovdqu %ymm15,0x01e0(%rsi)
-
- vzeroupper
- lea -8(%r10),%rsp
- ret
-ENDPROC(chacha20_8block_xor_avx2)
diff --git a/arch/x86/crypto/chacha20-ssse3-x86_64.S b/arch/x86/crypto/chacha20-ssse3-x86_64.S
deleted file mode 100644
index 512a2b500fd1..000000000000
--- a/arch/x86/crypto/chacha20-ssse3-x86_64.S
+++ /dev/null
@@ -1,630 +0,0 @@
-/*
- * ChaCha20 256-bit cipher algorithm, RFC7539, x64 SSSE3 functions
- *
- * Copyright (C) 2015 Martin Willi
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- */
-
-#include <linux/linkage.h>
-
-.section .rodata.cst16.ROT8, "aM", @progbits, 16
-.align 16
-ROT8: .octa 0x0e0d0c0f0a09080b0605040702010003
-.section .rodata.cst16.ROT16, "aM", @progbits, 16
-.align 16
-ROT16: .octa 0x0d0c0f0e09080b0a0504070601000302
-.section .rodata.cst16.CTRINC, "aM", @progbits, 16
-.align 16
-CTRINC: .octa 0x00000003000000020000000100000000
-
-.text
-
-ENTRY(chacha20_block_xor_ssse3)
- # %rdi: Input state matrix, s
- # %rsi: 1 data block output, o
- # %rdx: 1 data block input, i
-
- # This function encrypts one ChaCha20 block by loading the state matrix
- # in four SSE registers. It performs matrix operation on four words in
- # parallel, but requireds shuffling to rearrange the words after each
- # round. 8/16-bit word rotation is done with the slightly better
- # performing SSSE3 byte shuffling, 7/12-bit word rotation uses
- # traditional shift+OR.
-
- # x0..3 = s0..3
- movdqa 0x00(%rdi),%xmm0
- movdqa 0x10(%rdi),%xmm1
- movdqa 0x20(%rdi),%xmm2
- movdqa 0x30(%rdi),%xmm3
- movdqa %xmm0,%xmm8
- movdqa %xmm1,%xmm9
- movdqa %xmm2,%xmm10
- movdqa %xmm3,%xmm11
-
- movdqa ROT8(%rip),%xmm4
- movdqa ROT16(%rip),%xmm5
-
- mov $10,%ecx
-
-.Ldoubleround:
-
- # x0 += x1, x3 = rotl32(x3 ^ x0, 16)
- paddd %xmm1,%xmm0
- pxor %xmm0,%xmm3
- pshufb %xmm5,%xmm3
-
- # x2 += x3, x1 = rotl32(x1 ^ x2, 12)
- paddd %xmm3,%xmm2
- pxor %xmm2,%xmm1
- movdqa %xmm1,%xmm6
- pslld $12,%xmm6
- psrld $20,%xmm1
- por %xmm6,%xmm1
-
- # x0 += x1, x3 = rotl32(x3 ^ x0, 8)
- paddd %xmm1,%xmm0
- pxor %xmm0,%xmm3
- pshufb %xmm4,%xmm3
-
- # x2 += x3, x1 = rotl32(x1 ^ x2, 7)
- paddd %xmm3,%xmm2
- pxor %xmm2,%xmm1
- movdqa %xmm1,%xmm7
- pslld $7,%xmm7
- psrld $25,%xmm1
- por %xmm7,%xmm1
-
- # x1 = shuffle32(x1, MASK(0, 3, 2, 1))
- pshufd $0x39,%xmm1,%xmm1
- # x2 = shuffle32(x2, MASK(1, 0, 3, 2))
- pshufd $0x4e,%xmm2,%xmm2
- # x3 = shuffle32(x3, MASK(2, 1, 0, 3))
- pshufd $0x93,%xmm3,%xmm3
-
- # x0 += x1, x3 = rotl32(x3 ^ x0, 16)
- paddd %xmm1,%xmm0
- pxor %xmm0,%xmm3
- pshufb %xmm5,%xmm3
-
- # x2 += x3, x1 = rotl32(x1 ^ x2, 12)
- paddd %xmm3,%xmm2
- pxor %xmm2,%xmm1
- movdqa %xmm1,%xmm6
- pslld $12,%xmm6
- psrld $20,%xmm1
- por %xmm6,%xmm1
-
- # x0 += x1, x3 = rotl32(x3 ^ x0, 8)
- paddd %xmm1,%xmm0
- pxor %xmm0,%xmm3
- pshufb %xmm4,%xmm3
-
- # x2 += x3, x1 = rotl32(x1 ^ x2, 7)
- paddd %xmm3,%xmm2
- pxor %xmm2,%xmm1
- movdqa %xmm1,%xmm7
- pslld $7,%xmm7
- psrld $25,%xmm1
- por %xmm7,%xmm1
-
- # x1 = shuffle32(x1, MASK(2, 1, 0, 3))
- pshufd $0x93,%xmm1,%xmm1
- # x2 = shuffle32(x2, MASK(1, 0, 3, 2))
- pshufd $0x4e,%xmm2,%xmm2
- # x3 = shuffle32(x3, MASK(0, 3, 2, 1))
- pshufd $0x39,%xmm3,%xmm3
-
- dec %ecx
- jnz .Ldoubleround
-
- # o0 = i0 ^ (x0 + s0)
- movdqu 0x00(%rdx),%xmm4
- paddd %xmm8,%xmm0
- pxor %xmm4,%xmm0
- movdqu %xmm0,0x00(%rsi)
- # o1 = i1 ^ (x1 + s1)
- movdqu 0x10(%rdx),%xmm5
- paddd %xmm9,%xmm1
- pxor %xmm5,%xmm1
- movdqu %xmm1,0x10(%rsi)
- # o2 = i2 ^ (x2 + s2)
- movdqu 0x20(%rdx),%xmm6
- paddd %xmm10,%xmm2
- pxor %xmm6,%xmm2
- movdqu %xmm2,0x20(%rsi)
- # o3 = i3 ^ (x3 + s3)
- movdqu 0x30(%rdx),%xmm7
- paddd %xmm11,%xmm3
- pxor %xmm7,%xmm3
- movdqu %xmm3,0x30(%rsi)
-
- ret
-ENDPROC(chacha20_block_xor_ssse3)
-
-ENTRY(chacha20_4block_xor_ssse3)
- # %rdi: Input state matrix, s
- # %rsi: 4 data blocks output, o
- # %rdx: 4 data blocks input, i
-
- # This function encrypts four consecutive ChaCha20 blocks by loading the
- # the state matrix in SSE registers four times. As we need some scratch
- # registers, we save the first four registers on the stack. The
- # algorithm performs each operation on the corresponding word of each
- # state matrix, hence requires no word shuffling. For final XORing step
- # we transpose the matrix by interleaving 32- and then 64-bit words,
- # which allows us to do XOR in SSE registers. 8/16-bit word rotation is
- # done with the slightly better performing SSSE3 byte shuffling,
- # 7/12-bit word rotation uses traditional shift+OR.
-
- lea 8(%rsp),%r10
- sub $0x80,%rsp
- and $~63,%rsp
-
- # x0..15[0-3] = s0..3[0..3]
- movq 0x00(%rdi),%xmm1
- pshufd $0x00,%xmm1,%xmm0
- pshufd $0x55,%xmm1,%xmm1
- movq 0x08(%rdi),%xmm3
- pshufd $0x00,%xmm3,%xmm2
- pshufd $0x55,%xmm3,%xmm3
- movq 0x10(%rdi),%xmm5
- pshufd $0x00,%xmm5,%xmm4
- pshufd $0x55,%xmm5,%xmm5
- movq 0x18(%rdi),%xmm7
- pshufd $0x00,%xmm7,%xmm6
- pshufd $0x55,%xmm7,%xmm7
- movq 0x20(%rdi),%xmm9
- pshufd $0x00,%xmm9,%xmm8
- pshufd $0x55,%xmm9,%xmm9
- movq 0x28(%rdi),%xmm11
- pshufd $0x00,%xmm11,%xmm10
- pshufd $0x55,%xmm11,%xmm11
- movq 0x30(%rdi),%xmm13
- pshufd $0x00,%xmm13,%xmm12
- pshufd $0x55,%xmm13,%xmm13
- movq 0x38(%rdi),%xmm15
- pshufd $0x00,%xmm15,%xmm14
- pshufd $0x55,%xmm15,%xmm15
- # x0..3 on stack
- movdqa %xmm0,0x00(%rsp)
- movdqa %xmm1,0x10(%rsp)
- movdqa %xmm2,0x20(%rsp)
- movdqa %xmm3,0x30(%rsp)
-
- movdqa CTRINC(%rip),%xmm1
- movdqa ROT8(%rip),%xmm2
- movdqa ROT16(%rip),%xmm3
-
- # x12 += counter values 0-3
- paddd %xmm1,%xmm12
-
- mov $10,%ecx
-
-.Ldoubleround4:
- # x0 += x4, x12 = rotl32(x12 ^ x0, 16)
- movdqa 0x00(%rsp),%xmm0
- paddd %xmm4,%xmm0
- movdqa %xmm0,0x00(%rsp)
- pxor %xmm0,%xmm12
- pshufb %xmm3,%xmm12
- # x1 += x5, x13 = rotl32(x13 ^ x1, 16)
- movdqa 0x10(%rsp),%xmm0
- paddd %xmm5,%xmm0
- movdqa %xmm0,0x10(%rsp)
- pxor %xmm0,%xmm13
- pshufb %xmm3,%xmm13
- # x2 += x6, x14 = rotl32(x14 ^ x2, 16)
- movdqa 0x20(%rsp),%xmm0
- paddd %xmm6,%xmm0
- movdqa %xmm0,0x20(%rsp)
- pxor %xmm0,%xmm14
- pshufb %xmm3,%xmm14
- # x3 += x7, x15 = rotl32(x15 ^ x3, 16)
- movdqa 0x30(%rsp),%xmm0
- paddd %xmm7,%xmm0
- movdqa %xmm0,0x30(%rsp)
- pxor %xmm0,%xmm15
- pshufb %xmm3,%xmm15
-
- # x8 += x12, x4 = rotl32(x4 ^ x8, 12)
- paddd %xmm12,%xmm8
- pxor %xmm8,%xmm4
- movdqa %xmm4,%xmm0
- pslld $12,%xmm0
- psrld $20,%xmm4
- por %xmm0,%xmm4
- # x9 += x13, x5 = rotl32(x5 ^ x9, 12)
- paddd %xmm13,%xmm9
- pxor %xmm9,%xmm5
- movdqa %xmm5,%xmm0
- pslld $12,%xmm0
- psrld $20,%xmm5
- por %xmm0,%xmm5
- # x10 += x14, x6 = rotl32(x6 ^ x10, 12)
- paddd %xmm14,%xmm10
- pxor %xmm10,%xmm6
- movdqa %xmm6,%xmm0
- pslld $12,%xmm0
- psrld $20,%xmm6
- por %xmm0,%xmm6
- # x11 += x15, x7 = rotl32(x7 ^ x11, 12)
- paddd %xmm15,%xmm11
- pxor %xmm11,%xmm7
- movdqa %xmm7,%xmm0
- pslld $12,%xmm0
- psrld $20,%xmm7
- por %xmm0,%xmm7
-
- # x0 += x4, x12 = rotl32(x12 ^ x0, 8)
- movdqa 0x00(%rsp),%xmm0
- paddd %xmm4,%xmm0
- movdqa %xmm0,0x00(%rsp)
- pxor %xmm0,%xmm12
- pshufb %xmm2,%xmm12
- # x1 += x5, x13 = rotl32(x13 ^ x1, 8)
- movdqa 0x10(%rsp),%xmm0
- paddd %xmm5,%xmm0
- movdqa %xmm0,0x10(%rsp)
- pxor %xmm0,%xmm13
- pshufb %xmm2,%xmm13
- # x2 += x6, x14 = rotl32(x14 ^ x2, 8)
- movdqa 0x20(%rsp),%xmm0
- paddd %xmm6,%xmm0
- movdqa %xmm0,0x20(%rsp)
- pxor %xmm0,%xmm14
- pshufb %xmm2,%xmm14
- # x3 += x7, x15 = rotl32(x15 ^ x3, 8)
- movdqa 0x30(%rsp),%xmm0
- paddd %xmm7,%xmm0
- movdqa %xmm0,0x30(%rsp)
- pxor %xmm0,%xmm15
- pshufb %xmm2,%xmm15
-
- # x8 += x12, x4 = rotl32(x4 ^ x8, 7)
- paddd %xmm12,%xmm8
- pxor %xmm8,%xmm4
- movdqa %xmm4,%xmm0
- pslld $7,%xmm0
- psrld $25,%xmm4
- por %xmm0,%xmm4
- # x9 += x13, x5 = rotl32(x5 ^ x9, 7)
- paddd %xmm13,%xmm9
- pxor %xmm9,%xmm5
- movdqa %xmm5,%xmm0
- pslld $7,%xmm0
- psrld $25,%xmm5
- por %xmm0,%xmm5
- # x10 += x14, x6 = rotl32(x6 ^ x10, 7)
- paddd %xmm14,%xmm10
- pxor %xmm10,%xmm6
- movdqa %xmm6,%xmm0
- pslld $7,%xmm0
- psrld $25,%xmm6
- por %xmm0,%xmm6
- # x11 += x15, x7 = rotl32(x7 ^ x11, 7)
- paddd %xmm15,%xmm11
- pxor %xmm11,%xmm7
- movdqa %xmm7,%xmm0
- pslld $7,%xmm0
- psrld $25,%xmm7
- por %xmm0,%xmm7
-
- # x0 += x5, x15 = rotl32(x15 ^ x0, 16)
- movdqa 0x00(%rsp),%xmm0
- paddd %xmm5,%xmm0
- movdqa %xmm0,0x00(%rsp)
- pxor %xmm0,%xmm15
- pshufb %xmm3,%xmm15
- # x1 += x6, x12 = rotl32(x12 ^ x1, 16)
- movdqa 0x10(%rsp),%xmm0
- paddd %xmm6,%xmm0
- movdqa %xmm0,0x10(%rsp)
- pxor %xmm0,%xmm12
- pshufb %xmm3,%xmm12
- # x2 += x7, x13 = rotl32(x13 ^ x2, 16)
- movdqa 0x20(%rsp),%xmm0
- paddd %xmm7,%xmm0
- movdqa %xmm0,0x20(%rsp)
- pxor %xmm0,%xmm13
- pshufb %xmm3,%xmm13
- # x3 += x4, x14 = rotl32(x14 ^ x3, 16)
- movdqa 0x30(%rsp),%xmm0
- paddd %xmm4,%xmm0
- movdqa %xmm0,0x30(%rsp)
- pxor %xmm0,%xmm14
- pshufb %xmm3,%xmm14
-
- # x10 += x15, x5 = rotl32(x5 ^ x10, 12)
- paddd %xmm15,%xmm10
- pxor %xmm10,%xmm5
- movdqa %xmm5,%xmm0
- pslld $12,%xmm0
- psrld $20,%xmm5
- por %xmm0,%xmm5
- # x11 += x12, x6 = rotl32(x6 ^ x11, 12)
- paddd %xmm12,%xmm11
- pxor %xmm11,%xmm6
- movdqa %xmm6,%xmm0
- pslld $12,%xmm0
- psrld $20,%xmm6
- por %xmm0,%xmm6
- # x8 += x13, x7 = rotl32(x7 ^ x8, 12)
- paddd %xmm13,%xmm8
- pxor %xmm8,%xmm7
- movdqa %xmm7,%xmm0
- pslld $12,%xmm0
- psrld $20,%xmm7
- por %xmm0,%xmm7
- # x9 += x14, x4 = rotl32(x4 ^ x9, 12)
- paddd %xmm14,%xmm9
- pxor %xmm9,%xmm4
- movdqa %xmm4,%xmm0
- pslld $12,%xmm0
- psrld $20,%xmm4
- por %xmm0,%xmm4
-
- # x0 += x5, x15 = rotl32(x15 ^ x0, 8)
- movdqa 0x00(%rsp),%xmm0
- paddd %xmm5,%xmm0
- movdqa %xmm0,0x00(%rsp)
- pxor %xmm0,%xmm15
- pshufb %xmm2,%xmm15
- # x1 += x6, x12 = rotl32(x12 ^ x1, 8)
- movdqa 0x10(%rsp),%xmm0
- paddd %xmm6,%xmm0
- movdqa %xmm0,0x10(%rsp)
- pxor %xmm0,%xmm12
- pshufb %xmm2,%xmm12
- # x2 += x7, x13 = rotl32(x13 ^ x2, 8)
- movdqa 0x20(%rsp),%xmm0
- paddd %xmm7,%xmm0
- movdqa %xmm0,0x20(%rsp)
- pxor %xmm0,%xmm13
- pshufb %xmm2,%xmm13
- # x3 += x4, x14 = rotl32(x14 ^ x3, 8)
- movdqa 0x30(%rsp),%xmm0
- paddd %xmm4,%xmm0
- movdqa %xmm0,0x30(%rsp)
- pxor %xmm0,%xmm14
- pshufb %xmm2,%xmm14
-
- # x10 += x15, x5 = rotl32(x5 ^ x10, 7)
- paddd %xmm15,%xmm10
- pxor %xmm10,%xmm5
- movdqa %xmm5,%xmm0
- pslld $7,%xmm0
- psrld $25,%xmm5
- por %xmm0,%xmm5
- # x11 += x12, x6 = rotl32(x6 ^ x11, 7)
- paddd %xmm12,%xmm11
- pxor %xmm11,%xmm6
- movdqa %xmm6,%xmm0
- pslld $7,%xmm0
- psrld $25,%xmm6
- por %xmm0,%xmm6
- # x8 += x13, x7 = rotl32(x7 ^ x8, 7)
- paddd %xmm13,%xmm8
- pxor %xmm8,%xmm7
- movdqa %xmm7,%xmm0
- pslld $7,%xmm0
- psrld $25,%xmm7
- por %xmm0,%xmm7
- # x9 += x14, x4 = rotl32(x4 ^ x9, 7)
- paddd %xmm14,%xmm9
- pxor %xmm9,%xmm4
- movdqa %xmm4,%xmm0
- pslld $7,%xmm0
- psrld $25,%xmm4
- por %xmm0,%xmm4
-
- dec %ecx
- jnz .Ldoubleround4
-
- # x0[0-3] += s0[0]
- # x1[0-3] += s0[1]
- movq 0x00(%rdi),%xmm3
- pshufd $0x00,%xmm3,%xmm2
- pshufd $0x55,%xmm3,%xmm3
- paddd 0x00(%rsp),%xmm2
- movdqa %xmm2,0x00(%rsp)
- paddd 0x10(%rsp),%xmm3
- movdqa %xmm3,0x10(%rsp)
- # x2[0-3] += s0[2]
- # x3[0-3] += s0[3]
- movq 0x08(%rdi),%xmm3
- pshufd $0x00,%xmm3,%xmm2
- pshufd $0x55,%xmm3,%xmm3
- paddd 0x20(%rsp),%xmm2
- movdqa %xmm2,0x20(%rsp)
- paddd 0x30(%rsp),%xmm3
- movdqa %xmm3,0x30(%rsp)
-
- # x4[0-3] += s1[0]
- # x5[0-3] += s1[1]
- movq 0x10(%rdi),%xmm3
- pshufd $0x00,%xmm3,%xmm2
- pshufd $0x55,%xmm3,%xmm3
- paddd %xmm2,%xmm4
- paddd %xmm3,%xmm5
- # x6[0-3] += s1[2]
- # x7[0-3] += s1[3]
- movq 0x18(%rdi),%xmm3
- pshufd $0x00,%xmm3,%xmm2
- pshufd $0x55,%xmm3,%xmm3
- paddd %xmm2,%xmm6
- paddd %xmm3,%xmm7
-
- # x8[0-3] += s2[0]
- # x9[0-3] += s2[1]
- movq 0x20(%rdi),%xmm3
- pshufd $0x00,%xmm3,%xmm2
- pshufd $0x55,%xmm3,%xmm3
- paddd %xmm2,%xmm8
- paddd %xmm3,%xmm9
- # x10[0-3] += s2[2]
- # x11[0-3] += s2[3]
- movq 0x28(%rdi),%xmm3
- pshufd $0x00,%xmm3,%xmm2
- pshufd $0x55,%xmm3,%xmm3
- paddd %xmm2,%xmm10
- paddd %xmm3,%xmm11
-
- # x12[0-3] += s3[0]
- # x13[0-3] += s3[1]
- movq 0x30(%rdi),%xmm3
- pshufd $0x00,%xmm3,%xmm2
- pshufd $0x55,%xmm3,%xmm3
- paddd %xmm2,%xmm12
- paddd %xmm3,%xmm13
- # x14[0-3] += s3[2]
- # x15[0-3] += s3[3]
- movq 0x38(%rdi),%xmm3
- pshufd $0x00,%xmm3,%xmm2
- pshufd $0x55,%xmm3,%xmm3
- paddd %xmm2,%xmm14
- paddd %xmm3,%xmm15
-
- # x12 += counter values 0-3
- paddd %xmm1,%xmm12
-
- # interleave 32-bit words in state n, n+1
- movdqa 0x00(%rsp),%xmm0
- movdqa 0x10(%rsp),%xmm1
- movdqa %xmm0,%xmm2
- punpckldq %xmm1,%xmm2
- punpckhdq %xmm1,%xmm0
- movdqa %xmm2,0x00(%rsp)
- movdqa %xmm0,0x10(%rsp)
- movdqa 0x20(%rsp),%xmm0
- movdqa 0x30(%rsp),%xmm1
- movdqa %xmm0,%xmm2
- punpckldq %xmm1,%xmm2
- punpckhdq %xmm1,%xmm0
- movdqa %xmm2,0x20(%rsp)
- movdqa %xmm0,0x30(%rsp)
- movdqa %xmm4,%xmm0
- punpckldq %xmm5,%xmm4
- punpckhdq %xmm5,%xmm0
- movdqa %xmm0,%xmm5
- movdqa %xmm6,%xmm0
- punpckldq %xmm7,%xmm6
- punpckhdq %xmm7,%xmm0
- movdqa %xmm0,%xmm7
- movdqa %xmm8,%xmm0
- punpckldq %xmm9,%xmm8
- punpckhdq %xmm9,%xmm0
- movdqa %xmm0,%xmm9
- movdqa %xmm10,%xmm0
- punpckldq %xmm11,%xmm10
- punpckhdq %xmm11,%xmm0
- movdqa %xmm0,%xmm11
- movdqa %xmm12,%xmm0
- punpckldq %xmm13,%xmm12
- punpckhdq %xmm13,%xmm0
- movdqa %xmm0,%xmm13
- movdqa %xmm14,%xmm0
- punpckldq %xmm15,%xmm14
- punpckhdq %xmm15,%xmm0
- movdqa %xmm0,%xmm15
-
- # interleave 64-bit words in state n, n+2
- movdqa 0x00(%rsp),%xmm0
- movdqa 0x20(%rsp),%xmm1
- movdqa %xmm0,%xmm2
- punpcklqdq %xmm1,%xmm2
- punpckhqdq %xmm1,%xmm0
- movdqa %xmm2,0x00(%rsp)
- movdqa %xmm0,0x20(%rsp)
- movdqa 0x10(%rsp),%xmm0
- movdqa 0x30(%rsp),%xmm1
- movdqa %xmm0,%xmm2
- punpcklqdq %xmm1,%xmm2
- punpckhqdq %xmm1,%xmm0
- movdqa %xmm2,0x10(%rsp)
- movdqa %xmm0,0x30(%rsp)
- movdqa %xmm4,%xmm0
- punpcklqdq %xmm6,%xmm4
- punpckhqdq %xmm6,%xmm0
- movdqa %xmm0,%xmm6
- movdqa %xmm5,%xmm0
- punpcklqdq %xmm7,%xmm5
- punpckhqdq %xmm7,%xmm0
- movdqa %xmm0,%xmm7
- movdqa %xmm8,%xmm0
- punpcklqdq %xmm10,%xmm8
- punpckhqdq %xmm10,%xmm0
- movdqa %xmm0,%xmm10
- movdqa %xmm9,%xmm0
- punpcklqdq %xmm11,%xmm9
- punpckhqdq %xmm11,%xmm0
- movdqa %xmm0,%xmm11
- movdqa %xmm12,%xmm0
- punpcklqdq %xmm14,%xmm12
- punpckhqdq %xmm14,%xmm0
- movdqa %xmm0,%xmm14
- movdqa %xmm13,%xmm0
- punpcklqdq %xmm15,%xmm13
- punpckhqdq %xmm15,%xmm0
- movdqa %xmm0,%xmm15
-
- # xor with corresponding input, write to output
- movdqa 0x00(%rsp),%xmm0
- movdqu 0x00(%rdx),%xmm1
- pxor %xmm1,%xmm0
- movdqu %xmm0,0x00(%rsi)
- movdqa 0x10(%rsp),%xmm0
- movdqu 0x80(%rdx),%xmm1
- pxor %xmm1,%xmm0
- movdqu %xmm0,0x80(%rsi)
- movdqa 0x20(%rsp),%xmm0
- movdqu 0x40(%rdx),%xmm1
- pxor %xmm1,%xmm0
- movdqu %xmm0,0x40(%rsi)
- movdqa 0x30(%rsp),%xmm0
- movdqu 0xc0(%rdx),%xmm1
- pxor %xmm1,%xmm0
- movdqu %xmm0,0xc0(%rsi)
- movdqu 0x10(%rdx),%xmm1
- pxor %xmm1,%xmm4
- movdqu %xmm4,0x10(%rsi)
- movdqu 0x90(%rdx),%xmm1
- pxor %xmm1,%xmm5
- movdqu %xmm5,0x90(%rsi)
- movdqu 0x50(%rdx),%xmm1
- pxor %xmm1,%xmm6
- movdqu %xmm6,0x50(%rsi)
- movdqu 0xd0(%rdx),%xmm1
- pxor %xmm1,%xmm7
- movdqu %xmm7,0xd0(%rsi)
- movdqu 0x20(%rdx),%xmm1
- pxor %xmm1,%xmm8
- movdqu %xmm8,0x20(%rsi)
- movdqu 0xa0(%rdx),%xmm1
- pxor %xmm1,%xmm9
- movdqu %xmm9,0xa0(%rsi)
- movdqu 0x60(%rdx),%xmm1
- pxor %xmm1,%xmm10
- movdqu %xmm10,0x60(%rsi)
- movdqu 0xe0(%rdx),%xmm1
- pxor %xmm1,%xmm11
- movdqu %xmm11,0xe0(%rsi)
- movdqu 0x30(%rdx),%xmm1
- pxor %xmm1,%xmm12
- movdqu %xmm12,0x30(%rsi)
- movdqu 0xb0(%rdx),%xmm1
- pxor %xmm1,%xmm13
- movdqu %xmm13,0xb0(%rsi)
- movdqu 0x70(%rdx),%xmm1
- pxor %xmm1,%xmm14
- movdqu %xmm14,0x70(%rsi)
- movdqu 0xf0(%rdx),%xmm1
- pxor %xmm1,%xmm15
- movdqu %xmm15,0xf0(%rsi)
-
- lea -8(%r10),%rsp
- ret
-ENDPROC(chacha20_4block_xor_ssse3)
diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c
deleted file mode 100644
index dce7c5d39c2f..000000000000
--- a/arch/x86/crypto/chacha20_glue.c
+++ /dev/null
@@ -1,146 +0,0 @@
-/*
- * ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code
- *
- * Copyright (C) 2015 Martin Willi
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- */
-
-#include <crypto/algapi.h>
-#include <crypto/chacha20.h>
-#include <crypto/internal/skcipher.h>
-#include <linux/kernel.h>
-#include <linux/module.h>
-#include <asm/fpu/api.h>
-#include <asm/simd.h>
-
-#define CHACHA20_STATE_ALIGN 16
-
-asmlinkage void chacha20_block_xor_ssse3(u32 *state, u8 *dst, const u8 *src);
-asmlinkage void chacha20_4block_xor_ssse3(u32 *state, u8 *dst, const u8 *src);
-#ifdef CONFIG_AS_AVX2
-asmlinkage void chacha20_8block_xor_avx2(u32 *state, u8 *dst, const u8 *src);
-static bool chacha20_use_avx2;
-#endif
-
-static void chacha20_dosimd(u32 *state, u8 *dst, const u8 *src,
- unsigned int bytes)
-{
- u8 buf[CHACHA20_BLOCK_SIZE];
-
-#ifdef CONFIG_AS_AVX2
- if (chacha20_use_avx2) {
- while (bytes >= CHACHA20_BLOCK_SIZE * 8) {
- chacha20_8block_xor_avx2(state, dst, src);
- bytes -= CHACHA20_BLOCK_SIZE * 8;
- src += CHACHA20_BLOCK_SIZE * 8;
- dst += CHACHA20_BLOCK_SIZE * 8;
- state[12] += 8;
- }
- }
-#endif
- while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
- chacha20_4block_xor_ssse3(state, dst, src);
- bytes -= CHACHA20_BLOCK_SIZE * 4;
- src += CHACHA20_BLOCK_SIZE * 4;
- dst += CHACHA20_BLOCK_SIZE * 4;
- state[12] += 4;
- }
- while (bytes >= CHACHA20_BLOCK_SIZE) {
- chacha20_block_xor_ssse3(state, dst, src);
- bytes -= CHACHA20_BLOCK_SIZE;
- src += CHACHA20_BLOCK_SIZE;
- dst += CHACHA20_BLOCK_SIZE;
- state[12]++;
- }
- if (bytes) {
- memcpy(buf, src, bytes);
- chacha20_block_xor_ssse3(state, buf, buf);
- memcpy(dst, buf, bytes);
- }
-}
-
-static int chacha20_simd(struct skcipher_request *req)
-{
- struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
- struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm);
- u32 *state, state_buf[16 + 2] __aligned(8);
- struct skcipher_walk walk;
- int err;
-
- BUILD_BUG_ON(CHACHA20_STATE_ALIGN != 16);
- state = PTR_ALIGN(state_buf + 0, CHACHA20_STATE_ALIGN);
-
- if (req->cryptlen <= CHACHA20_BLOCK_SIZE || !may_use_simd())
- return crypto_chacha20_crypt(req);
-
- err = skcipher_walk_virt(&walk, req, true);
-
- crypto_chacha20_init(state, ctx, walk.iv);
-
- kernel_fpu_begin();
-
- while (walk.nbytes >= CHACHA20_BLOCK_SIZE) {
- chacha20_dosimd(state, walk.dst.virt.addr, walk.src.virt.addr,
- rounddown(walk.nbytes, CHACHA20_BLOCK_SIZE));
- err = skcipher_walk_done(&walk,
- walk.nbytes % CHACHA20_BLOCK_SIZE);
- }
-
- if (walk.nbytes) {
- chacha20_dosimd(state, walk.dst.virt.addr, walk.src.virt.addr,
- walk.nbytes);
- err = skcipher_walk_done(&walk, 0);
- }
-
- kernel_fpu_end();
-
- return err;
-}
-
-static struct skcipher_alg alg = {
- .base.cra_name = "chacha20",
- .base.cra_driver_name = "chacha20-simd",
- .base.cra_priority = 300,
- .base.cra_blocksize = 1,
- .base.cra_ctxsize = sizeof(struct chacha20_ctx),
- .base.cra_module = THIS_MODULE,
-
- .min_keysize = CHACHA20_KEY_SIZE,
- .max_keysize = CHACHA20_KEY_SIZE,
- .ivsize = CHACHA20_IV_SIZE,
- .chunksize = CHACHA20_BLOCK_SIZE,
- .setkey = crypto_chacha20_setkey,
- .encrypt = chacha20_simd,
- .decrypt = chacha20_simd,
-};
-
-static int __init chacha20_simd_mod_init(void)
-{
- if (!boot_cpu_has(X86_FEATURE_SSSE3))
- return -ENODEV;
-
-#ifdef CONFIG_AS_AVX2
- chacha20_use_avx2 = boot_cpu_has(X86_FEATURE_AVX) &&
- boot_cpu_has(X86_FEATURE_AVX2) &&
- cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL);
-#endif
- return crypto_register_skcipher(&alg);
-}
-
-static void __exit chacha20_simd_mod_fini(void)
-{
- crypto_unregister_skcipher(&alg);
-}
-
-module_init(chacha20_simd_mod_init);
-module_exit(chacha20_simd_mod_fini);
-
-MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Martin Willi <martin@strongswan.org>");
-MODULE_DESCRIPTION("chacha20 cipher algorithm, SIMD accelerated");
-MODULE_ALIAS_CRYPTO("chacha20");
-MODULE_ALIAS_CRYPTO("chacha20-simd");
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 47859a0f8052..42dc48aa9b81 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1428,27 +1428,12 @@ config CRYPTO_SALSA20
config CRYPTO_CHACHA20
tristate "ChaCha20 cipher algorithm"
select CRYPTO_BLKCIPHER
+ select ZINC_CHACHA20
help
ChaCha20 cipher algorithm, RFC7539.
ChaCha20 is a 256-bit high-speed stream cipher designed by Daniel J.
Bernstein and further specified in RFC7539 for use in IETF protocols.
- This is the portable C implementation of ChaCha20.
-
- See also:
- <http://cr.yp.to/chacha/chacha-20080128.pdf>
-
-config CRYPTO_CHACHA20_X86_64
- tristate "ChaCha20 cipher algorithm (x86_64/SSSE3/AVX2)"
- depends on X86 && 64BIT
- select CRYPTO_BLKCIPHER
- select CRYPTO_CHACHA20
- help
- ChaCha20 cipher algorithm, RFC7539.
-
- ChaCha20 is a 256-bit high-speed stream cipher designed by Daniel J.
- Bernstein and further specified in RFC7539 for use in IETF protocols.
- This is the x86_64 assembler implementation using SIMD instructions.
See also:
<http://cr.yp.to/chacha/chacha-20080128.pdf>
diff --git a/crypto/Makefile b/crypto/Makefile
index 5e60348d02e2..587103b87890 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -117,7 +117,7 @@ obj-$(CONFIG_CRYPTO_ANUBIS) += anubis.o
obj-$(CONFIG_CRYPTO_SEED) += seed.o
obj-$(CONFIG_CRYPTO_SPECK) += speck.o
obj-$(CONFIG_CRYPTO_SALSA20) += salsa20_generic.o
-obj-$(CONFIG_CRYPTO_CHACHA20) += chacha20_generic.o
+obj-$(CONFIG_CRYPTO_CHACHA20) += chacha20_zinc.o
obj-$(CONFIG_CRYPTO_POLY1305) += poly1305_zinc.o
obj-$(CONFIG_CRYPTO_DEFLATE) += deflate.o
obj-$(CONFIG_CRYPTO_MICHAEL_MIC) += michael_mic.o
diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c
deleted file mode 100644
index e451c3cb6a56..000000000000
--- a/crypto/chacha20_generic.c
+++ /dev/null
@@ -1,136 +0,0 @@
-/*
- * ChaCha20 256-bit cipher algorithm, RFC7539
- *
- * Copyright (C) 2015 Martin Willi
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- */
-
-#include <asm/unaligned.h>
-#include <crypto/algapi.h>
-#include <crypto/chacha20.h>
-#include <crypto/internal/skcipher.h>
-#include <linux/module.h>
-
-static void chacha20_docrypt(u32 *state, u8 *dst, const u8 *src,
- unsigned int bytes)
-{
- u32 stream[CHACHA20_BLOCK_WORDS];
-
- if (dst != src)
- memcpy(dst, src, bytes);
-
- while (bytes >= CHACHA20_BLOCK_SIZE) {
- chacha20_block(state, stream);
- crypto_xor(dst, (const u8 *)stream, CHACHA20_BLOCK_SIZE);
- bytes -= CHACHA20_BLOCK_SIZE;
- dst += CHACHA20_BLOCK_SIZE;
- }
- if (bytes) {
- chacha20_block(state, stream);
- crypto_xor(dst, (const u8 *)stream, bytes);
- }
-}
-
-void crypto_chacha20_init(u32 *state, struct chacha20_ctx *ctx, u8 *iv)
-{
- state[0] = 0x61707865; /* "expa" */
- state[1] = 0x3320646e; /* "nd 3" */
- state[2] = 0x79622d32; /* "2-by" */
- state[3] = 0x6b206574; /* "te k" */
- state[4] = ctx->key[0];
- state[5] = ctx->key[1];
- state[6] = ctx->key[2];
- state[7] = ctx->key[3];
- state[8] = ctx->key[4];
- state[9] = ctx->key[5];
- state[10] = ctx->key[6];
- state[11] = ctx->key[7];
- state[12] = get_unaligned_le32(iv + 0);
- state[13] = get_unaligned_le32(iv + 4);
- state[14] = get_unaligned_le32(iv + 8);
- state[15] = get_unaligned_le32(iv + 12);
-}
-EXPORT_SYMBOL_GPL(crypto_chacha20_init);
-
-int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
- unsigned int keysize)
-{
- struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm);
- int i;
-
- if (keysize != CHACHA20_KEY_SIZE)
- return -EINVAL;
-
- for (i = 0; i < ARRAY_SIZE(ctx->key); i++)
- ctx->key[i] = get_unaligned_le32(key + i * sizeof(u32));
-
- return 0;
-}
-EXPORT_SYMBOL_GPL(crypto_chacha20_setkey);
-
-int crypto_chacha20_crypt(struct skcipher_request *req)
-{
- struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
- struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm);
- struct skcipher_walk walk;
- u32 state[16];
- int err;
-
- err = skcipher_walk_virt(&walk, req, true);
-
- crypto_chacha20_init(state, ctx, walk.iv);
-
- while (walk.nbytes > 0) {
- unsigned int nbytes = walk.nbytes;
-
- if (nbytes < walk.total)
- nbytes = round_down(nbytes, walk.stride);
-
- chacha20_docrypt(state, walk.dst.virt.addr, walk.src.virt.addr,
- nbytes);
- err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
- }
-
- return err;
-}
-EXPORT_SYMBOL_GPL(crypto_chacha20_crypt);
-
-static struct skcipher_alg alg = {
- .base.cra_name = "chacha20",
- .base.cra_driver_name = "chacha20-generic",
- .base.cra_priority = 100,
- .base.cra_blocksize = 1,
- .base.cra_ctxsize = sizeof(struct chacha20_ctx),
- .base.cra_module = THIS_MODULE,
-
- .min_keysize = CHACHA20_KEY_SIZE,
- .max_keysize = CHACHA20_KEY_SIZE,
- .ivsize = CHACHA20_IV_SIZE,
- .chunksize = CHACHA20_BLOCK_SIZE,
- .setkey = crypto_chacha20_setkey,
- .encrypt = crypto_chacha20_crypt,
- .decrypt = crypto_chacha20_crypt,
-};
-
-static int __init chacha20_generic_mod_init(void)
-{
- return crypto_register_skcipher(&alg);
-}
-
-static void __exit chacha20_generic_mod_fini(void)
-{
- crypto_unregister_skcipher(&alg);
-}
-
-module_init(chacha20_generic_mod_init);
-module_exit(chacha20_generic_mod_fini);
-
-MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Martin Willi <martin@strongswan.org>");
-MODULE_DESCRIPTION("chacha20 cipher algorithm");
-MODULE_ALIAS_CRYPTO("chacha20");
-MODULE_ALIAS_CRYPTO("chacha20-generic");
diff --git a/crypto/chacha20_zinc.c b/crypto/chacha20_zinc.c
new file mode 100644
index 000000000000..55e4585de08c
--- /dev/null
+++ b/crypto/chacha20_zinc.c
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <asm/unaligned.h>
+#include <crypto/algapi.h>
+#include <crypto/internal/skcipher.h>
+#include <zinc/chacha20.h>
+#include <linux/module.h>
+
+static int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
+ unsigned int keysize)
+{
+ struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ if (keysize != CHACHA20_KEY_SIZE)
+ return -EINVAL;
+ chacha20_init(ctx, key, 0);
+ return 0;
+}
+
+static int crypto_chacha20_crypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct chacha20_ctx ctx = *(struct chacha20_ctx *)crypto_skcipher_ctx(tfm);
+ struct skcipher_walk walk;
+ simd_context_t simd_context;
+ int err, i;
+
+ err = skcipher_walk_virt(&walk, req, true);
+ if (unlikely(err))
+ return err;
+
+ for (i = 0; i < ARRAY_SIZE(ctx.counter); ++i)
+ ctx.counter[i] = get_unaligned_le32(walk.iv + i * sizeof(u32));
+
+ simd_get(&simd_context);
+ while (walk.nbytes > 0) {
+ unsigned int nbytes = walk.nbytes;
+
+ if (nbytes < walk.total)
+ nbytes = round_down(nbytes, walk.stride);
+
+ chacha20(&ctx, walk.dst.virt.addr, walk.src.virt.addr, nbytes,
+ &simd_context);
+
+ err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+ simd_relax(&simd_context);
+ }
+ simd_put(&simd_context);
+
+ return err;
+}
+
+static struct skcipher_alg alg = {
+ .base.cra_name = "chacha20",
+ .base.cra_driver_name = "chacha20-software",
+ .base.cra_priority = 100,
+ .base.cra_blocksize = 1,
+ .base.cra_ctxsize = sizeof(struct chacha20_ctx),
+ .base.cra_module = THIS_MODULE,
+
+ .min_keysize = CHACHA20_KEY_SIZE,
+ .max_keysize = CHACHA20_KEY_SIZE,
+ .ivsize = CHACHA20_NONCE_SIZE,
+ .chunksize = CHACHA20_BLOCK_SIZE,
+ .setkey = crypto_chacha20_setkey,
+ .encrypt = crypto_chacha20_crypt,
+ .decrypt = crypto_chacha20_crypt,
+};
+
+static int __init chacha20_mod_init(void)
+{
+ return crypto_register_skcipher(&alg);
+}
+
+static void __exit chacha20_mod_exit(void)
+{
+ crypto_unregister_skcipher(&alg);
+}
+
+module_init(chacha20_mod_init);
+module_exit(chacha20_mod_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Jason A. Donenfeld <Jason@zx2c4.com>");
+MODULE_DESCRIPTION("ChaCha20 stream cipher");
+MODULE_ALIAS_CRYPTO("chacha20");
+MODULE_ALIAS_CRYPTO("chacha20-software");
diff --git a/crypto/chacha20poly1305.c b/crypto/chacha20poly1305.c
index bf523797bef3..585c7ef4f543 100644
--- a/crypto/chacha20poly1305.c
+++ b/crypto/chacha20poly1305.c
@@ -13,7 +13,7 @@
#include <crypto/internal/hash.h>
#include <crypto/internal/skcipher.h>
#include <crypto/scatterwalk.h>
-#include <crypto/chacha20.h>
+#include <zinc/chacha20.h>
#include <zinc/poly1305.h>
#include <linux/err.h>
#include <linux/init.h>
@@ -51,7 +51,7 @@ struct poly_req {
};
struct chacha_req {
- u8 iv[CHACHA20_IV_SIZE];
+ u8 iv[CHACHA20_NONCE_SIZE];
struct scatterlist src[1];
struct skcipher_request req; /* must be last member */
};
@@ -91,7 +91,7 @@ static void chacha_iv(u8 *iv, struct aead_request *req, u32 icb)
memcpy(iv, &leicb, sizeof(leicb));
memcpy(iv + sizeof(leicb), ctx->salt, ctx->saltlen);
memcpy(iv + sizeof(leicb) + ctx->saltlen, req->iv,
- CHACHA20_IV_SIZE - sizeof(leicb) - ctx->saltlen);
+ CHACHA20_NONCE_SIZE - sizeof(leicb) - ctx->saltlen);
}
static int poly_verify_tag(struct aead_request *req)
@@ -639,7 +639,7 @@ static int chachapoly_create(struct crypto_template *tmpl, struct rtattr **tb,
err = -EINVAL;
/* Need 16-byte IV size, including Initial Block Counter value */
- if (crypto_skcipher_alg_ivsize(chacha) != CHACHA20_IV_SIZE)
+ if (crypto_skcipher_alg_ivsize(chacha) != CHACHA20_NONCE_SIZE)
goto out_drop_chacha;
/* Not a stream cipher? */
if (chacha->base.cra_blocksize != 1)
diff --git a/include/crypto/chacha20.h b/include/crypto/chacha20.h
index b83d66073db0..3b92f58f3891 100644
--- a/include/crypto/chacha20.h
+++ b/include/crypto/chacha20.h
@@ -6,23 +6,11 @@
#ifndef _CRYPTO_CHACHA20_H
#define _CRYPTO_CHACHA20_H
-#include <crypto/skcipher.h>
-#include <linux/types.h>
-#include <linux/crypto.h>
-
#define CHACHA20_IV_SIZE 16
#define CHACHA20_KEY_SIZE 32
#define CHACHA20_BLOCK_SIZE 64
#define CHACHA20_BLOCK_WORDS (CHACHA20_BLOCK_SIZE / sizeof(u32))
-struct chacha20_ctx {
- u32 key[8];
-};
-
void chacha20_block(u32 *state, u32 *stream);
-void crypto_chacha20_init(u32 *state, struct chacha20_ctx *ctx, u8 *iv);
-int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
- unsigned int keysize);
-int crypto_chacha20_crypt(struct skcipher_request *req);
#endif
--
2.19.1
^ permalink raw reply related
* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
From: Daniel Borkmann @ 2018-10-18 15:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs,
john.fastabend, netdev
In-Reply-To: <20181018081434.GT3121@hirez.programming.kicks-ass.net>
On 10/18/2018 10:14 AM, Peter Zijlstra wrote:
> On Thu, Oct 18, 2018 at 01:10:15AM +0200, Daniel Borkmann wrote:
>
>> Wouldn't this then also allow the kernel side to use smp_store_release()
>> when it updates the head? We'd be pretty much at the model as described
>> in Documentation/core-api/circular-buffers.rst.
>>
>> Meaning, rough pseudo-code diff would look as:
>>
>> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
>> index 5d3cf40..3d96275 100644
>> --- a/kernel/events/ring_buffer.c
>> +++ b/kernel/events/ring_buffer.c
>> @@ -84,8 +84,9 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
>> *
>> * See perf_output_begin().
>> */
>> - smp_wmb(); /* B, matches C */
>> - rb->user_page->data_head = head;
>> +
>> + /* B, matches C */
>> + smp_store_release(&rb->user_page->data_head, head);
>
> Yes, this would be correct.
>
> The reason we didn't do this is because smp_store_release() ends up
> being smp_mb() + WRITE_ONCE() for a fair number of platforms, even if
> they have a cheaper smp_wmb(). Most notably ARM.
Yep agree, that would be worse..
> (ARM64 OTOH would like to have smp_store_release() there I imagine;
> while x86 doesn't care either way around).
>
> A similar concern exists for the smp_load_acquire() I proposed for the
> userspace side, ARM would have to resort to smp_mb() in that situation,
> instead of the cheaper smp_rmb().
>
> The smp_store_release() on the userspace side will actually be of equal
> cost or cheaper, since it already has an smp_mb(). Most notably, x86 can
> avoid barrier entirely, because TSO doesn't allow the LOAD-STORE reorder
> (it only allows the STORE-LOAD reorder). And PowerPC can use LWSYNC
> instead of SYNC.
Ok, thanks a lot for your feedback, Peter! I've changed the user space
side now to the following diff (also moving to a small helper so it can
be reused by libbpf in the subsequent fix I had in the series):
tools/arch/arm64/include/asm/barrier.h | 70 +++++++++++++++++++++++++++++++
tools/arch/ia64/include/asm/barrier.h | 13 ++++++
tools/arch/powerpc/include/asm/barrier.h | 16 +++++++
tools/arch/s390/include/asm/barrier.h | 13 ++++++
tools/arch/sparc/include/asm/barrier_64.h | 13 ++++++
tools/arch/x86/include/asm/barrier.h | 14 +++++++
tools/include/asm/barrier.h | 35 ++++++++++++++++
tools/include/linux/ring_buffer.h | 69 ++++++++++++++++++++++++++++++
tools/perf/util/mmap.h | 15 ++-----
9 files changed, 246 insertions(+), 12 deletions(-)
create mode 100644 tools/include/linux/ring_buffer.h
diff --git a/tools/arch/arm64/include/asm/barrier.h b/tools/arch/arm64/include/asm/barrier.h
index 40bde6b..12835ea 100644
--- a/tools/arch/arm64/include/asm/barrier.h
+++ b/tools/arch/arm64/include/asm/barrier.h
@@ -14,4 +14,74 @@
#define wmb() asm volatile("dmb ishst" ::: "memory")
#define rmb() asm volatile("dmb ishld" ::: "memory")
+#define smp_store_release(p, v) \
+do { \
+ union { typeof(*p) __val; char __c[1]; } __u = \
+ { .__val = (__force typeof(*p)) (v) }; \
+ \
+ switch (sizeof(*p)) { \
+ case 1: \
+ asm volatile ("stlrb %w1, %0" \
+ : "=Q" (*p) \
+ : "r" (*(__u8 *)__u.__c) \
+ : "memory"); \
+ break; \
+ case 2: \
+ asm volatile ("stlrh %w1, %0" \
+ : "=Q" (*p) \
+ : "r" (*(__u16 *)__u.__c) \
+ : "memory"); \
+ break; \
+ case 4: \
+ asm volatile ("stlr %w1, %0" \
+ : "=Q" (*p) \
+ : "r" (*(__u32 *)__u.__c) \
+ : "memory"); \
+ break; \
+ case 8: \
+ asm volatile ("stlr %1, %0" \
+ : "=Q" (*p) \
+ : "r" (*(__u64 *)__u.__c) \
+ : "memory"); \
+ break; \
+ default: \
+ /* Only to shut up gcc ... */ \
+ mb(); \
+ break; \
+ } \
+} while (0)
+
+#define smp_load_acquire(p) \
+({ \
+ union { typeof(*p) __val; char __c[1]; } __u; \
+ \
+ switch (sizeof(*p)) { \
+ case 1: \
+ asm volatile ("ldarb %w0, %1" \
+ : "=r" (*(__u8 *)__u.__c) \
+ : "Q" (*p) : "memory"); \
+ break; \
+ case 2: \
+ asm volatile ("ldarh %w0, %1" \
+ : "=r" (*(__u16 *)__u.__c) \
+ : "Q" (*p) : "memory"); \
+ break; \
+ case 4: \
+ asm volatile ("ldar %w0, %1" \
+ : "=r" (*(__u32 *)__u.__c) \
+ : "Q" (*p) : "memory"); \
+ break; \
+ case 8: \
+ asm volatile ("ldar %0, %1" \
+ : "=r" (*(__u64 *)__u.__c) \
+ : "Q" (*p) : "memory"); \
+ break; \
+ default: \
+ /* Only to shut up gcc ... */ \
+ mb(); \
+ break; \
+ } \
+ __u.__val; \
+})
+
#endif /* _TOOLS_LINUX_ASM_AARCH64_BARRIER_H */
diff --git a/tools/arch/ia64/include/asm/barrier.h b/tools/arch/ia64/include/asm/barrier.h
index d808ee0..4d471d9 100644
--- a/tools/arch/ia64/include/asm/barrier.h
+++ b/tools/arch/ia64/include/asm/barrier.h
@@ -46,4 +46,17 @@
#define rmb() mb()
#define wmb() mb()
+#define smp_store_release(p, v) \
+do { \
+ barrier(); \
+ WRITE_ONCE(*p, v); \
+} while (0)
+
+#define smp_load_acquire(p) \
+({ \
+ typeof(*p) ___p1 = READ_ONCE(*p); \
+ barrier(); \
+ ___p1; \
+})
+
#endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */
diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h
index a634da0..905a2c6 100644
--- a/tools/arch/powerpc/include/asm/barrier.h
+++ b/tools/arch/powerpc/include/asm/barrier.h
@@ -27,4 +27,20 @@
#define rmb() __asm__ __volatile__ ("sync" : : : "memory")
#define wmb() __asm__ __volatile__ ("sync" : : : "memory")
+#if defined(__powerpc64__)
+#define smp_lwsync() __asm__ __volatile__ ("lwsync" : : : "memory")
+
+#define smp_store_release(p, v) \
+do { \
+ smp_lwsync(); \
+ WRITE_ONCE(*p, v); \
+} while (0)
+
+#define smp_load_acquire(p) \
+({ \
+ typeof(*p) ___p1 = READ_ONCE(*p); \
+ smp_lwsync(); \
+ ___p1; \
+})
+#endif /* defined(__powerpc64__) */
#endif /* _TOOLS_LINUX_ASM_POWERPC_BARRIER_H */
diff --git a/tools/arch/s390/include/asm/barrier.h b/tools/arch/s390/include/asm/barrier.h
index 5030c99..de362fa6 100644
--- a/tools/arch/s390/include/asm/barrier.h
+++ b/tools/arch/s390/include/asm/barrier.h
@@ -28,4 +28,17 @@
#define rmb() mb()
#define wmb() mb()
+#define smp_store_release(p, v) \
+do { \
+ barrier(); \
+ WRITE_ONCE(*p, v); \
+} while (0)
+
+#define smp_load_acquire(p) \
+({ \
+ typeof(*p) ___p1 = READ_ONCE(*p); \
+ barrier(); \
+ ___p1; \
+})
+
#endif /* __TOOLS_LIB_ASM_BARRIER_H */
diff --git a/tools/arch/sparc/include/asm/barrier_64.h b/tools/arch/sparc/include/asm/barrier_64.h
index ba61344..cfb0fdc 100644
--- a/tools/arch/sparc/include/asm/barrier_64.h
+++ b/tools/arch/sparc/include/asm/barrier_64.h
@@ -40,4 +40,17 @@ do { __asm__ __volatile__("ba,pt %%xcc, 1f\n\t" \
#define rmb() __asm__ __volatile__("":::"memory")
#define wmb() __asm__ __volatile__("":::"memory")
+#define smp_store_release(p, v) \
+do { \
+ barrier(); \
+ WRITE_ONCE(*p, v); \
+} while (0)
+
+#define smp_load_acquire(p) \
+({ \
+ typeof(*p) ___p1 = READ_ONCE(*p); \
+ barrier(); \
+ ___p1; \
+})
+
#endif /* !(__TOOLS_LINUX_SPARC64_BARRIER_H) */
diff --git a/tools/arch/x86/include/asm/barrier.h b/tools/arch/x86/include/asm/barrier.h
index 8774dee..5891986 100644
--- a/tools/arch/x86/include/asm/barrier.h
+++ b/tools/arch/x86/include/asm/barrier.h
@@ -26,4 +26,18 @@
#define wmb() asm volatile("sfence" ::: "memory")
#endif
+#if defined(__x86_64__)
+#define smp_store_release(p, v) \
+do { \
+ barrier(); \
+ WRITE_ONCE(*p, v); \
+} while (0)
+
+#define smp_load_acquire(p) \
+({ \
+ typeof(*p) ___p1 = READ_ONCE(*p); \
+ barrier(); \
+ ___p1; \
+})
+#endif /* defined(__x86_64__) */
#endif /* _TOOLS_LINUX_ASM_X86_BARRIER_H */
diff --git a/tools/include/asm/barrier.h b/tools/include/asm/barrier.h
index 391d942..8d378c5 100644
--- a/tools/include/asm/barrier.h
+++ b/tools/include/asm/barrier.h
@@ -1,4 +1,5 @@
/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/compiler.h>
#if defined(__i386__) || defined(__x86_64__)
#include "../../arch/x86/include/asm/barrier.h"
#elif defined(__arm__)
@@ -26,3 +27,37 @@
#else
#include <asm-generic/barrier.h>
#endif
+
+/*
+ * Generic fallback smp_*() definitions for archs that haven't
+ * been updated yet.
+ */
+
+#ifndef smp_rmb
+# define smp_rmb() rmb()
+#endif
+
+#ifndef smp_wmb
+# define smp_wmb() wmb()
+#endif
+
+#ifndef smp_mb
+# define smp_mb() mb()
+#endif
+
+#ifndef smp_store_release
+# define smp_store_release(p, v) \
+do { \
+ smp_mb(); \
+ WRITE_ONCE(*p, v); \
+} while (0)
+#endif
+
+#ifndef smp_load_acquire
+# define smp_load_acquire(p) \
+({ \
+ typeof(*p) ___p1 = READ_ONCE(*p); \
+ smp_mb(); \
+ ___p1; \
+})
+#endif
diff --git a/tools/include/linux/ring_buffer.h b/tools/include/linux/ring_buffer.h
new file mode 100644
index 0000000..48200e0
--- /dev/null
+++ b/tools/include/linux/ring_buffer.h
@@ -0,0 +1,69 @@
+#ifndef _TOOLS_LINUX_RING_BUFFER_H_
+#define _TOOLS_LINUX_RING_BUFFER_H_
+
+#include <linux/compiler.h>
+#include <asm/barrier.h>
+
+/*
+ * Below barriers pair as follows (kernel/events/ring_buffer.c):
+ *
+ * Since the mmap() consumer (userspace) can run on a different CPU:
+ *
+ * kernel user
+ *
+ * if (LOAD ->data_tail) { LOAD ->data_head
+ * (A) smp_rmb() (C)
+ * STORE $data LOAD $data
+ * smp_wmb() (B) smp_mb() (D)
+ * STORE ->data_head STORE ->data_tail
+ * }
+ *
+ * Where A pairs with D, and B pairs with C.
+ *
+ * In our case A is a control dependency that separates the load
+ * of the ->data_tail and the stores of $data. In case ->data_tail
+ * indicates there is no room in the buffer to store $data we do not.
+ *
+ * D needs to be a full barrier since it separates the data READ
+ * from the tail WRITE.
+ *
+ * For B a WMB is sufficient since it separates two WRITEs, and for
+ * C an RMB is sufficient since it separates two READs.
+ */
+
+/*
+ * Note, instead of B, C, D we could also use smp_store_release()
+ * in B and D as well as smp_load_acquire() in C. However, this
+ * optimization makes sense not for all architectures since it
+ * would resolve into READ_ONCE() + smp_mb() pair for smp_load_acquire()
+ * and smp_mb() + WRITE_ONCE() pair for smp_store_release(), thus
+ * for those smp_wmb() in B and smp_rmb() in C would still be less
+ * expensive. For the case of D this has either the same cost or
+ * is less expensive. For example, due to TSO (total store order),
+ * x86 can avoid the CPU barrier entirely.
+ */
+
+static inline u64 ring_buffer_read_head(struct perf_event_mmap_page *base)
+{
+/*
+ * Architectures where smp_load_acquire() does not fallback to
+ * READ_ONCE() + smp_mb() pair.
+ */
+#if defined(__x86_64__) || defined(__aarch64__) || defined(__powerpc64__) || \
+ defined(__ia64__) || defined(__sparc__) && defined(__arch64__)
+ return smp_load_acquire(&base->data_head);
+#else
+ u64 head = READ_ONCE(base->data_head);
+
+ smp_rmb();
+ return head;
+#endif
+}
+
+static inline void ring_buffer_write_tail(struct perf_event_mmap_page *base,
+ u64 tail)
+{
+ smp_store_release(&base->data_tail, tail);
+}
+
+#endif /* _TOOLS_LINUX_RING_BUFFER_H_ */
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 05a6d47..8f6531f 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -4,7 +4,7 @@
#include <linux/compiler.h>
#include <linux/refcount.h>
#include <linux/types.h>
-#include <asm/barrier.h>
+#include <linux/ring_buffer.h>
#include <stdbool.h>
#include "auxtrace.h"
#include "event.h"
@@ -71,21 +71,12 @@ void perf_mmap__consume(struct perf_mmap *map);
static inline u64 perf_mmap__read_head(struct perf_mmap *mm)
{
- struct perf_event_mmap_page *pc = mm->base;
- u64 head = READ_ONCE(pc->data_head);
- rmb();
- return head;
+ return ring_buffer_read_head(mm->base);
}
static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail)
{
- struct perf_event_mmap_page *pc = md->base;
-
- /*
- * ensure all reads are done before we write the tail out.
- */
- mb();
- pc->data_tail = tail;
+ ring_buffer_write_tail(md->base, tail);
}
union perf_event *perf_mmap__read_forward(struct perf_mmap *map);
--
2.9.5
^ permalink raw reply related
* Re: [PATCH net] net: ipmr: fix unresolved entry dumps
From: David Ahern @ 2018-10-18 15:16 UTC (permalink / raw)
To: David Miller, nikolay; +Cc: netdev
In-Reply-To: <20181017.223629.857704876875928943.davem@davemloft.net>
On 10/17/18 11:36 PM, David Miller wrote:
> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> Date: Wed, 17 Oct 2018 22:34:34 +0300
>
>> If the skb space ends in an unresolved entry while dumping we'll miss
>> some unresolved entries. The reason is due to zeroing the entry counter
>> between dumping resolved and unresolved mfc entries. We should just
>> keep counting until the whole table is dumped and zero when we move to
>> the next as we have a separate table counter.
>>
>> Reported-by: Colin Ian King <colin.king@canonical.com>
>> Fixes: 8fb472c09b9d ("ipmr: improve hash scalability")
>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>
> Applied and queued up for -stable.
>
FYI: on the net to net-next merge, those 2 lines were moved to
mr_table_dump.
^ permalink raw reply
* RE: [PATCH net-next] netpoll: allow cleanup to be synchronous
From: Banerjee, Debabrata @ 2018-10-18 15:17 UTC (permalink / raw)
To: 'Neil Horman', David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <20181018112929.GA2601@hmswarspite.think-freely.org>
> From: Neil Horman <nhorman@tuxdriver.com>
> Agreed, this doesn't make sense. If you want a synchronous cleanup, create
> a wrapper function that creates a wait queue, calls __netpoll_free_async,
> and blocks on the wait queue completion. Modify the cleanup_work
> method(s) to complete the wait queue, and you've got what you want.
>
> Neil
Actually the patch looks bad, but it seems to turn out the rtnl is always
held by the parent when it is called, and asynchronous cleanup doesn't
seem to be necessary at all. Perhaps you could share your deadlock
test case for 2cde6acd49da?
Patch v2 coming next.
-Deb
^ permalink raw reply
* [PATCH net-next v2] netpoll: allow cleanup to be synchronous
From: Debabrata Banerjee @ 2018-10-18 15:18 UTC (permalink / raw)
To: David S . Miller, Neil Horman, netdev; +Cc: dbanerje
This fixes a problem introduced by:
commit 2cde6acd49da ("netpoll: Fix __netpoll_rcu_free so that it can hold the rtnl lock")
When using netconsole on a bond, __netpoll_cleanup can asynchronously
recurse multiple times, each __netpoll_free_async call can result in
more __netpoll_free_async's. This means there is now a race between
cleanup_work queues on multiple netpoll_info's on multiple devices and
the configuration of a new netpoll. For example if a netconsole is set
to enable 0, reconfigured, and enable 1 immediately, this netconsole
will likely not work.
Given the reason for __netpoll_free_async is it can be called when rtnl
is not locked, if it is locked, we should be able to execute
synchronously. It appears to be locked everywhere it's called from.
Generalize the design pattern from the teaming driver for current
callers of __netpoll_free_async.
CC: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
---
drivers/net/bonding/bond_main.c | 3 ++-
drivers/net/macvlan.c | 2 +-
drivers/net/team/team.c | 5 +----
include/linux/netpoll.h | 4 +---
net/8021q/vlan_dev.c | 3 +--
net/bridge/br_device.c | 2 +-
net/core/netpoll.c | 20 +++++---------------
net/dsa/slave.c | 2 +-
8 files changed, 13 insertions(+), 28 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index ee28ec9e0aba..ffa37adb7681 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -963,7 +963,8 @@ static inline void slave_disable_netpoll(struct slave *slave)
return;
slave->np = NULL;
- __netpoll_free_async(np);
+
+ __netpoll_free(np);
}
static void bond_poll_controller(struct net_device *bond_dev)
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index cfda146f3b3b..fc8d5f1ee1ad 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -1077,7 +1077,7 @@ static void macvlan_dev_netpoll_cleanup(struct net_device *dev)
vlan->netpoll = NULL;
- __netpoll_free_async(netpoll);
+ __netpoll_free(netpoll);
}
#endif /* CONFIG_NET_POLL_CONTROLLER */
diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index d887016e54b6..db633ae9f784 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1104,10 +1104,7 @@ static void team_port_disable_netpoll(struct team_port *port)
return;
port->np = NULL;
- /* Wait for transmitting packets to finish before freeing. */
- synchronize_rcu_bh();
- __netpoll_cleanup(np);
- kfree(np);
+ __netpoll_free(np);
}
#else
static int team_port_enable_netpoll(struct team_port *port)
diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
index 3ef82d3a78db..676f1ff161a9 100644
--- a/include/linux/netpoll.h
+++ b/include/linux/netpoll.h
@@ -31,8 +31,6 @@ struct netpoll {
bool ipv6;
u16 local_port, remote_port;
u8 remote_mac[ETH_ALEN];
-
- struct work_struct cleanup_work;
};
struct netpoll_info {
@@ -63,7 +61,7 @@ int netpoll_parse_options(struct netpoll *np, char *opt);
int __netpoll_setup(struct netpoll *np, struct net_device *ndev);
int netpoll_setup(struct netpoll *np);
void __netpoll_cleanup(struct netpoll *np);
-void __netpoll_free_async(struct netpoll *np);
+void __netpoll_free(struct netpoll *np);
void netpoll_cleanup(struct netpoll *np);
void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb,
struct net_device *dev);
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 546af0e73ac3..ff720f1ebf73 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -756,8 +756,7 @@ static void vlan_dev_netpoll_cleanup(struct net_device *dev)
return;
vlan->netpoll = NULL;
-
- __netpoll_free_async(netpoll);
+ __netpoll_free(netpoll);
}
#endif /* CONFIG_NET_POLL_CONTROLLER */
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index e053a4e43758..c6abf927f0c9 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -344,7 +344,7 @@ void br_netpoll_disable(struct net_bridge_port *p)
p->np = NULL;
- __netpoll_free_async(np);
+ __netpoll_free(np);
}
#endif
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index de1d1ba92f2d..6ac71624ead4 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -591,7 +591,6 @@ int __netpoll_setup(struct netpoll *np, struct net_device *ndev)
np->dev = ndev;
strlcpy(np->dev_name, ndev->name, IFNAMSIZ);
- INIT_WORK(&np->cleanup_work, netpoll_async_cleanup);
if (ndev->priv_flags & IFF_DISABLE_NETPOLL) {
np_err(np, "%s doesn't support polling, aborting\n",
@@ -790,10 +789,6 @@ void __netpoll_cleanup(struct netpoll *np)
{
struct netpoll_info *npinfo;
- /* rtnl_dereference would be preferable here but
- * rcu_cleanup_netpoll path can put us in here safely without
- * holding the rtnl, so plain rcu_dereference it is
- */
npinfo = rtnl_dereference(np->dev->npinfo);
if (!npinfo)
return;
@@ -814,21 +809,16 @@ void __netpoll_cleanup(struct netpoll *np)
}
EXPORT_SYMBOL_GPL(__netpoll_cleanup);
-static void netpoll_async_cleanup(struct work_struct *work)
+void __netpoll_free(struct netpoll *np)
{
- struct netpoll *np = container_of(work, struct netpoll, cleanup_work);
+ ASSERT_RTNL();
- rtnl_lock();
+ /* Wait for transmitting packets to finish before freeing. */
+ synchronize_rcu_bh();
__netpoll_cleanup(np);
- rtnl_unlock();
kfree(np);
}
-
-void __netpoll_free_async(struct netpoll *np)
-{
- schedule_work(&np->cleanup_work);
-}
-EXPORT_SYMBOL_GPL(__netpoll_free_async);
+EXPORT_SYMBOL_GPL(__netpoll_free);
void netpoll_cleanup(struct netpoll *np)
{
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 3f840b6eea69..3679e13b2ead 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -722,7 +722,7 @@ static void dsa_slave_netpoll_cleanup(struct net_device *dev)
p->netpoll = NULL;
- __netpoll_free_async(netpoll);
+ __netpoll_free(netpoll);
}
static void dsa_slave_poll_controller(struct net_device *dev)
--
2.19.1
^ permalink raw reply related
* Re: [PATCH] virtio_net: add local_bh_disable() around u64_stats_update_begin
From: David Miller @ 2018-10-18 23:23 UTC (permalink / raw)
To: bigeasy; +Cc: mst, netdev, virtualization, tglx
In-Reply-To: <20181018084313.oopu34iwfwgkcwwc@linutronix.de>
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Thu, 18 Oct 2018 10:43:13 +0200
> on 32bit, lockdep notices that virtnet_open() and refill_work() invoke
> try_fill_recv() from process context while virtnet_receive() invokes the
> same function from BH context. The problem that the seqcounter within
> u64_stats_update_begin() may deadlock if it is interrupted by BH and
> then acquired again.
>
> Introduce u64_stats_update_begin_bh() which disables BH on 32bit
> architectures. Since the BH might interrupt the worker, this new
> function should not limited to SMP like the others which are expected
> to be used in softirq.
>
> With this change we might lose increments but this is okay. The
> important part that the two 32bit parts of the 64bit counter are not
> corrupted.
>
> Fixes: 461f03dc99cf6 ("virtio_net: Add kick stats").
> Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Trying to get down to the bottom of this:
1) virtnet_receive() runs from softirq but only if NAPI is active and
enabled. It is in this context that it invokes try_fill_recv().
2) refill_work() runs from process context, but disables NAPI (and
thus invocation of virtnet_receive()) before calling
try_fill_recv().
3) virtnet_open() invokes from process context as well, but before the
NAPI instances are enabled, it is same as case #2.
4) virtnet_restore_up() is the same situations as #3.
Therefore I agree that this is a false positive, and simply lockdep
cannot see the NAPI synchronization done by case #2.
I think we shouldn't add unnecessary BH disabling here, and instead
find some way to annotate this for lockdep's sake.
Thank you.
^ permalink raw reply
* Re: [iproute PATCH] rdma: Fix for ineffective check in add_filter()
From: David Ahern @ 2018-10-18 15:27 UTC (permalink / raw)
To: Phil Sutter, Stephen Hemminger; +Cc: netdev, Arkadi Sharshevsky
In-Reply-To: <20181018114154.8886-1-phil@nwl.cc>
On 10/18/18 5:41 AM, Phil Sutter wrote:
> With 'name' field defined as array in struct filters, it will always
> contain a value irrespective of whether a name was assigned or not.
>
> Fix this by turning the field into a const char pointer.
>
> Fixes: 8cd644095842a ("devlink: Add support for devlink resource abstraction")
Stale paste buffer? Seems like the correct tag is
Fixes: 1174be72d1b4c ("rdma: Add filtering infrastructure")
> Signed-off-by: Phil Sutter <phil@nwl.cc>
> ---
> rdma/rdma.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/rdma/rdma.h b/rdma/rdma.h
> index d4b7ba1918b13..c3b7530b6cc71 100644
> --- a/rdma/rdma.h
> +++ b/rdma/rdma.h
> @@ -34,7 +34,7 @@
>
> #define MAX_NUMBER_OF_FILTERS 64
> struct filters {
> - char name[32];
> + const char *name;
> bool is_number;
> };
>
>
^ permalink raw reply
* Re: [PATCH net v2] net/sched: act_gact: properly init 'goto chain'
From: Davide Caratti @ 2018-10-18 15:30 UTC (permalink / raw)
To: Jamal Hadi Salim, Jiri Pirko, Cong Wang, David S. Miller, netdev
In-Reply-To: <b27083a6-ad14-0f44-f85c-1ee73b8ddb8a@mojatatu.com>
hello Jamal,
On Thu, 2018-10-18 at 08:52 -0400, Jamal Hadi Salim wrote:
> Rejection is a good solution[1].
> Would be helpful to set an ext_ack to something like
> "only one goto chain is supported currently"
ok, that's material for a v3, if you / Cong / others are ok with the other
hunks.
On Thu, 2018-10-18 at 08:52 -0400, Jamal Hadi Salim wrote:
> I didnt follow why you needed to introduce tcfg_caction...
it's a trick (maybe an abuse, up to you to decide): when the user does
# tc f a dev v0 egress matchall action pass random determ goto chain 4 5
tcfg_caction contains 'pass', tcfg_paction and tcf_action both contain
'goto chain 4'. On the opposite, when the user does
# tc f a dev v0 egress matchall action goto chain 4 random determ drop 5
tcfg_caction and tcf_action both contain 'goto chain 4', and tcf_paction
contains 'drop'. In this way, the 'goto chain' is always stored in
tcf_action, so that TC generic code can always initialize a->goto_chain.
(we need tcf_caction, becaue tcf_gact_dump() and helpers in tc_gact.h need
to read the control action on the left of "random")
The alternative is, we systematically forbid usage of 'goto chain' in
tcfg_paction, so that:
# tc f a dev v0 egress matchall action <whatever> random determ goto chain 4 5
is systematically rejected with -EINVAL. This comand never worked, so we
are not breaking anything in userspace.
thanks for reviewing!
--
davide
^ permalink raw reply
* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
From: Alexei Starovoitov @ 2018-10-18 15:33 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Peter Zijlstra, paulmck, will.deacon, acme, yhs, john.fastabend,
netdev
In-Reply-To: <df1da9c7-f9b4-a482-0bbc-a6455b57a476@iogearbox.net>
On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote:
> #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */
> diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h
> index a634da0..905a2c6 100644
> --- a/tools/arch/powerpc/include/asm/barrier.h
> +++ b/tools/arch/powerpc/include/asm/barrier.h
> @@ -27,4 +27,20 @@
> #define rmb() __asm__ __volatile__ ("sync" : : : "memory")
> #define wmb() __asm__ __volatile__ ("sync" : : : "memory")
>
> +#if defined(__powerpc64__)
> +#define smp_lwsync() __asm__ __volatile__ ("lwsync" : : : "memory")
> +
> +#define smp_store_release(p, v) \
> +do { \
> + smp_lwsync(); \
> + WRITE_ONCE(*p, v); \
> +} while (0)
> +
> +#define smp_load_acquire(p) \
> +({ \
> + typeof(*p) ___p1 = READ_ONCE(*p); \
> + smp_lwsync(); \
> + ___p1; \
I don't like this proliferation of asm.
Why do we think that we can do better job than compiler?
can we please use gcc builtins instead?
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
__atomic_load_n(ptr, __ATOMIC_ACQUIRE);
__atomic_store_n(ptr, val, __ATOMIC_RELEASE);
are done specifically for this use case if I'm not mistaken.
I think it pays to learn what compiler provides.
^ permalink raw reply
* Re: [PATCH v4] Wait for running BPF programs when updating map-in-map
From: Joel Fernandes @ 2018-10-18 23:36 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Daniel Colascione, Joel Fernandes, LKML, Tim Murray, netdev,
Lorenzo Colitti, Chenbo Feng, Mathieu Desnoyers,
Alexei Starovoitov, Daniel Borkmann, stable
In-Reply-To: <20181018154657.ehgj3ozy62zy47hi@ast-mbp.dhcp.thefacebook.com>
On Thu, Oct 18, 2018 at 08:46:59AM -0700, Alexei Starovoitov wrote:
> On Tue, Oct 16, 2018 at 10:39:57AM -0700, Joel Fernandes wrote:
> > On Fri, Oct 12, 2018 at 7:31 PM, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > > On Fri, Oct 12, 2018 at 03:54:27AM -0700, Daniel Colascione wrote:
> > >> The map-in-map frequently serves as a mechanism for atomic
> > >> snapshotting of state that a BPF program might record. The current
> > >> implementation is dangerous to use in this way, however, since
> > >> userspace has no way of knowing when all programs that might have
> > >> retrieved the "old" value of the map may have completed.
> > >>
> > >> This change ensures that map update operations on map-in-map map types
> > >> always wait for all references to the old map to drop before returning
> > >> to userspace.
> > >>
> > >> Signed-off-by: Daniel Colascione <dancol@google.com>
> > >> ---
> > >> kernel/bpf/syscall.c | 14 ++++++++++++++
> > >> 1 file changed, 14 insertions(+)
> > >>
> > >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > >> index 8339d81cba1d..d7c16ae1e85a 100644
> > >> --- a/kernel/bpf/syscall.c
> > >> +++ b/kernel/bpf/syscall.c
> > >> @@ -741,6 +741,18 @@ static int map_lookup_elem(union bpf_attr *attr)
> > >> return err;
> > >> }
> > >>
> > >> +static void maybe_wait_bpf_programs(struct bpf_map *map)
> > >> +{
> > >> + /* Wait for any running BPF programs to complete so that
> > >> + * userspace, when we return to it, knows that all programs
> > >> + * that could be running use the new map value.
> > >> + */
> > >> + if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS ||
> > >> + map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) {
> > >> + synchronize_rcu();
> > >> + }
> > >
> > > extra {} were not necessary. I removed them while applying to bpf-next.
> > > Please run checkpatch.pl next time.
> > > Thanks
> >
> > Thanks Alexei for taking it. Me and Lorenzo were discussing that not
> > having this causes incorrect behavior for apps using map-in-map for
> > this. So I CC'd stable as well.
>
> It is too late in the release cycle.
> We can submit it to stable releases after the merge window.
>
Sounds good, thanks.
- Joel
^ permalink raw reply
* [net 1/1] tipc: fix info leak from kernel tipc_event
From: Jon Maloy @ 2018-10-18 15:38 UTC (permalink / raw)
To: davem, netdev
Cc: gordan.mihaljevic, tung.q.nguyen, hoang.h.le, jon.maloy,
canh.d.luu, ying.xue, tipc-discussion
We initialize a struct tipc_event allocated on the kernel stack to
zero to avert info leak to user space.
Reported-by: syzbot+057458894bc8cada4dee@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
---
net/tipc/group.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/tipc/group.c b/net/tipc/group.c
index e82f13c..06fee14 100644
--- a/net/tipc/group.c
+++ b/net/tipc/group.c
@@ -666,6 +666,7 @@ static void tipc_group_create_event(struct tipc_group *grp,
struct sk_buff *skb;
struct tipc_msg *hdr;
+ memset(&evt, 0, sizeof(evt));
evt.event = event;
evt.found_lower = m->instance;
evt.found_upper = m->instance;
--
2.1.4
^ permalink raw reply related
* Re: [PATCH] net: socket: fix a missing-check bug
From: David Miller @ 2018-10-18 23:43 UTC (permalink / raw)
To: wang6495; +Cc: kjlu, netdev, linux-kernel
In-Reply-To: <1539873406-5967-1-git-send-email-wang6495@umn.edu>
From: Wenwen Wang <wang6495@umn.edu>
Date: Thu, 18 Oct 2018 09:36:46 -0500
> In ethtool_ioctl(), the ioctl command 'ethcmd' is checked through a switch
> statement to see whether it is necessary to pre-process the ethtool
> structure, because, as mentioned in the comment, the structure
> ethtool_rxnfc is defined with padding. If yes, a user-space buffer 'rxnfc'
> is allocated through compat_alloc_user_space(). One thing to note here is
> that, if 'ethcmd' is ETHTOOL_GRXCLSRLALL, the size of the buffer 'rxnfc' is
> partially determined by 'rule_cnt', which is actually acquired from the
> user-space buffer 'compat_rxnfc', i.e., 'compat_rxnfc->rule_cnt', through
> get_user(). After 'rxnfc' is allocated, the data in the original user-space
> buffer 'compat_rxnfc' is then copied to 'rxnfc' through copy_in_user(),
> including the 'rule_cnt' field. However, after this copy, no check is
> re-enforced on 'rxnfc->rule_cnt'. So it is possible that a malicious user
> race to change the value in the 'compat_rxnfc->rule_cnt' between these two
> copies. Through this way, the attacker can bypass the previous check on
> 'rule_cnt' and inject malicious data. This can cause undefined behavior of
> the kernel and introduce potential security risk.
>
> This patch avoids the above issue via copying the value acquired by
> get_user() to 'rxnfc->rule_cn', if 'ethcmd' is ETHTOOL_GRXCLSRLALL.
>
> Signed-off-by: Wenwen Wang <wang6495@umn.edu>
This isn't pretty, but I can't come up with a better fix.
Note that we check and validate the rule count value even a third time
when we copy the rules back out to userspace.
Applied and queued up for -stable, thank you.
^ permalink raw reply
* Re: [PATCH v2] net-next/hinic: add checksum offload and TSO support
From: David Miller @ 2018-10-18 23:45 UTC (permalink / raw)
To: xuechaojing
Cc: linux-kernel, netdev, wulike1, chiqijun, fy.wang, zhaochen6,
tony.qu, luoshaokai
In-Reply-To: <20181018150251.28802-1-xuechaojing@huawei.com>
From: Xue Chaojing <xuechaojing@huawei.com>
Date: Thu, 18 Oct 2018 15:02:51 +0000
> From: Zhao Chen <zhaochen6@huawei.com>
>
> This patch adds checksum offload and TSO support for the HiNIC
> driver. Perfomance test (Iperf) shows more than 100% improvement
> in TCP streams.
>
> Signed-off-by: Zhao Chen <zhaochen6@huawei.com>
> Signed-off-by: Xue Chaojing <xuechaojing@huawei.com>
> ---
> v2: made all the local variables from longest to shortest line.
Applied, thank you.
^ permalink raw reply
* linux-next: manual merge of the net-next tree with the net tree
From: Stephen Rothwell @ 2018-10-18 23:56 UTC (permalink / raw)
To: David Miller, Networking
Cc: Linux-Next Mailing List, Linux Kernel Mailing List,
Nikolay Aleksandrov, David Ahern
[-- Attachment #1: Type: text/plain, Size: 3068 bytes --]
Hi all,
Today's linux-next merge of the net-next tree got a conflict in:
net/ipv4/ipmr_base.c
between commit:
eddf016b9104 ("net: ipmr: fix unresolved entry dumps")
from the net tree and commit:
e1cedae1ba6b ("ipmr: Refactor mr_rtm_dumproute")
from the net-next tree.
I fixed it up (I think - see below) and can carry the fix as
necessary. This is now fixed as far as linux-next is concerned, but any
non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging. You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.
--
Cheers,
Stephen Rothwell
diff --cc net/ipv4/ipmr_base.c
index eab8cd5ec2f5,844806120f44..000000000000
--- a/net/ipv4/ipmr_base.c
+++ b/net/ipv4/ipmr_base.c
@@@ -268,6 -268,83 +268,81 @@@ int mr_fill_mroute(struct mr_table *mrt
}
EXPORT_SYMBOL(mr_fill_mroute);
+ static bool mr_mfc_uses_dev(const struct mr_table *mrt,
+ const struct mr_mfc *c,
+ const struct net_device *dev)
+ {
+ int ct;
+
+ for (ct = c->mfc_un.res.minvif; ct < c->mfc_un.res.maxvif; ct++) {
+ if (VIF_EXISTS(mrt, ct) && c->mfc_un.res.ttls[ct] < 255) {
+ const struct vif_device *vif;
+
+ vif = &mrt->vif_table[ct];
+ if (vif->dev == dev)
+ return true;
+ }
+ }
+ return false;
+ }
+
+ int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb,
+ struct netlink_callback *cb,
+ int (*fill)(struct mr_table *mrt, struct sk_buff *skb,
+ u32 portid, u32 seq, struct mr_mfc *c,
+ int cmd, int flags),
+ spinlock_t *lock, struct fib_dump_filter *filter)
+ {
+ unsigned int e = 0, s_e = cb->args[1];
+ unsigned int flags = NLM_F_MULTI;
+ struct mr_mfc *mfc;
+ int err;
+
+ if (filter->filter_set)
+ flags |= NLM_F_DUMP_FILTERED;
+
+ list_for_each_entry_rcu(mfc, &mrt->mfc_cache_list, list) {
+ if (e < s_e)
+ goto next_entry;
+ if (filter->dev &&
+ !mr_mfc_uses_dev(mrt, mfc, filter->dev))
+ goto next_entry;
+
+ err = fill(mrt, skb, NETLINK_CB(cb->skb).portid,
+ cb->nlh->nlmsg_seq, mfc, RTM_NEWROUTE, flags);
+ if (err < 0)
+ goto out;
+ next_entry:
+ e++;
+ }
- e = 0;
- s_e = 0;
+
+ spin_lock_bh(lock);
+ list_for_each_entry(mfc, &mrt->mfc_unres_queue, list) {
+ if (e < s_e)
+ goto next_entry2;
+ if (filter->dev &&
+ !mr_mfc_uses_dev(mrt, mfc, filter->dev))
+ goto next_entry2;
+
+ err = fill(mrt, skb, NETLINK_CB(cb->skb).portid,
+ cb->nlh->nlmsg_seq, mfc, RTM_NEWROUTE, flags);
+ if (err < 0) {
+ spin_unlock_bh(lock);
+ goto out;
+ }
+ next_entry2:
+ e++;
+ }
+ spin_unlock_bh(lock);
+ err = 0;
+ e = 0;
+
+ out:
+ cb->args[1] = e;
+ return err;
+ }
+ EXPORT_SYMBOL(mr_table_dump);
+
int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb,
struct mr_table *(*iter)(struct net *net,
struct mr_table *mrt),
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [PATCH net-next] net: ethernet: ti: cpsw: don't flush mcast entries while switch promisc mode
From: Grygorii Strashko @ 2018-10-19 0:03 UTC (permalink / raw)
To: Ivan Khoronzhuk, davem; +Cc: linux-omap, netdev, linux-kernel
In-Reply-To: <20181018180006.7065-1-ivan.khoronzhuk@linaro.org>
On 10/18/18 1:00 PM, Ivan Khoronzhuk wrote:
> No need now to flush mcast entries in switch mode while toggling to
> promiscuous mode. It's not needed as vlan reg_mcast = ALL_PORTS
> and mcast/vlan ports = ALL_PORTS, the same happening for vlan
> unreg_mcast, it's set to ALL_PORT_MASK just after calling promisc
> mode routine by calling set allmulti. I suppose main reason to flush
> them is to use unreg_mcast to receive all to host port. Thus, now, all
> mcast packets are received anyway and no reason to flush mcast entries
> unsafely, as they were synced with __dev_mc_sync() previously and are
> not restored. Another way is to _dev_mc_unsync() them, but no need.
User have possibility to add additional mcast entries or edit existing
in switch-mode, which is now done using custom tool. So, Host in promisc
mode will not receive packets for mcast address X if port mask for this
addr set to (ALL_PORTS - HOST_PORT). Am I missing smth?
>
> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
> ---
>
> Based on net-next/master
> Tasted on am572x EVM and BBB
>
> drivers/net/ethernet/ti/cpsw.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
> index 226be2a56c1f..0e475020a674 100644
> --- a/drivers/net/ethernet/ti/cpsw.c
> +++ b/drivers/net/ethernet/ti/cpsw.c
> @@ -638,9 +638,6 @@ static void cpsw_set_promiscious(struct net_device *ndev, bool enable)
> } while (time_after(timeout, jiffies));
> cpsw_ale_control_set(ale, 0, ALE_AGEOUT, 1);
>
> - /* Clear all mcast from ALE */
> - cpsw_ale_flush_multicast(ale, ALE_ALL_PORTS, -1);
> -
> /* Flood All Unicast Packets to Host port */
> cpsw_ale_control_set(ale, 0, ALE_P0_UNI_FLOOD, 1);
> dev_dbg(&ndev->dev, "promiscuity enabled\n");
>
--
regards,
-grygorii
^ permalink raw reply
* [PATCH v4 bpf-next 1/2] bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
From: Song Liu @ 2018-10-18 16:06 UTC (permalink / raw)
To: netdev; +Cc: ast, daniel, kernel-team, Song Liu
In-Reply-To: <20181018160649.1611530-1-songliubraving@fb.com>
BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
skb. This patch enables direct access of skb for these programs.
Two helper functions bpf_compute_and_save_data_pointers() and
bpf_restore_data_pointers() are introduced. There are used in
__cgroup_bpf_run_filter_skb(), to compute proper data_end for the
BPF program, and restore original data afterwards.
Signed-off-by: Song Liu <songliubraving@fb.com>
---
include/linux/filter.h | 24 ++++++++++++++++++++++++
kernel/bpf/cgroup.c | 6 ++++++
net/core/filter.c | 36 +++++++++++++++++++++++++++++++++++-
3 files changed, 65 insertions(+), 1 deletion(-)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 5771874bc01e..96b3ee7f14c9 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -548,6 +548,30 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb)
cb->data_end = skb->data + skb_headlen(skb);
}
+/* Similar to bpf_compute_data_pointers(), except that save orginal
+ * data in cb->data and cb->meta_data for restore.
+ */
+static inline void bpf_compute_and_save_data_pointers(
+ struct sk_buff *skb, void *saved_pointers[2])
+{
+ struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
+
+ saved_pointers[0] = cb->data_meta;
+ saved_pointers[1] = cb->data_end;
+ cb->data_meta = skb->data - skb_metadata_len(skb);
+ cb->data_end = skb->data + skb_headlen(skb);
+}
+
+/* Restore data saved by bpf_compute_data_pointers(). */
+static inline void bpf_restore_data_pointers(
+ struct sk_buff *skb, void *saved_pointers[2])
+{
+ struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
+
+ cb->data_meta = saved_pointers[0];
+ cb->data_end = saved_pointers[1];;
+}
+
static inline u8 *bpf_skb_cb(struct sk_buff *skb)
{
/* eBPF programs may read/write skb->cb[] area to transfer meta
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 00f6ed2e4f9a..5f5180104ddc 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -554,6 +554,7 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
unsigned int offset = skb->data - skb_network_header(skb);
struct sock *save_sk;
struct cgroup *cgrp;
+ void *saved_pointers[2];
int ret;
if (!sk || !sk_fullsock(sk))
@@ -566,8 +567,13 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
save_sk = skb->sk;
skb->sk = sk;
__skb_push(skb, offset);
+
+ /* compute pointers for the bpf prog */
+ bpf_compute_and_save_data_pointers(skb, saved_pointers);
+
ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
bpf_prog_run_save_cb);
+ bpf_restore_data_pointers(skb, saved_pointers);
__skb_pull(skb, offset);
skb->sk = save_sk;
return ret == 1 ? 0 : -EPERM;
diff --git a/net/core/filter.c b/net/core/filter.c
index 1a3ac6c46873..e3ca30bd6840 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5346,6 +5346,40 @@ static bool sk_filter_is_valid_access(int off, int size,
return bpf_skb_is_valid_access(off, size, type, prog, info);
}
+static bool cg_skb_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ switch (off) {
+ case bpf_ctx_range(struct __sk_buff, tc_classid):
+ case bpf_ctx_range(struct __sk_buff, data_meta):
+ case bpf_ctx_range(struct __sk_buff, flow_keys):
+ return false;
+ }
+ if (type == BPF_WRITE) {
+ switch (off) {
+ case bpf_ctx_range(struct __sk_buff, mark):
+ case bpf_ctx_range(struct __sk_buff, priority):
+ case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
+ break;
+ default:
+ return false;
+ }
+ }
+
+ switch (off) {
+ case bpf_ctx_range(struct __sk_buff, data):
+ info->reg_type = PTR_TO_PACKET;
+ break;
+ case bpf_ctx_range(struct __sk_buff, data_end):
+ info->reg_type = PTR_TO_PACKET_END;
+ break;
+ }
+
+ return bpf_skb_is_valid_access(off, size, type, prog, info);
+}
+
static bool lwt_is_valid_access(int off, int size,
enum bpf_access_type type,
const struct bpf_prog *prog,
@@ -7038,7 +7072,7 @@ const struct bpf_prog_ops xdp_prog_ops = {
const struct bpf_verifier_ops cg_skb_verifier_ops = {
.get_func_proto = cg_skb_func_proto,
- .is_valid_access = sk_filter_is_valid_access,
+ .is_valid_access = cg_skb_is_valid_access,
.convert_ctx_access = bpf_convert_ctx_access,
};
--
2.17.1
^ permalink raw reply related
* [PATCH v4 bpf-next 2/2] bpf: add tests for direct packet access from CGROUP_SKB
From: Song Liu @ 2018-10-18 16:06 UTC (permalink / raw)
To: netdev; +Cc: ast, daniel, kernel-team, Song Liu
In-Reply-To: <20181018160649.1611530-1-songliubraving@fb.com>
Tests are added to make sure CGROUP_SKB cannot access:
tc_classid, data_meta, flow_keys
and can read and write:
mark, prority, and cb[0-4]
and can read other fields.
To make selftest with skb->sk work, a dummy sk is added in
bpf_prog_test_run_skb().
Signed-off-by: Song Liu <songliubraving@fb.com>
---
net/bpf/test_run.c | 7 +
tools/testing/selftests/bpf/test_verifier.c | 170 ++++++++++++++++++++
2 files changed, 177 insertions(+)
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 0c423b8cd75c..8dccac305268 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -10,6 +10,8 @@
#include <linux/etherdevice.h>
#include <linux/filter.h>
#include <linux/sched/signal.h>
+#include <net/sock.h>
+#include <net/tcp.h>
static __always_inline u32 bpf_test_run_one(struct bpf_prog *prog, void *ctx,
struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE])
@@ -115,6 +117,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
u32 retval, duration;
int hh_len = ETH_HLEN;
struct sk_buff *skb;
+ struct sock sk = {0};
void *data;
int ret;
@@ -137,11 +140,15 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
break;
}
+ sock_net_set(&sk, &init_net);
+ sock_init_data(NULL, &sk);
+
skb = build_skb(data, 0);
if (!skb) {
kfree(data);
return -ENOMEM;
}
+ skb->sk = &sk;
skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
__skb_put(skb, size);
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index cf4cd32b6772..5bfba7e8afd7 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -4862,6 +4862,176 @@ static struct bpf_test tests[] = {
.result = REJECT,
.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
+ {
+ "direct packet read test#1 for CGROUP_SKB",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+ offsetof(struct __sk_buff, data)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+ offsetof(struct __sk_buff, data_end)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+ offsetof(struct __sk_buff, len)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+ offsetof(struct __sk_buff, pkt_type)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+ offsetof(struct __sk_buff, mark)),
+ BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_6,
+ offsetof(struct __sk_buff, mark)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1,
+ offsetof(struct __sk_buff, queue_mapping)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_8, BPF_REG_1,
+ offsetof(struct __sk_buff, protocol)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_9, BPF_REG_1,
+ offsetof(struct __sk_buff, vlan_present)),
+ BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+ BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+ BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_2, 0),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+ },
+ {
+ "direct packet read test#2 for CGROUP_SKB",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+ offsetof(struct __sk_buff, vlan_tci)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+ offsetof(struct __sk_buff, vlan_proto)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+ offsetof(struct __sk_buff, priority)),
+ BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_6,
+ offsetof(struct __sk_buff, priority)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1,
+ offsetof(struct __sk_buff, ingress_ifindex)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_8, BPF_REG_1,
+ offsetof(struct __sk_buff, tc_index)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_9, BPF_REG_1,
+ offsetof(struct __sk_buff, hash)),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+ },
+ {
+ "direct packet read test#3 for CGROUP_SKB",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+ offsetof(struct __sk_buff, cb[0])),
+ BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+ offsetof(struct __sk_buff, cb[1])),
+ BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+ offsetof(struct __sk_buff, cb[2])),
+ BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1,
+ offsetof(struct __sk_buff, cb[3])),
+ BPF_LDX_MEM(BPF_W, BPF_REG_8, BPF_REG_1,
+ offsetof(struct __sk_buff, cb[4])),
+ BPF_LDX_MEM(BPF_W, BPF_REG_9, BPF_REG_1,
+ offsetof(struct __sk_buff, napi_id)),
+ BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_4,
+ offsetof(struct __sk_buff, cb[0])),
+ BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_5,
+ offsetof(struct __sk_buff, cb[1])),
+ BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_6,
+ offsetof(struct __sk_buff, cb[2])),
+ BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_7,
+ offsetof(struct __sk_buff, cb[3])),
+ BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_8,
+ offsetof(struct __sk_buff, cb[4])),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+ },
+ {
+ "direct packet read test#4 for CGROUP_SKB",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+ offsetof(struct __sk_buff, family)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+ offsetof(struct __sk_buff, remote_ip4)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+ offsetof(struct __sk_buff, local_ip4)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+ offsetof(struct __sk_buff, remote_ip6[0])),
+ BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+ offsetof(struct __sk_buff, remote_ip6[1])),
+ BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+ offsetof(struct __sk_buff, remote_ip6[2])),
+ BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+ offsetof(struct __sk_buff, remote_ip6[3])),
+ BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+ offsetof(struct __sk_buff, local_ip6[0])),
+ BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+ offsetof(struct __sk_buff, local_ip6[1])),
+ BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+ offsetof(struct __sk_buff, local_ip6[2])),
+ BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+ offsetof(struct __sk_buff, local_ip6[3])),
+ BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1,
+ offsetof(struct __sk_buff, remote_port)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_8, BPF_REG_1,
+ offsetof(struct __sk_buff, local_port)),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+ },
+ {
+ "invalid access of tc_classid for CGROUP_SKB",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+ offsetof(struct __sk_buff, tc_classid)),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = REJECT,
+ .errstr = "invalid bpf_context access",
+ .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+ },
+ {
+ "invalid access of data_meta for CGROUP_SKB",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+ offsetof(struct __sk_buff, data_meta)),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = REJECT,
+ .errstr = "invalid bpf_context access",
+ .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+ },
+ {
+ "invalid access of flow_keys for CGROUP_SKB",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+ offsetof(struct __sk_buff, flow_keys)),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = REJECT,
+ .errstr = "invalid bpf_context access",
+ .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+ },
+ {
+ "invalid write access to napi_id for CGROUP_SKB",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_9, BPF_REG_1,
+ offsetof(struct __sk_buff, napi_id)),
+ BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_9,
+ offsetof(struct __sk_buff, napi_id)),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = REJECT,
+ .errstr = "invalid bpf_context access",
+ .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+ },
{
"valid cgroup storage access",
.insns = {
--
2.17.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox