Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH V2 net-next] net: ena: Fix Kconfig dependency on X86
From: David Miller @ 2018-10-18  5:28 UTC (permalink / raw)
  To: netanel
  Cc: netdev, akiyano, alisaidi, dwmw, zorik, matua, saeedb, msw,
	aliguori, nafea, gtzalik
In-Reply-To: <20181017100421.56045-1-netanel@amazon.com>

From: <netanel@amazon.com>
Date: Wed, 17 Oct 2018 10:04:21 +0000

> From: Netanel Belgazal <netanel@amazon.com>
> 
> The Kconfig limitation of X86 is to too wide.
> The ENA driver only requires a little endian dependency.
> 
> Change the dependency to be on little endian CPU.
> 
> Signed-off-by: Netanel Belgazal <netanel@amazon.com>

Applied.

^ permalink raw reply

* Re: [PATCH V1 net-next] net: ena: enable Low Latency Queues
From: David Miller @ 2018-10-18  5:31 UTC (permalink / raw)
  To: akiyano
  Cc: netdev, dwmw, zorik, matua, saeedb, msw, aliguori, nafea, gtzalik,
	netanel, alisaidi
In-Reply-To: <1539779603-17895-1-git-send-email-akiyano@amazon.com>

From: <akiyano@amazon.com>
Date: Wed, 17 Oct 2018 15:33:23 +0300

> From: Arthur Kiyanovski <akiyano@amazon.com>
> 
> Use the new API to enable usage of LLQ.
> 
> Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>

Applied.

^ permalink raw reply

* Re: [PATCH net] sctp: fix the data size calculation in sctp_data_size
From: David Miller @ 2018-10-18  5:33 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, linux-sctp, marcelo.leitner, nhorman
In-Reply-To: <c407b682a424f0441d9b9a29ef6296c8e9824b64.1539781887.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Wed, 17 Oct 2018 21:11:27 +0800

> sctp data size should be calculated by subtracting data chunk header's
> length from chunk_hdr->length, not just data header.
> 
> Fixes: 668c9beb9020 ("sctp: implement assign_number for sctp_stream_interleave")
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net] net/sched: properly init chain in case of multiple control actions
From: Cong Wang @ 2018-10-18  5:35 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Jiri Pirko, Jamal Hadi Salim, David Miller,
	Linux Kernel Network Developers
In-Reply-To: <ddbde7ffec27783a65a63e89962bf9db7565f0e0.camel@redhat.com>

On Tue, Oct 16, 2018 at 10:38 AM Davide Caratti <dcaratti@redhat.com> wrote:
>
> On Mon, 2018-10-15 at 11:31 -0700, Cong Wang wrote:
> > On Sat, Oct 13, 2018 at 8:23 AM Davide Caratti <dcaratti@redhat.com> wrote:
> > >
> > > On Fri, 2018-10-12 at 13:57 -0700, Cong Wang wrote:
> > > > Why not just validate the fallback action in each action init()?
> > > > For example, checking tcfg_paction in tcf_gact_init().
> > > >
> > > > I don't see the need of making it generic.
> ...
> > > A (legal?) trick  is to let tcf_action store the fallback action when it
> > > contains a 'goto chain' command, I just posted a proposal for gact. If you
> > > think it's ok, I will test and post the same for act_police.
> >
> > Do we really need to support TC_ACT_GOTO_CHAIN for
> > gact->tcfg_paction etc.? I mean, is it useful in practice or is it just for
> > completeness?
> >
> > IF we don't need to support it, we can just make it invalid without needing
> > to initialize it in ->init() at all.
> >
> > If we do, however, we really need to move it into each ->init(), because
> > we have to lock each action if we are modifying an existing one. With
> > your patch, tcf_action_goto_chain_init() is still called without the per-action
> > lock.
> >
> > What's more, if we support two different actions in gact, that is, tcfg_paction
> > and tcf_action, how could you still only have one a->goto_chain pointer?
> > There should be two pointers for each of them. :)
>
> whatever fixes the NULL dereference is OK for me.
> I thought that the proposal made with
>
> https://www.mail-archive.com/netdev@vger.kernel.org/msg251933.html
>
> (i.e., letting init() copy tcfg_paction to tcf_action in case it contained
> 'goto chain x') was smart enough to preserve the current behavior, and
> also let 'goto chain' work in case it was configured  *only* for the
> fallback action.
> When the action is modified, the change to tcfg_paction is done with the
> same spinlock as tcf_action, so I didn't notice anything worse than the
> current locking layout.
>
> (well, after some more thinking I looked again at that patch and yes, it
> lacked the most important thing:)

Hmm, as I said, I am not sure if the logic is correct, if we have two different
goto actions, we must have two pointers.

I will re-think about it tomorrow. (I am at a conference so don't have much
time on reviewing this.)

Thanks.

^ permalink raw reply

* Re: [PATCH net] net: ipmr: fix unresolved entry dumps
From: David Miller @ 2018-10-18  5:36 UTC (permalink / raw)
  To: nikolay; +Cc: netdev
In-Reply-To: <20181017193434.11383-1-nikolay@cumulusnetworks.com>

From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Date: Wed, 17 Oct 2018 22:34:34 +0300

> If the skb space ends in an unresolved entry while dumping we'll miss
> some unresolved entries. The reason is due to zeroing the entry counter
> between dumping resolved and unresolved mfc entries. We should just
> keep counting until the whole table is dumped and zero when we move to
> the next as we have a separate table counter.
> 
> Reported-by: Colin Ian King <colin.king@canonical.com>
> Fixes: 8fb472c09b9d ("ipmr: improve hash scalability")
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

Applied and queued up for -stable.

^ permalink raw reply

* [PATCH v3 bpf-next 2/2] bpf: add tests for direct packet access from CGROUP_SKB
From: Song Liu @ 2018-10-18  5:39 UTC (permalink / raw)
  To: netdev; +Cc: ast, daniel, kernel-team, Song Liu
In-Reply-To: <20181018053949.4064426-1-songliubraving@fb.com>

Tests are added to make sure CGROUP_SKB cannot access:
  tc_classid, data_meta, flow_keys

and can read and write:
  mark, prority, and cb[0-4]

and can read other fields.

To make selftest with skb->sk work, a dummy sk is added in
bpf_prog_test_run_skb().

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 net/bpf/test_run.c                          |   4 +
 tools/testing/selftests/bpf/test_verifier.c | 170 ++++++++++++++++++++
 2 files changed, 174 insertions(+)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 0c423b8cd75c..c7210e2f1ae9 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -10,6 +10,7 @@
 #include <linux/etherdevice.h>
 #include <linux/filter.h>
 #include <linux/sched/signal.h>
+#include <net/sock.h>
 
 static __always_inline u32 bpf_test_run_one(struct bpf_prog *prog, void *ctx,
 		struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE])
@@ -115,6 +116,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
 	u32 retval, duration;
 	int hh_len = ETH_HLEN;
 	struct sk_buff *skb;
+	struct sock sk;
 	void *data;
 	int ret;
 
@@ -142,6 +144,8 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
 		kfree(data);
 		return -ENOMEM;
 	}
+	sock_init_data(NULL, &sk);
+	skb->sk = &sk;
 
 	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
 	__skb_put(skb, size);
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index cf4cd32b6772..5bfba7e8afd7 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -4862,6 +4862,176 @@ static struct bpf_test tests[] = {
 		.result = REJECT,
 		.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 	},
+	{
+		"direct packet read test#1 for CGROUP_SKB",
+		.insns = {
+			BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+				    offsetof(struct __sk_buff, data)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+				    offsetof(struct __sk_buff, data_end)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+				    offsetof(struct __sk_buff, len)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+			    	    offsetof(struct __sk_buff, pkt_type)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct __sk_buff, mark)),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_6,
+				    offsetof(struct __sk_buff, mark)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1,
+				    offsetof(struct __sk_buff, queue_mapping)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_8, BPF_REG_1,
+				    offsetof(struct __sk_buff, protocol)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_9, BPF_REG_1,
+				    offsetof(struct __sk_buff, vlan_present)),
+			BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+			BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+			BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_2, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	},
+	{
+		"direct packet read test#2 for CGROUP_SKB",
+		.insns = {
+			BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+				    offsetof(struct __sk_buff, vlan_tci)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+			    	    offsetof(struct __sk_buff, vlan_proto)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct __sk_buff, priority)),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_6,
+				    offsetof(struct __sk_buff, priority)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1,
+				    offsetof(struct __sk_buff, ingress_ifindex)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_8, BPF_REG_1,
+				    offsetof(struct __sk_buff, tc_index)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_9, BPF_REG_1,
+				    offsetof(struct __sk_buff, hash)),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	},
+	{
+		"direct packet read test#3 for CGROUP_SKB",
+		.insns = {
+			BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+				    offsetof(struct __sk_buff, cb[0])),
+			BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+			    	    offsetof(struct __sk_buff, cb[1])),
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct __sk_buff, cb[2])),
+			BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1,
+				    offsetof(struct __sk_buff, cb[3])),
+			BPF_LDX_MEM(BPF_W, BPF_REG_8, BPF_REG_1,
+				    offsetof(struct __sk_buff, cb[4])),
+			BPF_LDX_MEM(BPF_W, BPF_REG_9, BPF_REG_1,
+				    offsetof(struct __sk_buff, napi_id)),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_4,
+				    offsetof(struct __sk_buff, cb[0])),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_5,
+				    offsetof(struct __sk_buff, cb[1])),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_6,
+				    offsetof(struct __sk_buff, cb[2])),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_7,
+				    offsetof(struct __sk_buff, cb[3])),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_8,
+				    offsetof(struct __sk_buff, cb[4])),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	},
+	{
+		"direct packet read test#4 for CGROUP_SKB",
+		.insns = {
+			BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+				    offsetof(struct __sk_buff, family)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+				    offsetof(struct __sk_buff, remote_ip4)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+				    offsetof(struct __sk_buff, local_ip4)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+				    offsetof(struct __sk_buff, remote_ip6[0])),
+			BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+				    offsetof(struct __sk_buff, remote_ip6[1])),
+			BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+				    offsetof(struct __sk_buff, remote_ip6[2])),
+			BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+				    offsetof(struct __sk_buff, remote_ip6[3])),
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct __sk_buff, local_ip6[0])),
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct __sk_buff, local_ip6[1])),
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct __sk_buff, local_ip6[2])),
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct __sk_buff, local_ip6[3])),
+			BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1,
+				    offsetof(struct __sk_buff, remote_port)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_8, BPF_REG_1,
+				    offsetof(struct __sk_buff, local_port)),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	},
+	{
+		"invalid access of tc_classid for CGROUP_SKB",
+		.insns = {
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+				    offsetof(struct __sk_buff, tc_classid)),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = REJECT,
+		.errstr = "invalid bpf_context access",
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	},
+	{
+		"invalid access of data_meta for CGROUP_SKB",
+		.insns = {
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+				    offsetof(struct __sk_buff, data_meta)),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = REJECT,
+		.errstr = "invalid bpf_context access",
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	},
+	{
+		"invalid access of flow_keys for CGROUP_SKB",
+		.insns = {
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+				    offsetof(struct __sk_buff, flow_keys)),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = REJECT,
+		.errstr = "invalid bpf_context access",
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	},
+	{
+		"invalid write access to napi_id for CGROUP_SKB",
+		.insns = {
+			BPF_LDX_MEM(BPF_W, BPF_REG_9, BPF_REG_1,
+				    offsetof(struct __sk_buff, napi_id)),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_9,
+				    offsetof(struct __sk_buff, napi_id)),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = REJECT,
+		.errstr = "invalid bpf_context access",
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	},
 	{
 		"valid cgroup storage access",
 		.insns = {
-- 
2.17.1

^ permalink raw reply related

* [PATCH v3 bpf-next 0/2] bpf: add cg_skb_is_valid_access
From: Song Liu @ 2018-10-18  5:39 UTC (permalink / raw)
  To: netdev; +Cc: ast, daniel, kernel-team, Song Liu

Changes v2 -> v3:
1. Added helper function bpf_compute_and_save_data_pointers() and
   bpf_restore_data_pointers().

Changes v1 -> v2:
1. Updated the list of read-only fields, and read-write fields.
2. Added dummy sk to bpf_prog_test_run_skb().

This set enables BPF program of type BPF_PROG_TYPE_CGROUP_SKB to access
some __skb_buff data directly.

Song Liu (2):
  bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
  bpf: add tests for direct packet access from CGROUP_SKB

 include/linux/filter.h                      |  24 +++
 kernel/bpf/cgroup.c                         |   6 +
 net/bpf/test_run.c                          |   4 +
 net/core/filter.c                           |  36 ++++-
 tools/testing/selftests/bpf/test_verifier.c | 170 ++++++++++++++++++++
 5 files changed, 239 insertions(+), 1 deletion(-)

^ permalink raw reply

* [PATCH v3 bpf-next 1/2] bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
From: Song Liu @ 2018-10-18  5:39 UTC (permalink / raw)
  To: netdev; +Cc: ast, daniel, kernel-team, Song Liu
In-Reply-To: <20181018053949.4064426-1-songliubraving@fb.com>

BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
skb. This patch enables direct access of skb for these programs.

Two helper functions bpf_compute_and_save_data_pointers() and
bpf_restore_data_pointers() are introduced. There are used in
__cgroup_bpf_run_filter_skb(), to compute proper data_end for the
BPF program, and restore original data afterwards.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/filter.h | 24 ++++++++++++++++++++++++
 kernel/bpf/cgroup.c    |  6 ++++++
 net/core/filter.c      | 36 +++++++++++++++++++++++++++++++++++-
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 5771874bc01e..96b3ee7f14c9 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -548,6 +548,30 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb)
 	cb->data_end  = skb->data + skb_headlen(skb);
 }
 
+/* Similar to bpf_compute_data_pointers(), except that save orginal
+ * data in cb->data and cb->meta_data for restore.
+ */
+static inline void bpf_compute_and_save_data_pointers(
+	struct sk_buff *skb, void *saved_pointers[2])
+{
+	struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
+
+	saved_pointers[0] = cb->data_meta;
+	saved_pointers[1] = cb->data_end;
+	cb->data_meta = skb->data - skb_metadata_len(skb);
+	cb->data_end  = skb->data + skb_headlen(skb);
+}
+
+/* Restore data saved by bpf_compute_data_pointers(). */
+static inline void bpf_restore_data_pointers(
+	struct sk_buff *skb, void *saved_pointers[2])
+{
+	struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
+
+	cb->data_meta = saved_pointers[0];
+	cb->data_end = saved_pointers[1];;
+}
+
 static inline u8 *bpf_skb_cb(struct sk_buff *skb)
 {
 	/* eBPF programs may read/write skb->cb[] area to transfer meta
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 00f6ed2e4f9a..5f5180104ddc 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -554,6 +554,7 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 	unsigned int offset = skb->data - skb_network_header(skb);
 	struct sock *save_sk;
 	struct cgroup *cgrp;
+	void *saved_pointers[2];
 	int ret;
 
 	if (!sk || !sk_fullsock(sk))
@@ -566,8 +567,13 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 	save_sk = skb->sk;
 	skb->sk = sk;
 	__skb_push(skb, offset);
+
+	/* compute pointers for the bpf prog */
+	bpf_compute_and_save_data_pointers(skb, saved_pointers);
+
 	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
 				 bpf_prog_run_save_cb);
+	bpf_restore_data_pointers(skb, saved_pointers);
 	__skb_pull(skb, offset);
 	skb->sk = save_sk;
 	return ret == 1 ? 0 : -EPERM;
diff --git a/net/core/filter.c b/net/core/filter.c
index 1a3ac6c46873..e3ca30bd6840 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5346,6 +5346,40 @@ static bool sk_filter_is_valid_access(int off, int size,
 	return bpf_skb_is_valid_access(off, size, type, prog, info);
 }
 
+static bool cg_skb_is_valid_access(int off, int size,
+				   enum bpf_access_type type,
+				   const struct bpf_prog *prog,
+				   struct bpf_insn_access_aux *info)
+{
+	switch (off) {
+	case bpf_ctx_range(struct __sk_buff, tc_classid):
+	case bpf_ctx_range(struct __sk_buff, data_meta):
+	case bpf_ctx_range(struct __sk_buff, flow_keys):
+		return false;
+	}
+	if (type == BPF_WRITE) {
+		switch (off) {
+		case bpf_ctx_range(struct __sk_buff, mark):
+		case bpf_ctx_range(struct __sk_buff, priority):
+		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
+			break;
+		default:
+			return false;
+		}
+	}
+
+	switch (off) {
+	case bpf_ctx_range(struct __sk_buff, data):
+		info->reg_type = PTR_TO_PACKET;
+		break;
+	case bpf_ctx_range(struct __sk_buff, data_end):
+		info->reg_type = PTR_TO_PACKET_END;
+		break;
+	}
+
+	return bpf_skb_is_valid_access(off, size, type, prog, info);
+}
+
 static bool lwt_is_valid_access(int off, int size,
 				enum bpf_access_type type,
 				const struct bpf_prog *prog,
@@ -7038,7 +7072,7 @@ const struct bpf_prog_ops xdp_prog_ops = {
 
 const struct bpf_verifier_ops cg_skb_verifier_ops = {
 	.get_func_proto		= cg_skb_func_proto,
-	.is_valid_access	= sk_filter_is_valid_access,
+	.is_valid_access	= cg_skb_is_valid_access,
 	.convert_ctx_access	= bpf_convert_ctx_access,
 };
 
-- 
2.17.1

^ permalink raw reply related

* Re: [net-next PATCH] net: sched: cls_flower: Classify packets using port ranges
From: Cong Wang @ 2018-10-18  5:41 UTC (permalink / raw)
  To: David Miller
  Cc: amritha.nambiar, Linux Kernel Network Developers, Jakub Kicinski,
	sridhar.samudrala, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <20181017.214233.695288699109520404.davem@davemloft.net>

On Wed, Oct 17, 2018 at 9:42 PM David Miller <davem@davemloft.net> wrote:
>
> From: Amritha Nambiar <amritha.nambiar@intel.com>
> Date: Fri, 12 Oct 2018 06:53:30 -0700
>
> > Added support in tc flower for filtering based on port ranges.
> > This is a rework of the RFC patch at:
> > https://patchwork.ozlabs.org/patch/969595/
>
> You never addressed Cong's feedback asking you to explain why this
> can't be simply built using existing generic filtering facilities that
> exist already.
>
> I appreciate that you addressed Jiri's feedback, but Cong's feedback is
> just as, if not more, important.
>

My objection is against introducing a new filter just for port range, now
it is built on top of flower filter, so it is much better now.

u32 filter can do the nearly same, but requires a power-of-two, so it is
not completely duplicated.

Therefore, I think the idea of building it on top of flower is fine. But I don't
read into any code, only the description.

Thanks!

^ permalink raw reply

* Re: bond: take rcu lock in netpoll_send_skb_on_dev
From: Cong Wang @ 2018-10-18  5:46 UTC (permalink / raw)
  To: eranbe
  Cc: Dave Jones, Linux Kernel Network Developers, Tariq Toukan,
	Saeed Mahameed
In-Reply-To: <b80a8984-2100-a7c7-008d-2768c09ef0f8@mellanox.com>

On Mon, Oct 15, 2018 at 4:36 AM Eran Ben Elisha <eranbe@mellanox.com> wrote:
> Hi,
>
> This suggested fix introduced a regression while using netconsole module
> with mlx5_core module loaded.

It is already reported here:
https://marc.info/?l=linux-kernel&m=153917359528669&w=2


>
> During irq handling, we hit a warning that this rcu_read_lock_bh cannot
> be taken inside an IRQ.

Yes, I mentioned the same even before this patch was sent out:
https://marc.info/?l=linux-netdev&m=153816136624679&w=2

Thanks.

^ permalink raw reply

* Re: [PATCH net] r8169: fix NAPI handling under high load
From: Heiner Kallweit @ 2018-10-18  5:58 UTC (permalink / raw)
  To: David Miller, romieu; +Cc: nic_swsd, netdev
In-Reply-To: <20181017.222149.1241280289340067644.davem@davemloft.net>

On 18.10.2018 07:21, David Miller wrote:
> From: Francois Romieu <romieu@fr.zoreil.com>
> Date: Thu, 18 Oct 2018 01:30:45 +0200
> 
>> Heiner Kallweit <hkallweit1@gmail.com> :
>> [...]
>>> This issue has been there more or less forever (at least it exists in
>>> 3.16 already), so I can't provide a "Fixes" tag. 
>>
>> Hardly forever. It fixes da78dbff2e05630921c551dbbc70a4b7981a8fff.
> 
> I don't see exactly how that can be true.
> 
> That commit didn't change the parts of the NAPI poll processing which
> are relevant here, mainly the guarding of the RX and TX work using
> the status bits which are cleared.
> 

AFAICS Francois is right and patch da78dbff2e05 ("r8169: remove work
from irq handler") introduced the guarding of RX and TX work.
I just checked back to 3.16 as oldest LTS kernel version.

> Maybe I'm missing something?  If so, indeed it would be nice to add
> a proper Fixes: tag here.
> 
Shall I submit a v2 including the Fixes line?

> Thanks!
> 

^ permalink raw reply

* Re: [PATCH net] r8169: fix NAPI handling under high load
From: Heiner Kallweit @ 2018-10-18  6:03 UTC (permalink / raw)
  To: Jonathan Woithe
  Cc: Francois Romieu, Holger Hoffstätte, David Miller,
	Realtek linux nic maintainers, netdev@vger.kernel.org
In-Reply-To: <20181018055835.GE2487@marvin.atrad.com.au>

On 18.10.2018 07:58, Jonathan Woithe wrote:
> On Thu, Oct 18, 2018 at 01:30:51AM +0200, Francois Romieu wrote:
>> Holger Hoffstätte <holger@applied-asynchrony.com> :
>> [...]
>>> I continued to use the BQL patch in my private tree after it was reverted
>>> and also had occasional timeouts, but *only* after I started playing
>>> with ethtool to change offload settings. Without offloads or the BQL patch
>>> everything has been rock-solid since then.
>>> The other weird problem was that timeouts would occur on an otherwise
>>> *completely idle* system. Since that occasionally borked my NFS server
>>> over night I ultimately removed BQL as well. Rock-solid since then.
>>
>> The bug will induce delayed rx processing when a spike of "load" is
>> followed by an idle period.
> 
> If this is the case, I wonder whether this bug might also be the cause of
> the long reception delays we've observed at times when a period of high
> network load is followed by almost nothing[1].  That thread[2] details the
> investigations subsequently done.  A git bisect showed that commit
> da78dbff2e05630921c551dbbc70a4b7981a8fff was the origin of the misbehaviour
> we were observing.
> 
> We still see the problem when we test with recent kernels.  It would be
> great if the underlying problem has now been identified.
> 
> I can possibly scrape some hardware together to test any proposed fix under
> our workload if there was interest.
> 
Proposed fix is here:
https://patchwork.ozlabs.org/patch/985014/
Would be good if you could test it. Thanks!

Heiner

> Regards
>   jonathan
> 
> [1] https://marc.info/?l=linux-netdev&m=136281333207734&w=2
> [2] https://marc.info/?t=136281339500002&r=1&w=2
> 

^ permalink raw reply

* Re: [PATCH net] r8169: fix NAPI handling under high load
From: Jonathan Woithe @ 2018-10-18  5:58 UTC (permalink / raw)
  To: Francois Romieu
  Cc: Holger Hoffstätte, Heiner Kallweit, David Miller,
	Realtek linux nic maintainers, netdev@vger.kernel.org
In-Reply-To: <20181017233051.GB8478@electric-eye.fr.zoreil.com>

On Thu, Oct 18, 2018 at 01:30:51AM +0200, Francois Romieu wrote:
> Holger Hoffstätte <holger@applied-asynchrony.com> :
> [...]
> > I continued to use the BQL patch in my private tree after it was reverted
> > and also had occasional timeouts, but *only* after I started playing
> > with ethtool to change offload settings. Without offloads or the BQL patch
> > everything has been rock-solid since then.
> > The other weird problem was that timeouts would occur on an otherwise
> > *completely idle* system. Since that occasionally borked my NFS server
> > over night I ultimately removed BQL as well. Rock-solid since then.
> 
> The bug will induce delayed rx processing when a spike of "load" is
> followed by an idle period.

If this is the case, I wonder whether this bug might also be the cause of
the long reception delays we've observed at times when a period of high
network load is followed by almost nothing[1].  That thread[2] details the
investigations subsequently done.  A git bisect showed that commit
da78dbff2e05630921c551dbbc70a4b7981a8fff was the origin of the misbehaviour
we were observing.

We still see the problem when we test with recent kernels.  It would be
great if the underlying problem has now been identified.

I can possibly scrape some hardware together to test any proposed fix under
our workload if there was interest.

Regards
  jonathan

[1] https://marc.info/?l=linux-netdev&m=136281333207734&w=2
[2] https://marc.info/?t=136281339500002&r=1&w=2

^ permalink raw reply

* Re: [PATCH net] r8169: fix NAPI handling under high load
From: Jonathan Woithe @ 2018-10-18  6:15 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Francois Romieu, Holger Hoffstätte, David Miller,
	Realtek linux nic maintainers, netdev@vger.kernel.org
In-Reply-To: <8beda4fa-5d04-49e6-eb3e-5656897a301f@gmail.com>

On Thu, Oct 18, 2018 at 08:03:32AM +0200, Heiner Kallweit wrote:
> On 18.10.2018 07:58, Jonathan Woithe wrote:
> > On Thu, Oct 18, 2018 at 01:30:51AM +0200, Francois Romieu wrote:
> >> Holger Hoffstätte <holger@applied-asynchrony.com> :
> >> [...]
> >> The bug will induce delayed rx processing when a spike of "load" is
> >> followed by an idle period.
> > 
> > If this is the case, I wonder whether this bug might also be the cause of
> > the long reception delays we've observed at times when a period of high
> > network load is followed by almost nothing[1].  That thread[2] details the
> > investigations subsequently done.  A git bisect showed that commit
> > da78dbff2e05630921c551dbbc70a4b7981a8fff was the origin of the misbehaviour
> > we were observing.
> > 
> > We still see the problem when we test with recent kernels.  It would be
> > great if the underlying problem has now been identified.
> > 
> > I can possibly scrape some hardware together to test any proposed fix under
> > our workload if there was interest.
> > 
> Proposed fix is here:
> https://patchwork.ozlabs.org/patch/985014/
> Would be good if you could test it. Thanks!

I should be able to do so tomorrow.  Which kernel would you like me to apply
the patch to?

Regards
  jonathan

^ permalink raw reply

* Re: [PATCH bpf] bpf: fix doc of bpf_skb_adjust_room() in uapi
From: Nicolas Dichtel @ 2018-10-18  6:21 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: ast, daniel, davem, netdev, Quentin Monnet
In-Reply-To: <20181018044900.fvsz27v76ptgelfj@ast-mbp.dhcp.thefacebook.com>

Le 18/10/2018 à 06:49, Alexei Starovoitov a écrit :
> On Wed, Oct 17, 2018 at 04:24:48PM +0200, Nicolas Dichtel wrote:
>> len_diff is signed.
>>
>> Fixes: fa15601ab31e ("bpf: add documentation for eBPF helpers (33-41)")
>> CC: Quentin Monnet <quentin.monnet@netronome.com>
>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>> ---
>>  include/uapi/linux/bpf.h       | 2 +-
>>  tools/include/uapi/linux/bpf.h | 2 +-
>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 66917a4eba27..c4ffe91d5598 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1430,7 +1430,7 @@ union bpf_attr {
>>   * 	Return
>>   * 		0 on success, or a negative error in case of failure.
>>   *
>> - * int bpf_skb_adjust_room(struct sk_buff *skb, u32 len_diff, u32 mode, u64 flags)
>> + * int bpf_skb_adjust_room(struct sk_buff *skb, s32 len_diff, u32 mode, u64 flags)
> 
> Thanks. Applied to bpf-next, since we're very late into release cycle.
> 
Yep, I was also wondering if it was not too late for the bpf tree ;-)


Thank you,
Nicolas

^ permalink raw reply

* Re: [PATCH net] r8169: fix NAPI handling under high load
From: David Miller @ 2018-10-18  6:24 UTC (permalink / raw)
  To: hkallweit1; +Cc: romieu, nic_swsd, netdev
In-Reply-To: <d54112bb-2f02-bd59-5b5b-e211f070b20a@gmail.com>

From: Heiner Kallweit <hkallweit1@gmail.com>
Date: Thu, 18 Oct 2018 07:58:52 +0200

> On 18.10.2018 07:21, David Miller wrote:
>> From: Francois Romieu <romieu@fr.zoreil.com>
>> Date: Thu, 18 Oct 2018 01:30:45 +0200
>> 
>>> Heiner Kallweit <hkallweit1@gmail.com> :
>>> [...]
>>>> This issue has been there more or less forever (at least it exists in
>>>> 3.16 already), so I can't provide a "Fixes" tag. 
>>>
>>> Hardly forever. It fixes da78dbff2e05630921c551dbbc70a4b7981a8fff.
>> 
>> I don't see exactly how that can be true.
>> 
>> That commit didn't change the parts of the NAPI poll processing which
>> are relevant here, mainly the guarding of the RX and TX work using
>> the status bits which are cleared.
>> 
> 
> AFAICS Francois is right and patch da78dbff2e05 ("r8169: remove work
> from irq handler") introduced the guarding of RX and TX work.
> I just checked back to 3.16 as oldest LTS kernel version.

Aha, now I see it.

>> Maybe I'm missing something?  If so, indeed it would be nice to add
>> a proper Fixes: tag here.
>> 
> Shall I submit a v2 including the Fixes line?

Yes, please do!

^ permalink raw reply

* Re: [PATCH v3 bpf-next 2/2] bpf: add tests for direct packet access from CGROUP_SKB
From: Alexei Starovoitov @ 2018-10-18  6:25 UTC (permalink / raw)
  To: Song Liu; +Cc: netdev, ast, daniel, kernel-team
In-Reply-To: <20181018053949.4064426-3-songliubraving@fb.com>

On Wed, Oct 17, 2018 at 10:39:49PM -0700, Song Liu wrote:
> Tests are added to make sure CGROUP_SKB cannot access:
>   tc_classid, data_meta, flow_keys
> 
> and can read and write:
>   mark, prority, and cb[0-4]
> 
> and can read other fields.
> 
> To make selftest with skb->sk work, a dummy sk is added in
> bpf_prog_test_run_skb().
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  net/bpf/test_run.c                          |   4 +
>  tools/testing/selftests/bpf/test_verifier.c | 170 ++++++++++++++++++++
>  2 files changed, 174 insertions(+)
> 
> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> index 0c423b8cd75c..c7210e2f1ae9 100644
> --- a/net/bpf/test_run.c
> +++ b/net/bpf/test_run.c
> @@ -10,6 +10,7 @@
>  #include <linux/etherdevice.h>
>  #include <linux/filter.h>
>  #include <linux/sched/signal.h>
> +#include <net/sock.h>
>  
>  static __always_inline u32 bpf_test_run_one(struct bpf_prog *prog, void *ctx,
>  		struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE])
> @@ -115,6 +116,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
>  	u32 retval, duration;
>  	int hh_len = ETH_HLEN;
>  	struct sk_buff *skb;
> +	struct sock sk;
>  	void *data;
>  	int ret;
>  
> @@ -142,6 +144,8 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
>  		kfree(data);
>  		return -ENOMEM;
>  	}
> +	sock_init_data(NULL, &sk);
> +	skb->sk = &sk;

I was about to apply it, but it crashes as:
[   16.830822] BUG: unable to handle kernel paging request at 000000014427b974
[   16.831363] PGD 8000000135ecf067 P4D 8000000135ecf067 PUD 0
[   16.831792] Oops: 0000 [#1] SMP PTI
[   16.832061] CPU: 1 PID: 1965 Comm: test_verifier Not tainted 4.19.0-rc7-02550-ga76dee97ff12 #1153
[   16.832712] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
[   16.833358] RIP: 0010:cmp_map_id+0x10/0x50
[   16.835036] RSP: 0018:ffffc9000080faf8 EFLAGS: 00010246
[   16.835429] RAX: 00000000ffffffff RBX: 0000000036069ee8 RCX: 0000000000000000
[   16.835958] RDX: 000000014427b970 RSI: 000000014427b970 RDI: ffffc9000080fb44
[   16.836496] RBP: 000000000000000c R08: ffffffff810f7330 R09: 0000000036069ee8
[   16.837026] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[   16.837554] R13: ffffffff810f7330 R14: 000000014427b970 R15: 000000001b034f74
[   16.838083] FS:  00007fae50663700(0000) GS:ffff88013ba80000(0000) knlGS:0000000000000000
[   16.838677] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   16.839105] CR2: 000000014427b974 CR3: 0000000135934005 CR4: 00000000003606e0
[   16.839632] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   16.840157] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   16.840682] Call Trace:
[   16.840897]  bsearch+0x50/0x90
[   16.841144]  map_id_range_down+0x81/0xa0
[   16.841438]  make_kuid+0xf/0x10
[   16.841677]  sock_init_data+0x24f/0x260
[   16.841979]  bpf_prog_test_run_skb+0x9e/0x270

I suspect sock_net_set(sk, &init_net) is necessary before sock_init_data() call.

^ permalink raw reply

* [PATCH] net: socket: fix a missing-check bug
From: Wenwen Wang @ 2018-10-18 14:36 UTC (permalink / raw)
  To: Wenwen Wang
  Cc: Kangjie Lu, David S. Miller, open list:NETWORKING [GENERAL],
	open list

In ethtool_ioctl(), the ioctl command 'ethcmd' is checked through a switch
statement to see whether it is necessary to pre-process the ethtool
structure, because, as mentioned in the comment, the structure
ethtool_rxnfc is defined with padding. If yes, a user-space buffer 'rxnfc'
is allocated through compat_alloc_user_space(). One thing to note here is
that, if 'ethcmd' is ETHTOOL_GRXCLSRLALL, the size of the buffer 'rxnfc' is
partially determined by 'rule_cnt', which is actually acquired from the
user-space buffer 'compat_rxnfc', i.e., 'compat_rxnfc->rule_cnt', through
get_user(). After 'rxnfc' is allocated, the data in the original user-space
buffer 'compat_rxnfc' is then copied to 'rxnfc' through copy_in_user(),
including the 'rule_cnt' field. However, after this copy, no check is
re-enforced on 'rxnfc->rule_cnt'. So it is possible that a malicious user
race to change the value in the 'compat_rxnfc->rule_cnt' between these two
copies. Through this way, the attacker can bypass the previous check on
'rule_cnt' and inject malicious data. This can cause undefined behavior of
the kernel and introduce potential security risk.

This patch avoids the above issue via copying the value acquired by
get_user() to 'rxnfc->rule_cn', if 'ethcmd' is ETHTOOL_GRXCLSRLALL.

Signed-off-by: Wenwen Wang <wang6495@umn.edu>
---
 net/socket.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index 01f3f8f..390a8ec 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2875,9 +2875,14 @@ static int ethtool_ioctl(struct net *net, struct compat_ifreq __user *ifr32)
 		    copy_in_user(&rxnfc->fs.ring_cookie,
 				 &compat_rxnfc->fs.ring_cookie,
 				 (void __user *)(&rxnfc->fs.location + 1) -
-				 (void __user *)&rxnfc->fs.ring_cookie) ||
-		    copy_in_user(&rxnfc->rule_cnt, &compat_rxnfc->rule_cnt,
-				 sizeof(rxnfc->rule_cnt)))
+				 (void __user *)&rxnfc->fs.ring_cookie))
+			return -EFAULT;
+		if (ethcmd == ETHTOOL_GRXCLSRLALL) {
+			if (put_user(rule_cnt, &rxnfc->rule_cnt))
+				return -EFAULT;
+		} else if (copy_in_user(&rxnfc->rule_cnt,
+					&compat_rxnfc->rule_cnt,
+					sizeof(rxnfc->rule_cnt)))
 			return -EFAULT;
 	}
 
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v3 bpf-next 2/2] bpf: add tests for direct packet access from CGROUP_SKB
From: Song Liu @ 2018-10-18  6:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
	Kernel Team
In-Reply-To: <20181018062557.zyeoiker7jigcv6q@ast-mbp.dhcp.thefacebook.com>



> On Oct 17, 2018, at 11:25 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Wed, Oct 17, 2018 at 10:39:49PM -0700, Song Liu wrote:
>> Tests are added to make sure CGROUP_SKB cannot access:
>>  tc_classid, data_meta, flow_keys
>> 
>> and can read and write:
>>  mark, prority, and cb[0-4]
>> 
>> and can read other fields.
>> 
>> To make selftest with skb->sk work, a dummy sk is added in
>> bpf_prog_test_run_skb().
>> 
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> ---
>> net/bpf/test_run.c                          |   4 +
>> tools/testing/selftests/bpf/test_verifier.c | 170 ++++++++++++++++++++
>> 2 files changed, 174 insertions(+)
>> 
>> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
>> index 0c423b8cd75c..c7210e2f1ae9 100644
>> --- a/net/bpf/test_run.c
>> +++ b/net/bpf/test_run.c
>> @@ -10,6 +10,7 @@
>> #include <linux/etherdevice.h>
>> #include <linux/filter.h>
>> #include <linux/sched/signal.h>
>> +#include <net/sock.h>
>> 
>> static __always_inline u32 bpf_test_run_one(struct bpf_prog *prog, void *ctx,
>> 		struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE])
>> @@ -115,6 +116,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
>> 	u32 retval, duration;
>> 	int hh_len = ETH_HLEN;
>> 	struct sk_buff *skb;
>> +	struct sock sk;
>> 	void *data;
>> 	int ret;
>> 
>> @@ -142,6 +144,8 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
>> 		kfree(data);
>> 		return -ENOMEM;
>> 	}
>> +	sock_init_data(NULL, &sk);
>> +	skb->sk = &sk;
> 
> I was about to apply it, but it crashes as:
> [   16.830822] BUG: unable to handle kernel paging request at 000000014427b974
> [   16.831363] PGD 8000000135ecf067 P4D 8000000135ecf067 PUD 0
> [   16.831792] Oops: 0000 [#1] SMP PTI
> [   16.832061] CPU: 1 PID: 1965 Comm: test_verifier Not tainted 4.19.0-rc7-02550-ga76dee97ff12 #1153
> [   16.832712] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
> [   16.833358] RIP: 0010:cmp_map_id+0x10/0x50
> [   16.835036] RSP: 0018:ffffc9000080faf8 EFLAGS: 00010246
> [   16.835429] RAX: 00000000ffffffff RBX: 0000000036069ee8 RCX: 0000000000000000
> [   16.835958] RDX: 000000014427b970 RSI: 000000014427b970 RDI: ffffc9000080fb44
> [   16.836496] RBP: 000000000000000c R08: ffffffff810f7330 R09: 0000000036069ee8
> [   16.837026] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
> [   16.837554] R13: ffffffff810f7330 R14: 000000014427b970 R15: 000000001b034f74
> [   16.838083] FS:  00007fae50663700(0000) GS:ffff88013ba80000(0000) knlGS:0000000000000000
> [   16.838677] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   16.839105] CR2: 000000014427b974 CR3: 0000000135934005 CR4: 00000000003606e0
> [   16.839632] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   16.840157] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   16.840682] Call Trace:
> [   16.840897]  bsearch+0x50/0x90
> [   16.841144]  map_id_range_down+0x81/0xa0
> [   16.841438]  make_kuid+0xf/0x10
> [   16.841677]  sock_init_data+0x24f/0x260
> [   16.841979]  bpf_prog_test_run_skb+0x9e/0x270
> 
> I suspect sock_net_set(sk, &init_net) is necessary before sock_init_data() call.

I am not able to repro this, even with CONFIG_KASAN and CONFIG_PAGE_POISONING. 

Let me try a better approach on this.

Thanks,
Song

^ permalink raw reply

* [PATCH net-next v8 00/28] WireGuard: Secure Network Tunnel
From: Jason A. Donenfeld @ 2018-10-18 14:56 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-crypto, davem, gregkh; +Cc: Jason A. Donenfeld

Changes v7->v8, along with who suggested it.
--------------------------------------------
- Implementations that fail the selftests are now disabled, after a warning
  is printed. This way users don't make wrong calculations, even in the face
  of a rather grave bug.
- [Sultan Alsawaf] When assigning to a boolean, prefer "BIT(i) & a" to "(a >>
  i) & 1".
- [Andrew Lunn] Avoid control statements inside macros.
- [Jiri Pirko] Prefix functions used in callbacks with wg_.
- [Jiri Pirko] Rename struct wireguard_peer and struct wireguard_device to
  struct wg_peer and struct wg_device.
- [Eugene Syromiatnikov] Do not use nla type field as an index, and actually
  don't use an index at all, because it has no meaning or relevance at all.
- [Joe Perches] Do not place a space between for_each iterators and
  parentheses.
- Enumerable style cleanups and nits.
- [Arnd Bergmann] Swap endianness in allowedips early on in case optimizer is
  able to look a bit further in but not too far, resulting in a warning from
  -Wmaybe-uninitialized.
- [Jiri Pirko] Use textual error labels instead of numerical ones.
- [Jiri Pirko] Better module description string.
- [Eric Biggers] In poly1305 port to crypto API, account for short inputs in
  final function, in which case -ENOKEY should be returned.

-----------------------------------------------------------

This patchset is available on git.kernel.org in this branch, where it may be
pulled directly for inclusion into net-next:

  * https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/log/?h=jd/wireguard

-----------------------------------------------------------

WireGuard is a secure network tunnel written especially for Linux, which
has faced around three years of serious development, deployment, and
scrutiny. It delivers excellent performance and is extremely easy to
use and configure. It has been designed with the primary goal of being
both easy to audit by virtue of being small and highly secure from a
cryptography and systems security perspective. WireGuard is used by some
massive companies pushing enormous amounts of traffic, and likely
already today you've consumed bytes that at some point transited through
a WireGuard tunnel. Even as an out-of-tree module, WireGuard has been
integrated into various userspace tools, Linux distributions, mobile
phones, and data centers. There are ports in several languages to
several operating systems, and even commercial hardware and services
sold integrating WireGuard. It is time, therefore, for WireGuard to be
properly integrated into Linux.

Ample information, including documentation, installation instructions,
and project details, is available at:

  * https://www.wireguard.com/
  * https://www.wireguard.com/papers/wireguard.pdf

As it is currently an out-of-tree module, it lives in its own git repo
and has its own mailing list, and every commit for the module is tested
against every stable kernel since 3.10 on a variety of architectures
using an extensive test suite:

  * https://git.zx2c4.com/WireGuard
    https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/WireGuard.git/
  * https://lists.zx2c4.com/mailman/listinfo/wireguard
  * https://www.wireguard.com/build-status/

The project has been broadly discussed at conferences, and was presented
to the Netdev developers in Seoul last November, where a paper was
released detailing some interesting aspects of the project. Dave asked
me after the talk if I would consider sending in a v1 "sooner rather
than later", hence this patchset. Zinc was presented at Kernel Recipes
in September, and a video is available online. Both Zinc and WireGuard
will be presented at the conference in Vancouver in November.

  * https://www.wireguard.com/presentations/
  * https://www.wireguard.com/papers/wireguard-netdev22.pdf
  * Zinc talk: https://www.youtube.com/watch?v=bFhdln8aJ_U
  * Netdev talk: https://www.youtube.com/watch?v=54orFwtQ1XY

The cryptography in the protocol itself has been formally verified by
several independent academic teams with positive results, and I know of
two additional efforts on their way to further corroborate those
findings. The version 1 protocol is "complete", and so the purpose of
this review is to assess the implementation of the protocol. However, it
still may be of interest to know that the thing you're reviewing uses a
protocol with various nice security properties:

  * https://www.wireguard.com/formal-verification/

This patchset is divided into four segments. The first introduces a very
simple helper for working with the FPU state for the purposes of amortizing
SIMD operations. The second segment is a small collection of cryptographic
primitives, split up into several commits by primitive and by hardware. The
third shows usage of Zinc within the existing crypto API and as a replacement
to the existing crypto API. The last is WireGuard itself, presented as an
unintrusive and self-contained virtual network driver.

It is intended that this entire patch series enter the kernel through
DaveM's net-next tree. Subsequently, WireGuard patches will go through
DaveM's net-next tree, while Zinc patches will go through Greg KH's tree in
cases when an entire development cycle has no relationships with existing code
in crypto/; however, if there are any relationships with code in crypto/, then
pull requests will be sent to Herbert instead in case there are merge
conflicts.

Enjoy,
Jason

^ permalink raw reply

* [PATCH net-next v8 01/28] ARM: makefile: use ARMv3M mode for RiscPC
From: Jason A. Donenfeld @ 2018-10-18 14:56 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-crypto, davem, gregkh; +Cc: Jason A. Donenfeld
In-Reply-To: <20181018145712.7538-1-Jason@zx2c4.com>

The purpose of CONFIG_CPU_32v3 is to avoid ldrh/strh on the RiscPC,
which is pretty much an ARMv4 device, except its bus will choke on the
half-words. The way to make the C compiler not output ldrh/strh is with
-march=armv3, which doesn't support them in the ISA. However, this
prevents certain cryptography code from working that uses instructions
like umull. Fortunately there's also -march=armv3m that does support
those, making it possible to continue assembling optimized cryptography
routines for our beloved RiscPC.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---

Notes:
    This commit has been submitted to the proper ARM tree and is working its
    way upstream. It's included in this series here so that kbuild 0-day bot
    doesn't get too nervous about RiscPC, but is already entering the tree
    through arm-next.

 arch/arm/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/Makefile b/arch/arm/Makefile
index d1516f85f25d..7fd4bcaf0721 100644
--- a/arch/arm/Makefile
+++ b/arch/arm/Makefile
@@ -74,7 +74,7 @@ endif
 arch-$(CONFIG_CPU_32v5)		=-D__LINUX_ARM_ARCH__=5 $(call cc-option,-march=armv5te,-march=armv4t)
 arch-$(CONFIG_CPU_32v4T)	=-D__LINUX_ARM_ARCH__=4 -march=armv4t
 arch-$(CONFIG_CPU_32v4)		=-D__LINUX_ARM_ARCH__=4 -march=armv4
-arch-$(CONFIG_CPU_32v3)		=-D__LINUX_ARM_ARCH__=3 -march=armv3
+arch-$(CONFIG_CPU_32v3)		=-D__LINUX_ARM_ARCH__=3 -march=armv3m
 
 # Evaluate arch cc-option calls now
 arch-y := $(arch-y)
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v8 04/28] zinc: ChaCha20 generic C implementation and selftest
From: Jason A. Donenfeld @ 2018-10-18 14:56 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-crypto, davem, gregkh
  Cc: Jason A. Donenfeld, Samuel Neves, Jean-Philippe Aumasson,
	Andy Lutomirski, Andrew Morton, Linus Torvalds, kernel-hardening
In-Reply-To: <20181018145712.7538-1-Jason@zx2c4.com>

This implements the ChaCha20 permutation as a single C statement, by way
of the comma operator, which the compiler is able to simplify
terrifically.

Information: https://cr.yp.to/chacha.html

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-crypto@vger.kernel.org
---
 include/zinc/chacha20.h      |   70 +
 lib/zinc/Kconfig             |    4 +
 lib/zinc/Makefile            |    3 +
 lib/zinc/chacha20/chacha20.c |  179 +++
 lib/zinc/selftest/chacha20.c | 2698 ++++++++++++++++++++++++++++++++++
 lib/zinc/selftest/run.h      |   48 +
 6 files changed, 3002 insertions(+)
 create mode 100644 include/zinc/chacha20.h
 create mode 100644 lib/zinc/chacha20/chacha20.c
 create mode 100644 lib/zinc/selftest/chacha20.c
 create mode 100644 lib/zinc/selftest/run.h

diff --git a/include/zinc/chacha20.h b/include/zinc/chacha20.h
new file mode 100644
index 000000000000..0b98bd6946ae
--- /dev/null
+++ b/include/zinc/chacha20.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifndef _ZINC_CHACHA20_H
+#define _ZINC_CHACHA20_H
+
+#include <asm/unaligned.h>
+#include <linux/simd.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+enum {
+	CHACHA20_NONCE_SIZE = 16,
+	CHACHA20_KEY_SIZE = 32,
+	CHACHA20_KEY_WORDS = CHACHA20_KEY_SIZE / sizeof(u32),
+	CHACHA20_BLOCK_SIZE = 64,
+	CHACHA20_BLOCK_WORDS = CHACHA20_BLOCK_SIZE / sizeof(u32),
+	HCHACHA20_NONCE_SIZE = CHACHA20_NONCE_SIZE,
+	HCHACHA20_KEY_SIZE = CHACHA20_KEY_SIZE
+};
+
+enum { /* expand 32-byte k */
+	CHACHA20_CONSTANT_EXPA = 0x61707865U,
+	CHACHA20_CONSTANT_ND_3 = 0x3320646eU,
+	CHACHA20_CONSTANT_2_BY = 0x79622d32U,
+	CHACHA20_CONSTANT_TE_K = 0x6b206574U
+};
+
+struct chacha20_ctx {
+	union {
+		u32 state[16];
+		struct {
+			u32 constant[4];
+			u32 key[8];
+			u32 counter[4];
+		};
+	};
+};
+
+static inline void chacha20_init(struct chacha20_ctx *ctx,
+				 const u8 key[CHACHA20_KEY_SIZE],
+				 const u64 nonce)
+{
+	ctx->constant[0] = CHACHA20_CONSTANT_EXPA;
+	ctx->constant[1] = CHACHA20_CONSTANT_ND_3;
+	ctx->constant[2] = CHACHA20_CONSTANT_2_BY;
+	ctx->constant[3] = CHACHA20_CONSTANT_TE_K;
+	ctx->key[0] = get_unaligned_le32(key + 0);
+	ctx->key[1] = get_unaligned_le32(key + 4);
+	ctx->key[2] = get_unaligned_le32(key + 8);
+	ctx->key[3] = get_unaligned_le32(key + 12);
+	ctx->key[4] = get_unaligned_le32(key + 16);
+	ctx->key[5] = get_unaligned_le32(key + 20);
+	ctx->key[6] = get_unaligned_le32(key + 24);
+	ctx->key[7] = get_unaligned_le32(key + 28);
+	ctx->counter[0] = 0;
+	ctx->counter[1] = 0;
+	ctx->counter[2] = nonce & U32_MAX;
+	ctx->counter[3] = nonce >> 32;
+}
+void chacha20(struct chacha20_ctx *ctx, u8 *dst, const u8 *src, u32 len,
+	      simd_context_t *simd_context);
+
+void hchacha20(u32 derived_key[CHACHA20_KEY_WORDS],
+	       const u8 nonce[HCHACHA20_NONCE_SIZE],
+	       const u8 key[HCHACHA20_KEY_SIZE], simd_context_t *simd_context);
+
+#endif /* _ZINC_CHACHA20_H */
diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig
index 90e066ea93a0..d271be37cecb 100644
--- a/lib/zinc/Kconfig
+++ b/lib/zinc/Kconfig
@@ -1,3 +1,7 @@
+config ZINC_CHACHA20
+	tristate
+	select CRYPTO_ALGAPI
+
 config ZINC_SELFTEST
 	bool "Zinc cryptography library self-tests"
 	help
diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index a61c80d676cb..3d80144d55a6 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -1,3 +1,6 @@
 ccflags-y := -O2
 ccflags-y += -D'pr_fmt(fmt)="zinc: " fmt'
 ccflags-$(CONFIG_ZINC_DEBUG) += -DDEBUG
+
+zinc_chacha20-y := chacha20/chacha20.o
+obj-$(CONFIG_ZINC_CHACHA20) += zinc_chacha20.o
diff --git a/lib/zinc/chacha20/chacha20.c b/lib/zinc/chacha20/chacha20.c
new file mode 100644
index 000000000000..03209c15d1ca
--- /dev/null
+++ b/lib/zinc/chacha20/chacha20.c
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * Implementation of the ChaCha20 stream cipher.
+ *
+ * Information: https://cr.yp.to/chacha.html
+ */
+
+#include <zinc/chacha20.h>
+#include "../selftest/run.h"
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/vmalloc.h>
+#include <crypto/algapi.h> // For crypto_xor_cpy.
+
+static bool *const chacha20_nobs[] __initconst = { };
+static void __init chacha20_fpu_init(void)
+{
+}
+static inline bool chacha20_arch(struct chacha20_ctx *ctx, u8 *dst,
+				 const u8 *src, size_t len,
+				 simd_context_t *simd_context)
+{
+	return false;
+}
+static inline bool hchacha20_arch(u32 derived_key[CHACHA20_KEY_WORDS],
+				  const u8 nonce[HCHACHA20_NONCE_SIZE],
+				  const u8 key[HCHACHA20_KEY_SIZE],
+				  simd_context_t *simd_context)
+{
+	return false;
+}
+
+#define QUARTER_ROUND(x, a, b, c, d) ( \
+	x[a] += x[b], \
+	x[d] = rol32((x[d] ^ x[a]), 16), \
+	x[c] += x[d], \
+	x[b] = rol32((x[b] ^ x[c]), 12), \
+	x[a] += x[b], \
+	x[d] = rol32((x[d] ^ x[a]), 8), \
+	x[c] += x[d], \
+	x[b] = rol32((x[b] ^ x[c]), 7) \
+)
+
+#define C(i, j) (i * 4 + j)
+
+#define DOUBLE_ROUND(x) ( \
+	/* Column Round */ \
+	QUARTER_ROUND(x, C(0, 0), C(1, 0), C(2, 0), C(3, 0)), \
+	QUARTER_ROUND(x, C(0, 1), C(1, 1), C(2, 1), C(3, 1)), \
+	QUARTER_ROUND(x, C(0, 2), C(1, 2), C(2, 2), C(3, 2)), \
+	QUARTER_ROUND(x, C(0, 3), C(1, 3), C(2, 3), C(3, 3)), \
+	/* Diagonal Round */ \
+	QUARTER_ROUND(x, C(0, 0), C(1, 1), C(2, 2), C(3, 3)), \
+	QUARTER_ROUND(x, C(0, 1), C(1, 2), C(2, 3), C(3, 0)), \
+	QUARTER_ROUND(x, C(0, 2), C(1, 3), C(2, 0), C(3, 1)), \
+	QUARTER_ROUND(x, C(0, 3), C(1, 0), C(2, 1), C(3, 2)) \
+)
+
+#define TWENTY_ROUNDS(x) ( \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x) \
+)
+
+static void chacha20_block_generic(struct chacha20_ctx *ctx, __le32 *stream)
+{
+	u32 x[CHACHA20_BLOCK_WORDS];
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(x); ++i)
+		x[i] = ctx->state[i];
+
+	TWENTY_ROUNDS(x);
+
+	for (i = 0; i < ARRAY_SIZE(x); ++i)
+		stream[i] = cpu_to_le32(x[i] + ctx->state[i]);
+
+	ctx->counter[0] += 1;
+}
+
+static void chacha20_generic(struct chacha20_ctx *ctx, u8 *out, const u8 *in,
+			     u32 len)
+{
+	__le32 buf[CHACHA20_BLOCK_WORDS];
+
+	while (len >= CHACHA20_BLOCK_SIZE) {
+		chacha20_block_generic(ctx, buf);
+		crypto_xor_cpy(out, in, (u8 *)buf, CHACHA20_BLOCK_SIZE);
+		len -= CHACHA20_BLOCK_SIZE;
+		out += CHACHA20_BLOCK_SIZE;
+		in += CHACHA20_BLOCK_SIZE;
+	}
+	if (len) {
+		chacha20_block_generic(ctx, buf);
+		crypto_xor_cpy(out, in, (u8 *)buf, len);
+	}
+}
+
+void chacha20(struct chacha20_ctx *ctx, u8 *dst, const u8 *src, u32 len,
+	      simd_context_t *simd_context)
+{
+	if (!chacha20_arch(ctx, dst, src, len, simd_context))
+		chacha20_generic(ctx, dst, src, len);
+}
+EXPORT_SYMBOL(chacha20);
+
+static void hchacha20_generic(u32 derived_key[CHACHA20_KEY_WORDS],
+			      const u8 nonce[HCHACHA20_NONCE_SIZE],
+			      const u8 key[HCHACHA20_KEY_SIZE])
+{
+	u32 x[] = { CHACHA20_CONSTANT_EXPA,
+		    CHACHA20_CONSTANT_ND_3,
+		    CHACHA20_CONSTANT_2_BY,
+		    CHACHA20_CONSTANT_TE_K,
+		    get_unaligned_le32(key +  0),
+		    get_unaligned_le32(key +  4),
+		    get_unaligned_le32(key +  8),
+		    get_unaligned_le32(key + 12),
+		    get_unaligned_le32(key + 16),
+		    get_unaligned_le32(key + 20),
+		    get_unaligned_le32(key + 24),
+		    get_unaligned_le32(key + 28),
+		    get_unaligned_le32(nonce +  0),
+		    get_unaligned_le32(nonce +  4),
+		    get_unaligned_le32(nonce +  8),
+		    get_unaligned_le32(nonce + 12)
+	};
+
+	TWENTY_ROUNDS(x);
+
+	memcpy(derived_key + 0, x +  0, sizeof(u32) * 4);
+	memcpy(derived_key + 4, x + 12, sizeof(u32) * 4);
+}
+
+/* Derived key should be 32-bit aligned */
+void hchacha20(u32 derived_key[CHACHA20_KEY_WORDS],
+	       const u8 nonce[HCHACHA20_NONCE_SIZE],
+	       const u8 key[HCHACHA20_KEY_SIZE], simd_context_t *simd_context)
+{
+	if (!hchacha20_arch(derived_key, nonce, key, simd_context))
+		hchacha20_generic(derived_key, nonce, key);
+}
+EXPORT_SYMBOL(hchacha20);
+
+#include "../selftest/chacha20.c"
+
+static bool nosimd __initdata = false;
+
+static int __init mod_init(void)
+{
+	if (!nosimd)
+		chacha20_fpu_init();
+	if (!selftest_run("chacha20", chacha20_selftest, chacha20_nobs,
+			  ARRAY_SIZE(chacha20_nobs)))
+		return -ENOTRECOVERABLE;
+	return 0;
+}
+
+static void __exit mod_exit(void)
+{
+}
+
+module_param(nosimd, bool, 0);
+module_init(mod_init);
+module_exit(mod_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("ChaCha20 stream cipher");
+MODULE_AUTHOR("Jason A. Donenfeld <Jason@zx2c4.com>");
diff --git a/lib/zinc/selftest/chacha20.c b/lib/zinc/selftest/chacha20.c
new file mode 100644
index 000000000000..b8c9c709071d
--- /dev/null
+++ b/lib/zinc/selftest/chacha20.c
@@ -0,0 +1,2698 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+struct chacha20_testvec {
+	const u8 *input, *output, *key;
+	u64 nonce;
+	size_t ilen;
+};
+
+struct hchacha20_testvec {
+	u8 key[HCHACHA20_KEY_SIZE];
+	u8 nonce[HCHACHA20_NONCE_SIZE];
+	u8 output[CHACHA20_KEY_SIZE];
+};
+
+/* These test vectors are generated by reference implementations and are
+ * designed to check chacha20 implementation block handling, as well as from
+ * the draft-arciszewski-xchacha-01 document.
+ */
+
+static const u8 input01[] __initconst = { };
+static const u8 output01[] __initconst = { };
+static const u8 key01[] __initconst = {
+	0x09, 0xf4, 0xe8, 0x57, 0x10, 0xf2, 0x12, 0xc3,
+	0xc6, 0x91, 0xc4, 0x09, 0x97, 0x46, 0xef, 0xfe,
+	0x02, 0x00, 0xe4, 0x5c, 0x82, 0xed, 0x16, 0xf3,
+	0x32, 0xbe, 0xec, 0x7a, 0xe6, 0x68, 0x12, 0x26
+};
+enum { nonce01 = 0x3834e2afca3c66d3ULL };
+
+static const u8 input02[] __initconst = {
+	0x9d
+};
+static const u8 output02[] __initconst = {
+	0x94
+};
+static const u8 key02[] __initconst = {
+	0x8c, 0x01, 0xac, 0xaf, 0x62, 0x63, 0x56, 0x7a,
+	0xad, 0x23, 0x4c, 0x58, 0x29, 0x29, 0xbe, 0xab,
+	0xe9, 0xf8, 0xdf, 0x6c, 0x8c, 0x74, 0x4d, 0x7d,
+	0x13, 0x94, 0x10, 0x02, 0x3d, 0x8e, 0x9f, 0x94
+};
+enum { nonce02 = 0x5d1b3bfdedd9f73aULL };
+
+static const u8 input03[] __initconst = {
+	0x04, 0x16
+};
+static const u8 output03[] __initconst = {
+	0x92, 0x07
+};
+static const u8 key03[] __initconst = {
+	0x22, 0x0c, 0x79, 0x2c, 0x38, 0x51, 0xbe, 0x99,
+	0xa9, 0x59, 0x24, 0x50, 0xef, 0x87, 0x38, 0xa6,
+	0xa0, 0x97, 0x20, 0xcb, 0xb4, 0x0c, 0x94, 0x67,
+	0x1f, 0x98, 0xdc, 0xc4, 0x83, 0xbc, 0x35, 0x4d
+};
+enum { nonce03 = 0x7a3353ad720a3e2eULL };
+
+static const u8 input04[] __initconst = {
+	0xc7, 0xcc, 0xd0
+};
+static const u8 output04[] __initconst = {
+	0xd8, 0x41, 0x80
+};
+static const u8 key04[] __initconst = {
+	0x81, 0x5e, 0x12, 0x01, 0xc4, 0x36, 0x15, 0x03,
+	0x11, 0xa0, 0xe9, 0x86, 0xbb, 0x5a, 0xdc, 0x45,
+	0x7d, 0x5e, 0x98, 0xf8, 0x06, 0x76, 0x1c, 0xec,
+	0xc0, 0xf7, 0xca, 0x4e, 0x99, 0xd9, 0x42, 0x38
+};
+enum { nonce04 = 0x6816e2fc66176da2ULL };
+
+static const u8 input05[] __initconst = {
+	0x48, 0xf1, 0x31, 0x5f
+};
+static const u8 output05[] __initconst = {
+	0x48, 0xf7, 0x13, 0x67
+};
+static const u8 key05[] __initconst = {
+	0x3f, 0xd6, 0xb6, 0x5e, 0x2f, 0xda, 0x82, 0x39,
+	0x97, 0x06, 0xd3, 0x62, 0x4f, 0xbd, 0xcb, 0x9b,
+	0x1d, 0xe6, 0x4a, 0x76, 0xab, 0xdd, 0x14, 0x50,
+	0x59, 0x21, 0xe3, 0xb2, 0xc7, 0x95, 0xbc, 0x45
+};
+enum { nonce05 = 0xc41a7490e228cc42ULL };
+
+static const u8 input06[] __initconst = {
+	0xae, 0xa2, 0x85, 0x1d, 0xc8
+};
+static const u8 output06[] __initconst = {
+	0xfa, 0xff, 0x45, 0x6b, 0x6f
+};
+static const u8 key06[] __initconst = {
+	0x04, 0x8d, 0xea, 0x67, 0x20, 0x78, 0xfb, 0x8f,
+	0x49, 0x80, 0x35, 0xb5, 0x7b, 0xe4, 0x31, 0x74,
+	0x57, 0x43, 0x3a, 0x64, 0x64, 0xb9, 0xe6, 0x23,
+	0x4d, 0xfe, 0xb8, 0x7b, 0x71, 0x4d, 0x9d, 0x21
+};
+enum { nonce06 = 0x251366db50b10903ULL };
+
+static const u8 input07[] __initconst = {
+	0x1a, 0x32, 0x85, 0xb6, 0xe8, 0x52
+};
+static const u8 output07[] __initconst = {
+	0xd3, 0x5f, 0xf0, 0x07, 0x69, 0xec
+};
+static const u8 key07[] __initconst = {
+	0xbf, 0x2d, 0x42, 0x99, 0x97, 0x76, 0x04, 0xad,
+	0xd3, 0x8f, 0x6e, 0x6a, 0x34, 0x85, 0xaf, 0x81,
+	0xef, 0x36, 0x33, 0xd5, 0x43, 0xa2, 0xaa, 0x08,
+	0x0f, 0x77, 0x42, 0x83, 0x58, 0xc5, 0x42, 0x2a
+};
+enum { nonce07 = 0xe0796da17dba9b58ULL };
+
+static const u8 input08[] __initconst = {
+	0x40, 0xae, 0xcd, 0xe4, 0x3d, 0x22, 0xe0
+};
+static const u8 output08[] __initconst = {
+	0xfd, 0x8a, 0x9f, 0x3d, 0x05, 0xc9, 0xd3
+};
+static const u8 key08[] __initconst = {
+	0xdc, 0x3f, 0x41, 0xe3, 0x23, 0x2a, 0x8d, 0xf6,
+	0x41, 0x2a, 0xa7, 0x66, 0x05, 0x68, 0xe4, 0x7b,
+	0xc4, 0x58, 0xd6, 0xcc, 0xdf, 0x0d, 0xc6, 0x25,
+	0x1b, 0x61, 0x32, 0x12, 0x4e, 0xf1, 0xe6, 0x29
+};
+enum { nonce08 = 0xb1d2536d9e159832ULL };
+
+static const u8 input09[] __initconst = {
+	0xba, 0x1d, 0x14, 0x16, 0x9f, 0x83, 0x67, 0x24
+};
+static const u8 output09[] __initconst = {
+	0x7c, 0xe3, 0x78, 0x1d, 0xa2, 0xe7, 0xe9, 0x39
+};
+static const u8 key09[] __initconst = {
+	0x17, 0x55, 0x90, 0x52, 0xa4, 0xce, 0x12, 0xae,
+	0xd4, 0xfd, 0xd4, 0xfb, 0xd5, 0x18, 0x59, 0x50,
+	0x4e, 0x51, 0x99, 0x32, 0x09, 0x31, 0xfc, 0xf7,
+	0x27, 0x10, 0x8e, 0xa2, 0x4b, 0xa5, 0xf5, 0x62
+};
+enum { nonce09 = 0x495fc269536d003ULL };
+
+static const u8 input10[] __initconst = {
+	0x09, 0xfd, 0x3c, 0x0b, 0x3d, 0x0e, 0xf3, 0x9d,
+	0x27
+};
+static const u8 output10[] __initconst = {
+	0xdc, 0xe4, 0x33, 0x60, 0x0c, 0x07, 0xcb, 0x51,
+	0x6b
+};
+static const u8 key10[] __initconst = {
+	0x4e, 0x00, 0x72, 0x37, 0x0f, 0x52, 0x4d, 0x6f,
+	0x37, 0x50, 0x3c, 0xb3, 0x51, 0x81, 0x49, 0x16,
+	0x7e, 0xfd, 0xb1, 0x51, 0x72, 0x2e, 0xe4, 0x16,
+	0x68, 0x5c, 0x5b, 0x8a, 0xc3, 0x90, 0x70, 0x04
+};
+enum { nonce10 = 0x1ad9d1114d88cbbdULL };
+
+static const u8 input11[] __initconst = {
+	0x70, 0x18, 0x52, 0x85, 0xba, 0x66, 0xff, 0x2c,
+	0x9a, 0x46
+};
+static const u8 output11[] __initconst = {
+	0xf5, 0x2a, 0x7a, 0xfd, 0x31, 0x7c, 0x91, 0x41,
+	0xb1, 0xcf
+};
+static const u8 key11[] __initconst = {
+	0x48, 0xb4, 0xd0, 0x7c, 0x88, 0xd1, 0x96, 0x0d,
+	0x80, 0x33, 0xb4, 0xd5, 0x31, 0x9a, 0x88, 0xca,
+	0x14, 0xdc, 0xf0, 0xa8, 0xf3, 0xac, 0xb8, 0x47,
+	0x75, 0x86, 0x7c, 0x88, 0x50, 0x11, 0x43, 0x40
+};
+enum { nonce11 = 0x47c35dd1f4f8aa4fULL };
+
+static const u8 input12[] __initconst = {
+	0x9e, 0x8e, 0x3d, 0x2a, 0x05, 0xfd, 0xe4, 0x90,
+	0x24, 0x1c, 0xd3
+};
+static const u8 output12[] __initconst = {
+	0x97, 0x72, 0x40, 0x9f, 0xc0, 0x6b, 0x05, 0x33,
+	0x42, 0x7e, 0x28
+};
+static const u8 key12[] __initconst = {
+	0xee, 0xff, 0x33, 0x33, 0xe0, 0x28, 0xdf, 0xa2,
+	0xb6, 0x5e, 0x25, 0x09, 0x52, 0xde, 0xa5, 0x9c,
+	0x8f, 0x95, 0xa9, 0x03, 0x77, 0x0f, 0xbe, 0xa1,
+	0xd0, 0x7d, 0x73, 0x2f, 0xf8, 0x7e, 0x51, 0x44
+};
+enum { nonce12 = 0xc22d044dc6ea4af3ULL };
+
+static const u8 input13[] __initconst = {
+	0x9c, 0x16, 0xa2, 0x22, 0x4d, 0xbe, 0x04, 0x9a,
+	0xb3, 0xb5, 0xc6, 0x58
+};
+static const u8 output13[] __initconst = {
+	0xf0, 0x81, 0xdb, 0x6d, 0xa3, 0xe9, 0xb2, 0xc6,
+	0x32, 0x50, 0x16, 0x9f
+};
+static const u8 key13[] __initconst = {
+	0x96, 0xb3, 0x01, 0xd2, 0x7a, 0x8c, 0x94, 0x09,
+	0x4f, 0x58, 0xbe, 0x80, 0xcc, 0xa9, 0x7e, 0x2d,
+	0xad, 0x58, 0x3b, 0x63, 0xb8, 0x5c, 0x17, 0xce,
+	0xbf, 0x43, 0x33, 0x7a, 0x7b, 0x82, 0x28, 0x2f
+};
+enum { nonce13 = 0x2a5d05d88cd7b0daULL };
+
+static const u8 input14[] __initconst = {
+	0x57, 0x4f, 0xaa, 0x30, 0xe6, 0x23, 0x50, 0x86,
+	0x91, 0xa5, 0x60, 0x96, 0x2b
+};
+static const u8 output14[] __initconst = {
+	0x6c, 0x1f, 0x3b, 0x42, 0xb6, 0x2f, 0xf0, 0xbd,
+	0x76, 0x60, 0xc7, 0x7e, 0x8d
+};
+static const u8 key14[] __initconst = {
+	0x22, 0x85, 0xaf, 0x8f, 0xa3, 0x53, 0xa0, 0xc4,
+	0xb5, 0x75, 0xc0, 0xba, 0x30, 0x92, 0xc3, 0x32,
+	0x20, 0x5a, 0x8f, 0x7e, 0x93, 0xda, 0x65, 0x18,
+	0xd1, 0xf6, 0x9a, 0x9b, 0x8f, 0x85, 0x30, 0xe6
+};
+enum { nonce14 = 0xf9946c166aa4475fULL };
+
+static const u8 input15[] __initconst = {
+	0x89, 0x81, 0xc7, 0xe2, 0x00, 0xac, 0x52, 0x70,
+	0xa4, 0x79, 0xab, 0xeb, 0x74, 0xf7
+};
+static const u8 output15[] __initconst = {
+	0xb4, 0xd0, 0xa9, 0x9d, 0x15, 0x5f, 0x48, 0xd6,
+	0x00, 0x7e, 0x4c, 0x77, 0x5a, 0x46
+};
+static const u8 key15[] __initconst = {
+	0x0a, 0x66, 0x36, 0xca, 0x5d, 0x82, 0x23, 0xb6,
+	0xe4, 0x9b, 0xad, 0x5e, 0xd0, 0x7f, 0xf6, 0x7a,
+	0x7b, 0x03, 0xa7, 0x4c, 0xfd, 0xec, 0xd5, 0xa1,
+	0xfc, 0x25, 0x54, 0xda, 0x5a, 0x5c, 0xf0, 0x2c
+};
+enum { nonce15 = 0x9ab2b87a35e772c8ULL };
+
+static const u8 input16[] __initconst = {
+	0x5f, 0x09, 0xc0, 0x8b, 0x1e, 0xde, 0xca, 0xd9,
+	0xb7, 0x5c, 0x23, 0xc9, 0x55, 0x1e, 0xcf
+};
+static const u8 output16[] __initconst = {
+	0x76, 0x9b, 0x53, 0xf3, 0x66, 0x88, 0x28, 0x60,
+	0x98, 0x80, 0x2c, 0xa8, 0x80, 0xa6, 0x48
+};
+static const u8 key16[] __initconst = {
+	0x80, 0xb5, 0x51, 0xdf, 0x17, 0x5b, 0xb0, 0xef,
+	0x8b, 0x5b, 0x2e, 0x3e, 0xc5, 0xe3, 0xa5, 0x86,
+	0xac, 0x0d, 0x8e, 0x32, 0x90, 0x9d, 0x82, 0x27,
+	0xf1, 0x23, 0x26, 0xc3, 0xea, 0x55, 0xb6, 0x63
+};
+enum { nonce16 = 0xa82e9d39e4d02ef5ULL };
+
+static const u8 input17[] __initconst = {
+	0x87, 0x0b, 0x36, 0x71, 0x7c, 0xb9, 0x0b, 0x80,
+	0x4d, 0x77, 0x5c, 0x4f, 0xf5, 0x51, 0x0e, 0x1a
+};
+static const u8 output17[] __initconst = {
+	0xf1, 0x12, 0x4a, 0x8a, 0xd9, 0xd0, 0x08, 0x67,
+	0x66, 0xd7, 0x34, 0xea, 0x32, 0x3b, 0x54, 0x0e
+};
+static const u8 key17[] __initconst = {
+	0xfb, 0x71, 0x5f, 0x3f, 0x7a, 0xc0, 0x9a, 0xc8,
+	0xc8, 0xcf, 0xe8, 0xbc, 0xfb, 0x09, 0xbf, 0x89,
+	0x6a, 0xef, 0xd5, 0xe5, 0x36, 0x87, 0x14, 0x76,
+	0x00, 0xb9, 0x32, 0x28, 0xb2, 0x00, 0x42, 0x53
+};
+enum { nonce17 = 0x229b87e73d557b96ULL };
+
+static const u8 input18[] __initconst = {
+	0x38, 0x42, 0xb5, 0x37, 0xb4, 0x3d, 0xfe, 0x59,
+	0x38, 0x68, 0x88, 0xfa, 0x89, 0x8a, 0x5f, 0x90,
+	0x3c
+};
+static const u8 output18[] __initconst = {
+	0xac, 0xad, 0x14, 0xe8, 0x7e, 0xd7, 0xce, 0x96,
+	0x3d, 0xb3, 0x78, 0x85, 0x22, 0x5a, 0xcb, 0x39,
+	0xd4
+};
+static const u8 key18[] __initconst = {
+	0xe1, 0xc1, 0xa8, 0xe0, 0x91, 0xe7, 0x38, 0x66,
+	0x80, 0x17, 0x12, 0x3c, 0x5e, 0x2d, 0xbb, 0xea,
+	0xeb, 0x6c, 0x8b, 0xc8, 0x1b, 0x6f, 0x7c, 0xea,
+	0x50, 0x57, 0x23, 0x1e, 0x65, 0x6f, 0x6d, 0x81
+};
+enum { nonce18 = 0xfaf5fcf8f30e57a9ULL };
+
+static const u8 input19[] __initconst = {
+	0x1c, 0x4a, 0x30, 0x26, 0xef, 0x9a, 0x32, 0xa7,
+	0x8f, 0xe5, 0xc0, 0x0f, 0x30, 0x3a, 0xbf, 0x38,
+	0x54, 0xba
+};
+static const u8 output19[] __initconst = {
+	0x57, 0x67, 0x54, 0x4f, 0x31, 0xd6, 0xef, 0x35,
+	0x0b, 0xd9, 0x52, 0xa7, 0x46, 0x7d, 0x12, 0x17,
+	0x1e, 0xe3
+};
+static const u8 key19[] __initconst = {
+	0x5a, 0x79, 0xc1, 0xea, 0x33, 0xb3, 0xc7, 0x21,
+	0xec, 0xf8, 0xcb, 0xd2, 0x58, 0x96, 0x23, 0xd6,
+	0x4d, 0xed, 0x2f, 0xdf, 0x8a, 0x79, 0xe6, 0x8b,
+	0x38, 0xa3, 0xc3, 0x7a, 0x33, 0xda, 0x02, 0xc7
+};
+enum { nonce19 = 0x2b23b61840429604ULL };
+
+static const u8 input20[] __initconst = {
+	0xab, 0xe9, 0x32, 0xbb, 0x35, 0x17, 0xe0, 0x60,
+	0x80, 0xb1, 0x27, 0xdc, 0xe6, 0x62, 0x9e, 0x0c,
+	0x77, 0xf4, 0x50
+};
+static const u8 output20[] __initconst = {
+	0x54, 0x6d, 0xaa, 0xfc, 0x08, 0xfb, 0x71, 0xa8,
+	0xd6, 0x1d, 0x7d, 0xf3, 0x45, 0x10, 0xb5, 0x4c,
+	0xcc, 0x4b, 0x45
+};
+static const u8 key20[] __initconst = {
+	0xa3, 0xfd, 0x3d, 0xa9, 0xeb, 0xea, 0x2c, 0x69,
+	0xcf, 0x59, 0x38, 0x13, 0x5b, 0xa7, 0x53, 0x8f,
+	0x5e, 0xa2, 0x33, 0x86, 0x4c, 0x75, 0x26, 0xaf,
+	0x35, 0x12, 0x09, 0x71, 0x81, 0xea, 0x88, 0x66
+};
+enum { nonce20 = 0x7459667a8fadff58ULL };
+
+static const u8 input21[] __initconst = {
+	0xa6, 0x82, 0x21, 0x23, 0xad, 0x27, 0x3f, 0xc6,
+	0xd7, 0x16, 0x0d, 0x6d, 0x24, 0x15, 0x54, 0xc5,
+	0x96, 0x72, 0x59, 0x8a
+};
+static const u8 output21[] __initconst = {
+	0x5f, 0x34, 0x32, 0xea, 0x06, 0xd4, 0x9e, 0x01,
+	0xdc, 0x32, 0x32, 0x40, 0x66, 0x73, 0x6d, 0x4a,
+	0x6b, 0x12, 0x20, 0xe8
+};
+static const u8 key21[] __initconst = {
+	0x96, 0xfd, 0x13, 0x23, 0xa9, 0x89, 0x04, 0xe6,
+	0x31, 0xa5, 0x2c, 0xc1, 0x40, 0xd5, 0x69, 0x5c,
+	0x32, 0x79, 0x56, 0xe0, 0x29, 0x93, 0x8f, 0xe8,
+	0x5f, 0x65, 0x53, 0x7f, 0xc1, 0xe9, 0xaf, 0xaf
+};
+enum { nonce21 = 0xba8defee9d8e13b5ULL };
+
+static const u8 input22[] __initconst = {
+	0xb8, 0x32, 0x1a, 0x81, 0xd8, 0x38, 0x89, 0x5a,
+	0xb0, 0x05, 0xbe, 0xf4, 0xd2, 0x08, 0xc6, 0xee,
+	0x79, 0x7b, 0x3a, 0x76, 0x59
+};
+static const u8 output22[] __initconst = {
+	0xb7, 0xba, 0xae, 0x80, 0xe4, 0x9f, 0x79, 0x84,
+	0x5a, 0x48, 0x50, 0x6d, 0xcb, 0xd0, 0x06, 0x0c,
+	0x15, 0x63, 0xa7, 0x5e, 0xbd
+};
+static const u8 key22[] __initconst = {
+	0x0f, 0x35, 0x3d, 0xeb, 0x5f, 0x0a, 0x82, 0x0d,
+	0x24, 0x59, 0x71, 0xd8, 0xe6, 0x2d, 0x5f, 0xe1,
+	0x7e, 0x0c, 0xae, 0xf6, 0xdc, 0x2c, 0xc5, 0x4a,
+	0x38, 0x88, 0xf2, 0xde, 0xd9, 0x5f, 0x76, 0x7c
+};
+enum { nonce22 = 0xe77f1760e9f5e192ULL };
+
+static const u8 input23[] __initconst = {
+	0x4b, 0x1e, 0x79, 0x99, 0xcf, 0xef, 0x64, 0x4b,
+	0xb0, 0x66, 0xae, 0x99, 0x2e, 0x68, 0x97, 0xf5,
+	0x5d, 0x9b, 0x3f, 0x7a, 0xa9, 0xd9
+};
+static const u8 output23[] __initconst = {
+	0x5f, 0xa4, 0x08, 0x39, 0xca, 0xfa, 0x2b, 0x83,
+	0x5d, 0x95, 0x70, 0x7c, 0x2e, 0xd4, 0xae, 0xfa,
+	0x45, 0x4a, 0x77, 0x7f, 0xa7, 0x65
+};
+static const u8 key23[] __initconst = {
+	0x4a, 0x06, 0x83, 0x64, 0xaa, 0xe3, 0x38, 0x32,
+	0x28, 0x5d, 0xa4, 0xb2, 0x5a, 0xee, 0xcf, 0x8e,
+	0x19, 0x67, 0xf1, 0x09, 0xe8, 0xc9, 0xf6, 0x40,
+	0x02, 0x6d, 0x0b, 0xde, 0xfa, 0x81, 0x03, 0xb1
+};
+enum { nonce23 = 0x9b3f349158709849ULL };
+
+static const u8 input24[] __initconst = {
+	0xc6, 0xfc, 0x47, 0x5e, 0xd8, 0xed, 0xa9, 0xe5,
+	0x4f, 0x82, 0x79, 0x35, 0xee, 0x3e, 0x7e, 0x3e,
+	0x35, 0x70, 0x6e, 0xfa, 0x6d, 0x08, 0xe8
+};
+static const u8 output24[] __initconst = {
+	0x3b, 0xc5, 0xf8, 0xc2, 0xbf, 0x2b, 0x90, 0x33,
+	0xa6, 0xae, 0xf5, 0x5a, 0x65, 0xb3, 0x3d, 0xe1,
+	0xcd, 0x5f, 0x55, 0xfa, 0xe7, 0xa5, 0x4a
+};
+static const u8 key24[] __initconst = {
+	0x00, 0x24, 0xc3, 0x65, 0x5f, 0xe6, 0x31, 0xbb,
+	0x6d, 0xfc, 0x20, 0x7b, 0x1b, 0xa8, 0x96, 0x26,
+	0x55, 0x21, 0x62, 0x25, 0x7e, 0xba, 0x23, 0x97,
+	0xc9, 0xb8, 0x53, 0xa8, 0xef, 0xab, 0xad, 0x61
+};
+enum { nonce24 = 0x13ee0b8f526177c3ULL };
+
+static const u8 input25[] __initconst = {
+	0x33, 0x07, 0x16, 0xb1, 0x34, 0x33, 0x67, 0x04,
+	0x9b, 0x0a, 0xce, 0x1b, 0xe9, 0xde, 0x1a, 0xec,
+	0xd0, 0x55, 0xfb, 0xc6, 0x33, 0xaf, 0x2d, 0xe3
+};
+static const u8 output25[] __initconst = {
+	0x05, 0x93, 0x10, 0xd1, 0x58, 0x6f, 0x68, 0x62,
+	0x45, 0xdb, 0x91, 0xae, 0x70, 0xcf, 0xd4, 0x5f,
+	0xee, 0xdf, 0xd5, 0xba, 0x9e, 0xde, 0x68, 0xe6
+};
+static const u8 key25[] __initconst = {
+	0x83, 0xa9, 0x4f, 0x5d, 0x74, 0xd5, 0x91, 0xb3,
+	0xc9, 0x97, 0x19, 0x15, 0xdb, 0x0d, 0x0b, 0x4a,
+	0x3d, 0x55, 0xcf, 0xab, 0xb2, 0x05, 0x21, 0x35,
+	0x45, 0x50, 0xeb, 0xf8, 0xf5, 0xbf, 0x36, 0x35
+};
+enum { nonce25 = 0x7c6f459e49ebfebcULL };
+
+static const u8 input26[] __initconst = {
+	0xc2, 0xd4, 0x7a, 0xa3, 0x92, 0xe1, 0xac, 0x46,
+	0x1a, 0x15, 0x38, 0xc9, 0xb5, 0xfd, 0xdf, 0x84,
+	0x38, 0xbc, 0x6b, 0x1d, 0xb0, 0x83, 0x43, 0x04,
+	0x39
+};
+static const u8 output26[] __initconst = {
+	0x7f, 0xde, 0xd6, 0x87, 0xcc, 0x34, 0xf4, 0x12,
+	0xae, 0x55, 0xa5, 0x89, 0x95, 0x29, 0xfc, 0x18,
+	0xd8, 0xc7, 0x7c, 0xd3, 0xcb, 0x85, 0x95, 0x21,
+	0xd2
+};
+static const u8 key26[] __initconst = {
+	0xe4, 0xd0, 0x54, 0x1d, 0x7d, 0x47, 0xa8, 0xc1,
+	0x08, 0xca, 0xe2, 0x42, 0x52, 0x95, 0x16, 0x43,
+	0xa3, 0x01, 0x23, 0x03, 0xcc, 0x3b, 0x81, 0x78,
+	0x23, 0xcc, 0xa7, 0x36, 0xd7, 0xa0, 0x97, 0x8d
+};
+enum { nonce26 = 0x524401012231683ULL };
+
+static const u8 input27[] __initconst = {
+	0x0d, 0xb0, 0xcf, 0xec, 0xfc, 0x38, 0x9d, 0x9d,
+	0x89, 0x00, 0x96, 0xf2, 0x79, 0x8a, 0xa1, 0x8d,
+	0x32, 0x5e, 0xc6, 0x12, 0x22, 0xec, 0xf6, 0x52,
+	0xc1, 0x0b
+};
+static const u8 output27[] __initconst = {
+	0xef, 0xe1, 0xf2, 0x67, 0x8e, 0x2c, 0x00, 0x9f,
+	0x1d, 0x4c, 0x66, 0x1f, 0x94, 0x58, 0xdc, 0xbb,
+	0xb9, 0x11, 0x8f, 0x74, 0xfd, 0x0e, 0x14, 0x01,
+	0xa8, 0x21
+};
+static const u8 key27[] __initconst = {
+	0x78, 0x71, 0xa4, 0xe6, 0xb2, 0x95, 0x44, 0x12,
+	0x81, 0xaa, 0x7e, 0x94, 0xa7, 0x8d, 0x44, 0xea,
+	0xc4, 0xbc, 0x01, 0xb7, 0x9e, 0xf7, 0x82, 0x9e,
+	0x3b, 0x23, 0x9f, 0x31, 0xdd, 0xb8, 0x0d, 0x18
+};
+enum { nonce27 = 0xd58fe0e58fb254d6ULL };
+
+static const u8 input28[] __initconst = {
+	0xaa, 0xb7, 0xaa, 0xd9, 0xa8, 0x91, 0xd7, 0x8a,
+	0x97, 0x9b, 0xdb, 0x7c, 0x47, 0x2b, 0xdb, 0xd2,
+	0xda, 0x77, 0xb1, 0xfa, 0x2d, 0x12, 0xe3, 0xe9,
+	0xc4, 0x7f, 0x54
+};
+static const u8 output28[] __initconst = {
+	0x87, 0x84, 0xa9, 0xa6, 0xad, 0x8f, 0xe6, 0x0f,
+	0x69, 0xf8, 0x21, 0xc3, 0x54, 0x95, 0x0f, 0xb0,
+	0x4e, 0xc7, 0x02, 0xe4, 0x04, 0xb0, 0x6c, 0x42,
+	0x8c, 0x63, 0xe3
+};
+static const u8 key28[] __initconst = {
+	0x12, 0x23, 0x37, 0x95, 0x04, 0xb4, 0x21, 0xe8,
+	0xbc, 0x65, 0x46, 0x7a, 0xf4, 0x01, 0x05, 0x3f,
+	0xb1, 0x34, 0x73, 0xd2, 0x49, 0xbf, 0x6f, 0x20,
+	0xbd, 0x23, 0x58, 0x5f, 0xd1, 0x73, 0x57, 0xa6
+};
+enum { nonce28 = 0x3a04d51491eb4e07ULL };
+
+static const u8 input29[] __initconst = {
+	0x55, 0xd0, 0xd4, 0x4b, 0x17, 0xc8, 0xc4, 0x2b,
+	0xc0, 0x28, 0xbd, 0x9d, 0x65, 0x4d, 0xaf, 0x77,
+	0x72, 0x7c, 0x36, 0x68, 0xa7, 0xb6, 0x87, 0x4d,
+	0xb9, 0x27, 0x25, 0x6c
+};
+static const u8 output29[] __initconst = {
+	0x0e, 0xac, 0x4c, 0xf5, 0x12, 0xb5, 0x56, 0xa5,
+	0x00, 0x9a, 0xd6, 0xe5, 0x1a, 0x59, 0x2c, 0xf6,
+	0x42, 0x22, 0xcf, 0x23, 0x98, 0x34, 0x29, 0xac,
+	0x6e, 0xe3, 0x37, 0x6d
+};
+static const u8 key29[] __initconst = {
+	0xda, 0x9d, 0x05, 0x0c, 0x0c, 0xba, 0x75, 0xb9,
+	0x9e, 0xb1, 0x8d, 0xd9, 0x73, 0x26, 0x2c, 0xa9,
+	0x3a, 0xb5, 0xcb, 0x19, 0x49, 0xa7, 0x4f, 0xf7,
+	0x64, 0x35, 0x23, 0x20, 0x2a, 0x45, 0x78, 0xc7
+};
+enum { nonce29 = 0xc25ac9982431cbfULL };
+
+static const u8 input30[] __initconst = {
+	0x4e, 0xd6, 0x85, 0xbb, 0xe7, 0x99, 0xfa, 0x04,
+	0x33, 0x24, 0xfd, 0x75, 0x18, 0xe3, 0xd3, 0x25,
+	0xcd, 0xca, 0xae, 0x00, 0xbe, 0x52, 0x56, 0x4a,
+	0x31, 0xe9, 0x4f, 0xae, 0x8a
+};
+static const u8 output30[] __initconst = {
+	0x30, 0x36, 0x32, 0xa2, 0x3c, 0xb6, 0xf9, 0xf9,
+	0x76, 0x70, 0xad, 0xa6, 0x10, 0x41, 0x00, 0x4a,
+	0xfa, 0xce, 0x1b, 0x86, 0x05, 0xdb, 0x77, 0x96,
+	0xb3, 0xb7, 0x8f, 0x61, 0x24
+};
+static const u8 key30[] __initconst = {
+	0x49, 0x35, 0x4c, 0x15, 0x98, 0xfb, 0xc6, 0x57,
+	0x62, 0x6d, 0x06, 0xc3, 0xd4, 0x79, 0x20, 0x96,
+	0x05, 0x2a, 0x31, 0x63, 0xc0, 0x44, 0x42, 0x09,
+	0x13, 0x13, 0xff, 0x1b, 0xc8, 0x63, 0x1f, 0x0b
+};
+enum { nonce30 = 0x4967f9c08e41568bULL };
+
+static const u8 input31[] __initconst = {
+	0x91, 0x04, 0x20, 0x47, 0x59, 0xee, 0xa6, 0x0f,
+	0x04, 0x75, 0xc8, 0x18, 0x95, 0x44, 0x01, 0x28,
+	0x20, 0x6f, 0x73, 0x68, 0x66, 0xb5, 0x03, 0xb3,
+	0x58, 0x27, 0x6e, 0x7a, 0x76, 0xb8
+};
+static const u8 output31[] __initconst = {
+	0xe8, 0x03, 0x78, 0x9d, 0x13, 0x15, 0x98, 0xef,
+	0x64, 0x68, 0x12, 0x41, 0xb0, 0x29, 0x94, 0x0c,
+	0x83, 0x35, 0x46, 0xa9, 0x74, 0xe1, 0x75, 0xf0,
+	0xb6, 0x96, 0xc3, 0x6f, 0xd7, 0x70
+};
+static const u8 key31[] __initconst = {
+	0xef, 0xcd, 0x5a, 0x4a, 0xf4, 0x7e, 0x6a, 0x3a,
+	0x11, 0x88, 0x72, 0x94, 0xb8, 0xae, 0x84, 0xc3,
+	0x66, 0xe0, 0xde, 0x4b, 0x00, 0xa5, 0xd6, 0x2d,
+	0x50, 0xb7, 0x28, 0xff, 0x76, 0x57, 0x18, 0x1f
+};
+enum { nonce31 = 0xcb6f428fa4192e19ULL };
+
+static const u8 input32[] __initconst = {
+	0x90, 0x06, 0x50, 0x4b, 0x98, 0x14, 0x30, 0xf1,
+	0xb8, 0xd7, 0xf0, 0xa4, 0x3e, 0x4e, 0xd8, 0x00,
+	0xea, 0xdb, 0x4f, 0x93, 0x05, 0xef, 0x02, 0x71,
+	0x1a, 0xcd, 0xa3, 0xb1, 0xae, 0xd3, 0x18
+};
+static const u8 output32[] __initconst = {
+	0xcb, 0x4a, 0x37, 0x3f, 0xea, 0x40, 0xab, 0x86,
+	0xfe, 0xcc, 0x07, 0xd5, 0xdc, 0xb2, 0x25, 0xb6,
+	0xfd, 0x2a, 0x72, 0xbc, 0x5e, 0xd4, 0x75, 0xff,
+	0x71, 0xfc, 0xce, 0x1e, 0x6f, 0x22, 0xc1
+};
+static const u8 key32[] __initconst = {
+	0xfc, 0x6d, 0xc3, 0x80, 0xce, 0xa4, 0x31, 0xa1,
+	0xcc, 0xfa, 0x9d, 0x10, 0x0b, 0xc9, 0x11, 0x77,
+	0x34, 0xdb, 0xad, 0x1b, 0xc4, 0xfc, 0xeb, 0x79,
+	0x91, 0xda, 0x59, 0x3b, 0x0d, 0xb1, 0x19, 0x3b
+};
+enum { nonce32 = 0x88551bf050059467ULL };
+
+static const u8 input33[] __initconst = {
+	0x88, 0x94, 0x71, 0x92, 0xe8, 0xd7, 0xf9, 0xbd,
+	0x55, 0xe3, 0x22, 0xdb, 0x99, 0x51, 0xfb, 0x50,
+	0xbf, 0x82, 0xb5, 0x70, 0x8b, 0x2b, 0x6a, 0x03,
+	0x37, 0xa0, 0xc6, 0x19, 0x5d, 0xc9, 0xbc, 0xcc
+};
+static const u8 output33[] __initconst = {
+	0xb6, 0x17, 0x51, 0xc8, 0xea, 0x8a, 0x14, 0xdc,
+	0x23, 0x1b, 0xd4, 0xed, 0xbf, 0x50, 0xb9, 0x38,
+	0x00, 0xc2, 0x3f, 0x78, 0x3d, 0xbf, 0xa0, 0x84,
+	0xef, 0x45, 0xb2, 0x7d, 0x48, 0x7b, 0x62, 0xa7
+};
+static const u8 key33[] __initconst = {
+	0xb9, 0x8f, 0x6a, 0xad, 0xb4, 0x6f, 0xb5, 0xdc,
+	0x48, 0xfa, 0x43, 0x57, 0x62, 0x97, 0xef, 0x89,
+	0x4c, 0x5a, 0x7b, 0x67, 0xb8, 0x9d, 0xf0, 0x42,
+	0x2b, 0x8f, 0xf3, 0x18, 0x05, 0x2e, 0x48, 0xd0
+};
+enum { nonce33 = 0x31f16488fe8447f5ULL };
+
+static const u8 input34[] __initconst = {
+	0xda, 0x2b, 0x3d, 0x63, 0x9e, 0x4f, 0xc2, 0xb8,
+	0x7f, 0xc2, 0x1a, 0x8b, 0x0d, 0x95, 0x65, 0x55,
+	0x52, 0xba, 0x51, 0x51, 0xc0, 0x61, 0x9f, 0x0a,
+	0x5d, 0xb0, 0x59, 0x8c, 0x64, 0x6a, 0xab, 0xf5,
+	0x57
+};
+static const u8 output34[] __initconst = {
+	0x5c, 0xf6, 0x62, 0x24, 0x8c, 0x45, 0xa3, 0x26,
+	0xd0, 0xe4, 0x88, 0x1c, 0xed, 0xc4, 0x26, 0x58,
+	0xb5, 0x5d, 0x92, 0xc4, 0x17, 0x44, 0x1c, 0xb8,
+	0x2c, 0xf3, 0x55, 0x7e, 0xd6, 0xe5, 0xb3, 0x65,
+	0xa8
+};
+static const u8 key34[] __initconst = {
+	0xde, 0xd1, 0x27, 0xb7, 0x7c, 0xfa, 0xa6, 0x78,
+	0x39, 0x80, 0xdf, 0xb7, 0x46, 0xac, 0x71, 0x26,
+	0xd0, 0x2a, 0x56, 0x79, 0x12, 0xeb, 0x26, 0x37,
+	0x01, 0x0d, 0x30, 0xe0, 0xe3, 0x66, 0xb2, 0xf4
+};
+enum { nonce34 = 0x92d0d9b252c24149ULL };
+
+static const u8 input35[] __initconst = {
+	0x3a, 0x15, 0x5b, 0x75, 0x6e, 0xd0, 0x52, 0x20,
+	0x6c, 0x82, 0xfa, 0xce, 0x5b, 0xea, 0xf5, 0x43,
+	0xc1, 0x81, 0x7c, 0xb2, 0xac, 0x16, 0x3f, 0xd3,
+	0x5a, 0xaf, 0x55, 0x98, 0xf4, 0xc6, 0xba, 0x71,
+	0x25, 0x8b
+};
+static const u8 output35[] __initconst = {
+	0xb3, 0xaf, 0xac, 0x6d, 0x4d, 0xc7, 0x68, 0x56,
+	0x50, 0x5b, 0x69, 0x2a, 0xe5, 0x90, 0xf9, 0x5f,
+	0x99, 0x88, 0xff, 0x0c, 0xa6, 0xb1, 0x83, 0xd6,
+	0x80, 0xa6, 0x1b, 0xde, 0x94, 0xa4, 0x2c, 0xc3,
+	0x74, 0xfa
+};
+static const u8 key35[] __initconst = {
+	0xd8, 0x24, 0xe2, 0x06, 0xd7, 0x7a, 0xce, 0x81,
+	0x52, 0x72, 0x02, 0x69, 0x89, 0xc4, 0xe9, 0x53,
+	0x3b, 0x08, 0x5f, 0x98, 0x1e, 0x1b, 0x99, 0x6e,
+	0x28, 0x17, 0x6d, 0xba, 0xc0, 0x96, 0xf9, 0x3c
+};
+enum { nonce35 = 0x7baf968c4c8e3a37ULL };
+
+static const u8 input36[] __initconst = {
+	0x31, 0x5d, 0x4f, 0xe3, 0xac, 0xad, 0x17, 0xa6,
+	0xb5, 0x01, 0xe2, 0xc6, 0xd4, 0x7e, 0xc4, 0x80,
+	0xc0, 0x59, 0x72, 0xbb, 0x4b, 0x74, 0x6a, 0x41,
+	0x0f, 0x9c, 0xf6, 0xca, 0x20, 0xb3, 0x73, 0x07,
+	0x6b, 0x02, 0x2a
+};
+static const u8 output36[] __initconst = {
+	0xf9, 0x09, 0x92, 0x94, 0x7e, 0x31, 0xf7, 0x53,
+	0xe8, 0x8a, 0x5b, 0x20, 0xef, 0x9b, 0x45, 0x81,
+	0xba, 0x5e, 0x45, 0x63, 0xc1, 0xc7, 0x9e, 0x06,
+	0x0e, 0xd9, 0x62, 0x8e, 0x96, 0xf9, 0xfa, 0x43,
+	0x4d, 0xd4, 0x28
+};
+static const u8 key36[] __initconst = {
+	0x13, 0x30, 0x4c, 0x06, 0xae, 0x18, 0xde, 0x03,
+	0x1d, 0x02, 0x40, 0xf5, 0xbb, 0x19, 0xe3, 0x88,
+	0x41, 0xb1, 0x29, 0x15, 0x97, 0xc2, 0x69, 0x3f,
+	0x32, 0x2a, 0x0c, 0x8b, 0xcf, 0x83, 0x8b, 0x6c
+};
+enum { nonce36 = 0x226d251d475075a0ULL };
+
+static const u8 input37[] __initconst = {
+	0x10, 0x18, 0xbe, 0xfd, 0x66, 0xc9, 0x77, 0xcc,
+	0x43, 0xe5, 0x46, 0x0b, 0x08, 0x8b, 0xae, 0x11,
+	0x86, 0x15, 0xc2, 0xf6, 0x45, 0xd4, 0x5f, 0xd6,
+	0xb6, 0x5f, 0x9f, 0x3e, 0x97, 0xb7, 0xd4, 0xad,
+	0x0b, 0xe8, 0x31, 0x94
+};
+static const u8 output37[] __initconst = {
+	0x03, 0x2c, 0x1c, 0xee, 0xc6, 0xdd, 0xed, 0x38,
+	0x80, 0x6d, 0x84, 0x16, 0xc3, 0xc2, 0x04, 0x63,
+	0xcd, 0xa7, 0x6e, 0x36, 0x8b, 0xed, 0x78, 0x63,
+	0x95, 0xfc, 0x69, 0x7a, 0x3f, 0x8d, 0x75, 0x6b,
+	0x6c, 0x26, 0x56, 0x4d
+};
+static const u8 key37[] __initconst = {
+	0xac, 0x84, 0x4d, 0xa9, 0x29, 0x49, 0x3c, 0x39,
+	0x7f, 0xd9, 0xa6, 0x01, 0xf3, 0x7e, 0xfa, 0x4a,
+	0x14, 0x80, 0x22, 0x74, 0xf0, 0x29, 0x30, 0x2d,
+	0x07, 0x21, 0xda, 0xc0, 0x4d, 0x70, 0x56, 0xa2
+};
+enum { nonce37 = 0x167823ce3b64925aULL };
+
+static const u8 input38[] __initconst = {
+	0x30, 0x8f, 0xfa, 0x24, 0x29, 0xb1, 0xfb, 0xce,
+	0x31, 0x62, 0xdc, 0xd0, 0x46, 0xab, 0xe1, 0x31,
+	0xd9, 0xae, 0x60, 0x0d, 0xca, 0x0a, 0x49, 0x12,
+	0x3d, 0x92, 0xe9, 0x91, 0x67, 0x12, 0x62, 0x18,
+	0x89, 0xe2, 0xf9, 0x1c, 0xcc
+};
+static const u8 output38[] __initconst = {
+	0x56, 0x9c, 0xc8, 0x7a, 0xc5, 0x98, 0xa3, 0x0f,
+	0xba, 0xd5, 0x3e, 0xe1, 0xc9, 0x33, 0x64, 0x33,
+	0xf0, 0xd5, 0xf7, 0x43, 0x66, 0x0e, 0x08, 0x9a,
+	0x6e, 0x09, 0xe4, 0x01, 0x0d, 0x1e, 0x2f, 0x4b,
+	0xed, 0x9c, 0x08, 0x8c, 0x03
+};
+static const u8 key38[] __initconst = {
+	0x77, 0x52, 0x2a, 0x23, 0xf1, 0xc5, 0x96, 0x2b,
+	0x89, 0x4f, 0x3e, 0xf3, 0xff, 0x0e, 0x94, 0xce,
+	0xf1, 0xbd, 0x53, 0xf5, 0x77, 0xd6, 0x9e, 0x47,
+	0x49, 0x3d, 0x16, 0x64, 0xff, 0x95, 0x42, 0x42
+};
+enum { nonce38 = 0xff629d7b82cef357ULL };
+
+static const u8 input39[] __initconst = {
+	0x38, 0x26, 0x27, 0xd0, 0xc2, 0xf5, 0x34, 0xba,
+	0xda, 0x0f, 0x1c, 0x1c, 0x9a, 0x70, 0xe5, 0x8a,
+	0x78, 0x2d, 0x8f, 0x9a, 0xbf, 0x89, 0x6a, 0xfd,
+	0xd4, 0x9c, 0x33, 0xf1, 0xb6, 0x89, 0x16, 0xe3,
+	0x6a, 0x00, 0xfa, 0x3a, 0x0f, 0x26
+};
+static const u8 output39[] __initconst = {
+	0x0f, 0xaf, 0x91, 0x6d, 0x9c, 0x99, 0xa4, 0xf7,
+	0x3b, 0x9d, 0x9a, 0x98, 0xca, 0xbb, 0x50, 0x48,
+	0xee, 0xcb, 0x5d, 0xa1, 0x37, 0x2d, 0x36, 0x09,
+	0x2a, 0xe2, 0x1c, 0x3d, 0x98, 0x40, 0x1c, 0x16,
+	0x56, 0xa7, 0x98, 0xe9, 0x7d, 0x2b
+};
+static const u8 key39[] __initconst = {
+	0x6e, 0x83, 0x15, 0x4d, 0xf8, 0x78, 0xa8, 0x0e,
+	0x71, 0x37, 0xd4, 0x6e, 0x28, 0x5c, 0x06, 0xa1,
+	0x2d, 0x6c, 0x72, 0x7a, 0xfd, 0xf8, 0x65, 0x1a,
+	0xb8, 0xe6, 0x29, 0x7b, 0xe5, 0xb3, 0x23, 0x79
+};
+enum { nonce39 = 0xa4d8c491cf093e9dULL };
+
+static const u8 input40[] __initconst = {
+	0x8f, 0x32, 0x7c, 0x40, 0x37, 0x95, 0x08, 0x00,
+	0x00, 0xfe, 0x2f, 0x95, 0x20, 0x12, 0x40, 0x18,
+	0x5e, 0x7e, 0x5e, 0x99, 0xee, 0x8d, 0x91, 0x7d,
+	0x50, 0x7d, 0x21, 0x45, 0x27, 0xe1, 0x7f, 0xd4,
+	0x73, 0x10, 0xe1, 0x33, 0xbc, 0xf8, 0xdd
+};
+static const u8 output40[] __initconst = {
+	0x78, 0x7c, 0xdc, 0x55, 0x2b, 0xd9, 0x2b, 0x3a,
+	0xdd, 0x56, 0x11, 0x52, 0xd3, 0x2e, 0xe0, 0x0d,
+	0x23, 0x20, 0x8a, 0xf1, 0x4f, 0xee, 0xf1, 0x68,
+	0xf6, 0xdc, 0x53, 0xcf, 0x17, 0xd4, 0xf0, 0x6c,
+	0xdc, 0x80, 0x5f, 0x1c, 0xa4, 0x91, 0x05
+};
+static const u8 key40[] __initconst = {
+	0x0d, 0x86, 0xbf, 0x8a, 0xba, 0x9e, 0x39, 0x91,
+	0xa8, 0xe7, 0x22, 0xf0, 0x0c, 0x43, 0x18, 0xe4,
+	0x1f, 0xb0, 0xaf, 0x8a, 0x34, 0x31, 0xf4, 0x41,
+	0xf0, 0x89, 0x85, 0xca, 0x5d, 0x05, 0x3b, 0x94
+};
+enum { nonce40 = 0xae7acc4f5986439eULL };
+
+static const u8 input41[] __initconst = {
+	0x20, 0x5f, 0xc1, 0x83, 0x36, 0x02, 0x76, 0x96,
+	0xf0, 0xbf, 0x8e, 0x0e, 0x1a, 0xd1, 0xc7, 0x88,
+	0x18, 0xc7, 0x09, 0xc4, 0x15, 0xd9, 0x4f, 0x5e,
+	0x1f, 0xb3, 0xb4, 0x6d, 0xcb, 0xa0, 0xd6, 0x8a,
+	0x3b, 0x40, 0x8e, 0x80, 0xf1, 0xe8, 0x8f, 0x5f
+};
+static const u8 output41[] __initconst = {
+	0x0b, 0xd1, 0x49, 0x9a, 0x9d, 0xe8, 0x97, 0xb8,
+	0xd1, 0xeb, 0x90, 0x62, 0x37, 0xd2, 0x99, 0x15,
+	0x67, 0x6d, 0x27, 0x93, 0xce, 0x37, 0x65, 0xa2,
+	0x94, 0x88, 0xd6, 0x17, 0xbc, 0x1c, 0x6e, 0xa2,
+	0xcc, 0xfb, 0x81, 0x0e, 0x30, 0x60, 0x5a, 0x6f
+};
+static const u8 key41[] __initconst = {
+	0x36, 0x27, 0x57, 0x01, 0x21, 0x68, 0x97, 0xc7,
+	0x00, 0x67, 0x7b, 0xe9, 0x0f, 0x55, 0x49, 0xbb,
+	0x92, 0x18, 0x98, 0xf5, 0x5e, 0xbc, 0xe7, 0x5a,
+	0x9d, 0x3d, 0xc7, 0xbd, 0x59, 0xec, 0x82, 0x8e
+};
+enum { nonce41 = 0x5da05e4c8dfab464ULL };
+
+static const u8 input42[] __initconst = {
+	0xca, 0x30, 0xcd, 0x63, 0xf0, 0x2d, 0xf1, 0x03,
+	0x4d, 0x0d, 0xf2, 0xf7, 0x6f, 0xae, 0xd6, 0x34,
+	0xea, 0xf6, 0x13, 0xcf, 0x1c, 0xa0, 0xd0, 0xe8,
+	0xa4, 0x78, 0x80, 0x3b, 0x1e, 0xa5, 0x32, 0x4c,
+	0x73, 0x12, 0xd4, 0x6a, 0x94, 0xbc, 0xba, 0x80,
+	0x5e
+};
+static const u8 output42[] __initconst = {
+	0xec, 0x3f, 0x18, 0x31, 0xc0, 0x7b, 0xb5, 0xe2,
+	0xad, 0xf3, 0xec, 0xa0, 0x16, 0x9d, 0xef, 0xce,
+	0x05, 0x65, 0x59, 0x9d, 0x5a, 0xca, 0x3e, 0x13,
+	0xb9, 0x5d, 0x5d, 0xb5, 0xeb, 0xae, 0xc0, 0x87,
+	0xbb, 0xfd, 0xe7, 0xe4, 0x89, 0x5b, 0xd2, 0x6c,
+	0x56
+};
+static const u8 key42[] __initconst = {
+	0x7c, 0x6b, 0x7e, 0x77, 0xcc, 0x8c, 0x1b, 0x03,
+	0x8b, 0x2a, 0xb3, 0x7c, 0x5a, 0x73, 0xcc, 0xac,
+	0xdd, 0x53, 0x54, 0x0c, 0x85, 0xed, 0xcd, 0x47,
+	0x24, 0xc1, 0xb8, 0x9b, 0x2e, 0x41, 0x92, 0x36
+};
+enum { nonce42 = 0xe4d7348b09682c9cULL };
+
+static const u8 input43[] __initconst = {
+	0x52, 0xf2, 0x4b, 0x7c, 0xe5, 0x58, 0xe8, 0xd2,
+	0xb7, 0xf3, 0xa1, 0x29, 0x68, 0xa2, 0x50, 0x50,
+	0xae, 0x9c, 0x1b, 0xe2, 0x67, 0x77, 0xe2, 0xdb,
+	0x85, 0x55, 0x7e, 0x84, 0x8a, 0x12, 0x3c, 0xb6,
+	0x2e, 0xed, 0xd3, 0xec, 0x47, 0x68, 0xfa, 0x52,
+	0x46, 0x9d
+};
+static const u8 output43[] __initconst = {
+	0x1b, 0xf0, 0x05, 0xe4, 0x1c, 0xd8, 0x74, 0x9a,
+	0xf0, 0xee, 0x00, 0x54, 0xce, 0x02, 0x83, 0x15,
+	0xfb, 0x23, 0x35, 0x78, 0xc3, 0xda, 0x98, 0xd8,
+	0x9d, 0x1b, 0xb2, 0x51, 0x82, 0xb0, 0xff, 0xbe,
+	0x05, 0xa9, 0xa4, 0x04, 0xba, 0xea, 0x4b, 0x73,
+	0x47, 0x6e
+};
+static const u8 key43[] __initconst = {
+	0xeb, 0xec, 0x0e, 0xa1, 0x65, 0xe2, 0x99, 0x46,
+	0xd8, 0x54, 0x8c, 0x4a, 0x93, 0xdf, 0x6d, 0xbf,
+	0x93, 0x34, 0x94, 0x57, 0xc9, 0x12, 0x9d, 0x68,
+	0x05, 0xc5, 0x05, 0xad, 0x5a, 0xc9, 0x2a, 0x3b
+};
+enum { nonce43 = 0xe14f6a902b7827fULL };
+
+static const u8 input44[] __initconst = {
+	0x3e, 0x22, 0x3e, 0x8e, 0xcd, 0x18, 0xe2, 0xa3,
+	0x8d, 0x8b, 0x38, 0xc3, 0x02, 0xa3, 0x31, 0x48,
+	0xc6, 0x0e, 0xec, 0x99, 0x51, 0x11, 0x6d, 0x8b,
+	0x32, 0x35, 0x3b, 0x08, 0x58, 0x76, 0x25, 0x30,
+	0xe2, 0xfc, 0xa2, 0x46, 0x7d, 0x6e, 0x34, 0x87,
+	0xac, 0x42, 0xbf
+};
+static const u8 output44[] __initconst = {
+	0x08, 0x92, 0x58, 0x02, 0x1a, 0xf4, 0x1f, 0x3d,
+	0x38, 0x7b, 0x6b, 0xf6, 0x84, 0x07, 0xa3, 0x19,
+	0x17, 0x2a, 0xed, 0x57, 0x1c, 0xf9, 0x55, 0x37,
+	0x4e, 0xf4, 0x68, 0x68, 0x82, 0x02, 0x4f, 0xca,
+	0x21, 0x00, 0xc6, 0x66, 0x79, 0x53, 0x19, 0xef,
+	0x7f, 0xdd, 0x74
+};
+static const u8 key44[] __initconst = {
+	0x73, 0xb6, 0x3e, 0xf4, 0x57, 0x52, 0xa6, 0x43,
+	0x51, 0xd8, 0x25, 0x00, 0xdb, 0xb4, 0x52, 0x69,
+	0xd6, 0x27, 0x49, 0xeb, 0x9b, 0xf1, 0x7b, 0xa0,
+	0xd6, 0x7c, 0x9c, 0xd8, 0x95, 0x03, 0x69, 0x26
+};
+enum { nonce44 = 0xf5e6dc4f35ce24e5ULL };
+
+static const u8 input45[] __initconst = {
+	0x55, 0x76, 0xc0, 0xf1, 0x74, 0x03, 0x7a, 0x6d,
+	0x14, 0xd8, 0x36, 0x2c, 0x9f, 0x9a, 0x59, 0x7a,
+	0x2a, 0xf5, 0x77, 0x84, 0x70, 0x7c, 0x1d, 0x04,
+	0x90, 0x45, 0xa4, 0xc1, 0x5e, 0xdd, 0x2e, 0x07,
+	0x18, 0x34, 0xa6, 0x85, 0x56, 0x4f, 0x09, 0xaf,
+	0x2f, 0x83, 0xe1, 0xc6
+};
+static const u8 output45[] __initconst = {
+	0x22, 0x46, 0xe4, 0x0b, 0x3a, 0x55, 0xcc, 0x9b,
+	0xf0, 0xc0, 0x53, 0xcd, 0x95, 0xc7, 0x57, 0x6c,
+	0x77, 0x46, 0x41, 0x72, 0x07, 0xbf, 0xa8, 0xe5,
+	0x68, 0x69, 0xd8, 0x1e, 0x45, 0xc1, 0xa2, 0x50,
+	0xa5, 0xd1, 0x62, 0xc9, 0x5a, 0x7d, 0x08, 0x14,
+	0xae, 0x44, 0x16, 0xb9
+};
+static const u8 key45[] __initconst = {
+	0x41, 0xf3, 0x88, 0xb2, 0x51, 0x25, 0x47, 0x02,
+	0x39, 0xe8, 0x15, 0x3a, 0x22, 0x78, 0x86, 0x0b,
+	0xf9, 0x1e, 0x8d, 0x98, 0xb2, 0x22, 0x82, 0xac,
+	0x42, 0x94, 0xde, 0x64, 0xf0, 0xfd, 0xb3, 0x6c
+};
+enum { nonce45 = 0xf51a582daf4aa01aULL };
+
+static const u8 input46[] __initconst = {
+	0xf6, 0xff, 0x20, 0xf9, 0x26, 0x7e, 0x0f, 0xa8,
+	0x6a, 0x45, 0x5a, 0x91, 0x73, 0xc4, 0x4c, 0x63,
+	0xe5, 0x61, 0x59, 0xca, 0xec, 0xc0, 0x20, 0x35,
+	0xbc, 0x9f, 0x58, 0x9c, 0x5e, 0xa1, 0x17, 0x46,
+	0xcc, 0xab, 0x6e, 0xd0, 0x4f, 0x24, 0xeb, 0x05,
+	0x4d, 0x40, 0x41, 0xe0, 0x9d
+};
+static const u8 output46[] __initconst = {
+	0x31, 0x6e, 0x63, 0x3f, 0x9c, 0xe6, 0xb1, 0xb7,
+	0xef, 0x47, 0x46, 0xd7, 0xb1, 0x53, 0x42, 0x2f,
+	0x2c, 0xc8, 0x01, 0xae, 0x8b, 0xec, 0x42, 0x2c,
+	0x6b, 0x2c, 0x9c, 0xb2, 0xf0, 0x29, 0x06, 0xa5,
+	0xcd, 0x7e, 0xc7, 0x3a, 0x38, 0x98, 0x8a, 0xde,
+	0x03, 0x29, 0x14, 0x8f, 0xf9
+};
+static const u8 key46[] __initconst = {
+	0xac, 0xa6, 0x44, 0x4a, 0x0d, 0x42, 0x10, 0xbc,
+	0xd3, 0xc9, 0x8e, 0x9e, 0x71, 0xa3, 0x1c, 0x14,
+	0x9d, 0x65, 0x0d, 0x49, 0x4d, 0x8c, 0xec, 0x46,
+	0xe1, 0x41, 0xcd, 0xf5, 0xfc, 0x82, 0x75, 0x34
+};
+enum { nonce46 = 0x25f85182df84dec5ULL };
+
+static const u8 input47[] __initconst = {
+	0xa1, 0xd2, 0xf2, 0x52, 0x2f, 0x79, 0x50, 0xb2,
+	0x42, 0x29, 0x5b, 0x44, 0x20, 0xf9, 0xbd, 0x85,
+	0xb7, 0x65, 0x77, 0x86, 0xce, 0x3e, 0x1c, 0xe4,
+	0x70, 0x80, 0xdd, 0x72, 0x07, 0x48, 0x0f, 0x84,
+	0x0d, 0xfd, 0x97, 0xc0, 0xb7, 0x48, 0x9b, 0xb4,
+	0xec, 0xff, 0x73, 0x14, 0x99, 0xe4
+};
+static const u8 output47[] __initconst = {
+	0xe5, 0x3c, 0x78, 0x66, 0x31, 0x1e, 0xd6, 0xc4,
+	0x9e, 0x71, 0xb3, 0xd7, 0xd5, 0xad, 0x84, 0xf2,
+	0x78, 0x61, 0x77, 0xf8, 0x31, 0xf0, 0x13, 0xad,
+	0x66, 0xf5, 0x31, 0x7d, 0xeb, 0xdf, 0xaf, 0xcb,
+	0xac, 0x28, 0x6c, 0xc2, 0x9e, 0xe7, 0x78, 0xa2,
+	0xa2, 0x58, 0xce, 0x84, 0x76, 0x70
+};
+static const u8 key47[] __initconst = {
+	0x05, 0x7f, 0xc0, 0x7f, 0x37, 0x20, 0x71, 0x02,
+	0x3a, 0xe7, 0x20, 0x5a, 0x0a, 0x8f, 0x79, 0x5a,
+	0xfe, 0xbb, 0x43, 0x4d, 0x2f, 0xcb, 0xf6, 0x9e,
+	0xa2, 0x97, 0x00, 0xad, 0x0d, 0x51, 0x7e, 0x17
+};
+enum { nonce47 = 0xae707c60f54de32bULL };
+
+static const u8 input48[] __initconst = {
+	0x80, 0x93, 0x77, 0x2e, 0x8d, 0xe8, 0xe6, 0xc1,
+	0x27, 0xe6, 0xf2, 0x89, 0x5b, 0x33, 0x62, 0x18,
+	0x80, 0x6e, 0x17, 0x22, 0x8e, 0x83, 0x31, 0x40,
+	0x8f, 0xc9, 0x5c, 0x52, 0x6c, 0x0e, 0xa5, 0xe9,
+	0x6c, 0x7f, 0xd4, 0x6a, 0x27, 0x56, 0x99, 0xce,
+	0x8d, 0x37, 0x59, 0xaf, 0xc0, 0x0e, 0xe1
+};
+static const u8 output48[] __initconst = {
+	0x02, 0xa4, 0x2e, 0x33, 0xb7, 0x7c, 0x2b, 0x9a,
+	0x18, 0x5a, 0xba, 0x53, 0x38, 0xaf, 0x00, 0xeb,
+	0xd8, 0x3d, 0x02, 0x77, 0x43, 0x45, 0x03, 0x91,
+	0xe2, 0x5e, 0x4e, 0xeb, 0x50, 0xd5, 0x5b, 0xe0,
+	0xf3, 0x33, 0xa7, 0xa2, 0xac, 0x07, 0x6f, 0xeb,
+	0x3f, 0x6c, 0xcd, 0xf2, 0x6c, 0x61, 0x64
+};
+static const u8 key48[] __initconst = {
+	0xf3, 0x79, 0xe7, 0xf8, 0x0e, 0x02, 0x05, 0x6b,
+	0x83, 0x1a, 0xe7, 0x86, 0x6b, 0xe6, 0x8f, 0x3f,
+	0xd3, 0xa3, 0xe4, 0x6e, 0x29, 0x06, 0xad, 0xbc,
+	0xe8, 0x33, 0x56, 0x39, 0xdf, 0xb0, 0xe2, 0xfe
+};
+enum { nonce48 = 0xd849b938c6569da0ULL };
+
+static const u8 input49[] __initconst = {
+	0x89, 0x3b, 0x88, 0x9e, 0x7b, 0x38, 0x16, 0x9f,
+	0xa1, 0x28, 0xf6, 0xf5, 0x23, 0x74, 0x28, 0xb0,
+	0xdf, 0x6c, 0x9e, 0x8a, 0x71, 0xaf, 0xed, 0x7a,
+	0x39, 0x21, 0x57, 0x7d, 0x31, 0x6c, 0xee, 0x0d,
+	0x11, 0x8d, 0x41, 0x9a, 0x5f, 0xb7, 0x27, 0x40,
+	0x08, 0xad, 0xc6, 0xe0, 0x00, 0x43, 0x9e, 0xae
+};
+static const u8 output49[] __initconst = {
+	0x4d, 0xfd, 0xdb, 0x4c, 0x77, 0xc1, 0x05, 0x07,
+	0x4d, 0x6d, 0x32, 0xcb, 0x2e, 0x0e, 0xff, 0x65,
+	0xc9, 0x27, 0xeb, 0xa9, 0x46, 0x5b, 0xab, 0x06,
+	0xe6, 0xb6, 0x5a, 0x1e, 0x00, 0xfb, 0xcf, 0xe4,
+	0xb9, 0x71, 0x40, 0x10, 0xef, 0x12, 0x39, 0xf0,
+	0xea, 0x40, 0xb8, 0x9a, 0xa2, 0x85, 0x38, 0x48
+};
+static const u8 key49[] __initconst = {
+	0xe7, 0x10, 0x40, 0xd9, 0x66, 0xc0, 0xa8, 0x6d,
+	0xa3, 0xcc, 0x8b, 0xdd, 0x93, 0xf2, 0x6e, 0xe0,
+	0x90, 0x7f, 0xd0, 0xf4, 0x37, 0x0c, 0x8b, 0x9b,
+	0x4c, 0x4d, 0xe6, 0xf2, 0x1f, 0xe9, 0x95, 0x24
+};
+enum { nonce49 = 0xf269817bdae01bc0ULL };
+
+static const u8 input50[] __initconst = {
+	0xda, 0x5b, 0x60, 0xcd, 0xed, 0x58, 0x8e, 0x7f,
+	0xae, 0xdd, 0xc8, 0x2e, 0x16, 0x90, 0xea, 0x4b,
+	0x0c, 0x74, 0x14, 0x35, 0xeb, 0xee, 0x2c, 0xff,
+	0x46, 0x99, 0x97, 0x6e, 0xae, 0xa7, 0x8e, 0x6e,
+	0x38, 0xfe, 0x63, 0xe7, 0x51, 0xd9, 0xaa, 0xce,
+	0x7b, 0x1e, 0x7e, 0x5d, 0xc0, 0xe8, 0x10, 0x06,
+	0x14
+};
+static const u8 output50[] __initconst = {
+	0xe4, 0xe5, 0x86, 0x1b, 0x66, 0x19, 0xac, 0x49,
+	0x1c, 0xbd, 0xee, 0x03, 0xaf, 0x11, 0xfc, 0x1f,
+	0x6a, 0xd2, 0x50, 0x5c, 0xea, 0x2c, 0xa5, 0x75,
+	0xfd, 0xb7, 0x0e, 0x80, 0x8f, 0xed, 0x3f, 0x31,
+	0x47, 0xac, 0x67, 0x43, 0xb8, 0x2e, 0xb4, 0x81,
+	0x6d, 0xe4, 0x1e, 0xb7, 0x8b, 0x0c, 0x53, 0xa9,
+	0x26
+};
+static const u8 key50[] __initconst = {
+	0xd7, 0xb2, 0x04, 0x76, 0x30, 0xcc, 0x38, 0x45,
+	0xef, 0xdb, 0xc5, 0x86, 0x08, 0x61, 0xf0, 0xee,
+	0x6d, 0xd8, 0x22, 0x04, 0x8c, 0xfb, 0xcb, 0x37,
+	0xa6, 0xfb, 0x95, 0x22, 0xe1, 0x87, 0xb7, 0x6f
+};
+enum { nonce50 = 0x3b44d09c45607d38ULL };
+
+static const u8 input51[] __initconst = {
+	0xa9, 0x41, 0x02, 0x4b, 0xd7, 0xd5, 0xd1, 0xf1,
+	0x21, 0x55, 0xb2, 0x75, 0x6d, 0x77, 0x1b, 0x86,
+	0xa9, 0xc8, 0x90, 0xfd, 0xed, 0x4a, 0x7b, 0x6c,
+	0xb2, 0x5f, 0x9b, 0x5f, 0x16, 0xa1, 0x54, 0xdb,
+	0xd6, 0x3f, 0x6a, 0x7f, 0x2e, 0x51, 0x9d, 0x49,
+	0x5b, 0xa5, 0x0e, 0xf9, 0xfb, 0x2a, 0x38, 0xff,
+	0x20, 0x8c
+};
+static const u8 output51[] __initconst = {
+	0x18, 0xf7, 0x88, 0xc1, 0x72, 0xfd, 0x90, 0x4b,
+	0xa9, 0x2d, 0xdb, 0x47, 0xb0, 0xa5, 0xc4, 0x37,
+	0x01, 0x95, 0xc4, 0xb1, 0xab, 0xc5, 0x5b, 0xcd,
+	0xe1, 0x97, 0x78, 0x13, 0xde, 0x6a, 0xff, 0x36,
+	0xce, 0xa4, 0x67, 0xc5, 0x4a, 0x45, 0x2b, 0xd9,
+	0xff, 0x8f, 0x06, 0x7c, 0x63, 0xbb, 0x83, 0x17,
+	0xb4, 0x6b
+};
+static const u8 key51[] __initconst = {
+	0x82, 0x1a, 0x79, 0xab, 0x9a, 0xb5, 0x49, 0x6a,
+	0x30, 0x6b, 0x99, 0x19, 0x11, 0xc7, 0xa2, 0xf4,
+	0xca, 0x55, 0xb9, 0xdd, 0xe7, 0x2f, 0xe7, 0xc1,
+	0xdd, 0x27, 0xad, 0x80, 0xf2, 0x56, 0xad, 0xf3
+};
+enum { nonce51 = 0xe93aff94ca71a4a6ULL };
+
+static const u8 input52[] __initconst = {
+	0x89, 0xdd, 0xf3, 0xfa, 0xb6, 0xc1, 0xaa, 0x9a,
+	0xc8, 0xad, 0x6b, 0x00, 0xa1, 0x65, 0xea, 0x14,
+	0x55, 0x54, 0x31, 0x8f, 0xf0, 0x03, 0x84, 0x51,
+	0x17, 0x1e, 0x0a, 0x93, 0x6e, 0x79, 0x96, 0xa3,
+	0x2a, 0x85, 0x9c, 0x89, 0xf8, 0xd1, 0xe2, 0x15,
+	0x95, 0x05, 0xf4, 0x43, 0x4d, 0x6b, 0xf0, 0x71,
+	0x3b, 0x3e, 0xba
+};
+static const u8 output52[] __initconst = {
+	0x0c, 0x42, 0x6a, 0xb3, 0x66, 0x63, 0x5d, 0x2c,
+	0x9f, 0x3d, 0xa6, 0x6e, 0xc7, 0x5f, 0x79, 0x2f,
+	0x50, 0xe3, 0xd6, 0x07, 0x56, 0xa4, 0x2b, 0x2d,
+	0x8d, 0x10, 0xc0, 0x6c, 0xa2, 0xfc, 0x97, 0xec,
+	0x3f, 0x5c, 0x8d, 0x59, 0xbe, 0x84, 0xf1, 0x3e,
+	0x38, 0x47, 0x4f, 0x75, 0x25, 0x66, 0x88, 0x14,
+	0x03, 0xdd, 0xde
+};
+static const u8 key52[] __initconst = {
+	0x4f, 0xb0, 0x27, 0xb6, 0xdd, 0x24, 0x0c, 0xdb,
+	0x6b, 0x71, 0x2e, 0xac, 0xfc, 0x3f, 0xa6, 0x48,
+	0x5d, 0xd5, 0xff, 0x53, 0xb5, 0x62, 0xf1, 0xe0,
+	0x93, 0xfe, 0x39, 0x4c, 0x9f, 0x03, 0x11, 0xa7
+};
+enum { nonce52 = 0xed8becec3bdf6f25ULL };
+
+static const u8 input53[] __initconst = {
+	0x68, 0xd1, 0xc7, 0x74, 0x44, 0x1c, 0x84, 0xde,
+	0x27, 0x27, 0x35, 0xf0, 0x18, 0x0b, 0x57, 0xaa,
+	0xd0, 0x1a, 0xd3, 0x3b, 0x5e, 0x5c, 0x62, 0x93,
+	0xd7, 0x6b, 0x84, 0x3b, 0x71, 0x83, 0x77, 0x01,
+	0x3e, 0x59, 0x45, 0xf4, 0x77, 0x6c, 0x6b, 0xcb,
+	0x88, 0x45, 0x09, 0x1d, 0xc6, 0x45, 0x6e, 0xdc,
+	0x6e, 0x51, 0xb8, 0x28
+};
+static const u8 output53[] __initconst = {
+	0xc5, 0x90, 0x96, 0x78, 0x02, 0xf5, 0xc4, 0x3c,
+	0xde, 0xd4, 0xd4, 0xc6, 0xa7, 0xad, 0x12, 0x47,
+	0x45, 0xce, 0xcd, 0x8c, 0x35, 0xcc, 0xa6, 0x9e,
+	0x5a, 0xc6, 0x60, 0xbb, 0xe3, 0xed, 0xec, 0x68,
+	0x3f, 0x64, 0xf7, 0x06, 0x63, 0x9c, 0x8c, 0xc8,
+	0x05, 0x3a, 0xad, 0x32, 0x79, 0x8b, 0x45, 0x96,
+	0x93, 0x73, 0x4c, 0xe0
+};
+static const u8 key53[] __initconst = {
+	0x42, 0x4b, 0x20, 0x81, 0x49, 0x50, 0xe9, 0xc2,
+	0x43, 0x69, 0x36, 0xe7, 0x68, 0xae, 0xd5, 0x7e,
+	0x42, 0x1a, 0x1b, 0xb4, 0x06, 0x4d, 0xa7, 0x17,
+	0xb5, 0x31, 0xd6, 0x0c, 0xb0, 0x5c, 0x41, 0x0b
+};
+enum { nonce53 = 0xf44ce1931fbda3d7ULL };
+
+static const u8 input54[] __initconst = {
+	0x7b, 0xf6, 0x8b, 0xae, 0xc0, 0xcb, 0x10, 0x8e,
+	0xe8, 0xd8, 0x2e, 0x3b, 0x14, 0xba, 0xb4, 0xd2,
+	0x58, 0x6b, 0x2c, 0xec, 0xc1, 0x81, 0x71, 0xb4,
+	0xc6, 0xea, 0x08, 0xc5, 0xc9, 0x78, 0xdb, 0xa2,
+	0xfa, 0x44, 0x50, 0x9b, 0xc8, 0x53, 0x8d, 0x45,
+	0x42, 0xe7, 0x09, 0xc4, 0x29, 0xd8, 0x75, 0x02,
+	0xbb, 0xb2, 0x78, 0xcf, 0xe7
+};
+static const u8 output54[] __initconst = {
+	0xaf, 0x2c, 0x83, 0x26, 0x6e, 0x7f, 0xa6, 0xe9,
+	0x03, 0x75, 0xfe, 0xfe, 0x87, 0x58, 0xcf, 0xb5,
+	0xbc, 0x3c, 0x9d, 0xa1, 0x6e, 0x13, 0xf1, 0x0f,
+	0x9e, 0xbc, 0xe0, 0x54, 0x24, 0x32, 0xce, 0x95,
+	0xe6, 0xa5, 0x59, 0x3d, 0x24, 0x1d, 0x8f, 0xb1,
+	0x74, 0x6c, 0x56, 0xe7, 0x96, 0xc1, 0x91, 0xc8,
+	0x2d, 0x0e, 0xb7, 0x51, 0x10
+};
+static const u8 key54[] __initconst = {
+	0x00, 0x68, 0x74, 0xdc, 0x30, 0x9e, 0xe3, 0x52,
+	0xa9, 0xae, 0xb6, 0x7c, 0xa1, 0xdc, 0x12, 0x2d,
+	0x98, 0x32, 0x7a, 0x77, 0xe1, 0xdd, 0xa3, 0x76,
+	0x72, 0x34, 0x83, 0xd8, 0xb7, 0x69, 0xba, 0x77
+};
+enum { nonce54 = 0xbea57d79b798b63aULL };
+
+static const u8 input55[] __initconst = {
+	0xb5, 0xf4, 0x2f, 0xc1, 0x5e, 0x10, 0xa7, 0x4e,
+	0x74, 0x3d, 0xa3, 0x96, 0xc0, 0x4d, 0x7b, 0x92,
+	0x8f, 0xdb, 0x2d, 0x15, 0x52, 0x6a, 0x95, 0x5e,
+	0x40, 0x81, 0x4f, 0x70, 0x73, 0xea, 0x84, 0x65,
+	0x3d, 0x9a, 0x4e, 0x03, 0x95, 0xf8, 0x5d, 0x2f,
+	0x07, 0x02, 0x13, 0x13, 0xdd, 0x82, 0xe6, 0x3b,
+	0xe1, 0x5f, 0xb3, 0x37, 0x9b, 0x88
+};
+static const u8 output55[] __initconst = {
+	0xc1, 0x88, 0xbd, 0x92, 0x77, 0xad, 0x7c, 0x5f,
+	0xaf, 0xa8, 0x57, 0x0e, 0x40, 0x0a, 0xdc, 0x70,
+	0xfb, 0xc6, 0x71, 0xfd, 0xc4, 0x74, 0x60, 0xcc,
+	0xa0, 0x89, 0x8e, 0x99, 0xf0, 0x06, 0xa6, 0x7c,
+	0x97, 0x42, 0x21, 0x81, 0x6a, 0x07, 0xe7, 0xb3,
+	0xf7, 0xa5, 0x03, 0x71, 0x50, 0x05, 0x63, 0x17,
+	0xa9, 0x46, 0x0b, 0xff, 0x30, 0x78
+};
+static const u8 key55[] __initconst = {
+	0x19, 0x8f, 0xe7, 0xd7, 0x6b, 0x7f, 0x6f, 0x69,
+	0x86, 0x91, 0x0f, 0xa7, 0x4a, 0x69, 0x8e, 0x34,
+	0xf3, 0xdb, 0xde, 0xaf, 0xf2, 0x66, 0x1d, 0x64,
+	0x97, 0x0c, 0xcf, 0xfa, 0x33, 0x84, 0xfd, 0x0c
+};
+enum { nonce55 = 0x80aa3d3e2c51ef06ULL };
+
+static const u8 input56[] __initconst = {
+	0x6b, 0xe9, 0x73, 0x42, 0x27, 0x5e, 0x12, 0xcd,
+	0xaa, 0x45, 0x12, 0x8b, 0xb3, 0xe6, 0x54, 0x33,
+	0x31, 0x7d, 0xe2, 0x25, 0xc6, 0x86, 0x47, 0x67,
+	0x86, 0x83, 0xe4, 0x46, 0xb5, 0x8f, 0x2c, 0xbb,
+	0xe4, 0xb8, 0x9f, 0xa2, 0xa4, 0xe8, 0x75, 0x96,
+	0x92, 0x51, 0x51, 0xac, 0x8e, 0x2e, 0x6f, 0xfc,
+	0xbd, 0x0d, 0xa3, 0x9f, 0x16, 0x55, 0x3e
+};
+static const u8 output56[] __initconst = {
+	0x42, 0x99, 0x73, 0x6c, 0xd9, 0x4b, 0x16, 0xe5,
+	0x18, 0x63, 0x1a, 0xd9, 0x0e, 0xf1, 0x15, 0x2e,
+	0x0f, 0x4b, 0xe4, 0x5f, 0xa0, 0x4d, 0xde, 0x9f,
+	0xa7, 0x18, 0xc1, 0x0c, 0x0b, 0xae, 0x55, 0xe4,
+	0x89, 0x18, 0xa4, 0x78, 0x9d, 0x25, 0x0d, 0xd5,
+	0x94, 0x0f, 0xf9, 0x78, 0xa3, 0xa6, 0xe9, 0x9e,
+	0x2c, 0x73, 0xf0, 0xf7, 0x35, 0xf3, 0x2b
+};
+static const u8 key56[] __initconst = {
+	0x7d, 0x12, 0xad, 0x51, 0xd5, 0x6f, 0x8f, 0x96,
+	0xc0, 0x5d, 0x9a, 0xd1, 0x7e, 0x20, 0x98, 0x0e,
+	0x3c, 0x0a, 0x67, 0x6b, 0x1b, 0x88, 0x69, 0xd4,
+	0x07, 0x8c, 0xaf, 0x0f, 0x3a, 0x28, 0xe4, 0x5d
+};
+enum { nonce56 = 0x70f4c372fb8b5984ULL };
+
+static const u8 input57[] __initconst = {
+	0x28, 0xa3, 0x06, 0xe8, 0xe7, 0x08, 0xb9, 0xef,
+	0x0d, 0x63, 0x15, 0x99, 0xb2, 0x78, 0x7e, 0xaf,
+	0x30, 0x50, 0xcf, 0xea, 0xc9, 0x91, 0x41, 0x2f,
+	0x3b, 0x38, 0x70, 0xc4, 0x87, 0xb0, 0x3a, 0xee,
+	0x4a, 0xea, 0xe3, 0x83, 0x68, 0x8b, 0xcf, 0xda,
+	0x04, 0xa5, 0xbd, 0xb2, 0xde, 0x3c, 0x55, 0x13,
+	0xfe, 0x96, 0xad, 0xc1, 0x61, 0x1b, 0x98, 0xde
+};
+static const u8 output57[] __initconst = {
+	0xf4, 0x44, 0xe9, 0xd2, 0x6d, 0xc2, 0x5a, 0xe9,
+	0xfd, 0x7e, 0x41, 0x54, 0x3f, 0xf4, 0x12, 0xd8,
+	0x55, 0x0d, 0x12, 0x9b, 0xd5, 0x2e, 0x95, 0xe5,
+	0x77, 0x42, 0x3f, 0x2c, 0xfb, 0x28, 0x9d, 0x72,
+	0x6d, 0x89, 0x82, 0x27, 0x64, 0x6f, 0x0d, 0x57,
+	0xa1, 0x25, 0xa3, 0x6b, 0x88, 0x9a, 0xac, 0x0c,
+	0x76, 0x19, 0x90, 0xe2, 0x50, 0x5a, 0xf8, 0x12
+};
+static const u8 key57[] __initconst = {
+	0x08, 0x26, 0xb8, 0xac, 0xf3, 0xa5, 0xc6, 0xa3,
+	0x7f, 0x09, 0x87, 0xf5, 0x6c, 0x5a, 0x85, 0x6c,
+	0x3d, 0xbd, 0xde, 0xd5, 0x87, 0xa3, 0x98, 0x7a,
+	0xaa, 0x40, 0x3e, 0xf7, 0xff, 0x44, 0x5d, 0xee
+};
+enum { nonce57 = 0xc03a6130bf06b089ULL };
+
+static const u8 input58[] __initconst = {
+	0x82, 0xa5, 0x38, 0x6f, 0xaa, 0xb4, 0xaf, 0xb2,
+	0x42, 0x01, 0xa8, 0x39, 0x3f, 0x15, 0x51, 0xa8,
+	0x11, 0x1b, 0x93, 0xca, 0x9c, 0xa0, 0x57, 0x68,
+	0x8f, 0xdb, 0x68, 0x53, 0x51, 0x6d, 0x13, 0x22,
+	0x12, 0x9b, 0xbd, 0x33, 0xa8, 0x52, 0x40, 0x57,
+	0x80, 0x9b, 0x98, 0xef, 0x56, 0x70, 0x11, 0xfa,
+	0x36, 0x69, 0x7d, 0x15, 0x48, 0xf9, 0x3b, 0xeb,
+	0x42
+};
+static const u8 output58[] __initconst = {
+	0xff, 0x3a, 0x74, 0xc3, 0x3e, 0x44, 0x64, 0x4d,
+	0x0e, 0x5f, 0x9d, 0xa8, 0xdb, 0xbe, 0x12, 0xef,
+	0xba, 0x56, 0x65, 0x50, 0x76, 0xaf, 0xa4, 0x4e,
+	0x01, 0xc1, 0xd3, 0x31, 0x14, 0xe2, 0xbe, 0x7b,
+	0xa5, 0x67, 0xb4, 0xe3, 0x68, 0x40, 0x9c, 0xb0,
+	0xb1, 0x78, 0xef, 0x49, 0x03, 0x0f, 0x2d, 0x56,
+	0xb4, 0x37, 0xdb, 0xbc, 0x2d, 0x68, 0x1c, 0x3c,
+	0xf1
+};
+static const u8 key58[] __initconst = {
+	0x7e, 0xf1, 0x7c, 0x20, 0x65, 0xed, 0xcd, 0xd7,
+	0x57, 0xe8, 0xdb, 0x90, 0x87, 0xdb, 0x5f, 0x63,
+	0x3d, 0xdd, 0xb8, 0x2b, 0x75, 0x8e, 0x04, 0xb5,
+	0xf4, 0x12, 0x79, 0xa9, 0x4d, 0x42, 0x16, 0x7f
+};
+enum { nonce58 = 0x92838183f80d2f7fULL };
+
+static const u8 input59[] __initconst = {
+	0x37, 0xf1, 0x9d, 0xdd, 0xd7, 0x08, 0x9f, 0x13,
+	0xc5, 0x21, 0x82, 0x75, 0x08, 0x9e, 0x25, 0x16,
+	0xb1, 0xd1, 0x71, 0x42, 0x28, 0x63, 0xac, 0x47,
+	0x71, 0x54, 0xb1, 0xfc, 0x39, 0xf0, 0x61, 0x4f,
+	0x7c, 0x6d, 0x4f, 0xc8, 0x33, 0xef, 0x7e, 0xc8,
+	0xc0, 0x97, 0xfc, 0x1a, 0x61, 0xb4, 0x87, 0x6f,
+	0xdd, 0x5a, 0x15, 0x7b, 0x1b, 0x95, 0x50, 0x94,
+	0x1d, 0xba
+};
+static const u8 output59[] __initconst = {
+	0x73, 0x67, 0xc5, 0x07, 0xbb, 0x57, 0x79, 0xd5,
+	0xc9, 0x04, 0xdd, 0x88, 0xf3, 0x86, 0xe5, 0x70,
+	0x49, 0x31, 0xe0, 0xcc, 0x3b, 0x1d, 0xdf, 0xb0,
+	0xaf, 0xf4, 0x2d, 0xe0, 0x06, 0x10, 0x91, 0x8d,
+	0x1c, 0xcf, 0x31, 0x0b, 0xf6, 0x73, 0xda, 0x1c,
+	0xf0, 0x17, 0x52, 0x9e, 0x20, 0x2e, 0x9f, 0x8c,
+	0xb3, 0x59, 0xce, 0xd4, 0xd3, 0xc1, 0x81, 0xe9,
+	0x11, 0x36
+};
+static const u8 key59[] __initconst = {
+	0xbd, 0x07, 0xd0, 0x53, 0x2c, 0xb3, 0xcc, 0x3f,
+	0xc4, 0x95, 0xfd, 0xe7, 0x81, 0xb3, 0x29, 0x99,
+	0x05, 0x45, 0xd6, 0x95, 0x25, 0x0b, 0x72, 0xd3,
+	0xcd, 0xbb, 0x73, 0xf8, 0xfa, 0xc0, 0x9b, 0x7a
+};
+enum { nonce59 = 0x4a0db819b0d519e2ULL };
+
+static const u8 input60[] __initconst = {
+	0x58, 0x4e, 0xdf, 0x94, 0x3c, 0x76, 0x0a, 0x79,
+	0x47, 0xf1, 0xbe, 0x88, 0xd3, 0xba, 0x94, 0xd8,
+	0xe2, 0x8f, 0xe3, 0x2f, 0x2f, 0x74, 0x82, 0x55,
+	0xc3, 0xda, 0xe2, 0x4e, 0x2c, 0x8c, 0x45, 0x1d,
+	0x72, 0x8f, 0x54, 0x41, 0xb5, 0xb7, 0x69, 0xe4,
+	0xdc, 0xd2, 0x36, 0x21, 0x5c, 0x28, 0x52, 0xf7,
+	0x98, 0x8e, 0x72, 0xa7, 0x6d, 0x57, 0xed, 0xdc,
+	0x3c, 0xe6, 0x6a
+};
+static const u8 output60[] __initconst = {
+	0xda, 0xaf, 0xb5, 0xe3, 0x30, 0x65, 0x5c, 0xb1,
+	0x48, 0x08, 0x43, 0x7b, 0x9e, 0xd2, 0x6a, 0x62,
+	0x56, 0x7c, 0xad, 0xd9, 0xe5, 0xf6, 0x09, 0x71,
+	0xcd, 0xe6, 0x05, 0x6b, 0x3f, 0x44, 0x3a, 0x5c,
+	0xf6, 0xf8, 0xd7, 0xce, 0x7d, 0xd1, 0xe0, 0x4f,
+	0x88, 0x15, 0x04, 0xd8, 0x20, 0xf0, 0x3e, 0xef,
+	0xae, 0xa6, 0x27, 0xa3, 0x0e, 0xfc, 0x18, 0x90,
+	0x33, 0xcd, 0xd3
+};
+static const u8 key60[] __initconst = {
+	0xbf, 0xfd, 0x25, 0xb5, 0xb2, 0xfc, 0x78, 0x0c,
+	0x8e, 0xb9, 0x57, 0x2f, 0x26, 0x4a, 0x7e, 0x71,
+	0xcc, 0xf2, 0xe0, 0xfd, 0x24, 0x11, 0x20, 0x23,
+	0x57, 0x00, 0xff, 0x80, 0x11, 0x0c, 0x1e, 0xff
+};
+enum { nonce60 = 0xf18df56fdb7954adULL };
+
+static const u8 input61[] __initconst = {
+	0xb0, 0xf3, 0x06, 0xbc, 0x22, 0xae, 0x49, 0x40,
+	0xae, 0xff, 0x1b, 0x31, 0xa7, 0x98, 0xab, 0x1d,
+	0xe7, 0x40, 0x23, 0x18, 0x4f, 0xab, 0x8e, 0x93,
+	0x82, 0xf4, 0x56, 0x61, 0xfd, 0x2b, 0xcf, 0xa7,
+	0xc4, 0xb4, 0x0a, 0xf4, 0xcb, 0xc7, 0x8c, 0x40,
+	0x57, 0xac, 0x0b, 0x3e, 0x2a, 0x0a, 0x67, 0x83,
+	0x50, 0xbf, 0xec, 0xb0, 0xc7, 0xf1, 0x32, 0x26,
+	0x98, 0x80, 0x33, 0xb4
+};
+static const u8 output61[] __initconst = {
+	0x9d, 0x23, 0x0e, 0xff, 0xcc, 0x7c, 0xd5, 0xcf,
+	0x1a, 0xb8, 0x59, 0x1e, 0x92, 0xfd, 0x7f, 0xca,
+	0xca, 0x3c, 0x18, 0x81, 0xde, 0xfa, 0x59, 0xc8,
+	0x6f, 0x9c, 0x24, 0x3f, 0x3a, 0xe6, 0x0b, 0xb4,
+	0x34, 0x48, 0x69, 0xfc, 0xb6, 0xea, 0xb2, 0xde,
+	0x9f, 0xfd, 0x92, 0x36, 0x18, 0x98, 0x99, 0xaa,
+	0x65, 0xe2, 0xea, 0xf4, 0xb1, 0x47, 0x8e, 0xb0,
+	0xe7, 0xd4, 0x7a, 0x2c
+};
+static const u8 key61[] __initconst = {
+	0xd7, 0xfd, 0x9b, 0xbd, 0x8f, 0x65, 0x0d, 0x00,
+	0xca, 0xa1, 0x6c, 0x85, 0x85, 0xa4, 0x6d, 0xf1,
+	0xb1, 0x68, 0x0c, 0x8b, 0x5d, 0x37, 0x72, 0xd0,
+	0xd8, 0xd2, 0x25, 0xab, 0x9f, 0x7b, 0x7d, 0x95
+};
+enum { nonce61 = 0xd82caf72a9c4864fULL };
+
+static const u8 input62[] __initconst = {
+	0x10, 0x77, 0xf3, 0x2f, 0xc2, 0x50, 0xd6, 0x0c,
+	0xba, 0xa8, 0x8d, 0xce, 0x0d, 0x58, 0x9e, 0x87,
+	0xb1, 0x59, 0x66, 0x0a, 0x4a, 0xb3, 0xd8, 0xca,
+	0x0a, 0x6b, 0xf8, 0xc6, 0x2b, 0x3f, 0x8e, 0x09,
+	0xe0, 0x0a, 0x15, 0x85, 0xfe, 0xaa, 0xc6, 0xbd,
+	0x30, 0xef, 0xe4, 0x10, 0x78, 0x03, 0xc1, 0xc7,
+	0x8a, 0xd9, 0xde, 0x0b, 0x51, 0x07, 0xc4, 0x7b,
+	0xe2, 0x2e, 0x36, 0x3a, 0xc2
+};
+static const u8 output62[] __initconst = {
+	0xa0, 0x0c, 0xfc, 0xc1, 0xf6, 0xaf, 0xc2, 0xb8,
+	0x5c, 0xef, 0x6e, 0xf3, 0xce, 0x15, 0x48, 0x05,
+	0xb5, 0x78, 0x49, 0x51, 0x1f, 0x9d, 0xf4, 0xbf,
+	0x2f, 0x53, 0xa2, 0xd1, 0x15, 0x20, 0x82, 0x6b,
+	0xd2, 0x22, 0x6c, 0x4e, 0x14, 0x87, 0xe3, 0xd7,
+	0x49, 0x45, 0x84, 0xdb, 0x5f, 0x68, 0x60, 0xc4,
+	0xb3, 0xe6, 0x3f, 0xd1, 0xfc, 0xa5, 0x73, 0xf3,
+	0xfc, 0xbb, 0xbe, 0xc8, 0x9d
+};
+static const u8 key62[] __initconst = {
+	0x6e, 0xc9, 0xaf, 0xce, 0x35, 0xb9, 0x86, 0xd1,
+	0xce, 0x5f, 0xd9, 0xbb, 0xd5, 0x1f, 0x7c, 0xcd,
+	0xfe, 0x19, 0xaa, 0x3d, 0xea, 0x64, 0xc1, 0x28,
+	0x40, 0xba, 0xa1, 0x28, 0xcd, 0x40, 0xb6, 0xf2
+};
+enum { nonce62 = 0xa1c0c265f900cde8ULL };
+
+static const u8 input63[] __initconst = {
+	0x7a, 0x70, 0x21, 0x2c, 0xef, 0xa6, 0x36, 0xd4,
+	0xe0, 0xab, 0x8c, 0x25, 0x73, 0x34, 0xc8, 0x94,
+	0x6c, 0x81, 0xcb, 0x19, 0x8d, 0x5a, 0x49, 0xaa,
+	0x6f, 0xba, 0x83, 0x72, 0x02, 0x5e, 0xf5, 0x89,
+	0xce, 0x79, 0x7e, 0x13, 0x3d, 0x5b, 0x98, 0x60,
+	0x5d, 0xd9, 0xfb, 0x15, 0x93, 0x4c, 0xf3, 0x51,
+	0x49, 0x55, 0xd1, 0x58, 0xdd, 0x7e, 0x6d, 0xfe,
+	0xdd, 0x84, 0x23, 0x05, 0xba, 0xe9
+};
+static const u8 output63[] __initconst = {
+	0x20, 0xb3, 0x5c, 0x03, 0x03, 0x78, 0x17, 0xfc,
+	0x3b, 0x35, 0x30, 0x9a, 0x00, 0x18, 0xf5, 0xc5,
+	0x06, 0x53, 0xf5, 0x04, 0x24, 0x9d, 0xd1, 0xb2,
+	0xac, 0x5a, 0xb6, 0x2a, 0xa5, 0xda, 0x50, 0x00,
+	0xec, 0xff, 0xa0, 0x7a, 0x14, 0x7b, 0xe4, 0x6b,
+	0x63, 0xe8, 0x66, 0x86, 0x34, 0xfd, 0x74, 0x44,
+	0xa2, 0x50, 0x97, 0x0d, 0xdc, 0xc3, 0x84, 0xf8,
+	0x71, 0x02, 0x31, 0x95, 0xed, 0x54
+};
+static const u8 key63[] __initconst = {
+	0x7d, 0x64, 0xb4, 0x12, 0x81, 0xe4, 0xe6, 0x8f,
+	0xcc, 0xe7, 0xd1, 0x1f, 0x70, 0x20, 0xfd, 0xb8,
+	0x3a, 0x7d, 0xa6, 0x53, 0x65, 0x30, 0x5d, 0xe3,
+	0x1a, 0x44, 0xbe, 0x62, 0xed, 0x90, 0xc4, 0xd1
+};
+enum { nonce63 = 0xe8e849596c942276ULL };
+
+static const u8 input64[] __initconst = {
+	0x84, 0xf8, 0xda, 0x87, 0x23, 0x39, 0x60, 0xcf,
+	0xc5, 0x50, 0x7e, 0xc5, 0x47, 0x29, 0x7c, 0x05,
+	0xc2, 0xb4, 0xf4, 0xb2, 0xec, 0x5d, 0x48, 0x36,
+	0xbf, 0xfc, 0x06, 0x8c, 0xf2, 0x0e, 0x88, 0xe7,
+	0xc9, 0xc5, 0xa4, 0xa2, 0x83, 0x20, 0xa1, 0x6f,
+	0x37, 0xe5, 0x2d, 0xa1, 0x72, 0xa1, 0x19, 0xef,
+	0x05, 0x42, 0x08, 0xf2, 0x57, 0x47, 0x31, 0x1e,
+	0x17, 0x76, 0x13, 0xd3, 0xcc, 0x75, 0x2c
+};
+static const u8 output64[] __initconst = {
+	0xcb, 0xec, 0x90, 0x88, 0xeb, 0x31, 0x69, 0x20,
+	0xa6, 0xdc, 0xff, 0x76, 0x98, 0xb0, 0x24, 0x49,
+	0x7b, 0x20, 0xd9, 0xd1, 0x1b, 0xe3, 0x61, 0xdc,
+	0xcf, 0x51, 0xf6, 0x70, 0x72, 0x33, 0x28, 0x94,
+	0xac, 0x73, 0x18, 0xcf, 0x93, 0xfd, 0xca, 0x08,
+	0x0d, 0xa2, 0xb9, 0x57, 0x1e, 0x51, 0xb6, 0x07,
+	0x5c, 0xc1, 0x13, 0x64, 0x1d, 0x18, 0x6f, 0xe6,
+	0x0b, 0xb7, 0x14, 0x03, 0x43, 0xb6, 0xaf
+};
+static const u8 key64[] __initconst = {
+	0xbf, 0x82, 0x65, 0xe4, 0x50, 0xf9, 0x5e, 0xea,
+	0x28, 0x91, 0xd1, 0xd2, 0x17, 0x7c, 0x13, 0x7e,
+	0xf5, 0xd5, 0x6b, 0x06, 0x1c, 0x20, 0xc2, 0x82,
+	0xa1, 0x7a, 0xa2, 0x14, 0xa1, 0xb0, 0x54, 0x58
+};
+enum { nonce64 = 0xe57c5095aa5723c9ULL };
+
+static const u8 input65[] __initconst = {
+	0x1c, 0xfb, 0xd3, 0x3f, 0x85, 0xd7, 0xba, 0x7b,
+	0xae, 0xb1, 0xa5, 0xd2, 0xe5, 0x40, 0xce, 0x4d,
+	0x3e, 0xab, 0x17, 0x9d, 0x7d, 0x9f, 0x03, 0x98,
+	0x3f, 0x9f, 0xc8, 0xdd, 0x36, 0x17, 0x43, 0x5c,
+	0x34, 0xd1, 0x23, 0xe0, 0x77, 0xbf, 0x35, 0x5d,
+	0x8f, 0xb1, 0xcb, 0x82, 0xbb, 0x39, 0x69, 0xd8,
+	0x90, 0x45, 0x37, 0xfd, 0x98, 0x25, 0xf7, 0x5b,
+	0xce, 0x06, 0x43, 0xba, 0x61, 0xa8, 0x47, 0xb9
+};
+static const u8 output65[] __initconst = {
+	0x73, 0xa5, 0x68, 0xab, 0x8b, 0xa5, 0xc3, 0x7e,
+	0x74, 0xf8, 0x9d, 0xf5, 0x93, 0x6e, 0xf2, 0x71,
+	0x6d, 0xde, 0x82, 0xc5, 0x40, 0xa0, 0x46, 0xb3,
+	0x9a, 0x78, 0xa8, 0xf7, 0xdf, 0xb1, 0xc3, 0xdd,
+	0x8d, 0x90, 0x00, 0x68, 0x21, 0x48, 0xe8, 0xba,
+	0x56, 0x9f, 0x8f, 0xe7, 0xa4, 0x4d, 0x36, 0x55,
+	0xd0, 0x34, 0x99, 0xa6, 0x1c, 0x4c, 0xc1, 0xe2,
+	0x65, 0x98, 0x14, 0x8e, 0x6a, 0x05, 0xb1, 0x2b
+};
+static const u8 key65[] __initconst = {
+	0xbd, 0x5c, 0x8a, 0xb0, 0x11, 0x29, 0xf3, 0x00,
+	0x7a, 0x78, 0x32, 0x63, 0x34, 0x00, 0xe6, 0x7d,
+	0x30, 0x54, 0xde, 0x37, 0xda, 0xc2, 0xc4, 0x3d,
+	0x92, 0x6b, 0x4c, 0xc2, 0x92, 0xe9, 0x9e, 0x2a
+};
+enum { nonce65 = 0xf654a3031de746f2ULL };
+
+static const u8 input66[] __initconst = {
+	0x4b, 0x27, 0x30, 0x8f, 0x28, 0xd8, 0x60, 0x46,
+	0x39, 0x06, 0x49, 0xea, 0x1b, 0x71, 0x26, 0xe0,
+	0x99, 0x2b, 0xd4, 0x8f, 0x64, 0x64, 0xcd, 0xac,
+	0x1d, 0x78, 0x88, 0x90, 0xe1, 0x5c, 0x24, 0x4b,
+	0xdc, 0x2d, 0xb7, 0xee, 0x3a, 0xe6, 0x86, 0x2c,
+	0x21, 0xe4, 0x2b, 0xfc, 0xe8, 0x19, 0xca, 0x65,
+	0xe7, 0xdd, 0x6f, 0x52, 0xb3, 0x11, 0xe1, 0xe2,
+	0xbf, 0xe8, 0x70, 0xe3, 0x0d, 0x45, 0xb8, 0xa5,
+	0x20, 0xb7, 0xb5, 0xaf, 0xff, 0x08, 0xcf, 0x23,
+	0x65, 0xdf, 0x8d, 0xc3, 0x31, 0xf3, 0x1e, 0x6a,
+	0x58, 0x8d, 0xcc, 0x45, 0x16, 0x86, 0x1f, 0x31,
+	0x5c, 0x27, 0xcd, 0xc8, 0x6b, 0x19, 0x1e, 0xec,
+	0x44, 0x75, 0x63, 0x97, 0xfd, 0x79, 0xf6, 0x62,
+	0xc5, 0xba, 0x17, 0xc7, 0xab, 0x8f, 0xbb, 0xed,
+	0x85, 0x2a, 0x98, 0x79, 0x21, 0xec, 0x6e, 0x4d,
+	0xdc, 0xfa, 0x72, 0x52, 0xba, 0xc8, 0x4c
+};
+static const u8 output66[] __initconst = {
+	0x76, 0x5b, 0x2c, 0xa7, 0x62, 0xb9, 0x08, 0x4a,
+	0xc6, 0x4a, 0x92, 0xc3, 0xbb, 0x10, 0xb3, 0xee,
+	0xff, 0xb9, 0x07, 0xc7, 0x27, 0xcb, 0x1e, 0xcf,
+	0x58, 0x6f, 0xa1, 0x64, 0xe8, 0xf1, 0x4e, 0xe1,
+	0xef, 0x18, 0x96, 0xab, 0x97, 0x28, 0xd1, 0x7c,
+	0x71, 0x6c, 0xd1, 0xe2, 0xfa, 0xd9, 0x75, 0xcb,
+	0xeb, 0xea, 0x0c, 0x86, 0x82, 0xd8, 0xf4, 0xcc,
+	0xea, 0xa3, 0x00, 0xfa, 0x82, 0xd2, 0xcd, 0xcb,
+	0xdb, 0x63, 0x28, 0xe2, 0x82, 0xe9, 0x01, 0xed,
+	0x31, 0xe6, 0x71, 0x45, 0x08, 0x89, 0x8a, 0x23,
+	0xa8, 0xb5, 0xc2, 0xe2, 0x9f, 0xe9, 0xb8, 0x9a,
+	0xc4, 0x79, 0x6d, 0x71, 0x52, 0x61, 0x74, 0x6c,
+	0x1b, 0xd7, 0x65, 0x6d, 0x03, 0xc4, 0x1a, 0xc0,
+	0x50, 0xba, 0xd6, 0xc9, 0x43, 0x50, 0xbe, 0x09,
+	0x09, 0x8a, 0xdb, 0xaa, 0x76, 0x4e, 0x3b, 0x61,
+	0x3c, 0x7c, 0x44, 0xe7, 0xdb, 0x10, 0xa7
+};
+static const u8 key66[] __initconst = {
+	0x88, 0xdf, 0xca, 0x68, 0xaf, 0x4f, 0xb3, 0xfd,
+	0x6e, 0xa7, 0x95, 0x35, 0x8a, 0xe8, 0x37, 0xe8,
+	0xc8, 0x55, 0xa2, 0x2a, 0x6d, 0x77, 0xf8, 0x93,
+	0x7a, 0x41, 0xf3, 0x7b, 0x95, 0xdf, 0x89, 0xf5
+};
+enum { nonce66 = 0x1024b4fdd415cf82ULL };
+
+static const u8 input67[] __initconst = {
+	0xd4, 0x2e, 0xfa, 0x92, 0xe9, 0x29, 0x68, 0xb7,
+	0x54, 0x2c, 0xf7, 0xa4, 0x2d, 0xb7, 0x50, 0xb5,
+	0xc5, 0xb2, 0x9d, 0x17, 0x5e, 0x0a, 0xca, 0x37,
+	0xbf, 0x60, 0xae, 0xd2, 0x98, 0xe9, 0xfa, 0x59,
+	0x67, 0x62, 0xe6, 0x43, 0x0c, 0x77, 0x80, 0x82,
+	0x33, 0x61, 0xa3, 0xff, 0xc1, 0xa0, 0x8f, 0x56,
+	0xbc, 0xec, 0x65, 0x43, 0x88, 0xa5, 0xff, 0x51,
+	0x64, 0x30, 0xee, 0x34, 0xb7, 0x5c, 0x28, 0x68,
+	0xc3, 0x52, 0xd2, 0xac, 0x78, 0x2a, 0xa6, 0x10,
+	0xb8, 0xb2, 0x4c, 0x80, 0x4f, 0x99, 0xb2, 0x36,
+	0x94, 0x8f, 0x66, 0xcb, 0xa1, 0x91, 0xed, 0x06,
+	0x42, 0x6d, 0xc1, 0xae, 0x55, 0x93, 0xdd, 0x93,
+	0x9e, 0x88, 0x34, 0x7f, 0x98, 0xeb, 0xbe, 0x61,
+	0xf9, 0xa9, 0x0f, 0xd9, 0xc4, 0x87, 0xd5, 0xef,
+	0xcc, 0x71, 0x8c, 0x0e, 0xce, 0xad, 0x02, 0xcf,
+	0xa2, 0x61, 0xdf, 0xb1, 0xfe, 0x3b, 0xdc, 0xc0,
+	0x58, 0xb5, 0x71, 0xa1, 0x83, 0xc9, 0xb4, 0xaf,
+	0x9d, 0x54, 0x12, 0xcd, 0xea, 0x06, 0xd6, 0x4e,
+	0xe5, 0x27, 0x0c, 0xc3, 0xbb, 0xa8, 0x0a, 0x81,
+	0x75, 0xc3, 0xc9, 0xd4, 0x35, 0x3e, 0x53, 0x9f,
+	0xaa, 0x20, 0xc0, 0x68, 0x39, 0x2c, 0x96, 0x39,
+	0x53, 0x81, 0xda, 0x07, 0x0f, 0x44, 0xa5, 0x47,
+	0x0e, 0xb3, 0x87, 0x0d, 0x1b, 0xc1, 0xe5, 0x41,
+	0x35, 0x12, 0x58, 0x96, 0x69, 0x8a, 0x1a, 0xa3,
+	0x9d, 0x3d, 0xd4, 0xb1, 0x8e, 0x1f, 0x96, 0x87,
+	0xda, 0xd3, 0x19, 0xe2, 0xb1, 0x3a, 0x19, 0x74,
+	0xa0, 0x00, 0x9f, 0x4d, 0xbc, 0xcb, 0x0c, 0xe9,
+	0xec, 0x10, 0xdf, 0x2a, 0x88, 0xdc, 0x30, 0x51,
+	0x46, 0x56, 0x53, 0x98, 0x6a, 0x26, 0x14, 0x05,
+	0x54, 0x81, 0x55, 0x0b, 0x3c, 0x85, 0xdd, 0x33,
+	0x81, 0x11, 0x29, 0x82, 0x46, 0x35, 0xe1, 0xdb,
+	0x59, 0x7b
+};
+static const u8 output67[] __initconst = {
+	0x64, 0x6c, 0xda, 0x7f, 0xd4, 0xa9, 0x2a, 0x5e,
+	0x22, 0xae, 0x8d, 0x67, 0xdb, 0xee, 0xfd, 0xd0,
+	0x44, 0x80, 0x17, 0xb2, 0xe3, 0x87, 0xad, 0x57,
+	0x15, 0xcb, 0x88, 0x64, 0xc0, 0xf1, 0x49, 0x3d,
+	0xfa, 0xbe, 0xa8, 0x9f, 0x12, 0xc3, 0x57, 0x56,
+	0x70, 0xa5, 0xc5, 0x6b, 0xf1, 0xab, 0xd5, 0xde,
+	0x77, 0x92, 0x6a, 0x56, 0x03, 0xf5, 0x21, 0x0d,
+	0xb6, 0xc4, 0xcc, 0x62, 0x44, 0x3f, 0xb1, 0xc1,
+	0x61, 0x41, 0x90, 0xb2, 0xd5, 0xb8, 0xf3, 0x57,
+	0xfb, 0xc2, 0x6b, 0x25, 0x58, 0xc8, 0x45, 0x20,
+	0x72, 0x29, 0x6f, 0x9d, 0xb5, 0x81, 0x4d, 0x2b,
+	0xb2, 0x89, 0x9e, 0x91, 0x53, 0x97, 0x1c, 0xd9,
+	0x3d, 0x79, 0xdc, 0x14, 0xae, 0x01, 0x73, 0x75,
+	0xf0, 0xca, 0xd5, 0xab, 0x62, 0x5c, 0x7a, 0x7d,
+	0x3f, 0xfe, 0x22, 0x7d, 0xee, 0xe2, 0xcb, 0x76,
+	0x55, 0xec, 0x06, 0xdd, 0x41, 0x47, 0x18, 0x62,
+	0x1d, 0x57, 0xd0, 0xd6, 0xb6, 0x0f, 0x4b, 0xfc,
+	0x79, 0x19, 0xf4, 0xd6, 0x37, 0x86, 0x18, 0x1f,
+	0x98, 0x0d, 0x9e, 0x15, 0x2d, 0xb6, 0x9a, 0x8a,
+	0x8c, 0x80, 0x22, 0x2f, 0x82, 0xc4, 0xc7, 0x36,
+	0xfa, 0xfa, 0x07, 0xbd, 0xc2, 0x2a, 0xe2, 0xea,
+	0x93, 0xc8, 0xb2, 0x90, 0x33, 0xf2, 0xee, 0x4b,
+	0x1b, 0xf4, 0x37, 0x92, 0x13, 0xbb, 0xe2, 0xce,
+	0xe3, 0x03, 0xcf, 0x07, 0x94, 0xab, 0x9a, 0xc9,
+	0xff, 0x83, 0x69, 0x3a, 0xda, 0x2c, 0xd0, 0x47,
+	0x3d, 0x6c, 0x1a, 0x60, 0x68, 0x47, 0xb9, 0x36,
+	0x52, 0xdd, 0x16, 0xef, 0x6c, 0xbf, 0x54, 0x11,
+	0x72, 0x62, 0xce, 0x8c, 0x9d, 0x90, 0xa0, 0x25,
+	0x06, 0x92, 0x3e, 0x12, 0x7e, 0x1a, 0x1d, 0xe5,
+	0xa2, 0x71, 0xce, 0x1c, 0x4c, 0x6a, 0x7c, 0xdc,
+	0x3d, 0xe3, 0x6e, 0x48, 0x9d, 0xb3, 0x64, 0x7d,
+	0x78, 0x40
+};
+static const u8 key67[] __initconst = {
+	0xa9, 0x20, 0x75, 0x89, 0x7e, 0x37, 0x85, 0x48,
+	0xa3, 0xfb, 0x7b, 0xe8, 0x30, 0xa7, 0xe3, 0x6e,
+	0xa6, 0xc1, 0x71, 0x17, 0xc1, 0x6c, 0x9b, 0xc2,
+	0xde, 0xf0, 0xa7, 0x19, 0xec, 0xce, 0xc6, 0x53
+};
+enum { nonce67 = 0x4adc4d1f968c8a10ULL };
+
+static const u8 input68[] __initconst = {
+	0x99, 0xae, 0x72, 0xfb, 0x16, 0xe1, 0xf1, 0x59,
+	0x43, 0x15, 0x4e, 0x33, 0xa0, 0x95, 0xe7, 0x6c,
+	0x74, 0x24, 0x31, 0xca, 0x3b, 0x2e, 0xeb, 0xd7,
+	0x11, 0xd8, 0xe0, 0x56, 0x92, 0x91, 0x61, 0x57,
+	0xe2, 0x82, 0x9f, 0x8f, 0x37, 0xf5, 0x3d, 0x24,
+	0x92, 0x9d, 0x87, 0x00, 0x8d, 0x89, 0xe0, 0x25,
+	0x8b, 0xe4, 0x20, 0x5b, 0x8a, 0x26, 0x2c, 0x61,
+	0x78, 0xb0, 0xa6, 0x3e, 0x82, 0x18, 0xcf, 0xdc,
+	0x2d, 0x24, 0xdd, 0x81, 0x42, 0xc4, 0x95, 0xf0,
+	0x48, 0x60, 0x71, 0xe3, 0xe3, 0xac, 0xec, 0xbe,
+	0x98, 0x6b, 0x0c, 0xb5, 0x6a, 0xa9, 0xc8, 0x79,
+	0x23, 0x2e, 0x38, 0x0b, 0x72, 0x88, 0x8c, 0xe7,
+	0x71, 0x8b, 0x36, 0xe3, 0x58, 0x3d, 0x9c, 0xa0,
+	0xa2, 0xea, 0xcf, 0x0c, 0x6a, 0x6c, 0x64, 0xdf,
+	0x97, 0x21, 0x8f, 0x93, 0xfb, 0xba, 0xf3, 0x5a,
+	0xd7, 0x8f, 0xa6, 0x37, 0x15, 0x50, 0x43, 0x02,
+	0x46, 0x7f, 0x93, 0x46, 0x86, 0x31, 0xe2, 0xaa,
+	0x24, 0xa8, 0x26, 0xae, 0xe6, 0xc0, 0x05, 0x73,
+	0x0b, 0x4f, 0x7e, 0xed, 0x65, 0xeb, 0x56, 0x1e,
+	0xb6, 0xb3, 0x0b, 0xc3, 0x0e, 0x31, 0x95, 0xa9,
+	0x18, 0x4d, 0xaf, 0x38, 0xd7, 0xec, 0xc6, 0x44,
+	0x72, 0x77, 0x4e, 0x25, 0x4b, 0x25, 0xdd, 0x1e,
+	0x8c, 0xa2, 0xdf, 0xf6, 0x2a, 0x97, 0x1a, 0x88,
+	0x2c, 0x8a, 0x5d, 0xfe, 0xe8, 0xfb, 0x35, 0xe8,
+	0x0f, 0x2b, 0x7a, 0x18, 0x69, 0x43, 0x31, 0x1d,
+	0x38, 0x6a, 0x62, 0x95, 0x0f, 0x20, 0x4b, 0xbb,
+	0x97, 0x3c, 0xe0, 0x64, 0x2f, 0x52, 0xc9, 0x2d,
+	0x4d, 0x9d, 0x54, 0x04, 0x3d, 0xc9, 0xea, 0xeb,
+	0xd0, 0x86, 0x52, 0xff, 0x42, 0xe1, 0x0d, 0x7a,
+	0xad, 0x88, 0xf9, 0x9b, 0x1e, 0x5e, 0x12, 0x27,
+	0x95, 0x3e, 0x0c, 0x2c, 0x13, 0x00, 0x6f, 0x8e,
+	0x93, 0x69, 0x0e, 0x01, 0x8c, 0xc1, 0xfd, 0xb3
+};
+static const u8 output68[] __initconst = {
+	0x26, 0x3e, 0xf2, 0xb1, 0xf5, 0xef, 0x81, 0xa4,
+	0xb7, 0x42, 0xd4, 0x26, 0x18, 0x4b, 0xdd, 0x6a,
+	0x47, 0x15, 0xcb, 0x0e, 0x57, 0xdb, 0xa7, 0x29,
+	0x7e, 0x7b, 0x3f, 0x47, 0x89, 0x57, 0xab, 0xea,
+	0x14, 0x7b, 0xcf, 0x37, 0xdb, 0x1c, 0xe1, 0x11,
+	0x77, 0xae, 0x2e, 0x4c, 0xd2, 0x08, 0x3f, 0xa6,
+	0x62, 0x86, 0xa6, 0xb2, 0x07, 0xd5, 0x3f, 0x9b,
+	0xdc, 0xc8, 0x50, 0x4b, 0x7b, 0xb9, 0x06, 0xe6,
+	0xeb, 0xac, 0x98, 0x8c, 0x36, 0x0c, 0x1e, 0xb2,
+	0xc8, 0xfb, 0x24, 0x60, 0x2c, 0x08, 0x17, 0x26,
+	0x5b, 0xc8, 0xc2, 0xdf, 0x9c, 0x73, 0x67, 0x4a,
+	0xdb, 0xcf, 0xd5, 0x2c, 0x2b, 0xca, 0x24, 0xcc,
+	0xdb, 0xc9, 0xa8, 0xf2, 0x5d, 0x67, 0xdf, 0x5c,
+	0x62, 0x0b, 0x58, 0xc0, 0x83, 0xde, 0x8b, 0xf6,
+	0x15, 0x0a, 0xd6, 0x32, 0xd8, 0xf5, 0xf2, 0x5f,
+	0x33, 0xce, 0x7e, 0xab, 0x76, 0xcd, 0x14, 0x91,
+	0xd8, 0x41, 0x90, 0x93, 0xa1, 0xaf, 0xf3, 0x45,
+	0x6c, 0x1b, 0x25, 0xbd, 0x48, 0x51, 0x6d, 0x15,
+	0x47, 0xe6, 0x23, 0x50, 0x32, 0x69, 0x1e, 0xb5,
+	0x94, 0xd3, 0x97, 0xba, 0xd7, 0x37, 0x4a, 0xba,
+	0xb9, 0xcd, 0xfb, 0x96, 0x9a, 0x90, 0xe0, 0x37,
+	0xf8, 0xdf, 0x91, 0x6c, 0x62, 0x13, 0x19, 0x21,
+	0x4b, 0xa9, 0xf1, 0x12, 0x66, 0xe2, 0x74, 0xd7,
+	0x81, 0xa0, 0x74, 0x8d, 0x7e, 0x7e, 0xc9, 0xb1,
+	0x69, 0x8f, 0xed, 0xb3, 0xf6, 0x97, 0xcd, 0x72,
+	0x78, 0x93, 0xd3, 0x54, 0x6b, 0x43, 0xac, 0x29,
+	0xb4, 0xbc, 0x7d, 0xa4, 0x26, 0x4b, 0x7b, 0xab,
+	0xd6, 0x67, 0x22, 0xff, 0x03, 0x92, 0xb6, 0xd4,
+	0x96, 0x94, 0x5a, 0xe5, 0x02, 0x35, 0x77, 0xfa,
+	0x3f, 0x54, 0x1d, 0xdd, 0x35, 0x39, 0xfe, 0x03,
+	0xdd, 0x8e, 0x3c, 0x8c, 0xc2, 0x69, 0x2a, 0xb1,
+	0xb7, 0xb3, 0xa1, 0x89, 0x84, 0xea, 0x16, 0xe2
+};
+static const u8 key68[] __initconst = {
+	0xd2, 0x49, 0x7f, 0xd7, 0x49, 0x66, 0x0d, 0xb3,
+	0x5a, 0x7e, 0x3c, 0xfc, 0x37, 0x83, 0x0e, 0xf7,
+	0x96, 0xd8, 0xd6, 0x33, 0x79, 0x2b, 0x84, 0x53,
+	0x06, 0xbc, 0x6c, 0x0a, 0x55, 0x84, 0xfe, 0xab
+};
+enum { nonce68 = 0x6a6df7ff0a20de06ULL };
+
+static const u8 input69[] __initconst = {
+	0xf9, 0x18, 0x4c, 0xd2, 0x3f, 0xf7, 0x22, 0xd9,
+	0x58, 0xb6, 0x3b, 0x38, 0x69, 0x79, 0xf4, 0x71,
+	0x5f, 0x38, 0x52, 0x1f, 0x17, 0x6f, 0x6f, 0xd9,
+	0x09, 0x2b, 0xfb, 0x67, 0xdc, 0xc9, 0xe8, 0x4a,
+	0x70, 0x9f, 0x2e, 0x3c, 0x06, 0xe5, 0x12, 0x20,
+	0x25, 0x29, 0xd0, 0xdc, 0x81, 0xc5, 0xc6, 0x0f,
+	0xd2, 0xa8, 0x81, 0x15, 0x98, 0xb2, 0x71, 0x5a,
+	0x9a, 0xe9, 0xfb, 0xaf, 0x0e, 0x5f, 0x8a, 0xf3,
+	0x16, 0x4a, 0x47, 0xf2, 0x5c, 0xbf, 0xda, 0x52,
+	0x9a, 0xa6, 0x36, 0xfd, 0xc6, 0xf7, 0x66, 0x00,
+	0xcc, 0x6c, 0xd4, 0xb3, 0x07, 0x6d, 0xeb, 0xfe,
+	0x92, 0x71, 0x25, 0xd0, 0xcf, 0x9c, 0xe8, 0x65,
+	0x45, 0x10, 0xcf, 0x62, 0x74, 0x7d, 0xf2, 0x1b,
+	0x57, 0xa0, 0xf1, 0x6b, 0xa4, 0xd5, 0xfa, 0x12,
+	0x27, 0x5a, 0xf7, 0x99, 0xfc, 0xca, 0xf3, 0xb8,
+	0x2c, 0x8b, 0xba, 0x28, 0x74, 0xde, 0x8f, 0x78,
+	0xa2, 0x8c, 0xaf, 0x89, 0x4b, 0x05, 0xe2, 0xf3,
+	0xf8, 0xd2, 0xef, 0xac, 0xa4, 0xc4, 0xe2, 0xe2,
+	0x36, 0xbb, 0x5e, 0xae, 0xe6, 0x87, 0x3d, 0x88,
+	0x9f, 0xb8, 0x11, 0xbb, 0xcf, 0x57, 0xce, 0xd0,
+	0xba, 0x62, 0xf4, 0xf8, 0x9b, 0x95, 0x04, 0xc9,
+	0xcf, 0x01, 0xe9, 0xf1, 0xc8, 0xc6, 0x22, 0xa4,
+	0xf2, 0x8b, 0x2f, 0x24, 0x0a, 0xf5, 0x6e, 0xb7,
+	0xd4, 0x2c, 0xb6, 0xf7, 0x5c, 0x97, 0x61, 0x0b,
+	0xd9, 0xb5, 0x06, 0xcd, 0xed, 0x3e, 0x1f, 0xc5,
+	0xb2, 0x6c, 0xa3, 0xea, 0xb8, 0xad, 0xa6, 0x42,
+	0x88, 0x7a, 0x52, 0xd5, 0x64, 0xba, 0xb5, 0x20,
+	0x10, 0xa0, 0x0f, 0x0d, 0xea, 0xef, 0x5a, 0x9b,
+	0x27, 0xb8, 0xca, 0x20, 0x19, 0x6d, 0xa8, 0xc4,
+	0x46, 0x04, 0xb3, 0xe8, 0xf8, 0x66, 0x1b, 0x0a,
+	0xce, 0x76, 0x5d, 0x59, 0x58, 0x05, 0xee, 0x3e,
+	0x3c, 0x86, 0x5b, 0x49, 0x1c, 0x72, 0x18, 0x01,
+	0x62, 0x92, 0x0f, 0x3e, 0xd1, 0x57, 0x5e, 0x20,
+	0x7b, 0xfb, 0x4d, 0x3c, 0xc5, 0x35, 0x43, 0x2f,
+	0xb0, 0xc5, 0x7c, 0xe4, 0xa2, 0x84, 0x13, 0x77
+};
+static const u8 output69[] __initconst = {
+	0xbb, 0x4a, 0x7f, 0x7c, 0xd5, 0x2f, 0x89, 0x06,
+	0xec, 0x20, 0xf1, 0x9a, 0x11, 0x09, 0x14, 0x2e,
+	0x17, 0x50, 0xf9, 0xd5, 0xf5, 0x48, 0x7c, 0x7a,
+	0x55, 0xc0, 0x57, 0x03, 0xe3, 0xc4, 0xb2, 0xb7,
+	0x18, 0x47, 0x95, 0xde, 0xaf, 0x80, 0x06, 0x3c,
+	0x5a, 0xf2, 0xc3, 0x53, 0xe3, 0x29, 0x92, 0xf8,
+	0xff, 0x64, 0x85, 0xb9, 0xf7, 0xd3, 0x80, 0xd2,
+	0x0c, 0x5d, 0x7b, 0x57, 0x0c, 0x51, 0x79, 0x86,
+	0xf3, 0x20, 0xd2, 0xb8, 0x6e, 0x0c, 0x5a, 0xce,
+	0xeb, 0x88, 0x02, 0x8b, 0x82, 0x1b, 0x7f, 0xf5,
+	0xde, 0x7f, 0x48, 0x48, 0xdf, 0xa0, 0x55, 0xc6,
+	0x0c, 0x22, 0xa1, 0x80, 0x8d, 0x3b, 0xcb, 0x40,
+	0x2d, 0x3d, 0x0b, 0xf2, 0xe0, 0x22, 0x13, 0x99,
+	0xe1, 0xa7, 0x27, 0x68, 0x31, 0xe1, 0x24, 0x5d,
+	0xd2, 0xee, 0x16, 0xc1, 0xd7, 0xa8, 0x14, 0x19,
+	0x23, 0x72, 0x67, 0x27, 0xdc, 0x5e, 0xb9, 0xc7,
+	0xd8, 0xe3, 0x55, 0x50, 0x40, 0x98, 0x7b, 0xe7,
+	0x34, 0x1c, 0x3b, 0x18, 0x14, 0xd8, 0x62, 0xc1,
+	0x93, 0x84, 0xf3, 0x5b, 0xdd, 0x9e, 0x1f, 0x3b,
+	0x0b, 0xbc, 0x4e, 0x5b, 0x79, 0xa3, 0xca, 0x74,
+	0x2a, 0x98, 0xe8, 0x04, 0x39, 0xef, 0xc6, 0x76,
+	0x6d, 0xee, 0x9f, 0x67, 0x5b, 0x59, 0x3a, 0xe5,
+	0xf2, 0x3b, 0xca, 0x89, 0xe8, 0x9b, 0x03, 0x3d,
+	0x11, 0xd2, 0x4a, 0x70, 0xaf, 0x88, 0xb0, 0x94,
+	0x96, 0x26, 0xab, 0x3c, 0xc1, 0xb8, 0xe4, 0xe7,
+	0x14, 0x61, 0x64, 0x3a, 0x61, 0x08, 0x0f, 0xa9,
+	0xce, 0x64, 0xb2, 0x40, 0xf8, 0x20, 0x3a, 0xa9,
+	0x31, 0xbd, 0x7e, 0x16, 0xca, 0xf5, 0x62, 0x0f,
+	0x91, 0x9f, 0x8e, 0x1d, 0xa4, 0x77, 0xf3, 0x87,
+	0x61, 0xe8, 0x14, 0xde, 0x18, 0x68, 0x4e, 0x9d,
+	0x73, 0xcd, 0x8a, 0xe4, 0x80, 0x84, 0x23, 0xaa,
+	0x9d, 0x64, 0x1c, 0x80, 0x41, 0xca, 0x82, 0x40,
+	0x94, 0x55, 0xe3, 0x28, 0xa1, 0x97, 0x71, 0xba,
+	0xf2, 0x2c, 0x39, 0x62, 0x29, 0x56, 0xd0, 0xff,
+	0xb2, 0x82, 0x20, 0x59, 0x1f, 0xc3, 0x64, 0x57
+};
+static const u8 key69[] __initconst = {
+	0x19, 0x09, 0xe9, 0x7c, 0xd9, 0x02, 0x4a, 0x0c,
+	0x52, 0x25, 0xad, 0x5c, 0x2e, 0x8d, 0x86, 0x10,
+	0x85, 0x2b, 0xba, 0xa4, 0x44, 0x5b, 0x39, 0x3e,
+	0x18, 0xaa, 0xce, 0x0e, 0xe2, 0x69, 0x3c, 0xcf
+};
+enum { nonce69 = 0xdb925a1948f0f060ULL };
+
+static const u8 input70[] __initconst = {
+	0x10, 0xe7, 0x83, 0xcf, 0x42, 0x9f, 0xf2, 0x41,
+	0xc7, 0xe4, 0xdb, 0xf9, 0xa3, 0x02, 0x1d, 0x8d,
+	0x50, 0x81, 0x2c, 0x6b, 0x92, 0xe0, 0x4e, 0xea,
+	0x26, 0x83, 0x2a, 0xd0, 0x31, 0xf1, 0x23, 0xf3,
+	0x0e, 0x88, 0x14, 0x31, 0xf9, 0x01, 0x63, 0x59,
+	0x21, 0xd1, 0x8b, 0xdd, 0x06, 0xd0, 0xc6, 0xab,
+	0x91, 0x71, 0x82, 0x4d, 0xd4, 0x62, 0x37, 0x17,
+	0xf9, 0x50, 0xf9, 0xb5, 0x74, 0xce, 0x39, 0x80,
+	0x80, 0x78, 0xf8, 0xdc, 0x1c, 0xdb, 0x7c, 0x3d,
+	0xd4, 0x86, 0x31, 0x00, 0x75, 0x7b, 0xd1, 0x42,
+	0x9f, 0x1b, 0x97, 0x88, 0x0e, 0x14, 0x0e, 0x1e,
+	0x7d, 0x7b, 0xc4, 0xd2, 0xf3, 0xc1, 0x6d, 0x17,
+	0x5d, 0xc4, 0x75, 0x54, 0x0f, 0x38, 0x65, 0x89,
+	0xd8, 0x7d, 0xab, 0xc9, 0xa7, 0x0a, 0x21, 0x0b,
+	0x37, 0x12, 0x05, 0x07, 0xb5, 0x68, 0x32, 0x32,
+	0xb9, 0xf8, 0x97, 0x17, 0x03, 0xed, 0x51, 0x8f,
+	0x3d, 0x5a, 0xd0, 0x12, 0x01, 0x6e, 0x2e, 0x91,
+	0x1c, 0xbe, 0x6b, 0xa3, 0xcc, 0x75, 0x62, 0x06,
+	0x8e, 0x65, 0xbb, 0xe2, 0x29, 0x71, 0x4b, 0x89,
+	0x6a, 0x9d, 0x85, 0x8c, 0x8c, 0xdf, 0x94, 0x95,
+	0x23, 0x66, 0xf8, 0x92, 0xee, 0x56, 0xeb, 0xb3,
+	0xeb, 0xd2, 0x4a, 0x3b, 0x77, 0x8a, 0x6e, 0xf6,
+	0xca, 0xd2, 0x34, 0x00, 0xde, 0xbe, 0x1d, 0x7a,
+	0x73, 0xef, 0x2b, 0x80, 0x56, 0x16, 0x29, 0xbf,
+	0x6e, 0x33, 0xed, 0x0d, 0xe2, 0x02, 0x60, 0x74,
+	0xe9, 0x0a, 0xbc, 0xd1, 0xc5, 0xe8, 0x53, 0x02,
+	0x79, 0x0f, 0x25, 0x0c, 0xef, 0xab, 0xd3, 0xbc,
+	0xb7, 0xfc, 0xf3, 0xb0, 0x34, 0xd1, 0x07, 0xd2,
+	0x5a, 0x31, 0x1f, 0xec, 0x1f, 0x87, 0xed, 0xdd,
+	0x6a, 0xc1, 0xe8, 0xb3, 0x25, 0x4c, 0xc6, 0x9b,
+	0x91, 0x73, 0xec, 0x06, 0x73, 0x9e, 0x57, 0x65,
+	0x32, 0x75, 0x11, 0x74, 0x6e, 0xa4, 0x7d, 0x0d,
+	0x74, 0x9f, 0x51, 0x10, 0x10, 0x47, 0xc9, 0x71,
+	0x6e, 0x97, 0xae, 0x44, 0x41, 0xef, 0x98, 0x78,
+	0xf4, 0xc5, 0xbd, 0x5e, 0x00, 0xe5, 0xfd, 0xe2,
+	0xbe, 0x8c, 0xc2, 0xae, 0xc2, 0xee, 0x59, 0xf6,
+	0xcb, 0x20, 0x54, 0x84, 0xc3, 0x31, 0x7e, 0x67,
+	0x71, 0xb6, 0x76, 0xbe, 0x81, 0x8f, 0x82, 0xad,
+	0x01, 0x8f, 0xc4, 0x00, 0x04, 0x3d, 0x8d, 0x34,
+	0xaa, 0xea, 0xc0, 0xea, 0x91, 0x42, 0xb6, 0xb8,
+	0x43, 0xf3, 0x17, 0xb2, 0x73, 0x64, 0x82, 0x97,
+	0xd5, 0xc9, 0x07, 0x77, 0xb1, 0x26, 0xe2, 0x00,
+	0x6a, 0xae, 0x70, 0x0b, 0xbe, 0xe6, 0xb8, 0x42,
+	0x81, 0x55, 0xf7, 0xb8, 0x96, 0x41, 0x9d, 0xd4,
+	0x2c, 0x27, 0x00, 0xcc, 0x91, 0x28, 0x22, 0xa4,
+	0x7b, 0x42, 0x51, 0x9e, 0xd6, 0xec, 0xf3, 0x6b,
+	0x00, 0xff, 0x5c, 0xa2, 0xac, 0x47, 0x33, 0x2d,
+	0xf8, 0x11, 0x65, 0x5f, 0x4d, 0x79, 0x8b, 0x4f,
+	0xad, 0xf0, 0x9d, 0xcd, 0xb9, 0x7b, 0x08, 0xf7,
+	0x32, 0x51, 0xfa, 0x39, 0xaa, 0x78, 0x05, 0xb1,
+	0xf3, 0x5d, 0xe8, 0x7c, 0x8e, 0x4f, 0xa2, 0xe0,
+	0x98, 0x0c, 0xb2, 0xa7, 0xf0, 0x35, 0x8e, 0x70,
+	0x7c, 0x82, 0xf3, 0x1b, 0x26, 0x28, 0x12, 0xe5,
+	0x23, 0x57, 0xe4, 0xb4, 0x9b, 0x00, 0x39, 0x97,
+	0xef, 0x7c, 0x46, 0x9b, 0x34, 0x6b, 0xe7, 0x0e,
+	0xa3, 0x2a, 0x18, 0x11, 0x64, 0xc6, 0x7c, 0x8b,
+	0x06, 0x02, 0xf5, 0x69, 0x76, 0xf9, 0xaa, 0x09,
+	0x5f, 0x68, 0xf8, 0x4a, 0x79, 0x58, 0xec, 0x37,
+	0xcf, 0x3a, 0xcc, 0x97, 0x70, 0x1d, 0x3e, 0x52,
+	0x18, 0x0a, 0xad, 0x28, 0x5b, 0x3b, 0xe9, 0x03,
+	0x84, 0xe9, 0x68, 0x50, 0xce, 0xc4, 0xbc, 0x3e,
+	0x21, 0xad, 0x63, 0xfe, 0xc6, 0xfd, 0x6e, 0x69,
+	0x84, 0xa9, 0x30, 0xb1, 0x7a, 0xc4, 0x31, 0x10,
+	0xc1, 0x1f, 0x6e, 0xeb, 0xa5, 0xa6, 0x01
+};
+static const u8 output70[] __initconst = {
+	0x0f, 0x93, 0x2a, 0x20, 0xb3, 0x87, 0x2d, 0xce,
+	0xd1, 0x3b, 0x30, 0xfd, 0x06, 0x6d, 0x0a, 0xaa,
+	0x3e, 0xc4, 0x29, 0x02, 0x8a, 0xde, 0xa6, 0x4b,
+	0x45, 0x1b, 0x4f, 0x25, 0x59, 0xd5, 0x56, 0x6a,
+	0x3b, 0x37, 0xbd, 0x3e, 0x47, 0x12, 0x2c, 0x4e,
+	0x60, 0x5f, 0x05, 0x75, 0x61, 0x23, 0x05, 0x74,
+	0xcb, 0xfc, 0x5a, 0xb3, 0xac, 0x5c, 0x3d, 0xab,
+	0x52, 0x5f, 0x05, 0xbc, 0x57, 0xc0, 0x7e, 0xcf,
+	0x34, 0x5d, 0x7f, 0x41, 0xa3, 0x17, 0x78, 0xd5,
+	0x9f, 0xec, 0x0f, 0x1e, 0xf9, 0xfe, 0xa3, 0xbd,
+	0x28, 0xb0, 0xba, 0x4d, 0x84, 0xdb, 0xae, 0x8f,
+	0x1d, 0x98, 0xb7, 0xdc, 0xf9, 0xad, 0x55, 0x9c,
+	0x89, 0xfe, 0x9b, 0x9c, 0xa9, 0x89, 0xf6, 0x97,
+	0x9c, 0x3f, 0x09, 0x3e, 0xc6, 0x02, 0xc2, 0x55,
+	0x58, 0x09, 0x54, 0x66, 0xe4, 0x36, 0x81, 0x35,
+	0xca, 0x88, 0x17, 0x89, 0x80, 0x24, 0x2b, 0x21,
+	0x89, 0xee, 0x45, 0x5a, 0xe7, 0x1f, 0xd5, 0xa5,
+	0x16, 0xa4, 0xda, 0x70, 0x7e, 0xe9, 0x4f, 0x24,
+	0x61, 0x97, 0xab, 0xa0, 0xe0, 0xe7, 0xb8, 0x5c,
+	0x0f, 0x25, 0x17, 0x37, 0x75, 0x12, 0xb5, 0x40,
+	0xde, 0x1c, 0x0d, 0x8a, 0x77, 0x62, 0x3c, 0x86,
+	0xd9, 0x70, 0x2e, 0x96, 0x30, 0xd2, 0x55, 0xb3,
+	0x6b, 0xc3, 0xf2, 0x9c, 0x47, 0xf3, 0x3a, 0x24,
+	0x52, 0xc6, 0x38, 0xd8, 0x22, 0xb3, 0x0c, 0xfd,
+	0x2f, 0xa3, 0x3c, 0xb5, 0xe8, 0x26, 0xe1, 0xa3,
+	0xad, 0xb0, 0x82, 0x17, 0xc1, 0x53, 0xb8, 0x34,
+	0x48, 0xee, 0x39, 0xae, 0x51, 0x43, 0xec, 0x82,
+	0xce, 0x87, 0xc6, 0x76, 0xb9, 0x76, 0xd3, 0x53,
+	0xfe, 0x49, 0x24, 0x7d, 0x02, 0x42, 0x2b, 0x72,
+	0xfb, 0xcb, 0xd8, 0x96, 0x02, 0xc6, 0x9a, 0x20,
+	0xf3, 0x5a, 0x67, 0xe8, 0x13, 0xf8, 0xb2, 0xcb,
+	0xa2, 0xec, 0x18, 0x20, 0x4a, 0xb0, 0x73, 0x53,
+	0x21, 0xb0, 0x77, 0x53, 0xd8, 0x76, 0xa1, 0x30,
+	0x17, 0x72, 0x2e, 0x33, 0x5f, 0x33, 0x6b, 0x28,
+	0xfb, 0xb0, 0xf4, 0xec, 0x8e, 0xed, 0x20, 0x7d,
+	0x57, 0x8c, 0x74, 0x28, 0x64, 0x8b, 0xeb, 0x59,
+	0x38, 0x3f, 0xe7, 0x83, 0x2e, 0xe5, 0x64, 0x4d,
+	0x5c, 0x1f, 0xe1, 0x3b, 0xd9, 0x84, 0xdb, 0xc9,
+	0xec, 0xd8, 0xc1, 0x7c, 0x1f, 0x1b, 0x68, 0x35,
+	0xc6, 0x34, 0x10, 0xef, 0x19, 0xc9, 0x0a, 0xd6,
+	0x43, 0x7f, 0xa6, 0xcb, 0x9d, 0xf4, 0xf0, 0x16,
+	0xb1, 0xb1, 0x96, 0x64, 0xec, 0x8d, 0x22, 0x4c,
+	0x4b, 0xe8, 0x1a, 0xba, 0x6f, 0xb7, 0xfc, 0xa5,
+	0x69, 0x3e, 0xad, 0x78, 0x79, 0x19, 0xb5, 0x04,
+	0x69, 0xe5, 0x3f, 0xff, 0x60, 0x8c, 0xda, 0x0b,
+	0x7b, 0xf7, 0xe7, 0xe6, 0x29, 0x3a, 0x85, 0xba,
+	0xb5, 0xb0, 0x35, 0xbd, 0x38, 0xce, 0x34, 0x5e,
+	0xf2, 0xdc, 0xd1, 0x8f, 0xc3, 0x03, 0x24, 0xa2,
+	0x03, 0xf7, 0x4e, 0x49, 0x5b, 0xcf, 0x6d, 0xb0,
+	0xeb, 0xe3, 0x30, 0x28, 0xd5, 0x5b, 0x82, 0x5f,
+	0xe4, 0x7c, 0x1e, 0xec, 0xd2, 0x39, 0xf9, 0x6f,
+	0x2e, 0xb3, 0xcd, 0x01, 0xb1, 0x67, 0xaa, 0xea,
+	0xaa, 0xb3, 0x63, 0xaf, 0xd9, 0xb2, 0x1f, 0xba,
+	0x05, 0x20, 0xeb, 0x19, 0x32, 0xf0, 0x6c, 0x3f,
+	0x40, 0xcc, 0x93, 0xb3, 0xd8, 0x25, 0xa6, 0xe4,
+	0xce, 0xd7, 0x7e, 0x48, 0x99, 0x65, 0x7f, 0x86,
+	0xc5, 0xd4, 0x79, 0x6b, 0xab, 0x43, 0xb8, 0x6b,
+	0xf1, 0x2f, 0xea, 0x4c, 0x5e, 0xf0, 0x3b, 0xb4,
+	0xb8, 0xb0, 0x94, 0x0c, 0x6b, 0xe7, 0x22, 0x93,
+	0xaa, 0x01, 0xcb, 0xf1, 0x11, 0x60, 0xf6, 0x69,
+	0xcf, 0x14, 0xde, 0xfb, 0x90, 0x05, 0x27, 0x0c,
+	0x1a, 0x9e, 0xf0, 0xb4, 0xc6, 0xa1, 0xe8, 0xdd,
+	0xd0, 0x4c, 0x25, 0x4f, 0x9c, 0xb7, 0xb1, 0xb0,
+	0x21, 0xdb, 0x87, 0x09, 0x03, 0xf2, 0xb3
+};
+static const u8 key70[] __initconst = {
+	0x3b, 0x5b, 0x59, 0x36, 0x44, 0xd1, 0xba, 0x71,
+	0x55, 0x87, 0x4d, 0x62, 0x3d, 0xc2, 0xfc, 0xaa,
+	0x3f, 0x4e, 0x1a, 0xe4, 0xca, 0x09, 0xfc, 0x6a,
+	0xb2, 0xd6, 0x5d, 0x79, 0xf9, 0x1a, 0x91, 0xa7
+};
+enum { nonce70 = 0x3fd6786dd147a85ULL };
+
+static const u8 input71[] __initconst = {
+	0x18, 0x78, 0xd6, 0x79, 0xe4, 0x9a, 0x6c, 0x73,
+	0x17, 0xd4, 0x05, 0x0f, 0x1e, 0x9f, 0xd9, 0x2b,
+	0x86, 0x48, 0x7d, 0xf4, 0xd9, 0x1c, 0x76, 0xfc,
+	0x8e, 0x22, 0x34, 0xe1, 0x48, 0x4a, 0x8d, 0x79,
+	0xb7, 0xbb, 0x88, 0xab, 0x90, 0xde, 0xc5, 0xb4,
+	0xb4, 0xe7, 0x85, 0x49, 0xda, 0x57, 0xeb, 0xc9,
+	0xcd, 0x21, 0xfc, 0x45, 0x6e, 0x32, 0x67, 0xf2,
+	0x4f, 0xa6, 0x54, 0xe5, 0x20, 0xed, 0xcf, 0xc6,
+	0x62, 0x25, 0x8e, 0x00, 0xf8, 0x6b, 0xa2, 0x80,
+	0xac, 0x88, 0xa6, 0x59, 0x27, 0x83, 0x95, 0x11,
+	0x3f, 0x70, 0x5e, 0x3f, 0x11, 0xfb, 0x26, 0xbf,
+	0xe1, 0x48, 0x75, 0xf9, 0x86, 0xbf, 0xa6, 0x5d,
+	0x15, 0x61, 0x66, 0xbf, 0x78, 0x8f, 0x6b, 0x9b,
+	0xda, 0x98, 0xb7, 0x19, 0xe2, 0xf2, 0xa3, 0x9c,
+	0x7c, 0x6a, 0x9a, 0xd8, 0x3d, 0x4c, 0x2c, 0xe1,
+	0x09, 0xb4, 0x28, 0x82, 0x4e, 0xab, 0x0c, 0x75,
+	0x63, 0xeb, 0xbc, 0xd0, 0x71, 0xa2, 0x73, 0x85,
+	0xed, 0x53, 0x7a, 0x3f, 0x68, 0x9f, 0xd0, 0xa9,
+	0x00, 0x5a, 0x9e, 0x80, 0x55, 0x00, 0xe6, 0xae,
+	0x0c, 0x03, 0x40, 0xed, 0xfc, 0x68, 0x4a, 0xb7,
+	0x1e, 0x09, 0x65, 0x30, 0x5a, 0x3d, 0x97, 0x4d,
+	0x5e, 0x51, 0x8e, 0xda, 0xc3, 0x55, 0x8c, 0xfb,
+	0xcf, 0x83, 0x05, 0x35, 0x0d, 0x08, 0x1b, 0xf3,
+	0x3a, 0x57, 0x96, 0xac, 0x58, 0x8b, 0xfa, 0x00,
+	0x49, 0x15, 0x78, 0xd2, 0x4b, 0xed, 0xb8, 0x59,
+	0x78, 0x9b, 0x7f, 0xaa, 0xfc, 0xe7, 0x46, 0xdc,
+	0x7b, 0x34, 0xd0, 0x34, 0xe5, 0x10, 0xff, 0x4d,
+	0x5a, 0x4d, 0x60, 0xa7, 0x16, 0x54, 0xc4, 0xfd,
+	0xca, 0x5d, 0x68, 0xc7, 0x4a, 0x01, 0x8d, 0x7f,
+	0x74, 0x5d, 0xff, 0xb8, 0x37, 0x15, 0x62, 0xfa,
+	0x44, 0x45, 0xcf, 0x77, 0x3b, 0x1d, 0xb2, 0xd2,
+	0x0d, 0x42, 0x00, 0x39, 0x68, 0x1f, 0xcc, 0x89,
+	0x73, 0x5d, 0xa9, 0x2e, 0xfd, 0x58, 0x62, 0xca,
+	0x35, 0x8e, 0x70, 0x70, 0xaa, 0x6e, 0x14, 0xe9,
+	0xa4, 0xe2, 0x10, 0x66, 0x71, 0xdc, 0x4c, 0xfc,
+	0xa9, 0xdc, 0x8f, 0x57, 0x4d, 0xc5, 0xac, 0xd7,
+	0xa9, 0xf3, 0xf3, 0xa1, 0xff, 0x62, 0xa0, 0x8f,
+	0xe4, 0x96, 0x3e, 0xcb, 0x9f, 0x76, 0x42, 0x39,
+	0x1f, 0x24, 0xfd, 0xfd, 0x79, 0xe8, 0x27, 0xdf,
+	0xa8, 0xf6, 0x33, 0x8b, 0x31, 0x59, 0x69, 0xcf,
+	0x6a, 0xef, 0x89, 0x4d, 0xa7, 0xf6, 0x7e, 0x97,
+	0x14, 0xbd, 0xda, 0xdd, 0xb4, 0x84, 0x04, 0x24,
+	0xe0, 0x17, 0xe1, 0x0f, 0x1f, 0x8a, 0x6a, 0x71,
+	0x74, 0x41, 0xdc, 0x59, 0x5c, 0x8f, 0x01, 0x25,
+	0x92, 0xf0, 0x2e, 0x15, 0x62, 0x71, 0x9a, 0x9f,
+	0x87, 0xdf, 0x62, 0x49, 0x7f, 0x86, 0x62, 0xfc,
+	0x20, 0x84, 0xd7, 0xe3, 0x3a, 0xd9, 0x37, 0x85,
+	0xb7, 0x84, 0x5a, 0xf9, 0xed, 0x21, 0x32, 0x94,
+	0x3e, 0x04, 0xe7, 0x8c, 0x46, 0x76, 0x21, 0x67,
+	0xf6, 0x95, 0x64, 0x92, 0xb7, 0x15, 0xf6, 0xe3,
+	0x41, 0x27, 0x9d, 0xd7, 0xe3, 0x79, 0x75, 0x92,
+	0xd0, 0xc1, 0xf3, 0x40, 0x92, 0x08, 0xde, 0x90,
+	0x22, 0x82, 0xb2, 0x69, 0xae, 0x1a, 0x35, 0x11,
+	0x89, 0xc8, 0x06, 0x82, 0x95, 0x23, 0x44, 0x08,
+	0x22, 0xf2, 0x71, 0x73, 0x1b, 0x88, 0x11, 0xcf,
+	0x1c, 0x7e, 0x8a, 0x2e, 0xdc, 0x79, 0x57, 0xce,
+	0x1f, 0xe7, 0x6c, 0x07, 0xd8, 0x06, 0xbe, 0xec,
+	0xa3, 0xcf, 0xf9, 0x68, 0xa5, 0xb8, 0xf0, 0xe3,
+	0x3f, 0x01, 0x92, 0xda, 0xf1, 0xa0, 0x2d, 0x7b,
+	0xab, 0x57, 0x58, 0x2a, 0xaf, 0xab, 0xbd, 0xf2,
+	0xe5, 0xaf, 0x7e, 0x1f, 0x46, 0x24, 0x9e, 0x20,
+	0x22, 0x0f, 0x84, 0x4c, 0xb7, 0xd8, 0x03, 0xe8,
+	0x09, 0x73, 0x6c, 0xc6, 0x9b, 0x90, 0xe0, 0xdb,
+	0xf2, 0x71, 0xba, 0xad, 0xb3, 0xec, 0xda, 0x7a
+};
+static const u8 output71[] __initconst = {
+	0x28, 0xc5, 0x9b, 0x92, 0xf9, 0x21, 0x4f, 0xbb,
+	0xef, 0x3b, 0xf0, 0xf5, 0x3a, 0x6d, 0x7f, 0xd6,
+	0x6a, 0x8d, 0xa1, 0x01, 0x5c, 0x62, 0x20, 0x8b,
+	0x5b, 0x39, 0xd5, 0xd3, 0xc2, 0xf6, 0x9d, 0x5e,
+	0xcc, 0xe1, 0xa2, 0x61, 0x16, 0xe2, 0xce, 0xe9,
+	0x86, 0xd0, 0xfc, 0xce, 0x9a, 0x28, 0x27, 0xc4,
+	0x0c, 0xb9, 0xaa, 0x8d, 0x48, 0xdb, 0xbf, 0x82,
+	0x7d, 0xd0, 0x35, 0xc4, 0x06, 0x34, 0xb4, 0x19,
+	0x51, 0x73, 0xf4, 0x7a, 0xf4, 0xfd, 0xe9, 0x1d,
+	0xdc, 0x0f, 0x7e, 0xf7, 0x96, 0x03, 0xe3, 0xb1,
+	0x2e, 0x22, 0x59, 0xb7, 0x6d, 0x1c, 0x97, 0x8c,
+	0xd7, 0x31, 0x08, 0x26, 0x4c, 0x6d, 0xc6, 0x14,
+	0xa5, 0xeb, 0x45, 0x6a, 0x88, 0xa3, 0xa2, 0x36,
+	0xc4, 0x35, 0xb1, 0x5a, 0xa0, 0xad, 0xf7, 0x06,
+	0x9b, 0x5d, 0xc1, 0x15, 0xc1, 0xce, 0x0a, 0xb0,
+	0x57, 0x2e, 0x3f, 0x6f, 0x0d, 0x10, 0xd9, 0x11,
+	0x2c, 0x9c, 0xad, 0x2d, 0xa5, 0x81, 0xfb, 0x4e,
+	0x8f, 0xd5, 0x32, 0x4e, 0xaf, 0x5c, 0xc1, 0x86,
+	0xde, 0x56, 0x5a, 0x33, 0x29, 0xf7, 0x67, 0xc6,
+	0x37, 0x6f, 0xb2, 0x37, 0x4e, 0xd4, 0x69, 0x79,
+	0xaf, 0xd5, 0x17, 0x79, 0xe0, 0xba, 0x62, 0xa3,
+	0x68, 0xa4, 0x87, 0x93, 0x8d, 0x7e, 0x8f, 0xa3,
+	0x9c, 0xef, 0xda, 0xe3, 0xa5, 0x1f, 0xcd, 0x30,
+	0xa6, 0x55, 0xac, 0x4c, 0x69, 0x74, 0x02, 0xc7,
+	0x5d, 0x95, 0x81, 0x4a, 0x68, 0x11, 0xd3, 0xa9,
+	0x98, 0xb1, 0x0b, 0x0d, 0xae, 0x40, 0x86, 0x65,
+	0xbf, 0xcc, 0x2d, 0xef, 0x57, 0xca, 0x1f, 0xe4,
+	0x34, 0x4e, 0xa6, 0x5e, 0x82, 0x6e, 0x61, 0xad,
+	0x0b, 0x3c, 0xf8, 0xeb, 0x01, 0x43, 0x7f, 0x87,
+	0xa2, 0xa7, 0x6a, 0xe9, 0x62, 0x23, 0x24, 0x61,
+	0xf1, 0xf7, 0x36, 0xdb, 0x10, 0xe5, 0x57, 0x72,
+	0x3a, 0xc2, 0xae, 0xcc, 0x75, 0xc7, 0x80, 0x05,
+	0x0a, 0x5c, 0x4c, 0x95, 0xda, 0x02, 0x01, 0x14,
+	0x06, 0x6b, 0x5c, 0x65, 0xc2, 0xb8, 0x4a, 0xd6,
+	0xd3, 0xb4, 0xd8, 0x12, 0x52, 0xb5, 0x60, 0xd3,
+	0x8e, 0x5f, 0x5c, 0x76, 0x33, 0x7a, 0x05, 0xe5,
+	0xcb, 0xef, 0x4f, 0x89, 0xf1, 0xba, 0x32, 0x6f,
+	0x33, 0xcd, 0x15, 0x8d, 0xa3, 0x0c, 0x3f, 0x63,
+	0x11, 0xe7, 0x0e, 0xe0, 0x00, 0x01, 0xe9, 0xe8,
+	0x8e, 0x36, 0x34, 0x8d, 0x96, 0xb5, 0x03, 0xcf,
+	0x55, 0x62, 0x49, 0x7a, 0x34, 0x44, 0xa5, 0xee,
+	0x8c, 0x46, 0x06, 0x22, 0xab, 0x1d, 0x53, 0x9c,
+	0xa1, 0xf9, 0x67, 0x18, 0x57, 0x89, 0xf9, 0xc2,
+	0xd1, 0x7e, 0xbe, 0x36, 0x40, 0xcb, 0xe9, 0x04,
+	0xde, 0xb1, 0x3b, 0x29, 0x52, 0xc5, 0x9a, 0xb5,
+	0xa2, 0x7c, 0x7b, 0xfe, 0xe5, 0x92, 0x73, 0xea,
+	0xea, 0x7b, 0xba, 0x0a, 0x8c, 0x88, 0x15, 0xe6,
+	0x53, 0xbf, 0x1c, 0x33, 0xf4, 0x9b, 0x9a, 0x5e,
+	0x8d, 0xae, 0x60, 0xdc, 0xcb, 0x5d, 0xfa, 0xbe,
+	0x06, 0xc3, 0x3f, 0x06, 0xe7, 0x00, 0x40, 0x7b,
+	0xaa, 0x94, 0xfa, 0x6d, 0x1f, 0xe4, 0xc5, 0xa9,
+	0x1b, 0x5f, 0x36, 0xea, 0x5a, 0xdd, 0xa5, 0x48,
+	0x6a, 0x55, 0xd2, 0x47, 0x28, 0xbf, 0x96, 0xf1,
+	0x9f, 0xb6, 0x11, 0x4b, 0xd3, 0x44, 0x7d, 0x48,
+	0x41, 0x61, 0xdb, 0x12, 0xd4, 0xc2, 0x59, 0x82,
+	0x4c, 0x47, 0x5c, 0x04, 0xf6, 0x7b, 0xd3, 0x92,
+	0x2e, 0xe8, 0x40, 0xef, 0x15, 0x32, 0x97, 0xdc,
+	0x35, 0x4c, 0x6e, 0xa4, 0x97, 0xe9, 0x24, 0xde,
+	0x63, 0x8b, 0xb1, 0x6b, 0x48, 0xbb, 0x46, 0x1f,
+	0x84, 0xd6, 0x17, 0xb0, 0x5a, 0x4a, 0x4e, 0xd5,
+	0x31, 0xd7, 0xcf, 0xa0, 0x39, 0xc6, 0x2e, 0xfc,
+	0xa6, 0xa3, 0xd3, 0x0f, 0xa4, 0x28, 0xac, 0xb2,
+	0xf4, 0x48, 0x8d, 0x50, 0xa5, 0x1c, 0x44, 0x5d,
+	0x6e, 0x38, 0xb7, 0x2b, 0x8a, 0x45, 0xa7, 0x3d
+};
+static const u8 key71[] __initconst = {
+	0x8b, 0x68, 0xc4, 0xb7, 0x0d, 0x81, 0xef, 0x52,
+	0x1e, 0x05, 0x96, 0x72, 0x62, 0x89, 0x27, 0x83,
+	0xd0, 0xc7, 0x33, 0x6d, 0xf2, 0xcc, 0x69, 0xf9,
+	0x23, 0xae, 0x99, 0xb1, 0xd1, 0x05, 0x4e, 0x54
+};
+enum { nonce71 = 0x983f03656d64b5f6ULL };
+
+static const u8 input72[] __initconst = {
+	0x6b, 0x09, 0xc9, 0x57, 0x3d, 0x79, 0x04, 0x8c,
+	0x65, 0xad, 0x4a, 0x0f, 0xa1, 0x31, 0x3a, 0xdd,
+	0x14, 0x8e, 0xe8, 0xfe, 0xbf, 0x42, 0x87, 0x98,
+	0x2e, 0x8d, 0x83, 0xa3, 0xf8, 0x55, 0x3d, 0x84,
+	0x1e, 0x0e, 0x05, 0x4a, 0x38, 0x9e, 0xe7, 0xfe,
+	0xd0, 0x4d, 0x79, 0x74, 0x3a, 0x0b, 0x9b, 0xe1,
+	0xfd, 0x51, 0x84, 0x4e, 0xb2, 0x25, 0xe4, 0x64,
+	0x4c, 0xda, 0xcf, 0x46, 0xec, 0xba, 0x12, 0xeb,
+	0x5a, 0x33, 0x09, 0x6e, 0x78, 0x77, 0x8f, 0x30,
+	0xb1, 0x7d, 0x3f, 0x60, 0x8c, 0xf2, 0x1d, 0x8e,
+	0xb4, 0x70, 0xa2, 0x90, 0x7c, 0x79, 0x1a, 0x2c,
+	0xf6, 0x28, 0x79, 0x7c, 0x53, 0xc5, 0xfa, 0xcc,
+	0x65, 0x9b, 0xe1, 0x51, 0xd1, 0x7f, 0x1d, 0xc4,
+	0xdb, 0xd4, 0xd9, 0x04, 0x61, 0x7d, 0xbe, 0x12,
+	0xfc, 0xcd, 0xaf, 0xe4, 0x0f, 0x9c, 0x20, 0xb5,
+	0x22, 0x40, 0x18, 0xda, 0xe4, 0xda, 0x8c, 0x2d,
+	0x84, 0xe3, 0x5f, 0x53, 0x17, 0xed, 0x78, 0xdc,
+	0x2f, 0xe8, 0x31, 0xc7, 0xe6, 0x39, 0x71, 0x40,
+	0xb4, 0x0f, 0xc9, 0xa9, 0x7e, 0x78, 0x87, 0xc1,
+	0x05, 0x78, 0xbb, 0x01, 0xf2, 0x8f, 0x33, 0xb0,
+	0x6e, 0x84, 0xcd, 0x36, 0x33, 0x5c, 0x5b, 0x8e,
+	0xf1, 0xac, 0x30, 0xfe, 0x33, 0xec, 0x08, 0xf3,
+	0x7e, 0xf2, 0xf0, 0x4c, 0xf2, 0xad, 0xd8, 0xc1,
+	0xd4, 0x4e, 0x87, 0x06, 0xd4, 0x75, 0xe7, 0xe3,
+	0x09, 0xd3, 0x4d, 0xe3, 0x21, 0x32, 0xba, 0xb4,
+	0x68, 0x68, 0xcb, 0x4c, 0xa3, 0x1e, 0xb3, 0x87,
+	0x7b, 0xd3, 0x0c, 0x63, 0x37, 0x71, 0x79, 0xfb,
+	0x58, 0x36, 0x57, 0x0f, 0x34, 0x1d, 0xc1, 0x42,
+	0x02, 0x17, 0xe7, 0xed, 0xe8, 0xe7, 0x76, 0xcb,
+	0x42, 0xc4, 0x4b, 0xe2, 0xb2, 0x5e, 0x42, 0xd5,
+	0xec, 0x9d, 0xc1, 0x32, 0x71, 0xe4, 0xeb, 0x10,
+	0x68, 0x1a, 0x6e, 0x99, 0x8e, 0x73, 0x12, 0x1f,
+	0x97, 0x0c, 0x9e, 0xcd, 0x02, 0x3e, 0x4c, 0xa0,
+	0xf2, 0x8d, 0xe5, 0x44, 0xca, 0x6d, 0xfe, 0x07,
+	0xe3, 0xe8, 0x9b, 0x76, 0xc1, 0x6d, 0xb7, 0x6e,
+	0x0d, 0x14, 0x00, 0x6f, 0x8a, 0xfd, 0x43, 0xc6,
+	0x43, 0xa5, 0x9c, 0x02, 0x47, 0x10, 0xd4, 0xb4,
+	0x9b, 0x55, 0x67, 0xc8, 0x7f, 0xc1, 0x8a, 0x1f,
+	0x1e, 0xd1, 0xbc, 0x99, 0x5d, 0x50, 0x4f, 0x89,
+	0xf1, 0xe6, 0x5d, 0x91, 0x40, 0xdc, 0x20, 0x67,
+	0x56, 0xc2, 0xef, 0xbd, 0x2c, 0xa2, 0x99, 0x38,
+	0xe0, 0x45, 0xec, 0x44, 0x05, 0x52, 0x65, 0x11,
+	0xfc, 0x3b, 0x19, 0xcb, 0x71, 0xc2, 0x8e, 0x0e,
+	0x03, 0x2a, 0x03, 0x3b, 0x63, 0x06, 0x31, 0x9a,
+	0xac, 0x53, 0x04, 0x14, 0xd4, 0x80, 0x9d, 0x6b,
+	0x42, 0x7e, 0x7e, 0x4e, 0xdc, 0xc7, 0x01, 0x49,
+	0x9f, 0xf5, 0x19, 0x86, 0x13, 0x28, 0x2b, 0xa6,
+	0xa6, 0xbe, 0xa1, 0x7e, 0x71, 0x05, 0x00, 0xff,
+	0x59, 0x2d, 0xb6, 0x63, 0xf0, 0x1e, 0x2e, 0x69,
+	0x9b, 0x85, 0xf1, 0x1e, 0x8a, 0x64, 0x39, 0xab,
+	0x00, 0x12, 0xe4, 0x33, 0x4b, 0xb5, 0xd8, 0xb3,
+	0x6b, 0x5b, 0x8b, 0x5c, 0xd7, 0x6f, 0x23, 0xcf,
+	0x3f, 0x2e, 0x5e, 0x47, 0xb9, 0xb8, 0x1f, 0xf0,
+	0x1d, 0xda, 0xe7, 0x4f, 0x6e, 0xab, 0xc3, 0x36,
+	0xb4, 0x74, 0x6b, 0xeb, 0xc7, 0x5d, 0x91, 0xe5,
+	0xda, 0xf2, 0xc2, 0x11, 0x17, 0x48, 0xf8, 0x9c,
+	0xc9, 0x8b, 0xc1, 0xa2, 0xf4, 0xcd, 0x16, 0xf8,
+	0x27, 0xd9, 0x6c, 0x6f, 0xb5, 0x8f, 0x77, 0xca,
+	0x1b, 0xd8, 0xef, 0x84, 0x68, 0x71, 0x53, 0xc1,
+	0x43, 0x0f, 0x9f, 0x98, 0xae, 0x7e, 0x31, 0xd2,
+	0x98, 0xfb, 0x20, 0xa2, 0xad, 0x00, 0x10, 0x83,
+	0x00, 0x8b, 0xeb, 0x56, 0xd2, 0xc4, 0xcc, 0x7f,
+	0x2f, 0x4e, 0xfa, 0x88, 0x13, 0xa4, 0x2c, 0xde,
+	0x6b, 0x77, 0x86, 0x10, 0x6a, 0xab, 0x43, 0x0a,
+	0x02
+};
+static const u8 output72[] __initconst = {
+	0x42, 0x89, 0xa4, 0x80, 0xd2, 0xcb, 0x5f, 0x7f,
+	0x2a, 0x1a, 0x23, 0x00, 0xa5, 0x6a, 0x95, 0xa3,
+	0x9a, 0x41, 0xa1, 0xd0, 0x2d, 0x1e, 0xd6, 0x13,
+	0x34, 0x40, 0x4e, 0x7f, 0x1a, 0xbe, 0xa0, 0x3d,
+	0x33, 0x9c, 0x56, 0x2e, 0x89, 0x25, 0x45, 0xf9,
+	0xf0, 0xba, 0x9c, 0x6d, 0xd1, 0xd1, 0xde, 0x51,
+	0x47, 0x63, 0xc9, 0xbd, 0xfa, 0xa2, 0x9e, 0xad,
+	0x6a, 0x7b, 0x21, 0x1a, 0x6c, 0x3e, 0xff, 0x46,
+	0xbe, 0xf3, 0x35, 0x7a, 0x6e, 0xb3, 0xb9, 0xf7,
+	0xda, 0x5e, 0xf0, 0x14, 0xb5, 0x70, 0xa4, 0x2b,
+	0xdb, 0xbb, 0xc7, 0x31, 0x4b, 0x69, 0x5a, 0x83,
+	0x70, 0xd9, 0x58, 0xd4, 0x33, 0x84, 0x23, 0xf0,
+	0xae, 0xbb, 0x6d, 0x26, 0x7c, 0xc8, 0x30, 0xf7,
+	0x24, 0xad, 0xbd, 0xe4, 0x2c, 0x38, 0x38, 0xac,
+	0xe1, 0x4a, 0x9b, 0xac, 0x33, 0x0e, 0x4a, 0xf4,
+	0x93, 0xed, 0x07, 0x82, 0x81, 0x4f, 0x8f, 0xb1,
+	0xdd, 0x73, 0xd5, 0x50, 0x6d, 0x44, 0x1e, 0xbe,
+	0xa7, 0xcd, 0x17, 0x57, 0xd5, 0x3b, 0x62, 0x36,
+	0xcf, 0x7d, 0xc8, 0xd8, 0xd1, 0x78, 0xd7, 0x85,
+	0x46, 0x76, 0x5d, 0xcc, 0xfe, 0xe8, 0x94, 0xc5,
+	0xad, 0xbc, 0x5e, 0xbc, 0x8d, 0x1d, 0xdf, 0x03,
+	0xc9, 0x6b, 0x1b, 0x81, 0xd1, 0xb6, 0x5a, 0x24,
+	0xe3, 0xdc, 0x3f, 0x20, 0xc9, 0x07, 0x73, 0x4c,
+	0x43, 0x13, 0x87, 0x58, 0x34, 0x0d, 0x14, 0x63,
+	0x0f, 0x6f, 0xad, 0x8d, 0xac, 0x7c, 0x67, 0x68,
+	0xa3, 0x9d, 0x7f, 0x00, 0xdf, 0x28, 0xee, 0x67,
+	0xf4, 0x5c, 0x26, 0xcb, 0xef, 0x56, 0x71, 0xc8,
+	0xc6, 0x67, 0x5f, 0x38, 0xbb, 0xa0, 0xb1, 0x5c,
+	0x1f, 0xb3, 0x08, 0xd9, 0x38, 0xcf, 0x74, 0x54,
+	0xc6, 0xa4, 0xc4, 0xc0, 0x9f, 0xb3, 0xd0, 0xda,
+	0x62, 0x67, 0x8b, 0x81, 0x33, 0xf0, 0xa9, 0x73,
+	0xa4, 0xd1, 0x46, 0x88, 0x8d, 0x85, 0x12, 0x40,
+	0xba, 0x1a, 0xcd, 0x82, 0xd8, 0x8d, 0xc4, 0x52,
+	0xe7, 0x01, 0x94, 0x2e, 0x0e, 0xd0, 0xaf, 0xe7,
+	0x2d, 0x3f, 0x3c, 0xaa, 0xf4, 0xf5, 0xa7, 0x01,
+	0x4c, 0x14, 0xe2, 0xc2, 0x96, 0x76, 0xbe, 0x05,
+	0xaa, 0x19, 0xb1, 0xbd, 0x95, 0xbb, 0x5a, 0xf9,
+	0xa5, 0xa7, 0xe6, 0x16, 0x38, 0x34, 0xf7, 0x9d,
+	0x19, 0x66, 0x16, 0x8e, 0x7f, 0x2b, 0x5a, 0xfb,
+	0xb5, 0x29, 0x79, 0xbf, 0x52, 0xae, 0x30, 0x95,
+	0x3f, 0x31, 0x33, 0x28, 0xde, 0xc5, 0x0d, 0x55,
+	0x89, 0xec, 0x21, 0x11, 0x0f, 0x8b, 0xfe, 0x63,
+	0x3a, 0xf1, 0x95, 0x5c, 0xcd, 0x50, 0xe4, 0x5d,
+	0x8f, 0xa7, 0xc8, 0xca, 0x93, 0xa0, 0x67, 0x82,
+	0x63, 0x5c, 0xd0, 0xed, 0xe7, 0x08, 0xc5, 0x60,
+	0xf8, 0xb4, 0x47, 0xf0, 0x1a, 0x65, 0x4e, 0xa3,
+	0x51, 0x68, 0xc7, 0x14, 0xa1, 0xd9, 0x39, 0x72,
+	0xa8, 0x6f, 0x7c, 0x7e, 0xf6, 0x03, 0x0b, 0x25,
+	0x9b, 0xf2, 0xca, 0x49, 0xae, 0x5b, 0xf8, 0x0f,
+	0x71, 0x51, 0x01, 0xa6, 0x23, 0xa9, 0xdf, 0xd0,
+	0x7a, 0x39, 0x19, 0xf5, 0xc5, 0x26, 0x44, 0x7b,
+	0x0a, 0x4a, 0x41, 0xbf, 0xf2, 0x8e, 0x83, 0x50,
+	0x91, 0x96, 0x72, 0x02, 0xf6, 0x80, 0xbf, 0x95,
+	0x41, 0xac, 0xda, 0xb0, 0xba, 0xe3, 0x76, 0xb1,
+	0x9d, 0xff, 0x1f, 0x33, 0x02, 0x85, 0xfc, 0x2a,
+	0x29, 0xe6, 0xe3, 0x9d, 0xd0, 0xef, 0xc2, 0xd6,
+	0x9c, 0x4a, 0x62, 0xac, 0xcb, 0xea, 0x8b, 0xc3,
+	0x08, 0x6e, 0x49, 0x09, 0x26, 0x19, 0xc1, 0x30,
+	0xcc, 0x27, 0xaa, 0xc6, 0x45, 0x88, 0xbd, 0xae,
+	0xd6, 0x79, 0xff, 0x4e, 0xfc, 0x66, 0x4d, 0x02,
+	0xa5, 0xee, 0x8e, 0xa5, 0xb6, 0x15, 0x72, 0x24,
+	0xb1, 0xbf, 0xbf, 0x64, 0xcf, 0xcc, 0x93, 0xe9,
+	0xb6, 0xfd, 0xb4, 0xb6, 0x21, 0xb5, 0x48, 0x08,
+	0x0f, 0x11, 0x65, 0xe1, 0x47, 0xee, 0x93, 0x29,
+	0xad
+};
+static const u8 key72[] __initconst = {
+	0xb9, 0xa2, 0xfc, 0x59, 0x06, 0x3f, 0x77, 0xa5,
+	0x66, 0xd0, 0x2b, 0x22, 0x74, 0x22, 0x4c, 0x1e,
+	0x6a, 0x39, 0xdf, 0xe1, 0x0d, 0x4c, 0x64, 0x99,
+	0x54, 0x8a, 0xba, 0x1d, 0x2c, 0x21, 0x5f, 0xc3
+};
+enum { nonce72 = 0x3d069308fa3db04bULL };
+
+static const u8 input73[] __initconst = {
+	0xe4, 0xdd, 0x36, 0xd4, 0xf5, 0x70, 0x51, 0x73,
+	0x97, 0x1d, 0x45, 0x05, 0x92, 0xe7, 0xeb, 0xb7,
+	0x09, 0x82, 0x6e, 0x25, 0x6c, 0x50, 0xf5, 0x40,
+	0x19, 0xba, 0xbc, 0xf4, 0x39, 0x14, 0xc5, 0x15,
+	0x83, 0x40, 0xbd, 0x26, 0xe0, 0xff, 0x3b, 0x22,
+	0x7c, 0x7c, 0xd7, 0x0b, 0xe9, 0x25, 0x0c, 0x3d,
+	0x92, 0x38, 0xbe, 0xe4, 0x22, 0x75, 0x65, 0xf1,
+	0x03, 0x85, 0x34, 0x09, 0xb8, 0x77, 0xfb, 0x48,
+	0xb1, 0x2e, 0x21, 0x67, 0x9b, 0x9d, 0xad, 0x18,
+	0x82, 0x0d, 0x6b, 0xc3, 0xcf, 0x00, 0x61, 0x6e,
+	0xda, 0xdc, 0xa7, 0x0b, 0x5c, 0x02, 0x1d, 0xa6,
+	0x4e, 0x0d, 0x7f, 0x37, 0x01, 0x5a, 0x37, 0xf3,
+	0x2b, 0xbf, 0xba, 0xe2, 0x1c, 0xb3, 0xa3, 0xbc,
+	0x1c, 0x93, 0x1a, 0xb1, 0x71, 0xaf, 0xe2, 0xdd,
+	0x17, 0xee, 0x53, 0xfa, 0xfb, 0x02, 0x40, 0x3e,
+	0x03, 0xca, 0xe7, 0xc3, 0x51, 0x81, 0xcc, 0x8c,
+	0xca, 0xcf, 0x4e, 0xc5, 0x78, 0x99, 0xfd, 0xbf,
+	0xea, 0xab, 0x38, 0x81, 0xfc, 0xd1, 0x9e, 0x41,
+	0x0b, 0x84, 0x25, 0xf1, 0x6b, 0x3c, 0xf5, 0x40,
+	0x0d, 0xc4, 0x3e, 0xb3, 0x6a, 0xec, 0x6e, 0x75,
+	0xdc, 0x9b, 0xdf, 0x08, 0x21, 0x16, 0xfb, 0x7a,
+	0x8e, 0x19, 0x13, 0x02, 0xa7, 0xfc, 0x58, 0x21,
+	0xc3, 0xb3, 0x59, 0x5a, 0x9c, 0xef, 0x38, 0xbd,
+	0x87, 0x55, 0xd7, 0x0d, 0x1f, 0x84, 0xdc, 0x98,
+	0x22, 0xca, 0x87, 0x96, 0x71, 0x6d, 0x68, 0x00,
+	0xcb, 0x4f, 0x2f, 0xc4, 0x64, 0x0c, 0xc1, 0x53,
+	0x0c, 0x90, 0xe7, 0x3c, 0x88, 0xca, 0xc5, 0x85,
+	0xa3, 0x2a, 0x96, 0x7c, 0x82, 0x6d, 0x45, 0xf5,
+	0xb7, 0x8d, 0x17, 0x69, 0xd6, 0xcd, 0x3c, 0xd3,
+	0xe7, 0x1c, 0xce, 0x93, 0x50, 0xd4, 0x59, 0xa2,
+	0xd8, 0x8b, 0x72, 0x60, 0x5b, 0x25, 0x14, 0xcd,
+	0x5a, 0xe8, 0x8c, 0xdb, 0x23, 0x8d, 0x2b, 0x59,
+	0x12, 0x13, 0x10, 0x47, 0xa4, 0xc8, 0x3c, 0xc1,
+	0x81, 0x89, 0x6c, 0x98, 0xec, 0x8f, 0x7b, 0x32,
+	0xf2, 0x87, 0xd9, 0xa2, 0x0d, 0xc2, 0x08, 0xf9,
+	0xd5, 0xf3, 0x91, 0xe7, 0xb3, 0x87, 0xa7, 0x0b,
+	0x64, 0x8f, 0xb9, 0x55, 0x1c, 0x81, 0x96, 0x6c,
+	0xa1, 0xc9, 0x6e, 0x3b, 0xcd, 0x17, 0x1b, 0xfc,
+	0xa6, 0x05, 0xba, 0x4a, 0x7d, 0x03, 0x3c, 0x59,
+	0xc8, 0xee, 0x50, 0xb2, 0x5b, 0xe1, 0x4d, 0x6a,
+	0x1f, 0x09, 0xdc, 0xa2, 0x51, 0xd1, 0x93, 0x3a,
+	0x5f, 0x72, 0x1d, 0x26, 0x14, 0x62, 0xa2, 0x41,
+	0x3d, 0x08, 0x70, 0x7b, 0x27, 0x3d, 0xbc, 0xdf,
+	0x15, 0xfa, 0xb9, 0x5f, 0xb5, 0x38, 0x84, 0x0b,
+	0x58, 0x3d, 0xee, 0x3f, 0x32, 0x65, 0x6d, 0xd7,
+	0xce, 0x97, 0x3c, 0x8d, 0xfb, 0x63, 0xb9, 0xb0,
+	0xa8, 0x4a, 0x72, 0x99, 0x97, 0x58, 0xc8, 0xa7,
+	0xf9, 0x4c, 0xae, 0xc1, 0x63, 0xb9, 0x57, 0x18,
+	0x8a, 0xfa, 0xab, 0xe9, 0xf3, 0x67, 0xe6, 0xfd,
+	0xd2, 0x9d, 0x5c, 0xa9, 0x8e, 0x11, 0x0a, 0xf4,
+	0x4b, 0xf1, 0xec, 0x1a, 0xaf, 0x50, 0x5d, 0x16,
+	0x13, 0x69, 0x2e, 0xbd, 0x0d, 0xe6, 0xf0, 0xb2,
+	0xed, 0xb4, 0x4c, 0x59, 0x77, 0x37, 0x00, 0x0b,
+	0xc7, 0xa7, 0x9e, 0x37, 0xf3, 0x60, 0x70, 0xef,
+	0xf3, 0xc1, 0x74, 0x52, 0x87, 0xc6, 0xa1, 0x81,
+	0xbd, 0x0a, 0x2c, 0x5d, 0x2c, 0x0c, 0x6a, 0x81,
+	0xa1, 0xfe, 0x26, 0x78, 0x6c, 0x03, 0x06, 0x07,
+	0x34, 0xaa, 0xd1, 0x1b, 0x40, 0x03, 0x39, 0x56,
+	0xcf, 0x2a, 0x92, 0xc1, 0x4e, 0xdf, 0x29, 0x24,
+	0x83, 0x22, 0x7a, 0xea, 0x67, 0x1e, 0xe7, 0x54,
+	0x64, 0xd3, 0xbd, 0x3a, 0x5d, 0xae, 0xca, 0xf0,
+	0x9c, 0xd6, 0x5a, 0x9a, 0x62, 0xc8, 0xc7, 0x83,
+	0xf9, 0x89, 0xde, 0x2d, 0x53, 0x64, 0x61, 0xf7,
+	0xa3, 0xa7, 0x31, 0x38, 0xc6, 0x22, 0x9c, 0xb4,
+	0x87, 0xe0
+};
+static const u8 output73[] __initconst = {
+	0x34, 0xed, 0x05, 0xb0, 0x14, 0xbc, 0x8c, 0xcc,
+	0x95, 0xbd, 0x99, 0x0f, 0xb1, 0x98, 0x17, 0x10,
+	0xae, 0xe0, 0x08, 0x53, 0xa3, 0x69, 0xd2, 0xed,
+	0x66, 0xdb, 0x2a, 0x34, 0x8d, 0x0c, 0x6e, 0xce,
+	0x63, 0x69, 0xc9, 0xe4, 0x57, 0xc3, 0x0c, 0x8b,
+	0xa6, 0x2c, 0xa7, 0xd2, 0x08, 0xff, 0x4f, 0xec,
+	0x61, 0x8c, 0xee, 0x0d, 0xfa, 0x6b, 0xe0, 0xe8,
+	0x71, 0xbc, 0x41, 0x46, 0xd7, 0x33, 0x1d, 0xc0,
+	0xfd, 0xad, 0xca, 0x8b, 0x34, 0x56, 0xa4, 0x86,
+	0x71, 0x62, 0xae, 0x5e, 0x3d, 0x2b, 0x66, 0x3e,
+	0xae, 0xd8, 0xc0, 0xe1, 0x21, 0x3b, 0xca, 0xd2,
+	0x6b, 0xa2, 0xb8, 0xc7, 0x98, 0x4a, 0xf3, 0xcf,
+	0xb8, 0x62, 0xd8, 0x33, 0xe6, 0x80, 0xdb, 0x2f,
+	0x0a, 0xaf, 0x90, 0x3c, 0xe1, 0xec, 0xe9, 0x21,
+	0x29, 0x42, 0x9e, 0xa5, 0x50, 0xe9, 0x93, 0xd3,
+	0x53, 0x1f, 0xac, 0x2a, 0x24, 0x07, 0xb8, 0xed,
+	0xed, 0x38, 0x2c, 0xc4, 0xa1, 0x2b, 0x31, 0x5d,
+	0x9c, 0x24, 0x7b, 0xbf, 0xd9, 0xbb, 0x4e, 0x87,
+	0x8f, 0x32, 0x30, 0xf1, 0x11, 0x29, 0x54, 0x94,
+	0x00, 0x95, 0x1d, 0x1d, 0x24, 0xc0, 0xd4, 0x34,
+	0x49, 0x1d, 0xd5, 0xe3, 0xa6, 0xde, 0x8b, 0xbf,
+	0x5a, 0x9f, 0x58, 0x5a, 0x9b, 0x70, 0xe5, 0x9b,
+	0xb3, 0xdb, 0xe8, 0xb8, 0xca, 0x1b, 0x43, 0xe3,
+	0xc6, 0x6f, 0x0a, 0xd6, 0x32, 0x11, 0xd4, 0x04,
+	0xef, 0xa3, 0xe4, 0x3f, 0x12, 0xd8, 0xc1, 0x73,
+	0x51, 0x87, 0x03, 0xbd, 0xba, 0x60, 0x79, 0xee,
+	0x08, 0xcc, 0xf7, 0xc0, 0xaa, 0x4c, 0x33, 0xc4,
+	0xc7, 0x09, 0xf5, 0x91, 0xcb, 0x74, 0x57, 0x08,
+	0x1b, 0x90, 0xa9, 0x1b, 0x60, 0x02, 0xd2, 0x3f,
+	0x7a, 0xbb, 0xfd, 0x78, 0xf0, 0x15, 0xf9, 0x29,
+	0x82, 0x8f, 0xc4, 0xb2, 0x88, 0x1f, 0xbc, 0xcc,
+	0x53, 0x27, 0x8b, 0x07, 0x5f, 0xfc, 0x91, 0x29,
+	0x82, 0x80, 0x59, 0x0a, 0x3c, 0xea, 0xc4, 0x7e,
+	0xad, 0xd2, 0x70, 0x46, 0xbd, 0x9e, 0x3b, 0x1c,
+	0x8a, 0x62, 0xea, 0x69, 0xbd, 0xf6, 0x96, 0x15,
+	0xb5, 0x57, 0xe8, 0x63, 0x5f, 0x65, 0x46, 0x84,
+	0x58, 0x50, 0x87, 0x4b, 0x0e, 0x5b, 0x52, 0x90,
+	0xb0, 0xae, 0x37, 0x0f, 0xdd, 0x7e, 0xa2, 0xa0,
+	0x8b, 0x78, 0xc8, 0x5a, 0x1f, 0x53, 0xdb, 0xc5,
+	0xbf, 0x73, 0x20, 0xa9, 0x44, 0xfb, 0x1e, 0xc7,
+	0x97, 0xb2, 0x3a, 0x5a, 0x17, 0xe6, 0x8b, 0x9b,
+	0xe8, 0xf8, 0x2a, 0x01, 0x27, 0xa3, 0x71, 0x28,
+	0xe3, 0x19, 0xc6, 0xaf, 0xf5, 0x3a, 0x26, 0xc0,
+	0x5c, 0x69, 0x30, 0x78, 0x75, 0x27, 0xf2, 0x0c,
+	0x22, 0x71, 0x65, 0xc6, 0x8e, 0x7b, 0x47, 0xe3,
+	0x31, 0xaf, 0x7b, 0xc6, 0xc2, 0x55, 0x68, 0x81,
+	0xaa, 0x1b, 0x21, 0x65, 0xfb, 0x18, 0x35, 0x45,
+	0x36, 0x9a, 0x44, 0xba, 0x5c, 0xff, 0x06, 0xde,
+	0x3a, 0xc8, 0x44, 0x0b, 0xaa, 0x8e, 0x34, 0xe2,
+	0x84, 0xac, 0x18, 0xfe, 0x9b, 0xe1, 0x4f, 0xaa,
+	0xb6, 0x90, 0x0b, 0x1c, 0x2c, 0xd9, 0x9a, 0x10,
+	0x18, 0xf9, 0x49, 0x41, 0x42, 0x1b, 0xb5, 0xe1,
+	0x26, 0xac, 0x2d, 0x38, 0x00, 0x00, 0xe4, 0xb4,
+	0x50, 0x6f, 0x14, 0x18, 0xd6, 0x3d, 0x00, 0x59,
+	0x3c, 0x45, 0xf3, 0x42, 0x13, 0x44, 0xb8, 0x57,
+	0xd4, 0x43, 0x5c, 0x8a, 0x2a, 0xb4, 0xfc, 0x0a,
+	0x25, 0x5a, 0xdc, 0x8f, 0x11, 0x0b, 0x11, 0x44,
+	0xc7, 0x0e, 0x54, 0x8b, 0x22, 0x01, 0x7e, 0x67,
+	0x2e, 0x15, 0x3a, 0xb9, 0xee, 0x84, 0x10, 0xd4,
+	0x80, 0x57, 0xd7, 0x75, 0xcf, 0x8b, 0xcb, 0x03,
+	0xc9, 0x92, 0x2b, 0x69, 0xd8, 0x5a, 0x9b, 0x06,
+	0x85, 0x47, 0xaa, 0x4c, 0x28, 0xde, 0x49, 0x58,
+	0xe6, 0x11, 0x1e, 0x5e, 0x64, 0x8e, 0x3b, 0xe0,
+	0x40, 0x2e, 0xac, 0x96, 0x97, 0x15, 0x37, 0x1e,
+	0x30, 0xdd
+};
+static const u8 key73[] __initconst = {
+	0x96, 0x06, 0x1e, 0xc1, 0x6d, 0xba, 0x49, 0x5b,
+	0x65, 0x80, 0x79, 0xdd, 0xf3, 0x67, 0xa8, 0x6e,
+	0x2d, 0x9c, 0x54, 0x46, 0xd8, 0x4a, 0xeb, 0x7e,
+	0x23, 0x86, 0x51, 0xd8, 0x49, 0x49, 0x56, 0xe0
+};
+enum { nonce73 = 0xbefb83cb67e11ffdULL };
+
+static const u8 input74[] __initconst = {
+	0x47, 0x22, 0x70, 0xe5, 0x2f, 0x41, 0x18, 0x45,
+	0x07, 0xd3, 0x6d, 0x32, 0x0d, 0x43, 0x92, 0x2b,
+	0x9b, 0x65, 0x73, 0x13, 0x1a, 0x4f, 0x49, 0x8f,
+	0xff, 0xf8, 0xcc, 0xae, 0x15, 0xab, 0x9d, 0x7d,
+	0xee, 0x22, 0x5d, 0x8b, 0xde, 0x81, 0x5b, 0x81,
+	0x83, 0x49, 0x35, 0x9b, 0xb4, 0xbc, 0x4e, 0x01,
+	0xc2, 0x29, 0xa7, 0xf1, 0xca, 0x3a, 0xce, 0x3f,
+	0xf5, 0x31, 0x93, 0xa8, 0xe2, 0xc9, 0x7d, 0x03,
+	0x26, 0xa4, 0xbc, 0xa8, 0x9c, 0xb9, 0x68, 0xf3,
+	0xb3, 0x91, 0xe8, 0xe6, 0xc7, 0x2b, 0x1a, 0xce,
+	0xd2, 0x41, 0x53, 0xbd, 0xa3, 0x2c, 0x54, 0x94,
+	0x21, 0xa1, 0x40, 0xae, 0xc9, 0x0c, 0x11, 0x92,
+	0xfd, 0x91, 0xa9, 0x40, 0xca, 0xde, 0x21, 0x4e,
+	0x1e, 0x3d, 0xcc, 0x2c, 0x87, 0x11, 0xef, 0x46,
+	0xed, 0x52, 0x03, 0x11, 0x19, 0x43, 0x25, 0xc7,
+	0x0d, 0xc3, 0x37, 0x5f, 0xd3, 0x6f, 0x0c, 0x6a,
+	0x45, 0x30, 0x88, 0xec, 0xf0, 0x21, 0xef, 0x1d,
+	0x7b, 0x38, 0x63, 0x4b, 0x49, 0x0c, 0x72, 0xf6,
+	0x4c, 0x40, 0xc3, 0xcc, 0x03, 0xa7, 0xae, 0xa8,
+	0x8c, 0x37, 0x03, 0x1c, 0x11, 0xae, 0x0d, 0x1b,
+	0x62, 0x97, 0x27, 0xfc, 0x56, 0x4b, 0xb7, 0xfd,
+	0xbc, 0xfb, 0x0e, 0xfc, 0x61, 0xad, 0xc6, 0xb5,
+	0x9c, 0x8c, 0xc6, 0x38, 0x27, 0x91, 0x29, 0x3d,
+	0x29, 0xc8, 0x37, 0xc9, 0x96, 0x69, 0xe3, 0xdc,
+	0x3e, 0x61, 0x35, 0x9b, 0x99, 0x4f, 0xb9, 0x4e,
+	0x5a, 0x29, 0x1c, 0x2e, 0xcf, 0x16, 0xcb, 0x69,
+	0x87, 0xe4, 0x1a, 0xc4, 0x6e, 0x78, 0x43, 0x00,
+	0x03, 0xb2, 0x8b, 0x03, 0xd0, 0xb4, 0xf1, 0xd2,
+	0x7d, 0x2d, 0x7e, 0xfc, 0x19, 0x66, 0x5b, 0xa3,
+	0x60, 0x3f, 0x9d, 0xbd, 0xfa, 0x3e, 0xca, 0x7b,
+	0x26, 0x08, 0x19, 0x16, 0x93, 0x5d, 0x83, 0xfd,
+	0xf9, 0x21, 0xc6, 0x31, 0x34, 0x6f, 0x0c, 0xaa,
+	0x28, 0xf9, 0x18, 0xa2, 0xc4, 0x78, 0x3b, 0x56,
+	0xc0, 0x88, 0x16, 0xba, 0x22, 0x2c, 0x07, 0x2f,
+	0x70, 0xd0, 0xb0, 0x46, 0x35, 0xc7, 0x14, 0xdc,
+	0xbb, 0x56, 0x23, 0x1e, 0x36, 0x36, 0x2d, 0x73,
+	0x78, 0xc7, 0xce, 0xf3, 0x58, 0xf7, 0x58, 0xb5,
+	0x51, 0xff, 0x33, 0x86, 0x0e, 0x3b, 0x39, 0xfb,
+	0x1a, 0xfd, 0xf8, 0x8b, 0x09, 0x33, 0x1b, 0x83,
+	0xf2, 0xe6, 0x38, 0x37, 0xef, 0x47, 0x84, 0xd9,
+	0x82, 0x77, 0x2b, 0x82, 0xcc, 0xf9, 0xee, 0x94,
+	0x71, 0x78, 0x81, 0xc8, 0x4d, 0x91, 0xd7, 0x35,
+	0x29, 0x31, 0x30, 0x5c, 0x4a, 0x23, 0x23, 0xb1,
+	0x38, 0x6b, 0xac, 0x22, 0x3f, 0x80, 0xc7, 0xe0,
+	0x7d, 0xfa, 0x76, 0x47, 0xd4, 0x6f, 0x93, 0xa0,
+	0xa0, 0x93, 0x5d, 0x68, 0xf7, 0x43, 0x25, 0x8f,
+	0x1b, 0xc7, 0x87, 0xea, 0x59, 0x0c, 0xa2, 0xfa,
+	0xdb, 0x2f, 0x72, 0x43, 0xcf, 0x90, 0xf1, 0xd6,
+	0x58, 0xf3, 0x17, 0x6a, 0xdf, 0xb3, 0x4e, 0x0e,
+	0x38, 0x24, 0x48, 0x1f, 0xb7, 0x01, 0xec, 0x81,
+	0xb1, 0x87, 0x5b, 0xec, 0x9c, 0x11, 0x1a, 0xff,
+	0xa5, 0xca, 0x5a, 0x63, 0x31, 0xb2, 0xe4, 0xc6,
+	0x3c, 0x1d, 0xaf, 0x27, 0xb2, 0xd4, 0x19, 0xa2,
+	0xcc, 0x04, 0x92, 0x42, 0xd2, 0xc1, 0x8c, 0x3b,
+	0xce, 0xf5, 0x74, 0xc1, 0x81, 0xf8, 0x20, 0x23,
+	0x6f, 0x20, 0x6d, 0x78, 0x36, 0x72, 0x2c, 0x52,
+	0xdf, 0x5e, 0xe8, 0x75, 0xce, 0x1c, 0x49, 0x9d,
+	0x93, 0x6f, 0x65, 0xeb, 0xb1, 0xbd, 0x8e, 0x5e,
+	0xe5, 0x89, 0xc4, 0x8a, 0x81, 0x3d, 0x9a, 0xa7,
+	0x11, 0x82, 0x8e, 0x38, 0x5b, 0x5b, 0xca, 0x7d,
+	0x4b, 0x72, 0xc2, 0x9c, 0x30, 0x5e, 0x7f, 0xc0,
+	0x6f, 0x91, 0xd5, 0x67, 0x8c, 0x3e, 0xae, 0xda,
+	0x2b, 0x3c, 0x53, 0xcc, 0x50, 0x97, 0x36, 0x0b,
+	0x79, 0xd6, 0x73, 0x6e, 0x7d, 0x42, 0x56, 0xe1,
+	0xaa, 0xfc, 0xb3, 0xa7, 0xc8, 0x01, 0xaa, 0xc1,
+	0xfc, 0x5c, 0x72, 0x8e, 0x63, 0xa8, 0x46, 0x18,
+	0xee, 0x11, 0xe7, 0x30, 0x09, 0x83, 0x6c, 0xd9,
+	0xf4, 0x7a, 0x7b, 0xb5, 0x1f, 0x6d, 0xc7, 0xbc,
+	0xcb, 0x55, 0xea, 0x40, 0x58, 0x7a, 0x00, 0x00,
+	0x90, 0x60, 0xc5, 0x64, 0x69, 0x05, 0x99, 0xd2,
+	0x49, 0x62, 0x4f, 0xcb, 0x97, 0xdf, 0xdd, 0x6b,
+	0x60, 0x75, 0xe2, 0xe0, 0x6f, 0x76, 0xd0, 0x37,
+	0x67, 0x0a, 0xcf, 0xff, 0xc8, 0x61, 0x84, 0x14,
+	0x80, 0x7c, 0x1d, 0x31, 0x8d, 0x90, 0xde, 0x0b,
+	0x1c, 0x74, 0x9f, 0x82, 0x96, 0x80, 0xda, 0xaf,
+	0x8d, 0x99, 0x86, 0x9f, 0x24, 0x99, 0x28, 0x3e,
+	0xe0, 0xa3, 0xc3, 0x90, 0x2d, 0x14, 0x65, 0x1e,
+	0x3b, 0xb9, 0xba, 0x13, 0xa5, 0x77, 0x73, 0x63,
+	0x9a, 0x06, 0x3d, 0xa9, 0x28, 0x9b, 0xba, 0x25,
+	0x61, 0xc9, 0xcd, 0xcf, 0x7a, 0x4d, 0x96, 0x09,
+	0xcb, 0xca, 0x03, 0x9c, 0x54, 0x34, 0x31, 0x85,
+	0xa0, 0x3d, 0xe5, 0xbc, 0xa5, 0x5f, 0x1b, 0xd3,
+	0x10, 0x63, 0x74, 0x9d, 0x01, 0x92, 0x88, 0xf0,
+	0x27, 0x9c, 0x28, 0xd9, 0xfd, 0xe2, 0x4e, 0x01,
+	0x8d, 0x61, 0x79, 0x60, 0x61, 0x5b, 0x76, 0xab,
+	0x06, 0xd3, 0x44, 0x87, 0x43, 0x52, 0xcd, 0x06,
+	0x68, 0x1e, 0x2d, 0xc5, 0xb0, 0x07, 0x25, 0xdf,
+	0x0a, 0x50, 0xd7, 0xd9, 0x08, 0x53, 0x65, 0xf1,
+	0x0c, 0x2c, 0xde, 0x3f, 0x9d, 0x03, 0x1f, 0xe1,
+	0x49, 0x43, 0x3c, 0x83, 0x81, 0x37, 0xf8, 0xa2,
+	0x0b, 0xf9, 0x61, 0x1c, 0xc1, 0xdb, 0x79, 0xbc,
+	0x64, 0xce, 0x06, 0x4e, 0x87, 0x89, 0x62, 0x73,
+	0x51, 0xbc, 0xa4, 0x32, 0xd4, 0x18, 0x62, 0xab,
+	0x65, 0x7e, 0xad, 0x1e, 0x91, 0xa3, 0xfa, 0x2d,
+	0x58, 0x9e, 0x2a, 0xe9, 0x74, 0x44, 0x64, 0x11,
+	0xe6, 0xb6, 0xb3, 0x00, 0x7e, 0xa3, 0x16, 0xef,
+	0x72
+};
+static const u8 output74[] __initconst = {
+	0xf5, 0xca, 0x45, 0x65, 0x50, 0x35, 0x47, 0x67,
+	0x6f, 0x4f, 0x67, 0xff, 0x34, 0xd9, 0xc3, 0x37,
+	0x2a, 0x26, 0xb0, 0x4f, 0x08, 0x1e, 0x45, 0x13,
+	0xc7, 0x2c, 0x14, 0x75, 0x33, 0xd8, 0x8e, 0x1e,
+	0x1b, 0x11, 0x0d, 0x97, 0x04, 0x33, 0x8a, 0xe4,
+	0xd8, 0x8d, 0x0e, 0x12, 0x8d, 0xdb, 0x6e, 0x02,
+	0xfa, 0xe5, 0xbd, 0x3a, 0xb5, 0x28, 0x07, 0x7d,
+	0x20, 0xf0, 0x12, 0x64, 0x83, 0x2f, 0x59, 0x79,
+	0x17, 0x88, 0x3c, 0x2d, 0x08, 0x2f, 0x55, 0xda,
+	0xcc, 0x02, 0x3a, 0x82, 0xcd, 0x03, 0x94, 0xdf,
+	0xdf, 0xab, 0x8a, 0x13, 0xf5, 0xe6, 0x74, 0xdf,
+	0x7b, 0xe2, 0xab, 0x34, 0xbc, 0x00, 0x85, 0xbf,
+	0x5a, 0x48, 0xc8, 0xff, 0x8d, 0x6c, 0x27, 0x48,
+	0x19, 0x2d, 0x08, 0xfa, 0x82, 0x62, 0x39, 0x55,
+	0x32, 0x11, 0xa8, 0xd7, 0xb9, 0x08, 0x2c, 0xd6,
+	0x7a, 0xd9, 0x83, 0x9f, 0x9b, 0xfb, 0xec, 0x3a,
+	0xd1, 0x08, 0xc7, 0xad, 0xdc, 0x98, 0x4c, 0xbc,
+	0x98, 0xeb, 0x36, 0xb0, 0x39, 0xf4, 0x3a, 0xd6,
+	0x53, 0x02, 0xa0, 0xa9, 0x73, 0xa1, 0xca, 0xef,
+	0xd8, 0xd2, 0xec, 0x0e, 0xf8, 0xf5, 0xac, 0x8d,
+	0x34, 0x41, 0x06, 0xa8, 0xc6, 0xc3, 0x31, 0xbc,
+	0xe5, 0xcc, 0x7e, 0x72, 0x63, 0x59, 0x3e, 0x63,
+	0xc2, 0x8d, 0x2b, 0xd5, 0xb9, 0xfd, 0x1e, 0x31,
+	0x69, 0x32, 0x05, 0xd6, 0xde, 0xc9, 0xe6, 0x4c,
+	0xac, 0x68, 0xf7, 0x1f, 0x9d, 0xcd, 0x0e, 0xa2,
+	0x15, 0x3d, 0xd6, 0x47, 0x99, 0xab, 0x08, 0x5f,
+	0x28, 0xc3, 0x4c, 0xc2, 0xd5, 0xdd, 0x10, 0xb7,
+	0xbd, 0xdb, 0x9b, 0xcf, 0x85, 0x27, 0x29, 0x76,
+	0x98, 0xeb, 0xad, 0x31, 0x64, 0xe7, 0xfb, 0x61,
+	0xe0, 0xd8, 0x1a, 0xa6, 0xe2, 0xe7, 0x43, 0x42,
+	0x77, 0xc9, 0x82, 0x00, 0xac, 0x85, 0xe0, 0xa2,
+	0xd4, 0x62, 0xe3, 0xb7, 0x17, 0x6e, 0xb2, 0x9e,
+	0x21, 0x58, 0x73, 0xa9, 0x53, 0x2d, 0x3c, 0xe1,
+	0xdd, 0xd6, 0x6e, 0x92, 0xf2, 0x1d, 0xc2, 0x22,
+	0x5f, 0x9a, 0x7e, 0xd0, 0x52, 0xbf, 0x54, 0x19,
+	0xd7, 0x80, 0x63, 0x3e, 0xd0, 0x08, 0x2d, 0x37,
+	0x0c, 0x15, 0xf7, 0xde, 0xab, 0x2b, 0xe3, 0x16,
+	0x21, 0x3a, 0xee, 0xa5, 0xdc, 0xdf, 0xde, 0xa3,
+	0x69, 0xcb, 0xfd, 0x92, 0x89, 0x75, 0xcf, 0xc9,
+	0x8a, 0xa4, 0xc8, 0xdd, 0xcc, 0x21, 0xe6, 0xfe,
+	0x9e, 0x43, 0x76, 0xb2, 0x45, 0x22, 0xb9, 0xb5,
+	0xac, 0x7e, 0x3d, 0x26, 0xb0, 0x53, 0xc8, 0xab,
+	0xfd, 0xea, 0x2c, 0xd1, 0x44, 0xc5, 0x60, 0x1b,
+	0x8a, 0x99, 0x0d, 0xa5, 0x0e, 0x67, 0x6e, 0x3a,
+	0x96, 0x55, 0xec, 0xe8, 0xcc, 0xbe, 0x49, 0xd9,
+	0xf2, 0x72, 0x9f, 0x30, 0x21, 0x97, 0x57, 0x19,
+	0xbe, 0x5e, 0x33, 0x0c, 0xee, 0xc0, 0x72, 0x0d,
+	0x2e, 0xd1, 0xe1, 0x52, 0xc2, 0xea, 0x41, 0xbb,
+	0xe1, 0x6d, 0xd4, 0x17, 0xa9, 0x8d, 0x89, 0xa9,
+	0xd6, 0x4b, 0xc6, 0x4c, 0xf2, 0x88, 0x97, 0x54,
+	0x3f, 0x4f, 0x57, 0xb7, 0x37, 0xf0, 0x2c, 0x11,
+	0x15, 0x56, 0xdb, 0x28, 0xb5, 0x16, 0x84, 0x66,
+	0xce, 0x45, 0x3f, 0x61, 0x75, 0xb6, 0xbe, 0x00,
+	0xd1, 0xe4, 0xf5, 0x27, 0x54, 0x7f, 0xc2, 0xf1,
+	0xb3, 0x32, 0x9a, 0xe8, 0x07, 0x02, 0xf3, 0xdb,
+	0xa9, 0xd1, 0xc2, 0xdf, 0xee, 0xad, 0xe5, 0x8a,
+	0x3c, 0xfa, 0x67, 0xec, 0x6b, 0xa4, 0x08, 0xfe,
+	0xba, 0x5a, 0x58, 0x0b, 0x78, 0x11, 0x91, 0x76,
+	0xe3, 0x1a, 0x28, 0x54, 0x5e, 0xbd, 0x71, 0x1b,
+	0x8b, 0xdc, 0x6c, 0xf4, 0x6f, 0xd7, 0xf4, 0xf3,
+	0xe1, 0x03, 0xa4, 0x3c, 0x8d, 0x91, 0x2e, 0xba,
+	0x5f, 0x7f, 0x8c, 0xaf, 0x69, 0x89, 0x29, 0x0a,
+	0x5b, 0x25, 0x13, 0xc4, 0x2e, 0x16, 0xc2, 0x15,
+	0x07, 0x5d, 0x58, 0x33, 0x7c, 0xe0, 0xf0, 0x55,
+	0x5f, 0xbf, 0x5e, 0xf0, 0x71, 0x48, 0x8f, 0xf7,
+	0x48, 0xb3, 0xf7, 0x0d, 0xa1, 0xd0, 0x63, 0xb1,
+	0xad, 0xae, 0xb5, 0xb0, 0x5f, 0x71, 0xaf, 0x24,
+	0x8b, 0xb9, 0x1c, 0x44, 0xd2, 0x1a, 0x53, 0xd1,
+	0xd5, 0xb4, 0xa9, 0xff, 0x88, 0x73, 0xb5, 0xaa,
+	0x15, 0x32, 0x5f, 0x59, 0x9d, 0x2e, 0xb5, 0xcb,
+	0xde, 0x21, 0x2e, 0xe9, 0x35, 0xed, 0xfd, 0x0f,
+	0xb6, 0xbb, 0xe6, 0x4b, 0x16, 0xf1, 0x45, 0x1e,
+	0xb4, 0x84, 0xe9, 0x58, 0x1c, 0x0c, 0x95, 0xc0,
+	0xcf, 0x49, 0x8b, 0x59, 0xa1, 0x78, 0xe6, 0x80,
+	0x12, 0x49, 0x7a, 0xd4, 0x66, 0x62, 0xdf, 0x9c,
+	0x18, 0xc8, 0x8c, 0xda, 0xc1, 0xa6, 0xbc, 0x65,
+	0x28, 0xd2, 0xa4, 0xe8, 0xf1, 0x35, 0xdb, 0x5a,
+	0x75, 0x1f, 0x73, 0x60, 0xec, 0xa8, 0xda, 0x5a,
+	0x43, 0x15, 0x83, 0x9b, 0xe7, 0xb1, 0xa6, 0x81,
+	0xbb, 0xef, 0xf3, 0x8f, 0x0f, 0xd3, 0x79, 0xa2,
+	0xe5, 0xaa, 0x42, 0xef, 0xa0, 0x13, 0x4e, 0x91,
+	0x2d, 0xcb, 0x61, 0x7a, 0x9a, 0x33, 0x14, 0x50,
+	0x77, 0x4a, 0xd0, 0x91, 0x48, 0xe0, 0x0c, 0xe0,
+	0x11, 0xcb, 0xdf, 0xb0, 0xce, 0x06, 0xd2, 0x79,
+	0x4d, 0x69, 0xb9, 0xc9, 0x36, 0x74, 0x8f, 0x81,
+	0x72, 0x73, 0xf3, 0x17, 0xb7, 0x13, 0xcb, 0x5b,
+	0xd2, 0x5c, 0x33, 0x61, 0xb7, 0x61, 0x79, 0xb0,
+	0xc0, 0x4d, 0xa1, 0xc7, 0x5d, 0x98, 0xc9, 0xe1,
+	0x98, 0xbd, 0x78, 0x5a, 0x2c, 0x64, 0x53, 0xaf,
+	0xaf, 0x66, 0x51, 0x47, 0xe4, 0x48, 0x66, 0x8b,
+	0x07, 0x52, 0xa3, 0x03, 0x93, 0x28, 0xad, 0xcc,
+	0xa3, 0x86, 0xad, 0x63, 0x04, 0x35, 0x6c, 0x49,
+	0xd5, 0x28, 0x0e, 0x00, 0x47, 0xf4, 0xd4, 0x32,
+	0x27, 0x19, 0xb3, 0x29, 0xe7, 0xbc, 0xbb, 0xce,
+	0x3e, 0x3e, 0xd5, 0x67, 0x20, 0xe4, 0x0b, 0x75,
+	0x95, 0x24, 0xe0, 0x6c, 0xb6, 0x29, 0x0c, 0x14,
+	0xfd
+};
+static const u8 key74[] __initconst = {
+	0xf0, 0x41, 0x5b, 0x00, 0x56, 0xc4, 0xac, 0xf6,
+	0xa2, 0x4c, 0x33, 0x41, 0x16, 0x09, 0x1b, 0x8e,
+	0x4d, 0xe8, 0x8c, 0xd9, 0x48, 0xab, 0x3e, 0x60,
+	0xcb, 0x49, 0x3e, 0xaf, 0x2b, 0x8b, 0xc8, 0xf0
+};
+enum { nonce74 = 0xcbdb0ffd0e923384ULL };
+
+static const struct chacha20_testvec chacha20_testvecs[] __initconst = {
+	{ input01, output01, key01, nonce01, sizeof(input01) },
+	{ input02, output02, key02, nonce02, sizeof(input02) },
+	{ input03, output03, key03, nonce03, sizeof(input03) },
+	{ input04, output04, key04, nonce04, sizeof(input04) },
+	{ input05, output05, key05, nonce05, sizeof(input05) },
+	{ input06, output06, key06, nonce06, sizeof(input06) },
+	{ input07, output07, key07, nonce07, sizeof(input07) },
+	{ input08, output08, key08, nonce08, sizeof(input08) },
+	{ input09, output09, key09, nonce09, sizeof(input09) },
+	{ input10, output10, key10, nonce10, sizeof(input10) },
+	{ input11, output11, key11, nonce11, sizeof(input11) },
+	{ input12, output12, key12, nonce12, sizeof(input12) },
+	{ input13, output13, key13, nonce13, sizeof(input13) },
+	{ input14, output14, key14, nonce14, sizeof(input14) },
+	{ input15, output15, key15, nonce15, sizeof(input15) },
+	{ input16, output16, key16, nonce16, sizeof(input16) },
+	{ input17, output17, key17, nonce17, sizeof(input17) },
+	{ input18, output18, key18, nonce18, sizeof(input18) },
+	{ input19, output19, key19, nonce19, sizeof(input19) },
+	{ input20, output20, key20, nonce20, sizeof(input20) },
+	{ input21, output21, key21, nonce21, sizeof(input21) },
+	{ input22, output22, key22, nonce22, sizeof(input22) },
+	{ input23, output23, key23, nonce23, sizeof(input23) },
+	{ input24, output24, key24, nonce24, sizeof(input24) },
+	{ input25, output25, key25, nonce25, sizeof(input25) },
+	{ input26, output26, key26, nonce26, sizeof(input26) },
+	{ input27, output27, key27, nonce27, sizeof(input27) },
+	{ input28, output28, key28, nonce28, sizeof(input28) },
+	{ input29, output29, key29, nonce29, sizeof(input29) },
+	{ input30, output30, key30, nonce30, sizeof(input30) },
+	{ input31, output31, key31, nonce31, sizeof(input31) },
+	{ input32, output32, key32, nonce32, sizeof(input32) },
+	{ input33, output33, key33, nonce33, sizeof(input33) },
+	{ input34, output34, key34, nonce34, sizeof(input34) },
+	{ input35, output35, key35, nonce35, sizeof(input35) },
+	{ input36, output36, key36, nonce36, sizeof(input36) },
+	{ input37, output37, key37, nonce37, sizeof(input37) },
+	{ input38, output38, key38, nonce38, sizeof(input38) },
+	{ input39, output39, key39, nonce39, sizeof(input39) },
+	{ input40, output40, key40, nonce40, sizeof(input40) },
+	{ input41, output41, key41, nonce41, sizeof(input41) },
+	{ input42, output42, key42, nonce42, sizeof(input42) },
+	{ input43, output43, key43, nonce43, sizeof(input43) },
+	{ input44, output44, key44, nonce44, sizeof(input44) },
+	{ input45, output45, key45, nonce45, sizeof(input45) },
+	{ input46, output46, key46, nonce46, sizeof(input46) },
+	{ input47, output47, key47, nonce47, sizeof(input47) },
+	{ input48, output48, key48, nonce48, sizeof(input48) },
+	{ input49, output49, key49, nonce49, sizeof(input49) },
+	{ input50, output50, key50, nonce50, sizeof(input50) },
+	{ input51, output51, key51, nonce51, sizeof(input51) },
+	{ input52, output52, key52, nonce52, sizeof(input52) },
+	{ input53, output53, key53, nonce53, sizeof(input53) },
+	{ input54, output54, key54, nonce54, sizeof(input54) },
+	{ input55, output55, key55, nonce55, sizeof(input55) },
+	{ input56, output56, key56, nonce56, sizeof(input56) },
+	{ input57, output57, key57, nonce57, sizeof(input57) },
+	{ input58, output58, key58, nonce58, sizeof(input58) },
+	{ input59, output59, key59, nonce59, sizeof(input59) },
+	{ input60, output60, key60, nonce60, sizeof(input60) },
+	{ input61, output61, key61, nonce61, sizeof(input61) },
+	{ input62, output62, key62, nonce62, sizeof(input62) },
+	{ input63, output63, key63, nonce63, sizeof(input63) },
+	{ input64, output64, key64, nonce64, sizeof(input64) },
+	{ input65, output65, key65, nonce65, sizeof(input65) },
+	{ input66, output66, key66, nonce66, sizeof(input66) },
+	{ input67, output67, key67, nonce67, sizeof(input67) },
+	{ input68, output68, key68, nonce68, sizeof(input68) },
+	{ input69, output69, key69, nonce69, sizeof(input69) },
+	{ input70, output70, key70, nonce70, sizeof(input70) },
+	{ input71, output71, key71, nonce71, sizeof(input71) },
+	{ input72, output72, key72, nonce72, sizeof(input72) },
+	{ input73, output73, key73, nonce73, sizeof(input73) },
+	{ input74, output74, key74, nonce74, sizeof(input74) }
+};
+
+static const struct hchacha20_testvec hchacha20_testvecs[] __initconst = {{
+	.key	= { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+		    0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
+		    0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
+		    0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f },
+	.nonce	= { 0x00, 0x00, 0x00, 0x09, 0x00, 0x00, 0x00, 0x4a,
+		    0x00, 0x00, 0x00, 0x00, 0x31, 0x41, 0x59, 0x27 },
+	.output	= { 0x82, 0x41, 0x3b, 0x42, 0x27, 0xb2, 0x7b, 0xfe,
+		    0xd3, 0x0e, 0x42, 0x50, 0x8a, 0x87, 0x7d, 0x73,
+		    0xa0, 0xf9, 0xe4, 0xd5, 0x8a, 0x74, 0xa8, 0x53,
+		    0xc1, 0x2e, 0xc4, 0x13, 0x26, 0xd3, 0xec, 0xdc }
+}};
+
+static bool __init chacha20_selftest(void)
+{
+	enum {
+		MAXIMUM_TEST_BUFFER_LEN = 1UL << 10,
+		OUTRAGEOUSLY_HUGE_BUFFER_LEN = PAGE_SIZE * 35 + 17 /* 143k */
+	};
+	size_t i, j, k;
+	u32 derived_key[CHACHA20_KEY_WORDS];
+	u8 *offset_input = NULL, *computed_output = NULL, *massive_input = NULL;
+	u8 offset_key[CHACHA20_KEY_SIZE + 1]
+			__aligned(__alignof__(unsigned long));
+	struct chacha20_ctx state;
+	bool success = true;
+	simd_context_t simd_context;
+
+	offset_input = kmalloc(MAXIMUM_TEST_BUFFER_LEN + 1, GFP_KERNEL);
+	computed_output = kmalloc(MAXIMUM_TEST_BUFFER_LEN + 1, GFP_KERNEL);
+	massive_input = vzalloc(OUTRAGEOUSLY_HUGE_BUFFER_LEN);
+	if (!computed_output || !offset_input || !massive_input) {
+		pr_err("chacha20 self-test malloc: FAIL\n");
+		success = false;
+		goto out;
+	}
+
+	simd_get(&simd_context);
+	for (i = 0; i < ARRAY_SIZE(chacha20_testvecs); ++i) {
+		/* Boring case */
+		memset(computed_output, 0, MAXIMUM_TEST_BUFFER_LEN + 1);
+		memset(&state, 0, sizeof(state));
+		chacha20_init(&state, chacha20_testvecs[i].key,
+			      chacha20_testvecs[i].nonce);
+		chacha20(&state, computed_output, chacha20_testvecs[i].input,
+			 chacha20_testvecs[i].ilen, &simd_context);
+		if (memcmp(computed_output, chacha20_testvecs[i].output,
+			   chacha20_testvecs[i].ilen)) {
+			pr_err("chacha20 self-test %zu: FAIL\n", i + 1);
+			success = false;
+		}
+		for (k = chacha20_testvecs[i].ilen;
+		     k < MAXIMUM_TEST_BUFFER_LEN + 1; ++k) {
+			if (computed_output[k]) {
+				pr_err("chacha20 self-test %zu (zero check): FAIL\n",
+				       i + 1);
+				success = false;
+				break;
+			}
+		}
+
+		/* Unaligned case */
+		memset(computed_output, 0, MAXIMUM_TEST_BUFFER_LEN + 1);
+		memset(&state, 0, sizeof(state));
+		memcpy(offset_input + 1, chacha20_testvecs[i].input,
+		       chacha20_testvecs[i].ilen);
+		memcpy(offset_key + 1, chacha20_testvecs[i].key,
+		       CHACHA20_KEY_SIZE);
+		chacha20_init(&state, offset_key + 1, chacha20_testvecs[i].nonce);
+		chacha20(&state, computed_output + 1, offset_input + 1,
+			 chacha20_testvecs[i].ilen, &simd_context);
+		if (memcmp(computed_output + 1, chacha20_testvecs[i].output,
+			   chacha20_testvecs[i].ilen)) {
+			pr_err("chacha20 self-test %zu (unaligned): FAIL\n",
+			       i + 1);
+			success = false;
+		}
+		if (computed_output[0]) {
+			pr_err("chacha20 self-test %zu (unaligned, zero check): FAIL\n",
+			       i + 1);
+			success = false;
+		}
+		for (k = chacha20_testvecs[i].ilen + 1;
+		     k < MAXIMUM_TEST_BUFFER_LEN + 1; ++k) {
+			if (computed_output[k]) {
+				pr_err("chacha20 self-test %zu (unaligned, zero check): FAIL\n",
+				       i + 1);
+				success = false;
+				break;
+			}
+		}
+
+		/* Chunked case */
+		if (chacha20_testvecs[i].ilen <= CHACHA20_BLOCK_SIZE)
+			goto next_test;
+		memset(computed_output, 0, MAXIMUM_TEST_BUFFER_LEN + 1);
+		memset(&state, 0, sizeof(state));
+		chacha20_init(&state, chacha20_testvecs[i].key,
+			      chacha20_testvecs[i].nonce);
+		chacha20(&state, computed_output, chacha20_testvecs[i].input,
+			 CHACHA20_BLOCK_SIZE, &simd_context);
+		chacha20(&state, computed_output + CHACHA20_BLOCK_SIZE,
+			 chacha20_testvecs[i].input + CHACHA20_BLOCK_SIZE,
+			 chacha20_testvecs[i].ilen - CHACHA20_BLOCK_SIZE,
+			 &simd_context);
+		if (memcmp(computed_output, chacha20_testvecs[i].output,
+			   chacha20_testvecs[i].ilen)) {
+			pr_err("chacha20 self-test %zu (chunked): FAIL\n",
+			       i + 1);
+			success = false;
+		}
+		for (k = chacha20_testvecs[i].ilen;
+		     k < MAXIMUM_TEST_BUFFER_LEN + 1; ++k) {
+			if (computed_output[k]) {
+				pr_err("chacha20 self-test %zu (chunked, zero check): FAIL\n",
+				       i + 1);
+				success = false;
+				break;
+			}
+		}
+
+next_test:
+		/* Sliding unaligned case */
+		if (chacha20_testvecs[i].ilen > CHACHA20_BLOCK_SIZE + 1 ||
+		    !chacha20_testvecs[i].ilen)
+			continue;
+		for (j = 1; j < CHACHA20_BLOCK_SIZE; ++j) {
+			memset(computed_output, 0, MAXIMUM_TEST_BUFFER_LEN + 1);
+			memset(&state, 0, sizeof(state));
+			memcpy(offset_input + j, chacha20_testvecs[i].input,
+			       chacha20_testvecs[i].ilen);
+			chacha20_init(&state, chacha20_testvecs[i].key,
+				      chacha20_testvecs[i].nonce);
+			chacha20(&state, computed_output + j, offset_input + j,
+				 chacha20_testvecs[i].ilen, &simd_context);
+			if (memcmp(computed_output + j,
+				   chacha20_testvecs[i].output,
+				   chacha20_testvecs[i].ilen)) {
+				pr_err("chacha20 self-test %zu (unaligned, slide %zu): FAIL\n",
+				       i + 1, j);
+				success = false;
+			}
+			for (k = j; k < j; ++k) {
+				if (computed_output[k]) {
+					pr_err("chacha20 self-test %zu (unaligned, slide %zu, zero check): FAIL\n",
+					       i + 1, j);
+					success = false;
+					break;
+				}
+			}
+			for (k = chacha20_testvecs[i].ilen + j;
+			     k < MAXIMUM_TEST_BUFFER_LEN + 1; ++k) {
+				if (computed_output[k]) {
+					pr_err("chacha20 self-test %zu (unaligned, slide %zu, zero check): FAIL\n",
+					       i + 1, j);
+					success = false;
+					break;
+				}
+			}
+		}
+	}
+	for (i = 0; i < ARRAY_SIZE(hchacha20_testvecs); ++i) {
+		memset(&derived_key, 0, sizeof(derived_key));
+		hchacha20(derived_key, hchacha20_testvecs[i].nonce,
+			  hchacha20_testvecs[i].key, &simd_context);
+		cpu_to_le32_array(derived_key, ARRAY_SIZE(derived_key));
+		if (memcmp(derived_key, hchacha20_testvecs[i].output,
+			   CHACHA20_KEY_SIZE)) {
+			pr_err("hchacha20 self-test %zu: FAIL\n", i + 1);
+			success = false;
+		}
+	}
+	memset(&state, 0, sizeof(state));
+	chacha20_init(&state, chacha20_testvecs[0].key,
+		      chacha20_testvecs[0].nonce);
+	chacha20(&state, massive_input, massive_input,
+		 OUTRAGEOUSLY_HUGE_BUFFER_LEN, &simd_context);
+	chacha20_init(&state, chacha20_testvecs[0].key,
+		      chacha20_testvecs[0].nonce);
+	chacha20(&state, massive_input, massive_input,
+		 OUTRAGEOUSLY_HUGE_BUFFER_LEN, DONT_USE_SIMD);
+	for (k = 0; k < OUTRAGEOUSLY_HUGE_BUFFER_LEN; ++k) {
+		if (massive_input[k]) {
+			pr_err("chacha20 self-test massive: FAIL\n");
+			success = false;
+			break;
+		}
+	}
+
+	simd_put(&simd_context);
+
+out:
+	kfree(offset_input);
+	kfree(computed_output);
+	vfree(massive_input);
+	return success;
+}
diff --git a/lib/zinc/selftest/run.h b/lib/zinc/selftest/run.h
new file mode 100644
index 000000000000..74f607a8e0e2
--- /dev/null
+++ b/lib/zinc/selftest/run.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifndef _ZINC_SELFTEST_RUN_H
+#define _ZINC_SELFTEST_RUN_H
+
+#include <linux/kernel.h>
+#include <linux/printk.h>
+#include <linux/bug.h>
+
+static inline bool selftest_run(const char *name, bool (*selftest)(void),
+				bool *const nobs[], unsigned int nobs_len)
+{
+	unsigned long set = 0, subset = 0, largest_subset = 0;
+	unsigned int i;
+
+	BUILD_BUG_ON(!__builtin_constant_p(nobs_len) ||
+		     nobs_len >= BITS_PER_LONG);
+
+	if (!IS_ENABLED(CONFIG_ZINC_SELFTEST))
+		return true;
+
+	for (i = 0; i < nobs_len; ++i)
+		set |= ((unsigned long)*nobs[i]) << i;
+
+	do {
+		for (i = 0; i < nobs_len; ++i)
+			*nobs[i] = BIT(i) & subset;
+		if (selftest())
+			largest_subset = max(subset, largest_subset);
+		else
+			pr_err("%s self-test combination 0x%lx: FAIL\n", name,
+			       subset);
+		subset = (subset - set) & set;
+	} while (subset);
+
+	for (i = 0; i < nobs_len; ++i)
+		*nobs[i] = BIT(i) & largest_subset;
+
+	if (largest_subset == set)
+		pr_info("%s self-tests: pass\n", name);
+
+	return !WARN_ON(largest_subset != set);
+}
+
+#endif
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v8 05/28] zinc: import Andy Polyakov's ChaCha20 x86_64 implementation
From: Jason A. Donenfeld @ 2018-10-18 14:56 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-crypto, davem, gregkh
  Cc: Jason A. Donenfeld, Andy Polyakov, Thomas Gleixner, Ingo Molnar,
	x86, Samuel Neves, Jean-Philippe Aumasson, Andy Lutomirski,
	Andrew Morton, Linus Torvalds, kernel-hardening
In-Reply-To: <20181018145712.7538-1-Jason@zx2c4.com>

These x86_64 vectorized implementations come from Andy Polyakov's
implementation, and are included here in raw form without modification,
so that subsequent commits that fix these up for the kernel can see how
it has changed.

While this is CRYPTOGAMS code, the originating code for this happens to
be the same as OpenSSL's commit cded951378069a478391843f5f8653c1eb5128da

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Based-on-code-from: Andy Polyakov <appro@openssl.org>
Cc: Andy Polyakov <appro@openssl.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: x86@kernel.org
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-crypto@vger.kernel.org
---
 .../chacha20/chacha20-x86_64-cryptogams.S     | 3433 +++++++++++++++++
 1 file changed, 3433 insertions(+)
 create mode 100644 lib/zinc/chacha20/chacha20-x86_64-cryptogams.S

diff --git a/lib/zinc/chacha20/chacha20-x86_64-cryptogams.S b/lib/zinc/chacha20/chacha20-x86_64-cryptogams.S
new file mode 100644
index 000000000000..2bfc76f7e01f
--- /dev/null
+++ b/lib/zinc/chacha20/chacha20-x86_64-cryptogams.S
@@ -0,0 +1,3433 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
+/*
+ * Copyright (C) 2006-2017 CRYPTOGAMS by <appro@openssl.org>. All Rights Reserved.
+ */
+
+.text	
+
+
+
+.align	64
+.Lzero:
+.long	0,0,0,0
+.Lone:
+.long	1,0,0,0
+.Linc:
+.long	0,1,2,3
+.Lfour:
+.long	4,4,4,4
+.Lincy:
+.long	0,2,4,6,1,3,5,7
+.Leight:
+.long	8,8,8,8,8,8,8,8
+.Lrot16:
+.byte	0x2,0x3,0x0,0x1, 0x6,0x7,0x4,0x5, 0xa,0xb,0x8,0x9, 0xe,0xf,0xc,0xd
+.Lrot24:
+.byte	0x3,0x0,0x1,0x2, 0x7,0x4,0x5,0x6, 0xb,0x8,0x9,0xa, 0xf,0xc,0xd,0xe
+.Ltwoy:
+.long	2,0,0,0, 2,0,0,0
+.align	64
+.Lzeroz:
+.long	0,0,0,0, 1,0,0,0, 2,0,0,0, 3,0,0,0
+.Lfourz:
+.long	4,0,0,0, 4,0,0,0, 4,0,0,0, 4,0,0,0
+.Lincz:
+.long	0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
+.Lsixteen:
+.long	16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16
+.Lsigma:
+.byte	101,120,112,97,110,100,32,51,50,45,98,121,116,101,32,107,0
+.byte	67,104,97,67,104,97,50,48,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
+.globl	ChaCha20_ctr32
+.type	ChaCha20_ctr32,@function
+.align	64
+ChaCha20_ctr32:
+.cfi_startproc	
+	cmpq	$0,%rdx
+	je	.Lno_data
+	movq	OPENSSL_ia32cap_P+4(%rip),%r10
+	btq	$48,%r10
+	jc	.LChaCha20_avx512
+	testq	%r10,%r10
+	js	.LChaCha20_avx512vl
+	testl	$512,%r10d
+	jnz	.LChaCha20_ssse3
+
+	pushq	%rbx
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbx,-16
+	pushq	%rbp
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbp,-24
+	pushq	%r12
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r12,-32
+	pushq	%r13
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r13,-40
+	pushq	%r14
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r14,-48
+	pushq	%r15
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r15,-56
+	subq	$64+24,%rsp
+.cfi_adjust_cfa_offset	64+24
+.Lctr32_body:
+
+
+	movdqu	(%rcx),%xmm1
+	movdqu	16(%rcx),%xmm2
+	movdqu	(%r8),%xmm3
+	movdqa	.Lone(%rip),%xmm4
+
+
+	movdqa	%xmm1,16(%rsp)
+	movdqa	%xmm2,32(%rsp)
+	movdqa	%xmm3,48(%rsp)
+	movq	%rdx,%rbp
+	jmp	.Loop_outer
+
+.align	32
+.Loop_outer:
+	movl	$0x61707865,%eax
+	movl	$0x3320646e,%ebx
+	movl	$0x79622d32,%ecx
+	movl	$0x6b206574,%edx
+	movl	16(%rsp),%r8d
+	movl	20(%rsp),%r9d
+	movl	24(%rsp),%r10d
+	movl	28(%rsp),%r11d
+	movd	%xmm3,%r12d
+	movl	52(%rsp),%r13d
+	movl	56(%rsp),%r14d
+	movl	60(%rsp),%r15d
+
+	movq	%rbp,64+0(%rsp)
+	movl	$10,%ebp
+	movq	%rsi,64+8(%rsp)
+.byte	102,72,15,126,214
+	movq	%rdi,64+16(%rsp)
+	movq	%rsi,%rdi
+	shrq	$32,%rdi
+	jmp	.Loop
+
+.align	32
+.Loop:
+	addl	%r8d,%eax
+	xorl	%eax,%r12d
+	roll	$16,%r12d
+	addl	%r9d,%ebx
+	xorl	%ebx,%r13d
+	roll	$16,%r13d
+	addl	%r12d,%esi
+	xorl	%esi,%r8d
+	roll	$12,%r8d
+	addl	%r13d,%edi
+	xorl	%edi,%r9d
+	roll	$12,%r9d
+	addl	%r8d,%eax
+	xorl	%eax,%r12d
+	roll	$8,%r12d
+	addl	%r9d,%ebx
+	xorl	%ebx,%r13d
+	roll	$8,%r13d
+	addl	%r12d,%esi
+	xorl	%esi,%r8d
+	roll	$7,%r8d
+	addl	%r13d,%edi
+	xorl	%edi,%r9d
+	roll	$7,%r9d
+	movl	%esi,32(%rsp)
+	movl	%edi,36(%rsp)
+	movl	40(%rsp),%esi
+	movl	44(%rsp),%edi
+	addl	%r10d,%ecx
+	xorl	%ecx,%r14d
+	roll	$16,%r14d
+	addl	%r11d,%edx
+	xorl	%edx,%r15d
+	roll	$16,%r15d
+	addl	%r14d,%esi
+	xorl	%esi,%r10d
+	roll	$12,%r10d
+	addl	%r15d,%edi
+	xorl	%edi,%r11d
+	roll	$12,%r11d
+	addl	%r10d,%ecx
+	xorl	%ecx,%r14d
+	roll	$8,%r14d
+	addl	%r11d,%edx
+	xorl	%edx,%r15d
+	roll	$8,%r15d
+	addl	%r14d,%esi
+	xorl	%esi,%r10d
+	roll	$7,%r10d
+	addl	%r15d,%edi
+	xorl	%edi,%r11d
+	roll	$7,%r11d
+	addl	%r9d,%eax
+	xorl	%eax,%r15d
+	roll	$16,%r15d
+	addl	%r10d,%ebx
+	xorl	%ebx,%r12d
+	roll	$16,%r12d
+	addl	%r15d,%esi
+	xorl	%esi,%r9d
+	roll	$12,%r9d
+	addl	%r12d,%edi
+	xorl	%edi,%r10d
+	roll	$12,%r10d
+	addl	%r9d,%eax
+	xorl	%eax,%r15d
+	roll	$8,%r15d
+	addl	%r10d,%ebx
+	xorl	%ebx,%r12d
+	roll	$8,%r12d
+	addl	%r15d,%esi
+	xorl	%esi,%r9d
+	roll	$7,%r9d
+	addl	%r12d,%edi
+	xorl	%edi,%r10d
+	roll	$7,%r10d
+	movl	%esi,40(%rsp)
+	movl	%edi,44(%rsp)
+	movl	32(%rsp),%esi
+	movl	36(%rsp),%edi
+	addl	%r11d,%ecx
+	xorl	%ecx,%r13d
+	roll	$16,%r13d
+	addl	%r8d,%edx
+	xorl	%edx,%r14d
+	roll	$16,%r14d
+	addl	%r13d,%esi
+	xorl	%esi,%r11d
+	roll	$12,%r11d
+	addl	%r14d,%edi
+	xorl	%edi,%r8d
+	roll	$12,%r8d
+	addl	%r11d,%ecx
+	xorl	%ecx,%r13d
+	roll	$8,%r13d
+	addl	%r8d,%edx
+	xorl	%edx,%r14d
+	roll	$8,%r14d
+	addl	%r13d,%esi
+	xorl	%esi,%r11d
+	roll	$7,%r11d
+	addl	%r14d,%edi
+	xorl	%edi,%r8d
+	roll	$7,%r8d
+	decl	%ebp
+	jnz	.Loop
+	movl	%edi,36(%rsp)
+	movl	%esi,32(%rsp)
+	movq	64(%rsp),%rbp
+	movdqa	%xmm2,%xmm1
+	movq	64+8(%rsp),%rsi
+	paddd	%xmm4,%xmm3
+	movq	64+16(%rsp),%rdi
+
+	addl	$0x61707865,%eax
+	addl	$0x3320646e,%ebx
+	addl	$0x79622d32,%ecx
+	addl	$0x6b206574,%edx
+	addl	16(%rsp),%r8d
+	addl	20(%rsp),%r9d
+	addl	24(%rsp),%r10d
+	addl	28(%rsp),%r11d
+	addl	48(%rsp),%r12d
+	addl	52(%rsp),%r13d
+	addl	56(%rsp),%r14d
+	addl	60(%rsp),%r15d
+	paddd	32(%rsp),%xmm1
+
+	cmpq	$64,%rbp
+	jb	.Ltail
+
+	xorl	0(%rsi),%eax
+	xorl	4(%rsi),%ebx
+	xorl	8(%rsi),%ecx
+	xorl	12(%rsi),%edx
+	xorl	16(%rsi),%r8d
+	xorl	20(%rsi),%r9d
+	xorl	24(%rsi),%r10d
+	xorl	28(%rsi),%r11d
+	movdqu	32(%rsi),%xmm0
+	xorl	48(%rsi),%r12d
+	xorl	52(%rsi),%r13d
+	xorl	56(%rsi),%r14d
+	xorl	60(%rsi),%r15d
+	leaq	64(%rsi),%rsi
+	pxor	%xmm1,%xmm0
+
+	movdqa	%xmm2,32(%rsp)
+	movd	%xmm3,48(%rsp)
+
+	movl	%eax,0(%rdi)
+	movl	%ebx,4(%rdi)
+	movl	%ecx,8(%rdi)
+	movl	%edx,12(%rdi)
+	movl	%r8d,16(%rdi)
+	movl	%r9d,20(%rdi)
+	movl	%r10d,24(%rdi)
+	movl	%r11d,28(%rdi)
+	movdqu	%xmm0,32(%rdi)
+	movl	%r12d,48(%rdi)
+	movl	%r13d,52(%rdi)
+	movl	%r14d,56(%rdi)
+	movl	%r15d,60(%rdi)
+	leaq	64(%rdi),%rdi
+
+	subq	$64,%rbp
+	jnz	.Loop_outer
+
+	jmp	.Ldone
+
+.align	16
+.Ltail:
+	movl	%eax,0(%rsp)
+	movl	%ebx,4(%rsp)
+	xorq	%rbx,%rbx
+	movl	%ecx,8(%rsp)
+	movl	%edx,12(%rsp)
+	movl	%r8d,16(%rsp)
+	movl	%r9d,20(%rsp)
+	movl	%r10d,24(%rsp)
+	movl	%r11d,28(%rsp)
+	movdqa	%xmm1,32(%rsp)
+	movl	%r12d,48(%rsp)
+	movl	%r13d,52(%rsp)
+	movl	%r14d,56(%rsp)
+	movl	%r15d,60(%rsp)
+
+.Loop_tail:
+	movzbl	(%rsi,%rbx,1),%eax
+	movzbl	(%rsp,%rbx,1),%edx
+	leaq	1(%rbx),%rbx
+	xorl	%edx,%eax
+	movb	%al,-1(%rdi,%rbx,1)
+	decq	%rbp
+	jnz	.Loop_tail
+
+.Ldone:
+	leaq	64+24+48(%rsp),%rsi
+.cfi_def_cfa	%rsi,8
+	movq	-48(%rsi),%r15
+.cfi_restore	%r15
+	movq	-40(%rsi),%r14
+.cfi_restore	%r14
+	movq	-32(%rsi),%r13
+.cfi_restore	%r13
+	movq	-24(%rsi),%r12
+.cfi_restore	%r12
+	movq	-16(%rsi),%rbp
+.cfi_restore	%rbp
+	movq	-8(%rsi),%rbx
+.cfi_restore	%rbx
+	leaq	(%rsi),%rsp
+.cfi_def_cfa_register	%rsp
+.Lno_data:
+	.byte	0xf3,0xc3
+.cfi_endproc	
+.size	ChaCha20_ctr32,.-ChaCha20_ctr32
+.type	ChaCha20_ssse3,@function
+.align	32
+ChaCha20_ssse3:
+.cfi_startproc	
+.LChaCha20_ssse3:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	testl	$2048,%r10d
+	jnz	.LChaCha20_4xop
+	cmpq	$128,%rdx
+	je	.LChaCha20_128
+	ja	.LChaCha20_4x
+
+.Ldo_sse3_after_all:
+	subq	$64+8,%rsp
+	movdqa	.Lsigma(%rip),%xmm0
+	movdqu	(%rcx),%xmm1
+	movdqu	16(%rcx),%xmm2
+	movdqu	(%r8),%xmm3
+	movdqa	.Lrot16(%rip),%xmm6
+	movdqa	.Lrot24(%rip),%xmm7
+
+	movdqa	%xmm0,0(%rsp)
+	movdqa	%xmm1,16(%rsp)
+	movdqa	%xmm2,32(%rsp)
+	movdqa	%xmm3,48(%rsp)
+	movq	$10,%r8
+	jmp	.Loop_ssse3
+
+.align	32
+.Loop_outer_ssse3:
+	movdqa	.Lone(%rip),%xmm3
+	movdqa	0(%rsp),%xmm0
+	movdqa	16(%rsp),%xmm1
+	movdqa	32(%rsp),%xmm2
+	paddd	48(%rsp),%xmm3
+	movq	$10,%r8
+	movdqa	%xmm3,48(%rsp)
+	jmp	.Loop_ssse3
+
+.align	32
+.Loop_ssse3:
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+.byte	102,15,56,0,222
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	$20,%xmm1
+	pslld	$12,%xmm4
+	por	%xmm4,%xmm1
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+.byte	102,15,56,0,223
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	$25,%xmm1
+	pslld	$7,%xmm4
+	por	%xmm4,%xmm1
+	pshufd	$78,%xmm2,%xmm2
+	pshufd	$57,%xmm1,%xmm1
+	pshufd	$147,%xmm3,%xmm3
+	nop
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+.byte	102,15,56,0,222
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	$20,%xmm1
+	pslld	$12,%xmm4
+	por	%xmm4,%xmm1
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+.byte	102,15,56,0,223
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	$25,%xmm1
+	pslld	$7,%xmm4
+	por	%xmm4,%xmm1
+	pshufd	$78,%xmm2,%xmm2
+	pshufd	$147,%xmm1,%xmm1
+	pshufd	$57,%xmm3,%xmm3
+	decq	%r8
+	jnz	.Loop_ssse3
+	paddd	0(%rsp),%xmm0
+	paddd	16(%rsp),%xmm1
+	paddd	32(%rsp),%xmm2
+	paddd	48(%rsp),%xmm3
+
+	cmpq	$64,%rdx
+	jb	.Ltail_ssse3
+
+	movdqu	0(%rsi),%xmm4
+	movdqu	16(%rsi),%xmm5
+	pxor	%xmm4,%xmm0
+	movdqu	32(%rsi),%xmm4
+	pxor	%xmm5,%xmm1
+	movdqu	48(%rsi),%xmm5
+	leaq	64(%rsi),%rsi
+	pxor	%xmm4,%xmm2
+	pxor	%xmm5,%xmm3
+
+	movdqu	%xmm0,0(%rdi)
+	movdqu	%xmm1,16(%rdi)
+	movdqu	%xmm2,32(%rdi)
+	movdqu	%xmm3,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	subq	$64,%rdx
+	jnz	.Loop_outer_ssse3
+
+	jmp	.Ldone_ssse3
+
+.align	16
+.Ltail_ssse3:
+	movdqa	%xmm0,0(%rsp)
+	movdqa	%xmm1,16(%rsp)
+	movdqa	%xmm2,32(%rsp)
+	movdqa	%xmm3,48(%rsp)
+	xorq	%r8,%r8
+
+.Loop_tail_ssse3:
+	movzbl	(%rsi,%r8,1),%eax
+	movzbl	(%rsp,%r8,1),%ecx
+	leaq	1(%r8),%r8
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r8,1)
+	decq	%rdx
+	jnz	.Loop_tail_ssse3
+
+.Ldone_ssse3:
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.Lssse3_epilogue:
+	.byte	0xf3,0xc3
+.cfi_endproc	
+.size	ChaCha20_ssse3,.-ChaCha20_ssse3
+.type	ChaCha20_128,@function
+.align	32
+ChaCha20_128:
+.cfi_startproc	
+.LChaCha20_128:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	subq	$64+8,%rsp
+	movdqa	.Lsigma(%rip),%xmm8
+	movdqu	(%rcx),%xmm9
+	movdqu	16(%rcx),%xmm2
+	movdqu	(%r8),%xmm3
+	movdqa	.Lone(%rip),%xmm1
+	movdqa	.Lrot16(%rip),%xmm6
+	movdqa	.Lrot24(%rip),%xmm7
+
+	movdqa	%xmm8,%xmm10
+	movdqa	%xmm8,0(%rsp)
+	movdqa	%xmm9,%xmm11
+	movdqa	%xmm9,16(%rsp)
+	movdqa	%xmm2,%xmm0
+	movdqa	%xmm2,32(%rsp)
+	paddd	%xmm3,%xmm1
+	movdqa	%xmm3,48(%rsp)
+	movq	$10,%r8
+	jmp	.Loop_128
+
+.align	32
+.Loop_128:
+	paddd	%xmm9,%xmm8
+	pxor	%xmm8,%xmm3
+	paddd	%xmm11,%xmm10
+	pxor	%xmm10,%xmm1
+.byte	102,15,56,0,222
+.byte	102,15,56,0,206
+	paddd	%xmm3,%xmm2
+	paddd	%xmm1,%xmm0
+	pxor	%xmm2,%xmm9
+	pxor	%xmm0,%xmm11
+	movdqa	%xmm9,%xmm4
+	psrld	$20,%xmm9
+	movdqa	%xmm11,%xmm5
+	pslld	$12,%xmm4
+	psrld	$20,%xmm11
+	por	%xmm4,%xmm9
+	pslld	$12,%xmm5
+	por	%xmm5,%xmm11
+	paddd	%xmm9,%xmm8
+	pxor	%xmm8,%xmm3
+	paddd	%xmm11,%xmm10
+	pxor	%xmm10,%xmm1
+.byte	102,15,56,0,223
+.byte	102,15,56,0,207
+	paddd	%xmm3,%xmm2
+	paddd	%xmm1,%xmm0
+	pxor	%xmm2,%xmm9
+	pxor	%xmm0,%xmm11
+	movdqa	%xmm9,%xmm4
+	psrld	$25,%xmm9
+	movdqa	%xmm11,%xmm5
+	pslld	$7,%xmm4
+	psrld	$25,%xmm11
+	por	%xmm4,%xmm9
+	pslld	$7,%xmm5
+	por	%xmm5,%xmm11
+	pshufd	$78,%xmm2,%xmm2
+	pshufd	$57,%xmm9,%xmm9
+	pshufd	$147,%xmm3,%xmm3
+	pshufd	$78,%xmm0,%xmm0
+	pshufd	$57,%xmm11,%xmm11
+	pshufd	$147,%xmm1,%xmm1
+	paddd	%xmm9,%xmm8
+	pxor	%xmm8,%xmm3
+	paddd	%xmm11,%xmm10
+	pxor	%xmm10,%xmm1
+.byte	102,15,56,0,222
+.byte	102,15,56,0,206
+	paddd	%xmm3,%xmm2
+	paddd	%xmm1,%xmm0
+	pxor	%xmm2,%xmm9
+	pxor	%xmm0,%xmm11
+	movdqa	%xmm9,%xmm4
+	psrld	$20,%xmm9
+	movdqa	%xmm11,%xmm5
+	pslld	$12,%xmm4
+	psrld	$20,%xmm11
+	por	%xmm4,%xmm9
+	pslld	$12,%xmm5
+	por	%xmm5,%xmm11
+	paddd	%xmm9,%xmm8
+	pxor	%xmm8,%xmm3
+	paddd	%xmm11,%xmm10
+	pxor	%xmm10,%xmm1
+.byte	102,15,56,0,223
+.byte	102,15,56,0,207
+	paddd	%xmm3,%xmm2
+	paddd	%xmm1,%xmm0
+	pxor	%xmm2,%xmm9
+	pxor	%xmm0,%xmm11
+	movdqa	%xmm9,%xmm4
+	psrld	$25,%xmm9
+	movdqa	%xmm11,%xmm5
+	pslld	$7,%xmm4
+	psrld	$25,%xmm11
+	por	%xmm4,%xmm9
+	pslld	$7,%xmm5
+	por	%xmm5,%xmm11
+	pshufd	$78,%xmm2,%xmm2
+	pshufd	$147,%xmm9,%xmm9
+	pshufd	$57,%xmm3,%xmm3
+	pshufd	$78,%xmm0,%xmm0
+	pshufd	$147,%xmm11,%xmm11
+	pshufd	$57,%xmm1,%xmm1
+	decq	%r8
+	jnz	.Loop_128
+	paddd	0(%rsp),%xmm8
+	paddd	16(%rsp),%xmm9
+	paddd	32(%rsp),%xmm2
+	paddd	48(%rsp),%xmm3
+	paddd	.Lone(%rip),%xmm1
+	paddd	0(%rsp),%xmm10
+	paddd	16(%rsp),%xmm11
+	paddd	32(%rsp),%xmm0
+	paddd	48(%rsp),%xmm1
+
+	movdqu	0(%rsi),%xmm4
+	movdqu	16(%rsi),%xmm5
+	pxor	%xmm4,%xmm8
+	movdqu	32(%rsi),%xmm4
+	pxor	%xmm5,%xmm9
+	movdqu	48(%rsi),%xmm5
+	pxor	%xmm4,%xmm2
+	movdqu	64(%rsi),%xmm4
+	pxor	%xmm5,%xmm3
+	movdqu	80(%rsi),%xmm5
+	pxor	%xmm4,%xmm10
+	movdqu	96(%rsi),%xmm4
+	pxor	%xmm5,%xmm11
+	movdqu	112(%rsi),%xmm5
+	pxor	%xmm4,%xmm0
+	pxor	%xmm5,%xmm1
+
+	movdqu	%xmm8,0(%rdi)
+	movdqu	%xmm9,16(%rdi)
+	movdqu	%xmm2,32(%rdi)
+	movdqu	%xmm3,48(%rdi)
+	movdqu	%xmm10,64(%rdi)
+	movdqu	%xmm11,80(%rdi)
+	movdqu	%xmm0,96(%rdi)
+	movdqu	%xmm1,112(%rdi)
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L128_epilogue:
+	.byte	0xf3,0xc3
+.cfi_endproc	
+.size	ChaCha20_128,.-ChaCha20_128
+.type	ChaCha20_4x,@function
+.align	32
+ChaCha20_4x:
+.cfi_startproc	
+.LChaCha20_4x:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	movq	%r10,%r11
+	shrq	$32,%r10
+	testq	$32,%r10
+	jnz	.LChaCha20_8x
+	cmpq	$192,%rdx
+	ja	.Lproceed4x
+
+	andq	$71303168,%r11
+	cmpq	$4194304,%r11
+	je	.Ldo_sse3_after_all
+
+.Lproceed4x:
+	subq	$0x140+8,%rsp
+	movdqa	.Lsigma(%rip),%xmm11
+	movdqu	(%rcx),%xmm15
+	movdqu	16(%rcx),%xmm7
+	movdqu	(%r8),%xmm3
+	leaq	256(%rsp),%rcx
+	leaq	.Lrot16(%rip),%r10
+	leaq	.Lrot24(%rip),%r11
+
+	pshufd	$0x00,%xmm11,%xmm8
+	pshufd	$0x55,%xmm11,%xmm9
+	movdqa	%xmm8,64(%rsp)
+	pshufd	$0xaa,%xmm11,%xmm10
+	movdqa	%xmm9,80(%rsp)
+	pshufd	$0xff,%xmm11,%xmm11
+	movdqa	%xmm10,96(%rsp)
+	movdqa	%xmm11,112(%rsp)
+
+	pshufd	$0x00,%xmm15,%xmm12
+	pshufd	$0x55,%xmm15,%xmm13
+	movdqa	%xmm12,128-256(%rcx)
+	pshufd	$0xaa,%xmm15,%xmm14
+	movdqa	%xmm13,144-256(%rcx)
+	pshufd	$0xff,%xmm15,%xmm15
+	movdqa	%xmm14,160-256(%rcx)
+	movdqa	%xmm15,176-256(%rcx)
+
+	pshufd	$0x00,%xmm7,%xmm4
+	pshufd	$0x55,%xmm7,%xmm5
+	movdqa	%xmm4,192-256(%rcx)
+	pshufd	$0xaa,%xmm7,%xmm6
+	movdqa	%xmm5,208-256(%rcx)
+	pshufd	$0xff,%xmm7,%xmm7
+	movdqa	%xmm6,224-256(%rcx)
+	movdqa	%xmm7,240-256(%rcx)
+
+	pshufd	$0x00,%xmm3,%xmm0
+	pshufd	$0x55,%xmm3,%xmm1
+	paddd	.Linc(%rip),%xmm0
+	pshufd	$0xaa,%xmm3,%xmm2
+	movdqa	%xmm1,272-256(%rcx)
+	pshufd	$0xff,%xmm3,%xmm3
+	movdqa	%xmm2,288-256(%rcx)
+	movdqa	%xmm3,304-256(%rcx)
+
+	jmp	.Loop_enter4x
+
+.align	32
+.Loop_outer4x:
+	movdqa	64(%rsp),%xmm8
+	movdqa	80(%rsp),%xmm9
+	movdqa	96(%rsp),%xmm10
+	movdqa	112(%rsp),%xmm11
+	movdqa	128-256(%rcx),%xmm12
+	movdqa	144-256(%rcx),%xmm13
+	movdqa	160-256(%rcx),%xmm14
+	movdqa	176-256(%rcx),%xmm15
+	movdqa	192-256(%rcx),%xmm4
+	movdqa	208-256(%rcx),%xmm5
+	movdqa	224-256(%rcx),%xmm6
+	movdqa	240-256(%rcx),%xmm7
+	movdqa	256-256(%rcx),%xmm0
+	movdqa	272-256(%rcx),%xmm1
+	movdqa	288-256(%rcx),%xmm2
+	movdqa	304-256(%rcx),%xmm3
+	paddd	.Lfour(%rip),%xmm0
+
+.Loop_enter4x:
+	movdqa	%xmm6,32(%rsp)
+	movdqa	%xmm7,48(%rsp)
+	movdqa	(%r10),%xmm7
+	movl	$10,%eax
+	movdqa	%xmm0,256-256(%rcx)
+	jmp	.Loop4x
+
+.align	32
+.Loop4x:
+	paddd	%xmm12,%xmm8
+	paddd	%xmm13,%xmm9
+	pxor	%xmm8,%xmm0
+	pxor	%xmm9,%xmm1
+.byte	102,15,56,0,199
+.byte	102,15,56,0,207
+	paddd	%xmm0,%xmm4
+	paddd	%xmm1,%xmm5
+	pxor	%xmm4,%xmm12
+	pxor	%xmm5,%xmm13
+	movdqa	%xmm12,%xmm6
+	pslld	$12,%xmm12
+	psrld	$20,%xmm6
+	movdqa	%xmm13,%xmm7
+	pslld	$12,%xmm13
+	por	%xmm6,%xmm12
+	psrld	$20,%xmm7
+	movdqa	(%r11),%xmm6
+	por	%xmm7,%xmm13
+	paddd	%xmm12,%xmm8
+	paddd	%xmm13,%xmm9
+	pxor	%xmm8,%xmm0
+	pxor	%xmm9,%xmm1
+.byte	102,15,56,0,198
+.byte	102,15,56,0,206
+	paddd	%xmm0,%xmm4
+	paddd	%xmm1,%xmm5
+	pxor	%xmm4,%xmm12
+	pxor	%xmm5,%xmm13
+	movdqa	%xmm12,%xmm7
+	pslld	$7,%xmm12
+	psrld	$25,%xmm7
+	movdqa	%xmm13,%xmm6
+	pslld	$7,%xmm13
+	por	%xmm7,%xmm12
+	psrld	$25,%xmm6
+	movdqa	(%r10),%xmm7
+	por	%xmm6,%xmm13
+	movdqa	%xmm4,0(%rsp)
+	movdqa	%xmm5,16(%rsp)
+	movdqa	32(%rsp),%xmm4
+	movdqa	48(%rsp),%xmm5
+	paddd	%xmm14,%xmm10
+	paddd	%xmm15,%xmm11
+	pxor	%xmm10,%xmm2
+	pxor	%xmm11,%xmm3
+.byte	102,15,56,0,215
+.byte	102,15,56,0,223
+	paddd	%xmm2,%xmm4
+	paddd	%xmm3,%xmm5
+	pxor	%xmm4,%xmm14
+	pxor	%xmm5,%xmm15
+	movdqa	%xmm14,%xmm6
+	pslld	$12,%xmm14
+	psrld	$20,%xmm6
+	movdqa	%xmm15,%xmm7
+	pslld	$12,%xmm15
+	por	%xmm6,%xmm14
+	psrld	$20,%xmm7
+	movdqa	(%r11),%xmm6
+	por	%xmm7,%xmm15
+	paddd	%xmm14,%xmm10
+	paddd	%xmm15,%xmm11
+	pxor	%xmm10,%xmm2
+	pxor	%xmm11,%xmm3
+.byte	102,15,56,0,214
+.byte	102,15,56,0,222
+	paddd	%xmm2,%xmm4
+	paddd	%xmm3,%xmm5
+	pxor	%xmm4,%xmm14
+	pxor	%xmm5,%xmm15
+	movdqa	%xmm14,%xmm7
+	pslld	$7,%xmm14
+	psrld	$25,%xmm7
+	movdqa	%xmm15,%xmm6
+	pslld	$7,%xmm15
+	por	%xmm7,%xmm14
+	psrld	$25,%xmm6
+	movdqa	(%r10),%xmm7
+	por	%xmm6,%xmm15
+	paddd	%xmm13,%xmm8
+	paddd	%xmm14,%xmm9
+	pxor	%xmm8,%xmm3
+	pxor	%xmm9,%xmm0
+.byte	102,15,56,0,223
+.byte	102,15,56,0,199
+	paddd	%xmm3,%xmm4
+	paddd	%xmm0,%xmm5
+	pxor	%xmm4,%xmm13
+	pxor	%xmm5,%xmm14
+	movdqa	%xmm13,%xmm6
+	pslld	$12,%xmm13
+	psrld	$20,%xmm6
+	movdqa	%xmm14,%xmm7
+	pslld	$12,%xmm14
+	por	%xmm6,%xmm13
+	psrld	$20,%xmm7
+	movdqa	(%r11),%xmm6
+	por	%xmm7,%xmm14
+	paddd	%xmm13,%xmm8
+	paddd	%xmm14,%xmm9
+	pxor	%xmm8,%xmm3
+	pxor	%xmm9,%xmm0
+.byte	102,15,56,0,222
+.byte	102,15,56,0,198
+	paddd	%xmm3,%xmm4
+	paddd	%xmm0,%xmm5
+	pxor	%xmm4,%xmm13
+	pxor	%xmm5,%xmm14
+	movdqa	%xmm13,%xmm7
+	pslld	$7,%xmm13
+	psrld	$25,%xmm7
+	movdqa	%xmm14,%xmm6
+	pslld	$7,%xmm14
+	por	%xmm7,%xmm13
+	psrld	$25,%xmm6
+	movdqa	(%r10),%xmm7
+	por	%xmm6,%xmm14
+	movdqa	%xmm4,32(%rsp)
+	movdqa	%xmm5,48(%rsp)
+	movdqa	0(%rsp),%xmm4
+	movdqa	16(%rsp),%xmm5
+	paddd	%xmm15,%xmm10
+	paddd	%xmm12,%xmm11
+	pxor	%xmm10,%xmm1
+	pxor	%xmm11,%xmm2
+.byte	102,15,56,0,207
+.byte	102,15,56,0,215
+	paddd	%xmm1,%xmm4
+	paddd	%xmm2,%xmm5
+	pxor	%xmm4,%xmm15
+	pxor	%xmm5,%xmm12
+	movdqa	%xmm15,%xmm6
+	pslld	$12,%xmm15
+	psrld	$20,%xmm6
+	movdqa	%xmm12,%xmm7
+	pslld	$12,%xmm12
+	por	%xmm6,%xmm15
+	psrld	$20,%xmm7
+	movdqa	(%r11),%xmm6
+	por	%xmm7,%xmm12
+	paddd	%xmm15,%xmm10
+	paddd	%xmm12,%xmm11
+	pxor	%xmm10,%xmm1
+	pxor	%xmm11,%xmm2
+.byte	102,15,56,0,206
+.byte	102,15,56,0,214
+	paddd	%xmm1,%xmm4
+	paddd	%xmm2,%xmm5
+	pxor	%xmm4,%xmm15
+	pxor	%xmm5,%xmm12
+	movdqa	%xmm15,%xmm7
+	pslld	$7,%xmm15
+	psrld	$25,%xmm7
+	movdqa	%xmm12,%xmm6
+	pslld	$7,%xmm12
+	por	%xmm7,%xmm15
+	psrld	$25,%xmm6
+	movdqa	(%r10),%xmm7
+	por	%xmm6,%xmm12
+	decl	%eax
+	jnz	.Loop4x
+
+	paddd	64(%rsp),%xmm8
+	paddd	80(%rsp),%xmm9
+	paddd	96(%rsp),%xmm10
+	paddd	112(%rsp),%xmm11
+
+	movdqa	%xmm8,%xmm6
+	punpckldq	%xmm9,%xmm8
+	movdqa	%xmm10,%xmm7
+	punpckldq	%xmm11,%xmm10
+	punpckhdq	%xmm9,%xmm6
+	punpckhdq	%xmm11,%xmm7
+	movdqa	%xmm8,%xmm9
+	punpcklqdq	%xmm10,%xmm8
+	movdqa	%xmm6,%xmm11
+	punpcklqdq	%xmm7,%xmm6
+	punpckhqdq	%xmm10,%xmm9
+	punpckhqdq	%xmm7,%xmm11
+	paddd	128-256(%rcx),%xmm12
+	paddd	144-256(%rcx),%xmm13
+	paddd	160-256(%rcx),%xmm14
+	paddd	176-256(%rcx),%xmm15
+
+	movdqa	%xmm8,0(%rsp)
+	movdqa	%xmm9,16(%rsp)
+	movdqa	32(%rsp),%xmm8
+	movdqa	48(%rsp),%xmm9
+
+	movdqa	%xmm12,%xmm10
+	punpckldq	%xmm13,%xmm12
+	movdqa	%xmm14,%xmm7
+	punpckldq	%xmm15,%xmm14
+	punpckhdq	%xmm13,%xmm10
+	punpckhdq	%xmm15,%xmm7
+	movdqa	%xmm12,%xmm13
+	punpcklqdq	%xmm14,%xmm12
+	movdqa	%xmm10,%xmm15
+	punpcklqdq	%xmm7,%xmm10
+	punpckhqdq	%xmm14,%xmm13
+	punpckhqdq	%xmm7,%xmm15
+	paddd	192-256(%rcx),%xmm4
+	paddd	208-256(%rcx),%xmm5
+	paddd	224-256(%rcx),%xmm8
+	paddd	240-256(%rcx),%xmm9
+
+	movdqa	%xmm6,32(%rsp)
+	movdqa	%xmm11,48(%rsp)
+
+	movdqa	%xmm4,%xmm14
+	punpckldq	%xmm5,%xmm4
+	movdqa	%xmm8,%xmm7
+	punpckldq	%xmm9,%xmm8
+	punpckhdq	%xmm5,%xmm14
+	punpckhdq	%xmm9,%xmm7
+	movdqa	%xmm4,%xmm5
+	punpcklqdq	%xmm8,%xmm4
+	movdqa	%xmm14,%xmm9
+	punpcklqdq	%xmm7,%xmm14
+	punpckhqdq	%xmm8,%xmm5
+	punpckhqdq	%xmm7,%xmm9
+	paddd	256-256(%rcx),%xmm0
+	paddd	272-256(%rcx),%xmm1
+	paddd	288-256(%rcx),%xmm2
+	paddd	304-256(%rcx),%xmm3
+
+	movdqa	%xmm0,%xmm8
+	punpckldq	%xmm1,%xmm0
+	movdqa	%xmm2,%xmm7
+	punpckldq	%xmm3,%xmm2
+	punpckhdq	%xmm1,%xmm8
+	punpckhdq	%xmm3,%xmm7
+	movdqa	%xmm0,%xmm1
+	punpcklqdq	%xmm2,%xmm0
+	movdqa	%xmm8,%xmm3
+	punpcklqdq	%xmm7,%xmm8
+	punpckhqdq	%xmm2,%xmm1
+	punpckhqdq	%xmm7,%xmm3
+	cmpq	$256,%rdx
+	jb	.Ltail4x
+
+	movdqu	0(%rsi),%xmm6
+	movdqu	16(%rsi),%xmm11
+	movdqu	32(%rsi),%xmm2
+	movdqu	48(%rsi),%xmm7
+	pxor	0(%rsp),%xmm6
+	pxor	%xmm12,%xmm11
+	pxor	%xmm4,%xmm2
+	pxor	%xmm0,%xmm7
+
+	movdqu	%xmm6,0(%rdi)
+	movdqu	64(%rsi),%xmm6
+	movdqu	%xmm11,16(%rdi)
+	movdqu	80(%rsi),%xmm11
+	movdqu	%xmm2,32(%rdi)
+	movdqu	96(%rsi),%xmm2
+	movdqu	%xmm7,48(%rdi)
+	movdqu	112(%rsi),%xmm7
+	leaq	128(%rsi),%rsi
+	pxor	16(%rsp),%xmm6
+	pxor	%xmm13,%xmm11
+	pxor	%xmm5,%xmm2
+	pxor	%xmm1,%xmm7
+
+	movdqu	%xmm6,64(%rdi)
+	movdqu	0(%rsi),%xmm6
+	movdqu	%xmm11,80(%rdi)
+	movdqu	16(%rsi),%xmm11
+	movdqu	%xmm2,96(%rdi)
+	movdqu	32(%rsi),%xmm2
+	movdqu	%xmm7,112(%rdi)
+	leaq	128(%rdi),%rdi
+	movdqu	48(%rsi),%xmm7
+	pxor	32(%rsp),%xmm6
+	pxor	%xmm10,%xmm11
+	pxor	%xmm14,%xmm2
+	pxor	%xmm8,%xmm7
+
+	movdqu	%xmm6,0(%rdi)
+	movdqu	64(%rsi),%xmm6
+	movdqu	%xmm11,16(%rdi)
+	movdqu	80(%rsi),%xmm11
+	movdqu	%xmm2,32(%rdi)
+	movdqu	96(%rsi),%xmm2
+	movdqu	%xmm7,48(%rdi)
+	movdqu	112(%rsi),%xmm7
+	leaq	128(%rsi),%rsi
+	pxor	48(%rsp),%xmm6
+	pxor	%xmm15,%xmm11
+	pxor	%xmm9,%xmm2
+	pxor	%xmm3,%xmm7
+	movdqu	%xmm6,64(%rdi)
+	movdqu	%xmm11,80(%rdi)
+	movdqu	%xmm2,96(%rdi)
+	movdqu	%xmm7,112(%rdi)
+	leaq	128(%rdi),%rdi
+
+	subq	$256,%rdx
+	jnz	.Loop_outer4x
+
+	jmp	.Ldone4x
+
+.Ltail4x:
+	cmpq	$192,%rdx
+	jae	.L192_or_more4x
+	cmpq	$128,%rdx
+	jae	.L128_or_more4x
+	cmpq	$64,%rdx
+	jae	.L64_or_more4x
+
+
+	xorq	%r10,%r10
+
+	movdqa	%xmm12,16(%rsp)
+	movdqa	%xmm4,32(%rsp)
+	movdqa	%xmm0,48(%rsp)
+	jmp	.Loop_tail4x
+
+.align	32
+.L64_or_more4x:
+	movdqu	0(%rsi),%xmm6
+	movdqu	16(%rsi),%xmm11
+	movdqu	32(%rsi),%xmm2
+	movdqu	48(%rsi),%xmm7
+	pxor	0(%rsp),%xmm6
+	pxor	%xmm12,%xmm11
+	pxor	%xmm4,%xmm2
+	pxor	%xmm0,%xmm7
+	movdqu	%xmm6,0(%rdi)
+	movdqu	%xmm11,16(%rdi)
+	movdqu	%xmm2,32(%rdi)
+	movdqu	%xmm7,48(%rdi)
+	je	.Ldone4x
+
+	movdqa	16(%rsp),%xmm6
+	leaq	64(%rsi),%rsi
+	xorq	%r10,%r10
+	movdqa	%xmm6,0(%rsp)
+	movdqa	%xmm13,16(%rsp)
+	leaq	64(%rdi),%rdi
+	movdqa	%xmm5,32(%rsp)
+	subq	$64,%rdx
+	movdqa	%xmm1,48(%rsp)
+	jmp	.Loop_tail4x
+
+.align	32
+.L128_or_more4x:
+	movdqu	0(%rsi),%xmm6
+	movdqu	16(%rsi),%xmm11
+	movdqu	32(%rsi),%xmm2
+	movdqu	48(%rsi),%xmm7
+	pxor	0(%rsp),%xmm6
+	pxor	%xmm12,%xmm11
+	pxor	%xmm4,%xmm2
+	pxor	%xmm0,%xmm7
+
+	movdqu	%xmm6,0(%rdi)
+	movdqu	64(%rsi),%xmm6
+	movdqu	%xmm11,16(%rdi)
+	movdqu	80(%rsi),%xmm11
+	movdqu	%xmm2,32(%rdi)
+	movdqu	96(%rsi),%xmm2
+	movdqu	%xmm7,48(%rdi)
+	movdqu	112(%rsi),%xmm7
+	pxor	16(%rsp),%xmm6
+	pxor	%xmm13,%xmm11
+	pxor	%xmm5,%xmm2
+	pxor	%xmm1,%xmm7
+	movdqu	%xmm6,64(%rdi)
+	movdqu	%xmm11,80(%rdi)
+	movdqu	%xmm2,96(%rdi)
+	movdqu	%xmm7,112(%rdi)
+	je	.Ldone4x
+
+	movdqa	32(%rsp),%xmm6
+	leaq	128(%rsi),%rsi
+	xorq	%r10,%r10
+	movdqa	%xmm6,0(%rsp)
+	movdqa	%xmm10,16(%rsp)
+	leaq	128(%rdi),%rdi
+	movdqa	%xmm14,32(%rsp)
+	subq	$128,%rdx
+	movdqa	%xmm8,48(%rsp)
+	jmp	.Loop_tail4x
+
+.align	32
+.L192_or_more4x:
+	movdqu	0(%rsi),%xmm6
+	movdqu	16(%rsi),%xmm11
+	movdqu	32(%rsi),%xmm2
+	movdqu	48(%rsi),%xmm7
+	pxor	0(%rsp),%xmm6
+	pxor	%xmm12,%xmm11
+	pxor	%xmm4,%xmm2
+	pxor	%xmm0,%xmm7
+
+	movdqu	%xmm6,0(%rdi)
+	movdqu	64(%rsi),%xmm6
+	movdqu	%xmm11,16(%rdi)
+	movdqu	80(%rsi),%xmm11
+	movdqu	%xmm2,32(%rdi)
+	movdqu	96(%rsi),%xmm2
+	movdqu	%xmm7,48(%rdi)
+	movdqu	112(%rsi),%xmm7
+	leaq	128(%rsi),%rsi
+	pxor	16(%rsp),%xmm6
+	pxor	%xmm13,%xmm11
+	pxor	%xmm5,%xmm2
+	pxor	%xmm1,%xmm7
+
+	movdqu	%xmm6,64(%rdi)
+	movdqu	0(%rsi),%xmm6
+	movdqu	%xmm11,80(%rdi)
+	movdqu	16(%rsi),%xmm11
+	movdqu	%xmm2,96(%rdi)
+	movdqu	32(%rsi),%xmm2
+	movdqu	%xmm7,112(%rdi)
+	leaq	128(%rdi),%rdi
+	movdqu	48(%rsi),%xmm7
+	pxor	32(%rsp),%xmm6
+	pxor	%xmm10,%xmm11
+	pxor	%xmm14,%xmm2
+	pxor	%xmm8,%xmm7
+	movdqu	%xmm6,0(%rdi)
+	movdqu	%xmm11,16(%rdi)
+	movdqu	%xmm2,32(%rdi)
+	movdqu	%xmm7,48(%rdi)
+	je	.Ldone4x
+
+	movdqa	48(%rsp),%xmm6
+	leaq	64(%rsi),%rsi
+	xorq	%r10,%r10
+	movdqa	%xmm6,0(%rsp)
+	movdqa	%xmm15,16(%rsp)
+	leaq	64(%rdi),%rdi
+	movdqa	%xmm9,32(%rsp)
+	subq	$192,%rdx
+	movdqa	%xmm3,48(%rsp)
+
+.Loop_tail4x:
+	movzbl	(%rsi,%r10,1),%eax
+	movzbl	(%rsp,%r10,1),%ecx
+	leaq	1(%r10),%r10
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r10,1)
+	decq	%rdx
+	jnz	.Loop_tail4x
+
+.Ldone4x:
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L4x_epilogue:
+	.byte	0xf3,0xc3
+.cfi_endproc	
+.size	ChaCha20_4x,.-ChaCha20_4x
+.type	ChaCha20_4xop,@function
+.align	32
+ChaCha20_4xop:
+.cfi_startproc	
+.LChaCha20_4xop:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	subq	$0x140+8,%rsp
+	vzeroupper
+
+	vmovdqa	.Lsigma(%rip),%xmm11
+	vmovdqu	(%rcx),%xmm3
+	vmovdqu	16(%rcx),%xmm15
+	vmovdqu	(%r8),%xmm7
+	leaq	256(%rsp),%rcx
+
+	vpshufd	$0x00,%xmm11,%xmm8
+	vpshufd	$0x55,%xmm11,%xmm9
+	vmovdqa	%xmm8,64(%rsp)
+	vpshufd	$0xaa,%xmm11,%xmm10
+	vmovdqa	%xmm9,80(%rsp)
+	vpshufd	$0xff,%xmm11,%xmm11
+	vmovdqa	%xmm10,96(%rsp)
+	vmovdqa	%xmm11,112(%rsp)
+
+	vpshufd	$0x00,%xmm3,%xmm0
+	vpshufd	$0x55,%xmm3,%xmm1
+	vmovdqa	%xmm0,128-256(%rcx)
+	vpshufd	$0xaa,%xmm3,%xmm2
+	vmovdqa	%xmm1,144-256(%rcx)
+	vpshufd	$0xff,%xmm3,%xmm3
+	vmovdqa	%xmm2,160-256(%rcx)
+	vmovdqa	%xmm3,176-256(%rcx)
+
+	vpshufd	$0x00,%xmm15,%xmm12
+	vpshufd	$0x55,%xmm15,%xmm13
+	vmovdqa	%xmm12,192-256(%rcx)
+	vpshufd	$0xaa,%xmm15,%xmm14
+	vmovdqa	%xmm13,208-256(%rcx)
+	vpshufd	$0xff,%xmm15,%xmm15
+	vmovdqa	%xmm14,224-256(%rcx)
+	vmovdqa	%xmm15,240-256(%rcx)
+
+	vpshufd	$0x00,%xmm7,%xmm4
+	vpshufd	$0x55,%xmm7,%xmm5
+	vpaddd	.Linc(%rip),%xmm4,%xmm4
+	vpshufd	$0xaa,%xmm7,%xmm6
+	vmovdqa	%xmm5,272-256(%rcx)
+	vpshufd	$0xff,%xmm7,%xmm7
+	vmovdqa	%xmm6,288-256(%rcx)
+	vmovdqa	%xmm7,304-256(%rcx)
+
+	jmp	.Loop_enter4xop
+
+.align	32
+.Loop_outer4xop:
+	vmovdqa	64(%rsp),%xmm8
+	vmovdqa	80(%rsp),%xmm9
+	vmovdqa	96(%rsp),%xmm10
+	vmovdqa	112(%rsp),%xmm11
+	vmovdqa	128-256(%rcx),%xmm0
+	vmovdqa	144-256(%rcx),%xmm1
+	vmovdqa	160-256(%rcx),%xmm2
+	vmovdqa	176-256(%rcx),%xmm3
+	vmovdqa	192-256(%rcx),%xmm12
+	vmovdqa	208-256(%rcx),%xmm13
+	vmovdqa	224-256(%rcx),%xmm14
+	vmovdqa	240-256(%rcx),%xmm15
+	vmovdqa	256-256(%rcx),%xmm4
+	vmovdqa	272-256(%rcx),%xmm5
+	vmovdqa	288-256(%rcx),%xmm6
+	vmovdqa	304-256(%rcx),%xmm7
+	vpaddd	.Lfour(%rip),%xmm4,%xmm4
+
+.Loop_enter4xop:
+	movl	$10,%eax
+	vmovdqa	%xmm4,256-256(%rcx)
+	jmp	.Loop4xop
+
+.align	32
+.Loop4xop:
+	vpaddd	%xmm0,%xmm8,%xmm8
+	vpaddd	%xmm1,%xmm9,%xmm9
+	vpaddd	%xmm2,%xmm10,%xmm10
+	vpaddd	%xmm3,%xmm11,%xmm11
+	vpxor	%xmm4,%xmm8,%xmm4
+	vpxor	%xmm5,%xmm9,%xmm5
+	vpxor	%xmm6,%xmm10,%xmm6
+	vpxor	%xmm7,%xmm11,%xmm7
+.byte	143,232,120,194,228,16
+.byte	143,232,120,194,237,16
+.byte	143,232,120,194,246,16
+.byte	143,232,120,194,255,16
+	vpaddd	%xmm4,%xmm12,%xmm12
+	vpaddd	%xmm5,%xmm13,%xmm13
+	vpaddd	%xmm6,%xmm14,%xmm14
+	vpaddd	%xmm7,%xmm15,%xmm15
+	vpxor	%xmm0,%xmm12,%xmm0
+	vpxor	%xmm1,%xmm13,%xmm1
+	vpxor	%xmm14,%xmm2,%xmm2
+	vpxor	%xmm15,%xmm3,%xmm3
+.byte	143,232,120,194,192,12
+.byte	143,232,120,194,201,12
+.byte	143,232,120,194,210,12
+.byte	143,232,120,194,219,12
+	vpaddd	%xmm8,%xmm0,%xmm8
+	vpaddd	%xmm9,%xmm1,%xmm9
+	vpaddd	%xmm2,%xmm10,%xmm10
+	vpaddd	%xmm3,%xmm11,%xmm11
+	vpxor	%xmm4,%xmm8,%xmm4
+	vpxor	%xmm5,%xmm9,%xmm5
+	vpxor	%xmm6,%xmm10,%xmm6
+	vpxor	%xmm7,%xmm11,%xmm7
+.byte	143,232,120,194,228,8
+.byte	143,232,120,194,237,8
+.byte	143,232,120,194,246,8
+.byte	143,232,120,194,255,8
+	vpaddd	%xmm4,%xmm12,%xmm12
+	vpaddd	%xmm5,%xmm13,%xmm13
+	vpaddd	%xmm6,%xmm14,%xmm14
+	vpaddd	%xmm7,%xmm15,%xmm15
+	vpxor	%xmm0,%xmm12,%xmm0
+	vpxor	%xmm1,%xmm13,%xmm1
+	vpxor	%xmm14,%xmm2,%xmm2
+	vpxor	%xmm15,%xmm3,%xmm3
+.byte	143,232,120,194,192,7
+.byte	143,232,120,194,201,7
+.byte	143,232,120,194,210,7
+.byte	143,232,120,194,219,7
+	vpaddd	%xmm1,%xmm8,%xmm8
+	vpaddd	%xmm2,%xmm9,%xmm9
+	vpaddd	%xmm3,%xmm10,%xmm10
+	vpaddd	%xmm0,%xmm11,%xmm11
+	vpxor	%xmm7,%xmm8,%xmm7
+	vpxor	%xmm4,%xmm9,%xmm4
+	vpxor	%xmm5,%xmm10,%xmm5
+	vpxor	%xmm6,%xmm11,%xmm6
+.byte	143,232,120,194,255,16
+.byte	143,232,120,194,228,16
+.byte	143,232,120,194,237,16
+.byte	143,232,120,194,246,16
+	vpaddd	%xmm7,%xmm14,%xmm14
+	vpaddd	%xmm4,%xmm15,%xmm15
+	vpaddd	%xmm5,%xmm12,%xmm12
+	vpaddd	%xmm6,%xmm13,%xmm13
+	vpxor	%xmm1,%xmm14,%xmm1
+	vpxor	%xmm2,%xmm15,%xmm2
+	vpxor	%xmm12,%xmm3,%xmm3
+	vpxor	%xmm13,%xmm0,%xmm0
+.byte	143,232,120,194,201,12
+.byte	143,232,120,194,210,12
+.byte	143,232,120,194,219,12
+.byte	143,232,120,194,192,12
+	vpaddd	%xmm8,%xmm1,%xmm8
+	vpaddd	%xmm9,%xmm2,%xmm9
+	vpaddd	%xmm3,%xmm10,%xmm10
+	vpaddd	%xmm0,%xmm11,%xmm11
+	vpxor	%xmm7,%xmm8,%xmm7
+	vpxor	%xmm4,%xmm9,%xmm4
+	vpxor	%xmm5,%xmm10,%xmm5
+	vpxor	%xmm6,%xmm11,%xmm6
+.byte	143,232,120,194,255,8
+.byte	143,232,120,194,228,8
+.byte	143,232,120,194,237,8
+.byte	143,232,120,194,246,8
+	vpaddd	%xmm7,%xmm14,%xmm14
+	vpaddd	%xmm4,%xmm15,%xmm15
+	vpaddd	%xmm5,%xmm12,%xmm12
+	vpaddd	%xmm6,%xmm13,%xmm13
+	vpxor	%xmm1,%xmm14,%xmm1
+	vpxor	%xmm2,%xmm15,%xmm2
+	vpxor	%xmm12,%xmm3,%xmm3
+	vpxor	%xmm13,%xmm0,%xmm0
+.byte	143,232,120,194,201,7
+.byte	143,232,120,194,210,7
+.byte	143,232,120,194,219,7
+.byte	143,232,120,194,192,7
+	decl	%eax
+	jnz	.Loop4xop
+
+	vpaddd	64(%rsp),%xmm8,%xmm8
+	vpaddd	80(%rsp),%xmm9,%xmm9
+	vpaddd	96(%rsp),%xmm10,%xmm10
+	vpaddd	112(%rsp),%xmm11,%xmm11
+
+	vmovdqa	%xmm14,32(%rsp)
+	vmovdqa	%xmm15,48(%rsp)
+
+	vpunpckldq	%xmm9,%xmm8,%xmm14
+	vpunpckldq	%xmm11,%xmm10,%xmm15
+	vpunpckhdq	%xmm9,%xmm8,%xmm8
+	vpunpckhdq	%xmm11,%xmm10,%xmm10
+	vpunpcklqdq	%xmm15,%xmm14,%xmm9
+	vpunpckhqdq	%xmm15,%xmm14,%xmm14
+	vpunpcklqdq	%xmm10,%xmm8,%xmm11
+	vpunpckhqdq	%xmm10,%xmm8,%xmm8
+	vpaddd	128-256(%rcx),%xmm0,%xmm0
+	vpaddd	144-256(%rcx),%xmm1,%xmm1
+	vpaddd	160-256(%rcx),%xmm2,%xmm2
+	vpaddd	176-256(%rcx),%xmm3,%xmm3
+
+	vmovdqa	%xmm9,0(%rsp)
+	vmovdqa	%xmm14,16(%rsp)
+	vmovdqa	32(%rsp),%xmm9
+	vmovdqa	48(%rsp),%xmm14
+
+	vpunpckldq	%xmm1,%xmm0,%xmm10
+	vpunpckldq	%xmm3,%xmm2,%xmm15
+	vpunpckhdq	%xmm1,%xmm0,%xmm0
+	vpunpckhdq	%xmm3,%xmm2,%xmm2
+	vpunpcklqdq	%xmm15,%xmm10,%xmm1
+	vpunpckhqdq	%xmm15,%xmm10,%xmm10
+	vpunpcklqdq	%xmm2,%xmm0,%xmm3
+	vpunpckhqdq	%xmm2,%xmm0,%xmm0
+	vpaddd	192-256(%rcx),%xmm12,%xmm12
+	vpaddd	208-256(%rcx),%xmm13,%xmm13
+	vpaddd	224-256(%rcx),%xmm9,%xmm9
+	vpaddd	240-256(%rcx),%xmm14,%xmm14
+
+	vpunpckldq	%xmm13,%xmm12,%xmm2
+	vpunpckldq	%xmm14,%xmm9,%xmm15
+	vpunpckhdq	%xmm13,%xmm12,%xmm12
+	vpunpckhdq	%xmm14,%xmm9,%xmm9
+	vpunpcklqdq	%xmm15,%xmm2,%xmm13
+	vpunpckhqdq	%xmm15,%xmm2,%xmm2
+	vpunpcklqdq	%xmm9,%xmm12,%xmm14
+	vpunpckhqdq	%xmm9,%xmm12,%xmm12
+	vpaddd	256-256(%rcx),%xmm4,%xmm4
+	vpaddd	272-256(%rcx),%xmm5,%xmm5
+	vpaddd	288-256(%rcx),%xmm6,%xmm6
+	vpaddd	304-256(%rcx),%xmm7,%xmm7
+
+	vpunpckldq	%xmm5,%xmm4,%xmm9
+	vpunpckldq	%xmm7,%xmm6,%xmm15
+	vpunpckhdq	%xmm5,%xmm4,%xmm4
+	vpunpckhdq	%xmm7,%xmm6,%xmm6
+	vpunpcklqdq	%xmm15,%xmm9,%xmm5
+	vpunpckhqdq	%xmm15,%xmm9,%xmm9
+	vpunpcklqdq	%xmm6,%xmm4,%xmm7
+	vpunpckhqdq	%xmm6,%xmm4,%xmm4
+	vmovdqa	0(%rsp),%xmm6
+	vmovdqa	16(%rsp),%xmm15
+
+	cmpq	$256,%rdx
+	jb	.Ltail4xop
+
+	vpxor	0(%rsi),%xmm6,%xmm6
+	vpxor	16(%rsi),%xmm1,%xmm1
+	vpxor	32(%rsi),%xmm13,%xmm13
+	vpxor	48(%rsi),%xmm5,%xmm5
+	vpxor	64(%rsi),%xmm15,%xmm15
+	vpxor	80(%rsi),%xmm10,%xmm10
+	vpxor	96(%rsi),%xmm2,%xmm2
+	vpxor	112(%rsi),%xmm9,%xmm9
+	leaq	128(%rsi),%rsi
+	vpxor	0(%rsi),%xmm11,%xmm11
+	vpxor	16(%rsi),%xmm3,%xmm3
+	vpxor	32(%rsi),%xmm14,%xmm14
+	vpxor	48(%rsi),%xmm7,%xmm7
+	vpxor	64(%rsi),%xmm8,%xmm8
+	vpxor	80(%rsi),%xmm0,%xmm0
+	vpxor	96(%rsi),%xmm12,%xmm12
+	vpxor	112(%rsi),%xmm4,%xmm4
+	leaq	128(%rsi),%rsi
+
+	vmovdqu	%xmm6,0(%rdi)
+	vmovdqu	%xmm1,16(%rdi)
+	vmovdqu	%xmm13,32(%rdi)
+	vmovdqu	%xmm5,48(%rdi)
+	vmovdqu	%xmm15,64(%rdi)
+	vmovdqu	%xmm10,80(%rdi)
+	vmovdqu	%xmm2,96(%rdi)
+	vmovdqu	%xmm9,112(%rdi)
+	leaq	128(%rdi),%rdi
+	vmovdqu	%xmm11,0(%rdi)
+	vmovdqu	%xmm3,16(%rdi)
+	vmovdqu	%xmm14,32(%rdi)
+	vmovdqu	%xmm7,48(%rdi)
+	vmovdqu	%xmm8,64(%rdi)
+	vmovdqu	%xmm0,80(%rdi)
+	vmovdqu	%xmm12,96(%rdi)
+	vmovdqu	%xmm4,112(%rdi)
+	leaq	128(%rdi),%rdi
+
+	subq	$256,%rdx
+	jnz	.Loop_outer4xop
+
+	jmp	.Ldone4xop
+
+.align	32
+.Ltail4xop:
+	cmpq	$192,%rdx
+	jae	.L192_or_more4xop
+	cmpq	$128,%rdx
+	jae	.L128_or_more4xop
+	cmpq	$64,%rdx
+	jae	.L64_or_more4xop
+
+	xorq	%r10,%r10
+	vmovdqa	%xmm6,0(%rsp)
+	vmovdqa	%xmm1,16(%rsp)
+	vmovdqa	%xmm13,32(%rsp)
+	vmovdqa	%xmm5,48(%rsp)
+	jmp	.Loop_tail4xop
+
+.align	32
+.L64_or_more4xop:
+	vpxor	0(%rsi),%xmm6,%xmm6
+	vpxor	16(%rsi),%xmm1,%xmm1
+	vpxor	32(%rsi),%xmm13,%xmm13
+	vpxor	48(%rsi),%xmm5,%xmm5
+	vmovdqu	%xmm6,0(%rdi)
+	vmovdqu	%xmm1,16(%rdi)
+	vmovdqu	%xmm13,32(%rdi)
+	vmovdqu	%xmm5,48(%rdi)
+	je	.Ldone4xop
+
+	leaq	64(%rsi),%rsi
+	vmovdqa	%xmm15,0(%rsp)
+	xorq	%r10,%r10
+	vmovdqa	%xmm10,16(%rsp)
+	leaq	64(%rdi),%rdi
+	vmovdqa	%xmm2,32(%rsp)
+	subq	$64,%rdx
+	vmovdqa	%xmm9,48(%rsp)
+	jmp	.Loop_tail4xop
+
+.align	32
+.L128_or_more4xop:
+	vpxor	0(%rsi),%xmm6,%xmm6
+	vpxor	16(%rsi),%xmm1,%xmm1
+	vpxor	32(%rsi),%xmm13,%xmm13
+	vpxor	48(%rsi),%xmm5,%xmm5
+	vpxor	64(%rsi),%xmm15,%xmm15
+	vpxor	80(%rsi),%xmm10,%xmm10
+	vpxor	96(%rsi),%xmm2,%xmm2
+	vpxor	112(%rsi),%xmm9,%xmm9
+
+	vmovdqu	%xmm6,0(%rdi)
+	vmovdqu	%xmm1,16(%rdi)
+	vmovdqu	%xmm13,32(%rdi)
+	vmovdqu	%xmm5,48(%rdi)
+	vmovdqu	%xmm15,64(%rdi)
+	vmovdqu	%xmm10,80(%rdi)
+	vmovdqu	%xmm2,96(%rdi)
+	vmovdqu	%xmm9,112(%rdi)
+	je	.Ldone4xop
+
+	leaq	128(%rsi),%rsi
+	vmovdqa	%xmm11,0(%rsp)
+	xorq	%r10,%r10
+	vmovdqa	%xmm3,16(%rsp)
+	leaq	128(%rdi),%rdi
+	vmovdqa	%xmm14,32(%rsp)
+	subq	$128,%rdx
+	vmovdqa	%xmm7,48(%rsp)
+	jmp	.Loop_tail4xop
+
+.align	32
+.L192_or_more4xop:
+	vpxor	0(%rsi),%xmm6,%xmm6
+	vpxor	16(%rsi),%xmm1,%xmm1
+	vpxor	32(%rsi),%xmm13,%xmm13
+	vpxor	48(%rsi),%xmm5,%xmm5
+	vpxor	64(%rsi),%xmm15,%xmm15
+	vpxor	80(%rsi),%xmm10,%xmm10
+	vpxor	96(%rsi),%xmm2,%xmm2
+	vpxor	112(%rsi),%xmm9,%xmm9
+	leaq	128(%rsi),%rsi
+	vpxor	0(%rsi),%xmm11,%xmm11
+	vpxor	16(%rsi),%xmm3,%xmm3
+	vpxor	32(%rsi),%xmm14,%xmm14
+	vpxor	48(%rsi),%xmm7,%xmm7
+
+	vmovdqu	%xmm6,0(%rdi)
+	vmovdqu	%xmm1,16(%rdi)
+	vmovdqu	%xmm13,32(%rdi)
+	vmovdqu	%xmm5,48(%rdi)
+	vmovdqu	%xmm15,64(%rdi)
+	vmovdqu	%xmm10,80(%rdi)
+	vmovdqu	%xmm2,96(%rdi)
+	vmovdqu	%xmm9,112(%rdi)
+	leaq	128(%rdi),%rdi
+	vmovdqu	%xmm11,0(%rdi)
+	vmovdqu	%xmm3,16(%rdi)
+	vmovdqu	%xmm14,32(%rdi)
+	vmovdqu	%xmm7,48(%rdi)
+	je	.Ldone4xop
+
+	leaq	64(%rsi),%rsi
+	vmovdqa	%xmm8,0(%rsp)
+	xorq	%r10,%r10
+	vmovdqa	%xmm0,16(%rsp)
+	leaq	64(%rdi),%rdi
+	vmovdqa	%xmm12,32(%rsp)
+	subq	$192,%rdx
+	vmovdqa	%xmm4,48(%rsp)
+
+.Loop_tail4xop:
+	movzbl	(%rsi,%r10,1),%eax
+	movzbl	(%rsp,%r10,1),%ecx
+	leaq	1(%r10),%r10
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r10,1)
+	decq	%rdx
+	jnz	.Loop_tail4xop
+
+.Ldone4xop:
+	vzeroupper
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L4xop_epilogue:
+	.byte	0xf3,0xc3
+.cfi_endproc	
+.size	ChaCha20_4xop,.-ChaCha20_4xop
+.type	ChaCha20_8x,@function
+.align	32
+ChaCha20_8x:
+.cfi_startproc	
+.LChaCha20_8x:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	subq	$0x280+8,%rsp
+	andq	$-32,%rsp
+	vzeroupper
+
+
+
+
+
+
+
+
+
+
+	vbroadcasti128	.Lsigma(%rip),%ymm11
+	vbroadcasti128	(%rcx),%ymm3
+	vbroadcasti128	16(%rcx),%ymm15
+	vbroadcasti128	(%r8),%ymm7
+	leaq	256(%rsp),%rcx
+	leaq	512(%rsp),%rax
+	leaq	.Lrot16(%rip),%r10
+	leaq	.Lrot24(%rip),%r11
+
+	vpshufd	$0x00,%ymm11,%ymm8
+	vpshufd	$0x55,%ymm11,%ymm9
+	vmovdqa	%ymm8,128-256(%rcx)
+	vpshufd	$0xaa,%ymm11,%ymm10
+	vmovdqa	%ymm9,160-256(%rcx)
+	vpshufd	$0xff,%ymm11,%ymm11
+	vmovdqa	%ymm10,192-256(%rcx)
+	vmovdqa	%ymm11,224-256(%rcx)
+
+	vpshufd	$0x00,%ymm3,%ymm0
+	vpshufd	$0x55,%ymm3,%ymm1
+	vmovdqa	%ymm0,256-256(%rcx)
+	vpshufd	$0xaa,%ymm3,%ymm2
+	vmovdqa	%ymm1,288-256(%rcx)
+	vpshufd	$0xff,%ymm3,%ymm3
+	vmovdqa	%ymm2,320-256(%rcx)
+	vmovdqa	%ymm3,352-256(%rcx)
+
+	vpshufd	$0x00,%ymm15,%ymm12
+	vpshufd	$0x55,%ymm15,%ymm13
+	vmovdqa	%ymm12,384-512(%rax)
+	vpshufd	$0xaa,%ymm15,%ymm14
+	vmovdqa	%ymm13,416-512(%rax)
+	vpshufd	$0xff,%ymm15,%ymm15
+	vmovdqa	%ymm14,448-512(%rax)
+	vmovdqa	%ymm15,480-512(%rax)
+
+	vpshufd	$0x00,%ymm7,%ymm4
+	vpshufd	$0x55,%ymm7,%ymm5
+	vpaddd	.Lincy(%rip),%ymm4,%ymm4
+	vpshufd	$0xaa,%ymm7,%ymm6
+	vmovdqa	%ymm5,544-512(%rax)
+	vpshufd	$0xff,%ymm7,%ymm7
+	vmovdqa	%ymm6,576-512(%rax)
+	vmovdqa	%ymm7,608-512(%rax)
+
+	jmp	.Loop_enter8x
+
+.align	32
+.Loop_outer8x:
+	vmovdqa	128-256(%rcx),%ymm8
+	vmovdqa	160-256(%rcx),%ymm9
+	vmovdqa	192-256(%rcx),%ymm10
+	vmovdqa	224-256(%rcx),%ymm11
+	vmovdqa	256-256(%rcx),%ymm0
+	vmovdqa	288-256(%rcx),%ymm1
+	vmovdqa	320-256(%rcx),%ymm2
+	vmovdqa	352-256(%rcx),%ymm3
+	vmovdqa	384-512(%rax),%ymm12
+	vmovdqa	416-512(%rax),%ymm13
+	vmovdqa	448-512(%rax),%ymm14
+	vmovdqa	480-512(%rax),%ymm15
+	vmovdqa	512-512(%rax),%ymm4
+	vmovdqa	544-512(%rax),%ymm5
+	vmovdqa	576-512(%rax),%ymm6
+	vmovdqa	608-512(%rax),%ymm7
+	vpaddd	.Leight(%rip),%ymm4,%ymm4
+
+.Loop_enter8x:
+	vmovdqa	%ymm14,64(%rsp)
+	vmovdqa	%ymm15,96(%rsp)
+	vbroadcasti128	(%r10),%ymm15
+	vmovdqa	%ymm4,512-512(%rax)
+	movl	$10,%eax
+	jmp	.Loop8x
+
+.align	32
+.Loop8x:
+	vpaddd	%ymm0,%ymm8,%ymm8
+	vpxor	%ymm4,%ymm8,%ymm4
+	vpshufb	%ymm15,%ymm4,%ymm4
+	vpaddd	%ymm1,%ymm9,%ymm9
+	vpxor	%ymm5,%ymm9,%ymm5
+	vpshufb	%ymm15,%ymm5,%ymm5
+	vpaddd	%ymm4,%ymm12,%ymm12
+	vpxor	%ymm0,%ymm12,%ymm0
+	vpslld	$12,%ymm0,%ymm14
+	vpsrld	$20,%ymm0,%ymm0
+	vpor	%ymm0,%ymm14,%ymm0
+	vbroadcasti128	(%r11),%ymm14
+	vpaddd	%ymm5,%ymm13,%ymm13
+	vpxor	%ymm1,%ymm13,%ymm1
+	vpslld	$12,%ymm1,%ymm15
+	vpsrld	$20,%ymm1,%ymm1
+	vpor	%ymm1,%ymm15,%ymm1
+	vpaddd	%ymm0,%ymm8,%ymm8
+	vpxor	%ymm4,%ymm8,%ymm4
+	vpshufb	%ymm14,%ymm4,%ymm4
+	vpaddd	%ymm1,%ymm9,%ymm9
+	vpxor	%ymm5,%ymm9,%ymm5
+	vpshufb	%ymm14,%ymm5,%ymm5
+	vpaddd	%ymm4,%ymm12,%ymm12
+	vpxor	%ymm0,%ymm12,%ymm0
+	vpslld	$7,%ymm0,%ymm15
+	vpsrld	$25,%ymm0,%ymm0
+	vpor	%ymm0,%ymm15,%ymm0
+	vbroadcasti128	(%r10),%ymm15
+	vpaddd	%ymm5,%ymm13,%ymm13
+	vpxor	%ymm1,%ymm13,%ymm1
+	vpslld	$7,%ymm1,%ymm14
+	vpsrld	$25,%ymm1,%ymm1
+	vpor	%ymm1,%ymm14,%ymm1
+	vmovdqa	%ymm12,0(%rsp)
+	vmovdqa	%ymm13,32(%rsp)
+	vmovdqa	64(%rsp),%ymm12
+	vmovdqa	96(%rsp),%ymm13
+	vpaddd	%ymm2,%ymm10,%ymm10
+	vpxor	%ymm6,%ymm10,%ymm6
+	vpshufb	%ymm15,%ymm6,%ymm6
+	vpaddd	%ymm3,%ymm11,%ymm11
+	vpxor	%ymm7,%ymm11,%ymm7
+	vpshufb	%ymm15,%ymm7,%ymm7
+	vpaddd	%ymm6,%ymm12,%ymm12
+	vpxor	%ymm2,%ymm12,%ymm2
+	vpslld	$12,%ymm2,%ymm14
+	vpsrld	$20,%ymm2,%ymm2
+	vpor	%ymm2,%ymm14,%ymm2
+	vbroadcasti128	(%r11),%ymm14
+	vpaddd	%ymm7,%ymm13,%ymm13
+	vpxor	%ymm3,%ymm13,%ymm3
+	vpslld	$12,%ymm3,%ymm15
+	vpsrld	$20,%ymm3,%ymm3
+	vpor	%ymm3,%ymm15,%ymm3
+	vpaddd	%ymm2,%ymm10,%ymm10
+	vpxor	%ymm6,%ymm10,%ymm6
+	vpshufb	%ymm14,%ymm6,%ymm6
+	vpaddd	%ymm3,%ymm11,%ymm11
+	vpxor	%ymm7,%ymm11,%ymm7
+	vpshufb	%ymm14,%ymm7,%ymm7
+	vpaddd	%ymm6,%ymm12,%ymm12
+	vpxor	%ymm2,%ymm12,%ymm2
+	vpslld	$7,%ymm2,%ymm15
+	vpsrld	$25,%ymm2,%ymm2
+	vpor	%ymm2,%ymm15,%ymm2
+	vbroadcasti128	(%r10),%ymm15
+	vpaddd	%ymm7,%ymm13,%ymm13
+	vpxor	%ymm3,%ymm13,%ymm3
+	vpslld	$7,%ymm3,%ymm14
+	vpsrld	$25,%ymm3,%ymm3
+	vpor	%ymm3,%ymm14,%ymm3
+	vpaddd	%ymm1,%ymm8,%ymm8
+	vpxor	%ymm7,%ymm8,%ymm7
+	vpshufb	%ymm15,%ymm7,%ymm7
+	vpaddd	%ymm2,%ymm9,%ymm9
+	vpxor	%ymm4,%ymm9,%ymm4
+	vpshufb	%ymm15,%ymm4,%ymm4
+	vpaddd	%ymm7,%ymm12,%ymm12
+	vpxor	%ymm1,%ymm12,%ymm1
+	vpslld	$12,%ymm1,%ymm14
+	vpsrld	$20,%ymm1,%ymm1
+	vpor	%ymm1,%ymm14,%ymm1
+	vbroadcasti128	(%r11),%ymm14
+	vpaddd	%ymm4,%ymm13,%ymm13
+	vpxor	%ymm2,%ymm13,%ymm2
+	vpslld	$12,%ymm2,%ymm15
+	vpsrld	$20,%ymm2,%ymm2
+	vpor	%ymm2,%ymm15,%ymm2
+	vpaddd	%ymm1,%ymm8,%ymm8
+	vpxor	%ymm7,%ymm8,%ymm7
+	vpshufb	%ymm14,%ymm7,%ymm7
+	vpaddd	%ymm2,%ymm9,%ymm9
+	vpxor	%ymm4,%ymm9,%ymm4
+	vpshufb	%ymm14,%ymm4,%ymm4
+	vpaddd	%ymm7,%ymm12,%ymm12
+	vpxor	%ymm1,%ymm12,%ymm1
+	vpslld	$7,%ymm1,%ymm15
+	vpsrld	$25,%ymm1,%ymm1
+	vpor	%ymm1,%ymm15,%ymm1
+	vbroadcasti128	(%r10),%ymm15
+	vpaddd	%ymm4,%ymm13,%ymm13
+	vpxor	%ymm2,%ymm13,%ymm2
+	vpslld	$7,%ymm2,%ymm14
+	vpsrld	$25,%ymm2,%ymm2
+	vpor	%ymm2,%ymm14,%ymm2
+	vmovdqa	%ymm12,64(%rsp)
+	vmovdqa	%ymm13,96(%rsp)
+	vmovdqa	0(%rsp),%ymm12
+	vmovdqa	32(%rsp),%ymm13
+	vpaddd	%ymm3,%ymm10,%ymm10
+	vpxor	%ymm5,%ymm10,%ymm5
+	vpshufb	%ymm15,%ymm5,%ymm5
+	vpaddd	%ymm0,%ymm11,%ymm11
+	vpxor	%ymm6,%ymm11,%ymm6
+	vpshufb	%ymm15,%ymm6,%ymm6
+	vpaddd	%ymm5,%ymm12,%ymm12
+	vpxor	%ymm3,%ymm12,%ymm3
+	vpslld	$12,%ymm3,%ymm14
+	vpsrld	$20,%ymm3,%ymm3
+	vpor	%ymm3,%ymm14,%ymm3
+	vbroadcasti128	(%r11),%ymm14
+	vpaddd	%ymm6,%ymm13,%ymm13
+	vpxor	%ymm0,%ymm13,%ymm0
+	vpslld	$12,%ymm0,%ymm15
+	vpsrld	$20,%ymm0,%ymm0
+	vpor	%ymm0,%ymm15,%ymm0
+	vpaddd	%ymm3,%ymm10,%ymm10
+	vpxor	%ymm5,%ymm10,%ymm5
+	vpshufb	%ymm14,%ymm5,%ymm5
+	vpaddd	%ymm0,%ymm11,%ymm11
+	vpxor	%ymm6,%ymm11,%ymm6
+	vpshufb	%ymm14,%ymm6,%ymm6
+	vpaddd	%ymm5,%ymm12,%ymm12
+	vpxor	%ymm3,%ymm12,%ymm3
+	vpslld	$7,%ymm3,%ymm15
+	vpsrld	$25,%ymm3,%ymm3
+	vpor	%ymm3,%ymm15,%ymm3
+	vbroadcasti128	(%r10),%ymm15
+	vpaddd	%ymm6,%ymm13,%ymm13
+	vpxor	%ymm0,%ymm13,%ymm0
+	vpslld	$7,%ymm0,%ymm14
+	vpsrld	$25,%ymm0,%ymm0
+	vpor	%ymm0,%ymm14,%ymm0
+	decl	%eax
+	jnz	.Loop8x
+
+	leaq	512(%rsp),%rax
+	vpaddd	128-256(%rcx),%ymm8,%ymm8
+	vpaddd	160-256(%rcx),%ymm9,%ymm9
+	vpaddd	192-256(%rcx),%ymm10,%ymm10
+	vpaddd	224-256(%rcx),%ymm11,%ymm11
+
+	vpunpckldq	%ymm9,%ymm8,%ymm14
+	vpunpckldq	%ymm11,%ymm10,%ymm15
+	vpunpckhdq	%ymm9,%ymm8,%ymm8
+	vpunpckhdq	%ymm11,%ymm10,%ymm10
+	vpunpcklqdq	%ymm15,%ymm14,%ymm9
+	vpunpckhqdq	%ymm15,%ymm14,%ymm14
+	vpunpcklqdq	%ymm10,%ymm8,%ymm11
+	vpunpckhqdq	%ymm10,%ymm8,%ymm8
+	vpaddd	256-256(%rcx),%ymm0,%ymm0
+	vpaddd	288-256(%rcx),%ymm1,%ymm1
+	vpaddd	320-256(%rcx),%ymm2,%ymm2
+	vpaddd	352-256(%rcx),%ymm3,%ymm3
+
+	vpunpckldq	%ymm1,%ymm0,%ymm10
+	vpunpckldq	%ymm3,%ymm2,%ymm15
+	vpunpckhdq	%ymm1,%ymm0,%ymm0
+	vpunpckhdq	%ymm3,%ymm2,%ymm2
+	vpunpcklqdq	%ymm15,%ymm10,%ymm1
+	vpunpckhqdq	%ymm15,%ymm10,%ymm10
+	vpunpcklqdq	%ymm2,%ymm0,%ymm3
+	vpunpckhqdq	%ymm2,%ymm0,%ymm0
+	vperm2i128	$0x20,%ymm1,%ymm9,%ymm15
+	vperm2i128	$0x31,%ymm1,%ymm9,%ymm1
+	vperm2i128	$0x20,%ymm10,%ymm14,%ymm9
+	vperm2i128	$0x31,%ymm10,%ymm14,%ymm10
+	vperm2i128	$0x20,%ymm3,%ymm11,%ymm14
+	vperm2i128	$0x31,%ymm3,%ymm11,%ymm3
+	vperm2i128	$0x20,%ymm0,%ymm8,%ymm11
+	vperm2i128	$0x31,%ymm0,%ymm8,%ymm0
+	vmovdqa	%ymm15,0(%rsp)
+	vmovdqa	%ymm9,32(%rsp)
+	vmovdqa	64(%rsp),%ymm15
+	vmovdqa	96(%rsp),%ymm9
+
+	vpaddd	384-512(%rax),%ymm12,%ymm12
+	vpaddd	416-512(%rax),%ymm13,%ymm13
+	vpaddd	448-512(%rax),%ymm15,%ymm15
+	vpaddd	480-512(%rax),%ymm9,%ymm9
+
+	vpunpckldq	%ymm13,%ymm12,%ymm2
+	vpunpckldq	%ymm9,%ymm15,%ymm8
+	vpunpckhdq	%ymm13,%ymm12,%ymm12
+	vpunpckhdq	%ymm9,%ymm15,%ymm15
+	vpunpcklqdq	%ymm8,%ymm2,%ymm13
+	vpunpckhqdq	%ymm8,%ymm2,%ymm2
+	vpunpcklqdq	%ymm15,%ymm12,%ymm9
+	vpunpckhqdq	%ymm15,%ymm12,%ymm12
+	vpaddd	512-512(%rax),%ymm4,%ymm4
+	vpaddd	544-512(%rax),%ymm5,%ymm5
+	vpaddd	576-512(%rax),%ymm6,%ymm6
+	vpaddd	608-512(%rax),%ymm7,%ymm7
+
+	vpunpckldq	%ymm5,%ymm4,%ymm15
+	vpunpckldq	%ymm7,%ymm6,%ymm8
+	vpunpckhdq	%ymm5,%ymm4,%ymm4
+	vpunpckhdq	%ymm7,%ymm6,%ymm6
+	vpunpcklqdq	%ymm8,%ymm15,%ymm5
+	vpunpckhqdq	%ymm8,%ymm15,%ymm15
+	vpunpcklqdq	%ymm6,%ymm4,%ymm7
+	vpunpckhqdq	%ymm6,%ymm4,%ymm4
+	vperm2i128	$0x20,%ymm5,%ymm13,%ymm8
+	vperm2i128	$0x31,%ymm5,%ymm13,%ymm5
+	vperm2i128	$0x20,%ymm15,%ymm2,%ymm13
+	vperm2i128	$0x31,%ymm15,%ymm2,%ymm15
+	vperm2i128	$0x20,%ymm7,%ymm9,%ymm2
+	vperm2i128	$0x31,%ymm7,%ymm9,%ymm7
+	vperm2i128	$0x20,%ymm4,%ymm12,%ymm9
+	vperm2i128	$0x31,%ymm4,%ymm12,%ymm4
+	vmovdqa	0(%rsp),%ymm6
+	vmovdqa	32(%rsp),%ymm12
+
+	cmpq	$512,%rdx
+	jb	.Ltail8x
+
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	leaq	128(%rsi),%rsi
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	leaq	128(%rdi),%rdi
+
+	vpxor	0(%rsi),%ymm12,%ymm12
+	vpxor	32(%rsi),%ymm13,%ymm13
+	vpxor	64(%rsi),%ymm10,%ymm10
+	vpxor	96(%rsi),%ymm15,%ymm15
+	leaq	128(%rsi),%rsi
+	vmovdqu	%ymm12,0(%rdi)
+	vmovdqu	%ymm13,32(%rdi)
+	vmovdqu	%ymm10,64(%rdi)
+	vmovdqu	%ymm15,96(%rdi)
+	leaq	128(%rdi),%rdi
+
+	vpxor	0(%rsi),%ymm14,%ymm14
+	vpxor	32(%rsi),%ymm2,%ymm2
+	vpxor	64(%rsi),%ymm3,%ymm3
+	vpxor	96(%rsi),%ymm7,%ymm7
+	leaq	128(%rsi),%rsi
+	vmovdqu	%ymm14,0(%rdi)
+	vmovdqu	%ymm2,32(%rdi)
+	vmovdqu	%ymm3,64(%rdi)
+	vmovdqu	%ymm7,96(%rdi)
+	leaq	128(%rdi),%rdi
+
+	vpxor	0(%rsi),%ymm11,%ymm11
+	vpxor	32(%rsi),%ymm9,%ymm9
+	vpxor	64(%rsi),%ymm0,%ymm0
+	vpxor	96(%rsi),%ymm4,%ymm4
+	leaq	128(%rsi),%rsi
+	vmovdqu	%ymm11,0(%rdi)
+	vmovdqu	%ymm9,32(%rdi)
+	vmovdqu	%ymm0,64(%rdi)
+	vmovdqu	%ymm4,96(%rdi)
+	leaq	128(%rdi),%rdi
+
+	subq	$512,%rdx
+	jnz	.Loop_outer8x
+
+	jmp	.Ldone8x
+
+.Ltail8x:
+	cmpq	$448,%rdx
+	jae	.L448_or_more8x
+	cmpq	$384,%rdx
+	jae	.L384_or_more8x
+	cmpq	$320,%rdx
+	jae	.L320_or_more8x
+	cmpq	$256,%rdx
+	jae	.L256_or_more8x
+	cmpq	$192,%rdx
+	jae	.L192_or_more8x
+	cmpq	$128,%rdx
+	jae	.L128_or_more8x
+	cmpq	$64,%rdx
+	jae	.L64_or_more8x
+
+	xorq	%r10,%r10
+	vmovdqa	%ymm6,0(%rsp)
+	vmovdqa	%ymm8,32(%rsp)
+	jmp	.Loop_tail8x
+
+.align	32
+.L64_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	je	.Ldone8x
+
+	leaq	64(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm1,0(%rsp)
+	leaq	64(%rdi),%rdi
+	subq	$64,%rdx
+	vmovdqa	%ymm5,32(%rsp)
+	jmp	.Loop_tail8x
+
+.align	32
+.L128_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	je	.Ldone8x
+
+	leaq	128(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm12,0(%rsp)
+	leaq	128(%rdi),%rdi
+	subq	$128,%rdx
+	vmovdqa	%ymm13,32(%rsp)
+	jmp	.Loop_tail8x
+
+.align	32
+.L192_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	je	.Ldone8x
+
+	leaq	192(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm10,0(%rsp)
+	leaq	192(%rdi),%rdi
+	subq	$192,%rdx
+	vmovdqa	%ymm15,32(%rsp)
+	jmp	.Loop_tail8x
+
+.align	32
+.L256_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vpxor	192(%rsi),%ymm10,%ymm10
+	vpxor	224(%rsi),%ymm15,%ymm15
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	vmovdqu	%ymm10,192(%rdi)
+	vmovdqu	%ymm15,224(%rdi)
+	je	.Ldone8x
+
+	leaq	256(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm14,0(%rsp)
+	leaq	256(%rdi),%rdi
+	subq	$256,%rdx
+	vmovdqa	%ymm2,32(%rsp)
+	jmp	.Loop_tail8x
+
+.align	32
+.L320_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vpxor	192(%rsi),%ymm10,%ymm10
+	vpxor	224(%rsi),%ymm15,%ymm15
+	vpxor	256(%rsi),%ymm14,%ymm14
+	vpxor	288(%rsi),%ymm2,%ymm2
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	vmovdqu	%ymm10,192(%rdi)
+	vmovdqu	%ymm15,224(%rdi)
+	vmovdqu	%ymm14,256(%rdi)
+	vmovdqu	%ymm2,288(%rdi)
+	je	.Ldone8x
+
+	leaq	320(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm3,0(%rsp)
+	leaq	320(%rdi),%rdi
+	subq	$320,%rdx
+	vmovdqa	%ymm7,32(%rsp)
+	jmp	.Loop_tail8x
+
+.align	32
+.L384_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vpxor	192(%rsi),%ymm10,%ymm10
+	vpxor	224(%rsi),%ymm15,%ymm15
+	vpxor	256(%rsi),%ymm14,%ymm14
+	vpxor	288(%rsi),%ymm2,%ymm2
+	vpxor	320(%rsi),%ymm3,%ymm3
+	vpxor	352(%rsi),%ymm7,%ymm7
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	vmovdqu	%ymm10,192(%rdi)
+	vmovdqu	%ymm15,224(%rdi)
+	vmovdqu	%ymm14,256(%rdi)
+	vmovdqu	%ymm2,288(%rdi)
+	vmovdqu	%ymm3,320(%rdi)
+	vmovdqu	%ymm7,352(%rdi)
+	je	.Ldone8x
+
+	leaq	384(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm11,0(%rsp)
+	leaq	384(%rdi),%rdi
+	subq	$384,%rdx
+	vmovdqa	%ymm9,32(%rsp)
+	jmp	.Loop_tail8x
+
+.align	32
+.L448_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vpxor	192(%rsi),%ymm10,%ymm10
+	vpxor	224(%rsi),%ymm15,%ymm15
+	vpxor	256(%rsi),%ymm14,%ymm14
+	vpxor	288(%rsi),%ymm2,%ymm2
+	vpxor	320(%rsi),%ymm3,%ymm3
+	vpxor	352(%rsi),%ymm7,%ymm7
+	vpxor	384(%rsi),%ymm11,%ymm11
+	vpxor	416(%rsi),%ymm9,%ymm9
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	vmovdqu	%ymm10,192(%rdi)
+	vmovdqu	%ymm15,224(%rdi)
+	vmovdqu	%ymm14,256(%rdi)
+	vmovdqu	%ymm2,288(%rdi)
+	vmovdqu	%ymm3,320(%rdi)
+	vmovdqu	%ymm7,352(%rdi)
+	vmovdqu	%ymm11,384(%rdi)
+	vmovdqu	%ymm9,416(%rdi)
+	je	.Ldone8x
+
+	leaq	448(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm0,0(%rsp)
+	leaq	448(%rdi),%rdi
+	subq	$448,%rdx
+	vmovdqa	%ymm4,32(%rsp)
+
+.Loop_tail8x:
+	movzbl	(%rsi,%r10,1),%eax
+	movzbl	(%rsp,%r10,1),%ecx
+	leaq	1(%r10),%r10
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r10,1)
+	decq	%rdx
+	jnz	.Loop_tail8x
+
+.Ldone8x:
+	vzeroall
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L8x_epilogue:
+	.byte	0xf3,0xc3
+.cfi_endproc	
+.size	ChaCha20_8x,.-ChaCha20_8x
+.type	ChaCha20_avx512,@function
+.align	32
+ChaCha20_avx512:
+.cfi_startproc	
+.LChaCha20_avx512:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	cmpq	$512,%rdx
+	ja	.LChaCha20_16x
+
+	subq	$64+8,%rsp
+	vbroadcasti32x4	.Lsigma(%rip),%zmm0
+	vbroadcasti32x4	(%rcx),%zmm1
+	vbroadcasti32x4	16(%rcx),%zmm2
+	vbroadcasti32x4	(%r8),%zmm3
+
+	vmovdqa32	%zmm0,%zmm16
+	vmovdqa32	%zmm1,%zmm17
+	vmovdqa32	%zmm2,%zmm18
+	vpaddd	.Lzeroz(%rip),%zmm3,%zmm3
+	vmovdqa32	.Lfourz(%rip),%zmm20
+	movq	$10,%r8
+	vmovdqa32	%zmm3,%zmm19
+	jmp	.Loop_avx512
+
+.align	16
+.Loop_outer_avx512:
+	vmovdqa32	%zmm16,%zmm0
+	vmovdqa32	%zmm17,%zmm1
+	vmovdqa32	%zmm18,%zmm2
+	vpaddd	%zmm20,%zmm19,%zmm3
+	movq	$10,%r8
+	vmovdqa32	%zmm3,%zmm19
+	jmp	.Loop_avx512
+
+.align	32
+.Loop_avx512:
+	vpaddd	%zmm1,%zmm0,%zmm0
+	vpxord	%zmm0,%zmm3,%zmm3
+	vprold	$16,%zmm3,%zmm3
+	vpaddd	%zmm3,%zmm2,%zmm2
+	vpxord	%zmm2,%zmm1,%zmm1
+	vprold	$12,%zmm1,%zmm1
+	vpaddd	%zmm1,%zmm0,%zmm0
+	vpxord	%zmm0,%zmm3,%zmm3
+	vprold	$8,%zmm3,%zmm3
+	vpaddd	%zmm3,%zmm2,%zmm2
+	vpxord	%zmm2,%zmm1,%zmm1
+	vprold	$7,%zmm1,%zmm1
+	vpshufd	$78,%zmm2,%zmm2
+	vpshufd	$57,%zmm1,%zmm1
+	vpshufd	$147,%zmm3,%zmm3
+	vpaddd	%zmm1,%zmm0,%zmm0
+	vpxord	%zmm0,%zmm3,%zmm3
+	vprold	$16,%zmm3,%zmm3
+	vpaddd	%zmm3,%zmm2,%zmm2
+	vpxord	%zmm2,%zmm1,%zmm1
+	vprold	$12,%zmm1,%zmm1
+	vpaddd	%zmm1,%zmm0,%zmm0
+	vpxord	%zmm0,%zmm3,%zmm3
+	vprold	$8,%zmm3,%zmm3
+	vpaddd	%zmm3,%zmm2,%zmm2
+	vpxord	%zmm2,%zmm1,%zmm1
+	vprold	$7,%zmm1,%zmm1
+	vpshufd	$78,%zmm2,%zmm2
+	vpshufd	$147,%zmm1,%zmm1
+	vpshufd	$57,%zmm3,%zmm3
+	decq	%r8
+	jnz	.Loop_avx512
+	vpaddd	%zmm16,%zmm0,%zmm0
+	vpaddd	%zmm17,%zmm1,%zmm1
+	vpaddd	%zmm18,%zmm2,%zmm2
+	vpaddd	%zmm19,%zmm3,%zmm3
+
+	subq	$64,%rdx
+	jb	.Ltail64_avx512
+
+	vpxor	0(%rsi),%xmm0,%xmm4
+	vpxor	16(%rsi),%xmm1,%xmm5
+	vpxor	32(%rsi),%xmm2,%xmm6
+	vpxor	48(%rsi),%xmm3,%xmm7
+	leaq	64(%rsi),%rsi
+
+	vmovdqu	%xmm4,0(%rdi)
+	vmovdqu	%xmm5,16(%rdi)
+	vmovdqu	%xmm6,32(%rdi)
+	vmovdqu	%xmm7,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	jz	.Ldone_avx512
+
+	vextracti32x4	$1,%zmm0,%xmm4
+	vextracti32x4	$1,%zmm1,%xmm5
+	vextracti32x4	$1,%zmm2,%xmm6
+	vextracti32x4	$1,%zmm3,%xmm7
+
+	subq	$64,%rdx
+	jb	.Ltail_avx512
+
+	vpxor	0(%rsi),%xmm4,%xmm4
+	vpxor	16(%rsi),%xmm5,%xmm5
+	vpxor	32(%rsi),%xmm6,%xmm6
+	vpxor	48(%rsi),%xmm7,%xmm7
+	leaq	64(%rsi),%rsi
+
+	vmovdqu	%xmm4,0(%rdi)
+	vmovdqu	%xmm5,16(%rdi)
+	vmovdqu	%xmm6,32(%rdi)
+	vmovdqu	%xmm7,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	jz	.Ldone_avx512
+
+	vextracti32x4	$2,%zmm0,%xmm4
+	vextracti32x4	$2,%zmm1,%xmm5
+	vextracti32x4	$2,%zmm2,%xmm6
+	vextracti32x4	$2,%zmm3,%xmm7
+
+	subq	$64,%rdx
+	jb	.Ltail_avx512
+
+	vpxor	0(%rsi),%xmm4,%xmm4
+	vpxor	16(%rsi),%xmm5,%xmm5
+	vpxor	32(%rsi),%xmm6,%xmm6
+	vpxor	48(%rsi),%xmm7,%xmm7
+	leaq	64(%rsi),%rsi
+
+	vmovdqu	%xmm4,0(%rdi)
+	vmovdqu	%xmm5,16(%rdi)
+	vmovdqu	%xmm6,32(%rdi)
+	vmovdqu	%xmm7,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	jz	.Ldone_avx512
+
+	vextracti32x4	$3,%zmm0,%xmm4
+	vextracti32x4	$3,%zmm1,%xmm5
+	vextracti32x4	$3,%zmm2,%xmm6
+	vextracti32x4	$3,%zmm3,%xmm7
+
+	subq	$64,%rdx
+	jb	.Ltail_avx512
+
+	vpxor	0(%rsi),%xmm4,%xmm4
+	vpxor	16(%rsi),%xmm5,%xmm5
+	vpxor	32(%rsi),%xmm6,%xmm6
+	vpxor	48(%rsi),%xmm7,%xmm7
+	leaq	64(%rsi),%rsi
+
+	vmovdqu	%xmm4,0(%rdi)
+	vmovdqu	%xmm5,16(%rdi)
+	vmovdqu	%xmm6,32(%rdi)
+	vmovdqu	%xmm7,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	jnz	.Loop_outer_avx512
+
+	jmp	.Ldone_avx512
+
+.align	16
+.Ltail64_avx512:
+	vmovdqa	%xmm0,0(%rsp)
+	vmovdqa	%xmm1,16(%rsp)
+	vmovdqa	%xmm2,32(%rsp)
+	vmovdqa	%xmm3,48(%rsp)
+	addq	$64,%rdx
+	jmp	.Loop_tail_avx512
+
+.align	16
+.Ltail_avx512:
+	vmovdqa	%xmm4,0(%rsp)
+	vmovdqa	%xmm5,16(%rsp)
+	vmovdqa	%xmm6,32(%rsp)
+	vmovdqa	%xmm7,48(%rsp)
+	addq	$64,%rdx
+
+.Loop_tail_avx512:
+	movzbl	(%rsi,%r8,1),%eax
+	movzbl	(%rsp,%r8,1),%ecx
+	leaq	1(%r8),%r8
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r8,1)
+	decq	%rdx
+	jnz	.Loop_tail_avx512
+
+	vmovdqu32	%zmm16,0(%rsp)
+
+.Ldone_avx512:
+	vzeroall
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.Lavx512_epilogue:
+	.byte	0xf3,0xc3
+.cfi_endproc	
+.size	ChaCha20_avx512,.-ChaCha20_avx512
+.type	ChaCha20_avx512vl,@function
+.align	32
+ChaCha20_avx512vl:
+.cfi_startproc	
+.LChaCha20_avx512vl:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	cmpq	$128,%rdx
+	ja	.LChaCha20_8xvl
+
+	subq	$64+8,%rsp
+	vbroadcasti128	.Lsigma(%rip),%ymm0
+	vbroadcasti128	(%rcx),%ymm1
+	vbroadcasti128	16(%rcx),%ymm2
+	vbroadcasti128	(%r8),%ymm3
+
+	vmovdqa32	%ymm0,%ymm16
+	vmovdqa32	%ymm1,%ymm17
+	vmovdqa32	%ymm2,%ymm18
+	vpaddd	.Lzeroz(%rip),%ymm3,%ymm3
+	vmovdqa32	.Ltwoy(%rip),%ymm20
+	movq	$10,%r8
+	vmovdqa32	%ymm3,%ymm19
+	jmp	.Loop_avx512vl
+
+.align	16
+.Loop_outer_avx512vl:
+	vmovdqa32	%ymm18,%ymm2
+	vpaddd	%ymm20,%ymm19,%ymm3
+	movq	$10,%r8
+	vmovdqa32	%ymm3,%ymm19
+	jmp	.Loop_avx512vl
+
+.align	32
+.Loop_avx512vl:
+	vpaddd	%ymm1,%ymm0,%ymm0
+	vpxor	%ymm0,%ymm3,%ymm3
+	vprold	$16,%ymm3,%ymm3
+	vpaddd	%ymm3,%ymm2,%ymm2
+	vpxor	%ymm2,%ymm1,%ymm1
+	vprold	$12,%ymm1,%ymm1
+	vpaddd	%ymm1,%ymm0,%ymm0
+	vpxor	%ymm0,%ymm3,%ymm3
+	vprold	$8,%ymm3,%ymm3
+	vpaddd	%ymm3,%ymm2,%ymm2
+	vpxor	%ymm2,%ymm1,%ymm1
+	vprold	$7,%ymm1,%ymm1
+	vpshufd	$78,%ymm2,%ymm2
+	vpshufd	$57,%ymm1,%ymm1
+	vpshufd	$147,%ymm3,%ymm3
+	vpaddd	%ymm1,%ymm0,%ymm0
+	vpxor	%ymm0,%ymm3,%ymm3
+	vprold	$16,%ymm3,%ymm3
+	vpaddd	%ymm3,%ymm2,%ymm2
+	vpxor	%ymm2,%ymm1,%ymm1
+	vprold	$12,%ymm1,%ymm1
+	vpaddd	%ymm1,%ymm0,%ymm0
+	vpxor	%ymm0,%ymm3,%ymm3
+	vprold	$8,%ymm3,%ymm3
+	vpaddd	%ymm3,%ymm2,%ymm2
+	vpxor	%ymm2,%ymm1,%ymm1
+	vprold	$7,%ymm1,%ymm1
+	vpshufd	$78,%ymm2,%ymm2
+	vpshufd	$147,%ymm1,%ymm1
+	vpshufd	$57,%ymm3,%ymm3
+	decq	%r8
+	jnz	.Loop_avx512vl
+	vpaddd	%ymm16,%ymm0,%ymm0
+	vpaddd	%ymm17,%ymm1,%ymm1
+	vpaddd	%ymm18,%ymm2,%ymm2
+	vpaddd	%ymm19,%ymm3,%ymm3
+
+	subq	$64,%rdx
+	jb	.Ltail64_avx512vl
+
+	vpxor	0(%rsi),%xmm0,%xmm4
+	vpxor	16(%rsi),%xmm1,%xmm5
+	vpxor	32(%rsi),%xmm2,%xmm6
+	vpxor	48(%rsi),%xmm3,%xmm7
+	leaq	64(%rsi),%rsi
+
+	vmovdqu	%xmm4,0(%rdi)
+	vmovdqu	%xmm5,16(%rdi)
+	vmovdqu	%xmm6,32(%rdi)
+	vmovdqu	%xmm7,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	jz	.Ldone_avx512vl
+
+	vextracti128	$1,%ymm0,%xmm4
+	vextracti128	$1,%ymm1,%xmm5
+	vextracti128	$1,%ymm2,%xmm6
+	vextracti128	$1,%ymm3,%xmm7
+
+	subq	$64,%rdx
+	jb	.Ltail_avx512vl
+
+	vpxor	0(%rsi),%xmm4,%xmm4
+	vpxor	16(%rsi),%xmm5,%xmm5
+	vpxor	32(%rsi),%xmm6,%xmm6
+	vpxor	48(%rsi),%xmm7,%xmm7
+	leaq	64(%rsi),%rsi
+
+	vmovdqu	%xmm4,0(%rdi)
+	vmovdqu	%xmm5,16(%rdi)
+	vmovdqu	%xmm6,32(%rdi)
+	vmovdqu	%xmm7,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	vmovdqa32	%ymm16,%ymm0
+	vmovdqa32	%ymm17,%ymm1
+	jnz	.Loop_outer_avx512vl
+
+	jmp	.Ldone_avx512vl
+
+.align	16
+.Ltail64_avx512vl:
+	vmovdqa	%xmm0,0(%rsp)
+	vmovdqa	%xmm1,16(%rsp)
+	vmovdqa	%xmm2,32(%rsp)
+	vmovdqa	%xmm3,48(%rsp)
+	addq	$64,%rdx
+	jmp	.Loop_tail_avx512vl
+
+.align	16
+.Ltail_avx512vl:
+	vmovdqa	%xmm4,0(%rsp)
+	vmovdqa	%xmm5,16(%rsp)
+	vmovdqa	%xmm6,32(%rsp)
+	vmovdqa	%xmm7,48(%rsp)
+	addq	$64,%rdx
+
+.Loop_tail_avx512vl:
+	movzbl	(%rsi,%r8,1),%eax
+	movzbl	(%rsp,%r8,1),%ecx
+	leaq	1(%r8),%r8
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r8,1)
+	decq	%rdx
+	jnz	.Loop_tail_avx512vl
+
+	vmovdqu32	%ymm16,0(%rsp)
+	vmovdqu32	%ymm16,32(%rsp)
+
+.Ldone_avx512vl:
+	vzeroall
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.Lavx512vl_epilogue:
+	.byte	0xf3,0xc3
+.cfi_endproc	
+.size	ChaCha20_avx512vl,.-ChaCha20_avx512vl
+.type	ChaCha20_16x,@function
+.align	32
+ChaCha20_16x:
+.cfi_startproc	
+.LChaCha20_16x:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	subq	$64+8,%rsp
+	andq	$-64,%rsp
+	vzeroupper
+
+	leaq	.Lsigma(%rip),%r10
+	vbroadcasti32x4	(%r10),%zmm3
+	vbroadcasti32x4	(%rcx),%zmm7
+	vbroadcasti32x4	16(%rcx),%zmm11
+	vbroadcasti32x4	(%r8),%zmm15
+
+	vpshufd	$0x00,%zmm3,%zmm0
+	vpshufd	$0x55,%zmm3,%zmm1
+	vpshufd	$0xaa,%zmm3,%zmm2
+	vpshufd	$0xff,%zmm3,%zmm3
+	vmovdqa64	%zmm0,%zmm16
+	vmovdqa64	%zmm1,%zmm17
+	vmovdqa64	%zmm2,%zmm18
+	vmovdqa64	%zmm3,%zmm19
+
+	vpshufd	$0x00,%zmm7,%zmm4
+	vpshufd	$0x55,%zmm7,%zmm5
+	vpshufd	$0xaa,%zmm7,%zmm6
+	vpshufd	$0xff,%zmm7,%zmm7
+	vmovdqa64	%zmm4,%zmm20
+	vmovdqa64	%zmm5,%zmm21
+	vmovdqa64	%zmm6,%zmm22
+	vmovdqa64	%zmm7,%zmm23
+
+	vpshufd	$0x00,%zmm11,%zmm8
+	vpshufd	$0x55,%zmm11,%zmm9
+	vpshufd	$0xaa,%zmm11,%zmm10
+	vpshufd	$0xff,%zmm11,%zmm11
+	vmovdqa64	%zmm8,%zmm24
+	vmovdqa64	%zmm9,%zmm25
+	vmovdqa64	%zmm10,%zmm26
+	vmovdqa64	%zmm11,%zmm27
+
+	vpshufd	$0x00,%zmm15,%zmm12
+	vpshufd	$0x55,%zmm15,%zmm13
+	vpshufd	$0xaa,%zmm15,%zmm14
+	vpshufd	$0xff,%zmm15,%zmm15
+	vpaddd	.Lincz(%rip),%zmm12,%zmm12
+	vmovdqa64	%zmm12,%zmm28
+	vmovdqa64	%zmm13,%zmm29
+	vmovdqa64	%zmm14,%zmm30
+	vmovdqa64	%zmm15,%zmm31
+
+	movl	$10,%eax
+	jmp	.Loop16x
+
+.align	32
+.Loop_outer16x:
+	vpbroadcastd	0(%r10),%zmm0
+	vpbroadcastd	4(%r10),%zmm1
+	vpbroadcastd	8(%r10),%zmm2
+	vpbroadcastd	12(%r10),%zmm3
+	vpaddd	.Lsixteen(%rip),%zmm28,%zmm28
+	vmovdqa64	%zmm20,%zmm4
+	vmovdqa64	%zmm21,%zmm5
+	vmovdqa64	%zmm22,%zmm6
+	vmovdqa64	%zmm23,%zmm7
+	vmovdqa64	%zmm24,%zmm8
+	vmovdqa64	%zmm25,%zmm9
+	vmovdqa64	%zmm26,%zmm10
+	vmovdqa64	%zmm27,%zmm11
+	vmovdqa64	%zmm28,%zmm12
+	vmovdqa64	%zmm29,%zmm13
+	vmovdqa64	%zmm30,%zmm14
+	vmovdqa64	%zmm31,%zmm15
+
+	vmovdqa64	%zmm0,%zmm16
+	vmovdqa64	%zmm1,%zmm17
+	vmovdqa64	%zmm2,%zmm18
+	vmovdqa64	%zmm3,%zmm19
+
+	movl	$10,%eax
+	jmp	.Loop16x
+
+.align	32
+.Loop16x:
+	vpaddd	%zmm4,%zmm0,%zmm0
+	vpaddd	%zmm5,%zmm1,%zmm1
+	vpaddd	%zmm6,%zmm2,%zmm2
+	vpaddd	%zmm7,%zmm3,%zmm3
+	vpxord	%zmm0,%zmm12,%zmm12
+	vpxord	%zmm1,%zmm13,%zmm13
+	vpxord	%zmm2,%zmm14,%zmm14
+	vpxord	%zmm3,%zmm15,%zmm15
+	vprold	$16,%zmm12,%zmm12
+	vprold	$16,%zmm13,%zmm13
+	vprold	$16,%zmm14,%zmm14
+	vprold	$16,%zmm15,%zmm15
+	vpaddd	%zmm12,%zmm8,%zmm8
+	vpaddd	%zmm13,%zmm9,%zmm9
+	vpaddd	%zmm14,%zmm10,%zmm10
+	vpaddd	%zmm15,%zmm11,%zmm11
+	vpxord	%zmm8,%zmm4,%zmm4
+	vpxord	%zmm9,%zmm5,%zmm5
+	vpxord	%zmm10,%zmm6,%zmm6
+	vpxord	%zmm11,%zmm7,%zmm7
+	vprold	$12,%zmm4,%zmm4
+	vprold	$12,%zmm5,%zmm5
+	vprold	$12,%zmm6,%zmm6
+	vprold	$12,%zmm7,%zmm7
+	vpaddd	%zmm4,%zmm0,%zmm0
+	vpaddd	%zmm5,%zmm1,%zmm1
+	vpaddd	%zmm6,%zmm2,%zmm2
+	vpaddd	%zmm7,%zmm3,%zmm3
+	vpxord	%zmm0,%zmm12,%zmm12
+	vpxord	%zmm1,%zmm13,%zmm13
+	vpxord	%zmm2,%zmm14,%zmm14
+	vpxord	%zmm3,%zmm15,%zmm15
+	vprold	$8,%zmm12,%zmm12
+	vprold	$8,%zmm13,%zmm13
+	vprold	$8,%zmm14,%zmm14
+	vprold	$8,%zmm15,%zmm15
+	vpaddd	%zmm12,%zmm8,%zmm8
+	vpaddd	%zmm13,%zmm9,%zmm9
+	vpaddd	%zmm14,%zmm10,%zmm10
+	vpaddd	%zmm15,%zmm11,%zmm11
+	vpxord	%zmm8,%zmm4,%zmm4
+	vpxord	%zmm9,%zmm5,%zmm5
+	vpxord	%zmm10,%zmm6,%zmm6
+	vpxord	%zmm11,%zmm7,%zmm7
+	vprold	$7,%zmm4,%zmm4
+	vprold	$7,%zmm5,%zmm5
+	vprold	$7,%zmm6,%zmm6
+	vprold	$7,%zmm7,%zmm7
+	vpaddd	%zmm5,%zmm0,%zmm0
+	vpaddd	%zmm6,%zmm1,%zmm1
+	vpaddd	%zmm7,%zmm2,%zmm2
+	vpaddd	%zmm4,%zmm3,%zmm3
+	vpxord	%zmm0,%zmm15,%zmm15
+	vpxord	%zmm1,%zmm12,%zmm12
+	vpxord	%zmm2,%zmm13,%zmm13
+	vpxord	%zmm3,%zmm14,%zmm14
+	vprold	$16,%zmm15,%zmm15
+	vprold	$16,%zmm12,%zmm12
+	vprold	$16,%zmm13,%zmm13
+	vprold	$16,%zmm14,%zmm14
+	vpaddd	%zmm15,%zmm10,%zmm10
+	vpaddd	%zmm12,%zmm11,%zmm11
+	vpaddd	%zmm13,%zmm8,%zmm8
+	vpaddd	%zmm14,%zmm9,%zmm9
+	vpxord	%zmm10,%zmm5,%zmm5
+	vpxord	%zmm11,%zmm6,%zmm6
+	vpxord	%zmm8,%zmm7,%zmm7
+	vpxord	%zmm9,%zmm4,%zmm4
+	vprold	$12,%zmm5,%zmm5
+	vprold	$12,%zmm6,%zmm6
+	vprold	$12,%zmm7,%zmm7
+	vprold	$12,%zmm4,%zmm4
+	vpaddd	%zmm5,%zmm0,%zmm0
+	vpaddd	%zmm6,%zmm1,%zmm1
+	vpaddd	%zmm7,%zmm2,%zmm2
+	vpaddd	%zmm4,%zmm3,%zmm3
+	vpxord	%zmm0,%zmm15,%zmm15
+	vpxord	%zmm1,%zmm12,%zmm12
+	vpxord	%zmm2,%zmm13,%zmm13
+	vpxord	%zmm3,%zmm14,%zmm14
+	vprold	$8,%zmm15,%zmm15
+	vprold	$8,%zmm12,%zmm12
+	vprold	$8,%zmm13,%zmm13
+	vprold	$8,%zmm14,%zmm14
+	vpaddd	%zmm15,%zmm10,%zmm10
+	vpaddd	%zmm12,%zmm11,%zmm11
+	vpaddd	%zmm13,%zmm8,%zmm8
+	vpaddd	%zmm14,%zmm9,%zmm9
+	vpxord	%zmm10,%zmm5,%zmm5
+	vpxord	%zmm11,%zmm6,%zmm6
+	vpxord	%zmm8,%zmm7,%zmm7
+	vpxord	%zmm9,%zmm4,%zmm4
+	vprold	$7,%zmm5,%zmm5
+	vprold	$7,%zmm6,%zmm6
+	vprold	$7,%zmm7,%zmm7
+	vprold	$7,%zmm4,%zmm4
+	decl	%eax
+	jnz	.Loop16x
+
+	vpaddd	%zmm16,%zmm0,%zmm0
+	vpaddd	%zmm17,%zmm1,%zmm1
+	vpaddd	%zmm18,%zmm2,%zmm2
+	vpaddd	%zmm19,%zmm3,%zmm3
+
+	vpunpckldq	%zmm1,%zmm0,%zmm18
+	vpunpckldq	%zmm3,%zmm2,%zmm19
+	vpunpckhdq	%zmm1,%zmm0,%zmm0
+	vpunpckhdq	%zmm3,%zmm2,%zmm2
+	vpunpcklqdq	%zmm19,%zmm18,%zmm1
+	vpunpckhqdq	%zmm19,%zmm18,%zmm18
+	vpunpcklqdq	%zmm2,%zmm0,%zmm3
+	vpunpckhqdq	%zmm2,%zmm0,%zmm0
+	vpaddd	%zmm20,%zmm4,%zmm4
+	vpaddd	%zmm21,%zmm5,%zmm5
+	vpaddd	%zmm22,%zmm6,%zmm6
+	vpaddd	%zmm23,%zmm7,%zmm7
+
+	vpunpckldq	%zmm5,%zmm4,%zmm2
+	vpunpckldq	%zmm7,%zmm6,%zmm19
+	vpunpckhdq	%zmm5,%zmm4,%zmm4
+	vpunpckhdq	%zmm7,%zmm6,%zmm6
+	vpunpcklqdq	%zmm19,%zmm2,%zmm5
+	vpunpckhqdq	%zmm19,%zmm2,%zmm2
+	vpunpcklqdq	%zmm6,%zmm4,%zmm7
+	vpunpckhqdq	%zmm6,%zmm4,%zmm4
+	vshufi32x4	$0x44,%zmm5,%zmm1,%zmm19
+	vshufi32x4	$0xee,%zmm5,%zmm1,%zmm5
+	vshufi32x4	$0x44,%zmm2,%zmm18,%zmm1
+	vshufi32x4	$0xee,%zmm2,%zmm18,%zmm2
+	vshufi32x4	$0x44,%zmm7,%zmm3,%zmm18
+	vshufi32x4	$0xee,%zmm7,%zmm3,%zmm7
+	vshufi32x4	$0x44,%zmm4,%zmm0,%zmm3
+	vshufi32x4	$0xee,%zmm4,%zmm0,%zmm4
+	vpaddd	%zmm24,%zmm8,%zmm8
+	vpaddd	%zmm25,%zmm9,%zmm9
+	vpaddd	%zmm26,%zmm10,%zmm10
+	vpaddd	%zmm27,%zmm11,%zmm11
+
+	vpunpckldq	%zmm9,%zmm8,%zmm6
+	vpunpckldq	%zmm11,%zmm10,%zmm0
+	vpunpckhdq	%zmm9,%zmm8,%zmm8
+	vpunpckhdq	%zmm11,%zmm10,%zmm10
+	vpunpcklqdq	%zmm0,%zmm6,%zmm9
+	vpunpckhqdq	%zmm0,%zmm6,%zmm6
+	vpunpcklqdq	%zmm10,%zmm8,%zmm11
+	vpunpckhqdq	%zmm10,%zmm8,%zmm8
+	vpaddd	%zmm28,%zmm12,%zmm12
+	vpaddd	%zmm29,%zmm13,%zmm13
+	vpaddd	%zmm30,%zmm14,%zmm14
+	vpaddd	%zmm31,%zmm15,%zmm15
+
+	vpunpckldq	%zmm13,%zmm12,%zmm10
+	vpunpckldq	%zmm15,%zmm14,%zmm0
+	vpunpckhdq	%zmm13,%zmm12,%zmm12
+	vpunpckhdq	%zmm15,%zmm14,%zmm14
+	vpunpcklqdq	%zmm0,%zmm10,%zmm13
+	vpunpckhqdq	%zmm0,%zmm10,%zmm10
+	vpunpcklqdq	%zmm14,%zmm12,%zmm15
+	vpunpckhqdq	%zmm14,%zmm12,%zmm12
+	vshufi32x4	$0x44,%zmm13,%zmm9,%zmm0
+	vshufi32x4	$0xee,%zmm13,%zmm9,%zmm13
+	vshufi32x4	$0x44,%zmm10,%zmm6,%zmm9
+	vshufi32x4	$0xee,%zmm10,%zmm6,%zmm10
+	vshufi32x4	$0x44,%zmm15,%zmm11,%zmm6
+	vshufi32x4	$0xee,%zmm15,%zmm11,%zmm15
+	vshufi32x4	$0x44,%zmm12,%zmm8,%zmm11
+	vshufi32x4	$0xee,%zmm12,%zmm8,%zmm12
+	vshufi32x4	$0x88,%zmm0,%zmm19,%zmm16
+	vshufi32x4	$0xdd,%zmm0,%zmm19,%zmm19
+	vshufi32x4	$0x88,%zmm13,%zmm5,%zmm0
+	vshufi32x4	$0xdd,%zmm13,%zmm5,%zmm13
+	vshufi32x4	$0x88,%zmm9,%zmm1,%zmm17
+	vshufi32x4	$0xdd,%zmm9,%zmm1,%zmm1
+	vshufi32x4	$0x88,%zmm10,%zmm2,%zmm9
+	vshufi32x4	$0xdd,%zmm10,%zmm2,%zmm10
+	vshufi32x4	$0x88,%zmm6,%zmm18,%zmm14
+	vshufi32x4	$0xdd,%zmm6,%zmm18,%zmm18
+	vshufi32x4	$0x88,%zmm15,%zmm7,%zmm6
+	vshufi32x4	$0xdd,%zmm15,%zmm7,%zmm15
+	vshufi32x4	$0x88,%zmm11,%zmm3,%zmm8
+	vshufi32x4	$0xdd,%zmm11,%zmm3,%zmm3
+	vshufi32x4	$0x88,%zmm12,%zmm4,%zmm11
+	vshufi32x4	$0xdd,%zmm12,%zmm4,%zmm12
+	cmpq	$1024,%rdx
+	jb	.Ltail16x
+
+	vpxord	0(%rsi),%zmm16,%zmm16
+	vpxord	64(%rsi),%zmm17,%zmm17
+	vpxord	128(%rsi),%zmm14,%zmm14
+	vpxord	192(%rsi),%zmm8,%zmm8
+	vmovdqu32	%zmm16,0(%rdi)
+	vmovdqu32	%zmm17,64(%rdi)
+	vmovdqu32	%zmm14,128(%rdi)
+	vmovdqu32	%zmm8,192(%rdi)
+
+	vpxord	256(%rsi),%zmm19,%zmm19
+	vpxord	320(%rsi),%zmm1,%zmm1
+	vpxord	384(%rsi),%zmm18,%zmm18
+	vpxord	448(%rsi),%zmm3,%zmm3
+	vmovdqu32	%zmm19,256(%rdi)
+	vmovdqu32	%zmm1,320(%rdi)
+	vmovdqu32	%zmm18,384(%rdi)
+	vmovdqu32	%zmm3,448(%rdi)
+
+	vpxord	512(%rsi),%zmm0,%zmm0
+	vpxord	576(%rsi),%zmm9,%zmm9
+	vpxord	640(%rsi),%zmm6,%zmm6
+	vpxord	704(%rsi),%zmm11,%zmm11
+	vmovdqu32	%zmm0,512(%rdi)
+	vmovdqu32	%zmm9,576(%rdi)
+	vmovdqu32	%zmm6,640(%rdi)
+	vmovdqu32	%zmm11,704(%rdi)
+
+	vpxord	768(%rsi),%zmm13,%zmm13
+	vpxord	832(%rsi),%zmm10,%zmm10
+	vpxord	896(%rsi),%zmm15,%zmm15
+	vpxord	960(%rsi),%zmm12,%zmm12
+	leaq	1024(%rsi),%rsi
+	vmovdqu32	%zmm13,768(%rdi)
+	vmovdqu32	%zmm10,832(%rdi)
+	vmovdqu32	%zmm15,896(%rdi)
+	vmovdqu32	%zmm12,960(%rdi)
+	leaq	1024(%rdi),%rdi
+
+	subq	$1024,%rdx
+	jnz	.Loop_outer16x
+
+	jmp	.Ldone16x
+
+.align	32
+.Ltail16x:
+	xorq	%r10,%r10
+	subq	%rsi,%rdi
+	cmpq	$64,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm16,%zmm16
+	vmovdqu32	%zmm16,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm17,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$128,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm17,%zmm17
+	vmovdqu32	%zmm17,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm14,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$192,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm14,%zmm14
+	vmovdqu32	%zmm14,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm8,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$256,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm8,%zmm8
+	vmovdqu32	%zmm8,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm19,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$320,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm19,%zmm19
+	vmovdqu32	%zmm19,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm1,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$384,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm1,%zmm1
+	vmovdqu32	%zmm1,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm18,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$448,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm18,%zmm18
+	vmovdqu32	%zmm18,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm3,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$512,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm3,%zmm3
+	vmovdqu32	%zmm3,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm0,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$576,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm0,%zmm0
+	vmovdqu32	%zmm0,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm9,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$640,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm9,%zmm9
+	vmovdqu32	%zmm9,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm6,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$704,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm6,%zmm6
+	vmovdqu32	%zmm6,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm11,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$768,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm11,%zmm11
+	vmovdqu32	%zmm11,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm13,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$832,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm13,%zmm13
+	vmovdqu32	%zmm13,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm10,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$896,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm10,%zmm10
+	vmovdqu32	%zmm10,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm15,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$960,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm15,%zmm15
+	vmovdqu32	%zmm15,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm12,%zmm16
+	leaq	64(%rsi),%rsi
+
+.Less_than_64_16x:
+	vmovdqa32	%zmm16,0(%rsp)
+	leaq	(%rdi,%rsi,1),%rdi
+	andq	$63,%rdx
+
+.Loop_tail16x:
+	movzbl	(%rsi,%r10,1),%eax
+	movzbl	(%rsp,%r10,1),%ecx
+	leaq	1(%r10),%r10
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r10,1)
+	decq	%rdx
+	jnz	.Loop_tail16x
+
+	vpxord	%zmm16,%zmm16,%zmm16
+	vmovdqa32	%zmm16,0(%rsp)
+
+.Ldone16x:
+	vzeroall
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L16x_epilogue:
+	.byte	0xf3,0xc3
+.cfi_endproc	
+.size	ChaCha20_16x,.-ChaCha20_16x
+.type	ChaCha20_8xvl,@function
+.align	32
+ChaCha20_8xvl:
+.cfi_startproc	
+.LChaCha20_8xvl:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	subq	$64+8,%rsp
+	andq	$-64,%rsp
+	vzeroupper
+
+	leaq	.Lsigma(%rip),%r10
+	vbroadcasti128	(%r10),%ymm3
+	vbroadcasti128	(%rcx),%ymm7
+	vbroadcasti128	16(%rcx),%ymm11
+	vbroadcasti128	(%r8),%ymm15
+
+	vpshufd	$0x00,%ymm3,%ymm0
+	vpshufd	$0x55,%ymm3,%ymm1
+	vpshufd	$0xaa,%ymm3,%ymm2
+	vpshufd	$0xff,%ymm3,%ymm3
+	vmovdqa64	%ymm0,%ymm16
+	vmovdqa64	%ymm1,%ymm17
+	vmovdqa64	%ymm2,%ymm18
+	vmovdqa64	%ymm3,%ymm19
+
+	vpshufd	$0x00,%ymm7,%ymm4
+	vpshufd	$0x55,%ymm7,%ymm5
+	vpshufd	$0xaa,%ymm7,%ymm6
+	vpshufd	$0xff,%ymm7,%ymm7
+	vmovdqa64	%ymm4,%ymm20
+	vmovdqa64	%ymm5,%ymm21
+	vmovdqa64	%ymm6,%ymm22
+	vmovdqa64	%ymm7,%ymm23
+
+	vpshufd	$0x00,%ymm11,%ymm8
+	vpshufd	$0x55,%ymm11,%ymm9
+	vpshufd	$0xaa,%ymm11,%ymm10
+	vpshufd	$0xff,%ymm11,%ymm11
+	vmovdqa64	%ymm8,%ymm24
+	vmovdqa64	%ymm9,%ymm25
+	vmovdqa64	%ymm10,%ymm26
+	vmovdqa64	%ymm11,%ymm27
+
+	vpshufd	$0x00,%ymm15,%ymm12
+	vpshufd	$0x55,%ymm15,%ymm13
+	vpshufd	$0xaa,%ymm15,%ymm14
+	vpshufd	$0xff,%ymm15,%ymm15
+	vpaddd	.Lincy(%rip),%ymm12,%ymm12
+	vmovdqa64	%ymm12,%ymm28
+	vmovdqa64	%ymm13,%ymm29
+	vmovdqa64	%ymm14,%ymm30
+	vmovdqa64	%ymm15,%ymm31
+
+	movl	$10,%eax
+	jmp	.Loop8xvl
+
+.align	32
+.Loop_outer8xvl:
+
+
+	vpbroadcastd	8(%r10),%ymm2
+	vpbroadcastd	12(%r10),%ymm3
+	vpaddd	.Leight(%rip),%ymm28,%ymm28
+	vmovdqa64	%ymm20,%ymm4
+	vmovdqa64	%ymm21,%ymm5
+	vmovdqa64	%ymm22,%ymm6
+	vmovdqa64	%ymm23,%ymm7
+	vmovdqa64	%ymm24,%ymm8
+	vmovdqa64	%ymm25,%ymm9
+	vmovdqa64	%ymm26,%ymm10
+	vmovdqa64	%ymm27,%ymm11
+	vmovdqa64	%ymm28,%ymm12
+	vmovdqa64	%ymm29,%ymm13
+	vmovdqa64	%ymm30,%ymm14
+	vmovdqa64	%ymm31,%ymm15
+
+	vmovdqa64	%ymm0,%ymm16
+	vmovdqa64	%ymm1,%ymm17
+	vmovdqa64	%ymm2,%ymm18
+	vmovdqa64	%ymm3,%ymm19
+
+	movl	$10,%eax
+	jmp	.Loop8xvl
+
+.align	32
+.Loop8xvl:
+	vpaddd	%ymm4,%ymm0,%ymm0
+	vpaddd	%ymm5,%ymm1,%ymm1
+	vpaddd	%ymm6,%ymm2,%ymm2
+	vpaddd	%ymm7,%ymm3,%ymm3
+	vpxor	%ymm0,%ymm12,%ymm12
+	vpxor	%ymm1,%ymm13,%ymm13
+	vpxor	%ymm2,%ymm14,%ymm14
+	vpxor	%ymm3,%ymm15,%ymm15
+	vprold	$16,%ymm12,%ymm12
+	vprold	$16,%ymm13,%ymm13
+	vprold	$16,%ymm14,%ymm14
+	vprold	$16,%ymm15,%ymm15
+	vpaddd	%ymm12,%ymm8,%ymm8
+	vpaddd	%ymm13,%ymm9,%ymm9
+	vpaddd	%ymm14,%ymm10,%ymm10
+	vpaddd	%ymm15,%ymm11,%ymm11
+	vpxor	%ymm8,%ymm4,%ymm4
+	vpxor	%ymm9,%ymm5,%ymm5
+	vpxor	%ymm10,%ymm6,%ymm6
+	vpxor	%ymm11,%ymm7,%ymm7
+	vprold	$12,%ymm4,%ymm4
+	vprold	$12,%ymm5,%ymm5
+	vprold	$12,%ymm6,%ymm6
+	vprold	$12,%ymm7,%ymm7
+	vpaddd	%ymm4,%ymm0,%ymm0
+	vpaddd	%ymm5,%ymm1,%ymm1
+	vpaddd	%ymm6,%ymm2,%ymm2
+	vpaddd	%ymm7,%ymm3,%ymm3
+	vpxor	%ymm0,%ymm12,%ymm12
+	vpxor	%ymm1,%ymm13,%ymm13
+	vpxor	%ymm2,%ymm14,%ymm14
+	vpxor	%ymm3,%ymm15,%ymm15
+	vprold	$8,%ymm12,%ymm12
+	vprold	$8,%ymm13,%ymm13
+	vprold	$8,%ymm14,%ymm14
+	vprold	$8,%ymm15,%ymm15
+	vpaddd	%ymm12,%ymm8,%ymm8
+	vpaddd	%ymm13,%ymm9,%ymm9
+	vpaddd	%ymm14,%ymm10,%ymm10
+	vpaddd	%ymm15,%ymm11,%ymm11
+	vpxor	%ymm8,%ymm4,%ymm4
+	vpxor	%ymm9,%ymm5,%ymm5
+	vpxor	%ymm10,%ymm6,%ymm6
+	vpxor	%ymm11,%ymm7,%ymm7
+	vprold	$7,%ymm4,%ymm4
+	vprold	$7,%ymm5,%ymm5
+	vprold	$7,%ymm6,%ymm6
+	vprold	$7,%ymm7,%ymm7
+	vpaddd	%ymm5,%ymm0,%ymm0
+	vpaddd	%ymm6,%ymm1,%ymm1
+	vpaddd	%ymm7,%ymm2,%ymm2
+	vpaddd	%ymm4,%ymm3,%ymm3
+	vpxor	%ymm0,%ymm15,%ymm15
+	vpxor	%ymm1,%ymm12,%ymm12
+	vpxor	%ymm2,%ymm13,%ymm13
+	vpxor	%ymm3,%ymm14,%ymm14
+	vprold	$16,%ymm15,%ymm15
+	vprold	$16,%ymm12,%ymm12
+	vprold	$16,%ymm13,%ymm13
+	vprold	$16,%ymm14,%ymm14
+	vpaddd	%ymm15,%ymm10,%ymm10
+	vpaddd	%ymm12,%ymm11,%ymm11
+	vpaddd	%ymm13,%ymm8,%ymm8
+	vpaddd	%ymm14,%ymm9,%ymm9
+	vpxor	%ymm10,%ymm5,%ymm5
+	vpxor	%ymm11,%ymm6,%ymm6
+	vpxor	%ymm8,%ymm7,%ymm7
+	vpxor	%ymm9,%ymm4,%ymm4
+	vprold	$12,%ymm5,%ymm5
+	vprold	$12,%ymm6,%ymm6
+	vprold	$12,%ymm7,%ymm7
+	vprold	$12,%ymm4,%ymm4
+	vpaddd	%ymm5,%ymm0,%ymm0
+	vpaddd	%ymm6,%ymm1,%ymm1
+	vpaddd	%ymm7,%ymm2,%ymm2
+	vpaddd	%ymm4,%ymm3,%ymm3
+	vpxor	%ymm0,%ymm15,%ymm15
+	vpxor	%ymm1,%ymm12,%ymm12
+	vpxor	%ymm2,%ymm13,%ymm13
+	vpxor	%ymm3,%ymm14,%ymm14
+	vprold	$8,%ymm15,%ymm15
+	vprold	$8,%ymm12,%ymm12
+	vprold	$8,%ymm13,%ymm13
+	vprold	$8,%ymm14,%ymm14
+	vpaddd	%ymm15,%ymm10,%ymm10
+	vpaddd	%ymm12,%ymm11,%ymm11
+	vpaddd	%ymm13,%ymm8,%ymm8
+	vpaddd	%ymm14,%ymm9,%ymm9
+	vpxor	%ymm10,%ymm5,%ymm5
+	vpxor	%ymm11,%ymm6,%ymm6
+	vpxor	%ymm8,%ymm7,%ymm7
+	vpxor	%ymm9,%ymm4,%ymm4
+	vprold	$7,%ymm5,%ymm5
+	vprold	$7,%ymm6,%ymm6
+	vprold	$7,%ymm7,%ymm7
+	vprold	$7,%ymm4,%ymm4
+	decl	%eax
+	jnz	.Loop8xvl
+
+	vpaddd	%ymm16,%ymm0,%ymm0
+	vpaddd	%ymm17,%ymm1,%ymm1
+	vpaddd	%ymm18,%ymm2,%ymm2
+	vpaddd	%ymm19,%ymm3,%ymm3
+
+	vpunpckldq	%ymm1,%ymm0,%ymm18
+	vpunpckldq	%ymm3,%ymm2,%ymm19
+	vpunpckhdq	%ymm1,%ymm0,%ymm0
+	vpunpckhdq	%ymm3,%ymm2,%ymm2
+	vpunpcklqdq	%ymm19,%ymm18,%ymm1
+	vpunpckhqdq	%ymm19,%ymm18,%ymm18
+	vpunpcklqdq	%ymm2,%ymm0,%ymm3
+	vpunpckhqdq	%ymm2,%ymm0,%ymm0
+	vpaddd	%ymm20,%ymm4,%ymm4
+	vpaddd	%ymm21,%ymm5,%ymm5
+	vpaddd	%ymm22,%ymm6,%ymm6
+	vpaddd	%ymm23,%ymm7,%ymm7
+
+	vpunpckldq	%ymm5,%ymm4,%ymm2
+	vpunpckldq	%ymm7,%ymm6,%ymm19
+	vpunpckhdq	%ymm5,%ymm4,%ymm4
+	vpunpckhdq	%ymm7,%ymm6,%ymm6
+	vpunpcklqdq	%ymm19,%ymm2,%ymm5
+	vpunpckhqdq	%ymm19,%ymm2,%ymm2
+	vpunpcklqdq	%ymm6,%ymm4,%ymm7
+	vpunpckhqdq	%ymm6,%ymm4,%ymm4
+	vshufi32x4	$0,%ymm5,%ymm1,%ymm19
+	vshufi32x4	$3,%ymm5,%ymm1,%ymm5
+	vshufi32x4	$0,%ymm2,%ymm18,%ymm1
+	vshufi32x4	$3,%ymm2,%ymm18,%ymm2
+	vshufi32x4	$0,%ymm7,%ymm3,%ymm18
+	vshufi32x4	$3,%ymm7,%ymm3,%ymm7
+	vshufi32x4	$0,%ymm4,%ymm0,%ymm3
+	vshufi32x4	$3,%ymm4,%ymm0,%ymm4
+	vpaddd	%ymm24,%ymm8,%ymm8
+	vpaddd	%ymm25,%ymm9,%ymm9
+	vpaddd	%ymm26,%ymm10,%ymm10
+	vpaddd	%ymm27,%ymm11,%ymm11
+
+	vpunpckldq	%ymm9,%ymm8,%ymm6
+	vpunpckldq	%ymm11,%ymm10,%ymm0
+	vpunpckhdq	%ymm9,%ymm8,%ymm8
+	vpunpckhdq	%ymm11,%ymm10,%ymm10
+	vpunpcklqdq	%ymm0,%ymm6,%ymm9
+	vpunpckhqdq	%ymm0,%ymm6,%ymm6
+	vpunpcklqdq	%ymm10,%ymm8,%ymm11
+	vpunpckhqdq	%ymm10,%ymm8,%ymm8
+	vpaddd	%ymm28,%ymm12,%ymm12
+	vpaddd	%ymm29,%ymm13,%ymm13
+	vpaddd	%ymm30,%ymm14,%ymm14
+	vpaddd	%ymm31,%ymm15,%ymm15
+
+	vpunpckldq	%ymm13,%ymm12,%ymm10
+	vpunpckldq	%ymm15,%ymm14,%ymm0
+	vpunpckhdq	%ymm13,%ymm12,%ymm12
+	vpunpckhdq	%ymm15,%ymm14,%ymm14
+	vpunpcklqdq	%ymm0,%ymm10,%ymm13
+	vpunpckhqdq	%ymm0,%ymm10,%ymm10
+	vpunpcklqdq	%ymm14,%ymm12,%ymm15
+	vpunpckhqdq	%ymm14,%ymm12,%ymm12
+	vperm2i128	$0x20,%ymm13,%ymm9,%ymm0
+	vperm2i128	$0x31,%ymm13,%ymm9,%ymm13
+	vperm2i128	$0x20,%ymm10,%ymm6,%ymm9
+	vperm2i128	$0x31,%ymm10,%ymm6,%ymm10
+	vperm2i128	$0x20,%ymm15,%ymm11,%ymm6
+	vperm2i128	$0x31,%ymm15,%ymm11,%ymm15
+	vperm2i128	$0x20,%ymm12,%ymm8,%ymm11
+	vperm2i128	$0x31,%ymm12,%ymm8,%ymm12
+	cmpq	$512,%rdx
+	jb	.Ltail8xvl
+
+	movl	$0x80,%eax
+	vpxord	0(%rsi),%ymm19,%ymm19
+	vpxor	32(%rsi),%ymm0,%ymm0
+	vpxor	64(%rsi),%ymm5,%ymm5
+	vpxor	96(%rsi),%ymm13,%ymm13
+	leaq	(%rsi,%rax,1),%rsi
+	vmovdqu32	%ymm19,0(%rdi)
+	vmovdqu	%ymm0,32(%rdi)
+	vmovdqu	%ymm5,64(%rdi)
+	vmovdqu	%ymm13,96(%rdi)
+	leaq	(%rdi,%rax,1),%rdi
+
+	vpxor	0(%rsi),%ymm1,%ymm1
+	vpxor	32(%rsi),%ymm9,%ymm9
+	vpxor	64(%rsi),%ymm2,%ymm2
+	vpxor	96(%rsi),%ymm10,%ymm10
+	leaq	(%rsi,%rax,1),%rsi
+	vmovdqu	%ymm1,0(%rdi)
+	vmovdqu	%ymm9,32(%rdi)
+	vmovdqu	%ymm2,64(%rdi)
+	vmovdqu	%ymm10,96(%rdi)
+	leaq	(%rdi,%rax,1),%rdi
+
+	vpxord	0(%rsi),%ymm18,%ymm18
+	vpxor	32(%rsi),%ymm6,%ymm6
+	vpxor	64(%rsi),%ymm7,%ymm7
+	vpxor	96(%rsi),%ymm15,%ymm15
+	leaq	(%rsi,%rax,1),%rsi
+	vmovdqu32	%ymm18,0(%rdi)
+	vmovdqu	%ymm6,32(%rdi)
+	vmovdqu	%ymm7,64(%rdi)
+	vmovdqu	%ymm15,96(%rdi)
+	leaq	(%rdi,%rax,1),%rdi
+
+	vpxor	0(%rsi),%ymm3,%ymm3
+	vpxor	32(%rsi),%ymm11,%ymm11
+	vpxor	64(%rsi),%ymm4,%ymm4
+	vpxor	96(%rsi),%ymm12,%ymm12
+	leaq	(%rsi,%rax,1),%rsi
+	vmovdqu	%ymm3,0(%rdi)
+	vmovdqu	%ymm11,32(%rdi)
+	vmovdqu	%ymm4,64(%rdi)
+	vmovdqu	%ymm12,96(%rdi)
+	leaq	(%rdi,%rax,1),%rdi
+
+	vpbroadcastd	0(%r10),%ymm0
+	vpbroadcastd	4(%r10),%ymm1
+
+	subq	$512,%rdx
+	jnz	.Loop_outer8xvl
+
+	jmp	.Ldone8xvl
+
+.align	32
+.Ltail8xvl:
+	vmovdqa64	%ymm19,%ymm8
+	xorq	%r10,%r10
+	subq	%rsi,%rdi
+	cmpq	$64,%rdx
+	jb	.Less_than_64_8xvl
+	vpxor	0(%rsi),%ymm8,%ymm8
+	vpxor	32(%rsi),%ymm0,%ymm0
+	vmovdqu	%ymm8,0(%rdi,%rsi,1)
+	vmovdqu	%ymm0,32(%rdi,%rsi,1)
+	je	.Ldone8xvl
+	vmovdqa	%ymm5,%ymm8
+	vmovdqa	%ymm13,%ymm0
+	leaq	64(%rsi),%rsi
+
+	cmpq	$128,%rdx
+	jb	.Less_than_64_8xvl
+	vpxor	0(%rsi),%ymm5,%ymm5
+	vpxor	32(%rsi),%ymm13,%ymm13
+	vmovdqu	%ymm5,0(%rdi,%rsi,1)
+	vmovdqu	%ymm13,32(%rdi,%rsi,1)
+	je	.Ldone8xvl
+	vmovdqa	%ymm1,%ymm8
+	vmovdqa	%ymm9,%ymm0
+	leaq	64(%rsi),%rsi
+
+	cmpq	$192,%rdx
+	jb	.Less_than_64_8xvl
+	vpxor	0(%rsi),%ymm1,%ymm1
+	vpxor	32(%rsi),%ymm9,%ymm9
+	vmovdqu	%ymm1,0(%rdi,%rsi,1)
+	vmovdqu	%ymm9,32(%rdi,%rsi,1)
+	je	.Ldone8xvl
+	vmovdqa	%ymm2,%ymm8
+	vmovdqa	%ymm10,%ymm0
+	leaq	64(%rsi),%rsi
+
+	cmpq	$256,%rdx
+	jb	.Less_than_64_8xvl
+	vpxor	0(%rsi),%ymm2,%ymm2
+	vpxor	32(%rsi),%ymm10,%ymm10
+	vmovdqu	%ymm2,0(%rdi,%rsi,1)
+	vmovdqu	%ymm10,32(%rdi,%rsi,1)
+	je	.Ldone8xvl
+	vmovdqa32	%ymm18,%ymm8
+	vmovdqa	%ymm6,%ymm0
+	leaq	64(%rsi),%rsi
+
+	cmpq	$320,%rdx
+	jb	.Less_than_64_8xvl
+	vpxord	0(%rsi),%ymm18,%ymm18
+	vpxor	32(%rsi),%ymm6,%ymm6
+	vmovdqu32	%ymm18,0(%rdi,%rsi,1)
+	vmovdqu	%ymm6,32(%rdi,%rsi,1)
+	je	.Ldone8xvl
+	vmovdqa	%ymm7,%ymm8
+	vmovdqa	%ymm15,%ymm0
+	leaq	64(%rsi),%rsi
+
+	cmpq	$384,%rdx
+	jb	.Less_than_64_8xvl
+	vpxor	0(%rsi),%ymm7,%ymm7
+	vpxor	32(%rsi),%ymm15,%ymm15
+	vmovdqu	%ymm7,0(%rdi,%rsi,1)
+	vmovdqu	%ymm15,32(%rdi,%rsi,1)
+	je	.Ldone8xvl
+	vmovdqa	%ymm3,%ymm8
+	vmovdqa	%ymm11,%ymm0
+	leaq	64(%rsi),%rsi
+
+	cmpq	$448,%rdx
+	jb	.Less_than_64_8xvl
+	vpxor	0(%rsi),%ymm3,%ymm3
+	vpxor	32(%rsi),%ymm11,%ymm11
+	vmovdqu	%ymm3,0(%rdi,%rsi,1)
+	vmovdqu	%ymm11,32(%rdi,%rsi,1)
+	je	.Ldone8xvl
+	vmovdqa	%ymm4,%ymm8
+	vmovdqa	%ymm12,%ymm0
+	leaq	64(%rsi),%rsi
+
+.Less_than_64_8xvl:
+	vmovdqa	%ymm8,0(%rsp)
+	vmovdqa	%ymm0,32(%rsp)
+	leaq	(%rdi,%rsi,1),%rdi
+	andq	$63,%rdx
+
+.Loop_tail8xvl:
+	movzbl	(%rsi,%r10,1),%eax
+	movzbl	(%rsp,%r10,1),%ecx
+	leaq	1(%r10),%r10
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r10,1)
+	decq	%rdx
+	jnz	.Loop_tail8xvl
+
+	vpxor	%ymm8,%ymm8,%ymm8
+	vmovdqa	%ymm8,0(%rsp)
+	vmovdqa	%ymm8,32(%rsp)
+
+.Ldone8xvl:
+	vzeroall
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L8xvl_epilogue:
+	.byte	0xf3,0xc3
+.cfi_endproc	
+.size	ChaCha20_8xvl,.-ChaCha20_8xvl
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v8 07/28] zinc: import Andy Polyakov's ChaCha20 ARM and ARM64 implementations
From: Jason A. Donenfeld @ 2018-10-18 14:56 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-crypto, davem, gregkh
  Cc: Jason A. Donenfeld, Andy Polyakov, Russell King, linux-arm-kernel,
	Samuel Neves, Jean-Philippe Aumasson, Andy Lutomirski,
	Andrew Morton, Linus Torvalds, kernel-hardening
In-Reply-To: <20181018145712.7538-1-Jason@zx2c4.com>

These NEON and non-NEON implementations come from Andy Polyakov's
implementation, and are included here in raw form without modification,
so that subsequent commits that fix these up for the kernel can see how
it has changed.

While this is CRYPTOGAMS code, the originating code for this happens to
be the same as OpenSSL's commit 87cc649f30aaf69b351701875b9dac07c29ce8a2

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Based-on-code-from: Andy Polyakov <appro@openssl.org>
Cc: Andy Polyakov <appro@openssl.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-crypto@vger.kernel.org
---
 lib/zinc/chacha20/chacha20-arm-cryptogams.S   | 1440 ++++++++++++
 lib/zinc/chacha20/chacha20-arm64-cryptogams.S | 1973 +++++++++++++++++
 2 files changed, 3413 insertions(+)
 create mode 100644 lib/zinc/chacha20/chacha20-arm-cryptogams.S
 create mode 100644 lib/zinc/chacha20/chacha20-arm64-cryptogams.S

diff --git a/lib/zinc/chacha20/chacha20-arm-cryptogams.S b/lib/zinc/chacha20/chacha20-arm-cryptogams.S
new file mode 100644
index 000000000000..05a3a9e6e93f
--- /dev/null
+++ b/lib/zinc/chacha20/chacha20-arm-cryptogams.S
@@ -0,0 +1,1440 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
+/*
+ * Copyright (C) 2006-2017 CRYPTOGAMS by <appro@openssl.org>. All Rights Reserved.
+ */
+
+#include "arm_arch.h"
+
+.text
+#if defined(__thumb2__) || defined(__clang__)
+.syntax	unified
+#endif
+#if defined(__thumb2__)
+.thumb
+#else
+.code	32
+#endif
+
+#if defined(__thumb2__) || defined(__clang__)
+#define ldrhsb	ldrbhs
+#endif
+
+.align	5
+.Lsigma:
+.long	0x61707865,0x3320646e,0x79622d32,0x6b206574	@ endian-neutral
+.Lone:
+.long	1,0,0,0
+.Lrot8:
+.long	0x02010003,0x06050407
+#if __ARM_MAX_ARCH__>=7
+.LOPENSSL_armcap:
+.word   OPENSSL_armcap_P-.LChaCha20_ctr32
+#else
+.word	-1
+#endif
+
+.globl	ChaCha20_ctr32
+.type	ChaCha20_ctr32,%function
+.align	5
+ChaCha20_ctr32:
+.LChaCha20_ctr32:
+	ldr	r12,[sp,#0]		@ pull pointer to counter and nonce
+	stmdb	sp!,{r0-r2,r4-r11,lr}
+#if __ARM_ARCH__<7 && !defined(__thumb2__)
+	sub	r14,pc,#16		@ ChaCha20_ctr32
+#else
+	adr	r14,.LChaCha20_ctr32
+#endif
+	cmp	r2,#0			@ len==0?
+#ifdef	__thumb2__
+	itt	eq
+#endif
+	addeq	sp,sp,#4*3
+	beq	.Lno_data
+#if __ARM_MAX_ARCH__>=7
+	cmp	r2,#192			@ test len
+	bls	.Lshort
+	ldr	r4,[r14,#-24]
+	ldr	r4,[r14,r4]
+# ifdef	__APPLE__
+	ldr	r4,[r4]
+# endif
+	tst	r4,#ARMV7_NEON
+	bne	.LChaCha20_neon
+.Lshort:
+#endif
+	ldmia	r12,{r4-r7}		@ load counter and nonce
+	sub	sp,sp,#4*(16)		@ off-load area
+	sub	r14,r14,#64		@ .Lsigma
+	stmdb	sp!,{r4-r7}		@ copy counter and nonce
+	ldmia	r3,{r4-r11}		@ load key
+	ldmia	r14,{r0-r3}		@ load sigma
+	stmdb	sp!,{r4-r11}		@ copy key
+	stmdb	sp!,{r0-r3}		@ copy sigma
+	str	r10,[sp,#4*(16+10)]	@ off-load "rx"
+	str	r11,[sp,#4*(16+11)]	@ off-load "rx"
+	b	.Loop_outer_enter
+
+.align	4
+.Loop_outer:
+	ldmia	sp,{r0-r9}		@ load key material
+	str	r11,[sp,#4*(32+2)]	@ save len
+	str	r12,  [sp,#4*(32+1)]	@ save inp
+	str	r14,  [sp,#4*(32+0)]	@ save out
+.Loop_outer_enter:
+	ldr	r11, [sp,#4*(15)]
+	 mov	r4,r4,ror#19	@ twist b[0..3]
+	ldr	r12,[sp,#4*(12)]	@ modulo-scheduled load
+	 mov	r5,r5,ror#19
+	ldr	r10, [sp,#4*(13)]
+	 mov	r6,r6,ror#19
+	ldr	r14,[sp,#4*(14)]
+	 mov	r7,r7,ror#19
+	mov	r11,r11,ror#8	@ twist d[0..3]
+	mov	r12,r12,ror#8
+	mov	r10,r10,ror#8
+	mov	r14,r14,ror#8
+	str	r11, [sp,#4*(16+15)]
+	mov	r11,#10
+	b	.Loop
+
+.align	4
+.Loop:
+	subs	r11,r11,#1
+	add	r0,r0,r4,ror#13
+	add	r1,r1,r5,ror#13
+	eor	r12,r0,r12,ror#24
+	eor	r10,r1,r10,ror#24
+	add	r8,r8,r12,ror#16
+	add	r9,r9,r10,ror#16
+	eor	r4,r8,r4,ror#13
+	eor	r5,r9,r5,ror#13
+	add	r0,r0,r4,ror#20
+	add	r1,r1,r5,ror#20
+	eor	r12,r0,r12,ror#16
+	eor	r10,r1,r10,ror#16
+	add	r8,r8,r12,ror#24
+	str	r10,[sp,#4*(16+13)]
+	add	r9,r9,r10,ror#24
+	ldr	r10,[sp,#4*(16+15)]
+	str	r8,[sp,#4*(16+8)]
+	eor	r4,r4,r8,ror#12
+	str	r9,[sp,#4*(16+9)]
+	eor	r5,r5,r9,ror#12
+	ldr	r8,[sp,#4*(16+10)]
+	add	r2,r2,r6,ror#13
+	ldr	r9,[sp,#4*(16+11)]
+	add	r3,r3,r7,ror#13
+	eor	r14,r2,r14,ror#24
+	eor	r10,r3,r10,ror#24
+	add	r8,r8,r14,ror#16
+	add	r9,r9,r10,ror#16
+	eor	r6,r8,r6,ror#13
+	eor	r7,r9,r7,ror#13
+	add	r2,r2,r6,ror#20
+	add	r3,r3,r7,ror#20
+	eor	r14,r2,r14,ror#16
+	eor	r10,r3,r10,ror#16
+	add	r8,r8,r14,ror#24
+	add	r9,r9,r10,ror#24
+	eor	r6,r6,r8,ror#12
+	eor	r7,r7,r9,ror#12
+	add	r0,r0,r5,ror#13
+	add	r1,r1,r6,ror#13
+	eor	r10,r0,r10,ror#24
+	eor	r12,r1,r12,ror#24
+	add	r8,r8,r10,ror#16
+	add	r9,r9,r12,ror#16
+	eor	r5,r8,r5,ror#13
+	eor	r6,r9,r6,ror#13
+	add	r0,r0,r5,ror#20
+	add	r1,r1,r6,ror#20
+	eor	r10,r0,r10,ror#16
+	eor	r12,r1,r12,ror#16
+	str	r10,[sp,#4*(16+15)]
+	add	r8,r8,r10,ror#24
+	ldr	r10,[sp,#4*(16+13)]
+	add	r9,r9,r12,ror#24
+	str	r8,[sp,#4*(16+10)]
+	eor	r5,r5,r8,ror#12
+	str	r9,[sp,#4*(16+11)]
+	eor	r6,r6,r9,ror#12
+	ldr	r8,[sp,#4*(16+8)]
+	add	r2,r2,r7,ror#13
+	ldr	r9,[sp,#4*(16+9)]
+	add	r3,r3,r4,ror#13
+	eor	r10,r2,r10,ror#24
+	eor	r14,r3,r14,ror#24
+	add	r8,r8,r10,ror#16
+	add	r9,r9,r14,ror#16
+	eor	r7,r8,r7,ror#13
+	eor	r4,r9,r4,ror#13
+	add	r2,r2,r7,ror#20
+	add	r3,r3,r4,ror#20
+	eor	r10,r2,r10,ror#16
+	eor	r14,r3,r14,ror#16
+	add	r8,r8,r10,ror#24
+	add	r9,r9,r14,ror#24
+	eor	r7,r7,r8,ror#12
+	eor	r4,r4,r9,ror#12
+	bne	.Loop
+
+	ldr	r11,[sp,#4*(32+2)]	@ load len
+
+	str	r8, [sp,#4*(16+8)]	@ modulo-scheduled store
+	str	r9, [sp,#4*(16+9)]
+	str	r12,[sp,#4*(16+12)]
+	str	r10, [sp,#4*(16+13)]
+	str	r14,[sp,#4*(16+14)]
+
+	@ at this point we have first half of 512-bit result in
+	@ rx and second half at sp+4*(16+8)
+
+	cmp	r11,#64		@ done yet?
+#ifdef	__thumb2__
+	itete	lo
+#endif
+	addlo	r12,sp,#4*(0)		@ shortcut or ...
+	ldrhs	r12,[sp,#4*(32+1)]	@ ... load inp
+	addlo	r14,sp,#4*(0)		@ shortcut or ...
+	ldrhs	r14,[sp,#4*(32+0)]	@ ... load out
+
+	ldr	r8,[sp,#4*(0)]	@ load key material
+	ldr	r9,[sp,#4*(1)]
+
+#if __ARM_ARCH__>=6 || !defined(__ARMEB__)
+# if __ARM_ARCH__<7
+	orr	r10,r12,r14
+	tst	r10,#3		@ are input and output aligned?
+	ldr	r10,[sp,#4*(2)]
+	bne	.Lunaligned
+	cmp	r11,#64		@ restore flags
+# else
+	ldr	r10,[sp,#4*(2)]
+# endif
+	ldr	r11,[sp,#4*(3)]
+
+	add	r0,r0,r8	@ accumulate key material
+	add	r1,r1,r9
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhs	r8,[r12],#16		@ load input
+	ldrhs	r9,[r12,#-12]
+
+	add	r2,r2,r10
+	add	r3,r3,r11
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhs	r10,[r12,#-8]
+	ldrhs	r11,[r12,#-4]
+# if __ARM_ARCH__>=6 && defined(__ARMEB__)
+	rev	r0,r0
+	rev	r1,r1
+	rev	r2,r2
+	rev	r3,r3
+# endif
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	eorhs	r0,r0,r8	@ xor with input
+	eorhs	r1,r1,r9
+	 add	r8,sp,#4*(4)
+	str	r0,[r14],#16		@ store output
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	eorhs	r2,r2,r10
+	eorhs	r3,r3,r11
+	 ldmia	r8,{r8-r11}	@ load key material
+	str	r1,[r14,#-12]
+	str	r2,[r14,#-8]
+	str	r3,[r14,#-4]
+
+	add	r4,r8,r4,ror#13 @ accumulate key material
+	add	r5,r9,r5,ror#13
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhs	r8,[r12],#16		@ load input
+	ldrhs	r9,[r12,#-12]
+	add	r6,r10,r6,ror#13
+	add	r7,r11,r7,ror#13
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhs	r10,[r12,#-8]
+	ldrhs	r11,[r12,#-4]
+# if __ARM_ARCH__>=6 && defined(__ARMEB__)
+	rev	r4,r4
+	rev	r5,r5
+	rev	r6,r6
+	rev	r7,r7
+# endif
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	eorhs	r4,r4,r8
+	eorhs	r5,r5,r9
+	 add	r8,sp,#4*(8)
+	str	r4,[r14],#16		@ store output
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	eorhs	r6,r6,r10
+	eorhs	r7,r7,r11
+	str	r5,[r14,#-12]
+	 ldmia	r8,{r8-r11}	@ load key material
+	str	r6,[r14,#-8]
+	 add	r0,sp,#4*(16+8)
+	str	r7,[r14,#-4]
+
+	ldmia	r0,{r0-r7}	@ load second half
+
+	add	r0,r0,r8	@ accumulate key material
+	add	r1,r1,r9
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhs	r8,[r12],#16		@ load input
+	ldrhs	r9,[r12,#-12]
+# ifdef	__thumb2__
+	itt	hi
+# endif
+	 strhi	r10,[sp,#4*(16+10)]	@ copy "rx" while at it
+	 strhi	r11,[sp,#4*(16+11)]	@ copy "rx" while at it
+	add	r2,r2,r10
+	add	r3,r3,r11
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhs	r10,[r12,#-8]
+	ldrhs	r11,[r12,#-4]
+# if __ARM_ARCH__>=6 && defined(__ARMEB__)
+	rev	r0,r0
+	rev	r1,r1
+	rev	r2,r2
+	rev	r3,r3
+# endif
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	eorhs	r0,r0,r8
+	eorhs	r1,r1,r9
+	 add	r8,sp,#4*(12)
+	str	r0,[r14],#16		@ store output
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	eorhs	r2,r2,r10
+	eorhs	r3,r3,r11
+	str	r1,[r14,#-12]
+	 ldmia	r8,{r8-r11}	@ load key material
+	str	r2,[r14,#-8]
+	str	r3,[r14,#-4]
+
+	add	r4,r8,r4,ror#24 @ accumulate key material
+	add	r5,r9,r5,ror#24
+# ifdef	__thumb2__
+	itt	hi
+# endif
+	 addhi	r8,r8,#1		@ next counter value
+	 strhi	r8,[sp,#4*(12)]	@ save next counter value
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhs	r8,[r12],#16		@ load input
+	ldrhs	r9,[r12,#-12]
+	add	r6,r10,r6,ror#24
+	add	r7,r11,r7,ror#24
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhs	r10,[r12,#-8]
+	ldrhs	r11,[r12,#-4]
+# if __ARM_ARCH__>=6 && defined(__ARMEB__)
+	rev	r4,r4
+	rev	r5,r5
+	rev	r6,r6
+	rev	r7,r7
+# endif
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	eorhs	r4,r4,r8
+	eorhs	r5,r5,r9
+# ifdef	__thumb2__
+	 it	ne
+# endif
+	 ldrne	r8,[sp,#4*(32+2)]	@ re-load len
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	eorhs	r6,r6,r10
+	eorhs	r7,r7,r11
+	str	r4,[r14],#16		@ store output
+	str	r5,[r14,#-12]
+# ifdef	__thumb2__
+	it	hs
+# endif
+	 subhs	r11,r8,#64		@ len-=64
+	str	r6,[r14,#-8]
+	str	r7,[r14,#-4]
+	bhi	.Loop_outer
+
+	beq	.Ldone
+# if __ARM_ARCH__<7
+	b	.Ltail
+
+.align	4
+.Lunaligned:				@ unaligned endian-neutral path
+	cmp	r11,#64		@ restore flags
+# endif
+#endif
+#if __ARM_ARCH__<7
+	ldr	r11,[sp,#4*(3)]
+	add	r0,r8,r0	@ accumulate key material
+	add	r1,r9,r1
+	add	r2,r10,r2
+# ifdef	__thumb2__
+	itete	lo
+# endif
+	eorlo	r8,r8,r8		@ zero or ...
+	ldrhsb	r8,[r12],#16			@ ... load input
+	eorlo	r9,r9,r9
+	ldrhsb	r9,[r12,#-12]
+
+	add	r3,r11,r3
+# ifdef	__thumb2__
+	itete	lo
+# endif
+	eorlo	r10,r10,r10
+	ldrhsb	r10,[r12,#-8]
+	eorlo	r11,r11,r11
+	ldrhsb	r11,[r12,#-4]
+
+	eor	r0,r8,r0		@ xor with input (or zero)
+	eor	r1,r9,r1
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r8,[r12,#-15]		@ load more input
+	ldrhsb	r9,[r12,#-11]
+	eor	r2,r10,r2
+	 strb	r0,[r14],#16		@ store output
+	eor	r3,r11,r3
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r10,[r12,#-7]
+	ldrhsb	r11,[r12,#-3]
+	 strb	r1,[r14,#-12]
+	eor	r0,r8,r0,lsr#8
+	 strb	r2,[r14,#-8]
+	eor	r1,r9,r1,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r8,[r12,#-14]		@ load more input
+	ldrhsb	r9,[r12,#-10]
+	 strb	r3,[r14,#-4]
+	eor	r2,r10,r2,lsr#8
+	 strb	r0,[r14,#-15]
+	eor	r3,r11,r3,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r10,[r12,#-6]
+	ldrhsb	r11,[r12,#-2]
+	 strb	r1,[r14,#-11]
+	eor	r0,r8,r0,lsr#8
+	 strb	r2,[r14,#-7]
+	eor	r1,r9,r1,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r8,[r12,#-13]		@ load more input
+	ldrhsb	r9,[r12,#-9]
+	 strb	r3,[r14,#-3]
+	eor	r2,r10,r2,lsr#8
+	 strb	r0,[r14,#-14]
+	eor	r3,r11,r3,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r10,[r12,#-5]
+	ldrhsb	r11,[r12,#-1]
+	 strb	r1,[r14,#-10]
+	 strb	r2,[r14,#-6]
+	eor	r0,r8,r0,lsr#8
+	 strb	r3,[r14,#-2]
+	eor	r1,r9,r1,lsr#8
+	 strb	r0,[r14,#-13]
+	eor	r2,r10,r2,lsr#8
+	 strb	r1,[r14,#-9]
+	eor	r3,r11,r3,lsr#8
+	 strb	r2,[r14,#-5]
+	 strb	r3,[r14,#-1]
+	add	r8,sp,#4*(4+0)
+	ldmia	r8,{r8-r11}		@ load key material
+	add	r0,sp,#4*(16+8)
+	add	r4,r8,r4,ror#13	@ accumulate key material
+	add	r5,r9,r5,ror#13
+	add	r6,r10,r6,ror#13
+# ifdef	__thumb2__
+	itete	lo
+# endif
+	eorlo	r8,r8,r8		@ zero or ...
+	ldrhsb	r8,[r12],#16			@ ... load input
+	eorlo	r9,r9,r9
+	ldrhsb	r9,[r12,#-12]
+
+	add	r7,r11,r7,ror#13
+# ifdef	__thumb2__
+	itete	lo
+# endif
+	eorlo	r10,r10,r10
+	ldrhsb	r10,[r12,#-8]
+	eorlo	r11,r11,r11
+	ldrhsb	r11,[r12,#-4]
+
+	eor	r4,r8,r4		@ xor with input (or zero)
+	eor	r5,r9,r5
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r8,[r12,#-15]		@ load more input
+	ldrhsb	r9,[r12,#-11]
+	eor	r6,r10,r6
+	 strb	r4,[r14],#16		@ store output
+	eor	r7,r11,r7
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r10,[r12,#-7]
+	ldrhsb	r11,[r12,#-3]
+	 strb	r5,[r14,#-12]
+	eor	r4,r8,r4,lsr#8
+	 strb	r6,[r14,#-8]
+	eor	r5,r9,r5,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r8,[r12,#-14]		@ load more input
+	ldrhsb	r9,[r12,#-10]
+	 strb	r7,[r14,#-4]
+	eor	r6,r10,r6,lsr#8
+	 strb	r4,[r14,#-15]
+	eor	r7,r11,r7,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r10,[r12,#-6]
+	ldrhsb	r11,[r12,#-2]
+	 strb	r5,[r14,#-11]
+	eor	r4,r8,r4,lsr#8
+	 strb	r6,[r14,#-7]
+	eor	r5,r9,r5,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r8,[r12,#-13]		@ load more input
+	ldrhsb	r9,[r12,#-9]
+	 strb	r7,[r14,#-3]
+	eor	r6,r10,r6,lsr#8
+	 strb	r4,[r14,#-14]
+	eor	r7,r11,r7,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r10,[r12,#-5]
+	ldrhsb	r11,[r12,#-1]
+	 strb	r5,[r14,#-10]
+	 strb	r6,[r14,#-6]
+	eor	r4,r8,r4,lsr#8
+	 strb	r7,[r14,#-2]
+	eor	r5,r9,r5,lsr#8
+	 strb	r4,[r14,#-13]
+	eor	r6,r10,r6,lsr#8
+	 strb	r5,[r14,#-9]
+	eor	r7,r11,r7,lsr#8
+	 strb	r6,[r14,#-5]
+	 strb	r7,[r14,#-1]
+	add	r8,sp,#4*(4+4)
+	ldmia	r8,{r8-r11}		@ load key material
+	ldmia	r0,{r0-r7}		@ load second half
+# ifdef	__thumb2__
+	itt	hi
+# endif
+	strhi	r10,[sp,#4*(16+10)]		@ copy "rx"
+	strhi	r11,[sp,#4*(16+11)]		@ copy "rx"
+	add	r0,r8,r0	@ accumulate key material
+	add	r1,r9,r1
+	add	r2,r10,r2
+# ifdef	__thumb2__
+	itete	lo
+# endif
+	eorlo	r8,r8,r8		@ zero or ...
+	ldrhsb	r8,[r12],#16			@ ... load input
+	eorlo	r9,r9,r9
+	ldrhsb	r9,[r12,#-12]
+
+	add	r3,r11,r3
+# ifdef	__thumb2__
+	itete	lo
+# endif
+	eorlo	r10,r10,r10
+	ldrhsb	r10,[r12,#-8]
+	eorlo	r11,r11,r11
+	ldrhsb	r11,[r12,#-4]
+
+	eor	r0,r8,r0		@ xor with input (or zero)
+	eor	r1,r9,r1
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r8,[r12,#-15]		@ load more input
+	ldrhsb	r9,[r12,#-11]
+	eor	r2,r10,r2
+	 strb	r0,[r14],#16		@ store output
+	eor	r3,r11,r3
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r10,[r12,#-7]
+	ldrhsb	r11,[r12,#-3]
+	 strb	r1,[r14,#-12]
+	eor	r0,r8,r0,lsr#8
+	 strb	r2,[r14,#-8]
+	eor	r1,r9,r1,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r8,[r12,#-14]		@ load more input
+	ldrhsb	r9,[r12,#-10]
+	 strb	r3,[r14,#-4]
+	eor	r2,r10,r2,lsr#8
+	 strb	r0,[r14,#-15]
+	eor	r3,r11,r3,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r10,[r12,#-6]
+	ldrhsb	r11,[r12,#-2]
+	 strb	r1,[r14,#-11]
+	eor	r0,r8,r0,lsr#8
+	 strb	r2,[r14,#-7]
+	eor	r1,r9,r1,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r8,[r12,#-13]		@ load more input
+	ldrhsb	r9,[r12,#-9]
+	 strb	r3,[r14,#-3]
+	eor	r2,r10,r2,lsr#8
+	 strb	r0,[r14,#-14]
+	eor	r3,r11,r3,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r10,[r12,#-5]
+	ldrhsb	r11,[r12,#-1]
+	 strb	r1,[r14,#-10]
+	 strb	r2,[r14,#-6]
+	eor	r0,r8,r0,lsr#8
+	 strb	r3,[r14,#-2]
+	eor	r1,r9,r1,lsr#8
+	 strb	r0,[r14,#-13]
+	eor	r2,r10,r2,lsr#8
+	 strb	r1,[r14,#-9]
+	eor	r3,r11,r3,lsr#8
+	 strb	r2,[r14,#-5]
+	 strb	r3,[r14,#-1]
+	add	r8,sp,#4*(4+8)
+	ldmia	r8,{r8-r11}		@ load key material
+	add	r4,r8,r4,ror#24	@ accumulate key material
+# ifdef	__thumb2__
+	itt	hi
+# endif
+	addhi	r8,r8,#1			@ next counter value
+	strhi	r8,[sp,#4*(12)]		@ save next counter value
+	add	r5,r9,r5,ror#24
+	add	r6,r10,r6,ror#24
+# ifdef	__thumb2__
+	itete	lo
+# endif
+	eorlo	r8,r8,r8		@ zero or ...
+	ldrhsb	r8,[r12],#16			@ ... load input
+	eorlo	r9,r9,r9
+	ldrhsb	r9,[r12,#-12]
+
+	add	r7,r11,r7,ror#24
+# ifdef	__thumb2__
+	itete	lo
+# endif
+	eorlo	r10,r10,r10
+	ldrhsb	r10,[r12,#-8]
+	eorlo	r11,r11,r11
+	ldrhsb	r11,[r12,#-4]
+
+	eor	r4,r8,r4		@ xor with input (or zero)
+	eor	r5,r9,r5
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r8,[r12,#-15]		@ load more input
+	ldrhsb	r9,[r12,#-11]
+	eor	r6,r10,r6
+	 strb	r4,[r14],#16		@ store output
+	eor	r7,r11,r7
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r10,[r12,#-7]
+	ldrhsb	r11,[r12,#-3]
+	 strb	r5,[r14,#-12]
+	eor	r4,r8,r4,lsr#8
+	 strb	r6,[r14,#-8]
+	eor	r5,r9,r5,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r8,[r12,#-14]		@ load more input
+	ldrhsb	r9,[r12,#-10]
+	 strb	r7,[r14,#-4]
+	eor	r6,r10,r6,lsr#8
+	 strb	r4,[r14,#-15]
+	eor	r7,r11,r7,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r10,[r12,#-6]
+	ldrhsb	r11,[r12,#-2]
+	 strb	r5,[r14,#-11]
+	eor	r4,r8,r4,lsr#8
+	 strb	r6,[r14,#-7]
+	eor	r5,r9,r5,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r8,[r12,#-13]		@ load more input
+	ldrhsb	r9,[r12,#-9]
+	 strb	r7,[r14,#-3]
+	eor	r6,r10,r6,lsr#8
+	 strb	r4,[r14,#-14]
+	eor	r7,r11,r7,lsr#8
+# ifdef	__thumb2__
+	itt	hs
+# endif
+	ldrhsb	r10,[r12,#-5]
+	ldrhsb	r11,[r12,#-1]
+	 strb	r5,[r14,#-10]
+	 strb	r6,[r14,#-6]
+	eor	r4,r8,r4,lsr#8
+	 strb	r7,[r14,#-2]
+	eor	r5,r9,r5,lsr#8
+	 strb	r4,[r14,#-13]
+	eor	r6,r10,r6,lsr#8
+	 strb	r5,[r14,#-9]
+	eor	r7,r11,r7,lsr#8
+	 strb	r6,[r14,#-5]
+	 strb	r7,[r14,#-1]
+# ifdef	__thumb2__
+	it	ne
+# endif
+	ldrne	r8,[sp,#4*(32+2)]		@ re-load len
+# ifdef	__thumb2__
+	it	hs
+# endif
+	subhs	r11,r8,#64			@ len-=64
+	bhi	.Loop_outer
+
+	beq	.Ldone
+#endif
+
+.Ltail:
+	ldr	r12,[sp,#4*(32+1)]	@ load inp
+	add	r9,sp,#4*(0)
+	ldr	r14,[sp,#4*(32+0)]	@ load out
+
+.Loop_tail:
+	ldrb	r10,[r9],#1	@ read buffer on stack
+	ldrb	r11,[r12],#1		@ read input
+	subs	r8,r8,#1
+	eor	r11,r11,r10
+	strb	r11,[r14],#1		@ store output
+	bne	.Loop_tail
+
+.Ldone:
+	add	sp,sp,#4*(32+3)
+.Lno_data:
+	ldmia	sp!,{r4-r11,pc}
+.size	ChaCha20_ctr32,.-ChaCha20_ctr32
+#if __ARM_MAX_ARCH__>=7
+.arch	armv7-a
+.fpu	neon
+
+.type	ChaCha20_neon,%function
+.align	5
+ChaCha20_neon:
+	ldr		r12,[sp,#0]		@ pull pointer to counter and nonce
+	stmdb		sp!,{r0-r2,r4-r11,lr}
+.LChaCha20_neon:
+	adr		r14,.Lsigma
+	vstmdb		sp!,{d8-d15}		@ ABI spec says so
+	stmdb		sp!,{r0-r3}
+
+	vld1.32		{q1-q2},[r3]		@ load key
+	ldmia		r3,{r4-r11}		@ load key
+
+	sub		sp,sp,#4*(16+16)
+	vld1.32		{q3},[r12]		@ load counter and nonce
+	add		r12,sp,#4*8
+	ldmia		r14,{r0-r3}		@ load sigma
+	vld1.32		{q0},[r14]!		@ load sigma
+	vld1.32		{q12},[r14]!		@ one
+	@ vld1.32	{d30},[r14]		@ rot8
+	vst1.32		{q2-q3},[r12]		@ copy 1/2key|counter|nonce
+	vst1.32		{q0-q1},[sp]		@ copy sigma|1/2key
+
+	str		r10,[sp,#4*(16+10)]	@ off-load "rx"
+	str		r11,[sp,#4*(16+11)]	@ off-load "rx"
+	vshl.i32	d26,d24,#1	@ two
+	vstr		d24,[sp,#4*(16+0)]
+	vshl.i32	d28,d24,#2	@ four
+	vstr		d26,[sp,#4*(16+2)]
+	vmov		q4,q0
+	vstr		d28,[sp,#4*(16+4)]
+	vmov		q8,q0
+	@ vstr		d30,[sp,#4*(16+6)]
+	vmov		q5,q1
+	vmov		q9,q1
+	b		.Loop_neon_enter
+
+.align	4
+.Loop_neon_outer:
+	ldmia		sp,{r0-r9}		@ load key material
+	cmp		r11,#64*2		@ if len<=64*2
+	bls		.Lbreak_neon		@ switch to integer-only
+	@ vldr		d30,[sp,#4*(16+6)]	@ rot8
+	vmov		q4,q0
+	str		r11,[sp,#4*(32+2)]	@ save len
+	vmov		q8,q0
+	str		r12,  [sp,#4*(32+1)]	@ save inp
+	vmov		q5,q1
+	str		r14,  [sp,#4*(32+0)]	@ save out
+	vmov		q9,q1
+.Loop_neon_enter:
+	ldr		r11, [sp,#4*(15)]
+	 mov		r4,r4,ror#19	@ twist b[0..3]
+	vadd.i32	q7,q3,q12		@ counter+1
+	ldr		r12,[sp,#4*(12)]	@ modulo-scheduled load
+	 mov		r5,r5,ror#19
+	vmov		q6,q2
+	ldr		r10, [sp,#4*(13)]
+	 mov		r6,r6,ror#19
+	vmov		q10,q2
+	ldr		r14,[sp,#4*(14)]
+	 mov		r7,r7,ror#19
+	vadd.i32	q11,q7,q12		@ counter+2
+	add		r12,r12,#3	@ counter+3
+	mov		r11,r11,ror#8	@ twist d[0..3]
+	mov		r12,r12,ror#8
+	mov		r10,r10,ror#8
+	mov		r14,r14,ror#8
+	str		r11, [sp,#4*(16+15)]
+	mov		r11,#10
+	b		.Loop_neon
+
+.align	4
+.Loop_neon:
+	subs		r11,r11,#1
+	vadd.i32	q0,q0,q1
+	add	r0,r0,r4,ror#13
+	vadd.i32	q4,q4,q5
+	add	r1,r1,r5,ror#13
+	vadd.i32	q8,q8,q9
+	eor	r12,r0,r12,ror#24
+	veor	q3,q3,q0
+	eor	r10,r1,r10,ror#24
+	veor	q7,q7,q4
+	add	r8,r8,r12,ror#16
+	veor	q11,q11,q8
+	add	r9,r9,r10,ror#16
+	vrev32.16	q3,q3
+	eor	r4,r8,r4,ror#13
+	vrev32.16	q7,q7
+	eor	r5,r9,r5,ror#13
+	vrev32.16	q11,q11
+	add	r0,r0,r4,ror#20
+	vadd.i32	q2,q2,q3
+	add	r1,r1,r5,ror#20
+	vadd.i32	q6,q6,q7
+	eor	r12,r0,r12,ror#16
+	vadd.i32	q10,q10,q11
+	eor	r10,r1,r10,ror#16
+	veor	q12,q1,q2
+	add	r8,r8,r12,ror#24
+	veor	q13,q5,q6
+	str	r10,[sp,#4*(16+13)]
+	veor	q14,q9,q10
+	add	r9,r9,r10,ror#24
+	vshr.u32	q1,q12,#20
+	ldr	r10,[sp,#4*(16+15)]
+	vshr.u32	q5,q13,#20
+	str	r8,[sp,#4*(16+8)]
+	vshr.u32	q9,q14,#20
+	eor	r4,r4,r8,ror#12
+	vsli.32	q1,q12,#12
+	str	r9,[sp,#4*(16+9)]
+	vsli.32	q5,q13,#12
+	eor	r5,r5,r9,ror#12
+	vsli.32	q9,q14,#12
+	ldr	r8,[sp,#4*(16+10)]
+	vadd.i32	q0,q0,q1
+	add	r2,r2,r6,ror#13
+	vadd.i32	q4,q4,q5
+	ldr	r9,[sp,#4*(16+11)]
+	vadd.i32	q8,q8,q9
+	add	r3,r3,r7,ror#13
+	veor	q12,q3,q0
+	eor	r14,r2,r14,ror#24
+	veor	q13,q7,q4
+	eor	r10,r3,r10,ror#24
+	veor	q14,q11,q8
+	add	r8,r8,r14,ror#16
+	vshr.u32	q3,q12,#24
+	add	r9,r9,r10,ror#16
+	vshr.u32	q7,q13,#24
+	eor	r6,r8,r6,ror#13
+	vshr.u32	q11,q14,#24
+	eor	r7,r9,r7,ror#13
+	vsli.32	q3,q12,#8
+	add	r2,r2,r6,ror#20
+	vsli.32	q7,q13,#8
+	add	r3,r3,r7,ror#20
+	vsli.32	q11,q14,#8
+	eor	r14,r2,r14,ror#16
+	vadd.i32	q2,q2,q3
+	eor	r10,r3,r10,ror#16
+	vadd.i32	q6,q6,q7
+	add	r8,r8,r14,ror#24
+	vadd.i32	q10,q10,q11
+	add	r9,r9,r10,ror#24
+	veor	q12,q1,q2
+	eor	r6,r6,r8,ror#12
+	veor	q13,q5,q6
+	eor	r7,r7,r9,ror#12
+	veor	q14,q9,q10
+	vshr.u32	q1,q12,#25
+	vshr.u32	q5,q13,#25
+	vshr.u32	q9,q14,#25
+	vsli.32	q1,q12,#7
+	vsli.32	q5,q13,#7
+	vsli.32	q9,q14,#7
+	vext.8	q2,q2,q2,#8
+	vext.8	q6,q6,q6,#8
+	vext.8	q10,q10,q10,#8
+	vext.8	q1,q1,q1,#4
+	vext.8	q5,q5,q5,#4
+	vext.8	q9,q9,q9,#4
+	vext.8	q3,q3,q3,#12
+	vext.8	q7,q7,q7,#12
+	vext.8	q11,q11,q11,#12
+	vadd.i32	q0,q0,q1
+	add	r0,r0,r5,ror#13
+	vadd.i32	q4,q4,q5
+	add	r1,r1,r6,ror#13
+	vadd.i32	q8,q8,q9
+	eor	r10,r0,r10,ror#24
+	veor	q3,q3,q0
+	eor	r12,r1,r12,ror#24
+	veor	q7,q7,q4
+	add	r8,r8,r10,ror#16
+	veor	q11,q11,q8
+	add	r9,r9,r12,ror#16
+	vrev32.16	q3,q3
+	eor	r5,r8,r5,ror#13
+	vrev32.16	q7,q7
+	eor	r6,r9,r6,ror#13
+	vrev32.16	q11,q11
+	add	r0,r0,r5,ror#20
+	vadd.i32	q2,q2,q3
+	add	r1,r1,r6,ror#20
+	vadd.i32	q6,q6,q7
+	eor	r10,r0,r10,ror#16
+	vadd.i32	q10,q10,q11
+	eor	r12,r1,r12,ror#16
+	veor	q12,q1,q2
+	str	r10,[sp,#4*(16+15)]
+	veor	q13,q5,q6
+	add	r8,r8,r10,ror#24
+	veor	q14,q9,q10
+	ldr	r10,[sp,#4*(16+13)]
+	vshr.u32	q1,q12,#20
+	add	r9,r9,r12,ror#24
+	vshr.u32	q5,q13,#20
+	str	r8,[sp,#4*(16+10)]
+	vshr.u32	q9,q14,#20
+	eor	r5,r5,r8,ror#12
+	vsli.32	q1,q12,#12
+	str	r9,[sp,#4*(16+11)]
+	vsli.32	q5,q13,#12
+	eor	r6,r6,r9,ror#12
+	vsli.32	q9,q14,#12
+	ldr	r8,[sp,#4*(16+8)]
+	vadd.i32	q0,q0,q1
+	add	r2,r2,r7,ror#13
+	vadd.i32	q4,q4,q5
+	ldr	r9,[sp,#4*(16+9)]
+	vadd.i32	q8,q8,q9
+	add	r3,r3,r4,ror#13
+	veor	q12,q3,q0
+	eor	r10,r2,r10,ror#24
+	veor	q13,q7,q4
+	eor	r14,r3,r14,ror#24
+	veor	q14,q11,q8
+	add	r8,r8,r10,ror#16
+	vshr.u32	q3,q12,#24
+	add	r9,r9,r14,ror#16
+	vshr.u32	q7,q13,#24
+	eor	r7,r8,r7,ror#13
+	vshr.u32	q11,q14,#24
+	eor	r4,r9,r4,ror#13
+	vsli.32	q3,q12,#8
+	add	r2,r2,r7,ror#20
+	vsli.32	q7,q13,#8
+	add	r3,r3,r4,ror#20
+	vsli.32	q11,q14,#8
+	eor	r10,r2,r10,ror#16
+	vadd.i32	q2,q2,q3
+	eor	r14,r3,r14,ror#16
+	vadd.i32	q6,q6,q7
+	add	r8,r8,r10,ror#24
+	vadd.i32	q10,q10,q11
+	add	r9,r9,r14,ror#24
+	veor	q12,q1,q2
+	eor	r7,r7,r8,ror#12
+	veor	q13,q5,q6
+	eor	r4,r4,r9,ror#12
+	veor	q14,q9,q10
+	vshr.u32	q1,q12,#25
+	vshr.u32	q5,q13,#25
+	vshr.u32	q9,q14,#25
+	vsli.32	q1,q12,#7
+	vsli.32	q5,q13,#7
+	vsli.32	q9,q14,#7
+	vext.8	q2,q2,q2,#8
+	vext.8	q6,q6,q6,#8
+	vext.8	q10,q10,q10,#8
+	vext.8	q1,q1,q1,#12
+	vext.8	q5,q5,q5,#12
+	vext.8	q9,q9,q9,#12
+	vext.8	q3,q3,q3,#4
+	vext.8	q7,q7,q7,#4
+	vext.8	q11,q11,q11,#4
+	bne		.Loop_neon
+
+	add		r11,sp,#32
+	vld1.32		{q12-q13},[sp]		@ load key material
+	vld1.32		{q14-q15},[r11]
+
+	ldr		r11,[sp,#4*(32+2)]	@ load len
+
+	str		r8, [sp,#4*(16+8)]	@ modulo-scheduled store
+	str		r9, [sp,#4*(16+9)]
+	str		r12,[sp,#4*(16+12)]
+	str		r10, [sp,#4*(16+13)]
+	str		r14,[sp,#4*(16+14)]
+
+	@ at this point we have first half of 512-bit result in
+	@ rx and second half at sp+4*(16+8)
+
+	ldr		r12,[sp,#4*(32+1)]	@ load inp
+	ldr		r14,[sp,#4*(32+0)]	@ load out
+
+	vadd.i32	q0,q0,q12		@ accumulate key material
+	vadd.i32	q4,q4,q12
+	vadd.i32	q8,q8,q12
+	vldr		d24,[sp,#4*(16+0)]	@ one
+
+	vadd.i32	q1,q1,q13
+	vadd.i32	q5,q5,q13
+	vadd.i32	q9,q9,q13
+	vldr		d26,[sp,#4*(16+2)]	@ two
+
+	vadd.i32	q2,q2,q14
+	vadd.i32	q6,q6,q14
+	vadd.i32	q10,q10,q14
+	vadd.i32	d14,d14,d24	@ counter+1
+	vadd.i32	d22,d22,d26	@ counter+2
+
+	vadd.i32	q3,q3,q15
+	vadd.i32	q7,q7,q15
+	vadd.i32	q11,q11,q15
+
+	cmp		r11,#64*4
+	blo		.Ltail_neon
+
+	vld1.8		{q12-q13},[r12]!	@ load input
+	 mov		r11,sp
+	vld1.8		{q14-q15},[r12]!
+	veor		q0,q0,q12		@ xor with input
+	veor		q1,q1,q13
+	vld1.8		{q12-q13},[r12]!
+	veor		q2,q2,q14
+	veor		q3,q3,q15
+	vld1.8		{q14-q15},[r12]!
+
+	veor		q4,q4,q12
+	 vst1.8		{q0-q1},[r14]!	@ store output
+	veor		q5,q5,q13
+	vld1.8		{q12-q13},[r12]!
+	veor		q6,q6,q14
+	 vst1.8		{q2-q3},[r14]!
+	veor		q7,q7,q15
+	vld1.8		{q14-q15},[r12]!
+
+	veor		q8,q8,q12
+	 vld1.32	{q0-q1},[r11]!	@ load for next iteration
+	 veor		d25,d25,d25
+	 vldr		d24,[sp,#4*(16+4)]	@ four
+	veor		q9,q9,q13
+	 vld1.32	{q2-q3},[r11]
+	veor		q10,q10,q14
+	 vst1.8		{q4-q5},[r14]!
+	veor		q11,q11,q15
+	 vst1.8		{q6-q7},[r14]!
+
+	vadd.i32	d6,d6,d24	@ next counter value
+	vldr		d24,[sp,#4*(16+0)]	@ one
+
+	ldmia		sp,{r8-r11}	@ load key material
+	add		r0,r0,r8	@ accumulate key material
+	ldr		r8,[r12],#16		@ load input
+	 vst1.8		{q8-q9},[r14]!
+	add		r1,r1,r9
+	ldr		r9,[r12,#-12]
+	 vst1.8		{q10-q11},[r14]!
+	add		r2,r2,r10
+	ldr		r10,[r12,#-8]
+	add		r3,r3,r11
+	ldr		r11,[r12,#-4]
+# ifdef	__ARMEB__
+	rev		r0,r0
+	rev		r1,r1
+	rev		r2,r2
+	rev		r3,r3
+# endif
+	eor		r0,r0,r8	@ xor with input
+	 add		r8,sp,#4*(4)
+	eor		r1,r1,r9
+	str		r0,[r14],#16		@ store output
+	eor		r2,r2,r10
+	str		r1,[r14,#-12]
+	eor		r3,r3,r11
+	 ldmia		r8,{r8-r11}	@ load key material
+	str		r2,[r14,#-8]
+	str		r3,[r14,#-4]
+
+	add		r4,r8,r4,ror#13 @ accumulate key material
+	ldr		r8,[r12],#16		@ load input
+	add		r5,r9,r5,ror#13
+	ldr		r9,[r12,#-12]
+	add		r6,r10,r6,ror#13
+	ldr		r10,[r12,#-8]
+	add		r7,r11,r7,ror#13
+	ldr		r11,[r12,#-4]
+# ifdef	__ARMEB__
+	rev		r4,r4
+	rev		r5,r5
+	rev		r6,r6
+	rev		r7,r7
+# endif
+	eor		r4,r4,r8
+	 add		r8,sp,#4*(8)
+	eor		r5,r5,r9
+	str		r4,[r14],#16		@ store output
+	eor		r6,r6,r10
+	str		r5,[r14,#-12]
+	eor		r7,r7,r11
+	 ldmia		r8,{r8-r11}	@ load key material
+	str		r6,[r14,#-8]
+	 add		r0,sp,#4*(16+8)
+	str		r7,[r14,#-4]
+
+	ldmia		r0,{r0-r7}	@ load second half
+
+	add		r0,r0,r8	@ accumulate key material
+	ldr		r8,[r12],#16		@ load input
+	add		r1,r1,r9
+	ldr		r9,[r12,#-12]
+# ifdef	__thumb2__
+	it	hi
+# endif
+	 strhi		r10,[sp,#4*(16+10)]	@ copy "rx" while at it
+	add		r2,r2,r10
+	ldr		r10,[r12,#-8]
+# ifdef	__thumb2__
+	it	hi
+# endif
+	 strhi		r11,[sp,#4*(16+11)]	@ copy "rx" while at it
+	add		r3,r3,r11
+	ldr		r11,[r12,#-4]
+# ifdef	__ARMEB__
+	rev		r0,r0
+	rev		r1,r1
+	rev		r2,r2
+	rev		r3,r3
+# endif
+	eor		r0,r0,r8
+	 add		r8,sp,#4*(12)
+	eor		r1,r1,r9
+	str		r0,[r14],#16		@ store output
+	eor		r2,r2,r10
+	str		r1,[r14,#-12]
+	eor		r3,r3,r11
+	 ldmia		r8,{r8-r11}	@ load key material
+	str		r2,[r14,#-8]
+	str		r3,[r14,#-4]
+
+	add		r4,r8,r4,ror#24 @ accumulate key material
+	 add		r8,r8,#4		@ next counter value
+	add		r5,r9,r5,ror#24
+	 str		r8,[sp,#4*(12)]	@ save next counter value
+	ldr		r8,[r12],#16		@ load input
+	add		r6,r10,r6,ror#24
+	 add		r4,r4,#3		@ counter+3
+	ldr		r9,[r12,#-12]
+	add		r7,r11,r7,ror#24
+	ldr		r10,[r12,#-8]
+	ldr		r11,[r12,#-4]
+# ifdef	__ARMEB__
+	rev		r4,r4
+	rev		r5,r5
+	rev		r6,r6
+	rev		r7,r7
+# endif
+	eor		r4,r4,r8
+# ifdef	__thumb2__
+	it	hi
+# endif
+	 ldrhi		r8,[sp,#4*(32+2)]	@ re-load len
+	eor		r5,r5,r9
+	eor		r6,r6,r10
+	str		r4,[r14],#16		@ store output
+	eor		r7,r7,r11
+	str		r5,[r14,#-12]
+	 sub		r11,r8,#64*4	@ len-=64*4
+	str		r6,[r14,#-8]
+	str		r7,[r14,#-4]
+	bhi		.Loop_neon_outer
+
+	b		.Ldone_neon
+
+.align	4
+.Lbreak_neon:
+	@ harmonize NEON and integer-only stack frames: load data
+	@ from NEON frame, but save to integer-only one; distance
+	@ between the two is 4*(32+4+16-32)=4*(20).
+
+	str		r11, [sp,#4*(20+32+2)]	@ save len
+	 add		r11,sp,#4*(32+4)
+	str		r12,   [sp,#4*(20+32+1)]	@ save inp
+	str		r14,   [sp,#4*(20+32+0)]	@ save out
+
+	ldr		r12,[sp,#4*(16+10)]
+	ldr		r14,[sp,#4*(16+11)]
+	 vldmia		r11,{d8-d15}			@ fulfill ABI requirement
+	str		r12,[sp,#4*(20+16+10)]	@ copy "rx"
+	str		r14,[sp,#4*(20+16+11)]	@ copy "rx"
+
+	ldr		r11, [sp,#4*(15)]
+	 mov		r4,r4,ror#19		@ twist b[0..3]
+	ldr		r12,[sp,#4*(12)]		@ modulo-scheduled load
+	 mov		r5,r5,ror#19
+	ldr		r10, [sp,#4*(13)]
+	 mov		r6,r6,ror#19
+	ldr		r14,[sp,#4*(14)]
+	 mov		r7,r7,ror#19
+	mov		r11,r11,ror#8		@ twist d[0..3]
+	mov		r12,r12,ror#8
+	mov		r10,r10,ror#8
+	mov		r14,r14,ror#8
+	str		r11, [sp,#4*(20+16+15)]
+	add		r11,sp,#4*(20)
+	vst1.32		{q0-q1},[r11]!		@ copy key
+	add		sp,sp,#4*(20)			@ switch frame
+	vst1.32		{q2-q3},[r11]
+	mov		r11,#10
+	b		.Loop				@ go integer-only
+
+.align	4
+.Ltail_neon:
+	cmp		r11,#64*3
+	bhs		.L192_or_more_neon
+	cmp		r11,#64*2
+	bhs		.L128_or_more_neon
+	cmp		r11,#64*1
+	bhs		.L64_or_more_neon
+
+	add		r8,sp,#4*(8)
+	vst1.8		{q0-q1},[sp]
+	add		r10,sp,#4*(0)
+	vst1.8		{q2-q3},[r8]
+	b		.Loop_tail_neon
+
+.align	4
+.L64_or_more_neon:
+	vld1.8		{q12-q13},[r12]!
+	vld1.8		{q14-q15},[r12]!
+	veor		q0,q0,q12
+	veor		q1,q1,q13
+	veor		q2,q2,q14
+	veor		q3,q3,q15
+	vst1.8		{q0-q1},[r14]!
+	vst1.8		{q2-q3},[r14]!
+
+	beq		.Ldone_neon
+
+	add		r8,sp,#4*(8)
+	vst1.8		{q4-q5},[sp]
+	add		r10,sp,#4*(0)
+	vst1.8		{q6-q7},[r8]
+	sub		r11,r11,#64*1	@ len-=64*1
+	b		.Loop_tail_neon
+
+.align	4
+.L128_or_more_neon:
+	vld1.8		{q12-q13},[r12]!
+	vld1.8		{q14-q15},[r12]!
+	veor		q0,q0,q12
+	veor		q1,q1,q13
+	vld1.8		{q12-q13},[r12]!
+	veor		q2,q2,q14
+	veor		q3,q3,q15
+	vld1.8		{q14-q15},[r12]!
+
+	veor		q4,q4,q12
+	veor		q5,q5,q13
+	 vst1.8		{q0-q1},[r14]!
+	veor		q6,q6,q14
+	 vst1.8		{q2-q3},[r14]!
+	veor		q7,q7,q15
+	vst1.8		{q4-q5},[r14]!
+	vst1.8		{q6-q7},[r14]!
+
+	beq		.Ldone_neon
+
+	add		r8,sp,#4*(8)
+	vst1.8		{q8-q9},[sp]
+	add		r10,sp,#4*(0)
+	vst1.8		{q10-q11},[r8]
+	sub		r11,r11,#64*2	@ len-=64*2
+	b		.Loop_tail_neon
+
+.align	4
+.L192_or_more_neon:
+	vld1.8		{q12-q13},[r12]!
+	vld1.8		{q14-q15},[r12]!
+	veor		q0,q0,q12
+	veor		q1,q1,q13
+	vld1.8		{q12-q13},[r12]!
+	veor		q2,q2,q14
+	veor		q3,q3,q15
+	vld1.8		{q14-q15},[r12]!
+
+	veor		q4,q4,q12
+	veor		q5,q5,q13
+	vld1.8		{q12-q13},[r12]!
+	veor		q6,q6,q14
+	 vst1.8		{q0-q1},[r14]!
+	veor		q7,q7,q15
+	vld1.8		{q14-q15},[r12]!
+
+	veor		q8,q8,q12
+	 vst1.8		{q2-q3},[r14]!
+	veor		q9,q9,q13
+	 vst1.8		{q4-q5},[r14]!
+	veor		q10,q10,q14
+	 vst1.8		{q6-q7},[r14]!
+	veor		q11,q11,q15
+	vst1.8		{q8-q9},[r14]!
+	vst1.8		{q10-q11},[r14]!
+
+	beq		.Ldone_neon
+
+	ldmia		sp,{r8-r11}	@ load key material
+	add		r0,r0,r8	@ accumulate key material
+	 add		r8,sp,#4*(4)
+	add		r1,r1,r9
+	add		r2,r2,r10
+	add		r3,r3,r11
+	 ldmia		r8,{r8-r11}	@ load key material
+
+	add		r4,r8,r4,ror#13 @ accumulate key material
+	 add		r8,sp,#4*(8)
+	add		r5,r9,r5,ror#13
+	add		r6,r10,r6,ror#13
+	add		r7,r11,r7,ror#13
+	 ldmia		r8,{r8-r11}	@ load key material
+# ifdef	__ARMEB__
+	rev		r0,r0
+	rev		r1,r1
+	rev		r2,r2
+	rev		r3,r3
+	rev		r4,r4
+	rev		r5,r5
+	rev		r6,r6
+	rev		r7,r7
+# endif
+	stmia		sp,{r0-r7}
+	 add		r0,sp,#4*(16+8)
+
+	ldmia		r0,{r0-r7}	@ load second half
+
+	add		r0,r0,r8	@ accumulate key material
+	 add		r8,sp,#4*(12)
+	add		r1,r1,r9
+	add		r2,r2,r10
+	add		r3,r3,r11
+	 ldmia		r8,{r8-r11}	@ load key material
+
+	add		r4,r8,r4,ror#24 @ accumulate key material
+	 add		r8,sp,#4*(8)
+	add		r5,r9,r5,ror#24
+	 add		r4,r4,#3		@ counter+3
+	add		r6,r10,r6,ror#24
+	add		r7,r11,r7,ror#24
+	 ldr		r11,[sp,#4*(32+2)]	@ re-load len
+# ifdef	__ARMEB__
+	rev		r0,r0
+	rev		r1,r1
+	rev		r2,r2
+	rev		r3,r3
+	rev		r4,r4
+	rev		r5,r5
+	rev		r6,r6
+	rev		r7,r7
+# endif
+	stmia		r8,{r0-r7}
+	 add		r10,sp,#4*(0)
+	 sub		r11,r11,#64*3	@ len-=64*3
+
+.Loop_tail_neon:
+	ldrb		r8,[r10],#1	@ read buffer on stack
+	ldrb		r9,[r12],#1		@ read input
+	subs		r11,r11,#1
+	eor		r8,r8,r9
+	strb		r8,[r14],#1		@ store output
+	bne		.Loop_tail_neon
+
+.Ldone_neon:
+	add		sp,sp,#4*(32+4)
+	vldmia		sp,{d8-d15}
+	add		sp,sp,#4*(16+3)
+	ldmia		sp!,{r4-r11,pc}
+.size	ChaCha20_neon,.-ChaCha20_neon
+.comm	OPENSSL_armcap_P,4,4
+#endif
diff --git a/lib/zinc/chacha20/chacha20-arm64-cryptogams.S b/lib/zinc/chacha20/chacha20-arm64-cryptogams.S
new file mode 100644
index 000000000000..4d029bfdad3a
--- /dev/null
+++ b/lib/zinc/chacha20/chacha20-arm64-cryptogams.S
@@ -0,0 +1,1973 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
+/*
+ * Copyright (C) 2006-2017 CRYPTOGAMS by <appro@openssl.org>. All Rights Reserved.
+ */
+
+#include "arm_arch.h"
+
+.text
+
+
+
+.align	5
+.Lsigma:
+.quad	0x3320646e61707865,0x6b20657479622d32		// endian-neutral
+.Lone:
+.long	1,0,0,0
+.LOPENSSL_armcap_P:
+#ifdef	__ILP32__
+.long	OPENSSL_armcap_P-.
+#else
+.quad	OPENSSL_armcap_P-.
+#endif
+.byte	67,104,97,67,104,97,50,48,32,102,111,114,32,65,82,77,118,56,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
+.align	2
+
+.globl	ChaCha20_ctr32
+.type	ChaCha20_ctr32,%function
+.align	5
+ChaCha20_ctr32:
+	cbz	x2,.Labort
+	adr	x5,.LOPENSSL_armcap_P
+	cmp	x2,#192
+	b.lo	.Lshort
+#ifdef	__ILP32__
+	ldrsw	x6,[x5]
+#else
+	ldr	x6,[x5]
+#endif
+	ldr	w17,[x6,x5]
+	tst	w17,#ARMV7_NEON
+	b.ne	ChaCha20_neon
+
+.Lshort:
+	stp	x29,x30,[sp,#-96]!
+	add	x29,sp,#0
+
+	adr	x5,.Lsigma
+	stp	x19,x20,[sp,#16]
+	stp	x21,x22,[sp,#32]
+	stp	x23,x24,[sp,#48]
+	stp	x25,x26,[sp,#64]
+	stp	x27,x28,[sp,#80]
+	sub	sp,sp,#64
+
+	ldp	x22,x23,[x5]		// load sigma
+	ldp	x24,x25,[x3]		// load key
+	ldp	x26,x27,[x3,#16]
+	ldp	x28,x30,[x4]		// load counter
+#ifdef	__ARMEB__
+	ror	x24,x24,#32
+	ror	x25,x25,#32
+	ror	x26,x26,#32
+	ror	x27,x27,#32
+	ror	x28,x28,#32
+	ror	x30,x30,#32
+#endif
+
+.Loop_outer:
+	mov	w5,w22			// unpack key block
+	lsr	x6,x22,#32
+	mov	w7,w23
+	lsr	x8,x23,#32
+	mov	w9,w24
+	lsr	x10,x24,#32
+	mov	w11,w25
+	lsr	x12,x25,#32
+	mov	w13,w26
+	lsr	x14,x26,#32
+	mov	w15,w27
+	lsr	x16,x27,#32
+	mov	w17,w28
+	lsr	x19,x28,#32
+	mov	w20,w30
+	lsr	x21,x30,#32
+
+	mov	x4,#10
+	subs	x2,x2,#64
+.Loop:
+	sub	x4,x4,#1
+	add	w5,w5,w9
+	add	w6,w6,w10
+	add	w7,w7,w11
+	add	w8,w8,w12
+	eor	w17,w17,w5
+	eor	w19,w19,w6
+	eor	w20,w20,w7
+	eor	w21,w21,w8
+	ror	w17,w17,#16
+	ror	w19,w19,#16
+	ror	w20,w20,#16
+	ror	w21,w21,#16
+	add	w13,w13,w17
+	add	w14,w14,w19
+	add	w15,w15,w20
+	add	w16,w16,w21
+	eor	w9,w9,w13
+	eor	w10,w10,w14
+	eor	w11,w11,w15
+	eor	w12,w12,w16
+	ror	w9,w9,#20
+	ror	w10,w10,#20
+	ror	w11,w11,#20
+	ror	w12,w12,#20
+	add	w5,w5,w9
+	add	w6,w6,w10
+	add	w7,w7,w11
+	add	w8,w8,w12
+	eor	w17,w17,w5
+	eor	w19,w19,w6
+	eor	w20,w20,w7
+	eor	w21,w21,w8
+	ror	w17,w17,#24
+	ror	w19,w19,#24
+	ror	w20,w20,#24
+	ror	w21,w21,#24
+	add	w13,w13,w17
+	add	w14,w14,w19
+	add	w15,w15,w20
+	add	w16,w16,w21
+	eor	w9,w9,w13
+	eor	w10,w10,w14
+	eor	w11,w11,w15
+	eor	w12,w12,w16
+	ror	w9,w9,#25
+	ror	w10,w10,#25
+	ror	w11,w11,#25
+	ror	w12,w12,#25
+	add	w5,w5,w10
+	add	w6,w6,w11
+	add	w7,w7,w12
+	add	w8,w8,w9
+	eor	w21,w21,w5
+	eor	w17,w17,w6
+	eor	w19,w19,w7
+	eor	w20,w20,w8
+	ror	w21,w21,#16
+	ror	w17,w17,#16
+	ror	w19,w19,#16
+	ror	w20,w20,#16
+	add	w15,w15,w21
+	add	w16,w16,w17
+	add	w13,w13,w19
+	add	w14,w14,w20
+	eor	w10,w10,w15
+	eor	w11,w11,w16
+	eor	w12,w12,w13
+	eor	w9,w9,w14
+	ror	w10,w10,#20
+	ror	w11,w11,#20
+	ror	w12,w12,#20
+	ror	w9,w9,#20
+	add	w5,w5,w10
+	add	w6,w6,w11
+	add	w7,w7,w12
+	add	w8,w8,w9
+	eor	w21,w21,w5
+	eor	w17,w17,w6
+	eor	w19,w19,w7
+	eor	w20,w20,w8
+	ror	w21,w21,#24
+	ror	w17,w17,#24
+	ror	w19,w19,#24
+	ror	w20,w20,#24
+	add	w15,w15,w21
+	add	w16,w16,w17
+	add	w13,w13,w19
+	add	w14,w14,w20
+	eor	w10,w10,w15
+	eor	w11,w11,w16
+	eor	w12,w12,w13
+	eor	w9,w9,w14
+	ror	w10,w10,#25
+	ror	w11,w11,#25
+	ror	w12,w12,#25
+	ror	w9,w9,#25
+	cbnz	x4,.Loop
+
+	add	w5,w5,w22		// accumulate key block
+	add	x6,x6,x22,lsr#32
+	add	w7,w7,w23
+	add	x8,x8,x23,lsr#32
+	add	w9,w9,w24
+	add	x10,x10,x24,lsr#32
+	add	w11,w11,w25
+	add	x12,x12,x25,lsr#32
+	add	w13,w13,w26
+	add	x14,x14,x26,lsr#32
+	add	w15,w15,w27
+	add	x16,x16,x27,lsr#32
+	add	w17,w17,w28
+	add	x19,x19,x28,lsr#32
+	add	w20,w20,w30
+	add	x21,x21,x30,lsr#32
+
+	b.lo	.Ltail
+
+	add	x5,x5,x6,lsl#32	// pack
+	add	x7,x7,x8,lsl#32
+	ldp	x6,x8,[x1,#0]		// load input
+	add	x9,x9,x10,lsl#32
+	add	x11,x11,x12,lsl#32
+	ldp	x10,x12,[x1,#16]
+	add	x13,x13,x14,lsl#32
+	add	x15,x15,x16,lsl#32
+	ldp	x14,x16,[x1,#32]
+	add	x17,x17,x19,lsl#32
+	add	x20,x20,x21,lsl#32
+	ldp	x19,x21,[x1,#48]
+	add	x1,x1,#64
+#ifdef	__ARMEB__
+	rev	x5,x5
+	rev	x7,x7
+	rev	x9,x9
+	rev	x11,x11
+	rev	x13,x13
+	rev	x15,x15
+	rev	x17,x17
+	rev	x20,x20
+#endif
+	eor	x5,x5,x6
+	eor	x7,x7,x8
+	eor	x9,x9,x10
+	eor	x11,x11,x12
+	eor	x13,x13,x14
+	eor	x15,x15,x16
+	eor	x17,x17,x19
+	eor	x20,x20,x21
+
+	stp	x5,x7,[x0,#0]		// store output
+	add	x28,x28,#1			// increment counter
+	stp	x9,x11,[x0,#16]
+	stp	x13,x15,[x0,#32]
+	stp	x17,x20,[x0,#48]
+	add	x0,x0,#64
+
+	b.hi	.Loop_outer
+
+	ldp	x19,x20,[x29,#16]
+	add	sp,sp,#64
+	ldp	x21,x22,[x29,#32]
+	ldp	x23,x24,[x29,#48]
+	ldp	x25,x26,[x29,#64]
+	ldp	x27,x28,[x29,#80]
+	ldp	x29,x30,[sp],#96
+.Labort:
+	ret
+
+.align	4
+.Ltail:
+	add	x2,x2,#64
+.Less_than_64:
+	sub	x0,x0,#1
+	add	x1,x1,x2
+	add	x0,x0,x2
+	add	x4,sp,x2
+	neg	x2,x2
+
+	add	x5,x5,x6,lsl#32	// pack
+	add	x7,x7,x8,lsl#32
+	add	x9,x9,x10,lsl#32
+	add	x11,x11,x12,lsl#32
+	add	x13,x13,x14,lsl#32
+	add	x15,x15,x16,lsl#32
+	add	x17,x17,x19,lsl#32
+	add	x20,x20,x21,lsl#32
+#ifdef	__ARMEB__
+	rev	x5,x5
+	rev	x7,x7
+	rev	x9,x9
+	rev	x11,x11
+	rev	x13,x13
+	rev	x15,x15
+	rev	x17,x17
+	rev	x20,x20
+#endif
+	stp	x5,x7,[sp,#0]
+	stp	x9,x11,[sp,#16]
+	stp	x13,x15,[sp,#32]
+	stp	x17,x20,[sp,#48]
+
+.Loop_tail:
+	ldrb	w10,[x1,x2]
+	ldrb	w11,[x4,x2]
+	add	x2,x2,#1
+	eor	w10,w10,w11
+	strb	w10,[x0,x2]
+	cbnz	x2,.Loop_tail
+
+	stp	xzr,xzr,[sp,#0]
+	stp	xzr,xzr,[sp,#16]
+	stp	xzr,xzr,[sp,#32]
+	stp	xzr,xzr,[sp,#48]
+
+	ldp	x19,x20,[x29,#16]
+	add	sp,sp,#64
+	ldp	x21,x22,[x29,#32]
+	ldp	x23,x24,[x29,#48]
+	ldp	x25,x26,[x29,#64]
+	ldp	x27,x28,[x29,#80]
+	ldp	x29,x30,[sp],#96
+	ret
+.size	ChaCha20_ctr32,.-ChaCha20_ctr32
+
+.type	ChaCha20_neon,%function
+.align	5
+ChaCha20_neon:
+	stp	x29,x30,[sp,#-96]!
+	add	x29,sp,#0
+
+	adr	x5,.Lsigma
+	stp	x19,x20,[sp,#16]
+	stp	x21,x22,[sp,#32]
+	stp	x23,x24,[sp,#48]
+	stp	x25,x26,[sp,#64]
+	stp	x27,x28,[sp,#80]
+	cmp	x2,#512
+	b.hs	.L512_or_more_neon
+
+	sub	sp,sp,#64
+
+	ldp	x22,x23,[x5]		// load sigma
+	ld1	{v24.4s},[x5],#16
+	ldp	x24,x25,[x3]		// load key
+	ldp	x26,x27,[x3,#16]
+	ld1	{v25.4s,v26.4s},[x3]
+	ldp	x28,x30,[x4]		// load counter
+	ld1	{v27.4s},[x4]
+	ld1	{v31.4s},[x5]
+#ifdef	__ARMEB__
+	rev64	v24.4s,v24.4s
+	ror	x24,x24,#32
+	ror	x25,x25,#32
+	ror	x26,x26,#32
+	ror	x27,x27,#32
+	ror	x28,x28,#32
+	ror	x30,x30,#32
+#endif
+	add	v27.4s,v27.4s,v31.4s		// += 1
+	add	v28.4s,v27.4s,v31.4s
+	add	v29.4s,v28.4s,v31.4s
+	shl	v31.4s,v31.4s,#2			// 1 -> 4
+
+.Loop_outer_neon:
+	mov	w5,w22			// unpack key block
+	lsr	x6,x22,#32
+	mov	v0.16b,v24.16b
+	mov	w7,w23
+	lsr	x8,x23,#32
+	mov	v4.16b,v24.16b
+	mov	w9,w24
+	lsr	x10,x24,#32
+	mov	v16.16b,v24.16b
+	mov	w11,w25
+	mov	v1.16b,v25.16b
+	lsr	x12,x25,#32
+	mov	v5.16b,v25.16b
+	mov	w13,w26
+	mov	v17.16b,v25.16b
+	lsr	x14,x26,#32
+	mov	v3.16b,v27.16b
+	mov	w15,w27
+	mov	v7.16b,v28.16b
+	lsr	x16,x27,#32
+	mov	v19.16b,v29.16b
+	mov	w17,w28
+	mov	v2.16b,v26.16b
+	lsr	x19,x28,#32
+	mov	v6.16b,v26.16b
+	mov	w20,w30
+	mov	v18.16b,v26.16b
+	lsr	x21,x30,#32
+
+	mov	x4,#10
+	subs	x2,x2,#256
+.Loop_neon:
+	sub	x4,x4,#1
+	add	v0.4s,v0.4s,v1.4s
+	add	w5,w5,w9
+	add	v4.4s,v4.4s,v5.4s
+	add	w6,w6,w10
+	add	v16.4s,v16.4s,v17.4s
+	add	w7,w7,w11
+	eor	v3.16b,v3.16b,v0.16b
+	add	w8,w8,w12
+	eor	v7.16b,v7.16b,v4.16b
+	eor	w17,w17,w5
+	eor	v19.16b,v19.16b,v16.16b
+	eor	w19,w19,w6
+	rev32	v3.8h,v3.8h
+	eor	w20,w20,w7
+	rev32	v7.8h,v7.8h
+	eor	w21,w21,w8
+	rev32	v19.8h,v19.8h
+	ror	w17,w17,#16
+	add	v2.4s,v2.4s,v3.4s
+	ror	w19,w19,#16
+	add	v6.4s,v6.4s,v7.4s
+	ror	w20,w20,#16
+	add	v18.4s,v18.4s,v19.4s
+	ror	w21,w21,#16
+	eor	v20.16b,v1.16b,v2.16b
+	add	w13,w13,w17
+	eor	v21.16b,v5.16b,v6.16b
+	add	w14,w14,w19
+	eor	v22.16b,v17.16b,v18.16b
+	add	w15,w15,w20
+	ushr	v1.4s,v20.4s,#20
+	add	w16,w16,w21
+	ushr	v5.4s,v21.4s,#20
+	eor	w9,w9,w13
+	ushr	v17.4s,v22.4s,#20
+	eor	w10,w10,w14
+	sli	v1.4s,v20.4s,#12
+	eor	w11,w11,w15
+	sli	v5.4s,v21.4s,#12
+	eor	w12,w12,w16
+	sli	v17.4s,v22.4s,#12
+	ror	w9,w9,#20
+	add	v0.4s,v0.4s,v1.4s
+	ror	w10,w10,#20
+	add	v4.4s,v4.4s,v5.4s
+	ror	w11,w11,#20
+	add	v16.4s,v16.4s,v17.4s
+	ror	w12,w12,#20
+	eor	v20.16b,v3.16b,v0.16b
+	add	w5,w5,w9
+	eor	v21.16b,v7.16b,v4.16b
+	add	w6,w6,w10
+	eor	v22.16b,v19.16b,v16.16b
+	add	w7,w7,w11
+	ushr	v3.4s,v20.4s,#24
+	add	w8,w8,w12
+	ushr	v7.4s,v21.4s,#24
+	eor	w17,w17,w5
+	ushr	v19.4s,v22.4s,#24
+	eor	w19,w19,w6
+	sli	v3.4s,v20.4s,#8
+	eor	w20,w20,w7
+	sli	v7.4s,v21.4s,#8
+	eor	w21,w21,w8
+	sli	v19.4s,v22.4s,#8
+	ror	w17,w17,#24
+	add	v2.4s,v2.4s,v3.4s
+	ror	w19,w19,#24
+	add	v6.4s,v6.4s,v7.4s
+	ror	w20,w20,#24
+	add	v18.4s,v18.4s,v19.4s
+	ror	w21,w21,#24
+	eor	v20.16b,v1.16b,v2.16b
+	add	w13,w13,w17
+	eor	v21.16b,v5.16b,v6.16b
+	add	w14,w14,w19
+	eor	v22.16b,v17.16b,v18.16b
+	add	w15,w15,w20
+	ushr	v1.4s,v20.4s,#25
+	add	w16,w16,w21
+	ushr	v5.4s,v21.4s,#25
+	eor	w9,w9,w13
+	ushr	v17.4s,v22.4s,#25
+	eor	w10,w10,w14
+	sli	v1.4s,v20.4s,#7
+	eor	w11,w11,w15
+	sli	v5.4s,v21.4s,#7
+	eor	w12,w12,w16
+	sli	v17.4s,v22.4s,#7
+	ror	w9,w9,#25
+	ext	v2.16b,v2.16b,v2.16b,#8
+	ror	w10,w10,#25
+	ext	v6.16b,v6.16b,v6.16b,#8
+	ror	w11,w11,#25
+	ext	v18.16b,v18.16b,v18.16b,#8
+	ror	w12,w12,#25
+	ext	v3.16b,v3.16b,v3.16b,#12
+	ext	v7.16b,v7.16b,v7.16b,#12
+	ext	v19.16b,v19.16b,v19.16b,#12
+	ext	v1.16b,v1.16b,v1.16b,#4
+	ext	v5.16b,v5.16b,v5.16b,#4
+	ext	v17.16b,v17.16b,v17.16b,#4
+	add	v0.4s,v0.4s,v1.4s
+	add	w5,w5,w10
+	add	v4.4s,v4.4s,v5.4s
+	add	w6,w6,w11
+	add	v16.4s,v16.4s,v17.4s
+	add	w7,w7,w12
+	eor	v3.16b,v3.16b,v0.16b
+	add	w8,w8,w9
+	eor	v7.16b,v7.16b,v4.16b
+	eor	w21,w21,w5
+	eor	v19.16b,v19.16b,v16.16b
+	eor	w17,w17,w6
+	rev32	v3.8h,v3.8h
+	eor	w19,w19,w7
+	rev32	v7.8h,v7.8h
+	eor	w20,w20,w8
+	rev32	v19.8h,v19.8h
+	ror	w21,w21,#16
+	add	v2.4s,v2.4s,v3.4s
+	ror	w17,w17,#16
+	add	v6.4s,v6.4s,v7.4s
+	ror	w19,w19,#16
+	add	v18.4s,v18.4s,v19.4s
+	ror	w20,w20,#16
+	eor	v20.16b,v1.16b,v2.16b
+	add	w15,w15,w21
+	eor	v21.16b,v5.16b,v6.16b
+	add	w16,w16,w17
+	eor	v22.16b,v17.16b,v18.16b
+	add	w13,w13,w19
+	ushr	v1.4s,v20.4s,#20
+	add	w14,w14,w20
+	ushr	v5.4s,v21.4s,#20
+	eor	w10,w10,w15
+	ushr	v17.4s,v22.4s,#20
+	eor	w11,w11,w16
+	sli	v1.4s,v20.4s,#12
+	eor	w12,w12,w13
+	sli	v5.4s,v21.4s,#12
+	eor	w9,w9,w14
+	sli	v17.4s,v22.4s,#12
+	ror	w10,w10,#20
+	add	v0.4s,v0.4s,v1.4s
+	ror	w11,w11,#20
+	add	v4.4s,v4.4s,v5.4s
+	ror	w12,w12,#20
+	add	v16.4s,v16.4s,v17.4s
+	ror	w9,w9,#20
+	eor	v20.16b,v3.16b,v0.16b
+	add	w5,w5,w10
+	eor	v21.16b,v7.16b,v4.16b
+	add	w6,w6,w11
+	eor	v22.16b,v19.16b,v16.16b
+	add	w7,w7,w12
+	ushr	v3.4s,v20.4s,#24
+	add	w8,w8,w9
+	ushr	v7.4s,v21.4s,#24
+	eor	w21,w21,w5
+	ushr	v19.4s,v22.4s,#24
+	eor	w17,w17,w6
+	sli	v3.4s,v20.4s,#8
+	eor	w19,w19,w7
+	sli	v7.4s,v21.4s,#8
+	eor	w20,w20,w8
+	sli	v19.4s,v22.4s,#8
+	ror	w21,w21,#24
+	add	v2.4s,v2.4s,v3.4s
+	ror	w17,w17,#24
+	add	v6.4s,v6.4s,v7.4s
+	ror	w19,w19,#24
+	add	v18.4s,v18.4s,v19.4s
+	ror	w20,w20,#24
+	eor	v20.16b,v1.16b,v2.16b
+	add	w15,w15,w21
+	eor	v21.16b,v5.16b,v6.16b
+	add	w16,w16,w17
+	eor	v22.16b,v17.16b,v18.16b
+	add	w13,w13,w19
+	ushr	v1.4s,v20.4s,#25
+	add	w14,w14,w20
+	ushr	v5.4s,v21.4s,#25
+	eor	w10,w10,w15
+	ushr	v17.4s,v22.4s,#25
+	eor	w11,w11,w16
+	sli	v1.4s,v20.4s,#7
+	eor	w12,w12,w13
+	sli	v5.4s,v21.4s,#7
+	eor	w9,w9,w14
+	sli	v17.4s,v22.4s,#7
+	ror	w10,w10,#25
+	ext	v2.16b,v2.16b,v2.16b,#8
+	ror	w11,w11,#25
+	ext	v6.16b,v6.16b,v6.16b,#8
+	ror	w12,w12,#25
+	ext	v18.16b,v18.16b,v18.16b,#8
+	ror	w9,w9,#25
+	ext	v3.16b,v3.16b,v3.16b,#4
+	ext	v7.16b,v7.16b,v7.16b,#4
+	ext	v19.16b,v19.16b,v19.16b,#4
+	ext	v1.16b,v1.16b,v1.16b,#12
+	ext	v5.16b,v5.16b,v5.16b,#12
+	ext	v17.16b,v17.16b,v17.16b,#12
+	cbnz	x4,.Loop_neon
+
+	add	w5,w5,w22		// accumulate key block
+	add	v0.4s,v0.4s,v24.4s
+	add	x6,x6,x22,lsr#32
+	add	v4.4s,v4.4s,v24.4s
+	add	w7,w7,w23
+	add	v16.4s,v16.4s,v24.4s
+	add	x8,x8,x23,lsr#32
+	add	v2.4s,v2.4s,v26.4s
+	add	w9,w9,w24
+	add	v6.4s,v6.4s,v26.4s
+	add	x10,x10,x24,lsr#32
+	add	v18.4s,v18.4s,v26.4s
+	add	w11,w11,w25
+	add	v3.4s,v3.4s,v27.4s
+	add	x12,x12,x25,lsr#32
+	add	w13,w13,w26
+	add	v7.4s,v7.4s,v28.4s
+	add	x14,x14,x26,lsr#32
+	add	w15,w15,w27
+	add	v19.4s,v19.4s,v29.4s
+	add	x16,x16,x27,lsr#32
+	add	w17,w17,w28
+	add	v1.4s,v1.4s,v25.4s
+	add	x19,x19,x28,lsr#32
+	add	w20,w20,w30
+	add	v5.4s,v5.4s,v25.4s
+	add	x21,x21,x30,lsr#32
+	add	v17.4s,v17.4s,v25.4s
+
+	b.lo	.Ltail_neon
+
+	add	x5,x5,x6,lsl#32	// pack
+	add	x7,x7,x8,lsl#32
+	ldp	x6,x8,[x1,#0]		// load input
+	add	x9,x9,x10,lsl#32
+	add	x11,x11,x12,lsl#32
+	ldp	x10,x12,[x1,#16]
+	add	x13,x13,x14,lsl#32
+	add	x15,x15,x16,lsl#32
+	ldp	x14,x16,[x1,#32]
+	add	x17,x17,x19,lsl#32
+	add	x20,x20,x21,lsl#32
+	ldp	x19,x21,[x1,#48]
+	add	x1,x1,#64
+#ifdef	__ARMEB__
+	rev	x5,x5
+	rev	x7,x7
+	rev	x9,x9
+	rev	x11,x11
+	rev	x13,x13
+	rev	x15,x15
+	rev	x17,x17
+	rev	x20,x20
+#endif
+	ld1	{v20.16b,v21.16b,v22.16b,v23.16b},[x1],#64
+	eor	x5,x5,x6
+	eor	x7,x7,x8
+	eor	x9,x9,x10
+	eor	x11,x11,x12
+	eor	x13,x13,x14
+	eor	v0.16b,v0.16b,v20.16b
+	eor	x15,x15,x16
+	eor	v1.16b,v1.16b,v21.16b
+	eor	x17,x17,x19
+	eor	v2.16b,v2.16b,v22.16b
+	eor	x20,x20,x21
+	eor	v3.16b,v3.16b,v23.16b
+	ld1	{v20.16b,v21.16b,v22.16b,v23.16b},[x1],#64
+
+	stp	x5,x7,[x0,#0]		// store output
+	add	x28,x28,#4			// increment counter
+	stp	x9,x11,[x0,#16]
+	add	v27.4s,v27.4s,v31.4s		// += 4
+	stp	x13,x15,[x0,#32]
+	add	v28.4s,v28.4s,v31.4s
+	stp	x17,x20,[x0,#48]
+	add	v29.4s,v29.4s,v31.4s
+	add	x0,x0,#64
+
+	st1	{v0.16b,v1.16b,v2.16b,v3.16b},[x0],#64
+	ld1	{v0.16b,v1.16b,v2.16b,v3.16b},[x1],#64
+
+	eor	v4.16b,v4.16b,v20.16b
+	eor	v5.16b,v5.16b,v21.16b
+	eor	v6.16b,v6.16b,v22.16b
+	eor	v7.16b,v7.16b,v23.16b
+	st1	{v4.16b,v5.16b,v6.16b,v7.16b},[x0],#64
+
+	eor	v16.16b,v16.16b,v0.16b
+	eor	v17.16b,v17.16b,v1.16b
+	eor	v18.16b,v18.16b,v2.16b
+	eor	v19.16b,v19.16b,v3.16b
+	st1	{v16.16b,v17.16b,v18.16b,v19.16b},[x0],#64
+
+	b.hi	.Loop_outer_neon
+
+	ldp	x19,x20,[x29,#16]
+	add	sp,sp,#64
+	ldp	x21,x22,[x29,#32]
+	ldp	x23,x24,[x29,#48]
+	ldp	x25,x26,[x29,#64]
+	ldp	x27,x28,[x29,#80]
+	ldp	x29,x30,[sp],#96
+	ret
+
+.Ltail_neon:
+	add	x2,x2,#256
+	cmp	x2,#64
+	b.lo	.Less_than_64
+
+	add	x5,x5,x6,lsl#32	// pack
+	add	x7,x7,x8,lsl#32
+	ldp	x6,x8,[x1,#0]		// load input
+	add	x9,x9,x10,lsl#32
+	add	x11,x11,x12,lsl#32
+	ldp	x10,x12,[x1,#16]
+	add	x13,x13,x14,lsl#32
+	add	x15,x15,x16,lsl#32
+	ldp	x14,x16,[x1,#32]
+	add	x17,x17,x19,lsl#32
+	add	x20,x20,x21,lsl#32
+	ldp	x19,x21,[x1,#48]
+	add	x1,x1,#64
+#ifdef	__ARMEB__
+	rev	x5,x5
+	rev	x7,x7
+	rev	x9,x9
+	rev	x11,x11
+	rev	x13,x13
+	rev	x15,x15
+	rev	x17,x17
+	rev	x20,x20
+#endif
+	eor	x5,x5,x6
+	eor	x7,x7,x8
+	eor	x9,x9,x10
+	eor	x11,x11,x12
+	eor	x13,x13,x14
+	eor	x15,x15,x16
+	eor	x17,x17,x19
+	eor	x20,x20,x21
+
+	stp	x5,x7,[x0,#0]		// store output
+	add	x28,x28,#4			// increment counter
+	stp	x9,x11,[x0,#16]
+	stp	x13,x15,[x0,#32]
+	stp	x17,x20,[x0,#48]
+	add	x0,x0,#64
+	b.eq	.Ldone_neon
+	sub	x2,x2,#64
+	cmp	x2,#64
+	b.lo	.Less_than_128
+
+	ld1	{v20.16b,v21.16b,v22.16b,v23.16b},[x1],#64
+	eor	v0.16b,v0.16b,v20.16b
+	eor	v1.16b,v1.16b,v21.16b
+	eor	v2.16b,v2.16b,v22.16b
+	eor	v3.16b,v3.16b,v23.16b
+	st1	{v0.16b,v1.16b,v2.16b,v3.16b},[x0],#64
+	b.eq	.Ldone_neon
+	sub	x2,x2,#64
+	cmp	x2,#64
+	b.lo	.Less_than_192
+
+	ld1	{v20.16b,v21.16b,v22.16b,v23.16b},[x1],#64
+	eor	v4.16b,v4.16b,v20.16b
+	eor	v5.16b,v5.16b,v21.16b
+	eor	v6.16b,v6.16b,v22.16b
+	eor	v7.16b,v7.16b,v23.16b
+	st1	{v4.16b,v5.16b,v6.16b,v7.16b},[x0],#64
+	b.eq	.Ldone_neon
+	sub	x2,x2,#64
+
+	st1	{v16.16b,v17.16b,v18.16b,v19.16b},[sp]
+	b	.Last_neon
+
+.Less_than_128:
+	st1	{v0.16b,v1.16b,v2.16b,v3.16b},[sp]
+	b	.Last_neon
+.Less_than_192:
+	st1	{v4.16b,v5.16b,v6.16b,v7.16b},[sp]
+	b	.Last_neon
+
+.align	4
+.Last_neon:
+	sub	x0,x0,#1
+	add	x1,x1,x2
+	add	x0,x0,x2
+	add	x4,sp,x2
+	neg	x2,x2
+
+.Loop_tail_neon:
+	ldrb	w10,[x1,x2]
+	ldrb	w11,[x4,x2]
+	add	x2,x2,#1
+	eor	w10,w10,w11
+	strb	w10,[x0,x2]
+	cbnz	x2,.Loop_tail_neon
+
+	stp	xzr,xzr,[sp,#0]
+	stp	xzr,xzr,[sp,#16]
+	stp	xzr,xzr,[sp,#32]
+	stp	xzr,xzr,[sp,#48]
+
+.Ldone_neon:
+	ldp	x19,x20,[x29,#16]
+	add	sp,sp,#64
+	ldp	x21,x22,[x29,#32]
+	ldp	x23,x24,[x29,#48]
+	ldp	x25,x26,[x29,#64]
+	ldp	x27,x28,[x29,#80]
+	ldp	x29,x30,[sp],#96
+	ret
+.size	ChaCha20_neon,.-ChaCha20_neon
+.type	ChaCha20_512_neon,%function
+.align	5
+ChaCha20_512_neon:
+	stp	x29,x30,[sp,#-96]!
+	add	x29,sp,#0
+
+	adr	x5,.Lsigma
+	stp	x19,x20,[sp,#16]
+	stp	x21,x22,[sp,#32]
+	stp	x23,x24,[sp,#48]
+	stp	x25,x26,[sp,#64]
+	stp	x27,x28,[sp,#80]
+
+.L512_or_more_neon:
+	sub	sp,sp,#128+64
+
+	ldp	x22,x23,[x5]		// load sigma
+	ld1	{v24.4s},[x5],#16
+	ldp	x24,x25,[x3]		// load key
+	ldp	x26,x27,[x3,#16]
+	ld1	{v25.4s,v26.4s},[x3]
+	ldp	x28,x30,[x4]		// load counter
+	ld1	{v27.4s},[x4]
+	ld1	{v31.4s},[x5]
+#ifdef	__ARMEB__
+	rev64	v24.4s,v24.4s
+	ror	x24,x24,#32
+	ror	x25,x25,#32
+	ror	x26,x26,#32
+	ror	x27,x27,#32
+	ror	x28,x28,#32
+	ror	x30,x30,#32
+#endif
+	add	v27.4s,v27.4s,v31.4s		// += 1
+	stp	q24,q25,[sp,#0]		// off-load key block, invariant part
+	add	v27.4s,v27.4s,v31.4s		// not typo
+	str	q26,[sp,#32]
+	add	v28.4s,v27.4s,v31.4s
+	add	v29.4s,v28.4s,v31.4s
+	add	v30.4s,v29.4s,v31.4s
+	shl	v31.4s,v31.4s,#2			// 1 -> 4
+
+	stp	d8,d9,[sp,#128+0]		// meet ABI requirements
+	stp	d10,d11,[sp,#128+16]
+	stp	d12,d13,[sp,#128+32]
+	stp	d14,d15,[sp,#128+48]
+
+	sub	x2,x2,#512			// not typo
+
+.Loop_outer_512_neon:
+	mov	v0.16b,v24.16b
+	mov	v4.16b,v24.16b
+	mov	v8.16b,v24.16b
+	mov	v12.16b,v24.16b
+	mov	v16.16b,v24.16b
+	mov	v20.16b,v24.16b
+	mov	v1.16b,v25.16b
+	mov	w5,w22			// unpack key block
+	mov	v5.16b,v25.16b
+	lsr	x6,x22,#32
+	mov	v9.16b,v25.16b
+	mov	w7,w23
+	mov	v13.16b,v25.16b
+	lsr	x8,x23,#32
+	mov	v17.16b,v25.16b
+	mov	w9,w24
+	mov	v21.16b,v25.16b
+	lsr	x10,x24,#32
+	mov	v3.16b,v27.16b
+	mov	w11,w25
+	mov	v7.16b,v28.16b
+	lsr	x12,x25,#32
+	mov	v11.16b,v29.16b
+	mov	w13,w26
+	mov	v15.16b,v30.16b
+	lsr	x14,x26,#32
+	mov	v2.16b,v26.16b
+	mov	w15,w27
+	mov	v6.16b,v26.16b
+	lsr	x16,x27,#32
+	add	v19.4s,v3.4s,v31.4s			// +4
+	mov	w17,w28
+	add	v23.4s,v7.4s,v31.4s			// +4
+	lsr	x19,x28,#32
+	mov	v10.16b,v26.16b
+	mov	w20,w30
+	mov	v14.16b,v26.16b
+	lsr	x21,x30,#32
+	mov	v18.16b,v26.16b
+	stp	q27,q28,[sp,#48]		// off-load key block, variable part
+	mov	v22.16b,v26.16b
+	str	q29,[sp,#80]
+
+	mov	x4,#5
+	subs	x2,x2,#512
+.Loop_upper_neon:
+	sub	x4,x4,#1
+	add	v0.4s,v0.4s,v1.4s
+	add	w5,w5,w9
+	add	v4.4s,v4.4s,v5.4s
+	add	w6,w6,w10
+	add	v8.4s,v8.4s,v9.4s
+	add	w7,w7,w11
+	add	v12.4s,v12.4s,v13.4s
+	add	w8,w8,w12
+	add	v16.4s,v16.4s,v17.4s
+	eor	w17,w17,w5
+	add	v20.4s,v20.4s,v21.4s
+	eor	w19,w19,w6
+	eor	v3.16b,v3.16b,v0.16b
+	eor	w20,w20,w7
+	eor	v7.16b,v7.16b,v4.16b
+	eor	w21,w21,w8
+	eor	v11.16b,v11.16b,v8.16b
+	ror	w17,w17,#16
+	eor	v15.16b,v15.16b,v12.16b
+	ror	w19,w19,#16
+	eor	v19.16b,v19.16b,v16.16b
+	ror	w20,w20,#16
+	eor	v23.16b,v23.16b,v20.16b
+	ror	w21,w21,#16
+	rev32	v3.8h,v3.8h
+	add	w13,w13,w17
+	rev32	v7.8h,v7.8h
+	add	w14,w14,w19
+	rev32	v11.8h,v11.8h
+	add	w15,w15,w20
+	rev32	v15.8h,v15.8h
+	add	w16,w16,w21
+	rev32	v19.8h,v19.8h
+	eor	w9,w9,w13
+	rev32	v23.8h,v23.8h
+	eor	w10,w10,w14
+	add	v2.4s,v2.4s,v3.4s
+	eor	w11,w11,w15
+	add	v6.4s,v6.4s,v7.4s
+	eor	w12,w12,w16
+	add	v10.4s,v10.4s,v11.4s
+	ror	w9,w9,#20
+	add	v14.4s,v14.4s,v15.4s
+	ror	w10,w10,#20
+	add	v18.4s,v18.4s,v19.4s
+	ror	w11,w11,#20
+	add	v22.4s,v22.4s,v23.4s
+	ror	w12,w12,#20
+	eor	v24.16b,v1.16b,v2.16b
+	add	w5,w5,w9
+	eor	v25.16b,v5.16b,v6.16b
+	add	w6,w6,w10
+	eor	v26.16b,v9.16b,v10.16b
+	add	w7,w7,w11
+	eor	v27.16b,v13.16b,v14.16b
+	add	w8,w8,w12
+	eor	v28.16b,v17.16b,v18.16b
+	eor	w17,w17,w5
+	eor	v29.16b,v21.16b,v22.16b
+	eor	w19,w19,w6
+	ushr	v1.4s,v24.4s,#20
+	eor	w20,w20,w7
+	ushr	v5.4s,v25.4s,#20
+	eor	w21,w21,w8
+	ushr	v9.4s,v26.4s,#20
+	ror	w17,w17,#24
+	ushr	v13.4s,v27.4s,#20
+	ror	w19,w19,#24
+	ushr	v17.4s,v28.4s,#20
+	ror	w20,w20,#24
+	ushr	v21.4s,v29.4s,#20
+	ror	w21,w21,#24
+	sli	v1.4s,v24.4s,#12
+	add	w13,w13,w17
+	sli	v5.4s,v25.4s,#12
+	add	w14,w14,w19
+	sli	v9.4s,v26.4s,#12
+	add	w15,w15,w20
+	sli	v13.4s,v27.4s,#12
+	add	w16,w16,w21
+	sli	v17.4s,v28.4s,#12
+	eor	w9,w9,w13
+	sli	v21.4s,v29.4s,#12
+	eor	w10,w10,w14
+	add	v0.4s,v0.4s,v1.4s
+	eor	w11,w11,w15
+	add	v4.4s,v4.4s,v5.4s
+	eor	w12,w12,w16
+	add	v8.4s,v8.4s,v9.4s
+	ror	w9,w9,#25
+	add	v12.4s,v12.4s,v13.4s
+	ror	w10,w10,#25
+	add	v16.4s,v16.4s,v17.4s
+	ror	w11,w11,#25
+	add	v20.4s,v20.4s,v21.4s
+	ror	w12,w12,#25
+	eor	v24.16b,v3.16b,v0.16b
+	add	w5,w5,w10
+	eor	v25.16b,v7.16b,v4.16b
+	add	w6,w6,w11
+	eor	v26.16b,v11.16b,v8.16b
+	add	w7,w7,w12
+	eor	v27.16b,v15.16b,v12.16b
+	add	w8,w8,w9
+	eor	v28.16b,v19.16b,v16.16b
+	eor	w21,w21,w5
+	eor	v29.16b,v23.16b,v20.16b
+	eor	w17,w17,w6
+	ushr	v3.4s,v24.4s,#24
+	eor	w19,w19,w7
+	ushr	v7.4s,v25.4s,#24
+	eor	w20,w20,w8
+	ushr	v11.4s,v26.4s,#24
+	ror	w21,w21,#16
+	ushr	v15.4s,v27.4s,#24
+	ror	w17,w17,#16
+	ushr	v19.4s,v28.4s,#24
+	ror	w19,w19,#16
+	ushr	v23.4s,v29.4s,#24
+	ror	w20,w20,#16
+	sli	v3.4s,v24.4s,#8
+	add	w15,w15,w21
+	sli	v7.4s,v25.4s,#8
+	add	w16,w16,w17
+	sli	v11.4s,v26.4s,#8
+	add	w13,w13,w19
+	sli	v15.4s,v27.4s,#8
+	add	w14,w14,w20
+	sli	v19.4s,v28.4s,#8
+	eor	w10,w10,w15
+	sli	v23.4s,v29.4s,#8
+	eor	w11,w11,w16
+	add	v2.4s,v2.4s,v3.4s
+	eor	w12,w12,w13
+	add	v6.4s,v6.4s,v7.4s
+	eor	w9,w9,w14
+	add	v10.4s,v10.4s,v11.4s
+	ror	w10,w10,#20
+	add	v14.4s,v14.4s,v15.4s
+	ror	w11,w11,#20
+	add	v18.4s,v18.4s,v19.4s
+	ror	w12,w12,#20
+	add	v22.4s,v22.4s,v23.4s
+	ror	w9,w9,#20
+	eor	v24.16b,v1.16b,v2.16b
+	add	w5,w5,w10
+	eor	v25.16b,v5.16b,v6.16b
+	add	w6,w6,w11
+	eor	v26.16b,v9.16b,v10.16b
+	add	w7,w7,w12
+	eor	v27.16b,v13.16b,v14.16b
+	add	w8,w8,w9
+	eor	v28.16b,v17.16b,v18.16b
+	eor	w21,w21,w5
+	eor	v29.16b,v21.16b,v22.16b
+	eor	w17,w17,w6
+	ushr	v1.4s,v24.4s,#25
+	eor	w19,w19,w7
+	ushr	v5.4s,v25.4s,#25
+	eor	w20,w20,w8
+	ushr	v9.4s,v26.4s,#25
+	ror	w21,w21,#24
+	ushr	v13.4s,v27.4s,#25
+	ror	w17,w17,#24
+	ushr	v17.4s,v28.4s,#25
+	ror	w19,w19,#24
+	ushr	v21.4s,v29.4s,#25
+	ror	w20,w20,#24
+	sli	v1.4s,v24.4s,#7
+	add	w15,w15,w21
+	sli	v5.4s,v25.4s,#7
+	add	w16,w16,w17
+	sli	v9.4s,v26.4s,#7
+	add	w13,w13,w19
+	sli	v13.4s,v27.4s,#7
+	add	w14,w14,w20
+	sli	v17.4s,v28.4s,#7
+	eor	w10,w10,w15
+	sli	v21.4s,v29.4s,#7
+	eor	w11,w11,w16
+	ext	v2.16b,v2.16b,v2.16b,#8
+	eor	w12,w12,w13
+	ext	v6.16b,v6.16b,v6.16b,#8
+	eor	w9,w9,w14
+	ext	v10.16b,v10.16b,v10.16b,#8
+	ror	w10,w10,#25
+	ext	v14.16b,v14.16b,v14.16b,#8
+	ror	w11,w11,#25
+	ext	v18.16b,v18.16b,v18.16b,#8
+	ror	w12,w12,#25
+	ext	v22.16b,v22.16b,v22.16b,#8
+	ror	w9,w9,#25
+	ext	v3.16b,v3.16b,v3.16b,#12
+	ext	v7.16b,v7.16b,v7.16b,#12
+	ext	v11.16b,v11.16b,v11.16b,#12
+	ext	v15.16b,v15.16b,v15.16b,#12
+	ext	v19.16b,v19.16b,v19.16b,#12
+	ext	v23.16b,v23.16b,v23.16b,#12
+	ext	v1.16b,v1.16b,v1.16b,#4
+	ext	v5.16b,v5.16b,v5.16b,#4
+	ext	v9.16b,v9.16b,v9.16b,#4
+	ext	v13.16b,v13.16b,v13.16b,#4
+	ext	v17.16b,v17.16b,v17.16b,#4
+	ext	v21.16b,v21.16b,v21.16b,#4
+	add	v0.4s,v0.4s,v1.4s
+	add	w5,w5,w9
+	add	v4.4s,v4.4s,v5.4s
+	add	w6,w6,w10
+	add	v8.4s,v8.4s,v9.4s
+	add	w7,w7,w11
+	add	v12.4s,v12.4s,v13.4s
+	add	w8,w8,w12
+	add	v16.4s,v16.4s,v17.4s
+	eor	w17,w17,w5
+	add	v20.4s,v20.4s,v21.4s
+	eor	w19,w19,w6
+	eor	v3.16b,v3.16b,v0.16b
+	eor	w20,w20,w7
+	eor	v7.16b,v7.16b,v4.16b
+	eor	w21,w21,w8
+	eor	v11.16b,v11.16b,v8.16b
+	ror	w17,w17,#16
+	eor	v15.16b,v15.16b,v12.16b
+	ror	w19,w19,#16
+	eor	v19.16b,v19.16b,v16.16b
+	ror	w20,w20,#16
+	eor	v23.16b,v23.16b,v20.16b
+	ror	w21,w21,#16
+	rev32	v3.8h,v3.8h
+	add	w13,w13,w17
+	rev32	v7.8h,v7.8h
+	add	w14,w14,w19
+	rev32	v11.8h,v11.8h
+	add	w15,w15,w20
+	rev32	v15.8h,v15.8h
+	add	w16,w16,w21
+	rev32	v19.8h,v19.8h
+	eor	w9,w9,w13
+	rev32	v23.8h,v23.8h
+	eor	w10,w10,w14
+	add	v2.4s,v2.4s,v3.4s
+	eor	w11,w11,w15
+	add	v6.4s,v6.4s,v7.4s
+	eor	w12,w12,w16
+	add	v10.4s,v10.4s,v11.4s
+	ror	w9,w9,#20
+	add	v14.4s,v14.4s,v15.4s
+	ror	w10,w10,#20
+	add	v18.4s,v18.4s,v19.4s
+	ror	w11,w11,#20
+	add	v22.4s,v22.4s,v23.4s
+	ror	w12,w12,#20
+	eor	v24.16b,v1.16b,v2.16b
+	add	w5,w5,w9
+	eor	v25.16b,v5.16b,v6.16b
+	add	w6,w6,w10
+	eor	v26.16b,v9.16b,v10.16b
+	add	w7,w7,w11
+	eor	v27.16b,v13.16b,v14.16b
+	add	w8,w8,w12
+	eor	v28.16b,v17.16b,v18.16b
+	eor	w17,w17,w5
+	eor	v29.16b,v21.16b,v22.16b
+	eor	w19,w19,w6
+	ushr	v1.4s,v24.4s,#20
+	eor	w20,w20,w7
+	ushr	v5.4s,v25.4s,#20
+	eor	w21,w21,w8
+	ushr	v9.4s,v26.4s,#20
+	ror	w17,w17,#24
+	ushr	v13.4s,v27.4s,#20
+	ror	w19,w19,#24
+	ushr	v17.4s,v28.4s,#20
+	ror	w20,w20,#24
+	ushr	v21.4s,v29.4s,#20
+	ror	w21,w21,#24
+	sli	v1.4s,v24.4s,#12
+	add	w13,w13,w17
+	sli	v5.4s,v25.4s,#12
+	add	w14,w14,w19
+	sli	v9.4s,v26.4s,#12
+	add	w15,w15,w20
+	sli	v13.4s,v27.4s,#12
+	add	w16,w16,w21
+	sli	v17.4s,v28.4s,#12
+	eor	w9,w9,w13
+	sli	v21.4s,v29.4s,#12
+	eor	w10,w10,w14
+	add	v0.4s,v0.4s,v1.4s
+	eor	w11,w11,w15
+	add	v4.4s,v4.4s,v5.4s
+	eor	w12,w12,w16
+	add	v8.4s,v8.4s,v9.4s
+	ror	w9,w9,#25
+	add	v12.4s,v12.4s,v13.4s
+	ror	w10,w10,#25
+	add	v16.4s,v16.4s,v17.4s
+	ror	w11,w11,#25
+	add	v20.4s,v20.4s,v21.4s
+	ror	w12,w12,#25
+	eor	v24.16b,v3.16b,v0.16b
+	add	w5,w5,w10
+	eor	v25.16b,v7.16b,v4.16b
+	add	w6,w6,w11
+	eor	v26.16b,v11.16b,v8.16b
+	add	w7,w7,w12
+	eor	v27.16b,v15.16b,v12.16b
+	add	w8,w8,w9
+	eor	v28.16b,v19.16b,v16.16b
+	eor	w21,w21,w5
+	eor	v29.16b,v23.16b,v20.16b
+	eor	w17,w17,w6
+	ushr	v3.4s,v24.4s,#24
+	eor	w19,w19,w7
+	ushr	v7.4s,v25.4s,#24
+	eor	w20,w20,w8
+	ushr	v11.4s,v26.4s,#24
+	ror	w21,w21,#16
+	ushr	v15.4s,v27.4s,#24
+	ror	w17,w17,#16
+	ushr	v19.4s,v28.4s,#24
+	ror	w19,w19,#16
+	ushr	v23.4s,v29.4s,#24
+	ror	w20,w20,#16
+	sli	v3.4s,v24.4s,#8
+	add	w15,w15,w21
+	sli	v7.4s,v25.4s,#8
+	add	w16,w16,w17
+	sli	v11.4s,v26.4s,#8
+	add	w13,w13,w19
+	sli	v15.4s,v27.4s,#8
+	add	w14,w14,w20
+	sli	v19.4s,v28.4s,#8
+	eor	w10,w10,w15
+	sli	v23.4s,v29.4s,#8
+	eor	w11,w11,w16
+	add	v2.4s,v2.4s,v3.4s
+	eor	w12,w12,w13
+	add	v6.4s,v6.4s,v7.4s
+	eor	w9,w9,w14
+	add	v10.4s,v10.4s,v11.4s
+	ror	w10,w10,#20
+	add	v14.4s,v14.4s,v15.4s
+	ror	w11,w11,#20
+	add	v18.4s,v18.4s,v19.4s
+	ror	w12,w12,#20
+	add	v22.4s,v22.4s,v23.4s
+	ror	w9,w9,#20
+	eor	v24.16b,v1.16b,v2.16b
+	add	w5,w5,w10
+	eor	v25.16b,v5.16b,v6.16b
+	add	w6,w6,w11
+	eor	v26.16b,v9.16b,v10.16b
+	add	w7,w7,w12
+	eor	v27.16b,v13.16b,v14.16b
+	add	w8,w8,w9
+	eor	v28.16b,v17.16b,v18.16b
+	eor	w21,w21,w5
+	eor	v29.16b,v21.16b,v22.16b
+	eor	w17,w17,w6
+	ushr	v1.4s,v24.4s,#25
+	eor	w19,w19,w7
+	ushr	v5.4s,v25.4s,#25
+	eor	w20,w20,w8
+	ushr	v9.4s,v26.4s,#25
+	ror	w21,w21,#24
+	ushr	v13.4s,v27.4s,#25
+	ror	w17,w17,#24
+	ushr	v17.4s,v28.4s,#25
+	ror	w19,w19,#24
+	ushr	v21.4s,v29.4s,#25
+	ror	w20,w20,#24
+	sli	v1.4s,v24.4s,#7
+	add	w15,w15,w21
+	sli	v5.4s,v25.4s,#7
+	add	w16,w16,w17
+	sli	v9.4s,v26.4s,#7
+	add	w13,w13,w19
+	sli	v13.4s,v27.4s,#7
+	add	w14,w14,w20
+	sli	v17.4s,v28.4s,#7
+	eor	w10,w10,w15
+	sli	v21.4s,v29.4s,#7
+	eor	w11,w11,w16
+	ext	v2.16b,v2.16b,v2.16b,#8
+	eor	w12,w12,w13
+	ext	v6.16b,v6.16b,v6.16b,#8
+	eor	w9,w9,w14
+	ext	v10.16b,v10.16b,v10.16b,#8
+	ror	w10,w10,#25
+	ext	v14.16b,v14.16b,v14.16b,#8
+	ror	w11,w11,#25
+	ext	v18.16b,v18.16b,v18.16b,#8
+	ror	w12,w12,#25
+	ext	v22.16b,v22.16b,v22.16b,#8
+	ror	w9,w9,#25
+	ext	v3.16b,v3.16b,v3.16b,#4
+	ext	v7.16b,v7.16b,v7.16b,#4
+	ext	v11.16b,v11.16b,v11.16b,#4
+	ext	v15.16b,v15.16b,v15.16b,#4
+	ext	v19.16b,v19.16b,v19.16b,#4
+	ext	v23.16b,v23.16b,v23.16b,#4
+	ext	v1.16b,v1.16b,v1.16b,#12
+	ext	v5.16b,v5.16b,v5.16b,#12
+	ext	v9.16b,v9.16b,v9.16b,#12
+	ext	v13.16b,v13.16b,v13.16b,#12
+	ext	v17.16b,v17.16b,v17.16b,#12
+	ext	v21.16b,v21.16b,v21.16b,#12
+	cbnz	x4,.Loop_upper_neon
+
+	add	w5,w5,w22		// accumulate key block
+	add	x6,x6,x22,lsr#32
+	add	w7,w7,w23
+	add	x8,x8,x23,lsr#32
+	add	w9,w9,w24
+	add	x10,x10,x24,lsr#32
+	add	w11,w11,w25
+	add	x12,x12,x25,lsr#32
+	add	w13,w13,w26
+	add	x14,x14,x26,lsr#32
+	add	w15,w15,w27
+	add	x16,x16,x27,lsr#32
+	add	w17,w17,w28
+	add	x19,x19,x28,lsr#32
+	add	w20,w20,w30
+	add	x21,x21,x30,lsr#32
+
+	add	x5,x5,x6,lsl#32	// pack
+	add	x7,x7,x8,lsl#32
+	ldp	x6,x8,[x1,#0]		// load input
+	add	x9,x9,x10,lsl#32
+	add	x11,x11,x12,lsl#32
+	ldp	x10,x12,[x1,#16]
+	add	x13,x13,x14,lsl#32
+	add	x15,x15,x16,lsl#32
+	ldp	x14,x16,[x1,#32]
+	add	x17,x17,x19,lsl#32
+	add	x20,x20,x21,lsl#32
+	ldp	x19,x21,[x1,#48]
+	add	x1,x1,#64
+#ifdef	__ARMEB__
+	rev	x5,x5
+	rev	x7,x7
+	rev	x9,x9
+	rev	x11,x11
+	rev	x13,x13
+	rev	x15,x15
+	rev	x17,x17
+	rev	x20,x20
+#endif
+	eor	x5,x5,x6
+	eor	x7,x7,x8
+	eor	x9,x9,x10
+	eor	x11,x11,x12
+	eor	x13,x13,x14
+	eor	x15,x15,x16
+	eor	x17,x17,x19
+	eor	x20,x20,x21
+
+	stp	x5,x7,[x0,#0]		// store output
+	add	x28,x28,#1			// increment counter
+	mov	w5,w22			// unpack key block
+	lsr	x6,x22,#32
+	stp	x9,x11,[x0,#16]
+	mov	w7,w23
+	lsr	x8,x23,#32
+	stp	x13,x15,[x0,#32]
+	mov	w9,w24
+	lsr	x10,x24,#32
+	stp	x17,x20,[x0,#48]
+	add	x0,x0,#64
+	mov	w11,w25
+	lsr	x12,x25,#32
+	mov	w13,w26
+	lsr	x14,x26,#32
+	mov	w15,w27
+	lsr	x16,x27,#32
+	mov	w17,w28
+	lsr	x19,x28,#32
+	mov	w20,w30
+	lsr	x21,x30,#32
+
+	mov	x4,#5
+.Loop_lower_neon:
+	sub	x4,x4,#1
+	add	v0.4s,v0.4s,v1.4s
+	add	w5,w5,w9
+	add	v4.4s,v4.4s,v5.4s
+	add	w6,w6,w10
+	add	v8.4s,v8.4s,v9.4s
+	add	w7,w7,w11
+	add	v12.4s,v12.4s,v13.4s
+	add	w8,w8,w12
+	add	v16.4s,v16.4s,v17.4s
+	eor	w17,w17,w5
+	add	v20.4s,v20.4s,v21.4s
+	eor	w19,w19,w6
+	eor	v3.16b,v3.16b,v0.16b
+	eor	w20,w20,w7
+	eor	v7.16b,v7.16b,v4.16b
+	eor	w21,w21,w8
+	eor	v11.16b,v11.16b,v8.16b
+	ror	w17,w17,#16
+	eor	v15.16b,v15.16b,v12.16b
+	ror	w19,w19,#16
+	eor	v19.16b,v19.16b,v16.16b
+	ror	w20,w20,#16
+	eor	v23.16b,v23.16b,v20.16b
+	ror	w21,w21,#16
+	rev32	v3.8h,v3.8h
+	add	w13,w13,w17
+	rev32	v7.8h,v7.8h
+	add	w14,w14,w19
+	rev32	v11.8h,v11.8h
+	add	w15,w15,w20
+	rev32	v15.8h,v15.8h
+	add	w16,w16,w21
+	rev32	v19.8h,v19.8h
+	eor	w9,w9,w13
+	rev32	v23.8h,v23.8h
+	eor	w10,w10,w14
+	add	v2.4s,v2.4s,v3.4s
+	eor	w11,w11,w15
+	add	v6.4s,v6.4s,v7.4s
+	eor	w12,w12,w16
+	add	v10.4s,v10.4s,v11.4s
+	ror	w9,w9,#20
+	add	v14.4s,v14.4s,v15.4s
+	ror	w10,w10,#20
+	add	v18.4s,v18.4s,v19.4s
+	ror	w11,w11,#20
+	add	v22.4s,v22.4s,v23.4s
+	ror	w12,w12,#20
+	eor	v24.16b,v1.16b,v2.16b
+	add	w5,w5,w9
+	eor	v25.16b,v5.16b,v6.16b
+	add	w6,w6,w10
+	eor	v26.16b,v9.16b,v10.16b
+	add	w7,w7,w11
+	eor	v27.16b,v13.16b,v14.16b
+	add	w8,w8,w12
+	eor	v28.16b,v17.16b,v18.16b
+	eor	w17,w17,w5
+	eor	v29.16b,v21.16b,v22.16b
+	eor	w19,w19,w6
+	ushr	v1.4s,v24.4s,#20
+	eor	w20,w20,w7
+	ushr	v5.4s,v25.4s,#20
+	eor	w21,w21,w8
+	ushr	v9.4s,v26.4s,#20
+	ror	w17,w17,#24
+	ushr	v13.4s,v27.4s,#20
+	ror	w19,w19,#24
+	ushr	v17.4s,v28.4s,#20
+	ror	w20,w20,#24
+	ushr	v21.4s,v29.4s,#20
+	ror	w21,w21,#24
+	sli	v1.4s,v24.4s,#12
+	add	w13,w13,w17
+	sli	v5.4s,v25.4s,#12
+	add	w14,w14,w19
+	sli	v9.4s,v26.4s,#12
+	add	w15,w15,w20
+	sli	v13.4s,v27.4s,#12
+	add	w16,w16,w21
+	sli	v17.4s,v28.4s,#12
+	eor	w9,w9,w13
+	sli	v21.4s,v29.4s,#12
+	eor	w10,w10,w14
+	add	v0.4s,v0.4s,v1.4s
+	eor	w11,w11,w15
+	add	v4.4s,v4.4s,v5.4s
+	eor	w12,w12,w16
+	add	v8.4s,v8.4s,v9.4s
+	ror	w9,w9,#25
+	add	v12.4s,v12.4s,v13.4s
+	ror	w10,w10,#25
+	add	v16.4s,v16.4s,v17.4s
+	ror	w11,w11,#25
+	add	v20.4s,v20.4s,v21.4s
+	ror	w12,w12,#25
+	eor	v24.16b,v3.16b,v0.16b
+	add	w5,w5,w10
+	eor	v25.16b,v7.16b,v4.16b
+	add	w6,w6,w11
+	eor	v26.16b,v11.16b,v8.16b
+	add	w7,w7,w12
+	eor	v27.16b,v15.16b,v12.16b
+	add	w8,w8,w9
+	eor	v28.16b,v19.16b,v16.16b
+	eor	w21,w21,w5
+	eor	v29.16b,v23.16b,v20.16b
+	eor	w17,w17,w6
+	ushr	v3.4s,v24.4s,#24
+	eor	w19,w19,w7
+	ushr	v7.4s,v25.4s,#24
+	eor	w20,w20,w8
+	ushr	v11.4s,v26.4s,#24
+	ror	w21,w21,#16
+	ushr	v15.4s,v27.4s,#24
+	ror	w17,w17,#16
+	ushr	v19.4s,v28.4s,#24
+	ror	w19,w19,#16
+	ushr	v23.4s,v29.4s,#24
+	ror	w20,w20,#16
+	sli	v3.4s,v24.4s,#8
+	add	w15,w15,w21
+	sli	v7.4s,v25.4s,#8
+	add	w16,w16,w17
+	sli	v11.4s,v26.4s,#8
+	add	w13,w13,w19
+	sli	v15.4s,v27.4s,#8
+	add	w14,w14,w20
+	sli	v19.4s,v28.4s,#8
+	eor	w10,w10,w15
+	sli	v23.4s,v29.4s,#8
+	eor	w11,w11,w16
+	add	v2.4s,v2.4s,v3.4s
+	eor	w12,w12,w13
+	add	v6.4s,v6.4s,v7.4s
+	eor	w9,w9,w14
+	add	v10.4s,v10.4s,v11.4s
+	ror	w10,w10,#20
+	add	v14.4s,v14.4s,v15.4s
+	ror	w11,w11,#20
+	add	v18.4s,v18.4s,v19.4s
+	ror	w12,w12,#20
+	add	v22.4s,v22.4s,v23.4s
+	ror	w9,w9,#20
+	eor	v24.16b,v1.16b,v2.16b
+	add	w5,w5,w10
+	eor	v25.16b,v5.16b,v6.16b
+	add	w6,w6,w11
+	eor	v26.16b,v9.16b,v10.16b
+	add	w7,w7,w12
+	eor	v27.16b,v13.16b,v14.16b
+	add	w8,w8,w9
+	eor	v28.16b,v17.16b,v18.16b
+	eor	w21,w21,w5
+	eor	v29.16b,v21.16b,v22.16b
+	eor	w17,w17,w6
+	ushr	v1.4s,v24.4s,#25
+	eor	w19,w19,w7
+	ushr	v5.4s,v25.4s,#25
+	eor	w20,w20,w8
+	ushr	v9.4s,v26.4s,#25
+	ror	w21,w21,#24
+	ushr	v13.4s,v27.4s,#25
+	ror	w17,w17,#24
+	ushr	v17.4s,v28.4s,#25
+	ror	w19,w19,#24
+	ushr	v21.4s,v29.4s,#25
+	ror	w20,w20,#24
+	sli	v1.4s,v24.4s,#7
+	add	w15,w15,w21
+	sli	v5.4s,v25.4s,#7
+	add	w16,w16,w17
+	sli	v9.4s,v26.4s,#7
+	add	w13,w13,w19
+	sli	v13.4s,v27.4s,#7
+	add	w14,w14,w20
+	sli	v17.4s,v28.4s,#7
+	eor	w10,w10,w15
+	sli	v21.4s,v29.4s,#7
+	eor	w11,w11,w16
+	ext	v2.16b,v2.16b,v2.16b,#8
+	eor	w12,w12,w13
+	ext	v6.16b,v6.16b,v6.16b,#8
+	eor	w9,w9,w14
+	ext	v10.16b,v10.16b,v10.16b,#8
+	ror	w10,w10,#25
+	ext	v14.16b,v14.16b,v14.16b,#8
+	ror	w11,w11,#25
+	ext	v18.16b,v18.16b,v18.16b,#8
+	ror	w12,w12,#25
+	ext	v22.16b,v22.16b,v22.16b,#8
+	ror	w9,w9,#25
+	ext	v3.16b,v3.16b,v3.16b,#12
+	ext	v7.16b,v7.16b,v7.16b,#12
+	ext	v11.16b,v11.16b,v11.16b,#12
+	ext	v15.16b,v15.16b,v15.16b,#12
+	ext	v19.16b,v19.16b,v19.16b,#12
+	ext	v23.16b,v23.16b,v23.16b,#12
+	ext	v1.16b,v1.16b,v1.16b,#4
+	ext	v5.16b,v5.16b,v5.16b,#4
+	ext	v9.16b,v9.16b,v9.16b,#4
+	ext	v13.16b,v13.16b,v13.16b,#4
+	ext	v17.16b,v17.16b,v17.16b,#4
+	ext	v21.16b,v21.16b,v21.16b,#4
+	add	v0.4s,v0.4s,v1.4s
+	add	w5,w5,w9
+	add	v4.4s,v4.4s,v5.4s
+	add	w6,w6,w10
+	add	v8.4s,v8.4s,v9.4s
+	add	w7,w7,w11
+	add	v12.4s,v12.4s,v13.4s
+	add	w8,w8,w12
+	add	v16.4s,v16.4s,v17.4s
+	eor	w17,w17,w5
+	add	v20.4s,v20.4s,v21.4s
+	eor	w19,w19,w6
+	eor	v3.16b,v3.16b,v0.16b
+	eor	w20,w20,w7
+	eor	v7.16b,v7.16b,v4.16b
+	eor	w21,w21,w8
+	eor	v11.16b,v11.16b,v8.16b
+	ror	w17,w17,#16
+	eor	v15.16b,v15.16b,v12.16b
+	ror	w19,w19,#16
+	eor	v19.16b,v19.16b,v16.16b
+	ror	w20,w20,#16
+	eor	v23.16b,v23.16b,v20.16b
+	ror	w21,w21,#16
+	rev32	v3.8h,v3.8h
+	add	w13,w13,w17
+	rev32	v7.8h,v7.8h
+	add	w14,w14,w19
+	rev32	v11.8h,v11.8h
+	add	w15,w15,w20
+	rev32	v15.8h,v15.8h
+	add	w16,w16,w21
+	rev32	v19.8h,v19.8h
+	eor	w9,w9,w13
+	rev32	v23.8h,v23.8h
+	eor	w10,w10,w14
+	add	v2.4s,v2.4s,v3.4s
+	eor	w11,w11,w15
+	add	v6.4s,v6.4s,v7.4s
+	eor	w12,w12,w16
+	add	v10.4s,v10.4s,v11.4s
+	ror	w9,w9,#20
+	add	v14.4s,v14.4s,v15.4s
+	ror	w10,w10,#20
+	add	v18.4s,v18.4s,v19.4s
+	ror	w11,w11,#20
+	add	v22.4s,v22.4s,v23.4s
+	ror	w12,w12,#20
+	eor	v24.16b,v1.16b,v2.16b
+	add	w5,w5,w9
+	eor	v25.16b,v5.16b,v6.16b
+	add	w6,w6,w10
+	eor	v26.16b,v9.16b,v10.16b
+	add	w7,w7,w11
+	eor	v27.16b,v13.16b,v14.16b
+	add	w8,w8,w12
+	eor	v28.16b,v17.16b,v18.16b
+	eor	w17,w17,w5
+	eor	v29.16b,v21.16b,v22.16b
+	eor	w19,w19,w6
+	ushr	v1.4s,v24.4s,#20
+	eor	w20,w20,w7
+	ushr	v5.4s,v25.4s,#20
+	eor	w21,w21,w8
+	ushr	v9.4s,v26.4s,#20
+	ror	w17,w17,#24
+	ushr	v13.4s,v27.4s,#20
+	ror	w19,w19,#24
+	ushr	v17.4s,v28.4s,#20
+	ror	w20,w20,#24
+	ushr	v21.4s,v29.4s,#20
+	ror	w21,w21,#24
+	sli	v1.4s,v24.4s,#12
+	add	w13,w13,w17
+	sli	v5.4s,v25.4s,#12
+	add	w14,w14,w19
+	sli	v9.4s,v26.4s,#12
+	add	w15,w15,w20
+	sli	v13.4s,v27.4s,#12
+	add	w16,w16,w21
+	sli	v17.4s,v28.4s,#12
+	eor	w9,w9,w13
+	sli	v21.4s,v29.4s,#12
+	eor	w10,w10,w14
+	add	v0.4s,v0.4s,v1.4s
+	eor	w11,w11,w15
+	add	v4.4s,v4.4s,v5.4s
+	eor	w12,w12,w16
+	add	v8.4s,v8.4s,v9.4s
+	ror	w9,w9,#25
+	add	v12.4s,v12.4s,v13.4s
+	ror	w10,w10,#25
+	add	v16.4s,v16.4s,v17.4s
+	ror	w11,w11,#25
+	add	v20.4s,v20.4s,v21.4s
+	ror	w12,w12,#25
+	eor	v24.16b,v3.16b,v0.16b
+	add	w5,w5,w10
+	eor	v25.16b,v7.16b,v4.16b
+	add	w6,w6,w11
+	eor	v26.16b,v11.16b,v8.16b
+	add	w7,w7,w12
+	eor	v27.16b,v15.16b,v12.16b
+	add	w8,w8,w9
+	eor	v28.16b,v19.16b,v16.16b
+	eor	w21,w21,w5
+	eor	v29.16b,v23.16b,v20.16b
+	eor	w17,w17,w6
+	ushr	v3.4s,v24.4s,#24
+	eor	w19,w19,w7
+	ushr	v7.4s,v25.4s,#24
+	eor	w20,w20,w8
+	ushr	v11.4s,v26.4s,#24
+	ror	w21,w21,#16
+	ushr	v15.4s,v27.4s,#24
+	ror	w17,w17,#16
+	ushr	v19.4s,v28.4s,#24
+	ror	w19,w19,#16
+	ushr	v23.4s,v29.4s,#24
+	ror	w20,w20,#16
+	sli	v3.4s,v24.4s,#8
+	add	w15,w15,w21
+	sli	v7.4s,v25.4s,#8
+	add	w16,w16,w17
+	sli	v11.4s,v26.4s,#8
+	add	w13,w13,w19
+	sli	v15.4s,v27.4s,#8
+	add	w14,w14,w20
+	sli	v19.4s,v28.4s,#8
+	eor	w10,w10,w15
+	sli	v23.4s,v29.4s,#8
+	eor	w11,w11,w16
+	add	v2.4s,v2.4s,v3.4s
+	eor	w12,w12,w13
+	add	v6.4s,v6.4s,v7.4s
+	eor	w9,w9,w14
+	add	v10.4s,v10.4s,v11.4s
+	ror	w10,w10,#20
+	add	v14.4s,v14.4s,v15.4s
+	ror	w11,w11,#20
+	add	v18.4s,v18.4s,v19.4s
+	ror	w12,w12,#20
+	add	v22.4s,v22.4s,v23.4s
+	ror	w9,w9,#20
+	eor	v24.16b,v1.16b,v2.16b
+	add	w5,w5,w10
+	eor	v25.16b,v5.16b,v6.16b
+	add	w6,w6,w11
+	eor	v26.16b,v9.16b,v10.16b
+	add	w7,w7,w12
+	eor	v27.16b,v13.16b,v14.16b
+	add	w8,w8,w9
+	eor	v28.16b,v17.16b,v18.16b
+	eor	w21,w21,w5
+	eor	v29.16b,v21.16b,v22.16b
+	eor	w17,w17,w6
+	ushr	v1.4s,v24.4s,#25
+	eor	w19,w19,w7
+	ushr	v5.4s,v25.4s,#25
+	eor	w20,w20,w8
+	ushr	v9.4s,v26.4s,#25
+	ror	w21,w21,#24
+	ushr	v13.4s,v27.4s,#25
+	ror	w17,w17,#24
+	ushr	v17.4s,v28.4s,#25
+	ror	w19,w19,#24
+	ushr	v21.4s,v29.4s,#25
+	ror	w20,w20,#24
+	sli	v1.4s,v24.4s,#7
+	add	w15,w15,w21
+	sli	v5.4s,v25.4s,#7
+	add	w16,w16,w17
+	sli	v9.4s,v26.4s,#7
+	add	w13,w13,w19
+	sli	v13.4s,v27.4s,#7
+	add	w14,w14,w20
+	sli	v17.4s,v28.4s,#7
+	eor	w10,w10,w15
+	sli	v21.4s,v29.4s,#7
+	eor	w11,w11,w16
+	ext	v2.16b,v2.16b,v2.16b,#8
+	eor	w12,w12,w13
+	ext	v6.16b,v6.16b,v6.16b,#8
+	eor	w9,w9,w14
+	ext	v10.16b,v10.16b,v10.16b,#8
+	ror	w10,w10,#25
+	ext	v14.16b,v14.16b,v14.16b,#8
+	ror	w11,w11,#25
+	ext	v18.16b,v18.16b,v18.16b,#8
+	ror	w12,w12,#25
+	ext	v22.16b,v22.16b,v22.16b,#8
+	ror	w9,w9,#25
+	ext	v3.16b,v3.16b,v3.16b,#4
+	ext	v7.16b,v7.16b,v7.16b,#4
+	ext	v11.16b,v11.16b,v11.16b,#4
+	ext	v15.16b,v15.16b,v15.16b,#4
+	ext	v19.16b,v19.16b,v19.16b,#4
+	ext	v23.16b,v23.16b,v23.16b,#4
+	ext	v1.16b,v1.16b,v1.16b,#12
+	ext	v5.16b,v5.16b,v5.16b,#12
+	ext	v9.16b,v9.16b,v9.16b,#12
+	ext	v13.16b,v13.16b,v13.16b,#12
+	ext	v17.16b,v17.16b,v17.16b,#12
+	ext	v21.16b,v21.16b,v21.16b,#12
+	cbnz	x4,.Loop_lower_neon
+
+	add	w5,w5,w22		// accumulate key block
+	ldp	q24,q25,[sp,#0]
+	add	x6,x6,x22,lsr#32
+	ldp	q26,q27,[sp,#32]
+	add	w7,w7,w23
+	ldp	q28,q29,[sp,#64]
+	add	x8,x8,x23,lsr#32
+	add	v0.4s,v0.4s,v24.4s
+	add	w9,w9,w24
+	add	v4.4s,v4.4s,v24.4s
+	add	x10,x10,x24,lsr#32
+	add	v8.4s,v8.4s,v24.4s
+	add	w11,w11,w25
+	add	v12.4s,v12.4s,v24.4s
+	add	x12,x12,x25,lsr#32
+	add	v16.4s,v16.4s,v24.4s
+	add	w13,w13,w26
+	add	v20.4s,v20.4s,v24.4s
+	add	x14,x14,x26,lsr#32
+	add	v2.4s,v2.4s,v26.4s
+	add	w15,w15,w27
+	add	v6.4s,v6.4s,v26.4s
+	add	x16,x16,x27,lsr#32
+	add	v10.4s,v10.4s,v26.4s
+	add	w17,w17,w28
+	add	v14.4s,v14.4s,v26.4s
+	add	x19,x19,x28,lsr#32
+	add	v18.4s,v18.4s,v26.4s
+	add	w20,w20,w30
+	add	v22.4s,v22.4s,v26.4s
+	add	x21,x21,x30,lsr#32
+	add	v19.4s,v19.4s,v31.4s			// +4
+	add	x5,x5,x6,lsl#32	// pack
+	add	v23.4s,v23.4s,v31.4s			// +4
+	add	x7,x7,x8,lsl#32
+	add	v3.4s,v3.4s,v27.4s
+	ldp	x6,x8,[x1,#0]		// load input
+	add	v7.4s,v7.4s,v28.4s
+	add	x9,x9,x10,lsl#32
+	add	v11.4s,v11.4s,v29.4s
+	add	x11,x11,x12,lsl#32
+	add	v15.4s,v15.4s,v30.4s
+	ldp	x10,x12,[x1,#16]
+	add	v19.4s,v19.4s,v27.4s
+	add	x13,x13,x14,lsl#32
+	add	v23.4s,v23.4s,v28.4s
+	add	x15,x15,x16,lsl#32
+	add	v1.4s,v1.4s,v25.4s
+	ldp	x14,x16,[x1,#32]
+	add	v5.4s,v5.4s,v25.4s
+	add	x17,x17,x19,lsl#32
+	add	v9.4s,v9.4s,v25.4s
+	add	x20,x20,x21,lsl#32
+	add	v13.4s,v13.4s,v25.4s
+	ldp	x19,x21,[x1,#48]
+	add	v17.4s,v17.4s,v25.4s
+	add	x1,x1,#64
+	add	v21.4s,v21.4s,v25.4s
+
+#ifdef	__ARMEB__
+	rev	x5,x5
+	rev	x7,x7
+	rev	x9,x9
+	rev	x11,x11
+	rev	x13,x13
+	rev	x15,x15
+	rev	x17,x17
+	rev	x20,x20
+#endif
+	ld1	{v24.16b,v25.16b,v26.16b,v27.16b},[x1],#64
+	eor	x5,x5,x6
+	eor	x7,x7,x8
+	eor	x9,x9,x10
+	eor	x11,x11,x12
+	eor	x13,x13,x14
+	eor	v0.16b,v0.16b,v24.16b
+	eor	x15,x15,x16
+	eor	v1.16b,v1.16b,v25.16b
+	eor	x17,x17,x19
+	eor	v2.16b,v2.16b,v26.16b
+	eor	x20,x20,x21
+	eor	v3.16b,v3.16b,v27.16b
+	ld1	{v24.16b,v25.16b,v26.16b,v27.16b},[x1],#64
+
+	stp	x5,x7,[x0,#0]		// store output
+	add	x28,x28,#7			// increment counter
+	stp	x9,x11,[x0,#16]
+	stp	x13,x15,[x0,#32]
+	stp	x17,x20,[x0,#48]
+	add	x0,x0,#64
+	st1	{v0.16b,v1.16b,v2.16b,v3.16b},[x0],#64
+
+	ld1	{v0.16b,v1.16b,v2.16b,v3.16b},[x1],#64
+	eor	v4.16b,v4.16b,v24.16b
+	eor	v5.16b,v5.16b,v25.16b
+	eor	v6.16b,v6.16b,v26.16b
+	eor	v7.16b,v7.16b,v27.16b
+	st1	{v4.16b,v5.16b,v6.16b,v7.16b},[x0],#64
+
+	ld1	{v4.16b,v5.16b,v6.16b,v7.16b},[x1],#64
+	eor	v8.16b,v8.16b,v0.16b
+	ldp	q24,q25,[sp,#0]
+	eor	v9.16b,v9.16b,v1.16b
+	ldp	q26,q27,[sp,#32]
+	eor	v10.16b,v10.16b,v2.16b
+	eor	v11.16b,v11.16b,v3.16b
+	st1	{v8.16b,v9.16b,v10.16b,v11.16b},[x0],#64
+
+	ld1	{v8.16b,v9.16b,v10.16b,v11.16b},[x1],#64
+	eor	v12.16b,v12.16b,v4.16b
+	eor	v13.16b,v13.16b,v5.16b
+	eor	v14.16b,v14.16b,v6.16b
+	eor	v15.16b,v15.16b,v7.16b
+	st1	{v12.16b,v13.16b,v14.16b,v15.16b},[x0],#64
+
+	ld1	{v12.16b,v13.16b,v14.16b,v15.16b},[x1],#64
+	eor	v16.16b,v16.16b,v8.16b
+	eor	v17.16b,v17.16b,v9.16b
+	eor	v18.16b,v18.16b,v10.16b
+	eor	v19.16b,v19.16b,v11.16b
+	st1	{v16.16b,v17.16b,v18.16b,v19.16b},[x0],#64
+
+	shl	v0.4s,v31.4s,#1			// 4 -> 8
+	eor	v20.16b,v20.16b,v12.16b
+	eor	v21.16b,v21.16b,v13.16b
+	eor	v22.16b,v22.16b,v14.16b
+	eor	v23.16b,v23.16b,v15.16b
+	st1	{v20.16b,v21.16b,v22.16b,v23.16b},[x0],#64
+
+	add	v27.4s,v27.4s,v0.4s			// += 8
+	add	v28.4s,v28.4s,v0.4s
+	add	v29.4s,v29.4s,v0.4s
+	add	v30.4s,v30.4s,v0.4s
+
+	b.hs	.Loop_outer_512_neon
+
+	adds	x2,x2,#512
+	ushr	v0.4s,v31.4s,#2			// 4 -> 1
+
+	ldp	d8,d9,[sp,#128+0]		// meet ABI requirements
+	ldp	d10,d11,[sp,#128+16]
+	ldp	d12,d13,[sp,#128+32]
+	ldp	d14,d15,[sp,#128+48]
+
+	stp	q24,q31,[sp,#0]		// wipe off-load area
+	stp	q24,q31,[sp,#32]
+	stp	q24,q31,[sp,#64]
+
+	b.eq	.Ldone_512_neon
+
+	cmp	x2,#192
+	sub	v27.4s,v27.4s,v0.4s			// -= 1
+	sub	v28.4s,v28.4s,v0.4s
+	sub	v29.4s,v29.4s,v0.4s
+	add	sp,sp,#128
+	b.hs	.Loop_outer_neon
+
+	eor	v25.16b,v25.16b,v25.16b
+	eor	v26.16b,v26.16b,v26.16b
+	eor	v27.16b,v27.16b,v27.16b
+	eor	v28.16b,v28.16b,v28.16b
+	eor	v29.16b,v29.16b,v29.16b
+	eor	v30.16b,v30.16b,v30.16b
+	b	.Loop_outer
+
+.Ldone_512_neon:
+	ldp	x19,x20,[x29,#16]
+	add	sp,sp,#128+64
+	ldp	x21,x22,[x29,#32]
+	ldp	x23,x24,[x29,#48]
+	ldp	x25,x26,[x29,#64]
+	ldp	x27,x28,[x29,#80]
+	ldp	x29,x30,[sp],#96
+	ret
+.size	ChaCha20_512_neon,.-ChaCha20_512_neon
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v8 08/28] zinc: port Andy Polyakov's ChaCha20 ARM and ARM64 implementations
From: Jason A. Donenfeld @ 2018-10-18 14:56 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-crypto, davem, gregkh
  Cc: Jason A. Donenfeld, Russell King, linux-arm-kernel, Samuel Neves,
	Jean-Philippe Aumasson, Andy Lutomirski, Andrew Morton,
	Linus Torvalds, kernel-hardening
In-Reply-To: <20181018145712.7538-1-Jason@zx2c4.com>

These port and prepare Andy Polyakov's implementations for the kernel,
but don't actually wire up any of the code yet. The wiring will be done
in a subsequent commit, since we'll need to merge these implementations
with another one. We make a few small changes to the assembly:

  - Entries and exits use the proper kernel convention macro.
  - CPU feature checking is done in C by the glue code, so that has been
    removed from the assembly.
  - The function names have been renamed to fit kernel conventions.
  - Labels have been renamed (prefixed with .L) to fit kernel conventions.
  - Constants have been rearranged so that they are closer to the code
    that is using them. [ARM only]
  - The neon code can jump to the scalar code when it makes sense to do
    so.
  - The neon_512 function as a separate function has been removed, leaving
    the decision up to the main neon entry point. [ARM64 only]

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-crypto@vger.kernel.org
---
 lib/zinc/chacha20/chacha20-arm-cryptogams.S   | 367 +++++++++---------
 lib/zinc/chacha20/chacha20-arm64-cryptogams.S |  75 ++--
 2 files changed, 202 insertions(+), 240 deletions(-)

diff --git a/lib/zinc/chacha20/chacha20-arm-cryptogams.S b/lib/zinc/chacha20/chacha20-arm-cryptogams.S
index 05a3a9e6e93f..770bab469171 100644
--- a/lib/zinc/chacha20/chacha20-arm-cryptogams.S
+++ b/lib/zinc/chacha20/chacha20-arm-cryptogams.S
@@ -1,9 +1,12 @@
 /* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
 /*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
  * Copyright (C) 2006-2017 CRYPTOGAMS by <appro@openssl.org>. All Rights Reserved.
+ *
+ * This is based in part on Andy Polyakov's implementation from CRYPTOGAMS.
  */
 
-#include "arm_arch.h"
+#include <linux/linkage.h>
 
 .text
 #if defined(__thumb2__) || defined(__clang__)
@@ -24,48 +27,25 @@
 .long	0x61707865,0x3320646e,0x79622d32,0x6b206574	@ endian-neutral
 .Lone:
 .long	1,0,0,0
-.Lrot8:
-.long	0x02010003,0x06050407
-#if __ARM_MAX_ARCH__>=7
-.LOPENSSL_armcap:
-.word   OPENSSL_armcap_P-.LChaCha20_ctr32
-#else
 .word	-1
-#endif
 
-.globl	ChaCha20_ctr32
-.type	ChaCha20_ctr32,%function
 .align	5
-ChaCha20_ctr32:
-.LChaCha20_ctr32:
+ENTRY(chacha20_arm)
 	ldr	r12,[sp,#0]		@ pull pointer to counter and nonce
 	stmdb	sp!,{r0-r2,r4-r11,lr}
-#if __ARM_ARCH__<7 && !defined(__thumb2__)
-	sub	r14,pc,#16		@ ChaCha20_ctr32
-#else
-	adr	r14,.LChaCha20_ctr32
-#endif
 	cmp	r2,#0			@ len==0?
-#ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	eq
 #endif
 	addeq	sp,sp,#4*3
-	beq	.Lno_data
-#if __ARM_MAX_ARCH__>=7
-	cmp	r2,#192			@ test len
-	bls	.Lshort
-	ldr	r4,[r14,#-24]
-	ldr	r4,[r14,r4]
-# ifdef	__APPLE__
-	ldr	r4,[r4]
-# endif
-	tst	r4,#ARMV7_NEON
-	bne	.LChaCha20_neon
-.Lshort:
-#endif
+	beq	.Lno_data_arm
 	ldmia	r12,{r4-r7}		@ load counter and nonce
 	sub	sp,sp,#4*(16)		@ off-load area
-	sub	r14,r14,#64		@ .Lsigma
+#if __LINUX_ARM_ARCH__ < 7 && !defined(__thumb2__)
+	sub	r14,pc,#100		@ .Lsigma
+#else
+	adr	r14,.Lsigma		@ .Lsigma
+#endif
 	stmdb	sp!,{r4-r7}		@ copy counter and nonce
 	ldmia	r3,{r4-r11}		@ load key
 	ldmia	r14,{r0-r3}		@ load sigma
@@ -191,7 +171,7 @@ ChaCha20_ctr32:
 	@ rx and second half at sp+4*(16+8)
 
 	cmp	r11,#64		@ done yet?
-#ifdef	__thumb2__
+#ifdef __thumb2__
 	itete	lo
 #endif
 	addlo	r12,sp,#4*(0)		@ shortcut or ...
@@ -202,49 +182,49 @@ ChaCha20_ctr32:
 	ldr	r8,[sp,#4*(0)]	@ load key material
 	ldr	r9,[sp,#4*(1)]
 
-#if __ARM_ARCH__>=6 || !defined(__ARMEB__)
-# if __ARM_ARCH__<7
+#if __LINUX_ARM_ARCH__ >= 6 || !defined(__ARMEB__)
+#if __LINUX_ARM_ARCH__ < 7
 	orr	r10,r12,r14
 	tst	r10,#3		@ are input and output aligned?
 	ldr	r10,[sp,#4*(2)]
 	bne	.Lunaligned
 	cmp	r11,#64		@ restore flags
-# else
+#else
 	ldr	r10,[sp,#4*(2)]
-# endif
+#endif
 	ldr	r11,[sp,#4*(3)]
 
 	add	r0,r0,r8	@ accumulate key material
 	add	r1,r1,r9
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhs	r8,[r12],#16		@ load input
 	ldrhs	r9,[r12,#-12]
 
 	add	r2,r2,r10
 	add	r3,r3,r11
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhs	r10,[r12,#-8]
 	ldrhs	r11,[r12,#-4]
-# if __ARM_ARCH__>=6 && defined(__ARMEB__)
+#if __LINUX_ARM_ARCH__ >= 6 && defined(__ARMEB__)
 	rev	r0,r0
 	rev	r1,r1
 	rev	r2,r2
 	rev	r3,r3
-# endif
-# ifdef	__thumb2__
+#endif
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	eorhs	r0,r0,r8	@ xor with input
 	eorhs	r1,r1,r9
 	 add	r8,sp,#4*(4)
 	str	r0,[r14],#16		@ store output
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	eorhs	r2,r2,r10
 	eorhs	r3,r3,r11
 	 ldmia	r8,{r8-r11}	@ load key material
@@ -254,34 +234,34 @@ ChaCha20_ctr32:
 
 	add	r4,r8,r4,ror#13 @ accumulate key material
 	add	r5,r9,r5,ror#13
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhs	r8,[r12],#16		@ load input
 	ldrhs	r9,[r12,#-12]
 	add	r6,r10,r6,ror#13
 	add	r7,r11,r7,ror#13
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhs	r10,[r12,#-8]
 	ldrhs	r11,[r12,#-4]
-# if __ARM_ARCH__>=6 && defined(__ARMEB__)
+#if __LINUX_ARM_ARCH__ >= 6 && defined(__ARMEB__)
 	rev	r4,r4
 	rev	r5,r5
 	rev	r6,r6
 	rev	r7,r7
-# endif
-# ifdef	__thumb2__
+#endif
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	eorhs	r4,r4,r8
 	eorhs	r5,r5,r9
 	 add	r8,sp,#4*(8)
 	str	r4,[r14],#16		@ store output
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	eorhs	r6,r6,r10
 	eorhs	r7,r7,r11
 	str	r5,[r14,#-12]
@@ -294,39 +274,39 @@ ChaCha20_ctr32:
 
 	add	r0,r0,r8	@ accumulate key material
 	add	r1,r1,r9
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhs	r8,[r12],#16		@ load input
 	ldrhs	r9,[r12,#-12]
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hi
-# endif
+#endif
 	 strhi	r10,[sp,#4*(16+10)]	@ copy "rx" while at it
 	 strhi	r11,[sp,#4*(16+11)]	@ copy "rx" while at it
 	add	r2,r2,r10
 	add	r3,r3,r11
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhs	r10,[r12,#-8]
 	ldrhs	r11,[r12,#-4]
-# if __ARM_ARCH__>=6 && defined(__ARMEB__)
+#if __LINUX_ARM_ARCH__ >= 6 && defined(__ARMEB__)
 	rev	r0,r0
 	rev	r1,r1
 	rev	r2,r2
 	rev	r3,r3
-# endif
-# ifdef	__thumb2__
+#endif
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	eorhs	r0,r0,r8
 	eorhs	r1,r1,r9
 	 add	r8,sp,#4*(12)
 	str	r0,[r14],#16		@ store output
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	eorhs	r2,r2,r10
 	eorhs	r3,r3,r11
 	str	r1,[r14,#-12]
@@ -336,79 +316,79 @@ ChaCha20_ctr32:
 
 	add	r4,r8,r4,ror#24 @ accumulate key material
 	add	r5,r9,r5,ror#24
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hi
-# endif
+#endif
 	 addhi	r8,r8,#1		@ next counter value
 	 strhi	r8,[sp,#4*(12)]	@ save next counter value
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhs	r8,[r12],#16		@ load input
 	ldrhs	r9,[r12,#-12]
 	add	r6,r10,r6,ror#24
 	add	r7,r11,r7,ror#24
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhs	r10,[r12,#-8]
 	ldrhs	r11,[r12,#-4]
-# if __ARM_ARCH__>=6 && defined(__ARMEB__)
+#if __LINUX_ARM_ARCH__ >= 6 && defined(__ARMEB__)
 	rev	r4,r4
 	rev	r5,r5
 	rev	r6,r6
 	rev	r7,r7
-# endif
-# ifdef	__thumb2__
+#endif
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	eorhs	r4,r4,r8
 	eorhs	r5,r5,r9
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	 it	ne
-# endif
+#endif
 	 ldrne	r8,[sp,#4*(32+2)]	@ re-load len
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	eorhs	r6,r6,r10
 	eorhs	r7,r7,r11
 	str	r4,[r14],#16		@ store output
 	str	r5,[r14,#-12]
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	it	hs
-# endif
+#endif
 	 subhs	r11,r8,#64		@ len-=64
 	str	r6,[r14,#-8]
 	str	r7,[r14,#-4]
 	bhi	.Loop_outer
 
 	beq	.Ldone
-# if __ARM_ARCH__<7
+#if __LINUX_ARM_ARCH__ < 7
 	b	.Ltail
 
 .align	4
 .Lunaligned:				@ unaligned endian-neutral path
 	cmp	r11,#64		@ restore flags
-# endif
 #endif
-#if __ARM_ARCH__<7
+#endif
+#if __LINUX_ARM_ARCH__ < 7
 	ldr	r11,[sp,#4*(3)]
 	add	r0,r8,r0	@ accumulate key material
 	add	r1,r9,r1
 	add	r2,r10,r2
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itete	lo
-# endif
+#endif
 	eorlo	r8,r8,r8		@ zero or ...
 	ldrhsb	r8,[r12],#16			@ ... load input
 	eorlo	r9,r9,r9
 	ldrhsb	r9,[r12,#-12]
 
 	add	r3,r11,r3
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itete	lo
-# endif
+#endif
 	eorlo	r10,r10,r10
 	ldrhsb	r10,[r12,#-8]
 	eorlo	r11,r11,r11
@@ -416,53 +396,53 @@ ChaCha20_ctr32:
 
 	eor	r0,r8,r0		@ xor with input (or zero)
 	eor	r1,r9,r1
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r8,[r12,#-15]		@ load more input
 	ldrhsb	r9,[r12,#-11]
 	eor	r2,r10,r2
 	 strb	r0,[r14],#16		@ store output
 	eor	r3,r11,r3
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r10,[r12,#-7]
 	ldrhsb	r11,[r12,#-3]
 	 strb	r1,[r14,#-12]
 	eor	r0,r8,r0,lsr#8
 	 strb	r2,[r14,#-8]
 	eor	r1,r9,r1,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r8,[r12,#-14]		@ load more input
 	ldrhsb	r9,[r12,#-10]
 	 strb	r3,[r14,#-4]
 	eor	r2,r10,r2,lsr#8
 	 strb	r0,[r14,#-15]
 	eor	r3,r11,r3,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r10,[r12,#-6]
 	ldrhsb	r11,[r12,#-2]
 	 strb	r1,[r14,#-11]
 	eor	r0,r8,r0,lsr#8
 	 strb	r2,[r14,#-7]
 	eor	r1,r9,r1,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r8,[r12,#-13]		@ load more input
 	ldrhsb	r9,[r12,#-9]
 	 strb	r3,[r14,#-3]
 	eor	r2,r10,r2,lsr#8
 	 strb	r0,[r14,#-14]
 	eor	r3,r11,r3,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r10,[r12,#-5]
 	ldrhsb	r11,[r12,#-1]
 	 strb	r1,[r14,#-10]
@@ -482,18 +462,18 @@ ChaCha20_ctr32:
 	add	r4,r8,r4,ror#13	@ accumulate key material
 	add	r5,r9,r5,ror#13
 	add	r6,r10,r6,ror#13
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itete	lo
-# endif
+#endif
 	eorlo	r8,r8,r8		@ zero or ...
 	ldrhsb	r8,[r12],#16			@ ... load input
 	eorlo	r9,r9,r9
 	ldrhsb	r9,[r12,#-12]
 
 	add	r7,r11,r7,ror#13
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itete	lo
-# endif
+#endif
 	eorlo	r10,r10,r10
 	ldrhsb	r10,[r12,#-8]
 	eorlo	r11,r11,r11
@@ -501,53 +481,53 @@ ChaCha20_ctr32:
 
 	eor	r4,r8,r4		@ xor with input (or zero)
 	eor	r5,r9,r5
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r8,[r12,#-15]		@ load more input
 	ldrhsb	r9,[r12,#-11]
 	eor	r6,r10,r6
 	 strb	r4,[r14],#16		@ store output
 	eor	r7,r11,r7
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r10,[r12,#-7]
 	ldrhsb	r11,[r12,#-3]
 	 strb	r5,[r14,#-12]
 	eor	r4,r8,r4,lsr#8
 	 strb	r6,[r14,#-8]
 	eor	r5,r9,r5,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r8,[r12,#-14]		@ load more input
 	ldrhsb	r9,[r12,#-10]
 	 strb	r7,[r14,#-4]
 	eor	r6,r10,r6,lsr#8
 	 strb	r4,[r14,#-15]
 	eor	r7,r11,r7,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r10,[r12,#-6]
 	ldrhsb	r11,[r12,#-2]
 	 strb	r5,[r14,#-11]
 	eor	r4,r8,r4,lsr#8
 	 strb	r6,[r14,#-7]
 	eor	r5,r9,r5,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r8,[r12,#-13]		@ load more input
 	ldrhsb	r9,[r12,#-9]
 	 strb	r7,[r14,#-3]
 	eor	r6,r10,r6,lsr#8
 	 strb	r4,[r14,#-14]
 	eor	r7,r11,r7,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r10,[r12,#-5]
 	ldrhsb	r11,[r12,#-1]
 	 strb	r5,[r14,#-10]
@@ -564,26 +544,26 @@ ChaCha20_ctr32:
 	add	r8,sp,#4*(4+4)
 	ldmia	r8,{r8-r11}		@ load key material
 	ldmia	r0,{r0-r7}		@ load second half
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hi
-# endif
+#endif
 	strhi	r10,[sp,#4*(16+10)]		@ copy "rx"
 	strhi	r11,[sp,#4*(16+11)]		@ copy "rx"
 	add	r0,r8,r0	@ accumulate key material
 	add	r1,r9,r1
 	add	r2,r10,r2
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itete	lo
-# endif
+#endif
 	eorlo	r8,r8,r8		@ zero or ...
 	ldrhsb	r8,[r12],#16			@ ... load input
 	eorlo	r9,r9,r9
 	ldrhsb	r9,[r12,#-12]
 
 	add	r3,r11,r3
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itete	lo
-# endif
+#endif
 	eorlo	r10,r10,r10
 	ldrhsb	r10,[r12,#-8]
 	eorlo	r11,r11,r11
@@ -591,53 +571,53 @@ ChaCha20_ctr32:
 
 	eor	r0,r8,r0		@ xor with input (or zero)
 	eor	r1,r9,r1
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r8,[r12,#-15]		@ load more input
 	ldrhsb	r9,[r12,#-11]
 	eor	r2,r10,r2
 	 strb	r0,[r14],#16		@ store output
 	eor	r3,r11,r3
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r10,[r12,#-7]
 	ldrhsb	r11,[r12,#-3]
 	 strb	r1,[r14,#-12]
 	eor	r0,r8,r0,lsr#8
 	 strb	r2,[r14,#-8]
 	eor	r1,r9,r1,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r8,[r12,#-14]		@ load more input
 	ldrhsb	r9,[r12,#-10]
 	 strb	r3,[r14,#-4]
 	eor	r2,r10,r2,lsr#8
 	 strb	r0,[r14,#-15]
 	eor	r3,r11,r3,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r10,[r12,#-6]
 	ldrhsb	r11,[r12,#-2]
 	 strb	r1,[r14,#-11]
 	eor	r0,r8,r0,lsr#8
 	 strb	r2,[r14,#-7]
 	eor	r1,r9,r1,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r8,[r12,#-13]		@ load more input
 	ldrhsb	r9,[r12,#-9]
 	 strb	r3,[r14,#-3]
 	eor	r2,r10,r2,lsr#8
 	 strb	r0,[r14,#-14]
 	eor	r3,r11,r3,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r10,[r12,#-5]
 	ldrhsb	r11,[r12,#-1]
 	 strb	r1,[r14,#-10]
@@ -654,25 +634,25 @@ ChaCha20_ctr32:
 	add	r8,sp,#4*(4+8)
 	ldmia	r8,{r8-r11}		@ load key material
 	add	r4,r8,r4,ror#24	@ accumulate key material
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hi
-# endif
+#endif
 	addhi	r8,r8,#1			@ next counter value
 	strhi	r8,[sp,#4*(12)]		@ save next counter value
 	add	r5,r9,r5,ror#24
 	add	r6,r10,r6,ror#24
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itete	lo
-# endif
+#endif
 	eorlo	r8,r8,r8		@ zero or ...
 	ldrhsb	r8,[r12],#16			@ ... load input
 	eorlo	r9,r9,r9
 	ldrhsb	r9,[r12,#-12]
 
 	add	r7,r11,r7,ror#24
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itete	lo
-# endif
+#endif
 	eorlo	r10,r10,r10
 	ldrhsb	r10,[r12,#-8]
 	eorlo	r11,r11,r11
@@ -680,53 +660,53 @@ ChaCha20_ctr32:
 
 	eor	r4,r8,r4		@ xor with input (or zero)
 	eor	r5,r9,r5
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r8,[r12,#-15]		@ load more input
 	ldrhsb	r9,[r12,#-11]
 	eor	r6,r10,r6
 	 strb	r4,[r14],#16		@ store output
 	eor	r7,r11,r7
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r10,[r12,#-7]
 	ldrhsb	r11,[r12,#-3]
 	 strb	r5,[r14,#-12]
 	eor	r4,r8,r4,lsr#8
 	 strb	r6,[r14,#-8]
 	eor	r5,r9,r5,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r8,[r12,#-14]		@ load more input
 	ldrhsb	r9,[r12,#-10]
 	 strb	r7,[r14,#-4]
 	eor	r6,r10,r6,lsr#8
 	 strb	r4,[r14,#-15]
 	eor	r7,r11,r7,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r10,[r12,#-6]
 	ldrhsb	r11,[r12,#-2]
 	 strb	r5,[r14,#-11]
 	eor	r4,r8,r4,lsr#8
 	 strb	r6,[r14,#-7]
 	eor	r5,r9,r5,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r8,[r12,#-13]		@ load more input
 	ldrhsb	r9,[r12,#-9]
 	 strb	r7,[r14,#-3]
 	eor	r6,r10,r6,lsr#8
 	 strb	r4,[r14,#-14]
 	eor	r7,r11,r7,lsr#8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	itt	hs
-# endif
+#endif
 	ldrhsb	r10,[r12,#-5]
 	ldrhsb	r11,[r12,#-1]
 	 strb	r5,[r14,#-10]
@@ -740,13 +720,13 @@ ChaCha20_ctr32:
 	eor	r7,r11,r7,lsr#8
 	 strb	r6,[r14,#-5]
 	 strb	r7,[r14,#-1]
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	it	ne
-# endif
+#endif
 	ldrne	r8,[sp,#4*(32+2)]		@ re-load len
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	it	hs
-# endif
+#endif
 	subhs	r11,r8,#64			@ len-=64
 	bhi	.Loop_outer
 
@@ -768,20 +748,33 @@ ChaCha20_ctr32:
 
 .Ldone:
 	add	sp,sp,#4*(32+3)
-.Lno_data:
+.Lno_data_arm:
 	ldmia	sp!,{r4-r11,pc}
-.size	ChaCha20_ctr32,.-ChaCha20_ctr32
-#if __ARM_MAX_ARCH__>=7
+ENDPROC(chacha20_arm)
+
+#ifdef CONFIG_KERNEL_MODE_NEON
+.align	5
+.Lsigma2:
+.long	0x61707865,0x3320646e,0x79622d32,0x6b206574	@ endian-neutral
+.Lone2:
+.long	1,0,0,0
+.word	-1
+
 .arch	armv7-a
 .fpu	neon
 
-.type	ChaCha20_neon,%function
 .align	5
-ChaCha20_neon:
+ENTRY(chacha20_neon)
 	ldr		r12,[sp,#0]		@ pull pointer to counter and nonce
 	stmdb		sp!,{r0-r2,r4-r11,lr}
-.LChaCha20_neon:
-	adr		r14,.Lsigma
+	cmp		r2,#0			@ len==0?
+#ifdef	__thumb2__
+	itt		eq
+#endif
+	addeq		sp,sp,#4*3
+	beq		.Lno_data_neon
+.Lchacha20_neon_begin:
+	adr		r14,.Lsigma2
 	vstmdb		sp!,{d8-d15}		@ ABI spec says so
 	stmdb		sp!,{r0-r3}
 
@@ -1121,12 +1114,12 @@ ChaCha20_neon:
 	ldr		r10,[r12,#-8]
 	add		r3,r3,r11
 	ldr		r11,[r12,#-4]
-# ifdef	__ARMEB__
+#ifdef	__ARMEB__
 	rev		r0,r0
 	rev		r1,r1
 	rev		r2,r2
 	rev		r3,r3
-# endif
+#endif
 	eor		r0,r0,r8	@ xor with input
 	 add		r8,sp,#4*(4)
 	eor		r1,r1,r9
@@ -1146,12 +1139,12 @@ ChaCha20_neon:
 	ldr		r10,[r12,#-8]
 	add		r7,r11,r7,ror#13
 	ldr		r11,[r12,#-4]
-# ifdef	__ARMEB__
+#ifdef	__ARMEB__
 	rev		r4,r4
 	rev		r5,r5
 	rev		r6,r6
 	rev		r7,r7
-# endif
+#endif
 	eor		r4,r4,r8
 	 add		r8,sp,#4*(8)
 	eor		r5,r5,r9
@@ -1170,24 +1163,24 @@ ChaCha20_neon:
 	ldr		r8,[r12],#16		@ load input
 	add		r1,r1,r9
 	ldr		r9,[r12,#-12]
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	it	hi
-# endif
+#endif
 	 strhi		r10,[sp,#4*(16+10)]	@ copy "rx" while at it
 	add		r2,r2,r10
 	ldr		r10,[r12,#-8]
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	it	hi
-# endif
+#endif
 	 strhi		r11,[sp,#4*(16+11)]	@ copy "rx" while at it
 	add		r3,r3,r11
 	ldr		r11,[r12,#-4]
-# ifdef	__ARMEB__
+#ifdef	__ARMEB__
 	rev		r0,r0
 	rev		r1,r1
 	rev		r2,r2
 	rev		r3,r3
-# endif
+#endif
 	eor		r0,r0,r8
 	 add		r8,sp,#4*(12)
 	eor		r1,r1,r9
@@ -1210,16 +1203,16 @@ ChaCha20_neon:
 	add		r7,r11,r7,ror#24
 	ldr		r10,[r12,#-8]
 	ldr		r11,[r12,#-4]
-# ifdef	__ARMEB__
+#ifdef	__ARMEB__
 	rev		r4,r4
 	rev		r5,r5
 	rev		r6,r6
 	rev		r7,r7
-# endif
+#endif
 	eor		r4,r4,r8
-# ifdef	__thumb2__
+#ifdef __thumb2__
 	it	hi
-# endif
+#endif
 	 ldrhi		r8,[sp,#4*(32+2)]	@ re-load len
 	eor		r5,r5,r9
 	eor		r6,r6,r10
@@ -1379,7 +1372,7 @@ ChaCha20_neon:
 	add		r6,r10,r6,ror#13
 	add		r7,r11,r7,ror#13
 	 ldmia		r8,{r8-r11}	@ load key material
-# ifdef	__ARMEB__
+#ifdef	__ARMEB__
 	rev		r0,r0
 	rev		r1,r1
 	rev		r2,r2
@@ -1388,7 +1381,7 @@ ChaCha20_neon:
 	rev		r5,r5
 	rev		r6,r6
 	rev		r7,r7
-# endif
+#endif
 	stmia		sp,{r0-r7}
 	 add		r0,sp,#4*(16+8)
 
@@ -1408,7 +1401,7 @@ ChaCha20_neon:
 	add		r6,r10,r6,ror#24
 	add		r7,r11,r7,ror#24
 	 ldr		r11,[sp,#4*(32+2)]	@ re-load len
-# ifdef	__ARMEB__
+#ifdef	__ARMEB__
 	rev		r0,r0
 	rev		r1,r1
 	rev		r2,r2
@@ -1417,7 +1410,7 @@ ChaCha20_neon:
 	rev		r5,r5
 	rev		r6,r6
 	rev		r7,r7
-# endif
+#endif
 	stmia		r8,{r0-r7}
 	 add		r10,sp,#4*(0)
 	 sub		r11,r11,#64*3	@ len-=64*3
@@ -1434,7 +1427,7 @@ ChaCha20_neon:
 	add		sp,sp,#4*(32+4)
 	vldmia		sp,{d8-d15}
 	add		sp,sp,#4*(16+3)
+.Lno_data_neon:
 	ldmia		sp!,{r4-r11,pc}
-.size	ChaCha20_neon,.-ChaCha20_neon
-.comm	OPENSSL_armcap_P,4,4
+ENDPROC(chacha20_neon)
 #endif
diff --git a/lib/zinc/chacha20/chacha20-arm64-cryptogams.S b/lib/zinc/chacha20/chacha20-arm64-cryptogams.S
index 4d029bfdad3a..1ae11a5c5a14 100644
--- a/lib/zinc/chacha20/chacha20-arm64-cryptogams.S
+++ b/lib/zinc/chacha20/chacha20-arm64-cryptogams.S
@@ -1,46 +1,24 @@
 /* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
 /*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
  * Copyright (C) 2006-2017 CRYPTOGAMS by <appro@openssl.org>. All Rights Reserved.
+ *
+ * This is based in part on Andy Polyakov's implementation from CRYPTOGAMS.
  */
 
-#include "arm_arch.h"
+#include <linux/linkage.h>
 
 .text
-
-
-
 .align	5
 .Lsigma:
 .quad	0x3320646e61707865,0x6b20657479622d32		// endian-neutral
 .Lone:
 .long	1,0,0,0
-.LOPENSSL_armcap_P:
-#ifdef	__ILP32__
-.long	OPENSSL_armcap_P-.
-#else
-.quad	OPENSSL_armcap_P-.
-#endif
-.byte	67,104,97,67,104,97,50,48,32,102,111,114,32,65,82,77,118,56,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
-.align	2
 
-.globl	ChaCha20_ctr32
-.type	ChaCha20_ctr32,%function
 .align	5
-ChaCha20_ctr32:
+ENTRY(chacha20_arm)
 	cbz	x2,.Labort
-	adr	x5,.LOPENSSL_armcap_P
-	cmp	x2,#192
-	b.lo	.Lshort
-#ifdef	__ILP32__
-	ldrsw	x6,[x5]
-#else
-	ldr	x6,[x5]
-#endif
-	ldr	w17,[x6,x5]
-	tst	w17,#ARMV7_NEON
-	b.ne	ChaCha20_neon
 
-.Lshort:
 	stp	x29,x30,[sp,#-96]!
 	add	x29,sp,#0
 
@@ -56,7 +34,7 @@ ChaCha20_ctr32:
 	ldp	x24,x25,[x3]		// load key
 	ldp	x26,x27,[x3,#16]
 	ldp	x28,x30,[x4]		// load counter
-#ifdef	__ARMEB__
+#ifdef	__AARCH64EB__
 	ror	x24,x24,#32
 	ror	x25,x25,#32
 	ror	x26,x26,#32
@@ -217,7 +195,7 @@ ChaCha20_ctr32:
 	add	x20,x20,x21,lsl#32
 	ldp	x19,x21,[x1,#48]
 	add	x1,x1,#64
-#ifdef	__ARMEB__
+#ifdef	__AARCH64EB__
 	rev	x5,x5
 	rev	x7,x7
 	rev	x9,x9
@@ -273,7 +251,7 @@ ChaCha20_ctr32:
 	add	x15,x15,x16,lsl#32
 	add	x17,x17,x19,lsl#32
 	add	x20,x20,x21,lsl#32
-#ifdef	__ARMEB__
+#ifdef	__AARCH64EB__
 	rev	x5,x5
 	rev	x7,x7
 	rev	x9,x9
@@ -309,11 +287,13 @@ ChaCha20_ctr32:
 	ldp	x27,x28,[x29,#80]
 	ldp	x29,x30,[sp],#96
 	ret
-.size	ChaCha20_ctr32,.-ChaCha20_ctr32
+ENDPROC(chacha20_arm)
 
-.type	ChaCha20_neon,%function
+#ifdef CONFIG_KERNEL_MODE_NEON
 .align	5
-ChaCha20_neon:
+ENTRY(chacha20_neon)
+	cbz	x2,.Labort_neon
+
 	stp	x29,x30,[sp,#-96]!
 	add	x29,sp,#0
 
@@ -336,7 +316,7 @@ ChaCha20_neon:
 	ldp	x28,x30,[x4]		// load counter
 	ld1	{v27.4s},[x4]
 	ld1	{v31.4s},[x5]
-#ifdef	__ARMEB__
+#ifdef	__AARCH64EB__
 	rev64	v24.4s,v24.4s
 	ror	x24,x24,#32
 	ror	x25,x25,#32
@@ -634,7 +614,7 @@ ChaCha20_neon:
 	add	x20,x20,x21,lsl#32
 	ldp	x19,x21,[x1,#48]
 	add	x1,x1,#64
-#ifdef	__ARMEB__
+#ifdef	__AARCH64EB__
 	rev	x5,x5
 	rev	x7,x7
 	rev	x9,x9
@@ -713,7 +693,7 @@ ChaCha20_neon:
 	add	x20,x20,x21,lsl#32
 	ldp	x19,x21,[x1,#48]
 	add	x1,x1,#64
-#ifdef	__ARMEB__
+#ifdef	__AARCH64EB__
 	rev	x5,x5
 	rev	x7,x7
 	rev	x9,x9
@@ -803,19 +783,6 @@ ChaCha20_neon:
 	ldp	x27,x28,[x29,#80]
 	ldp	x29,x30,[sp],#96
 	ret
-.size	ChaCha20_neon,.-ChaCha20_neon
-.type	ChaCha20_512_neon,%function
-.align	5
-ChaCha20_512_neon:
-	stp	x29,x30,[sp,#-96]!
-	add	x29,sp,#0
-
-	adr	x5,.Lsigma
-	stp	x19,x20,[sp,#16]
-	stp	x21,x22,[sp,#32]
-	stp	x23,x24,[sp,#48]
-	stp	x25,x26,[sp,#64]
-	stp	x27,x28,[sp,#80]
 
 .L512_or_more_neon:
 	sub	sp,sp,#128+64
@@ -828,7 +795,7 @@ ChaCha20_512_neon:
 	ldp	x28,x30,[x4]		// load counter
 	ld1	{v27.4s},[x4]
 	ld1	{v31.4s},[x5]
-#ifdef	__ARMEB__
+#ifdef	__AARCH64EB__
 	rev64	v24.4s,v24.4s
 	ror	x24,x24,#32
 	ror	x25,x25,#32
@@ -1341,7 +1308,7 @@ ChaCha20_512_neon:
 	add	x20,x20,x21,lsl#32
 	ldp	x19,x21,[x1,#48]
 	add	x1,x1,#64
-#ifdef	__ARMEB__
+#ifdef	__AARCH64EB__
 	rev	x5,x5
 	rev	x7,x7
 	rev	x9,x9
@@ -1855,7 +1822,7 @@ ChaCha20_512_neon:
 	add	x1,x1,#64
 	add	v21.4s,v21.4s,v25.4s
 
-#ifdef	__ARMEB__
+#ifdef	__AARCH64EB__
 	rev	x5,x5
 	rev	x7,x7
 	rev	x9,x9
@@ -1969,5 +1936,7 @@ ChaCha20_512_neon:
 	ldp	x25,x26,[x29,#64]
 	ldp	x27,x28,[x29,#80]
 	ldp	x29,x30,[sp],#96
+.Labort_neon:
 	ret
-.size	ChaCha20_512_neon,.-ChaCha20_512_neon
+ENDPROC(chacha20_neon)
+#endif
-- 
2.19.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox