Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 02/17] zinc: introduce minimal cryptography library
From: Eric Biggers @ 2018-08-25  6:29 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: linux-kernel, netdev, davem, Andy Lutomirski, Greg KH,
	Samuel Neves, Jean-Philippe Aumasson, linux-crypto
In-Reply-To: <20180824213849.23647-3-Jason@zx2c4.com>

Hi Jason,

On Fri, Aug 24, 2018 at 03:38:34PM -0600, Jason A. Donenfeld wrote:
> Zinc stands for "Zinc Is Neat Crypto" or "Zinc as IN Crypto" or maybe
> just "Zx2c4's INsane Cryptolib." It's also short, easy to type, and
> plays nicely with the recent trend of naming crypto libraries after
> elements. The guiding principle is "don't overdo it". It's less of a
> library and more of a directory tree for organizing well-curated direct
> implementations of cryptography primitives.
> 
> Zinc is a new cryptography API that is much more minimal and lower-level
> than the current one. It intends to complement it and provide a basis
> upon which the current crypto API might build, as the provider of
> software implementations of cryptographic primitives. It is motivated by
> three primary observations in crypto API design:
> 
>   * Highly composable "cipher modes" and related abstractions from
>     90s cryptographers did not turn out to be as terrific an idea as
>     hoped, leading to a host of API misuse problems.
> 
>   * Most programmers are afraid of crypto code, and so prefer to
>     integrate it into libraries in a highly abstracted manner, so as to
>     shield themselves from implementation details. Cryptographers, on
>     the other hand, prefer simple direct implementations, which they're
>     able to verify for high assurance and optimize in accordance with
>     their expertise.
> 
>   * Overly abstracted and flexible cryptography APIs lead to a host of
>     dangerous problems and performance issues. The kernel is in the
>     business usually not of coming up with new uses of crypto, but
>     rather implementing various constructions, which means it essentially
>     needs a library of primitives, not a highly abstracted enterprise-ready
>     pluggable system, with a few particular exceptions.
> 
> This last observation has seen itself play out several times over and
> over again within the kernel:
> 
>   * The perennial move of actual primitives away from crypto/ and into
>     lib/, so that users can actually call these functions directly with
>     no overhead and without lots of allocations, function pointers,
>     string specifier parsing, and general clunkiness. For example:
>     sha256, chacha20, siphash, sha1, and so forth live in lib/ rather
>     than in crypto/. Zinc intends to stop the cluttering of lib/ and
>     introduce these direct primitives into their proper place, lib/zinc/.
> 
>   * An abundance of misuse bugs with the present crypto API that have
>     been very unpleasant to clean up.
> 
>   * A hesitance to even use cryptography, because of the overhead and
>     headaches involved in accessing the routines.
> 
> Zinc goes in a rather different direction. Rather than providing a
> thoroughly designed and abstracted API, Zinc gives you simple functions,
> which implement some primitive, or some particular and specific
> construction of primitives. It is not dynamic in the least, though one
> could imagine implementing a complex dynamic dispatch mechanism (such as
> the current crypto API) on top of these basic functions. After all,
> dynamic dispatch is usually needed for applications with cipher agility,
> such as IPsec, dm-crypt, AF_ALG, and so forth, and the existing crypto
> API will continue to play that role. However, Zinc will provide a non-
> haphazard way of directly utilizing crypto routines in applications
> that do have neither the need nor desire for abstraction and dynamic
> dispatch.
> 
> It also organizes the implementations in a simple, straight-forward,
> and direct manner, making it enjoyable and intuitive to work on.
> Rather than moving optimized assembly implementations into arch/, it
> keeps them all together in lib/zinc/, making it simple and obvious to
> compare and contrast what's happening. This is, notably, exactly what
> the lib/raid6/ tree does, and that seems to work out rather well. It's
> also the pattern of most successful crypto libraries. The architecture-
> specific glue-code is made a part of each translation unit, rather than
> being in a separate one, so that generic and architecture-optimized code
> are combined at compile-time, and incompatibility branches compiled out by
> the optimizer.
> 
> All implementations have been extensively tested and fuzzed, and are
> selected for their quality, trustworthiness, and performance. Wherever
> possible and performant, formally verified implementations are used,
> such as those from HACL* [1] and Fiat-Crypto [2]. The routines also take
> special care to zero out secrets using memzero_explicit (and future work
> is planned to have gcc do this more reliably and performantly with
> compiler plugins). The performance of the selected implementations is
> state-of-the-art and unrivaled on a broad array of hardware, though of
> course we will continue to fine tune these to the hardware demands
> needed by kernel contributors. Each implementation also comes with
> extensive self-tests and crafted test vectors, pulled from various
> places such as Wycheproof [9].
> 
> Regularity of function signatures is important, so that users can easily
> "guess" the name of the function they want. Though, individual
> primitives are oftentimes not trivially interchangeable, having been
> designed for different things and requiring different parameters and
> semantics, and so the function signatures they provide will directly
> reflect the realities of the primitives' usages, rather than hiding it
> behind (inevitably leaky) abstractions. Also, in contrast to the current
> crypto API, Zinc functions can work on stack buffers, and can be called
> with different keys, without requiring allocations or locking.
> 
> SIMD is used automatically when available, though some routines may
> benefit from either having their SIMD disabled for particular
> invocations, or to have the SIMD initialization calls amortized over
> several invocations of the function, and so Zinc utilizes function
> signatures enabling that in conjunction with the recently introduced
> simd_context_t.
> 
> More generally, Zinc provides function signatures that allow just what
> is required by the various callers. This isn't to say that users of the
> functions will be permitted to pollute the function semantics with weird
> particular needs, but we are trying very hard not to overdo it, and that
> means looking carefully at what's actually necessary, and doing just that,
> and not much more than that. Remember: practicality and cleanliness rather
> than over-zealous infrastructure.
> 
> Zinc provides also an opening for the best implementers in academia to
> contribute their time and effort to the kernel, by being sufficiently
> simple and inviting. In discussing this commit with some of the best and
> brightest over the last few years, there are many who are eager to
> devote rare talent and energy to this effort.
> 
> Following the merging of this, I expect for the primitives that
> currently exist in lib/ to work their way into lib/zinc/, after intense
> scrutiny of each implementation, potentially replacing them with either
> formally-verified implementations, or better studied and faster
> state-of-the-art implementations.
> 
> Also following the merging of this, I expect for the old crypto API
> implementations to be ported over to use Zinc for their software-based
> implementations.
> 
> As Zinc is simply library code, its config options are un-menued, with
> the exception of CONFIG_ZINC_DEBUG, which enables various selftests and
> BUG_ONs.
> 
> [1] https://github.com/project-everest/hacl-star
> [2] https://github.com/mit-plv/fiat-crypto
> [3] https://cr.yp.to/ecdh.html
> [4] https://cr.yp.to/chacha.html
> [5] https://cr.yp.to/snuffle/xsalsa-20081128.pdf
> [6] https://cr.yp.to/mac.html
> [7] https://blake2.net/
> [8] https://tools.ietf.org/html/rfc8439
> [9] https://github.com/google/wycheproof
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Samuel Neves <sneves@dei.uc.pt>
> Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
> Cc: linux-crypto@vger.kernel.org

I think the crypto portion of the patchset is looking *slightly* better from v1.
(Thanks for getting rid of the #ifdef mazes!)  But here are some more comments,
with the caveat that I haven't really reviewed any actual implementations yet,
and it's a *lot* of new code so this is really just scratching the surface...:

I thought you were going to wrap lines at 80 characters?  It's hard to read the
extremely long lines, and they encourage deep nesting.

As I said before, I still think you need to switch the crypto API ChaCha20 and
Poly1305 over to use the new implementations.  It's not okay to have two
completely different sets of ChaCha20 and Poly1305 implementations just because
you want a different API, so you might as well get started on it...  The thing
is that before you try it, it's not clear what problems will come up that
require changes to the design.  So, this really ought to be addressed up-front.

It's also not clearly explained whether/why/how the new ChaCha20 and Poly1305
implementations are better than the existing ones.  The patch adding the ARM and
ARM64 ChaCha, for example, just says who wrote them, with no mention of why the
particular implementations were chosen.

You've also documented a lot of stuff in commit messages which will be lost as
it isn't being added to the source itself.  There are various examples of this,
such as information about where the various implementations came from, what you
changed, why a particular implementation isn't used on Skylake or whatever, etc.
Can you please make sure that any important information is in comments, e.g. at
the top of the files?  There maybe should even be a Documentation/ file for
"Zinc", rather than only a long commit message.

I still think the "Zinc" name is confusing and misleading, and people are going
to forget that it means "crypto".  lib/crypto/ would be more intuitive.
But I don't care *that* much myself, and you should see what others think...

It seems you still don't explicitly clarify anywhere in the source itself that
the copyright holders of the code from OpenSSL have relicensed it under GPLv2.
I only see a GPLv2 license slapped on the files, yet no such license is presence
in the OpenSSL originals, at least in the one I checked.  If you did receive
explicit permission, then you should include an explicit clarification in each
file like the one in arch/arm/crypto/sha1-armv4-large.S.  Otherwise people will
be confused and come asking about the license status.

As Ard and I discussed recently on my patch
"crypto: arm/poly1305 - add NEON accelerated Poly1305 implementation"
which proposed adding the exact same poly1305-arm.S file, for all the OpenSSL
assembly it would probably be better to include the .pl file and generate the .S
file as part of the build process.  For one, there is semantic information like
register names in the .pl script that is lost in the .S file, thereby making the
.S file less readable.

There are still some alignment bugs where integers are loaded from byte arrays
without using the unaligned access macros, e.g. in chacha20_init(),
hchacha20_generic(), and fe_frombytes_impl().  I found these grepping for
le32_to_cpu.  Interestingly, that last one is in "formally verified" code :-)

Thanks!

Eric

^ permalink raw reply

* Re: [PATCH NET 3/3] net: hns: add configuration constraints for 1000M half
From: lipeng (Y) @ 2018-08-25  1:21 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: davem, netdev, linux-kernel, linuxarm, yisen.zhuang, salil.mehta
In-Reply-To: <20180824132857.GA1999@lunn.ch>



On 2018/8/24 21:28, Andrew Lunn wrote:
> On Fri, Aug 24, 2018 at 02:39:36PM +0800, lipeng (Y) wrote:
>>
>> On 2018/8/24 11:41, Andrew Lunn wrote:
>>> On Fri, Aug 24, 2018 at 11:42:23AM +0800, Peng Li wrote:
>>>> Hisilicon hip05 and hip06 board network card do not support
>>>> 1000M half configuration. Driver can not config gmac as
>>>> 1000M half.
>>>>
>>>> Signed-off-by: Peng Li <lipeng321@huawei.com>
>>> Hi Peng
>>>
>>> Does the driver remove SUPPORTED_1000baseT_Half from
>>> phydev->supported?  If you do that, the PHY should never negotiate
>>> this speed.
>>>
>>>     Andrew
>> Hi, Andrew,
>>
>> The driver has removed SUPPORTED_1000baseT_Half from
>>
>> phydev->supported.
>>
>> the code is :
>> #define MAC_GMAC_SUPPORTED \
>> 	(SUPPORTED_10baseT_Half \
>> 	| SUPPORTED_10baseT_Full \
>> 	| SUPPORTED_100baseT_Half \
>> 	| SUPPORTED_100baseT_Full \
>> 	| SUPPORTED_Autoneg)
>>   h->if_support = MAC_GMAC_SUPPORTED;
>>   h->if_support |= SUPPORTED_1000baseT_Full;
>> phydev->supported &= h->if_support;
>>
>> As gmac do not support 1000M half, we add this patch to
>> make sure that no users can set 1000M half in any case.
> Well, not quite. What this patch does is protect the hardware when it
> is asked to change to an unsupported mode. This patch has nothing to
> do with user APIs.
>
> The user API for setting link modes is hns_nic_set_link_ksettings().
>
> What you do have is
>
>                  if (speed == SPEED_1000 && cmd->base.duplex == DUPLEX_HALF)
>                          return -EINVAL;
>
> which should stop somebody setting up a fixed speed link at 1000Half.
>
> But you don't do anything with cmd->link_modes.advertising. However,
> when you call phy_ethtool_ksettings_set(), it gets AND'ed with
> phydev->supported, which you have already removed 1000Half from.
>
> So is this a patch for a theoretical problem? Or have you seen it
> happen? If it did happen, how did the user configure it to cause this
> problem? That user API needs to prevent it.
>
> 	 Andrew
Hi,  Andrew,

HW  do not support 1000M half, and the driver has done the 2 things:
1, remove SUPPORTED_1000baseT_Half from phydev->supported.
2,   add the protect in  hns_nic_set_link_ksettings().

This patch is a  theoretical protect, and the problem does not really  
happen.

I think you really get the point, do you think we need this patch?

Thanks
> .
>

^ permalink raw reply

* Re: [RFC v3 net-next 3/5] ebpf: fix bpf_msg_pull_data
From: John Fastabend @ 2018-08-25  1:02 UTC (permalink / raw)
  To: Tushar Dave, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev
In-Reply-To: <1534547305-25140-4-git-send-email-tushar.n.dave@oracle.com>

On 08/17/2018 04:08 PM, Tushar Dave wrote:
> Like sockmap (sk_msg), socksg also deals with struct scatterlist
> therefore socksg programs can use existing bpf helper bpf_msg_pull_data
> to access packet data contained in struct scatterlist. While doing some
> prelimnary testing, there are couple of issues found with
> bpf_msg_pull_data that are fixed in this patch.
> 
> Also, there cannot be more than MAX_SKB_FRAGS entries in sg_data
> therefore any checks for sg entry more than MAX_SKB_FRAGS in
> bpf_msg_pull_data() is removed.

In sockmap the scatterlist is used as a ring so the MAX_SKB_FRAGS
check is needed to keep searching through the ring when sg_start
is non-zero.

> 
> Besides that, I also ran into issues while put_page() is invoked.
> e.g.
> [ 450.568723] BUG: Bad page state in process swapper/10 pfn:2021540
> [ 450.575632] page:ffffea0080855000 count:0 mapcount:0
> mapping:ffff88103d006840 index:0xffff882021540000 compound_mapcount: 0
> [ 450.588069] flags: 0x6fffff80008100(slab|head)
> [ 450.593033] raw: 006fffff80008100 dead000000000100 dead000000000200
> ffff88103d006840
> [ 450.601683] raw: ffff882021540000 0000000080080007 00000000ffffffff
> 0000000000000000
> [ 450.610337] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> [ 450.617530] bad because of flags: 0x100(slab)
> 
> To avoid above issue, currently put_page() is disabled in this patch
> temporarily. I am working on alternatives so that page allocated via
> slab (in this case) can be freed without any issue.> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  net/core/filter.c | 61 +++++++++++++++++++++++++++++--------------------------
>  1 file changed, 32 insertions(+), 29 deletions(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index e427c8e..cc52baa 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2316,7 +2316,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>  BPF_CALL_4(bpf_msg_pull_data,
>  	   struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
>  {
> -	unsigned int len = 0, offset = 0, copy = 0;
> +	unsigned int len = 0, offset = 0, copy = 0, off = 0;
>  	struct scatterlist *sg = msg->sg_data;
>  	int first_sg, last_sg, i, shift;
>  	unsigned char *p, *to, *from;
> @@ -2330,22 +2330,28 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>  	i = msg->sg_start;
>  	do {
>  		len = sg[i].length;
> -		offset += len;
>  		if (start < offset + len)
>  			break;
> +		offset += len;

This looks like a generic fix unrelated to this series.
Can you send that as a bugfix?

>  		i++;
> -		if (i == MAX_SKB_FRAGS)
> -			i = 0;
> -	} while (i != msg->sg_end);
> +	} while (i <= msg->sg_end);
>  

As noted above the MAX_SKB_FRAGS check is needed because
sg_start can be non-zero and sg_end < st_start. In these
cases we need to search the entries at the start of the
array (being used as a ring).

> +	/* return error if start is out of range */
>  	if (unlikely(start >= offset + len))
>  		return -EINVAL;
>  
> -	if (!msg->sg_copy[i] && bytes <= len)
> -		goto out;
> +	/* return error if i is last entry in sglist and end is out of range */
> +	if (msg->sg_copy[i] && end > offset + len)
> +		return -EINVAL>  
>  	first_sg = i;
>  
> +	/* if i is not last entry in sg list and end (i.e start + bytes) is
> +	 * within this sg[i] then goto out and calculate data and data_end
> +	 */
> +	if (!msg->sg_copy[i] && end <= offset + len)
> +		goto out;
> +>  	/* At this point we need to linearize multiple scatterlist
>  	 * elements or a single shared page. Either way we need to
>  	 * copy into a linear buffer exclusively owned by BPF. Then
> @@ -2359,11 +2365,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>  	do {
>  		copy += sg[i].length;
>  		i++;
> -		if (i == MAX_SKB_FRAGS)
> -			i = 0;

same as above, need to keep.

> -		if (bytes < copy)
> +		if (end < copy)
>  			break;
> -	} while (i != msg->sg_end);
> +	} while (i <= msg->sg_end);
> +
> +	/* return error if i is last entry in sglist and end is out of range */
> +	if (i > msg->sg_end && end > offset + copy)
> +		return -EINVAL;
> +
>  	last_sg = i;
>  
>  	if (unlikely(copy < end - start))
> @@ -2373,23 +2382,25 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>  	if (unlikely(!page))
>  		return -ENOMEM;
>  	p = page_address(page);
> -	offset = 0;
>  
>  	i = first_sg;
>  	do {
>  		from = sg_virt(&sg[i]);
>  		len = sg[i].length;
> -		to = p + offset;
> +		to = p + off;

Not really sure if the change from offset->off is needed. Looks
like it just makes a bigger diff.

>  
>  		memcpy(to, from, len);
> -		offset += len;
> +		off += len;
>  		sg[i].length = 0;
> -		put_page(sg_page(&sg[i]));
> +		/* if original page is allocated via slab then put_page
> +		 * causes error BUG: Bad page state in process. So temporarily
> +		 * disabled put_page.
> +		 * Todo: fix it
> +		 */
> +		//put_page(sg_page(&sg[i]));
>  
>  		i++;
> -		if (i == MAX_SKB_FRAGS)
> -			i = 0;
> -	} while (i != last_sg);
> +	} while (i < last_sg);
>  
>  	sg[first_sg].length = copy;
>  	sg_set_page(&sg[first_sg], page, copy, 0);
> @@ -2406,12 +2417,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>  	do {
>  		int move_from;
>  
> -		if (i + shift >= MAX_SKB_FRAGS)
> -			move_from = i + shift - MAX_SKB_FRAGS;
> -		else
> -			move_from = i + shift;
> -

Need to keep same as above.

> -		if (move_from == msg->sg_end)
> +		move_from = i + shift;> +		if (move_from > msg->sg_end)
>  			break;
>  
>  		sg[i] = sg[move_from];
> @@ -2420,14 +2427,10 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>  		sg[move_from].offset = 0;
>  
>  		i++;
> -		if (i == MAX_SKB_FRAGS)
> -			i = 0;
>  	} while (1);
>  	msg->sg_end -= shift;
> -	if (msg->sg_end < 0)
> -		msg->sg_end += MAX_SKB_FRAGS;
>  out:
> -	msg->data = sg_virt(&sg[i]) + start - offset;
> +	msg->data = sg_virt(&sg[first_sg]) + start - offset;
>  	msg->data_end = msg->data + bytes;
>  
>  	return 0;
> 

Thanks,
John

^ permalink raw reply

* Re: [PATCH bpf 0/2] Two BPF sockmap fixes
From: Alexei Starovoitov @ 2018-08-25  1:01 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: john.fastabend, netdev
In-Reply-To: <20180824200851.12308-1-daniel@iogearbox.net>

On Fri, Aug 24, 2018 at 10:08:49PM +0200, Daniel Borkmann wrote:
> This series contains two fixes found while working with the BPF
> sockmap code. One refcount leakage and one case for a small window
> of use after free. See patches for more info. Thanks!

Applied, Thanks

^ permalink raw reply

* Re: [PATCH] correct comments for flags, priv flags in netdevice.h
From: Andrew Lunn @ 2018-08-25  0:38 UTC (permalink / raw)
  To: Robert P. J. Day; +Cc: Linux kernel netdev mailing list
In-Reply-To: <alpine.LFD.2.21.1808241908260.11448@localhost.localdomain>

On Fri, Aug 24, 2018 at 07:10:20PM -0400, Robert P. J. Day wrote:
> Correct the references for both flags and priv flags since they both
> refer to the incorrect file which contains their explanations.

Hi Robert

I'm guessing this is your first patch for Linux?

Please use git send-email to sending patches. That will avoid the
signature being added, and any other mangling your email client may
do.

The subject line should indicate what tree the patch is for:
[PATCH net-next] correct comments for flags, priv flags in netdevice.h

You can do this with git format-patch --subject "PATCH net-next".

And lastly, net-next is closed at the moment, because the merge window
is open. DaveM will send an email when he is read to accept patches, a
few days after -rc1 is out.

   Andrew

^ permalink raw reply

* [bpf PATCH] bpf: sockmap, decrement copied count correctly in redirect error case
From: John Fastabend @ 2018-08-25  0:37 UTC (permalink / raw)
  To: alexei.starovoitov, daniel; +Cc: netdev

Currently, when a redirect occurs in sockmap and an error occurs in
the redirect call we unwind the scatterlist once in the error path
of bpf_tcp_sendmsg_do_redirect() and then again in sendmsg(). Then
in the error path of sendmsg we decrement the copied count by the
send size.

However, its possible we partially sent data before the error was
generated. This can happen if do_tcp_sendpages() partially sends the
scatterlist before encountering a memory pressure error. If this
happens we need to decrement the copied value (the value tracking
how many bytes were actually sent to TCP stack) by the number of
remaining bytes _not_ the entire send size. Otherwise we risk
confusing userspace.

Also we don't need two calls to free the scatterlist one is
good enough. So remove the one in bpf_tcp_sendmsg_do_redirect() and
then properly reduce copied by the number of remaining bytes which
may in fact be the entire send size if no bytes were sent.

To do this use bool to indicate if free_start_sg() should do mem
accounting or not.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |   45 ++++++++++++++++++++++-----------------------
 1 file changed, 22 insertions(+), 23 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 1d092f3..b7bf1ad 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -236,7 +236,7 @@ static int bpf_tcp_init(struct sock *sk)
 }
 
 static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
-static int free_start_sg(struct sock *sk, struct sk_msg_buff *md);
+static int free_start_sg(struct sock *sk, struct sk_msg_buff *md, bool charge);
 
 static void bpf_tcp_release(struct sock *sk)
 {
@@ -248,7 +248,7 @@ static void bpf_tcp_release(struct sock *sk)
 		goto out;
 
 	if (psock->cork) {
-		free_start_sg(psock->sock, psock->cork);
+		free_start_sg(psock->sock, psock->cork, true);
 		kfree(psock->cork);
 		psock->cork = NULL;
 	}
@@ -330,14 +330,14 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
 	close_fun = psock->save_close;
 
 	if (psock->cork) {
-		free_start_sg(psock->sock, psock->cork);
+		free_start_sg(psock->sock, psock->cork, true);
 		kfree(psock->cork);
 		psock->cork = NULL;
 	}
 
 	list_for_each_entry_safe(md, mtmp, &psock->ingress, list) {
 		list_del(&md->list);
-		free_start_sg(psock->sock, md);
+		free_start_sg(psock->sock, md, true);
 		kfree(md);
 	}
 
@@ -570,14 +570,16 @@ static void free_bytes_sg(struct sock *sk, int bytes,
 	md->sg_start = i;
 }
 
-static int free_sg(struct sock *sk, int start, struct sk_msg_buff *md)
+static int free_sg(struct sock *sk, int start,
+		   struct sk_msg_buff *md, bool charge)
 {
 	struct scatterlist *sg = md->sg_data;
 	int i = start, free = 0;
 
 	while (sg[i].length) {
 		free += sg[i].length;
-		sk_mem_uncharge(sk, sg[i].length);
+		if (charge)
+			sk_mem_uncharge(sk, sg[i].length);
 		if (!md->skb)
 			put_page(sg_page(&sg[i]));
 		sg[i].length = 0;
@@ -594,9 +596,9 @@ static int free_sg(struct sock *sk, int start, struct sk_msg_buff *md)
 	return free;
 }
 
-static int free_start_sg(struct sock *sk, struct sk_msg_buff *md)
+static int free_start_sg(struct sock *sk, struct sk_msg_buff *md, bool charge)
 {
-	int free = free_sg(sk, md->sg_start, md);
+	int free = free_sg(sk, md->sg_start, md, charge);
 
 	md->sg_start = md->sg_end;
 	return free;
@@ -604,7 +606,7 @@ static int free_start_sg(struct sock *sk, struct sk_msg_buff *md)
 
 static int free_curr_sg(struct sock *sk, struct sk_msg_buff *md)
 {
-	return free_sg(sk, md->sg_curr, md);
+	return free_sg(sk, md->sg_curr, md, true);
 }
 
 static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
@@ -718,7 +720,7 @@ static int bpf_tcp_ingress(struct sock *sk, int apply_bytes,
 		list_add_tail(&r->list, &psock->ingress);
 		sk->sk_data_ready(sk);
 	} else {
-		free_start_sg(sk, r);
+		free_start_sg(sk, r, true);
 		kfree(r);
 	}
 
@@ -752,14 +754,10 @@ static int bpf_tcp_sendmsg_do_redirect(struct sock *sk, int send,
 		release_sock(sk);
 	}
 	smap_release_sock(psock, sk);
-	if (unlikely(err))
-		goto out;
-	return 0;
+	return err;
 out_rcu:
 	rcu_read_unlock();
-out:
-	free_bytes_sg(NULL, send, md, false);
-	return err;
+	return 0;
 }
 
 static inline void bpf_md_init(struct smap_psock *psock)
@@ -822,7 +820,7 @@ static int bpf_exec_tx_verdict(struct smap_psock *psock,
 	case __SK_PASS:
 		err = bpf_tcp_push(sk, send, m, flags, true);
 		if (unlikely(err)) {
-			*copied -= free_start_sg(sk, m);
+			*copied -= free_start_sg(sk, m, true);
 			break;
 		}
 
@@ -845,16 +843,17 @@ static int bpf_exec_tx_verdict(struct smap_psock *psock,
 		lock_sock(sk);
 
 		if (unlikely(err < 0)) {
-			free_start_sg(sk, m);
+			int free = free_start_sg(sk, m, false);
+
 			psock->sg_size = 0;
 			if (!cork)
-				*copied -= send;
+				*copied -= free;
 		} else {
 			psock->sg_size -= send;
 		}
 
 		if (cork) {
-			free_start_sg(sk, m);
+			free_start_sg(sk, m, true);
 			psock->sg_size = 0;
 			kfree(m);
 			m = NULL;
@@ -1122,7 +1121,7 @@ static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 		err = sk_stream_wait_memory(sk, &timeo);
 		if (err) {
 			if (m && m != psock->cork)
-				free_start_sg(sk, m);
+				free_start_sg(sk, m, true);
 			goto out_err;
 		}
 	}
@@ -1581,13 +1580,13 @@ static void smap_gc_work(struct work_struct *w)
 		bpf_prog_put(psock->bpf_tx_msg);
 
 	if (psock->cork) {
-		free_start_sg(psock->sock, psock->cork);
+		free_start_sg(psock->sock, psock->cork, true);
 		kfree(psock->cork);
 	}
 
 	list_for_each_entry_safe(md, mtmp, &psock->ingress, list) {
 		list_del(&md->list);
-		free_start_sg(psock->sock, md);
+		free_start_sg(psock->sock, md, true);
 		kfree(md);
 	}
 

^ permalink raw reply related

* Re: [PATCH bpf 2/2] bpf, sockmap: fix psock refcount leak in bpf_tcp_recvmsg
From: John Fastabend @ 2018-08-25  0:29 UTC (permalink / raw)
  To: Daniel Borkmann, alexei.starovoitov; +Cc: netdev
In-Reply-To: <20180824200851.12308-3-daniel@iogearbox.net>

On 08/24/2018 01:08 PM, Daniel Borkmann wrote:
> In bpf_tcp_recvmsg() we first took a reference on the psock, however
> once we find that there are skbs in the normal socket's receive queue
> we return with processing them through tcp_recvmsg(). Problem is that
> we leak the taken reference on the psock in that path. Given we don't
> really do anything with the psock at this point, move the skb_queue_empty()
> test before we fetch the psock to fix this case.
> 
> Fixes: 8934ce2fd081 ("bpf: sockmap redirect ingress support")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---

Oops. Thanks for catching this, all of our use cases and tests
to this point either always did redirect or normal receive and
I missed the mixed case. I'll write a tests for bpf-next to also
ensure we catch anything else.

Thanks.

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf 1/2] bpf, sockmap: fix potential use after free in bpf_tcp_close
From: John Fastabend @ 2018-08-25  0:25 UTC (permalink / raw)
  To: Daniel Borkmann, alexei.starovoitov; +Cc: netdev
In-Reply-To: <20180824200851.12308-2-daniel@iogearbox.net>

On 08/24/2018 01:08 PM, Daniel Borkmann wrote:
> bpf_tcp_close() we pop the psock linkage to a map via psock_map_pop().
> A parallel update on the sock hash map can happen between psock_map_pop()
> and lookup_elem_raw() where we override the element under link->hash /
> link->key. In bpf_tcp_close()'s lookup_elem_raw() we subsequently only
> test whether an element is present, but we do not test whether the
> element is infact the element we were looking for.
> 
> We lock the sock in bpf_tcp_close() during that time, so do we hold
> the lock in sock_hash_update_elem(). However, the latter locks the
> sock which is newly updated, not the one we're purging from the hash
> table. This means that while one CPU is doing the lookup from bpf_tcp_close(),
> another CPU is doing the map update in parallel, dropped our sock from
> the hlist and released the psock.
> 
> Subsequently the first CPU will find the new sock and attempts to drop
> and release the old sock yet another time. Fix is that we need to check
> the elements for a match after lookup, similar as we do in the sock map.
> Note that the hash tab elems are freed via RCU, so access to their
> link->hash / link->key is fine since we're under RCU read side there.
> 
> Fixes: e9db4ef6bf4c ("bpf: sockhash fix omitted bucket lock in sock_close")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---

Looks good to me, nice catch by the way.

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply

* Re: KASAN: use-after-free Read in __rhashtable_lookup (2)
From: Dmitry Vyukov @ 2018-08-24 23:50 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: syzbot, David Miller, LKML, linux-rdma, netdev, rds-devel,
	Santosh Shilimkar, syzkaller-bugs
In-Reply-To: <20180824233700.GC31961@oracle.com>

On Fri, Aug 24, 2018 at 4:37 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (08/24/18 16:10), Dmitry Vyukov wrote:
>> You need a colon after dup.
>
> I see.
>
>> But this can't possibly be a dup of that bug. That bug was fixed in February:
>
> Apologies, I unknowingly made a mistake in my
> syzbot-tiquette in that case,
>
> I did not intend to *close* the bug as a dup, I was
> merely trying to indicate that this is yet another
> manifestation of the original report with subject
> "KASAN: use-after-free Read in rds_find_bound"
>
> All of these problems are happening because we are missing
> a synchronize_net() in rds_release.
>
> A full description of the race conditon can be found in
> https://www.spinics.net/lists/netdev/msg475074.html
>
> If rds_release is going to nuke the rds_socket
> it needs to sychronize_net() with other threads that may be
> looking up the bind_hash_table. If we dont do that,
> all of these "use-after-free" races can happen.
> As long as we dont have the synchronize_net() we are
> missing a big (and standard) bit of RCU synchronization.


If it's a note for humans, then just a plain English comment will do.

^ permalink raw reply

* [PATCH] correct comments for flags, priv flags in netdevice.h
From: Robert P. J. Day @ 2018-08-24 23:10 UTC (permalink / raw)
  To: Linux kernel netdev mailing list

Correct the references for both flags and priv flags since they both
refer to the incorrect file which contains their explanations.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>

---

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ca5ab98053c8..f01f3473bb91 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1586,9 +1586,9 @@ enum netdev_priv_flags {
  *	@header_ops:	Includes callbacks for creating,parsing,caching,etc
  *			of Layer 2 headers.
  *
- *	@flags:		Interface flags (a la BSD)
- *	@priv_flags:	Like 'flags' but invisible to userspace,
- *			see if.h for the definitions
+ *	@flags:		Interface flags (a la BSD), see if.h for definitions
+ *	@priv_flags:	Like 'flags' but invisible to userspace, see
+ *			definitions earlier in this file
  *	@gflags:	Global flags ( kept as legacy )
  *	@padded:	How much padding added by alloc_netdev()
  *	@operstate:	RFC2863 operstate

-- 

========================================================================
Robert P. J. Day                                 Ottawa, Ontario, CANADA
                  http://crashcourse.ca/dokuwiki

Twitter:                                       http://twitter.com/rpjday
LinkedIn:                               http://ca.linkedin.com/in/rpjday
========================================================================

^ permalink raw reply related

* Re: KASAN: use-after-free Read in __rhashtable_lookup (2)
From: Sowmini Varadhan @ 2018-08-24 23:04 UTC (permalink / raw)
  To: syzbot
  Cc: davem, linux-kernel, linux-rdma, netdev, rds-devel,
	santosh.shilimkar, syzkaller-bugs
In-Reply-To: <00000000000027a1e605741b2afa@google.com>



#syz dup KASAN: use-after-free Read in rds_find_bound

The problem is a missing synchronize_net() in rds_release.
References:
  https://www.spinics.net/lists/netdev/msg475074.html
  https://www.spinics.net/lists/netdev/msg507908.html

^ permalink raw reply

* Re: oops with ip6_rt_cache_alloc
From: David Ahern @ 2018-08-24 23:04 UTC (permalink / raw)
  To: Yonghong Song, netdev, Alexei Starovoitov, Martin Lau, Dave Jones
In-Reply-To: <5d3d7d56-ce9f-79c3-04ec-122a2451b580@fb.com>

On 8/24/18 4:26 PM, Yonghong Song wrote:
> Hi,
> 
> We got a kernel oops with the following stack trace:
> 
> CPU: 24 PID: 0 Comm: swapper/24 Not tainted
> 4.16.0-10_fbk1_1183_g7e4ee4c8171c #10
> "Hardware name: Quanta Leopard-DDR3/Leopard-DDR3, BIOS F06_3A16.DDR3
> 11/19/2015"
> RIP: 0010:ip6_rt_get_dev_rcu+0x6/0x60
> RSP: 0018:ffff88046fb03c78 EFLAGS: 00010286
> RAX: 0000000040000003 RBX: ffff88035a6c1500 RCX: ffffffff81ec5dc0
> RDX: ffff88033192a090 RSI: ffff88033192a0a0 RDI: 0000000000000000

RDI = 0 means the rt passed to ip6_rt_get_dev_rcu is NULL. I believe
that can't happen prior to the fib6_info changes. After the fib6_info
changes, it means the 'from' is NULL and that is not expected.

...

> Our internal experiments showed that an early version of 4.16 works fine
> and after backporting some ipv6 route related changes and the above
> problem showed up.

Can you run the test on 4.18?

^ permalink raw reply

* oops with ip6_rt_cache_alloc
From: Yonghong Song @ 2018-08-24 22:26 UTC (permalink / raw)
  To: David Ahern, netdev, Alexei Starovoitov, Martin Lau, Dave Jones

Hi,

We got a kernel oops with the following stack trace:

CPU: 24 PID: 0 Comm: swapper/24 Not tainted 
4.16.0-10_fbk1_1183_g7e4ee4c8171c #10
"Hardware name: Quanta Leopard-DDR3/Leopard-DDR3, BIOS F06_3A16.DDR3 
11/19/2015"
RIP: 0010:ip6_rt_get_dev_rcu+0x6/0x60
RSP: 0018:ffff88046fb03c78 EFLAGS: 00010286
RAX: 0000000040000003 RBX: ffff88035a6c1500 RCX: ffffffff81ec5dc0
RDX: ffff88033192a090 RSI: ffff88033192a0a0 RDI: 0000000000000000
RBP: ffff88046fb03cb0 R08: 0000000040000003 R09: ffff8803eb770d00
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88033192a0a0
R13: ffff88033192a090 R14: 0000000000000000 R15: ffff8803d748d700
FS:  0000000000000000(0000) GS:ffff88046fb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000054 CR3: 000000000220a002 CR4: 00000000001606e0
Call Trace:
  <IRQ>
  ip6_rt_cache_alloc+0x20/0x100
  __ip6_rt_update_pmtu+0xae/0x180
  ip6_tnl_xmit+0x330/0x970 [ip6_tunnel]
  ? __gre6_xmit+0x2d5/0x540 [ip6_gre]
  ? ip6_forward+0x522/0x7e0
  ? ip6_tnl_parse_tlv_enc_lim+0x59/0x190 [ip6_tunnel]
  ? ip6gre_tunnel_xmit+0xe3/0x320 [ip6_gre]
  ip6gre_tunnel_xmit+0xe3/0x320 [ip6_gre]
  dev_hard_start_xmit+0x9e/0x200
  sch_direct_xmit+0xeb/0x250
  __qdisc_run+0x146/0x510
  net_tx_action+0xde/0x210
  __do_softirq+0xd8/0x2a8
  irq_exit+0xa8/0xb0
  smp_apic_timer_interrupt+0x6c/0x120
  apic_timer_interrupt+0xf/0x20
  </IRQ>
RIP: 0010:poll_idle+0x31/0x61
RSP: 0018:ffffc9000328fed8 EFLAGS: 00000246
  ORIG_RAX: ffffffffffffff12
RAX: 0000000000000000 RBX: ffffffff822da9e0 RCX: ffff88046d4e7000
RDX: 0000000000000000 RSI: ffffffff822da9e0 RDI: ffffe8fc00301c00
RBP: ffffe8fc00301c00 R08: 0000000000000f1a R09: 0000000000000001
R10: ffffc9000328fec8 R11: 0000000000000f15 R12: 0000000000000000
R13: ffffffff822da9f8 R14: 0000000000000000 R15: 00002e37d560bb8e
  ? acpi_idle_do_entry+0x40/0x40
  cpuidle_enter_state+0x70/0x2a0
  do_idle+0xdf/0x170
  cpu_startup_entry+0x19/0x20
  secondary_startup_64+0xa5/0xb0
Code: d7 be 01 00 00 00 48 83 e0 fe 48 8b 00 48 89 42 10 ba 0f 00 00 00 
e9 7a fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 <f7> 47 
54 00 00 10 80 48 8b 9f a8 00 00 00 74 22 8b 83 0c 02 00
RIP: ip6_rt_get_dev_rcu+0x6/0x60 RSP: ffff88046fb03c78
CR2: 0000000000000054

Our internal experiments showed that an early version of 4.16 works fine
and after backporting some ipv6 route related changes and the above
problem showed up.

Have anybody seen this issue?

Thanks!

^ permalink raw reply

* [PATCH v2 16/17] zinc: Curve25519 x86_64 implementation
From: Jason A. Donenfeld @ 2018-08-24 21:38 UTC (permalink / raw)
  To: linux-kernel, netdev, davem
  Cc: Jason A. Donenfeld, Andy Lutomirski, Greg KH, Samuel Neves,
	Jean-Philippe Aumasson, Armando Faz-Hernández,
	Thomas Gleixner, Ingo Molnar, x86, linux-crypto
In-Reply-To: <20180824213849.23647-1-Jason@zx2c4.com>

This implementation is the fastest available x86_64 implementation, and
unlike Sandy2x, it doesn't requie use of the floating point registers at
all. Instead it makes use of BMI2 and ADX, available on recent
microarchitectures. The implementation was written by Armando
Faz-Hernández with contributions (upstream) from Samuel Neves and me,
in addition to further changes in the kernel implementation from us.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Armando Faz-Hernández <armfazh@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: x86@kernel.org
Cc: linux-crypto@vger.kernel.org
---
 lib/zinc/Makefile                            |    3 +
 lib/zinc/curve25519/curve25519-x86_64-glue.h |   45 +
 lib/zinc/curve25519/curve25519-x86_64.h      | 2053 ++++++++++++++++++
 3 files changed, 2101 insertions(+)
 create mode 100644 lib/zinc/curve25519/curve25519-x86_64-glue.h
 create mode 100644 lib/zinc/curve25519/curve25519-x86_64.h

diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 13e403780ce0..f35a7017d847 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -56,6 +56,9 @@ ifeq ($(CONFIG_ARM)$(CONFIG_KERNEL_MODE_NEON),yy)
 zinc-y += curve25519/curve25519-arm.o
 CFLAGS_curve25519.o += -include $(srctree)/$(src)/curve25519/curve25519-arm-glue.h
 endif
+ifeq ($(CONFIG_X86_64),y)
+CFLAGS_curve25519.o += -include $(srctree)/$(src)/curve25519/curve25519-x86_64-glue.h
+endif
 endif
 
 ifeq ($(CONFIG_ZINC_BLAKE2S),y)
diff --git a/lib/zinc/curve25519/curve25519-x86_64-glue.h b/lib/zinc/curve25519/curve25519-x86_64-glue.h
new file mode 100644
index 000000000000..7dc1923e6b8c
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-x86_64-glue.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <zinc/curve25519.h>
+#include <asm/cpufeature.h>
+#include <asm/processor.h>
+
+#include "curve25519-x86_64.h"
+
+static bool curve25519_use_bmi2 __ro_after_init;
+static bool curve25519_use_adx __ro_after_init;
+
+void __init curve25519_fpu_init(void)
+{
+	curve25519_use_bmi2 = boot_cpu_has(X86_FEATURE_BMI2);
+	curve25519_use_adx = boot_cpu_has(X86_FEATURE_BMI2) && boot_cpu_has(X86_FEATURE_ADX);
+}
+
+static inline bool curve25519_arch(u8 mypublic[CURVE25519_POINT_SIZE], const u8 secret[CURVE25519_POINT_SIZE], const u8 basepoint[CURVE25519_POINT_SIZE])
+{
+	if (curve25519_use_adx) {
+		curve25519_adx(mypublic, secret, basepoint);
+		return true;
+	} else if (curve25519_use_bmi2) {
+		curve25519_bmi2(mypublic, secret, basepoint);
+		return true;
+	}
+	return false;
+}
+
+static inline bool curve25519_base_arch(u8 pub[CURVE25519_POINT_SIZE], const u8 secret[CURVE25519_POINT_SIZE])
+{
+	if (curve25519_use_adx) {
+		curve25519_adx_base(pub, secret);
+		return true;
+	} else if (curve25519_use_bmi2) {
+		curve25519_bmi2_base(pub, secret);
+		return true;
+	}
+	return false;
+}
+
+#define HAVE_CURVE25519_ARCH_IMPLEMENTATION
diff --git a/lib/zinc/curve25519/curve25519-x86_64.h b/lib/zinc/curve25519/curve25519-x86_64.h
new file mode 100644
index 000000000000..2ede548279d6
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-x86_64.h
@@ -0,0 +1,2053 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (c) 2017 Armando Faz <armfazh@ic.unicamp.br>. All Rights Reserved.
+ * Copyright (C) 2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ * Copyright (C) 2018 Samuel Neves <sneves@dei.uc.pt>. All Rights Reserved.
+ */
+
+enum { NUM_WORDS_ELTFP25519 = 4 };
+typedef __aligned(32) u64 eltfp25519_1w[NUM_WORDS_ELTFP25519];
+typedef __aligned(32) u64 eltfp25519_1w_buffer[2 * NUM_WORDS_ELTFP25519];
+
+#define mul_eltfp25519_1w_adx(c, a, b) do { \
+	mul_256x256_integer_adx(m.buffer, a, b); \
+	red_eltfp25519_1w_adx(c, m.buffer); \
+} while (0)
+
+#define mul_eltfp25519_1w_bmi2(c, a, b) do { \
+	mul_256x256_integer_bmi2(m.buffer, a, b); \
+	red_eltfp25519_1w_bmi2(c, m.buffer); \
+} while(0)
+
+#define sqr_eltfp25519_1w_adx(a) do { \
+	sqr_256x256_integer_adx(m.buffer, a); \
+	red_eltfp25519_1w_adx(a, m.buffer); \
+} while (0)
+
+#define sqr_eltfp25519_1w_bmi2(a) do { \
+	sqr_256x256_integer_bmi2(m.buffer, a); \
+	red_eltfp25519_1w_bmi2(a, m.buffer); \
+} while (0)
+
+#define mul_eltfp25519_2w_adx(c, a, b) do { \
+	mul2_256x256_integer_adx(m.buffer, a, b); \
+	red_eltfp25519_2w_adx(c, m.buffer); \
+} while (0)
+
+#define mul_eltfp25519_2w_bmi2(c, a, b) do { \
+	mul2_256x256_integer_bmi2(m.buffer, a, b); \
+	red_eltfp25519_2w_bmi2(c, m.buffer); \
+} while (0)
+
+#define sqr_eltfp25519_2w_adx(a) do { \
+	sqr2_256x256_integer_adx(m.buffer, a); \
+	red_eltfp25519_2w_adx(a, m.buffer); \
+} while (0)
+
+#define sqr_eltfp25519_2w_bmi2(a) do { \
+	sqr2_256x256_integer_bmi2(m.buffer, a); \
+	red_eltfp25519_2w_bmi2(a, m.buffer); \
+} while (0)
+
+#define sqrn_eltfp25519_1w_adx(a, times) do { \
+	int ____counter = (times); \
+	while (____counter-- > 0) \
+		sqr_eltfp25519_1w_adx(a); \
+} while (0)
+
+#define sqrn_eltfp25519_1w_bmi2(a, times) do { \
+	int ____counter = (times); \
+	while (____counter-- > 0) \
+		sqr_eltfp25519_1w_bmi2(a); \
+} while (0)
+
+#define copy_eltfp25519_1w(C, A) do { \
+	(C)[0] = (A)[0]; \
+	(C)[1] = (A)[1]; \
+	(C)[2] = (A)[2]; \
+	(C)[3] = (A)[3]; \
+} while (0)
+
+#define setzero_eltfp25519_1w(C) do { \
+	(C)[0] = 0; \
+	(C)[1] = 0; \
+	(C)[2] = 0; \
+	(C)[3] = 0; \
+} while (0)
+
+__aligned(32) static const u64 table_ladder_8k[252 * NUM_WORDS_ELTFP25519] = {
+	/*   1 */ 0xfffffffffffffff3UL, 0xffffffffffffffffUL, 0xffffffffffffffffUL, 0x5fffffffffffffffUL,
+	/*   2 */ 0x6b8220f416aafe96UL, 0x82ebeb2b4f566a34UL, 0xd5a9a5b075a5950fUL, 0x5142b2cf4b2488f4UL,
+	/*   3 */ 0x6aaebc750069680cUL, 0x89cf7820a0f99c41UL, 0x2a58d9183b56d0f4UL, 0x4b5aca80e36011a4UL,
+	/*   4 */ 0x329132348c29745dUL, 0xf4a2e616e1642fd7UL, 0x1e45bb03ff67bc34UL, 0x306912d0f42a9b4aUL,
+	/*   5 */ 0xff886507e6af7154UL, 0x04f50e13dfeec82fUL, 0xaa512fe82abab5ceUL, 0x174e251a68d5f222UL,
+	/*   6 */ 0xcf96700d82028898UL, 0x1743e3370a2c02c5UL, 0x379eec98b4e86eaaUL, 0x0c59888a51e0482eUL,
+	/*   7 */ 0xfbcbf1d699b5d189UL, 0xacaef0d58e9fdc84UL, 0xc1c20d06231f7614UL, 0x2938218da274f972UL,
+	/*   8 */ 0xf6af49beff1d7f18UL, 0xcc541c22387ac9c2UL, 0x96fcc9ef4015c56bUL, 0x69c1627c690913a9UL,
+	/*   9 */ 0x7a86fd2f4733db0eUL, 0xfdb8c4f29e087de9UL, 0x095e4b1a8ea2a229UL, 0x1ad7a7c829b37a79UL,
+	/*  10 */ 0x342d89cad17ea0c0UL, 0x67bedda6cced2051UL, 0x19ca31bf2bb42f74UL, 0x3df7b4c84980acbbUL,
+	/*  11 */ 0xa8c6444dc80ad883UL, 0xb91e440366e3ab85UL, 0xc215cda00164f6d8UL, 0x3d867c6ef247e668UL,
+	/*  12 */ 0xc7dd582bcc3e658cUL, 0xfd2c4748ee0e5528UL, 0xa0fd9b95cc9f4f71UL, 0x7529d871b0675ddfUL,
+	/*  13 */ 0xb8f568b42d3cbd78UL, 0x1233011b91f3da82UL, 0x2dce6ccd4a7c3b62UL, 0x75e7fc8e9e498603UL,
+	/*  14 */ 0x2f4f13f1fcd0b6ecUL, 0xf1a8ca1f29ff7a45UL, 0xc249c1a72981e29bUL, 0x6ebe0dbb8c83b56aUL,
+	/*  15 */ 0x7114fa8d170bb222UL, 0x65a2dcd5bf93935fUL, 0xbdc41f68b59c979aUL, 0x2f0eef79a2ce9289UL,
+	/*  16 */ 0x42ecbf0c083c37ceUL, 0x2930bc09ec496322UL, 0xf294b0c19cfeac0dUL, 0x3780aa4bedfabb80UL,
+	/*  17 */ 0x56c17d3e7cead929UL, 0xe7cb4beb2e5722c5UL, 0x0ce931732dbfe15aUL, 0x41b883c7621052f8UL,
+	/*  18 */ 0xdbf75ca0c3d25350UL, 0x2936be086eb1e351UL, 0xc936e03cb4a9b212UL, 0x1d45bf82322225aaUL,
+	/*  19 */ 0xe81ab1036a024cc5UL, 0xe212201c304c9a72UL, 0xc5d73fba6832b1fcUL, 0x20ffdb5a4d839581UL,
+	/*  20 */ 0xa283d367be5d0fadUL, 0x6c2b25ca8b164475UL, 0x9d4935467caaf22eUL, 0x5166408eee85ff49UL,
+	/*  21 */ 0x3c67baa2fab4e361UL, 0xb3e433c67ef35cefUL, 0x5259729241159b1cUL, 0x6a621892d5b0ab33UL,
+	/*  22 */ 0x20b74a387555cdcbUL, 0x532aa10e1208923fUL, 0xeaa17b7762281dd1UL, 0x61ab3443f05c44bfUL,
+	/*  23 */ 0x257a6c422324def8UL, 0x131c6c1017e3cf7fUL, 0x23758739f630a257UL, 0x295a407a01a78580UL,
+	/*  24 */ 0xf8c443246d5da8d9UL, 0x19d775450c52fa5dUL, 0x2afcfc92731bf83dUL, 0x7d10c8e81b2b4700UL,
+	/*  25 */ 0xc8e0271f70baa20bUL, 0x993748867ca63957UL, 0x5412efb3cb7ed4bbUL, 0x3196d36173e62975UL,
+	/*  26 */ 0xde5bcad141c7dffcUL, 0x47cc8cd2b395c848UL, 0xa34cd942e11af3cbUL, 0x0256dbf2d04ecec2UL,
+	/*  27 */ 0x875ab7e94b0e667fUL, 0xcad4dd83c0850d10UL, 0x47f12e8f4e72c79fUL, 0x5f1a87bb8c85b19bUL,
+	/*  28 */ 0x7ae9d0b6437f51b8UL, 0x12c7ce5518879065UL, 0x2ade09fe5cf77aeeUL, 0x23a05a2f7d2c5627UL,
+	/*  29 */ 0x5908e128f17c169aUL, 0xf77498dd8ad0852dUL, 0x74b4c4ceab102f64UL, 0x183abadd10139845UL,
+	/*  30 */ 0xb165ba8daa92aaacUL, 0xd5c5ef9599386705UL, 0xbe2f8f0cf8fc40d1UL, 0x2701e635ee204514UL,
+	/*  31 */ 0x629fa80020156514UL, 0xf223868764a8c1ceUL, 0x5b894fff0b3f060eUL, 0x60d9944cf708a3faUL,
+	/*  32 */ 0xaeea001a1c7a201fUL, 0xebf16a633ee2ce63UL, 0x6f7709594c7a07e1UL, 0x79b958150d0208cbUL,
+	/*  33 */ 0x24b55e5301d410e7UL, 0xe3a34edff3fdc84dUL, 0xd88768e4904032d8UL, 0x131384427b3aaeecUL,
+	/*  34 */ 0x8405e51286234f14UL, 0x14dc4739adb4c529UL, 0xb8a2b5b250634ffdUL, 0x2fe2a94ad8a7ff93UL,
+	/*  35 */ 0xec5c57efe843faddUL, 0x2843ce40f0bb9918UL, 0xa4b561d6cf3d6305UL, 0x743629bde8fb777eUL,
+	/*  36 */ 0x343edd46bbaf738fUL, 0xed981828b101a651UL, 0xa401760b882c797aUL, 0x1fc223e28dc88730UL,
+	/*  37 */ 0x48604e91fc0fba0eUL, 0xb637f78f052c6fa4UL, 0x91ccac3d09e9239cUL, 0x23f7eed4437a687cUL,
+	/*  38 */ 0x5173b1118d9bd800UL, 0x29d641b63189d4a7UL, 0xfdbf177988bbc586UL, 0x2959894fcad81df5UL,
+	/*  39 */ 0xaebc8ef3b4bbc899UL, 0x4148995ab26992b9UL, 0x24e20b0134f92cfbUL, 0x40d158894a05dee8UL,
+	/*  40 */ 0x46b00b1185af76f6UL, 0x26bac77873187a79UL, 0x3dc0bf95ab8fff5fUL, 0x2a608bd8945524d7UL,
+	/*  41 */ 0x26449588bd446302UL, 0x7c4bc21c0388439cUL, 0x8e98a4f383bd11b2UL, 0x26218d7bc9d876b9UL,
+	/*  42 */ 0xe3081542997c178aUL, 0x3c2d29a86fb6606fUL, 0x5c217736fa279374UL, 0x7dde05734afeb1faUL,
+	/*  43 */ 0x3bf10e3906d42babUL, 0xe4f7803e1980649cUL, 0xe6053bf89595bf7aUL, 0x394faf38da245530UL,
+	/*  44 */ 0x7a8efb58896928f4UL, 0xfbc778e9cc6a113cUL, 0x72670ce330af596fUL, 0x48f222a81d3d6cf7UL,
+	/*  45 */ 0xf01fce410d72caa7UL, 0x5a20ecc7213b5595UL, 0x7bc21165c1fa1483UL, 0x07f89ae31da8a741UL,
+	/*  46 */ 0x05d2c2b4c6830ff9UL, 0xd43e330fc6316293UL, 0xa5a5590a96d3a904UL, 0x705edb91a65333b6UL,
+	/*  47 */ 0x048ee15e0bb9a5f7UL, 0x3240cfca9e0aaf5dUL, 0x8f4b71ceedc4a40bUL, 0x621c0da3de544a6dUL,
+	/*  48 */ 0x92872836a08c4091UL, 0xce8375b010c91445UL, 0x8a72eb524f276394UL, 0x2667fcfa7ec83635UL,
+	/*  49 */ 0x7f4c173345e8752aUL, 0x061b47feee7079a5UL, 0x25dd9afa9f86ff34UL, 0x3780cef5425dc89cUL,
+	/*  50 */ 0x1a46035a513bb4e9UL, 0x3e1ef379ac575adaUL, 0xc78c5f1c5fa24b50UL, 0x321a967634fd9f22UL,
+	/*  51 */ 0x946707b8826e27faUL, 0x3dca84d64c506fd0UL, 0xc189218075e91436UL, 0x6d9284169b3b8484UL,
+	/*  52 */ 0x3a67e840383f2ddfUL, 0x33eec9a30c4f9b75UL, 0x3ec7c86fa783ef47UL, 0x26ec449fbac9fbc4UL,
+	/*  53 */ 0x5c0f38cba09b9e7dUL, 0x81168cc762a3478cUL, 0x3e23b0d306fc121cUL, 0x5a238aa0a5efdcddUL,
+	/*  54 */ 0x1ba26121c4ea43ffUL, 0x36f8c77f7c8832b5UL, 0x88fbea0b0adcf99aUL, 0x5ca9938ec25bebf9UL,
+	/*  55 */ 0xd5436a5e51fccda0UL, 0x1dbc4797c2cd893bUL, 0x19346a65d3224a08UL, 0x0f5034e49b9af466UL,
+	/*  56 */ 0xf23c3967a1e0b96eUL, 0xe58b08fa867a4d88UL, 0xfb2fabc6a7341679UL, 0x2a75381eb6026946UL,
+	/*  57 */ 0xc80a3be4c19420acUL, 0x66b1f6c681f2b6dcUL, 0x7cf7036761e93388UL, 0x25abbbd8a660a4c4UL,
+	/*  58 */ 0x91ea12ba14fd5198UL, 0x684950fc4a3cffa9UL, 0xf826842130f5ad28UL, 0x3ea988f75301a441UL,
+	/*  59 */ 0xc978109a695f8c6fUL, 0x1746eb4a0530c3f3UL, 0x444d6d77b4459995UL, 0x75952b8c054e5cc7UL,
+	/*  60 */ 0xa3703f7915f4d6aaUL, 0x66c346202f2647d8UL, 0xd01469df811d644bUL, 0x77fea47d81a5d71fUL,
+	/*  61 */ 0xc5e9529ef57ca381UL, 0x6eeeb4b9ce2f881aUL, 0xb6e91a28e8009bd6UL, 0x4b80be3e9afc3fecUL,
+	/*  62 */ 0x7e3773c526aed2c5UL, 0x1b4afcb453c9a49dUL, 0xa920bdd7baffb24dUL, 0x7c54699f122d400eUL,
+	/*  63 */ 0xef46c8e14fa94bc8UL, 0xe0b074ce2952ed5eUL, 0xbea450e1dbd885d5UL, 0x61b68649320f712cUL,
+	/*  64 */ 0x8a485f7309ccbdd1UL, 0xbd06320d7d4d1a2dUL, 0x25232973322dbef4UL, 0x445dc4758c17f770UL,
+	/*  65 */ 0xdb0434177cc8933cUL, 0xed6fe82175ea059fUL, 0x1efebefdc053db34UL, 0x4adbe867c65daf99UL,
+	/*  66 */ 0x3acd71a2a90609dfUL, 0xe5e991856dd04050UL, 0x1ec69b688157c23cUL, 0x697427f6885cfe4dUL,
+	/*  67 */ 0xd7be7b9b65e1a851UL, 0xa03d28d522c536ddUL, 0x28399d658fd2b645UL, 0x49e5b7e17c2641e1UL,
+	/*  68 */ 0x6f8c3a98700457a4UL, 0x5078f0a25ebb6778UL, 0xd13c3ccbc382960fUL, 0x2e003258a7df84b1UL,
+	/*  69 */ 0x8ad1f39be6296a1cUL, 0xc1eeaa652a5fbfb2UL, 0x33ee0673fd26f3cbUL, 0x59256173a69d2cccUL,
+	/*  70 */ 0x41ea07aa4e18fc41UL, 0xd9fc19527c87a51eUL, 0xbdaacb805831ca6fUL, 0x445b652dc916694fUL,
+	/*  71 */ 0xce92a3a7f2172315UL, 0x1edc282de11b9964UL, 0xa1823aafe04c314aUL, 0x790a2d94437cf586UL,
+	/*  72 */ 0x71c447fb93f6e009UL, 0x8922a56722845276UL, 0xbf70903b204f5169UL, 0x2f7a89891ba319feUL,
+	/*  73 */ 0x02a08eb577e2140cUL, 0xed9a4ed4427bdcf4UL, 0x5253ec44e4323cd1UL, 0x3e88363c14e9355bUL,
+	/*  74 */ 0xaa66c14277110b8cUL, 0x1ae0391610a23390UL, 0x2030bd12c93fc2a2UL, 0x3ee141579555c7abUL,
+	/*  75 */ 0x9214de3a6d6e7d41UL, 0x3ccdd88607f17efeUL, 0x674f1288f8e11217UL, 0x5682250f329f93d0UL,
+	/*  76 */ 0x6cf00b136d2e396eUL, 0x6e4cf86f1014debfUL, 0x5930b1b5bfcc4e83UL, 0x047069b48aba16b6UL,
+	/*  77 */ 0x0d4ce4ab69b20793UL, 0xb24db91a97d0fb9eUL, 0xcdfa50f54e00d01dUL, 0x221b1085368bddb5UL,
+	/*  78 */ 0xe7e59468b1e3d8d2UL, 0x53c56563bd122f93UL, 0xeee8a903e0663f09UL, 0x61efa662cbbe3d42UL,
+	/*  79 */ 0x2cf8ddddde6eab2aUL, 0x9bf80ad51435f231UL, 0x5deadacec9f04973UL, 0x29275b5d41d29b27UL,
+	/*  80 */ 0xcfde0f0895ebf14fUL, 0xb9aab96b054905a7UL, 0xcae80dd9a1c420fdUL, 0x0a63bf2f1673bbc7UL,
+	/*  81 */ 0x092f6e11958fbc8cUL, 0x672a81e804822fadUL, 0xcac8351560d52517UL, 0x6f3f7722c8f192f8UL,
+	/*  82 */ 0xf8ba90ccc2e894b7UL, 0x2c7557a438ff9f0dUL, 0x894d1d855ae52359UL, 0x68e122157b743d69UL,
+	/*  83 */ 0xd87e5570cfb919f3UL, 0x3f2cdecd95798db9UL, 0x2121154710c0a2ceUL, 0x3c66a115246dc5b2UL,
+	/*  84 */ 0xcbedc562294ecb72UL, 0xba7143c36a280b16UL, 0x9610c2efd4078b67UL, 0x6144735d946a4b1eUL,
+	/*  85 */ 0x536f111ed75b3350UL, 0x0211db8c2041d81bUL, 0xf93cb1000e10413cUL, 0x149dfd3c039e8876UL,
+	/*  86 */ 0xd479dde46b63155bUL, 0xb66e15e93c837976UL, 0xdafde43b1f13e038UL, 0x5fafda1a2e4b0b35UL,
+	/*  87 */ 0x3600bbdf17197581UL, 0x3972050bbe3cd2c2UL, 0x5938906dbdd5be86UL, 0x34fce5e43f9b860fUL,
+	/*  88 */ 0x75a8a4cd42d14d02UL, 0x828dabc53441df65UL, 0x33dcabedd2e131d3UL, 0x3ebad76fb814d25fUL,
+	/*  89 */ 0xd4906f566f70e10fUL, 0x5d12f7aa51690f5aUL, 0x45adb16e76cefcf2UL, 0x01f768aead232999UL,
+	/*  90 */ 0x2b6cc77b6248febdUL, 0x3cd30628ec3aaffdUL, 0xce1c0b80d4ef486aUL, 0x4c3bff2ea6f66c23UL,
+	/*  91 */ 0x3f2ec4094aeaeb5fUL, 0x61b19b286e372ca7UL, 0x5eefa966de2a701dUL, 0x23b20565de55e3efUL,
+	/*  92 */ 0xe301ca5279d58557UL, 0x07b2d4ce27c2874fUL, 0xa532cd8a9dcf1d67UL, 0x2a52fee23f2bff56UL,
+	/*  93 */ 0x8624efb37cd8663dUL, 0xbbc7ac20ffbd7594UL, 0x57b85e9c82d37445UL, 0x7b3052cb86a6ec66UL,
+	/*  94 */ 0x3482f0ad2525e91eUL, 0x2cb68043d28edca0UL, 0xaf4f6d052e1b003aUL, 0x185f8c2529781b0aUL,
+	/*  95 */ 0xaa41de5bd80ce0d6UL, 0x9407b2416853e9d6UL, 0x563ec36e357f4c3aUL, 0x4cc4b8dd0e297bceUL,
+	/*  96 */ 0xa2fc1a52ffb8730eUL, 0x1811f16e67058e37UL, 0x10f9a366cddf4ee1UL, 0x72f4a0c4a0b9f099UL,
+	/*  97 */ 0x8c16c06f663f4ea7UL, 0x693b3af74e970fbaUL, 0x2102e7f1d69ec345UL, 0x0ba53cbc968a8089UL,
+	/*  98 */ 0xca3d9dc7fea15537UL, 0x4c6824bb51536493UL, 0xb9886314844006b1UL, 0x40d2a72ab454cc60UL,
+	/*  99 */ 0x5936a1b712570975UL, 0x91b9d648debda657UL, 0x3344094bb64330eaUL, 0x006ba10d12ee51d0UL,
+	/* 100 */ 0x19228468f5de5d58UL, 0x0eb12f4c38cc05b0UL, 0xa1039f9dd5601990UL, 0x4502d4ce4fff0e0bUL,
+	/* 101 */ 0xeb2054106837c189UL, 0xd0f6544c6dd3b93cUL, 0x40727064c416d74fUL, 0x6e15c6114b502ef0UL,
+	/* 102 */ 0x4df2a398cfb1a76bUL, 0x11256c7419f2f6b1UL, 0x4a497962066e6043UL, 0x705b3aab41355b44UL,
+	/* 103 */ 0x365ef536d797b1d8UL, 0x00076bd622ddf0dbUL, 0x3bbf33b0e0575a88UL, 0x3777aa05c8e4ca4dUL,
+	/* 104 */ 0x392745c85578db5fUL, 0x6fda4149dbae5ae2UL, 0xb1f0b00b8adc9867UL, 0x09963437d36f1da3UL,
+	/* 105 */ 0x7e824e90a5dc3853UL, 0xccb5f6641f135cbdUL, 0x6736d86c87ce8fccUL, 0x625f3ce26604249fUL,
+	/* 106 */ 0xaf8ac8059502f63fUL, 0x0c05e70a2e351469UL, 0x35292e9c764b6305UL, 0x1a394360c7e23ac3UL,
+	/* 107 */ 0xd5c6d53251183264UL, 0x62065abd43c2b74fUL, 0xb5fbf5d03b973f9bUL, 0x13a3da3661206e5eUL,
+	/* 108 */ 0xc6bd5837725d94e5UL, 0x18e30912205016c5UL, 0x2088ce1570033c68UL, 0x7fba1f495c837987UL,
+	/* 109 */ 0x5a8c7423f2f9079dUL, 0x1735157b34023fc5UL, 0xe4f9b49ad2fab351UL, 0x6691ff72c878e33cUL,
+	/* 110 */ 0x122c2adedc5eff3eUL, 0xf8dd4bf1d8956cf4UL, 0xeb86205d9e9e5bdaUL, 0x049b92b9d975c743UL,
+	/* 111 */ 0xa5379730b0f6c05aUL, 0x72a0ffacc6f3a553UL, 0xb0032c34b20dcd6dUL, 0x470e9dbc88d5164aUL,
+	/* 112 */ 0xb19cf10ca237c047UL, 0xb65466711f6c81a2UL, 0xb3321bd16dd80b43UL, 0x48c14f600c5fbe8eUL,
+	/* 113 */ 0x66451c264aa6c803UL, 0xb66e3904a4fa7da6UL, 0xd45f19b0b3128395UL, 0x31602627c3c9bc10UL,
+	/* 114 */ 0x3120dc4832e4e10dUL, 0xeb20c46756c717f7UL, 0x00f52e3f67280294UL, 0x566d4fc14730c509UL,
+	/* 115 */ 0x7e3a5d40fd837206UL, 0xc1e926dc7159547aUL, 0x216730fba68d6095UL, 0x22e8c3843f69cea7UL,
+	/* 116 */ 0x33d074e8930e4b2bUL, 0xb6e4350e84d15816UL, 0x5534c26ad6ba2365UL, 0x7773c12f89f1f3f3UL,
+	/* 117 */ 0x8cba404da57962aaUL, 0x5b9897a81999ce56UL, 0x508e862f121692fcUL, 0x3a81907fa093c291UL,
+	/* 118 */ 0x0dded0ff4725a510UL, 0x10d8cc10673fc503UL, 0x5b9d151c9f1f4e89UL, 0x32a5c1d5cb09a44cUL,
+	/* 119 */ 0x1e0aa442b90541fbUL, 0x5f85eb7cc1b485dbUL, 0xbee595ce8a9df2e5UL, 0x25e496c722422236UL,
+	/* 120 */ 0x5edf3c46cd0fe5b9UL, 0x34e75a7ed2a43388UL, 0xe488de11d761e352UL, 0x0e878a01a085545cUL,
+	/* 121 */ 0xba493c77e021bb04UL, 0x2b4d1843c7df899aUL, 0x9ea37a487ae80d67UL, 0x67a9958011e41794UL,
+	/* 122 */ 0x4b58051a6697b065UL, 0x47e33f7d8d6ba6d4UL, 0xbb4da8d483ca46c1UL, 0x68becaa181c2db0dUL,
+	/* 123 */ 0x8d8980e90b989aa5UL, 0xf95eb14a2c93c99bUL, 0x51c6c7c4796e73a2UL, 0x6e228363b5efb569UL,
+	/* 124 */ 0xc6bbc0b02dd624c8UL, 0x777eb47dec8170eeUL, 0x3cde15a004cfafa9UL, 0x1dc6bc087160bf9bUL,
+	/* 125 */ 0x2e07e043eec34002UL, 0x18e9fc677a68dc7fUL, 0xd8da03188bd15b9aUL, 0x48fbc3bb00568253UL,
+	/* 126 */ 0x57547d4cfb654ce1UL, 0xd3565b82a058e2adUL, 0xf63eaf0bbf154478UL, 0x47531ef114dfbb18UL,
+	/* 127 */ 0xe1ec630a4278c587UL, 0x5507d546ca8e83f3UL, 0x85e135c63adc0c2bUL, 0x0aa7efa85682844eUL,
+	/* 128 */ 0x72691ba8b3e1f615UL, 0x32b4e9701fbe3ffaUL, 0x97b6d92e39bb7868UL, 0x2cfe53dea02e39e8UL,
+	/* 129 */ 0x687392cd85cd52b0UL, 0x27ff66c910e29831UL, 0x97134556a9832d06UL, 0x269bb0360a84f8a0UL,
+	/* 130 */ 0x706e55457643f85cUL, 0x3734a48c9b597d1bUL, 0x7aee91e8c6efa472UL, 0x5cd6abc198a9d9e0UL,
+	/* 131 */ 0x0e04de06cb3ce41aUL, 0xd8c6eb893402e138UL, 0x904659bb686e3772UL, 0x7215c371746ba8c8UL,
+	/* 132 */ 0xfd12a97eeae4a2d9UL, 0x9514b7516394f2c5UL, 0x266fd5809208f294UL, 0x5c847085619a26b9UL,
+	/* 133 */ 0x52985410fed694eaUL, 0x3c905b934a2ed254UL, 0x10bb47692d3be467UL, 0x063b3d2d69e5e9e1UL,
+	/* 134 */ 0x472726eedda57debUL, 0xefb6c4ae10f41891UL, 0x2b1641917b307614UL, 0x117c554fc4f45b7cUL,
+	/* 135 */ 0xc07cf3118f9d8812UL, 0x01dbd82050017939UL, 0xd7e803f4171b2827UL, 0x1015e87487d225eaUL,
+	/* 136 */ 0xc58de3fed23acc4dUL, 0x50db91c294a7be2dUL, 0x0b94d43d1c9cf457UL, 0x6b1640fa6e37524aUL,
+	/* 137 */ 0x692f346c5fda0d09UL, 0x200b1c59fa4d3151UL, 0xb8c46f760777a296UL, 0x4b38395f3ffdfbcfUL,
+	/* 138 */ 0x18d25e00be54d671UL, 0x60d50582bec8aba6UL, 0x87ad8f263b78b982UL, 0x50fdf64e9cda0432UL,
+	/* 139 */ 0x90f567aac578dcf0UL, 0xef1e9b0ef2a3133bUL, 0x0eebba9242d9de71UL, 0x15473c9bf03101c7UL,
+	/* 140 */ 0x7c77e8ae56b78095UL, 0xb678e7666e6f078eUL, 0x2da0b9615348ba1fUL, 0x7cf931c1ff733f0bUL,
+	/* 141 */ 0x26b357f50a0a366cUL, 0xe9708cf42b87d732UL, 0xc13aeea5f91cb2c0UL, 0x35d90c991143bb4cUL,
+	/* 142 */ 0x47c1c404a9a0d9dcUL, 0x659e58451972d251UL, 0x3875a8c473b38c31UL, 0x1fbd9ed379561f24UL,
+	/* 143 */ 0x11fabc6fd41ec28dUL, 0x7ef8dfe3cd2a2dcaUL, 0x72e73b5d8c404595UL, 0x6135fa4954b72f27UL,
+	/* 144 */ 0xccfc32a2de24b69cUL, 0x3f55698c1f095d88UL, 0xbe3350ed5ac3f929UL, 0x5e9bf806ca477eebUL,
+	/* 145 */ 0xe9ce8fb63c309f68UL, 0x5376f63565e1f9f4UL, 0xd1afcfb35a6393f1UL, 0x6632a1ede5623506UL,
+	/* 146 */ 0x0b7d6c390c2ded4cUL, 0x56cb3281df04cb1fUL, 0x66305a1249ecc3c7UL, 0x5d588b60a38ca72aUL,
+	/* 147 */ 0xa6ecbf78e8e5f42dUL, 0x86eeb44b3c8a3eecUL, 0xec219c48fbd21604UL, 0x1aaf1af517c36731UL,
+	/* 148 */ 0xc306a2836769bde7UL, 0x208280622b1e2adbUL, 0x8027f51ffbff94a6UL, 0x76cfa1ce1124f26bUL,
+	/* 149 */ 0x18eb00562422abb6UL, 0xf377c4d58f8c29c3UL, 0x4dbbc207f531561aUL, 0x0253b7f082128a27UL,
+	/* 150 */ 0x3d1f091cb62c17e0UL, 0x4860e1abd64628a9UL, 0x52d17436309d4253UL, 0x356f97e13efae576UL,
+	/* 151 */ 0xd351e11aa150535bUL, 0x3e6b45bb1dd878ccUL, 0x0c776128bed92c98UL, 0x1d34ae93032885b8UL,
+	/* 152 */ 0x4ba0488ca85ba4c3UL, 0x985348c33c9ce6ceUL, 0x66124c6f97bda770UL, 0x0f81a0290654124aUL,
+	/* 153 */ 0x9ed09ca6569b86fdUL, 0x811009fd18af9a2dUL, 0xff08d03f93d8c20aUL, 0x52a148199faef26bUL,
+	/* 154 */ 0x3e03f9dc2d8d1b73UL, 0x4205801873961a70UL, 0xc0d987f041a35970UL, 0x07aa1f15a1c0d549UL,
+	/* 155 */ 0xdfd46ce08cd27224UL, 0x6d0a024f934e4239UL, 0x808a7a6399897b59UL, 0x0a4556e9e13d95a2UL,
+	/* 156 */ 0xd21a991fe9c13045UL, 0x9b0e8548fe7751b8UL, 0x5da643cb4bf30035UL, 0x77db28d63940f721UL,
+	/* 157 */ 0xfc5eeb614adc9011UL, 0x5229419ae8c411ebUL, 0x9ec3e7787d1dcf74UL, 0x340d053e216e4cb5UL,
+	/* 158 */ 0xcac7af39b48df2b4UL, 0xc0faec2871a10a94UL, 0x140a69245ca575edUL, 0x0cf1c37134273a4cUL,
+	/* 159 */ 0xc8ee306ac224b8a5UL, 0x57eaee7ccb4930b0UL, 0xa1e806bdaacbe74fUL, 0x7d9a62742eeb657dUL,
+	/* 160 */ 0x9eb6b6ef546c4830UL, 0x885cca1fddb36e2eUL, 0xe6b9f383ef0d7105UL, 0x58654fef9d2e0412UL,
+	/* 161 */ 0xa905c4ffbe0e8e26UL, 0x942de5df9b31816eUL, 0x497d723f802e88e1UL, 0x30684dea602f408dUL,
+	/* 162 */ 0x21e5a278a3e6cb34UL, 0xaefb6e6f5b151dc4UL, 0xb30b8e049d77ca15UL, 0x28c3c9cf53b98981UL,
+	/* 163 */ 0x287fb721556cdd2aUL, 0x0d317ca897022274UL, 0x7468c7423a543258UL, 0x4a7f11464eb5642fUL,
+	/* 164 */ 0xa237a4774d193aa6UL, 0xd865986ea92129a1UL, 0x24c515ecf87c1a88UL, 0x604003575f39f5ebUL,
+	/* 165 */ 0x47b9f189570a9b27UL, 0x2b98cede465e4b78UL, 0x026df551dbb85c20UL, 0x74fcd91047e21901UL,
+	/* 166 */ 0x13e2a90a23c1bfa3UL, 0x0cb0074e478519f6UL, 0x5ff1cbbe3af6cf44UL, 0x67fe5438be812dbeUL,
+	/* 167 */ 0xd13cf64fa40f05b0UL, 0x054dfb2f32283787UL, 0x4173915b7f0d2aeaUL, 0x482f144f1f610d4eUL,
+	/* 168 */ 0xf6210201b47f8234UL, 0x5d0ae1929e70b990UL, 0xdcd7f455b049567cUL, 0x7e93d0f1f0916f01UL,
+	/* 169 */ 0xdd79cbf18a7db4faUL, 0xbe8391bf6f74c62fUL, 0x027145d14b8291bdUL, 0x585a73ea2cbf1705UL,
+	/* 170 */ 0x485ca03e928a0db2UL, 0x10fc01a5742857e7UL, 0x2f482edbd6d551a7UL, 0x0f0433b5048fdb8aUL,
+	/* 171 */ 0x60da2e8dd7dc6247UL, 0x88b4c9d38cd4819aUL, 0x13033ac001f66697UL, 0x273b24fe3b367d75UL,
+	/* 172 */ 0xc6e8f66a31b3b9d4UL, 0x281514a494df49d5UL, 0xd1726fdfc8b23da7UL, 0x4b3ae7d103dee548UL,
+	/* 173 */ 0xc6256e19ce4b9d7eUL, 0xff5c5cf186e3c61cUL, 0xacc63ca34b8ec145UL, 0x74621888fee66574UL,
+	/* 174 */ 0x956f409645290a1eUL, 0xef0bf8e3263a962eUL, 0xed6a50eb5ec2647bUL, 0x0694283a9dca7502UL,
+	/* 175 */ 0x769b963643a2dcd1UL, 0x42b7c8ea09fc5353UL, 0x4f002aee13397eabUL, 0x63005e2c19b7d63aUL,
+	/* 176 */ 0xca6736da63023beaUL, 0x966c7f6db12a99b7UL, 0xace09390c537c5e1UL, 0x0b696063a1aa89eeUL,
+	/* 177 */ 0xebb03e97288c56e5UL, 0x432a9f9f938c8be8UL, 0xa6a5a93d5b717f71UL, 0x1a5fb4c3e18f9d97UL,
+	/* 178 */ 0x1c94e7ad1c60cdceUL, 0xee202a43fc02c4a0UL, 0x8dafe4d867c46a20UL, 0x0a10263c8ac27b58UL,
+	/* 179 */ 0xd0dea9dfe4432a4aUL, 0x856af87bbe9277c5UL, 0xce8472acc212c71aUL, 0x6f151b6d9bbb1e91UL,
+	/* 180 */ 0x26776c527ceed56aUL, 0x7d211cb7fbf8faecUL, 0x37ae66a6fd4609ccUL, 0x1f81b702d2770c42UL,
+	/* 181 */ 0x2fb0b057eac58392UL, 0xe1dd89fe29744e9dUL, 0xc964f8eb17beb4f8UL, 0x29571073c9a2d41eUL,
+	/* 182 */ 0xa948a18981c0e254UL, 0x2df6369b65b22830UL, 0xa33eb2d75fcfd3c6UL, 0x078cd6ec4199a01fUL,
+	/* 183 */ 0x4a584a41ad900d2fUL, 0x32142b78e2c74c52UL, 0x68c4e8338431c978UL, 0x7f69ea9008689fc2UL,
+	/* 184 */ 0x52f2c81e46a38265UL, 0xfd78072d04a832fdUL, 0x8cd7d5fa25359e94UL, 0x4de71b7454cc29d2UL,
+	/* 185 */ 0x42eb60ad1eda6ac9UL, 0x0aad37dfdbc09c3aUL, 0x81004b71e33cc191UL, 0x44e6be345122803cUL,
+	/* 186 */ 0x03fe8388ba1920dbUL, 0xf5d57c32150db008UL, 0x49c8c4281af60c29UL, 0x21edb518de701aeeUL,
+	/* 187 */ 0x7fb63e418f06dc99UL, 0xa4460d99c166d7b8UL, 0x24dd5248ce520a83UL, 0x5ec3ad712b928358UL,
+	/* 188 */ 0x15022a5fbd17930fUL, 0xa4f64a77d82570e3UL, 0x12bc8d6915783712UL, 0x498194c0fc620abbUL,
+	/* 189 */ 0x38a2d9d255686c82UL, 0x785c6bd9193e21f0UL, 0xe4d5c81ab24a5484UL, 0x56307860b2e20989UL,
+	/* 190 */ 0x429d55f78b4d74c4UL, 0x22f1834643350131UL, 0x1e60c24598c71fffUL, 0x59f2f014979983efUL,
+	/* 191 */ 0x46a47d56eb494a44UL, 0x3e22a854d636a18eUL, 0xb346e15274491c3bUL, 0x2ceafd4e5390cde7UL,
+	/* 192 */ 0xba8a8538be0d6675UL, 0x4b9074bb50818e23UL, 0xcbdab89085d304c3UL, 0x61a24fe0e56192c4UL,
+	/* 193 */ 0xcb7615e6db525bcbUL, 0xdd7d8c35a567e4caUL, 0xe6b4153acafcdd69UL, 0x2d668e097f3c9766UL,
+	/* 194 */ 0xa57e7e265ce55ef0UL, 0x5d9f4e527cd4b967UL, 0xfbc83606492fd1e5UL, 0x090d52beb7c3f7aeUL,
+	/* 195 */ 0x09b9515a1e7b4d7cUL, 0x1f266a2599da44c0UL, 0xa1c49548e2c55504UL, 0x7ef04287126f15ccUL,
+	/* 196 */ 0xfed1659dbd30ef15UL, 0x8b4ab9eec4e0277bUL, 0x884d6236a5df3291UL, 0x1fd96ea6bf5cf788UL,
+	/* 197 */ 0x42a161981f190d9aUL, 0x61d849507e6052c1UL, 0x9fe113bf285a2cd5UL, 0x7c22d676dbad85d8UL,
+	/* 198 */ 0x82e770ed2bfbd27dUL, 0x4c05b2ece996f5a5UL, 0xcd40a9c2b0900150UL, 0x5895319213d9bf64UL,
+	/* 199 */ 0xe7cc5d703fea2e08UL, 0xb50c491258e2188cUL, 0xcce30baa48205bf0UL, 0x537c659ccfa32d62UL,
+	/* 200 */ 0x37b6623a98cfc088UL, 0xfe9bed1fa4d6aca4UL, 0x04d29b8e56a8d1b0UL, 0x725f71c40b519575UL,
+	/* 201 */ 0x28c7f89cd0339ce6UL, 0x8367b14469ddc18bUL, 0x883ada83a6a1652cUL, 0x585f1974034d6c17UL,
+	/* 202 */ 0x89cfb266f1b19188UL, 0xe63b4863e7c35217UL, 0xd88c9da6b4c0526aUL, 0x3e035c9df0954635UL,
+	/* 203 */ 0xdd9d5412fb45de9dUL, 0xdd684532e4cff40dUL, 0x4b5c999b151d671cUL, 0x2d8c2cc811e7f690UL,
+	/* 204 */ 0x7f54be1d90055d40UL, 0xa464c5df464aaf40UL, 0x33979624f0e917beUL, 0x2c018dc527356b30UL,
+	/* 205 */ 0xa5415024e330b3d4UL, 0x73ff3d96691652d3UL, 0x94ec42c4ef9b59f1UL, 0x0747201618d08e5aUL,
+	/* 206 */ 0x4d6ca48aca411c53UL, 0x66415f2fcfa66119UL, 0x9c4dd40051e227ffUL, 0x59810bc09a02f7ebUL,
+	/* 207 */ 0x2a7eb171b3dc101dUL, 0x441c5ab99ffef68eUL, 0x32025c9b93b359eaUL, 0x5e8ce0a71e9d112fUL,
+	/* 208 */ 0xbfcccb92429503fdUL, 0xd271ba752f095d55UL, 0x345ead5e972d091eUL, 0x18c8df11a83103baUL,
+	/* 209 */ 0x90cd949a9aed0f4cUL, 0xc5d1f4cb6660e37eUL, 0xb8cac52d56c52e0bUL, 0x6e42e400c5808e0dUL,
+	/* 210 */ 0xa3b46966eeaefd23UL, 0x0c4f1f0be39ecdcaUL, 0x189dc8c9d683a51dUL, 0x51f27f054c09351bUL,
+	/* 211 */ 0x4c487ccd2a320682UL, 0x587ea95bb3df1c96UL, 0xc8ccf79e555cb8e8UL, 0x547dc829a206d73dUL,
+	/* 212 */ 0xb822a6cd80c39b06UL, 0xe96d54732000d4c6UL, 0x28535b6f91463b4dUL, 0x228f4660e2486e1dUL,
+	/* 213 */ 0x98799538de8d3abfUL, 0x8cd8330045ebca6eUL, 0x79952a008221e738UL, 0x4322e1a7535cd2bbUL,
+	/* 214 */ 0xb114c11819d1801cUL, 0x2016e4d84f3f5ec7UL, 0xdd0e2df409260f4cUL, 0x5ec362c0ae5f7266UL,
+	/* 215 */ 0xc0462b18b8b2b4eeUL, 0x7cc8d950274d1afbUL, 0xf25f7105436b02d2UL, 0x43bbf8dcbff9ccd3UL,
+	/* 216 */ 0xb6ad1767a039e9dfUL, 0xb0714da8f69d3583UL, 0x5e55fa18b42931f5UL, 0x4ed5558f33c60961UL,
+	/* 217 */ 0x1fe37901c647a5ddUL, 0x593ddf1f8081d357UL, 0x0249a4fd813fd7a6UL, 0x69acca274e9caf61UL,
+	/* 218 */ 0x047ba3ea330721c9UL, 0x83423fc20e7e1ea0UL, 0x1df4c0af01314a60UL, 0x09a62dab89289527UL,
+	/* 219 */ 0xa5b325a49cc6cb00UL, 0xe94b5dc654b56cb6UL, 0x3be28779adc994a0UL, 0x4296e8f8ba3a4aadUL,
+	/* 220 */ 0x328689761e451eabUL, 0x2e4d598bff59594aUL, 0x49b96853d7a7084aUL, 0x4980a319601420a8UL,
+	/* 221 */ 0x9565b9e12f552c42UL, 0x8a5318db7100fe96UL, 0x05c90b4d43add0d7UL, 0x538b4cd66a5d4edaUL,
+	/* 222 */ 0xf4e94fc3e89f039fUL, 0x592c9af26f618045UL, 0x08a36eb5fd4b9550UL, 0x25fffaf6c2ed1419UL,
+	/* 223 */ 0x34434459cc79d354UL, 0xeeecbfb4b1d5476bUL, 0xddeb34a061615d99UL, 0x5129cecceb64b773UL,
+	/* 224 */ 0xee43215894993520UL, 0x772f9c7cf14c0b3bUL, 0xd2e2fce306bedad5UL, 0x715f42b546f06a97UL,
+	/* 225 */ 0x434ecdceda5b5f1aUL, 0x0da17115a49741a9UL, 0x680bd77c73edad2eUL, 0x487c02354edd9041UL,
+	/* 226 */ 0xb8efeff3a70ed9c4UL, 0x56a32aa3e857e302UL, 0xdf3a68bd48a2a5a0UL, 0x07f650b73176c444UL,
+	/* 227 */ 0xe38b9b1626e0ccb1UL, 0x79e053c18b09fb36UL, 0x56d90319c9f94964UL, 0x1ca941e7ac9ff5c4UL,
+	/* 228 */ 0x49c4df29162fa0bbUL, 0x8488cf3282b33305UL, 0x95dfda14cabb437dUL, 0x3391f78264d5ad86UL,
+	/* 229 */ 0x729ae06ae2b5095dUL, 0xd58a58d73259a946UL, 0xe9834262d13921edUL, 0x27fedafaa54bb592UL,
+	/* 230 */ 0xa99dc5b829ad48bbUL, 0x5f025742499ee260UL, 0x802c8ecd5d7513fdUL, 0x78ceb3ef3f6dd938UL,
+	/* 231 */ 0xc342f44f8a135d94UL, 0x7b9edb44828cdda3UL, 0x9436d11a0537cfe7UL, 0x5064b164ec1ab4c8UL,
+	/* 232 */ 0x7020eccfd37eb2fcUL, 0x1f31ea3ed90d25fcUL, 0x1b930d7bdfa1bb34UL, 0x5344467a48113044UL,
+	/* 233 */ 0x70073170f25e6dfbUL, 0xe385dc1a50114cc8UL, 0x2348698ac8fc4f00UL, 0x2a77a55284dd40d8UL,
+	/* 234 */ 0xfe06afe0c98c6ce4UL, 0xc235df96dddfd6e4UL, 0x1428d01e33bf1ed3UL, 0x785768ec9300bdafUL,
+	/* 235 */ 0x9702e57a91deb63bUL, 0x61bdb8bfe5ce8b80UL, 0x645b426f3d1d58acUL, 0x4804a82227a557bcUL,
+	/* 236 */ 0x8e57048ab44d2601UL, 0x68d6501a4b3a6935UL, 0xc39c9ec3f9e1c293UL, 0x4172f257d4de63e2UL,
+	/* 237 */ 0xd368b450330c6401UL, 0x040d3017418f2391UL, 0x2c34bb6090b7d90dUL, 0x16f649228fdfd51fUL,
+	/* 238 */ 0xbea6818e2b928ef5UL, 0xe28ccf91cdc11e72UL, 0x594aaa68e77a36cdUL, 0x313034806c7ffd0fUL,
+	/* 239 */ 0x8a9d27ac2249bd65UL, 0x19a3b464018e9512UL, 0xc26ccff352b37ec7UL, 0x056f68341d797b21UL,
+	/* 240 */ 0x5e79d6757efd2327UL, 0xfabdbcb6553afe15UL, 0xd3e7222c6eaf5a60UL, 0x7046c76d4dae743bUL,
+	/* 241 */ 0x660be872b18d4a55UL, 0x19992518574e1496UL, 0xc103053a302bdcbbUL, 0x3ed8e9800b218e8eUL,
+	/* 242 */ 0x7b0b9239fa75e03eUL, 0xefe9fb684633c083UL, 0x98a35fbe391a7793UL, 0x6065510fe2d0fe34UL,
+	/* 243 */ 0x55cb668548abad0cUL, 0xb4584548da87e527UL, 0x2c43ecea0107c1ddUL, 0x526028809372de35UL,
+	/* 244 */ 0x3415c56af9213b1fUL, 0x5bee1a4d017e98dbUL, 0x13f6b105b5cf709bUL, 0x5ff20e3482b29ab6UL,
+	/* 245 */ 0x0aa29c75cc2e6c90UL, 0xfc7d73ca3a70e206UL, 0x899fc38fc4b5c515UL, 0x250386b124ffc207UL,
+	/* 246 */ 0x54ea28d5ae3d2b56UL, 0x9913149dd6de60ceUL, 0x16694fc58f06d6c1UL, 0x46b23975eb018fc7UL,
+	/* 247 */ 0x470a6a0fb4b7b4e2UL, 0x5d92475a8f7253deUL, 0xabeee5b52fbd3adbUL, 0x7fa20801a0806968UL,
+	/* 248 */ 0x76f3faf19f7714d2UL, 0xb3e840c12f4660c3UL, 0x0fb4cd8df212744eUL, 0x4b065a251d3a2dd2UL,
+	/* 249 */ 0x5cebde383d77cd4aUL, 0x6adf39df882c9cb1UL, 0xa2dd242eb09af759UL, 0x3147c0e50e5f6422UL,
+	/* 250 */ 0x164ca5101d1350dbUL, 0xf8d13479c33fc962UL, 0xe640ce4d13e5da08UL, 0x4bdee0c45061f8baUL,
+	/* 251 */ 0xd7c46dc1a4edb1c9UL, 0x5514d7b6437fd98aUL, 0x58942f6bb2a1c00bUL, 0x2dffb2ab1d70710eUL,
+	/* 252 */ 0xccdfcf2fc18b6d68UL, 0xa8ebcba8b7806167UL, 0x980697f95e2937e3UL, 0x02fbba1cd0126e8cUL
+};
+
+/* c is two 512-bit products: c0[0:7]=a0[0:3]*b0[0:3] and c1[8:15]=a1[4:7]*b1[4:7]
+ * a is two 256-bit integers: a0[0:3] and a1[4:7]
+ * b is two 256-bit integers: b0[0:3] and b1[4:7]
+ */
+static void mul2_256x256_integer_adx(u64 *const c, const u64 *const a, const u64 *const b)
+{
+	asm volatile(
+		"xorl %%r14d, %%r14d ;"
+		"movq   (%1), %%rdx; "	/* A[0] */
+		"mulx   (%2),  %%r8, %%r15; " /* A[0]*B[0] */
+		"xorl %%r10d, %%r10d ;"
+		"movq %%r8, (%0) ;"
+		"mulx  8(%2), %%r10, %%rax; " /* A[0]*B[1] */
+		"adox %%r10, %%r15 ;"
+		"mulx 16(%2),  %%r8, %%rbx; " /* A[0]*B[2] */
+		"adox  %%r8, %%rax ;"
+		"mulx 24(%2), %%r10, %%rcx; " /* A[0]*B[3] */
+		"adox %%r10, %%rbx ;"
+		/******************************************/
+		"adox %%r14, %%rcx ;"
+
+		"movq  8(%1), %%rdx; "	/* A[1] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[1]*B[0] */
+		"adox %%r15,  %%r8 ;"
+		"movq  %%r8, 8(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[1]*B[1] */
+		"adox %%r10,  %%r9 ;"
+		"adcx  %%r9, %%rax ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[1]*B[2] */
+		"adox  %%r8, %%r11 ;"
+		"adcx %%r11, %%rbx ;"
+		"mulx 24(%2), %%r10, %%r15; " /* A[1]*B[3] */
+		"adox %%r10, %%r13 ;"
+		"adcx %%r13, %%rcx ;"
+		/******************************************/
+		"adox %%r14, %%r15 ;"
+		"adcx %%r14, %%r15 ;"
+
+		"movq 16(%1), %%rdx; " /* A[2] */
+		"xorl %%r10d, %%r10d ;"
+		"mulx   (%2),  %%r8,  %%r9; " /* A[2]*B[0] */
+		"adox %%rax,  %%r8 ;"
+		"movq %%r8, 16(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[2]*B[1] */
+		"adox %%r10,  %%r9 ;"
+		"adcx  %%r9, %%rbx ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[2]*B[2] */
+		"adox  %%r8, %%r11 ;"
+		"adcx %%r11, %%rcx ;"
+		"mulx 24(%2), %%r10, %%rax; " /* A[2]*B[3] */
+		"adox %%r10, %%r13 ;"
+		"adcx %%r13, %%r15 ;"
+		/******************************************/
+		"adox %%r14, %%rax ;"
+		"adcx %%r14, %%rax ;"
+
+		"movq 24(%1), %%rdx; " /* A[3] */
+		"xorl %%r10d, %%r10d ;"
+		"mulx   (%2),  %%r8,  %%r9; " /* A[3]*B[0] */
+		"adox %%rbx,  %%r8 ;"
+		"movq %%r8, 24(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[3]*B[1] */
+		"adox %%r10,  %%r9 ;"
+		"adcx  %%r9, %%rcx ;"
+		"movq %%rcx, 32(%0) ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[3]*B[2] */
+		"adox  %%r8, %%r11 ;"
+		"adcx %%r11, %%r15 ;"
+		"movq %%r15, 40(%0) ;"
+		"mulx 24(%2), %%r10, %%rbx; " /* A[3]*B[3] */
+		"adox %%r10, %%r13 ;"
+		"adcx %%r13, %%rax ;"
+		"movq %%rax, 48(%0) ;"
+		/******************************************/
+		"adox %%r14, %%rbx ;"
+		"adcx %%r14, %%rbx ;"
+		"movq %%rbx, 56(%0) ;"
+
+		"movq 32(%1), %%rdx; "	/* C[0] */
+		"mulx 32(%2),  %%r8, %%r15; " /* C[0]*D[0] */
+		"xorl %%r10d, %%r10d ;"
+		"movq %%r8, 64(%0);"
+		"mulx 40(%2), %%r10, %%rax; " /* C[0]*D[1] */
+		"adox %%r10, %%r15 ;"
+		"mulx 48(%2),  %%r8, %%rbx; " /* C[0]*D[2] */
+		"adox  %%r8, %%rax ;"
+		"mulx 56(%2), %%r10, %%rcx; " /* C[0]*D[3] */
+		"adox %%r10, %%rbx ;"
+		/******************************************/
+		"adox %%r14, %%rcx ;"
+
+		"movq 40(%1), %%rdx; " /* C[1] */
+		"xorl %%r10d, %%r10d ;"
+		"mulx 32(%2),  %%r8,  %%r9; " /* C[1]*D[0] */
+		"adox %%r15,  %%r8 ;"
+		"movq  %%r8, 72(%0);"
+		"mulx 40(%2), %%r10, %%r11; " /* C[1]*D[1] */
+		"adox %%r10,  %%r9 ;"
+		"adcx  %%r9, %%rax ;"
+		"mulx 48(%2),  %%r8, %%r13; " /* C[1]*D[2] */
+		"adox  %%r8, %%r11 ;"
+		"adcx %%r11, %%rbx ;"
+		"mulx 56(%2), %%r10, %%r15; " /* C[1]*D[3] */
+		"adox %%r10, %%r13 ;"
+		"adcx %%r13, %%rcx ;"
+		/******************************************/
+		"adox %%r14, %%r15 ;"
+		"adcx %%r14, %%r15 ;"
+
+		"movq 48(%1), %%rdx; " /* C[2] */
+		"xorl %%r10d, %%r10d ;"
+		"mulx 32(%2),  %%r8,  %%r9; " /* C[2]*D[0] */
+		"adox %%rax,  %%r8 ;"
+		"movq  %%r8, 80(%0);"
+		"mulx 40(%2), %%r10, %%r11; " /* C[2]*D[1] */
+		"adox %%r10,  %%r9 ;"
+		"adcx  %%r9, %%rbx ;"
+		"mulx 48(%2),  %%r8, %%r13; " /* C[2]*D[2] */
+		"adox  %%r8, %%r11 ;"
+		"adcx %%r11, %%rcx ;"
+		"mulx 56(%2), %%r10, %%rax; " /* C[2]*D[3] */
+		"adox %%r10, %%r13 ;"
+		"adcx %%r13, %%r15 ;"
+		/******************************************/
+		"adox %%r14, %%rax ;"
+		"adcx %%r14, %%rax ;"
+
+		"movq 56(%1), %%rdx; " /* C[3] */
+		"xorl %%r10d, %%r10d ;"
+		"mulx 32(%2),  %%r8,  %%r9; " /* C[3]*D[0] */
+		"adox %%rbx,  %%r8 ;"
+		"movq  %%r8, 88(%0);"
+		"mulx 40(%2), %%r10, %%r11; " /* C[3]*D[1] */
+		"adox %%r10,  %%r9 ;"
+		"adcx  %%r9, %%rcx ;"
+		"movq %%rcx,  96(%0) ;"
+		"mulx 48(%2),  %%r8, %%r13; " /* C[3]*D[2] */
+		"adox  %%r8, %%r11 ;"
+		"adcx %%r11, %%r15 ;"
+		"movq %%r15, 104(%0) ;"
+		"mulx 56(%2), %%r10, %%rbx; " /* C[3]*D[3] */
+		"adox %%r10, %%r13 ;"
+		"adcx %%r13, %%rax ;"
+		"movq %%rax, 112(%0) ;"
+		/******************************************/
+		"adox %%r14, %%rbx ;"
+		"adcx %%r14, %%rbx ;"
+		"movq %%rbx, 120(%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(b)
+		: "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r13", "%r14", "%r15");
+}
+
+static void mul2_256x256_integer_bmi2(u64 *const c, const u64 *const a, const u64 *const b)
+{
+	asm volatile(
+		"movq   (%1), %%rdx; "	/* A[0] */
+		"mulx   (%2),  %%r8, %%r15; " /* A[0]*B[0] */
+		"movq %%r8,  (%0) ;"
+		"mulx  8(%2), %%r10, %%rax; " /* A[0]*B[1] */
+		"addq %%r10, %%r15 ;"
+		"mulx 16(%2),  %%r8, %%rbx; " /* A[0]*B[2] */
+		"adcq  %%r8, %%rax ;"
+		"mulx 24(%2), %%r10, %%rcx; " /* A[0]*B[3] */
+		"adcq %%r10, %%rbx ;"
+		/******************************************/
+		"adcq    $0, %%rcx ;"
+
+		"movq  8(%1), %%rdx; "	/* A[1] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[1]*B[0] */
+		"addq %%r15,  %%r8 ;"
+		"movq %%r8, 8(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[1]*B[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[1]*B[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 24(%2), %%r10, %%r15; " /* A[1]*B[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%r15 ;"
+
+		"addq  %%r9, %%rax ;"
+		"adcq %%r11, %%rbx ;"
+		"adcq %%r13, %%rcx ;"
+		"adcq    $0, %%r15 ;"
+
+		"movq 16(%1), %%rdx; "	/* A[2] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[2]*B[0] */
+		"addq %%rax,  %%r8 ;"
+		"movq %%r8, 16(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[2]*B[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[2]*B[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 24(%2), %%r10, %%rax; " /* A[2]*B[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%rax ;"
+
+		"addq  %%r9, %%rbx ;"
+		"adcq %%r11, %%rcx ;"
+		"adcq %%r13, %%r15 ;"
+		"adcq    $0, %%rax ;"
+
+		"movq 24(%1), %%rdx; "	/* A[3] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[3]*B[0] */
+		"addq %%rbx,  %%r8 ;"
+		"movq %%r8, 24(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[3]*B[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[3]*B[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 24(%2), %%r10, %%rbx; " /* A[3]*B[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%rbx ;"
+
+		"addq  %%r9, %%rcx ;"
+		"movq %%rcx, 32(%0) ;"
+		"adcq %%r11, %%r15 ;"
+		"movq %%r15, 40(%0) ;"
+		"adcq %%r13, %%rax ;"
+		"movq %%rax, 48(%0) ;"
+		"adcq    $0, %%rbx ;"
+		"movq %%rbx, 56(%0) ;"
+
+		"movq 32(%1), %%rdx; "	/* C[0] */
+		"mulx 32(%2),  %%r8, %%r15; " /* C[0]*D[0] */
+		"movq %%r8, 64(%0) ;"
+		"mulx 40(%2), %%r10, %%rax; " /* C[0]*D[1] */
+		"addq %%r10, %%r15 ;"
+		"mulx 48(%2),  %%r8, %%rbx; " /* C[0]*D[2] */
+		"adcq  %%r8, %%rax ;"
+		"mulx 56(%2), %%r10, %%rcx; " /* C[0]*D[3] */
+		"adcq %%r10, %%rbx ;"
+		/******************************************/
+		"adcq    $0, %%rcx ;"
+
+		"movq 40(%1), %%rdx; "	/* C[1] */
+		"mulx 32(%2),  %%r8,  %%r9; " /* C[1]*D[0] */
+		"addq %%r15,  %%r8 ;"
+		"movq %%r8, 72(%0) ;"
+		"mulx 40(%2), %%r10, %%r11; " /* C[1]*D[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 48(%2),  %%r8, %%r13; " /* C[1]*D[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 56(%2), %%r10, %%r15; " /* C[1]*D[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%r15 ;"
+
+		"addq  %%r9, %%rax ;"
+		"adcq %%r11, %%rbx ;"
+		"adcq %%r13, %%rcx ;"
+		"adcq    $0, %%r15 ;"
+
+		"movq 48(%1), %%rdx; "	/* C[2] */
+		"mulx 32(%2),  %%r8,  %%r9; " /* C[2]*D[0] */
+		"addq %%rax,  %%r8 ;"
+		"movq %%r8, 80(%0) ;"
+		"mulx 40(%2), %%r10, %%r11; " /* C[2]*D[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 48(%2),  %%r8, %%r13; " /* C[2]*D[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 56(%2), %%r10, %%rax; " /* C[2]*D[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%rax ;"
+
+		"addq  %%r9, %%rbx ;"
+		"adcq %%r11, %%rcx ;"
+		"adcq %%r13, %%r15 ;"
+		"adcq    $0, %%rax ;"
+
+		"movq 56(%1), %%rdx; "	/* C[3] */
+		"mulx 32(%2),  %%r8,  %%r9; " /* C[3]*D[0] */
+		"addq %%rbx,  %%r8 ;"
+		"movq %%r8, 88(%0) ;"
+		"mulx 40(%2), %%r10, %%r11; " /* C[3]*D[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 48(%2),  %%r8, %%r13; " /* C[3]*D[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 56(%2), %%r10, %%rbx; " /* C[3]*D[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%rbx ;"
+
+		"addq  %%r9, %%rcx ;"
+		"movq %%rcx,  96(%0) ;"
+		"adcq %%r11, %%r15 ;"
+		"movq %%r15, 104(%0) ;"
+		"adcq %%r13, %%rax ;"
+		"movq %%rax, 112(%0) ;"
+		"adcq    $0, %%rbx ;"
+		"movq %%rbx, 120(%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(b)
+		: "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r13", "%r15");
+}
+
+static void sqr2_256x256_integer_adx(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movq   (%1), %%rdx        ;" /* A[0]      */
+		"mulx  8(%1),  %%r8, %%r14 ;" /* A[1]*A[0] */
+		"xorl %%r15d, %%r15d;"
+		"mulx 16(%1),  %%r9, %%r10 ;" /* A[2]*A[0] */
+		"adcx %%r14,  %%r9 ;"
+		"mulx 24(%1), %%rax, %%rcx ;" /* A[3]*A[0] */
+		"adcx %%rax, %%r10 ;"
+		"movq 24(%1), %%rdx        ;" /* A[3]      */
+		"mulx  8(%1), %%r11, %%rbx ;" /* A[1]*A[3] */
+		"adcx %%rcx, %%r11 ;"
+		"mulx 16(%1), %%rax, %%r13 ;" /* A[2]*A[3] */
+		"adcx %%rax, %%rbx ;"
+		"movq  8(%1), %%rdx        ;" /* A[1]      */
+		"adcx %%r15, %%r13 ;"
+		"mulx 16(%1), %%rax, %%rcx ;" /* A[2]*A[1] */
+		"movq    $0, %%r14 ;"
+		/******************************************/
+		"adcx %%r15, %%r14 ;"
+
+		"xorl %%r15d, %%r15d;"
+		"adox %%rax, %%r10 ;"
+		"adcx  %%r8,  %%r8 ;"
+		"adox %%rcx, %%r11 ;"
+		"adcx  %%r9,  %%r9 ;"
+		"adox %%r15, %%rbx ;"
+		"adcx %%r10, %%r10 ;"
+		"adox %%r15, %%r13 ;"
+		"adcx %%r11, %%r11 ;"
+		"adox %%r15, %%r14 ;"
+		"adcx %%rbx, %%rbx ;"
+		"adcx %%r13, %%r13 ;"
+		"adcx %%r14, %%r14 ;"
+
+		"movq   (%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[0]^2 */
+		/*******************/
+		"movq %%rax,  0(%0) ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,  8(%0) ;"
+		"movq  8(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[1]^2 */
+		"adcq %%rax,  %%r9 ;"
+		"movq  %%r9, 16(%0) ;"
+		"adcq %%rcx, %%r10 ;"
+		"movq %%r10, 24(%0) ;"
+		"movq 16(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[2]^2 */
+		"adcq %%rax, %%r11 ;"
+		"movq %%r11, 32(%0) ;"
+		"adcq %%rcx, %%rbx ;"
+		"movq %%rbx, 40(%0) ;"
+		"movq 24(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[3]^2 */
+		"adcq %%rax, %%r13 ;"
+		"movq %%r13, 48(%0) ;"
+		"adcq %%rcx, %%r14 ;"
+		"movq %%r14, 56(%0) ;"
+
+
+		"movq 32(%1), %%rdx        ;" /* B[0]      */
+		"mulx 40(%1),  %%r8, %%r14 ;" /* B[1]*B[0] */
+		"xorl %%r15d, %%r15d;"
+		"mulx 48(%1),  %%r9, %%r10 ;" /* B[2]*B[0] */
+		"adcx %%r14,  %%r9 ;"
+		"mulx 56(%1), %%rax, %%rcx ;" /* B[3]*B[0] */
+		"adcx %%rax, %%r10 ;"
+		"movq 56(%1), %%rdx        ;" /* B[3]      */
+		"mulx 40(%1), %%r11, %%rbx ;" /* B[1]*B[3] */
+		"adcx %%rcx, %%r11 ;"
+		"mulx 48(%1), %%rax, %%r13 ;" /* B[2]*B[3] */
+		"adcx %%rax, %%rbx ;"
+		"movq 40(%1), %%rdx        ;" /* B[1]      */
+		"adcx %%r15, %%r13 ;"
+		"mulx 48(%1), %%rax, %%rcx ;" /* B[2]*B[1] */
+		"movq    $0, %%r14 ;"
+		/******************************************/
+		"adcx %%r15, %%r14 ;"
+
+		"xorl %%r15d, %%r15d;"
+		"adox %%rax, %%r10 ;"
+		"adcx  %%r8,  %%r8 ;"
+		"adox %%rcx, %%r11 ;"
+		"adcx  %%r9,  %%r9 ;"
+		"adox %%r15, %%rbx ;"
+		"adcx %%r10, %%r10 ;"
+		"adox %%r15, %%r13 ;"
+		"adcx %%r11, %%r11 ;"
+		"adox %%r15, %%r14 ;"
+		"adcx %%rbx, %%rbx ;"
+		"adcx %%r13, %%r13 ;"
+		"adcx %%r14, %%r14 ;"
+
+		"movq 32(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* B[0]^2 */
+		/*******************/
+		"movq %%rax,  64(%0) ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,  72(%0) ;"
+		"movq 40(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* B[1]^2 */
+		"adcq %%rax,  %%r9 ;"
+		"movq  %%r9,  80(%0) ;"
+		"adcq %%rcx, %%r10 ;"
+		"movq %%r10,  88(%0) ;"
+		"movq 48(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* B[2]^2 */
+		"adcq %%rax, %%r11 ;"
+		"movq %%r11,  96(%0) ;"
+		"adcq %%rcx, %%rbx ;"
+		"movq %%rbx, 104(%0) ;"
+		"movq 56(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* B[3]^2 */
+		"adcq %%rax, %%r13 ;"
+		"movq %%r13, 112(%0) ;"
+		"adcq %%rcx, %%r14 ;"
+		"movq %%r14, 120(%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r13", "%r14", "%r15");
+}
+
+static void sqr2_256x256_integer_bmi2(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movq  8(%1), %%rdx        ;" /* A[1]      */
+		"mulx   (%1),  %%r8,  %%r9 ;" /* A[0]*A[1] */
+		"mulx 16(%1), %%r10, %%r11 ;" /* A[2]*A[1] */
+		"mulx 24(%1), %%rcx, %%r14 ;" /* A[3]*A[1] */
+
+		"movq 16(%1), %%rdx        ;" /* A[2]      */
+		"mulx 24(%1), %%r15, %%r13 ;" /* A[3]*A[2] */
+		"mulx   (%1), %%rax, %%rdx ;" /* A[0]*A[2] */
+
+		"addq %%rax,  %%r9 ;"
+		"adcq %%rdx, %%r10 ;"
+		"adcq %%rcx, %%r11 ;"
+		"adcq %%r14, %%r15 ;"
+		"adcq    $0, %%r13 ;"
+		"movq    $0, %%r14 ;"
+		"adcq    $0, %%r14 ;"
+
+		"movq   (%1), %%rdx        ;" /* A[0]      */
+		"mulx 24(%1), %%rax, %%rcx ;" /* A[0]*A[3] */
+
+		"addq %%rax, %%r10 ;"
+		"adcq %%rcx, %%r11 ;"
+		"adcq    $0, %%r15 ;"
+		"adcq    $0, %%r13 ;"
+		"adcq    $0, %%r14 ;"
+
+		"shldq $1, %%r13, %%r14 ;"
+		"shldq $1, %%r15, %%r13 ;"
+		"shldq $1, %%r11, %%r15 ;"
+		"shldq $1, %%r10, %%r11 ;"
+		"shldq $1,  %%r9, %%r10 ;"
+		"shldq $1,  %%r8,  %%r9 ;"
+		"shlq  $1,  %%r8        ;"
+
+		/*******************/
+		"mulx %%rdx, %%rax, %%rcx ; " /* A[0]^2 */
+		/*******************/
+		"movq %%rax,  0(%0) ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,  8(%0) ;"
+		"movq  8(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ; " /* A[1]^2 */
+		"adcq %%rax,  %%r9 ;"
+		"movq  %%r9, 16(%0) ;"
+		"adcq %%rcx, %%r10 ;"
+		"movq %%r10, 24(%0) ;"
+		"movq 16(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ; " /* A[2]^2 */
+		"adcq %%rax, %%r11 ;"
+		"movq %%r11, 32(%0) ;"
+		"adcq %%rcx, %%r15 ;"
+		"movq %%r15, 40(%0) ;"
+		"movq 24(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ; " /* A[3]^2 */
+		"adcq %%rax, %%r13 ;"
+		"movq %%r13, 48(%0) ;"
+		"adcq %%rcx, %%r14 ;"
+		"movq %%r14, 56(%0) ;"
+
+		"movq 40(%1), %%rdx        ;" /* B[1]      */
+		"mulx 32(%1),  %%r8,  %%r9 ;" /* B[0]*B[1] */
+		"mulx 48(%1), %%r10, %%r11 ;" /* B[2]*B[1] */
+		"mulx 56(%1), %%rcx, %%r14 ;" /* B[3]*B[1] */
+
+		"movq 48(%1), %%rdx        ;" /* B[2]      */
+		"mulx 56(%1), %%r15, %%r13 ;" /* B[3]*B[2] */
+		"mulx 32(%1), %%rax, %%rdx ;" /* B[0]*B[2] */
+
+		"addq %%rax,  %%r9 ;"
+		"adcq %%rdx, %%r10 ;"
+		"adcq %%rcx, %%r11 ;"
+		"adcq %%r14, %%r15 ;"
+		"adcq    $0, %%r13 ;"
+		"movq    $0, %%r14 ;"
+		"adcq    $0, %%r14 ;"
+
+		"movq 32(%1), %%rdx        ;" /* B[0]      */
+		"mulx 56(%1), %%rax, %%rcx ;" /* B[0]*B[3] */
+
+		"addq %%rax, %%r10 ;"
+		"adcq %%rcx, %%r11 ;"
+		"adcq    $0, %%r15 ;"
+		"adcq    $0, %%r13 ;"
+		"adcq    $0, %%r14 ;"
+
+		"shldq $1, %%r13, %%r14 ;"
+		"shldq $1, %%r15, %%r13 ;"
+		"shldq $1, %%r11, %%r15 ;"
+		"shldq $1, %%r10, %%r11 ;"
+		"shldq $1,  %%r9, %%r10 ;"
+		"shldq $1,  %%r8,  %%r9 ;"
+		"shlq  $1,  %%r8        ;"
+
+		/*******************/
+		"mulx %%rdx, %%rax, %%rcx ; " /* B[0]^2 */
+		/*******************/
+		"movq %%rax,  64(%0) ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,  72(%0) ;"
+		"movq 40(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ; " /* B[1]^2 */
+		"adcq %%rax,  %%r9 ;"
+		"movq  %%r9,  80(%0) ;"
+		"adcq %%rcx, %%r10 ;"
+		"movq %%r10,  88(%0) ;"
+		"movq 48(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ; " /* B[2]^2 */
+		"adcq %%rax, %%r11 ;"
+		"movq %%r11,  96(%0) ;"
+		"adcq %%rcx, %%r15 ;"
+		"movq %%r15, 104(%0) ;"
+		"movq 56(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ; " /* B[3]^2 */
+		"adcq %%rax, %%r13 ;"
+		"movq %%r13, 112(%0) ;"
+		"adcq %%rcx, %%r14 ;"
+		"movq %%r14, 120(%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r13", "%r14", "%r15");
+}
+
+static void red_eltfp25519_2w_adx(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movl    $38, %%edx; "	/* 2*c = 38 = 2^256 */
+		"mulx 32(%1),  %%r8, %%r10; " /* c*C[4] */
+		"xorl %%ebx, %%ebx ;"
+		"adox   (%1),  %%r8 ;"
+		"mulx 40(%1),  %%r9, %%r11; " /* c*C[5] */
+		"adcx %%r10,  %%r9 ;"
+		"adox  8(%1),  %%r9 ;"
+		"mulx 48(%1), %%r10, %%rax; " /* c*C[6] */
+		"adcx %%r11, %%r10 ;"
+		"adox 16(%1), %%r10 ;"
+		"mulx 56(%1), %%r11, %%rcx; " /* c*C[7] */
+		"adcx %%rax, %%r11 ;"
+		"adox 24(%1), %%r11 ;"
+		/***************************************/
+		"adcx %%rbx, %%rcx ;"
+		"adox  %%rbx, %%rcx ;"
+		"imul %%rdx, %%rcx ;" /* c*C[4], cf=0, of=0 */
+		"adcx %%rcx,  %%r8 ;"
+		"adcx %%rbx,  %%r9 ;"
+		"movq  %%r9,  8(%0) ;"
+		"adcx %%rbx, %%r10 ;"
+		"movq %%r10, 16(%0) ;"
+		"adcx %%rbx, %%r11 ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $0, %%ecx ;"
+		"cmovc %%edx, %%ecx ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,   (%0) ;"
+
+		"mulx  96(%1),  %%r8, %%r10; " /* c*C[4] */
+		"xorl %%ebx, %%ebx ;"
+		"adox 64(%1),  %%r8 ;"
+		"mulx 104(%1),  %%r9, %%r11; " /* c*C[5] */
+		"adcx %%r10,  %%r9 ;"
+		"adox 72(%1),  %%r9 ;"
+		"mulx 112(%1), %%r10, %%rax; " /* c*C[6] */
+		"adcx %%r11, %%r10 ;"
+		"adox 80(%1), %%r10 ;"
+		"mulx 120(%1), %%r11, %%rcx; " /* c*C[7] */
+		"adcx %%rax, %%r11 ;"
+		"adox 88(%1), %%r11 ;"
+		/****************************************/
+		"adcx %%rbx, %%rcx ;"
+		"adox  %%rbx, %%rcx ;"
+		"imul %%rdx, %%rcx ;" /* c*C[4], cf=0, of=0 */
+		"adcx %%rcx,  %%r8 ;"
+		"adcx %%rbx,  %%r9 ;"
+		"movq  %%r9, 40(%0) ;"
+		"adcx %%rbx, %%r10 ;"
+		"movq %%r10, 48(%0) ;"
+		"adcx %%rbx, %%r11 ;"
+		"movq %%r11, 56(%0) ;"
+		"mov     $0, %%ecx ;"
+		"cmovc %%edx, %%ecx ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8, 32(%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11");
+}
+
+static void red_eltfp25519_2w_bmi2(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movl    $38, %%edx ; "       /* 2*c = 38 = 2^256 */
+		"mulx 32(%1),  %%r8, %%r10 ;" /* c*C[4] */
+		"mulx 40(%1),  %%r9, %%r11 ;" /* c*C[5] */
+		"addq %%r10,  %%r9 ;"
+		"mulx 48(%1), %%r10, %%rax ;" /* c*C[6] */
+		"adcq %%r11, %%r10 ;"
+		"mulx 56(%1), %%r11, %%rcx ;" /* c*C[7] */
+		"adcq %%rax, %%r11 ;"
+		/***************************************/
+		"adcq    $0, %%rcx ;"
+		"addq   (%1),  %%r8 ;"
+		"adcq  8(%1),  %%r9 ;"
+		"adcq 16(%1), %%r10 ;"
+		"adcq 24(%1), %%r11 ;"
+		"adcq     $0, %%rcx ;"
+		"imul %%rdx, %%rcx ;" /* c*C[4], cf=0 */
+		"addq %%rcx,  %%r8 ;"
+		"adcq    $0,  %%r9 ;"
+		"movq  %%r9,  8(%0) ;"
+		"adcq    $0, %%r10 ;"
+		"movq %%r10, 16(%0) ;"
+		"adcq    $0, %%r11 ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $0, %%ecx ;"
+		"cmovc %%edx, %%ecx ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,   (%0) ;"
+
+		"mulx  96(%1),  %%r8, %%r10 ;" /* c*C[4] */
+		"mulx 104(%1),  %%r9, %%r11 ;" /* c*C[5] */
+		"addq %%r10,  %%r9 ;"
+		"mulx 112(%1), %%r10, %%rax ;" /* c*C[6] */
+		"adcq %%r11, %%r10 ;"
+		"mulx 120(%1), %%r11, %%rcx ;" /* c*C[7] */
+		"adcq %%rax, %%r11 ;"
+		/****************************************/
+		"adcq    $0, %%rcx ;"
+		"addq 64(%1),  %%r8 ;"
+		"adcq 72(%1),  %%r9 ;"
+		"adcq 80(%1), %%r10 ;"
+		"adcq 88(%1), %%r11 ;"
+		"adcq     $0, %%rcx ;"
+		"imul %%rdx, %%rcx ;" /* c*C[4], cf=0 */
+		"addq %%rcx,  %%r8 ;"
+		"adcq    $0,  %%r9 ;"
+		"movq  %%r9, 40(%0) ;"
+		"adcq    $0, %%r10 ;"
+		"movq %%r10, 48(%0) ;"
+		"adcq    $0, %%r11 ;"
+		"movq %%r11, 56(%0) ;"
+		"mov     $0, %%ecx ;"
+		"cmovc %%edx, %%ecx ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8, 32(%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11");
+}
+
+static void mul_256x256_integer_adx(u64 *const c, const u64 *const a, const u64 *const b)
+{
+	asm volatile(
+		"movq   (%1), %%rdx; "	/* A[0] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[0]*B[0] */
+		"xorl %%r10d, %%r10d ;"
+		"movq  %%r8,  (%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[0]*B[1] */
+		"adox  %%r9, %%r10 ;"
+		"movq %%r10, 8(%0) ;"
+		"mulx 16(%2), %%r15, %%r13; " /* A[0]*B[2] */
+		"adox %%r11, %%r15 ;"
+		"mulx 24(%2), %%r14, %%rdx; " /* A[0]*B[3] */
+		"adox %%r13, %%r14 ;"
+		"movq $0, %%rax ;"
+		/******************************************/
+		"adox %%rdx, %%rax ;"
+
+		"movq  8(%1), %%rdx; "	/* A[1] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[1]*B[0] */
+		"xorl %%r10d, %%r10d ;"
+		"adcx 8(%0),  %%r8 ;"
+		"movq  %%r8,  8(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[1]*B[1] */
+		"adox  %%r9, %%r10 ;"
+		"adcx %%r15, %%r10 ;"
+		"movq %%r10, 16(%0) ;"
+		"mulx 16(%2), %%r15, %%r13; " /* A[1]*B[2] */
+		"adox %%r11, %%r15 ;"
+		"adcx %%r14, %%r15 ;"
+		"movq $0, %%r8  ;"
+		"mulx 24(%2), %%r14, %%rdx; " /* A[1]*B[3] */
+		"adox %%r13, %%r14 ;"
+		"adcx %%rax, %%r14 ;"
+		"movq $0, %%rax ;"
+		/******************************************/
+		"adox %%rdx, %%rax ;"
+		"adcx  %%r8, %%rax ;"
+
+		"movq 16(%1), %%rdx; "	/* A[2] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[2]*B[0] */
+		"xorl %%r10d, %%r10d ;"
+		"adcx 16(%0), %%r8 ;"
+		"movq  %%r8, 16(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[2]*B[1] */
+		"adox  %%r9, %%r10 ;"
+		"adcx %%r15, %%r10 ;"
+		"movq %%r10, 24(%0) ;"
+		"mulx 16(%2), %%r15, %%r13; " /* A[2]*B[2] */
+		"adox %%r11, %%r15 ;"
+		"adcx %%r14, %%r15 ;"
+		"movq $0, %%r8  ;"
+		"mulx 24(%2), %%r14, %%rdx; " /* A[2]*B[3] */
+		"adox %%r13, %%r14 ;"
+		"adcx %%rax, %%r14 ;"
+		"movq $0, %%rax ;"
+		/******************************************/
+		"adox %%rdx, %%rax ;"
+		"adcx  %%r8, %%rax ;"
+
+		"movq 24(%1), %%rdx; "	/* A[3] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[3]*B[0] */
+		"xorl %%r10d, %%r10d ;"
+		"adcx 24(%0), %%r8 ;"
+		"movq  %%r8, 24(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[3]*B[1] */
+		"adox  %%r9, %%r10 ;"
+		"adcx %%r15, %%r10 ;"
+		"movq %%r10, 32(%0) ;"
+		"mulx 16(%2), %%r15, %%r13; " /* A[3]*B[2] */
+		"adox %%r11, %%r15 ;"
+		"adcx %%r14, %%r15 ;"
+		"movq %%r15, 40(%0) ;"
+		"movq $0, %%r8  ;"
+		"mulx 24(%2), %%r14, %%rdx; " /* A[3]*B[3] */
+		"adox %%r13, %%r14 ;"
+		"adcx %%rax, %%r14 ;"
+		"movq %%r14, 48(%0) ;"
+		"movq $0, %%rax ;"
+		/******************************************/
+		"adox %%rdx, %%rax ;"
+		"adcx  %%r8, %%rax ;"
+		"movq %%rax, 56(%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(b)
+		: "memory", "cc", "%rax", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r13", "%r14", "%r15");
+}
+
+static void mul_256x256_integer_bmi2(u64 *const c, const u64 *const a, const u64 *const b)
+{
+	asm volatile(
+		"movq   (%1), %%rdx; "	/* A[0] */
+		"mulx   (%2),  %%r8, %%r15; " /* A[0]*B[0] */
+		"movq %%r8,  (%0) ;"
+		"mulx  8(%2), %%r10, %%rax; " /* A[0]*B[1] */
+		"addq %%r10, %%r15 ;"
+		"mulx 16(%2),  %%r8, %%rbx; " /* A[0]*B[2] */
+		"adcq  %%r8, %%rax ;"
+		"mulx 24(%2), %%r10, %%rcx; " /* A[0]*B[3] */
+		"adcq %%r10, %%rbx ;"
+		/******************************************/
+		"adcq    $0, %%rcx ;"
+
+		"movq  8(%1), %%rdx; "	/* A[1] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[1]*B[0] */
+		"addq %%r15,  %%r8 ;"
+		"movq %%r8, 8(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[1]*B[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[1]*B[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 24(%2), %%r10, %%r15; " /* A[1]*B[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%r15 ;"
+
+		"addq  %%r9, %%rax ;"
+		"adcq %%r11, %%rbx ;"
+		"adcq %%r13, %%rcx ;"
+		"adcq    $0, %%r15 ;"
+
+		"movq 16(%1), %%rdx; "	/* A[2] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[2]*B[0] */
+		"addq %%rax,  %%r8 ;"
+		"movq %%r8, 16(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[2]*B[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[2]*B[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 24(%2), %%r10, %%rax; " /* A[2]*B[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%rax ;"
+
+		"addq  %%r9, %%rbx ;"
+		"adcq %%r11, %%rcx ;"
+		"adcq %%r13, %%r15 ;"
+		"adcq    $0, %%rax ;"
+
+		"movq 24(%1), %%rdx; "	/* A[3] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[3]*B[0] */
+		"addq %%rbx,  %%r8 ;"
+		"movq %%r8, 24(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[3]*B[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[3]*B[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 24(%2), %%r10, %%rbx; " /* A[3]*B[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%rbx ;"
+
+		"addq  %%r9, %%rcx ;"
+		"movq %%rcx, 32(%0) ;"
+		"adcq %%r11, %%r15 ;"
+		"movq %%r15, 40(%0) ;"
+		"adcq %%r13, %%rax ;"
+		"movq %%rax, 48(%0) ;"
+		"adcq    $0, %%rbx ;"
+		"movq %%rbx, 56(%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(b)
+		: "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r13", "%r15");
+}
+
+static void sqr_256x256_integer_adx(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movq   (%1), %%rdx        ;" /* A[0]      */
+		"mulx  8(%1),  %%r8, %%r14 ;" /* A[1]*A[0] */
+		"xorl %%r15d, %%r15d;"
+		"mulx 16(%1),  %%r9, %%r10 ;" /* A[2]*A[0] */
+		"adcx %%r14,  %%r9 ;"
+		"mulx 24(%1), %%rax, %%rcx ;" /* A[3]*A[0] */
+		"adcx %%rax, %%r10 ;"
+		"movq 24(%1), %%rdx        ;" /* A[3]      */
+		"mulx  8(%1), %%r11, %%rbx ;" /* A[1]*A[3] */
+		"adcx %%rcx, %%r11 ;"
+		"mulx 16(%1), %%rax, %%r13 ;" /* A[2]*A[3] */
+		"adcx %%rax, %%rbx ;"
+		"movq  8(%1), %%rdx        ;" /* A[1]      */
+		"adcx %%r15, %%r13 ;"
+		"mulx 16(%1), %%rax, %%rcx ;" /* A[2]*A[1] */
+		"movq    $0, %%r14 ;"
+		/******************************************/
+		"adcx %%r15, %%r14 ;"
+
+		"xorl %%r15d, %%r15d;"
+		"adox %%rax, %%r10 ;"
+		"adcx  %%r8,  %%r8 ;"
+		"adox %%rcx, %%r11 ;"
+		"adcx  %%r9,  %%r9 ;"
+		"adox %%r15, %%rbx ;"
+		"adcx %%r10, %%r10 ;"
+		"adox %%r15, %%r13 ;"
+		"adcx %%r11, %%r11 ;"
+		"adox %%r15, %%r14 ;"
+		"adcx %%rbx, %%rbx ;"
+		"adcx %%r13, %%r13 ;"
+		"adcx %%r14, %%r14 ;"
+
+		"movq   (%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[0]^2 */
+		/*******************/
+		"movq %%rax,  0(%0) ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,  8(%0) ;"
+		"movq  8(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[1]^2 */
+		"adcq %%rax,  %%r9 ;"
+		"movq  %%r9, 16(%0) ;"
+		"adcq %%rcx, %%r10 ;"
+		"movq %%r10, 24(%0) ;"
+		"movq 16(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[2]^2 */
+		"adcq %%rax, %%r11 ;"
+		"movq %%r11, 32(%0) ;"
+		"adcq %%rcx, %%rbx ;"
+		"movq %%rbx, 40(%0) ;"
+		"movq 24(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[3]^2 */
+		"adcq %%rax, %%r13 ;"
+		"movq %%r13, 48(%0) ;"
+		"adcq %%rcx, %%r14 ;"
+		"movq %%r14, 56(%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r13", "%r14", "%r15");
+}
+
+static void sqr_256x256_integer_bmi2(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movq  8(%1), %%rdx        ;" /* A[1]      */
+		"mulx   (%1),  %%r8,  %%r9 ;" /* A[0]*A[1] */
+		"mulx 16(%1), %%r10, %%r11 ;" /* A[2]*A[1] */
+		"mulx 24(%1), %%rcx, %%r14 ;" /* A[3]*A[1] */
+
+		"movq 16(%1), %%rdx        ;" /* A[2]      */
+		"mulx 24(%1), %%r15, %%r13 ;" /* A[3]*A[2] */
+		"mulx   (%1), %%rax, %%rdx ;" /* A[0]*A[2] */
+
+		"addq %%rax,  %%r9 ;"
+		"adcq %%rdx, %%r10 ;"
+		"adcq %%rcx, %%r11 ;"
+		"adcq %%r14, %%r15 ;"
+		"adcq    $0, %%r13 ;"
+		"movq    $0, %%r14 ;"
+		"adcq    $0, %%r14 ;"
+
+		"movq   (%1), %%rdx        ;" /* A[0]      */
+		"mulx 24(%1), %%rax, %%rcx ;" /* A[0]*A[3] */
+
+		"addq %%rax, %%r10 ;"
+		"adcq %%rcx, %%r11 ;"
+		"adcq    $0, %%r15 ;"
+		"adcq    $0, %%r13 ;"
+		"adcq    $0, %%r14 ;"
+
+		"shldq $1, %%r13, %%r14 ;"
+		"shldq $1, %%r15, %%r13 ;"
+		"shldq $1, %%r11, %%r15 ;"
+		"shldq $1, %%r10, %%r11 ;"
+		"shldq $1,  %%r9, %%r10 ;"
+		"shldq $1,  %%r8,  %%r9 ;"
+		"shlq  $1,  %%r8        ;"
+
+		/*******************/
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[0]^2 */
+		/*******************/
+		"movq %%rax,  0(%0) ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,  8(%0) ;"
+		"movq  8(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[1]^2 */
+		"adcq %%rax,  %%r9 ;"
+		"movq  %%r9, 16(%0) ;"
+		"adcq %%rcx, %%r10 ;"
+		"movq %%r10, 24(%0) ;"
+		"movq 16(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[2]^2 */
+		"adcq %%rax, %%r11 ;"
+		"movq %%r11, 32(%0) ;"
+		"adcq %%rcx, %%r15 ;"
+		"movq %%r15, 40(%0) ;"
+		"movq 24(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[3]^2 */
+		"adcq %%rax, %%r13 ;"
+		"movq %%r13, 48(%0) ;"
+		"adcq %%rcx, %%r14 ;"
+		"movq %%r14, 56(%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r13", "%r14", "%r15");
+}
+
+static void red_eltfp25519_1w_adx(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movl    $38, %%edx ;"	/* 2*c = 38 = 2^256 */
+		"mulx 32(%1),  %%r8, %%r10 ;" /* c*C[4] */
+		"xorl %%ebx, %%ebx ;"
+		"adox   (%1),  %%r8 ;"
+		"mulx 40(%1),  %%r9, %%r11 ;" /* c*C[5] */
+		"adcx %%r10,  %%r9 ;"
+		"adox  8(%1),  %%r9 ;"
+		"mulx 48(%1), %%r10, %%rax ;" /* c*C[6] */
+		"adcx %%r11, %%r10 ;"
+		"adox 16(%1), %%r10 ;"
+		"mulx 56(%1), %%r11, %%rcx ;" /* c*C[7] */
+		"adcx %%rax, %%r11 ;"
+		"adox 24(%1), %%r11 ;"
+		/***************************************/
+		"adcx %%rbx, %%rcx ;"
+		"adox  %%rbx, %%rcx ;"
+		"imul %%rdx, %%rcx ;" /* c*C[4], cf=0, of=0 */
+		"adcx %%rcx,  %%r8 ;"
+		"adcx %%rbx,  %%r9 ;"
+		"movq  %%r9,  8(%0) ;"
+		"adcx %%rbx, %%r10 ;"
+		"movq %%r10, 16(%0) ;"
+		"adcx %%rbx, %%r11 ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $0, %%ecx ;"
+		"cmovc %%edx, %%ecx ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,   (%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11");
+}
+
+static void red_eltfp25519_1w_bmi2(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movl    $38, %%edx ;"	/* 2*c = 38 = 2^256 */
+		"mulx 32(%1),  %%r8, %%r10 ;" /* c*C[4] */
+		"mulx 40(%1),  %%r9, %%r11 ;" /* c*C[5] */
+		"addq %%r10,  %%r9 ;"
+		"mulx 48(%1), %%r10, %%rax ;" /* c*C[6] */
+		"adcq %%r11, %%r10 ;"
+		"mulx 56(%1), %%r11, %%rcx ;" /* c*C[7] */
+		"adcq %%rax, %%r11 ;"
+		/***************************************/
+		"adcq    $0, %%rcx ;"
+		"addq   (%1),  %%r8 ;"
+		"adcq  8(%1),  %%r9 ;"
+		"adcq 16(%1), %%r10 ;"
+		"adcq 24(%1), %%r11 ;"
+		"adcq     $0, %%rcx ;"
+		"imul %%rdx, %%rcx ;" /* c*C[4], cf=0 */
+		"addq %%rcx,  %%r8 ;"
+		"adcq    $0,  %%r9 ;"
+		"movq  %%r9,  8(%0) ;"
+		"adcq    $0, %%r10 ;"
+		"movq %%r10, 16(%0) ;"
+		"adcq    $0, %%r11 ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $0, %%ecx ;"
+		"cmovc %%edx, %%ecx ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,   (%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11");
+}
+
+static __always_inline void add_eltfp25519_1w_adx(u64 *const c, const u64 *const a, const u64 *const b)
+{
+	asm volatile(
+		"mov     $38, %%eax ;"
+		"xorl  %%ecx, %%ecx ;"
+		"movq   (%2),  %%r8 ;"
+		"adcx   (%1),  %%r8 ;"
+		"movq  8(%2),  %%r9 ;"
+		"adcx  8(%1),  %%r9 ;"
+		"movq 16(%2), %%r10 ;"
+		"adcx 16(%1), %%r10 ;"
+		"movq 24(%2), %%r11 ;"
+		"adcx 24(%1), %%r11 ;"
+		"cmovc %%eax, %%ecx ;"
+		"xorl %%eax, %%eax  ;"
+		"adcx %%rcx,  %%r8  ;"
+		"adcx %%rax,  %%r9  ;"
+		"movq  %%r9,  8(%0) ;"
+		"adcx %%rax, %%r10  ;"
+		"movq %%r10, 16(%0) ;"
+		"adcx %%rax, %%r11  ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $38, %%ecx ;"
+		"cmovc %%ecx, %%eax ;"
+		"addq %%rax,  %%r8  ;"
+		"movq  %%r8,   (%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(b)
+		: "memory", "cc", "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11");
+}
+
+static __always_inline void add_eltfp25519_1w_bmi2(u64 *const c, const u64 *const a, const u64 *const b)
+{
+	asm volatile(
+		"mov     $38, %%eax ;"
+		"movq   (%2),  %%r8 ;"
+		"addq   (%1),  %%r8 ;"
+		"movq  8(%2),  %%r9 ;"
+		"adcq  8(%1),  %%r9 ;"
+		"movq 16(%2), %%r10 ;"
+		"adcq 16(%1), %%r10 ;"
+		"movq 24(%2), %%r11 ;"
+		"adcq 24(%1), %%r11 ;"
+		"mov      $0, %%ecx ;"
+		"cmovc %%eax, %%ecx ;"
+		"addq %%rcx,  %%r8  ;"
+		"adcq    $0,  %%r9  ;"
+		"movq  %%r9,  8(%0) ;"
+		"adcq    $0, %%r10  ;"
+		"movq %%r10, 16(%0) ;"
+		"adcq    $0, %%r11  ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $0, %%ecx  ;"
+		"cmovc %%eax, %%ecx ;"
+		"addq %%rcx,  %%r8  ;"
+		"movq  %%r8,   (%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(b)
+		: "memory", "cc", "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11");
+}
+
+static __always_inline void sub_eltfp25519_1w(u64 *const c, const u64 *const a, const u64 *const b)
+{
+	asm volatile(
+		"mov     $38, %%eax ;"
+		"movq   (%1),  %%r8 ;"
+		"subq   (%2),  %%r8 ;"
+		"movq  8(%1),  %%r9 ;"
+		"sbbq  8(%2),  %%r9 ;"
+		"movq 16(%1), %%r10 ;"
+		"sbbq 16(%2), %%r10 ;"
+		"movq 24(%1), %%r11 ;"
+		"sbbq 24(%2), %%r11 ;"
+		"mov      $0, %%ecx ;"
+		"cmovc %%eax, %%ecx ;"
+		"subq %%rcx,  %%r8  ;"
+		"sbbq    $0,  %%r9  ;"
+		"movq  %%r9,  8(%0) ;"
+		"sbbq    $0, %%r10  ;"
+		"movq %%r10, 16(%0) ;"
+		"sbbq    $0, %%r11  ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $0, %%ecx  ;"
+		"cmovc %%eax, %%ecx ;"
+		"subq %%rcx,  %%r8  ;"
+		"movq  %%r8,   (%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(b)
+		: "memory", "cc", "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11");
+}
+
+/* Multiplication by a24 = (A+2)/4 = (486662+2)/4 = 121666 */
+static __always_inline void mul_a24_eltfp25519_1w(u64 *const c, const u64 *const a)
+{
+	const u64 a24 = 121666;
+	asm volatile(
+		"movq     %2, %%rdx ;"
+		"mulx   (%1),  %%r8, %%r10 ;"
+		"mulx  8(%1),  %%r9, %%r11 ;"
+		"addq %%r10,  %%r9 ;"
+		"mulx 16(%1), %%r10, %%rax ;"
+		"adcq %%r11, %%r10 ;"
+		"mulx 24(%1), %%r11, %%rcx ;"
+		"adcq %%rax, %%r11 ;"
+		/**************************/
+		"adcq    $0, %%rcx ;"
+		"movl   $38, %%edx ;" /* 2*c = 38 = 2^256 mod 2^255-19*/
+		"imul %%rdx, %%rcx ;"
+		"addq %%rcx,  %%r8 ;"
+		"adcq    $0,  %%r9 ;"
+		"movq  %%r9,  8(%0) ;"
+		"adcq    $0, %%r10 ;"
+		"movq %%r10, 16(%0) ;"
+		"adcq    $0, %%r11 ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $0, %%ecx ;"
+		"cmovc %%edx, %%ecx ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,   (%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(a24)
+		: "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11");
+}
+
+static void inv_eltfp25519_1w_adx(u64 *const c, const u64 *const a)
+{
+	struct {
+		eltfp25519_1w_buffer buffer;
+		eltfp25519_1w x0, x1, x2;
+	} __aligned(32) m;
+	u64 *T[4];
+
+	T[0] = m.x0;
+	T[1] = c; /* x^(-1) */
+	T[2] = m.x1;
+	T[3] = m.x2;
+
+	copy_eltfp25519_1w(T[1], a);
+	sqrn_eltfp25519_1w_adx(T[1], 1);
+	copy_eltfp25519_1w(T[2], T[1]);
+	sqrn_eltfp25519_1w_adx(T[2], 2);
+	mul_eltfp25519_1w_adx(T[0], a, T[2]);
+	mul_eltfp25519_1w_adx(T[1], T[1], T[0]);
+	copy_eltfp25519_1w(T[2], T[1]);
+	sqrn_eltfp25519_1w_adx(T[2], 1);
+	mul_eltfp25519_1w_adx(T[0], T[0], T[2]);
+	copy_eltfp25519_1w(T[2], T[0]);
+	sqrn_eltfp25519_1w_adx(T[2], 5);
+	mul_eltfp25519_1w_adx(T[0], T[0], T[2]);
+	copy_eltfp25519_1w(T[2], T[0]);
+	sqrn_eltfp25519_1w_adx(T[2], 10);
+	mul_eltfp25519_1w_adx(T[2], T[2], T[0]);
+	copy_eltfp25519_1w(T[3], T[2]);
+	sqrn_eltfp25519_1w_adx(T[3], 20);
+	mul_eltfp25519_1w_adx(T[3], T[3], T[2]);
+	sqrn_eltfp25519_1w_adx(T[3], 10);
+	mul_eltfp25519_1w_adx(T[3], T[3], T[0]);
+	copy_eltfp25519_1w(T[0], T[3]);
+	sqrn_eltfp25519_1w_adx(T[0], 50);
+	mul_eltfp25519_1w_adx(T[0], T[0], T[3]);
+	copy_eltfp25519_1w(T[2], T[0]);
+	sqrn_eltfp25519_1w_adx(T[2], 100);
+	mul_eltfp25519_1w_adx(T[2], T[2], T[0]);
+	sqrn_eltfp25519_1w_adx(T[2], 50);
+	mul_eltfp25519_1w_adx(T[2], T[2], T[3]);
+	sqrn_eltfp25519_1w_adx(T[2], 5);
+	mul_eltfp25519_1w_adx(T[1], T[1], T[2]);
+
+	memzero_explicit(&m, sizeof(m));
+}
+
+static void inv_eltfp25519_1w_bmi2(u64 *const c, const u64 *const a)
+{
+	struct {
+		eltfp25519_1w_buffer buffer;
+		eltfp25519_1w x0, x1, x2;
+	} __aligned(32) m;
+	u64 *T[5];
+
+	T[0] = m.x0;
+	T[1] = c; /* x^(-1) */
+	T[2] = m.x1;
+	T[3] = m.x2;
+
+	copy_eltfp25519_1w(T[1], a);
+	sqrn_eltfp25519_1w_bmi2(T[1], 1);
+	copy_eltfp25519_1w(T[2], T[1]);
+	sqrn_eltfp25519_1w_bmi2(T[2], 2);
+	mul_eltfp25519_1w_bmi2(T[0], a, T[2]);
+	mul_eltfp25519_1w_bmi2(T[1], T[1], T[0]);
+	copy_eltfp25519_1w(T[2], T[1]);
+	sqrn_eltfp25519_1w_bmi2(T[2], 1);
+	mul_eltfp25519_1w_bmi2(T[0], T[0], T[2]);
+	copy_eltfp25519_1w(T[2], T[0]);
+	sqrn_eltfp25519_1w_bmi2(T[2], 5);
+	mul_eltfp25519_1w_bmi2(T[0], T[0], T[2]);
+	copy_eltfp25519_1w(T[2], T[0]);
+	sqrn_eltfp25519_1w_bmi2(T[2], 10);
+	mul_eltfp25519_1w_bmi2(T[2], T[2], T[0]);
+	copy_eltfp25519_1w(T[3], T[2]);
+	sqrn_eltfp25519_1w_bmi2(T[3], 20);
+	mul_eltfp25519_1w_bmi2(T[3], T[3], T[2]);
+	sqrn_eltfp25519_1w_bmi2(T[3], 10);
+	mul_eltfp25519_1w_bmi2(T[3], T[3], T[0]);
+	copy_eltfp25519_1w(T[0], T[3]);
+	sqrn_eltfp25519_1w_bmi2(T[0], 50);
+	mul_eltfp25519_1w_bmi2(T[0], T[0], T[3]);
+	copy_eltfp25519_1w(T[2], T[0]);
+	sqrn_eltfp25519_1w_bmi2(T[2], 100);
+	mul_eltfp25519_1w_bmi2(T[2], T[2], T[0]);
+	sqrn_eltfp25519_1w_bmi2(T[2], 50);
+	mul_eltfp25519_1w_bmi2(T[2], T[2], T[3]);
+	sqrn_eltfp25519_1w_bmi2(T[2], 5);
+	mul_eltfp25519_1w_bmi2(T[1], T[1], T[2]);
+
+	memzero_explicit(&m, sizeof(m));
+}
+
+/* Given c, a 256-bit number, fred_eltfp25519_1w updates c
+ * with a number such that 0 <= C < 2**255-19.
+ */
+static __always_inline void fred_eltfp25519_1w(u64 *const c)
+{
+	u64 tmp0 = 38, tmp1 = 19;
+	asm volatile(
+		"btrq   $63,    %3 ;" /* Put bit 255 in carry flag and clear */
+		"cmovncl %k5,   %k4 ;" /* c[255] ? 38 : 19 */
+
+		/* Add either 19 or 38 to c */
+		"addq    %4,   %0 ;"
+		"adcq    $0,   %1 ;"
+		"adcq    $0,   %2 ;"
+		"adcq    $0,   %3 ;"
+
+		/* Test for bit 255 again; only triggered on overflow modulo 2^255-19 */
+		"movl    $0,  %k4 ;"
+		"cmovnsl %k5,  %k4 ;" /* c[255] ? 0 : 19 */
+		"btrq   $63,   %3 ;" /* Clear bit 255 */
+
+		/* Subtract 19 if necessary */
+		"subq    %4,   %0 ;"
+		"sbbq    $0,   %1 ;"
+		"sbbq    $0,   %2 ;"
+		"sbbq    $0,   %3 ;"
+
+		: "+r"(c[0]), "+r"(c[1]), "+r"(c[2]), "+r"(c[3]), "+r"(tmp0), "+r"(tmp1)
+		:
+		: "memory", "cc");
+}
+
+static __always_inline void cswap(u8 bit, u64 *const px, u64 *const py)
+{
+	u64 temp;
+	asm volatile(
+		"test %9, %9 ;"
+		"movq %0, %8 ;"
+		"cmovnzq %4, %0 ;"
+		"cmovnzq %8, %4 ;"
+		"movq %1, %8 ;"
+		"cmovnzq %5, %1 ;"
+		"cmovnzq %8, %5 ;"
+		"movq %2, %8 ;"
+		"cmovnzq %6, %2 ;"
+		"cmovnzq %8, %6 ;"
+		"movq %3, %8 ;"
+		"cmovnzq %7, %3 ;"
+		"cmovnzq %8, %7 ;"
+		: "+r"(px[0]), "+r"(px[1]), "+r"(px[2]), "+r"(px[3]),
+		  "+r"(py[0]), "+r"(py[1]), "+r"(py[2]), "+r"(py[3]),
+		  "=r"(temp)
+		: "r"(bit)
+		: "cc"
+	);
+}
+
+static __always_inline void cselect(u8 bit, u64 *const px, const u64 *const py)
+{
+	asm volatile(
+		"test %4, %4 ;"
+		"cmovnzq %5, %0 ;"
+		"cmovnzq %6, %1 ;"
+		"cmovnzq %7, %2 ;"
+		"cmovnzq %8, %3 ;"
+		: "+r"(px[0]), "+r"(px[1]), "+r"(px[2]), "+r"(px[3])
+		: "r"(bit), "rm"(py[0]), "rm"(py[1]), "rm"(py[2]), "rm"(py[3])
+		: "cc"
+	);
+}
+
+static __always_inline void clamp_secret(u8 secret[CURVE25519_POINT_SIZE])
+{
+	secret[0] &= 248;
+	secret[31] &= 127;
+	secret[31] |= 64;
+}
+
+static void curve25519_adx(u8 shared[CURVE25519_POINT_SIZE], const u8 private_key[CURVE25519_POINT_SIZE], const u8 session_key[CURVE25519_POINT_SIZE])
+{
+	struct {
+		u64 buffer[4 * NUM_WORDS_ELTFP25519];
+		u64 coordinates[4 * NUM_WORDS_ELTFP25519];
+		u64 workspace[6 * NUM_WORDS_ELTFP25519];
+		u8 session[CURVE25519_POINT_SIZE];
+		u8 private[CURVE25519_POINT_SIZE];
+	} __aligned(32) m;
+
+	int i = 0, j = 0;
+	u64 prev = 0;
+	u64 *const X1 = (u64 *)m.session;
+	u64 *const key = (u64 *)m.private;
+	u64 *const Px = m.coordinates + 0;
+	u64 *const Pz = m.coordinates + 4;
+	u64 *const Qx = m.coordinates + 8;
+	u64 *const Qz = m.coordinates + 12;
+	u64 *const X2 = Qx;
+	u64 *const Z2 = Qz;
+	u64 *const X3 = Px;
+	u64 *const Z3 = Pz;
+	u64 *const X2Z2 = Qx;
+	u64 *const X3Z3 = Px;
+
+	u64 *const A = m.workspace + 0;
+	u64 *const B = m.workspace + 4;
+	u64 *const D = m.workspace + 8;
+	u64 *const C = m.workspace + 12;
+	u64 *const DA = m.workspace + 16;
+	u64 *const CB = m.workspace + 20;
+	u64 *const AB = A;
+	u64 *const DC = D;
+	u64 *const DACB = DA;
+
+	memcpy(m.private, private_key, sizeof(m.private));
+	memcpy(m.session, session_key, sizeof(m.session));
+
+	clamp_secret(m.private);
+
+	/* As in the draft:
+	 * When receiving such an array, implementations of curve25519
+	 * MUST mask the most-significant bit in the final byte. This
+	 * is done to preserve compatibility with point formats which
+	 * reserve the sign bit for use in other protocols and to
+	 * increase resistance to implementation fingerprinting
+	 */
+	m.session[CURVE25519_POINT_SIZE - 1] &= (1 << (255 % 8)) - 1;
+
+	copy_eltfp25519_1w(Px, X1);
+	setzero_eltfp25519_1w(Pz);
+	setzero_eltfp25519_1w(Qx);
+	setzero_eltfp25519_1w(Qz);
+
+	Pz[0] = 1;
+	Qx[0] = 1;
+
+	/* main-loop */
+	prev = 0;
+	j = 62;
+	for (i = 3; i >= 0; --i) {
+		while (j >= 0) {
+			u64 bit = (key[i] >> j) & 0x1;
+			u64 swap = bit ^ prev;
+			prev = bit;
+
+			add_eltfp25519_1w_adx(A, X2, Z2);	/* A = (X2+Z2) */
+			sub_eltfp25519_1w(B, X2, Z2);		/* B = (X2-Z2) */
+			add_eltfp25519_1w_adx(C, X3, Z3);	/* C = (X3+Z3) */
+			sub_eltfp25519_1w(D, X3, Z3);		/* D = (X3-Z3) */
+			mul_eltfp25519_2w_adx(DACB, AB, DC);	/* [DA|CB] = [A|B]*[D|C] */
+
+			cselect(swap, A, C);
+			cselect(swap, B, D);
+
+			sqr_eltfp25519_2w_adx(AB);		/* [AA|BB] = [A^2|B^2] */
+			add_eltfp25519_1w_adx(X3, DA, CB);	/* X3 = (DA+CB) */
+			sub_eltfp25519_1w(Z3, DA, CB);		/* Z3 = (DA-CB) */
+			sqr_eltfp25519_2w_adx(X3Z3);		/* [X3|Z3] = [(DA+CB)|(DA+CB)]^2 */
+
+			copy_eltfp25519_1w(X2, B);		/* X2 = B^2 */
+			sub_eltfp25519_1w(Z2, A, B);		/* Z2 = E = AA-BB */
+
+			mul_a24_eltfp25519_1w(B, Z2);		/* B = a24*E */
+			add_eltfp25519_1w_adx(B, B, X2);	/* B = a24*E+B */
+			mul_eltfp25519_2w_adx(X2Z2, X2Z2, AB);	/* [X2|Z2] = [B|E]*[A|a24*E+B] */
+			mul_eltfp25519_1w_adx(Z3, Z3, X1);	/* Z3 = Z3*X1 */
+			--j;
+		}
+		j = 63;
+	}
+
+	inv_eltfp25519_1w_adx(A, Qz);
+	mul_eltfp25519_1w_adx((u64 *)shared, Qx, A);
+	fred_eltfp25519_1w((u64 *)shared);
+
+	memzero_explicit(&m, sizeof(m));
+}
+
+static void curve25519_adx_base(u8 session_key[CURVE25519_POINT_SIZE], const u8 private_key[CURVE25519_POINT_SIZE])
+{
+	struct {
+		u64 buffer[4 * NUM_WORDS_ELTFP25519];
+		u64 coordinates[4 * NUM_WORDS_ELTFP25519];
+		u64 workspace[4 * NUM_WORDS_ELTFP25519];
+		u8 private[CURVE25519_POINT_SIZE];
+	} __aligned(32) m;
+
+	const int ite[4] = { 64, 64, 64, 63 };
+	const int q = 3;
+	u64 swap = 1;
+
+	int i = 0, j = 0, k = 0;
+	u64 *const key = (u64 *)m.private;
+	u64 *const Ur1 = m.coordinates + 0;
+	u64 *const Zr1 = m.coordinates + 4;
+	u64 *const Ur2 = m.coordinates + 8;
+	u64 *const Zr2 = m.coordinates + 12;
+
+	u64 *const UZr1 = m.coordinates + 0;
+	u64 *const ZUr2 = m.coordinates + 8;
+
+	u64 *const A = m.workspace + 0;
+	u64 *const B = m.workspace + 4;
+	u64 *const C = m.workspace + 8;
+	u64 *const D = m.workspace + 12;
+
+	u64 *const AB = m.workspace + 0;
+	u64 *const CD = m.workspace + 8;
+
+	const u64 *const P = table_ladder_8k;
+
+	memcpy(m.private, private_key, sizeof(m.private));
+
+	clamp_secret(m.private);
+
+	setzero_eltfp25519_1w(Ur1);
+	setzero_eltfp25519_1w(Zr1);
+	setzero_eltfp25519_1w(Zr2);
+	Ur1[0] = 1;
+	Zr1[0] = 1;
+	Zr2[0] = 1;
+
+	/* G-S */
+	Ur2[3] = 0x1eaecdeee27cab34UL;
+	Ur2[2] = 0xadc7a0b9235d48e2UL;
+	Ur2[1] = 0xbbf095ae14b2edf8UL;
+	Ur2[0] = 0x7e94e1fec82faabdUL;
+
+	/* main-loop */
+	j = q;
+	for (i = 0; i < NUM_WORDS_ELTFP25519; ++i) {
+		while (j < ite[i]) {
+			u64 bit = (key[i] >> j) & 0x1;
+			k = (64 * i + j - q);
+			swap = swap ^ bit;
+			cswap(swap, Ur1, Ur2);
+			cswap(swap, Zr1, Zr2);
+			swap = bit;
+			/* Addition */
+			sub_eltfp25519_1w(B, Ur1, Zr1);		/* B = Ur1-Zr1 */
+			add_eltfp25519_1w_adx(A, Ur1, Zr1);	/* A = Ur1+Zr1 */
+			mul_eltfp25519_1w_adx(C, &P[4 * k], B);	/* C = M0-B */
+			sub_eltfp25519_1w(B, A, C);		/* B = (Ur1+Zr1) - M*(Ur1-Zr1) */
+			add_eltfp25519_1w_adx(A, A, C);		/* A = (Ur1+Zr1) + M*(Ur1-Zr1) */
+			sqr_eltfp25519_2w_adx(AB);		/* A = A^2      |  B = B^2 */
+			mul_eltfp25519_2w_adx(UZr1, ZUr2, AB);	/* Ur1 = Zr2*A  |  Zr1 = Ur2*B */
+			++j;
+		}
+		j = 0;
+	}
+
+	/* Doubling */
+	for (i = 0; i < q; ++i) {
+		add_eltfp25519_1w_adx(A, Ur1, Zr1);	/*  A = Ur1+Zr1 */
+		sub_eltfp25519_1w(B, Ur1, Zr1);		/*  B = Ur1-Zr1 */
+		sqr_eltfp25519_2w_adx(AB);		/*  A = A**2     B = B**2 */
+		copy_eltfp25519_1w(C, B);		/*  C = B */
+		sub_eltfp25519_1w(B, A, B);		/*  B = A-B */
+		mul_a24_eltfp25519_1w(D, B);		/*  D = my_a24*B */
+		add_eltfp25519_1w_adx(D, D, C);		/*  D = D+C */
+		mul_eltfp25519_2w_adx(UZr1, AB, CD);	/*  Ur1 = A*B   Zr1 = Zr1*A */
+	}
+
+	/* Convert to affine coordinates */
+	inv_eltfp25519_1w_adx(A, Zr1);
+	mul_eltfp25519_1w_adx((u64 *)session_key, Ur1, A);
+	fred_eltfp25519_1w((u64 *)session_key);
+
+	memzero_explicit(&m, sizeof(m));
+}
+
+static void curve25519_bmi2(u8 shared[CURVE25519_POINT_SIZE], const u8 private_key[CURVE25519_POINT_SIZE], const u8 session_key[CURVE25519_POINT_SIZE])
+{
+	struct {
+		u64 buffer[4 * NUM_WORDS_ELTFP25519];
+		u64 coordinates[4 * NUM_WORDS_ELTFP25519];
+		u64 workspace[6 * NUM_WORDS_ELTFP25519];
+		u8 session[CURVE25519_POINT_SIZE];
+		u8 private[CURVE25519_POINT_SIZE];
+	} __aligned(32) m;
+
+	int i = 0, j = 0;
+	u64 prev = 0;
+	u64 *const X1 = (u64 *)m.session;
+	u64 *const key = (u64 *)m.private;
+	u64 *const Px = m.coordinates + 0;
+	u64 *const Pz = m.coordinates + 4;
+	u64 *const Qx = m.coordinates + 8;
+	u64 *const Qz = m.coordinates + 12;
+	u64 *const X2 = Qx;
+	u64 *const Z2 = Qz;
+	u64 *const X3 = Px;
+	u64 *const Z3 = Pz;
+	u64 *const X2Z2 = Qx;
+	u64 *const X3Z3 = Px;
+
+	u64 *const A = m.workspace + 0;
+	u64 *const B = m.workspace + 4;
+	u64 *const D = m.workspace + 8;
+	u64 *const C = m.workspace + 12;
+	u64 *const DA = m.workspace + 16;
+	u64 *const CB = m.workspace + 20;
+	u64 *const AB = A;
+	u64 *const DC = D;
+	u64 *const DACB = DA;
+
+	memcpy(m.private, private_key, sizeof(m.private));
+	memcpy(m.session, session_key, sizeof(m.session));
+
+	clamp_secret(m.private);
+
+	/* As in the draft:
+	 * When receiving such an array, implementations of curve25519
+	 * MUST mask the most-significant bit in the final byte. This
+	 * is done to preserve compatibility with point formats which
+	 * reserve the sign bit for use in other protocols and to
+	 * increase resistance to implementation fingerprinting
+	 */
+	m.session[CURVE25519_POINT_SIZE - 1] &= (1 << (255 % 8)) - 1;
+
+	copy_eltfp25519_1w(Px, X1);
+	setzero_eltfp25519_1w(Pz);
+	setzero_eltfp25519_1w(Qx);
+	setzero_eltfp25519_1w(Qz);
+
+	Pz[0] = 1;
+	Qx[0] = 1;
+
+	/* main-loop */
+	prev = 0;
+	j = 62;
+	for (i = 3; i >= 0; --i) {
+		while (j >= 0) {
+			u64 bit = (key[i] >> j) & 0x1;
+			u64 swap = bit ^ prev;
+			prev = bit;
+
+			add_eltfp25519_1w_bmi2(A, X2, Z2);	/* A = (X2+Z2) */
+			sub_eltfp25519_1w(B, X2, Z2);		/* B = (X2-Z2) */
+			add_eltfp25519_1w_bmi2(C, X3, Z3);	/* C = (X3+Z3) */
+			sub_eltfp25519_1w(D, X3, Z3);		/* D = (X3-Z3) */
+			mul_eltfp25519_2w_bmi2(DACB, AB, DC);	/* [DA|CB] = [A|B]*[D|C] */
+
+			cselect(swap, A, C);
+			cselect(swap, B, D);
+
+			sqr_eltfp25519_2w_bmi2(AB);		/* [AA|BB] = [A^2|B^2] */
+			add_eltfp25519_1w_bmi2(X3, DA, CB);	/* X3 = (DA+CB) */
+			sub_eltfp25519_1w(Z3, DA, CB);		/* Z3 = (DA-CB) */
+			sqr_eltfp25519_2w_bmi2(X3Z3);		/* [X3|Z3] = [(DA+CB)|(DA+CB)]^2 */
+
+			copy_eltfp25519_1w(X2, B);		/* X2 = B^2 */
+			sub_eltfp25519_1w(Z2, A, B);		/* Z2 = E = AA-BB */
+
+			mul_a24_eltfp25519_1w(B, Z2);		/* B = a24*E */
+			add_eltfp25519_1w_bmi2(B, B, X2);	/* B = a24*E+B */
+			mul_eltfp25519_2w_bmi2(X2Z2, X2Z2, AB);	/* [X2|Z2] = [B|E]*[A|a24*E+B] */
+			mul_eltfp25519_1w_bmi2(Z3, Z3, X1);	/* Z3 = Z3*X1 */
+			--j;
+		}
+		j = 63;
+	}
+
+	inv_eltfp25519_1w_bmi2(A, Qz);
+	mul_eltfp25519_1w_bmi2((u64 *)shared, Qx, A);
+	fred_eltfp25519_1w((u64 *)shared);
+
+	memzero_explicit(&m, sizeof(m));
+}
+
+static void curve25519_bmi2_base(u8 session_key[CURVE25519_POINT_SIZE], const u8 private_key[CURVE25519_POINT_SIZE])
+{
+	struct {
+		u64 buffer[4 * NUM_WORDS_ELTFP25519];
+		u64 coordinates[4 * NUM_WORDS_ELTFP25519];
+		u64 workspace[4 * NUM_WORDS_ELTFP25519];
+		u8 private[CURVE25519_POINT_SIZE];
+	} __aligned(32) m;
+
+	const int ite[4] = { 64, 64, 64, 63 };
+	const int q = 3;
+	u64 swap = 1;
+
+	int i = 0, j = 0, k = 0;
+	u64 *const key = (u64 *)m.private;
+	u64 *const Ur1 = m.coordinates + 0;
+	u64 *const Zr1 = m.coordinates + 4;
+	u64 *const Ur2 = m.coordinates + 8;
+	u64 *const Zr2 = m.coordinates + 12;
+
+	u64 *const UZr1 = m.coordinates + 0;
+	u64 *const ZUr2 = m.coordinates + 8;
+
+	u64 *const A = m.workspace + 0;
+	u64 *const B = m.workspace + 4;
+	u64 *const C = m.workspace + 8;
+	u64 *const D = m.workspace + 12;
+
+	u64 *const AB = m.workspace + 0;
+	u64 *const CD = m.workspace + 8;
+
+	const u64 *const P = table_ladder_8k;
+
+	memcpy(m.private, private_key, sizeof(m.private));
+
+	clamp_secret(m.private);
+
+	setzero_eltfp25519_1w(Ur1);
+	setzero_eltfp25519_1w(Zr1);
+	setzero_eltfp25519_1w(Zr2);
+	Ur1[0] = 1;
+	Zr1[0] = 1;
+	Zr2[0] = 1;
+
+	/* G-S */
+	Ur2[3] = 0x1eaecdeee27cab34UL;
+	Ur2[2] = 0xadc7a0b9235d48e2UL;
+	Ur2[1] = 0xbbf095ae14b2edf8UL;
+	Ur2[0] = 0x7e94e1fec82faabdUL;
+
+	/* main-loop */
+	j = q;
+	for (i = 0; i < NUM_WORDS_ELTFP25519; ++i) {
+		while (j < ite[i]) {
+			u64 bit = (key[i] >> j) & 0x1;
+			k = (64 * i + j - q);
+			swap = swap ^ bit;
+			cswap(swap, Ur1, Ur2);
+			cswap(swap, Zr1, Zr2);
+			swap = bit;
+			/* Addition */
+			sub_eltfp25519_1w(B, Ur1, Zr1);		/* B = Ur1-Zr1 */
+			add_eltfp25519_1w_bmi2(A, Ur1, Zr1);	/* A = Ur1+Zr1 */
+			mul_eltfp25519_1w_bmi2(C, &P[4 * k], B);/* C = M0-B */
+			sub_eltfp25519_1w(B, A, C);		/* B = (Ur1+Zr1) - M*(Ur1-Zr1) */
+			add_eltfp25519_1w_bmi2(A, A, C);	/* A = (Ur1+Zr1) + M*(Ur1-Zr1) */
+			sqr_eltfp25519_2w_bmi2(AB);		/* A = A^2      |  B = B^2 */
+			mul_eltfp25519_2w_bmi2(UZr1, ZUr2, AB);	/* Ur1 = Zr2*A  |  Zr1 = Ur2*B */
+			++j;
+		}
+		j = 0;
+	}
+
+	/* Doubling */
+	for (i = 0; i < q; ++i) {
+		add_eltfp25519_1w_bmi2(A, Ur1, Zr1);	/*  A = Ur1+Zr1 */
+		sub_eltfp25519_1w(B, Ur1, Zr1);		/*  B = Ur1-Zr1 */
+		sqr_eltfp25519_2w_bmi2(AB);		/*  A = A**2     B = B**2 */
+		copy_eltfp25519_1w(C, B);		/*  C = B */
+		sub_eltfp25519_1w(B, A, B);		/*  B = A-B */
+		mul_a24_eltfp25519_1w(D, B);		/*  D = my_a24*B */
+		add_eltfp25519_1w_bmi2(D, D, C);	/*  D = D+C */
+		mul_eltfp25519_2w_bmi2(UZr1, AB, CD);	/*  Ur1 = A*B   Zr1 = Zr1*A */
+	}
+
+	/* Convert to affine coordinates */
+	inv_eltfp25519_1w_bmi2(A, Zr1);
+	mul_eltfp25519_1w_bmi2((u64 *)session_key, Ur1, A);
+	fred_eltfp25519_1w((u64 *)session_key);
+
+	memzero_explicit(&m, sizeof(m));
+}
-- 
2.18.0

^ permalink raw reply related

* [PATCH v2 14/17] zinc: Curve25519 generic C implementations and selftest
From: Jason A. Donenfeld @ 2018-08-24 21:38 UTC (permalink / raw)
  To: linux-kernel, netdev, davem
  Cc: Jason A. Donenfeld, Andy Lutomirski, Greg KH, Samuel Neves,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan, linux-crypto
In-Reply-To: <20180824213849.23647-1-Jason@zx2c4.com>

This contains two formally verified C implementations of the Curve25519
scalar multiplication function, one for 32-bit systems, and one for
64-bit systems whose compiler supports efficient 128-bit integer types.
Not only are these implementations formally verified, but they are also
the fastest available C implementations. They have been modified to be
friendly to kernel space and to be generally less horrendous looking,
but still an effort has been made to retain their formally verified
characteristic, and so the C might look slightly unidiomatic.

The 64-bit version comes from HACL*: https://github.com/project-everest/hacl-star
The 32-bit version comes from Fiat: https://github.com/mit-plv/fiat-crypto

Information: https://cr.yp.to/ecdh.html

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Karthikeyan Bhargavan <karthik.bhargavan@gmail.com>
Cc: linux-crypto@vger.kernel.org
---
 include/zinc/curve25519.h               |  25 +
 lib/zinc/Kconfig                        |   4 +
 lib/zinc/Makefile                       |   4 +
 lib/zinc/curve25519/curve25519-fiat32.h | 838 ++++++++++++++++++++++++
 lib/zinc/curve25519/curve25519-hacl64.h | 744 +++++++++++++++++++++
 lib/zinc/curve25519/curve25519.c        |  62 ++
 lib/zinc/main.c                         |   5 +
 lib/zinc/selftest/curve25519.h          | 607 +++++++++++++++++
 8 files changed, 2289 insertions(+)
 create mode 100644 include/zinc/curve25519.h
 create mode 100644 lib/zinc/curve25519/curve25519-fiat32.h
 create mode 100644 lib/zinc/curve25519/curve25519-hacl64.h
 create mode 100644 lib/zinc/curve25519/curve25519.c
 create mode 100644 lib/zinc/selftest/curve25519.h

diff --git a/include/zinc/curve25519.h b/include/zinc/curve25519.h
new file mode 100644
index 000000000000..b6bd387e3b8a
--- /dev/null
+++ b/include/zinc/curve25519.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifndef _ZINC_CURVE25519_H
+#define _ZINC_CURVE25519_H
+
+#include <linux/types.h>
+
+enum curve25519_lengths {
+	CURVE25519_POINT_SIZE = 32
+};
+
+bool __must_check curve25519(u8 mypublic[CURVE25519_POINT_SIZE], const u8 secret[CURVE25519_POINT_SIZE], const u8 basepoint[CURVE25519_POINT_SIZE]);
+void curve25519_generate_secret(u8 secret[CURVE25519_POINT_SIZE]);
+bool __must_check curve25519_generate_public(u8 pub[CURVE25519_POINT_SIZE], const u8 secret[CURVE25519_POINT_SIZE]);
+
+void curve25519_fpu_init(void);
+
+#ifdef CONFIG_ZINC_DEBUG
+bool curve25519_selftest(void);
+#endif
+
+#endif /* _ZINC_CURVE25519_H */
diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig
index 25b46a87509f..285ba527efed 100644
--- a/lib/zinc/Kconfig
+++ b/lib/zinc/Kconfig
@@ -24,6 +24,10 @@ config ZINC_BLAKE2S
 	bool
 	select ZINC
 
+config ZINC_CURVE25519
+	bool
+	select ZINC
+
 config ZINC_DEBUG
 	bool "Zinc cryptography library debugging and self-tests"
 	depends on ZINC
diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index e252eea24844..b6b0614f8743 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -50,6 +50,10 @@ ifeq ($(CONFIG_ZINC_CHACHA20POLY1305),y)
 zinc-y += chacha20poly1305.o
 endif
 
+ifeq ($(CONFIG_ZINC_CURVE25519),y)
+zinc-y += curve25519/curve25519.o
+endif
+
 ifeq ($(CONFIG_ZINC_BLAKE2S),y)
 zinc-y += blake2s/blake2s.o
 ifeq ($(CONFIG_X86_64),y)
diff --git a/lib/zinc/curve25519/curve25519-fiat32.h b/lib/zinc/curve25519/curve25519-fiat32.h
new file mode 100644
index 000000000000..f1e21a416a31
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-fiat32.h
@@ -0,0 +1,838 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2016 The fiat-crypto Authors.
+ * Copyright (C) 2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This is a machine-generated formally verified implementation of curve25519 DH from:
+ * https://github.com/mit-plv/fiat-crypto
+ */
+
+/* fe means field element. Here the field is \Z/(2^255-19). An element t,
+ * entries t[0]...t[9], represents the integer t[0]+2^26 t[1]+2^51 t[2]+2^77
+ * t[3]+2^102 t[4]+...+2^230 t[9].
+ * fe limbs are bounded by 1.125*2^26,1.125*2^25,1.125*2^26,1.125*2^25,etc.
+ * Multiplication and carrying produce fe from fe_loose.
+ */
+typedef struct fe { u32 v[10]; } fe;
+
+/* fe_loose limbs are bounded by 3.375*2^26,3.375*2^25,3.375*2^26,3.375*2^25,etc.
+ * Addition and subtraction produce fe_loose from (fe, fe).
+ */
+typedef struct fe_loose { u32 v[10]; } fe_loose;
+
+static __always_inline void fe_frombytes_impl(u32 h[10], const u8 *s)
+{
+	/* Ignores top bit of s. */
+	u32 a0 = le32_to_cpup((__force __le32 *)(s));
+	u32 a1 = le32_to_cpup((__force __le32 *)(s+4));
+	u32 a2 = le32_to_cpup((__force __le32 *)(s+8));
+	u32 a3 = le32_to_cpup((__force __le32 *)(s+12));
+	u32 a4 = le32_to_cpup((__force __le32 *)(s+16));
+	u32 a5 = le32_to_cpup((__force __le32 *)(s+20));
+	u32 a6 = le32_to_cpup((__force __le32 *)(s+24));
+	u32 a7 = le32_to_cpup((__force __le32 *)(s+28));
+	h[0] = a0&((1<<26)-1);                    /* 26 used, 32-26 left.   26 */
+	h[1] = (a0>>26) | ((a1&((1<<19)-1))<< 6); /* (32-26) + 19 =  6+19 = 25 */
+	h[2] = (a1>>19) | ((a2&((1<<13)-1))<<13); /* (32-19) + 13 = 13+13 = 26 */
+	h[3] = (a2>>13) | ((a3&((1<< 6)-1))<<19); /* (32-13) +  6 = 19+ 6 = 25 */
+	h[4] = (a3>> 6);                          /* (32- 6)              = 26 */
+	h[5] = a4&((1<<25)-1);                    /*                        25 */
+	h[6] = (a4>>25) | ((a5&((1<<19)-1))<< 7); /* (32-25) + 19 =  7+19 = 26 */
+	h[7] = (a5>>19) | ((a6&((1<<12)-1))<<13); /* (32-19) + 12 = 13+12 = 25 */
+	h[8] = (a6>>12) | ((a7&((1<< 6)-1))<<20); /* (32-12) +  6 = 20+ 6 = 26 */
+	h[9] = (a7>> 6)&((1<<25)-1); /*                                     25 */
+}
+
+static __always_inline void fe_frombytes(fe *h, const u8 *s)
+{
+	fe_frombytes_impl(h->v, s);
+}
+
+static __always_inline u8 /*bool*/ addcarryx_u25(u8 /*bool*/ c, u32 a, u32 b, u32 *low)
+{
+	/* This function extracts 25 bits of result and 1 bit of carry (26 total), so
+	 * a 32-bit intermediate is sufficient.
+	 */
+	u32 x = a + b + c;
+	*low = x & ((1 << 25) - 1);
+	return (x >> 25) & 1;
+}
+
+static __always_inline u8 /*bool*/ addcarryx_u26(u8 /*bool*/ c, u32 a, u32 b, u32 *low)
+{
+	/* This function extracts 26 bits of result and 1 bit of carry (27 total), so
+	 * a 32-bit intermediate is sufficient.
+	 */
+	u32 x = a + b + c;
+	*low = x & ((1 << 26) - 1);
+	return (x >> 26) & 1;
+}
+
+static __always_inline u8 /*bool*/ subborrow_u25(u8 /*bool*/ c, u32 a, u32 b, u32 *low)
+{
+	/* This function extracts 25 bits of result and 1 bit of borrow (26 total), so
+	 * a 32-bit intermediate is sufficient.
+	 */
+	u32 x = a - b - c;
+	*low = x & ((1 << 25) - 1);
+	return x >> 31;
+}
+
+static __always_inline u8 /*bool*/ subborrow_u26(u8 /*bool*/ c, u32 a, u32 b, u32 *low)
+{
+	/* This function extracts 26 bits of result and 1 bit of borrow (27 total), so
+	 * a 32-bit intermediate is sufficient.
+	 */
+	u32 x = a - b - c;
+	*low = x & ((1 << 26) - 1);
+	return x >> 31;
+}
+
+static __always_inline u32 cmovznz32(u32 t, u32 z, u32 nz)
+{
+	t = -!!t; /* all set if nonzero, 0 if 0 */
+	return (t&nz) | ((~t)&z);
+}
+
+static __always_inline void fe_freeze(u32 out[10], const u32 in1[10])
+{
+	{ const u32 x17 = in1[9];
+	{ const u32 x18 = in1[8];
+	{ const u32 x16 = in1[7];
+	{ const u32 x14 = in1[6];
+	{ const u32 x12 = in1[5];
+	{ const u32 x10 = in1[4];
+	{ const u32 x8 = in1[3];
+	{ const u32 x6 = in1[2];
+	{ const u32 x4 = in1[1];
+	{ const u32 x2 = in1[0];
+	{ u32 x20; u8/*bool*/ x21 = subborrow_u26(0x0, x2, 0x3ffffed, &x20);
+	{ u32 x23; u8/*bool*/ x24 = subborrow_u25(x21, x4, 0x1ffffff, &x23);
+	{ u32 x26; u8/*bool*/ x27 = subborrow_u26(x24, x6, 0x3ffffff, &x26);
+	{ u32 x29; u8/*bool*/ x30 = subborrow_u25(x27, x8, 0x1ffffff, &x29);
+	{ u32 x32; u8/*bool*/ x33 = subborrow_u26(x30, x10, 0x3ffffff, &x32);
+	{ u32 x35; u8/*bool*/ x36 = subborrow_u25(x33, x12, 0x1ffffff, &x35);
+	{ u32 x38; u8/*bool*/ x39 = subborrow_u26(x36, x14, 0x3ffffff, &x38);
+	{ u32 x41; u8/*bool*/ x42 = subborrow_u25(x39, x16, 0x1ffffff, &x41);
+	{ u32 x44; u8/*bool*/ x45 = subborrow_u26(x42, x18, 0x3ffffff, &x44);
+	{ u32 x47; u8/*bool*/ x48 = subborrow_u25(x45, x17, 0x1ffffff, &x47);
+	{ u32 x49 = cmovznz32(x48, 0x0, 0xffffffff);
+	{ u32 x50 = (x49 & 0x3ffffed);
+	{ u32 x52; u8/*bool*/ x53 = addcarryx_u26(0x0, x20, x50, &x52);
+	{ u32 x54 = (x49 & 0x1ffffff);
+	{ u32 x56; u8/*bool*/ x57 = addcarryx_u25(x53, x23, x54, &x56);
+	{ u32 x58 = (x49 & 0x3ffffff);
+	{ u32 x60; u8/*bool*/ x61 = addcarryx_u26(x57, x26, x58, &x60);
+	{ u32 x62 = (x49 & 0x1ffffff);
+	{ u32 x64; u8/*bool*/ x65 = addcarryx_u25(x61, x29, x62, &x64);
+	{ u32 x66 = (x49 & 0x3ffffff);
+	{ u32 x68; u8/*bool*/ x69 = addcarryx_u26(x65, x32, x66, &x68);
+	{ u32 x70 = (x49 & 0x1ffffff);
+	{ u32 x72; u8/*bool*/ x73 = addcarryx_u25(x69, x35, x70, &x72);
+	{ u32 x74 = (x49 & 0x3ffffff);
+	{ u32 x76; u8/*bool*/ x77 = addcarryx_u26(x73, x38, x74, &x76);
+	{ u32 x78 = (x49 & 0x1ffffff);
+	{ u32 x80; u8/*bool*/ x81 = addcarryx_u25(x77, x41, x78, &x80);
+	{ u32 x82 = (x49 & 0x3ffffff);
+	{ u32 x84; u8/*bool*/ x85 = addcarryx_u26(x81, x44, x82, &x84);
+	{ u32 x86 = (x49 & 0x1ffffff);
+	{ u32 x88; addcarryx_u25(x85, x47, x86, &x88);
+	out[0] = x52;
+	out[1] = x56;
+	out[2] = x60;
+	out[3] = x64;
+	out[4] = x68;
+	out[5] = x72;
+	out[6] = x76;
+	out[7] = x80;
+	out[8] = x84;
+	out[9] = x88;
+	}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
+}
+
+static __always_inline void fe_tobytes(u8 s[32], const fe *f)
+{
+	u32 h[10];
+	fe_freeze(h, f->v);
+	s[0] = h[0] >> 0;
+	s[1] = h[0] >> 8;
+	s[2] = h[0] >> 16;
+	s[3] = (h[0] >> 24) | (h[1] << 2);
+	s[4] = h[1] >> 6;
+	s[5] = h[1] >> 14;
+	s[6] = (h[1] >> 22) | (h[2] << 3);
+	s[7] = h[2] >> 5;
+	s[8] = h[2] >> 13;
+	s[9] = (h[2] >> 21) | (h[3] << 5);
+	s[10] = h[3] >> 3;
+	s[11] = h[3] >> 11;
+	s[12] = (h[3] >> 19) | (h[4] << 6);
+	s[13] = h[4] >> 2;
+	s[14] = h[4] >> 10;
+	s[15] = h[4] >> 18;
+	s[16] = h[5] >> 0;
+	s[17] = h[5] >> 8;
+	s[18] = h[5] >> 16;
+	s[19] = (h[5] >> 24) | (h[6] << 1);
+	s[20] = h[6] >> 7;
+	s[21] = h[6] >> 15;
+	s[22] = (h[6] >> 23) | (h[7] << 3);
+	s[23] = h[7] >> 5;
+	s[24] = h[7] >> 13;
+	s[25] = (h[7] >> 21) | (h[8] << 4);
+	s[26] = h[8] >> 4;
+	s[27] = h[8] >> 12;
+	s[28] = (h[8] >> 20) | (h[9] << 6);
+	s[29] = h[9] >> 2;
+	s[30] = h[9] >> 10;
+	s[31] = h[9] >> 18;
+}
+
+/* h = f */
+static __always_inline void fe_copy(fe *h, const fe *f)
+{
+	memmove(h, f, sizeof(u32) * 10);
+}
+
+static __always_inline void fe_copy_lt(fe_loose *h, const fe *f)
+{
+	memmove(h, f, sizeof(u32) * 10);
+}
+
+/* h = 0 */
+static __always_inline void fe_0(fe *h)
+{
+	memset(h, 0, sizeof(u32) * 10);
+}
+
+/* h = 1 */
+static __always_inline void fe_1(fe *h)
+{
+	memset(h, 0, sizeof(u32) * 10);
+	h->v[0] = 1;
+}
+
+static void fe_add_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
+{
+	{ const u32 x20 = in1[9];
+	{ const u32 x21 = in1[8];
+	{ const u32 x19 = in1[7];
+	{ const u32 x17 = in1[6];
+	{ const u32 x15 = in1[5];
+	{ const u32 x13 = in1[4];
+	{ const u32 x11 = in1[3];
+	{ const u32 x9 = in1[2];
+	{ const u32 x7 = in1[1];
+	{ const u32 x5 = in1[0];
+	{ const u32 x38 = in2[9];
+	{ const u32 x39 = in2[8];
+	{ const u32 x37 = in2[7];
+	{ const u32 x35 = in2[6];
+	{ const u32 x33 = in2[5];
+	{ const u32 x31 = in2[4];
+	{ const u32 x29 = in2[3];
+	{ const u32 x27 = in2[2];
+	{ const u32 x25 = in2[1];
+	{ const u32 x23 = in2[0];
+	out[0] = (x5 + x23);
+	out[1] = (x7 + x25);
+	out[2] = (x9 + x27);
+	out[3] = (x11 + x29);
+	out[4] = (x13 + x31);
+	out[5] = (x15 + x33);
+	out[6] = (x17 + x35);
+	out[7] = (x19 + x37);
+	out[8] = (x21 + x39);
+	out[9] = (x20 + x38);
+	}}}}}}}}}}}}}}}}}}}}
+}
+
+/* h = f + g
+ * Can overlap h with f or g.
+ */
+static __always_inline void fe_add(fe_loose *h, const fe *f, const fe *g)
+{
+	fe_add_impl(h->v, f->v, g->v);
+}
+
+static void fe_sub_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
+{
+	{ const u32 x20 = in1[9];
+	{ const u32 x21 = in1[8];
+	{ const u32 x19 = in1[7];
+	{ const u32 x17 = in1[6];
+	{ const u32 x15 = in1[5];
+	{ const u32 x13 = in1[4];
+	{ const u32 x11 = in1[3];
+	{ const u32 x9 = in1[2];
+	{ const u32 x7 = in1[1];
+	{ const u32 x5 = in1[0];
+	{ const u32 x38 = in2[9];
+	{ const u32 x39 = in2[8];
+	{ const u32 x37 = in2[7];
+	{ const u32 x35 = in2[6];
+	{ const u32 x33 = in2[5];
+	{ const u32 x31 = in2[4];
+	{ const u32 x29 = in2[3];
+	{ const u32 x27 = in2[2];
+	{ const u32 x25 = in2[1];
+	{ const u32 x23 = in2[0];
+	out[0] = ((0x7ffffda + x5) - x23);
+	out[1] = ((0x3fffffe + x7) - x25);
+	out[2] = ((0x7fffffe + x9) - x27);
+	out[3] = ((0x3fffffe + x11) - x29);
+	out[4] = ((0x7fffffe + x13) - x31);
+	out[5] = ((0x3fffffe + x15) - x33);
+	out[6] = ((0x7fffffe + x17) - x35);
+	out[7] = ((0x3fffffe + x19) - x37);
+	out[8] = ((0x7fffffe + x21) - x39);
+	out[9] = ((0x3fffffe + x20) - x38);
+	}}}}}}}}}}}}}}}}}}}}
+}
+
+/* h = f - g
+ * Can overlap h with f or g.
+ */
+static __always_inline void fe_sub(fe_loose *h, const fe *f, const fe *g)
+{
+	fe_sub_impl(h->v, f->v, g->v);
+}
+
+static void fe_mul_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
+{
+	{ const u32 x20 = in1[9];
+	{ const u32 x21 = in1[8];
+	{ const u32 x19 = in1[7];
+	{ const u32 x17 = in1[6];
+	{ const u32 x15 = in1[5];
+	{ const u32 x13 = in1[4];
+	{ const u32 x11 = in1[3];
+	{ const u32 x9 = in1[2];
+	{ const u32 x7 = in1[1];
+	{ const u32 x5 = in1[0];
+	{ const u32 x38 = in2[9];
+	{ const u32 x39 = in2[8];
+	{ const u32 x37 = in2[7];
+	{ const u32 x35 = in2[6];
+	{ const u32 x33 = in2[5];
+	{ const u32 x31 = in2[4];
+	{ const u32 x29 = in2[3];
+	{ const u32 x27 = in2[2];
+	{ const u32 x25 = in2[1];
+	{ const u32 x23 = in2[0];
+	{ u64 x40 = ((u64)x23 * x5);
+	{ u64 x41 = (((u64)x23 * x7) + ((u64)x25 * x5));
+	{ u64 x42 = ((((u64)(0x2 * x25) * x7) + ((u64)x23 * x9)) + ((u64)x27 * x5));
+	{ u64 x43 = (((((u64)x25 * x9) + ((u64)x27 * x7)) + ((u64)x23 * x11)) + ((u64)x29 * x5));
+	{ u64 x44 = (((((u64)x27 * x9) + (0x2 * (((u64)x25 * x11) + ((u64)x29 * x7)))) + ((u64)x23 * x13)) + ((u64)x31 * x5));
+	{ u64 x45 = (((((((u64)x27 * x11) + ((u64)x29 * x9)) + ((u64)x25 * x13)) + ((u64)x31 * x7)) + ((u64)x23 * x15)) + ((u64)x33 * x5));
+	{ u64 x46 = (((((0x2 * ((((u64)x29 * x11) + ((u64)x25 * x15)) + ((u64)x33 * x7))) + ((u64)x27 * x13)) + ((u64)x31 * x9)) + ((u64)x23 * x17)) + ((u64)x35 * x5));
+	{ u64 x47 = (((((((((u64)x29 * x13) + ((u64)x31 * x11)) + ((u64)x27 * x15)) + ((u64)x33 * x9)) + ((u64)x25 * x17)) + ((u64)x35 * x7)) + ((u64)x23 * x19)) + ((u64)x37 * x5));
+	{ u64 x48 = (((((((u64)x31 * x13) + (0x2 * (((((u64)x29 * x15) + ((u64)x33 * x11)) + ((u64)x25 * x19)) + ((u64)x37 * x7)))) + ((u64)x27 * x17)) + ((u64)x35 * x9)) + ((u64)x23 * x21)) + ((u64)x39 * x5));
+	{ u64 x49 = (((((((((((u64)x31 * x15) + ((u64)x33 * x13)) + ((u64)x29 * x17)) + ((u64)x35 * x11)) + ((u64)x27 * x19)) + ((u64)x37 * x9)) + ((u64)x25 * x21)) + ((u64)x39 * x7)) + ((u64)x23 * x20)) + ((u64)x38 * x5));
+	{ u64 x50 = (((((0x2 * ((((((u64)x33 * x15) + ((u64)x29 * x19)) + ((u64)x37 * x11)) + ((u64)x25 * x20)) + ((u64)x38 * x7))) + ((u64)x31 * x17)) + ((u64)x35 * x13)) + ((u64)x27 * x21)) + ((u64)x39 * x9));
+	{ u64 x51 = (((((((((u64)x33 * x17) + ((u64)x35 * x15)) + ((u64)x31 * x19)) + ((u64)x37 * x13)) + ((u64)x29 * x21)) + ((u64)x39 * x11)) + ((u64)x27 * x20)) + ((u64)x38 * x9));
+	{ u64 x52 = (((((u64)x35 * x17) + (0x2 * (((((u64)x33 * x19) + ((u64)x37 * x15)) + ((u64)x29 * x20)) + ((u64)x38 * x11)))) + ((u64)x31 * x21)) + ((u64)x39 * x13));
+	{ u64 x53 = (((((((u64)x35 * x19) + ((u64)x37 * x17)) + ((u64)x33 * x21)) + ((u64)x39 * x15)) + ((u64)x31 * x20)) + ((u64)x38 * x13));
+	{ u64 x54 = (((0x2 * ((((u64)x37 * x19) + ((u64)x33 * x20)) + ((u64)x38 * x15))) + ((u64)x35 * x21)) + ((u64)x39 * x17));
+	{ u64 x55 = (((((u64)x37 * x21) + ((u64)x39 * x19)) + ((u64)x35 * x20)) + ((u64)x38 * x17));
+	{ u64 x56 = (((u64)x39 * x21) + (0x2 * (((u64)x37 * x20) + ((u64)x38 * x19))));
+	{ u64 x57 = (((u64)x39 * x20) + ((u64)x38 * x21));
+	{ u64 x58 = ((u64)(0x2 * x38) * x20);
+	{ u64 x59 = (x48 + (x58 << 0x4));
+	{ u64 x60 = (x59 + (x58 << 0x1));
+	{ u64 x61 = (x60 + x58);
+	{ u64 x62 = (x47 + (x57 << 0x4));
+	{ u64 x63 = (x62 + (x57 << 0x1));
+	{ u64 x64 = (x63 + x57);
+	{ u64 x65 = (x46 + (x56 << 0x4));
+	{ u64 x66 = (x65 + (x56 << 0x1));
+	{ u64 x67 = (x66 + x56);
+	{ u64 x68 = (x45 + (x55 << 0x4));
+	{ u64 x69 = (x68 + (x55 << 0x1));
+	{ u64 x70 = (x69 + x55);
+	{ u64 x71 = (x44 + (x54 << 0x4));
+	{ u64 x72 = (x71 + (x54 << 0x1));
+	{ u64 x73 = (x72 + x54);
+	{ u64 x74 = (x43 + (x53 << 0x4));
+	{ u64 x75 = (x74 + (x53 << 0x1));
+	{ u64 x76 = (x75 + x53);
+	{ u64 x77 = (x42 + (x52 << 0x4));
+	{ u64 x78 = (x77 + (x52 << 0x1));
+	{ u64 x79 = (x78 + x52);
+	{ u64 x80 = (x41 + (x51 << 0x4));
+	{ u64 x81 = (x80 + (x51 << 0x1));
+	{ u64 x82 = (x81 + x51);
+	{ u64 x83 = (x40 + (x50 << 0x4));
+	{ u64 x84 = (x83 + (x50 << 0x1));
+	{ u64 x85 = (x84 + x50);
+	{ u64 x86 = (x85 >> 0x1a);
+	{ u32 x87 = ((u32)x85 & 0x3ffffff);
+	{ u64 x88 = (x86 + x82);
+	{ u64 x89 = (x88 >> 0x19);
+	{ u32 x90 = ((u32)x88 & 0x1ffffff);
+	{ u64 x91 = (x89 + x79);
+	{ u64 x92 = (x91 >> 0x1a);
+	{ u32 x93 = ((u32)x91 & 0x3ffffff);
+	{ u64 x94 = (x92 + x76);
+	{ u64 x95 = (x94 >> 0x19);
+	{ u32 x96 = ((u32)x94 & 0x1ffffff);
+	{ u64 x97 = (x95 + x73);
+	{ u64 x98 = (x97 >> 0x1a);
+	{ u32 x99 = ((u32)x97 & 0x3ffffff);
+	{ u64 x100 = (x98 + x70);
+	{ u64 x101 = (x100 >> 0x19);
+	{ u32 x102 = ((u32)x100 & 0x1ffffff);
+	{ u64 x103 = (x101 + x67);
+	{ u64 x104 = (x103 >> 0x1a);
+	{ u32 x105 = ((u32)x103 & 0x3ffffff);
+	{ u64 x106 = (x104 + x64);
+	{ u64 x107 = (x106 >> 0x19);
+	{ u32 x108 = ((u32)x106 & 0x1ffffff);
+	{ u64 x109 = (x107 + x61);
+	{ u64 x110 = (x109 >> 0x1a);
+	{ u32 x111 = ((u32)x109 & 0x3ffffff);
+	{ u64 x112 = (x110 + x49);
+	{ u64 x113 = (x112 >> 0x19);
+	{ u32 x114 = ((u32)x112 & 0x1ffffff);
+	{ u64 x115 = (x87 + (0x13 * x113));
+	{ u32 x116 = (u32) (x115 >> 0x1a);
+	{ u32 x117 = ((u32)x115 & 0x3ffffff);
+	{ u32 x118 = (x116 + x90);
+	{ u32 x119 = (x118 >> 0x19);
+	{ u32 x120 = (x118 & 0x1ffffff);
+	out[0] = x117;
+	out[1] = x120;
+	out[2] = (x119 + x93);
+	out[3] = x96;
+	out[4] = x99;
+	out[5] = x102;
+	out[6] = x105;
+	out[7] = x108;
+	out[8] = x111;
+	out[9] = x114;
+	}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
+}
+
+static __always_inline void fe_mul_ttt(fe *h, const fe *f, const fe *g)
+{
+	fe_mul_impl(h->v, f->v, g->v);
+}
+
+static __always_inline void fe_mul_tlt(fe *h, const fe_loose *f, const fe *g)
+{
+	fe_mul_impl(h->v, f->v, g->v);
+}
+
+static __always_inline void fe_mul_tll(fe *h, const fe_loose *f, const fe_loose *g)
+{
+	fe_mul_impl(h->v, f->v, g->v);
+}
+
+static void fe_sqr_impl(u32 out[10], const u32 in1[10])
+{
+	{ const u32 x17 = in1[9];
+	{ const u32 x18 = in1[8];
+	{ const u32 x16 = in1[7];
+	{ const u32 x14 = in1[6];
+	{ const u32 x12 = in1[5];
+	{ const u32 x10 = in1[4];
+	{ const u32 x8 = in1[3];
+	{ const u32 x6 = in1[2];
+	{ const u32 x4 = in1[1];
+	{ const u32 x2 = in1[0];
+	{ u64 x19 = ((u64)x2 * x2);
+	{ u64 x20 = ((u64)(0x2 * x2) * x4);
+	{ u64 x21 = (0x2 * (((u64)x4 * x4) + ((u64)x2 * x6)));
+	{ u64 x22 = (0x2 * (((u64)x4 * x6) + ((u64)x2 * x8)));
+	{ u64 x23 = ((((u64)x6 * x6) + ((u64)(0x4 * x4) * x8)) + ((u64)(0x2 * x2) * x10));
+	{ u64 x24 = (0x2 * ((((u64)x6 * x8) + ((u64)x4 * x10)) + ((u64)x2 * x12)));
+	{ u64 x25 = (0x2 * (((((u64)x8 * x8) + ((u64)x6 * x10)) + ((u64)x2 * x14)) + ((u64)(0x2 * x4) * x12)));
+	{ u64 x26 = (0x2 * (((((u64)x8 * x10) + ((u64)x6 * x12)) + ((u64)x4 * x14)) + ((u64)x2 * x16)));
+	{ u64 x27 = (((u64)x10 * x10) + (0x2 * ((((u64)x6 * x14) + ((u64)x2 * x18)) + (0x2 * (((u64)x4 * x16) + ((u64)x8 * x12))))));
+	{ u64 x28 = (0x2 * ((((((u64)x10 * x12) + ((u64)x8 * x14)) + ((u64)x6 * x16)) + ((u64)x4 * x18)) + ((u64)x2 * x17)));
+	{ u64 x29 = (0x2 * (((((u64)x12 * x12) + ((u64)x10 * x14)) + ((u64)x6 * x18)) + (0x2 * (((u64)x8 * x16) + ((u64)x4 * x17)))));
+	{ u64 x30 = (0x2 * (((((u64)x12 * x14) + ((u64)x10 * x16)) + ((u64)x8 * x18)) + ((u64)x6 * x17)));
+	{ u64 x31 = (((u64)x14 * x14) + (0x2 * (((u64)x10 * x18) + (0x2 * (((u64)x12 * x16) + ((u64)x8 * x17))))));
+	{ u64 x32 = (0x2 * ((((u64)x14 * x16) + ((u64)x12 * x18)) + ((u64)x10 * x17)));
+	{ u64 x33 = (0x2 * ((((u64)x16 * x16) + ((u64)x14 * x18)) + ((u64)(0x2 * x12) * x17)));
+	{ u64 x34 = (0x2 * (((u64)x16 * x18) + ((u64)x14 * x17)));
+	{ u64 x35 = (((u64)x18 * x18) + ((u64)(0x4 * x16) * x17));
+	{ u64 x36 = ((u64)(0x2 * x18) * x17);
+	{ u64 x37 = ((u64)(0x2 * x17) * x17);
+	{ u64 x38 = (x27 + (x37 << 0x4));
+	{ u64 x39 = (x38 + (x37 << 0x1));
+	{ u64 x40 = (x39 + x37);
+	{ u64 x41 = (x26 + (x36 << 0x4));
+	{ u64 x42 = (x41 + (x36 << 0x1));
+	{ u64 x43 = (x42 + x36);
+	{ u64 x44 = (x25 + (x35 << 0x4));
+	{ u64 x45 = (x44 + (x35 << 0x1));
+	{ u64 x46 = (x45 + x35);
+	{ u64 x47 = (x24 + (x34 << 0x4));
+	{ u64 x48 = (x47 + (x34 << 0x1));
+	{ u64 x49 = (x48 + x34);
+	{ u64 x50 = (x23 + (x33 << 0x4));
+	{ u64 x51 = (x50 + (x33 << 0x1));
+	{ u64 x52 = (x51 + x33);
+	{ u64 x53 = (x22 + (x32 << 0x4));
+	{ u64 x54 = (x53 + (x32 << 0x1));
+	{ u64 x55 = (x54 + x32);
+	{ u64 x56 = (x21 + (x31 << 0x4));
+	{ u64 x57 = (x56 + (x31 << 0x1));
+	{ u64 x58 = (x57 + x31);
+	{ u64 x59 = (x20 + (x30 << 0x4));
+	{ u64 x60 = (x59 + (x30 << 0x1));
+	{ u64 x61 = (x60 + x30);
+	{ u64 x62 = (x19 + (x29 << 0x4));
+	{ u64 x63 = (x62 + (x29 << 0x1));
+	{ u64 x64 = (x63 + x29);
+	{ u64 x65 = (x64 >> 0x1a);
+	{ u32 x66 = ((u32)x64 & 0x3ffffff);
+	{ u64 x67 = (x65 + x61);
+	{ u64 x68 = (x67 >> 0x19);
+	{ u32 x69 = ((u32)x67 & 0x1ffffff);
+	{ u64 x70 = (x68 + x58);
+	{ u64 x71 = (x70 >> 0x1a);
+	{ u32 x72 = ((u32)x70 & 0x3ffffff);
+	{ u64 x73 = (x71 + x55);
+	{ u64 x74 = (x73 >> 0x19);
+	{ u32 x75 = ((u32)x73 & 0x1ffffff);
+	{ u64 x76 = (x74 + x52);
+	{ u64 x77 = (x76 >> 0x1a);
+	{ u32 x78 = ((u32)x76 & 0x3ffffff);
+	{ u64 x79 = (x77 + x49);
+	{ u64 x80 = (x79 >> 0x19);
+	{ u32 x81 = ((u32)x79 & 0x1ffffff);
+	{ u64 x82 = (x80 + x46);
+	{ u64 x83 = (x82 >> 0x1a);
+	{ u32 x84 = ((u32)x82 & 0x3ffffff);
+	{ u64 x85 = (x83 + x43);
+	{ u64 x86 = (x85 >> 0x19);
+	{ u32 x87 = ((u32)x85 & 0x1ffffff);
+	{ u64 x88 = (x86 + x40);
+	{ u64 x89 = (x88 >> 0x1a);
+	{ u32 x90 = ((u32)x88 & 0x3ffffff);
+	{ u64 x91 = (x89 + x28);
+	{ u64 x92 = (x91 >> 0x19);
+	{ u32 x93 = ((u32)x91 & 0x1ffffff);
+	{ u64 x94 = (x66 + (0x13 * x92));
+	{ u32 x95 = (u32) (x94 >> 0x1a);
+	{ u32 x96 = ((u32)x94 & 0x3ffffff);
+	{ u32 x97 = (x95 + x69);
+	{ u32 x98 = (x97 >> 0x19);
+	{ u32 x99 = (x97 & 0x1ffffff);
+	out[0] = x96;
+	out[1] = x99;
+	out[2] = (x98 + x72);
+	out[3] = x75;
+	out[4] = x78;
+	out[5] = x81;
+	out[6] = x84;
+	out[7] = x87;
+	out[8] = x90;
+	out[9] = x93;
+	}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
+}
+
+static __always_inline void fe_sq_tl(fe *h, const fe_loose *f)
+{
+	fe_sqr_impl(h->v, f->v);
+}
+
+static __always_inline void fe_sq_tt(fe *h, const fe *f)
+{
+	fe_sqr_impl(h->v, f->v);
+}
+
+static __always_inline void fe_loose_invert(fe *out, const fe_loose *z)
+{
+	fe t0;
+	fe t1;
+	fe t2;
+	fe t3;
+	int i;
+
+	fe_sq_tl(&t0, z);
+	fe_sq_tt(&t1, &t0);
+	for (i = 1; i < 2; ++i)
+		fe_sq_tt(&t1, &t1);
+	fe_mul_tlt(&t1, z, &t1);
+	fe_mul_ttt(&t0, &t0, &t1);
+	fe_sq_tt(&t2, &t0);
+	fe_mul_ttt(&t1, &t1, &t2);
+	fe_sq_tt(&t2, &t1);
+	for (i = 1; i < 5; ++i)
+		fe_sq_tt(&t2, &t2);
+	fe_mul_ttt(&t1, &t2, &t1);
+	fe_sq_tt(&t2, &t1);
+	for (i = 1; i < 10; ++i)
+		fe_sq_tt(&t2, &t2);
+	fe_mul_ttt(&t2, &t2, &t1);
+	fe_sq_tt(&t3, &t2);
+	for (i = 1; i < 20; ++i)
+		fe_sq_tt(&t3, &t3);
+	fe_mul_ttt(&t2, &t3, &t2);
+	fe_sq_tt(&t2, &t2);
+	for (i = 1; i < 10; ++i)
+		fe_sq_tt(&t2, &t2);
+	fe_mul_ttt(&t1, &t2, &t1);
+	fe_sq_tt(&t2, &t1);
+	for (i = 1; i < 50; ++i)
+		fe_sq_tt(&t2, &t2);
+	fe_mul_ttt(&t2, &t2, &t1);
+	fe_sq_tt(&t3, &t2);
+	for (i = 1; i < 100; ++i)
+		fe_sq_tt(&t3, &t3);
+	fe_mul_ttt(&t2, &t3, &t2);
+	fe_sq_tt(&t2, &t2);
+	for (i = 1; i < 50; ++i)
+		fe_sq_tt(&t2, &t2);
+	fe_mul_ttt(&t1, &t2, &t1);
+	fe_sq_tt(&t1, &t1);
+	for (i = 1; i < 5; ++i)
+		fe_sq_tt(&t1, &t1);
+	fe_mul_ttt(out, &t1, &t0);
+}
+
+static __always_inline void fe_invert(fe *out, const fe *z)
+{
+	fe_loose l;
+	fe_copy_lt(&l, z);
+	fe_loose_invert(out, &l);
+}
+
+/* Replace (f,g) with (g,f) if b == 1;
+ * replace (f,g) with (f,g) if b == 0.
+ *
+ * Preconditions: b in {0,1}
+ */
+static __always_inline void fe_cswap(fe *f, fe *g, unsigned int b)
+{
+	unsigned i;
+	b = 0-b;
+	for (i = 0; i < 10; i++) {
+		u32 x = f->v[i] ^ g->v[i];
+		x &= b;
+		f->v[i] ^= x;
+		g->v[i] ^= x;
+	}
+}
+
+/* NOTE: based on fiat-crypto fe_mul, edited for in2=121666, 0, 0.*/
+static __always_inline void fe_mul_121666_impl(u32 out[10], const u32 in1[10])
+{
+	{ const u32 x20 = in1[9];
+	{ const u32 x21 = in1[8];
+	{ const u32 x19 = in1[7];
+	{ const u32 x17 = in1[6];
+	{ const u32 x15 = in1[5];
+	{ const u32 x13 = in1[4];
+	{ const u32 x11 = in1[3];
+	{ const u32 x9 = in1[2];
+	{ const u32 x7 = in1[1];
+	{ const u32 x5 = in1[0];
+	{ const u32 x38 = 0;
+	{ const u32 x39 = 0;
+	{ const u32 x37 = 0;
+	{ const u32 x35 = 0;
+	{ const u32 x33 = 0;
+	{ const u32 x31 = 0;
+	{ const u32 x29 = 0;
+	{ const u32 x27 = 0;
+	{ const u32 x25 = 0;
+	{ const u32 x23 = 121666;
+	{ u64 x40 = ((u64)x23 * x5);
+	{ u64 x41 = (((u64)x23 * x7) + ((u64)x25 * x5));
+	{ u64 x42 = ((((u64)(0x2 * x25) * x7) + ((u64)x23 * x9)) + ((u64)x27 * x5));
+	{ u64 x43 = (((((u64)x25 * x9) + ((u64)x27 * x7)) + ((u64)x23 * x11)) + ((u64)x29 * x5));
+	{ u64 x44 = (((((u64)x27 * x9) + (0x2 * (((u64)x25 * x11) + ((u64)x29 * x7)))) + ((u64)x23 * x13)) + ((u64)x31 * x5));
+	{ u64 x45 = (((((((u64)x27 * x11) + ((u64)x29 * x9)) + ((u64)x25 * x13)) + ((u64)x31 * x7)) + ((u64)x23 * x15)) + ((u64)x33 * x5));
+	{ u64 x46 = (((((0x2 * ((((u64)x29 * x11) + ((u64)x25 * x15)) + ((u64)x33 * x7))) + ((u64)x27 * x13)) + ((u64)x31 * x9)) + ((u64)x23 * x17)) + ((u64)x35 * x5));
+	{ u64 x47 = (((((((((u64)x29 * x13) + ((u64)x31 * x11)) + ((u64)x27 * x15)) + ((u64)x33 * x9)) + ((u64)x25 * x17)) + ((u64)x35 * x7)) + ((u64)x23 * x19)) + ((u64)x37 * x5));
+	{ u64 x48 = (((((((u64)x31 * x13) + (0x2 * (((((u64)x29 * x15) + ((u64)x33 * x11)) + ((u64)x25 * x19)) + ((u64)x37 * x7)))) + ((u64)x27 * x17)) + ((u64)x35 * x9)) + ((u64)x23 * x21)) + ((u64)x39 * x5));
+	{ u64 x49 = (((((((((((u64)x31 * x15) + ((u64)x33 * x13)) + ((u64)x29 * x17)) + ((u64)x35 * x11)) + ((u64)x27 * x19)) + ((u64)x37 * x9)) + ((u64)x25 * x21)) + ((u64)x39 * x7)) + ((u64)x23 * x20)) + ((u64)x38 * x5));
+	{ u64 x50 = (((((0x2 * ((((((u64)x33 * x15) + ((u64)x29 * x19)) + ((u64)x37 * x11)) + ((u64)x25 * x20)) + ((u64)x38 * x7))) + ((u64)x31 * x17)) + ((u64)x35 * x13)) + ((u64)x27 * x21)) + ((u64)x39 * x9));
+	{ u64 x51 = (((((((((u64)x33 * x17) + ((u64)x35 * x15)) + ((u64)x31 * x19)) + ((u64)x37 * x13)) + ((u64)x29 * x21)) + ((u64)x39 * x11)) + ((u64)x27 * x20)) + ((u64)x38 * x9));
+	{ u64 x52 = (((((u64)x35 * x17) + (0x2 * (((((u64)x33 * x19) + ((u64)x37 * x15)) + ((u64)x29 * x20)) + ((u64)x38 * x11)))) + ((u64)x31 * x21)) + ((u64)x39 * x13));
+	{ u64 x53 = (((((((u64)x35 * x19) + ((u64)x37 * x17)) + ((u64)x33 * x21)) + ((u64)x39 * x15)) + ((u64)x31 * x20)) + ((u64)x38 * x13));
+	{ u64 x54 = (((0x2 * ((((u64)x37 * x19) + ((u64)x33 * x20)) + ((u64)x38 * x15))) + ((u64)x35 * x21)) + ((u64)x39 * x17));
+	{ u64 x55 = (((((u64)x37 * x21) + ((u64)x39 * x19)) + ((u64)x35 * x20)) + ((u64)x38 * x17));
+	{ u64 x56 = (((u64)x39 * x21) + (0x2 * (((u64)x37 * x20) + ((u64)x38 * x19))));
+	{ u64 x57 = (((u64)x39 * x20) + ((u64)x38 * x21));
+	{ u64 x58 = ((u64)(0x2 * x38) * x20);
+	{ u64 x59 = (x48 + (x58 << 0x4));
+	{ u64 x60 = (x59 + (x58 << 0x1));
+	{ u64 x61 = (x60 + x58);
+	{ u64 x62 = (x47 + (x57 << 0x4));
+	{ u64 x63 = (x62 + (x57 << 0x1));
+	{ u64 x64 = (x63 + x57);
+	{ u64 x65 = (x46 + (x56 << 0x4));
+	{ u64 x66 = (x65 + (x56 << 0x1));
+	{ u64 x67 = (x66 + x56);
+	{ u64 x68 = (x45 + (x55 << 0x4));
+	{ u64 x69 = (x68 + (x55 << 0x1));
+	{ u64 x70 = (x69 + x55);
+	{ u64 x71 = (x44 + (x54 << 0x4));
+	{ u64 x72 = (x71 + (x54 << 0x1));
+	{ u64 x73 = (x72 + x54);
+	{ u64 x74 = (x43 + (x53 << 0x4));
+	{ u64 x75 = (x74 + (x53 << 0x1));
+	{ u64 x76 = (x75 + x53);
+	{ u64 x77 = (x42 + (x52 << 0x4));
+	{ u64 x78 = (x77 + (x52 << 0x1));
+	{ u64 x79 = (x78 + x52);
+	{ u64 x80 = (x41 + (x51 << 0x4));
+	{ u64 x81 = (x80 + (x51 << 0x1));
+	{ u64 x82 = (x81 + x51);
+	{ u64 x83 = (x40 + (x50 << 0x4));
+	{ u64 x84 = (x83 + (x50 << 0x1));
+	{ u64 x85 = (x84 + x50);
+	{ u64 x86 = (x85 >> 0x1a);
+	{ u32 x87 = ((u32)x85 & 0x3ffffff);
+	{ u64 x88 = (x86 + x82);
+	{ u64 x89 = (x88 >> 0x19);
+	{ u32 x90 = ((u32)x88 & 0x1ffffff);
+	{ u64 x91 = (x89 + x79);
+	{ u64 x92 = (x91 >> 0x1a);
+	{ u32 x93 = ((u32)x91 & 0x3ffffff);
+	{ u64 x94 = (x92 + x76);
+	{ u64 x95 = (x94 >> 0x19);
+	{ u32 x96 = ((u32)x94 & 0x1ffffff);
+	{ u64 x97 = (x95 + x73);
+	{ u64 x98 = (x97 >> 0x1a);
+	{ u32 x99 = ((u32)x97 & 0x3ffffff);
+	{ u64 x100 = (x98 + x70);
+	{ u64 x101 = (x100 >> 0x19);
+	{ u32 x102 = ((u32)x100 & 0x1ffffff);
+	{ u64 x103 = (x101 + x67);
+	{ u64 x104 = (x103 >> 0x1a);
+	{ u32 x105 = ((u32)x103 & 0x3ffffff);
+	{ u64 x106 = (x104 + x64);
+	{ u64 x107 = (x106 >> 0x19);
+	{ u32 x108 = ((u32)x106 & 0x1ffffff);
+	{ u64 x109 = (x107 + x61);
+	{ u64 x110 = (x109 >> 0x1a);
+	{ u32 x111 = ((u32)x109 & 0x3ffffff);
+	{ u64 x112 = (x110 + x49);
+	{ u64 x113 = (x112 >> 0x19);
+	{ u32 x114 = ((u32)x112 & 0x1ffffff);
+	{ u64 x115 = (x87 + (0x13 * x113));
+	{ u32 x116 = (u32) (x115 >> 0x1a);
+	{ u32 x117 = ((u32)x115 & 0x3ffffff);
+	{ u32 x118 = (x116 + x90);
+	{ u32 x119 = (x118 >> 0x19);
+	{ u32 x120 = (x118 & 0x1ffffff);
+	out[0] = x117;
+	out[1] = x120;
+	out[2] = (x119 + x93);
+	out[3] = x96;
+	out[4] = x99;
+	out[5] = x102;
+	out[6] = x105;
+	out[7] = x108;
+	out[8] = x111;
+	out[9] = x114;
+	}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
+}
+
+static __always_inline void fe_mul121666(fe *h, const fe_loose *f)
+{
+	fe_mul_121666_impl(h->v, f->v);
+}
+
+static void curve25519_generic(u8 out[CURVE25519_POINT_SIZE], const u8 scalar[CURVE25519_POINT_SIZE], const u8 point[CURVE25519_POINT_SIZE])
+{
+	fe x1, x2, z2, x3, z3, tmp0, tmp1;
+	fe_loose x2l, z2l, x3l, tmp0l, tmp1l;
+	unsigned swap = 0;
+	int pos;
+	u8 e[32];
+
+	memcpy(e, scalar, 32);
+	normalize_secret(e);
+
+	/* The following implementation was transcribed to Coq and proven to
+	 * correspond to unary scalar multiplication in affine coordinates given that
+	 * x1 != 0 is the x coordinate of some point on the curve. It was also checked
+	 * in Coq that doing a ladderstep with x1 = x3 = 0 gives z2' = z3' = 0, and z2
+	 * = z3 = 0 gives z2' = z3' = 0. The statement was quantified over the
+	 * underlying field, so it applies to Curve25519 itself and the quadratic
+	 * twist of Curve25519. It was not proven in Coq that prime-field arithmetic
+	 * correctly simulates extension-field arithmetic on prime-field values.
+	 * The decoding of the byte array representation of e was not considered.
+	 * Specification of Montgomery curves in affine coordinates:
+	 * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Spec/MontgomeryCurve.v#L27>
+	 * Proof that these form a group that is isomorphic to a Weierstrass curve:
+	 * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/AffineProofs.v#L35>
+	 * Coq transcription and correctness proof of the loop (where scalarbits=255):
+	 * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZ.v#L118>
+	 * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZProofs.v#L278>
+	 * preconditions: 0 <= e < 2^255 (not necessarily e < order), fe_invert(0) = 0
+	 */
+	fe_frombytes(&x1, point);
+	fe_1(&x2);
+	fe_0(&z2);
+	fe_copy(&x3, &x1);
+	fe_1(&z3);
+
+	for (pos = 254; pos >= 0; --pos) {
+		/* loop invariant as of right before the test, for the case where x1 != 0:
+		 *   pos >= -1; if z2 = 0 then x2 is nonzero; if z3 = 0 then x3 is nonzero
+		 *   let r := e >> (pos+1) in the following equalities of projective points:
+		 *   to_xz (r*P)     === if swap then (x3, z3) else (x2, z2)
+		 *   to_xz ((r+1)*P) === if swap then (x2, z2) else (x3, z3)
+		 *   x1 is the nonzero x coordinate of the nonzero point (r*P-(r+1)*P)
+		 */
+		unsigned b = 1 & (e[pos / 8] >> (pos & 7));
+		swap ^= b;
+		fe_cswap(&x2, &x3, swap);
+		fe_cswap(&z2, &z3, swap);
+		swap = b;
+		/* Coq transcription of ladderstep formula (called from transcribed loop):
+		 * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZ.v#L89>
+		 * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZProofs.v#L131>
+		 * x1 != 0 <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZProofs.v#L217>
+		 * x1  = 0 <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZProofs.v#L147>
+		 */
+		fe_sub(&tmp0l, &x3, &z3);
+		fe_sub(&tmp1l, &x2, &z2);
+		fe_add(&x2l, &x2, &z2);
+		fe_add(&z2l, &x3, &z3);
+		fe_mul_tll(&z3, &tmp0l, &x2l);
+		fe_mul_tll(&z2, &z2l, &tmp1l);
+		fe_sq_tl(&tmp0, &tmp1l);
+		fe_sq_tl(&tmp1, &x2l);
+		fe_add(&x3l, &z3, &z2);
+		fe_sub(&z2l, &z3, &z2);
+		fe_mul_ttt(&x2, &tmp1, &tmp0);
+		fe_sub(&tmp1l, &tmp1, &tmp0);
+		fe_sq_tl(&z2, &z2l);
+		fe_mul121666(&z3, &tmp1l);
+		fe_sq_tl(&x3, &x3l);
+		fe_add(&tmp0l, &tmp0, &z3);
+		fe_mul_ttt(&z3, &x1, &z2);
+		fe_mul_tll(&z2, &tmp1l, &tmp0l);
+	}
+	/* here pos=-1, so r=e, so to_xz (e*P) === if swap then (x3, z3) else (x2, z2) */
+	fe_cswap(&x2, &x3, swap);
+	fe_cswap(&z2, &z3, swap);
+
+	fe_invert(&z2, &z2);
+	fe_mul_ttt(&x2, &x2, &z2);
+	fe_tobytes(out, &x2);
+
+	memzero_explicit(&x1, sizeof(x1));
+	memzero_explicit(&x2, sizeof(x2));
+	memzero_explicit(&z2, sizeof(z2));
+	memzero_explicit(&x3, sizeof(x3));
+	memzero_explicit(&z3, sizeof(z3));
+	memzero_explicit(&tmp0, sizeof(tmp0));
+	memzero_explicit(&tmp1, sizeof(tmp1));
+	memzero_explicit(&x2l, sizeof(x2l));
+	memzero_explicit(&z2l, sizeof(z2l));
+	memzero_explicit(&x3l, sizeof(x3l));
+	memzero_explicit(&tmp0l, sizeof(tmp0l));
+	memzero_explicit(&tmp1l, sizeof(tmp1l));
+	memzero_explicit(&e, sizeof(e));
+}
diff --git a/lib/zinc/curve25519/curve25519-hacl64.h b/lib/zinc/curve25519/curve25519-hacl64.h
new file mode 100644
index 000000000000..d2637ac4459c
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-hacl64.h
@@ -0,0 +1,744 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2016-2017 INRIA and Microsoft Corporation.
+ * Copyright (C) 2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This is a machine-generated formally verified implementation of curve25519 DH from:
+ * https://github.com/mitls/hacl-star
+ */
+
+typedef __uint128_t u128;
+static __always_inline u64 u64_eq_mask(u64 x, u64 y)
+{
+	x ^= y;
+	x |= -x;
+	return (x >> 63) - 1;
+}
+
+static __always_inline u64 u64_gte_mask(u64 x, u64 y)
+{
+	return ((x ^ ((x ^ y) | ((x - y) ^ y))) >> 63) - 1;
+}
+
+static __always_inline void modulo_carry_top(u64 *b)
+{
+	u64 b4 = b[4];
+	u64 b0 = b[0];
+	u64 b4_ = b4 & 0x7ffffffffffffLLU;
+	u64 b0_ = b0 + 19 * (b4 >> 51);
+	b[4] = b4_;
+	b[0] = b0_;
+}
+
+static __always_inline void fproduct_copy_from_wide_(u64 *output, u128 *input)
+{
+	{
+		u128 xi = input[0];
+		output[0] = ((u64)(xi));
+	}
+	{
+		u128 xi = input[1];
+		output[1] = ((u64)(xi));
+	}
+	{
+		u128 xi = input[2];
+		output[2] = ((u64)(xi));
+	}
+	{
+		u128 xi = input[3];
+		output[3] = ((u64)(xi));
+	}
+	{
+		u128 xi = input[4];
+		output[4] = ((u64)(xi));
+	}
+}
+
+static __always_inline void fproduct_sum_scalar_multiplication_(u128 *output, u64 *input, u64 s)
+{
+	output[0] += (u128)input[0] * s;
+	output[1] += (u128)input[1] * s;
+	output[2] += (u128)input[2] * s;
+	output[3] += (u128)input[3] * s;
+	output[4] += (u128)input[4] * s;
+}
+
+static __always_inline void fproduct_carry_wide_(u128 *tmp)
+{
+	{
+		u32 ctr = 0;
+		u128 tctr = tmp[ctr];
+		u128 tctrp1 = tmp[ctr + 1];
+		u64 r0 = ((u64)(tctr)) & 0x7ffffffffffffLLU;
+		u128 c = ((tctr) >> (51));
+		tmp[ctr] = ((u128)(r0));
+		tmp[ctr + 1] = ((tctrp1) + (c));
+	}
+	{
+		u32 ctr = 1;
+		u128 tctr = tmp[ctr];
+		u128 tctrp1 = tmp[ctr + 1];
+		u64 r0 = ((u64)(tctr)) & 0x7ffffffffffffLLU;
+		u128 c = ((tctr) >> (51));
+		tmp[ctr] = ((u128)(r0));
+		tmp[ctr + 1] = ((tctrp1) + (c));
+	}
+
+	{
+		u32 ctr = 2;
+		u128 tctr = tmp[ctr];
+		u128 tctrp1 = tmp[ctr + 1];
+		u64 r0 = ((u64)(tctr)) & 0x7ffffffffffffLLU;
+		u128 c = ((tctr) >> (51));
+		tmp[ctr] = ((u128)(r0));
+		tmp[ctr + 1] = ((tctrp1) + (c));
+	}
+	{
+		u32 ctr = 3;
+		u128 tctr = tmp[ctr];
+		u128 tctrp1 = tmp[ctr + 1];
+		u64 r0 = ((u64)(tctr)) & 0x7ffffffffffffLLU;
+		u128 c = ((tctr) >> (51));
+		tmp[ctr] = ((u128)(r0));
+		tmp[ctr + 1] = ((tctrp1) + (c));
+	}
+}
+
+static __always_inline void fmul_shift_reduce(u64 *output)
+{
+	u64 tmp = output[4];
+	u64 b0;
+	{
+		u32 ctr = 5 - 0 - 1;
+		u64 z = output[ctr - 1];
+		output[ctr] = z;
+	}
+	{
+		u32 ctr = 5 - 1 - 1;
+		u64 z = output[ctr - 1];
+		output[ctr] = z;
+	}
+	{
+		u32 ctr = 5 - 2 - 1;
+		u64 z = output[ctr - 1];
+		output[ctr] = z;
+	}
+	{
+		u32 ctr = 5 - 3 - 1;
+		u64 z = output[ctr - 1];
+		output[ctr] = z;
+	}
+	output[0] = tmp;
+	b0 = output[0];
+	output[0] = 19 * b0;
+}
+
+static __always_inline void fmul_mul_shift_reduce_(u128 *output, u64 *input, u64 *input21)
+{
+	u32 i;
+	u64 input2i;
+	{
+		u64 input2i = input21[0];
+		fproduct_sum_scalar_multiplication_(output, input, input2i);
+		fmul_shift_reduce(input);
+	}
+	{
+		u64 input2i = input21[1];
+		fproduct_sum_scalar_multiplication_(output, input, input2i);
+		fmul_shift_reduce(input);
+	}
+	{
+		u64 input2i = input21[2];
+		fproduct_sum_scalar_multiplication_(output, input, input2i);
+		fmul_shift_reduce(input);
+	}
+	{
+		u64 input2i = input21[3];
+		fproduct_sum_scalar_multiplication_(output, input, input2i);
+		fmul_shift_reduce(input);
+	}
+	i = 4;
+	input2i = input21[i];
+	fproduct_sum_scalar_multiplication_(output, input, input2i);
+}
+
+static __always_inline void fmul_fmul(u64 *output, u64 *input, u64 *input21)
+{
+	u64 tmp[5];
+	memcpy(tmp, input, 5 * sizeof(*input));
+	{
+		u128 b4;
+		u128 b0;
+		u128 b4_;
+		u128 b0_;
+		u64 i0;
+		u64 i1;
+		u64 i0_;
+		u64 i1_;
+		u128 t[5] = { 0 };
+		fmul_mul_shift_reduce_(t, tmp, input21);
+		fproduct_carry_wide_(t);
+		b4 = t[4];
+		b0 = t[0];
+		b4_ = ((b4) & (((u128)(0x7ffffffffffffLLU))));
+		b0_ = ((b0) + (((u128)(19) * (((u64)(((b4) >> (51))))))));
+		t[4] = b4_;
+		t[0] = b0_;
+		fproduct_copy_from_wide_(output, t);
+		i0 = output[0];
+		i1 = output[1];
+		i0_ = i0 & 0x7ffffffffffffLLU;
+		i1_ = i1 + (i0 >> 51);
+		output[0] = i0_;
+		output[1] = i1_;
+	}
+}
+
+static __always_inline void fsquare_fsquare__(u128 *tmp, u64 *output)
+{
+	u64 r0 = output[0];
+	u64 r1 = output[1];
+	u64 r2 = output[2];
+	u64 r3 = output[3];
+	u64 r4 = output[4];
+	u64 d0 = r0 * 2;
+	u64 d1 = r1 * 2;
+	u64 d2 = r2 * 2 * 19;
+	u64 d419 = r4 * 19;
+	u64 d4 = d419 * 2;
+	u128 s0 = ((((((u128)(r0) * (r0))) + (((u128)(d4) * (r1))))) + (((u128)(d2) * (r3))));
+	u128 s1 = ((((((u128)(d0) * (r1))) + (((u128)(d4) * (r2))))) + (((u128)(r3 * 19) * (r3))));
+	u128 s2 = ((((((u128)(d0) * (r2))) + (((u128)(r1) * (r1))))) + (((u128)(d4) * (r3))));
+	u128 s3 = ((((((u128)(d0) * (r3))) + (((u128)(d1) * (r2))))) + (((u128)(r4) * (d419))));
+	u128 s4 = ((((((u128)(d0) * (r4))) + (((u128)(d1) * (r3))))) + (((u128)(r2) * (r2))));
+	tmp[0] = s0;
+	tmp[1] = s1;
+	tmp[2] = s2;
+	tmp[3] = s3;
+	tmp[4] = s4;
+}
+
+static __always_inline void fsquare_fsquare_(u128 *tmp, u64 *output)
+{
+	u128 b4;
+	u128 b0;
+	u128 b4_;
+	u128 b0_;
+	u64 i0;
+	u64 i1;
+	u64 i0_;
+	u64 i1_;
+	fsquare_fsquare__(tmp, output);
+	fproduct_carry_wide_(tmp);
+	b4 = tmp[4];
+	b0 = tmp[0];
+	b4_ = ((b4) & (((u128)(0x7ffffffffffffLLU))));
+	b0_ = ((b0) + (((u128)(19) * (((u64)(((b4) >> (51))))))));
+	tmp[4] = b4_;
+	tmp[0] = b0_;
+	fproduct_copy_from_wide_(output, tmp);
+	i0 = output[0];
+	i1 = output[1];
+	i0_ = i0 & 0x7ffffffffffffLLU;
+	i1_ = i1 + (i0 >> 51);
+	output[0] = i0_;
+	output[1] = i1_;
+}
+
+static __always_inline void fsquare_fsquare_times_(u64 *output, u128 *tmp, u32 count1)
+{
+	u32 i;
+	fsquare_fsquare_(tmp, output);
+	for (i = 1; i < count1; ++i)
+		fsquare_fsquare_(tmp, output);
+}
+
+static __always_inline void fsquare_fsquare_times(u64 *output, u64 *input, u32 count1)
+{
+	u128 t[5];
+	memcpy(output, input, 5 * sizeof(*input));
+	fsquare_fsquare_times_(output, t, count1);
+}
+
+static __always_inline void fsquare_fsquare_times_inplace(u64 *output, u32 count1)
+{
+	u128 t[5];
+	fsquare_fsquare_times_(output, t, count1);
+}
+
+static __always_inline void crecip_crecip(u64 *out, u64 *z)
+{
+	u64 buf[20] = { 0 };
+	u64 *a0 = buf;
+	u64 *t00 = buf + 5;
+	u64 *b0 = buf + 10;
+	u64 *t01;
+	u64 *b1;
+	u64 *c0;
+	u64 *a;
+	u64 *t0;
+	u64 *b;
+	u64 *c;
+	fsquare_fsquare_times(a0, z, 1);
+	fsquare_fsquare_times(t00, a0, 2);
+	fmul_fmul(b0, t00, z);
+	fmul_fmul(a0, b0, a0);
+	fsquare_fsquare_times(t00, a0, 1);
+	fmul_fmul(b0, t00, b0);
+	fsquare_fsquare_times(t00, b0, 5);
+	t01 = buf + 5;
+	b1 = buf + 10;
+	c0 = buf + 15;
+	fmul_fmul(b1, t01, b1);
+	fsquare_fsquare_times(t01, b1, 10);
+	fmul_fmul(c0, t01, b1);
+	fsquare_fsquare_times(t01, c0, 20);
+	fmul_fmul(t01, t01, c0);
+	fsquare_fsquare_times_inplace(t01, 10);
+	fmul_fmul(b1, t01, b1);
+	fsquare_fsquare_times(t01, b1, 50);
+	a = buf;
+	t0 = buf + 5;
+	b = buf + 10;
+	c = buf + 15;
+	fmul_fmul(c, t0, b);
+	fsquare_fsquare_times(t0, c, 100);
+	fmul_fmul(t0, t0, c);
+	fsquare_fsquare_times_inplace(t0, 50);
+	fmul_fmul(t0, t0, b);
+	fsquare_fsquare_times_inplace(t0, 5);
+	fmul_fmul(out, t0, a);
+}
+
+static __always_inline void fsum(u64 *a, u64 *b)
+{
+	a[0] += b[0];
+	a[1] += b[1];
+	a[2] += b[2];
+	a[3] += b[3];
+	a[4] += b[4];
+}
+
+static __always_inline void fdifference(u64 *a, u64 *b)
+{
+	u64 tmp[5] = { 0 };
+	u64 b0;
+	u64 b1;
+	u64 b2;
+	u64 b3;
+	u64 b4;
+	memcpy(tmp, b, 5 * sizeof(*b));
+	b0 = tmp[0];
+	b1 = tmp[1];
+	b2 = tmp[2];
+	b3 = tmp[3];
+	b4 = tmp[4];
+	tmp[0] = b0 + 0x3fffffffffff68LLU;
+	tmp[1] = b1 + 0x3ffffffffffff8LLU;
+	tmp[2] = b2 + 0x3ffffffffffff8LLU;
+	tmp[3] = b3 + 0x3ffffffffffff8LLU;
+	tmp[4] = b4 + 0x3ffffffffffff8LLU;
+	{
+		u64 xi = a[0];
+		u64 yi = tmp[0];
+		a[0] = yi - xi;
+	}
+	{
+		u64 xi = a[1];
+		u64 yi = tmp[1];
+		a[1] = yi - xi;
+	}
+	{
+		u64 xi = a[2];
+		u64 yi = tmp[2];
+		a[2] = yi - xi;
+	}
+	{
+		u64 xi = a[3];
+		u64 yi = tmp[3];
+		a[3] = yi - xi;
+	}
+	{
+		u64 xi = a[4];
+		u64 yi = tmp[4];
+		a[4] = yi - xi;
+	}
+}
+
+static __always_inline void fscalar(u64 *output, u64 *b, u64 s)
+{
+	u128 tmp[5];
+	u128 b4;
+	u128 b0;
+	u128 b4_;
+	u128 b0_;
+	{
+		u64 xi = b[0];
+		tmp[0] = ((u128)(xi) * (s));
+	}
+	{
+		u64 xi = b[1];
+		tmp[1] = ((u128)(xi) * (s));
+	}
+	{
+		u64 xi = b[2];
+		tmp[2] = ((u128)(xi) * (s));
+	}
+	{
+		u64 xi = b[3];
+		tmp[3] = ((u128)(xi) * (s));
+	}
+	{
+		u64 xi = b[4];
+		tmp[4] = ((u128)(xi) * (s));
+	}
+	fproduct_carry_wide_(tmp);
+	b4 = tmp[4];
+	b0 = tmp[0];
+	b4_ = ((b4) & (((u128)(0x7ffffffffffffLLU))));
+	b0_ = ((b0) + (((u128)(19) * (((u64)(((b4) >> (51))))))));
+	tmp[4] = b4_;
+	tmp[0] = b0_;
+	fproduct_copy_from_wide_(output, tmp);
+}
+
+static __always_inline void fmul(u64 *output, u64 *a, u64 *b)
+{
+	fmul_fmul(output, a, b);
+}
+
+static __always_inline void crecip(u64 *output, u64 *input)
+{
+	crecip_crecip(output, input);
+}
+
+static __always_inline void point_swap_conditional_step(u64 *a, u64 *b, u64 swap1, u32 ctr)
+{
+	u32 i = ctr - 1;
+	u64 ai = a[i];
+	u64 bi = b[i];
+	u64 x = swap1 & (ai ^ bi);
+	u64 ai1 = ai ^ x;
+	u64 bi1 = bi ^ x;
+	a[i] = ai1;
+	b[i] = bi1;
+}
+
+static __always_inline void point_swap_conditional5(u64 *a, u64 *b, u64 swap1)
+{
+	point_swap_conditional_step(a, b, swap1, 5);
+	point_swap_conditional_step(a, b, swap1, 4);
+	point_swap_conditional_step(a, b, swap1, 3);
+	point_swap_conditional_step(a, b, swap1, 2);
+	point_swap_conditional_step(a, b, swap1, 1);
+}
+
+static __always_inline void point_swap_conditional(u64 *a, u64 *b, u64 iswap)
+{
+	u64 swap1 = 0 - iswap;
+	point_swap_conditional5(a, b, swap1);
+	point_swap_conditional5(a + 5, b + 5, swap1);
+}
+
+static __always_inline void point_copy(u64 *output, u64 *input)
+{
+	memcpy(output, input, 5 * sizeof(*input));
+	memcpy(output + 5, input + 5, 5 * sizeof(*input));
+}
+
+static __always_inline void addanddouble_fmonty(u64 *pp, u64 *ppq, u64 *p, u64 *pq, u64 *qmqp)
+{
+	u64 *qx = qmqp;
+	u64 *x2 = pp;
+	u64 *z2 = pp + 5;
+	u64 *x3 = ppq;
+	u64 *z3 = ppq + 5;
+	u64 *x = p;
+	u64 *z = p + 5;
+	u64 *xprime = pq;
+	u64 *zprime = pq + 5;
+	u64 buf[40] = { 0 };
+	u64 *origx = buf;
+	u64 *origxprime0 = buf + 5;
+	u64 *xxprime0;
+	u64 *zzprime0;
+	u64 *origxprime;
+	xxprime0 = buf + 25;
+	zzprime0 = buf + 30;
+	memcpy(origx, x, 5 * sizeof(*x));
+	fsum(x, z);
+	fdifference(z, origx);
+	memcpy(origxprime0, xprime, 5 * sizeof(*xprime));
+	fsum(xprime, zprime);
+	fdifference(zprime, origxprime0);
+	fmul(xxprime0, xprime, z);
+	fmul(zzprime0, x, zprime);
+	origxprime = buf + 5;
+	{
+		u64 *xx0;
+		u64 *zz0;
+		u64 *xxprime;
+		u64 *zzprime;
+		u64 *zzzprime;
+		xx0 = buf + 15;
+		zz0 = buf + 20;
+		xxprime = buf + 25;
+		zzprime = buf + 30;
+		zzzprime = buf + 35;
+		memcpy(origxprime, xxprime, 5 * sizeof(*xxprime));
+		fsum(xxprime, zzprime);
+		fdifference(zzprime, origxprime);
+		fsquare_fsquare_times(x3, xxprime, 1);
+		fsquare_fsquare_times(zzzprime, zzprime, 1);
+		fmul(z3, zzzprime, qx);
+		fsquare_fsquare_times(xx0, x, 1);
+		fsquare_fsquare_times(zz0, z, 1);
+		{
+			u64 *zzz;
+			u64 *xx;
+			u64 *zz;
+			u64 scalar;
+			zzz = buf + 10;
+			xx = buf + 15;
+			zz = buf + 20;
+			fmul(x2, xx, zz);
+			fdifference(zz, xx);
+			scalar = 121665;
+			fscalar(zzz, zz, scalar);
+			fsum(zzz, xx);
+			fmul(z2, zzz, zz);
+		}
+	}
+}
+
+static __always_inline void ladder_smallloop_cmult_small_loop_step(u64 *nq, u64 *nqpq, u64 *nq2, u64 *nqpq2, u64 *q, u8 byt)
+{
+	u64 bit0 = (u64)(byt >> 7);
+	u64 bit;
+	point_swap_conditional(nq, nqpq, bit0);
+	addanddouble_fmonty(nq2, nqpq2, nq, nqpq, q);
+	bit = (u64)(byt >> 7);
+	point_swap_conditional(nq2, nqpq2, bit);
+}
+
+static __always_inline void ladder_smallloop_cmult_small_loop_double_step(u64 *nq, u64 *nqpq, u64 *nq2, u64 *nqpq2, u64 *q, u8 byt)
+{
+	u8 byt1;
+	ladder_smallloop_cmult_small_loop_step(nq, nqpq, nq2, nqpq2, q, byt);
+	byt1 = byt << 1;
+	ladder_smallloop_cmult_small_loop_step(nq2, nqpq2, nq, nqpq, q, byt1);
+}
+
+static __always_inline void ladder_smallloop_cmult_small_loop(u64 *nq, u64 *nqpq, u64 *nq2, u64 *nqpq2, u64 *q, u8 byt, u32 i)
+{
+	while (i--) {
+		ladder_smallloop_cmult_small_loop_double_step(nq, nqpq, nq2, nqpq2, q, byt);
+		byt <<= 2;
+	}
+}
+
+static __always_inline void ladder_bigloop_cmult_big_loop(u8 *n1, u64 *nq, u64 *nqpq, u64 *nq2, u64 *nqpq2, u64 *q, u32 i)
+{
+	while (i--) {
+		u8 byte = n1[i];
+		ladder_smallloop_cmult_small_loop(nq, nqpq, nq2, nqpq2, q, byte, 4);
+	}
+}
+
+static __always_inline void ladder_cmult(u64 *result, u8 *n1, u64 *q)
+{
+	u64 point_buf[40] = { 0 };
+	u64 *nq = point_buf;
+	u64 *nqpq = point_buf + 10;
+	u64 *nq2 = point_buf + 20;
+	u64 *nqpq2 = point_buf + 30;
+	point_copy(nqpq, q);
+	nq[0] = 1;
+	ladder_bigloop_cmult_big_loop(n1, nq, nqpq, nq2, nqpq2, q, 32);
+	point_copy(result, nq);
+}
+
+static __always_inline void format_fexpand(u64 *output, const u8 *input)
+{
+	const u8 *x00 = input + 6;
+	const u8 *x01 = input + 12;
+	const u8 *x02 = input + 19;
+	const u8 *x0 = input + 24;
+	u64 i0, i1, i2, i3, i4, output0, output1, output2, output3, output4;
+	i0 = le64_to_cpup((__force __le64 *)input);
+	i1 = le64_to_cpup((__force __le64 *)x00);
+	i2 = le64_to_cpup((__force __le64 *)x01);
+	i3 = le64_to_cpup((__force __le64 *)x02);
+	i4 = le64_to_cpup((__force __le64 *)x0);
+	output0 = i0 & 0x7ffffffffffffLLU;
+	output1 = i1 >> 3 & 0x7ffffffffffffLLU;
+	output2 = i2 >> 6 & 0x7ffffffffffffLLU;
+	output3 = i3 >> 1 & 0x7ffffffffffffLLU;
+	output4 = i4 >> 12 & 0x7ffffffffffffLLU;
+	output[0] = output0;
+	output[1] = output1;
+	output[2] = output2;
+	output[3] = output3;
+	output[4] = output4;
+}
+
+static __always_inline void format_fcontract_first_carry_pass(u64 *input)
+{
+	u64 t0 = input[0];
+	u64 t1 = input[1];
+	u64 t2 = input[2];
+	u64 t3 = input[3];
+	u64 t4 = input[4];
+	u64 t1_ = t1 + (t0 >> 51);
+	u64 t0_ = t0 & 0x7ffffffffffffLLU;
+	u64 t2_ = t2 + (t1_ >> 51);
+	u64 t1__ = t1_ & 0x7ffffffffffffLLU;
+	u64 t3_ = t3 + (t2_ >> 51);
+	u64 t2__ = t2_ & 0x7ffffffffffffLLU;
+	u64 t4_ = t4 + (t3_ >> 51);
+	u64 t3__ = t3_ & 0x7ffffffffffffLLU;
+	input[0] = t0_;
+	input[1] = t1__;
+	input[2] = t2__;
+	input[3] = t3__;
+	input[4] = t4_;
+}
+
+static __always_inline void format_fcontract_first_carry_full(u64 *input)
+{
+	format_fcontract_first_carry_pass(input);
+	modulo_carry_top(input);
+}
+
+static __always_inline void format_fcontract_second_carry_pass(u64 *input)
+{
+	u64 t0 = input[0];
+	u64 t1 = input[1];
+	u64 t2 = input[2];
+	u64 t3 = input[3];
+	u64 t4 = input[4];
+	u64 t1_ = t1 + (t0 >> 51);
+	u64 t0_ = t0 & 0x7ffffffffffffLLU;
+	u64 t2_ = t2 + (t1_ >> 51);
+	u64 t1__ = t1_ & 0x7ffffffffffffLLU;
+	u64 t3_ = t3 + (t2_ >> 51);
+	u64 t2__ = t2_ & 0x7ffffffffffffLLU;
+	u64 t4_ = t4 + (t3_ >> 51);
+	u64 t3__ = t3_ & 0x7ffffffffffffLLU;
+	input[0] = t0_;
+	input[1] = t1__;
+	input[2] = t2__;
+	input[3] = t3__;
+	input[4] = t4_;
+}
+
+static __always_inline void format_fcontract_second_carry_full(u64 *input)
+{
+	u64 i0;
+	u64 i1;
+	u64 i0_;
+	u64 i1_;
+	format_fcontract_second_carry_pass(input);
+	modulo_carry_top(input);
+	i0 = input[0];
+	i1 = input[1];
+	i0_ = i0 & 0x7ffffffffffffLLU;
+	i1_ = i1 + (i0 >> 51);
+	input[0] = i0_;
+	input[1] = i1_;
+}
+
+static __always_inline void format_fcontract_trim(u64 *input)
+{
+	u64 a0 = input[0];
+	u64 a1 = input[1];
+	u64 a2 = input[2];
+	u64 a3 = input[3];
+	u64 a4 = input[4];
+	u64 mask0 = u64_gte_mask(a0, 0x7ffffffffffedLLU);
+	u64 mask1 = u64_eq_mask(a1, 0x7ffffffffffffLLU);
+	u64 mask2 = u64_eq_mask(a2, 0x7ffffffffffffLLU);
+	u64 mask3 = u64_eq_mask(a3, 0x7ffffffffffffLLU);
+	u64 mask4 = u64_eq_mask(a4, 0x7ffffffffffffLLU);
+	u64 mask = (((mask0 & mask1) & mask2) & mask3) & mask4;
+	u64 a0_ = a0 - (0x7ffffffffffedLLU & mask);
+	u64 a1_ = a1 - (0x7ffffffffffffLLU & mask);
+	u64 a2_ = a2 - (0x7ffffffffffffLLU & mask);
+	u64 a3_ = a3 - (0x7ffffffffffffLLU & mask);
+	u64 a4_ = a4 - (0x7ffffffffffffLLU & mask);
+	input[0] = a0_;
+	input[1] = a1_;
+	input[2] = a2_;
+	input[3] = a3_;
+	input[4] = a4_;
+}
+
+static __always_inline void format_fcontract_store(u8 *output, u64 *input)
+{
+	u64 t0 = input[0];
+	u64 t1 = input[1];
+	u64 t2 = input[2];
+	u64 t3 = input[3];
+	u64 t4 = input[4];
+	u64 o0 = t1 << 51 | t0;
+	u64 o1 = t2 << 38 | t1 >> 13;
+	u64 o2 = t3 << 25 | t2 >> 26;
+	u64 o3 = t4 << 12 | t3 >> 39;
+	u8 *b0 = output;
+	u8 *b1 = output + 8;
+	u8 *b2 = output + 16;
+	u8 *b3 = output + 24;
+	*(__force __le64 *)b0 = cpu_to_le64(o0);
+	*(__force __le64 *)b1 = cpu_to_le64(o1);
+	*(__force __le64 *)b2 = cpu_to_le64(o2);
+	*(__force __le64 *)b3 = cpu_to_le64(o3);
+}
+
+static __always_inline void format_fcontract(u8 *output, u64 *input)
+{
+	format_fcontract_first_carry_full(input);
+	format_fcontract_second_carry_full(input);
+	format_fcontract_trim(input);
+	format_fcontract_store(output, input);
+}
+
+static __always_inline void format_scalar_of_point(u8 *scalar, u64 *point)
+{
+	u64 *x = point;
+	u64 *z = point + 5;
+	u64 buf[10] __aligned(32) = { 0 };
+	u64 *zmone = buf;
+	u64 *sc = buf + 5;
+	crecip(zmone, z);
+	fmul(sc, x, zmone);
+	format_fcontract(scalar, sc);
+}
+
+static void curve25519_generic(u8 mypublic[CURVE25519_POINT_SIZE], const u8 secret[CURVE25519_POINT_SIZE], const u8 basepoint[CURVE25519_POINT_SIZE])
+{
+	u64 buf0[10] __aligned(32) = { 0 };
+	u64 *x0 = buf0;
+	u64 *z = buf0 + 5;
+	u64 *q;
+	format_fexpand(x0, basepoint);
+	z[0] = 1;
+	q = buf0;
+	{
+		u8 e[32] __aligned(32) = { 0 };
+		u8 *scalar;
+		memcpy(e, secret, 32);
+		normalize_secret(e);
+		scalar = e;
+		{
+			u64 buf[15] = { 0 };
+			u64 *nq = buf;
+			u64 *x = nq;
+			x[0] = 1;
+			ladder_cmult(nq, scalar, q);
+			format_scalar_of_point(mypublic, nq);
+			memzero_explicit(buf, sizeof(buf));
+		}
+		memzero_explicit(e, sizeof(e));
+	}
+	memzero_explicit(buf0, sizeof(buf0));
+}
diff --git a/lib/zinc/curve25519/curve25519.c b/lib/zinc/curve25519/curve25519.c
new file mode 100644
index 000000000000..4780c1721016
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519.c
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <zinc/curve25519.h>
+
+#include <linux/version.h>
+#include <linux/string.h>
+#include <linux/random.h>
+#include <crypto/algapi.h>
+
+#ifndef HAVE_CURVE25519_ARCH_IMPLEMENTATION
+void __init curve25519_fpu_init(void) { }
+static inline bool curve25519_arch(u8 mypublic[CURVE25519_POINT_SIZE], const u8 secret[CURVE25519_POINT_SIZE], const u8 basepoint[CURVE25519_POINT_SIZE]) { return false; }
+static inline bool curve25519_base_arch(u8 pub[CURVE25519_POINT_SIZE], const u8 secret[CURVE25519_POINT_SIZE]) { return false; }
+#endif
+
+static __always_inline void normalize_secret(u8 secret[CURVE25519_POINT_SIZE])
+{
+	secret[0] &= 248;
+	secret[31] &= 127;
+	secret[31] |= 64;
+}
+
+#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+#include "curve25519-hacl64.h"
+#else
+#include "curve25519-fiat32.h"
+#endif
+
+static const u8 null_point[CURVE25519_POINT_SIZE] = { 0 };
+
+bool curve25519(u8 mypublic[CURVE25519_POINT_SIZE], const u8 secret[CURVE25519_POINT_SIZE], const u8 basepoint[CURVE25519_POINT_SIZE])
+{
+	if (!curve25519_arch(mypublic, secret, basepoint))
+		curve25519_generic(mypublic, secret, basepoint);
+	return crypto_memneq(mypublic, null_point, CURVE25519_POINT_SIZE);
+}
+EXPORT_SYMBOL(curve25519);
+
+bool curve25519_generate_public(u8 pub[CURVE25519_POINT_SIZE], const u8 secret[CURVE25519_POINT_SIZE])
+{
+	static const u8 basepoint[CURVE25519_POINT_SIZE] __aligned(32) = { 9 };
+
+	if (unlikely(!crypto_memneq(secret, null_point, CURVE25519_POINT_SIZE)))
+		return false;
+
+	if (curve25519_base_arch(pub, secret))
+		return crypto_memneq(pub, null_point, CURVE25519_POINT_SIZE);
+	return curve25519(pub, secret, basepoint);
+}
+EXPORT_SYMBOL(curve25519_generate_public);
+
+void curve25519_generate_secret(u8 secret[CURVE25519_POINT_SIZE])
+{
+	get_random_bytes_wait(secret, CURVE25519_POINT_SIZE);
+	normalize_secret(secret);
+}
+EXPORT_SYMBOL(curve25519_generate_secret);
+
+#include "../selftest/curve25519.h"
diff --git a/lib/zinc/main.c b/lib/zinc/main.c
index db4704257586..a4767ea656d6 100644
--- a/lib/zinc/main.c
+++ b/lib/zinc/main.c
@@ -7,6 +7,7 @@
 #include <zinc/chacha20.h>
 #include <zinc/poly1305.h>
 #include <zinc/blake2s.h>
+#include <zinc/curve25519.h>
 
 #include <linux/init.h>
 #include <linux/module.h>
@@ -35,6 +36,10 @@ static int __init mod_init(void)
 #ifdef CONFIG_ZINC_BLAKE2S
 	blake2s_fpu_init();
 	selftest(blake2s);
+#endif
+#ifdef CONFIG_ZINC_CURVE25519
+	curve25519_fpu_init();
+	selftest(curve25519);
 #endif
 	return 0;
 }
diff --git a/lib/zinc/selftest/curve25519.h b/lib/zinc/selftest/curve25519.h
new file mode 100644
index 000000000000..5eeba4ace423
--- /dev/null
+++ b/lib/zinc/selftest/curve25519.h
@@ -0,0 +1,607 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifdef CONFIG_ZINC_DEBUG
+struct curve25519_test_vector {
+	u8 private[CURVE25519_POINT_SIZE];
+	u8 public[CURVE25519_POINT_SIZE];
+	u8 result[CURVE25519_POINT_SIZE];
+	bool valid;
+};
+static const struct curve25519_test_vector curve25519_test_vectors[] __initconst = {
+	{
+		.private = { 0x77, 0x07, 0x6d, 0x0a, 0x73, 0x18, 0xa5, 0x7d, 0x3c, 0x16, 0xc1, 0x72, 0x51, 0xb2, 0x66, 0x45, 0xdf, 0x4c, 0x2f, 0x87, 0xeb, 0xc0, 0x99, 0x2a, 0xb1, 0x77, 0xfb, 0xa5, 0x1d, 0xb9, 0x2c, 0x2a },
+		.public = { 0xde, 0x9e, 0xdb, 0x7d, 0x7b, 0x7d, 0xc1, 0xb4, 0xd3, 0x5b, 0x61, 0xc2, 0xec, 0xe4, 0x35, 0x37, 0x3f, 0x83, 0x43, 0xc8, 0x5b, 0x78, 0x67, 0x4d, 0xad, 0xfc, 0x7e, 0x14, 0x6f, 0x88, 0x2b, 0x4f },
+		.result = { 0x4a, 0x5d, 0x9d, 0x5b, 0xa4, 0xce, 0x2d, 0xe1, 0x72, 0x8e, 0x3b, 0xf4, 0x80, 0x35, 0x0f, 0x25, 0xe0, 0x7e, 0x21, 0xc9, 0x47, 0xd1, 0x9e, 0x33, 0x76, 0xf0, 0x9b, 0x3c, 0x1e, 0x16, 0x17, 0x42 },
+		.valid = true
+	},
+	{
+		.private = { 0x5d, 0xab, 0x08, 0x7e, 0x62, 0x4a, 0x8a, 0x4b, 0x79, 0xe1, 0x7f, 0x8b, 0x83, 0x80, 0x0e, 0xe6, 0x6f, 0x3b, 0xb1, 0x29, 0x26, 0x18, 0xb6, 0xfd, 0x1c, 0x2f, 0x8b, 0x27, 0xff, 0x88, 0xe0, 0xeb },
+		.public = { 0x85, 0x20, 0xf0, 0x09, 0x89, 0x30, 0xa7, 0x54, 0x74, 0x8b, 0x7d, 0xdc, 0xb4, 0x3e, 0xf7, 0x5a, 0x0d, 0xbf, 0x3a, 0x0d, 0x26, 0x38, 0x1a, 0xf4, 0xeb, 0xa4, 0xa9, 0x8e, 0xaa, 0x9b, 0x4e, 0x6a },
+		.result = { 0x4a, 0x5d, 0x9d, 0x5b, 0xa4, 0xce, 0x2d, 0xe1, 0x72, 0x8e, 0x3b, 0xf4, 0x80, 0x35, 0x0f, 0x25, 0xe0, 0x7e, 0x21, 0xc9, 0x47, 0xd1, 0x9e, 0x33, 0x76, 0xf0, 0x9b, 0x3c, 0x1e, 0x16, 0x17, 0x42 },
+		.valid = true
+	},
+	{
+		.private = { 1 },
+		.public = { 0x25, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.result = { 0x3c, 0x77, 0x77, 0xca, 0xf9, 0x97, 0xb2, 0x64, 0x41, 0x60, 0x77, 0x66, 0x5b, 0x4e, 0x22, 0x9d, 0xb, 0x95, 0x48, 0xdc, 0xc, 0xd8, 0x19, 0x98, 0xdd, 0xcd, 0xc5, 0xc8, 0x53, 0x3c, 0x79, 0x7f },
+		.valid = true
+	},
+	{
+		.private = { 1 },
+		.public = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+		.result = { 0xb3, 0x2d, 0x13, 0x62, 0xc2, 0x48, 0xd6, 0x2f, 0xe6, 0x26, 0x19, 0xcf, 0xf0, 0x4d, 0xd4, 0x3d, 0xb7, 0x3f, 0xfc, 0x1b, 0x63, 0x8, 0xed, 0xe3, 0xb, 0x78, 0xd8, 0x73, 0x80, 0xf1, 0xe8, 0x34 },
+		.valid = true
+	},
+	{
+		.private = { 0xa5, 0x46, 0xe3, 0x6b, 0xf0, 0x52, 0x7c, 0x9d, 0x3b, 0x16, 0x15, 0x4b, 0x82, 0x46, 0x5e, 0xdd, 0x62, 0x14, 0x4c, 0x0a, 0xc1, 0xfc, 0x5a, 0x18, 0x50, 0x6a, 0x22, 0x44, 0xba, 0x44, 0x9a, 0xc4 },
+		.public = { 0xe6, 0xdb, 0x68, 0x67, 0x58, 0x30, 0x30, 0xdb, 0x35, 0x94, 0xc1, 0xa4, 0x24, 0xb1, 0x5f, 0x7c, 0x72, 0x66, 0x24, 0xec, 0x26, 0xb3, 0x35, 0x3b, 0x10, 0xa9, 0x03, 0xa6, 0xd0, 0xab, 0x1c, 0x4c },
+		.result = { 0xc3, 0xda, 0x55, 0x37, 0x9d, 0xe9, 0xc6, 0x90, 0x8e, 0x94, 0xea, 0x4d, 0xf2, 0x8d, 0x08, 0x4f, 0x32, 0xec, 0xcf, 0x03, 0x49, 0x1c, 0x71, 0xf7, 0x54, 0xb4, 0x07, 0x55, 0x77, 0xa2, 0x85, 0x52 },
+		.valid = true
+	},
+	{
+		.private = { 1, 2, 3, 4 },
+		.public = { 0 },
+		.result = { 0 },
+		.valid = false
+	},
+	{
+		.private = { 2, 4, 6, 8 },
+		.public = { 0xe0, 0xeb, 0x7a, 0x7c, 0x3b, 0x41, 0xb8, 0xae, 0x16, 0x56, 0xe3, 0xfa, 0xf1, 0x9f, 0xc4, 0x6a, 0xda, 0x09, 0x8d, 0xeb, 0x9c, 0x32, 0xb1, 0xfd, 0x86, 0x62, 0x05, 0x16, 0x5f, 0x49, 0xb8 },
+		.result = { 0 },
+		.valid = false
+	},
+	{
+		.private = { 0xff, 0xff, 0xff, 0xff, 0x0a, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+		.public = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0a, 0x00, 0xfb, 0x9f },
+		.result = { 0x77, 0x52, 0xb6, 0x18, 0xc1, 0x2d, 0x48, 0xd2, 0xc6, 0x93, 0x46, 0x83, 0x81, 0x7c, 0xc6, 0x57, 0xf3, 0x31, 0x03, 0x19, 0x49, 0x48, 0x20, 0x05, 0x42, 0x2b, 0x4e, 0xae, 0x8d, 0x1d, 0x43, 0x23 },
+		.valid = true
+	},
+	{
+		.private = { 0x8e, 0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.public = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x8e, 0x06 },
+		.result = { 0x5a, 0xdf, 0xaa, 0x25, 0x86, 0x8e, 0x32, 0x3d, 0xae, 0x49, 0x62, 0xc1, 0x01, 0x5c, 0xb3, 0x12, 0xe1, 0xc5, 0xc7, 0x9e, 0x95, 0x3f, 0x03, 0x99, 0xb0, 0xba, 0x16, 0x22, 0xf3, 0xb6, 0xf7, 0x0c },
+		.valid = true
+	},
+	/* wycheproof - normal case */
+	{
+		.private = { 0x48, 0x52, 0x83, 0x4d, 0x9d, 0x6b, 0x77, 0xda, 0xde, 0xab, 0xaa, 0xf2, 0xe1, 0x1d, 0xca, 0x66, 0xd1, 0x9f, 0xe7, 0x49, 0x93, 0xa7, 0xbe, 0xc3, 0x6c, 0x6e, 0x16, 0xa0, 0x98, 0x3f, 0xea, 0xba },
+		.public = { 0x9c, 0x64, 0x7d, 0x9a, 0xe5, 0x89, 0xb9, 0xf5, 0x8f, 0xdc, 0x3c, 0xa4, 0x94, 0x7e, 0xfb, 0xc9, 0x15, 0xc4, 0xb2, 0xe0, 0x8e, 0x74, 0x4a, 0x0e, 0xdf, 0x46, 0x9d, 0xac, 0x59, 0xc8, 0xf8, 0x5a },
+		.result = { 0x87, 0xb7, 0xf2, 0x12, 0xb6, 0x27, 0xf7, 0xa5, 0x4c, 0xa5, 0xe0, 0xbc, 0xda, 0xdd, 0xd5, 0x38, 0x9d, 0x9d, 0xe6, 0x15, 0x6c, 0xdb, 0xcf, 0x8e, 0xbe, 0x14, 0xff, 0xbc, 0xfb, 0x43, 0x65, 0x51 },
+		.valid = true
+	},
+	/* wycheproof - public key on twist */
+	{
+		.private = { 0x58, 0x8c, 0x06, 0x1a, 0x50, 0x80, 0x4a, 0xc4, 0x88, 0xad, 0x77, 0x4a, 0xc7, 0x16, 0xc3, 0xf5, 0xba, 0x71, 0x4b, 0x27, 0x12, 0xe0, 0x48, 0x49, 0x13, 0x79, 0xa5, 0x00, 0x21, 0x19, 0x98, 0xa8 },
+		.public = { 0x63, 0xaa, 0x40, 0xc6, 0xe3, 0x83, 0x46, 0xc5, 0xca, 0xf2, 0x3a, 0x6d, 0xf0, 0xa5, 0xe6, 0xc8, 0x08, 0x89, 0xa0, 0x86, 0x47, 0xe5, 0x51, 0xb3, 0x56, 0x34, 0x49, 0xbe, 0xfc, 0xfc, 0x97, 0x33 },
+		.result = { 0xb1, 0xa7, 0x07, 0x51, 0x94, 0x95, 0xff, 0xff, 0xb2, 0x98, 0xff, 0x94, 0x17, 0x16, 0xb0, 0x6d, 0xfa, 0xb8, 0x7c, 0xf8, 0xd9, 0x11, 0x23, 0xfe, 0x2b, 0xe9, 0xa2, 0x33, 0xdd, 0xa2, 0x22, 0x12 },
+		.valid = true
+	},
+	/* wycheproof - public key on twist */
+	{
+		.private = { 0xb0, 0x5b, 0xfd, 0x32, 0xe5, 0x53, 0x25, 0xd9, 0xfd, 0x64, 0x8c, 0xb3, 0x02, 0x84, 0x80, 0x39, 0x00, 0x0b, 0x39, 0x0e, 0x44, 0xd5, 0x21, 0xe5, 0x8a, 0xab, 0x3b, 0x29, 0xa6, 0x96, 0x0b, 0xa8 },
+		.public = { 0x0f, 0x83, 0xc3, 0x6f, 0xde, 0xd9, 0xd3, 0x2f, 0xad, 0xf4, 0xef, 0xa3, 0xae, 0x93, 0xa9, 0x0b, 0xb5, 0xcf, 0xa6, 0x68, 0x93, 0xbc, 0x41, 0x2c, 0x43, 0xfa, 0x72, 0x87, 0xdb, 0xb9, 0x97, 0x79 },
+		.result = { 0x67, 0xdd, 0x4a, 0x6e, 0x16, 0x55, 0x33, 0x53, 0x4c, 0x0e, 0x3f, 0x17, 0x2e, 0x4a, 0xb8, 0x57, 0x6b, 0xca, 0x92, 0x3a, 0x5f, 0x07, 0xb2, 0xc0, 0x69, 0xb4, 0xc3, 0x10, 0xff, 0x2e, 0x93, 0x5b },
+		.valid = true
+	},
+	/* wycheproof - public key on twist */
+	{
+		.private = { 0x70, 0xe3, 0x4b, 0xcb, 0xe1, 0xf4, 0x7f, 0xbc, 0x0f, 0xdd, 0xfd, 0x7c, 0x1e, 0x1a, 0xa5, 0x3d, 0x57, 0xbf, 0xe0, 0xf6, 0x6d, 0x24, 0x30, 0x67, 0xb4, 0x24, 0xbb, 0x62, 0x10, 0xbe, 0xd1, 0x9c },
+		.public = { 0x0b, 0x82, 0x11, 0xa2, 0xb6, 0x04, 0x90, 0x97, 0xf6, 0x87, 0x1c, 0x6c, 0x05, 0x2d, 0x3c, 0x5f, 0xc1, 0xba, 0x17, 0xda, 0x9e, 0x32, 0xae, 0x45, 0x84, 0x03, 0xb0, 0x5b, 0xb2, 0x83, 0x09, 0x2a },
+		.result = { 0x4a, 0x06, 0x38, 0xcf, 0xaa, 0x9e, 0xf1, 0x93, 0x3b, 0x47, 0xf8, 0x93, 0x92, 0x96, 0xa6, 0xb2, 0x5b, 0xe5, 0x41, 0xef, 0x7f, 0x70, 0xe8, 0x44, 0xc0, 0xbc, 0xc0, 0x0b, 0x13, 0x4d, 0xe6, 0x4a },
+		.valid = true
+	},
+	/* wycheproof - public key on twist */
+	{
+		.private = { 0x68, 0xc1, 0xf3, 0xa6, 0x53, 0xa4, 0xcd, 0xb1, 0xd3, 0x7b, 0xba, 0x94, 0x73, 0x8f, 0x8b, 0x95, 0x7a, 0x57, 0xbe, 0xb2, 0x4d, 0x64, 0x6e, 0x99, 0x4d, 0xc2, 0x9a, 0x27, 0x6a, 0xad, 0x45, 0x8d },
+		.public = { 0x34, 0x3a, 0xc2, 0x0a, 0x3b, 0x9c, 0x6a, 0x27, 0xb1, 0x00, 0x81, 0x76, 0x50, 0x9a, 0xd3, 0x07, 0x35, 0x85, 0x6e, 0xc1, 0xc8, 0xd8, 0xfc, 0xae, 0x13, 0x91, 0x2d, 0x08, 0xd1, 0x52, 0xf4, 0x6c },
+		.result = { 0x39, 0x94, 0x91, 0xfc, 0xe8, 0xdf, 0xab, 0x73, 0xb4, 0xf9, 0xf6, 0x11, 0xde, 0x8e, 0xa0, 0xb2, 0x7b, 0x28, 0xf8, 0x59, 0x94, 0x25, 0x0b, 0x0f, 0x47, 0x5d, 0x58, 0x5d, 0x04, 0x2a, 0xc2, 0x07 },
+		.valid = true
+	},
+	/* wycheproof - public key on twist */
+	{
+		.private = { 0xd8, 0x77, 0xb2, 0x6d, 0x06, 0xdf, 0xf9, 0xd9, 0xf7, 0xfd, 0x4c, 0x5b, 0x37, 0x69, 0xf8, 0xcd, 0xd5, 0xb3, 0x05, 0x16, 0xa5, 0xab, 0x80, 0x6b, 0xe3, 0x24, 0xff, 0x3e, 0xb6, 0x9e, 0xa0, 0xb2 },
+		.public = { 0xfa, 0x69, 0x5f, 0xc7, 0xbe, 0x8d, 0x1b, 0xe5, 0xbf, 0x70, 0x48, 0x98, 0xf3, 0x88, 0xc4, 0x52, 0xba, 0xfd, 0xd3, 0xb8, 0xea, 0xe8, 0x05, 0xf8, 0x68, 0x1a, 0x8d, 0x15, 0xc2, 0xd4, 0xe1, 0x42 },
+		.result = { 0x2c, 0x4f, 0xe1, 0x1d, 0x49, 0x0a, 0x53, 0x86, 0x17, 0x76, 0xb1, 0x3b, 0x43, 0x54, 0xab, 0xd4, 0xcf, 0x5a, 0x97, 0x69, 0x9d, 0xb6, 0xe6, 0xc6, 0x8c, 0x16, 0x26, 0xd0, 0x76, 0x62, 0xf7, 0x58 },
+		.valid = true
+	},
+	/* wycheproof - public key = 0 */
+	{
+		.private = { 0x20, 0x74, 0x94, 0x03, 0x8f, 0x2b, 0xb8, 0x11, 0xd4, 0x78, 0x05, 0xbc, 0xdf, 0x04, 0xa2, 0xac, 0x58, 0x5a, 0xda, 0x7f, 0x2f, 0x23, 0x38, 0x9b, 0xfd, 0x46, 0x58, 0xf9, 0xdd, 0xd4, 0xde, 0xbc },
+		.public = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = false
+	},
+	/* wycheproof - public key = 1 */
+	{
+		.private = { 0x20, 0x2e, 0x89, 0x72, 0xb6, 0x1c, 0x7e, 0x61, 0x93, 0x0e, 0xb9, 0x45, 0x0b, 0x50, 0x70, 0xea, 0xe1, 0xc6, 0x70, 0x47, 0x56, 0x85, 0x54, 0x1f, 0x04, 0x76, 0x21, 0x7e, 0x48, 0x18, 0xcf, 0xab },
+		.public = { 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = false
+	},
+	/* wycheproof - edge case on twist */
+	{
+		.private = { 0x38, 0xdd, 0xe9, 0xf3, 0xe7, 0xb7, 0x99, 0x04, 0x5f, 0x9a, 0xc3, 0x79, 0x3d, 0x4a, 0x92, 0x77, 0xda, 0xde, 0xad, 0xc4, 0x1b, 0xec, 0x02, 0x90, 0xf8, 0x1f, 0x74, 0x4f, 0x73, 0x77, 0x5f, 0x84 },
+		.public = { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.result = { 0x9a, 0x2c, 0xfe, 0x84, 0xff, 0x9c, 0x4a, 0x97, 0x39, 0x62, 0x5c, 0xae, 0x4a, 0x3b, 0x82, 0xa9, 0x06, 0x87, 0x7a, 0x44, 0x19, 0x46, 0xf8, 0xd7, 0xb3, 0xd7, 0x95, 0xfe, 0x8f, 0x5d, 0x16, 0x39 },
+		.valid = true
+	},
+	/* wycheproof - edge case on twist */
+	{
+		.private = { 0x98, 0x57, 0xa9, 0x14, 0xe3, 0xc2, 0x90, 0x36, 0xfd, 0x9a, 0x44, 0x2b, 0xa5, 0x26, 0xb5, 0xcd, 0xcd, 0xf2, 0x82, 0x16, 0x15, 0x3e, 0x63, 0x6c, 0x10, 0x67, 0x7a, 0xca, 0xb6, 0xbd, 0x6a, 0xa5 },
+		.public = { 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.result = { 0x4d, 0xa4, 0xe0, 0xaa, 0x07, 0x2c, 0x23, 0x2e, 0xe2, 0xf0, 0xfa, 0x4e, 0x51, 0x9a, 0xe5, 0x0b, 0x52, 0xc1, 0xed, 0xd0, 0x8a, 0x53, 0x4d, 0x4e, 0xf3, 0x46, 0xc2, 0xe1, 0x06, 0xd2, 0x1d, 0x60 },
+		.valid = true
+	},
+	/* wycheproof - edge case on twist */
+	{
+		.private = { 0x48, 0xe2, 0x13, 0x0d, 0x72, 0x33, 0x05, 0xed, 0x05, 0xe6, 0xe5, 0x89, 0x4d, 0x39, 0x8a, 0x5e, 0x33, 0x36, 0x7a, 0x8c, 0x6a, 0xac, 0x8f, 0xcd, 0xf0, 0xa8, 0x8e, 0x4b, 0x42, 0x82, 0x0d, 0xb7 },
+		.public = { 0xff, 0xff, 0xff, 0x03, 0x00, 0x00, 0xf8, 0xff, 0xff, 0x1f, 0x00, 0x00, 0xc0, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0xfe, 0xff, 0xff, 0x07, 0x00, 0x00, 0xf0, 0xff, 0xff, 0x3f, 0x00, 0x00, 0x00 },
+		.result = { 0x9e, 0xd1, 0x0c, 0x53, 0x74, 0x7f, 0x64, 0x7f, 0x82, 0xf4, 0x51, 0x25, 0xd3, 0xde, 0x15, 0xa1, 0xe6, 0xb8, 0x24, 0x49, 0x6a, 0xb4, 0x04, 0x10, 0xff, 0xcc, 0x3c, 0xfe, 0x95, 0x76, 0x0f, 0x3b },
+		.valid = true
+	},
+	/* wycheproof - edge case on twist */
+	{
+		.private = { 0x28, 0xf4, 0x10, 0x11, 0x69, 0x18, 0x51, 0xb3, 0xa6, 0x2b, 0x64, 0x15, 0x53, 0xb3, 0x0d, 0x0d, 0xfd, 0xdc, 0xb8, 0xff, 0xfc, 0xf5, 0x37, 0x00, 0xa7, 0xbe, 0x2f, 0x6a, 0x87, 0x2e, 0x9f, 0xb0 },
+		.public = { 0x00, 0x00, 0x00, 0xfc, 0xff, 0xff, 0x07, 0x00, 0x00, 0xe0, 0xff, 0xff, 0x3f, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0x01, 0x00, 0x00, 0xf8, 0xff, 0xff, 0x0f, 0x00, 0x00, 0xc0, 0xff, 0xff, 0x7f },
+		.result = { 0xcf, 0x72, 0xb4, 0xaa, 0x6a, 0xa1, 0xc9, 0xf8, 0x94, 0xf4, 0x16, 0x5b, 0x86, 0x10, 0x9a, 0xa4, 0x68, 0x51, 0x76, 0x48, 0xe1, 0xf0, 0xcc, 0x70, 0xe1, 0xab, 0x08, 0x46, 0x01, 0x76, 0x50, 0x6b },
+		.valid = true
+	},
+	/* wycheproof - edge case on twist */
+	{
+		.private = { 0x18, 0xa9, 0x3b, 0x64, 0x99, 0xb9, 0xf6, 0xb3, 0x22, 0x5c, 0xa0, 0x2f, 0xef, 0x41, 0x0e, 0x0a, 0xde, 0xc2, 0x35, 0x32, 0x32, 0x1d, 0x2d, 0x8e, 0xf1, 0xa6, 0xd6, 0x02, 0xa8, 0xc6, 0x5b, 0x83 },
+		.public = { 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0x7f },
+		.result = { 0x5d, 0x50, 0xb6, 0x28, 0x36, 0xbb, 0x69, 0x57, 0x94, 0x10, 0x38, 0x6c, 0xf7, 0xbb, 0x81, 0x1c, 0x14, 0xbf, 0x85, 0xb1, 0xc7, 0xb1, 0x7e, 0x59, 0x24, 0xc7, 0xff, 0xea, 0x91, 0xef, 0x9e, 0x12 },
+		.valid = true
+	},
+	/* wycheproof - edge case on twist */
+	{
+		.private = { 0xc0, 0x1d, 0x13, 0x05, 0xa1, 0x33, 0x8a, 0x1f, 0xca, 0xc2, 0xba, 0x7e, 0x2e, 0x03, 0x2b, 0x42, 0x7e, 0x0b, 0x04, 0x90, 0x31, 0x65, 0xac, 0xa9, 0x57, 0xd8, 0xd0, 0x55, 0x3d, 0x87, 0x17, 0xb0 },
+		.public = { 0xea, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+		.result = { 0x19, 0x23, 0x0e, 0xb1, 0x48, 0xd5, 0xd6, 0x7c, 0x3c, 0x22, 0xab, 0x1d, 0xae, 0xff, 0x80, 0xa5, 0x7e, 0xae, 0x42, 0x65, 0xce, 0x28, 0x72, 0x65, 0x7b, 0x2c, 0x80, 0x99, 0xfc, 0x69, 0x8e, 0x50 },
+		.valid = true
+	},
+	/* wycheproof - edge case for public key */
+	{
+		.private = { 0x38, 0x6f, 0x7f, 0x16, 0xc5, 0x07, 0x31, 0xd6, 0x4f, 0x82, 0xe6, 0xa1, 0x70, 0xb1, 0x42, 0xa4, 0xe3, 0x4f, 0x31, 0xfd, 0x77, 0x68, 0xfc, 0xb8, 0x90, 0x29, 0x25, 0xe7, 0xd1, 0xe2, 0x1a, 0xbe },
+		.public = { 0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.result = { 0x0f, 0xca, 0xb5, 0xd8, 0x42, 0xa0, 0x78, 0xd7, 0xa7, 0x1f, 0xc5, 0x9b, 0x57, 0xbf, 0xb4, 0xca, 0x0b, 0xe6, 0x87, 0x3b, 0x49, 0xdc, 0xdb, 0x9f, 0x44, 0xe1, 0x4a, 0xe8, 0xfb, 0xdf, 0xa5, 0x42 },
+		.valid = true
+	},
+	/* wycheproof - edge case for public key */
+	{
+		.private = { 0xe0, 0x23, 0xa2, 0x89, 0xbd, 0x5e, 0x90, 0xfa, 0x28, 0x04, 0xdd, 0xc0, 0x19, 0xa0, 0x5e, 0xf3, 0xe7, 0x9d, 0x43, 0x4b, 0xb6, 0xea, 0x2f, 0x52, 0x2e, 0xcb, 0x64, 0x3a, 0x75, 0x29, 0x6e, 0x95 },
+		.public = { 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00 },
+		.result = { 0x54, 0xce, 0x8f, 0x22, 0x75, 0xc0, 0x77, 0xe3, 0xb1, 0x30, 0x6a, 0x39, 0x39, 0xc5, 0xe0, 0x3e, 0xef, 0x6b, 0xbb, 0x88, 0x06, 0x05, 0x44, 0x75, 0x8d, 0x9f, 0xef, 0x59, 0xb0, 0xbc, 0x3e, 0x4f },
+		.valid = true
+	},
+	/* wycheproof - edge case for public key */
+	{
+		.private = { 0x68, 0xf0, 0x10, 0xd6, 0x2e, 0xe8, 0xd9, 0x26, 0x05, 0x3a, 0x36, 0x1c, 0x3a, 0x75, 0xc6, 0xea, 0x4e, 0xbd, 0xc8, 0x60, 0x6a, 0xb2, 0x85, 0x00, 0x3a, 0x6f, 0x8f, 0x40, 0x76, 0xb0, 0x1e, 0x83 },
+		.public = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x03 },
+		.result = { 0xf1, 0x36, 0x77, 0x5c, 0x5b, 0xeb, 0x0a, 0xf8, 0x11, 0x0a, 0xf1, 0x0b, 0x20, 0x37, 0x23, 0x32, 0x04, 0x3c, 0xab, 0x75, 0x24, 0x19, 0x67, 0x87, 0x75, 0xa2, 0x23, 0xdf, 0x57, 0xc9, 0xd3, 0x0d },
+		.valid = true
+	},
+	/* wycheproof - edge case for public key */
+	{
+		.private = { 0x58, 0xeb, 0xcb, 0x35, 0xb0, 0xf8, 0x84, 0x5c, 0xaf, 0x1e, 0xc6, 0x30, 0xf9, 0x65, 0x76, 0xb6, 0x2c, 0x4b, 0x7b, 0x6c, 0x36, 0xb2, 0x9d, 0xeb, 0x2c, 0xb0, 0x08, 0x46, 0x51, 0x75, 0x5c, 0x96 },
+		.public = { 0xff, 0xff, 0xff, 0xfb, 0xff, 0xff, 0xfb, 0xff, 0xff, 0xdf, 0xff, 0xff, 0xdf, 0xff, 0xff, 0xff, 0xfe, 0xff, 0xff, 0xfe, 0xff, 0xff, 0xf7, 0xff, 0xff, 0xf7, 0xff, 0xff, 0xbf, 0xff, 0xff, 0x3f },
+		.result = { 0xbf, 0x9a, 0xff, 0xd0, 0x6b, 0x84, 0x40, 0x85, 0x58, 0x64, 0x60, 0x96, 0x2e, 0xf2, 0x14, 0x6f, 0xf3, 0xd4, 0x53, 0x3d, 0x94, 0x44, 0xaa, 0xb0, 0x06, 0xeb, 0x88, 0xcc, 0x30, 0x54, 0x40, 0x7d },
+		.valid = true
+	},
+	/* wycheproof - edge case for public key */
+	{
+		.private = { 0x18, 0x8c, 0x4b, 0xc5, 0xb9, 0xc4, 0x4b, 0x38, 0xbb, 0x65, 0x8b, 0x9b, 0x2a, 0xe8, 0x2d, 0x5b, 0x01, 0x01, 0x5e, 0x09, 0x31, 0x84, 0xb1, 0x7c, 0xb7, 0x86, 0x35, 0x03, 0xa7, 0x83, 0xe1, 0xbb },
+		.public = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+		.result = { 0xd4, 0x80, 0xde, 0x04, 0xf6, 0x99, 0xcb, 0x3b, 0xe0, 0x68, 0x4a, 0x9c, 0xc2, 0xe3, 0x12, 0x81, 0xea, 0x0b, 0xc5, 0xa9, 0xdc, 0xc1, 0x57, 0xd3, 0xd2, 0x01, 0x58, 0xd4, 0x6c, 0xa5, 0x24, 0x6d },
+		.valid = true
+	},
+	/* wycheproof - edge case for public key */
+	{
+		.private = { 0xe0, 0x6c, 0x11, 0xbb, 0x2e, 0x13, 0xce, 0x3d, 0xc7, 0x67, 0x3f, 0x67, 0xf5, 0x48, 0x22, 0x42, 0x90, 0x94, 0x23, 0xa9, 0xae, 0x95, 0xee, 0x98, 0x6a, 0x98, 0x8d, 0x98, 0xfa, 0xee, 0x23, 0xa2 },
+		.public = { 0xff, 0xff, 0xff, 0xff, 0xfe, 0xff, 0xff, 0x7f, 0xff, 0xff, 0xff, 0xff, 0xfe, 0xff, 0xff, 0x7f, 0xff, 0xff, 0xff, 0xff, 0xfe, 0xff, 0xff, 0x7f, 0xff, 0xff, 0xff, 0xff, 0xfe, 0xff, 0xff, 0x7f },
+		.result = { 0x4c, 0x44, 0x01, 0xcc, 0xe6, 0xb5, 0x1e, 0x4c, 0xb1, 0x8f, 0x27, 0x90, 0x24, 0x6c, 0x9b, 0xf9, 0x14, 0xdb, 0x66, 0x77, 0x50, 0xa1, 0xcb, 0x89, 0x06, 0x90, 0x92, 0xaf, 0x07, 0x29, 0x22, 0x76 },
+		.valid = true
+	},
+	/* wycheproof - edge case for public key */
+	{
+		.private = { 0xc0, 0x65, 0x8c, 0x46, 0xdd, 0xe1, 0x81, 0x29, 0x29, 0x38, 0x77, 0x53, 0x5b, 0x11, 0x62, 0xb6, 0xf9, 0xf5, 0x41, 0x4a, 0x23, 0xcf, 0x4d, 0x2c, 0xbc, 0x14, 0x0a, 0x4d, 0x99, 0xda, 0x2b, 0x8f },
+		.public = { 0xeb, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+		.result = { 0x57, 0x8b, 0xa8, 0xcc, 0x2d, 0xbd, 0xc5, 0x75, 0xaf, 0xcf, 0x9d, 0xf2, 0xb3, 0xee, 0x61, 0x89, 0xf5, 0x33, 0x7d, 0x68, 0x54, 0xc7, 0x9b, 0x4c, 0xe1, 0x65, 0xea, 0x12, 0x29, 0x3b, 0x3a, 0x0f },
+		.valid = true
+	},
+	/* wycheproof - public key with low order */
+	{
+		.private = { 0x10, 0x25, 0x5c, 0x92, 0x30, 0xa9, 0x7a, 0x30, 0xa4, 0x58, 0xca, 0x28, 0x4a, 0x62, 0x96, 0x69, 0x29, 0x3a, 0x31, 0x89, 0x0c, 0xda, 0x9d, 0x14, 0x7f, 0xeb, 0xc7, 0xd1, 0xe2, 0x2d, 0x6b, 0xb1 },
+		.public = { 0xe0, 0xeb, 0x7a, 0x7c, 0x3b, 0x41, 0xb8, 0xae, 0x16, 0x56, 0xe3, 0xfa, 0xf1, 0x9f, 0xc4, 0x6a, 0xda, 0x09, 0x8d, 0xeb, 0x9c, 0x32, 0xb1, 0xfd, 0x86, 0x62, 0x05, 0x16, 0x5f, 0x49, 0xb8, 0x00 },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = false
+	},
+	/* wycheproof - public key with low order */
+	{
+		.private = { 0x78, 0xf1, 0xe8, 0xed, 0xf1, 0x44, 0x81, 0xb3, 0x89, 0x44, 0x8d, 0xac, 0x8f, 0x59, 0xc7, 0x0b, 0x03, 0x8e, 0x7c, 0xf9, 0x2e, 0xf2, 0xc7, 0xef, 0xf5, 0x7a, 0x72, 0x46, 0x6e, 0x11, 0x52, 0x96 },
+		.public = { 0x5f, 0x9c, 0x95, 0xbc, 0xa3, 0x50, 0x8c, 0x24, 0xb1, 0xd0, 0xb1, 0x55, 0x9c, 0x83, 0xef, 0x5b, 0x04, 0x44, 0x5c, 0xc4, 0x58, 0x1c, 0x8e, 0x86, 0xd8, 0x22, 0x4e, 0xdd, 0xd0, 0x9f, 0x11, 0x57 },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = false
+	},
+	/* wycheproof - public key with low order */
+	{
+		.private = { 0xa0, 0xa0, 0x5a, 0x3e, 0x8f, 0x9f, 0x44, 0x20, 0x4d, 0x5f, 0x80, 0x59, 0xa9, 0x4a, 0xc7, 0xdf, 0xc3, 0x9a, 0x49, 0xac, 0x01, 0x6d, 0xd7, 0x43, 0xdb, 0xfa, 0x43, 0xc5, 0xd6, 0x71, 0xfd, 0x88 },
+		.public = { 0xec, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = false
+	},
+	/* wycheproof - public key with low order */
+	{
+		.private = { 0xd0, 0xdb, 0xb3, 0xed, 0x19, 0x06, 0x66, 0x3f, 0x15, 0x42, 0x0a, 0xf3, 0x1f, 0x4e, 0xaf, 0x65, 0x09, 0xd9, 0xa9, 0x94, 0x97, 0x23, 0x50, 0x06, 0x05, 0xad, 0x7c, 0x1c, 0x6e, 0x74, 0x50, 0xa9 },
+		.public = { 0xed, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = false
+	},
+	/* wycheproof - public key with low order */
+	{
+		.private = { 0xc0, 0xb1, 0xd0, 0xeb, 0x22, 0xb2, 0x44, 0xfe, 0x32, 0x91, 0x14, 0x00, 0x72, 0xcd, 0xd9, 0xd9, 0x89, 0xb5, 0xf0, 0xec, 0xd9, 0x6c, 0x10, 0x0f, 0xeb, 0x5b, 0xca, 0x24, 0x1c, 0x1d, 0x9f, 0x8f },
+		.public = { 0xee, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = false
+	},
+	/* wycheproof - public key with low order */
+	{
+		.private = { 0x48, 0x0b, 0xf4, 0x5f, 0x59, 0x49, 0x42, 0xa8, 0xbc, 0x0f, 0x33, 0x53, 0xc6, 0xe8, 0xb8, 0x85, 0x3d, 0x77, 0xf3, 0x51, 0xf1, 0xc2, 0xca, 0x6c, 0x2d, 0x1a, 0xbf, 0x8a, 0x00, 0xb4, 0x22, 0x9c },
+		.public = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = false
+	},
+	/* wycheproof - public key with low order */
+	{
+		.private = { 0x30, 0xf9, 0x93, 0xfc, 0xf8, 0x51, 0x4f, 0xc8, 0x9b, 0xd8, 0xdb, 0x14, 0xcd, 0x43, 0xba, 0x0d, 0x4b, 0x25, 0x30, 0xe7, 0x3c, 0x42, 0x76, 0xa0, 0x5e, 0x1b, 0x14, 0x5d, 0x42, 0x0c, 0xed, 0xb4 },
+		.public = { 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = false
+	},
+	/* wycheproof - public key with low order */
+	{
+		.private = { 0xc0, 0x49, 0x74, 0xb7, 0x58, 0x38, 0x0e, 0x2a, 0x5b, 0x5d, 0xf6, 0xeb, 0x09, 0xbb, 0x2f, 0x6b, 0x34, 0x34, 0xf9, 0x82, 0x72, 0x2a, 0x8e, 0x67, 0x6d, 0x3d, 0xa2, 0x51, 0xd1, 0xb3, 0xde, 0x83 },
+		.public = { 0xe0, 0xeb, 0x7a, 0x7c, 0x3b, 0x41, 0xb8, 0xae, 0x16, 0x56, 0xe3, 0xfa, 0xf1, 0x9f, 0xc4, 0x6a, 0xda, 0x09, 0x8d, 0xeb, 0x9c, 0x32, 0xb1, 0xfd, 0x86, 0x62, 0x05, 0x16, 0x5f, 0x49, 0xb8, 0x80 },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = false
+	},
+	/* wycheproof - public key with low order */
+	{
+		.private = { 0x50, 0x2a, 0x31, 0x37, 0x3d, 0xb3, 0x24, 0x46, 0x84, 0x2f, 0xe5, 0xad, 0xd3, 0xe0, 0x24, 0x02, 0x2e, 0xa5, 0x4f, 0x27, 0x41, 0x82, 0xaf, 0xc3, 0xd9, 0xf1, 0xbb, 0x3d, 0x39, 0x53, 0x4e, 0xb5 },
+		.public = { 0x5f, 0x9c, 0x95, 0xbc, 0xa3, 0x50, 0x8c, 0x24, 0xb1, 0xd0, 0xb1, 0x55, 0x9c, 0x83, 0xef, 0x5b, 0x04, 0x44, 0x5c, 0xc4, 0x58, 0x1c, 0x8e, 0x86, 0xd8, 0x22, 0x4e, 0xdd, 0xd0, 0x9f, 0x11, 0xd7 },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = false
+	},
+	/* wycheproof - public key with low order */
+	{
+		.private = { 0x90, 0xfa, 0x64, 0x17, 0xb0, 0xe3, 0x70, 0x30, 0xfd, 0x6e, 0x43, 0xef, 0xf2, 0xab, 0xae, 0xf1, 0x4c, 0x67, 0x93, 0x11, 0x7a, 0x03, 0x9c, 0xf6, 0x21, 0x31, 0x8b, 0xa9, 0x0f, 0x4e, 0x98, 0xbe },
+		.public = { 0xec, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = false
+	},
+	/* wycheproof - public key with low order */
+	{
+		.private = { 0x78, 0xad, 0x3f, 0x26, 0x02, 0x7f, 0x1c, 0x9f, 0xdd, 0x97, 0x5a, 0x16, 0x13, 0xb9, 0x47, 0x77, 0x9b, 0xad, 0x2c, 0xf2, 0xb7, 0x41, 0xad, 0xe0, 0x18, 0x40, 0x88, 0x5a, 0x30, 0xbb, 0x97, 0x9c },
+		.public = { 0xed, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = false
+	},
+	/* wycheproof - public key with low order */
+	{
+		.private = { 0x98, 0xe2, 0x3d, 0xe7, 0xb1, 0xe0, 0x92, 0x6e, 0xd9, 0xc8, 0x7e, 0x7b, 0x14, 0xba, 0xf5, 0x5f, 0x49, 0x7a, 0x1d, 0x70, 0x96, 0xf9, 0x39, 0x77, 0x68, 0x0e, 0x44, 0xdc, 0x1c, 0x7b, 0x7b, 0x8b },
+		.public = { 0xee, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = false
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0xf0, 0x1e, 0x48, 0xda, 0xfa, 0xc9, 0xd7, 0xbc, 0xf5, 0x89, 0xcb, 0xc3, 0x82, 0xc8, 0x78, 0xd1, 0x8b, 0xda, 0x35, 0x50, 0x58, 0x9f, 0xfb, 0x5d, 0x50, 0xb5, 0x23, 0xbe, 0xbe, 0x32, 0x9d, 0xae },
+		.public = { 0xef, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+		.result = { 0xbd, 0x36, 0xa0, 0x79, 0x0e, 0xb8, 0x83, 0x09, 0x8c, 0x98, 0x8b, 0x21, 0x78, 0x67, 0x73, 0xde, 0x0b, 0x3a, 0x4d, 0xf1, 0x62, 0x28, 0x2c, 0xf1, 0x10, 0xde, 0x18, 0xdd, 0x48, 0x4c, 0xe7, 0x4b },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0x28, 0x87, 0x96, 0xbc, 0x5a, 0xff, 0x4b, 0x81, 0xa3, 0x75, 0x01, 0x75, 0x7b, 0xc0, 0x75, 0x3a, 0x3c, 0x21, 0x96, 0x47, 0x90, 0xd3, 0x86, 0x99, 0x30, 0x8d, 0xeb, 0xc1, 0x7a, 0x6e, 0xaf, 0x8d },
+		.public = { 0xf0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+		.result = { 0xb4, 0xe0, 0xdd, 0x76, 0xda, 0x7b, 0x07, 0x17, 0x28, 0xb6, 0x1f, 0x85, 0x67, 0x71, 0xaa, 0x35, 0x6e, 0x57, 0xed, 0xa7, 0x8a, 0x5b, 0x16, 0x55, 0xcc, 0x38, 0x20, 0xfb, 0x5f, 0x85, 0x4c, 0x5c },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0x98, 0xdf, 0x84, 0x5f, 0x66, 0x51, 0xbf, 0x11, 0x38, 0x22, 0x1f, 0x11, 0x90, 0x41, 0xf7, 0x2b, 0x6d, 0xbc, 0x3c, 0x4a, 0xce, 0x71, 0x43, 0xd9, 0x9f, 0xd5, 0x5a, 0xd8, 0x67, 0x48, 0x0d, 0xa8 },
+		.public = { 0xf1, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+		.result = { 0x6f, 0xdf, 0x6c, 0x37, 0x61, 0x1d, 0xbd, 0x53, 0x04, 0xdc, 0x0f, 0x2e, 0xb7, 0xc9, 0x51, 0x7e, 0xb3, 0xc5, 0x0e, 0x12, 0xfd, 0x05, 0x0a, 0xc6, 0xde, 0xc2, 0x70, 0x71, 0xd4, 0xbf, 0xc0, 0x34 },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0xf0, 0x94, 0x98, 0xe4, 0x6f, 0x02, 0xf8, 0x78, 0x82, 0x9e, 0x78, 0xb8, 0x03, 0xd3, 0x16, 0xa2, 0xed, 0x69, 0x5d, 0x04, 0x98, 0xa0, 0x8a, 0xbd, 0xf8, 0x27, 0x69, 0x30, 0xe2, 0x4e, 0xdc, 0xb0 },
+		.public = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+		.result = { 0x4c, 0x8f, 0xc4, 0xb1, 0xc6, 0xab, 0x88, 0xfb, 0x21, 0xf1, 0x8f, 0x6d, 0x4c, 0x81, 0x02, 0x40, 0xd4, 0xe9, 0x46, 0x51, 0xba, 0x44, 0xf7, 0xa2, 0xc8, 0x63, 0xce, 0xc7, 0xdc, 0x56, 0x60, 0x2d },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0x18, 0x13, 0xc1, 0x0a, 0x5c, 0x7f, 0x21, 0xf9, 0x6e, 0x17, 0xf2, 0x88, 0xc0, 0xcc, 0x37, 0x60, 0x7c, 0x04, 0xc5, 0xf5, 0xae, 0xa2, 0xdb, 0x13, 0x4f, 0x9e, 0x2f, 0xfc, 0x66, 0xbd, 0x9d, 0xb8 },
+		.public = { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+		.result = { 0x1c, 0xd0, 0xb2, 0x82, 0x67, 0xdc, 0x54, 0x1c, 0x64, 0x2d, 0x6d, 0x7d, 0xca, 0x44, 0xa8, 0xb3, 0x8a, 0x63, 0x73, 0x6e, 0xef, 0x5c, 0x4e, 0x65, 0x01, 0xff, 0xbb, 0xb1, 0x78, 0x0c, 0x03, 0x3c },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0x78, 0x57, 0xfb, 0x80, 0x86, 0x53, 0x64, 0x5a, 0x0b, 0xeb, 0x13, 0x8a, 0x64, 0xf5, 0xf4, 0xd7, 0x33, 0xa4, 0x5e, 0xa8, 0x4c, 0x3c, 0xda, 0x11, 0xa9, 0xc0, 0x6f, 0x7e, 0x71, 0x39, 0x14, 0x9e },
+		.public = { 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+		.result = { 0x87, 0x55, 0xbe, 0x01, 0xc6, 0x0a, 0x7e, 0x82, 0x5c, 0xff, 0x3e, 0x0e, 0x78, 0xcb, 0x3a, 0xa4, 0x33, 0x38, 0x61, 0x51, 0x6a, 0xa5, 0x9b, 0x1c, 0x51, 0xa8, 0xb2, 0xa5, 0x43, 0xdf, 0xa8, 0x22 },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0xe0, 0x3a, 0xa8, 0x42, 0xe2, 0xab, 0xc5, 0x6e, 0x81, 0xe8, 0x7b, 0x8b, 0x9f, 0x41, 0x7b, 0x2a, 0x1e, 0x59, 0x13, 0xc7, 0x23, 0xee, 0xd2, 0x8d, 0x75, 0x2f, 0x8d, 0x47, 0xa5, 0x9f, 0x49, 0x8f },
+		.public = { 0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+		.result = { 0x54, 0xc9, 0xa1, 0xed, 0x95, 0xe5, 0x46, 0xd2, 0x78, 0x22, 0xa3, 0x60, 0x93, 0x1d, 0xda, 0x60, 0xa1, 0xdf, 0x04, 0x9d, 0xa6, 0xf9, 0x04, 0x25, 0x3c, 0x06, 0x12, 0xbb, 0xdc, 0x08, 0x74, 0x76 },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0xf8, 0xf7, 0x07, 0xb7, 0x99, 0x9b, 0x18, 0xcb, 0x0d, 0x6b, 0x96, 0x12, 0x4f, 0x20, 0x45, 0x97, 0x2c, 0xa2, 0x74, 0xbf, 0xc1, 0x54, 0xad, 0x0c, 0x87, 0x03, 0x8c, 0x24, 0xc6, 0xd0, 0xd4, 0xb2 },
+		.public = { 0xda, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+		.result = { 0xcc, 0x1f, 0x40, 0xd7, 0x43, 0xcd, 0xc2, 0x23, 0x0e, 0x10, 0x43, 0xda, 0xba, 0x8b, 0x75, 0xe8, 0x10, 0xf1, 0xfb, 0xab, 0x7f, 0x25, 0x52, 0x69, 0xbd, 0x9e, 0xbb, 0x29, 0xe6, 0xbf, 0x49, 0x4f },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0xa0, 0x34, 0xf6, 0x84, 0xfa, 0x63, 0x1e, 0x1a, 0x34, 0x81, 0x18, 0xc1, 0xce, 0x4c, 0x98, 0x23, 0x1f, 0x2d, 0x9e, 0xec, 0x9b, 0xa5, 0x36, 0x5b, 0x4a, 0x05, 0xd6, 0x9a, 0x78, 0x5b, 0x07, 0x96 },
+		.public = { 0xdb, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+		.result = { 0x54, 0x99, 0x8e, 0xe4, 0x3a, 0x5b, 0x00, 0x7b, 0xf4, 0x99, 0xf0, 0x78, 0xe7, 0x36, 0x52, 0x44, 0x00, 0xa8, 0xb5, 0xc7, 0xe9, 0xb9, 0xb4, 0x37, 0x71, 0x74, 0x8c, 0x7c, 0xdf, 0x88, 0x04, 0x12 },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0x30, 0xb6, 0xc6, 0xa0, 0xf2, 0xff, 0xa6, 0x80, 0x76, 0x8f, 0x99, 0x2b, 0xa8, 0x9e, 0x15, 0x2d, 0x5b, 0xc9, 0x89, 0x3d, 0x38, 0xc9, 0x11, 0x9b, 0xe4, 0xf7, 0x67, 0xbf, 0xab, 0x6e, 0x0c, 0xa5 },
+		.public = { 0xdc, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+		.result = { 0xea, 0xd9, 0xb3, 0x8e, 0xfd, 0xd7, 0x23, 0x63, 0x79, 0x34, 0xe5, 0x5a, 0xb7, 0x17, 0xa7, 0xae, 0x09, 0xeb, 0x86, 0xa2, 0x1d, 0xc3, 0x6a, 0x3f, 0xee, 0xb8, 0x8b, 0x75, 0x9e, 0x39, 0x1e, 0x09 },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0x90, 0x1b, 0x9d, 0xcf, 0x88, 0x1e, 0x01, 0xe0, 0x27, 0x57, 0x50, 0x35, 0xd4, 0x0b, 0x43, 0xbd, 0xc1, 0xc5, 0x24, 0x2e, 0x03, 0x08, 0x47, 0x49, 0x5b, 0x0c, 0x72, 0x86, 0x46, 0x9b, 0x65, 0x91 },
+		.public = { 0xea, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+		.result = { 0x60, 0x2f, 0xf4, 0x07, 0x89, 0xb5, 0x4b, 0x41, 0x80, 0x59, 0x15, 0xfe, 0x2a, 0x62, 0x21, 0xf0, 0x7a, 0x50, 0xff, 0xc2, 0xc3, 0xfc, 0x94, 0xcf, 0x61, 0xf1, 0x3d, 0x79, 0x04, 0xe8, 0x8e, 0x0e },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0x80, 0x46, 0x67, 0x7c, 0x28, 0xfd, 0x82, 0xc9, 0xa1, 0xbd, 0xb7, 0x1a, 0x1a, 0x1a, 0x34, 0xfa, 0xba, 0x12, 0x25, 0xe2, 0x50, 0x7f, 0xe3, 0xf5, 0x4d, 0x10, 0xbd, 0x5b, 0x0d, 0x86, 0x5f, 0x8e },
+		.public = { 0xeb, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+		.result = { 0xe0, 0x0a, 0xe8, 0xb1, 0x43, 0x47, 0x12, 0x47, 0xba, 0x24, 0xf1, 0x2c, 0x88, 0x55, 0x36, 0xc3, 0xcb, 0x98, 0x1b, 0x58, 0xe1, 0xe5, 0x6b, 0x2b, 0xaf, 0x35, 0xc1, 0x2a, 0xe1, 0xf7, 0x9c, 0x26 },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0x60, 0x2f, 0x7e, 0x2f, 0x68, 0xa8, 0x46, 0xb8, 0x2c, 0xc2, 0x69, 0xb1, 0xd4, 0x8e, 0x93, 0x98, 0x86, 0xae, 0x54, 0xfd, 0x63, 0x6c, 0x1f, 0xe0, 0x74, 0xd7, 0x10, 0x12, 0x7d, 0x47, 0x24, 0x91 },
+		.public = { 0xef, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+		.result = { 0x98, 0xcb, 0x9b, 0x50, 0xdd, 0x3f, 0xc2, 0xb0, 0xd4, 0xf2, 0xd2, 0xbf, 0x7c, 0x5c, 0xfd, 0xd1, 0x0c, 0x8f, 0xcd, 0x31, 0xfc, 0x40, 0xaf, 0x1a, 0xd4, 0x4f, 0x47, 0xc1, 0x31, 0x37, 0x63, 0x62 },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0x60, 0x88, 0x7b, 0x3d, 0xc7, 0x24, 0x43, 0x02, 0x6e, 0xbe, 0xdb, 0xbb, 0xb7, 0x06, 0x65, 0xf4, 0x2b, 0x87, 0xad, 0xd1, 0x44, 0x0e, 0x77, 0x68, 0xfb, 0xd7, 0xe8, 0xe2, 0xce, 0x5f, 0x63, 0x9d },
+		.public = { 0xf0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+		.result = { 0x38, 0xd6, 0x30, 0x4c, 0x4a, 0x7e, 0x6d, 0x9f, 0x79, 0x59, 0x33, 0x4f, 0xb5, 0x24, 0x5b, 0xd2, 0xc7, 0x54, 0x52, 0x5d, 0x4c, 0x91, 0xdb, 0x95, 0x02, 0x06, 0x92, 0x62, 0x34, 0xc1, 0xf6, 0x33 },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0x78, 0xd3, 0x1d, 0xfa, 0x85, 0x44, 0x97, 0xd7, 0x2d, 0x8d, 0xef, 0x8a, 0x1b, 0x7f, 0xb0, 0x06, 0xce, 0xc2, 0xd8, 0xc4, 0x92, 0x46, 0x47, 0xc9, 0x38, 0x14, 0xae, 0x56, 0xfa, 0xed, 0xa4, 0x95 },
+		.public = { 0xf1, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+		.result = { 0x78, 0x6c, 0xd5, 0x49, 0x96, 0xf0, 0x14, 0xa5, 0xa0, 0x31, 0xec, 0x14, 0xdb, 0x81, 0x2e, 0xd0, 0x83, 0x55, 0x06, 0x1f, 0xdb, 0x5d, 0xe6, 0x80, 0xa8, 0x00, 0xac, 0x52, 0x1f, 0x31, 0x8e, 0x23 },
+		.valid = true
+	},
+	/* wycheproof - public key >= p */
+	{
+		.private = { 0xc0, 0x4c, 0x5b, 0xae, 0xfa, 0x83, 0x02, 0xdd, 0xde, 0xd6, 0xa4, 0xbb, 0x95, 0x77, 0x61, 0xb4, 0xeb, 0x97, 0xae, 0xfa, 0x4f, 0xc3, 0xb8, 0x04, 0x30, 0x85, 0xf9, 0x6a, 0x56, 0x59, 0xb3, 0xa5 },
+		.public = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+		.result = { 0x29, 0xae, 0x8b, 0xc7, 0x3e, 0x9b, 0x10, 0xa0, 0x8b, 0x4f, 0x68, 0x1c, 0x43, 0xc3, 0xe0, 0xac, 0x1a, 0x17, 0x1d, 0x31, 0xb3, 0x8f, 0x1a, 0x48, 0xef, 0xba, 0x29, 0xae, 0x63, 0x9e, 0xa1, 0x34 },
+		.valid = true
+	},
+	/* wycheproof - RFC 7748 */
+	{
+		.private = { 0xa0, 0x46, 0xe3, 0x6b, 0xf0, 0x52, 0x7c, 0x9d, 0x3b, 0x16, 0x15, 0x4b, 0x82, 0x46, 0x5e, 0xdd, 0x62, 0x14, 0x4c, 0x0a, 0xc1, 0xfc, 0x5a, 0x18, 0x50, 0x6a, 0x22, 0x44, 0xba, 0x44, 0x9a, 0x44 },
+		.public = { 0xe6, 0xdb, 0x68, 0x67, 0x58, 0x30, 0x30, 0xdb, 0x35, 0x94, 0xc1, 0xa4, 0x24, 0xb1, 0x5f, 0x7c, 0x72, 0x66, 0x24, 0xec, 0x26, 0xb3, 0x35, 0x3b, 0x10, 0xa9, 0x03, 0xa6, 0xd0, 0xab, 0x1c, 0x4c },
+		.result = { 0xc3, 0xda, 0x55, 0x37, 0x9d, 0xe9, 0xc6, 0x90, 0x8e, 0x94, 0xea, 0x4d, 0xf2, 0x8d, 0x08, 0x4f, 0x32, 0xec, 0xcf, 0x03, 0x49, 0x1c, 0x71, 0xf7, 0x54, 0xb4, 0x07, 0x55, 0x77, 0xa2, 0x85, 0x52 },
+		.valid = true
+	},
+	/* wycheproof - RFC 7748 */
+	{
+		.private = { 0x48, 0x66, 0xe9, 0xd4, 0xd1, 0xb4, 0x67, 0x3c, 0x5a, 0xd2, 0x26, 0x91, 0x95, 0x7d, 0x6a, 0xf5, 0xc1, 0x1b, 0x64, 0x21, 0xe0, 0xea, 0x01, 0xd4, 0x2c, 0xa4, 0x16, 0x9e, 0x79, 0x18, 0xba, 0x4d },
+		.public = { 0xe5, 0x21, 0x0f, 0x12, 0x78, 0x68, 0x11, 0xd3, 0xf4, 0xb7, 0x95, 0x9d, 0x05, 0x38, 0xae, 0x2c, 0x31, 0xdb, 0xe7, 0x10, 0x6f, 0xc0, 0x3c, 0x3e, 0xfc, 0x4c, 0xd5, 0x49, 0xc7, 0x15, 0xa4, 0x13 },
+		.result = { 0x95, 0xcb, 0xde, 0x94, 0x76, 0xe8, 0x90, 0x7d, 0x7a, 0xad, 0xe4, 0x5c, 0xb4, 0xb8, 0x73, 0xf8, 0x8b, 0x59, 0x5a, 0x68, 0x79, 0x9f, 0xa1, 0x52, 0xe6, 0xf8, 0xf7, 0x64, 0x7a, 0xac, 0x79, 0x57 },
+		.valid = true
+	},
+	/* wycheproof - edge case for shared secret */
+	{
+		.private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4, 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3, 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc, 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+		.public = { 0x0a, 0xb4, 0xe7, 0x63, 0x80, 0xd8, 0x4d, 0xde, 0x4f, 0x68, 0x33, 0xc5, 0x8f, 0x2a, 0x9f, 0xb8, 0xf8, 0x3b, 0xb0, 0x16, 0x9b, 0x17, 0x2b, 0xe4, 0xb6, 0xe0, 0x59, 0x28, 0x87, 0x74, 0x1a, 0x36 },
+		.result = { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = true
+	},
+	/* wycheproof - edge case for shared secret */
+	{
+		.private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4, 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3, 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc, 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+		.public = { 0x89, 0xe1, 0x0d, 0x57, 0x01, 0xb4, 0x33, 0x7d, 0x2d, 0x03, 0x21, 0x81, 0x53, 0x8b, 0x10, 0x64, 0xbd, 0x40, 0x84, 0x40, 0x1c, 0xec, 0xa1, 0xfd, 0x12, 0x66, 0x3a, 0x19, 0x59, 0x38, 0x80, 0x00 },
+		.result = { 0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = true
+	},
+	/* wycheproof - edge case for shared secret */
+	{
+		.private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4, 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3, 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc, 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+		.public = { 0x2b, 0x55, 0xd3, 0xaa, 0x4a, 0x8f, 0x80, 0xc8, 0xc0, 0xb2, 0xae, 0x5f, 0x93, 0x3e, 0x85, 0xaf, 0x49, 0xbe, 0xac, 0x36, 0xc2, 0xfa, 0x73, 0x94, 0xba, 0xb7, 0x6c, 0x89, 0x33, 0xf8, 0xf8, 0x1d },
+		.result = { 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+		.valid = true
+	},
+	/* wycheproof - edge case for shared secret */
+	{
+		.private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4, 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3, 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc, 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+		.public = { 0x63, 0xe5, 0xb1, 0xfe, 0x96, 0x01, 0xfe, 0x84, 0x38, 0x5d, 0x88, 0x66, 0xb0, 0x42, 0x12, 0x62, 0xf7, 0x8f, 0xbf, 0xa5, 0xaf, 0xf9, 0x58, 0x5e, 0x62, 0x66, 0x79, 0xb1, 0x85, 0x47, 0xd9, 0x59 },
+		.result = { 0xfe, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+		.valid = true
+	},
+	/* wycheproof - edge case for shared secret */
+	{
+		.private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4, 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3, 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc, 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+		.public = { 0xe4, 0x28, 0xf3, 0xda, 0xc1, 0x78, 0x09, 0xf8, 0x27, 0xa5, 0x22, 0xce, 0x32, 0x35, 0x50, 0x58, 0xd0, 0x73, 0x69, 0x36, 0x4a, 0xa7, 0x89, 0x02, 0xee, 0x10, 0x13, 0x9b, 0x9f, 0x9d, 0xd6, 0x53 },
+		.result = { 0xfc, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+		.valid = true
+	},
+	/* wycheproof - edge case for shared secret */
+	{
+		.private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4, 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3, 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc, 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+		.public = { 0xb3, 0xb5, 0x0e, 0x3e, 0xd3, 0xa4, 0x07, 0xb9, 0x5d, 0xe9, 0x42, 0xef, 0x74, 0x57, 0x5b, 0x5a, 0xb8, 0xa1, 0x0c, 0x09, 0xee, 0x10, 0x35, 0x44, 0xd6, 0x0b, 0xdf, 0xed, 0x81, 0x38, 0xab, 0x2b },
+		.result = { 0xf9, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+		.valid = true
+	},
+	/* wycheproof - edge case for shared secret */
+	{
+		.private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4, 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3, 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc, 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+		.public = { 0x21, 0x3f, 0xff, 0xe9, 0x3d, 0x5e, 0xa8, 0xcd, 0x24, 0x2e, 0x46, 0x28, 0x44, 0x02, 0x99, 0x22, 0xc4, 0x3c, 0x77, 0xc9, 0xe3, 0xe4, 0x2f, 0x56, 0x2f, 0x48, 0x5d, 0x24, 0xc5, 0x01, 0xa2, 0x0b },
+		.result = { 0xf3, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+		.valid = true
+	},
+	/* wycheproof - edge case for shared secret */
+	{
+		.private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4, 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3, 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc, 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+		.public = { 0x91, 0xb2, 0x32, 0xa1, 0x78, 0xb3, 0xcd, 0x53, 0x09, 0x32, 0x44, 0x1e, 0x61, 0x39, 0x41, 0x8f, 0x72, 0x17, 0x22, 0x92, 0xf1, 0xda, 0x4c, 0x18, 0x34, 0xfc, 0x5e, 0xbf, 0xef, 0xb5, 0x1e, 0x3f },
+		.result = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x03 },
+		.valid = true
+	},
+	/* wycheproof - edge case for shared secret */
+	{
+		.private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4, 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3, 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc, 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+		.public = { 0x04, 0x5c, 0x6e, 0x11, 0xc5, 0xd3, 0x32, 0x55, 0x6c, 0x78, 0x22, 0xfe, 0x94, 0xeb, 0xf8, 0x9b, 0x56, 0xa3, 0x87, 0x8d, 0xc2, 0x7c, 0xa0, 0x79, 0x10, 0x30, 0x58, 0x84, 0x9f, 0xab, 0xcb, 0x4f },
+		.result = { 0xe5, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+		.valid = true
+	},
+	/* wycheproof - edge case for shared secret */
+	{
+		.private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4, 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3, 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc, 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+		.public = { 0x1c, 0xa2, 0x19, 0x0b, 0x71, 0x16, 0x35, 0x39, 0x06, 0x3c, 0x35, 0x77, 0x3b, 0xda, 0x0c, 0x9c, 0x92, 0x8e, 0x91, 0x36, 0xf0, 0x62, 0x0a, 0xeb, 0x09, 0x3f, 0x09, 0x91, 0x97, 0xb7, 0xf7, 0x4e },
+		.result = { 0xe3, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+		.valid = true
+	},
+	/* wycheproof - edge case for shared secret */
+	{
+		.private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4, 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3, 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc, 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+		.public = { 0xf7, 0x6e, 0x90, 0x10, 0xac, 0x33, 0xc5, 0x04, 0x3b, 0x2d, 0x3b, 0x76, 0xa8, 0x42, 0x17, 0x10, 0x00, 0xc4, 0x91, 0x62, 0x22, 0xe9, 0xe8, 0x58, 0x97, 0xa0, 0xae, 0xc7, 0xf6, 0x35, 0x0b, 0x3c },
+		.result = { 0xdd, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+		.valid = true
+	},
+	/* wycheproof - edge case for shared secret */
+	{
+		.private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4, 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3, 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc, 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+		.public = { 0xbb, 0x72, 0x68, 0x8d, 0x8f, 0x8a, 0xa7, 0xa3, 0x9c, 0xd6, 0x06, 0x0c, 0xd5, 0xc8, 0x09, 0x3c, 0xde, 0xc6, 0xfe, 0x34, 0x19, 0x37, 0xc3, 0x88, 0x6a, 0x99, 0x34, 0x6c, 0xd0, 0x7f, 0xaa, 0x55 },
+		.result = { 0xdb, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+		.valid = true
+	},
+	/* wycheproof - edge case for shared secret */
+	{
+		.private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4, 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3, 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc, 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+		.public = { 0x88, 0xfd, 0xde, 0xa1, 0x93, 0x39, 0x1c, 0x6a, 0x59, 0x33, 0xef, 0x9b, 0x71, 0x90, 0x15, 0x49, 0x44, 0x72, 0x05, 0xaa, 0xe9, 0xda, 0x92, 0x8a, 0x6b, 0x91, 0xa3, 0x52, 0xba, 0x10, 0xf4, 0x1f },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02 },
+		.valid = true
+	},
+	/* wycheproof - edge case for shared secret */
+	{
+		.private = { 0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4, 0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3, 0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc, 0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+		.public = { 0x30, 0x3b, 0x39, 0x2f, 0x15, 0x31, 0x16, 0xca, 0xd9, 0xcc, 0x68, 0x2a, 0x00, 0xcc, 0xc4, 0x4c, 0x95, 0xff, 0x0d, 0x3b, 0xbe, 0x56, 0x8b, 0xeb, 0x6c, 0x4e, 0x73, 0x9b, 0xaf, 0xdc, 0x2c, 0x68 },
+		.result = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80, 0x00 },
+		.valid = true
+	},
+	/* wycheproof - checking for overflow */
+	{
+		.private = { 0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d, 0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d, 0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c, 0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+		.public = { 0xfd, 0x30, 0x0a, 0xeb, 0x40, 0xe1, 0xfa, 0x58, 0x25, 0x18, 0x41, 0x2b, 0x49, 0xb2, 0x08, 0xa7, 0x84, 0x2b, 0x1e, 0x1f, 0x05, 0x6a, 0x04, 0x01, 0x78, 0xea, 0x41, 0x41, 0x53, 0x4f, 0x65, 0x2d },
+		.result = { 0xb7, 0x34, 0x10, 0x5d, 0xc2, 0x57, 0x58, 0x5d, 0x73, 0xb5, 0x66, 0xcc, 0xb7, 0x6f, 0x06, 0x27, 0x95, 0xcc, 0xbe, 0xc8, 0x91, 0x28, 0xe5, 0x2b, 0x02, 0xf3, 0xe5, 0x96, 0x39, 0xf1, 0x3c, 0x46 },
+		.valid = true
+	},
+	/* wycheproof - checking for overflow */
+	{
+		.private = { 0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d, 0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d, 0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c, 0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+		.public = { 0xc8, 0xef, 0x79, 0xb5, 0x14, 0xd7, 0x68, 0x26, 0x77, 0xbc, 0x79, 0x31, 0xe0, 0x6e, 0xe5, 0xc2, 0x7c, 0x9b, 0x39, 0x2b, 0x4a, 0xe9, 0x48, 0x44, 0x73, 0xf5, 0x54, 0xe6, 0x67, 0x8e, 0xcc, 0x2e },
+		.result = { 0x64, 0x7a, 0x46, 0xb6, 0xfc, 0x3f, 0x40, 0xd6, 0x21, 0x41, 0xee, 0x3c, 0xee, 0x70, 0x6b, 0x4d, 0x7a, 0x92, 0x71, 0x59, 0x3a, 0x7b, 0x14, 0x3e, 0x8e, 0x2e, 0x22, 0x79, 0x88, 0x3e, 0x45, 0x50 },
+		.valid = true
+	},
+	/* wycheproof - checking for overflow */
+	{
+		.private = { 0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d, 0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d, 0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c, 0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+		.public = { 0x64, 0xae, 0xac, 0x25, 0x04, 0x14, 0x48, 0x61, 0x53, 0x2b, 0x7b, 0xbc, 0xb6, 0xc8, 0x7d, 0x67, 0xdd, 0x4c, 0x1f, 0x07, 0xeb, 0xc2, 0xe0, 0x6e, 0xff, 0xb9, 0x5a, 0xec, 0xc6, 0x17, 0x0b, 0x2c },
+		.result = { 0x4f, 0xf0, 0x3d, 0x5f, 0xb4, 0x3c, 0xd8, 0x65, 0x7a, 0x3c, 0xf3, 0x7c, 0x13, 0x8c, 0xad, 0xce, 0xcc, 0xe5, 0x09, 0xe4, 0xeb, 0xa0, 0x89, 0xd0, 0xef, 0x40, 0xb4, 0xe4, 0xfb, 0x94, 0x61, 0x55 },
+		.valid = true
+	},
+	/* wycheproof - checking for overflow */
+	{
+		.private = { 0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d, 0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d, 0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c, 0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+		.public = { 0xbf, 0x68, 0xe3, 0x5e, 0x9b, 0xdb, 0x7e, 0xee, 0x1b, 0x50, 0x57, 0x02, 0x21, 0x86, 0x0f, 0x5d, 0xcd, 0xad, 0x8a, 0xcb, 0xab, 0x03, 0x1b, 0x14, 0x97, 0x4c, 0xc4, 0x90, 0x13, 0xc4, 0x98, 0x31 },
+		.result = { 0x21, 0xce, 0xe5, 0x2e, 0xfd, 0xbc, 0x81, 0x2e, 0x1d, 0x02, 0x1a, 0x4a, 0xf1, 0xe1, 0xd8, 0xbc, 0x4d, 0xb3, 0xc4, 0x00, 0xe4, 0xd2, 0xa2, 0xc5, 0x6a, 0x39, 0x26, 0xdb, 0x4d, 0x99, 0xc6, 0x5b },
+		.valid = true
+	},
+	/* wycheproof - checking for overflow */
+	{
+		.private = { 0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d, 0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d, 0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c, 0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+		.public = { 0x53, 0x47, 0xc4, 0x91, 0x33, 0x1a, 0x64, 0xb4, 0x3d, 0xdc, 0x68, 0x30, 0x34, 0xe6, 0x77, 0xf5, 0x3d, 0xc3, 0x2b, 0x52, 0xa5, 0x2a, 0x57, 0x7c, 0x15, 0xa8, 0x3b, 0xf2, 0x98, 0xe9, 0x9f, 0x19 },
+		.result = { 0x18, 0xcb, 0x89, 0xe4, 0xe2, 0x0c, 0x0c, 0x2b, 0xd3, 0x24, 0x30, 0x52, 0x45, 0x26, 0x6c, 0x93, 0x27, 0x69, 0x0b, 0xbe, 0x79, 0xac, 0xb8, 0x8f, 0x5b, 0x8f, 0xb3, 0xf7, 0x4e, 0xca, 0x3e, 0x52 },
+		.valid = true
+	},
+	/* wycheproof - private key == -1 (mod order) */
+	{
+		.private = { 0xa0, 0x23, 0xcd, 0xd0, 0x83, 0xef, 0x5b, 0xb8, 0x2f, 0x10, 0xd6, 0x2e, 0x59, 0xe1, 0x5a, 0x68, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x50 },
+		.public = { 0x25, 0x8e, 0x04, 0x52, 0x3b, 0x8d, 0x25, 0x3e, 0xe6, 0x57, 0x19, 0xfc, 0x69, 0x06, 0xc6, 0x57, 0x19, 0x2d, 0x80, 0x71, 0x7e, 0xdc, 0x82, 0x8f, 0xa0, 0xaf, 0x21, 0x68, 0x6e, 0x2f, 0xaa, 0x75 },
+		.result = { 0x25, 0x8e, 0x04, 0x52, 0x3b, 0x8d, 0x25, 0x3e, 0xe6, 0x57, 0x19, 0xfc, 0x69, 0x06, 0xc6, 0x57, 0x19, 0x2d, 0x80, 0x71, 0x7e, 0xdc, 0x82, 0x8f, 0xa0, 0xaf, 0x21, 0x68, 0x6e, 0x2f, 0xaa, 0x75 },
+		.valid = true
+	},
+	/* wycheproof - private key == 1 (mod order) on twist */
+	{
+		.private = { 0x58, 0x08, 0x3d, 0xd2, 0x61, 0xad, 0x91, 0xef, 0xf9, 0x52, 0x32, 0x2e, 0xc8, 0x24, 0xc6, 0x82, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x5f },
+		.public = { 0x2e, 0xae, 0x5e, 0xc3, 0xdd, 0x49, 0x4e, 0x9f, 0x2d, 0x37, 0xd2, 0x58, 0xf8, 0x73, 0xa8, 0xe6, 0xe9, 0xd0, 0xdb, 0xd1, 0xe3, 0x83, 0xef, 0x64, 0xd9, 0x8b, 0xb9, 0x1b, 0x3e, 0x0b, 0xe0, 0x35 },
+		.result = { 0x2e, 0xae, 0x5e, 0xc3, 0xdd, 0x49, 0x4e, 0x9f, 0x2d, 0x37, 0xd2, 0x58, 0xf8, 0x73, 0xa8, 0xe6, 0xe9, 0xd0, 0xdb, 0xd1, 0xe3, 0x83, 0xef, 0x64, 0xd9, 0x8b, 0xb9, 0x1b, 0x3e, 0x0b, 0xe0, 0x35 },
+		.valid = true
+	}
+};
+bool __init curve25519_selftest(void)
+{
+	bool success = true, ret, ret2;
+	size_t i = 0, j;
+	u8 in[CURVE25519_POINT_SIZE], out[CURVE25519_POINT_SIZE], out2[CURVE25519_POINT_SIZE];
+
+	for (i = 0; i < ARRAY_SIZE(curve25519_test_vectors); ++i) {
+		memset(out, 0, CURVE25519_POINT_SIZE);
+		ret = curve25519(out, curve25519_test_vectors[i].private, curve25519_test_vectors[i].public);
+		if (ret != curve25519_test_vectors[i].valid || memcmp(out, curve25519_test_vectors[i].result, CURVE25519_POINT_SIZE)) {
+			pr_info("curve25519 self-test %zu: FAIL\n", i + 1);
+			success = false;
+			break;
+		}
+	}
+
+	for (i = 0; i < 5; ++i) {
+		get_random_bytes(in, sizeof(in));
+		ret = curve25519_generate_public(out, in);
+		ret2 = curve25519(out2, in, (u8[CURVE25519_POINT_SIZE]){ 9 });
+		if (ret != ret2 || memcmp(out, out2, CURVE25519_POINT_SIZE)) {
+			pr_info("curve25519 basepoint self-test %zu: FAIL: input - 0x", i + 1);
+			for (j = CURVE25519_POINT_SIZE; j-- > 0;)
+				printk(KERN_CONT "%02x", in[j]);
+			printk(KERN_CONT "\n");
+			success = false;
+			break;
+		}
+	}
+
+	if (success)
+		pr_info("curve25519 self-tests: pass\n");
+	return success;
+}
+#endif
-- 
2.18.0

^ permalink raw reply related

* [PATCH v2 15/17] zinc: Curve25519 ARM implementation
From: Jason A. Donenfeld @ 2018-08-24 21:38 UTC (permalink / raw)
  To: linux-kernel, netdev, davem
  Cc: Jason A. Donenfeld, Andy Lutomirski, Greg KH, Samuel Neves,
	D . J . Bernstein, Jean-Philippe Aumasson, Russell King,
	linux-arm-kernel, linux-crypto
In-Reply-To: <20180824213849.23647-1-Jason@zx2c4.com>

This comes from Dan Bernstein and Peter Schwabe's public domain NEON
code, and has been modified to be friendly for kernel space.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: D. J. Bernstein <djb@cr.yp.to>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-crypto@vger.kernel.org
---
 lib/zinc/Makefile                         |    4 +
 lib/zinc/curve25519/curve25519-arm-glue.h |   33 +
 lib/zinc/curve25519/curve25519-arm.S      | 2110 +++++++++++++++++++++
 3 files changed, 2147 insertions(+)
 create mode 100644 lib/zinc/curve25519/curve25519-arm-glue.h
 create mode 100644 lib/zinc/curve25519/curve25519-arm.S

diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index b6b0614f8743..13e403780ce0 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -52,6 +52,10 @@ endif
 
 ifeq ($(CONFIG_ZINC_CURVE25519),y)
 zinc-y += curve25519/curve25519.o
+ifeq ($(CONFIG_ARM)$(CONFIG_KERNEL_MODE_NEON),yy)
+zinc-y += curve25519/curve25519-arm.o
+CFLAGS_curve25519.o += -include $(srctree)/$(src)/curve25519/curve25519-arm-glue.h
+endif
 endif
 
 ifeq ($(CONFIG_ZINC_BLAKE2S),y)
diff --git a/lib/zinc/curve25519/curve25519-arm-glue.h b/lib/zinc/curve25519/curve25519-arm-glue.h
new file mode 100644
index 000000000000..1d5c029e9195
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-arm-glue.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <zinc/curve25519.h>
+#include <asm/hwcap.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+
+asmlinkage void curve25519_neon(u8 mypublic[CURVE25519_POINT_SIZE], const u8 secret[CURVE25519_POINT_SIZE], const u8 basepoint[CURVE25519_POINT_SIZE]);
+
+static bool curve25519_use_neon __ro_after_init;
+
+void __init curve25519_fpu_init(void)
+{
+	curve25519_use_neon = elf_hwcap & HWCAP_NEON;
+}
+
+static inline bool curve25519_arch(u8 mypublic[CURVE25519_POINT_SIZE], const u8 secret[CURVE25519_POINT_SIZE], const u8 basepoint[CURVE25519_POINT_SIZE])
+{
+	if (curve25519_use_neon && may_use_simd()) {
+		kernel_neon_begin();
+		curve25519_neon(mypublic, secret, basepoint);
+		kernel_neon_end();
+		return true;
+	}
+	return false;
+}
+
+static inline bool curve25519_base_arch(u8 pub[CURVE25519_POINT_SIZE], const u8 secret[CURVE25519_POINT_SIZE]) { return false; }
+
+#define HAVE_CURVE25519_ARCH_IMPLEMENTATION
diff --git a/lib/zinc/curve25519/curve25519-arm.S b/lib/zinc/curve25519/curve25519-arm.S
new file mode 100644
index 000000000000..2c02e66cc87c
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-arm.S
@@ -0,0 +1,2110 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * Based on public domain code from Daniel J. Bernstein and Peter Schwabe.
+ */
+
+#if IS_ENABLED(CONFIG_KERNEL_MODE_NEON)
+
+#include <linux/linkage.h>
+
+	.text
+	.fpu		neon
+	.align		4
+
+ENTRY(curve25519_neon)
+	vpush {q4,q5,q6,q7}
+	mov r12,sp
+	sub r3,sp,#736
+	and r3,r3,#0xffffffe0
+	mov sp,r3
+	strd r4,[sp,#0]
+	strd r6,[sp,#8]
+	strd r8,[sp,#16]
+	strd r10,[sp,#24]
+	str r12,[sp,#480]
+	str r14,[sp,#484]
+	mov r0,r0
+	mov r1,r1
+	mov r2,r2
+	add r3,sp,#32
+	ldr r4,=0
+	ldr r5,=254
+	vmov.i32 q0,#1
+	vshr.u64 q1,q0,#7
+	vshr.u64 q0,q0,#8
+	vmov.i32 d4,#19
+	vmov.i32 d5,#38
+	add r6,sp,#512
+	vst1.8 {d2-d3},[r6,: 128]
+	add r6,sp,#528
+	vst1.8 {d0-d1},[r6,: 128]
+	add r6,sp,#544
+	vst1.8 {d4-d5},[r6,: 128]
+	add r6,r3,#0
+	vmov.i32 q2,#0
+	vst1.8 {d4-d5},[r6,: 128]!
+	vst1.8 {d4-d5},[r6,: 128]!
+	vst1.8 d4,[r6,: 64]
+	add r6,r3,#0
+	ldr r7,=960
+	sub r7,r7,#2
+	neg r7,r7
+	sub r7,r7,r7,LSL #7
+	str r7,[r6]
+	add r6,sp,#704
+	vld1.8 {d4-d5},[r1]!
+	vld1.8 {d6-d7},[r1]
+	vst1.8 {d4-d5},[r6,: 128]!
+	vst1.8 {d6-d7},[r6,: 128]
+	sub r1,r6,#16
+	ldrb r6,[r1]
+	and r6,r6,#248
+	strb r6,[r1]
+	ldrb r6,[r1,#31]
+	and r6,r6,#127
+	orr r6,r6,#64
+	strb r6,[r1,#31]
+	vmov.i64 q2,#0xffffffff
+	vshr.u64 q3,q2,#7
+	vshr.u64 q2,q2,#6
+	vld1.8 {d8},[r2]
+	vld1.8 {d10},[r2]
+	add r2,r2,#6
+	vld1.8 {d12},[r2]
+	vld1.8 {d14},[r2]
+	add r2,r2,#6
+	vld1.8 {d16},[r2]
+	add r2,r2,#4
+	vld1.8 {d18},[r2]
+	vld1.8 {d20},[r2]
+	add r2,r2,#6
+	vld1.8 {d22},[r2]
+	add r2,r2,#2
+	vld1.8 {d24},[r2]
+	vld1.8 {d26},[r2]
+	vshr.u64 q5,q5,#26
+	vshr.u64 q6,q6,#3
+	vshr.u64 q7,q7,#29
+	vshr.u64 q8,q8,#6
+	vshr.u64 q10,q10,#25
+	vshr.u64 q11,q11,#3
+	vshr.u64 q12,q12,#12
+	vshr.u64 q13,q13,#38
+	vand q4,q4,q2
+	vand q6,q6,q2
+	vand q8,q8,q2
+	vand q10,q10,q2
+	vand q2,q12,q2
+	vand q5,q5,q3
+	vand q7,q7,q3
+	vand q9,q9,q3
+	vand q11,q11,q3
+	vand q3,q13,q3
+	add r2,r3,#48
+	vadd.i64 q12,q4,q1
+	vadd.i64 q13,q10,q1
+	vshr.s64 q12,q12,#26
+	vshr.s64 q13,q13,#26
+	vadd.i64 q5,q5,q12
+	vshl.i64 q12,q12,#26
+	vadd.i64 q14,q5,q0
+	vadd.i64 q11,q11,q13
+	vshl.i64 q13,q13,#26
+	vadd.i64 q15,q11,q0
+	vsub.i64 q4,q4,q12
+	vshr.s64 q12,q14,#25
+	vsub.i64 q10,q10,q13
+	vshr.s64 q13,q15,#25
+	vadd.i64 q6,q6,q12
+	vshl.i64 q12,q12,#25
+	vadd.i64 q14,q6,q1
+	vadd.i64 q2,q2,q13
+	vsub.i64 q5,q5,q12
+	vshr.s64 q12,q14,#26
+	vshl.i64 q13,q13,#25
+	vadd.i64 q14,q2,q1
+	vadd.i64 q7,q7,q12
+	vshl.i64 q12,q12,#26
+	vadd.i64 q15,q7,q0
+	vsub.i64 q11,q11,q13
+	vshr.s64 q13,q14,#26
+	vsub.i64 q6,q6,q12
+	vshr.s64 q12,q15,#25
+	vadd.i64 q3,q3,q13
+	vshl.i64 q13,q13,#26
+	vadd.i64 q14,q3,q0
+	vadd.i64 q8,q8,q12
+	vshl.i64 q12,q12,#25
+	vadd.i64 q15,q8,q1
+	add r2,r2,#8
+	vsub.i64 q2,q2,q13
+	vshr.s64 q13,q14,#25
+	vsub.i64 q7,q7,q12
+	vshr.s64 q12,q15,#26
+	vadd.i64 q14,q13,q13
+	vadd.i64 q9,q9,q12
+	vtrn.32 d12,d14
+	vshl.i64 q12,q12,#26
+	vtrn.32 d13,d15
+	vadd.i64 q0,q9,q0
+	vadd.i64 q4,q4,q14
+	vst1.8 d12,[r2,: 64]!
+	vshl.i64 q6,q13,#4
+	vsub.i64 q7,q8,q12
+	vshr.s64 q0,q0,#25
+	vadd.i64 q4,q4,q6
+	vadd.i64 q6,q10,q0
+	vshl.i64 q0,q0,#25
+	vadd.i64 q8,q6,q1
+	vadd.i64 q4,q4,q13
+	vshl.i64 q10,q13,#25
+	vadd.i64 q1,q4,q1
+	vsub.i64 q0,q9,q0
+	vshr.s64 q8,q8,#26
+	vsub.i64 q3,q3,q10
+	vtrn.32 d14,d0
+	vshr.s64 q1,q1,#26
+	vtrn.32 d15,d1
+	vadd.i64 q0,q11,q8
+	vst1.8 d14,[r2,: 64]
+	vshl.i64 q7,q8,#26
+	vadd.i64 q5,q5,q1
+	vtrn.32 d4,d6
+	vshl.i64 q1,q1,#26
+	vtrn.32 d5,d7
+	vsub.i64 q3,q6,q7
+	add r2,r2,#16
+	vsub.i64 q1,q4,q1
+	vst1.8 d4,[r2,: 64]
+	vtrn.32 d6,d0
+	vtrn.32 d7,d1
+	sub r2,r2,#8
+	vtrn.32 d2,d10
+	vtrn.32 d3,d11
+	vst1.8 d6,[r2,: 64]
+	sub r2,r2,#24
+	vst1.8 d2,[r2,: 64]
+	add r2,r3,#96
+	vmov.i32 q0,#0
+	vmov.i64 d2,#0xff
+	vmov.i64 d3,#0
+	vshr.u32 q1,q1,#7
+	vst1.8 {d2-d3},[r2,: 128]!
+	vst1.8 {d0-d1},[r2,: 128]!
+	vst1.8 d0,[r2,: 64]
+	add r2,r3,#144
+	vmov.i32 q0,#0
+	vst1.8 {d0-d1},[r2,: 128]!
+	vst1.8 {d0-d1},[r2,: 128]!
+	vst1.8 d0,[r2,: 64]
+	add r2,r3,#240
+	vmov.i32 q0,#0
+	vmov.i64 d2,#0xff
+	vmov.i64 d3,#0
+	vshr.u32 q1,q1,#7
+	vst1.8 {d2-d3},[r2,: 128]!
+	vst1.8 {d0-d1},[r2,: 128]!
+	vst1.8 d0,[r2,: 64]
+	add r2,r3,#48
+	add r6,r3,#192
+	vld1.8 {d0-d1},[r2,: 128]!
+	vld1.8 {d2-d3},[r2,: 128]!
+	vld1.8 {d4},[r2,: 64]
+	vst1.8 {d0-d1},[r6,: 128]!
+	vst1.8 {d2-d3},[r6,: 128]!
+	vst1.8 d4,[r6,: 64]
+	.Lmainloop:
+	mov r2,r5,LSR #3
+	and r6,r5,#7
+	ldrb r2,[r1,r2]
+	mov r2,r2,LSR r6
+	and r2,r2,#1
+	str r5,[sp,#488]
+	eor r4,r4,r2
+	str r2,[sp,#492]
+	neg r2,r4
+	add r4,r3,#96
+	add r5,r3,#192
+	add r6,r3,#144
+	vld1.8 {d8-d9},[r4,: 128]!
+	add r7,r3,#240
+	vld1.8 {d10-d11},[r5,: 128]!
+	veor q6,q4,q5
+	vld1.8 {d14-d15},[r6,: 128]!
+	vdup.i32 q8,r2
+	vld1.8 {d18-d19},[r7,: 128]!
+	veor q10,q7,q9
+	vld1.8 {d22-d23},[r4,: 128]!
+	vand q6,q6,q8
+	vld1.8 {d24-d25},[r5,: 128]!
+	vand q10,q10,q8
+	vld1.8 {d26-d27},[r6,: 128]!
+	veor q4,q4,q6
+	vld1.8 {d28-d29},[r7,: 128]!
+	veor q5,q5,q6
+	vld1.8 {d0},[r4,: 64]
+	veor q6,q7,q10
+	vld1.8 {d2},[r5,: 64]
+	veor q7,q9,q10
+	vld1.8 {d4},[r6,: 64]
+	veor q9,q11,q12
+	vld1.8 {d6},[r7,: 64]
+	veor q10,q0,q1
+	sub r2,r4,#32
+	vand q9,q9,q8
+	sub r4,r5,#32
+	vand q10,q10,q8
+	sub r5,r6,#32
+	veor q11,q11,q9
+	sub r6,r7,#32
+	veor q0,q0,q10
+	veor q9,q12,q9
+	veor q1,q1,q10
+	veor q10,q13,q14
+	veor q12,q2,q3
+	vand q10,q10,q8
+	vand q8,q12,q8
+	veor q12,q13,q10
+	veor q2,q2,q8
+	veor q10,q14,q10
+	veor q3,q3,q8
+	vadd.i32 q8,q4,q6
+	vsub.i32 q4,q4,q6
+	vst1.8 {d16-d17},[r2,: 128]!
+	vadd.i32 q6,q11,q12
+	vst1.8 {d8-d9},[r5,: 128]!
+	vsub.i32 q4,q11,q12
+	vst1.8 {d12-d13},[r2,: 128]!
+	vadd.i32 q6,q0,q2
+	vst1.8 {d8-d9},[r5,: 128]!
+	vsub.i32 q0,q0,q2
+	vst1.8 d12,[r2,: 64]
+	vadd.i32 q2,q5,q7
+	vst1.8 d0,[r5,: 64]
+	vsub.i32 q0,q5,q7
+	vst1.8 {d4-d5},[r4,: 128]!
+	vadd.i32 q2,q9,q10
+	vst1.8 {d0-d1},[r6,: 128]!
+	vsub.i32 q0,q9,q10
+	vst1.8 {d4-d5},[r4,: 128]!
+	vadd.i32 q2,q1,q3
+	vst1.8 {d0-d1},[r6,: 128]!
+	vsub.i32 q0,q1,q3
+	vst1.8 d4,[r4,: 64]
+	vst1.8 d0,[r6,: 64]
+	add r2,sp,#544
+	add r4,r3,#96
+	add r5,r3,#144
+	vld1.8 {d0-d1},[r2,: 128]
+	vld1.8 {d2-d3},[r4,: 128]!
+	vld1.8 {d4-d5},[r5,: 128]!
+	vzip.i32 q1,q2
+	vld1.8 {d6-d7},[r4,: 128]!
+	vld1.8 {d8-d9},[r5,: 128]!
+	vshl.i32 q5,q1,#1
+	vzip.i32 q3,q4
+	vshl.i32 q6,q2,#1
+	vld1.8 {d14},[r4,: 64]
+	vshl.i32 q8,q3,#1
+	vld1.8 {d15},[r5,: 64]
+	vshl.i32 q9,q4,#1
+	vmul.i32 d21,d7,d1
+	vtrn.32 d14,d15
+	vmul.i32 q11,q4,q0
+	vmul.i32 q0,q7,q0
+	vmull.s32 q12,d2,d2
+	vmlal.s32 q12,d11,d1
+	vmlal.s32 q12,d12,d0
+	vmlal.s32 q12,d13,d23
+	vmlal.s32 q12,d16,d22
+	vmlal.s32 q12,d7,d21
+	vmull.s32 q10,d2,d11
+	vmlal.s32 q10,d4,d1
+	vmlal.s32 q10,d13,d0
+	vmlal.s32 q10,d6,d23
+	vmlal.s32 q10,d17,d22
+	vmull.s32 q13,d10,d4
+	vmlal.s32 q13,d11,d3
+	vmlal.s32 q13,d13,d1
+	vmlal.s32 q13,d16,d0
+	vmlal.s32 q13,d17,d23
+	vmlal.s32 q13,d8,d22
+	vmull.s32 q1,d10,d5
+	vmlal.s32 q1,d11,d4
+	vmlal.s32 q1,d6,d1
+	vmlal.s32 q1,d17,d0
+	vmlal.s32 q1,d8,d23
+	vmull.s32 q14,d10,d6
+	vmlal.s32 q14,d11,d13
+	vmlal.s32 q14,d4,d4
+	vmlal.s32 q14,d17,d1
+	vmlal.s32 q14,d18,d0
+	vmlal.s32 q14,d9,d23
+	vmull.s32 q11,d10,d7
+	vmlal.s32 q11,d11,d6
+	vmlal.s32 q11,d12,d5
+	vmlal.s32 q11,d8,d1
+	vmlal.s32 q11,d19,d0
+	vmull.s32 q15,d10,d8
+	vmlal.s32 q15,d11,d17
+	vmlal.s32 q15,d12,d6
+	vmlal.s32 q15,d13,d5
+	vmlal.s32 q15,d19,d1
+	vmlal.s32 q15,d14,d0
+	vmull.s32 q2,d10,d9
+	vmlal.s32 q2,d11,d8
+	vmlal.s32 q2,d12,d7
+	vmlal.s32 q2,d13,d6
+	vmlal.s32 q2,d14,d1
+	vmull.s32 q0,d15,d1
+	vmlal.s32 q0,d10,d14
+	vmlal.s32 q0,d11,d19
+	vmlal.s32 q0,d12,d8
+	vmlal.s32 q0,d13,d17
+	vmlal.s32 q0,d6,d6
+	add r2,sp,#512
+	vld1.8 {d18-d19},[r2,: 128]
+	vmull.s32 q3,d16,d7
+	vmlal.s32 q3,d10,d15
+	vmlal.s32 q3,d11,d14
+	vmlal.s32 q3,d12,d9
+	vmlal.s32 q3,d13,d8
+	add r2,sp,#528
+	vld1.8 {d8-d9},[r2,: 128]
+	vadd.i64 q5,q12,q9
+	vadd.i64 q6,q15,q9
+	vshr.s64 q5,q5,#26
+	vshr.s64 q6,q6,#26
+	vadd.i64 q7,q10,q5
+	vshl.i64 q5,q5,#26
+	vadd.i64 q8,q7,q4
+	vadd.i64 q2,q2,q6
+	vshl.i64 q6,q6,#26
+	vadd.i64 q10,q2,q4
+	vsub.i64 q5,q12,q5
+	vshr.s64 q8,q8,#25
+	vsub.i64 q6,q15,q6
+	vshr.s64 q10,q10,#25
+	vadd.i64 q12,q13,q8
+	vshl.i64 q8,q8,#25
+	vadd.i64 q13,q12,q9
+	vadd.i64 q0,q0,q10
+	vsub.i64 q7,q7,q8
+	vshr.s64 q8,q13,#26
+	vshl.i64 q10,q10,#25
+	vadd.i64 q13,q0,q9
+	vadd.i64 q1,q1,q8
+	vshl.i64 q8,q8,#26
+	vadd.i64 q15,q1,q4
+	vsub.i64 q2,q2,q10
+	vshr.s64 q10,q13,#26
+	vsub.i64 q8,q12,q8
+	vshr.s64 q12,q15,#25
+	vadd.i64 q3,q3,q10
+	vshl.i64 q10,q10,#26
+	vadd.i64 q13,q3,q4
+	vadd.i64 q14,q14,q12
+	add r2,r3,#288
+	vshl.i64 q12,q12,#25
+	add r4,r3,#336
+	vadd.i64 q15,q14,q9
+	add r2,r2,#8
+	vsub.i64 q0,q0,q10
+	add r4,r4,#8
+	vshr.s64 q10,q13,#25
+	vsub.i64 q1,q1,q12
+	vshr.s64 q12,q15,#26
+	vadd.i64 q13,q10,q10
+	vadd.i64 q11,q11,q12
+	vtrn.32 d16,d2
+	vshl.i64 q12,q12,#26
+	vtrn.32 d17,d3
+	vadd.i64 q1,q11,q4
+	vadd.i64 q4,q5,q13
+	vst1.8 d16,[r2,: 64]!
+	vshl.i64 q5,q10,#4
+	vst1.8 d17,[r4,: 64]!
+	vsub.i64 q8,q14,q12
+	vshr.s64 q1,q1,#25
+	vadd.i64 q4,q4,q5
+	vadd.i64 q5,q6,q1
+	vshl.i64 q1,q1,#25
+	vadd.i64 q6,q5,q9
+	vadd.i64 q4,q4,q10
+	vshl.i64 q10,q10,#25
+	vadd.i64 q9,q4,q9
+	vsub.i64 q1,q11,q1
+	vshr.s64 q6,q6,#26
+	vsub.i64 q3,q3,q10
+	vtrn.32 d16,d2
+	vshr.s64 q9,q9,#26
+	vtrn.32 d17,d3
+	vadd.i64 q1,q2,q6
+	vst1.8 d16,[r2,: 64]
+	vshl.i64 q2,q6,#26
+	vst1.8 d17,[r4,: 64]
+	vadd.i64 q6,q7,q9
+	vtrn.32 d0,d6
+	vshl.i64 q7,q9,#26
+	vtrn.32 d1,d7
+	vsub.i64 q2,q5,q2
+	add r2,r2,#16
+	vsub.i64 q3,q4,q7
+	vst1.8 d0,[r2,: 64]
+	add r4,r4,#16
+	vst1.8 d1,[r4,: 64]
+	vtrn.32 d4,d2
+	vtrn.32 d5,d3
+	sub r2,r2,#8
+	sub r4,r4,#8
+	vtrn.32 d6,d12
+	vtrn.32 d7,d13
+	vst1.8 d4,[r2,: 64]
+	vst1.8 d5,[r4,: 64]
+	sub r2,r2,#24
+	sub r4,r4,#24
+	vst1.8 d6,[r2,: 64]
+	vst1.8 d7,[r4,: 64]
+	add r2,r3,#240
+	add r4,r3,#96
+	vld1.8 {d0-d1},[r4,: 128]!
+	vld1.8 {d2-d3},[r4,: 128]!
+	vld1.8 {d4},[r4,: 64]
+	add r4,r3,#144
+	vld1.8 {d6-d7},[r4,: 128]!
+	vtrn.32 q0,q3
+	vld1.8 {d8-d9},[r4,: 128]!
+	vshl.i32 q5,q0,#4
+	vtrn.32 q1,q4
+	vshl.i32 q6,q3,#4
+	vadd.i32 q5,q5,q0
+	vadd.i32 q6,q6,q3
+	vshl.i32 q7,q1,#4
+	vld1.8 {d5},[r4,: 64]
+	vshl.i32 q8,q4,#4
+	vtrn.32 d4,d5
+	vadd.i32 q7,q7,q1
+	vadd.i32 q8,q8,q4
+	vld1.8 {d18-d19},[r2,: 128]!
+	vshl.i32 q10,q2,#4
+	vld1.8 {d22-d23},[r2,: 128]!
+	vadd.i32 q10,q10,q2
+	vld1.8 {d24},[r2,: 64]
+	vadd.i32 q5,q5,q0
+	add r2,r3,#192
+	vld1.8 {d26-d27},[r2,: 128]!
+	vadd.i32 q6,q6,q3
+	vld1.8 {d28-d29},[r2,: 128]!
+	vadd.i32 q8,q8,q4
+	vld1.8 {d25},[r2,: 64]
+	vadd.i32 q10,q10,q2
+	vtrn.32 q9,q13
+	vadd.i32 q7,q7,q1
+	vadd.i32 q5,q5,q0
+	vtrn.32 q11,q14
+	vadd.i32 q6,q6,q3
+	add r2,sp,#560
+	vadd.i32 q10,q10,q2
+	vtrn.32 d24,d25
+	vst1.8 {d12-d13},[r2,: 128]
+	vshl.i32 q6,q13,#1
+	add r2,sp,#576
+	vst1.8 {d20-d21},[r2,: 128]
+	vshl.i32 q10,q14,#1
+	add r2,sp,#592
+	vst1.8 {d12-d13},[r2,: 128]
+	vshl.i32 q15,q12,#1
+	vadd.i32 q8,q8,q4
+	vext.32 d10,d31,d30,#0
+	vadd.i32 q7,q7,q1
+	add r2,sp,#608
+	vst1.8 {d16-d17},[r2,: 128]
+	vmull.s32 q8,d18,d5
+	vmlal.s32 q8,d26,d4
+	vmlal.s32 q8,d19,d9
+	vmlal.s32 q8,d27,d3
+	vmlal.s32 q8,d22,d8
+	vmlal.s32 q8,d28,d2
+	vmlal.s32 q8,d23,d7
+	vmlal.s32 q8,d29,d1
+	vmlal.s32 q8,d24,d6
+	vmlal.s32 q8,d25,d0
+	add r2,sp,#624
+	vst1.8 {d14-d15},[r2,: 128]
+	vmull.s32 q2,d18,d4
+	vmlal.s32 q2,d12,d9
+	vmlal.s32 q2,d13,d8
+	vmlal.s32 q2,d19,d3
+	vmlal.s32 q2,d22,d2
+	vmlal.s32 q2,d23,d1
+	vmlal.s32 q2,d24,d0
+	add r2,sp,#640
+	vst1.8 {d20-d21},[r2,: 128]
+	vmull.s32 q7,d18,d9
+	vmlal.s32 q7,d26,d3
+	vmlal.s32 q7,d19,d8
+	vmlal.s32 q7,d27,d2
+	vmlal.s32 q7,d22,d7
+	vmlal.s32 q7,d28,d1
+	vmlal.s32 q7,d23,d6
+	vmlal.s32 q7,d29,d0
+	add r2,sp,#656
+	vst1.8 {d10-d11},[r2,: 128]
+	vmull.s32 q5,d18,d3
+	vmlal.s32 q5,d19,d2
+	vmlal.s32 q5,d22,d1
+	vmlal.s32 q5,d23,d0
+	vmlal.s32 q5,d12,d8
+	add r2,sp,#672
+	vst1.8 {d16-d17},[r2,: 128]
+	vmull.s32 q4,d18,d8
+	vmlal.s32 q4,d26,d2
+	vmlal.s32 q4,d19,d7
+	vmlal.s32 q4,d27,d1
+	vmlal.s32 q4,d22,d6
+	vmlal.s32 q4,d28,d0
+	vmull.s32 q8,d18,d7
+	vmlal.s32 q8,d26,d1
+	vmlal.s32 q8,d19,d6
+	vmlal.s32 q8,d27,d0
+	add r2,sp,#576
+	vld1.8 {d20-d21},[r2,: 128]
+	vmlal.s32 q7,d24,d21
+	vmlal.s32 q7,d25,d20
+	vmlal.s32 q4,d23,d21
+	vmlal.s32 q4,d29,d20
+	vmlal.s32 q8,d22,d21
+	vmlal.s32 q8,d28,d20
+	vmlal.s32 q5,d24,d20
+	add r2,sp,#576
+	vst1.8 {d14-d15},[r2,: 128]
+	vmull.s32 q7,d18,d6
+	vmlal.s32 q7,d26,d0
+	add r2,sp,#656
+	vld1.8 {d30-d31},[r2,: 128]
+	vmlal.s32 q2,d30,d21
+	vmlal.s32 q7,d19,d21
+	vmlal.s32 q7,d27,d20
+	add r2,sp,#624
+	vld1.8 {d26-d27},[r2,: 128]
+	vmlal.s32 q4,d25,d27
+	vmlal.s32 q8,d29,d27
+	vmlal.s32 q8,d25,d26
+	vmlal.s32 q7,d28,d27
+	vmlal.s32 q7,d29,d26
+	add r2,sp,#608
+	vld1.8 {d28-d29},[r2,: 128]
+	vmlal.s32 q4,d24,d29
+	vmlal.s32 q8,d23,d29
+	vmlal.s32 q8,d24,d28
+	vmlal.s32 q7,d22,d29
+	vmlal.s32 q7,d23,d28
+	add r2,sp,#608
+	vst1.8 {d8-d9},[r2,: 128]
+	add r2,sp,#560
+	vld1.8 {d8-d9},[r2,: 128]
+	vmlal.s32 q7,d24,d9
+	vmlal.s32 q7,d25,d31
+	vmull.s32 q1,d18,d2
+	vmlal.s32 q1,d19,d1
+	vmlal.s32 q1,d22,d0
+	vmlal.s32 q1,d24,d27
+	vmlal.s32 q1,d23,d20
+	vmlal.s32 q1,d12,d7
+	vmlal.s32 q1,d13,d6
+	vmull.s32 q6,d18,d1
+	vmlal.s32 q6,d19,d0
+	vmlal.s32 q6,d23,d27
+	vmlal.s32 q6,d22,d20
+	vmlal.s32 q6,d24,d26
+	vmull.s32 q0,d18,d0
+	vmlal.s32 q0,d22,d27
+	vmlal.s32 q0,d23,d26
+	vmlal.s32 q0,d24,d31
+	vmlal.s32 q0,d19,d20
+	add r2,sp,#640
+	vld1.8 {d18-d19},[r2,: 128]
+	vmlal.s32 q2,d18,d7
+	vmlal.s32 q2,d19,d6
+	vmlal.s32 q5,d18,d6
+	vmlal.s32 q5,d19,d21
+	vmlal.s32 q1,d18,d21
+	vmlal.s32 q1,d19,d29
+	vmlal.s32 q0,d18,d28
+	vmlal.s32 q0,d19,d9
+	vmlal.s32 q6,d18,d29
+	vmlal.s32 q6,d19,d28
+	add r2,sp,#592
+	vld1.8 {d18-d19},[r2,: 128]
+	add r2,sp,#512
+	vld1.8 {d22-d23},[r2,: 128]
+	vmlal.s32 q5,d19,d7
+	vmlal.s32 q0,d18,d21
+	vmlal.s32 q0,d19,d29
+	vmlal.s32 q6,d18,d6
+	add r2,sp,#528
+	vld1.8 {d6-d7},[r2,: 128]
+	vmlal.s32 q6,d19,d21
+	add r2,sp,#576
+	vld1.8 {d18-d19},[r2,: 128]
+	vmlal.s32 q0,d30,d8
+	add r2,sp,#672
+	vld1.8 {d20-d21},[r2,: 128]
+	vmlal.s32 q5,d30,d29
+	add r2,sp,#608
+	vld1.8 {d24-d25},[r2,: 128]
+	vmlal.s32 q1,d30,d28
+	vadd.i64 q13,q0,q11
+	vadd.i64 q14,q5,q11
+	vmlal.s32 q6,d30,d9
+	vshr.s64 q4,q13,#26
+	vshr.s64 q13,q14,#26
+	vadd.i64 q7,q7,q4
+	vshl.i64 q4,q4,#26
+	vadd.i64 q14,q7,q3
+	vadd.i64 q9,q9,q13
+	vshl.i64 q13,q13,#26
+	vadd.i64 q15,q9,q3
+	vsub.i64 q0,q0,q4
+	vshr.s64 q4,q14,#25
+	vsub.i64 q5,q5,q13
+	vshr.s64 q13,q15,#25
+	vadd.i64 q6,q6,q4
+	vshl.i64 q4,q4,#25
+	vadd.i64 q14,q6,q11
+	vadd.i64 q2,q2,q13
+	vsub.i64 q4,q7,q4
+	vshr.s64 q7,q14,#26
+	vshl.i64 q13,q13,#25
+	vadd.i64 q14,q2,q11
+	vadd.i64 q8,q8,q7
+	vshl.i64 q7,q7,#26
+	vadd.i64 q15,q8,q3
+	vsub.i64 q9,q9,q13
+	vshr.s64 q13,q14,#26
+	vsub.i64 q6,q6,q7
+	vshr.s64 q7,q15,#25
+	vadd.i64 q10,q10,q13
+	vshl.i64 q13,q13,#26
+	vadd.i64 q14,q10,q3
+	vadd.i64 q1,q1,q7
+	add r2,r3,#144
+	vshl.i64 q7,q7,#25
+	add r4,r3,#96
+	vadd.i64 q15,q1,q11
+	add r2,r2,#8
+	vsub.i64 q2,q2,q13
+	add r4,r4,#8
+	vshr.s64 q13,q14,#25
+	vsub.i64 q7,q8,q7
+	vshr.s64 q8,q15,#26
+	vadd.i64 q14,q13,q13
+	vadd.i64 q12,q12,q8
+	vtrn.32 d12,d14
+	vshl.i64 q8,q8,#26
+	vtrn.32 d13,d15
+	vadd.i64 q3,q12,q3
+	vadd.i64 q0,q0,q14
+	vst1.8 d12,[r2,: 64]!
+	vshl.i64 q7,q13,#4
+	vst1.8 d13,[r4,: 64]!
+	vsub.i64 q1,q1,q8
+	vshr.s64 q3,q3,#25
+	vadd.i64 q0,q0,q7
+	vadd.i64 q5,q5,q3
+	vshl.i64 q3,q3,#25
+	vadd.i64 q6,q5,q11
+	vadd.i64 q0,q0,q13
+	vshl.i64 q7,q13,#25
+	vadd.i64 q8,q0,q11
+	vsub.i64 q3,q12,q3
+	vshr.s64 q6,q6,#26
+	vsub.i64 q7,q10,q7
+	vtrn.32 d2,d6
+	vshr.s64 q8,q8,#26
+	vtrn.32 d3,d7
+	vadd.i64 q3,q9,q6
+	vst1.8 d2,[r2,: 64]
+	vshl.i64 q6,q6,#26
+	vst1.8 d3,[r4,: 64]
+	vadd.i64 q1,q4,q8
+	vtrn.32 d4,d14
+	vshl.i64 q4,q8,#26
+	vtrn.32 d5,d15
+	vsub.i64 q5,q5,q6
+	add r2,r2,#16
+	vsub.i64 q0,q0,q4
+	vst1.8 d4,[r2,: 64]
+	add r4,r4,#16
+	vst1.8 d5,[r4,: 64]
+	vtrn.32 d10,d6
+	vtrn.32 d11,d7
+	sub r2,r2,#8
+	sub r4,r4,#8
+	vtrn.32 d0,d2
+	vtrn.32 d1,d3
+	vst1.8 d10,[r2,: 64]
+	vst1.8 d11,[r4,: 64]
+	sub r2,r2,#24
+	sub r4,r4,#24
+	vst1.8 d0,[r2,: 64]
+	vst1.8 d1,[r4,: 64]
+	add r2,r3,#288
+	add r4,r3,#336
+	vld1.8 {d0-d1},[r2,: 128]!
+	vld1.8 {d2-d3},[r4,: 128]!
+	vsub.i32 q0,q0,q1
+	vld1.8 {d2-d3},[r2,: 128]!
+	vld1.8 {d4-d5},[r4,: 128]!
+	vsub.i32 q1,q1,q2
+	add r5,r3,#240
+	vld1.8 {d4},[r2,: 64]
+	vld1.8 {d6},[r4,: 64]
+	vsub.i32 q2,q2,q3
+	vst1.8 {d0-d1},[r5,: 128]!
+	vst1.8 {d2-d3},[r5,: 128]!
+	vst1.8 d4,[r5,: 64]
+	add r2,r3,#144
+	add r4,r3,#96
+	add r5,r3,#144
+	add r6,r3,#192
+	vld1.8 {d0-d1},[r2,: 128]!
+	vld1.8 {d2-d3},[r4,: 128]!
+	vsub.i32 q2,q0,q1
+	vadd.i32 q0,q0,q1
+	vld1.8 {d2-d3},[r2,: 128]!
+	vld1.8 {d6-d7},[r4,: 128]!
+	vsub.i32 q4,q1,q3
+	vadd.i32 q1,q1,q3
+	vld1.8 {d6},[r2,: 64]
+	vld1.8 {d10},[r4,: 64]
+	vsub.i32 q6,q3,q5
+	vadd.i32 q3,q3,q5
+	vst1.8 {d4-d5},[r5,: 128]!
+	vst1.8 {d0-d1},[r6,: 128]!
+	vst1.8 {d8-d9},[r5,: 128]!
+	vst1.8 {d2-d3},[r6,: 128]!
+	vst1.8 d12,[r5,: 64]
+	vst1.8 d6,[r6,: 64]
+	add r2,r3,#0
+	add r4,r3,#240
+	vld1.8 {d0-d1},[r4,: 128]!
+	vld1.8 {d2-d3},[r4,: 128]!
+	vld1.8 {d4},[r4,: 64]
+	add r4,r3,#336
+	vld1.8 {d6-d7},[r4,: 128]!
+	vtrn.32 q0,q3
+	vld1.8 {d8-d9},[r4,: 128]!
+	vshl.i32 q5,q0,#4
+	vtrn.32 q1,q4
+	vshl.i32 q6,q3,#4
+	vadd.i32 q5,q5,q0
+	vadd.i32 q6,q6,q3
+	vshl.i32 q7,q1,#4
+	vld1.8 {d5},[r4,: 64]
+	vshl.i32 q8,q4,#4
+	vtrn.32 d4,d5
+	vadd.i32 q7,q7,q1
+	vadd.i32 q8,q8,q4
+	vld1.8 {d18-d19},[r2,: 128]!
+	vshl.i32 q10,q2,#4
+	vld1.8 {d22-d23},[r2,: 128]!
+	vadd.i32 q10,q10,q2
+	vld1.8 {d24},[r2,: 64]
+	vadd.i32 q5,q5,q0
+	add r2,r3,#288
+	vld1.8 {d26-d27},[r2,: 128]!
+	vadd.i32 q6,q6,q3
+	vld1.8 {d28-d29},[r2,: 128]!
+	vadd.i32 q8,q8,q4
+	vld1.8 {d25},[r2,: 64]
+	vadd.i32 q10,q10,q2
+	vtrn.32 q9,q13
+	vadd.i32 q7,q7,q1
+	vadd.i32 q5,q5,q0
+	vtrn.32 q11,q14
+	vadd.i32 q6,q6,q3
+	add r2,sp,#560
+	vadd.i32 q10,q10,q2
+	vtrn.32 d24,d25
+	vst1.8 {d12-d13},[r2,: 128]
+	vshl.i32 q6,q13,#1
+	add r2,sp,#576
+	vst1.8 {d20-d21},[r2,: 128]
+	vshl.i32 q10,q14,#1
+	add r2,sp,#592
+	vst1.8 {d12-d13},[r2,: 128]
+	vshl.i32 q15,q12,#1
+	vadd.i32 q8,q8,q4
+	vext.32 d10,d31,d30,#0
+	vadd.i32 q7,q7,q1
+	add r2,sp,#608
+	vst1.8 {d16-d17},[r2,: 128]
+	vmull.s32 q8,d18,d5
+	vmlal.s32 q8,d26,d4
+	vmlal.s32 q8,d19,d9
+	vmlal.s32 q8,d27,d3
+	vmlal.s32 q8,d22,d8
+	vmlal.s32 q8,d28,d2
+	vmlal.s32 q8,d23,d7
+	vmlal.s32 q8,d29,d1
+	vmlal.s32 q8,d24,d6
+	vmlal.s32 q8,d25,d0
+	add r2,sp,#624
+	vst1.8 {d14-d15},[r2,: 128]
+	vmull.s32 q2,d18,d4
+	vmlal.s32 q2,d12,d9
+	vmlal.s32 q2,d13,d8
+	vmlal.s32 q2,d19,d3
+	vmlal.s32 q2,d22,d2
+	vmlal.s32 q2,d23,d1
+	vmlal.s32 q2,d24,d0
+	add r2,sp,#640
+	vst1.8 {d20-d21},[r2,: 128]
+	vmull.s32 q7,d18,d9
+	vmlal.s32 q7,d26,d3
+	vmlal.s32 q7,d19,d8
+	vmlal.s32 q7,d27,d2
+	vmlal.s32 q7,d22,d7
+	vmlal.s32 q7,d28,d1
+	vmlal.s32 q7,d23,d6
+	vmlal.s32 q7,d29,d0
+	add r2,sp,#656
+	vst1.8 {d10-d11},[r2,: 128]
+	vmull.s32 q5,d18,d3
+	vmlal.s32 q5,d19,d2
+	vmlal.s32 q5,d22,d1
+	vmlal.s32 q5,d23,d0
+	vmlal.s32 q5,d12,d8
+	add r2,sp,#672
+	vst1.8 {d16-d17},[r2,: 128]
+	vmull.s32 q4,d18,d8
+	vmlal.s32 q4,d26,d2
+	vmlal.s32 q4,d19,d7
+	vmlal.s32 q4,d27,d1
+	vmlal.s32 q4,d22,d6
+	vmlal.s32 q4,d28,d0
+	vmull.s32 q8,d18,d7
+	vmlal.s32 q8,d26,d1
+	vmlal.s32 q8,d19,d6
+	vmlal.s32 q8,d27,d0
+	add r2,sp,#576
+	vld1.8 {d20-d21},[r2,: 128]
+	vmlal.s32 q7,d24,d21
+	vmlal.s32 q7,d25,d20
+	vmlal.s32 q4,d23,d21
+	vmlal.s32 q4,d29,d20
+	vmlal.s32 q8,d22,d21
+	vmlal.s32 q8,d28,d20
+	vmlal.s32 q5,d24,d20
+	add r2,sp,#576
+	vst1.8 {d14-d15},[r2,: 128]
+	vmull.s32 q7,d18,d6
+	vmlal.s32 q7,d26,d0
+	add r2,sp,#656
+	vld1.8 {d30-d31},[r2,: 128]
+	vmlal.s32 q2,d30,d21
+	vmlal.s32 q7,d19,d21
+	vmlal.s32 q7,d27,d20
+	add r2,sp,#624
+	vld1.8 {d26-d27},[r2,: 128]
+	vmlal.s32 q4,d25,d27
+	vmlal.s32 q8,d29,d27
+	vmlal.s32 q8,d25,d26
+	vmlal.s32 q7,d28,d27
+	vmlal.s32 q7,d29,d26
+	add r2,sp,#608
+	vld1.8 {d28-d29},[r2,: 128]
+	vmlal.s32 q4,d24,d29
+	vmlal.s32 q8,d23,d29
+	vmlal.s32 q8,d24,d28
+	vmlal.s32 q7,d22,d29
+	vmlal.s32 q7,d23,d28
+	add r2,sp,#608
+	vst1.8 {d8-d9},[r2,: 128]
+	add r2,sp,#560
+	vld1.8 {d8-d9},[r2,: 128]
+	vmlal.s32 q7,d24,d9
+	vmlal.s32 q7,d25,d31
+	vmull.s32 q1,d18,d2
+	vmlal.s32 q1,d19,d1
+	vmlal.s32 q1,d22,d0
+	vmlal.s32 q1,d24,d27
+	vmlal.s32 q1,d23,d20
+	vmlal.s32 q1,d12,d7
+	vmlal.s32 q1,d13,d6
+	vmull.s32 q6,d18,d1
+	vmlal.s32 q6,d19,d0
+	vmlal.s32 q6,d23,d27
+	vmlal.s32 q6,d22,d20
+	vmlal.s32 q6,d24,d26
+	vmull.s32 q0,d18,d0
+	vmlal.s32 q0,d22,d27
+	vmlal.s32 q0,d23,d26
+	vmlal.s32 q0,d24,d31
+	vmlal.s32 q0,d19,d20
+	add r2,sp,#640
+	vld1.8 {d18-d19},[r2,: 128]
+	vmlal.s32 q2,d18,d7
+	vmlal.s32 q2,d19,d6
+	vmlal.s32 q5,d18,d6
+	vmlal.s32 q5,d19,d21
+	vmlal.s32 q1,d18,d21
+	vmlal.s32 q1,d19,d29
+	vmlal.s32 q0,d18,d28
+	vmlal.s32 q0,d19,d9
+	vmlal.s32 q6,d18,d29
+	vmlal.s32 q6,d19,d28
+	add r2,sp,#592
+	vld1.8 {d18-d19},[r2,: 128]
+	add r2,sp,#512
+	vld1.8 {d22-d23},[r2,: 128]
+	vmlal.s32 q5,d19,d7
+	vmlal.s32 q0,d18,d21
+	vmlal.s32 q0,d19,d29
+	vmlal.s32 q6,d18,d6
+	add r2,sp,#528
+	vld1.8 {d6-d7},[r2,: 128]
+	vmlal.s32 q6,d19,d21
+	add r2,sp,#576
+	vld1.8 {d18-d19},[r2,: 128]
+	vmlal.s32 q0,d30,d8
+	add r2,sp,#672
+	vld1.8 {d20-d21},[r2,: 128]
+	vmlal.s32 q5,d30,d29
+	add r2,sp,#608
+	vld1.8 {d24-d25},[r2,: 128]
+	vmlal.s32 q1,d30,d28
+	vadd.i64 q13,q0,q11
+	vadd.i64 q14,q5,q11
+	vmlal.s32 q6,d30,d9
+	vshr.s64 q4,q13,#26
+	vshr.s64 q13,q14,#26
+	vadd.i64 q7,q7,q4
+	vshl.i64 q4,q4,#26
+	vadd.i64 q14,q7,q3
+	vadd.i64 q9,q9,q13
+	vshl.i64 q13,q13,#26
+	vadd.i64 q15,q9,q3
+	vsub.i64 q0,q0,q4
+	vshr.s64 q4,q14,#25
+	vsub.i64 q5,q5,q13
+	vshr.s64 q13,q15,#25
+	vadd.i64 q6,q6,q4
+	vshl.i64 q4,q4,#25
+	vadd.i64 q14,q6,q11
+	vadd.i64 q2,q2,q13
+	vsub.i64 q4,q7,q4
+	vshr.s64 q7,q14,#26
+	vshl.i64 q13,q13,#25
+	vadd.i64 q14,q2,q11
+	vadd.i64 q8,q8,q7
+	vshl.i64 q7,q7,#26
+	vadd.i64 q15,q8,q3
+	vsub.i64 q9,q9,q13
+	vshr.s64 q13,q14,#26
+	vsub.i64 q6,q6,q7
+	vshr.s64 q7,q15,#25
+	vadd.i64 q10,q10,q13
+	vshl.i64 q13,q13,#26
+	vadd.i64 q14,q10,q3
+	vadd.i64 q1,q1,q7
+	add r2,r3,#288
+	vshl.i64 q7,q7,#25
+	add r4,r3,#96
+	vadd.i64 q15,q1,q11
+	add r2,r2,#8
+	vsub.i64 q2,q2,q13
+	add r4,r4,#8
+	vshr.s64 q13,q14,#25
+	vsub.i64 q7,q8,q7
+	vshr.s64 q8,q15,#26
+	vadd.i64 q14,q13,q13
+	vadd.i64 q12,q12,q8
+	vtrn.32 d12,d14
+	vshl.i64 q8,q8,#26
+	vtrn.32 d13,d15
+	vadd.i64 q3,q12,q3
+	vadd.i64 q0,q0,q14
+	vst1.8 d12,[r2,: 64]!
+	vshl.i64 q7,q13,#4
+	vst1.8 d13,[r4,: 64]!
+	vsub.i64 q1,q1,q8
+	vshr.s64 q3,q3,#25
+	vadd.i64 q0,q0,q7
+	vadd.i64 q5,q5,q3
+	vshl.i64 q3,q3,#25
+	vadd.i64 q6,q5,q11
+	vadd.i64 q0,q0,q13
+	vshl.i64 q7,q13,#25
+	vadd.i64 q8,q0,q11
+	vsub.i64 q3,q12,q3
+	vshr.s64 q6,q6,#26
+	vsub.i64 q7,q10,q7
+	vtrn.32 d2,d6
+	vshr.s64 q8,q8,#26
+	vtrn.32 d3,d7
+	vadd.i64 q3,q9,q6
+	vst1.8 d2,[r2,: 64]
+	vshl.i64 q6,q6,#26
+	vst1.8 d3,[r4,: 64]
+	vadd.i64 q1,q4,q8
+	vtrn.32 d4,d14
+	vshl.i64 q4,q8,#26
+	vtrn.32 d5,d15
+	vsub.i64 q5,q5,q6
+	add r2,r2,#16
+	vsub.i64 q0,q0,q4
+	vst1.8 d4,[r2,: 64]
+	add r4,r4,#16
+	vst1.8 d5,[r4,: 64]
+	vtrn.32 d10,d6
+	vtrn.32 d11,d7
+	sub r2,r2,#8
+	sub r4,r4,#8
+	vtrn.32 d0,d2
+	vtrn.32 d1,d3
+	vst1.8 d10,[r2,: 64]
+	vst1.8 d11,[r4,: 64]
+	sub r2,r2,#24
+	sub r4,r4,#24
+	vst1.8 d0,[r2,: 64]
+	vst1.8 d1,[r4,: 64]
+	add r2,sp,#544
+	add r4,r3,#144
+	add r5,r3,#192
+	vld1.8 {d0-d1},[r2,: 128]
+	vld1.8 {d2-d3},[r4,: 128]!
+	vld1.8 {d4-d5},[r5,: 128]!
+	vzip.i32 q1,q2
+	vld1.8 {d6-d7},[r4,: 128]!
+	vld1.8 {d8-d9},[r5,: 128]!
+	vshl.i32 q5,q1,#1
+	vzip.i32 q3,q4
+	vshl.i32 q6,q2,#1
+	vld1.8 {d14},[r4,: 64]
+	vshl.i32 q8,q3,#1
+	vld1.8 {d15},[r5,: 64]
+	vshl.i32 q9,q4,#1
+	vmul.i32 d21,d7,d1
+	vtrn.32 d14,d15
+	vmul.i32 q11,q4,q0
+	vmul.i32 q0,q7,q0
+	vmull.s32 q12,d2,d2
+	vmlal.s32 q12,d11,d1
+	vmlal.s32 q12,d12,d0
+	vmlal.s32 q12,d13,d23
+	vmlal.s32 q12,d16,d22
+	vmlal.s32 q12,d7,d21
+	vmull.s32 q10,d2,d11
+	vmlal.s32 q10,d4,d1
+	vmlal.s32 q10,d13,d0
+	vmlal.s32 q10,d6,d23
+	vmlal.s32 q10,d17,d22
+	vmull.s32 q13,d10,d4
+	vmlal.s32 q13,d11,d3
+	vmlal.s32 q13,d13,d1
+	vmlal.s32 q13,d16,d0
+	vmlal.s32 q13,d17,d23
+	vmlal.s32 q13,d8,d22
+	vmull.s32 q1,d10,d5
+	vmlal.s32 q1,d11,d4
+	vmlal.s32 q1,d6,d1
+	vmlal.s32 q1,d17,d0
+	vmlal.s32 q1,d8,d23
+	vmull.s32 q14,d10,d6
+	vmlal.s32 q14,d11,d13
+	vmlal.s32 q14,d4,d4
+	vmlal.s32 q14,d17,d1
+	vmlal.s32 q14,d18,d0
+	vmlal.s32 q14,d9,d23
+	vmull.s32 q11,d10,d7
+	vmlal.s32 q11,d11,d6
+	vmlal.s32 q11,d12,d5
+	vmlal.s32 q11,d8,d1
+	vmlal.s32 q11,d19,d0
+	vmull.s32 q15,d10,d8
+	vmlal.s32 q15,d11,d17
+	vmlal.s32 q15,d12,d6
+	vmlal.s32 q15,d13,d5
+	vmlal.s32 q15,d19,d1
+	vmlal.s32 q15,d14,d0
+	vmull.s32 q2,d10,d9
+	vmlal.s32 q2,d11,d8
+	vmlal.s32 q2,d12,d7
+	vmlal.s32 q2,d13,d6
+	vmlal.s32 q2,d14,d1
+	vmull.s32 q0,d15,d1
+	vmlal.s32 q0,d10,d14
+	vmlal.s32 q0,d11,d19
+	vmlal.s32 q0,d12,d8
+	vmlal.s32 q0,d13,d17
+	vmlal.s32 q0,d6,d6
+	add r2,sp,#512
+	vld1.8 {d18-d19},[r2,: 128]
+	vmull.s32 q3,d16,d7
+	vmlal.s32 q3,d10,d15
+	vmlal.s32 q3,d11,d14
+	vmlal.s32 q3,d12,d9
+	vmlal.s32 q3,d13,d8
+	add r2,sp,#528
+	vld1.8 {d8-d9},[r2,: 128]
+	vadd.i64 q5,q12,q9
+	vadd.i64 q6,q15,q9
+	vshr.s64 q5,q5,#26
+	vshr.s64 q6,q6,#26
+	vadd.i64 q7,q10,q5
+	vshl.i64 q5,q5,#26
+	vadd.i64 q8,q7,q4
+	vadd.i64 q2,q2,q6
+	vshl.i64 q6,q6,#26
+	vadd.i64 q10,q2,q4
+	vsub.i64 q5,q12,q5
+	vshr.s64 q8,q8,#25
+	vsub.i64 q6,q15,q6
+	vshr.s64 q10,q10,#25
+	vadd.i64 q12,q13,q8
+	vshl.i64 q8,q8,#25
+	vadd.i64 q13,q12,q9
+	vadd.i64 q0,q0,q10
+	vsub.i64 q7,q7,q8
+	vshr.s64 q8,q13,#26
+	vshl.i64 q10,q10,#25
+	vadd.i64 q13,q0,q9
+	vadd.i64 q1,q1,q8
+	vshl.i64 q8,q8,#26
+	vadd.i64 q15,q1,q4
+	vsub.i64 q2,q2,q10
+	vshr.s64 q10,q13,#26
+	vsub.i64 q8,q12,q8
+	vshr.s64 q12,q15,#25
+	vadd.i64 q3,q3,q10
+	vshl.i64 q10,q10,#26
+	vadd.i64 q13,q3,q4
+	vadd.i64 q14,q14,q12
+	add r2,r3,#144
+	vshl.i64 q12,q12,#25
+	add r4,r3,#192
+	vadd.i64 q15,q14,q9
+	add r2,r2,#8
+	vsub.i64 q0,q0,q10
+	add r4,r4,#8
+	vshr.s64 q10,q13,#25
+	vsub.i64 q1,q1,q12
+	vshr.s64 q12,q15,#26
+	vadd.i64 q13,q10,q10
+	vadd.i64 q11,q11,q12
+	vtrn.32 d16,d2
+	vshl.i64 q12,q12,#26
+	vtrn.32 d17,d3
+	vadd.i64 q1,q11,q4
+	vadd.i64 q4,q5,q13
+	vst1.8 d16,[r2,: 64]!
+	vshl.i64 q5,q10,#4
+	vst1.8 d17,[r4,: 64]!
+	vsub.i64 q8,q14,q12
+	vshr.s64 q1,q1,#25
+	vadd.i64 q4,q4,q5
+	vadd.i64 q5,q6,q1
+	vshl.i64 q1,q1,#25
+	vadd.i64 q6,q5,q9
+	vadd.i64 q4,q4,q10
+	vshl.i64 q10,q10,#25
+	vadd.i64 q9,q4,q9
+	vsub.i64 q1,q11,q1
+	vshr.s64 q6,q6,#26
+	vsub.i64 q3,q3,q10
+	vtrn.32 d16,d2
+	vshr.s64 q9,q9,#26
+	vtrn.32 d17,d3
+	vadd.i64 q1,q2,q6
+	vst1.8 d16,[r2,: 64]
+	vshl.i64 q2,q6,#26
+	vst1.8 d17,[r4,: 64]
+	vadd.i64 q6,q7,q9
+	vtrn.32 d0,d6
+	vshl.i64 q7,q9,#26
+	vtrn.32 d1,d7
+	vsub.i64 q2,q5,q2
+	add r2,r2,#16
+	vsub.i64 q3,q4,q7
+	vst1.8 d0,[r2,: 64]
+	add r4,r4,#16
+	vst1.8 d1,[r4,: 64]
+	vtrn.32 d4,d2
+	vtrn.32 d5,d3
+	sub r2,r2,#8
+	sub r4,r4,#8
+	vtrn.32 d6,d12
+	vtrn.32 d7,d13
+	vst1.8 d4,[r2,: 64]
+	vst1.8 d5,[r4,: 64]
+	sub r2,r2,#24
+	sub r4,r4,#24
+	vst1.8 d6,[r2,: 64]
+	vst1.8 d7,[r4,: 64]
+	add r2,r3,#336
+	add r4,r3,#288
+	vld1.8 {d0-d1},[r2,: 128]!
+	vld1.8 {d2-d3},[r4,: 128]!
+	vadd.i32 q0,q0,q1
+	vld1.8 {d2-d3},[r2,: 128]!
+	vld1.8 {d4-d5},[r4,: 128]!
+	vadd.i32 q1,q1,q2
+	add r5,r3,#288
+	vld1.8 {d4},[r2,: 64]
+	vld1.8 {d6},[r4,: 64]
+	vadd.i32 q2,q2,q3
+	vst1.8 {d0-d1},[r5,: 128]!
+	vst1.8 {d2-d3},[r5,: 128]!
+	vst1.8 d4,[r5,: 64]
+	add r2,r3,#48
+	add r4,r3,#144
+	vld1.8 {d0-d1},[r4,: 128]!
+	vld1.8 {d2-d3},[r4,: 128]!
+	vld1.8 {d4},[r4,: 64]
+	add r4,r3,#288
+	vld1.8 {d6-d7},[r4,: 128]!
+	vtrn.32 q0,q3
+	vld1.8 {d8-d9},[r4,: 128]!
+	vshl.i32 q5,q0,#4
+	vtrn.32 q1,q4
+	vshl.i32 q6,q3,#4
+	vadd.i32 q5,q5,q0
+	vadd.i32 q6,q6,q3
+	vshl.i32 q7,q1,#4
+	vld1.8 {d5},[r4,: 64]
+	vshl.i32 q8,q4,#4
+	vtrn.32 d4,d5
+	vadd.i32 q7,q7,q1
+	vadd.i32 q8,q8,q4
+	vld1.8 {d18-d19},[r2,: 128]!
+	vshl.i32 q10,q2,#4
+	vld1.8 {d22-d23},[r2,: 128]!
+	vadd.i32 q10,q10,q2
+	vld1.8 {d24},[r2,: 64]
+	vadd.i32 q5,q5,q0
+	add r2,r3,#240
+	vld1.8 {d26-d27},[r2,: 128]!
+	vadd.i32 q6,q6,q3
+	vld1.8 {d28-d29},[r2,: 128]!
+	vadd.i32 q8,q8,q4
+	vld1.8 {d25},[r2,: 64]
+	vadd.i32 q10,q10,q2
+	vtrn.32 q9,q13
+	vadd.i32 q7,q7,q1
+	vadd.i32 q5,q5,q0
+	vtrn.32 q11,q14
+	vadd.i32 q6,q6,q3
+	add r2,sp,#560
+	vadd.i32 q10,q10,q2
+	vtrn.32 d24,d25
+	vst1.8 {d12-d13},[r2,: 128]
+	vshl.i32 q6,q13,#1
+	add r2,sp,#576
+	vst1.8 {d20-d21},[r2,: 128]
+	vshl.i32 q10,q14,#1
+	add r2,sp,#592
+	vst1.8 {d12-d13},[r2,: 128]
+	vshl.i32 q15,q12,#1
+	vadd.i32 q8,q8,q4
+	vext.32 d10,d31,d30,#0
+	vadd.i32 q7,q7,q1
+	add r2,sp,#608
+	vst1.8 {d16-d17},[r2,: 128]
+	vmull.s32 q8,d18,d5
+	vmlal.s32 q8,d26,d4
+	vmlal.s32 q8,d19,d9
+	vmlal.s32 q8,d27,d3
+	vmlal.s32 q8,d22,d8
+	vmlal.s32 q8,d28,d2
+	vmlal.s32 q8,d23,d7
+	vmlal.s32 q8,d29,d1
+	vmlal.s32 q8,d24,d6
+	vmlal.s32 q8,d25,d0
+	add r2,sp,#624
+	vst1.8 {d14-d15},[r2,: 128]
+	vmull.s32 q2,d18,d4
+	vmlal.s32 q2,d12,d9
+	vmlal.s32 q2,d13,d8
+	vmlal.s32 q2,d19,d3
+	vmlal.s32 q2,d22,d2
+	vmlal.s32 q2,d23,d1
+	vmlal.s32 q2,d24,d0
+	add r2,sp,#640
+	vst1.8 {d20-d21},[r2,: 128]
+	vmull.s32 q7,d18,d9
+	vmlal.s32 q7,d26,d3
+	vmlal.s32 q7,d19,d8
+	vmlal.s32 q7,d27,d2
+	vmlal.s32 q7,d22,d7
+	vmlal.s32 q7,d28,d1
+	vmlal.s32 q7,d23,d6
+	vmlal.s32 q7,d29,d0
+	add r2,sp,#656
+	vst1.8 {d10-d11},[r2,: 128]
+	vmull.s32 q5,d18,d3
+	vmlal.s32 q5,d19,d2
+	vmlal.s32 q5,d22,d1
+	vmlal.s32 q5,d23,d0
+	vmlal.s32 q5,d12,d8
+	add r2,sp,#672
+	vst1.8 {d16-d17},[r2,: 128]
+	vmull.s32 q4,d18,d8
+	vmlal.s32 q4,d26,d2
+	vmlal.s32 q4,d19,d7
+	vmlal.s32 q4,d27,d1
+	vmlal.s32 q4,d22,d6
+	vmlal.s32 q4,d28,d0
+	vmull.s32 q8,d18,d7
+	vmlal.s32 q8,d26,d1
+	vmlal.s32 q8,d19,d6
+	vmlal.s32 q8,d27,d0
+	add r2,sp,#576
+	vld1.8 {d20-d21},[r2,: 128]
+	vmlal.s32 q7,d24,d21
+	vmlal.s32 q7,d25,d20
+	vmlal.s32 q4,d23,d21
+	vmlal.s32 q4,d29,d20
+	vmlal.s32 q8,d22,d21
+	vmlal.s32 q8,d28,d20
+	vmlal.s32 q5,d24,d20
+	add r2,sp,#576
+	vst1.8 {d14-d15},[r2,: 128]
+	vmull.s32 q7,d18,d6
+	vmlal.s32 q7,d26,d0
+	add r2,sp,#656
+	vld1.8 {d30-d31},[r2,: 128]
+	vmlal.s32 q2,d30,d21
+	vmlal.s32 q7,d19,d21
+	vmlal.s32 q7,d27,d20
+	add r2,sp,#624
+	vld1.8 {d26-d27},[r2,: 128]
+	vmlal.s32 q4,d25,d27
+	vmlal.s32 q8,d29,d27
+	vmlal.s32 q8,d25,d26
+	vmlal.s32 q7,d28,d27
+	vmlal.s32 q7,d29,d26
+	add r2,sp,#608
+	vld1.8 {d28-d29},[r2,: 128]
+	vmlal.s32 q4,d24,d29
+	vmlal.s32 q8,d23,d29
+	vmlal.s32 q8,d24,d28
+	vmlal.s32 q7,d22,d29
+	vmlal.s32 q7,d23,d28
+	add r2,sp,#608
+	vst1.8 {d8-d9},[r2,: 128]
+	add r2,sp,#560
+	vld1.8 {d8-d9},[r2,: 128]
+	vmlal.s32 q7,d24,d9
+	vmlal.s32 q7,d25,d31
+	vmull.s32 q1,d18,d2
+	vmlal.s32 q1,d19,d1
+	vmlal.s32 q1,d22,d0
+	vmlal.s32 q1,d24,d27
+	vmlal.s32 q1,d23,d20
+	vmlal.s32 q1,d12,d7
+	vmlal.s32 q1,d13,d6
+	vmull.s32 q6,d18,d1
+	vmlal.s32 q6,d19,d0
+	vmlal.s32 q6,d23,d27
+	vmlal.s32 q6,d22,d20
+	vmlal.s32 q6,d24,d26
+	vmull.s32 q0,d18,d0
+	vmlal.s32 q0,d22,d27
+	vmlal.s32 q0,d23,d26
+	vmlal.s32 q0,d24,d31
+	vmlal.s32 q0,d19,d20
+	add r2,sp,#640
+	vld1.8 {d18-d19},[r2,: 128]
+	vmlal.s32 q2,d18,d7
+	vmlal.s32 q2,d19,d6
+	vmlal.s32 q5,d18,d6
+	vmlal.s32 q5,d19,d21
+	vmlal.s32 q1,d18,d21
+	vmlal.s32 q1,d19,d29
+	vmlal.s32 q0,d18,d28
+	vmlal.s32 q0,d19,d9
+	vmlal.s32 q6,d18,d29
+	vmlal.s32 q6,d19,d28
+	add r2,sp,#592
+	vld1.8 {d18-d19},[r2,: 128]
+	add r2,sp,#512
+	vld1.8 {d22-d23},[r2,: 128]
+	vmlal.s32 q5,d19,d7
+	vmlal.s32 q0,d18,d21
+	vmlal.s32 q0,d19,d29
+	vmlal.s32 q6,d18,d6
+	add r2,sp,#528
+	vld1.8 {d6-d7},[r2,: 128]
+	vmlal.s32 q6,d19,d21
+	add r2,sp,#576
+	vld1.8 {d18-d19},[r2,: 128]
+	vmlal.s32 q0,d30,d8
+	add r2,sp,#672
+	vld1.8 {d20-d21},[r2,: 128]
+	vmlal.s32 q5,d30,d29
+	add r2,sp,#608
+	vld1.8 {d24-d25},[r2,: 128]
+	vmlal.s32 q1,d30,d28
+	vadd.i64 q13,q0,q11
+	vadd.i64 q14,q5,q11
+	vmlal.s32 q6,d30,d9
+	vshr.s64 q4,q13,#26
+	vshr.s64 q13,q14,#26
+	vadd.i64 q7,q7,q4
+	vshl.i64 q4,q4,#26
+	vadd.i64 q14,q7,q3
+	vadd.i64 q9,q9,q13
+	vshl.i64 q13,q13,#26
+	vadd.i64 q15,q9,q3
+	vsub.i64 q0,q0,q4
+	vshr.s64 q4,q14,#25
+	vsub.i64 q5,q5,q13
+	vshr.s64 q13,q15,#25
+	vadd.i64 q6,q6,q4
+	vshl.i64 q4,q4,#25
+	vadd.i64 q14,q6,q11
+	vadd.i64 q2,q2,q13
+	vsub.i64 q4,q7,q4
+	vshr.s64 q7,q14,#26
+	vshl.i64 q13,q13,#25
+	vadd.i64 q14,q2,q11
+	vadd.i64 q8,q8,q7
+	vshl.i64 q7,q7,#26
+	vadd.i64 q15,q8,q3
+	vsub.i64 q9,q9,q13
+	vshr.s64 q13,q14,#26
+	vsub.i64 q6,q6,q7
+	vshr.s64 q7,q15,#25
+	vadd.i64 q10,q10,q13
+	vshl.i64 q13,q13,#26
+	vadd.i64 q14,q10,q3
+	vadd.i64 q1,q1,q7
+	add r2,r3,#240
+	vshl.i64 q7,q7,#25
+	add r4,r3,#144
+	vadd.i64 q15,q1,q11
+	add r2,r2,#8
+	vsub.i64 q2,q2,q13
+	add r4,r4,#8
+	vshr.s64 q13,q14,#25
+	vsub.i64 q7,q8,q7
+	vshr.s64 q8,q15,#26
+	vadd.i64 q14,q13,q13
+	vadd.i64 q12,q12,q8
+	vtrn.32 d12,d14
+	vshl.i64 q8,q8,#26
+	vtrn.32 d13,d15
+	vadd.i64 q3,q12,q3
+	vadd.i64 q0,q0,q14
+	vst1.8 d12,[r2,: 64]!
+	vshl.i64 q7,q13,#4
+	vst1.8 d13,[r4,: 64]!
+	vsub.i64 q1,q1,q8
+	vshr.s64 q3,q3,#25
+	vadd.i64 q0,q0,q7
+	vadd.i64 q5,q5,q3
+	vshl.i64 q3,q3,#25
+	vadd.i64 q6,q5,q11
+	vadd.i64 q0,q0,q13
+	vshl.i64 q7,q13,#25
+	vadd.i64 q8,q0,q11
+	vsub.i64 q3,q12,q3
+	vshr.s64 q6,q6,#26
+	vsub.i64 q7,q10,q7
+	vtrn.32 d2,d6
+	vshr.s64 q8,q8,#26
+	vtrn.32 d3,d7
+	vadd.i64 q3,q9,q6
+	vst1.8 d2,[r2,: 64]
+	vshl.i64 q6,q6,#26
+	vst1.8 d3,[r4,: 64]
+	vadd.i64 q1,q4,q8
+	vtrn.32 d4,d14
+	vshl.i64 q4,q8,#26
+	vtrn.32 d5,d15
+	vsub.i64 q5,q5,q6
+	add r2,r2,#16
+	vsub.i64 q0,q0,q4
+	vst1.8 d4,[r2,: 64]
+	add r4,r4,#16
+	vst1.8 d5,[r4,: 64]
+	vtrn.32 d10,d6
+	vtrn.32 d11,d7
+	sub r2,r2,#8
+	sub r4,r4,#8
+	vtrn.32 d0,d2
+	vtrn.32 d1,d3
+	vst1.8 d10,[r2,: 64]
+	vst1.8 d11,[r4,: 64]
+	sub r2,r2,#24
+	sub r4,r4,#24
+	vst1.8 d0,[r2,: 64]
+	vst1.8 d1,[r4,: 64]
+	ldr r2,[sp,#488]
+	ldr r4,[sp,#492]
+	subs r5,r2,#1
+	bge .Lmainloop
+	add r1,r3,#144
+	add r2,r3,#336
+	vld1.8 {d0-d1},[r1,: 128]!
+	vld1.8 {d2-d3},[r1,: 128]!
+	vld1.8 {d4},[r1,: 64]
+	vst1.8 {d0-d1},[r2,: 128]!
+	vst1.8 {d2-d3},[r2,: 128]!
+	vst1.8 d4,[r2,: 64]
+	ldr r1,=0
+	.Linvertloop:
+	add r2,r3,#144
+	ldr r4,=0
+	ldr r5,=2
+	cmp r1,#1
+	ldreq r5,=1
+	addeq r2,r3,#336
+	addeq r4,r3,#48
+	cmp r1,#2
+	ldreq r5,=1
+	addeq r2,r3,#48
+	cmp r1,#3
+	ldreq r5,=5
+	addeq r4,r3,#336
+	cmp r1,#4
+	ldreq r5,=10
+	cmp r1,#5
+	ldreq r5,=20
+	cmp r1,#6
+	ldreq r5,=10
+	addeq r2,r3,#336
+	addeq r4,r3,#336
+	cmp r1,#7
+	ldreq r5,=50
+	cmp r1,#8
+	ldreq r5,=100
+	cmp r1,#9
+	ldreq r5,=50
+	addeq r2,r3,#336
+	cmp r1,#10
+	ldreq r5,=5
+	addeq r2,r3,#48
+	cmp r1,#11
+	ldreq r5,=0
+	addeq r2,r3,#96
+	add r6,r3,#144
+	add r7,r3,#288
+	vld1.8 {d0-d1},[r6,: 128]!
+	vld1.8 {d2-d3},[r6,: 128]!
+	vld1.8 {d4},[r6,: 64]
+	vst1.8 {d0-d1},[r7,: 128]!
+	vst1.8 {d2-d3},[r7,: 128]!
+	vst1.8 d4,[r7,: 64]
+	cmp r5,#0
+	beq .Lskipsquaringloop
+	.Lsquaringloop:
+	add r6,r3,#288
+	add r7,r3,#288
+	add r8,r3,#288
+	vmov.i32 q0,#19
+	vmov.i32 q1,#0
+	vmov.i32 q2,#1
+	vzip.i32 q1,q2
+	vld1.8 {d4-d5},[r7,: 128]!
+	vld1.8 {d6-d7},[r7,: 128]!
+	vld1.8 {d9},[r7,: 64]
+	vld1.8 {d10-d11},[r6,: 128]!
+	add r7,sp,#416
+	vld1.8 {d12-d13},[r6,: 128]!
+	vmul.i32 q7,q2,q0
+	vld1.8 {d8},[r6,: 64]
+	vext.32 d17,d11,d10,#1
+	vmul.i32 q9,q3,q0
+	vext.32 d16,d10,d8,#1
+	vshl.u32 q10,q5,q1
+	vext.32 d22,d14,d4,#1
+	vext.32 d24,d18,d6,#1
+	vshl.u32 q13,q6,q1
+	vshl.u32 d28,d8,d2
+	vrev64.i32 d22,d22
+	vmul.i32 d1,d9,d1
+	vrev64.i32 d24,d24
+	vext.32 d29,d8,d13,#1
+	vext.32 d0,d1,d9,#1
+	vrev64.i32 d0,d0
+	vext.32 d2,d9,d1,#1
+	vext.32 d23,d15,d5,#1
+	vmull.s32 q4,d20,d4
+	vrev64.i32 d23,d23
+	vmlal.s32 q4,d21,d1
+	vrev64.i32 d2,d2
+	vmlal.s32 q4,d26,d19
+	vext.32 d3,d5,d15,#1
+	vmlal.s32 q4,d27,d18
+	vrev64.i32 d3,d3
+	vmlal.s32 q4,d28,d15
+	vext.32 d14,d12,d11,#1
+	vmull.s32 q5,d16,d23
+	vext.32 d15,d13,d12,#1
+	vmlal.s32 q5,d17,d4
+	vst1.8 d8,[r7,: 64]!
+	vmlal.s32 q5,d14,d1
+	vext.32 d12,d9,d8,#0
+	vmlal.s32 q5,d15,d19
+	vmov.i64 d13,#0
+	vmlal.s32 q5,d29,d18
+	vext.32 d25,d19,d7,#1
+	vmlal.s32 q6,d20,d5
+	vrev64.i32 d25,d25
+	vmlal.s32 q6,d21,d4
+	vst1.8 d11,[r7,: 64]!
+	vmlal.s32 q6,d26,d1
+	vext.32 d9,d10,d10,#0
+	vmlal.s32 q6,d27,d19
+	vmov.i64 d8,#0
+	vmlal.s32 q6,d28,d18
+	vmlal.s32 q4,d16,d24
+	vmlal.s32 q4,d17,d5
+	vmlal.s32 q4,d14,d4
+	vst1.8 d12,[r7,: 64]!
+	vmlal.s32 q4,d15,d1
+	vext.32 d10,d13,d12,#0
+	vmlal.s32 q4,d29,d19
+	vmov.i64 d11,#0
+	vmlal.s32 q5,d20,d6
+	vmlal.s32 q5,d21,d5
+	vmlal.s32 q5,d26,d4
+	vext.32 d13,d8,d8,#0
+	vmlal.s32 q5,d27,d1
+	vmov.i64 d12,#0
+	vmlal.s32 q5,d28,d19
+	vst1.8 d9,[r7,: 64]!
+	vmlal.s32 q6,d16,d25
+	vmlal.s32 q6,d17,d6
+	vst1.8 d10,[r7,: 64]
+	vmlal.s32 q6,d14,d5
+	vext.32 d8,d11,d10,#0
+	vmlal.s32 q6,d15,d4
+	vmov.i64 d9,#0
+	vmlal.s32 q6,d29,d1
+	vmlal.s32 q4,d20,d7
+	vmlal.s32 q4,d21,d6
+	vmlal.s32 q4,d26,d5
+	vext.32 d11,d12,d12,#0
+	vmlal.s32 q4,d27,d4
+	vmov.i64 d10,#0
+	vmlal.s32 q4,d28,d1
+	vmlal.s32 q5,d16,d0
+	sub r6,r7,#32
+	vmlal.s32 q5,d17,d7
+	vmlal.s32 q5,d14,d6
+	vext.32 d30,d9,d8,#0
+	vmlal.s32 q5,d15,d5
+	vld1.8 {d31},[r6,: 64]!
+	vmlal.s32 q5,d29,d4
+	vmlal.s32 q15,d20,d0
+	vext.32 d0,d6,d18,#1
+	vmlal.s32 q15,d21,d25
+	vrev64.i32 d0,d0
+	vmlal.s32 q15,d26,d24
+	vext.32 d1,d7,d19,#1
+	vext.32 d7,d10,d10,#0
+	vmlal.s32 q15,d27,d23
+	vrev64.i32 d1,d1
+	vld1.8 {d6},[r6,: 64]
+	vmlal.s32 q15,d28,d22
+	vmlal.s32 q3,d16,d4
+	add r6,r6,#24
+	vmlal.s32 q3,d17,d2
+	vext.32 d4,d31,d30,#0
+	vmov d17,d11
+	vmlal.s32 q3,d14,d1
+	vext.32 d11,d13,d13,#0
+	vext.32 d13,d30,d30,#0
+	vmlal.s32 q3,d15,d0
+	vext.32 d1,d8,d8,#0
+	vmlal.s32 q3,d29,d3
+	vld1.8 {d5},[r6,: 64]
+	sub r6,r6,#16
+	vext.32 d10,d6,d6,#0
+	vmov.i32 q1,#0xffffffff
+	vshl.i64 q4,q1,#25
+	add r7,sp,#512
+	vld1.8 {d14-d15},[r7,: 128]
+	vadd.i64 q9,q2,q7
+	vshl.i64 q1,q1,#26
+	vshr.s64 q10,q9,#26
+	vld1.8 {d0},[r6,: 64]!
+	vadd.i64 q5,q5,q10
+	vand q9,q9,q1
+	vld1.8 {d16},[r6,: 64]!
+	add r6,sp,#528
+	vld1.8 {d20-d21},[r6,: 128]
+	vadd.i64 q11,q5,q10
+	vsub.i64 q2,q2,q9
+	vshr.s64 q9,q11,#25
+	vext.32 d12,d5,d4,#0
+	vand q11,q11,q4
+	vadd.i64 q0,q0,q9
+	vmov d19,d7
+	vadd.i64 q3,q0,q7
+	vsub.i64 q5,q5,q11
+	vshr.s64 q11,q3,#26
+	vext.32 d18,d11,d10,#0
+	vand q3,q3,q1
+	vadd.i64 q8,q8,q11
+	vadd.i64 q11,q8,q10
+	vsub.i64 q0,q0,q3
+	vshr.s64 q3,q11,#25
+	vand q11,q11,q4
+	vadd.i64 q3,q6,q3
+	vadd.i64 q6,q3,q7
+	vsub.i64 q8,q8,q11
+	vshr.s64 q11,q6,#26
+	vand q6,q6,q1
+	vadd.i64 q9,q9,q11
+	vadd.i64 d25,d19,d21
+	vsub.i64 q3,q3,q6
+	vshr.s64 d23,d25,#25
+	vand q4,q12,q4
+	vadd.i64 d21,d23,d23
+	vshl.i64 d25,d23,#4
+	vadd.i64 d21,d21,d23
+	vadd.i64 d25,d25,d21
+	vadd.i64 d4,d4,d25
+	vzip.i32 q0,q8
+	vadd.i64 d12,d4,d14
+	add r6,r8,#8
+	vst1.8 d0,[r6,: 64]
+	vsub.i64 d19,d19,d9
+	add r6,r6,#16
+	vst1.8 d16,[r6,: 64]
+	vshr.s64 d22,d12,#26
+	vand q0,q6,q1
+	vadd.i64 d10,d10,d22
+	vzip.i32 q3,q9
+	vsub.i64 d4,d4,d0
+	sub r6,r6,#8
+	vst1.8 d6,[r6,: 64]
+	add r6,r6,#16
+	vst1.8 d18,[r6,: 64]
+	vzip.i32 q2,q5
+	sub r6,r6,#32
+	vst1.8 d4,[r6,: 64]
+	subs r5,r5,#1
+	bhi .Lsquaringloop
+	.Lskipsquaringloop:
+	mov r2,r2
+	add r5,r3,#288
+	add r6,r3,#144
+	vmov.i32 q0,#19
+	vmov.i32 q1,#0
+	vmov.i32 q2,#1
+	vzip.i32 q1,q2
+	vld1.8 {d4-d5},[r5,: 128]!
+	vld1.8 {d6-d7},[r5,: 128]!
+	vld1.8 {d9},[r5,: 64]
+	vld1.8 {d10-d11},[r2,: 128]!
+	add r5,sp,#416
+	vld1.8 {d12-d13},[r2,: 128]!
+	vmul.i32 q7,q2,q0
+	vld1.8 {d8},[r2,: 64]
+	vext.32 d17,d11,d10,#1
+	vmul.i32 q9,q3,q0
+	vext.32 d16,d10,d8,#1
+	vshl.u32 q10,q5,q1
+	vext.32 d22,d14,d4,#1
+	vext.32 d24,d18,d6,#1
+	vshl.u32 q13,q6,q1
+	vshl.u32 d28,d8,d2
+	vrev64.i32 d22,d22
+	vmul.i32 d1,d9,d1
+	vrev64.i32 d24,d24
+	vext.32 d29,d8,d13,#1
+	vext.32 d0,d1,d9,#1
+	vrev64.i32 d0,d0
+	vext.32 d2,d9,d1,#1
+	vext.32 d23,d15,d5,#1
+	vmull.s32 q4,d20,d4
+	vrev64.i32 d23,d23
+	vmlal.s32 q4,d21,d1
+	vrev64.i32 d2,d2
+	vmlal.s32 q4,d26,d19
+	vext.32 d3,d5,d15,#1
+	vmlal.s32 q4,d27,d18
+	vrev64.i32 d3,d3
+	vmlal.s32 q4,d28,d15
+	vext.32 d14,d12,d11,#1
+	vmull.s32 q5,d16,d23
+	vext.32 d15,d13,d12,#1
+	vmlal.s32 q5,d17,d4
+	vst1.8 d8,[r5,: 64]!
+	vmlal.s32 q5,d14,d1
+	vext.32 d12,d9,d8,#0
+	vmlal.s32 q5,d15,d19
+	vmov.i64 d13,#0
+	vmlal.s32 q5,d29,d18
+	vext.32 d25,d19,d7,#1
+	vmlal.s32 q6,d20,d5
+	vrev64.i32 d25,d25
+	vmlal.s32 q6,d21,d4
+	vst1.8 d11,[r5,: 64]!
+	vmlal.s32 q6,d26,d1
+	vext.32 d9,d10,d10,#0
+	vmlal.s32 q6,d27,d19
+	vmov.i64 d8,#0
+	vmlal.s32 q6,d28,d18
+	vmlal.s32 q4,d16,d24
+	vmlal.s32 q4,d17,d5
+	vmlal.s32 q4,d14,d4
+	vst1.8 d12,[r5,: 64]!
+	vmlal.s32 q4,d15,d1
+	vext.32 d10,d13,d12,#0
+	vmlal.s32 q4,d29,d19
+	vmov.i64 d11,#0
+	vmlal.s32 q5,d20,d6
+	vmlal.s32 q5,d21,d5
+	vmlal.s32 q5,d26,d4
+	vext.32 d13,d8,d8,#0
+	vmlal.s32 q5,d27,d1
+	vmov.i64 d12,#0
+	vmlal.s32 q5,d28,d19
+	vst1.8 d9,[r5,: 64]!
+	vmlal.s32 q6,d16,d25
+	vmlal.s32 q6,d17,d6
+	vst1.8 d10,[r5,: 64]
+	vmlal.s32 q6,d14,d5
+	vext.32 d8,d11,d10,#0
+	vmlal.s32 q6,d15,d4
+	vmov.i64 d9,#0
+	vmlal.s32 q6,d29,d1
+	vmlal.s32 q4,d20,d7
+	vmlal.s32 q4,d21,d6
+	vmlal.s32 q4,d26,d5
+	vext.32 d11,d12,d12,#0
+	vmlal.s32 q4,d27,d4
+	vmov.i64 d10,#0
+	vmlal.s32 q4,d28,d1
+	vmlal.s32 q5,d16,d0
+	sub r2,r5,#32
+	vmlal.s32 q5,d17,d7
+	vmlal.s32 q5,d14,d6
+	vext.32 d30,d9,d8,#0
+	vmlal.s32 q5,d15,d5
+	vld1.8 {d31},[r2,: 64]!
+	vmlal.s32 q5,d29,d4
+	vmlal.s32 q15,d20,d0
+	vext.32 d0,d6,d18,#1
+	vmlal.s32 q15,d21,d25
+	vrev64.i32 d0,d0
+	vmlal.s32 q15,d26,d24
+	vext.32 d1,d7,d19,#1
+	vext.32 d7,d10,d10,#0
+	vmlal.s32 q15,d27,d23
+	vrev64.i32 d1,d1
+	vld1.8 {d6},[r2,: 64]
+	vmlal.s32 q15,d28,d22
+	vmlal.s32 q3,d16,d4
+	add r2,r2,#24
+	vmlal.s32 q3,d17,d2
+	vext.32 d4,d31,d30,#0
+	vmov d17,d11
+	vmlal.s32 q3,d14,d1
+	vext.32 d11,d13,d13,#0
+	vext.32 d13,d30,d30,#0
+	vmlal.s32 q3,d15,d0
+	vext.32 d1,d8,d8,#0
+	vmlal.s32 q3,d29,d3
+	vld1.8 {d5},[r2,: 64]
+	sub r2,r2,#16
+	vext.32 d10,d6,d6,#0
+	vmov.i32 q1,#0xffffffff
+	vshl.i64 q4,q1,#25
+	add r5,sp,#512
+	vld1.8 {d14-d15},[r5,: 128]
+	vadd.i64 q9,q2,q7
+	vshl.i64 q1,q1,#26
+	vshr.s64 q10,q9,#26
+	vld1.8 {d0},[r2,: 64]!
+	vadd.i64 q5,q5,q10
+	vand q9,q9,q1
+	vld1.8 {d16},[r2,: 64]!
+	add r2,sp,#528
+	vld1.8 {d20-d21},[r2,: 128]
+	vadd.i64 q11,q5,q10
+	vsub.i64 q2,q2,q9
+	vshr.s64 q9,q11,#25
+	vext.32 d12,d5,d4,#0
+	vand q11,q11,q4
+	vadd.i64 q0,q0,q9
+	vmov d19,d7
+	vadd.i64 q3,q0,q7
+	vsub.i64 q5,q5,q11
+	vshr.s64 q11,q3,#26
+	vext.32 d18,d11,d10,#0
+	vand q3,q3,q1
+	vadd.i64 q8,q8,q11
+	vadd.i64 q11,q8,q10
+	vsub.i64 q0,q0,q3
+	vshr.s64 q3,q11,#25
+	vand q11,q11,q4
+	vadd.i64 q3,q6,q3
+	vadd.i64 q6,q3,q7
+	vsub.i64 q8,q8,q11
+	vshr.s64 q11,q6,#26
+	vand q6,q6,q1
+	vadd.i64 q9,q9,q11
+	vadd.i64 d25,d19,d21
+	vsub.i64 q3,q3,q6
+	vshr.s64 d23,d25,#25
+	vand q4,q12,q4
+	vadd.i64 d21,d23,d23
+	vshl.i64 d25,d23,#4
+	vadd.i64 d21,d21,d23
+	vadd.i64 d25,d25,d21
+	vadd.i64 d4,d4,d25
+	vzip.i32 q0,q8
+	vadd.i64 d12,d4,d14
+	add r2,r6,#8
+	vst1.8 d0,[r2,: 64]
+	vsub.i64 d19,d19,d9
+	add r2,r2,#16
+	vst1.8 d16,[r2,: 64]
+	vshr.s64 d22,d12,#26
+	vand q0,q6,q1
+	vadd.i64 d10,d10,d22
+	vzip.i32 q3,q9
+	vsub.i64 d4,d4,d0
+	sub r2,r2,#8
+	vst1.8 d6,[r2,: 64]
+	add r2,r2,#16
+	vst1.8 d18,[r2,: 64]
+	vzip.i32 q2,q5
+	sub r2,r2,#32
+	vst1.8 d4,[r2,: 64]
+	cmp r4,#0
+	beq .Lskippostcopy
+	add r2,r3,#144
+	mov r4,r4
+	vld1.8 {d0-d1},[r2,: 128]!
+	vld1.8 {d2-d3},[r2,: 128]!
+	vld1.8 {d4},[r2,: 64]
+	vst1.8 {d0-d1},[r4,: 128]!
+	vst1.8 {d2-d3},[r4,: 128]!
+	vst1.8 d4,[r4,: 64]
+	.Lskippostcopy:
+	cmp r1,#1
+	bne .Lskipfinalcopy
+	add r2,r3,#288
+	add r4,r3,#144
+	vld1.8 {d0-d1},[r2,: 128]!
+	vld1.8 {d2-d3},[r2,: 128]!
+	vld1.8 {d4},[r2,: 64]
+	vst1.8 {d0-d1},[r4,: 128]!
+	vst1.8 {d2-d3},[r4,: 128]!
+	vst1.8 d4,[r4,: 64]
+	.Lskipfinalcopy:
+	add r1,r1,#1
+	cmp r1,#12
+	blo .Linvertloop
+	add r1,r3,#144
+	ldr r2,[r1],#4
+	ldr r3,[r1],#4
+	ldr r4,[r1],#4
+	ldr r5,[r1],#4
+	ldr r6,[r1],#4
+	ldr r7,[r1],#4
+	ldr r8,[r1],#4
+	ldr r9,[r1],#4
+	ldr r10,[r1],#4
+	ldr r1,[r1]
+	add r11,r1,r1,LSL #4
+	add r11,r11,r1,LSL #1
+	add r11,r11,#16777216
+	mov r11,r11,ASR #25
+	add r11,r11,r2
+	mov r11,r11,ASR #26
+	add r11,r11,r3
+	mov r11,r11,ASR #25
+	add r11,r11,r4
+	mov r11,r11,ASR #26
+	add r11,r11,r5
+	mov r11,r11,ASR #25
+	add r11,r11,r6
+	mov r11,r11,ASR #26
+	add r11,r11,r7
+	mov r11,r11,ASR #25
+	add r11,r11,r8
+	mov r11,r11,ASR #26
+	add r11,r11,r9
+	mov r11,r11,ASR #25
+	add r11,r11,r10
+	mov r11,r11,ASR #26
+	add r11,r11,r1
+	mov r11,r11,ASR #25
+	add r2,r2,r11
+	add r2,r2,r11,LSL #1
+	add r2,r2,r11,LSL #4
+	mov r11,r2,ASR #26
+	add r3,r3,r11
+	sub r2,r2,r11,LSL #26
+	mov r11,r3,ASR #25
+	add r4,r4,r11
+	sub r3,r3,r11,LSL #25
+	mov r11,r4,ASR #26
+	add r5,r5,r11
+	sub r4,r4,r11,LSL #26
+	mov r11,r5,ASR #25
+	add r6,r6,r11
+	sub r5,r5,r11,LSL #25
+	mov r11,r6,ASR #26
+	add r7,r7,r11
+	sub r6,r6,r11,LSL #26
+	mov r11,r7,ASR #25
+	add r8,r8,r11
+	sub r7,r7,r11,LSL #25
+	mov r11,r8,ASR #26
+	add r9,r9,r11
+	sub r8,r8,r11,LSL #26
+	mov r11,r9,ASR #25
+	add r10,r10,r11
+	sub r9,r9,r11,LSL #25
+	mov r11,r10,ASR #26
+	add r1,r1,r11
+	sub r10,r10,r11,LSL #26
+	mov r11,r1,ASR #25
+	sub r1,r1,r11,LSL #25
+	add r2,r2,r3,LSL #26
+	mov r3,r3,LSR #6
+	add r3,r3,r4,LSL #19
+	mov r4,r4,LSR #13
+	add r4,r4,r5,LSL #13
+	mov r5,r5,LSR #19
+	add r5,r5,r6,LSL #6
+	add r6,r7,r8,LSL #25
+	mov r7,r8,LSR #7
+	add r7,r7,r9,LSL #19
+	mov r8,r9,LSR #13
+	add r8,r8,r10,LSL #12
+	mov r9,r10,LSR #20
+	add r1,r9,r1,LSL #6
+	str r2,[r0],#4
+	str r3,[r0],#4
+	str r4,[r0],#4
+	str r5,[r0],#4
+	str r6,[r0],#4
+	str r7,[r0],#4
+	str r8,[r0],#4
+	str r1,[r0]
+	ldrd r4,[sp,#0]
+	ldrd r6,[sp,#8]
+	ldrd r8,[sp,#16]
+	ldrd r10,[sp,#24]
+	ldr r12,[sp,#480]
+	ldr r14,[sp,#484]
+	ldr r0,=0
+	mov sp,r12
+	vpop {q4,q5,q6,q7}
+	bx lr
+ENDPROC(curve25519_neon)
+#endif
-- 
2.18.0

^ permalink raw reply related

* [PATCH v2 09/17] zinc: Poly1305 x86_64 implementation
From: Jason A. Donenfeld @ 2018-08-24 21:38 UTC (permalink / raw)
  To: linux-kernel, netdev, davem
  Cc: Jason A. Donenfeld, Andy Lutomirski, Greg KH, Samuel Neves,
	Jean-Philippe Aumasson, Andy Polyakov, Thomas Gleixner,
	Ingo Molnar, x86, linux-crypto
In-Reply-To: <20180824213849.23647-1-Jason@zx2c4.com>

This provides AVX, AVX-2, and AVX-512F implementations for Poly1305.
The AVX-512F implementation is disabled on Skylake, due to throttling.
These come from Andy Polyakov's implementation, with some heavy
modifications from Samuel Neves and me.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Andy Polyakov <appro@openssl.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: x86@kernel.org
Cc: linux-crypto@vger.kernel.org
---
 lib/zinc/Makefile                        |    4 +
 lib/zinc/poly1305/poly1305-x86_64-glue.h |   91 +
 lib/zinc/poly1305/poly1305-x86_64.S      | 2790 ++++++++++++++++++++++
 3 files changed, 2885 insertions(+)
 create mode 100644 lib/zinc/poly1305/poly1305-x86_64-glue.h
 create mode 100644 lib/zinc/poly1305/poly1305-x86_64.S

diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index db70ffedf7dd..57fe42b7f97c 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -24,6 +24,10 @@ endif
 
 ifeq ($(CONFIG_ZINC_POLY1305),y)
 zinc-y += poly1305/poly1305.o
+ifeq ($(CONFIG_X86_64),y)
+zinc-y += poly1305/poly1305-x86_64.o
+CFLAGS_poly1305.o += -include $(srctree)/$(src)/poly1305/poly1305-x86_64-glue.h
+endif
 ifeq ($(CONFIG_ARM),y)
 zinc-y += poly1305/poly1305-arm.o
 CFLAGS_poly1305.o += -include $(srctree)/$(src)/poly1305/poly1305-arm-glue.h
diff --git a/lib/zinc/poly1305/poly1305-x86_64-glue.h b/lib/zinc/poly1305/poly1305-x86_64-glue.h
new file mode 100644
index 000000000000..dde3b0ec39d5
--- /dev/null
+++ b/lib/zinc/poly1305/poly1305-x86_64-glue.h
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <zinc/poly1305.h>
+#include <asm/fpu/api.h>
+#include <asm/cpufeature.h>
+#include <asm/processor.h>
+#include <asm/intel-family.h>
+
+asmlinkage void poly1305_init_x86_64(void *ctx, const u8 key[POLY1305_KEY_SIZE]);
+asmlinkage void poly1305_blocks_x86_64(void *ctx, const u8 *inp, const size_t len, const u32 padbit);
+asmlinkage void poly1305_emit_x86_64(void *ctx, u8 mac[POLY1305_MAC_SIZE], const u32 nonce[4]);
+#ifdef CONFIG_AS_AVX
+asmlinkage void poly1305_emit_avx(void *ctx, u8 mac[POLY1305_MAC_SIZE], const u32 nonce[4]);
+asmlinkage void poly1305_blocks_avx(void *ctx, const u8 *inp, const size_t len, const u32 padbit);
+#endif
+#ifdef CONFIG_AS_AVX2
+asmlinkage void poly1305_blocks_avx2(void *ctx, const u8 *inp, const size_t len, const u32 padbit);
+#endif
+#ifdef CONFIG_AS_AVX512
+asmlinkage void poly1305_blocks_avx512(void *ctx, const u8 *inp, const size_t len, const u32 padbit);
+#endif
+
+static bool poly1305_use_avx __ro_after_init;
+static bool poly1305_use_avx2 __ro_after_init;
+static bool poly1305_use_avx512 __ro_after_init;
+
+void __init poly1305_fpu_init(void)
+{
+#ifndef CONFIG_UML
+	poly1305_use_avx = boot_cpu_has(X86_FEATURE_AVX) &&
+			   cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL);
+	poly1305_use_avx2 = boot_cpu_has(X86_FEATURE_AVX) && boot_cpu_has(X86_FEATURE_AVX2) &&
+			    cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL);
+	poly1305_use_avx512 = boot_cpu_has(X86_FEATURE_AVX) && boot_cpu_has(X86_FEATURE_AVX2) && boot_cpu_has(X86_FEATURE_AVX512F) &&
+			      cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM | XFEATURE_MASK_AVX512, NULL) &&
+			      boot_cpu_data.x86_model != INTEL_FAM6_SKYLAKE_X;
+#endif
+}
+
+static inline bool poly1305_init_arch(void *ctx, const u8 key[POLY1305_KEY_SIZE], simd_context_t simd_context)
+{
+	poly1305_init_x86_64(ctx, key);
+	return true;
+}
+
+static inline bool poly1305_blocks_arch(void *ctx, const u8 *inp, const size_t len, const u32 padbit, simd_context_t simd_context)
+{
+#ifdef CONFIG_AS_AVX512
+	if (poly1305_use_avx512 && simd_context == HAVE_FULL_SIMD)
+		poly1305_blocks_avx512(ctx, inp, len, padbit);
+	else
+#endif
+#ifdef CONFIG_AS_AVX2
+	if (poly1305_use_avx2 && simd_context == HAVE_FULL_SIMD)
+		poly1305_blocks_avx2(ctx, inp, len, padbit);
+	else
+#endif
+#ifdef CONFIG_AS_AVX
+	if (poly1305_use_avx && simd_context == HAVE_FULL_SIMD)
+		poly1305_blocks_avx(ctx, inp, len, padbit);
+	else
+#endif
+		poly1305_blocks_x86_64(ctx, inp, len, padbit);
+	return true;
+}
+
+static inline bool poly1305_emit_arch(void *ctx, u8 mac[POLY1305_MAC_SIZE], const u32 nonce[4], simd_context_t simd_context)
+{
+#ifdef CONFIG_AS_AVX512
+	if (poly1305_use_avx512 && simd_context == HAVE_FULL_SIMD)
+		poly1305_emit_avx(ctx, mac, nonce);
+	else
+#endif
+#ifdef CONFIG_AS_AVX2
+	if (poly1305_use_avx2 && simd_context == HAVE_FULL_SIMD)
+		poly1305_emit_avx(ctx, mac, nonce);
+	else
+#endif
+#ifdef CONFIG_AS_AVX
+	if (poly1305_use_avx && simd_context == HAVE_FULL_SIMD)
+		poly1305_emit_avx(ctx, mac, nonce);
+	else
+#endif
+		poly1305_emit_x86_64(ctx, mac, nonce);
+	return true;
+}
+
+#define HAVE_POLY1305_ARCH_IMPLEMENTATION
diff --git a/lib/zinc/poly1305/poly1305-x86_64.S b/lib/zinc/poly1305/poly1305-x86_64.S
new file mode 100644
index 000000000000..7c1b7363f487
--- /dev/null
+++ b/lib/zinc/poly1305/poly1305-x86_64.S
@@ -0,0 +1,2790 @@
+/* SPDX-License-Identifier: OpenSSL OR (BSD-3-Clause OR GPL-2.0)
+ *
+ * Copyright (C) 2017 Samuel Neves <sneves@dei.uc.pt>. All Rights Reserved.
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ * Copyright 2016 The OpenSSL Project Authors. All Rights Reserved.
+ */
+
+#include <linux/linkage.h>
+
+.section .rodata.cst192.Lconst, "aM", @progbits, 192
+.align	64
+.Lconst:
+.long	0x0ffffff,0,0x0ffffff,0,0x0ffffff,0,0x0ffffff,0
+.long	16777216,0,16777216,0,16777216,0,16777216,0
+.long	0x3ffffff,0,0x3ffffff,0,0x3ffffff,0,0x3ffffff,0
+.long	2,2,2,3,2,0,2,1
+.long	0,0,0,1, 0,2,0,3, 0,4,0,5, 0,6,0,7
+
+.text
+
+.align	32
+ENTRY(poly1305_init_x86_64)
+	xorq	%rax,%rax
+	movq	%rax,0(%rdi)
+	movq	%rax,8(%rdi)
+	movq	%rax,16(%rdi)
+
+	cmpq	$0,%rsi
+	je	.Lno_key
+
+	movq	$0x0ffffffc0fffffff,%rax
+	movq	$0x0ffffffc0ffffffc,%rcx
+	andq	0(%rsi),%rax
+	andq	8(%rsi),%rcx
+	movq	%rax,24(%rdi)
+	movq	%rcx,32(%rdi)
+	movl	$1,%eax
+.Lno_key:
+	ret
+ENDPROC(poly1305_init_x86_64)
+
+.align	32
+ENTRY(poly1305_blocks_x86_64)
+.Lblocks:
+	shrq	$4,%rdx
+	jz	.Lno_data
+
+	pushq	%rbx
+	pushq	%r12
+	pushq	%r13
+	pushq	%r14
+	pushq	%r15
+	pushq	%rdi
+
+.Lblocks_body:
+
+	movq	%rdx,%r15
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+	movq	0(%rdi),%r14
+	movq	8(%rdi),%rbx
+	movq	16(%rdi),%r10
+
+	movq	%r13,%r12
+	shrq	$2,%r13
+	movq	%r12,%rax
+	addq	%r12,%r13
+	jmp	.Loop
+
+.align	32
+.Loop:
+
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%r10
+	mulq	%r14
+	movq	%rax,%r9
+	movq	%r11,%rax
+	movq	%rdx,%rdi
+
+	mulq	%r14
+	movq	%rax,%r14
+	movq	%r11,%rax
+	movq	%rdx,%r8
+
+	mulq	%rbx
+	addq	%rax,%r9
+	movq	%r13,%rax
+	adcq	%rdx,%rdi
+
+	mulq	%rbx
+	movq	%r10,%rbx
+	addq	%rax,%r14
+	adcq	%rdx,%r8
+
+	imulq	%r13,%rbx
+	addq	%rbx,%r9
+	movq	%r8,%rbx
+	adcq	$0,%rdi
+
+	imulq	%r11,%r10
+	addq	%r9,%rbx
+	movq	$-4,%rax
+	adcq	%r10,%rdi
+
+	andq	%rdi,%rax
+	movq	%rdi,%r10
+	shrq	$2,%rdi
+	andq	$3,%r10
+	addq	%rdi,%rax
+	addq	%rax,%r14
+	adcq	$0,%rbx
+	adcq	$0,%r10
+
+	movq	%r12,%rax
+	decq	%r15
+	jnz	.Loop
+
+	movq	0(%rsp),%rdi
+
+	movq	%r14,0(%rdi)
+	movq	%rbx,8(%rdi)
+	movq	%r10,16(%rdi)
+
+	movq	8(%rsp),%r15
+	movq	16(%rsp),%r14
+	movq	24(%rsp),%r13
+	movq	32(%rsp),%r12
+	movq	40(%rsp),%rbx
+	leaq	48(%rsp),%rsp
+.Lno_data:
+.Lblocks_epilogue:
+	ret
+ENDPROC(poly1305_blocks_x86_64)
+
+.align	32
+ENTRY(poly1305_emit_x86_64)
+.Lemit:
+	movq	0(%rdi),%r8
+	movq	8(%rdi),%r9
+	movq	16(%rdi),%r10
+
+	movq	%r8,%rax
+	addq	$5,%r8
+	movq	%r9,%rcx
+	adcq	$0,%r9
+	adcq	$0,%r10
+	shrq	$2,%r10
+	cmovnzq	%r8,%rax
+	cmovnzq	%r9,%rcx
+
+	addq	0(%rdx),%rax
+	adcq	8(%rdx),%rcx
+	movq	%rax,0(%rsi)
+	movq	%rcx,8(%rsi)
+
+	ret
+ENDPROC(poly1305_emit_x86_64)
+
+.macro __poly1305_block
+	mulq	%r14
+	movq	%rax,%r9
+	movq	%r11,%rax
+	movq	%rdx,%rdi
+
+	mulq	%r14
+	movq	%rax,%r14
+	movq	%r11,%rax
+	movq	%rdx,%r8
+
+	mulq	%rbx
+	addq	%rax,%r9
+	movq	%r13,%rax
+	adcq	%rdx,%rdi
+
+	mulq	%rbx
+	movq	%r10,%rbx
+	addq	%rax,%r14
+	adcq	%rdx,%r8
+
+	imulq	%r13,%rbx
+	addq	%rbx,%r9
+	movq	%r8,%rbx
+	adcq	$0,%rdi
+
+	imulq	%r11,%r10
+	addq	%r9,%rbx
+	movq	$-4,%rax
+	adcq	%r10,%rdi
+
+	andq	%rdi,%rax
+	movq	%rdi,%r10
+	shrq	$2,%rdi
+	andq	$3,%r10
+	addq	%rdi,%rax
+	addq	%rax,%r14
+	adcq	$0,%rbx
+	adcq	$0,%r10
+.endm
+
+.macro __poly1305_init_avx
+	movq	%r11,%r14
+	movq	%r12,%rbx
+	xorq	%r10,%r10
+
+	leaq	48+64(%rdi),%rdi
+
+	movq	%r12,%rax
+	movq	%rdi,0(%rsp)
+	__poly1305_block
+	movq	0(%rsp),%rdi
+
+	movl	$0x3ffffff,%eax
+	movl	$0x3ffffff,%edx
+	movq	%r14,%r8
+	andl	%r14d,%eax
+	movq	%r11,%r9
+	andl	%r11d,%edx
+	movl	%eax,-64(%rdi)
+	shrq	$26,%r8
+	movl	%edx,-60(%rdi)
+	shrq	$26,%r9
+
+	movl	$0x3ffffff,%eax
+	movl	$0x3ffffff,%edx
+	andl	%r8d,%eax
+	andl	%r9d,%edx
+	movl	%eax,-48(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movl	%edx,-44(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	movl	%eax,-32(%rdi)
+	shrq	$26,%r8
+	movl	%edx,-28(%rdi)
+	shrq	$26,%r9
+
+	movq	%rbx,%rax
+	movq	%r12,%rdx
+	shlq	$12,%rax
+	shlq	$12,%rdx
+	orq	%r8,%rax
+	orq	%r9,%rdx
+	andl	$0x3ffffff,%eax
+	andl	$0x3ffffff,%edx
+	movl	%eax,-16(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movl	%edx,-12(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	movl	%eax,0(%rdi)
+	movq	%rbx,%r8
+	movl	%edx,4(%rdi)
+	movq	%r12,%r9
+
+	movl	$0x3ffffff,%eax
+	movl	$0x3ffffff,%edx
+	shrq	$14,%r8
+	shrq	$14,%r9
+	andl	%r8d,%eax
+	andl	%r9d,%edx
+	movl	%eax,16(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movl	%edx,20(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	movl	%eax,32(%rdi)
+	shrq	$26,%r8
+	movl	%edx,36(%rdi)
+	shrq	$26,%r9
+
+	movq	%r10,%rax
+	shlq	$24,%rax
+	orq	%rax,%r8
+	movl	%r8d,48(%rdi)
+	leaq	(%r8,%r8,4),%r8
+	movl	%r9d,52(%rdi)
+	leaq	(%r9,%r9,4),%r9
+	movl	%r8d,64(%rdi)
+	movl	%r9d,68(%rdi)
+
+	movq	%r12,%rax
+	movq	%rdi,0(%rsp)
+	__poly1305_block
+	movq	0(%rsp),%rdi
+
+	movl	$0x3ffffff,%eax
+	movq	%r14,%r8
+	andl	%r14d,%eax
+	shrq	$26,%r8
+	movl	%eax,-52(%rdi)
+
+	movl	$0x3ffffff,%edx
+	andl	%r8d,%edx
+	movl	%edx,-36(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	shrq	$26,%r8
+	movl	%edx,-20(%rdi)
+
+	movq	%rbx,%rax
+	shlq	$12,%rax
+	orq	%r8,%rax
+	andl	$0x3ffffff,%eax
+	movl	%eax,-4(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movq	%rbx,%r8
+	movl	%eax,12(%rdi)
+
+	movl	$0x3ffffff,%edx
+	shrq	$14,%r8
+	andl	%r8d,%edx
+	movl	%edx,28(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	shrq	$26,%r8
+	movl	%edx,44(%rdi)
+
+	movq	%r10,%rax
+	shlq	$24,%rax
+	orq	%rax,%r8
+	movl	%r8d,60(%rdi)
+	leaq	(%r8,%r8,4),%r8
+	movl	%r8d,76(%rdi)
+
+	movq	%r12,%rax
+	movq	%rdi,0(%rsp)
+	__poly1305_block
+	movq	0(%rsp),%rdi
+
+	movl	$0x3ffffff,%eax
+	movq	%r14,%r8
+	andl	%r14d,%eax
+	shrq	$26,%r8
+	movl	%eax,-56(%rdi)
+
+	movl	$0x3ffffff,%edx
+	andl	%r8d,%edx
+	movl	%edx,-40(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	shrq	$26,%r8
+	movl	%edx,-24(%rdi)
+
+	movq	%rbx,%rax
+	shlq	$12,%rax
+	orq	%r8,%rax
+	andl	$0x3ffffff,%eax
+	movl	%eax,-8(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movq	%rbx,%r8
+	movl	%eax,8(%rdi)
+
+	movl	$0x3ffffff,%edx
+	shrq	$14,%r8
+	andl	%r8d,%edx
+	movl	%edx,24(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	shrq	$26,%r8
+	movl	%edx,40(%rdi)
+
+	movq	%r10,%rax
+	shlq	$24,%rax
+	orq	%rax,%r8
+	movl	%r8d,56(%rdi)
+	leaq	(%r8,%r8,4),%r8
+	movl	%r8d,72(%rdi)
+
+	leaq	-48-64(%rdi),%rdi
+.endm
+
+#ifdef CONFIG_AS_AVX
+.align	32
+ENTRY(poly1305_blocks_avx)
+
+	movl	20(%rdi),%r8d
+	cmpq	$128,%rdx
+	jae	.Lblocks_avx
+	testl	%r8d,%r8d
+	jz	.Lblocks
+
+.Lblocks_avx:
+	andq	$-16,%rdx
+	jz	.Lno_data_avx
+
+	vzeroupper
+
+	testl	%r8d,%r8d
+	jz	.Lbase2_64_avx
+
+	testq	$31,%rdx
+	jz	.Leven_avx
+
+	pushq	%rbx
+	pushq	%r12
+	pushq	%r13
+	pushq	%r14
+	pushq	%r15
+	pushq	%rdi
+
+.Lblocks_avx_body:
+
+	movq	%rdx,%r15
+
+	movq	0(%rdi),%r8
+	movq	8(%rdi),%r9
+	movl	16(%rdi),%r10d
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+
+	movl	%r8d,%r14d
+	andq	$-2147483648,%r8
+	movq	%r9,%r12
+	movl	%r9d,%ebx
+	andq	$-2147483648,%r9
+
+	shrq	$6,%r8
+	shlq	$52,%r12
+	addq	%r8,%r14
+	shrq	$12,%rbx
+	shrq	$18,%r9
+	addq	%r12,%r14
+	adcq	%r9,%rbx
+
+	movq	%r10,%r8
+	shlq	$40,%r8
+	shrq	$24,%r10
+	addq	%r8,%rbx
+	adcq	$0,%r10
+
+	movq	$-4,%r9
+	movq	%r10,%r8
+	andq	%r10,%r9
+	shrq	$2,%r8
+	andq	$3,%r10
+	addq	%r9,%r8
+	addq	%r8,%r14
+	adcq	$0,%rbx
+	adcq	$0,%r10
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%r10
+
+	movq	%rdi,0(%rsp)
+	__poly1305_block
+	movq	0(%rsp),%rdi
+
+	testq	%rcx,%rcx
+	jz	.Lstore_base2_64_avx
+
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r11
+	movq	%rbx,%r12
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r11
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r11,%r14
+	shlq	$24,%r10
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r12
+	andq	$0x3ffffff,%rbx
+	orq	%r12,%r10
+
+	subq	$16,%r15
+	jz	.Lstore_base2_26_avx
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%r10d,%xmm4
+	jmp	.Lproceed_avx
+
+.align	32
+.Lstore_base2_64_avx:
+	movq	%r14,0(%rdi)
+	movq	%rbx,8(%rdi)
+	movq	%r10,16(%rdi)
+	jmp	.Ldone_avx
+
+.align	16
+.Lstore_base2_26_avx:
+	movl	%eax,0(%rdi)
+	movl	%edx,4(%rdi)
+	movl	%r14d,8(%rdi)
+	movl	%ebx,12(%rdi)
+	movl	%r10d,16(%rdi)
+.align	16
+.Ldone_avx:
+	movq	8(%rsp),%r15
+	movq	16(%rsp),%r14
+	movq	24(%rsp),%r13
+	movq	32(%rsp),%r12
+	movq	40(%rsp),%rbx
+	leaq	48(%rsp),%rsp
+
+.Lno_data_avx:
+.Lblocks_avx_epilogue:
+	ret
+
+.align	32
+.Lbase2_64_avx:
+
+	pushq	%rbx
+	pushq	%r12
+	pushq	%r13
+	pushq	%r14
+	pushq	%r15
+	pushq	%rdi
+
+.Lbase2_64_avx_body:
+
+	movq	%rdx,%r15
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+	movq	0(%rdi),%r14
+	movq	8(%rdi),%rbx
+	movl	16(%rdi),%r10d
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+	testq	$31,%rdx
+	jz	.Linit_avx
+
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%r10
+	subq	$16,%r15
+
+	movq	%rdi,0(%rsp)
+	__poly1305_block
+	movq	0(%rsp),%rdi
+
+.Linit_avx:
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r8
+	movq	%rbx,%r9
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r8
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r8,%r14
+	shlq	$24,%r10
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r9
+	andq	$0x3ffffff,%rbx
+	orq	%r9,%r10
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%r10d,%xmm4
+	movl	$1,20(%rdi)
+
+	__poly1305_init_avx
+
+.Lproceed_avx:
+	movq	%r15,%rdx
+
+	movq	8(%rsp),%r15
+	movq	16(%rsp),%r14
+	movq	24(%rsp),%r13
+	movq	32(%rsp),%r12
+	movq	40(%rsp),%rbx
+	leaq	48(%rsp),%rax
+	leaq	48(%rsp),%rsp
+
+.Lbase2_64_avx_epilogue:
+	jmp	.Ldo_avx
+
+
+.align	32
+.Leven_avx:
+	vmovd	0(%rdi),%xmm0
+	vmovd	4(%rdi),%xmm1
+	vmovd	8(%rdi),%xmm2
+	vmovd	12(%rdi),%xmm3
+	vmovd	16(%rdi),%xmm4
+
+.Ldo_avx:
+	leaq	8(%rsp),%r10
+	andq	$-32,%rsp
+	subq	$8,%rsp
+	leaq	-88(%rsp),%r11
+	subq	$0x178,%rsp
+	subq	$64,%rdx
+	leaq	-32(%rsi),%rax
+	cmovcq	%rax,%rsi
+
+	vmovdqu	48(%rdi),%xmm14
+	leaq	112(%rdi),%rdi
+	leaq	.Lconst(%rip),%rcx
+
+	vmovdqu	32(%rsi),%xmm5
+	vmovdqu	48(%rsi),%xmm6
+	vmovdqa	64(%rcx),%xmm15
+
+	vpsrldq	$6,%xmm5,%xmm7
+	vpsrldq	$6,%xmm6,%xmm8
+	vpunpckhqdq	%xmm6,%xmm5,%xmm9
+	vpunpcklqdq	%xmm6,%xmm5,%xmm5
+	vpunpcklqdq	%xmm8,%xmm7,%xmm8
+
+	vpsrlq	$40,%xmm9,%xmm9
+	vpsrlq	$26,%xmm5,%xmm6
+	vpand	%xmm15,%xmm5,%xmm5
+	vpsrlq	$4,%xmm8,%xmm7
+	vpand	%xmm15,%xmm6,%xmm6
+	vpsrlq	$30,%xmm8,%xmm8
+	vpand	%xmm15,%xmm7,%xmm7
+	vpand	%xmm15,%xmm8,%xmm8
+	vpor	32(%rcx),%xmm9,%xmm9
+
+	jbe	.Lskip_loop_avx
+
+
+	vmovdqu	-48(%rdi),%xmm11
+	vmovdqu	-32(%rdi),%xmm12
+	vpshufd	$0xEE,%xmm14,%xmm13
+	vpshufd	$0x44,%xmm14,%xmm10
+	vmovdqa	%xmm13,-144(%r11)
+	vmovdqa	%xmm10,0(%rsp)
+	vpshufd	$0xEE,%xmm11,%xmm14
+	vmovdqu	-16(%rdi),%xmm10
+	vpshufd	$0x44,%xmm11,%xmm11
+	vmovdqa	%xmm14,-128(%r11)
+	vmovdqa	%xmm11,16(%rsp)
+	vpshufd	$0xEE,%xmm12,%xmm13
+	vmovdqu	0(%rdi),%xmm11
+	vpshufd	$0x44,%xmm12,%xmm12
+	vmovdqa	%xmm13,-112(%r11)
+	vmovdqa	%xmm12,32(%rsp)
+	vpshufd	$0xEE,%xmm10,%xmm14
+	vmovdqu	16(%rdi),%xmm12
+	vpshufd	$0x44,%xmm10,%xmm10
+	vmovdqa	%xmm14,-96(%r11)
+	vmovdqa	%xmm10,48(%rsp)
+	vpshufd	$0xEE,%xmm11,%xmm13
+	vmovdqu	32(%rdi),%xmm10
+	vpshufd	$0x44,%xmm11,%xmm11
+	vmovdqa	%xmm13,-80(%r11)
+	vmovdqa	%xmm11,64(%rsp)
+	vpshufd	$0xEE,%xmm12,%xmm14
+	vmovdqu	48(%rdi),%xmm11
+	vpshufd	$0x44,%xmm12,%xmm12
+	vmovdqa	%xmm14,-64(%r11)
+	vmovdqa	%xmm12,80(%rsp)
+	vpshufd	$0xEE,%xmm10,%xmm13
+	vmovdqu	64(%rdi),%xmm12
+	vpshufd	$0x44,%xmm10,%xmm10
+	vmovdqa	%xmm13,-48(%r11)
+	vmovdqa	%xmm10,96(%rsp)
+	vpshufd	$0xEE,%xmm11,%xmm14
+	vpshufd	$0x44,%xmm11,%xmm11
+	vmovdqa	%xmm14,-32(%r11)
+	vmovdqa	%xmm11,112(%rsp)
+	vpshufd	$0xEE,%xmm12,%xmm13
+	vmovdqa	0(%rsp),%xmm14
+	vpshufd	$0x44,%xmm12,%xmm12
+	vmovdqa	%xmm13,-16(%r11)
+	vmovdqa	%xmm12,128(%rsp)
+
+	jmp	.Loop_avx
+
+.align	32
+.Loop_avx:
+
+	vpmuludq	%xmm5,%xmm14,%xmm10
+	vpmuludq	%xmm6,%xmm14,%xmm11
+	vmovdqa	%xmm2,32(%r11)
+	vpmuludq	%xmm7,%xmm14,%xmm12
+	vmovdqa	16(%rsp),%xmm2
+	vpmuludq	%xmm8,%xmm14,%xmm13
+	vpmuludq	%xmm9,%xmm14,%xmm14
+
+	vmovdqa	%xmm0,0(%r11)
+	vpmuludq	32(%rsp),%xmm9,%xmm0
+	vmovdqa	%xmm1,16(%r11)
+	vpmuludq	%xmm8,%xmm2,%xmm1
+	vpaddq	%xmm0,%xmm10,%xmm10
+	vpaddq	%xmm1,%xmm14,%xmm14
+	vmovdqa	%xmm3,48(%r11)
+	vpmuludq	%xmm7,%xmm2,%xmm0
+	vpmuludq	%xmm6,%xmm2,%xmm1
+	vpaddq	%xmm0,%xmm13,%xmm13
+	vmovdqa	48(%rsp),%xmm3
+	vpaddq	%xmm1,%xmm12,%xmm12
+	vmovdqa	%xmm4,64(%r11)
+	vpmuludq	%xmm5,%xmm2,%xmm2
+	vpmuludq	%xmm7,%xmm3,%xmm0
+	vpaddq	%xmm2,%xmm11,%xmm11
+
+	vmovdqa	64(%rsp),%xmm4
+	vpaddq	%xmm0,%xmm14,%xmm14
+	vpmuludq	%xmm6,%xmm3,%xmm1
+	vpmuludq	%xmm5,%xmm3,%xmm3
+	vpaddq	%xmm1,%xmm13,%xmm13
+	vmovdqa	80(%rsp),%xmm2
+	vpaddq	%xmm3,%xmm12,%xmm12
+	vpmuludq	%xmm9,%xmm4,%xmm0
+	vpmuludq	%xmm8,%xmm4,%xmm4
+	vpaddq	%xmm0,%xmm11,%xmm11
+	vmovdqa	96(%rsp),%xmm3
+	vpaddq	%xmm4,%xmm10,%xmm10
+
+	vmovdqa	128(%rsp),%xmm4
+	vpmuludq	%xmm6,%xmm2,%xmm1
+	vpmuludq	%xmm5,%xmm2,%xmm2
+	vpaddq	%xmm1,%xmm14,%xmm14
+	vpaddq	%xmm2,%xmm13,%xmm13
+	vpmuludq	%xmm9,%xmm3,%xmm0
+	vpmuludq	%xmm8,%xmm3,%xmm1
+	vpaddq	%xmm0,%xmm12,%xmm12
+	vmovdqu	0(%rsi),%xmm0
+	vpaddq	%xmm1,%xmm11,%xmm11
+	vpmuludq	%xmm7,%xmm3,%xmm3
+	vpmuludq	%xmm7,%xmm4,%xmm7
+	vpaddq	%xmm3,%xmm10,%xmm10
+
+	vmovdqu	16(%rsi),%xmm1
+	vpaddq	%xmm7,%xmm11,%xmm11
+	vpmuludq	%xmm8,%xmm4,%xmm8
+	vpmuludq	%xmm9,%xmm4,%xmm9
+	vpsrldq	$6,%xmm0,%xmm2
+	vpaddq	%xmm8,%xmm12,%xmm12
+	vpaddq	%xmm9,%xmm13,%xmm13
+	vpsrldq	$6,%xmm1,%xmm3
+	vpmuludq	112(%rsp),%xmm5,%xmm9
+	vpmuludq	%xmm6,%xmm4,%xmm5
+	vpunpckhqdq	%xmm1,%xmm0,%xmm4
+	vpaddq	%xmm9,%xmm14,%xmm14
+	vmovdqa	-144(%r11),%xmm9
+	vpaddq	%xmm5,%xmm10,%xmm10
+
+	vpunpcklqdq	%xmm1,%xmm0,%xmm0
+	vpunpcklqdq	%xmm3,%xmm2,%xmm3
+
+
+	vpsrldq	$5,%xmm4,%xmm4
+	vpsrlq	$26,%xmm0,%xmm1
+	vpand	%xmm15,%xmm0,%xmm0
+	vpsrlq	$4,%xmm3,%xmm2
+	vpand	%xmm15,%xmm1,%xmm1
+	vpand	0(%rcx),%xmm4,%xmm4
+	vpsrlq	$30,%xmm3,%xmm3
+	vpand	%xmm15,%xmm2,%xmm2
+	vpand	%xmm15,%xmm3,%xmm3
+	vpor	32(%rcx),%xmm4,%xmm4
+
+	vpaddq	0(%r11),%xmm0,%xmm0
+	vpaddq	16(%r11),%xmm1,%xmm1
+	vpaddq	32(%r11),%xmm2,%xmm2
+	vpaddq	48(%r11),%xmm3,%xmm3
+	vpaddq	64(%r11),%xmm4,%xmm4
+
+	leaq	32(%rsi),%rax
+	leaq	64(%rsi),%rsi
+	subq	$64,%rdx
+	cmovcq	%rax,%rsi
+
+	vpmuludq	%xmm0,%xmm9,%xmm5
+	vpmuludq	%xmm1,%xmm9,%xmm6
+	vpaddq	%xmm5,%xmm10,%xmm10
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vmovdqa	-128(%r11),%xmm7
+	vpmuludq	%xmm2,%xmm9,%xmm5
+	vpmuludq	%xmm3,%xmm9,%xmm6
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vpmuludq	%xmm4,%xmm9,%xmm9
+	vpmuludq	-112(%r11),%xmm4,%xmm5
+	vpaddq	%xmm9,%xmm14,%xmm14
+
+	vpaddq	%xmm5,%xmm10,%xmm10
+	vpmuludq	%xmm2,%xmm7,%xmm6
+	vpmuludq	%xmm3,%xmm7,%xmm5
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vmovdqa	-96(%r11),%xmm8
+	vpaddq	%xmm5,%xmm14,%xmm14
+	vpmuludq	%xmm1,%xmm7,%xmm6
+	vpmuludq	%xmm0,%xmm7,%xmm7
+	vpaddq	%xmm6,%xmm12,%xmm12
+	vpaddq	%xmm7,%xmm11,%xmm11
+
+	vmovdqa	-80(%r11),%xmm9
+	vpmuludq	%xmm2,%xmm8,%xmm5
+	vpmuludq	%xmm1,%xmm8,%xmm6
+	vpaddq	%xmm5,%xmm14,%xmm14
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vmovdqa	-64(%r11),%xmm7
+	vpmuludq	%xmm0,%xmm8,%xmm8
+	vpmuludq	%xmm4,%xmm9,%xmm5
+	vpaddq	%xmm8,%xmm12,%xmm12
+	vpaddq	%xmm5,%xmm11,%xmm11
+	vmovdqa	-48(%r11),%xmm8
+	vpmuludq	%xmm3,%xmm9,%xmm9
+	vpmuludq	%xmm1,%xmm7,%xmm6
+	vpaddq	%xmm9,%xmm10,%xmm10
+
+	vmovdqa	-16(%r11),%xmm9
+	vpaddq	%xmm6,%xmm14,%xmm14
+	vpmuludq	%xmm0,%xmm7,%xmm7
+	vpmuludq	%xmm4,%xmm8,%xmm5
+	vpaddq	%xmm7,%xmm13,%xmm13
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vmovdqu	32(%rsi),%xmm5
+	vpmuludq	%xmm3,%xmm8,%xmm7
+	vpmuludq	%xmm2,%xmm8,%xmm8
+	vpaddq	%xmm7,%xmm11,%xmm11
+	vmovdqu	48(%rsi),%xmm6
+	vpaddq	%xmm8,%xmm10,%xmm10
+
+	vpmuludq	%xmm2,%xmm9,%xmm2
+	vpmuludq	%xmm3,%xmm9,%xmm3
+	vpsrldq	$6,%xmm5,%xmm7
+	vpaddq	%xmm2,%xmm11,%xmm11
+	vpmuludq	%xmm4,%xmm9,%xmm4
+	vpsrldq	$6,%xmm6,%xmm8
+	vpaddq	%xmm3,%xmm12,%xmm2
+	vpaddq	%xmm4,%xmm13,%xmm3
+	vpmuludq	-32(%r11),%xmm0,%xmm4
+	vpmuludq	%xmm1,%xmm9,%xmm0
+	vpunpckhqdq	%xmm6,%xmm5,%xmm9
+	vpaddq	%xmm4,%xmm14,%xmm4
+	vpaddq	%xmm0,%xmm10,%xmm0
+
+	vpunpcklqdq	%xmm6,%xmm5,%xmm5
+	vpunpcklqdq	%xmm8,%xmm7,%xmm8
+
+
+	vpsrldq	$5,%xmm9,%xmm9
+	vpsrlq	$26,%xmm5,%xmm6
+	vmovdqa	0(%rsp),%xmm14
+	vpand	%xmm15,%xmm5,%xmm5
+	vpsrlq	$4,%xmm8,%xmm7
+	vpand	%xmm15,%xmm6,%xmm6
+	vpand	0(%rcx),%xmm9,%xmm9
+	vpsrlq	$30,%xmm8,%xmm8
+	vpand	%xmm15,%xmm7,%xmm7
+	vpand	%xmm15,%xmm8,%xmm8
+	vpor	32(%rcx),%xmm9,%xmm9
+
+	vpsrlq	$26,%xmm3,%xmm13
+	vpand	%xmm15,%xmm3,%xmm3
+	vpaddq	%xmm13,%xmm4,%xmm4
+
+	vpsrlq	$26,%xmm0,%xmm10
+	vpand	%xmm15,%xmm0,%xmm0
+	vpaddq	%xmm10,%xmm11,%xmm1
+
+	vpsrlq	$26,%xmm4,%xmm10
+	vpand	%xmm15,%xmm4,%xmm4
+
+	vpsrlq	$26,%xmm1,%xmm11
+	vpand	%xmm15,%xmm1,%xmm1
+	vpaddq	%xmm11,%xmm2,%xmm2
+
+	vpaddq	%xmm10,%xmm0,%xmm0
+	vpsllq	$2,%xmm10,%xmm10
+	vpaddq	%xmm10,%xmm0,%xmm0
+
+	vpsrlq	$26,%xmm2,%xmm12
+	vpand	%xmm15,%xmm2,%xmm2
+	vpaddq	%xmm12,%xmm3,%xmm3
+
+	vpsrlq	$26,%xmm0,%xmm10
+	vpand	%xmm15,%xmm0,%xmm0
+	vpaddq	%xmm10,%xmm1,%xmm1
+
+	vpsrlq	$26,%xmm3,%xmm13
+	vpand	%xmm15,%xmm3,%xmm3
+	vpaddq	%xmm13,%xmm4,%xmm4
+
+	ja	.Loop_avx
+
+.Lskip_loop_avx:
+	vpshufd	$0x10,%xmm14,%xmm14
+	addq	$32,%rdx
+	jnz	.Long_tail_avx
+
+	vpaddq	%xmm2,%xmm7,%xmm7
+	vpaddq	%xmm0,%xmm5,%xmm5
+	vpaddq	%xmm1,%xmm6,%xmm6
+	vpaddq	%xmm3,%xmm8,%xmm8
+	vpaddq	%xmm4,%xmm9,%xmm9
+
+.Long_tail_avx:
+	vmovdqa	%xmm2,32(%r11)
+	vmovdqa	%xmm0,0(%r11)
+	vmovdqa	%xmm1,16(%r11)
+	vmovdqa	%xmm3,48(%r11)
+	vmovdqa	%xmm4,64(%r11)
+
+	vpmuludq	%xmm7,%xmm14,%xmm12
+	vpmuludq	%xmm5,%xmm14,%xmm10
+	vpshufd	$0x10,-48(%rdi),%xmm2
+	vpmuludq	%xmm6,%xmm14,%xmm11
+	vpmuludq	%xmm8,%xmm14,%xmm13
+	vpmuludq	%xmm9,%xmm14,%xmm14
+
+	vpmuludq	%xmm8,%xmm2,%xmm0
+	vpaddq	%xmm0,%xmm14,%xmm14
+	vpshufd	$0x10,-32(%rdi),%xmm3
+	vpmuludq	%xmm7,%xmm2,%xmm1
+	vpaddq	%xmm1,%xmm13,%xmm13
+	vpshufd	$0x10,-16(%rdi),%xmm4
+	vpmuludq	%xmm6,%xmm2,%xmm0
+	vpaddq	%xmm0,%xmm12,%xmm12
+	vpmuludq	%xmm5,%xmm2,%xmm2
+	vpaddq	%xmm2,%xmm11,%xmm11
+	vpmuludq	%xmm9,%xmm3,%xmm3
+	vpaddq	%xmm3,%xmm10,%xmm10
+
+	vpshufd	$0x10,0(%rdi),%xmm2
+	vpmuludq	%xmm7,%xmm4,%xmm1
+	vpaddq	%xmm1,%xmm14,%xmm14
+	vpmuludq	%xmm6,%xmm4,%xmm0
+	vpaddq	%xmm0,%xmm13,%xmm13
+	vpshufd	$0x10,16(%rdi),%xmm3
+	vpmuludq	%xmm5,%xmm4,%xmm4
+	vpaddq	%xmm4,%xmm12,%xmm12
+	vpmuludq	%xmm9,%xmm2,%xmm1
+	vpaddq	%xmm1,%xmm11,%xmm11
+	vpshufd	$0x10,32(%rdi),%xmm4
+	vpmuludq	%xmm8,%xmm2,%xmm2
+	vpaddq	%xmm2,%xmm10,%xmm10
+
+	vpmuludq	%xmm6,%xmm3,%xmm0
+	vpaddq	%xmm0,%xmm14,%xmm14
+	vpmuludq	%xmm5,%xmm3,%xmm3
+	vpaddq	%xmm3,%xmm13,%xmm13
+	vpshufd	$0x10,48(%rdi),%xmm2
+	vpmuludq	%xmm9,%xmm4,%xmm1
+	vpaddq	%xmm1,%xmm12,%xmm12
+	vpshufd	$0x10,64(%rdi),%xmm3
+	vpmuludq	%xmm8,%xmm4,%xmm0
+	vpaddq	%xmm0,%xmm11,%xmm11
+	vpmuludq	%xmm7,%xmm4,%xmm4
+	vpaddq	%xmm4,%xmm10,%xmm10
+
+	vpmuludq	%xmm5,%xmm2,%xmm2
+	vpaddq	%xmm2,%xmm14,%xmm14
+	vpmuludq	%xmm9,%xmm3,%xmm1
+	vpaddq	%xmm1,%xmm13,%xmm13
+	vpmuludq	%xmm8,%xmm3,%xmm0
+	vpaddq	%xmm0,%xmm12,%xmm12
+	vpmuludq	%xmm7,%xmm3,%xmm1
+	vpaddq	%xmm1,%xmm11,%xmm11
+	vpmuludq	%xmm6,%xmm3,%xmm3
+	vpaddq	%xmm3,%xmm10,%xmm10
+
+	jz	.Lshort_tail_avx
+
+	vmovdqu	0(%rsi),%xmm0
+	vmovdqu	16(%rsi),%xmm1
+
+	vpsrldq	$6,%xmm0,%xmm2
+	vpsrldq	$6,%xmm1,%xmm3
+	vpunpckhqdq	%xmm1,%xmm0,%xmm4
+	vpunpcklqdq	%xmm1,%xmm0,%xmm0
+	vpunpcklqdq	%xmm3,%xmm2,%xmm3
+
+	vpsrlq	$40,%xmm4,%xmm4
+	vpsrlq	$26,%xmm0,%xmm1
+	vpand	%xmm15,%xmm0,%xmm0
+	vpsrlq	$4,%xmm3,%xmm2
+	vpand	%xmm15,%xmm1,%xmm1
+	vpsrlq	$30,%xmm3,%xmm3
+	vpand	%xmm15,%xmm2,%xmm2
+	vpand	%xmm15,%xmm3,%xmm3
+	vpor	32(%rcx),%xmm4,%xmm4
+
+	vpshufd	$0x32,-64(%rdi),%xmm9
+	vpaddq	0(%r11),%xmm0,%xmm0
+	vpaddq	16(%r11),%xmm1,%xmm1
+	vpaddq	32(%r11),%xmm2,%xmm2
+	vpaddq	48(%r11),%xmm3,%xmm3
+	vpaddq	64(%r11),%xmm4,%xmm4
+
+	vpmuludq	%xmm0,%xmm9,%xmm5
+	vpaddq	%xmm5,%xmm10,%xmm10
+	vpmuludq	%xmm1,%xmm9,%xmm6
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vpmuludq	%xmm2,%xmm9,%xmm5
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vpshufd	$0x32,-48(%rdi),%xmm7
+	vpmuludq	%xmm3,%xmm9,%xmm6
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vpmuludq	%xmm4,%xmm9,%xmm9
+	vpaddq	%xmm9,%xmm14,%xmm14
+
+	vpmuludq	%xmm3,%xmm7,%xmm5
+	vpaddq	%xmm5,%xmm14,%xmm14
+	vpshufd	$0x32,-32(%rdi),%xmm8
+	vpmuludq	%xmm2,%xmm7,%xmm6
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vpshufd	$0x32,-16(%rdi),%xmm9
+	vpmuludq	%xmm1,%xmm7,%xmm5
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vpmuludq	%xmm0,%xmm7,%xmm7
+	vpaddq	%xmm7,%xmm11,%xmm11
+	vpmuludq	%xmm4,%xmm8,%xmm8
+	vpaddq	%xmm8,%xmm10,%xmm10
+
+	vpshufd	$0x32,0(%rdi),%xmm7
+	vpmuludq	%xmm2,%xmm9,%xmm6
+	vpaddq	%xmm6,%xmm14,%xmm14
+	vpmuludq	%xmm1,%xmm9,%xmm5
+	vpaddq	%xmm5,%xmm13,%xmm13
+	vpshufd	$0x32,16(%rdi),%xmm8
+	vpmuludq	%xmm0,%xmm9,%xmm9
+	vpaddq	%xmm9,%xmm12,%xmm12
+	vpmuludq	%xmm4,%xmm7,%xmm6
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vpshufd	$0x32,32(%rdi),%xmm9
+	vpmuludq	%xmm3,%xmm7,%xmm7
+	vpaddq	%xmm7,%xmm10,%xmm10
+
+	vpmuludq	%xmm1,%xmm8,%xmm5
+	vpaddq	%xmm5,%xmm14,%xmm14
+	vpmuludq	%xmm0,%xmm8,%xmm8
+	vpaddq	%xmm8,%xmm13,%xmm13
+	vpshufd	$0x32,48(%rdi),%xmm7
+	vpmuludq	%xmm4,%xmm9,%xmm6
+	vpaddq	%xmm6,%xmm12,%xmm12
+	vpshufd	$0x32,64(%rdi),%xmm8
+	vpmuludq	%xmm3,%xmm9,%xmm5
+	vpaddq	%xmm5,%xmm11,%xmm11
+	vpmuludq	%xmm2,%xmm9,%xmm9
+	vpaddq	%xmm9,%xmm10,%xmm10
+
+	vpmuludq	%xmm0,%xmm7,%xmm7
+	vpaddq	%xmm7,%xmm14,%xmm14
+	vpmuludq	%xmm4,%xmm8,%xmm6
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vpmuludq	%xmm3,%xmm8,%xmm5
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vpmuludq	%xmm2,%xmm8,%xmm6
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vpmuludq	%xmm1,%xmm8,%xmm8
+	vpaddq	%xmm8,%xmm10,%xmm10
+
+.Lshort_tail_avx:
+
+	vpsrldq	$8,%xmm14,%xmm9
+	vpsrldq	$8,%xmm13,%xmm8
+	vpsrldq	$8,%xmm11,%xmm6
+	vpsrldq	$8,%xmm10,%xmm5
+	vpsrldq	$8,%xmm12,%xmm7
+	vpaddq	%xmm8,%xmm13,%xmm13
+	vpaddq	%xmm9,%xmm14,%xmm14
+	vpaddq	%xmm5,%xmm10,%xmm10
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vpaddq	%xmm7,%xmm12,%xmm12
+
+	vpsrlq	$26,%xmm13,%xmm3
+	vpand	%xmm15,%xmm13,%xmm13
+	vpaddq	%xmm3,%xmm14,%xmm14
+
+	vpsrlq	$26,%xmm10,%xmm0
+	vpand	%xmm15,%xmm10,%xmm10
+	vpaddq	%xmm0,%xmm11,%xmm11
+
+	vpsrlq	$26,%xmm14,%xmm4
+	vpand	%xmm15,%xmm14,%xmm14
+
+	vpsrlq	$26,%xmm11,%xmm1
+	vpand	%xmm15,%xmm11,%xmm11
+	vpaddq	%xmm1,%xmm12,%xmm12
+
+	vpaddq	%xmm4,%xmm10,%xmm10
+	vpsllq	$2,%xmm4,%xmm4
+	vpaddq	%xmm4,%xmm10,%xmm10
+
+	vpsrlq	$26,%xmm12,%xmm2
+	vpand	%xmm15,%xmm12,%xmm12
+	vpaddq	%xmm2,%xmm13,%xmm13
+
+	vpsrlq	$26,%xmm10,%xmm0
+	vpand	%xmm15,%xmm10,%xmm10
+	vpaddq	%xmm0,%xmm11,%xmm11
+
+	vpsrlq	$26,%xmm13,%xmm3
+	vpand	%xmm15,%xmm13,%xmm13
+	vpaddq	%xmm3,%xmm14,%xmm14
+
+	vmovd	%xmm10,-112(%rdi)
+	vmovd	%xmm11,-108(%rdi)
+	vmovd	%xmm12,-104(%rdi)
+	vmovd	%xmm13,-100(%rdi)
+	vmovd	%xmm14,-96(%rdi)
+	leaq	-8(%r10),%rsp
+
+	vzeroupper
+	ret
+ENDPROC(poly1305_blocks_avx)
+
+.align	32
+ENTRY(poly1305_emit_avx)
+	cmpl	$0,20(%rdi)
+	je	.Lemit
+
+	movl	0(%rdi),%eax
+	movl	4(%rdi),%ecx
+	movl	8(%rdi),%r8d
+	movl	12(%rdi),%r11d
+	movl	16(%rdi),%r10d
+
+	shlq	$26,%rcx
+	movq	%r8,%r9
+	shlq	$52,%r8
+	addq	%rcx,%rax
+	shrq	$12,%r9
+	addq	%rax,%r8
+	adcq	$0,%r9
+
+	shlq	$14,%r11
+	movq	%r10,%rax
+	shrq	$24,%r10
+	addq	%r11,%r9
+	shlq	$40,%rax
+	addq	%rax,%r9
+	adcq	$0,%r10
+
+	movq	%r10,%rax
+	movq	%r10,%rcx
+	andq	$3,%r10
+	shrq	$2,%rax
+	andq	$-4,%rcx
+	addq	%rcx,%rax
+	addq	%rax,%r8
+	adcq	$0,%r9
+	adcq	$0,%r10
+
+	movq	%r8,%rax
+	addq	$5,%r8
+	movq	%r9,%rcx
+	adcq	$0,%r9
+	adcq	$0,%r10
+	shrq	$2,%r10
+	cmovnzq	%r8,%rax
+	cmovnzq	%r9,%rcx
+
+	addq	0(%rdx),%rax
+	adcq	8(%rdx),%rcx
+	movq	%rax,0(%rsi)
+	movq	%rcx,8(%rsi)
+
+	ret
+ENDPROC(poly1305_emit_avx)
+#endif /* CONFIG_AS_AVX */
+
+#ifdef CONFIG_AS_AVX2
+.align	32
+ENTRY(poly1305_blocks_avx2)
+
+	movl	20(%rdi),%r8d
+	cmpq	$128,%rdx
+	jae	.Lblocks_avx2
+	testl	%r8d,%r8d
+	jz	.Lblocks
+
+.Lblocks_avx2:
+	andq	$-16,%rdx
+	jz	.Lno_data_avx2
+
+	vzeroupper
+
+	testl	%r8d,%r8d
+	jz	.Lbase2_64_avx2
+
+	testq	$63,%rdx
+	jz	.Leven_avx2
+
+	pushq	%rbx
+	pushq	%r12
+	pushq	%r13
+	pushq	%r14
+	pushq	%r15
+	pushq	%rdi
+
+.Lblocks_avx2_body:
+
+	movq	%rdx,%r15
+
+	movq	0(%rdi),%r8
+	movq	8(%rdi),%r9
+	movl	16(%rdi),%r10d
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+
+	movl	%r8d,%r14d
+	andq	$-2147483648,%r8
+	movq	%r9,%r12
+	movl	%r9d,%ebx
+	andq	$-2147483648,%r9
+
+	shrq	$6,%r8
+	shlq	$52,%r12
+	addq	%r8,%r14
+	shrq	$12,%rbx
+	shrq	$18,%r9
+	addq	%r12,%r14
+	adcq	%r9,%rbx
+
+	movq	%r10,%r8
+	shlq	$40,%r8
+	shrq	$24,%r10
+	addq	%r8,%rbx
+	adcq	$0,%r10
+
+	movq	$-4,%r9
+	movq	%r10,%r8
+	andq	%r10,%r9
+	shrq	$2,%r8
+	andq	$3,%r10
+	addq	%r9,%r8
+	addq	%r8,%r14
+	adcq	$0,%rbx
+	adcq	$0,%r10
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+.Lbase2_26_pre_avx2:
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%r10
+	subq	$16,%r15
+
+	movq	%rdi,0(%rsp)
+	__poly1305_block
+	movq	0(%rsp),%rdi
+	movq	%r12,%rax
+
+	testq	$63,%r15
+	jnz	.Lbase2_26_pre_avx2
+
+	testq	%rcx,%rcx
+	jz	.Lstore_base2_64_avx2
+
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r11
+	movq	%rbx,%r12
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r11
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r11,%r14
+	shlq	$24,%r10
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r12
+	andq	$0x3ffffff,%rbx
+	orq	%r12,%r10
+
+	testq	%r15,%r15
+	jz	.Lstore_base2_26_avx2
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%r10d,%xmm4
+	jmp	.Lproceed_avx2
+
+.align	32
+.Lstore_base2_64_avx2:
+	movq	%r14,0(%rdi)
+	movq	%rbx,8(%rdi)
+	movq	%r10,16(%rdi)
+	jmp	.Ldone_avx2
+
+.align	16
+.Lstore_base2_26_avx2:
+	movl	%eax,0(%rdi)
+	movl	%edx,4(%rdi)
+	movl	%r14d,8(%rdi)
+	movl	%ebx,12(%rdi)
+	movl	%r10d,16(%rdi)
+.align	16
+.Ldone_avx2:
+	movq	8(%rsp),%r15
+	movq	16(%rsp),%r14
+	movq	24(%rsp),%r13
+	movq	32(%rsp),%r12
+	movq	40(%rsp),%rbx
+	leaq	48(%rsp),%rsp
+
+.Lno_data_avx2:
+.Lblocks_avx2_epilogue:
+	ret
+
+
+.align	32
+.Lbase2_64_avx2:
+
+
+	pushq	%rbx
+	pushq	%r12
+	pushq	%r13
+	pushq	%r14
+	pushq	%r15
+	pushq	%rdi
+
+.Lbase2_64_avx2_body:
+
+	movq	%rdx,%r15
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+	movq	0(%rdi),%r14
+	movq	8(%rdi),%rbx
+	movl	16(%rdi),%r10d
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+	testq	$63,%rdx
+	jz	.Linit_avx2
+
+.Lbase2_64_pre_avx2:
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%r10
+	subq	$16,%r15
+
+	movq	%rdi,0(%rsp)
+	__poly1305_block
+	movq	0(%rsp),%rdi
+	movq	%r12,%rax
+
+	testq	$63,%r15
+	jnz	.Lbase2_64_pre_avx2
+
+.Linit_avx2:
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r8
+	movq	%rbx,%r9
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r8
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r8,%r14
+	shlq	$24,%r10
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r9
+	andq	$0x3ffffff,%rbx
+	orq	%r9,%r10
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%r10d,%xmm4
+	movl	$1,20(%rdi)
+
+	__poly1305_init_avx
+
+.Lproceed_avx2:
+	movq	%r15,%rdx
+
+	movq	8(%rsp),%r15
+	movq	16(%rsp),%r14
+	movq	24(%rsp),%r13
+	movq	32(%rsp),%r12
+	movq	40(%rsp),%rbx
+	leaq	48(%rsp),%rax
+	leaq	48(%rsp),%rsp
+
+.Lbase2_64_avx2_epilogue:
+	jmp	.Ldo_avx2
+
+
+.align	32
+.Leven_avx2:
+
+	vmovd	0(%rdi),%xmm0
+	vmovd	4(%rdi),%xmm1
+	vmovd	8(%rdi),%xmm2
+	vmovd	12(%rdi),%xmm3
+	vmovd	16(%rdi),%xmm4
+
+.Ldo_avx2:
+	leaq	8(%rsp),%r10
+	subq	$0x128,%rsp
+	leaq	.Lconst(%rip),%rcx
+	leaq	48+64(%rdi),%rdi
+	vmovdqa	96(%rcx),%ymm7
+
+
+	vmovdqu	-64(%rdi),%xmm9
+	andq	$-512,%rsp
+	vmovdqu	-48(%rdi),%xmm10
+	vmovdqu	-32(%rdi),%xmm6
+	vmovdqu	-16(%rdi),%xmm11
+	vmovdqu	0(%rdi),%xmm12
+	vmovdqu	16(%rdi),%xmm13
+	leaq	144(%rsp),%rax
+	vmovdqu	32(%rdi),%xmm14
+	vpermd	%ymm9,%ymm7,%ymm9
+	vmovdqu	48(%rdi),%xmm15
+	vpermd	%ymm10,%ymm7,%ymm10
+	vmovdqu	64(%rdi),%xmm5
+	vpermd	%ymm6,%ymm7,%ymm6
+	vmovdqa	%ymm9,0(%rsp)
+	vpermd	%ymm11,%ymm7,%ymm11
+	vmovdqa	%ymm10,32-144(%rax)
+	vpermd	%ymm12,%ymm7,%ymm12
+	vmovdqa	%ymm6,64-144(%rax)
+	vpermd	%ymm13,%ymm7,%ymm13
+	vmovdqa	%ymm11,96-144(%rax)
+	vpermd	%ymm14,%ymm7,%ymm14
+	vmovdqa	%ymm12,128-144(%rax)
+	vpermd	%ymm15,%ymm7,%ymm15
+	vmovdqa	%ymm13,160-144(%rax)
+	vpermd	%ymm5,%ymm7,%ymm5
+	vmovdqa	%ymm14,192-144(%rax)
+	vmovdqa	%ymm15,224-144(%rax)
+	vmovdqa	%ymm5,256-144(%rax)
+	vmovdqa	64(%rcx),%ymm5
+
+
+
+	vmovdqu	0(%rsi),%xmm7
+	vmovdqu	16(%rsi),%xmm8
+	vinserti128	$1,32(%rsi),%ymm7,%ymm7
+	vinserti128	$1,48(%rsi),%ymm8,%ymm8
+	leaq	64(%rsi),%rsi
+
+	vpsrldq	$6,%ymm7,%ymm9
+	vpsrldq	$6,%ymm8,%ymm10
+	vpunpckhqdq	%ymm8,%ymm7,%ymm6
+	vpunpcklqdq	%ymm10,%ymm9,%ymm9
+	vpunpcklqdq	%ymm8,%ymm7,%ymm7
+
+	vpsrlq	$30,%ymm9,%ymm10
+	vpsrlq	$4,%ymm9,%ymm9
+	vpsrlq	$26,%ymm7,%ymm8
+	vpsrlq	$40,%ymm6,%ymm6
+	vpand	%ymm5,%ymm9,%ymm9
+	vpand	%ymm5,%ymm7,%ymm7
+	vpand	%ymm5,%ymm8,%ymm8
+	vpand	%ymm5,%ymm10,%ymm10
+	vpor	32(%rcx),%ymm6,%ymm6
+
+	vpaddq	%ymm2,%ymm9,%ymm2
+	subq	$64,%rdx
+	jz	.Ltail_avx2
+	jmp	.Loop_avx2
+
+.align	32
+.Loop_avx2:
+
+	vpaddq	%ymm0,%ymm7,%ymm0
+	vmovdqa	0(%rsp),%ymm7
+	vpaddq	%ymm1,%ymm8,%ymm1
+	vmovdqa	32(%rsp),%ymm8
+	vpaddq	%ymm3,%ymm10,%ymm3
+	vmovdqa	96(%rsp),%ymm9
+	vpaddq	%ymm4,%ymm6,%ymm4
+	vmovdqa	48(%rax),%ymm10
+	vmovdqa	112(%rax),%ymm5
+
+	vpmuludq	%ymm2,%ymm7,%ymm13
+	vpmuludq	%ymm2,%ymm8,%ymm14
+	vpmuludq	%ymm2,%ymm9,%ymm15
+	vpmuludq	%ymm2,%ymm10,%ymm11
+	vpmuludq	%ymm2,%ymm5,%ymm12
+
+	vpmuludq	%ymm0,%ymm8,%ymm6
+	vpmuludq	%ymm1,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	64(%rsp),%ymm4,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm11,%ymm11
+	vmovdqa	-16(%rax),%ymm8
+
+	vpmuludq	%ymm0,%ymm7,%ymm6
+	vpmuludq	%ymm1,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vpmuludq	%ymm3,%ymm7,%ymm6
+	vpmuludq	%ymm4,%ymm7,%ymm2
+	vmovdqu	0(%rsi),%xmm7
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm2,%ymm15,%ymm15
+	vinserti128	$1,32(%rsi),%ymm7,%ymm7
+
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	%ymm4,%ymm8,%ymm2
+	vmovdqu	16(%rsi),%xmm8
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vmovdqa	16(%rax),%ymm2
+	vpmuludq	%ymm1,%ymm9,%ymm6
+	vpmuludq	%ymm0,%ymm9,%ymm9
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm9,%ymm13,%ymm13
+	vinserti128	$1,48(%rsi),%ymm8,%ymm8
+	leaq	64(%rsi),%rsi
+
+	vpmuludq	%ymm1,%ymm2,%ymm6
+	vpmuludq	%ymm0,%ymm2,%ymm2
+	vpsrldq	$6,%ymm7,%ymm9
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm14,%ymm14
+	vpmuludq	%ymm3,%ymm10,%ymm6
+	vpmuludq	%ymm4,%ymm10,%ymm2
+	vpsrldq	$6,%ymm8,%ymm10
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpunpckhqdq	%ymm8,%ymm7,%ymm6
+
+	vpmuludq	%ymm3,%ymm5,%ymm3
+	vpmuludq	%ymm4,%ymm5,%ymm4
+	vpunpcklqdq	%ymm8,%ymm7,%ymm7
+	vpaddq	%ymm3,%ymm13,%ymm2
+	vpaddq	%ymm4,%ymm14,%ymm3
+	vpunpcklqdq	%ymm10,%ymm9,%ymm10
+	vpmuludq	80(%rax),%ymm0,%ymm4
+	vpmuludq	%ymm1,%ymm5,%ymm0
+	vmovdqa	64(%rcx),%ymm5
+	vpaddq	%ymm4,%ymm15,%ymm4
+	vpaddq	%ymm0,%ymm11,%ymm0
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm12,%ymm1
+
+	vpsrlq	$26,%ymm4,%ymm15
+	vpand	%ymm5,%ymm4,%ymm4
+
+	vpsrlq	$4,%ymm10,%ymm9
+
+	vpsrlq	$26,%ymm1,%ymm12
+	vpand	%ymm5,%ymm1,%ymm1
+	vpaddq	%ymm12,%ymm2,%ymm2
+
+	vpaddq	%ymm15,%ymm0,%ymm0
+	vpsllq	$2,%ymm15,%ymm15
+	vpaddq	%ymm15,%ymm0,%ymm0
+
+	vpand	%ymm5,%ymm9,%ymm9
+	vpsrlq	$26,%ymm7,%ymm8
+
+	vpsrlq	$26,%ymm2,%ymm13
+	vpand	%ymm5,%ymm2,%ymm2
+	vpaddq	%ymm13,%ymm3,%ymm3
+
+	vpaddq	%ymm9,%ymm2,%ymm2
+	vpsrlq	$30,%ymm10,%ymm10
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm1,%ymm1
+
+	vpsrlq	$40,%ymm6,%ymm6
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpand	%ymm5,%ymm7,%ymm7
+	vpand	%ymm5,%ymm8,%ymm8
+	vpand	%ymm5,%ymm10,%ymm10
+	vpor	32(%rcx),%ymm6,%ymm6
+
+	subq	$64,%rdx
+	jnz	.Loop_avx2
+
+.byte	0x66,0x90
+.Ltail_avx2:
+
+	vpaddq	%ymm0,%ymm7,%ymm0
+	vmovdqu	4(%rsp),%ymm7
+	vpaddq	%ymm1,%ymm8,%ymm1
+	vmovdqu	36(%rsp),%ymm8
+	vpaddq	%ymm3,%ymm10,%ymm3
+	vmovdqu	100(%rsp),%ymm9
+	vpaddq	%ymm4,%ymm6,%ymm4
+	vmovdqu	52(%rax),%ymm10
+	vmovdqu	116(%rax),%ymm5
+
+	vpmuludq	%ymm2,%ymm7,%ymm13
+	vpmuludq	%ymm2,%ymm8,%ymm14
+	vpmuludq	%ymm2,%ymm9,%ymm15
+	vpmuludq	%ymm2,%ymm10,%ymm11
+	vpmuludq	%ymm2,%ymm5,%ymm12
+
+	vpmuludq	%ymm0,%ymm8,%ymm6
+	vpmuludq	%ymm1,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	68(%rsp),%ymm4,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm11,%ymm11
+
+	vpmuludq	%ymm0,%ymm7,%ymm6
+	vpmuludq	%ymm1,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vmovdqu	-12(%rax),%ymm8
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vpmuludq	%ymm3,%ymm7,%ymm6
+	vpmuludq	%ymm4,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm2,%ymm15,%ymm15
+
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	%ymm4,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vmovdqu	20(%rax),%ymm2
+	vpmuludq	%ymm1,%ymm9,%ymm6
+	vpmuludq	%ymm0,%ymm9,%ymm9
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm9,%ymm13,%ymm13
+
+	vpmuludq	%ymm1,%ymm2,%ymm6
+	vpmuludq	%ymm0,%ymm2,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm14,%ymm14
+	vpmuludq	%ymm3,%ymm10,%ymm6
+	vpmuludq	%ymm4,%ymm10,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+
+	vpmuludq	%ymm3,%ymm5,%ymm3
+	vpmuludq	%ymm4,%ymm5,%ymm4
+	vpaddq	%ymm3,%ymm13,%ymm2
+	vpaddq	%ymm4,%ymm14,%ymm3
+	vpmuludq	84(%rax),%ymm0,%ymm4
+	vpmuludq	%ymm1,%ymm5,%ymm0
+	vmovdqa	64(%rcx),%ymm5
+	vpaddq	%ymm4,%ymm15,%ymm4
+	vpaddq	%ymm0,%ymm11,%ymm0
+
+	vpsrldq	$8,%ymm12,%ymm8
+	vpsrldq	$8,%ymm2,%ymm9
+	vpsrldq	$8,%ymm3,%ymm10
+	vpsrldq	$8,%ymm4,%ymm6
+	vpsrldq	$8,%ymm0,%ymm7
+	vpaddq	%ymm8,%ymm12,%ymm12
+	vpaddq	%ymm9,%ymm2,%ymm2
+	vpaddq	%ymm10,%ymm3,%ymm3
+	vpaddq	%ymm6,%ymm4,%ymm4
+	vpaddq	%ymm7,%ymm0,%ymm0
+
+	vpermq	$0x2,%ymm3,%ymm10
+	vpermq	$0x2,%ymm4,%ymm6
+	vpermq	$0x2,%ymm0,%ymm7
+	vpermq	$0x2,%ymm12,%ymm8
+	vpermq	$0x2,%ymm2,%ymm9
+	vpaddq	%ymm10,%ymm3,%ymm3
+	vpaddq	%ymm6,%ymm4,%ymm4
+	vpaddq	%ymm7,%ymm0,%ymm0
+	vpaddq	%ymm8,%ymm12,%ymm12
+	vpaddq	%ymm9,%ymm2,%ymm2
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm12,%ymm1
+
+	vpsrlq	$26,%ymm4,%ymm15
+	vpand	%ymm5,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm1,%ymm12
+	vpand	%ymm5,%ymm1,%ymm1
+	vpaddq	%ymm12,%ymm2,%ymm2
+
+	vpaddq	%ymm15,%ymm0,%ymm0
+	vpsllq	$2,%ymm15,%ymm15
+	vpaddq	%ymm15,%ymm0,%ymm0
+
+	vpsrlq	$26,%ymm2,%ymm13
+	vpand	%ymm5,%ymm2,%ymm2
+	vpaddq	%ymm13,%ymm3,%ymm3
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm1,%ymm1
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vmovd	%xmm0,-112(%rdi)
+	vmovd	%xmm1,-108(%rdi)
+	vmovd	%xmm2,-104(%rdi)
+	vmovd	%xmm3,-100(%rdi)
+	vmovd	%xmm4,-96(%rdi)
+	leaq	-8(%r10),%rsp
+
+	vzeroupper
+	ret
+
+ENDPROC(poly1305_blocks_avx2)
+#endif /* CONFIG_AS_AVX2 */
+
+#ifdef CONFIG_AS_AVX512
+.align	32
+ENTRY(poly1305_blocks_avx512)
+
+	movl	20(%rdi),%r8d
+	cmpq	$128,%rdx
+	jae	.Lblocks_avx2_512
+	testl	%r8d,%r8d
+	jz	.Lblocks
+
+.Lblocks_avx2_512:
+	andq	$-16,%rdx
+	jz	.Lno_data_avx2_512
+
+	vzeroupper
+
+	testl	%r8d,%r8d
+	jz	.Lbase2_64_avx2_512
+
+	testq	$63,%rdx
+	jz	.Leven_avx2_512
+
+	pushq	%rbx
+	pushq	%r12
+	pushq	%r13
+	pushq	%r14
+	pushq	%r15
+	pushq	%rdi
+
+.Lblocks_avx2_body_512:
+
+	movq	%rdx,%r15
+
+	movq	0(%rdi),%r8
+	movq	8(%rdi),%r9
+	movl	16(%rdi),%r10d
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+
+	movl	%r8d,%r14d
+	andq	$-2147483648,%r8
+	movq	%r9,%r12
+	movl	%r9d,%ebx
+	andq	$-2147483648,%r9
+
+	shrq	$6,%r8
+	shlq	$52,%r12
+	addq	%r8,%r14
+	shrq	$12,%rbx
+	shrq	$18,%r9
+	addq	%r12,%r14
+	adcq	%r9,%rbx
+
+	movq	%r10,%r8
+	shlq	$40,%r8
+	shrq	$24,%r10
+	addq	%r8,%rbx
+	adcq	$0,%r10
+
+	movq	$-4,%r9
+	movq	%r10,%r8
+	andq	%r10,%r9
+	shrq	$2,%r8
+	andq	$3,%r10
+	addq	%r9,%r8
+	addq	%r8,%r14
+	adcq	$0,%rbx
+	adcq	$0,%r10
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+.Lbase2_26_pre_avx2_512:
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%r10
+	subq	$16,%r15
+
+	movq	%rdi,0(%rsp)
+	__poly1305_block
+	movq	0(%rsp),%rdi
+	movq	%r12,%rax
+
+	testq	$63,%r15
+	jnz	.Lbase2_26_pre_avx2_512
+
+	testq	%rcx,%rcx
+	jz	.Lstore_base2_64_avx2_512
+
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r11
+	movq	%rbx,%r12
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r11
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r11,%r14
+	shlq	$24,%r10
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r12
+	andq	$0x3ffffff,%rbx
+	orq	%r12,%r10
+
+	testq	%r15,%r15
+	jz	.Lstore_base2_26_avx2_512
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%r10d,%xmm4
+	jmp	.Lproceed_avx2_512
+
+.align	32
+.Lstore_base2_64_avx2_512:
+	movq	%r14,0(%rdi)
+	movq	%rbx,8(%rdi)
+	movq	%r10,16(%rdi)
+	jmp	.Ldone_avx2_512
+
+.align	16
+.Lstore_base2_26_avx2_512:
+	movl	%eax,0(%rdi)
+	movl	%edx,4(%rdi)
+	movl	%r14d,8(%rdi)
+	movl	%ebx,12(%rdi)
+	movl	%r10d,16(%rdi)
+.align	16
+.Ldone_avx2_512:
+	movq	8(%rsp),%r15
+	movq	16(%rsp),%r14
+	movq	24(%rsp),%r13
+	movq	32(%rsp),%r12
+	movq	40(%rsp),%rbx
+	leaq	48(%rsp),%rsp
+
+.Lno_data_avx2_512:
+.Lblocks_avx2_epilogue_512:
+	ret
+
+
+.align	32
+.Lbase2_64_avx2_512:
+
+	pushq	%rbx
+	pushq	%r12
+	pushq	%r13
+	pushq	%r14
+	pushq	%r15
+	pushq	%rdi
+
+.Lbase2_64_avx2_body_512:
+
+	movq	%rdx,%r15
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+	movq	0(%rdi),%r14
+	movq	8(%rdi),%rbx
+	movl	16(%rdi),%r10d
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+	testq	$63,%rdx
+	jz	.Linit_avx2_512
+
+.Lbase2_64_pre_avx2_512:
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%r10
+	subq	$16,%r15
+
+	movq	%rdi,0(%rsp)
+	__poly1305_block
+	movq	0(%rsp),%rdi
+	movq	%r12,%rax
+
+	testq	$63,%r15
+	jnz	.Lbase2_64_pre_avx2_512
+
+.Linit_avx2_512:
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r8
+	movq	%rbx,%r9
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r8
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r8,%r14
+	shlq	$24,%r10
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r9
+	andq	$0x3ffffff,%rbx
+	orq	%r9,%r10
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%r10d,%xmm4
+	movl	$1,20(%rdi)
+
+	__poly1305_init_avx
+
+.Lproceed_avx2_512:
+	movq	%r15,%rdx
+
+	movq	8(%rsp),%r15
+	movq	16(%rsp),%r14
+	movq	24(%rsp),%r13
+	movq	32(%rsp),%r12
+	movq	40(%rsp),%rbx
+	leaq	48(%rsp),%rax
+	leaq	48(%rsp),%rsp
+
+.Lbase2_64_avx2_epilogue_512:
+	jmp	.Ldo_avx2_512
+
+
+.align	32
+.Leven_avx2_512:
+
+	vmovd	0(%rdi),%xmm0
+	vmovd	4(%rdi),%xmm1
+	vmovd	8(%rdi),%xmm2
+	vmovd	12(%rdi),%xmm3
+	vmovd	16(%rdi),%xmm4
+
+.Ldo_avx2_512:
+	cmpq	$512,%rdx
+	jae	.Lblocks_avx512
+.Lskip_avx512:
+	leaq	8(%rsp),%r10
+
+	subq	$0x128,%rsp
+	leaq	.Lconst(%rip),%rcx
+	leaq	48+64(%rdi),%rdi
+	vmovdqa	96(%rcx),%ymm7
+
+
+	vmovdqu	-64(%rdi),%xmm9
+	andq	$-512,%rsp
+	vmovdqu	-48(%rdi),%xmm10
+	vmovdqu	-32(%rdi),%xmm6
+	vmovdqu	-16(%rdi),%xmm11
+	vmovdqu	0(%rdi),%xmm12
+	vmovdqu	16(%rdi),%xmm13
+	leaq	144(%rsp),%rax
+	vmovdqu	32(%rdi),%xmm14
+	vpermd	%ymm9,%ymm7,%ymm9
+	vmovdqu	48(%rdi),%xmm15
+	vpermd	%ymm10,%ymm7,%ymm10
+	vmovdqu	64(%rdi),%xmm5
+	vpermd	%ymm6,%ymm7,%ymm6
+	vmovdqa	%ymm9,0(%rsp)
+	vpermd	%ymm11,%ymm7,%ymm11
+	vmovdqa	%ymm10,32-144(%rax)
+	vpermd	%ymm12,%ymm7,%ymm12
+	vmovdqa	%ymm6,64-144(%rax)
+	vpermd	%ymm13,%ymm7,%ymm13
+	vmovdqa	%ymm11,96-144(%rax)
+	vpermd	%ymm14,%ymm7,%ymm14
+	vmovdqa	%ymm12,128-144(%rax)
+	vpermd	%ymm15,%ymm7,%ymm15
+	vmovdqa	%ymm13,160-144(%rax)
+	vpermd	%ymm5,%ymm7,%ymm5
+	vmovdqa	%ymm14,192-144(%rax)
+	vmovdqa	%ymm15,224-144(%rax)
+	vmovdqa	%ymm5,256-144(%rax)
+	vmovdqa	64(%rcx),%ymm5
+
+
+
+	vmovdqu	0(%rsi),%xmm7
+	vmovdqu	16(%rsi),%xmm8
+	vinserti128	$1,32(%rsi),%ymm7,%ymm7
+	vinserti128	$1,48(%rsi),%ymm8,%ymm8
+	leaq	64(%rsi),%rsi
+
+	vpsrldq	$6,%ymm7,%ymm9
+	vpsrldq	$6,%ymm8,%ymm10
+	vpunpckhqdq	%ymm8,%ymm7,%ymm6
+	vpunpcklqdq	%ymm10,%ymm9,%ymm9
+	vpunpcklqdq	%ymm8,%ymm7,%ymm7
+
+	vpsrlq	$30,%ymm9,%ymm10
+	vpsrlq	$4,%ymm9,%ymm9
+	vpsrlq	$26,%ymm7,%ymm8
+	vpsrlq	$40,%ymm6,%ymm6
+	vpand	%ymm5,%ymm9,%ymm9
+	vpand	%ymm5,%ymm7,%ymm7
+	vpand	%ymm5,%ymm8,%ymm8
+	vpand	%ymm5,%ymm10,%ymm10
+	vpor	32(%rcx),%ymm6,%ymm6
+
+	vpaddq	%ymm2,%ymm9,%ymm2
+	subq	$64,%rdx
+	jz	.Ltail_avx2_512
+	jmp	.Loop_avx2_512
+
+.align	32
+.Loop_avx2_512:
+
+	vpaddq	%ymm0,%ymm7,%ymm0
+	vmovdqa	0(%rsp),%ymm7
+	vpaddq	%ymm1,%ymm8,%ymm1
+	vmovdqa	32(%rsp),%ymm8
+	vpaddq	%ymm3,%ymm10,%ymm3
+	vmovdqa	96(%rsp),%ymm9
+	vpaddq	%ymm4,%ymm6,%ymm4
+	vmovdqa	48(%rax),%ymm10
+	vmovdqa	112(%rax),%ymm5
+
+	vpmuludq	%ymm2,%ymm7,%ymm13
+	vpmuludq	%ymm2,%ymm8,%ymm14
+	vpmuludq	%ymm2,%ymm9,%ymm15
+	vpmuludq	%ymm2,%ymm10,%ymm11
+	vpmuludq	%ymm2,%ymm5,%ymm12
+
+	vpmuludq	%ymm0,%ymm8,%ymm6
+	vpmuludq	%ymm1,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	64(%rsp),%ymm4,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm11,%ymm11
+	vmovdqa	-16(%rax),%ymm8
+
+	vpmuludq	%ymm0,%ymm7,%ymm6
+	vpmuludq	%ymm1,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vpmuludq	%ymm3,%ymm7,%ymm6
+	vpmuludq	%ymm4,%ymm7,%ymm2
+	vmovdqu	0(%rsi),%xmm7
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm2,%ymm15,%ymm15
+	vinserti128	$1,32(%rsi),%ymm7,%ymm7
+
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	%ymm4,%ymm8,%ymm2
+	vmovdqu	16(%rsi),%xmm8
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vmovdqa	16(%rax),%ymm2
+	vpmuludq	%ymm1,%ymm9,%ymm6
+	vpmuludq	%ymm0,%ymm9,%ymm9
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm9,%ymm13,%ymm13
+	vinserti128	$1,48(%rsi),%ymm8,%ymm8
+	leaq	64(%rsi),%rsi
+
+	vpmuludq	%ymm1,%ymm2,%ymm6
+	vpmuludq	%ymm0,%ymm2,%ymm2
+	vpsrldq	$6,%ymm7,%ymm9
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm14,%ymm14
+	vpmuludq	%ymm3,%ymm10,%ymm6
+	vpmuludq	%ymm4,%ymm10,%ymm2
+	vpsrldq	$6,%ymm8,%ymm10
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpunpckhqdq	%ymm8,%ymm7,%ymm6
+
+	vpmuludq	%ymm3,%ymm5,%ymm3
+	vpmuludq	%ymm4,%ymm5,%ymm4
+	vpunpcklqdq	%ymm8,%ymm7,%ymm7
+	vpaddq	%ymm3,%ymm13,%ymm2
+	vpaddq	%ymm4,%ymm14,%ymm3
+	vpunpcklqdq	%ymm10,%ymm9,%ymm10
+	vpmuludq	80(%rax),%ymm0,%ymm4
+	vpmuludq	%ymm1,%ymm5,%ymm0
+	vmovdqa	64(%rcx),%ymm5
+	vpaddq	%ymm4,%ymm15,%ymm4
+	vpaddq	%ymm0,%ymm11,%ymm0
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm12,%ymm1
+
+	vpsrlq	$26,%ymm4,%ymm15
+	vpand	%ymm5,%ymm4,%ymm4
+
+	vpsrlq	$4,%ymm10,%ymm9
+
+	vpsrlq	$26,%ymm1,%ymm12
+	vpand	%ymm5,%ymm1,%ymm1
+	vpaddq	%ymm12,%ymm2,%ymm2
+
+	vpaddq	%ymm15,%ymm0,%ymm0
+	vpsllq	$2,%ymm15,%ymm15
+	vpaddq	%ymm15,%ymm0,%ymm0
+
+	vpand	%ymm5,%ymm9,%ymm9
+	vpsrlq	$26,%ymm7,%ymm8
+
+	vpsrlq	$26,%ymm2,%ymm13
+	vpand	%ymm5,%ymm2,%ymm2
+	vpaddq	%ymm13,%ymm3,%ymm3
+
+	vpaddq	%ymm9,%ymm2,%ymm2
+	vpsrlq	$30,%ymm10,%ymm10
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm1,%ymm1
+
+	vpsrlq	$40,%ymm6,%ymm6
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpand	%ymm5,%ymm7,%ymm7
+	vpand	%ymm5,%ymm8,%ymm8
+	vpand	%ymm5,%ymm10,%ymm10
+	vpor	32(%rcx),%ymm6,%ymm6
+
+	subq	$64,%rdx
+	jnz	.Loop_avx2_512
+
+.byte	0x66,0x90
+.Ltail_avx2_512:
+
+	vpaddq	%ymm0,%ymm7,%ymm0
+	vmovdqu	4(%rsp),%ymm7
+	vpaddq	%ymm1,%ymm8,%ymm1
+	vmovdqu	36(%rsp),%ymm8
+	vpaddq	%ymm3,%ymm10,%ymm3
+	vmovdqu	100(%rsp),%ymm9
+	vpaddq	%ymm4,%ymm6,%ymm4
+	vmovdqu	52(%rax),%ymm10
+	vmovdqu	116(%rax),%ymm5
+
+	vpmuludq	%ymm2,%ymm7,%ymm13
+	vpmuludq	%ymm2,%ymm8,%ymm14
+	vpmuludq	%ymm2,%ymm9,%ymm15
+	vpmuludq	%ymm2,%ymm10,%ymm11
+	vpmuludq	%ymm2,%ymm5,%ymm12
+
+	vpmuludq	%ymm0,%ymm8,%ymm6
+	vpmuludq	%ymm1,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	68(%rsp),%ymm4,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm11,%ymm11
+
+	vpmuludq	%ymm0,%ymm7,%ymm6
+	vpmuludq	%ymm1,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vmovdqu	-12(%rax),%ymm8
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vpmuludq	%ymm3,%ymm7,%ymm6
+	vpmuludq	%ymm4,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm2,%ymm15,%ymm15
+
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	%ymm4,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vmovdqu	20(%rax),%ymm2
+	vpmuludq	%ymm1,%ymm9,%ymm6
+	vpmuludq	%ymm0,%ymm9,%ymm9
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm9,%ymm13,%ymm13
+
+	vpmuludq	%ymm1,%ymm2,%ymm6
+	vpmuludq	%ymm0,%ymm2,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm14,%ymm14
+	vpmuludq	%ymm3,%ymm10,%ymm6
+	vpmuludq	%ymm4,%ymm10,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+
+	vpmuludq	%ymm3,%ymm5,%ymm3
+	vpmuludq	%ymm4,%ymm5,%ymm4
+	vpaddq	%ymm3,%ymm13,%ymm2
+	vpaddq	%ymm4,%ymm14,%ymm3
+	vpmuludq	84(%rax),%ymm0,%ymm4
+	vpmuludq	%ymm1,%ymm5,%ymm0
+	vmovdqa	64(%rcx),%ymm5
+	vpaddq	%ymm4,%ymm15,%ymm4
+	vpaddq	%ymm0,%ymm11,%ymm0
+
+	vpsrldq	$8,%ymm12,%ymm8
+	vpsrldq	$8,%ymm2,%ymm9
+	vpsrldq	$8,%ymm3,%ymm10
+	vpsrldq	$8,%ymm4,%ymm6
+	vpsrldq	$8,%ymm0,%ymm7
+	vpaddq	%ymm8,%ymm12,%ymm12
+	vpaddq	%ymm9,%ymm2,%ymm2
+	vpaddq	%ymm10,%ymm3,%ymm3
+	vpaddq	%ymm6,%ymm4,%ymm4
+	vpaddq	%ymm7,%ymm0,%ymm0
+
+	vpermq	$0x2,%ymm3,%ymm10
+	vpermq	$0x2,%ymm4,%ymm6
+	vpermq	$0x2,%ymm0,%ymm7
+	vpermq	$0x2,%ymm12,%ymm8
+	vpermq	$0x2,%ymm2,%ymm9
+	vpaddq	%ymm10,%ymm3,%ymm3
+	vpaddq	%ymm6,%ymm4,%ymm4
+	vpaddq	%ymm7,%ymm0,%ymm0
+	vpaddq	%ymm8,%ymm12,%ymm12
+	vpaddq	%ymm9,%ymm2,%ymm2
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm12,%ymm1
+
+	vpsrlq	$26,%ymm4,%ymm15
+	vpand	%ymm5,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm1,%ymm12
+	vpand	%ymm5,%ymm1,%ymm1
+	vpaddq	%ymm12,%ymm2,%ymm2
+
+	vpaddq	%ymm15,%ymm0,%ymm0
+	vpsllq	$2,%ymm15,%ymm15
+	vpaddq	%ymm15,%ymm0,%ymm0
+
+	vpsrlq	$26,%ymm2,%ymm13
+	vpand	%ymm5,%ymm2,%ymm2
+	vpaddq	%ymm13,%ymm3,%ymm3
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm1,%ymm1
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vmovd	%xmm0,-112(%rdi)
+	vmovd	%xmm1,-108(%rdi)
+	vmovd	%xmm2,-104(%rdi)
+	vmovd	%xmm3,-100(%rdi)
+	vmovd	%xmm4,-96(%rdi)
+	leaq	-8(%r10),%rsp
+
+	vzeroupper
+	ret
+
+.Lblocks_avx512:
+
+	movl	$15,%eax
+	kmovw	%eax,%k2
+	leaq	8(%rsp),%r10
+
+	subq	$0x128,%rsp
+	leaq	.Lconst(%rip),%rcx
+	leaq	48+64(%rdi),%rdi
+	vmovdqa	96(%rcx),%ymm9
+
+	vmovdqu32	-64(%rdi),%zmm16{%k2}{z}
+	andq	$-512,%rsp
+	vmovdqu32	-48(%rdi),%zmm17{%k2}{z}
+	movq	$0x20,%rax
+	vmovdqu32	-32(%rdi),%zmm21{%k2}{z}
+	vmovdqu32	-16(%rdi),%zmm18{%k2}{z}
+	vmovdqu32	0(%rdi),%zmm22{%k2}{z}
+	vmovdqu32	16(%rdi),%zmm19{%k2}{z}
+	vmovdqu32	32(%rdi),%zmm23{%k2}{z}
+	vmovdqu32	48(%rdi),%zmm20{%k2}{z}
+	vmovdqu32	64(%rdi),%zmm24{%k2}{z}
+	vpermd	%zmm16,%zmm9,%zmm16
+	vpbroadcastq	64(%rcx),%zmm5
+	vpermd	%zmm17,%zmm9,%zmm17
+	vpermd	%zmm21,%zmm9,%zmm21
+	vpermd	%zmm18,%zmm9,%zmm18
+	vmovdqa64	%zmm16,0(%rsp){%k2}
+	vpsrlq	$32,%zmm16,%zmm7
+	vpermd	%zmm22,%zmm9,%zmm22
+	vmovdqu64	%zmm17,0(%rsp,%rax,1){%k2}
+	vpsrlq	$32,%zmm17,%zmm8
+	vpermd	%zmm19,%zmm9,%zmm19
+	vmovdqa64	%zmm21,64(%rsp){%k2}
+	vpermd	%zmm23,%zmm9,%zmm23
+	vpermd	%zmm20,%zmm9,%zmm20
+	vmovdqu64	%zmm18,64(%rsp,%rax,1){%k2}
+	vpermd	%zmm24,%zmm9,%zmm24
+	vmovdqa64	%zmm22,128(%rsp){%k2}
+	vmovdqu64	%zmm19,128(%rsp,%rax,1){%k2}
+	vmovdqa64	%zmm23,192(%rsp){%k2}
+	vmovdqu64	%zmm20,192(%rsp,%rax,1){%k2}
+	vmovdqa64	%zmm24,256(%rsp){%k2}
+
+	vpmuludq	%zmm7,%zmm16,%zmm11
+	vpmuludq	%zmm7,%zmm17,%zmm12
+	vpmuludq	%zmm7,%zmm18,%zmm13
+	vpmuludq	%zmm7,%zmm19,%zmm14
+	vpmuludq	%zmm7,%zmm20,%zmm15
+	vpsrlq	$32,%zmm18,%zmm9
+
+	vpmuludq	%zmm8,%zmm24,%zmm25
+	vpmuludq	%zmm8,%zmm16,%zmm26
+	vpmuludq	%zmm8,%zmm17,%zmm27
+	vpmuludq	%zmm8,%zmm18,%zmm28
+	vpmuludq	%zmm8,%zmm19,%zmm29
+	vpsrlq	$32,%zmm19,%zmm10
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm27,%zmm13,%zmm13
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+
+	vpmuludq	%zmm9,%zmm23,%zmm25
+	vpmuludq	%zmm9,%zmm24,%zmm26
+	vpmuludq	%zmm9,%zmm17,%zmm28
+	vpmuludq	%zmm9,%zmm18,%zmm29
+	vpmuludq	%zmm9,%zmm16,%zmm27
+	vpsrlq	$32,%zmm20,%zmm6
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vpmuludq	%zmm10,%zmm22,%zmm25
+	vpmuludq	%zmm10,%zmm16,%zmm28
+	vpmuludq	%zmm10,%zmm17,%zmm29
+	vpmuludq	%zmm10,%zmm23,%zmm26
+	vpmuludq	%zmm10,%zmm24,%zmm27
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vpmuludq	%zmm6,%zmm24,%zmm28
+	vpmuludq	%zmm6,%zmm16,%zmm29
+	vpmuludq	%zmm6,%zmm21,%zmm25
+	vpmuludq	%zmm6,%zmm22,%zmm26
+	vpmuludq	%zmm6,%zmm23,%zmm27
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vmovdqu64	0(%rsi),%zmm10
+	vmovdqu64	64(%rsi),%zmm6
+	leaq	128(%rsi),%rsi
+
+	vpsrlq	$26,%zmm14,%zmm28
+	vpandq	%zmm5,%zmm14,%zmm14
+	vpaddq	%zmm28,%zmm15,%zmm15
+
+	vpsrlq	$26,%zmm11,%zmm25
+	vpandq	%zmm5,%zmm11,%zmm11
+	vpaddq	%zmm25,%zmm12,%zmm12
+
+	vpsrlq	$26,%zmm15,%zmm29
+	vpandq	%zmm5,%zmm15,%zmm15
+
+	vpsrlq	$26,%zmm12,%zmm26
+	vpandq	%zmm5,%zmm12,%zmm12
+	vpaddq	%zmm26,%zmm13,%zmm13
+
+	vpaddq	%zmm29,%zmm11,%zmm11
+	vpsllq	$2,%zmm29,%zmm29
+	vpaddq	%zmm29,%zmm11,%zmm11
+
+	vpsrlq	$26,%zmm13,%zmm27
+	vpandq	%zmm5,%zmm13,%zmm13
+	vpaddq	%zmm27,%zmm14,%zmm14
+
+	vpsrlq	$26,%zmm11,%zmm25
+	vpandq	%zmm5,%zmm11,%zmm11
+	vpaddq	%zmm25,%zmm12,%zmm12
+
+	vpsrlq	$26,%zmm14,%zmm28
+	vpandq	%zmm5,%zmm14,%zmm14
+	vpaddq	%zmm28,%zmm15,%zmm15
+
+	vpunpcklqdq	%zmm6,%zmm10,%zmm7
+	vpunpckhqdq	%zmm6,%zmm10,%zmm6
+
+	vmovdqa32	128(%rcx),%zmm25
+	movl	$0x7777,%eax
+	kmovw	%eax,%k1
+
+	vpermd	%zmm16,%zmm25,%zmm16
+	vpermd	%zmm17,%zmm25,%zmm17
+	vpermd	%zmm18,%zmm25,%zmm18
+	vpermd	%zmm19,%zmm25,%zmm19
+	vpermd	%zmm20,%zmm25,%zmm20
+
+	vpermd	%zmm11,%zmm25,%zmm16{%k1}
+	vpermd	%zmm12,%zmm25,%zmm17{%k1}
+	vpermd	%zmm13,%zmm25,%zmm18{%k1}
+	vpermd	%zmm14,%zmm25,%zmm19{%k1}
+	vpermd	%zmm15,%zmm25,%zmm20{%k1}
+
+	vpslld	$2,%zmm17,%zmm21
+	vpslld	$2,%zmm18,%zmm22
+	vpslld	$2,%zmm19,%zmm23
+	vpslld	$2,%zmm20,%zmm24
+	vpaddd	%zmm17,%zmm21,%zmm21
+	vpaddd	%zmm18,%zmm22,%zmm22
+	vpaddd	%zmm19,%zmm23,%zmm23
+	vpaddd	%zmm20,%zmm24,%zmm24
+
+	vpbroadcastq	32(%rcx),%zmm30
+
+	vpsrlq	$52,%zmm7,%zmm9
+	vpsllq	$12,%zmm6,%zmm10
+	vporq	%zmm10,%zmm9,%zmm9
+	vpsrlq	$26,%zmm7,%zmm8
+	vpsrlq	$14,%zmm6,%zmm10
+	vpsrlq	$40,%zmm6,%zmm6
+	vpandq	%zmm5,%zmm9,%zmm9
+	vpandq	%zmm5,%zmm7,%zmm7
+
+	vpaddq	%zmm2,%zmm9,%zmm2
+	subq	$192,%rdx
+	jbe	.Ltail_avx512
+	jmp	.Loop_avx512
+
+.align	32
+.Loop_avx512:
+
+	vpmuludq	%zmm2,%zmm17,%zmm14
+	vpaddq	%zmm0,%zmm7,%zmm0
+	vpmuludq	%zmm2,%zmm18,%zmm15
+	vpandq	%zmm5,%zmm8,%zmm8
+	vpmuludq	%zmm2,%zmm23,%zmm11
+	vpandq	%zmm5,%zmm10,%zmm10
+	vpmuludq	%zmm2,%zmm24,%zmm12
+	vporq	%zmm30,%zmm6,%zmm6
+	vpmuludq	%zmm2,%zmm16,%zmm13
+	vpaddq	%zmm1,%zmm8,%zmm1
+	vpaddq	%zmm3,%zmm10,%zmm3
+	vpaddq	%zmm4,%zmm6,%zmm4
+
+	vmovdqu64	0(%rsi),%zmm10
+	vmovdqu64	64(%rsi),%zmm6
+	leaq	128(%rsi),%rsi
+	vpmuludq	%zmm0,%zmm19,%zmm28
+	vpmuludq	%zmm0,%zmm20,%zmm29
+	vpmuludq	%zmm0,%zmm16,%zmm25
+	vpmuludq	%zmm0,%zmm17,%zmm26
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm26,%zmm12,%zmm12
+
+	vpmuludq	%zmm1,%zmm18,%zmm28
+	vpmuludq	%zmm1,%zmm19,%zmm29
+	vpmuludq	%zmm1,%zmm24,%zmm25
+	vpmuludq	%zmm0,%zmm18,%zmm27
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vpunpcklqdq	%zmm6,%zmm10,%zmm7
+	vpunpckhqdq	%zmm6,%zmm10,%zmm6
+
+	vpmuludq	%zmm3,%zmm16,%zmm28
+	vpmuludq	%zmm3,%zmm17,%zmm29
+	vpmuludq	%zmm1,%zmm16,%zmm26
+	vpmuludq	%zmm1,%zmm17,%zmm27
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vpmuludq	%zmm4,%zmm24,%zmm28
+	vpmuludq	%zmm4,%zmm16,%zmm29
+	vpmuludq	%zmm3,%zmm22,%zmm25
+	vpmuludq	%zmm3,%zmm23,%zmm26
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpmuludq	%zmm3,%zmm24,%zmm27
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vpmuludq	%zmm4,%zmm21,%zmm25
+	vpmuludq	%zmm4,%zmm22,%zmm26
+	vpmuludq	%zmm4,%zmm23,%zmm27
+	vpaddq	%zmm25,%zmm11,%zmm0
+	vpaddq	%zmm26,%zmm12,%zmm1
+	vpaddq	%zmm27,%zmm13,%zmm2
+
+	vpsrlq	$52,%zmm7,%zmm9
+	vpsllq	$12,%zmm6,%zmm10
+
+	vpsrlq	$26,%zmm14,%zmm3
+	vpandq	%zmm5,%zmm14,%zmm14
+	vpaddq	%zmm3,%zmm15,%zmm4
+
+	vporq	%zmm10,%zmm9,%zmm9
+
+	vpsrlq	$26,%zmm0,%zmm11
+	vpandq	%zmm5,%zmm0,%zmm0
+	vpaddq	%zmm11,%zmm1,%zmm1
+
+	vpandq	%zmm5,%zmm9,%zmm9
+
+	vpsrlq	$26,%zmm4,%zmm15
+	vpandq	%zmm5,%zmm4,%zmm4
+
+	vpsrlq	$26,%zmm1,%zmm12
+	vpandq	%zmm5,%zmm1,%zmm1
+	vpaddq	%zmm12,%zmm2,%zmm2
+
+	vpaddq	%zmm15,%zmm0,%zmm0
+	vpsllq	$2,%zmm15,%zmm15
+	vpaddq	%zmm15,%zmm0,%zmm0
+
+	vpaddq	%zmm9,%zmm2,%zmm2
+	vpsrlq	$26,%zmm7,%zmm8
+
+	vpsrlq	$26,%zmm2,%zmm13
+	vpandq	%zmm5,%zmm2,%zmm2
+	vpaddq	%zmm13,%zmm14,%zmm3
+
+	vpsrlq	$14,%zmm6,%zmm10
+
+	vpsrlq	$26,%zmm0,%zmm11
+	vpandq	%zmm5,%zmm0,%zmm0
+	vpaddq	%zmm11,%zmm1,%zmm1
+
+	vpsrlq	$40,%zmm6,%zmm6
+
+	vpsrlq	$26,%zmm3,%zmm14
+	vpandq	%zmm5,%zmm3,%zmm3
+	vpaddq	%zmm14,%zmm4,%zmm4
+
+	vpandq	%zmm5,%zmm7,%zmm7
+
+	subq	$128,%rdx
+	ja	.Loop_avx512
+
+.Ltail_avx512:
+
+	vpsrlq	$32,%zmm16,%zmm16
+	vpsrlq	$32,%zmm17,%zmm17
+	vpsrlq	$32,%zmm18,%zmm18
+	vpsrlq	$32,%zmm23,%zmm23
+	vpsrlq	$32,%zmm24,%zmm24
+	vpsrlq	$32,%zmm19,%zmm19
+	vpsrlq	$32,%zmm20,%zmm20
+	vpsrlq	$32,%zmm21,%zmm21
+	vpsrlq	$32,%zmm22,%zmm22
+
+	leaq	(%rsi,%rdx,1),%rsi
+
+	vpaddq	%zmm0,%zmm7,%zmm0
+
+	vpmuludq	%zmm2,%zmm17,%zmm14
+	vpmuludq	%zmm2,%zmm18,%zmm15
+	vpmuludq	%zmm2,%zmm23,%zmm11
+	vpandq	%zmm5,%zmm8,%zmm8
+	vpmuludq	%zmm2,%zmm24,%zmm12
+	vpandq	%zmm5,%zmm10,%zmm10
+	vpmuludq	%zmm2,%zmm16,%zmm13
+	vporq	%zmm30,%zmm6,%zmm6
+	vpaddq	%zmm1,%zmm8,%zmm1
+	vpaddq	%zmm3,%zmm10,%zmm3
+	vpaddq	%zmm4,%zmm6,%zmm4
+
+	vmovdqu	0(%rsi),%xmm7
+	vpmuludq	%zmm0,%zmm19,%zmm28
+	vpmuludq	%zmm0,%zmm20,%zmm29
+	vpmuludq	%zmm0,%zmm16,%zmm25
+	vpmuludq	%zmm0,%zmm17,%zmm26
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm26,%zmm12,%zmm12
+
+	vmovdqu	16(%rsi),%xmm8
+	vpmuludq	%zmm1,%zmm18,%zmm28
+	vpmuludq	%zmm1,%zmm19,%zmm29
+	vpmuludq	%zmm1,%zmm24,%zmm25
+	vpmuludq	%zmm0,%zmm18,%zmm27
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vinserti128	$1,32(%rsi),%ymm7,%ymm7
+	vpmuludq	%zmm3,%zmm16,%zmm28
+	vpmuludq	%zmm3,%zmm17,%zmm29
+	vpmuludq	%zmm1,%zmm16,%zmm26
+	vpmuludq	%zmm1,%zmm17,%zmm27
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vinserti128	$1,48(%rsi),%ymm8,%ymm8
+	vpmuludq	%zmm4,%zmm24,%zmm28
+	vpmuludq	%zmm4,%zmm16,%zmm29
+	vpmuludq	%zmm3,%zmm22,%zmm25
+	vpmuludq	%zmm3,%zmm23,%zmm26
+	vpmuludq	%zmm3,%zmm24,%zmm27
+	vpaddq	%zmm28,%zmm14,%zmm3
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vpmuludq	%zmm4,%zmm21,%zmm25
+	vpmuludq	%zmm4,%zmm22,%zmm26
+	vpmuludq	%zmm4,%zmm23,%zmm27
+	vpaddq	%zmm25,%zmm11,%zmm0
+	vpaddq	%zmm26,%zmm12,%zmm1
+	vpaddq	%zmm27,%zmm13,%zmm2
+
+	movl	$1,%eax
+	vpermq	$0xb1,%zmm3,%zmm14
+	vpermq	$0xb1,%zmm15,%zmm4
+	vpermq	$0xb1,%zmm0,%zmm11
+	vpermq	$0xb1,%zmm1,%zmm12
+	vpermq	$0xb1,%zmm2,%zmm13
+	vpaddq	%zmm14,%zmm3,%zmm3
+	vpaddq	%zmm15,%zmm4,%zmm4
+	vpaddq	%zmm11,%zmm0,%zmm0
+	vpaddq	%zmm12,%zmm1,%zmm1
+	vpaddq	%zmm13,%zmm2,%zmm2
+
+	kmovw	%eax,%k3
+	vpermq	$0x2,%zmm3,%zmm14
+	vpermq	$0x2,%zmm4,%zmm15
+	vpermq	$0x2,%zmm0,%zmm11
+	vpermq	$0x2,%zmm1,%zmm12
+	vpermq	$0x2,%zmm2,%zmm13
+	vpaddq	%zmm14,%zmm3,%zmm3
+	vpaddq	%zmm15,%zmm4,%zmm4
+	vpaddq	%zmm11,%zmm0,%zmm0
+	vpaddq	%zmm12,%zmm1,%zmm1
+	vpaddq	%zmm13,%zmm2,%zmm2
+
+	vextracti64x4	$0x1,%zmm3,%ymm14
+	vextracti64x4	$0x1,%zmm4,%ymm15
+	vextracti64x4	$0x1,%zmm0,%ymm11
+	vextracti64x4	$0x1,%zmm1,%ymm12
+	vextracti64x4	$0x1,%zmm2,%ymm13
+	vpaddq	%zmm14,%zmm3,%zmm3{%k3}{z}
+	vpaddq	%zmm15,%zmm4,%zmm4{%k3}{z}
+	vpaddq	%zmm11,%zmm0,%zmm0{%k3}{z}
+	vpaddq	%zmm12,%zmm1,%zmm1{%k3}{z}
+	vpaddq	%zmm13,%zmm2,%zmm2{%k3}{z}
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpsrldq	$6,%ymm7,%ymm9
+	vpsrldq	$6,%ymm8,%ymm10
+	vpunpckhqdq	%ymm8,%ymm7,%ymm6
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpunpcklqdq	%ymm10,%ymm9,%ymm9
+	vpunpcklqdq	%ymm8,%ymm7,%ymm7
+	vpaddq	%ymm11,%ymm1,%ymm1
+
+	vpsrlq	$26,%ymm4,%ymm15
+	vpand	%ymm5,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm1,%ymm12
+	vpand	%ymm5,%ymm1,%ymm1
+	vpsrlq	$30,%ymm9,%ymm10
+	vpsrlq	$4,%ymm9,%ymm9
+	vpaddq	%ymm12,%ymm2,%ymm2
+
+	vpaddq	%ymm15,%ymm0,%ymm0
+	vpsllq	$2,%ymm15,%ymm15
+	vpsrlq	$26,%ymm7,%ymm8
+	vpsrlq	$40,%ymm6,%ymm6
+	vpaddq	%ymm15,%ymm0,%ymm0
+
+	vpsrlq	$26,%ymm2,%ymm13
+	vpand	%ymm5,%ymm2,%ymm2
+	vpand	%ymm5,%ymm9,%ymm9
+	vpand	%ymm5,%ymm7,%ymm7
+	vpaddq	%ymm13,%ymm3,%ymm3
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm2,%ymm9,%ymm2
+	vpand	%ymm5,%ymm8,%ymm8
+	vpaddq	%ymm11,%ymm1,%ymm1
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpand	%ymm5,%ymm10,%ymm10
+	vpor	32(%rcx),%ymm6,%ymm6
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	leaq	144(%rsp),%rax
+	addq	$64,%rdx
+	jnz	.Ltail_avx2_512
+
+	vpsubq	%ymm9,%ymm2,%ymm2
+	vmovd	%xmm0,-112(%rdi)
+	vmovd	%xmm1,-108(%rdi)
+	vmovd	%xmm2,-104(%rdi)
+	vmovd	%xmm3,-100(%rdi)
+	vmovd	%xmm4,-96(%rdi)
+	vzeroall
+	leaq	-8(%r10),%rsp
+
+	ret
+
+ENDPROC(poly1305_blocks_avx512)
+#endif /* CONFIG_AS_AVX512 */
-- 
2.18.0

^ permalink raw reply related

* [PATCH v2 08/17] zinc: Poly1305 ARM and ARM64 implementations
From: Jason A. Donenfeld @ 2018-08-24 21:38 UTC (permalink / raw)
  To: linux-kernel, netdev, davem
  Cc: Jason A. Donenfeld, Andy Lutomirski, Greg KH, Samuel Neves,
	Jean-Philippe Aumasson, Andy Polyakov, Russell King,
	linux-arm-kernel, linux-crypto
In-Reply-To: <20180824213849.23647-1-Jason@zx2c4.com>

These NEON and non-NEON implementations come from Andy Polyakov's
implementation, with several modifications for being friendly to kernel
space.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Andy Polyakov <appro@openssl.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-crypto@vger.kernel.org
---
 lib/zinc/Makefile                     |    8 +
 lib/zinc/poly1305/poly1305-arm-glue.h |   58 ++
 lib/zinc/poly1305/poly1305-arm.S      | 1115 +++++++++++++++++++++++++
 lib/zinc/poly1305/poly1305-arm64.S    |  820 ++++++++++++++++++
 4 files changed, 2001 insertions(+)
 create mode 100644 lib/zinc/poly1305/poly1305-arm-glue.h
 create mode 100644 lib/zinc/poly1305/poly1305-arm.S
 create mode 100644 lib/zinc/poly1305/poly1305-arm64.S

diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index bf7b3a75f4c2..db70ffedf7dd 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -24,6 +24,14 @@ endif
 
 ifeq ($(CONFIG_ZINC_POLY1305),y)
 zinc-y += poly1305/poly1305.o
+ifeq ($(CONFIG_ARM),y)
+zinc-y += poly1305/poly1305-arm.o
+CFLAGS_poly1305.o += -include $(srctree)/$(src)/poly1305/poly1305-arm-glue.h
+endif
+ifeq ($(CONFIG_ARM64),y)
+zinc-y += poly1305/poly1305-arm64.o
+CFLAGS_poly1305.o += -include $(srctree)/$(src)/poly1305/poly1305-arm-glue.h
+endif
 endif
 
 zinc-y += main.o
diff --git a/lib/zinc/poly1305/poly1305-arm-glue.h b/lib/zinc/poly1305/poly1305-arm-glue.h
new file mode 100644
index 000000000000..38fa40f26393
--- /dev/null
+++ b/lib/zinc/poly1305/poly1305-arm-glue.h
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <zinc/poly1305.h>
+
+asmlinkage void poly1305_init_arm(void *ctx, const u8 key[16]);
+asmlinkage void poly1305_blocks_arm(void *ctx, const u8 *inp, const size_t len, const u32 padbit);
+asmlinkage void poly1305_emit_arm(void *ctx, u8 mac[16], const u32 nonce[4]);
+#if IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && (!defined(__LINUX_ARM_ARCH__) || __LINUX_ARM_ARCH__ >= 7)
+#define ARM_USE_NEON
+#include <asm/hwcap.h>
+#include <asm/neon.h>
+asmlinkage void poly1305_blocks_neon(void *ctx, const u8 *inp, const size_t len, const u32 padbit);
+asmlinkage void poly1305_emit_neon(void *ctx, u8 mac[16], const u32 nonce[4]);
+#endif
+
+static bool poly1305_use_neon __ro_after_init;
+
+void __init poly1305_fpu_init(void)
+{
+#if defined(CONFIG_ARM64)
+	poly1305_use_neon = elf_hwcap & HWCAP_ASIMD;
+#elif defined(CONFIG_ARM)
+	poly1305_use_neon = elf_hwcap & HWCAP_NEON;
+#endif
+}
+
+static inline bool poly1305_init_arch(void *ctx, const u8 key[POLY1305_KEY_SIZE], simd_context_t simd_context)
+{
+	poly1305_init_arm(ctx, key);
+	return true;
+}
+
+static inline bool poly1305_blocks_arch(void *ctx, const u8 *inp, const size_t len, const u32 padbit, simd_context_t simd_context)
+{
+#if defined(ARM_USE_NEON)
+	if (poly1305_use_neon && simd_context == HAVE_FULL_SIMD)
+		poly1305_blocks_neon(ctx, inp, len, padbit);
+	else
+#endif
+		poly1305_blocks_arm(ctx, inp, len, padbit);
+	return true;
+}
+
+static inline bool poly1305_emit_arch(void *ctx, u8 mac[POLY1305_MAC_SIZE], const u32 nonce[4], simd_context_t simd_context)
+{
+#if defined(ARM_USE_NEON)
+	if (poly1305_use_neon && simd_context == HAVE_FULL_SIMD)
+		poly1305_emit_neon(ctx, mac, nonce);
+	else
+#endif
+		poly1305_emit_arm(ctx, mac, nonce);
+	return true;
+}
+
+#define HAVE_POLY1305_ARCH_IMPLEMENTATION
diff --git a/lib/zinc/poly1305/poly1305-arm.S b/lib/zinc/poly1305/poly1305-arm.S
new file mode 100644
index 000000000000..d6b9a80dcc8b
--- /dev/null
+++ b/lib/zinc/poly1305/poly1305-arm.S
@@ -0,0 +1,1115 @@
+/* SPDX-License-Identifier: OpenSSL OR (BSD-3-Clause OR GPL-2.0)
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ * Copyright 2016 The OpenSSL Project Authors. All Rights Reserved.
+ */
+
+#include <linux/linkage.h>
+
+.text
+#if defined(__thumb2__)
+.syntax	unified
+.thumb
+#else
+.code	32
+#endif
+
+.align	5
+ENTRY(poly1305_init_arm)
+	stmdb	sp!,{r4-r11}
+
+	eor	r3,r3,r3
+	cmp	r1,#0
+	str	r3,[r0,#0]		@ zero hash value
+	str	r3,[r0,#4]
+	str	r3,[r0,#8]
+	str	r3,[r0,#12]
+	str	r3,[r0,#16]
+	str	r3,[r0,#36]		@ is_base2_26
+	add	r0,r0,#20
+
+#ifdef	__thumb2__
+	it	eq
+#endif
+	moveq	r0,#0
+	beq	.Lno_key
+
+	ldrb	r4,[r1,#0]
+	mov	r10,#0x0fffffff
+	ldrb	r5,[r1,#1]
+	and	r3,r10,#-4		@ 0x0ffffffc
+	ldrb	r6,[r1,#2]
+	ldrb	r7,[r1,#3]
+	orr	r4,r4,r5,lsl#8
+	ldrb	r5,[r1,#4]
+	orr	r4,r4,r6,lsl#16
+	ldrb	r6,[r1,#5]
+	orr	r4,r4,r7,lsl#24
+	ldrb	r7,[r1,#6]
+	and	r4,r4,r10
+
+	ldrb	r8,[r1,#7]
+	orr	r5,r5,r6,lsl#8
+	ldrb	r6,[r1,#8]
+	orr	r5,r5,r7,lsl#16
+	ldrb	r7,[r1,#9]
+	orr	r5,r5,r8,lsl#24
+	ldrb	r8,[r1,#10]
+	and	r5,r5,r3
+
+	ldrb	r9,[r1,#11]
+	orr	r6,r6,r7,lsl#8
+	ldrb	r7,[r1,#12]
+	orr	r6,r6,r8,lsl#16
+	ldrb	r8,[r1,#13]
+	orr	r6,r6,r9,lsl#24
+	ldrb	r9,[r1,#14]
+	and	r6,r6,r3
+
+	ldrb	r10,[r1,#15]
+	orr	r7,r7,r8,lsl#8
+	str	r4,[r0,#0]
+	orr	r7,r7,r9,lsl#16
+	str	r5,[r0,#4]
+	orr	r7,r7,r10,lsl#24
+	str	r6,[r0,#8]
+	and	r7,r7,r3
+	str	r7,[r0,#12]
+.Lno_key:
+	ldmia	sp!,{r4-r11}
+#if __LINUX_ARM_ARCH__ >= 5
+	bx	lr				@ bx	lr
+#else
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	.word	0xe12fff1e			@ interoperable with Thumb ISA:-)
+#endif
+ENDPROC(poly1305_init_arm)
+
+.align	5
+ENTRY(poly1305_blocks_arm)
+.Lpoly1305_blocks_arm:
+	stmdb	sp!,{r3-r11,lr}
+
+	ands	r2,r2,#-16
+	beq	.Lno_data
+
+	cmp	r3,#0
+	add	r2,r2,r1		@ end pointer
+	sub	sp,sp,#32
+
+	ldmia	r0,{r4-r12}		@ load context
+
+	str	r0,[sp,#12]		@ offload stuff
+	mov	lr,r1
+	str	r2,[sp,#16]
+	str	r10,[sp,#20]
+	str	r11,[sp,#24]
+	str	r12,[sp,#28]
+	b	.Loop
+
+.Loop:
+#if __LINUX_ARM_ARCH__ < 7
+	ldrb	r0,[lr],#16		@ load input
+#ifdef	__thumb2__
+	it	hi
+#endif
+	addhi	r8,r8,#1		@ 1<<128
+	ldrb	r1,[lr,#-15]
+	ldrb	r2,[lr,#-14]
+	ldrb	r3,[lr,#-13]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-12]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-11]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-10]
+	adds	r4,r4,r3		@ accumulate input
+
+	ldrb	r3,[lr,#-9]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-8]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-7]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-6]
+	adcs	r5,r5,r3
+
+	ldrb	r3,[lr,#-5]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-4]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-3]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-2]
+	adcs	r6,r6,r3
+
+	ldrb	r3,[lr,#-1]
+	orr	r1,r0,r1,lsl#8
+	str	lr,[sp,#8]		@ offload input pointer
+	orr	r2,r1,r2,lsl#16
+	add	r10,r10,r10,lsr#2
+	orr	r3,r2,r3,lsl#24
+#else
+	ldr	r0,[lr],#16		@ load input
+#ifdef	__thumb2__
+	it	hi
+#endif
+	addhi	r8,r8,#1		@ padbit
+	ldr	r1,[lr,#-12]
+	ldr	r2,[lr,#-8]
+	ldr	r3,[lr,#-4]
+#ifdef	__ARMEB__
+	rev	r0,r0
+	rev	r1,r1
+	rev	r2,r2
+	rev	r3,r3
+#endif
+	adds	r4,r4,r0		@ accumulate input
+	str	lr,[sp,#8]		@ offload input pointer
+	adcs	r5,r5,r1
+	add	r10,r10,r10,lsr#2
+	adcs	r6,r6,r2
+#endif
+	add	r11,r11,r11,lsr#2
+	adcs	r7,r7,r3
+	add	r12,r12,r12,lsr#2
+
+	umull	r2,r3,r5,r9
+	 adc	r8,r8,#0
+	umull	r0,r1,r4,r9
+	umlal	r2,r3,r8,r10
+	umlal	r0,r1,r7,r10
+	ldr	r10,[sp,#20]		@ reload r10
+	umlal	r2,r3,r6,r12
+	umlal	r0,r1,r5,r12
+	umlal	r2,r3,r7,r11
+	umlal	r0,r1,r6,r11
+	umlal	r2,r3,r4,r10
+	str	r0,[sp,#0]		@ future r4
+	 mul	r0,r11,r8
+	ldr	r11,[sp,#24]		@ reload r11
+	adds	r2,r2,r1		@ d1+=d0>>32
+	 eor	r1,r1,r1
+	adc	lr,r3,#0		@ future r6
+	str	r2,[sp,#4]		@ future r5
+
+	mul	r2,r12,r8
+	eor	r3,r3,r3
+	umlal	r0,r1,r7,r12
+	ldr	r12,[sp,#28]		@ reload r12
+	umlal	r2,r3,r7,r9
+	umlal	r0,r1,r6,r9
+	umlal	r2,r3,r6,r10
+	umlal	r0,r1,r5,r10
+	umlal	r2,r3,r5,r11
+	umlal	r0,r1,r4,r11
+	umlal	r2,r3,r4,r12
+	ldr	r4,[sp,#0]
+	mul	r8,r9,r8
+	ldr	r5,[sp,#4]
+
+	adds	r6,lr,r0		@ d2+=d1>>32
+	ldr	lr,[sp,#8]		@ reload input pointer
+	adc	r1,r1,#0
+	adds	r7,r2,r1		@ d3+=d2>>32
+	ldr	r0,[sp,#16]		@ reload end pointer
+	adc	r3,r3,#0
+	add	r8,r8,r3		@ h4+=d3>>32
+
+	and	r1,r8,#-4
+	and	r8,r8,#3
+	add	r1,r1,r1,lsr#2		@ *=5
+	adds	r4,r4,r1
+	adcs	r5,r5,#0
+	adcs	r6,r6,#0
+	adcs	r7,r7,#0
+	adc	r8,r8,#0
+
+	cmp	r0,lr			@ done yet?
+	bhi	.Loop
+
+	ldr	r0,[sp,#12]
+	add	sp,sp,#32
+	stmia	r0,{r4-r8}		@ store the result
+
+.Lno_data:
+#if __LINUX_ARM_ARCH__ >= 5
+	ldmia	sp!,{r3-r11,pc}
+#else
+	ldmia	sp!,{r3-r11,lr}
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	.word	0xe12fff1e			@ interoperable with Thumb ISA:-)
+#endif
+ENDPROC(poly1305_blocks_arm)
+
+.align	5
+ENTRY(poly1305_emit_arm)
+	stmdb	sp!,{r4-r11}
+.Lpoly1305_emit_enter:
+	ldmia	r0,{r3-r7}
+	adds	r8,r3,#5		@ compare to modulus
+	adcs	r9,r4,#0
+	adcs	r10,r5,#0
+	adcs	r11,r6,#0
+	adc	r7,r7,#0
+	tst	r7,#4			@ did it carry/borrow?
+
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	r3,r8
+	ldr	r8,[r2,#0]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	r4,r9
+	ldr	r9,[r2,#4]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	r5,r10
+	ldr	r10,[r2,#8]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	r6,r11
+	ldr	r11,[r2,#12]
+
+	adds	r3,r3,r8
+	adcs	r4,r4,r9
+	adcs	r5,r5,r10
+	adc	r6,r6,r11
+
+#if __LINUX_ARM_ARCH__ >= 7
+#ifdef __ARMEB__
+	rev	r3,r3
+	rev	r4,r4
+	rev	r5,r5
+	rev	r6,r6
+#endif
+	str	r3,[r1,#0]
+	str	r4,[r1,#4]
+	str	r5,[r1,#8]
+	str	r6,[r1,#12]
+#else
+	strb	r3,[r1,#0]
+	mov	r3,r3,lsr#8
+	strb	r4,[r1,#4]
+	mov	r4,r4,lsr#8
+	strb	r5,[r1,#8]
+	mov	r5,r5,lsr#8
+	strb	r6,[r1,#12]
+	mov	r6,r6,lsr#8
+
+	strb	r3,[r1,#1]
+	mov	r3,r3,lsr#8
+	strb	r4,[r1,#5]
+	mov	r4,r4,lsr#8
+	strb	r5,[r1,#9]
+	mov	r5,r5,lsr#8
+	strb	r6,[r1,#13]
+	mov	r6,r6,lsr#8
+
+	strb	r3,[r1,#2]
+	mov	r3,r3,lsr#8
+	strb	r4,[r1,#6]
+	mov	r4,r4,lsr#8
+	strb	r5,[r1,#10]
+	mov	r5,r5,lsr#8
+	strb	r6,[r1,#14]
+	mov	r6,r6,lsr#8
+
+	strb	r3,[r1,#3]
+	strb	r4,[r1,#7]
+	strb	r5,[r1,#11]
+	strb	r6,[r1,#15]
+#endif
+	ldmia	sp!,{r4-r11}
+#if __LINUX_ARM_ARCH__ >= 5
+	bx	lr				@ bx	lr
+#else
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	.word	0xe12fff1e			@ interoperable with Thumb ISA:-)
+#endif
+ENDPROC(poly1305_emit_arm)
+
+
+#if __LINUX_ARM_ARCH__ >= 7
+.fpu	neon
+
+.align	5
+ENTRY(poly1305_init_neon)
+.Lpoly1305_init_neon:
+	ldr	r4,[r0,#20]		@ load key base 2^32
+	ldr	r5,[r0,#24]
+	ldr	r6,[r0,#28]
+	ldr	r7,[r0,#32]
+
+	and	r2,r4,#0x03ffffff	@ base 2^32 -> base 2^26
+	mov	r3,r4,lsr#26
+	mov	r4,r5,lsr#20
+	orr	r3,r3,r5,lsl#6
+	mov	r5,r6,lsr#14
+	orr	r4,r4,r6,lsl#12
+	mov	r6,r7,lsr#8
+	orr	r5,r5,r7,lsl#18
+	and	r3,r3,#0x03ffffff
+	and	r4,r4,#0x03ffffff
+	and	r5,r5,#0x03ffffff
+
+	vdup.32	d0,r2			@ r^1 in both lanes
+	add	r2,r3,r3,lsl#2		@ *5
+	vdup.32	d1,r3
+	add	r3,r4,r4,lsl#2
+	vdup.32	d2,r2
+	vdup.32	d3,r4
+	add	r4,r5,r5,lsl#2
+	vdup.32	d4,r3
+	vdup.32	d5,r5
+	add	r5,r6,r6,lsl#2
+	vdup.32	d6,r4
+	vdup.32	d7,r6
+	vdup.32	d8,r5
+
+	mov	r5,#2		@ counter
+
+.Lsquare_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+	@ d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	@ d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	@ d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	@ d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+
+	vmull.u32	q5,d0,d0[1]
+	vmull.u32	q6,d1,d0[1]
+	vmull.u32	q7,d3,d0[1]
+	vmull.u32	q8,d5,d0[1]
+	vmull.u32	q9,d7,d0[1]
+
+	vmlal.u32	q5,d7,d2[1]
+	vmlal.u32	q6,d0,d1[1]
+	vmlal.u32	q7,d1,d1[1]
+	vmlal.u32	q8,d3,d1[1]
+	vmlal.u32	q9,d5,d1[1]
+
+	vmlal.u32	q5,d5,d4[1]
+	vmlal.u32	q6,d7,d4[1]
+	vmlal.u32	q8,d1,d3[1]
+	vmlal.u32	q7,d0,d3[1]
+	vmlal.u32	q9,d3,d3[1]
+
+	vmlal.u32	q5,d3,d6[1]
+	vmlal.u32	q8,d0,d5[1]
+	vmlal.u32	q6,d5,d6[1]
+	vmlal.u32	q7,d7,d6[1]
+	vmlal.u32	q9,d1,d5[1]
+
+	vmlal.u32	q8,d7,d8[1]
+	vmlal.u32	q5,d1,d8[1]
+	vmlal.u32	q6,d3,d8[1]
+	vmlal.u32	q7,d5,d8[1]
+	vmlal.u32	q9,d0,d7[1]
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
+	@ and P. Schwabe
+	@
+	@ H0>>+H1>>+H2>>+H3>>+H4
+	@ H3>>+H4>>*5+H0>>+H1
+	@
+	@ Trivia.
+	@
+	@ Result of multiplication of n-bit number by m-bit number is
+	@ n+m bits wide. However! Even though 2^n is a n+1-bit number,
+	@ m-bit number multiplied by 2^n is still n+m bits wide.
+	@
+	@ Sum of two n-bit numbers is n+1 bits wide, sum of three - n+2,
+	@ and so is sum of four. Sum of 2^m n-m-bit numbers and n-bit
+	@ one is n+1 bits wide.
+	@
+	@ >>+ denotes Hnext += Hn>>26, Hn &= 0x3ffffff. This means that
+	@ H0, H2, H3 are guaranteed to be 26 bits wide, while H1 and H4
+	@ can be 27. However! In cases when their width exceeds 26 bits
+	@ they are limited by 2^26+2^6. This in turn means that *sum*
+	@ of the products with these values can still be viewed as sum
+	@ of 52-bit numbers as long as the amount of addends is not a
+	@ power of 2. For example,
+	@
+	@ H4 = H4*R0 + H3*R1 + H2*R2 + H1*R3 + H0 * R4,
+	@
+	@ which can't be larger than 5 * (2^26 + 2^6) * (2^26 + 2^6), or
+	@ 5 * (2^52 + 2*2^32 + 2^12), which in turn is smaller than
+	@ 8 * (2^52) or 2^55. However, the value is then multiplied by
+	@ by 5, so we should be looking at 5 * 5 * (2^52 + 2^33 + 2^12),
+	@ which is less than 32 * (2^52) or 2^57. And when processing
+	@ data we are looking at triple as many addends...
+	@
+	@ In key setup procedure pre-reduced H0 is limited by 5*4+1 and
+	@ 5*H4 - by 5*5 52-bit addends, or 57 bits. But when hashing the
+	@ input H0 is limited by (5*4+1)*3 addends, or 58 bits, while
+	@ 5*H4 by 5*5*3, or 59[!] bits. How is this relevant? vmlal.u32
+	@ instruction accepts 2x32-bit input and writes 2x64-bit result.
+	@ This means that result of reduction have to be compressed upon
+	@ loop wrap-around. This can be done in the process of reduction
+	@ to minimize amount of instructions [as well as amount of
+	@ 128-bit instructions, which benefits low-end processors], but
+	@ one has to watch for H2 (which is narrower than H0) and 5*H4
+	@ not being wider than 58 bits, so that result of right shift
+	@ by 26 bits fits in 32 bits. This is also useful on x86,
+	@ because it allows to use paddd in place for paddq, which
+	@ benefits Atom, where paddq is ridiculously slow.
+
+	vshr.u64	q15,q8,#26
+	vmovn.i64	d16,q8
+	 vshr.u64	q4,q5,#26
+	 vmovn.i64	d10,q5
+	vadd.i64	q9,q9,q15		@ h3 -> h4
+	vbic.i32	d16,#0xfc000000	@ &=0x03ffffff
+	 vadd.i64	q6,q6,q4		@ h0 -> h1
+	 vbic.i32	d10,#0xfc000000
+
+	vshrn.u64	d30,q9,#26
+	vmovn.i64	d18,q9
+	 vshr.u64	q4,q6,#26
+	 vmovn.i64	d12,q6
+	 vadd.i64	q7,q7,q4		@ h1 -> h2
+	vbic.i32	d18,#0xfc000000
+	 vbic.i32	d12,#0xfc000000
+
+	vadd.i32	d10,d10,d30
+	vshl.u32	d30,d30,#2
+	 vshrn.u64	d8,q7,#26
+	 vmovn.i64	d14,q7
+	vadd.i32	d10,d10,d30	@ h4 -> h0
+	 vadd.i32	d16,d16,d8	@ h2 -> h3
+	 vbic.i32	d14,#0xfc000000
+
+	vshr.u32	d30,d10,#26
+	vbic.i32	d10,#0xfc000000
+	 vshr.u32	d8,d16,#26
+	 vbic.i32	d16,#0xfc000000
+	vadd.i32	d12,d12,d30	@ h0 -> h1
+	 vadd.i32	d18,d18,d8	@ h3 -> h4
+
+	subs		r5,r5,#1
+	beq		.Lsquare_break_neon
+
+	add		r6,r0,#(48+0*9*4)
+	add		r7,r0,#(48+1*9*4)
+
+	vtrn.32		d0,d10		@ r^2:r^1
+	vtrn.32		d3,d14
+	vtrn.32		d5,d16
+	vtrn.32		d1,d12
+	vtrn.32		d7,d18
+
+	vshl.u32	d4,d3,#2		@ *5
+	vshl.u32	d6,d5,#2
+	vshl.u32	d2,d1,#2
+	vshl.u32	d8,d7,#2
+	vadd.i32	d4,d4,d3
+	vadd.i32	d2,d2,d1
+	vadd.i32	d6,d6,d5
+	vadd.i32	d8,d8,d7
+
+	vst4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!
+	vst4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!
+	vst4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	vst4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vst1.32		{d8[0]},[r6,:32]
+	vst1.32		{d8[1]},[r7,:32]
+
+	b		.Lsquare_neon
+
+.align	4
+.Lsquare_break_neon:
+	add		r6,r0,#(48+2*4*9)
+	add		r7,r0,#(48+3*4*9)
+
+	vmov		d0,d10		@ r^4:r^3
+	vshl.u32	d2,d12,#2		@ *5
+	vmov		d1,d12
+	vshl.u32	d4,d14,#2
+	vmov		d3,d14
+	vshl.u32	d6,d16,#2
+	vmov		d5,d16
+	vshl.u32	d8,d18,#2
+	vmov		d7,d18
+	vadd.i32	d2,d2,d12
+	vadd.i32	d4,d4,d14
+	vadd.i32	d6,d6,d16
+	vadd.i32	d8,d8,d18
+
+	vst4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!
+	vst4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!
+	vst4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	vst4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vst1.32		{d8[0]},[r6]
+	vst1.32		{d8[1]},[r7]
+
+	bx	lr				@ bx	lr
+ENDPROC(poly1305_init_neon)
+
+.align	5
+ENTRY(poly1305_blocks_neon)
+	ldr	ip,[r0,#36]		@ is_base2_26
+	ands	r2,r2,#-16
+	beq	.Lno_data_neon
+
+	cmp	r2,#64
+	bhs	.Lenter_neon
+	tst	ip,ip			@ is_base2_26?
+	beq	.Lpoly1305_blocks_arm
+
+.Lenter_neon:
+	stmdb	sp!,{r4-r7}
+	vstmdb	sp!,{d8-d15}		@ ABI specification says so
+
+	tst	ip,ip			@ is_base2_26?
+	bne	.Lbase2_26_neon
+
+	stmdb	sp!,{r1-r3,lr}
+	bl	.Lpoly1305_init_neon
+
+	ldr	r4,[r0,#0]		@ load hash value base 2^32
+	ldr	r5,[r0,#4]
+	ldr	r6,[r0,#8]
+	ldr	r7,[r0,#12]
+	ldr	ip,[r0,#16]
+
+	and	r2,r4,#0x03ffffff	@ base 2^32 -> base 2^26
+	mov	r3,r4,lsr#26
+	 veor	d10,d10,d10
+	mov	r4,r5,lsr#20
+	orr	r3,r3,r5,lsl#6
+	 veor	d12,d12,d12
+	mov	r5,r6,lsr#14
+	orr	r4,r4,r6,lsl#12
+	 veor	d14,d14,d14
+	mov	r6,r7,lsr#8
+	orr	r5,r5,r7,lsl#18
+	 veor	d16,d16,d16
+	and	r3,r3,#0x03ffffff
+	orr	r6,r6,ip,lsl#24
+	 veor	d18,d18,d18
+	and	r4,r4,#0x03ffffff
+	mov	r1,#1
+	and	r5,r5,#0x03ffffff
+	str	r1,[r0,#36]		@ is_base2_26
+
+	vmov.32	d10[0],r2
+	vmov.32	d12[0],r3
+	vmov.32	d14[0],r4
+	vmov.32	d16[0],r5
+	vmov.32	d18[0],r6
+	adr	r5,.Lzeros
+
+	ldmia	sp!,{r1-r3,lr}
+	b	.Lbase2_32_neon
+
+.align	4
+.Lbase2_26_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ load hash value
+
+	veor		d10,d10,d10
+	veor		d12,d12,d12
+	veor		d14,d14,d14
+	veor		d16,d16,d16
+	veor		d18,d18,d18
+	vld4.32		{d10[0],d12[0],d14[0],d16[0]},[r0]!
+	adr		r5,.Lzeros
+	vld1.32		{d18[0]},[r0]
+	sub		r0,r0,#16		@ rewind
+
+.Lbase2_32_neon:
+	add		r4,r1,#32
+	mov		r3,r3,lsl#24
+	tst		r2,#31
+	beq		.Leven
+
+	vld4.32		{d20[0],d22[0],d24[0],d26[0]},[r1]!
+	vmov.32		d28[0],r3
+	sub		r2,r2,#16
+	add		r4,r1,#32
+
+#ifdef	__ARMEB__
+	vrev32.8	q10,q10
+	vrev32.8	q13,q13
+	vrev32.8	q11,q11
+	vrev32.8	q12,q12
+#endif
+	vsri.u32	d28,d26,#8	@ base 2^32 -> base 2^26
+	vshl.u32	d26,d26,#18
+
+	vsri.u32	d26,d24,#14
+	vshl.u32	d24,d24,#12
+	vadd.i32	d29,d28,d18	@ add hash value and move to #hi
+
+	vbic.i32	d26,#0xfc000000
+	vsri.u32	d24,d22,#20
+	vshl.u32	d22,d22,#6
+
+	vbic.i32	d24,#0xfc000000
+	vsri.u32	d22,d20,#26
+	vadd.i32	d27,d26,d16
+
+	vbic.i32	d20,#0xfc000000
+	vbic.i32	d22,#0xfc000000
+	vadd.i32	d25,d24,d14
+
+	vadd.i32	d21,d20,d10
+	vadd.i32	d23,d22,d12
+
+	mov		r7,r5
+	add		r6,r0,#48
+
+	cmp		r2,r2
+	b		.Long_tail
+
+.align	4
+.Leven:
+	subs		r2,r2,#64
+	it		lo
+	movlo		r4,r5
+
+	vmov.i32	q14,#1<<24		@ padbit, yes, always
+	vld4.32		{d20,d22,d24,d26},[r1]	@ inp[0:1]
+	add		r1,r1,#64
+	vld4.32		{d21,d23,d25,d27},[r4]	@ inp[2:3] (or 0)
+	add		r4,r4,#64
+	itt		hi
+	addhi		r7,r0,#(48+1*9*4)
+	addhi		r6,r0,#(48+3*9*4)
+
+#ifdef	__ARMEB__
+	vrev32.8	q10,q10
+	vrev32.8	q13,q13
+	vrev32.8	q11,q11
+	vrev32.8	q12,q12
+#endif
+	vsri.u32	q14,q13,#8		@ base 2^32 -> base 2^26
+	vshl.u32	q13,q13,#18
+
+	vsri.u32	q13,q12,#14
+	vshl.u32	q12,q12,#12
+
+	vbic.i32	q13,#0xfc000000
+	vsri.u32	q12,q11,#20
+	vshl.u32	q11,q11,#6
+
+	vbic.i32	q12,#0xfc000000
+	vsri.u32	q11,q10,#26
+
+	vbic.i32	q10,#0xfc000000
+	vbic.i32	q11,#0xfc000000
+
+	bls		.Lskip_loop
+
+	vld4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!	@ load r^2
+	vld4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!	@ load r^4
+	vld4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vld4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	b		.Loop_neon
+
+.align	5
+.Loop_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+	@ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
+	@   ___________________/
+	@ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
+	@ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
+	@   ___________________/ ____________________/
+	@
+	@ Note that we start with inp[2:3]*r^2. This is because it
+	@ doesn't depend on reduction in previous iteration.
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	@ d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	@ d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	@ d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	@ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ inp[2:3]*r^2
+
+	vadd.i32	d24,d24,d14	@ accumulate inp[0:1]
+	vmull.u32	q7,d25,d0[1]
+	vadd.i32	d20,d20,d10
+	vmull.u32	q5,d21,d0[1]
+	vadd.i32	d26,d26,d16
+	vmull.u32	q8,d27,d0[1]
+	vmlal.u32	q7,d23,d1[1]
+	vadd.i32	d22,d22,d12
+	vmull.u32	q6,d23,d0[1]
+
+	vadd.i32	d28,d28,d18
+	vmull.u32	q9,d29,d0[1]
+	subs		r2,r2,#64
+	vmlal.u32	q5,d29,d2[1]
+	it		lo
+	movlo		r4,r5
+	vmlal.u32	q8,d25,d1[1]
+	vld1.32		d8[1],[r7,:32]
+	vmlal.u32	q6,d21,d1[1]
+	vmlal.u32	q9,d27,d1[1]
+
+	vmlal.u32	q5,d27,d4[1]
+	vmlal.u32	q8,d23,d3[1]
+	vmlal.u32	q9,d25,d3[1]
+	vmlal.u32	q6,d29,d4[1]
+	vmlal.u32	q7,d21,d3[1]
+
+	vmlal.u32	q8,d21,d5[1]
+	vmlal.u32	q5,d25,d6[1]
+	vmlal.u32	q9,d23,d5[1]
+	vmlal.u32	q6,d27,d6[1]
+	vmlal.u32	q7,d29,d6[1]
+
+	vmlal.u32	q8,d29,d8[1]
+	vmlal.u32	q5,d23,d8[1]
+	vmlal.u32	q9,d21,d7[1]
+	vmlal.u32	q6,d25,d8[1]
+	vmlal.u32	q7,d27,d8[1]
+
+	vld4.32		{d21,d23,d25,d27},[r4]	@ inp[2:3] (or 0)
+	add		r4,r4,#64
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ (hash+inp[0:1])*r^4 and accumulate
+
+	vmlal.u32	q8,d26,d0[0]
+	vmlal.u32	q5,d20,d0[0]
+	vmlal.u32	q9,d28,d0[0]
+	vmlal.u32	q6,d22,d0[0]
+	vmlal.u32	q7,d24,d0[0]
+	vld1.32		d8[0],[r6,:32]
+
+	vmlal.u32	q8,d24,d1[0]
+	vmlal.u32	q5,d28,d2[0]
+	vmlal.u32	q9,d26,d1[0]
+	vmlal.u32	q6,d20,d1[0]
+	vmlal.u32	q7,d22,d1[0]
+
+	vmlal.u32	q8,d22,d3[0]
+	vmlal.u32	q5,d26,d4[0]
+	vmlal.u32	q9,d24,d3[0]
+	vmlal.u32	q6,d28,d4[0]
+	vmlal.u32	q7,d20,d3[0]
+
+	vmlal.u32	q8,d20,d5[0]
+	vmlal.u32	q5,d24,d6[0]
+	vmlal.u32	q9,d22,d5[0]
+	vmlal.u32	q6,d26,d6[0]
+	vmlal.u32	q8,d28,d8[0]
+
+	vmlal.u32	q7,d28,d6[0]
+	vmlal.u32	q5,d22,d8[0]
+	vmlal.u32	q9,d20,d7[0]
+	vmov.i32	q14,#1<<24		@ padbit, yes, always
+	vmlal.u32	q6,d24,d8[0]
+	vmlal.u32	q7,d26,d8[0]
+
+	vld4.32		{d20,d22,d24,d26},[r1]	@ inp[0:1]
+	add		r1,r1,#64
+#ifdef	__ARMEB__
+	vrev32.8	q10,q10
+	vrev32.8	q11,q11
+	vrev32.8	q12,q12
+	vrev32.8	q13,q13
+#endif
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction interleaved with base 2^32 -> base 2^26 of
+	@ inp[0:3] previously loaded to q10-q13 and smashed to q10-q14.
+
+	vshr.u64	q15,q8,#26
+	vmovn.i64	d16,q8
+	 vshr.u64	q4,q5,#26
+	 vmovn.i64	d10,q5
+	vadd.i64	q9,q9,q15		@ h3 -> h4
+	vbic.i32	d16,#0xfc000000
+	  vsri.u32	q14,q13,#8		@ base 2^32 -> base 2^26
+	 vadd.i64	q6,q6,q4		@ h0 -> h1
+	  vshl.u32	q13,q13,#18
+	 vbic.i32	d10,#0xfc000000
+
+	vshrn.u64	d30,q9,#26
+	vmovn.i64	d18,q9
+	 vshr.u64	q4,q6,#26
+	 vmovn.i64	d12,q6
+	 vadd.i64	q7,q7,q4		@ h1 -> h2
+	  vsri.u32	q13,q12,#14
+	vbic.i32	d18,#0xfc000000
+	  vshl.u32	q12,q12,#12
+	 vbic.i32	d12,#0xfc000000
+
+	vadd.i32	d10,d10,d30
+	vshl.u32	d30,d30,#2
+	  vbic.i32	q13,#0xfc000000
+	 vshrn.u64	d8,q7,#26
+	 vmovn.i64	d14,q7
+	vaddl.u32	q5,d10,d30	@ h4 -> h0 [widen for a sec]
+	  vsri.u32	q12,q11,#20
+	 vadd.i32	d16,d16,d8	@ h2 -> h3
+	  vshl.u32	q11,q11,#6
+	 vbic.i32	d14,#0xfc000000
+	  vbic.i32	q12,#0xfc000000
+
+	vshrn.u64	d30,q5,#26		@ re-narrow
+	vmovn.i64	d10,q5
+	  vsri.u32	q11,q10,#26
+	  vbic.i32	q10,#0xfc000000
+	 vshr.u32	d8,d16,#26
+	 vbic.i32	d16,#0xfc000000
+	vbic.i32	d10,#0xfc000000
+	vadd.i32	d12,d12,d30	@ h0 -> h1
+	 vadd.i32	d18,d18,d8	@ h3 -> h4
+	  vbic.i32	q11,#0xfc000000
+
+	bhi		.Loop_neon
+
+.Lskip_loop:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
+
+	add		r7,r0,#(48+0*9*4)
+	add		r6,r0,#(48+1*9*4)
+	adds		r2,r2,#32
+	it		ne
+	movne		r2,#0
+	bne		.Long_tail
+
+	vadd.i32	d25,d24,d14	@ add hash value and move to #hi
+	vadd.i32	d21,d20,d10
+	vadd.i32	d27,d26,d16
+	vadd.i32	d23,d22,d12
+	vadd.i32	d29,d28,d18
+
+.Long_tail:
+	vld4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!	@ load r^1
+	vld4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!	@ load r^2
+
+	vadd.i32	d24,d24,d14	@ can be redundant
+	vmull.u32	q7,d25,d0
+	vadd.i32	d20,d20,d10
+	vmull.u32	q5,d21,d0
+	vadd.i32	d26,d26,d16
+	vmull.u32	q8,d27,d0
+	vadd.i32	d22,d22,d12
+	vmull.u32	q6,d23,d0
+	vadd.i32	d28,d28,d18
+	vmull.u32	q9,d29,d0
+
+	vmlal.u32	q5,d29,d2
+	vld4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vmlal.u32	q8,d25,d1
+	vld4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	vmlal.u32	q6,d21,d1
+	vmlal.u32	q9,d27,d1
+	vmlal.u32	q7,d23,d1
+
+	vmlal.u32	q8,d23,d3
+	vld1.32		d8[1],[r7,:32]
+	vmlal.u32	q5,d27,d4
+	vld1.32		d8[0],[r6,:32]
+	vmlal.u32	q9,d25,d3
+	vmlal.u32	q6,d29,d4
+	vmlal.u32	q7,d21,d3
+
+	vmlal.u32	q8,d21,d5
+	 it		ne
+	 addne		r7,r0,#(48+2*9*4)
+	vmlal.u32	q5,d25,d6
+	 it		ne
+	 addne		r6,r0,#(48+3*9*4)
+	vmlal.u32	q9,d23,d5
+	vmlal.u32	q6,d27,d6
+	vmlal.u32	q7,d29,d6
+
+	vmlal.u32	q8,d29,d8
+	 vorn		q0,q0,q0	@ all-ones, can be redundant
+	vmlal.u32	q5,d23,d8
+	 vshr.u64	q0,q0,#38
+	vmlal.u32	q9,d21,d7
+	vmlal.u32	q6,d25,d8
+	vmlal.u32	q7,d27,d8
+
+	beq		.Lshort_tail
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ (hash+inp[0:1])*r^4:r^3 and accumulate
+
+	vld4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!	@ load r^3
+	vld4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!	@ load r^4
+
+	vmlal.u32	q7,d24,d0
+	vmlal.u32	q5,d20,d0
+	vmlal.u32	q8,d26,d0
+	vmlal.u32	q6,d22,d0
+	vmlal.u32	q9,d28,d0
+
+	vmlal.u32	q5,d28,d2
+	vld4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vmlal.u32	q8,d24,d1
+	vld4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	vmlal.u32	q6,d20,d1
+	vmlal.u32	q9,d26,d1
+	vmlal.u32	q7,d22,d1
+
+	vmlal.u32	q8,d22,d3
+	vld1.32		d8[1],[r7,:32]
+	vmlal.u32	q5,d26,d4
+	vld1.32		d8[0],[r6,:32]
+	vmlal.u32	q9,d24,d3
+	vmlal.u32	q6,d28,d4
+	vmlal.u32	q7,d20,d3
+
+	vmlal.u32	q8,d20,d5
+	vmlal.u32	q5,d24,d6
+	vmlal.u32	q9,d22,d5
+	vmlal.u32	q6,d26,d6
+	vmlal.u32	q7,d28,d6
+
+	vmlal.u32	q8,d28,d8
+	 vorn		q0,q0,q0	@ all-ones
+	vmlal.u32	q5,d22,d8
+	 vshr.u64	q0,q0,#38
+	vmlal.u32	q9,d20,d7
+	vmlal.u32	q6,d24,d8
+	vmlal.u32	q7,d26,d8
+
+.Lshort_tail:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ horizontal addition
+
+	vadd.i64	d16,d16,d17
+	vadd.i64	d10,d10,d11
+	vadd.i64	d18,d18,d19
+	vadd.i64	d12,d12,d13
+	vadd.i64	d14,d14,d15
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction, but without narrowing
+
+	vshr.u64	q15,q8,#26
+	vand.i64	q8,q8,q0
+	 vshr.u64	q4,q5,#26
+	 vand.i64	q5,q5,q0
+	vadd.i64	q9,q9,q15		@ h3 -> h4
+	 vadd.i64	q6,q6,q4		@ h0 -> h1
+
+	vshr.u64	q15,q9,#26
+	vand.i64	q9,q9,q0
+	 vshr.u64	q4,q6,#26
+	 vand.i64	q6,q6,q0
+	 vadd.i64	q7,q7,q4		@ h1 -> h2
+
+	vadd.i64	q5,q5,q15
+	vshl.u64	q15,q15,#2
+	 vshr.u64	q4,q7,#26
+	 vand.i64	q7,q7,q0
+	vadd.i64	q5,q5,q15		@ h4 -> h0
+	 vadd.i64	q8,q8,q4		@ h2 -> h3
+
+	vshr.u64	q15,q5,#26
+	vand.i64	q5,q5,q0
+	 vshr.u64	q4,q8,#26
+	 vand.i64	q8,q8,q0
+	vadd.i64	q6,q6,q15		@ h0 -> h1
+	 vadd.i64	q9,q9,q4		@ h3 -> h4
+
+	cmp		r2,#0
+	bne		.Leven
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ store hash value
+
+	vst4.32		{d10[0],d12[0],d14[0],d16[0]},[r0]!
+	vst1.32		{d18[0]},[r0]
+
+	vldmia	sp!,{d8-d15}			@ epilogue
+	ldmia	sp!,{r4-r7}
+.Lno_data_neon:
+	bx	lr					@ bx	lr
+ENDPROC(poly1305_blocks_neon)
+
+.align	5
+ENTRY(poly1305_emit_neon)
+	ldr	ip,[r0,#36]		@ is_base2_26
+
+	stmdb	sp!,{r4-r11}
+
+	tst	ip,ip
+	beq	.Lpoly1305_emit_enter
+
+	ldmia	r0,{r3-r7}
+	eor	r8,r8,r8
+
+	adds	r3,r3,r4,lsl#26	@ base 2^26 -> base 2^32
+	mov	r4,r4,lsr#6
+	adcs	r4,r4,r5,lsl#20
+	mov	r5,r5,lsr#12
+	adcs	r5,r5,r6,lsl#14
+	mov	r6,r6,lsr#18
+	adcs	r6,r6,r7,lsl#8
+	adc	r7,r8,r7,lsr#24	@ can be partially reduced ...
+
+	and	r8,r7,#-4		@ ... so reduce
+	and	r7,r6,#3
+	add	r8,r8,r8,lsr#2	@ *= 5
+	adds	r3,r3,r8
+	adcs	r4,r4,#0
+	adcs	r5,r5,#0
+	adcs	r6,r6,#0
+	adc	r7,r7,#0
+
+	adds	r8,r3,#5		@ compare to modulus
+	adcs	r9,r4,#0
+	adcs	r10,r5,#0
+	adcs	r11,r6,#0
+	adc	r7,r7,#0
+	tst	r7,#4			@ did it carry/borrow?
+
+	it	ne
+	movne	r3,r8
+	ldr	r8,[r2,#0]
+	it	ne
+	movne	r4,r9
+	ldr	r9,[r2,#4]
+	it	ne
+	movne	r5,r10
+	ldr	r10,[r2,#8]
+	it	ne
+	movne	r6,r11
+	ldr	r11,[r2,#12]
+
+	adds	r3,r3,r8		@ accumulate nonce
+	adcs	r4,r4,r9
+	adcs	r5,r5,r10
+	adc	r6,r6,r11
+
+#ifdef __ARMEB__
+	rev	r3,r3
+	rev	r4,r4
+	rev	r5,r5
+	rev	r6,r6
+#endif
+	str	r3,[r1,#0]		@ store the result
+	str	r4,[r1,#4]
+	str	r5,[r1,#8]
+	str	r6,[r1,#12]
+
+	ldmia	sp!,{r4-r11}
+	bx	lr				@ bx	lr
+ENDPROC(poly1305_emit_neon)
+
+.align	5
+.Lzeros:
+.long	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+#endif
diff --git a/lib/zinc/poly1305/poly1305-arm64.S b/lib/zinc/poly1305/poly1305-arm64.S
new file mode 100644
index 000000000000..911b57edeb20
--- /dev/null
+++ b/lib/zinc/poly1305/poly1305-arm64.S
@@ -0,0 +1,820 @@
+/* SPDX-License-Identifier: OpenSSL OR (BSD-3-Clause OR GPL-2.0)
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ * Copyright 2016 The OpenSSL Project Authors. All Rights Reserved.
+ */
+
+#include <linux/linkage.h>
+.text
+
+.align	5
+ENTRY(poly1305_init_arm)
+	cmp	x1,xzr
+	stp	xzr,xzr,[x0]		// zero hash value
+	stp	xzr,xzr,[x0,#16]	// [along with is_base2_26]
+
+	csel	x0,xzr,x0,eq
+	b.eq	.Lno_key
+
+	ldp	x7,x8,[x1]		// load key
+	mov	x9,#0xfffffffc0fffffff
+	movk	x9,#0x0fff,lsl#48
+#ifdef	__ARMEB__
+	rev	x7,x7			// flip bytes
+	rev	x8,x8
+#endif
+	and	x7,x7,x9		// &=0ffffffc0fffffff
+	and	x9,x9,#-4
+	and	x8,x8,x9		// &=0ffffffc0ffffffc
+	stp	x7,x8,[x0,#32]	// save key value
+
+.Lno_key:
+	ret
+ENDPROC(poly1305_init_arm)
+
+.align	5
+ENTRY(poly1305_blocks_arm)
+	ands	x2,x2,#-16
+	b.eq	.Lno_data
+
+	ldp	x4,x5,[x0]		// load hash value
+	ldp	x7,x8,[x0,#32]	// load key value
+	ldr	x6,[x0,#16]
+	add	x9,x8,x8,lsr#2	// s1 = r1 + (r1 >> 2)
+	b	.Loop
+
+.align	5
+.Loop:
+	ldp	x10,x11,[x1],#16	// load input
+	sub	x2,x2,#16
+#ifdef	__ARMEB__
+	rev	x10,x10
+	rev	x11,x11
+#endif
+	adds	x4,x4,x10		// accumulate input
+	adcs	x5,x5,x11
+
+	mul	x12,x4,x7		// h0*r0
+	adc	x6,x6,x3
+	umulh	x13,x4,x7
+
+	mul	x10,x5,x9		// h1*5*r1
+	umulh	x11,x5,x9
+
+	adds	x12,x12,x10
+	mul	x10,x4,x8		// h0*r1
+	adc	x13,x13,x11
+	umulh	x14,x4,x8
+
+	adds	x13,x13,x10
+	mul	x10,x5,x7		// h1*r0
+	adc	x14,x14,xzr
+	umulh	x11,x5,x7
+
+	adds	x13,x13,x10
+	mul	x10,x6,x9		// h2*5*r1
+	adc	x14,x14,x11
+	mul	x11,x6,x7		// h2*r0
+
+	adds	x13,x13,x10
+	adc	x14,x14,x11
+
+	and	x10,x14,#-4		// final reduction
+	and	x6,x14,#3
+	add	x10,x10,x14,lsr#2
+	adds	x4,x12,x10
+	adcs	x5,x13,xzr
+	adc	x6,x6,xzr
+
+	cbnz	x2,.Loop
+
+	stp	x4,x5,[x0]		// store hash value
+	str	x6,[x0,#16]
+
+.Lno_data:
+	ret
+ENDPROC(poly1305_blocks_arm)
+
+.align	5
+ENTRY(poly1305_emit_arm)
+	ldp	x4,x5,[x0]		// load hash base 2^64
+	ldr	x6,[x0,#16]
+	ldp	x10,x11,[x2]	// load nonce
+
+	adds	x12,x4,#5		// compare to modulus
+	adcs	x13,x5,xzr
+	adc	x14,x6,xzr
+
+	tst	x14,#-4			// see if it's carried/borrowed
+
+	csel	x4,x4,x12,eq
+	csel	x5,x5,x13,eq
+
+#ifdef	__ARMEB__
+	ror	x10,x10,#32		// flip nonce words
+	ror	x11,x11,#32
+#endif
+	adds	x4,x4,x10		// accumulate nonce
+	adc	x5,x5,x11
+#ifdef	__ARMEB__
+	rev	x4,x4			// flip output bytes
+	rev	x5,x5
+#endif
+	stp	x4,x5,[x1]		// write result
+
+	ret
+ENDPROC(poly1305_emit_arm)
+
+.align	5
+__poly1305_mult:
+	mul	x12,x4,x7		// h0*r0
+	umulh	x13,x4,x7
+
+	mul	x10,x5,x9		// h1*5*r1
+	umulh	x11,x5,x9
+
+	adds	x12,x12,x10
+	mul	x10,x4,x8		// h0*r1
+	adc	x13,x13,x11
+	umulh	x14,x4,x8
+
+	adds	x13,x13,x10
+	mul	x10,x5,x7		// h1*r0
+	adc	x14,x14,xzr
+	umulh	x11,x5,x7
+
+	adds	x13,x13,x10
+	mul	x10,x6,x9		// h2*5*r1
+	adc	x14,x14,x11
+	mul	x11,x6,x7		// h2*r0
+
+	adds	x13,x13,x10
+	adc	x14,x14,x11
+
+	and	x10,x14,#-4		// final reduction
+	and	x6,x14,#3
+	add	x10,x10,x14,lsr#2
+	adds	x4,x12,x10
+	adcs	x5,x13,xzr
+	adc	x6,x6,xzr
+
+	ret
+
+__poly1305_splat:
+	and	x12,x4,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x13,x4,#26,#26
+	extr	x14,x5,x4,#52
+	and	x14,x14,#0x03ffffff
+	ubfx	x15,x5,#14,#26
+	extr	x16,x6,x5,#40
+
+	str	w12,[x0,#16*0]	// r0
+	add	w12,w13,w13,lsl#2	// r1*5
+	str	w13,[x0,#16*1]	// r1
+	add	w13,w14,w14,lsl#2	// r2*5
+	str	w12,[x0,#16*2]	// s1
+	str	w14,[x0,#16*3]	// r2
+	add	w14,w15,w15,lsl#2	// r3*5
+	str	w13,[x0,#16*4]	// s2
+	str	w15,[x0,#16*5]	// r3
+	add	w15,w16,w16,lsl#2	// r4*5
+	str	w14,[x0,#16*6]	// s3
+	str	w16,[x0,#16*7]	// r4
+	str	w15,[x0,#16*8]	// s4
+
+	ret
+
+.align	5
+ENTRY(poly1305_blocks_neon)
+	ldr	x17,[x0,#24]
+	cmp	x2,#128
+	b.hs	.Lblocks_neon
+	cbz	x17,poly1305_blocks_arm
+
+.Lblocks_neon:
+	stp	x29,x30,[sp,#-80]!
+	add	x29,sp,#0
+
+	ands	x2,x2,#-16
+	b.eq	.Lno_data_neon
+
+	cbz	x17,.Lbase2_64_neon
+
+	ldp	w10,w11,[x0]		// load hash value base 2^26
+	ldp	w12,w13,[x0,#8]
+	ldr	w14,[x0,#16]
+
+	tst	x2,#31
+	b.eq	.Leven_neon
+
+	ldp	x7,x8,[x0,#32]	// load key value
+
+	add	x4,x10,x11,lsl#26	// base 2^26 -> base 2^64
+	lsr	x5,x12,#12
+	adds	x4,x4,x12,lsl#52
+	add	x5,x5,x13,lsl#14
+	adc	x5,x5,xzr
+	lsr	x6,x14,#24
+	adds	x5,x5,x14,lsl#40
+	adc	x14,x6,xzr		// can be partially reduced...
+
+	ldp	x12,x13,[x1],#16	// load input
+	sub	x2,x2,#16
+	add	x9,x8,x8,lsr#2	// s1 = r1 + (r1 >> 2)
+
+	and	x10,x14,#-4		// ... so reduce
+	and	x6,x14,#3
+	add	x10,x10,x14,lsr#2
+	adds	x4,x4,x10
+	adcs	x5,x5,xzr
+	adc	x6,x6,xzr
+
+#ifdef	__ARMEB__
+	rev	x12,x12
+	rev	x13,x13
+#endif
+	adds	x4,x4,x12		// accumulate input
+	adcs	x5,x5,x13
+	adc	x6,x6,x3
+
+	bl	__poly1305_mult
+	ldr	x30,[sp,#8]
+
+	cbz	x3,.Lstore_base2_64_neon
+
+	and	x10,x4,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x11,x4,#26,#26
+	extr	x12,x5,x4,#52
+	and	x12,x12,#0x03ffffff
+	ubfx	x13,x5,#14,#26
+	extr	x14,x6,x5,#40
+
+	cbnz	x2,.Leven_neon
+
+	stp	w10,w11,[x0]		// store hash value base 2^26
+	stp	w12,w13,[x0,#8]
+	str	w14,[x0,#16]
+	b	.Lno_data_neon
+
+.align	4
+.Lstore_base2_64_neon:
+	stp	x4,x5,[x0]		// store hash value base 2^64
+	stp	x6,xzr,[x0,#16]	// note that is_base2_26 is zeroed
+	b	.Lno_data_neon
+
+.align	4
+.Lbase2_64_neon:
+	ldp	x7,x8,[x0,#32]	// load key value
+
+	ldp	x4,x5,[x0]		// load hash value base 2^64
+	ldr	x6,[x0,#16]
+
+	tst	x2,#31
+	b.eq	.Linit_neon
+
+	ldp	x12,x13,[x1],#16	// load input
+	sub	x2,x2,#16
+	add	x9,x8,x8,lsr#2	// s1 = r1 + (r1 >> 2)
+#ifdef	__ARMEB__
+	rev	x12,x12
+	rev	x13,x13
+#endif
+	adds	x4,x4,x12		// accumulate input
+	adcs	x5,x5,x13
+	adc	x6,x6,x3
+
+	bl	__poly1305_mult
+
+.Linit_neon:
+	and	x10,x4,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x11,x4,#26,#26
+	extr	x12,x5,x4,#52
+	and	x12,x12,#0x03ffffff
+	ubfx	x13,x5,#14,#26
+	extr	x14,x6,x5,#40
+
+	stp	d8,d9,[sp,#16]		// meet ABI requirements
+	stp	d10,d11,[sp,#32]
+	stp	d12,d13,[sp,#48]
+	stp	d14,d15,[sp,#64]
+
+	fmov	d24,x10
+	fmov	d25,x11
+	fmov	d26,x12
+	fmov	d27,x13
+	fmov	d28,x14
+
+	////////////////////////////////// initialize r^n table
+	mov	x4,x7			// r^1
+	add	x9,x8,x8,lsr#2	// s1 = r1 + (r1 >> 2)
+	mov	x5,x8
+	mov	x6,xzr
+	add	x0,x0,#48+12
+	bl	__poly1305_splat
+
+	bl	__poly1305_mult		// r^2
+	sub	x0,x0,#4
+	bl	__poly1305_splat
+
+	bl	__poly1305_mult		// r^3
+	sub	x0,x0,#4
+	bl	__poly1305_splat
+
+	bl	__poly1305_mult		// r^4
+	sub	x0,x0,#4
+	bl	__poly1305_splat
+	ldr	x30,[sp,#8]
+
+	add	x16,x1,#32
+	adr	x17,.Lzeros
+	subs	x2,x2,#64
+	csel	x16,x17,x16,lo
+
+	mov	x4,#1
+	str	x4,[x0,#-24]		// set is_base2_26
+	sub	x0,x0,#48		// restore original x0
+	b	.Ldo_neon
+
+.align	4
+.Leven_neon:
+	add	x16,x1,#32
+	adr	x17,.Lzeros
+	subs	x2,x2,#64
+	csel	x16,x17,x16,lo
+
+	stp	d8,d9,[sp,#16]		// meet ABI requirements
+	stp	d10,d11,[sp,#32]
+	stp	d12,d13,[sp,#48]
+	stp	d14,d15,[sp,#64]
+
+	fmov	d24,x10
+	fmov	d25,x11
+	fmov	d26,x12
+	fmov	d27,x13
+	fmov	d28,x14
+
+.Ldo_neon:
+	ldp	x8,x12,[x16],#16	// inp[2:3] (or zero)
+	ldp	x9,x13,[x16],#48
+
+	lsl	x3,x3,#24
+	add	x15,x0,#48
+
+#ifdef	__ARMEB__
+	rev	x8,x8
+	rev	x12,x12
+	rev	x9,x9
+	rev	x13,x13
+#endif
+	and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	and	x5,x9,#0x03ffffff
+	ubfx	x6,x8,#26,#26
+	ubfx	x7,x9,#26,#26
+	add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	extr	x8,x12,x8,#52
+	extr	x9,x13,x9,#52
+	add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	fmov	d14,x4
+	and	x8,x8,#0x03ffffff
+	and	x9,x9,#0x03ffffff
+	ubfx	x10,x12,#14,#26
+	ubfx	x11,x13,#14,#26
+	add	x12,x3,x12,lsr#40
+	add	x13,x3,x13,lsr#40
+	add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	fmov	d15,x6
+	add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	fmov	d16,x8
+	fmov	d17,x10
+	fmov	d18,x12
+
+	ldp	x8,x12,[x1],#16	// inp[0:1]
+	ldp	x9,x13,[x1],#48
+
+	ld1	{v0.4s,v1.4s,v2.4s,v3.4s},[x15],#64
+	ld1	{v4.4s,v5.4s,v6.4s,v7.4s},[x15],#64
+	ld1	{v8.4s},[x15]
+
+#ifdef	__ARMEB__
+	rev	x8,x8
+	rev	x12,x12
+	rev	x9,x9
+	rev	x13,x13
+#endif
+	and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	and	x5,x9,#0x03ffffff
+	ubfx	x6,x8,#26,#26
+	ubfx	x7,x9,#26,#26
+	add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	extr	x8,x12,x8,#52
+	extr	x9,x13,x9,#52
+	add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	fmov	d9,x4
+	and	x8,x8,#0x03ffffff
+	and	x9,x9,#0x03ffffff
+	ubfx	x10,x12,#14,#26
+	ubfx	x11,x13,#14,#26
+	add	x12,x3,x12,lsr#40
+	add	x13,x3,x13,lsr#40
+	add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	fmov	d10,x6
+	add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	movi	v31.2d,#-1
+	fmov	d11,x8
+	fmov	d12,x10
+	fmov	d13,x12
+	ushr	v31.2d,v31.2d,#38
+
+	b.ls	.Lskip_loop
+
+.align	4
+.Loop_neon:
+	////////////////////////////////////////////////////////////////
+	// ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+	// ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
+	//   ___________________/
+	// ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
+	// ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
+	//   ___________________/ ____________________/
+	//
+	// Note that we start with inp[2:3]*r^2. This is because it
+	// doesn't depend on reduction in previous iteration.
+	////////////////////////////////////////////////////////////////
+	// d4 = h0*r4 + h1*r3   + h2*r2   + h3*r1   + h4*r0
+	// d3 = h0*r3 + h1*r2   + h2*r1   + h3*r0   + h4*5*r4
+	// d2 = h0*r2 + h1*r1   + h2*r0   + h3*5*r4 + h4*5*r3
+	// d1 = h0*r1 + h1*r0   + h2*5*r4 + h3*5*r3 + h4*5*r2
+	// d0 = h0*r0 + h1*5*r4 + h2*5*r3 + h3*5*r2 + h4*5*r1
+
+	subs	x2,x2,#64
+	umull	v23.2d,v14.2s,v7.s[2]
+	csel	x16,x17,x16,lo
+	umull	v22.2d,v14.2s,v5.s[2]
+	umull	v21.2d,v14.2s,v3.s[2]
+	ldp	x8,x12,[x16],#16	// inp[2:3] (or zero)
+	umull	v20.2d,v14.2s,v1.s[2]
+	ldp	x9,x13,[x16],#48
+	umull	v19.2d,v14.2s,v0.s[2]
+#ifdef	__ARMEB__
+	rev	x8,x8
+	rev	x12,x12
+	rev	x9,x9
+	rev	x13,x13
+#endif
+
+	umlal	v23.2d,v15.2s,v5.s[2]
+	and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	umlal	v22.2d,v15.2s,v3.s[2]
+	and	x5,x9,#0x03ffffff
+	umlal	v21.2d,v15.2s,v1.s[2]
+	ubfx	x6,x8,#26,#26
+	umlal	v20.2d,v15.2s,v0.s[2]
+	ubfx	x7,x9,#26,#26
+	umlal	v19.2d,v15.2s,v8.s[2]
+	add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+
+	umlal	v23.2d,v16.2s,v3.s[2]
+	extr	x8,x12,x8,#52
+	umlal	v22.2d,v16.2s,v1.s[2]
+	extr	x9,x13,x9,#52
+	umlal	v21.2d,v16.2s,v0.s[2]
+	add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	umlal	v20.2d,v16.2s,v8.s[2]
+	fmov	d14,x4
+	umlal	v19.2d,v16.2s,v6.s[2]
+	and	x8,x8,#0x03ffffff
+
+	umlal	v23.2d,v17.2s,v1.s[2]
+	and	x9,x9,#0x03ffffff
+	umlal	v22.2d,v17.2s,v0.s[2]
+	ubfx	x10,x12,#14,#26
+	umlal	v21.2d,v17.2s,v8.s[2]
+	ubfx	x11,x13,#14,#26
+	umlal	v20.2d,v17.2s,v6.s[2]
+	add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	umlal	v19.2d,v17.2s,v4.s[2]
+	fmov	d15,x6
+
+	add	v11.2s,v11.2s,v26.2s
+	add	x12,x3,x12,lsr#40
+	umlal	v23.2d,v18.2s,v0.s[2]
+	add	x13,x3,x13,lsr#40
+	umlal	v22.2d,v18.2s,v8.s[2]
+	add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	umlal	v21.2d,v18.2s,v6.s[2]
+	add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	umlal	v20.2d,v18.2s,v4.s[2]
+	fmov	d16,x8
+	umlal	v19.2d,v18.2s,v2.s[2]
+	fmov	d17,x10
+
+	////////////////////////////////////////////////////////////////
+	// (hash+inp[0:1])*r^4 and accumulate
+
+	add	v9.2s,v9.2s,v24.2s
+	fmov	d18,x12
+	umlal	v22.2d,v11.2s,v1.s[0]
+	ldp	x8,x12,[x1],#16	// inp[0:1]
+	umlal	v19.2d,v11.2s,v6.s[0]
+	ldp	x9,x13,[x1],#48
+	umlal	v23.2d,v11.2s,v3.s[0]
+	umlal	v20.2d,v11.2s,v8.s[0]
+	umlal	v21.2d,v11.2s,v0.s[0]
+#ifdef	__ARMEB__
+	rev	x8,x8
+	rev	x12,x12
+	rev	x9,x9
+	rev	x13,x13
+#endif
+
+	add	v10.2s,v10.2s,v25.2s
+	umlal	v22.2d,v9.2s,v5.s[0]
+	umlal	v23.2d,v9.2s,v7.s[0]
+	and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	umlal	v21.2d,v9.2s,v3.s[0]
+	and	x5,x9,#0x03ffffff
+	umlal	v19.2d,v9.2s,v0.s[0]
+	ubfx	x6,x8,#26,#26
+	umlal	v20.2d,v9.2s,v1.s[0]
+	ubfx	x7,x9,#26,#26
+
+	add	v12.2s,v12.2s,v27.2s
+	add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	umlal	v22.2d,v10.2s,v3.s[0]
+	extr	x8,x12,x8,#52
+	umlal	v23.2d,v10.2s,v5.s[0]
+	extr	x9,x13,x9,#52
+	umlal	v19.2d,v10.2s,v8.s[0]
+	add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	umlal	v21.2d,v10.2s,v1.s[0]
+	fmov	d9,x4
+	umlal	v20.2d,v10.2s,v0.s[0]
+	and	x8,x8,#0x03ffffff
+
+	add	v13.2s,v13.2s,v28.2s
+	and	x9,x9,#0x03ffffff
+	umlal	v22.2d,v12.2s,v0.s[0]
+	ubfx	x10,x12,#14,#26
+	umlal	v19.2d,v12.2s,v4.s[0]
+	ubfx	x11,x13,#14,#26
+	umlal	v23.2d,v12.2s,v1.s[0]
+	add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	umlal	v20.2d,v12.2s,v6.s[0]
+	fmov	d10,x6
+	umlal	v21.2d,v12.2s,v8.s[0]
+	add	x12,x3,x12,lsr#40
+
+	umlal	v22.2d,v13.2s,v8.s[0]
+	add	x13,x3,x13,lsr#40
+	umlal	v19.2d,v13.2s,v2.s[0]
+	add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	umlal	v23.2d,v13.2s,v0.s[0]
+	add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	umlal	v20.2d,v13.2s,v4.s[0]
+	fmov	d11,x8
+	umlal	v21.2d,v13.2s,v6.s[0]
+	fmov	d12,x10
+	fmov	d13,x12
+
+	/////////////////////////////////////////////////////////////////
+	// lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
+	// and P. Schwabe
+	//
+	// [see discussion in poly1305-armv4 module]
+
+	ushr	v29.2d,v22.2d,#26
+	xtn	v27.2s,v22.2d
+	ushr	v30.2d,v19.2d,#26
+	and	v19.16b,v19.16b,v31.16b
+	add	v23.2d,v23.2d,v29.2d	// h3 -> h4
+	bic	v27.2s,#0xfc,lsl#24	// &=0x03ffffff
+	add	v20.2d,v20.2d,v30.2d	// h0 -> h1
+
+	ushr	v29.2d,v23.2d,#26
+	xtn	v28.2s,v23.2d
+	ushr	v30.2d,v20.2d,#26
+	xtn	v25.2s,v20.2d
+	bic	v28.2s,#0xfc,lsl#24
+	add	v21.2d,v21.2d,v30.2d	// h1 -> h2
+
+	add	v19.2d,v19.2d,v29.2d
+	shl	v29.2d,v29.2d,#2
+	shrn	v30.2s,v21.2d,#26
+	xtn	v26.2s,v21.2d
+	add	v19.2d,v19.2d,v29.2d	// h4 -> h0
+	bic	v25.2s,#0xfc,lsl#24
+	add	v27.2s,v27.2s,v30.2s		// h2 -> h3
+	bic	v26.2s,#0xfc,lsl#24
+
+	shrn	v29.2s,v19.2d,#26
+	xtn	v24.2s,v19.2d
+	ushr	v30.2s,v27.2s,#26
+	bic	v27.2s,#0xfc,lsl#24
+	bic	v24.2s,#0xfc,lsl#24
+	add	v25.2s,v25.2s,v29.2s		// h0 -> h1
+	add	v28.2s,v28.2s,v30.2s		// h3 -> h4
+
+	b.hi	.Loop_neon
+
+.Lskip_loop:
+	dup	v16.2d,v16.d[0]
+	add	v11.2s,v11.2s,v26.2s
+
+	////////////////////////////////////////////////////////////////
+	// multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
+
+	adds	x2,x2,#32
+	b.ne	.Long_tail
+
+	dup	v16.2d,v11.d[0]
+	add	v14.2s,v9.2s,v24.2s
+	add	v17.2s,v12.2s,v27.2s
+	add	v15.2s,v10.2s,v25.2s
+	add	v18.2s,v13.2s,v28.2s
+
+.Long_tail:
+	dup	v14.2d,v14.d[0]
+	umull2	v19.2d,v16.4s,v6.4s
+	umull2	v22.2d,v16.4s,v1.4s
+	umull2	v23.2d,v16.4s,v3.4s
+	umull2	v21.2d,v16.4s,v0.4s
+	umull2	v20.2d,v16.4s,v8.4s
+
+	dup	v15.2d,v15.d[0]
+	umlal2	v19.2d,v14.4s,v0.4s
+	umlal2	v21.2d,v14.4s,v3.4s
+	umlal2	v22.2d,v14.4s,v5.4s
+	umlal2	v23.2d,v14.4s,v7.4s
+	umlal2	v20.2d,v14.4s,v1.4s
+
+	dup	v17.2d,v17.d[0]
+	umlal2	v19.2d,v15.4s,v8.4s
+	umlal2	v22.2d,v15.4s,v3.4s
+	umlal2	v21.2d,v15.4s,v1.4s
+	umlal2	v23.2d,v15.4s,v5.4s
+	umlal2	v20.2d,v15.4s,v0.4s
+
+	dup	v18.2d,v18.d[0]
+	umlal2	v22.2d,v17.4s,v0.4s
+	umlal2	v23.2d,v17.4s,v1.4s
+	umlal2	v19.2d,v17.4s,v4.4s
+	umlal2	v20.2d,v17.4s,v6.4s
+	umlal2	v21.2d,v17.4s,v8.4s
+
+	umlal2	v22.2d,v18.4s,v8.4s
+	umlal2	v19.2d,v18.4s,v2.4s
+	umlal2	v23.2d,v18.4s,v0.4s
+	umlal2	v20.2d,v18.4s,v4.4s
+	umlal2	v21.2d,v18.4s,v6.4s
+
+	b.eq	.Lshort_tail
+
+	////////////////////////////////////////////////////////////////
+	// (hash+inp[0:1])*r^4:r^3 and accumulate
+
+	add	v9.2s,v9.2s,v24.2s
+	umlal	v22.2d,v11.2s,v1.2s
+	umlal	v19.2d,v11.2s,v6.2s
+	umlal	v23.2d,v11.2s,v3.2s
+	umlal	v20.2d,v11.2s,v8.2s
+	umlal	v21.2d,v11.2s,v0.2s
+
+	add	v10.2s,v10.2s,v25.2s
+	umlal	v22.2d,v9.2s,v5.2s
+	umlal	v19.2d,v9.2s,v0.2s
+	umlal	v23.2d,v9.2s,v7.2s
+	umlal	v20.2d,v9.2s,v1.2s
+	umlal	v21.2d,v9.2s,v3.2s
+
+	add	v12.2s,v12.2s,v27.2s
+	umlal	v22.2d,v10.2s,v3.2s
+	umlal	v19.2d,v10.2s,v8.2s
+	umlal	v23.2d,v10.2s,v5.2s
+	umlal	v20.2d,v10.2s,v0.2s
+	umlal	v21.2d,v10.2s,v1.2s
+
+	add	v13.2s,v13.2s,v28.2s
+	umlal	v22.2d,v12.2s,v0.2s
+	umlal	v19.2d,v12.2s,v4.2s
+	umlal	v23.2d,v12.2s,v1.2s
+	umlal	v20.2d,v12.2s,v6.2s
+	umlal	v21.2d,v12.2s,v8.2s
+
+	umlal	v22.2d,v13.2s,v8.2s
+	umlal	v19.2d,v13.2s,v2.2s
+	umlal	v23.2d,v13.2s,v0.2s
+	umlal	v20.2d,v13.2s,v4.2s
+	umlal	v21.2d,v13.2s,v6.2s
+
+.Lshort_tail:
+	////////////////////////////////////////////////////////////////
+	// horizontal add
+
+	addp	v22.2d,v22.2d,v22.2d
+	ldp	d8,d9,[sp,#16]		// meet ABI requirements
+	addp	v19.2d,v19.2d,v19.2d
+	ldp	d10,d11,[sp,#32]
+	addp	v23.2d,v23.2d,v23.2d
+	ldp	d12,d13,[sp,#48]
+	addp	v20.2d,v20.2d,v20.2d
+	ldp	d14,d15,[sp,#64]
+	addp	v21.2d,v21.2d,v21.2d
+
+	////////////////////////////////////////////////////////////////
+	// lazy reduction, but without narrowing
+
+	ushr	v29.2d,v22.2d,#26
+	and	v22.16b,v22.16b,v31.16b
+	ushr	v30.2d,v19.2d,#26
+	and	v19.16b,v19.16b,v31.16b
+
+	add	v23.2d,v23.2d,v29.2d	// h3 -> h4
+	add	v20.2d,v20.2d,v30.2d	// h0 -> h1
+
+	ushr	v29.2d,v23.2d,#26
+	and	v23.16b,v23.16b,v31.16b
+	ushr	v30.2d,v20.2d,#26
+	and	v20.16b,v20.16b,v31.16b
+	add	v21.2d,v21.2d,v30.2d	// h1 -> h2
+
+	add	v19.2d,v19.2d,v29.2d
+	shl	v29.2d,v29.2d,#2
+	ushr	v30.2d,v21.2d,#26
+	and	v21.16b,v21.16b,v31.16b
+	add	v19.2d,v19.2d,v29.2d	// h4 -> h0
+	add	v22.2d,v22.2d,v30.2d	// h2 -> h3
+
+	ushr	v29.2d,v19.2d,#26
+	and	v19.16b,v19.16b,v31.16b
+	ushr	v30.2d,v22.2d,#26
+	and	v22.16b,v22.16b,v31.16b
+	add	v20.2d,v20.2d,v29.2d	// h0 -> h1
+	add	v23.2d,v23.2d,v30.2d	// h3 -> h4
+
+	////////////////////////////////////////////////////////////////
+	// write the result, can be partially reduced
+
+	st4	{v19.s,v20.s,v21.s,v22.s}[0],[x0],#16
+	st1	{v23.s}[0],[x0]
+
+.Lno_data_neon:
+	ldr	x29,[sp],#80
+	ret
+ENDPROC(poly1305_blocks_neon)
+
+.align	5
+ENTRY(poly1305_emit_neon)
+	ldr	x17,[x0,#24]
+	cbz	x17,poly1305_emit_arm
+
+	ldp	w10,w11,[x0]		// load hash value base 2^26
+	ldp	w12,w13,[x0,#8]
+	ldr	w14,[x0,#16]
+
+	add	x4,x10,x11,lsl#26	// base 2^26 -> base 2^64
+	lsr	x5,x12,#12
+	adds	x4,x4,x12,lsl#52
+	add	x5,x5,x13,lsl#14
+	adc	x5,x5,xzr
+	lsr	x6,x14,#24
+	adds	x5,x5,x14,lsl#40
+	adc	x6,x6,xzr		// can be partially reduced...
+
+	ldp	x10,x11,[x2]	// load nonce
+
+	and	x12,x6,#-4		// ... so reduce
+	add	x12,x12,x6,lsr#2
+	and	x6,x6,#3
+	adds	x4,x4,x12
+	adcs	x5,x5,xzr
+	adc	x6,x6,xzr
+
+	adds	x12,x4,#5		// compare to modulus
+	adcs	x13,x5,xzr
+	adc	x14,x6,xzr
+
+	tst	x14,#-4			// see if it's carried/borrowed
+
+	csel	x4,x4,x12,eq
+	csel	x5,x5,x13,eq
+
+#ifdef	__ARMEB__
+	ror	x10,x10,#32		// flip nonce words
+	ror	x11,x11,#32
+#endif
+	adds	x4,x4,x10		// accumulate nonce
+	adc	x5,x5,x11
+#ifdef	__ARMEB__
+	rev	x4,x4			// flip output bytes
+	rev	x5,x5
+#endif
+	stp	x4,x5,[x1]		// write result
+
+	ret
+ENDPROC(poly1305_emit_neon)
+
+.align	5
+.Lzeros:
+.long	0,0,0,0,0,0,0,0
-- 
2.18.0

^ permalink raw reply related

* [PATCH v2 03/17] zinc: ChaCha20 generic C implementation
From: Jason A. Donenfeld @ 2018-08-24 21:38 UTC (permalink / raw)
  To: linux-kernel, netdev, davem
  Cc: Jason A. Donenfeld, Andy Lutomirski, Greg KH, Samuel Neves,
	Jean-Philippe Aumasson, linux-crypto
In-Reply-To: <20180824213849.23647-1-Jason@zx2c4.com>

This implements the ChaCha20 permutation as a single C statement, by way
of the comma operator, which the compiler is able to simplify
terrifically.

Information: https://cr.yp.to/chacha.html

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: linux-crypto@vger.kernel.org
---
 include/zinc/chacha20.h      |  47 ++++++++++++
 lib/zinc/Kconfig             |   4 +
 lib/zinc/Makefile            |   4 +
 lib/zinc/chacha20/chacha20.c | 137 +++++++++++++++++++++++++++++++++++
 lib/zinc/main.c              |   5 ++
 5 files changed, 197 insertions(+)
 create mode 100644 include/zinc/chacha20.h
 create mode 100644 lib/zinc/chacha20/chacha20.c

diff --git a/include/zinc/chacha20.h b/include/zinc/chacha20.h
new file mode 100644
index 000000000000..ba50a2424ad0
--- /dev/null
+++ b/include/zinc/chacha20.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifndef _ZINC_CHACHA20_H
+#define _ZINC_CHACHA20_H
+
+#include <linux/simd.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+enum {
+	CHACHA20_IV_SIZE = 16,
+	CHACHA20_KEY_SIZE = 32,
+	CHACHA20_BLOCK_SIZE = 64,
+	HCHACHA20_KEY_SIZE = 32,
+	HCHACHA20_NONCE_SIZE = 16
+};
+
+struct chacha20_ctx {
+	u32 key[8];
+	u32 counter[4];
+} __aligned(32);
+
+void chacha20_fpu_init(void);
+
+static inline void chacha20_init(struct chacha20_ctx *state, const u8 key[CHACHA20_KEY_SIZE], const u64 nonce)
+{
+	__le32 *le_key = (__le32 *)key;
+	state->key[0] = le32_to_cpu(le_key[0]);
+	state->key[1] = le32_to_cpu(le_key[1]);
+	state->key[2] = le32_to_cpu(le_key[2]);
+	state->key[3] = le32_to_cpu(le_key[3]);
+	state->key[4] = le32_to_cpu(le_key[4]);
+	state->key[5] = le32_to_cpu(le_key[5]);
+	state->key[6] = le32_to_cpu(le_key[6]);
+	state->key[7] = le32_to_cpu(le_key[7]);
+	state->counter[0] = state->counter[1] = 0;
+	state->counter[2] = nonce & U32_MAX;
+	state->counter[3] = nonce >> 32;
+}
+void chacha20(struct chacha20_ctx *state, u8 *dst, const u8 *src, u32 len, simd_context_t simd_context);
+
+void hchacha20(u8 derived_key[CHACHA20_KEY_SIZE], const u8 nonce[HCHACHA20_NONCE_SIZE], const u8 key[HCHACHA20_KEY_SIZE], simd_context_t simd_context);
+
+#endif /* _ZINC_CHACHA20_H */
diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig
index aa4f8d449d6b..5311a0d6ba8b 100644
--- a/lib/zinc/Kconfig
+++ b/lib/zinc/Kconfig
@@ -6,6 +6,10 @@ config ZINC
 	select NEON
 	select KERNEL_MODE_NEON
 
+config ZINC_CHACHA20
+	bool
+	select ZINC
+
 config ZINC_DEBUG
 	bool "Zinc cryptography library debugging and self-tests"
 	depends on ZINC
diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 8e30115217db..6ec77feb4ab7 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -2,6 +2,10 @@ ccflags-y := -O3
 ccflags-y += -Wframe-larger-than=8192
 ccflags-y += -D'pr_fmt(fmt)=KBUILD_MODNAME ": " fmt'
 
+ifeq ($(CONFIG_ZINC_CHACHA20),y)
+zinc-y += chacha20/chacha20.o
+endif
+
 zinc-y += main.o
 
 obj-$(CONFIG_ZINC) := zinc.o
diff --git a/lib/zinc/chacha20/chacha20.c b/lib/zinc/chacha20/chacha20.c
new file mode 100644
index 000000000000..29c487375ccd
--- /dev/null
+++ b/lib/zinc/chacha20/chacha20.c
@@ -0,0 +1,137 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <zinc/chacha20.h>
+
+#include <linux/kernel.h>
+#include <crypto/algapi.h>
+
+#ifndef HAVE_CHACHA20_ARCH_IMPLEMENTATION
+void __init chacha20_fpu_init(void) { }
+static inline bool chacha20_arch(u8 *out, const u8 *in, const size_t len, const u32 key[8], const u32 counter[4], simd_context_t simd_context) { return false; }
+static inline bool hchacha20_arch(u8 *derived_key, const u8 *nonce, const u8 *key, simd_context_t simd_context) { return false; }
+#endif
+
+#define EXPAND_32_BYTE_K 0x61707865U, 0x3320646eU, 0x79622d32U, 0x6b206574U
+
+#define QUARTER_ROUND(x, a, b, c, d) ( \
+	x[a] += x[b], \
+	x[d] = rol32((x[d] ^ x[a]), 16), \
+	x[c] += x[d], \
+	x[b] = rol32((x[b] ^ x[c]), 12), \
+	x[a] += x[b], \
+	x[d] = rol32((x[d] ^ x[a]), 8), \
+	x[c] += x[d], \
+	x[b] = rol32((x[b] ^ x[c]), 7) \
+)
+
+#define C(i, j) (i * 4 + j)
+
+#define DOUBLE_ROUND(x) ( \
+	/* Column Round */ \
+	QUARTER_ROUND(x, C(0, 0), C(1, 0), C(2, 0), C(3, 0)), \
+	QUARTER_ROUND(x, C(0, 1), C(1, 1), C(2, 1), C(3, 1)), \
+	QUARTER_ROUND(x, C(0, 2), C(1, 2), C(2, 2), C(3, 2)), \
+	QUARTER_ROUND(x, C(0, 3), C(1, 3), C(2, 3), C(3, 3)), \
+	/* Diagonal Round */ \
+	QUARTER_ROUND(x, C(0, 0), C(1, 1), C(2, 2), C(3, 3)), \
+	QUARTER_ROUND(x, C(0, 1), C(1, 2), C(2, 3), C(3, 0)), \
+	QUARTER_ROUND(x, C(0, 2), C(1, 3), C(2, 0), C(3, 1)), \
+	QUARTER_ROUND(x, C(0, 3), C(1, 0), C(2, 1), C(3, 2)) \
+)
+
+#define TWENTY_ROUNDS(x) ( \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x) \
+)
+
+static void chacha20_block_generic(__le32 *stream, u32 *state)
+{
+	u32 x[CHACHA20_BLOCK_SIZE / sizeof(u32)];
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(x); ++i)
+		x[i] = state[i];
+
+	TWENTY_ROUNDS(x);
+
+	for (i = 0; i < ARRAY_SIZE(x); ++i)
+		stream[i] = cpu_to_le32(x[i] + state[i]);
+
+	++state[12];
+}
+
+static void chacha20_generic(u8 *out, const u8 *in, u32 len, const u32 key[8], const u32 counter[4])
+{
+	__le32 buf[CHACHA20_BLOCK_SIZE / sizeof(__le32)];
+	u32 x[] = {
+		EXPAND_32_BYTE_K,
+		key[0], key[1], key[2], key[3],
+		key[4], key[5], key[6], key[7],
+		counter[0], counter[1], counter[2], counter[3]
+	};
+
+	if (out != in)
+		memmove(out, in, len);
+
+	while (len >= CHACHA20_BLOCK_SIZE) {
+		chacha20_block_generic(buf, x);
+		crypto_xor(out, (u8 *)buf, CHACHA20_BLOCK_SIZE);
+		len -= CHACHA20_BLOCK_SIZE;
+		out += CHACHA20_BLOCK_SIZE;
+	}
+	if (len) {
+		chacha20_block_generic(buf, x);
+		crypto_xor(out, (u8 *)buf, len);
+	}
+}
+
+void chacha20(struct chacha20_ctx *state, u8 *dst, const u8 *src, u32 len, simd_context_t simd_context)
+{
+	if (!chacha20_arch(dst, src, len, state->key, state->counter, simd_context))
+		chacha20_generic(dst, src, len, state->key, state->counter);
+	state->counter[0] += (len + 63) / 64;
+}
+EXPORT_SYMBOL(chacha20);
+
+static void hchacha20_generic(u8 derived_key[CHACHA20_KEY_SIZE], const u8 nonce[HCHACHA20_NONCE_SIZE], const u8 key[HCHACHA20_KEY_SIZE])
+{
+	__le32 *out = (__force __le32 *)derived_key;
+	u32 x[] = {
+		EXPAND_32_BYTE_K,
+		le32_to_cpup((__le32 *)(key + 0)), le32_to_cpup((__le32 *)(key + 4)), le32_to_cpup((__le32 *)(key + 8)), le32_to_cpup((__le32 *)(key + 12)),
+		le32_to_cpup((__le32 *)(key + 16)), le32_to_cpup((__le32 *)(key + 20)), le32_to_cpup((__le32 *)(key + 24)), le32_to_cpup((__le32 *)(key + 28)),
+		le32_to_cpup((__le32 *)(nonce + 0)), le32_to_cpup((__le32 *)(nonce + 4)), le32_to_cpup((__le32 *)(nonce +  8)), le32_to_cpup((__le32 *)(nonce + 12))
+	};
+
+	TWENTY_ROUNDS(x);
+
+	out[0] = cpu_to_le32(x[0]);
+	out[1] = cpu_to_le32(x[1]);
+	out[2] = cpu_to_le32(x[2]);
+	out[3] = cpu_to_le32(x[3]);
+	out[4] = cpu_to_le32(x[12]);
+	out[5] = cpu_to_le32(x[13]);
+	out[6] = cpu_to_le32(x[14]);
+	out[7] = cpu_to_le32(x[15]);
+}
+
+void hchacha20(u8 derived_key[CHACHA20_KEY_SIZE], const u8 nonce[HCHACHA20_NONCE_SIZE], const u8 key[HCHACHA20_KEY_SIZE], simd_context_t simd_context)
+{
+	if (!hchacha20_arch(derived_key, nonce, key, simd_context))
+		hchacha20_generic(derived_key, nonce, key);
+}
+/* Deliberately not EXPORT_SYMBOL'd, since there are few reasons why somebody
+ * should be using this directly, rather than via xchacha20. Revisit only in
+ * the unlikely event that somebody has a good reason to export this.
+ */
diff --git a/lib/zinc/main.c b/lib/zinc/main.c
index 590872563955..79b85045f22f 100644
--- a/lib/zinc/main.c
+++ b/lib/zinc/main.c
@@ -3,6 +3,8 @@
  * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
  */
 
+#include <zinc/chacha20.h>
+
 #include <linux/init.h>
 #include <linux/module.h>
 
@@ -17,6 +19,9 @@
 
 static int __init mod_init(void)
 {
+#ifdef CONFIG_ZINC_CHACHA20
+	chacha20_fpu_init();
+#endif
 	return 0;
 }
 
-- 
2.18.0

^ permalink raw reply related

* [PATCH v2 00/17] WireGuard: Secure Network Tunnel
From: Jason A. Donenfeld @ 2018-08-24 21:38 UTC (permalink / raw)
  To: linux-kernel, netdev, davem; +Cc: Jason A. Donenfeld

This patchset is available on git.kernel.org in this branch, where it may be
pulled directly for inclusion into net-next:

  * https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/log/?h=jd/wireguard

WireGuard is a secure network tunnel written especially for Linux, which
has faced around three years of serious development, deployment, and
scrutiny. It delivers excellent performance and is extremely easy to
use and configure. It has been designed with the primary goal of being
both easy to audit by virtue of being small and highly secure from a
cryptography and systems security perspective. WireGuard is used by some
massive companies pushing enormous amounts of traffic, and likely
already today you've consumed bytes that at some point transited through
a WireGuard tunnel. Even as an out-of-tree module, WireGuard has been
integrated into various userspace tools, Linux distributions, mobile
phones, and data centers. There are ports in several languages to
several operating systems, and even commercial hardware and services
sold integrating WireGuard. It is time, therefore, for WireGuard to be
properly integrated into Linux.

Ample information, including documentation, installation instructions,
and project details, is available at:

  * https://www.wireguard.com/
  * https://www.wireguard.com/papers/wireguard.pdf

As it is currently an out-of-tree module, it lives in its own git repo
and has its own mailing list, and every commit for the module is tested
against every stable kernel since 3.10 on a variety of architectures
using an extensive test suite:

  * https://git.zx2c4.com/WireGuard
    https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/WireGuard.git/
  * https://lists.zx2c4.com/mailman/listinfo/wireguard
  * https://www.wireguard.com/build-status/

The project has been broadly discussed at conferences, and was presented
to the Netdev developers in Seoul last November, where a paper was
released detailing some interesting aspects of the project. Dave asked
me after the talk if I would consider sending in a v1 "sooner rather
than later", hence this patchset. A decision is still waiting from the
LPC network track committee, but an update on these topics may be
presented in Vancouver in a few months. Presentations:

  * https://www.wireguard.com/presentations/
  * https://www.wireguard.com/papers/wireguard-netdev22.pdf

The cryptography in the protocol itself has been formally verified by
several independent academic teams with positive results, and I know of
two additional efforts on their way to further corroborate those
findings. The version 1 protocol is "complete", and so the purpose of
this review is to assess the implementation of the protocol. However, it
still may be of interest to know that the thing you're reviewing uses a
protocol with various nice security properties:

  * https://www.wireguard.com/formal-verification/

This patchset is divided into three segments. The first introduces a very
simple helper for working with the FPU state for the purposes of amortizing
SIMD operations. The second segment is a small collection of cryptographic
primitives, split up into several commits by primitive and by hardware. The
third is WireGuard itself, presented as an unintrusive and self-contained
virtual network driver.

It is intended that this entire patch series enter the kernel through
DaveM's net-next tree.

Changes v1->v2:
  - Zinc has been split into multiple patches.
  - Zinc has been split into multiple config options.
  - WireGuard has been wrapped at 80 characters.
  - Numerous cleanups of inside WireGuard.
  - Splitting out the simd helper into linux/.
  - Simplification of bitwise arithmetic inside some Zinc primitives.
  - Better handling of unaligned buffers inside Zinc.
  - Expanded CC list.

Enjoy,
Jason

^ permalink raw reply

* [PATCH bpf 2/2] bpf, sockmap: fix psock refcount leak in bpf_tcp_recvmsg
From: Daniel Borkmann @ 2018-08-24 20:08 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: john.fastabend, netdev, Daniel Borkmann
In-Reply-To: <20180824200851.12308-1-daniel@iogearbox.net>

In bpf_tcp_recvmsg() we first took a reference on the psock, however
once we find that there are skbs in the normal socket's receive queue
we return with processing them through tcp_recvmsg(). Problem is that
we leak the taken reference on the psock in that path. Given we don't
really do anything with the psock at this point, move the skb_queue_empty()
test before we fetch the psock to fix this case.

Fixes: 8934ce2fd081 ("bpf: sockmap redirect ingress support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 kernel/bpf/sockmap.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 01879e4..26d8a30 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -912,6 +912,8 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 
 	if (unlikely(flags & MSG_ERRQUEUE))
 		return inet_recv_error(sk, msg, len, addr_len);
+	if (!skb_queue_empty(&sk->sk_receive_queue))
+		return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
 
 	rcu_read_lock();
 	psock = smap_psock_sk(sk);
@@ -922,9 +924,6 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		goto out;
 	rcu_read_unlock();
 
-	if (!skb_queue_empty(&sk->sk_receive_queue))
-		return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
-
 	lock_sock(sk);
 bytes_ready:
 	while (copied != len) {
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 1/2] bpf, sockmap: fix potential use after free in bpf_tcp_close
From: Daniel Borkmann @ 2018-08-24 20:08 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: john.fastabend, netdev, Daniel Borkmann
In-Reply-To: <20180824200851.12308-1-daniel@iogearbox.net>

bpf_tcp_close() we pop the psock linkage to a map via psock_map_pop().
A parallel update on the sock hash map can happen between psock_map_pop()
and lookup_elem_raw() where we override the element under link->hash /
link->key. In bpf_tcp_close()'s lookup_elem_raw() we subsequently only
test whether an element is present, but we do not test whether the
element is infact the element we were looking for.

We lock the sock in bpf_tcp_close() during that time, so do we hold
the lock in sock_hash_update_elem(). However, the latter locks the
sock which is newly updated, not the one we're purging from the hash
table. This means that while one CPU is doing the lookup from bpf_tcp_close(),
another CPU is doing the map update in parallel, dropped our sock from
the hlist and released the psock.

Subsequently the first CPU will find the new sock and attempts to drop
and release the old sock yet another time. Fix is that we need to check
the elements for a match after lookup, similar as we do in the sock map.
Note that the hash tab elems are freed via RCU, so access to their
link->hash / link->key is fine since we're under RCU read side there.

Fixes: e9db4ef6bf4c ("bpf: sockhash fix omitted bucket lock in sock_close")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 kernel/bpf/sockmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index cf5195c..01879e4 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -369,7 +369,7 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
 			/* If another thread deleted this object skip deletion.
 			 * The refcnt on psock may or may not be zero.
 			 */
-			if (l) {
+			if (l && l == link) {
 				hlist_del_rcu(&link->hash_node);
 				smap_release_sock(psock, link->sk);
 				free_htab_elem(htab, link);
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 0/2] Two BPF sockmap fixes
From: Daniel Borkmann @ 2018-08-24 20:08 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: john.fastabend, netdev, Daniel Borkmann

This series contains two fixes found while working with the BPF
sockmap code. One refcount leakage and one case for a small window
of use after free. See patches for more info. Thanks!

Daniel Borkmann (2):
  bpf, sockmap: fix potential use after free in bpf_tcp_close
  bpf, sockmap: fix psock refcount leak in bpf_tcp_recvmsg

 kernel/bpf/sockmap.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

-- 
2.9.5

^ permalink raw reply

* Re: KASAN: use-after-free Read in __rhashtable_lookup (2)
From: Sowmini Varadhan @ 2018-08-24 23:37 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzbot, David Miller, LKML, linux-rdma, netdev, rds-devel,
	Santosh Shilimkar, syzkaller-bugs
In-Reply-To: <CACT4Y+a6P+q8jtRrd=cAf5hWbqQO4CQdWUaCCX-ygRbGsWHnGQ@mail.gmail.com>

On (08/24/18 16:10), Dmitry Vyukov wrote:
> You need a colon after dup.

I see.

> But this can't possibly be a dup of that bug. That bug was fixed in February:

Apologies, I unknowingly made a mistake in my 
syzbot-tiquette in that case,

I did not intend to *close* the bug as a dup, I was
merely trying to indicate that this is yet another 
manifestation of the original report with subject
"KASAN: use-after-free Read in rds_find_bound"

All of these problems are happening because we are missing
a synchronize_net() in rds_release.

A full description of the race conditon can be found in
https://www.spinics.net/lists/netdev/msg475074.html

If rds_release is going to nuke the rds_socket
it needs to sychronize_net() with other threads that may be
looking up the bind_hash_table. If we dont do that,
all of these "use-after-free" races can happen.
As long as we dont have the synchronize_net() we are
missing a big (and standard) bit of RCU synchronization.

--Sowmini

^ permalink raw reply

* Re: Re: KASAN: use-after-free Read in __rhashtable_lookup (2)
From: Dmitry Vyukov @ 2018-08-24 23:10 UTC (permalink / raw)
  To: syzbot
  Cc: Sowmini Varadhan, David Miller, LKML, linux-rdma, netdev,
	rds-devel, Santosh Shilimkar, syzkaller-bugs
In-Reply-To: <000000000000d8879a05743667a2@google.com>

On Fri, Aug 24, 2018 at 4:05 PM, syzbot
<syzbot+8967084bcac563795dc6@syzkaller.appspotmail.com> wrote:
>
>
>> #syz dup KASAN: use-after-free Read in rds_find_bound
>
>
> unknown command "dup"


You need a colon after dup.

But this can't possibly be a dup of that bug. That bug was fixed in February:
https://syzkaller.appspot.com/bug?extid=93a5839deb355537440f

And this crash happened a day ago. This is a different bug. If you dup
it to a closed one, this one will also be considered closed, and then
a new crash will immediately create a new bug.



>> The problem is a missing synchronize_net() in rds_release.
>> References:
>>    https://www.spinics.net/lists/netdev/msg475074.html
>>    https://www.spinics.net/lists/netdev/msg507908.html
>
>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "syzkaller-bugs" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to syzkaller-bugs+unsubscribe@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/syzkaller-bugs/20180824230446.GA16891%40oracle.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/000000000000d8879a05743667a2%40google.com.
>
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox