Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net] pppoe: fix reception of frames with no mac header
From: Guillaume Nault @ 2018-09-14 14:35 UTC (permalink / raw)
  To: netdev; +Cc: Alexander Potapenko, Michal Ostrowski, Eric Dumazet
In-Reply-To: <274ac54fa02052104201d4738a6326a637e87a83.1536935190.git.g.nault@alphalink.fr>

On Fri, Sep 14, 2018 at 04:28:05PM +0200, Guillaume Nault wrote:
> pppoe_rcv() needs to look back at the Ethernet header in order to
> lookup the PPPoE session. Therefore we need to ensure that the mac
> header is big enough to contain an Ethernet header. Otherwise
> eth_hdr(skb)->h_source might access invalid data.
> 
Forgot to Cc Alexander :/
Sorry...
BTW, thanks for your first analysis.

^ permalink raw reply

* [PATCH net] pppoe: fix reception of frames with no mac header
From: Guillaume Nault @ 2018-09-14 14:28 UTC (permalink / raw)
  To: netdev; +Cc: Michal Ostrowski, Eric Dumazet

pppoe_rcv() needs to look back at the Ethernet header in order to
lookup the PPPoE session. Therefore we need to ensure that the mac
header is big enough to contain an Ethernet header. Otherwise
eth_hdr(skb)->h_source might access invalid data.

==================================================================
BUG: KMSAN: uninit-value in __get_item drivers/net/ppp/pppoe.c:172 [inline]
BUG: KMSAN: uninit-value in get_item drivers/net/ppp/pppoe.c:236 [inline]
BUG: KMSAN: uninit-value in pppoe_rcv+0xcef/0x10e0 drivers/net/ppp/pppoe.c:450
CPU: 0 PID: 4543 Comm: syz-executor355 Not tainted 4.16.0+ #87
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
 __get_item drivers/net/ppp/pppoe.c:172 [inline]
 get_item drivers/net/ppp/pppoe.c:236 [inline]
 pppoe_rcv+0xcef/0x10e0 drivers/net/ppp/pppoe.c:450
 __netif_receive_skb_core+0x47df/0x4a90 net/core/dev.c:4562
 __netif_receive_skb net/core/dev.c:4627 [inline]
 netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
 netif_receive_skb+0x230/0x240 net/core/dev.c:4725
 tun_rx_batched drivers/net/tun.c:1555 [inline]
 tun_get_user+0x740f/0x7c60 drivers/net/tun.c:1962
 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
 call_write_iter include/linux/fs.h:1782 [inline]
 new_sync_write fs/read_write.c:469 [inline]
 __vfs_write+0x7fb/0x9f0 fs/read_write.c:482
 vfs_write+0x463/0x8d0 fs/read_write.c:544
 SYSC_write+0x172/0x360 fs/read_write.c:589
 SyS_write+0x55/0x80 fs/read_write.c:581
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x4447c9
RSP: 002b:00007fff64c8fc28 EFLAGS: 00000297 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004447c9
RDX: 000000000000fd87 RSI: 0000000020000600 RDI: 0000000000000004
RBP: 00000000006cf018 R08: 00007fff64c8fda8 R09: 00007fff00006bda
R10: 0000000000005fe7 R11: 0000000000000297 R12: 00000000004020d0
R13: 0000000000402160 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
 slab_post_alloc_hook mm/slab.h:445 [inline]
 slab_alloc_node mm/slub.c:2737 [inline]
 __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
 __kmalloc_reserve net/core/skbuff.c:138 [inline]
 __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
 alloc_skb include/linux/skbuff.h:984 [inline]
 alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
 sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
 tun_alloc_skb drivers/net/tun.c:1532 [inline]
 tun_get_user+0x2242/0x7c60 drivers/net/tun.c:1829
 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
 call_write_iter include/linux/fs.h:1782 [inline]
 new_sync_write fs/read_write.c:469 [inline]
 __vfs_write+0x7fb/0x9f0 fs/read_write.c:482
 vfs_write+0x463/0x8d0 fs/read_write.c:544
 SYSC_write+0x172/0x360 fs/read_write.c:589
 SyS_write+0x55/0x80 fs/read_write.c:581
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
==================================================================

Fixes: 224cf5ad14c0 ("ppp: Move the PPP drivers")
Reported-by: syzbot+f5f6080811c849739212@syzkaller.appspotmail.com
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 drivers/net/ppp/pppoe.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index ce61231e96ea..62dc564b251d 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -429,6 +429,9 @@ static int pppoe_rcv(struct sk_buff *skb, struct net_device *dev,
 	if (!skb)
 		goto out;
 
+	if (skb_mac_header_len(skb) < ETH_HLEN)
+		goto drop;
+
 	if (!pskb_may_pull(skb, sizeof(struct pppoe_hdr)))
 		goto drop;
 
-- 
2.19.0

^ permalink raw reply related

* Re: [PATCH net] veth: Orphan skb before GRO
From: Paolo Abeni @ 2018-09-14 14:16 UTC (permalink / raw)
  To: Toshiaki Makita, David S. Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <1536899624-2438-1-git-send-email-makita.toshiaki@lab.ntt.co.jp>

On Fri, 2018-09-14 at 13:33 +0900, Toshiaki Makita wrote:
> GRO expects skbs not to be owned by sockets, but when XDP is enabled veth
> passed skbs owned by sockets. It caused corrupted sk_wmem_alloc.
> 
> Paolo Abeni reported the following splat:
> 
> [  362.098904] refcount_t overflow at skb_set_owner_w+0x5e/0xa0 in iperf3[1644], uid/euid: 0/0
> [  362.108239] WARNING: CPU: 0 PID: 1644 at kernel/panic.c:648 refcount_error_report+0xa0/0xa4
> [  362.117547] Modules linked in: tcp_diag inet_diag veth intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf ipmi_ssif iTCO_wdt sg ipmi_si iTCO_vendor_support ipmi_devintf mxm_wmi ipmi_msghandler pcspkr dcdbas mei_me wmi mei lpc_ich acpi_power_meter pcc_cpufreq xfs libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ixgbe igb ttm ahci mdio libahci ptp crc32c_intel drm pps_core libata i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod
> [  362.176622] CPU: 0 PID: 1644 Comm: iperf3 Not tainted 4.19.0-rc2.vanilla+ #2025
> [  362.184777] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
> [  362.193124] RIP: 0010:refcount_error_report+0xa0/0xa4
> [  362.198758] Code: 08 00 00 48 8b 95 80 00 00 00 49 8d 8c 24 80 0a 00 00 41 89 c1 44 89 2c 24 48 89 de 48 c7 c7 18 4d e7 9d 31 c0 e8 30 fa ff ff <0f> 0b eb 88 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 49 89 fc
> [  362.219711] RSP: 0018:ffff9ee6ff603c20 EFLAGS: 00010282
> [  362.225538] RAX: 0000000000000000 RBX: ffffffff9de83e10 RCX: 0000000000000000
> [  362.233497] RDX: 0000000000000001 RSI: ffff9ee6ff6167d8 RDI: ffff9ee6ff6167d8
> [  362.241457] RBP: ffff9ee6ff603d78 R08: 0000000000000490 R09: 0000000000000004
> [  362.249416] R10: 0000000000000000 R11: ffff9ee6ff603990 R12: ffff9ee664b94500
> [  362.257377] R13: 0000000000000000 R14: 0000000000000004 R15: ffffffff9de615f9
> [  362.265337] FS:  00007f1d22d28740(0000) GS:ffff9ee6ff600000(0000) knlGS:0000000000000000
> [  362.274363] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  362.280773] CR2: 00007f1d222f35d0 CR3: 0000001fddfec003 CR4: 00000000001606f0
> [  362.288733] Call Trace:
> [  362.291459]  <IRQ>
> [  362.293702]  ex_handler_refcount+0x4e/0x80
> [  362.298269]  fixup_exception+0x35/0x40
> [  362.302451]  do_trap+0x109/0x150
> [  362.306048]  do_error_trap+0xd5/0x130
> [  362.315766]  invalid_op+0x14/0x20
> [  362.319460] RIP: 0010:skb_set_owner_w+0x5e/0xa0
> [  362.324512] Code: ef ff ff 74 49 48 c7 43 60 20 7b 4a 9d 8b 85 f4 01 00 00 85 c0 75 16 8b 83 e0 00 00 00 f0 01 85 44 01 00 00 0f 88 d8 23 16 00 <5b> 5d c3 80 8b 91 00 00 00 01 8b 85 f4 01 00 00 89 83 a4 00 00 00
> [  362.345465] RSP: 0018:ffff9ee6ff603e20 EFLAGS: 00010a86
> [  362.351291] RAX: 0000000000001100 RBX: ffff9ee65deec700 RCX: ffff9ee65e829244
> [  362.359250] RDX: 0000000000000100 RSI: ffff9ee65e829100 RDI: ffff9ee65deec700
> [  362.367210] RBP: ffff9ee65e829100 R08: 000000000002a380 R09: 0000000000000000
> [  362.375169] R10: 0000000000000002 R11: fffff1a4bf77bb00 R12: ffffc0754661d000
> [  362.383130] R13: ffff9ee65deec200 R14: ffff9ee65f597000 R15: 00000000000000aa
> [  362.391092]  veth_xdp_rcv+0x4e4/0x890 [veth]
> [  362.399357]  veth_poll+0x4d/0x17a [veth]
> [  362.403731]  net_rx_action+0x2af/0x3f0
> [  362.407912]  __do_softirq+0xdd/0x29e
> [  362.411897]  do_softirq_own_stack+0x2a/0x40
> [  362.416561]  </IRQ>
> [  362.418899]  do_softirq+0x4b/0x70
> [  362.422594]  __local_bh_enable_ip+0x50/0x60
> [  362.427258]  ip_finish_output2+0x16a/0x390
> [  362.431824]  ip_output+0x71/0xe0
> [  362.440670]  __tcp_transmit_skb+0x583/0xab0
> [  362.445333]  tcp_write_xmit+0x247/0xfb0
> [  362.449609]  __tcp_push_pending_frames+0x2d/0xd0
> [  362.454760]  tcp_sendmsg_locked+0x857/0xd30
> [  362.459424]  tcp_sendmsg+0x27/0x40
> [  362.463216]  sock_sendmsg+0x36/0x50
> [  362.467104]  sock_write_iter+0x87/0x100
> [  362.471382]  __vfs_write+0x112/0x1a0
> [  362.475369]  vfs_write+0xad/0x1a0
> [  362.479062]  ksys_write+0x52/0xc0
> [  362.482759]  do_syscall_64+0x5b/0x180
> [  362.486841]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  362.492473] RIP: 0033:0x7f1d22293238
> [  362.496458] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 c5 54 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
> [  362.517409] RSP: 002b:00007ffebaef8008 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  362.525855] RAX: ffffffffffffffda RBX: 0000000000002800 RCX: 00007f1d22293238
> [  362.533816] RDX: 0000000000002800 RSI: 00007f1d22d36000 RDI: 0000000000000005
> [  362.541775] RBP: 00007f1d22d36000 R08: 00000002db777a30 R09: 0000562b70712b20
> [  362.549734] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
> [  362.557693] R13: 0000000000002800 R14: 00007ffebaef8060 R15: 0000562b70712260
> 
> In order to avoid this, orphan the skb before entering GRO.
> 
> Fixes: 948d4f214fde ("veth: Add driver XDP")
> Reported-by: Paolo Abeni <pabeni@redhat.com>
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> ---
>  drivers/net/veth.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 8d679c8..41a00cd 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -463,6 +463,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb,
>  	int mac_len, delta, off;
>  	struct xdp_buff xdp;
>  
> +	skb_orphan(skb);
> +
>  	rcu_read_lock();
>  	xdp_prog = rcu_dereference(rq->xdp_prog);
>  	if (unlikely(!xdp_prog)) {
> @@ -508,8 +510,6 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb,
>  		skb_copy_header(nskb, skb);
>  		head_off = skb_headroom(nskb) - skb_headroom(skb);
>  		skb_headers_offset_update(nskb, head_off);
> -		if (skb->sk)
> -			skb_set_owner_w(nskb, skb->sk);
>  		consume_skb(skb);
>  		skb = nskb;
>  	}

I just gave it a run in my test environment, and it fixes the reported
issue.

Tested-by: Paolo Abeni <pabeni@redhat.com>

^ permalink raw reply

* [PATCH iproute2] q_cake: Also print nonat, nowash and no-ack-filter keywords
From: Toke Høiland-Jørgensen @ 2018-09-14 13:51 UTC (permalink / raw)
  To: netdev; +Cc: cake, Toke Høiland-Jørgensen

Similar to the previous patch for no-split-gso, the negative keywords for
'nat', 'wash' and 'ack-filter' were not printed either. Add those well.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 tc/q_cake.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tc/q_cake.c b/tc/q_cake.c
index 077bf84f..e827e3f1 100644
--- a/tc/q_cake.c
+++ b/tc/q_cake.c
@@ -468,6 +468,8 @@ static int cake_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 
 	if (nat)
 		print_string(PRINT_FP, NULL, "nat ", NULL);
+	else
+		print_string(PRINT_FP, NULL, "nonat ", NULL);
 	print_bool(PRINT_JSON, "nat", NULL, nat);
 
 	if (tb[TCA_CAKE_WASH] &&
@@ -508,6 +510,8 @@ static int cake_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 
 	if (wash)
 		print_string(PRINT_FP, NULL, "wash ", NULL);
+	else
+		print_string(PRINT_FP, NULL, "nowash ", NULL);
 	print_bool(PRINT_JSON, "wash", NULL, wash);
 
 	if (ingress)
@@ -520,7 +524,7 @@ static int cake_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 	else if (ack_filter == CAKE_ACK_FILTER)
 		print_string(PRINT_ANY, "ack-filter", "ack-filter ", "enabled");
 	else
-		print_string(PRINT_JSON, "ack-filter", NULL, "disabled");
+		print_string(PRINT_ANY, "ack-filter", "no-ack-filter ", "disabled");
 
 	if (split_gso)
 		print_string(PRINT_FP, NULL, "split-gso ", NULL);
-- 
2.18.0

^ permalink raw reply related

* Re: [PATCH iproute2] q_cake: Add printing of no-split-gso option
From: Toke Høiland-Jørgensen @ 2018-09-14 13:40 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20180912130743.1adfe86b@xeon-e3>

Stephen Hemminger <stephen@networkplumber.org> writes:

> On Wed, 12 Sep 2018 00:32:16 +0200
> Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>
>> When the GSO splitting was turned into dual split-gso/no-split-gso options,
>> the printing of the latter was left out. Add that, so output is consistent
>> with the options passed.
>> 
>> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
>
> Applied. I noticed that nat/nonat and wash/nowash have similar missing
> output.

Thanks! And yeah, you're right; I'll send another patch :)

-Toke

^ permalink raw reply

* Re: [PATCH] net/mlx4_core: print firmware version during driver loading
From: Qing Huang @ 2018-09-14 18:33 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Leon Romanovsky, netdev, linux-rdma, linux-kernel, tariqt, davem
In-Reply-To: <20180914181718.GD3811@lunn.ch>



On 9/14/2018 11:17 AM, Andrew Lunn wrote:
> On Fri, Sep 14, 2018 at 10:15:48AM -0700, Qing Huang wrote:
>> The FW version is actually a very crucial piece of information and only
>> printed once here
>> when the driver is loaded. People tend to get confused when switching
>> multiple FW files
>> back and forth without running separate utility tools, especially at
>> customer sites.
>> IMHO, this information is very useful and only takes up very little log file
>> space. :-)
> Why not use ethtool -i ?
>
> $ sudo ethtool -i eth0
> driver: r8169
> version: 2.3LK-NAPI
> firmware-version: rtl8168g-2_0.0.1 02/06/13
>
>      Andrew
Sure. You can also use ibstat or ibv_devinfo tool if they are installed. 
But it's not very
convenient in some cases.

E.g.
A customer upgrades FW on HCAs and encounters issues. During triage, 
it's much easier
to study customer uploaded log files when remotely testing different FW 
files.

Thanks.

^ permalink raw reply

* KMSAN: uninit-value in do_ip_vs_set_ctl
From: syzbot @ 2018-09-14 18:23 UTC (permalink / raw)
  To: coreteam, davem, fw, horms, ja, kadlec, linux-kernel, lvs-devel,
	netdev, netfilter-devel, pablo, syzkaller-bugs, wensong

Hello,

syzbot found the following crash on:

HEAD commit:    06b2df0593a8 kmsan: unpoison only the created pages in get..
git tree:       https://github.com/google/kmsan.git/master
console output: https://syzkaller.appspot.com/x/log.txt?x=11a6ae37800000
kernel config:  https://syzkaller.appspot.com/x/.config?x=4ca1e57bafa8ab1f
dashboard link: https://syzkaller.appspot.com/bug?extid=23b5f9e7caf61d9a3898
compiler:       clang version 7.0.0 (trunk 329391)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=14008417800000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11deb017800000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+23b5f9e7caf61d9a3898@syzkaller.appspotmail.com

random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
==================================================================
BUG: KMSAN: uninit-value in do_ip_vs_set_ctl+0x15ac/0x2760  
net/netfilter/ipvs/ip_vs_ctl.c:2424
CPU: 1 PID: 4464 Comm: syz-executor844 Not tainted 4.17.0-rc3+ #94
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x185/0x1d0 lib/dump_stack.c:113
  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1084
  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
  do_ip_vs_set_ctl+0x15ac/0x2760 net/netfilter/ipvs/ip_vs_ctl.c:2424
  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
  nf_setsockopt+0x476/0x4d0 net/netfilter/nf_sockopt.c:115
  ip_setsockopt+0x24b/0x2b0 net/ipv4/ip_sockglue.c:1253
  raw_setsockopt+0x2e5/0x350 net/ipv4/raw.c:868
  sock_common_setsockopt+0x136/0x170 net/core/sock.c:3039
  __sys_setsockopt+0x4af/0x560 net/socket.c:1903
  __do_sys_setsockopt net/socket.c:1914 [inline]
  __se_sys_setsockopt net/socket.c:1911 [inline]
  __x64_sys_setsockopt+0x15c/0x1c0 net/socket.c:1911
  do_syscall_64+0x154/0x220 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x43fca9
RSP: 002b:00007fff7a4795b8 EFLAGS: 00000213 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fca9
RDX: 0000000000000480 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000213 R12: 00000000004015d0
R13: 0000000000401660 R14: 0000000000000000 R15: 0000000000000000

Local variable description: ----arg@do_ip_vs_set_ctl
Variable was created at:
  read_pnet include/net/net_namespace.h:288 [inline]
  sock_net include/net/sock.h:2306 [inline]
  do_ip_vs_set_ctl+0x93/0x2760 net/netfilter/ipvs/ip_vs_ctl.c:2347
  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
  nf_setsockopt+0x476/0x4d0 net/netfilter/nf_sockopt.c:115
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: [PATCH] net/mlx4_core: print firmware version during driver loading
From: Andrew Lunn @ 2018-09-14 18:17 UTC (permalink / raw)
  To: Qing Huang
  Cc: Leon Romanovsky, netdev, linux-rdma, linux-kernel, tariqt, davem
In-Reply-To: <c580ad9d-b63d-743b-2278-1c4cf3553186@oracle.com>

On Fri, Sep 14, 2018 at 10:15:48AM -0700, Qing Huang wrote:
> The FW version is actually a very crucial piece of information and only
> printed once here
> when the driver is loaded. People tend to get confused when switching
> multiple FW files
> back and forth without running separate utility tools, especially at
> customer sites.
> IMHO, this information is very useful and only takes up very little log file
> space. :-)

Why not use ethtool -i ?

$ sudo ethtool -i eth0
driver: r8169
version: 2.3LK-NAPI
firmware-version: rtl8168g-2_0.0.1 02/06/13

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 4/4] bnxt_en: Always forward VF MAC address to the PF.
From: Siwei Liu @ 2018-09-14 12:49 UTC (permalink / raw)
  To: Michael Chan; +Cc: David Miller, Netdev, si-wei liu
In-Reply-To: <1525763921-20698-5-git-send-email-michael.chan@broadcom.com>

This commit is toxic, if possible I hope it can be reverted and
reworked with a new patch.

First, the patch introduced backward incompatible changes to bnxt_en
VF driver that is causing issue when interoperating with the old PF
driver without this commit. In that event, VF probing fails from
within the VM:

[    5.660331] Broadcom NetXtreme-C/E driver bnxt_en v1.9.1
[    5.663653] bnxt_en 0000:00:03.0 (unnamed net_device)
(uninitialized): hwrm req_type 0xf seq id 0x6 error 0x4
[    5.665804] bnxt_en 0000:00:03.0 (unnamed net_device)
(uninitialized): VF MAC address 00:01:02:03:04:05 not approved by the
PF
[    5.668268] bnxt_en 0000:00:03.0: Unable to initialize mac address.
[    5.670974] bnxt_en: probe of 0000:00:03.0 failed with error -99

Second, this commit contains driver changes to both PF and VF side,
and incorrectly assumes that both PF and VF can/should be updated at
the same time to resolve the original issue (zero VF MAC address in
'ip link show') it tried to address. In fact that is not warranted. A
potential warranted fix is for VF driver to ignore what
bnxt_approve_mac() may return when it got a valid MAC address from the
firmware. The only purpose for the bnxt_approve_mac call for this case
is a best-effort attempt to inform PF of the MAC address, instead of
failing the VF driver probe when talking to an old PF driver.

Canonical reported a similar issue a few days back due to the same cause.

https://www.spinics.net/lists/netdev/msg521428.html


Regards,
-Siwei

On Tue, May 8, 2018 at 12:18 AM, Michael Chan <michael.chan@broadcom.com> wrote:
> The current code already forwards the VF MAC address to the PF, except
> in one case.  If the VF driver gets a valid MAC address from the firmware
> during probe time, it will not forward the MAC address to the PF,
> incorrectly assuming that the PF already knows the MAC address.  This
> causes "ip link show" to show zero VF MAC addresses for this case.
>
> This assumption is not correct.  Newer firmware remembers the VF MAC
> address last used by the VF and provides it to the VF driver during
> probe.  So we need to always forward the VF MAC address to the PF.
>
> The forwarded MAC address may now be the PF assigned MAC address and so we
> need to make sure we approve it for this case.
>
> Signed-off-by: Michael Chan <michael.chan@broadcom.com>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c       | 2 +-
>  drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index cd3ab78..dfa0839 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -8678,8 +8678,8 @@ static int bnxt_init_mac_addr(struct bnxt *bp)
>                         memcpy(bp->dev->dev_addr, vf->mac_addr, ETH_ALEN);
>                 } else {
>                         eth_hw_addr_random(bp->dev);
> -                       rc = bnxt_approve_mac(bp, bp->dev->dev_addr);
>                 }
> +               rc = bnxt_approve_mac(bp, bp->dev->dev_addr);
>  #endif
>         }
>         return rc;
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> index cc21d87..a649108 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> @@ -923,7 +923,8 @@ static int bnxt_vf_configure_mac(struct bnxt *bp, struct bnxt_vf_info *vf)
>         if (req->enables & cpu_to_le32(FUNC_VF_CFG_REQ_ENABLES_DFLT_MAC_ADDR)) {
>                 if (is_valid_ether_addr(req->dflt_mac_addr) &&
>                     ((vf->flags & BNXT_VF_TRUST) ||
> -                    (!is_valid_ether_addr(vf->mac_addr)))) {
> +                    !is_valid_ether_addr(vf->mac_addr) ||
> +                    ether_addr_equal(req->dflt_mac_addr, vf->mac_addr))) {
>                         ether_addr_copy(vf->vf_mac_addr, req->dflt_mac_addr);
>                         return bnxt_hwrm_exec_fwd_resp(bp, vf, msg_size);
>                 }
> --
> 1.8.3.1
>

^ permalink raw reply

* Re: [PATCH 5/7] MIPS: mscc: ocelot: add GPIO4 pinmuxing DT node
From: Alexandre Belloni @ 2018-09-14 18:02 UTC (permalink / raw)
  To: Quentin Schulz
  Cc: ralf, paul.burton, jhogan, robh+dt, mark.rutland, davem, andrew,
	f.fainelli, allan.nielsen, linux-mips, devicetree, linux-kernel,
	netdev, thomas.petazzoni, antoine.tenart
In-Reply-To: <20180914162638.fgzzjin2bzgx74de@qschulz>

On 14/09/2018 18:26:38+0200, Quentin Schulz wrote:
> Hi Alexandre,
> 
> On Fri, Sep 14, 2018 at 04:54:46PM +0200, Alexandre Belloni wrote:
> > Hi,
> > 
> > On 14/09/2018 11:44:26+0200, Quentin Schulz wrote:
> > > In order to use GPIO4 as a GPIO, we need to mux it in this mode so let's
> > > declare a new pinctrl DT node for it.
> > > 
> > > Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
> > > ---
> > >  arch/mips/boot/dts/mscc/ocelot.dtsi | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/arch/mips/boot/dts/mscc/ocelot.dtsi b/arch/mips/boot/dts/mscc/ocelot.dtsi
> > > index 8ce317c..b5c4c74 100644
> > > --- a/arch/mips/boot/dts/mscc/ocelot.dtsi
> > > +++ b/arch/mips/boot/dts/mscc/ocelot.dtsi
> > > @@ -182,6 +182,11 @@
> > >  			interrupts = <13>;
> > >  			#interrupt-cells = <2>;
> > >  
> > > +			gpio4: gpio4 {
> > > +				pins = "GPIO_4";
> > > +				function = "gpio";
> > > +			};
> > > +
> > 
> > For a GPIO, I would do that in the board dts because it is not used
> > directly in the dtsi.
> > 
> 
> And the day we've two boards using this pinctrl we move it to a dtsi. Is
> that the plan?
> 

Not really, at least not for gpios. I've included the pinctrl for the
uart, i2c and spi because they are the only option if you are to use
those peripherals. Else, I've would have left the pinctrl to the board
file. From my point of view, the gpios are too board specific to be in a
soc dtsi.


-- 
Alexandre Belloni, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [PATCH net-next 3/7] net: phy: mscc: split config_init in two functions for VSC8584
From: Florian Fainelli @ 2018-09-14 17:57 UTC (permalink / raw)
  To: Quentin Schulz, alexandre.belloni, ralf, paul.burton, jhogan,
	robh+dt, mark.rutland, davem, andrew
  Cc: allan.nielsen, linux-mips, devicetree, linux-kernel, netdev,
	thomas.petazzoni, antoine.tenart
In-Reply-To: <5daa7f3e467b218410238ef0fb97f01779f8f49f.1536916714.git-series.quentin.schulz@bootlin.com>

On 09/14/2018 02:44 AM, Quentin Schulz wrote:
> Part of the config init is common between the VSC8584 and the VSC8574,
> so to prepare the upcoming support for VSC8574, separate config_init
> PHY-specific code to config_pre_init function which is set in the probe
> function of the PHY and used in config_init.
> 
> Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
> ---
>  drivers/net/phy/mscc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
> index b450489..69cc3cf 100644
> --- a/drivers/net/phy/mscc.c
> +++ b/drivers/net/phy/mscc.c
> @@ -355,6 +355,7 @@ struct vsc8531_private {
>  	u64 *stats;
>  	int nstats;
>  	bool pkg_init;
> +	int (*config_pre_init)(struct mii_bus *bus, int phy);

Is not this overkill given that you have a reference to the phy_device,
you could check for the for phy_id to know which exact type you have and
call the appropriate pre_init function?

unsigned int phy might be more appropriate.
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next v4 18/20] crypto: port ChaCha20 to Zinc
From: Jason A. Donenfeld @ 2018-09-14 17:49 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: LKML, Netdev, Linux Crypto Mailing List, David Miller,
	Greg Kroah-Hartman, Samuel Neves, Andrew Lutomirski,
	Jean-Philippe Aumasson, Eric Biggers
In-Reply-To: <CAKv+Gu-wwFJOL82+iJYCu8rbzeDWLYH=5PtGOJBUouB1zdiZjg@mail.gmail.com>

On Fri, Sep 14, 2018 at 7:38 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> so could we please bring that discussion to a close before we drop the ARM code?

My understanding is that either these will find their way up to AndyP
and then back down here, or Eric or you will augment the .S in this
patch at a later date with an improvement commit that includes some
benchmarks.

Jason

^ permalink raw reply

* Re: [PATCH net-next v4 00/20] WireGuard: Secure Network Tunnel
From: Jason A. Donenfeld @ 2018-09-14 17:47 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: LKML, Netdev, Linux Crypto Mailing List, David Miller,
	Greg Kroah-Hartman
In-Reply-To: <CAKv+Gu_LYsNs88uF4+G1xfOtWvNPOjiiYZKqZf7qSBkvn6iEoA@mail.gmail.com>

On Fri, Sep 14, 2018 at 7:40 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> >   - Move away from makefile ifdef maze and instead prefer kconfig values,
> >     which also makes the design a bit more modular too, which could help
> >     in the future.
>
> Could you elaborate on this? From the patches, it is not clear to me
> how this has improved.

Feature detection was prior done as a confusing set of ifeq and
ifdefs. Instead, I've now put the logic for this into the kconfig,
which makes the makefiles and header files a bit simpler. This also
makes it easier to later on modularize Zinc itself if deemed
necessary.

^ permalink raw reply

* Re: [PATCH net-next v4 08/20] zinc: Poly1305 ARM and ARM64 implementations
From: Jason A. Donenfeld @ 2018-09-14 17:45 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: LKML, Netdev, Linux Crypto Mailing List, David Miller,
	Greg Kroah-Hartman, Samuel Neves, Andrew Lutomirski,
	Jean-Philippe Aumasson, Andy Polyakov, Russell King - ARM Linux,
	linux-arm-kernel
In-Reply-To: <CAKv+Gu8BD=fLk3zm8tvRQ3H-yiePqzXOrKLEz1BLFSRRz2opOQ@mail.gmail.com>

Hi Ard,

On Fri, Sep 14, 2018 at 7:27 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> As I asked in response to v3, could we please have this as a separate
> patch on top? The diff below is corrupted.

I had played with that originally, but thought it made things actually
harder to review, whereas here you have the changes presented pretty
straight forwardly, and I'd appreciate your review of them. If you and
Eric both prefer I split this into two commits, with the first one
just plopping down the CRYPTOGAMS code as is and the second one
bringing it up to kernel-snuff, I can do that.

> Also, both Andy and Eric have offered to get involved in upstreaming
> these changes to OpenSSL, so there is no delta to begin with.

Yes, I think this is probably a good long-term plan, which we can act
on sometime after Zinc is merged.

> I still don't like the GCC -includes, especially because these .h
> files contain function and variable definitions so they are not
> actually header files to begin with.

I very very strongly disagree with you here. I think doing it via
-include is significantly cleaner than any of the alternatives, and
allows the code to be cleanly expressed as conditionals that the
optimizer trivially compiles out in the case of stub functions
returning false and branch optimizes when the stub functions return
true. It is extremely important that these compile together as one
compilation unit. Yes, this is a different design than the crypto
API's approach, but I believe the approach presented here poses
significant improvements and is a lot cleaner.

> Also, you mentioned in the commit log that you got rid of defines and
> made the code more modular, but as far as I can tell, libzinc is still
> a single monolithic binary that is essentially always builtin once we
> move random.c to it.

Yes, it's still monolithic, but it's now trivial to split up when the
time comes to do that. If you and AndyL think that it should be split
into multiple modules _now_, then I can go ahead and do that for v5.
But if it's not essential, it seems simpler to keep it as is. I'll
wait for word from you two on this.

Jason

^ permalink raw reply

* Re: [PATCH net-next v4 00/20] WireGuard: Secure Network Tunnel
From: Ard Biesheuvel @ 2018-09-14 17:39 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Linux Kernel Mailing List, <netdev@vger.kernel.org>,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE, David S. Miller,
	Greg Kroah-Hartman
In-Reply-To: <20180914161954.7325-1-Jason@zx2c4.com>

On 14 September 2018 at 18:19, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Changes v3->v4:
>   - Remove mistaken double 07/17 patch.
>   - Fix whitespace issues in blake2s assembly.
>   - It's not possible to put compound literals into __initconst, so
>     we now instead just use boring fixed size struct members.
>   - Move away from makefile ifdef maze and instead prefer kconfig values,
>     which also makes the design a bit more modular too, which could help
>     in the future.

Could you elaborate on this? From the patches, it is not clear to me
how this has improved.

>   - Port old crypto API implementations (ChaCha20 and Poly1305) to Zinc.
>   - Port security/keys/big_key to Zinc as second example of a good usage of
>     Zinc.
>   - Document precisely what is different between the kernel code and
>     CRYPTOGAMS code when the CRYPTOGAMS code is used.
>   - Move changelog to top of 00/20 message so that people can
>     actually find it.
>
> -----------------------------------------------------------
>
> This patchset is available on git.kernel.org in this branch, where it may be
> pulled directly for inclusion into net-next:
>
>   * https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/log/?h=jd/wireguard
>
> -----------------------------------------------------------
>
> WireGuard is a secure network tunnel written especially for Linux, which
> has faced around three years of serious development, deployment, and
> scrutiny. It delivers excellent performance and is extremely easy to
> use and configure. It has been designed with the primary goal of being
> both easy to audit by virtue of being small and highly secure from a
> cryptography and systems security perspective. WireGuard is used by some
> massive companies pushing enormous amounts of traffic, and likely
> already today you've consumed bytes that at some point transited through
> a WireGuard tunnel. Even as an out-of-tree module, WireGuard has been
> integrated into various userspace tools, Linux distributions, mobile
> phones, and data centers. There are ports in several languages to
> several operating systems, and even commercial hardware and services
> sold integrating WireGuard. It is time, therefore, for WireGuard to be
> properly integrated into Linux.
>
> Ample information, including documentation, installation instructions,
> and project details, is available at:
>
>   * https://www.wireguard.com/
>   * https://www.wireguard.com/papers/wireguard.pdf
>
> As it is currently an out-of-tree module, it lives in its own git repo
> and has its own mailing list, and every commit for the module is tested
> against every stable kernel since 3.10 on a variety of architectures
> using an extensive test suite:
>
>   * https://git.zx2c4.com/WireGuard
>     https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/WireGuard.git/
>   * https://lists.zx2c4.com/mailman/listinfo/wireguard
>   * https://www.wireguard.com/build-status/
>
> The project has been broadly discussed at conferences, and was presented
> to the Netdev developers in Seoul last November, where a paper was
> released detailing some interesting aspects of the project. Dave asked
> me after the talk if I would consider sending in a v1 "sooner rather
> than later", hence this patchset. A decision is still waiting from the
> Linux Plumbers Conference, but an update on these topics may be presented
> in Vancouver in a few months. Prior presentations:
>
>   * https://www.wireguard.com/presentations/
>   * https://www.wireguard.com/papers/wireguard-netdev22.pdf
>
> The cryptography in the protocol itself has been formally verified by
> several independent academic teams with positive results, and I know of
> two additional efforts on their way to further corroborate those
> findings. The version 1 protocol is "complete", and so the purpose of
> this review is to assess the implementation of the protocol. However, it
> still may be of interest to know that the thing you're reviewing uses a
> protocol with various nice security properties:
>
>   * https://www.wireguard.com/formal-verification/
>
> This patchset is divided into four segments. The first introduces a very
> simple helper for working with the FPU state for the purposes of amortizing
> SIMD operations. The second segment is a small collection of cryptographic
> primitives, split up into several commits by primitive and by hardware. The
> third shows usage of Zinc within the existing crypto API and as a replacement
> to the existing crypto API. The last is WireGuard itself, presented as an
> unintrusive and self-contained virtual network driver.
>
> It is intended that this entire patch series enter the kernel through
> DaveM's net-next tree. Subsequently, WireGuard patches will go through
> DaveM's net-next tree, while Zinc patches will go through Greg KH's tree.
>
> Enjoy,
> Jason

^ permalink raw reply

* Re: [PATCH net-next 2/7] net: phy: mscc: add support for VSC8584 PHY
From: Andrew Lunn @ 2018-09-14 17:27 UTC (permalink / raw)
  To: Quentin Schulz
  Cc: alexandre.belloni, ralf, paul.burton, jhogan, robh+dt,
	mark.rutland, davem, f.fainelli, allan.nielsen, linux-mips,
	devicetree, linux-kernel, netdev, thomas.petazzoni,
	antoine.tenart
In-Reply-To: <a61d9affd3f1ec9deb60c882cce1daf37fbe2427.1536916714.git-series.quentin.schulz@bootlin.com>


>  struct vsc8531_private {
>  	int rate_magic;
>  	u16 supp_led_modes;
> @@ -181,6 +354,7 @@ struct vsc8531_private {
>  	struct vsc85xx_hw_stat *hw_stats;
>  	u64 *stats;
>  	int nstats;
> +	bool pkg_init;

> +/* bus->mdio_lock should be locked when using this function */
> +static int vsc8584_cmd(struct mii_bus *bus, int phy, u16 val)
> +{
> +	unsigned long deadline;
> +	u16 reg_val;
> +
> +	__mdiobus_write(bus, phy, MSCC_EXT_PAGE_ACCESS,
> +			MSCC_PHY_PAGE_EXTENDED_GPIO);
> +
> +	__mdiobus_write(bus, phy, MSCC_PHY_PROC_CMD, PROC_CMD_NCOMPLETED | val);

Hi Quentin

All the __mdiobus_write() look a bit ugly. Maybe add bus and base_addr
to the vsc8531_private structure. Then add helpers
phy_write_base_phy(priv, reg, val) and phy_read_base_phy(priv, reg).

You could also add in:

        if (unlikely(!mutex_is_locked(&priv->bus->mdio_lock))) {
                dev_err(bus->dev, "MDIO bus lock not held!\n");
                dump_stack();
        }

Having such code in the mv88e6xxx driver has found a few bugs for me.

       Andrew

^ permalink raw reply

* Re: [PATCH net-next v4 08/20] zinc: Poly1305 ARM and ARM64 implementations
From: Ard Biesheuvel @ 2018-09-14 17:27 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Linux Kernel Mailing List, <netdev@vger.kernel.org>,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE, David S. Miller,
	Greg Kroah-Hartman, Samuel Neves, Andy Lutomirski,
	Jean-Philippe Aumasson, Andy Polyakov, Russell King,
	linux-arm-kernel
In-Reply-To: <20180914162240.7925-9-Jason@zx2c4.com>

On 14 September 2018 at 18:22, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> These NEON and non-NEON implementations come from Andy Polyakov's
> implementation. They are exactly the same as Andy Polyakov's original,
> with the following exceptions:
>
> - Entries and exits use the proper kernel convention macro.
> - CPU feature checking is done in C by the glue code, so that has been
>   removed from the assembly.
> - The function names have been renamed to fit kernel conventions.
> - Labels have been renamed to fit kernel conventions.
> - The neon code can jump to the scalar code when it makes sense to do
>   so.
>
> After '/^#/d;/^\..*[^:]$/d', the code has the following diff in actual
> instructions from the original.
>

As I asked in response to v3, could we please have this as a separate
patch on top? The diff below is corrupted.

Also, both Andy and Eric have offered to get involved in upstreaming
these changes to OpenSSL, so there is no delta to begin with.

> ARM:
>
> -poly1305_init:
> -.Lpoly1305_init:
> +ENTRY(poly1305_init_arm)
>         stmdb   sp!,{r4-r11}
>
>         eor     r3,r3,r3
> @@ -18,8 +25,6 @@
>         moveq   r0,#0
>         beq     .Lno_key
>
> -       adr     r11,.Lpoly1305_init
> -       ldr     r12,.LOPENSSL_armcap
>         ldrb    r4,[r1,#0]
>         mov     r10,#0x0fffffff
>         ldrb    r5,[r1,#1]
> @@ -34,8 +39,6 @@
>         ldrb    r7,[r1,#6]
>         and     r4,r4,r10
>
> -       ldr     r12,[r11,r12]           @ OPENSSL_armcap_P
> -       ldr     r12,[r12]
>         ldrb    r8,[r1,#7]
>         orr     r5,r5,r6,lsl#8
>         ldrb    r6,[r1,#8]
> @@ -45,22 +48,6 @@
>         ldrb    r8,[r1,#10]
>         and     r5,r5,r3
>
> -       tst     r12,#ARMV7_NEON         @ check for NEON
> -       adr     r9,poly1305_blocks_neon
> -       adr     r11,poly1305_blocks
> -       it      ne
> -       movne   r11,r9
> -       adr     r12,poly1305_emit
> -       adr     r10,poly1305_emit_neon
> -       it      ne
> -       movne   r12,r10
> -       itete   eq
> -       addeq   r12,r11,#(poly1305_emit-.Lpoly1305_init)
> -       addne   r12,r11,#(poly1305_emit_neon-.Lpoly1305_init)
> -       addeq   r11,r11,#(poly1305_blocks-.Lpoly1305_init)
> -       addne   r11,r11,#(poly1305_blocks_neon-.Lpoly1305_init)
> -       orr     r12,r12,#1      @ thumb-ify address
> -       orr     r11,r11,#1
>         ldrb    r9,[r1,#11]
>         orr     r6,r6,r7,lsl#8
>         ldrb    r7,[r1,#12]
> @@ -79,17 +66,16 @@
>         str     r6,[r0,#8]
>         and     r7,r7,r3
>         str     r7,[r0,#12]
> -       stmia   r2,{r11,r12}            @ fill functions table
> -       mov     r0,#1
> -       mov     r0,#0
>  .Lno_key:
>         ldmia   sp!,{r4-r11}
>         bx      lr                              @ bx    lr
>         tst     lr,#1
>         moveq   pc,lr                   @ be binary compatible with V4, yet
>         .word   0xe12fff1e                      @ interoperable with Thumb ISA:-)
> -poly1305_blocks:
> -.Lpoly1305_blocks:
> +ENDPROC(poly1305_init_arm)
> +
> +ENTRY(poly1305_blocks_arm)
> +.Lpoly1305_blocks_arm:
>         stmdb   sp!,{r3-r11,lr}
>
>         ands    r2,r2,#-16
> @@ -231,10 +217,11 @@
>         tst     lr,#1
>         moveq   pc,lr                   @ be binary compatible with V4, yet
>         .word   0xe12fff1e                      @ interoperable with Thumb ISA:-)
> -poly1305_emit:
> +ENDPROC(poly1305_blocks_arm)
> +
> +ENTRY(poly1305_emit_arm)
>         stmdb   sp!,{r4-r11}
>  .Lpoly1305_emit_enter:
> -
>         ldmia   r0,{r3-r7}
>         adds    r8,r3,#5                @ compare to modulus
>         adcs    r9,r4,#0
> @@ -305,8 +292,12 @@
>         tst     lr,#1
>         moveq   pc,lr                   @ be binary compatible with V4, yet
>         .word   0xe12fff1e                      @ interoperable with Thumb ISA:-)
> +ENDPROC(poly1305_emit_arm)
> +
> +
>
> -poly1305_init_neon:
> +ENTRY(poly1305_init_neon)
> +.Lpoly1305_init_neon:
>         ldr     r4,[r0,#20]             @ load key base 2^32
>         ldr     r5,[r0,#24]
>         ldr     r6,[r0,#28]
> @@ -515,8 +506,9 @@
>         vst1.32         {d8[1]},[r7]
>
>         bx      lr                              @ bx    lr
> +ENDPROC(poly1305_init_neon)
>
> -poly1305_blocks_neon:
> +ENTRY(poly1305_blocks_neon)
>         ldr     ip,[r0,#36]             @ is_base2_26
>         ands    r2,r2,#-16
>         beq     .Lno_data_neon
> @@ -524,7 +516,7 @@
>         cmp     r2,#64
>         bhs     .Lenter_neon
>         tst     ip,ip                   @ is_base2_26?
> -       beq     .Lpoly1305_blocks
> +       beq     .Lpoly1305_blocks_arm
>
>  .Lenter_neon:
>         stmdb   sp!,{r4-r7}
> @@ -534,7 +526,7 @@
>         bne     .Lbase2_26_neon
>
>         stmdb   sp!,{r1-r3,lr}
> -       bl      poly1305_init_neon
> +       bl      .Lpoly1305_init_neon
>
>         ldr     r4,[r0,#0]              @ load hash value base 2^32
>         ldr     r5,[r0,#4]
> @@ -989,8 +981,9 @@
>         ldmia   sp!,{r4-r7}
>  .Lno_data_neon:
>         bx      lr                                      @ bx    lr
> +ENDPROC(poly1305_blocks_neon)
>
> -poly1305_emit_neon:
> +ENTRY(poly1305_emit_neon)
>         ldr     ip,[r0,#36]             @ is_base2_26
>
>         stmdb   sp!,{r4-r11}
> @@ -1055,6 +1048,6 @@
>
>         ldmia   sp!,{r4-r11}
>         bx      lr                              @ bx    lr
> +ENDPROC(poly1305_emit_neon)
>
> ARM64:
>
> -poly1305_init:
> +ENTRY(poly1305_init_arm)
>         cmp     x1,xzr
>         stp     xzr,xzr,[x0]            // zero hash value
>         stp     xzr,xzr,[x0,#16]        // [along with is_base2_26]
> @@ -11,14 +15,9 @@
>         csel    x0,xzr,x0,eq
>         b.eq    .Lno_key
>
> -       ldrsw   x11,.LOPENSSL_armcap_P
> -       ldr     x11,.LOPENSSL_armcap_P

In the original, this looks like

#ifdef __ILP32__
        ldrsw $t1,.LOPENSSL_armcap_P
#else
        ldr $t1,.LOPENSSL_armcap_P
#endif


so I guess git commit ate those lines.

> -       adr     x10,.LOPENSSL_armcap_P
> -
>         ldp     x7,x8,[x1]              // load key
>         mov     x9,#0xfffffffc0fffffff
>         movk    x9,#0x0fff,lsl#48
> -       ldr     w17,[x10,x11]
>         rev     x7,x7                   // flip bytes
>         rev     x8,x8
>         and     x7,x7,x9                // &=0ffffffc0fffffff
> @@ -26,24 +25,11 @@
>         and     x8,x8,x9                // &=0ffffffc0ffffffc
>         stp     x7,x8,[x0,#32]  // save key value
>
> -       tst     w17,#ARMV7_NEON
> -
> -       adr     x12,poly1305_blocks
> -       adr     x7,poly1305_blocks_neon
> -       adr     x13,poly1305_emit
> -       adr     x8,poly1305_emit_neon
> -
> -       csel    x12,x12,x7,eq
> -       csel    x13,x13,x8,eq
> -
> -       stp     w12,w13,[x2]
> -       stp     x12,x13,[x2]
> -
> -       mov     x0,#1
>  .Lno_key:
>         ret
> +ENDPROC(poly1305_init_arm)
>
> -poly1305_blocks:
> +ENTRY(poly1305_blocks_arm)
>         ands    x2,x2,#-16
>         b.eq    .Lno_data
>
> @@ -100,8 +86,9 @@
>
>  .Lno_data:
>         ret
> +ENDPROC(poly1305_blocks_arm)
>
> -poly1305_emit:
> +ENTRY(poly1305_emit_arm)
>         ldp     x4,x5,[x0]              // load hash base 2^64
>         ldr     x6,[x0,#16]
>         ldp     x10,x11,[x2]    // load nonce
> @@ -124,7 +111,9 @@
>         stp     x4,x5,[x1]              // write result
>
>         ret
> -poly1305_mult:
> +ENDPROC(poly1305_emit_arm)
> +
> +__poly1305_mult:
>         mul     x12,x4,x7               // h0*r0
>         umulh   x13,x4,x7
>
> @@ -158,7 +147,7 @@
>
>         ret
>
> -poly1305_splat:
> +__poly1305_splat:
>         and     x12,x4,#0x03ffffff      // base 2^64 -> base 2^26
>         ubfx    x13,x4,#26,#26
>         extr    x14,x5,x4,#52
> @@ -182,11 +171,11 @@
>
>         ret
>
> -poly1305_blocks_neon:
> +ENTRY(poly1305_blocks_neon)
>         ldr     x17,[x0,#24]
>         cmp     x2,#128
>         b.hs    .Lblocks_neon
> -       cbz     x17,poly1305_blocks
> +       cbz     x17,poly1305_blocks_arm
>
>  .Lblocks_neon:
>         stp     x29,x30,[sp,#-80]!
> @@ -232,7 +221,7 @@
>         adcs    x5,x5,x13
>         adc     x6,x6,x3
>
> -       bl      poly1305_mult
> +       bl      __poly1305_mult
>         ldr     x30,[sp,#8]
>
>         cbz     x3,.Lstore_base2_64_neon
> @@ -274,7 +263,7 @@
>         adcs    x5,x5,x13
>         adc     x6,x6,x3
>
> -       bl      poly1305_mult
> +       bl      __poly1305_mult
>
>  .Linit_neon:
>         and     x10,x4,#0x03ffffff      // base 2^64 -> base 2^26
> @@ -301,19 +290,19 @@
>         mov     x5,x8
>         mov     x6,xzr
>         add     x0,x0,#48+12
> -       bl      poly1305_splat
> +       bl      __poly1305_splat
>
> -       bl      poly1305_mult           // r^2
> +       bl      __poly1305_mult         // r^2
>         sub     x0,x0,#4
> -       bl      poly1305_splat
> +       bl      __poly1305_splat
>
> -       bl      poly1305_mult           // r^3
> +       bl      __poly1305_mult         // r^3
>         sub     x0,x0,#4
> -       bl      poly1305_splat
> +       bl      __poly1305_splat
>
> -       bl      poly1305_mult           // r^4
> +       bl      __poly1305_mult         // r^4
>         sub     x0,x0,#4
> -       bl      poly1305_splat
> +       bl      __poly1305_splat
>         ldr     x30,[sp,#8]
>
>         add     x16,x1,#32
> @@ -743,10 +732,11 @@
>  .Lno_data_neon:
>         ldr     x29,[sp],#80
>         ret
> +ENDPROC(poly1305_blocks_neon)
>
> -poly1305_emit_neon:
> +ENTRY(poly1305_emit_neon)
>         ldr     x17,[x0,#24]
> -       cbz     x17,poly1305_emit
> +       cbz     x17,poly1305_emit_arm
>
>         ldp     w10,w11,[x0]            // load hash value base 2^26
>         ldp     w12,w13,[x0,#8]
> @@ -788,6 +778,6 @@
>         stp     x4,x5,[x1]              // write result
>
>         ret
> +ENDPROC(poly1305_emit_neon)
>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> Cc: Samuel Neves <sneves@dei.uc.pt>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
> Cc: Andy Polyakov <appro@openssl.org>
> Cc: Russell King <linux@armlinux.org.uk>
> Cc: linux-arm-kernel@lists.infradead.org
> ---
>  lib/zinc/Makefile                     |    8 +
>  lib/zinc/poly1305/poly1305-arm-glue.h |   69 ++
>  lib/zinc/poly1305/poly1305-arm.S      | 1117 +++++++++++++++++++++++++
>  lib/zinc/poly1305/poly1305-arm64.S    |  822 ++++++++++++++++++
>  4 files changed, 2016 insertions(+)
>  create mode 100644 lib/zinc/poly1305/poly1305-arm-glue.h
>  create mode 100644 lib/zinc/poly1305/poly1305-arm.S
>  create mode 100644 lib/zinc/poly1305/poly1305-arm64.S
>
> diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
> index d1e3892e06d9..f37df89a3f87 100644
> --- a/lib/zinc/Makefile
> +++ b/lib/zinc/Makefile
> @@ -25,6 +25,14 @@ endif
>
>  ifeq ($(CONFIG_ZINC_POLY1305),y)
>  zinc-y += poly1305/poly1305.o
> +ifeq ($(CONFIG_ZINC_ARCH_ARM),y)
> +zinc-y += poly1305/poly1305-arm.o
> +CFLAGS_poly1305.o += -include $(srctree)/$(src)/poly1305/poly1305-arm-glue.h
> +endif
> +ifeq ($(CONFIG_ZINC_ARCH_ARM64),y)
> +zinc-y += poly1305/poly1305-arm64.o
> +CFLAGS_poly1305.o += -include $(srctree)/$(src)/poly1305/poly1305-arm-glue.h
> +endif
>  endif
>

I still don't like the GCC -includes, especially because these .h
files contain function and variable definitions so they are not
actually header files to begin with.

Also, you mentioned in the commit log that you got rid of defines and
made the code more modular, but as far as I can tell, libzinc is still
a single monolithic binary that is essentially always builtin once we
move random.c to it.

>  zinc-y += main.o
> diff --git a/lib/zinc/poly1305/poly1305-arm-glue.h b/lib/zinc/poly1305/poly1305-arm-glue.h
> new file mode 100644
> index 000000000000..53f8fec7f858
> --- /dev/null
> +++ b/lib/zinc/poly1305/poly1305-arm-glue.h
> @@ -0,0 +1,69 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
> + */
> +
> +#include <zinc/poly1305.h>
> +#include <asm/hwcap.h>
> +#include <asm/neon.h>
> +
> +asmlinkage void poly1305_init_arm(void *ctx, const u8 key[16]);
> +asmlinkage void poly1305_blocks_arm(void *ctx, const u8 *inp, const size_t len,
> +                                   const u32 padbit);
> +asmlinkage void poly1305_emit_arm(void *ctx, u8 mac[16], const u32 nonce[4]);
> +#if IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&                                     \
> +       (defined(CONFIG_64BIT) || __LINUX_ARM_ARCH__ >= 7)
> +#define ARM_USE_NEON
> +asmlinkage void poly1305_blocks_neon(void *ctx, const u8 *inp, const size_t len,
> +                                    const u32 padbit);
> +asmlinkage void poly1305_emit_neon(void *ctx, u8 mac[16], const u32 nonce[4]);
> +#endif
> +
> +static bool poly1305_use_neon __ro_after_init;
> +
> +void __init poly1305_fpu_init(void)
> +{
> +#if defined(CONFIG_ARM64)
> +       poly1305_use_neon = elf_hwcap & HWCAP_ASIMD;
> +#elif defined(CONFIG_ARM)
> +       poly1305_use_neon = elf_hwcap & HWCAP_NEON;
> +#endif
> +}
> +
> +static inline bool poly1305_init_arch(void *ctx,
> +                                     const u8 key[POLY1305_KEY_SIZE],
> +                                     simd_context_t simd_context)
> +{
> +       poly1305_init_arm(ctx, key);
> +       return true;
> +}
> +
> +static inline bool poly1305_blocks_arch(void *ctx, const u8 *inp,
> +                                       const size_t len, const u32 padbit,
> +                                       simd_context_t simd_context)
> +{
> +#if defined(ARM_USE_NEON)
> +       if (simd_context == HAVE_FULL_SIMD && poly1305_use_neon) {
> +               poly1305_blocks_neon(ctx, inp, len, padbit);
> +               return true;
> +       }
> +#endif
> +       poly1305_blocks_arm(ctx, inp, len, padbit);
> +       return true;
> +}
> +
> +static inline bool poly1305_emit_arch(void *ctx, u8 mac[POLY1305_MAC_SIZE],
> +                                     const u32 nonce[4],
> +                                     simd_context_t simd_context)
> +{
> +#if defined(ARM_USE_NEON)
> +       if (simd_context == HAVE_FULL_SIMD && poly1305_use_neon) {
> +               poly1305_emit_neon(ctx, mac, nonce);
> +               return true;
> +       }
> +#endif
> +       poly1305_emit_arm(ctx, mac, nonce);
> +       return true;
> +}
> +
> +#define HAVE_POLY1305_ARCH_IMPLEMENTATION

We shouldn't #define HAVE_xxx constants in code but only in Kconfig.

> diff --git a/lib/zinc/poly1305/poly1305-arm.S b/lib/zinc/poly1305/poly1305-arm.S
> new file mode 100644
> index 000000000000..110f4317b5d7
> --- /dev/null
> +++ b/lib/zinc/poly1305/poly1305-arm.S
> @@ -0,0 +1,1117 @@
> +/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0
> + *
> + * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
> + * Copyright (C) 2006-2017 CRYPTOGAMS by <appro@openssl.org>. All Rights Reserved.
> + *
> + * This is based in part on Andy Polyakov's implementation from CRYPTOGAMS.
> + */
> +
> +#include <linux/linkage.h>
> +
> +.text
> +#if defined(__thumb2__)
> +.syntax        unified
> +.thumb
> +#else
> +.code  32
> +#endif
> +
> +.align 5
> +ENTRY(poly1305_init_arm)
> +       stmdb   sp!,{r4-r11}
> +
> +       eor     r3,r3,r3
> +       cmp     r1,#0
> +       str     r3,[r0,#0]              @ zero hash value
> +       str     r3,[r0,#4]
> +       str     r3,[r0,#8]
> +       str     r3,[r0,#12]
> +       str     r3,[r0,#16]
> +       str     r3,[r0,#36]             @ is_base2_26
> +       add     r0,r0,#20
> +
> +#ifdef __thumb2__
> +       it      eq
> +#endif
> +       moveq   r0,#0
> +       beq     .Lno_key
> +
> +       ldrb    r4,[r1,#0]
> +       mov     r10,#0x0fffffff
> +       ldrb    r5,[r1,#1]
> +       and     r3,r10,#-4              @ 0x0ffffffc
> +       ldrb    r6,[r1,#2]
> +       ldrb    r7,[r1,#3]
> +       orr     r4,r4,r5,lsl#8
> +       ldrb    r5,[r1,#4]
> +       orr     r4,r4,r6,lsl#16
> +       ldrb    r6,[r1,#5]
> +       orr     r4,r4,r7,lsl#24
> +       ldrb    r7,[r1,#6]
> +       and     r4,r4,r10
> +
> +       ldrb    r8,[r1,#7]
> +       orr     r5,r5,r6,lsl#8
> +       ldrb    r6,[r1,#8]
> +       orr     r5,r5,r7,lsl#16
> +       ldrb    r7,[r1,#9]
> +       orr     r5,r5,r8,lsl#24
> +       ldrb    r8,[r1,#10]
> +       and     r5,r5,r3
> +
> +       ldrb    r9,[r1,#11]
> +       orr     r6,r6,r7,lsl#8
> +       ldrb    r7,[r1,#12]
> +       orr     r6,r6,r8,lsl#16
> +       ldrb    r8,[r1,#13]
> +       orr     r6,r6,r9,lsl#24
> +       ldrb    r9,[r1,#14]
> +       and     r6,r6,r3
> +
> +       ldrb    r10,[r1,#15]
> +       orr     r7,r7,r8,lsl#8
> +       str     r4,[r0,#0]
> +       orr     r7,r7,r9,lsl#16
> +       str     r5,[r0,#4]
> +       orr     r7,r7,r10,lsl#24
> +       str     r6,[r0,#8]
> +       and     r7,r7,r3
> +       str     r7,[r0,#12]
> +.Lno_key:
> +       ldmia   sp!,{r4-r11}
> +#if __LINUX_ARM_ARCH__ >= 5
> +       bx      lr                              @ bx    lr
> +#else
> +       tst     lr,#1
> +       moveq   pc,lr                   @ be binary compatible with V4, yet
> +       .word   0xe12fff1e                      @ interoperable with Thumb ISA:-)
> +#endif
> +ENDPROC(poly1305_init_arm)
> +
> +.align 5
> +ENTRY(poly1305_blocks_arm)
> +.Lpoly1305_blocks_arm:
> +       stmdb   sp!,{r3-r11,lr}
> +
> +       ands    r2,r2,#-16
> +       beq     .Lno_data
> +
> +       cmp     r3,#0
> +       add     r2,r2,r1                @ end pointer
> +       sub     sp,sp,#32
> +
> +       ldmia   r0,{r4-r12}             @ load context
> +
> +       str     r0,[sp,#12]             @ offload stuff
> +       mov     lr,r1
> +       str     r2,[sp,#16]
> +       str     r10,[sp,#20]
> +       str     r11,[sp,#24]
> +       str     r12,[sp,#28]
> +       b       .Loop
> +
> +.Loop:
> +#if __LINUX_ARM_ARCH__ < 7
> +       ldrb    r0,[lr],#16             @ load input
> +#ifdef __thumb2__
> +       it      hi
> +#endif
> +       addhi   r8,r8,#1                @ 1<<128
> +       ldrb    r1,[lr,#-15]
> +       ldrb    r2,[lr,#-14]
> +       ldrb    r3,[lr,#-13]
> +       orr     r1,r0,r1,lsl#8
> +       ldrb    r0,[lr,#-12]
> +       orr     r2,r1,r2,lsl#16
> +       ldrb    r1,[lr,#-11]
> +       orr     r3,r2,r3,lsl#24
> +       ldrb    r2,[lr,#-10]
> +       adds    r4,r4,r3                @ accumulate input
> +
> +       ldrb    r3,[lr,#-9]
> +       orr     r1,r0,r1,lsl#8
> +       ldrb    r0,[lr,#-8]
> +       orr     r2,r1,r2,lsl#16
> +       ldrb    r1,[lr,#-7]
> +       orr     r3,r2,r3,lsl#24
> +       ldrb    r2,[lr,#-6]
> +       adcs    r5,r5,r3
> +
> +       ldrb    r3,[lr,#-5]
> +       orr     r1,r0,r1,lsl#8
> +       ldrb    r0,[lr,#-4]
> +       orr     r2,r1,r2,lsl#16
> +       ldrb    r1,[lr,#-3]
> +       orr     r3,r2,r3,lsl#24
> +       ldrb    r2,[lr,#-2]
> +       adcs    r6,r6,r3
> +
> +       ldrb    r3,[lr,#-1]
> +       orr     r1,r0,r1,lsl#8
> +       str     lr,[sp,#8]              @ offload input pointer
> +       orr     r2,r1,r2,lsl#16
> +       add     r10,r10,r10,lsr#2
> +       orr     r3,r2,r3,lsl#24
> +#else
> +       ldr     r0,[lr],#16             @ load input
> +#ifdef __thumb2__
> +       it      hi
> +#endif
> +       addhi   r8,r8,#1                @ padbit
> +       ldr     r1,[lr,#-12]
> +       ldr     r2,[lr,#-8]
> +       ldr     r3,[lr,#-4]
> +#ifdef __ARMEB__
> +       rev     r0,r0
> +       rev     r1,r1
> +       rev     r2,r2
> +       rev     r3,r3
> +#endif
> +       adds    r4,r4,r0                @ accumulate input
> +       str     lr,[sp,#8]              @ offload input pointer
> +       adcs    r5,r5,r1
> +       add     r10,r10,r10,lsr#2
> +       adcs    r6,r6,r2
> +#endif
> +       add     r11,r11,r11,lsr#2
> +       adcs    r7,r7,r3
> +       add     r12,r12,r12,lsr#2
> +
> +       umull   r2,r3,r5,r9
> +        adc    r8,r8,#0
> +       umull   r0,r1,r4,r9
> +       umlal   r2,r3,r8,r10
> +       umlal   r0,r1,r7,r10
> +       ldr     r10,[sp,#20]            @ reload r10
> +       umlal   r2,r3,r6,r12
> +       umlal   r0,r1,r5,r12
> +       umlal   r2,r3,r7,r11
> +       umlal   r0,r1,r6,r11
> +       umlal   r2,r3,r4,r10
> +       str     r0,[sp,#0]              @ future r4
> +        mul    r0,r11,r8
> +       ldr     r11,[sp,#24]            @ reload r11
> +       adds    r2,r2,r1                @ d1+=d0>>32
> +        eor    r1,r1,r1
> +       adc     lr,r3,#0                @ future r6
> +       str     r2,[sp,#4]              @ future r5
> +
> +       mul     r2,r12,r8
> +       eor     r3,r3,r3
> +       umlal   r0,r1,r7,r12
> +       ldr     r12,[sp,#28]            @ reload r12
> +       umlal   r2,r3,r7,r9
> +       umlal   r0,r1,r6,r9
> +       umlal   r2,r3,r6,r10
> +       umlal   r0,r1,r5,r10
> +       umlal   r2,r3,r5,r11
> +       umlal   r0,r1,r4,r11
> +       umlal   r2,r3,r4,r12
> +       ldr     r4,[sp,#0]
> +       mul     r8,r9,r8
> +       ldr     r5,[sp,#4]
> +
> +       adds    r6,lr,r0                @ d2+=d1>>32
> +       ldr     lr,[sp,#8]              @ reload input pointer
> +       adc     r1,r1,#0
> +       adds    r7,r2,r1                @ d3+=d2>>32
> +       ldr     r0,[sp,#16]             @ reload end pointer
> +       adc     r3,r3,#0
> +       add     r8,r8,r3                @ h4+=d3>>32
> +
> +       and     r1,r8,#-4
> +       and     r8,r8,#3
> +       add     r1,r1,r1,lsr#2          @ *=5
> +       adds    r4,r4,r1
> +       adcs    r5,r5,#0
> +       adcs    r6,r6,#0
> +       adcs    r7,r7,#0
> +       adc     r8,r8,#0
> +
> +       cmp     r0,lr                   @ done yet?
> +       bhi     .Loop
> +
> +       ldr     r0,[sp,#12]
> +       add     sp,sp,#32
> +       stmia   r0,{r4-r8}              @ store the result
> +
> +.Lno_data:
> +#if __LINUX_ARM_ARCH__ >= 5
> +       ldmia   sp!,{r3-r11,pc}
> +#else
> +       ldmia   sp!,{r3-r11,lr}
> +       tst     lr,#1
> +       moveq   pc,lr                   @ be binary compatible with V4, yet
> +       .word   0xe12fff1e                      @ interoperable with Thumb ISA:-)
> +#endif
> +ENDPROC(poly1305_blocks_arm)
> +
> +.align 5
> +ENTRY(poly1305_emit_arm)
> +       stmdb   sp!,{r4-r11}
> +.Lpoly1305_emit_enter:
> +       ldmia   r0,{r3-r7}
> +       adds    r8,r3,#5                @ compare to modulus
> +       adcs    r9,r4,#0
> +       adcs    r10,r5,#0
> +       adcs    r11,r6,#0
> +       adc     r7,r7,#0
> +       tst     r7,#4                   @ did it carry/borrow?
> +
> +#ifdef __thumb2__
> +       it      ne
> +#endif
> +       movne   r3,r8
> +       ldr     r8,[r2,#0]
> +#ifdef __thumb2__
> +       it      ne
> +#endif
> +       movne   r4,r9
> +       ldr     r9,[r2,#4]
> +#ifdef __thumb2__
> +       it      ne
> +#endif
> +       movne   r5,r10
> +       ldr     r10,[r2,#8]
> +#ifdef __thumb2__
> +       it      ne
> +#endif
> +       movne   r6,r11
> +       ldr     r11,[r2,#12]
> +
> +       adds    r3,r3,r8
> +       adcs    r4,r4,r9
> +       adcs    r5,r5,r10
> +       adc     r6,r6,r11
> +
> +#if __LINUX_ARM_ARCH__ >= 7
> +#ifdef __ARMEB__
> +       rev     r3,r3
> +       rev     r4,r4
> +       rev     r5,r5
> +       rev     r6,r6
> +#endif
> +       str     r3,[r1,#0]
> +       str     r4,[r1,#4]
> +       str     r5,[r1,#8]
> +       str     r6,[r1,#12]
> +#else
> +       strb    r3,[r1,#0]
> +       mov     r3,r3,lsr#8
> +       strb    r4,[r1,#4]
> +       mov     r4,r4,lsr#8
> +       strb    r5,[r1,#8]
> +       mov     r5,r5,lsr#8
> +       strb    r6,[r1,#12]
> +       mov     r6,r6,lsr#8
> +
> +       strb    r3,[r1,#1]
> +       mov     r3,r3,lsr#8
> +       strb    r4,[r1,#5]
> +       mov     r4,r4,lsr#8
> +       strb    r5,[r1,#9]
> +       mov     r5,r5,lsr#8
> +       strb    r6,[r1,#13]
> +       mov     r6,r6,lsr#8
> +
> +       strb    r3,[r1,#2]
> +       mov     r3,r3,lsr#8
> +       strb    r4,[r1,#6]
> +       mov     r4,r4,lsr#8
> +       strb    r5,[r1,#10]
> +       mov     r5,r5,lsr#8
> +       strb    r6,[r1,#14]
> +       mov     r6,r6,lsr#8
> +
> +       strb    r3,[r1,#3]
> +       strb    r4,[r1,#7]
> +       strb    r5,[r1,#11]
> +       strb    r6,[r1,#15]
> +#endif
> +       ldmia   sp!,{r4-r11}
> +#if __LINUX_ARM_ARCH__ >= 5
> +       bx      lr                              @ bx    lr
> +#else
> +       tst     lr,#1
> +       moveq   pc,lr                   @ be binary compatible with V4, yet
> +       .word   0xe12fff1e                      @ interoperable with Thumb ISA:-)
> +#endif
> +ENDPROC(poly1305_emit_arm)
> +
> +
> +#if __LINUX_ARM_ARCH__ >= 7
> +.fpu   neon
> +
> +.align 5
> +ENTRY(poly1305_init_neon)
> +.Lpoly1305_init_neon:
> +       ldr     r4,[r0,#20]             @ load key base 2^32
> +       ldr     r5,[r0,#24]
> +       ldr     r6,[r0,#28]
> +       ldr     r7,[r0,#32]
> +
> +       and     r2,r4,#0x03ffffff       @ base 2^32 -> base 2^26
> +       mov     r3,r4,lsr#26
> +       mov     r4,r5,lsr#20
> +       orr     r3,r3,r5,lsl#6
> +       mov     r5,r6,lsr#14
> +       orr     r4,r4,r6,lsl#12
> +       mov     r6,r7,lsr#8
> +       orr     r5,r5,r7,lsl#18
> +       and     r3,r3,#0x03ffffff
> +       and     r4,r4,#0x03ffffff
> +       and     r5,r5,#0x03ffffff
> +
> +       vdup.32 d0,r2                   @ r^1 in both lanes
> +       add     r2,r3,r3,lsl#2          @ *5
> +       vdup.32 d1,r3
> +       add     r3,r4,r4,lsl#2
> +       vdup.32 d2,r2
> +       vdup.32 d3,r4
> +       add     r4,r5,r5,lsl#2
> +       vdup.32 d4,r3
> +       vdup.32 d5,r5
> +       add     r5,r6,r6,lsl#2
> +       vdup.32 d6,r4
> +       vdup.32 d7,r6
> +       vdup.32 d8,r5
> +
> +       mov     r5,#2           @ counter
> +
> +.Lsquare_neon:
> +       @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> +       @ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
> +       @ d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
> +       @ d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
> +       @ d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
> +       @ d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
> +
> +       vmull.u32       q5,d0,d0[1]
> +       vmull.u32       q6,d1,d0[1]
> +       vmull.u32       q7,d3,d0[1]
> +       vmull.u32       q8,d5,d0[1]
> +       vmull.u32       q9,d7,d0[1]
> +
> +       vmlal.u32       q5,d7,d2[1]
> +       vmlal.u32       q6,d0,d1[1]
> +       vmlal.u32       q7,d1,d1[1]
> +       vmlal.u32       q8,d3,d1[1]
> +       vmlal.u32       q9,d5,d1[1]
> +
> +       vmlal.u32       q5,d5,d4[1]
> +       vmlal.u32       q6,d7,d4[1]
> +       vmlal.u32       q8,d1,d3[1]
> +       vmlal.u32       q7,d0,d3[1]
> +       vmlal.u32       q9,d3,d3[1]
> +
> +       vmlal.u32       q5,d3,d6[1]
> +       vmlal.u32       q8,d0,d5[1]
> +       vmlal.u32       q6,d5,d6[1]
> +       vmlal.u32       q7,d7,d6[1]
> +       vmlal.u32       q9,d1,d5[1]
> +
> +       vmlal.u32       q8,d7,d8[1]
> +       vmlal.u32       q5,d1,d8[1]
> +       vmlal.u32       q6,d3,d8[1]
> +       vmlal.u32       q7,d5,d8[1]
> +       vmlal.u32       q9,d0,d7[1]
> +
> +       @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> +       @ lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
> +       @ and P. Schwabe
> +       @
> +       @ H0>>+H1>>+H2>>+H3>>+H4
> +       @ H3>>+H4>>*5+H0>>+H1
> +       @
> +       @ Trivia.
> +       @
> +       @ Result of multiplication of n-bit number by m-bit number is
> +       @ n+m bits wide. However! Even though 2^n is a n+1-bit number,
> +       @ m-bit number multiplied by 2^n is still n+m bits wide.
> +       @
> +       @ Sum of two n-bit numbers is n+1 bits wide, sum of three - n+2,
> +       @ and so is sum of four. Sum of 2^m n-m-bit numbers and n-bit
> +       @ one is n+1 bits wide.
> +       @
> +       @ >>+ denotes Hnext += Hn>>26, Hn &= 0x3ffffff. This means that
> +       @ H0, H2, H3 are guaranteed to be 26 bits wide, while H1 and H4
> +       @ can be 27. However! In cases when their width exceeds 26 bits
> +       @ they are limited by 2^26+2^6. This in turn means that *sum*
> +       @ of the products with these values can still be viewed as sum
> +       @ of 52-bit numbers as long as the amount of addends is not a
> +       @ power of 2. For example,
> +       @
> +       @ H4 = H4*R0 + H3*R1 + H2*R2 + H1*R3 + H0 * R4,
> +       @
> +       @ which can't be larger than 5 * (2^26 + 2^6) * (2^26 + 2^6), or
> +       @ 5 * (2^52 + 2*2^32 + 2^12), which in turn is smaller than
> +       @ 8 * (2^52) or 2^55. However, the value is then multiplied by
> +       @ by 5, so we should be looking at 5 * 5 * (2^52 + 2^33 + 2^12),
> +       @ which is less than 32 * (2^52) or 2^57. And when processing
> +       @ data we are looking at triple as many addends...
> +       @
> +       @ In key setup procedure pre-reduced H0 is limited by 5*4+1 and
> +       @ 5*H4 - by 5*5 52-bit addends, or 57 bits. But when hashing the
> +       @ input H0 is limited by (5*4+1)*3 addends, or 58 bits, while
> +       @ 5*H4 by 5*5*3, or 59[!] bits. How is this relevant? vmlal.u32
> +       @ instruction accepts 2x32-bit input and writes 2x64-bit result.
> +       @ This means that result of reduction have to be compressed upon
> +       @ loop wrap-around. This can be done in the process of reduction
> +       @ to minimize amount of instructions [as well as amount of
> +       @ 128-bit instructions, which benefits low-end processors], but
> +       @ one has to watch for H2 (which is narrower than H0) and 5*H4
> +       @ not being wider than 58 bits, so that result of right shift
> +       @ by 26 bits fits in 32 bits. This is also useful on x86,
> +       @ because it allows to use paddd in place for paddq, which
> +       @ benefits Atom, where paddq is ridiculously slow.
> +
> +       vshr.u64        q15,q8,#26
> +       vmovn.i64       d16,q8
> +        vshr.u64       q4,q5,#26
> +        vmovn.i64      d10,q5
> +       vadd.i64        q9,q9,q15               @ h3 -> h4
> +       vbic.i32        d16,#0xfc000000 @ &=0x03ffffff
> +        vadd.i64       q6,q6,q4                @ h0 -> h1
> +        vbic.i32       d10,#0xfc000000
> +
> +       vshrn.u64       d30,q9,#26
> +       vmovn.i64       d18,q9
> +        vshr.u64       q4,q6,#26
> +        vmovn.i64      d12,q6
> +        vadd.i64       q7,q7,q4                @ h1 -> h2
> +       vbic.i32        d18,#0xfc000000
> +        vbic.i32       d12,#0xfc000000
> +
> +       vadd.i32        d10,d10,d30
> +       vshl.u32        d30,d30,#2
> +        vshrn.u64      d8,q7,#26
> +        vmovn.i64      d14,q7
> +       vadd.i32        d10,d10,d30     @ h4 -> h0
> +        vadd.i32       d16,d16,d8      @ h2 -> h3
> +        vbic.i32       d14,#0xfc000000
> +
> +       vshr.u32        d30,d10,#26
> +       vbic.i32        d10,#0xfc000000
> +        vshr.u32       d8,d16,#26
> +        vbic.i32       d16,#0xfc000000
> +       vadd.i32        d12,d12,d30     @ h0 -> h1
> +        vadd.i32       d18,d18,d8      @ h3 -> h4
> +
> +       subs            r5,r5,#1
> +       beq             .Lsquare_break_neon
> +
> +       add             r6,r0,#(48+0*9*4)
> +       add             r7,r0,#(48+1*9*4)
> +
> +       vtrn.32         d0,d10          @ r^2:r^1
> +       vtrn.32         d3,d14
> +       vtrn.32         d5,d16
> +       vtrn.32         d1,d12
> +       vtrn.32         d7,d18
> +
> +       vshl.u32        d4,d3,#2                @ *5
> +       vshl.u32        d6,d5,#2
> +       vshl.u32        d2,d1,#2
> +       vshl.u32        d8,d7,#2
> +       vadd.i32        d4,d4,d3
> +       vadd.i32        d2,d2,d1
> +       vadd.i32        d6,d6,d5
> +       vadd.i32        d8,d8,d7
> +
> +       vst4.32         {d0[0],d1[0],d2[0],d3[0]},[r6]!
> +       vst4.32         {d0[1],d1[1],d2[1],d3[1]},[r7]!
> +       vst4.32         {d4[0],d5[0],d6[0],d7[0]},[r6]!
> +       vst4.32         {d4[1],d5[1],d6[1],d7[1]},[r7]!
> +       vst1.32         {d8[0]},[r6,:32]
> +       vst1.32         {d8[1]},[r7,:32]
> +
> +       b               .Lsquare_neon
> +
> +.align 4
> +.Lsquare_break_neon:
> +       add             r6,r0,#(48+2*4*9)
> +       add             r7,r0,#(48+3*4*9)
> +
> +       vmov            d0,d10          @ r^4:r^3
> +       vshl.u32        d2,d12,#2               @ *5
> +       vmov            d1,d12
> +       vshl.u32        d4,d14,#2
> +       vmov            d3,d14
> +       vshl.u32        d6,d16,#2
> +       vmov            d5,d16
> +       vshl.u32        d8,d18,#2
> +       vmov            d7,d18
> +       vadd.i32        d2,d2,d12
> +       vadd.i32        d4,d4,d14
> +       vadd.i32        d6,d6,d16
> +       vadd.i32        d8,d8,d18
> +
> +       vst4.32         {d0[0],d1[0],d2[0],d3[0]},[r6]!
> +       vst4.32         {d0[1],d1[1],d2[1],d3[1]},[r7]!
> +       vst4.32         {d4[0],d5[0],d6[0],d7[0]},[r6]!
> +       vst4.32         {d4[1],d5[1],d6[1],d7[1]},[r7]!
> +       vst1.32         {d8[0]},[r6]
> +       vst1.32         {d8[1]},[r7]
> +
> +       bx      lr                              @ bx    lr
> +ENDPROC(poly1305_init_neon)
> +
> +.align 5
> +ENTRY(poly1305_blocks_neon)
> +       ldr     ip,[r0,#36]             @ is_base2_26
> +       ands    r2,r2,#-16
> +       beq     .Lno_data_neon
> +
> +       cmp     r2,#64
> +       bhs     .Lenter_neon
> +       tst     ip,ip                   @ is_base2_26?
> +       beq     .Lpoly1305_blocks_arm
> +
> +.Lenter_neon:
> +       stmdb   sp!,{r4-r7}
> +       vstmdb  sp!,{d8-d15}            @ ABI specification says so
> +
> +       tst     ip,ip                   @ is_base2_26?
> +       bne     .Lbase2_26_neon
> +
> +       stmdb   sp!,{r1-r3,lr}
> +       bl      .Lpoly1305_init_neon
> +
> +       ldr     r4,[r0,#0]              @ load hash value base 2^32
> +       ldr     r5,[r0,#4]
> +       ldr     r6,[r0,#8]
> +       ldr     r7,[r0,#12]
> +       ldr     ip,[r0,#16]
> +
> +       and     r2,r4,#0x03ffffff       @ base 2^32 -> base 2^26
> +       mov     r3,r4,lsr#26
> +        veor   d10,d10,d10
> +       mov     r4,r5,lsr#20
> +       orr     r3,r3,r5,lsl#6
> +        veor   d12,d12,d12
> +       mov     r5,r6,lsr#14
> +       orr     r4,r4,r6,lsl#12
> +        veor   d14,d14,d14
> +       mov     r6,r7,lsr#8
> +       orr     r5,r5,r7,lsl#18
> +        veor   d16,d16,d16
> +       and     r3,r3,#0x03ffffff
> +       orr     r6,r6,ip,lsl#24
> +        veor   d18,d18,d18
> +       and     r4,r4,#0x03ffffff
> +       mov     r1,#1
> +       and     r5,r5,#0x03ffffff
> +       str     r1,[r0,#36]             @ is_base2_26
> +
> +       vmov.32 d10[0],r2
> +       vmov.32 d12[0],r3
> +       vmov.32 d14[0],r4
> +       vmov.32 d16[0],r5
> +       vmov.32 d18[0],r6
> +       adr     r5,.Lzeros
> +
> +       ldmia   sp!,{r1-r3,lr}
> +       b       .Lbase2_32_neon
> +
> +.align 4
> +.Lbase2_26_neon:
> +       @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> +       @ load hash value
> +
> +       veor            d10,d10,d10
> +       veor            d12,d12,d12
> +       veor            d14,d14,d14
> +       veor            d16,d16,d16
> +       veor            d18,d18,d18
> +       vld4.32         {d10[0],d12[0],d14[0],d16[0]},[r0]!
> +       adr             r5,.Lzeros
> +       vld1.32         {d18[0]},[r0]
> +       sub             r0,r0,#16               @ rewind
> +
> +.Lbase2_32_neon:
> +       add             r4,r1,#32
> +       mov             r3,r3,lsl#24
> +       tst             r2,#31
> +       beq             .Leven
> +
> +       vld4.32         {d20[0],d22[0],d24[0],d26[0]},[r1]!
> +       vmov.32         d28[0],r3
> +       sub             r2,r2,#16
> +       add             r4,r1,#32
> +
> +#ifdef __ARMEB__
> +       vrev32.8        q10,q10
> +       vrev32.8        q13,q13
> +       vrev32.8        q11,q11
> +       vrev32.8        q12,q12
> +#endif
> +       vsri.u32        d28,d26,#8      @ base 2^32 -> base 2^26
> +       vshl.u32        d26,d26,#18
> +
> +       vsri.u32        d26,d24,#14
> +       vshl.u32        d24,d24,#12
> +       vadd.i32        d29,d28,d18     @ add hash value and move to #hi
> +
> +       vbic.i32        d26,#0xfc000000
> +       vsri.u32        d24,d22,#20
> +       vshl.u32        d22,d22,#6
> +
> +       vbic.i32        d24,#0xfc000000
> +       vsri.u32        d22,d20,#26
> +       vadd.i32        d27,d26,d16
> +
> +       vbic.i32        d20,#0xfc000000
> +       vbic.i32        d22,#0xfc000000
> +       vadd.i32        d25,d24,d14
> +
> +       vadd.i32        d21,d20,d10
> +       vadd.i32        d23,d22,d12
> +
> +       mov             r7,r5
> +       add             r6,r0,#48
> +
> +       cmp             r2,r2
> +       b               .Long_tail
> +
> +.align 4
> +.Leven:
> +       subs            r2,r2,#64
> +       it              lo
> +       movlo           r4,r5
> +
> +       vmov.i32        q14,#1<<24              @ padbit, yes, always
> +       vld4.32         {d20,d22,d24,d26},[r1]  @ inp[0:1]
> +       add             r1,r1,#64
> +       vld4.32         {d21,d23,d25,d27},[r4]  @ inp[2:3] (or 0)
> +       add             r4,r4,#64
> +       itt             hi
> +       addhi           r7,r0,#(48+1*9*4)
> +       addhi           r6,r0,#(48+3*9*4)
> +
> +#ifdef __ARMEB__
> +       vrev32.8        q10,q10
> +       vrev32.8        q13,q13
> +       vrev32.8        q11,q11
> +       vrev32.8        q12,q12
> +#endif
> +       vsri.u32        q14,q13,#8              @ base 2^32 -> base 2^26
> +       vshl.u32        q13,q13,#18
> +
> +       vsri.u32        q13,q12,#14
> +       vshl.u32        q12,q12,#12
> +
> +       vbic.i32        q13,#0xfc000000
> +       vsri.u32        q12,q11,#20
> +       vshl.u32        q11,q11,#6
> +
> +       vbic.i32        q12,#0xfc000000
> +       vsri.u32        q11,q10,#26
> +
> +       vbic.i32        q10,#0xfc000000
> +       vbic.i32        q11,#0xfc000000
> +
> +       bls             .Lskip_loop
> +
> +       vld4.32         {d0[1],d1[1],d2[1],d3[1]},[r7]! @ load r^2
> +       vld4.32         {d0[0],d1[0],d2[0],d3[0]},[r6]! @ load r^4
> +       vld4.32         {d4[1],d5[1],d6[1],d7[1]},[r7]!
> +       vld4.32         {d4[0],d5[0],d6[0],d7[0]},[r6]!
> +       b               .Loop_neon
> +
> +.align 5
> +.Loop_neon:
> +       @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> +       @ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
> +       @ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
> +       @   ___________________/
> +       @ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
> +       @ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
> +       @   ___________________/ ____________________/
> +       @
> +       @ Note that we start with inp[2:3]*r^2. This is because it
> +       @ doesn't depend on reduction in previous iteration.
> +       @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> +       @ d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
> +       @ d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
> +       @ d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
> +       @ d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
> +       @ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
> +
> +       @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> +       @ inp[2:3]*r^2
> +
> +       vadd.i32        d24,d24,d14     @ accumulate inp[0:1]
> +       vmull.u32       q7,d25,d0[1]
> +       vadd.i32        d20,d20,d10
> +       vmull.u32       q5,d21,d0[1]
> +       vadd.i32        d26,d26,d16
> +       vmull.u32       q8,d27,d0[1]
> +       vmlal.u32       q7,d23,d1[1]
> +       vadd.i32        d22,d22,d12
> +       vmull.u32       q6,d23,d0[1]
> +
> +       vadd.i32        d28,d28,d18
> +       vmull.u32       q9,d29,d0[1]
> +       subs            r2,r2,#64
> +       vmlal.u32       q5,d29,d2[1]
> +       it              lo
> +       movlo           r4,r5
> +       vmlal.u32       q8,d25,d1[1]
> +       vld1.32         d8[1],[r7,:32]
> +       vmlal.u32       q6,d21,d1[1]
> +       vmlal.u32       q9,d27,d1[1]
> +
> +       vmlal.u32       q5,d27,d4[1]
> +       vmlal.u32       q8,d23,d3[1]
> +       vmlal.u32       q9,d25,d3[1]
> +       vmlal.u32       q6,d29,d4[1]
> +       vmlal.u32       q7,d21,d3[1]
> +
> +       vmlal.u32       q8,d21,d5[1]
> +       vmlal.u32       q5,d25,d6[1]
> +       vmlal.u32       q9,d23,d5[1]
> +       vmlal.u32       q6,d27,d6[1]
> +       vmlal.u32       q7,d29,d6[1]
> +
> +       vmlal.u32       q8,d29,d8[1]
> +       vmlal.u32       q5,d23,d8[1]
> +       vmlal.u32       q9,d21,d7[1]
> +       vmlal.u32       q6,d25,d8[1]
> +       vmlal.u32       q7,d27,d8[1]
> +
> +       vld4.32         {d21,d23,d25,d27},[r4]  @ inp[2:3] (or 0)
> +       add             r4,r4,#64
> +
> +       @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> +       @ (hash+inp[0:1])*r^4 and accumulate
> +
> +       vmlal.u32       q8,d26,d0[0]
> +       vmlal.u32       q5,d20,d0[0]
> +       vmlal.u32       q9,d28,d0[0]
> +       vmlal.u32       q6,d22,d0[0]
> +       vmlal.u32       q7,d24,d0[0]
> +       vld1.32         d8[0],[r6,:32]
> +
> +       vmlal.u32       q8,d24,d1[0]
> +       vmlal.u32       q5,d28,d2[0]
> +       vmlal.u32       q9,d26,d1[0]
> +       vmlal.u32       q6,d20,d1[0]
> +       vmlal.u32       q7,d22,d1[0]
> +
> +       vmlal.u32       q8,d22,d3[0]
> +       vmlal.u32       q5,d26,d4[0]
> +       vmlal.u32       q9,d24,d3[0]
> +       vmlal.u32       q6,d28,d4[0]
> +       vmlal.u32       q7,d20,d3[0]
> +
> +       vmlal.u32       q8,d20,d5[0]
> +       vmlal.u32       q5,d24,d6[0]
> +       vmlal.u32       q9,d22,d5[0]
> +       vmlal.u32       q6,d26,d6[0]
> +       vmlal.u32       q8,d28,d8[0]
> +
> +       vmlal.u32       q7,d28,d6[0]
> +       vmlal.u32       q5,d22,d8[0]
> +       vmlal.u32       q9,d20,d7[0]
> +       vmov.i32        q14,#1<<24              @ padbit, yes, always
> +       vmlal.u32       q6,d24,d8[0]
> +       vmlal.u32       q7,d26,d8[0]
> +
> +       vld4.32         {d20,d22,d24,d26},[r1]  @ inp[0:1]
> +       add             r1,r1,#64
> +#ifdef __ARMEB__
> +       vrev32.8        q10,q10
> +       vrev32.8        q11,q11
> +       vrev32.8        q12,q12
> +       vrev32.8        q13,q13
> +#endif
> +
> +       @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> +       @ lazy reduction interleaved with base 2^32 -> base 2^26 of
> +       @ inp[0:3] previously loaded to q10-q13 and smashed to q10-q14.
> +
> +       vshr.u64        q15,q8,#26
> +       vmovn.i64       d16,q8
> +        vshr.u64       q4,q5,#26
> +        vmovn.i64      d10,q5
> +       vadd.i64        q9,q9,q15               @ h3 -> h4
> +       vbic.i32        d16,#0xfc000000
> +         vsri.u32      q14,q13,#8              @ base 2^32 -> base 2^26
> +        vadd.i64       q6,q6,q4                @ h0 -> h1
> +         vshl.u32      q13,q13,#18
> +        vbic.i32       d10,#0xfc000000
> +
> +       vshrn.u64       d30,q9,#26
> +       vmovn.i64       d18,q9
> +        vshr.u64       q4,q6,#26
> +        vmovn.i64      d12,q6
> +        vadd.i64       q7,q7,q4                @ h1 -> h2
> +         vsri.u32      q13,q12,#14
> +       vbic.i32        d18,#0xfc000000
> +         vshl.u32      q12,q12,#12
> +        vbic.i32       d12,#0xfc000000
> +
> +       vadd.i32        d10,d10,d30
> +       vshl.u32        d30,d30,#2
> +         vbic.i32      q13,#0xfc000000
> +        vshrn.u64      d8,q7,#26
> +        vmovn.i64      d14,q7
> +       vaddl.u32       q5,d10,d30      @ h4 -> h0 [widen for a sec]
> +         vsri.u32      q12,q11,#20
> +        vadd.i32       d16,d16,d8      @ h2 -> h3
> +         vshl.u32      q11,q11,#6
> +        vbic.i32       d14,#0xfc000000
> +         vbic.i32      q12,#0xfc000000
> +
> +       vshrn.u64       d30,q5,#26              @ re-narrow
> +       vmovn.i64       d10,q5
> +         vsri.u32      q11,q10,#26
> +         vbic.i32      q10,#0xfc000000
> +        vshr.u32       d8,d16,#26
> +        vbic.i32       d16,#0xfc000000
> +       vbic.i32        d10,#0xfc000000
> +       vadd.i32        d12,d12,d30     @ h0 -> h1
> +        vadd.i32       d18,d18,d8      @ h3 -> h4
> +         vbic.i32      q11,#0xfc000000
> +
> +       bhi             .Loop_neon
> +
> +.Lskip_loop:
> +       @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> +       @ multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
> +
> +       add             r7,r0,#(48+0*9*4)
> +       add             r6,r0,#(48+1*9*4)
> +       adds            r2,r2,#32
> +       it              ne
> +       movne           r2,#0
> +       bne             .Long_tail
> +
> +       vadd.i32        d25,d24,d14     @ add hash value and move to #hi
> +       vadd.i32        d21,d20,d10
> +       vadd.i32        d27,d26,d16
> +       vadd.i32        d23,d22,d12
> +       vadd.i32        d29,d28,d18
> +
> +.Long_tail:
> +       vld4.32         {d0[1],d1[1],d2[1],d3[1]},[r7]! @ load r^1
> +       vld4.32         {d0[0],d1[0],d2[0],d3[0]},[r6]! @ load r^2
> +
> +       vadd.i32        d24,d24,d14     @ can be redundant
> +       vmull.u32       q7,d25,d0
> +       vadd.i32        d20,d20,d10
> +       vmull.u32       q5,d21,d0
> +       vadd.i32        d26,d26,d16
> +       vmull.u32       q8,d27,d0
> +       vadd.i32        d22,d22,d12
> +       vmull.u32       q6,d23,d0
> +       vadd.i32        d28,d28,d18
> +       vmull.u32       q9,d29,d0
> +
> +       vmlal.u32       q5,d29,d2
> +       vld4.32         {d4[1],d5[1],d6[1],d7[1]},[r7]!
> +       vmlal.u32       q8,d25,d1
> +       vld4.32         {d4[0],d5[0],d6[0],d7[0]},[r6]!
> +       vmlal.u32       q6,d21,d1
> +       vmlal.u32       q9,d27,d1
> +       vmlal.u32       q7,d23,d1
> +
> +       vmlal.u32       q8,d23,d3
> +       vld1.32         d8[1],[r7,:32]
> +       vmlal.u32       q5,d27,d4
> +       vld1.32         d8[0],[r6,:32]
> +       vmlal.u32       q9,d25,d3
> +       vmlal.u32       q6,d29,d4
> +       vmlal.u32       q7,d21,d3
> +
> +       vmlal.u32       q8,d21,d5
> +        it             ne
> +        addne          r7,r0,#(48+2*9*4)
> +       vmlal.u32       q5,d25,d6
> +        it             ne
> +        addne          r6,r0,#(48+3*9*4)
> +       vmlal.u32       q9,d23,d5
> +       vmlal.u32       q6,d27,d6
> +       vmlal.u32       q7,d29,d6
> +
> +       vmlal.u32       q8,d29,d8
> +        vorn           q0,q0,q0        @ all-ones, can be redundant
> +       vmlal.u32       q5,d23,d8
> +        vshr.u64       q0,q0,#38
> +       vmlal.u32       q9,d21,d7
> +       vmlal.u32       q6,d25,d8
> +       vmlal.u32       q7,d27,d8
> +
> +       beq             .Lshort_tail
> +
> +       @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> +       @ (hash+inp[0:1])*r^4:r^3 and accumulate
> +
> +       vld4.32         {d0[1],d1[1],d2[1],d3[1]},[r7]! @ load r^3
> +       vld4.32         {d0[0],d1[0],d2[0],d3[0]},[r6]! @ load r^4
> +
> +       vmlal.u32       q7,d24,d0
> +       vmlal.u32       q5,d20,d0
> +       vmlal.u32       q8,d26,d0
> +       vmlal.u32       q6,d22,d0
> +       vmlal.u32       q9,d28,d0
> +
> +       vmlal.u32       q5,d28,d2
> +       vld4.32         {d4[1],d5[1],d6[1],d7[1]},[r7]!
> +       vmlal.u32       q8,d24,d1
> +       vld4.32         {d4[0],d5[0],d6[0],d7[0]},[r6]!
> +       vmlal.u32       q6,d20,d1
> +       vmlal.u32       q9,d26,d1
> +       vmlal.u32       q7,d22,d1
> +
> +       vmlal.u32       q8,d22,d3
> +       vld1.32         d8[1],[r7,:32]
> +       vmlal.u32       q5,d26,d4
> +       vld1.32         d8[0],[r6,:32]
> +       vmlal.u32       q9,d24,d3
> +       vmlal.u32       q6,d28,d4
> +       vmlal.u32       q7,d20,d3
> +
> +       vmlal.u32       q8,d20,d5
> +       vmlal.u32       q5,d24,d6
> +       vmlal.u32       q9,d22,d5
> +       vmlal.u32       q6,d26,d6
> +       vmlal.u32       q7,d28,d6
> +
> +       vmlal.u32       q8,d28,d8
> +        vorn           q0,q0,q0        @ all-ones
> +       vmlal.u32       q5,d22,d8
> +        vshr.u64       q0,q0,#38
> +       vmlal.u32       q9,d20,d7
> +       vmlal.u32       q6,d24,d8
> +       vmlal.u32       q7,d26,d8
> +
> +.Lshort_tail:
> +       @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> +       @ horizontal addition
> +
> +       vadd.i64        d16,d16,d17
> +       vadd.i64        d10,d10,d11
> +       vadd.i64        d18,d18,d19
> +       vadd.i64        d12,d12,d13
> +       vadd.i64        d14,d14,d15
> +
> +       @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> +       @ lazy reduction, but without narrowing
> +
> +       vshr.u64        q15,q8,#26
> +       vand.i64        q8,q8,q0
> +        vshr.u64       q4,q5,#26
> +        vand.i64       q5,q5,q0
> +       vadd.i64        q9,q9,q15               @ h3 -> h4
> +        vadd.i64       q6,q6,q4                @ h0 -> h1
> +
> +       vshr.u64        q15,q9,#26
> +       vand.i64        q9,q9,q0
> +        vshr.u64       q4,q6,#26
> +        vand.i64       q6,q6,q0
> +        vadd.i64       q7,q7,q4                @ h1 -> h2
> +
> +       vadd.i64        q5,q5,q15
> +       vshl.u64        q15,q15,#2
> +        vshr.u64       q4,q7,#26
> +        vand.i64       q7,q7,q0
> +       vadd.i64        q5,q5,q15               @ h4 -> h0
> +        vadd.i64       q8,q8,q4                @ h2 -> h3
> +
> +       vshr.u64        q15,q5,#26
> +       vand.i64        q5,q5,q0
> +        vshr.u64       q4,q8,#26
> +        vand.i64       q8,q8,q0
> +       vadd.i64        q6,q6,q15               @ h0 -> h1
> +        vadd.i64       q9,q9,q4                @ h3 -> h4
> +
> +       cmp             r2,#0
> +       bne             .Leven
> +
> +       @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> +       @ store hash value
> +
> +       vst4.32         {d10[0],d12[0],d14[0],d16[0]},[r0]!
> +       vst1.32         {d18[0]},[r0]
> +
> +       vldmia  sp!,{d8-d15}                    @ epilogue
> +       ldmia   sp!,{r4-r7}
> +.Lno_data_neon:
> +       bx      lr                                      @ bx    lr
> +ENDPROC(poly1305_blocks_neon)
> +
> +.align 5
> +ENTRY(poly1305_emit_neon)
> +       ldr     ip,[r0,#36]             @ is_base2_26
> +
> +       stmdb   sp!,{r4-r11}
> +
> +       tst     ip,ip
> +       beq     .Lpoly1305_emit_enter
> +
> +       ldmia   r0,{r3-r7}
> +       eor     r8,r8,r8
> +
> +       adds    r3,r3,r4,lsl#26 @ base 2^26 -> base 2^32
> +       mov     r4,r4,lsr#6
> +       adcs    r4,r4,r5,lsl#20
> +       mov     r5,r5,lsr#12
> +       adcs    r5,r5,r6,lsl#14
> +       mov     r6,r6,lsr#18
> +       adcs    r6,r6,r7,lsl#8
> +       adc     r7,r8,r7,lsr#24 @ can be partially reduced ...
> +
> +       and     r8,r7,#-4               @ ... so reduce
> +       and     r7,r6,#3
> +       add     r8,r8,r8,lsr#2  @ *= 5
> +       adds    r3,r3,r8
> +       adcs    r4,r4,#0
> +       adcs    r5,r5,#0
> +       adcs    r6,r6,#0
> +       adc     r7,r7,#0
> +
> +       adds    r8,r3,#5                @ compare to modulus
> +       adcs    r9,r4,#0
> +       adcs    r10,r5,#0
> +       adcs    r11,r6,#0
> +       adc     r7,r7,#0
> +       tst     r7,#4                   @ did it carry/borrow?
> +
> +       it      ne
> +       movne   r3,r8
> +       ldr     r8,[r2,#0]
> +       it      ne
> +       movne   r4,r9
> +       ldr     r9,[r2,#4]
> +       it      ne
> +       movne   r5,r10
> +       ldr     r10,[r2,#8]
> +       it      ne
> +       movne   r6,r11
> +       ldr     r11,[r2,#12]
> +
> +       adds    r3,r3,r8                @ accumulate nonce
> +       adcs    r4,r4,r9
> +       adcs    r5,r5,r10
> +       adc     r6,r6,r11
> +
> +#ifdef __ARMEB__
> +       rev     r3,r3
> +       rev     r4,r4
> +       rev     r5,r5
> +       rev     r6,r6
> +#endif
> +       str     r3,[r1,#0]              @ store the result
> +       str     r4,[r1,#4]
> +       str     r5,[r1,#8]
> +       str     r6,[r1,#12]
> +
> +       ldmia   sp!,{r4-r11}
> +       bx      lr                              @ bx    lr
> +ENDPROC(poly1305_emit_neon)
> +
> +.align 5
> +.Lzeros:
> +.long  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
> +#endif
> diff --git a/lib/zinc/poly1305/poly1305-arm64.S b/lib/zinc/poly1305/poly1305-arm64.S
> new file mode 100644
> index 000000000000..c20023544183
> --- /dev/null
> +++ b/lib/zinc/poly1305/poly1305-arm64.S
> @@ -0,0 +1,822 @@
> +/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0
> + *
> + * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
> + * Copyright (C) 2006-2017 CRYPTOGAMS by <appro@openssl.org>. All Rights Reserved.
> + *
> + * This is based in part on Andy Polyakov's implementation from CRYPTOGAMS.
> + */
> +
> +#include <linux/linkage.h>
> +.text
> +
> +.align 5
> +ENTRY(poly1305_init_arm)
> +       cmp     x1,xzr
> +       stp     xzr,xzr,[x0]            // zero hash value
> +       stp     xzr,xzr,[x0,#16]        // [along with is_base2_26]
> +
> +       csel    x0,xzr,x0,eq
> +       b.eq    .Lno_key
> +
> +       ldp     x7,x8,[x1]              // load key
> +       mov     x9,#0xfffffffc0fffffff
> +       movk    x9,#0x0fff,lsl#48
> +#ifdef __ARMEB__
> +       rev     x7,x7                   // flip bytes
> +       rev     x8,x8
> +#endif
> +       and     x7,x7,x9                // &=0ffffffc0fffffff
> +       and     x9,x9,#-4
> +       and     x8,x8,x9                // &=0ffffffc0ffffffc
> +       stp     x7,x8,[x0,#32]  // save key value
> +
> +.Lno_key:
> +       ret
> +ENDPROC(poly1305_init_arm)
> +
> +.align 5
> +ENTRY(poly1305_blocks_arm)
> +       ands    x2,x2,#-16
> +       b.eq    .Lno_data
> +
> +       ldp     x4,x5,[x0]              // load hash value
> +       ldp     x7,x8,[x0,#32]  // load key value
> +       ldr     x6,[x0,#16]
> +       add     x9,x8,x8,lsr#2  // s1 = r1 + (r1 >> 2)
> +       b       .Loop
> +
> +.align 5
> +.Loop:
> +       ldp     x10,x11,[x1],#16        // load input
> +       sub     x2,x2,#16
> +#ifdef __ARMEB__
> +       rev     x10,x10
> +       rev     x11,x11
> +#endif
> +       adds    x4,x4,x10               // accumulate input
> +       adcs    x5,x5,x11
> +
> +       mul     x12,x4,x7               // h0*r0
> +       adc     x6,x6,x3
> +       umulh   x13,x4,x7
> +
> +       mul     x10,x5,x9               // h1*5*r1
> +       umulh   x11,x5,x9
> +
> +       adds    x12,x12,x10
> +       mul     x10,x4,x8               // h0*r1
> +       adc     x13,x13,x11
> +       umulh   x14,x4,x8
> +
> +       adds    x13,x13,x10
> +       mul     x10,x5,x7               // h1*r0
> +       adc     x14,x14,xzr
> +       umulh   x11,x5,x7
> +
> +       adds    x13,x13,x10
> +       mul     x10,x6,x9               // h2*5*r1
> +       adc     x14,x14,x11
> +       mul     x11,x6,x7               // h2*r0
> +
> +       adds    x13,x13,x10
> +       adc     x14,x14,x11
> +
> +       and     x10,x14,#-4             // final reduction
> +       and     x6,x14,#3
> +       add     x10,x10,x14,lsr#2
> +       adds    x4,x12,x10
> +       adcs    x5,x13,xzr
> +       adc     x6,x6,xzr
> +
> +       cbnz    x2,.Loop
> +
> +       stp     x4,x5,[x0]              // store hash value
> +       str     x6,[x0,#16]
> +
> +.Lno_data:
> +       ret
> +ENDPROC(poly1305_blocks_arm)
> +
> +.align 5
> +ENTRY(poly1305_emit_arm)
> +       ldp     x4,x5,[x0]              // load hash base 2^64
> +       ldr     x6,[x0,#16]
> +       ldp     x10,x11,[x2]    // load nonce
> +
> +       adds    x12,x4,#5               // compare to modulus
> +       adcs    x13,x5,xzr
> +       adc     x14,x6,xzr
> +
> +       tst     x14,#-4                 // see if it's carried/borrowed
> +
> +       csel    x4,x4,x12,eq
> +       csel    x5,x5,x13,eq
> +
> +#ifdef __ARMEB__
> +       ror     x10,x10,#32             // flip nonce words
> +       ror     x11,x11,#32
> +#endif
> +       adds    x4,x4,x10               // accumulate nonce
> +       adc     x5,x5,x11
> +#ifdef __ARMEB__
> +       rev     x4,x4                   // flip output bytes
> +       rev     x5,x5
> +#endif
> +       stp     x4,x5,[x1]              // write result
> +
> +       ret
> +ENDPROC(poly1305_emit_arm)
> +
> +.align 5
> +__poly1305_mult:
> +       mul     x12,x4,x7               // h0*r0
> +       umulh   x13,x4,x7
> +
> +       mul     x10,x5,x9               // h1*5*r1
> +       umulh   x11,x5,x9
> +
> +       adds    x12,x12,x10
> +       mul     x10,x4,x8               // h0*r1
> +       adc     x13,x13,x11
> +       umulh   x14,x4,x8
> +
> +       adds    x13,x13,x10
> +       mul     x10,x5,x7               // h1*r0
> +       adc     x14,x14,xzr
> +       umulh   x11,x5,x7
> +
> +       adds    x13,x13,x10
> +       mul     x10,x6,x9               // h2*5*r1
> +       adc     x14,x14,x11
> +       mul     x11,x6,x7               // h2*r0
> +
> +       adds    x13,x13,x10
> +       adc     x14,x14,x11
> +
> +       and     x10,x14,#-4             // final reduction
> +       and     x6,x14,#3
> +       add     x10,x10,x14,lsr#2
> +       adds    x4,x12,x10
> +       adcs    x5,x13,xzr
> +       adc     x6,x6,xzr
> +
> +       ret
> +
> +__poly1305_splat:
> +       and     x12,x4,#0x03ffffff      // base 2^64 -> base 2^26
> +       ubfx    x13,x4,#26,#26
> +       extr    x14,x5,x4,#52
> +       and     x14,x14,#0x03ffffff
> +       ubfx    x15,x5,#14,#26
> +       extr    x16,x6,x5,#40
> +
> +       str     w12,[x0,#16*0]  // r0
> +       add     w12,w13,w13,lsl#2       // r1*5
> +       str     w13,[x0,#16*1]  // r1
> +       add     w13,w14,w14,lsl#2       // r2*5
> +       str     w12,[x0,#16*2]  // s1
> +       str     w14,[x0,#16*3]  // r2
> +       add     w14,w15,w15,lsl#2       // r3*5
> +       str     w13,[x0,#16*4]  // s2
> +       str     w15,[x0,#16*5]  // r3
> +       add     w15,w16,w16,lsl#2       // r4*5
> +       str     w14,[x0,#16*6]  // s3
> +       str     w16,[x0,#16*7]  // r4
> +       str     w15,[x0,#16*8]  // s4
> +
> +       ret
> +
> +.align 5
> +ENTRY(poly1305_blocks_neon)
> +       ldr     x17,[x0,#24]
> +       cmp     x2,#128
> +       b.hs    .Lblocks_neon
> +       cbz     x17,poly1305_blocks_arm
> +
> +.Lblocks_neon:
> +       stp     x29,x30,[sp,#-80]!
> +       add     x29,sp,#0
> +
> +       ands    x2,x2,#-16
> +       b.eq    .Lno_data_neon
> +
> +       cbz     x17,.Lbase2_64_neon
> +
> +       ldp     w10,w11,[x0]            // load hash value base 2^26
> +       ldp     w12,w13,[x0,#8]
> +       ldr     w14,[x0,#16]
> +
> +       tst     x2,#31
> +       b.eq    .Leven_neon
> +
> +       ldp     x7,x8,[x0,#32]  // load key value
> +
> +       add     x4,x10,x11,lsl#26       // base 2^26 -> base 2^64
> +       lsr     x5,x12,#12
> +       adds    x4,x4,x12,lsl#52
> +       add     x5,x5,x13,lsl#14
> +       adc     x5,x5,xzr
> +       lsr     x6,x14,#24
> +       adds    x5,x5,x14,lsl#40
> +       adc     x14,x6,xzr              // can be partially reduced...
> +
> +       ldp     x12,x13,[x1],#16        // load input
> +       sub     x2,x2,#16
> +       add     x9,x8,x8,lsr#2  // s1 = r1 + (r1 >> 2)
> +
> +       and     x10,x14,#-4             // ... so reduce
> +       and     x6,x14,#3
> +       add     x10,x10,x14,lsr#2
> +       adds    x4,x4,x10
> +       adcs    x5,x5,xzr
> +       adc     x6,x6,xzr
> +
> +#ifdef __ARMEB__
> +       rev     x12,x12
> +       rev     x13,x13
> +#endif
> +       adds    x4,x4,x12               // accumulate input
> +       adcs    x5,x5,x13
> +       adc     x6,x6,x3
> +
> +       bl      __poly1305_mult
> +       ldr     x30,[sp,#8]
> +
> +       cbz     x3,.Lstore_base2_64_neon
> +
> +       and     x10,x4,#0x03ffffff      // base 2^64 -> base 2^26
> +       ubfx    x11,x4,#26,#26
> +       extr    x12,x5,x4,#52
> +       and     x12,x12,#0x03ffffff
> +       ubfx    x13,x5,#14,#26
> +       extr    x14,x6,x5,#40
> +
> +       cbnz    x2,.Leven_neon
> +
> +       stp     w10,w11,[x0]            // store hash value base 2^26
> +       stp     w12,w13,[x0,#8]
> +       str     w14,[x0,#16]
> +       b       .Lno_data_neon
> +
> +.align 4
> +.Lstore_base2_64_neon:
> +       stp     x4,x5,[x0]              // store hash value base 2^64
> +       stp     x6,xzr,[x0,#16] // note that is_base2_26 is zeroed
> +       b       .Lno_data_neon
> +
> +.align 4
> +.Lbase2_64_neon:
> +       ldp     x7,x8,[x0,#32]  // load key value
> +
> +       ldp     x4,x5,[x0]              // load hash value base 2^64
> +       ldr     x6,[x0,#16]
> +
> +       tst     x2,#31
> +       b.eq    .Linit_neon
> +
> +       ldp     x12,x13,[x1],#16        // load input
> +       sub     x2,x2,#16
> +       add     x9,x8,x8,lsr#2  // s1 = r1 + (r1 >> 2)
> +#ifdef __ARMEB__
> +       rev     x12,x12
> +       rev     x13,x13
> +#endif
> +       adds    x4,x4,x12               // accumulate input
> +       adcs    x5,x5,x13
> +       adc     x6,x6,x3
> +
> +       bl      __poly1305_mult
> +
> +.Linit_neon:
> +       and     x10,x4,#0x03ffffff      // base 2^64 -> base 2^26
> +       ubfx    x11,x4,#26,#26
> +       extr    x12,x5,x4,#52
> +       and     x12,x12,#0x03ffffff
> +       ubfx    x13,x5,#14,#26
> +       extr    x14,x6,x5,#40
> +
> +       stp     d8,d9,[sp,#16]          // meet ABI requirements
> +       stp     d10,d11,[sp,#32]
> +       stp     d12,d13,[sp,#48]
> +       stp     d14,d15,[sp,#64]
> +
> +       fmov    d24,x10
> +       fmov    d25,x11
> +       fmov    d26,x12
> +       fmov    d27,x13
> +       fmov    d28,x14
> +
> +       ////////////////////////////////// initialize r^n table
> +       mov     x4,x7                   // r^1
> +       add     x9,x8,x8,lsr#2  // s1 = r1 + (r1 >> 2)
> +       mov     x5,x8
> +       mov     x6,xzr
> +       add     x0,x0,#48+12
> +       bl      __poly1305_splat
> +
> +       bl      __poly1305_mult         // r^2
> +       sub     x0,x0,#4
> +       bl      __poly1305_splat
> +
> +       bl      __poly1305_mult         // r^3
> +       sub     x0,x0,#4
> +       bl      __poly1305_splat
> +
> +       bl      __poly1305_mult         // r^4
> +       sub     x0,x0,#4
> +       bl      __poly1305_splat
> +       ldr     x30,[sp,#8]
> +
> +       add     x16,x1,#32
> +       adr     x17,.Lzeros
> +       subs    x2,x2,#64
> +       csel    x16,x17,x16,lo
> +
> +       mov     x4,#1
> +       str     x4,[x0,#-24]            // set is_base2_26
> +       sub     x0,x0,#48               // restore original x0
> +       b       .Ldo_neon
> +
> +.align 4
> +.Leven_neon:
> +       add     x16,x1,#32
> +       adr     x17,.Lzeros
> +       subs    x2,x2,#64
> +       csel    x16,x17,x16,lo
> +
> +       stp     d8,d9,[sp,#16]          // meet ABI requirements
> +       stp     d10,d11,[sp,#32]
> +       stp     d12,d13,[sp,#48]
> +       stp     d14,d15,[sp,#64]
> +
> +       fmov    d24,x10
> +       fmov    d25,x11
> +       fmov    d26,x12
> +       fmov    d27,x13
> +       fmov    d28,x14
> +
> +.Ldo_neon:
> +       ldp     x8,x12,[x16],#16        // inp[2:3] (or zero)
> +       ldp     x9,x13,[x16],#48
> +
> +       lsl     x3,x3,#24
> +       add     x15,x0,#48
> +
> +#ifdef __ARMEB__
> +       rev     x8,x8
> +       rev     x12,x12
> +       rev     x9,x9
> +       rev     x13,x13
> +#endif
> +       and     x4,x8,#0x03ffffff       // base 2^64 -> base 2^26
> +       and     x5,x9,#0x03ffffff
> +       ubfx    x6,x8,#26,#26
> +       ubfx    x7,x9,#26,#26
> +       add     x4,x4,x5,lsl#32         // bfi  x4,x5,#32,#32
> +       extr    x8,x12,x8,#52
> +       extr    x9,x13,x9,#52
> +       add     x6,x6,x7,lsl#32         // bfi  x6,x7,#32,#32
> +       fmov    d14,x4
> +       and     x8,x8,#0x03ffffff
> +       and     x9,x9,#0x03ffffff
> +       ubfx    x10,x12,#14,#26
> +       ubfx    x11,x13,#14,#26
> +       add     x12,x3,x12,lsr#40
> +       add     x13,x3,x13,lsr#40
> +       add     x8,x8,x9,lsl#32         // bfi  x8,x9,#32,#32
> +       fmov    d15,x6
> +       add     x10,x10,x11,lsl#32      // bfi  x10,x11,#32,#32
> +       add     x12,x12,x13,lsl#32      // bfi  x12,x13,#32,#32
> +       fmov    d16,x8
> +       fmov    d17,x10
> +       fmov    d18,x12
> +
> +       ldp     x8,x12,[x1],#16 // inp[0:1]
> +       ldp     x9,x13,[x1],#48
> +
> +       ld1     {v0.4s,v1.4s,v2.4s,v3.4s},[x15],#64
> +       ld1     {v4.4s,v5.4s,v6.4s,v7.4s},[x15],#64
> +       ld1     {v8.4s},[x15]
> +
> +#ifdef __ARMEB__
> +       rev     x8,x8
> +       rev     x12,x12
> +       rev     x9,x9
> +       rev     x13,x13
> +#endif
> +       and     x4,x8,#0x03ffffff       // base 2^64 -> base 2^26
> +       and     x5,x9,#0x03ffffff
> +       ubfx    x6,x8,#26,#26
> +       ubfx    x7,x9,#26,#26
> +       add     x4,x4,x5,lsl#32         // bfi  x4,x5,#32,#32
> +       extr    x8,x12,x8,#52
> +       extr    x9,x13,x9,#52
> +       add     x6,x6,x7,lsl#32         // bfi  x6,x7,#32,#32
> +       fmov    d9,x4
> +       and     x8,x8,#0x03ffffff
> +       and     x9,x9,#0x03ffffff
> +       ubfx    x10,x12,#14,#26
> +       ubfx    x11,x13,#14,#26
> +       add     x12,x3,x12,lsr#40
> +       add     x13,x3,x13,lsr#40
> +       add     x8,x8,x9,lsl#32         // bfi  x8,x9,#32,#32
> +       fmov    d10,x6
> +       add     x10,x10,x11,lsl#32      // bfi  x10,x11,#32,#32
> +       add     x12,x12,x13,lsl#32      // bfi  x12,x13,#32,#32
> +       movi    v31.2d,#-1
> +       fmov    d11,x8
> +       fmov    d12,x10
> +       fmov    d13,x12
> +       ushr    v31.2d,v31.2d,#38
> +
> +       b.ls    .Lskip_loop
> +
> +.align 4
> +.Loop_neon:
> +       ////////////////////////////////////////////////////////////////
> +       // ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
> +       // ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
> +       //   ___________________/
> +       // ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
> +       // ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
> +       //   ___________________/ ____________________/
> +       //
> +       // Note that we start with inp[2:3]*r^2. This is because it
> +       // doesn't depend on reduction in previous iteration.
> +       ////////////////////////////////////////////////////////////////
> +       // d4 = h0*r4 + h1*r3   + h2*r2   + h3*r1   + h4*r0
> +       // d3 = h0*r3 + h1*r2   + h2*r1   + h3*r0   + h4*5*r4
> +       // d2 = h0*r2 + h1*r1   + h2*r0   + h3*5*r4 + h4*5*r3
> +       // d1 = h0*r1 + h1*r0   + h2*5*r4 + h3*5*r3 + h4*5*r2
> +       // d0 = h0*r0 + h1*5*r4 + h2*5*r3 + h3*5*r2 + h4*5*r1
> +
> +       subs    x2,x2,#64
> +       umull   v23.2d,v14.2s,v7.s[2]
> +       csel    x16,x17,x16,lo
> +       umull   v22.2d,v14.2s,v5.s[2]
> +       umull   v21.2d,v14.2s,v3.s[2]
> +       ldp     x8,x12,[x16],#16        // inp[2:3] (or zero)
> +       umull   v20.2d,v14.2s,v1.s[2]
> +       ldp     x9,x13,[x16],#48
> +       umull   v19.2d,v14.2s,v0.s[2]
> +#ifdef __ARMEB__
> +       rev     x8,x8
> +       rev     x12,x12
> +       rev     x9,x9
> +       rev     x13,x13
> +#endif
> +
> +       umlal   v23.2d,v15.2s,v5.s[2]
> +       and     x4,x8,#0x03ffffff       // base 2^64 -> base 2^26
> +       umlal   v22.2d,v15.2s,v3.s[2]
> +       and     x5,x9,#0x03ffffff
> +       umlal   v21.2d,v15.2s,v1.s[2]
> +       ubfx    x6,x8,#26,#26
> +       umlal   v20.2d,v15.2s,v0.s[2]
> +       ubfx    x7,x9,#26,#26
> +       umlal   v19.2d,v15.2s,v8.s[2]
> +       add     x4,x4,x5,lsl#32         // bfi  x4,x5,#32,#32
> +
> +       umlal   v23.2d,v16.2s,v3.s[2]
> +       extr    x8,x12,x8,#52
> +       umlal   v22.2d,v16.2s,v1.s[2]
> +       extr    x9,x13,x9,#52
> +       umlal   v21.2d,v16.2s,v0.s[2]
> +       add     x6,x6,x7,lsl#32         // bfi  x6,x7,#32,#32
> +       umlal   v20.2d,v16.2s,v8.s[2]
> +       fmov    d14,x4
> +       umlal   v19.2d,v16.2s,v6.s[2]
> +       and     x8,x8,#0x03ffffff
> +
> +       umlal   v23.2d,v17.2s,v1.s[2]
> +       and     x9,x9,#0x03ffffff
> +       umlal   v22.2d,v17.2s,v0.s[2]
> +       ubfx    x10,x12,#14,#26
> +       umlal   v21.2d,v17.2s,v8.s[2]
> +       ubfx    x11,x13,#14,#26
> +       umlal   v20.2d,v17.2s,v6.s[2]
> +       add     x8,x8,x9,lsl#32         // bfi  x8,x9,#32,#32
> +       umlal   v19.2d,v17.2s,v4.s[2]
> +       fmov    d15,x6
> +
> +       add     v11.2s,v11.2s,v26.2s
> +       add     x12,x3,x12,lsr#40
> +       umlal   v23.2d,v18.2s,v0.s[2]
> +       add     x13,x3,x13,lsr#40
> +       umlal   v22.2d,v18.2s,v8.s[2]
> +       add     x10,x10,x11,lsl#32      // bfi  x10,x11,#32,#32
> +       umlal   v21.2d,v18.2s,v6.s[2]
> +       add     x12,x12,x13,lsl#32      // bfi  x12,x13,#32,#32
> +       umlal   v20.2d,v18.2s,v4.s[2]
> +       fmov    d16,x8
> +       umlal   v19.2d,v18.2s,v2.s[2]
> +       fmov    d17,x10
> +
> +       ////////////////////////////////////////////////////////////////
> +       // (hash+inp[0:1])*r^4 and accumulate
> +
> +       add     v9.2s,v9.2s,v24.2s
> +       fmov    d18,x12
> +       umlal   v22.2d,v11.2s,v1.s[0]
> +       ldp     x8,x12,[x1],#16 // inp[0:1]
> +       umlal   v19.2d,v11.2s,v6.s[0]
> +       ldp     x9,x13,[x1],#48
> +       umlal   v23.2d,v11.2s,v3.s[0]
> +       umlal   v20.2d,v11.2s,v8.s[0]
> +       umlal   v21.2d,v11.2s,v0.s[0]
> +#ifdef __ARMEB__
> +       rev     x8,x8
> +       rev     x12,x12
> +       rev     x9,x9
> +       rev     x13,x13
> +#endif
> +
> +       add     v10.2s,v10.2s,v25.2s
> +       umlal   v22.2d,v9.2s,v5.s[0]
> +       umlal   v23.2d,v9.2s,v7.s[0]
> +       and     x4,x8,#0x03ffffff       // base 2^64 -> base 2^26
> +       umlal   v21.2d,v9.2s,v3.s[0]
> +       and     x5,x9,#0x03ffffff
> +       umlal   v19.2d,v9.2s,v0.s[0]
> +       ubfx    x6,x8,#26,#26
> +       umlal   v20.2d,v9.2s,v1.s[0]
> +       ubfx    x7,x9,#26,#26
> +
> +       add     v12.2s,v12.2s,v27.2s
> +       add     x4,x4,x5,lsl#32         // bfi  x4,x5,#32,#32
> +       umlal   v22.2d,v10.2s,v3.s[0]
> +       extr    x8,x12,x8,#52
> +       umlal   v23.2d,v10.2s,v5.s[0]
> +       extr    x9,x13,x9,#52
> +       umlal   v19.2d,v10.2s,v8.s[0]
> +       add     x6,x6,x7,lsl#32         // bfi  x6,x7,#32,#32
> +       umlal   v21.2d,v10.2s,v1.s[0]
> +       fmov    d9,x4
> +       umlal   v20.2d,v10.2s,v0.s[0]
> +       and     x8,x8,#0x03ffffff
> +
> +       add     v13.2s,v13.2s,v28.2s
> +       and     x9,x9,#0x03ffffff
> +       umlal   v22.2d,v12.2s,v0.s[0]
> +       ubfx    x10,x12,#14,#26
> +       umlal   v19.2d,v12.2s,v4.s[0]
> +       ubfx    x11,x13,#14,#26
> +       umlal   v23.2d,v12.2s,v1.s[0]
> +       add     x8,x8,x9,lsl#32         // bfi  x8,x9,#32,#32
> +       umlal   v20.2d,v12.2s,v6.s[0]
> +       fmov    d10,x6
> +       umlal   v21.2d,v12.2s,v8.s[0]
> +       add     x12,x3,x12,lsr#40
> +
> +       umlal   v22.2d,v13.2s,v8.s[0]
> +       add     x13,x3,x13,lsr#40
> +       umlal   v19.2d,v13.2s,v2.s[0]
> +       add     x10,x10,x11,lsl#32      // bfi  x10,x11,#32,#32
> +       umlal   v23.2d,v13.2s,v0.s[0]
> +       add     x12,x12,x13,lsl#32      // bfi  x12,x13,#32,#32
> +       umlal   v20.2d,v13.2s,v4.s[0]
> +       fmov    d11,x8
> +       umlal   v21.2d,v13.2s,v6.s[0]
> +       fmov    d12,x10
> +       fmov    d13,x12
> +
> +       /////////////////////////////////////////////////////////////////
> +       // lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
> +       // and P. Schwabe
> +       //
> +       // [see discussion in poly1305-armv4 module]
> +
> +       ushr    v29.2d,v22.2d,#26
> +       xtn     v27.2s,v22.2d
> +       ushr    v30.2d,v19.2d,#26
> +       and     v19.16b,v19.16b,v31.16b
> +       add     v23.2d,v23.2d,v29.2d    // h3 -> h4
> +       bic     v27.2s,#0xfc,lsl#24     // &=0x03ffffff
> +       add     v20.2d,v20.2d,v30.2d    // h0 -> h1
> +
> +       ushr    v29.2d,v23.2d,#26
> +       xtn     v28.2s,v23.2d
> +       ushr    v30.2d,v20.2d,#26
> +       xtn     v25.2s,v20.2d
> +       bic     v28.2s,#0xfc,lsl#24
> +       add     v21.2d,v21.2d,v30.2d    // h1 -> h2
> +
> +       add     v19.2d,v19.2d,v29.2d
> +       shl     v29.2d,v29.2d,#2
> +       shrn    v30.2s,v21.2d,#26
> +       xtn     v26.2s,v21.2d
> +       add     v19.2d,v19.2d,v29.2d    // h4 -> h0
> +       bic     v25.2s,#0xfc,lsl#24
> +       add     v27.2s,v27.2s,v30.2s            // h2 -> h3
> +       bic     v26.2s,#0xfc,lsl#24
> +
> +       shrn    v29.2s,v19.2d,#26
> +       xtn     v24.2s,v19.2d
> +       ushr    v30.2s,v27.2s,#26
> +       bic     v27.2s,#0xfc,lsl#24
> +       bic     v24.2s,#0xfc,lsl#24
> +       add     v25.2s,v25.2s,v29.2s            // h0 -> h1
> +       add     v28.2s,v28.2s,v30.2s            // h3 -> h4
> +
> +       b.hi    .Loop_neon
> +
> +.Lskip_loop:
> +       dup     v16.2d,v16.d[0]
> +       add     v11.2s,v11.2s,v26.2s
> +
> +       ////////////////////////////////////////////////////////////////
> +       // multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
> +
> +       adds    x2,x2,#32
> +       b.ne    .Long_tail
> +
> +       dup     v16.2d,v11.d[0]
> +       add     v14.2s,v9.2s,v24.2s
> +       add     v17.2s,v12.2s,v27.2s
> +       add     v15.2s,v10.2s,v25.2s
> +       add     v18.2s,v13.2s,v28.2s
> +
> +.Long_tail:
> +       dup     v14.2d,v14.d[0]
> +       umull2  v19.2d,v16.4s,v6.4s
> +       umull2  v22.2d,v16.4s,v1.4s
> +       umull2  v23.2d,v16.4s,v3.4s
> +       umull2  v21.2d,v16.4s,v0.4s
> +       umull2  v20.2d,v16.4s,v8.4s
> +
> +       dup     v15.2d,v15.d[0]
> +       umlal2  v19.2d,v14.4s,v0.4s
> +       umlal2  v21.2d,v14.4s,v3.4s
> +       umlal2  v22.2d,v14.4s,v5.4s
> +       umlal2  v23.2d,v14.4s,v7.4s
> +       umlal2  v20.2d,v14.4s,v1.4s
> +
> +       dup     v17.2d,v17.d[0]
> +       umlal2  v19.2d,v15.4s,v8.4s
> +       umlal2  v22.2d,v15.4s,v3.4s
> +       umlal2  v21.2d,v15.4s,v1.4s
> +       umlal2  v23.2d,v15.4s,v5.4s
> +       umlal2  v20.2d,v15.4s,v0.4s
> +
> +       dup     v18.2d,v18.d[0]
> +       umlal2  v22.2d,v17.4s,v0.4s
> +       umlal2  v23.2d,v17.4s,v1.4s
> +       umlal2  v19.2d,v17.4s,v4.4s
> +       umlal2  v20.2d,v17.4s,v6.4s
> +       umlal2  v21.2d,v17.4s,v8.4s
> +
> +       umlal2  v22.2d,v18.4s,v8.4s
> +       umlal2  v19.2d,v18.4s,v2.4s
> +       umlal2  v23.2d,v18.4s,v0.4s
> +       umlal2  v20.2d,v18.4s,v4.4s
> +       umlal2  v21.2d,v18.4s,v6.4s
> +
> +       b.eq    .Lshort_tail
> +
> +       ////////////////////////////////////////////////////////////////
> +       // (hash+inp[0:1])*r^4:r^3 and accumulate
> +
> +       add     v9.2s,v9.2s,v24.2s
> +       umlal   v22.2d,v11.2s,v1.2s
> +       umlal   v19.2d,v11.2s,v6.2s
> +       umlal   v23.2d,v11.2s,v3.2s
> +       umlal   v20.2d,v11.2s,v8.2s
> +       umlal   v21.2d,v11.2s,v0.2s
> +
> +       add     v10.2s,v10.2s,v25.2s
> +       umlal   v22.2d,v9.2s,v5.2s
> +       umlal   v19.2d,v9.2s,v0.2s
> +       umlal   v23.2d,v9.2s,v7.2s
> +       umlal   v20.2d,v9.2s,v1.2s
> +       umlal   v21.2d,v9.2s,v3.2s
> +
> +       add     v12.2s,v12.2s,v27.2s
> +       umlal   v22.2d,v10.2s,v3.2s
> +       umlal   v19.2d,v10.2s,v8.2s
> +       umlal   v23.2d,v10.2s,v5.2s
> +       umlal   v20.2d,v10.2s,v0.2s
> +       umlal   v21.2d,v10.2s,v1.2s
> +
> +       add     v13.2s,v13.2s,v28.2s
> +       umlal   v22.2d,v12.2s,v0.2s
> +       umlal   v19.2d,v12.2s,v4.2s
> +       umlal   v23.2d,v12.2s,v1.2s
> +       umlal   v20.2d,v12.2s,v6.2s
> +       umlal   v21.2d,v12.2s,v8.2s
> +
> +       umlal   v22.2d,v13.2s,v8.2s
> +       umlal   v19.2d,v13.2s,v2.2s
> +       umlal   v23.2d,v13.2s,v0.2s
> +       umlal   v20.2d,v13.2s,v4.2s
> +       umlal   v21.2d,v13.2s,v6.2s
> +
> +.Lshort_tail:
> +       ////////////////////////////////////////////////////////////////
> +       // horizontal add
> +
> +       addp    v22.2d,v22.2d,v22.2d
> +       ldp     d8,d9,[sp,#16]          // meet ABI requirements
> +       addp    v19.2d,v19.2d,v19.2d
> +       ldp     d10,d11,[sp,#32]
> +       addp    v23.2d,v23.2d,v23.2d
> +       ldp     d12,d13,[sp,#48]
> +       addp    v20.2d,v20.2d,v20.2d
> +       ldp     d14,d15,[sp,#64]
> +       addp    v21.2d,v21.2d,v21.2d
> +
> +       ////////////////////////////////////////////////////////////////
> +       // lazy reduction, but without narrowing
> +
> +       ushr    v29.2d,v22.2d,#26
> +       and     v22.16b,v22.16b,v31.16b
> +       ushr    v30.2d,v19.2d,#26
> +       and     v19.16b,v19.16b,v31.16b
> +
> +       add     v23.2d,v23.2d,v29.2d    // h3 -> h4
> +       add     v20.2d,v20.2d,v30.2d    // h0 -> h1
> +
> +       ushr    v29.2d,v23.2d,#26
> +       and     v23.16b,v23.16b,v31.16b
> +       ushr    v30.2d,v20.2d,#26
> +       and     v20.16b,v20.16b,v31.16b
> +       add     v21.2d,v21.2d,v30.2d    // h1 -> h2
> +
> +       add     v19.2d,v19.2d,v29.2d
> +       shl     v29.2d,v29.2d,#2
> +       ushr    v30.2d,v21.2d,#26
> +       and     v21.16b,v21.16b,v31.16b
> +       add     v19.2d,v19.2d,v29.2d    // h4 -> h0
> +       add     v22.2d,v22.2d,v30.2d    // h2 -> h3
> +
> +       ushr    v29.2d,v19.2d,#26
> +       and     v19.16b,v19.16b,v31.16b
> +       ushr    v30.2d,v22.2d,#26
> +       and     v22.16b,v22.16b,v31.16b
> +       add     v20.2d,v20.2d,v29.2d    // h0 -> h1
> +       add     v23.2d,v23.2d,v30.2d    // h3 -> h4
> +
> +       ////////////////////////////////////////////////////////////////
> +       // write the result, can be partially reduced
> +
> +       st4     {v19.s,v20.s,v21.s,v22.s}[0],[x0],#16
> +       st1     {v23.s}[0],[x0]
> +
> +.Lno_data_neon:
> +       ldr     x29,[sp],#80
> +       ret
> +ENDPROC(poly1305_blocks_neon)
> +
> +.align 5
> +ENTRY(poly1305_emit_neon)
> +       ldr     x17,[x0,#24]
> +       cbz     x17,poly1305_emit_arm
> +
> +       ldp     w10,w11,[x0]            // load hash value base 2^26
> +       ldp     w12,w13,[x0,#8]
> +       ldr     w14,[x0,#16]
> +
> +       add     x4,x10,x11,lsl#26       // base 2^26 -> base 2^64
> +       lsr     x5,x12,#12
> +       adds    x4,x4,x12,lsl#52
> +       add     x5,x5,x13,lsl#14
> +       adc     x5,x5,xzr
> +       lsr     x6,x14,#24
> +       adds    x5,x5,x14,lsl#40
> +       adc     x6,x6,xzr               // can be partially reduced...
> +
> +       ldp     x10,x11,[x2]    // load nonce
> +
> +       and     x12,x6,#-4              // ... so reduce
> +       add     x12,x12,x6,lsr#2
> +       and     x6,x6,#3
> +       adds    x4,x4,x12
> +       adcs    x5,x5,xzr
> +       adc     x6,x6,xzr
> +
> +       adds    x12,x4,#5               // compare to modulus
> +       adcs    x13,x5,xzr
> +       adc     x14,x6,xzr
> +
> +       tst     x14,#-4                 // see if it's carried/borrowed
> +
> +       csel    x4,x4,x12,eq
> +       csel    x5,x5,x13,eq
> +
> +#ifdef __ARMEB__
> +       ror     x10,x10,#32             // flip nonce words
> +       ror     x11,x11,#32
> +#endif
> +       adds    x4,x4,x10               // accumulate nonce
> +       adc     x5,x5,x11
> +#ifdef __ARMEB__
> +       rev     x4,x4                   // flip output bytes
> +       rev     x5,x5
> +#endif
> +       stp     x4,x5,[x1]              // write result
> +
> +       ret
> +ENDPROC(poly1305_emit_neon)
> +
> +.align 5
> +.Lzeros:
> +.long  0,0,0,0,0,0,0,0
> --
> 2.19.0
>

^ permalink raw reply

* [PATCH net-next] cxgb4: update supported DCB version
From: Ganesh Goudar @ 2018-09-14 12:05 UTC (permalink / raw)
  To: netdev, davem; +Cc: nirranjan, indranil, dt, varun, Ganesh Goudar

- In CXGB4_DCB_STATE_FW_INCOMPLETE state check if the dcb
  version is changed and update the dcb supported version.

- Also, fill the priority code point value for priority
  based flow control.

Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c | 27 ++++++++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/l2t.c       |  6 ++++--
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c
index b34f0f0..6ba3104 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c
@@ -114,6 +114,24 @@ void cxgb4_dcb_reset(struct net_device *dev)
 	cxgb4_dcb_state_init(dev);
 }
 
+/* update the dcb port support, if version is IEEE then set it to
+ * FW_PORT_DCB_VER_IEEE and if DCB_CAP_DCBX_VER_CEE is already set then
+ * clear that. and if it is set to CEE then set dcb supported to
+ * DCB_CAP_DCBX_VER_CEE & if DCB_CAP_DCBX_VER_IEEE is set, clear it
+ */
+static inline void cxgb4_dcb_update_support(struct port_dcb_info *dcb)
+{
+	if (dcb->dcb_version == FW_PORT_DCB_VER_IEEE) {
+		if (dcb->supported & DCB_CAP_DCBX_VER_CEE)
+			dcb->supported &= ~DCB_CAP_DCBX_VER_CEE;
+		dcb->supported |= DCB_CAP_DCBX_VER_IEEE;
+	} else if (dcb->dcb_version == FW_PORT_DCB_VER_CEE1D01) {
+		if (dcb->supported & DCB_CAP_DCBX_VER_IEEE)
+			dcb->supported &= ~DCB_CAP_DCBX_VER_IEEE;
+		dcb->supported |= DCB_CAP_DCBX_VER_CEE;
+	}
+}
+
 /* Finite State machine for Data Center Bridging.
  */
 void cxgb4_dcb_state_fsm(struct net_device *dev,
@@ -165,6 +183,15 @@ void cxgb4_dcb_state_fsm(struct net_device *dev,
 	}
 
 	case CXGB4_DCB_STATE_FW_INCOMPLETE: {
+		if (transition_to != CXGB4_DCB_INPUT_FW_DISABLED) {
+			/* during this CXGB4_DCB_STATE_FW_INCOMPLETE state,
+			 * check if the dcb version is changed (there can be
+			 * mismatch in default config & the negotiated switch
+			 * configuration at FW, so update the dcb support
+			 * accordingly.
+			 */
+			cxgb4_dcb_update_support(dcb);
+		}
 		switch (transition_to) {
 		case CXGB4_DCB_INPUT_FW_ENABLED: {
 			/* we're alreaady in firmware DCB mode */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/l2t.c b/drivers/net/ethernet/chelsio/cxgb4/l2t.c
index 301c4df..99022c0 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/l2t.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/l2t.c
@@ -433,10 +433,12 @@ struct l2t_entry *cxgb4_l2t_get(struct l2t_data *d, struct neighbour *neigh,
 	else
 		lport = netdev2pinfo(physdev)->lport;
 
-	if (is_vlan_dev(neigh->dev))
+	if (is_vlan_dev(neigh->dev)) {
 		vlan = vlan_dev_vlan_id(neigh->dev);
-	else
+		vlan |= vlan_dev_get_egress_qos_mask(neigh->dev, priority);
+	} else {
 		vlan = VLAN_NONE;
+	}
 
 	write_lock_bh(&d->lock);
 	for (e = d->l2tab[hash].first; e; e = e->next)
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH] net/mlx4_core: print firmware version during driver loading
From: Qing Huang @ 2018-09-14 17:15 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: netdev, linux-rdma, linux-kernel, tariqt, davem
In-Reply-To: <20180914044314.GC5257@mtr-leonro.mtl.com>

The FW version is actually a very crucial piece of information and only 
printed once here
when the driver is loaded. People tend to get confused when switching 
multiple FW files
back and forth without running separate utility tools, especially at 
customer sites.
IMHO, this information is very useful and only takes up very little log 
file space. :-)

I was also thinking of doing something slightly differently. Maybe we 
just trim down the
output string, and add something like this?
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2208,6 +2208,11 @@ static int mlx4_init_fw(struct mlx4_dev *dev)
                         return err;
                 }

+               mlx4_info(dev, "Installed FW version is %d.%d.%03d.\n",
+                         (int) (dev->caps.fw_ver >> 32),
+                         (int) (dev->caps.fw_ver >> 16) & 0xffff,
+                         (int) dev->caps.fw_ver & 0xffff);
+
                 err = mlx4_load_fw(dev);
                 if (err) {
                         mlx4_err(dev, "Failed to start FW, aborting\n");

Thanks,
Qing

On 9/13/2018 9:43 PM, Leon Romanovsky wrote:
> On Thu, Sep 13, 2018 at 05:25:14PM -0700, Qing Huang wrote:
>> When debugging firmware related issues, it's very helpful to have
>        ^^^^^^^^^^ exactly, this is why we set this print as mlx4_dbg and
>        not mlx4_info.
>
>> the installed FW version info in the kernel log when the driver is
>> loaded. It's easier to match error/warning messages with different
>> FW versions in the log other than running a separate tool to get
>> the information back and forth.
>>
>> Signed-off-by: Qing Huang <qing.huang@oracle.com>
>> ---
>>   drivers/net/ethernet/mellanox/mlx4/fw.c | 10 +++++-----
>>   1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c
>> index babcfd9..e1c5218 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/fw.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
>> @@ -1686,11 +1686,11 @@ int mlx4_QUERY_FW(struct mlx4_dev *dev)
>>   	MLX4_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET);
>>   	cmd->max_cmds = 1 << lg;
>>
>> -	mlx4_dbg(dev, "FW version %d.%d.%03d (cmd intf rev %d), max commands %d\n",
>> -		 (int) (dev->caps.fw_ver >> 32),
>> -		 (int) (dev->caps.fw_ver >> 16) & 0xffff,
>> -		 (int) dev->caps.fw_ver & 0xffff,
>> -		 cmd_if_rev, cmd->max_cmds);
>> +	mlx4_info(dev, "FW version %d.%d.%03d (cmd intf rev %d), max commands %d\n",
>> +		  (int)(dev->caps.fw_ver >> 32),
>> +		  (int)(dev->caps.fw_ver >> 16) & 0xffff,
>> +		  (int)dev->caps.fw_ver & 0xffff,
>> +		  cmd_if_rev, cmd->max_cmds);
>>
>>   	MLX4_GET(fw->catas_offset, outbox, QUERY_FW_ERR_START_OFFSET);
>>   	MLX4_GET(fw->catas_size,   outbox, QUERY_FW_ERR_SIZE_OFFSET);
>> --
>> 2.9.3
>>

^ permalink raw reply

* Re: [PATCH 5/7] MIPS: mscc: ocelot: add GPIO4 pinmuxing DT node
From: Andrew Lunn @ 2018-09-14 17:02 UTC (permalink / raw)
  To: Quentin Schulz
  Cc: Alexandre Belloni, ralf, paul.burton, jhogan, robh+dt,
	mark.rutland, davem, f.fainelli, allan.nielsen, linux-mips,
	devicetree, linux-kernel, netdev, thomas.petazzoni,
	antoine.tenart
In-Reply-To: <20180914162638.fgzzjin2bzgx74de@qschulz>

On Fri, Sep 14, 2018 at 06:26:38PM +0200, Quentin Schulz wrote:
> Hi Alexandre,
> 
> On Fri, Sep 14, 2018 at 04:54:46PM +0200, Alexandre Belloni wrote:
> > Hi,
> > 
> > On 14/09/2018 11:44:26+0200, Quentin Schulz wrote:
> > > In order to use GPIO4 as a GPIO, we need to mux it in this mode so let's
> > > declare a new pinctrl DT node for it.
> > > 
> > > Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
> > > ---
> > >  arch/mips/boot/dts/mscc/ocelot.dtsi | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/arch/mips/boot/dts/mscc/ocelot.dtsi b/arch/mips/boot/dts/mscc/ocelot.dtsi
> > > index 8ce317c..b5c4c74 100644
> > > --- a/arch/mips/boot/dts/mscc/ocelot.dtsi
> > > +++ b/arch/mips/boot/dts/mscc/ocelot.dtsi
> > > @@ -182,6 +182,11 @@
> > >  			interrupts = <13>;
> > >  			#interrupt-cells = <2>;
> > >  
> > > +			gpio4: gpio4 {
> > > +				pins = "GPIO_4";
> > > +				function = "gpio";
> > > +			};
> > > +
> > 
> > For a GPIO, I would do that in the board dts because it is not used
> > directly in the dtsi.
> > 
> 
> And the day we've two boards using this pinctrl we move it to a dtsi. Is
> that the plan?

Hi Quentin

gpio4 appears to be pretty arbitrary. Could a different design use a
different gpio? It me, this seems like a board property.

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 2/7] net: phy: mscc: add support for VSC8584 PHY
From: Andrew Lunn @ 2018-09-14 16:58 UTC (permalink / raw)
  To: Quentin Schulz
  Cc: alexandre.belloni, ralf, paul.burton, jhogan, robh+dt,
	mark.rutland, davem, f.fainelli, allan.nielsen, linux-mips,
	devicetree, linux-kernel, netdev, thomas.petazzoni,
	antoine.tenart
In-Reply-To: <20180914162828.5e75ffh5sig4om3d@qschulz>

> Confirmed by HW engineers, it only impacts PHYs in the same package.

Hi Quentin

Thanks for checking. As you said, it would be counter intuitive,
meaning a lot of confusion if it actually did happen.

Maybe you can add "in package" before broadcast in the commit message
and the code comments.

       Andrew

^ permalink raw reply

* [PATCH] net: hp100: fix always-true check for link up state
From: Colin King @ 2018-09-14 16:39 UTC (permalink / raw)
  To: Jaroslav Kysela, David S . Miller, netdev; +Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

The operation ~(p100_inb(VG_LAN_CFG_1) & HP100_LINK_UP) returns a value
that is always non-zero and hence the wait for the link to drop always
terminates prematurely.  Fix this by using a logical not operator instead
of a bitwise complement.  This issue has been in the driver since
pre-2.6.12-rc2.

Detected by CoverityScan, CID#114157 ("Logical vs. bitwise operator")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/ethernet/hp/hp100.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hp/hp100.c b/drivers/net/ethernet/hp/hp100.c
index c8c7ad2eff77..9b5a68b65432 100644
--- a/drivers/net/ethernet/hp/hp100.c
+++ b/drivers/net/ethernet/hp/hp100.c
@@ -2634,7 +2634,7 @@ static int hp100_login_to_vg_hub(struct net_device *dev, u_short force_relogin)
 		/* Wait for link to drop */
 		time = jiffies + (HZ / 10);
 		do {
-			if (~(hp100_inb(VG_LAN_CFG_1) & HP100_LINK_UP_ST))
+			if (!(hp100_inb(VG_LAN_CFG_1) & HP100_LINK_UP_ST))
 				break;
 			if (!in_interrupt())
 				schedule_timeout_interruptible(1);
-- 
2.17.1

^ permalink raw reply related

* [PATCH] net: ethernet: ti: add missing GENERIC_ALLOCATOR dependency
From: Corentin Labbe @ 2018-09-14 11:20 UTC (permalink / raw)
  To: davem; +Cc: linux-kernel, netdev, Corentin Labbe

This patch mades TI_DAVINCI_CPDMA select GENERIC_ALLOCATOR.
without that, the following sparc64 build failure happen

drivers/net/ethernet/ti/davinci_cpdma.o: In function `cpdma_check_free_tx_desc':
(.text+0x278): undefined reference to `gen_pool_avail'
drivers/net/ethernet/ti/davinci_cpdma.o: In function `cpdma_chan_submit':
(.text+0x340): undefined reference to `gen_pool_alloc'
(.text+0x5c4): undefined reference to `gen_pool_free'
drivers/net/ethernet/ti/davinci_cpdma.o: In function `__cpdma_chan_free':
davinci_cpdma.c:(.text+0x64c): undefined reference to `gen_pool_free'
drivers/net/ethernet/ti/davinci_cpdma.o: In function `cpdma_desc_pool_destroy.isra.6':
davinci_cpdma.c:(.text+0x17ac): undefined reference to `gen_pool_size'
davinci_cpdma.c:(.text+0x17b8): undefined reference to `gen_pool_avail'
davinci_cpdma.c:(.text+0x1824): undefined reference to `gen_pool_size'
davinci_cpdma.c:(.text+0x1830): undefined reference to `gen_pool_avail'
drivers/net/ethernet/ti/davinci_cpdma.o: In function `cpdma_ctlr_create':
(.text+0x19f8): undefined reference to `devm_gen_pool_create'
(.text+0x1a90): undefined reference to `gen_pool_add_virt'
Makefile:1011: recipe for target 'vmlinux' failed

Signed-off-by: Corentin Labbe <clabbe@baylibre.com>
---
 drivers/net/ethernet/ti/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/ti/Kconfig b/drivers/net/ethernet/ti/Kconfig
index 9263d63..f932923 100644
--- a/drivers/net/ethernet/ti/Kconfig
+++ b/drivers/net/ethernet/ti/Kconfig
@@ -41,6 +41,7 @@ config TI_DAVINCI_MDIO
 config TI_DAVINCI_CPDMA
 	tristate "TI DaVinci CPDMA Support"
 	depends on ARCH_DAVINCI || ARCH_OMAP2PLUS || COMPILE_TEST
+	select GENERIC_ALLOCATOR
 	---help---
 	  This driver supports TI's DaVinci CPDMA dma engine.
 
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next 2/7] net: phy: mscc: add support for VSC8584 PHY
From: Quentin Schulz @ 2018-09-14 16:28 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: alexandre.belloni, ralf, paul.burton, jhogan, robh+dt,
	mark.rutland, davem, f.fainelli, allan.nielsen, linux-mips,
	devicetree, linux-kernel, netdev, thomas.petazzoni,
	antoine.tenart
In-Reply-To: <20180914132930.fphdm3dm2incetbq@qschulz>

[-- Attachment #1: Type: text/plain, Size: 1067 bytes --]

Hi Andrew,

On Fri, Sep 14, 2018 at 03:29:30PM +0200, Quentin Schulz wrote:
> Hi Andrew,
> 
> On Fri, Sep 14, 2018 at 03:18:46PM +0200, Andrew Lunn wrote:
> > > Most of the init sequence of a PHY of the package is common to all PHYs
> > > in the package, thus we use the SMI broadcast feature which enables us
> > > to propagate a write in one register of one PHY to all PHYs in the
> > > package.
> > 
> > Hi Quinten
> > 
> > Could you say a bit more about the broadcast. Does the SMI broadcast
> > go to all PHY everywhere on an MDIO bus, or only all PHYs within one
> > package? I'm just thinking about the case you need two of these
> > packages to cover 8 switch ports.
> > 
> 
> Ah sorry, that wasn't very explicit. That's a feature on the PHY side so
> my wildest guess is that it wouldn't impact any other PHY outside of
> this package. Affecting any other PHY on the bus is counter-intuitive to
> me but I'll ask the HW engineers for confirmation.
> 

Confirmed by HW engineers, it only impacts PHYs in the same package.

Quentin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 5/7] MIPS: mscc: ocelot: add GPIO4 pinmuxing DT node
From: Quentin Schulz @ 2018-09-14 16:26 UTC (permalink / raw)
  To: Alexandre Belloni
  Cc: ralf, paul.burton, jhogan, robh+dt, mark.rutland, davem, andrew,
	f.fainelli, allan.nielsen, linux-mips, devicetree, linux-kernel,
	netdev, thomas.petazzoni, antoine.tenart
In-Reply-To: <20180914145446.GQ14988@piout.net>

[-- Attachment #1: Type: text/plain, Size: 1076 bytes --]

Hi Alexandre,

On Fri, Sep 14, 2018 at 04:54:46PM +0200, Alexandre Belloni wrote:
> Hi,
> 
> On 14/09/2018 11:44:26+0200, Quentin Schulz wrote:
> > In order to use GPIO4 as a GPIO, we need to mux it in this mode so let's
> > declare a new pinctrl DT node for it.
> > 
> > Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
> > ---
> >  arch/mips/boot/dts/mscc/ocelot.dtsi | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/arch/mips/boot/dts/mscc/ocelot.dtsi b/arch/mips/boot/dts/mscc/ocelot.dtsi
> > index 8ce317c..b5c4c74 100644
> > --- a/arch/mips/boot/dts/mscc/ocelot.dtsi
> > +++ b/arch/mips/boot/dts/mscc/ocelot.dtsi
> > @@ -182,6 +182,11 @@
> >  			interrupts = <13>;
> >  			#interrupt-cells = <2>;
> >  
> > +			gpio4: gpio4 {
> > +				pins = "GPIO_4";
> > +				function = "gpio";
> > +			};
> > +
> 
> For a GPIO, I would do that in the board dts because it is not used
> directly in the dtsi.
> 

And the day we've two boards using this pinctrl we move it to a dtsi. Is
that the plan?

Thanks,
Quentin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox