Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v5] openvswitch: enable NSH support
From: Yang, Yi @ 2017-08-21  8:41 UTC (permalink / raw)
  To: Jiri Benc; +Cc: netdev, dev, blp, e, jan.scheurich
In-Reply-To: <20170821102509.076d7cc9@griffin>

On Mon, Aug 21, 2017 at 10:25:09AM +0200, Jiri Benc wrote:
> On Mon, 21 Aug 2017 13:54:23 +0800, Yi Yang wrote:
> > v4->v5
> >  - Fix many comments by Jiri Benc and Eric Garver
> >    for v4.
> 
> NACK for v5, we haven't finished discussing v4. You're sending new
> versions too quickly.
> 
> My comment about sending a new version was meant only for the rest of
> the patch that I have not reviewed - i.e. that once the comments I've
> made so far are addressed, I'm fine with continuing reviewing in a new
> version.
> 
> It did not mean that I'm okay with sending a new version without my
> comments to the first part being addressed. Which they are not, see my
> reply to your mail at v4.
> 
> Sorry if that was not clear but I won't review v5 until we have
> conclusion for the v4 comments.

Ok, let me fix comments for v4 first.

> 
>  Jiri

^ permalink raw reply

* Re: [PATCH v2] net: sunrpc: svcsock: fix NULL-pointer exception
From: Vadim Lomovtsev @ 2017-08-21  8:50 UTC (permalink / raw)
  To: trond.myklebust, anna.schumaker, bfields, jlayton, davem,
	linux-nfs, netdev, linux-kernel, pabeni
  Cc: vlomovts
In-Reply-To: <1503305085-5488-1-git-send-email-vlomovts@redhat.com>


Sorry guys, please ignore this - being sent by mistake. 
Have some typos at comments by copy-paste, will correct and re-send

Vadim

On Mon, Aug 21, 2017 at 04:44:45AM -0400, Vadim Lomovtsev wrote:
> While running nfs/connectathon tests kernel NULL-pointer exception
> has been observed due to races in svcsock.c.
> 
> Race is appear when kernel accepts connection by kernel_accept
> (which creates new socket) and start queuing ingress packets
> to new socket. This happens in ksoftirq context which could run
> concurrently on a different core while new socket setup is not done yet.
> 
> The fix is to re-order socket user data init sequence and add
> write/read barrier calls to be sure that we got proper values
> for callback pointers before actually calling them.
> 
> Test results: nfs/connectathon reports '0' failed tests for about 200+ iterations.
> 
> Crash log:
> ---<-snip->---
> [ 6708.638984] Unable to handle kernel NULL pointer dereference at virtual address 00000000
> [ 6708.647093] pgd = ffff0000094e0000
> [ 6708.650497] [00000000] *pgd=0000010ffff90003, *pud=0000010ffff90003, *pmd=0000010ffff80003, *pte=0000000000000000
> [ 6708.660761] Internal error: Oops: 86000005 [#1] SMP
> [ 6708.665630] Modules linked in: nfsv3 nfnetlink_queue nfnetlink_log nfnetlink rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache overlay xt_CONNSECMARK xt_SECMARK xt_conntrack iptable_security ip_tables ah4 xfrm4_mode_transport sctp tun binfmt_misc ext4 jbd2 mbcache loop tcp_diag udp_diag inet_diag rpcrdma ib_isert iscsi_target_mod ib_iser rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_ucm ib_uverbs ib_umad ib_cm ib_core nls_koi8_u nls_cp932 ts_kmp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack vfat fat ghash_ce sha2_ce sha1_ce cavium_rng_vf i2c_thunderx sg thunderx_edac i2c_smbus edac_core cavium_rng nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c nicvf nicpf ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgb
 lt fb_sys_fops
> [ 6708.736446]  ttm drm i2c_core thunder_bgx thunder_xcv mdio_thunder mdio_cavium dm_mirror dm_region_hash dm_log dm_mod [last unloaded: stap_3c300909c5b3f46dcacd49aab3334af_87021]
> [ 6708.752275] CPU: 84 PID: 0 Comm: swapper/84 Tainted: G        W  OE   4.11.0-4.el7.aarch64 #1
> [ 6708.760787] Hardware name: www.cavium.com CRB-2S/CRB-2S, BIOS 0.3 Mar 13 2017
> [ 6708.767910] task: ffff810006842e80 task.stack: ffff81000689c000
> [ 6708.773822] PC is at 0x0
> [ 6708.776739] LR is at svc_data_ready+0x38/0x88 [sunrpc]
> [ 6708.781866] pc : [<0000000000000000>] lr : [<ffff0000029d7378>] pstate: 60000145
> [ 6708.789248] sp : ffff810ffbad3900
> [ 6708.792551] x29: ffff810ffbad3900 x28: ffff000008c73d58
> [ 6708.797853] x27: 0000000000000000 x26: ffff81000bbe1e00
> [ 6708.803156] x25: 0000000000000020 x24: ffff800f7410bf28
> [ 6708.808458] x23: ffff000008c63000 x22: ffff000008c63000
> [ 6708.813760] x21: ffff800f7410bf28 x20: ffff81000bbe1e00
> [ 6708.819063] x19: ffff810012412400 x18: 00000000d82a9df2
> [ 6708.824365] x17: 0000000000000000 x16: 0000000000000000
> [ 6708.829667] x15: 0000000000000000 x14: 0000000000000001
> [ 6708.834969] x13: 0000000000000000 x12: 722e736f622e676e
> [ 6708.840271] x11: 00000000f814dd99 x10: 0000000000000000
> [ 6708.845573] x9 : 7374687225000000 x8 : 0000000000000000
> [ 6708.850875] x7 : 0000000000000000 x6 : 0000000000000000
> [ 6708.856177] x5 : 0000000000000028 x4 : 0000000000000000
> [ 6708.861479] x3 : 0000000000000000 x2 : 00000000e5000000
> [ 6708.866781] x1 : 0000000000000000 x0 : ffff81000bbe1e00
> [ 6708.872084]
> [ 6708.873565] Process swapper/84 (pid: 0, stack limit = 0xffff81000689c000)
> [ 6708.880341] Stack: (0xffff810ffbad3900 to 0xffff8100068a0000)
> [ 6708.886075] Call trace:
> [ 6708.888513] Exception stack(0xffff810ffbad3710 to 0xffff810ffbad3840)
> [ 6708.894942] 3700:                                   ffff810012412400 0001000000000000
> [ 6708.902759] 3720: ffff810ffbad3900 0000000000000000 0000000060000145 ffff800f79300000
> [ 6708.910577] 3740: ffff000009274d00 00000000000003ea 0000000000000015 ffff000008c63000
> [ 6708.918395] 3760: ffff810ffbad3830 ffff800f79300000 000000000000004d 0000000000000000
> [ 6708.926212] 3780: ffff810ffbad3890 ffff0000080f88dc ffff800f79300000 000000000000004d
> [ 6708.934030] 37a0: ffff800f7930093c ffff000008c63000 0000000000000000 0000000000000140
> [ 6708.941848] 37c0: ffff000008c2c000 0000000000040b00 ffff81000bbe1e00 0000000000000000
> [ 6708.949665] 37e0: 00000000e5000000 0000000000000000 0000000000000000 0000000000000028
> [ 6708.957483] 3800: 0000000000000000 0000000000000000 0000000000000000 7374687225000000
> [ 6708.965300] 3820: 0000000000000000 00000000f814dd99 722e736f622e676e 0000000000000000
> [ 6708.973117] [<          (null)>]           (null)
> [ 6708.977824] [<ffff0000086f9fa4>] tcp_data_queue+0x754/0xc5c
> [ 6708.983386] [<ffff0000086fa64c>] tcp_rcv_established+0x1a0/0x67c
> [ 6708.989384] [<ffff000008704120>] tcp_v4_do_rcv+0x15c/0x22c
> [ 6708.994858] [<ffff000008707418>] tcp_v4_rcv+0xaf0/0xb58
> [ 6709.000077] [<ffff0000086df784>] ip_local_deliver_finish+0x10c/0x254
> [ 6709.006419] [<ffff0000086dfea4>] ip_local_deliver+0xf0/0xfc
> [ 6709.011980] [<ffff0000086dfad4>] ip_rcv_finish+0x208/0x3a4
> [ 6709.017454] [<ffff0000086e018c>] ip_rcv+0x2dc/0x3c8
> [ 6709.022328] [<ffff000008692fc8>] __netif_receive_skb_core+0x2f8/0xa0c
> [ 6709.028758] [<ffff000008696068>] __netif_receive_skb+0x38/0x84
> [ 6709.034580] [<ffff00000869611c>] netif_receive_skb_internal+0x68/0xdc
> [ 6709.041010] [<ffff000008696bc0>] napi_gro_receive+0xcc/0x1a8
> [ 6709.046690] [<ffff0000014b0fc4>] nicvf_cq_intr_handler+0x59c/0x730 [nicvf]
> [ 6709.053559] [<ffff0000014b1380>] nicvf_poll+0x38/0xb8 [nicvf]
> [ 6709.059295] [<ffff000008697a6c>] net_rx_action+0x2f8/0x464
> [ 6709.064771] [<ffff000008081824>] __do_softirq+0x11c/0x308
> [ 6709.070164] [<ffff0000080d14e4>] irq_exit+0x12c/0x174
> [ 6709.075206] [<ffff00000813101c>] __handle_domain_irq+0x78/0xc4
> [ 6709.081027] [<ffff000008081608>] gic_handle_irq+0x94/0x190
> [ 6709.086501] Exception stack(0xffff81000689fdf0 to 0xffff81000689ff20)
> [ 6709.092929] fde0:                                   0000810ff2ec0000 ffff000008c10000
> [ 6709.100747] fe00: ffff000008c70ef4 0000000000000001 0000000000000000 ffff810ffbad9b18
> [ 6709.108565] fe20: ffff810ffbad9c70 ffff8100169d3800 ffff810006843ab0 ffff81000689fe80
> [ 6709.116382] fe40: 0000000000000bd0 0000ffffdf979cd0 183f5913da192500 0000ffff8a254ce4
> [ 6709.124200] fe60: 0000ffff8a254b78 0000aaab10339808 0000000000000000 0000ffff8a0c2a50
> [ 6709.132018] fe80: 0000ffffdf979b10 ffff000008d6d450 ffff000008c10000 ffff000008d6d000
> [ 6709.139836] fea0: 0000000000000054 ffff000008cd3dbc 0000000000000000 0000000000000000
> [ 6709.147653] fec0: 0000000000000000 0000000000000000 0000000000000000 ffff81000689ff20
> [ 6709.155471] fee0: ffff000008085240 ffff81000689ff20 ffff000008085244 0000000060000145
> [ 6709.163289] ff00: ffff81000689ff10 ffff00000813f1e4 ffffffffffffffff ffff00000813f238
> [ 6709.171107] [<ffff000008082eb4>] el1_irq+0xb4/0x140
> [ 6709.175976] [<ffff000008085244>] arch_cpu_idle+0x44/0x11c
> [ 6709.181368] [<ffff0000087bf3b8>] default_idle_call+0x20/0x30
> [ 6709.187020] [<ffff000008116d50>] do_idle+0x158/0x1e4
> [ 6709.191973] [<ffff000008116ff4>] cpu_startup_entry+0x2c/0x30
> [ 6709.197624] [<ffff00000808e7cc>] secondary_start_kernel+0x13c/0x160
> [ 6709.203878] [<0000000001bc71c4>] 0x1bc71c4
> [ 6709.207967] Code: bad PC value
> [ 6709.211061] SMP: stopping secondary CPUs
> [ 6709.218830] Starting crashdump kernel...
> [ 6709.222749] Bye!
> ---<-snip>---
> 
> Signed-off-by: Vadim Lomovtsev <vlomovts@redhat.com>
> ---
>  net/sunrpc/svcsock.c | 27 +++++++++++++++++++++++++--
>  1 file changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 2b720fa..2be967f 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -421,6 +421,10 @@ static void svc_data_ready(struct sock *sk)
>  		dprintk("svc: socket %p(inet %p), busy=%d\n",
>  			svsk, sk,
>  			test_bit(XPT_BUSY, &svsk->sk_xprt.xpt_flags));
> +		/* this barrier is necessary to prevent kernel crash
> +		   (due to bad CPU-value) caused by races aginst
> +		   svc_setup_socket() while calling sk_odata() callback. */
> +		rmb();
>  		svsk->sk_odata(sk);
>  		if (!test_and_set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags))
>  			svc_xprt_enqueue(&svsk->sk_xprt);
> @@ -437,6 +441,10 @@ static void svc_write_space(struct sock *sk)
>  	if (svsk) {
>  		dprintk("svc: socket %p(inet %p), write_space busy=%d\n",
>  			svsk, sk, test_bit(XPT_BUSY, &svsk->sk_xprt.xpt_flags));
> +		/* this barrier is necessary to prevent kernel crash
> +		   (due to bad CPU-value) caused by races aginst
> +		   svc_setup_socket() while calling sk_owspace() callback. */
> +		rmb();
>  		svsk->sk_owspace(sk);
>  		svc_xprt_enqueue(&svsk->sk_xprt);
>  	}
> @@ -760,8 +768,14 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
>  	dprintk("svc: socket %p TCP (listen) state change %d\n",
>  		sk, sk->sk_state);
>  
> -	if (svsk)
> +	if (svsk) {
> +		/* this barrier is necessary to prevent kernel crash
> +		   (due to bad CPU-value) caused by races aginst
> +		   svc_setup_socket() while calling sk_odata() callback.*/
> +		rmb();
>  		svsk->sk_odata(sk);
> +	}
> +
>  	/*
>  	 * This callback may called twice when a new connection
>  	 * is established as a child socket inherits everything
> @@ -794,7 +808,12 @@ static void svc_tcp_state_change(struct sock *sk)
>  	if (!svsk)
>  		printk("svc: socket %p: no user data\n", sk);
>  	else {
> +		/* this barrier is necessary to prevent kernel crash
> +		   (due to bad CPU-value) caused by races aginst
> +		   svc_setup_socket() while calling sk_odata() callback. */
> +		rmb();
>  		svsk->sk_ostate(sk);
> +
>  		if (sk->sk_state != TCP_ESTABLISHED) {
>  			set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
>  			svc_xprt_enqueue(&svsk->sk_xprt);
> @@ -1381,12 +1400,16 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
>  		return ERR_PTR(err);
>  	}
>  
> -	inet->sk_user_data = svsk;
>  	svsk->sk_sock = sock;
>  	svsk->sk_sk = inet;
>  	svsk->sk_ostate = inet->sk_state_change;
>  	svsk->sk_odata = inet->sk_data_ready;
>  	svsk->sk_owspace = inet->sk_write_space;
> +	/* this barrier is necessary in order to prevent race condition
> +	   with svc_data_ready(), svc_listen_data_ready()
> +	   and others when calling callbacks above */
> +	wmb();
> +	inet->sk_user_data = svsk;
>  
>  	/* Initialize the socket */
>  	if (sock->type == SOCK_DGRAM)
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* [PATCH v2] net: sunrpc: svcsock: fix NULL-pointer exception
From: Vadim Lomovtsev @ 2017-08-21  8:44 UTC (permalink / raw)
  To: trond.myklebust-7I+n7zu2hftEKMMhf/gKZA,
	anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA,
	bfields-uC3wQj2KruNg9hUCZPvPmw, jlayton-vpEMnDpepFuMZCB2o+C8xQ,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	pabeni-H+wXaHxf7aLQT0dZR+AlfA
  Cc: Vadim Lomovtsev
In-Reply-To: <1503050447-13362-1-git-send-email-vlomovts-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

While running nfs/connectathon tests kernel NULL-pointer exception
has been observed due to races in svcsock.c.

Race is appear when kernel accepts connection by kernel_accept
(which creates new socket) and start queuing ingress packets
to new socket. This happens in ksoftirq context which could run
concurrently on a different core while new socket setup is not done yet.

The fix is to re-order socket user data init sequence and add
write/read barrier calls to be sure that we got proper values
for callback pointers before actually calling them.

Test results: nfs/connectathon reports '0' failed tests for about 200+ iterations.

Crash log:
---<-snip->---
[ 6708.638984] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[ 6708.647093] pgd = ffff0000094e0000
[ 6708.650497] [00000000] *pgd=0000010ffff90003, *pud=0000010ffff90003, *pmd=0000010ffff80003, *pte=0000000000000000
[ 6708.660761] Internal error: Oops: 86000005 [#1] SMP
[ 6708.665630] Modules linked in: nfsv3 nfnetlink_queue nfnetlink_log nfnetlink rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache overlay xt_CONNSECMARK xt_SECMARK xt_conntrack iptable_security ip_tables ah4 xfrm4_mode_transport sctp tun binfmt_misc ext4 jbd2 mbcache loop tcp_diag udp_diag inet_diag rpcrdma ib_isert iscsi_target_mod ib_iser rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_ucm ib_uverbs ib_umad ib_cm ib_core nls_koi8_u nls_cp932 ts_kmp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack vfat fat ghash_ce sha2_ce sha1_ce cavium_rng_vf i2c_thunderx sg thunderx_edac i2c_smbus edac_core cavium_rng nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c nicvf nicpf ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
  fb_sys_fops
[ 6708.736446]  ttm drm i2c_core thunder_bgx thunder_xcv mdio_thunder mdio_cavium dm_mirror dm_region_hash dm_log dm_mod [last unloaded: stap_3c300909c5b3f46dcacd49aab3334af_87021]
[ 6708.752275] CPU: 84 PID: 0 Comm: swapper/84 Tainted: G        W  OE   4.11.0-4.el7.aarch64 #1
[ 6708.760787] Hardware name: www.cavium.com CRB-2S/CRB-2S, BIOS 0.3 Mar 13 2017
[ 6708.767910] task: ffff810006842e80 task.stack: ffff81000689c000
[ 6708.773822] PC is at 0x0
[ 6708.776739] LR is at svc_data_ready+0x38/0x88 [sunrpc]
[ 6708.781866] pc : [<0000000000000000>] lr : [<ffff0000029d7378>] pstate: 60000145
[ 6708.789248] sp : ffff810ffbad3900
[ 6708.792551] x29: ffff810ffbad3900 x28: ffff000008c73d58
[ 6708.797853] x27: 0000000000000000 x26: ffff81000bbe1e00
[ 6708.803156] x25: 0000000000000020 x24: ffff800f7410bf28
[ 6708.808458] x23: ffff000008c63000 x22: ffff000008c63000
[ 6708.813760] x21: ffff800f7410bf28 x20: ffff81000bbe1e00
[ 6708.819063] x19: ffff810012412400 x18: 00000000d82a9df2
[ 6708.824365] x17: 0000000000000000 x16: 0000000000000000
[ 6708.829667] x15: 0000000000000000 x14: 0000000000000001
[ 6708.834969] x13: 0000000000000000 x12: 722e736f622e676e
[ 6708.840271] x11: 00000000f814dd99 x10: 0000000000000000
[ 6708.845573] x9 : 7374687225000000 x8 : 0000000000000000
[ 6708.850875] x7 : 0000000000000000 x6 : 0000000000000000
[ 6708.856177] x5 : 0000000000000028 x4 : 0000000000000000
[ 6708.861479] x3 : 0000000000000000 x2 : 00000000e5000000
[ 6708.866781] x1 : 0000000000000000 x0 : ffff81000bbe1e00
[ 6708.872084]
[ 6708.873565] Process swapper/84 (pid: 0, stack limit = 0xffff81000689c000)
[ 6708.880341] Stack: (0xffff810ffbad3900 to 0xffff8100068a0000)
[ 6708.886075] Call trace:
[ 6708.888513] Exception stack(0xffff810ffbad3710 to 0xffff810ffbad3840)
[ 6708.894942] 3700:                                   ffff810012412400 0001000000000000
[ 6708.902759] 3720: ffff810ffbad3900 0000000000000000 0000000060000145 ffff800f79300000
[ 6708.910577] 3740: ffff000009274d00 00000000000003ea 0000000000000015 ffff000008c63000
[ 6708.918395] 3760: ffff810ffbad3830 ffff800f79300000 000000000000004d 0000000000000000
[ 6708.926212] 3780: ffff810ffbad3890 ffff0000080f88dc ffff800f79300000 000000000000004d
[ 6708.934030] 37a0: ffff800f7930093c ffff000008c63000 0000000000000000 0000000000000140
[ 6708.941848] 37c0: ffff000008c2c000 0000000000040b00 ffff81000bbe1e00 0000000000000000
[ 6708.949665] 37e0: 00000000e5000000 0000000000000000 0000000000000000 0000000000000028
[ 6708.957483] 3800: 0000000000000000 0000000000000000 0000000000000000 7374687225000000
[ 6708.965300] 3820: 0000000000000000 00000000f814dd99 722e736f622e676e 0000000000000000
[ 6708.973117] [<          (null)>]           (null)
[ 6708.977824] [<ffff0000086f9fa4>] tcp_data_queue+0x754/0xc5c
[ 6708.983386] [<ffff0000086fa64c>] tcp_rcv_established+0x1a0/0x67c
[ 6708.989384] [<ffff000008704120>] tcp_v4_do_rcv+0x15c/0x22c
[ 6708.994858] [<ffff000008707418>] tcp_v4_rcv+0xaf0/0xb58
[ 6709.000077] [<ffff0000086df784>] ip_local_deliver_finish+0x10c/0x254
[ 6709.006419] [<ffff0000086dfea4>] ip_local_deliver+0xf0/0xfc
[ 6709.011980] [<ffff0000086dfad4>] ip_rcv_finish+0x208/0x3a4
[ 6709.017454] [<ffff0000086e018c>] ip_rcv+0x2dc/0x3c8
[ 6709.022328] [<ffff000008692fc8>] __netif_receive_skb_core+0x2f8/0xa0c
[ 6709.028758] [<ffff000008696068>] __netif_receive_skb+0x38/0x84
[ 6709.034580] [<ffff00000869611c>] netif_receive_skb_internal+0x68/0xdc
[ 6709.041010] [<ffff000008696bc0>] napi_gro_receive+0xcc/0x1a8
[ 6709.046690] [<ffff0000014b0fc4>] nicvf_cq_intr_handler+0x59c/0x730 [nicvf]
[ 6709.053559] [<ffff0000014b1380>] nicvf_poll+0x38/0xb8 [nicvf]
[ 6709.059295] [<ffff000008697a6c>] net_rx_action+0x2f8/0x464
[ 6709.064771] [<ffff000008081824>] __do_softirq+0x11c/0x308
[ 6709.070164] [<ffff0000080d14e4>] irq_exit+0x12c/0x174
[ 6709.075206] [<ffff00000813101c>] __handle_domain_irq+0x78/0xc4
[ 6709.081027] [<ffff000008081608>] gic_handle_irq+0x94/0x190
[ 6709.086501] Exception stack(0xffff81000689fdf0 to 0xffff81000689ff20)
[ 6709.092929] fde0:                                   0000810ff2ec0000 ffff000008c10000
[ 6709.100747] fe00: ffff000008c70ef4 0000000000000001 0000000000000000 ffff810ffbad9b18
[ 6709.108565] fe20: ffff810ffbad9c70 ffff8100169d3800 ffff810006843ab0 ffff81000689fe80
[ 6709.116382] fe40: 0000000000000bd0 0000ffffdf979cd0 183f5913da192500 0000ffff8a254ce4
[ 6709.124200] fe60: 0000ffff8a254b78 0000aaab10339808 0000000000000000 0000ffff8a0c2a50
[ 6709.132018] fe80: 0000ffffdf979b10 ffff000008d6d450 ffff000008c10000 ffff000008d6d000
[ 6709.139836] fea0: 0000000000000054 ffff000008cd3dbc 0000000000000000 0000000000000000
[ 6709.147653] fec0: 0000000000000000 0000000000000000 0000000000000000 ffff81000689ff20
[ 6709.155471] fee0: ffff000008085240 ffff81000689ff20 ffff000008085244 0000000060000145
[ 6709.163289] ff00: ffff81000689ff10 ffff00000813f1e4 ffffffffffffffff ffff00000813f238
[ 6709.171107] [<ffff000008082eb4>] el1_irq+0xb4/0x140
[ 6709.175976] [<ffff000008085244>] arch_cpu_idle+0x44/0x11c
[ 6709.181368] [<ffff0000087bf3b8>] default_idle_call+0x20/0x30
[ 6709.187020] [<ffff000008116d50>] do_idle+0x158/0x1e4
[ 6709.191973] [<ffff000008116ff4>] cpu_startup_entry+0x2c/0x30
[ 6709.197624] [<ffff00000808e7cc>] secondary_start_kernel+0x13c/0x160
[ 6709.203878] [<0000000001bc71c4>] 0x1bc71c4
[ 6709.207967] Code: bad PC value
[ 6709.211061] SMP: stopping secondary CPUs
[ 6709.218830] Starting crashdump kernel...
[ 6709.222749] Bye!
---<-snip>---

Signed-off-by: Vadim Lomovtsev <vlomovts-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/svcsock.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 2b720fa..2be967f 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -421,6 +421,10 @@ static void svc_data_ready(struct sock *sk)
 		dprintk("svc: socket %p(inet %p), busy=%d\n",
 			svsk, sk,
 			test_bit(XPT_BUSY, &svsk->sk_xprt.xpt_flags));
+		/* this barrier is necessary to prevent kernel crash
+		   (due to bad CPU-value) caused by races aginst
+		   svc_setup_socket() while calling sk_odata() callback. */
+		rmb();
 		svsk->sk_odata(sk);
 		if (!test_and_set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags))
 			svc_xprt_enqueue(&svsk->sk_xprt);
@@ -437,6 +441,10 @@ static void svc_write_space(struct sock *sk)
 	if (svsk) {
 		dprintk("svc: socket %p(inet %p), write_space busy=%d\n",
 			svsk, sk, test_bit(XPT_BUSY, &svsk->sk_xprt.xpt_flags));
+		/* this barrier is necessary to prevent kernel crash
+		   (due to bad CPU-value) caused by races aginst
+		   svc_setup_socket() while calling sk_owspace() callback. */
+		rmb();
 		svsk->sk_owspace(sk);
 		svc_xprt_enqueue(&svsk->sk_xprt);
 	}
@@ -760,8 +768,14 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
 	dprintk("svc: socket %p TCP (listen) state change %d\n",
 		sk, sk->sk_state);
 
-	if (svsk)
+	if (svsk) {
+		/* this barrier is necessary to prevent kernel crash
+		   (due to bad CPU-value) caused by races aginst
+		   svc_setup_socket() while calling sk_odata() callback.*/
+		rmb();
 		svsk->sk_odata(sk);
+	}
+
 	/*
 	 * This callback may called twice when a new connection
 	 * is established as a child socket inherits everything
@@ -794,7 +808,12 @@ static void svc_tcp_state_change(struct sock *sk)
 	if (!svsk)
 		printk("svc: socket %p: no user data\n", sk);
 	else {
+		/* this barrier is necessary to prevent kernel crash
+		   (due to bad CPU-value) caused by races aginst
+		   svc_setup_socket() while calling sk_odata() callback. */
+		rmb();
 		svsk->sk_ostate(sk);
+
 		if (sk->sk_state != TCP_ESTABLISHED) {
 			set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
 			svc_xprt_enqueue(&svsk->sk_xprt);
@@ -1381,12 +1400,16 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
 		return ERR_PTR(err);
 	}
 
-	inet->sk_user_data = svsk;
 	svsk->sk_sock = sock;
 	svsk->sk_sk = inet;
 	svsk->sk_ostate = inet->sk_state_change;
 	svsk->sk_odata = inet->sk_data_ready;
 	svsk->sk_owspace = inet->sk_write_space;
+	/* this barrier is necessary in order to prevent race condition
+	   with svc_data_ready(), svc_listen_data_ready()
+	   and others when calling callbacks above */
+	wmb();
+	inet->sk_user_data = svsk;
 
 	/* Initialize the socket */
 	if (sock->type == SOCK_DGRAM)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH net-next v4] openvswitch: enable NSH support
From: Yang, Yi @ 2017-08-21  8:39 UTC (permalink / raw)
  To: Jiri Benc
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, e@erig.me
In-Reply-To: <20170821101925.3f9b36a1@griffin>

On Mon, Aug 21, 2017 at 10:19:24AM +0200, Jiri Benc wrote:
> On Mon, 21 Aug 2017 14:11:10 +0800, Yang, Yi wrote:
> > In OVS code, it has been removed because of Microsoft compiler issue.
> 
> We absolutely, completely and utterly do not care in the kernel. Please
> never make such arguments and never make the code look worse because of
> a compiler for other operating systems.

Anyway, we need to keep the code in userspace consistent with the one in
kernel as possible as, otherwise it will be a burden for developer, I
know userspace has different coding standard from kernel, this will make
developer painful if we have two sets of code although they have same
functionality.

> 
> > > I was wondering about this during the reviews of the previous versions.
> > > Now I've given this more thought but I still don't see it - why is the
> > > inner_protocol set here?
> > 
> > I saw push_mpls has it, so also set it.
> 
> MPLS supports GSO and needs this for segmentation. I don't see anything
> GSO related in this patch.
> 
> How do you plan to address GSO, anyway?

No plan to do that, I'm not an expert on this, we can remove it if
you're very sure it is necessary.

> 
> > > > +	err = check_header(skb, length);
> > > > +	if (unlikely(err))
> > > > +		return err;
> > > > +
> > > > +	key->nsh.flags = nsh_get_flags(nsh);
> > > 
> > > Again, need to reload nsh.
> > 
> > I used skb->len in v5, so we can't avoid such issue.
> 
> And how do you ensure that the skb has enough headroom, then? That is
> wrong. All I said is that you have to reload the pointers to the header
> which is what you have to do.

Ok, got it, will add it.

> 
> > > Wouldn't it be nicer if the fields of struct ovs_nsh_key_base and of
> > > struct nsh_hdr had the same names?
> > 
> > Such change also will impact on OVS code, so I prefer not to change
> > them.
> 
> We do not care.
> 
> The order in which you send patches to different projects is your
> choice. The only standard by which we measure and evaluate kernel
> patches is the technical suitability of the patches. Whether or not
> some other projects have dependencies on some kind of previous versions
> of out of tree kernel patches have zero relevance here.
> 
> If you designed other code to depend on your notion on how a kernel API
> will look like before getting the actual patches accepted to the
> kernel, that's your problem and you'll have to deal with that.
> 
> > For struct nsh_hdr, we need more self-descriptive fields, but for struct
> > ovs_nsh_key_base, because we need to use keys in OVS OpenFlow, so np is
> > obviously better than next_proto, we also try our best to make sure the
> > old NSH implementation has same match fields as the new one does.
> 
> Not relevant.

Ok, I can follow your standard :-)

To make sure we make agreement, please confirm if this one is ok?

struct nsh_hdr {
    ovs_be16 ver_flags_ttl_len;
    uint8_t mdtype;
    uint8_t np;
    ovs_16aligned_be32 path_hdr;
    union {
        struct nsh_md1_ctx md1;
        struct nsh_md2_tlv md2;
    };
};

Or it will be better if you can provide your preferred version here.

> 
>  Jiri

^ permalink raw reply

* Re: [PATCH net-next v5] openvswitch: enable NSH support
From: Jiri Benc @ 2017-08-21  8:25 UTC (permalink / raw)
  To: Yi Yang; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA, e
In-Reply-To: <1503294863-12859-1-git-send-email-yi.y.yang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

On Mon, 21 Aug 2017 13:54:23 +0800, Yi Yang wrote:
> v4->v5
>  - Fix many comments by Jiri Benc and Eric Garver
>    for v4.

NACK for v5, we haven't finished discussing v4. You're sending new
versions too quickly.

My comment about sending a new version was meant only for the rest of
the patch that I have not reviewed - i.e. that once the comments I've
made so far are addressed, I'm fine with continuing reviewing in a new
version.

It did not mean that I'm okay with sending a new version without my
comments to the first part being addressed. Which they are not, see my
reply to your mail at v4.

Sorry if that was not clear but I won't review v5 until we have
conclusion for the v4 comments.

 Jiri

^ permalink raw reply

* Re: [PATCH 0/3] MIPS,bpf: Improvements for MIPS eBPF JIT
From: Daniel Borkmann @ 2017-08-21  8:24 UTC (permalink / raw)
  To: David Miller, david.daney; +Cc: ast, netdev, linux-kernel, linux-mips, ralf
In-Reply-To: <20170820.200619.821714603187353951.davem@davemloft.net>

On 08/21/2017 05:06 AM, David Miller wrote:
> From: David Daney <david.daney@cavium.com>
> Date: Fri, 18 Aug 2017 16:40:30 -0700
>
>> I suggest that the whole thing go via the BPF/net-next path as there
>> are dependencies on code that is not yet merged to Linus' tree.
>
> What kind of dependency?  On networking or MIPS changes?

On networking, David implemented the JLT, JLE, JSLT and JSLE
ops for the JIT in the patch set. Back then the MIPS JIT wasn't
in net-next tree, thus this is basically just a follow-up, so
that we have all covered with JIT support for net-next.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH net-next v4] openvswitch: enable NSH support
From: Jiri Benc @ 2017-08-21  8:19 UTC (permalink / raw)
  To: Yang, Yi
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, e@erig.me
In-Reply-To: <20170821061109.GA72656-re2EX8HDrk21gSHoDXDV2kEOCMrvLtNR@public.gmane.org>

On Mon, 21 Aug 2017 14:11:10 +0800, Yang, Yi wrote:
> In OVS code, it has been removed because of Microsoft compiler issue.

We absolutely, completely and utterly do not care in the kernel. Please
never make such arguments and never make the code look worse because of
a compiler for other operating systems.

> > I was wondering about this during the reviews of the previous versions.
> > Now I've given this more thought but I still don't see it - why is the
> > inner_protocol set here?
> 
> I saw push_mpls has it, so also set it.

MPLS supports GSO and needs this for segmentation. I don't see anything
GSO related in this patch.

How do you plan to address GSO, anyway?

> > > +	err = check_header(skb, length);
> > > +	if (unlikely(err))
> > > +		return err;
> > > +
> > > +	key->nsh.flags = nsh_get_flags(nsh);
> > 
> > Again, need to reload nsh.
> 
> I used skb->len in v5, so we can't avoid such issue.

And how do you ensure that the skb has enough headroom, then? That is
wrong. All I said is that you have to reload the pointers to the header
which is what you have to do.

> > Wouldn't it be nicer if the fields of struct ovs_nsh_key_base and of
> > struct nsh_hdr had the same names?
> 
> Such change also will impact on OVS code, so I prefer not to change
> them.

We do not care.

The order in which you send patches to different projects is your
choice. The only standard by which we measure and evaluate kernel
patches is the technical suitability of the patches. Whether or not
some other projects have dependencies on some kind of previous versions
of out of tree kernel patches have zero relevance here.

If you designed other code to depend on your notion on how a kernel API
will look like before getting the actual patches accepted to the
kernel, that's your problem and you'll have to deal with that.

> For struct nsh_hdr, we need more self-descriptive fields, but for struct
> ovs_nsh_key_base, because we need to use keys in OVS OpenFlow, so np is
> obviously better than next_proto, we also try our best to make sure the
> old NSH implementation has same match fields as the new one does.

Not relevant.

 Jiri

^ permalink raw reply

* [PATCH RFC] net_sched/codel: do not defer queue length update
From: Konstantin Khlebnikov @ 2017-08-21  8:14 UTC (permalink / raw)
  To: netdev, David S. Miller; +Cc: Cong Wang, Eric Dumazet, Jiri Pirko

When codel wants to drop last packet in ->dequeue() it cannot call
qdisc_tree_reduce_backlog() right away - it will notify parent qdisc
about zero qlen and HTB/HFSC will deactivate class. The same class will
be deactivated second time by caller of ->dequeue(). Currently codel and
fq_codel defer update. This triggers warning in HFSC when it's qlen != 0
but there is no active classes.

This patch update parent queue length immediately: just temporary increase
qlen around qdisc_tree_reduce_backlog() to prevent first class deactivation
if we have skb to return.

This might open another problem in HFSC - now operation peek could fail and
deactivate parent class.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=109581
---
 net/sched/sch_codel.c    |   14 ++++++++++----
 net/sched/sch_fq_codel.c |   24 +++++++++++++++---------
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/net/sched/sch_codel.c b/net/sched/sch_codel.c
index c518a1efcb9d..e8194e455efe 100644
--- a/net/sched/sch_codel.c
+++ b/net/sched/sch_codel.c
@@ -95,11 +95,17 @@ static struct sk_buff *codel_qdisc_dequeue(struct Qdisc *sch)
 			    &q->stats, qdisc_pkt_len, codel_get_enqueue_time,
 			    drop_func, dequeue_func);
 
-	/* We cant call qdisc_tree_reduce_backlog() if our qlen is 0,
-	 * or HTB crashes. Defer it for next round.
+	/* If our qlen is 0 qdisc_tree_reduce_backlog() will deactivate
+	 * parent class, dequeue in parent qdisc will do the same if we
+	 * return skb. Temporary increment qlen if we have skb.
 	 */
-	if (q->stats.drop_count && sch->q.qlen) {
-		qdisc_tree_reduce_backlog(sch, q->stats.drop_count, q->stats.drop_len);
+	if (q->stats.drop_count) {
+		if (skb)
+			sch->q.qlen++;
+		qdisc_tree_reduce_backlog(sch, q->stats.drop_count,
+					  q->stats.drop_len);
+		if (skb)
+			sch->q.qlen--;
 		q->stats.drop_count = 0;
 		q->stats.drop_len = 0;
 	}
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 337f2d6d81e4..4048790a8c99 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -316,6 +316,21 @@ static struct sk_buff *fq_codel_dequeue(struct Qdisc *sch)
 	flow->dropped += q->cstats.drop_count - prev_drop_count;
 	flow->dropped += q->cstats.ecn_mark - prev_ecn_mark;
 
+	/* If our qlen is 0 qdisc_tree_reduce_backlog() will deactivate
+	 * parent class, dequeue in parent qdisc will do the same if we
+	 * return skb. Temporary increment qlen if we have skb.
+	 */
+	if (q->cstats.drop_count) {
+		if (skb)
+			sch->q.qlen++;
+		qdisc_tree_reduce_backlog(sch, q->cstats.drop_count,
+					  q->cstats.drop_len);
+		if (skb)
+			sch->q.qlen--;
+		q->cstats.drop_count = 0;
+		q->cstats.drop_len = 0;
+	}
+
 	if (!skb) {
 		/* force a pass through old_flows to prevent starvation */
 		if ((head == &q->new_flows) && !list_empty(&q->old_flows))
@@ -326,15 +341,6 @@ static struct sk_buff *fq_codel_dequeue(struct Qdisc *sch)
 	}
 	qdisc_bstats_update(sch, skb);
 	flow->deficit -= qdisc_pkt_len(skb);
-	/* We cant call qdisc_tree_reduce_backlog() if our qlen is 0,
-	 * or HTB crashes. Defer it for next round.
-	 */
-	if (q->cstats.drop_count && sch->q.qlen) {
-		qdisc_tree_reduce_backlog(sch, q->cstats.drop_count,
-					  q->cstats.drop_len);
-		q->cstats.drop_count = 0;
-		q->cstats.drop_len = 0;
-	}
 	return skb;
 }
 

^ permalink raw reply related

* [PATCH] net_sched/hhf: update hierarchical backlog when drop packet
From: Konstantin Khlebnikov @ 2017-08-21  8:12 UTC (permalink / raw)
  To: netdev, David S. Miller; +Cc: Cong Wang, Jiri Pirko

When hhf_enqueue() drops packet from another bucket it
have to update backlog at upper qdiscs too.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too")
---
 net/sched/sch_hhf.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_hhf.c b/net/sched/sch_hhf.c
index 51d3ba682af9..c8d84436e427 100644
--- a/net/sched/sch_hhf.c
+++ b/net/sched/sch_hhf.c
@@ -406,8 +406,11 @@ static int hhf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	/* Return Congestion Notification only if we dropped a packet from this
 	 * bucket.
 	 */
-	if (hhf_drop(sch, to_free) == idx)
+	if (hhf_drop(sch, to_free) == idx) {
+		qdisc_tree_reduce_backlog(sch, 0,
+					  prev_backlog - sch->qstats.backlog);
 		return NET_XMIT_CN;
+	}
 
 	/* As we dropped a packet, better let upper stack know this. */
 	qdisc_tree_reduce_backlog(sch, 1, prev_backlog - sch->qstats.backlog);

^ permalink raw reply related

* Re: [PATCH v3 3/4] net: stmmac: register parent MDIO node for sun8i-h3-emac
From: Chen-Yu Tsai @ 2017-08-21  8:10 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Corentin Labbe, Chen-Yu Tsai, Rob Herring, Mark Rutland,
	Russell King, Maxime Ripard, Giuseppe Cavallaro, Alexandre Torgue,
	devicetree, linux-arm-kernel, linux-kernel, netdev
In-Reply-To: <20170820142545.GB24150@lunn.ch>

On Sun, Aug 20, 2017 at 10:25 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> I think we cannot use mdio-mux-mmioreg since the register for doing
>> the switch is in middle of the "System Control" and shared with
>> other functions.  This is why we use a sycon/regmap for selecting
>> the MDIO.
>
> You could add a mdio-mux-regmap.c.
>
> However, it probably need restructuring of the stmmac mdio code, to
> make the mdio bus usable as a separate driver. You need stmmac mdio
> to probe first, then mdio-mux-remap should probe, and then lastly
> stmmmac mac driver.
>
> With stmmac mdio and stmmac mac being in one driver, there is no time
> in the middle to allow the mux driver to probe.
>
> It is some effort, but a nice cleanup and generalization.

I'm not sure mdio-mux is the right thing here.

Looking at mdio-mux, it seems to provide a way to access multiplexed
MDIO buses. It says nothing about the MAC<->PHY connection.

With our hardware, and likely Rockchip's as well, the muxed connections
include the MDIO and MII connections, which are muxed at the same time.
It's likely to cause some confusion or even bugs if we implement it as
a separate driver.

ChenYu

^ permalink raw reply

* [PATCH 2/2] net: phy: Don't use drv when it is NULL in phy_attached_print
From: Romain Perier @ 2017-08-21  7:52 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue, Andrew Lunn,
	Florian Fainelli
  Cc: netdev, linux-kernel, Romain Perier
In-Reply-To: <20170821075235.28473-1-romain.perier@collabora.com>

Currently, if this logging function is used prior the phy driver is
binded to the phy device (that is usually done from .ndo_open),
'phydev->drv' might be NULL, resulting in a kernel crash. That is
typically the case in the stmmac driver, info about the phy is displayed
during the registration of the MDIO bus, and then genphy driver is binded
to this phydev when .ndo_open is called.

This commit fixes the issue by using the right genphy driver, when
phydev->drv is NULL.

Fixes: commit fbca164776e4 ("net: stmmac: Use the right logging functi")
Signed-off-by: Romain Perier <romain.perier@collabora.com>
---
 drivers/net/phy/phy_device.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 9493fb369682..b38926bc275f 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -877,15 +877,24 @@ EXPORT_SYMBOL(phy_attached_info);
 #define ATTACHED_FMT "attached PHY driver [%s] (mii_bus:phy_addr=%s, irq=%d)"
 void phy_attached_print(struct phy_device *phydev, const char *fmt, ...)
 {
+	struct phy_driver *drv = phydev->drv;
+
+	if (!drv) {
+		if (phydev->is_c45)
+			drv = &genphy_10g_driver;
+		else
+			drv = &genphy_driver;
+	}
+
 	if (!fmt) {
 		dev_info(&phydev->mdio.dev, ATTACHED_FMT "\n",
-			 phydev->drv->name, phydev_name(phydev),
+			 drv->name, phydev_name(phydev),
 			 phydev->irq);
 	} else {
 		va_list ap;
 
 		dev_info(&phydev->mdio.dev, ATTACHED_FMT,
-			 phydev->drv->name, phydev_name(phydev),
+			 drv->name, phydev_name(phydev),
 			 phydev->irq);
 
 		va_start(ap, fmt);
-- 
2.11.0

^ permalink raw reply related

* [PATCH 1/2] net: stmmac: Delete dead code for MDIO registration
From: Romain Perier @ 2017-08-21  7:52 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue, Andrew Lunn,
	Florian Fainelli
  Cc: netdev, linux-kernel, Romain Perier
In-Reply-To: <20170821075235.28473-1-romain.perier@collabora.com>

This code is no longer used, the logging function was changed by commit
fbca164776e4 ("net: stmmac: Use the right logging functi").

Fixes: commit fbca164776e4 ("net: stmmac: Use the right logging functi")
Signed-off-by: Romain Perier <romain.perier@collabora.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
index 72ec711fcba2..f5f37bfa1d58 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
@@ -248,9 +248,6 @@ int stmmac_mdio_register(struct net_device *ndev)
 	found = 0;
 	for (addr = 0; addr < PHY_MAX_ADDR; addr++) {
 		struct phy_device *phydev = mdiobus_get_phy(new_bus, addr);
-		int act = 0;
-		char irq_num[4];
-		char *irq_str;
 
 		if (!phydev)
 			continue;
@@ -273,19 +270,6 @@ int stmmac_mdio_register(struct net_device *ndev)
 		if (priv->plat->phy_addr == -1)
 			priv->plat->phy_addr = addr;
 
-		act = (priv->plat->phy_addr == addr);
-		switch (phydev->irq) {
-		case PHY_POLL:
-			irq_str = "POLL";
-			break;
-		case PHY_IGNORE_INTERRUPT:
-			irq_str = "IGNORE";
-			break;
-		default:
-			sprintf(irq_num, "%d", phydev->irq);
-			irq_str = irq_num;
-			break;
-		}
 		phy_attached_info(phydev);
 		found = 1;
 	}
-- 
2.11.0

^ permalink raw reply related

* [PATCH 0/2] net: stmmac: phy logging fixes
From: Romain Perier @ 2017-08-21  7:52 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue, Andrew Lunn,
	Florian Fainelli
  Cc: netdev, linux-kernel, Romain Perier

This set of patches fixes issues related to logging. First it delete old
code that is no longer used in stmmac_mdio_register(). Then, it fixes a
crash that happens when phydev->drv is NULL and phy_attached_info() is
called prior phy_driver is binded to phydev.

Romain Perier (2):
  net: stmmac: Delete dead code for MDIO registration
  net: phy: Don't use drv when it is NULL in phy_attached_info

 drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c | 16 ----------------
 drivers/net/phy/phy_device.c                      | 13 +++++++++++--
 2 files changed, 11 insertions(+), 18 deletions(-)

-- 
2.11.0

^ permalink raw reply

* IPsec workshop at netdevconf
From: Steffen Klassert @ 2017-08-21  7:10 UTC (permalink / raw)
  To: Skidmore, Donald C, Boris Pismenny, Florian Westphal,
	Sowmini Varadhan, Hannes Frederic Sowa
  Cc: netdev

Hi all,

we plan for an IPsec workshop at the next netdev conference in November.
Is there anybody interested to present something or is there anybody who
want to discuss something?

So far we have the following tentative topics:

Donald Skidmore:

Presentation: The intel implementation of the IPsec offload support.
Discussion: Virtualization with IPsec offload; IPsec offload with OVS

Boris Pismenny:

Presentation: Describe trailer removal; Describe ESN; show performance
numbers.

Discussion: Full device terminated IPsec offload; software fallback
problems common to TLS and IPsec; auth failure using decrypted packet
vs. plaintext packet

Steffen Klassert:

Presentation: The IPsec status update.

Discussion: Redesigning the IPsec VTI interfaces.

^ permalink raw reply

* Re: [PATCH net-next v4] openvswitch: enable NSH support
From: Yang, Yi @ 2017-08-21  6:31 UTC (permalink / raw)
  To: Jiri Benc
  Cc: netdev@vger.kernel.org, dev@openvswitch.org, blp@ovn.org,
	e@erig.me, jan.scheurich@ericsson.com
In-Reply-To: <20170818153103.5a757abe@griffin>

On Fri, Aug 18, 2017 at 09:31:03PM +0800, Jiri Benc wrote:
> On Fri, 18 Aug 2017 15:26:01 +0200, Jiri Benc wrote:
> > Out of time for today, will continue the review next week. Again, feel
> > free to send a new version meanwhile or wait for the rest of the
> > review, whatever works better for you.
> 
> One more thing: before you send a new version, be sure and double check
> that you addressed *all* of the feedback. In this version, you still
> have not addressed a few things I gave feedback on in the previous
> version. This wastes everyone's time.

I did check every comment you added and tried my best to fix them, maybe
you mean that hacky attr & mask code, I think that one is ok in v4, I have
used your proposal in v5 and has small change, from my perspective, they
don't have what difference :-)

> 
> It doesn't mean I'm always right, of course, but you either explain why
> I'm wrong or change the code. It's generally not acceptable to ignore
> feedback.

Will be more careful later in order that we can have a more efficient
review. Thank you so much for your great comments.

> 
> Thank you,
> 
>  Jiri

^ permalink raw reply

* Re: [PATCH net-next v4] openvswitch: enable NSH support
From: Yang, Yi @ 2017-08-21  6:21 UTC (permalink / raw)
  To: Eric Garver
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	jbenc-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
In-Reply-To: <20170818190947.GA1479@dev-rhel7>

On Sat, Aug 19, 2017 at 03:09:47AM +0800, Eric Garver wrote:
> On Fri, Aug 18, 2017 at 03:24:31PM +0800, Yi Yang wrote:
> > v3->v4
> >  - Add new NSH match field ttl
> >  - Update NSH header to the latest format
> >    which will be final format and won't change
> >    per its author's confirmation.
> >  - Fix comments for v3.
> 
> Hi Yi,
> Only a few comments below since Jiri already supplied lots of feedback.
>

Eric, thank you for your comments, I have fixed them in v5, please help
review v5, thanks a lot.
 
> > @@ -1210,6 +1373,20 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
> >  		case OVS_ACTION_ATTR_POP_ETH:
> >  			err = pop_eth(skb, key);
> >  			break;
> > +
> > +		case OVS_ACTION_ATTR_PUSH_NSH: {
> > +			u8 buffer[256];
> 
> Use NSH_M_TYPE2_MAX_LEN

Replaced 256 to NSH_M_TYPE2_MAX_LEN in v5.

> > +		case OVS_NSH_KEY_ATTR_MD1: {
> > +			const struct ovs_nsh_key_md1 *md1 =
> > +				(struct ovs_nsh_key_md1 *)nla_data(a);
> > +
> > +			has_md1 = true;
> > +			memcpy(nsh->context, md1->context, sizeof(*md1));
> > +			break;
> > +		}
> > +		case OVS_NSH_KEY_ATTR_MD2:
> > +			/* Not supported yet */
> 
> return -ENOTPSUPP if it's not supported.

Did that way in v5.

> > +		case OVS_NSH_KEY_ATTR_MD1: {
> > +			const struct ovs_nsh_key_md1 *md1 =
> > +				(struct ovs_nsh_key_md1 *)nla_data(a);
> > +
> > +			has_md1 = true;
> > +			for (i = 0; i < NSH_MD1_CONTEXT_SIZE; i++)
> > +				SW_FLOW_KEY_PUT(match, nsh.context[i],
> > +						md1->context[i], is_mask);
> > +			break;
> > +		}
> > +		case OVS_NSH_KEY_ATTR_MD2:
> > +			/* Not supported yet */
> 
> return -ENOTPSUPP if it's not supported.

Done in v5.

> > @@ -2636,6 +2984,17 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
> >  			mac_proto = MAC_PROTO_ETHERNET;
> >  			break;
> >  
> > +		case OVS_ACTION_ATTR_PUSH_NSH:
> 
> You need to some validation here, especially the metadata lengths.
> Relying on action_lens is not enough because it's variable.
> 

Good point, I added validate_nsh for this.

^ permalink raw reply

* Re: [PATCH net-next v4] openvswitch: enable NSH support
From: Yang, Yi @ 2017-08-21  6:11 UTC (permalink / raw)
  To: Jiri Benc
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, e@erig.me
In-Reply-To: <20170818152601.3760aaec@griffin>

On Fri, Aug 18, 2017 at 09:26:01PM +0800, Jiri Benc wrote:
> On Fri, 18 Aug 2017 15:24:31 +0800, Yi Yang wrote:
> > +struct nsh_md2_tlv {
> > +	__be16 md_class;
> > +	u8 type;
> > +	u8 length;
> > +	/* Followed by variable-length data. */
> > +};
> 
> What was wrong with the u8[] field that was present at the end of the
> struct in the previous version of the patch?

In OVS code, it has been removed because of Microsoft compiler issue.

> 
> > +#define NSH_M_TYPE2_MAX_LEN 256
> 
> This is defined twice, please delete this define and keep the one lower
> in the file.

Removed duplicate one in v5.

> 
> > +#define NSH_DST_PORT    4790     /* UDP Port for NSH on VXLAN. */
> 
> This is a VXLAN-GPE port, it has nothing to do with NSH (except that
> VXLAN-GPE can contain a NSH packet). It's also unused. Please remove it.
> 

Removed in v5.

> > +/* NSH Metadata Length. */
> > +#define NSH_M_TYPE1_MDLEN 16
> 
> This is unused and it seems it's not much useful anyway,
> sizeof(struct nsh_md1_ctx) provides the same value. Please remove this
> define.

Removed in v5.

> 
> > +#define NSH_MD1_CTX(nsh_hdr_ptr) (&(nsh_hdr_ptr)->md1)
> > +
> > +#define NSH_MD2_CTX(nsh_hdr_ptr) (&(nsh_hdr_ptr)->md2)
> 
> Please remove these two. They are unused and would just obscure things
> anyway.
> 
> > +static inline struct nsh_md1_ctx *nsh_md1_ctx(struct nsh_hdr *nsh)
> > +{
> > +	return &nsh->md1;
> > +}
> > +
> > +static inline struct nsh_md2_tlv *nsh_md2_ctx(struct nsh_hdr *nsh)
> > +{
> > +	return &nsh->md2;
> > +}
> 
> And remove these too, for the same reason. Just use nsh->md1 when you
> need the metadata, there's no reason for these helper functions. They
> just obscure things.
> 

Removed them in v5

> > +static inline void nsh_set_flags_and_ttl(struct nsh_hdr *nsh, u8 flags, u8 ttl)
> > +{
> > +	nsh->ver_flags_ttl_len
> > +		= htons((ntohs(nsh->ver_flags_ttl_len)
> > +			& ~(NSH_FLAGS_MASK | NSH_TTL_MASK))
> > +			| ((flags << NSH_FLAGS_SHIFT) & NSH_FLAGS_MASK)
> > +			| ((ttl << NSH_TTL_SHIFT) & NSH_TTL_MASK));
> > +}
> > +
> > +static inline void nsh_set_flags_ttl_len(struct nsh_hdr *nsh, u8 flags,
> > +					 u8 ttl, u8 len)
> > +{
> > +	nsh->ver_flags_ttl_len
> > +		= htons((ntohs(nsh->ver_flags_ttl_len)
> > +			& ~(NSH_FLAGS_MASK | NSH_TTL_MASK | NSH_LEN_MASK))
> > +			| ((flags << NSH_FLAGS_SHIFT) & NSH_FLAGS_MASK)
> > +			| ((ttl << NSH_TTL_SHIFT) & NSH_TTL_MASK)
> > +			| ((len << NSH_LEN_SHIFT) & NSH_LEN_MASK));
> > +}
> 
> Okay. Could those two perhaps use a common function?
> 
> static inline void __nsh_set_flags(struct nsh_hdr *nsh, u16 value, u16 mask)
> {
> 	nsh->ver_flags_ttl_len = nsh->ver_flags_ttl_len & ~htons(mask)
> 							| htons(value);
> }
> 
> static inline void nsh_set_flags_and_ttl(struct nsh_hdr *nsh, u8 flags, u8 ttl)
> {
> 	__nsh_set_flags(nsh, flags << NSH_FLAGS_SHIFT | ttl << NSH_TTL_SHIFT,
> 			NSH_FLAGS_MASK | NSH_TTL_MASK);
> }
> 
> etc.

Thanks for this good suggestion, applied in v5 with small change.

> 
> > +static int push_nsh(struct sk_buff *skb, struct sw_flow_key *key,
> > +		    const struct nsh_hdr *nsh_src)
> > +{
> [...]
> > +	if (!skb->inner_protocol)
> > +		skb_set_inner_protocol(skb, skb->protocol);
> 
> I was wondering about this during the reviews of the previous versions.
> Now I've given this more thought but I still don't see it - why is the
> inner_protocol set here?

I saw push_mpls has it, so also set it.

> 
> > +	case OVS_KEY_ATTR_NSH: {
> > +		struct ovs_key_nsh nsh;
> > +		struct ovs_key_nsh nsh_mask;
> > +		size_t size = nla_len(a) / 2;
> > +		struct nlattr attr[1 + size / sizeof(struct nlattr) + 1];
> > +		struct nlattr mask[1 + size / sizeof(struct nlattr) + 1];
> > +
> > +		attr->nla_type = nla_type(a);
> > +		mask->nla_type = attr->nla_type;
> > +		attr->nla_len = NLA_HDRLEN + size;
> > +		mask->nla_len = attr->nla_len;
> > +		memcpy(attr + 1, (char *)(a + 1), size);
> > +		memcpy(mask + 1, (char *)(a + 1) + size, size);
> 
> No, please. See my reply to the previous version for how to do this in
> a less hacky way.

I have used your proposal in previous comments and have it in v5.

> 
> > +		case OVS_ACTION_ATTR_PUSH_NSH: {
> > +			u8 buffer[256];
> > +			struct nsh_hdr *nsh_hdr = (struct nsh_hdr *)buffer;
> > +			const struct nsh_hdr *nsh_src = nsh_hdr;
> > +
> > +			nsh_hdr_from_nlattr(nla_data(a), nsh_hdr);
> 
> This is very dangerous security wise. You have to protect against
> buffer overflow, one way or other. The current code may not overflow
> (I have not checked that, though) but a future addition may break the
> assumption without being obvious it's a problem.
> 
> Note that the previous version had exactly the same problem but it was
> hidden and I didn't notice it. Which means that getting rid of that
> push_nsh_para struct was a very good thing, the code is more clean and
> more obvious now.

I have added a size parameter for nsh_hdr_from_nlattr in which there is
size check code in order to make sure there will not buffer overflow
happening. please chech v5 for details.

> 
> > +static int parse_nsh(struct sk_buff *skb, struct sw_flow_key *key)
> > +{
> > +	struct nsh_hdr *nsh = (struct nsh_hdr *)skb_network_header(skb);
> > +	u8 version, length;
> > +	int err;
> > +
> > +	err = check_header(skb, NSH_BASE_HDR_LEN);
> > +	if (unlikely(err))
> > +		return err;
> > +
> > +	memset(&key->nsh, 0, sizeof(struct ovs_key_nsh));
> 
> This is unnecessary and expensive. We're initializing all the fields
> below.

Removed in v5.

> 
> > +	version = nsh_get_ver(nsh);
> > +	length = nsh_hdr_len(nsh);
> 
> You have to reload nsh after pskb_may_pull (which is called by
> check_header).

I have removed check_header and use skb->len to check in v5.

> 
> > +	if (version != 0)
> > +		return -EINVAL;
> > +
> > +	if (nsh->md_type == NSH_M_TYPE1 && length != NSH_M_TYPE1_LEN)
> > +		return -EINVAL;
> > +
> > +	if (nsh->md_type == NSH_M_TYPE2 && length < NSH_BASE_HDR_LEN)
> > +		return -EINVAL;
> 
> This might better be merged to the switch below. Or are you concerned
> about potentially expensive pskb_may_pull with unchecked length? In
> that case, it would be better to convert to switch and reject on
> unknown md_types.

Good point, I have moved them to switch in v5.

> 
> > +	err = check_header(skb, length);
> > +	if (unlikely(err))
> > +		return err;
> > +
> > +	key->nsh.flags = nsh_get_flags(nsh);
> 
> Again, need to reload nsh.

I used skb->len in v5, so we can't avoid such issue.

> 
> > +	key->nsh.ttl = nsh_get_ttl(nsh);
> > +	key->nsh.mdtype = nsh->md_type;
> > +	key->nsh.np = nsh->next_proto;
> > +	key->nsh.path_hdr = nsh->path_hdr;
> > +	switch (key->nsh.mdtype) {
> > +	case NSH_M_TYPE1:
> > +		memcpy(key->nsh.context, nsh->md1.context,
> > +		       sizeof(nsh->md1));
> > +		break;
> > +	case NSH_M_TYPE2:
> > +		/* Don't support MD type 2 metedata parsing yet */
> > +		break;
> > +	default:
> > +		return -EINVAL;
> > +	}
> 
> This is the switch I mentioned above.

Yes, done in v5.

> 
> > +struct ovs_key_nsh {
> > +	__u8 flags;
> > +	__u8 ttl;
> > +	__u8 mdtype;
> > +	__u8 np;
> 
> Just u8, please, this is kernel internal.

Changed to u8 in v5.

> 
> > +size_t ovs_nsh_key_attr_size(void)
> > +{
> > +	/* Whenever adding new OVS_NSH_KEY_ FIELDS, we should consider
> > +	 * updating this function.
> > +	 */
> > +	return  nla_total_size(8)      /* OVS_NSH_KEY_ATTR_BASE */
> 
> NSH_BASE_HDR_LEN, perhaps? Not that much important, though.

Replaced 8 with NSH_BASE_HDR_LEN in v5.

> 
> > +		switch (type) {
> > +		case OVS_NSH_KEY_ATTR_BASE: {
> > +			const struct ovs_nsh_key_base *base =
> > +				(struct ovs_nsh_key_base *)nla_data(a);
> > +			flags = base->flags;
> > +			ttl = base->ttl;
> > +			nsh->next_proto = base->np;
> > +			nsh->md_type = base->mdtype;
> > +			nsh->path_hdr = base->path_hdr;
> 
> Wouldn't it be nicer if the fields of struct ovs_nsh_key_base and of
> struct nsh_hdr had the same names?

Such change also will impact on OVS code, so I prefer not to change
them.

For struct nsh_hdr, we need more self-descriptive fields, but for struct
ovs_nsh_key_base, because we need to use keys in OVS OpenFlow, so np is
obviously better than next_proto, we also try our best to make sure the
old NSH implementation has same match fields as the new one does.

> 
> > +		case OVS_NSH_KEY_ATTR_MD1: {
> > +			const struct ovs_nsh_key_md1 *md1 =
> > +				(struct ovs_nsh_key_md1 *)nla_data(a);
> > +			struct nsh_md1_ctx *md1_dst = nsh_md1_ctx(nsh);
> > +
> > +			has_md1 = true;
> > +			mdlen = nla_len(a);
> > +			memcpy(md1_dst, md1, mdlen);
> 
> How can we be sure there's enough room in the nsh buffer? See also my
> previous remark.

I have added a size parameter for nsh_hdr_from_nlattr and also added
check code here in v5.

> 
> > +			break;
> > +		}
> > +		case OVS_NSH_KEY_ATTR_MD2: {
> > +			const struct u8 *md2 = nla_data(a);
> > +			struct nsh_md2_tlv *md2_dst = nsh_md2_ctx(nsh);
> > +
> > +			has_md2 = true;
> > +			mdlen = nla_len(a);
> > +			if ((mdlen > NSH_M_TYPE2_MD_MAX_LEN) ||
> > +			    (mdlen == 0)) {
> > +				OVS_NLERR(
> > +				    1,
> > +				    "length %d of nsh attr %d is invalid",
> > +				    mdlen,
> > +				    type
> > +				);
> > +				return -EINVAL;
> > +			}
> > +			memcpy(md2_dst, md2, mdlen);
> 
> And, more importantly, here. It seems that it's currently capped at
> 256 bytes by the mdlen check yet it's too fragile. Either add a
> parameter with the nsh buffer size or find other way to make this more
> robust. Otherwise we're going to hunt a buffer overflow in a year.

Done in v5.

> 
> > +	if ((has_md1 && nsh->md_type != NSH_M_TYPE1) ||
> > +	    (has_md2 && nsh->md_type != NSH_M_TYPE2)) {
> > +		OVS_NLERR(1,
> > +			  "nsh attribute has unmatched MD type %d.",
> > +			  nsh->md_type);
> > +		return -EINVAL;
> > +	}
> 
> What if both type 1 and type 2 attributes were specified? Or neither?
> This condition does not catch that.

I have added these checks in the function, but for set action, we may
only have OVS_NSH_KEY_BASE without OVS_NSH_KEY_MD1 and OVS_NSH_KEY_MD2,
so these checks will be different in different use case.

> 
> > +	/* nsh header length  = NSH_BASE_HDR_LEN + mdlen */
> > +	nsh_set_flags_ttl_len(nsh, flags, ttl,
> > +			      (NSH_BASE_HDR_LEN + mdlen) >> 2);
> 
> Just specify the len. It's the job of the helper function to convert it
> to whatever format is needed in the header. (I'm talking about the
> ">> 2". That should not be done by the caller but by the helper
> function.)

Changed nsh_set_flags_ttl_len for this in v5.

> 
> Out of time for today, will continue the review next week. Again, feel
> free to send a new version meanwhile or wait for the rest of the
> review, whatever works better for you.

I have sent out v5, please continue to review that version, thanks a
lot.
> 
>  Jiri

^ permalink raw reply

* [PATCH 3/4] xfrm: policy: check policy direction value
From: Steffen Klassert @ 2017-08-21  6:08 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <1503295683-19153-1-git-send-email-steffen.klassert@secunet.com>

From: Vladis Dronov <vdronov@redhat.com>

The 'dir' parameter in xfrm_migrate() is a user-controlled byte which is used
as an array index. This can lead to an out-of-bound access, kernel lockup and
DoS. Add a check for the 'dir' value.

This fixes CVE-2017-11600.

References: https://bugzilla.redhat.com/show_bug.cgi?id=1474928
Fixes: 80c9abaabf42 ("[XFRM]: Extension for dynamic update of endpoint address(es)")
Cc: <stable@vger.kernel.org> # v2.6.21-rc1
Reported-by: "bo Zhang" <zhangbo5891001@gmail.com>
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_policy.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index ff61d85..6f5a0dad 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -3308,9 +3308,15 @@ int xfrm_migrate(const struct xfrm_selector *sel, u8 dir, u8 type,
 	struct xfrm_state *x_new[XFRM_MAX_DEPTH];
 	struct xfrm_migrate *mp;
 
+	/* Stage 0 - sanity checks */
 	if ((err = xfrm_migrate_check(m, num_migrate)) < 0)
 		goto out;
 
+	if (dir >= XFRM_POLICY_MAX) {
+		err = -EINVAL;
+		goto out;
+	}
+
 	/* Stage 1 - find policy */
 	if ((pol = xfrm_migrate_policy_find(sel, dir, type, net)) == NULL) {
 		err = -ENOENT;
-- 
2.7.4

^ permalink raw reply related

* [PATCH 4/4] esp: Fix error handling on layer 2 xmit.
From: Steffen Klassert @ 2017-08-21  6:08 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <1503295683-19153-1-git-send-email-steffen.klassert@secunet.com>

esp_output_tail() and esp6_output_tail() can return negative
and positive error values. We currently treat only negative
values as errors, fix this to treat both cases as error.

Fixes: fca11ebde3f0 ("esp4: Reorganize esp_output")
Fixes: 383d0350f2cc ("esp6: Reorganize esp_output")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/ipv4/esp4_offload.c | 2 +-
 net/ipv6/esp6_offload.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index e066601..5011232 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -257,7 +257,7 @@ static int esp_xmit(struct xfrm_state *x, struct sk_buff *skb,  netdev_features_
 	esp.seqno = cpu_to_be64(xo->seq.low + ((u64)xo->seq.hi << 32));
 
 	err = esp_output_tail(x, skb, &esp);
-	if (err < 0)
+	if (err)
 		return err;
 
 	secpath_reset(skb);
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index f02f131..1cf437f 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -286,7 +286,7 @@ static int esp6_xmit(struct xfrm_state *x, struct sk_buff *skb,  netdev_features
 	esp.seqno = cpu_to_be64(xo->seq.low + ((u64)xo->seq.hi << 32));
 
 	err = esp6_output_tail(x, skb, &esp);
-	if (err < 0)
+	if (err)
 		return err;
 
 	secpath_reset(skb);
-- 
2.7.4

^ permalink raw reply related

* [PATCH 2/4] xfrm: fix null pointer dereference on state and tmpl sort
From: Steffen Klassert @ 2017-08-21  6:08 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <1503295683-19153-1-git-send-email-steffen.klassert@secunet.com>

From: Koichiro Den <den@klaipeden.com>

Creating sub policy that matches the same outer flow as main policy does
leads to a null pointer dereference if the outer mode's family is ipv4.
For userspace compatibility, this patch just eliminates the crash i.e.,
does not introduce any new sorting rule, which would fruitlessly affect
all but the aforementioned case.

Signed-off-by: Koichiro Den <den@klaipeden.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_state.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 6c0956d..a792eff 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1620,6 +1620,7 @@ int
 xfrm_tmpl_sort(struct xfrm_tmpl **dst, struct xfrm_tmpl **src, int n,
 	       unsigned short family, struct net *net)
 {
+	int i;
 	int err = 0;
 	struct xfrm_state_afinfo *afinfo = xfrm_state_get_afinfo(family);
 	if (!afinfo)
@@ -1628,6 +1629,9 @@ xfrm_tmpl_sort(struct xfrm_tmpl **dst, struct xfrm_tmpl **src, int n,
 	spin_lock_bh(&net->xfrm.xfrm_state_lock); /*FIXME*/
 	if (afinfo->tmpl_sort)
 		err = afinfo->tmpl_sort(dst, src, n);
+	else
+		for (i = 0; i < n; i++)
+			dst[i] = src[i];
 	spin_unlock_bh(&net->xfrm.xfrm_state_lock);
 	rcu_read_unlock();
 	return err;
@@ -1638,6 +1642,7 @@ int
 xfrm_state_sort(struct xfrm_state **dst, struct xfrm_state **src, int n,
 		unsigned short family)
 {
+	int i;
 	int err = 0;
 	struct xfrm_state_afinfo *afinfo = xfrm_state_get_afinfo(family);
 	struct net *net = xs_net(*src);
@@ -1648,6 +1653,9 @@ xfrm_state_sort(struct xfrm_state **dst, struct xfrm_state **src, int n,
 	spin_lock_bh(&net->xfrm.xfrm_state_lock);
 	if (afinfo->state_sort)
 		err = afinfo->state_sort(dst, src, n);
+	else
+		for (i = 0; i < n; i++)
+			dst[i] = src[i];
 	spin_unlock_bh(&net->xfrm.xfrm_state_lock);
 	rcu_read_unlock();
 	return err;
-- 
2.7.4

^ permalink raw reply related

* pull request (net): ipsec 2017-08-21
From: Steffen Klassert @ 2017-08-21  6:07 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev

1) Fix memleaks when ESP takes an error path.

2) Fix null pointer dereference when creating a sub policy
   that matches the same outer flow as main policy does.
   From Koichiro Den.

3) Fix possible out-of-bound access in xfrm_migrate.
   This patch should go to the stable trees too.
   From Vladis Dronov.

4) ESP can return positive and negative error values,
   so treat both cases as an error.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit edaf3825182958a1fd5e39708fcb6ea48eca2060:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2017-07-12 19:30:57 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master

for you to fetch changes up to 4ff0308f06da5016aafb05330ed37809b54f81ae:

  esp: Fix error handling on layer 2 xmit. (2017-08-07 08:31:07 +0200)

----------------------------------------------------------------
Koichiro Den (1):
      xfrm: fix null pointer dereference on state and tmpl sort

Steffen Klassert (2):
      esp: Fix memleaks on error paths.
      esp: Fix error handling on layer 2 xmit.

Vladis Dronov (1):
      xfrm: policy: check policy direction value

 net/ipv4/esp4.c         | 13 ++++++++-----
 net/ipv4/esp4_offload.c |  2 +-
 net/ipv6/esp6.c         |  9 +++++----
 net/ipv6/esp6_offload.c |  2 +-
 net/xfrm/xfrm_policy.c  |  6 ++++++
 net/xfrm/xfrm_state.c   |  8 ++++++++
 6 files changed, 29 insertions(+), 11 deletions(-)

^ permalink raw reply

* [PATCH 1/4] esp: Fix memleaks on error paths.
From: Steffen Klassert @ 2017-08-21  6:08 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <1503295683-19153-1-git-send-email-steffen.klassert@secunet.com>

We leak the temporary allocated resources in error paths,
fix this by freeing them.

Fixes: fca11ebde3f ("esp4: Reorganize esp_output")
Fixes: 383d0350f2c ("esp6: Reorganize esp_output")
Fixes: 3f29770723f ("ipsec: check return value of skb_to_sgvec always")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/ipv4/esp4.c | 13 ++++++++-----
 net/ipv6/esp6.c |  9 +++++----
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 0cbee0a..dbb31a9 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -381,7 +381,7 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
 		           (unsigned char *)esph - skb->data,
 		           assoclen + ivlen + esp->clen + alen);
 	if (unlikely(err < 0))
-		goto error;
+		goto error_free;
 
 	if (!esp->inplace) {
 		int allocsize;
@@ -392,7 +392,7 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
 		spin_lock_bh(&x->lock);
 		if (unlikely(!skb_page_frag_refill(allocsize, pfrag, GFP_ATOMIC))) {
 			spin_unlock_bh(&x->lock);
-			goto error;
+			goto error_free;
 		}
 
 		skb_shinfo(skb)->nr_frags = 1;
@@ -409,7 +409,7 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
 			           (unsigned char *)esph - skb->data,
 			           assoclen + ivlen + esp->clen + alen);
 		if (unlikely(err < 0))
-			goto error;
+			goto error_free;
 	}
 
 	if ((x->props.flags & XFRM_STATE_ESN))
@@ -442,8 +442,9 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
 
 	if (sg != dsg)
 		esp_ssg_unref(x, tmp);
-	kfree(tmp);
 
+error_free:
+	kfree(tmp);
 error:
 	return err;
 }
@@ -695,8 +696,10 @@ static int esp_input(struct xfrm_state *x, struct sk_buff *skb)
 
 	sg_init_table(sg, nfrags);
 	err = skb_to_sgvec(skb, sg, 0, skb->len);
-	if (unlikely(err < 0))
+	if (unlikely(err < 0)) {
+		kfree(tmp);
 		goto out;
+	}
 
 	skb->ip_summed = CHECKSUM_NONE;
 
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 9ed3547..392def1 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -345,7 +345,7 @@ int esp6_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info
 		           (unsigned char *)esph - skb->data,
 		           assoclen + ivlen + esp->clen + alen);
 	if (unlikely(err < 0))
-		goto error;
+		goto error_free;
 
 	if (!esp->inplace) {
 		int allocsize;
@@ -356,7 +356,7 @@ int esp6_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info
 		spin_lock_bh(&x->lock);
 		if (unlikely(!skb_page_frag_refill(allocsize, pfrag, GFP_ATOMIC))) {
 			spin_unlock_bh(&x->lock);
-			goto error;
+			goto error_free;
 		}
 
 		skb_shinfo(skb)->nr_frags = 1;
@@ -373,7 +373,7 @@ int esp6_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info
 			           (unsigned char *)esph - skb->data,
 			           assoclen + ivlen + esp->clen + alen);
 		if (unlikely(err < 0))
-			goto error;
+			goto error_free;
 	}
 
 	if ((x->props.flags & XFRM_STATE_ESN))
@@ -406,8 +406,9 @@ int esp6_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info
 
 	if (sg != dsg)
 		esp_ssg_unref(x, tmp);
-	kfree(tmp);
 
+error_free:
+	kfree(tmp);
 error:
 	return err;
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v5] openvswitch: enable NSH support
From: Yi Yang @ 2017-08-21  5:54 UTC (permalink / raw)
  To: netdev; +Cc: dev, blp, jbenc, e, jan.scheurich, Yi Yang

v4->v5
 - Fix many comments by Jiri Benc and Eric Garver
   for v4.

v3->v4
 - Add new NSH match field ttl
 - Update NSH header to the latest format
   which will be final format and won't change
   per its author's confirmation.
 - Fix comments for v3.

v2->v3
 - Change OVS_KEY_ATTR_NSH to nested key to handle
   length-fixed attributes and length-variable
   attriubte more flexibly.
 - Remove struct ovs_action_push_nsh completely
 - Add code to handle nested attribute for SET_MASKED
 - Change PUSH_NSH to use the nested OVS_KEY_ATTR_NSH
   to transfer NSH header data.
 - Fix comments and coding style issues by Jiri and Eric

v1->v2
 - Change encap_nsh and decap_nsh to push_nsh and pop_nsh
 - Dynamically allocate struct ovs_action_push_nsh for
   length-variable metadata.

OVS master and 2.8 branch has merged NSH userspace
patch series, this patch is to enable NSH support
in kernel data path in order that OVS can support
NSH in 2.8 release in compat mode by porting this.

Signed-off-by: Yi Yang <yi.y.yang@intel.com>
---
 drivers/net/vxlan.c              |   7 +
 include/net/nsh.h                | 307 +++++++++++++++++++++++++++++
 include/uapi/linux/if_ether.h    |   1 +
 include/uapi/linux/openvswitch.h |  28 +++
 net/openvswitch/actions.c        | 181 ++++++++++++++++++
 net/openvswitch/flow.c           |  50 +++++
 net/openvswitch/flow.h           |  11 ++
 net/openvswitch/flow_netlink.c   | 404 ++++++++++++++++++++++++++++++++++++++-
 net/openvswitch/flow_netlink.h   |   4 +
 9 files changed, 992 insertions(+), 1 deletion(-)
 create mode 100644 include/net/nsh.h

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ae3a1da..a36c41e 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -27,6 +27,7 @@
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 #include <net/vxlan.h>
+#include <net/nsh.h>
 
 #if IS_ENABLED(CONFIG_IPV6)
 #include <net/ip6_tunnel.h>
@@ -1268,6 +1269,9 @@ static bool vxlan_parse_gpe_hdr(struct vxlanhdr *unparsed,
 	case VXLAN_GPE_NP_IPV6:
 		*protocol = htons(ETH_P_IPV6);
 		break;
+	case VXLAN_GPE_NP_NSH:
+		*protocol = htons(ETH_P_NSH);
+		break;
 	case VXLAN_GPE_NP_ETHERNET:
 		*protocol = htons(ETH_P_TEB);
 		break;
@@ -1807,6 +1811,9 @@ static int vxlan_build_gpe_hdr(struct vxlanhdr *vxh, u32 vxflags,
 	case htons(ETH_P_IPV6):
 		gpe->next_protocol = VXLAN_GPE_NP_IPV6;
 		return 0;
+	case htons(ETH_P_NSH):
+		gpe->next_protocol = VXLAN_GPE_NP_NSH;
+		return 0;
 	case htons(ETH_P_TEB):
 		gpe->next_protocol = VXLAN_GPE_NP_ETHERNET;
 		return 0;
diff --git a/include/net/nsh.h b/include/net/nsh.h
new file mode 100644
index 0000000..df5812e
--- /dev/null
+++ b/include/net/nsh.h
@@ -0,0 +1,307 @@
+#ifndef __NET_NSH_H
+#define __NET_NSH_H 1
+
+/*
+ * Network Service Header:
+ *  0                   1                   2                   3
+ *  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |Ver|O|U|    TTL    |   Length  |U|U|U|U|MD Type| Next Protocol |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |          Service Path Identifier (SPI)        | Service Index |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |                                                               |
+ * ~               Mandatory/Optional Context Headers              ~
+ * |                                                               |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *
+ * Version: The version field is used to ensure backward compatibility
+ * going forward with future NSH specification updates.  It MUST be set
+ * to 0x0 by the sender, in this first revision of NSH.  Given the
+ * widespread implementation of existing hardware that uses the first
+ * nibble after an MPLS label stack for ECMP decision processing, this
+ * document reserves version 01b and this value MUST NOT be used in
+ * future versions of the protocol.  Please see [RFC7325] for further
+ * discussion of MPLS-related forwarding requirements.
+ *
+ * O bit: Setting this bit indicates an Operations, Administration, and
+ * Maintenance (OAM) packet.  The actual format and processing of SFC
+ * OAM packets is outside the scope of this specification (see for
+ * example [I-D.ietf-sfc-oam-framework] for one approach).
+ *
+ * The O bit MUST be set for OAM packets and MUST NOT be set for non-OAM
+ * packets.  The O bit MUST NOT be modified along the SFP.
+ *
+ * SF/SFF/SFC Proxy/Classifier implementations that do not support SFC
+ * OAM procedures SHOULD discard packets with O bit set, but MAY support
+ * a configurable parameter to enable forwarding received SFC OAM
+ * packets unmodified to the next element in the chain.  Forwarding OAM
+ * packets unmodified by SFC elements that do not support SFC OAM
+ * procedures may be acceptable for a subset of OAM functions, but can
+ * result in unexpected outcomes for others, thus it is recommended to
+ * analyze the impact of forwarding an OAM packet for all OAM functions
+ * prior to enabling this behavior.  The configurable parameter MUST be
+ * disabled by default.
+ *
+ * TTL: Indicates the maximum SFF hops for an SFP.  This field is used
+ * for service plane loop detection.  The initial TTL value SHOULD be
+ * configurable via the control plane; the configured initial value can
+ * be specific to one or more SFPs.  If no initial value is explicitly
+ * provided, the default initial TTL value of 63 MUST be used.  Each SFF
+ * involved in forwarding an NSH packet MUST decrement the TTL value by
+ * 1 prior to NSH forwarding lookup.  Decrementing by 1 from an incoming
+ * value of 0 shall result in a TTL value of 63.  The packet MUST NOT be
+ * forwarded if TTL is, after decrement, 0.
+ *
+ * All other flag fields, marked U, are unassigned and available for
+ * future use, see Section 11.2.1.  Unassigned bits MUST be set to zero
+ * upon origination, and MUST be ignored and preserved unmodified by
+ * other NSH supporting elements.  Elements which do not understand the
+ * meaning of any of these bits MUST NOT modify their actions based on
+ * those unknown bits.
+ *
+ * Length: The total length, in 4-byte words, of NSH including the Base
+ * Header, the Service Path Header, the Fixed Length Context Header or
+ * Variable Length Context Header(s).  The length MUST be 0x6 for MD
+ * Type equal to 0x1, and MUST be 0x2 or greater for MD Type equal to
+ * 0x2.  The length of the NSH header MUST be an integer multiple of 4
+ * bytes, thus variable length metadata is always padded out to a
+ * multiple of 4 bytes.
+ *
+ * MD Type: Indicates the format of NSH beyond the mandatory Base Header
+ * and the Service Path Header.  MD Type defines the format of the
+ * metadata being carried.
+ *
+ * 0x0 - This is a reserved value.  Implementations SHOULD silently
+ * discard packets with MD Type 0x0.
+ *
+ * 0x1 - This indicates that the format of the header includes a fixed
+ * length Context Header (see Figure 4 below).
+ *
+ * 0x2 - This does not mandate any headers beyond the Base Header and
+ * Service Path Header, but may contain optional variable length Context
+ * Header(s).  The semantics of the variable length Context Header(s)
+ * are not defined in this document.  The format of the optional
+ * variable length Context Headers is provided in Section 2.5.1.
+ *
+ * 0xF - This value is reserved for experimentation and testing, as per
+ * [RFC3692].  Implementations not explicitly configured to be part of
+ * an experiment SHOULD silently discard packets with MD Type 0xF.
+ *
+ * Next Protocol: indicates the protocol type of the encapsulated data.
+ * NSH does not alter the inner payload, and the semantics on the inner
+ * protocol remain unchanged due to NSH service function chaining.
+ * Please see the IANA Considerations section below, Section 11.2.5.
+ *
+ * This document defines the following Next Protocol values:
+ *
+ * 0x1: IPv4
+ * 0x2: IPv6
+ * 0x3: Ethernet
+ * 0x4: NSH
+ * 0x5: MPLS
+ * 0xFE: Experiment 1
+ * 0xFF: Experiment 2
+ *
+ * Packets with Next Protocol values not supported SHOULD be silently
+ * dropped by default, although an implementation MAY provide a
+ * configuration parameter to forward them.  Additionally, an
+ * implementation not explicitly configured for a specific experiment
+ * [RFC3692] SHOULD silently drop packets with Next Protocol values 0xFE
+ * and 0xFF.
+ *
+ * Service Path Identifier (SPI): Identifies a service path.
+ * Participating nodes MUST use this identifier for Service Function
+ * Path selection.  The initial classifier MUST set the appropriate SPI
+ * for a given classification result.
+ *
+ * Service Index (SI): Provides location within the SFP.  The initial
+ * classifier for a given SFP SHOULD set the SI to 255, however the
+ * control plane MAY configure the initial value of SI as appropriate
+ * (i.e., taking into account the length of the service function path).
+ * The Service Index MUST be decremented by a value of 1 by Service
+ * Functions or by SFC Proxy nodes after performing required services
+ * and the new decremented SI value MUST be used in the egress packet's
+ * NSH.  The initial Classifier MUST send the packet to the first SFF in
+ * the identified SFP for forwarding along an SFP.  If re-classification
+ * occurs, and that re-classification results in a new SPI, the
+ * (re)classifier is, in effect, the initial classifier for the
+ * resultant SPI.
+ *
+ * The SI is used in conjunction the with Service Path Identifier for
+ * Service Function Path Selection and for determining the next SFF/SF
+ * in the path.  The SI is also valuable when troubleshooting or
+ * reporting service paths.  Additionally, while the TTL field is the
+ * main mechanism for service plane loop detection, the SI can also be
+ * used for detecting service plane loops.
+ *
+ * When the Base Header specifies MD Type = 0x1, a Fixed Length Context
+ * Header (16-bytes) MUST be present immediately following the Service
+ * Path Header. The value of a Fixed Length Context
+ * Header that carries no metadata MUST be set to zero.
+ *
+ * When the base header specifies MD Type = 0x2, zero or more Variable
+ * Length Context Headers MAY be added, immediately following the
+ * Service Path Header (see Figure 5).  Therefore, Length = 0x2,
+ * indicates that only the Base Header followed by the Service Path
+ * Header are present.  The optional Variable Length Context Headers
+ * MUST be of an integer number of 4-bytes.  The base header Length
+ * field MUST be used to determine the offset to locate the original
+ * packet or frame for SFC nodes that require access to that
+ * information.
+ *
+ * The format of the optional variable length Context Headers
+ *
+ *  0                   1                   2                   3
+ *  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |          Metadata Class       |      Type     |U|    Length   |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |                      Variable Metadata                        |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *
+ * Metadata Class (MD Class): Defines the scope of the 'Type' field to
+ * provide a hierarchical namespace.  The IANA Considerations
+ * Section 11.2.4 defines how the MD Class values can be allocated to
+ * standards bodies, vendors, and others.
+ *
+ * Type: Indicates the explicit type of metadata being carried.  The
+ * definition of the Type is the responsibility of the MD Class owner.
+ *
+ * Unassigned bit: One unassigned bit is available for future use. This
+ * bit MUST NOT be set, and MUST be ignored on receipt.
+ *
+ * Length: Indicates the length of the variable metadata, in bytes.  In
+ * case the metadata length is not an integer number of 4-byte words,
+ * the sender MUST add pad bytes immediately following the last metadata
+ * byte to extend the metadata to an integer number of 4-byte words.
+ * The receiver MUST round up the length field to the nearest 4-byte
+ * word boundary, to locate and process the next field in the packet.
+ * The receiver MUST access only those bytes in the metadata indicated
+ * by the length field (i.e., actual number of bytes) and MUST ignore
+ * the remaining bytes up to the nearest 4-byte word boundary.  The
+ * Length may be 0 or greater.
+ *
+ * A value of 0 denotes a Context Header without a Variable Metadata
+ * field.
+ *
+ * [0] https://datatracker.ietf.org/doc/draft-ietf-sfc-nsh/
+ */
+
+/**
+ * struct nsh_md1_ctx - Keeps track of NSH context data
+ * @nshc<1-4>: NSH Contexts.
+ */
+struct nsh_md1_ctx {
+	__be32 context[4];
+};
+
+struct nsh_md2_tlv {
+	__be16 md_class;
+	u8 type;
+	u8 length;
+	/* Followed by variable-length data. */
+};
+
+struct nsh_hdr {
+	__be16 ver_flags_ttl_len;
+	u8 md_type;
+	u8 next_proto;
+	__be32 path_hdr;
+	union {
+	    struct nsh_md1_ctx md1;
+	    struct nsh_md2_tlv md2;
+	};
+};
+
+/* Masking NSH header fields. */
+#define NSH_VER_MASK       0xc000
+#define NSH_VER_SHIFT      14
+#define NSH_FLAGS_MASK     0x3000
+#define NSH_FLAGS_SHIFT    12
+#define NSH_TTL_MASK       0x0fc0
+#define NSH_TTL_SHIFT      6
+#define NSH_LEN_MASK       0x003f
+#define NSH_LEN_SHIFT      0
+
+#define NSH_MDTYPE_MASK    0x0f
+#define NSH_MDTYPE_SHIFT   0
+
+#define NSH_SPI_MASK       0xffffff00
+#define NSH_SPI_SHIFT      8
+#define NSH_SI_MASK        0x000000ff
+#define NSH_SI_SHIFT       0
+
+/* NSH Base Header Next Protocol. */
+#define NSH_P_IPV4        0x01
+#define NSH_P_IPV6        0x02
+#define NSH_P_ETHERNET    0x03
+#define NSH_P_NSH         0x04
+#define NSH_P_MPLS        0x05
+
+/* MD Type Registry. */
+#define NSH_M_TYPE1     0x01
+#define NSH_M_TYPE2     0x02
+#define NSH_M_EXP1      0xFE
+#define NSH_M_EXP2      0xFF
+
+/* NSH Base Header Length */
+#define NSH_BASE_HDR_LEN  8
+
+/* NSH MD Type 1 header Length. */
+#define NSH_M_TYPE1_LEN   24
+
+/* NSH MD Type 2 header maximum Length. */
+#define NSH_M_TYPE2_MAX_LEN 256
+
+/* NSH MD Type 2 Metadata maximum Length. */
+#define NSH_M_TYPE2_MD_MAX_LEN (NSH_M_TYPE2_MAX_LEN - NSH_BASE_HDR_LEN)
+
+static inline u16 nsh_hdr_len(const struct nsh_hdr *nsh)
+{
+	return ((ntohs(nsh->ver_flags_ttl_len) & NSH_LEN_MASK)
+		>> NSH_LEN_SHIFT) << 2;
+}
+
+static inline u8 nsh_get_ver(const struct nsh_hdr *nsh)
+{
+	return (ntohs(nsh->ver_flags_ttl_len) & NSH_VER_MASK)
+		>> NSH_VER_SHIFT;
+}
+
+static inline u8 nsh_get_flags(const struct nsh_hdr *nsh)
+{
+	return (ntohs(nsh->ver_flags_ttl_len) & NSH_FLAGS_MASK)
+		>> NSH_FLAGS_SHIFT;
+}
+
+static inline u8 nsh_get_ttl(const struct nsh_hdr *nsh)
+{
+	return (ntohs(nsh->ver_flags_ttl_len) & NSH_TTL_MASK)
+		>> NSH_TTL_SHIFT;
+}
+
+static inline void __nsh_set_xflag(struct nsh_hdr *nsh, u16 xflag, u16 xmask)
+{
+	nsh->ver_flags_ttl_len
+		= (nsh->ver_flags_ttl_len & ~htons(xmask)) | htons(xflag);
+}
+
+static inline void nsh_set_flags_and_ttl(struct nsh_hdr *nsh, u8 flags, u8 ttl)
+{
+	__nsh_set_xflag(nsh, ((flags << NSH_FLAGS_SHIFT) & NSH_FLAGS_MASK) |
+			     ((ttl << NSH_TTL_SHIFT) & NSH_TTL_MASK),
+			NSH_FLAGS_MASK | NSH_TTL_MASK);
+}
+
+static inline void nsh_set_flags_ttl_len(struct nsh_hdr *nsh, u8 flags,
+					 u8 ttl, u8 len)
+{
+	len = len >> 2;
+	__nsh_set_xflag(nsh, ((flags << NSH_FLAGS_SHIFT) & NSH_FLAGS_MASK) |
+			     ((ttl << NSH_TTL_SHIFT) & NSH_TTL_MASK) |
+			     ((len << NSH_LEN_SHIFT) & NSH_LEN_MASK),
+			NSH_FLAGS_MASK | NSH_TTL_MASK | NSH_LEN_MASK);
+}
+
+#endif /* __NET_NSH_H */
diff --git a/include/uapi/linux/if_ether.h b/include/uapi/linux/if_ether.h
index 5bc9bfd..e7f1a61 100644
--- a/include/uapi/linux/if_ether.h
+++ b/include/uapi/linux/if_ether.h
@@ -137,6 +137,7 @@
 #define ETH_P_IEEE802154 0x00F6		/* IEEE802.15.4 frame		*/
 #define ETH_P_CAIF	0x00F7		/* ST-Ericsson CAIF protocol	*/
 #define ETH_P_XDSA	0x00F8		/* Multiplexed DSA protocol	*/
+#define ETH_P_NSH	0x894F		/* Ethertype for NSH. */
 
 /*
  *	This is an Ethernet frame header.
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 156ee4c..91dee5b 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -333,6 +333,7 @@ enum ovs_key_attr {
 	OVS_KEY_ATTR_CT_LABELS,	/* 16-octet connection tracking label */
 	OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4,   /* struct ovs_key_ct_tuple_ipv4 */
 	OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6,   /* struct ovs_key_ct_tuple_ipv6 */
+	OVS_KEY_ATTR_NSH,       /* Nested set of ovs_nsh_key_* */
 
 #ifdef __KERNEL__
 	OVS_KEY_ATTR_TUNNEL_INFO,  /* struct ip_tunnel_info */
@@ -491,6 +492,29 @@ struct ovs_key_ct_tuple_ipv6 {
 	__u8   ipv6_proto;
 };
 
+enum ovs_nsh_key_attr {
+	OVS_NSH_KEY_ATTR_BASE,  /* struct ovs_nsh_key_base. */
+	OVS_NSH_KEY_ATTR_MD1,   /* struct ovs_nsh_key_md1. */
+	OVS_NSH_KEY_ATTR_MD2,   /* variable-length octets for MD type 2. */
+	__OVS_NSH_KEY_ATTR_MAX
+};
+
+#define OVS_NSH_KEY_ATTR_MAX (__OVS_NSH_KEY_ATTR_MAX - 1)
+
+struct ovs_nsh_key_base {
+	__u8 flags;
+	__u8 ttl;
+	__u8 mdtype;
+	__u8 np;
+	__be32 path_hdr;
+};
+
+#define NSH_MD1_CONTEXT_SIZE 4
+
+struct ovs_nsh_key_md1 {
+	__be32 context[NSH_MD1_CONTEXT_SIZE];
+};
+
 /**
  * enum ovs_flow_attr - attributes for %OVS_FLOW_* commands.
  * @OVS_FLOW_ATTR_KEY: Nested %OVS_KEY_ATTR_* attributes specifying the flow
@@ -806,6 +830,8 @@ struct ovs_action_push_eth {
  * packet.
  * @OVS_ACTION_ATTR_POP_ETH: Pop the outermost Ethernet header off the
  * packet.
+ * @OVS_ACTION_ATTR_PUSH_NSH: push NSH header to the packet.
+ * @OVS_ACTION_ATTR_POP_NSH: pop the outermost NSH header off the packet.
  *
  * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not all
  * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
@@ -835,6 +861,8 @@ enum ovs_action_attr {
 	OVS_ACTION_ATTR_TRUNC,        /* u32 struct ovs_action_trunc. */
 	OVS_ACTION_ATTR_PUSH_ETH,     /* struct ovs_action_push_eth. */
 	OVS_ACTION_ATTR_POP_ETH,      /* No argument. */
+	OVS_ACTION_ATTR_PUSH_NSH,     /* Nested OVS_NSH_KEY_ATTR_*. */
+	OVS_ACTION_ATTR_POP_NSH,      /* No argument. */
 
 	__OVS_ACTION_ATTR_MAX,	      /* Nothing past this will be accepted
 				       * from userspace. */
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index e461067..efdcd37 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -43,6 +43,7 @@
 #include "flow.h"
 #include "conntrack.h"
 #include "vport.h"
+#include "flow_netlink.h"
 
 struct deferred_action {
 	struct sk_buff *skb;
@@ -380,6 +381,98 @@ static int push_eth(struct sk_buff *skb, struct sw_flow_key *key,
 	return 0;
 }
 
+static int push_nsh(struct sk_buff *skb, struct sw_flow_key *key,
+		    const struct nsh_hdr *nsh_src)
+{
+	struct nsh_hdr *nsh;
+	size_t length = nsh_hdr_len(nsh_src);
+	u8 next_proto;
+
+	if (key->mac_proto == MAC_PROTO_ETHERNET) {
+		next_proto = NSH_P_ETHERNET;
+	} else {
+		switch (ntohs(skb->protocol)) {
+		case ETH_P_IP:
+			next_proto = NSH_P_IPV4;
+			break;
+		case ETH_P_IPV6:
+			next_proto = NSH_P_IPV6;
+			break;
+		case ETH_P_NSH:
+			next_proto = NSH_P_NSH;
+			break;
+		default:
+			return -ENOTSUPP;
+		}
+	}
+
+	/* Add the NSH header */
+	if (skb_cow_head(skb, length) < 0)
+		return -ENOMEM;
+
+	skb_push(skb, length);
+	nsh = (struct nsh_hdr *)(skb->data);
+	memcpy(nsh, nsh_src, length);
+	nsh->next_proto = next_proto;
+	nsh->md_type &= NSH_MDTYPE_MASK;
+
+	if (!skb->inner_protocol)
+		skb_set_inner_protocol(skb, skb->protocol);
+
+	skb->protocol = htons(ETH_P_NSH);
+	key->eth.type = htons(ETH_P_NSH);
+	skb_reset_mac_header(skb);
+	skb_reset_mac_len(skb);
+
+	/* safe right before invalidate_flow_key */
+	key->mac_proto = MAC_PROTO_NONE;
+	invalidate_flow_key(key);
+	return 0;
+}
+
+static int pop_nsh(struct sk_buff *skb, struct sw_flow_key *key)
+{
+	struct nsh_hdr *nsh = (struct nsh_hdr *)(skb->data);
+	size_t length;
+	u16 inner_proto;
+
+	if (ovs_key_mac_proto(key) != MAC_PROTO_NONE ||
+	    skb->protocol != htons(ETH_P_NSH)) {
+		return -EINVAL;
+	}
+
+	switch (nsh->next_proto) {
+	case NSH_P_ETHERNET:
+		inner_proto = htons(ETH_P_TEB);
+		break;
+	case NSH_P_IPV4:
+		inner_proto = htons(ETH_P_IP);
+		break;
+	case NSH_P_IPV6:
+		inner_proto = htons(ETH_P_IPV6);
+		break;
+	case NSH_P_NSH:
+		inner_proto = htons(ETH_P_NSH);
+		break;
+	default:
+		return -ENOTSUPP;
+	}
+
+	length = nsh_hdr_len(nsh);
+	skb_pull(skb, length);
+	skb_reset_mac_header(skb);
+	skb_reset_mac_len(skb);
+	skb->protocol = inner_proto;
+
+	/* safe right before invalidate_flow_key */
+	if (inner_proto == htons(ETH_P_TEB))
+		key->mac_proto = MAC_PROTO_ETHERNET;
+	else
+		key->mac_proto = MAC_PROTO_NONE;
+	invalidate_flow_key(key);
+	return 0;
+}
+
 static void update_ip_l4_checksum(struct sk_buff *skb, struct iphdr *nh,
 				  __be32 addr, __be32 new_addr)
 {
@@ -602,6 +695,53 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key *flow_key,
 	return 0;
 }
 
+static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
+		   const struct ovs_key_nsh *key,
+		   const struct ovs_key_nsh *mask)
+{
+	struct nsh_hdr *nsh;
+	int err;
+	u8 flags;
+	u8 ttl;
+	int i;
+
+	err = skb_ensure_writable(skb, skb_network_offset(skb) +
+				  sizeof(struct nsh_hdr));
+	if (unlikely(err))
+		return err;
+
+	nsh = (struct nsh_hdr *)skb_network_header(skb);
+
+	flags = nsh_get_flags(nsh);
+	flags = OVS_MASKED(flags, key->flags, mask->flags);
+	flow_key->nsh.flags = flags;
+	ttl = nsh_get_ttl(nsh);
+	ttl = OVS_MASKED(ttl, key->ttl, mask->ttl);
+	flow_key->nsh.ttl = ttl;
+	nsh_set_flags_and_ttl(nsh, flags, ttl);
+	nsh->path_hdr = OVS_MASKED(nsh->path_hdr, key->path_hdr,
+				   mask->path_hdr);
+	flow_key->nsh.path_hdr = nsh->path_hdr;
+	switch (nsh->md_type) {
+	case NSH_M_TYPE1:
+		for (i = 0; i < NSH_MD1_CONTEXT_SIZE; i++) {
+			nsh->md1.context[i] =
+			    OVS_MASKED(nsh->md1.context[i], key->context[i],
+				       mask->context[i]);
+		}
+		memcpy(flow_key->nsh.context, nsh->md1.context,
+		       sizeof(nsh->md1.context));
+		break;
+	case NSH_M_TYPE2:
+		memset(flow_key->nsh.context, 0,
+		       sizeof(flow_key->nsh.context));
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
 /* Must follow skb_ensure_writable() since that can move the skb data. */
 static void set_tp_port(struct sk_buff *skb, __be16 *port,
 			__be16 new_port, __sum16 *check)
@@ -1024,6 +1164,32 @@ static int execute_masked_set_action(struct sk_buff *skb,
 				   get_mask(a, struct ovs_key_ethernet *));
 		break;
 
+	case OVS_KEY_ATTR_NSH: {
+		struct ovs_key_nsh nsh;
+		struct ovs_key_nsh nsh_mask;
+		size_t size = nla_len(a) / 2;
+		struct {
+			struct nlattr nla;
+			u8 data[size];
+		} attr, mask;
+
+		attr.nla.nla_type = nla_type(a);
+		attr.nla.nla_len = NLA_HDRLEN + size;
+		memcpy(attr.data, (char *)(a + 1), size);
+
+		mask.nla = attr.nla;
+		memcpy(mask.data, (char *)(a + 1) + size, size);
+
+		err = nsh_key_from_nlattr(&attr.nla, &nsh);
+		if (err)
+			break;
+		err = nsh_key_from_nlattr(&mask.nla, &nsh_mask);
+		if (err)
+			break;
+		err = set_nsh(skb, flow_key, &nsh, &nsh_mask);
+		break;
+	}
+
 	case OVS_KEY_ATTR_IPV4:
 		err = set_ipv4(skb, flow_key, nla_data(a),
 			       get_mask(a, struct ovs_key_ipv4 *));
@@ -1210,6 +1376,21 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 		case OVS_ACTION_ATTR_POP_ETH:
 			err = pop_eth(skb, key);
 			break;
+
+		case OVS_ACTION_ATTR_PUSH_NSH: {
+			u8 buffer[NSH_M_TYPE2_MAX_LEN];
+			struct nsh_hdr *nsh_hdr = (struct nsh_hdr *)buffer;
+			const struct nsh_hdr *nsh_src = nsh_hdr;
+
+			nsh_hdr_from_nlattr(nla_data(a), nsh_hdr,
+					    NSH_M_TYPE2_MAX_LEN);
+			err = push_nsh(skb, key, nsh_src);
+			break;
+		}
+
+		case OVS_ACTION_ATTR_POP_NSH:
+			err = pop_nsh(skb, key);
+			break;
 		}
 
 		if (unlikely(err)) {
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 8c94cef..301bf71 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -46,6 +46,7 @@
 #include <net/ipv6.h>
 #include <net/mpls.h>
 #include <net/ndisc.h>
+#include <net/nsh.h>
 
 #include "conntrack.h"
 #include "datapath.h"
@@ -490,6 +491,51 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key,
 	return 0;
 }
 
+static int parse_nsh(struct sk_buff *skb, struct sw_flow_key *key)
+{
+	struct nsh_hdr *nsh = (struct nsh_hdr *)skb_network_header(skb);
+	u8 version, length;
+	int err;
+
+	if (unlikely(skb->len < NSH_BASE_HDR_LEN))
+		return -EINVAL;
+
+	version = nsh_get_ver(nsh);
+	length = nsh_hdr_len(nsh);
+
+	if (version != 0)
+		return -EINVAL;
+
+	if (unlikely(skb->len < length))
+		return -EINVAL;
+
+	key->nsh.flags = nsh_get_flags(nsh);
+	key->nsh.ttl = nsh_get_ttl(nsh);
+	key->nsh.mdtype = nsh->md_type;
+	key->nsh.np = nsh->next_proto;
+	key->nsh.path_hdr = nsh->path_hdr;
+	switch (key->nsh.mdtype) {
+	case NSH_M_TYPE1:
+		if (length != NSH_M_TYPE1_LEN)
+			return -EINVAL;
+		memcpy(key->nsh.context, nsh->md1.context,
+		       sizeof(nsh->md1));
+		break;
+	case NSH_M_TYPE2:
+		/* Don't support MD type 2 metedata parsing yet */
+		if (length < NSH_BASE_HDR_LEN)
+			return -EINVAL;
+
+		memset(key->nsh.context, 0,
+		       sizeof(nsh->md1));
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 /**
  * key_extract - extracts a flow key from an Ethernet frame.
  * @skb: sk_buff that contains the frame, with skb->data pointing to the
@@ -735,6 +781,10 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
 				memset(&key->tp, 0, sizeof(key->tp));
 			}
 		}
+	} else if (key->eth.type == htons(ETH_P_NSH)) {
+		error = parse_nsh(skb, key);
+		if (error)
+			return error;
 	}
 	return 0;
 }
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 1875bba..6a3cd9c 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -35,6 +35,7 @@
 #include <net/inet_ecn.h>
 #include <net/ip_tunnels.h>
 #include <net/dst_metadata.h>
+#include <net/nsh.h>
 
 struct sk_buff;
 
@@ -66,6 +67,15 @@ struct vlan_head {
 	(offsetof(struct sw_flow_key, recirc_id) +	\
 	FIELD_SIZEOF(struct sw_flow_key, recirc_id))
 
+struct ovs_key_nsh {
+	u8 flags;
+	u8 ttl;
+	u8 mdtype;
+	u8 np;
+	__be32 path_hdr;
+	__be32 context[NSH_MD1_CONTEXT_SIZE];
+};
+
 struct sw_flow_key {
 	u8 tun_opts[IP_TUNNEL_OPTS_MAX];
 	u8 tun_opts_len;
@@ -144,6 +154,7 @@ struct sw_flow_key {
 			};
 		} ipv6;
 	};
+	struct ovs_key_nsh nsh;         /* network service header */
 	struct {
 		/* Connection tracking fields not packed above. */
 		struct {
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index e8eb427..1cd4785 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -78,9 +78,11 @@ static bool actions_may_change_flow(const struct nlattr *actions)
 		case OVS_ACTION_ATTR_HASH:
 		case OVS_ACTION_ATTR_POP_ETH:
 		case OVS_ACTION_ATTR_POP_MPLS:
+		case OVS_ACTION_ATTR_POP_NSH:
 		case OVS_ACTION_ATTR_POP_VLAN:
 		case OVS_ACTION_ATTR_PUSH_ETH:
 		case OVS_ACTION_ATTR_PUSH_MPLS:
+		case OVS_ACTION_ATTR_PUSH_NSH:
 		case OVS_ACTION_ATTR_PUSH_VLAN:
 		case OVS_ACTION_ATTR_SAMPLE:
 		case OVS_ACTION_ATTR_SET:
@@ -322,12 +324,27 @@ size_t ovs_tun_key_attr_size(void)
 		+ nla_total_size(2);   /* OVS_TUNNEL_KEY_ATTR_TP_DST */
 }
 
+size_t ovs_nsh_key_attr_size(void)
+{
+	/* Whenever adding new OVS_NSH_KEY_ FIELDS, we should consider
+	 * updating this function.
+	 */
+	return  nla_total_size(NSH_BASE_HDR_LEN) /* OVS_NSH_KEY_ATTR_BASE */
+		/* OVS_NSH_KEY_ATTR_MD1 and OVS_NSH_KEY_ATTR_MD2 are
+		 * mutually exclusive, so the bigger one can cover
+		 * the small one.
+		 *
+		 * OVS_NSH_KEY_ATTR_MD2
+		 */
+		+ nla_total_size(NSH_M_TYPE2_MD_MAX_LEN);
+}
+
 size_t ovs_key_attr_size(void)
 {
 	/* Whenever adding new OVS_KEY_ FIELDS, we should consider
 	 * updating this function.
 	 */
-	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 28);
+	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 29);
 
 	return    nla_total_size(4)   /* OVS_KEY_ATTR_PRIORITY */
 		+ nla_total_size(0)   /* OVS_KEY_ATTR_TUNNEL */
@@ -341,6 +358,8 @@ size_t ovs_key_attr_size(void)
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_CT_MARK */
 		+ nla_total_size(16)  /* OVS_KEY_ATTR_CT_LABELS */
 		+ nla_total_size(40)  /* OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6 */
+		+ nla_total_size(0)   /* OVS_KEY_ATTR_NSH */
+		  + ovs_nsh_key_attr_size()
 		+ nla_total_size(12)  /* OVS_KEY_ATTR_ETHERNET */
 		+ nla_total_size(2)   /* OVS_KEY_ATTR_ETHERTYPE */
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_VLAN */
@@ -373,6 +392,13 @@ static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
 	[OVS_TUNNEL_KEY_ATTR_IPV6_DST]      = { .len = sizeof(struct in6_addr) },
 };
 
+static const struct ovs_len_tbl
+ovs_nsh_key_attr_lens[OVS_NSH_KEY_ATTR_MAX + 1] = {
+	[OVS_NSH_KEY_ATTR_BASE]     = { .len = 8 },
+	[OVS_NSH_KEY_ATTR_MD1]      = { .len = 16 },
+	[OVS_NSH_KEY_ATTR_MD2]      = { .len = OVS_ATTR_VARIABLE },
+};
+
 /* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute.  */
 static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 	[OVS_KEY_ATTR_ENCAP]	 = { .len = OVS_ATTR_NESTED },
@@ -405,6 +431,8 @@ static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 		.len = sizeof(struct ovs_key_ct_tuple_ipv4) },
 	[OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6] = {
 		.len = sizeof(struct ovs_key_ct_tuple_ipv6) },
+	[OVS_KEY_ATTR_NSH]       = { .len = OVS_ATTR_NESTED,
+				     .next = ovs_nsh_key_attr_lens, },
 };
 
 static bool check_attr_len(unsigned int attr_len, unsigned int expected_len)
@@ -1179,6 +1207,303 @@ static int metadata_from_nlattrs(struct net *net, struct sw_flow_match *match,
 	return 0;
 }
 
+int nsh_hdr_from_nlattr(const struct nlattr *attr,
+			struct nsh_hdr *nsh, size_t size)
+{
+	struct nlattr *a;
+	int rem;
+	u8 flags = 0;
+	u8 ttl = 0;
+	int mdlen = 0;
+	bool has_md1 = false;
+	bool has_md2 = false;
+	u8 len;
+
+	nla_for_each_nested(a, attr, rem) {
+		int type = nla_type(a);
+
+		if (type > OVS_NSH_KEY_ATTR_MAX) {
+			OVS_NLERR(1, "nsh attr %d is out of range max %d",
+				  type, OVS_NSH_KEY_ATTR_MAX);
+			return -EINVAL;
+		}
+
+		if (!check_attr_len(nla_len(a),
+				    ovs_nsh_key_attr_lens[type].len)) {
+			OVS_NLERR(
+			    1,
+			    "nsh attr %d has unexpected len %d expected %d",
+			    type,
+			    nla_len(a),
+			    ovs_nsh_key_attr_lens[type].len
+			);
+			return -EINVAL;
+		}
+
+		switch (type) {
+		case OVS_NSH_KEY_ATTR_BASE: {
+			const struct ovs_nsh_key_base *base =
+				(struct ovs_nsh_key_base *)nla_data(a);
+			flags = base->flags;
+			ttl = base->ttl;
+			nsh->next_proto = base->np;
+			nsh->md_type = base->mdtype;
+			nsh->path_hdr = base->path_hdr;
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD1: {
+			const struct ovs_nsh_key_md1 *md1 =
+				(struct ovs_nsh_key_md1 *)nla_data(a);
+			struct nsh_md1_ctx *md1_dst = &nsh->md1;
+
+			has_md1 = true;
+			mdlen = nla_len(a);
+			if (((mdlen + NSH_BASE_HDR_LEN) != NSH_M_TYPE1_LEN) ||
+			    ((mdlen + NSH_BASE_HDR_LEN) > size) ||
+			    (mdlen <= 0)) {
+				OVS_NLERR(
+				    1,
+				    "length %d of nsh attr %d is invalid",
+				    mdlen,
+				    type
+				);
+				return -EINVAL;
+			}
+			memcpy(md1_dst, md1, mdlen);
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD2: {
+			const struct u8 *md2 = nla_data(a);
+			struct nsh_md2_tlv *md2_dst = &nsh->md2;
+
+			has_md2 = true;
+			mdlen = nla_len(a);
+			if (((mdlen + NSH_BASE_HDR_LEN) > size) ||
+			    (mdlen <= 0)) {
+				OVS_NLERR(
+				    1,
+				    "length %d of nsh attr %d is invalid",
+				    mdlen,
+				    type
+				);
+				return -EINVAL;
+			}
+			memcpy(md2_dst, md2, mdlen);
+			break;
+		}
+		default:
+			OVS_NLERR(1, "Unknown nsh attribute %d",
+				  type);
+			return -EINVAL;
+		}
+	}
+
+	if (rem > 0) {
+		OVS_NLERR(1, "nsh attribute has %d unknown bytes.", rem);
+		return -EINVAL;
+	}
+
+	if ((has_md1 && nsh->md_type != NSH_M_TYPE1) ||
+	    (has_md2 && nsh->md_type != NSH_M_TYPE2)) {
+		OVS_NLERR(1,
+			  "nsh attribute has unmatched MD type %d.",
+			  nsh->md_type);
+		return -EINVAL;
+	}
+
+	if (unlikely(has_md1 && has_md2)) {
+		OVS_NLERR(1, "both nsh md1 and md2 attribute are there");
+		return -EINVAL;
+	}
+
+	if (unlikely(!has_md1 && !has_md2)) {
+		OVS_NLERR(1, "neither nsh md1 nor md2 attribute is there");
+		return -EINVAL;
+	}
+
+	/* nsh header length  = NSH_BASE_HDR_LEN + mdlen */
+	len = NSH_BASE_HDR_LEN + mdlen;
+	nsh_set_flags_ttl_len(nsh, flags, ttl, len);
+
+	return 0;
+}
+
+int nsh_key_from_nlattr(const struct nlattr *attr,
+			struct ovs_key_nsh *nsh)
+{
+	struct nlattr *a;
+	int rem;
+	bool has_md1 = false;
+
+	nla_for_each_nested(a, attr, rem) {
+		int type = nla_type(a);
+
+		if (type > OVS_NSH_KEY_ATTR_MAX) {
+			OVS_NLERR(1, "nsh attr %d is out of range max %d",
+				  type, OVS_NSH_KEY_ATTR_MAX);
+			return -EINVAL;
+		}
+
+		if (!check_attr_len(nla_len(a),
+				    ovs_nsh_key_attr_lens[type].len)) {
+			OVS_NLERR(
+			    1,
+			    "nsh attr %d has unexpected len %d expected %d",
+			    type,
+			    nla_len(a),
+			    ovs_nsh_key_attr_lens[type].len
+			);
+			return -EINVAL;
+		}
+
+		switch (type) {
+		case OVS_NSH_KEY_ATTR_BASE: {
+			const struct ovs_nsh_key_base *base =
+				(struct ovs_nsh_key_base *)nla_data(a);
+
+			memcpy(nsh, base, sizeof(*base));
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD1: {
+			const struct ovs_nsh_key_md1 *md1 =
+				(struct ovs_nsh_key_md1 *)nla_data(a);
+
+			has_md1 = true;
+			memcpy(nsh->context, md1->context, sizeof(*md1));
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD2:
+			/* Not supported yet */
+			return -ENOTSUPP;
+		default:
+			OVS_NLERR(1, "Unknown nsh attribute %d",
+				  type);
+			return -EINVAL;
+		}
+	}
+
+	if (rem > 0) {
+		OVS_NLERR(1, "nsh attribute has %d unknown bytes.", rem);
+		return -EINVAL;
+	}
+
+	if ((has_md1 && nsh->mdtype != NSH_M_TYPE1)) {
+		OVS_NLERR(1, "nsh attribute has unmatched MD type %d.",
+			  nsh->mdtype);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int nsh_key_put_from_nlattr(const struct nlattr *attr,
+				   struct sw_flow_match *match, bool is_mask,
+				   bool is_push_nsh, bool log)
+{
+	struct nlattr *a;
+	int rem;
+	bool has_md1 = false;
+	bool has_md2 = false;
+	u8 mdtype = 0;
+	int mdlen = 0;
+
+	nla_for_each_nested(a, attr, rem) {
+		int type = nla_type(a);
+		int i;
+
+		if (type > OVS_NSH_KEY_ATTR_MAX) {
+			OVS_NLERR(log, "nsh attr %d is out of range max %d",
+				  type, OVS_NSH_KEY_ATTR_MAX);
+			return -EINVAL;
+		}
+
+		if (!check_attr_len(nla_len(a),
+				    ovs_nsh_key_attr_lens[type].len)) {
+			OVS_NLERR(
+			    log,
+			    "nsh attr %d has unexpected len %d expected %d",
+			    type,
+			    nla_len(a),
+			    ovs_nsh_key_attr_lens[type].len
+			);
+			return -EINVAL;
+		}
+
+		switch (type) {
+		case OVS_NSH_KEY_ATTR_BASE: {
+			const struct ovs_nsh_key_base *base =
+				(struct ovs_nsh_key_base *)nla_data(a);
+
+			mdtype = base->mdtype;
+			SW_FLOW_KEY_PUT(match, nsh.flags,
+					base->flags, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.ttl,
+					base->ttl, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.mdtype,
+					base->mdtype, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.np,
+					base->np, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.path_hdr,
+					base->path_hdr, is_mask);
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD1: {
+			const struct ovs_nsh_key_md1 *md1 =
+				(struct ovs_nsh_key_md1 *)nla_data(a);
+
+			has_md1 = true;
+			for (i = 0; i < NSH_MD1_CONTEXT_SIZE; i++)
+				SW_FLOW_KEY_PUT(match, nsh.context[i],
+						md1->context[i], is_mask);
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD2:
+			if (!is_push_nsh) /* Not supported MD type 2 yet */
+				return -ENOTSUPP;
+
+			has_md2 = true;
+			mdlen = nla_len(a);
+			if ((mdlen + NSH_BASE_HDR_LEN) > NSH_M_TYPE2_MAX_LEN ||
+			    mdlen <= 0)
+				return -EINVAL;
+			break;
+		default:
+			OVS_NLERR(log, "Unknown nsh attribute %d",
+				  type);
+			return -EINVAL;
+		}
+	}
+
+	if (rem > 0) {
+		OVS_NLERR(log, "nsh attribute has %d unknown bytes.", rem);
+		return -EINVAL;
+	}
+
+	if (!is_mask) {
+		if ((has_md1 && mdtype != NSH_M_TYPE1)) {
+			OVS_NLERR(1, "nsh attribute has unmatched MD type %d.",
+				  mdtype);
+			return -EINVAL;
+		}
+	}
+
+	if (is_push_nsh & !is_mask) {
+		if ((has_md1 && mdtype != NSH_M_TYPE1) ||
+		    (has_md2 && mdtype != NSH_M_TYPE2) ||
+		    (has_md1 && has_md2) ||
+		    (!has_md1 && !has_md2)) {
+			OVS_NLERR(
+			    1,
+			    "push nsh attributes are invalid for type %d.",
+			    mdtype
+			);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
 static int ovs_key_from_nlattrs(struct net *net, struct sw_flow_match *match,
 				u64 attrs, const struct nlattr **a,
 				bool is_mask, bool log)
@@ -1306,6 +1631,13 @@ static int ovs_key_from_nlattrs(struct net *net, struct sw_flow_match *match,
 		attrs &= ~(1 << OVS_KEY_ATTR_ARP);
 	}
 
+	if (attrs & (1 << OVS_KEY_ATTR_NSH)) {
+		if (nsh_key_put_from_nlattr(a[OVS_KEY_ATTR_NSH], match,
+					    is_mask, false, log) < 0)
+			return -EINVAL;
+		attrs &= ~(1 << OVS_KEY_ATTR_NSH);
+	}
+
 	if (attrs & (1 << OVS_KEY_ATTR_MPLS)) {
 		const struct ovs_key_mpls *mpls_key;
 
@@ -1622,6 +1954,40 @@ static int ovs_nla_put_vlan(struct sk_buff *skb, const struct vlan_head *vh,
 	return 0;
 }
 
+static int nsh_key_to_nlattr(const struct ovs_key_nsh *nsh, bool is_mask,
+			     struct sk_buff *skb)
+{
+	struct nlattr *start;
+	struct ovs_nsh_key_base base;
+	struct ovs_nsh_key_md1 md1;
+
+	memcpy(&base, nsh, sizeof(base));
+
+	if (is_mask || nsh->mdtype == NSH_M_TYPE1)
+		memcpy(md1.context, nsh->context, sizeof(md1));
+
+	start = nla_nest_start(skb, OVS_KEY_ATTR_NSH);
+	if (!start)
+		return -EMSGSIZE;
+
+	if (nla_put(skb, OVS_NSH_KEY_ATTR_BASE, sizeof(base), &base))
+		goto nla_put_failure;
+
+	if (is_mask || nsh->mdtype == NSH_M_TYPE1) {
+		if (nla_put(skb, OVS_NSH_KEY_ATTR_MD1, sizeof(md1), &md1))
+			goto nla_put_failure;
+	}
+
+	/* Don't support MD type 2 yet */
+
+	nla_nest_end(skb, start);
+
+	return 0;
+
+nla_put_failure:
+	return -EMSGSIZE;
+}
+
 static int __ovs_nla_put_key(const struct sw_flow_key *swkey,
 			     const struct sw_flow_key *output, bool is_mask,
 			     struct sk_buff *skb)
@@ -1750,6 +2116,9 @@ static int __ovs_nla_put_key(const struct sw_flow_key *swkey,
 		ipv6_key->ipv6_tclass = output->ip.tos;
 		ipv6_key->ipv6_hlimit = output->ip.ttl;
 		ipv6_key->ipv6_frag = output->ip.frag;
+	} else if (swkey->eth.type == htons(ETH_P_NSH)) {
+		if (nsh_key_to_nlattr(&output->nsh, is_mask, skb))
+			goto nla_put_failure;
 	} else if (swkey->eth.type == htons(ETH_P_ARP) ||
 		   swkey->eth.type == htons(ETH_P_RARP)) {
 		struct ovs_key_arp *arp_key;
@@ -2242,6 +2611,19 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	return err;
 }
 
+static bool validate_nsh(const struct nlattr *attr, bool is_mask,
+			 bool is_push_nsh, bool log)
+{
+	struct sw_flow_match match;
+	struct sw_flow_key key;
+	int ret = 0;
+
+	ovs_match_init(&match, &key, true, NULL);
+	ret = nsh_key_put_from_nlattr(attr, &match, is_mask,
+				      is_push_nsh, log);
+	return ((ret != 0) ? false : true);
+}
+
 /* Return false if there are any non-masked bits set.
  * Mask follows data immediately, before any netlink padding.
  */
@@ -2384,6 +2766,11 @@ static int validate_set(const struct nlattr *a,
 
 		break;
 
+	case OVS_KEY_ATTR_NSH:
+		if (!validate_nsh(nla_data(a), masked, false, log))
+			return -EINVAL;
+		break;
+
 	default:
 		return -EINVAL;
 	}
@@ -2482,6 +2869,8 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 			[OVS_ACTION_ATTR_TRUNC] = sizeof(struct ovs_action_trunc),
 			[OVS_ACTION_ATTR_PUSH_ETH] = sizeof(struct ovs_action_push_eth),
 			[OVS_ACTION_ATTR_POP_ETH] = 0,
+			[OVS_ACTION_ATTR_PUSH_NSH] = (u32)-1,
+			[OVS_ACTION_ATTR_POP_NSH] = 0,
 		};
 		const struct ovs_action_push_vlan *vlan;
 		int type = nla_type(a);
@@ -2636,6 +3025,19 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 			mac_proto = MAC_PROTO_ETHERNET;
 			break;
 
+		case OVS_ACTION_ATTR_PUSH_NSH:
+			mac_proto = MAC_PROTO_NONE;
+			if (!validate_nsh(nla_data(a), false, true, true))
+				return -EINVAL;
+			break;
+
+		case OVS_ACTION_ATTR_POP_NSH:
+			if (key->nsh.np == NSH_P_ETHERNET)
+				mac_proto = MAC_PROTO_ETHERNET;
+			else
+				mac_proto = MAC_PROTO_NONE;
+			break;
+
 		default:
 			OVS_NLERR(log, "Unknown Action type %d", type);
 			return -EINVAL;
diff --git a/net/openvswitch/flow_netlink.h b/net/openvswitch/flow_netlink.h
index 929c665..a849945 100644
--- a/net/openvswitch/flow_netlink.h
+++ b/net/openvswitch/flow_netlink.h
@@ -79,4 +79,8 @@ int ovs_nla_put_actions(const struct nlattr *attr,
 void ovs_nla_free_flow_actions(struct sw_flow_actions *);
 void ovs_nla_free_flow_actions_rcu(struct sw_flow_actions *);
 
+int nsh_key_from_nlattr(const struct nlattr *attr, struct ovs_key_nsh *nsh);
+int nsh_hdr_from_nlattr(const struct nlattr *attr, struct nsh_hdr *nsh_src,
+			size_t size);
+
 #endif /* flow_netlink.h */
-- 
2.5.5

^ permalink raw reply related

* [PATCH 8/8] net: xfrm: support setting an output mark.
From: Steffen Klassert @ 2017-08-21  5:49 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <1503294573-10206-1-git-send-email-steffen.klassert@secunet.com>

From: Lorenzo Colitti <lorenzo@google.com>

On systems that use mark-based routing it may be necessary for
routing lookups to use marks in order for packets to be routed
correctly. An example of such a system is Android, which uses
socket marks to route packets via different networks.

Currently, routing lookups in tunnel mode always use a mark of
zero, making routing incorrect on such systems.

This patch adds a new output_mark element to the xfrm state and
a corresponding XFRMA_OUTPUT_MARK netlink attribute. The output
mark differs from the existing xfrm mark in two ways:

1. The xfrm mark is used to match xfrm policies and states, while
   the xfrm output mark is used to set the mark (and influence
   the routing) of the packets emitted by those states.
2. The existing mark is constrained to be a subset of the bits of
   the originating socket or transformed packet, but the output
   mark is arbitrary and depends only on the state.

The use of a separate mark provides additional flexibility. For
example:

- A packet subject to two transforms (e.g., transport mode inside
  tunnel mode) can have two different output marks applied to it,
  one for the transport mode SA and one for the tunnel mode SA.
- On a system where socket marks determine routing, the packets
  emitted by an IPsec tunnel can be routed based on a mark that
  is determined by the tunnel, not by the marks of the
  unencrypted packets.
- Support for setting the output marks can be introduced without
  breaking any existing setups that employ both mark-based
  routing and xfrm tunnel mode. Simply changing the code to use
  the xfrm mark for routing output packets could xfrm mark could
  change behaviour in a way that breaks these setups.

If the output mark is unspecified or set to zero, the mark is not
set or changed.

Tested: make allyesconfig; make -j64
Tested: https://android-review.googlesource.com/452776
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 include/net/xfrm.h        |  9 ++++++---
 include/uapi/linux/xfrm.h |  1 +
 net/ipv4/xfrm4_policy.c   | 14 +++++++++-----
 net/ipv6/xfrm6_policy.c   |  9 ++++++---
 net/xfrm/xfrm_device.c    |  3 ++-
 net/xfrm/xfrm_output.c    |  3 +++
 net/xfrm/xfrm_policy.c    | 17 +++++++++--------
 net/xfrm/xfrm_user.c      | 11 +++++++++++
 8 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 18d7de3..9c7b70c 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -165,6 +165,7 @@ struct xfrm_state {
 		int		header_len;
 		int		trailer_len;
 		u32		extra_flags;
+		u32		output_mark;
 	} props;
 
 	struct xfrm_lifetime_cfg lft;
@@ -298,10 +299,12 @@ struct xfrm_policy_afinfo {
 	struct dst_entry	*(*dst_lookup)(struct net *net,
 					       int tos, int oif,
 					       const xfrm_address_t *saddr,
-					       const xfrm_address_t *daddr);
+					       const xfrm_address_t *daddr,
+					       u32 mark);
 	int			(*get_saddr)(struct net *net, int oif,
 					     xfrm_address_t *saddr,
-					     xfrm_address_t *daddr);
+					     xfrm_address_t *daddr,
+					     u32 mark);
 	void			(*decode_session)(struct sk_buff *skb,
 						  struct flowi *fl,
 						  int reverse);
@@ -1640,7 +1643,7 @@ static inline int xfrm4_udp_encap_rcv(struct sock *sk, struct sk_buff *skb)
 struct dst_entry *__xfrm_dst_lookup(struct net *net, int tos, int oif,
 				    const xfrm_address_t *saddr,
 				    const xfrm_address_t *daddr,
-				    int family);
+				    int family, u32 mark);
 
 struct xfrm_policy *xfrm_policy_alloc(struct net *net, gfp_t gfp);
 
diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h
index 2b384ff..5fe7370 100644
--- a/include/uapi/linux/xfrm.h
+++ b/include/uapi/linux/xfrm.h
@@ -304,6 +304,7 @@ enum xfrm_attr_type_t {
 	XFRMA_ADDRESS_FILTER,	/* struct xfrm_address_filter */
 	XFRMA_PAD,
 	XFRMA_OFFLOAD_DEV,	/* struct xfrm_state_offload */
+	XFRMA_OUTPUT_MARK,	/* __u32 */
 	__XFRMA_MAX
 
 #define XFRMA_MAX (__XFRMA_MAX - 1)
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index 4aefb14..d7bf0b0 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -20,7 +20,8 @@
 static struct dst_entry *__xfrm4_dst_lookup(struct net *net, struct flowi4 *fl4,
 					    int tos, int oif,
 					    const xfrm_address_t *saddr,
-					    const xfrm_address_t *daddr)
+					    const xfrm_address_t *daddr,
+					    u32 mark)
 {
 	struct rtable *rt;
 
@@ -28,6 +29,7 @@ static struct dst_entry *__xfrm4_dst_lookup(struct net *net, struct flowi4 *fl4,
 	fl4->daddr = daddr->a4;
 	fl4->flowi4_tos = tos;
 	fl4->flowi4_oif = l3mdev_master_ifindex_by_index(net, oif);
+	fl4->flowi4_mark = mark;
 	if (saddr)
 		fl4->saddr = saddr->a4;
 
@@ -42,20 +44,22 @@ static struct dst_entry *__xfrm4_dst_lookup(struct net *net, struct flowi4 *fl4,
 
 static struct dst_entry *xfrm4_dst_lookup(struct net *net, int tos, int oif,
 					  const xfrm_address_t *saddr,
-					  const xfrm_address_t *daddr)
+					  const xfrm_address_t *daddr,
+					  u32 mark)
 {
 	struct flowi4 fl4;
 
-	return __xfrm4_dst_lookup(net, &fl4, tos, oif, saddr, daddr);
+	return __xfrm4_dst_lookup(net, &fl4, tos, oif, saddr, daddr, mark);
 }
 
 static int xfrm4_get_saddr(struct net *net, int oif,
-			   xfrm_address_t *saddr, xfrm_address_t *daddr)
+			   xfrm_address_t *saddr, xfrm_address_t *daddr,
+			   u32 mark)
 {
 	struct dst_entry *dst;
 	struct flowi4 fl4;
 
-	dst = __xfrm4_dst_lookup(net, &fl4, 0, oif, NULL, daddr);
+	dst = __xfrm4_dst_lookup(net, &fl4, 0, oif, NULL, daddr, mark);
 	if (IS_ERR(dst))
 		return -EHOSTUNREACH;
 
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index f44b25a..11d1314 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -27,7 +27,8 @@
 
 static struct dst_entry *xfrm6_dst_lookup(struct net *net, int tos, int oif,
 					  const xfrm_address_t *saddr,
-					  const xfrm_address_t *daddr)
+					  const xfrm_address_t *daddr,
+					  u32 mark)
 {
 	struct flowi6 fl6;
 	struct dst_entry *dst;
@@ -36,6 +37,7 @@ static struct dst_entry *xfrm6_dst_lookup(struct net *net, int tos, int oif,
 	memset(&fl6, 0, sizeof(fl6));
 	fl6.flowi6_oif = l3mdev_master_ifindex_by_index(net, oif);
 	fl6.flowi6_flags = FLOWI_FLAG_SKIP_NH_OIF;
+	fl6.flowi6_mark = mark;
 	memcpy(&fl6.daddr, daddr, sizeof(fl6.daddr));
 	if (saddr)
 		memcpy(&fl6.saddr, saddr, sizeof(fl6.saddr));
@@ -52,12 +54,13 @@ static struct dst_entry *xfrm6_dst_lookup(struct net *net, int tos, int oif,
 }
 
 static int xfrm6_get_saddr(struct net *net, int oif,
-			   xfrm_address_t *saddr, xfrm_address_t *daddr)
+			   xfrm_address_t *saddr, xfrm_address_t *daddr,
+			   u32 mark)
 {
 	struct dst_entry *dst;
 	struct net_device *dev;
 
-	dst = xfrm6_dst_lookup(net, 0, oif, NULL, daddr);
+	dst = xfrm6_dst_lookup(net, 0, oif, NULL, daddr, mark);
 	if (IS_ERR(dst))
 		return -EHOSTUNREACH;
 
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 1904127..acf0010 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -79,7 +79,8 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 			daddr = &x->props.saddr;
 		}
 
-		dst = __xfrm_dst_lookup(net, 0, 0, saddr, daddr, x->props.family);
+		dst = __xfrm_dst_lookup(net, 0, 0, saddr, daddr,
+					x->props.family, x->props.output_mark);
 		if (IS_ERR(dst))
 			return 0;
 
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 8c0b672..31a2e6d 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -66,6 +66,9 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
 			goto error_nolock;
 		}
 
+		if (x->props.output_mark)
+			skb->mark = x->props.output_mark;
+
 		err = x->outer_mode->output(x, skb);
 		if (err) {
 			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTSTATEMODEERROR);
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 06c3bf7..1de52f3 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -122,7 +122,7 @@ static const struct xfrm_policy_afinfo *xfrm_policy_get_afinfo(unsigned short fa
 struct dst_entry *__xfrm_dst_lookup(struct net *net, int tos, int oif,
 				    const xfrm_address_t *saddr,
 				    const xfrm_address_t *daddr,
-				    int family)
+				    int family, u32 mark)
 {
 	const struct xfrm_policy_afinfo *afinfo;
 	struct dst_entry *dst;
@@ -131,7 +131,7 @@ struct dst_entry *__xfrm_dst_lookup(struct net *net, int tos, int oif,
 	if (unlikely(afinfo == NULL))
 		return ERR_PTR(-EAFNOSUPPORT);
 
-	dst = afinfo->dst_lookup(net, tos, oif, saddr, daddr);
+	dst = afinfo->dst_lookup(net, tos, oif, saddr, daddr, mark);
 
 	rcu_read_unlock();
 
@@ -143,7 +143,7 @@ static inline struct dst_entry *xfrm_dst_lookup(struct xfrm_state *x,
 						int tos, int oif,
 						xfrm_address_t *prev_saddr,
 						xfrm_address_t *prev_daddr,
-						int family)
+						int family, u32 mark)
 {
 	struct net *net = xs_net(x);
 	xfrm_address_t *saddr = &x->props.saddr;
@@ -159,7 +159,7 @@ static inline struct dst_entry *xfrm_dst_lookup(struct xfrm_state *x,
 		daddr = x->coaddr;
 	}
 
-	dst = __xfrm_dst_lookup(net, tos, oif, saddr, daddr, family);
+	dst = __xfrm_dst_lookup(net, tos, oif, saddr, daddr, family, mark);
 
 	if (!IS_ERR(dst)) {
 		if (prev_saddr != saddr)
@@ -1340,14 +1340,14 @@ int __xfrm_sk_clone_policy(struct sock *sk, const struct sock *osk)
 
 static int
 xfrm_get_saddr(struct net *net, int oif, xfrm_address_t *local,
-	       xfrm_address_t *remote, unsigned short family)
+	       xfrm_address_t *remote, unsigned short family, u32 mark)
 {
 	int err;
 	const struct xfrm_policy_afinfo *afinfo = xfrm_policy_get_afinfo(family);
 
 	if (unlikely(afinfo == NULL))
 		return -EINVAL;
-	err = afinfo->get_saddr(net, oif, local, remote);
+	err = afinfo->get_saddr(net, oif, local, remote, mark);
 	rcu_read_unlock();
 	return err;
 }
@@ -1378,7 +1378,7 @@ xfrm_tmpl_resolve_one(struct xfrm_policy *policy, const struct flowi *fl,
 			if (xfrm_addr_any(local, tmpl->encap_family)) {
 				error = xfrm_get_saddr(net, fl->flowi_oif,
 						       &tmp, remote,
-						       tmpl->encap_family);
+						       tmpl->encap_family, 0);
 				if (error)
 					goto fail;
 				local = &tmp;
@@ -1598,7 +1598,8 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy,
 		if (xfrm[i]->props.mode != XFRM_MODE_TRANSPORT) {
 			family = xfrm[i]->props.family;
 			dst = xfrm_dst_lookup(xfrm[i], tos, fl->flowi_oif,
-					      &saddr, &daddr, family);
+					      &saddr, &daddr, family,
+					      xfrm[i]->props.output_mark);
 			err = PTR_ERR(dst);
 			if (IS_ERR(dst))
 				goto put_states;
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index ffe8d5e..cc3268d 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -584,6 +584,9 @@ static struct xfrm_state *xfrm_state_construct(struct net *net,
 
 	xfrm_mark_get(attrs, &x->mark);
 
+	if (attrs[XFRMA_OUTPUT_MARK])
+		x->props.output_mark = nla_get_u32(attrs[XFRMA_OUTPUT_MARK]);
+
 	err = __xfrm_init_state(x, false, attrs[XFRMA_OFFLOAD_DEV]);
 	if (err)
 		goto error;
@@ -899,6 +902,11 @@ static int copy_to_user_state_extra(struct xfrm_state *x,
 		goto out;
 	if (x->security)
 		ret = copy_sec_ctx(x->security, skb);
+	if (x->props.output_mark) {
+		ret = nla_put_u32(skb, XFRMA_OUTPUT_MARK, x->props.output_mark);
+		if (ret)
+			goto out;
+	}
 out:
 	return ret;
 }
@@ -2454,6 +2462,7 @@ static const struct nla_policy xfrma_policy[XFRMA_MAX+1] = {
 	[XFRMA_PROTO]		= { .type = NLA_U8 },
 	[XFRMA_ADDRESS_FILTER]	= { .len = sizeof(struct xfrm_address_filter) },
 	[XFRMA_OFFLOAD_DEV]	= { .len = sizeof(struct xfrm_user_offload) },
+	[XFRMA_OUTPUT_MARK]	= { .len = NLA_U32 },
 };
 
 static const struct nla_policy xfrma_spd_policy[XFRMA_SPD_MAX+1] = {
@@ -2673,6 +2682,8 @@ static inline size_t xfrm_sa_len(struct xfrm_state *x)
 		l += nla_total_size(sizeof(x->props.extra_flags));
 	if (x->xso.dev)
 		 l += nla_total_size(sizeof(x->xso));
+	if (x->props.output_mark)
+		l += nla_total_size(sizeof(x->props.output_mark));
 
 	/* Must count x->lastused as it may become non-zero behind our back. */
 	l += nla_total_size_64bit(sizeof(u64));
-- 
2.7.4

^ permalink raw reply related

* [PATCH 5/8] xfrm: Auto-load xfrm offload modules
From: Steffen Klassert @ 2017-08-21  5:49 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <1503294573-10206-1-git-send-email-steffen.klassert@secunet.com>

From: Ilan Tayari <ilant@mellanox.com>

IPSec crypto offload depends on the protocol-specific
offload module (such as esp_offload.ko).

When the user installs an SA with crypto-offload, load
the offload module automatically, in the same way
that the protocol module is loaded (such as esp.ko)

Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 include/net/xfrm.h      |  4 +++-
 net/ipv4/esp4_offload.c |  1 +
 net/ipv6/esp6_offload.c |  1 +
 net/xfrm/xfrm_device.c  |  2 +-
 net/xfrm/xfrm_state.c   | 16 ++++++++++++----
 net/xfrm/xfrm_user.c    |  2 +-
 6 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index afb4929..5a36010 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -43,6 +43,8 @@
 	MODULE_ALIAS("xfrm-mode-" __stringify(family) "-" __stringify(encap))
 #define MODULE_ALIAS_XFRM_TYPE(family, proto) \
 	MODULE_ALIAS("xfrm-type-" __stringify(family) "-" __stringify(proto))
+#define MODULE_ALIAS_XFRM_OFFLOAD_TYPE(family, proto) \
+	MODULE_ALIAS("xfrm-offload-" __stringify(family) "-" __stringify(proto))
 
 #ifdef CONFIG_XFRM_STATISTICS
 #define XFRM_INC_STATS(net, field)	SNMP_INC_STATS((net)->mib.xfrm_statistics, field)
@@ -1558,7 +1560,7 @@ void xfrm_spd_getinfo(struct net *net, struct xfrmk_spdinfo *si);
 u32 xfrm_replay_seqhi(struct xfrm_state *x, __be32 net_seq);
 int xfrm_init_replay(struct xfrm_state *x);
 int xfrm_state_mtu(struct xfrm_state *x, int mtu);
-int __xfrm_init_state(struct xfrm_state *x, bool init_replay);
+int __xfrm_init_state(struct xfrm_state *x, bool init_replay, bool offload);
 int xfrm_init_state(struct xfrm_state *x);
 int xfrm_prepare_input(struct xfrm_state *x, struct sk_buff *skb);
 int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type);
diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index 05831de..aca1c85 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -305,3 +305,4 @@ module_init(esp4_offload_init);
 module_exit(esp4_offload_exit);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Steffen Klassert <steffen.klassert@secunet.com>");
+MODULE_ALIAS_XFRM_OFFLOAD_TYPE(AF_INET, XFRM_PROTO_ESP);
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index eec3add..8d4e2ba 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -334,3 +334,4 @@ module_init(esp6_offload_init);
 module_exit(esp6_offload_exit);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Steffen Klassert <steffen.klassert@secunet.com>");
+MODULE_ALIAS_XFRM_OFFLOAD_TYPE(AF_INET6, XFRM_PROTO_ESP);
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 5cd7a24..1904127 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -63,7 +63,7 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 	xfrm_address_t *daddr;
 
 	if (!x->type_offload)
-		return 0;
+		return -EINVAL;
 
 	/* We don't yet support UDP encapsulation, TFC padding and ESN. */
 	if (x->encap || x->tfcpad || (x->props.flags & XFRM_STATE_ESN))
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 82cbbce..a41e2ef 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -296,12 +296,14 @@ int xfrm_unregister_type_offload(const struct xfrm_type_offload *type,
 }
 EXPORT_SYMBOL(xfrm_unregister_type_offload);
 
-static const struct xfrm_type_offload *xfrm_get_type_offload(u8 proto, unsigned short family)
+static const struct xfrm_type_offload *
+xfrm_get_type_offload(u8 proto, unsigned short family, bool try_load)
 {
 	struct xfrm_state_afinfo *afinfo;
 	const struct xfrm_type_offload **typemap;
 	const struct xfrm_type_offload *type;
 
+retry:
 	afinfo = xfrm_state_get_afinfo(family);
 	if (unlikely(afinfo == NULL))
 		return NULL;
@@ -311,6 +313,12 @@ static const struct xfrm_type_offload *xfrm_get_type_offload(u8 proto, unsigned
 	if ((type && !try_module_get(type->owner)))
 		type = NULL;
 
+	if (!type && try_load) {
+		request_module("xfrm-offload-%d-%d", family, proto);
+		try_load = 0;
+		goto retry;
+	}
+
 	rcu_read_unlock();
 	return type;
 }
@@ -2165,7 +2173,7 @@ int xfrm_state_mtu(struct xfrm_state *x, int mtu)
 	return mtu - x->props.header_len;
 }
 
-int __xfrm_init_state(struct xfrm_state *x, bool init_replay)
+int __xfrm_init_state(struct xfrm_state *x, bool init_replay, bool offload)
 {
 	struct xfrm_state_afinfo *afinfo;
 	struct xfrm_mode *inner_mode;
@@ -2230,7 +2238,7 @@ int __xfrm_init_state(struct xfrm_state *x, bool init_replay)
 	if (x->type == NULL)
 		goto error;
 
-	x->type_offload = xfrm_get_type_offload(x->id.proto, family);
+	x->type_offload = xfrm_get_type_offload(x->id.proto, family, offload);
 
 	err = x->type->init_state(x);
 	if (err)
@@ -2258,7 +2266,7 @@ EXPORT_SYMBOL(__xfrm_init_state);
 
 int xfrm_init_state(struct xfrm_state *x)
 {
-	return __xfrm_init_state(x, true);
+	return __xfrm_init_state(x, true, false);
 }
 
 EXPORT_SYMBOL(xfrm_init_state);
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 1b539b7..ffe8d5e 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -584,7 +584,7 @@ static struct xfrm_state *xfrm_state_construct(struct net *net,
 
 	xfrm_mark_get(attrs, &x->mark);
 
-	err = __xfrm_init_state(x, false);
+	err = __xfrm_init_state(x, false, attrs[XFRMA_OFFLOAD_DEV]);
 	if (err)
 		goto error;
 
-- 
2.7.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox