Netdev List
 help / color / mirror / Atom feed
* Re: [net-next PATCH 0/6] i40e: Add VF port representator support or SR-IOV VFs
From: Samudrala, Sridhar @ 2016-12-30 16:55 UTC (permalink / raw)
  To: David Miller
  Cc: alexander.h.duyck, john.r.fastabend, anjali.singhai,
	jakub.kicinski, intel-wired-lan, netdev
In-Reply-To: <20161229.215356.2143143534347167491.davem@davemloft.net>

On 12/29/2016 6:53 PM, David Miller wrote:
> From: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Date: Thu, 29 Dec 2016 22:20:57 -0800
>
>> - Patch 1 introduces devlink interface to get/set the mode of SRIOV switch.
>> - Patch 2 adds support to create VF port representor(VFPR) netdevs associated
>>    with SR-IOV VFs that can be used to control/configure VFs from PF name space.
>> - Patch 3 enables syncing link state between VFs and VFPRs.
>> - Patch 4 adds a new type to metadata_dst to allow passing VF id to lower device.
>> - Patch 5 adds TX and RX support to VFPR netdevs.
>> - Patch 6 enables HW and SW VFPR statistics to be exposed via netlink on VFPR
>>    netdevs.
> Patch 4 did not show up on the lists.
I see it on intel-wired-lan list.  not sure why it didn't make it to netdev.
>
> The computer you did this work on is configured with a time in the future.
Will fix and send a v2 that will address other comments and include patch 4.

^ permalink raw reply

* [PATCH net-next] af_packet: Provide a TPACKET_V2 compatible Tx path for TPACKET_V3
From: Sowmini Varadhan @ 2016-12-30 16:54 UTC (permalink / raw)
  To: netdev, sowmini.varadhan, sowmini.varadhan; +Cc: daniel, willemb, davem

Although TPACKET_V3 Rx has some benefits over TPACKET_V2 Rx,
*_v3 does not currently have Tx support. As a result an application
that wants the best perf for Tx and Rx (e.g. to handle
request/response transacations) ends up needing 2 sockets,
one with *_v2 for Tx and another with *_v3 for Rx.

This patch enables TPACKET_V2 compatible Tx features in TPACKET_V3
so that _v3 supports at least as many features as _v2.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 Documentation/networking/packet_mmap.txt |    6 ++++--
 net/packet/af_packet.c                   |   23 +++++++++++++++--------
 2 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index daa015a..6425062 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -565,7 +565,7 @@ where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
 		   (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
 
 TPACKET_V2 --> TPACKET_V3:
-	- Flexible buffer implementation:
+	- Flexible buffer implementation for RX_RING:
 		1. Blocks can be configured with non-static frame-size
 		2. Read/poll is at a block-level (as opposed to packet-level)
 		3. Added poll timeout to avoid indefinite user-space wait
@@ -574,7 +574,9 @@ where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
 			4.1 block::timeout
 			4.2 tpkt_hdr::sk_rxhash
 	- RX Hash data available in user space
-	- Currently only RX_RING available
+	- TX_RING semantics are conceptually similar to TPACKET_V2;
+	  use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN
+	  instead of TPACKET2_HDRLEN.
 
 -------------------------------------------------------------------------------
 + AF_PACKET fanout mode
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 89f2e8c..57692af 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -409,6 +409,9 @@ static void __packet_set_status(struct packet_sock *po, void *frame, int status)
 		flush_dcache_page(pgv_to_page(&h.h2->tp_status));
 		break;
 	case TPACKET_V3:
+		h.h3->tp_status = status;
+		flush_dcache_page(pgv_to_page(&h.h3->tp_status));
+		break;
 	default:
 		WARN(1, "TPACKET version not supported.\n");
 		BUG();
@@ -432,6 +435,8 @@ static int __packet_get_status(struct packet_sock *po, void *frame)
 		flush_dcache_page(pgv_to_page(&h.h2->tp_status));
 		return h.h2->tp_status;
 	case TPACKET_V3:
+		flush_dcache_page(pgv_to_page(&h.h3->tp_status));
+		return h.h3->tp_status;
 	default:
 		WARN(1, "TPACKET version not supported.\n");
 		BUG();
@@ -2500,6 +2505,9 @@ static int tpacket_parse_header(struct packet_sock *po, void *frame,
 	ph.raw = frame;
 
 	switch (po->tp_version) {
+	case TPACKET_V3:
+		tp_len = ph.h3->tp_len;
+		break;
 	case TPACKET_V2:
 		tp_len = ph.h2->tp_len;
 		break;
@@ -2519,6 +2527,9 @@ static int tpacket_parse_header(struct packet_sock *po, void *frame,
 		off_max = po->tx_ring.frame_size - tp_len;
 		if (po->sk.sk_type == SOCK_DGRAM) {
 			switch (po->tp_version) {
+			case TPACKET_V3:
+				off = ph.h3->tp_net;
+				break;
 			case TPACKET_V2:
 				off = ph.h2->tp_net;
 				break;
@@ -2528,6 +2539,9 @@ static int tpacket_parse_header(struct packet_sock *po, void *frame,
 			}
 		} else {
 			switch (po->tp_version) {
+			case TPACKET_V3:
+				off = ph.h3->tp_mac;
+				break;
 			case TPACKET_V2:
 				off = ph.h2->tp_mac;
 				break;
@@ -4116,11 +4130,6 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 	struct tpacket_req *req = &req_u->req;
 
 	lock_sock(sk);
-	/* Opening a Tx-ring is NOT supported in TPACKET_V3 */
-	if (!closing && tx_ring && (po->tp_version > TPACKET_V2)) {
-		net_warn_ratelimited("Tx-ring is not supported.\n");
-		goto out;
-	}
 
 	rb = tx_ring ? &po->tx_ring : &po->rx_ring;
 	rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
@@ -4180,9 +4189,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 			goto out;
 		switch (po->tp_version) {
 		case TPACKET_V3:
-		/* Transmit path is not supported. We checked
-		 * it above but just being paranoid
-		 */
+			/* Block transmit is not supported yet */
 			if (!tx_ring)
 				init_prb_bdqc(po, rb, pg_vec, req_u);
 			break;
-- 
1.7.1

^ permalink raw reply related

* Re: [net-next PATCH 5/6] i40e: Add TX and RX support in switchdev mode.
From: Samudrala, Sridhar @ 2016-12-30 17:04 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexander Duyck, John Fastabend, Anjali Singhai Jain,
	jakub.kicinski, intel-wired-lan, Linux Netdev List
In-Reply-To: <CAJ3xEMgDAUSE2gVD3dU4UuXFb7k_W-9QgHpg-0c7b5kxRC2=FQ@mail.gmail.com>

On 12/30/2016 7:31 AM, Or Gerlitz wrote:
> On Fri, Dec 30, 2016 at 8:21 AM, Sridhar Samudrala
> <sridhar.samudrala@intel.com> wrote:
>> In switchdev mode, broadcast filter is not enabled on VFs, The broadcasts and
> nit ", The broadcast" --> ", the broadcasts" or ". The broadcasts"
>
>> unknown frames from VFs are received by the PF and passed to corresponding VF
>> port representator netdev.
>> A host based switching entity like a linux bridge or OVS redirects these frames
>> to the right VFs via VFPR netdevs. Any frames sent via VFPR netdevs are sent as
>> directed transmits to the corresponding VFs. To enable directed transmit, skb
>> metadata dst is used to pass the VF id and the frame is requeued to call the PFs
>> transmit routine.
> So the representator netdevs do or don't have TX/RX rings of their
> own? if the latter, was there any special locking you had to do for
> that?
Atleast in the inital implementation,  we are not creating any TX/RX 
rings for VFPR netdevs.
skb->dev is set to the PF device  and skb is requeued via dev_queue_xmit().


>
> Are you exposing switchdev ops for the representators? didn't see that
> or maybe it's in the 4th patch which didn't make it to the list?
Not at this time. In the future patches when we offload fdb/vlan 
functionality, we could use
switchdev ops.

^ permalink raw reply

* Re: [net-next PATCH 3/6] i40e: Sync link state between VFs and VFPRs
From: Samudrala, Sridhar @ 2016-12-30 17:10 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexander Duyck, John Fastabend, Anjali Singhai Jain,
	jakub.kicinski, intel-wired-lan, Linux Netdev List
In-Reply-To: <CAJ3xEMiQaVc=aJMFJdWEDHEeYL5NnxAcqbM_cTa+X5CkNrLtZg@mail.gmail.com>

On 12/30/2016 7:43 AM, Or Gerlitz wrote:
> On Fri, Dec 30, 2016 at 8:21 AM, Sridhar Samudrala
> <sridhar.samudrala@intel.com> wrote:
>
> [..]
>> - enable/disable VF link state via ndo_set_vf_link_state will update the
>>    admin state of associated VFPR.
> why you want to do that? anything missing here w.o using the legacy mode?
Nothing missing.  legacy way of controlling the VF link state is 
supported even in switchdev mode
until it is deprecated.

^ permalink raw reply

* [PATCH net 0/2] l2tp: socket lookup fixes for l2tp_ip and l2tp_ip6
From: Guillaume Nault @ 2016-12-30 18:48 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston

There are still some cases that aren't correctly handled in the socket
lookup functions of l2tp_ip and l2tp_ip6. This series fixes lookups for
connected sockets and for sockets bound to the IPv6 unspecified
address.

bind() and connect() should now work as expected on IPPROTO_L2TP
sockets. Extra features, like SO_REUSEADDR, remain unsupported.

The matching conditions in __l2tp_ip6_bind_lookup() and
__l2tp_ip_bind_lookup() are getting hard to read. I've kept the single
test approach to make the intend of the patches clear. I'll split the
conditionals once these fixes reach net-next.


Guillaume Nault (2):
  l2tp: consider '::' as wildcard address in l2tp_ip6 socket lookup
  l2tp: take remote address into account in l2tp_ip and l2tp_ip6 socket
    lookups

 net/l2tp/l2tp_ip.c  | 19 ++++++-------------
 net/l2tp/l2tp_ip6.c | 24 ++++++++----------------
 2 files changed, 14 insertions(+), 29 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH net 1/2] l2tp: consider '::' as wildcard address in l2tp_ip6 socket lookup
From: Guillaume Nault @ 2016-12-30 18:48 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1483123297.git.g.nault@alphalink.fr>

An L2TP socket bound to the unspecified address should match with any
address. If not, it can't receive any packet and __l2tp_ip6_bind_lookup()
can't prevent another socket from binding on the same device/tunnel ID.

While there, rename the 'addr' variable to 'sk_laddr' (local addr), to
make following patch clearer.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ip6.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index f092ac441fdd..3135b9d55df5 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -64,7 +64,7 @@ static struct sock *__l2tp_ip6_bind_lookup(struct net *net,
 	struct sock *sk;
 
 	sk_for_each_bound(sk, &l2tp_ip6_bind_table) {
-		const struct in6_addr *addr = inet6_rcv_saddr(sk);
+		const struct in6_addr *sk_laddr = inet6_rcv_saddr(sk);
 		struct l2tp_ip6_sock *l2tp = l2tp_ip6_sk(sk);
 
 		if (l2tp == NULL)
@@ -72,7 +72,7 @@ static struct sock *__l2tp_ip6_bind_lookup(struct net *net,
 
 		if ((l2tp->conn_id == tunnel_id) &&
 		    net_eq(sock_net(sk), net) &&
-		    (!addr || ipv6_addr_equal(addr, laddr)) &&
+		    (!sk_laddr || ipv6_addr_any(sk_laddr) || ipv6_addr_equal(sk_laddr, laddr)) &&
 		    (!sk->sk_bound_dev_if || !dif ||
 		     sk->sk_bound_dev_if == dif))
 			goto found;
-- 
2.11.0

^ permalink raw reply related

* [PATCH net 2/2] l2tp: take remote address into account in l2tp_ip and l2tp_ip6 socket lookups
From: Guillaume Nault @ 2016-12-30 18:48 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1483123297.git.g.nault@alphalink.fr>

For connected sockets, __l2tp_ip{,6}_bind_lookup() needs to check the
remote IP when looking for a matching socket. Otherwise a connected
socket can receive traffic not originating from its peer.

Drop l2tp_ip_bind_lookup() and l2tp_ip6_bind_lookup() instead of
updating their prototype, as these functions aren't used.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ip.c  | 19 ++++++-------------
 net/l2tp/l2tp_ip6.c | 20 ++++++--------------
 2 files changed, 12 insertions(+), 27 deletions(-)

diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 8938b6ba57a0..3d73278b86ca 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -47,7 +47,8 @@ static inline struct l2tp_ip_sock *l2tp_ip_sk(const struct sock *sk)
 	return (struct l2tp_ip_sock *)sk;
 }
 
-static struct sock *__l2tp_ip_bind_lookup(struct net *net, __be32 laddr, int dif, u32 tunnel_id)
+static struct sock *__l2tp_ip_bind_lookup(const struct net *net, __be32 laddr,
+					  __be32 raddr, int dif, u32 tunnel_id)
 {
 	struct sock *sk;
 
@@ -61,6 +62,7 @@ static struct sock *__l2tp_ip_bind_lookup(struct net *net, __be32 laddr, int dif
 		if ((l2tp->conn_id == tunnel_id) &&
 		    net_eq(sock_net(sk), net) &&
 		    !(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) &&
+		    (!inet->inet_daddr || !raddr || inet->inet_daddr == raddr) &&
 		    (!sk->sk_bound_dev_if || !dif ||
 		     sk->sk_bound_dev_if == dif))
 			goto found;
@@ -71,15 +73,6 @@ static struct sock *__l2tp_ip_bind_lookup(struct net *net, __be32 laddr, int dif
 	return sk;
 }
 
-static inline struct sock *l2tp_ip_bind_lookup(struct net *net, __be32 laddr, int dif, u32 tunnel_id)
-{
-	struct sock *sk = __l2tp_ip_bind_lookup(net, laddr, dif, tunnel_id);
-	if (sk)
-		sock_hold(sk);
-
-	return sk;
-}
-
 /* When processing receive frames, there are two cases to
  * consider. Data frames consist of a non-zero session-id and an
  * optional cookie. Control frames consist of a regular L2TP header
@@ -183,8 +176,8 @@ static int l2tp_ip_recv(struct sk_buff *skb)
 		struct iphdr *iph = (struct iphdr *) skb_network_header(skb);
 
 		read_lock_bh(&l2tp_ip_lock);
-		sk = __l2tp_ip_bind_lookup(net, iph->daddr, inet_iif(skb),
-					   tunnel_id);
+		sk = __l2tp_ip_bind_lookup(net, iph->daddr, iph->saddr,
+					   inet_iif(skb), tunnel_id);
 		if (!sk) {
 			read_unlock_bh(&l2tp_ip_lock);
 			goto discard;
@@ -280,7 +273,7 @@ static int l2tp_ip_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		inet->inet_saddr = 0;  /* Use device */
 
 	write_lock_bh(&l2tp_ip_lock);
-	if (__l2tp_ip_bind_lookup(net, addr->l2tp_addr.s_addr,
+	if (__l2tp_ip_bind_lookup(net, addr->l2tp_addr.s_addr, 0,
 				  sk->sk_bound_dev_if, addr->l2tp_conn_id)) {
 		write_unlock_bh(&l2tp_ip_lock);
 		ret = -EADDRINUSE;
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 3135b9d55df5..331ccf5a7bad 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -59,12 +59,14 @@ static inline struct l2tp_ip6_sock *l2tp_ip6_sk(const struct sock *sk)
 
 static struct sock *__l2tp_ip6_bind_lookup(struct net *net,
 					   struct in6_addr *laddr,
+					   const struct in6_addr *raddr,
 					   int dif, u32 tunnel_id)
 {
 	struct sock *sk;
 
 	sk_for_each_bound(sk, &l2tp_ip6_bind_table) {
 		const struct in6_addr *sk_laddr = inet6_rcv_saddr(sk);
+		const struct in6_addr *sk_raddr = &sk->sk_v6_daddr;
 		struct l2tp_ip6_sock *l2tp = l2tp_ip6_sk(sk);
 
 		if (l2tp == NULL)
@@ -73,6 +75,7 @@ static struct sock *__l2tp_ip6_bind_lookup(struct net *net,
 		if ((l2tp->conn_id == tunnel_id) &&
 		    net_eq(sock_net(sk), net) &&
 		    (!sk_laddr || ipv6_addr_any(sk_laddr) || ipv6_addr_equal(sk_laddr, laddr)) &&
+		    (!raddr || ipv6_addr_any(sk_raddr) || ipv6_addr_equal(sk_raddr, raddr)) &&
 		    (!sk->sk_bound_dev_if || !dif ||
 		     sk->sk_bound_dev_if == dif))
 			goto found;
@@ -83,17 +86,6 @@ static struct sock *__l2tp_ip6_bind_lookup(struct net *net,
 	return sk;
 }
 
-static inline struct sock *l2tp_ip6_bind_lookup(struct net *net,
-						struct in6_addr *laddr,
-						int dif, u32 tunnel_id)
-{
-	struct sock *sk = __l2tp_ip6_bind_lookup(net, laddr, dif, tunnel_id);
-	if (sk)
-		sock_hold(sk);
-
-	return sk;
-}
-
 /* When processing receive frames, there are two cases to
  * consider. Data frames consist of a non-zero session-id and an
  * optional cookie. Control frames consist of a regular L2TP header
@@ -197,8 +189,8 @@ static int l2tp_ip6_recv(struct sk_buff *skb)
 		struct ipv6hdr *iph = ipv6_hdr(skb);
 
 		read_lock_bh(&l2tp_ip6_lock);
-		sk = __l2tp_ip6_bind_lookup(net, &iph->daddr, inet6_iif(skb),
-					    tunnel_id);
+		sk = __l2tp_ip6_bind_lookup(net, &iph->daddr, &iph->saddr,
+					    inet6_iif(skb), tunnel_id);
 		if (!sk) {
 			read_unlock_bh(&l2tp_ip6_lock);
 			goto discard;
@@ -330,7 +322,7 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	rcu_read_unlock();
 
 	write_lock_bh(&l2tp_ip6_lock);
-	if (__l2tp_ip6_bind_lookup(net, &addr->l2tp_addr, bound_dev_if,
+	if (__l2tp_ip6_bind_lookup(net, &addr->l2tp_addr, NULL, bound_dev_if,
 				   addr->l2tp_conn_id)) {
 		write_unlock_bh(&l2tp_ip6_lock);
 		err = -EADDRINUSE;
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH v2] scm: fix possible control message header alignment issue
From: David Miller @ 2016-12-30 20:20 UTC (permalink / raw)
  To: cugyly; +Cc: netdev, Linyu.Yuan
In-Reply-To: <1483015172-5017-1-git-send-email-cugyly@163.com>

From: yuan linyu <cugyly@163.com>
Date: Thu, 29 Dec 2016 20:39:32 +0800

> From: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
> 
> 1. put_cmsg{_compat}() may copy data to user when buffer free space less than
>    control message header alignment size.
> 2. scm_detach_fds{_compat}() may calc wrong fdmax if control message header
>    have greater alignment size.
> 
> Signed-off-by: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>

But can this actually happen, in practice?

Take, for example, COMPAT_CMSG_DATA().

It aligns "struct compat_cmsghdr" to a multiple of a u32.

I cannot think of any possibly way that, on any architecture
whatsoever:

	CMSG_COMPAT_ALIGN(sizeof(struct compat_cmsghdr))

evaludates to any value other than, exactly:

	sizeof(struct compat_cmsghdr)

If you can come up with a case where this does happen in
practice, I will continue to consider this patch.

Otherwise, we should make the assumptions that exist explicit
and get rid of all of the code that does that funny alignment
upon the cmsghdr structure.

Thanks.

^ permalink raw reply

* Re: [PATCH] ipv4: make tcp_notsent_lowat sysctl knob behave as true unsigned int
From: David Miller @ 2016-12-30 20:23 UTC (permalink / raw)
  To: ptikhomirov
  Cc: edumazet, kuznet, jmorris, yoshfuji, kaber, netdev, linux-kernel,
	khorenko
In-Reply-To: <20161229143507.4934-1-ptikhomirov@virtuozzo.com>

From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Date: Thu, 29 Dec 2016 17:35:07 +0300

> @@ -958,7 +959,9 @@ static struct ctl_table ipv4_net_table[] = {
>  		.data		= &init_net.ipv4.sysctl_tcp_notsent_lowat,
>  		.maxlen		= sizeof(unsigned int),
>  		.mode		= 0644,
> -		.proc_handler	= proc_dointvec,
> +		.proc_handler	= proc_doulongvec_minmax,
> +		.extra1		= &zero,
> +		.extra2		= &uint_max,

It seems much simpler to use "proc_douintvec()" to fix this bug.

Doesn't it?

^ permalink raw reply

* Re: [PATCH v1 0/2] bpf: add longest prefix match map
From: David Miller @ 2016-12-30 20:25 UTC (permalink / raw)
  To: daniel; +Cc: ast, dh.herrmann, daniel, netdev
In-Reply-To: <20161229172855.14910-1-daniel@zonque.org>

From: Daniel Mack <daniel@zonque.org>
Date: Thu, 29 Dec 2016 18:28:53 +0100

> This patch set adds a longest prefix match algorithm that can be used
> to match IP addresses to a stored set of ranges. It is exposed as a
> bpf map type.
>    
> Internally, data is stored in an unbalanced tree of nodes that has a
> maximum height of n, where n is the prefixlen the trie was created
> with.
>  
> Not that this has nothing to do with fib or fib6 and is in no way meant
> to replace or share code with it. It's rather a much simpler
> implementation that is specifically written with bpf maps in mind.
>  
> Patch 1/2 adds the implementation, and 2/2 an extensive test suite.
>  
> Feedback is much appreciated.

I'll give Alexei and Daniel time to provide feedback on this series.

^ permalink raw reply

* Re: [PATCH net-next] net: Allow IP_MULTICAST_IF to set index to L3 slave
From: David Miller @ 2016-12-30 20:25 UTC (permalink / raw)
  To: dsa; +Cc: netdev, darwin.dingel, Mark.Tomlinson
In-Reply-To: <1483054777-11306-1-git-send-email-dsa@cumulusnetworks.com>

From: David Ahern <dsa@cumulusnetworks.com>
Date: Thu, 29 Dec 2016 15:39:37 -0800

> IP_MULTICAST_IF fails if sk_bound_dev_if is already set and the new index
> does not match it. e.g.,
> 
>     ntpd[15381]: setsockopt IP_MULTICAST_IF 192.168.1.23 fails: Invalid argument
> 
> Relax the check in setsockopt to allow setting mc_index to an L3 slave if
> sk_bound_dev_if points to an L3 master.
> 
> Make a similar change for IPv6. In this case change the device lookup to
> take the rcu_read_lock avoiding a refcnt. The rcu lock is also needed for
> the lookup of a potential L3 master device.
> 
> This really only silences a setsockopt failure since uses of mc_index are
> secondary to sk_bound_dev_if if it is set. In both cases, if either index
> is an L3 slave or master, lookups are directed to the same FIB table so
> relaxing the check at setsockopt time causes no harm.
> 
> Patch is based on a suggested change by Darwin for a problem noted in
> their code base.
> 
> Suggested-by: Darwin Dingel <darwin.dingel@alliedtelesis.co.nz>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>

This seems ok, applied, thanks.

^ permalink raw reply

* Re: [PATCH v4 1/2] net: 3com: typhoon: typhoon_init_one: fix incorrect return values
From: David Miller @ 2016-12-30 20:27 UTC (permalink / raw)
  To: thomas.preisner+linux
  Cc: dave, netdev, linux-kernel, linux-kernel, milan.stephan+linux,
	sergei.shtylyov
In-Reply-To: <1483065474-11512-2-git-send-email-thomas.preisner+linux@fau.de>

From: Thomas Preisner <thomas.preisner+linux@fau.de>
Date: Fri, 30 Dec 2016 03:37:53 +0100

> In a few cases the err-variable is not set to a negative error code if a
> function call in typhoon_init_one() fails and thus 0 is returned
> instead.
> It may be better to set err to the appropriate negative error
> code before returning.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188841
> 
> Reported-by: Pan Bian <bianpan2016@163.com>
> Signed-off-by: Thomas Preisner <thomas.preisner+linux@fau.de>
> Signed-off-by: Milan Stephan <milan.stephan+linux@fau.de>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH v4 2/2] net: 3com: typhoon: typhoon_init_one: make return values more specific
From: David Miller @ 2016-12-30 20:27 UTC (permalink / raw)
  To: thomas.preisner+linux
  Cc: dave, netdev, linux-kernel, linux-kernel, milan.stephan+linux,
	sergei.shtylyov
In-Reply-To: <1483065474-11512-3-git-send-email-thomas.preisner+linux@fau.de>

From: Thomas Preisner <thomas.preisner+linux@fau.de>
Date: Fri, 30 Dec 2016 03:37:54 +0100

> In some cases the return value of a failing function is not being used
> and the function typhoon_init_one() returns another negative error code
> instead.
> 
> Signed-off-by: Thomas Preisner <thomas.preisner+linux@fau.de>
> Signed-off-by: Milan Stephan <milan.stephan+linux@fau.de>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH net-next] af_packet: Provide a TPACKET_V2 compatible Tx path for TPACKET_V3
From: Willem de Bruijn @ 2016-12-30 21:33 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, Daniel Borkmann, Willem de Bruijn,
	David Miller
In-Reply-To: <1483116853-45769-1-git-send-email-sowmini.varadhan@oracle.com>

On Fri, Dec 30, 2016 at 11:54 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> Although TPACKET_V3 Rx has some benefits over TPACKET_V2 Rx,
> *_v3 does not currently have Tx support. As a result an application
> that wants the best perf for Tx and Rx (e.g. to handle
> request/response transacations) ends up needing 2 sockets,
> one with *_v2 for Tx and another with *_v3 for Rx.
>
> This patch enables TPACKET_V2 compatible Tx features in TPACKET_V3
> so that _v3 supports at least as many features as _v2.

Once we define the interface as equivalent to v2, we cannot redefine it to
support v3-only features later.

> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  Documentation/networking/packet_mmap.txt |    6 ++++--
>  net/packet/af_packet.c                   |   23 +++++++++++++++--------
>  2 files changed, 19 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
> index daa015a..6425062 100644
> --- a/Documentation/networking/packet_mmap.txt
> +++ b/Documentation/networking/packet_mmap.txt
> @@ -565,7 +565,7 @@ where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
>                    (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
>
>  TPACKET_V2 --> TPACKET_V3:
> -       - Flexible buffer implementation:
> +       - Flexible buffer implementation for RX_RING:
>                 1. Blocks can be configured with non-static frame-size

This is one of the main advantages of the v3 interface, and also
relevant to Tx. The current implementation does not consult
tpacket3_hdr->tp_next_offset and would preclude adding that
later.

^ permalink raw reply

* Re: [PATCH net-next] af_packet: Provide a TPACKET_V2 compatible Tx path for TPACKET_V3
From: Sowmini Varadhan @ 2016-12-30 23:03 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, Daniel Borkmann, Willem de Bruijn,
	David Miller
In-Reply-To: <CAF=yD-+CMjr_9s2wVZOQEPAP5h9jpnMe3vckYGbQ9hnQoGWFEw@mail.gmail.com>

On (12/30/16 16:33), Willem de Bruijn wrote:
> 
> Once we define the interface as equivalent to v2, we cannot redefine it to
> support v3-only features later.

What v3 only features do we think we want to support?

Tpacket_v3 went in 

  commit f6fb8f100b807378fda19e83e5ac6828b638603a
  :
  Date:   Fri Aug 19 10:18:16 2011 +0000

since then apps that want to use the Rx benefits
have to deal with this dual socket feature, where
with "one socket for super-fast rx, zero Tx".
The zero-tx part sounds like a regression to me.

If we want to have something that does block Tx,
and we cannot figure out how to retro-fit it to 
the exisiting APIs, we can always go for TPACKET_V4. 


> >  TPACKET_V2 --> TPACKET_V3:
> > -       - Flexible buffer implementation:
> > +       - Flexible buffer implementation for RX_RING:
> >                 1. Blocks can be configured with non-static frame-size
> 
> This is one of the main advantages of the v3 interface, and also

sure, and we see some marginal benefits for this, when
we try to use it for our apps. But the "marginal" part
is not worth it, if I have to use separate sockets for tx and rx.

> relevant to Tx. The current implementation does not consult
> tpacket3_hdr->tp_next_offset and would preclude adding that
> later.

When is "later"? its been 6+ years.

--Sowmini

^ permalink raw reply

* Bug w/ (policy) routing
From: Olivier Brunel @ 2016-12-30 23:00 UTC (permalink / raw)
  To: netdev

Hi,

(Please cc me as I'm not subscribed to the list, thanks.)

I'm trying to set things up using some policy routing, and having some
weird issues I can't really explain. It looks to me like there might be
a bug somewhere...

This is done under Arch Linux 64bits, iproute2 4.9.0 (`ip -V` says ip
utility, iproute2-ss161212), kernel 4.8.13

Basically here's what I could reduce it to:
- create a new network namespace, create a pair of veth devices: one in
there, one sent back to the original namespace
- I'm giving them IPs 10.4.0.1 (original namespace) & 10.4.0.2 (new
namespace)
- in that new namespace, I'm trying to add a route to 10.4.0.1, but
  inside a new table. I also want a default route via 10.4.0.1 on the
  table main. It seems to work, only not really...

It's not very easy to describe so hopefully this will make things
clearer:

$ sudo unshare -n sh
sh-4.4# ip rule add table 50 prio 50
sh-4.4# ip link add test type veth peer name test2
sh-4.4# ip addr add 10.4.0.2 dev test
sh-4.4# ip link set dev test up
sh-4.4# ip link set netns 1 dev test2
# back in original namespace, we add 10.4.0.1 to test2 and bring it up

sh-4.4# ip route add 10.4.0.1 dev test table 50
sh-4.4# ip route add default via 10.4.0.1 dev test
sh-4.4# ip route flush cache
sh-4.4# ip rule
0:	from all lookup local 
50:	from all lookup 50 
32766:	from all lookup main 
32767:	from all lookup default 
sh-4.4# ip route show table 50
10.4.0.1 dev test scope link 
sh-4.4# ip route get 10.4.0.1
10.4.0.1 via 10.4.0.1 dev test table local src 10.4.0.2 
    cache 
# !?? why isn't table 50 used as, I believe, it should. And why
does adding a rule "fixes" it :

sh-4.4# ip rule add prio 55555
sh-4.4# ip route get 10.4.0.1
10.4.0.1 dev test table 50 src 10.4.0.2 
    cache 
# deleting the new rule makes no difference. It could even have been
done right after adding it. It's like it just triggered something
(reload, cache flushed, ...)
As seen I did flush cached routes as mentionned in the man page, I don't
know of anything else that would need done to "refresh" things?

Any ideas as to why this is happening? Should this work as I expect it,
or is there anything I'm doing wrong?

Thanks,

^ permalink raw reply

* Re: [PATCH net-next] af_packet: Provide a TPACKET_V2 compatible Tx path for TPACKET_V3
From: Willem de Bruijn @ 2016-12-30 23:39 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, Daniel Borkmann, Willem de Bruijn,
	David Miller
In-Reply-To: <20161230230316.GB31800@oracle.com>

>> Once we define the interface as equivalent to v2, we cannot redefine it to
>> support v3-only features later.
>
> What v3 only features do we think we want to support?

Variable length slots seems like the only one from that list that
makes sense on Tx.

It is already possible to prepare multiple buffers before triggering
transmit, so the block-based signal moderation is not very relevant.

> since then apps that want to use the Rx benefits
> have to deal with this dual socket feature, where
> with "one socket for super-fast rx, zero Tx".
> The zero-tx part sounds like a regression to me.

What is the issue with using separate sockets that you are
having? I generally end up using that even with V2.

> If we want to have something that does block Tx,
> and we cannot figure out how to retro-fit it to
> the exisiting APIs, we can always go for TPACKET_V4.

But the semantics for V3 are currenty well defined. Calling something
V3, but using V2 semantics is a somewhat unintuitive interface to me.

I don't see a benefit in defining something that does not implement
any new features. Especially if it blocks adding the expected
semantics later.

That said, if there is a workload that benefits from using a
single socket, and especially if it can be forward compatible with
support for variable sized slots, then I do not object. I was just
having a look at that second point, actually.

Could you also extend the TX_RING test in
tools/testing/selftests/net/psock_tpacket.c if there are no other
blocking issues?

^ permalink raw reply

* [PATCH] net: socket: don't set sk_uid to garbage value in ->setattr()
From: Eric Biggers @ 2016-12-30 23:42 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Lorenzo Colitti, linux-fsdevel, linux-kernel,
	Eric Biggers

From: Eric Biggers <ebiggers@google.com>

->setattr() was recently implemented for socket files to sync the socket
inode's uid to the new 'sk_uid' member of struct sock.  It does this by
copying over the ia_uid member of struct iattr.  However, ia_uid is
actually only valid when ATTR_UID is set in ia_valid, indicating that
the uid is being changed, e.g. by chown.  Other metadata operations such
as chmod or utimes leave ia_uid uninitialized.  Therefore, sk_uid could
be set to a "garbage" value from the stack.

Fix this by only copying the uid over when ATTR_UID is set.

Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.")
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 net/socket.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/socket.c b/net/socket.c
index 8487bf136e5c..a8c2307590b8 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -537,7 +537,7 @@ int sockfs_setattr(struct dentry *dentry, struct iattr *iattr)
 {
 	int err = simple_setattr(dentry, iattr);
 
-	if (!err) {
+	if (!err && (iattr->ia_valid & ATTR_UID)) {
 		struct socket *sock = SOCKET_I(d_inode(dentry));
 
 		sock->sk->sk_uid = iattr->ia_uid;
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH net-next rfc 0/6] convert tc_verd to integer bitfields
From: Willem de Bruijn @ 2016-12-31  0:02 UTC (permalink / raw)
  To: David Miller
  Cc: Network Development, Florian Westphal, dborkman, Jamal Hadi Salim,
	Alexei Starovoitov, Willem de Bruijn
In-Reply-To: <20161229.222113.146921294105321953.davem@davemloft.net>

On Thu, Dec 29, 2016 at 10:21 PM, David Miller <davem@davemloft.net> wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Wed, 28 Dec 2016 14:13:24 -0500
>
>> The skb tc_verd field takes up two bytes but uses far fewer bits.
>> Convert the remaining use cases to bitfields that fit in existing
>> holes (depending on config options) and potentially save the two
>> bytes in struct sk_buff.
>
> I like the looks of this, for sure :-)

Great, thanks. I sent it as RFC initially, because there were some unresolved
issues with TC_FROM semantics last time around. bpf_redirect has been
merged in the meantime, so I suspect that that has been resolved.

But there is no rush. I will run a few more tests, such as the new mirred
ingress path, and will hold off resubmitting until people now on holiday
have a chance to chime in next week.

^ permalink raw reply

* Re: [PATCH net-next] af_packet: Provide a TPACKET_V2 compatible Tx path for TPACKET_V3
From: Sowmini Varadhan @ 2016-12-31  0:48 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, Daniel Borkmann, Willem de Bruijn,
	David Miller
In-Reply-To: <CAF=yD-JiRDnYPB=W37T5tF7rVVGD5bACK6YkSWwuGdF2PdOdBQ@mail.gmail.com>

On (12/30/16 18:39), Willem de Bruijn wrote:
> 
> Variable length slots seems like the only one from that list that
> makes sense on Tx.
> 
> It is already possible to prepare multiple buffers before triggering
> transmit, so the block-based signal moderation is not very relevant.

FWIW, here is our experience

In our use cases, the blocking on the RX side comes quite naturally
to the application (since, upon waking from select(), we try 
to read as many requests as possible, until we run out of buffers
and/or input), but the response side is not batched today: the server
application sends out one response at a time, and trying to change
this would need additional batching-intelligence in the server. We
are working on the latter part, but as you point out, we can  prepare
multiple buffers before triggering transmit, so some variant of block
TX seems achievable.

Our response messages are usually well-defined multiples of PAGE_SIZE,
(and we are able to set Jumbo MTU) so the variable length slots is not
an issue we foresee (see additional comment on this below).

The block RX is interesting because it allows the server better control
over context-switches and system-calls. This is important because our
input request stream tends to be bursty - the senders (clients) of the
request have to do some computationally intense work before sending the
next request, so being able to adjust the timeout for poll wakeup at
the server is a useful knob.

Having 2 sockets instead of one is unattractive because it just makes
the existing API more clumsy - today  we are using UDP, RDS-TCP and
RDS-IB sockets, and all of this is built around a POSIX-like paradigm
of having some type of select(), sendmsg(), recvmsg() API with a single
socket. Even just extending this to also handle TPACKET_V2 (and tracking
needed context) is messy. Having to convert all this to a 2-socket model
would need significant perf justification, and we havent seen that
justification in our micro-benchmarks yet.

(and fwiw, the POSIX-like API with a single file desc for all I/O
is a major consideration, since the I/O can come from other sources 
like disk, fs etc, and it's cleanest if we follow the same paradigm
for networking as well)

> > since then apps that want to use the Rx benefits
> > have to deal with this dual socket feature, where
> > with "one socket for super-fast rx, zero Tx".
> > The zero-tx part sounds like a regression to me.
> 
> What is the issue with using separate sockets that you are
> having? I generally end up using that even with V2.

Why do you end up having to use 2 sockets with V2? That part
worked out quite nicely for my case (for a simple netserver like 
req/resp handler).

> But the semantics for V3 are currenty well defined. Calling something
> V3, but using V2 semantics is a somewhat unintuitive interface to me.

One fundamental part of tpacket that makes it attractive to 
alternatives like netmap, dpdk etc is that the API follows the
semantics of the classic  unix socket and fd APIs: support for basic
select/sendmsg/recvmsg that work for everything until _V3. 

> I don't see a benefit in defining something that does not implement
> any new features. Especially if it blocks adding the expected
> semantics later.

V3 removed the sendmsg feature.  This patch puts back that feature.

> That said, if there is a workload that benefits from using a
> single socket, and especially if it can be forward compatible with
> support for variable sized slots, then I do not object. I was just
> having a look at that second point, actually.

Actually I'm not averse to looking at extensions (or at least,
place-holders) to allow variable sized slots- do you have any
suggestions? As I mentioned before, the use-cases that I see
do not need variable length slots, thus I have not thought
too deeply about it. But if we think this may be needed in the 
future can't it be accomodated by additional sockopts (or even 
per-packet cmsghdr?) on top of V3? 

> Could you also extend the TX_RING test in
> tools/testing/selftests/net/psock_tpacket.c if there are no other
> blocking issues?

sure, I can do that. Let me do this for patchv2. 

--Sowmini

^ permalink raw reply

* [PATCH] Ipvlan should return an error when an address is already in use.
From: Krister Johansen @ 2016-12-31  4:10 UTC (permalink / raw)
  To: David S. Miller; +Cc: Mahesh Bandewar, netdev

The ipvlan code already knows how to detect when a duplicate address is
about to be assigned to an ipvlan device.  However, that failure is not
propogated outward and leads to a silent failure.  This teaches the ip
address addition functions how to report this error to the user
applications so that a notifier chain failure during ip address addition
will not appear to succeed when it actually has not.

This can be especially useful if it is necessary to provision many
ipvlans in containers.  The provisioning software (or operator) can use
this to detect situations where an ip address is unexpectedly in use.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
 drivers/net/ipvlan/ipvlan_main.c | 12 ++++++++----
 net/ipv4/devinet.c               |  8 +++++++-
 net/ipv6/addrconf.c              | 23 ++++++++++++++++++-----
 3 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 975f9dd..3bd2506 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -739,6 +739,7 @@ static int ipvlan_addr6_event(struct notifier_block *unused,
 	struct inet6_ifaddr *if6 = (struct inet6_ifaddr *)ptr;
 	struct net_device *dev = (struct net_device *)if6->idev->dev;
 	struct ipvl_dev *ipvlan = netdev_priv(dev);
+	int err;
 
 	/* FIXME IPv6 autoconf calls us from bh without RTNL */
 	if (in_softirq())
@@ -752,8 +753,9 @@ static int ipvlan_addr6_event(struct notifier_block *unused,
 
 	switch (event) {
 	case NETDEV_UP:
-		if (ipvlan_add_addr6(ipvlan, &if6->addr))
-			return NOTIFY_BAD;
+		err = ipvlan_add_addr6(ipvlan, &if6->addr);
+		if (err)
+			return notifier_from_errno(err);
 		break;
 
 	case NETDEV_DOWN:
@@ -788,6 +790,7 @@ static int ipvlan_addr4_event(struct notifier_block *unused,
 	struct net_device *dev = (struct net_device *)if4->ifa_dev->dev;
 	struct ipvl_dev *ipvlan = netdev_priv(dev);
 	struct in_addr ip4_addr;
+	int err;
 
 	if (!netif_is_ipvlan(dev))
 		return NOTIFY_DONE;
@@ -798,8 +801,9 @@ static int ipvlan_addr4_event(struct notifier_block *unused,
 	switch (event) {
 	case NETDEV_UP:
 		ip4_addr.s_addr = if4->ifa_address;
-		if (ipvlan_add_addr4(ipvlan, &ip4_addr))
-			return NOTIFY_BAD;
+		err = ipvlan_add_addr4(ipvlan, &ip4_addr);
+		if (err)
+			return notifier_from_errno(err);
 		break;
 
 	case NETDEV_DOWN:
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 4cd2ee8..5bc26f8e 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -442,6 +442,7 @@ static int __inet_insert_ifa(struct in_ifaddr *ifa, struct nlmsghdr *nlh,
 {
 	struct in_device *in_dev = ifa->ifa_dev;
 	struct in_ifaddr *ifa1, **ifap, **last_primary;
+	int ret;
 
 	ASSERT_RTNL();
 
@@ -489,7 +490,12 @@ static int __inet_insert_ifa(struct in_ifaddr *ifa, struct nlmsghdr *nlh,
 	   Notifier will trigger FIB update, so that
 	   listeners of netlink will know about new ifaddr */
 	rtmsg_ifa(RTM_NEWADDR, ifa, nlh, portid);
-	blocking_notifier_call_chain(&inetaddr_chain, NETDEV_UP, ifa);
+	ret = blocking_notifier_call_chain(&inetaddr_chain, NETDEV_UP, ifa);
+	ret = notifier_to_errno(ret);
+	if (ret) {
+		__inet_del_ifa(in_dev, ifap, 1, NULL, portid);
+		return ret;
+	}
 
 	return 0;
 }
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c1e124b..24d89f3 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -149,6 +149,7 @@ static inline void addrconf_sysctl_unregister(struct inet6_dev *idev)
 
 static void ipv6_regen_rndid(struct inet6_dev *idev);
 static void ipv6_try_regen_rndid(struct inet6_dev *idev, struct in6_addr *tmpaddr);
+static void __ipv6_del_addr(struct inet6_ifaddr *ifp, bool notify);
 
 static int ipv6_generate_eui64(u8 *eui, struct net_device *dev);
 static int ipv6_count_addresses(struct inet6_dev *idev);
@@ -1031,9 +1032,15 @@ ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr,
 out2:
 	rcu_read_unlock_bh();
 
-	if (likely(err == 0))
-		inet6addr_notifier_call_chain(NETDEV_UP, ifa);
-	else {
+	if (likely(err == 0)) {
+		err = inet6addr_notifier_call_chain(NETDEV_UP, ifa);
+		err = notifier_to_errno(err);
+		if (err) {
+			__ipv6_del_addr(ifa, false);
+			ifa = ERR_PTR(err);
+			return ifa;
+		}
+	} else {
 		kfree(ifa);
 		ifa = ERR_PTR(err);
 	}
@@ -1128,7 +1135,7 @@ cleanup_prefix_route(struct inet6_ifaddr *ifp, unsigned long expires, bool del_r
 
 /* This function wants to get referenced ifp and releases it before return */
 
-static void ipv6_del_addr(struct inet6_ifaddr *ifp)
+static void __ipv6_del_addr(struct inet6_ifaddr *ifp, bool notify)
 {
 	int state;
 	enum cleanup_prefix_rt_t action = CLEANUP_PREFIX_RT_NOP;
@@ -1169,7 +1176,8 @@ static void ipv6_del_addr(struct inet6_ifaddr *ifp)
 
 	addrconf_del_dad_work(ifp);
 
-	ipv6_ifa_notify(RTM_DELADDR, ifp);
+	if (notify)
+		ipv6_ifa_notify(RTM_DELADDR, ifp);
 
 	inet6addr_notifier_call_chain(NETDEV_DOWN, ifp);
 
@@ -1184,6 +1192,11 @@ static void ipv6_del_addr(struct inet6_ifaddr *ifp)
 	in6_ifa_put(ifp);
 }
 
+static void ipv6_del_addr(struct inet6_ifaddr *ifp)
+{
+	__ipv6_del_addr(ifp, true);
+}
+
 static int ipv6_create_tempaddr(struct inet6_ifaddr *ifp, struct inet6_ifaddr *ift)
 {
 	struct inet6_dev *idev = ifp->idev;
-- 
2.7.4

^ permalink raw reply related

* [PATCH] Introduce a sysctl that modifies the value of PROT_SOCK.
From: Krister Johansen @ 2016-12-31  4:11 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

Add net.ipv4.ip_unprotected_port_start, which is a per namespace sysctl
that denotes the first unprotected inet port in the namespace.  To
disable all protected ports set this to zero.  It also checks for
overlap with the local port range.  The protected and local range may
not overlap.

The use case for this change is to allow containerized processes to bind
to priviliged ports, but prevent them from ever being allowed to modify
their container's network configuration.  The latter is accomplished by
ensuring that the network namespace is not a child of the user
namespace.  This modification was needed to allow the container manager
to disable a namespace's priviliged port restrictions without exposing
control of the network namespace to processes in the user namespace.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
 include/net/ip.h               | 12 +++++++++
 include/net/netns/ipv4.h       |  3 +++
 net/Kconfig                    |  7 +++++
 net/ipv4/af_inet.c             |  5 +++-
 net/ipv4/sysctl_net_ipv4.c     | 59 ++++++++++++++++++++++++++++++++++++++++++
 net/ipv6/af_inet6.c            |  3 ++-
 net/netfilter/ipvs/ip_vs_ctl.c |  7 +++--
 net/sctp/socket.c              | 10 ++++---
 security/selinux/hooks.c       |  3 ++-
 9 files changed, 98 insertions(+), 11 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index ab6761a..65f1375 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -270,6 +270,18 @@ static inline int inet_is_local_reserved_port(struct net *net, int port)
 }
 #endif
 
+#ifdef CONFIG_LOWPORT_SYSCTL
+static inline int inet_prot_sock(struct net *net)
+{
+	return net->ipv4.sysctl_ip_prot_sock;
+}
+#else
+static inline int inet_prot_sock(struct net *net)
+{
+	return PROT_SOCK;
+}
+#endif
+
 __be32 inet_current_timestamp(void);
 
 /* From inetpeer.c */
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 8e3f5b6..f1126686 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -96,6 +96,9 @@ struct netns_ipv4 {
 	/* Shall we try to damage output packets if routing dev changes? */
 	int sysctl_ip_dynaddr;
 	int sysctl_ip_early_demux;
+#ifdef CONFIG_LOWPORT_SYSCTL
+	int sysctl_ip_prot_sock;
+#endif
 
 	int sysctl_fwmark_reflect;
 	int sysctl_tcp_fwmark_accept;
diff --git a/net/Kconfig b/net/Kconfig
index a100500..946999c 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -109,6 +109,13 @@ config NETWORK_PHY_TIMESTAMPING
 
 	  If you are unsure how to answer this question, answer N.
 
+config LOWPORT_SYSCTL
+	bool "Adjust reserved port range via sysctl"
+	depends on SYSCTL
+	help
+	  This allows the administrator to adjust the reserved port range
+	  using a sysctl.
+
 menuconfig NETFILTER
 	bool "Network packet filtering framework (Netfilter)"
 	---help---
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index aae410b..c44ea78 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -479,7 +479,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 
 	snum = ntohs(addr->sin_port);
 	err = -EACCES;
-	if (snum && snum < PROT_SOCK &&
+	if (snum && snum < inet_prot_sock(net) &&
 	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		goto out;
 
@@ -1700,6 +1700,9 @@ static __net_init int inet_init_net(struct net *net)
 	net->ipv4.sysctl_ip_default_ttl = IPDEFTTL;
 	net->ipv4.sysctl_ip_dynaddr = 0;
 	net->ipv4.sysctl_ip_early_demux = 1;
+#ifdef CONFIG_LOWPORT_SYSCTL
+	net->ipv4.sysctl_ip_prot_sock = PROT_SOCK;
+#endif
 
 	return 0;
 }
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 134d8e1..2167e3f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -35,6 +35,10 @@ static int ip_local_port_range_min[] = { 1, 1 };
 static int ip_local_port_range_max[] = { 65535, 65535 };
 static int tcp_adv_win_scale_min = -31;
 static int tcp_adv_win_scale_max = 31;
+#ifdef CONFIG_LOWPORT_SYSCTL
+static int ip_protected_port_min;
+static int ip_protected_port_max = 65535;
+#endif
 static int ip_ttl_min = 1;
 static int ip_ttl_max = 255;
 static int tcp_syn_retries_min = 1;
@@ -79,7 +83,17 @@ static int ipv4_local_port_range(struct ctl_table *table, int write,
 	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
 
 	if (write && ret == 0) {
+		/* Ensure that the upper limit is not smaller than the lower,
+		 * and that the lower does not encroach upon the protected
+		 * port limit.
+		 */
+#ifdef CONFIG_LOWPORT_SYSCTL
+		if ((range[1] < range[0]) ||
+		    (range[0] < net->ipv4.sysctl_ip_prot_sock))
+#else
+
 		if (range[1] < range[0])
+#endif
 			ret = -EINVAL;
 		else
 			set_local_port_range(net, range);
@@ -88,6 +102,42 @@ static int ipv4_local_port_range(struct ctl_table *table, int write,
 	return ret;
 }
 
+#ifdef CONFIG_LOWPORT_SYSCTL
+/* Validate changes from /proc interface. */
+static int ipv4_protected_ports(struct ctl_table *table, int write,
+				void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct net *net = container_of(table->data, struct net,
+	    ipv4.sysctl_ip_prot_sock);
+	int ret;
+	int pports;
+	int range[2];
+	struct ctl_table tmp = {
+		.data = &pports,
+		.maxlen = sizeof(pports),
+		.mode = table->mode,
+		.extra1 = &ip_protected_port_min,
+		.extra2 = &ip_protected_port_max,
+	};
+
+	pports = net->ipv4.sysctl_ip_prot_sock;
+
+	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
+
+	if (write && ret == 0) {
+		inet_get_local_port_range(net, &range[0], &range[1]);
+		/* Ensure that the local port range doesn't overlap with the
+		 * protected port range.
+		 */
+		if (range[0] < pports)
+			ret = -EINVAL;
+		else
+			net->ipv4.sysctl_ip_prot_sock = pports;
+	}
+
+	return ret;
+}
+#endif
 
 static void inet_get_ping_group_range_table(struct ctl_table *table, kgid_t *low, kgid_t *high)
 {
@@ -971,6 +1021,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_LOWPORT_SYSCTL
+	{
+		.procname	= "ip_unprotected_port_start",
+		.maxlen		= sizeof(int),
+		.data		= &init_net.ipv4.sysctl_ip_prot_sock,
+		.mode		= 0644,
+		.proc_handler	= ipv4_protected_ports,
+	},
+#endif
 	{ }
 };
 
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index aa42123..04db406 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -302,7 +302,8 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		return -EINVAL;
 
 	snum = ntohs(addr->sin6_port);
-	if (snum && snum < PROT_SOCK && !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
+	if (snum && snum < inet_prot_sock(net) &&
+	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
 
 	lock_sock(sk);
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 55e0169..8b7416f 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -426,10 +426,9 @@ ip_vs_service_find(struct netns_ipvs *ipvs, int af, __u32 fwmark, __u16 protocol
 	 */
 	svc = __ip_vs_service_find(ipvs, af, protocol, vaddr, vport);
 
-	if (svc == NULL
-	    && protocol == IPPROTO_TCP
-	    && atomic_read(&ipvs->ftpsvc_counter)
-	    && (vport == FTPDATA || ntohs(vport) >= PROT_SOCK)) {
+	if (!svc && protocol == IPPROTO_TCP &&
+	    atomic_read(&ipvs->ftpsvc_counter) &&
+	    (vport == FTPDATA || ntohs(vport) >= inet_prot_sock(ipvs->net))) {
 		/*
 		 * Check if ftp service entry exists, the packet
 		 * might belong to FTP data connections.
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 318c678..2723f4a 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -360,7 +360,7 @@ static int sctp_do_bind(struct sock *sk, union sctp_addr *addr, int len)
 		}
 	}
 
-	if (snum && snum < PROT_SOCK &&
+	if (snum && snum < inet_prot_sock(net) &&
 	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
 
@@ -1152,8 +1152,10 @@ static int __sctp_connect(struct sock *sk,
 				 * accept new associations, but it SHOULD NOT
 				 * be permitted to open new associations.
 				 */
-				if (ep->base.bind_addr.port < PROT_SOCK &&
-				    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE)) {
+				if (ep->base.bind_addr.port <
+				    inet_prot_sock(net) &&
+				    !ns_capable(net->user_ns,
+				    CAP_NET_BIND_SERVICE)) {
 					err = -EACCES;
 					goto out_free;
 				}
@@ -1818,7 +1820,7 @@ static int sctp_sendmsg(struct sock *sk, struct msghdr *msg, size_t msg_len)
 			 * but it SHOULD NOT be permitted to open new
 			 * associations.
 			 */
-			if (ep->base.bind_addr.port < PROT_SOCK &&
+			if (ep->base.bind_addr.port < inet_prot_sock(net) &&
 			    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE)) {
 				err = -EACCES;
 				goto out_unlock;
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index c7c6619..53cb6da 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4365,7 +4365,8 @@ static int selinux_socket_bind(struct socket *sock, struct sockaddr *address, in
 
 			inet_get_local_port_range(sock_net(sk), &low, &high);
 
-			if (snum < max(PROT_SOCK, low) || snum > high) {
+			if (snum < max(inet_prot_sock(sock_net(sk)), low) ||
+			    snum > high) {
 				err = sel_netport_sid(sk->sk_protocol,
 						      snum, &sid);
 				if (err)
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next] af_packet: Provide a TPACKET_V2 compatible Tx path for TPACKET_V3
From: Willem de Bruijn @ 2016-12-31  4:59 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, Daniel Borkmann, Willem de Bruijn,
	David Miller
In-Reply-To: <20161231004826.GC31800@oracle.com>

>> What is the issue with using separate sockets that you are
>> having? I generally end up using that even with V2.
>
> Why do you end up having to use 2 sockets with V2? That part
> worked out quite nicely for my case (for a simple netserver like
> req/resp handler).

Yes, that should work fine, actually. I was thinking of multi-threaded
setups where using per-cpu tx sockets avoids cacheline bouncing, but
there is no reason to have a receive socket on each cpu. But that's
not a standard setup.

>> But the semantics for V3 are currenty well defined. Calling something
>> V3, but using V2 semantics is a somewhat unintuitive interface to me.
>
> One fundamental part of tpacket that makes it attractive to
> alternatives like netmap, dpdk etc is that the API follows the
> semantics of the classic  unix socket and fd APIs: support for basic
> select/sendmsg/recvmsg that work for everything until _V3.
>
>> I don't see a benefit in defining something that does not implement
>> any new features. Especially if it blocks adding the expected
>> semantics later.
>
> V3 removed the sendmsg feature.  This patch puts back that feature.

You mean send ring, right? The sendmsg call works fine on a single socket
alongside an RX_RING until a TX_RING is installed.

>> That said, if there is a workload that benefits from using a
>> single socket, and especially if it can be forward compatible with
>> support for variable sized slots, then I do not object. I was just
>> having a look at that second point, actually.
>
> Actually I'm not averse to looking at extensions (or at least,
> place-holders) to allow variable sized slots- do you have any
> suggestions?

It is probably enough to just enforce that tp_next_offset is always
sane. Either that it points to the start of the next packet, even
though that currently can also be inferred from frame_size. Or, that
it is always zero unless the ring is to be interpreted as holding
variable sized frames. If the field is non-zero, drop the packet.

> As I mentioned before, the use-cases that I see
> do not need variable length slots, thus I have not thought
> too deeply about it. But if we think this may be needed in the
> future can't it be accomodated by additional sockopts (or even
> per-packet cmsghdr?) on top of V3?
>
>> Could you also extend the TX_RING test in
>> tools/testing/selftests/net/psock_tpacket.c if there are no other
>> blocking issues?
>
> sure, I can do that. Let me do this for patchv2.

Great. Thanks!

^ permalink raw reply

* ping: What's the purpose of rate limit?
From: Guan Xin @ 2016-12-31 12:04 UTC (permalink / raw)
  To: netdev

Hello,

Excuse me, I searched but didn't find an answer --

What's the purpose of setting a limit to the "-i" and "-l" parameters
of ping for non-root users?

It seems that this is only intended to prevent accidental misuse
because these restrictions can be easily worked around by starting
multiple instances or wrapping the program in a loop, e.g.,
while true; do ping -l3 -c3 192.168.0.1; done

(Please Cc to me because I'm not subscribed to this list. Thanks!)

Regards,
Guan

^ permalink raw reply

* Re: [PATCH net-next] af_packet: Provide a TPACKET_V2 compatible Tx path for TPACKET_V3
From: Sowmini Varadhan @ 2016-12-31 12:21 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, Daniel Borkmann, Willem de Bruijn,
	David Miller
In-Reply-To: <CAF=yD-Jgui0gBeZuzfJRN1uozV+LqVm76tnKUnChSg7_z9-1ng@mail.gmail.com>

On (12/30/16 23:59), Willem de Bruijn wrote:
> > Actually I'm not averse to looking at extensions (or at least,
> > place-holders) to allow variable sized slots- do you have any
> > suggestions?
> 
> It is probably enough to just enforce that tp_next_offset is always
> sane. Either that it points to the start of the next packet, even
> though that currently can also be inferred from frame_size. Or, that
> it is always zero unless the ring is to be interpreted as holding
> variable sized frames. If the field is non-zero, drop the packet.

Ok, let me add this to patchv2, along with extra test cases.

--Sowmini

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox