Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v4 1/2] net: 3com: typhoon: typhoon_init_one: fix incorrect return values
From: David Miller @ 2016-12-30 20:27 UTC (permalink / raw)
  To: thomas.preisner+linux
  Cc: dave, netdev, linux-kernel, linux-kernel, milan.stephan+linux,
	sergei.shtylyov
In-Reply-To: <1483065474-11512-2-git-send-email-thomas.preisner+linux@fau.de>

From: Thomas Preisner <thomas.preisner+linux@fau.de>
Date: Fri, 30 Dec 2016 03:37:53 +0100

> In a few cases the err-variable is not set to a negative error code if a
> function call in typhoon_init_one() fails and thus 0 is returned
> instead.
> It may be better to set err to the appropriate negative error
> code before returning.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188841
> 
> Reported-by: Pan Bian <bianpan2016@163.com>
> Signed-off-by: Thomas Preisner <thomas.preisner+linux@fau.de>
> Signed-off-by: Milan Stephan <milan.stephan+linux@fau.de>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH v4 2/2] net: 3com: typhoon: typhoon_init_one: make return values more specific
From: David Miller @ 2016-12-30 20:27 UTC (permalink / raw)
  To: thomas.preisner+linux
  Cc: dave, netdev, linux-kernel, linux-kernel, milan.stephan+linux,
	sergei.shtylyov
In-Reply-To: <1483065474-11512-3-git-send-email-thomas.preisner+linux@fau.de>

From: Thomas Preisner <thomas.preisner+linux@fau.de>
Date: Fri, 30 Dec 2016 03:37:54 +0100

> In some cases the return value of a failing function is not being used
> and the function typhoon_init_one() returns another negative error code
> instead.
> 
> Signed-off-by: Thomas Preisner <thomas.preisner+linux@fau.de>
> Signed-off-by: Milan Stephan <milan.stephan+linux@fau.de>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH net-next] af_packet: Provide a TPACKET_V2 compatible Tx path for TPACKET_V3
From: Willem de Bruijn @ 2016-12-30 21:33 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, Daniel Borkmann, Willem de Bruijn,
	David Miller
In-Reply-To: <1483116853-45769-1-git-send-email-sowmini.varadhan@oracle.com>

On Fri, Dec 30, 2016 at 11:54 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> Although TPACKET_V3 Rx has some benefits over TPACKET_V2 Rx,
> *_v3 does not currently have Tx support. As a result an application
> that wants the best perf for Tx and Rx (e.g. to handle
> request/response transacations) ends up needing 2 sockets,
> one with *_v2 for Tx and another with *_v3 for Rx.
>
> This patch enables TPACKET_V2 compatible Tx features in TPACKET_V3
> so that _v3 supports at least as many features as _v2.

Once we define the interface as equivalent to v2, we cannot redefine it to
support v3-only features later.

> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  Documentation/networking/packet_mmap.txt |    6 ++++--
>  net/packet/af_packet.c                   |   23 +++++++++++++++--------
>  2 files changed, 19 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
> index daa015a..6425062 100644
> --- a/Documentation/networking/packet_mmap.txt
> +++ b/Documentation/networking/packet_mmap.txt
> @@ -565,7 +565,7 @@ where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
>                    (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
>
>  TPACKET_V2 --> TPACKET_V3:
> -       - Flexible buffer implementation:
> +       - Flexible buffer implementation for RX_RING:
>                 1. Blocks can be configured with non-static frame-size

This is one of the main advantages of the v3 interface, and also
relevant to Tx. The current implementation does not consult
tpacket3_hdr->tp_next_offset and would preclude adding that
later.

^ permalink raw reply

* Re: [PATCH net-next] af_packet: Provide a TPACKET_V2 compatible Tx path for TPACKET_V3
From: Sowmini Varadhan @ 2016-12-30 23:03 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, Daniel Borkmann, Willem de Bruijn,
	David Miller
In-Reply-To: <CAF=yD-+CMjr_9s2wVZOQEPAP5h9jpnMe3vckYGbQ9hnQoGWFEw@mail.gmail.com>

On (12/30/16 16:33), Willem de Bruijn wrote:
> 
> Once we define the interface as equivalent to v2, we cannot redefine it to
> support v3-only features later.

What v3 only features do we think we want to support?

Tpacket_v3 went in 

  commit f6fb8f100b807378fda19e83e5ac6828b638603a
  :
  Date:   Fri Aug 19 10:18:16 2011 +0000

since then apps that want to use the Rx benefits
have to deal with this dual socket feature, where
with "one socket for super-fast rx, zero Tx".
The zero-tx part sounds like a regression to me.

If we want to have something that does block Tx,
and we cannot figure out how to retro-fit it to 
the exisiting APIs, we can always go for TPACKET_V4. 


> >  TPACKET_V2 --> TPACKET_V3:
> > -       - Flexible buffer implementation:
> > +       - Flexible buffer implementation for RX_RING:
> >                 1. Blocks can be configured with non-static frame-size
> 
> This is one of the main advantages of the v3 interface, and also

sure, and we see some marginal benefits for this, when
we try to use it for our apps. But the "marginal" part
is not worth it, if I have to use separate sockets for tx and rx.

> relevant to Tx. The current implementation does not consult
> tpacket3_hdr->tp_next_offset and would preclude adding that
> later.

When is "later"? its been 6+ years.

--Sowmini

^ permalink raw reply

* Bug w/ (policy) routing
From: Olivier Brunel @ 2016-12-30 23:00 UTC (permalink / raw)
  To: netdev

Hi,

(Please cc me as I'm not subscribed to the list, thanks.)

I'm trying to set things up using some policy routing, and having some
weird issues I can't really explain. It looks to me like there might be
a bug somewhere...

This is done under Arch Linux 64bits, iproute2 4.9.0 (`ip -V` says ip
utility, iproute2-ss161212), kernel 4.8.13

Basically here's what I could reduce it to:
- create a new network namespace, create a pair of veth devices: one in
there, one sent back to the original namespace
- I'm giving them IPs 10.4.0.1 (original namespace) & 10.4.0.2 (new
namespace)
- in that new namespace, I'm trying to add a route to 10.4.0.1, but
  inside a new table. I also want a default route via 10.4.0.1 on the
  table main. It seems to work, only not really...

It's not very easy to describe so hopefully this will make things
clearer:

$ sudo unshare -n sh
sh-4.4# ip rule add table 50 prio 50
sh-4.4# ip link add test type veth peer name test2
sh-4.4# ip addr add 10.4.0.2 dev test
sh-4.4# ip link set dev test up
sh-4.4# ip link set netns 1 dev test2
# back in original namespace, we add 10.4.0.1 to test2 and bring it up

sh-4.4# ip route add 10.4.0.1 dev test table 50
sh-4.4# ip route add default via 10.4.0.1 dev test
sh-4.4# ip route flush cache
sh-4.4# ip rule
0:	from all lookup local 
50:	from all lookup 50 
32766:	from all lookup main 
32767:	from all lookup default 
sh-4.4# ip route show table 50
10.4.0.1 dev test scope link 
sh-4.4# ip route get 10.4.0.1
10.4.0.1 via 10.4.0.1 dev test table local src 10.4.0.2 
    cache 
# !?? why isn't table 50 used as, I believe, it should. And why
does adding a rule "fixes" it :

sh-4.4# ip rule add prio 55555
sh-4.4# ip route get 10.4.0.1
10.4.0.1 dev test table 50 src 10.4.0.2 
    cache 
# deleting the new rule makes no difference. It could even have been
done right after adding it. It's like it just triggered something
(reload, cache flushed, ...)
As seen I did flush cached routes as mentionned in the man page, I don't
know of anything else that would need done to "refresh" things?

Any ideas as to why this is happening? Should this work as I expect it,
or is there anything I'm doing wrong?

Thanks,

^ permalink raw reply

* Re: [PATCH net-next] af_packet: Provide a TPACKET_V2 compatible Tx path for TPACKET_V3
From: Willem de Bruijn @ 2016-12-30 23:39 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, Daniel Borkmann, Willem de Bruijn,
	David Miller
In-Reply-To: <20161230230316.GB31800@oracle.com>

>> Once we define the interface as equivalent to v2, we cannot redefine it to
>> support v3-only features later.
>
> What v3 only features do we think we want to support?

Variable length slots seems like the only one from that list that
makes sense on Tx.

It is already possible to prepare multiple buffers before triggering
transmit, so the block-based signal moderation is not very relevant.

> since then apps that want to use the Rx benefits
> have to deal with this dual socket feature, where
> with "one socket for super-fast rx, zero Tx".
> The zero-tx part sounds like a regression to me.

What is the issue with using separate sockets that you are
having? I generally end up using that even with V2.

> If we want to have something that does block Tx,
> and we cannot figure out how to retro-fit it to
> the exisiting APIs, we can always go for TPACKET_V4.

But the semantics for V3 are currenty well defined. Calling something
V3, but using V2 semantics is a somewhat unintuitive interface to me.

I don't see a benefit in defining something that does not implement
any new features. Especially if it blocks adding the expected
semantics later.

That said, if there is a workload that benefits from using a
single socket, and especially if it can be forward compatible with
support for variable sized slots, then I do not object. I was just
having a look at that second point, actually.

Could you also extend the TX_RING test in
tools/testing/selftests/net/psock_tpacket.c if there are no other
blocking issues?

^ permalink raw reply

* [PATCH] net: socket: don't set sk_uid to garbage value in ->setattr()
From: Eric Biggers @ 2016-12-30 23:42 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Lorenzo Colitti, linux-fsdevel, linux-kernel,
	Eric Biggers

From: Eric Biggers <ebiggers@google.com>

->setattr() was recently implemented for socket files to sync the socket
inode's uid to the new 'sk_uid' member of struct sock.  It does this by
copying over the ia_uid member of struct iattr.  However, ia_uid is
actually only valid when ATTR_UID is set in ia_valid, indicating that
the uid is being changed, e.g. by chown.  Other metadata operations such
as chmod or utimes leave ia_uid uninitialized.  Therefore, sk_uid could
be set to a "garbage" value from the stack.

Fix this by only copying the uid over when ATTR_UID is set.

Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.")
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 net/socket.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/socket.c b/net/socket.c
index 8487bf136e5c..a8c2307590b8 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -537,7 +537,7 @@ int sockfs_setattr(struct dentry *dentry, struct iattr *iattr)
 {
 	int err = simple_setattr(dentry, iattr);
 
-	if (!err) {
+	if (!err && (iattr->ia_valid & ATTR_UID)) {
 		struct socket *sock = SOCKET_I(d_inode(dentry));
 
 		sock->sk->sk_uid = iattr->ia_uid;
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH net-next rfc 0/6] convert tc_verd to integer bitfields
From: Willem de Bruijn @ 2016-12-31  0:02 UTC (permalink / raw)
  To: David Miller
  Cc: Network Development, Florian Westphal, dborkman, Jamal Hadi Salim,
	Alexei Starovoitov, Willem de Bruijn
In-Reply-To: <20161229.222113.146921294105321953.davem@davemloft.net>

On Thu, Dec 29, 2016 at 10:21 PM, David Miller <davem@davemloft.net> wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Wed, 28 Dec 2016 14:13:24 -0500
>
>> The skb tc_verd field takes up two bytes but uses far fewer bits.
>> Convert the remaining use cases to bitfields that fit in existing
>> holes (depending on config options) and potentially save the two
>> bytes in struct sk_buff.
>
> I like the looks of this, for sure :-)

Great, thanks. I sent it as RFC initially, because there were some unresolved
issues with TC_FROM semantics last time around. bpf_redirect has been
merged in the meantime, so I suspect that that has been resolved.

But there is no rush. I will run a few more tests, such as the new mirred
ingress path, and will hold off resubmitting until people now on holiday
have a chance to chime in next week.

^ permalink raw reply

* Re: [PATCH net-next] af_packet: Provide a TPACKET_V2 compatible Tx path for TPACKET_V3
From: Sowmini Varadhan @ 2016-12-31  0:48 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, Daniel Borkmann, Willem de Bruijn,
	David Miller
In-Reply-To: <CAF=yD-JiRDnYPB=W37T5tF7rVVGD5bACK6YkSWwuGdF2PdOdBQ@mail.gmail.com>

On (12/30/16 18:39), Willem de Bruijn wrote:
> 
> Variable length slots seems like the only one from that list that
> makes sense on Tx.
> 
> It is already possible to prepare multiple buffers before triggering
> transmit, so the block-based signal moderation is not very relevant.

FWIW, here is our experience

In our use cases, the blocking on the RX side comes quite naturally
to the application (since, upon waking from select(), we try 
to read as many requests as possible, until we run out of buffers
and/or input), but the response side is not batched today: the server
application sends out one response at a time, and trying to change
this would need additional batching-intelligence in the server. We
are working on the latter part, but as you point out, we can  prepare
multiple buffers before triggering transmit, so some variant of block
TX seems achievable.

Our response messages are usually well-defined multiples of PAGE_SIZE,
(and we are able to set Jumbo MTU) so the variable length slots is not
an issue we foresee (see additional comment on this below).

The block RX is interesting because it allows the server better control
over context-switches and system-calls. This is important because our
input request stream tends to be bursty - the senders (clients) of the
request have to do some computationally intense work before sending the
next request, so being able to adjust the timeout for poll wakeup at
the server is a useful knob.

Having 2 sockets instead of one is unattractive because it just makes
the existing API more clumsy - today  we are using UDP, RDS-TCP and
RDS-IB sockets, and all of this is built around a POSIX-like paradigm
of having some type of select(), sendmsg(), recvmsg() API with a single
socket. Even just extending this to also handle TPACKET_V2 (and tracking
needed context) is messy. Having to convert all this to a 2-socket model
would need significant perf justification, and we havent seen that
justification in our micro-benchmarks yet.

(and fwiw, the POSIX-like API with a single file desc for all I/O
is a major consideration, since the I/O can come from other sources 
like disk, fs etc, and it's cleanest if we follow the same paradigm
for networking as well)

> > since then apps that want to use the Rx benefits
> > have to deal with this dual socket feature, where
> > with "one socket for super-fast rx, zero Tx".
> > The zero-tx part sounds like a regression to me.
> 
> What is the issue with using separate sockets that you are
> having? I generally end up using that even with V2.

Why do you end up having to use 2 sockets with V2? That part
worked out quite nicely for my case (for a simple netserver like 
req/resp handler).

> But the semantics for V3 are currenty well defined. Calling something
> V3, but using V2 semantics is a somewhat unintuitive interface to me.

One fundamental part of tpacket that makes it attractive to 
alternatives like netmap, dpdk etc is that the API follows the
semantics of the classic  unix socket and fd APIs: support for basic
select/sendmsg/recvmsg that work for everything until _V3. 

> I don't see a benefit in defining something that does not implement
> any new features. Especially if it blocks adding the expected
> semantics later.

V3 removed the sendmsg feature.  This patch puts back that feature.

> That said, if there is a workload that benefits from using a
> single socket, and especially if it can be forward compatible with
> support for variable sized slots, then I do not object. I was just
> having a look at that second point, actually.

Actually I'm not averse to looking at extensions (or at least,
place-holders) to allow variable sized slots- do you have any
suggestions? As I mentioned before, the use-cases that I see
do not need variable length slots, thus I have not thought
too deeply about it. But if we think this may be needed in the 
future can't it be accomodated by additional sockopts (or even 
per-packet cmsghdr?) on top of V3? 

> Could you also extend the TX_RING test in
> tools/testing/selftests/net/psock_tpacket.c if there are no other
> blocking issues?

sure, I can do that. Let me do this for patchv2. 

--Sowmini

^ permalink raw reply

* [PATCH] Ipvlan should return an error when an address is already in use.
From: Krister Johansen @ 2016-12-31  4:10 UTC (permalink / raw)
  To: David S. Miller; +Cc: Mahesh Bandewar, netdev

The ipvlan code already knows how to detect when a duplicate address is
about to be assigned to an ipvlan device.  However, that failure is not
propogated outward and leads to a silent failure.  This teaches the ip
address addition functions how to report this error to the user
applications so that a notifier chain failure during ip address addition
will not appear to succeed when it actually has not.

This can be especially useful if it is necessary to provision many
ipvlans in containers.  The provisioning software (or operator) can use
this to detect situations where an ip address is unexpectedly in use.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
 drivers/net/ipvlan/ipvlan_main.c | 12 ++++++++----
 net/ipv4/devinet.c               |  8 +++++++-
 net/ipv6/addrconf.c              | 23 ++++++++++++++++++-----
 3 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 975f9dd..3bd2506 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -739,6 +739,7 @@ static int ipvlan_addr6_event(struct notifier_block *unused,
 	struct inet6_ifaddr *if6 = (struct inet6_ifaddr *)ptr;
 	struct net_device *dev = (struct net_device *)if6->idev->dev;
 	struct ipvl_dev *ipvlan = netdev_priv(dev);
+	int err;
 
 	/* FIXME IPv6 autoconf calls us from bh without RTNL */
 	if (in_softirq())
@@ -752,8 +753,9 @@ static int ipvlan_addr6_event(struct notifier_block *unused,
 
 	switch (event) {
 	case NETDEV_UP:
-		if (ipvlan_add_addr6(ipvlan, &if6->addr))
-			return NOTIFY_BAD;
+		err = ipvlan_add_addr6(ipvlan, &if6->addr);
+		if (err)
+			return notifier_from_errno(err);
 		break;
 
 	case NETDEV_DOWN:
@@ -788,6 +790,7 @@ static int ipvlan_addr4_event(struct notifier_block *unused,
 	struct net_device *dev = (struct net_device *)if4->ifa_dev->dev;
 	struct ipvl_dev *ipvlan = netdev_priv(dev);
 	struct in_addr ip4_addr;
+	int err;
 
 	if (!netif_is_ipvlan(dev))
 		return NOTIFY_DONE;
@@ -798,8 +801,9 @@ static int ipvlan_addr4_event(struct notifier_block *unused,
 	switch (event) {
 	case NETDEV_UP:
 		ip4_addr.s_addr = if4->ifa_address;
-		if (ipvlan_add_addr4(ipvlan, &ip4_addr))
-			return NOTIFY_BAD;
+		err = ipvlan_add_addr4(ipvlan, &ip4_addr);
+		if (err)
+			return notifier_from_errno(err);
 		break;
 
 	case NETDEV_DOWN:
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 4cd2ee8..5bc26f8e 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -442,6 +442,7 @@ static int __inet_insert_ifa(struct in_ifaddr *ifa, struct nlmsghdr *nlh,
 {
 	struct in_device *in_dev = ifa->ifa_dev;
 	struct in_ifaddr *ifa1, **ifap, **last_primary;
+	int ret;
 
 	ASSERT_RTNL();
 
@@ -489,7 +490,12 @@ static int __inet_insert_ifa(struct in_ifaddr *ifa, struct nlmsghdr *nlh,
 	   Notifier will trigger FIB update, so that
 	   listeners of netlink will know about new ifaddr */
 	rtmsg_ifa(RTM_NEWADDR, ifa, nlh, portid);
-	blocking_notifier_call_chain(&inetaddr_chain, NETDEV_UP, ifa);
+	ret = blocking_notifier_call_chain(&inetaddr_chain, NETDEV_UP, ifa);
+	ret = notifier_to_errno(ret);
+	if (ret) {
+		__inet_del_ifa(in_dev, ifap, 1, NULL, portid);
+		return ret;
+	}
 
 	return 0;
 }
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c1e124b..24d89f3 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -149,6 +149,7 @@ static inline void addrconf_sysctl_unregister(struct inet6_dev *idev)
 
 static void ipv6_regen_rndid(struct inet6_dev *idev);
 static void ipv6_try_regen_rndid(struct inet6_dev *idev, struct in6_addr *tmpaddr);
+static void __ipv6_del_addr(struct inet6_ifaddr *ifp, bool notify);
 
 static int ipv6_generate_eui64(u8 *eui, struct net_device *dev);
 static int ipv6_count_addresses(struct inet6_dev *idev);
@@ -1031,9 +1032,15 @@ ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr,
 out2:
 	rcu_read_unlock_bh();
 
-	if (likely(err == 0))
-		inet6addr_notifier_call_chain(NETDEV_UP, ifa);
-	else {
+	if (likely(err == 0)) {
+		err = inet6addr_notifier_call_chain(NETDEV_UP, ifa);
+		err = notifier_to_errno(err);
+		if (err) {
+			__ipv6_del_addr(ifa, false);
+			ifa = ERR_PTR(err);
+			return ifa;
+		}
+	} else {
 		kfree(ifa);
 		ifa = ERR_PTR(err);
 	}
@@ -1128,7 +1135,7 @@ cleanup_prefix_route(struct inet6_ifaddr *ifp, unsigned long expires, bool del_r
 
 /* This function wants to get referenced ifp and releases it before return */
 
-static void ipv6_del_addr(struct inet6_ifaddr *ifp)
+static void __ipv6_del_addr(struct inet6_ifaddr *ifp, bool notify)
 {
 	int state;
 	enum cleanup_prefix_rt_t action = CLEANUP_PREFIX_RT_NOP;
@@ -1169,7 +1176,8 @@ static void ipv6_del_addr(struct inet6_ifaddr *ifp)
 
 	addrconf_del_dad_work(ifp);
 
-	ipv6_ifa_notify(RTM_DELADDR, ifp);
+	if (notify)
+		ipv6_ifa_notify(RTM_DELADDR, ifp);
 
 	inet6addr_notifier_call_chain(NETDEV_DOWN, ifp);
 
@@ -1184,6 +1192,11 @@ static void ipv6_del_addr(struct inet6_ifaddr *ifp)
 	in6_ifa_put(ifp);
 }
 
+static void ipv6_del_addr(struct inet6_ifaddr *ifp)
+{
+	__ipv6_del_addr(ifp, true);
+}
+
 static int ipv6_create_tempaddr(struct inet6_ifaddr *ifp, struct inet6_ifaddr *ift)
 {
 	struct inet6_dev *idev = ifp->idev;
-- 
2.7.4

^ permalink raw reply related

* [PATCH] Introduce a sysctl that modifies the value of PROT_SOCK.
From: Krister Johansen @ 2016-12-31  4:11 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

Add net.ipv4.ip_unprotected_port_start, which is a per namespace sysctl
that denotes the first unprotected inet port in the namespace.  To
disable all protected ports set this to zero.  It also checks for
overlap with the local port range.  The protected and local range may
not overlap.

The use case for this change is to allow containerized processes to bind
to priviliged ports, but prevent them from ever being allowed to modify
their container's network configuration.  The latter is accomplished by
ensuring that the network namespace is not a child of the user
namespace.  This modification was needed to allow the container manager
to disable a namespace's priviliged port restrictions without exposing
control of the network namespace to processes in the user namespace.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
 include/net/ip.h               | 12 +++++++++
 include/net/netns/ipv4.h       |  3 +++
 net/Kconfig                    |  7 +++++
 net/ipv4/af_inet.c             |  5 +++-
 net/ipv4/sysctl_net_ipv4.c     | 59 ++++++++++++++++++++++++++++++++++++++++++
 net/ipv6/af_inet6.c            |  3 ++-
 net/netfilter/ipvs/ip_vs_ctl.c |  7 +++--
 net/sctp/socket.c              | 10 ++++---
 security/selinux/hooks.c       |  3 ++-
 9 files changed, 98 insertions(+), 11 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index ab6761a..65f1375 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -270,6 +270,18 @@ static inline int inet_is_local_reserved_port(struct net *net, int port)
 }
 #endif
 
+#ifdef CONFIG_LOWPORT_SYSCTL
+static inline int inet_prot_sock(struct net *net)
+{
+	return net->ipv4.sysctl_ip_prot_sock;
+}
+#else
+static inline int inet_prot_sock(struct net *net)
+{
+	return PROT_SOCK;
+}
+#endif
+
 __be32 inet_current_timestamp(void);
 
 /* From inetpeer.c */
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 8e3f5b6..f1126686 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -96,6 +96,9 @@ struct netns_ipv4 {
 	/* Shall we try to damage output packets if routing dev changes? */
 	int sysctl_ip_dynaddr;
 	int sysctl_ip_early_demux;
+#ifdef CONFIG_LOWPORT_SYSCTL
+	int sysctl_ip_prot_sock;
+#endif
 
 	int sysctl_fwmark_reflect;
 	int sysctl_tcp_fwmark_accept;
diff --git a/net/Kconfig b/net/Kconfig
index a100500..946999c 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -109,6 +109,13 @@ config NETWORK_PHY_TIMESTAMPING
 
 	  If you are unsure how to answer this question, answer N.
 
+config LOWPORT_SYSCTL
+	bool "Adjust reserved port range via sysctl"
+	depends on SYSCTL
+	help
+	  This allows the administrator to adjust the reserved port range
+	  using a sysctl.
+
 menuconfig NETFILTER
 	bool "Network packet filtering framework (Netfilter)"
 	---help---
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index aae410b..c44ea78 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -479,7 +479,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 
 	snum = ntohs(addr->sin_port);
 	err = -EACCES;
-	if (snum && snum < PROT_SOCK &&
+	if (snum && snum < inet_prot_sock(net) &&
 	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		goto out;
 
@@ -1700,6 +1700,9 @@ static __net_init int inet_init_net(struct net *net)
 	net->ipv4.sysctl_ip_default_ttl = IPDEFTTL;
 	net->ipv4.sysctl_ip_dynaddr = 0;
 	net->ipv4.sysctl_ip_early_demux = 1;
+#ifdef CONFIG_LOWPORT_SYSCTL
+	net->ipv4.sysctl_ip_prot_sock = PROT_SOCK;
+#endif
 
 	return 0;
 }
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 134d8e1..2167e3f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -35,6 +35,10 @@ static int ip_local_port_range_min[] = { 1, 1 };
 static int ip_local_port_range_max[] = { 65535, 65535 };
 static int tcp_adv_win_scale_min = -31;
 static int tcp_adv_win_scale_max = 31;
+#ifdef CONFIG_LOWPORT_SYSCTL
+static int ip_protected_port_min;
+static int ip_protected_port_max = 65535;
+#endif
 static int ip_ttl_min = 1;
 static int ip_ttl_max = 255;
 static int tcp_syn_retries_min = 1;
@@ -79,7 +83,17 @@ static int ipv4_local_port_range(struct ctl_table *table, int write,
 	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
 
 	if (write && ret == 0) {
+		/* Ensure that the upper limit is not smaller than the lower,
+		 * and that the lower does not encroach upon the protected
+		 * port limit.
+		 */
+#ifdef CONFIG_LOWPORT_SYSCTL
+		if ((range[1] < range[0]) ||
+		    (range[0] < net->ipv4.sysctl_ip_prot_sock))
+#else
+
 		if (range[1] < range[0])
+#endif
 			ret = -EINVAL;
 		else
 			set_local_port_range(net, range);
@@ -88,6 +102,42 @@ static int ipv4_local_port_range(struct ctl_table *table, int write,
 	return ret;
 }
 
+#ifdef CONFIG_LOWPORT_SYSCTL
+/* Validate changes from /proc interface. */
+static int ipv4_protected_ports(struct ctl_table *table, int write,
+				void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct net *net = container_of(table->data, struct net,
+	    ipv4.sysctl_ip_prot_sock);
+	int ret;
+	int pports;
+	int range[2];
+	struct ctl_table tmp = {
+		.data = &pports,
+		.maxlen = sizeof(pports),
+		.mode = table->mode,
+		.extra1 = &ip_protected_port_min,
+		.extra2 = &ip_protected_port_max,
+	};
+
+	pports = net->ipv4.sysctl_ip_prot_sock;
+
+	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
+
+	if (write && ret == 0) {
+		inet_get_local_port_range(net, &range[0], &range[1]);
+		/* Ensure that the local port range doesn't overlap with the
+		 * protected port range.
+		 */
+		if (range[0] < pports)
+			ret = -EINVAL;
+		else
+			net->ipv4.sysctl_ip_prot_sock = pports;
+	}
+
+	return ret;
+}
+#endif
 
 static void inet_get_ping_group_range_table(struct ctl_table *table, kgid_t *low, kgid_t *high)
 {
@@ -971,6 +1021,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_LOWPORT_SYSCTL
+	{
+		.procname	= "ip_unprotected_port_start",
+		.maxlen		= sizeof(int),
+		.data		= &init_net.ipv4.sysctl_ip_prot_sock,
+		.mode		= 0644,
+		.proc_handler	= ipv4_protected_ports,
+	},
+#endif
 	{ }
 };
 
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index aa42123..04db406 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -302,7 +302,8 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		return -EINVAL;
 
 	snum = ntohs(addr->sin6_port);
-	if (snum && snum < PROT_SOCK && !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
+	if (snum && snum < inet_prot_sock(net) &&
+	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
 
 	lock_sock(sk);
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 55e0169..8b7416f 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -426,10 +426,9 @@ ip_vs_service_find(struct netns_ipvs *ipvs, int af, __u32 fwmark, __u16 protocol
 	 */
 	svc = __ip_vs_service_find(ipvs, af, protocol, vaddr, vport);
 
-	if (svc == NULL
-	    && protocol == IPPROTO_TCP
-	    && atomic_read(&ipvs->ftpsvc_counter)
-	    && (vport == FTPDATA || ntohs(vport) >= PROT_SOCK)) {
+	if (!svc && protocol == IPPROTO_TCP &&
+	    atomic_read(&ipvs->ftpsvc_counter) &&
+	    (vport == FTPDATA || ntohs(vport) >= inet_prot_sock(ipvs->net))) {
 		/*
 		 * Check if ftp service entry exists, the packet
 		 * might belong to FTP data connections.
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 318c678..2723f4a 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -360,7 +360,7 @@ static int sctp_do_bind(struct sock *sk, union sctp_addr *addr, int len)
 		}
 	}
 
-	if (snum && snum < PROT_SOCK &&
+	if (snum && snum < inet_prot_sock(net) &&
 	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
 
@@ -1152,8 +1152,10 @@ static int __sctp_connect(struct sock *sk,
 				 * accept new associations, but it SHOULD NOT
 				 * be permitted to open new associations.
 				 */
-				if (ep->base.bind_addr.port < PROT_SOCK &&
-				    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE)) {
+				if (ep->base.bind_addr.port <
+				    inet_prot_sock(net) &&
+				    !ns_capable(net->user_ns,
+				    CAP_NET_BIND_SERVICE)) {
 					err = -EACCES;
 					goto out_free;
 				}
@@ -1818,7 +1820,7 @@ static int sctp_sendmsg(struct sock *sk, struct msghdr *msg, size_t msg_len)
 			 * but it SHOULD NOT be permitted to open new
 			 * associations.
 			 */
-			if (ep->base.bind_addr.port < PROT_SOCK &&
+			if (ep->base.bind_addr.port < inet_prot_sock(net) &&
 			    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE)) {
 				err = -EACCES;
 				goto out_unlock;
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index c7c6619..53cb6da 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4365,7 +4365,8 @@ static int selinux_socket_bind(struct socket *sock, struct sockaddr *address, in
 
 			inet_get_local_port_range(sock_net(sk), &low, &high);
 
-			if (snum < max(PROT_SOCK, low) || snum > high) {
+			if (snum < max(inet_prot_sock(sock_net(sk)), low) ||
+			    snum > high) {
 				err = sel_netport_sid(sk->sk_protocol,
 						      snum, &sid);
 				if (err)
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next] af_packet: Provide a TPACKET_V2 compatible Tx path for TPACKET_V3
From: Willem de Bruijn @ 2016-12-31  4:59 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, Daniel Borkmann, Willem de Bruijn,
	David Miller
In-Reply-To: <20161231004826.GC31800@oracle.com>

>> What is the issue with using separate sockets that you are
>> having? I generally end up using that even with V2.
>
> Why do you end up having to use 2 sockets with V2? That part
> worked out quite nicely for my case (for a simple netserver like
> req/resp handler).

Yes, that should work fine, actually. I was thinking of multi-threaded
setups where using per-cpu tx sockets avoids cacheline bouncing, but
there is no reason to have a receive socket on each cpu. But that's
not a standard setup.

>> But the semantics for V3 are currenty well defined. Calling something
>> V3, but using V2 semantics is a somewhat unintuitive interface to me.
>
> One fundamental part of tpacket that makes it attractive to
> alternatives like netmap, dpdk etc is that the API follows the
> semantics of the classic  unix socket and fd APIs: support for basic
> select/sendmsg/recvmsg that work for everything until _V3.
>
>> I don't see a benefit in defining something that does not implement
>> any new features. Especially if it blocks adding the expected
>> semantics later.
>
> V3 removed the sendmsg feature.  This patch puts back that feature.

You mean send ring, right? The sendmsg call works fine on a single socket
alongside an RX_RING until a TX_RING is installed.

>> That said, if there is a workload that benefits from using a
>> single socket, and especially if it can be forward compatible with
>> support for variable sized slots, then I do not object. I was just
>> having a look at that second point, actually.
>
> Actually I'm not averse to looking at extensions (or at least,
> place-holders) to allow variable sized slots- do you have any
> suggestions?

It is probably enough to just enforce that tp_next_offset is always
sane. Either that it points to the start of the next packet, even
though that currently can also be inferred from frame_size. Or, that
it is always zero unless the ring is to be interpreted as holding
variable sized frames. If the field is non-zero, drop the packet.

> As I mentioned before, the use-cases that I see
> do not need variable length slots, thus I have not thought
> too deeply about it. But if we think this may be needed in the
> future can't it be accomodated by additional sockopts (or even
> per-packet cmsghdr?) on top of V3?
>
>> Could you also extend the TX_RING test in
>> tools/testing/selftests/net/psock_tpacket.c if there are no other
>> blocking issues?
>
> sure, I can do that. Let me do this for patchv2.

Great. Thanks!

^ permalink raw reply

* ping: What's the purpose of rate limit?
From: Guan Xin @ 2016-12-31 12:04 UTC (permalink / raw)
  To: netdev

Hello,

Excuse me, I searched but didn't find an answer --

What's the purpose of setting a limit to the "-i" and "-l" parameters
of ping for non-root users?

It seems that this is only intended to prevent accidental misuse
because these restrictions can be easily worked around by starting
multiple instances or wrapping the program in a loop, e.g.,
while true; do ping -l3 -c3 192.168.0.1; done

(Please Cc to me because I'm not subscribed to this list. Thanks!)

Regards,
Guan

^ permalink raw reply

* Re: [PATCH net-next] af_packet: Provide a TPACKET_V2 compatible Tx path for TPACKET_V3
From: Sowmini Varadhan @ 2016-12-31 12:21 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, Daniel Borkmann, Willem de Bruijn,
	David Miller
In-Reply-To: <CAF=yD-Jgui0gBeZuzfJRN1uozV+LqVm76tnKUnChSg7_z9-1ng@mail.gmail.com>

On (12/30/16 23:59), Willem de Bruijn wrote:
> > Actually I'm not averse to looking at extensions (or at least,
> > place-holders) to allow variable sized slots- do you have any
> > suggestions?
> 
> It is probably enough to just enforce that tp_next_offset is always
> sane. Either that it points to the start of the next packet, even
> though that currently can also be inferred from frame_size. Or, that
> it is always zero unless the ring is to be interpreted as holding
> variable sized frames. If the field is non-zero, drop the packet.

Ok, let me add this to patchv2, along with extra test cases.

--Sowmini

^ permalink raw reply

* Re: [PATCH net-next V3 3/3] tun: rx batching
From: David Miller @ 2016-12-31 17:31 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, virtualization, wexu, kvm, mst
In-Reply-To: <1483075251-6889-4-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Fri, 30 Dec 2016 13:20:51 +0800

> @@ -1283,10 +1314,15 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>  	skb_probe_transport_header(skb, 0);
>  
>  	rxhash = skb_get_hash(skb);
> +
>  #ifndef CONFIG_4KSTACKS
> -	local_bh_disable();
> -	netif_receive_skb(skb);
> -	local_bh_enable();
> +	if (!rx_batched) {
> +		local_bh_disable();
> +		netif_receive_skb(skb);
> +		local_bh_enable();
> +	} else {
> +		tun_rx_batched(tfile, skb, more);
> +	}
>  #else
>  	netif_rx_ni(skb);
>  #endif

If rx_batched has been set, and we are talking to clients not using
this new MSG_MORE facility (or such clients don't have multiple TX
packets to send to you, thus MSG_MORE is often clear), you are doing a
lot more work per-packet than the existing code.

You take the queue lock, you test state, you splice into a local queue
on the stack, then you walk that local stack queue to submit just one
SKB to netif_receive_skb().

I think you want to streamline this sequence in such cases so that the
cost before and after is similar if not equivalent.

^ permalink raw reply

* Re: [PATCH] net: socket: don't set sk_uid to garbage value in ->setattr()
From: David Miller @ 2016-12-31 17:36 UTC (permalink / raw)
  To: ebiggers3; +Cc: netdev, lorenzo, linux-fsdevel, linux-kernel, ebiggers
In-Reply-To: <20161230234232.4221-1-ebiggers3@gmail.com>

From: Eric Biggers <ebiggers3@gmail.com>
Date: Fri, 30 Dec 2016 17:42:32 -0600

> From: Eric Biggers <ebiggers@google.com>
> 
> ->setattr() was recently implemented for socket files to sync the socket
> inode's uid to the new 'sk_uid' member of struct sock.  It does this by
> copying over the ia_uid member of struct iattr.  However, ia_uid is
> actually only valid when ATTR_UID is set in ia_valid, indicating that
> the uid is being changed, e.g. by chown.  Other metadata operations such
> as chmod or utimes leave ia_uid uninitialized.  Therefore, sk_uid could
> be set to a "garbage" value from the stack.
> 
> Fix this by only copying the uid over when ATTR_UID is set.
> 
> Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.")
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Lorenzo, please review.

^ permalink raw reply

* Re: [PATCH v3] net: ethernet: faraday: To support device tree usage.
From: Florian Fainelli @ 2016-12-31 18:48 UTC (permalink / raw)
  To: Greentime Hu, netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA, andrew-g2DYL2Zd6BY,
	arnd-r2nGTMty4D4, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jiri-rHqAuBHg3fBzbRFIqnYvSA
In-Reply-To: <1483083470-15779-1-git-send-email-green.hu-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>



On 12/29/2016 11:37 PM, Greentime Hu wrote:
> Signed-off-by: Greentime Hu <green.hu-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

This is not enough, you need to add a Device Tree binding document under
Documentation/devicetree/bindings/net/ which documents this compatible
string, as well as additional properties that may be required to
describe this hardware block.

-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] drop_monitor: add missing call to genlmsg_end
From: Reiter Wolfgang @ 2016-12-31 20:11 UTC (permalink / raw)
  To: nhorman; +Cc: davem, netdev, linux-kernel, Reiter Wolfgang

Update nlmsg_len field with genlmsg_end to enable userspace processing
using nlmsg_next helper. Also adds error handling.

Signed-off-by: Reiter Wolfgang <wr0112358@gmail.com>
---
 net/core/drop_monitor.c | 33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index 8e0c063..f465bad 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -75,6 +75,7 @@ static struct sk_buff *reset_per_cpu_data(struct per_cpu_dm_data *data)
 	struct nlattr *nla;
 	struct sk_buff *skb;
 	unsigned long flags;
+	void *msg_header;
 
 	al = sizeof(struct net_dm_alert_msg);
 	al += dm_hit_limit * sizeof(struct net_dm_drop_point);
@@ -82,17 +83,31 @@ static struct sk_buff *reset_per_cpu_data(struct per_cpu_dm_data *data)
 
 	skb = genlmsg_new(al, GFP_KERNEL);
 
-	if (skb) {
-		genlmsg_put(skb, 0, 0, &net_drop_monitor_family,
-				0, NET_DM_CMD_ALERT);
-		nla = nla_reserve(skb, NLA_UNSPEC,
-				  sizeof(struct net_dm_alert_msg));
-		msg = nla_data(nla);
-		memset(msg, 0, al);
-	} else {
-		mod_timer(&data->send_timer, jiffies + HZ / 10);
+	if (!skb)
+		goto err;
+
+	msg_header = genlmsg_put(skb, 0, 0, &net_drop_monitor_family,
+				 0, NET_DM_CMD_ALERT);
+	if (!msg_header) {
+		nlmsg_free(skb);
+		skb = NULL;
+		goto err;
+	}
+	nla = nla_reserve(skb, NLA_UNSPEC,
+			  sizeof(struct net_dm_alert_msg));
+	if (!nla) {
+		nlmsg_free(skb);
+		skb = NULL;
+		goto err;
 	}
+	msg = nla_data(nla);
+	memset(msg, 0, al);
+	genlmsg_end(skb, msg_header);
+	goto out;
 
+err:
+	mod_timer(&data->send_timer, jiffies + HZ / 10);
+out:
 	spin_lock_irqsave(&data->lock, flags);
 	swap(data->skb, skb);
 	spin_unlock_irqrestore(&data->lock, flags);
-- 
2.9.3

^ permalink raw reply related

* Re: Bug w/ (policy) routing
From: David Ahern @ 2016-12-31 20:15 UTC (permalink / raw)
  To: Olivier Brunel, netdev, Alexander Duyck
In-Reply-To: <20161231000006.17677918@jjacky.com>

On 12/30/16 4:00 PM, Olivier Brunel wrote:
> Hi,
> 
> (Please cc me as I'm not subscribed to the list, thanks.)
> 
> I'm trying to set things up using some policy routing, and having some
> weird issues I can't really explain. It looks to me like there might be
> a bug somewhere...
> 
> This is done under Arch Linux 64bits, iproute2 4.9.0 (`ip -V` says ip
> utility, iproute2-ss161212), kernel 4.8.13
> 
> Basically here's what I could reduce it to:
> - create a new network namespace, create a pair of veth devices: one in
> there, one sent back to the original namespace
> - I'm giving them IPs 10.4.0.1 (original namespace) & 10.4.0.2 (new
> namespace)
> - in that new namespace, I'm trying to add a route to 10.4.0.1, but
>   inside a new table. I also want a default route via 10.4.0.1 on the
>   table main. It seems to work, only not really...
> 
> It's not very easy to describe so hopefully this will make things
> clearer:
> 
> $ sudo unshare -n sh

The main and local fib tables start merged into a single fib_table instance.

> sh-4.4# ip rule add table 50 prio 50
> sh-4.4# ip link add test type veth peer name test2
> sh-4.4# ip addr add 10.4.0.2 dev test
> sh-4.4# ip link set dev test up
> sh-4.4# ip link set netns 1 dev test2
> # back in original namespace, we add 10.4.0.1 to test2 and bring it up
> 
> sh-4.4# ip route add 10.4.0.1 dev test table 50
> sh-4.4# ip route add default via 10.4.0.1 dev test
> sh-4.4# ip route flush cache
> sh-4.4# ip rule
> 0:	from all lookup local 
> 50:	from all lookup 50 
> 32766:	from all lookup main 
> 32767:	from all lookup default 
> sh-4.4# ip route show table 50
> 10.4.0.1 dev test scope link 
> sh-4.4# ip route get 10.4.0.1
> 10.4.0.1 via 10.4.0.1 dev test table local src 10.4.0.2 
>     cache 
> # !?? why isn't table 50 used as, I believe, it should. And why

The default rule set has the local table at priority 0 so all lookups hit that table first:

# ip ru ls
0:	from all lookup local
32766:	from all lookup main
32767:	from all lookup default

For some reason it hits a match doing the lookup in the merged local/main fib_table for this ip route get.

> does adding a rule "fixes" it :
> 
> sh-4.4# ip rule add prio 55555

Adding this rule causes the local and main tables to be split into actual separate fib tables. Doing so causes the lookup in the local table to fail, so the lookup continues to the next rule -- which is your prio 50 table 50 rule.

I did not look into why the earlier rule addition did not cause the unmerge to happen -- it should have.


> sh-4.4# ip route get 10.4.0.1
> 10.4.0.1 dev test table 50 src 10.4.0.2 
>     cache 
> # deleting the new rule makes no difference. It could even have been
> done right after adding it. It's like it just triggered something
> (reload, cache flushed, ...)
> As seen I did flush cached routes as mentionned in the man page, I don't
> know of anything else that would need done to "refresh" things?
> 
> Any ideas as to why this is happening? Should this work as I expect it,
> or is there anything I'm doing wrong?

For your purposes now this should fix the problem for you:

ip ru del from all lookup local
ip ru add prio 32765 from all lookup local

^ permalink raw reply

* Re: [PATCH v3] net: ethernet: faraday: To support device tree usage.
From: Arnd Bergmann @ 2016-12-31 20:23 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Greentime Hu, netdev, devicetree, andrew, linux-kernel, jiri
In-Reply-To: <351d1ae7-ca51-7962-0e52-38cdc69bcebd@gmail.com>

On Saturday, December 31, 2016 10:48:39 AM CET Florian Fainelli wrote:
> 
> On 12/29/2016 11:37 PM, Greentime Hu wrote:
> > Signed-off-by: Greentime Hu <green.hu@gmail.com>
> 
> This is not enough, you need to add a Device Tree binding document under
> Documentation/devicetree/bindings/net/ which documents this compatible
> string, as well as additional properties that may be required to
> describe this hardware block.

We already have

Documentation/devicetree/bindings/net/moxa,moxart-mac.txt

for the same hardware (though used by a different driver).

I'd suggest renaming that one to a more generic file name and
adding the new compatible string there.

Aside from that, every patch should also have a changelog comment.

	Arnd

^ permalink raw reply

* Re: [PATCH] Introduce a sysctl that modifies the value of PROT_SOCK.
From: Stephen Hemminger @ 2016-12-31 20:55 UTC (permalink / raw)
  To: Krister Johansen; +Cc: David S. Miller, netdev
In-Reply-To: <20161231041111.GD2448@templeofstupid.com>

On Fri, 30 Dec 2016 20:11:11 -0800
Krister Johansen <kjlx@templeofstupid.com> wrote:

>  
> +config LOWPORT_SYSCTL
> +	bool "Adjust reserved port range via sysctl"
> +	depends on SYSCTL
> +	help
> +	  This allows the administrator to adjust the reserved port range
> +	  using a sysctl.

This looks like a good idea, and makes a lot of sense.

Please don't introduce yet another config option. All distro's will enable it anyway.
Having more config options doesn't help reliability or testability.

Do or do not, please no new config options.

^ permalink raw reply

* Re: [PATCH net-next V3 3/3] tun: rx batching
From: Stephen Hemminger @ 2016-12-31 21:03 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, virtualization, wexu, kvm, mst
In-Reply-To: <1483075251-6889-4-git-send-email-jasowang@redhat.com>

On Fri, 30 Dec 2016 13:20:51 +0800
Jason Wang <jasowang@redhat.com> wrote:

> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index cd8e02c..a268ed9 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -75,6 +75,10 @@
>  
>  #include <linux/uaccess.h>
>  
> +static int rx_batched;
> +module_param(rx_batched, int, 0444);
> +MODULE_PARM_DESC(rx_batched, "Number of packets batched in rx");
> +
>  /* Uncomment to enable debugging */

I like the concept or rx batching. But controlling it via a module parameter
is one of the worst API choices.  Ethtool would be better to use because that is
how other network devices control batching.

If you do ethtool, you could even extend it to have an number of packets
and max latency value.

^ permalink raw reply

* Re: [PATCH v1 0/2] bpf: add longest prefix match map
From: Alexei Starovoitov @ 2016-12-31 22:31 UTC (permalink / raw)
  To: David Miller
  Cc: Daniel Mack, Alexei Starovoitov, dh.herrmann, Daniel Borkmann,
	netdev@vger.kernel.org

On Fri, Dec 30, 2016 at 12:25 PM, David Miller <davem@davemloft.net> wrote:
> From: Daniel Mack <daniel@zonque.org>
> Date: Thu, 29 Dec 2016 18:28:53 +0100
>
>> This patch set adds a longest prefix match algorithm that can be used
>> to match IP addresses to a stored set of ranges. It is exposed as a
>> bpf map type.
>>
>> Internally, data is stored in an unbalanced tree of nodes that has a
>> maximum height of n, where n is the prefixlen the trie was created
>> with.
>>
>> Not that this has nothing to do with fib or fib6 and is in no way meant
>> to replace or share code with it. It's rather a much simpler
>> implementation that is specifically written with bpf maps in mind.
>>
>> Patch 1/2 adds the implementation, and 2/2 an extensive test suite.
>>
>> Feedback is much appreciated.
>
> I'll give Alexei and Daniel time to provide feedback on this series.

I did a quick glance over and, in general, I'm very much in favor.
Will do a proper review Jan 3rd when I'm back from pto.
Daniel,
could you provide performance numbers for lookups per second
for different key lengths with almost empty and almost
full 100k+ lpm rules?
And the rate of updates per second?
I guess your use case is more traditional 32 or 128 bit lpm
whereas I'd like to use it as 64-bit lpm.
Would be good to add few tests with non-power of 8 key length
(it seems to me they should be supported by the algo).
Thank you for working on it!

^ permalink raw reply

* Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable
From: Ansis Atteka @ 2017-01-01  0:07 UTC (permalink / raw)
  To: Hayes Wang
  Cc: David Miller, mlord@pobox.com, greg@kroah.com,
	romieu@fr.zoreil.com, netdev@vger.kernel.org, nic_swsd,
	linux-kernel@vger.kernel.org, linux-usb@vger.kernel.org
In-Reply-To: <0835B3720019904CB8F7AA43166CEEB201057793@RTITMBSV03.realtek.com.tw>

On Wed, Nov 30, 2016 at 3:58 AM, Hayes Wang <hayeswang@realtek.com> wrote:
> Mark Lord <mlord@pobox.com>
> [...]
>> > Not sure why, because there really is no other way for the data to
>> > appear where it does at the beginning of that URB buffer.
>> >
>> > This does seem a rather unexpected burden to place upon someone
>> > reporting a regression in a USB network driver that corrupts user data.
>>
>> If you are the only person who can actively reproduce this, which
>> seems to be the case right now, this is unfortunately the only way to
>> reach a proper analysis and fix.
>
> I have tested it with iperf more than five days without any error.
> I would think if there is any other way to reproduce it.
>

For the past few days I have been debugging a similar data corruption
bug related to r8152 driver, but on x86-64 platform. Also, I think
that this data corruption bug has some serious security implications,
because it appears that "corrupted data" is actually 530 byte fragment
from one of the previous Ethernet frames that Realtek device just
received. See the ping test in the bottom of my email that
demonstrates this.

Besides the data corruption problem I am also experiencing another
serious problem that could be related and manifests itself in XHCI
module when Realtek Ethernet port receives packets at "high" rate (ie
10Mbps or higher). This second problem correlates with error messages
in kern.log printed by xhci-hcd. Ethernet connectivity is completely
lost at this time until I reload r8152 driver:

[ 2540.426240] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.426258] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f010 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0
[ 2540.426259] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.426260] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f020 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0
[ 2540.426334] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.426336] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f030 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0
[ 2540.426372] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.426373] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f040 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0
[ 2540.426488] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.426491] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f050 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0
[ 2540.437020] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.437024] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f060 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0
[ 2540.438239] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.438246] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f070 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0
[ 2540.438493] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.438495] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f080 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0


All of that is happening on my X86-64  Dell XPS15 9550 laptop that is
connected to Ethernet via Dell TB15 dock. This Dell TB 15 Dock uses
Realtek chip to provide Ethernet connectivity to laptop:

# lsusb
...
Bus 004 Device 003: ID 0bda:8153 Realtek Semiconductor Corp.
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               3.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0
  bDeviceProtocol         0
  bMaxPacketSize0         9
  idVendor           0x0bda Realtek Semiconductor Corp.
  idProduct          0x8153
  bcdDevice           30.01
  iManufacturer           1 Realtek
  iProduct                2 USB 10/100/1000 LAN
  iSerial                 6 000001000000
  bNumConfigurations      2

This Realtek Ethernet port is connected to a XHCI ASMedia host
controller that also resides on Dell TB15 Dock. The dock itself is
connected via Thunderbolt 3 cable to laptop:

# lspci
....
0e:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller


In my case it is easy to reproduce either of those two issues. Here
are my observations:
1. The Ethernet controller on Dell TB15 dock was working completely
fine while I had Windows 10 installed on my Laptop.
2. I have tried various Linux distributions - Ubuntu 16.10, Ubuntu
14.04, CentOS 7. All of them fail with "ERROR Transfer event TRB DMA
ptr not part of current TD ep_index 2 comp_code 13" error message
under high load.
3. I have tried Ubuntu 16.10 and Ubuntu 16.04. Both of them are
affected by this data corruption bug. I did not test for data
corruption on CentOS or other Linux distributions that come with older
Linux kernels than Ubuntu.
4. If I start two ping instances at the same time then it appears that
530 bytes from the first ping instance are occasionally "injected"
into ping payload of the second ping instance. Also, I was able to
reproduce this exact same issue with TCP.

sudo ping -i 0.05 -p ff -s 15000 10.33.75.80 # Sending 0xff as payload
....
15008 bytes from 10.33.75.80: icmp_seq=39 ttl=64 time=104 ms
wrong data byte #9822 should be 0xff but was 0x0
#16 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff
#9776 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#9808 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0
#9840 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#9872 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#9904 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#9936 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#9968 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10032 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10064 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10096 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10128 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10160 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10192 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10224 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10288 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10320 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10352 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#10384 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
...

sudo ping -i 0.05 -p 00 -s 15000 10.33.75.80 # Sending 0x00 as payload
...
15008 bytes from 10.33.75.80: icmp_seq=164 ttl=64 time=95.4 ms
wrong data byte #11302 should be 0x0 but was 0xff
#16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...
#11248 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#11280 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ff ff ff ff ff ff ff ff ff ff
#11312 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11344 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11376 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11408 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11472 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11504 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11536 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11568 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11600 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11632 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11664 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11696 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11728 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11760 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11792 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11824 ff ff ff ff ff ff ff ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#11856 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#11888 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...

^ permalink raw reply

* Re: [PATCH] drop_monitor: add missing call to genlmsg_end
From: Neil Horman @ 2017-01-01  2:41 UTC (permalink / raw)
  To: Reiter Wolfgang; +Cc: davem, netdev, linux-kernel
In-Reply-To: <20161231201157.10800-1-wr0112358@gmail.com>

On Sat, Dec 31, 2016 at 09:11:57PM +0100, Reiter Wolfgang wrote:
> Update nlmsg_len field with genlmsg_end to enable userspace processing
> using nlmsg_next helper. Also adds error handling.
> 
> Signed-off-by: Reiter Wolfgang <wr0112358@gmail.com>
> ---
>  net/core/drop_monitor.c | 33 ++++++++++++++++++++++++---------
>  1 file changed, 24 insertions(+), 9 deletions(-)
> 
> diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
> index 8e0c063..f465bad 100644
> --- a/net/core/drop_monitor.c
> +++ b/net/core/drop_monitor.c
> @@ -75,6 +75,7 @@ static struct sk_buff *reset_per_cpu_data(struct per_cpu_dm_data *data)
>  	struct nlattr *nla;
>  	struct sk_buff *skb;
>  	unsigned long flags;
> +	void *msg_header;
>  
>  	al = sizeof(struct net_dm_alert_msg);
>  	al += dm_hit_limit * sizeof(struct net_dm_drop_point);
> @@ -82,17 +83,31 @@ static struct sk_buff *reset_per_cpu_data(struct per_cpu_dm_data *data)
>  
>  	skb = genlmsg_new(al, GFP_KERNEL);
>  
> -	if (skb) {
> -		genlmsg_put(skb, 0, 0, &net_drop_monitor_family,
> -				0, NET_DM_CMD_ALERT);
> -		nla = nla_reserve(skb, NLA_UNSPEC,
> -				  sizeof(struct net_dm_alert_msg));
> -		msg = nla_data(nla);
> -		memset(msg, 0, al);
> -	} else {
> -		mod_timer(&data->send_timer, jiffies + HZ / 10);
> +	if (!skb)
> +		goto err;
> +
> +	msg_header = genlmsg_put(skb, 0, 0, &net_drop_monitor_family,
> +				 0, NET_DM_CMD_ALERT);
> +	if (!msg_header) {
> +		nlmsg_free(skb);
> +		skb = NULL;
> +		goto err;
> +	}
> +	nla = nla_reserve(skb, NLA_UNSPEC,
> +			  sizeof(struct net_dm_alert_msg));
> +	if (!nla) {
> +		nlmsg_free(skb);
> +		skb = NULL;
> +		goto err;
>  	}
> +	msg = nla_data(nla);
> +	memset(msg, 0, al);
> +	genlmsg_end(skb, msg_header);
> +	goto out;
>  
> +err:
> +	mod_timer(&data->send_timer, jiffies + HZ / 10);
> +out:
>  	spin_lock_irqsave(&data->lock, flags);
>  	swap(data->skb, skb);
>  	spin_unlock_irqrestore(&data->lock, flags);
> -- 
> 2.9.3
> 
> 
Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox