Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH bpf-next v11 3/7] bpf: handle GSO in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-13 19:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190213195341.184969-1-posk@google.com>

This patch adds handling of GSO packets in bpf_lwt_push_ip_encap()
(called from bpf_lwt_push_encap):

* IPIP, GRE, and UDP encapsulation types are deduced by looking
  into iphdr->protocol or ipv6hdr->next_header;
* SCTP GSO packets are not supported (as bpf_skb_proto_4_to_6
  and similar do);
* UDP_L4 GSO packets are also not supported (although they are
  not blocked in bpf_skb_proto_4_to_6 and similar), as
  skb_decrease_gso_size() will break it;
* SKB_GSO_DODGY bit is set.

Note: it may be possible to support SCTP and UDP_L4 gso packets;
      but as these cases seem to be not well handled by other
      tunneling/encapping code paths, the solution should
      be generic enough to apply to all tunneling/encapping code.

v8 changes:
   - make sure that if GRE or UDP encap is detected, there is
     enough of pushed bytes to cover both IP[v6] + GRE|UDP headers;
   - do not reject double-encapped packets;
   - whitelist TCP GSO packets rather than block SCTP GSO and
     UDP GSO.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 net/core/lwt_bpf.c | 67 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 65 insertions(+), 2 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index e5a9850d9f48..079871fc020f 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -16,6 +16,7 @@
 #include <linux/types.h>
 #include <linux/bpf.h>
 #include <net/lwtunnel.h>
+#include <net/gre.h>
 
 struct bpf_lwt_prog {
 	struct bpf_prog *prog;
@@ -390,10 +391,72 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = {
 	.owner		= THIS_MODULE,
 };
 
+static int handle_gso_type(struct sk_buff *skb, unsigned int gso_type,
+			   int encap_len)
+{
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+	gso_type |= SKB_GSO_DODGY;
+	shinfo->gso_type |= gso_type;
+	skb_decrease_gso_size(shinfo, encap_len);
+	shinfo->gso_segs = 0;
+	return 0;
+}
+
 static int handle_gso_encap(struct sk_buff *skb, bool ipv4, int encap_len)
 {
-	/* Handling of GSO-enabled packets is added in the next patch. */
-	return -EOPNOTSUPP;
+	int next_hdr_offset;
+	void *next_hdr;
+	__u8 protocol;
+
+	/* SCTP and UDP_L4 gso need more nuanced handling than what
+	 * handle_gso_type() does above: skb_decrease_gso_size() is not enough.
+	 * So at the moment only TCP GSO packets are let through.
+	 */
+	if (!(skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)))
+		return -ENOTSUPP;
+
+	if (ipv4) {
+		protocol = ip_hdr(skb)->protocol;
+		next_hdr_offset = sizeof(struct iphdr);
+		next_hdr = skb_network_header(skb) + next_hdr_offset;
+	} else {
+		protocol = ipv6_hdr(skb)->nexthdr;
+		next_hdr_offset = sizeof(struct ipv6hdr);
+		next_hdr = skb_network_header(skb) + next_hdr_offset;
+	}
+
+	switch (protocol) {
+	case IPPROTO_GRE:
+		next_hdr_offset += sizeof(struct gre_base_hdr);
+		if (next_hdr_offset > encap_len)
+			return -EINVAL;
+
+		if (((struct gre_base_hdr *)next_hdr)->flags & GRE_CSUM)
+			return handle_gso_type(skb, SKB_GSO_GRE_CSUM,
+					       encap_len);
+		return handle_gso_type(skb, SKB_GSO_GRE, encap_len);
+
+	case IPPROTO_UDP:
+		next_hdr_offset += sizeof(struct udphdr);
+		if (next_hdr_offset > encap_len)
+			return -EINVAL;
+
+		if (((struct udphdr *)next_hdr)->check)
+			return handle_gso_type(skb, SKB_GSO_UDP_TUNNEL_CSUM,
+					       encap_len);
+		return handle_gso_type(skb, SKB_GSO_UDP_TUNNEL, encap_len);
+
+	case IPPROTO_IP:
+	case IPPROTO_IPV6:
+		if (ipv4)
+			return handle_gso_type(skb, SKB_GSO_IPXIP4, encap_len);
+		else
+			return handle_gso_type(skb, SKB_GSO_IPXIP6, encap_len);
+
+	default:
+		return -EPROTONOSUPPORT;
+	}
 }
 
 int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v11 4/7] ipv6_stub: add ipv6_route_input stub/proxy.
From: Peter Oskolkov @ 2019-02-13 19:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190213195341.184969-1-posk@google.com>

Proxy ip6_route_input via ipv6_stub, for later use by lwt bpf ip encap
(see the next patch in the patchset).

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/net/addrconf.h   | 1 +
 net/ipv6/addrconf_core.c | 6 ++++++
 net/ipv6/af_inet6.c      | 7 +++++++
 3 files changed, 14 insertions(+)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 20d523ee2fec..269ec27385e9 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -248,6 +248,7 @@ struct ipv6_stub {
 				 const struct in6_addr *addr);
 	int (*ipv6_dst_lookup)(struct net *net, struct sock *sk,
 			       struct dst_entry **dst, struct flowi6 *fl6);
+	int (*ipv6_route_input)(struct sk_buff *skb);
 
 	struct fib6_table *(*fib6_get_table)(struct net *net, u32 id);
 	struct fib6_info *(*fib6_lookup)(struct net *net, int oif,
diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c
index 5cd0029d930e..6c79af056d9b 100644
--- a/net/ipv6/addrconf_core.c
+++ b/net/ipv6/addrconf_core.c
@@ -134,6 +134,11 @@ static int eafnosupport_ipv6_dst_lookup(struct net *net, struct sock *u1,
 	return -EAFNOSUPPORT;
 }
 
+static int eafnosupport_ipv6_route_input(struct sk_buff *skb)
+{
+	return -EAFNOSUPPORT;
+}
+
 static struct fib6_table *eafnosupport_fib6_get_table(struct net *net, u32 id)
 {
 	return NULL;
@@ -170,6 +175,7 @@ eafnosupport_ip6_mtu_from_fib6(struct fib6_info *f6i, struct in6_addr *daddr,
 
 const struct ipv6_stub *ipv6_stub __read_mostly = &(struct ipv6_stub) {
 	.ipv6_dst_lookup   = eafnosupport_ipv6_dst_lookup,
+	.ipv6_route_input  = eafnosupport_ipv6_route_input,
 	.fib6_get_table    = eafnosupport_fib6_get_table,
 	.fib6_table_lookup = eafnosupport_fib6_table_lookup,
 	.fib6_lookup       = eafnosupport_fib6_lookup,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index d99753b5e39b..2f45d2a3e3a3 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -900,10 +900,17 @@ static struct pernet_operations inet6_net_ops = {
 	.exit = inet6_net_exit,
 };
 
+static int ipv6_route_input(struct sk_buff *skb)
+{
+	ip6_route_input(skb);
+	return skb_dst(skb)->error;
+}
+
 static const struct ipv6_stub ipv6_stub_impl = {
 	.ipv6_sock_mc_join = ipv6_sock_mc_join,
 	.ipv6_sock_mc_drop = ipv6_sock_mc_drop,
 	.ipv6_dst_lookup   = ip6_dst_lookup,
+	.ipv6_route_input  = ipv6_route_input,
 	.fib6_get_table	   = fib6_get_table,
 	.fib6_table_lookup = fib6_table_lookup,
 	.fib6_lookup       = fib6_lookup,
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v11 5/7] bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
From: Peter Oskolkov @ 2019-02-13 19:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190213195341.184969-1-posk@google.com>

This patch builds on top of the previous patch in the patchset,
which added BPF_LWT_ENCAP_IP mode to bpf_lwt_push_encap. As the
encapping can result in the skb needing to go via a different
interface/route/dst, bpf programs can indicate this by returning
BPF_LWT_REROUTE, which triggers a new route lookup for the skb.

v8 changes: fix kbuild errors when LWTUNNEL_BPF is builtin, but
   IPV6 is a module: as LWTUNNEL_BPF can only be either Y or N,
   call IPV6 routing functions only if they are built-in.

v9 changes:
   - fixed a kbuild test robot compiler warning;
   - call IPV6 routing functions via ipv6_stub.

v10 changes: removed unnecessary IS_ENABLED and pr_warn_once.

v11 changes: fixed a potential dst leak.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 net/core/lwt_bpf.c | 126 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 124 insertions(+), 2 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 079871fc020f..32251f3fcda0 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -17,6 +17,7 @@
 #include <linux/bpf.h>
 #include <net/lwtunnel.h>
 #include <net/gre.h>
+#include <net/ip6_route.h>
 
 struct bpf_lwt_prog {
 	struct bpf_prog *prog;
@@ -56,6 +57,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 
 	switch (ret) {
 	case BPF_OK:
+	case BPF_LWT_REROUTE:
 		break;
 
 	case BPF_REDIRECT:
@@ -88,6 +90,30 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 	return ret;
 }
 
+static int bpf_lwt_input_reroute(struct sk_buff *skb)
+{
+	int err = -EINVAL;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		struct iphdr *iph = ip_hdr(skb);
+
+		err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
+					   iph->tos, skb_dst(skb)->dev);
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		err = ipv6_stub->ipv6_route_input(skb);
+	} else {
+		err = -EAFNOSUPPORT;
+	}
+
+	if (err)
+		goto err;
+	return dst_input(skb);
+
+err:
+	kfree_skb(skb);
+	return err;
+}
+
 static int bpf_input(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -99,11 +125,11 @@ static int bpf_input(struct sk_buff *skb)
 		ret = run_lwt_bpf(skb, &bpf->in, dst, NO_REDIRECT);
 		if (ret < 0)
 			return ret;
+		if (ret == BPF_LWT_REROUTE)
+			return bpf_lwt_input_reroute(skb);
 	}
 
 	if (unlikely(!dst->lwtstate->orig_input)) {
-		pr_warn_once("orig_input not set on dst for prog %s\n",
-			     bpf->out.name);
 		kfree_skb(skb);
 		return -EINVAL;
 	}
@@ -148,6 +174,91 @@ static int xmit_check_hhlen(struct sk_buff *skb)
 	return 0;
 }
 
+static int bpf_lwt_xmit_reroute(struct sk_buff *skb)
+{
+	struct net_device *l3mdev = l3mdev_master_dev_rcu(skb_dst(skb)->dev);
+	int oif = l3mdev ? l3mdev->ifindex : 0;
+	struct dst_entry *dst = NULL;
+	struct sock *sk;
+	struct net *net;
+	bool ipv4;
+	int err;
+
+	if (skb->protocol == htons(ETH_P_IP))
+		ipv4 = true;
+	else if (skb->protocol == htons(ETH_P_IPV6))
+		ipv4 = false;
+	else
+		return -EAFNOSUPPORT;
+
+	sk = sk_to_full_sk(skb->sk);
+	if (sk) {
+		if (sk->sk_bound_dev_if)
+			oif = sk->sk_bound_dev_if;
+		net = sock_net(sk);
+	} else {
+		net = dev_net(skb_dst(skb)->dev);
+	}
+
+	if (ipv4) {
+		struct iphdr *iph = ip_hdr(skb);
+		struct flowi4 fl4 = {};
+		struct rtable *rt;
+
+		fl4.flowi4_oif = oif;
+		fl4.flowi4_mark = skb->mark;
+		fl4.flowi4_uid = sock_net_uid(net, sk);
+		fl4.flowi4_tos = RT_TOS(iph->tos);
+		fl4.flowi4_flags = FLOWI_FLAG_ANYSRC;
+		fl4.flowi4_proto = iph->protocol;
+		fl4.daddr = iph->daddr;
+		fl4.saddr = iph->saddr;
+
+		rt = ip_route_output_key(net, &fl4);
+		if (IS_ERR(rt))
+			return -EINVAL;
+		dst = &rt->dst;
+	} else {
+		struct ipv6hdr *iph6 = ipv6_hdr(skb);
+		struct flowi6 fl6 = {};
+
+		fl6.flowi6_oif = oif;
+		fl6.flowi6_mark = skb->mark;
+		fl6.flowi6_uid = sock_net_uid(net, sk);
+		fl6.flowlabel = ip6_flowinfo(iph6);
+		fl6.flowi6_proto = iph6->nexthdr;
+		fl6.daddr = iph6->daddr;
+		fl6.saddr = iph6->saddr;
+
+		err = ipv6_stub->ipv6_dst_lookup(net, skb->sk, &dst, &fl6);
+		if (err || IS_ERR(dst))
+			return -EINVAL;
+	}
+	if (unlikely(dst->error)) {
+		dst_release(dst);
+		return -EINVAL;
+	}
+
+	/* Although skb header was reserved in bpf_lwt_push_ip_encap(), it
+	 * was done for the previous dst, so we are doing it here again, in
+	 * case the new dst needs much more space. The call below is a noop
+	 * if there is enough header space in skb.
+	 */
+	err = skb_cow_head(skb, LL_RESERVED_SPACE(dst->dev));
+	if (unlikely(err))
+		return err;
+
+	skb_dst_drop(skb);
+	skb_dst_set(skb, dst);
+
+	err = dst_output(dev_net(skb_dst(skb)->dev), skb->sk, skb);
+	if (unlikely(err))
+		return err;
+
+	/* ip[6]_finish_output2 understand LWTUNNEL_XMIT_DONE */
+	return LWTUNNEL_XMIT_DONE;
+}
+
 static int bpf_xmit(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -155,11 +266,20 @@ static int bpf_xmit(struct sk_buff *skb)
 
 	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
 	if (bpf->xmit.prog) {
+		__be16 proto = skb->protocol;
 		int ret;
 
 		ret = run_lwt_bpf(skb, &bpf->xmit, dst, CAN_REDIRECT);
 		switch (ret) {
 		case BPF_OK:
+			/* If the header changed, e.g. via bpf_lwt_push_encap,
+			 * BPF_LWT_REROUTE below should have been used if the
+			 * protocol was also changed.
+			 */
+			if (skb->protocol != proto) {
+				kfree_skb(skb);
+				return -EINVAL;
+			}
 			/* If the header was expanded, headroom might be too
 			 * small for L2 header to come, expand as needed.
 			 */
@@ -170,6 +290,8 @@ static int bpf_xmit(struct sk_buff *skb)
 			return LWTUNNEL_XMIT_CONTINUE;
 		case BPF_REDIRECT:
 			return LWTUNNEL_XMIT_DONE;
+		case BPF_LWT_REROUTE:
+			return bpf_lwt_xmit_reroute(skb);
 		default:
 			return ret;
 		}
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v11 6/7] bpf: sync <kdir>/include/.../bpf.h with tools/include/.../bpf.h
From: Peter Oskolkov @ 2019-02-13 19:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190213195341.184969-1-posk@google.com>

This patch copies changes in bpf.h done by a previous patch
in this patchset from the kernel uapi include dir into tools
uapi include dir.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/include/uapi/linux/bpf.h | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 25c8c0e62ecf..bcdd2474eee7 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2016,6 +2016,19 @@ union bpf_attr {
  *			Only works if *skb* contains an IPv6 packet. Insert a
  *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
  *			the IPv6 header.
+ *		**BPF_LWT_ENCAP_IP**
+ *			IP encapsulation (GRE/GUE/IPIP/etc). The outer header
+ *			must be IPv4 or IPv6, followed by zero or more
+ *			additional headers, up to LWT_BPF_MAX_HEADROOM total
+ *			bytes in all prepended headers. Please note that
+ *			if skb_is_gso(skb) is true, no more than two headers
+ *			can be prepended, and the inner header, if present,
+ *			should be either GRE or UDP/GUE.
+ *
+ *		BPF_LWT_ENCAP_SEG6*** types can be called by bpf programs of
+ *		type BPF_PROG_TYPE_LWT_IN; BPF_LWT_ENCAP_IP type can be called
+ *		by bpf programs of types BPF_PROG_TYPE_LWT_IN and
+ *		BPF_PROG_TYPE_LWT_XMIT.
  *
  * 		A call to this helper is susceptible to change the underlaying
  * 		packet buffer. Therefore, at load time, all checks on pointers
@@ -2517,7 +2530,8 @@ enum bpf_hdr_start_off {
 /* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
 enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_SEG6,
-	BPF_LWT_ENCAP_SEG6_INLINE
+	BPF_LWT_ENCAP_SEG6_INLINE,
+	BPF_LWT_ENCAP_IP,
 };
 
 #define __bpf_md_ptr(type, name)	\
@@ -2606,7 +2620,15 @@ enum bpf_ret_code {
 	BPF_DROP = 2,
 	/* 3-6 reserved */
 	BPF_REDIRECT = 7,
-	/* >127 are reserved for prog type specific return codes */
+	/* >127 are reserved for prog type specific return codes.
+	 *
+	 * BPF_LWT_REROUTE: used by BPF_PROG_TYPE_LWT_IN and
+	 *    BPF_PROG_TYPE_LWT_XMIT to indicate that skb had been
+	 *    changed and should be routed based on its new L3 header.
+	 *    (This is an L3 redirect, as opposed to L2 redirect
+	 *    represented by BPF_REDIRECT above).
+	 */
+	BPF_LWT_REROUTE = 128,
 };
 
 struct bpf_sock {
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v11 7/7] selftests: bpf: add test_lwt_ip_encap selftest
From: Peter Oskolkov @ 2019-02-13 19:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190213195341.184969-1-posk@google.com>

This patch adds a bpf self-test to cover BPF_LWT_ENCAP_IP mode
in bpf_lwt_push_encap.

Covered:
- encapping in LWT_IN and LWT_XMIT
- IPv4 and IPv6

A follow-up patch will add GSO and VRF-enabled tests.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../selftests/bpf/progs/test_lwt_ip_encap.c   |  85 +++++
 .../selftests/bpf/test_lwt_ip_encap.sh        | 311 ++++++++++++++++++
 3 files changed, 398 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_lwt_ip_encap.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_ip_encap.sh

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index c3edf47da05d..ccffaa0a0787 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -50,7 +50,8 @@ TEST_PROGS := test_kmod.sh \
 	test_lirc_mode2.sh \
 	test_skb_cgroup_id.sh \
 	test_flow_dissector.sh \
-	test_xdp_vlan.sh
+	test_xdp_vlan.sh \
+	test_lwt_ip_encap.sh
 
 TEST_PROGS_EXTENDED := with_addr.sh \
 	with_tunnels.sh \
diff --git a/tools/testing/selftests/bpf/progs/test_lwt_ip_encap.c b/tools/testing/selftests/bpf/progs/test_lwt_ip_encap.c
new file mode 100644
index 000000000000..c957d6dfe6d7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_lwt_ip_encap.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stddef.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+struct grehdr {
+	__be16 flags;
+	__be16 protocol;
+};
+
+SEC("encap_gre")
+int bpf_lwt_encap_gre(struct __sk_buff *skb)
+{
+	struct encap_hdr {
+		struct iphdr iph;
+		struct grehdr greh;
+	} hdr;
+	int err;
+
+	memset(&hdr, 0, sizeof(struct encap_hdr));
+
+	hdr.iph.ihl = 5;
+	hdr.iph.version = 4;
+	hdr.iph.ttl = 0x40;
+	hdr.iph.protocol = 47;  /* IPPROTO_GRE */
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	hdr.iph.saddr = 0x640110ac;  /* 172.16.1.100 */
+	hdr.iph.daddr = 0x641010ac;  /* 172.16.16.100 */
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+	hdr.iph.saddr = 0xac100164;  /* 172.16.1.100 */
+	hdr.iph.daddr = 0xac101064;  /* 172.16.16.100 */
+#else
+#error "Fix your compiler's __BYTE_ORDER__?!"
+#endif
+	hdr.iph.tot_len = bpf_htons(skb->len + sizeof(struct encap_hdr));
+
+	hdr.greh.protocol = skb->protocol;
+
+	err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, &hdr,
+				 sizeof(struct encap_hdr));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_LWT_REROUTE;
+}
+
+SEC("encap_gre6")
+int bpf_lwt_encap_gre6(struct __sk_buff *skb)
+{
+	struct encap_hdr {
+		struct ipv6hdr ip6hdr;
+		struct grehdr greh;
+	} hdr;
+	int err;
+
+	memset(&hdr, 0, sizeof(struct encap_hdr));
+
+	hdr.ip6hdr.version = 6;
+	hdr.ip6hdr.payload_len = bpf_htons(skb->len + sizeof(struct grehdr));
+	hdr.ip6hdr.nexthdr = 47;  /* IPPROTO_GRE */
+	hdr.ip6hdr.hop_limit = 0x40;
+	/* fb01::1 */
+	hdr.ip6hdr.saddr.s6_addr[0] = 0xfb;
+	hdr.ip6hdr.saddr.s6_addr[1] = 1;
+	hdr.ip6hdr.saddr.s6_addr[15] = 1;
+	/* fb10::1 */
+	hdr.ip6hdr.daddr.s6_addr[0] = 0xfb;
+	hdr.ip6hdr.daddr.s6_addr[1] = 0x10;
+	hdr.ip6hdr.daddr.s6_addr[15] = 1;
+
+	hdr.greh.protocol = skb->protocol;
+
+	err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, &hdr,
+				 sizeof(struct encap_hdr));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_LWT_REROUTE;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
new file mode 100755
index 000000000000..4ca714e23ab0
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
@@ -0,0 +1,311 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Setup/topology:
+#
+#    NS1             NS2             NS3
+#   veth1 <---> veth2   veth3 <---> veth4 (the top route)
+#   veth5 <---> veth6   veth7 <---> veth8 (the bottom route)
+#
+#   each vethN gets IPv[4|6]_N address
+#
+#   IPv*_SRC = IPv*_1
+#   IPv*_DST = IPv*_4
+#
+#   all tests test pings from IPv*_SRC to IPv*_DST
+#
+#   by default, routes are configured to allow packets to go
+#   IP*_1 <=> IP*_2 <=> IP*_3 <=> IP*_4 (the top route)
+#
+#   a GRE device is installed in NS3 with IPv*_GRE, and
+#   NS1/NS2 are configured to route packets to IPv*_GRE via IP*_8
+#   (the bottom route)
+#
+# Tests:
+#
+#   1. routes NS2->IPv*_DST are brought down, so the only way a ping
+#      from IP*_SRC to IP*_DST can work is via IPv*_GRE
+#
+#   2a. in an egress test, a bpf LWT_XMIT program is installed on veth1
+#       that encaps the packets with an IP/GRE header to route to IPv*_GRE
+#
+#       ping: SRC->[encap at veth1:egress]->GRE:decap->DST
+#       ping replies go DST->SRC directly
+#
+#   2b. in an ingress test, a bpf LWT_IN program is installed on veth2
+#       that encaps the packets with an IP/GRE header to route to IPv*_GRE
+#
+#       ping: SRC->[encap at veth2:ingress]->GRE:decap->DST
+#       ping replies go DST->SRC directly
+
+set -e  # exit on error
+
+if [[ $EUID -ne 0 ]]; then
+	echo "This script must be run as root"
+	echo "FAIL"
+	exit 1
+fi
+
+readonly NS1="ns1-$(mktemp -u XXXXXX)"
+readonly NS2="ns2-$(mktemp -u XXXXXX)"
+readonly NS3="ns3-$(mktemp -u XXXXXX)"
+
+readonly IPv4_1="172.16.1.100"
+readonly IPv4_2="172.16.2.100"
+readonly IPv4_3="172.16.3.100"
+readonly IPv4_4="172.16.4.100"
+readonly IPv4_5="172.16.5.100"
+readonly IPv4_6="172.16.6.100"
+readonly IPv4_7="172.16.7.100"
+readonly IPv4_8="172.16.8.100"
+readonly IPv4_GRE="172.16.16.100"
+
+readonly IPv4_SRC=$IPv4_1
+readonly IPv4_DST=$IPv4_4
+
+readonly IPv6_1="fb01::1"
+readonly IPv6_2="fb02::1"
+readonly IPv6_3="fb03::1"
+readonly IPv6_4="fb04::1"
+readonly IPv6_5="fb05::1"
+readonly IPv6_6="fb06::1"
+readonly IPv6_7="fb07::1"
+readonly IPv6_8="fb08::1"
+readonly IPv6_GRE="fb10::1"
+
+readonly IPv6_SRC=$IPv6_1
+readonly IPv6_DST=$IPv6_4
+
+setup() {
+set -e  # exit on error
+	# create devices and namespaces
+	ip netns add "${NS1}"
+	ip netns add "${NS2}"
+	ip netns add "${NS3}"
+
+	ip link add veth1 type veth peer name veth2
+	ip link add veth3 type veth peer name veth4
+	ip link add veth5 type veth peer name veth6
+	ip link add veth7 type veth peer name veth8
+
+	ip netns exec ${NS2} sysctl -wq net.ipv4.ip_forward=1
+	ip netns exec ${NS2} sysctl -wq net.ipv6.conf.all.forwarding=1
+
+	ip link set veth1 netns ${NS1}
+	ip link set veth2 netns ${NS2}
+	ip link set veth3 netns ${NS2}
+	ip link set veth4 netns ${NS3}
+	ip link set veth5 netns ${NS1}
+	ip link set veth6 netns ${NS2}
+	ip link set veth7 netns ${NS2}
+	ip link set veth8 netns ${NS3}
+
+	# configure addesses: the top route (1-2-3-4)
+	ip -netns ${NS1}    addr add ${IPv4_1}/24  dev veth1
+	ip -netns ${NS2}    addr add ${IPv4_2}/24  dev veth2
+	ip -netns ${NS2}    addr add ${IPv4_3}/24  dev veth3
+	ip -netns ${NS3}    addr add ${IPv4_4}/24  dev veth4
+	ip -netns ${NS1} -6 addr add ${IPv6_1}/128 nodad dev veth1
+	ip -netns ${NS2} -6 addr add ${IPv6_2}/128 nodad dev veth2
+	ip -netns ${NS2} -6 addr add ${IPv6_3}/128 nodad dev veth3
+	ip -netns ${NS3} -6 addr add ${IPv6_4}/128 nodad dev veth4
+
+	# configure addresses: the bottom route (5-6-7-8)
+	ip -netns ${NS1}    addr add ${IPv4_5}/24  dev veth5
+	ip -netns ${NS2}    addr add ${IPv4_6}/24  dev veth6
+	ip -netns ${NS2}    addr add ${IPv4_7}/24  dev veth7
+	ip -netns ${NS3}    addr add ${IPv4_8}/24  dev veth8
+	ip -netns ${NS1} -6 addr add ${IPv6_5}/128 nodad dev veth5
+	ip -netns ${NS2} -6 addr add ${IPv6_6}/128 nodad dev veth6
+	ip -netns ${NS2} -6 addr add ${IPv6_7}/128 nodad dev veth7
+	ip -netns ${NS3} -6 addr add ${IPv6_8}/128 nodad dev veth8
+
+
+	ip -netns ${NS1} link set dev veth1 up
+	ip -netns ${NS2} link set dev veth2 up
+	ip -netns ${NS2} link set dev veth3 up
+	ip -netns ${NS3} link set dev veth4 up
+	ip -netns ${NS1} link set dev veth5 up
+	ip -netns ${NS2} link set dev veth6 up
+	ip -netns ${NS2} link set dev veth7 up
+	ip -netns ${NS3} link set dev veth8 up
+
+	# configure routes: IP*_SRC -> veth1/IP*_2 (= top route) default;
+	# the bottom route to specific bottom addresses
+
+	# NS1
+	# top route
+	ip -netns ${NS1}    route add ${IPv4_2}/32  dev veth1
+	ip -netns ${NS1}    route add default dev veth1 via ${IPv4_2}  # go top by default
+	ip -netns ${NS1} -6 route add ${IPv6_2}/128 dev veth1
+	ip -netns ${NS1} -6 route add default dev veth1 via ${IPv6_2}  # go top by default
+	# bottom route
+	ip -netns ${NS1}    route add ${IPv4_6}/32  dev veth5
+	ip -netns ${NS1}    route add ${IPv4_7}/32  dev veth5 via ${IPv4_6}
+	ip -netns ${NS1}    route add ${IPv4_8}/32  dev veth5 via ${IPv4_6}
+	ip -netns ${NS1} -6 route add ${IPv6_6}/128 dev veth5
+	ip -netns ${NS1} -6 route add ${IPv6_7}/128 dev veth5 via ${IPv6_6}
+	ip -netns ${NS1} -6 route add ${IPv6_8}/128 dev veth5 via ${IPv6_6}
+
+	# NS2
+	# top route
+	ip -netns ${NS2}    route add ${IPv4_1}/32  dev veth2
+	ip -netns ${NS2}    route add ${IPv4_4}/32  dev veth3
+	ip -netns ${NS2} -6 route add ${IPv6_1}/128 dev veth2
+	ip -netns ${NS2} -6 route add ${IPv6_4}/128 dev veth3
+	# bottom route
+	ip -netns ${NS2}    route add ${IPv4_5}/32  dev veth6
+	ip -netns ${NS2}    route add ${IPv4_8}/32  dev veth7
+	ip -netns ${NS2} -6 route add ${IPv6_5}/128 dev veth6
+	ip -netns ${NS2} -6 route add ${IPv6_8}/128 dev veth7
+
+	# NS3
+	# top route
+	ip -netns ${NS3}    route add ${IPv4_3}/32  dev veth4
+	ip -netns ${NS3}    route add ${IPv4_1}/32  dev veth4 via ${IPv4_3}
+	ip -netns ${NS3}    route add ${IPv4_2}/32  dev veth4 via ${IPv4_3}
+	ip -netns ${NS3} -6 route add ${IPv6_3}/128 dev veth4
+	ip -netns ${NS3} -6 route add ${IPv6_1}/128 dev veth4 via ${IPv6_3}
+	ip -netns ${NS3} -6 route add ${IPv6_2}/128 dev veth4 via ${IPv6_3}
+	# bottom route
+	ip -netns ${NS3}    route add ${IPv4_7}/32  dev veth8
+	ip -netns ${NS3}    route add ${IPv4_5}/32  dev veth8 via ${IPv4_7}
+	ip -netns ${NS3}    route add ${IPv4_6}/32  dev veth8 via ${IPv4_7}
+	ip -netns ${NS3} -6 route add ${IPv6_7}/128 dev veth8
+	ip -netns ${NS3} -6 route add ${IPv6_5}/128 dev veth8 via ${IPv6_7}
+	ip -netns ${NS3} -6 route add ${IPv6_6}/128 dev veth8 via ${IPv6_7}
+
+	# configure IPv4 GRE device in NS3, and a route to it via the "bottom" route
+	ip -netns ${NS3} tunnel add gre_dev mode gre remote ${IPv4_1} local ${IPv4_GRE} ttl 255
+	ip -netns ${NS3} link set gre_dev up
+	ip -netns ${NS3} addr add ${IPv4_GRE} dev gre_dev
+	ip -netns ${NS1} route add ${IPv4_GRE}/32 dev veth5 via ${IPv4_6}
+	ip -netns ${NS2} route add ${IPv4_GRE}/32 dev veth7 via ${IPv4_8}
+
+
+	# configure IPv6 GRE device in NS3, and a route to it via the "bottom" route
+	ip -netns ${NS3} -6 tunnel add name gre6_dev mode ip6gre remote ${IPv6_1} local ${IPv6_GRE} ttl 255
+	ip -netns ${NS3} link set gre6_dev up
+	ip -netns ${NS3} -6 addr add ${IPv6_GRE} nodad dev gre6_dev
+	ip -netns ${NS1} -6 route add ${IPv6_GRE}/128 dev veth5 via ${IPv6_6}
+	ip -netns ${NS2} -6 route add ${IPv6_GRE}/128 dev veth7 via ${IPv6_8}
+
+	# rp_filter gets confused by what these tests are doing, so disable it
+	ip netns exec ${NS1} sysctl -wq net.ipv4.conf.all.rp_filter=0
+	ip netns exec ${NS2} sysctl -wq net.ipv4.conf.all.rp_filter=0
+	ip netns exec ${NS3} sysctl -wq net.ipv4.conf.all.rp_filter=0
+}
+
+cleanup() {
+	ip netns del ${NS1} 2> /dev/null
+	ip netns del ${NS2} 2> /dev/null
+	ip netns del ${NS3} 2> /dev/null
+}
+
+trap cleanup EXIT
+
+test_ping() {
+	local readonly PROTO=$1
+	local readonly EXPECTED=$2
+	local RET=0
+
+	set +e
+	if [ "${PROTO}" == "IPv4" ] ; then
+		ip netns exec ${NS1} ping  -c 1 -W 1 -I ${IPv4_SRC} ${IPv4_DST} 2>&1 > /dev/null
+		RET=$?
+	elif [ "${PROTO}" == "IPv6" ] ; then
+		ip netns exec ${NS1} ping6 -c 1 -W 6 -I ${IPv6_SRC} ${IPv6_DST} 2>&1 > /dev/null
+		RET=$?
+	else
+		echo "test_ping: unknown PROTO: ${PROTO}"
+		exit 1
+	fi
+	set -e
+
+	if [ "0" != "${RET}" ]; then
+		RET=1
+	fi
+
+	if [ "${EXPECTED}" != "${RET}" ] ; then
+		echo "FAIL: test_ping: ${RET}"
+		exit 1
+	fi
+}
+
+test_egress() {
+	local readonly ENCAP=$1
+	echo "starting egress ${ENCAP} encap test"
+	setup
+
+	# need to wait a bit for IPv6 to autoconf, otherwise
+	# ping6 sometimes fails with "unable to bind to address"
+
+	# by default, pings work
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	# remove NS2->DST routes, ping fails
+	ip -netns ${NS2}    route del ${IPv4_DST}/32  dev veth3
+	ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3
+	test_ping IPv4 1
+	test_ping IPv6 1
+
+	# install replacement routes (LWT/eBPF), pings succeed
+	if [ "${ENCAP}" == "IPv4" ] ; then
+		ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1
+		ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1
+	elif [ "${ENCAP}" == "IPv6" ] ; then
+		ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
+		ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
+	else
+		echo "FAIL: unknown encap ${ENCAP}"
+	fi
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	cleanup
+	echo "PASS"
+}
+
+test_ingress() {
+	local readonly ENCAP=$1
+	echo "starting ingress ${ENCAP} encap test"
+	setup
+
+	# need to wait a bit for IPv6 to autoconf, otherwise
+	# ping6 sometimes fails with "unable to bind to address"
+
+	# by default, pings work
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	# remove NS2->DST routes, pings fail
+	ip -netns ${NS2}    route del ${IPv4_DST}/32  dev veth3
+	ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3
+	test_ping IPv4 1
+	test_ping IPv6 1
+
+	# install replacement routes (LWT/eBPF), pings succeed
+	if [ "${ENCAP}" == "IPv4" ] ; then
+		ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2
+		ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2
+	elif [ "${ENCAP}" == "IPv6" ] ; then
+		ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2
+		ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2
+	else
+		echo "FAIL: unknown encap ${ENCAP}"
+	fi
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	cleanup
+	echo "PASS"
+}
+
+test_egress IPv4
+test_egress IPv6
+
+test_ingress IPv4
+test_ingress IPv6
+
+echo "all tests passed"
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* Re: [PATCH bpf-next v10 5/7] bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
From: Peter Oskolkov @ 2019-02-13 19:57 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Peter Oskolkov,
	Willem de Bruijn
In-Reply-To: <681aca28-b4e5-eb0d-46cd-94db7a2c368c@gmail.com>

On Tue, Feb 12, 2019 at 6:58 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 2/12/19 10:32 AM, Peter Oskolkov wrote:
> > @@ -148,6 +174,87 @@ static int xmit_check_hhlen(struct sk_buff *skb)
> >       return 0;
> >  }
> >
> > +static int bpf_lwt_xmit_reroute(struct sk_buff *skb)
> > +{
> > +     struct net_device *l3mdev = l3mdev_master_dev_rcu(skb_dst(skb)->dev);
> > +     int oif = l3mdev ? l3mdev->ifindex : 0;
> > +     struct dst_entry *dst = NULL;
> > +     struct sock *sk;
> > +     struct net *net;
> > +     bool ipv4;
> > +     int err;
> > +
> > +     if (skb->protocol == htons(ETH_P_IP))
> > +             ipv4 = true;
> > +     else if (skb->protocol == htons(ETH_P_IPV6))
> > +             ipv4 = false;
> > +     else
> > +             return -EAFNOSUPPORT;
> > +
> > +     sk = sk_to_full_sk(skb->sk);
> > +     if (sk) {
> > +             if (sk->sk_bound_dev_if)
> > +                     oif = sk->sk_bound_dev_if;
> > +             net = sock_net(sk);
> > +     } else {
> > +             net = dev_net(skb_dst(skb)->dev);
> > +     }
> > +
> > +     if (ipv4) {
> > +             struct iphdr *iph = ip_hdr(skb);
> > +             struct flowi4 fl4 = {};
> > +             struct rtable *rt;
> > +
> > +             fl4.flowi4_oif = oif;
> > +             fl4.flowi4_mark = skb->mark;
> > +             fl4.flowi4_uid = sock_net_uid(net, sk);
> > +             fl4.flowi4_tos = RT_TOS(iph->tos);
> > +             fl4.flowi4_flags = FLOWI_FLAG_ANYSRC;
> > +             fl4.flowi4_proto = iph->protocol;
> > +             fl4.daddr = iph->daddr;
> > +             fl4.saddr = iph->saddr;
> > +
> > +             rt = ip_route_output_key(net, &fl4);
> > +             if (IS_ERR(rt) || rt->dst.error)
> > +                     return -EINVAL;
>
> I think you have a dst leak here if rt is valid but the lookup is a
> reject (e.g., unreachable or blackhole).

Thanks, David! I was not able to reproduce the leak, but based on your
suggestion and similar code elsewhere I made a change in v11 to explicitly
release a dst with error.

>
> > +             dst = &rt->dst;
> > +     } else {
> > +             struct ipv6hdr *iph6 = ipv6_hdr(skb);
> > +             struct flowi6 fl6 = {};
> > +
> > +             fl6.flowi6_oif = oif;
> > +             fl6.flowi6_mark = skb->mark;
> > +             fl6.flowi6_uid = sock_net_uid(net, sk);
> > +             fl6.flowlabel = ip6_flowinfo(iph6);
> > +             fl6.flowi6_proto = iph6->nexthdr;
> > +             fl6.daddr = iph6->daddr;
> > +             fl6.saddr = iph6->saddr;
> > +
> > +             err = ipv6_stub->ipv6_dst_lookup(net, skb->sk, &dst, &fl6);
> > +             if (err || IS_ERR(dst) || dst->error)
> > +                     return -EINVAL;
>
> same here.
>
> You could check this by adding a route with unreachable as the target in
> your tests. Test cleanup and namespace teardown will tell you pretty quick.

^ permalink raw reply

* Re: [PATCH] net: phy: at803x: disable delay only for RGMII mode
From: Niklas Cassel @ 2019-02-13 20:07 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Marc Gonzalez, Andrew Lunn, Vinod Koul, David S Miller,
	linux-arm-msm, Bjorn Andersson, netdev, Nori, Sekhar,
	Peter Ujfalusi, hkallweit1
In-Reply-To: <34037b72-b082-89fa-f586-8c032ebe5aea@gmail.com>

On Wed, Feb 13, 2019 at 09:59:43AM -0800, Florian Fainelli wrote:
> On 2/13/19 9:40 AM, Niklas Cassel wrote:
> > On Wed, Feb 13, 2019 at 02:40:18PM +0100, Marc Gonzalez wrote:
> >> On 13/02/2019 14:29, Andrew Lunn wrote:
> >>
> >>>> So we have these modes:
> >>>>
> >>>> PHY_INTERFACE_MODE_RGMII: TX and RX delays disabled
> >>>> PHY_INTERFACE_MODE_RGMII_ID: TX and RX delays enabled
> >>>> PHY_INTERFACE_MODE_RGMII_RXID: RX delay enabled, TX delay disabled
> >>>> PHY_INTERFACE_MODE_RGMII_TXID: TX delay enabled, RX delay disabled
> >>>>
> >>>> What I don't like with this patch, is that if we specify phy-mode
> >>>> PHY_INTERFACE_MODE_RGMII_TXID, this patch will enable TX delay,
> >>>> but RX delay will not be explicitly set.
> >>>
> >>> That is not the behaviour we want. It is best to assume the device is
> >>> in a random state, and correctly enable/disable all delays as
> >>> requested. Only leave the hardware alone if PHY_INTERFACE_MODE_NA is
> >>> used.
> >>
> >> That's what my patch did:
> >> https://www.spinics.net/lists/netdev/msg445053.html
> >>
> >> But see Florian's remarks:
> >> https://www.spinics.net/lists/netdev/msg445133.html
> > 
> > Hello Marc,
> > 
> > I saw that comment from Florian. However that was way back in 2017.
> > Maybe the phy-modes were not as well defined back then?
> 
> The definition of the 'phy-mode' was clarified to be understood from the
> perspective of the PHY device (hence the name) after we had several
> fruitful exchanges with Marc (at least from my perspective), but since
> the definition was not clear before, there is a high chance of finding
> DTS/DTBs out there with the 'phy-mode' property understood from the
> MAC's perspective, which would now be wrong.

Hello Florian,


We have a specification:
Documentation/devicetree/bindings/net/ethernet.txt

And several implementations: the PHY drivers.

Either we decide that all PHY drivers have to follow
the specification for "phy-mode" in
Documentation/devicetree/bindings/net/ethernet.txt
or we decide that they don't.

If we decide that all PHY drivers have to follow the specification,
then we can fix the PHY drivers that currently do not follow the
specification.

If we decide that all PHY drivers do not have to follow the spec,
then the "phy-mode" property is basically useless, and then we should
introduce a new device tree property, e.g. "phy-mode2", that is
guaranteed to respect the definitons in
Documentation/devicetree/bindings/net/ethernet.txt

> 
> 
> > 
> > Andrew recently suggested to fix the driver so that it conforms with the
> > phy-modes, and fix any SoC that specified an incorrect phy-mode in DT
> > and thus relied upon the broken behavior of the PHY driver:
> > https://www.spinics.net/lists/netdev/msg445133.html
> > 
> > 
> > So, I've rebased your old patch, see attachment.
> > I suggest that Peter test it on am335x-evm.
> > 
> > am335x-evm appears to rely on the current broken behavior of the PHY
> > driver, so we will probably need to fix the am335x-evm according to this:
> > https://www.spinics.net/lists/netdev/msg445117.html
> > and merge that as well.
> > 
> > 
> > Andrew, Florian, do you both agree?
> 
> In my reply to Marc, there was a concern that while am335x-evm was
> identified and reported to be broken after fixing the PHY driver, there
> could be platforms out there that we have little to no visibility that
> would most likely be equally broken. That concern still exists, and I
> don't think there is anything we can do to even assess the size of the
> problem unless we attempt to fix it, so maybe we should attempt to fix that.
> 
> There was a suggestion to Marc that one way to possibly "ignore" an
> incorrectly broken 'phy-mode' property would be to allow specifying
> rx/tx delay properties such that if the driver obtained its
> phy_interface_t, yet still parsed rx/tx delays, the rx/tx delays would
> take precedence, and we could possibly derive some sort of a "more
> correct" phy_interface_t that we could assign back to phydev->interface
> and issue a warning about that.

You mean to add new device tree properties to
Documentation/devicetree/bindings/net/ethernet.txt

- phy-id-tx: "true" if PHY should add internal delay on TX lines;
             "false" or not specified if PHY should not add internal
	     delay on TX lines. This property overrides any delay
	     requested by "phy-mode".
- phy-id-rx: "true" if PHY should add internal delay on RX lines;
             "false" or not specified if PHY should not add internal
	     delay on RX lines. This property overrides any delay
	     requested by "phy-mode".

Perhaps something like that?

Personally, I prefer making "phy-mode" strict,
but whatever you guys decide:
- making "phy-mode" strict
- introducing a "phy-mode2"
- introducing "phy-id-tx/phy-id-rx"
- introducing "mac-mode"
- some other solution

It is probably wise to introduce helper functions in phy.h
 phy_wants_id_rx()
 phy_wants_id_tx()
so that PHY drivers can simply use e.g.:

if (phy_wants_id_rx(phydev))
	at803x_enable_rx_delay(phydev);
else
	at803x_disable_rx_delay(phydev);

if (phy_wants_id_tx(phydev))
	at803x_enable_tx_delay(phydev);
else
	at803x_disable_tx_delay(phydev);

> 
> Another possible way to resolve that could be to introduce a 'mac-mode'
> property, which must be strictly compatible with specifying a 'phy-mode'
> property. For instance:
> 
> - MAC specifies mac-mode = 'rgmii-id', then the PHY must have phy-mode =
> 'rmgii' since the MAC is taking of inserting both RX and TX delays,
> reverse also applies
> 
> - MAC specifies mac-mode = 'rgmii-txid', then the PHY must have phy-mode
> = 'rgmii-rxid' because the MAC adds the TX delay, but the PHY should
> insert the delay on the RX lines, reverse also applies
> 
> Because there is usually (not always, DSA is an exception) a 1:1 mapping
> between MAC and PHY devices we could look up the 'mac-mode' property in
> the MAC in the PHY library code and make sure that we have a compatible
> matrix and if we do not, maybe pass something like PHY_INTERFACE_MODE_NA
> such that the driver retains its settings.

Is there any advantage of creating a "mac-mode" over creating a
"phy-mode2" ?


Kind regards,
Niklas

> 
> Maybe another way to approach this is if we assume that the PHY comes up
> configured correctly by the boot loader, or upon power on reset, we add
> some PHY driver methods that allow us to determine the RGMII mode in
> which a PHY is and that tells us whether we are compatible with the
> MAC's phy_interface_t upon connection. We check both at connect() time
> and if something does not look right, we flip the meaning of
> phy_interface_t.
> 
> None of those solutions are entirely fool proof, but at least we might
> be able to detect incorrect combinations, yet still make them work by
> reversing the meaning of the 'phy-mode' property given information at hand.
> 
> Let me know if none of that makes sense and this just looks like yet
> another brain dump.
> 
> Wonderful RGMII...
> -- 
> Florian

^ permalink raw reply

* [RFC iproute2] ip route: get: allow zero-length subnet mask
From: Luca Boccassi @ 2019-02-13 20:09 UTC (permalink / raw)
  To: netdev; +Cc: stephen, Luca Boccassi, Clément Hertling

A /0 subnet mask is theoretically valid, but ip route get doesn't allow
it:

$ ip route get 1.0.0.0/0
need at least a destination address

Remove the check so that it can go through:

$ ip/ip route get 1.0.0.0/0
1.0.0.0 via 192.168.1.1 dev eth0 src 192.168.1.91 uid 1000
    cache

Reported-by: Clément Hertling <wxcafe@wxcafe.net>
Signed-off-by: Luca Boccassi <bluca@debian.org>
---
Stephen et al, this was reported by a Debian user:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=921737

It makes sense to me at a cursory glance, but sending as RFC as I'm
not 100% familiar with the route get function.

 ip/iproute.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 5f58a3b3..d78f43d8 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -2041,11 +2041,6 @@ static int iproute_get(int argc, char **argv)
 		argc--; argv++;
 	}
 
-	if (req.r.rtm_dst_len == 0) {
-		fprintf(stderr, "need at least a destination address\n");
-		return -1;
-	}
-
 	if (idev || odev)  {
 		int idx;
 
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH bpf-next v10 5/7] bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
From: David Ahern @ 2019-02-13 20:11 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Peter Oskolkov,
	Willem de Bruijn
In-Reply-To: <CAPNVh5eFMaXAdbhkn3Le5eQ-ZYaf2kWjKAxf4dfW9tYhyyXAKQ@mail.gmail.com>

On 2/13/19 12:57 PM, Peter Oskolkov wrote:
> Thanks, David! I was not able to reproduce the leak, but based on your
> suggestion and similar code elsewhere I made a change in v11 to explicitly
> release a dst with error.

ok. Did you run the test with a debug kernel - checking refcount, use
after free, etc?

^ permalink raw reply

* Re: [RFC iproute2] ip route: get: allow zero-length subnet mask
From: Stephen Hemminger @ 2019-02-13 20:37 UTC (permalink / raw)
  To: Luca Boccassi; +Cc: netdev, Clément Hertling
In-Reply-To: <20190213200954.32271-1-bluca@debian.org>

On Wed, 13 Feb 2019 20:09:53 +0000
Luca Boccassi <bluca@debian.org> wrote:

> A /0 subnet mask is theoretically valid, but ip route get doesn't allow
> it:
> 
> $ ip route get 1.0.0.0/0
> need at least a destination address
> 
> Remove the check so that it can go through:
> 
> $ ip/ip route get 1.0.0.0/0
> 1.0.0.0 via 192.168.1.1 dev eth0 src 192.168.1.91 uid 1000
>     cache
> 
> Reported-by: Clément Hertling <wxcafe@wxcafe.net>
> Signed-off-by: Luca Boccassi <bluca@debian.org>
> ---
> Stephen et al, this was reported by a Debian user:
> 
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=921737
> 
> It makes sense to me at a cursory glance, but sending as RFC as I'm
> not 100% familiar with the route get function.
> 
>  ip/iproute.c | 5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/ip/iproute.c b/ip/iproute.c
> index 5f58a3b3..d78f43d8 100644
> --- a/ip/iproute.c
> +++ b/ip/iproute.c
> @@ -2041,11 +2041,6 @@ static int iproute_get(int argc, char **argv)
>  		argc--; argv++;
>  	}
>  
> -	if (req.r.rtm_dst_len == 0) {
> -		fprintf(stderr, "need at least a destination address\n");
> -		return -1;
> -	}
> -
>  	if (idev || odev)  {
>  		int idx;
>  

You still need a way to report error for:
	ip route get
(i.e when no address is present)

^ permalink raw reply

* Re: [PATCH bpf-next v10 5/7] bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
From: Peter Oskolkov @ 2019-02-13 20:41 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Peter Oskolkov,
	Willem de Bruijn
In-Reply-To: <80849fb5-c5de-ce6b-6c25-bd152326196c@gmail.com>

On Wed, Feb 13, 2019 at 12:11 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 2/13/19 12:57 PM, Peter Oskolkov wrote:
> > Thanks, David! I was not able to reproduce the leak, but based on your
> > suggestion and similar code elsewhere I made a change in v11 to explicitly
> > release a dst with error.
>
> ok. Did you run the test with a debug kernel - checking refcount, use
> after free, etc?

In my tests I was always getting ERR_PTR for unroutable packets,
not a full rt/dst with an error flag set. But I checked several
similar route lookups,
and they all release bad dsts, so I did not feel it was worth it to
investigate further.

^ permalink raw reply

* [PATCH] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
From: Jann Horn @ 2019-02-13 20:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, jannh
  Cc: linux-kernel, Michal Hocko, Vlastimil Babka, Pavel Tatashin,
	Oscar Salvador, Mel Gorman, Aaron Lu, netdev, Alexander Duyck

The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
number of references that we might need to create in the fastpath later,
the bump-allocation fastpath only has to modify the non-atomic bias value
that tracks the number of extra references we hold instead of the atomic
refcount. The maximum number of allocations we can serve (under the
assumption that no allocation is made with size 0) is nc->size, so that's
the bias used.

However, even when all memory in the allocation has been given away, a
reference to the page is still held; and in the `offset < 0` slowpath, the
page may be reused if everyone else has dropped their references.
This means that the necessary number of references is actually
`nc->size+1`.

Luckily, from a quick grep, it looks like the only path that can call
page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
requires CAP_NET_ADMIN in the init namespace and is only intended to be
used for kernel testing and fuzzing.

To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
`offset < 0` path, below the virt_to_page() call, and then repeatedly call
writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
with a vector consisting of 15 elements containing 1 byte each.

Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
---
 mm/page_alloc.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 35fdde041f5c..46285d28e43b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc,
 		/* Even if we own the page, we do not use atomic_set().
 		 * This would break get_page_unless_zero() users.
 		 */
-		page_ref_add(page, size - 1);
+		page_ref_add(page, size);

 		/* reset page count bias and offset to start of new frag */
 		nc->pfmemalloc = page_is_pfmemalloc(page);
-		nc->pagecnt_bias = size;
+		nc->pagecnt_bias = size + 1;
 		nc->offset = size;
 	}

@@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc,
 		size = nc->size;
 #endif
 		/* OK, page count is 0, we can safely set it */
-		set_page_count(page, size);
+		set_page_count(page, size + 1);

 		/* reset page count bias and offset to start of new frag */
-		nc->pagecnt_bias = size;
+		nc->pagecnt_bias = size + 1;
 		offset = size - fragsz;
 	}

-- 
2.20.1.791.gb4d0f1c61a-goog

^ permalink raw reply related

* Re: [PATCH bpf-next v4 0/2] libbpf: adding AF_XDP support
From: Jonathan Lemon @ 2019-02-13 20:49 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Magnus Karlsson, Björn Töpel, ast, Daniel Borkmann,
	Network Development, Jakub Kicinski, Björn Töpel,
	Zhang, Qi Z, Jesper Dangaard Brouer, xiaolong.ye
In-Reply-To: <CAJ8uoz19UjmEHTc28Qd_9KdY9D-ojXSBRTbmffRhUTX49mnWvg@mail.gmail.com>

On 13 Feb 2019, at 3:32, Magnus Karlsson wrote:

> On Mon, Feb 11, 2019 at 9:44 PM Jonathan Lemon 
> <jonathan.lemon@gmail.com> wrote:
>>
>> On 8 Feb 2019, at 5:05, Magnus Karlsson wrote:
>>
>>> This patch proposes to add AF_XDP support to libbpf. The main reason
>>> for this is to facilitate writing applications that use AF_XDP by
>>> offering higher-level APIs that hide many of the details of the 
>>> AF_XDP
>>> uapi. This is in the same vein as libbpf facilitates XDP adoption by
>>> offering easy-to-use higher level interfaces of XDP
>>> functionality. Hopefully this will facilitate adoption of AF_XDP, 
>>> make
>>> applications using it simpler and smaller, and finally also make it
>>> possible for applications to benefit from optimizations in the 
>>> AF_XDP
>>> user space access code. Previously, people just copied and pasted 
>>> the
>>> code from the sample application into their application, which is 
>>> not
>>> desirable.
>>
>> I like the idea of encapsulating the boilerplate logic in a library.
>>
>> I do think there is an important missing piece though - there should 
>> be
>> some code which queries the netdev for how many queues are attached, 
>> and
>> create the appropriate number of umem/AF_XDP sockets.
>>
>> I ran into this issue when testing the current AF_XDP code - on my 
>> test
>> boxes, the mlx5 card has 55 channels (aka queues), so when the test 
>> program
>> binds only to channel 0, nothing works as expected, since not all 
>> traffic
>> is being intercepted.  While obvious in hindsight, this took a while 
>> to
>> track down.
>
> Yes, agreed. You are not the first one to stumble upon this problem
> :-). Let me think a little bit on how to solve this in a good way. We
> need this to be simple and intuitive, as you say.

Has any investigation been done on using some variant of MPSC 
implementation
as an intermediate form for AF_XDP?  E.g.: something like LCRQ or the 
bulkQ
in bpf devmap/cpumap.  I'm aware that this would be slightly slower, as 
it
would introduce a lock in the path, but I'd think that having DEVMAP, 
CPUMAP
and XSKMAP all behave the same way would add more flexibility.

Ideally, if the configuration matches the underlying hardware, then the
implementation would reduce to the current setup (and allow ZC 
implementations),
but a non-matching configuration would still work - as opposed to the 
current
situation.
-- 
Jonathan

^ permalink raw reply

* Re: [PATCH] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
From: Andrew Morton @ 2019-02-13 20:59 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-mm, linux-kernel, Michal Hocko, Vlastimil Babka,
	Pavel Tatashin, Oscar Salvador, Mel Gorman, Aaron Lu, netdev,
	Alexander Duyck
In-Reply-To: <20190213204157.12570-1-jannh@google.com>

On Wed, 13 Feb 2019 21:41:57 +0100 Jann Horn <jannh@google.com> wrote:

> The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
> number of references that we might need to create in the fastpath later,
> the bump-allocation fastpath only has to modify the non-atomic bias value
> that tracks the number of extra references we hold instead of the atomic
> refcount. The maximum number of allocations we can serve (under the
> assumption that no allocation is made with size 0) is nc->size, so that's
> the bias used.
> 
> However, even when all memory in the allocation has been given away, a
> reference to the page is still held; and in the `offset < 0` slowpath, the
> page may be reused if everyone else has dropped their references.
> This means that the necessary number of references is actually
> `nc->size+1`.
> 
> Luckily, from a quick grep, it looks like the only path that can call
> page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
> requires CAP_NET_ADMIN in the init namespace and is only intended to be
> used for kernel testing and fuzzing.

For the net-naive, what is TAP?  It doesn't appear to mean
drivers/net/tap.c.

> To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
> `offset < 0` path, below the virt_to_page() call, and then repeatedly call
> writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
> with a vector consisting of 15 elements containing 1 byte each.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc,
>  		/* Even if we own the page, we do not use atomic_set().
>  		 * This would break get_page_unless_zero() users.
>  		 */
> -		page_ref_add(page, size - 1);
> +		page_ref_add(page, size);
>  
>  		/* reset page count bias and offset to start of new frag */
>  		nc->pfmemalloc = page_is_pfmemalloc(page);
> -		nc->pagecnt_bias = size;
> +		nc->pagecnt_bias = size + 1;
>  		nc->offset = size;
>  	}
>  
> @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc,
>  		size = nc->size;
>  #endif
>  		/* OK, page count is 0, we can safely set it */
> -		set_page_count(page, size);
> +		set_page_count(page, size + 1);
>  
>  		/* reset page count bias and offset to start of new frag */
> -		nc->pagecnt_bias = size;
> +		nc->pagecnt_bias = size + 1;
>  		offset = size - fragsz;
>  	}

This is probably more a davem patch than a -mm one.

^ permalink raw reply

* Re: [PATCH] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
From: Jann Horn @ 2019-02-13 21:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, kernel list, Michal Hocko, Vlastimil Babka,
	Pavel Tatashin, Oscar Salvador, Mel Gorman, Aaron Lu,
	Network Development, Alexander Duyck
In-Reply-To: <20190213125906.eae96c18fe585e060aaf0ef7@linux-foundation.org>

On Wed, Feb 13, 2019 at 9:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 13 Feb 2019 21:41:57 +0100 Jann Horn <jannh@google.com> wrote:
>
> > The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
> > number of references that we might need to create in the fastpath later,
> > the bump-allocation fastpath only has to modify the non-atomic bias value
> > that tracks the number of extra references we hold instead of the atomic
> > refcount. The maximum number of allocations we can serve (under the
> > assumption that no allocation is made with size 0) is nc->size, so that's
> > the bias used.
> >
> > However, even when all memory in the allocation has been given away, a
> > reference to the page is still held; and in the `offset < 0` slowpath, the
> > page may be reused if everyone else has dropped their references.
> > This means that the necessary number of references is actually
> > `nc->size+1`.
> >
> > Luckily, from a quick grep, it looks like the only path that can call
> > page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
> > requires CAP_NET_ADMIN in the init namespace and is only intended to be
> > used for kernel testing and fuzzing.
>
> For the net-naive, what is TAP?  It doesn't appear to mean
> drivers/net/tap.c.

It's implemented in drivers/net/tun.c; the combined functionality
implemented in there is called TUN/TAP. TUN refers to providing raw IP
packets to the kernel, TAP refers to providing raw ethernet packets.
It's documented in Documentation/networking/tuntap.txt. The code
that's interesting here is tun_get_user(), which calls into
tun_napi_alloc_frags() if tun_napi_frags_enabled(tfile) is true, which
in turn calls into netdev_alloc_frag(), which ends up in
page_frag_alloc(). This is how you can use it (except that if you were
using it legitimately, you'd be writing an ethernet header, a layer 3
header, and application data instead of writing "aaaaaaaaaaaaaaa" like
me):

================
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdarg.h>
#include <net/if.h>
#include <linux/if.h>
#include <linux/if_tun.h>
#include <err.h>
#include <sys/types.h>
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ioctl.h>

void systemf(const char *command, ...) {
  char *full_command;
  va_list ap;
  va_start(ap, command);
  if (vasprintf(&full_command, command, ap) == -1)
    err(1, "vasprintf");
  va_end(ap);
  printf("systemf: <<<%s>>>\n", full_command);
  system(full_command);
}

char *devname;

int tun_alloc(char *name) {
  int fd = open("/dev/net/tun", O_RDWR);
  if (fd == -1)
    err(1, "open tun dev");
  static struct ifreq req = { .ifr_flags =
IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI };
  strcpy(req.ifr_name, name);
  if (ioctl(fd, TUNSETIFF, &req))
    err(1, "TUNSETIFF");
  devname = req.ifr_name;
  printf("device name: %s\n", devname);
  return fd;
}

int main(void) {
  int tun_fd = tun_alloc("inject_dev%d");
  systemf("ip link set %s up", devname);

  while (1) {
    struct iovec iov[15];
    for (int i=0; i<sizeof(iov)/sizeof(iov[0]); i++) {
      iov[i].iov_base = "a";
      iov[i].iov_len = 1;
    }
    writev(tun_fd, iov, sizeof(iov)/sizeof(iov[0]));
  }
}
================

> > To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
> > `offset < 0` path, below the virt_to_page() call, and then repeatedly call
> > writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
> > with a vector consisting of 15 elements containing 1 byte each.
> >
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc,
> >               /* Even if we own the page, we do not use atomic_set().
> >                * This would break get_page_unless_zero() users.
> >                */
> > -             page_ref_add(page, size - 1);
> > +             page_ref_add(page, size);
> >
> >               /* reset page count bias and offset to start of new frag */
> >               nc->pfmemalloc = page_is_pfmemalloc(page);
> > -             nc->pagecnt_bias = size;
> > +             nc->pagecnt_bias = size + 1;
> >               nc->offset = size;
> >       }
> >
> > @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc,
> >               size = nc->size;
> >  #endif
> >               /* OK, page count is 0, we can safely set it */
> > -             set_page_count(page, size);
> > +             set_page_count(page, size + 1);
> >
> >               /* reset page count bias and offset to start of new frag */
> > -             nc->pagecnt_bias = size;
> > +             nc->pagecnt_bias = size + 1;
> >               offset = size - fragsz;
> >       }
>
> This is probably more a davem patch than a -mm one.

Ah, sorry. I assumed that I just should go by which directory the
patched code is in.

You did just add it to the -mm tree though, right? So I shouldn't
resend it to davem?

^ permalink raw reply

* Re: [PATCH iproute2 net-next v2 3/4] ss: Buffer raw fields first, then render them as a table
From: Stefano Brivio @ 2019-02-13 21:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, netdev, Sabrina Dubroca, David Ahern
In-Reply-To: <dfdb5a99-d922-5be8-b110-e5f069600ecd@gmail.com>

On Wed, 13 Feb 2019 09:31:03 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On 02/13/2019 12:37 AM, Stefano Brivio wrote:
> > On Tue, 12 Feb 2019 16:42:04 -0800
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >   
> >> I do not get it.
> >>
> >> "ss -emoi " uses almost 1KB per socket.
> >>
> >> 10,000,000 sockets -> we need about 10GB of memory  ???
> >>
> >> This is a serious regression.  
> > 
> > I guess this is rather subjective: the worst case I considered back then
> > was the output of 'ss -tei0' (less than 500 bytes) for one million
> > sockets, which gives 500M of memory, which should in turn be fine on a
> > machine handling one million sockets.
> > 
> > Now, if 'ss -emoi' on 10 million sockets is an actual use case (out of
> > curiosity: how are you going to process that output? Would JSON help?),
> > I see two easy options to solve this:  
> 
> 
> ss -temoi | parser (written in shell or awk or whatever...)
> 
> This is a use case, I just got bitten because using ss command
> actually OOM my container, while trying to debug a busy GFE.
> 
> The host itself can have 10,000,000 TCP sockets, but usually sysadmin shells
> run in a container with no more than 500 MB available. 
> 
> Otherwise, it would be too easy for a buggy program to OOM the whole machine
> and have angry customers.
> 
> > 
> > 1. flush the output every time we reach a given buffer size (1M
> >    perhaps). This might make the resulting blocks slightly unaligned,
> >    with occasional loss of readability on lines occurring every 1k to
> >    10k sockets approximately, even though after 1k sockets column sizes
> >    won't change much (it looks anyway better than the original), and I
> >    don't expect anybody to actually scroll that output
> > 
> > 2. add a switch for unbuffered output, but then you need to remember to
> >    pass it manually, and the whole output would be as bad as the
> >    original in case you need the switch.
> > 
> > I'd rather go with 1., it's easy to implement (we already have partial
> > flushing with '--events') and it looks like a good compromise on
> > usability. Thoughts?
> >   
> 
> 1 seems fine, but a switch for 'please do not try to format' would be fine.
> 
> I wonder why we try to 'format' when stdout is a pipe or a regular file .

On a second thought: what about | less, or | grep [ports],
or > readable.log? I guess those might also be rather common use cases,
what do you think?

I'm tempted to skip this for the moment and just go with option 1.

-- 
Stefano

^ permalink raw reply

* Re: [PATCH] net: phy: at803x: disable delay only for RGMII mode
From: Florian Fainelli @ 2019-02-13 21:38 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Marc Gonzalez, Andrew Lunn, Vinod Koul, David S Miller,
	linux-arm-msm, Bjorn Andersson, netdev, Nori, Sekhar,
	Peter Ujfalusi, hkallweit1
In-Reply-To: <20190213200738.GB460@centauri.lan>

On 2/13/19 12:07 PM, Niklas Cassel wrote:
> On Wed, Feb 13, 2019 at 09:59:43AM -0800, Florian Fainelli wrote:
>> On 2/13/19 9:40 AM, Niklas Cassel wrote:
>>> On Wed, Feb 13, 2019 at 02:40:18PM +0100, Marc Gonzalez wrote:
>>>> On 13/02/2019 14:29, Andrew Lunn wrote:
>>>>
>>>>>> So we have these modes:
>>>>>>
>>>>>> PHY_INTERFACE_MODE_RGMII: TX and RX delays disabled
>>>>>> PHY_INTERFACE_MODE_RGMII_ID: TX and RX delays enabled
>>>>>> PHY_INTERFACE_MODE_RGMII_RXID: RX delay enabled, TX delay disabled
>>>>>> PHY_INTERFACE_MODE_RGMII_TXID: TX delay enabled, RX delay disabled
>>>>>>
>>>>>> What I don't like with this patch, is that if we specify phy-mode
>>>>>> PHY_INTERFACE_MODE_RGMII_TXID, this patch will enable TX delay,
>>>>>> but RX delay will not be explicitly set.
>>>>>
>>>>> That is not the behaviour we want. It is best to assume the device is
>>>>> in a random state, and correctly enable/disable all delays as
>>>>> requested. Only leave the hardware alone if PHY_INTERFACE_MODE_NA is
>>>>> used.
>>>>
>>>> That's what my patch did:
>>>> https://www.spinics.net/lists/netdev/msg445053.html
>>>>
>>>> But see Florian's remarks:
>>>> https://www.spinics.net/lists/netdev/msg445133.html
>>>
>>> Hello Marc,
>>>
>>> I saw that comment from Florian. However that was way back in 2017.
>>> Maybe the phy-modes were not as well defined back then?
>>
>> The definition of the 'phy-mode' was clarified to be understood from the
>> perspective of the PHY device (hence the name) after we had several
>> fruitful exchanges with Marc (at least from my perspective), but since
>> the definition was not clear before, there is a high chance of finding
>> DTS/DTBs out there with the 'phy-mode' property understood from the
>> MAC's perspective, which would now be wrong.
> 
> Hello Florian,
> 
> 
> We have a specification:
> Documentation/devicetree/bindings/net/ethernet.txt
> 
> And several implementations: the PHY drivers.
> 
> Either we decide that all PHY drivers have to follow
> the specification for "phy-mode" in
> Documentation/devicetree/bindings/net/ethernet.txt
> or we decide that they don't.
> 
> If we decide that all PHY drivers have to follow the specification,
> then we can fix the PHY drivers that currently do not follow the
> specification.
> 
> If we decide that all PHY drivers do not have to follow the spec,
> then the "phy-mode" property is basically useless, and then we should
> introduce a new device tree property, e.g. "phy-mode2", that is
> guaranteed to respect the definitons in
> Documentation/devicetree/bindings/net/ethernet.txt

If the specification had been clear from day one, then we would not be
in the situation we are in today, so in that case it is not as simple
as: a) deprecating an existing property that was misused because the
spec was not well enough defined and b) go and fix all drivers. The
amount of breakage that can be introduced is just immense, and quite
frankly, for absolutely no good reason.

It's all well and good to introduce a 'phy-mode2' but let's think about
the future:

- what is depreciation path for 'phy-mode'/'phy-connection-type' looking
like then?
- do we have the manpower to review every new binding, DTS submission
that gets included in Linux, FreeBSD, Zephyr, for correctness?

> 
>>
>>
>>>
>>> Andrew recently suggested to fix the driver so that it conforms with the
>>> phy-modes, and fix any SoC that specified an incorrect phy-mode in DT
>>> and thus relied upon the broken behavior of the PHY driver:
>>> https://www.spinics.net/lists/netdev/msg445133.html
>>>
>>>
>>> So, I've rebased your old patch, see attachment.
>>> I suggest that Peter test it on am335x-evm.
>>>
>>> am335x-evm appears to rely on the current broken behavior of the PHY
>>> driver, so we will probably need to fix the am335x-evm according to this:
>>> https://www.spinics.net/lists/netdev/msg445117.html
>>> and merge that as well.
>>>
>>>
>>> Andrew, Florian, do you both agree?
>>
>> In my reply to Marc, there was a concern that while am335x-evm was
>> identified and reported to be broken after fixing the PHY driver, there
>> could be platforms out there that we have little to no visibility that
>> would most likely be equally broken. That concern still exists, and I
>> don't think there is anything we can do to even assess the size of the
>> problem unless we attempt to fix it, so maybe we should attempt to fix that.
>>
>> There was a suggestion to Marc that one way to possibly "ignore" an
>> incorrectly broken 'phy-mode' property would be to allow specifying
>> rx/tx delay properties such that if the driver obtained its
>> phy_interface_t, yet still parsed rx/tx delays, the rx/tx delays would
>> take precedence, and we could possibly derive some sort of a "more
>> correct" phy_interface_t that we could assign back to phydev->interface
>> and issue a warning about that.
> 
> You mean to add new device tree properties to
> Documentation/devicetree/bindings/net/ethernet.txt
> 
> - phy-id-tx: "true" if PHY should add internal delay on TX lines;
>              "false" or not specified if PHY should not add internal
> 	     delay on TX lines. This property overrides any delay
> 	     requested by "phy-mode".
> - phy-id-rx: "true" if PHY should add internal delay on RX lines;
>              "false" or not specified if PHY should not add internal
> 	     delay on RX lines. This property overrides any delay
> 	     requested by "phy-mode".
> 
> Perhaps something like that?

Not quite booleans, actual delay values, e.g.:

tx-delay-ps = <2000>
rx-delay-ps = <2000>

this is something that exists already:

Documentation/devicetree/bindings/net/apm-xgene-enet.txt
Documentation/devicetree/bindings/net/cavium-pip.txt
Documentation/devicetree/bindings/net/dwmac-sun8i.txt

because conceptually, telling the PHY driver that a TX or RX delay is
simply not enough, sometimes the standard 2ns (90 degree shift at
125Mhz) is not good enough and gets you out of spec because of some
board design.

> 
> Personally, I prefer making "phy-mode" strict,
> but whatever you guys decide:
> - making "phy-mode" strict
> - introducing a "phy-mode2"
> - introducing "phy-id-tx/phy-id-rx"
> - introducing "mac-mode"
> - some other solution
> 
> It is probably wise to introduce helper functions in phy.h
>  phy_wants_id_rx()
>  phy_wants_id_tx()
> so that PHY drivers can simply use e.g.:
> 
> if (phy_wants_id_rx(phydev))
> 	at803x_enable_rx_delay(phydev);
> else
> 	at803x_disable_rx_delay(phydev);
> 
> if (phy_wants_id_tx(phydev))
> 	at803x_enable_tx_delay(phydev);
> else
> 	at803x_disable_tx_delay(phydev);

Yes, that I think is pretty much orthogonal to the end solution we
decide to choose, having a way to tell what the PHY is currently
configured, or capable of supporting is step 1 in trying to find a
compatibility solution.

> 
>>
>> Another possible way to resolve that could be to introduce a 'mac-mode'
>> property, which must be strictly compatible with specifying a 'phy-mode'
>> property. For instance:
>>
>> - MAC specifies mac-mode = 'rgmii-id', then the PHY must have phy-mode =
>> 'rmgii' since the MAC is taking of inserting both RX and TX delays,
>> reverse also applies
>>
>> - MAC specifies mac-mode = 'rgmii-txid', then the PHY must have phy-mode
>> = 'rgmii-rxid' because the MAC adds the TX delay, but the PHY should
>> insert the delay on the RX lines, reverse also applies
>>
>> Because there is usually (not always, DSA is an exception) a 1:1 mapping
>> between MAC and PHY devices we could look up the 'mac-mode' property in
>> the MAC in the PHY library code and make sure that we have a compatible
>> matrix and if we do not, maybe pass something like PHY_INTERFACE_MODE_NA
>> such that the driver retains its settings.
> 
> Is there any advantage of creating a "mac-mode" over creating a
> "phy-mode2" ?
> 
> 
> Kind regards,
> Niklas
> 
>>
>> Maybe another way to approach this is if we assume that the PHY comes up
>> configured correctly by the boot loader, or upon power on reset, we add
>> some PHY driver methods that allow us to determine the RGMII mode in
>> which a PHY is and that tells us whether we are compatible with the
>> MAC's phy_interface_t upon connection. We check both at connect() time
>> and if something does not look right, we flip the meaning of
>> phy_interface_t.
>>
>> None of those solutions are entirely fool proof, but at least we might
>> be able to detect incorrect combinations, yet still make them work by
>> reversing the meaning of the 'phy-mode' property given information at hand.
>>
>> Let me know if none of that makes sense and this just looks like yet
>> another brain dump.
>>
>> Wonderful RGMII...
>> -- 
>> Florian


-- 
Florian

^ permalink raw reply

* Re: [PATCH] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
From: Andrew Morton @ 2019-02-13 21:40 UTC (permalink / raw)
  To: Jann Horn
  Cc: Linux-MM, kernel list, Michal Hocko, Vlastimil Babka,
	Pavel Tatashin, Oscar Salvador, Mel Gorman, Aaron Lu,
	Network Development, Alexander Duyck
In-Reply-To: <CAG48ez2Qo7N-+=y=eFhzw9HfYS3HODAY-zLaubFMGyXEV_nwpg@mail.gmail.com>

On Wed, 13 Feb 2019 22:11:58 +0100 Jann Horn <jannh@google.com> wrote:

> > This is probably more a davem patch than a -mm one.
> 
> Ah, sorry. I assumed that I just should go by which directory the
> patched code is in.
> 
> You did just add it to the -mm tree though, right? So I shouldn't
> resend it to davem?

Yes, please send to Dave.  I'll autodrop the -mm copy if/when it turns
up in -next.


^ permalink raw reply

* Re: [PATCH] NETWORKING: avoid use IPCB in cipso_v4_error
From: Paul Moore @ 2019-02-13 21:41 UTC (permalink / raw)
  To: Nazarov Sergey
  Cc: netdev@vger.kernel.org, linux-security-module@vger.kernel.org,
	davem, kuznet, yoshfuji
In-Reply-To: <6691891549984203@myt5-a323eb993ef7.qloud-c.yandex.net>

On Tue, Feb 12, 2019 at 10:10 AM Nazarov Sergey <s-nazarov@yandex.ru> wrote:
> Since cipso_v4_error might be called from different network stack layers, we can't safely use icmp_send there.
> icmp_send copies IP options with ip_option_echo, which uses IPCB to take access to IP header compiled data.
> But after commit 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses"), IPCB can't be used
> above IP layer.
> This patch fixes the problem by creating in cipso_v4_error a local copy of compiled IP options and using it with
> introduced __icmp_send function. This looks some overloaded, but in quite rare error conditions only.
>
> The original discussion is here:
> https://lore.kernel.org/linux-security-module/16659801547571984@sas1-890ba5c2334a.qloud-c.yandex.net/
>
> Signed-off-by: Sergey Nazarov <s-nazarov@yandex.ru>
> ---
>  include/net/icmp.h    |    9 ++++++++-
>  net/ipv4/cipso_ipv4.c |   18 ++++++++++++++++--
>  net/ipv4/icmp.c       |    7 ++++---
>  3 files changed, 28 insertions(+), 6 deletions(-)

Hi Sergey,

Thanks for your work on finding this and putting a fix together.  As
we discussed previously, I think this looks good, but can you describe
the testing you did to verify that this works correctly?

> diff --git a/include/net/icmp.h b/include/net/icmp.h
> index 6ac3a5b..e0f709d 100644
> --- a/include/net/icmp.h
> +++ b/include/net/icmp.h
> @@ -22,6 +22,7 @@
>
>  #include <net/inet_sock.h>
>  #include <net/snmp.h>
> +#include <net/ip.h>
>
>  struct icmp_err {
>    int          errno;
> @@ -39,7 +40,13 @@ struct icmp_err {
>  struct sk_buff;
>  struct net;
>
> -void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info);
> +void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
> +                const struct ip_options *opt);
> +static inline void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
> +{
> +       __icmp_send(skb_in, type, code, info, &IPCB(skb_in)->opt);
> +}
> +
>  int icmp_rcv(struct sk_buff *skb);
>  int icmp_err(struct sk_buff *skb, u32 info);
>  int icmp_init(void);
> diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
> index 777fa3b..234d12e 100644
> --- a/net/ipv4/cipso_ipv4.c
> +++ b/net/ipv4/cipso_ipv4.c
> @@ -1735,13 +1735,27 @@ int cipso_v4_validate(const struct sk_buff *skb, unsigned char **option)
>   */
>  void cipso_v4_error(struct sk_buff *skb, int error, u32 gateway)
>  {
> +       unsigned char optbuf[sizeof(struct ip_options) + 40];
> +       struct ip_options *opt = (struct ip_options *)optbuf;
> +
>         if (ip_hdr(skb)->protocol == IPPROTO_ICMP || error != -EACCES)
>                 return;
>
> +       /*
> +        * We might be called above the IP layer,
> +        * so we can not use icmp_send and IPCB here.
> +        */
> +
> +       memset(opt, 0, sizeof(struct ip_options));
> +       opt->optlen = ip_hdr(skb)->ihl*4 - sizeof(struct iphdr);
> +       memcpy(opt->__data, (unsigned char *)&(ip_hdr(skb)[1]), opt->optlen);
> +       if (ip_options_compile(dev_net(skb->dev), opt, NULL))
> +               return;
> +
>         if (gateway)
> -               icmp_send(skb, ICMP_DEST_UNREACH, ICMP_NET_ANO, 0);
> +               __icmp_send(skb, ICMP_DEST_UNREACH, ICMP_NET_ANO, 0, opt);
>         else
> -               icmp_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_ANO, 0);
> +               __icmp_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_ANO, 0, opt);
>  }
>
>  /**
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index 065997f..3f24414 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -570,7 +570,8 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
>   *                     MUST reply to only the first fragment.
>   */
>
> -void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
> +void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
> +                const struct ip_options *opt)
>  {
>         struct iphdr *iph;
>         int room;
> @@ -691,7 +692,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
>                                           iph->tos;
>         mark = IP4_REPLY_MARK(net, skb_in->mark);
>
> -       if (ip_options_echo(net, &icmp_param.replyopts.opt.opt, skb_in))
> +       if (__ip_options_echo(net, &icmp_param.replyopts.opt.opt, skb_in, opt))
>                 goto out_unlock;
>
>
> @@ -742,7 +743,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
>         local_bh_enable();
>  out:;
>  }
> -EXPORT_SYMBOL(icmp_send);
> +EXPORT_SYMBOL(__icmp_send);
>
>
>  static void icmp_socket_deliver(struct sk_buff *skb, u32 info)
> --
>


-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* [RESEND PATCH net] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
From: Jann Horn @ 2019-02-13 21:45 UTC (permalink / raw)
  To: David S. Miller, netdev, jannh
  Cc: linux-mm, linux-kernel, Michal Hocko, Vlastimil Babka,
	Pavel Tatashin, Oscar Salvador, Mel Gorman, Aaron Lu,
	Alexander Duyck

The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
number of references that we might need to create in the fastpath later,
the bump-allocation fastpath only has to modify the non-atomic bias value
that tracks the number of extra references we hold instead of the atomic
refcount. The maximum number of allocations we can serve (under the
assumption that no allocation is made with size 0) is nc->size, so that's
the bias used.

However, even when all memory in the allocation has been given away, a
reference to the page is still held; and in the `offset < 0` slowpath, the
page may be reused if everyone else has dropped their references.
This means that the necessary number of references is actually
`nc->size+1`.

Luckily, from a quick grep, it looks like the only path that can call
page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
requires CAP_NET_ADMIN in the init namespace and is only intended to be
used for kernel testing and fuzzing.

To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
`offset < 0` path, below the virt_to_page() call, and then repeatedly call
writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
with a vector consisting of 15 elements containing 1 byte each.

Signed-off-by: Jann Horn <jannh@google.com>
---
Resending to davem at the request of akpm.

 mm/page_alloc.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 35fdde041f5c..46285d28e43b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc,
 		/* Even if we own the page, we do not use atomic_set().
 		 * This would break get_page_unless_zero() users.
 		 */
-		page_ref_add(page, size - 1);
+		page_ref_add(page, size);

 		/* reset page count bias and offset to start of new frag */
 		nc->pfmemalloc = page_is_pfmemalloc(page);
-		nc->pagecnt_bias = size;
+		nc->pagecnt_bias = size + 1;
 		nc->offset = size;
 	}

@@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc,
 		size = nc->size;
 #endif
 		/* OK, page count is 0, we can safely set it */
-		set_page_count(page, size);
+		set_page_count(page, size + 1);

 		/* reset page count bias and offset to start of new frag */
-		nc->pagecnt_bias = size;
+		nc->pagecnt_bias = size + 1;
 		offset = size - fragsz;
 	}

-- 
2.20.1.791.gb4d0f1c61a-goog

^ permalink raw reply related

* Re: [PATCH iproute2 net-next v2 3/4] ss: Buffer raw fields first, then render them as a table
From: Stephen Hemminger @ 2019-02-13 21:55 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Eric Dumazet, netdev, Sabrina Dubroca, David Ahern
In-Reply-To: <20190213221716.5f958c2a@redhat.com>

On Wed, 13 Feb 2019 22:17:16 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Wed, 13 Feb 2019 09:31:03 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On 02/13/2019 12:37 AM, Stefano Brivio wrote:  
> > > On Tue, 12 Feb 2019 16:42:04 -0800
> > > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > >     
> > >> I do not get it.
> > >>
> > >> "ss -emoi " uses almost 1KB per socket.
> > >>
> > >> 10,000,000 sockets -> we need about 10GB of memory  ???
> > >>
> > >> This is a serious regression.    
> > > 
> > > I guess this is rather subjective: the worst case I considered back then
> > > was the output of 'ss -tei0' (less than 500 bytes) for one million
> > > sockets, which gives 500M of memory, which should in turn be fine on a
> > > machine handling one million sockets.
> > > 
> > > Now, if 'ss -emoi' on 10 million sockets is an actual use case (out of
> > > curiosity: how are you going to process that output? Would JSON help?),
> > > I see two easy options to solve this:    
> > 
> > 
> > ss -temoi | parser (written in shell or awk or whatever...)
> > 
> > This is a use case, I just got bitten because using ss command
> > actually OOM my container, while trying to debug a busy GFE.
> > 
> > The host itself can have 10,000,000 TCP sockets, but usually sysadmin shells
> > run in a container with no more than 500 MB available. 
> > 
> > Otherwise, it would be too easy for a buggy program to OOM the whole machine
> > and have angry customers.
> >   
> > > 
> > > 1. flush the output every time we reach a given buffer size (1M
> > >    perhaps). This might make the resulting blocks slightly unaligned,
> > >    with occasional loss of readability on lines occurring every 1k to
> > >    10k sockets approximately, even though after 1k sockets column sizes
> > >    won't change much (it looks anyway better than the original), and I
> > >    don't expect anybody to actually scroll that output
> > > 
> > > 2. add a switch for unbuffered output, but then you need to remember to
> > >    pass it manually, and the whole output would be as bad as the
> > >    original in case you need the switch.
> > > 
> > > I'd rather go with 1., it's easy to implement (we already have partial
> > > flushing with '--events') and it looks like a good compromise on
> > > usability. Thoughts?
> > >     
> > 
> > 1 seems fine, but a switch for 'please do not try to format' would be fine.
> > 
> > I wonder why we try to 'format' when stdout is a pipe or a regular file .  
> 
> On a second thought: what about | less, or | grep [ports],
> or > readable.log? I guess those might also be rather common use cases,
> what do you think?
> 
> I'm tempted to skip this for the moment and just go with option 1.
> 

What I would favor:
	* use big enough columns that for the common case everything lines up fine
	* if column is to wide just print that element wider (which is what print %Ns does)
and
	* add json output for programs that want to parse
	* use print_uint etc for that

The buffering patch (in iproute2-next) can/will be reverted.

^ permalink raw reply

* Re: [PATCH iproute2] lib/libnetlink: ensure a minimum of 32KB for the buffer used in rtnl_recvmsg()
From: Stephen Hemminger @ 2019-02-13 21:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Ahern, netdev, Eric Dumazet, Hangbin Liu, Phil Sutter
In-Reply-To: <20190213015841.140383-1-edumazet@google.com>

On Tue, 12 Feb 2019 17:58:41 -0800
Eric Dumazet <edumazet@google.com> wrote:

> In the past, we tried to increase the buffer size up to 32 KB in order
> to reduce number of syscalls per dump.
> 
> Commit 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
> brought the size back to 4KB because the kernel can not know the application
> is ready to receive bigger requests.
> 
> See kernel commits 9063e21fb026 ("netlink: autosize skb lengthes") and
> d35c99ff77ec ("netlink: do not enter direct reclaim from netlink_dump()")
> for more details.
> 
> Fixes: 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Hangbin Liu <liuhangbin@gmail.com>
> Cc: Phil Sutter <phil@nwl.cc>

Applied, although maybe we should bump it to 64K or bigger?

^ permalink raw reply

* Re: [PATCH iproute2] lib/libnetlink: ensure a minimum of 32KB for the buffer used in rtnl_recvmsg()
From: Eric Dumazet @ 2019-02-13 21:59 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Ahern, netdev, Eric Dumazet, Hangbin Liu, Phil Sutter
In-Reply-To: <20190213135718.1ed23c3a@shemminger-XPS-13-9360>

On Wed, Feb 13, 2019 at 1:57 PM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Tue, 12 Feb 2019 17:58:41 -0800
> Eric Dumazet <edumazet@google.com> wrote:
>
> > In the past, we tried to increase the buffer size up to 32 KB in order
> > to reduce number of syscalls per dump.
> >
> > Commit 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
> > brought the size back to 4KB because the kernel can not know the application
> > is ready to receive bigger requests.
> >
> > See kernel commits 9063e21fb026 ("netlink: autosize skb lengthes") and
> > d35c99ff77ec ("netlink: do not enter direct reclaim from netlink_dump()")
> > for more details.
> >
> > Fixes: 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Hangbin Liu <liuhangbin@gmail.com>
> > Cc: Phil Sutter <phil@nwl.cc>
>
> Applied, although maybe we should bump it to 64K or bigger?

Note the kernel does not yet try 64KB allocations, so I do not see an
urgent need for that :)

^ permalink raw reply

* Re: [PATCH iproute2] ss: add option --tos for requesting ipv4 tos and ipv6 tclass
From: Stephen Hemminger @ 2019-02-13 22:00 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: netdev, Eric Dumazet
In-Reply-To: <155006154185.449020.2783123004054072980.stgit@buzz>

On Wed, 13 Feb 2019 15:39:01 +0300
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote:

> Also show socket class_id/priority used by classful qdisc.
> Kernel report this together with tclass since commit
> ("inet_diag: fix reporting cgroup classid and fallback to priority")
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

Applied, this is useful even if diffserv is not.

^ permalink raw reply

* Re: [PATCH -next] net: ipvlan_l3s: fix kconfig dependency warning
From: Daniel Borkmann @ 2019-02-13 22:03 UTC (permalink / raw)
  To: Randy Dunlap, netdev@vger.kernel.org; +Cc: Mahesh Bandewar, David Miller
In-Reply-To: <204a7785-a1d2-e714-653e-2cb19e36f279@infradead.org>

On 02/13/2019 05:55 PM, Randy Dunlap wrote:
> From: Randy Dunlap <rdunlap@infradead.org>
> 
> Fix the kconfig warning in IPVLAN_L3S when neither INET nor IPV6
> is enabled:
> 
> WARNING: unmet direct dependencies detected for NET_L3_MASTER_DEV
>   Depends on [n]: NET [=y] && (INET [=n] || IPV6 [=n])
>   Selected by [y]:
>   - IPVLAN_L3S [=y] && NETDEVICES [=y] && NET_CORE [=y] && NETFILTER [=y]
> 
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> Cc: Mahesh Bandewar <maheshb@google.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox