Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 2/3] mm/gup: Introduce get_user_pages_fast_longterm()
From: Ira Weiny @ 2019-02-12  0:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, John Hubbard, linux-rdma, Linux Kernel Mailing List,
	Linux MM, Daniel Borkmann, Davidlohr Bueso, Netdev,
	Mike Marciniszyn, Dennis Dalessandro, Doug Ledford, Andrew Morton,
	Kirill A. Shutemov
In-Reply-To: <20190211232510.GP24692@ziepe.ca>

On Mon, Feb 11, 2019 at 04:25:10PM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 02:55:10PM -0800, Dan Williams wrote:
> 
> > > I also wonder if someone should think about making fast into a flag
> > > too..
> > >
> > > But I'm not sure when fast should be used vs when it shouldn't :(
> > 
> > Effectively fast should always be used just in case the user cares
> > about performance. It's just that it may fail and need to fall back to
> > requiring the vma.
> 
> But the fall back / slow path is hidden inside the API, so when should
> the caller care? 
> 
> ie when should the caller care to use gup_fast vs gup_unlocked? (the
> comments say they are the same, but this seems to be a mistake)
> 
> Based on some of the comments in the code it looks like this API is
> trying to convert itself into:
> 
> long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
>                            unsigned long start, unsigned long nr_pages,
> 			   unsigned int gup_flags, struct page **pages,
> 			   struct vm_area_struct **vmas, bool *locked)
> 
> long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
>                              unsigned long start, unsigned long nr_pages,
> 			     unsigned int gup_flags, struct page **pages)
> 
> (and maybe a FOLL_FAST if there is some reason we have _fast and
> _unlocked)
> 
> The reason I ask, is that if there is no reason for fast vs unlocked
> then maybe Ira should convert HFI to use gup_unlocked and move the
> 'fast' code into unlocked?
> 
> ie move incrementally closer to the desired end-state here.

If the pages are not in the page tables then fast is probably going to be
slightly slower because it will have to fall back after walking the tables and
finding something missing.

For PSM2 (MPI) applications are performance improvement was probably because
the memory in question was in the page tables and very much in use.

Ira

> 
> Jason

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the net tree
From: Stephen Rothwell @ 2019-02-12  0:23 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: Linux Next Mailing List, Linux Kernel Mailing List,
	Florian Westphal, Pablo Neira Ayuso

[-- Attachment #1: Type: text/plain, Size: 1953 bytes --]

Hi all,

Today's linux-next merge of the net-next tree got conflicts in:

  net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
  net/ipv6/netfilter/nf_nat_l3proto_ipv6.c

between commit:

  8303b7e8f018 ("netfilter: nat: fix spurious connection timeouts")

from the net tree and commit:

  303e0c558959 ("netfilter: conntrack: avoid unneeded nf_conntrack_l4proto lookups")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
index fa2ba7c500e4,e26165af45cb..000000000000
--- a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
@@@ -214,8 -214,7 +214,8 @@@ int nf_nat_icmp_reply_translation(struc
  	}
  
  	/* Change outer to look like the reply to an incoming packet */
- 	nf_ct_invert_tuplepr(&target, &ct->tuplehash[!dir].tuple);
+ 	nf_ct_invert_tuple(&target, &ct->tuplehash[!dir].tuple);
 +	target.dst.protonum = IPPROTO_ICMP;
  	if (!nf_nat_ipv4_manip_pkt(skb, 0, &target, manip))
  		return 0;
  
diff --cc net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
index 7a41ee3c11b4,9c914db44bec..000000000000
--- a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
+++ b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
@@@ -225,8 -225,7 +225,8 @@@ int nf_nat_icmpv6_reply_translation(str
  						     skb->len - hdrlen, 0));
  	}
  
- 	nf_ct_invert_tuplepr(&target, &ct->tuplehash[!dir].tuple);
+ 	nf_ct_invert_tuple(&target, &ct->tuplehash[!dir].tuple);
 +	target.dst.protonum = IPPROTO_ICMPV6;
  	if (!nf_nat_ipv6_manip_pkt(skb, 0, &target, manip))
  		return 0;
  

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH bpf-next v9 1/7] bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-12  0:42 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190212004249.219268-1-posk@google.com>

This patch adds all needed plumbing in preparation to allowing
bpf programs to do IP encapping via bpf_lwt_push_encap. Actual
implementation is added in the next patch in the patchset.

Of note:
- bpf_lwt_push_encap can now be called from BPF_PROG_TYPE_LWT_XMIT
  prog types in addition to BPF_PROG_TYPE_LWT_IN;
- if the skb being encapped has GSO set, encapsulation is limited
  to IPIP/IP+GRE/IP+GUE (both IPv4 and IPv6);
- as route lookups are different for ingress vs egress, the single
  external bpf_lwt_push_encap BPF helper is routed internally to
  either bpf_lwt_in_push_encap or bpf_lwt_xmit_push_encap BPF_CALLs,
  depending on prog type.

v8 changes: fixed a typo.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/uapi/linux/bpf.h | 26 ++++++++++++++++++++--
 net/core/filter.c        | 48 +++++++++++++++++++++++++++++++++++-----
 2 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 25c8c0e62ecf..bcdd2474eee7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2016,6 +2016,19 @@ union bpf_attr {
  *			Only works if *skb* contains an IPv6 packet. Insert a
  *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
  *			the IPv6 header.
+ *		**BPF_LWT_ENCAP_IP**
+ *			IP encapsulation (GRE/GUE/IPIP/etc). The outer header
+ *			must be IPv4 or IPv6, followed by zero or more
+ *			additional headers, up to LWT_BPF_MAX_HEADROOM total
+ *			bytes in all prepended headers. Please note that
+ *			if skb_is_gso(skb) is true, no more than two headers
+ *			can be prepended, and the inner header, if present,
+ *			should be either GRE or UDP/GUE.
+ *
+ *		BPF_LWT_ENCAP_SEG6*** types can be called by bpf programs of
+ *		type BPF_PROG_TYPE_LWT_IN; BPF_LWT_ENCAP_IP type can be called
+ *		by bpf programs of types BPF_PROG_TYPE_LWT_IN and
+ *		BPF_PROG_TYPE_LWT_XMIT.
  *
  * 		A call to this helper is susceptible to change the underlaying
  * 		packet buffer. Therefore, at load time, all checks on pointers
@@ -2517,7 +2530,8 @@ enum bpf_hdr_start_off {
 /* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
 enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_SEG6,
-	BPF_LWT_ENCAP_SEG6_INLINE
+	BPF_LWT_ENCAP_SEG6_INLINE,
+	BPF_LWT_ENCAP_IP,
 };
 
 #define __bpf_md_ptr(type, name)	\
@@ -2606,7 +2620,15 @@ enum bpf_ret_code {
 	BPF_DROP = 2,
 	/* 3-6 reserved */
 	BPF_REDIRECT = 7,
-	/* >127 are reserved for prog type specific return codes */
+	/* >127 are reserved for prog type specific return codes.
+	 *
+	 * BPF_LWT_REROUTE: used by BPF_PROG_TYPE_LWT_IN and
+	 *    BPF_PROG_TYPE_LWT_XMIT to indicate that skb had been
+	 *    changed and should be routed based on its new L3 header.
+	 *    (This is an L3 redirect, as opposed to L2 redirect
+	 *    represented by BPF_REDIRECT above).
+	 */
+	BPF_LWT_REROUTE = 128,
 };
 
 struct bpf_sock {
diff --git a/net/core/filter.c b/net/core/filter.c
index 353735575204..12c88c21b6b8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4815,7 +4815,15 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len
 }
 #endif /* CONFIG_IPV6_SEG6_BPF */
 
-BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
+#if IS_ENABLED(CONFIG_LWTUNNEL_BPF)
+static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
+			     bool ingress)
+{
+	return -EINVAL;  /* Implemented in the next patch. */
+}
+#endif
+
+BPF_CALL_4(bpf_lwt_in_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
 	   u32, len)
 {
 	switch (type) {
@@ -4823,14 +4831,41 @@ BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
 	case BPF_LWT_ENCAP_SEG6:
 	case BPF_LWT_ENCAP_SEG6_INLINE:
 		return bpf_push_seg6_encap(skb, type, hdr, len);
+#endif
+#if IS_ENABLED(CONFIG_LWTUNNEL_BPF)
+	case BPF_LWT_ENCAP_IP:
+		return bpf_push_ip_encap(skb, hdr, len, true /* ingress */);
+#endif
+	default:
+		return -EINVAL;
+	}
+}
+
+BPF_CALL_4(bpf_lwt_xmit_push_encap, struct sk_buff *, skb, u32, type,
+	   void *, hdr, u32, len)
+{
+	switch (type) {
+#if IS_ENABLED(CONFIG_LWTUNNEL_BPF)
+	case BPF_LWT_ENCAP_IP:
+		return bpf_push_ip_encap(skb, hdr, len, false /* egress */);
 #endif
 	default:
 		return -EINVAL;
 	}
 }
 
-static const struct bpf_func_proto bpf_lwt_push_encap_proto = {
-	.func		= bpf_lwt_push_encap,
+static const struct bpf_func_proto bpf_lwt_in_push_encap_proto = {
+	.func		= bpf_lwt_in_push_encap,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_MEM,
+	.arg4_type	= ARG_CONST_SIZE
+};
+
+static const struct bpf_func_proto bpf_lwt_xmit_push_encap_proto = {
+	.func		= bpf_lwt_xmit_push_encap,
 	.gpl_only	= false,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
@@ -5417,7 +5452,8 @@ bool bpf_helper_changes_pkt_data(void *func)
 	    func == bpf_lwt_seg6_adjust_srh ||
 	    func == bpf_lwt_seg6_action ||
 #endif
-	    func == bpf_lwt_push_encap)
+	    func == bpf_lwt_in_push_encap ||
+	    func == bpf_lwt_xmit_push_encap)
 		return true;
 
 	return false;
@@ -5815,7 +5851,7 @@ lwt_in_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_lwt_push_encap:
-		return &bpf_lwt_push_encap_proto;
+		return &bpf_lwt_in_push_encap_proto;
 	default:
 		return lwt_out_func_proto(func_id, prog);
 	}
@@ -5851,6 +5887,8 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_l4_csum_replace_proto;
 	case BPF_FUNC_set_hash_invalid:
 		return &bpf_set_hash_invalid_proto;
+	case BPF_FUNC_lwt_push_encap:
+		return &bpf_lwt_xmit_push_encap_proto;
 	default:
 		return lwt_out_func_proto(func_id, prog);
 	}
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v9 0/7] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-12  0:42 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov

This patchset implements BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
BPF helper. It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN
and BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
to packets (e.g. IP/GRE, GUE, IPIP).

This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).

V2 changes: added flowi-based route lookup, IPv6 encapping, and
   encapping on ingress.

V3 changes: incorporated David Ahern's suggestions:
   - added l3mdev check/oif (patch 2)
   - sync bpf.h from include/uapi into tools/include/uapi
   - selftest tweaks

V4 changes: moved route lookup/dst change from bpf_push_ip_encap
   to when BPF_LWT_REROUTE is handled, as suggested by David Ahern.

V5 changes: added a check in lwt_xmit that skb->protocol stays the
   same if the skb is to be passed back to the stack (ret == BPF_OK).
   Again, suggested by David Ahern.

V6 changes: abandoned.

V7 changes: added handling of GSO packets (patch 3 in the patchset added),
   as suggested by BPF maintainers.

V8 changes:
   - fixed build errors when LWT or IPV6 are not enabled;
   - whitelisted TCP GSO instead of blacklisting SCTP and UDP GSO, as
     suggested by Willem de Bruijn;
   - added validation that pushed length cover needed headers when GRE/UDP
     encap is detected, as suggested by Willem de Bruijn;
   - a couple of minor/stylistic tweaks/fixed typos.

V9 changes:
   - fixed a kbuild test robot compiler warning;
   - added ipv6_route_input to ipv6_stub (patch 4 in the patchset
     added), and IPv6 routing functions are now invoked via ipv6_stub,
     as suggested by David Ahern.

Peter Oskolkov (7):
  bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap
  bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
  bpf: handle GSO in bpf_lwt_push_encap
  ipv6_stub: add ipv6_route_input stub/proxy.
  bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
  bpf: sync <kdir>/include/.../bpf.h with tools/include/.../bpf.h
  selftests: bpf: add test_lwt_ip_encap selftest

 include/net/addrconf.h                        |   1 +
 include/net/lwtunnel.h                        |   2 +
 include/uapi/linux/bpf.h                      |  26 +-
 net/core/filter.c                             |  49 ++-
 net/core/lwt_bpf.c                            | 261 +++++++++++++++
 net/ipv6/addrconf_core.c                      |   6 +
 net/ipv6/af_inet6.c                           |   7 +
 tools/include/uapi/linux/bpf.h                |  26 +-
 tools/testing/selftests/bpf/Makefile          |   5 +-
 .../testing/selftests/bpf/test_lwt_ip_encap.c |  85 +++++
 .../selftests/bpf/test_lwt_ip_encap.sh        | 311 ++++++++++++++++++
 11 files changed, 768 insertions(+), 11 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lwt_ip_encap.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_ip_encap.sh

-- 
2.20.1.791.gb4d0f1c61a-goog

^ permalink raw reply

* [PATCH bpf-next v9 2/7] bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-12  0:42 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190212004249.219268-1-posk@google.com>

Implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap BPF helper.
It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN and
BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
to packets (e.g. IP/GRE, GUE, IPIP).

This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).

v7 changes:
 - added a call skb_clear_hash();
 - removed calls to skb_set_transport_header();
 - refuse to encap GSO-enabled packets.

v8 changes:
 - fix build errors when LWT is not enabled.

Note: the next patch in the patchset with deal with GSO-enabled packets,
which are currently rejected at encapping attempt.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/net/lwtunnel.h |  2 ++
 net/core/filter.c      |  3 +-
 net/core/lwt_bpf.c     | 65 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 33fd9ba7e0e5..671113bcb2cc 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -126,6 +126,8 @@ int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b);
 int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb);
 int lwtunnel_input(struct sk_buff *skb);
 int lwtunnel_xmit(struct sk_buff *skb);
+int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
+			  bool ingress);
 
 static inline void lwtunnel_set_redirect(struct dst_entry *dst)
 {
diff --git a/net/core/filter.c b/net/core/filter.c
index 12c88c21b6b8..a78deb2656e1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -73,6 +73,7 @@
 #include <linux/seg6_local.h>
 #include <net/seg6.h>
 #include <net/seg6_local.h>
+#include <net/lwtunnel.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -4819,7 +4820,7 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len
 static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
 			     bool ingress)
 {
-	return -EINVAL;  /* Implemented in the next patch. */
+	return bpf_lwt_push_ip_encap(skb, hdr, len, ingress);
 }
 #endif
 
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a648568c5e8f..e5a9850d9f48 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -390,6 +390,71 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = {
 	.owner		= THIS_MODULE,
 };
 
+static int handle_gso_encap(struct sk_buff *skb, bool ipv4, int encap_len)
+{
+	/* Handling of GSO-enabled packets is added in the next patch. */
+	return -EOPNOTSUPP;
+}
+
+int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
+{
+	struct iphdr *iph;
+	bool ipv4;
+	int err;
+
+	if (unlikely(len < sizeof(struct iphdr) || len > LWT_BPF_MAX_HEADROOM))
+		return -EINVAL;
+
+	/* validate protocol and length */
+	iph = (struct iphdr *)hdr;
+	if (iph->version == 4) {
+		ipv4 = true;
+		if (unlikely(len < iph->ihl * 4))
+			return -EINVAL;
+	} else if (iph->version == 6) {
+		ipv4 = false;
+		if (unlikely(len < sizeof(struct ipv6hdr)))
+			return -EINVAL;
+	} else {
+		return -EINVAL;
+	}
+
+	if (ingress)
+		err = skb_cow_head(skb, len + skb->mac_len);
+	else
+		err = skb_cow_head(skb,
+				   len + LL_RESERVED_SPACE(skb_dst(skb)->dev));
+	if (unlikely(err))
+		return err;
+
+	/* push the encap headers and fix pointers */
+	skb_reset_inner_headers(skb);
+	skb->encapsulation = 1;
+	skb_push(skb, len);
+	if (ingress)
+		skb_postpush_rcsum(skb, iph, len);
+	skb_reset_network_header(skb);
+	memcpy(skb_network_header(skb), hdr, len);
+	bpf_compute_data_pointers(skb);
+	skb_clear_hash(skb);
+
+	if (ipv4) {
+		skb->protocol = htons(ETH_P_IP);
+		iph = ip_hdr(skb);
+
+		if (!iph->check)
+			iph->check = ip_fast_csum((unsigned char *)iph,
+						  iph->ihl);
+	} else {
+		skb->protocol = htons(ETH_P_IPV6);
+	}
+
+	if (skb_is_gso(skb))
+		return handle_gso_encap(skb, ipv4, len);
+
+	return 0;
+}
+
 static int __init bpf_lwt_init(void)
 {
 	return lwtunnel_encap_add_ops(&bpf_encap_ops, LWTUNNEL_ENCAP_BPF);
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v9 3/7] bpf: handle GSO in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-12  0:42 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190212004249.219268-1-posk@google.com>

This patch adds handling of GSO packets in bpf_lwt_push_ip_encap()
(called from bpf_lwt_push_encap):

* IPIP, GRE, and UDP encapsulation types are deduced by looking
  into iphdr->protocol or ipv6hdr->next_header;
* SCTP GSO packets are not supported (as bpf_skb_proto_4_to_6
  and similar do);
* UDP_L4 GSO packets are also not supported (although they are
  not blocked in bpf_skb_proto_4_to_6 and similar), as
  skb_decrease_gso_size() will break it;
* SKB_GSO_DODGY bit is set.

Note: it may be possible to support SCTP and UDP_L4 gso packets;
      but as these cases seem to be not well handled by other
      tunneling/encapping code paths, the solution should
      be generic enough to apply to all tunneling/encapping code.

v8 changes:
   - make sure that if GRE or UDP encap is detected, there is
     enough of pushed bytes to cover both IP[v6] + GRE|UDP headers;
   - do not reject double-encapped packets;
   - whitelist TCP GSO packets rather than block SCTP GSO and
     UDP GSO.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 net/core/lwt_bpf.c | 67 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 65 insertions(+), 2 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index e5a9850d9f48..079871fc020f 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -16,6 +16,7 @@
 #include <linux/types.h>
 #include <linux/bpf.h>
 #include <net/lwtunnel.h>
+#include <net/gre.h>
 
 struct bpf_lwt_prog {
 	struct bpf_prog *prog;
@@ -390,10 +391,72 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = {
 	.owner		= THIS_MODULE,
 };
 
+static int handle_gso_type(struct sk_buff *skb, unsigned int gso_type,
+			   int encap_len)
+{
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+	gso_type |= SKB_GSO_DODGY;
+	shinfo->gso_type |= gso_type;
+	skb_decrease_gso_size(shinfo, encap_len);
+	shinfo->gso_segs = 0;
+	return 0;
+}
+
 static int handle_gso_encap(struct sk_buff *skb, bool ipv4, int encap_len)
 {
-	/* Handling of GSO-enabled packets is added in the next patch. */
-	return -EOPNOTSUPP;
+	int next_hdr_offset;
+	void *next_hdr;
+	__u8 protocol;
+
+	/* SCTP and UDP_L4 gso need more nuanced handling than what
+	 * handle_gso_type() does above: skb_decrease_gso_size() is not enough.
+	 * So at the moment only TCP GSO packets are let through.
+	 */
+	if (!(skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)))
+		return -ENOTSUPP;
+
+	if (ipv4) {
+		protocol = ip_hdr(skb)->protocol;
+		next_hdr_offset = sizeof(struct iphdr);
+		next_hdr = skb_network_header(skb) + next_hdr_offset;
+	} else {
+		protocol = ipv6_hdr(skb)->nexthdr;
+		next_hdr_offset = sizeof(struct ipv6hdr);
+		next_hdr = skb_network_header(skb) + next_hdr_offset;
+	}
+
+	switch (protocol) {
+	case IPPROTO_GRE:
+		next_hdr_offset += sizeof(struct gre_base_hdr);
+		if (next_hdr_offset > encap_len)
+			return -EINVAL;
+
+		if (((struct gre_base_hdr *)next_hdr)->flags & GRE_CSUM)
+			return handle_gso_type(skb, SKB_GSO_GRE_CSUM,
+					       encap_len);
+		return handle_gso_type(skb, SKB_GSO_GRE, encap_len);
+
+	case IPPROTO_UDP:
+		next_hdr_offset += sizeof(struct udphdr);
+		if (next_hdr_offset > encap_len)
+			return -EINVAL;
+
+		if (((struct udphdr *)next_hdr)->check)
+			return handle_gso_type(skb, SKB_GSO_UDP_TUNNEL_CSUM,
+					       encap_len);
+		return handle_gso_type(skb, SKB_GSO_UDP_TUNNEL, encap_len);
+
+	case IPPROTO_IP:
+	case IPPROTO_IPV6:
+		if (ipv4)
+			return handle_gso_type(skb, SKB_GSO_IPXIP4, encap_len);
+		else
+			return handle_gso_type(skb, SKB_GSO_IPXIP6, encap_len);
+
+	default:
+		return -EPROTONOSUPPORT;
+	}
 }
 
 int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v9 4/7] ipv6_stub: add ipv6_route_input stub/proxy.
From: Peter Oskolkov @ 2019-02-12  0:42 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190212004249.219268-1-posk@google.com>

Proxy ip6_route_input via ipv6_stub, for later use by lwt bpf ip encap
(see the next patch in the patchset).

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/net/addrconf.h   | 1 +
 net/ipv6/addrconf_core.c | 6 ++++++
 net/ipv6/af_inet6.c      | 7 +++++++
 3 files changed, 14 insertions(+)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 20d523ee2fec..269ec27385e9 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -248,6 +248,7 @@ struct ipv6_stub {
 				 const struct in6_addr *addr);
 	int (*ipv6_dst_lookup)(struct net *net, struct sock *sk,
 			       struct dst_entry **dst, struct flowi6 *fl6);
+	int (*ipv6_route_input)(struct sk_buff *skb);
 
 	struct fib6_table *(*fib6_get_table)(struct net *net, u32 id);
 	struct fib6_info *(*fib6_lookup)(struct net *net, int oif,
diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c
index 5cd0029d930e..6c79af056d9b 100644
--- a/net/ipv6/addrconf_core.c
+++ b/net/ipv6/addrconf_core.c
@@ -134,6 +134,11 @@ static int eafnosupport_ipv6_dst_lookup(struct net *net, struct sock *u1,
 	return -EAFNOSUPPORT;
 }
 
+static int eafnosupport_ipv6_route_input(struct sk_buff *skb)
+{
+	return -EAFNOSUPPORT;
+}
+
 static struct fib6_table *eafnosupport_fib6_get_table(struct net *net, u32 id)
 {
 	return NULL;
@@ -170,6 +175,7 @@ eafnosupport_ip6_mtu_from_fib6(struct fib6_info *f6i, struct in6_addr *daddr,
 
 const struct ipv6_stub *ipv6_stub __read_mostly = &(struct ipv6_stub) {
 	.ipv6_dst_lookup   = eafnosupport_ipv6_dst_lookup,
+	.ipv6_route_input  = eafnosupport_ipv6_route_input,
 	.fib6_get_table    = eafnosupport_fib6_get_table,
 	.fib6_table_lookup = eafnosupport_fib6_table_lookup,
 	.fib6_lookup       = eafnosupport_fib6_lookup,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index d99753b5e39b..2f45d2a3e3a3 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -900,10 +900,17 @@ static struct pernet_operations inet6_net_ops = {
 	.exit = inet6_net_exit,
 };
 
+static int ipv6_route_input(struct sk_buff *skb)
+{
+	ip6_route_input(skb);
+	return skb_dst(skb)->error;
+}
+
 static const struct ipv6_stub ipv6_stub_impl = {
 	.ipv6_sock_mc_join = ipv6_sock_mc_join,
 	.ipv6_sock_mc_drop = ipv6_sock_mc_drop,
 	.ipv6_dst_lookup   = ip6_dst_lookup,
+	.ipv6_route_input  = ipv6_route_input,
 	.fib6_get_table	   = fib6_get_table,
 	.fib6_table_lookup = fib6_table_lookup,
 	.fib6_lookup       = fib6_lookup,
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v9 5/7] bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
From: Peter Oskolkov @ 2019-02-12  0:42 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190212004249.219268-1-posk@google.com>

This patch builds on top of the previous patch in the patchset,
which added BPF_LWT_ENCAP_IP mode to bpf_lwt_push_encap. As the
encapping can result in the skb needing to go via a different
interface/route/dst, bpf programs can indicate this by returning
BPF_LWT_REROUTE, which triggers a new route lookup for the skb.

v8 changes: fix kbuild errors when LWTUNNEL_BPF is builtin, but
   IPV6 is a module: as LWTUNNEL_BPF can only be either Y or N,
   call IPV6 routing functions only if they are built-in.

v9 changes:
   - fixed a kbuild test robot compiler warning;
   - call IPV6 routing functions via ipv6_stub.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 net/core/lwt_bpf.c | 133 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 133 insertions(+)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 079871fc020f..aec5e6df880e 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -17,6 +17,7 @@
 #include <linux/bpf.h>
 #include <net/lwtunnel.h>
 #include <net/gre.h>
+#include <net/ip6_route.h>
 
 struct bpf_lwt_prog {
 	struct bpf_prog *prog;
@@ -56,6 +57,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 
 	switch (ret) {
 	case BPF_OK:
+	case BPF_LWT_REROUTE:
 		break;
 
 	case BPF_REDIRECT:
@@ -88,6 +90,35 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 	return ret;
 }
 
+static int bpf_lwt_input_reroute(struct sk_buff *skb)
+{
+	int err = -EINVAL;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		struct iphdr *iph = ip_hdr(skb);
+
+		err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
+					   iph->tos, skb_dst(skb)->dev);
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+#if IS_ENABLED(CONFIG_IPV6)
+		err = ipv6_stub->ipv6_route_input(skb);
+#else
+		pr_warn_once("BPF_LWT_REROUTE input: IPV6 not available\n");
+#endif
+	} else {
+		pr_warn_once("BPF_LWT_REROUTE input: unsupported proto %d\n",
+			     skb->protocol);
+	}
+
+	if (err)
+		goto err;
+	return dst_input(skb);
+
+err:
+	kfree_skb(skb);
+	return err;
+}
+
 static int bpf_input(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -99,6 +130,8 @@ static int bpf_input(struct sk_buff *skb)
 		ret = run_lwt_bpf(skb, &bpf->in, dst, NO_REDIRECT);
 		if (ret < 0)
 			return ret;
+		if (ret == BPF_LWT_REROUTE)
+			return bpf_lwt_input_reroute(skb);
 	}
 
 	if (unlikely(!dst->lwtstate->orig_input)) {
@@ -148,6 +181,95 @@ static int xmit_check_hhlen(struct sk_buff *skb)
 	return 0;
 }
 
+static int bpf_lwt_xmit_reroute(struct sk_buff *skb)
+{
+	struct net_device *l3mdev = l3mdev_master_dev_rcu(skb_dst(skb)->dev);
+	int oif = l3mdev ? l3mdev->ifindex : 0;
+	struct dst_entry *dst = NULL;
+	struct sock *sk;
+	struct net *net;
+	bool ipv4;
+	int err;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		ipv4 = true;
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		ipv4 = false;
+	} else {
+		pr_warn_once("BPF_LWT_REROUTE xmit: unsupported proto %d\n",
+			     skb->protocol);
+		return -EINVAL;
+	}
+
+	sk = sk_to_full_sk(skb->sk);
+	if (sk) {
+		if (sk->sk_bound_dev_if)
+			oif = sk->sk_bound_dev_if;
+		net = sock_net(sk);
+	} else {
+		net = dev_net(skb_dst(skb)->dev);
+	}
+
+	if (ipv4) {
+		struct iphdr *iph = ip_hdr(skb);
+		struct flowi4 fl4 = {};
+		struct rtable *rt;
+
+		fl4.flowi4_oif = oif;
+		fl4.flowi4_mark = skb->mark;
+		fl4.flowi4_uid = sock_net_uid(net, sk);
+		fl4.flowi4_tos = RT_TOS(iph->tos);
+		fl4.flowi4_flags = FLOWI_FLAG_ANYSRC;
+		fl4.flowi4_proto = iph->protocol;
+		fl4.daddr = iph->daddr;
+		fl4.saddr = iph->saddr;
+
+		rt = ip_route_output_key(net, &fl4);
+		if (IS_ERR(rt) || rt->dst.error)
+			return -EINVAL;
+		dst = &rt->dst;
+	} else {
+#if IS_ENABLED(CONFIG_IPV6)
+		struct ipv6hdr *iph6 = ipv6_hdr(skb);
+		struct flowi6 fl6 = {};
+
+		fl6.flowi6_oif = oif;
+		fl6.flowi6_mark = skb->mark;
+		fl6.flowi6_uid = sock_net_uid(net, sk);
+		fl6.flowlabel = ip6_flowinfo(iph6);
+		fl6.flowi6_proto = iph6->nexthdr;
+		fl6.daddr = iph6->daddr;
+		fl6.saddr = iph6->saddr;
+
+		err = ipv6_stub->ipv6_dst_lookup(net, skb->sk, &dst, &fl6);
+		if (err || IS_ERR(dst) || dst->error)
+			return -EINVAL;
+#else
+		pr_warn_once("BPF_LWT_REROUTE xmit: IPV6 not available\n");
+		return -EINVAL;
+#endif
+	}
+
+	/* Although skb header was reserved in bpf_lwt_push_ip_encap(), it
+	 * was done for the previous dst, so we are doing it here again, in
+	 * case the new dst needs much more space. The call below is a noop
+	 * if there is enough header space in skb.
+	 */
+	err = skb_cow_head(skb, LL_RESERVED_SPACE(dst->dev));
+	if (unlikely(err))
+		return err;
+
+	skb_dst_drop(skb);
+	skb_dst_set(skb, dst);
+
+	err = dst_output(dev_net(skb_dst(skb)->dev), skb->sk, skb);
+	if (unlikely(err))
+		return err;
+
+	/* ip[6]_finish_output2 understand LWTUNNEL_XMIT_DONE */
+	return LWTUNNEL_XMIT_DONE;
+}
+
 static int bpf_xmit(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -155,11 +277,20 @@ static int bpf_xmit(struct sk_buff *skb)
 
 	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
 	if (bpf->xmit.prog) {
+		__be16 proto = skb->protocol;
 		int ret;
 
 		ret = run_lwt_bpf(skb, &bpf->xmit, dst, CAN_REDIRECT);
 		switch (ret) {
 		case BPF_OK:
+			/* If the header changed, e.g. via bpf_lwt_push_encap,
+			 * BPF_LWT_REROUTE below should have been used if the
+			 * protocol was also changed.
+			 */
+			if (skb->protocol != proto) {
+				kfree_skb(skb);
+				return -EINVAL;
+			}
 			/* If the header was expanded, headroom might be too
 			 * small for L2 header to come, expand as needed.
 			 */
@@ -170,6 +301,8 @@ static int bpf_xmit(struct sk_buff *skb)
 			return LWTUNNEL_XMIT_CONTINUE;
 		case BPF_REDIRECT:
 			return LWTUNNEL_XMIT_DONE;
+		case BPF_LWT_REROUTE:
+			return bpf_lwt_xmit_reroute(skb);
 		default:
 			return ret;
 		}
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v9 6/7] bpf: sync <kdir>/include/.../bpf.h with tools/include/.../bpf.h
From: Peter Oskolkov @ 2019-02-12  0:42 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190212004249.219268-1-posk@google.com>

This patch copies changes in bpf.h done by a previous patch
in this patchset from the kernel uapi include dir into tools
uapi include dir.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/include/uapi/linux/bpf.h | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 25c8c0e62ecf..bcdd2474eee7 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2016,6 +2016,19 @@ union bpf_attr {
  *			Only works if *skb* contains an IPv6 packet. Insert a
  *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
  *			the IPv6 header.
+ *		**BPF_LWT_ENCAP_IP**
+ *			IP encapsulation (GRE/GUE/IPIP/etc). The outer header
+ *			must be IPv4 or IPv6, followed by zero or more
+ *			additional headers, up to LWT_BPF_MAX_HEADROOM total
+ *			bytes in all prepended headers. Please note that
+ *			if skb_is_gso(skb) is true, no more than two headers
+ *			can be prepended, and the inner header, if present,
+ *			should be either GRE or UDP/GUE.
+ *
+ *		BPF_LWT_ENCAP_SEG6*** types can be called by bpf programs of
+ *		type BPF_PROG_TYPE_LWT_IN; BPF_LWT_ENCAP_IP type can be called
+ *		by bpf programs of types BPF_PROG_TYPE_LWT_IN and
+ *		BPF_PROG_TYPE_LWT_XMIT.
  *
  * 		A call to this helper is susceptible to change the underlaying
  * 		packet buffer. Therefore, at load time, all checks on pointers
@@ -2517,7 +2530,8 @@ enum bpf_hdr_start_off {
 /* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
 enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_SEG6,
-	BPF_LWT_ENCAP_SEG6_INLINE
+	BPF_LWT_ENCAP_SEG6_INLINE,
+	BPF_LWT_ENCAP_IP,
 };
 
 #define __bpf_md_ptr(type, name)	\
@@ -2606,7 +2620,15 @@ enum bpf_ret_code {
 	BPF_DROP = 2,
 	/* 3-6 reserved */
 	BPF_REDIRECT = 7,
-	/* >127 are reserved for prog type specific return codes */
+	/* >127 are reserved for prog type specific return codes.
+	 *
+	 * BPF_LWT_REROUTE: used by BPF_PROG_TYPE_LWT_IN and
+	 *    BPF_PROG_TYPE_LWT_XMIT to indicate that skb had been
+	 *    changed and should be routed based on its new L3 header.
+	 *    (This is an L3 redirect, as opposed to L2 redirect
+	 *    represented by BPF_REDIRECT above).
+	 */
+	BPF_LWT_REROUTE = 128,
 };
 
 struct bpf_sock {
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v9 7/7] selftests: bpf: add test_lwt_ip_encap selftest
From: Peter Oskolkov @ 2019-02-12  0:42 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190212004249.219268-1-posk@google.com>

This patch adds a bpf self-test to cover BPF_LWT_ENCAP_IP mode
in bpf_lwt_push_encap.

Covered:
- encapping in LWT_IN and LWT_XMIT
- IPv4 and IPv6

A follow-up patch will add GSO and VRF-enabled tests.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/testing/selftests/bpf/Makefile          |   5 +-
 .../testing/selftests/bpf/test_lwt_ip_encap.c |  85 +++++
 .../selftests/bpf/test_lwt_ip_encap.sh        | 311 ++++++++++++++++++
 3 files changed, 399 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lwt_ip_encap.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_ip_encap.sh

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index c7e1e3255448..3ebd41a0c253 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -36,7 +36,7 @@ BPF_OBJ_FILES = \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
 	test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_xdp_vlan.o \
 	xdp_dummy.o test_map_in_map.o test_spin_lock.o test_map_lock.o \
-	test_sock_fields_kern.o
+	test_sock_fields_kern.o test_lwt_ip_encap.o
 
 # Objects are built with default compilation flags and with sub-register
 # code-gen enabled.
@@ -74,7 +74,8 @@ TEST_PROGS := test_kmod.sh \
 	test_lirc_mode2.sh \
 	test_skb_cgroup_id.sh \
 	test_flow_dissector.sh \
-	test_xdp_vlan.sh
+	test_xdp_vlan.sh \
+	test_lwt_ip_encap.sh
 
 TEST_PROGS_EXTENDED := with_addr.sh \
 	with_tunnels.sh \
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.c b/tools/testing/selftests/bpf/test_lwt_ip_encap.c
new file mode 100644
index 000000000000..c957d6dfe6d7
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stddef.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+struct grehdr {
+	__be16 flags;
+	__be16 protocol;
+};
+
+SEC("encap_gre")
+int bpf_lwt_encap_gre(struct __sk_buff *skb)
+{
+	struct encap_hdr {
+		struct iphdr iph;
+		struct grehdr greh;
+	} hdr;
+	int err;
+
+	memset(&hdr, 0, sizeof(struct encap_hdr));
+
+	hdr.iph.ihl = 5;
+	hdr.iph.version = 4;
+	hdr.iph.ttl = 0x40;
+	hdr.iph.protocol = 47;  /* IPPROTO_GRE */
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	hdr.iph.saddr = 0x640110ac;  /* 172.16.1.100 */
+	hdr.iph.daddr = 0x641010ac;  /* 172.16.16.100 */
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+	hdr.iph.saddr = 0xac100164;  /* 172.16.1.100 */
+	hdr.iph.daddr = 0xac101064;  /* 172.16.16.100 */
+#else
+#error "Fix your compiler's __BYTE_ORDER__?!"
+#endif
+	hdr.iph.tot_len = bpf_htons(skb->len + sizeof(struct encap_hdr));
+
+	hdr.greh.protocol = skb->protocol;
+
+	err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, &hdr,
+				 sizeof(struct encap_hdr));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_LWT_REROUTE;
+}
+
+SEC("encap_gre6")
+int bpf_lwt_encap_gre6(struct __sk_buff *skb)
+{
+	struct encap_hdr {
+		struct ipv6hdr ip6hdr;
+		struct grehdr greh;
+	} hdr;
+	int err;
+
+	memset(&hdr, 0, sizeof(struct encap_hdr));
+
+	hdr.ip6hdr.version = 6;
+	hdr.ip6hdr.payload_len = bpf_htons(skb->len + sizeof(struct grehdr));
+	hdr.ip6hdr.nexthdr = 47;  /* IPPROTO_GRE */
+	hdr.ip6hdr.hop_limit = 0x40;
+	/* fb01::1 */
+	hdr.ip6hdr.saddr.s6_addr[0] = 0xfb;
+	hdr.ip6hdr.saddr.s6_addr[1] = 1;
+	hdr.ip6hdr.saddr.s6_addr[15] = 1;
+	/* fb10::1 */
+	hdr.ip6hdr.daddr.s6_addr[0] = 0xfb;
+	hdr.ip6hdr.daddr.s6_addr[1] = 0x10;
+	hdr.ip6hdr.daddr.s6_addr[15] = 1;
+
+	hdr.greh.protocol = skb->protocol;
+
+	err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, &hdr,
+				 sizeof(struct encap_hdr));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_LWT_REROUTE;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
new file mode 100755
index 000000000000..4ca714e23ab0
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
@@ -0,0 +1,311 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Setup/topology:
+#
+#    NS1             NS2             NS3
+#   veth1 <---> veth2   veth3 <---> veth4 (the top route)
+#   veth5 <---> veth6   veth7 <---> veth8 (the bottom route)
+#
+#   each vethN gets IPv[4|6]_N address
+#
+#   IPv*_SRC = IPv*_1
+#   IPv*_DST = IPv*_4
+#
+#   all tests test pings from IPv*_SRC to IPv*_DST
+#
+#   by default, routes are configured to allow packets to go
+#   IP*_1 <=> IP*_2 <=> IP*_3 <=> IP*_4 (the top route)
+#
+#   a GRE device is installed in NS3 with IPv*_GRE, and
+#   NS1/NS2 are configured to route packets to IPv*_GRE via IP*_8
+#   (the bottom route)
+#
+# Tests:
+#
+#   1. routes NS2->IPv*_DST are brought down, so the only way a ping
+#      from IP*_SRC to IP*_DST can work is via IPv*_GRE
+#
+#   2a. in an egress test, a bpf LWT_XMIT program is installed on veth1
+#       that encaps the packets with an IP/GRE header to route to IPv*_GRE
+#
+#       ping: SRC->[encap at veth1:egress]->GRE:decap->DST
+#       ping replies go DST->SRC directly
+#
+#   2b. in an ingress test, a bpf LWT_IN program is installed on veth2
+#       that encaps the packets with an IP/GRE header to route to IPv*_GRE
+#
+#       ping: SRC->[encap at veth2:ingress]->GRE:decap->DST
+#       ping replies go DST->SRC directly
+
+set -e  # exit on error
+
+if [[ $EUID -ne 0 ]]; then
+	echo "This script must be run as root"
+	echo "FAIL"
+	exit 1
+fi
+
+readonly NS1="ns1-$(mktemp -u XXXXXX)"
+readonly NS2="ns2-$(mktemp -u XXXXXX)"
+readonly NS3="ns3-$(mktemp -u XXXXXX)"
+
+readonly IPv4_1="172.16.1.100"
+readonly IPv4_2="172.16.2.100"
+readonly IPv4_3="172.16.3.100"
+readonly IPv4_4="172.16.4.100"
+readonly IPv4_5="172.16.5.100"
+readonly IPv4_6="172.16.6.100"
+readonly IPv4_7="172.16.7.100"
+readonly IPv4_8="172.16.8.100"
+readonly IPv4_GRE="172.16.16.100"
+
+readonly IPv4_SRC=$IPv4_1
+readonly IPv4_DST=$IPv4_4
+
+readonly IPv6_1="fb01::1"
+readonly IPv6_2="fb02::1"
+readonly IPv6_3="fb03::1"
+readonly IPv6_4="fb04::1"
+readonly IPv6_5="fb05::1"
+readonly IPv6_6="fb06::1"
+readonly IPv6_7="fb07::1"
+readonly IPv6_8="fb08::1"
+readonly IPv6_GRE="fb10::1"
+
+readonly IPv6_SRC=$IPv6_1
+readonly IPv6_DST=$IPv6_4
+
+setup() {
+set -e  # exit on error
+	# create devices and namespaces
+	ip netns add "${NS1}"
+	ip netns add "${NS2}"
+	ip netns add "${NS3}"
+
+	ip link add veth1 type veth peer name veth2
+	ip link add veth3 type veth peer name veth4
+	ip link add veth5 type veth peer name veth6
+	ip link add veth7 type veth peer name veth8
+
+	ip netns exec ${NS2} sysctl -wq net.ipv4.ip_forward=1
+	ip netns exec ${NS2} sysctl -wq net.ipv6.conf.all.forwarding=1
+
+	ip link set veth1 netns ${NS1}
+	ip link set veth2 netns ${NS2}
+	ip link set veth3 netns ${NS2}
+	ip link set veth4 netns ${NS3}
+	ip link set veth5 netns ${NS1}
+	ip link set veth6 netns ${NS2}
+	ip link set veth7 netns ${NS2}
+	ip link set veth8 netns ${NS3}
+
+	# configure addesses: the top route (1-2-3-4)
+	ip -netns ${NS1}    addr add ${IPv4_1}/24  dev veth1
+	ip -netns ${NS2}    addr add ${IPv4_2}/24  dev veth2
+	ip -netns ${NS2}    addr add ${IPv4_3}/24  dev veth3
+	ip -netns ${NS3}    addr add ${IPv4_4}/24  dev veth4
+	ip -netns ${NS1} -6 addr add ${IPv6_1}/128 nodad dev veth1
+	ip -netns ${NS2} -6 addr add ${IPv6_2}/128 nodad dev veth2
+	ip -netns ${NS2} -6 addr add ${IPv6_3}/128 nodad dev veth3
+	ip -netns ${NS3} -6 addr add ${IPv6_4}/128 nodad dev veth4
+
+	# configure addresses: the bottom route (5-6-7-8)
+	ip -netns ${NS1}    addr add ${IPv4_5}/24  dev veth5
+	ip -netns ${NS2}    addr add ${IPv4_6}/24  dev veth6
+	ip -netns ${NS2}    addr add ${IPv4_7}/24  dev veth7
+	ip -netns ${NS3}    addr add ${IPv4_8}/24  dev veth8
+	ip -netns ${NS1} -6 addr add ${IPv6_5}/128 nodad dev veth5
+	ip -netns ${NS2} -6 addr add ${IPv6_6}/128 nodad dev veth6
+	ip -netns ${NS2} -6 addr add ${IPv6_7}/128 nodad dev veth7
+	ip -netns ${NS3} -6 addr add ${IPv6_8}/128 nodad dev veth8
+
+
+	ip -netns ${NS1} link set dev veth1 up
+	ip -netns ${NS2} link set dev veth2 up
+	ip -netns ${NS2} link set dev veth3 up
+	ip -netns ${NS3} link set dev veth4 up
+	ip -netns ${NS1} link set dev veth5 up
+	ip -netns ${NS2} link set dev veth6 up
+	ip -netns ${NS2} link set dev veth7 up
+	ip -netns ${NS3} link set dev veth8 up
+
+	# configure routes: IP*_SRC -> veth1/IP*_2 (= top route) default;
+	# the bottom route to specific bottom addresses
+
+	# NS1
+	# top route
+	ip -netns ${NS1}    route add ${IPv4_2}/32  dev veth1
+	ip -netns ${NS1}    route add default dev veth1 via ${IPv4_2}  # go top by default
+	ip -netns ${NS1} -6 route add ${IPv6_2}/128 dev veth1
+	ip -netns ${NS1} -6 route add default dev veth1 via ${IPv6_2}  # go top by default
+	# bottom route
+	ip -netns ${NS1}    route add ${IPv4_6}/32  dev veth5
+	ip -netns ${NS1}    route add ${IPv4_7}/32  dev veth5 via ${IPv4_6}
+	ip -netns ${NS1}    route add ${IPv4_8}/32  dev veth5 via ${IPv4_6}
+	ip -netns ${NS1} -6 route add ${IPv6_6}/128 dev veth5
+	ip -netns ${NS1} -6 route add ${IPv6_7}/128 dev veth5 via ${IPv6_6}
+	ip -netns ${NS1} -6 route add ${IPv6_8}/128 dev veth5 via ${IPv6_6}
+
+	# NS2
+	# top route
+	ip -netns ${NS2}    route add ${IPv4_1}/32  dev veth2
+	ip -netns ${NS2}    route add ${IPv4_4}/32  dev veth3
+	ip -netns ${NS2} -6 route add ${IPv6_1}/128 dev veth2
+	ip -netns ${NS2} -6 route add ${IPv6_4}/128 dev veth3
+	# bottom route
+	ip -netns ${NS2}    route add ${IPv4_5}/32  dev veth6
+	ip -netns ${NS2}    route add ${IPv4_8}/32  dev veth7
+	ip -netns ${NS2} -6 route add ${IPv6_5}/128 dev veth6
+	ip -netns ${NS2} -6 route add ${IPv6_8}/128 dev veth7
+
+	# NS3
+	# top route
+	ip -netns ${NS3}    route add ${IPv4_3}/32  dev veth4
+	ip -netns ${NS3}    route add ${IPv4_1}/32  dev veth4 via ${IPv4_3}
+	ip -netns ${NS3}    route add ${IPv4_2}/32  dev veth4 via ${IPv4_3}
+	ip -netns ${NS3} -6 route add ${IPv6_3}/128 dev veth4
+	ip -netns ${NS3} -6 route add ${IPv6_1}/128 dev veth4 via ${IPv6_3}
+	ip -netns ${NS3} -6 route add ${IPv6_2}/128 dev veth4 via ${IPv6_3}
+	# bottom route
+	ip -netns ${NS3}    route add ${IPv4_7}/32  dev veth8
+	ip -netns ${NS3}    route add ${IPv4_5}/32  dev veth8 via ${IPv4_7}
+	ip -netns ${NS3}    route add ${IPv4_6}/32  dev veth8 via ${IPv4_7}
+	ip -netns ${NS3} -6 route add ${IPv6_7}/128 dev veth8
+	ip -netns ${NS3} -6 route add ${IPv6_5}/128 dev veth8 via ${IPv6_7}
+	ip -netns ${NS3} -6 route add ${IPv6_6}/128 dev veth8 via ${IPv6_7}
+
+	# configure IPv4 GRE device in NS3, and a route to it via the "bottom" route
+	ip -netns ${NS3} tunnel add gre_dev mode gre remote ${IPv4_1} local ${IPv4_GRE} ttl 255
+	ip -netns ${NS3} link set gre_dev up
+	ip -netns ${NS3} addr add ${IPv4_GRE} dev gre_dev
+	ip -netns ${NS1} route add ${IPv4_GRE}/32 dev veth5 via ${IPv4_6}
+	ip -netns ${NS2} route add ${IPv4_GRE}/32 dev veth7 via ${IPv4_8}
+
+
+	# configure IPv6 GRE device in NS3, and a route to it via the "bottom" route
+	ip -netns ${NS3} -6 tunnel add name gre6_dev mode ip6gre remote ${IPv6_1} local ${IPv6_GRE} ttl 255
+	ip -netns ${NS3} link set gre6_dev up
+	ip -netns ${NS3} -6 addr add ${IPv6_GRE} nodad dev gre6_dev
+	ip -netns ${NS1} -6 route add ${IPv6_GRE}/128 dev veth5 via ${IPv6_6}
+	ip -netns ${NS2} -6 route add ${IPv6_GRE}/128 dev veth7 via ${IPv6_8}
+
+	# rp_filter gets confused by what these tests are doing, so disable it
+	ip netns exec ${NS1} sysctl -wq net.ipv4.conf.all.rp_filter=0
+	ip netns exec ${NS2} sysctl -wq net.ipv4.conf.all.rp_filter=0
+	ip netns exec ${NS3} sysctl -wq net.ipv4.conf.all.rp_filter=0
+}
+
+cleanup() {
+	ip netns del ${NS1} 2> /dev/null
+	ip netns del ${NS2} 2> /dev/null
+	ip netns del ${NS3} 2> /dev/null
+}
+
+trap cleanup EXIT
+
+test_ping() {
+	local readonly PROTO=$1
+	local readonly EXPECTED=$2
+	local RET=0
+
+	set +e
+	if [ "${PROTO}" == "IPv4" ] ; then
+		ip netns exec ${NS1} ping  -c 1 -W 1 -I ${IPv4_SRC} ${IPv4_DST} 2>&1 > /dev/null
+		RET=$?
+	elif [ "${PROTO}" == "IPv6" ] ; then
+		ip netns exec ${NS1} ping6 -c 1 -W 6 -I ${IPv6_SRC} ${IPv6_DST} 2>&1 > /dev/null
+		RET=$?
+	else
+		echo "test_ping: unknown PROTO: ${PROTO}"
+		exit 1
+	fi
+	set -e
+
+	if [ "0" != "${RET}" ]; then
+		RET=1
+	fi
+
+	if [ "${EXPECTED}" != "${RET}" ] ; then
+		echo "FAIL: test_ping: ${RET}"
+		exit 1
+	fi
+}
+
+test_egress() {
+	local readonly ENCAP=$1
+	echo "starting egress ${ENCAP} encap test"
+	setup
+
+	# need to wait a bit for IPv6 to autoconf, otherwise
+	# ping6 sometimes fails with "unable to bind to address"
+
+	# by default, pings work
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	# remove NS2->DST routes, ping fails
+	ip -netns ${NS2}    route del ${IPv4_DST}/32  dev veth3
+	ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3
+	test_ping IPv4 1
+	test_ping IPv6 1
+
+	# install replacement routes (LWT/eBPF), pings succeed
+	if [ "${ENCAP}" == "IPv4" ] ; then
+		ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1
+		ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1
+	elif [ "${ENCAP}" == "IPv6" ] ; then
+		ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
+		ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
+	else
+		echo "FAIL: unknown encap ${ENCAP}"
+	fi
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	cleanup
+	echo "PASS"
+}
+
+test_ingress() {
+	local readonly ENCAP=$1
+	echo "starting ingress ${ENCAP} encap test"
+	setup
+
+	# need to wait a bit for IPv6 to autoconf, otherwise
+	# ping6 sometimes fails with "unable to bind to address"
+
+	# by default, pings work
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	# remove NS2->DST routes, pings fail
+	ip -netns ${NS2}    route del ${IPv4_DST}/32  dev veth3
+	ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3
+	test_ping IPv4 1
+	test_ping IPv6 1
+
+	# install replacement routes (LWT/eBPF), pings succeed
+	if [ "${ENCAP}" == "IPv4" ] ; then
+		ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2
+		ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2
+	elif [ "${ENCAP}" == "IPv6" ] ; then
+		ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2
+		ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2
+	else
+		echo "FAIL: unknown encap ${ENCAP}"
+	fi
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	cleanup
+	echo "PASS"
+}
+
+test_egress IPv4
+test_egress IPv6
+
+test_ingress IPv4
+test_ingress IPv6
+
+echo "all tests passed"
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next 0/4] libbpf: Add support for 32-bit static data
From: Joe Stringer @ 2019-02-12  0:47 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast

This series adds support to libbpf for relocating references to 32-bit
static data inside ELF files, both for .data and .bss, similar to one of
the approaches proposed in LPC 2018[0]. This improves a common workflow
for BPF users, where the BPF program may be customised each time it is
loaded, for example to tailor IP addresses for each instance of the
loaded program. Current approaches require full recompilation of the
programs for each load, however with templatized BPF programs, one ELF
template program may be generated, then the static data can be easily
substituted prior to loading into the kernel without invoking the
compiler again.

The approach here is useful for templating limited static data for ELF
programs, and will work regardless of kernel support for static data
sections. Its main limitation is that static data must be defined as
32-bit values in the BPF C input code (or defined using macros that use
32-bit values as the underlying store). The alternative approach
proposed at LPC would be more general and is being actively explored,
however it requires kernel extension and so will not solve this problem
for any existing kernels that are in use today.

There are similar patches floating around for iproute2 which I would
like to upstream as well[1].

[0] https://linuxplumbersconf.org/event/2/contributions/115/
[1] https://github.com/joestringer/iproute2/tree/bss

Joe Stringer (4):
  libbpf: Refactor relocations
  libbpf: Support 32-bit static data loads
  libbpf: Support relocations for bss.
  selftests/bpf: Test static data relocation

 tools/lib/bpf/libbpf.c                        | 108 ++++++++++++------
 tools/testing/selftests/bpf/Makefile          |   2 +-
 tools/testing/selftests/bpf/test_progs.c      |  44 +++++++
 .../selftests/bpf/test_static_data_kern.c     |  47 ++++++++
 4 files changed, 168 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_static_data_kern.c

-- 
2.19.1

^ permalink raw reply

* [PATCH bpf-next 1/4] libbpf: Refactor relocations
From: Joe Stringer @ 2019-02-12  0:47 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast
In-Reply-To: <20190212004729.535-1-joe@wand.net.nz>

Adjust the code for relocations slightly with no functional changes, so
that upcoming patches that will introduce support for relocations into
the .data and .bss sections can be added independent of these changes.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
 tools/lib/bpf/libbpf.c | 62 ++++++++++++++++++++++--------------------
 1 file changed, 32 insertions(+), 30 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e3c39edfb9d3..1ec28d5154dc 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -852,20 +852,20 @@ static int bpf_object__elf_collect(struct bpf_object *obj, int flags)
 				obj->efile.symbols = data;
 				obj->efile.strtabidx = sh.sh_link;
 			}
-		} else if ((sh.sh_type == SHT_PROGBITS) &&
-			   (sh.sh_flags & SHF_EXECINSTR) &&
-			   (data->d_size > 0)) {
-			if (strcmp(name, ".text") == 0)
-				obj->efile.text_shndx = idx;
-			err = bpf_object__add_program(obj, data->d_buf,
-						      data->d_size, name, idx);
-			if (err) {
-				char errmsg[STRERR_BUFSIZE];
-				char *cp = libbpf_strerror_r(-err, errmsg,
-							     sizeof(errmsg));
-
-				pr_warning("failed to alloc program %s (%s): %s",
-					   name, obj->path, cp);
+		} else if (sh.sh_type == SHT_PROGBITS && data->d_size > 0) {
+			if (sh.sh_flags & SHF_EXECINSTR) {
+				if (strcmp(name, ".text") == 0)
+					obj->efile.text_shndx = idx;
+				err = bpf_object__add_program(obj, data->d_buf,
+							      data->d_size, name, idx);
+				if (err) {
+					char errmsg[STRERR_BUFSIZE];
+					char *cp = libbpf_strerror_r(-err, errmsg,
+								     sizeof(errmsg));
+
+					pr_warning("failed to alloc program %s (%s): %s",
+						   name, obj->path, cp);
+				}
 			}
 		} else if (sh.sh_type == SHT_REL) {
 			void *reloc = obj->efile.reloc;
@@ -1027,24 +1027,26 @@ bpf_program__collect_reloc(struct bpf_program *prog, GElf_Shdr *shdr,
 			return -LIBBPF_ERRNO__RELOC;
 		}
 
-		/* TODO: 'maps' is sorted. We can use bsearch to make it faster. */
-		for (map_idx = 0; map_idx < nr_maps; map_idx++) {
-			if (maps[map_idx].offset == sym.st_value) {
-				pr_debug("relocation: find map %zd (%s) for insn %u\n",
-					 map_idx, maps[map_idx].name, insn_idx);
-				break;
+		if (sym.st_shndx == maps_shndx) {
+			/* TODO: 'maps' is sorted. We can use bsearch to make it faster. */
+			for (map_idx = 0; map_idx < nr_maps; map_idx++) {
+				if (maps[map_idx].offset == sym.st_value) {
+					pr_debug("relocation: find map %zd (%s) for insn %u\n",
+						 map_idx, maps[map_idx].name, insn_idx);
+					break;
+				}
 			}
-		}
 
-		if (map_idx >= nr_maps) {
-			pr_warning("bpf relocation: map_idx %d large than %d\n",
-				   (int)map_idx, (int)nr_maps - 1);
-			return -LIBBPF_ERRNO__RELOC;
-		}
+			if (map_idx >= nr_maps) {
+				pr_warning("bpf relocation: map_idx %d large than %d\n",
+					   (int)map_idx, (int)nr_maps - 1);
+				return -LIBBPF_ERRNO__RELOC;
+			}
 
-		prog->reloc_desc[i].type = RELO_LD64;
-		prog->reloc_desc[i].insn_idx = insn_idx;
-		prog->reloc_desc[i].map_idx = map_idx;
+			prog->reloc_desc[i].type = RELO_LD64;
+			prog->reloc_desc[i].insn_idx = insn_idx;
+			prog->reloc_desc[i].map_idx = map_idx;
+		}
 	}
 	return 0;
 }
@@ -1392,7 +1394,7 @@ bpf_program__relocate(struct bpf_program *prog, struct bpf_object *obj)
 			}
 			insns[insn_idx].src_reg = BPF_PSEUDO_MAP_FD;
 			insns[insn_idx].imm = obj->maps[map_idx].fd;
-		} else {
+		} else if (prog->reloc_desc[i].type == RELO_CALL) {
 			err = bpf_program__reloc_text(prog, obj,
 						      &prog->reloc_desc[i]);
 			if (err)
-- 
2.19.1


^ permalink raw reply related

* [PATCH bpf-next 2/4] libbpf: Support 32-bit static data loads
From: Joe Stringer @ 2019-02-12  0:47 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast
In-Reply-To: <20190212004729.535-1-joe@wand.net.nz>

Support loads of static 32-bit data when BPF writers make use of
convenience macros for accessing static global data variables. A later
patch in this series will demonstrate its usage in a selftest.

As of LLVM-7, this technique only works with 32-bit data, as LLVM will
complain if this technique is attempted with data of other sizes:

    LLVM ERROR: Unsupported relocation: try to compile with -O2 or above,
    or check your static variable usage

Based on the proof of concept by Daniel Borkmann (presented at LPC 2018).

Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
 tools/lib/bpf/libbpf.c | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 1ec28d5154dc..da35d5559b22 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -140,11 +140,13 @@ struct bpf_program {
 		enum {
 			RELO_LD64,
 			RELO_CALL,
+			RELO_DATA,
 		} type;
 		int insn_idx;
 		union {
 			int map_idx;
 			int text_off;
+			uint32_t data;
 		};
 	} *reloc_desc;
 	int nr_reloc;
@@ -210,6 +212,7 @@ struct bpf_object {
 		Elf *elf;
 		GElf_Ehdr ehdr;
 		Elf_Data *symbols;
+		Elf_Data *global_data;
 		size_t strtabidx;
 		struct {
 			GElf_Shdr shdr;
@@ -218,6 +221,7 @@ struct bpf_object {
 		int nr_reloc;
 		int maps_shndx;
 		int text_shndx;
+		int data_shndx;
 	} efile;
 	/*
 	 * All loaded bpf_object is linked in a list, which is
@@ -476,6 +480,7 @@ static void bpf_object__elf_finish(struct bpf_object *obj)
 		obj->efile.elf = NULL;
 	}
 	obj->efile.symbols = NULL;
+	obj->efile.global_data = NULL;
 
 	zfree(&obj->efile.reloc);
 	obj->efile.nr_reloc = 0;
@@ -866,6 +871,9 @@ static int bpf_object__elf_collect(struct bpf_object *obj, int flags)
 					pr_warning("failed to alloc program %s (%s): %s",
 						   name, obj->path, cp);
 				}
+			} else if (strcmp(name, ".data") == 0) {
+				obj->efile.global_data = data;
+				obj->efile.data_shndx = idx;
 			}
 		} else if (sh.sh_type == SHT_REL) {
 			void *reloc = obj->efile.reloc;
@@ -962,6 +970,7 @@ bpf_program__collect_reloc(struct bpf_program *prog, GElf_Shdr *shdr,
 	Elf_Data *symbols = obj->efile.symbols;
 	int text_shndx = obj->efile.text_shndx;
 	int maps_shndx = obj->efile.maps_shndx;
+	int data_shndx = obj->efile.data_shndx;
 	struct bpf_map *maps = obj->maps;
 	size_t nr_maps = obj->nr_maps;
 	int i, nrels;
@@ -1000,8 +1009,9 @@ bpf_program__collect_reloc(struct bpf_program *prog, GElf_Shdr *shdr,
 			 (long long) (rel.r_info >> 32),
 			 (long long) sym.st_value, sym.st_name);
 
-		if (sym.st_shndx != maps_shndx && sym.st_shndx != text_shndx) {
-			pr_warning("Program '%s' contains non-map related relo data pointing to section %u\n",
+		if (sym.st_shndx != maps_shndx && sym.st_shndx != text_shndx &&
+		    sym.st_shndx != data_shndx) {
+			pr_warning("Program '%s' contains unrecognized relo data pointing to section %u\n",
 				   prog->section_name, sym.st_shndx);
 			return -LIBBPF_ERRNO__RELOC;
 		}
@@ -1046,6 +1056,20 @@ bpf_program__collect_reloc(struct bpf_program *prog, GElf_Shdr *shdr,
 			prog->reloc_desc[i].type = RELO_LD64;
 			prog->reloc_desc[i].insn_idx = insn_idx;
 			prog->reloc_desc[i].map_idx = map_idx;
+		} else if (sym.st_shndx == data_shndx) {
+			Elf_Data *global_data = obj->efile.global_data;
+			uint32_t *static_data;
+
+			if (sym.st_value + sizeof(uint32_t) > (int)global_data->d_size) {
+				pr_warning("bpf relocation: static data load beyond data size %lu\n",
+					   global_data->d_size);
+				return -LIBBPF_ERRNO__RELOC;
+			}
+
+			static_data = global_data->d_buf + sym.st_value;
+			prog->reloc_desc[i].type = RELO_DATA;
+			prog->reloc_desc[i].insn_idx = insn_idx;
+			prog->reloc_desc[i].data = *static_data;
 		}
 	}
 	return 0;
@@ -1399,6 +1423,12 @@ bpf_program__relocate(struct bpf_program *prog, struct bpf_object *obj)
 						      &prog->reloc_desc[i]);
 			if (err)
 				return err;
+		} else if (prog->reloc_desc[i].type == RELO_DATA) {
+			struct bpf_insn *insns = prog->insns;
+			int insn_idx;
+
+			insn_idx = prog->reloc_desc[i].insn_idx;
+			insns[insn_idx].imm = prog->reloc_desc[i].data;
 		}
 	}
 
-- 
2.19.1


^ permalink raw reply related

* [PATCH bpf-next 3/4] libbpf: Support relocations for bss.
From: Joe Stringer @ 2019-02-12  0:47 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast
In-Reply-To: <20190212004729.535-1-joe@wand.net.nz>

The BSS section in an ELF generated by LLVM represents constants for
uninitialized variables or variables that are configured with a zero
value. Support initializing zeroed static data by parsing the
relocations with references to the .bss section and zeroing them.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
 tools/lib/bpf/libbpf.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index da35d5559b22..ff66d7e970c9 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -141,6 +141,7 @@ struct bpf_program {
 			RELO_LD64,
 			RELO_CALL,
 			RELO_DATA,
+			RELO_ZERO,
 		} type;
 		int insn_idx;
 		union {
@@ -222,6 +223,7 @@ struct bpf_object {
 		int maps_shndx;
 		int text_shndx;
 		int data_shndx;
+		int bss_shndx;
 	} efile;
 	/*
 	 * All loaded bpf_object is linked in a list, which is
@@ -901,6 +903,8 @@ static int bpf_object__elf_collect(struct bpf_object *obj, int flags)
 				obj->efile.reloc[n].shdr = sh;
 				obj->efile.reloc[n].data = data;
 			}
+		} else if (sh.sh_type == SHT_NOBITS && strcmp(name, ".bss") == 0) {
+			obj->efile.bss_shndx = idx;
 		} else {
 			pr_debug("skip section(%d) %s\n", idx, name);
 		}
@@ -971,6 +975,7 @@ bpf_program__collect_reloc(struct bpf_program *prog, GElf_Shdr *shdr,
 	int text_shndx = obj->efile.text_shndx;
 	int maps_shndx = obj->efile.maps_shndx;
 	int data_shndx = obj->efile.data_shndx;
+	int bss_shndx = obj->efile.bss_shndx;
 	struct bpf_map *maps = obj->maps;
 	size_t nr_maps = obj->nr_maps;
 	int i, nrels;
@@ -1010,7 +1015,7 @@ bpf_program__collect_reloc(struct bpf_program *prog, GElf_Shdr *shdr,
 			 (long long) sym.st_value, sym.st_name);
 
 		if (sym.st_shndx != maps_shndx && sym.st_shndx != text_shndx &&
-		    sym.st_shndx != data_shndx) {
+		    sym.st_shndx != data_shndx && sym.st_shndx != bss_shndx) {
 			pr_warning("Program '%s' contains unrecognized relo data pointing to section %u\n",
 				   prog->section_name, sym.st_shndx);
 			return -LIBBPF_ERRNO__RELOC;
@@ -1070,6 +1075,9 @@ bpf_program__collect_reloc(struct bpf_program *prog, GElf_Shdr *shdr,
 			prog->reloc_desc[i].type = RELO_DATA;
 			prog->reloc_desc[i].insn_idx = insn_idx;
 			prog->reloc_desc[i].data = *static_data;
+		} else if (sym.st_shndx == bss_shndx) {
+			prog->reloc_desc[i].type = RELO_ZERO;
+			prog->reloc_desc[i].insn_idx = insn_idx;
 		}
 	}
 	return 0;
@@ -1429,6 +1437,10 @@ bpf_program__relocate(struct bpf_program *prog, struct bpf_object *obj)
 
 			insn_idx = prog->reloc_desc[i].insn_idx;
 			insns[insn_idx].imm = prog->reloc_desc[i].data;
+		} else if (prog->reloc_desc[i].type == RELO_ZERO) {
+			int insn_idx = prog->reloc_desc[i].insn_idx;
+
+			prog->insns[insn_idx].imm = 0;
 		}
 	}
 
-- 
2.19.1


^ permalink raw reply related

* [PATCH bpf-next 4/4] selftests/bpf: Test static data relocation
From: Joe Stringer @ 2019-02-12  0:47 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast
In-Reply-To: <20190212004729.535-1-joe@wand.net.nz>

Add tests for libbpf relocation of static variable references into the
.data and .bss sections of the ELF.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
 tools/testing/selftests/bpf/Makefile          |  2 +-
 tools/testing/selftests/bpf/test_progs.c      | 44 +++++++++++++++++
 .../selftests/bpf/test_static_data_kern.c     | 47 +++++++++++++++++++
 3 files changed, 92 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_static_data_kern.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index c7e1e3255448..ef52a58e2368 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -36,7 +36,7 @@ BPF_OBJ_FILES = \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
 	test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_xdp_vlan.o \
 	xdp_dummy.o test_map_in_map.o test_spin_lock.o test_map_lock.o \
-	test_sock_fields_kern.o
+	test_sock_fields_kern.o test_static_data_kern.o
 
 # Objects are built with default compilation flags and with sub-register
 # code-gen enabled.
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index c52bd90fbb34..72899d58a77c 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -736,6 +736,49 @@ static void test_pkt_md_access(void)
 	bpf_object__close(obj);
 }
 
+static void test_static_data_access(void)
+{
+	const char *file = "./test_static_data_kern.o";
+	struct bpf_object *obj;
+	__u32 duration = 0, retval;
+	int i, err, prog_fd, map_fd;
+	uint32_t value;
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_SCHED_CLS, &obj, &prog_fd);
+	if (CHECK(err, "load program", "error %d loading %s\n", err, file))
+		return;
+
+	map_fd = bpf_find_map(__func__, obj, "result");
+	if (map_fd < 0) {
+		error_cnt++;
+		goto close_prog;
+	}
+
+	err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
+				NULL, NULL, &retval, &duration);
+	CHECK(err || retval, "pass packet",
+	      "err %d errno %d retval %d duration %d\n",
+	      err, errno, retval, duration);
+
+	struct {
+		char *name;
+		uint32_t key;
+		uint32_t value;
+	} tests[] = {
+		{ "relocate .bss reference", 0, 0 },
+		{ "relocate .data reference", 1, 42 },
+	};
+	for (i = 0; i < sizeof(tests) / sizeof(tests[0]); i++) {
+		err = bpf_map_lookup_elem(map_fd, &tests[i].key, &value);
+		CHECK (err || value != tests[i].value, tests[i].name,
+		       "err %d result %d expected %d\n",
+		       err, value, tests[i].value);
+	}
+
+close_prog:
+	bpf_object__close(obj);
+}
+
 static void test_obj_name(void)
 {
 	struct {
@@ -2138,6 +2181,7 @@ int main(void)
 	test_flow_dissector();
 	test_spinlock();
 	test_map_lock();
+	test_static_data_access();
 
 	printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
 	return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
diff --git a/tools/testing/selftests/bpf/test_static_data_kern.c b/tools/testing/selftests/bpf/test_static_data_kern.c
new file mode 100644
index 000000000000..f2485af6bd0b
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_static_data_kern.c
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Isovalent, Inc.
+
+#include <linux/bpf.h>
+#include <linux/pkt_cls.h>
+
+#include <string.h>
+
+#include "bpf_helpers.h"
+
+#define NUM_CGROUP_LEVELS	4
+
+struct bpf_map_def SEC("maps") result = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 2,
+};
+
+#define __fetch(x) (__u32)(&(x))
+
+static __u32 static_bss = 0;	/* Reloc reference to .bss section */
+static __u32 static_data = 42;	/* Reloc reference to .data section */
+
+/**
+ * Load a u32 value from a static variable into a map, for the userland test
+ * program to validate.
+ */
+SEC("static_data_load")
+int load_static_data(struct __sk_buff *skb)
+{
+	__u32 key, value;
+
+	key = 0;
+	value = __fetch(static_bss);
+	bpf_map_update_elem(&result, &key, &value, 0);
+
+	key = 1;
+	value = __fetch(static_data);
+	bpf_map_update_elem(&result, &key, &value, 0);
+
+	return TC_ACT_OK;
+}
+
+int _version SEC("version") = 1;
+
+char _license[] SEC("license") = "GPL";
-- 
2.19.1


^ permalink raw reply related

* Re: [PATCH net] dsa: mv88e6xxx: Ensure all pending interrupts are handled prior to exit
From: John David Anglin @ 2019-02-12  0:57 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Russell King, Vivien Didelot, Florian Fainelli, netdev
In-Reply-To: <20190211233327.GB8591@lunn.ch>

On 2019-02-11 6:33 p.m., Andrew Lunn wrote:
>> Signed-off-by:  John David Anglin <dave.anglin@bell.net>
>> ---
>>  drivers/net/dsa/mv88e6xxx/chip.c | 28 ++++++++++++++++++++++------
>>  1 file changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
>> index 8dca2c949e73..12fd7ce3f1ff 100644
>> --- a/drivers/net/dsa/mv88e6xxx/chip.c
>> +++ b/drivers/net/dsa/mv88e6xxx/chip.c
>> @@ -261,6 +261,7 @@ static irqreturn_t mv88e6xxx_g1_irq_thread_work(struct mv88e6xxx_chip *chip)
>>  	unsigned int sub_irq;
>>  	unsigned int n;
>>  	u16 reg;
>> +	u16 ctl1;
>>  	int err;
>>
>>  	mutex_lock(&chip->reg_lock);
>> @@ -270,13 +271,28 @@ static irqreturn_t mv88e6xxx_g1_irq_thread_work(struct mv88e6xxx_chip *chip)
>>  	if (err)
>>  		goto out;
>>
>> -	for (n = 0; n < chip->g1_irq.nirqs; ++n) {
>> -		if (reg & (1 << n)) {
>> -			sub_irq = irq_find_mapping(chip->g1_irq.domain, n);
>> -			handle_nested_irq(sub_irq);
>> -			++nhandled;
>> +	do {
>> +		for (n = 0; n < chip->g1_irq.nirqs; ++n) {
>> +			if (reg & (1 << n)) {
>> +				sub_irq = irq_find_mapping(chip->g1_irq.domain,
>> +							   n);
>> +				handle_nested_irq(sub_irq);
>> +				++nhandled;
>> +			}
>>  		}
>> -	}
>> +
>> +		mutex_lock(&chip->reg_lock);
>> +		err = mv88e6xxx_g1_read(chip, MV88E6XXX_G1_CTL1, &ctl1);
>> +		if (err)
>> +			goto unlock;
>> +		err = mv88e6xxx_g1_read(chip, MV88E6XXX_G1_STS, &reg);
>> +unlock:
>> +		mutex_unlock(&chip->reg_lock);
>> +		if (err)
>> +			goto out;
>> +		ctl1 &= GENMASK(chip->g1_irq.nirqs, 0);
>> +	} while (reg & ctl1);
> Hi David
>
> I just tested this on one of my boards. It loops endlessly:
>
> [   47.173396] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> [   47.182108] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> [   47.190820] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> [   47.199535] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> [   47.208254] mv88e6xxx_g1_irq_thread_work: c881 a8 80   
>
> These are reg, ctl1, reg & ctl1.
>
> So there is an unhandled device interrupt. I think this is because
> device interrupts are not masked before installing the interrupt
> handler. But i've not fully got to the bottom of this yet.
Yes, it is true the PHY and SERDES enables in Global 2 should be cleared before the interrupt handler
is installed for device interrupts.  That's what is done for the interrupts enables in Global 1.  I'm
not seeing that these enables are initialized.

Which switch?  The device interrupts are not be cleared properly on that board.  Would it be possible
to also print the Global 2 status and enables?  Unplugging the cable that's causing the loop might
cause the loop to stop.

I suspect the same would happen if level interrupts were used.

I tested both edge and polling on espressobin with Armada 3700.  There's no problem with
looping there.  I've booted it many times.  I've unplugged and plugged cables many times.

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply

* Re: [PATCH net] dsa: mv88e6xxx: Ensure all pending interrupts are handled prior to exit
From: Andrew Lunn @ 2019-02-12  1:21 UTC (permalink / raw)
  To: John David Anglin; +Cc: Russell King, Vivien Didelot, Florian Fainelli, netdev
In-Reply-To: <2b6bbb4c-1346-461b-ff7a-cb96b4142f7a@bell.net>

> Yes, it is true the PHY and SERDES enables in Global 2 should be
> cleared before the interrupt handler is installed for device
> interrupts.  That's what is done for the interrupts enables in
> Global 1.  I'm not seeing that these enables are initialized.
> 
> Which switch?

6390X.

> The device interrupts are not be cleared properly on that board.

I added in code to mask all interrupts. It did not help. I need to go
deeper and see if it is a PHY problem.

> I suspect the same would happen if level interrupts were used.

I've not seen it loop. Which is why i want to understand it fully.

     Andrew

^ permalink raw reply

* [PATCH v2 bpf-next] tools: bpftool: doc, add text about feature-subcommand
From: Prashant Bhole @ 2019-02-12  1:25 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Prashant Bhole, Quentin Monnet, netdev

This patch adds missing information about feature-subcommand in
bpftool.rst

Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
---

v2: used tabs instead of spaces

 tools/bpf/bpftool/Documentation/bpftool.rst | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst b/tools/bpf/bpftool/Documentation/bpftool.rst
index 27153bb816ac..4f2188845dd8 100644
--- a/tools/bpf/bpftool/Documentation/bpftool.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool.rst
@@ -16,7 +16,7 @@ SYNOPSIS
 
 	**bpftool** **version**
 
-	*OBJECT* := { **map** | **program** | **cgroup** | **perf** | **net** }
+	*OBJECT* := { **map** | **program** | **cgroup** | **perf** | **net** | **feature** }
 
 	*OPTIONS* := { { **-V** | **--version** } | { **-h** | **--help** }
 	| { **-j** | **--json** } [{ **-p** | **--pretty** }] }
@@ -34,6 +34,8 @@ SYNOPSIS
 
 	*NET-COMMANDS* := { **show** | **list** | **help** }
 
+	*FEATURE-COMMANDS* := { **probe** | **help** }
+
 DESCRIPTION
 ===========
 	*bpftool* allows for inspection and simple modification of BPF objects
-- 
2.20.1



^ permalink raw reply related

* Re: [PATCH net-next] ipvs: Use struct_size() helper
From: Gustavo A. R. Silva @ 2019-02-12  1:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Simon Horman
  Cc: Wensong Zhang, Julian Anastasov, Jozsef Kadlecsik,
	Florian Westphal, David S. Miller, netdev, lvs-devel,
	netfilter-devel, coreteam, linux-kernel
In-Reply-To: <20190211234033.wwpygxamqwvsuxmv@salvia>



On 2/11/19 5:40 PM, Pablo Neira Ayuso wrote:
> On Fri, Feb 08, 2019 at 10:56:48AM +0100, Simon Horman wrote:
>> On Thu, Feb 07, 2019 at 06:44:56PM -0600, Gustavo A. R. Silva wrote:
>>> One of the more common cases of allocation size calculations is finding
>>> the size of a structure that has a zero-sized array at the end, along
>>> with memory for some number of elements for that array. For example:
>>>
>>> struct foo {
>>>     int stuff;
>>>     struct boo entry[];
>>> };
>>>
>>> size = sizeof(struct foo) + count * sizeof(struct boo);
>>> instance = alloc(size, GFP_KERNEL)
>>>
>>> Instead of leaving these open-coded and prone to type mistakes, we can
>>> now use the new struct_size() helper:
>>>
>>> size = struct_size(instance, entry, count);
>>>
>>> This code was detected with the help of Coccinelle.
>>>
>>> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
>>
>> Acked-by: Simon Horman <horms+renesas@verge.net.au>
>>
>> Pablo, could you consider applying this?
> 
> Applied, thanks!
> 

Thank you both, Simon and Pablo.

--
Gustavo

^ permalink raw reply

* Re: [net] tipc: fix skb may be leaky in tipc_link_input
From: David Miller @ 2019-02-12  2:36 UTC (permalink / raw)
  To: hoang.h.le; +Cc: tipc-discussion, jon.maloy, maloy, ying.xue, netdev
In-Reply-To: <20190211021828.6145-1-hoang.h.le@dektech.com.au>

From: Hoang Le <hoang.h.le@dektech.com.au>
Date: Mon, 11 Feb 2019 09:18:28 +0700

> When we free skb at tipc_data_input, we return a 'false' boolean.
> Then, skb passed to subcalling tipc_link_input in tipc_link_rcv,
> 
> <snip>
> 1303 int tipc_link_rcv:
> ...
> 1354    if (!tipc_data_input(l, skb, l->inputq))
> 1355        rc |= tipc_link_input(l, skb, l->inputq);
> </snip>
> 
> Fix it by simple changing to a 'true' boolean when skb is being free-ed.
> Then, tipc_link_rcv will bypassed to subcalling tipc_link_input as above
> condition.
> 
> Acked-by: Ying Xue <ying.xue@windriver.com>
> Acked-by: Jon Maloy <maloy@donjonn.com>
> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>

Applied, thanks.

^ permalink raw reply

* [PATCH net-next v4] ipmr: ip6mr: Create new sockopt to clear mfc cache or vifs
From: Callum Sinclair @ 2019-02-12  3:12 UTC (permalink / raw)
  To: davem, kuznet, yoshfuji, nikolay, netdev, linux-kernel
  Cc: nicolas.dichtel, Callum Sinclair

Created a way to clear the multicast forwarding cache on a socket
without having to either remove the entries manually using the delete
entry socket option or destroy and recreate the multicast socket.

Calling the socket option MRT_FLUSH will allow any combination of the
four flag options to be cleared.

MRT_FLUSH_MFC will clear all non static mfc entries
MRT_FLUSH_MFC_STATIC will clear all static mfc entries
MRT_FLUSH_VIFS will clear all non static interfaces
MRT_FLUSH_VIFS_STATIC will clear all static interfaces.

Callum Sinclair (1):
  ipmr: ip6mr: Create new sockopt to clear mfc cache or vifs

 include/uapi/linux/mroute.h  |  9 ++++-
 include/uapi/linux/mroute6.h |  9 ++++-
 net/ipv4/ipmr.c              | 73 ++++++++++++++++++++-------------
 net/ipv6/ip6mr.c             | 78 +++++++++++++++++++++++-------------
 4 files changed, 112 insertions(+), 57 deletions(-)

-- 
2.20.1

^ permalink raw reply

* [PATCH net-next v4] ipmr: ip6mr: Create new sockopt to clear mfc cache or vifs
From: Callum Sinclair @ 2019-02-12  3:12 UTC (permalink / raw)
  To: davem, kuznet, yoshfuji, nikolay, netdev, linux-kernel
  Cc: nicolas.dichtel, Callum Sinclair
In-Reply-To: <20190212031255.16121-1-callum.sinclair@alliedtelesis.co.nz>

Currently the only way to clear the forwarding cache was to delete the
entries one by one using the MRT_DEL_MFC socket option or to destroy and
recreate the socket.

Create a new socket option which with the use of optional flags can
clear any combination of multicast entries (static or not static) and
multicast vifs (static or not static).

Calling the new socket option MRT_FLUSH with the flags MRT_FLUSH_MFC and
MRT_FLUSH_VIFS will clear all entries and vifs on the socket except for
static entries.

Signed-off-by: Callum Sinclair <callum.sinclair@alliedtelesis.co.nz>
---
v1 -> v2:
  Implemented additional flags for static entries
v2 -> v3:
  Cleaned up flag logic so any combination of routes can be cleared.
  Fixed style errors
  Fixed incorrect flag values
v3 -> v4:
  Fixed style errors
  Fixed incorrect flag (MRT_FLUSH was used instead of MRT_FLUSH_VIFS)

 include/uapi/linux/mroute.h  |  9 ++++-
 include/uapi/linux/mroute6.h |  9 ++++-
 net/ipv4/ipmr.c              | 73 ++++++++++++++++++++-------------
 net/ipv6/ip6mr.c             | 78 +++++++++++++++++++++++-------------
 4 files changed, 112 insertions(+), 57 deletions(-)

diff --git a/include/uapi/linux/mroute.h b/include/uapi/linux/mroute.h
index 5d37a9ccce63..11c8c1fc1124 100644
--- a/include/uapi/linux/mroute.h
+++ b/include/uapi/linux/mroute.h
@@ -28,12 +28,19 @@
 #define MRT_TABLE	(MRT_BASE+9)	/* Specify mroute table ID		*/
 #define MRT_ADD_MFC_PROXY	(MRT_BASE+10)	/* Add a (*,*|G) mfc entry	*/
 #define MRT_DEL_MFC_PROXY	(MRT_BASE+11)	/* Del a (*,*|G) mfc entry	*/
-#define MRT_MAX		(MRT_BASE+11)
+#define MRT_FLUSH	(MRT_BASE+12)	/* Flush all mfc entries and/or vifs	*/
+#define MRT_MAX		(MRT_BASE+12)
 
 #define SIOCGETVIFCNT	SIOCPROTOPRIVATE	/* IP protocol privates */
 #define SIOCGETSGCNT	(SIOCPROTOPRIVATE+1)
 #define SIOCGETRPF	(SIOCPROTOPRIVATE+2)
 
+/* MRT_FLUSH optional flags */
+#define MRT_FLUSH_MFC	1	/* Flush multicast entries */
+#define MRT_FLUSH_MFC_STATIC	2	/* Flush static multicast entries */
+#define MRT_FLUSH_VIFS	4	/* Flush multicast vifs */
+#define MRT_FLUSH_VIFS_STATIC	8	/* Flush static multicast vifs */
+
 #define MAXVIFS		32
 typedef unsigned long vifbitmap_t;	/* User mode code depends on this lot */
 typedef unsigned short vifi_t;
diff --git a/include/uapi/linux/mroute6.h b/include/uapi/linux/mroute6.h
index 9999cc006390..ac84ef11b29c 100644
--- a/include/uapi/linux/mroute6.h
+++ b/include/uapi/linux/mroute6.h
@@ -31,12 +31,19 @@
 #define MRT6_TABLE	(MRT6_BASE+9)	/* Specify mroute table ID		*/
 #define MRT6_ADD_MFC_PROXY	(MRT6_BASE+10)	/* Add a (*,*|G) mfc entry	*/
 #define MRT6_DEL_MFC_PROXY	(MRT6_BASE+11)	/* Del a (*,*|G) mfc entry	*/
-#define MRT6_MAX	(MRT6_BASE+11)
+#define MRT6_FLUSH	(MRT6_BASE+12)	/* Flush all mfc entries and/or vifs	*/
+#define MRT6_MAX	(MRT6_BASE+12)
 
 #define SIOCGETMIFCNT_IN6	SIOCPROTOPRIVATE	/* IP protocol privates */
 #define SIOCGETSGCNT_IN6	(SIOCPROTOPRIVATE+1)
 #define SIOCGETRPF	(SIOCPROTOPRIVATE+2)
 
+/* MRT6_FLUSH optional flags */
+#define MRT6_FLUSH_MFC	1	/* Flush multicast entries */
+#define MRT6_FLUSH_MFC_STATIC	2	/* Flush static multicast entries */
+#define MRT6_FLUSH_VIFS	4	/* Flushing multicast vifs */
+#define MRT6_FLUSH_VIFS_STATIC	8	/* Flush static multicast vifs */
+
 #define MAXMIFS		32
 typedef unsigned long mifbitmap_t;	/* User mode code depends on this lot */
 typedef unsigned short mifi_t;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index e536970557dd..a232645d3335 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -110,7 +110,7 @@ static int ipmr_cache_report(struct mr_table *mrt,
 static void mroute_netlink_event(struct mr_table *mrt, struct mfc_cache *mfc,
 				 int cmd);
 static void igmpmsg_netlink_event(struct mr_table *mrt, struct sk_buff *pkt);
-static void mroute_clean_tables(struct mr_table *mrt, bool all);
+static void mroute_clean_tables(struct mr_table *mrt, int flags);
 static void ipmr_expire_process(struct timer_list *t);
 
 #ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES
@@ -415,7 +415,8 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 id)
 static void ipmr_free_table(struct mr_table *mrt)
 {
 	del_timer_sync(&mrt->ipmr_expire_timer);
-	mroute_clean_tables(mrt, true);
+	mroute_clean_tables(mrt, MRT_FLUSH_VIFS | MRT_FLUSH_VIFS_STATIC |
+					  MRT_FLUSH_MFC | MRT_FLUSH_MFC_STATIC);
 	rhltable_destroy(&mrt->mfc_hash);
 	kfree(mrt);
 }
@@ -1296,7 +1297,7 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
 }
 
 /* Close the multicast socket, and clear the vif tables etc */
-static void mroute_clean_tables(struct mr_table *mrt, bool all)
+static void mroute_clean_tables(struct mr_table *mrt, int flags)
 {
 	struct net *net = read_pnet(&mrt->net);
 	struct mr_mfc *c, *tmp;
@@ -1305,35 +1306,42 @@ static void mroute_clean_tables(struct mr_table *mrt, bool all)
 	int i;
 
 	/* Shut down all active vif entries */
-	for (i = 0; i < mrt->maxvif; i++) {
-		if (!all && (mrt->vif_table[i].flags & VIFF_STATIC))
-			continue;
-		vif_delete(mrt, i, 0, &list);
+	if (flags & (MRT_FLUSH_VIFS | MRT_FLUSH_VIFS_STATIC)) {
+		for (i = 0; i < mrt->maxvif; i++) {
+			if (((mrt->vif_table[i].flags & VIFF_STATIC) &&
+			     !(flags & MRT_FLUSH_VIFS_STATIC)) ||
+			    (!(mrt->vif_table[i].flags & VIFF_STATIC) && !(flags & MRT_FLUSH_VIFS)))
+				continue;
+			vif_delete(mrt, i, 0, &list);
+		}
+		unregister_netdevice_many(&list);
 	}
-	unregister_netdevice_many(&list);
 
 	/* Wipe the cache */
-	list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
-		if (!all && (c->mfc_flags & MFC_STATIC))
-			continue;
-		rhltable_remove(&mrt->mfc_hash, &c->mnode, ipmr_rht_params);
-		list_del_rcu(&c->list);
-		cache = (struct mfc_cache *)c;
-		call_ipmr_mfc_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, cache,
-					      mrt->id);
-		mroute_netlink_event(mrt, cache, RTM_DELROUTE);
-		mr_cache_put(c);
-	}
-
-	if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
-		spin_lock_bh(&mfc_unres_lock);
-		list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
-			list_del(&c->list);
+	if (flags & (MRT_FLUSH_MFC | MRT_FLUSH_MFC_STATIC)) {
+		list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
+			if (((c->mfc_flags & MFC_STATIC) && !(flags & MRT_FLUSH_MFC_STATIC)) ||
+			    (!(c->mfc_flags & MFC_STATIC) && !(flags & MRT_FLUSH_MFC)))
+				continue;
+			rhltable_remove(&mrt->mfc_hash, &c->mnode, ipmr_rht_params);
+			list_del_rcu(&c->list);
 			cache = (struct mfc_cache *)c;
+			call_ipmr_mfc_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, cache,
+						      mrt->id);
 			mroute_netlink_event(mrt, cache, RTM_DELROUTE);
-			ipmr_destroy_unres(mrt, cache);
+			mr_cache_put(c);
+		}
+
+		if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
+			spin_lock_bh(&mfc_unres_lock);
+			list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
+				list_del(&c->list);
+				cache = (struct mfc_cache *)c;
+				mroute_netlink_event(mrt, cache, RTM_DELROUTE);
+				ipmr_destroy_unres(mrt, cache);
+			}
+			spin_unlock_bh(&mfc_unres_lock);
 		}
-		spin_unlock_bh(&mfc_unres_lock);
 	}
 }
 
@@ -1354,7 +1362,7 @@ static void mrtsock_destruct(struct sock *sk)
 						    NETCONFA_IFINDEX_ALL,
 						    net->ipv4.devconf_all);
 			RCU_INIT_POINTER(mrt->mroute_sk, NULL);
-			mroute_clean_tables(mrt, false);
+			mroute_clean_tables(mrt, MRT_FLUSH_VIFS | MRT_FLUSH_MFC);
 		}
 	}
 	rtnl_unlock();
@@ -1479,6 +1487,17 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval,
 					   sk == rtnl_dereference(mrt->mroute_sk),
 					   parent);
 		break;
+	case MRT_FLUSH:
+		if (optlen != sizeof(val)) {
+			ret = -EINVAL;
+			break;
+		}
+		if (get_user(val, (int __user *)optval)) {
+			ret = -EFAULT;
+			break;
+		}
+		mroute_clean_tables(mrt, val);
+		break;
 	/* Control PIM assert. */
 	case MRT_ASSERT:
 		if (optlen != sizeof(val)) {
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index cc01aa3f2b5e..b0d8989540a3 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -97,7 +97,7 @@ static void mr6_netlink_event(struct mr_table *mrt, struct mfc6_cache *mfc,
 static void mrt6msg_netlink_event(struct mr_table *mrt, struct sk_buff *pkt);
 static int ip6mr_rtm_dumproute(struct sk_buff *skb,
 			       struct netlink_callback *cb);
-static void mroute_clean_tables(struct mr_table *mrt, bool all);
+static void mroute_clean_tables(struct mr_table *mrt, int flags);
 static void ipmr_expire_process(struct timer_list *t);
 
 #ifdef CONFIG_IPV6_MROUTE_MULTIPLE_TABLES
@@ -393,7 +393,8 @@ static struct mr_table *ip6mr_new_table(struct net *net, u32 id)
 static void ip6mr_free_table(struct mr_table *mrt)
 {
 	del_timer_sync(&mrt->ipmr_expire_timer);
-	mroute_clean_tables(mrt, true);
+	mroute_clean_tables(mrt, MRT6_FLUSH_VIFS | MRT6_FLUSH_VIFS_STATIC |
+					  MRT6_FLUSH_MFC | MRT6_FLUSH_MFC_STATIC);
 	rhltable_destroy(&mrt->mfc_hash);
 	kfree(mrt);
 }
@@ -1496,42 +1497,49 @@ static int ip6mr_mfc_add(struct net *net, struct mr_table *mrt,
  *	Close the multicast socket, and clear the vif tables etc
  */
 
-static void mroute_clean_tables(struct mr_table *mrt, bool all)
+static void mroute_clean_tables(struct mr_table *mrt, int flags)
 {
 	struct mr_mfc *c, *tmp;
 	LIST_HEAD(list);
 	int i;
 
 	/* Shut down all active vif entries */
-	for (i = 0; i < mrt->maxvif; i++) {
-		if (!all && (mrt->vif_table[i].flags & VIFF_STATIC))
-			continue;
-		mif6_delete(mrt, i, 0, &list);
+	if (flags & (MRT6_FLUSH_VIFS | MRT6_FLUSH_VIFS_STATIC)) {
+		for (i = 0; i < mrt->maxvif; i++) {
+			if (((mrt->vif_table[i].flags & VIFF_STATIC) &&
+			     !(flags & MRT6_FLUSH_VIFS_STATIC)) ||
+			    (!(mrt->vif_table[i].flags & VIFF_STATIC) && !(flags & MRT6_FLUSH_VIFS)))
+				continue;
+			mif6_delete(mrt, i, 0, &list);
+		}
+		unregister_netdevice_many(&list);
 	}
-	unregister_netdevice_many(&list);
 
 	/* Wipe the cache */
-	list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
-		if (!all && (c->mfc_flags & MFC_STATIC))
-			continue;
-		rhltable_remove(&mrt->mfc_hash, &c->mnode, ip6mr_rht_params);
-		list_del_rcu(&c->list);
-		call_ip6mr_mfc_entry_notifiers(read_pnet(&mrt->net),
-					       FIB_EVENT_ENTRY_DEL,
-					       (struct mfc6_cache *)c, mrt->id);
-		mr6_netlink_event(mrt, (struct mfc6_cache *)c, RTM_DELROUTE);
-		mr_cache_put(c);
-	}
+	if (flags & (MRT6_FLUSH_MFC | MRT6_FLUSH_MFC_STATIC)) {
+		list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
+			if (((c->mfc_flags & MFC_STATIC) && !(flags & MRT6_FLUSH_MFC_STATIC)) ||
+			    (!(c->mfc_flags & MFC_STATIC) && !(flags & MRT6_FLUSH_MFC)))
+				continue;
+			rhltable_remove(&mrt->mfc_hash, &c->mnode, ip6mr_rht_params);
+			list_del_rcu(&c->list);
+			call_ip6mr_mfc_entry_notifiers(read_pnet(&mrt->net),
+						       FIB_EVENT_ENTRY_DEL,
+										   (struct mfc6_cache *)c, mrt->id);
+			mr6_netlink_event(mrt, (struct mfc6_cache *)c, RTM_DELROUTE);
+			mr_cache_put(c);
+		}
 
-	if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
-		spin_lock_bh(&mfc_unres_lock);
-		list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
-			list_del(&c->list);
-			mr6_netlink_event(mrt, (struct mfc6_cache *)c,
-					  RTM_DELROUTE);
-			ip6mr_destroy_unres(mrt, (struct mfc6_cache *)c);
+		if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
+			spin_lock_bh(&mfc_unres_lock);
+			list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
+				list_del(&c->list);
+				mr6_netlink_event(mrt, (struct mfc6_cache *)c,
+						  RTM_DELROUTE);
+				ip6mr_destroy_unres(mrt, (struct mfc6_cache *)c);
+			}
+			spin_unlock_bh(&mfc_unres_lock);
 		}
-		spin_unlock_bh(&mfc_unres_lock);
 	}
 }
 
@@ -1587,7 +1595,7 @@ int ip6mr_sk_done(struct sock *sk)
 						     NETCONFA_IFINDEX_ALL,
 						     net->ipv6.devconf_all);
 
-			mroute_clean_tables(mrt, false);
+			mroute_clean_tables(mrt, MRT6_FLUSH_VIFS | MRT6_FLUSH_MFC);
 			err = 0;
 			break;
 		}
@@ -1703,6 +1711,20 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, uns
 		rtnl_unlock();
 		return ret;
 
+	case MRT6_FLUSH:
+	{
+		int flags;
+
+		if (optlen != sizeof(flags))
+			return -EINVAL;
+		if (get_user(flags, (int __user *)optval))
+			return -EFAULT;
+		rtnl_lock();
+		mroute_clean_tables(mrt, flags);
+		rtnl_unlock();
+		return 0;
+	}
+
 	/*
 	 *	Control PIM assert (to activate pim will activate assert)
 	 */
-- 
2.20.1


^ permalink raw reply related

* RE: [PATCH v3] arm64: dts: lx2160aqds: Add mdio mux nodes
From: Pankaj Bansal @ 2019-02-12  3:26 UTC (permalink / raw)
  To: Leo Li, Shawn Guo
  Cc: Andrew Lunn, Florian Fainelli, netdev@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <AM6PR04MB5863A68D22DD06F81743AC8B8F640@AM6PR04MB5863.eurprd04.prod.outlook.com>



> -----Original Message-----
> From: Leo Li
> Sent: Tuesday, 12 February, 2019 02:14 AM
> To: Shawn Guo <shawnguo@kernel.org>; Pankaj Bansal
> <pankaj.bansal@nxp.com>
> Cc: Andrew Lunn <andrew@lunn.ch>; Florian Fainelli <f.fainelli@gmail.com>;
> netdev@vger.kernel.org; linux-arm-kernel@lists.infradead.org
> Subject: RE: [PATCH v3] arm64: dts: lx2160aqds: Add mdio mux nodes
> 
> 
> 
> > -----Original Message-----
> > From: Shawn Guo <shawnguo@kernel.org>
> > Sent: Sunday, February 10, 2019 9:00 PM
> > To: Pankaj Bansal <pankaj.bansal@nxp.com>
> > Cc: Leo Li <leoyang.li@nxp.com>; Andrew Lunn <andrew@lunn.ch>; Florian
> > Fainelli <f.fainelli@gmail.com>; netdev@vger.kernel.org; linux-arm-
> > kernel@lists.infradead.org
> > Subject: Re: [PATCH v3] arm64: dts: lx2160aqds: Add mdio mux nodes
> >
> > On Wed, Feb 06, 2019 at 09:40:33AM +0000, Pankaj Bansal wrote:
> > > The two external MDIO buses used to communicate with phy devices
> > > that are external to SOC are muxed in LX2160AQDS board.
> > >
> > > These buses can be routed to any one of the eight IO slots on
> > > LX2160AQDS board depending on value in fpga register 0x54.
> > >
> > > Additionally the external MDIO1 is used to communicate to the
> > > onboard RGMII phy devices.
> > >
> > > The mdio1 is controlled by bits 4-7 of fpga register and mdio2 is
> > > controlled by bits 0-3 of fpga register.
> > >
> > > Signed-off-by: Pankaj Bansal <pankaj.bansal@nxp.com>
> > > ---
> > >
> > > Notes:
> > >     V3:
> > >     - Add status = disabled in soc file and status = okay in board file
> > >       for external MDIO nodes
> > >     - Add interrupts property in external mdio nodes in soc file
> > >     V2:
> > >     - removed unnecassary TODO statements
> > >     - removed device_type from mdio nodes
> > >     - change the case of hex number to lowercase
> > >     - removed board specific comments from soc file
> > >
> > >  .../boot/dts/freescale/fsl-lx2160a-qds.dts   | 123 +++++++++++++++++
> > >  .../boot/dts/freescale/fsl-lx2160a.dtsi      |  22 +++
> > >  2 files changed, 145 insertions(+)
> > >
> > > diff --git a/arch/arm64/boot/dts/freescale/fsl-lx2160a-qds.dts
> > > b/arch/arm64/boot/dts/freescale/fsl-lx2160a-qds.dts
> > > index 99a22abbe725..079264b391a2 100644
> > > --- a/arch/arm64/boot/dts/freescale/fsl-lx2160a-qds.dts
> > > +++ b/arch/arm64/boot/dts/freescale/fsl-lx2160a-qds.dts
> > > @@ -35,6 +35,14 @@
> > >  	status = "okay";
> > >  };
> > >
> > > +&emdio1 {
> > > +	status = "okay";
> > > +};
> > > +
> > > +&emdio2 {
> > > +	status = "okay";
> > > +};
> > > +
> > >  &esdhc0 {
> > >  	status = "okay";
> > >  };
> > > @@ -46,6 +54,121 @@
> > >  &i2c0 {
> > >  	status = "okay";
> > >
> > > +	fpga@66 {
> > > +		compatible = "fsl,lx2160aqds-fpga", "fsl,fpga-qixis-i2c";
> > > +		reg = <0x66>;
> > > +		#address-cells = <1>;
> > > +		#size-cells = <0>;
> > > +
> > > +		mdio-mux-1@54 {
> > > +			mdio-parent-bus = <&emdio1>;
> > > +			reg = <0x54>;		 /* BRDCFG4 */
> > > +			mux-mask = <0xf8>;      /* EMI1_MDIO */
> > > +			#address-cells=<1>;
> > > +			#size-cells = <0>;
> > > +
> > > +			mdio@0 {
> > > +				reg = <0x00>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> >
> > Please have a newline between nodes.  It doesn't deserve a respin
> > though.  I can fix them up when applying if Leo is fine with this version.
> 
> I think there should be a compatible string defined for the binding of parent
> node mdio-mux, probably "mdio-mux-regmap", and be used here in the device
> tree.

I have two concerns :
1. The regmap is linux s/w construct, while device tree is h/w representation and is s/w agnostic. can we use regmap in device tree?
2. By convention the device tree compatible binding is defined as "<manufacturer>,<model>" e.g. "fsl,mpc8349-uart". The mdio-mux node and it's sub nodes are a generic representation of mdio mux and it is not dependent on a particular manufacturer device. How to define the compatible in this case?

> 
> >
> > Shawn
> >
> > > +			mdio@40 {
> > > +				reg = <0x40>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@c0 {
> > > +				reg = <0xc0>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@c8 {
> > > +				reg = <0xc8>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@d0 {
> > > +				reg = <0xd0>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@d8 {
> > > +				reg = <0xd8>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@e0 {
> > > +				reg = <0xe0>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@e8 {
> > > +				reg = <0xe8>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@f0 {
> > > +				reg = <0xf0>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@f8 {
> > > +				reg = <0xf8>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +		};
> > > +
> > > +		mdio-mux-2@54 {
> > > +			mdio-parent-bus = <&emdio2>;
> > > +			reg = <0x54>;		 /* BRDCFG4 */
> > > +			mux-mask = <0x07>;      /* EMI2_MDIO */
> > > +			#address-cells=<1>;
> > > +			#size-cells = <0>;
> > > +
> > > +			mdio@0 {
> > > +				reg = <0x00>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@1 {
> > > +				reg = <0x01>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@2 {
> > > +				reg = <0x02>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@3 {
> > > +				reg = <0x03>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@4 {
> > > +				reg = <0x04>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@5 {
> > > +				reg = <0x05>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@6 {
> > > +				reg = <0x06>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +			mdio@7 {
> > > +				reg = <0x07>;
> > > +				#address-cells = <1>;
> > > +				#size-cells = <0>;
> > > +			};
> > > +		};
> > > +	};
> > > +
> > >  	i2c-mux@77 {
> > >  		compatible = "nxp,pca9547";
> > >  		reg = <0x77>;
> > > diff --git a/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
> > > b/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
> > > index a79f5c1ea56d..7def5252ac1a 100644
> > > --- a/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
> > > +++ b/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
> > > @@ -762,5 +762,27 @@
> > >  				     <GIC_SPI 209 IRQ_TYPE_LEVEL_HIGH>;
> > >  			dma-coherent;
> > >  		};
> > > +
> > > +		/* WRIOP0: 0x8b8_0000, E-MDIO1: 0x1_6000 */
> > > +		emdio1: mdio@8b96000 {
> > > +			compatible = "fsl,fman-memac-mdio";
> > > +			reg = <0x0 0x8b96000 0x0 0x1000>;
> > > +			interrupts = <GIC_SPI 90 IRQ_TYPE_LEVEL_HIGH>;
> > > +			#address-cells = <1>;
> > > +			#size-cells = <0>;
> > > +			little-endian;	/* force the driver in LE mode */
> > > +			status = "disabled";
> > > +		};
> > > +
> > > +		/* WRIOP0: 0x8b8_0000, E-MDIO2: 0x1_7000 */
> > > +		emdio2: mdio@8b97000 {
> > > +			compatible = "fsl,fman-memac-mdio";
> > > +			reg = <0x0 0x8b97000 0x0 0x1000>;
> > > +			interrupts = <GIC_SPI 91 IRQ_TYPE_LEVEL_HIGH>;
> > > +			#address-cells = <1>;
> > > +			#size-cells = <0>;
> > > +			little-endian;	/* force the driver in LE mode */
> > > +			status = "disabled";
> > > +		};
> > >  	};
> > >  };
> > > --
> > > 2.17.1
> > >

^ permalink raw reply

* Re: [PATCH] ipv6: fix icmp6_send() route lookup
From: Ivan Delalande @ 2019-02-12  3:31 UTC (permalink / raw)
  To: David Miller; +Cc: alin.nastac, netdev
In-Reply-To: <20190211.123818.1763509059512954986.davem@davemloft.net>

Hi David,

On Mon, Feb 11, 2019 at 12:38:18PM -0800, David Miller wrote:
> From: Alin Nastac <alin.nastac@gmail.com>
> Date: Thu,  7 Feb 2019 16:05:31 +0100
> 
> > Original packet destination address must be used as saddr for the
> > route lookup performed by icmp6_send() even when this address is
> > not local. This fixes the IPv6 router ability to send back
> > destination unreachable ICMPv6 errors for forwarded packets when
> > the route toward the saddr of the original packet is source
> > filtered (e.g. a default route with a "from PD" attribute, where
> > PD is the delegated prefix).
> > 
> > Signed-off-by: Alin Nastac <alin.nastac@gmail.com>
> 
> Yes, but however this will change behavior for a lot of situations
> not just the one you are interested in.
> 
> The base ipv6_chk_addr() test has been there for more than a decade
> and I'm not comfortable with changing this logic until I see you
> write up a full audit of all of the use cases of icmp6_send() and
> how they are impacted by your changes.

For what it's worth, we also have 3 internal patches changing the
selection of saddr in icmp6_send (to pick an address from the receiving
interface in priority, or the most specific to the source address of the
original packet, etc.) that we would like to submit in some form, but
that would most likely break existing setups if enabled by default.

Could we introduce a sysctl with a set of flags to enable the different
behaviors from our patches and Alin's? Or any other configuration
interface than sysctls if more appropriate.

Thank you,

-- 
Ivan Delalande
Arista Networks

^ permalink raw reply

* Re: [PATCH net] dsa: mv88e6xxx: Ensure all pending interrupts are handled prior to exit
From: Andrew Lunn @ 2019-02-12  3:58 UTC (permalink / raw)
  To: John David Anglin, Heiner Kallweit
  Cc: Russell King, Vivien Didelot, Florian Fainelli, netdev
In-Reply-To: <2b6bbb4c-1346-461b-ff7a-cb96b4142f7a@bell.net>

> > Hi David
> >
> > I just tested this on one of my boards. It loops endlessly:
> >
> > [   47.173396] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> > [   47.182108] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> > [   47.190820] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> > [   47.199535] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> > [   47.208254] mv88e6xxx_g1_irq_thread_work: c881 a8 80   
> >
> > These are reg, ctl1, reg & ctl1.
> >
> > So there is an unhandled device interrupt.

Hi Heiner

Your patch Fixes: 2b3e88ea6528 ("net: phy: improve phy state
checking") is causing me problems with interrupts for the Marvell
switches.

That change means we don't check the PHY device if it caused an
interrupt when its state is less than UP.

What i'm seeing is that the PHY is interrupting pretty early on after
a reboot when the previous boot had the interface up.

[   10.125702] Marvell 88E6390 mv88e6xxx-0:02: phy_start_interrupts
[   10.162798] Marvell 88E6390 mv88e6xxx-0:02: phy_enable_interrupts
[   10.168931] Marvell 88E6390 mv88e6xxx-0:02: marvell_ack_interrupt
[   10.180164] Marvell 88E6390 mv88e6xxx-0:02: marvell_config_intr 1

a little later it interrupts:

[   12.999717] mv88e6xxx_g1_irq_thread_fn
[   13.007253] mv88e6xxx_g2_irq_thread_fn: 4 811c 4
[   13.012015] libphy: __phy_is_started: phydev->state 1 PHY_UP 3
[   13.017941] Marvell 88E6390 mv88e6xxx-0:02: phy_interrupt: phy_is_started(phydev) 0

The current code just causes it to be ignored. So the interrupts fires
again, and again...

If i change to code to call into the PHY driver and let it handle the
interrupts, things keep running. A little bit later the interface is
configured up:

[   15.921326] mv88e6085 gpio-0:00 red: configuring for phy/gmii link mode
[   15.928693] libphy: __phy_is_started: phydev->state 3 PHY_UP 3
[   15.929442] IPv6: ADDRCONF(NETDEV_UP): red: link is not ready
[   15.935596] Marvell 88E6390 mv88e6xxx-0:02: m88e6390_config_aneg
[   15.935608] Marvell 88E6390 mv88e6xxx-0:02: m88e6390_errata

[   16.071364] Marvell 88E6390 mv88e6xxx-0:02: m88e1510_config_aneg
[   16.112362] Marvell 88E6390 mv88e6xxx-0:02: m88e1318_config_aneg
[   16.151245] Marvell 88E6390 mv88e6xxx-0:02: m88e1121_config_aneg
[   16.368206] Marvell 88E6390 mv88e6xxx-0:02: PHY state change UP -> NOLINK

and after another interrupt the link goes up.

[   19.519840] mv88e6xxx_g1_irq_thread_fn
[   19.528546] mv88e6xxx_g2_irq_thread_fn: 4 811c 4
[   19.534152] libphy: __phy_is_started: phydev->state 5 PHY_UP 3
[   19.540030] Marvell 88E6390 mv88e6xxx-0:02: phy_interrupt: phy_is_started(phydev) 1
[   19.547721] Marvell 88E6390 mv88e6xxx-0:02: m88e1121_did_interrupt
[   19.559829] Marvell 88E6390 mv88e6xxx-0:02: marvell_ack_interrupt
[   19.590753] Marvell 88E6390 mv88e6xxx-0:02: marvell_read_status
[   19.596712] Marvell 88E6390 mv88e6xxx-0:02: marvell_update_link
[   19.628387] Marvell 88E6390 mv88e6xxx-0:02: PHY state change NOLINK -> RUNNING
[   19.628453] mv88e6085 gpio-0:00 red: Link is Up - 1Gbps/Full - flow control off
[   19.635920] IPv6: ADDRCONF(NETDEV_CHANGE): red: link becomes ready

I don't yet know why the first interrupt happens, before we configure
auto-neg, etc. But it is not too unreasonable. We have configured
interrupts, so it could be reporting link down etc.

So i think we might need to revert part of this change, call into the
driver so long as the PHY is not in state PHY_HALTED.

What do you think?

     Andrew

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox