Netdev List
 help / color / mirror / Atom feed
* Re: iproute2 ss: Some thoughts about additional info output layout
From: Stephen Hemminger @ 2015-01-14 22:55 UTC (permalink / raw)
  To: Vadim Kochan; +Cc: netdev
In-Reply-To: <20150108221240.GA23636@angus-think.lan>

On Fri, 9 Jan 2015 00:12:40 +0200
Vadim Kochan <vadim4j@gmail.com> wrote:

> Hi,
> 
> I think that current output of ss utility visually looks little weird
> when additional info options were specified, so I feel that a lot of
> yours will say that it will break existing scripts of ss output parsing
> but I will try, so thats how I think ss output would looks better in
> case if additionally info options were specified (I am not sure how it
> would looks in the email) :

The current output is pretty much unparseable for options with ss.
For a parseable output generating JSON would be much better, I did that for
a couple of utilities.

I see no problem with having a better human format.

^ permalink raw reply

* [PATCH 4/5] openvswitch: Allow for any level of nesting in flow attributes
From: Thomas Graf @ 2015-01-15  2:53 UTC (permalink / raw)
  To: davem-fT/PcQaiUtIeIZ0/mPfg9Q, jesse-l0M0P4e3n4LQT0dZR+AlfA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	pshelar-l0M0P4e3n4LQT0dZR+AlfA, therbert-hpIqsD4AKlfQT0dZR+AlfA,
	alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <cover.1421290198.git.tgraf-G/eBtMaohhA@public.gmane.org>

nlattr_set() is currently hardcoded to two levels of nesting. This change
introduces struct ovs_len_tbl to define minimal length requirements plus
next level nesting tables to traverse the key attributes to arbitrary depth.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v5->v6:
 - No change
v4->v5:
 - No change
v3->v4:
 - No change. The spotted bug is unrelatd to this series and will be fixed
   in a separate patch
v2->v3:
 - No change
v1->v2:
 - New patch to allow nested Netlink attributes inside
   OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS

 net/openvswitch/flow_netlink.c | 106 ++++++++++++++++++++++-------------------
 1 file changed, 56 insertions(+), 50 deletions(-)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 2e8a9cd..518941c 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -50,6 +50,13 @@
 
 #include "flow_netlink.h"
 
+struct ovs_len_tbl {
+	int len;
+	const struct ovs_len_tbl *next;
+};
+
+#define OVS_ATTR_NESTED -1
+
 static void update_range(struct sw_flow_match *match,
 			 size_t offset, size_t size, bool is_mask)
 {
@@ -289,29 +296,44 @@ size_t ovs_key_attr_size(void)
 		+ nla_total_size(28); /* OVS_KEY_ATTR_ND */
 }
 
+static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1] = {
+	[OVS_TUNNEL_KEY_ATTR_ID]	    = { .len = sizeof(u64) },
+	[OVS_TUNNEL_KEY_ATTR_IPV4_SRC]	    = { .len = sizeof(u32) },
+	[OVS_TUNNEL_KEY_ATTR_IPV4_DST]	    = { .len = sizeof(u32) },
+	[OVS_TUNNEL_KEY_ATTR_TOS]	    = { .len = 1 },
+	[OVS_TUNNEL_KEY_ATTR_TTL]	    = { .len = 1 },
+	[OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT] = { .len = 0 },
+	[OVS_TUNNEL_KEY_ATTR_CSUM]	    = { .len = 0 },
+	[OVS_TUNNEL_KEY_ATTR_TP_SRC]	    = { .len = sizeof(u16) },
+	[OVS_TUNNEL_KEY_ATTR_TP_DST]	    = { .len = sizeof(u16) },
+	[OVS_TUNNEL_KEY_ATTR_OAM]	    = { .len = 0 },
+	[OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS]   = { .len = OVS_ATTR_NESTED },
+};
+
 /* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute.  */
-static const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
-	[OVS_KEY_ATTR_ENCAP] = -1,
-	[OVS_KEY_ATTR_PRIORITY] = sizeof(u32),
-	[OVS_KEY_ATTR_IN_PORT] = sizeof(u32),
-	[OVS_KEY_ATTR_SKB_MARK] = sizeof(u32),
-	[OVS_KEY_ATTR_ETHERNET] = sizeof(struct ovs_key_ethernet),
-	[OVS_KEY_ATTR_VLAN] = sizeof(__be16),
-	[OVS_KEY_ATTR_ETHERTYPE] = sizeof(__be16),
-	[OVS_KEY_ATTR_IPV4] = sizeof(struct ovs_key_ipv4),
-	[OVS_KEY_ATTR_IPV6] = sizeof(struct ovs_key_ipv6),
-	[OVS_KEY_ATTR_TCP] = sizeof(struct ovs_key_tcp),
-	[OVS_KEY_ATTR_TCP_FLAGS] = sizeof(__be16),
-	[OVS_KEY_ATTR_UDP] = sizeof(struct ovs_key_udp),
-	[OVS_KEY_ATTR_SCTP] = sizeof(struct ovs_key_sctp),
-	[OVS_KEY_ATTR_ICMP] = sizeof(struct ovs_key_icmp),
-	[OVS_KEY_ATTR_ICMPV6] = sizeof(struct ovs_key_icmpv6),
-	[OVS_KEY_ATTR_ARP] = sizeof(struct ovs_key_arp),
-	[OVS_KEY_ATTR_ND] = sizeof(struct ovs_key_nd),
-	[OVS_KEY_ATTR_RECIRC_ID] = sizeof(u32),
-	[OVS_KEY_ATTR_DP_HASH] = sizeof(u32),
-	[OVS_KEY_ATTR_TUNNEL] = -1,
-	[OVS_KEY_ATTR_MPLS] = sizeof(struct ovs_key_mpls),
+static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
+	[OVS_KEY_ATTR_ENCAP]	 = { .len = OVS_ATTR_NESTED },
+	[OVS_KEY_ATTR_PRIORITY]	 = { .len = sizeof(u32) },
+	[OVS_KEY_ATTR_IN_PORT]	 = { .len = sizeof(u32) },
+	[OVS_KEY_ATTR_SKB_MARK]	 = { .len = sizeof(u32) },
+	[OVS_KEY_ATTR_ETHERNET]	 = { .len = sizeof(struct ovs_key_ethernet) },
+	[OVS_KEY_ATTR_VLAN]	 = { .len = sizeof(__be16) },
+	[OVS_KEY_ATTR_ETHERTYPE] = { .len = sizeof(__be16) },
+	[OVS_KEY_ATTR_IPV4]	 = { .len = sizeof(struct ovs_key_ipv4) },
+	[OVS_KEY_ATTR_IPV6]	 = { .len = sizeof(struct ovs_key_ipv6) },
+	[OVS_KEY_ATTR_TCP]	 = { .len = sizeof(struct ovs_key_tcp) },
+	[OVS_KEY_ATTR_TCP_FLAGS] = { .len = sizeof(__be16) },
+	[OVS_KEY_ATTR_UDP]	 = { .len = sizeof(struct ovs_key_udp) },
+	[OVS_KEY_ATTR_SCTP]	 = { .len = sizeof(struct ovs_key_sctp) },
+	[OVS_KEY_ATTR_ICMP]	 = { .len = sizeof(struct ovs_key_icmp) },
+	[OVS_KEY_ATTR_ICMPV6]	 = { .len = sizeof(struct ovs_key_icmpv6) },
+	[OVS_KEY_ATTR_ARP]	 = { .len = sizeof(struct ovs_key_arp) },
+	[OVS_KEY_ATTR_ND]	 = { .len = sizeof(struct ovs_key_nd) },
+	[OVS_KEY_ATTR_RECIRC_ID] = { .len = sizeof(u32) },
+	[OVS_KEY_ATTR_DP_HASH]	 = { .len = sizeof(u32) },
+	[OVS_KEY_ATTR_TUNNEL]	 = { .len = OVS_ATTR_NESTED,
+				     .next = ovs_tunnel_key_lens, },
+	[OVS_KEY_ATTR_MPLS]	 = { .len = sizeof(struct ovs_key_mpls) },
 };
 
 static bool is_all_zero(const u8 *fp, size_t size)
@@ -352,8 +374,8 @@ static int __parse_flow_nlattrs(const struct nlattr *attr,
 			return -EINVAL;
 		}
 
-		expected_len = ovs_key_lens[type];
-		if (nla_len(nla) != expected_len && expected_len != -1) {
+		expected_len = ovs_key_lens[type].len;
+		if (nla_len(nla) != expected_len && expected_len != OVS_ATTR_NESTED) {
 			OVS_NLERR(log, "Key %d has unexpected len %d expected %d",
 				  type, nla_len(nla), expected_len);
 			return -EINVAL;
@@ -451,30 +473,16 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 		int type = nla_type(a);
 		int err;
 
-		static const u32 ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1] = {
-			[OVS_TUNNEL_KEY_ATTR_ID] = sizeof(u64),
-			[OVS_TUNNEL_KEY_ATTR_IPV4_SRC] = sizeof(u32),
-			[OVS_TUNNEL_KEY_ATTR_IPV4_DST] = sizeof(u32),
-			[OVS_TUNNEL_KEY_ATTR_TOS] = 1,
-			[OVS_TUNNEL_KEY_ATTR_TTL] = 1,
-			[OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT] = 0,
-			[OVS_TUNNEL_KEY_ATTR_CSUM] = 0,
-			[OVS_TUNNEL_KEY_ATTR_TP_SRC] = sizeof(u16),
-			[OVS_TUNNEL_KEY_ATTR_TP_DST] = sizeof(u16),
-			[OVS_TUNNEL_KEY_ATTR_OAM] = 0,
-			[OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS] = -1,
-		};
-
 		if (type > OVS_TUNNEL_KEY_ATTR_MAX) {
 			OVS_NLERR(log, "Tunnel attr %d out of range max %d",
 				  type, OVS_TUNNEL_KEY_ATTR_MAX);
 			return -EINVAL;
 		}
 
-		if (ovs_tunnel_key_lens[type] != nla_len(a) &&
-		    ovs_tunnel_key_lens[type] != -1) {
+		if (ovs_tunnel_key_lens[type].len != nla_len(a) &&
+		    ovs_tunnel_key_lens[type].len != OVS_ATTR_NESTED) {
 			OVS_NLERR(log, "Tunnel attr %d has unexpected len %d expected %d",
-				  type, nla_len(a), ovs_tunnel_key_lens[type]);
+				  type, nla_len(a), ovs_tunnel_key_lens[type].len);
 			return -EINVAL;
 		}
 
@@ -912,18 +920,16 @@ static int ovs_key_from_nlattrs(struct sw_flow_match *match, u64 attrs,
 	return 0;
 }
 
-static void nlattr_set(struct nlattr *attr, u8 val, bool is_attr_mask_key)
+static void nlattr_set(struct nlattr *attr, u8 val,
+		       const struct ovs_len_tbl *tbl)
 {
 	struct nlattr *nla;
 	int rem;
 
 	/* The nlattr stream should already have been validated */
 	nla_for_each_nested(nla, attr, rem) {
-		/* We assume that ovs_key_lens[type] == -1 means that type is a
-		 * nested attribute
-		 */
-		if (is_attr_mask_key && ovs_key_lens[nla_type(nla)] == -1)
-			nlattr_set(nla, val, false);
+		if (tbl && tbl[nla_type(nla)].len == OVS_ATTR_NESTED)
+			nlattr_set(nla, val, tbl[nla_type(nla)].next);
 		else
 			memset(nla_data(nla), val, nla_len(nla));
 	}
@@ -931,7 +937,7 @@ static void nlattr_set(struct nlattr *attr, u8 val, bool is_attr_mask_key)
 
 static void mask_set_nlattr(struct nlattr *attr, u8 val)
 {
-	nlattr_set(attr, val, true);
+	nlattr_set(attr, val, ovs_key_lens);
 }
 
 /**
@@ -1628,8 +1634,8 @@ static int validate_set(const struct nlattr *a,
 		return -EINVAL;
 
 	if (key_type > OVS_KEY_ATTR_MAX ||
-	    (ovs_key_lens[key_type] != nla_len(ovs_key) &&
-	     ovs_key_lens[key_type] != -1))
+	    (ovs_key_lens[key_type].len != nla_len(ovs_key) &&
+	     ovs_key_lens[key_type].len != OVS_ATTR_NESTED))
 		return -EINVAL;
 
 	switch (key_type) {
-- 
1.9.3

_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply related

* [PATCH 1/5] vxlan: Group Policy extension
From: Thomas Graf @ 2015-01-15  2:53 UTC (permalink / raw)
  To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov,
	nicolas.dichtel
  Cc: netdev, dev
In-Reply-To: <cover.1421290198.git.tgraf@suug.ch>

Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.

The group membership is defined by the lower 16 bits of skb->mark, the
upper 16 bits are used for flags.

SELinux allows to manage label to secure local resources. However,
distributed applications require ACLs to implemented across hosts. This
is typically achieved by matching on L2-L4 fields to identify the
original sending host and process on the receiver. On top of that,
netlabel and specifically CIPSO [1] allow to map security contexts to
universal labels.  However, netlabel and CIPSO are relatively complex.
This patch provides a lightweight alternative for overlay network
environments with a trusted underlay. No additional control protocol
is required.

           Host 1:                       Host 2:

      Group A        Group B        Group B     Group A
      +-----+   +-------------+    +-------+   +-----+
      | lxc |   | SELinux CTX |    | httpd |   | VM  |
      +--+--+   +--+----------+    +---+---+   +--+--+
	  \---+---/                     \----+---/
	      |                              |
	  +---+---+                      +---+---+
	  | vxlan |                      | vxlan |
	  +---+---+                      +---+---+
	      +------------------------------+

Backwards compatibility:
A VXLAN-GBP socket can receive standard VXLAN frames and will assign
the default group 0x0000 to such frames. A Linux VXLAN socket will
drop VXLAN-GBP  frames. The extension is therefore disabled by default
and needs to be specifically enabled:

   ip link add [...] type vxlan [...] gbp

In a mixed environment with VXLAN and VXLAN-GBP sockets, the GBP socket
must run on a separate port number.

Examples:
 iptables:
  host1# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
  host2# iptables -I INPUT -m mark --mark 0x200 -j DROP

 OVS:
  # ovs-ofctl add-flow br0 'in_port=1,actions=load:0x200->NXM_NX_TUN_GBP_ID[],NORMAL'
  # ovs-ofctl add-flow br0 'in_port=2,tun_gbp_id=0x200,actions=drop'

[0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
[1] http://lwn.net/Articles/204905/

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v5->v6:
 - Use flags instead of exts member to store enablement of GBP as suggested
   by Tom
v4->v5:
 - Rebased on top of Tom's RCO work
 - Dropped IFLA_VXLAN_EXTENSION container attribute and embedded IFLA_VXLAN_GBP
   as top level VXLAN attribute like RCO for consistency. 
v3->v4:
 - Patch 1 was no longer needed due to Tom Herbert's 3bf394 ("vxlan: Improve
   support for header flags"). Moved remaining header description to this patch.
 - Zero out vxlan_metadata in vxlan_tnl_send() as suggested by Jesse.
 - Reported enabled extensions to user space as requested by Nicolas.
 - Use VXLAN_HF_GBP instead of bitfield to be in line with Tom's work.
v2->v3:
 - Removed empty struct vxlan_gbp as spotted by Alexei
v1->v2:
 - split GBP header definition into separate struct vxlanhdr_gbp as requested
   by Alexei

 drivers/net/vxlan.c           | 84 ++++++++++++++++++++++++++++++++++++-------
 include/net/vxlan.h           | 79 +++++++++++++++++++++++++++++++++++++---
 include/uapi/linux/if_link.h  |  1 +
 net/openvswitch/vport-vxlan.c |  9 +++--
 4 files changed, 152 insertions(+), 21 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 99df0d7..6dbf8e0 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -620,7 +620,8 @@ static struct sk_buff **vxlan_gro_receive(struct sk_buff **head,
 			continue;
 
 		vh2 = (struct vxlanhdr *)(p->data + off_vx);
-		if (vh->vx_vni != vh2->vx_vni) {
+		if (vh->vx_flags != vh2->vx_flags ||
+		    vh->vx_vni != vh2->vx_vni) {
 			NAPI_GRO_CB(p)->same_flow = 0;
 			continue;
 		}
@@ -1183,6 +1184,7 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 	struct vxlan_sock *vs;
 	struct vxlanhdr *vxh;
 	u32 flags, vni;
+	struct vxlan_metadata md = {0};
 
 	/* Need Vxlan and inner Ethernet header to be present */
 	if (!pskb_may_pull(skb, VXLAN_HLEN))
@@ -1216,6 +1218,24 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 		vni &= VXLAN_VID_MASK;
 	}
 
+	/* For backwards compatibility, only allow reserved fields to be
+	 * used by VXLAN extensions if explicitly requested.
+	 */
+	if ((flags & VXLAN_HF_GBP) && (vs->flags & VXLAN_F_GBP)) {
+		struct vxlanhdr_gbp *gbp;
+
+		gbp = (struct vxlanhdr_gbp *)vxh;
+		md.gbp = ntohs(gbp->policy_id);
+
+		if (gbp->dont_learn)
+			md.gbp |= VXLAN_GBP_DONT_LEARN;
+
+		if (gbp->policy_applied)
+			md.gbp |= VXLAN_GBP_POLICY_APPLIED;
+
+		flags &= ~VXLAN_GBP_USED_BITS;
+	}
+
 	if (flags || (vni & ~VXLAN_VID_MASK)) {
 		/* If there are any unprocessed flags remaining treat
 		 * this as a malformed packet. This behavior diverges from
@@ -1229,7 +1249,8 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 		goto bad_flags;
 	}
 
-	vs->rcv(vs, skb, vxh->vx_vni);
+	md.vni = vxh->vx_vni;
+	vs->rcv(vs, skb, &md);
 	return 0;
 
 drop:
@@ -1246,8 +1267,8 @@ error:
 	return 1;
 }
 
-static void vxlan_rcv(struct vxlan_sock *vs,
-		      struct sk_buff *skb, __be32 vx_vni)
+static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb,
+		      struct vxlan_metadata *md)
 {
 	struct iphdr *oip = NULL;
 	struct ipv6hdr *oip6 = NULL;
@@ -1258,7 +1279,7 @@ static void vxlan_rcv(struct vxlan_sock *vs,
 	int err = 0;
 	union vxlan_addr *remote_ip;
 
-	vni = ntohl(vx_vni) >> 8;
+	vni = ntohl(md->vni) >> 8;
 	/* Is this VNI defined? */
 	vxlan = vxlan_vs_find_vni(vs, vni);
 	if (!vxlan)
@@ -1292,6 +1313,7 @@ static void vxlan_rcv(struct vxlan_sock *vs,
 		goto drop;
 
 	skb_reset_network_header(skb);
+	skb->mark = md->gbp;
 
 	if (oip6)
 		err = IP6_ECN_decapsulate(oip6, skb);
@@ -1641,13 +1663,30 @@ static bool route_shortcircuit(struct net_device *dev, struct sk_buff *skb)
 	return false;
 }
 
+static void vxlan_build_gbp_hdr(struct vxlanhdr *vxh, struct vxlan_sock *vs,
+				struct vxlan_metadata *md)
+{
+	struct vxlanhdr_gbp *gbp;
+
+	gbp = (struct vxlanhdr_gbp *)vxh;
+	vxh->vx_flags |= htonl(VXLAN_HF_GBP);
+
+	if (md->gbp & VXLAN_GBP_DONT_LEARN)
+		gbp->dont_learn = 1;
+
+	if (md->gbp & VXLAN_GBP_POLICY_APPLIED)
+		gbp->policy_applied = 1;
+
+	gbp->policy_id = htons(md->gbp & VXLAN_GBP_ID_MASK);
+}
+
 #if IS_ENABLED(CONFIG_IPV6)
 static int vxlan6_xmit_skb(struct vxlan_sock *vs,
 			   struct dst_entry *dst, struct sk_buff *skb,
 			   struct net_device *dev, struct in6_addr *saddr,
 			   struct in6_addr *daddr, __u8 prio, __u8 ttl,
-			   __be16 src_port, __be16 dst_port, __be32 vni,
-			   bool xnet)
+			   __be16 src_port, __be16 dst_port,
+			   struct vxlan_metadata *md, bool xnet)
 {
 	struct vxlanhdr *vxh;
 	int min_headroom;
@@ -1696,7 +1735,7 @@ static int vxlan6_xmit_skb(struct vxlan_sock *vs,
 
 	vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh));
 	vxh->vx_flags = htonl(VXLAN_HF_VNI);
-	vxh->vx_vni = vni;
+	vxh->vx_vni = md->vni;
 
 	if (type & SKB_GSO_TUNNEL_REMCSUM) {
 		u32 data = (skb_checksum_start_offset(skb) - hdrlen) >>
@@ -1714,6 +1753,9 @@ static int vxlan6_xmit_skb(struct vxlan_sock *vs,
 		}
 	}
 
+	if (vs->flags & VXLAN_F_GBP)
+		vxlan_build_gbp_hdr(vxh, vs, md);
+
 	skb_set_inner_protocol(skb, htons(ETH_P_TEB));
 
 	udp_tunnel6_xmit_skb(vs->sock, dst, skb, dev, saddr, daddr, prio,
@@ -1728,7 +1770,8 @@ err:
 int vxlan_xmit_skb(struct vxlan_sock *vs,
 		   struct rtable *rt, struct sk_buff *skb,
 		   __be32 src, __be32 dst, __u8 tos, __u8 ttl, __be16 df,
-		   __be16 src_port, __be16 dst_port, __be32 vni, bool xnet)
+		   __be16 src_port, __be16 dst_port,
+		   struct vxlan_metadata *md, bool xnet)
 {
 	struct vxlanhdr *vxh;
 	int min_headroom;
@@ -1771,7 +1814,7 @@ int vxlan_xmit_skb(struct vxlan_sock *vs,
 
 	vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh));
 	vxh->vx_flags = htonl(VXLAN_HF_VNI);
-	vxh->vx_vni = vni;
+	vxh->vx_vni = md->vni;
 
 	if (type & SKB_GSO_TUNNEL_REMCSUM) {
 		u32 data = (skb_checksum_start_offset(skb) - hdrlen) >>
@@ -1789,6 +1832,9 @@ int vxlan_xmit_skb(struct vxlan_sock *vs,
 		}
 	}
 
+	if (vs->flags & VXLAN_F_GBP)
+		vxlan_build_gbp_hdr(vxh, vs, md);
+
 	skb_set_inner_protocol(skb, htons(ETH_P_TEB));
 
 	return udp_tunnel_xmit_skb(vs->sock, rt, skb, src, dst, tos,
@@ -1849,6 +1895,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 	const struct iphdr *old_iph;
 	struct flowi4 fl4;
 	union vxlan_addr *dst;
+	struct vxlan_metadata md;
 	__be16 src_port = 0, dst_port;
 	u32 vni;
 	__be16 df = 0;
@@ -1919,11 +1966,12 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 
 		tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
 		ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
+		md.vni = htonl(vni << 8);
+		md.gbp = skb->mark;
 
 		err = vxlan_xmit_skb(vxlan->vn_sock, rt, skb,
 				     fl4.saddr, dst->sin.sin_addr.s_addr,
-				     tos, ttl, df, src_port, dst_port,
-				     htonl(vni << 8),
+				     tos, ttl, df, src_port, dst_port, &md,
 				     !net_eq(vxlan->net, dev_net(vxlan->dev)));
 		if (err < 0) {
 			/* skb is already freed. */
@@ -1976,10 +2024,12 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 		}
 
 		ttl = ttl ? : ip6_dst_hoplimit(ndst);
+		md.vni = htonl(vni << 8);
+		md.gbp = skb->mark;
 
 		err = vxlan6_xmit_skb(vxlan->vn_sock, ndst, skb,
 				      dev, &fl6.saddr, &fl6.daddr, 0, ttl,
-				      src_port, dst_port, htonl(vni << 8),
+				      src_port, dst_port, &md,
 				      !net_eq(vxlan->net, dev_net(vxlan->dev)));
 #endif
 	}
@@ -2382,6 +2432,7 @@ static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = {
 	[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]	= { .type = NLA_U8 },
 	[IFLA_VXLAN_REMCSUM_TX]	= { .type = NLA_U8 },
 	[IFLA_VXLAN_REMCSUM_RX]	= { .type = NLA_U8 },
+	[IFLA_VXLAN_GBP]	= { .type = NLA_FLAG, },
 };
 
 static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[])
@@ -2706,6 +2757,9 @@ static int vxlan_newlink(struct net *net, struct net_device *dev,
 	    nla_get_u8(data[IFLA_VXLAN_REMCSUM_RX]))
 		vxlan->flags |= VXLAN_F_REMCSUM_RX;
 
+	if (data[IFLA_VXLAN_GBP])
+		vxlan->flags |= VXLAN_F_GBP;
+
 	if (vxlan_find_vni(net, vni, use_ipv6 ? AF_INET6 : AF_INET,
 			   vxlan->dst_port)) {
 		pr_info("duplicate VNI %u\n", vni);
@@ -2851,6 +2905,10 @@ static int vxlan_fill_info(struct sk_buff *skb, const struct net_device *dev)
 	if (nla_put(skb, IFLA_VXLAN_PORT_RANGE, sizeof(ports), &ports))
 		goto nla_put_failure;
 
+	if (vxlan->flags & VXLAN_F_GBP &&
+	    nla_put_flag(skb, IFLA_VXLAN_GBP))
+		goto nla_put_failure;
+
 	return 0;
 
 nla_put_failure:
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 0a7443b..f4a3583 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -11,15 +11,76 @@
 #define VNI_HASH_BITS	10
 #define VNI_HASH_SIZE	(1<<VNI_HASH_BITS)
 
-/* VXLAN protocol header */
+/*
+ * VXLAN Group Based Policy Extension:
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |1|-|-|-|1|-|-|-|R|D|R|R|A|R|R|R|        Group Policy ID        |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |                VXLAN Network Identifier (VNI) |   Reserved    |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *
+ * D = Don't Learn bit. When set, this bit indicates that the egress
+ *     VTEP MUST NOT learn the source address of the encapsulated frame.
+ *
+ * A = Indicates that the group policy has already been applied to
+ *     this packet. Policies MUST NOT be applied by devices when the
+ *     A bit is set.
+ *
+ * [0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
+ */
+struct vxlanhdr_gbp {
+	__u8	vx_flags;
+#ifdef __LITTLE_ENDIAN_BITFIELD
+	__u8	reserved_flags1:3,
+		policy_applied:1,
+		reserved_flags2:2,
+		dont_learn:1,
+		reserved_flags3:1;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+	__u8	reserved_flags1:1,
+		dont_learn:1,
+		reserved_flags2:2,
+		policy_applied:1,
+		reserved_flags3:3;
+#else
+#error	"Please fix <asm/byteorder.h>"
+#endif
+	__be16	policy_id;
+	__be32	vx_vni;
+};
+
+#define VXLAN_GBP_USED_BITS (VXLAN_HF_GBP | 0xFFFFFF)
+
+/* skb->mark mapping
+ *
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |R|R|R|R|R|R|R|R|R|D|R|R|A|R|R|R|        Group Policy ID        |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ */
+#define VXLAN_GBP_DONT_LEARN		(BIT(6) << 16)
+#define VXLAN_GBP_POLICY_APPLIED	(BIT(3) << 16)
+#define VXLAN_GBP_ID_MASK		(0xFFFF)
+
+/* VXLAN protocol header:
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |G|R|R|R|I|R|R|C|               Reserved                        |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |                VXLAN Network Identifier (VNI) |   Reserved    |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *
+ * G = 1	Group Policy (VXLAN-GBP)
+ * I = 1	VXLAN Network Identifier (VNI) present
+ * C = 1	Remote checksum offload (RCO)
+ */
 struct vxlanhdr {
 	__be32 vx_flags;
 	__be32 vx_vni;
 };
 
 /* VXLAN header flags. */
-#define VXLAN_HF_VNI 0x08000000
-#define VXLAN_HF_RCO 0x00200000
+#define VXLAN_HF_RCO BIT(24)
+#define VXLAN_HF_VNI BIT(27)
+#define VXLAN_HF_GBP BIT(31)
 
 /* Remote checksum offload header option */
 #define VXLAN_RCO_MASK  0x7f    /* Last byte of vni field */
@@ -32,8 +93,14 @@ struct vxlanhdr {
 #define VXLAN_VID_MASK  (VXLAN_N_VID - 1)
 #define VXLAN_HLEN (sizeof(struct udphdr) + sizeof(struct vxlanhdr))
 
+struct vxlan_metadata {
+	__be32		vni;
+	u32		gbp;
+};
+
 struct vxlan_sock;
-typedef void (vxlan_rcv_t)(struct vxlan_sock *vh, struct sk_buff *skb, __be32 key);
+typedef void (vxlan_rcv_t)(struct vxlan_sock *vh, struct sk_buff *skb,
+			   struct vxlan_metadata *md);
 
 /* per UDP socket information */
 struct vxlan_sock {
@@ -60,6 +127,7 @@ struct vxlan_sock {
 #define VXLAN_F_UDP_ZERO_CSUM6_RX	0x100
 #define VXLAN_F_REMCSUM_TX		0x200
 #define VXLAN_F_REMCSUM_RX		0x400
+#define VXLAN_F_GBP			0x800
 
 struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
 				  vxlan_rcv_t *rcv, void *data,
@@ -70,7 +138,8 @@ void vxlan_sock_release(struct vxlan_sock *vs);
 int vxlan_xmit_skb(struct vxlan_sock *vs,
 		   struct rtable *rt, struct sk_buff *skb,
 		   __be32 src, __be32 dst, __u8 tos, __u8 ttl, __be16 df,
-		   __be16 src_port, __be16 dst_port, __be32 vni, bool xnet);
+		   __be16 src_port, __be16 dst_port, struct vxlan_metadata *md,
+		   bool xnet);
 
 static inline netdev_features_t vxlan_features_check(struct sk_buff *skb,
 						     netdev_features_t features)
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index b2723f6..2a8380e 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -372,6 +372,7 @@ enum {
 	IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
 	IFLA_VXLAN_REMCSUM_TX,
 	IFLA_VXLAN_REMCSUM_RX,
+	IFLA_VXLAN_GBP,
 	__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX	(__IFLA_VXLAN_MAX - 1)
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index d7c46b3..deed9e3 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -59,7 +59,8 @@ static inline struct vxlan_port *vxlan_vport(const struct vport *vport)
 }
 
 /* Called with rcu_read_lock and BH disabled. */
-static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
+static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb,
+		      struct vxlan_metadata *md)
 {
 	struct ovs_tunnel_info tun_info;
 	struct vport *vport = vs->data;
@@ -68,7 +69,7 @@ static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
 
 	/* Save outer tunnel values */
 	iph = ip_hdr(skb);
-	key = cpu_to_be64(ntohl(vx_vni) >> 8);
+	key = cpu_to_be64(ntohl(md->vni) >> 8);
 	ovs_flow_tun_info_init(&tun_info, iph,
 			       udp_hdr(skb)->source, udp_hdr(skb)->dest,
 			       key, TUNNEL_KEY, NULL, 0);
@@ -146,6 +147,7 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
 	struct vxlan_port *vxlan_port = vxlan_vport(vport);
 	__be16 dst_port = inet_sk(vxlan_port->vs->sock->sk)->inet_sport;
 	struct ovs_key_ipv4_tunnel *tun_key;
+	struct vxlan_metadata md = {0};
 	struct rtable *rt;
 	struct flowi4 fl;
 	__be16 src_port;
@@ -178,12 +180,13 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
 	skb->ignore_df = 1;
 
 	src_port = udp_flow_src_port(net, skb, 0, 0, true);
+	md.vni = htonl(be64_to_cpu(tun_key->tun_id) << 8);
 
 	err = vxlan_xmit_skb(vxlan_port->vs, rt, skb,
 			     fl.saddr, tun_key->ipv4_dst,
 			     tun_key->ipv4_tos, tun_key->ipv4_ttl, df,
 			     src_port, dst_port,
-			     htonl(be64_to_cpu(tun_key->tun_id) << 8),
+			     &md,
 			     false);
 	if (err < 0)
 		ip_rt_put(rt);
-- 
1.9.3

^ permalink raw reply related

* [PATCH 3/5] openvswitch: Rename GENEVE_TUN_OPTS() to TUN_METADATA_OPTS()
From: Thomas Graf @ 2015-01-15  2:53 UTC (permalink / raw)
  To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov,
	nicolas.dichtel
  Cc: netdev, dev
In-Reply-To: <cover.1421290198.git.tgraf@suug.ch>

Also factors out Geneve validation code into a new separate function
validate_and_copy_geneve_opts().

A subsequent patch will introduce VXLAN options. Rename the existing
GENEVE_TUN_OPTS() to reflect its extended purpose of carrying generic
tunnel metadata options.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v5->v6:
 - No change
v4->v5:
 - No change
v3->v4:
 - Renamed validate_and_copy_geneve_opts() to validate_geneve_opts() as
   suggested by Jesse
v2->v3:
 - No change
v1->v2:
 - Don't rename genev_tun_opt_from_nlattr() and keep it Geneve specific,
   pointed out by Jesse.
 - Factor out Geneve specific validation code into separate function as
   requested by Jesse.

 net/openvswitch/flow.c         |  2 +-
 net/openvswitch/flow.h         | 14 ++++----
 net/openvswitch/flow_netlink.c | 72 +++++++++++++++++++++++-------------------
 3 files changed, 47 insertions(+), 41 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index df334fe..e2c348b 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -691,7 +691,7 @@ int ovs_flow_key_extract(const struct ovs_tunnel_info *tun_info,
 			BUILD_BUG_ON((1 << (sizeof(tun_info->options_len) *
 						   8)) - 1
 					> sizeof(key->tun_opts));
-			memcpy(GENEVE_OPTS(key, tun_info->options_len),
+			memcpy(TUN_METADATA_OPTS(key, tun_info->options_len),
 			       tun_info->options, tun_info->options_len);
 			key->tun_opts_len = tun_info->options_len;
 		} else {
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index a8b30f3..d3d0a40 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -53,7 +53,7 @@ struct ovs_key_ipv4_tunnel {
 
 struct ovs_tunnel_info {
 	struct ovs_key_ipv4_tunnel tunnel;
-	const struct geneve_opt *options;
+	const void *options;
 	u8 options_len;
 };
 
@@ -61,10 +61,10 @@ struct ovs_tunnel_info {
  * maximum size. This allows us to get the benefits of variable length
  * matching for small options.
  */
-#define GENEVE_OPTS(flow_key, opt_len)	\
-	((struct geneve_opt *)((flow_key)->tun_opts + \
-			       FIELD_SIZEOF(struct sw_flow_key, tun_opts) - \
-			       opt_len))
+#define TUN_METADATA_OFFSET(opt_len) \
+	(FIELD_SIZEOF(struct sw_flow_key, tun_opts) - opt_len)
+#define TUN_METADATA_OPTS(flow_key, opt_len) \
+	((void *)((flow_key)->tun_opts + TUN_METADATA_OFFSET(opt_len)))
 
 static inline void __ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
 					    __be32 saddr, __be32 daddr,
@@ -73,7 +73,7 @@ static inline void __ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
 					    __be16 tp_dst,
 					    __be64 tun_id,
 					    __be16 tun_flags,
-					    const struct geneve_opt *opts,
+					    const void *opts,
 					    u8 opts_len)
 {
 	tun_info->tunnel.tun_id = tun_id;
@@ -105,7 +105,7 @@ static inline void ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
 					  __be16 tp_dst,
 					  __be64 tun_id,
 					  __be16 tun_flags,
-					  const struct geneve_opt *opts,
+					  const void *opts,
 					  u8 opts_len)
 {
 	__ovs_flow_tun_info_init(tun_info, iph->saddr, iph->daddr,
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d1eecf7..2e8a9cd 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -432,8 +432,7 @@ static int genev_tun_opt_from_nlattr(const struct nlattr *a,
 		SW_FLOW_KEY_PUT(match, tun_opts_len, 0xff, true);
 	}
 
-	opt_key_offset = (unsigned long)GENEVE_OPTS((struct sw_flow_key *)0,
-						    nla_len(a));
+	opt_key_offset = TUN_METADATA_OFFSET(nla_len(a));
 	SW_FLOW_KEY_MEMCPY_OFFSET(match, opt_key_offset, nla_data(a),
 				  nla_len(a), is_mask);
 	return 0;
@@ -558,8 +557,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 
 static int __ipv4_tun_to_nlattr(struct sk_buff *skb,
 				const struct ovs_key_ipv4_tunnel *output,
-				const struct geneve_opt *tun_opts,
-				int swkey_tun_opts_len)
+				const void *tun_opts, int swkey_tun_opts_len)
 {
 	if (output->tun_flags & TUNNEL_KEY &&
 	    nla_put_be64(skb, OVS_TUNNEL_KEY_ATTR_ID, output->tun_id))
@@ -600,8 +598,7 @@ static int __ipv4_tun_to_nlattr(struct sk_buff *skb,
 
 static int ipv4_tun_to_nlattr(struct sk_buff *skb,
 			      const struct ovs_key_ipv4_tunnel *output,
-			      const struct geneve_opt *tun_opts,
-			      int swkey_tun_opts_len)
+			      const void *tun_opts, int swkey_tun_opts_len)
 {
 	struct nlattr *nla;
 	int err;
@@ -1148,10 +1145,10 @@ int ovs_nla_put_flow(const struct sw_flow_key *swkey,
 		goto nla_put_failure;
 
 	if ((swkey->tun_key.ipv4_dst || is_mask)) {
-		const struct geneve_opt *opts = NULL;
+		const void *opts = NULL;
 
 		if (output->tun_key.tun_flags & TUNNEL_OPTIONS_PRESENT)
-			opts = GENEVE_OPTS(output, swkey->tun_opts_len);
+			opts = TUN_METADATA_OPTS(output, swkey->tun_opts_len);
 
 		if (ipv4_tun_to_nlattr(skb, &output->tun_key, opts,
 				       swkey->tun_opts_len))
@@ -1540,6 +1537,34 @@ void ovs_match_init(struct sw_flow_match *match,
 	}
 }
 
+static int validate_geneve_opts(struct sw_flow_key *key)
+{
+	struct geneve_opt *option;
+	int opts_len = key->tun_opts_len;
+	bool crit_opt = false;
+
+	option = (struct geneve_opt *)TUN_METADATA_OPTS(key, key->tun_opts_len);
+	while (opts_len > 0) {
+		int len;
+
+		if (opts_len < sizeof(*option))
+			return -EINVAL;
+
+		len = sizeof(*option) + option->length * 4;
+		if (len > opts_len)
+			return -EINVAL;
+
+		crit_opt |= !!(option->type & GENEVE_CRIT_OPT_TYPE);
+
+		option = (struct geneve_opt *)((u8 *)option + len);
+		opts_len -= len;
+	};
+
+	key->tun_key.tun_flags |= crit_opt ? TUNNEL_CRIT_OPT : 0;
+
+	return 0;
+}
+
 static int validate_and_copy_set_tun(const struct nlattr *attr,
 				     struct sw_flow_actions **sfa, bool log)
 {
@@ -1555,28 +1580,9 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 		return err;
 
 	if (key.tun_opts_len) {
-		struct geneve_opt *option = GENEVE_OPTS(&key,
-							key.tun_opts_len);
-		int opts_len = key.tun_opts_len;
-		bool crit_opt = false;
-
-		while (opts_len > 0) {
-			int len;
-
-			if (opts_len < sizeof(*option))
-				return -EINVAL;
-
-			len = sizeof(*option) + option->length * 4;
-			if (len > opts_len)
-				return -EINVAL;
-
-			crit_opt |= !!(option->type & GENEVE_CRIT_OPT_TYPE);
-
-			option = (struct geneve_opt *)((u8 *)option + len);
-			opts_len -= len;
-		};
-
-		key.tun_key.tun_flags |= crit_opt ? TUNNEL_CRIT_OPT : 0;
+		err = validate_geneve_opts(&key);
+		if (err < 0)
+			return err;
 	};
 
 	start = add_nested_action_start(sfa, OVS_ACTION_ATTR_SET, log);
@@ -1597,9 +1603,9 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 		 * everything else will go away after flow setup. We can append
 		 * it to tun_info and then point there.
 		 */
-		memcpy((tun_info + 1), GENEVE_OPTS(&key, key.tun_opts_len),
-		       key.tun_opts_len);
-		tun_info->options = (struct geneve_opt *)(tun_info + 1);
+		memcpy((tun_info + 1),
+		       TUN_METADATA_OPTS(&key, key.tun_opts_len), key.tun_opts_len);
+		tun_info->options = (tun_info + 1);
 	} else {
 		tun_info->options = NULL;
 	}
-- 
1.9.3

^ permalink raw reply related

* [PATCH 0/5 net-next v6] VXLAN Group Policy Extension
From: Thomas Graf @ 2015-01-15  2:53 UTC (permalink / raw)
  To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov,
	nicolas.dichtel
  Cc: netdev, dev

Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.

The extension is disabled by default and should be run on a distinct
port in mixed Linux VXLAN VTEP environments. Liberal VXLAN VTEPs
which ignore unknown reserved bits will be able to receive VXLAN-GBP
frames.

Simple usage example:

10.1.1.1:
   # ip link add vxlan0 type vxlan id 10 remote 10.1.1.2 gbp
   # iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200

10.1.1.2:
   # ip link add vxlan0 type vxlan id 10 remote 10.1.1.1 gbp
   # iptables -I INPUT -m mark --mark 0x200 -j DROP

iproute2 [1] and OVS [2] support will be provided in separate patches.

[0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
[1] https://github.com/tgraf/iproute2/tree/vxlan-gbp
[2] https://github.com/tgraf/ovs/tree/vxlan-gbp

Thomas Graf (5):
  vxlan: Group Policy extension
  vxlan: Only bind to sockets with compatible flags enabled
  openvswitch: Rename GENEVE_TUN_OPTS() to TUN_METADATA_OPTS()
  openvswitch: Allow for any level of nesting in flow attributes
  openvswitch: Support VXLAN Group Policy extension

 drivers/net/vxlan.c              | 113 ++++++++++++----
 include/net/ip_tunnels.h         |   5 +-
 include/net/vxlan.h              |  82 ++++++++++-
 include/uapi/linux/if_link.h     |   1 +
 include/uapi/linux/openvswitch.h |  11 ++
 net/openvswitch/flow.c           |   2 +-
 net/openvswitch/flow.h           |  14 +-
 net/openvswitch/flow_netlink.c   | 286 ++++++++++++++++++++++++++-------------
 net/openvswitch/vport-geneve.c   |  15 +-
 net/openvswitch/vport-vxlan.c    |  91 ++++++++++++-
 net/openvswitch/vport-vxlan.h    |  11 ++
 11 files changed, 491 insertions(+), 140 deletions(-)
 create mode 100644 net/openvswitch/vport-vxlan.h

-- 
1.9.3

^ permalink raw reply

* [PATCH 2/5] vxlan: Only bind to sockets with compatible flags enabled
From: Thomas Graf @ 2015-01-15  2:53 UTC (permalink / raw)
  To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov,
	nicolas.dichtel
  Cc: netdev, dev
In-Reply-To: <cover.1421290198.git.tgraf@suug.ch>

A VXLAN net_device looking for an appropriate socket may only consider
a socket which has a matching set of flags/extensions enabled. If
incompatible flags are enabled, return a conflict to have the caller
create a distinct socket with distinct port.

The OVS VXLAN port is kept unaware of extensions at this point.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v5->v6:
 - Keep sharing logic but base it off unsharable flags instead of exts
   member as suggested by Tom
v4->v5:
 - No change
v3->v4:
 - No change
v2->v3:
 - No change
v1->v2:
 - Improved commit message, reported by Jesse

 drivers/net/vxlan.c | 29 ++++++++++++++++++-----------
 include/net/vxlan.h |  3 +++
 2 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 6dbf8e0..6b6b456 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -263,15 +263,19 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
 	return list_first_entry(&fdb->remotes, struct vxlan_rdst, list);
 }
 
-/* Find VXLAN socket based on network namespace, address family and UDP port */
-static struct vxlan_sock *vxlan_find_sock(struct net *net,
-					  sa_family_t family, __be16 port)
+/* Find VXLAN socket based on network namespace, address family and UDP port
+ * and enabled unshareable flags.
+ */
+static struct vxlan_sock *vxlan_find_sock(struct net *net, sa_family_t family,
+					  __be16 port, u32 flags)
 {
 	struct vxlan_sock *vs;
+	u32 match_flags = flags & VXLAN_F_UNSHAREABLE;
 
 	hlist_for_each_entry_rcu(vs, vs_head(net, port), hlist) {
 		if (inet_sk(vs->sock->sk)->inet_sport == port &&
-		    inet_sk(vs->sock->sk)->sk.sk_family == family)
+		    inet_sk(vs->sock->sk)->sk.sk_family == family &&
+		    (vs->flags & VXLAN_F_UNSHAREABLE) == match_flags)
 			return vs;
 	}
 	return NULL;
@@ -291,11 +295,12 @@ static struct vxlan_dev *vxlan_vs_find_vni(struct vxlan_sock *vs, u32 id)
 
 /* Look up VNI in a per net namespace table */
 static struct vxlan_dev *vxlan_find_vni(struct net *net, u32 id,
-					sa_family_t family, __be16 port)
+					sa_family_t family, __be16 port,
+					u32 flags)
 {
 	struct vxlan_sock *vs;
 
-	vs = vxlan_find_sock(net, family, port);
+	vs = vxlan_find_sock(net, family, port, flags);
 	if (!vs)
 		return NULL;
 
@@ -1957,7 +1962,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 
 			ip_rt_put(rt);
 			dst_vxlan = vxlan_find_vni(vxlan->net, vni,
-						   dst->sa.sa_family, dst_port);
+						   dst->sa.sa_family, dst_port,
+						   vxlan->flags);
 			if (!dst_vxlan)
 				goto tx_error;
 			vxlan_encap_bypass(skb, vxlan, dst_vxlan);
@@ -2016,7 +2022,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 
 			dst_release(ndst);
 			dst_vxlan = vxlan_find_vni(vxlan->net, vni,
-						   dst->sa.sa_family, dst_port);
+						   dst->sa.sa_family, dst_port,
+						   vxlan->flags);
 			if (!dst_vxlan)
 				goto tx_error;
 			vxlan_encap_bypass(skb, vxlan, dst_vxlan);
@@ -2186,7 +2193,7 @@ static int vxlan_init(struct net_device *dev)
 
 	spin_lock(&vn->sock_lock);
 	vs = vxlan_find_sock(vxlan->net, ipv6 ? AF_INET6 : AF_INET,
-			     vxlan->dst_port);
+			     vxlan->dst_port, vxlan->flags);
 	if (vs && atomic_add_unless(&vs->refcnt, 1, 0)) {
 		/* If we have a socket with same port already, reuse it */
 		vxlan_vs_add_dev(vs, vxlan);
@@ -2593,7 +2600,7 @@ struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
 		return vs;
 
 	spin_lock(&vn->sock_lock);
-	vs = vxlan_find_sock(net, ipv6 ? AF_INET6 : AF_INET, port);
+	vs = vxlan_find_sock(net, ipv6 ? AF_INET6 : AF_INET, port, flags);
 	if (vs && ((vs->rcv != rcv) ||
 		   !atomic_add_unless(&vs->refcnt, 1, 0)))
 			vs = ERR_PTR(-EBUSY);
@@ -2761,7 +2768,7 @@ static int vxlan_newlink(struct net *net, struct net_device *dev,
 		vxlan->flags |= VXLAN_F_GBP;
 
 	if (vxlan_find_vni(net, vni, use_ipv6 ? AF_INET6 : AF_INET,
-			   vxlan->dst_port)) {
+			   vxlan->dst_port, vxlan->flags)) {
 		pr_info("duplicate VNI %u\n", vni);
 		return -EEXIST;
 	}
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index f4a3583..7be8c34 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -129,6 +129,9 @@ struct vxlan_sock {
 #define VXLAN_F_REMCSUM_RX		0x400
 #define VXLAN_F_GBP			0x800
 
+/* These flags must match in order for a socket to be shareable */
+#define VXLAN_F_UNSHAREABLE		VXLAN_F_GBP
+
 struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
 				  vxlan_rcv_t *rcv, void *data,
 				  bool no_share, u32 flags);
-- 
1.9.3

^ permalink raw reply related

* [PATCH 5/5] openvswitch: Support VXLAN Group Policy extension
From: Thomas Graf @ 2015-01-15  2:53 UTC (permalink / raw)
  To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov,
	nicolas.dichtel
  Cc: netdev, dev
In-Reply-To: <cover.1421290198.git.tgraf@suug.ch>

Introduces support for the group policy extension to the VXLAN virtual
port. The extension is disabled by default and only enabled if the user
has provided the respective configuration.

  ovs-vsctl add-port br0 vxlan0 -- \
     set Interface vxlan0 type=vxlan options:exts=gbp

The configuration interface to enable the extension is based on a new
attribute OVS_VXLAN_EXT_GBP nested inside OVS_TUNNEL_ATTR_EXTENSION
which can carry additional extensions as needed in the future.

The group policy metadata is stored as binary blob (struct ovs_vxlan_opts)
internally just like Geneve options but transported as nested Netlink
attributes to user space.

Renames the existing TUNNEL_OPTIONS_PRESENT to TUNNEL_GENEVE_OPT with the
binary value kept intact, a new flag TUNNEL_VXLAN_OPT is introduced.

The attributes OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and existing
OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS are implemented mutually exclusive.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v5->v6:
 - No change
v4->v5:
 - No change
v3->v4:
 - Fixed OVS_VXLAN_EXT_MAX->OVS_VXLAN_EXT_GBP typo as spotted by Jesse
 - Only applied tunnel options if they are of the right type as
   suggested by Jesse
v2->v3:
 - No change
v1->v2:
 - Addressed Jesse's request to transport VXLAN options as Netlink
   attributes instead of a binary blob. Allows a partial transport of
   VXLAN extensions. Internally, the datapath continues to use a binary
   blob (defined in vport-vxlan.h) for performance reasons.
 - Added new TUNNEL_GENEVE_OPT and TUNNEL_VXLAN_OPT flags to mark
   tunnel option flavour
 - Correctly report VXLAN options to user space

 include/net/ip_tunnels.h         |   5 +-
 include/uapi/linux/openvswitch.h |  11 ++++
 net/openvswitch/flow_netlink.c   | 114 ++++++++++++++++++++++++++++++++++-----
 net/openvswitch/vport-geneve.c   |  15 ++++--
 net/openvswitch/vport-vxlan.c    |  82 +++++++++++++++++++++++++++-
 net/openvswitch/vport-vxlan.h    |  11 ++++
 6 files changed, 218 insertions(+), 20 deletions(-)
 create mode 100644 net/openvswitch/vport-vxlan.h

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 25a59eb..ce4db3c 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -97,7 +97,10 @@ struct ip_tunnel {
 #define TUNNEL_DONT_FRAGMENT    __cpu_to_be16(0x0100)
 #define TUNNEL_OAM		__cpu_to_be16(0x0200)
 #define TUNNEL_CRIT_OPT		__cpu_to_be16(0x0400)
-#define TUNNEL_OPTIONS_PRESENT	__cpu_to_be16(0x0800)
+#define TUNNEL_GENEVE_OPT	__cpu_to_be16(0x0800)
+#define TUNNEL_VXLAN_OPT	__cpu_to_be16(0x1000)
+
+#define TUNNEL_OPTIONS_PRESENT	(TUNNEL_GENEVE_OPT | TUNNEL_VXLAN_OPT)
 
 struct tnl_ptk_info {
 	__be16 flags;
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 3a6dcaa..e474c95 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -248,11 +248,21 @@ enum ovs_vport_attr {
 
 #define OVS_VPORT_ATTR_MAX (__OVS_VPORT_ATTR_MAX - 1)
 
+enum {
+	OVS_VXLAN_EXT_UNSPEC,
+	OVS_VXLAN_EXT_GBP,	/* Flag or __u32 */
+	__OVS_VXLAN_EXT_MAX,
+};
+
+#define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
+
+
 /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
  */
 enum {
 	OVS_TUNNEL_ATTR_UNSPEC,
 	OVS_TUNNEL_ATTR_DST_PORT, /* 16-bit UDP port, used by L4 tunnels. */
+	OVS_TUNNEL_ATTR_EXTENSION,
 	__OVS_TUNNEL_ATTR_MAX
 };
 
@@ -324,6 +334,7 @@ enum ovs_tunnel_key_attr {
 	OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS,        /* Array of Geneve options. */
 	OVS_TUNNEL_KEY_ATTR_TP_SRC,		/* be16 src Transport Port. */
 	OVS_TUNNEL_KEY_ATTR_TP_DST,		/* be16 dst Transport Port. */
+	OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS,		/* Nested OVS_VXLAN_EXT_* */
 	__OVS_TUNNEL_KEY_ATTR_MAX
 };
 
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 518941c..d210d1b 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -49,6 +49,7 @@
 #include <net/mpls.h>
 
 #include "flow_netlink.h"
+#include "vport-vxlan.h"
 
 struct ovs_len_tbl {
 	int len;
@@ -268,6 +269,9 @@ size_t ovs_tun_key_attr_size(void)
 		+ nla_total_size(0)    /* OVS_TUNNEL_KEY_ATTR_CSUM */
 		+ nla_total_size(0)    /* OVS_TUNNEL_KEY_ATTR_OAM */
 		+ nla_total_size(256)  /* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS */
+		/* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS is mutually exclusive with
+		 * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
+		 */
 		+ nla_total_size(2)    /* OVS_TUNNEL_KEY_ATTR_TP_SRC */
 		+ nla_total_size(2);   /* OVS_TUNNEL_KEY_ATTR_TP_DST */
 }
@@ -308,6 +312,7 @@ static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
 	[OVS_TUNNEL_KEY_ATTR_TP_DST]	    = { .len = sizeof(u16) },
 	[OVS_TUNNEL_KEY_ATTR_OAM]	    = { .len = 0 },
 	[OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS]   = { .len = OVS_ATTR_NESTED },
+	[OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS]    = { .len = OVS_ATTR_NESTED },
 };
 
 /* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute.  */
@@ -460,6 +465,41 @@ static int genev_tun_opt_from_nlattr(const struct nlattr *a,
 	return 0;
 }
 
+static const struct nla_policy vxlan_opt_policy[OVS_VXLAN_EXT_MAX + 1] = {
+	[OVS_VXLAN_EXT_GBP]	= { .type = NLA_U32 },
+};
+
+static int vxlan_tun_opt_from_nlattr(const struct nlattr *a,
+				     struct sw_flow_match *match, bool is_mask,
+				     bool log)
+{
+	struct nlattr *tb[OVS_VXLAN_EXT_MAX+1];
+	unsigned long opt_key_offset;
+	struct ovs_vxlan_opts opts;
+	int err;
+
+	BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
+
+	err = nla_parse_nested(tb, OVS_VXLAN_EXT_MAX, a, vxlan_opt_policy);
+	if (err < 0)
+		return err;
+
+	memset(&opts, 0, sizeof(opts));
+
+	if (tb[OVS_VXLAN_EXT_GBP])
+		opts.gbp = nla_get_u32(tb[OVS_VXLAN_EXT_GBP]);
+
+	if (!is_mask)
+		SW_FLOW_KEY_PUT(match, tun_opts_len, sizeof(opts), false);
+	else
+		SW_FLOW_KEY_PUT(match, tun_opts_len, 0xff, true);
+
+	opt_key_offset = TUN_METADATA_OFFSET(sizeof(opts));
+	SW_FLOW_KEY_MEMCPY_OFFSET(match, opt_key_offset, &opts, sizeof(opts),
+				  is_mask);
+	return 0;
+}
+
 static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 				struct sw_flow_match *match, bool is_mask,
 				bool log)
@@ -468,6 +508,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 	int rem;
 	bool ttl = false;
 	__be16 tun_flags = 0;
+	int opts_type = 0;
 
 	nla_for_each_nested(a, attr, rem) {
 		int type = nla_type(a);
@@ -527,11 +568,30 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 			tun_flags |= TUNNEL_OAM;
 			break;
 		case OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS:
+			if (opts_type) {
+				OVS_NLERR(log, "Multiple metadata blocks provided");
+				return -EINVAL;
+			}
+
 			err = genev_tun_opt_from_nlattr(a, match, is_mask, log);
 			if (err)
 				return err;
 
-			tun_flags |= TUNNEL_OPTIONS_PRESENT;
+			tun_flags |= TUNNEL_GENEVE_OPT;
+			opts_type = type;
+			break;
+		case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS:
+			if (opts_type) {
+				OVS_NLERR(log, "Multiple metadata blocks provided");
+				return -EINVAL;
+			}
+
+			err = vxlan_tun_opt_from_nlattr(a, match, is_mask, log);
+			if (err)
+				return err;
+
+			tun_flags |= TUNNEL_VXLAN_OPT;
+			opts_type = type;
 			break;
 		default:
 			OVS_NLERR(log, "Unknown IPv4 tunnel attribute %d",
@@ -560,6 +620,23 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 		}
 	}
 
+	return opts_type;
+}
+
+static int vxlan_opt_to_nlattr(struct sk_buff *skb,
+			       const void *tun_opts, int swkey_tun_opts_len)
+{
+	const struct ovs_vxlan_opts *opts = tun_opts;
+	struct nlattr *nla;
+
+	nla = nla_nest_start(skb, OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS);
+	if (!nla)
+		return -EMSGSIZE;
+
+	if (nla_put_u32(skb, OVS_VXLAN_EXT_GBP, opts->gbp) < 0)
+		return -EMSGSIZE;
+
+	nla_nest_end(skb, nla);
 	return 0;
 }
 
@@ -596,10 +673,15 @@ static int __ipv4_tun_to_nlattr(struct sk_buff *skb,
 	if ((output->tun_flags & TUNNEL_OAM) &&
 	    nla_put_flag(skb, OVS_TUNNEL_KEY_ATTR_OAM))
 		return -EMSGSIZE;
-	if (tun_opts &&
-	    nla_put(skb, OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS,
-		    swkey_tun_opts_len, tun_opts))
-		return -EMSGSIZE;
+	if (tun_opts) {
+		if (output->tun_flags & TUNNEL_GENEVE_OPT &&
+		    nla_put(skb, OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS,
+			    swkey_tun_opts_len, tun_opts))
+			return -EMSGSIZE;
+		else if (output->tun_flags & TUNNEL_VXLAN_OPT &&
+			 vxlan_opt_to_nlattr(skb, tun_opts, swkey_tun_opts_len))
+			return -EMSGSIZE;
+	}
 
 	return 0;
 }
@@ -680,7 +762,7 @@ static int metadata_from_nlattrs(struct sw_flow_match *match,  u64 *attrs,
 	}
 	if (*attrs & (1 << OVS_KEY_ATTR_TUNNEL)) {
 		if (ipv4_tun_from_nlattr(a[OVS_KEY_ATTR_TUNNEL], match,
-					 is_mask, log))
+					 is_mask, log) < 0)
 			return -EINVAL;
 		*attrs &= ~(1 << OVS_KEY_ATTR_TUNNEL);
 	}
@@ -1578,17 +1660,23 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	struct sw_flow_key key;
 	struct ovs_tunnel_info *tun_info;
 	struct nlattr *a;
-	int err, start;
+	int err, start, opts_type;
 
 	ovs_match_init(&match, &key, NULL);
-	err = ipv4_tun_from_nlattr(nla_data(attr), &match, false, log);
-	if (err)
-		return err;
+	opts_type = ipv4_tun_from_nlattr(nla_data(attr), &match, false, log);
+	if (opts_type < 0)
+		return opts_type;
 
 	if (key.tun_opts_len) {
-		err = validate_geneve_opts(&key);
-		if (err < 0)
-			return err;
+		switch (opts_type) {
+		case OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS:
+			err = validate_geneve_opts(&key);
+			if (err < 0)
+				return err;
+			break;
+		case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS:
+			break;
+		}
 	};
 
 	start = add_nested_action_start(sfa, OVS_ACTION_ATTR_SET, log);
diff --git a/net/openvswitch/vport-geneve.c b/net/openvswitch/vport-geneve.c
index 2daf144..17b0840 100644
--- a/net/openvswitch/vport-geneve.c
+++ b/net/openvswitch/vport-geneve.c
@@ -88,7 +88,7 @@ static void geneve_rcv(struct geneve_sock *gs, struct sk_buff *skb)
 
 	opts_len = geneveh->opt_len * 4;
 
-	flags = TUNNEL_KEY | TUNNEL_OPTIONS_PRESENT |
+	flags = TUNNEL_KEY | TUNNEL_GENEVE_OPT |
 		(udp_hdr(skb)->check != 0 ? TUNNEL_CSUM : 0) |
 		(geneveh->oam ? TUNNEL_OAM : 0) |
 		(geneveh->critical ? TUNNEL_CRIT_OPT : 0);
@@ -178,7 +178,7 @@ static int geneve_tnl_send(struct vport *vport, struct sk_buff *skb)
 	__be16 sport;
 	struct rtable *rt;
 	struct flowi4 fl;
-	u8 vni[3];
+	u8 vni[3], opts_len, *opts;
 	__be16 df;
 	int err;
 
@@ -209,11 +209,18 @@ static int geneve_tnl_send(struct vport *vport, struct sk_buff *skb)
 	tunnel_id_to_vni(tun_key->tun_id, vni);
 	skb->ignore_df = 1;
 
+	if (tun_key->tun_flags & TUNNEL_GENEVE_OPT) {
+		opts = (u8 *)tun_info->options;
+		opts_len = tun_info->options_len;
+	} else {
+		opts = NULL;
+		opts_len = 0;
+	}
+
 	err = geneve_xmit_skb(geneve_port->gs, rt, skb, fl.saddr,
 			      tun_key->ipv4_dst, tun_key->ipv4_tos,
 			      tun_key->ipv4_ttl, df, sport, dport,
-			      tun_key->tun_flags, vni,
-			      tun_info->options_len, (u8 *)tun_info->options,
+			      tun_key->tun_flags, vni, opts_len, opts,
 			      false);
 	if (err < 0)
 		ip_rt_put(rt);
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index deed9e3..0beddd0 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -40,6 +40,7 @@
 
 #include "datapath.h"
 #include "vport.h"
+#include "vport-vxlan.h"
 
 /**
  * struct vxlan_port - Keeps track of open UDP ports
@@ -49,6 +50,7 @@
 struct vxlan_port {
 	struct vxlan_sock *vs;
 	char name[IFNAMSIZ];
+	u32 exts; /* VXLAN_F_* in <net/vxlan.h> */
 };
 
 static struct vport_ops ovs_vxlan_vport_ops;
@@ -63,16 +65,26 @@ static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb,
 		      struct vxlan_metadata *md)
 {
 	struct ovs_tunnel_info tun_info;
+	struct vxlan_port *vxlan_port;
 	struct vport *vport = vs->data;
 	struct iphdr *iph;
+	struct ovs_vxlan_opts opts = {
+		.gbp = md->gbp,
+	};
 	__be64 key;
+	__be16 flags;
+
+	flags = TUNNEL_KEY;
+	vxlan_port = vxlan_vport(vport);
+	if (vxlan_port->exts & VXLAN_F_GBP)
+		flags |= TUNNEL_VXLAN_OPT;
 
 	/* Save outer tunnel values */
 	iph = ip_hdr(skb);
 	key = cpu_to_be64(ntohl(md->vni) >> 8);
 	ovs_flow_tun_info_init(&tun_info, iph,
 			       udp_hdr(skb)->source, udp_hdr(skb)->dest,
-			       key, TUNNEL_KEY, NULL, 0);
+			       key, flags, &opts, sizeof(opts));
 
 	ovs_vport_receive(vport, skb, &tun_info);
 }
@@ -84,6 +96,21 @@ static int vxlan_get_options(const struct vport *vport, struct sk_buff *skb)
 
 	if (nla_put_u16(skb, OVS_TUNNEL_ATTR_DST_PORT, ntohs(dst_port)))
 		return -EMSGSIZE;
+
+	if (vxlan_port->exts) {
+		struct nlattr *exts;
+
+		exts = nla_nest_start(skb, OVS_TUNNEL_ATTR_EXTENSION);
+		if (!exts)
+			return -EMSGSIZE;
+
+		if (vxlan_port->exts & VXLAN_F_GBP &&
+		    nla_put_flag(skb, OVS_VXLAN_EXT_GBP))
+			return -EMSGSIZE;
+
+		nla_nest_end(skb, exts);
+	}
+
 	return 0;
 }
 
@@ -96,6 +123,31 @@ static void vxlan_tnl_destroy(struct vport *vport)
 	ovs_vport_deferred_free(vport);
 }
 
+static const struct nla_policy exts_policy[OVS_VXLAN_EXT_MAX+1] = {
+	[OVS_VXLAN_EXT_GBP]	= { .type = NLA_FLAG, },
+};
+
+static int vxlan_configure_exts(struct vport *vport, struct nlattr *attr)
+{
+	struct nlattr *exts[OVS_VXLAN_EXT_MAX+1];
+	struct vxlan_port *vxlan_port;
+	int err;
+
+	if (nla_len(attr) < sizeof(struct nlattr))
+		return -EINVAL;
+
+	err = nla_parse_nested(exts, OVS_VXLAN_EXT_MAX, attr, exts_policy);
+	if (err < 0)
+		return err;
+
+	vxlan_port = vxlan_vport(vport);
+
+	if (exts[OVS_VXLAN_EXT_GBP])
+		vxlan_port->exts |= VXLAN_F_GBP;
+
+	return 0;
+}
+
 static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
 {
 	struct net *net = ovs_dp_get_net(parms->dp);
@@ -128,7 +180,17 @@ static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
 	vxlan_port = vxlan_vport(vport);
 	strncpy(vxlan_port->name, parms->name, IFNAMSIZ);
 
-	vs = vxlan_sock_add(net, htons(dst_port), vxlan_rcv, vport, true, 0);
+	a = nla_find_nested(options, OVS_TUNNEL_ATTR_EXTENSION);
+	if (a) {
+		err = vxlan_configure_exts(vport, a);
+		if (err) {
+			ovs_vport_free(vport);
+			goto error;
+		}
+	}
+
+	vs = vxlan_sock_add(net, htons(dst_port), vxlan_rcv, vport, true,
+			    vxlan_port->exts);
 	if (IS_ERR(vs)) {
 		ovs_vport_free(vport);
 		return (void *)vs;
@@ -141,6 +203,21 @@ error:
 	return ERR_PTR(err);
 }
 
+static int vxlan_ext_gbp(struct sk_buff *skb)
+{
+	const struct ovs_tunnel_info *tun_info;
+	const struct ovs_vxlan_opts *opts;
+
+	tun_info = OVS_CB(skb)->egress_tun_info;
+	opts = tun_info->options;
+
+	if (tun_info->tunnel.tun_flags & TUNNEL_VXLAN_OPT &&
+	    tun_info->options_len >= sizeof(*opts))
+		return opts->gbp;
+	else
+		return 0;
+}
+
 static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
 {
 	struct net *net = ovs_dp_get_net(vport->dp);
@@ -181,6 +258,7 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
 
 	src_port = udp_flow_src_port(net, skb, 0, 0, true);
 	md.vni = htonl(be64_to_cpu(tun_key->tun_id) << 8);
+	md.gbp = vxlan_ext_gbp(skb);
 
 	err = vxlan_xmit_skb(vxlan_port->vs, rt, skb,
 			     fl.saddr, tun_key->ipv4_dst,
diff --git a/net/openvswitch/vport-vxlan.h b/net/openvswitch/vport-vxlan.h
new file mode 100644
index 0000000..4b08233e
--- /dev/null
+++ b/net/openvswitch/vport-vxlan.h
@@ -0,0 +1,11 @@
+#ifndef VPORT_VXLAN_H
+#define VPORT_VXLAN_H 1
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+struct ovs_vxlan_opts {
+	__u32 gbp;
+};
+
+#endif
-- 
1.9.3

^ permalink raw reply related

* Re: [PATCH net-next v13 3/3] net: hisilicon: new hip04 ethernet driver
From: Ding Tianhong @ 2015-01-15  2:54 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: robh+dt, davem, grant.likely, agraf, sergei.shtylyov,
	linux-arm-kernel, eric.dumazet, xuwei5, zhangfei.gao, netdev,
	devicetree, linux
In-Reply-To: <3131780.HetDHI4Cfl@wuerfel>

On 2015/1/14 16:53, Arnd Bergmann wrote:
> On Wednesday 14 January 2015 14:34:14 Ding Tianhong wrote:
>> +#define HIP04_MAX_TX_COALESCE_USECS    200
>> +#define HIP04_MIN_TX_COALESCE_USECS    100
>> +#define HIP04_MAX_TX_COALESCE_FRAMES   200
>> +#define HIP04_MIN_TX_COALESCE_FRAMES   100
> 
> It's not important, but in case you are creating another version of the
> patch, maybe the allowed range can be extended somewhat. The example values
> I picked when I sent my suggestion were really made up. It's great if
> they work fine, but users might want to  tune this far more depending on
> their workloads,  How about these
> 
> #define HIP04_MAX_TX_COALESCE_USECS    100000
> #define HIP04_MIN_TX_COALESCE_USECS    1
> #define HIP04_MAX_TX_COALESCE_FRAMES   (TX_DESC_NUM - 1)
> #define HIP04_MIN_TX_COALESCE_FRAMES   1
> 

Is it really ok that the so big range may break the driver and hip04 could not work fine?
I am not sure it is ok, I will fix it in next version.
 
Ding

> 	Arnd
> 
> .
> 

^ permalink raw reply

* Re: [PATCH 1/5] vxlan: Group Policy extension
From: Thomas Graf @ 2015-01-15  2:55 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, Jesse Gross, Stephen Hemminger, Pravin B Shelar,
	Alexei Starovoitov, Nicolas Dichtel, Linux Netdev List,
	dev@openvswitch.org
In-Reply-To: <20150115012804.GI2105@casper.infradead.org>

On 01/15/15 at 01:28am, Thomas Graf wrote:
> What exactly is the problem of having a distinct bitmap used by
> extensions? It is the least error prone method because it's clear that
> all extensions must match and we don't have to maintain an additional
> bitmask which can be forgotten to be updated.
> 
> If you need to compare additional receive checksum settings for RCO
> then that should be separate because as you say it's not an extension.

Tom,

OK. I have changed it to use flags instead of exts in v6. You should
be able to add whatever RCO flags need to be matched in
vxlan_find_sock() to the VXLAN_F_UNSHAREABLE bitmask.

Hope this makes everybody happy ;-)

^ permalink raw reply

* Re: [PATCH net-next v13 3/3] net: hisilicon: new hip04 ethernet driver
From: Ding Tianhong @ 2015-01-15  2:56 UTC (permalink / raw)
  To: Joe Perches
  Cc: arnd, robh+dt, davem, grant.likely, agraf, sergei.shtylyov,
	linux-arm-kernel, eric.dumazet, xuwei5, zhangfei.gao, netdev,
	devicetree, linux
In-Reply-To: <1421222782.2847.1.camel@perches.com>

On 2015/1/14 16:06, Joe Perches wrote:
> On Wed, 2015-01-14 at 14:34 +0800, Ding Tianhong wrote:
>> Support Hisilicon hip04 ethernet driver, including 100M / 1000M controller.
>> The controller has no tx done interrupt, reclaim xmitted buffer in the poll.
> 
> Single comment:
> 
>> diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c
> []
>> +static int hip04_rx_poll(struct napi_struct *napi, int budget)
>> +{
> []
>> +	while (cnt && !last) {
>> +		buf = priv->rx_buf[priv->rx_head];
>> +		skb = build_skb(buf, priv->rx_buf_size);
>> +		if (unlikely(!skb))
>> +			net_dbg_ratelimited("build_skb failed\n");
>> +
>> +		dma_unmap_single(&ndev->dev, priv->rx_phys[priv->rx_head],
>> +				 RX_BUF_SIZE, DMA_FROM_DEVICE);
>> +		priv->rx_phys[priv->rx_head] = 0;
>> +
>> +		desc = (struct rx_desc *)skb->data;
> 
> Perhaps if (!skb) is possible, there shouldn't be
> a dereference of that known null here.
> 
Yes, we should return and do something to avoid oops, thanks.

Ding

> 
> 
> .
> 

^ permalink raw reply

* Re: [PATCH 1/5] vxlan: Group Policy extension
From: Tom Herbert @ 2015-01-15  2:59 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David Miller, Jesse Gross, Stephen Hemminger, Pravin B Shelar,
	Alexei Starovoitov, Nicolas Dichtel, Linux Netdev List,
	dev@openvswitch.org
In-Reply-To: <20150115025539.GA20315@casper.infradead.org>

On Wed, Jan 14, 2015 at 6:55 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 01/15/15 at 01:28am, Thomas Graf wrote:
>> What exactly is the problem of having a distinct bitmap used by
>> extensions? It is the least error prone method because it's clear that
>> all extensions must match and we don't have to maintain an additional
>> bitmask which can be forgotten to be updated.
>>
>> If you need to compare additional receive checksum settings for RCO
>> then that should be separate because as you say it's not an extension.
>
> Tom,
>
> OK. I have changed it to use flags instead of exts in v6. You should
> be able to add whatever RCO flags need to be matched in
> vxlan_find_sock() to the VXLAN_F_UNSHAREABLE bitmask.
>
> Hope this makes everybody happy ;-)

Awesome. Thanks!

^ permalink raw reply

* Re: [PATCH 1/5] vxlan: Group Policy extension
From: Tom Herbert @ 2015-01-15  3:06 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David Miller, Jesse Gross, Stephen Hemminger, Pravin B Shelar,
	Alexei Starovoitov, Nicolas Dichtel, Linux Netdev List,
	dev@openvswitch.org
In-Reply-To: <80677f296e829f43ca57a4c0dc4f45782e2e6f8e.1421290198.git.tgraf@suug.ch>

> +struct vxlan_metadata {
> +       __be32          vni;
> +       u32             gbp;

Should this be __be32 also and use ntohl/htonl when setting to/from skb->mark?

> +};
> +

^ permalink raw reply

* [PATCH] net: core: Fix race by  protecting process_queues at CPU hotplug
From: subashab @ 2015-01-15  3:13 UTC (permalink / raw)
  To: netdev

I am seeing frequent crashes in high throughput conditions in a
multiprocessor system with kernel version 3.10 where cores are getting hot
plugged. I have pinned the network stack to a particular core using
receive packet steering (RPS). At the time of crash, it looks like a
contention of the process_queue between dev_cpu_callback and
process_backlog.

The following is the log at the moment of the crash
 75241.598056:   <6> Unable to handle kernel NULL pointer dereference at
virtual address 00000004
 75241.598064:   <6> pgd = c0004000
 75241.598072:   <2> [00000004] *pgd=00000000
 75241.598082:   <6> Internal error: Oops: 817 [#1] PREEMPT SMP ARM
 75241.598096:   <2> Modules linked in: wlan(O) [last unloaded: wlan]
 75241.598106:   <6> CPU: 0 PID: 3 Comm: ksoftirqd/0 Tainted: G        W 
O 3.10.49-g0b78c2e-00003-g2eab3e3 #1
 75241.598113:   <6> task: ee870a80 ti: ee888000 task.ti: ee888000
 75241.598133:   <2> PC is at process_backlog+0x78/0x140
 75241.598142:   <2> LR is at __netif_receive_skb_core+0x6fc/0x724

Here is the call stack involved in the crash
-006|__skb_unlink
-006|__skb_dequeue
-006|dev_cpu_callback
-007|notifier_call_chain
-008|__raw_notifier_call_chain
-009|notifier_to_errno
-009|__cpu_notify
-010|cpu_notify_nofail
-011|check_for_tasks
-011|cpu_down
-012|disable_nonboot_cpus()
-013|suspend_enter
-013|suspend_devices_and_enter
-014|enter_state
-014|pm_suspend
-015|state_store
-016|kobj_attr_store
-017|flush_write_buffer
-017|sysfs_write_file
-018|vfs_write
-019|SYSC_write
-019|sys_write
-020|ret_fast_syscall

I have a fix in mind in which the process queues are protected using spin
locks. I do not observe the crash with this patch. Since this patch is in
the hot path for incoming packet processing, I would like some reviews of
this patch. I would also like to know if there are any potential
performance bottlenecks due to this change.

    net: core: Protect process_queues at CPU hotplug

    When a CPU is hotplugged while processing incoming packets,
    dev_cpu_callback() will copy the poll list from the offline
    CPU and raise the softIRQ. In the same context, it will also
    process process_queue of the offline CPU by de-queueing and
    calling netif_rx. Due to this there is a potential for race
    condition between process_backlog() and dev_cpu_callback()
    accessing the same process_queue resource. This patch
    protects this concurrent access by locking.

    Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
    Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>

diff --git a/net/core/dev.c b/net/core/dev.c
index df0b522..aa8f503 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3640,6 +3640,7 @@ static void flush_backlog(void *arg)
        struct net_device *dev = arg;
        struct softnet_data *sd = &__get_cpu_var(softnet_data);
        struct sk_buff *skb, *tmp;
+       unsigned long flags;

        rps_lock(sd);
        skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp) {
@@ -3651,6 +3652,7 @@ static void flush_backlog(void *arg)
        }
        rps_unlock(sd);

+       spin_lock_irqsave(&sd->process_queue.lock, flags);
        skb_queue_walk_safe(&sd->process_queue, skb, tmp) {
                if (skb->dev == dev) {
                        __skb_unlink(skb, &sd->process_queue);
@@ -3658,6 +3660,7 @@ static void flush_backlog(void *arg)
                        input_queue_head_incr(sd);
                }
        }
+       spin_unlock_irqrestore(&sd->process_queue.lock, flags);
 }

 static int napi_gro_complete(struct sk_buff *skb)
@@ -4021,7 +4024,7 @@ static int process_backlog(struct napi_struct *napi,
int quota)
 {
        int work = 0;
        struct softnet_data *sd = container_of(napi, struct softnet_data,
backlog);
-
+       unsigned long flags;
 #ifdef CONFIG_RPS
        /* Check if we have pending ipi, its better to send them now,
         * not waiting net_rx_action() end.
@@ -4032,18 +4035,19 @@ static int process_backlog(struct napi_struct
*napi, int quota)
        }
 #endif
        napi->weight = weight_p;
-       local_irq_disable();
+       spin_lock_irqsave(&sd->process_queue.lock, flags);
        while (work < quota) {
                struct sk_buff *skb;
                unsigned int qlen;

                while ((skb = __skb_dequeue(&sd->process_queue))) {
-                       local_irq_enable();
+                       spin_unlock_irqrestore(&sd->process_queue.lock,
flags);
                        __netif_receive_skb(skb);
-                       local_irq_disable();
+                       spin_lock_irqsave(&sd->process_queue.lock, flags);
                        input_queue_head_incr(sd);
                        if (++work >= quota) {
-                               local_irq_enable();
+                              
spin_unlock_irqrestore(&sd->process_queue.lock,
+                                                      flags);
                                return work;
                        }
                }
@@ -4054,6 +4058,7 @@ static int process_backlog(struct napi_struct *napi,
int quota)
                        skb_queue_splice_tail_init(&sd->input_pkt_queue,
                                                   &sd->process_queue);

+
                if (qlen < quota - work) {
                        /*
                         * Inline a custom version of __napi_complete().
@@ -4069,7 +4074,7 @@ static int process_backlog(struct napi_struct *napi,
int quota)
                }
                rps_unlock(sd);
        }
-       local_irq_enable();
+       spin_unlock_irqrestore(&sd->process_queue.lock, flags);

        return work;
 }
@@ -5991,6 +5996,7 @@ static int dev_cpu_callback(struct notifier_block *nfb,
        struct sk_buff *skb;
        unsigned int cpu, oldcpu = (unsigned long)ocpu;
        struct softnet_data *sd, *oldsd;
+       unsigned long flags;

        if (action != CPU_DEAD && action != CPU_DEAD_FROZEN)
                return NOTIFY_OK;
@@ -6024,11 +6030,15 @@ static int dev_cpu_callback(struct notifier_block
*nfb,
        raise_softirq_irqoff(NET_TX_SOFTIRQ);
        local_irq_enable();

+       spin_lock_irqsave(&oldsd->process_queue.lock, flags);
        /* Process offline CPU's input_pkt_queue */
        while ((skb = __skb_dequeue(&oldsd->process_queue))) {
+               spin_unlock_irqrestore(&oldsd->process_queue.lock, flags);
                netif_rx(skb);
+               spin_lock_irqsave(&oldsd->process_queue.lock, flags);
                input_queue_head_incr(oldsd);
        }
+       spin_unlock_irqrestore(&oldsd->process_queue.lock, flags);
        while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
                netif_rx(skb);
                input_queue_head_incr(oldsd);

Thanks
KS

The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
 a Linux Foundation Collaborative Project

^ permalink raw reply related

* Re: [PATCH 1/5] vxlan: Group Policy extension
From: Thomas Graf @ 2015-01-15  3:20 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, Jesse Gross, Stephen Hemminger, Pravin B Shelar,
	Alexei Starovoitov, Nicolas Dichtel, Linux Netdev List,
	dev@openvswitch.org
In-Reply-To: <CA+mtBx9QV89vb+kwY4PUnemwqq-vWgpU4paDEs0FjYVDS2MsXQ@mail.gmail.com>

On 01/14/15 at 07:06pm, Tom Herbert wrote:
> > +struct vxlan_metadata {
> > +       __be32          vni;
> > +       u32             gbp;
> 
> Should this be __be32 also and use ntohl/htonl when setting to/from skb->mark?

The bitmask is stored in host byte order in vxlan_metadata to be
compatible with skb->mark and converted to network byte order on
the wire, see:

        gbp = (struct vxlanhdr_gbp *)vxh;
        md.gbp = ntohs(gbp->policy_id);

and:

	gbp->policy_id = htons(md->gbp & VXLAN_GBP_ID_MASK);

^ permalink raw reply

* Re: [net-next PATCH v2 03/12] net: flow: implement flow cache for get routines
From: John Fastabend @ 2015-01-15  3:21 UTC (permalink / raw)
  To: Thomas Graf; +Cc: simon.horman, sfeldma, netdev, gerlitz.or, jhs, andy, davem
In-Reply-To: <20150114215232.GF2105@casper.infradead.org>

On 01/14/2015 01:52 PM, Thomas Graf wrote:
> On 01/13/15 at 01:36pm, John Fastabend wrote:
>> I chose rhashtable to get the dynamic resizing. I could use arrays
>> but I don't want to pre-allocate large cache tables when we may
>> never use them.
>>
>> One oddity in the rhashtable implementation is there is no way
>> AFAICS to do delayed free's so we use rcu_sync heavily. This should be
>> fine, get operations shouldn't be a used heavily.
>
> John, can you please clarify a bit, I'm not sure I understand. Are you
> talking about delayed freeing of the table itself or elements?
>
> The Netlink usage would be an example of a user with delayed element
> freeing.
>
> I'm glad to add whatever is required.
>

Took another look at the netlink code looks like this is the correct
pattern where the call_rcu implements the delayed freeing after a grace
period.

	mutex_lock(&my_hash_lock);
	rhashtable_remove(&my_hash, &my_obj->rhash_head);
	mutex_unlock(&my_hash_lock)

	[...]

	call_rcu(&my_obj->rcu, deferred_my_obj_free);

anyways it looks like it is there no problem after all and I don't
recall what I was thinking thanks for bearing with me. I'll convert
this code to avoid the over-use of rcu_sync.

Thanks,
John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [net-next PATCH v2 03/12] net: flow: implement flow cache for get routines
From: Thomas Graf @ 2015-01-15  3:24 UTC (permalink / raw)
  To: John Fastabend
  Cc: simon.horman, sfeldma, netdev, gerlitz.or, jhs, andy, davem
In-Reply-To: <54B7323D.9030506@gmail.com>

On 01/14/15 at 07:21pm, John Fastabend wrote:
> anyways it looks like it is there no problem after all and I don't
> recall what I was thinking thanks for bearing with me. I'll convert
> this code to avoid the over-use of rcu_sync.

It's likely that you looked at this before the per bucket lock work
occured. That's when the Netlink rhashtable code was converted to RCU
with the rhashtable API changed accordingly.

^ permalink raw reply

* [PATCH net-next v2] bridge: fix setlink/dellink notifications
From: roopa @ 2015-01-15  4:02 UTC (permalink / raw)
  To: netdev
  Cc: shemminger, vyasevic, john.fastabend, tgraf, jhs, sfeldma, jiri,
	wkok, ronen.arad

From: Roopa Prabhu <roopa@cumulusnetworks.com>

problems with bridge getlink/setlink notifications today:
        - bridge setlink generates two notifications to userspace
                - one from the bridge driver
                - one from rtnetlink.c (rtnl_bridge_notify)
        - dellink generates one notification from rtnetlink.c. Which
	means bridge setlink and dellink notifications are not
	consistent

        - Looking at the code it appears,
	If both BRIDGE_FLAGS_MASTER and BRIDGE_FLAGS_SELF were set,
        the size calculation in rtnl_bridge_notify can be wrong.
        Example: if you set both BRIDGE_FLAGS_MASTER and BRIDGE_FLAGS_SELF
        in a setlink request to rocker dev, rtnl_bridge_notify will
	allocate skb for one set of bridge attributes, but,
	both the bridge driver and rocker dev will try to add
	attributes resulting in twice the number of attributes
	being added to the skb.  (rocker dev calls ndo_dflt_bridge_getlink)

There are multiple options:
1) Generate one notification including all attributes from master and self:
   But, I don't think it will work, because both master and self may use
   the same attributes/policy. Cannot pack the same set of attributes in a
   single notification from both master and slave (duplicate attributes).

2) Generate one notification from master and the other notification from
   self (This seems to be ideal):
     For master: the master driver will send notification (bridge in this
	example)
     For self: the self driver will send notification (rocker in the above
	example. It can use helpers from rtnetlink.c to do so. Like the
	ndo_dflt_bridge_getlink api).

This patch implements 2) (leaving the 'rtnl_bridge_notify' around to be used
with 'self').

v1->v2 :
	- rtnl_bridge_notify is now called only for self,
	so, remove 'BRIDGE_FLAGS_SELF' check and cleanup a few things
	- rtnl_bridge_dellink used to always send a RTM_NEWLINK msg
	earlier. So, I have changed the notification from br_dellink to
	go as RTM_NEWLINK

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 net/bridge/br_netlink.c |    5 +++++
 net/core/rtnetlink.c    |   45 +++++++++++++++++++++------------------------
 2 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 9f5eb55..f996324 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -432,6 +432,11 @@ int br_dellink(struct net_device *dev, struct nlmsghdr *nlh)
 
 	err = br_afspec((struct net_bridge *)netdev_priv(dev), p,
 			afspec, RTM_DELLINK);
+	if (err == 0)
+		/* Send RTM_NEWLINK because userspace
+		 * expects RTM_NEWLINK for vlan dels
+		 */
+		br_ifinfo_notify(RTM_NEWLINK, p);
 
 	return err;
 }
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index d06107d..acefd60 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2863,32 +2863,24 @@ static inline size_t bridge_nlmsg_size(void)
 		+ nla_total_size(sizeof(u16));	/* IFLA_BRIDGE_MODE */
 }
 
-static int rtnl_bridge_notify(struct net_device *dev, u16 flags)
+static int rtnl_bridge_notify(struct net_device *dev)
 {
 	struct net *net = dev_net(dev);
-	struct net_device *br_dev = netdev_master_upper_dev_get(dev);
 	struct sk_buff *skb;
 	int err = -EOPNOTSUPP;
 
+	if (!dev->netdev_ops->ndo_bridge_getlink)
+		return 0;
+
 	skb = nlmsg_new(bridge_nlmsg_size(), GFP_ATOMIC);
 	if (!skb) {
 		err = -ENOMEM;
 		goto errout;
 	}
 
-	if ((!flags || (flags & BRIDGE_FLAGS_MASTER)) &&
-	    br_dev && br_dev->netdev_ops->ndo_bridge_getlink) {
-		err = br_dev->netdev_ops->ndo_bridge_getlink(skb, 0, 0, dev, 0);
-		if (err < 0)
-			goto errout;
-	}
-
-	if ((flags & BRIDGE_FLAGS_SELF) &&
-	    dev->netdev_ops->ndo_bridge_getlink) {
-		err = dev->netdev_ops->ndo_bridge_getlink(skb, 0, 0, dev, 0);
-		if (err < 0)
-			goto errout;
-	}
+	err = dev->netdev_ops->ndo_bridge_getlink(skb, 0, 0, dev, 0);
+	if (err < 0)
+		goto errout;
 
 	rtnl_notify(skb, net, 0, RTNLGRP_LINK, NULL, GFP_ATOMIC);
 	return 0;
@@ -2958,16 +2950,18 @@ static int rtnl_bridge_setlink(struct sk_buff *skb, struct nlmsghdr *nlh)
 			err = -EOPNOTSUPP;
 		else
 			err = dev->netdev_ops->ndo_bridge_setlink(dev, nlh);
-
-		if (!err)
+		if (!err) {
 			flags &= ~BRIDGE_FLAGS_SELF;
+
+			/* Generate event to notify upper layer of bridge
+			 * change
+			 */
+			err = rtnl_bridge_notify(dev);
+		}
 	}
 
 	if (have_flags)
 		memcpy(nla_data(attr), &flags, sizeof(flags));
-	/* Generate event to notify upper layer of bridge change */
-	if (!err)
-		err = rtnl_bridge_notify(dev, oflags);
 out:
 	return err;
 }
@@ -3032,15 +3026,18 @@ static int rtnl_bridge_dellink(struct sk_buff *skb, struct nlmsghdr *nlh)
 		else
 			err = dev->netdev_ops->ndo_bridge_dellink(dev, nlh);
 
-		if (!err)
+		if (!err) {
 			flags &= ~BRIDGE_FLAGS_SELF;
+
+			/* Generate event to notify upper layer of bridge
+			 * change
+			 */
+			err = rtnl_bridge_notify(dev);
+		}
 	}
 
 	if (have_flags)
 		memcpy(nla_data(attr), &flags, sizeof(flags));
-	/* Generate event to notify upper layer of bridge change */
-	if (!err)
-		err = rtnl_bridge_notify(dev, oflags);
 out:
 	return err;
 }
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH net-next v13 3/3] net: hisilicon: new hip04 ethernet driver
From: Joe Perches @ 2015-01-15  4:39 UTC (permalink / raw)
  To: Ding Tianhong
  Cc: arnd, robh+dt, davem, grant.likely, agraf, sergei.shtylyov,
	linux-arm-kernel, eric.dumazet, xuwei5, zhangfei.gao, netdev,
	devicetree, linux
In-Reply-To: <1421217254-12008-4-git-send-email-dingtianhong@huawei.com>

On Wed, 2015-01-14 at 14:34 +0800, Ding Tianhong wrote:
> Support Hisilicon hip04 ethernet driver, including 100M / 1000M controller.
> The controller has no tx done interrupt, reclaim xmitted buffer in the poll.

Mostly trivial comments:

> +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c

[]

> +#define GMAC_MAX_PKT_LEN		1516

[]

> +static int hip04_rx_poll(struct napi_struct *napi, int budget)
> +{
[]
> +	while (cnt && !last) {
[]
> +		desc = (struct rx_desc *)skb->data;
> +		len = be16_to_cpu(desc->pkt_len);
> +		err = be32_to_cpu(desc->pkt_err);
> +
> +		if (0 == len) {
> +			dev_kfree_skb_any(skb);
> +			last = true;
> +		} else if ((err & RX_PKT_ERR) || (len >= GMAC_MAX_PKT_LEN)) {

Is this ">=" correct?  Maybe it should be ">" ?

[]

> +static irqreturn_t hip04_mac_interrupt(int irq, void *dev_id)
> +{
> +	struct net_device *ndev = (struct net_device *)dev_id;

Unnecessary cast of void *

[]

> +static int hip04_set_coalesce(struct net_device *netdev,
> +			      struct ethtool_coalesce *ec)
> +{
> +	struct hip04_priv *priv = netdev_priv(netdev);
> +
> +	/* Check not supported parameters  */
> +	if ((ec->rx_max_coalesced_frames) || (ec->rx_coalesce_usecs_irq) ||
> +	    (ec->rx_max_coalesced_frames_irq) || (ec->tx_coalesce_usecs_irq) ||
> +	    (ec->use_adaptive_rx_coalesce) || (ec->use_adaptive_tx_coalesce) ||
> +	    (ec->pkt_rate_low) || (ec->rx_coalesce_usecs_low) ||
> +	    (ec->rx_max_coalesced_frames_low) || (ec->tx_coalesce_usecs_high) ||
> +	    (ec->tx_max_coalesced_frames_low) || (ec->pkt_rate_high) ||
> +	    (ec->tx_coalesce_usecs_low) || (ec->rx_coalesce_usecs_high) ||
> +	    (ec->rx_max_coalesced_frames_high) || (ec->rx_coalesce_usecs) ||
> +	    (ec->tx_max_coalesced_frames_irq) ||
> +	    (ec->stats_block_coalesce_usecs) ||
> +	    (ec->tx_max_coalesced_frames_high) || (ec->rate_sample_interval))
> +		return -EOPNOTSUPP;

Rather than a somewhat haphazard mix of these values, 
this might be simpler to read as something like:

	/* Check not supported parameters  */
	if (ec->pkt_rate_low ||
	    ec->pkt_rate_high ||

	    ec->use_adaptive_rx_coalesce ||
	    ec->rx_coalesce_usecs ||
	    ec->rx_coalesce_usecs_low ||
	    ec->rx_coalesce_usecs_high ||
	    ec->rx_coalesce_usecs_irq ||
	    ec->rx_max_coalesced_frames ||
	    ec->rx_max_coalesced_frames_low ||
	    ec->rx_max_coalesced_frames_high ||
	    ec->rx_max_coalesced_frames_irq ||
	    
	    ec->use_adaptive_tx_coalesce ||
	    ec->tx_coalesce_usecs_low ||
	    ec->tx_coalesce_usecs_high ||
	    ec->tx_max_coalesced_frames_low ||
	    ec->tx_max_coalesced_frames_high ||
	    ec->tx_max_coalesced_frames_irq ||

	    ec->stats_block_coalesce_usecs ||
	    ec->rate_sample_interval)
		return -EOPNOTSUPP;


> +static void hip04_free_ring(struct net_device *ndev, struct device *d)
> +{
> +	struct hip04_priv *priv = netdev_priv(ndev);
> +	int i;
> +
> +	for (i = 0; i < RX_DESC_NUM; i++)
> +		if (priv->rx_buf[i])
> +			put_page(virt_to_head_page(priv->rx_buf[i]));

It's generally nicer to use braces around
for loops with single ifs.

> +
> +	for (i = 0; i < TX_DESC_NUM; i++)
> +		if (priv->tx_skb[i])
> +			dev_kfree_skb_any(priv->tx_skb[i]);

^ permalink raw reply

* Re: [PATCH iproute2] ss: Filter inet dgram sockets with established state by default
From: Vadim Kochan @ 2015-01-15  4:49 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev@vger.kernel.org
In-Reply-To: <20150114184806.56a7b7e2@urahara>

So it means showing by established state by default.

On Thu, Jan 15, 2015 at 4:48 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Thu, 15 Jan 2015 00:43:47 +0200
> Vadim Kochan <vadim4j@gmail.com> wrote:
>
>> On Wed, Jan 14, 2015 at 02:41:20PM -0800, Stephen Hemminger wrote:
>> > On Wed, 14 Jan 2015 08:49:44 +0200
>> > Vadim Kochan <vadim4j@gmail.com> wrote:
>> >
>> > > On Tue, Jan 13, 2015 at 05:31:50PM -0800, Stephen Hemminger wrote:
>> > > > On Thu,  8 Jan 2015 19:32:22 +0200
>> > > > Vadim Kochan <vadim4j@gmail.com> wrote:
>> > > >
>> > > > > From: Vadim Kochan <vadim4j@gmail.com>
>> > > > >
>> > > > > As inet dgram sockets (udp, raw) can call connect(...)  - they
>> > > > > might be set in ESTABLISHED state. So keep the original behaviour of
>> > > > > 'ss' which filtered them by ESTABLISHED state by default. So:
>> > > > >
>> > > > >     $ ss -u
>> > > > >
>> > > > >     or
>> > > > >
>> > > > >     $ ss -w
>> > > > >
>> > > > > Will show only ESTABLISHED UDP sockets by default.
>> > > > >
>> > > > > Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
>> > > > > ---
>> > > > >  misc/ss.c | 4 ++--
>> > > > >  1 file changed, 2 insertions(+), 2 deletions(-)
>> > > > >
>> > > > > diff --git a/misc/ss.c b/misc/ss.c
>> > > > > index 08d210a..015d829 100644
>> > > > > --- a/misc/ss.c
>> > > > > +++ b/misc/ss.c
>> > > > > @@ -170,11 +170,11 @@ static const struct filter default_dbs[MAX_DB] = {
>> > > > >               .families = (1 << AF_INET) | (1 << AF_INET6),
>> > > > >       },
>> > > > >       [UDP_DB] = {
>> > > > > -             .states   = (1 << SS_CLOSE),
>> > > > > +             .states   = (1 << SS_ESTABLISHED),
>> > > > >               .families = (1 << AF_INET) | (1 << AF_INET6),
>> > > > >       },
>> > > > >       [RAW_DB] = {
>> > > > > -             .states   = (1 << SS_CLOSE),
>> > > > > +             .states   = (1 << SS_ESTABLISHED),
>> > > > >               .families = (1 << AF_INET) | (1 << AF_INET6),
>> > > > >       },
>> > > > >       [UNIX_DG_DB] = {
>> > > >
>> > > > This is a change likely to break somebody using 'ss -u' now and the bound
>> > > > sockets will disappear from the output.
>> > > >
>> > >
>> > > But thats was as original behaviour before I added table-driven code
>> > > (about few commits ago), so thats a rather fix (sorry I did not noticed
>> > > about it) to keep the previous behaviour for dgram sockets - show
>> > > established states by default.
>> > >
>> > > Regards,
>> >
>> > Ok, I will merge it and update the comments.
>> Even with this PATCH I am still confused what is preferred behaviour -
>> show established dgram sockets (as it was all the way) or closed + established by default.
>>
>> What do you think ?
>>
>> Thanks,
>
> Make it work like earliest releases (like 3 months ago).

^ permalink raw reply

* RE: [PATCH v2 net] be2net: Allow GRE to work concurrently while a VxLAN tunnel is configured
From: Sriharsha Basavapatna @ 2015-01-15  4:59 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Netdev List
In-Reply-To: <CA+mtBx9Kd8JM53RJFJ_R-ZW_FN=UhDQgBNgLNPqadGA9o3ZNiA@mail.gmail.com>



-----Original Message-----
From: Tom Herbert [mailto:therbert@google.com] 
Sent: Thursday, January 15, 2015 8:00 AM
To: Sriharsha Basavapatna
Cc: Linux Netdev List
Subject: Re: [PATCH v2 net] be2net: Allow GRE to work concurrently while a VxLAN tunnel is configured

On Thu, Jan 15, 2015 at 2:38 AM, Sriharsha Basavapatna <sriharsha.basavapatna@emulex.com> wrote:
> Other tunnels like GRE break while VxLAN offloads are enabled in 
> Skyhawk-R. To avoid this, we should restrict offload features on a 
> per-packet basis in such conditions.
>
> Signed-off-by: Sriharsha Basavapatna 
> <sriharsha.basavapatna@emulex.com>
> ---
> v2 changes: fixed minor nits pointed out by Sergei Shtylyov
> ---
>  drivers/net/ethernet/emulex/benet/be_main.c |   41 +++++++++++++++++++++++++--
>  1 file changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/emulex/benet/be_main.c 
> b/drivers/net/ethernet/emulex/benet/be_main.c
> index 41a0a54..d48806b 100644
> --- a/drivers/net/ethernet/emulex/benet/be_main.c
> +++ b/drivers/net/ethernet/emulex/benet/be_main.c
> @@ -4383,8 +4383,9 @@ static int be_ndo_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
>   * distinguish various types of transports (VxLAN, GRE, NVGRE ..). So, offload
>   * is expected to work across all types of IP tunnels once exported. Skyhawk
>   * supports offloads for either VxLAN or NVGRE, exclusively. So we 
> export VxLAN
> - * offloads in hw_enc_features only when a VxLAN port is added. Note 
> this only
> - * ensures that other tunnels work fine while VxLAN offloads are not enabled.
> + * offloads in hw_enc_features only when a VxLAN port is added. If 
> + other (non
> + * VxLAN) tunnels are configured while VxLAN offloads are enabled, 
> + offloads for
> + * those other tunnels are unexported on the fly through ndo_features_check().
>   *
>   * Skyhawk supports VxLAN offloads only for one UDP dport. So, if the stack
>   * adds more than one port, disable offloads and don't re-enable them 
> again @@ -4463,7 +4464,41 @@ static netdev_features_t be_features_check(struct sk_buff *skb,
>                                            struct net_device *dev,
>                                            netdev_features_t features)  
> {
> -       return vxlan_features_check(skb, features);
> +       struct be_adapter *adapter = netdev_priv(dev);
> +       u8 l4_hdr = 0;
> +
> +       /* The code below restricts offload features for some tunneled packets.
> +        * Offload features for normal (non tunnel) packets are unchanged.
> +        */
> +       if (!skb->encapsulation ||
> +           !(adapter->flags & BE_FLAGS_VXLAN_OFFLOADS))
> +               return features;
> +
> +       /* It's an encapsulated packet and VxLAN offloads are enabled. We
> +        * should disable tunnel offload features if it's not a VxLAN packet,
> +        * as tunnel offloads have been enabled only for VxLAN. This is done to
> +        * allow other tunneled traffic like GRE work fine while VxLAN
> +        * offloads are configured in Skyhawk-R.
> +        */
> +       switch (vlan_get_protocol(skb)) {
> +       case htons(ETH_P_IP):
> +               l4_hdr = ip_hdr(skb)->protocol;
> +               break;
> +       case htons(ETH_P_IPV6):
> +               l4_hdr = ipv6_hdr(skb)->nexthdr;
> +               break;
> +       default:
> +               return features;
> +       }
> +
> +       if (l4_hdr != IPPROTO_UDP ||

I don't understand why this is needed. The only GSO type with with encapsulation allowed by device features SKB_GSO_UDP_TUNNEL. This should not be GRE for instance. Do you see cases where protocol is not UDP at this point?
[Harsha] It's needed for GRE checksum case; without this,  GRE pkts are sent down without checksum by the stack, but HW
can only offload checksum for VxLAN pkts.

> +           skb->inner_protocol_type != ENCAP_TYPE_ETHER ||
> +           skb->inner_protocol != htons(ETH_P_TEB) ||
> +           skb_inner_mac_header(skb) - skb_transport_header(skb) !=
> +           sizeof(struct udphdr) + sizeof(struct vxlanhdr))
> +               return features & ~(NETIF_F_ALL_CSUM | 
> + NETIF_F_GSO_MASK);
> +
> +       return features;
>  }
>  #endif
>
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in 
> the body of a message to majordomo@vger.kernel.org More majordomo info 
> at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2 net] be2net: Allow GRE to work concurrently while a VxLAN tunnel is configured
From: Tom Herbert @ 2015-01-15  5:33 UTC (permalink / raw)
  To: Sriharsha Basavapatna; +Cc: Linux Netdev List
In-Reply-To: <31318D46B5DF3F4AB71CC057601E9FEB3A7F5178@CMEXMB1.ad.emulex.com>

> I don't understand why this is needed. The only GSO type with with encapsulation allowed by device features SKB_GSO_UDP_TUNNEL. This should not be GRE for instance. Do you see cases where protocol is not UDP at this point?
> [Harsha] It's needed for GRE checksum case; without this,  GRE pkts are sent down without checksum by the stack, but HW
> can only offload checksum for VxLAN pkts.
>
Okay, I see that this is a problem with checksum not GSO. Please ask
your hardware guys to provide NETIF_F_HW_CSUM to avoid any more of
this unpleasantness in the future :-)

>> +           skb->inner_protocol_type != ENCAP_TYPE_ETHER ||
>> +           skb->inner_protocol != htons(ETH_P_TEB) ||
>> +           skb_inner_mac_header(skb) - skb_transport_header(skb) !=
>> +           sizeof(struct udphdr) + sizeof(struct vxlanhdr))
>> +               return features & ~(NETIF_F_ALL_CSUM |
>> + NETIF_F_GSO_MASK);
>> +
>> +       return features;
>>  }
>>  #endif
>>
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org More majordomo info
>> at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH v2 net] be2net: Allow GRE to work concurrently while a VxLAN tunnel is configured
From: Sriharsha Basavapatna @ 2015-01-15  5:45 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Netdev List
In-Reply-To: <CA+mtBx-tLRi21Sveh-TOmMC0T2cRNqcaEOx2thnqhU-ZLyaN2A@mail.gmail.com>



-----Original Message-----
From: Tom Herbert [mailto:therbert@google.com] 
Sent: Thursday, January 15, 2015 11:04 AM
To: Sriharsha Basavapatna
Cc: Linux Netdev List
Subject: Re: [PATCH v2 net] be2net: Allow GRE to work concurrently while a VxLAN tunnel is configured

> I don't understand why this is needed. The only GSO type with with encapsulation allowed by device features SKB_GSO_UDP_TUNNEL. This should not be GRE for instance. Do you see cases where protocol is not UDP at this point?
> [Harsha] It's needed for GRE checksum case; without this,  GRE pkts 
> are sent down without checksum by the stack, but HW can only offload checksum for VxLAN pkts.
>
Okay, I see that this is a problem with checksum not GSO. Please ask your hardware guys to provide NETIF_F_HW_CSUM to avoid any more of this unpleasantness in the future :-)
[Harsha] Ok, will do :-)

>> +           skb->inner_protocol_type != ENCAP_TYPE_ETHER ||
>> +           skb->inner_protocol != htons(ETH_P_TEB) ||
>> +           skb_inner_mac_header(skb) - skb_transport_header(skb) !=
>> +           sizeof(struct udphdr) + sizeof(struct vxlanhdr))
>> +               return features & ~(NETIF_F_ALL_CSUM | 
>> + NETIF_F_GSO_MASK);
>> +
>> +       return features;
>>  }
>>  #endif
>>
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in 
>> the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* PROBLEM: [3.4] neigh_destroy() crashes on unplug netdev.
From: Nakashima Akihiro @ 2015-01-15  5:57 UTC (permalink / raw)
  To: davem@davemloft.net, netdev@vger.kernel.org
  Cc: Ueda Motoki, Otsu Takahiro, Tomono Mitsunori,
	linux-kernel@vger.kernel.org

Dear David and networking developers:

I got kernel panic on 3.4.105 kernel.
Please see a report below.
 
[1.] One line summary of the problem: [3.4] neigh_destroy() crashes on unplug netdev.
[2.] Full description of the problem/report:
I got kernel panic: neigh_destroy() crashes on unplug my wlan dongle. Please see Oops.. message for detail.
I found this problem is occured on kernel 3.4 branch, but kernel 3.6 or later do not have it.
It does not occur on every netdev device, but I think it is not a driver specific problem.
And I found 20 patches that you released on 05-Jul-2012 look effective to solve it.
Patches are below:
 01. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a263b3093641fb1ec377582c90986a7fd0625184
 02. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3c521f2ba9646c5543963cbc2b9c9d3f02a82594
 03. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=60d354ebebd9d0f760cb6c3b9f53a7ade0f8cd0e
 04. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5110effee8fde2edfacac9cd12a9960ab2dc39ea
 05. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f894cbf847c9bea1955095bf37aca6c050553167
 06. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=dbedbe6d56e8944f220e34deb9ebdf4bec2a2afd
 07. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=178709bbfe9d4fe432c272ed65a34b8582703c23
 08. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=24db1ba866eebf5b516df80ea2212d2479bfb502
 09. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0b399d46b317a6d0a73ad523e014ecfa4d449769
 10. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c473737765c0f72ceb5b245ada7ead798d88b4f6
 11. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f9d751667fd60788fe3641738938e0968e99cece
 12. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=13a43d94ab026c423dc8902170ef27c2bd36aa87
 13. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=fccd7d5c77ff61d5283e7ce8242791d5f00dcc17
 14. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=1d248b1cf4e09dbec8cef5f7fbeda5874248bd09
 15. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=534cb283efef9fdbd9f70f4615054d26aa444dd6
 16. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=97cac0821af4474ec4ba3a9e7a36b98ed9b6db88
 17. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f187bc6efb7250afee0e2009b6106370319b0c8b
 18. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d1e31fb02b31ba88d5650d97c35eb58f52bfe0e1
 19. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=36bdbcae2fa2a6dfa99344d4190fcea0aa7b7c25
 20. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a2de86f63cfc92f7aaf11e7b9d9f2150946a1622
I applied these patches to 3.4.105 kernel, and confirmed the problem is solved on my box.
Could you confirm and backport them to 3.4 branch?
[3.] Keywords (i.e., modules, networking, kernel): networking
[4.] Kernel version (from /proc/version):
Linux version 3.4.105 (root@JP1201393) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #2 SMP Tue Jan 13 13:39:40 JST 2015
[5.] Output of Oops.. message (if applicable) with symbolic information 
 resolved (see Documentation/oops-tracing.txt)
BUG: unable to handle kernel paging request at f87be0ac
IP: [<c1475c5f>] neigh_destroy+0x8f/0x110
*pdpt = 00000000018c0001 *pde = 0000000032f7c067 *pte = 0000000000000000
Oops: 0000 [#1] SMP
Modules linked in: mt7603u_sta(0) nls_iso8859_1 bnep rfcomm bluetooth snd_hda_codec_realtek snd_hda_intel snd_hda_codec i915 snd_hwdep snd_pcm drm_kms_helper binfmt_misc snd_seq_midi drm snd_rawmidi snd_seq_midi_event snd_seq aesni_intel snd_timer snd_seq_device snd cryptd aes_i586 i2c_algo_bit microcode soundcore psmouse video snd_page_alloc serio_raw mac_hid ppdev parport_pc lp parport usbhid hid usb_storage ahci firewire_ohci libahci firewire_core crc_itu_te1000e
Pid: 19, comm: ksoftirqd/3 Tainted: G O 3.4.105 #1 EPSON DIRECT CORP. MR690D0F61/MR6900
EIP: 0060:[<c1475c5f>] EFLAGS: 00010206 CPU: 3
EIP is at neigh_destroy+0x8f/0x110
EAX: f87be000 EBX: f1fd461c ECX: 80150006 EDX: 00000100
ESI: f1fd4600 EDI: ec5ae000 EBP: f2d71ef4 ESP: f2d71edc
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: f87be0ac CR3: 3184b000 CR4: 000407f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process ksoftirqd/3 (pid: 19, ti=f2d70000 task=f2d68000 task.ti=f2d70000)
Stack:
 c1472b23 c1472b23 f1fd4614 ee27b0c0 00000000 00000005 f2d71f0c c1472b95
 c1552da3 0000000a ee26a9c0 00000005 f2d71f14 c149292c f2d71f44 c10a80f6
 f33bc0e0 0000000a f1b1cc8c f33bc0f8 c17b9ec0 f2d68000 00000000 00000003
Call Trace:
 [<c1472b23>] ? dst_destroy+0x43/0xe0
 [<c1472b23>] ? dst_destroy+0x43/0xe0
 [<c1472b95>] dst_destroy+0xb5/0xe0
 [<c1552da3>] ? _raw_spin_unlock_bh+0x13/0x20
 [<c149292c>] dst_rcu_free+0x1c/0x30
 [<c10a80f6>] __rcu_process_callbacks+0x186/0x310
 [<c10a82bc>] rcu_process_callbacks+0x3c/0xc0
 [<c1038041>] __do_softirq+0x81/0x190
 [<c15533cd>] ? apic_timer_interrupt+0x31/0x38
 [<c10381f8>] run_ksoftirqd+0xa8/0x130
 [<c1038150>] ? __do_softirq+0x190/0x190
 [<c104ff82>] kthread+0x72/0x80
 [<c104ff10>] ? flush_kthread_work+0xc0/0xc0
 [<c1559ebe>] kernel_thread_helper+0x6/0x10
Code: 40 04 00 00 00 00 89 51 04 89 0a e8 bc a2 fe ff 8b 03 39 c3 75 d6 8b 45 f0 e8 fe d0 0d 00 c7 46 2c 00 00 00 00 8b 87 34 01 00 00 <8b> 90 ac 00 00 00 85 d2 74 04 89 f0 ff d2 8b 87 98 02 00 00 64
EIP: [<c1475c5f>] neigh_destroy+0x8f/0x110 SS:ESP 0068:f2d71edc
CR2: 00000000f87be0ac
Kernel panic - not syncing: Fatal exception in interrupt
panic occurred, switching back to text console

call trace indicate these code line:
<c1475c5f>: net/core/neighbour.c:729
<c1472b23>: net/core/dst.c:250
<c1472b95>: include/net/neighbour.h:294
<c1552da3>: kernel/spinlock.c:194
<c149292c>: include/net/dst.h:385
[6.] A small shell script or example program which triggers the
 problem (if possible)
Method to reproduce the problem:
 1. run shell script below:
#/bin/sh
while [ true ]
do
 ifconfig wlan0 192.168.1.2 up
done
 2. unplug and plug a netdev dongle. (repeat)
[7.] Environment
[7.1.] Software (add the output of the ver_linux script here)
--- ver_linux ---
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.
 
Linux JP1201393 3.4.105 #2 SMP Tue Jan 13 13:39:40 JST 2015 i686 i686 i386 GNU/Linux
 
Gnu C 4.6
Gnu make 3.81
binutils 2.22
util-linux 2.20.1
mount support
module-init-tools 3.16
e2fsprogs 1.42
pcmciautils 018
PPP 2.4.5
Linux C Library 2.15
Dynamic linker (ldd) 2.15
Procps 3.2.8
Net-tools 1.60
Kbd 1.15.2
Sh-utils 8.13
wireless-tools 30
Modules Loaded mt7603u_sta nls_iso8859_1 rfcomm bnep bluetooth snd_hda_codec_realtek snd_hda_intel snd_hda_codec i915 snd_hwdep snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event drm_kms_helper aesni_intel snd_seq drm cryptd psmouse snd_timer snd_seq_device aes_i586 microcode binfmt_misc serio_raw snd soundcore snd_page_alloc i2c_algo_bit mac_hid video ppdev parport_pc lp parport usbhid hid usb_storage ahci libahci e1000e firewire_ohci firewire_core crc_itu_t
[7.2.] Processor information (from /proc/cpuinfo):
I think this is no relationship about the problem.
If it is needed, I will gather it.
[7.3.] Module information (from /proc/modules):
--- /proc/modules ---
mt7603u_sta 1114536 1 - Live 0x00000000 (O)
nls_iso8859_1 12618 1 - Live 0x00000000
rfcomm 57545 0 - Live 0x00000000
bnep 18868 2 - Live 0x00000000
bluetooth 263846 10 rfcomm,bnep, Live 0x00000000
snd_hda_codec_realtek 63163 1 - Live 0x00000000
snd_hda_intel 31907 3 - Live 0x00000000
snd_hda_codec 102579 2 snd_hda_codec_realtek,snd_hda_intel, Live 0x00000000
i915 427399 2 - Live 0x00000000
snd_hwdep 13277 1 snd_hda_codec, Live 0x00000000
snd_pcm 84645 2 snd_hda_intel,snd_hda_codec, Live 0x00000000
snd_seq_midi 13133 0 - Live 0x00000000
snd_rawmidi 25115 1 snd_seq_midi, Live 0x00000000
snd_seq_midi_event 14476 1 snd_seq_midi, Live 0x00000000
drm_kms_helper 45322 1 i915, Live 0x00000000
aesni_intel 18135 0 - Live 0x00000000
snd_seq 55404 2 snd_seq_midi,snd_seq_midi_event, Live 0x00000000
drm 215637 3 i915,drm_kms_helper, Live 0x00000000
cryptd 15580 1 aesni_intel, Live 0x00000000
psmouse 81253 0 - Live 0x00000000
snd_timer 24503 2 snd_pcm,snd_seq, Live 0x00000000
snd_seq_device 14138 3 snd_seq_midi,snd_rawmidi,snd_seq, Live 0x00000000
aes_i586 16996 1 aesni_intel, Live 0x00000000
microcode 18819 0 - Live 0x00000000
binfmt_misc 17208 1 - Live 0x00000000
serio_raw 13156 0 - Live 0x00000000
snd 60917 16 snd_hda_codec_realtek,snd_hda_intel,snd_hda_codec,snd_hwdep,snd_pcm,snd_seq_midi,snd_rawmidi,snd_seq,snd_timer,snd_seq_device, Live 0x00000000
soundcore 12601 1 snd, Live 0x00000000
snd_page_alloc 14037 2 snd_hda_intel,snd_pcm, Live 0x00000000
i2c_algo_bit 13198 1 i915, Live 0x00000000
mac_hid 13038 0 - Live 0x00000000
video 18688 1 i915, Live 0x00000000
ppdev 17364 0 - Live 0x00000000
parport_pc 27505 1 - Live 0x00000000
lp 13300 0 - Live 0x00000000
parport 40763 3 ppdev,parport_pc,lp, Live 0x00000000
usbhid 47307 0 - Live 0x00000000
hid 81906 1 usbhid, Live 0x00000000
usb_storage 48081 1 - Live 0x00000000
ahci 25497 2 - Live 0x00000000
libahci 25871 1 ahci, Live 0x00000000
e1000e 175750 0 - Live 0x00000000
firewire_ohci 35480 0 - Live 0x00000000
firewire_core 56954 1 firewire_ohci, Live 0x00000000
crc_itu_t 12628 1 firewire_core, Live 0x00000000
[7.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem)
--- /proc/ioports ---
0000-03af : PCI Bus 0000:00
 0000-001f : dma1
 0020-0021 : pic1
 0040-0043 : timer0
 0050-0053 : timer1
 0060-0060 : keyboard
 0064-0064 : keyboard
 0070-0071 : rtc0
 0080-008f : dma page reg
 00a0-00a1 : pic2
 00c0-00df : dma2
 00f0-00ff : fpu
 0200-0201 : pnp 00:02
 0378-037a : parport0
03b0-03df : PCI Bus 0000:00
03e0-0cf7 : PCI Bus 0000:00
 03f8-03ff : serial
 0400-0453 : pnp 00:0a
 0400-0403 : ACPI PM1a_EVT_BLK
 0404-0405 : ACPI PM1a_CNT_BLK
 0408-040b : ACPI PM_TMR
 0410-0415 : ACPI CPU throttle
 0420-042f : ACPI GPE0_BLK
 0450-0450 : ACPI PM2_CNT_BLK
 0454-0457 : pnp 00:0b
 0458-047f : pnp 00:0a
 04d0-04d1 : pnp 00:08
 0500-057f : pnp 00:0a
0cf8-0cff : PCI conf1
0d00-ffff : PCI Bus 0000:00
 1180-119f : pnp 00:0a
 d000-dfff : PCI Bus 0000:04
 d000-d01f : 0000:04:00.0
 e000-efff : PCI Bus 0000:03
 e000-e0ff : 0000:03:00.0
 f000-f03f : 0000:00:02.0
 f040-f05f : 0000:00:1f.3
 f060-f07f : 0000:00:1f.2
 f060-f07f : ahci
 f080-f09f : 0000:00:19.0
 f0a0-f0a3 : 0000:00:1f.2
 f0a0-f0a3 : ahci
 f0b0-f0b7 : 0000:00:1f.2
 f0b0-f0b7 : ahci
 f0c0-f0c3 : 0000:00:1f.2
 f0c0-f0c3 : ahci
 f0d0-f0d7 : 0000:00:1f.2
 f0d0-f0d7 : ahci
--- /proc/iomem ---
00000000-0000ffff : reserved
00010000-0009d7ff : System RAM
0009d800-0009ffff : reserved
000a0000-000bffff : PCI Bus 0000:00
 000a0000-000bffff : Video RAM area
000c0000-000dffff : PCI Bus 0000:00
 000c0000-000cd7ff : Video ROM
000e0000-000fffff : reserved
 000f0000-000fffff : System ROM
00100000-1fffffff : System RAM
 01000000-0155addd : Kernel code
 0155adde-01813f3f : Kernel data
 018c0000-0194bfff : Kernel bss
20000000-201fffff : reserved
20200000-3fffffff : System RAM
40000000-401fffff : reserved
40200000-bad8bfff : System RAM
bad8c000-badd8fff : ACPI Non-volatile Storage
badd9000-bade0fff : ACPI Tables
bade1000-badf5fff : reserved
badf6000-badf7fff : System RAM
badf8000-bae04fff : ACPI Non-volatile Storage
bae05000-bae2bfff : reserved
bae2c000-bae6efff : ACPI Non-volatile Storage
bae6f000-baffffff : System RAM
bb000000-bb7fffff : RAM buffer
bb800000-bf9fffff : reserved
bfa00000-ffffffff : PCI Bus 0000:00
 d0000000-dfffffff : 0000:00:02.0
 e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff]
 e0000000-efffffff : pnp 00:01
 fe000000-fe3fffff : 0000:00:02.0
 fe400000-fe4fffff : PCI Bus 0000:04
 fe400000-fe47ffff : 0000:04:00.0
 fe400000-fe47ffff : e1000e
 fe480000-fe4bffff : 0000:04:00.0
 fe4c0000-fe4dffff : 0000:04:00.0
 fe4c0000-fe4dffff : e1000e
 fe4e0000-fe4e3fff : 0000:04:00.0
 fe4e0000-fe4e3fff : e1000e
 fe500000-fe5fffff : PCI Bus 0000:03
 fe500000-fe5007ff : 0000:03:00.0
 fe500000-fe5007ff : firewire_ohci
 fe600000-fe61ffff : 0000:00:19.0
 fe600000-fe61ffff : e1000e
 fe620000-fe623fff : 0000:00:1b.0
 fe620000-fe623fff : ICH HD audio
 fe624000-fe6240ff : 0000:00:1f.3
 fe625000-fe6257ff : 0000:00:1f.2
 fe625000-fe6257ff : ahci
 fe626000-fe6263ff : 0000:00:1d.0
 fe626000-fe6263ff : ehci_hcd
 fe627000-fe6273ff : 0000:00:1a.0
 fe627000-fe6273ff : ehci_hcd
 fe628000-fe628fff : 0000:00:19.0
 fe628000-fe628fff : e1000e
 fe629000-fe62900f : 0000:00:16.0
 fec00000-fec003ff : IOAPIC 0
 fed00000-fed003ff : HPET 0
 fed08000-fed08fff : pnp 00:0a
 fed10000-fed19fff : pnp 00:01
 fed1c000-fed1ffff : reserved
 fed1c000-fed1ffff : pnp 00:0a
 fed20000-fed3ffff : pnp 00:01
 fed90000-fed93fff : pnp 00:01
 fee00000-fee0ffff : pnp 00:01
 fee00000-fee00fff : Local APIC
 ff000000-ffffffff : reserved
 ff000000-ffffffff : pnp 00:0a
100000000-23f7fffff : System RAM
23f800000-23fffffff : RAM buffer
[7.5.] PCI information ('lspci -vvv' as root)
I think this is no relationship about the problem.
If it is needed, I will gather it.
[7.6.] SCSI information (from /proc/scsi/scsi):
I don't think no relationship about this issue.
If it is needed, I will gather it.
[7.7.] Other information that might be relevant to the problem
 (please look in /proc and include all information that you
 think to be relevant):
[X.] Other notes, patches, fixes, workarounds:
patches I described on [2.] look effective.

Best regards,
Akihiro Nakashima

----------------------------------
NAKASHIMA Akihiro
Nakashima.Akihiro@exc.epson.co.jp
----------------------------------

^ permalink raw reply

* [MERGE] net --> net-next
From: David Miller @ 2015-01-15  6:06 UTC (permalink / raw)
  To: netdev


With Linus taking in my bug fixes from 'net' I subsequently merged
his tree into 'net' and the merged 'net' into 'net-next'.

Just FYI...

^ permalink raw reply

* Re: linux-next: manual merge of the net-next tree with the net tree
From: David Miller @ 2015-01-15  6:06 UTC (permalink / raw)
  To: sfr; +Cc: netdev, linux-next, linux-kernel, david.vrabel
In-Reply-To: <20150115134735.1e4612c6@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Thu, 15 Jan 2015 13:47:35 +1100

> Today's linux-next merge of the net-next tree got a conflict in
> drivers/net/xen-netfront.c between commit 900e183301b5 ("xen-netfront:
> use different locks for Rx and Tx stats") from the net tree and commit
> a55e8bb8fb89 ("xen-netfront: refactor making Tx requests") from the
> net-next tree.
> 
> I fixed it up (see below) and can carry the fix as necessary (no action
> is required).

Thanks a lot Stephen, I just resolved this.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox