Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH iproute2] ss: Filter inet dgram sockets with established state by default
From: Vadim Kochan @ 2015-01-14 22:43 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Vadim Kochan, netdev
In-Reply-To: <20150114144120.0e15ac1f@urahara>

On Wed, Jan 14, 2015 at 02:41:20PM -0800, Stephen Hemminger wrote:
> On Wed, 14 Jan 2015 08:49:44 +0200
> Vadim Kochan <vadim4j@gmail.com> wrote:
> 
> > On Tue, Jan 13, 2015 at 05:31:50PM -0800, Stephen Hemminger wrote:
> > > On Thu,  8 Jan 2015 19:32:22 +0200
> > > Vadim Kochan <vadim4j@gmail.com> wrote:
> > > 
> > > > From: Vadim Kochan <vadim4j@gmail.com>
> > > > 
> > > > As inet dgram sockets (udp, raw) can call connect(...)  - they
> > > > might be set in ESTABLISHED state. So keep the original behaviour of
> > > > 'ss' which filtered them by ESTABLISHED state by default. So:
> > > > 
> > > >     $ ss -u
> > > > 
> > > >     or
> > > > 
> > > >     $ ss -w
> > > > 
> > > > Will show only ESTABLISHED UDP sockets by default.
> > > > 
> > > > Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
> > > > ---
> > > >  misc/ss.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/misc/ss.c b/misc/ss.c
> > > > index 08d210a..015d829 100644
> > > > --- a/misc/ss.c
> > > > +++ b/misc/ss.c
> > > > @@ -170,11 +170,11 @@ static const struct filter default_dbs[MAX_DB] = {
> > > >  		.families = (1 << AF_INET) | (1 << AF_INET6),
> > > >  	},
> > > >  	[UDP_DB] = {
> > > > -		.states   = (1 << SS_CLOSE),
> > > > +		.states   = (1 << SS_ESTABLISHED),
> > > >  		.families = (1 << AF_INET) | (1 << AF_INET6),
> > > >  	},
> > > >  	[RAW_DB] = {
> > > > -		.states   = (1 << SS_CLOSE),
> > > > +		.states   = (1 << SS_ESTABLISHED),
> > > >  		.families = (1 << AF_INET) | (1 << AF_INET6),
> > > >  	},
> > > >  	[UNIX_DG_DB] = {
> > > 
> > > This is a change likely to break somebody using 'ss -u' now and the bound
> > > sockets will disappear from the output.
> > > 
> > 
> > But thats was as original behaviour before I added table-driven code
> > (about few commits ago), so thats a rather fix (sorry I did not noticed
> > about it) to keep the previous behaviour for dgram sockets - show
> > established states by default.
> > 
> > Regards,
> 
> Ok, I will merge it and update the comments.
Even with this PATCH I am still confused what is preferred behaviour -
show established dgram sockets (as it was all the way) or closed + established by default.

What do you think ?

Thanks,

^ permalink raw reply

* Re: [PATCH] net: rocker: Add basic netdev counters
From: Florian Fainelli @ 2015-01-14 22:57 UTC (permalink / raw)
  To: David Ahern, netdev; +Cc: Scott Feldman, Jiri Pirko
In-Reply-To: <1421275161-99434-1-git-send-email-dsahern@gmail.com>

On 14/01/15 14:39, David Ahern wrote:
> Add packet and byte counters for RX and TX paths.
> 
> $ ifconfig eth1
> eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>         inet6 fe80::5054:ff:fe12:3501  prefixlen 64  scopeid 0x20<link>
>         ether 52:54:00:12:35:01  txqueuelen 1000  (Ethernet)
>         RX packets 63  bytes 15813 (15.4 KiB)
>         RX errors 0  dropped 0  overruns 0  frame 0
>         TX packets 79  bytes 17991 (17.5 KiB)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>
> Cc: Scott Feldman <sfeldma@gmail.com>
> Cc: Jiri Pirko <jiri@resnulli.us>
> ---
>  drivers/net/ethernet/rocker/rocker.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
> index 2f398fa4b9e6..9743279d9121 100644
> --- a/drivers/net/ethernet/rocker/rocker.c
> +++ b/drivers/net/ethernet/rocker/rocker.c
> @@ -3557,6 +3557,9 @@ static netdev_tx_t rocker_port_xmit(struct sk_buff *skb, struct net_device *dev)
>  	if (!desc_info)
>  		netif_stop_queue(dev);
>  
> +	dev->stats.tx_packets++;
> +	dev->stats.tx_bytes += skb->len;

Potential use after free, the skb pointer is certainly not valid anymore
here.

BTW, increasing statistics here is valid because this is a driver for a
virtual piece of HW, which does not have TX reclaim/completion logic,
but if it did, statistics update should occur there, not in the
ndo_start_xmit() function.

> +
>  	return NETDEV_TX_OK;
>  
>  unmap_frags:
> @@ -3565,6 +3568,8 @@ static netdev_tx_t rocker_port_xmit(struct sk_buff *skb, struct net_device *dev)
>  	rocker_tlv_nest_cancel(desc_info, frags);
>  out:
>  	dev_kfree_skb(skb);
> +	dev->stats.tx_dropped++;
> +
>  	return NETDEV_TX_OK;
>  }
>  
> @@ -3890,6 +3895,9 @@ static int rocker_port_rx_proc(struct rocker *rocker,
>  	skb->protocol = eth_type_trans(skb, rocker_port->dev);
>  	netif_receive_skb(skb);
>  
> +	rocker_port->dev->stats.rx_packets++;
> +	rocker_port->dev->stats.rx_bytes += skb->len;

Same here, past netif_receive_skb() you should not assume that this skb
reference is valid.

> +
>  	return rocker_dma_rx_ring_skb_alloc(rocker, rocker_port, desc_info);
>  }
>  
> 


-- 
Florian

^ permalink raw reply

* [PATCH net-next] ipv4: per cpu uncached list
From: Eric Dumazet @ 2015-01-14 23:17 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Willem de Bruijn

From: Eric Dumazet <edumazet@google.com>

RAW sockets with hdrinc suffer from contention on rt_uncached_lock
spinlock.

One solution is to use percpu lists, since most routes are destroyed
by the cpu that created them.

It is unclear why we even have to put these routes in uncached_list,
as all outgoing packets should be freed when a device is dismantled.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: caacf05e5ad1 ("ipv4: Properly purge netdev references on uncached routes.")
---
 include/net/route.h |    2 +
 net/ipv4/route.c    |   46 ++++++++++++++++++++++++++++++------------
 2 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index b17cf28f996e..fe22d03afb6a 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -46,6 +46,7 @@
 
 struct fib_nh;
 struct fib_info;
+struct uncached_list;
 struct rtable {
 	struct dst_entry	dst;
 
@@ -64,6 +65,7 @@ struct rtable {
 	u32			rt_pmtu;
 
 	struct list_head	rt_uncached;
+	struct uncached_list	*rt_uncached_list;
 };
 
 static inline bool rt_is_input_route(const struct rtable *rt)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6a2155b02602..ce112d0f2698 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1325,14 +1325,22 @@ static bool rt_cache_route(struct fib_nh *nh, struct rtable *rt)
 	return ret;
 }
 
-static DEFINE_SPINLOCK(rt_uncached_lock);
-static LIST_HEAD(rt_uncached_list);
+struct uncached_list {
+	spinlock_t		lock;
+	struct list_head	head;
+};
+
+static DEFINE_PER_CPU_ALIGNED(struct uncached_list, rt_uncached_list);
 
 static void rt_add_uncached_list(struct rtable *rt)
 {
-	spin_lock_bh(&rt_uncached_lock);
-	list_add_tail(&rt->rt_uncached, &rt_uncached_list);
-	spin_unlock_bh(&rt_uncached_lock);
+	struct uncached_list *ul = raw_cpu_ptr(&rt_uncached_list);
+
+	rt->rt_uncached_list = ul;
+
+	spin_lock_bh(&ul->lock);
+	list_add_tail(&rt->rt_uncached, &ul->head);
+	spin_unlock_bh(&ul->lock);
 }
 
 static void ipv4_dst_destroy(struct dst_entry *dst)
@@ -1340,27 +1348,32 @@ static void ipv4_dst_destroy(struct dst_entry *dst)
 	struct rtable *rt = (struct rtable *) dst;
 
 	if (!list_empty(&rt->rt_uncached)) {
-		spin_lock_bh(&rt_uncached_lock);
+		struct uncached_list *ul = rt->rt_uncached_list;
+
+		spin_lock_bh(&ul->lock);
 		list_del(&rt->rt_uncached);
-		spin_unlock_bh(&rt_uncached_lock);
+		spin_unlock_bh(&ul->lock);
 	}
 }
 
 void rt_flush_dev(struct net_device *dev)
 {
-	if (!list_empty(&rt_uncached_list)) {
-		struct net *net = dev_net(dev);
-		struct rtable *rt;
+	struct net *net = dev_net(dev);
+	struct rtable *rt;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct uncached_list *ul = &per_cpu(rt_uncached_list, cpu);
 
-		spin_lock_bh(&rt_uncached_lock);
-		list_for_each_entry(rt, &rt_uncached_list, rt_uncached) {
+		spin_lock_bh(&ul->lock);
+		list_for_each_entry(rt, &ul->head, rt_uncached) {
 			if (rt->dst.dev != dev)
 				continue;
 			rt->dst.dev = net->loopback_dev;
 			dev_hold(rt->dst.dev);
 			dev_put(dev);
 		}
-		spin_unlock_bh(&rt_uncached_lock);
+		spin_unlock_bh(&ul->lock);
 	}
 }
 
@@ -2717,6 +2730,7 @@ struct ip_rt_acct __percpu *ip_rt_acct __read_mostly;
 int __init ip_rt_init(void)
 {
 	int rc = 0;
+	int cpu;
 
 	ip_idents = kmalloc(IP_IDENTS_SZ * sizeof(*ip_idents), GFP_KERNEL);
 	if (!ip_idents)
@@ -2724,6 +2738,12 @@ int __init ip_rt_init(void)
 
 	prandom_bytes(ip_idents, IP_IDENTS_SZ * sizeof(*ip_idents));
 
+	for_each_possible_cpu(cpu) {
+		struct uncached_list *ul = &per_cpu(rt_uncached_list, cpu);
+
+		INIT_LIST_HEAD(&ul->head);
+		spin_lock_init(&ul->lock);
+	}
 #ifdef CONFIG_IP_ROUTE_CLASSID
 	ip_rt_acct = __alloc_percpu(256 * sizeof(struct ip_rt_acct), __alignof__(struct ip_rt_acct));
 	if (!ip_rt_acct)

^ permalink raw reply related

* [PATCH net-next v2 0/2] cxgb4/cxgb4i : Update & use ipv6 handling api
From: Anish Bhatt @ 2015-01-14 23:17 UTC (permalink / raw)
  To: netdev; +Cc: davem, hariprasad, kxie, deepak.s, Anish Bhatt

This patch series consolidates and updates the ipv6 api, as well as exports
it for use by upper level drivers dependent on cxgb4

v2: Fix formatting issues in clip_tbl.c

Anish Bhatt (2):
  cxgb4 : Update ipv6 address handling api
  cxgb4i : Call into recently added cxgb4 ipv6 api

 drivers/net/ethernet/chelsio/cxgb4/Makefile        |   2 +-
 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c      | 314 +++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.h      |  41 +++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h         |   3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c |  19 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c    | 228 +++++----------
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h     |   3 -
 drivers/scsi/cxgbi/cxgb4i/cxgb4i.c                 |  23 +-
 8 files changed, 469 insertions(+), 164 deletions(-)
 create mode 100644 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c
 create mode 100644 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.h

-- 
2.2.1

^ permalink raw reply

* [PATCH net-next v2 1/2] cxgb4 : Update ipv6 address handling api
From: Anish Bhatt @ 2015-01-14 23:17 UTC (permalink / raw)
  To: netdev; +Cc: davem, hariprasad, kxie, deepak.s, Anish Bhatt
In-Reply-To: <1421277455-20158-1-git-send-email-anish@chelsio.com>

This patch improves on previously added support for ipv6 addresses. The code
is consolidated to a single file and adds an api for use by dependent upper
level drivers such as cxgb4i/iw_cxgb4 etc.

Signed-off-by: Anish Bhatt <anish@chelsio.com>
Signed-off-by: Deepak Singh <deepak.s@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/Makefile        |   2 +-
 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c      | 314 +++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.h      |  41 +++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h         |   3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c |  19 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c    | 228 +++++----------
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h     |   3 -
 7 files changed, 447 insertions(+), 163 deletions(-)
 create mode 100644 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c
 create mode 100644 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.h

diff --git a/drivers/net/ethernet/chelsio/cxgb4/Makefile b/drivers/net/ethernet/chelsio/cxgb4/Makefile
index b85280775997..ae50cd72358c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/Makefile
+++ b/drivers/net/ethernet/chelsio/cxgb4/Makefile
@@ -4,6 +4,6 @@
 
 obj-$(CONFIG_CHELSIO_T4) += cxgb4.o
 
-cxgb4-objs := cxgb4_main.o l2t.o t4_hw.o sge.o
+cxgb4-objs := cxgb4_main.o l2t.o t4_hw.o sge.o clip_tbl.o
 cxgb4-$(CONFIG_CHELSIO_T4_DCB) +=  cxgb4_dcb.o
 cxgb4-$(CONFIG_DEBUG_FS) += cxgb4_debugfs.o
diff --git a/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c b/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c
new file mode 100644
index 000000000000..2b407b6a35a8
--- /dev/null
+++ b/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c
@@ -0,0 +1,314 @@
+/*
+ *  This file is part of the Chelsio T4 Ethernet driver for Linux.
+ *  Copyright (C) 2003-2014 Chelsio Communications.  All rights reserved.
+ *
+ *  Written by Deepak (deepak.s@chelsio.com)
+ *
+ *  This program is distributed in the hope that it will be useful, but WITHOUT
+ *  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ *  FITNESS FOR A PARTICULAR PURPOSE.  See the LICENSE file included in this
+ *  release for licensing terms and conditions.
+ */
+
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/jhash.h>
+#include <linux/if_vlan.h>
+#include <net/addrconf.h>
+#include "cxgb4.h"
+#include "clip_tbl.h"
+
+static inline unsigned int ipv4_clip_hash(struct clip_tbl *c, const u32 *key)
+{
+	unsigned int clipt_size_half = c->clipt_size / 2;
+
+	return jhash_1word(*key, 0) % clipt_size_half;
+}
+
+static inline unsigned int ipv6_clip_hash(struct clip_tbl *d, const u32 *key)
+{
+	unsigned int clipt_size_half = d->clipt_size / 2;
+	u32 xor = key[0] ^ key[1] ^ key[2] ^ key[3];
+
+	return clipt_size_half +
+		(jhash_1word(xor, 0) % clipt_size_half);
+}
+
+static unsigned int clip_addr_hash(struct clip_tbl *ctbl, const u32 *addr,
+				   int addr_len)
+{
+	return addr_len == 4 ? ipv4_clip_hash(ctbl, addr) :
+				ipv6_clip_hash(ctbl, addr);
+}
+
+static int clip6_get_mbox(const struct net_device *dev,
+			  const struct in6_addr *lip)
+{
+	struct adapter *adap = netdev2adap(dev);
+	struct fw_clip_cmd c;
+
+	memset(&c, 0, sizeof(c));
+	c.op_to_write = htonl(FW_CMD_OP_V(FW_CLIP_CMD) |
+			      FW_CMD_REQUEST_F | FW_CMD_WRITE_F);
+	c.alloc_to_len16 = htonl(FW_CLIP_CMD_ALLOC_F | FW_LEN16(c));
+	*(__be64 *)&c.ip_hi = *(__be64 *)(lip->s6_addr);
+	*(__be64 *)&c.ip_lo = *(__be64 *)(lip->s6_addr + 8);
+	return t4_wr_mbox_meat(adap, adap->mbox, &c, sizeof(c), &c, false);
+}
+
+static int clip6_release_mbox(const struct net_device *dev,
+			      const struct in6_addr *lip)
+{
+	struct adapter *adap = netdev2adap(dev);
+	struct fw_clip_cmd c;
+
+	memset(&c, 0, sizeof(c));
+	c.op_to_write = htonl(FW_CMD_OP_V(FW_CLIP_CMD) |
+			      FW_CMD_REQUEST_F | FW_CMD_READ_F);
+	c.alloc_to_len16 = htonl(FW_CLIP_CMD_FREE_F | FW_LEN16(c));
+	*(__be64 *)&c.ip_hi = *(__be64 *)(lip->s6_addr);
+	*(__be64 *)&c.ip_lo = *(__be64 *)(lip->s6_addr + 8);
+	return t4_wr_mbox_meat(adap, adap->mbox, &c, sizeof(c), &c, false);
+}
+
+int cxgb4_clip_get(const struct net_device *dev, const u32 *lip, u8 v6)
+{
+	struct adapter *adap = netdev2adap(dev);
+	struct clip_tbl *ctbl = adap->clipt;
+	struct clip_entry *ce, *cte;
+	u32 *addr = (u32 *)lip;
+	int hash;
+	int addr_len;
+	int ret = 0;
+
+	if (v6)
+		addr_len = 16;
+	else
+		addr_len = 4;
+
+	hash = clip_addr_hash(ctbl, addr, addr_len);
+
+	read_lock_bh(&ctbl->lock);
+	list_for_each_entry(cte, &ctbl->hash_list[hash], list) {
+		if (addr_len == cte->addr_len &&
+		    memcmp(lip, cte->addr, cte->addr_len) == 0) {
+			ce = cte;
+			read_unlock_bh(&ctbl->lock);
+			goto found;
+		}
+	}
+	read_unlock_bh(&ctbl->lock);
+
+	write_lock_bh(&ctbl->lock);
+	if (!list_empty(&ctbl->ce_free_head)) {
+		ce = list_first_entry(&ctbl->ce_free_head,
+				      struct clip_entry, list);
+		list_del(&ce->list);
+		INIT_LIST_HEAD(&ce->list);
+		spin_lock_init(&ce->lock);
+		atomic_set(&ce->refcnt, 0);
+		atomic_dec(&ctbl->nfree);
+		ce->addr_len = addr_len;
+		memcpy(ce->addr, lip, addr_len);
+		list_add_tail(&ce->list, &ctbl->hash_list[hash]);
+		if (v6) {
+			ret = clip6_get_mbox(dev, (const struct in6_addr *)lip);
+			if (ret) {
+				write_unlock_bh(&ctbl->lock);
+				return ret;
+			}
+		}
+	} else {
+		write_unlock_bh(&ctbl->lock);
+		return -ENOMEM;
+	}
+	write_unlock_bh(&ctbl->lock);
+found:
+	atomic_inc(&ce->refcnt);
+
+	return 0;
+}
+EXPORT_SYMBOL(cxgb4_clip_get);
+
+void cxgb4_clip_release(const struct net_device *dev, const u32 *lip, u8 v6)
+{
+	struct adapter *adap = netdev2adap(dev);
+	struct clip_tbl *ctbl = adap->clipt;
+	struct clip_entry *ce, *cte;
+	u32 *addr = (u32 *)lip;
+	int hash;
+	int addr_len;
+
+	if (v6)
+		addr_len = 16;
+	else
+		addr_len = 4;
+
+	hash = clip_addr_hash(ctbl, addr, addr_len);
+
+	read_lock_bh(&ctbl->lock);
+	list_for_each_entry(cte, &ctbl->hash_list[hash], list) {
+		if (addr_len == cte->addr_len &&
+		    memcmp(lip, cte->addr, cte->addr_len) == 0) {
+			ce = cte;
+			read_unlock_bh(&ctbl->lock);
+			goto found;
+		}
+	}
+	read_unlock_bh(&ctbl->lock);
+
+	return;
+found:
+	write_lock_bh(&ctbl->lock);
+	spin_lock_bh(&ce->lock);
+	if (atomic_dec_and_test(&ce->refcnt)) {
+		list_del(&ce->list);
+		INIT_LIST_HEAD(&ce->list);
+		list_add_tail(&ce->list, &ctbl->ce_free_head);
+		atomic_inc(&ctbl->nfree);
+		if (v6)
+			clip6_release_mbox(dev, (const struct in6_addr *)lip);
+	}
+	spin_unlock_bh(&ce->lock);
+	write_unlock_bh(&ctbl->lock);
+}
+EXPORT_SYMBOL(cxgb4_clip_release);
+
+/* Retrieves IPv6 addresses from a root device (bond, vlan) associated with
+ * a physical device.
+ * The physical device reference is needed to send the actul CLIP command.
+ */
+static int cxgb4_update_dev_clip(struct net_device *root_dev,
+				 struct net_device *dev)
+{
+	struct inet6_dev *idev = NULL;
+	struct inet6_ifaddr *ifa;
+	int ret = 0;
+
+	idev = __in6_dev_get(root_dev);
+	if (!idev)
+		return ret;
+
+	read_lock_bh(&idev->lock);
+	list_for_each_entry(ifa, &idev->addr_list, if_list) {
+		ret = cxgb4_clip_get(dev, (const u32 *)ifa->addr.s6_addr, 1);
+		if (ret < 0)
+			break;
+	}
+	read_unlock_bh(&idev->lock);
+
+	return ret;
+}
+
+int cxgb4_update_root_dev_clip(struct net_device *dev)
+{
+	struct net_device *root_dev = NULL;
+	int i, ret = 0;
+
+	/* First populate the real net device's IPv6 addresses */
+	ret = cxgb4_update_dev_clip(dev, dev);
+	if (ret)
+		return ret;
+
+	/* Parse all bond and vlan devices layered on top of the physical dev */
+	root_dev = netdev_master_upper_dev_get_rcu(dev);
+	if (root_dev) {
+		ret = cxgb4_update_dev_clip(root_dev, dev);
+		if (ret)
+			return ret;
+	}
+
+	for (i = 0; i < VLAN_N_VID; i++) {
+		root_dev = __vlan_find_dev_deep_rcu(dev, htons(ETH_P_8021Q), i);
+		if (!root_dev)
+			continue;
+
+		ret = cxgb4_update_dev_clip(root_dev, dev);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(cxgb4_update_root_dev_clip);
+
+int clip_tbl_show(struct seq_file *seq, void *v)
+{
+	struct adapter *adapter = seq->private;
+	struct clip_tbl *ctbl = adapter->clipt;
+	struct clip_entry *ce;
+	char ip[60];
+	int i;
+
+	read_lock_bh(&ctbl->lock);
+
+	seq_puts(seq, "IP Address                  Users\n");
+	for (i = 0 ; i < ctbl->clipt_size;  ++i) {
+		list_for_each_entry(ce, &ctbl->hash_list[i], list) {
+			ip[0] = '\0';
+			if (ce->addr_len == 16)
+				sprintf(ip, "%pI6c", ce->addr);
+			else
+				sprintf(ip, "%pI4c", ce->addr);
+			seq_printf(seq, "%-25s   %u\n", ip,
+				   atomic_read(&ce->refcnt));
+		}
+	}
+	seq_printf(seq, "Free clip entries : %d\n", atomic_read(&ctbl->nfree));
+
+	read_unlock_bh(&ctbl->lock);
+
+	return 0;
+}
+
+struct clip_tbl *t4_init_clip_tbl(unsigned int clipt_start,
+				  unsigned int clipt_end)
+{
+	struct clip_entry *cl_list;
+	struct clip_tbl *ctbl;
+	unsigned int clipt_size;
+	int i;
+
+	if (clipt_start >= clipt_end)
+		return NULL;
+	clipt_size = clipt_end - clipt_start + 1;
+	if (clipt_size < CLIPT_MIN_HASH_BUCKETS)
+		return NULL;
+
+	ctbl = t4_alloc_mem(sizeof(*ctbl) +
+			    clipt_size*sizeof(struct list_head));
+	if (!ctbl)
+		return NULL;
+
+	ctbl->clipt_start = clipt_start;
+	ctbl->clipt_size = clipt_size;
+	INIT_LIST_HEAD(&ctbl->ce_free_head);
+
+	atomic_set(&ctbl->nfree, clipt_size);
+	rwlock_init(&ctbl->lock);
+
+	for (i = 0; i < ctbl->clipt_size; ++i)
+		INIT_LIST_HEAD(&ctbl->hash_list[i]);
+
+	cl_list = t4_alloc_mem(clipt_size*sizeof(struct clip_entry));
+	ctbl->cl_list = (void *)cl_list;
+
+	for (i = 0; i < clipt_size; i++) {
+		INIT_LIST_HEAD(&cl_list[i].list);
+		list_add_tail(&cl_list[i].list, &ctbl->ce_free_head);
+	}
+
+	return ctbl;
+}
+
+void t4_cleanup_clip_tbl(struct adapter *adap)
+{
+	struct clip_tbl *ctbl = adap->clipt;
+
+	if (ctbl) {
+		if (ctbl->cl_list)
+			t4_free_mem(ctbl->cl_list);
+		t4_free_mem(ctbl);
+	}
+}
+EXPORT_SYMBOL(t4_cleanup_clip_tbl);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.h b/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.h
new file mode 100644
index 000000000000..2eaba0161cf8
--- /dev/null
+++ b/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.h
@@ -0,0 +1,41 @@
+/*
+ *  This file is part of the Chelsio T4 Ethernet driver for Linux.
+ *  Copyright (C) 2003-2014 Chelsio Communications.  All rights reserved.
+ *
+ *  Written by Deepak (deepak.s@chelsio.com)
+ *
+ *  This program is distributed in the hope that it will be useful, but WITHOUT
+ *  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ *  FITNESS FOR A PARTICULAR PURPOSE.  See the LICENSE file included in this
+ *  release for licensing terms and conditions.
+ */
+
+struct clip_entry {
+	spinlock_t lock;	/* Hold while modifying clip reference */
+	atomic_t refcnt;
+	struct list_head list;
+	u32 addr[4];
+	int addr_len;
+};
+
+struct clip_tbl {
+	unsigned int clipt_start;
+	unsigned int clipt_size;
+	rwlock_t lock;
+	atomic_t nfree;
+	struct list_head ce_free_head;
+	void *cl_list;
+	struct list_head hash_list[0];
+};
+
+enum {
+	CLIPT_MIN_HASH_BUCKETS = 2,
+};
+
+struct clip_tbl *t4_init_clip_tbl(unsigned int clipt_start,
+				  unsigned int clipt_end);
+int cxgb4_clip_get(const struct net_device *dev, const u32 *lip, u8 v6);
+void cxgb4_clip_release(const struct net_device *dev, const u32 *lip, u8 v6);
+int clip_tbl_show(struct seq_file *seq, void *v);
+int cxgb4_update_root_dev_clip(struct net_device *dev);
+void t4_cleanup_clip_tbl(struct adapter *adap);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 7c785b5e7757..e468f920892f 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -668,6 +668,9 @@ struct adapter {
 	unsigned int l2t_start;
 	unsigned int l2t_end;
 	struct l2t_data *l2t;
+	unsigned int clipt_start;
+	unsigned int clipt_end;
+	struct clip_tbl *clipt;
 	void *uld_handle[CXGB4_ULD_MAX];
 	struct list_head list_node;
 	struct list_head rcu_node;
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index e9f348942eb0..6dabfe5ba44e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -41,6 +41,7 @@
 #include "t4_regs.h"
 #include "t4fw_api.h"
 #include "cxgb4_debugfs.h"
+#include "clip_tbl.h"
 #include "l2t.h"
 
 /* generic seq_file support for showing a table of size rows x width. */
@@ -563,6 +564,21 @@ static const struct file_operations mps_tcam_debugfs_fops = {
 	.release = seq_release,
 };
 
+#if IS_ENABLED(CONFIG_IPV6)
+static int clip_tbl_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, clip_tbl_show, PDE_DATA(inode));
+}
+
+static const struct file_operations clip_tbl_debugfs_fops = {
+	.owner   = THIS_MODULE,
+	.open    = clip_tbl_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = single_release
+};
+#endif
+
 static ssize_t mem_read(struct file *file, char __user *buf, size_t count,
 			loff_t *ppos)
 {
@@ -646,6 +662,9 @@ int t4_setup_debugfs(struct adapter *adap)
 		{ "devlog", &devlog_fops, S_IRUSR, 0 },
 		{ "l2t", &t4_l2t_fops, S_IRUSR, 0},
 		{ "mps_tcam", &mps_tcam_debugfs_fops, S_IRUSR, 0 },
+#if IS_ENABLED(CONFIG_IPV6)
+		{ "clip_tbl", &clip_tbl_debugfs_fops, S_IRUSR, 0 },
+#endif
 	};
 
 	add_debugfs_files(adap,
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 082a596a4264..1147e1e88314 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -62,6 +62,7 @@
 #include <net/netevent.h>
 #include <net/addrconf.h>
 #include <net/bonding.h>
+#include <net/addrconf.h>
 #include <asm/uaccess.h>
 
 #include "cxgb4.h"
@@ -71,6 +72,7 @@
 #include "t4fw_api.h"
 #include "cxgb4_dcb.h"
 #include "cxgb4_debugfs.h"
+#include "clip_tbl.h"
 #include "l2t.h"
 
 #ifdef DRV_VERSION
@@ -3236,40 +3238,6 @@ static int tid_init(struct tid_info *t)
 	return 0;
 }
 
-int cxgb4_clip_get(const struct net_device *dev,
-		   const struct in6_addr *lip)
-{
-	struct adapter *adap;
-	struct fw_clip_cmd c;
-
-	adap = netdev2adap(dev);
-	memset(&c, 0, sizeof(c));
-	c.op_to_write = htonl(FW_CMD_OP_V(FW_CLIP_CMD) |
-			FW_CMD_REQUEST_F | FW_CMD_WRITE_F);
-	c.alloc_to_len16 = htonl(FW_CLIP_CMD_ALLOC_F | FW_LEN16(c));
-	c.ip_hi = *(__be64 *)(lip->s6_addr);
-	c.ip_lo = *(__be64 *)(lip->s6_addr + 8);
-	return t4_wr_mbox_meat(adap, adap->mbox, &c, sizeof(c), &c, false);
-}
-EXPORT_SYMBOL(cxgb4_clip_get);
-
-int cxgb4_clip_release(const struct net_device *dev,
-		       const struct in6_addr *lip)
-{
-	struct adapter *adap;
-	struct fw_clip_cmd c;
-
-	adap = netdev2adap(dev);
-	memset(&c, 0, sizeof(c));
-	c.op_to_write = htonl(FW_CMD_OP_V(FW_CLIP_CMD) |
-			FW_CMD_REQUEST_F | FW_CMD_READ_F);
-	c.alloc_to_len16 = htonl(FW_CLIP_CMD_FREE_F | FW_LEN16(c));
-	c.ip_hi = *(__be64 *)(lip->s6_addr);
-	c.ip_lo = *(__be64 *)(lip->s6_addr + 8);
-	return t4_wr_mbox_meat(adap, adap->mbox, &c, sizeof(c), &c, false);
-}
-EXPORT_SYMBOL(cxgb4_clip_release);
-
 /**
  *	cxgb4_create_server - create an IP server
  *	@dev: the device
@@ -4122,148 +4090,61 @@ int cxgb4_unregister_uld(enum cxgb4_uld type)
 }
 EXPORT_SYMBOL(cxgb4_unregister_uld);
 
-/* Check if netdev on which event is occured belongs to us or not. Return
- * success (true) if it belongs otherwise failure (false).
- * Called with rcu_read_lock() held.
- */
 #if IS_ENABLED(CONFIG_IPV6)
-static bool cxgb4_netdev(const struct net_device *netdev)
+static int cxgb4_inet6addr_handler(struct notifier_block *this,
+				   unsigned long event, void *data)
 {
+	struct inet6_ifaddr *ifa = data;
+	struct net_device *event_dev = ifa->idev->dev;
+	const struct device *parent = NULL;
+#if IS_ENABLED(CONFIG_BONDING)
 	struct adapter *adap;
-	int i;
-
-	list_for_each_entry_rcu(adap, &adap_rcu_list, rcu_node)
-		for (i = 0; i < MAX_NPORTS; i++)
-			if (adap->port[i] == netdev)
-				return true;
-	return false;
-}
+#endif
+	if (event_dev->priv_flags & IFF_802_1Q_VLAN)
+		event_dev = vlan_dev_real_dev(event_dev);
+#if IS_ENABLED(CONFIG_BONDING)
+	if (event_dev->flags & IFF_MASTER) {
+		list_for_each_entry(adap, &adapter_list, list_node) {
+			switch (event) {
+			case NETDEV_UP:
+				cxgb4_clip_get(adap->port[0],
+					       (const u32 *)ifa, 1);
+				break;
+			case NETDEV_DOWN:
+				cxgb4_clip_release(adap->port[0],
+						   (const u32 *)ifa, 1);
+				break;
+			default:
+				break;
+			}
+		}
+		return NOTIFY_OK;
+	}
+#endif
 
-static int clip_add(struct net_device *event_dev, struct inet6_ifaddr *ifa,
-		    unsigned long event)
-{
-	int ret = NOTIFY_DONE;
+	if (event_dev)
+		parent = event_dev->dev.parent;
 
-	rcu_read_lock();
-	if (cxgb4_netdev(event_dev)) {
+	if (parent && parent->driver == &cxgb4_driver.driver) {
 		switch (event) {
 		case NETDEV_UP:
-			ret = cxgb4_clip_get(event_dev, &ifa->addr);
-			if (ret < 0) {
-				rcu_read_unlock();
-				return ret;
-			}
-			ret = NOTIFY_OK;
+			cxgb4_clip_get(event_dev, (const u32 *)ifa, 1);
 			break;
 		case NETDEV_DOWN:
-			cxgb4_clip_release(event_dev, &ifa->addr);
-			ret = NOTIFY_OK;
+			cxgb4_clip_release(event_dev, (const u32 *)ifa, 1);
 			break;
 		default:
 			break;
 		}
 	}
-	rcu_read_unlock();
-	return ret;
-}
-
-static int cxgb4_inet6addr_handler(struct notifier_block *this,
-		unsigned long event, void *data)
-{
-	struct inet6_ifaddr *ifa = data;
-	struct net_device *event_dev;
-	int ret = NOTIFY_DONE;
-	struct bonding *bond = netdev_priv(ifa->idev->dev);
-	struct list_head *iter;
-	struct slave *slave;
-	struct pci_dev *first_pdev = NULL;
-
-	if (ifa->idev->dev->priv_flags & IFF_802_1Q_VLAN) {
-		event_dev = vlan_dev_real_dev(ifa->idev->dev);
-		ret = clip_add(event_dev, ifa, event);
-	} else if (ifa->idev->dev->flags & IFF_MASTER) {
-		/* It is possible that two different adapters are bonded in one
-		 * bond. We need to find such different adapters and add clip
-		 * in all of them only once.
-		 */
-		bond_for_each_slave(bond, slave, iter) {
-			if (!first_pdev) {
-				ret = clip_add(slave->dev, ifa, event);
-				/* If clip_add is success then only initialize
-				 * first_pdev since it means it is our device
-				 */
-				if (ret == NOTIFY_OK)
-					first_pdev = to_pci_dev(
-							slave->dev->dev.parent);
-			} else if (first_pdev !=
-				   to_pci_dev(slave->dev->dev.parent))
-					ret = clip_add(slave->dev, ifa, event);
-		}
-	} else
-		ret = clip_add(ifa->idev->dev, ifa, event);
-
-	return ret;
+	return NOTIFY_OK;
 }
 
+static bool inet6addr_registered;
 static struct notifier_block cxgb4_inet6addr_notifier = {
 	.notifier_call = cxgb4_inet6addr_handler
 };
 
-/* Retrieves IPv6 addresses from a root device (bond, vlan) associated with
- * a physical device.
- * The physical device reference is needed to send the actul CLIP command.
- */
-static int update_dev_clip(struct net_device *root_dev, struct net_device *dev)
-{
-	struct inet6_dev *idev = NULL;
-	struct inet6_ifaddr *ifa;
-	int ret = 0;
-
-	idev = __in6_dev_get(root_dev);
-	if (!idev)
-		return ret;
-
-	read_lock_bh(&idev->lock);
-	list_for_each_entry(ifa, &idev->addr_list, if_list) {
-		ret = cxgb4_clip_get(dev, &ifa->addr);
-		if (ret < 0)
-			break;
-	}
-	read_unlock_bh(&idev->lock);
-
-	return ret;
-}
-
-static int update_root_dev_clip(struct net_device *dev)
-{
-	struct net_device *root_dev = NULL;
-	int i, ret = 0;
-
-	/* First populate the real net device's IPv6 addresses */
-	ret = update_dev_clip(dev, dev);
-	if (ret)
-		return ret;
-
-	/* Parse all bond and vlan devices layered on top of the physical dev */
-	root_dev = netdev_master_upper_dev_get_rcu(dev);
-	if (root_dev) {
-		ret = update_dev_clip(root_dev, dev);
-		if (ret)
-			return ret;
-	}
-
-	for (i = 0; i < VLAN_N_VID; i++) {
-		root_dev = __vlan_find_dev_deep_rcu(dev, htons(ETH_P_8021Q), i);
-		if (!root_dev)
-			continue;
-
-		ret = update_dev_clip(root_dev, dev);
-		if (ret)
-			break;
-	}
-	return ret;
-}
-
 static void update_clip(const struct adapter *adap)
 {
 	int i;
@@ -4277,7 +4158,7 @@ static void update_clip(const struct adapter *adap)
 		ret = 0;
 
 		if (dev)
-			ret = update_root_dev_clip(dev);
+			ret = cxgb4_update_root_dev_clip(dev);
 
 		if (ret < 0)
 			break;
@@ -5391,6 +5272,14 @@ static int adap_init0(struct adapter *adap)
 	adap->tids.nftids = val[4] - val[3] + 1;
 	adap->sge.ingr_start = val[5];
 
+	params[0] = FW_PARAM_PFVF(CLIP_START);
+	params[1] = FW_PARAM_PFVF(CLIP_END);
+	ret = t4_query_params(adap, adap->mbox, adap->fn, 0, 2, params, val);
+	if (ret < 0)
+		goto bye;
+	adap->clipt_start = val[0];
+	adap->clipt_end = val[1];
+
 	/* query params related to active filter region */
 	params[0] = FW_PARAM_PFVF(ACTIVE_FILTER_START);
 	params[1] = FW_PARAM_PFVF(ACTIVE_FILTER_END);
@@ -6211,6 +6100,18 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 		adapter->params.offload = 0;
 	}
 
+#if IS_ENABLED(CONFIG_IPV6)
+	adapter->clipt = t4_init_clip_tbl(adapter->clipt_start,
+					  adapter->clipt_end);
+	if (!adapter->clipt) {
+		/* We tolerate a lack of clip_table, giving up
+		 * some functionality
+		 */
+		dev_warn(&pdev->dev,
+			 "could not allocate Clip table, continuing\n");
+		adapter->params.offload = 0;
+	}
+#endif
 	if (is_offload(adapter) && tid_init(&adapter->tids) < 0) {
 		dev_warn(&pdev->dev, "could not allocate TID table, "
 			 "continuing\n");
@@ -6336,6 +6237,9 @@ static void remove_one(struct pci_dev *pdev)
 			cxgb_down(adapter);
 
 		free_some_resources(adapter);
+#if IS_ENABLED(CONFIG_IPV6)
+		t4_cleanup_clip_tbl(adapter);
+#endif
 		iounmap(adapter->regs);
 		if (!is_t4(adapter->params.chip))
 			iounmap(adapter->bar2);
@@ -6374,7 +6278,10 @@ static int __init cxgb4_init_module(void)
 		debugfs_remove(cxgb4_debugfs_root);
 
 #if IS_ENABLED(CONFIG_IPV6)
-	register_inet6addr_notifier(&cxgb4_inet6addr_notifier);
+	if (!inet6addr_registered) {
+		register_inet6addr_notifier(&cxgb4_inet6addr_notifier);
+		inet6addr_registered = true;
+	}
 #endif
 
 	return ret;
@@ -6383,7 +6290,10 @@ static int __init cxgb4_init_module(void)
 static void __exit cxgb4_cleanup_module(void)
 {
 #if IS_ENABLED(CONFIG_IPV6)
-	unregister_inet6addr_notifier(&cxgb4_inet6addr_notifier);
+	if (inet6addr_registered && list_empty(&adapter_list)) {
+		unregister_inet6addr_notifier(&cxgb4_inet6addr_notifier);
+		inet6addr_registered = false;
+	}
 #endif
 	pci_unregister_driver(&cxgb4_driver);
 	debugfs_remove(cxgb4_debugfs_root);  /* NULL ok */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
index 152b4c4c7809..78ab4d406ce2 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
@@ -173,9 +173,6 @@ int cxgb4_create_server_filter(const struct net_device *dev, unsigned int stid,
 			       unsigned char port, unsigned char mask);
 int cxgb4_remove_server_filter(const struct net_device *dev, unsigned int stid,
 			       unsigned int queue, bool ipv6);
-int cxgb4_clip_get(const struct net_device *dev, const struct in6_addr *lip);
-int cxgb4_clip_release(const struct net_device *dev,
-		       const struct in6_addr *lip);
 
 static inline void set_wr_txq(struct sk_buff *skb, int prio, int queue)
 {
-- 
2.2.1

^ permalink raw reply related

* [PATCH net-next v2 2/2] cxgb4i : Call into recently added cxgb4 ipv6 api
From: Anish Bhatt @ 2015-01-14 23:17 UTC (permalink / raw)
  To: netdev; +Cc: davem, hariprasad, kxie, deepak.s, Anish Bhatt
In-Reply-To: <1421277455-20158-1-git-send-email-anish@chelsio.com>

Get a reference on every ipv6 address we offload to hardware so that it cannot
be prematurely cleared out before cleanup.

Signed-off-by: Anish Bhatt <anish@chelsio.com>
---
 drivers/scsi/cxgbi/cxgb4i/cxgb4i.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/cxgbi/cxgb4i/cxgb4i.c b/drivers/scsi/cxgbi/cxgb4i/cxgb4i.c
index 37d7191a3c38..dd00e5fe4a5e 100644
--- a/drivers/scsi/cxgbi/cxgb4i/cxgb4i.c
+++ b/drivers/scsi/cxgbi/cxgb4i/cxgb4i.c
@@ -28,6 +28,7 @@
 #include "t4fw_api.h"
 #include "l2t.h"
 #include "cxgb4i.h"
+#include "clip_tbl.h"
 
 static unsigned int dbg_level;
 
@@ -1322,6 +1323,9 @@ static inline void l2t_put(struct cxgbi_sock *csk)
 static void release_offload_resources(struct cxgbi_sock *csk)
 {
 	struct cxgb4_lld_info *lldi;
+#if IS_ENABLED(CONFIG_IPV6)
+	struct net_device *ndev = csk->cdev->ports[csk->port_id];
+#endif
 
 	log_debug(1 << CXGBI_DBG_TOE | 1 << CXGBI_DBG_SOCK,
 		"csk 0x%p,%u,0x%lx,%u.\n",
@@ -1334,6 +1338,12 @@ static void release_offload_resources(struct cxgbi_sock *csk)
 	}
 
 	l2t_put(csk);
+#if IS_ENABLED(CONFIG_IPV6)
+	if (csk->csk_family == AF_INET6)
+		cxgb4_clip_release(ndev,
+				   (const u32 *)&csk->saddr6.sin6_addr, 1);
+#endif
+
 	if (cxgbi_sock_flag(csk, CTPF_HAS_ATID))
 		free_atid(csk);
 	else if (cxgbi_sock_flag(csk, CTPF_HAS_TID)) {
@@ -1391,10 +1401,15 @@ static int init_act_open(struct cxgbi_sock *csk)
 	csk->l2t = cxgb4_l2t_get(lldi->l2t, n, ndev, 0);
 	if (!csk->l2t) {
 		pr_err("%s, cannot alloc l2t.\n", ndev->name);
-		goto rel_resource;
+		goto rel_resource_without_clip;
 	}
 	cxgbi_sock_get(csk);
 
+#if IS_ENABLED(CONFIG_IPV6)
+	if (csk->csk_family == AF_INET6)
+		cxgb4_clip_get(ndev, (const u32 *)&csk->saddr6.sin6_addr, 1);
+#endif
+
 	if (t4) {
 		size = sizeof(struct cpl_act_open_req);
 		size6 = sizeof(struct cpl_act_open_req6);
@@ -1451,6 +1466,12 @@ static int init_act_open(struct cxgbi_sock *csk)
 	return 0;
 
 rel_resource:
+#if IS_ENABLED(CONFIG_IPV6)
+	if (csk->csk_family == AF_INET6)
+		cxgb4_clip_release(ndev,
+				   (const u32 *)&csk->saddr6.sin6_addr, 1);
+#endif
+rel_resource_without_clip:
 	if (n)
 		neigh_release(n);
 	if (skb)
-- 
2.2.1

^ permalink raw reply related

* RE: [PATCH net-next] bridge: fix setlink/dellink notifications
From: Arad, Ronen @ 2015-01-14 23:22 UTC (permalink / raw)
  To: roopa@cumulusnetworks.com, netdev@vger.kernel.org,
	shemminger@vyatta.com, vyasevic@redhat.com,
	john.fastabend@gmail.com, tgraf@suug.ch, jhs@mojatatu.com,
	sfeldma@gmail.com, jiri@resnulli.us
  Cc: wkok@cumulusnetworks.com
In-Reply-To: <1421218123-18346-1-git-send-email-roopa@cumulusnetworks.com>



>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>Behalf Of roopa@cumulusnetworks.com
>Sent: Tuesday, January 13, 2015 10:49 PM
>To: netdev@vger.kernel.org; shemminger@vyatta.com; vyasevic@redhat.com;
>john.fastabend@gmail.com; tgraf@suug.ch; jhs@mojatatu.com; sfeldma@gmail.com;
>jiri@resnulli.us
>Cc: wkok@cumulusnetworks.com
>Subject: [PATCH net-next] bridge: fix setlink/dellink notifications
>
>From: Roopa Prabhu <roopa@cumulusnetworks.com>
>
[..]
>diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
>index d06107d..4ac79ff 100644
>--- a/net/core/rtnetlink.c
>+++ b/net/core/rtnetlink.c
>@@ -2876,13 +2876,6 @@ static int rtnl_bridge_notify(struct net_device *dev,
>u16 flags)

The 'flags' argument was only used for applying the same handling of
MASTER/SELF flags to notification as used for setlink/delink.
This patch eliminates the MASTER case and leaves only SELF notification.
It seems clearer to eliminate flags argument and rename the function to
something like rtnl_bridge_self_notify().
   
> 		goto errout;
> 	}
>
>-	if ((!flags || (flags & BRIDGE_FLAGS_MASTER)) &&
>-	    br_dev && br_dev->netdev_ops->ndo_bridge_getlink) {
>-		err = br_dev->netdev_ops->ndo_bridge_getlink(skb, 0, 0, dev, 0);
>-		if (err < 0)
>-			goto errout;
>-	}
>-
> 	if ((flags & BRIDGE_FLAGS_SELF) &&
> 	    dev->netdev_ops->ndo_bridge_getlink) {
> 		err = dev->netdev_ops->ndo_bridge_getlink(skb, 0, 0, dev, 0);
>@@ -2958,16 +2951,19 @@ static int rtnl_bridge_setlink(struct sk_buff *skb,
>struct nlmsghdr *nlh)
> 			err = -EOPNOTSUPP;
> 		else
> 			err = dev->netdev_ops->ndo_bridge_setlink(dev, nlh);
>-
>-		if (!err)
>+		if (!err) {
> 			flags &= ~BRIDGE_FLAGS_SELF;
>+
>+			/* Generate event to notify upper layer of bridge
>+			 * change
>+			 */
>+			if (!err)
>+				err = rtnl_bridge_notify(dev, oflags);
>+		}
> 	}
>
> 	if (have_flags)
> 		memcpy(nla_data(attr), &flags, sizeof(flags));

What is the purpose of the above two lines (not changed by the patch)?
They seem to copy over the flags with the successfully applied cases
(MASTER and/or SELF) flags cleared back into the incoming netlink message.
I could not figure any place where the modified flags attribute is used

>-	/* Generate event to notify upper layer of bridge change */
>-	if (!err)
>-		err = rtnl_bridge_notify(dev, oflags);
> out:
> 	return err;
> }
>@@ -3032,15 +3028,19 @@ static int rtnl_bridge_dellink(struct sk_buff *skb,
>struct nlmsghdr *nlh)
> 		else
> 			err = dev->netdev_ops->ndo_bridge_dellink(dev, nlh);
>
>-		if (!err)
>+		if (!err) {
> 			flags &= ~BRIDGE_FLAGS_SELF;
>+
>+			/* Generate event to notify upper layer of bridge
>+			 * change
>+			 */
>+			err = rtnl_bridge_notify(dev, oflags);
>+		}
>+
> 	}
>
> 	if (have_flags)
> 		memcpy(nla_data(attr), &flags, sizeof(flags));
>-	/* Generate event to notify upper layer of bridge change */
>-	if (!err)
>-		err = rtnl_bridge_notify(dev, oflags);
> out:
> 	return err;
> }
>--
>1.7.10.4
>
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [ 2375.793397] WARNING: CPU: 0 PID: 1149 at net/netlink/genetlink.c:1037 genl_unbind+0xc0/0xd0()
From: Johannes Berg @ 2015-01-14 23:25 UTC (permalink / raw)
  To: Jeff Layton; +Cc: netdev
In-Reply-To: <1421275700.1950.34.camel@sipsolutions.net>

On Wed, 2015-01-14 at 23:48 +0100, Johannes Berg wrote:

> > [ 2375.793396] ------------[ cut here ]------------
> > [ 2375.793397] WARNING: CPU: 0 PID: 1149 at net/netlink/genetlink.c:1037 genl_unbind+0xc0/0xd0()
> 
> This warning is supposed to happen only when you somehow manage to
> unsubscribe from a generic netlink group that doesn't actually exist, or
> so.

Ok - after long deliberation I found a way to trigger it. It requires
that you leave a multicast group (likely by destroying a socket) at the
same time as the kernel unregisters the generic netlink group. I have no
idea what generic netlink group you might be using here, but I could
reproduce it with a strategically placed delay in the netlink code and
the nl80211 genl group by opening a socket, closing the socket, and
removing the cfg80211 module (to unregister the nl80211 genl group)
while the socket was still being closed.

I'll think about a fix tomorrow - it doesn't seem trivial due to
possible locking concerns.

On the bright side, I cannot see a way to reproduce this without
removing the genl family at the same time - which is good because it
means that I've just again audited the case I was worried about (the
bind/unbind not being symmetric) - it is asymmetric but only in the case
of genl family removal which seems reasonable (but I should document
it.)

johannes

^ permalink raw reply

* [PATCH] e1000e: Fix 82574/82583 TimeSync errata handling for SYSTIM read
From: Bhavesh Davda @ 2015-01-14 23:30 UTC (permalink / raw)
  To: linux.nics, netdev
  Cc: nithin, ninad, gyang, smurali, pv-drivers, Bhavesh Davda

In emulated 82574 vNICs, TIMINCA might read as '0', so this change prevents a
divide-by-zero error.

Signed-off-by: Bhavesh Davda <bhavesh@vmware.com>
Acked-by: Nithin Raju <nithin@vmware.com>
Acked-by: Ninad Ghodke <ninad@vmware.com>
Reviewed-by: Guolin Yang <gyang@vmware.com>
Reviewed-by: Srividya Murali <smurali@vmware.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index e14fd85..a4727e3 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -4141,6 +4141,8 @@ static cycle_t e1000e_cyclecounter_read(const struct cyclecounter *cc)
 		 * rate and is a multiple of incvalue
 		 */
 		incvalue = er32(TIMINCA) & E1000_TIMINCA_INCVALUE_MASK;
+		if (incvalue == 0)
+			goto out;
 		for (i = 0; i < E1000_MAX_82574_SYSTIM_REREADS; i++) {
 			/* latch SYSTIMH on read of SYSTIML */
 			systim_next = (cycle_t)er32(SYSTIML);
@@ -4157,6 +4159,7 @@ static cycle_t e1000e_cyclecounter_read(const struct cyclecounter *cc)
 				break;
 		}
 	}
+out:
 	return systim;
 }
 
-- 
2.3.0.rc0

^ permalink raw reply related

* Re: [PATCH net-next] bridge: fix setlink/dellink notifications
From: John Fastabend @ 2015-01-14 23:36 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: roopa@cumulusnetworks.com, netdev@vger.kernel.org,
	shemminger@vyatta.com, vyasevic@redhat.com, tgraf@suug.ch,
	jhs@mojatatu.com, sfeldma@gmail.com, jiri@resnulli.us,
	wkok@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DEE86A@ORSMSX101.amr.corp.intel.com>

On 01/14/2015 03:22 PM, Arad, Ronen wrote:
>
>
>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>> Behalf Of roopa@cumulusnetworks.com
>> Sent: Tuesday, January 13, 2015 10:49 PM
>> To: netdev@vger.kernel.org; shemminger@vyatta.com; vyasevic@redhat.com;
>> john.fastabend@gmail.com; tgraf@suug.ch; jhs@mojatatu.com; sfeldma@gmail.com;
>> jiri@resnulli.us
>> Cc: wkok@cumulusnetworks.com
>> Subject: [PATCH net-next] bridge: fix setlink/dellink notifications
>>
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
> [..]

[...]

>> 			err = dev->netdev_ops->ndo_bridge_setlink(dev, nlh);
>> -
>> -		if (!err)
>> +		if (!err) {
>> 			flags &= ~BRIDGE_FLAGS_SELF;
>> +
>> +			/* Generate event to notify upper layer of bridge
>> +			 * change
>> +			 */
>> +			if (!err)
>> +				err = rtnl_bridge_notify(dev, oflags);
>> +		}
>> 	}
>>
>> 	if (have_flags)
>> 		memcpy(nla_data(attr), &flags, sizeof(flags));
>
> What is the purpose of the above two lines (not changed by the patch)?
> They seem to copy over the flags with the successfully applied cases
> (MASTER and/or SELF) flags cleared back into the incoming netlink message.
> I could not figure any place where the modified flags attribute is used

This allows userspace to learn which operation failed when it is an
operation to set both the software bridge via BRIDGE_FLAGS_MASTER and
the the hardware via BRIDGE_FLAGS_SELF. When we get the error back
software looks at the flags to figure out how to recover/retry/etc.

.John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH net-next] bridge: fix setlink/dellink notifications
From: roopa @ 2015-01-14 23:54 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: netdev@vger.kernel.org, shemminger@vyatta.com,
	vyasevic@redhat.com, john.fastabend@gmail.com, tgraf@suug.ch,
	jhs@mojatatu.com, sfeldma@gmail.com, jiri@resnulli.us,
	wkok@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DEE86A@ORSMSX101.amr.corp.intel.com>

On 1/14/15, 3:22 PM, Arad, Ronen wrote:
>
>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>> Behalf Of roopa@cumulusnetworks.com
>> Sent: Tuesday, January 13, 2015 10:49 PM
>> To: netdev@vger.kernel.org; shemminger@vyatta.com; vyasevic@redhat.com;
>> john.fastabend@gmail.com; tgraf@suug.ch; jhs@mojatatu.com; sfeldma@gmail.com;
>> jiri@resnulli.us
>> Cc: wkok@cumulusnetworks.com
>> Subject: [PATCH net-next] bridge: fix setlink/dellink notifications
>>
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
> [..]
>> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
>> index d06107d..4ac79ff 100644
>> --- a/net/core/rtnetlink.c
>> +++ b/net/core/rtnetlink.c
>> @@ -2876,13 +2876,6 @@ static int rtnl_bridge_notify(struct net_device *dev,
>> u16 flags)
> The 'flags' argument was only used for applying the same handling of
> MASTER/SELF flags to notification as used for setlink/delink.
> This patch eliminates the MASTER case and leaves only SELF notification.
> It seems clearer to eliminate flags argument and rename the function to
> something like rtnl_bridge_self_notify().
sure, if that makes it clearer.

Thanks,
Roopa

^ permalink raw reply

* Re: [PATCH] net: rocker: Add basic netdev counters
From: David Ahern @ 2015-01-14 23:54 UTC (permalink / raw)
  To: Florian Fainelli, netdev; +Cc: Scott Feldman, Jiri Pirko
In-Reply-To: <54B6F43C.4030207@gmail.com>

On 1/14/15 3:57 PM, Florian Fainelli wrote:
> On 14/01/15 14:39, David Ahern wrote:
>> Add packet and byte counters for RX and TX paths.
>>
>> $ ifconfig eth1
>> eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>>          inet6 fe80::5054:ff:fe12:3501  prefixlen 64  scopeid 0x20<link>
>>          ether 52:54:00:12:35:01  txqueuelen 1000  (Ethernet)
>>          RX packets 63  bytes 15813 (15.4 KiB)
>>          RX errors 0  dropped 0  overruns 0  frame 0
>>          TX packets 79  bytes 17991 (17.5 KiB)
>>          TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> Signed-off-by: David Ahern <dsahern@gmail.com>
>> Cc: Scott Feldman <sfeldma@gmail.com>
>> Cc: Jiri Pirko <jiri@resnulli.us>
>> ---
>>   drivers/net/ethernet/rocker/rocker.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
>> index 2f398fa4b9e6..9743279d9121 100644
>> --- a/drivers/net/ethernet/rocker/rocker.c
>> +++ b/drivers/net/ethernet/rocker/rocker.c
>> @@ -3557,6 +3557,9 @@ static netdev_tx_t rocker_port_xmit(struct sk_buff *skb, struct net_device *dev)
>>   	if (!desc_info)
>>   		netif_stop_queue(dev);
>>
>> +	dev->stats.tx_packets++;
>> +	dev->stats.tx_bytes += skb->len;
>
> Potential use after free, the skb pointer is certainly not valid anymore
> here.
>
> BTW, increasing statistics here is valid because this is a driver for a
> virtual piece of HW, which does not have TX reclaim/completion logic,
> but if it did, statistics update should occur there, not in the
> ndo_start_xmit() function.

sure. I had considered putting in the rocker_port_poll_tx function like 
this:

-       dev_kfree_skb_any(rocker_desc_cookie_ptr_get(desc_info));
+
+       skb = rocker_desc_cookie_ptr_get(desc_info);
+       rocker_port->dev->stats.tx_packets++;
+       rocker_port->dev->stats.tx_bytes += skb->len;
+
+       dev_kfree_skb_any(skb);

I think this the reclaim point.

>
>> +
>>   	return NETDEV_TX_OK;
>>
>>   unmap_frags:
>> @@ -3565,6 +3568,8 @@ static netdev_tx_t rocker_port_xmit(struct sk_buff *skb, struct net_device *dev)
>>   	rocker_tlv_nest_cancel(desc_info, frags);
>>   out:
>>   	dev_kfree_skb(skb);
>> +	dev->stats.tx_dropped++;
>> +
>>   	return NETDEV_TX_OK;
>>   }
>>
>> @@ -3890,6 +3895,9 @@ static int rocker_port_rx_proc(struct rocker *rocker,
>>   	skb->protocol = eth_type_trans(skb, rocker_port->dev);
>>   	netif_receive_skb(skb);
>>
>> +	rocker_port->dev->stats.rx_packets++;
>> +	rocker_port->dev->stats.rx_bytes += skb->len;
>
> Same here, past netif_receive_skb() you should not assume that this skb
> reference is valid.

right. I'll move the stats above the netif_receive_skb.

Thanks,
David

^ permalink raw reply

* Re: [PATCH net-next] bridge: fix setlink/dellink notifications
From: roopa @ 2015-01-14 23:56 UTC (permalink / raw)
  To: John Fastabend
  Cc: Arad, Ronen, netdev@vger.kernel.org, shemminger@vyatta.com,
	vyasevic@redhat.com, tgraf@suug.ch, jhs@mojatatu.com,
	sfeldma@gmail.com, jiri@resnulli.us, wkok@cumulusnetworks.com
In-Reply-To: <54B6FD60.8020106@gmail.com>

On 1/14/15, 3:36 PM, John Fastabend wrote:
> On 01/14/2015 03:22 PM, Arad, Ronen wrote:
>>
>>
>>> -----Original Message-----
>>> From: netdev-owner@vger.kernel.org 
>>> [mailto:netdev-owner@vger.kernel.org] On
>>> Behalf Of roopa@cumulusnetworks.com
>>> Sent: Tuesday, January 13, 2015 10:49 PM
>>> To: netdev@vger.kernel.org; shemminger@vyatta.com; vyasevic@redhat.com;
>>> john.fastabend@gmail.com; tgraf@suug.ch; jhs@mojatatu.com; 
>>> sfeldma@gmail.com;
>>> jiri@resnulli.us
>>> Cc: wkok@cumulusnetworks.com
>>> Subject: [PATCH net-next] bridge: fix setlink/dellink notifications
>>>
>>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>
>> [..]
>
> [...]
>
>>>             err = dev->netdev_ops->ndo_bridge_setlink(dev, nlh);
>>> -
>>> -        if (!err)
>>> +        if (!err) {
>>>             flags &= ~BRIDGE_FLAGS_SELF;
>>> +
>>> +            /* Generate event to notify upper layer of bridge
>>> +             * change
>>> +             */
>>> +            if (!err)
>>> +                err = rtnl_bridge_notify(dev, oflags);
>>> +        }
>>>     }
>>>
>>>     if (have_flags)
>>>         memcpy(nla_data(attr), &flags, sizeof(flags));
>>
>> What is the purpose of the above two lines (not changed by the patch)?
>> They seem to copy over the flags with the successfully applied cases
>> (MASTER and/or SELF) flags cleared back into the incoming netlink 
>> message.
>> I could not figure any place where the modified flags attribute is used
>
> This allows userspace to learn which operation failed when it is an
> operation to set both the software bridge via BRIDGE_FLAGS_MASTER and
> the the hardware via BRIDGE_FLAGS_SELF. When we get the error back
> software looks at the flags to figure out how to recover/retry/etc.
Ah ok, I was also wondering why that was there,

thanks,
Roopa

^ permalink raw reply

* RE: [PATCH] ixgbe: Re-enable relaxed ordering as part of init/restart sequence for non-DCA config
From: Tantilov, Emil S @ 2015-01-15  0:01 UTC (permalink / raw)
  To: Sowmini Varadhan, Kirsher, Jeffrey T, Brandeburg, Jesse,
	Allan, Bruce W, Wyborny, Carolyn, Skidmore, Donald C,
	Rose, Gregory V, Vick, Matthew, Ronciak, John, Williams, Mitch A
  Cc: Linux NICS, e1000-devel@lists.sourceforge.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	sparclinux@vger.kernel.org
In-Reply-To: <20150114144009.GG19534@oracle.com>

>-----Original Message-----
>From: Sowmini Varadhan [mailto:sowmini.varadhan@oracle.com] 
>Relaxed ordering is disabled by default at driver initialization
>and re-enabled when DCA is used. The reason it is disabled  was
>due to an issue on some chipsets (see comments in ixgbe_update_tx_dca()).
>But when DCA is not used, RO needs to be re-enabled, else we have
>a serialization bottleneck on platforms like SPARC.
>
>This patch eliminates the bottleneck for ixgbe when DCA is not configured.
>
>Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>Cc: Emil Tantilov <emil.s.tantilov@intel.com>
>---
> drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c  |    1 +
> drivers/net/ethernet/intel/ixgbe/ixgbe_common.c |   20 ++++++++++++++++++++
> drivers/net/ethernet/intel/ixgbe/ixgbe_common.h |    1 +
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |   11 +++++++++++
> drivers/net/ethernet/intel/ixgbe/ixgbe_type.h   |    1 +
> drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c   |    1 +
> 6 files changed, 35 insertions(+), 0 deletions(-)
>
>diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c
>index c5c97b4..85c7a28 100644
>--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c
>+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c
>@@ -1161,6 +1161,7 @@ static struct ixgbe_mac_operations mac_ops_82598 = {
> 	.clear_hw_cntrs		= &ixgbe_clear_hw_cntrs_generic,
> 	.get_media_type		= &ixgbe_get_media_type_82598,
> 	.enable_rx_dma          = &ixgbe_enable_rx_dma_generic,
>+	.enable_relaxed_ordering = &ixgbe_enable_relaxed_ordering,

The IXGBE_DCA_TXCTRL register for 82598 is at a different offset. Also there is a limit of 16 registers. The function we have in our code for 82598 is as follows:

/**
 *  ixgbe_enable_relaxed_ordering_82598 - enable relaxed ordering
 *  @hw: pointer to hardware structure
 *
 **/
void ixgbe_enable_relaxed_ordering_82598(struct ixgbe_hw *hw)
{
	u32 regval;
	u32 i;

	/* Enable relaxed ordering */
	for (i = 0; ((i < hw->mac.max_tx_queues) &&
	     (i < IXGBE_DCA_MAX_QUEUES_82598)); i++) {
		regval = IXGBE_READ_REG(hw, IXGBE_DCA_TXCTRL(i));
		regval |= IXGBE_DCA_TXCTRL_DESC_WRO_EN;
		IXGBE_WRITE_REG(hw, IXGBE_DCA_TXCTRL(i), regval);
	}

	for (i = 0; ((i < hw->mac.max_rx_queues) &&
	     (i < IXGBE_DCA_MAX_QUEUES_82598)); i++) {
		regval = IXGBE_READ_REG(hw, IXGBE_DCA_RXCTRL(i));
		regval |= IXGBE_DCA_RXCTRL_DATA_WRO_EN |
			  IXGBE_DCA_RXCTRL_HEAD_WRO_EN;
		IXGBE_WRITE_REG(hw, IXGBE_DCA_RXCTRL(i), regval);
	}
}

Thanks,
Emil

^ permalink raw reply

* Re: [PATCH net-next] bridge: fix setlink/dellink notifications
From: tgraf @ 2015-01-15  0:01 UTC (permalink / raw)
  To: John Fastabend
  Cc: Arad, Ronen, roopa@cumulusnetworks.com, netdev@vger.kernel.org,
	shemminger@vyatta.com, vyasevic@redhat.com, jhs@mojatatu.com,
	sfeldma@gmail.com, jiri@resnulli.us, wkok@cumulusnetworks.com
In-Reply-To: <54B6FD60.8020106@gmail.com>

On 01/14/15 at 03:36pm, John Fastabend wrote:
> On 01/14/2015 03:22 PM, Arad, Ronen wrote:
> >What is the purpose of the above two lines (not changed by the patch)?
> >They seem to copy over the flags with the successfully applied cases
> >(MASTER and/or SELF) flags cleared back into the incoming netlink message.
> >I could not figure any place where the modified flags attribute is used
> 
> This allows userspace to learn which operation failed when it is an
> operation to set both the software bridge via BRIDGE_FLAGS_MASTER and
> the the hardware via BRIDGE_FLAGS_SELF. When we get the error back
> software looks at the flags to figure out how to recover/retry/etc.

The intent of including the original message in the error Netlink
message was originally to track the request that lead to the error ;-)

^ permalink raw reply

* [PATCH 0/5 net-next v5] VXLAN Group Policy Extension
From: Thomas Graf @ 2015-01-15  0:10 UTC (permalink / raw)
  To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov,
	nicolas.dichtel
  Cc: netdev, dev

Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.

The extension is disabled by default and should be run on a distinct
port in mixed Linux VXLAN VTEP environments. Liberal VXLAN VTEPs
which ignore unknown reserved bits will be able to receive VXLAN-GBP
frames.

Simple usage example:

10.1.1.1:
   # ip link add vxlan0 type vxlan id 10 remote 10.1.1.2 gbp
   # iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200

10.1.1.2:
   # ip link add vxlan0 type vxlan id 10 remote 10.1.1.1 gbp
   # iptables -I INPUT -m mark --mark 0x200 -j DROP

iproute2 [1] and OVS [2] support will be provided in separate patches.

[0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
[1] https://github.com/tgraf/iproute2/tree/vxlan-gbp
[2] https://github.com/tgraf/ovs/tree/vxlan-gbp

Thomas Graf (5):
  vxlan: Group Policy extension
  vxlan: Only bind to sockets with correct extensions enabled
  openvswitch: Rename GENEVE_TUN_OPTS() to TUN_METADATA_OPTS()
  openvswitch: Allow for any level of nesting in flow attributes
  openvswitch: Support VXLAN Group Policy extension

 drivers/net/vxlan.c              | 125 +++++++++++++----
 include/net/ip_tunnels.h         |   5 +-
 include/net/vxlan.h              |  83 +++++++++++-
 include/uapi/linux/if_link.h     |   1 +
 include/uapi/linux/openvswitch.h |  11 ++
 net/openvswitch/flow.c           |   2 +-
 net/openvswitch/flow.h           |  14 +-
 net/openvswitch/flow_netlink.c   | 286 ++++++++++++++++++++++++++-------------
 net/openvswitch/vport-geneve.c   |  15 +-
 net/openvswitch/vport-vxlan.c    |  91 ++++++++++++-
 net/openvswitch/vport-vxlan.h    |  11 ++
 11 files changed, 500 insertions(+), 144 deletions(-)
 create mode 100644 net/openvswitch/vport-vxlan.h

-- 
1.9.3

^ permalink raw reply

* [PATCH 2/5] vxlan: Only bind to sockets with correct extensions enabled
From: Thomas Graf @ 2015-01-15  0:10 UTC (permalink / raw)
  To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov,
	nicolas.dichtel
  Cc: netdev, dev
In-Reply-To: <cover.1421280226.git.tgraf@suug.ch>

A VXLAN net_device looking for an appropriate socket may only consider
a socket which has a matching set of extensions enabled. If the
extensions don't match, return a conflict to have the caller create a
distinct socket with distinct port.

The OVS VXLAN port is kept unaware of extensions at this point.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v4->v5:
 - No change
v3->v4:
 - No change
v2->v3:
 - No change
v1->v2:
 - Improved commit message, reported by Jesse

 drivers/net/vxlan.c           | 35 +++++++++++++++++++++--------------
 include/net/vxlan.h           |  2 +-
 net/openvswitch/vport-vxlan.c |  2 +-
 3 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 06f7196..ca94f2f 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -265,14 +265,15 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
 }
 
 /* Find VXLAN socket based on network namespace, address family and UDP port */
-static struct vxlan_sock *vxlan_find_sock(struct net *net,
-					  sa_family_t family, __be16 port)
+static struct vxlan_sock *vxlan_find_sock(struct net *net, sa_family_t family,
+					  __be16 port, u32 exts)
 {
 	struct vxlan_sock *vs;
 
 	hlist_for_each_entry_rcu(vs, vs_head(net, port), hlist) {
 		if (inet_sk(vs->sock->sk)->inet_sport == port &&
-		    inet_sk(vs->sock->sk)->sk.sk_family == family)
+		    inet_sk(vs->sock->sk)->sk.sk_family == family &&
+		    vs->exts == exts)
 			return vs;
 	}
 	return NULL;
@@ -292,11 +293,12 @@ static struct vxlan_dev *vxlan_vs_find_vni(struct vxlan_sock *vs, u32 id)
 
 /* Look up VNI in a per net namespace table */
 static struct vxlan_dev *vxlan_find_vni(struct net *net, u32 id,
-					sa_family_t family, __be16 port)
+					sa_family_t family, __be16 port,
+					u32 exts)
 {
 	struct vxlan_sock *vs;
 
-	vs = vxlan_find_sock(net, family, port);
+	vs = vxlan_find_sock(net, family, port, exts);
 	if (!vs)
 		return NULL;
 
@@ -1963,7 +1965,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 
 			ip_rt_put(rt);
 			dst_vxlan = vxlan_find_vni(vxlan->net, vni,
-						   dst->sa.sa_family, dst_port);
+						   dst->sa.sa_family, dst_port,
+						   vxlan->exts);
 			if (!dst_vxlan)
 				goto tx_error;
 			vxlan_encap_bypass(skb, vxlan, dst_vxlan);
@@ -2022,7 +2025,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 
 			dst_release(ndst);
 			dst_vxlan = vxlan_find_vni(vxlan->net, vni,
-						   dst->sa.sa_family, dst_port);
+						   dst->sa.sa_family, dst_port,
+						   vxlan->exts);
 			if (!dst_vxlan)
 				goto tx_error;
 			vxlan_encap_bypass(skb, vxlan, dst_vxlan);
@@ -2192,7 +2196,7 @@ static int vxlan_init(struct net_device *dev)
 
 	spin_lock(&vn->sock_lock);
 	vs = vxlan_find_sock(vxlan->net, ipv6 ? AF_INET6 : AF_INET,
-			     vxlan->dst_port);
+			     vxlan->dst_port, vxlan->exts);
 	if (vs && atomic_add_unless(&vs->refcnt, 1, 0)) {
 		/* If we have a socket with same port already, reuse it */
 		vxlan_vs_add_dev(vs, vxlan);
@@ -2532,7 +2536,7 @@ static struct socket *vxlan_create_sock(struct net *net, bool ipv6,
 /* Create new listen socket if needed */
 static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
 					      vxlan_rcv_t *rcv, void *data,
-					      u32 flags)
+					      u32 flags, u32 exts)
 {
 	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
 	struct vxlan_sock *vs;
@@ -2561,6 +2565,7 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
 	vs->rcv = rcv;
 	vs->data = data;
 	vs->flags = flags;
+	vs->exts = exts;
 
 	/* Initialize the vxlan udp offloads structure */
 	vs->udp_offloads.port = port;
@@ -2585,13 +2590,14 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
 
 struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
 				  vxlan_rcv_t *rcv, void *data,
-				  bool no_share, u32 flags)
+				  bool no_share, u32 flags,
+				  u32 exts)
 {
 	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
 	struct vxlan_sock *vs;
 	bool ipv6 = flags & VXLAN_F_IPV6;
 
-	vs = vxlan_socket_create(net, port, rcv, data, flags);
+	vs = vxlan_socket_create(net, port, rcv, data, flags, exts);
 	if (!IS_ERR(vs))
 		return vs;
 
@@ -2599,7 +2605,7 @@ struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
 		return vs;
 
 	spin_lock(&vn->sock_lock);
-	vs = vxlan_find_sock(net, ipv6 ? AF_INET6 : AF_INET, port);
+	vs = vxlan_find_sock(net, ipv6 ? AF_INET6 : AF_INET, port, exts);
 	if (vs && ((vs->rcv != rcv) ||
 		   !atomic_add_unless(&vs->refcnt, 1, 0)))
 			vs = ERR_PTR(-EBUSY);
@@ -2621,7 +2627,8 @@ static void vxlan_sock_work(struct work_struct *work)
 	__be16 port = vxlan->dst_port;
 	struct vxlan_sock *nvs;
 
-	nvs = vxlan_sock_add(net, port, vxlan_rcv, NULL, false, vxlan->flags);
+	nvs = vxlan_sock_add(net, port, vxlan_rcv, NULL, false, vxlan->flags,
+			     vxlan->exts);
 	spin_lock(&vn->sock_lock);
 	if (!IS_ERR(nvs))
 		vxlan_vs_add_dev(nvs, vxlan);
@@ -2767,7 +2774,7 @@ static int vxlan_newlink(struct net *net, struct net_device *dev,
 		vxlan->exts |= VXLAN_EXT_GBP;
 
 	if (vxlan_find_vni(net, vni, use_ipv6 ? AF_INET6 : AF_INET,
-			   vxlan->dst_port)) {
+			   vxlan->dst_port, vxlan->exts)) {
 		pr_info("duplicate VNI %u\n", vni);
 		return -EEXIST;
 	}
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index db02582..446b018 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -133,7 +133,7 @@ struct vxlan_sock {
 
 struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
 				  vxlan_rcv_t *rcv, void *data,
-				  bool no_share, u32 flags);
+				  bool no_share, u32 flags, u32 exts);
 
 void vxlan_sock_release(struct vxlan_sock *vs);
 
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index deed9e3..40a16fb 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -128,7 +128,7 @@ static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
 	vxlan_port = vxlan_vport(vport);
 	strncpy(vxlan_port->name, parms->name, IFNAMSIZ);
 
-	vs = vxlan_sock_add(net, htons(dst_port), vxlan_rcv, vport, true, 0);
+	vs = vxlan_sock_add(net, htons(dst_port), vxlan_rcv, vport, true, 0, 0);
 	if (IS_ERR(vs)) {
 		ovs_vport_free(vport);
 		return (void *)vs;
-- 
1.9.3

^ permalink raw reply related

* [PATCH 3/5] openvswitch: Rename GENEVE_TUN_OPTS() to TUN_METADATA_OPTS()
From: Thomas Graf @ 2015-01-15  0:10 UTC (permalink / raw)
  To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov,
	nicolas.dichtel
  Cc: netdev, dev
In-Reply-To: <cover.1421280226.git.tgraf@suug.ch>

Also factors out Geneve validation code into a new separate function
validate_and_copy_geneve_opts().

A subsequent patch will introduce VXLAN options. Rename the existing
GENEVE_TUN_OPTS() to reflect its extended purpose of carrying generic
tunnel metadata options.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v4->v5:
 - No change
v3->v4:
 - Renamed validate_and_copy_geneve_opts() to validate_geneve_opts() as
   suggested by Jesse
v2->v3:
 - No change
v1->v2:
 - Don't rename genev_tun_opt_from_nlattr() and keep it Geneve specific,
   pointed out by Jesse.
 - Factor out Geneve specific validation code into separate function as
   requested by Jesse.

 net/openvswitch/flow.c         |  2 +-
 net/openvswitch/flow.h         | 14 ++++----
 net/openvswitch/flow_netlink.c | 72 +++++++++++++++++++++++-------------------
 3 files changed, 47 insertions(+), 41 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index df334fe..e2c348b 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -691,7 +691,7 @@ int ovs_flow_key_extract(const struct ovs_tunnel_info *tun_info,
 			BUILD_BUG_ON((1 << (sizeof(tun_info->options_len) *
 						   8)) - 1
 					> sizeof(key->tun_opts));
-			memcpy(GENEVE_OPTS(key, tun_info->options_len),
+			memcpy(TUN_METADATA_OPTS(key, tun_info->options_len),
 			       tun_info->options, tun_info->options_len);
 			key->tun_opts_len = tun_info->options_len;
 		} else {
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index a8b30f3..d3d0a40 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -53,7 +53,7 @@ struct ovs_key_ipv4_tunnel {
 
 struct ovs_tunnel_info {
 	struct ovs_key_ipv4_tunnel tunnel;
-	const struct geneve_opt *options;
+	const void *options;
 	u8 options_len;
 };
 
@@ -61,10 +61,10 @@ struct ovs_tunnel_info {
  * maximum size. This allows us to get the benefits of variable length
  * matching for small options.
  */
-#define GENEVE_OPTS(flow_key, opt_len)	\
-	((struct geneve_opt *)((flow_key)->tun_opts + \
-			       FIELD_SIZEOF(struct sw_flow_key, tun_opts) - \
-			       opt_len))
+#define TUN_METADATA_OFFSET(opt_len) \
+	(FIELD_SIZEOF(struct sw_flow_key, tun_opts) - opt_len)
+#define TUN_METADATA_OPTS(flow_key, opt_len) \
+	((void *)((flow_key)->tun_opts + TUN_METADATA_OFFSET(opt_len)))
 
 static inline void __ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
 					    __be32 saddr, __be32 daddr,
@@ -73,7 +73,7 @@ static inline void __ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
 					    __be16 tp_dst,
 					    __be64 tun_id,
 					    __be16 tun_flags,
-					    const struct geneve_opt *opts,
+					    const void *opts,
 					    u8 opts_len)
 {
 	tun_info->tunnel.tun_id = tun_id;
@@ -105,7 +105,7 @@ static inline void ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
 					  __be16 tp_dst,
 					  __be64 tun_id,
 					  __be16 tun_flags,
-					  const struct geneve_opt *opts,
+					  const void *opts,
 					  u8 opts_len)
 {
 	__ovs_flow_tun_info_init(tun_info, iph->saddr, iph->daddr,
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d1eecf7..2e8a9cd 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -432,8 +432,7 @@ static int genev_tun_opt_from_nlattr(const struct nlattr *a,
 		SW_FLOW_KEY_PUT(match, tun_opts_len, 0xff, true);
 	}
 
-	opt_key_offset = (unsigned long)GENEVE_OPTS((struct sw_flow_key *)0,
-						    nla_len(a));
+	opt_key_offset = TUN_METADATA_OFFSET(nla_len(a));
 	SW_FLOW_KEY_MEMCPY_OFFSET(match, opt_key_offset, nla_data(a),
 				  nla_len(a), is_mask);
 	return 0;
@@ -558,8 +557,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 
 static int __ipv4_tun_to_nlattr(struct sk_buff *skb,
 				const struct ovs_key_ipv4_tunnel *output,
-				const struct geneve_opt *tun_opts,
-				int swkey_tun_opts_len)
+				const void *tun_opts, int swkey_tun_opts_len)
 {
 	if (output->tun_flags & TUNNEL_KEY &&
 	    nla_put_be64(skb, OVS_TUNNEL_KEY_ATTR_ID, output->tun_id))
@@ -600,8 +598,7 @@ static int __ipv4_tun_to_nlattr(struct sk_buff *skb,
 
 static int ipv4_tun_to_nlattr(struct sk_buff *skb,
 			      const struct ovs_key_ipv4_tunnel *output,
-			      const struct geneve_opt *tun_opts,
-			      int swkey_tun_opts_len)
+			      const void *tun_opts, int swkey_tun_opts_len)
 {
 	struct nlattr *nla;
 	int err;
@@ -1148,10 +1145,10 @@ int ovs_nla_put_flow(const struct sw_flow_key *swkey,
 		goto nla_put_failure;
 
 	if ((swkey->tun_key.ipv4_dst || is_mask)) {
-		const struct geneve_opt *opts = NULL;
+		const void *opts = NULL;
 
 		if (output->tun_key.tun_flags & TUNNEL_OPTIONS_PRESENT)
-			opts = GENEVE_OPTS(output, swkey->tun_opts_len);
+			opts = TUN_METADATA_OPTS(output, swkey->tun_opts_len);
 
 		if (ipv4_tun_to_nlattr(skb, &output->tun_key, opts,
 				       swkey->tun_opts_len))
@@ -1540,6 +1537,34 @@ void ovs_match_init(struct sw_flow_match *match,
 	}
 }
 
+static int validate_geneve_opts(struct sw_flow_key *key)
+{
+	struct geneve_opt *option;
+	int opts_len = key->tun_opts_len;
+	bool crit_opt = false;
+
+	option = (struct geneve_opt *)TUN_METADATA_OPTS(key, key->tun_opts_len);
+	while (opts_len > 0) {
+		int len;
+
+		if (opts_len < sizeof(*option))
+			return -EINVAL;
+
+		len = sizeof(*option) + option->length * 4;
+		if (len > opts_len)
+			return -EINVAL;
+
+		crit_opt |= !!(option->type & GENEVE_CRIT_OPT_TYPE);
+
+		option = (struct geneve_opt *)((u8 *)option + len);
+		opts_len -= len;
+	};
+
+	key->tun_key.tun_flags |= crit_opt ? TUNNEL_CRIT_OPT : 0;
+
+	return 0;
+}
+
 static int validate_and_copy_set_tun(const struct nlattr *attr,
 				     struct sw_flow_actions **sfa, bool log)
 {
@@ -1555,28 +1580,9 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 		return err;
 
 	if (key.tun_opts_len) {
-		struct geneve_opt *option = GENEVE_OPTS(&key,
-							key.tun_opts_len);
-		int opts_len = key.tun_opts_len;
-		bool crit_opt = false;
-
-		while (opts_len > 0) {
-			int len;
-
-			if (opts_len < sizeof(*option))
-				return -EINVAL;
-
-			len = sizeof(*option) + option->length * 4;
-			if (len > opts_len)
-				return -EINVAL;
-
-			crit_opt |= !!(option->type & GENEVE_CRIT_OPT_TYPE);
-
-			option = (struct geneve_opt *)((u8 *)option + len);
-			opts_len -= len;
-		};
-
-		key.tun_key.tun_flags |= crit_opt ? TUNNEL_CRIT_OPT : 0;
+		err = validate_geneve_opts(&key);
+		if (err < 0)
+			return err;
 	};
 
 	start = add_nested_action_start(sfa, OVS_ACTION_ATTR_SET, log);
@@ -1597,9 +1603,9 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 		 * everything else will go away after flow setup. We can append
 		 * it to tun_info and then point there.
 		 */
-		memcpy((tun_info + 1), GENEVE_OPTS(&key, key.tun_opts_len),
-		       key.tun_opts_len);
-		tun_info->options = (struct geneve_opt *)(tun_info + 1);
+		memcpy((tun_info + 1),
+		       TUN_METADATA_OPTS(&key, key.tun_opts_len), key.tun_opts_len);
+		tun_info->options = (tun_info + 1);
 	} else {
 		tun_info->options = NULL;
 	}
-- 
1.9.3

^ permalink raw reply related

* [PATCH 1/5] vxlan: Group Policy extension
From: Thomas Graf @ 2015-01-15  0:10 UTC (permalink / raw)
  To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov,
	nicolas.dichtel
  Cc: netdev, dev
In-Reply-To: <cover.1421280226.git.tgraf@suug.ch>

Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.

The group membership is defined by the lower 16 bits of skb->mark, the
upper 16 bits are used for flags.

SELinux allows to manage label to secure local resources. However,
distributed applications require ACLs to implemented across hosts. This
is typically achieved by matching on L2-L4 fields to identify the
original sending host and process on the receiver. On top of that,
netlabel and specifically CIPSO [1] allow to map security contexts to
universal labels.  However, netlabel and CIPSO are relatively complex.
This patch provides a lightweight alternative for overlay network
environments with a trusted underlay. No additional control protocol
is required.

           Host 1:                       Host 2:

      Group A        Group B        Group B     Group A
      +-----+   +-------------+    +-------+   +-----+
      | lxc |   | SELinux CTX |    | httpd |   | VM  |
      +--+--+   +--+----------+    +---+---+   +--+--+
	  \---+---/                     \----+---/
	      |                              |
	  +---+---+                      +---+---+
	  | vxlan |                      | vxlan |
	  +---+---+                      +---+---+
	      +------------------------------+

Backwards compatibility:
A VXLAN-GBP socket can receive standard VXLAN frames and will assign
the default group 0x0000 to such frames. A Linux VXLAN socket will
drop VXLAN-GBP  frames. The extension is therefore disabled by default
and needs to be specifically enabled:

   ip link add [...] type vxlan [...] gbp

In a mixed environment with VXLAN and VXLAN-GBP sockets, the GBP socket
must run on a separate port number.

Examples:
 iptables:
  host1# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
  host2# iptables -I INPUT -m mark --mark 0x200 -j DROP

 OVS:
  # ovs-ofctl add-flow br0 'in_port=1,actions=load:0x200->NXM_NX_TUN_GBP_ID[],NORMAL'
  # ovs-ofctl add-flow br0 'in_port=2,tun_gbp_id=0x200,actions=drop'

[0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
[1] http://lwn.net/Articles/204905/

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v4->v5:
 - Rebased on top of Tom's RCO work
 - Dropped IFLA_VXLAN_EXTENSION container attribute and embedded IFLA_VXLAN_GBP
   as top level VXLAN attribute like RCO for consistency. 
v3->v4:
 - Patch 1 was no longer needed due to Tom Herbert's 3bf394 ("vxlan: Improve
   support for header flags"). Moved remaining header description to this patch.
 - Zero out vxlan_metadata in vxlan_tnl_send() as suggested by Jesse.
 - Reported enabled extensions to user space as requested by Nicolas.
 - Use VXLAN_HF_GBP instead of bitfield to be in line with Tom's work.
v2->v3:
 - Removed empty struct vxlan_gbp as spotted by Alexei
v1->v2:
 - split GBP header definition into separate struct vxlanhdr_gbp as requested
   by Alexei

 drivers/net/vxlan.c           | 90 ++++++++++++++++++++++++++++++++++++-------
 include/net/vxlan.h           | 81 +++++++++++++++++++++++++++++++++++---
 include/uapi/linux/if_link.h  |  1 +
 net/openvswitch/vport-vxlan.c |  9 +++--
 4 files changed, 160 insertions(+), 21 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 99df0d7..06f7196 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -126,6 +126,7 @@ struct vxlan_dev {
 	__u8		  tos;		/* TOS override */
 	__u8		  ttl;
 	u32		  flags;	/* VXLAN_F_* in vxlan.h */
+	u32		  exts;		/* Enabled extensions */
 
 	struct work_struct sock_work;
 	struct work_struct igmp_join;
@@ -620,7 +621,8 @@ static struct sk_buff **vxlan_gro_receive(struct sk_buff **head,
 			continue;
 
 		vh2 = (struct vxlanhdr *)(p->data + off_vx);
-		if (vh->vx_vni != vh2->vx_vni) {
+		if (vh->vx_flags != vh2->vx_flags ||
+		    vh->vx_vni != vh2->vx_vni) {
 			NAPI_GRO_CB(p)->same_flow = 0;
 			continue;
 		}
@@ -1183,6 +1185,7 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 	struct vxlan_sock *vs;
 	struct vxlanhdr *vxh;
 	u32 flags, vni;
+	struct vxlan_metadata md = {0};
 
 	/* Need Vxlan and inner Ethernet header to be present */
 	if (!pskb_may_pull(skb, VXLAN_HLEN))
@@ -1216,6 +1219,29 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 		vni &= VXLAN_VID_MASK;
 	}
 
+	/* For backwards compatibility, only allow reserved fields to be
+	 * used by VXLAN extensions if explicitly requested.
+	 */
+	if (vs->exts) {
+		if (flags & VXLAN_HF_GBP) {
+			struct vxlanhdr_gbp *gbp;
+
+			if (!(vs->exts & VXLAN_EXT_GBP))
+				goto bad_flags;
+
+			gbp = (struct vxlanhdr_gbp *)vxh;
+			md.gbp = ntohs(gbp->policy_id);
+
+			if (gbp->dont_learn)
+				md.gbp |= VXLAN_GBP_DONT_LEARN;
+
+			if (gbp->policy_applied)
+				md.gbp |= VXLAN_GBP_POLICY_APPLIED;
+
+			flags &= ~VXLAN_GBP_USED_BITS;
+		}
+	}
+
 	if (flags || (vni & ~VXLAN_VID_MASK)) {
 		/* If there are any unprocessed flags remaining treat
 		 * this as a malformed packet. This behavior diverges from
@@ -1229,7 +1255,8 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 		goto bad_flags;
 	}
 
-	vs->rcv(vs, skb, vxh->vx_vni);
+	md.vni = vxh->vx_vni;
+	vs->rcv(vs, skb, &md);
 	return 0;
 
 drop:
@@ -1246,8 +1273,8 @@ error:
 	return 1;
 }
 
-static void vxlan_rcv(struct vxlan_sock *vs,
-		      struct sk_buff *skb, __be32 vx_vni)
+static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb,
+		      struct vxlan_metadata *md)
 {
 	struct iphdr *oip = NULL;
 	struct ipv6hdr *oip6 = NULL;
@@ -1258,7 +1285,7 @@ static void vxlan_rcv(struct vxlan_sock *vs,
 	int err = 0;
 	union vxlan_addr *remote_ip;
 
-	vni = ntohl(vx_vni) >> 8;
+	vni = ntohl(md->vni) >> 8;
 	/* Is this VNI defined? */
 	vxlan = vxlan_vs_find_vni(vs, vni);
 	if (!vxlan)
@@ -1292,6 +1319,7 @@ static void vxlan_rcv(struct vxlan_sock *vs,
 		goto drop;
 
 	skb_reset_network_header(skb);
+	skb->mark = md->gbp;
 
 	if (oip6)
 		err = IP6_ECN_decapsulate(oip6, skb);
@@ -1641,13 +1669,30 @@ static bool route_shortcircuit(struct net_device *dev, struct sk_buff *skb)
 	return false;
 }
 
+static void vxlan_build_gbp_hdr(struct vxlanhdr *vxh, struct vxlan_sock *vs,
+				struct vxlan_metadata *md)
+{
+	struct vxlanhdr_gbp *gbp;
+
+	gbp = (struct vxlanhdr_gbp *)vxh;
+	vxh->vx_flags |= htonl(VXLAN_HF_GBP);
+
+	if (md->gbp & VXLAN_GBP_DONT_LEARN)
+		gbp->dont_learn = 1;
+
+	if (md->gbp & VXLAN_GBP_POLICY_APPLIED)
+		gbp->policy_applied = 1;
+
+	gbp->policy_id = htons(md->gbp & VXLAN_GBP_ID_MASK);
+}
+
 #if IS_ENABLED(CONFIG_IPV6)
 static int vxlan6_xmit_skb(struct vxlan_sock *vs,
 			   struct dst_entry *dst, struct sk_buff *skb,
 			   struct net_device *dev, struct in6_addr *saddr,
 			   struct in6_addr *daddr, __u8 prio, __u8 ttl,
-			   __be16 src_port, __be16 dst_port, __be32 vni,
-			   bool xnet)
+			   __be16 src_port, __be16 dst_port,
+			   struct vxlan_metadata *md, bool xnet)
 {
 	struct vxlanhdr *vxh;
 	int min_headroom;
@@ -1696,7 +1741,7 @@ static int vxlan6_xmit_skb(struct vxlan_sock *vs,
 
 	vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh));
 	vxh->vx_flags = htonl(VXLAN_HF_VNI);
-	vxh->vx_vni = vni;
+	vxh->vx_vni = md->vni;
 
 	if (type & SKB_GSO_TUNNEL_REMCSUM) {
 		u32 data = (skb_checksum_start_offset(skb) - hdrlen) >>
@@ -1714,6 +1759,9 @@ static int vxlan6_xmit_skb(struct vxlan_sock *vs,
 		}
 	}
 
+	if (vs->exts & VXLAN_EXT_GBP)
+		vxlan_build_gbp_hdr(vxh, vs, md);
+
 	skb_set_inner_protocol(skb, htons(ETH_P_TEB));
 
 	udp_tunnel6_xmit_skb(vs->sock, dst, skb, dev, saddr, daddr, prio,
@@ -1728,7 +1776,8 @@ err:
 int vxlan_xmit_skb(struct vxlan_sock *vs,
 		   struct rtable *rt, struct sk_buff *skb,
 		   __be32 src, __be32 dst, __u8 tos, __u8 ttl, __be16 df,
-		   __be16 src_port, __be16 dst_port, __be32 vni, bool xnet)
+		   __be16 src_port, __be16 dst_port,
+		   struct vxlan_metadata *md, bool xnet)
 {
 	struct vxlanhdr *vxh;
 	int min_headroom;
@@ -1771,7 +1820,7 @@ int vxlan_xmit_skb(struct vxlan_sock *vs,
 
 	vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh));
 	vxh->vx_flags = htonl(VXLAN_HF_VNI);
-	vxh->vx_vni = vni;
+	vxh->vx_vni = md->vni;
 
 	if (type & SKB_GSO_TUNNEL_REMCSUM) {
 		u32 data = (skb_checksum_start_offset(skb) - hdrlen) >>
@@ -1789,6 +1838,9 @@ int vxlan_xmit_skb(struct vxlan_sock *vs,
 		}
 	}
 
+	if (vs->exts & VXLAN_EXT_GBP)
+		vxlan_build_gbp_hdr(vxh, vs, md);
+
 	skb_set_inner_protocol(skb, htons(ETH_P_TEB));
 
 	return udp_tunnel_xmit_skb(vs->sock, rt, skb, src, dst, tos,
@@ -1849,6 +1901,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 	const struct iphdr *old_iph;
 	struct flowi4 fl4;
 	union vxlan_addr *dst;
+	struct vxlan_metadata md;
 	__be16 src_port = 0, dst_port;
 	u32 vni;
 	__be16 df = 0;
@@ -1919,11 +1972,12 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 
 		tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
 		ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
+		md.vni = htonl(vni << 8);
+		md.gbp = skb->mark;
 
 		err = vxlan_xmit_skb(vxlan->vn_sock, rt, skb,
 				     fl4.saddr, dst->sin.sin_addr.s_addr,
-				     tos, ttl, df, src_port, dst_port,
-				     htonl(vni << 8),
+				     tos, ttl, df, src_port, dst_port, &md,
 				     !net_eq(vxlan->net, dev_net(vxlan->dev)));
 		if (err < 0) {
 			/* skb is already freed. */
@@ -1976,10 +2030,12 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 		}
 
 		ttl = ttl ? : ip6_dst_hoplimit(ndst);
+		md.vni = htonl(vni << 8);
+		md.gbp = skb->mark;
 
 		err = vxlan6_xmit_skb(vxlan->vn_sock, ndst, skb,
 				      dev, &fl6.saddr, &fl6.daddr, 0, ttl,
-				      src_port, dst_port, htonl(vni << 8),
+				      src_port, dst_port, &md,
 				      !net_eq(vxlan->net, dev_net(vxlan->dev)));
 #endif
 	}
@@ -2382,6 +2438,7 @@ static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = {
 	[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]	= { .type = NLA_U8 },
 	[IFLA_VXLAN_REMCSUM_TX]	= { .type = NLA_U8 },
 	[IFLA_VXLAN_REMCSUM_RX]	= { .type = NLA_U8 },
+	[IFLA_VXLAN_GBP]	= { .type = NLA_FLAG, },
 };
 
 static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[])
@@ -2706,6 +2763,9 @@ static int vxlan_newlink(struct net *net, struct net_device *dev,
 	    nla_get_u8(data[IFLA_VXLAN_REMCSUM_RX]))
 		vxlan->flags |= VXLAN_F_REMCSUM_RX;
 
+	if (data[IFLA_VXLAN_GBP])
+		vxlan->exts |= VXLAN_EXT_GBP;
+
 	if (vxlan_find_vni(net, vni, use_ipv6 ? AF_INET6 : AF_INET,
 			   vxlan->dst_port)) {
 		pr_info("duplicate VNI %u\n", vni);
@@ -2851,6 +2911,10 @@ static int vxlan_fill_info(struct sk_buff *skb, const struct net_device *dev)
 	if (nla_put(skb, IFLA_VXLAN_PORT_RANGE, sizeof(ports), &ports))
 		goto nla_put_failure;
 
+	if (vxlan->exts & VXLAN_EXT_GBP &&
+	    nla_put_flag(skb, IFLA_VXLAN_GBP))
+		goto nla_put_failure;
+
 	return 0;
 
 nla_put_failure:
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 0a7443b..db02582 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -11,15 +11,76 @@
 #define VNI_HASH_BITS	10
 #define VNI_HASH_SIZE	(1<<VNI_HASH_BITS)
 
-/* VXLAN protocol header */
+/*
+ * VXLAN Group Based Policy Extension:
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |1|-|-|-|1|-|-|-|R|D|R|R|A|R|R|R|        Group Policy ID        |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |                VXLAN Network Identifier (VNI) |   Reserved    |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *
+ * D = Don't Learn bit. When set, this bit indicates that the egress
+ *     VTEP MUST NOT learn the source address of the encapsulated frame.
+ *
+ * A = Indicates that the group policy has already been applied to
+ *     this packet. Policies MUST NOT be applied by devices when the
+ *     A bit is set.
+ *
+ * [0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
+ */
+struct vxlanhdr_gbp {
+	__u8	vx_flags;
+#ifdef __LITTLE_ENDIAN_BITFIELD
+	__u8	reserved_flags1:3,
+		policy_applied:1,
+		reserved_flags2:2,
+		dont_learn:1,
+		reserved_flags3:1;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+	__u8	reserved_flags1:1,
+		dont_learn:1,
+		reserved_flags2:2,
+		policy_applied:1,
+		reserved_flags3:3;
+#else
+#error	"Please fix <asm/byteorder.h>"
+#endif
+	__be16	policy_id;
+	__be32	vx_vni;
+};
+
+#define VXLAN_GBP_USED_BITS (VXLAN_HF_GBP | 0xFFFFFF)
+
+/* skb->mark mapping
+ *
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |R|R|R|R|R|R|R|R|R|D|R|R|A|R|R|R|        Group Policy ID        |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ */
+#define VXLAN_GBP_DONT_LEARN		(BIT(6) << 16)
+#define VXLAN_GBP_POLICY_APPLIED	(BIT(3) << 16)
+#define VXLAN_GBP_ID_MASK		(0xFFFF)
+
+/* VXLAN protocol header:
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |G|R|R|R|I|R|R|C|               Reserved                        |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |                VXLAN Network Identifier (VNI) |   Reserved    |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *
+ * G = 1	Group Policy (VXLAN-GBP)
+ * I = 1	VXLAN Network Identifier (VNI) present
+ * C = 1	Remote checksum offload (RCO)
+ */
 struct vxlanhdr {
 	__be32 vx_flags;
 	__be32 vx_vni;
 };
 
 /* VXLAN header flags. */
-#define VXLAN_HF_VNI 0x08000000
-#define VXLAN_HF_RCO 0x00200000
+#define VXLAN_HF_RCO BIT(24)
+#define VXLAN_HF_VNI BIT(27)
+#define VXLAN_HF_GBP BIT(31)
 
 /* Remote checksum offload header option */
 #define VXLAN_RCO_MASK  0x7f    /* Last byte of vni field */
@@ -32,14 +93,23 @@ struct vxlanhdr {
 #define VXLAN_VID_MASK  (VXLAN_N_VID - 1)
 #define VXLAN_HLEN (sizeof(struct udphdr) + sizeof(struct vxlanhdr))
 
+struct vxlan_metadata {
+	__be32		vni;
+	u32		gbp;
+};
+
 struct vxlan_sock;
-typedef void (vxlan_rcv_t)(struct vxlan_sock *vh, struct sk_buff *skb, __be32 key);
+typedef void (vxlan_rcv_t)(struct vxlan_sock *vh, struct sk_buff *skb,
+			   struct vxlan_metadata *md);
+
+#define VXLAN_EXT_GBP			BIT(0)
 
 /* per UDP socket information */
 struct vxlan_sock {
 	struct hlist_node hlist;
 	vxlan_rcv_t	 *rcv;
 	void		 *data;
+	u32		  exts;
 	struct work_struct del_work;
 	struct socket	 *sock;
 	struct rcu_head	  rcu;
@@ -70,7 +140,8 @@ void vxlan_sock_release(struct vxlan_sock *vs);
 int vxlan_xmit_skb(struct vxlan_sock *vs,
 		   struct rtable *rt, struct sk_buff *skb,
 		   __be32 src, __be32 dst, __u8 tos, __u8 ttl, __be16 df,
-		   __be16 src_port, __be16 dst_port, __be32 vni, bool xnet);
+		   __be16 src_port, __be16 dst_port, struct vxlan_metadata *md,
+		   bool xnet);
 
 static inline netdev_features_t vxlan_features_check(struct sk_buff *skb,
 						     netdev_features_t features)
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index b2723f6..2a8380e 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -372,6 +372,7 @@ enum {
 	IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
 	IFLA_VXLAN_REMCSUM_TX,
 	IFLA_VXLAN_REMCSUM_RX,
+	IFLA_VXLAN_GBP,
 	__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX	(__IFLA_VXLAN_MAX - 1)
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index d7c46b3..deed9e3 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -59,7 +59,8 @@ static inline struct vxlan_port *vxlan_vport(const struct vport *vport)
 }
 
 /* Called with rcu_read_lock and BH disabled. */
-static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
+static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb,
+		      struct vxlan_metadata *md)
 {
 	struct ovs_tunnel_info tun_info;
 	struct vport *vport = vs->data;
@@ -68,7 +69,7 @@ static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
 
 	/* Save outer tunnel values */
 	iph = ip_hdr(skb);
-	key = cpu_to_be64(ntohl(vx_vni) >> 8);
+	key = cpu_to_be64(ntohl(md->vni) >> 8);
 	ovs_flow_tun_info_init(&tun_info, iph,
 			       udp_hdr(skb)->source, udp_hdr(skb)->dest,
 			       key, TUNNEL_KEY, NULL, 0);
@@ -146,6 +147,7 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
 	struct vxlan_port *vxlan_port = vxlan_vport(vport);
 	__be16 dst_port = inet_sk(vxlan_port->vs->sock->sk)->inet_sport;
 	struct ovs_key_ipv4_tunnel *tun_key;
+	struct vxlan_metadata md = {0};
 	struct rtable *rt;
 	struct flowi4 fl;
 	__be16 src_port;
@@ -178,12 +180,13 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
 	skb->ignore_df = 1;
 
 	src_port = udp_flow_src_port(net, skb, 0, 0, true);
+	md.vni = htonl(be64_to_cpu(tun_key->tun_id) << 8);
 
 	err = vxlan_xmit_skb(vxlan_port->vs, rt, skb,
 			     fl.saddr, tun_key->ipv4_dst,
 			     tun_key->ipv4_tos, tun_key->ipv4_ttl, df,
 			     src_port, dst_port,
-			     htonl(be64_to_cpu(tun_key->tun_id) << 8),
+			     &md,
 			     false);
 	if (err < 0)
 		ip_rt_put(rt);
-- 
1.9.3

^ permalink raw reply related

* [PATCH 5/5] openvswitch: Support VXLAN Group Policy extension
From: Thomas Graf @ 2015-01-15  0:10 UTC (permalink / raw)
  To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov,
	nicolas.dichtel
  Cc: netdev, dev
In-Reply-To: <cover.1421280226.git.tgraf@suug.ch>

Introduces support for the group policy extension to the VXLAN virtual
port. The extension is disabled by default and only enabled if the user
has provided the respective configuration.

  ovs-vsctl add-port br0 vxlan0 -- \
     set Interface vxlan0 type=vxlan options:exts=gbp

The configuration interface to enable the extension is based on a new
attribute OVS_VXLAN_EXT_GBP nested inside OVS_TUNNEL_ATTR_EXTENSION
which can carry additional extensions as needed in the future.

The group policy metadata is stored as binary blob (struct ovs_vxlan_opts)
internally just like Geneve options but transported as nested Netlink
attributes to user space.

Renames the existing TUNNEL_OPTIONS_PRESENT to TUNNEL_GENEVE_OPT with the
binary value kept intact, a new flag TUNNEL_VXLAN_OPT is introduced.

The attributes OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and existing
OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS are implemented mutually exclusive.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v4->v5:
 - No change
v3->v4:
 - Fixed OVS_VXLAN_EXT_MAX->OVS_VXLAN_EXT_GBP typo as spotted by Jesse
 - Only applied tunnel options if they are of the right type as
   suggested by Jesse
v2->v3:
 - No change
v1->v2:
 - Addressed Jesse's request to transport VXLAN options as Netlink
   attributes instead of a binary blob. Allows a partial transport of
   VXLAN extensions. Internally, the datapath continues to use a binary
   blob (defined in vport-vxlan.h) for performance reasons.
 - Added new TUNNEL_GENEVE_OPT and TUNNEL_VXLAN_OPT flags to mark
   tunnel option flavour
 - Correctly report VXLAN options to user space

 include/net/ip_tunnels.h         |   5 +-
 include/uapi/linux/openvswitch.h |  11 ++++
 net/openvswitch/flow_netlink.c   | 114 ++++++++++++++++++++++++++++++++++-----
 net/openvswitch/vport-geneve.c   |  15 ++++--
 net/openvswitch/vport-vxlan.c    |  82 +++++++++++++++++++++++++++-
 net/openvswitch/vport-vxlan.h    |  11 ++++
 6 files changed, 218 insertions(+), 20 deletions(-)
 create mode 100644 net/openvswitch/vport-vxlan.h

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 25a59eb..ce4db3c 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -97,7 +97,10 @@ struct ip_tunnel {
 #define TUNNEL_DONT_FRAGMENT    __cpu_to_be16(0x0100)
 #define TUNNEL_OAM		__cpu_to_be16(0x0200)
 #define TUNNEL_CRIT_OPT		__cpu_to_be16(0x0400)
-#define TUNNEL_OPTIONS_PRESENT	__cpu_to_be16(0x0800)
+#define TUNNEL_GENEVE_OPT	__cpu_to_be16(0x0800)
+#define TUNNEL_VXLAN_OPT	__cpu_to_be16(0x1000)
+
+#define TUNNEL_OPTIONS_PRESENT	(TUNNEL_GENEVE_OPT | TUNNEL_VXLAN_OPT)
 
 struct tnl_ptk_info {
 	__be16 flags;
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 3a6dcaa..e474c95 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -248,11 +248,21 @@ enum ovs_vport_attr {
 
 #define OVS_VPORT_ATTR_MAX (__OVS_VPORT_ATTR_MAX - 1)
 
+enum {
+	OVS_VXLAN_EXT_UNSPEC,
+	OVS_VXLAN_EXT_GBP,	/* Flag or __u32 */
+	__OVS_VXLAN_EXT_MAX,
+};
+
+#define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
+
+
 /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
  */
 enum {
 	OVS_TUNNEL_ATTR_UNSPEC,
 	OVS_TUNNEL_ATTR_DST_PORT, /* 16-bit UDP port, used by L4 tunnels. */
+	OVS_TUNNEL_ATTR_EXTENSION,
 	__OVS_TUNNEL_ATTR_MAX
 };
 
@@ -324,6 +334,7 @@ enum ovs_tunnel_key_attr {
 	OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS,        /* Array of Geneve options. */
 	OVS_TUNNEL_KEY_ATTR_TP_SRC,		/* be16 src Transport Port. */
 	OVS_TUNNEL_KEY_ATTR_TP_DST,		/* be16 dst Transport Port. */
+	OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS,		/* Nested OVS_VXLAN_EXT_* */
 	__OVS_TUNNEL_KEY_ATTR_MAX
 };
 
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 518941c..d210d1b 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -49,6 +49,7 @@
 #include <net/mpls.h>
 
 #include "flow_netlink.h"
+#include "vport-vxlan.h"
 
 struct ovs_len_tbl {
 	int len;
@@ -268,6 +269,9 @@ size_t ovs_tun_key_attr_size(void)
 		+ nla_total_size(0)    /* OVS_TUNNEL_KEY_ATTR_CSUM */
 		+ nla_total_size(0)    /* OVS_TUNNEL_KEY_ATTR_OAM */
 		+ nla_total_size(256)  /* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS */
+		/* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS is mutually exclusive with
+		 * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
+		 */
 		+ nla_total_size(2)    /* OVS_TUNNEL_KEY_ATTR_TP_SRC */
 		+ nla_total_size(2);   /* OVS_TUNNEL_KEY_ATTR_TP_DST */
 }
@@ -308,6 +312,7 @@ static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
 	[OVS_TUNNEL_KEY_ATTR_TP_DST]	    = { .len = sizeof(u16) },
 	[OVS_TUNNEL_KEY_ATTR_OAM]	    = { .len = 0 },
 	[OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS]   = { .len = OVS_ATTR_NESTED },
+	[OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS]    = { .len = OVS_ATTR_NESTED },
 };
 
 /* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute.  */
@@ -460,6 +465,41 @@ static int genev_tun_opt_from_nlattr(const struct nlattr *a,
 	return 0;
 }
 
+static const struct nla_policy vxlan_opt_policy[OVS_VXLAN_EXT_MAX + 1] = {
+	[OVS_VXLAN_EXT_GBP]	= { .type = NLA_U32 },
+};
+
+static int vxlan_tun_opt_from_nlattr(const struct nlattr *a,
+				     struct sw_flow_match *match, bool is_mask,
+				     bool log)
+{
+	struct nlattr *tb[OVS_VXLAN_EXT_MAX+1];
+	unsigned long opt_key_offset;
+	struct ovs_vxlan_opts opts;
+	int err;
+
+	BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
+
+	err = nla_parse_nested(tb, OVS_VXLAN_EXT_MAX, a, vxlan_opt_policy);
+	if (err < 0)
+		return err;
+
+	memset(&opts, 0, sizeof(opts));
+
+	if (tb[OVS_VXLAN_EXT_GBP])
+		opts.gbp = nla_get_u32(tb[OVS_VXLAN_EXT_GBP]);
+
+	if (!is_mask)
+		SW_FLOW_KEY_PUT(match, tun_opts_len, sizeof(opts), false);
+	else
+		SW_FLOW_KEY_PUT(match, tun_opts_len, 0xff, true);
+
+	opt_key_offset = TUN_METADATA_OFFSET(sizeof(opts));
+	SW_FLOW_KEY_MEMCPY_OFFSET(match, opt_key_offset, &opts, sizeof(opts),
+				  is_mask);
+	return 0;
+}
+
 static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 				struct sw_flow_match *match, bool is_mask,
 				bool log)
@@ -468,6 +508,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 	int rem;
 	bool ttl = false;
 	__be16 tun_flags = 0;
+	int opts_type = 0;
 
 	nla_for_each_nested(a, attr, rem) {
 		int type = nla_type(a);
@@ -527,11 +568,30 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 			tun_flags |= TUNNEL_OAM;
 			break;
 		case OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS:
+			if (opts_type) {
+				OVS_NLERR(log, "Multiple metadata blocks provided");
+				return -EINVAL;
+			}
+
 			err = genev_tun_opt_from_nlattr(a, match, is_mask, log);
 			if (err)
 				return err;
 
-			tun_flags |= TUNNEL_OPTIONS_PRESENT;
+			tun_flags |= TUNNEL_GENEVE_OPT;
+			opts_type = type;
+			break;
+		case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS:
+			if (opts_type) {
+				OVS_NLERR(log, "Multiple metadata blocks provided");
+				return -EINVAL;
+			}
+
+			err = vxlan_tun_opt_from_nlattr(a, match, is_mask, log);
+			if (err)
+				return err;
+
+			tun_flags |= TUNNEL_VXLAN_OPT;
+			opts_type = type;
 			break;
 		default:
 			OVS_NLERR(log, "Unknown IPv4 tunnel attribute %d",
@@ -560,6 +620,23 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 		}
 	}
 
+	return opts_type;
+}
+
+static int vxlan_opt_to_nlattr(struct sk_buff *skb,
+			       const void *tun_opts, int swkey_tun_opts_len)
+{
+	const struct ovs_vxlan_opts *opts = tun_opts;
+	struct nlattr *nla;
+
+	nla = nla_nest_start(skb, OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS);
+	if (!nla)
+		return -EMSGSIZE;
+
+	if (nla_put_u32(skb, OVS_VXLAN_EXT_GBP, opts->gbp) < 0)
+		return -EMSGSIZE;
+
+	nla_nest_end(skb, nla);
 	return 0;
 }
 
@@ -596,10 +673,15 @@ static int __ipv4_tun_to_nlattr(struct sk_buff *skb,
 	if ((output->tun_flags & TUNNEL_OAM) &&
 	    nla_put_flag(skb, OVS_TUNNEL_KEY_ATTR_OAM))
 		return -EMSGSIZE;
-	if (tun_opts &&
-	    nla_put(skb, OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS,
-		    swkey_tun_opts_len, tun_opts))
-		return -EMSGSIZE;
+	if (tun_opts) {
+		if (output->tun_flags & TUNNEL_GENEVE_OPT &&
+		    nla_put(skb, OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS,
+			    swkey_tun_opts_len, tun_opts))
+			return -EMSGSIZE;
+		else if (output->tun_flags & TUNNEL_VXLAN_OPT &&
+			 vxlan_opt_to_nlattr(skb, tun_opts, swkey_tun_opts_len))
+			return -EMSGSIZE;
+	}
 
 	return 0;
 }
@@ -680,7 +762,7 @@ static int metadata_from_nlattrs(struct sw_flow_match *match,  u64 *attrs,
 	}
 	if (*attrs & (1 << OVS_KEY_ATTR_TUNNEL)) {
 		if (ipv4_tun_from_nlattr(a[OVS_KEY_ATTR_TUNNEL], match,
-					 is_mask, log))
+					 is_mask, log) < 0)
 			return -EINVAL;
 		*attrs &= ~(1 << OVS_KEY_ATTR_TUNNEL);
 	}
@@ -1578,17 +1660,23 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	struct sw_flow_key key;
 	struct ovs_tunnel_info *tun_info;
 	struct nlattr *a;
-	int err, start;
+	int err, start, opts_type;
 
 	ovs_match_init(&match, &key, NULL);
-	err = ipv4_tun_from_nlattr(nla_data(attr), &match, false, log);
-	if (err)
-		return err;
+	opts_type = ipv4_tun_from_nlattr(nla_data(attr), &match, false, log);
+	if (opts_type < 0)
+		return opts_type;
 
 	if (key.tun_opts_len) {
-		err = validate_geneve_opts(&key);
-		if (err < 0)
-			return err;
+		switch (opts_type) {
+		case OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS:
+			err = validate_geneve_opts(&key);
+			if (err < 0)
+				return err;
+			break;
+		case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS:
+			break;
+		}
 	};
 
 	start = add_nested_action_start(sfa, OVS_ACTION_ATTR_SET, log);
diff --git a/net/openvswitch/vport-geneve.c b/net/openvswitch/vport-geneve.c
index 2daf144..17b0840 100644
--- a/net/openvswitch/vport-geneve.c
+++ b/net/openvswitch/vport-geneve.c
@@ -88,7 +88,7 @@ static void geneve_rcv(struct geneve_sock *gs, struct sk_buff *skb)
 
 	opts_len = geneveh->opt_len * 4;
 
-	flags = TUNNEL_KEY | TUNNEL_OPTIONS_PRESENT |
+	flags = TUNNEL_KEY | TUNNEL_GENEVE_OPT |
 		(udp_hdr(skb)->check != 0 ? TUNNEL_CSUM : 0) |
 		(geneveh->oam ? TUNNEL_OAM : 0) |
 		(geneveh->critical ? TUNNEL_CRIT_OPT : 0);
@@ -178,7 +178,7 @@ static int geneve_tnl_send(struct vport *vport, struct sk_buff *skb)
 	__be16 sport;
 	struct rtable *rt;
 	struct flowi4 fl;
-	u8 vni[3];
+	u8 vni[3], opts_len, *opts;
 	__be16 df;
 	int err;
 
@@ -209,11 +209,18 @@ static int geneve_tnl_send(struct vport *vport, struct sk_buff *skb)
 	tunnel_id_to_vni(tun_key->tun_id, vni);
 	skb->ignore_df = 1;
 
+	if (tun_key->tun_flags & TUNNEL_GENEVE_OPT) {
+		opts = (u8 *)tun_info->options;
+		opts_len = tun_info->options_len;
+	} else {
+		opts = NULL;
+		opts_len = 0;
+	}
+
 	err = geneve_xmit_skb(geneve_port->gs, rt, skb, fl.saddr,
 			      tun_key->ipv4_dst, tun_key->ipv4_tos,
 			      tun_key->ipv4_ttl, df, sport, dport,
-			      tun_key->tun_flags, vni,
-			      tun_info->options_len, (u8 *)tun_info->options,
+			      tun_key->tun_flags, vni, opts_len, opts,
 			      false);
 	if (err < 0)
 		ip_rt_put(rt);
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index 40a16fb..9f47c23 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -40,6 +40,7 @@
 
 #include "datapath.h"
 #include "vport.h"
+#include "vport-vxlan.h"
 
 /**
  * struct vxlan_port - Keeps track of open UDP ports
@@ -49,6 +50,7 @@
 struct vxlan_port {
 	struct vxlan_sock *vs;
 	char name[IFNAMSIZ];
+	u32 exts; /* VXLAN_EXT_* in <net/vxlan.h> */
 };
 
 static struct vport_ops ovs_vxlan_vport_ops;
@@ -63,16 +65,26 @@ static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb,
 		      struct vxlan_metadata *md)
 {
 	struct ovs_tunnel_info tun_info;
+	struct vxlan_port *vxlan_port;
 	struct vport *vport = vs->data;
 	struct iphdr *iph;
+	struct ovs_vxlan_opts opts = {
+		.gbp = md->gbp,
+	};
 	__be64 key;
+	__be16 flags;
+
+	flags = TUNNEL_KEY;
+	vxlan_port = vxlan_vport(vport);
+	if (vxlan_port->exts & VXLAN_EXT_GBP)
+		flags |= TUNNEL_VXLAN_OPT;
 
 	/* Save outer tunnel values */
 	iph = ip_hdr(skb);
 	key = cpu_to_be64(ntohl(md->vni) >> 8);
 	ovs_flow_tun_info_init(&tun_info, iph,
 			       udp_hdr(skb)->source, udp_hdr(skb)->dest,
-			       key, TUNNEL_KEY, NULL, 0);
+			       key, flags, &opts, sizeof(opts));
 
 	ovs_vport_receive(vport, skb, &tun_info);
 }
@@ -84,6 +96,21 @@ static int vxlan_get_options(const struct vport *vport, struct sk_buff *skb)
 
 	if (nla_put_u16(skb, OVS_TUNNEL_ATTR_DST_PORT, ntohs(dst_port)))
 		return -EMSGSIZE;
+
+	if (vxlan_port->exts) {
+		struct nlattr *exts;
+
+		exts = nla_nest_start(skb, OVS_TUNNEL_ATTR_EXTENSION);
+		if (!exts)
+			return -EMSGSIZE;
+
+		if (vxlan_port->exts & VXLAN_EXT_GBP &&
+		    nla_put_flag(skb, OVS_VXLAN_EXT_GBP))
+			return -EMSGSIZE;
+
+		nla_nest_end(skb, exts);
+	}
+
 	return 0;
 }
 
@@ -96,6 +123,31 @@ static void vxlan_tnl_destroy(struct vport *vport)
 	ovs_vport_deferred_free(vport);
 }
 
+static const struct nla_policy exts_policy[OVS_VXLAN_EXT_MAX+1] = {
+	[OVS_VXLAN_EXT_GBP]	= { .type = NLA_FLAG, },
+};
+
+static int vxlan_configure_exts(struct vport *vport, struct nlattr *attr)
+{
+	struct nlattr *exts[OVS_VXLAN_EXT_MAX+1];
+	struct vxlan_port *vxlan_port;
+	int err;
+
+	if (nla_len(attr) < sizeof(struct nlattr))
+		return -EINVAL;
+
+	err = nla_parse_nested(exts, OVS_VXLAN_EXT_MAX, attr, exts_policy);
+	if (err < 0)
+		return err;
+
+	vxlan_port = vxlan_vport(vport);
+
+	if (exts[OVS_VXLAN_EXT_GBP])
+		vxlan_port->exts |= VXLAN_EXT_GBP;
+
+	return 0;
+}
+
 static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
 {
 	struct net *net = ovs_dp_get_net(parms->dp);
@@ -128,7 +180,17 @@ static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
 	vxlan_port = vxlan_vport(vport);
 	strncpy(vxlan_port->name, parms->name, IFNAMSIZ);
 
-	vs = vxlan_sock_add(net, htons(dst_port), vxlan_rcv, vport, true, 0, 0);
+	a = nla_find_nested(options, OVS_TUNNEL_ATTR_EXTENSION);
+	if (a) {
+		err = vxlan_configure_exts(vport, a);
+		if (err) {
+			ovs_vport_free(vport);
+			goto error;
+		}
+	}
+
+	vs = vxlan_sock_add(net, htons(dst_port), vxlan_rcv, vport, true, 0,
+			    vxlan_port->exts);
 	if (IS_ERR(vs)) {
 		ovs_vport_free(vport);
 		return (void *)vs;
@@ -141,6 +203,21 @@ error:
 	return ERR_PTR(err);
 }
 
+static int vxlan_ext_gbp(struct sk_buff *skb)
+{
+	const struct ovs_tunnel_info *tun_info;
+	const struct ovs_vxlan_opts *opts;
+
+	tun_info = OVS_CB(skb)->egress_tun_info;
+	opts = tun_info->options;
+
+	if (tun_info->tunnel.tun_flags & TUNNEL_VXLAN_OPT &&
+	    tun_info->options_len >= sizeof(*opts))
+		return opts->gbp;
+	else
+		return 0;
+}
+
 static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
 {
 	struct net *net = ovs_dp_get_net(vport->dp);
@@ -181,6 +258,7 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
 
 	src_port = udp_flow_src_port(net, skb, 0, 0, true);
 	md.vni = htonl(be64_to_cpu(tun_key->tun_id) << 8);
+	md.gbp = vxlan_ext_gbp(skb);
 
 	err = vxlan_xmit_skb(vxlan_port->vs, rt, skb,
 			     fl.saddr, tun_key->ipv4_dst,
diff --git a/net/openvswitch/vport-vxlan.h b/net/openvswitch/vport-vxlan.h
new file mode 100644
index 0000000..4b08233e
--- /dev/null
+++ b/net/openvswitch/vport-vxlan.h
@@ -0,0 +1,11 @@
+#ifndef VPORT_VXLAN_H
+#define VPORT_VXLAN_H 1
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+struct ovs_vxlan_opts {
+	__u32 gbp;
+};
+
+#endif
-- 
1.9.3

^ permalink raw reply related

* [PATCH 4/5] openvswitch: Allow for any level of nesting in flow attributes
From: Thomas Graf @ 2015-01-15  0:10 UTC (permalink / raw)
  To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov,
	nicolas.dichtel
  Cc: netdev, dev
In-Reply-To: <cover.1421280226.git.tgraf@suug.ch>

nlattr_set() is currently hardcoded to two levels of nesting. This change
introduces struct ovs_len_tbl to define minimal length requirements plus
next level nesting tables to traverse the key attributes to arbitrary depth.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v4->v5:
 - No change
v3->v4:
 - No change. The spotted bug is unrelatd to this series and will be fixed
   in a separate patch
v2->v3:
 - No change
v1->v2:
 - New patch to allow nested Netlink attributes inside
   OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS

 net/openvswitch/flow_netlink.c | 106 ++++++++++++++++++++++-------------------
 1 file changed, 56 insertions(+), 50 deletions(-)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 2e8a9cd..518941c 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -50,6 +50,13 @@
 
 #include "flow_netlink.h"
 
+struct ovs_len_tbl {
+	int len;
+	const struct ovs_len_tbl *next;
+};
+
+#define OVS_ATTR_NESTED -1
+
 static void update_range(struct sw_flow_match *match,
 			 size_t offset, size_t size, bool is_mask)
 {
@@ -289,29 +296,44 @@ size_t ovs_key_attr_size(void)
 		+ nla_total_size(28); /* OVS_KEY_ATTR_ND */
 }
 
+static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1] = {
+	[OVS_TUNNEL_KEY_ATTR_ID]	    = { .len = sizeof(u64) },
+	[OVS_TUNNEL_KEY_ATTR_IPV4_SRC]	    = { .len = sizeof(u32) },
+	[OVS_TUNNEL_KEY_ATTR_IPV4_DST]	    = { .len = sizeof(u32) },
+	[OVS_TUNNEL_KEY_ATTR_TOS]	    = { .len = 1 },
+	[OVS_TUNNEL_KEY_ATTR_TTL]	    = { .len = 1 },
+	[OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT] = { .len = 0 },
+	[OVS_TUNNEL_KEY_ATTR_CSUM]	    = { .len = 0 },
+	[OVS_TUNNEL_KEY_ATTR_TP_SRC]	    = { .len = sizeof(u16) },
+	[OVS_TUNNEL_KEY_ATTR_TP_DST]	    = { .len = sizeof(u16) },
+	[OVS_TUNNEL_KEY_ATTR_OAM]	    = { .len = 0 },
+	[OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS]   = { .len = OVS_ATTR_NESTED },
+};
+
 /* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute.  */
-static const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
-	[OVS_KEY_ATTR_ENCAP] = -1,
-	[OVS_KEY_ATTR_PRIORITY] = sizeof(u32),
-	[OVS_KEY_ATTR_IN_PORT] = sizeof(u32),
-	[OVS_KEY_ATTR_SKB_MARK] = sizeof(u32),
-	[OVS_KEY_ATTR_ETHERNET] = sizeof(struct ovs_key_ethernet),
-	[OVS_KEY_ATTR_VLAN] = sizeof(__be16),
-	[OVS_KEY_ATTR_ETHERTYPE] = sizeof(__be16),
-	[OVS_KEY_ATTR_IPV4] = sizeof(struct ovs_key_ipv4),
-	[OVS_KEY_ATTR_IPV6] = sizeof(struct ovs_key_ipv6),
-	[OVS_KEY_ATTR_TCP] = sizeof(struct ovs_key_tcp),
-	[OVS_KEY_ATTR_TCP_FLAGS] = sizeof(__be16),
-	[OVS_KEY_ATTR_UDP] = sizeof(struct ovs_key_udp),
-	[OVS_KEY_ATTR_SCTP] = sizeof(struct ovs_key_sctp),
-	[OVS_KEY_ATTR_ICMP] = sizeof(struct ovs_key_icmp),
-	[OVS_KEY_ATTR_ICMPV6] = sizeof(struct ovs_key_icmpv6),
-	[OVS_KEY_ATTR_ARP] = sizeof(struct ovs_key_arp),
-	[OVS_KEY_ATTR_ND] = sizeof(struct ovs_key_nd),
-	[OVS_KEY_ATTR_RECIRC_ID] = sizeof(u32),
-	[OVS_KEY_ATTR_DP_HASH] = sizeof(u32),
-	[OVS_KEY_ATTR_TUNNEL] = -1,
-	[OVS_KEY_ATTR_MPLS] = sizeof(struct ovs_key_mpls),
+static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
+	[OVS_KEY_ATTR_ENCAP]	 = { .len = OVS_ATTR_NESTED },
+	[OVS_KEY_ATTR_PRIORITY]	 = { .len = sizeof(u32) },
+	[OVS_KEY_ATTR_IN_PORT]	 = { .len = sizeof(u32) },
+	[OVS_KEY_ATTR_SKB_MARK]	 = { .len = sizeof(u32) },
+	[OVS_KEY_ATTR_ETHERNET]	 = { .len = sizeof(struct ovs_key_ethernet) },
+	[OVS_KEY_ATTR_VLAN]	 = { .len = sizeof(__be16) },
+	[OVS_KEY_ATTR_ETHERTYPE] = { .len = sizeof(__be16) },
+	[OVS_KEY_ATTR_IPV4]	 = { .len = sizeof(struct ovs_key_ipv4) },
+	[OVS_KEY_ATTR_IPV6]	 = { .len = sizeof(struct ovs_key_ipv6) },
+	[OVS_KEY_ATTR_TCP]	 = { .len = sizeof(struct ovs_key_tcp) },
+	[OVS_KEY_ATTR_TCP_FLAGS] = { .len = sizeof(__be16) },
+	[OVS_KEY_ATTR_UDP]	 = { .len = sizeof(struct ovs_key_udp) },
+	[OVS_KEY_ATTR_SCTP]	 = { .len = sizeof(struct ovs_key_sctp) },
+	[OVS_KEY_ATTR_ICMP]	 = { .len = sizeof(struct ovs_key_icmp) },
+	[OVS_KEY_ATTR_ICMPV6]	 = { .len = sizeof(struct ovs_key_icmpv6) },
+	[OVS_KEY_ATTR_ARP]	 = { .len = sizeof(struct ovs_key_arp) },
+	[OVS_KEY_ATTR_ND]	 = { .len = sizeof(struct ovs_key_nd) },
+	[OVS_KEY_ATTR_RECIRC_ID] = { .len = sizeof(u32) },
+	[OVS_KEY_ATTR_DP_HASH]	 = { .len = sizeof(u32) },
+	[OVS_KEY_ATTR_TUNNEL]	 = { .len = OVS_ATTR_NESTED,
+				     .next = ovs_tunnel_key_lens, },
+	[OVS_KEY_ATTR_MPLS]	 = { .len = sizeof(struct ovs_key_mpls) },
 };
 
 static bool is_all_zero(const u8 *fp, size_t size)
@@ -352,8 +374,8 @@ static int __parse_flow_nlattrs(const struct nlattr *attr,
 			return -EINVAL;
 		}
 
-		expected_len = ovs_key_lens[type];
-		if (nla_len(nla) != expected_len && expected_len != -1) {
+		expected_len = ovs_key_lens[type].len;
+		if (nla_len(nla) != expected_len && expected_len != OVS_ATTR_NESTED) {
 			OVS_NLERR(log, "Key %d has unexpected len %d expected %d",
 				  type, nla_len(nla), expected_len);
 			return -EINVAL;
@@ -451,30 +473,16 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 		int type = nla_type(a);
 		int err;
 
-		static const u32 ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1] = {
-			[OVS_TUNNEL_KEY_ATTR_ID] = sizeof(u64),
-			[OVS_TUNNEL_KEY_ATTR_IPV4_SRC] = sizeof(u32),
-			[OVS_TUNNEL_KEY_ATTR_IPV4_DST] = sizeof(u32),
-			[OVS_TUNNEL_KEY_ATTR_TOS] = 1,
-			[OVS_TUNNEL_KEY_ATTR_TTL] = 1,
-			[OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT] = 0,
-			[OVS_TUNNEL_KEY_ATTR_CSUM] = 0,
-			[OVS_TUNNEL_KEY_ATTR_TP_SRC] = sizeof(u16),
-			[OVS_TUNNEL_KEY_ATTR_TP_DST] = sizeof(u16),
-			[OVS_TUNNEL_KEY_ATTR_OAM] = 0,
-			[OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS] = -1,
-		};
-
 		if (type > OVS_TUNNEL_KEY_ATTR_MAX) {
 			OVS_NLERR(log, "Tunnel attr %d out of range max %d",
 				  type, OVS_TUNNEL_KEY_ATTR_MAX);
 			return -EINVAL;
 		}
 
-		if (ovs_tunnel_key_lens[type] != nla_len(a) &&
-		    ovs_tunnel_key_lens[type] != -1) {
+		if (ovs_tunnel_key_lens[type].len != nla_len(a) &&
+		    ovs_tunnel_key_lens[type].len != OVS_ATTR_NESTED) {
 			OVS_NLERR(log, "Tunnel attr %d has unexpected len %d expected %d",
-				  type, nla_len(a), ovs_tunnel_key_lens[type]);
+				  type, nla_len(a), ovs_tunnel_key_lens[type].len);
 			return -EINVAL;
 		}
 
@@ -912,18 +920,16 @@ static int ovs_key_from_nlattrs(struct sw_flow_match *match, u64 attrs,
 	return 0;
 }
 
-static void nlattr_set(struct nlattr *attr, u8 val, bool is_attr_mask_key)
+static void nlattr_set(struct nlattr *attr, u8 val,
+		       const struct ovs_len_tbl *tbl)
 {
 	struct nlattr *nla;
 	int rem;
 
 	/* The nlattr stream should already have been validated */
 	nla_for_each_nested(nla, attr, rem) {
-		/* We assume that ovs_key_lens[type] == -1 means that type is a
-		 * nested attribute
-		 */
-		if (is_attr_mask_key && ovs_key_lens[nla_type(nla)] == -1)
-			nlattr_set(nla, val, false);
+		if (tbl && tbl[nla_type(nla)].len == OVS_ATTR_NESTED)
+			nlattr_set(nla, val, tbl[nla_type(nla)].next);
 		else
 			memset(nla_data(nla), val, nla_len(nla));
 	}
@@ -931,7 +937,7 @@ static void nlattr_set(struct nlattr *attr, u8 val, bool is_attr_mask_key)
 
 static void mask_set_nlattr(struct nlattr *attr, u8 val)
 {
-	nlattr_set(attr, val, true);
+	nlattr_set(attr, val, ovs_key_lens);
 }
 
 /**
@@ -1628,8 +1634,8 @@ static int validate_set(const struct nlattr *a,
 		return -EINVAL;
 
 	if (key_type > OVS_KEY_ATTR_MAX ||
-	    (ovs_key_lens[key_type] != nla_len(ovs_key) &&
-	     ovs_key_lens[key_type] != -1))
+	    (ovs_key_lens[key_type].len != nla_len(ovs_key) &&
+	     ovs_key_lens[key_type].len != OVS_ATTR_NESTED))
 		return -EINVAL;
 
 	switch (key_type) {
-- 
1.9.3

^ permalink raw reply related

* Re: [PATCH RFC v2 net-next 1/2] ip_tunnel: Create percpu gro_cell
From: Eric Dumazet @ 2015-01-15  0:14 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, kernel-team
In-Reply-To: <1421192564-437455-2-git-send-email-kafai@fb.com>

On Tue, 2015-01-13 at 15:42 -0800, Martin KaFai Lau wrote:
> In the ipip tunnel, the skb->queue_mapping is lost in ipip_rcv().
> All skb will be queued to the same cell->napi_skbs.  The
> gro_cell_poll is pinned to one core under load.  In production traffic,
> we also see severe rx_dropped in the tunl iface and it is probably due to
> this limit: skb_queue_len(&cell->napi_skbs) > netdev_max_backlog.
> 
> This patch is trying to alloc_percpu(struct gro_cell) and schedule
> gro_cell_poll to process the skb in the same core.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH 1/5] vxlan: Group Policy extension
From: Tom Herbert @ 2015-01-15  0:18 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David Miller, Jesse Gross, Stephen Hemminger, Pravin B Shelar,
	Alexei Starovoitov, Nicolas Dichtel, Linux Netdev List,
	dev@openvswitch.org
In-Reply-To: <c73939bbb55a8e450de7931d830e0467a8340665.1421280226.git.tgraf@suug.ch>

> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 99df0d7..06f7196 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -126,6 +126,7 @@ struct vxlan_dev {
>         __u8              tos;          /* TOS override */
>         __u8              ttl;
>         u32               flags;        /* VXLAN_F_* in vxlan.h */
> +       u32               exts;         /* Enabled extensions */
>

Thomas, why not just make a VXAM_F_GPB flag? Then this setting can be
saved in the flags for vxlan_dev and vxlan_sock so no exts field.

Tom


>         struct work_struct sock_work;
>         struct work_struct igmp_join;
> @@ -620,7 +621,8 @@ static struct sk_buff **vxlan_gro_receive(struct sk_buff **head,
>                         continue;
>

^ permalink raw reply

* Re: [PATCH 1/5] vxlan: Group Policy extension
From: Thomas Graf @ 2015-01-15  0:23 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, Jesse Gross, Stephen Hemminger, Pravin B Shelar,
	Alexei Starovoitov, Nicolas Dichtel, Linux Netdev List,
	dev@openvswitch.org
In-Reply-To: <CA+mtBx-CH2=D-wKmkjVTvhQOWhbf73NxcGrXrXOPE-E2PzJajQ@mail.gmail.com>

On 01/14/15 at 04:18pm, Tom Herbert wrote:
> > diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> > index 99df0d7..06f7196 100644
> > --- a/drivers/net/vxlan.c
> > +++ b/drivers/net/vxlan.c
> > @@ -126,6 +126,7 @@ struct vxlan_dev {
> >         __u8              tos;          /* TOS override */
> >         __u8              ttl;
> >         u32               flags;        /* VXLAN_F_* in vxlan.h */
> > +       u32               exts;         /* Enabled extensions */
> >
> 
> Thomas, why not just make a VXAM_F_GPB flag? Then this setting can be
> saved in the flags for vxlan_dev and vxlan_sock so no exts field.

Because we need to compare enabled extensions in vxlan_find_sock() to
make sure we are not sharing a VXLAN socket with extensions enabled
with a user which does not have the same extensions enabled.

However, we do not want vxlan_find_sock() to compare all flags.

So we need a bitmap that is ignored during the share check (flags) and
a bitmap that must match to allow sharing (exts).

The RCO extension is currently suffering from this bug which is causing
a compatibility issue. I explained in the thread of your patch. I was
under the imrpession that you would either send a v2 or fix it in a
follow-up.

^ permalink raw reply

* Re: [linux-nics] [PATCH] e1000e: Fix 82574/82583 TimeSync errata handling for SYSTIM read
From: Jeff Kirsher @ 2015-01-15  0:36 UTC (permalink / raw)
  To: Bhavesh Davda
  Cc: linux.nics, netdev, pv-drivers, nithin, ninad, smurali, gyang
In-Reply-To: <1421278252-13622-1-git-send-email-bhavesh@vmware.com>

[-- Attachment #1: Type: text/plain, Size: 565 bytes --]

On Wed, 2015-01-14 at 15:30 -0800, Bhavesh Davda wrote:
> In emulated 82574 vNICs, TIMINCA might read as '0', so this change
> prevents a
> divide-by-zero error.
> 
> Signed-off-by: Bhavesh Davda <bhavesh@vmware.com>
> Acked-by: Nithin Raju <nithin@vmware.com>
> Acked-by: Ninad Ghodke <ninad@vmware.com>
> Reviewed-by: Guolin Yang <gyang@vmware.com>
> Reviewed-by: Srividya Murali <smurali@vmware.com>
> ---
>  drivers/net/ethernet/intel/e1000e/netdev.c | 3 +++
>  1 file changed, 3 insertions(+)

Thanks Bhavesh, I will add your patch to my queue.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox