Netdev List
 help / color / mirror / Atom feed
* [PATCH 1/5] net: ethernet: ti: cpsw: improve re-split policy
From: Ivan Khoronzhuk @ 2016-12-10 12:23 UTC (permalink / raw)
  To: mugunthanvnm, grygorii.strashko
  Cc: linux-omap, netdev, linux-kernel, Ivan Khoronzhuk

This patches add several simplifications and improvements to set
maximum rate for channels taking in account switch and dual emac mode.

Don't re-split res in the following cases:
- speed of phys is not changed
- speed of phys is changed and no rate limited channels
- speed of phys is changed and all channels are rate limited
- phy is unlinked while dev is open
- phy is linked back but speed is not changed

The maximum speed is sum of "linked" phys, thus res are split taken
into account two interfaces, both for dual emac mode and for
switch mode.

Tested on am572x

Based on net-next/master

Ivan Khoronzhuk (5):
  net: ethernet: ti: cpsw: use same macros to get active slave
  net: ethernet: ti: cpsw: don't start queue twice
  net: ethernet: ti: cpsw: combine budget and weight split and check
  net: ethernet: ti: cpsw: re-split res only when speed is changed
  net: ethernet: ti: cpsw: sync rates for channels in dual emac mode

 drivers/net/ethernet/ti/cpsw.c | 178 +++++++++++++++++++++++------------------
 1 file changed, 99 insertions(+), 79 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH 1/5] net: ethernet: ti: cpsw: use same macros to get active slave
From: Ivan Khoronzhuk @ 2016-12-10 12:23 UTC (permalink / raw)
  To: mugunthanvnm, grygorii.strashko
  Cc: linux-omap, netdev, linux-kernel, Ivan Khoronzhuk
In-Reply-To: <1481372630-14914-1-git-send-email-ivan.khoronzhuk@linaro.org>

Use the same, more convenient macros, to get active slave.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
 drivers/net/ethernet/ti/cpsw.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index b62d958..c45f7d2 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1624,10 +1624,7 @@ static void cpsw_hwtstamp_v2(struct cpsw_priv *priv)
 	struct cpsw_common *cpsw = priv->cpsw;
 	u32 ctrl, mtype;
 
-	if (cpsw->data.dual_emac)
-		slave = &cpsw->slaves[priv->emac_port];
-	else
-		slave = &cpsw->slaves[cpsw->data.active_slave];
+	slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
 
 	ctrl = slave_read(slave, CPSW2_CONTROL);
 	switch (cpsw->version) {
-- 
2.7.4

^ permalink raw reply related

* [PATCH 2/5] net: ethernet: ti: cpsw: don't start queue twice
From: Ivan Khoronzhuk @ 2016-12-10 12:23 UTC (permalink / raw)
  To: mugunthanvnm, grygorii.strashko
  Cc: linux-omap, netdev, linux-kernel, Ivan Khoronzhuk
In-Reply-To: <1481372630-14914-1-git-send-email-ivan.khoronzhuk@linaro.org>

No need to start queues after cpsw is started as it will be done
while cpsw_adjust_link(), after phy connection.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
 drivers/net/ethernet/ti/cpsw.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index c45f7d2..23213a3 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1506,8 +1506,6 @@ static int cpsw_ndo_open(struct net_device *ndev)
 	if (cpsw->data.dual_emac)
 		cpsw->slaves[priv->emac_port].open_stat = true;
 
-	netif_tx_start_all_queues(ndev);
-
 	return 0;
 
 err_cleanup:
-- 
2.7.4

^ permalink raw reply related

* [PATCH 4/5] net: ethernet: ti: cpsw: re-split res only when speed is changed
From: Ivan Khoronzhuk @ 2016-12-10 12:23 UTC (permalink / raw)
  To: mugunthanvnm, grygorii.strashko
  Cc: linux-omap, netdev, linux-kernel, Ivan Khoronzhuk
In-Reply-To: <1481372630-14914-1-git-send-email-ivan.khoronzhuk@linaro.org>

Don't re-split res in the following cases:
- speed of phys is not changed
- speed of phys is changed and no rate limited channels
- speed of phys is changed and all channels are rate limited
- phy is unlinked while dev is open
- phy is linked back but speed is not changed

The maximum speed is sum of "linked" phys, thus res are split taken
in account two interfaces, both for dual emac mode and for
switch mode.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
 drivers/net/ethernet/ti/cpsw.c | 64 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 59 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index a2c2c06..7ccfa63 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -394,6 +394,7 @@ struct cpsw_common {
 	u32 irqs_table[IRQ_NUM];
 	struct cpts			*cpts;
 	int				rx_ch_num, tx_ch_num;
+	int				speed;
 };
 
 struct cpsw_priv {
@@ -761,7 +762,6 @@ static void cpsw_split_res(struct net_device *ndev)
 	struct cpsw_vector *txv = cpsw->txv;
 	int i, ch_weight, rlim_ch_num = 0;
 	int budget, bigest_rate_ch = 0;
-	struct cpsw_slave *slave;
 	u32 ch_rate, max_rate;
 	int ch_budget = 0;
 
@@ -781,8 +781,16 @@ static void cpsw_split_res(struct net_device *ndev)
 		bigest_rate = 0;
 		max_rate = consumed_rate;
 	} else {
-		slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
-		max_rate = slave->phy->speed * 1000;
+		max_rate = cpsw->speed * 1000;
+
+		/* if max_rate is less then expected due to reduced link speed,
+		 * split proportionally according next potential max speed
+		 */
+		if (max_rate < consumed_rate)
+			max_rate *= 10;
+
+		if (max_rate < consumed_rate)
+			max_rate *= 10;
 
 		ch_budget = (consumed_rate * CPSW_POLL_WEIGHT) / max_rate;
 		ch_budget = (CPSW_POLL_WEIGHT - ch_budget) /
@@ -1013,15 +1021,56 @@ static void _cpsw_adjust_link(struct cpsw_slave *slave,
 	slave->mac_control = mac_control;
 }
 
+static int cpsw_get_common_speed(struct cpsw_common *cpsw)
+{
+	int i, speed;
+
+	for (i = 0, speed = 0; i < cpsw->data.slaves; i++)
+		if (cpsw->slaves[i].phy && cpsw->slaves[i].phy->link)
+			speed += cpsw->slaves[i].phy->speed;
+
+	return speed;
+}
+
+static int cpsw_need_resplit(struct cpsw_common *cpsw)
+{
+	int i, rlim_ch_num;
+	int speed, ch_rate;
+
+	/* re-split resources only in case speed was changed */
+	speed = cpsw_get_common_speed(cpsw);
+	if (speed == cpsw->speed || !speed)
+		return 0;
+
+	cpsw->speed = speed;
+
+	for (i = 0, rlim_ch_num = 0; i < cpsw->tx_ch_num; i++) {
+		ch_rate = cpdma_chan_get_rate(cpsw->txv[i].ch);
+		if (!ch_rate)
+			break;
+
+		rlim_ch_num++;
+	}
+
+	/* cases not dependent on speed */
+	if (!rlim_ch_num || rlim_ch_num == cpsw->tx_ch_num)
+		return 0;
+
+	return 1;
+}
+
 static void cpsw_adjust_link(struct net_device *ndev)
 {
 	struct cpsw_priv	*priv = netdev_priv(ndev);
+	struct cpsw_common	*cpsw = priv->cpsw;
 	bool			link = false;
 
 	for_each_slave(priv, _cpsw_adjust_link, priv, &link);
 
 	if (link) {
-		cpsw_split_res(priv->ndev);
+		if (cpsw_need_resplit(cpsw))
+			cpsw_split_res(ndev);
+
 		netif_carrier_on(ndev);
 		if (netif_running(ndev))
 			netif_tx_wake_all_queues(ndev);
@@ -1538,6 +1587,10 @@ static int cpsw_ndo_stop(struct net_device *ndev)
 		cpsw_ale_stop(cpsw->ale);
 	}
 	for_each_slave(priv, cpsw_slave_stop, cpsw);
+
+	if (cpsw_need_resplit(cpsw))
+		cpsw_split_res(ndev);
+
 	pm_runtime_put_sync(cpsw->dev);
 	if (cpsw->data.dual_emac)
 		cpsw->slaves[priv->emac_port].open_stat = false;
@@ -1983,7 +2036,7 @@ static int cpsw_ndo_set_tx_maxrate(struct net_device *ndev, int queue, u32 rate)
 		return -EINVAL;
 	}
 
-	if (rate > 2000) {
+	if (rate > cpsw->speed) {
 		dev_err(priv->dev, "The channel rate cannot be more than 2Gbps");
 		return -EINVAL;
 	}
@@ -2998,6 +3051,7 @@ static int cpsw_probe(struct platform_device *pdev)
 	ndev->ethtool_ops = &cpsw_ethtool_ops;
 	netif_napi_add(ndev, &cpsw->napi_rx, cpsw_rx_poll, CPSW_POLL_WEIGHT);
 	netif_tx_napi_add(ndev, &cpsw->napi_tx, cpsw_tx_poll, CPSW_POLL_WEIGHT);
+	cpsw_split_res(ndev);
 
 	/* register the network device */
 	SET_NETDEV_DEV(ndev, &pdev->dev);
-- 
2.7.4

^ permalink raw reply related

* [PATCH 5/5] net: ethernet: ti: cpsw: sync rates for channels in dual emac mode
From: Ivan Khoronzhuk @ 2016-12-10 12:23 UTC (permalink / raw)
  To: mugunthanvnm, grygorii.strashko
  Cc: linux-omap, netdev, linux-kernel, Ivan Khoronzhuk
In-Reply-To: <1481372630-14914-1-git-send-email-ivan.khoronzhuk@linaro.org>

The channels are common for both ndevs in dual emac mode. Hence, keep
in sync their rates.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
 drivers/net/ethernet/ti/cpsw.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 7ccfa63..b203143 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2020,9 +2020,10 @@ static int cpsw_ndo_set_tx_maxrate(struct net_device *ndev, int queue, u32 rate)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
 	struct cpsw_common *cpsw = priv->cpsw;
+	struct cpsw_slave *slave;
 	u32 min_rate;
 	u32 ch_rate;
-	int ret;
+	int i, ret;
 
 	ch_rate = netdev_get_tx_queue(ndev, queue)->tx_maxrate;
 	if (ch_rate == rate)
@@ -2053,6 +2054,15 @@ static int cpsw_ndo_set_tx_maxrate(struct net_device *ndev, int queue, u32 rate)
 	if (ret)
 		return ret;
 
+	/* update rates for slaves tx queues */
+	for (i = 0; i < cpsw->data.slaves; i++) {
+		slave = &cpsw->slaves[i];
+		if (!slave->ndev)
+			continue;
+
+		netdev_get_tx_queue(slave->ndev, queue)->tx_maxrate = rate;
+	}
+
 	cpsw_split_res(ndev);
 	return ret;
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH 3/5] net: ethernet: ti: cpsw: combine budget and weight split and check
From: Ivan Khoronzhuk @ 2016-12-10 12:23 UTC (permalink / raw)
  To: mugunthanvnm, grygorii.strashko
  Cc: linux-omap, netdev, linux-kernel, Ivan Khoronzhuk
In-Reply-To: <1481372630-14914-1-git-send-email-ivan.khoronzhuk@linaro.org>

Re-split weight along with budget. It simplify code a little
and update state after every rate change. Also it's necessarily
to move arguments checks to this combined function. Replace
maximum rate check for an interface on maximum possible rate.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
 drivers/net/ethernet/ti/cpsw.c | 107 +++++++++++++----------------------------
 1 file changed, 34 insertions(+), 73 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 23213a3..a2c2c06 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -753,27 +753,18 @@ static void cpsw_rx_handler(void *token, int len, int status)
 		dev_kfree_skb_any(new_skb);
 }
 
-/* split budget depending on channel rates */
-static void cpsw_split_budget(struct net_device *ndev)
+static void cpsw_split_res(struct net_device *ndev)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
+	u32 consumed_rate = 0, bigest_rate = 0;
 	struct cpsw_common *cpsw = priv->cpsw;
 	struct cpsw_vector *txv = cpsw->txv;
-	u32 consumed_rate, bigest_rate = 0;
+	int i, ch_weight, rlim_ch_num = 0;
 	int budget, bigest_rate_ch = 0;
 	struct cpsw_slave *slave;
-	int i, rlim_ch_num = 0;
 	u32 ch_rate, max_rate;
 	int ch_budget = 0;
 
-	if (cpsw->data.dual_emac)
-		slave = &cpsw->slaves[priv->emac_port];
-	else
-		slave = &cpsw->slaves[cpsw->data.active_slave];
-
-	max_rate = slave->phy->speed * 1000;
-
-	consumed_rate = 0;
 	for (i = 0; i < cpsw->tx_ch_num; i++) {
 		ch_rate = cpdma_chan_get_rate(txv[i].ch);
 		if (!ch_rate)
@@ -785,7 +776,14 @@ static void cpsw_split_budget(struct net_device *ndev)
 
 	if (cpsw->tx_ch_num == rlim_ch_num) {
 		max_rate = consumed_rate;
+	} else if (!rlim_ch_num) {
+		ch_budget = CPSW_POLL_WEIGHT / cpsw->tx_ch_num;
+		bigest_rate = 0;
+		max_rate = consumed_rate;
 	} else {
+		slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
+		max_rate = slave->phy->speed * 1000;
+
 		ch_budget = (consumed_rate * CPSW_POLL_WEIGHT) / max_rate;
 		ch_budget = (CPSW_POLL_WEIGHT - ch_budget) /
 			    (cpsw->tx_ch_num - rlim_ch_num);
@@ -793,22 +791,28 @@ static void cpsw_split_budget(struct net_device *ndev)
 			      (cpsw->tx_ch_num - rlim_ch_num);
 	}
 
-	/* split tx budget */
+	/* split tx weight/budget */
 	budget = CPSW_POLL_WEIGHT;
 	for (i = 0; i < cpsw->tx_ch_num; i++) {
 		ch_rate = cpdma_chan_get_rate(txv[i].ch);
 		if (ch_rate) {
 			txv[i].budget = (ch_rate * CPSW_POLL_WEIGHT) / max_rate;
 			if (!txv[i].budget)
-				txv[i].budget = 1;
+				txv[i].budget++;
 			if (ch_rate > bigest_rate) {
 				bigest_rate_ch = i;
 				bigest_rate = ch_rate;
 			}
+
+			ch_weight = (ch_rate * 100) / max_rate;
+			if (!ch_weight)
+				ch_weight++;
+			cpdma_chan_set_weight(cpsw->txv[i].ch, ch_weight);
 		} else {
 			txv[i].budget = ch_budget;
 			if (!bigest_rate_ch)
 				bigest_rate_ch = i;
+			cpdma_chan_set_weight(cpsw->txv[i].ch, 0);
 		}
 
 		budget -= txv[i].budget;
@@ -1017,7 +1021,7 @@ static void cpsw_adjust_link(struct net_device *ndev)
 	for_each_slave(priv, _cpsw_adjust_link, priv, &link);
 
 	if (link) {
-		cpsw_split_budget(priv->ndev);
+		cpsw_split_res(priv->ndev);
 		netif_carrier_on(ndev);
 		if (netif_running(ndev))
 			netif_tx_wake_all_queues(ndev);
@@ -1962,64 +1966,25 @@ static int cpsw_ndo_vlan_rx_kill_vid(struct net_device *ndev,
 static int cpsw_ndo_set_tx_maxrate(struct net_device *ndev, int queue, u32 rate)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
-	int tx_ch_num = ndev->real_num_tx_queues;
-	u32 consumed_rate, min_rate, max_rate;
 	struct cpsw_common *cpsw = priv->cpsw;
-	struct cpsw_slave *slave;
-	int ret, i, weight;
-	int rlim_num = 0;
+	u32 min_rate;
 	u32 ch_rate;
+	int ret;
 
 	ch_rate = netdev_get_tx_queue(ndev, queue)->tx_maxrate;
 	if (ch_rate == rate)
 		return 0;
 
-	if (cpsw->data.dual_emac)
-		slave = &cpsw->slaves[priv->emac_port];
-	else
-		slave = &cpsw->slaves[cpsw->data.active_slave];
-	max_rate = slave->phy->speed;
-
-	consumed_rate = 0;
-	for (i = 0; i < tx_ch_num; i++) {
-		if (i == queue)
-			ch_rate = rate;
-		else
-			ch_rate = netdev_get_tx_queue(ndev, i)->tx_maxrate;
-		if (!ch_rate)
-			continue;
-
-		rlim_num++;
-		consumed_rate += ch_rate;
-	}
-
-	if (consumed_rate > max_rate)
-		dev_info(priv->dev, "The common rate shouldn't be more than %dMbps",
-			 max_rate);
-
-	if (consumed_rate > max_rate) {
-		if (max_rate == 10 && consumed_rate <= 100) {
-			max_rate = 100;
-		} else if (max_rate <= 100 && consumed_rate <= 1000) {
-			max_rate = 1000;
-		} else {
-			dev_err(priv->dev, "The common rate cannot be more than %dMbps",
-				max_rate);
-			return -EINVAL;
-		}
-	}
-
-	if (consumed_rate > max_rate) {
-		dev_err(priv->dev, "The common rate cannot be more than %dMbps",
-			max_rate);
+	ch_rate = rate * 1000;
+	min_rate = cpdma_chan_get_min_rate(cpsw->dma);
+	if ((ch_rate < min_rate && ch_rate)) {
+		dev_err(priv->dev, "The channel rate cannot be less than %dMbps",
+			min_rate);
 		return -EINVAL;
 	}
 
-	rate *= 1000;
-	min_rate = cpdma_chan_get_min_rate(cpsw->dma);
-	if ((rate < min_rate && rate)) {
-		dev_err(priv->dev, "The common rate cannot be less than %dMbps",
-			min_rate);
+	if (rate > 2000) {
+		dev_err(priv->dev, "The channel rate cannot be more than 2Gbps");
 		return -EINVAL;
 	}
 
@@ -2029,17 +1994,13 @@ static int cpsw_ndo_set_tx_maxrate(struct net_device *ndev, int queue, u32 rate)
 		return ret;
 	}
 
-	if (rlim_num == tx_ch_num)
-		max_rate = consumed_rate;
+	ret = cpdma_chan_set_rate(cpsw->txv[queue].ch, ch_rate);
+	pm_runtime_put(cpsw->dev);
 
-	weight = (rate * 100) / (max_rate * 1000);
-	cpdma_chan_set_weight(cpsw->txv[queue].ch, weight);
-	ret = cpdma_chan_set_rate(cpsw->txv[queue].ch, rate);
+	if (ret)
+		return ret;
 
-	/* re-split budget between channels */
-	if (!rate)
-		cpsw_split_budget(ndev);
-	pm_runtime_put(cpsw->dev);
+	cpsw_split_res(ndev);
 	return ret;
 }
 
@@ -2399,7 +2360,7 @@ static int cpsw_set_channels(struct net_device *ndev,
 		if (ret)
 			goto err;
 
-		cpsw_split_budget(ndev);
+		cpsw_split_res(ndev);
 
 		/* After this receive is started */
 		cpdma_ctlr_start(cpsw->dma);
-- 
2.7.4

^ permalink raw reply related

* Re: Misalignment, MIPS, and ip_hdr(skb)->version
From: Felix Fietkau @ 2016-12-10 12:25 UTC (permalink / raw)
  To: Jason A. Donenfeld, David Miller
  Cc: Netdev, WireGuard mailing list, LKML, linux-mips
In-Reply-To: <CAHmME9oLgjDA2F0gkFzHU2Es8-XCxQHRABS18OKF0EnZgt1=LQ@mail.gmail.com>

On 2016-12-07 19:54, Jason A. Donenfeld wrote:
> On Wed, Dec 7, 2016 at 7:51 PM, David Miller <davem@davemloft.net> wrote:
>> It's so much better to analyze properly where the misalignment comes from
>> and address it at the source, as we have for various cases that trip up
>> Sparc too.
> 
> That's sort of my attitude too, hence starting this thread. Any
> pointers you have about this would be most welcome, so as not to
> perpetuate what already seems like an issue in other parts of the
> stack.
Hi Jason,

I'm the author of that hackish LEDE/OpenWrt patch that works around the
misalignment issues. Here's some context regarding that patch:

I intentionally put it in the target specific patches for only one of
our MIPS targets. There are a few ar71xx devices where the misalignment
cannot be fixed, because the Ethernet MAC has a 4-byte DMA alignment
requirement, and does not support inserting 2 bytes of padding to
correct the IP header misalignment.

With these limitations the choice was between this ugly network stack
patch or inserting a very expensive memmove in the data path (which is
better than taking the mis-alignment traps, but still hurts routing
performance significantly).

There are a lot of places in the network stack that assume full 32 bit
alignment, and you only get to see those once you start using more of
netfilter, play with various tunnel encapsulations, etc.

I think you have 3 options to deal with this properly:
1. add 3 bytes of padding
2. allocate a separate skb for decryption (might be more expensive)
3. save the header and decrypt to the start of the packet data
(overwriting the misaligned header).

I'm not sure what the performance impact of 2 and 3 is, so it's probably
best to stick with the padding.

I've taken a quick look at the wireguard message headers, and my
recommendation would be to insert the 3-byte padding in struct
message_header and remove __packed from your structs.
This will also remove misaligment of your own protocol fields.

- Felix

^ permalink raw reply

* Re: [PATCH] ARM: add cmpxchg64 helper for ARMv7-M
From: Pablo Neira Ayuso @ 2016-12-10 12:32 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Russell King, netdev, Paul E. McKenney, David S. Miller,
	netfilter-devel, coreteam, linux-arm-kernel, linux-kernel
In-Reply-To: <20161210103646.1407256-1-arnd@arndb.de>

Hi Arnd,

On Sat, Dec 10, 2016 at 11:36:34AM +0100, Arnd Bergmann wrote:
> A change to the netfilter code in net-next introduced the first caller of
> cmpxchg64 that can get built on ARMv7-M, leading to an error from the
> assembler that points out the lack of 64-bit atomics on this architecture:
> 
> /tmp/ccMe7djj.s: Assembler messages:
> /tmp/ccMe7djj.s:367: Error: selected processor does not support `ldrexd r0,r1,[lr]' in Thumb mode
> /tmp/ccMe7djj.s:371: Error: selected processor does not support `strexd ip,r2,r3,[lr]' in Thumb mode
> /tmp/ccMe7djj.s:389: Error: selected processor does not support `ldrexd r8,r9,[r7]' in Thumb mode
> /tmp/ccMe7djj.s:393: Error: selected processor does not support `strexd lr,r0,r1,[r7]' in Thumb mode
> scripts/Makefile.build:299: recipe for target 'net/netfilter/nft_counter.o' failed
> 
> This makes ARMv7-M use the same emulation from asm-generic/cmpxchg-local.h
> that we use on architectures earlier than ARMv6K, to fix the build. The
> 32-bit atomics are available on ARMv7-M and we keep using them there.
> This ARM specific change is probably something we should do regardless
> of the netfilter code.
> 
> However, looking at the new nft_counter_reset() function in nft_counter.c,
> this looks incorrect to me not just on ARMv7-M but also on other
> architectures, with at least the following possible race:

Right, Eric Dumazet already spotted this problem. I'm preparing a
patch that doesn't require cmpxchg64(). Will keep you on Cc. Thanks.

^ permalink raw reply

* Re: Misalignment, MIPS, and ip_hdr(skb)->version
From: Måns Rullgård @ 2016-12-10 13:25 UTC (permalink / raw)
  To: Felix Fietkau
  Cc: linux-mips, Netdev, LKML, David Miller, WireGuard mailing list
In-Reply-To: <040bcdb2-2725-c8de-11d9-a4f77b75d9d8@nbd.name>

Felix Fietkau <nbd@nbd.name> writes:

> On 2016-12-07 19:54, Jason A. Donenfeld wrote:
>> On Wed, Dec 7, 2016 at 7:51 PM, David Miller <davem@davemloft.net> wrote:
>>> It's so much better to analyze properly where the misalignment comes from
>>> and address it at the source, as we have for various cases that trip up
>>> Sparc too.
>> 
>> That's sort of my attitude too, hence starting this thread. Any
>> pointers you have about this would be most welcome, so as not to
>> perpetuate what already seems like an issue in other parts of the
>> stack.
> Hi Jason,
>
> I'm the author of that hackish LEDE/OpenWrt patch that works around the
> misalignment issues. Here's some context regarding that patch:
>
> I intentionally put it in the target specific patches for only one of
> our MIPS targets. There are a few ar71xx devices where the misalignment
> cannot be fixed, because the Ethernet MAC has a 4-byte DMA alignment
> requirement, and does not support inserting 2 bytes of padding to
> correct the IP header misalignment.
>
> With these limitations the choice was between this ugly network stack
> patch or inserting a very expensive memmove in the data path (which is
> better than taking the mis-alignment traps, but still hurts routing
> performance significantly).

I solved this problem in an Ethernet driver by copying the initial part
of the packet to an aligned skb and appending the remainder using
skb_add_rx_frag().  The kernel network stack only cares about the
headers, so the alignment of the packet payload doesn't matter.

-- 
Måns Rullgård

^ permalink raw reply

* Re: [PATCH net-next 2/5] liquidio VF vxlan
From: Or Gerlitz @ 2016-12-10 13:46 UTC (permalink / raw)
  To: Vatsavayi, Raghu
  Cc: David Miller, Linux Netdev List, Chickles, Derek,
	Burla, Satananda, Manlunas, Felix
In-Reply-To: <DM3PR07MB21386696FCE7727BD9B3B48C81840@DM3PR07MB2138.namprd07.prod.outlook.com>

On Fri, Dec 9, 2016 at 12:42 AM, Vatsavayi, Raghu
<Raghu.Vatsavayi@cavium.com> wrote:
>> From: Or Gerlitz [mailto:gerlitz.or@gmail.com]
>> On Thu, Dec 8, 2016 at 11:00 PM, Raghu Vatsavayi
>> <rvatsavayi@caviumnetworks.com> wrote:

>>> Adds VF vxlan offload support.

>> What's the use case for that? a VM running a VTEP, isn't that part needs to
>> run @ the host?

> Our HW can support offloads for VF which is required if we load it on Hypervisor.


+       nctrl.ncmd.u64 = 0;
+       nctrl.ncmd.s.cmd = command;
+       nctrl.ncmd.s.more = vxlan_cmd_bit;
+       nctrl.ncmd.s.param1 = vxlan_port;
+       nctrl.iq_no = lio->linfo.txpciq[0].s.q_no;
+       nctrl.wait_time = 100;
+       nctrl.netpndev = (u64)netdev;
+       nctrl.cb_fn = liquidio_link_ctrl_cmd_completion;
+
+       ret = octnet_send_nic_ctrl_pkt(lio->oct_dev, &nctrl);

1. What happens if > 1 one VF runs this code, each with different
port? who wins? is the result well defined?

2. does octnet_send_nic_ctrl_pkt() goes to sleep? this is disallowed here

Or.

^ permalink raw reply

* Re: [PATCH V2 18/22] bnxt_re: Support for DCB
From: Or Gerlitz @ 2016-12-10 13:50 UTC (permalink / raw)
  To: Selvin Xavier
  Cc: Doug Ledford, linux-rdma@vger.kernel.org, Linux Netdev List,
	Eddie Wai, Devesh Sharma, Somnath Kotur, Sriharsha Basavapatna
In-Reply-To: <1481266096-23331-19-git-send-email-selvin.xavier@broadcom.com>

On Fri, Dec 9, 2016 at 8:48 AM, Selvin Xavier
<selvin.xavier@broadcom.com> wrote:
> This patch queries the configured RoCE APP Priority on the host
> using the dcbnl API and programs the RoCE FW with the corresponding
> Traffic Class(es) for the priority.

> +#define BNXT_RE_ROCE_V1_ETH_TYPE       0x8915
> +#define BNXT_RE_ROCE_V2_PORT_NO                4791

I believe these two are defined already, try # git grep on each under include

^ permalink raw reply

* [PATCH] net: nicvf: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes @ 2016-12-10 14:00 UTC (permalink / raw)
  To: sgoutham, rric, davem; +Cc: netdev, linux-kernel, Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
---
 .../net/ethernet/cavium/thunder/nicvf_ethtool.c    |   56 +++++++++++---------
 1 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c b/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
index b048241..2e74bba 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
@@ -116,33 +116,34 @@ struct nicvf_stat {
 static const unsigned int nicvf_n_drv_stats = ARRAY_SIZE(nicvf_drv_stats);
 static const unsigned int nicvf_n_queue_stats = ARRAY_SIZE(nicvf_queue_stats);
 
-static int nicvf_get_settings(struct net_device *netdev,
-			      struct ethtool_cmd *cmd)
+static int nicvf_get_link_ksettings(struct net_device *netdev,
+				    struct ethtool_link_ksettings *cmd)
 {
 	struct nicvf *nic = netdev_priv(netdev);
+	u32 supported, advertising;
 
-	cmd->supported = 0;
-	cmd->transceiver = XCVR_EXTERNAL;
+	supported = 0;
+	advertising = 0;
 
 	if (!nic->link_up) {
-		cmd->duplex = DUPLEX_UNKNOWN;
-		ethtool_cmd_speed_set(cmd, SPEED_UNKNOWN);
+		cmd->base.duplex = DUPLEX_UNKNOWN;
+		cmd->base.speed = SPEED_UNKNOWN;
 		return 0;
 	}
 
 	switch (nic->speed) {
 	case SPEED_1000:
-		cmd->port = PORT_MII | PORT_TP;
-		cmd->autoneg = AUTONEG_ENABLE;
-		cmd->supported |= SUPPORTED_MII | SUPPORTED_TP;
-		cmd->supported |= SUPPORTED_1000baseT_Full |
+		cmd->base.port = PORT_MII | PORT_TP;
+		cmd->base.autoneg = AUTONEG_ENABLE;
+		supported |= SUPPORTED_MII | SUPPORTED_TP;
+		supported |= SUPPORTED_1000baseT_Full |
 				  SUPPORTED_1000baseT_Half |
 				  SUPPORTED_100baseT_Full  |
 				  SUPPORTED_100baseT_Half  |
 				  SUPPORTED_10baseT_Full   |
 				  SUPPORTED_10baseT_Half;
-		cmd->supported |= SUPPORTED_Autoneg;
-		cmd->advertising |= ADVERTISED_1000baseT_Full |
+		supported |= SUPPORTED_Autoneg;
+		advertising |= ADVERTISED_1000baseT_Full |
 				    ADVERTISED_1000baseT_Half |
 				    ADVERTISED_100baseT_Full  |
 				    ADVERTISED_100baseT_Half  |
@@ -151,24 +152,29 @@ static int nicvf_get_settings(struct net_device *netdev,
 		break;
 	case SPEED_10000:
 		if (nic->mac_type == BGX_MODE_RXAUI) {
-			cmd->port = PORT_TP;
-			cmd->supported |= SUPPORTED_TP;
+			cmd->base.port = PORT_TP;
+			supported |= SUPPORTED_TP;
 		} else {
-			cmd->port = PORT_FIBRE;
-			cmd->supported |= SUPPORTED_FIBRE;
+			cmd->base.port = PORT_FIBRE;
+			supported |= SUPPORTED_FIBRE;
 		}
-		cmd->autoneg = AUTONEG_DISABLE;
-		cmd->supported |= SUPPORTED_10000baseT_Full;
+		cmd->base.autoneg = AUTONEG_DISABLE;
+		supported |= SUPPORTED_10000baseT_Full;
 		break;
 	case SPEED_40000:
-		cmd->port = PORT_FIBRE;
-		cmd->autoneg = AUTONEG_DISABLE;
-		cmd->supported |= SUPPORTED_FIBRE;
-		cmd->supported |= SUPPORTED_40000baseCR4_Full;
+		cmd->base.port = PORT_FIBRE;
+		cmd->base.autoneg = AUTONEG_DISABLE;
+		supported |= SUPPORTED_FIBRE;
+		supported |= SUPPORTED_40000baseCR4_Full;
 		break;
 	}
-	cmd->duplex = nic->duplex;
-	ethtool_cmd_speed_set(cmd, nic->speed);
+	cmd->base.duplex = nic->duplex;
+	cmd->base.speed = nic->speed;
+
+	ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.supported,
+						supported);
+	ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.advertising,
+						advertising);
 
 	return 0;
 }
@@ -770,7 +776,6 @@ static int nicvf_set_pauseparam(struct net_device *dev,
 }
 
 static const struct ethtool_ops nicvf_ethtool_ops = {
-	.get_settings		= nicvf_get_settings,
 	.get_link		= nicvf_get_link,
 	.get_drvinfo		= nicvf_get_drvinfo,
 	.get_msglevel		= nicvf_get_msglevel,
@@ -793,6 +798,7 @@ static int nicvf_set_pauseparam(struct net_device *dev,
 	.get_pauseparam         = nicvf_get_pauseparam,
 	.set_pauseparam         = nicvf_set_pauseparam,
 	.get_ts_info		= ethtool_op_get_ts_info,
+	.get_link_ksettings	= nicvf_get_link_ksettings,
 };
 
 void nicvf_set_ethtool_ops(struct net_device *netdev)
-- 
1.7.4.4

^ permalink raw reply related

* [PATCH net-next] netfilter: nft_counter: rework atomic dump and reset
From: Pablo Neira Ayuso @ 2016-12-10 14:05 UTC (permalink / raw)
  To: netfilter-devel; +Cc: eric.dumazet, arnd, netdev

Dump and reset doesn't work unless cmpxchg64() is used both from both
packet and control plane paths. This approach is going to be slow
though. Instead, use a percpu seqcount to fetch counters consistently,
then subtract bytes and packets in case a reset was requested.

This patch is based on original sketch from Eric Dumazet.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 43da04a593d8 ("netfilter: nf_tables: atomic dump and reset for stateful objects")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_counter.c | 128 ++++++++++++++++++--------------------------
 1 file changed, 51 insertions(+), 77 deletions(-)

diff --git a/net/netfilter/nft_counter.c b/net/netfilter/nft_counter.c
index f6a02c5071c2..c37983d0a141 100644
--- a/net/netfilter/nft_counter.c
+++ b/net/netfilter/nft_counter.c
@@ -22,27 +22,29 @@ struct nft_counter {
 	u64		packets;
 };
 
-struct nft_counter_percpu {
-	struct nft_counter	counter;
-	struct u64_stats_sync	syncp;
-};
-
 struct nft_counter_percpu_priv {
-	struct nft_counter_percpu __percpu *counter;
+	struct nft_counter __percpu *counter;
 };
 
+static DEFINE_PER_CPU(seqcount_t, nft_counter_seq);
+
 static inline void nft_counter_do_eval(struct nft_counter_percpu_priv *priv,
 				       struct nft_regs *regs,
 				       const struct nft_pktinfo *pkt)
 {
-	struct nft_counter_percpu *this_cpu;
+	struct nft_counter *this_cpu;
+	seqcount_t *myseq;
 
 	local_bh_disable();
 	this_cpu = this_cpu_ptr(priv->counter);
-	u64_stats_update_begin(&this_cpu->syncp);
-	this_cpu->counter.bytes += pkt->skb->len;
-	this_cpu->counter.packets++;
-	u64_stats_update_end(&this_cpu->syncp);
+	myseq = this_cpu_ptr(&nft_counter_seq);
+
+	write_seqcount_begin(myseq);
+
+	this_cpu->bytes += pkt->skb->len;
+	this_cpu->packets++;
+
+	write_seqcount_end(myseq);
 	local_bh_enable();
 }
 
@@ -58,21 +60,21 @@ static inline void nft_counter_obj_eval(struct nft_object *obj,
 static int nft_counter_do_init(const struct nlattr * const tb[],
 			       struct nft_counter_percpu_priv *priv)
 {
-	struct nft_counter_percpu __percpu *cpu_stats;
-	struct nft_counter_percpu *this_cpu;
+	struct nft_counter __percpu *cpu_stats;
+	struct nft_counter *this_cpu;
 
-	cpu_stats = netdev_alloc_pcpu_stats(struct nft_counter_percpu);
+	cpu_stats = alloc_percpu(struct nft_counter);
 	if (cpu_stats == NULL)
 		return -ENOMEM;
 
 	preempt_disable();
 	this_cpu = this_cpu_ptr(cpu_stats);
 	if (tb[NFTA_COUNTER_PACKETS]) {
-	        this_cpu->counter.packets =
+	        this_cpu->packets =
 			be64_to_cpu(nla_get_be64(tb[NFTA_COUNTER_PACKETS]));
 	}
 	if (tb[NFTA_COUNTER_BYTES]) {
-		this_cpu->counter.bytes =
+		this_cpu->bytes =
 			be64_to_cpu(nla_get_be64(tb[NFTA_COUNTER_BYTES]));
 	}
 	preempt_enable();
@@ -100,74 +102,44 @@ static void nft_counter_obj_destroy(struct nft_object *obj)
 	nft_counter_do_destroy(priv);
 }
 
-static void nft_counter_fetch(struct nft_counter_percpu __percpu *counter,
-			      struct nft_counter *total)
+static void nft_counter_fetch(struct nft_counter_percpu_priv *priv,
+			      struct nft_counter *total, bool reset)
 {
-	struct nft_counter_percpu *cpu_stats;
+	struct nft_counter *this_cpu;
+	const seqcount_t *myseq;
 	u64 bytes, packets;
 	unsigned int seq;
 	int cpu;
 
 	memset(total, 0, sizeof(*total));
 	for_each_possible_cpu(cpu) {
-		cpu_stats = per_cpu_ptr(counter, cpu);
+		myseq = per_cpu_ptr(&nft_counter_seq, cpu);
+		this_cpu = per_cpu_ptr(priv->counter, cpu);
 		do {
-			seq	= u64_stats_fetch_begin_irq(&cpu_stats->syncp);
-			bytes	= cpu_stats->counter.bytes;
-			packets	= cpu_stats->counter.packets;
-		} while (u64_stats_fetch_retry_irq(&cpu_stats->syncp, seq));
-
-		total->packets += packets;
-		total->bytes += bytes;
-	}
-}
-
-static u64 __nft_counter_reset(u64 *counter)
-{
-	u64 ret, old;
-
-	do {
-		old = *counter;
-		ret = cmpxchg64(counter, old, 0);
-	} while (ret != old);
-
-	return ret;
-}
-
-static void nft_counter_reset(struct nft_counter_percpu __percpu *counter,
-			      struct nft_counter *total)
-{
-	struct nft_counter_percpu *cpu_stats;
-	u64 bytes, packets;
-	unsigned int seq;
-	int cpu;
-
-	memset(total, 0, sizeof(*total));
-	for_each_possible_cpu(cpu) {
-		bytes = packets = 0;
-
-		cpu_stats = per_cpu_ptr(counter, cpu);
-		do {
-			seq	= u64_stats_fetch_begin_irq(&cpu_stats->syncp);
-			packets	+= __nft_counter_reset(&cpu_stats->counter.packets);
-			bytes	+= __nft_counter_reset(&cpu_stats->counter.bytes);
-		} while (u64_stats_fetch_retry_irq(&cpu_stats->syncp, seq));
-
-		total->packets += packets;
-		total->bytes += bytes;
+			seq	= read_seqcount_begin(myseq);
+			bytes	= this_cpu->bytes;
+			packets	= this_cpu->packets;
+		} while (read_seqcount_retry(myseq, seq));
+
+		total->bytes	+= bytes;
+		total->packets	+= packets;
+
+		if (reset) {
+			local_bh_disable();
+			this_cpu->packets -= packets;
+			this_cpu->bytes -= bytes;
+			local_bh_enable();
+		}
 	}
 }
 
 static int nft_counter_do_dump(struct sk_buff *skb,
-			       const struct nft_counter_percpu_priv *priv,
+			       struct nft_counter_percpu_priv *priv,
 			       bool reset)
 {
 	struct nft_counter total;
 
-	if (reset)
-		nft_counter_reset(priv->counter, &total);
-	else
-		nft_counter_fetch(priv->counter, &total);
+	nft_counter_fetch(priv, &total, reset);
 
 	if (nla_put_be64(skb, NFTA_COUNTER_BYTES, cpu_to_be64(total.bytes),
 			 NFTA_COUNTER_PAD) ||
@@ -216,7 +188,7 @@ static void nft_counter_eval(const struct nft_expr *expr,
 
 static int nft_counter_dump(struct sk_buff *skb, const struct nft_expr *expr)
 {
-	const struct nft_counter_percpu_priv *priv = nft_expr_priv(expr);
+	struct nft_counter_percpu_priv *priv = nft_expr_priv(expr);
 
 	return nft_counter_do_dump(skb, priv, false);
 }
@@ -242,21 +214,20 @@ static int nft_counter_clone(struct nft_expr *dst, const struct nft_expr *src)
 {
 	struct nft_counter_percpu_priv *priv = nft_expr_priv(src);
 	struct nft_counter_percpu_priv *priv_clone = nft_expr_priv(dst);
-	struct nft_counter_percpu __percpu *cpu_stats;
-	struct nft_counter_percpu *this_cpu;
+	struct nft_counter __percpu *cpu_stats;
+	struct nft_counter *this_cpu;
 	struct nft_counter total;
 
-	nft_counter_fetch(priv->counter, &total);
+	nft_counter_fetch(priv, &total, false);
 
-	cpu_stats = __netdev_alloc_pcpu_stats(struct nft_counter_percpu,
-					      GFP_ATOMIC);
+	cpu_stats = alloc_percpu_gfp(struct nft_counter, GFP_ATOMIC);
 	if (cpu_stats == NULL)
 		return -ENOMEM;
 
 	preempt_disable();
 	this_cpu = this_cpu_ptr(cpu_stats);
-	this_cpu->counter.packets = total.packets;
-	this_cpu->counter.bytes = total.bytes;
+	this_cpu->packets = total.packets;
+	this_cpu->bytes = total.bytes;
 	preempt_enable();
 
 	priv_clone->counter = cpu_stats;
@@ -285,7 +256,10 @@ static struct nft_expr_type nft_counter_type __read_mostly = {
 
 static int __init nft_counter_module_init(void)
 {
-	int err;
+	int cpu, err;
+
+	for_each_possible_cpu(cpu)
+		seqcount_init(per_cpu_ptr(&nft_counter_seq, cpu));
 
 	err = nft_register_obj(&nft_counter_obj);
 	if (err < 0)
-- 
2.1.4


^ permalink raw reply related

* Re: [PATCH net-next] netfilter: nft_counter: rework atomic dump and reset
From: Pablo Neira Ayuso @ 2016-12-10 14:16 UTC (permalink / raw)
  To: netfilter-devel; +Cc: eric.dumazet, arnd, netdev
In-Reply-To: <1481378741-19871-1-git-send-email-pablo@netfilter.org>

On Sat, Dec 10, 2016 at 03:05:41PM +0100, Pablo Neira Ayuso wrote:
[...]
> -static void nft_counter_reset(struct nft_counter_percpu __percpu *counter,
> -			      struct nft_counter *total)
> -{
> -	struct nft_counter_percpu *cpu_stats;
> -	u64 bytes, packets;
> -	unsigned int seq;
> -	int cpu;
> -
> -	memset(total, 0, sizeof(*total));
> -	for_each_possible_cpu(cpu) {
> -		bytes = packets = 0;
> -
> -		cpu_stats = per_cpu_ptr(counter, cpu);
> -		do {
> -			seq	= u64_stats_fetch_begin_irq(&cpu_stats->syncp);
> -			packets	+= __nft_counter_reset(&cpu_stats->counter.packets);
> -			bytes	+= __nft_counter_reset(&cpu_stats->counter.bytes);
> -		} while (u64_stats_fetch_retry_irq(&cpu_stats->syncp, seq));
> -
> -		total->packets += packets;
> -		total->bytes += bytes;
> +			seq	= read_seqcount_begin(myseq);
> +			bytes	= this_cpu->bytes;
> +			packets	= this_cpu->packets;
> +		} while (read_seqcount_retry(myseq, seq));
> +
> +		total->bytes	+= bytes;
> +		total->packets	+= packets;
> +
> +		if (reset) {
> +			local_bh_disable();
> +			this_cpu->packets -= packets;
> +			this_cpu->bytes -= bytes;
> +			local_bh_enable();
> +		}

Actually this is not right either, Eric proposed this instead:

static void nft_counter_reset(struct nft_counter_percpu __percpu *counter,
                              struct nft_counter *total)
{
      struct nft_counter_percpu *cpu_stats;

      local_bh_disable();
      cpu_stats = this_cpu_ptr(counter);
      cpu_stats->counter.packets -= total->packets;
      cpu_stats->counter.bytes -= total->bytes;
      local_bh_enable();
}

The cpu that running over the reset code is guaranteed to own this
stats exclusively, but this is not guaranteed by my patch.

I'm going to send a v2. I think I need to turn packet and byte
counters into s64, otherwise a sufficient large total->packets may
underflow and confuse stats.

^ permalink raw reply

* Re: [PATCH net-next] netfilter: nft_counter: rework atomic dump and reset
From: Pablo Neira Ayuso @ 2016-12-10 14:25 UTC (permalink / raw)
  To: netfilter-devel; +Cc: eric.dumazet, arnd, netdev
In-Reply-To: <20161210141655.GA19963@salvia>

On Sat, Dec 10, 2016 at 03:16:55PM +0100, Pablo Neira Ayuso wrote:
> On Sat, Dec 10, 2016 at 03:05:41PM +0100, Pablo Neira Ayuso wrote:
> [...]
> > -static void nft_counter_reset(struct nft_counter_percpu __percpu *counter,
> > -			      struct nft_counter *total)
> > -{
> > -	struct nft_counter_percpu *cpu_stats;
> > -	u64 bytes, packets;
> > -	unsigned int seq;
> > -	int cpu;
> > -
> > -	memset(total, 0, sizeof(*total));
> > -	for_each_possible_cpu(cpu) {
> > -		bytes = packets = 0;
> > -
> > -		cpu_stats = per_cpu_ptr(counter, cpu);
> > -		do {
> > -			seq	= u64_stats_fetch_begin_irq(&cpu_stats->syncp);
> > -			packets	+= __nft_counter_reset(&cpu_stats->counter.packets);
> > -			bytes	+= __nft_counter_reset(&cpu_stats->counter.bytes);
> > -		} while (u64_stats_fetch_retry_irq(&cpu_stats->syncp, seq));
> > -
> > -		total->packets += packets;
> > -		total->bytes += bytes;
> > +			seq	= read_seqcount_begin(myseq);
> > +			bytes	= this_cpu->bytes;
> > +			packets	= this_cpu->packets;
> > +		} while (read_seqcount_retry(myseq, seq));
> > +
> > +		total->bytes	+= bytes;
> > +		total->packets	+= packets;
> > +
> > +		if (reset) {
> > +			local_bh_disable();
> > +			this_cpu->packets -= packets;
> > +			this_cpu->bytes -= bytes;
> > +			local_bh_enable();
> > +		}
> 
> Actually this is not right either, Eric proposed this instead:
> 
> static void nft_counter_reset(struct nft_counter_percpu __percpu *counter,
>                               struct nft_counter *total)
> {
>       struct nft_counter_percpu *cpu_stats;
> 
>       local_bh_disable();
>       cpu_stats = this_cpu_ptr(counter);
>       cpu_stats->counter.packets -= total->packets;
>       cpu_stats->counter.bytes -= total->bytes;
>       local_bh_enable();
> }
> 
> The cpu that running over the reset code is guaranteed to own this
> stats exclusively, but this is not guaranteed by my patch.
> 
> I'm going to send a v2. I think I need to turn packet and byte
> counters into s64, otherwise a sufficient large total->packets may
> underflow and confuse stats.

So my plan is to fold this incremental change on this patch and send a
v2.

diff --git a/net/netfilter/nft_counter.c b/net/netfilter/nft_counter.c
index c37983d0a141..5647feb43f43 100644
--- a/net/netfilter/nft_counter.c
+++ b/net/netfilter/nft_counter.c
@@ -18,8 +18,8 @@
 #include <net/netfilter/nf_tables.h>
 
 struct nft_counter {
-       u64             bytes;
-       u64             packets;
+       s64             bytes;
+       s64             packets;
 };
 
 struct nft_counter_percpu_priv {
@@ -102,8 +102,20 @@ static void nft_counter_obj_destroy(struct nft_object *obj)
        nft_counter_do_destroy(priv);
 }
 
+static void nft_counter_reset(struct nft_counter_percpu_priv __percpu *priv,
+                             struct nft_counter *total)
+{
+       struct nft_counter *this_cpu;
+
+       local_bh_disable();
+       this_cpu = this_cpu_ptr(priv->counter);
+       this_cpu->packets -= total->packets;
+       this_cpu->bytes -= total->bytes;
+       local_bh_enable();
+}
+
 static void nft_counter_fetch(struct nft_counter_percpu_priv *priv,
-                             struct nft_counter *total, bool reset)
+                             struct nft_counter *total)
 {
        struct nft_counter *this_cpu;
        const seqcount_t *myseq;
@@ -123,13 +135,6 @@ static void nft_counter_fetch(struct nft_counter_percpu_priv *priv,
 
                total->bytes    += bytes;
                total->packets  += packets;
-
-               if (reset) {
-                       local_bh_disable();
-                       this_cpu->packets -= packets;
-                       this_cpu->bytes -= bytes;
-                       local_bh_enable();
-               }
        }
 }

@@ -139,7 +144,9 @@ static int nft_counter_do_dump(struct sk_buff
*skb,
 {
        struct nft_counter total;
 
-       nft_counter_fetch(priv, &total, reset);
+       nft_counter_fetch(priv, &total);
+       if (reset)
+               nft_counter_reset(priv, &total);
 
        if (nla_put_be64(skb, NFTA_COUNTER_BYTES,
cpu_to_be64(total.bytes),
                         NFTA_COUNTER_PAD) ||
@@ -218,7 +225,7 @@ static int nft_counter_clone(struct nft_expr *dst,
const struct nft_expr *src)
        struct nft_counter *this_cpu;
        struct nft_counter total;
 
-       nft_counter_fetch(priv, &total, false);
+       nft_counter_fetch(priv, &total);
 
        cpu_stats = alloc_percpu_gfp(struct nft_counter, GFP_ATOMIC);
        if (cpu_stats == NULL)

^ permalink raw reply related

* Re: [PATCH net-next] netfilter: nft_counter: rework atomic dump and reset
From: Eric Dumazet @ 2016-12-10 15:40 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, arnd, netdev
In-Reply-To: <20161210142501.GA25221@salvia>

On Sat, 2016-12-10 at 15:25 +0100, Pablo Neira Ayuso wrote:
> On Sat, Dec 10, 2016 at 03:16:55PM +0100, Pablo Neira Ayuso wrote:
=
>  
> -       nft_counter_fetch(priv, &total, reset);
> +       nft_counter_fetch(priv, &total);
> +       if (reset)
> +               nft_counter_reset(priv, &total);
>  
>         if (nla_put_be64(skb, NFTA_COUNTER_BYTES,
> cpu_to_be64(total.bytes),
>                          NFTA_COUNTER_PAD) ||

Night be nitpicking, but you might reset the stats only if the
nla_put_be64() succeeded.

But regardless of this detail, patch looks good and is very close to the
one I cooked and was about to send this morning.

Thanks Pablo !

^ permalink raw reply

* Re: [PATCH/RFC net-next] net: fec: allow "mini jumbo" frames
From: Vivien Didelot @ 2016-12-10 16:55 UTC (permalink / raw)
  To: Nikita Yushchenko, Fugang Duan, David S. Miller, Troy Kisky,
	Florian Fainelli, Andrew Lunn, Eric Nelson, Philippe Reynes,
	Johannes Berg, netdev
  Cc: Chris Healy, Fabio Estevam, linux-kernel, Nikita Yushchenko
In-Reply-To: <1481275255-7650-1-git-send-email-nikita.yoush@cogentembedded.com>

Hi Nikita,

Nikita Yushchenko <nikita.yoush@cogentembedded.com> writes:

> This adds support for MTU slightly larger than default, on modern
> FEC flavours.
>
> Currently FEC driver uses single hardware Rx buffer per frame. On most
> FEC flavours, size of single buffer is limited by 11-bit field, and
> has to be multiple of 64 (in the worst case). Thus maximum usable Rx
> buffer size is 1984 bytes.
>
> Of those:
> - 2 bytes are used for IP header alignment,
> - 14 bytes are used by ethhdr,
> - up to 8 bytes are needed for VLAN and/or DSA tags,
> - 4 bytes are needed for CRC.
>
> Thus maximum MTU possible within current RX architecture is 1956.
>
> This patch allows exactly that. For further increase, Rx architecture
> change is needed.
>
> Use of MTU=1956 gives about 1.5% throughput improvement between two Vybrid
> boards, compared to default MTU=1500.
>
> Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>

For what it's worth, I have tested your patch on my ZII Rev B boards
(see vf610-zii-dev-rev-b.dts) which have a FEC as the master net device
of their DSA trees. They still work as expected.

Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Thanks,

        Vivien

^ permalink raw reply

* [iproute2 net-next 0/8] Add support for vrf helper
From: David Ahern @ 2016-12-10 17:47 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern

This series adds support to iproute2 to run a command against a specific
VRF. The user semnatics are similar to 'ip netns'.

The 'ip vrf' subcommand supports 3 usages:

1. Run a command against a given vrf:
       ip vrf exec NAME CMD

   Uses the recently committed cgroup/sock BPF option. vrf directory
   is added to cgroup2 mount. Individual vrfs are created under it. BPF
   filter is attached to vrf/NAME cgroup2 to set sk_bound_dev_if to the
   device index of the VRF. From there the current process (ip's pid) is
   addded to the cgroups.proc file and the given command is exected. In
   doing so all AF_INET/AF_INET6 (ipv4/ipv6) sockets are automatically
   bound to the VRF domain.

   The association is inherited parent to child allowing the command to
   be a shell from which other commands are run relative to the VRF.

2. Show the VRF a process is bound to:
       ip vrf id
   This command essentially looks at /proc/pid/cgroup for a "::/vrf/"
   entry.

3. Show process ids bound to a VRF
       ip vrf pids NAME
   This command dumps the file MNT/vrf/NAME/cgroup.procs since that file
   shows the process ids in the particular vrf cgroup.

David Ahern (8):
  lib bpf: Add support for BPF_PROG_ATTACH and BPF_PROG_DETACH
  bpf: export bpf_prog_load
  Add libbpf.h header with BPF_XXXX macros
  move cmd_exec to lib utils
  Add filesystem APIs to lib
  change name_is_vrf to return index
  libnetlink: Add variant of rtnl_talk that does not display RTNETLINK
    answers error
  Introduce ip vrf command

 include/bpf_util.h   |   6 ++
 include/libbpf.h     | 184 ++++++++++++++++++++++++++++++++
 include/libnetlink.h |   3 +
 include/utils.h      |   4 +
 ip/Makefile          |   3 +-
 ip/ip.c              |   4 +-
 ip/ip_common.h       |   4 +-
 ip/iplink_vrf.c      |  29 ++++--
 ip/ipnetns.c         |  34 ------
 ip/ipvrf.c           | 289 +++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/Makefile         |   2 +-
 lib/bpf.c            |  71 ++++++++-----
 lib/exec.c           |  41 ++++++++
 lib/fs.c             | 143 +++++++++++++++++++++++++
 lib/libnetlink.c     |  20 +++-
 man/man8/ip-vrf.8    |  88 ++++++++++++++++
 16 files changed, 850 insertions(+), 75 deletions(-)
 create mode 100644 include/libbpf.h
 create mode 100644 ip/ipvrf.c
 create mode 100644 lib/exec.c
 create mode 100644 lib/fs.c
 create mode 100644 man/man8/ip-vrf.8

-- 
2.1.4

^ permalink raw reply

* [iproute2 net-next 1/8] lib bpf: Add support for BPF_PROG_ATTACH and BPF_PROG_DETACH
From: David Ahern @ 2016-12-10 17:47 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern
In-Reply-To: <1481392069-3138-1-git-send-email-dsa@cumulusnetworks.com>

For consistency with other bpf commands, the functions are named
bpf_prog_attach and bpf_prog_detach. The existing bpf_prog_attach is
renamed to bpf_prog_load_and_report since it calls bpf_prog_load and
bpf_prog_report.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/bpf_util.h |  3 +++
 lib/bpf.c          | 31 ++++++++++++++++++++++++++-----
 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/bpf_util.h b/include/bpf_util.h
index 05baeecda57f..49b96bbc208f 100644
--- a/include/bpf_util.h
+++ b/include/bpf_util.h
@@ -75,6 +75,9 @@ int bpf_trace_pipe(void);
 
 void bpf_print_ops(FILE *f, struct rtattr *bpf_ops, __u16 len);
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type);
+
 #ifdef HAVE_ELF
 int bpf_send_map_fds(const char *path, const char *obj);
 int bpf_recv_map_fds(const char *path, int *fds, struct bpf_map_aux *aux,
diff --git a/lib/bpf.c b/lib/bpf.c
index 2a8cd51d4dae..103fc1ef0593 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -850,6 +850,27 @@ int bpf_graft_map(const char *map_path, uint32_t *key, int argc, char **argv)
 	return ret;
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+	union bpf_attr attr = {
+		.target_fd = target_fd,
+		.attach_bpf_fd = prog_fd,
+		.attach_type = type,
+	};
+
+	return bpf(BPF_PROG_ATTACH, &attr, sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+	union bpf_attr attr = {
+		.target_fd = target_fd,
+		.attach_type = type,
+	};
+
+	return bpf(BPF_PROG_DETACH, &attr, sizeof(attr));
+}
+
 #ifdef HAVE_ELF
 struct bpf_elf_prog {
 	enum bpf_prog_type	type;
@@ -1262,9 +1283,9 @@ static void bpf_prog_report(int fd, const char *section,
 	bpf_dump_error(ctx, "Verifier analysis:\n\n");
 }
 
-static int bpf_prog_attach(const char *section,
-			   const struct bpf_elf_prog *prog,
-			   struct bpf_elf_ctx *ctx)
+static int bpf_prog_load_and_report(const char *section,
+				    const struct bpf_elf_prog *prog,
+				    struct bpf_elf_ctx *ctx)
 {
 	int tries = 0, fd;
 retry:
@@ -1656,7 +1677,7 @@ static int bpf_fetch_prog(struct bpf_elf_ctx *ctx, const char *section,
 		prog.size    = data.sec_data->d_size;
 		prog.license = ctx->license;
 
-		fd = bpf_prog_attach(section, &prog, ctx);
+		fd = bpf_prog_load_and_report(section, &prog, ctx);
 		if (fd < 0)
 			return fd;
 
@@ -1755,7 +1776,7 @@ static int bpf_fetch_prog_relo(struct bpf_elf_ctx *ctx, const char *section,
 		prog.size    = data_insn.sec_data->d_size;
 		prog.license = ctx->license;
 
-		fd = bpf_prog_attach(section, &prog, ctx);
+		fd = bpf_prog_load_and_report(section, &prog, ctx);
 		if (fd < 0) {
 			*lderr = true;
 			return fd;
-- 
2.1.4

^ permalink raw reply related

* [iproute2 net-next 2/8] bpf: export bpf_prog_load
From: David Ahern @ 2016-12-10 17:47 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern
In-Reply-To: <1481392069-3138-1-git-send-email-dsa@cumulusnetworks.com>

Code move only; no functional change intended.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/bpf_util.h |  3 +++
 lib/bpf.c          | 40 ++++++++++++++++++++--------------------
 2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/include/bpf_util.h b/include/bpf_util.h
index 49b96bbc208f..dcbdca6978d6 100644
--- a/include/bpf_util.h
+++ b/include/bpf_util.h
@@ -75,6 +75,9 @@ int bpf_trace_pipe(void);
 
 void bpf_print_ops(FILE *f, struct rtattr *bpf_ops, __u16 len);
 
+int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn *insns,
+		  size_t size_insns, const char *license, char *log,
+		  size_t size_log);
 int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type);
 int bpf_prog_detach(int target_fd, enum bpf_attach_type type);
 
diff --git a/lib/bpf.c b/lib/bpf.c
index 103fc1ef0593..b04c3a678b9c 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -871,6 +871,26 @@ int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
 	return bpf(BPF_PROG_DETACH, &attr, sizeof(attr));
 }
 
+int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn *insns,
+		  size_t size_insns, const char *license, char *log,
+		  size_t size_log)
+{
+	union bpf_attr attr = {};
+
+	attr.prog_type = type;
+	attr.insns = bpf_ptr_to_u64(insns);
+	attr.insn_cnt = size_insns / sizeof(struct bpf_insn);
+	attr.license = bpf_ptr_to_u64(license);
+
+	if (size_log > 0) {
+		attr.log_buf = bpf_ptr_to_u64(log);
+		attr.log_size = size_log;
+		attr.log_level = 1;
+	}
+
+	return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
+}
+
 #ifdef HAVE_ELF
 struct bpf_elf_prog {
 	enum bpf_prog_type	type;
@@ -988,26 +1008,6 @@ static int bpf_map_create(enum bpf_map_type type, uint32_t size_key,
 	return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
 }
 
-static int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn *insns,
-			 size_t size_insns, const char *license, char *log,
-			 size_t size_log)
-{
-	union bpf_attr attr = {};
-
-	attr.prog_type = type;
-	attr.insns = bpf_ptr_to_u64(insns);
-	attr.insn_cnt = size_insns / sizeof(struct bpf_insn);
-	attr.license = bpf_ptr_to_u64(license);
-
-	if (size_log > 0) {
-		attr.log_buf = bpf_ptr_to_u64(log);
-		attr.log_size = size_log;
-		attr.log_level = 1;
-	}
-
-	return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
-}
-
 static int bpf_obj_pin(int fd, const char *pathname)
 {
 	union bpf_attr attr = {};
-- 
2.1.4

^ permalink raw reply related

* [iproute2 net-next 4/8] move cmd_exec to lib utils
From: David Ahern @ 2016-12-10 17:47 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern
In-Reply-To: <1481392069-3138-1-git-send-email-dsa@cumulusnetworks.com>

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/utils.h |  2 ++
 ip/ipnetns.c    | 34 ----------------------------------
 lib/Makefile    |  2 +-
 lib/exec.c      | 41 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 44 insertions(+), 35 deletions(-)
 create mode 100644 lib/exec.c

diff --git a/include/utils.h b/include/utils.h
index 26c970daa5d0..ac4517a3bde1 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -256,4 +256,6 @@ char *int_to_str(int val, char *buf);
 int get_guid(__u64 *guid, const char *arg);
 int get_real_family(int rtm_type, int rtm_family);
 
+int cmd_exec(const char *cmd, char **argv, bool do_fork);
+
 #endif /* __UTILS_H__ */
diff --git a/ip/ipnetns.c b/ip/ipnetns.c
index bd1e9013706c..db9a541769f1 100644
--- a/ip/ipnetns.c
+++ b/ip/ipnetns.c
@@ -357,40 +357,6 @@ static int netns_list(int argc, char **argv)
 	return 0;
 }
 
-static int cmd_exec(const char *cmd, char **argv, bool do_fork)
-{
-	fflush(stdout);
-	if (do_fork) {
-		int status;
-		pid_t pid;
-
-		pid = fork();
-		if (pid < 0) {
-			perror("fork");
-			exit(1);
-		}
-
-		if (pid != 0) {
-			/* Parent  */
-			if (waitpid(pid, &status, 0) < 0) {
-				perror("waitpid");
-				exit(1);
-			}
-
-			if (WIFEXITED(status)) {
-				return WEXITSTATUS(status);
-			}
-
-			exit(1);
-		}
-	}
-
-	if (execvp(cmd, argv)  < 0)
-		fprintf(stderr, "exec of \"%s\" failed: %s\n",
-				cmd, strerror(errno));
-	_exit(1);
-}
-
 static int on_netns_exec(char *nsname, void *arg)
 {
 	char **argv = arg;
diff --git a/lib/Makefile b/lib/Makefile
index 5b7ec169048a..749073261c49 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -fPIC
 
 UTILOBJ = utils.o rt_names.o ll_types.o ll_proto.o ll_addr.o \
 	inet_proto.o namespace.o json_writer.o \
-	names.o color.o bpf.o
+	names.o color.o bpf.o exec.o
 
 NLOBJ=libgenl.o ll_map.o libnetlink.o
 
diff --git a/lib/exec.c b/lib/exec.c
new file mode 100644
index 000000000000..96edbc422e84
--- /dev/null
+++ b/lib/exec.c
@@ -0,0 +1,41 @@
+#define _ATFILE_SOURCE
+#include <sys/wait.h>
+#include <stdio.h>
+#include <errno.h>
+#include <unistd.h>
+
+#include "utils.h"
+
+int cmd_exec(const char *cmd, char **argv, bool do_fork)
+{
+	fflush(stdout);
+	if (do_fork) {
+		int status;
+		pid_t pid;
+
+		pid = fork();
+		if (pid < 0) {
+			perror("fork");
+			exit(1);
+		}
+
+		if (pid != 0) {
+			/* Parent  */
+			if (waitpid(pid, &status, 0) < 0) {
+				perror("waitpid");
+				exit(1);
+			}
+
+			if (WIFEXITED(status)) {
+				return WEXITSTATUS(status);
+			}
+
+			exit(1);
+		}
+	}
+
+	if (execvp(cmd, argv)  < 0)
+		fprintf(stderr, "exec of \"%s\" failed: %s\n",
+				cmd, strerror(errno));
+	_exit(1);
+}
-- 
2.1.4

^ permalink raw reply related

* [iproute2 net-next 6/8] change name_is_vrf to return index
From: David Ahern @ 2016-12-10 17:47 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern
In-Reply-To: <1481392069-3138-1-git-send-email-dsa@cumulusnetworks.com>

index of 0 means name is not a valid vrf.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 ip/ip_common.h  |  2 +-
 ip/iplink_vrf.c | 15 +++++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index 0147f45a7a31..3162f1ca5b2c 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -91,7 +91,7 @@ struct link_util *get_link_kind(const char *kind);
 void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len);
 
 __u32 ipvrf_get_table(const char *name);
-bool name_is_vrf(const char *name);
+int name_is_vrf(const char *name);
 
 #ifndef	INFINITY_LIFE_TIME
 #define     INFINITY_LIFE_TIME      0xFFFFFFFFU
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
index a238b2906805..c101ed770f87 100644
--- a/ip/iplink_vrf.c
+++ b/ip/iplink_vrf.c
@@ -159,7 +159,7 @@ __u32 ipvrf_get_table(const char *name)
 	return tb_id;
 }
 
-bool name_is_vrf(const char *name)
+int name_is_vrf(const char *name)
 {
 	struct {
 		struct nlmsghdr		n;
@@ -187,24 +187,27 @@ bool name_is_vrf(const char *name)
 	addattr_l(&req.n, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
 
 	if (rtnl_talk(&rth, &req.n, &answer.n, sizeof(answer)) < 0)
-		return false;
+		return 0;
 
 	ifi = NLMSG_DATA(&answer.n);
 	len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
 	if (len < 0) {
 		fprintf(stderr, "BUG: Invalid response to link query.\n");
-		return false;
+		return 0;
 	}
 
 	parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
 
 	if (!tb[IFLA_LINKINFO])
-		return false;
+		return 0;
 
 	parse_rtattr_nested(li, IFLA_INFO_MAX, tb[IFLA_LINKINFO]);
 
 	if (!li[IFLA_INFO_KIND])
-		return false;
+		return 0;
+
+	if (strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf"))
+		return 0;
 
-	return strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf") == 0;
+	return ifi->ifi_index;
 }
-- 
2.1.4

^ permalink raw reply related

* [iproute2 net-next 5/8] Add filesystem APIs to lib
From: David Ahern @ 2016-12-10 17:47 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern
In-Reply-To: <1481392069-3138-1-git-send-email-dsa@cumulusnetworks.com>

Add make_path to recursively call mkdir as needed to create a given
path with the given mode.

Add find_cgroup2_mount to lookup path where cgroup2 is mounted. If it
is not already mounted, cgroup2 is mounted under /var/run/cgroup2 for
use by iproute2.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/utils.h |   2 +
 lib/Makefile    |   2 +-
 lib/fs.c        | 143 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 146 insertions(+), 1 deletion(-)
 create mode 100644 lib/fs.c

diff --git a/include/utils.h b/include/utils.h
index ac4517a3bde1..dc1d6b9607dd 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -257,5 +257,7 @@ int get_guid(__u64 *guid, const char *arg);
 int get_real_family(int rtm_type, int rtm_family);
 
 int cmd_exec(const char *cmd, char **argv, bool do_fork);
+int make_path(const char *path, mode_t mode);
+char *find_cgroup2_mount(void);
 
 #endif /* __UTILS_H__ */
diff --git a/lib/Makefile b/lib/Makefile
index 749073261c49..0c57662b4f8f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -fPIC
 
 UTILOBJ = utils.o rt_names.o ll_types.o ll_proto.o ll_addr.o \
 	inet_proto.o namespace.o json_writer.o \
-	names.o color.o bpf.o exec.o
+	names.o color.o bpf.o exec.o fs.o
 
 NLOBJ=libgenl.o ll_map.o libnetlink.o
 
diff --git a/lib/fs.c b/lib/fs.c
new file mode 100644
index 000000000000..39cc96dccca9
--- /dev/null
+++ b/lib/fs.c
@@ -0,0 +1,143 @@
+/*
+ * fs.c         filesystem APIs
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	David Ahern <dsa@cumulusnetworks.com>
+ *
+ */
+
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/socket.h>
+#include <sys/mount.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <limits.h>
+
+#include "utils.h"
+
+#define CGROUP2_FS_NAME "cgroup2"
+
+/* if not already mounted cgroup2 is mounted here for iproute2's use */
+#define MNT_CGRP2_PATH  "/var/run/cgroup2"
+
+/* return mount path of first occurrence of given fstype */
+static char *find_fs_mount(const char *fs_to_find)
+{
+	char path[4096];
+	char fstype[128];    /* max length of any filesystem name */
+	char *mnt = NULL;
+	FILE *fp;
+
+	fp = fopen("/proc/mounts", "r");
+	if (!fp) {
+		fprintf(stderr,
+			"Failed to open mounts file: %s\n", strerror(errno));
+		return NULL;
+	}
+
+	while (fscanf(fp, "%*s %4096s %127s %*s %*d %*d\n",
+		      path, fstype) == 2) {
+		if (strcmp(fstype, fs_to_find) == 0) {
+			mnt = strdup(path);
+			break;
+		}
+	}
+
+	fclose(fp);
+
+	return mnt;
+}
+
+/* caller needs to free string returned */
+char *find_cgroup2_mount(void)
+{
+	char *mnt = find_fs_mount(CGROUP2_FS_NAME);
+
+	if (mnt)
+		return mnt;
+
+	mnt = strdup(MNT_CGRP2_PATH);
+	if (!mnt) {
+		fprintf(stderr, "Failed to allocate memory for cgroup2 path\n");
+		return NULL;
+
+	}
+
+	if (make_path(mnt, 0755)) {
+		fprintf(stderr, "Failed to setup vrf cgroup2 directory\n");
+		free(mnt);
+		return NULL;
+	}
+
+	if (mount("none", mnt, CGROUP2_FS_NAME, 0, NULL)) {
+		/* EBUSY means already mounted */
+		if (errno != EBUSY) {
+			fprintf(stderr,
+				"Failed to mount cgroup2. Are CGROUPS enabled in your kernel?\n");
+			free(mnt);
+			return NULL;
+		}
+	}
+	return mnt;
+}
+
+int make_path(const char *path, mode_t mode)
+{
+	char *dir, *delim;
+	struct stat sbuf;
+	int rc = -1;
+
+	delim = dir = strdup(path);
+	if (dir == NULL) {
+		fprintf(stderr, "strdup failed copying path");
+		return -1;
+	}
+
+	/* skip '/' -- it had better exist */
+	if (*delim == '/')
+		delim++;
+
+	while (1) {
+		delim = strchr(delim, '/');
+		if (delim)
+			*delim = '\0';
+
+		if (stat(dir, &sbuf) != 0) {
+			if (errno != ENOENT) {
+				fprintf(stderr,
+					"stat failed for %s: %s\n",
+					dir, strerror(errno));
+				goto out;
+			}
+
+			if (mkdir(dir, mode) != 0) {
+				fprintf(stderr,
+					"mkdir failed for %s: %s",
+					dir, strerror(errno));
+				goto out;
+			}
+		}
+
+		if (delim == NULL)
+			break;
+
+		*delim = '/';
+		delim++;
+		if (*delim == '\0')
+			break;
+	}
+	rc = 0;
+out:
+	free(dir);
+
+	return rc;
+}
-- 
2.1.4

^ permalink raw reply related

* [iproute2 net-next 7/8] libnetlink: Add variant of rtnl_talk that does not display RTNETLINK answers error
From: David Ahern @ 2016-12-10 17:47 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern
In-Reply-To: <1481392069-3138-1-git-send-email-dsa@cumulusnetworks.com>

iplink_vrf has 2 functions used to validate a user given device name is
a VRF device and to return the table id. If the user string is not a
device name ip commands with a vrf keyword show a confusing error
message: "RTNETLINK answers: No such device".

Add a variant of rtnl_talk that does not display the "RTNETLINK answers"
message and update iplink_vrf to use it.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/libnetlink.h |  3 +++
 ip/iplink_vrf.c      | 14 +++++++++++---
 lib/libnetlink.c     | 20 +++++++++++++++++---
 3 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index 751ebf186dd4..bd0267dfcc02 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -81,6 +81,9 @@ int rtnl_dump_filter_nc(struct rtnl_handle *rth,
 int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
 	      struct nlmsghdr *answer, size_t len)
 	__attribute__((warn_unused_result));
+int rtnl_talk_suppress_rtnl_errmsg(struct rtnl_handle *rtnl, struct nlmsghdr *n,
+				   struct nlmsghdr *answer, size_t len)
+	__attribute__((warn_unused_result));
 int rtnl_send(struct rtnl_handle *rth, const void *buf, int)
 	__attribute__((warn_unused_result));
 int rtnl_send_check(struct rtnl_handle *rth, const void *buf, int)
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
index c101ed770f87..917630e85337 100644
--- a/ip/iplink_vrf.c
+++ b/ip/iplink_vrf.c
@@ -13,6 +13,7 @@
 #include <string.h>
 #include <sys/socket.h>
 #include <linux/if_link.h>
+#include <errno.h>
 
 #include "rt_names.h"
 #include "utils.h"
@@ -126,8 +127,14 @@ __u32 ipvrf_get_table(const char *name)
 
 	addattr_l(&req.n, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
 
-	if (rtnl_talk(&rth, &req.n, &answer.n, sizeof(answer)) < 0)
-		return 0;
+	if (rtnl_talk_suppress_rtnl_errmsg(&rth, &req.n,
+					   &answer.n, sizeof(answer)) < 0) {
+		/* special case "default" vrf to be the main table */
+		if (errno == ENODEV && !strcmp(name, "default"))
+			rtnl_rttable_a2n(&tb_id, "main");
+
+		return tb_id;
+	}
 
 	ifi = NLMSG_DATA(&answer.n);
 	len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
@@ -186,7 +193,8 @@ int name_is_vrf(const char *name)
 
 	addattr_l(&req.n, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
 
-	if (rtnl_talk(&rth, &req.n, &answer.n, sizeof(answer)) < 0)
+	if (rtnl_talk_suppress_rtnl_errmsg(&rth, &req.n,
+					   &answer.n, sizeof(answer)) < 0)
 		return 0;
 
 	ifi = NLMSG_DATA(&answer.n);
diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index a5db168e50eb..9d7e89aebbd0 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -12,6 +12,7 @@
 
 #include <stdio.h>
 #include <stdlib.h>
+#include <stdbool.h>
 #include <unistd.h>
 #include <syslog.h>
 #include <fcntl.h>
@@ -397,8 +398,9 @@ int rtnl_dump_filter_nc(struct rtnl_handle *rth,
 	return rtnl_dump_filter_l(rth, a);
 }
 
-int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
-	      struct nlmsghdr *answer, size_t maxlen)
+static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
+		       struct nlmsghdr *answer, size_t maxlen,
+		       bool show_rtnl_err)
 {
 	int status;
 	unsigned int seq;
@@ -485,7 +487,7 @@ int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
 					return 0;
 				}
 
-				if (rtnl->proto != NETLINK_SOCK_DIAG)
+				if (rtnl->proto != NETLINK_SOCK_DIAG && show_rtnl_err)
 					fprintf(stderr,
 						"RTNETLINK answers: %s\n",
 						strerror(-err->error));
@@ -517,6 +519,18 @@ int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
 	}
 }
 
+int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
+	      struct nlmsghdr *answer, size_t maxlen)
+{
+	return __rtnl_talk(rtnl, n, answer, maxlen, true);
+}
+
+int rtnl_talk_suppress_rtnl_errmsg(struct rtnl_handle *rtnl, struct nlmsghdr *n,
+				   struct nlmsghdr *answer, size_t maxlen)
+{
+	return __rtnl_talk(rtnl, n, answer, maxlen, false);
+}
+
 int rtnl_listen_all_nsid(struct rtnl_handle *rth)
 {
 	unsigned int on = 1;
-- 
2.1.4

^ permalink raw reply related

* [iproute2 net-next 8/8] Introduce ip vrf command
From: David Ahern @ 2016-12-10 17:47 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern
In-Reply-To: <1481392069-3138-1-git-send-email-dsa@cumulusnetworks.com>

'ip vrf' follows the user semnatics established by 'ip netns'.

The 'ip vrf' subcommand supports 3 usages:

1. Run a command against a given vrf:
       ip vrf exec NAME CMD

   Uses the recently committed cgroup/sock BPF option. vrf directory
   is added to cgroup2 mount. Individual vrfs are created under it. BPF
   filter attached to vrf/NAME cgroup2 to set sk_bound_dev_if to the VRF
   device index. From there the current process (ip's pid) is addded to
   the cgroups.proc file and the given command is exected. In doing so
   all AF_INET/AF_INET6 (ipv4/ipv6) sockets are automatically bound to
   the VRF domain.

   The association is inherited parent to child allowing the command to
   be a shell from which other commands are run relative to the VRF.

2. Show the VRF a process is bound to:
       ip vrf id
   This command essentially looks at /proc/pid/cgroup for a "::/vrf/"
   entry with the VRF name following.

3. Show process ids bound to a VRF
       ip vrf pids NAME
   This command dumps the file MNT/vrf/NAME/cgroup.procs since that file
   shows the process ids in the particular vrf cgroup.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 ip/Makefile       |   3 +-
 ip/ip.c           |   4 +-
 ip/ip_common.h    |   2 +
 ip/ipvrf.c        | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 man/man8/ip-vrf.8 |  88 +++++++++++++++++
 5 files changed, 384 insertions(+), 2 deletions(-)
 create mode 100644 ip/ipvrf.c
 create mode 100644 man/man8/ip-vrf.8

diff --git a/ip/Makefile b/ip/Makefile
index c8e6c6172741..1928489e7f90 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -7,7 +7,8 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \
     iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \
     link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \
     iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \
-    iplink_geneve.o iplink_vrf.o iproute_lwtunnel.o ipmacsec.o ipila.o
+    iplink_geneve.o iplink_vrf.o iproute_lwtunnel.o ipmacsec.o ipila.o \
+    ipvrf.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/ip.c b/ip/ip.c
index cb3adcb3f57d..07050b07592a 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -51,7 +51,8 @@ static void usage(void)
 "       ip [ -force ] -batch filename\n"
 "where  OBJECT := { link | address | addrlabel | route | rule | neigh | ntable |\n"
 "                   tunnel | tuntap | maddress | mroute | mrule | monitor | xfrm |\n"
-"                   netns | l2tp | fou | macsec | tcp_metrics | token | netconf | ila }\n"
+"                   netns | l2tp | fou | macsec | tcp_metrics | token | netconf | ila |\n"
+"                   vrf }\n"
 "       OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] | -r[esolve] |\n"
 "                    -h[uman-readable] | -iec |\n"
 "                    -f[amily] { inet | inet6 | ipx | dnet | mpls | bridge | link } |\n"
@@ -99,6 +100,7 @@ static const struct cmd {
 	{ "mrule",	do_multirule },
 	{ "netns",	do_netns },
 	{ "netconf",	do_ipnetconf },
+	{ "vrf",	do_ipvrf},
 	{ "help",	do_help },
 	{ 0 }
 };
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 3162f1ca5b2c..28763e81e4a4 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -57,6 +57,8 @@ extern int do_ipila(int argc, char **argv);
 int do_tcp_metrics(int argc, char **argv);
 int do_ipnetconf(int argc, char **argv);
 int do_iptoken(int argc, char **argv);
+int do_ipvrf(int argc, char **argv);
+
 int iplink_get(unsigned int flags, char *name, __u32 filt_mask);
 
 static inline int rtm_get_table(struct rtmsg *r, struct rtattr **tb)
diff --git a/ip/ipvrf.c b/ip/ipvrf.c
new file mode 100644
index 000000000000..c4f0e53532e2
--- /dev/null
+++ b/ip/ipvrf.c
@@ -0,0 +1,289 @@
+/*
+ * ipvrf.c	"ip vrf"
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	David Ahern <dsa@cumulusnetworks.com>
+ *
+ */
+
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/socket.h>
+#include <sys/mount.h>
+#include <linux/bpf.h>
+#include <linux/if.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <limits.h>
+
+#include "rt_names.h"
+#include "utils.h"
+#include "ip_common.h"
+#include "libbpf.h"
+#include "bpf_util.h"
+
+#define CGRP_PROC_FILE  "/cgroup.procs"
+
+static void usage(void)
+{
+	fprintf(stderr, "Usage: ip vrf exec [NAME] cmd ...\n");
+	fprintf(stderr, "       ip vrf identify [PID]\n");
+	fprintf(stderr, "       ip vrf pids [NAME]\n");
+
+	exit(-1);
+}
+
+static int ipvrf_identify(int argc, char **argv)
+{
+	char path[PATH_MAX];
+	char buf[4096];
+	char *vrf, *end;
+	int fd, rc = -1;
+	unsigned int pid;
+	ssize_t n;
+
+	if (argc < 1)
+		pid = getpid();
+	else if (argc > 1)
+		invarg("Extra arguments specified\n", argv[1]);
+	else if (get_unsigned(&pid, argv[0], 10))
+		invarg("Invalid pid\n", argv[0]);
+
+	snprintf(path, sizeof(path), "/proc/%d/cgroup", pid);
+	fd = open(path, O_RDONLY);
+	if (fd < 0) {
+		fprintf(stderr,
+			"Failed to open cgroups file: %s\n", strerror(errno));
+		return -1;
+	}
+
+	n = read(fd, buf, sizeof(buf) - 1);
+	if (n < 0) {
+		fprintf(stderr,
+			"Failed to read cgroups file: %s\n", strerror(errno));
+		goto out;
+	}
+	buf[n] = '\0';
+	vrf = strstr(buf, "::/vrf/");
+	if (vrf) {
+		vrf += 7;  /* skip past "::/vrf/" */
+		end = strchr(vrf, '\n');
+		if (end)
+			*end = '\0';
+
+		printf("%s\n", vrf);
+	}
+
+	rc = 0;
+out:
+	close(fd);
+
+	return rc;
+}
+
+static int ipvrf_pids(int argc, char **argv)
+{
+	char path[PATH_MAX];
+	char buf[4096];
+	char *mnt, *vrf;
+	int fd, rc = -1;
+	ssize_t n;
+
+	if (argc != 1) {
+		fprintf(stderr, "Invalid arguments\n");
+		return -1;
+	}
+
+	vrf = argv[0];
+
+	mnt = find_cgroup2_mount();
+	if (!mnt)
+		return -1;
+
+	snprintf(path, sizeof(path), "%s/vrf/%s%s", mnt, vrf, CGRP_PROC_FILE);
+	free(mnt);
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		return 0; /* no cgroup file, nothing to show */
+
+	while (1) {
+		n = read(fd, buf, sizeof(buf) - 1);
+		if (n < 0) {
+			fprintf(stderr,
+				"Failed to read cgroups file: %s\n", strerror(errno));
+			break;
+		} else if (n == 0) {
+			rc = 0;
+			break;
+		}
+		printf("%s", buf);
+	}
+
+	close(fd);
+
+	return rc;
+}
+
+/* load BPF program to set sk_bound_dev_if for sockets */
+static char bpf_log_buf[256*1024];
+
+static int prog_load(int idx)
+{
+	struct bpf_insn prog[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+		BPF_MOV64_IMM(BPF_REG_3, idx),
+		BPF_MOV64_IMM(BPF_REG_2, offsetof(struct bpf_sock, bound_dev_if)),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_3, offsetof(struct bpf_sock, bound_dev_if)),
+		BPF_MOV64_IMM(BPF_REG_0, 1), /* r0 = verdict */
+		BPF_EXIT_INSN(),
+	};
+
+	return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCK, prog, sizeof(prog),
+			     "GPL", bpf_log_buf, sizeof(bpf_log_buf));
+}
+
+static int vrf_configure_cgroup(const char *path, int ifindex)
+{
+	int rc = -1, cg_fd, prog_fd = -1;
+
+	cg_fd = open(path, O_DIRECTORY | O_RDONLY);
+	if (cg_fd < 0) {
+		fprintf(stderr, "Failed to open cgroup path: '%s'\n", strerror(errno));
+		goto out;
+	}
+
+	/*
+	 * Load bpf program into kernel and attach to cgroup to affect
+	 * socket creates
+	 */
+	prog_fd = prog_load(ifindex);
+	if (prog_fd < 0) {
+		printf("Failed to load BPF prog: '%s'\n", strerror(errno));
+		goto out;
+	}
+
+	if (bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_INET_SOCK_CREATE)) {
+		fprintf(stderr, "Failed to attach prog to cgroup: '%s'\n",
+			strerror(errno));
+			fprintf(stderr, "Kernel compiled with CGROUP_BPF enabled?\n");
+		goto out;
+	}
+
+	rc = 0;
+out:
+	close(cg_fd);
+	close(prog_fd);
+
+	return rc;
+}
+
+static int vrf_switch(const char *name)
+{
+	char path[PATH_MAX], *mnt, pid[16];
+	int ifindex = name_is_vrf(name);
+	bool default_vrf = false;
+	int rc = -1, len, fd = -1;
+
+	if (!ifindex) {
+		if (strcmp(name, "default")) {
+			fprintf(stderr, "Invalid VRF name\n");
+			return -1;
+		}
+		default_vrf = true;
+	}
+
+	mnt = find_cgroup2_mount();
+	if (!mnt)
+		return -1;
+
+	/* path to cgroup; make sure buffer has room to cat "/cgroup.procs"
+	 * to the end of the path
+	 */
+	len = snprintf(path, sizeof(path) - sizeof(CGRP_PROC_FILE), "%s%s/%s",
+		       mnt, default_vrf ? "" : "/vrf", name);
+	if (len > sizeof(path) - sizeof(CGRP_PROC_FILE)) {
+		fprintf(stderr, "Invalid path to cgroup2 mount\n");
+		goto out;
+	}
+
+	if (make_path(path, 0755)) {
+		fprintf(stderr, "Failed to setup vrf cgroup2 directory\n");
+		goto out;
+	}
+
+	if (!default_vrf && vrf_configure_cgroup(path, ifindex))
+		goto out;
+
+	/*
+	 * write pid to cgroup.procs making process part of cgroup
+	 */
+	strcat(path, CGRP_PROC_FILE);
+	fd = open(path, O_RDWR | O_APPEND);
+	if (fd < 0) {
+		fprintf(stderr, "cgroups.procs file does not exist.\n");
+		goto out;
+	}
+
+	snprintf(pid, sizeof(pid), "%d", getpid());
+	if (write(fd, pid, strlen(pid)) < 0) {
+		fprintf(stderr, "Failed to join cgroup\n");
+		goto out;
+	}
+
+	rc = 0;
+out:
+	free(mnt);
+	close(fd);
+
+	return rc;
+}
+
+static int ipvrf_exec(int argc, char **argv)
+{
+	if (argc < 1) {
+		fprintf(stderr, "No VRF name specified\n");
+		return -1;
+	}
+	if (argc < 2) {
+		fprintf(stderr, "No command specified\n");
+		return -1;
+	}
+
+	if (vrf_switch(argv[0]))
+		return -1;
+
+	return -cmd_exec(argv[1], argv + 1, !!batch_mode);
+}
+
+int do_ipvrf(int argc, char **argv)
+{
+	if (argc == 0) {
+		fprintf(stderr, "No command given. Try \"ip vrf help\".\n");
+		exit(-1);
+	}
+
+	if (matches(*argv, "identify") == 0)
+		return ipvrf_identify(argc-1, argv+1);
+
+	if (matches(*argv, "pids") == 0)
+		return ipvrf_pids(argc-1, argv+1);
+
+	if (matches(*argv, "exec") == 0)
+		return ipvrf_exec(argc-1, argv+1);
+
+	if (matches(*argv, "help") == 0)
+		usage();
+
+	fprintf(stderr, "Command \"%s\" is unknown, try \"ip vrf help\".\n",
+		*argv);
+
+	exit(-1);
+}
diff --git a/man/man8/ip-vrf.8 b/man/man8/ip-vrf.8
new file mode 100644
index 000000000000..57a7c7692ce8
--- /dev/null
+++ b/man/man8/ip-vrf.8
@@ -0,0 +1,88 @@
+.TH IP\-VRF 8 "7 Dec 2016" "iproute2" "Linux"
+.SH NAME
+ip-vrf \- run a command against a vrf
+.SH SYNOPSIS
+.sp
+.ad l
+.in +8
+.ti -8
+.B ip
+.B vrf
+.RI  " { " COMMAND " | "
+.BR help " }"
+.sp
+
+.ti -8
+.BR "ip vrf identify"
+.RI "[ " PID " ]"
+
+.ti -8
+.BR "ip vrf pids"
+.I NAME
+
+.ti -8
+.BR "ip vrf exec "
+.RI "[ " NAME " ] " command ...
+
+.SH DESCRIPTION
+A VRF provides traffic isolation at layer 3 for routing, similar to how a
+VLAN is used to isolate traffic at layer 2. Fundamentally, a VRF is a separate
+routing table. Network devices are associated with a VRF by enslaving the
+device to the VRF. At that point network addresses assigned to the device are
+local to the VRF with host and connected routes moved to the table associated
+with the VRF.
+
+A process can specify a VRF using several APIs -- binding the socket to the
+VRF device using SO_BINDTODEVICE, setting the VRF association using
+IP_UNICAST_IF or IPV6_UNICAST_IF, or specifying the VRF for a specific message
+using IP_PKTINFO or IPV6_PKTINFO.
+
+By default a process is not bound to any VRF. An association can be set
+explicitly by making the program use one of the APIs mentioned above or
+implicitly using a helper to set SO_BINDTODEVICE for all IPv4 and IPv6
+sockets (AF_INET and AF_INET6) when the socket is created. This ip-vrf command
+is a helper to run a command against a specific VRF with the VRF association
+inherited parent to child.
+
+.TP
+.B ip vrf exec [ NAME ] cmd ... - Run cmd against the named VRF
+.sp
+This command allows applications that are VRF unaware to be run against
+a VRF other than the default VRF (main table). A command can be run against
+the default VRF by passing the "default" as the VRF name. This is useful if
+the current shell is associated with another VRF (e.g, Management VRF).
+
+.TP
+.B ip vrf identify [PID] - Report VRF association for process
+.sp
+This command shows the VRF association of the specified process. If PID is
+not specified then the id of the current process is used.
+
+.TP
+.B ip vrf pids NAME - Report processes associated with the named VRF
+.sp
+This command shows all process ids that are associated with the given
+VRF.
+
+.SH CAVEATS
+This command requires a kernel compiled with CGROUPS and CGROUP_BPF enabled.
+
+The VRF helper *only* affects network layer sockets.
+
+.SH EXAMPLES
+.PP
+ip vrf exec red ssh 10.100.1.254
+.RS
+Executes ssh to 10.100.1.254 against the VRF red table.
+.RE
+
+.SH SEE ALSO
+.br
+.BR ip (8),
+.BR ip-link (8),
+.BR ip-address (8),
+.BR ip-route (8),
+.BR ip-neighbor (8)
+
+.SH AUTHOR
+Original Manpage by David Ahern
-- 
2.1.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox