Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next 1/3] net: mvpp2: make the phy optional
From: Antoine Tenart @ 2017-08-31  7:12 UTC (permalink / raw)
  To: davem
  Cc: Antoine Tenart, andrew, gregory.clement, thomas.petazzoni, nadavh,
	linux, linux-kernel, mw, stefanc, miquel.raynal, netdev
In-Reply-To: <20170831071256.18416-1-antoine.tenart@free-electrons.com>

There is not necessarily a PHY between the GoP and the physical port.
However, the driver currently makes the "phy" property mandatory,
contrary to what is stated in the device tree bindings. This patch makes
the PHY optional, and aligns the PPv2 driver on its device tree
documentation. However if a PHY is provided, the GoP link interrupt
won't be used.

With this patch switches directly connected to the serdes lanes and SFP
ports on the Armada 8040-db and Armada 7040-db can be used if the link
interrupt is described in the device tree.

Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Tested-by: Marcin Wojtas <mw@semihalf.com>
---
 drivers/net/ethernet/marvell/mvpp2.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
index 9e64b1ba3d43..1916a4035ea0 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -6484,7 +6484,8 @@ static void mvpp2_start_dev(struct mvpp2_port *port)
 
 	mvpp2_port_mii_set(port);
 	mvpp2_port_enable(port);
-	phy_start(ndev->phydev);
+	if (ndev->phydev)
+		phy_start(ndev->phydev);
 	netif_tx_start_all_queues(port->dev);
 }
 
@@ -6510,7 +6511,8 @@ static void mvpp2_stop_dev(struct mvpp2_port *port)
 
 	mvpp2_egress_disable(port);
 	mvpp2_port_disable(port);
-	phy_stop(ndev->phydev);
+	if (ndev->phydev)
+		phy_stop(ndev->phydev);
 	phy_power_off(port->comphy);
 }
 
@@ -6567,6 +6569,10 @@ static int mvpp2_phy_connect(struct mvpp2_port *port)
 {
 	struct phy_device *phy_dev;
 
+	/* No PHY is attached */
+	if (!port->phy_node)
+		return 0;
+
 	phy_dev = of_phy_connect(port->dev, port->phy_node, mvpp2_link_event, 0,
 				 port->phy_interface);
 	if (!phy_dev) {
@@ -6587,6 +6593,9 @@ static void mvpp2_phy_disconnect(struct mvpp2_port *port)
 {
 	struct net_device *ndev = port->dev;
 
+	if (!ndev->phydev)
+		return;
+
 	phy_disconnect(ndev->phydev);
 }
 
@@ -7375,12 +7384,6 @@ static int mvpp2_port_probe(struct platform_device *pdev,
 		return -ENOMEM;
 
 	phy_node = of_parse_phandle(port_node, "phy", 0);
-	if (!phy_node) {
-		dev_err(&pdev->dev, "missing phy\n");
-		err = -ENODEV;
-		goto err_free_netdev;
-	}
-
 	phy_mode = of_get_phy_mode(port_node);
 	if (phy_mode < 0) {
 		dev_err(&pdev->dev, "incorrect phy mode\n");
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 2/3] net: mvpp2: use the GoP interrupt for link status changes
From: Antoine Tenart @ 2017-08-31  7:12 UTC (permalink / raw)
  To: davem
  Cc: Antoine Tenart, andrew, gregory.clement, thomas.petazzoni, nadavh,
	linux, linux-kernel, mw, stefanc, miquel.raynal, netdev
In-Reply-To: <20170831071256.18416-1-antoine.tenart@free-electrons.com>

This patch adds the GoP link interrupt support for when a port isn't
connected to a PHY. Because of this the phylib callback is never called
and the link status management isn't done. This patch use the GoP link
interrupt in such cases to still have a minimal link management. Without
this patch ports not connected to a PHY cannot work.

Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Tested-by: Marcin Wojtas <mw@semihalf.com>
---
 drivers/net/ethernet/marvell/mvpp2.c | 189 ++++++++++++++++++++++++++++++++++-
 1 file changed, 184 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
index 1916a4035ea0..4a4560d9151a 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -348,16 +348,24 @@
 #define     MVPP2_GMAC_FLOW_CTRL_AUTONEG	BIT(11)
 #define     MVPP2_GMAC_CONFIG_FULL_DUPLEX	BIT(12)
 #define     MVPP2_GMAC_AN_DUPLEX_EN		BIT(13)
+#define MVPP2_GMAC_STATUS0			0x10
+#define     MVPP2_GMAC_STATUS0_LINK_UP		BIT(0)
 #define MVPP2_GMAC_PORT_FIFO_CFG_1_REG		0x1c
 #define     MVPP2_GMAC_TX_FIFO_MIN_TH_OFFS	6
 #define     MVPP2_GMAC_TX_FIFO_MIN_TH_ALL_MASK	0x1fc0
 #define     MVPP2_GMAC_TX_FIFO_MIN_TH_MASK(v)	(((v) << 6) & \
 					MVPP2_GMAC_TX_FIFO_MIN_TH_ALL_MASK)
+#define MVPP22_GMAC_INT_STAT			0x20
+#define     MVPP22_GMAC_INT_STAT_LINK		BIT(1)
+#define MVPP22_GMAC_INT_MASK			0x24
+#define     MVPP22_GMAC_INT_MASK_LINK_STAT	BIT(1)
 #define MVPP22_GMAC_CTRL_4_REG			0x90
 #define     MVPP22_CTRL4_EXT_PIN_GMII_SEL	BIT(0)
 #define     MVPP22_CTRL4_DP_CLK_SEL		BIT(5)
 #define     MVPP22_CTRL4_SYNC_BYPASS_DIS	BIT(6)
 #define     MVPP22_CTRL4_QSGMII_BYPASS_ACTIVE	BIT(7)
+#define MVPP22_GMAC_INT_SUM_MASK		0xa4
+#define     MVPP22_GMAC_INT_SUM_MASK_LINK_STAT	BIT(1)
 
 /* Per-port XGMAC registers. PPv2.2 only, only for GOP port 0,
  * relative to port->base.
@@ -370,11 +378,19 @@
 #define MVPP22_XLG_CTRL1_REG			0x104
 #define     MVPP22_XLG_CTRL1_FRAMESIZELIMIT_OFFS	0
 #define     MVPP22_XLG_CTRL1_FRAMESIZELIMIT_MASK	0x1fff
+#define MVPP22_XLG_STATUS			0x10c
+#define     MVPP22_XLG_STATUS_LINK_UP		BIT(0)
+#define MVPP22_XLG_INT_STAT			0x114
+#define     MVPP22_XLG_INT_STAT_LINK		BIT(1)
+#define MVPP22_XLG_INT_MASK			0x118
+#define     MVPP22_XLG_INT_MASK_LINK		BIT(1)
 #define MVPP22_XLG_CTRL3_REG			0x11c
 #define     MVPP22_XLG_CTRL3_MACMODESELECT_MASK	(7 << 13)
 #define     MVPP22_XLG_CTRL3_MACMODESELECT_GMAC	(0 << 13)
 #define     MVPP22_XLG_CTRL3_MACMODESELECT_10G	(1 << 13)
-
+#define MVPP22_XLG_EXT_INT_MASK			0x15c
+#define     MVPP22_XLG_EXT_INT_MASK_XLG		BIT(1)
+#define     MVPP22_XLG_EXT_INT_MASK_GIG		BIT(2)
 #define MVPP22_XLG_CTRL4_REG			0x184
 #define     MVPP22_XLG_CTRL4_FWD_FC		BIT(5)
 #define     MVPP22_XLG_CTRL4_FWD_PFC		BIT(6)
@@ -837,6 +853,8 @@ struct mvpp2_port {
 	 */
 	int gop_id;
 
+	int link_irq;
+
 	struct mvpp2 *priv;
 
 	/* Per-port registers' base address */
@@ -4422,6 +4440,77 @@ static int mvpp22_gop_init(struct mvpp2_port *port)
 	return -EINVAL;
 }
 
+static void mvpp22_gop_unmask_irq(struct mvpp2_port *port)
+{
+	u32 val;
+
+	if (port->phy_interface == PHY_INTERFACE_MODE_RGMII ||
+	    port->phy_interface == PHY_INTERFACE_MODE_RGMII_ID ||
+	    port->phy_interface == PHY_INTERFACE_MODE_RGMII_RXID ||
+	    port->phy_interface == PHY_INTERFACE_MODE_RGMII_TXID ||
+	    port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
+		/* Enable the GMAC link status irq for this port */
+		val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK);
+		val |= MVPP22_GMAC_INT_SUM_MASK_LINK_STAT;
+		writel(val, port->base + MVPP22_GMAC_INT_SUM_MASK);
+	}
+
+	if (port->gop_id == 0) {
+		/* Enable the XLG/GIG irqs for this port */
+		val = readl(port->base + MVPP22_XLG_EXT_INT_MASK);
+		if (port->phy_interface == PHY_INTERFACE_MODE_10GKR)
+			val |= MVPP22_XLG_EXT_INT_MASK_XLG;
+		else
+			val |= MVPP22_XLG_EXT_INT_MASK_GIG;
+		writel(val, port->base + MVPP22_XLG_EXT_INT_MASK);
+	}
+}
+
+static void mvpp22_gop_mask_irq(struct mvpp2_port *port)
+{
+	u32 val;
+
+	if (port->gop_id == 0) {
+		val = readl(port->base + MVPP22_XLG_EXT_INT_MASK);
+		val &= ~(MVPP22_XLG_EXT_INT_MASK_XLG |
+		         MVPP22_XLG_EXT_INT_MASK_GIG);
+		writel(val, port->base + MVPP22_XLG_EXT_INT_MASK);
+	}
+
+	if (port->phy_interface == PHY_INTERFACE_MODE_RGMII ||
+	    port->phy_interface == PHY_INTERFACE_MODE_RGMII_ID ||
+	    port->phy_interface == PHY_INTERFACE_MODE_RGMII_RXID ||
+	    port->phy_interface == PHY_INTERFACE_MODE_RGMII_TXID ||
+	    port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
+		val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK);
+		val &= ~MVPP22_GMAC_INT_SUM_MASK_LINK_STAT;
+		writel(val, port->base + MVPP22_GMAC_INT_SUM_MASK);
+	}
+}
+
+static void mvpp22_gop_setup_irq(struct mvpp2_port *port)
+{
+	u32 val;
+
+	if (port->phy_interface == PHY_INTERFACE_MODE_RGMII ||
+	    port->phy_interface == PHY_INTERFACE_MODE_RGMII_ID ||
+	    port->phy_interface == PHY_INTERFACE_MODE_RGMII_RXID ||
+	    port->phy_interface == PHY_INTERFACE_MODE_RGMII_TXID ||
+	    port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
+		val = readl(port->base + MVPP22_GMAC_INT_MASK);
+		val |= MVPP22_GMAC_INT_MASK_LINK_STAT;
+		writel(val, port->base + MVPP22_GMAC_INT_MASK);
+	}
+
+	if (port->gop_id == 0) {
+		val = readl(port->base + MVPP22_XLG_INT_MASK);
+		val |= MVPP22_XLG_INT_MASK_LINK;
+		writel(val, port->base + MVPP22_XLG_INT_MASK);
+	}
+
+	mvpp22_gop_unmask_irq(port);
+}
+
 static int mvpp22_comphy_init(struct mvpp2_port *port)
 {
 	enum phy_mode mode;
@@ -5735,6 +5824,63 @@ static irqreturn_t mvpp2_isr(int irq, void *dev_id)
 	return IRQ_HANDLED;
 }
 
+/* Per-port interrupt for link status changes */
+static irqreturn_t mvpp2_link_status_isr(int irq, void *dev_id)
+{
+	struct mvpp2_port *port = (struct mvpp2_port *)dev_id;
+	struct net_device *dev = port->dev;
+	bool event = false, link = false;
+	u32 val;
+
+	mvpp22_gop_mask_irq(port);
+
+	if (port->gop_id == 0 &&
+	    port->phy_interface == PHY_INTERFACE_MODE_10GKR) {
+		val = readl(port->base + MVPP22_XLG_INT_STAT);
+		if (val & MVPP22_XLG_INT_STAT_LINK) {
+			event = true;
+			val = readl(port->base + MVPP22_XLG_STATUS);
+			if (val & MVPP22_XLG_STATUS_LINK_UP)
+				link = true;
+		}
+	} else if (port->phy_interface == PHY_INTERFACE_MODE_RGMII ||
+		   port->phy_interface == PHY_INTERFACE_MODE_RGMII_ID ||
+		   port->phy_interface == PHY_INTERFACE_MODE_RGMII_RXID ||
+		   port->phy_interface == PHY_INTERFACE_MODE_RGMII_TXID ||
+		   port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
+		val = readl(port->base + MVPP22_GMAC_INT_STAT);
+		if (val & MVPP22_GMAC_INT_STAT_LINK) {
+			event = true;
+			val = readl(port->base + MVPP2_GMAC_STATUS0);
+			if (val & MVPP2_GMAC_STATUS0_LINK_UP)
+				link = true;
+		}
+	}
+
+	if (!netif_running(dev) || !event)
+		goto handled;
+
+	if (link) {
+		mvpp2_interrupts_enable(port);
+
+		mvpp2_egress_enable(port);
+		mvpp2_ingress_enable(port);
+		netif_carrier_on(dev);
+		netif_tx_wake_all_queues(dev);
+	} else {
+		netif_tx_stop_all_queues(dev);
+		netif_carrier_off(dev);
+		mvpp2_ingress_disable(port);
+		mvpp2_egress_disable(port);
+
+		mvpp2_interrupts_disable(port);
+	}
+
+handled:
+	mvpp22_gop_unmask_irq(port);
+	return IRQ_HANDLED;
+}
+
 static void mvpp2_gmac_set_autoneg(struct mvpp2_port *port,
 				   struct phy_device *phydev)
 {
@@ -5763,7 +5909,6 @@ static void mvpp2_gmac_set_autoneg(struct mvpp2_port *port,
 		val |= MVPP2_GMAC_CONFIG_MII_SPEED;
 
 	writel(val, port->base + MVPP2_GMAC_AUTONEG_CONFIG);
-
 }
 
 /* Adjust link */
@@ -6642,6 +6787,7 @@ static void mvpp2_irqs_deinit(struct mvpp2_port *port)
 static int mvpp2_open(struct net_device *dev)
 {
 	struct mvpp2_port *port = netdev_priv(dev);
+	struct mvpp2 *priv = port->priv;
 	unsigned char mac_bcast[ETH_ALEN] = {
 			0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
 	int err;
@@ -6687,12 +6833,24 @@ static int mvpp2_open(struct net_device *dev)
 		goto err_cleanup_txqs;
 	}
 
+	if (priv->hw_version == MVPP22 && !port->phy_node && port->link_irq) {
+		err = request_irq(port->link_irq, mvpp2_link_status_isr, 0,
+				  dev->name, port);
+		if (err) {
+			netdev_err(port->dev, "cannot request link IRQ %d\n",
+				   port->link_irq);
+			goto err_free_irq;
+		}
+
+		mvpp22_gop_setup_irq(port);
+	}
+
 	/* In default link is down */
 	netif_carrier_off(port->dev);
 
 	err = mvpp2_phy_connect(port);
 	if (err < 0)
-		goto err_free_irq;
+		goto err_free_link_irq;
 
 	/* Unmask interrupts on all CPUs */
 	on_each_cpu(mvpp2_interrupts_unmask, port, 1);
@@ -6702,6 +6860,9 @@ static int mvpp2_open(struct net_device *dev)
 
 	return 0;
 
+err_free_link_irq:
+	if (priv->hw_version == MVPP22 && !port->phy_node && port->link_irq)
+		free_irq(port->link_irq, port);
 err_free_irq:
 	mvpp2_irqs_deinit(port);
 err_cleanup_txqs:
@@ -6715,6 +6876,7 @@ static int mvpp2_stop(struct net_device *dev)
 {
 	struct mvpp2_port *port = netdev_priv(dev);
 	struct mvpp2_port_pcpu *port_pcpu;
+	struct mvpp2 *priv = port->priv;
 	int cpu;
 
 	mvpp2_stop_dev(port);
@@ -6724,6 +6886,9 @@ static int mvpp2_stop(struct net_device *dev)
 	on_each_cpu(mvpp2_interrupts_mask, port, 1);
 	mvpp2_shared_interrupt_mask_unmask(port, true);
 
+	if (priv->hw_version == MVPP22 && !port->phy_node && port->link_irq)
+		free_irq(port->link_irq, port);
+
 	mvpp2_irqs_deinit(port);
 	if (!port->has_tx_irqs) {
 		for_each_present_cpu(cpu) {
@@ -7422,6 +7587,15 @@ static int mvpp2_port_probe(struct platform_device *pdev,
 	if (err)
 		goto err_free_netdev;
 
+	port->link_irq = of_irq_get_byname(port_node, "link");
+	if (port->link_irq == -EPROBE_DEFER) {
+		err = -EPROBE_DEFER;
+		goto err_deinit_qvecs;
+	}
+	if (port->link_irq <= 0)
+		/* the link irq is optional */
+		port->link_irq = 0;
+
 	if (of_property_read_bool(port_node, "marvell,loopback"))
 		port->flags |= MVPP2_F_LOOPBACK;
 
@@ -7440,7 +7614,7 @@ static int mvpp2_port_probe(struct platform_device *pdev,
 		port->base = devm_ioremap_resource(&pdev->dev, res);
 		if (IS_ERR(port->base)) {
 			err = PTR_ERR(port->base);
-			goto err_deinit_qvecs;
+			goto err_free_irq;
 		}
 	} else {
 		if (of_property_read_u32(port_node, "gop-port-id",
@@ -7457,7 +7631,7 @@ static int mvpp2_port_probe(struct platform_device *pdev,
 	port->stats = netdev_alloc_pcpu_stats(struct mvpp2_pcpu_stats);
 	if (!port->stats) {
 		err = -ENOMEM;
-		goto err_deinit_qvecs;
+		goto err_free_irq;
 	}
 
 	mvpp2_port_copy_mac_addr(dev, priv, port_node, &mac_from);
@@ -7527,6 +7701,9 @@ static int mvpp2_port_probe(struct platform_device *pdev,
 		free_percpu(port->txqs[i]->pcpu);
 err_free_stats:
 	free_percpu(port->stats);
+err_free_irq:
+	if (port->link_irq)
+		irq_dispose_mapping(port->link_irq);
 err_deinit_qvecs:
 	mvpp2_queue_vectors_deinit(port);
 err_free_netdev:
@@ -7547,6 +7724,8 @@ static void mvpp2_port_remove(struct mvpp2_port *port)
 	for (i = 0; i < port->ntxqs; i++)
 		free_percpu(port->txqs[i]->pcpu);
 	mvpp2_queue_vectors_deinit(port);
+	if (port->link_irq)
+		irq_dispose_mapping(port->link_irq);
 	free_netdev(port->dev);
 }
 
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 3/3] Documentation/bindings: net: marvell-pp2: add the link interrupt
From: Antoine Tenart @ 2017-08-31  7:12 UTC (permalink / raw)
  To: davem
  Cc: Antoine Tenart, andrew, gregory.clement, thomas.petazzoni, nadavh,
	linux, linux-kernel, mw, stefanc, miquel.raynal, netdev
In-Reply-To: <20170831071256.18416-1-antoine.tenart@free-electrons.com>

A link interrupt can be described. Document this valid interrupt name.

Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Tested-by: Marcin Wojtas <mw@semihalf.com>
---
 Documentation/devicetree/bindings/net/marvell-pp2.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/net/marvell-pp2.txt b/Documentation/devicetree/bindings/net/marvell-pp2.txt
index 49484db81583..c78f3187dfea 100644
--- a/Documentation/devicetree/bindings/net/marvell-pp2.txt
+++ b/Documentation/devicetree/bindings/net/marvell-pp2.txt
@@ -44,7 +44,7 @@ Optional properties (port):
 - interrupt-names: if more than a single interrupt for rx is given, must
                    be the name associated to the interrupts listed. Valid
                    names are: "tx-cpu0", "tx-cpu1", "tx-cpu2", "tx-cpu3",
-		   "rx-shared".
+		   "rx-shared", "link".
 - marvell,system-controller: a phandle to the system controller.
 
 Example for marvell,armada-375-pp2:
-- 
2.13.5

^ permalink raw reply related

* Re: [patch net-next 3/8] mlxsw: spectrum_dpipe: Add IPv6 host table initial support
From: Jiri Pirko @ 2017-08-31  7:17 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, davem, arkadis, idosch, mlxsw
In-Reply-To: <200b2c64-f252-c8cd-9e7c-e237bb7a8ce2@gmail.com>

Wed, Aug 30, 2017 at 07:36:13PM CEST, dsahern@gmail.com wrote:
>On 8/30/17 6:03 AM, Jiri Pirko wrote:
>> @@ -328,9 +329,21 @@ static int mlxsw_sp_dpipe_table_host_matches_dump(struct sk_buff *skb, int type)
>>  	if (err)
>>  		return err;
>>  
>> -	match.type = DEVLINK_DPIPE_MATCH_TYPE_FIELD_EXACT;
>> -	match.header = &devlink_dpipe_header_ipv4;
>> -	match.field_id = DEVLINK_DPIPE_FIELD_IPV4_DST_IP;
>> +	switch (type) {
>> +	case AF_INET:
>> +		match.type = DEVLINK_DPIPE_MATCH_TYPE_FIELD_EXACT;
>> +		match.header = &devlink_dpipe_header_ipv4;
>> +		match.field_id = DEVLINK_DPIPE_FIELD_IPV4_DST_IP;
>> +		break;
>> +	case AF_INET6:
>> +		match.type = DEVLINK_DPIPE_MATCH_TYPE_FIELD_EXACT;
>> +		match.header = &devlink_dpipe_header_ipv6;
>> +		match.field_id = DEVLINK_DPIPE_FIELD_IPV6_DST_IP;
>> +		break;
>> +	default:
>> +		WARN_ON(1);
>
>Why a warn for dump request of an unsupported family?

It's a handling of default case that should not happen unless there is a
bug in kernel.


>
>> +		return -EINVAL;
>> +	}
>>  
>>  	return devlink_dpipe_match_put(skb, &match);
>>  }
>> @@ -342,7 +355,7 @@ mlxsw_sp_dpipe_table_host4_matches_dump(void *priv, struct sk_buff *skb)
>

^ permalink raw reply

* Re: [patch net-next 6/8] mlxsw: spectrum_dpipe: Add support for IPv6 host table dump
From: Jiri Pirko @ 2017-08-31  7:18 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, davem, arkadis, idosch, mlxsw
In-Reply-To: <e1e906eb-7c4e-1653-9bd2-2924127f3dcb@gmail.com>

Wed, Aug 30, 2017 at 07:42:36PM CEST, dsahern@gmail.com wrote:
>On 8/30/17 6:03 AM, Jiri Pirko wrote:
>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
>> index 5924e97..75da2ef 100644
>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
>> @@ -386,8 +386,19 @@ mlxsw_sp_dpipe_table_host_match_action_prepare(struct devlink_dpipe_match *match
>>  
>>  	match = &matches[MLXSW_SP_DPIPE_TABLE_HOST_MATCH_DIP];
>>  	match->type = DEVLINK_DPIPE_MATCH_TYPE_FIELD_EXACT;
>> -	match->header = &devlink_dpipe_header_ipv4;
>> -	match->field_id = DEVLINK_DPIPE_FIELD_IPV4_DST_IP;
>> +	switch (type) {
>> +	case AF_INET:
>> +		match->header = &devlink_dpipe_header_ipv4;
>> +		match->field_id = DEVLINK_DPIPE_FIELD_IPV4_DST_IP;
>> +		break;
>> +	case AF_INET6:
>> +		match->header = &devlink_dpipe_header_ipv6;
>> +		match->field_id = DEVLINK_DPIPE_FIELD_IPV6_DST_IP;
>> +		break;
>> +	default:
>> +		WARN_ON(1);
>
>Here as well.
>
>> +		return;
>> +	}
>>  
>>  	action->type = DEVLINK_DPIPE_ACTION_TYPE_FIELD_MODIFY;
>>  	action->header = &devlink_dpipe_header_ethernet;
>> @@ -424,7 +435,18 @@ mlxsw_sp_dpipe_table_host_entry_prepare(struct devlink_dpipe_entry *entry,
>>  	match_value = &match_values[MLXSW_SP_DPIPE_TABLE_HOST_MATCH_DIP];
>>  
>>  	match_value->match = match;
>> -	match_value->value_size = sizeof(u32);
>> +	switch (type) {
>> +	case AF_INET:
>> +		match_value->value_size = sizeof(u32);
>> +		break;
>> +	case AF_INET6:
>> +		match_value->value_size = sizeof(struct in6_addr);
>> +		break;
>> +	default:
>> +		WARN_ON(1);
>
>And here. WARN_ON is overkill

Again, only in case of bug in kernel.


>
>> +		return -EINVAL;
>> +	}
>> +
>>  	match_value->value = kmalloc(match_value->value_size, GFP_KERNEL);
>>  	if (!match_value->value)
>>  		return -ENOMEM;
>

^ permalink raw reply

* Re: [PATCH net-next] dp83640: don't hold spinlock while calling netif_rx_ni
From: Richard Cochran @ 2017-08-31  7:25 UTC (permalink / raw)
  To: David Miller; +Cc: stefan.sorensen, netdev
In-Reply-To: <20170830.145031.1158341283102125578.davem@davemloft.net>

On Wed, Aug 30, 2017 at 02:50:31PM -0700, David Miller wrote:
> From: Stefan Sørensen <stefan.sorensen@spectralink.com>
> Date: Wed, 30 Aug 2017 08:58:47 +0200
> 
> > We should not hold a spinlock while pushing the skb into the networking
> > stack, so move the call to netif_rx_ni out of the critical region to where
> > we have dropped the spinlock.
> > 
> > Signed-off-by: Stefan Sørensen <stefan.sorensen@spectralink.com>
> 
> Looks good, applied, thanks.

And thanks, Stefan, for following up on this.  For the record, we
discussed this fix back in April.

    https://patchwork.ozlabs.org/patch/752201/

Cheers,
Richard

^ permalink raw reply

* Re: [PATCH 0/4] irda: move it to drivers/staging so we can delete it
From: Ondrej Zary @ 2017-08-31  7:31 UTC (permalink / raw)
  To: Greg KH; +Cc: devel, netdev, samuel, David Miller, linux-kernel
In-Reply-To: <20170831043042.GB3359@kroah.com>

On Thursday 31 August 2017, Greg KH wrote:
> On Tue, Aug 29, 2017 at 11:32:58PM +0200, Ondrej Zary wrote:
> > On Tuesday 29 August 2017 01:42:08 David Miller wrote:
> > > From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > Date: Sun, 27 Aug 2017 17:03:30 +0200
> > >
> > > > The IRDA code has long been obsolete and broken.  So, to keep people
> > > > from trying to use it, and to prevent people from having to maintain
> > > > it, let's move it to drivers/staging/ so that we can delete it
> > > > entirely from the kernel in a few releases.
> > >
> > > No objection, I'll apply this to net-next, thanks Greg.
> >
> > IRDA works fine in Debian 9 (kernel 4.9) and I use it for simple file
> > transfer. Hope I'm not the only one...
> >
> > # irattach /dev/ttyS0 -d tekram -s
> > # irdadump
> > 21:28:52.830350 xid:cmd aed8eb79 > ffffffff S=6 s=0 (14)
> > 21:28:52.922368 xid:cmd aed8eb79 > ffffffff S=6 s=1 (14)
> > 21:28:53.014350 xid:cmd aed8eb79 > ffffffff S=6 s=2 (14)
> > 21:28:53.106338 xid:cmd aed8eb79 > ffffffff S=6 s=3 (14)
> > 21:28:53.190276 xid:rsp aed8eb79 < 000035d1 S=6 s=3 Nokia 6230i hint=b125
> > [ PnP Modem Fax Telephony IrCOMM IrOBEX ] (28)
> > 21:28:53.198384 xid:cmd aed8eb79 > ffffffff S=6 s=4 (14)
> > 21:28:53.290382 xid:cmd aed8eb79 > ffffffff S=6 s=5 (14)
> > 21:28:53.382341 xid:cmd aed8eb79 > ffffffff S=6 s=* pentium hint=0400 [
> > Computer ] (23)
> > ^C
> > 8 packets received by filter
> >
> > $ obexftp -i -l MMC
> > Connecting..\done
> > Receiving "MMC".../<?xml version="1.0"?>
> > <!DOCTYPE folder-listing SYSTEM "obex-folder-listing.dtd"
> >  [ <!ATTLIST folder mem-type CDATA #IMPLIED> ]>
> > <folder-listing version="1.0">
> >     <parent-folder />
> >     <file name="Image000.jpg" size="304300" modified="20160219T135924"
> > user-perm="RWD"/>
> >     <file name="Image001.jpg" size="270037" modified="20170811T233122"
> > user-perm="RWD"/>
> >     <file name="Image004.jpg" size="53519" modified="20170814T074550"
> > user-perm="RWD"/>
> > ....
> > $ obexftp -i -c MMC -g Image004.jpg
> > Connecting..\done
> > Sending "MMC"...|done
> > Receiving "Image004.jpg"...-done
> > Disconnecting..\done
>
> Odd, and is this just a ir device connected to a "real" serial port, or
> a specific IRDA device?
>
> thanks,
>
> greg k-h

Yes, it's an external IrDA dongle connected to a real serial port.

I also have an ARK3116-based USB IrDA dongle and some laptops with integrated 
IrDA ports that used to work fine but haven't tested them recently (i.e. 
Debian 9).

-- 
Ondrej Zary

^ permalink raw reply

* Re: [PATCH net-next] net: cpsw: Don't handle SIOC[GS]HWTSTAMP when CPTS is disabled
From: Richard Cochran @ 2017-08-31  7:48 UTC (permalink / raw)
  To: David Miller; +Cc: stefan.sorensen, grygorii.strashko, netdev, linux-omap
In-Reply-To: <20170830.144745.947488279115809130.davem@davemloft.net>

On Wed, Aug 30, 2017 at 02:47:45PM -0700, David Miller wrote:
> It should not be required to disable a Kconfig option just to get PHY
> timestamping to work properly.

Well, if the MAC driver handles the ioctl and enables time stamping,
then the PHY driver's time stamping remains disabled.  We don't have a
way to choose PHY time stamping at run time.
 
> Rather, if the CPTS code returns -EOPNOTSUPP we should try to
> fallthrough to the PHY library based methods.

I agree that it would be better for the core (rather than the
individual drivers) to handle this case.

There are a few callers of .ndo_do_ioctl to consider.  Besides
dev_ifsioc() there is at least vlan_dev_ioctl() that needs to handle
the EOPNOTSUPP.

Thanks,
Richard

^ permalink raw reply

* Re: DSA mv88e6xxx RX frame errors and TCP/IP RX failure
From: Maxim Uvarov @ 2017-08-31  9:20 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Tim Harvey, netdev, Vivien Didelot, linux-kernel@vger.kernel.org,
	Fugang Duan
In-Reply-To: <20170831014624.GA18554@lunn.ch>

check with ping -s 1500 that packets passed to cpu. mv88e6xxx add
additional dsa tag before the frame so small packets can pass and big
rejected.

also ethtool -S dsaethdevX shows more details stats for Marvell chips.

or opposite, lower mtu on cpu to 1400 and check that ping works. So
from description of patch above it has to work.

Maxim.

2017-08-31 4:46 GMT+03:00 Andrew Lunn <andrew@lunn.ch>:
>> >                        /* Report late collisions as a frame error. */
>> >                         if (status & (BD_ENET_RX_NO | BD_ENET_RX_CL))
>> >                                 ndev->stats.rx_frame_errors++;
>> >
>> > I don't see anywhere else frame errors are counted, but it would be
>> > good to prove we are looking in the right place.
>> >
>>
>> Andrew,
>>
>> (adding IMX FEC driver maintainer to CC)
>>
>> Yes, that's one of them being hit. It looks like ifconfig reports
>> 'frame' as the accumulation of a few stats so here are some more
>> specifics from /sys/class/net/eth0/statistics:
>>
>> root@xenial:/sys/devices/soc0/soc/2100000.aips-bus/2188000.ethernet/net/eth0/statistics#
>> for i in `ls rx_*`; do echo $i:$(cat $i); done
>> rx_bytes:103229
>> rx_compressed:0
>> rx_crc_errors:22
>> rx_dropped:0
>> rx_errors:22
>> rx_fifo_errors:0
>> rx_frame_errors:22
>> rx_length_errors:22
>> rx_missed_errors:0
>> rx_nohandler:0
>> rx_over_errors:0
>> rx_packets:1174
>> root@xenial:/sys/devices/soc0/soc/2100000.aips-bus/2188000.ethernet/net/eth0/statistics#
>> ifconfig eth0
>> eth0      Link encap:Ethernet  HWaddr 00:D0:12:41:F3:E7
>>           inet6 addr: fe80::2d0:12ff:fe41:f3e7/64 Scope:Link
>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>           RX packets:1207 errors:22 dropped:0 overruns:0 frame:66
>>           TX packets:42 errors:0 dropped:0 overruns:0 carrier:0
>>           collisions:0 txqueuelen:1000
>>           RX bytes:106009 (103.5 KiB)  TX bytes:4604 (4.4 KiB)
>>
>> Instrumenting fec driver I see the following getting hit:
>>
>> status & BD_ENET_RX_LG /* rx_length_errors: Frame too long */
>> status & BD_ENET_RX_CR  /* rx_crc_errors: CRC Error */
>> status & BD_ENET_RX_CL /* rx_frame_errors: Collision? */
>>
>> Is this a frame size issue where the MV88E6176 is sending frames down
>> that exceed the MTU because of headers added?
>
> I did fix an issue recently with that. See
>
> commit fbbeefdd21049fcf9437c809da3828b210577f36
> Author: Andrew Lunn <andrew@lunn.ch>
> Date:   Sun Jul 30 19:36:05 2017 +0200
>
>     net: fec: Allow reception of frames bigger than 1522 bytes
>
>     The FEC Receive Control Register has a 14 bit field indicating the
>     longest frame that may be received. It is being set to 1522. Frames
>     longer than this are discarded, but counted as being in error.
>
>     When using DSA, frames from the switch has an additional header,
>     either 4 or 8 bytes if a Marvell switch is used. Thus a full MTU frame
>     of 1522 bytes received by the switch on a port becomes 1530 bytes when
>     passed to the host via the FEC interface.
>
>     Change the maximum receive size to 2048 - 64, where 64 is the maximum
>     rx_alignment applied on the receive buffer for AVB capable FEC
>     cores. Use this value also for the maximum receive buffer size. The
>     driver is already allocating a receive SKB of 2048 bytes, so this
>     change should not have any significant effects.
>
>     Tested on imx51, imx6, vf610.
>
>     Signed-off-by: Andrew Lunn <andrew@lunn.ch>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
>
>
> However, this is was of an all/nothing problem. All frames with the
> full MTU were getting dropped, where as i think you are only seeing a
> few dropped?
>
> Anyway, try cherry picking that patch and see if it helps.
>
>         Andrew



-- 
Best regards,
Maxim Uvarov

^ permalink raw reply

* Re: [PATCH net-next 6/7] samples/bpf: xdp_redirect load XDP dummy prog on TX device
From: Tariq Toukan @ 2017-08-31 10:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev; +Cc: John Fastabend, Eran Ben Elisha
In-Reply-To: <150401748630.16384.478521489037692464.stgit@firesoul>

Hi Jesper,

On 29/08/2017 5:38 PM, Jesper Dangaard Brouer wrote:
>   
> +/* Redirect require an XDP bpf_prog loaded on the TX device */
> +SEC("xdp_redirect_dummy")
> +int xdp_redirect_dummy(struct xdp_md *ctx)
> +{
> +	return XDP_PASS;
> +}
> +

I get a compilation error related to this:

$ make samples/bpf/

...
LLVM ERROR: 'xdp_redirect_dummy' label emitted multiple times to 
assembly file
make[1]: *** [samples/bpf/xdp_redirect_kern.o] Error 1
make: *** [samples/bpf/] Error 2


It can be fixed by the following patch.
I can submit it in a separate mail if you want to.


diff --git a/samples/bpf/xdp_redirect_kern.c 
b/samples/bpf/xdp_redirect_kern.c
index 1c90288d0203..8abb151e385f 100644
--- a/samples/bpf/xdp_redirect_kern.c
+++ b/samples/bpf/xdp_redirect_kern.c
@@ -82,7 +82,7 @@ int xdp_redirect_prog(struct xdp_md *ctx)

  /* Redirect require an XDP bpf_prog loaded on the TX device */
  SEC("xdp_redirect_dummy")
-int xdp_redirect_dummy(struct xdp_md *ctx)
+int xdp_redirect_dummy_prog(struct xdp_md *ctx)
  {
         return XDP_PASS;
  }
diff --git a/samples/bpf/xdp_redirect_map_kern.c 
b/samples/bpf/xdp_redirect_map_kern.c
index 79795d41ad0d..740a529ba84f 100644
--- a/samples/bpf/xdp_redirect_map_kern.c
+++ b/samples/bpf/xdp_redirect_map_kern.c
@@ -84,7 +84,7 @@ int xdp_redirect_map_prog(struct xdp_md *ctx)

  /* Redirect require an XDP bpf_prog loaded on the TX device */
  SEC("xdp_redirect_dummy")
-int xdp_redirect_dummy(struct xdp_md *ctx)
+int xdp_redirect_dummy_prog(struct xdp_md *ctx)
  {
         return XDP_PASS;
  }


Regards,
Tariq

^ permalink raw reply related

* Re: [PATCH v2 net-next 0/6] flow_dissector: Protocol specific flow dissector offload
From: Hannes Frederic Sowa @ 2017-08-31 10:11 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David S . Miller, Linux Kernel Network Developers
In-Reply-To: <CAPDqMepnrDRecc5FcrKK39+_EMvpvpNjhJW7LyReZBGjcZ9TJw@mail.gmail.com>

Hello,

Tom Herbert <tom@quantonium.net> writes:

> On Wed, Aug 30, 2017 at 1:41 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> Hello Tom,
>>
>> Tom Herbert <tom@quantonium.net> writes:
>>
>>> This patch set adds a new offload type to perform flow dissection for
>>> specific protocols (either by EtherType or by IP protocol). This is
>>> primary useful to crack open UDP encapsulations (like VXLAN, GUE) for
>>> the purposes of parsing the encapsulated packet.
>>>
>>> Items in this patch set:
>>> - Constify skb argument to UDP lookup functions
>>> - Create new protocol case in __skb_dissect for ETH_P_TEB. This is based
>>>   on the code in the GRE dissect function and the special handling in
>>>   GRE can now be removed (it sets protocol to ETH_P_TEB and returns so
>>>   goto proto_again is done)
>>> - Add infrastructure for protocol specific flow dissection offload
>>> - Add infrastructure to perform UDP flow dissection. Uses same model of
>>>   GRO where a flow_dissect callback can be associated with a UDP
>>>   socket
>>> - Use the infrastructure to support flow dissection of VXLAN and GUE
>>>
>>> Tested:
>>>
>>> Forced RPS to call flow dissection for VXLAN, FOU, and GUE. Observed
>>> that inner packet was being properly dissected.
>>>
>>> v2: Add signed off
>>
>> [...]
>>
>> Can you provide more context on why you did this series? Is the entropy
>> insufficient you receive via UDP source ports? I assume this is the case
>> for HW RSS hashing but actually not for the software dissector.
>>
> Hi Hannes,
>
> I think entropy is sufficient looking at UDP source ports, but there
> is not universal agreement on that. In any case there are now many
> other uses of flow dissector, for those that want DPI like getting TCP
> flags, UDP encapsulation is currently a blind spot.

Regarding entropy, Toeplitz seems to do worse while mixing it in
compared to jenkins hash used in flow dissection in software.

I have a number of things I don't understand yet and haven't wound my
head around, yet:

* it seems you implemented boundless looping in the flow dissection. If
  you do know the outer vxlan tunnel parameter (dst-ip and port) I
  basically can let your implementation loop a while until the packet
  data is exceeded. This is not good. This seriously needs to be limited
  to one layer above the tunnels. Never trust user input! It seems a
  user can even overwrite the VID in the flow keys while reparsing? (I
  got this only from looking at the code)

* MPLS/VPLS do encapsulate IP or Ethernet depnding on the label but
  don't have representative sockets but would need other ways to query
  inner content - is this relevant to you.

>> Btw. we forbid hardware to use L4 information if IP_PROTO is UDP but we
>> allow it in RPS (not in IPv6 if flowlabel is present). Your series could
>> solve this problem by being more protocol specific and disallow
>> fragmentation on a particular quadtuple, very much the same like hw
>> encap offload, where we tell the specific port number to the hardware
>> and then disallow using L4 information for all other UDP protocols.
>>
> IMO the fact that HW is protocol specific and operates solely on ports
> is a problem (remember Less Is More...). It's better to be protocol
> generic and do the socket lookup in SW which no longer has atomic
> operations. Matching by bound socket tuple is more accurate than just
> a port. However, technically this solution still isn't 100% correct
> since it's possible that macvlan or ipvlan may intercede and steer
> packet to a namespace where the socket isn't valid.

Your implementation needs to do hierachical socket lookups with checking
the bound interface and traverse the stack figure out the next stacked
interface and use that for the next socket lookup.

I don't think this approach works, to be honest.

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH net-next 6/7] samples/bpf: xdp_redirect load XDP dummy prog on TX device
From: Jesper Dangaard Brouer @ 2017-08-31 10:15 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: netdev, John Fastabend, Eran Ben Elisha, brouer
In-Reply-To: <be534824-72d9-5f58-70bd-45cca0877330@mellanox.com>

On Thu, 31 Aug 2017 13:08:22 +0300
Tariq Toukan <tariqt@mellanox.com> wrote:

> Hi Jesper,
> 
> On 29/08/2017 5:38 PM, Jesper Dangaard Brouer wrote:
> >   
> > +/* Redirect require an XDP bpf_prog loaded on the TX device */
> > +SEC("xdp_redirect_dummy")
> > +int xdp_redirect_dummy(struct xdp_md *ctx)
> > +{
> > +	return XDP_PASS;
> > +}
> > +  
> 
> I get a compilation error related to this:
> 
> $ make samples/bpf/
> 
> ...
> LLVM ERROR: 'xdp_redirect_dummy' label emitted multiple times to 
> assembly file
> make[1]: *** [samples/bpf/xdp_redirect_kern.o] Error 1
> make: *** [samples/bpf/] Error 2
> 
> 
> It can be fixed by the following patch.
> I can submit it in a separate mail if you want to.

Ups! - yes please submit an official fix-patch for this!

Below fix looks good to me...

> diff --git a/samples/bpf/xdp_redirect_kern.c 
> b/samples/bpf/xdp_redirect_kern.c
> index 1c90288d0203..8abb151e385f 100644
> --- a/samples/bpf/xdp_redirect_kern.c
> +++ b/samples/bpf/xdp_redirect_kern.c
> @@ -82,7 +82,7 @@ int xdp_redirect_prog(struct xdp_md *ctx)
> 
>   /* Redirect require an XDP bpf_prog loaded on the TX device */
>   SEC("xdp_redirect_dummy")
> -int xdp_redirect_dummy(struct xdp_md *ctx)
> +int xdp_redirect_dummy_prog(struct xdp_md *ctx)
>   {
>          return XDP_PASS;
>   }
> diff --git a/samples/bpf/xdp_redirect_map_kern.c 
> b/samples/bpf/xdp_redirect_map_kern.c
> index 79795d41ad0d..740a529ba84f 100644
> --- a/samples/bpf/xdp_redirect_map_kern.c
> +++ b/samples/bpf/xdp_redirect_map_kern.c
> @@ -84,7 +84,7 @@ int xdp_redirect_map_prog(struct xdp_md *ctx)
> 
>   /* Redirect require an XDP bpf_prog loaded on the TX device */
>   SEC("xdp_redirect_dummy")
> -int xdp_redirect_dummy(struct xdp_md *ctx)
> +int xdp_redirect_dummy_prog(struct xdp_md *ctx)
>   {
>          return XDP_PASS;
>   }

Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [RFC PATCH] net: frag limit checks need to use percpu_counter_compare
From: Jesper Dangaard Brouer @ 2017-08-31 10:20 UTC (permalink / raw)
  To: liujian56; +Cc: netdev, Florian Westphal, Jesper Dangaard Brouer

To: Liujian can you please test this patch?
 I want to understand if using __percpu_counter_compare() solves
 the problem correctness wise (even-though this will be slower
 than using a simple atomic_t on your big system).

Fix bug in fragmentation codes use of the percpu_counter API, that
cause issues on systems with many CPUs.

The frag_mem_limit() just reads the global counter (fbc->count),
without considering other CPUs can have upto batch size (130K) that
haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
this become dangerous at >=24 CPUs (3*1024*1024/130000=24).

The __percpu_counter_compare() does the right thing, and takes into
account the number of (online) CPUs and batch size, to account for
this and call __percpu_counter_sum() when needed.

On systems with many CPUs this will unfortunately always result in the
heavier fully locked __percpu_counter_sum() which touch the
per_cpu_ptr of all (online) CPUs.

On systems with a smaller number of CPUs this solution is also not
optimal, because __percpu_counter_compare()/__percpu_counter_sum()
doesn't help synchronize the global counter.
 Florian Westphal have an idea of adding some counter sync points,
which should help address this issue.
---
 include/net/inet_frag.h  |   16 ++++++++++++++--
 net/ipv4/inet_fragment.c |    6 +++---
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 6fdcd2427776..b586e320783d 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -147,9 +147,21 @@ static inline bool inet_frag_evicting(struct inet_frag_queue *q)
  */
 static unsigned int frag_percpu_counter_batch = 130000;
 
-static inline int frag_mem_limit(struct netns_frags *nf)
+static inline bool frag_mem_over_limit(struct netns_frags *nf, int thresh)
 {
-	return percpu_counter_read(&nf->mem);
+	/* When reading counter here, __percpu_counter_compare() call
+	 * will invoke __percpu_counter_sum() when needed.  Which
+	 * depend on num_online_cpus()*batch size, as each CPU can
+	 * potentential can hold a batch count.
+	 *
+	 * With many CPUs this heavier sum operation will
+	 * unfortunately always occur.
+	 */
+	if (__percpu_counter_compare(&nf->mem, thresh,
+				     frag_percpu_counter_batch) > 0)
+		return true;
+	else
+		return false;
 }
 
 static inline void sub_frag_mem_limit(struct netns_frags *nf, int i)
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 96e95e83cc61..ee2cf56900e6 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -120,7 +120,7 @@ static void inet_frag_secret_rebuild(struct inet_frags *f)
 static bool inet_fragq_should_evict(const struct inet_frag_queue *q)
 {
 	return q->net->low_thresh == 0 ||
-	       frag_mem_limit(q->net) >= q->net->low_thresh;
+		frag_mem_over_limit(q->net, q->net->low_thresh);
 }
 
 static unsigned int
@@ -355,7 +355,7 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 {
 	struct inet_frag_queue *q;
 
-	if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh) {
+	if (!nf->high_thresh || frag_mem_over_limit(nf, nf->high_thresh)) {
 		inet_frag_schedule_worker(f);
 		return NULL;
 	}
@@ -396,7 +396,7 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 	struct inet_frag_queue *q;
 	int depth = 0;
 
-	if (frag_mem_limit(nf) > nf->low_thresh)
+	if (frag_mem_over_limit(nf, nf->low_thresh))
 		inet_frag_schedule_worker(f);
 
 	hash &= (INETFRAGS_HASHSZ - 1);

^ permalink raw reply related

* [PATCH net-next] net: fix two typos in net_device_ops documentation.
From: Rami Rosen @ 2017-08-31 10:41 UTC (permalink / raw)
  To: davem; +Cc: netdev, jasowang, Rami Rosen

This patch fixes two trivial typos in net_device_ops documentation,
related to ndo_xdp_flush callback.

Signed-off-by: Rami Rosen <rami.rosen@intel.com>
---
 include/linux/netdevice.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 29bf06f..35de831 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1124,8 +1124,8 @@ struct xfrmdev_ops {
  *	This function is used to submit a XDP packet for transmit on a
  *	netdevice.
  * void (*ndo_xdp_flush)(struct net_device *dev);
- *	This function is used to inform the driver to flush a paticular
- *	xpd tx queue. Must be called on same CPU as xdp_xmit.
+ *	This function is used to inform the driver to flush a particular
+ *	xdp tx queue. Must be called on same CPU as xdp_xmit.
  */
 struct net_device_ops {
 	int			(*ndo_init)(struct net_device *dev);
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH net-next v7] openvswitch: enable NSH support
From: Jiri Benc @ 2017-08-31 10:45 UTC (permalink / raw)
  To: Yi Yang
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA, e,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1504096752-102003-1-git-send-email-yi.y.yang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

On Wed, 30 Aug 2017 20:39:12 +0800, Yi Yang wrote:
> --- a/net/nsh/nsh.c
> +++ b/net/nsh/nsh.c
> @@ -14,6 +14,47 @@
>  #include <net/nsh.h>
>  #include <net/tun_proto.h>
>  
> +int skb_push_nsh(struct sk_buff *skb, const struct nshhdr *nsh_src, bool is_eth)
> +{
> +	struct nshhdr *nsh;
> +	size_t length = nsh_hdr_len(nsh_src);
> +	u8 next_proto;
> +
> +	if (is_eth) {
> +		next_proto = TUN_P_ETHERNET;

I don't like the is_eth parameter. It should be clear from the skb being
passed to the function, specifically from skb->mac_len.

> +	} else {
> +		next_proto = tun_p_from_eth_p(skb->protocol);
> +		if (!next_proto)
> +			return -ENOTSUPP;

EAFNOSUPPORT is more suitable here.

> +	}
> +
> +	/* Add the NSH header */
> +	if (skb_cow_head(skb, length) < 0)
> +		return -ENOMEM;
> +
> +	/* Add the NSH header */
> +	if (skb_cow_head(skb, length) < 0)
> +		return -ENOMEM;

Duplicate statement.

> +
> +	if (!skb->inner_protocol) {
> +		skb_set_inner_network_header(skb, skb->mac_len);
> +		skb_set_inner_protocol(skb, skb->protocol);

It doesn't make sense to set either of those. Nothing uses the fields for
NSH.

> +	}
> +
> +	skb_push(skb, length);
> +	nsh = (struct nshhdr *)(skb->data);

Use nsh_hdr.

> +	memcpy(nsh, nsh_src, length);
> +	nsh->np = next_proto;
> +	nsh->mdtype &= NSH_MDTYPE_MASK;

The "& NSH_MDTYPE_MASK" operation does not make sense. Just trust the caller
to provide a meaningful value. It's its job to verify the parameters.

> +	skb->protocol = htons(ETH_P_NSH);
> +	skb_reset_mac_header(skb);
> +	skb_reset_mac_len(skb);

You have to reset also the network header.

> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(skb_push_nsh);
> +
>  static struct sk_buff *nsh_gso_segment(struct sk_buff *skb,
>  				       netdev_features_t features)
>  {
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index a54a556..e969fad 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -38,11 +38,13 @@
>  #include <net/dsfield.h>
>  #include <net/mpls.h>
>  #include <net/sctp/checksum.h>
> +#include <net/tun_proto.h>
>  
>  #include "datapath.h"
>  #include "flow.h"
>  #include "conntrack.h"
>  #include "vport.h"
> +#include "flow_netlink.h"
>  
>  struct deferred_action {
>  	struct sk_buff *skb;
> @@ -380,6 +382,57 @@ static int push_eth(struct sk_buff *skb, struct sw_flow_key *key,
>  	return 0;
>  }
>  
> +static int push_nsh(struct sk_buff *skb, struct sw_flow_key *key,
> +		    const struct nshhdr *nsh_src)
> +{
> +	bool is_eth = false;
> +	int ret;
> +
> +	if (key->mac_proto == MAC_PROTO_ETHERNET)
> +		is_eth = true;
> +
> +	ret = skb_push_nsh(skb, nsh_src, is_eth);
> +	if (ret != 0)

The usual idiom is "if (ret)". And for consistency with the rest of the
file, the name of the variable should be "err".

> +		return ret;
> +
> +	key->eth.type = htons(ETH_P_NSH);
> +
> +	/* safe right before invalidate_flow_key */
> +	key->mac_proto = MAC_PROTO_NONE;
> +	invalidate_flow_key(key);
> +	return 0;
> +}
> +
> +static int pop_nsh(struct sk_buff *skb, struct sw_flow_key *key)
> +{
> +	struct nshhdr *nsh = (struct nshhdr *)(skb->data);
> +	size_t length;
> +	u16 inner_proto;
> +
> +	if (ovs_key_mac_proto(key) != MAC_PROTO_NONE ||
> +	    skb->protocol != htons(ETH_P_NSH)) {
> +		return -EINVAL;
> +	}
> +
> +	inner_proto = tun_p_to_eth_p(nsh->np);
> +	if (!inner_proto)
> +		return -ENOTSUPP;
> +
> +	length = nsh_hdr_len(nsh);
> +	skb_pull(skb, length);
> +	skb_reset_mac_header(skb);
> +	skb_reset_mac_len(skb);
> +	skb->protocol = inner_proto;

Please factor this out to skb_pop_nsh, similarly to what you did with
skb_push_nsh.

> +
> +	/* safe right before invalidate_flow_key */
> +	if (inner_proto == htons(ETH_P_TEB))
> +		key->mac_proto = MAC_PROTO_ETHERNET;
> +	else
> +		key->mac_proto = MAC_PROTO_NONE;
> +	invalidate_flow_key(key);
> +	return 0;
> +}
> +
>  static void update_ip_l4_checksum(struct sk_buff *skb, struct iphdr *nh,
>  				  __be32 addr, __be32 new_addr)
>  {
> @@ -602,6 +655,53 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key *flow_key,
>  	return 0;
>  }
>  
> +static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
> +		   const struct ovs_key_nsh *key,
> +		   const struct ovs_key_nsh *mask)
> +{
> +	struct nshhdr *nsh;
> +	int err;
> +	u8 flags;
> +	u8 ttl;
> +	int i;
> +
> +	err = skb_ensure_writable(skb, skb_network_offset(skb) +
> +				  sizeof(struct nshhdr));
> +	if (unlikely(err))
> +		return err;
> +
> +	nsh = (struct nshhdr *)skb_network_header(skb);

nsh_hdr

> +
> +	flags = nsh_get_flags(nsh);
> +	flags = OVS_MASKED(flags, key->flags, mask->flags);
> +	flow_key->nsh.flags = flags;
> +	ttl = nsh_get_ttl(nsh);
> +	ttl = OVS_MASKED(ttl, key->ttl, mask->ttl);
> +	flow_key->nsh.ttl = ttl;
> +	nsh_set_flags_and_ttl(nsh, flags, ttl);
> +	nsh->path_hdr = OVS_MASKED(nsh->path_hdr, key->path_hdr,
> +				   mask->path_hdr);
> +	flow_key->nsh.path_hdr = nsh->path_hdr;
> +	switch (nsh->mdtype) {
> +	case NSH_M_TYPE1:
> +		for (i = 0; i < NSH_MD1_CONTEXT_SIZE; i++) {
> +			nsh->md1.context[i] =
> +			    OVS_MASKED(nsh->md1.context[i], key->context[i],
> +				       mask->context[i]);
> +		}
> +		memcpy(flow_key->nsh.context, nsh->md1.context,
> +		       sizeof(nsh->md1.context));

Do you follow the discussion that Hannes Sowa started on the ovs list
regarding matching on the context fields? It would be better to hold on this
until there's a conclusion reached.

> +		break;
> +	case NSH_M_TYPE2:
> +		memset(flow_key->nsh.context, 0,
> +		       sizeof(flow_key->nsh.context));
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
>  /* Must follow skb_ensure_writable() since that can move the skb data. */
>  static void set_tp_port(struct sk_buff *skb, __be16 *port,
>  			__be16 new_port, __sum16 *check)
> @@ -1024,6 +1124,32 @@ static int execute_masked_set_action(struct sk_buff *skb,
>  				   get_mask(a, struct ovs_key_ethernet *));
>  		break;
>  
> +	case OVS_KEY_ATTR_NSH: {
> +		struct ovs_key_nsh nsh;
> +		struct ovs_key_nsh nsh_mask;
> +		size_t size = nla_len(a) / 2;
> +		struct {
> +			struct nlattr nla;
> +			u8 data[size];
> +		} attr, mask;
> +
> +		attr.nla.nla_type = nla_type(a);
> +		attr.nla.nla_len = NLA_HDRLEN + size;
> +		memcpy(attr.data, (char *)(a + 1), size);
> +
> +		mask.nla = attr.nla;
> +		memcpy(mask.data, (char *)(a + 1) + size, size);

This is much cleaner than before. Thank you for doing the change. It now
allows me to understand what's actually going on here.

> +		err = nsh_key_from_nlattr(&attr.nla, &nsh);
> +		if (err)
> +			break;
> +		err = nsh_key_from_nlattr(&mask.nla, &nsh_mask);
> +		if (err)
> +			break;

This is a lot of copying to just be able to use nla_for_each_nested. While
I'm not a fan of how ovs uses the attributes to store both value and mask,
it's not completely irrational. However, I don't think that the intent was
to store a copy of the whole nested group. It's quite obvious that it does
not fit the pattern from the gymnastics you need to do.

Instead, store the mask in each of the nested attributes individually. There
will be no need for the copying above and the code will get cleaner and
faster. It's not that the nsh_key_from_nlattr function needs to be able to
work separately for the key and mask. Nothing else than the line above uses
this function. Just move the logic inside.

Actually, it seems it can be moved directly to set_nsh.

> +		err = set_nsh(skb, flow_key, &nsh, &nsh_mask);
> +		break;
> +	}
> +
>  	case OVS_KEY_ATTR_IPV4:
>  		err = set_ipv4(skb, flow_key, nla_data(a),
>  			       get_mask(a, struct ovs_key_ipv4 *));
> @@ -1210,6 +1336,21 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
>  		case OVS_ACTION_ATTR_POP_ETH:
>  			err = pop_eth(skb, key);
>  			break;
> +
> +		case OVS_ACTION_ATTR_PUSH_NSH: {
> +			u8 buffer[NSH_HDR_MAX_LEN];
> +			struct nshhdr *nsh_hdr = (struct nshhdr *)buffer;
> +			const struct nshhdr *nsh_src = nsh_hdr;
> +
> +			nsh_hdr_from_nlattr(nla_data(a), nsh_hdr,
> +					    NSH_HDR_MAX_LEN);

This is costly. This is called for every packet in the fast path. We should
precompute and store the header instead.

I know I had comments to this part before and it would be ideal if I raised
this earlier but since you had trivial bugs like not taking the buffer size
into account, it was hard to focus on the design side of things. You can
prevent this churn by paying more attention to details before submission.
It's really hard to focus on high level picture when the first thing one
encounters is a trivial bug.

> +			err = push_nsh(skb, key, nsh_src);
> +			break;
> +		}
> +
> +		case OVS_ACTION_ATTR_POP_NSH:
> +			err = pop_nsh(skb, key);
> +			break;
>  		}
>  
>  		if (unlikely(err)) {
> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> index 8c94cef..7a178d1 100644
> --- a/net/openvswitch/flow.c
> +++ b/net/openvswitch/flow.c
> @@ -46,6 +46,7 @@
>  #include <net/ipv6.h>
>  #include <net/mpls.h>
>  #include <net/ndisc.h>
> +#include <net/nsh.h>
>  
>  #include "conntrack.h"
>  #include "datapath.h"
> @@ -490,6 +491,56 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key,
>  	return 0;
>  }
>  
> +static int parse_nsh(struct sk_buff *skb, struct sw_flow_key *key)
> +{
> +	struct nshhdr *nsh;
> +	unsigned int nh_ofs = skb_network_offset(skb);
> +	u8 version, length;
> +	int err;
> +
> +	err = check_header(skb, nh_ofs + NSH_BASE_HDR_LEN);
> +	if (unlikely(err))
> +		return err;
> +
> +	nsh = (struct nshhdr *)skb_network_header(skb);

nsh_hdr

> +	version = nsh_get_ver(nsh);
> +	length = nsh_hdr_len(nsh);
> +
> +	if (version != 0)
> +		return -EINVAL;
> +
> +	err = check_header(skb, nh_ofs + length);
> +	if (unlikely(err))
> +		return err;
> +
> +	nsh = (struct nshhdr *)skb_network_header(skb);

nsh_hdr

> +	key->nsh.flags = nsh_get_flags(nsh);
> +	key->nsh.ttl = nsh_get_ttl(nsh);
> +	key->nsh.mdtype = nsh->mdtype;
> +	key->nsh.np = nsh->np;
> +	key->nsh.path_hdr = nsh->path_hdr;
> +	switch (key->nsh.mdtype) {
> +	case NSH_M_TYPE1:
> +		if (length != NSH_M_TYPE1_LEN)
> +			return -EINVAL;
> +		memcpy(key->nsh.context, nsh->md1.context,
> +		       sizeof(nsh->md1));
> +		break;
> +	case NSH_M_TYPE2:
> +		/* Don't support MD type 2 metedata parsing yet */
> +		if (length < NSH_BASE_HDR_LEN)
> +			return -EINVAL;

Shouldn't we reject the packet, then? Pretending that something was parsed
correctly while in fact it was not, does not look correct.

> +
> +		memset(key->nsh.context, 0,
> +		       sizeof(nsh->md1));
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
>  /**
>   * key_extract - extracts a flow key from an Ethernet frame.
>   * @skb: sk_buff that contains the frame, with skb->data pointing to the
> @@ -735,6 +786,10 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
>  				memset(&key->tp, 0, sizeof(key->tp));
>  			}
>  		}
> +	} else if (key->eth.type == htons(ETH_P_NSH)) {
> +		error = parse_nsh(skb, key);
> +		if (error)
> +			return error;
>  	}
>  	return 0;
>  }
> diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
> index 1875bba..6a3cd9c 100644
> --- a/net/openvswitch/flow.h
> +++ b/net/openvswitch/flow.h
> @@ -35,6 +35,7 @@
>  #include <net/inet_ecn.h>
>  #include <net/ip_tunnels.h>
>  #include <net/dst_metadata.h>
> +#include <net/nsh.h>
>  
>  struct sk_buff;
>  
> @@ -66,6 +67,15 @@ struct vlan_head {
>  	(offsetof(struct sw_flow_key, recirc_id) +	\
>  	FIELD_SIZEOF(struct sw_flow_key, recirc_id))
>  
> +struct ovs_key_nsh {
> +	u8 flags;
> +	u8 ttl;
> +	u8 mdtype;
> +	u8 np;
> +	__be32 path_hdr;
> +	__be32 context[NSH_MD1_CONTEXT_SIZE];
> +};
> +
>  struct sw_flow_key {
>  	u8 tun_opts[IP_TUNNEL_OPTS_MAX];
>  	u8 tun_opts_len;
> @@ -144,6 +154,7 @@ struct sw_flow_key {
>  			};
>  		} ipv6;
>  	};
> +	struct ovs_key_nsh nsh;         /* network service header */
>  	struct {
>  		/* Connection tracking fields not packed above. */
>  		struct {
> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
> index e8eb427..17df00a 100644
> --- a/net/openvswitch/flow_netlink.c
> +++ b/net/openvswitch/flow_netlink.c
> @@ -48,6 +48,7 @@
>  #include <net/ndisc.h>
>  #include <net/mpls.h>
>  #include <net/vxlan.h>
> +#include <net/tun_proto.h>
>  
>  #include "flow_netlink.h"
>  
> @@ -78,9 +79,11 @@ static bool actions_may_change_flow(const struct nlattr *actions)
>  		case OVS_ACTION_ATTR_HASH:
>  		case OVS_ACTION_ATTR_POP_ETH:
>  		case OVS_ACTION_ATTR_POP_MPLS:
> +		case OVS_ACTION_ATTR_POP_NSH:
>  		case OVS_ACTION_ATTR_POP_VLAN:
>  		case OVS_ACTION_ATTR_PUSH_ETH:
>  		case OVS_ACTION_ATTR_PUSH_MPLS:
> +		case OVS_ACTION_ATTR_PUSH_NSH:
>  		case OVS_ACTION_ATTR_PUSH_VLAN:
>  		case OVS_ACTION_ATTR_SAMPLE:
>  		case OVS_ACTION_ATTR_SET:
> @@ -322,12 +325,27 @@ size_t ovs_tun_key_attr_size(void)
>  		+ nla_total_size(2);   /* OVS_TUNNEL_KEY_ATTR_TP_DST */
>  }
>  
> +size_t ovs_nsh_key_attr_size(void)
> +{
> +	/* Whenever adding new OVS_NSH_KEY_ FIELDS, we should consider
> +	 * updating this function.
> +	 */
> +	return  nla_total_size(NSH_BASE_HDR_LEN) /* OVS_NSH_KEY_ATTR_BASE */
> +		/* OVS_NSH_KEY_ATTR_MD1 and OVS_NSH_KEY_ATTR_MD2 are
> +		 * mutually exclusive, so the bigger one can cover
> +		 * the small one.
> +		 *
> +		 * OVS_NSH_KEY_ATTR_MD2
> +		 */
> +		+ nla_total_size(NSH_CTX_HDRS_MAX_LEN);
> +}
> +
>  size_t ovs_key_attr_size(void)
>  {
>  	/* Whenever adding new OVS_KEY_ FIELDS, we should consider
>  	 * updating this function.
>  	 */
> -	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 28);
> +	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 29);
>  
>  	return    nla_total_size(4)   /* OVS_KEY_ATTR_PRIORITY */
>  		+ nla_total_size(0)   /* OVS_KEY_ATTR_TUNNEL */
> @@ -341,6 +359,8 @@ size_t ovs_key_attr_size(void)
>  		+ nla_total_size(4)   /* OVS_KEY_ATTR_CT_MARK */
>  		+ nla_total_size(16)  /* OVS_KEY_ATTR_CT_LABELS */
>  		+ nla_total_size(40)  /* OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6 */
> +		+ nla_total_size(0)   /* OVS_KEY_ATTR_NSH */
> +		  + ovs_nsh_key_attr_size()
>  		+ nla_total_size(12)  /* OVS_KEY_ATTR_ETHERNET */
>  		+ nla_total_size(2)   /* OVS_KEY_ATTR_ETHERTYPE */
>  		+ nla_total_size(4)   /* OVS_KEY_ATTR_VLAN */
> @@ -373,6 +393,13 @@ static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
>  	[OVS_TUNNEL_KEY_ATTR_IPV6_DST]      = { .len = sizeof(struct in6_addr) },
>  };
>  
> +static const struct ovs_len_tbl
> +ovs_nsh_key_attr_lens[OVS_NSH_KEY_ATTR_MAX + 1] = {
> +	[OVS_NSH_KEY_ATTR_BASE]     = { .len = 8 },

sizeof(struct ovs_nsh_key_base)

> +	[OVS_NSH_KEY_ATTR_MD1]      = { .len = 16 },

sizeof(struct ovs_nsh_key_md1)

> +	[OVS_NSH_KEY_ATTR_MD2]      = { .len = OVS_ATTR_VARIABLE },
> +};
> +
>  /* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute.  */
>  static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
>  	[OVS_KEY_ATTR_ENCAP]	 = { .len = OVS_ATTR_NESTED },
> @@ -405,6 +432,8 @@ static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
>  		.len = sizeof(struct ovs_key_ct_tuple_ipv4) },
>  	[OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6] = {
>  		.len = sizeof(struct ovs_key_ct_tuple_ipv6) },
> +	[OVS_KEY_ATTR_NSH]       = { .len = OVS_ATTR_NESTED,
> +				     .next = ovs_nsh_key_attr_lens, },
>  };
>  
>  static bool check_attr_len(unsigned int attr_len, unsigned int expected_len)
> @@ -1179,6 +1208,304 @@ static int metadata_from_nlattrs(struct net *net, struct sw_flow_match *match,
>  	return 0;
>  }
>  
> +int nsh_hdr_from_nlattr(const struct nlattr *attr,
> +			struct nshhdr *nsh, size_t size)
> +{
> +	struct nlattr *a;
> +	int rem;
> +	u8 flags = 0;
> +	u8 ttl = 0;
> +	int mdlen = 0;
> +	bool has_md1 = false;
> +	bool has_md2 = false;
> +	u8 len;
> +
> +	nla_for_each_nested(a, attr, rem) {
> +		int type = nla_type(a);
> +
> +		if (type > OVS_NSH_KEY_ATTR_MAX) {
> +			OVS_NLERR(1, "nsh attr %d is out of range max %d",
> +				  type, OVS_NSH_KEY_ATTR_MAX);
> +			return -EINVAL;
> +		}
> +
> +		if (!check_attr_len(nla_len(a),
> +				    ovs_nsh_key_attr_lens[type].len)) {
> +			OVS_NLERR(
> +			    1,
> +			    "nsh attr %d has unexpected len %d expected %d",
> +			    type,
> +			    nla_len(a),
> +			    ovs_nsh_key_attr_lens[type].len
> +			);
> +			return -EINVAL;
> +		}

These checks should be done only once when the action is configured, not for
each packet.

> +
> +		switch (type) {
> +		case OVS_NSH_KEY_ATTR_BASE: {
> +			const struct ovs_nsh_key_base *base =
> +				(struct ovs_nsh_key_base *)nla_data(a);
> +			flags = base->flags;
> +			ttl = base->ttl;
> +			nsh->np = base->np;
> +			nsh->mdtype = base->mdtype;
> +			nsh->path_hdr = base->path_hdr;
> +			break;
> +		}
> +		case OVS_NSH_KEY_ATTR_MD1: {
> +			const struct ovs_nsh_key_md1 *md1 =
> +				(struct ovs_nsh_key_md1 *)nla_data(a);
> +			struct nsh_md1_ctx *md1_dst = &nsh->md1;
> +
> +			has_md1 = true;
> +			mdlen = nla_len(a);
> +			if (((mdlen + NSH_BASE_HDR_LEN) != NSH_M_TYPE1_LEN) ||
> +			    ((mdlen + NSH_BASE_HDR_LEN) > size) ||
> +			    (mdlen <= 0)) {
> +				OVS_NLERR(
> +				    1,
> +				    "length %d of nsh attr %d is invalid",
> +				    mdlen,
> +				    type
> +				);
> +				return -EINVAL;
> +			}

Ditto for most parts of the check. Plus the maximum size should be validated
elsewhere. In this function, there should be a WARN_ON_ONCE if the computed
length > size.

> +			memcpy(md1_dst, md1, mdlen);
> +			break;
> +		}
> +		case OVS_NSH_KEY_ATTR_MD2: {
> +			const struct u8 *md2 = nla_data(a);
> +			struct nsh_md2_tlv *md2_dst = &nsh->md2;
> +
> +			has_md2 = true;
> +			mdlen = nla_len(a);
> +			if (((mdlen + NSH_BASE_HDR_LEN) > size) ||
> +			    (mdlen <= 0)) {
> +				OVS_NLERR(
> +				    1,
> +				    "length %d of nsh attr %d is invalid",
> +				    mdlen,
> +				    type
> +				);
> +				return -EINVAL;
> +			}

Ditto.

> +			memcpy(md2_dst, md2, mdlen);
> +			break;
> +		}
> +		default:
> +			OVS_NLERR(1, "Unknown nsh attribute %d",
> +				  type);
> +			return -EINVAL;

Ditto.

> +		}
> +	}
> +
> +	if (rem > 0) {
> +		OVS_NLERR(1, "nsh attribute has %d unknown bytes.", rem);
> +		return -EINVAL;
> +	}
> +
> +	if ((has_md1 && nsh->mdtype != NSH_M_TYPE1) ||
> +	    (has_md2 && nsh->mdtype != NSH_M_TYPE2)) {
> +		OVS_NLERR(1,
> +			  "nsh attribute has unmatched MD type %d.",
> +			  nsh->mdtype);
> +		return -EINVAL;
> +	}
> +
> +	if (unlikely(has_md1 && has_md2)) {
> +		OVS_NLERR(1, "both nsh md1 and md2 attribute are there");
> +		return -EINVAL;
> +	}
> +
> +	if (unlikely(!has_md1 && !has_md2)) {
> +		OVS_NLERR(1, "neither nsh md1 nor md2 attribute is there");
> +		return -EINVAL;
> +	}

Ditto. Plus I don't see a check that the OVS_NSH_KEY_ATTR_BASE attribute is
present. Seems that the compiler warned you about possibly unitialized flags
and ttl variables and you just silenced the warning without considering
whether it's not actually justified.

> +
> +	/* nsh header length  = NSH_BASE_HDR_LEN + mdlen */
> +	nsh->ver_flags_ttl_len = 0;
> +	len = NSH_BASE_HDR_LEN + mdlen;

In the whole function you open code NSH_BASE_HDR_LEN + mdlen and suddenly,
at the very end, you store it to a variable to be used just once? Call me
confused.

> +	nsh_set_flags_ttl_len(nsh, flags, ttl, len);
> +
> +	return 0;
> +}

So, this whole function does a lot of processing and copying. Why don't we
just parse the attributes once, cache the resulting header and just memcpy
it in the hot path?

> +
> +int nsh_key_from_nlattr(const struct nlattr *attr,
> +			struct ovs_key_nsh *nsh)

See above plus what I wrote for the previous function, it applies here as
well.

I stop here. I suspect there are similar things in the rest of the patch.
Please think about what I wrote and apply it to all similar situations. Not
as you do with nsh_push where you apparenly ignored nsh_pop just because
I mentioned nsh_push only.

 Jiri

^ permalink raw reply

* Re: net/ipv4: divide error in __tcp_select_window
From: Neal Cardwell @ 2017-08-31 11:11 UTC (permalink / raw)
  To: idaifish
  Cc: David Miller, Alexey Kuznetsov, Netdev, syzkaller, Wei Wang,
	Eric Dumazet
In-Reply-To: <CADUsjNnba_AmnTdoY45Bh7HasqjmBja-W1+-+6eMt-d_dCs2qA@mail.gmail.com>

On Thu, Aug 31, 2017 at 1:56 AM, idaifish <idaifish@gmail.com> wrote:
> Hi:
>    This bug seems still can be triggered by the attached PoC on latest
> Ubuntu1604 (4.4.0-94-generic)
>
> ============================================================================
> divide error: 0000 [#1] SMP KASAN
> Modules linked in:
> CPU: 0 PID: 14933 Comm: syz-executor0 Not tainted 4.9.45 #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> task: ffff880076ab9900 task.stack: ffff880062ae8000
> RIP: 0010:[<ffffffff829c1df3>]  [<ffffffff829c1df3>]
> __tcp_select_window+0x2f3/0x6b0 net/ipv4/tcp_output.c:2499
...
>  [<ffffffff8297c36e>] tcp_cleanup_rbuf+0x43e/0x4f0 net/ipv4/tcp.c:1468
>  [<ffffffff829815df>] tcp_recvmsg+0xc2f/0x25d0 net/ipv4/tcp.c:1937

Thanks for the report. I believe this tcp_recvmsg  => tcp_cleanup_rbuf
 => __tcp_select_window divide-by-zero issue was fixed in May by Wei,
in:

 499350a5a6e7 tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0
 https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=499350a5a6e7

Looks like we should probably mark this as a -stable candidate, so
that it will eventually make it to 4.4.y, 4.9.y, 4.12.y users, etc. (I
don't see the commit in those stable branches.)

thanks,
neal

^ permalink raw reply

* [PATCH net-next] samples/bpf: Fix compilation issue in redirect dummy program
From: Tariq Toukan @ 2017-08-31 11:16 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Jesper Dangaard Brouer, Tariq Toukan

Fix compilation error below:

$ make samples/bpf/

LLVM ERROR: 'xdp_redirect_dummy' label emitted multiple times to
assembly file
make[1]: *** [samples/bpf/xdp_redirect_kern.o] Error 1
make: *** [samples/bpf/] Error 2

Fixes: 306da4e685b4 ("samples/bpf: xdp_redirect load XDP dummy prog on TX device")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 samples/bpf/xdp_redirect_kern.c     | 2 +-
 samples/bpf/xdp_redirect_map_kern.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/xdp_redirect_kern.c b/samples/bpf/xdp_redirect_kern.c
index 1c90288d0203..8abb151e385f 100644
--- a/samples/bpf/xdp_redirect_kern.c
+++ b/samples/bpf/xdp_redirect_kern.c
@@ -82,7 +82,7 @@ int xdp_redirect_prog(struct xdp_md *ctx)
 
 /* Redirect require an XDP bpf_prog loaded on the TX device */
 SEC("xdp_redirect_dummy")
-int xdp_redirect_dummy(struct xdp_md *ctx)
+int xdp_redirect_dummy_prog(struct xdp_md *ctx)
 {
 	return XDP_PASS;
 }
diff --git a/samples/bpf/xdp_redirect_map_kern.c b/samples/bpf/xdp_redirect_map_kern.c
index 79795d41ad0d..740a529ba84f 100644
--- a/samples/bpf/xdp_redirect_map_kern.c
+++ b/samples/bpf/xdp_redirect_map_kern.c
@@ -84,7 +84,7 @@ int xdp_redirect_map_prog(struct xdp_md *ctx)
 
 /* Redirect require an XDP bpf_prog loaded on the TX device */
 SEC("xdp_redirect_dummy")
-int xdp_redirect_dummy(struct xdp_md *ctx)
+int xdp_redirect_dummy_prog(struct xdp_md *ctx)
 {
 	return XDP_PASS;
 }
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net-next] samples/bpf: Fix compilation issue in redirect dummy program
From: Jesper Dangaard Brouer @ 2017-08-31 11:27 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: brouer, David S. Miller, netdev, Eran Ben Elisha
In-Reply-To: <1504178199-12410-1-git-send-email-tariqt@mellanox.com>

On Thu, 31 Aug 2017 14:16:39 +0300
Tariq Toukan <tariqt@mellanox.com> wrote:

> Fix compilation error below:
> 
> $ make samples/bpf/
> 
> LLVM ERROR: 'xdp_redirect_dummy' label emitted multiple times to
> assembly file
> make[1]: *** [samples/bpf/xdp_redirect_kern.o] Error 1
> make: *** [samples/bpf/] Error 2
> 
> Fixes: 306da4e685b4 ("samples/bpf: xdp_redirect load XDP dummy prog on TX device")
> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> ---

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

What LLVM/clang version do you use?

I don't see this compile error, and I have:
 $ clang --version
 clang version 3.9.1 (tags/RELEA

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* hi
From: SUSAN JEAN @ 2017-08-31 11:32 UTC (permalink / raw)


you speak French or English

^ permalink raw reply

* Re: [PATCH net-next] samples/bpf: Fix compilation issue in redirect dummy program
From: Daniel Borkmann @ 2017-08-31 11:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Tariq Toukan
  Cc: David S. Miller, netdev, Eran Ben Elisha, ys114321,
	alexei.starovoitov
In-Reply-To: <20170831132710.3b245a6a@redhat.com>

On 08/31/2017 01:27 PM, Jesper Dangaard Brouer wrote:
> On Thu, 31 Aug 2017 14:16:39 +0300
> Tariq Toukan <tariqt@mellanox.com> wrote:
>
>> Fix compilation error below:
>>
>> $ make samples/bpf/
>>
>> LLVM ERROR: 'xdp_redirect_dummy' label emitted multiple times to
>> assembly file
>> make[1]: *** [samples/bpf/xdp_redirect_kern.o] Error 1
>> make: *** [samples/bpf/] Error 2
>>
>> Fixes: 306da4e685b4 ("samples/bpf: xdp_redirect load XDP dummy prog on TX device")
>> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
>> ---
>
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

> What LLVM/clang version do you use?
>
> I don't see this compile error, and I have:
>   $ clang --version
>   clang version 3.9.1 (tags/RELEA

I'm seeing the error as well with a fairly recent LLVM from git
tree (6.0.0git-2d810c2).

Looks like the llvm error is triggered when section name and
the function name for XDP prog is the same. Changing either the
function or the section name right above resolves the issue. If
such error didn't trigger on older versions, people could be
using such naming scheme as done here, so seems to me like a
regression on LLVM side we might need to look at ...

In any case, patch here is fine, thanks!

^ permalink raw reply

* Re: [GIT] Networking
From: Kalle Valo @ 2017-08-31 11:50 UTC (permalink / raw)
  To: Pavel Machek
  Cc: David Miller, xiyou.wangcong, torvalds, akpm, netdev,
	linux-kernel, linux-wireless
In-Reply-To: <20170831065204.GA17812@amd>

(Adding linux-wireless)

Pavel Machek <pavel@ucw.cz> writes:

> On Thu 2017-08-31 07:44:58, Kalle Valo wrote:
>> David Miller <davem@davemloft.net> writes:
>> 
>> > From: Kalle Valo <kvalo@codeaurora.org>
>> > Date: Wed, 30 Aug 2017 20:31:31 +0300
>> >
>> >> AFAICS the bug was introduced by 9df86e2e702c6 back in 2010. If the bug
>> >> has been there for 7 years so waiting for a few more weeks should not
>> >> hurt.
>> >
>> > As a maintainer you have a right to handle bug fixing in that way, but
>> > certainly that is not how I would handle this.
>> >
>> > It's easy to validate this fix, it's extremely unlikely to cause
>> > a regression, and fixes a problem someone actually was able to
>> > trigger.
>> >
>> > Deferring to -next only has the side effect of making people wait
>> > longer for the fix.
>> 
>> Yeah, you are right there. I did actually ponder which I tree should
>> commit it back in July but due to various reasons decided differently.
>
> Can we still get the fix to v4.13-final? :-).

I'm not planning to submit pull requests to 4.13 anymore. If you think
this is so important that it needs to be applied in the last minute (I
don't) you could always try to convince Dave to take it directly.

Or better yet, push it to the stable tree. If the merge window opens on
Sunday I suspect that the commit will be in Linus' tree sometime next
week. Then you can submit the request to the stable team to take it.

-- 
Kalle Valo

^ permalink raw reply

* [PATCH net-next] x86: bpf_jit: small optimization in emit_bpf_tail_call()
From: Eric Dumazet @ 2017-08-31 11:53 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Alexei Starovoitov, Daniel Borkmann

From: Eric Dumazet <edumazet@google.com>

Saves 4 bytes replacing following instructions :

lea rax, [rsi + rdx * 8 + offsetof(...)] 
mov rax, qword ptr [rax]
cmp rax, 0

by :

mov rax, [rsi + rdx * 8 + offsetof(...)] 
test rax, rax

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
---
 arch/x86/net/bpf_jit_comp.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 8194696e2805852a8246bf1db2a5862d5758f76b..7af5ee584bf8546f3c1b92df9d9f513416f2089d 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -287,7 +287,7 @@ static void emit_bpf_tail_call(u8 **pprog)
 	EMIT4(0x48, 0x8B, 0x46,                   /* mov rax, qword ptr [rsi + 16] */
 	      offsetof(struct bpf_array, map.max_entries));
 	EMIT3(0x48, 0x39, 0xD0);                  /* cmp rax, rdx */
-#define OFFSET1 47 /* number of bytes to jump */
+#define OFFSET1 43 /* number of bytes to jump */
 	EMIT2(X86_JBE, OFFSET1);                  /* jbe out */
 	label1 = cnt;
 
@@ -296,21 +296,20 @@ static void emit_bpf_tail_call(u8 **pprog)
 	 */
 	EMIT2_off32(0x8B, 0x85, 36);              /* mov eax, dword ptr [rbp + 36] */
 	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
-#define OFFSET2 36
+#define OFFSET2 32
 	EMIT2(X86_JA, OFFSET2);                   /* ja out */
 	label2 = cnt;
 	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
 	EMIT2_off32(0x89, 0x85, 36);              /* mov dword ptr [rbp + 36], eax */
 
 	/* prog = array->ptrs[index]; */
-	EMIT4_off32(0x48, 0x8D, 0x84, 0xD6,       /* lea rax, [rsi + rdx * 8 + offsetof(...)] */
+	EMIT4_off32(0x48, 0x8B, 0x84, 0xD6,       /* mov rax, [rsi + rdx * 8 + offsetof(...)] */
 		    offsetof(struct bpf_array, ptrs));
-	EMIT3(0x48, 0x8B, 0x00);                  /* mov rax, qword ptr [rax] */
 
 	/* if (prog == NULL)
 	 *   goto out;
 	 */
-	EMIT4(0x48, 0x83, 0xF8, 0x00);            /* cmp rax, 0 */
+	EMIT3(0x48, 0x85, 0xC0);		  /* test rax,rax */
 #define OFFSET3 10
 	EMIT2(X86_JE, OFFSET3);                   /* je out */
 	label3 = cnt;

^ permalink raw reply related

* Re: [PATCH] vsock: only load vmci transport on VMware hypervisor by default
From: Stefan Hajnoczi @ 2017-08-31 11:54 UTC (permalink / raw)
  To: Jorgen S. Hansen
  Cc: Michal Kubecek, jasowang@redhat.com, George Zhang,
	Stephen Hemminger, Rolf Neugebauer, Marcelo Cerri, Asias He,
	Dan Carpenter, olaf@aepfle.de, Haiyang Zhang, Dave Scott,
	apw@canonical.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, gregkh@linuxfoundation.org,
	joe@perches.com, devel@linuxdriverproject.org, Vitaly Kuznetsov,
	davem@davemloft.net
In-Reply-To: <61AE41B3-D143-46D2-8C68-009B78AF12C8@vmware.com>

On Tue, Aug 29, 2017 at 03:37:07PM +0000, Jorgen S. Hansen wrote:
> > On Aug 29, 2017, at 4:36 AM, Dexuan Cui <decui@microsoft.com> wrote:
> If we allow multiple host side transports, virtio host side support and
> vmci should be able to coexist regardless of the order of initialization.

That sounds good to me.

This means af_vsock.c needs to be aware of CID allocation.  Currently the
vhost_vsock.ko driver handles this itself (it keeps a list of CIDs and
checks that they are not used twice).  It should be possible to move
that state into af_vsock.c so we have <cid, host_transport> pairs.

I'm currently working on NFS over AF_VSOCK and sock_diag support (for
ss(8) and netstat-like tools).

Multi-transport support is lower priority for me at the moment.  I'm
happy to review patches though.  If there is no progress on this by the
end of the year then I will have time to work on it.

Are either of you are in Prague, Czech Republic on October 25-27 for
Linux Kernel Summit, Open Source Summit Europe, Embedded Linux
Conference Europe, KVM Forum, or MesosCon Europe?

Stefan

^ permalink raw reply

* Re: [PATCH net-next] xfrm: Add support for network devices capable of removing the ESP trailer
From: Steffen Klassert @ 2017-08-31 11:56 UTC (permalink / raw)
  To: yossiku
  Cc: Herbert Xu, David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	netdev, linux-kernel, borisp, kliteyn
In-Reply-To: <1504081839-22019-1-git-send-email-yossiku@mellanox.com>

On Wed, Aug 30, 2017 at 11:30:39AM +0300, yossiku@mellanox.com wrote:
> From: Yossi Kuperman <yossiku@mellanox.com>
> 
> In conjunction with crypto offload [1], removing the ESP trailer by
> hardware can potentially improve the performance by avoiding (1) a
> cache miss incurred by reading the nexthdr field and (2) the necessity
> to calculate the csum value of the trailer in order to keep skb->csum
> valid.
> 
> This patch introduces the changes to the xfrm stack and merely serves
> as an infrastructure. Subsequent patch to mlx5 driver will put this to
> a good use.
> 
> [1] https://www.mail-archive.com/netdev@vger.kernel.org/msg175733.html
> 
> Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>

Applied to ipsec-next, thanks Yossi!

^ permalink raw reply

* Re: [PATCH net 0/2] netfilter: ipvs: some fixes in sctp_conn_schedule
From: Simon Horman @ 2017-08-31 11:59 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Xin Long, netfilter-devel, Alex Gartrell, lvs-devel, netdev, ja,
	wensong
In-Reply-To: <20170828161732.GA13738@salvia>

On Mon, Aug 28, 2017 at 06:17:32PM +0200, Pablo Neira Ayuso wrote:
> On Sun, Aug 20, 2017 at 01:38:06PM +0800, Xin Long wrote:
> > Patch 1/2 fixes the regression introduced by commit 5e26b1b3abce.
> > Patch 2/2 makes ipvs not create conn for sctp ABORT packet.
> 
> Will wait for Julian and Simon to tell me what I should do with this.

Hi Pablo,

could you take these directly with Julian's Ack and the following?

Signed-off-by: Simon Horman <horms@verge.net.au>


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox