Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v4] route: do not cache fib route info on local routes with oif
From: Julian Anastasov @ 2016-04-08 22:08 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev
In-Reply-To: <570820DA.2070007@windriver.com>


	Hello,

On Fri, 8 Apr 2016, Chris Friesen wrote:

> For local routes that require a particular output interface we do not want
> to cache the result.  Caching the result causes incorrect behaviour when
> there are multiple source addresses on the interface.  The end result
> being that if the intended recipient is waiting on that interface for the
> packet he won't receive it because it will be delivered on the loopback
> interface and the IP_PKTINFO ipi_ifindex will be set to the loopback
> interface as well.
> 
> This can be tested by running a program such as "dhcp_release" which
> attempts to inject a packet on a particular interface so that it is
> received by another program on the same board.  The receiving process
> should see an IP_PKTINFO ipi_ifndex value of the source interface
> (e.g., eth1) instead of the loopback interface (e.g., lo).  The packet
> will still appear on the loopback interface in tcpdump but the important
> aspect is that the CMSG info is correct.
> 
> Sample dhcp_release command line:
> 
>    dhcp_release eth1 192.168.204.222 02:11:33:22:44:66
> 
> Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
> Signed off-by: Chris Friesen <chris.friesen@windriver.com>

	Looks good to me.

Reviewed-by: Julian Anastasov <ja@ssi.bg>

> ---
>  net/ipv4/route.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 02c6229..b050cf9 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2045,6 +2045,18 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
>  		 */
>  		if (fi && res->prefixlen < 4)
>  			fi = NULL;
> +	} else if ((type == RTN_LOCAL) && (orig_oif != 0) &&
> +		   (orig_oif != dev_out->ifindex)) {
> +		/* For local routes that require a particular output interface
> +		 * we do not want to cache the result.  Caching the result
> +		 * causes incorrect behaviour when there are multiple source
> +		 * addresses on the interface, the end result being that if the
> +		 * intended recipient is waiting on that interface for the
> +		 * packet he won't receive it because it will be delivered on
> +		 * the loopback interface and the IP_PKTINFO ipi_ifindex will
> +		 * be set to the loopback interface as well.
> +		 */
> +		fi = NULL;
>  	}
>  
>  	fnhe = NULL;

Regards

^ permalink raw reply

* Re: [PATCH net-next 1/8] perf: optimize perf_fetch_caller_regs
From: Steven Rostedt @ 2016-04-08 22:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexei Starovoitov, David S . Miller, Ingo Molnar,
	Daniel Borkmann, Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik,
	Brendan Gregg, netdev, linux-kernel, kernel-team
In-Reply-To: <20160405120626.GM3448@twins.programming.kicks-ass.net>

On Tue, 5 Apr 2016 14:06:26 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Apr 04, 2016 at 09:52:47PM -0700, Alexei Starovoitov wrote:
> > avoid memset in perf_fetch_caller_regs, since it's the critical path of all tracepoints.
> > It's called from perf_sw_event_sched, perf_event_task_sched_in and all of perf_trace_##call
> > with this_cpu_ptr(&__perf_regs[..]) which are zero initialized by perpcu_alloc  
> 
> Its not actually allocated; but because its a static uninitialized
> variable we get .bss like behaviour and the initial value is copied to
> all CPUs when the per-cpu allocator thingy bootstraps SMP IIRC.
> 
> > and
> > subsequent call to perf_arch_fetch_caller_regs initializes the same fields on all archs,
> > so we can safely drop memset from all of the above cases and   
> 
> Indeed.
> 
> > move it into
> > perf_ftrace_function_call that calls it with stack allocated pt_regs.  
> 
> Hmm, is there a reason that's still on-stack instead of using the
> per-cpu thing, Steve?

Well, what do you do when you are tracing with regs in an interrupt
that already set the per cpu regs field? We could create our own
per-cpu one as well, but then that would require checking which level
we are in, as we can have one for normal context, one for softirq
context, one for irq context and one for nmi context.

-- Steve



> 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>  
> 
> In any case,
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply

* Re: [net-next PATCH 2/5] GSO: Add GSO type for fixed IPv4 ID
From: Alexander Duyck @ 2016-04-08 22:12 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Alexander Duyck, Herbert Xu, Tom Herbert, Eric Dumazet,
	Linux Kernel Network Developers, David Miller
In-Reply-To: <CAEh+42j21N+3ScmR9D9P3esgUL-G0kWxNT1x2zfGi_9M-bPKzw@mail.gmail.com>

On Fri, Apr 8, 2016 at 2:41 PM, Jesse Gross <jesse@kernel.org> wrote:
> On Fri, Apr 8, 2016 at 5:33 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
>> This patch adds support for TSO using IPv4 headers with a fixed IP ID
>> field.  This is meant to allow us to do a lossless GRO in the case of TCP
>> flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
>> headers.
>>
>> In addition I am adding a feature that for now I am referring to TSO with
>> IP ID mangling.  Basically when this flag is enabled the device has the
>> option to either output the flow with incrementing IP IDs or with a fixed
>> IP ID regardless of what the original IP ID ordering was.  This is useful
>> in cases where the DF bit is set and we do not care if the original IP ID
>> value is maintained.
>>
>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>
> I think SKB_GSO_TCP_FIXEDID would also need to be added to the list of
> enumerated OK GSO types for MPLS GSO.

I'll have that fixed for v2.

- Alex

^ permalink raw reply

* [PATCH RFT 0/2] Teach phylib hard-resetting devices
From: Sergei Shtylyov @ 2016-04-08 22:21 UTC (permalink / raw)
  To: grant.likely, robh+dt, devicetree, f.fainelli, netdev,
	frowand.list, pawel.moll, mark.rutland, ijc+devicetree, galak
  Cc: linux-kernel

Hello.

   Here's the set of 2 patches against DaveM's 'net-next.git' repo. They add to
'phylib' support for resetting devices via GPIO and do some clean up after
doing that...

[1/2] phylib: add device reset GPIO support
[2/2] macb: kill PHY reset code

MBR, Sergei

^ permalink raw reply

* [PATCH RFT 1/2] phylib: add device reset GPIO support
From: Sergei Shtylyov @ 2016-04-08 22:22 UTC (permalink / raw)
  To: grant.likely, robh+dt, devicetree, f.fainelli, netdev,
	frowand.list, pawel.moll, mark.rutland, ijc+devicetree, galak
  Cc: linux-kernel
In-Reply-To: <81129033.NXiOLTg1so@wasted.cogentembedded.com>

The PHY  devices sometimes do have their reset signal (maybe even power
supply?) tied to some GPIO and sometimes it also does happen that a boot
loader does not leave it deasserted. So far this issue has been attacked
from (as I believe) a wrong angle: by teaching the MAC driver to manipulate
the GPIO in question;  that solution, when  applied to the device trees,
led to adding the PHY reset GPIO properties to the MAC device node, with
one exception: Cadence MACB driver which could handle the "reset-gpios"
prop  in a PHY device  subnode.  I believe that the correct approach is to
teach the 'phylib' to get the MDIO device reset GPIO from the device tree
node corresponding to this device -- which this patch is doing...

Note that I had to modify the  AT803x PHY driver as it would stop working
otherwise as it made use of the reset GPIO for its own purposes...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>

---
 Documentation/devicetree/bindings/net/phy.txt |    2 +
 drivers/net/phy/at803x.c                      |   19 ++------------
 drivers/net/phy/mdio_bus.c                    |    4 +++
 drivers/net/phy/mdio_device.c                 |   27 +++++++++++++++++++--
 drivers/net/phy/phy_device.c                  |   33 ++++++++++++++++++++++++--
 drivers/of/of_mdio.c                          |   16 ++++++++++++
 include/linux/mdio.h                          |    3 ++
 include/linux/phy.h                           |    5 +++
 8 files changed, 89 insertions(+), 20 deletions(-)

Index: net-next/Documentation/devicetree/bindings/net/phy.txt
===================================================================
--- net-next.orig/Documentation/devicetree/bindings/net/phy.txt
+++ net-next/Documentation/devicetree/bindings/net/phy.txt
@@ -35,6 +35,8 @@ Optional Properties:
 - broken-turn-around: If set, indicates the PHY device does not correctly
   release the turn around line low at the end of a MDIO transaction.
 
+- reset-gpios: The GPIO phandle and specifier for the PHY reset signal.
+
 Example:
 
 ethernet-phy@0 {
Index: net-next/drivers/net/phy/at803x.c
===================================================================
--- net-next.orig/drivers/net/phy/at803x.c
+++ net-next/drivers/net/phy/at803x.c
@@ -65,7 +65,6 @@ MODULE_LICENSE("GPL");
 
 struct at803x_priv {
 	bool phy_reset:1;
-	struct gpio_desc *gpiod_reset;
 };
 
 struct at803x_context {
@@ -271,22 +270,10 @@ static int at803x_probe(struct phy_devic
 {
 	struct device *dev = &phydev->mdio.dev;
 	struct at803x_priv *priv;
-	struct gpio_desc *gpiod_reset;
 
 	priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
 	if (!priv)
 		return -ENOMEM;
-
-	if (phydev->drv->phy_id != ATH8030_PHY_ID)
-		goto does_not_require_reset_workaround;
-
-	gpiod_reset = devm_gpiod_get_optional(dev, "reset", GPIOD_OUT_LOW);
-	if (IS_ERR(gpiod_reset))
-		return PTR_ERR(gpiod_reset);
-
-	priv->gpiod_reset = gpiod_reset;
-
-does_not_require_reset_workaround:
 	phydev->priv = priv;
 
 	return 0;
@@ -361,14 +348,14 @@ static void at803x_link_change_notify(st
 	 */
 	if (phydev->drv->phy_id == ATH8030_PHY_ID) {
 		if (phydev->state == PHY_NOLINK) {
-			if (priv->gpiod_reset && !priv->phy_reset) {
+			if (phydev->mdio.reset && !priv->phy_reset) {
 				struct at803x_context context;
 
 				at803x_context_save(phydev, &context);
 
-				gpiod_set_value(priv->gpiod_reset, 1);
+				phy_device_reset(phydev, 1);
 				msleep(1);
-				gpiod_set_value(priv->gpiod_reset, 0);
+				phy_device_reset(phydev, 0);
 				msleep(1);
 
 				at803x_context_restore(phydev, &context);
Index: net-next/drivers/net/phy/mdio_bus.c
===================================================================
--- net-next.orig/drivers/net/phy/mdio_bus.c
+++ net-next/drivers/net/phy/mdio_bus.c
@@ -35,6 +35,7 @@
 #include <linux/phy.h>
 #include <linux/io.h>
 #include <linux/uaccess.h>
+#include <linux/gpio/consumer.h>
 
 #include <asm/irq.h>
 
@@ -371,6 +372,9 @@ void mdiobus_unregister(struct mii_bus *
 		if (!mdiodev)
 			continue;
 
+		if (mdiodev->reset)
+			gpiod_put(mdiodev->reset);
+
 		mdiodev->device_remove(mdiodev);
 		mdiodev->device_free(mdiodev);
 	}
Index: net-next/drivers/net/phy/mdio_device.c
===================================================================
--- net-next.orig/drivers/net/phy/mdio_device.c
+++ net-next/drivers/net/phy/mdio_device.c
@@ -12,6 +12,8 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include <linux/errno.h>
+#include <linux/gpio.h>
+#include <linux/gpio/consumer.h>
 #include <linux/init.h>
 #include <linux/interrupt.h>
 #include <linux/kernel.h>
@@ -103,6 +105,13 @@ void mdio_device_remove(struct mdio_devi
 }
 EXPORT_SYMBOL(mdio_device_remove);
 
+void mdio_device_reset(struct mdio_device *mdiodev, int value)
+{
+	if (mdiodev->reset)
+		gpiod_set_value(mdiodev->reset, value);
+}
+EXPORT_SYMBOL(mdio_device_reset);
+
 /**
  * mdio_probe - probe an MDIO device
  * @dev: device to probe
@@ -117,9 +126,16 @@ static int mdio_probe(struct device *dev
 	struct mdio_driver *mdiodrv = to_mdio_driver(drv);
 	int err = 0;
 
-	if (mdiodrv->probe)
+	if (mdiodrv->probe) {
+		/* Deassert the reset signal */
+		mdio_device_reset(mdiodev, 0);
+
 		err = mdiodrv->probe(mdiodev);
 
+		/* Assert the reset signal */
+		mdio_device_reset(mdiodev, 1);
+	}
+
 	return err;
 }
 
@@ -129,9 +145,16 @@ static int mdio_remove(struct device *de
 	struct device_driver *drv = mdiodev->dev.driver;
 	struct mdio_driver *mdiodrv = to_mdio_driver(drv);
 
-	if (mdiodrv->remove)
+	if (mdiodrv->remove) {
+		/* Deassert the reset signal */
+		mdio_device_reset(mdiodev, 0);
+
 		mdiodrv->remove(mdiodev);
 
+		/* Assert the reset signal */
+		mdio_device_reset(mdiodev, 1);
+	}
+
 	return 0;
 }
 
Index: net-next/drivers/net/phy/phy_device.c
===================================================================
--- net-next.orig/drivers/net/phy/phy_device.c
+++ net-next/drivers/net/phy/phy_device.c
@@ -589,6 +589,9 @@ int phy_device_register(struct phy_devic
 	if (err)
 		return err;
 
+	/* Deassert the reset signal */
+	phy_device_reset(phydev, 0);
+
 	/* Run all of the fixups for this PHY */
 	err = phy_scan_fixups(phydev);
 	if (err) {
@@ -604,9 +607,15 @@ int phy_device_register(struct phy_devic
 		goto out;
 	}
 
+	/* Assert the reset signal */
+	phy_device_reset(phydev, 1);
+
 	return 0;
 
  out:
+	/* Assert the reset signal */
+	phy_device_reset(phydev, 1);
+
 	mdiobus_unregister_device(&phydev->mdio);
 	return err;
 }
@@ -792,6 +801,9 @@ int phy_init_hw(struct phy_device *phyde
 {
 	int ret = 0;
 
+	/* Deassert the reset signal */
+	phy_device_reset(phydev, 0);
+
 	if (!phydev->drv || !phydev->drv->config_init)
 		return 0;
 
@@ -997,6 +1009,9 @@ void phy_detach(struct phy_device *phyde
 
 	put_device(&phydev->mdio.dev);
 	module_put(bus->owner);
+
+	/* Assert the reset signal */
+	phy_device_reset(phydev, 1);
 }
 EXPORT_SYMBOL(phy_detach);
 
@@ -1595,9 +1610,16 @@ static int phy_probe(struct device *dev)
 	/* Set the state to READY by default */
 	phydev->state = PHY_READY;
 
-	if (phydev->drv->probe)
+	if (phydev->drv->probe) {
+		/* Deassert the reset signal */
+		phy_device_reset(phydev, 0);
+
 		err = phydev->drv->probe(phydev);
 
+		/* Assert the reset signal */
+		phy_device_reset(phydev, 1);
+	}
+
 	mutex_unlock(&phydev->lock);
 
 	return err;
@@ -1611,8 +1633,15 @@ static int phy_remove(struct device *dev
 	phydev->state = PHY_DOWN;
 	mutex_unlock(&phydev->lock);
 
-	if (phydev->drv->remove)
+	if (phydev->drv->remove) {
+		/* Deassert the reset signal */
+		phy_device_reset(phydev, 0);
+
 		phydev->drv->remove(phydev);
+
+		/* Assert the reset signal */
+		phy_device_reset(phydev, 1);
+	}
 	phydev->drv = NULL;
 
 	return 0;
Index: net-next/drivers/of/of_mdio.c
===================================================================
--- net-next.orig/drivers/of/of_mdio.c
+++ net-next/drivers/of/of_mdio.c
@@ -44,6 +44,7 @@ static int of_get_phy_id(struct device_n
 static int of_mdiobus_register_phy(struct mii_bus *mdio, struct device_node *child,
 				   u32 addr)
 {
+	struct gpio_desc *gpiod;
 	struct phy_device *phy;
 	bool is_c45;
 	int rc;
@@ -52,10 +53,17 @@ static int of_mdiobus_register_phy(struc
 	is_c45 = of_device_is_compatible(child,
 					 "ethernet-phy-ieee802.3-c45");
 
+	gpiod = fwnode_get_named_gpiod(&child->fwnode, "reset-gpios");
+	/* Deassert the reset signal */
+	if (!IS_ERR(gpiod))
+		gpiod_direction_output(gpiod, 0);
 	if (!is_c45 && !of_get_phy_id(child, &phy_id))
 		phy = phy_device_create(mdio, addr, phy_id, 0, NULL);
 	else
 		phy = get_phy_device(mdio, addr, is_c45);
+	/* Assert the reset signal again */
+	if (!IS_ERR(gpiod))
+		gpiod_set_value(gpiod, 1);
 	if (IS_ERR_OR_NULL(phy))
 		return 1;
 
@@ -75,6 +83,9 @@ static int of_mdiobus_register_phy(struc
 	of_node_get(child);
 	phy->mdio.dev.of_node = child;
 
+	if (!IS_ERR(gpiod))
+		phy->mdio.reset = gpiod;
+
 	/* All data is now stored in the phy struct;
 	 * register it */
 	rc = phy_device_register(phy);
@@ -95,6 +106,7 @@ static int of_mdiobus_register_device(st
 				      u32 addr)
 {
 	struct mdio_device *mdiodev;
+	struct gpio_desc *gpiod;
 	int rc;
 
 	mdiodev = mdio_device_create(mdio, addr);
@@ -107,6 +119,10 @@ static int of_mdiobus_register_device(st
 	of_node_get(child);
 	mdiodev->dev.of_node = child;
 
+	gpiod = fwnode_get_named_gpiod(&child->fwnode, "reset-gpios");
+	if (!IS_ERR(gpiod))
+		mdiodev->reset = gpiod;
+
 	/* All data is now stored in the mdiodev struct; register it. */
 	rc = mdio_device_register(mdiodev);
 	if (rc) {
Index: net-next/include/linux/mdio.h
===================================================================
--- net-next.orig/include/linux/mdio.h
+++ net-next/include/linux/mdio.h
@@ -11,6 +11,7 @@
 
 #include <uapi/linux/mdio.h>
 
+struct gpio_desc;
 struct mii_bus;
 
 struct mdio_device {
@@ -26,6 +27,7 @@ struct mdio_device {
 	/* Bus address of the MDIO device (0-31) */
 	int addr;
 	int flags;
+	struct gpio_desc *reset;
 };
 #define to_mdio_device(d) container_of(d, struct mdio_device, dev)
 
@@ -58,6 +60,7 @@ void mdio_device_free(struct mdio_device
 struct mdio_device *mdio_device_create(struct mii_bus *bus, int addr);
 int mdio_device_register(struct mdio_device *mdiodev);
 void mdio_device_remove(struct mdio_device *mdiodev);
+void mdio_device_reset(struct mdio_device *mdiodev, int value);
 int mdio_driver_register(struct mdio_driver *drv);
 void mdio_driver_unregister(struct mdio_driver *drv);
 
Index: net-next/include/linux/phy.h
===================================================================
--- net-next.orig/include/linux/phy.h
+++ net-next/include/linux/phy.h
@@ -769,6 +769,11 @@ static inline int phy_read_status(struct
 	return phydev->drv->read_status(phydev);
 }
 
+static inline void phy_device_reset(struct phy_device *phydev, int value)
+{
+	mdio_device_reset(&phydev->mdio, value);
+}
+
 #define phydev_err(_phydev, format, args...)	\
 	dev_err(&_phydev->mdio.dev, format, ##args)
 

^ permalink raw reply

* [PATCH RFT 2/2] macb: kill PHY reset code
From: Sergei Shtylyov @ 2016-04-08 22:25 UTC (permalink / raw)
  To: nicolas.ferre, netdev; +Cc: linux-kernel
In-Reply-To: <81129033.NXiOLTg1so@wasted.cogentembedded.com>

With  the 'phylib' now  being aware of  the "reset-gpios" PHY node property,
there should be no need to frob the PHY reset in this  driver anymore...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>

---
 drivers/net/ethernet/cadence/macb.c |   17 -----------------
 drivers/net/ethernet/cadence/macb.h |    1 -
 2 files changed, 18 deletions(-)

Index: net-next/drivers/net/ethernet/cadence/macb.c
===================================================================
--- net-next.orig/drivers/net/ethernet/cadence/macb.c
+++ net-next/drivers/net/ethernet/cadence/macb.c
@@ -2884,7 +2884,6 @@ static int macb_probe(struct platform_de
 					      = macb_clk_init;
 	int (*init)(struct platform_device *) = macb_init;
 	struct device_node *np = pdev->dev.of_node;
-	struct device_node *phy_node;
 	const struct macb_config *macb_config = NULL;
 	struct clk *pclk, *hclk = NULL, *tx_clk = NULL;
 	unsigned int queue_mask, num_queues;
@@ -2977,18 +2976,6 @@ static int macb_probe(struct platform_de
 	else
 		macb_get_hwaddr(bp);
 
-	/* Power up the PHY if there is a GPIO reset */
-	phy_node =  of_get_next_available_child(np, NULL);
-	if (phy_node) {
-		int gpio = of_get_named_gpio(phy_node, "reset-gpios", 0);
-
-		if (gpio_is_valid(gpio)) {
-			bp->reset_gpio = gpio_to_desc(gpio);
-			gpiod_direction_output(bp->reset_gpio, 1);
-		}
-	}
-	of_node_put(phy_node);
-
 	err = of_get_phy_mode(np);
 	if (err < 0) {
 		pdata = dev_get_platdata(&pdev->dev);
@@ -3054,10 +3041,6 @@ static int macb_remove(struct platform_d
 		mdiobus_unregister(bp->mii_bus);
 		mdiobus_free(bp->mii_bus);
 
-		/* Shutdown the PHY if there is a GPIO reset */
-		if (bp->reset_gpio)
-			gpiod_set_value(bp->reset_gpio, 0);
-
 		unregister_netdev(dev);
 		clk_disable_unprepare(bp->tx_clk);
 		clk_disable_unprepare(bp->hclk);
Index: net-next/drivers/net/ethernet/cadence/macb.h
===================================================================
--- net-next.orig/drivers/net/ethernet/cadence/macb.h
+++ net-next/drivers/net/ethernet/cadence/macb.h
@@ -832,7 +832,6 @@ struct macb {
 	unsigned int		dma_burst_length;
 
 	phy_interface_t		phy_interface;
-	struct gpio_desc	*reset_gpio;
 
 	/* AT91RM9200 transmit */
 	struct sk_buff *skb;			/* holds skb until xmit interrupt completes */

^ permalink raw reply

* [net-next][PATCH 0/2] RDS: couple of fixes for 4.6
From: Santosh Shilimkar @ 2016-04-08 22:26 UTC (permalink / raw)
  To: netdev, davem; +Cc: linux-kernel

Patches are also available at below git tree. 

git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git for_4.6/net-next/rds-fixes

Qing Huang (1):
  RDS: fix endianness for dp_ack_seq

Santosh Shilimkar (1):
  RDS: Fix the atomicity for congestion map update

 net/rds/cong.c  | 4 ++--
 net/rds/ib_cm.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

-- 
1.9.1

^ permalink raw reply

* [net-next][PATCH 1/2] RDS: fix endianness for dp_ack_seq
From: Santosh Shilimkar @ 2016-04-08 22:26 UTC (permalink / raw)
  To: netdev, davem; +Cc: linux-kernel, Santosh Shilimkar
In-Reply-To: <1460154400-14546-1-git-send-email-santosh.shilimkar@oracle.com>

From: Qing Huang <qing.huang@oracle.com>

dp->dp_ack_seq is used in big endian format. We need to do the
big endianness conversion when we assign a value in host format
to it.

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
---
 net/rds/ib_cm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 8764970..310cabc 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -194,7 +194,7 @@ static void rds_ib_cm_fill_conn_param(struct rds_connection *conn,
 		dp->dp_protocol_major = RDS_PROTOCOL_MAJOR(protocol_version);
 		dp->dp_protocol_minor = RDS_PROTOCOL_MINOR(protocol_version);
 		dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
-		dp->dp_ack_seq = rds_ib_piggyb_ack(ic);
+		dp->dp_ack_seq = cpu_to_be64(rds_ib_piggyb_ack(ic));
 
 		/* Advertise flow control */
 		if (ic->i_flowctl) {
-- 
1.9.1

^ permalink raw reply related

* [net-next][PATCH 2/2] RDS: Fix the atomicity for congestion map update
From: Santosh Shilimkar @ 2016-04-08 22:26 UTC (permalink / raw)
  To: netdev, davem; +Cc: linux-kernel, Santosh Shilimkar
In-Reply-To: <1460154400-14546-1-git-send-email-santosh.shilimkar@oracle.com>

Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.

Full credit to Wengang <wen.gang.wang@oracle.com> for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.

Reviewed-by: Leon Romanovsky <leon@leon.nu>
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
---
 net/rds/cong.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/cong.c b/net/rds/cong.c
index e6144b8..6641bcf 100644
--- a/net/rds/cong.c
+++ b/net/rds/cong.c
@@ -299,7 +299,7 @@ void rds_cong_set_bit(struct rds_cong_map *map, __be16 port)
 	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
 	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
 
-	__set_bit_le(off, (void *)map->m_page_addrs[i]);
+	set_bit_le(off, (void *)map->m_page_addrs[i]);
 }
 
 void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
@@ -313,7 +313,7 @@ void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
 	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
 	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
 
-	__clear_bit_le(off, (void *)map->m_page_addrs[i]);
+	clear_bit_le(off, (void *)map->m_page_addrs[i]);
 }
 
 static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port)
-- 
1.9.1

^ permalink raw reply related

* [PATCH iproute2 -master 0/3] Minor tc/bpf updates
From: Daniel Borkmann @ 2016-04-08 22:32 UTC (permalink / raw)
  To: stephen; +Cc: netdev, alexei.starovoitov, Daniel Borkmann

Some minor updates to improve error reporting, add signatures
and recently introduced map flags attribute. Set is against
master branch.

Thanks!

Daniel Borkmann (3):
  tc, bpf: add new csum and tunnel signatures
  tc, bpf: further improve error reporting
  tc, bpf: add support for map pre/allocation

 examples/bpf/bpf_cyclic.c   |  9 ++++-
 examples/bpf/bpf_graft.c    |  8 +++-
 examples/bpf/bpf_prog.c     |  2 +
 examples/bpf/bpf_shared.c   |  8 +++-
 examples/bpf/bpf_tailcall.c | 29 ++++++++++++--
 include/bpf_api.h           | 52 ++++--------------------
 include/bpf_elf.h           |  1 +
 tc/tc_bpf.c                 | 98 +++++++++++++++++++++++++++++++++++----------
 tc/tc_bpf.h                 |  4 ++
 9 files changed, 138 insertions(+), 73 deletions(-)

-- 
1.9.3

^ permalink raw reply

* [PATCH iproute2 -master 1/3] tc, bpf: add new csum and tunnel signatures
From: Daniel Borkmann @ 2016-04-08 22:32 UTC (permalink / raw)
  To: stephen; +Cc: netdev, alexei.starovoitov, Daniel Borkmann
In-Reply-To: <cover.1460153854.git.daniel@iogearbox.net>

Add new signatures for BPF_FUNC_csum_diff, BPF_FUNC_skb_get_tunnel_opt
and BPF_FUNC_skb_set_tunnel_opt.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/bpf_api.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/bpf_api.h b/include/bpf_api.h
index 4b16d25..0f278f0 100644
--- a/include/bpf_api.h
+++ b/include/bpf_api.h
@@ -212,6 +212,8 @@ static int BPF_FUNC(l3_csum_replace, struct __sk_buff *skb, uint32_t off,
 		    uint32_t from, uint32_t to, uint32_t flags);
 static int BPF_FUNC(l4_csum_replace, struct __sk_buff *skb, uint32_t off,
 		    uint32_t from, uint32_t to, uint32_t flags);
+static int BPF_FUNC(csum_diff, const void *from, uint32_t from_size,
+		    const void *to, uint32_t to_size, uint32_t seed);
 
 /* Packet vlan encap/decap */
 static int BPF_FUNC(skb_vlan_push, struct __sk_buff *skb, uint16_t proto,
@@ -225,6 +227,11 @@ static int BPF_FUNC(skb_set_tunnel_key, struct __sk_buff *skb,
 		    const struct bpf_tunnel_key *from, uint32_t size,
 		    uint32_t flags);
 
+static int BPF_FUNC(skb_get_tunnel_opt, struct __sk_buff *skb,
+		    void *to, uint32_t size);
+static int BPF_FUNC(skb_set_tunnel_opt, struct __sk_buff *skb,
+		    const void *from, uint32_t size);
+
 /** LLVM built-ins, mem*() routines work for constant size */
 
 #ifndef lock_xadd
-- 
1.9.3

^ permalink raw reply related

* [PATCH iproute2 -master 3/3] tc, bpf: add support for map pre/allocation
From: Daniel Borkmann @ 2016-04-08 22:32 UTC (permalink / raw)
  To: stephen; +Cc: netdev, alexei.starovoitov, Daniel Borkmann
In-Reply-To: <cover.1460153854.git.daniel@iogearbox.net>

Follow-up to kernel commit 6c9059817432 ("bpf: pre-allocate hash map
elements"). Add flags support, so that we can pass in BPF_F_NO_PREALLOC
flag for disallowing preallocation. Update examples accordingly and also
remove the BPF_* map helper macros from them as they were not very useful.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 examples/bpf/bpf_cyclic.c   |  9 ++++++++-
 examples/bpf/bpf_graft.c    |  8 +++++++-
 examples/bpf/bpf_prog.c     |  2 ++
 examples/bpf/bpf_shared.c   |  8 +++++++-
 examples/bpf/bpf_tailcall.c | 29 +++++++++++++++++++++++++----
 include/bpf_api.h           | 45 ---------------------------------------------
 include/bpf_elf.h           |  1 +
 tc/tc_bpf.c                 | 16 ++++++++++++----
 8 files changed, 62 insertions(+), 56 deletions(-)

diff --git a/examples/bpf/bpf_cyclic.c b/examples/bpf/bpf_cyclic.c
index 36745a3..11d1c06 100644
--- a/examples/bpf/bpf_cyclic.c
+++ b/examples/bpf/bpf_cyclic.c
@@ -6,7 +6,14 @@
  */
 #define JMP_MAP_ID	0xabccba
 
-BPF_PROG_ARRAY(jmp_tc, JMP_MAP_ID, PIN_OBJECT_NS, 1);
+struct bpf_elf_map __section_maps jmp_tc = {
+	.type		= BPF_MAP_TYPE_PROG_ARRAY,
+	.id		= JMP_MAP_ID,
+	.size_key	= sizeof(uint32_t),
+	.size_value	= sizeof(uint32_t),
+	.pinning	= PIN_OBJECT_NS,
+	.max_elem	= 1,
+};
 
 __section_tail(JMP_MAP_ID, 0)
 int cls_loop(struct __sk_buff *skb)
diff --git a/examples/bpf/bpf_graft.c b/examples/bpf/bpf_graft.c
index 20784ff..07113d4 100644
--- a/examples/bpf/bpf_graft.c
+++ b/examples/bpf/bpf_graft.c
@@ -33,7 +33,13 @@
  *   [...]
  */
 
-BPF_PROG_ARRAY(jmp_tc, 0, PIN_GLOBAL_NS, 1);
+struct bpf_elf_map __section_maps jmp_tc = {
+	.type		= BPF_MAP_TYPE_PROG_ARRAY,
+	.size_key	= sizeof(uint32_t),
+	.size_value	= sizeof(uint32_t),
+	.pinning	= PIN_GLOBAL_NS,
+	.max_elem	= 1,
+};
 
 __section("aaa")
 int cls_aaa(struct __sk_buff *skb)
diff --git a/examples/bpf/bpf_prog.c b/examples/bpf/bpf_prog.c
index f15e876..d6caf37 100644
--- a/examples/bpf/bpf_prog.c
+++ b/examples/bpf/bpf_prog.c
@@ -192,6 +192,7 @@ struct bpf_elf_map __section("maps") map_proto = {
 	.size_key	=	sizeof(uint8_t),
 	.size_value	=	sizeof(struct count_tuple),
 	.max_elem	=	256,
+	.flags		=	BPF_F_NO_PREALLOC,
 };
 
 struct bpf_elf_map __section("maps") map_queue = {
@@ -200,6 +201,7 @@ struct bpf_elf_map __section("maps") map_queue = {
 	.size_key	=	sizeof(uint32_t),
 	.size_value	=	sizeof(struct count_queue),
 	.max_elem	=	1024,
+	.flags		=	BPF_F_NO_PREALLOC,
 };
 
 struct bpf_elf_map __section("maps") map_drops = {
diff --git a/examples/bpf/bpf_shared.c b/examples/bpf/bpf_shared.c
index 7fe9ef3..21fe6f1 100644
--- a/examples/bpf/bpf_shared.c
+++ b/examples/bpf/bpf_shared.c
@@ -18,7 +18,13 @@
  * instance is being created.
  */
 
-BPF_ARRAY4(map_sh, 0, PIN_OBJECT_NS, 1); /* or PIN_GLOBAL_NS, or PIN_NONE */
+struct bpf_elf_map __section_maps map_sh = {
+	.type		= BPF_MAP_TYPE_ARRAY,
+	.size_key	= sizeof(uint32_t),
+	.size_value	= sizeof(uint32_t),
+	.pinning	= PIN_OBJECT_NS, /* or PIN_GLOBAL_NS, or PIN_NONE */
+	.max_elem	= 1,
+};
 
 __section("egress")
 int emain(struct __sk_buff *skb)
diff --git a/examples/bpf/bpf_tailcall.c b/examples/bpf/bpf_tailcall.c
index f545430..1a30426 100644
--- a/examples/bpf/bpf_tailcall.c
+++ b/examples/bpf/bpf_tailcall.c
@@ -26,10 +26,31 @@
  * classifier behaviour.
  */
 
-BPF_PROG_ARRAY(jmp_tc, FOO, PIN_OBJECT_NS, MAX_JMP_SIZE);
-BPF_PROG_ARRAY(jmp_ex, BAR, PIN_OBJECT_NS, 1);
-
-BPF_ARRAY4(map_sh, 0, PIN_OBJECT_NS, 1);
+struct bpf_elf_map __section_maps jmp_tc = {
+	.type		= BPF_MAP_TYPE_PROG_ARRAY,
+	.id		= FOO,
+	.size_key	= sizeof(uint32_t),
+	.size_value	= sizeof(uint32_t),
+	.pinning	= PIN_OBJECT_NS,
+	.max_elem	= MAX_JMP_SIZE,
+};
+
+struct bpf_elf_map __section_maps jmp_ex = {
+	.type		= BPF_MAP_TYPE_PROG_ARRAY,
+	.id		= BAR,
+	.size_key	= sizeof(uint32_t),
+	.size_value	= sizeof(uint32_t),
+	.pinning	= PIN_OBJECT_NS,
+	.max_elem	= 1,
+};
+
+struct bpf_elf_map __section_maps map_sh = {
+	.type		= BPF_MAP_TYPE_ARRAY,
+	.size_key	= sizeof(uint32_t),
+	.size_value	= sizeof(uint32_t),
+	.pinning	= PIN_OBJECT_NS,
+	.max_elem	= 1,
+};
 
 __section_tail(FOO, ENTRY_0)
 int cls_case1(struct __sk_buff *skb)
diff --git a/include/bpf_api.h b/include/bpf_api.h
index 0f278f0..1b250d2 100644
--- a/include/bpf_api.h
+++ b/include/bpf_api.h
@@ -99,51 +99,6 @@
 	char ____license[] __section_license = NAME
 #endif
 
-#ifndef __BPF_MAP
-# define __BPF_MAP(NAME, TYPE, ID, SIZE_KEY, SIZE_VALUE, PIN, MAX_ELEM)	\
-	struct bpf_elf_map __section_maps NAME = {			\
-		.type		= (TYPE),				\
-		.id		= (ID),					\
-		.size_key	= (SIZE_KEY),				\
-		.size_value	= (SIZE_VALUE),				\
-		.pinning	= (PIN),				\
-		.max_elem	= (MAX_ELEM),				\
-	}
-#endif
-
-#ifndef BPF_HASH
-# define BPF_HASH(NAME, ID, SIZE_KEY, SIZE_VALUE, PIN, MAX_ELEM)	\
-	__BPF_MAP(NAME, BPF_MAP_TYPE_HASH, ID, SIZE_KEY, SIZE_VALUE,	\
-		  PIN, MAX_ELEM)
-#endif
-
-#ifndef BPF_ARRAY
-# define BPF_ARRAY(NAME, ID, SIZE_VALUE, PIN, MAX_ELEM)			\
-	__BPF_MAP(NAME, BPF_MAP_TYPE_ARRAY, ID, sizeof(uint32_t), 	\
-		  SIZE_VALUE, PIN, MAX_ELEM)
-#endif
-
-#ifndef BPF_ARRAY2
-# define BPF_ARRAY2(NAME, ID, PIN, MAX_ELEM)				\
-	BPF_ARRAY(NAME, ID, sizeof(uint16_t), PIN, MAX_ELEM)
-#endif
-
-#ifndef BPF_ARRAY4
-# define BPF_ARRAY4(NAME, ID, PIN, MAX_ELEM)				\
-	BPF_ARRAY(NAME, ID, sizeof(uint32_t), PIN, MAX_ELEM)
-#endif
-
-#ifndef BPF_ARRAY8
-# define BPF_ARRAY8(NAME, ID, PIN, MAX_ELEM)				\
-	BPF_ARRAY(NAME, ID, sizeof(uint64_t), PIN, MAX_ELEM)
-#endif
-
-#ifndef BPF_PROG_ARRAY
-# define BPF_PROG_ARRAY(NAME, ID, PIN, MAX_ELEM)			\
-	__BPF_MAP(NAME, BPF_MAP_TYPE_PROG_ARRAY, ID, sizeof(uint32_t),	\
-		  sizeof(uint32_t), PIN, MAX_ELEM)
-#endif
-
 /** Classifier helper */
 
 #ifndef BPF_H_DEFAULT
diff --git a/include/bpf_elf.h b/include/bpf_elf.h
index 31a8974..36cc988 100644
--- a/include/bpf_elf.h
+++ b/include/bpf_elf.h
@@ -32,6 +32,7 @@ struct bpf_elf_map {
 	__u32 size_key;
 	__u32 size_value;
 	__u32 max_elem;
+	__u32 flags;
 	__u32 id;
 	__u32 pinning;
 };
diff --git a/tc/tc_bpf.c b/tc/tc_bpf.c
index 0c59427..fe927ac 100644
--- a/tc/tc_bpf.c
+++ b/tc/tc_bpf.c
@@ -231,6 +231,9 @@ static void bpf_map_pin_report(const struct bpf_elf_map *pin,
 	if (obj->max_elem != pin->max_elem)
 		fprintf(stderr, " - Max elems:    %u (obj) != %u (pin)\n",
 			obj->max_elem, pin->max_elem);
+	if (obj->flags != pin->flags)
+		fprintf(stderr, " - Flags:        %#x (obj) != %#x (pin)\n",
+			obj->flags, pin->flags);
 
 	fprintf(stderr, "\n");
 }
@@ -261,6 +264,8 @@ static int bpf_map_selfcheck_pinned(int fd, const struct bpf_elf_map *map,
 			tmp.size_value = val;
 		else if (sscanf(buff, "max_entries:\t%u", &val) == 1)
 			tmp.max_elem = val;
+		else if (sscanf(buff, "map_flags:\t%i", &val) == 1)
+			tmp.flags = val;
 	}
 
 	fclose(fp);
@@ -796,8 +801,9 @@ static int bpf_log_realloc(struct bpf_elf_ctx *ctx)
 	return 0;
 }
 
-static int bpf_map_create(enum bpf_map_type type, unsigned int size_key,
-			  unsigned int size_value, unsigned int max_elem)
+static int bpf_map_create(enum bpf_map_type type, uint32_t size_key,
+			  uint32_t size_value, uint32_t max_elem,
+			  uint32_t flags)
 {
 	union bpf_attr attr;
 
@@ -806,6 +812,7 @@ static int bpf_map_create(enum bpf_map_type type, unsigned int size_key,
 	attr.key_size = size_key;
 	attr.value_size = size_value;
 	attr.max_entries = max_elem;
+	attr.map_flags = flags;
 
 	return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
 }
@@ -1147,7 +1154,8 @@ static void bpf_map_report(int fd, const char *name,
 	fprintf(stderr, " - Pinning:      %u\n", map->pinning);
 	fprintf(stderr, " - Size key:     %u\n", map->size_key);
 	fprintf(stderr, " - Size value:   %u\n", map->size_value);
-	fprintf(stderr, " - Max elems:    %u\n\n", map->max_elem);
+	fprintf(stderr, " - Max elems:    %u\n", map->max_elem);
+	fprintf(stderr, " - Flags:        %#x\n\n", map->flags);
 }
 
 static int bpf_map_attach(const char *name, const struct bpf_elf_map *map,
@@ -1174,7 +1182,7 @@ static int bpf_map_attach(const char *name, const struct bpf_elf_map *map,
 
 	errno = 0;
 	fd = bpf_map_create(map->type, map->size_key, map->size_value,
-			    map->max_elem);
+			    map->max_elem, map->flags);
 	if (fd < 0 || ctx->verbose) {
 		bpf_map_report(fd, name, map, ctx);
 		if (fd < 0)
-- 
1.9.3

^ permalink raw reply related

* [PATCH iproute2 -master 2/3] tc, bpf: further improve error reporting
From: Daniel Borkmann @ 2016-04-08 22:32 UTC (permalink / raw)
  To: stephen; +Cc: netdev, alexei.starovoitov, Daniel Borkmann
In-Reply-To: <cover.1460153854.git.daniel@iogearbox.net>

Make it easier to spot issues when loading the object file fails. This
includes reporting in what pinned object specs differ, better indication
when we've reached instruction limits. Don't retry to load a non relo
program once we failed with bpf(2), and report out of bounds tail call key.

Also, add truncation of huge log outputs by default. Sometimes errors are
quite easy to spot by only looking at the tail of the verifier log, but
logs can get huge in size e.g. up to few MB (due to verifier checking all
possible program paths). Thus, by default limit output to the last 4096
bytes and indicate that it's truncated. For the full log, the verbose option
can be used.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tc/tc_bpf.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 tc/tc_bpf.h |  4 +++
 2 files changed, 69 insertions(+), 17 deletions(-)

diff --git a/tc/tc_bpf.c b/tc/tc_bpf.c
index d94af82..0c59427 100644
--- a/tc/tc_bpf.c
+++ b/tc/tc_bpf.c
@@ -184,7 +184,7 @@ static int bpf_ops_parse(int argc, char **argv, struct sock_filter *bpf_ops,
 	}
 
 	if (i != bpf_len) {
-		fprintf(stderr, "Parsed program length is less than encodedlength parameter!\n");
+		fprintf(stderr, "Parsed program length is less than encoded length parameter!\n");
 		ret = -EINVAL;
 		goto out;
 	}
@@ -214,6 +214,27 @@ void bpf_print_ops(FILE *f, struct rtattr *bpf_ops, __u16 len)
 		ops[i].jf, ops[i].k);
 }
 
+static void bpf_map_pin_report(const struct bpf_elf_map *pin,
+			       const struct bpf_elf_map *obj)
+{
+	fprintf(stderr, "Map specification differs from pinned file!\n");
+
+	if (obj->type != pin->type)
+		fprintf(stderr, " - Type:         %u (obj) != %u (pin)\n",
+			obj->type, pin->type);
+	if (obj->size_key != pin->size_key)
+		fprintf(stderr, " - Size key:     %u (obj) != %u (pin)\n",
+			obj->size_key, pin->size_key);
+	if (obj->size_value != pin->size_value)
+		fprintf(stderr, " - Size value:   %u (obj) != %u (pin)\n",
+			obj->size_value, pin->size_value);
+	if (obj->max_elem != pin->max_elem)
+		fprintf(stderr, " - Max elems:    %u (obj) != %u (pin)\n",
+			obj->max_elem, pin->max_elem);
+
+	fprintf(stderr, "\n");
+}
+
 static int bpf_map_selfcheck_pinned(int fd, const struct bpf_elf_map *map,
 				    int length)
 {
@@ -256,7 +277,7 @@ static int bpf_map_selfcheck_pinned(int fd, const struct bpf_elf_map *map,
 		if (!memcmp(&tmp, &zero, length))
 			return 0;
 
-		fprintf(stderr, "Map specs from pinned file differ!\n");
+		bpf_map_pin_report(&tmp, map);
 		return -EINVAL;
 	}
 }
@@ -735,7 +756,19 @@ bpf_dump_error(struct bpf_elf_ctx *ctx, const char *format, ...)
 	va_end(vl);
 
 	if (ctx->log && ctx->log[0]) {
-		fprintf(stderr, "%s\n", ctx->log);
+		if (ctx->verbose) {
+			fprintf(stderr, "%s\n", ctx->log);
+		} else {
+			unsigned int off = 0, len = strlen(ctx->log);
+
+			if (len > BPF_MAX_LOG) {
+				off = len - BPF_MAX_LOG;
+				fprintf(stderr, "Skipped %u bytes, use \'verb\' option for the full verbose log.\n[...]\n",
+					off);
+			}
+			fprintf(stderr, "%s\n", ctx->log + off);
+		}
+
 		memset(ctx->log, 0, ctx->log_size);
 	}
 }
@@ -1055,14 +1088,16 @@ static void bpf_prog_report(int fd, const char *section,
 			    const struct bpf_elf_prog *prog,
 			    struct bpf_elf_ctx *ctx)
 {
-	fprintf(stderr, "Prog section \'%s\' %s%s (%d)!\n", section,
+	unsigned int insns = prog->size / sizeof(struct bpf_insn);
+
+	fprintf(stderr, "\nProg section \'%s\' %s%s (%d)!\n", section,
 		fd < 0 ? "rejected: " : "loaded",
 		fd < 0 ? strerror(errno) : "",
 		fd < 0 ? errno : fd);
 
 	fprintf(stderr, " - Type:         %u\n", prog->type);
-	fprintf(stderr, " - Instructions: %zu\n",
-		prog->size / sizeof(struct bpf_insn));
+	fprintf(stderr, " - Instructions: %u (%u over limit)\n",
+		insns, insns > BPF_MAXINSNS ? insns - BPF_MAXINSNS : 0);
 	fprintf(stderr, " - License:      %s\n\n", prog->license);
 
 	bpf_dump_error(ctx, "Verifier analysis:\n\n");
@@ -1283,6 +1318,11 @@ static int bpf_fetch_strtab(struct bpf_elf_ctx *ctx, int section,
 	return 0;
 }
 
+static bool bpf_has_map_data(const struct bpf_elf_ctx *ctx)
+{
+	return ctx->sym_tab && ctx->str_tab && ctx->sec_maps;
+}
+
 static int bpf_fetch_ancillary(struct bpf_elf_ctx *ctx)
 {
 	struct bpf_elf_sec_data data;
@@ -1306,13 +1346,13 @@ static int bpf_fetch_ancillary(struct bpf_elf_ctx *ctx)
 			 !strcmp(data.sec_name, ".strtab"))
 			ret = bpf_fetch_strtab(ctx, i, &data);
 		if (ret < 0) {
-			fprintf(stderr, "Error parsing section %d! Perhapscheck with readelf -a?\n",
+			fprintf(stderr, "Error parsing section %d! Perhaps check with readelf -a?\n",
 				i);
 			break;
 		}
 	}
 
-	if (ctx->sym_tab && ctx->str_tab && ctx->sec_maps) {
+	if (bpf_has_map_data(ctx)) {
 		ret = bpf_maps_attach_all(ctx);
 		if (ret < 0) {
 			fprintf(stderr, "Error loading maps into kernel!\n");
@@ -1348,7 +1388,7 @@ static int bpf_fetch_prog(struct bpf_elf_ctx *ctx, const char *section)
 
 		fd = bpf_prog_attach(section, &prog, ctx);
 		if (fd < 0)
-			continue;
+			break;
 
 		ctx->sec_done[i] = true;
 		break;
@@ -1412,7 +1452,8 @@ static int bpf_apply_relo_data(struct bpf_elf_ctx *ctx,
 	return 0;
 }
 
-static int bpf_fetch_prog_relo(struct bpf_elf_ctx *ctx, const char *section)
+static int bpf_fetch_prog_relo(struct bpf_elf_ctx *ctx, const char *section,
+			       bool *lderr)
 {
 	struct bpf_elf_sec_data data_relo, data_insn;
 	struct bpf_elf_prog prog;
@@ -1442,8 +1483,10 @@ static int bpf_fetch_prog_relo(struct bpf_elf_ctx *ctx, const char *section)
 		prog.license = ctx->license;
 
 		fd = bpf_prog_attach(section, &prog, ctx);
-		if (fd < 0)
-			continue;
+		if (fd < 0) {
+			*lderr = true;
+			break;
+		}
 
 		ctx->sec_done[i]   = true;
 		ctx->sec_done[idx] = true;
@@ -1455,11 +1498,12 @@ static int bpf_fetch_prog_relo(struct bpf_elf_ctx *ctx, const char *section)
 
 static int bpf_fetch_prog_sec(struct bpf_elf_ctx *ctx, const char *section)
 {
+	bool lderr = false;
 	int ret = -1;
 
-	if (ctx->sym_tab)
-		ret = bpf_fetch_prog_relo(ctx, section);
-	if (ret < 0)
+	if (bpf_has_map_data(ctx))
+		ret = bpf_fetch_prog_relo(ctx, section, &lderr);
+	if (ret < 0 && !lderr)
 		ret = bpf_fetch_prog(ctx, section);
 
 	return ret;
@@ -1504,8 +1548,12 @@ static int bpf_fill_prog_arrays(struct bpf_elf_ctx *ctx)
 
 		ret = bpf_map_update(ctx->map_fds[idx], &key_id,
 				     &fd, BPF_ANY);
-		if (ret < 0)
-			return -ENOENT;
+		if (ret < 0) {
+			if (errno == E2BIG)
+				fprintf(stderr, "Tail call key %u for map %u out of bounds?\n",
+					key_id, map_id);
+			return -errno;
+		}
 
 		ctx->sec_done[i] = true;
 	}
diff --git a/tc/tc_bpf.h b/tc/tc_bpf.h
index 93f7f0e..30306de 100644
--- a/tc/tc_bpf.h
+++ b/tc/tc_bpf.h
@@ -33,6 +33,10 @@ enum {
 #define BPF_ENV_UDS	"TC_BPF_UDS"
 #define BPF_ENV_MNT	"TC_BPF_MNT"
 
+#ifndef BPF_MAX_LOG
+# define BPF_MAX_LOG	4096
+#endif
+
 #ifndef BPF_FS_MAGIC
 # define BPF_FS_MAGIC	0xcafe4a11
 #endif
-- 
1.9.3

^ permalink raw reply related

* Re: [PATCH V3] net: emac: emac gigabit ethernet controller driver
From: Bjorn Andersson @ 2016-04-08 22:43 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Andrew Lunn, Rob Herring, Gilad Avidov, netdev,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-arm-msm,
	Sagar Dharia, shankerd-sgV2jX0FEOL9JmXXK+q4OQ, Greg Kroah-Hartman,
	vikrams-sgV2jX0FEOL9JmXXK+q4OQ, Christopher Covington
In-Reply-To: <5708013F.90207-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>

On Fri, Apr 8, 2016 at 12:06 PM, Timur Tabi <timur-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org> wrote:
> Andrew Lunn wrote:
>
>> There are two different things here. One is configuring the pin to be
>> a GPIO. The second is using the GPIO as a GPIO. In this case,
>> bit-banging the MDIO bus.
>>
>> The firmware could be doing the configuration, setting the pin as a
>> GPIO. However, the firmware cannot be doing the MDIO bit-banging to
>> make an MDIO bus available. Linux has to do that.
>>
>> Or it could be we have all completely misunderstood the hardware, and
>> we are not doing bit-banging GPIO MDIO. There is a real MDIO
>> controller there, we don't use these pins as GPIOs, etc....
>
>
> Actually, I think there is a misunderstanding.
>
> On the FSM9900 SOC (which uses device-tree), the two pins that connect to
> the external PHY are gpio pins.  However, the driver needs to reprogram the
> pinmux so that those pins are wired to the Emac controller.  That's what the
> the gpio code in this driver is doing: it's just configuring the pins so
> that they connect directly between the Emac and the external PHY.  After
> that, they are no longer GPIO pins, and you cannot use the "GPIO controlled
> MDIO bus".  There is no MDIO controller on the SOC.  The external PHY is
> controlled directly from the Emac and also from the internal PHY.  It is
> screwy, I know, but that's what Gilad was trying to explain.
>

It sounds like you're trying to say that the pins used can be are
muxed as GPIO or MDIO, in the TLMM.

In the downstream kernel this is often seen with the drivers calling
gpio_request() to "reserve" said pins, but all you should do is
described the desired configuration and muxing in the pinctrl node,
reference that from your driver and simply ignore the fact that those
pins could have been used as GPIO pins.

Regards,
Bjorn
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH V3] net: emac: emac gigabit ethernet controller driver
From: Timur Tabi @ 2016-04-08 23:01 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Andrew Lunn, Rob Herring, Gilad Avidov, netdev,
	linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
	linux-arm-msm, Sagar Dharia, shankerd, Greg Kroah-Hartman,
	vikrams, Christopher Covington
In-Reply-To: <CAOCOHw5R=PnkafRqx2eGKKm10eLgD4H3u7TjbbhdZMbqVR7LrA@mail.gmail.com>

Bjorn Andersson wrote:

> It sounds like you're trying to say that the pins used can be are
> muxed as GPIO or MDIO, in the TLMM.

I'm not 100% sure, but I think that's correct.  If you don't want to 
have normal networking, you could connect those external pins to some 
GPIO device (like an LED or whatever), and then configure the pin muxing 
for GPIO purposes.  But if that's true, it's only true on the FSM9900. 
On the QDF2432, those lines are not connected to the TLMM.  They are 
instead hard-wired to the Emac.

> In the downstream kernel this is often seen with the drivers calling
> gpio_request() to "reserve" said pins, but all you should do is
> described the desired configuration and muxing in the pinctrl node,
> reference that from your driver and simply ignore the fact that those
> pins could have been used as GPIO pins.

That makes sense, but I think the driver already does that.

https://patchwork.ozlabs.org/patch/561667/

Function emac_probe_resources() has a call to of_get_named_gpio().  And 
then emac_mac_up() calls gpio_request().  As far as I can tell, that's it.

I'm guessing that the of_get_named_gpio() call needs to be changed 
somehow, but I'm not sure how.

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation collaborative project.

^ permalink raw reply

* Re: [PATCH net-next v2] vxlan: synchronously and race-free destruction of vxlan sockets
From: Cong Wang @ 2016-04-08 23:24 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Linux Kernel Network Developers, Eric Dumazet, Jiri Benc,
	Marcelo Ricardo Leitner
In-Reply-To: <1460148901-23740-1-git-send-email-hannes@stressinduktion.org>

On Fri, Apr 8, 2016 at 1:55 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Due to the fact that the udp socket is destructed asynchronously in a
> work queue, we have some nondeterministic behavior during shutdown of
> vxlan tunnels and creating new ones. Fix this by keeping the destruction
> process synchronous in regards to the user space process so IFF_UP can
> be reliably set.
>
> udp_tunnel_sock_release destroys vs->sock->sk if reference counter
> indicates so. We expect to have the same lifetime of vxlan_sock and
> vxlan_sock->sock->sk even in fast paths with only rcu locks held. So
> only destruct the whole socket after we can be sure it cannot be found
> by searching vxlan_net->sock_list.
>

I am wondering what is the reason why we used work queue from
the beginning?

^ permalink raw reply

* Re: [PATCH V3] net: emac: emac gigabit ethernet controller driver
From: Bjorn Andersson @ 2016-04-08 23:25 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Andrew Lunn, Rob Herring, Gilad Avidov, netdev,
	linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
	linux-arm-msm, Sagar Dharia, shankerd, Greg Kroah-Hartman,
	vikrams, Christopher Covington
In-Reply-To: <57083834.2060102@codeaurora.org>

On Fri 08 Apr 16:01 PDT 2016, Timur Tabi wrote:

> Bjorn Andersson wrote:
> 
> >It sounds like you're trying to say that the pins used can be are
> >muxed as GPIO or MDIO, in the TLMM.
> 
> I'm not 100% sure, but I think that's correct.  If you don't want to have
> normal networking, you could connect those external pins to some GPIO device
> (like an LED or whatever), and then configure the pin muxing for GPIO
> purposes.  But if that's true, it's only true on the FSM9900. On the
> QDF2432, those lines are not connected to the TLMM.  They are instead
> hard-wired to the Emac.
> 

Then through proper use of the pinctrl framework you should configure
the FSM9900 to mux these pins appropriately and the two solutions are
equivalent.

> >In the downstream kernel this is often seen with the drivers calling
> >gpio_request() to "reserve" said pins, but all you should do is
> >described the desired configuration and muxing in the pinctrl node,
> >reference that from your driver and simply ignore the fact that those
> >pins could have been used as GPIO pins.
> 
> That makes sense, but I think the driver already does that.
> 
> https://patchwork.ozlabs.org/patch/561667/
> 
> Function emac_probe_resources() has a call to of_get_named_gpio().  And then
> emac_mac_up() calls gpio_request().  As far as I can tell, that's it.
> 
> I'm guessing that the of_get_named_gpio() call needs to be changed somehow,
> but I'm not sure how.
> 

Thanks for the link.

In short those call to the gpio framework should just be removed. They
should only be there if you're using the gpiolib to control the state of
those pins, and you're not as far as I can see.


The general outline of what you should have in your dts instead is:

soc {
	tlmm {
		compatible = "qcom,pinctrl-xyz";

		mdio_pins_a: mdio {
			state {
				pins = "gpio0", "gpio1";
				function = "mdio";
			};
		};
	};

	emac {
		compatible = "qcom,somthing-emac";

		pinctrl-names = "default";
		pinctrl-0 = <&mdio_pins_a>;
	};
};

Regards,
Bjorn

^ permalink raw reply

* Re: [PATCH net-next v2] vxlan: synchronously and race-free destruction of vxlan sockets
From: Hannes Frederic Sowa @ 2016-04-08 23:55 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, Eric Dumazet, Jiri Benc,
	Marcelo Ricardo Leitner, Stephen Hemminger
In-Reply-To: <CAM_iQpUbTz2dRS+ZTZSSL0AvkOb_2_HEfXo+ap9__aQZmajMjA@mail.gmail.com>



On Sat, Apr 9, 2016, at 01:24, Cong Wang wrote:
> On Fri, Apr 8, 2016 at 1:55 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> > Due to the fact that the udp socket is destructed asynchronously in a
> > work queue, we have some nondeterministic behavior during shutdown of
> > vxlan tunnels and creating new ones. Fix this by keeping the destruction
> > process synchronous in regards to the user space process so IFF_UP can
> > be reliably set.
> >
> > udp_tunnel_sock_release destroys vs->sock->sk if reference counter
> > indicates so. We expect to have the same lifetime of vxlan_sock and
> > vxlan_sock->sock->sk even in fast paths with only rcu locks held. So
> > only destruct the whole socket after we can be sure it cannot be found
> > by searching vxlan_net->sock_list.
> >
> 
> I am wondering what is the reason why we used work queue from
> the beginning?

I actually don't know. It was like that from the beginning. I cc'ed
Stephen, maybe he remembers?

Bye,
Hannes

^ permalink raw reply

* Re: How do I avoid recvmsg races with IP_RECVERR?
From: Andy Lutomirski @ 2016-04-09  0:02 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Andy Lutomirski, Network Development
In-Reply-To: <1433291591.3300318.285256449.713A7E1E@webmail.messagingengine.com>

On Tue, Jun 2, 2015 at 5:33 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On Wed, Jun 3, 2015, at 02:03, Andy Lutomirski wrote:
>> On Tue, Jun 2, 2015 at 2:50 PM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>> >> My proposal would be to make the error conversion lazy:
>> >>
>> >> Keeping duplicate data is not a good idea in general: So we shouldn't
>> >> use sk->sk_err if IP_RECVERR is set at all but let sock_error just use
>> >> the sk_error_queue and extract the error code from there.
>> >>
>> >> Only if IP_RECVERR was not set, we use sk->sk_err logic.
>> >>
>> >> What do you think?
>> >
>> > I just noticed that this will probably break existing user space
>> > applications which require that icmp errors are transient even with
>> > IP_RECVERR. We can mark that with a bit in the sk_error_queue pointer
>> > and xchg the pointer, hmmm....
>>
>> Do you mean to fix the race like this but to otherwise leave the
>> semantics
>> alone?  That would be an improvement, but it might be nice to also add
>> a non-crappy API for this, too.
>
> Yes, keep current semantics but fix the race you reported.
>
> I currently don't have good proposals for a decent API to handle this
> besides adding some ancillary cmsg data to msg_control. This still would
> not solve the problem fundamentally, as a -EFAULT/-EINVAL return value
> could also mean that msg_control should not be touched, thus we end up
> again relying on errno checking. :/ Thus checking error queue after
> receiving an error indications is my best hunch so far.
>
> Your proposal with MSG_IGNORE_ERROR seems reasonable so far for ping or
> udp, but I haven't fully grasped the TCP semantics of sk->sk_err, yet.

I was looking at this a bit, and I was thinking about adding a new
socket option, but I'm a bit vague on how all this fits together.

One option would be a socket option that simply causes sock_error to
return 0 (and change SO_ERROR to peek at sk_err directly).  But there
seem to be sock_error callers all over the place, and maybe this
change would cause problems.

Another option would be to add a socket option that explicitly turns
off everything that queues soft errors to sk_err.

I think that, for IP datagrams at least, the ideal semantics would be
for soft errors not to affect sk_err and for POLLERR to be set if the
error queue is nonempty.

--Andy

^ permalink raw reply

* RE: [PATCH -v2] drivers: net: ethernet: intel: e1000e: fix ethtool autoneg off for non-copper
From: Brown, Aaron F @ 2016-04-09  1:03 UTC (permalink / raw)
  To: Daniel Walker, Ruinskiy, Dima, Kirsher, Jeffrey T,
	Brandeburg, Jesse, Nelson, Shannon, Wyborny, Carolyn,
	Skidmore, Donald C, Allan, Bruce W, Ronciak, John,
	Williams, Mitch A
  Cc: Steve Shih, xe-kernel@external.cisco.com, Daniel Walker,
	intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <1459881004-13992-1-git-send-email-danielwa@cisco.com>

> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Daniel Walker
> Sent: Tuesday, April 5, 2016 11:30 AM
> To: Ruinskiy, Dima <dima.ruinskiy@intel.com>; Kirsher, Jeffrey T
> <jeffrey.t.kirsher@intel.com>; Brandeburg, Jesse
> <jesse.brandeburg@intel.com>; Nelson, Shannon
> <shannon.nelson@intel.com>; Wyborny, Carolyn
> <carolyn.wyborny@intel.com>; Skidmore, Donald C
> <donald.c.skidmore@intel.com>; Allan, Bruce W <bruce.w.allan@intel.com>;
> Ronciak, John <john.ronciak@intel.com>; Williams, Mitch A
> <mitch.a.williams@intel.com>
> Cc: Steve Shih <sshih@cisco.com>; xe-kernel@external.cisco.com; Daniel
> Walker <dwalker@fifo99.com>; intel-wired-lan@lists.osuosl.org;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH -v2] drivers: net: ethernet: intel: e1000e: fix ethtool autoneg
> off for non-copper
> 
> From: Steve Shih <sshih@cisco.com>
> 
> This patch fixes the issues for disabling auto-negotiation and forcing
> speed and duplex settings for the non-copper media.
> 
> For non-copper media, e1000_get_settings should return
> ETH_TP_MDI_INVALID for
> eth_tp_mdix_ctrl instead of ETH_TP_MDI_AUTO so subsequent
> e1000_set_settings
> call would not fail with -EOPNOTSUPP.
> 
> e1000_set_spd_dplx should not automatically turn autoneg back on for
> forced
> 1000 Mbps full duplex settings for non-copper media.
> 
> Cc: xe-kernel@external.cisco.com
> Cc: Daniel Walker <dwalker@fifo99.com>
> Signed-off-by: Steve Shih <sshih@cisco.com>
> ---
>  drivers/net/ethernet/intel/e1000e/ethtool.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)

Tested-by: Aaron Brown <aaron.f.brown@intel.com>

^ permalink raw reply

* Re: [PATCH net-next] net: bcmgenet: add BQL support
From: Petri Gynther @ 2016-04-09  1:39 UTC (permalink / raw)
  To: David Miller, Eric Dumazet; +Cc: netdev, Florian Fainelli, opendmb, Jaedon Shin
In-Reply-To: <20160408.163648.2055481752216023669.davem@davemloft.net>

On Fri, Apr 8, 2016 at 1:36 PM, David Miller <davem@davemloft.net> wrote:
> From: Petri Gynther <pgynther@google.com>
> Date: Tue,  5 Apr 2016 17:50:01 -0700
>
>> Add Byte Queue Limits (BQL) support to bcmgenet driver.
>>
>> Signed-off-by: Petri Gynther <pgynther@google.com>
>
> As Eric Dumazet indicated, your ->ndo_init() code to reset the queues is
> probably not necessary at all.

I added the netdev_tx_reset_queue(txq) calls to ndo_open() path:
netdev->ndo_open()
  bcmgenet_open()
    bcmgenet_netif_start()
      for all Tx queues:
        netdev_tx_reset_queue(txq)
          clear __QUEUE_STATE_STACK_XOFF
          dql_reset()
      netif_tx_start_all_queues(dev)
        for all Tx queues:
          clear __QUEUE_STATE_DRV_XOFF

So, I think the call to netdev_tx_reset_queue(txq) is in the right
place. It ensures that the Tx queue state is clean when the device is
opened.

^ permalink raw reply

* Re: [PATCH net-next] net: bcmgenet: add BQL support
From: Eric Dumazet @ 2016-04-09  1:56 UTC (permalink / raw)
  To: Petri Gynther
  Cc: David Miller, netdev, Florian Fainelli, opendmb, Jaedon Shin
In-Reply-To: <CAGXr9JE8Qp-u4gjm08oB8nXDoQxFM12=Da2=MZbWcZrRXOCi_w@mail.gmail.com>

On Fri, 2016-04-08 at 18:39 -0700, Petri Gynther wrote:
> On Fri, Apr 8, 2016 at 1:36 PM, David Miller <davem@davemloft.net> wrote:
> > From: Petri Gynther <pgynther@google.com>
> > Date: Tue,  5 Apr 2016 17:50:01 -0700
> >
> >> Add Byte Queue Limits (BQL) support to bcmgenet driver.
> >>
> >> Signed-off-by: Petri Gynther <pgynther@google.com>
> >
> > As Eric Dumazet indicated, your ->ndo_init() code to reset the queues is
> > probably not necessary at all.
> 
> I added the netdev_tx_reset_queue(txq) calls to ndo_open() path:
> netdev->ndo_open()
>   bcmgenet_open()
>     bcmgenet_netif_start()
>       for all Tx queues:
>         netdev_tx_reset_queue(txq)
>           clear __QUEUE_STATE_STACK_XOFF
>           dql_reset()
>       netif_tx_start_all_queues(dev)
>         for all Tx queues:
>           clear __QUEUE_STATE_DRV_XOFF
> 
> So, I think the call to netdev_tx_reset_queue(txq) is in the right
> place. It ensures that the Tx queue state is clean when the device is
> opened.


The netdev_tx_reset_queue(txq) calls are only needed in exceptional
conditions.

Not at device start, as the core networking layer init all txq
(including their BQL state) properly before giving them to drivers for
use.

For example, tg3 calls netdev_tx_reset_queue() only when freeing tx
rings, as it might have freed skb(s) not from normal TX complete path
and thus missed appropriate dql_completed().

If you believe BQL drivers need a fix, please elaborate ?

Thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: bcmgenet: add BQL support
From: Alexander Duyck @ 2016-04-09  2:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Petri Gynther, David Miller, netdev, Florian Fainelli, opendmb,
	Jaedon Shin
In-Reply-To: <1460166979.6473.451.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, Apr 8, 2016 at 6:56 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2016-04-08 at 18:39 -0700, Petri Gynther wrote:
>> On Fri, Apr 8, 2016 at 1:36 PM, David Miller <davem@davemloft.net> wrote:
>> > From: Petri Gynther <pgynther@google.com>
>> > Date: Tue,  5 Apr 2016 17:50:01 -0700
>> >
>> >> Add Byte Queue Limits (BQL) support to bcmgenet driver.
>> >>
>> >> Signed-off-by: Petri Gynther <pgynther@google.com>
>> >
>> > As Eric Dumazet indicated, your ->ndo_init() code to reset the queues is
>> > probably not necessary at all.
>>
>> I added the netdev_tx_reset_queue(txq) calls to ndo_open() path:
>> netdev->ndo_open()
>>   bcmgenet_open()
>>     bcmgenet_netif_start()
>>       for all Tx queues:
>>         netdev_tx_reset_queue(txq)
>>           clear __QUEUE_STATE_STACK_XOFF
>>           dql_reset()
>>       netif_tx_start_all_queues(dev)
>>         for all Tx queues:
>>           clear __QUEUE_STATE_DRV_XOFF
>>
>> So, I think the call to netdev_tx_reset_queue(txq) is in the right
>> place. It ensures that the Tx queue state is clean when the device is
>> opened.
>
>
> The netdev_tx_reset_queue(txq) calls are only needed in exceptional
> conditions.
>
> Not at device start, as the core networking layer init all txq
> (including their BQL state) properly before giving them to drivers for
> use.
>
> For example, tg3 calls netdev_tx_reset_queue() only when freeing tx
> rings, as it might have freed skb(s) not from normal TX complete path
> and thus missed appropriate dql_completed().
>
> If you believe BQL drivers need a fix, please elaborate ?
>
> Thanks.

For a bit of history on why you might want to do the reset on clean-up
instead of init you might take a look at commit dad8a3b3eaa0 ("igb,
ixgbe: netdev_tx_reset_queue incorrectly called from tx init path").

Basically you want to make certain you flush the queues after bringing
the interface down so that you don't possibly trigger any false hangs
for having stalled queues.  Basically the rule with the BQL stuff is
you need to leave the Tx queue in the state you found it in instead of
just wiping it and making use of it and putting it away dirty.

- Alex

^ permalink raw reply

* Re: [PATCH net-next] net: bcmgenet: add BQL support
From: Petri Gynther @ 2016-04-09  4:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Florian Fainelli, opendmb, Jaedon Shin
In-Reply-To: <1460166979.6473.451.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, Apr 8, 2016 at 6:56 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2016-04-08 at 18:39 -0700, Petri Gynther wrote:
>> On Fri, Apr 8, 2016 at 1:36 PM, David Miller <davem@davemloft.net> wrote:
>> > From: Petri Gynther <pgynther@google.com>
>> > Date: Tue,  5 Apr 2016 17:50:01 -0700
>> >
>> >> Add Byte Queue Limits (BQL) support to bcmgenet driver.
>> >>
>> >> Signed-off-by: Petri Gynther <pgynther@google.com>
>> >
>> > As Eric Dumazet indicated, your ->ndo_init() code to reset the queues is
>> > probably not necessary at all.
>>
>> I added the netdev_tx_reset_queue(txq) calls to ndo_open() path:
>> netdev->ndo_open()
>>   bcmgenet_open()
>>     bcmgenet_netif_start()
>>       for all Tx queues:
>>         netdev_tx_reset_queue(txq)
>>           clear __QUEUE_STATE_STACK_XOFF
>>           dql_reset()
>>       netif_tx_start_all_queues(dev)
>>         for all Tx queues:
>>           clear __QUEUE_STATE_DRV_XOFF
>>
>> So, I think the call to netdev_tx_reset_queue(txq) is in the right
>> place. It ensures that the Tx queue state is clean when the device is
>> opened.
>
>
> The netdev_tx_reset_queue(txq) calls are only needed in exceptional
> conditions.
>
> Not at device start, as the core networking layer init all txq
> (including their BQL state) properly before giving them to drivers for
> use.
>

What values does the networking core program into BQL dynamic limits
that my code in netdev->ndo_open() would wipe out?

You mentioned the queue init path:
netdev_init_one_queue() -> dql_init() -> dql_reset()

that is called when the netdev is created and Tx queues allocated.

But, does the networking core somewhere set *different* values for BQL
dynamic limits than what dql_reset() did, before opening the device?

> For example, tg3 calls netdev_tx_reset_queue() only when freeing tx
> rings, as it might have freed skb(s) not from normal TX complete path
> and thus missed appropriate dql_completed().
>

Looking at the tg3 driver, it calls:
tg3_stop()
  tg3_free_rings()
    netdev_tx_reset_queue()

netdev_tx_reset_queue() is called unconditionally, as long as the Tx
ring exists. So "ip link set dev eth<x> down" would cause it to be
called.

Why is it OK to call netdev_tx_reset_queue() from the
netdev->ndo_stop() path, but not from netdev->ndo_open() path?

> If you believe BQL drivers need a fix, please elaborate ?
>
> Thanks.
>
>

^ permalink raw reply

* Re: [PATCH net-next] ibmvnic: Enable use of multiple tx/rx scrqs
From: David Miller @ 2016-04-09  4:24 UTC (permalink / raw)
  To: jallen; +Cc: tlfalcon, netdev, linuxppc-dev
In-Reply-To: <57053E33.6020706@linux.vnet.ibm.com>

From: John Allen <jallen@linux.vnet.ibm.com>
Date: Wed, 6 Apr 2016 11:49:55 -0500

> Enables the use of multiple transmit and receive scrqs allowing the ibmvnic
> driver to take advantage of multiqueue functionality. To achieve this, the
> driver must implement the process of negotiating the maximum number of
> queues allowed by the server. Initially, the driver will attempt to login
> with the maximum number of tx and rx queues supported by the server. If
> the server fails to allocate the requested number of scrqs, it will return
> partial success in the login response. In this case, we must reinitiate
> the login process from the request capabilities stage and attempt to login
> requesting fewer scrqs.
> 
> Signed-off-by: John Allen <jallen@linux.vnet.ibm.com>

Applied, thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox