Netdev List
 help / color / mirror / Atom feed
* [net PATCH v2 1/2] GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
From: Alexander Duyck @ 2016-04-04 16:28 UTC (permalink / raw)
  To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem
In-Reply-To: <20160404162545.14332.653.stgit@localhost.localdomain>

This patch fixes an issue I found in which we were dropping frames if we
had enabled checksums on GRE headers that were encapsulated by either FOU
or GUE.  Without this patch I was barely able to get 1 Gb/s of throughput.
With this patch applied I am now at least getting around 6 Gb/s.

The issue is due to the fact that with FOU or GUE applied we do not provide
a transport offset pointing to the GRE header, nor do we offload it in
software as the GRE header is completely skipped by GSO and treated like a
VXLAN or GENEVE type header.  As such we need to prevent the stack from
generating it and also prevent GRE from generating it via any interface we
create.

Fixes: c3483384ee511 ("gro: Allow tunnel stacking in the case of FOU/GUE")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 include/linux/netdevice.h |    5 ++++-
 net/core/dev.c            |    1 +
 net/ipv4/fou.c            |    6 ++++++
 net/ipv4/gre_offload.c    |    8 ++++++++
 net/ipv4/ip_gre.c         |   13 ++++++++++---
 5 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cb0d5d09c2e4..8395308a2445 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2120,7 +2120,10 @@ struct napi_gro_cb {
 	/* Used in foo-over-udp, set in udp[46]_gro_receive */
 	u8	is_ipv6:1;
 
-	/* 7 bit hole */
+	/* Used in GRE, set in fou/gue_gro_receive */
+	u8	is_fou:1;
+
+	/* 6 bit hole */
 
 	/* used to support CHECKSUM_COMPLETE for tunneling protocols */
 	__wsum	csum;
diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe77d913..77a71cd68535 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4439,6 +4439,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 		NAPI_GRO_CB(skb)->flush = 0;
 		NAPI_GRO_CB(skb)->free = 0;
 		NAPI_GRO_CB(skb)->encap_mark = 0;
+		NAPI_GRO_CB(skb)->is_fou = 0;
 		NAPI_GRO_CB(skb)->gro_remcsum_start = 0;
 
 		/* Setup for GRO checksum validation */
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 5a94aea280d3..a39068b4a4d9 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -203,6 +203,9 @@ static struct sk_buff **fou_gro_receive(struct sk_buff **head,
 	 */
 	NAPI_GRO_CB(skb)->encap_mark = 0;
 
+	/* Flag this frame as already having an outer encap header */
+	NAPI_GRO_CB(skb)->is_fou = 1;
+
 	rcu_read_lock();
 	offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
 	ops = rcu_dereference(offloads[proto]);
@@ -368,6 +371,9 @@ static struct sk_buff **gue_gro_receive(struct sk_buff **head,
 	 */
 	NAPI_GRO_CB(skb)->encap_mark = 0;
 
+	/* Flag this frame as already having an outer encap header */
+	NAPI_GRO_CB(skb)->is_fou = 1;
+
 	rcu_read_lock();
 	offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
 	ops = rcu_dereference(offloads[guehdr->proto_ctype]);
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index c47539d04b88..6a5bd4317866 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -150,6 +150,14 @@ static struct sk_buff **gre_gro_receive(struct sk_buff **head,
 	if ((greh->flags & ~(GRE_KEY|GRE_CSUM)) != 0)
 		goto out;
 
+	/* We can only support GRE_CSUM if we can track the location of
+	 * the GRE header.  In the case of FOU/GUE we cannot because the
+	 * outer UDP header displaces the GRE header leaving us in a state
+	 * of limbo.
+	 */
+	if ((greh->flags & GRE_CSUM) && NAPI_GRO_CB(skb)->is_fou)
+		goto out;
+
 	type = greh->protocol;
 
 	rcu_read_lock();
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 31936d387cfd..af5d1f38217f 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -862,9 +862,16 @@ static void __gre_tunnel_init(struct net_device *dev)
 	dev->hw_features	|= GRE_FEATURES;
 
 	if (!(tunnel->parms.o_flags & TUNNEL_SEQ)) {
-		/* TCP offload with GRE SEQ is not supported. */
-		dev->features    |= NETIF_F_GSO_SOFTWARE;
-		dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+		/* TCP offload with GRE SEQ is not supported, nor
+		 * can we support 2 levels of outer headers requiring
+		 * an update.
+		 */
+		if (!(tunnel->parms.o_flags & TUNNEL_CSUM) ||
+		    (tunnel->encap.type == TUNNEL_ENCAP_NONE)) {
+			dev->features    |= NETIF_F_GSO_SOFTWARE;
+			dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+		}
+
 		/* Can use a lockless transmit, unless we generate
 		 * output sequences
 		 */

^ permalink raw reply related

* [net PATCH v2 0/2] Fixes for GRO and GRE tunnels
From: Alexander Duyck @ 2016-04-04 16:27 UTC (permalink / raw)
  To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem

This pair of patches addresses a few issues I have discovered over the last
week or so concerning GRO and GRE tunnels.

The first patch addresses an item I called out as an issue with FOU/GUE
encapsulating GRE, and I finally had a chance to test it and verify that
the code concerning it was broken so I took the opportunity to fix it so
that we cannot generate a FOU/GUE frame that is encapsulating a GRE tunnel
with checksum while requesting TSO/GSO for the frame.

The second patch actually addresses something I realized was an issue if we
feed a tunnel through GRO and back out through GSO.  Specifically it was
possible for GRO to generate overlapping IPv4 ID ranges as the outer IP IDs
were being ignored for tunnels.  Ignoring the IP IDs like this should only
be valid if the DF bit is set.  This is normally the case for IPIP, SIT,
and GRE tunnels, but not so for UDP tunnels.  In the case that the DF bit
is not set we store off the fact that there was a delta from what we were
expecting and when we hit the inner-most header we validate the value as to
avoid generating a frame which could lead to an IP ID collision on packets
that could eventually be fragmented.  A side effect is that the inner-most
IP ID test is relaxed as well, but the worst case scenario is that we GRO a
frame with a throw-away ID sequence anyway so if anything segmenting such a
frame with the wrong IP IDs should have no negative effects.

v2: Updated RFC 6864 patch so that we only support one of two modes.  Either
    the IP ID is incrementing, or it is fixed at a specific value with DF bit
    set.

---

Alexander Duyck (2):
      GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
      ipv4/GRO: Make GRO conform to RFC 6864


 include/linux/netdevice.h |    5 ++++-
 net/core/dev.c            |    2 ++
 net/ipv4/af_inet.c        |   25 ++++++++++++++++++-------
 net/ipv4/fou.c            |    6 ++++++
 net/ipv4/gre_offload.c    |    8 ++++++++
 net/ipv4/ip_gre.c         |   13 ++++++++++---
 net/ipv6/ip6_offload.c    |    3 ---
 7 files changed, 48 insertions(+), 14 deletions(-)

^ permalink raw reply

* [PATCH RFC 5/5] ARM: dts: sun8i: Enable Ethernet controller on the Orange PI PC
From: Chen-Yu Tsai @ 2016-04-04 16:22 UTC (permalink / raw)
  To: Maxime Ripard, Florian Fainelli, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala
  Cc: Chen-Yu Tsai, netdev, linux-arm-kernel, linux-kernel, devicetree,
	LABBE Corentin
In-Reply-To: <1459786954-12649-1-git-send-email-wens@csie.org>

The Orange PI PC uses the H3's internal Ethernet PHY with the EMAC
Ethernet controller.

Set a proper address for the PHY and enable the EMAC.

Signed-off-by: Chen-Yu Tsai <wens@csie.org>
---

This patch depends on "ARM: dts: sun8i-h3: Add Ethernet controller device",
which uses an binding still in development.

Do not merge.

---
 arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts b/arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts
index daf50b9a6657..f01e10df812a 100644
--- a/arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts
+++ b/arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts
@@ -102,6 +102,20 @@
 	status = "okay";
 };
 
+&ephy {
+	allwinner,ephy-addr = <0x1>;
+};
+
+&emac {
+	phy = <&phy1>;
+	phy-mode = "mii";
+	status = "okay";
+
+	phy1: ethernet-phy@1 {
+		reg = <1>;
+	};
+};
+
 &ir {
 	pinctrl-names = "default";
 	pinctrl-0 = <&ir_pins_a>;
-- 
2.7.0

^ permalink raw reply related

* [PATCH RFC 4/5] ARM: dts: sun8i-h3: Add Ethernet controller device node to sun8i-h3.dtsi
From: Chen-Yu Tsai @ 2016-04-04 16:22 UTC (permalink / raw)
  To: Maxime Ripard, Florian Fainelli, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala
  Cc: devicetree, netdev, linux-kernel, Chen-Yu Tsai, LABBE Corentin,
	linux-arm-kernel
In-Reply-To: <1459786954-12649-1-git-send-email-wens@csie.org>

From: LABBE Corentin <clabbe.montjoie@gmail.com>

The Allwinner H3 SoC has a gigabit Ethernet controller.

Signed-off-by: LABBE Corentin <clabbe.montjoie@gmail.com>
[wens@csie.org: drop pinmux; update clocks/resets]
Signed-off-by: Chen-Yu Tsai <wens@csie.org>
---

This is not a stable binding. Do not merge.

---
 arch/arm/boot/dts/sun8i-h3.dtsi | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/arm/boot/dts/sun8i-h3.dtsi b/arch/arm/boot/dts/sun8i-h3.dtsi
index 9a28aeba9bc6..7749af6354bb 100644
--- a/arch/arm/boot/dts/sun8i-h3.dtsi
+++ b/arch/arm/boot/dts/sun8i-h3.dtsi
@@ -635,6 +635,18 @@
 			status = "disabled";
 		};
 
+		emac: ethernet@1c30000 {
+			compatible = "allwinner,sun8i-h3-emac";
+			reg = <0x01c30000 0x1000>;
+			interrupts = <GIC_SPI 82 IRQ_TYPE_LEVEL_HIGH>;
+			resets = <&ahb_rst 17>;
+			reset-names = "ahb";
+			clocks = <&bus_gates 17>, <&ephy>;
+			clock-names = "ahb", "tx";
+			#address-cells = <1>;
+			#size-cells = <0>;
+		};
+
 		gic: interrupt-controller@01c81000 {
 			compatible = "arm,cortex-a7-gic", "arm,cortex-a15-gic";
 			reg = <0x01c81000 0x1000>,
-- 
2.7.0

^ permalink raw reply related

* [PATCH RFC 3/5] ARM: dts: sun8i-h3: Add H3 Ethernet PHY device node to sun8i-h3.dtsi
From: Chen-Yu Tsai @ 2016-04-04 16:22 UTC (permalink / raw)
  To: Maxime Ripard, Florian Fainelli, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala
  Cc: Chen-Yu Tsai, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA, LABBE Corentin
In-Reply-To: <1459786954-12649-1-git-send-email-wens-jdAy2FN1RRM@public.gmane.org>

The Allwinner H3 SoC incorporates an Ethernet PHY, whose controls are
mapped to a system control register.

Signed-off-by: Chen-Yu Tsai <wens-jdAy2FN1RRM@public.gmane.org>
---
 arch/arm/boot/dts/sun8i-h3.dtsi | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/arm/boot/dts/sun8i-h3.dtsi b/arch/arm/boot/dts/sun8i-h3.dtsi
index 4a4926b0b0ed..9a28aeba9bc6 100644
--- a/arch/arm/boot/dts/sun8i-h3.dtsi
+++ b/arch/arm/boot/dts/sun8i-h3.dtsi
@@ -323,6 +323,15 @@
 		#size-cells = <1>;
 		ranges;
 
+		ephy: ethernet-phy@01c00030 {
+			compatible = "allwinner,sun8i-h3-ephy";
+			reg = <0x01c00030 0x4>;
+			clocks = <&bus_gates 128>;
+			resets = <&ahb_rst 66>;
+			#clock-cells = <0>;
+			clock-output-names = "emac_tx";
+		};
+
 		dma: dma-controller@01c02000 {
 			compatible = "allwinner,sun8i-h3-dma";
 			reg = <0x01c02000 0x1000>;
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH RFC 2/5] net: phy: sun8i-h3-ephy: Add driver for Allwinner H3 Ethernet PHY
From: Chen-Yu Tsai @ 2016-04-04 16:22 UTC (permalink / raw)
  To: Maxime Ripard, Florian Fainelli, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala
  Cc: Chen-Yu Tsai, netdev, linux-arm-kernel, linux-kernel, devicetree,
	LABBE Corentin
In-Reply-To: <1459786954-12649-1-git-send-email-wens@csie.org>

The Allwinner H3 SoC incorporates an Ethernet PHY. This is enabled and
configured through a memory mapped hardware register.

This same register also configures the MAC interface mode and TX clock
source. Also covered by the register, but not supported in this driver,
are TX/RX clock delay chains and inverters, and an RMII module.

Signed-off-by: Chen-Yu Tsai <wens@csie.org>
---
 drivers/net/phy/Kconfig         |   8 ++
 drivers/net/phy/Makefile        |   1 +
 drivers/net/phy/sun8i-h3-ephy.c | 266 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 275 insertions(+)
 create mode 100644 drivers/net/phy/sun8i-h3-ephy.c

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 6dad9a9c356c..f4b9eb6e9114 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -276,3 +276,11 @@ endif # PHYLIB
 config MICREL_KS8995MA
 	tristate "Micrel KS8995MA 5-ports 10/100 managed Ethernet switch"
 	depends on SPI
+
+config SUN8I_H3_EPHY
+	tristate "Allwinner sun8i H3 Ethernet PHY"
+	depends on OF && HAS_IOMEM && COMMON_CLK && RESET_CONTROLLER
+	depends on MACH_SUN8I || COMPILE_TEST
+	help
+	  This modules provides a driver for the internal Ethernet PHY
+	  and Ethernet MAC controls found in the Allwinner H3 SoC.
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index fcdbb9299fab..7cec3aceb518 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -44,3 +44,4 @@ obj-$(CONFIG_MDIO_MOXART)	+= mdio-moxart.o
 obj-$(CONFIG_MDIO_BCM_UNIMAC)	+= mdio-bcm-unimac.o
 obj-$(CONFIG_MICROCHIP_PHY)	+= microchip.o
 obj-$(CONFIG_MDIO_BCM_IPROC)	+= mdio-bcm-iproc.o
+obj-$(CONFIG_SUN8I_H3_EPHY)	+= sun8i-h3-ephy.o
diff --git a/drivers/net/phy/sun8i-h3-ephy.c b/drivers/net/phy/sun8i-h3-ephy.c
new file mode 100644
index 000000000000..bf2703bee024
--- /dev/null
+++ b/drivers/net/phy/sun8i-h3-ephy.c
@@ -0,0 +1,266 @@
+/*
+ * Allwinner sun8i H3 E(thernet) PHY driver
+ *
+ * Copyright (C) 2016 Chen-Yu Tsai <wens@csie.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/clk.h>
+#include <linux/clk-provider.h>
+#include <linux/err.h>
+#include <linux/io.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/reset.h>
+#include <linux/spinlock.h>
+
+#define REG_PHY_ADDR_SHIFT	20
+#define REG_PHY_ADDR_MASK	GENMASK(4, 0)
+#define REG_LED_POL		BIT(17)	/* 1: active low, 0: active high */
+#define REG_SHUTDOWN		BIT(16)	/* 1: shutdown, 0: power up */
+#define REG_PHY_SELECT		BIT(15) /* 1: internal PHY, 0: external PHY */
+
+#define REG_DEFAULT_VALUE	0x58000
+#define REG_DEFAULT_MASK	GENMASK(31, 15)
+
+#define REG_GPIT		2
+#define REG_ETCS_MASK		0x3
+#define SUN8I_H3_EMAC_PARENTS	2
+#define MII_PHY_CLK_RATE	25000000
+#define INT_TX_CLK_RATE		125000000
+#define MII_PHY_CLK_NAME	"mii_phy_tx"
+#define INT_TX_CLK_NAME		"int_tx"
+
+struct sun8i_h3_ephy {
+	struct device *dev;
+	void __iomem *reg;
+	struct clk *clk;
+	struct reset_control *reset;
+	struct clk *mac_clks[SUN8I_H3_EMAC_PARENTS + 1];
+};
+
+static DEFINE_SPINLOCK(sun8i_h3_ephy_lock);
+
+/* The EMAC clock/interface controls are much like those found on the A20,
+ * which is supported by sun7i-a20-gmac-clk. The H3 adds an RMII module,
+ * enabled by bit 13. Unfortunately vendor code and documentation don't
+ * explain  how it should work with the other bits, and no hardware
+ * actually uses it.
+ */
+static u32 sun8i_h3_emac_mux_table[SUN8I_H3_EMAC_PARENTS] = {
+	0x0,	/* Select TX clock from MII PHY */
+	0x2,	/* Select internal TX clock (for RGMII) */
+};
+
+static const char *sun8i_h3_emac_mux_parents[SUN8I_H3_EMAC_PARENTS] = {
+	MII_PHY_CLK_NAME,
+	INT_TX_CLK_NAME,
+};
+
+int sun8i_h3_ephy_register_clocks(struct sun8i_h3_ephy *phy)
+{
+	struct device *dev = phy->dev;
+	struct device_node *np = dev->of_node;
+	const char *clk_name;
+	struct clk_mux *mux;
+	struct clk_gate *gate;
+	int ret;
+
+	if (of_property_read_string(np, "clock-output-names", &clk_name)) {
+		dev_err(phy->dev, "No clock output name given\n");
+		return -EINVAL;
+	}
+
+	/* allocate mux and gate clock structs */
+	mux = devm_kzalloc(dev, sizeof(struct clk_mux), GFP_KERNEL);
+	if (!mux)
+		return -ENOMEM;
+
+	gate = devm_kzalloc(dev, sizeof(struct clk_gate), GFP_KERNEL);
+	if (!gate)
+		return -ENOMEM;
+
+	/* Register fixed clocks to use as parents */
+	phy->mac_clks[0] = clk_register_fixed_rate(dev, MII_PHY_CLK_NAME,
+						   NULL, 0, MII_PHY_CLK_RATE);
+	if (IS_ERR(phy->mac_clks[0]))
+		return PTR_ERR(phy->mac_clks[0]);
+
+	phy->mac_clks[1] = clk_register_fixed_rate(dev, INT_TX_CLK_NAME,
+						   NULL, 0, INT_TX_CLK_RATE);
+	if (IS_ERR(phy->mac_clks[1])) {
+		ret = PTR_ERR(phy->mac_clks[1]);
+		goto err_int_tx_clk;
+	}
+
+	/* set up gate and mux properties */
+	gate->reg = phy->reg;
+	gate->bit_idx = REG_GPIT;
+	gate->lock = &sun8i_h3_ephy_lock;
+	mux->reg = phy->reg;
+	mux->mask = REG_ETCS_MASK;
+	mux->table = sun8i_h3_emac_mux_table;
+	mux->lock = &sun8i_h3_ephy_lock;
+
+	phy->mac_clks[2] = clk_register_composite(dev, clk_name,
+						  sun8i_h3_emac_mux_parents,
+						  SUN8I_H3_EMAC_PARENTS,
+						  &mux->hw, &clk_mux_ops,
+						  NULL, NULL,
+						  &gate->hw, &clk_gate_ops,
+						  0);
+	if (IS_ERR(phy->mac_clks[2])) {
+		ret = PTR_ERR(phy->mac_clks[2]);
+		goto err_tx_clk;
+	}
+
+	ret = of_clk_add_provider(dev->of_node, of_clk_src_simple_get,
+				  phy->mac_clks[2]);
+	if (ret)
+		goto err_of_clk;
+
+	return 0;
+
+err_of_clk:
+	/* TODO: switch to clk_unregister_composite when it's available */
+	clk_unregister(phy->mac_clks[2]);
+err_tx_clk:
+	clk_unregister_fixed_rate(phy->mac_clks[1]);
+err_int_tx_clk:
+	clk_unregister_fixed_rate(phy->mac_clks[0]);
+
+	return ret;
+}
+
+static void sun8i_h3_ephy_init_phy(struct sun8i_h3_ephy *phy, u32 addr)
+{
+	u32 val = readl(phy->reg);
+
+	/* set default values, but don't touch clock settings */
+	val = (val & ~REG_DEFAULT_MASK) | REG_DEFAULT_VALUE;
+
+	if (addr < REG_PHY_ADDR_MASK) {
+		dev_info(phy->dev, "Using interal PHY at 0x%x\n", addr);
+
+		val |= addr << REG_PHY_ADDR_SHIFT;
+		val &= ~REG_SHUTDOWN;
+		val |= REG_PHY_SELECT;
+
+		if (of_property_read_bool(phy->dev->of_node,
+					  "allwinner,leds-active-low"))
+			val |= REG_LED_POL;
+	} else {
+		dev_info(phy->dev, "Using external PHY\n");
+	}
+
+	writel(val, phy->reg);
+}
+
+static void sun8i_h3_ephy_reset_phy(struct sun8i_h3_ephy *phy)
+{
+	u32 val;
+
+	val = readl(phy->reg);
+	val = (val & ~REG_DEFAULT_MASK) | REG_DEFAULT_VALUE;
+	writel(val, phy->reg);
+}
+
+int sun8i_h3_ephy_probe(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct device_node *np = dev->of_node;
+	struct sun8i_h3_ephy *phy;
+	struct resource *res;
+	u32 addr = REG_PHY_ADDR_MASK;
+	int ret;
+
+	ret = of_property_read_u32(np, "allwinner,ephy-addr", &addr);
+	if (!ret && addr > REG_PHY_ADDR_MASK) {
+		dev_err(dev, "invalid PHY address: 0x%x\n", addr);
+		return -EINVAL;
+	} else if (ret && ret != -EINVAL) {
+		dev_err(dev, "could not get PHY address: %d\n", ret);
+		return ret;
+	}
+
+	phy = devm_kzalloc(dev, sizeof(*phy), GFP_KERNEL);
+	if (!phy)
+		return -ENOMEM;
+
+	phy->dev = dev;
+	platform_set_drvdata(pdev, phy);
+
+	phy->clk = devm_clk_get(dev, NULL);
+	if (IS_ERR(phy->clk)) {
+		dev_err(dev, "failed to get clock\n");
+		return PTR_ERR(phy->clk);
+	}
+
+	phy->reset = devm_reset_control_get(dev, NULL);
+	if (IS_ERR(phy->reset)) {
+		dev_err(dev, "failed to get reset control\n");
+		return PTR_ERR(phy->reset);
+	}
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	phy->reg = devm_ioremap_resource(dev, res);
+	if (IS_ERR(phy->reg)) {
+		dev_err(dev, "failed to map registers\n");
+		return PTR_ERR(phy->reg);
+	}
+
+	ret = clk_prepare_enable(phy->clk);
+	if (ret) {
+		dev_err(dev, "failed to enable clock\n");
+		return ret;
+	}
+
+	ret = reset_control_deassert(phy->reset);
+	if (ret) {
+		dev_err(dev, "failed to deassert reset control\n");
+		goto err_reset;
+	}
+
+	sun8i_h3_ephy_init_phy(phy, addr);
+
+	ret = sun8i_h3_ephy_register_clocks(phy);
+	if (ret)
+		goto err_tx_clk;
+
+	return 0;
+
+err_tx_clk:
+	sun8i_h3_ephy_reset_phy(phy);
+	reset_control_assert(phy->reset);
+err_reset:
+	clk_disable_unprepare(phy->clk);
+	return ret;
+}
+
+static const struct of_device_id sun8i_h3_ephy_of_match[] = {
+	{ .compatible = "allwinner,sun8i-h3-ephy", },
+	{ },
+};
+MODULE_DEVICE_TABLE(of, sun8i_h3_ephy_of_match);
+
+static struct platform_driver sun8i_h3_ephy_driver = {
+	.probe	= sun8i_h3_ephy_probe,
+	.driver = {
+		.of_match_table	= sun8i_h3_ephy_of_match,
+		.name  = "sun8i-h3-ephy",
+	}
+};
+module_platform_driver(sun8i_h3_ephy_driver);
+
+MODULE_DESCRIPTION("Allwinner sun8i H3 Ethernet PHY driver");
+MODULE_AUTHOR("Chen-Yu Tsai <wens@csie.org>");
+MODULE_LICENSE("GPL v2");
-- 
2.7.0

^ permalink raw reply related

* [PATCH RFC 1/5] net: phy: sun8i-h3-ephy: Add bindings for Allwinner H3 Ethernet PHY
From: Chen-Yu Tsai @ 2016-04-04 16:22 UTC (permalink / raw)
  To: Maxime Ripard, Florian Fainelli, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala
  Cc: Chen-Yu Tsai, netdev, linux-arm-kernel, linux-kernel, devicetree,
	LABBE Corentin
In-Reply-To: <1459786954-12649-1-git-send-email-wens@csie.org>

The Allwinner H3 SoC incorporates an Ethernet PHY. This is enabled and
configured through a memory mapped hardware register.

This same register also configures the MAC interface mode and TX clock
source. Also covered by the register, but not supported in these bindings,
are TX/RX clock delay chains and inverters, and an RMII module.

Signed-off-by: Chen-Yu Tsai <wens@csie.org>
---
 .../bindings/net/allwinner,sun8i-h3-ephy.txt       | 44 ++++++++++++++++++++++
 1 file changed, 44 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt

diff --git a/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt b/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
new file mode 100644
index 000000000000..146f227e6d88
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
@@ -0,0 +1,44 @@
+* Allwinner H3 E(thernet) PHY
+
+The Allwinner H3 integrates an MII ethernet PHY. As with external PHYs,
+before it can be configured over the MDIO bus and used, certain hardware
+features must be configured, such as the PHY address and LED polarity.
+The PHY must also be powered on and brought out of reset.
+
+This is accomplished with regulators and pull-up/downs for external PHYs.
+For this internal PHY, a hardware register is programmed.
+
+The same hardware register also contains clock and interface controls
+for the MAC. This is also present in earlier SoCs, and is covered by
+"allwinner,sun7i-a20-gmac-clk". The controls in the H3 are slightly
+different due to the inclusion of what appears to be an RMII-MII
+bridge.
+
+Required properties:
+- compatible: should be "allwinner,sun8i-h3-ephy"
+- reg: address and length of the register set for the device
+- clocks: A phandle to the reference clock for this device
+- resets: A phandle to the reset control for this device
+
+Ethernet PHY related properties:
+- allwinner,ephy-addr: the MDIO bus address the PHY should respond to.
+		       If this is not set, the external PHY is used, and
+		       everything else in this section is ignored.
+- allwinner,leds-active-low: LEDs are active low. Without this, LEDs are
+			     active high.
+
+Ethernet MAC clock related properties:
+- #clock-cells: should be 0
+- clock-output-names: "mac_tx"
+
+Example:
+
+ethernet-phy@01c00030 {
+	compatible = "allwinner,sun8i-h3-ephy";
+	reg = <0x01c00030 0x4>;
+	clocks = <&bus_gates 128>;
+	resets = <&ahb_rst 66>;
+	allwinner,ephy-addr = <0x1>;
+	#clock-cells = <0>;
+	clock-output-names = "mac_tx";
+};
-- 
2.7.0

^ permalink raw reply related

* [PATCH RFC 0/5] net: phy: sun8i-h3-ephy: Add Allwinner H3 Ethernet PHY driver
From: Chen-Yu Tsai @ 2016-04-04 16:22 UTC (permalink / raw)
  To: Maxime Ripard, Florian Fainelli, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala
  Cc: Chen-Yu Tsai, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA, LABBE Corentin

Hi everyone,

This is an attempt to support the Ethernet PHY incorporated in Allwinner's
H3 SoC. It is a MII PHY, supporting 10/100 Mbps Ethernet.

Before the PHY can be accessed via MDIO and used, it needs to be powered
on and configured. This is done via a system control register in the CPU
address space. The same register also controls PHY interface mode and
MAC TX clock sources.

This driver brings up the Ethernet PHY (if it is used), and exports a
clock control for the MAC TX clock.

Patch 1 adds the bindings for this piece of hardware.

Patch 2 adds the driver for it.

Patch 3 adds a device node for the PHY.

Patch 4 & 5 are not to be merged. They are provided here as a solid
example of how this driver is used. The sun8i-emac bindings have not
been submitted and are not stable yet.

Comments welcome.

Regards
ChenYu


Chen-Yu Tsai (4):
  net: phy: sun8i-h3-ephy: Add bindings for Allwinner H3 Ethernet PHY
  net: phy: sun8i-h3-ephy: Add driver for Allwinner H3 Ethernet PHY
  ARM: dts: sun8i-h3: Add H3 Ethernet PHY device node to sun8i-h3.dtsi
  ARM: dts: sun8i: Enable Ethernet controller on the Orange PI PC

LABBE Corentin (1):
  ARM: dts: sun8i-h3: Add Ethernet controller device node to
    sun8i-h3.dtsi

 .../bindings/net/allwinner,sun8i-h3-ephy.txt       |  44 ++++
 arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts         |  14 ++
 arch/arm/boot/dts/sun8i-h3.dtsi                    |  21 ++
 drivers/net/phy/Kconfig                            |   8 +
 drivers/net/phy/Makefile                           |   1 +
 drivers/net/phy/sun8i-h3-ephy.c                    | 266 +++++++++++++++++++++
 6 files changed, 354 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
 create mode 100644 drivers/net/phy/sun8i-h3-ephy.c

-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Brenden Blanco @ 2016-04-04 16:17 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jesper Dangaard Brouer, Tom Herbert, Daniel Borkmann,
	David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, ogerlitz
In-Reply-To: <57029127.3040303@gmail.com>

On Mon, Apr 04, 2016 at 09:07:03AM -0700, John Fastabend wrote:
> On 16-04-04 08:29 AM, Brenden Blanco wrote:
> > On Mon, Apr 04, 2016 at 05:12:27PM +0200, Jesper Dangaard Brouer wrote:
> >> On Mon, 4 Apr 2016 11:09:57 -0300
> >> Tom Herbert <tom@herbertland.com> wrote:
> >>
> >>> On Mon, Apr 4, 2016 at 10:36 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>> On 04/04/2016 03:07 PM, Jesper Dangaard Brouer wrote:  
> >>>>>
> >>>>> On Mon, 04 Apr 2016 10:49:09 +0200 Daniel Borkmann <daniel@iogearbox.net>
> >>>>> wrote:  
> >>>>>>
> >>>>>> On 04/02/2016 03:21 AM, Brenden Blanco wrote:  
> >>>>>>>
> >>>>>>> Add a new bpf prog type that is intended to run in early stages of the
> >>>>>>> packet rx path. Only minimal packet metadata will be available, hence a
> >>>>>>> new
> >>>>>>> context type, struct xdp_metadata, is exposed to userspace. So far only
> >>>>>>> expose the readable packet length, and only in read mode.
> >>>>>>>
> >>>>>>> The PHYS_DEV name is chosen to represent that the program is meant only
> >>>>>>> for physical adapters, rather than all netdevs.
> >>>>>>>
> >>>>>>> While the user visible struct is new, the underlying context must be
> >>>>>>> implemented as a minimal skb in order for the packet load_* instructions
> >>>>>>> to work. The skb filled in by the driver must have skb->len, skb->head,
> >>>>>>> and skb->data set, and skb->data_len == 0.
> >>>>>>>  
> >>>>> [...]  
> >>>>>>
> >>>>>>
> >>>>>> Do you plan to support bpf_skb_load_bytes() as well? I like using
> >>>>>> this API especially when dealing with larger chunks (>4 bytes) to
> >>>>>> load into stack memory, plus content is kept in network byte order.
> >>>>>>
> >>>>>> What about other helpers such as bpf_skb_store_bytes() et al that
> >>>>>> work on skbs. Do you intent to reuse them as is and thus populate
> >>>>>> the per cpu skb with needed fields (faking linear data), or do you
> >>>>>> see larger obstacles that prevent for this?  
> >>>>>
> >>>>>
> >>>>> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
> >>>>> to users of this API.
> >>>>>
> >>>>> The hole idea is that an SKB is NOT allocated yet, and not needed at
> >>>>> this level.  If we start supporting calling underlying SKB functions,
> >>>>> then we will end-up in the same place (performance wise).  
> >>>>
> >>>>
> >>>> I'm talking about the current skb-related BPF helper functions we have,
> >>>> so the question is how much from that code we have we can reuse under
> >>>> these constraints (obviously things like the tunnel helpers are a different
> >>>> story) and if that trade-off is acceptable for us. I'm also thinking
> >>>> that, for example, if you need to parse the packet data anyway for a drop
> >>>> verdict, you might as well pass some meta data (that is set in the real
> >>>> skb later on) for those packets that go up the stack.  
> >>>
> >>> Right, the meta data in this case is an abstracted receive descriptor.
> >>> This would include items that we get in a device receive descriptor
> >>> (computed checksum, hash, VLAN tag). This is purposely a small
> >>> restricted data structure. I'm hoping we can minimize the size of this
> >>> to not much more than 32 bytes (including pointers to data and
> >>> linkage).
> >>
> >> I agree.
> >>  
> >>> How this translates to skb to maintain compatibility is with BPF
> >>> interesting question. One other consideration is that skb's are kernel
> >>> specific, we should be able to use the same BPF filter program in
> >>> userspace over DPDK for instance-- so an skb interface as the packet
> >>> abstraction might not be the right model...
> >>
> >> I agree.  I don't think reusing the SKB data structure is the right
> >> model.  We should drop the SKB pointer from the API.
> >>
> >> As Tom also points out, making the BPF interface independent of the SKB
> >> meta-data structure, would also make the eBPF program more generally
> >> applicable.
> > The initial approach that I tried went down this path. Alexei advised
> > that I use the pseudo skb, and in the future the API between drivers and
> > bpf can change to adopt non-skb context. The only user facing ABIs in
> > this patchset are the IFLA, the xdp_metadata struct, and the name of the
> > new enum.
> > 
> > The reason to use a pseudo skb for now is that there will be a fair
> > amount of churn to get bpf jit and interpreter to understand non-skb
> > context in the bpf_load_pointer() code. I don't see the need for
> > requiring that for this patchset, as it will be internal-only change
> > if/when we use something else.
> 
> Another option would be to have per driver JIT code to patch up the
> skb read/loads with descriptor reads and metadata. From a strictly
> performance stand point it should be better than pseudo skbs.

I considered (and implemented) this as well, but there my problem was
that I needed to inform the bpf() syscall at BPF_PROG_LOAD time which
ifindex to look at for fixups, so I had to add a new ifindex field to
bpf_attr. Then during verification I had to use a new ndo to get the
driver-specific offsets for its particular descriptor format. It seemed
kludgy.
> 
> .John

^ permalink raw reply

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: John Fastabend @ 2016-04-04 16:07 UTC (permalink / raw)
  To: Brenden Blanco, Jesper Dangaard Brouer
  Cc: Tom Herbert, Daniel Borkmann, David S. Miller,
	Linux Kernel Network Developers, Alexei Starovoitov, ogerlitz
In-Reply-To: <20160404152948.GA495@gmail.com>

On 16-04-04 08:29 AM, Brenden Blanco wrote:
> On Mon, Apr 04, 2016 at 05:12:27PM +0200, Jesper Dangaard Brouer wrote:
>> On Mon, 4 Apr 2016 11:09:57 -0300
>> Tom Herbert <tom@herbertland.com> wrote:
>>
>>> On Mon, Apr 4, 2016 at 10:36 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>> On 04/04/2016 03:07 PM, Jesper Dangaard Brouer wrote:  
>>>>>
>>>>> On Mon, 04 Apr 2016 10:49:09 +0200 Daniel Borkmann <daniel@iogearbox.net>
>>>>> wrote:  
>>>>>>
>>>>>> On 04/02/2016 03:21 AM, Brenden Blanco wrote:  
>>>>>>>
>>>>>>> Add a new bpf prog type that is intended to run in early stages of the
>>>>>>> packet rx path. Only minimal packet metadata will be available, hence a
>>>>>>> new
>>>>>>> context type, struct xdp_metadata, is exposed to userspace. So far only
>>>>>>> expose the readable packet length, and only in read mode.
>>>>>>>
>>>>>>> The PHYS_DEV name is chosen to represent that the program is meant only
>>>>>>> for physical adapters, rather than all netdevs.
>>>>>>>
>>>>>>> While the user visible struct is new, the underlying context must be
>>>>>>> implemented as a minimal skb in order for the packet load_* instructions
>>>>>>> to work. The skb filled in by the driver must have skb->len, skb->head,
>>>>>>> and skb->data set, and skb->data_len == 0.
>>>>>>>  
>>>>> [...]  
>>>>>>
>>>>>>
>>>>>> Do you plan to support bpf_skb_load_bytes() as well? I like using
>>>>>> this API especially when dealing with larger chunks (>4 bytes) to
>>>>>> load into stack memory, plus content is kept in network byte order.
>>>>>>
>>>>>> What about other helpers such as bpf_skb_store_bytes() et al that
>>>>>> work on skbs. Do you intent to reuse them as is and thus populate
>>>>>> the per cpu skb with needed fields (faking linear data), or do you
>>>>>> see larger obstacles that prevent for this?  
>>>>>
>>>>>
>>>>> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
>>>>> to users of this API.
>>>>>
>>>>> The hole idea is that an SKB is NOT allocated yet, and not needed at
>>>>> this level.  If we start supporting calling underlying SKB functions,
>>>>> then we will end-up in the same place (performance wise).  
>>>>
>>>>
>>>> I'm talking about the current skb-related BPF helper functions we have,
>>>> so the question is how much from that code we have we can reuse under
>>>> these constraints (obviously things like the tunnel helpers are a different
>>>> story) and if that trade-off is acceptable for us. I'm also thinking
>>>> that, for example, if you need to parse the packet data anyway for a drop
>>>> verdict, you might as well pass some meta data (that is set in the real
>>>> skb later on) for those packets that go up the stack.  
>>>
>>> Right, the meta data in this case is an abstracted receive descriptor.
>>> This would include items that we get in a device receive descriptor
>>> (computed checksum, hash, VLAN tag). This is purposely a small
>>> restricted data structure. I'm hoping we can minimize the size of this
>>> to not much more than 32 bytes (including pointers to data and
>>> linkage).
>>
>> I agree.
>>  
>>> How this translates to skb to maintain compatibility is with BPF
>>> interesting question. One other consideration is that skb's are kernel
>>> specific, we should be able to use the same BPF filter program in
>>> userspace over DPDK for instance-- so an skb interface as the packet
>>> abstraction might not be the right model...
>>
>> I agree.  I don't think reusing the SKB data structure is the right
>> model.  We should drop the SKB pointer from the API.
>>
>> As Tom also points out, making the BPF interface independent of the SKB
>> meta-data structure, would also make the eBPF program more generally
>> applicable.
> The initial approach that I tried went down this path. Alexei advised
> that I use the pseudo skb, and in the future the API between drivers and
> bpf can change to adopt non-skb context. The only user facing ABIs in
> this patchset are the IFLA, the xdp_metadata struct, and the name of the
> new enum.
> 
> The reason to use a pseudo skb for now is that there will be a fair
> amount of churn to get bpf jit and interpreter to understand non-skb
> context in the bpf_load_pointer() code. I don't see the need for
> requiring that for this patchset, as it will be internal-only change
> if/when we use something else.

Another option would be to have per driver JIT code to patch up the
skb read/loads with descriptor reads and metadata. From a strictly
performance stand point it should be better than pseudo skbs.

.John

^ permalink raw reply

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Brenden Blanco @ 2016-04-04 15:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, Daniel Borkmann, David S. Miller,
	Linux Kernel Network Developers, Alexei Starovoitov, ogerlitz,
	john fastabend
In-Reply-To: <20160404171227.1f862cb1@redhat.com>

On Mon, Apr 04, 2016 at 05:12:27PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 4 Apr 2016 11:09:57 -0300
> Tom Herbert <tom@herbertland.com> wrote:
> 
> > On Mon, Apr 4, 2016 at 10:36 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > On 04/04/2016 03:07 PM, Jesper Dangaard Brouer wrote:  
> > >>
> > >> On Mon, 04 Apr 2016 10:49:09 +0200 Daniel Borkmann <daniel@iogearbox.net>
> > >> wrote:  
> > >>>
> > >>> On 04/02/2016 03:21 AM, Brenden Blanco wrote:  
> > >>>>
> > >>>> Add a new bpf prog type that is intended to run in early stages of the
> > >>>> packet rx path. Only minimal packet metadata will be available, hence a
> > >>>> new
> > >>>> context type, struct xdp_metadata, is exposed to userspace. So far only
> > >>>> expose the readable packet length, and only in read mode.
> > >>>>
> > >>>> The PHYS_DEV name is chosen to represent that the program is meant only
> > >>>> for physical adapters, rather than all netdevs.
> > >>>>
> > >>>> While the user visible struct is new, the underlying context must be
> > >>>> implemented as a minimal skb in order for the packet load_* instructions
> > >>>> to work. The skb filled in by the driver must have skb->len, skb->head,
> > >>>> and skb->data set, and skb->data_len == 0.
> > >>>>  
> > >> [...]  
> > >>>
> > >>>
> > >>> Do you plan to support bpf_skb_load_bytes() as well? I like using
> > >>> this API especially when dealing with larger chunks (>4 bytes) to
> > >>> load into stack memory, plus content is kept in network byte order.
> > >>>
> > >>> What about other helpers such as bpf_skb_store_bytes() et al that
> > >>> work on skbs. Do you intent to reuse them as is and thus populate
> > >>> the per cpu skb with needed fields (faking linear data), or do you
> > >>> see larger obstacles that prevent for this?  
> > >>
> > >>
> > >> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
> > >> to users of this API.
> > >>
> > >> The hole idea is that an SKB is NOT allocated yet, and not needed at
> > >> this level.  If we start supporting calling underlying SKB functions,
> > >> then we will end-up in the same place (performance wise).  
> > >
> > >
> > > I'm talking about the current skb-related BPF helper functions we have,
> > > so the question is how much from that code we have we can reuse under
> > > these constraints (obviously things like the tunnel helpers are a different
> > > story) and if that trade-off is acceptable for us. I'm also thinking
> > > that, for example, if you need to parse the packet data anyway for a drop
> > > verdict, you might as well pass some meta data (that is set in the real
> > > skb later on) for those packets that go up the stack.  
> > 
> > Right, the meta data in this case is an abstracted receive descriptor.
> > This would include items that we get in a device receive descriptor
> > (computed checksum, hash, VLAN tag). This is purposely a small
> > restricted data structure. I'm hoping we can minimize the size of this
> > to not much more than 32 bytes (including pointers to data and
> > linkage).
> 
> I agree.
>  
> > How this translates to skb to maintain compatibility is with BPF
> > interesting question. One other consideration is that skb's are kernel
> > specific, we should be able to use the same BPF filter program in
> > userspace over DPDK for instance-- so an skb interface as the packet
> > abstraction might not be the right model...
> 
> I agree.  I don't think reusing the SKB data structure is the right
> model.  We should drop the SKB pointer from the API.
> 
> As Tom also points out, making the BPF interface independent of the SKB
> meta-data structure, would also make the eBPF program more generally
> applicable.
The initial approach that I tried went down this path. Alexei advised
that I use the pseudo skb, and in the future the API between drivers and
bpf can change to adopt non-skb context. The only user facing ABIs in
this patchset are the IFLA, the xdp_metadata struct, and the name of the
new enum.

The reason to use a pseudo skb for now is that there will be a fair
amount of churn to get bpf jit and interpreter to understand non-skb
context in the bpf_load_pointer() code. I don't see the need for
requiring that for this patchset, as it will be internal-only change
if/when we use something else.
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Eric Dumazet @ 2016-04-04 15:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Brenden Blanco, davem, netdev, tom,
	Or Gerlitz, daniel, john.fastabend
In-Reply-To: <20160404165701.2a25a17a@redhat.com>

On Mon, 2016-04-04 at 16:57 +0200, Jesper Dangaard Brouer wrote:
> On Fri, 1 Apr 2016 19:47:12 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > My guess we're hitting 14.5Mpps limit for empty bpf program
> > and for program that actually looks into the packet because we're
> > hitting 10G phy limit of 40G nic. Since physically 40G nic
> > consists of four 10G phys. There will be the same problem
> > with 100G and 50G nics. Both will be hitting 25G phy limit.
> > We need to vary packets somehow. Hopefully Or can explain that
> > bit of hw design.
> > Jesper's experiments with mlx4 showed the same 14.5Mpps limit
> > when sender blasting the same packet over and over again.
> 
> That is an interesting observation Alexei, and could explain the pps limit
> I hit on 40G, with single flow testing.  AFAIK 40G is 4x 10G PHYs, and
> 100G is 4x 25G PHYs.
> 
> I have a pktgen script that tried to avoid this pitfall.  By creating a
> new flow per pktgen kthread. I call it "pktgen_sample05_flow_per_thread.sh"[1]
> 
> [1] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen_sample05_flow_per_thread.sh
> 

A single flow is able to use 40Gbit on those 40Gbit NIC, so there is not
a single 10GB trunk used for a given flow.

This 14Mpps thing seems to be a queue limitation on mlx4.

^ permalink raw reply

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Edward Cree @ 2016-04-04 15:18 UTC (permalink / raw)
  To: Eric Dumazet, Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Brenden Blanco, davem, netdev, tom,
	alexei.starovoitov, gerlitz, john.fastabend
In-Reply-To: <1459780420.6473.337.camel@edumazet-glaptop3.roam.corp.google.com>

On 04/04/16 15:33, Eric Dumazet wrote:
> On Mon, 2016-04-04 at 15:07 +0200, Jesper Dangaard Brouer wrote:
> 
>> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
>> to users of this API.
>>
>> The hole idea is that an SKB is NOT allocated yet, and not needed at
>> this level.  If we start supporting calling underlying SKB functions,
>> then we will end-up in the same place (performance wise).
> 
> A BPF program can access many skb fields.
> 
> If you plan to support BPF, your fake skb needs to be populated like a
> real one. Looks like some code will be replicated in all drivers that
> want this facility...
> 
> Or accept (document ?) that some BPF instructions are just not there.
> (hash, queue_mapping ...)

If these progs are eventually going to get pushed down into supporting
hardware, many skb things won't make sense at all at that level.  I would
suggest that anything hardware wouldn't reasonably have available should
be "just not there"; I suspect that'll lead you to the right API for early
driver filter as well.  And it probably won't look much like an skb.

-Ed

^ permalink raw reply

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Jesper Dangaard Brouer @ 2016-04-04 15:12 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Daniel Borkmann, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Alexei Starovoitov, gerlitz,
	john fastabend, brouer
In-Reply-To: <CALx6S37aK79AbkUPBFTHkonUziSb7A1KV47vnG1OgciPD2qXcA@mail.gmail.com>

On Mon, 4 Apr 2016 11:09:57 -0300
Tom Herbert <tom@herbertland.com> wrote:

> On Mon, Apr 4, 2016 at 10:36 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> > On 04/04/2016 03:07 PM, Jesper Dangaard Brouer wrote:  
> >>
> >> On Mon, 04 Apr 2016 10:49:09 +0200 Daniel Borkmann <daniel@iogearbox.net>
> >> wrote:  
> >>>
> >>> On 04/02/2016 03:21 AM, Brenden Blanco wrote:  
> >>>>
> >>>> Add a new bpf prog type that is intended to run in early stages of the
> >>>> packet rx path. Only minimal packet metadata will be available, hence a
> >>>> new
> >>>> context type, struct xdp_metadata, is exposed to userspace. So far only
> >>>> expose the readable packet length, and only in read mode.
> >>>>
> >>>> The PHYS_DEV name is chosen to represent that the program is meant only
> >>>> for physical adapters, rather than all netdevs.
> >>>>
> >>>> While the user visible struct is new, the underlying context must be
> >>>> implemented as a minimal skb in order for the packet load_* instructions
> >>>> to work. The skb filled in by the driver must have skb->len, skb->head,
> >>>> and skb->data set, and skb->data_len == 0.
> >>>>  
> >> [...]  
> >>>
> >>>
> >>> Do you plan to support bpf_skb_load_bytes() as well? I like using
> >>> this API especially when dealing with larger chunks (>4 bytes) to
> >>> load into stack memory, plus content is kept in network byte order.
> >>>
> >>> What about other helpers such as bpf_skb_store_bytes() et al that
> >>> work on skbs. Do you intent to reuse them as is and thus populate
> >>> the per cpu skb with needed fields (faking linear data), or do you
> >>> see larger obstacles that prevent for this?  
> >>
> >>
> >> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
> >> to users of this API.
> >>
> >> The hole idea is that an SKB is NOT allocated yet, and not needed at
> >> this level.  If we start supporting calling underlying SKB functions,
> >> then we will end-up in the same place (performance wise).  
> >
> >
> > I'm talking about the current skb-related BPF helper functions we have,
> > so the question is how much from that code we have we can reuse under
> > these constraints (obviously things like the tunnel helpers are a different
> > story) and if that trade-off is acceptable for us. I'm also thinking
> > that, for example, if you need to parse the packet data anyway for a drop
> > verdict, you might as well pass some meta data (that is set in the real
> > skb later on) for those packets that go up the stack.  
> 
> Right, the meta data in this case is an abstracted receive descriptor.
> This would include items that we get in a device receive descriptor
> (computed checksum, hash, VLAN tag). This is purposely a small
> restricted data structure. I'm hoping we can minimize the size of this
> to not much more than 32 bytes (including pointers to data and
> linkage).

I agree.
 
> How this translates to skb to maintain compatibility is with BPF
> interesting question. One other consideration is that skb's are kernel
> specific, we should be able to use the same BPF filter program in
> userspace over DPDK for instance-- so an skb interface as the packet
> abstraction might not be the right model...

I agree.  I don't think reusing the SKB data structure is the right
model.  We should drop the SKB pointer from the API.

As Tom also points out, making the BPF interface independent of the SKB
meta-data structure, would also make the eBPF program more generally
applicable.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] ath5k: Change led pin configuration for compaq c700 laptop
From: Kalle Valo @ 2016-04-04 15:03 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: jirislaby, mickflemm, mcgrof, linux-kernel, stable,
	linux-wireless, netdev
In-Reply-To: <830d8fa0b7cb0b8cfb0b34360925cd50c3187341.1457977047.git.joseph.salisbury@canonical.com>

Joseph Salisbury <joseph.salisbury@canonical.com> writes:

> BugLink: http://bugs.launchpad.net/bugs/972604
>
> Commit 09c9bae26b0d3c9472cb6ae45010460a2cee8b8d ("ath5k: add led pin 
> configuration for compaq c700 laptop") added a pin configuration for the Compaq 
> c700 laptop.  However, the polarity of the led pin is reversed.  It should be 
> red for wifi off and blue for wifi on, but it is the opposite.  This bug was 
> reported in the following bug report: 
> http://pad.lv/972604
>
>
> Fixes: 09c9bae26b0d3c9472cb6ae45010460a2cee8b8d ("ath5k: add led pin 
> configuration for compaq c700 laptop")
>
> Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com>
> Cc: stable@vger.kernel.org

Applied to ath.git, thanks.

I just fixed the Fixes line to be one continuous line.

-- 
Kalle Valo

^ permalink raw reply

* Re: System hangs (unable to handle kernel paging request)
From: Oleksii Berezhniak @ 2016-04-04 15:01 UTC (permalink / raw)
  To: netdev
In-Reply-To: <57027A82.6040807@gmail.com>

Can you please point me to more detailed description of similar issues
that you mentioned?

I can only find this:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=32b3e08fff60494cd1d281a39b51583edfd2b18f

But there are no any hangs. Only performance issues.

BTW, GRO (Generic Receive Offloading) is disabled on our network adapter.

2016-04-04 17:30 GMT+03:00 Bastien Philbert <bastienphilbert@gmail.com>:
>
>
> On 2016-04-04 03:59 AM, Oleksii Berezhniak wrote:
>> Good day.
>>
>> We have PPPoE server with CentOS 7 (kernel 3.10.0-327.10.1.el7.dsip.x86_64)
>>
>> We applied some PPPoE related patches to this kernel:
>>
>> ppp: don't override sk->sk_state in pppoe_flush_dev()
>> ppp: fix pppoe_dev deletion condition in pppoe_release()
>> pppoe: fix memory corruption in padt work structure
>> pppoe: fix reference counting in PPPoE proxy
>>
>> Also we built latest version of ixgbe driver from Intel.
>>
>> Now we have crashes after approx. one week of uptime:
>>
>> [545444.673270] BUG: unable to handle kernel paging request at ffff88a005040200
>> [545444.673306] IP: [<ffffffff811c0e95>] kmem_cache_alloc+0x75/0x1d0
>> [545444.673335] PGD 0
>> [545444.673348] Oops: 0000 [#1] SMP
>> [545444.673367] Modules linked in: arc4 ppp_mppe act_police cls_u32
>> sch_ingress sch_tbf pptp gre pppoe pppox ppp_generic slhc 8021q garp
>> stp mrp llc iptable_nat nf_conn
>> track_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_filter xt_TCPMSS
>> iptable_mangle xt_CT nf_conntrack iptable_raw w83793 hwmon_vid
>> snd_hda_codec_realtek snd_hda_codec
>> _generic snd_hda_intel snd_hda_codec coretemp snd_hda_core iTCO_wdt
>> kvm iTCO_vendor_support snd_hwdep snd_seq snd_seq_device ipmi_ssif
>> ppdev lpc_ich snd_pcm pcspkr mfd_
>> core sg ipmi_si snd_timer snd i2c_i801 ipmi_msghandler ioatdma
>> parport_pc parport shpchp soundcore i7core_edac tpm_infineon edac_core
>> ip_tables ext4 mbcache jbd2 sd_mod
>>  crct10dif_generic crc_t10dif crct10dif_common syscopyarea sysfillrect
>> firewire_ohci sysimgblt i2c_algo_bit drm_kms_helper ata_generic
>> pata_acpi
>> [545444.674383]  ttm firewire_core crc_itu_t serio_raw drm ata_piix
>> libata crc32c_intel i2c_core ixgbe(OE) vxlan e1000e ip6_udp_tunnel
>> udp_tunnel aacraid dca ptp pps_co
>> re
>> [545444.674783] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G           OE
>> ------------   3.10.0-327.10.1.el7.dsip.x86_64 #1
>> [545444.675032] Hardware name: empty empty/S7010, BIOS 'V2.06  ' 03/31/2010
>> [545444.675162] task: ffff880139c55c00 ti: ffff880139c84000 task.ti:
>> ffff880139c84000
>> [545444.675400] RIP: 0010:[<ffffffff811c0e95>]  [<ffffffff811c0e95>]
>> kmem_cache_alloc+0x75/0x1d0
>> [545444.675641] RSP: 0018:ffff88023fc23ce8  EFLAGS: 00010286
>> [545444.675766] RAX: 0000000000000000 RBX: ffff8802302eab00 RCX:
>> 000000010eb8edbe
>> [545444.676002] RDX: 000000010eb8edbd RSI: 0000000000000020 RDI:
>> ffff88013b803700
>> [545444.676237] RBP: ffff88023fc23d18 R08: 00000000000175a0 R09:
>> ffffffff81517e70
>> [545444.676472] R10: 000000000000006b R11: 0000000000000000 R12:
>> ffff88a005040200
>> [545444.676706] R13: 0000000000000020 R14: ffff88013b803700 R15:
>> ffff88013b803700
>> [545444.676942] FS:  0000000000000000(0000) GS:ffff88023fc20000(0000)
>> knlGS:0000000000000000
>> [545444.677180] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [545444.677307] CR2: ffff88a005040200 CR3: 0000000237e63000 CR4:
>> 00000000000007e0
>> [545444.677543] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [545444.677779] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>> 0000000000000400
>> [545444.678014] Stack:
>> [545444.678127]  ffff880237ea2040 ffff8802302eab00 0000000000000280
>> 0000000000000280
>> [545444.678370]  0000000000000006 ffff880236bb1b60 ffff88023fc23d40
>> ffffffff81517e70
>> [545444.678614]  0000000000000280 ffff8802302eab00 0000000000000000
>> ffff88023fc23d60
>> [545444.678857] Call Trace:
>> [545444.678973]  <IRQ>
>>
>> [545444.678982]
>> [545444.679100]  [<ffffffff81517e70>] build_skb+0x30/0x1d0
>> [545444.679222]  [<ffffffff8151a973>] __alloc_rx_skb+0x63/0xb0
>> [545444.679349]  [<ffffffff8151a9db>] __netdev_alloc_skb+0x1b/0x40
>> [545444.679492]  [<ffffffffa0104d8e>] ixgbe_clean_rx_irq+0xee/0xa50 [ixgbe]
>> [545444.679624]  [<ffffffff8152862f>] ? __napi_complete+0x1f/0x30
>> [545444.679756]  [<ffffffffa0106738>] ixgbe_poll+0x2d8/0x6d0 [ixgbe]
>> [545444.679886]  [<ffffffff8152b092>] net_rx_action+0x152/0x240
>> [545444.680015]  [<ffffffff81084aef>] __do_softirq+0xef/0x280
>> [545444.680144]  [<ffffffff8164735c>] call_softirq+0x1c/0x30
>> [545444.680277]  [<ffffffff81016fc5>] do_softirq+0x65/0xa0
>> [545444.680402]  [<ffffffff81084e85>] irq_exit+0x115/0x120
>> [545444.680529]  [<ffffffff81647ef8>] do_IRQ+0x58/0xf0
>> [545444.680660]  [<ffffffff8163d1ad>] common_interrupt+0x6d/0x6d
>> [545444.680786]  <EOI>
>> [545444.680794]
>> [545444.680914]  [<ffffffff81058e96>] ? native_safe_halt+0x6/0x10
>> [545444.681041]  [<ffffffff8101dbcf>] default_idle+0x1f/0xc0
>> [545444.681168]  [<ffffffff8101e4d6>] arch_cpu_idle+0x26/0x30
>> [545444.681297]  [<ffffffff810d62c5>] cpu_startup_entry+0x245/0x290
>> [545444.681427]  [<ffffffff810475fa>] start_secondary+0x1ba/0x230
>> [545444.681554] Code: ce 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85
>> e4 0f 84 1f 01 00 00 48 85 c0 0f 84 16 01 00 00 49 63 46 20 48 8d 4a
>> 01 4d 8b 06 <49> 8b 1c 04 4c
>> 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 49 63
>> [545444.682056] RIP  [<ffffffff811c0e95>] kmem_cache_alloc+0x75/0x1d0
>> [545444.682186]  RSP <ffff88023fc23ce8>
>> [545444.682305] CR2: ffff88a005040200
>>
>>
>> Every time description and call stack are the same.
>>
>> What can be cause of these crashes?
>>
>> Thanks.
>>
> I am wondering if your kernel has this commit id, 32b3e08fff60494cd1d281a39b51583edfd2b18f.
> As this seems to be added to fix issues that look very similar to the trace you are receiving.
> Nick



-- 
WBR

^ permalink raw reply

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Jesper Dangaard Brouer @ 2016-04-04 14:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Brenden Blanco, davem, netdev, tom, Or Gerlitz,
	daniel, john.fastabend, brouer
In-Reply-To: <20160402024710.GA59703@ast-mbp.thefacebook.com>


On Fri, 1 Apr 2016 19:47:12 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> My guess we're hitting 14.5Mpps limit for empty bpf program
> and for program that actually looks into the packet because we're
> hitting 10G phy limit of 40G nic. Since physically 40G nic
> consists of four 10G phys. There will be the same problem
> with 100G and 50G nics. Both will be hitting 25G phy limit.
> We need to vary packets somehow. Hopefully Or can explain that
> bit of hw design.
> Jesper's experiments with mlx4 showed the same 14.5Mpps limit
> when sender blasting the same packet over and over again.

That is an interesting observation Alexei, and could explain the pps limit
I hit on 40G, with single flow testing.  AFAIK 40G is 4x 10G PHYs, and
100G is 4x 25G PHYs.

I have a pktgen script that tried to avoid this pitfall.  By creating a
new flow per pktgen kthread. I call it "pktgen_sample05_flow_per_thread.sh"[1]

[1] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen_sample05_flow_per_thread.sh

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH v5 2/4] Documentation: Bindings: Add STM32 DWMAC glue
From: Alexandre Torgue @ 2016-04-04 14:40 UTC (permalink / raw)
  To: Joachim Eastwood
  Cc: Rob Herring, Chen-Yu Tsai, Maxime Coquelin, Giuseppe Cavallaro,
	netdev, devicetree, linux-kernel, linux-arm-kernel
In-Reply-To: <CAJgp7zxfFRhfNjcE23n+DKD0ffrvH4_sowPjK_-DZAS4ecH5dg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi Rob,

2016-03-22 17:11 GMT+01:00 Alexandre Torgue <alexandre.torgue-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> Hi guys,
>
> I will fix typo issues (s/vesrion/version and ethernet @).
>
> Concerning compatible string. For sure "snps,dwmac-3.50a" string is
> not used inside glue driver.
> I perfere to keep it for information but if you really want that I
> remove it I will not block ;)
>
> 2016-03-21 16:36 GMT+01:00 Joachim  Eastwood <manabian-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>> On 21 March 2016 at 13:40, Rob Herring <robh-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>>> On Sat, Mar 19, 2016 at 12:00:22AM +0800, Chen-Yu Tsai wrote:
>>>> Hi,
>>>>
>>>> On Fri, Mar 18, 2016 at 11:37 PM, Alexandre TORGUE
>>>> <alexandre.torgue-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> > +- clocks: Must contain a phandle for each entry in clock-names.
>>>> > +- clock-names: Should be "stmmaceth" for the host clock.
>>>
> We can remove host clock (stmmac eth) entry here and refer to
> stmmac.txt binding for common entry
>
>>> This doesn't sound like the clock input signal name...
>>>
>>>> > +              Should be "tx-clk" for the MAC TX clock.
>>>> > +              Should be "rx-clk" for the MAC RX clock.
>>>
>>> How can other DWMAC blocks not have these clocks? The glue can't really
>>> add these clocks. It could combine them into one or a new version of
>>> DWMAC could have a different number of clock inputs. So if there is
>>> variation here, then some of the bindings are probably wrong. I guess
>>> the only change I'm suggesting is possibly moving these into common
>>> binding doc.
>>
>> The LPC18xx implementation probably have these clocks as well but the
>> LPC1850 user manual only documents the main clock. Someone with access
>> to the IP block doc from Synopsys should be able to check which clocks
>> the MAC really needs.
>>
>> Rockchip bindings have two clocks named "mac_clk_rx" and "mac_clk_tx".
>> These are probably the same as stm32 needs so maybe use these names
>> and move them into the main doc and update the rockchip binding.
>>
> I think we can use same name. But I have a doubt on moving it in a
> common bindings (maybe I don't well understood). When you say "common
> binding file" is it "stmmac.txt" binding ? If yes does it mean that we
> have to control it inside stmmac driver (no more in glue) ? In this
> case those clocks will become "required" for stm32 and rockship but
> not for others chip. It could create confusion?

A gentle ping. Can you give me your feedback please ?
I will send next patchset version according to your answer.

Thanks in advance

Alex

>
> Best regards
>
> Alex
>
>>
>> regards,
>> Joachim Eastwood
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Best way to reduce system call overhead for tun device I/O?
From: Guus Sliepen @ 2016-04-04 14:40 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stephen Hemminger, David Miller, Tom Herbert, Network Development
In-Reply-To: <CAF=yD-LUaxsJMZiGXQdEDh-6UE11ApL89rjt=13oLK3FM1rerQ@mail.gmail.com>

On Sun, Apr 03, 2016 at 07:03:09PM -0400, Willem de Bruijn wrote:

> On Thu, Mar 31, 2016 at 7:39 PM, Stephen Hemminger <stephen@networkplumber.org> wrote:
>
> > Rather than bodge AF_PACKET onto TUN, why not just create a new device type
> > and control it from something modern like netlink.

Do we really want to introduce a whole new device type? The tun device
is working perfectly fine, except for the fact that there is no way to
send/receive multiple packets in one go.

> Depending on the use-case, it may be sufficient to extend AF_PACKET
> with limited tap functionality:
> 
> - add a po->xmit mode that reinjects into the kernel receive path,
>   analogous to pktgen's M_NETIF_RECEIVE mode.
> 
> - optionally drop packets in __netif_receive_skb_core and xmit_one
>   if any of the registered packet sockets accepted the packet and has
>   a new intercept feature flag enabled.
> 
> This can be applied to a dummy device, but much more interesting
> is to interpose on the flow of a normal nic. It is clearly not a drop-in
> replacement for a tap (let alone tun) device. I have some preliminary
> code.

It's not really tinc's use case, but I did try using socket(AF_PACKET)
bound to a tun interface, just to see if sendmmsg()/recvmmsg() works
then. It does, but indeed for packets that are sent to the socket, they
need to be reinjected into the kernel receive path. So I'll be happy to
test out your preliminary code.

-- 
Met vriendelijke groet / with kind regards,
     Guus Sliepen <guus@tinc-vpn.org>

^ permalink raw reply

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Eric Dumazet @ 2016-04-04 14:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Brenden Blanco, davem, netdev, tom,
	alexei.starovoitov, gerlitz, john.fastabend
In-Reply-To: <20160404150700.1456ae80@redhat.com>

On Mon, 2016-04-04 at 15:07 +0200, Jesper Dangaard Brouer wrote:

> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
> to users of this API.
> 
> The hole idea is that an SKB is NOT allocated yet, and not needed at
> this level.  If we start supporting calling underlying SKB functions,
> then we will end-up in the same place (performance wise).

A BPF program can access many skb fields.

If you plan to support BPF, your fake skb needs to be populated like a
real one. Looks like some code will be replicated in all drivers that
want this facility...

Or accept (document ?) that some BPF instructions are just not there.
(hash, queue_mapping ...)

^ permalink raw reply

* Re: Best way to reduce system call overhead for tun device I/O?
From: Guus Sliepen @ 2016-04-04 14:31 UTC (permalink / raw)
  To: ValdikSS
  Cc: Stephen Hemminger, Willem de Bruijn, David Miller, Tom Herbert,
	netdev
In-Reply-To: <57026C8F.8050406@valdikss.org.ru>

On Mon, Apr 04, 2016 at 04:30:55PM +0300, ValdikSS wrote:

> I'm trying to increase OpenVPN throughput by optimizing tun manipulations, too.
> Right now I have more questions than answers.
> 
> I get about 800 Mbit/s speeds via OpenVPN with authentication and encryption disabled on a local machine with OpenVPN server and client running in a different
> network namespaces, which use veth for networking, with 1500 MTU on a TUN interface. This is rather limiting. Low-end devices like SOHO routers could only
> achieve 15-20 Mbit/s via OpenVPN with encryption with a 560 MHz CPU.
> Increasing MTU reduces overhead. You can get > 5GBit/s if you set 16000 MTU on a TUN interface.
> That's not only OpenVPN related. All the tunneling software I tried can't achieve gigabit speeds without encryption on my machine with MTU 1500. Didn't test
> tinc though.

It's exactly the same issue for tinc. But tinc does path MTU discovery,
and actively limits the size of packets inside the tunnel so that the
outer packets are not bigger than the PMTU. Of course this can be
disabled, but experience has shown that transmitting large UDP packets
over the Internet is not ideal, since they will be fragmented, and the
loss of one fragment means the whole packet is dropped. In the case of
OpenVPN, I think many users use -mssfix, so they too are in effect
limiting the size of packets inside the tunnel.

Of course, tinc could fragment packets internally (it actually does so
in some circumstances), but I'd rather avoid that.

Also, GSO and GRO only deal with optimizations within one UDP packet or
one TCP stream. If you have many concurrent programs sending data, or
one program sending lots of small UDP packets, those will never be
optimized.

So I think GSO/GRO is not the way to go, but there really should be a
way to receive and send many individual packets in one system call.

-- 
Met vriendelijke groet / with kind regards,
     Guus Sliepen <guus@tinc-vpn.org>

^ permalink raw reply

* Re: System hangs (unable to handle kernel paging request)
From: Bastien Philbert @ 2016-04-04 14:30 UTC (permalink / raw)
  To: Oleksii Berezhniak, netdev
In-Reply-To: <CAJHPw-M7ZLRXcEHovc4EZL1OiE_SihBte2tE+P8UZenKNya7hg@mail.gmail.com>



On 2016-04-04 03:59 AM, Oleksii Berezhniak wrote:
> Good day.
> 
> We have PPPoE server with CentOS 7 (kernel 3.10.0-327.10.1.el7.dsip.x86_64)
> 
> We applied some PPPoE related patches to this kernel:
> 
> ppp: don't override sk->sk_state in pppoe_flush_dev()
> ppp: fix pppoe_dev deletion condition in pppoe_release()
> pppoe: fix memory corruption in padt work structure
> pppoe: fix reference counting in PPPoE proxy
> 
> Also we built latest version of ixgbe driver from Intel.
> 
> Now we have crashes after approx. one week of uptime:
> 
> [545444.673270] BUG: unable to handle kernel paging request at ffff88a005040200
> [545444.673306] IP: [<ffffffff811c0e95>] kmem_cache_alloc+0x75/0x1d0
> [545444.673335] PGD 0
> [545444.673348] Oops: 0000 [#1] SMP
> [545444.673367] Modules linked in: arc4 ppp_mppe act_police cls_u32
> sch_ingress sch_tbf pptp gre pppoe pppox ppp_generic slhc 8021q garp
> stp mrp llc iptable_nat nf_conn
> track_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_filter xt_TCPMSS
> iptable_mangle xt_CT nf_conntrack iptable_raw w83793 hwmon_vid
> snd_hda_codec_realtek snd_hda_codec
> _generic snd_hda_intel snd_hda_codec coretemp snd_hda_core iTCO_wdt
> kvm iTCO_vendor_support snd_hwdep snd_seq snd_seq_device ipmi_ssif
> ppdev lpc_ich snd_pcm pcspkr mfd_
> core sg ipmi_si snd_timer snd i2c_i801 ipmi_msghandler ioatdma
> parport_pc parport shpchp soundcore i7core_edac tpm_infineon edac_core
> ip_tables ext4 mbcache jbd2 sd_mod
>  crct10dif_generic crc_t10dif crct10dif_common syscopyarea sysfillrect
> firewire_ohci sysimgblt i2c_algo_bit drm_kms_helper ata_generic
> pata_acpi
> [545444.674383]  ttm firewire_core crc_itu_t serio_raw drm ata_piix
> libata crc32c_intel i2c_core ixgbe(OE) vxlan e1000e ip6_udp_tunnel
> udp_tunnel aacraid dca ptp pps_co
> re
> [545444.674783] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G           OE
> ------------   3.10.0-327.10.1.el7.dsip.x86_64 #1
> [545444.675032] Hardware name: empty empty/S7010, BIOS 'V2.06  ' 03/31/2010
> [545444.675162] task: ffff880139c55c00 ti: ffff880139c84000 task.ti:
> ffff880139c84000
> [545444.675400] RIP: 0010:[<ffffffff811c0e95>]  [<ffffffff811c0e95>]
> kmem_cache_alloc+0x75/0x1d0
> [545444.675641] RSP: 0018:ffff88023fc23ce8  EFLAGS: 00010286
> [545444.675766] RAX: 0000000000000000 RBX: ffff8802302eab00 RCX:
> 000000010eb8edbe
> [545444.676002] RDX: 000000010eb8edbd RSI: 0000000000000020 RDI:
> ffff88013b803700
> [545444.676237] RBP: ffff88023fc23d18 R08: 00000000000175a0 R09:
> ffffffff81517e70
> [545444.676472] R10: 000000000000006b R11: 0000000000000000 R12:
> ffff88a005040200
> [545444.676706] R13: 0000000000000020 R14: ffff88013b803700 R15:
> ffff88013b803700
> [545444.676942] FS:  0000000000000000(0000) GS:ffff88023fc20000(0000)
> knlGS:0000000000000000
> [545444.677180] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [545444.677307] CR2: ffff88a005040200 CR3: 0000000237e63000 CR4:
> 00000000000007e0
> [545444.677543] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [545444.677779] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [545444.678014] Stack:
> [545444.678127]  ffff880237ea2040 ffff8802302eab00 0000000000000280
> 0000000000000280
> [545444.678370]  0000000000000006 ffff880236bb1b60 ffff88023fc23d40
> ffffffff81517e70
> [545444.678614]  0000000000000280 ffff8802302eab00 0000000000000000
> ffff88023fc23d60
> [545444.678857] Call Trace:
> [545444.678973]  <IRQ>
> 
> [545444.678982]
> [545444.679100]  [<ffffffff81517e70>] build_skb+0x30/0x1d0
> [545444.679222]  [<ffffffff8151a973>] __alloc_rx_skb+0x63/0xb0
> [545444.679349]  [<ffffffff8151a9db>] __netdev_alloc_skb+0x1b/0x40
> [545444.679492]  [<ffffffffa0104d8e>] ixgbe_clean_rx_irq+0xee/0xa50 [ixgbe]
> [545444.679624]  [<ffffffff8152862f>] ? __napi_complete+0x1f/0x30
> [545444.679756]  [<ffffffffa0106738>] ixgbe_poll+0x2d8/0x6d0 [ixgbe]
> [545444.679886]  [<ffffffff8152b092>] net_rx_action+0x152/0x240
> [545444.680015]  [<ffffffff81084aef>] __do_softirq+0xef/0x280
> [545444.680144]  [<ffffffff8164735c>] call_softirq+0x1c/0x30
> [545444.680277]  [<ffffffff81016fc5>] do_softirq+0x65/0xa0
> [545444.680402]  [<ffffffff81084e85>] irq_exit+0x115/0x120
> [545444.680529]  [<ffffffff81647ef8>] do_IRQ+0x58/0xf0
> [545444.680660]  [<ffffffff8163d1ad>] common_interrupt+0x6d/0x6d
> [545444.680786]  <EOI>
> [545444.680794]
> [545444.680914]  [<ffffffff81058e96>] ? native_safe_halt+0x6/0x10
> [545444.681041]  [<ffffffff8101dbcf>] default_idle+0x1f/0xc0
> [545444.681168]  [<ffffffff8101e4d6>] arch_cpu_idle+0x26/0x30
> [545444.681297]  [<ffffffff810d62c5>] cpu_startup_entry+0x245/0x290
> [545444.681427]  [<ffffffff810475fa>] start_secondary+0x1ba/0x230
> [545444.681554] Code: ce 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85
> e4 0f 84 1f 01 00 00 48 85 c0 0f 84 16 01 00 00 49 63 46 20 48 8d 4a
> 01 4d 8b 06 <49> 8b 1c 04 4c
> 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 49 63
> [545444.682056] RIP  [<ffffffff811c0e95>] kmem_cache_alloc+0x75/0x1d0
> [545444.682186]  RSP <ffff88023fc23ce8>
> [545444.682305] CR2: ffff88a005040200
> 
> 
> Every time description and call stack are the same.
> 
> What can be cause of these crashes?
> 
> Thanks.
> 
I am wondering if your kernel has this commit id, 32b3e08fff60494cd1d281a39b51583edfd2b18f.
As this seems to be added to fix issues that look very similar to the trace you are receiving.
Nick

^ permalink raw reply

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Tom Herbert @ 2016-04-04 14:09 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jesper Dangaard Brouer, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Alexei Starovoitov, gerlitz,
	john fastabend
In-Reply-To: <57026DFA.3090201@iogearbox.net>

On Mon, Apr 4, 2016 at 10:36 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 04/04/2016 03:07 PM, Jesper Dangaard Brouer wrote:
>>
>> On Mon, 04 Apr 2016 10:49:09 +0200 Daniel Borkmann <daniel@iogearbox.net>
>> wrote:
>>>
>>> On 04/02/2016 03:21 AM, Brenden Blanco wrote:
>>>>
>>>> Add a new bpf prog type that is intended to run in early stages of the
>>>> packet rx path. Only minimal packet metadata will be available, hence a
>>>> new
>>>> context type, struct xdp_metadata, is exposed to userspace. So far only
>>>> expose the readable packet length, and only in read mode.
>>>>
>>>> The PHYS_DEV name is chosen to represent that the program is meant only
>>>> for physical adapters, rather than all netdevs.
>>>>
>>>> While the user visible struct is new, the underlying context must be
>>>> implemented as a minimal skb in order for the packet load_* instructions
>>>> to work. The skb filled in by the driver must have skb->len, skb->head,
>>>> and skb->data set, and skb->data_len == 0.
>>>>
>> [...]
>>>
>>>
>>> Do you plan to support bpf_skb_load_bytes() as well? I like using
>>> this API especially when dealing with larger chunks (>4 bytes) to
>>> load into stack memory, plus content is kept in network byte order.
>>>
>>> What about other helpers such as bpf_skb_store_bytes() et al that
>>> work on skbs. Do you intent to reuse them as is and thus populate
>>> the per cpu skb with needed fields (faking linear data), or do you
>>> see larger obstacles that prevent for this?
>>
>>
>> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
>> to users of this API.
>>
>> The hole idea is that an SKB is NOT allocated yet, and not needed at
>> this level.  If we start supporting calling underlying SKB functions,
>> then we will end-up in the same place (performance wise).
>
>
> I'm talking about the current skb-related BPF helper functions we have,
> so the question is how much from that code we have we can reuse under
> these constraints (obviously things like the tunnel helpers are a different
> story) and if that trade-off is acceptable for us. I'm also thinking
> that, for example, if you need to parse the packet data anyway for a drop
> verdict, you might as well pass some meta data (that is set in the real
> skb later on) for those packets that go up the stack.

Right, the meta data in this case is an abstracted receive descriptor.
This would include items that we get in a device receive descriptor
(computed checksum, hash, VLAN tag). This is purposely a small
restricted data structure. I'm hoping we can minimize the size of this
to not much more than 32 bytes (including pointers to data and
linkage).

How this translates to skb to maintain compatibility is with BPF
interesting question. One other consideration is that skb's are kernel
specific, we should be able to use the same BPF filter program in
userspace over DPDK for instance-- so an skb interface as the packet
abstraction might not be the right model...

Tom

^ permalink raw reply

* Re: davinci-mdio: failing to connect to PHY
From: Petr Kulhavy @ 2016-04-04 14:01 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev
In-Reply-To: <20160404135813.GA25131@lunn.ch>



On 04.04.2016 15:58, Andrew Lunn wrote:
> On Mon, Apr 04, 2016 at 03:50:02PM +0200, Petr Kulhavy wrote:
>> Hi Andrew,
>>
>> thanks a lot for the link. In the meantime I've understood the issue
>> better. It is due to the fact that the PHY is pin-strapped to
>> address 1 and broadcast (at address 0) is  enabled. The Micrel
>> driver's config_init() disables the broadcast and the PHY stops
>> responding, which causes the troubles. The kernel 3.17 didn't
>> disable the broadcast and therefore it worked.
>>
>> I'm wondering how to solve or workaround this...
> One option is in your device tree is to explicitly list the phy on
> your mdio bus. Something like:
>
> &mdio {
>          status = "okay";
>
>          ethphy0: ethernet-phy@1 {
>                  reg = <1>;
>          };
> };
>
> This alone might be sufficient. If not, you need to reference the phy
> via a phandle in the ethernet node.
>
>
> &eth0 {
>          status = "okay";
>          phy-handle = <&ethphy0>;
> };
>
> 	Andrew
Thanks a lot, I'm going to try it out right now!

Cheers
Petr

^ permalink raw reply

* Re: davinci-mdio: failing to connect to PHY
From: Andrew Lunn @ 2016-04-04 13:58 UTC (permalink / raw)
  To: Petr Kulhavy; +Cc: netdev
In-Reply-To: <5702710A.5010804@barix.com>

On Mon, Apr 04, 2016 at 03:50:02PM +0200, Petr Kulhavy wrote:
> 
> 
> On 04.04.2016 14:31, Andrew Lunn wrote:
> >Hi Petr
> >
> >You might want to take a look at:
> >
> >http://lxr.free-electrons.com/source/drivers/net/ethernet/ti/davinci_mdio.c#L137
> >
> >It seems to be asking the hardware about the phy mask.
> >
> >    Andrew
> 
> Hi Andrew,
> 
> thanks a lot for the link. In the meantime I've understood the issue
> better. It is due to the fact that the PHY is pin-strapped to
> address 1 and broadcast (at address 0) is  enabled. The Micrel
> driver's config_init() disables the broadcast and the PHY stops
> responding, which causes the troubles. The kernel 3.17 didn't
> disable the broadcast and therefore it worked.
> 
> I'm wondering how to solve or workaround this...

One option is in your device tree is to explicitly list the phy on
your mdio bus. Something like:

&mdio {
        status = "okay";

        ethphy0: ethernet-phy@1 {
                reg = <1>;
        };
};

This alone might be sufficient. If not, you need to reference the phy
via a phandle in the ethernet node.


&eth0 {
        status = "okay";
        phy-handle = <&ethphy0>;
};

	Andrew

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox