Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 4/4] [DO NOT MERGE] arm64: allwinner: a64: enable RTL8211E PHY workaround
From: Icenowy Zheng @ 2017-04-21 23:24 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, Rob Herring
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw, Icenowy Zheng
In-Reply-To: <20170421232436.10924-1-icenowy-h8G6r0blFSE@public.gmane.org>

From: Icenowy Zheng <icenowy-ymACFijhrKM@public.gmane.org>

Some Pine64+ boards are said to have broken RTL8211E PHY.

Enable the workaround in Pine64+ device tree file.

Signed-off-by: Icenowy Zheng <icenowy-ymACFijhrKM@public.gmane.org>
---
 arch/arm64/boot/dts/allwinner/sun50i-a64-pine64-plus.dts | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64-pine64-plus.dts b/arch/arm64/boot/dts/allwinner/sun50i-a64-pine64-plus.dts
index 790d14daaa6a..1f59ee4f8b45 100644
--- a/arch/arm64/boot/dts/allwinner/sun50i-a64-pine64-plus.dts
+++ b/arch/arm64/boot/dts/allwinner/sun50i-a64-pine64-plus.dts
@@ -48,3 +48,7 @@
 
 	/* TODO: Camera, Ethernet PHY, touchscreen, etc. */
 };
+
+&ext_phy {
+	realtek,disable-rx-delay;
+};
-- 
2.12.2

^ permalink raw reply related

* [PATCH 3/4] net: phy: realtek: add disable RX delay hack for RTL8211E
From: Icenowy Zheng @ 2017-04-21 23:24 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, Rob Herring
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw, Icenowy Zheng
In-Reply-To: <20170421232436.10924-1-icenowy-h8G6r0blFSE@public.gmane.org>

From: Icenowy Zheng <icenowy-ymACFijhrKM@public.gmane.org>

Some RTL8211E chips have broken GbE function, which needs a hack to
fix. It's said that this fix will affect the performance on not-buggy
PHYs, so it should only be enabled on boards with the broken PHY.
Currently only some Pine64+ boards are known to have this issue.

This hack is said to disable RX relay for RTL8211E according to Realtek.

Enable this hack when a certain device tree property is set.

As this hack is not documented on the datasheet at all, it contains
magic numbers, and could not be revealed. These magic numbers are
received from Realtek via Pine64.

Signed-off-by: Icenowy Zheng <icenowy-ymACFijhrKM@public.gmane.org>
---
 drivers/net/phy/realtek.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
index d820d00addf6..880022160cd2 100644
--- a/drivers/net/phy/realtek.c
+++ b/drivers/net/phy/realtek.c
@@ -13,6 +13,7 @@
  * option) any later version.
  *
  */
+#include <linux/of.h>
 #include <linux/phy.h>
 #include <linux/module.h>
 
@@ -26,6 +27,8 @@
 #define RTL8211_PAGE_SELECT	0x1f
 
 #define RTL8211E_INER_LINK_STATUS 0x400
+#define RTL8211E_EXT_PAGE_SELECT 0x1e
+#define RTL8211E_EXT_PAGE	0x7
 
 #define RTL8211F_INER_LINK_STATUS 0x0010
 #define RTL8211F_INSR		0x1d
@@ -121,6 +124,38 @@ static int rtl8211f_config_init(struct phy_device *phydev)
 	return 0;
 }
 
+static int rtl8211e_config_init(struct phy_device *phydev)
+{
+	struct device *dev = &phydev->mdio.dev;
+	struct device_node *of_node = dev->of_node;
+	int ret;
+
+	ret = genphy_config_init(phydev);
+	if (ret < 0)
+		return ret;
+
+	if (of_node &&
+	    of_property_read_bool(of_node, "realtek,disable-rx-delay")) {
+		/* All these magic numbers are retrieved from Pine64, and
+		 * they're said to be originated from Realtek.
+		 *
+		 * The datasheet of RTL8211E didn't cover this ext page.
+		 *
+		 * Select extension page 0xa4 here.
+		 */
+		phy_write(phydev, RTL8211_PAGE_SELECT, RTL8211E_EXT_PAGE)
+		phy_write(phydev, RTL8211E_EXT_PAGE_SELECT, 0xa4);
+
+		/* Write the magic number */
+		phy_write(phydev, 0x1c, 0xb591);
+
+		/* Restore to default page 0 */
+		phy_write(phydev, RTL8211_PAGE_SELECT, 0);
+	}
+
+	return 0;
+}
+
 static struct phy_driver realtek_drvs[] = {
 	{
 		.phy_id         = 0x00008201,
@@ -159,6 +194,7 @@ static struct phy_driver realtek_drvs[] = {
 		.features	= PHY_GBIT_FEATURES,
 		.flags		= PHY_HAS_INTERRUPT,
 		.config_aneg	= &genphy_config_aneg,
+		.config_init	= rtl8211e_config_init,
 		.read_status	= &genphy_read_status,
 		.ack_interrupt	= &rtl821x_ack_interrupt,
 		.config_intr	= &rtl8211e_config_intr,
-- 
2.12.2

^ permalink raw reply related

* [PATCH 2/4] dt-bindings: add binding for RTL8211E Ethernet PHY
From: Icenowy Zheng @ 2017-04-21 23:24 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, Rob Herring
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw, Icenowy Zheng
In-Reply-To: <20170421232436.10924-1-icenowy-h8G6r0blFSE@public.gmane.org>

From: Icenowy Zheng <icenowy-ymACFijhrKM@public.gmane.org>

Some RTL8211E Ethernet PHY have an issue that needs a workaround
indicated with device tree.

Add the binding for a property that indicates this workaround.

Signed-off-by: Icenowy Zheng <icenowy-ymACFijhrKM@public.gmane.org>
---
 .../devicetree/bindings/net/realtek,rtl8211e.txt   | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/realtek,rtl8211e.txt

diff --git a/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt b/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt
new file mode 100644
index 000000000000..c1913301bfe8
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/realtek,rtl8211e.txt
@@ -0,0 +1,22 @@
+Realtek RTL8211E Ethernet PHY
+
+One batch of RTL8211E is slight broken, that needs some special (and
+full of magic numbers) tweaking in order to make GbE to operate properly.
+The only well-known board that used the broken batch is Pine64+.
+Configure it through an Ethernet OF device node.
+
+Optional properties:
+
+- realtek,disable-rx-delay:
+  If set, RX delay will be completely disabled (according to Realtek). This
+  will affect the performance on non-broken boards.
+  default: do not disable RX delay.
+
+Examples:
+Pine64+ with broken RTL8211E:
+&mdio {
+	ext_phy: ethernet-phy@0 {
+		reg = <0>;
+		realtek,disable-rx-delay;
+	};
+};
-- 
2.12.2

^ permalink raw reply related

* [PATCH 1/4] net: phy: realtek: change macro name for page select register
From: Icenowy Zheng @ 2017-04-21 23:24 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, Rob Herring
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw, Icenowy Zheng
In-Reply-To: <20170421232436.10924-1-icenowy-h8G6r0blFSE@public.gmane.org>

From: Icenowy Zheng <icenowy-ymACFijhrKM@public.gmane.org>

The page select register also exists on RTL8211E PHY (although it
behaves slightly differently).

Change the register macro name to remove the F.

Signed-off-by: Icenowy Zheng <icenowy-ymACFijhrKM@public.gmane.org>
---
 drivers/net/phy/realtek.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
index 9cbe645e3d89..d820d00addf6 100644
--- a/drivers/net/phy/realtek.c
+++ b/drivers/net/phy/realtek.c
@@ -22,11 +22,13 @@
 #define RTL821x_INER		0x12
 #define RTL821x_INER_INIT	0x6400
 #define RTL821x_INSR		0x13
+
+#define RTL8211_PAGE_SELECT	0x1f
+
 #define RTL8211E_INER_LINK_STATUS 0x400
 
 #define RTL8211F_INER_LINK_STATUS 0x0010
 #define RTL8211F_INSR		0x1d
-#define RTL8211F_PAGE_SELECT	0x1f
 #define RTL8211F_TX_DELAY	0x100
 
 MODULE_DESCRIPTION("Realtek PHY driver");
@@ -46,10 +48,10 @@ static int rtl8211f_ack_interrupt(struct phy_device *phydev)
 {
 	int err;
 
-	phy_write(phydev, RTL8211F_PAGE_SELECT, 0xa43);
+	phy_write(phydev, RTL8211_PAGE_SELECT, 0xa43);
 	err = phy_read(phydev, RTL8211F_INSR);
 	/* restore to default page 0 */
-	phy_write(phydev, RTL8211F_PAGE_SELECT, 0x0);
+	phy_write(phydev, RTL8211_PAGE_SELECT, 0x0);
 
 	return (err < 0) ? err : 0;
 }
@@ -102,7 +104,7 @@ static int rtl8211f_config_init(struct phy_device *phydev)
 	if (ret < 0)
 		return ret;
 
-	phy_write(phydev, RTL8211F_PAGE_SELECT, 0xd08);
+	phy_write(phydev, RTL8211_PAGE_SELECT, 0xd08);
 	reg = phy_read(phydev, 0x11);
 
 	/* enable TX-delay for rgmii-id and rgmii-txid, otherwise disable it */
@@ -114,7 +116,7 @@ static int rtl8211f_config_init(struct phy_device *phydev)
 
 	phy_write(phydev, 0x11, reg);
 	/* restore to default page 0 */
-	phy_write(phydev, RTL8211F_PAGE_SELECT, 0x0);
+	phy_write(phydev, RTL8211_PAGE_SELECT, 0x0);
 
 	return 0;
 }
-- 
2.12.2

^ permalink raw reply related

* [PATCH 0/4] RTL8211E-specified hacks
From: Icenowy Zheng @ 2017-04-21 23:24 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, Rob Herring
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw, Icenowy Zheng

Some Pine64 boards are reported to have broken RTL8211E PHYs, which will fail
to work at 1000BASE-T mode if not workarounded.

The workaround is retrieved from Pine64, and is said to be from Realtek
engineer. It's undocumented but effective. (Tested on my Pine64 with GbE
broken)

The first patch is a small tweak for Realtek PHY driver, which removed the
"F" in page select register name, as RTL8211E also uses the same register
as page select (although with some different difinition).

The second patch adds a binding for the PHY, specified for this hack.

The third patch is the real driver part of this hack, which contains
some magic numbers from Pine64/Realtek.

The fourth patch is for reference only and should not be merged -- to
use it you will need sun8i-emac or dwmac-sun8i patchset applied.

Icenowy Zheng (4):
  net: phy: realtek: change macro name for page select register
  dt-bindings: add binding for RTL8211E Ethernet PHY
  net: phy: realtek: add disable RX delay hack for RTL8211E
  [DO NOT MERGE] arm64: allwinner: a64: enable RTL8211E PHY workaround

 .../devicetree/bindings/net/realtek,rtl8211e.txt   | 22 ++++++++++
 .../boot/dts/allwinner/sun50i-a64-pine64-plus.dts  |  4 ++
 drivers/net/phy/realtek.c                          | 48 +++++++++++++++++++---
 3 files changed, 69 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/realtek,rtl8211e.txt

-- 
2.12.2

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next v2 5/5] virtio-net: keep tx interrupts disabled unless kick
From: Willem de Bruijn @ 2017-04-21 23:13 UTC (permalink / raw)
  To: Jason Wang
  Cc: Network Development, Michael S. Tsirkin, virtualization,
	David Miller, Willem de Bruijn
In-Reply-To: <CAF=yD-KVwRkkGr=pGDmTTXtrrTuAuD7SmTPVtQuwu9aufoBeEQ@mail.gmail.com>

On Thu, Apr 20, 2017 at 10:03 AM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>>> -       if (!use_napi)
>>> +       if (use_napi) {
>>> +               if (kick)
>>> +                       virtqueue_enable_cb_delayed(sq->vq);
>>> +               else
>>> +                       virtqueue_disable_cb(sq->vq);
>>
>>
>> Since virtqueue_disable_cb() do nothing for event idx. I wonder whether or
>> not just calling enable_cb_dealyed() is ok here.
>
> Good point.
>
>> Btw, it does not disable interrupt at all, I propose a patch in the past
>> which can do more than this:
>>
>> https://patchwork.kernel.org/patch/6472601/
>
> Interesting. Yes, let me evaluate that variant.

In initial tests I don't see a significant change, but we can look
into this more closely as a follow-on patch.

^ permalink raw reply

* Re: [RFC] change the default Kconfig value of mlx5_en
From: Ian Kumlien @ 2017-04-21 23:10 UTC (permalink / raw)
  To: David Miller; +Cc: saeedm, Linux Kernel Network Developers
In-Reply-To: <CAA85sZtBo5dORg4sj8X_SROBU0ySh3_CQh1WiYcVX-RVv4E+gA@mail.gmail.com>

Sorry,

Back again, fighting cold, hot whiskey has been consumed...

Something like this would perhaps be a better solution:

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 60154a175bd3..fe192e247601 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1139,6 +1139,10 @@ static int mlx5_load_one(struct mlx5_core_dev
*dev, struct mlx5_priv *priv,

 #ifdef CONFIG_MLX5_CORE_EN
        mlx5_eswitch_attach(dev->priv.eswitch);
+#else
+       if (MLX5_CAP_GEN(dev, port_type) == MLX5_CAP_PORT_TYPE_ETH) {
+               dev_info(&pdev->dev, "Ethernet device discovered but
support not enabled in kernel.");
+       }
 #endif

        err = mlx5_sriov_attach(dev);

This is in no way tested and just a thought for now - I suspect that a
better message would be required though...

^ permalink raw reply related

* [PATCH net] udp: disable inner UDP checksum offloads in IPsec case
From: Ansis Atteka @ 2017-04-21 22:23 UTC (permalink / raw)
  To: netdev; +Cc: Ansis Atteka

Otherwise, UDP checksum offloads could corrupt ESP packets by attempting
to calculate UDP checksum when this inner UDP packet is already protected
by IPsec.

One way to reproduce this bug is to have a VM with virtio_net driver (UFO
set to ON in the guest VM); and then encapsulate all guest's Ethernet
frames in Geneve; and then further encrypt Geneve with IPsec.  In this
case following symptoms are observed:
1. If using ixgbe NIC, then it will complain with following error message:
   ixgbe 0000:01:00.1: partial checksum but l4 proto=32!
2. Receiving IPsec stack will drop all the corrupted ESP packets and
   increase XfrmInStateProtoError counter in /proc/net/xfrm_stat.
3. iperf UDP test from the VM with packet sizes above MTU will not work at
   all.
4. iperf TCP test from the VM will get ridiculously low performance because.

Signed-off-by: Ansis Atteka <aatteka@ovn.org>
Co-authored-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/ipv4/udp_offload.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index b2be1d9..7812501 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -29,6 +29,7 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
 	u16 mac_len = skb->mac_len;
 	int udp_offset, outer_hlen;
 	__wsum partial;
+	bool need_ipsec;

 	if (unlikely(!pskb_may_pull(skb, tnl_hlen)))
 		goto out;
@@ -62,8 +63,10 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,

 	ufo = !!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP);

+	need_ipsec = skb_dst(skb) && dst_xfrm(skb_dst(skb));
 	/* Try to offload checksum if possible */
 	offload_csum = !!(need_csum &&
+			  !need_ipsec &&
 			  (skb->dev->features &
 			   (is_ipv6 ? (NETIF_F_HW_CSUM | NETIF_F_IPV6_CSUM) :
 				      (NETIF_F_HW_CSUM | NETIF_F_IP_CSUM))));
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH net] xfrm: calculate L4 checksums also for GSO case before encrypting packets
From: Ansis Atteka @ 2017-04-21 21:45 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Ansis Atteka, netdev
In-Reply-To: <20170420094743.GI2649@secunet.com>

On 20 April 2017 at 02:47, Steffen Klassert
<steffen.klassert@secunet.com> wrote:
> On Tue, Apr 18, 2017 at 07:10:03PM -0700, Ansis Atteka wrote:
>>
>> However, after taking pointers from your patch I came up with this one
>> that may solve this problem once and for all (note, that I was seeing
>> this bug only with ixgbe NIC that supports tx csum offloads). I hope
>> it does not break any other IPsec tests that you have.
>>
>> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
>> index b2be1d9..7812501 100644
>> --- a/net/ipv4/udp_offload.c
>> +++ b/net/ipv4/udp_offload.c
>> @@ -29,6 +29,7 @@ static struct sk_buff
>> *__skb_udp_tunnel_segment(struct sk_buff *skb,
>>         u16 mac_len = skb->mac_len;
>>         int udp_offset, outer_hlen;
>>         __wsum partial;
>> +       bool need_ipsec;
>>
>>         if (unlikely(!pskb_may_pull(skb, tnl_hlen)))
>>                 goto out;
>> @@ -62,8 +63,10 @@ static struct sk_buff
>> *__skb_udp_tunnel_segment(struct sk_buff *skb,
>>
>>         ufo = !!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP);
>>
>> +       need_ipsec = skb_dst(skb) && dst_xfrm(skb_dst(skb));
>>         /* Try to offload checksum if possible */
>>         offload_csum = !!(need_csum &&
>> +                         !need_ipsec &&
>>                           (skb->dev->features &
>>                            (is_ipv6 ? (NETIF_F_HW_CSUM | NETIF_F_IPV6_CSUM) :
>>                                       (NETIF_F_HW_CSUM | NETIF_F_IP_CSUM))));
>
> This looks good, but we should fix udp4_ufo_fragment() too.
>
> Thanks!

I removed Geneve tunneling from equation and tried to run a simple
iperf underlay UDP test while IPsec was still enabled to observe
issues with the udp4_ufo_fragment() case.

Unfortunately, as can be seen from kernel tracer output below, I was
unable to come up with a test case where udp4_ufo_fragment function
would ever be invoked while IPsec is enabled:

admin1@ubuntu1:~/xfrm_test/net$ ifconfig em2.4001 | grep "inet addr"
          inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
admin1@ubuntu1:~/xfrm_test/net$ ethtool -k em2.4001 | grep
udp-fragmentation-offload
udp-fragmentation-offload: on
admin1@ubuntu1:~/xfrm_test/net$ sudo trace-cmd record -p
function_graph -c -F iperf -c 192.168.1.2 -u -l20000
admin1@ubuntu1:~/xfrm_test/net$ trace-cmd report | grep udp4
admin1@ubuntu1:~/xfrm_test/net$


Nevertheless, after disabling IPsec and leaving everything else the
same, I start to see that udp4_ufo_fragment() gets invoked:

admin1@ubuntu1:~/xfrm_test/net$ trace-cmd report | grep udp4
           iperf-25466 [004] 242431.203307: funcgraph_entry:
0.113 us   |                  udp4_hwcsum();
           iperf-25466 [004] 242431.203360: funcgraph_entry:
        |
udp4_ufo_fragment() {
           iperf-25466 [004] 242431.508436: funcgraph_entry:
0.080 us   |                  udp4_hwcsum();
           iperf-25466 [004] 242431.508542: funcgraph_entry:
        |
udp4_ufo_fragment() {


However, non-IPsec case really does not have this ESP packet
corruption problem, because then the packets are in plain and can
utilize checksum offloads. Do we really have a problem there for
IPsec? I did not have time yet to look into the code to understand why
exactly udp4_ufo_fragment() is not called for IPsec case, but since I
can't come up with a real life test case, then for now I will simply
resubmit the previous patch as-is to netdev mailinglist to solve the
problem I encountered previously for Geneve tunneling case.

If you could drop more hints on how to come up with an IPsec test case
where udp4_ufo_fragment() is still invoked and packets get corrupted,
then I can later send another patch for that.

^ permalink raw reply

* [PATCH 2/2] igb: Remove useless argument
From: Benjamin Poirier @ 2017-04-21 21:20 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: Stefan Priebe, intel-wired-lan, netdev
In-Reply-To: <20170421212012.25950-1-bpoirier@suse.com>

Given that all callers of igb_update_stats() pass the same two arguments:
(adapter, &adapter->stats64), the second argument can be removed.

Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
---
 drivers/net/ethernet/intel/igb/igb.h         |  2 +-
 drivers/net/ethernet/intel/igb/igb_ethtool.c |  2 +-
 drivers/net/ethernet/intel/igb/igb_main.c    | 10 +++++-----
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index acbc3abe2ddd..3f0c06847fc2 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -593,7 +593,7 @@ void igb_setup_rctl(struct igb_adapter *);
 netdev_tx_t igb_xmit_frame_ring(struct sk_buff *, struct igb_ring *);
 void igb_unmap_and_free_tx_resource(struct igb_ring *, struct igb_tx_buffer *);
 void igb_alloc_rx_buffers(struct igb_ring *, u16);
-void igb_update_stats(struct igb_adapter *, struct rtnl_link_stats64 *);
+void igb_update_stats(struct igb_adapter *);
 bool igb_has_link(struct igb_adapter *adapter);
 void igb_set_ethtool_ops(struct net_device *);
 void igb_power_up_link(struct igb_adapter *);
diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 737b664d004c..8c913958c2eb 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2287,7 +2287,7 @@ static void igb_get_ethtool_stats(struct net_device *netdev,
 	char *p;
 
 	spin_lock(&adapter->stats64_lock);
-	igb_update_stats(adapter, net_stats);
+	igb_update_stats(adapter);
 
 	for (i = 0; i < IGB_GLOBAL_STATS_LEN; i++) {
 		p = (char *)adapter + igb_gstrings_stats[i].stat_offset;
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index be456bae8169..20da5e9d9d3c 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1815,7 +1815,7 @@ void igb_down(struct igb_adapter *adapter)
 
 	/* record the stats before reset*/
 	spin_lock(&adapter->stats64_lock);
-	igb_update_stats(adapter, &adapter->stats64);
+	igb_update_stats(adapter);
 	spin_unlock(&adapter->stats64_lock);
 
 	adapter->link_speed = 0;
@@ -4628,7 +4628,7 @@ static void igb_watchdog_task(struct work_struct *work)
 	}
 
 	spin_lock(&adapter->stats64_lock);
-	igb_update_stats(adapter, &adapter->stats64);
+	igb_update_stats(adapter);
 	spin_unlock(&adapter->stats64_lock);
 
 	for (i = 0; i < adapter->num_tx_queues; i++) {
@@ -5410,7 +5410,7 @@ static void igb_get_stats64(struct net_device *netdev,
 	struct igb_adapter *adapter = netdev_priv(netdev);
 
 	spin_lock(&adapter->stats64_lock);
-	igb_update_stats(adapter, &adapter->stats64);
+	igb_update_stats(adapter);
 	memcpy(stats, &adapter->stats64, sizeof(*stats));
 	spin_unlock(&adapter->stats64_lock);
 }
@@ -5459,9 +5459,9 @@ static int igb_change_mtu(struct net_device *netdev, int new_mtu)
  *  igb_update_stats - Update the board statistics counters
  *  @adapter: board private structure
  **/
-void igb_update_stats(struct igb_adapter *adapter,
-		      struct rtnl_link_stats64 *net_stats)
+void igb_update_stats(struct igb_adapter *adapter)
 {
+	struct rtnl_link_stats64 *net_stats = &adapter->stats64;
 	struct e1000_hw *hw = &adapter->hw;
 	struct pci_dev *pdev = adapter->pdev;
 	u32 reg, mpc;
-- 
2.12.2

^ permalink raw reply related

* [PATCH 1/2] e1000e: Don't return uninitialized stats
From: Benjamin Poirier @ 2017-04-21 21:20 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: Stefan Priebe, intel-wired-lan, netdev

Some statistics passed to ethtool are garbage because e1000e_get_stats64()
doesn't write them, for example: tx_heartbeat_errors. This leaks kernel
memory to userspace and confuses users.

Do like ixgbe and use dev_get_stats() which first zeroes out
rtnl_link_stats64.

Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
---
 drivers/net/ethernet/intel/e1000e/ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c b/drivers/net/ethernet/intel/e1000e/ethtool.c
index 7aff68a4a4df..f117b90cdc2f 100644
--- a/drivers/net/ethernet/intel/e1000e/ethtool.c
+++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
@@ -2063,7 +2063,7 @@ static void e1000_get_ethtool_stats(struct net_device *netdev,
 
 	pm_runtime_get_sync(netdev->dev.parent);
 
-	e1000e_get_stats64(netdev, &net_stats);
+	dev_get_stats(netdev, &net_stats);
 
 	pm_runtime_put_sync(netdev->dev.parent);
 
-- 
2.12.2

^ permalink raw reply related

* [PATCH] macsec: avoid heap overflow in skb_to_sgvec
From: Jason A. Donenfeld @ 2017-04-21 21:14 UTC (permalink / raw)
  To: netdev, linux-kernel, davem; +Cc: Jason A. Donenfeld, stable, security

While this may appear as a humdrum one line change, it's actually quite
important. An sk_buff stores data in three places:

1. A linear chunk of allocated memory in skb->data. This is the easiest
   one to work with, but it precludes using scatterdata since the memory
   must be linear.
2. The array skb_shinfo(skb)->frags, which is of maximum length
   MAX_SKB_FRAGS. This is nice for scattergather, since these fragments
   can point to different pages.
3. skb_shinfo(skb)->frag_list, which is a pointer to another sk_buff,
   which in turn can have data in either (1) or (2).

The first two are rather easy to deal with, since they're of a fixed
maximum length, while the third one is not, since there can be
potentially limitless chains of fragments. Fortunately dealing with
frag_list is opt-in for drivers, so drivers don't actually have to deal
with this mess. For whatever reason, macsec decided it wanted pain, and
so it explicitly specified NETIF_F_FRAGLIST.

Because dealing with (1), (2), and (3) is insane, most users of sk_buff
doing any sort of crypto or paging operation calls a convenient function
called skb_to_sgvec (which happens to be recursive if (3) is in use!).
This takes a sk_buff as input, and writes into its output pointer an
array of scattergather list items. Sometimes people like to declare a
fixed size scattergather list on the stack; othertimes people like to
allocate a fixed size scattergather list on the heap. However, if you're
doing it in a fixed-size fashion, you really shouldn't be using
NETIF_F_FRAGLIST too (unless you're also ensuring the sk_buff and its
frag_list children arent't shared and then you check the number of
fragments in total required.)

Macsec specifically does this:

        size += sizeof(struct scatterlist) * (MAX_SKB_FRAGS + 1);
        tmp = kmalloc(size, GFP_ATOMIC);
        *sg = (struct scatterlist *)(tmp + sg_offset);
	...
        sg_init_table(sg, MAX_SKB_FRAGS + 1);
        skb_to_sgvec(skb, sg, 0, skb->len);

Specifying MAX_SKB_FRAGS + 1 is the right answer usually, but not if you're
using NETIF_F_FRAGLIST, in which case the call to skb_to_sgvec will
overflow the heap, and disaster ensues.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: stable@vger.kernel.org
Cc: security@kernel.org
---
 drivers/net/macsec.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index ff0a5ed3ca80..dbab05afcdbe 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -2716,7 +2716,7 @@ static netdev_tx_t macsec_start_xmit(struct sk_buff *skb,
 }

 #define MACSEC_FEATURES \
-	(NETIF_F_SG | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST)
+	(NETIF_F_SG | NETIF_F_HIGHDMA)
 static struct lock_class_key macsec_netdev_addr_lock_key;

 static int macsec_dev_init(struct net_device *dev)
-- 
2.12.2

^ permalink raw reply related

* Re: macvlan: Fix device ref leak when purging bc_queue
From: Joe.Ghalam @ 2017-04-21 20:37 UTC (permalink / raw)
  To: maheshb; +Cc: herbert, davem, Clifford.Wichmann, netdev
In-Reply-To: <CAF2d9jis_NcDO4bfOuf1OxPf_1XvOJQA1eo0OKW5SEk2O-Ny0A@mail.gmail.com>

________________________________________
> From: Mahesh Bandewar (महेश बंडेवार) <maheshb@google.com>
> Sent: Friday, April 21, 2017 12:23 PM
> To: Ghalam, Joe
> Cc: herbert@gondor.apana.org.au; David Miller; Wichmann, Clifford; linux-netdev
> Subject: Re: macvlan: Fix device ref leak when purging bc_queue

> May be the system is busy and snapshot is too small, and eventually
> process_broadcast() should get called. Deleting a slave does nothing
> about cancelling the work-queue so it would happen eventually.

> The change that Herbert proposed is correct. When packets are enqueued
> for processing later a dev reference is taken and it's removed when
> it's processed when it gets scheduled. The backlog is per port so it
> makes sense to remove reference(s) before purging the queue prior to
> deleting the port.

I only included the snapshot of the logs that's relevant. The system in question has been left in that state for hours, without ever seeing process_broadcast() being called. And, yes I did check the cpu load, and the system was running at around 20% load. So, I don't think that's the case. I would suggest to take closer look at the code in mtacvlan_dellink(), where it performs unlink and unregister:

void macvlan_dellink(struct net_device *dev, struct list_head *head)
{
	struct macvlan_dev *vlan = netdev_priv(dev);
	list_del_rcu(&vlan->list);
	unregister_netdevice_queue(dev, head);
	netdev_upper_dev_unlink(vlan->lowerdev, dev);
}

As I stated in my reply to Herbert initially, the code change he suggested is correct and needed, but not enough. We have tested with his code change and observed the same behavior. I can guarantee you that the code change to macvlan_port_destroy() has no effect on this issue, since the function macvlan_port_destroy () is not even called during the operation. 

Here is the forced stack trace that I caused to show the removal call:
Apr 20 06:23:40 OS10 kernel:  [<ffffffff810d312c>] __netdev_adjacent_dev_remove+0x3c/0x1a0
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81bb6e87>] __netdev_adjacent_dev_unlink_lists+0x67/0x69
Apr 20 06:23:40 OS10 kernel:  [<ffffffff810d32a0>] __netdev_adjacent_dev_unlink+0x82/0x40
Apr 20 06:23:40 OS10 kernel:  [<ffffffff811d31e0>] netdev_upper_dev_unlink+0x10/0x20
Apr 20 06:23:40 OS10 kernel:  [<ffffffff8180e770>] macvlan_dellink+0x50/0x130
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a2ca27>] rtnl_dellink+0xb7/0x120
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a609ab>] ? __netlink_ns_capable+0x3b/0x40
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a2a6c5>] rtnetlink_rcv_msg+0x95/0x250
Apr 20 06:23:40 OS10 kernel:  [<ffffffff811c1499>] ? zone_statistics+0x89/0xa0
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a0a9de>] ? __alloc_skb+0x7e/0x2a0
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a2a630>] ? rtnetlink_rcv+0x30/0x30
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a64f59>] netlink_rcv_skb+0xa9/0xc0
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a2a628>] rtnetlink_rcv+0x28/0x30
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a64603>] netlink_unicast+0xf3/0x200
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a64a1e>] netlink_sendmsg+0x30e/0x680
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a014fb>] sock_sendmsg+0x8b/0xc0
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a011ee>] ? move_addr_to_kernel.part.18+0x1e/0x60
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a01ff1>] ? move_addr_to_kernel+0x21/0x30
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a018f6>] ___sys_sendmsg+0x376/0x390
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a0019f>] ? sock_destroy_inode+0x2f/0x40
Apr 20 06:23:40 OS10 kernel:  [<ffffffff810a161c>] ? __do_page_fault+0x20c/0x560
Apr 20 06:23:40 OS10 kernel:  [<ffffffff812279ad>] ? dput+0xad/0x180
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81230a74>] ? mntput+0x24/0x40
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81212a50>] ? __fput+0x190/0x220
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a026b2>] __sys_sendmsg+0x42/0x80
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81a02702>] SyS_sendmsg+0x12/0x20
Apr 20 06:23:40 OS10 kernel:  [<ffffffff81bc86cd>] system_call_fast_compare_end+0x10/0x15

^ permalink raw reply

* Re: [PATCH net-next 0/3] l3mdev: Improve use with main table
From: David Ahern @ 2017-04-21 20:47 UTC (permalink / raw)
  To: Robert Shearman, davem; +Cc: netdev
In-Reply-To: <a4360e77-4386-2760-0b7e-72a21b3aad99@brocade.com>

On 4/21/17 11:44 AM, Robert Shearman wrote:
> 
> Can you send me some more details of your testing?

It's a shell script that runs a long list of combinations of client,
server and local traffic for ipv4 and ipv6 with addresses on the
external interface, the vrf device and 127.0.0.1 on the VRF device and
link local and mcast addresses for IPv6. Tests are run for icmp, tcp and
udp - and includes negative testing to verify tcp resets or icmp
unreachables are properly sent and received. There are so many
combinations. And then you have to check tcp_l3mdev_accept = {0,1} and
udp_l3mdev_accept = {0,1}.

For your 'main table' vrf run through the different combinations and
check IPv6 linklocal addressing to.

^ permalink raw reply

* [PATCH net-next 3/7] ibmvnic: Only retrieve error info if present
From: Nathan Fontenot @ 2017-04-21 19:38 UTC (permalink / raw)
  To: netdev; +Cc: brking, jallen, muvic, tlfalcon
In-Reply-To: <20170421193627.11030.34813.stgit@ltcalpine2-lp23.aus.stglabs.ibm.com>

When handling a fatal error in the driver, there can be additional
error information provided by the vios. This information is not
always present, so only retrieve the additional error information
when present.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c |   71 ++++++++++++++++++++++++++----------
 1 file changed, 51 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 0f359543..cc34bf9 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2361,25 +2361,22 @@ static void handle_error_info_rsp(union ibmvnic_crq *crq,
 	kfree(error_buff);
 }
 
-static void handle_error_indication(union ibmvnic_crq *crq,
-				    struct ibmvnic_adapter *adapter)
+static void request_error_information(struct ibmvnic_adapter *adapter,
+				      union ibmvnic_crq *err_crq)
 {
-	int detail_len = be32_to_cpu(crq->error_indication.detail_error_sz);
 	struct device *dev = &adapter->vdev->dev;
+	struct net_device *netdev = adapter->netdev;
 	struct ibmvnic_error_buff *error_buff;
-	union ibmvnic_crq new_crq;
+	unsigned long timeout = msecs_to_jiffies(30000);
+	union ibmvnic_crq crq;
 	unsigned long flags;
-
-	dev_err(dev, "Firmware reports %serror id %x, cause %d\n",
-		crq->error_indication.
-		    flags & IBMVNIC_FATAL_ERROR ? "FATAL " : "",
-		be32_to_cpu(crq->error_indication.error_id),
-		be16_to_cpu(crq->error_indication.error_cause));
+	int rc, detail_len;
 
 	error_buff = kmalloc(sizeof(*error_buff), GFP_ATOMIC);
 	if (!error_buff)
 		return;
 
+	detail_len = be32_to_cpu(err_crq->error_indication.detail_error_sz);
 	error_buff->buff = kmalloc(detail_len, GFP_ATOMIC);
 	if (!error_buff->buff) {
 		kfree(error_buff);
@@ -2389,27 +2386,61 @@ static void handle_error_indication(union ibmvnic_crq *crq,
 	error_buff->dma = dma_map_single(dev, error_buff->buff, detail_len,
 					 DMA_FROM_DEVICE);
 	if (dma_mapping_error(dev, error_buff->dma)) {
-		if (!firmware_has_feature(FW_FEATURE_CMO))
-			dev_err(dev, "Couldn't map error buffer\n");
+		netdev_err(netdev, "Couldn't map error buffer\n");
 		kfree(error_buff->buff);
 		kfree(error_buff);
 		return;
 	}
 
 	error_buff->len = detail_len;
-	error_buff->error_id = crq->error_indication.error_id;
+	error_buff->error_id = err_crq->error_indication.error_id;
 
 	spin_lock_irqsave(&adapter->error_list_lock, flags);
 	list_add_tail(&error_buff->list, &adapter->errors);
 	spin_unlock_irqrestore(&adapter->error_list_lock, flags);
 
-	memset(&new_crq, 0, sizeof(new_crq));
-	new_crq.request_error_info.first = IBMVNIC_CRQ_CMD;
-	new_crq.request_error_info.cmd = REQUEST_ERROR_INFO;
-	new_crq.request_error_info.ioba = cpu_to_be32(error_buff->dma);
-	new_crq.request_error_info.len = cpu_to_be32(detail_len);
-	new_crq.request_error_info.error_id = crq->error_indication.error_id;
-	ibmvnic_send_crq(adapter, &new_crq);
+	memset(&crq, 0, sizeof(crq));
+	crq.request_error_info.first = IBMVNIC_CRQ_CMD;
+	crq.request_error_info.cmd = REQUEST_ERROR_INFO;
+	crq.request_error_info.ioba = cpu_to_be32(error_buff->dma);
+	crq.request_error_info.len = cpu_to_be32(detail_len);
+	crq.request_error_info.error_id = err_crq->error_indication.error_id;
+
+	rc = ibmvnic_send_crq(adapter, &crq);
+	if (rc) {
+		netdev_err(netdev, "failed to request error information\n");
+		goto err_info_fail;
+	}
+
+	if (!wait_for_completion_timeout(&adapter->init_done, timeout)) {
+		netdev_err(netdev, "timeout waiting for error information\n");
+		goto err_info_fail;
+	}
+
+	return;
+
+err_info_fail:
+	spin_lock_irqsave(&adapter->error_list_lock, flags);
+	list_del(&error_buff->list);
+	spin_unlock_irqrestore(&adapter->error_list_lock, flags);
+
+	kfree(error_buff->buff);
+	kfree(error_buff);
+}
+
+static void handle_error_indication(union ibmvnic_crq *crq,
+				    struct ibmvnic_adapter *adapter)
+{
+	struct device *dev = &adapter->vdev->dev;
+
+	dev_err(dev, "Firmware reports %serror id %x, cause %d\n",
+		crq->error_indication.flags
+			& IBMVNIC_FATAL_ERROR ? "FATAL " : "",
+		be32_to_cpu(crq->error_indication.error_id),
+		be16_to_cpu(crq->error_indication.error_cause));
+
+	if (be32_to_cpu(crq->error_indication.error_id))
+		request_error_information(adapter, crq);
 }
 
 static void handle_change_mac_rsp(union ibmvnic_crq *crq,

^ permalink raw reply related

* Re: [PATCH net v2] ipv4: Avoid caching l3mdev dst on mismatched local route
From: David Ahern @ 2017-04-21 20:37 UTC (permalink / raw)
  To: Robert Shearman, davem; +Cc: netdev
In-Reply-To: <1492806899-6215-1-git-send-email-rshearma@brocade.com>

On 4/21/17 2:34 PM, Robert Shearman wrote:
> David reported that doing the following:
> 
>     ip li add red type vrf table 10
>     ip link set dev eth1 vrf red
>     ip addr add 127.0.0.1/8 dev red
>     ip link set dev eth1 up
>     ip li set red up
>     ping -c1 -w1 -I red 127.0.0.1
>     ip li del red
> 
> when either policy routing IP rules are present or the local table
> lookup ip rule is before the l3mdev lookup results in a hang with
> these messages:
> 
>     unregister_netdevice: waiting for red to become free. Usage count = 1
> 
> The problem is caused by caching the dst used for sending the packet
> out of the specified interface on a local route with a different
> nexthop interface. Thus the dst could stay around until the route in
> the table the lookup was done is deleted which may be never.
> 
> Address the problem by not forcing output device to be the l3mdev in
> the flow's output interface if the lookup didn't use the l3mdev. This
> then results in the dst using the right device according to the route.
> 
> Changes in v2:
>  - make the dev_out passed in by __ip_route_output_key_hash correct
>    instead of checking the nh dev if FLOWI_FLAG_SKIP_NH_OIF is set as
>    suggested by David.
> 
> Fixes: 5f02ce24c2696 ("net: l3mdev: Allow the l3mdev to be a loopback")
> Reported-by: David Ahern <dsa@cumulusnetworks.com>
> Suggested-by: David Ahern <dsa@cumulusnetworks.com>
> Signed-off-by: Robert Shearman <rshearma@brocade.com>
> ---
>  net/ipv4/route.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index acd69cfe2951..d9724889ff09 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2359,7 +2359,8 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
>  		}
>  
>  		/* L3 master device is the loopback for that domain */
> -		dev_out = l3mdev_master_dev_rcu(dev_out) ? : net->loopback_dev;
> +		dev_out = l3mdev_master_dev_rcu(FIB_RES_DEV(res)) ? :
> +			net->loopback_dev;
>  		fl4->flowi4_oif = dev_out->ifindex;
>  		flags |= RTCF_LOCAL;
>  		goto make_route;
> 

LGTM

Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>

^ permalink raw reply

* [PATCH net v2] ipv4: Avoid caching l3mdev dst on mismatched local route
From: Robert Shearman @ 2017-04-21 20:34 UTC (permalink / raw)
  To: davem; +Cc: netdev, David Ahern, Robert Shearman
In-Reply-To: <f3a3ab2d-4a32-a3ec-6a72-071e56577017@brocade.com>

David reported that doing the following:

    ip li add red type vrf table 10
    ip link set dev eth1 vrf red
    ip addr add 127.0.0.1/8 dev red
    ip link set dev eth1 up
    ip li set red up
    ping -c1 -w1 -I red 127.0.0.1
    ip li del red

when either policy routing IP rules are present or the local table
lookup ip rule is before the l3mdev lookup results in a hang with
these messages:

    unregister_netdevice: waiting for red to become free. Usage count = 1

The problem is caused by caching the dst used for sending the packet
out of the specified interface on a local route with a different
nexthop interface. Thus the dst could stay around until the route in
the table the lookup was done is deleted which may be never.

Address the problem by not forcing output device to be the l3mdev in
the flow's output interface if the lookup didn't use the l3mdev. This
then results in the dst using the right device according to the route.

Changes in v2:
 - make the dev_out passed in by __ip_route_output_key_hash correct
   instead of checking the nh dev if FLOWI_FLAG_SKIP_NH_OIF is set as
   suggested by David.

Fixes: 5f02ce24c2696 ("net: l3mdev: Allow the l3mdev to be a loopback")
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Suggested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
---
 net/ipv4/route.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index acd69cfe2951..d9724889ff09 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2359,7 +2359,8 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
 		}
 
 		/* L3 master device is the loopback for that domain */
-		dev_out = l3mdev_master_dev_rcu(dev_out) ? : net->loopback_dev;
+		dev_out = l3mdev_master_dev_rcu(FIB_RES_DEV(res)) ? :
+			net->loopback_dev;
 		fl4->flowi4_oif = dev_out->ifindex;
 		flags |= RTCF_LOCAL;
 		goto make_route;
-- 
2.1.4

^ permalink raw reply related

* Re: Heads-up: two regressions in v4.11-rc series
From: Frederic Weisbecker @ 2017-04-21 20:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Jesper Dangaard Brouer, Andrew Morton, Tariq Toukan,
	LKML, linux-mm, netdev@vger.kernel.org, Peter Zijlstra
In-Reply-To: <CA+55aFznH8Y9_okyQ=dU1AeJL8rtHx=n5DqT3sGJj7kr6QMYXA@mail.gmail.com>

On Fri, Apr 21, 2017 at 10:52:29AM -0700, Linus Torvalds wrote:
> On Thu, Apr 20, 2017 at 7:30 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
> >> The end result was a revert, and this is waiting in AKPMs quilt queue:
> >>  http://ozlabs.org/~akpm/mmots/broken-out/revert-mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch
> >>
> >
> > This was flagged to Andrew that it should go in for either 4.11 or if
> > there were concerns about how close to the release we are then put it in
> > for 4.11-stable. At worst, I can do a resubmit to -stable myself after
> > it gets merged in the next window if it falls between the cracks.
> 
> This got merged (commit d34b0733b452: "Revert "mm, page_alloc: only
> use per-cpu allocator for irq-safe requests"").
> 
> The other issue (caused by commit a499a5a14dbd: "sched/cputime:
> Increment kcpustat directly on irqtime account") is still open.
> 
> Frederic? Revert? But I guess it's something we can delay for
> backporting, it's presumably not possible to hit maliciously except on
> some fast local network attacker just causing an effective DoS.

I can't tell about the security impact. But indeed I think we should rather
delay for backporting if we can't manage to fix it in the upcoming days.
Especially as you can't revert this patch alone, it's part of a whole series
of ~ 30 commits that removed cputime_t and it's in the middle of the series,
so those that come after depend on it and those that come before just don't make
sense alone.

But I'll fix this ASAP.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [Patch net] ip6mr: avoid double unregister of pim6reg device
From: Nikolay Aleksandrov @ 2017-04-21 20:25 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, Andrey Konovalov, Linus Torvalds
In-Reply-To: <CAM_iQpXSP5ECw6ZXZXf44xb1Hh-yk4egbR55ZpjQwusBAOcgMw@mail.gmail.com>

On 21/04/17 23:20, Cong Wang wrote:
> On Fri, Apr 21, 2017 at 12:34 PM, Nikolay Aleksandrov
> <nikolay@cumulusnetworks.com> wrote:
>> On 21/04/17 22:27, Cong Wang wrote:
>>> If we unregister the pim6reg device via default_device_exit_batch(),
>>> we will receive a notification and ip6mr_device_event() will
>>> unregister it again. This causes a kernel BUG at net/core/dev.c:6813.
>>>
>>> Like commit 7dc00c82cbb0 ("ipv4: Fix ipmr unregister device oops")
>>> we should avoid double-unregister in netdevice notifier.
>>>
>>> Reported-by: Andrey Konovalov <andreyknvl@google.com>
>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>>> ---
>>>  net/ipv6/ip6mr.c | 11 ++++++-----
>>>  1 file changed, 6 insertions(+), 5 deletions(-)
>>>
>>
>> Cong,
>> Please read the full thread, I've already provided a fix which is similar.
>> https://patchwork.ozlabs.org/patch/753531/
> 
> You beat me on that. ;) You should leave a reply to Andrey's report
> otherwise people could miss it.
> 

The patch was sent as a reply to the original report. :-)

^ permalink raw reply

* Re: [Patch net] ip6mr: avoid double unregister of pim6reg device
From: Cong Wang @ 2017-04-21 20:20 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Linux Kernel Network Developers, Andrey Konovalov, Linus Torvalds
In-Reply-To: <db28d88e-ed4d-de81-fea2-c404a5b89af0@cumulusnetworks.com>

On Fri, Apr 21, 2017 at 12:34 PM, Nikolay Aleksandrov
<nikolay@cumulusnetworks.com> wrote:
> On 21/04/17 22:27, Cong Wang wrote:
>> If we unregister the pim6reg device via default_device_exit_batch(),
>> we will receive a notification and ip6mr_device_event() will
>> unregister it again. This causes a kernel BUG at net/core/dev.c:6813.
>>
>> Like commit 7dc00c82cbb0 ("ipv4: Fix ipmr unregister device oops")
>> we should avoid double-unregister in netdevice notifier.
>>
>> Reported-by: Andrey Konovalov <andreyknvl@google.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> ---
>>  net/ipv6/ip6mr.c | 11 ++++++-----
>>  1 file changed, 6 insertions(+), 5 deletions(-)
>>
>
> Cong,
> Please read the full thread, I've already provided a fix which is similar.
> https://patchwork.ozlabs.org/patch/753531/

You beat me on that. ;) You should leave a reply to Andrey's report
otherwise people could miss it.

^ permalink raw reply

* Re: [PATCH net-next] Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning
From: Eric Dumazet @ 2017-04-21 20:07 UTC (permalink / raw)
  To: David Miller; +Cc: tedheadster, netdev
In-Reply-To: <20170421.160052.848351578908155648.davem@davemloft.net>

On Fri, 2017-04-21 at 16:00 -0400, David Miller wrote:

> That's true, I'll kill this.
> 

Thanks !

^ permalink raw reply

* Re: [PATCH net-next] Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning
From: David Miller @ 2017-04-21 20:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: tedheadster, netdev
In-Reply-To: <1492804622.6453.32.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 21 Apr 2017 12:57:02 -0700

> On Fri, 2017-04-21 at 13:22 -0400, David Miller wrote:
>> From: Matthew Whitehead <tedheadster@gmail.com>
>> Date: Wed, 19 Apr 2017 12:37:10 -0400
>> 
>> > Constants used for tuning are generally a bad idea, especially as hardware
>> > changes over time. Replace the constant 2 jiffies with sysctl variable
>> > netdev_budget_usecs to enable sysadmins to tune the softirq processing.
>> > Also document the variable.
>> > 
>> > For example, a very fast machine might tune this to 1000 microseconds,
>> > while my regression testing 486DX-25 needs it to be 4000 microseconds on
>> > a nearly idle network to prevent time_squeeze from being incremented.
>> > 
>> > Version 2: changed jiffies to microseconds for predictable units.
>> > 
>> > Signed-off-by: Matthew Whitehead <tedheadster@gmail.com>
>> 
>> Applied, thanks.
> 
> Can we revert the changes in kernel/sysctl_binary.c &
> include/uapi/linux/sysctl.h ?
> 
> { CTL_INT,      NET_CORE_BUDGET_USECS,  "netdev_budget_usecs" },
> 
> NET_CORE_BUDGET_USECS=23,
> 
> Unless I am missing something, we should not add new binary sysctls.

That's true, I'll kill this.

====================
[PATCH] net: Remove NET_CORE_BUDGET_USECS from sysctl binary interface.

We are not supposed to add new entries to this thing
any more.

Thanks to Eric Dumazet for noticing this.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/uapi/linux/sysctl.h | 1 -
 kernel/sysctl_binary.c      | 1 -
 2 files changed, 2 deletions(-)

diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 177f5f1..e13d480 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -274,7 +274,6 @@ enum
 	NET_CORE_AEVENT_ETIME=20,
 	NET_CORE_AEVENT_RSEQTH=21,
 	NET_CORE_WARNINGS=22,
-	NET_CORE_BUDGET_USECS=23,
 };
 
 /* /proc/sys/net/ethernet */
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 4ee3e49..ece4b17 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -197,7 +197,6 @@ static const struct bin_table bin_net_core_table[] = {
 	{ CTL_INT,	NET_CORE_AEVENT_ETIME,	"xfrm_aevent_etime" },
 	{ CTL_INT,	NET_CORE_AEVENT_RSEQTH,	"xfrm_aevent_rseqth" },
 	{ CTL_INT,	NET_CORE_WARNINGS,	"warnings" },
-	{ CTL_INT,	NET_CORE_BUDGET_USECS,	"netdev_budget_usecs" },
 	{},
 };
 
-- 
2.4.11

^ permalink raw reply related

* Re: [PATCH net] ip6mr: fix notification device destruction
From: David Miller @ 2017-04-21 19:58 UTC (permalink / raw)
  To: nikolay
  Cc: netdev, yoshfuji, dvyukov, kcc, syzkaller, edumazet, roopa,
	torvalds, linux-kernel
In-Reply-To: <1e536c10-3bbb-f39b-d5dc-c397c121dce0@cumulusnetworks.com>

From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Date: Fri, 21 Apr 2017 22:56:26 +0300

> On 21/04/17 22:50, Nikolay Aleksandrov wrote:
>> On 21/04/17 22:36, David Miller wrote:
>>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>> Date: Fri, 21 Apr 2017 21:30:42 +0300
>>>
>>>> On 21/04/17 20:42, Nikolay Aleksandrov wrote:
>>>>> Andrey Konovalov reported a BUG caused by the ip6mr code which is caused
>>>>> because we call unregister_netdevice_many for a device that is already
>>>>> being destroyed. In IPv4's ipmr that has been resolved by two commits
>>>>> long time ago by introducing the "notify" parameter to the delete
>>>>> function and avoiding the unregister when called from a notifier, so
>>>>> let's do the same for ip6mr.
>>>  ...
>>>> +CC LKML and Linus
>>>
>>> Applied, thanks Nikolay and thanks Andrey for the report and testing.
>>>
>>> Nikolay, how far does this bug go back?
>>>
>> 
>> Good question, AFAICS since ip6mr exists because it was copied from ipmr:
>> commit 7bc570c8b4f7
>> Author: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
>> Date:   Thu Apr 3 09:22:53 2008 +0900
>> 
>>     [IPV6] MROUTE: Support multicast forwarding.
>> 
>> 
> 
> Oops no, my bad. That wouldn't cause it to BUG because it was already removed by mif6_delete
> earlier. So since it can be destroyed by a netns exiting, currently I don't see any other
> way which is outside of ip6mr for destroying that device.
> 
> That should be:
> commit 8229efdaef1e
> Author: Benjamin Thery <benjamin.thery@bull.net>
> Date:   Wed Dec 10 16:30:15 2008 -0800
> 
>     netns: ip6mr: enable namespace support in ipv6 multicast forwarding code
> 
> 
> Which allowed the notifier to be executed for pimreg devices in other network namespaces.

That still makes it -stable material as far as I'm concerned.

Thanks again! :)

^ permalink raw reply

* Re: [PATCH net-next] Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning
From: Eric Dumazet @ 2017-04-21 19:57 UTC (permalink / raw)
  To: David Miller; +Cc: tedheadster, netdev
In-Reply-To: <20170421.132245.1207278487695293623.davem@davemloft.net>

On Fri, 2017-04-21 at 13:22 -0400, David Miller wrote:
> From: Matthew Whitehead <tedheadster@gmail.com>
> Date: Wed, 19 Apr 2017 12:37:10 -0400
> 
> > Constants used for tuning are generally a bad idea, especially as hardware
> > changes over time. Replace the constant 2 jiffies with sysctl variable
> > netdev_budget_usecs to enable sysadmins to tune the softirq processing.
> > Also document the variable.
> > 
> > For example, a very fast machine might tune this to 1000 microseconds,
> > while my regression testing 486DX-25 needs it to be 4000 microseconds on
> > a nearly idle network to prevent time_squeeze from being incremented.
> > 
> > Version 2: changed jiffies to microseconds for predictable units.
> > 
> > Signed-off-by: Matthew Whitehead <tedheadster@gmail.com>
> 
> Applied, thanks.

Can we revert the changes in kernel/sysctl_binary.c &
include/uapi/linux/sysctl.h ?

{ CTL_INT,      NET_CORE_BUDGET_USECS,  "netdev_budget_usecs" },

NET_CORE_BUDGET_USECS=23,

Unless I am missing something, we should not add new binary sysctls.


Thanks.

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2017-04-21 19:56 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) Don't race in IPSEC dumps, from Yuejie Shi.

2) Verify lengths properly in IPSEC reqeusts, from Herbert Xu.

3) Fix out of bounds access in ipv6 segment routing code, from David
   Lebrun.

4) Don't write into the header of cloned SKBs in smsc95xx driver, from
   James Hughes.

5) Several other drivers have this bug too, fix them.  From Eric
   Dumazet.

6) Fix access to uninitialized data in TC action cookie code, from
   Wolfgang Bumiller.

7) Fix double free in IPV6 segment routing, again from David Lebrun.

8) Don't let userspace set the RTF_PCPU flag, oops.  From David Ahern.

9) Fix use after free in qrtr code, from Dan Carpenter.

10) Don't double-destroy devices in ip6mr code, from Nikolay
    Aleksandrov.

11) Don't pass out-of-range TX queue indices into drivers, from Tushar
    Dave.

Please pull, thanks a lot!

The following changes since commit 005882e53d62f25dae10351a8d3f13326051e8f5:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc (2017-04-18 13:56:51 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 

for you to fetch changes up to c70b17b775edb21280e9de7531acf6db3b365274:

  netpoll: Check for skb->queue_mapping (2017-04-21 15:45:19 -0400)

----------------------------------------------------------------
Dan Carpenter (2):
      dp83640: don't recieve time stamps twice
      net: qrtr: potential use after free in qrtr_sendmsg()

Daniel Borkmann (1):
      bpf, doc: update bpf maintainers entry

David Ahern (1):
      net: ipv6: RTF_PCPU should not be settable from userspace

David Lebrun (2):
      ipv6: sr: fix out-of-bounds access in SRH validation
      ipv6: sr: fix double free of skb after handling invalid SRH

David Miller (1):
      bpf: Fix values type used in test_maps

David S. Miller (5):
      Merge tag 'mac80211-for-davem-2017-04-18' of git://git.kernel.org/.../jberg/mac80211
      Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec
      Merge branch 'qed-dcbx-fixes'
      MAINTAINERS: Add "B:" field for networking.
      Merge branch 'skb_cow_head'

Eric Dumazet (6):
      smsc75xx: use skb_cow_head() to deal with cloned skbs
      cx82310_eth: use skb_cow_head() to deal with cloned skbs
      sr9700: use skb_cow_head() to deal with cloned skbs
      lan78xx: use skb_cow_head() to deal with cloned skbs
      ch9200: use skb_cow_head() to deal with cloned skbs
      kaweth: use skb_cow_head() to deal with cloned skbs

Herbert Xu (1):
      af_key: Fix sadb_x_ipsecrequest parsing

Ilan Tayari (1):
      gso: Validate assumption of frag_list segementation

James Hughes (1):
      smsc95xx: Use skb_cow_head to deal with cloned skbs

Johannes Berg (2):
      mac80211: fix MU-MIMO follow-MAC mode
      mac80211: reject ToDS broadcast data frames

Mike Maloney (1):
      selftests/net: Fixes psock_fanout CBPF test case

Nikolay Aleksandrov (1):
      ip6mr: fix notification device destruction

Sekhar Nori (1):
      MAINTAINERS: update entry for TI's CPSW driver

Sergei Shtylyov (1):
      sh_eth: unmap DMA buffers when freeing rings

Tushar Dave (1):
      netpoll: Check for skb->queue_mapping

Wolfgang Bumiller (1):
      net sched actions: allocate act cookie early

Yuejie Shi (1):
      af_key: Add lock to key dump

sudarsana.kalluru@cavium.com (4):
      qed: Fix possible error in populating max_tc field.
      qed: Fix sending an invalid PFC error mask to MFW.
      qed: Fix possible system hang in the dcbnl-getdcbx() path.
      qed: Fix issue in populating the PFC config paramters.

 MAINTAINERS                                |  18 +++++++++++++++--
 drivers/net/ethernet/qlogic/qed/qed_dcbx.c |  13 +++++++++++-
 drivers/net/ethernet/renesas/sh_eth.c      | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------
 drivers/net/phy/dp83640.c                  |   2 --
 drivers/net/usb/ch9200.c                   |   9 ++-------
 drivers/net/usb/cx82310_eth.c              |   7 ++-----
 drivers/net/usb/kaweth.c                   |  18 ++++++-----------
 drivers/net/usb/lan78xx.c                  |   9 ++-------
 drivers/net/usb/smsc75xx.c                 |   8 ++------
 drivers/net/usb/smsc95xx.c                 |  12 +++++------
 drivers/net/usb/sr9700.c                   |   9 ++-------
 include/uapi/linux/ipv6_route.h            |   2 +-
 net/core/netpoll.c                         |  10 ++++++++--
 net/core/skbuff.c                          |  18 +++++++++++++----
 net/ipv6/exthdrs.c                         |   1 -
 net/ipv6/ip6mr.c                           |  13 ++++++------
 net/ipv6/route.c                           |   4 ++++
 net/ipv6/seg6.c                            |   3 +++
 net/key/af_key.c                           |  93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------
 net/mac80211/rx.c                          |  86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------
 net/qrtr/qrtr.c                            |   4 +++-
 net/sched/act_api.c                        |  55 ++++++++++++++++++++++++++++++---------------------
 tools/testing/selftests/bpf/test_maps.c    |   4 ++--
 tools/testing/selftests/net/psock_fanout.c |  22 +++++++++++++++++++--
 tools/testing/selftests/net/psock_lib.h    |  13 +++---------
 25 files changed, 345 insertions(+), 210 deletions(-)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox