Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] virtio_net: split out ctrl buffer
From: kbuild test robot @ 2018-04-19  7:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kbuild-all, linux-kernel, Mikulas Patocka, Eric Dumazet,
	David Miller, Jason Wang, virtualization, netdev
In-Reply-To: <1524113437-308621-1-git-send-email-mst@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 1822 bytes --]

Hi Michael,

I love your patch! Yet something to improve:

[auto build test ERROR on net/master]

url:    https://github.com/0day-ci/linux/commits/Michael-S-Tsirkin/virtio_net-split-out-ctrl-buffer/20180419-145754
config: x86_64-randconfig-x006-201815 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers//net/virtio_net.c: In function 'virtnet_free_queues':
>> drivers//net/virtio_net.c:2365:12: error: 'struct virtnet_info' has no member named 'err_ctrl'; did you mean 'ctrl'?
     kfree(vi->err_ctrl);
               ^~~~~~~~
               ctrl
   drivers//net/virtio_net.c: In function 'virtnet_alloc_queues':
>> drivers//net/virtio_net.c:2589:8: error: 'vq' undeclared (first use in this function); did you mean 'vi'?
     kfree(vq->ctrl);
           ^~
           vi
   drivers//net/virtio_net.c:2589:8: note: each undeclared identifier is reported only once for each function it appears in

vim +2365 drivers//net/virtio_net.c

  2347	
  2348	static void virtnet_free_queues(struct virtnet_info *vi)
  2349	{
  2350		int i;
  2351	
  2352		for (i = 0; i < vi->max_queue_pairs; i++) {
  2353			napi_hash_del(&vi->rq[i].napi);
  2354			netif_napi_del(&vi->rq[i].napi);
  2355			netif_napi_del(&vi->sq[i].napi);
  2356		}
  2357	
  2358		/* We called napi_hash_del() before netif_napi_del(),
  2359		 * we need to respect an RCU grace period before freeing vi->rq
  2360		 */
  2361		synchronize_net();
  2362	
  2363		kfree(vi->rq);
  2364		kfree(vi->sq);
> 2365		kfree(vi->err_ctrl);
  2366	}
  2367	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27230 bytes --]

^ permalink raw reply

* Greetings
From: "Miss Zeliha ömer faruk" @ 2018-04-19  7:37 UTC (permalink / raw)





Hello

Greetings to you please i have a business proposal for you contact me
for more detailes asap thanks.

Best Regards,
Miss.Zeliha ömer faruk
Esentepe Mahallesi Büyükdere
Caddesi Kristal Kule Binasi
No:215
Sisli - Istanbul, Turkey

^ permalink raw reply

* [PATCH net-next v2 3/3] net: ethernet: ave: add support for phy-mode setting of system controller
From: Kunihiko Hayashi @ 2018-04-19  7:24 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: Andrew Lunn, Florian Fainelli, Rob Herring, Mark Rutland,
	linux-arm-kernel, linux-kernel, devicetree, Masahiro Yamada,
	Masami Hiramatsu, Jassi Brar, Kunihiko Hayashi
In-Reply-To: <1524122695-19597-1-git-send-email-hayashi.kunihiko@socionext.com>

This patch adds support for specifying system controller that configures
phy-mode setting.

According to the DT property "phy-mode", it's necessary to configure the
controller, which is used to choose the settings of the MAC suitable,
for example, mdio pin connections, internal clocks, and so on.

Supported phy-modes are SoC-dependent. The driver allows phy-mode to set
"internal" if the SoC has a built-in PHY, and {"mii", "rmii", "rgmii"}
if the SoC supports each mode. So we have to check whether the phy-mode
is valid or not.

This adds the following features for each SoC:
- check whether the SoC supports the specified phy-mode
- configure the controller accroding to phy-mode

The DT property accepts one argument to distinguish them for multiple MAC
instances.

ethernet@65000000 {
	...
	socionext,syscon-phy-mode = <&soc_glue 0>;
};

ethernet@65200000 {
	...
	socionext,syscon-phy-mode = <&soc_glue 1>;
};

Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
---
 drivers/net/ethernet/socionext/Kconfig   |   2 +
 drivers/net/ethernet/socionext/sni_ave.c | 150 ++++++++++++++++++++++++++++---
 2 files changed, 140 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/socionext/Kconfig b/drivers/net/ethernet/socionext/Kconfig
index 6bcfe27..b80048c 100644
--- a/drivers/net/ethernet/socionext/Kconfig
+++ b/drivers/net/ethernet/socionext/Kconfig
@@ -14,6 +14,8 @@ if NET_VENDOR_SOCIONEXT
 config SNI_AVE
 	tristate "Socionext AVE ethernet support"
 	depends on (ARCH_UNIPHIER || COMPILE_TEST) && OF
+	depends on HAS_IOMEM
+	select MFD_SYSCON
 	select PHYLIB
 	---help---
 	  Driver for gigabit ethernet MACs, called AVE, in the
diff --git a/drivers/net/ethernet/socionext/sni_ave.c b/drivers/net/ethernet/socionext/sni_ave.c
index 52940bd..f7eccee 100644
--- a/drivers/net/ethernet/socionext/sni_ave.c
+++ b/drivers/net/ethernet/socionext/sni_ave.c
@@ -11,6 +11,7 @@
 #include <linux/interrupt.h>
 #include <linux/io.h>
 #include <linux/iopoll.h>
+#include <linux/mfd/syscon.h>
 #include <linux/mii.h>
 #include <linux/module.h>
 #include <linux/netdevice.h>
@@ -18,6 +19,7 @@
 #include <linux/of_mdio.h>
 #include <linux/of_platform.h>
 #include <linux/phy.h>
+#include <linux/regmap.h>
 #include <linux/reset.h>
 #include <linux/types.h>
 #include <linux/u64_stats_sync.h>
@@ -197,6 +199,11 @@
 #define AVE_INTM_COUNT		20
 #define AVE_FORCE_TXINTCNT	1
 
+/* SG */
+#define SG_ETPINMODE		0x540
+#define SG_ETPINMODE_EXTPHY	BIT(1)	/* for LD11 */
+#define SG_ETPINMODE_RMII(ins)	BIT(ins)
+
 #define IS_DESC_64BIT(p)	((p)->data->is_desc_64bit)
 
 #define AVE_MAX_CLKS		4
@@ -228,12 +235,6 @@ struct ave_desc_info {
 	struct ave_desc *desc;	/* skb info related descriptor */
 };
 
-struct ave_soc_data {
-	bool	is_desc_64bit;
-	const char	*clock_names[AVE_MAX_CLKS];
-	const char	*reset_names[AVE_MAX_RSTS];
-};
-
 struct ave_stats {
 	struct	u64_stats_sync	syncp;
 	u64	packets;
@@ -257,6 +258,9 @@ struct ave_private {
 	phy_interface_t		phy_mode;
 	struct phy_device	*phydev;
 	struct mii_bus		*mdio;
+	struct regmap		*regmap;
+	unsigned int		pinmode_mask;
+	unsigned int		pinmode_val;
 
 	/* stats */
 	struct ave_stats	stats_rx;
@@ -279,6 +283,14 @@ struct ave_private {
 	const struct ave_soc_data *data;
 };
 
+struct ave_soc_data {
+	bool	is_desc_64bit;
+	const char	*clock_names[AVE_MAX_CLKS];
+	const char	*reset_names[AVE_MAX_RSTS];
+	int	(*get_pinmode)(struct ave_private *priv,
+			       phy_interface_t phy_mode, u32 arg);
+};
+
 static u32 ave_desc_read(struct net_device *ndev, enum desc_id id, int entry,
 			 int offset)
 {
@@ -1179,6 +1191,11 @@ static int ave_init(struct net_device *ndev)
 		}
 	}
 
+	ret = regmap_update_bits(priv->regmap, SG_ETPINMODE,
+				 priv->pinmode_mask, priv->pinmode_val);
+	if (ret)
+		return ret;
+
 	ave_global_reset(ndev);
 
 	mdio_np = of_get_child_by_name(np, "mdio");
@@ -1537,6 +1554,7 @@ static int ave_probe(struct platform_device *pdev)
 	const struct ave_soc_data *data;
 	struct device *dev = &pdev->dev;
 	char buf[ETHTOOL_FWVERS_LEN];
+	struct of_phandle_args args;
 	phy_interface_t phy_mode;
 	struct ave_private *priv;
 	struct net_device *ndev;
@@ -1559,12 +1577,6 @@ static int ave_probe(struct platform_device *pdev)
 		dev_err(dev, "phy-mode not found\n");
 		return -EINVAL;
 	}
-	if ((!phy_interface_mode_is_rgmii(phy_mode)) &&
-	    phy_mode != PHY_INTERFACE_MODE_RMII &&
-	    phy_mode != PHY_INTERFACE_MODE_MII) {
-		dev_err(dev, "phy-mode is invalid\n");
-		return -EINVAL;
-	}
 
 	irq = platform_get_irq(pdev, 0);
 	if (irq < 0) {
@@ -1656,6 +1668,26 @@ static int ave_probe(struct platform_device *pdev)
 		priv->nrsts++;
 	}
 
+	ret = of_parse_phandle_with_fixed_args(np,
+					       "socionext,syscon-phy-mode",
+					       1, 0, &args);
+	if (ret) {
+		netdev_err(ndev, "can't get syscon-phy-mode property\n");
+		goto out_free_netdev;
+	}
+	priv->regmap = syscon_node_to_regmap(args.np);
+	of_node_put(args.np);
+	if (IS_ERR(priv->regmap)) {
+		netdev_err(ndev, "can't map syscon-phy-mode\n");
+		ret = PTR_ERR(priv->regmap);
+		goto out_free_netdev;
+	}
+	ret = priv->data->get_pinmode(priv, phy_mode, args.args[0]);
+	if (ret) {
+		netdev_err(ndev, "invalid phy-mode setting\n");
+		goto out_free_netdev;
+	}
+
 	priv->mdio = devm_mdiobus_alloc(dev);
 	if (!priv->mdio) {
 		ret = -ENOMEM;
@@ -1715,6 +1747,95 @@ static int ave_remove(struct platform_device *pdev)
 	return 0;
 }
 
+static int ave_pro4_get_pinmode(struct ave_private *priv,
+				phy_interface_t phy_mode, u32 arg)
+{
+	if (arg > 0)
+		return -EINVAL;
+
+	priv->pinmode_mask = SG_ETPINMODE_RMII(0);
+
+	switch (phy_mode) {
+	case PHY_INTERFACE_MODE_RMII:
+		priv->pinmode_val = SG_ETPINMODE_RMII(0);
+		break;
+	case PHY_INTERFACE_MODE_MII:
+	case PHY_INTERFACE_MODE_RGMII:
+		priv->pinmode_val = 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int ave_ld11_get_pinmode(struct ave_private *priv,
+				phy_interface_t phy_mode, u32 arg)
+{
+	if (arg > 0)
+		return -EINVAL;
+
+	priv->pinmode_mask = SG_ETPINMODE_EXTPHY | SG_ETPINMODE_RMII(0);
+
+	switch (phy_mode) {
+	case PHY_INTERFACE_MODE_INTERNAL:
+		priv->pinmode_val = 0;
+		break;
+	case PHY_INTERFACE_MODE_RMII:
+		priv->pinmode_val = SG_ETPINMODE_EXTPHY | SG_ETPINMODE_RMII(0);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int ave_ld20_get_pinmode(struct ave_private *priv,
+				phy_interface_t phy_mode, u32 arg)
+{
+	if (arg > 0)
+		return -EINVAL;
+
+	priv->pinmode_mask = SG_ETPINMODE_RMII(0);
+
+	switch (phy_mode) {
+	case PHY_INTERFACE_MODE_RMII:
+		priv->pinmode_val = SG_ETPINMODE_RMII(0);
+		break;
+	case PHY_INTERFACE_MODE_RGMII:
+		priv->pinmode_val = 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int ave_pxs3_get_pinmode(struct ave_private *priv,
+				phy_interface_t phy_mode, u32 arg)
+{
+	if (arg > 1)
+		return -EINVAL;
+
+	priv->pinmode_mask = SG_ETPINMODE_RMII(arg);
+
+	switch (phy_mode) {
+	case PHY_INTERFACE_MODE_RMII:
+		priv->pinmode_val = SG_ETPINMODE_RMII(arg);
+		break;
+	case PHY_INTERFACE_MODE_RGMII:
+		priv->pinmode_val = 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static const struct ave_soc_data ave_pro4_data = {
 	.is_desc_64bit = false,
 	.clock_names = {
@@ -1723,6 +1844,7 @@ static const struct ave_soc_data ave_pro4_data = {
 	.reset_names = {
 		"gio", "ether",
 	},
+	.get_pinmode = ave_pro4_get_pinmode,
 };
 
 static const struct ave_soc_data ave_pxs2_data = {
@@ -1733,6 +1855,7 @@ static const struct ave_soc_data ave_pxs2_data = {
 	.reset_names = {
 		"ether",
 	},
+	.get_pinmode = ave_pro4_get_pinmode,
 };
 
 static const struct ave_soc_data ave_ld11_data = {
@@ -1743,6 +1866,7 @@ static const struct ave_soc_data ave_ld11_data = {
 	.reset_names = {
 		"ether",
 	},
+	.get_pinmode = ave_ld11_get_pinmode,
 };
 
 static const struct ave_soc_data ave_ld20_data = {
@@ -1753,6 +1877,7 @@ static const struct ave_soc_data ave_ld20_data = {
 	.reset_names = {
 		"ether",
 	},
+	.get_pinmode = ave_ld20_get_pinmode,
 };
 
 static const struct ave_soc_data ave_pxs3_data = {
@@ -1763,6 +1888,7 @@ static const struct ave_soc_data ave_pxs3_data = {
 	.reset_names = {
 		"ether",
 	},
+	.get_pinmode = ave_pxs3_get_pinmode,
 };
 
 static const struct of_device_id of_ave_match[] = {
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v2 2/3] dt-bindings: net: ave: add syscon-phy-mode property to configure phy-mode setting
From: Kunihiko Hayashi @ 2018-04-19  7:24 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: Andrew Lunn, Florian Fainelli, Rob Herring, Mark Rutland,
	linux-arm-kernel, linux-kernel, devicetree, Masahiro Yamada,
	Masami Hiramatsu, Jassi Brar, Kunihiko Hayashi
In-Reply-To: <1524122695-19597-1-git-send-email-hayashi.kunihiko@socionext.com>

Add "socionext,syscon-phy-mode" property to specify system controller that
configures the settings about phy-mode.

Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
Reviewed-by: Rob Herring <robh@kernel.org>
---
 Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
index 85e0c49..fc8f017 100644
--- a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
+++ b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
@@ -13,7 +13,8 @@ Required properties:
  - reg: Address where registers are mapped and size of region.
  - interrupts: Should contain the MAC interrupt.
  - phy-mode: See ethernet.txt in the same directory. Allow to choose
-	"rgmii", "rmii", or "mii" according to the PHY.
+	"rgmii", "rmii", "mii", or "internal" according to the PHY.
+	The acceptable mode is SoC-dependent.
  - phy-handle: Should point to the external phy device.
 	See ethernet.txt file in the same directory.
  - clocks: A phandle to the clock for the MAC.
@@ -27,6 +28,8 @@ Required properties:
  - reset-names: Should contain
 	- "ether", "gio" for Pro4 SoC
 	- "ether" for others
+ - socionext,syscon-phy-mode: A phandle to syscon with one argument
+	that configures phy mode. The argument is the ID of MAC instance.
 
 Optional properties:
  - local-mac-address: See ethernet.txt in the same directory.
@@ -47,6 +50,7 @@ Example:
 		clocks = <&sys_clk 6>;
 		reset-names = "ether";
 		resets = <&sys_rst 6>;
+		socionext,syscon-phy-mode = <&soc_glue 0>;
 		local-mac-address = [00 00 00 00 00 00];
 
 		mdio {
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v2 1/3] net: ethernet: ave: add multiple clocks and resets support as required property
From: Kunihiko Hayashi @ 2018-04-19  7:24 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: Andrew Lunn, Florian Fainelli, Rob Herring, Mark Rutland,
	linux-arm-kernel, linux-kernel, devicetree, Masahiro Yamada,
	Masami Hiramatsu, Jassi Brar, Kunihiko Hayashi
In-Reply-To: <1524122695-19597-1-git-send-email-hayashi.kunihiko@socionext.com>

When the link is becoming up for Pro4 SoC, the kernel is stalled
due to some missing clocks and resets.

The AVE block for Pro4 is connected to the GIO bus in the SoC.
Without its clock/reset, the access to the AVE register makes the
system stall.

In the same way, another MAC clock for Giga-bit Connection and
the PHY clock are also required for Pro4 to activate the Giga-bit feature
and to recognize the PHY.

To satisfy these requirements, this patch adds support for multiple clocks
and resets, and adds the clock-names and reset-names to the binding because
we need to distinguish clock/reset for the AVE main block and the others.

Also, make the resets a required property. Currently, "reset is
optional" relies on that the bootloader or firmware has deasserted
the reset before booting the kernel.  Drivers should work without
such expectation.

Fixes: 4c270b55a5af ("net: ethernet: socionext: add AVE ethernet driver")
Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
Reviewed-by: Rob Herring <robh@kernel.org>
---
 .../bindings/net/socionext,uniphier-ave4.txt       |  13 ++-
 drivers/net/ethernet/socionext/sni_ave.c           | 108 ++++++++++++++++-----
 2 files changed, 96 insertions(+), 25 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
index 96398cc..85e0c49 100644
--- a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
+++ b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
@@ -17,9 +17,18 @@ Required properties:
  - phy-handle: Should point to the external phy device.
 	See ethernet.txt file in the same directory.
  - clocks: A phandle to the clock for the MAC.
+	For Pro4 SoC, that is "socionext,uniphier-pro4-ave4",
+	another MAC clock, GIO bus clock and PHY clock are also required.
+ - clock-names: Should contain
+	- "ether", "ether-gb", "gio", "ether-phy" for Pro4 SoC
+	- "ether" for others
+ - resets: A phandle to the reset control for the MAC. For Pro4 SoC,
+	GIO bus reset is also required.
+ - reset-names: Should contain
+	- "ether", "gio" for Pro4 SoC
+	- "ether" for others
 
 Optional properties:
- - resets: A phandle to the reset control for the MAC.
  - local-mac-address: See ethernet.txt in the same directory.
 
 Required subnode:
@@ -34,7 +43,9 @@ Example:
 		interrupts = <0 66 4>;
 		phy-mode = "rgmii";
 		phy-handle = <&ethphy>;
+		clock-names = "ether";
 		clocks = <&sys_clk 6>;
+		reset-names = "ether";
 		resets = <&sys_rst 6>;
 		local-mac-address = [00 00 00 00 00 00];
 
diff --git a/drivers/net/ethernet/socionext/sni_ave.c b/drivers/net/ethernet/socionext/sni_ave.c
index 0b3b7a4..52940bd 100644
--- a/drivers/net/ethernet/socionext/sni_ave.c
+++ b/drivers/net/ethernet/socionext/sni_ave.c
@@ -199,6 +199,9 @@
 
 #define IS_DESC_64BIT(p)	((p)->data->is_desc_64bit)
 
+#define AVE_MAX_CLKS		4
+#define AVE_MAX_RSTS		2
+
 enum desc_id {
 	AVE_DESCID_RX,
 	AVE_DESCID_TX,
@@ -227,6 +230,8 @@ struct ave_desc_info {
 
 struct ave_soc_data {
 	bool	is_desc_64bit;
+	const char	*clock_names[AVE_MAX_CLKS];
+	const char	*reset_names[AVE_MAX_RSTS];
 };
 
 struct ave_stats {
@@ -245,8 +250,10 @@ struct ave_private {
 	int			phy_id;
 	unsigned int		desc_size;
 	u32			msg_enable;
-	struct clk		*clk;
-	struct reset_control	*rst;
+	int			nclks;
+	struct clk		*clk[AVE_MAX_CLKS];
+	int			nrsts;
+	struct reset_control	*rst[AVE_MAX_RSTS];
 	phy_interface_t		phy_mode;
 	struct phy_device	*phydev;
 	struct mii_bus		*mdio;
@@ -1153,18 +1160,23 @@ static int ave_init(struct net_device *ndev)
 	struct device_node *np = dev->of_node;
 	struct device_node *mdio_np;
 	struct phy_device *phydev;
-	int ret;
+	int nc, nr, ret;
 
 	/* enable clk because of hw access until ndo_open */
-	ret = clk_prepare_enable(priv->clk);
-	if (ret) {
-		dev_err(dev, "can't enable clock\n");
-		return ret;
+	for (nc = 0; nc < priv->nclks; nc++) {
+		ret = clk_prepare_enable(priv->clk[nc]);
+		if (ret) {
+			dev_err(dev, "can't enable clock\n");
+			goto out_clk_disable;
+		}
 	}
-	ret = reset_control_deassert(priv->rst);
-	if (ret) {
-		dev_err(dev, "can't deassert reset\n");
-		goto out_clk_disable;
+
+	for (nr = 0; nr < priv->nrsts; nr++) {
+		ret = reset_control_deassert(priv->rst[nr]);
+		if (ret) {
+			dev_err(dev, "can't deassert reset\n");
+			goto out_reset_assert;
+		}
 	}
 
 	ave_global_reset(ndev);
@@ -1207,9 +1219,11 @@ static int ave_init(struct net_device *ndev)
 out_mdio_unregister:
 	mdiobus_unregister(priv->mdio);
 out_reset_assert:
-	reset_control_assert(priv->rst);
+	while (--nr >= 0)
+		reset_control_assert(priv->rst[nr]);
 out_clk_disable:
-	clk_disable_unprepare(priv->clk);
+	while (--nc >= 0)
+		clk_disable_unprepare(priv->clk[nc]);
 
 	return ret;
 }
@@ -1217,13 +1231,16 @@ static int ave_init(struct net_device *ndev)
 static void ave_uninit(struct net_device *ndev)
 {
 	struct ave_private *priv = netdev_priv(ndev);
+	int i;
 
 	phy_disconnect(priv->phydev);
 	mdiobus_unregister(priv->mdio);
 
 	/* disable clk because of hw access after ndo_stop */
-	reset_control_assert(priv->rst);
-	clk_disable_unprepare(priv->clk);
+	for (i = 0; i < priv->nrsts; i++)
+		reset_control_assert(priv->rst[i]);
+	for (i = 0; i < priv->nclks; i++)
+		clk_disable_unprepare(priv->clk[i]);
 }
 
 static int ave_open(struct net_device *ndev)
@@ -1527,8 +1544,9 @@ static int ave_probe(struct platform_device *pdev)
 	struct resource	*res;
 	const void *mac_addr;
 	void __iomem *base;
+	const char *name;
+	int i, irq, ret;
 	u64 dma_mask;
-	int irq, ret;
 	u32 ave_id;
 
 	data = of_device_get_match_data(dev);
@@ -1614,16 +1632,28 @@ static int ave_probe(struct platform_device *pdev)
 	u64_stats_init(&priv->stats_tx.syncp);
 	u64_stats_init(&priv->stats_rx.syncp);
 
-	priv->clk = devm_clk_get(dev, NULL);
-	if (IS_ERR(priv->clk)) {
-		ret = PTR_ERR(priv->clk);
-		goto out_free_netdev;
+	for (i = 0; i < AVE_MAX_CLKS; i++) {
+		name = priv->data->clock_names[i];
+		if (!name)
+			break;
+		priv->clk[i] = devm_clk_get(dev, name);
+		if (IS_ERR(priv->clk[i])) {
+			ret = PTR_ERR(priv->clk[i]);
+			goto out_free_netdev;
+		}
+		priv->nclks++;
 	}
 
-	priv->rst = devm_reset_control_get_optional_shared(dev, NULL);
-	if (IS_ERR(priv->rst)) {
-		ret = PTR_ERR(priv->rst);
-		goto out_free_netdev;
+	for (i = 0; i < AVE_MAX_RSTS; i++) {
+		name = priv->data->reset_names[i];
+		if (!name)
+			break;
+		priv->rst[i] = devm_reset_control_get_shared(dev, name);
+		if (IS_ERR(priv->rst[i])) {
+			ret = PTR_ERR(priv->rst[i]);
+			goto out_free_netdev;
+		}
+		priv->nrsts++;
 	}
 
 	priv->mdio = devm_mdiobus_alloc(dev);
@@ -1687,22 +1717,52 @@ static int ave_remove(struct platform_device *pdev)
 
 static const struct ave_soc_data ave_pro4_data = {
 	.is_desc_64bit = false,
+	.clock_names = {
+		"gio", "ether", "ether-gb", "ether-phy",
+	},
+	.reset_names = {
+		"gio", "ether",
+	},
 };
 
 static const struct ave_soc_data ave_pxs2_data = {
 	.is_desc_64bit = false,
+	.clock_names = {
+		"ether",
+	},
+	.reset_names = {
+		"ether",
+	},
 };
 
 static const struct ave_soc_data ave_ld11_data = {
 	.is_desc_64bit = false,
+	.clock_names = {
+		"ether",
+	},
+	.reset_names = {
+		"ether",
+	},
 };
 
 static const struct ave_soc_data ave_ld20_data = {
 	.is_desc_64bit = true,
+	.clock_names = {
+		"ether",
+	},
+	.reset_names = {
+		"ether",
+	},
 };
 
 static const struct ave_soc_data ave_pxs3_data = {
 	.is_desc_64bit = false,
+	.clock_names = {
+		"ether",
+	},
+	.reset_names = {
+		"ether",
+	},
 };
 
 static const struct of_device_id of_ave_match[] = {
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v2 0/3] ave: fix the activation issues for some UniPhier SoCs
From: Kunihiko Hayashi @ 2018-04-19  7:24 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: Andrew Lunn, Florian Fainelli, Rob Herring, Mark Rutland,
	linux-arm-kernel, linux-kernel, devicetree, Masahiro Yamada,
	Masami Hiramatsu, Jassi Brar, Kunihiko Hayashi

This add the following stuffs to fix the activation issues and satisfy
requirements for AVE ethernet driver implemented on some UniPhier SoCs.

- Add support for additional necessary clocks and resets, because the kernel
  is stalled on Pro4 due to lack of them.

- Check whether the SoC supports the specified phy-mode

- Add DT property support indicating system controller that has the feature
  for configurating phy-mode including built-in phy on LD11.

v1: https://www.spinics.net/lists/netdev/msg494904.html

Changes since v1:
- Add 'Reviewed-by' lines

Kunihiko Hayashi (3):
  net: ethernet: ave: add multiple clocks and resets support as required
    property
  dt-bindings: net: ave: add syscon-phy-mode property to configure
    phy-mode setting
  net: ethernet: ave: add support for phy-mode setting of system
    controller

 .../bindings/net/socionext,uniphier-ave4.txt       |  19 +-
 drivers/net/ethernet/socionext/Kconfig             |   2 +
 drivers/net/ethernet/socionext/sni_ave.c           | 252 ++++++++++++++++++---
 3 files changed, 238 insertions(+), 35 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Jiri Pirko @ 2018-04-19  7:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, stephen, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	jasowang, loseweigh
In-Reply-To: <20180419070752-mutt-send-email-mst@kernel.org>

Thu, Apr 19, 2018 at 06:08:58AM CEST, mst@redhat.com wrote:
>On Wed, Apr 18, 2018 at 10:32:06PM +0200, Jiri Pirko wrote:
>> >> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
>> >> >> > am not too happy with it.
>> >> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> >> >> No. The netdev could be any netdevice. It does not have to be a "VF".
>> >> >> I think "stolen" is quite appropriate since it describes the modus
>> >> >> operandi. The bypass master steals some netdevice according to some
>> >> >> match.
>> >> >> 
>> >> >> But I don't insist on "stolen". Just sounds right.
>> >> >
>> >> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>> >> >'backup' name is consistent.
>> >> 
>> >> It perhaps makes sense from the view of virtio device. However, as I
>> >> described couple of times, for master/slave device the name "backup" is
>> >> highly misleading.
>> >
>> >virtio is the backup. You are supposed to use another
>> >(typically passthrough) device, if that fails use virtio.
>> >It does seem appropriate to me. If you like, we can
>> >change that to "standby".  Active I don't like either. "main"?
>> 
>> Sounds much better, yes.
>
>Excuse me, which of the versions are better in your eyes?

standby is okay. main/primary is fine too.

>
>
>> 
>> >
>> >In fact would failover be better than bypass?
>> 
>> Also, much better.
>> 

^ permalink raw reply

* [PATCH net] bnxt_en: Fix memory fault in bnxt_ethtool_init()
From: Michael Chan @ 2018-04-19  7:16 UTC (permalink / raw)
  To: davem; +Cc: netdev, kernel-team

From: Vasundhara Volam <vasundhara-v.volam@broadcom.com>

In some firmware images, the length of BNX_DIR_TYPE_PKG_LOG nvram type
could be greater than the fixed buffer length of 4096 bytes allocated by
the driver.  This was causing HWRM_NVM_READ to copy more data to the buffer
than the allocated size, causing general protection fault.

Fix the issue by allocating the exact buffer length returned by
HWRM_NVM_FIND_DIR_ENTRY, instead of 4096.  Move the kzalloc() call
into the bnxt_get_pkgver() function.

Fixes: 3ebf6f0a09a2 ("bnxt_en: Add installed-package firmware version reporting via Ethtool GDRVINFO")
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c  | 49 ++++++++++++----------
 drivers/net/ethernet/broadcom/bnxt/bnxt_nvm_defs.h |  2 -
 2 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 1f622ca..8ba14ae 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -1927,22 +1927,39 @@ static char *bnxt_parse_pkglog(int desired_field, u8 *data, size_t datalen)
 	return retval;
 }
 
-static char *bnxt_get_pkgver(struct net_device *dev, char *buf, size_t buflen)
+static void bnxt_get_pkgver(struct net_device *dev)
 {
+	struct bnxt *bp = netdev_priv(dev);
 	u16 index = 0;
-	u32 datalen;
+	char *pkgver;
+	u32 pkglen;
+	u8 *pkgbuf;
+	int len;
 
 	if (bnxt_find_nvram_item(dev, BNX_DIR_TYPE_PKG_LOG,
 				 BNX_DIR_ORDINAL_FIRST, BNX_DIR_EXT_NONE,
-				 &index, NULL, &datalen) != 0)
-		return NULL;
+				 &index, NULL, &pkglen) != 0)
+		return;
 
-	memset(buf, 0, buflen);
-	if (bnxt_get_nvram_item(dev, index, 0, datalen, buf) != 0)
-		return NULL;
+	pkgbuf = kzalloc(pkglen, GFP_KERNEL);
+	if (!pkgbuf) {
+		dev_err(&bp->pdev->dev, "Unable to allocate memory for pkg version, length = %u\n",
+			pkglen);
+		return;
+	}
+
+	if (bnxt_get_nvram_item(dev, index, 0, pkglen, pkgbuf))
+		goto err;
 
-	return bnxt_parse_pkglog(BNX_PKG_LOG_FIELD_IDX_PKG_VERSION, buf,
-		datalen);
+	pkgver = bnxt_parse_pkglog(BNX_PKG_LOG_FIELD_IDX_PKG_VERSION, pkgbuf,
+				   pkglen);
+	if (pkgver && *pkgver != 0 && isdigit(*pkgver)) {
+		len = strlen(bp->fw_ver_str);
+		snprintf(bp->fw_ver_str + len, FW_VER_STR_LEN - len - 1,
+			 "/pkg %s", pkgver);
+	}
+err:
+	kfree(pkgbuf);
 }
 
 static int bnxt_get_eeprom(struct net_device *dev,
@@ -2615,22 +2632,10 @@ void bnxt_ethtool_init(struct bnxt *bp)
 	struct hwrm_selftest_qlist_input req = {0};
 	struct bnxt_test_info *test_info;
 	struct net_device *dev = bp->dev;
-	char *pkglog;
 	int i, rc;
 
-	pkglog = kzalloc(BNX_PKG_LOG_MAX_LENGTH, GFP_KERNEL);
-	if (pkglog) {
-		char *pkgver;
-		int len;
+	bnxt_get_pkgver(dev);
 
-		pkgver = bnxt_get_pkgver(dev, pkglog, BNX_PKG_LOG_MAX_LENGTH);
-		if (pkgver && *pkgver != 0 && isdigit(*pkgver)) {
-			len = strlen(bp->fw_ver_str);
-			snprintf(bp->fw_ver_str + len, FW_VER_STR_LEN - len - 1,
-				 "/pkg %s", pkgver);
-		}
-		kfree(pkglog);
-	}
 	if (bp->hwrm_spec_code < 0x10704 || !BNXT_SINGLE_PF(bp))
 		return;
 
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_nvm_defs.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_nvm_defs.h
index 73f2249..8344481 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_nvm_defs.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_nvm_defs.h
@@ -59,8 +59,6 @@ enum bnxt_nvm_directory_type {
 #define BNX_DIR_ATTR_NO_CHKSUM			(1 << 0)
 #define BNX_DIR_ATTR_PROP_STREAM		(1 << 1)
 
-#define BNX_PKG_LOG_MAX_LENGTH			4096
-
 enum bnxnvm_pkglog_field_index {
 	BNX_PKG_LOG_FIELD_IDX_INSTALLED_TIMESTAMP	= 0,
 	BNX_PKG_LOG_FIELD_IDX_PKG_DESCRIPTION		= 1,
-- 
1.8.3.1

^ permalink raw reply related

* Re: [net] xfrm: allow to release xfrm_state with flush
From: Steffen Klassert @ 2018-04-19  6:56 UTC (permalink / raw)
  To: Jacek Kalwas; +Cc: davem, netdev, intel-wired-lan
In-Reply-To: <20180412190315.3102-2-jacek.kalwas@intel.com>

On Thu, Apr 12, 2018 at 12:03:14PM -0700, Jacek Kalwas wrote:
> Call to flush SAs doesn't release xfrm_state in case there was a
> traffic associated with that state and state was already deleted.
> 
> Given patch calls xfrm_policy_cache_flush despite of actual states
> deleted in xfrm_state_flush function.
> 
> Signed-off-by: Jacek Kalwas <jacek.kalwas@intel.com>

This is a fix that needs to be backported to -stable,
so please add a proper 'Fixes:' tag and resend based on
the ipsec tree.

Thanks!

^ permalink raw reply

* Re: [net] udp: enable UDP checksum offload for ESP
From: Steffen Klassert @ 2018-04-19  6:54 UTC (permalink / raw)
  To: Jacek Kalwas; +Cc: davem, netdev, intel-wired-lan
In-Reply-To: <20180412190315.3102-1-jacek.kalwas@intel.com>

On Thu, Apr 12, 2018 at 12:03:13PM -0700, Jacek Kalwas wrote:
> In case NIC has support for ESP TX CSUM offload skb->ip_summed is not
> set to CHECKSUM_PARTIAL which results in checksum calculated by SW.
> 
> Fix enables ESP TX CSUM for UDP by extending condition with check for
> NETIF_F_HW_ESP_TX_CSUM.
> 
> Signed-off-by: Jacek Kalwas <jacek.kalwas@intel.com>

Applied to ipsec-next, thanks!

^ permalink raw reply

* Re: kernel BUG at /home/blee/project/race-fuzzer/kernels/kernel_v4.16-rc3/net/packet/af_packet.c:LINE!
From: DaeRyong Jeong @ 2018-04-19  6:45 UTC (permalink / raw)
  To: LKML
  Cc: Byoungyoung Lee, Kyungtae Kim, Linux Kernel Network Developers,
	Willem de Bruijn
In-Reply-To: <CACsK=jcYaF8ZP6DKTnvTh6SFLh+=eh+SJtn5xjpuxRxpiM9_BA@mail.gmail.com>

Hello.
We have analyzed the cause of the crash, kernel BUG at
net/packet/af_packet.c:LINE!,
which is found by RaceFuzzer (a modified version of Syzkaller) in v4.16-rc7.

Since struct packet_sock's member variables, running, has_vnet_hdr, origdev
and auxdata are declared as bitfields, accessing these variables can race if
there is no synchronization mechanism.

We think racing between following lines in af_packet.c causes the crash.
In function __unregister_prot_hook,
    po->running = 0;
In function packet_setsockopt,
    po->has_vnet_hdr = !!val;

Analysis:
CPU0
pakcet_setsockopt
    po->has_vnet_hdr = !!val;

CPU1
packet_do_bind
    __unregister_prot_hook
        po->running = 0;

In the CPU1, the value of po->running should become 0, but because of racing,
it is possible that po->running can keep the value 1.
Consequently, after returning from __unregister_prot_hook, BUG_ON at
net/packet/af_packet.c:3107 can be triggered.


Possible interleaving between racy C source lines is as follows (built with
gcc-7.1.0).
CPU0 (po->has_vnet_hdr = !!val)            CPU1 (po->running = 0)
movzbl 0x6e0(%r15),%eax
                                                           andb
$0xfe,0x6e0(%r13)
shl    $0x3,%r12d
and    $0xfffffff7,%eax
or     %r12d,%eax
mov    %al,0x6e0(%r15)


Please, check out the following reproducer.
C repro code : https://kiwi.cs.purdue.edu/static/race-fuzzer/afpacket-setsockopt-bind-repro.c
kernel config v4.16-rc3 :
https://kiwi.cs.purdue.edu/static/race-fuzzer/afpacket-setsockopt-bind-v4.16-rc3.config
kernel config v4.16-rc7 :
https://kiwi.cs.purdue.edu/static/race-fuzzer/afpacket-setsockopt-bind-v4.16-rc7.config
kernel config v4.15.14 :
https://kiwi.cs.purdue.edu/static/race-fuzzer/afpacket-setsockopt-bind-v4.15.14.config


= About RaceFuzzer

RaceFuzzer is a customized version of Syzkaller, specifically tailored
to find race condition bugs in the Linux kernel. While we leverage
many different technique, the notable feature of RaceFuzzer is in
leveraging a custom hypervisor (QEMU/KVM) to interleave the
scheduling. In particular, we modified the hypervisor to intentionally
stall a per-core execution, which is similar to supporting per-core
breakpoint functionality. This allows RaceFuzzer to force the kernel
to deterministically trigger racy condition (which may rarely happen
in practice due to randomness in scheduling).

RaceFuzzer's C repro always pinpoints two racy syscalls. Since C
repro's scheduling synchronization should be performed at the user
space, its reproducibility is limited (reproduction may take from 1
second to 10 minutes (or even more), depending on a bug). This is
because, while RaceFuzzer precisely interleaves the scheduling at the
kernel's instruction level when finding this bug, C repro cannot fully
utilize such a feature. Please disregard all code related to
"should_hypercall" in the C repro, as this is only for our debugging
purposes using our own hypervisor.

On Sat, Mar 31, 2018 at 1:33 AM, DaeRyong Jeong <threeearcat@gmail.com> wrote:
> We report the crash: kernel BUG at
> /home/blee/project/race-fuzzer/kernels/kernel_v4.16-rc3/net/packet/af_packet.c:LINE!
>
> This crash has been found in v4.16-rc3 using RaceFuzzer (a modified
> version of Syzkaller), which we describe more at the end of this
> report. Our analysis shows that the race occurs when invoking two
> syscalls concurrently, (setsockopt$packet_int) and (bind$packet).
> We have confirmed that the kernel v4.16-rc3, v4.16-rc7, and v4.15.14
> built with gcc 7.1.0 are crashing by running the provided C repro
> program within a few minutes (5 minutes).
> Note that this crash can be triggered from the user space.
>
> C repro code : https://kiwi.cs.purdue.edu/static/race-fuzzer/afpacket-setsockopt-bind-repro.c
> kernel config v4.16-rc3 :
> https://kiwi.cs.purdue.edu/static/race-fuzzer/afpacket-setsockopt-bind-v4.16-rc3.config
> kernel config v4.16-rc7 :
> https://kiwi.cs.purdue.edu/static/race-fuzzer/afpacket-setsockopt-bind-v4.16-rc7.config
> kernel config v4.15.14 :
> https://kiwi.cs.purdue.edu/static/race-fuzzer/afpacket-setsockopt-bind-v4.15.14.config
>
> [  881.047513] ------------[ cut here ]------------
> [  881.048416] kernel BUG at
> /home/blee/project/race-fuzzer/kernels/kernel_v4.16-rc3/net/packet/af_packet.c:3107!
> [  881.050014] invalid opcode: 0000 [#1] SMP KASAN
> [  881.050698] Dumping ftrace buffer:
> [  881.051244]    (ftrace buffer empty)
> [  881.051768] Modules linked in:
> [  881.052236] CPU: 1 PID: 18247 Comm: syz-executor0 Not tainted 4.16.0-rc3 #1
> [  881.053247] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> [  881.054880] RIP: 0010:packet_do_bind+0x88d/0x950
> [  881.055553] RSP: 0018:ffff8802231d7b08 EFLAGS: 00010212
> [  881.056310] RAX: 0000000000010000 RBX: ffff8800af831740 RCX: ffffc900025ce000
> [  881.057318] RDX: 00000000000000a5 RSI: ffffffff838b257d RDI: 0000000000000001
> [  881.058301] RBP: ffff8802231d7c10 R08: ffff8802342f2480 R09: 0000000000000000
> [  881.059298] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802309f8f00
> [  881.060314] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000001000
> [  881.061320] FS:  00007f7fab50d700(0000) GS:ffff88023fc00000(0000)
> knlGS:0000000000000000
> [  881.062467] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  881.063285] CR2: 0000000020038000 CR3: 00000000b11c9000 CR4: 00000000000006e0
> [  881.064317] Call Trace:
> [  881.064686]  ? compat_packet_setsockopt+0x100/0x100
> [  881.065430]  ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
> [  881.066188]  packet_bind+0xa2/0xe0
> [  881.066690]  SYSC_bind+0x279/0x2f0
> [  881.067180]  ? move_addr_to_kernel.part.19+0xc0/0xc0
> [  881.067896]  ? do_futex+0x1e90/0x1e90
> [  881.068435]  ? SyS_sched_getaffinity+0xe3/0x100
> [  881.069112]  ? mark_held_locks+0x25/0xb0
> [  881.069677]  ? SyS_socketpair+0x4a0/0x4a0
> [  881.070265]  SyS_bind+0x24/0x30
> [  881.070732]  do_syscall_64+0x209/0x5d0
> [  881.071270]  ? syscall_return_slowpath+0x3e0/0x3e0
> [  881.071929]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
> [  881.072675]  ? syscall_return_slowpath+0x260/0x3e0
> [  881.073365]  ? mark_held_locks+0x25/0xb0
> [  881.073950]  ? entry_SYSCALL_64_after_hwframe+0x52/0xb7
> [  881.074693]  ? trace_hardirqs_off_caller+0xb5/0x120
> [  881.075390]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [  881.076079]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> [  881.076797] RIP: 0033:0x453909
> [  881.077238] RSP: 002b:00007f7fab50caf8 EFLAGS: 00000212 ORIG_RAX:
> 0000000000000031
> [  881.078268] RAX: ffffffffffffffda RBX: 00000000007080d8 RCX: 0000000000453909
> [  881.079239] RDX: 0000000000000014 RSI: 000000002001f000 RDI: 0000000000000015
> [  881.080268] RBP: 0000000000000250 R08: 0000000000000000 R09: 0000000000000000
> [  881.081256] R10: 0000000000000000 R11: 0000000000000212 R12: 00000000004a82d3
> [  881.082272] R13: 00000000ffffffff R14: 0000000000000015 R15: 000000002001f000
> [  881.083251] Code: c0 fd 48 c7 c2 00 c8 d9 84 be ab 02 00 00 48 c7
> c7 60 c8 d9 84 c6 05 e7 a2 48 02 01 e8 3f 17 af fd e9 60 fb ff ff e8
> 43 b3 c0 fd <0f> 0b e8 3c b3 c0 fd 48 8b bd 20 ff ff ff e8 60 1e e7 fd
> 4c 89
> [  881.085828] RIP: packet_do_bind+0x88d/0x950 RSP: ffff8802231d7b08
> [  881.086619] ---[ end trace 9c461502752b4f3e ]---
> [  881.087181] Kernel panic - not syncing: Fatal exception
> [  881.088352] Dumping ftrace buffer:
> [  881.088877]    (ftrace buffer empty)
> [  881.089414] Kernel Offset: disabled
> [  881.089950] Rebooting in 86400 seconds..
>
> = About RaceFuzzer
>
> RaceFuzzer is a customized version of Syzkaller, specifically tailored
> to find race condition bugs in the Linux kernel. While we leverage
> many different technique, the notable feature of RaceFuzzer is in
> leveraging a custom hypervisor (QEMU/KVM) to interleave the
> scheduling. In particular, we modified the hypervisor to intentionally
> stall a per-core execution, which is similar to supporting per-core
> breakpoint functionality. This allows RaceFuzzer to force the kernel
> to deterministically trigger racy condition (which may rarely happen
> in practice due to randomness in scheduling).
>
> RaceFuzzer's C repro always pinpoints two racy syscalls. Since C
> repro's scheduling synchronization should be performed at the user
> space, its reproducibility is limited (reproduction may take from 1
> second to 10 minutes (or even more), depending on a bug). This is
> because, while RaceFuzzer precisely interleaves the scheduling at the
> kernel's instruction level when finding this bug, C repro cannot fully
> utilize such a feature. Please disregard all code related to
> "should_hypercall" in the C repro, as this is only for our debugging
> purposes using our own hypervisor.

^ permalink raw reply

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Siwei Liu @ 2018-04-19  6:43 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, David Miller, David Ahern, Jiri Pirko,
	si-wei liu, Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Netdev, virtualization, virtio-dev
In-Reply-To: <5d5f1e52-bddf-5551-bdea-8e7b408b39b9@intel.com>

On Wed, Apr 18, 2018 at 11:10 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
> On 4/18/2018 10:07 PM, Michael S. Tsirkin wrote:
>>
>> On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>>>
>>>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>>>
>>>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>>>
>>>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>>>> that I'd like to share thus revive the discussion.
>>>>>>
>>>>>> First of all, as illustrated in the reply below, cloud service
>>>>>> providers require transparent live migration. Specifically, the main
>>>>>> target of our case is to support SR-IOV live migration via kernel
>>>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>>>> because this use case is not appealing enough for the mainline to
>>>>>> adopt, I will shut up and not continue discussing, although
>>>>>> technically it's entirely possible (and there's precedent in other
>>>>>> implementation) to do so to benefit any cloud service providers.
>>>>>>
>>>>>> If it's just the implementation of hiding netdev itself needs to be
>>>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>>>> API, that's completely fine and we can look into that. However, the
>>>>>> specific issue needs to be undestood beforehand is to make transparent
>>>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>>>> from the lower netdev, which needs some games with uevents and name
>>>>>> space reservation. So far I don't think it's been well discussed.
>>>>>>
>>>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>>>> model currently missed to address the core problem of live migration:
>>>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>>>> and hardware offloading states. Only general network state (IP
>>>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>>>> save and apply those hardware specific configs before or after
>>>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>>>> part of this patch series will be able to solve that problem naturally
>>>>>> by making all hardware specific configurations go through the central
>>>>>> bypass driver, such that hardware configurations can be replayed when
>>>>>> new VF or passthrough gets plugged back in. Although that
>>>>>> corresponding function hasn't been implemented today, I'd like to
>>>>>> refresh everyone's mind that is the core problem any live migration
>>>>>> proposal should have addressed.
>>>>>>
>>>>>> If it would make things more clear to defer netdev hiding until all
>>>>>> functionalities regarding centralizing and replay are implemented,
>>>>>> we'd take advices like that and move on to implementing those features
>>>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>>>> the work for hiding lower netdev at that point. Think it would be the
>>>>>> best to make everyone understand the big picture in advance before
>>>>>> going too far.
>>>>>
>>>>> I think we should get the 3-netdev model integrated and add any
>>>>> additional
>>>>> ndo_ops/ethool ops that we would like to support/migrate before looking
>>>>> into
>>>>> hiding the lower netdevs.
>>>>
>>>> Once they are exposed, I don't think we'll be able to hide them -
>>>> they will be a kernel ABI.
>>>>
>>>> Do you think everyone needs to hide the SRIOV device?
>>>> Or that only some users need this?
>>>
>>> Hyper-V is currently supporting live migration without hiding the SR-IOV
>>> device. So i don't
>>> think it is a hard requirement.
>>
>> OK, fine.
>>
>>> And also,  as we don't yet have a consensus on how to hide
>>> the lower netdevs, we could make it as another feature bit to hide lower
>>> netdevs once
>>> we have an acceptable solution.
>>
>> Guest/host interface isn't more flexible than the userspace/kernel
>> interface.  The feature bit you propose would say what exactly?
>> Hypervisor has no idea what guest kernel shows guest userspace.
>> Note that the backup flag doesn't tell guest kernel what to do,
>> it just tells guest that there is or will be a faster main device
>> connected to the same backend, so the backup should only be used
>> when main device is not present.
>
>
> The current bypass module supports 3-netdev and 2-netdev models via 2 sets
> of interfaces
> bypass_master_create/destroy and bypass_master_register/unregister.  So
> theoretically
> we can support the 2 models via 2 different feature bits. BACKUP and
> BACKUP_2_NETDEV.

I'm still trying to understand the value of so many models to support.
If we all agree eventually the transparent 1-netdev model can address
the more general case while 2-netdev or 3-netdev is unable to, what's
the point for supporting these many features?

-Siwei
>
> Similarly if we can figure out a way to hide both the netdevs, we can add
> BACKUP_1_NETDEV
> feature bit and update the bypass module to provide another set of
> interfaces that can
> be used by virtio_net to support this model.
>
> Now that we are leaning towards 'standby' as the name for the lower
> virtio-net, should we
> change the feature bit name also to VIRTIO_NET_F_STANDBY?
>
>>
>

^ permalink raw reply

* Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
From: Johannes Berg @ 2018-04-19  6:38 UTC (permalink / raw)
  To: Ben Greear, netdev; +Cc: linux-wireless, ath10k
In-Reply-To: <f43d7ee4-7f27-b145-125e-32dbc31a27f1@candelatech.com>

On Wed, 2018-04-18 at 14:51 -0700, Ben Greear wrote:

> > It'd be pretty hard to know which flags are firmware stats?
> 
> Yes, it is, but ethtool stats are difficult to understand in a generic
> manner anyway, so someone using them is already likely aware of low-level
> details of the driver(s) they are using.

Right. Come to think of it though,

> + * @get_ethtool_stats2: Return extended statistics about the device.
> + *     This is only useful if the device maintains statistics not
> + *     included in &struct rtnl_link_stats64.
> + *      Takes a flags argument:  0 means all (same as get_ethtool_stats),
> + *      0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats.
> + *      Other flags are reserved for now.
> + *      Same number of stats will be returned, but some of them might
> + *      not be as accurate/refreshed.  This is to allow not querying
> + *      firmware or other expensive-to-read stats, for instance.

"skip" vs. "don't refresh" is a bit ambiguous - I'd argue better to
either really skip and not return the non-refreshed ones (also helps
with the identifying), or rename the flag.

Also, wrt. the rest of the patch, I'd argue that it'd be worthwhile to
write the spatch and just add the flags argument to "get_ethtool_stats"
instead of adding a separate method - internally to the kernel it's not
that hard to change.

> I posted the patches to netdev, ath10k and linux-wireless.  If I had only
> posted them individually to different lists I figure I'd be hearing about how
> the netdev patch is useless because it has no driver support, etc.

Sure. I missed netdev, perhaps because it was in To, or more likely
because I was too sleepy. Sorry for the noise.

johannes

^ permalink raw reply

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Jiri Pirko @ 2018-04-19  6:35 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, stephen, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	jasowang, loseweigh
In-Reply-To: <ff0c5ea1-16d4-00a7-9952-9049efa818eb@intel.com>

Thu, Apr 19, 2018 at 12:46:11AM CEST, sridhar.samudrala@intel.com wrote:
>On 4/18/2018 1:32 PM, Jiri Pirko wrote:
>> > > > > > > You still use "active"/"backup" names which is highly misleading as
>> > > > > > > it has completely different meaning that in bond for example.
>> > > > > > > I noted that in my previous review already. Please change it.
>> > > > > > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> > > > > > matches with the BACKUP feature bit we are adding to virtio_net.
>> > > > > I think that "backup" is also misleading. Both "active" and "backup"
>> > > > > mean a *state* of slaves. This should be named differently.
>> > > > > 
>> > > > > 
>> > > > > 
>> > > > > > With regards to alternate names for 'active', you suggested 'stolen', but i
>> > > > > > am not too happy with it.
>> > > > > > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> > > > > No. The netdev could be any netdevice. It does not have to be a "VF".
>> > > > > I think "stolen" is quite appropriate since it describes the modus
>> > > > > operandi. The bypass master steals some netdevice according to some
>> > > > > match.
>> > > > > 
>> > > > > But I don't insist on "stolen". Just sounds right.
>> > > > We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>> > > > 'backup' name is consistent.
>> > > It perhaps makes sense from the view of virtio device. However, as I
>> > > described couple of times, for master/slave device the name "backup" is
>> > > highly misleading.
>> > virtio is the backup. You are supposed to use another
>> > (typically passthrough) device, if that fails use virtio.
>> > It does seem appropriate to me. If you like, we can
>> > change that to "standby".  Active I don't like either. "main"?
>> Sounds much better, yes.
>
>OK. Will change backup to 'standby'.
>'main' is fine, what about 'primary'?

Primary is also bonding terminology. But in this case, I think it would
fit. The primary slave is used as the active one whenever the link is
up.


>
>
>> 
>> 
>> > In fact would failover be better than bypass?
>> Also, much better.
>
>So do we want to change all 'bypass' references to 'failover' including
>the filenames.(net/core/failover.c and include/net/failover.h)
>
><snip>
>
>
>
>> 
>> 
>> > 
>> > > > The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>> > > > a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>> > > > 
>> > > > Will look for any suggestions in the next day or two. If i don't get any, i will go
>> > > > with 'stolen'
>> > > > 
>> > > > <snip>
>> > > > 
>> > > > 
>> > > > > +
>> > > > > +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> > > > > +						  struct bypass_ops **ops)
>> > > > > +{
>> > > > > +	struct bypass_master *bypass_master;
>> > > > > +	struct net_device *bypass_netdev;
>> > > > > +
>> > > > > +	spin_lock(&bypass_lock);
>> > > > > +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> > > > > > > As I wrote the last time, you don't need this list, spinlock.
>> > > > > > > You can do just something like:
>> > > > > > >            for_each_net(net) {
>> > > > > > >                    for_each_netdev(net, dev) {
>> > > > > > > 			if (netif_is_bypass_master(dev)) {
>> > > > > > This function returns the upper netdev as well as the ops associated
>> > > > > > with that netdev.
>> > > > > > bypass_master_list is a list of 'struct bypass_master' that associates
>> > > > > Well, can't you have it in netdev priv?
>> > > > We cannot do this for 2-netdev model as there is no bypass_netdev created.
>> > > Howcome? You have no master? I don't understand..
>
>For 2-netdev model, the master netdev is not a new one created by the bypass module.
>It is created by netvsc internally and passed via bypass_master_register()

But virtio_net alho has to create the master and pass it down to the
bypass module. Howcome it is different?


>
><snip>
>
>
>
>> > > 
>> > > > > > > > +
>> > > > > > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> > > > > > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> > > > > > > Yeah, this is certainly incorrect. One thing is, you should be using the
>> > > > > > > helpers netif_is_bond_master().
>> > > > > > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> > > > > > > 
>> > > > > > > You need to do it not by blacklisting, but with whitelisting. You need
>> > > > > > > to whitelist VF devices. My port flavours patchset might help with this.
>> > > > > > May be i can use netdev_has_lower_dev() helper to make sure that the slave
>> > > > > I don't see such function in the code.
>> > > > It is netdev_has_any_lower_dev(). I need to export it.
>> > > Come on, you cannot use that. That would allow bonding without slaves,
>> > > but the slaves could be added later on.
>> > > 
>> > > What exactly you are trying to achieve by this?
>
>I think i can remove this check.  In pre-register,
>for backup device, i check that its parent matches bypass device &
>for vf device, we make sure that it is a pci device.

Okay. That is a start.


>
>

^ permalink raw reply

* Re: WARNING in refcount_dec
From: DaeRyong Jeong @ 2018-04-19  6:32 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Cong Wang, Byoungyoung Lee, LKML, Kyungtae Kim,
	Linux Kernel Network Developers, Willem de Bruijn
In-Reply-To: <CACsK=jeTu1mhOXXAFAkp5nGVG6WbcV+dJZmTavdjWerHhDQfVg@mail.gmail.com>

Hello.
We have analyzed the cause of the crash in v4.16-rc3, WARNING in refcount_dec,
which is found by RaceFuzzer (a modified version of Syzkaller).

Since struct packet_sock's member variables, running, has_vnet_hdr, origdev
and auxdata are declared as bitfields, accessing these variables can race if
there is no synchronization mechanism.

We think racing between following lines in af_packet.c causes the crash.
In function __unregister_prot_hook,
    po->running = 0;
In function packet_setsockopt,
    po->has_vnet_hdr = !!val;

Analysis:
CPU0
pakcet_setsockopt
    po->has_vnet_hdr = !!val;

CPU1
packet_setsockop
    packet_set_ring
        __unregister_prot_hook
            po->running = 0;

In the CPU1, the value of po->running should become 0, but because of racing,
it is possible that po->running can keep the value 1.
Consequently, the followings can happen.
- When packet_set_ring calls register_prot_hook, register_prot_hook
return immediately without calling sock_hold(sk).
- When packet_release is called, __unregister_prot_hook will be called
because po->running == 1 and sk->sk_refcnt hits zero.


Possible interleaving between racy C source lines is as follows (built
with gcc-7.1.0).
CPU0 (po->has_vnet_hdr = !!val)                  CPU1 (po->running = 0)
movzbl 0x6e0(%r15),%eax
                                                                 andb
 $0xfe,0x6e0(%r13)
shl    $0x3,%r12d
and    $0xfffffff7,%eax
or     %r12d,%eax
mov    %al,0x6e0(%r15)


Please, check out the following reproducer.
C repro code : https://kiwi.cs.purdue.edu/static/race-fuzzer/repro-refcount_dec.c
kernel config: https://kiwi.cs.purdue.edu/static/race-fuzzer/kernel-config-v4.16-rc3

Since there is a small room to race, it may take a long time to
reproduce the crash.


= About RaceFuzzer

RaceFuzzer is a customized version of Syzkaller, specifically tailored
to find race condition bugs in the Linux kernel. While we leverage
many different technique, the notable feature of RaceFuzzer is in
leveraging a custom hypervisor (QEMU/KVM) to interleave the
scheduling. In particular, we modified the hypervisor to intentionally
stall a per-core execution, which is similar to supporting per-core
breakpoint functionality. This allows RaceFuzzer to force the kernel
to deterministically trigger racy condition (which may rarely happen
in practice due to randomness in scheduling).

RaceFuzzer's C repro always pinpoints two racy syscalls. Since C
repro's scheduling synchronization should be performed at the user
space, its reproducibility is limited (reproduction may take from 1
second to 10 minutes (or even more), depending on a bug). This is
because, while RaceFuzzer precisely interleaves the scheduling at the
kernel's instruction level when finding this bug, C repro cannot fully
utilize such a feature. Please disregard all code related to
"should_hypercall" in the C repro, as this is only for our debugging
purposes using our own hypervisor.


On Tue, Apr 3, 2018 at 1:12 PM, DaeRyong Jeong <threeearcat@gmail.com> wrote:
> No. Only the first crash (WARNING in refcount_dec) is reproduced by
> the attached reproducer.
>
> The second crash (kernel bug at af_packet.c:3107) is reproduced by
> another reproducer.
> We reported it here.
> http://lkml.iu.edu/hypermail/linux/kernel/1803.3/05324.html
>
> On Sun, Apr 1, 2018 at 4:38 PM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>> On Thu, Mar 29, 2018 at 1:16 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>> (Cc'ing netdev and Willem)
>>>
>>> On Wed, Mar 28, 2018 at 12:03 PM, Byoungyoung Lee
>>> <byoungyoung@purdue.edu> wrote:
>>>> Another crash patterns observed: race between (setsockopt$packet_int)
>>>> and (bind$packet).
>>>>
>>>> ------------------------------
>>>> [  357.731597] kernel BUG at
>>>> /home/blee/project/race-fuzzer/kernels/kernel_v4.16-rc3/net/packet/af_packet.c:3107!
>>>> [  357.733382] invalid opcode: 0000 [#1] SMP KASAN
>>>> [  357.734017] Modules linked in:
>>>> [  357.734662] CPU: 1 PID: 3871 Comm: repro.exe Not tainted 4.16.0-rc3 #1
>>>> [  357.735791] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>> BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
>>>> [  357.737434] RIP: 0010:packet_do_bind+0x88d/0x950
>>>> [  357.738121] RSP: 0018:ffff8800b2787b08 EFLAGS: 00010293
>>>> [  357.738906] RAX: ffff8800b2fdc780 RBX: ffff880234358cc0 RCX: ffffffff838b244c
>>>> [  357.739905] RDX: 0000000000000000 RSI: ffffffff838b257d RDI: 0000000000000001
>>>> [  357.741315] RBP: ffff8800b2787c10 R08: ffff8800b2fdc780 R09: 0000000000000000
>>>> [  357.743055] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88023352ecc0
>>>> [  357.744744] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000001d00
>>>> [  357.746377] FS:  00007f4b43733700(0000) GS:ffff8800b8b00000(0000)
>>>> knlGS:0000000000000000
>>>> [  357.749599] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  357.752096] CR2: 0000000020058000 CR3: 00000002334b8000 CR4: 00000000000006e0
>>>> [  357.755045] Call Trace:
>>>> [  357.755822]  ? compat_packet_setsockopt+0x100/0x100
>>>> [  357.757324]  ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
>>>> [  357.758810]  packet_bind+0xa2/0xe0
>>>> [  357.759640]  SYSC_bind+0x279/0x2f0
>>>> [  357.760364]  ? move_addr_to_kernel.part.19+0xc0/0xc0
>>>> [  357.761491]  ? __handle_mm_fault+0x25d0/0x25d0
>>>> [  357.762449]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
>>>> [  357.763663]  ? __do_page_fault+0x417/0xba0
>>>> [  357.764569]  ? vmalloc_fault+0x910/0x910
>>>> [  357.765405]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
>>>> [  357.766525]  ? mark_held_locks+0x25/0xb0
>>>> [  357.767336]  ? SyS_socketpair+0x4a0/0x4a0
>>>> [  357.768182]  SyS_bind+0x24/0x30
>>>> [  357.768851]  do_syscall_64+0x209/0x5d0
>>>> [  357.769650]  ? syscall_return_slowpath+0x3e0/0x3e0
>>>> [  357.770665]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
>>>> [  357.771779]  ? syscall_return_slowpath+0x260/0x3e0
>>>> [  357.772748]  ? mark_held_locks+0x25/0xb0
>>>> [  357.773581]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
>>>> [  357.774720]  ? retint_user+0x18/0x18
>>>> [  357.775493]  ? trace_hardirqs_off_caller+0xb5/0x120
>>>> [  357.776567]  ? trace_hardirqs_off_thunk+0x1a/0x1c
>>>> [  357.777512]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>> [  357.778508] RIP: 0033:0x4503a9
>>>> [  357.779156] RSP: 002b:00007f4b43732ce8 EFLAGS: 00000246 ORIG_RAX:
>>>> 0000000000000031
>>>> [  357.780737] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004503a9
>>>> [  357.782169] RDX: 0000000000000014 RSI: 0000000020058000 RDI: 0000000000000003
>>>> [  357.783710] RBP: 00007f4b43732d10 R08: 0000000000000000 R09: 0000000000000000
>>>> [  357.785202] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>>>> [  357.786664] R13: 0000000000000000 R14: 00007f4b437339c0 R15: 00007f4b43733700
>>>> [  357.788210] Code: c0 fd 48 c7 c2 00 c8 d9 84 be ab 02 00 00 48 c7
>>>> c7 60 c8 d9 84 c6 05 e7 a2 48 02 01 e8 3f 17 af fd e9 60 fb ff ff e8
>>>> 43 b3 c0 fd <0f> 0b e8 3c b3 c0 fd 48 8b bd 20 ff ff ff e8 60 1e e7 fd
>>>> 4c 89
>>>> [  357.792260] RIP: packet_do_bind+0x88d/0x950 RSP: ffff8800b2787b08
>>>> [  357.793698] ---[ end trace 0c5a2539f0247369 ]---
>>>> [  357.794696] Kernel panic - not syncing: Fatal exception
>>>> [  357.795918] Kernel Offset: disabled
>>>> [  357.796614] Rebooting in 86400 seconds..
>>>>
>>>> On Wed, Mar 28, 2018 at 1:19 AM, Byoungyoung Lee <byoungyoung@purdue.edu> wrote:
>>>>> We report the crash: WARNING in refcount_dec
>>>>>
>>>>> This crash has been found in v4.16-rc3 using RaceFuzzer (a modified
>>>>> version of Syzkaller), which we describe more at the end of this
>>>>> report. Our analysis shows that the race occurs when invoking two
>>>>> syscalls concurrently, (setsockopt$packet_int) and
>>>>> (setsockopt$packet_rx_ring).
>>>>>
>>>>> C repro code : https://kiwi.cs.purdue.edu/static/race-fuzzer/repro-refcount_dec.c
>>>>> kernel config: https://kiwi.cs.purdue.edu/static/race-fuzzer/kernel-config-v4.16-rc3
>>>
>>>
>>> I tried your reproducer, no luck here.
>>>
>>
>> Are both crashes with the same reproducer?
>>
>> It races setsockopt PACKET_RX_RING with PACKET_VNET_HDR.
>>
>> There have been previous bug fixes for other setsockopts racing with
>> ring creation. The change would be
>>
>> @@ -3763,14 +3763,19 @@ packet_setsockopt(struct socket *sock, int
>> level, int optname, char __user *optv
>>
>>                 if (sock->type != SOCK_RAW)
>>                         return -EINVAL;
>> -               if (po->rx_ring.pg_vec || po->tx_ring.pg_vec)
>> -                       return -EBUSY;
>>                 if (optlen < sizeof(val))
>>                         return -EINVAL;
>>                 if (copy_from_user(&val, optval, sizeof(val)))
>>                         return -EFAULT;
>>
>> -               po->has_vnet_hdr = !!val;
>> +               lock_sock(sk);
>> +               if (po->rx_ring.pg_vec || po->tx_ring.pg_vec) {
>> +                       ret = -EBUSY;
>> +               } else {
>> +                       po->has_vnet_hdr = !!val;
>> +                       ret = 0;
>> +               }
>> +               release_sock(sk);
>>                 return 0;
>>         }
>>
>> But I do not immediately see why these concurrent operations
>> would be unsafe.
>>
>> The program races a lot more complex operations, like bind and
>> close. So the specific setsockopt may be a red herring.
>>
>> I'm traveling; haven't been able to setup your fuzzer and run the
>> repro locally yet.
>>
>>>
>>>
>>>>>
>>>>> ---------------------------------------
>>>>> [  305.838560] refcount_t: decrement hit 0; leaking memory.
>>>>> [  305.839669] WARNING: CPU: 0 PID: 3867 at
>>>>> /home/blee/project/race-fuzzer/kernels/kernel_v4.16-rc3/lib/refcount.c:228
>>>>> refcount_dec+0x62/0x70
>>>>> [  305.841441] Modules linked in:
>>>>> [  305.841883] CPU: 0 PID: 3867 Comm: repro.exe Not tainted 4.16.0-rc3 #1
>>>>> [  305.842803] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>>> BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
>>>>> [  305.844345] RIP: 0010:refcount_dec+0x62/0x70
>>>>> [  305.845005] RSP: 0018:ffff880224d374f8 EFLAGS: 00010286
>>>>> [  305.845802] RAX: 000000000000002c RBX: 0000000000000000 RCX: ffffffff81538991
>>>>> [  305.846768] RDX: 0000000000000000 RSI: ffffffff813cd761 RDI: 0000000000000005
>>>>> [  305.847748] RBP: ffff880224d37500 R08: ffff88023169a440 R09: 0000000000000000
>>>>> [  305.848748] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88023473ad40
>>>>> [  305.849738] R13: ffff88023473b368 R14: 0000000000000001 R15: 0000000000000000
>>>>> [  305.850733] FS:  0000000000c6e940(0000) GS:ffff8800b8a00000(0000)
>>>>> knlGS:0000000000000000
>>>>> [  305.851837] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [  305.852652] CR2: 00007fb120571db8 CR3: 0000000005422000 CR4: 00000000000006f0
>>>>> [  305.853619] Call Trace:
>>>>> [  305.854086]  __unregister_prot_hook+0x15f/0x1d0
>>>>> [  305.854722]  packet_release+0x77a/0x7a0
>>>>> [  305.855335]  ? mark_held_locks+0x25/0xb0
>>>>> [  305.855883]  ? packet_lookup_frame+0x110/0x110
>>>>> [  305.856576]  ? __lock_is_held+0x39/0xc0
>>>>> [  305.857109]  ? note_gp_changes+0x300/0x300
>>>>> [  305.857745]  ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
>>>>> [  305.858548]  ? locks_remove_file+0x31b/0x420
>>>>> [  305.859138]  ? fcntl_setlk+0xad0/0xad0
>>>>> [  305.859743]  ? trace_event_raw_event_sched_switch+0x320/0x320
>>>>> [  305.860534]  ? fsnotify_first_mark+0x2c0/0x2c0
>>>>> [  305.861234]  sock_release+0x53/0xe0
>>>>> [  305.861711]  ? sock_alloc_file+0x2c0/0x2c0
>>>>> [  305.862334]  sock_close+0x16/0x20
>>>>> [  305.862801]  __fput+0x246/0x4e0
>>>>> [  305.863310]  ? fput+0x130/0x130
>>>>> [  305.863743]  ? trace_event_raw_event_sched_switch+0x320/0x320
>>>>> [  305.864604]  ____fput+0x15/0x20
>>>>> [  305.865046]  task_work_run+0x1a5/0x200
>>>>> [  305.865636]  ? kmem_cache_free+0x25c/0x2d0
>>>>> [  305.866194]  ? task_work_cancel+0x1a0/0x1a0
>>>>> [  305.866829]  ? switch_task_namespaces+0x9e/0xb0
>>>>> [  305.867458]  do_exit+0xacf/0x10d0
>>>>> [  305.868023]  ? mm_update_next_owner+0x650/0x650
>>>>> [  305.868642]  ? __pv_queued_spin_lock_slowpath+0xbf0/0xbf0
>>>>> [  305.869427]  ? trace_hardirqs_on_caller+0x136/0x2a0
>>>>> [  305.870102]  ? trace_hardirqs_on+0xd/0x10
>>>>> [  305.870701]  ? wake_up_new_task+0x41c/0x650
>>>>> [  305.871324]  ? to_ratio+0x20/0x20
>>>>> [  305.871816]  ? lock_release+0x530/0x530
>>>>> [  305.872341]  ? trace_event_raw_event_sched_switch+0x320/0x320
>>>>> [  305.873161]  ? match_held_lock+0x7e/0x360
>>>>> [  305.873777]  ? trace_hardirqs_off+0x10/0x10
>>>>> [  305.874359]  ? put_pid+0x111/0x140
>>>>> [  305.874897]  ? task_active_pid_ns+0x70/0x70
>>>>> [  305.875500]  ? find_held_lock+0xca/0xf0
>>>>> [  305.876118]  ? do_group_exit+0x1f9/0x260
>>>>> [  305.876650]  ? lock_downgrade+0x380/0x380
>>>>> [  305.877297]  ? task_clear_jobctl_pending+0xb5/0xd0
>>>>> [  305.877951]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
>>>>> [  305.878725]  ? do_raw_spin_unlock+0x112/0x1a0
>>>>> [  305.879309]  ? do_raw_spin_trylock+0x100/0x100
>>>>> [  305.879969]  ? mark_held_locks+0x25/0xb0
>>>>> [  305.880505]  ? force_sig+0x30/0x30
>>>>> [  305.881054]  ? _raw_spin_unlock_irq+0x27/0x50
>>>>> [  305.881671]  ? trace_hardirqs_on_caller+0x136/0x2a0
>>>>> [  305.882412]  do_group_exit+0xfb/0x260
>>>>> [  305.882945]  ? SyS_exit+0x30/0x30
>>>>> [  305.883442]  ? find_mergeable_anon_vma+0x90/0x90
>>>>> [  305.884103]  ? SyS_read+0x1c0/0x1c0
>>>>> [  305.884785]  ? mark_held_locks+0x25/0xb0
>>>>> [  305.885503]  ? do_syscall_64+0xb2/0x5d0
>>>>> [  305.886093]  ? do_group_exit+0x260/0x260
>>>>> [  305.886741]  SyS_exit_group+0x1d/0x20
>>>>> [  305.887255]  do_syscall_64+0x209/0x5d0
>>>>> [  305.887888]  ? syscall_return_slowpath+0x3e0/0x3e0
>>>>> [  305.888611]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
>>>>> [  305.889420]  ? syscall_return_slowpath+0x260/0x3e0
>>>>> [  305.890188]  ? mark_held_locks+0x25/0xb0
>>>>> [  305.890724]  ? entry_SYSCALL_64_after_hwframe+0x52/0xb7
>>>>> [  305.891556]  ? trace_hardirqs_off_caller+0xb5/0x120
>>>>> [  305.892265]  ? trace_hardirqs_off_thunk+0x1a/0x1c
>>>>> [  305.892939]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>>> [  305.893676] RIP: 0033:0x44d989
>>>>> [  305.894100] RSP: 002b:00000000007fff38 EFLAGS: 00000202 ORIG_RAX:
>>>>> 00000000000000e7
>>>>> [  305.895158] RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000044d989
>>>>> [  305.896174] RDX: 00007fb120d739c0 RSI: 00007fb120572700 RDI: 0000000000000001
>>>>> [  305.897161] RBP: 00000000007fff60 R08: 00007fb120572700 R09: 0000000000000000
>>>>> [  305.898128] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
>>>>> [  305.899464] R13: 000000000040d270 R14: 000000000040d300 R15: 0000000000000000
>>>>> [  305.900823] Code: b6 1d 81 9a ef 03 31 ff 89 de e8 ca a3 67 ff 84
>>>>> db 75 df e8 f1 a2 67 ff 48 c7 c7 60 8f 83 84 c6 05 61 9a ef 03 01 e8
>>>>> ee 5f 49 ff <0f> 0b eb c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41
>>>>> 57 49
>>>>> [  305.904324] ---[ end trace 360c084b02d93021 ]---
>>>>> [  305.919117] ------------[ cut here ]------------
>>>>> [  305.920120] refcount_t: underflow; use-after-free.
>>>>> [  305.921335] WARNING: CPU: 0 PID: 3867 at
>>>>> /home/blee/project/race-fuzzer/kernels/kernel_v4.16-rc3/lib/refcount.c:187
>>>>> refcount_sub_and_test+0x1ec/0x200
>>>>> [  305.923927] Modules linked in:
>>>>> [  305.924611] CPU: 0 PID: 3867 Comm: repro.exe Tainted: G        W
>>>>>     4.16.0-rc3 #1
>>>>> [  305.925987] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>>> BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
>>>>> [  305.928119] RIP: 0010:refcount_sub_and_test+0x1ec/0x200
>>>>> [  305.929124] RSP: 0018:ffff880224d374a0 EFLAGS: 00010282
>>>>> [  305.930161] RAX: 0000000000000026 RBX: 0000000000000000 RCX: ffffffff813c9644
>>>>> [  305.931504] RDX: 0000000000000000 RSI: ffffffff813cd761 RDI: ffff880224d37018
>>>>> [  305.932942] RBP: ffff880224d37538 R08: ffff88023169a440 R09: 0000000000000000
>>>>> [  305.934365] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffff
>>>>> [  305.935734] R13: ffff88023473adc0 R14: 0000000000000001 R15: 1ffff100449a6e96
>>>>> [  305.937114] FS:  0000000000c6e940(0000) GS:ffff8800b8a00000(0000)
>>>>> knlGS:0000000000000000
>>>>> [  305.938668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [  305.939768] CR2: 00007fb120571db8 CR3: 0000000005422000 CR4: 00000000000006f0
>>>>> [  305.941212] Call Trace:
>>>>> [  305.941689]  ? refcount_inc+0x70/0x70
>>>>> [  305.942216]  ? skb_dequeue+0xa5/0xc0
>>>>> [  305.942713]  refcount_dec_and_test+0x1a/0x20
>>>>> [  305.943295]  packet_release+0x702/0x7a0
>>>>> [  305.943816]  ? mark_held_locks+0x25/0xb0
>>>>> [  305.944378]  ? packet_lookup_frame+0x110/0x110
>>>>> [  305.945021]  ? __lock_is_held+0x39/0xc0
>>>>> [  305.945561]  ? note_gp_changes+0x300/0x300
>>>>> [  305.946132]  ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
>>>>> [  305.946866]  ? locks_remove_file+0x31b/0x420
>>>>> [  305.947464]  ? fcntl_setlk+0xad0/0xad0
>>>>> [  305.948000]  ? trace_event_raw_event_sched_switch+0x320/0x320
>>>>> [  305.948781]  ? fsnotify_first_mark+0x2c0/0x2c0
>>>>> [  305.949386]  sock_release+0x53/0xe0
>>>>> [  305.949866]  ? sock_alloc_file+0x2c0/0x2c0
>>>>> [  305.950437]  sock_close+0x16/0x20
>>>>> [  305.950906]  __fput+0x246/0x4e0
>>>>> [  305.951360]  ? fput+0x130/0x130
>>>>> [  305.951807]  ? trace_event_raw_event_sched_switch+0x320/0x320
>>>>> [  305.952620]  ____fput+0x15/0x20
>>>>> [  305.953071]  task_work_run+0x1a5/0x200
>>>>> [  305.953585]  ? kmem_cache_free+0x25c/0x2d0
>>>>> [  305.954143]  ? task_work_cancel+0x1a0/0x1a0
>>>>> [  305.954714]  ? switch_task_namespaces+0x9e/0xb0
>>>>> [  305.955334]  do_exit+0xacf/0x10d0
>>>>> [  305.955801]  ? mm_update_next_owner+0x650/0x650
>>>>> [  305.956431]  ? __pv_queued_spin_lock_slowpath+0xbf0/0xbf0
>>>>> [  305.957157]  ? trace_hardirqs_on_caller+0x136/0x2a0
>>>>> [  305.957811]  ? trace_hardirqs_on+0xd/0x10
>>>>> [  305.958360]  ? wake_up_new_task+0x41c/0x650
>>>>> [  305.958937]  ? to_ratio+0x20/0x20
>>>>> [  305.959391]  ? lock_release+0x530/0x530
>>>>> [  305.959924]  ? trace_event_raw_event_sched_switch+0x320/0x320
>>>>> [  305.960693]  ? match_held_lock+0x7e/0x360
>>>>> [  305.961244]  ? trace_hardirqs_off+0x10/0x10
>>>>> [  305.961810]  ? put_pid+0x111/0x140
>>>>> [  305.962277]  ? task_active_pid_ns+0x70/0x70
>>>>> [  305.962862]  ? find_held_lock+0xca/0xf0
>>>>> [  305.963396]  ? do_group_exit+0x1f9/0x260
>>>>> [  305.963933]  ? lock_downgrade+0x380/0x380
>>>>> [  305.964508]  ? task_clear_jobctl_pending+0xb5/0xd0
>>>>> [  305.965147]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
>>>>> [  305.965871]  ? do_raw_spin_unlock+0x112/0x1a0
>>>>> [  305.966459]  ? do_raw_spin_trylock+0x100/0x100
>>>>> [  305.967060]  ? mark_held_locks+0x25/0xb0
>>>>> [  305.967592]  ? force_sig+0x30/0x30
>>>>> [  305.968135]  ? _raw_spin_unlock_irq+0x27/0x50
>>>>> [  305.968741]  ? trace_hardirqs_on_caller+0x136/0x2a0
>>>>> [  305.969470]  do_group_exit+0xfb/0x260
>>>>> [  305.969987]  ? SyS_exit+0x30/0x30
>>>>> [  305.970505]  ? find_mergeable_anon_vma+0x90/0x90
>>>>> [  305.971126]  ? SyS_read+0x1c0/0x1c0
>>>>> [  305.971718]  ? mark_held_locks+0x25/0xb0
>>>>> [  305.972259]  ? do_syscall_64+0xb2/0x5d0
>>>>> [  305.972843]  ? do_group_exit+0x260/0x260
>>>>> [  305.973374]  SyS_exit_group+0x1d/0x20
>>>>> [  305.973932]  do_syscall_64+0x209/0x5d0
>>>>> [  305.974452]  ? syscall_return_slowpath+0x3e0/0x3e0
>>>>> [  305.975149]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
>>>>> [  305.975941]  ? syscall_return_slowpath+0x260/0x3e0
>>>>> [  305.976669]  ? mark_held_locks+0x25/0xb0
>>>>> [  305.977206]  ? entry_SYSCALL_64_after_hwframe+0x52/0xb7
>>>>> [  305.977978]  ? trace_hardirqs_off_caller+0xb5/0x120
>>>>> [  305.978690]  ? trace_hardirqs_off_thunk+0x1a/0x1c
>>>>> [  305.979381]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>>> [  305.980114] RIP: 0033:0x44d989
>>>>> [  305.980531] RSP: 002b:00000000007fff38 EFLAGS: 00000202 ORIG_RAX:
>>>>> 00000000000000e7
>>>>> [  305.981664] RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000044d989
>>>>> [  305.982655] RDX: 00007fb120d739c0 RSI: 00007fb120572700 RDI: 0000000000000001
>>>>> [  305.983654] RBP: 00000000007fff60 R08: 00007fb120572700 R09: 0000000000000000
>>>>> [  305.984656] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
>>>>> [  305.985707] R13: 000000000040d270 R14: 000000000040d300 R15: 0000000000000000
>>>>> [  305.986724] Code: b6 1d 18 9b ef 03 31 ff 89 de e8 60 a4 67 ff 84
>>>>> db 75 1a e8 87 a3 67 ff 48 c7 c7 00 8f 83 84 c6 05 f8 9a ef 03 01 e8
>>>>> 84 60 49 ff <0f> 0b 31 db e9 2b ff ff ff 66 66 2e 0f 1f 84 00 00 00 00
>>>>> 00 55
>>>>> [  305.990106] ---[ end trace 360c084b02d93022 ]---
>>>>> [  305.998636] IPVS: ftp: loaded support on port[0] = 21
>>>>> ---------------------------------------
>>>>>
>>>>> = About RaceFuzzer
>>>>>
>>>>> RaceFuzzer is a customized version of Syzkaller, specifically tailored
>>>>> to find race condition bugs in the Linux kernel. While we leverage
>>>>> many different technique, the notable feature of RaceFuzzer is in
>>>>> leveraging a custom hypervisor (QEMU/KVM) to interleave the
>>>>> scheduling. In particular, we modified the hypervisor to intentionally
>>>>> stall a per-core execution, which is similar to supporting per-core
>>>>> breakpoint functionality. This allows RaceFuzzer to force the kernel
>>>>> to deterministically trigger racy condition (which may rarely happen
>>>>> in practice due to randomness in scheduling).
>>>>>
>>>>> RaceFuzzer's C repro always pinpoints two racy syscalls. Since C
>>>>> repro's scheduling synchronization should be performed at the user
>>>>> space, its reproducibility is limited (reproduction may take from 1
>>>>> second to 10 minutes (or even more), depending on a bug). This is
>>>>> because, while RaceFuzzer precisely interleaves the scheduling at the
>>>>> kernel's instruction level when finding this bug, C repro cannot fully
>>>>> utilize such a feature. Please disregard all code related to
>>>>> "should_hypercall" in the C repro, as this is only for our debugging
>>>>> purposes using our own hypervisor.

^ permalink raw reply

* Re: Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Siwei Liu @ 2018-04-19  6:31 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, David Miller, David Ahern, Jiri Pirko,
	si-wei liu, Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Netdev, virtualization, virtio-dev
In-Reply-To: <4e029524-d542-0f12-bfdb-7c8a2f198e28@intel.com>

On Wed, Apr 18, 2018 at 10:00 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>
>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>
>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>> that I'd like to share thus revive the discussion.
>>>>
>>>> First of all, as illustrated in the reply below, cloud service
>>>> providers require transparent live migration. Specifically, the main
>>>> target of our case is to support SR-IOV live migration via kernel
>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>> because this use case is not appealing enough for the mainline to
>>>> adopt, I will shut up and not continue discussing, although
>>>> technically it's entirely possible (and there's precedent in other
>>>> implementation) to do so to benefit any cloud service providers.
>>>>
>>>> If it's just the implementation of hiding netdev itself needs to be
>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>> API, that's completely fine and we can look into that. However, the
>>>> specific issue needs to be undestood beforehand is to make transparent
>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>> from the lower netdev, which needs some games with uevents and name
>>>> space reservation. So far I don't think it's been well discussed.
>>>>
>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>> model currently missed to address the core problem of live migration:
>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>> and hardware offloading states. Only general network state (IP
>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>> save and apply those hardware specific configs before or after
>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>> part of this patch series will be able to solve that problem naturally
>>>> by making all hardware specific configurations go through the central
>>>> bypass driver, such that hardware configurations can be replayed when
>>>> new VF or passthrough gets plugged back in. Although that
>>>> corresponding function hasn't been implemented today, I'd like to
>>>> refresh everyone's mind that is the core problem any live migration
>>>> proposal should have addressed.
>>>>
>>>> If it would make things more clear to defer netdev hiding until all
>>>> functionalities regarding centralizing and replay are implemented,
>>>> we'd take advices like that and move on to implementing those features
>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>> the work for hiding lower netdev at that point. Think it would be the
>>>> best to make everyone understand the big picture in advance before
>>>> going too far.
>>>
>>> I think we should get the 3-netdev model integrated and add any
>>> additional
>>> ndo_ops/ethool ops that we would like to support/migrate before looking
>>> into
>>> hiding the lower netdevs.
>>
>> Once they are exposed, I don't think we'll be able to hide them -
>> they will be a kernel ABI.
>>
>> Do you think everyone needs to hide the SRIOV device?
>> Or that only some users need this?
>
>
> Hyper-V is currently supporting live migration without hiding the SR-IOV
> device. So i don't
> think it is a hard requirement. And also,  as we don't yet have a consensus
This is a vague point as Hyper-V is mostly Windows oriented: the
target users don't change adapter settings in device manager much as
it's hidden too deep already. Actually it does not address the general
case for SR-IOV live migration but just a subset, why are we making
such comparison?

Note it's always the hard requirement for live migration that *all
states* should be migrated no matter what the implementation it is
going to be. The current 3-netdev model is remote to be useful for
real world scenario and it has no advantage compared to using
userspace generic bonding.

-Siwei

> on how to hide
> the lower netdevs, we could make it as another feature bit to hide lower
> netdevs once
> we have an acceptable solution.
>

^ permalink raw reply

* [PATCH net-next] liquidio: Added ndo_get_vf_stats support
From: Felix Manlunas @ 2018-04-19  6:18 UTC (permalink / raw)
  To: davem
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	felix.manlunas, intiyaz.basha

From: Intiyaz Basha <intiyaz.basha@cavium.com>

Added the ndo to gather VF statistics through the PF.

Collect VF statistics via mailbox from VF.

Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
---
 .../ethernet/cavium/liquidio/cn23xx_pf_device.c    | 54 ++++++++++++++++++++++
 .../ethernet/cavium/liquidio/cn23xx_pf_device.h    | 12 +++++
 drivers/net/ethernet/cavium/liquidio/lio_main.c    | 26 +++++++++++
 .../net/ethernet/cavium/liquidio/octeon_mailbox.c  | 52 +++++++++++++++++++++
 .../net/ethernet/cavium/liquidio/octeon_mailbox.h  |  7 +++
 5 files changed, 151 insertions(+)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index e8b2904..bc9861c 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -1497,3 +1497,57 @@ void cn23xx_tell_vf_its_macaddr_changed(struct octeon_device *oct, int vfidx,
 		octeon_mbox_write(oct, &mbox_cmd);
 	}
 }
+
+static void
+cn23xx_get_vf_stats_callback(struct octeon_device *oct,
+			     struct octeon_mbox_cmd *cmd, void *arg)
+{
+	struct oct_vf_stats_ctx *ctx = arg;
+
+	memcpy(ctx->stats, cmd->data, sizeof(struct oct_vf_stats));
+	atomic_set(&ctx->status, 1);
+}
+
+int cn23xx_get_vf_stats(struct octeon_device *oct, int vfidx,
+			struct oct_vf_stats *stats)
+{
+	u32 timeout = HZ; // 1sec
+	struct octeon_mbox_cmd mbox_cmd;
+	struct oct_vf_stats_ctx ctx;
+	u32 count = 0, ret;
+
+	if (!(oct->sriov_info.vf_drv_loaded_mask & (1ULL << vfidx)))
+		return -1;
+
+	if (sizeof(struct oct_vf_stats) > sizeof(mbox_cmd.data))
+		return -1;
+
+	mbox_cmd.msg.u64 = 0;
+	mbox_cmd.msg.s.type = OCTEON_MBOX_REQUEST;
+	mbox_cmd.msg.s.resp_needed = 1;
+	mbox_cmd.msg.s.cmd = OCTEON_GET_VF_STATS;
+	mbox_cmd.msg.s.len = 1;
+	mbox_cmd.q_no = vfidx * oct->sriov_info.rings_per_vf;
+	mbox_cmd.recv_len = 0;
+	mbox_cmd.recv_status = 0;
+	mbox_cmd.fn = (octeon_mbox_callback_t)cn23xx_get_vf_stats_callback;
+	ctx.stats = stats;
+	atomic_set(&ctx.status, 0);
+	mbox_cmd.fn_arg = (void *)&ctx;
+	memset(mbox_cmd.data, 0, sizeof(mbox_cmd.data));
+	octeon_mbox_write(oct, &mbox_cmd);
+
+	do {
+		schedule_timeout_uninterruptible(1);
+	} while ((atomic_read(&ctx.status) == 0) && (count++ < timeout));
+
+	ret = atomic_read(&ctx.status);
+	if (ret == 0) {
+		octeon_mbox_cancel(oct, 0);
+		dev_err(&oct->pci_dev->dev, "Unable to get stats from VF-%d, timedout\n",
+			vfidx);
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h
index 2aba524..63b3de4 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h
@@ -43,6 +43,15 @@ struct octeon_cn23xx_pf {
 
 #define CN23XX_SLI_DEF_BP			0x40
 
+struct oct_vf_stats {
+	u64 rx_packets;
+	u64 tx_packets;
+	u64 rx_bytes;
+	u64 tx_bytes;
+	u64 broadcast;
+	u64 multicast;
+};
+
 int setup_cn23xx_octeon_pf_device(struct octeon_device *oct);
 
 int validate_cn23xx_pf_config_info(struct octeon_device *oct,
@@ -56,4 +65,7 @@ int validate_cn23xx_pf_config_info(struct octeon_device *oct,
 
 void cn23xx_tell_vf_its_macaddr_changed(struct octeon_device *oct, int vfidx,
 					u8 *mac);
+
+int cn23xx_get_vf_stats(struct octeon_device *oct, int ifidx,
+			struct oct_vf_stats *stats);
 #endif
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 603a144..6f2a5dd 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -3324,6 +3324,31 @@ static int liquidio_set_vf_link_state(struct net_device *netdev, int vfidx,
 	.switchdev_port_attr_get = lio_pf_switchdev_attr_get,
 };
 
+static int liquidio_get_vf_stats(struct net_device *netdev, int vfidx,
+				 struct ifla_vf_stats *vf_stats)
+{
+	struct lio *lio = GET_LIO(netdev);
+	struct octeon_device *oct = lio->oct_dev;
+	struct oct_vf_stats stats;
+	int ret;
+
+	if (vfidx < 0 || vfidx >= oct->sriov_info.num_vfs_alloced)
+		return -EINVAL;
+
+	memset(&stats, 0, sizeof(struct oct_vf_stats));
+	ret = cn23xx_get_vf_stats(oct, vfidx, &stats);
+	if (!ret) {
+		vf_stats->rx_packets = stats.rx_packets;
+		vf_stats->tx_packets = stats.tx_packets;
+		vf_stats->rx_bytes = stats.rx_bytes;
+		vf_stats->tx_bytes = stats.tx_bytes;
+		vf_stats->broadcast = stats.broadcast;
+		vf_stats->multicast = stats.multicast;
+	}
+
+	return ret;
+}
+
 static const struct net_device_ops lionetdevops = {
 	.ndo_open		= liquidio_open,
 	.ndo_stop		= liquidio_stop,
@@ -3346,6 +3371,7 @@ static int liquidio_set_vf_link_state(struct net_device *netdev, int vfidx,
 	.ndo_get_vf_config	= liquidio_get_vf_config,
 	.ndo_set_vf_trust	= liquidio_set_vf_trust,
 	.ndo_set_vf_link_state  = liquidio_set_vf_link_state,
+	.ndo_get_vf_stats	= liquidio_get_vf_stats,
 };
 
 /** \brief Entry point for the liquidio module
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
index 28e74ee..021d99c 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
@@ -24,6 +24,7 @@
 #include "octeon_device.h"
 #include "octeon_main.h"
 #include "octeon_mailbox.h"
+#include "cn23xx_pf_device.h"
 
 /**
  * octeon_mbox_read:
@@ -205,6 +206,26 @@ int octeon_mbox_write(struct octeon_device *oct,
 	return ret;
 }
 
+static void get_vf_stats(struct octeon_device *oct,
+			 struct oct_vf_stats *stats)
+{
+	int i;
+
+	for (i = 0; i < oct->num_iqs; i++) {
+		if (!oct->instr_queue[i])
+			continue;
+		stats->tx_packets += oct->instr_queue[i]->stats.tx_done;
+		stats->tx_bytes += oct->instr_queue[i]->stats.tx_tot_bytes;
+	}
+
+	for (i = 0; i < oct->num_oqs; i++) {
+		if (!oct->droq[i])
+			continue;
+		stats->rx_packets += oct->droq[i]->stats.rx_pkts_received;
+		stats->rx_bytes += oct->droq[i]->stats.rx_bytes_received;
+	}
+}
+
 /**
  * octeon_mbox_process_cmd:
  * @mbox: Pointer mailbox
@@ -250,6 +271,15 @@ static int octeon_mbox_process_cmd(struct octeon_mbox *mbox,
 						     mbox_cmd->msg.s.params);
 		break;
 
+	case OCTEON_GET_VF_STATS:
+		dev_dbg(&oct->pci_dev->dev, "Got VF stats request. Sending data back\n");
+		mbox_cmd->msg.s.type = OCTEON_MBOX_RESPONSE;
+		mbox_cmd->msg.s.resp_needed = 1;
+		mbox_cmd->msg.s.len = 1 +
+			sizeof(struct oct_vf_stats) / sizeof(u64);
+		get_vf_stats(oct, (struct oct_vf_stats *)mbox_cmd->data);
+		octeon_mbox_write(oct, mbox_cmd);
+		break;
 	default:
 		break;
 	}
@@ -322,3 +352,25 @@ int octeon_mbox_process_message(struct octeon_mbox *mbox)
 
 	return 0;
 }
+
+int octeon_mbox_cancel(struct octeon_device *oct, int q_no)
+{
+	struct octeon_mbox *mbox = oct->mbox[q_no];
+	struct octeon_mbox_cmd *mbox_cmd;
+	unsigned long flags = 0;
+
+	spin_lock_irqsave(&mbox->lock, flags);
+	mbox_cmd = &mbox->mbox_resp;
+
+	if (!(mbox->state & OCTEON_MBOX_STATE_RESPONSE_PENDING)) {
+		spin_unlock_irqrestore(&mbox->lock, flags);
+		return 1;
+	}
+
+	mbox->state = OCTEON_MBOX_STATE_IDLE;
+	memset(mbox_cmd, 0, sizeof(*mbox_cmd));
+	writeq(OCTEON_PFVFSIG, mbox->mbox_read_reg);
+	spin_unlock_irqrestore(&mbox->lock, flags);
+
+	return 0;
+}
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.h b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.h
index 1def22a..d92bd7e 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.h
@@ -25,6 +25,7 @@
 #define OCTEON_VF_ACTIVE		0x1
 #define OCTEON_VF_FLR_REQUEST		0x2
 #define OCTEON_PF_CHANGED_VF_MACADDR	0x4
+#define OCTEON_GET_VF_STATS		0x8
 
 /*Macro for Read acknowldgement*/
 #define OCTEON_PFVFACK			0xffffffffffffffffULL
@@ -107,9 +108,15 @@ struct octeon_mbox {
 
 };
 
+struct oct_vf_stats_ctx {
+	atomic_t status;
+	struct oct_vf_stats *stats;
+};
+
 int octeon_mbox_read(struct octeon_mbox *mbox);
 int octeon_mbox_write(struct octeon_device *oct,
 		      struct octeon_mbox_cmd *mbox_cmd);
 int octeon_mbox_process_message(struct octeon_mbox *mbox);
+int octeon_mbox_cancel(struct octeon_device *oct, int q_no);
 
 #endif
-- 
1.8.3.1

^ permalink raw reply related

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Samudrala, Sridhar @ 2018-04-19  6:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Siwei Liu, David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Netdev, virtualization, virtio-dev
In-Reply-To: <20180419080215-mutt-send-email-mst@kernel.org>


On 4/18/2018 10:07 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
>> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>>> that I'd like to share thus revive the discussion.
>>>>>
>>>>> First of all, as illustrated in the reply below, cloud service
>>>>> providers require transparent live migration. Specifically, the main
>>>>> target of our case is to support SR-IOV live migration via kernel
>>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>>> because this use case is not appealing enough for the mainline to
>>>>> adopt, I will shut up and not continue discussing, although
>>>>> technically it's entirely possible (and there's precedent in other
>>>>> implementation) to do so to benefit any cloud service providers.
>>>>>
>>>>> If it's just the implementation of hiding netdev itself needs to be
>>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>>> API, that's completely fine and we can look into that. However, the
>>>>> specific issue needs to be undestood beforehand is to make transparent
>>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>>> from the lower netdev, which needs some games with uevents and name
>>>>> space reservation. So far I don't think it's been well discussed.
>>>>>
>>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>>> model currently missed to address the core problem of live migration:
>>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>>> and hardware offloading states. Only general network state (IP
>>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>>> save and apply those hardware specific configs before or after
>>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>>> part of this patch series will be able to solve that problem naturally
>>>>> by making all hardware specific configurations go through the central
>>>>> bypass driver, such that hardware configurations can be replayed when
>>>>> new VF or passthrough gets plugged back in. Although that
>>>>> corresponding function hasn't been implemented today, I'd like to
>>>>> refresh everyone's mind that is the core problem any live migration
>>>>> proposal should have addressed.
>>>>>
>>>>> If it would make things more clear to defer netdev hiding until all
>>>>> functionalities regarding centralizing and replay are implemented,
>>>>> we'd take advices like that and move on to implementing those features
>>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>>> the work for hiding lower netdev at that point. Think it would be the
>>>>> best to make everyone understand the big picture in advance before
>>>>> going too far.
>>>> I think we should get the 3-netdev model integrated and add any additional
>>>> ndo_ops/ethool ops that we would like to support/migrate before looking into
>>>> hiding the lower netdevs.
>>> Once they are exposed, I don't think we'll be able to hide them -
>>> they will be a kernel ABI.
>>>
>>> Do you think everyone needs to hide the SRIOV device?
>>> Or that only some users need this?
>> Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
>> think it is a hard requirement.
> OK, fine.
>
>> And also,  as we don't yet have a consensus on how to hide
>> the lower netdevs, we could make it as another feature bit to hide lower netdevs once
>> we have an acceptable solution.
> Guest/host interface isn't more flexible than the userspace/kernel
> interface.  The feature bit you propose would say what exactly?
> Hypervisor has no idea what guest kernel shows guest userspace.
> Note that the backup flag doesn't tell guest kernel what to do,
> it just tells guest that there is or will be a faster main device
> connected to the same backend, so the backup should only be used
> when main device is not present.

The current bypass module supports 3-netdev and 2-netdev models via 2 sets of interfaces
bypass_master_create/destroy and bypass_master_register/unregister.  So theoretically
we can support the 2 models via 2 different feature bits. BACKUP and BACKUP_2_NETDEV.

Similarly if we can figure out a way to hide both the netdevs, we can add BACKUP_1_NETDEV
feature bit and update the bypass module to provide another set of interfaces that can
be used by virtio_net to support this model.

Now that we are leaning towards 'standby' as the name for the lower virtio-net, should we
change the feature bit name also to VIRTIO_NET_F_STANDBY?

>

^ permalink raw reply

* 答复: [PATCH][net-next] net: ip tos cgroup
From: Li,Rongqing @ 2018-04-19  4:56 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev@vger.kernel.org, tj@kernel.org, ast@fb.com, brakmo@fb.com
In-Reply-To: <e5915527-5846-dce5-18de-5aee977c2e4b@iogearbox.net>



> -----邮件原件-----
> 发件人: Daniel Borkmann [mailto:daniel@iogearbox.net]
> 发送时间: 2018年4月17日 22:11
> 收件人: Li,Rongqing <lirongqing@baidu.com>
> 抄送: netdev@vger.kernel.org; tj@kernel.org; ast@fb.com;
> brakmo@fb.com
> 主题: Re: [PATCH][net-next] net: ip tos cgroup
> 
> On 04/17/2018 05:36 AM, Li RongQing wrote:
> > ip tos segment can be changed by setsockopt(IP_TOS), or by iptables;
> > this patch creates a new method to change socket tos segment of
> > processes based on cgroup
> >
> > The usage:
> >
> >     1. mount ip_tos cgroup, and setting tos value
> >     mount -t cgroup -o ip_tos ip_tos /cgroups/tos
> >     echo tos_value >/cgroups/tos/ip_tos.tos
> >     2. then move processes to cgroup, or create processes in cgroup
> >
> > Signed-off-by: jimyan <jimyan@baidu.com>
> > Signed-off-by: Li RongQing <lirongqing@baidu.com>
> 
> This functionality is already possible through the help of BPF programs
> attached to cgroups, have you had a chance to look into that?
> 

I think this method is easier to use than BPF, and more efficient


-RongQing 



^ permalink raw reply

* Re: [PATCH net 3/3] net: sched: ife: check on metadata length
From: yotam gigi @ 2018-04-19  5:37 UTC (permalink / raw)
  To: Alexander Aring
  Cc: Jamal Hadi Salim, davem, Cong Wang, Jiří Pírko,
	Yuval Mintz, netdev, kernel
In-Reply-To: <20180418213534.6215-4-aring@mojatatu.com>

On Thu, Apr 19, 2018 at 12:35 AM, Alexander Aring <aring@mojatatu.com> wrote:
> This patch checks if sk buffer is available to dererence ife header. If
> not then NULL will returned to signal an malformed ife packet. This
> avoids to crashing the kernel from outside.
>
> Signed-off-by: Alexander Aring <aring@mojatatu.com>

Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>

> ---
>  net/ife/ife.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/net/ife/ife.c b/net/ife/ife.c
> index 8632d2685efb..7c100034fbee 100644
> --- a/net/ife/ife.c
> +++ b/net/ife/ife.c
> @@ -70,6 +70,9 @@ void *ife_decode(struct sk_buff *skb, u16 *metalen)
>         u16 ifehdrln;
>
>         ifehdr = (struct ifeheadr *) (skb->data + skb->dev->hard_header_len);
> +       if (skb->len < skb->dev->hard_header_len + IFE_METAHDRLEN)
> +               return NULL;
> +
>         ifehdrln = ntohs(ifehdr->metalen);
>         total_pull = skb->dev->hard_header_len + ifehdrln;
>
> --
> 2.11.0
>

^ permalink raw reply

* Re: [PATCH net 2/3] net: sched: ife: handle malformed tlv length
From: yotam gigi @ 2018-04-19  5:37 UTC (permalink / raw)
  To: Alexander Aring
  Cc: Jamal Hadi Salim, davem, Cong Wang, Jiří Pírko,
	Yuval Mintz, netdev, kernel
In-Reply-To: <20180418213534.6215-3-aring@mojatatu.com>

On Thu, Apr 19, 2018 at 12:35 AM, Alexander Aring <aring@mojatatu.com> wrote:
> There is currently no handling to check on a invalid tlv length. This
> patch adds such handling to avoid killing the kernel with a malformed
> ife packet.

That's very important. Thanks for that!

>
> Signed-off-by: Alexander Aring <aring@mojatatu.com>
> ---
>  include/net/ife.h   |  3 ++-
>  net/ife/ife.c       | 35 +++++++++++++++++++++++++++++++++--
>  net/sched/act_ife.c |  7 ++++++-
>  3 files changed, 41 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/ife.h b/include/net/ife.h
> index 44b9c00f7223..e117617e3c34 100644
> --- a/include/net/ife.h
> +++ b/include/net/ife.h
> @@ -12,7 +12,8 @@
>  void *ife_encode(struct sk_buff *skb, u16 metalen);
>  void *ife_decode(struct sk_buff *skb, u16 *metalen);
>
> -void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen);
> +void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 *attrtype,
> +                         u16 *dlen, u16 *totlen);
>  int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
>                         const void *dval);
>
> diff --git a/net/ife/ife.c b/net/ife/ife.c
> index 7d1ec76e7f43..8632d2685efb 100644
> --- a/net/ife/ife.c
> +++ b/net/ife/ife.c
> @@ -92,12 +92,43 @@ struct meta_tlvhdr {
>         __be16 len;
>  };
>
> +static inline bool __ife_tlv_meta_valid(const unsigned char *skbdata,
> +                                       const unsigned char *ifehdr_end)
> +{
> +       const struct meta_tlvhdr *tlv;
> +       u16 tlvlen;
> +
> +       if (unlikely(skbdata + sizeof(*tlv) > ifehdr_end))
> +               return false;
> +
> +       tlv = (struct meta_tlvhdr *)skbdata;
> +       tlvlen = ntohs(tlv->len);
> +
> +       /* tlv length field is inc header, check on minimum */
> +       if (tlvlen < NLA_HDRLEN)
> +               return false;
> +
> +       /* overflow by NLA_ALIGN check */
> +       if (NLA_ALIGN(tlvlen) < tlvlen)
> +               return false;
> +
> +       if (unlikely(skbdata + NLA_ALIGN(tlvlen) > ifehdr_end))
> +               return false;
> +
> +       return true;
> +}
> +
>  /* Caller takes care of presenting data in network order
>   */
> -void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen)
> +void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 *attrtype,
> +                         u16 *dlen, u16 *totlen)

Now it is not critical, but it looks a bit odd to me: the function ife_decode
returns "metalen" - maybe it is better to change it too to return ifehdr_end?
If not, the caller has to calculate it himself for no particular reason.
Having both metalen and ifehdr_end is redundant, so we should stick to only
one of these.

Other than that,
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>

>  {
> -       struct meta_tlvhdr *tlv = (struct meta_tlvhdr *) skbdata;
> +       struct meta_tlvhdr *tlv;
> +
> +       if (!__ife_tlv_meta_valid(skbdata, ifehdr_end))
> +               return NULL;
>
> +       tlv = (struct meta_tlvhdr *)skbdata;
>         *dlen = ntohs(tlv->len) - NLA_HDRLEN;
>         *attrtype = ntohs(tlv->type);
>
> diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
> index 49b8ab551fbe..8527cfdc446d 100644
> --- a/net/sched/act_ife.c
> +++ b/net/sched/act_ife.c
> @@ -682,7 +682,12 @@ static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
>                 u16 mtype;
>                 u16 dlen;
>
> -               curr_data = ife_tlv_meta_decode(tlv_data, &mtype, &dlen, NULL);
> +               curr_data = ife_tlv_meta_decode(tlv_data, ifehdr_end, &mtype,
> +                                               &dlen, NULL);
> +               if (!curr_data) {
> +                       qstats_drop_inc(this_cpu_ptr(ife->common.cpu_qstats));
> +                       return TC_ACT_SHOT;
> +               }
>
>                 if (find_decode_metaid(skb, ife, mtype, dlen, curr_data)) {
>                         /* abuse overlimits to count when we receive metadata
> --
> 2.11.0
>

^ permalink raw reply

* Re: [PATCH net 1/3] net: sched: ife: signal not finding metaid
From: yotam gigi @ 2018-04-19  5:37 UTC (permalink / raw)
  To: Alexander Aring
  Cc: Jamal Hadi Salim, davem, Cong Wang, Jiří Pírko,
	Yuval Mintz, netdev, kernel
In-Reply-To: <20180418213534.6215-2-aring@mojatatu.com>

On Thu, Apr 19, 2018 at 12:35 AM, Alexander Aring <aring@mojatatu.com> wrote:
> We need to record stats for received metadata that we dont know how
> to process. Have find_decode_metaid() return -ENOENT to capture this.

Agree.

>
> Signed-off-by: Alexander Aring <aring@mojatatu.com>

Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>

> ---
>  net/sched/act_ife.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
> index a5994cf0512b..49b8ab551fbe 100644
> --- a/net/sched/act_ife.c
> +++ b/net/sched/act_ife.c
> @@ -652,7 +652,7 @@ static int find_decode_metaid(struct sk_buff *skb, struct tcf_ife_info *ife,
>                 }
>         }
>
> -       return 0;
> +       return -ENOENT;
>  }
>
>  static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
> --
> 2.11.0
>

^ permalink raw reply

* Re: [PATCH bpf-next 4/5] samples/bpf: Refine printing symbol for sampleip
From: Leo Yan @ 2018-04-19  5:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, linux-kernel
In-Reply-To: <20180419052124.xzaz66vehd6pofnh@ast-mbp>

On Wed, Apr 18, 2018 at 10:21:25PM -0700, Alexei Starovoitov wrote:
> On Thu, Apr 19, 2018 at 01:12:49PM +0800, Leo Yan wrote:
> > On Wed, Apr 18, 2018 at 09:47:45PM -0700, Alexei Starovoitov wrote:
> > > On Thu, Apr 19, 2018 at 09:34:05AM +0800, Leo Yan wrote:
> > > > The code defines macro 'PAGE_OFFSET' and uses it to decide if the
> > > > address is in kernel space or not.  But different architecture has
> > > > different 'PAGE_OFFSET' so this program cannot be used for all
> > > > platforms.
> > > > 
> > > > This commit changes to check returned pointer from ksym_search() to
> > > > judge if the address falls into kernel space or not, and removes
> > > > macro 'PAGE_OFFSET' as it isn't used anymore.  As result, this program
> > > > has no architecture dependency.
> > > > 
> > > > Signed-off-by: Leo Yan <leo.yan@linaro.org>
> > > > ---
> > > >  samples/bpf/sampleip_user.c | 8 +++-----
> > > >  1 file changed, 3 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/samples/bpf/sampleip_user.c b/samples/bpf/sampleip_user.c
> > > > index 4ed690b..0eea1b3 100644
> > > > --- a/samples/bpf/sampleip_user.c
> > > > +++ b/samples/bpf/sampleip_user.c
> > > > @@ -26,7 +26,6 @@
> > > >  #define DEFAULT_FREQ	99
> > > >  #define DEFAULT_SECS	5
> > > >  #define MAX_IPS		8192
> > > > -#define PAGE_OFFSET	0xffff880000000000
> > > >  
> > > >  static int nr_cpus;
> > > >  
> > > > @@ -107,14 +106,13 @@ static void print_ip_map(int fd)
> > > >  	/* sort and print */
> > > >  	qsort(counts, max, sizeof(struct ipcount), count_cmp);
> > > >  	for (i = 0; i < max; i++) {
> > > > -		if (counts[i].ip > PAGE_OFFSET) {
> > > > -			sym = ksym_search(counts[i].ip);
> > > 
> > > yes. it is x64 specific, since it's a sample code,
> > > but simply removing it is not a fix.
> > > It makes this sampleip code behaving incorrectly.
> > > To do such 'cleanup of ksym' please refactor it in the true generic way,
> > > so these ksym* helpers can work on all archs and put this new
> > > functionality into selftests.
> > 
> > Just want to check, are you suggesting to create a standalone
> > testing for kallsyms (like a folder tools/testing/selftests/ksym) or
> > do you mean to place the generic code into folder
> > tools/testing/selftests/bpf/?
> 
> I think the minimal first step is to cleanup ksym bits into something
> that bpf selftests can use and keep it as new .c in
> tools/testing/selftests/bpf/
> Later if it becomes useful for other tests in selftests it can be moved.

Thanks for explanation, now it's clear for me :)

> Thanks!
> 

^ permalink raw reply

* [PATCH v2 net 3/3] virtio_net: sparse annotation fix
From: Michael S. Tsirkin @ 2018-04-19  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mikulas Patocka, Eric Dumazet, David Miller, Thomas Huth,
	Cornelia Huck, Jason Wang, virtualization, netdev
In-Reply-To: <1524115776-334953-1-git-send-email-mst@redhat.com>

offloads is a buffer in virtio format, should use
the __virtio64 tag.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f84fe04..c5b11f2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -155,7 +155,7 @@ struct control_buf {
 	u8 promisc;
 	u8 allmulti;
 	__virtio16 vid;
-	u64 offloads;
+	__virtio64 offloads;
 };
 
 struct virtnet_info {
-- 
MST

^ permalink raw reply related

* [PATCH v2 net 2/3] virtio_net: fix adding vids on big-endian
From: Michael S. Tsirkin @ 2018-04-19  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mikulas Patocka, Eric Dumazet, David Miller, Thomas Huth,
	Cornelia Huck, Jason Wang, virtualization, netdev
In-Reply-To: <1524115776-334953-1-git-send-email-mst@redhat.com>

Programming vids (adding or removing them) still passes
guest-endian values in the DMA buffer. That's wrong
if guest is big-endian and when virtio 1 is enabled.

Note: this is on top of a previous patch:
	virtio_net: split out ctrl buffer

Fixes: 9465a7a6f ("virtio_net: enable v1.0 support")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3d0eff53..f84fe04 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -154,7 +154,7 @@ struct control_buf {
 	struct virtio_net_ctrl_mq mq;
 	u8 promisc;
 	u8 allmulti;
-	u16 vid;
+	__virtio16 vid;
 	u64 offloads;
 };
 
@@ -1718,7 +1718,7 @@ static int virtnet_vlan_rx_add_vid(struct net_device *dev,
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct scatterlist sg;
 
-	vi->ctrl->vid = vid;
+	vi->ctrl->vid = cpu_to_virtio16(vi->vdev, vid);
 	sg_init_one(&sg, &vi->ctrl->vid, sizeof(vi->ctrl->vid));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
@@ -1733,7 +1733,7 @@ static int virtnet_vlan_rx_kill_vid(struct net_device *dev,
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct scatterlist sg;
 
-	vi->ctrl->vid = vid;
+	vi->ctrl->vid = cpu_to_virtio16(vi->vdev, vid);
 	sg_init_one(&sg, &vi->ctrl->vid, sizeof(vi->ctrl->vid));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
-- 
MST

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox