Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v6 3/3] netdev/of/phy: Add MDIO bus multiplexer driven by GPIO lines.
From: David Daney @ 2012-05-03  1:16 UTC (permalink / raw)
  To: Ralf Baechle, Grant Likely, Rob Herring,
	devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ, David S. Miller,
	netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	afleming-Re5JQEeQqe8AvxtiuMwx3w, David Daney,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1336007799-31016-1-git-send-email-ddaney.cavm-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

From: David Daney <david.daney-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org>

The GPIO pins select which sub bus is connected to the master.

Initially tested with an sn74cbtlv3253 switch device wired into the
MDIO bus.

Signed-off-by: David Daney <david.daney-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org>
---
 .../devicetree/bindings/net/mdio-mux-gpio.txt      |  127 +++++++++++++++++
 drivers/net/phy/Kconfig                            |   10 ++
 drivers/net/phy/Makefile                           |    1 +
 drivers/net/phy/mdio-mux-gpio.c                    |  142 ++++++++++++++++++++
 4 files changed, 280 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/mdio-mux-gpio.txt
 create mode 100644 drivers/net/phy/mdio-mux-gpio.c

diff --git a/Documentation/devicetree/bindings/net/mdio-mux-gpio.txt b/Documentation/devicetree/bindings/net/mdio-mux-gpio.txt
new file mode 100644
index 0000000..7938411
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/mdio-mux-gpio.txt
@@ -0,0 +1,127 @@
+Properties for an MDIO bus multiplexer/switch controlled by GPIO pins.
+
+This is a special case of a MDIO bus multiplexer.  One or more GPIO
+lines are used to control which child bus is connected.
+
+Required properties in addition to the generic multiplexer properties:
+
+- compatible : mdio-mux-gpio.
+- gpios : GPIO specifiers for each GPIO line.  One or more must be specified.
+
+
+Example :
+
+	/* The parent MDIO bus. */
+	smi1: mdio@1180000001900 {
+		compatible = "cavium,octeon-3860-mdio";
+		#address-cells = <1>;
+		#size-cells = <0>;
+		reg = <0x11800 0x00001900 0x0 0x40>;
+	};
+
+	/*
+	   An NXP sn74cbtlv3253 dual 1-of-4 switch controlled by a
+	   pair of GPIO lines.  Child busses 2 and 3 populated with 4
+	   PHYs each.
+	 */
+	mdio-mux {
+		compatible = "mdio-mux-gpio";
+		gpios = <&gpio1 3 0>, <&gpio1 4 0>;
+		mdio-parent-bus = <&smi1>;
+		#address-cells = <1>;
+		#size-cells = <0>;
+
+		mdio@2 {
+			reg = <2>;
+			#address-cells = <1>;
+			#size-cells = <0>;
+
+			phy11: ethernet-phy@1 {
+				reg = <1>;
+				compatible = "marvell,88e1149r";
+				marvell,reg-init = <3 0x10 0 0x5777>,
+					<3 0x11 0 0x00aa>,
+					<3 0x12 0 0x4105>,
+					<3 0x13 0 0x0a60>;
+				interrupt-parent = <&gpio>;
+				interrupts = <10 8>; /* Pin 10, active low */
+			};
+			phy12: ethernet-phy@2 {
+				reg = <2>;
+				compatible = "marvell,88e1149r";
+				marvell,reg-init = <3 0x10 0 0x5777>,
+					<3 0x11 0 0x00aa>,
+					<3 0x12 0 0x4105>,
+					<3 0x13 0 0x0a60>;
+				interrupt-parent = <&gpio>;
+				interrupts = <10 8>; /* Pin 10, active low */
+			};
+			phy13: ethernet-phy@3 {
+				reg = <3>;
+				compatible = "marvell,88e1149r";
+				marvell,reg-init = <3 0x10 0 0x5777>,
+					<3 0x11 0 0x00aa>,
+					<3 0x12 0 0x4105>,
+					<3 0x13 0 0x0a60>;
+				interrupt-parent = <&gpio>;
+				interrupts = <10 8>; /* Pin 10, active low */
+			};
+			phy14: ethernet-phy@4 {
+				reg = <4>;
+				compatible = "marvell,88e1149r";
+				marvell,reg-init = <3 0x10 0 0x5777>,
+					<3 0x11 0 0x00aa>,
+					<3 0x12 0 0x4105>,
+					<3 0x13 0 0x0a60>;
+				interrupt-parent = <&gpio>;
+				interrupts = <10 8>; /* Pin 10, active low */
+			};
+		};
+
+		mdio@3 {
+			reg = <3>;
+			#address-cells = <1>;
+			#size-cells = <0>;
+
+			phy21: ethernet-phy@1 {
+				reg = <1>;
+				compatible = "marvell,88e1149r";
+				marvell,reg-init = <3 0x10 0 0x5777>,
+					<3 0x11 0 0x00aa>,
+					<3 0x12 0 0x4105>,
+					<3 0x13 0 0x0a60>;
+				interrupt-parent = <&gpio>;
+				interrupts = <12 8>; /* Pin 12, active low */
+			};
+			phy22: ethernet-phy@2 {
+				reg = <2>;
+				compatible = "marvell,88e1149r";
+				marvell,reg-init = <3 0x10 0 0x5777>,
+					<3 0x11 0 0x00aa>,
+					<3 0x12 0 0x4105>,
+					<3 0x13 0 0x0a60>;
+				interrupt-parent = <&gpio>;
+				interrupts = <12 8>; /* Pin 12, active low */
+			};
+			phy23: ethernet-phy@3 {
+				reg = <3>;
+				compatible = "marvell,88e1149r";
+				marvell,reg-init = <3 0x10 0 0x5777>,
+					<3 0x11 0 0x00aa>,
+					<3 0x12 0 0x4105>,
+					<3 0x13 0 0x0a60>;
+				interrupt-parent = <&gpio>;
+				interrupts = <12 8>; /* Pin 12, active low */
+			};
+			phy24: ethernet-phy@4 {
+				reg = <4>;
+				compatible = "marvell,88e1149r";
+				marvell,reg-init = <3 0x10 0 0x5777>,
+					<3 0x11 0 0x00aa>,
+					<3 0x12 0 0x4105>,
+					<3 0x13 0 0x0a60>;
+				interrupt-parent = <&gpio>;
+				interrupts = <12 8>; /* Pin 12, active low */
+			};
+		};
+	};
diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 99c0674a..944cdfb 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -144,6 +144,16 @@ config MDIO_BUS_MUX
 	  to a parent bus.  Switching between child busses is done by
 	  device specific drivers.
 
+config MDIO_BUS_MUX_GPIO
+	tristate "Support for GPIO controlled MDIO bus multiplexers"
+	depends on OF_GPIO && OF_MDIO
+	select MDIO_BUS_MUX
+	help
+	  This module provides a driver for MDIO bus multiplexers that
+	  are controlled via GPIO lines.  The multiplexer connects one of
+	  several child MDIO busses to a parent bus.  Child bus
+	  selection is under the control of GPIO lines.
+
 endif # PHYLIB
 
 config MICREL_KS8995MA
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index a6b50e7..f51af68 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -26,3 +26,4 @@ obj-$(CONFIG_MDIO_OCTEON)	+= mdio-octeon.o
 obj-$(CONFIG_MICREL_KS8995MA)	+= spi_ks8995.o
 obj-$(CONFIG_AMD_PHY)		+= amd.o
 obj-$(CONFIG_MDIO_BUS_MUX)	+= mdio-mux.o
+obj-$(CONFIG_MDIO_BUS_MUX_GPIO)	+= mdio-mux-gpio.o
diff --git a/drivers/net/phy/mdio-mux-gpio.c b/drivers/net/phy/mdio-mux-gpio.c
new file mode 100644
index 0000000..e0cc4ef
--- /dev/null
+++ b/drivers/net/phy/mdio-mux-gpio.c
@@ -0,0 +1,142 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (C) 2011, 2012 Cavium, Inc.
+ */
+
+#include <linux/platform_device.h>
+#include <linux/device.h>
+#include <linux/of_mdio.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/phy.h>
+#include <linux/mdio-mux.h>
+#include <linux/of_gpio.h>
+
+#define DRV_VERSION "1.0"
+#define DRV_DESCRIPTION "GPIO controlled MDIO bus multiplexer driver"
+
+#define MDIO_MUX_GPIO_MAX_BITS 8
+
+struct mdio_mux_gpio_state {
+	int gpio[MDIO_MUX_GPIO_MAX_BITS];
+	unsigned int num_gpios;
+	void *mux_handle;
+};
+
+static int mdio_mux_gpio_switch_fn(int current_child, int desired_child,
+				   void *data)
+{
+	int change;
+	unsigned int n;
+	struct mdio_mux_gpio_state *s = data;
+
+	if (current_child == desired_child)
+		return 0;
+
+	change = current_child == -1 ? -1 : current_child ^ desired_child;
+
+	for (n = 0; n < s->num_gpios; n++) {
+		if (change & 1)
+			gpio_set_value_cansleep(s->gpio[n],
+						(desired_child & 1) != 0);
+		change >>= 1;
+		desired_child >>= 1;
+	}
+
+	return 0;
+}
+
+static int __devinit mdio_mux_gpio_probe(struct platform_device *pdev)
+{
+	enum of_gpio_flags f;
+	struct mdio_mux_gpio_state *s;
+	unsigned int num_gpios;
+	unsigned int n;
+	int r;
+
+	if (!pdev->dev.of_node)
+		return -ENODEV;
+
+	num_gpios = of_gpio_count(pdev->dev.of_node);
+	if (num_gpios == 0 || num_gpios > MDIO_MUX_GPIO_MAX_BITS)
+		return -ENODEV;
+
+	s = devm_kzalloc(&pdev->dev, sizeof(*s), GFP_KERNEL);
+	if (!s)
+		return -ENOMEM;
+
+	s->num_gpios = num_gpios;
+
+	for (n = 0; n < num_gpios; ) {
+		int gpio = of_get_gpio_flags(pdev->dev.of_node, n, &f);
+		if (gpio < 0) {
+			r = (gpio == -ENODEV) ? -EPROBE_DEFER : gpio;
+			goto err;
+		}
+		s->gpio[n] = gpio;
+
+		n++;
+
+		r = gpio_request(gpio, "mdio_mux_gpio");
+		if (r)
+			goto err;
+
+		r = gpio_direction_output(gpio, 0);
+		if (r)
+			goto err;
+	}
+
+	r = mdio_mux_init(&pdev->dev,
+			  mdio_mux_gpio_switch_fn, &s->mux_handle, s);
+
+	if (r == 0) {
+		pdev->dev.platform_data = s;
+		return 0;
+	}
+err:
+	while (n) {
+		n--;
+		gpio_free(s->gpio[n]);
+	}
+	devm_kfree(&pdev->dev, s);
+	return r;
+}
+
+static int __devexit mdio_mux_gpio_remove(struct platform_device *pdev)
+{
+	struct mdio_mux_gpio_state *s = pdev->dev.platform_data;
+	mdio_mux_uninit(s->mux_handle);
+	return 0;
+}
+
+static struct of_device_id mdio_mux_gpio_match[] = {
+	{
+		.compatible = "mdio-mux-gpio",
+	},
+	{
+		/* Legacy compatible property. */
+		.compatible = "cavium,mdio-mux-sn74cbtlv3253",
+	},
+	{},
+};
+MODULE_DEVICE_TABLE(of, mdio_mux_gpio_match);
+
+static struct platform_driver mdio_mux_gpio_driver = {
+	.driver = {
+		.name		= "mdio-mux-gpio",
+		.owner		= THIS_MODULE,
+		.of_match_table = mdio_mux_gpio_match,
+	},
+	.probe		= mdio_mux_gpio_probe,
+	.remove		= __devexit_p(mdio_mux_gpio_remove),
+};
+
+module_platform_driver(mdio_mux_gpio_driver);
+
+MODULE_DESCRIPTION(DRV_DESCRIPTION);
+MODULE_VERSION(DRV_VERSION);
+MODULE_AUTHOR("David Daney");
+MODULE_LICENSE("GPL");
-- 
1.7.2.3

^ permalink raw reply related

* linux-next: manual merge of the wireless-next tree with the net-next tree
From: Stephen Rothwell @ 2012-05-03  1:31 UTC (permalink / raw)
  To: John W. Linville
  Cc: linux-next, linux-kernel, Ashok Nagarajan, David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 5530 bytes --]

Hi John,

Today's linux-next merge of the wireless-next tree got a conflict in
net/wireless/nl80211.c between commit 9360ffd18597 ("wireless: Stop using
NLA_PUT*()") from the net-next tree and commit 0a9b3782ef40 ("{nl,cfg,mac}
80211: Allow user to see/configure HT protection mode") from the
wireless-next tree.

I fixed it up (see below) and can carry the fix as necessary.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --cc net/wireless/nl80211.c
index d5005c5,859bd66..0000000
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@@ -3348,50 -3292,51 +3348,52 @@@ static int nl80211_get_mesh_config(stru
  	pinfoattr = nla_nest_start(msg, NL80211_ATTR_MESH_CONFIG);
  	if (!pinfoattr)
  		goto nla_put_failure;
 -	NLA_PUT_U32(msg, NL80211_ATTR_IFINDEX, dev->ifindex);
 -	NLA_PUT_U16(msg, NL80211_MESHCONF_RETRY_TIMEOUT,
 -			cur_params.dot11MeshRetryTimeout);
 -	NLA_PUT_U16(msg, NL80211_MESHCONF_CONFIRM_TIMEOUT,
 -			cur_params.dot11MeshConfirmTimeout);
 -	NLA_PUT_U16(msg, NL80211_MESHCONF_HOLDING_TIMEOUT,
 -			cur_params.dot11MeshHoldingTimeout);
 -	NLA_PUT_U16(msg, NL80211_MESHCONF_MAX_PEER_LINKS,
 -			cur_params.dot11MeshMaxPeerLinks);
 -	NLA_PUT_U8(msg, NL80211_MESHCONF_MAX_RETRIES,
 -			cur_params.dot11MeshMaxRetries);
 -	NLA_PUT_U8(msg, NL80211_MESHCONF_TTL,
 -			cur_params.dot11MeshTTL);
 -	NLA_PUT_U8(msg, NL80211_MESHCONF_ELEMENT_TTL,
 -			cur_params.element_ttl);
 -	NLA_PUT_U8(msg, NL80211_MESHCONF_AUTO_OPEN_PLINKS,
 -			cur_params.auto_open_plinks);
 -	NLA_PUT_U32(msg, NL80211_MESHCONF_SYNC_OFFSET_MAX_NEIGHBOR,
 -			cur_params.dot11MeshNbrOffsetMaxNeighbor);
 -	NLA_PUT_U8(msg, NL80211_MESHCONF_HWMP_MAX_PREQ_RETRIES,
 -			cur_params.dot11MeshHWMPmaxPREQretries);
 -	NLA_PUT_U32(msg, NL80211_MESHCONF_PATH_REFRESH_TIME,
 -			cur_params.path_refresh_time);
 -	NLA_PUT_U16(msg, NL80211_MESHCONF_MIN_DISCOVERY_TIMEOUT,
 -			cur_params.min_discovery_timeout);
 -	NLA_PUT_U32(msg, NL80211_MESHCONF_HWMP_ACTIVE_PATH_TIMEOUT,
 -			cur_params.dot11MeshHWMPactivePathTimeout);
 -	NLA_PUT_U16(msg, NL80211_MESHCONF_HWMP_PREQ_MIN_INTERVAL,
 -			cur_params.dot11MeshHWMPpreqMinInterval);
 -	NLA_PUT_U16(msg, NL80211_MESHCONF_HWMP_PERR_MIN_INTERVAL,
 -			cur_params.dot11MeshHWMPperrMinInterval);
 -	NLA_PUT_U16(msg, NL80211_MESHCONF_HWMP_NET_DIAM_TRVS_TIME,
 -			cur_params.dot11MeshHWMPnetDiameterTraversalTime);
 -	NLA_PUT_U8(msg, NL80211_MESHCONF_HWMP_ROOTMODE,
 -			cur_params.dot11MeshHWMPRootMode);
 -	NLA_PUT_U16(msg, NL80211_MESHCONF_HWMP_RANN_INTERVAL,
 -			cur_params.dot11MeshHWMPRannInterval);
 -	NLA_PUT_U8(msg, NL80211_MESHCONF_GATE_ANNOUNCEMENTS,
 -			cur_params.dot11MeshGateAnnouncementProtocol);
 -	NLA_PUT_U8(msg, NL80211_MESHCONF_FORWARDING,
 -			cur_params.dot11MeshForwarding);
 -	NLA_PUT_U32(msg, NL80211_MESHCONF_RSSI_THRESHOLD,
 -			cur_params.rssi_threshold);
 -	NLA_PUT_U32(msg, NL80211_MESHCONF_HT_OPMODE,
 -			cur_params.ht_opmode);
 +	if (nla_put_u32(msg, NL80211_ATTR_IFINDEX, dev->ifindex) ||
 +	    nla_put_u16(msg, NL80211_MESHCONF_RETRY_TIMEOUT,
 +			cur_params.dot11MeshRetryTimeout) ||
 +	    nla_put_u16(msg, NL80211_MESHCONF_CONFIRM_TIMEOUT,
 +			cur_params.dot11MeshConfirmTimeout) ||
 +	    nla_put_u16(msg, NL80211_MESHCONF_HOLDING_TIMEOUT,
 +			cur_params.dot11MeshHoldingTimeout) ||
 +	    nla_put_u16(msg, NL80211_MESHCONF_MAX_PEER_LINKS,
 +			cur_params.dot11MeshMaxPeerLinks) ||
 +	    nla_put_u8(msg, NL80211_MESHCONF_MAX_RETRIES,
 +		       cur_params.dot11MeshMaxRetries) ||
 +	    nla_put_u8(msg, NL80211_MESHCONF_TTL,
 +		       cur_params.dot11MeshTTL) ||
 +	    nla_put_u8(msg, NL80211_MESHCONF_ELEMENT_TTL,
 +		       cur_params.element_ttl) ||
 +	    nla_put_u8(msg, NL80211_MESHCONF_AUTO_OPEN_PLINKS,
 +		       cur_params.auto_open_plinks) ||
 +	    nla_put_u32(msg, NL80211_MESHCONF_SYNC_OFFSET_MAX_NEIGHBOR,
 +			cur_params.dot11MeshNbrOffsetMaxNeighbor) ||
 +	    nla_put_u8(msg, NL80211_MESHCONF_HWMP_MAX_PREQ_RETRIES,
 +		       cur_params.dot11MeshHWMPmaxPREQretries) ||
 +	    nla_put_u32(msg, NL80211_MESHCONF_PATH_REFRESH_TIME,
 +			cur_params.path_refresh_time) ||
 +	    nla_put_u16(msg, NL80211_MESHCONF_MIN_DISCOVERY_TIMEOUT,
 +			cur_params.min_discovery_timeout) ||
 +	    nla_put_u32(msg, NL80211_MESHCONF_HWMP_ACTIVE_PATH_TIMEOUT,
 +			cur_params.dot11MeshHWMPactivePathTimeout) ||
 +	    nla_put_u16(msg, NL80211_MESHCONF_HWMP_PREQ_MIN_INTERVAL,
 +			cur_params.dot11MeshHWMPpreqMinInterval) ||
 +	    nla_put_u16(msg, NL80211_MESHCONF_HWMP_PERR_MIN_INTERVAL,
 +			cur_params.dot11MeshHWMPperrMinInterval) ||
 +	    nla_put_u16(msg, NL80211_MESHCONF_HWMP_NET_DIAM_TRVS_TIME,
 +			cur_params.dot11MeshHWMPnetDiameterTraversalTime) ||
 +	    nla_put_u8(msg, NL80211_MESHCONF_HWMP_ROOTMODE,
 +		       cur_params.dot11MeshHWMPRootMode) ||
 +	    nla_put_u16(msg, NL80211_MESHCONF_HWMP_RANN_INTERVAL,
 +			cur_params.dot11MeshHWMPRannInterval) ||
 +	    nla_put_u8(msg, NL80211_MESHCONF_GATE_ANNOUNCEMENTS,
 +		       cur_params.dot11MeshGateAnnouncementProtocol) ||
 +	    nla_put_u8(msg, NL80211_MESHCONF_FORWARDING,
 +		       cur_params.dot11MeshForwarding) ||
 +	    nla_put_u32(msg, NL80211_MESHCONF_RSSI_THRESHOLD,
- 			cur_params.rssi_threshold))
++			cur_params.rssi_threshold) ||
++	    nla_put_u32(msg, NL80211_MESHCONF_HT_OPMODE,
++			cur_params.ht_opmode))
 +		goto nla_put_failure;
  	nla_nest_end(msg, pinfoattr);
  	genlmsg_end(msg, hdr);
  	return genlmsg_reply(msg, info);

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: linux-next: manual merge of the wireless-next tree with the net-next tree
From: David Miller @ 2012-05-03  1:35 UTC (permalink / raw)
  To: sfr; +Cc: linville, linux-next, linux-kernel, ashok, netdev
In-Reply-To: <20120503113110.3eded43c5dc526a471c69dd2@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Thu, 3 May 2012 11:31:10 +1000

> Today's linux-next merge of the wireless-next tree got a conflict in
> net/wireless/nl80211.c between commit 9360ffd18597 ("wireless: Stop using
> NLA_PUT*()") from the net-next tree and commit 0a9b3782ef40 ("{nl,cfg,mac}
> 80211: Allow user to see/configure HT protection mode") from the
> wireless-next tree.

John, the NLA_PUT macros were removed from the net-next tree more
than a month ago.

There is zero reason why this should still be happening, and this
issue is causing a large, unnecessary, burdon upon Stephen.

Please take care of this, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
From: Eric Dumazet @ 2012-05-03  1:52 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell, Tom Herbert,
	Jeff Kirsher, Michael Chan, Matt Carlson, Herbert Xu,
	Ben Hutchings, Ilpo Järvinen, Maciej Żenczykowski
In-Reply-To: <4FA19F5B.7040407@intel.com>

On Wed, 2012-05-02 at 13:55 -0700, Alexander Duyck wrote:
> On 05/02/2012 11:15 AM, Eric Dumazet wrote:
> > On Wed, 2012-05-02 at 11:05 -0700, Alexander Duyck wrote:
> >
> >> You're correct about the fragstolen case, I actually was thinking of the
> >> first patch you sent, not this second one.
> >>
> >> However we still have a problem.  What we end up with now is a case of
> >> sharing in which the clone skb no longer knows that it is sharing the
> >> head with another skb.  The dataref will drop to 1 when we call
> >> __kfree_skb.  This means that any other function out there that tries to
> >> see if the skb is shared would return false.  This could lead to issues
> >> if there is anything out there that manipulates the data in head based
> >> on the false assumption that it is not cloned.  What we would probably
> >> need to do in this case is tweak the logic for skb_cloned.  If you are
> >> using a head_frag you should probably add a check that returns true if
> >> cloned is true and page_count is greater than 1.  We should be safe in
> >> the case of skb_header_cloned since we already dropped are dataref when
> >> we stole the page and freed the skb.
> > I really dont understand this concern.
> >
> > When skb is cloned, we copy in head_frag __skb_clone()
> >
> > So both skbs have the bit set, and dataref = 2.
> >
> > first skb is freed, dataref becomes 1 and nothing special  happen
> >
> > >From this point, skb->head is not 'shared' anymore (taken your own
> > words). And we are free to do whatever we want.
> >
> > second skb is freed, dataref becomes 0 and we call the right destructor.
> The problem is that the stack will not be able to detect sharing.  As
> long as page_count is greater than 2 and skb->cloned is set we should be
> telling any callers to skb_cloned that the head is cloned.  Otherwise we
> can run into issues elsewhere with well meaning code checking and not
> detecting sharing, and then mangling the header.
> 

page count is irrelevant, since if PAGE_SIZE=65536, you can have 32
fragments (of 2048 bytes) per page. Still we can call put_page() for
each individual frag that must be freed.

You forgot to give an example of path that would be failing. Since the
skb_cloned() check is still valid.

Head is cloned if : skb->cloned is set and dataref value is not 1

(minus the skb_header_release() tweaks done on output path for tcp)

Every time a 'caller' is going to modify/mangle its skb head, it must
first call pskb_expand_head() (or various helpers around it) to :

- allocate a new skb->head
- copy old content to new head
- release a reference on old head dataref
- if old dataref reaches 0, 'free' old head (might be a kfree() or
put_page())

> Also I am not sure if the big monolithic changes are really the best way
> to approach this.  It would be nice if we could fix this incrementally
> instead of trying to do it all at once since there are multiple issues
> that need to be addressed.
> 
> I will try to submit a few patches from my end later today.  I still
> need to look over all of the changes from the past couple of weeks that
> were based on the assumption that the IP stack completely owned the skb.

I did my best to provide small changes.

Plus TCP coalescing is done after IP processing.

Owning skb is a vague concept anyway. IP borrows skb but not owns them.
They already could be cloned skb.

^ permalink raw reply

* Re: [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery()
From: Eric Dumazet @ 2012-05-03  1:56 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: Yuchung Cheng, davem, ilpo.jarvinen, nanditad, netdev
In-Reply-To: <CADVnQymbDLXSR=aRB8SSd8GxpsFB3j2DLCBxH-1t2oS0T+sxgA@mail.gmail.com>

On Wed, 2012-05-02 at 20:13 -0400, Neal Cardwell wrote:
> On Wed, May 2, 2012 at 7:30 PM, Yuchung Cheng <ycheng@google.com> wrote:
> > This a prepartion patch that refactors the code to enter recovery
> > into a new function tcp_enter_recovery(). It's needed to implement
> > the delayed fast retransmit in ER.
> >
> > Signed-off-by: Yuchung Cheng <ycheng@google.com>
> > ---
> > ChangeLog since v1:
> >  - swaped with part 1 and part2
> > ChangeLog since v2:
> >  - removed RFC in commit message
> > ChangeLog since v3:
> >  - nothing
> 
> Acked-by: Neal Cardwell <ncardwell@google.com>

Yuchung, you should add our Acked-by to new submissions if no change on
a particular part was done, or else final inclusion wont have them

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
From: Eric Dumazet @ 2012-05-03  2:14 UTC (permalink / raw)
  To: David Miller
  Cc: alexander.duyck, alexander.h.duyck, netdev, ncardwell, therbert,
	jeffrey.t.kirsher, mchan, mcarlson, herbert, bhutchings,
	ilpo.jarvinen, maze
In-Reply-To: <20120502.211157.2201073892985382154.davem@davemloft.net>

On Wed, 2012-05-02 at 21:11 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 02 May 2012 21:58:29 +0200
> 
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
> > receiver function when application is not blocked in recvmsg().
> > 
> > Function tcp_queue_rcv() is moved a bit to allow its call from
> > tcp_data_queue()
> > 
> > This gives good results especially if GRO could not kick, and if skb
> > head is a fragment.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> Applied.

Thanks David

My next step is to provide a common helper to NAPI drivers to replace 
netdev_alloc_skb_ip_align(struct net_device *dev, unsigned int len)

by :

struct sk_buff *napi_alloc_rxskb(struct napi_struct *napi, unsigned int len)

That will manage a cache of one page, splitted in fragments as needed.
(roughly the code we added in tg3 as POC)

Because converting drivers to build_skb() is a too slow process.

^ permalink raw reply

* Re: [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
From: David Miller @ 2012-05-03  2:21 UTC (permalink / raw)
  To: eric.dumazet
  Cc: alexander.duyck, alexander.h.duyck, netdev, ncardwell, therbert,
	jeffrey.t.kirsher, mchan, mcarlson, herbert, bhutchings,
	ilpo.jarvinen, maze
In-Reply-To: <1336011260.22133.754.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 03 May 2012 04:14:20 +0200

> My next step is to provide a common helper to NAPI drivers to replace 
> netdev_alloc_skb_ip_align(struct net_device *dev, unsigned int len)
> 
> by :
> 
> struct sk_buff *napi_alloc_rxskb(struct napi_struct *napi, unsigned int len)
> 
> That will manage a cache of one page, splitted in fragments as needed.
> (roughly the code we added in tg3 as POC)
> 
> Because converting drivers to build_skb() is a too slow process.

Sounds good.

^ permalink raw reply

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
From: Alexander Duyck @ 2012-05-03  3:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell, Tom Herbert,
	Jeff Kirsher, Michael Chan, Matt Carlson, Herbert Xu,
	Ben Hutchings, Ilpo Järvinen, Maciej Żenczykowski
In-Reply-To: <1336009974.22133.706.camel@edumazet-glaptop>

On 5/2/2012 6:52 PM, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 13:55 -0700, Alexander Duyck wrote:
>> On 05/02/2012 11:15 AM, Eric Dumazet wrote:
>>> On Wed, 2012-05-02 at 11:05 -0700, Alexander Duyck wrote:
>>>> You're correct about the fragstolen case, I actually was thinking of the
>>>> first patch you sent, not this second one.
>>>>
>>>> However we still have a problem.  What we end up with now is a case of
>>>> sharing in which the clone skb no longer knows that it is sharing the
>>>> head with another skb.  The dataref will drop to 1 when we call
>>>> __kfree_skb.  This means that any other function out there that tries to
>>>> see if the skb is shared would return false.  This could lead to issues
>>>> if there is anything out there that manipulates the data in head based
>>>> on the false assumption that it is not cloned.  What we would probably
>>>> need to do in this case is tweak the logic for skb_cloned.  If you are
>>>> using a head_frag you should probably add a check that returns true if
>>>> cloned is true and page_count is greater than 1.  We should be safe in
>>>> the case of skb_header_cloned since we already dropped are dataref when
>>>> we stole the page and freed the skb.
>>> I really dont understand this concern.
>>>
>>> When skb is cloned, we copy in head_frag __skb_clone()
>>>
>>> So both skbs have the bit set, and dataref = 2.
>>>
>>> first skb is freed, dataref becomes 1 and nothing special  happen
>>>
>>> > From this point, skb->head is not 'shared' anymore (taken your own
>>> words). And we are free to do whatever we want.
>>>
>>> second skb is freed, dataref becomes 0 and we call the right destructor.
>> The problem is that the stack will not be able to detect sharing.  As
>> long as page_count is greater than 2 and skb->cloned is set we should be
>> telling any callers to skb_cloned that the head is cloned.  Otherwise we
>> can run into issues elsewhere with well meaning code checking and not
>> detecting sharing, and then mangling the header.
> page count is irrelevant, since if PAGE_SIZE=65536, you can have 32
> fragments (of 2048 bytes) per page. Still we can call put_page() for
> each individual frag that must be freed.
>
> You forgot to give an example of path that would be failing. Since the
> skb_cloned() check is still valid.
>
> Head is cloned if : skb->cloned is set and dataref value is not 1
>
> (minus the skb_header_release() tweaks done on output path for tcp)
>
> Every time a 'caller' is going to modify/mangle its skb head, it must
> first call pskb_expand_head() (or various helpers around it) to :
>
> - allocate a new skb->head
> - copy old content to new head
> - release a reference on old head dataref
> - if old dataref reaches 0, 'free' old head (might be a kfree() or
> put_page())
This is exactly my point.  The way your current code works the check in 
pskb_expand_head will not detect the header is cloned.

So for example lets say you have one of your skbs that goes through and 
is merged.

1.  You start with a cloned skb.  dataref = 2, head_frag page count = 1;
2.  You go through tcp_try_coalesce.  dataref = 2, head_frag page count = 2;
3.  You call __kfree_skb on the skb.  dataref = 1, head_frag page count = 2;
4   At this point if the holder of the clone decides to modify the skb-> 
head there is nothing to stop it from doing so.

Odds are you would have to have a pretty extreme corner case to really 
see any issues with this since I know most raw sockets won't mangle the 
data portion of the frame.  I just figure we could save ourselves a lot 
of trouble by just not coalescing the head_frag in the case that the skb 
is cloned.

>> Also I am not sure if the big monolithic changes are really the best way
>> to approach this.  It would be nice if we could fix this incrementally
>> instead of trying to do it all at once since there are multiple issues
>> that need to be addressed.
>>
>> I will try to submit a few patches from my end later today.  I still
>> need to look over all of the changes from the past couple of weeks that
>> were based on the assumption that the IP stack completely owned the skb.
> I did my best to provide small changes.
I just meant you didn't need to do things like add the kfree_skb_partial 
helper in the same patch.

> Plus TCP coalescing is done after IP processing.
>
> Owning skb is a vague concept anyway. IP borrows skb but not owns them.
> They already could be cloned skb.
I know it is a vague concept.  However it is one of those concepts where 
if we don't get it right we start to see random panics and call traces 
due to memory corruption.  That is why I prefer to avoid it if at all 
possible.  All I am really asking for is that we don't just use the 
head_frag as a page unless we know it is not cloned.  All those checks 
that just look for skb->head_frag could check for skb->head_frag && 
!skb_cloned(skb) and I would be a much happier person since then I know 
all owners of the clones will be using the same variable for reference 
count tracking.

I'm still working on those patches.  I thought I had sent them earlier 
but it looks like I had an email issue, and since then Dave pulled your 
patches so I am rebasing the patches on the updated tree and should have 
something in the next hour or so.

Thanks,

Alex

^ permalink raw reply

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
From: Eric Dumazet @ 2012-05-03  3:14 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell, Tom Herbert,
	Jeff Kirsher, Michael Chan, Matt Carlson, Herbert Xu,
	Ben Hutchings, Ilpo Järvinen, Maciej Żenczykowski
In-Reply-To: <4FA1F4C3.8000804@gmail.com>

On Wed, 2012-05-02 at 20:00 -0700, Alexander Duyck wrote:

> This is exactly my point.  The way your current code works the check in 
> pskb_expand_head will not detect the header is cloned.
> 
> So for example lets say you have one of your skbs that goes through and 
> is merged.
> 
> 1.  You start with a cloned skb.  dataref = 2, head_frag page count = 1;
> 2.  You go through tcp_try_coalesce.  dataref = 2, head_frag page count = 2;
> 3.  You call __kfree_skb on the skb.  dataref = 1, head_frag page count = 2;


If page count was incremented in 2., this is because we stole the head.

But we could not stole the head since dataref = 2

So try to find a real example, based on current state, not on the first
patch I sent, since we made progress since this time.

Thanks

^ permalink raw reply

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
From: Alexander Duyck @ 2012-05-03  3:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell, Tom Herbert,
	Jeff Kirsher, Michael Chan, Matt Carlson, Herbert Xu,
	Ben Hutchings, Ilpo Järvinen, Maciej Żenczykowski
In-Reply-To: <1336014869.22133.821.camel@edumazet-glaptop>

On 5/2/2012 8:14 PM, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 20:00 -0700, Alexander Duyck wrote:
>> This is exactly my point.  The way your current code works the check in
>> pskb_expand_head will not detect the header is cloned.
>>
>> So for example lets say you have one of your skbs that goes through and
>> is merged.
>>
>> 1.  You start with a cloned skb.  dataref = 2, head_frag page count = 1;
>> 2.  You go through tcp_try_coalesce.  dataref = 2, head_frag page count = 2;

4592         if (from->head_frag) {
4593                 struct page *page;
4594                 unsigned int offset;
4595
4596                 if (skb_shinfo(to)->nr_frags + skb_shinfo(from)->nr_frags>= MAX_SKB_FRAGS)
4597                         return false;
4598                 page = virt_to_head_page(from->head);
4599                 offset = from->data - (unsigned char *)page_address(page);
4600                 skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
4601                                    page, offset, skb_headlen(from));
4602
4603                 if (skb_cloned(from))
4604                         get_page(page);
4605                 else
4606                         *fragstolen = true;
4607
4608                 delta = len; /* we dont know real truesize... */
4609                 goto copyfrags;
4610         }


>> 3.  You call __kfree_skb on the skb.  dataref = 1, head_frag page count = 2;
4614 static void kfree_skb_partial(struct sk_buff *skb, bool head_stolen)
4615 {
4616         if (head_stolen)
4617                 kmem_cache_free(skbuff_head_cache, skb);
4618         else
4619                 __kfree_skb(skb);
4620 }
4621

>
> If page count was incremented in 2., this is because we stole the head.
>
> But we could not stole the head since dataref = 2
I don't see how that is the case.  It looks pretty clear to me that if 
dataref ==2, as represented by skb_cloned(from) being true above, we are 
calling get_page which will bump the page count and we are still 
stealing the head and then calling __kfree_skb which will decrement dataref.
> So try to find a real example, based on current state, not on the first
> patch I sent, since we made progress since this time.
>
> Thanks

I included a couple of grabs from blobs on git.kernel.org for Dave's 
current tree.  I should have patches out in about 10 minutes and then we 
can discuss those.

Thanks,

Alex

^ permalink raw reply

* [PATCH 0/2] Cleanups for head_frag usage and tcp_try_coalese
From: Alexander Duyck @ 2012-05-03  3:38 UTC (permalink / raw)
  To: netdev; +Cc: davem

I wrote up these patches earlier today to address issues related to the
splitting of reference counting into two seperate counts in the case of
skbs containing a head frag.  In addition I have a secondary patch for
tcp_try_coalesce which is meant to be a general cleanup as I found it
difficult to read through the code as it was.

I have done some performance testing on the series with netperf and saw no
appreciable difference after these changes for the non-cloned case.
---

Alexander Duyck (2):
      tcp: cleanup tcp_try_coalesce
      net: Stop decapitating clones that have a head_frag

 net/core/skbuff.c    |    2 +
 net/ipv4/tcp_input.c |   85 +++++++++++++++++++++++++-------------------------
 2 files changed, 44 insertions(+), 43 deletions(-)

-- 

^ permalink raw reply

* [PATCH 1/2] net: Stop decapitating clones that have a head_frag
From: Alexander Duyck @ 2012-05-03  3:38 UTC (permalink / raw)
  To: netdev; +Cc: davem, Alexander Duyck, Eric Dumazet, Jeff Kirsher
In-Reply-To: <20120503033018.5482.89902.stgit@gitlad.jf.intel.com>

This change is meant ot prevent stealing the skb->head to use as a page in
the event that the skb->head was cloned.  This allows the other clones to
track each other via shinfo->dataref.

Without this we break down to two methods for tracking the reference count,
one being dataref, the other being the page count.  As a result it becomes
difficult to track how many references there are to skb->head.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 net/core/skbuff.c    |    2 +-
 net/ipv4/tcp_input.c |    9 ++-------
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 52ba2b5..8703754 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1699,7 +1699,7 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
 			      struct splice_pipe_desc *spd, struct sock *sk)
 {
 	int seg;
-	bool head_is_linear = !skb->head_frag;
+	bool head_is_linear = !skb->head_frag || skb_cloned(skb);
 
 	/* map the linear part :
 	 * If skb->head_frag is set, this 'linear' part is backed
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2f696ef..c6f78e2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4589,7 +4589,7 @@ copyfrags:
 		to->data_len += len;
 		goto merge;
 	}
-	if (from->head_frag) {
+	if (from->head_frag && !skb_cloned(from)) {
 		struct page *page;
 		unsigned int offset;
 
@@ -4599,12 +4599,7 @@ copyfrags:
 		offset = from->data - (unsigned char *)page_address(page);
 		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
 				   page, offset, skb_headlen(from));
-
-		if (skb_cloned(from))
-			get_page(page);
-		else
-			*fragstolen = true;
-
+		*fragstolen = true;
 		delta = len; /* we dont know real truesize... */
 		goto copyfrags;
 	}

^ permalink raw reply related

* [PATCH 2/2] tcp: cleanup tcp_try_coalesce
From: Alexander Duyck @ 2012-05-03  3:39 UTC (permalink / raw)
  To: netdev; +Cc: davem, Alexander Duyck, Eric Dumazet, Jeff Kirsher
In-Reply-To: <20120503033018.5482.89902.stgit@gitlad.jf.intel.com>

This change is mostly meant to help improve the readability of
tcp_try_coalesce.  I had a few issues since there were several points when
the code would test for a conditional, fail, then succeed on another
conditional take some action, and then follow a goto back into the previous
conditional.  I just tore all of that apart and made the whole thing one
linear flow with a single goto.

Also there were multiple ways of computing the delta, the one for head_frag
made the least amount of sense to me since we were only dropping the
sk_buff so I have updated the logic for the stolen head case so that delta
is only truesize - sizeof(skb_buff), and for the case where we are dropping
the head as well it is truesize - SKB_TRUESIZE(skb_end_pointer - head).
This way we can also account for the head_frag with headlen == 0.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 net/ipv4/tcp_input.c |   80 +++++++++++++++++++++++++++-----------------------
 1 files changed, 43 insertions(+), 37 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c6f78e2..23bc3ff 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4548,62 +4548,68 @@ static bool tcp_try_coalesce(struct sock *sk,
 	int i, delta, len = from->len;
 
 	*fragstolen = false;
+
 	if (tcp_hdr(from)->fin || skb_cloned(to))
 		return false;
+
 	if (len <= skb_tailroom(to)) {
 		BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len));
-merge:
-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE);
-		TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq;
-		TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq;
-		return true;
+		goto merge;
 	}
 
 	if (skb_has_frag_list(to) || skb_has_frag_list(from))
 		return false;
 
-	if (skb_headlen(from) == 0 &&
-	    (skb_shinfo(to)->nr_frags +
-	     skb_shinfo(from)->nr_frags <= MAX_SKB_FRAGS)) {
-		WARN_ON_ONCE(from->head_frag);
-		delta = from->truesize - ksize(from->head) -
-			SKB_DATA_ALIGN(sizeof(struct sk_buff));
-
-		WARN_ON_ONCE(delta < len);
-copyfrags:
-		memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags,
-		       skb_shinfo(from)->frags,
-		       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
-		skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
-
-		if (skb_cloned(from))
-			for (i = 0; i < skb_shinfo(from)->nr_frags; i++)
-				skb_frag_ref(from, i);
-		else
-			skb_shinfo(from)->nr_frags = 0;
-
-		to->truesize += delta;
-		atomic_add(delta, &sk->sk_rmem_alloc);
-		sk_mem_charge(sk, delta);
-		to->len += len;
-		to->data_len += len;
-		goto merge;
-	}
-	if (from->head_frag && !skb_cloned(from)) {
+	if (skb_headlen(from) != 0) {
 		struct page *page;
 		unsigned int offset;
 
-		if (skb_shinfo(to)->nr_frags + skb_shinfo(from)->nr_frags >= MAX_SKB_FRAGS)
+		if (skb_shinfo(to)->nr_frags +
+		    skb_shinfo(from)->nr_frags >= MAX_SKB_FRAGS)
+			return false;
+
+		if (!from->head_frag || skb_cloned(from))
 			return false;
+
+		delta = from->truesize - sizeof(struct sk_buff);
+
 		page = virt_to_head_page(from->head);
 		offset = from->data - (unsigned char *)page_address(page);
+
 		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
 				   page, offset, skb_headlen(from));
 		*fragstolen = true;
-		delta = len; /* we dont know real truesize... */
-		goto copyfrags;
+	} else {
+		if (skb_shinfo(to)->nr_frags +
+		    skb_shinfo(from)->nr_frags > MAX_SKB_FRAGS)
+			return false;
+
+		delta = from->truesize -
+			SKB_TRUESIZE(skb_end_pointer(from) - from->head);
 	}
-	return false;
+
+	memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags,
+	       skb_shinfo(from)->frags,
+	       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
+	skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
+
+	if (!skb_cloned(from))
+		skb_shinfo(from)->nr_frags = 0;
+
+	for (i = 0; i < skb_shinfo(from)->nr_frags; i++)
+		skb_frag_ref(from, i);
+
+	to->truesize += delta;
+	atomic_add(delta, &sk->sk_rmem_alloc);
+	sk_mem_charge(sk, delta);
+	to->len += len;
+	to->data_len += len;
+
+merge:
+	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE);
+	TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq;
+	TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq;
+	return true;
 }
 
 static void kfree_skb_partial(struct sk_buff *skb, bool head_stolen)

^ permalink raw reply related

* Re: [PATCH 1/2] net: Stop decapitating clones that have a head_frag
From: Eric Dumazet @ 2012-05-03  3:56 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, davem, Eric Dumazet, Jeff Kirsher
In-Reply-To: <20120503033856.5482.70122.stgit@gitlad.jf.intel.com>

On Wed, 2012-05-02 at 20:38 -0700, Alexander Duyck wrote:
> This change is meant ot prevent stealing the skb->head to use as a page in
> the event that the skb->head was cloned.  This allows the other clones to
> track each other via shinfo->dataref.
> 
> Without this we break down to two methods for tracking the reference count,
> one being dataref, the other being the page count.  As a result it becomes
> difficult to track how many references there are to skb->head.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
> 
>  net/core/skbuff.c    |    2 +-
>  net/ipv4/tcp_input.c |    9 ++-------
>  2 files changed, 3 insertions(+), 8 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 52ba2b5..8703754 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1699,7 +1699,7 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
>  			      struct splice_pipe_desc *spd, struct sock *sk)
>  {
>  	int seg;
> -	bool head_is_linear = !skb->head_frag;
> +	bool head_is_linear = !skb->head_frag || skb_cloned(skb);
>  

Can you chose a better name than head_is_linear maybe ?

>  	/* map the linear part :
>  	 * If skb->head_frag is set, this 'linear' part is backed
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 2f696ef..c6f78e2 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4589,7 +4589,7 @@ copyfrags:
>  		to->data_len += len;
>  		goto merge;
>  	}
> -	if (from->head_frag) {
> +	if (from->head_frag && !skb_cloned(from)) {
>  		struct page *page;
>  		unsigned int offset;
>  
> @@ -4599,12 +4599,7 @@ copyfrags:
>  		offset = from->data - (unsigned char *)page_address(page);
>  		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
>  				   page, offset, skb_headlen(from));
> -
> -		if (skb_cloned(from))
> -			get_page(page);
> -		else
> -			*fragstolen = true;
> -
> +		*fragstolen = true;
>  		delta = len; /* we dont know real truesize... */
>  		goto copyfrags;
>  	}

Thanks

^ permalink raw reply

* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce
From: Eric Dumazet @ 2012-05-03  4:06 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, davem, Eric Dumazet, Jeff Kirsher
In-Reply-To: <20120503033901.5482.27183.stgit@gitlad.jf.intel.com>

On Wed, 2012-05-02 at 20:39 -0700, Alexander Duyck wrote:
> This change is mostly meant to help improve the readability of
> tcp_try_coalesce.  I had a few issues since there were several points when
> the code would test for a conditional, fail, then succeed on another
> conditional take some action, and then follow a goto back into the previous
> conditional.  I just tore all of that apart and made the whole thing one
> linear flow with a single goto.
> 
> Also there were multiple ways of computing the delta, the one for head_frag
> made the least amount of sense to me since we were only dropping the
> sk_buff so I have updated the logic for the stolen head case so that delta
> is only truesize - sizeof(skb_buff), and for the case where we are dropping
> the head as well it is truesize - SKB_TRUESIZE(skb_end_pointer - head).
> This way we can also account for the head_frag with headlen == 0.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
> 

Sorry I prefer you dont touch this code like this.

The truesize bits must stay as is, since it'll track drivers that lies
about truesize.

>  net/ipv4/tcp_input.c |   80 +++++++++++++++++++++++++++-----------------------
>  1 files changed, 43 insertions(+), 37 deletions(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index c6f78e2..23bc3ff 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4548,62 +4548,68 @@ static bool tcp_try_coalesce(struct sock *sk,
>  	int i, delta, len = from->len;
>  
>  	*fragstolen = false;
> +
>  	if (tcp_hdr(from)->fin || skb_cloned(to))
>  		return false;
> +
>  	if (len <= skb_tailroom(to)) {
>  		BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len));
> -merge:
> -		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE);
> -		TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq;
> -		TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq;
> -		return true;
> +		goto merge;
>  	}
>  
>  	if (skb_has_frag_list(to) || skb_has_frag_list(from))
>  		return false;
>  
> -	if (skb_headlen(from) == 0 &&
> -	    (skb_shinfo(to)->nr_frags +
> -	     skb_shinfo(from)->nr_frags <= MAX_SKB_FRAGS)) {
> -		WARN_ON_ONCE(from->head_frag);
> -		delta = from->truesize - ksize(from->head) -
> -			SKB_DATA_ALIGN(sizeof(struct sk_buff));


This delta was done on purpose. We want the ksize()

> -
> -		WARN_ON_ONCE(delta < len);
> -copyfrags:
> -		memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags,
> -		       skb_shinfo(from)->frags,
> -		       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
> -		skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
> -
> -		if (skb_cloned(from))
> -			for (i = 0; i < skb_shinfo(from)->nr_frags; i++)
> -				skb_frag_ref(from, i);
> -		else
> -			skb_shinfo(from)->nr_frags = 0;
> -
> -		to->truesize += delta;
> -		atomic_add(delta, &sk->sk_rmem_alloc);
> -		sk_mem_charge(sk, delta);
> -		to->len += len;
> -		to->data_len += len;
> -		goto merge;
> -	}
> -	if (from->head_frag && !skb_cloned(from)) {
> +	if (skb_headlen(from) != 0) {
>  		struct page *page;
>  		unsigned int offset;
>  
> -		if (skb_shinfo(to)->nr_frags + skb_shinfo(from)->nr_frags >= MAX_SKB_FRAGS)
> +		if (skb_shinfo(to)->nr_frags +
> +		    skb_shinfo(from)->nr_frags >= MAX_SKB_FRAGS)
> +			return false;
> +
> +		if (!from->head_frag || skb_cloned(from))
>  			return false;
> +
> +		delta = from->truesize - sizeof(struct sk_buff);



> +
>  		page = virt_to_head_page(from->head);
>  		offset = from->data - (unsigned char *)page_address(page);
> +
>  		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
>  				   page, offset, skb_headlen(from));
>  		*fragstolen = true;
> -		delta = len; /* we dont know real truesize... */
> -		goto copyfrags;
> +	} else {
> +		if (skb_shinfo(to)->nr_frags +
> +		    skb_shinfo(from)->nr_frags > MAX_SKB_FRAGS)
> +			return false;
> +
> +		delta = from->truesize -
> +			SKB_TRUESIZE(skb_end_pointer(from) - from->head);

No... SKB_TRUESIZE() doesnt account of power-of-two roundings in
kmalloc()

>  	}
> -	return false;
> +
> +	memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags,
> +	       skb_shinfo(from)->frags,
> +	       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
> +	skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
> +
> +	if (!skb_cloned(from))
> +		skb_shinfo(from)->nr_frags = 0;
> +

You break the code here... we had an else.

> +	for (i = 0; i < skb_shinfo(from)->nr_frags; i++)
> +		skb_frag_ref(from, i);
> +
> +	to->truesize += delta;
> +	atomic_add(delta, &sk->sk_rmem_alloc);
> +	sk_mem_charge(sk, delta);
> +	to->len += len;
> +	to->data_len += len;
> +
> +merge:
> +	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE);
> +	TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq;
> +	TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq;
> +	return true;
>  }
>  
>  static void kfree_skb_partial(struct sk_buff *skb, bool head_stolen)
> 

Really this patch is too hard to review.

Thanks

^ permalink raw reply

* [v2 PATCH] net: Stop decapitating clones that have a head_frag
From: Alexander Duyck @ 2012-05-03  4:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, Alexander Duyck, Eric Dumazet, Jeff Kirsher

This change is meant ot prevent stealing the skb->head to use as a page in
the event that the skb->head was cloned.  This allows the other clones to
track each other via shinfo->dataref.

Without this we break down to two methods for tracking the reference count,
one being dataref, the other being the page count.  As a result it becomes
difficult to track how many references there are to skb->head.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 net/core/skbuff.c    |    9 +++++----
 net/ipv4/tcp_input.c |    9 ++-------
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 52ba2b5..9e8caa0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1699,17 +1699,18 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
 			      struct splice_pipe_desc *spd, struct sock *sk)
 {
 	int seg;
-	bool head_is_linear = !skb->head_frag;
+	bool head_is_locked = !skb->head_frag || skb_cloned(skb);
 
 	/* map the linear part :
-	 * If skb->head_frag is set, this 'linear' part is backed
-	 * by a fragment, and we can avoid a copy.
+	 * If skb->head_frag is set, this 'linear' part is backed by a
+	 * fragment, and if the head is not shared with any clones then
+	 * we can avoid a copy since we own the head portion of this page.
 	 */
 	if (__splice_segment(virt_to_page(skb->data),
 			     (unsigned long) skb->data & (PAGE_SIZE - 1),
 			     skb_headlen(skb),
 			     offset, len, skb, spd,
-			     head_is_linear,
+			     head_is_locked,
 			     sk, pipe))
 		return true;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2f696ef..c6f78e2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4589,7 +4589,7 @@ copyfrags:
 		to->data_len += len;
 		goto merge;
 	}
-	if (from->head_frag) {
+	if (from->head_frag && !skb_cloned(from)) {
 		struct page *page;
 		unsigned int offset;
 
@@ -4599,12 +4599,7 @@ copyfrags:
 		offset = from->data - (unsigned char *)page_address(page);
 		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
 				   page, offset, skb_headlen(from));
-
-		if (skb_cloned(from))
-			get_page(page);
-		else
-			*fragstolen = true;
-
+		*fragstolen = true;
 		delta = len; /* we dont know real truesize... */
 		goto copyfrags;
 	}

^ permalink raw reply related

* Re: [v2 PATCH] net: Stop decapitating clones that have a head_frag
From: Eric Dumazet @ 2012-05-03  4:33 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, davem, Eric Dumazet, Jeff Kirsher
In-Reply-To: <20120503041842.8315.16138.stgit@gitlad.jf.intel.com>

On Wed, 2012-05-02 at 21:18 -0700, Alexander Duyck wrote:
> This change is meant ot prevent stealing the skb->head to use as a page in
> the event that the skb->head was cloned.  This allows the other clones to
> track each other via shinfo->dataref.
> 
> Without this we break down to two methods for tracking the reference count,
> one being dataref, the other being the page count.  As a result it becomes
> difficult to track how many references there are to skb->head.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
> 
>  net/core/skbuff.c    |    9 +++++----
>  net/ipv4/tcp_input.c |    9 ++-------
>  2 files changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 52ba2b5..9e8caa0 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1699,17 +1699,18 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
>  			      struct splice_pipe_desc *spd, struct sock *sk)
>  {
>  	int seg;
> -	bool head_is_linear = !skb->head_frag;
> +	bool head_is_locked = !skb->head_frag || skb_cloned(skb);
>  

Thanks

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH net v2] cdc_ether: Ignore bogus union descriptor for RNDIS devices
From: Markus Kolb @ 2012-05-03  4:57 UTC (permalink / raw)
  To: David Miller, bjorn-yOkvZcmFvRU
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA,
	shaola-r/HQZD9NxM9g9hUCZPvPmw, jrnieder-Re5JQEeQqe8AvxtiuMwx3w,
	oliver-GvhC2dPhHPQdnm+yROfE0A, 655387-61a8vm9lEZVf4u+23C9RwQ,
	stable-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20120502.211411.560705499667456177.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>



David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> schrieb:

>From: Bjørn Mork <bjorn-yOkvZcmFvRU@public.gmane.org>
>Date: Thu, 26 Apr 2012 14:35:10 +0200
>
>> The same comments as for v1 regarding testing applies.  This is build
>> tested only.  Should go through some functional testing before being
>> applied.
>
>Well?  Is anyone gonna test this?

I'll build it during next rainy day and will report its success after some usage ;-)

Markus
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce
From: Alexander Duyck @ 2012-05-03  4:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alexander Duyck, netdev, davem, Eric Dumazet, Jeff Kirsher
In-Reply-To: <1336017985.12425.9.camel@edumazet-glaptop>

On 5/2/2012 9:06 PM, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 20:39 -0700, Alexander Duyck wrote:
>> This change is mostly meant to help improve the readability of
>> tcp_try_coalesce.  I had a few issues since there were several points when
>> the code would test for a conditional, fail, then succeed on another
>> conditional take some action, and then follow a goto back into the previous
>> conditional.  I just tore all of that apart and made the whole thing one
>> linear flow with a single goto.
>>
>> Also there were multiple ways of computing the delta, the one for head_frag
>> made the least amount of sense to me since we were only dropping the
>> sk_buff so I have updated the logic for the stolen head case so that delta
>> is only truesize - sizeof(skb_buff), and for the case where we are dropping
>> the head as well it is truesize - SKB_TRUESIZE(skb_end_pointer - head).
>> This way we can also account for the head_frag with headlen == 0.
>>
>> Signed-off-by: Alexander Duyck<alexander.h.duyck@intel.com>
>> Cc: Eric Dumazet<edumazet@google.com>
>> Cc: Jeff Kirsher<jeffrey.t.kirsher@intel.com>
>> ---
>>
> Sorry I prefer you dont touch this code like this.
>
> The truesize bits must stay as is, since it'll track drivers that lies
> about truesize.

I can understand that.  But from what I can tell the only way they can 
lie about truesize is to actually change the value itself.  The code I 
modified was just tightening things up so we didn't do things like use 
the length to track the truesize like we were in the original code.  
 From what I can tell the new code should have been more accurate.

>>   net/ipv4/tcp_input.c |   80 +++++++++++++++++++++++++++-----------------------
>>   1 files changed, 43 insertions(+), 37 deletions(-)
>>
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index c6f78e2..23bc3ff 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -4548,62 +4548,68 @@ static bool tcp_try_coalesce(struct sock *sk,
>>   	int i, delta, len = from->len;
>>
>>   	*fragstolen = false;
>> +
>>   	if (tcp_hdr(from)->fin || skb_cloned(to))
>>   		return false;
>> +
>>   	if (len<= skb_tailroom(to)) {
>>   		BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len));
>> -merge:
>> -		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE);
>> -		TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq;
>> -		TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq;
>> -		return true;
>> +		goto merge;
>>   	}
>>
>>   	if (skb_has_frag_list(to) || skb_has_frag_list(from))
>>   		return false;
>>
>> -	if (skb_headlen(from) == 0&&
>> -	    (skb_shinfo(to)->nr_frags +
>> -	     skb_shinfo(from)->nr_frags<= MAX_SKB_FRAGS)) {
>> -		WARN_ON_ONCE(from->head_frag);
>> -		delta = from->truesize - ksize(from->head) -
>> -			SKB_DATA_ALIGN(sizeof(struct sk_buff));
>
> This delta was done on purpose. We want the ksize()
The question I have is how can you get into a case where the ksize is 
different from the end offset plus the aligned size of skb_shared_info?  
>From what I can tell it looks like the only place we can lie is if we 
use build_skb with the frag_size option, and in that case we are using a 
page, not kmalloc memory.  Otherwise in all other cases __alloc_skb or 
build_skb is using ksize(skb->head) - SKB_DATA_ALIGN(struct 
skb_shared_info) to set the end pointer, so reversing that should give 
us the same value as ksize(skb->head).

>> -
>> -		WARN_ON_ONCE(delta<  len);
>> -copyfrags:
>> -		memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags,
>> -		       skb_shinfo(from)->frags,
>> -		       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
>> -		skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
>> -
>> -		if (skb_cloned(from))
>> -			for (i = 0; i<  skb_shinfo(from)->nr_frags; i++)
>> -				skb_frag_ref(from, i);
>> -		else
>> -			skb_shinfo(from)->nr_frags = 0;
>> -
>> -		to->truesize += delta;
>> -		atomic_add(delta,&sk->sk_rmem_alloc);
>> -		sk_mem_charge(sk, delta);
>> -		to->len += len;
>> -		to->data_len += len;
>> -		goto merge;
>> -	}
>> -	if (from->head_frag&&  !skb_cloned(from)) {
>> +	if (skb_headlen(from) != 0) {
>>   		struct page *page;
>>   		unsigned int offset;
>>
>> -		if (skb_shinfo(to)->nr_frags + skb_shinfo(from)->nr_frags>= MAX_SKB_FRAGS)
>> +		if (skb_shinfo(to)->nr_frags +
>> +		    skb_shinfo(from)->nr_frags>= MAX_SKB_FRAGS)
>> +			return false;
>> +
>> +		if (!from->head_frag || skb_cloned(from))
>>   			return false;
>> +
>> +		delta = from->truesize - sizeof(struct sk_buff);
>>
It looks like you were thinking of something here since the quote lines 
were broken.  This was the piece I saw in the original code that caught 
my eye as probably being wrong.  We are letting packets using head_frag 
just report length as the truesize.  This is why I favored using the end 
offset.  In the case of us using the head_frag we should only be 
deducting SKB_DATA_ALIGN(sizeof(struct sk_buff)) (I just relized the 
line above is incorrect), and in the case of us dropping the head_frag 
as well we should be able to also deduct the size of the shared_info and 
everything up to the offset of end.

>> +
>>   		page = virt_to_head_page(from->head);
>>   		offset = from->data - (unsigned char *)page_address(page);
>> +
>>   		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
>>   				   page, offset, skb_headlen(from));
>>   		*fragstolen = true;
>> -		delta = len; /* we dont know real truesize... */
>> -		goto copyfrags;
>> +	} else {
>> +		if (skb_shinfo(to)->nr_frags +
>> +		    skb_shinfo(from)->nr_frags>  MAX_SKB_FRAGS)
>> +			return false;
>> +
>> +		delta = from->truesize -
>> +			SKB_TRUESIZE(skb_end_pointer(from) - from->head);
> No... SKB_TRUESIZE() doesnt account of power-of-two roundings in
> kmalloc()
You used ksize in alloc_skb or build_skb to set the end pointer.  All I 
am doing by using the end pointer is getting the value you had already 
extracted.  The end pointer should be unmovable once it is set so I 
wouldn't think it would be an issue to use it to compute the truesize 
for the section including the sk_buff and skb->head.


>>   	}
>> -	return false;
>> +
>> +	memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags,
>> +	       skb_shinfo(from)->frags,
>> +	       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
>> +	skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
>> +
>> +	if (!skb_cloned(from))
>> +		skb_shinfo(from)->nr_frags = 0;
>> +
> You break the code here... we had an else.
Yes and no.  I just realized that I didn't need the else because the for 
loop below does nothing if nr_frags is 0.  I suspect gcc is probably 
smart enough to just skip the loop if the skb is cloned.  I should 
probably have added a one line comment explaining that above the loop.

>> +	for (i = 0; i<  skb_shinfo(from)->nr_frags; i++)
>> +		skb_frag_ref(from, i);
>> +
>> +	to->truesize += delta;
>> +	atomic_add(delta,&sk->sk_rmem_alloc);
>> +	sk_mem_charge(sk, delta);
>> +	to->len += len;
>> +	to->data_len += len;
>> +
>> +merge:
>> +	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE);
>> +	TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq;
>> +	TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq;
>> +	return true;
>>   }
>>
>>   static void kfree_skb_partial(struct sk_buff *skb, bool head_stolen)
>>
> Really this patch is too hard to review.
Yeah, it is a bit difficult.  After Dave pulled your patches it ended up 
pushing most of the changes into this one.  I'll see if I can break this 
back up into smaller bits.  I just needed to flush the patches I had so 
I could get some feedback and stop talking in circles about the other patch.

Thanks,

Alex

^ permalink raw reply

* Re: [PATCH net v2] cdc_ether: Ignore bogus union descriptor for RNDIS devices
From: David Miller @ 2012-05-03  5:11 UTC (permalink / raw)
  To: linux-201011-TzK+PxyQ8t4hFhg+JK9F0w
  Cc: bjorn-yOkvZcmFvRU, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, shaola-r/HQZD9NxM9g9hUCZPvPmw,
	jrnieder-Re5JQEeQqe8AvxtiuMwx3w, oliver-GvhC2dPhHPQdnm+yROfE0A,
	655387-61a8vm9lEZVf4u+23C9RwQ, stable-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <01138af4-69c3-4adb-b7f6-1e5e6f471df2-2ueSQiBKiTY7tOexoI0I+QC/G2K4zDHf@public.gmane.org>

From: Markus Kolb <linux-201011-TzK+PxyQ8t4hFhg+JK9F0w@public.gmane.org>
Date: Thu, 03 May 2012 06:57:39 +0200

> I'll build it during next rainy day and will report its success
> after some usage ;-)

Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce
From: Eric Dumazet @ 2012-05-03  5:19 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, netdev, davem, Eric Dumazet, Jeff Kirsher
In-Reply-To: <4FA21087.1080801@gmail.com>

On Wed, 2012-05-02 at 21:58 -0700, Alexander Duyck wrote:
> The question I have is how can you get into a case where the ksize is 
> different from the end offset plus the aligned size of skb_shared_info?  
> From what I can tell it looks like the only place we can lie is if we 
> use build_skb with the frag_size option, and in that case we are using a 
> page, not kmalloc memory.  Otherwise in all other cases __alloc_skb or 
> build_skb is using ksize(skb->head) - SKB_DATA_ALIGN(struct 
> skb_shared_info) to set the end pointer, so reversing that should give 
> us the same value as ksize(skb->head).


Right after skb is allocated (build_skb() or other skb_alloc...
variants), truesize is correct by construction.

Then drivers add fragments and can make truesize smaller than reality.

And Intel drivers are known to abuse truesize.

My last patch against iwlwifi is still waiting to make its way into
official tree.

http://www.spinics.net/lists/netdev/msg192629.html

^ permalink raw reply

* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce
From: David Miller @ 2012-05-03  5:25 UTC (permalink / raw)
  To: eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w
  Cc: alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, edumazet-hpIqsD4AKlfQT0dZR+AlfA,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	linville-2XuSBdqkA4R54TAoqtyWWQ,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1336022373.12425.24.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Thu, 03 May 2012 07:19:33 +0200

> My last patch against iwlwifi is still waiting to make its way into
> official tree.
> 
> http://www.spinics.net/lists/netdev/msg192629.html

John, please rectify this situation.

The Intel Wireless folks said they would test it, but that was more
than a month ago.

It's not acceptable to let bug fixes rot for that long, I don't care
what their special internal testing procedure is.

If they give you further pushback, please just ignore them and apply
Eric's fix directly.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* IPv6 bonding in VM doesn't work (was Re: bonding and IPv6 "doesn't work"?)
From: Amos Kong @ 2012-05-03  5:33 UTC (permalink / raw)
  To: Chris Friesen
  Cc: netdev, bridge, qemu-devel, Tomasz Chmielewski, shemminger,
	David Lamparter

[-- Attachment #1: Type: text/plain, Size: 1844 bytes --]

On Wed, Jul 13, 2011 at 1:15 AM, Chris Friesen <chris.friesen@genband.com>wrote:

> On 07/12/2011 10:25 AM, Tomasz Chmielewski wrote:
>
>> On 12.07.2011 18:14, David Lamparter wrote:
>>
>
>  Your bonding peer is probably looping those
>>> packets back on the other link, most likely because...
>>>
>>>  Bonding Mode: load balancing (round-robin)
>>>>
>>>
>>> ... most likely because you maybe have a switch on the other side, and
>>> that switch expects you to do 802.3ad?
>>>
>>
>> It's a virtual machine, so the host shouldn't know or care much about
>> 802.3ad (I think!).
>>
>

Bonding mode 1 and 5 work when slave nics(guest) connect with same virtual
bridge in host.
other modes don't work.

>
> I suspect that connecting multiple links of a bond to the same unmanaged
> switch (or virtual bridge) is going to confuse things.
>
> Try using multiple virtual bridges instead, one for each slave in the
> bond.  That way they won't interfere with each other.
>

>>  Continue original thread: http://marc.info/?t=131048686200002&r=1&w=2

http://en.wikipedia.org/wiki/IEEE_802.1AX
""" With modes balance-rr, balance-xor, broadcast and 802.3ad all physical
ports in the link aggregation group _must reside on the same_ logical
switch, which in most scenarios will leave a single point of failure when
the physical switch to which both links are connected goes offline. Modes
active-backup, balance-tlb, and balance-alb can also be set up with two or
more switches."""

We could not resolve this issue by connect links to different virtual
bridges?
But it doesn't make sense to connect slave nics with same physical path,
NO stable /performance benefit.

What will happen if we connect two physical NIC directly (back-to-back)?
duplicated address detected?

BTW, openvswitch already support 802.3ad, it might resolve this issue.

Thanks, Amos

[-- Attachment #2: Type: text/html, Size: 3384 bytes --]

^ permalink raw reply

* Re: [v2 PATCH] net: Stop decapitating clones that have a head_frag
From: David Miller @ 2012-05-03  5:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.h.duyck, netdev, edumazet, jeffrey.t.kirsher
In-Reply-To: <1336019638.12425.16.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 03 May 2012 06:33:58 +0200

> On Wed, 2012-05-02 at 21:18 -0700, Alexander Duyck wrote:
>> This change is meant ot prevent stealing the skb->head to use as a page in
>> the event that the skb->head was cloned.  This allows the other clones to
>> track each other via shinfo->dataref.
>> 
>> Without this we break down to two methods for tracking the reference count,
>> one being dataref, the other being the page count.  As a result it becomes
>> difficult to track how many references there are to skb->head.
>> 
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
...
> Acked-by: Eric Dumazet <edumazet@google.com>

Applied, thanks.

These tests of the form:

	head_frag || cloned

and

	head_frag && cloned

will probably crop up in other places.  Therefore we should probably
add a "skb_head_is_locked()" helper function to skbuff.h

Well, there are already two instances, which is candidate enough. :-)

^ permalink raw reply

* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce
From: Alexander Duyck @ 2012-05-03  5:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alexander Duyck, netdev, davem, Eric Dumazet, Jeff Kirsher
In-Reply-To: <1336022373.12425.24.camel@edumazet-glaptop>

On 5/2/2012 10:19 PM, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 21:58 -0700, Alexander Duyck wrote:
>> The question I have is how can you get into a case where the ksize is
>> different from the end offset plus the aligned size of skb_shared_info?
>>  From what I can tell it looks like the only place we can lie is if we
>> use build_skb with the frag_size option, and in that case we are using a
>> page, not kmalloc memory.  Otherwise in all other cases __alloc_skb or
>> build_skb is using ksize(skb->head) - SKB_DATA_ALIGN(struct
>> skb_shared_info) to set the end pointer, so reversing that should give
>> us the same value as ksize(skb->head).
>
> Right after skb is allocated (build_skb() or other skb_alloc...
> variants), truesize is correct by construction.
>
> Then drivers add fragments and can make truesize smaller than reality.
>
> And Intel drivers are known to abuse truesize.
>
> My last patch against iwlwifi is still waiting to make its way into
> official tree.
>
> http://www.spinics.net/lists/netdev/msg192629.html
I think the part that has me confused is how being more precise about 
removing from truesize gets in the way of detecting abuses of truesize.  
It seems like it should be more as good as or better then the original 
approach of just using skb->len.

Then again we might just be talking in circles again.  I have things 
broken out into 3 patches now that are much more readable.  I will email 
them out in an hour or so once I do some quick tests to verify they are 
building and don't break anything.

Thanks,

Alex

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox