Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 1/4] ksz884x: delete useless variable
From: David Miller @ 2014-01-15 21:43 UTC (permalink / raw)
  To: Julia.Lawall; +Cc: netdev, kernel-janitors, linux-kernel
In-Reply-To: <1389629847-5330-2-git-send-email-Julia.Lawall@lip6.fr>

From: Julia Lawall <Julia.Lawall@lip6.fr>
Date: Mon, 13 Jan 2014 17:17:24 +0100

> From: Julia Lawall <Julia.Lawall@lip6.fr>
> 
> Delete a variable that is at most only assigned to a constant, but never
> used otherwise.  In this code, it is the variable result that is used for
> the return code, not rc.
> 
> A simplified version of the semantic patch that fixes this problem is as
> follows: (http://coccinelle.lip6.fr/)
 ...
> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>

Applied, thank you.

^ permalink raw reply

* [PATCH] net/dt: Add support for overriding phy configuration from device tree
From: Matthew Garrett @ 2014-01-15 21:38 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kishon-l0cyMroinI0,
	Matthew Garrett

Some hardware may be broken in interesting and board-specific ways, such
that various bits of functionality don't work. This patch provides a
mechanism for overriding mii registers during init based on the contents of
the device tree data, allowing board-specific fixups without having to
pollute generic code.

Signed-off-by: Matthew Garrett <matthew.garrett-05XSO3Yj/JvQT0dZR+AlfA@public.gmane.org>
---
 Documentation/devicetree/bindings/net/phy.txt |  13 +++
 drivers/net/phy/phy_device.c                  |  29 +++++-
 drivers/of/of_net.c                           | 124 ++++++++++++++++++++++++++
 include/linux/of_net.h                        |  12 +++
 4 files changed, 177 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/net/phy.txt b/Documentation/devicetree/bindings/net/phy.txt
index 7cd18fb..552a5e0 100644
--- a/Documentation/devicetree/bindings/net/phy.txt
+++ b/Documentation/devicetree/bindings/net/phy.txt
@@ -23,6 +23,19 @@ Optional Properties:
   assume clause 22. The compatible list may also contain other
   elements.
 
+The following properties may be added to either the phy node or the parent
+ethernet device:
+
+- phy-mii-advertise-10half: Whether to advertise half-duplex 10MBit
+- phy-mii-advertise-10full: Whether to advertise full-duplex 10MBit
+- phy-mii-advertise-100half: Whether to advertise half-duplex 100MBit
+- phy-mii-advertise-100full: Whether to advertise full-duplex 100MBit
+- phy-mii-advertise-100base4: Whether to advertise 100base4
+- phy-mii-advertise-1000half: Whether to advertise half-duplex 1000MBit
+- phy-mii-advertise-1000full: Whether to advertise full-duplex 1000MBit
+- phy-mii-as-master: Configure phy to act as master/slave
+- phy-mii-manual-master: Enable/disable manual master/slave configuration
+
 Example:
 
 ethernet-phy@0 {
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index d6447b3..91793bc 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -33,6 +33,7 @@
 #include <linux/mii.h>
 #include <linux/ethtool.h>
 #include <linux/phy.h>
+#include <linux/of_net.h>
 
 #include <asm/io.h>
 #include <asm/irq.h>
@@ -497,6 +498,28 @@ void phy_disconnect(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(phy_disconnect);
 
+int phy_override_from_of(struct phy_device *phydev)
+{
+	int reg, regval;
+	u16 val, mask;
+
+	/* Check for phy register overrides from OF */
+	for (reg = 0; reg < 16; reg++) {
+		if (!of_get_mii_register(phydev, reg, &val, &mask)) {
+			if (!mask)
+				continue;
+			regval = phy_read(phydev, reg);
+			if (regval < 0)
+				continue;
+			regval &= ~mask;
+			regval |= val;
+			phy_write(phydev, reg, regval);
+		}
+	}
+
+	return 0;
+}
+
 int phy_init_hw(struct phy_device *phydev)
 {
 	int ret;
@@ -508,7 +531,11 @@ int phy_init_hw(struct phy_device *phydev)
 	if (ret < 0)
 		return ret;
 
-	return phydev->drv->config_init(phydev);
+	ret = phydev->drv->config_init(phydev);
+	if (ret < 0)
+		return ret;
+
+	return phy_override_from_of(phydev);
 }
 
 /**
diff --git a/drivers/of/of_net.c b/drivers/of/of_net.c
index 8f9be2e..4545608 100644
--- a/drivers/of/of_net.c
+++ b/drivers/of/of_net.c
@@ -93,3 +93,127 @@ const void *of_get_mac_address(struct device_node *np)
 	return NULL;
 }
 EXPORT_SYMBOL(of_get_mac_address);
+
+/**
+ * Provide phy register overrides from the device tree. Some hardware may
+ * be broken in interesting and board-specific ways, so we want a mechanism
+ * for the board data to provide overrides for default values. This should be
+ * called during phy init.
+ */
+int of_get_mii_register(struct phy_device *phydev, int reg, u16 *val,
+			u16 *mask)
+{
+	u32 tmp;
+	struct device *dev = &phydev->dev;
+	struct device_node *np = dev->of_node;
+
+	*val = 0;
+	*mask = 0;
+
+	if (!np && dev->parent->of_node)
+		np = dev->parent->of_node;
+
+	if (!np)
+		return 0;
+
+	switch (reg) {
+	case MII_ADVERTISE:
+		if (!of_property_read_u32(np, "phy-mii-advertise-10half",
+					   &tmp)) {
+			if (tmp) {
+				*val |= ADVERTISE_10HALF;
+				phydev->advertising |= SUPPORTED_10baseT_Half;
+			} else {
+				phydev->advertising &=
+					~(SUPPORTED_10baseT_Half);
+			}
+
+			*mask |= ADVERTISE_10HALF;
+		}
+		if (!of_property_read_u32(np, "phy-mii-advertise-10full",
+					   &tmp)) {
+			if (tmp) {
+				*val |= ADVERTISE_10FULL;
+				phydev->advertising |= SUPPORTED_10baseT_Full;
+			} else {
+				phydev->advertising &=
+					~(SUPPORTED_10baseT_Full);
+			}
+
+			*mask |= ADVERTISE_10FULL;
+		}
+		if (!of_property_read_u32(np, "phy-mii-advertise-100half",
+					   &tmp)) {
+			if (tmp) {
+				*val |= ADVERTISE_100HALF;
+				phydev->advertising |= SUPPORTED_100baseT_Half;
+			} else {
+				phydev->advertising &=
+					~(SUPPORTED_100baseT_Half);
+			}
+
+			*mask |= ADVERTISE_100HALF;
+		}
+		if (!of_property_read_u32(np, "phy-mii-advertise-100full",
+					   &tmp)) {
+			if (tmp) {
+				*val |= ADVERTISE_100FULL;
+				phydev->advertising |= SUPPORTED_100baseT_Full;
+			} else {
+				phydev->advertising &=
+					~(SUPPORTED_100baseT_Full);
+			}
+
+			*mask |= ADVERTISE_100FULL;
+		}
+		if (!of_property_read_u32(np, "phy-mii-advertise-100base4",
+					   &tmp)) {
+			if (tmp)
+				*val |= ADVERTISE_100BASE4;
+			*mask |= ADVERTISE_100BASE4;
+		}
+		break;
+	case MII_CTRL1000:
+		if (!of_property_read_u32(np, "phy-mii-advertise-1000full",
+					   &tmp)) {
+			if (tmp) {
+				*val |= ADVERTISE_1000FULL;
+				phydev->advertising |= SUPPORTED_1000baseT_Full;
+			} else {
+				phydev->advertising &=
+					~(SUPPORTED_1000baseT_Full);
+			}
+
+			*mask |= ADVERTISE_1000FULL;
+		}
+		if (!of_property_read_u32(np, "phy-mii-advertise-1000half",
+					   &tmp)) {
+			if (tmp) {
+				*val |= ADVERTISE_1000HALF;
+				phydev->advertising |= SUPPORTED_1000baseT_Half;
+			} else {
+				phydev->advertising &=
+					~(SUPPORTED_1000baseT_Half);
+			}
+
+			*mask |= ADVERTISE_1000HALF;
+		}
+		if (!of_property_read_u32(np, "phy-mii-as-master",
+					   &tmp)) {
+			if (tmp)
+				*val |= CTL1000_AS_MASTER;
+			*mask |= CTL1000_AS_MASTER;
+		}
+		if (!of_property_read_u32(np, "phy-mii-manual-master",
+					   &tmp)) {
+			if (tmp)
+				*val |= CTL1000_ENABLE_MASTER;
+			*mask |= CTL1000_ENABLE_MASTER;
+		}
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(of_get_mii_register);
diff --git a/include/linux/of_net.h b/include/linux/of_net.h
index 34597c8..2e478bc 100644
--- a/include/linux/of_net.h
+++ b/include/linux/of_net.h
@@ -7,10 +7,14 @@
 #ifndef __LINUX_OF_NET_H
 #define __LINUX_OF_NET_H
 
+#include <linux/phy.h>
+
 #ifdef CONFIG_OF_NET
 #include <linux/of.h>
 extern int of_get_phy_mode(struct device_node *np);
 extern const void *of_get_mac_address(struct device_node *np);
+extern int of_get_mii_register(struct phy_device *np, int reg, u16 *val,
+			       u16 *mask);
 #else
 static inline int of_get_phy_mode(struct device_node *np)
 {
@@ -21,6 +25,14 @@ static inline const void *of_get_mac_address(struct device_node *np)
 {
 	return NULL;
 }
+static inline int of_get_mii_register(struct phy_device *np, int reg, u16 *val,
+				      u16 *mask)
+{
+	*val = 0;
+	*mask = 0;
+
+	return -EINVAL;
+}
 #endif
 
 #endif /* __LINUX_OF_NET_H */
-- 
1.8.4.2

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH v3 1/4] net_dma: simple removal
From: Dan Williams @ 2014-01-15 21:33 UTC (permalink / raw)
  To: saeed bishara
  Cc: dmaengine@vger.kernel.org, Alexander Duyck, Dave Jiang,
	Vinod Koul, netdev@vger.kernel.org, David Whipple, lkml,
	David S. Miller
In-Reply-To: <CAPcyv4hzNT15R41zSOM98f-0aQ60HBZoD_DAvf6VED7iXoCZ8w@mail.gmail.com>

On Wed, Jan 15, 2014 at 1:31 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Jan 15, 2014 at 1:20 PM, saeed bishara <saeed.bishara@gmail.com> wrote:
>> Hi Dan,
>>
>> I'm using net_dma on my system and I achieve meaningful performance
>> boost when running Iperf receive.
>>
>> As far as I know the net_dma is used by many embedded systems out
>> there and might effect their performance.
>> Can you please elaborate on the exact scenario that cause the memory corruption?
>>
>> Is the scenario mentioned here caused by "real life" application or
>> this is more of theoretical issue found through manual testing, I was
>> trying to find the thread describing the failing scenario and couldn't
>> find it, any pointer will be appreciated.
>
> Did you see the referenced commit?
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=77873803363c
>
> This is a real issue in that any app that forks() while receiving data
> can cause the dma data to be lost.  The problem is that the copy
> operation falls back to cpu at many locations.  Any one of those
> instance could touch a mapped page and trigger a copy-on-write event.
> The dma completes to the wrong location.
>

Btw, do you have benchmark data showing that NET_DMA is beneficial on
these platforms?  I would have expected worse performance on platforms
without i/o coherent caches.

^ permalink raw reply

* Re: [PATCH v3 1/4] net_dma: simple removal
From: Dan Williams @ 2014-01-15 21:31 UTC (permalink / raw)
  To: saeed bishara
  Cc: dmaengine@vger.kernel.org, Alexander Duyck, Dave Jiang,
	Vinod Koul, netdev@vger.kernel.org, David Whipple, lkml,
	David S. Miller
In-Reply-To: <CAMAG_eduH4M2OPVh-R4Q6KG1DDcinEDzC-fQyXj1mLdZG=49hw@mail.gmail.com>

On Wed, Jan 15, 2014 at 1:20 PM, saeed bishara <saeed.bishara@gmail.com> wrote:
> Hi Dan,
>
> I'm using net_dma on my system and I achieve meaningful performance
> boost when running Iperf receive.
>
> As far as I know the net_dma is used by many embedded systems out
> there and might effect their performance.
> Can you please elaborate on the exact scenario that cause the memory corruption?
>
> Is the scenario mentioned here caused by "real life" application or
> this is more of theoretical issue found through manual testing, I was
> trying to find the thread describing the failing scenario and couldn't
> find it, any pointer will be appreciated.

Did you see the referenced commit?

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=77873803363c

This is a real issue in that any app that forks() while receiving data
can cause the dma data to be lost.  The problem is that the copy
operation falls back to cpu at many locations.  Any one of those
instance could touch a mapped page and trigger a copy-on-write event.
The dma completes to the wrong location.

--
Dan

^ permalink raw reply

* Re: TI CPSW Ethernet Tx performance regression
From: Florian Fainelli @ 2014-01-15 21:21 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Mugunthan V N, netdev
In-Reply-To: <1389808467.11912.9.camel@bwh-desktop.uk.level5networks.com>

2014/1/15 Ben Hutchings <bhutchings@solarflare.com>:
> On Wed, 2014-01-15 at 18:18 +0530, Mugunthan V N wrote:
>> Hi
>>
>> I am seeing a performance regression with CPSW driver on AM335x EVM. AM335x EVM
>> CPSW has 3.2 kernel support [1] and Mainline support from 3.7. When I am
>> comparing the performance between 3.2 and 3.13-rc4. TCP receive performance of
>> CPSW between 3.2 and 3.13-rc4 is same (~180Mbps) but TCP Transmit performance
>> is poor comparing to 3.2 kernel. In 3.2 kernel is it *256Mbps* and in 3.13-rc4
>> it is *70Mbps*
>>
>> Iperf version is *iperf version 2.0.5 (08 Jul 2010) pthreads* on both PC and EVM
>>
>> On UDP transmit also performance is down comparing to 3.2 kernel. In 3.2 it is
>> 196Mbps for 200Mbps band width and in 3.13-rc4 it is 92Mbps
>>
>> Can someone point me out where can I look for improving Tx performance. I also
>> checked whether there is Tx descriptor over flow and there is none. I have
>> tries 3.11 and some older kernel, all are giving ~75Mbps Transmit performance
>> only.
>>
>> [1] - http://arago-project.org/git/projects/?p=linux-am33x.git;a=summary
>
> If you don't get any specific suggestions, you could try bisecting to
> find out which specific commit(s) changed the performance.

Not necessarily related to that issue, but there are a few
weird/unusual things done in the CPSW interrupt handler:

static irqreturn_t cpsw_interrupt(int irq, void *dev_id)
{
        struct cpsw_priv *priv = dev_id;

        cpsw_intr_disable(priv);
        if (priv->irq_enabled == true) {
                cpsw_disable_irq(priv);
                priv->irq_enabled = false;
        }

        if (netif_running(priv->ndev)) {
                napi_schedule(&priv->napi);
                return IRQ_HANDLED;
        }

Checking for netif_running() should not be required, you should not
get any TX/RX interrupts if your interface is not running.


        priv = cpsw_get_slave_priv(priv, 1);
        if (!priv)
                return IRQ_NONE;

Should not this be moved up as the very first conditional check to do?
is not there a risk to leave the interrupts disabled and not
re-enabled due to the first 5 lines at the top?


        if (netif_running(priv->ndev)) {
                napi_schedule(&priv->napi);
                return IRQ_HANDLED;
        }

This was done before, why doing it again?

In drivers/net/ethernet/ti/davinci_cpdma.c::cpdma_chan_process()
treats equally an error processing a packet (and will stop there) as
well as successfully processing num_tx packets, is that also
intentional? Should you attempt to keep processing "quota" packets?

As Ben suggests, bisecting what is causing the regression is your best bet here.
-- 
Florian

^ permalink raw reply

* Re: [PATCH v3 1/4] net_dma: simple removal
From: saeed bishara @ 2014-01-15 21:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: dmaengine, Alexander Duyck, Dave Jiang, Vinod Koul,
	netdev@vger.kernel.org, David Whipple, lkml, David S. Miller
In-Reply-To: <20140114004622.27138.54103.stgit@viggo.jf.intel.com>

Hi Dan,

I'm using net_dma on my system and I achieve meaningful performance
boost when running Iperf receive.

As far as I know the net_dma is used by many embedded systems out
there and might effect their performance.
Can you please elaborate on the exact scenario that cause the memory corruption?

Is the scenario mentioned here caused by "real life" application or
this is more of theoretical issue found through manual testing, I was
trying to find the thread describing the failing scenario and couldn't
find it, any pointer will be appreciated.

Thanks

On Tue, Jan 14, 2014 at 2:46 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> Per commit "77873803363c net_dma: mark broken" net_dma is no longer used
> and there is no plan to fix it.
>
> This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
> Reverting the remainder of the net_dma induced changes is deferred to
> subsequent patches.
>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Vinod Koul <vinod.koul@intel.com>
> Cc: David Whipple <whipple@securedatainnovations.ch>
> Cc: Alexander Duyck <alexander.h.duyck@intel.com>
> Acked-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>
> No changes since v2
>
>
>  Documentation/ABI/removed/net_dma      |    8 +
>  Documentation/networking/ip-sysctl.txt |    6 -
>  drivers/dma/Kconfig                    |   12 -
>  drivers/dma/Makefile                   |    1
>  drivers/dma/dmaengine.c                |  104 ------------
>  drivers/dma/ioat/dma.c                 |    1
>  drivers/dma/ioat/dma.h                 |    7 -
>  drivers/dma/ioat/dma_v2.c              |    1
>  drivers/dma/ioat/dma_v3.c              |    1
>  drivers/dma/iovlock.c                  |  280 --------------------------------
>  include/linux/dmaengine.h              |   22 ---
>  include/linux/skbuff.h                 |    8 -
>  include/linux/tcp.h                    |    8 -
>  include/net/netdma.h                   |   32 ----
>  include/net/sock.h                     |   19 --
>  include/net/tcp.h                      |    8 -
>  kernel/sysctl_binary.c                 |    1
>  net/core/Makefile                      |    1
>  net/core/dev.c                         |   10 -
>  net/core/sock.c                        |    6 -
>  net/core/user_dma.c                    |  131 ---------------
>  net/dccp/proto.c                       |    4
>  net/ipv4/sysctl_net_ipv4.c             |    9 -
>  net/ipv4/tcp.c                         |  147 ++---------------
>  net/ipv4/tcp_input.c                   |   61 -------
>  net/ipv4/tcp_ipv4.c                    |   18 --
>  net/ipv6/tcp_ipv6.c                    |   13 -
>  net/llc/af_llc.c                       |   10 +
>  28 files changed, 35 insertions(+), 894 deletions(-)
>  create mode 100644 Documentation/ABI/removed/net_dma
>  delete mode 100644 drivers/dma/iovlock.c
>  delete mode 100644 include/net/netdma.h
>  delete mode 100644 net/core/user_dma.c
>
> diff --git a/Documentation/ABI/removed/net_dma b/Documentation/ABI/removed/net_dma
> new file mode 100644
> index 000000000000..a173aecc2f18
> --- /dev/null
> +++ b/Documentation/ABI/removed/net_dma
> @@ -0,0 +1,8 @@
> +What:          tcp_dma_copybreak sysctl
> +Date:          Removed in kernel v3.13
> +Contact:       Dan Williams <dan.j.williams@intel.com>
> +Description:
> +       Formerly the lower limit, in bytes, of the size of socket reads
> +       that will be offloaded to a DMA copy engine.  Removed due to
> +       coherency issues of the cpu potentially touching the buffers
> +       while dma is in flight.
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 3c12d9a7ed00..bdd8a67f0be2 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -538,12 +538,6 @@ tcp_workaround_signed_windows - BOOLEAN
>         not receive a window scaling option from them.
>         Default: 0
>
> -tcp_dma_copybreak - INTEGER
> -       Lower limit, in bytes, of the size of socket reads that will be
> -       offloaded to a DMA copy engine, if one is present in the system
> -       and CONFIG_NET_DMA is enabled.
> -       Default: 4096
> -
>  tcp_thin_linear_timeouts - BOOLEAN
>         Enable dynamic triggering of linear timeouts for thin streams.
>         If set, a check is performed upon retransmission by timeout to
> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
> index c823daaf9043..b24f13195272 100644
> --- a/drivers/dma/Kconfig
> +++ b/drivers/dma/Kconfig
> @@ -351,18 +351,6 @@ config DMA_OF
>  comment "DMA Clients"
>         depends on DMA_ENGINE
>
> -config NET_DMA
> -       bool "Network: TCP receive copy offload"
> -       depends on DMA_ENGINE && NET
> -       default (INTEL_IOATDMA || FSL_DMA)
> -       depends on BROKEN
> -       help
> -         This enables the use of DMA engines in the network stack to
> -         offload receive copy-to-user operations, freeing CPU cycles.
> -
> -         Say Y here if you enabled INTEL_IOATDMA or FSL_DMA, otherwise
> -         say N.
> -
>  config ASYNC_TX_DMA
>         bool "Async_tx: Offload support for the async_tx api"
>         depends on DMA_ENGINE
> diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
> index 0ce2da97e429..024b008a25de 100644
> --- a/drivers/dma/Makefile
> +++ b/drivers/dma/Makefile
> @@ -6,7 +6,6 @@ obj-$(CONFIG_DMA_VIRTUAL_CHANNELS) += virt-dma.o
>  obj-$(CONFIG_DMA_ACPI) += acpi-dma.o
>  obj-$(CONFIG_DMA_OF) += of-dma.o
>
> -obj-$(CONFIG_NET_DMA) += iovlock.o
>  obj-$(CONFIG_INTEL_MID_DMAC) += intel_mid_dma.o
>  obj-$(CONFIG_DMATEST) += dmatest.o
>  obj-$(CONFIG_INTEL_IOATDMA) += ioat/
> diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
> index ef63b9058f3c..d7f4f4e0d71f 100644
> --- a/drivers/dma/dmaengine.c
> +++ b/drivers/dma/dmaengine.c
> @@ -1029,110 +1029,6 @@ dmaengine_get_unmap_data(struct device *dev, int nr, gfp_t flags)
>  }
>  EXPORT_SYMBOL(dmaengine_get_unmap_data);
>
> -/**
> - * dma_async_memcpy_pg_to_pg - offloaded copy from page to page
> - * @chan: DMA channel to offload copy to
> - * @dest_pg: destination page
> - * @dest_off: offset in page to copy to
> - * @src_pg: source page
> - * @src_off: offset in page to copy from
> - * @len: length
> - *
> - * Both @dest_page/@dest_off and @src_page/@src_off must be mappable to a bus
> - * address according to the DMA mapping API rules for streaming mappings.
> - * Both @dest_page/@dest_off and @src_page/@src_off must stay memory resident
> - * (kernel memory or locked user space pages).
> - */
> -dma_cookie_t
> -dma_async_memcpy_pg_to_pg(struct dma_chan *chan, struct page *dest_pg,
> -       unsigned int dest_off, struct page *src_pg, unsigned int src_off,
> -       size_t len)
> -{
> -       struct dma_device *dev = chan->device;
> -       struct dma_async_tx_descriptor *tx;
> -       struct dmaengine_unmap_data *unmap;
> -       dma_cookie_t cookie;
> -       unsigned long flags;
> -
> -       unmap = dmaengine_get_unmap_data(dev->dev, 2, GFP_NOWAIT);
> -       if (!unmap)
> -               return -ENOMEM;
> -
> -       unmap->to_cnt = 1;
> -       unmap->from_cnt = 1;
> -       unmap->addr[0] = dma_map_page(dev->dev, src_pg, src_off, len,
> -                                     DMA_TO_DEVICE);
> -       unmap->addr[1] = dma_map_page(dev->dev, dest_pg, dest_off, len,
> -                                     DMA_FROM_DEVICE);
> -       unmap->len = len;
> -       flags = DMA_CTRL_ACK;
> -       tx = dev->device_prep_dma_memcpy(chan, unmap->addr[1], unmap->addr[0],
> -                                        len, flags);
> -
> -       if (!tx) {
> -               dmaengine_unmap_put(unmap);
> -               return -ENOMEM;
> -       }
> -
> -       dma_set_unmap(tx, unmap);
> -       cookie = tx->tx_submit(tx);
> -       dmaengine_unmap_put(unmap);
> -
> -       preempt_disable();
> -       __this_cpu_add(chan->local->bytes_transferred, len);
> -       __this_cpu_inc(chan->local->memcpy_count);
> -       preempt_enable();
> -
> -       return cookie;
> -}
> -EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
> -
> -/**
> - * dma_async_memcpy_buf_to_buf - offloaded copy between virtual addresses
> - * @chan: DMA channel to offload copy to
> - * @dest: destination address (virtual)
> - * @src: source address (virtual)
> - * @len: length
> - *
> - * Both @dest and @src must be mappable to a bus address according to the
> - * DMA mapping API rules for streaming mappings.
> - * Both @dest and @src must stay memory resident (kernel memory or locked
> - * user space pages).
> - */
> -dma_cookie_t
> -dma_async_memcpy_buf_to_buf(struct dma_chan *chan, void *dest,
> -                           void *src, size_t len)
> -{
> -       return dma_async_memcpy_pg_to_pg(chan, virt_to_page(dest),
> -                                        (unsigned long) dest & ~PAGE_MASK,
> -                                        virt_to_page(src),
> -                                        (unsigned long) src & ~PAGE_MASK, len);
> -}
> -EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
> -
> -/**
> - * dma_async_memcpy_buf_to_pg - offloaded copy from address to page
> - * @chan: DMA channel to offload copy to
> - * @page: destination page
> - * @offset: offset in page to copy to
> - * @kdata: source address (virtual)
> - * @len: length
> - *
> - * Both @page/@offset and @kdata must be mappable to a bus address according
> - * to the DMA mapping API rules for streaming mappings.
> - * Both @page/@offset and @kdata must stay memory resident (kernel memory or
> - * locked user space pages)
> - */
> -dma_cookie_t
> -dma_async_memcpy_buf_to_pg(struct dma_chan *chan, struct page *page,
> -                          unsigned int offset, void *kdata, size_t len)
> -{
> -       return dma_async_memcpy_pg_to_pg(chan, page, offset,
> -                                        virt_to_page(kdata),
> -                                        (unsigned long) kdata & ~PAGE_MASK, len);
> -}
> -EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
> -
>  void dma_async_tx_descriptor_init(struct dma_async_tx_descriptor *tx,
>         struct dma_chan *chan)
>  {
> diff --git a/drivers/dma/ioat/dma.c b/drivers/dma/ioat/dma.c
> index 1a49c777607c..97fa394ca855 100644
> --- a/drivers/dma/ioat/dma.c
> +++ b/drivers/dma/ioat/dma.c
> @@ -1175,7 +1175,6 @@ int ioat1_dma_probe(struct ioatdma_device *device, int dca)
>         err = ioat_probe(device);
>         if (err)
>                 return err;
> -       ioat_set_tcp_copy_break(4096);
>         err = ioat_register(device);
>         if (err)
>                 return err;
> diff --git a/drivers/dma/ioat/dma.h b/drivers/dma/ioat/dma.h
> index 11fb877ddca9..664ec9cbd651 100644
> --- a/drivers/dma/ioat/dma.h
> +++ b/drivers/dma/ioat/dma.h
> @@ -214,13 +214,6 @@ __dump_desc_dbg(struct ioat_chan_common *chan, struct ioat_dma_descriptor *hw,
>  #define dump_desc_dbg(c, d) \
>         ({ if (d) __dump_desc_dbg(&c->base, d->hw, &d->txd, desc_id(d)); 0; })
>
> -static inline void ioat_set_tcp_copy_break(unsigned long copybreak)
> -{
> -       #ifdef CONFIG_NET_DMA
> -       sysctl_tcp_dma_copybreak = copybreak;
> -       #endif
> -}
> -
>  static inline struct ioat_chan_common *
>  ioat_chan_by_index(struct ioatdma_device *device, int index)
>  {
> diff --git a/drivers/dma/ioat/dma_v2.c b/drivers/dma/ioat/dma_v2.c
> index 5d3affe7e976..31e8098e444f 100644
> --- a/drivers/dma/ioat/dma_v2.c
> +++ b/drivers/dma/ioat/dma_v2.c
> @@ -900,7 +900,6 @@ int ioat2_dma_probe(struct ioatdma_device *device, int dca)
>         err = ioat_probe(device);
>         if (err)
>                 return err;
> -       ioat_set_tcp_copy_break(2048);
>
>         list_for_each_entry(c, &dma->channels, device_node) {
>                 chan = to_chan_common(c);
> diff --git a/drivers/dma/ioat/dma_v3.c b/drivers/dma/ioat/dma_v3.c
> index 820817e97e62..4bb81346bee2 100644
> --- a/drivers/dma/ioat/dma_v3.c
> +++ b/drivers/dma/ioat/dma_v3.c
> @@ -1652,7 +1652,6 @@ int ioat3_dma_probe(struct ioatdma_device *device, int dca)
>         err = ioat_probe(device);
>         if (err)
>                 return err;
> -       ioat_set_tcp_copy_break(262144);
>
>         list_for_each_entry(c, &dma->channels, device_node) {
>                 chan = to_chan_common(c);
> diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
> deleted file mode 100644
> index bb48a57c2fc1..000000000000
> --- a/drivers/dma/iovlock.c
> +++ /dev/null
> @@ -1,280 +0,0 @@
> -/*
> - * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
> - * Portions based on net/core/datagram.c and copyrighted by their authors.
> - *
> - * This program is free software; you can redistribute it and/or modify it
> - * under the terms of the GNU General Public License as published by the Free
> - * Software Foundation; either version 2 of the License, or (at your option)
> - * any later version.
> - *
> - * This program is distributed in the hope that it will be useful, but WITHOUT
> - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> - * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> - * more details.
> - *
> - * You should have received a copy of the GNU General Public License along with
> - * this program; if not, write to the Free Software Foundation, Inc., 59
> - * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
> - *
> - * The full GNU General Public License is included in this distribution in the
> - * file called COPYING.
> - */
> -
> -/*
> - * This code allows the net stack to make use of a DMA engine for
> - * skb to iovec copies.
> - */
> -
> -#include <linux/dmaengine.h>
> -#include <linux/pagemap.h>
> -#include <linux/slab.h>
> -#include <net/tcp.h> /* for memcpy_toiovec */
> -#include <asm/io.h>
> -#include <asm/uaccess.h>
> -
> -static int num_pages_spanned(struct iovec *iov)
> -{
> -       return
> -       ((PAGE_ALIGN((unsigned long)iov->iov_base + iov->iov_len) -
> -       ((unsigned long)iov->iov_base & PAGE_MASK)) >> PAGE_SHIFT);
> -}
> -
> -/*
> - * Pin down all the iovec pages needed for len bytes.
> - * Return a struct dma_pinned_list to keep track of pages pinned down.
> - *
> - * We are allocating a single chunk of memory, and then carving it up into
> - * 3 sections, the latter 2 whose size depends on the number of iovecs and the
> - * total number of pages, respectively.
> - */
> -struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
> -{
> -       struct dma_pinned_list *local_list;
> -       struct page **pages;
> -       int i;
> -       int ret;
> -       int nr_iovecs = 0;
> -       int iovec_len_used = 0;
> -       int iovec_pages_used = 0;
> -
> -       /* don't pin down non-user-based iovecs */
> -       if (segment_eq(get_fs(), KERNEL_DS))
> -               return NULL;
> -
> -       /* determine how many iovecs/pages there are, up front */
> -       do {
> -               iovec_len_used += iov[nr_iovecs].iov_len;
> -               iovec_pages_used += num_pages_spanned(&iov[nr_iovecs]);
> -               nr_iovecs++;
> -       } while (iovec_len_used < len);
> -
> -       /* single kmalloc for pinned list, page_list[], and the page arrays */
> -       local_list = kmalloc(sizeof(*local_list)
> -               + (nr_iovecs * sizeof (struct dma_page_list))
> -               + (iovec_pages_used * sizeof (struct page*)), GFP_KERNEL);
> -       if (!local_list)
> -               goto out;
> -
> -       /* list of pages starts right after the page list array */
> -       pages = (struct page **) &local_list->page_list[nr_iovecs];
> -
> -       local_list->nr_iovecs = 0;
> -
> -       for (i = 0; i < nr_iovecs; i++) {
> -               struct dma_page_list *page_list = &local_list->page_list[i];
> -
> -               len -= iov[i].iov_len;
> -
> -               if (!access_ok(VERIFY_WRITE, iov[i].iov_base, iov[i].iov_len))
> -                       goto unpin;
> -
> -               page_list->nr_pages = num_pages_spanned(&iov[i]);
> -               page_list->base_address = iov[i].iov_base;
> -
> -               page_list->pages = pages;
> -               pages += page_list->nr_pages;
> -
> -               /* pin pages down */
> -               down_read(&current->mm->mmap_sem);
> -               ret = get_user_pages(
> -                       current,
> -                       current->mm,
> -                       (unsigned long) iov[i].iov_base,
> -                       page_list->nr_pages,
> -                       1,      /* write */
> -                       0,      /* force */
> -                       page_list->pages,
> -                       NULL);
> -               up_read(&current->mm->mmap_sem);
> -
> -               if (ret != page_list->nr_pages)
> -                       goto unpin;
> -
> -               local_list->nr_iovecs = i + 1;
> -       }
> -
> -       return local_list;
> -
> -unpin:
> -       dma_unpin_iovec_pages(local_list);
> -out:
> -       return NULL;
> -}
> -
> -void dma_unpin_iovec_pages(struct dma_pinned_list *pinned_list)
> -{
> -       int i, j;
> -
> -       if (!pinned_list)
> -               return;
> -
> -       for (i = 0; i < pinned_list->nr_iovecs; i++) {
> -               struct dma_page_list *page_list = &pinned_list->page_list[i];
> -               for (j = 0; j < page_list->nr_pages; j++) {
> -                       set_page_dirty_lock(page_list->pages[j]);
> -                       page_cache_release(page_list->pages[j]);
> -               }
> -       }
> -
> -       kfree(pinned_list);
> -}
> -
> -
> -/*
> - * We have already pinned down the pages we will be using in the iovecs.
> - * Each entry in iov array has corresponding entry in pinned_list->page_list.
> - * Using array indexing to keep iov[] and page_list[] in sync.
> - * Initial elements in iov array's iov->iov_len will be 0 if already copied into
> - *   by another call.
> - * iov array length remaining guaranteed to be bigger than len.
> - */
> -dma_cookie_t dma_memcpy_to_iovec(struct dma_chan *chan, struct iovec *iov,
> -       struct dma_pinned_list *pinned_list, unsigned char *kdata, size_t len)
> -{
> -       int iov_byte_offset;
> -       int copy;
> -       dma_cookie_t dma_cookie = 0;
> -       int iovec_idx;
> -       int page_idx;
> -
> -       if (!chan)
> -               return memcpy_toiovec(iov, kdata, len);
> -
> -       iovec_idx = 0;
> -       while (iovec_idx < pinned_list->nr_iovecs) {
> -               struct dma_page_list *page_list;
> -
> -               /* skip already used-up iovecs */
> -               while (!iov[iovec_idx].iov_len)
> -                       iovec_idx++;
> -
> -               page_list = &pinned_list->page_list[iovec_idx];
> -
> -               iov_byte_offset = ((unsigned long)iov[iovec_idx].iov_base & ~PAGE_MASK);
> -               page_idx = (((unsigned long)iov[iovec_idx].iov_base & PAGE_MASK)
> -                        - ((unsigned long)page_list->base_address & PAGE_MASK)) >> PAGE_SHIFT;
> -
> -               /* break up copies to not cross page boundary */
> -               while (iov[iovec_idx].iov_len) {
> -                       copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
> -                       copy = min_t(int, copy, iov[iovec_idx].iov_len);
> -
> -                       dma_cookie = dma_async_memcpy_buf_to_pg(chan,
> -                                       page_list->pages[page_idx],
> -                                       iov_byte_offset,
> -                                       kdata,
> -                                       copy);
> -                       /* poll for a descriptor slot */
> -                       if (unlikely(dma_cookie < 0)) {
> -                               dma_async_issue_pending(chan);
> -                               continue;
> -                       }
> -
> -                       len -= copy;
> -                       iov[iovec_idx].iov_len -= copy;
> -                       iov[iovec_idx].iov_base += copy;
> -
> -                       if (!len)
> -                               return dma_cookie;
> -
> -                       kdata += copy;
> -                       iov_byte_offset = 0;
> -                       page_idx++;
> -               }
> -               iovec_idx++;
> -       }
> -
> -       /* really bad if we ever run out of iovecs */
> -       BUG();
> -       return -EFAULT;
> -}
> -
> -dma_cookie_t dma_memcpy_pg_to_iovec(struct dma_chan *chan, struct iovec *iov,
> -       struct dma_pinned_list *pinned_list, struct page *page,
> -       unsigned int offset, size_t len)
> -{
> -       int iov_byte_offset;
> -       int copy;
> -       dma_cookie_t dma_cookie = 0;
> -       int iovec_idx;
> -       int page_idx;
> -       int err;
> -
> -       /* this needs as-yet-unimplemented buf-to-buff, so punt. */
> -       /* TODO: use dma for this */
> -       if (!chan || !pinned_list) {
> -               u8 *vaddr = kmap(page);
> -               err = memcpy_toiovec(iov, vaddr + offset, len);
> -               kunmap(page);
> -               return err;
> -       }
> -
> -       iovec_idx = 0;
> -       while (iovec_idx < pinned_list->nr_iovecs) {
> -               struct dma_page_list *page_list;
> -
> -               /* skip already used-up iovecs */
> -               while (!iov[iovec_idx].iov_len)
> -                       iovec_idx++;
> -
> -               page_list = &pinned_list->page_list[iovec_idx];
> -
> -               iov_byte_offset = ((unsigned long)iov[iovec_idx].iov_base & ~PAGE_MASK);
> -               page_idx = (((unsigned long)iov[iovec_idx].iov_base & PAGE_MASK)
> -                        - ((unsigned long)page_list->base_address & PAGE_MASK)) >> PAGE_SHIFT;
> -
> -               /* break up copies to not cross page boundary */
> -               while (iov[iovec_idx].iov_len) {
> -                       copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
> -                       copy = min_t(int, copy, iov[iovec_idx].iov_len);
> -
> -                       dma_cookie = dma_async_memcpy_pg_to_pg(chan,
> -                                       page_list->pages[page_idx],
> -                                       iov_byte_offset,
> -                                       page,
> -                                       offset,
> -                                       copy);
> -                       /* poll for a descriptor slot */
> -                       if (unlikely(dma_cookie < 0)) {
> -                               dma_async_issue_pending(chan);
> -                               continue;
> -                       }
> -
> -                       len -= copy;
> -                       iov[iovec_idx].iov_len -= copy;
> -                       iov[iovec_idx].iov_base += copy;
> -
> -                       if (!len)
> -                               return dma_cookie;
> -
> -                       offset += copy;
> -                       iov_byte_offset = 0;
> -                       page_idx++;
> -               }
> -               iovec_idx++;
> -       }
> -
> -       /* really bad if we ever run out of iovecs */
> -       BUG();
> -       return -EFAULT;
> -}
> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
> index 41cf0c399288..890545871af0 100644
> --- a/include/linux/dmaengine.h
> +++ b/include/linux/dmaengine.h
> @@ -875,18 +875,6 @@ static inline void dmaengine_put(void)
>  }
>  #endif
>
> -#ifdef CONFIG_NET_DMA
> -#define net_dmaengine_get()    dmaengine_get()
> -#define net_dmaengine_put()    dmaengine_put()
> -#else
> -static inline void net_dmaengine_get(void)
> -{
> -}
> -static inline void net_dmaengine_put(void)
> -{
> -}
> -#endif
> -
>  #ifdef CONFIG_ASYNC_TX_DMA
>  #define async_dmaengine_get()  dmaengine_get()
>  #define async_dmaengine_put()  dmaengine_put()
> @@ -908,16 +896,8 @@ async_dma_find_channel(enum dma_transaction_type type)
>         return NULL;
>  }
>  #endif /* CONFIG_ASYNC_TX_DMA */
> -
> -dma_cookie_t dma_async_memcpy_buf_to_buf(struct dma_chan *chan,
> -       void *dest, void *src, size_t len);
> -dma_cookie_t dma_async_memcpy_buf_to_pg(struct dma_chan *chan,
> -       struct page *page, unsigned int offset, void *kdata, size_t len);
> -dma_cookie_t dma_async_memcpy_pg_to_pg(struct dma_chan *chan,
> -       struct page *dest_pg, unsigned int dest_off, struct page *src_pg,
> -       unsigned int src_off, size_t len);
>  void dma_async_tx_descriptor_init(struct dma_async_tx_descriptor *tx,
> -       struct dma_chan *chan);
> +                                 struct dma_chan *chan);
>
>  static inline void async_tx_ack(struct dma_async_tx_descriptor *tx)
>  {
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index bec1cc7d5e3c..ac4f84dfa84b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -28,7 +28,6 @@
>  #include <linux/textsearch.h>
>  #include <net/checksum.h>
>  #include <linux/rcupdate.h>
> -#include <linux/dmaengine.h>
>  #include <linux/hrtimer.h>
>  #include <linux/dma-mapping.h>
>  #include <linux/netdev_features.h>
> @@ -496,11 +495,8 @@ struct sk_buff {
>         /* 6/8 bit hole (depending on ndisc_nodetype presence) */
>         kmemcheck_bitfield_end(flags2);
>
> -#if defined CONFIG_NET_DMA || defined CONFIG_NET_RX_BUSY_POLL
> -       union {
> -               unsigned int    napi_id;
> -               dma_cookie_t    dma_cookie;
> -       };
> +#ifdef CONFIG_NET_RX_BUSY_POLL
> +       unsigned int    napi_id;
>  #endif
>  #ifdef CONFIG_NETWORK_SECMARK
>         __u32                   secmark;
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index d68633452d9b..26f16021ce1d 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -19,7 +19,6 @@
>
>
>  #include <linux/skbuff.h>
> -#include <linux/dmaengine.h>
>  #include <net/sock.h>
>  #include <net/inet_connection_sock.h>
>  #include <net/inet_timewait_sock.h>
> @@ -169,13 +168,6 @@ struct tcp_sock {
>                 struct iovec            *iov;
>                 int                     memory;
>                 int                     len;
> -#ifdef CONFIG_NET_DMA
> -               /* members for async copy */
> -               struct dma_chan         *dma_chan;
> -               int                     wakeup;
> -               struct dma_pinned_list  *pinned_list;
> -               dma_cookie_t            dma_cookie;
> -#endif
>         } ucopy;
>
>         u32     snd_wl1;        /* Sequence for window update           */
> diff --git a/include/net/netdma.h b/include/net/netdma.h
> deleted file mode 100644
> index 8ba8ce284eeb..000000000000
> --- a/include/net/netdma.h
> +++ /dev/null
> @@ -1,32 +0,0 @@
> -/*
> - * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
> - *
> - * This program is free software; you can redistribute it and/or modify it
> - * under the terms of the GNU General Public License as published by the Free
> - * Software Foundation; either version 2 of the License, or (at your option)
> - * any later version.
> - *
> - * This program is distributed in the hope that it will be useful, but WITHOUT
> - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> - * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> - * more details.
> - *
> - * You should have received a copy of the GNU General Public License along with
> - * this program; if not, write to the Free Software Foundation, Inc., 59
> - * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
> - *
> - * The full GNU General Public License is included in this distribution in the
> - * file called COPYING.
> - */
> -#ifndef NETDMA_H
> -#define NETDMA_H
> -#ifdef CONFIG_NET_DMA
> -#include <linux/dmaengine.h>
> -#include <linux/skbuff.h>
> -
> -int dma_skb_copy_datagram_iovec(struct dma_chan* chan,
> -               struct sk_buff *skb, int offset, struct iovec *to,
> -               size_t len, struct dma_pinned_list *pinned_list);
> -
> -#endif /* CONFIG_NET_DMA */
> -#endif /* NETDMA_H */
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e3a18ff0c38b..9d5f716e921e 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -231,7 +231,6 @@ struct cg_proto;
>    *    @sk_receive_queue: incoming packets
>    *    @sk_wmem_alloc: transmit queue bytes committed
>    *    @sk_write_queue: Packet sending queue
> -  *    @sk_async_wait_queue: DMA copied packets
>    *    @sk_omem_alloc: "o" is "option" or "other"
>    *    @sk_wmem_queued: persistent queue size
>    *    @sk_forward_alloc: space allocated forward
> @@ -354,10 +353,6 @@ struct sock {
>         struct sk_filter __rcu  *sk_filter;
>         struct socket_wq __rcu  *sk_wq;
>
> -#ifdef CONFIG_NET_DMA
> -       struct sk_buff_head     sk_async_wait_queue;
> -#endif
> -
>  #ifdef CONFIG_XFRM
>         struct xfrm_policy      *sk_policy[2];
>  #endif
> @@ -2200,27 +2195,15 @@ void sock_tx_timestamp(struct sock *sk, __u8 *tx_flags);
>   * sk_eat_skb - Release a skb if it is no longer needed
>   * @sk: socket to eat this skb from
>   * @skb: socket buffer to eat
> - * @copied_early: flag indicating whether DMA operations copied this data early
>   *
>   * This routine must be called with interrupts disabled or with the socket
>   * locked so that the sk_buff queue operation is ok.
>  */
> -#ifdef CONFIG_NET_DMA
> -static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb, bool copied_early)
> -{
> -       __skb_unlink(skb, &sk->sk_receive_queue);
> -       if (!copied_early)
> -               __kfree_skb(skb);
> -       else
> -               __skb_queue_tail(&sk->sk_async_wait_queue, skb);
> -}
> -#else
> -static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb, bool copied_early)
> +static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb)
>  {
>         __skb_unlink(skb, &sk->sk_receive_queue);
>         __kfree_skb(skb);
>  }
> -#endif
>
>  static inline
>  struct net *sock_net(const struct sock *sk)
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 70e55d200610..084c163e9d40 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -27,7 +27,6 @@
>  #include <linux/cache.h>
>  #include <linux/percpu.h>
>  #include <linux/skbuff.h>
> -#include <linux/dmaengine.h>
>  #include <linux/crypto.h>
>  #include <linux/cryptohash.h>
>  #include <linux/kref.h>
> @@ -267,7 +266,6 @@ extern int sysctl_tcp_adv_win_scale;
>  extern int sysctl_tcp_tw_reuse;
>  extern int sysctl_tcp_frto;
>  extern int sysctl_tcp_low_latency;
> -extern int sysctl_tcp_dma_copybreak;
>  extern int sysctl_tcp_nometrics_save;
>  extern int sysctl_tcp_moderate_rcvbuf;
>  extern int sysctl_tcp_tso_win_divisor;
> @@ -1032,12 +1030,6 @@ static inline void tcp_prequeue_init(struct tcp_sock *tp)
>         tp->ucopy.len = 0;
>         tp->ucopy.memory = 0;
>         skb_queue_head_init(&tp->ucopy.prequeue);
> -#ifdef CONFIG_NET_DMA
> -       tp->ucopy.dma_chan = NULL;
> -       tp->ucopy.wakeup = 0;
> -       tp->ucopy.pinned_list = NULL;
> -       tp->ucopy.dma_cookie = 0;
> -#endif
>  }
>
>  bool tcp_prequeue(struct sock *sk, struct sk_buff *skb);
> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> index 653cbbd9e7ad..d457005acedf 100644
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -390,7 +390,6 @@ static const struct bin_table bin_net_ipv4_table[] = {
>         { CTL_INT,      NET_TCP_MTU_PROBING,                    "tcp_mtu_probing" },
>         { CTL_INT,      NET_TCP_BASE_MSS,                       "tcp_base_mss" },
>         { CTL_INT,      NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS, "tcp_workaround_signed_windows" },
> -       { CTL_INT,      NET_TCP_DMA_COPYBREAK,                  "tcp_dma_copybreak" },
>         { CTL_INT,      NET_TCP_SLOW_START_AFTER_IDLE,          "tcp_slow_start_after_idle" },
>         { CTL_INT,      NET_CIPSOV4_CACHE_ENABLE,               "cipso_cache_enable" },
>         { CTL_INT,      NET_CIPSOV4_CACHE_BUCKET_SIZE,          "cipso_cache_bucket_size" },
> diff --git a/net/core/Makefile b/net/core/Makefile
> index b33b996f5dd6..5f98e5983bd3 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -16,7 +16,6 @@ obj-y += net-sysfs.o
>  obj-$(CONFIG_PROC_FS) += net-procfs.o
>  obj-$(CONFIG_NET_PKTGEN) += pktgen.o
>  obj-$(CONFIG_NETPOLL) += netpoll.o
> -obj-$(CONFIG_NET_DMA) += user_dma.o
>  obj-$(CONFIG_FIB_RULES) += fib_rules.o
>  obj-$(CONFIG_TRACEPOINTS) += net-traces.o
>  obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o
> diff --git a/net/core/dev.c b/net/core/dev.c
> index ba3b7ea5ebb3..677a5a4dcca7 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1262,7 +1262,6 @@ static int __dev_open(struct net_device *dev)
>                 clear_bit(__LINK_STATE_START, &dev->state);
>         else {
>                 dev->flags |= IFF_UP;
> -               net_dmaengine_get();
>                 dev_set_rx_mode(dev);
>                 dev_activate(dev);
>                 add_device_randomness(dev->dev_addr, dev->addr_len);
> @@ -1338,7 +1337,6 @@ static int __dev_close_many(struct list_head *head)
>                         ops->ndo_stop(dev);
>
>                 dev->flags &= ~IFF_UP;
> -               net_dmaengine_put();
>         }
>
>         return 0;
> @@ -4362,14 +4360,6 @@ static void net_rx_action(struct softirq_action *h)
>  out:
>         net_rps_action_and_irq_enable(sd);
>
> -#ifdef CONFIG_NET_DMA
> -       /*
> -        * There may not be any more sk_buffs coming right now, so push
> -        * any pending DMA copies to hardware
> -        */
> -       dma_issue_pending_all();
> -#endif
> -
>         return;
>
>  softnet_break:
> diff --git a/net/core/sock.c b/net/core/sock.c
> index ab20ed9b0f31..411dab3a5726 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1461,9 +1461,6 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>                 atomic_set(&newsk->sk_omem_alloc, 0);
>                 skb_queue_head_init(&newsk->sk_receive_queue);
>                 skb_queue_head_init(&newsk->sk_write_queue);
> -#ifdef CONFIG_NET_DMA
> -               skb_queue_head_init(&newsk->sk_async_wait_queue);
> -#endif
>
>                 spin_lock_init(&newsk->sk_dst_lock);
>                 rwlock_init(&newsk->sk_callback_lock);
> @@ -2290,9 +2287,6 @@ void sock_init_data(struct socket *sock, struct sock *sk)
>         skb_queue_head_init(&sk->sk_receive_queue);
>         skb_queue_head_init(&sk->sk_write_queue);
>         skb_queue_head_init(&sk->sk_error_queue);
> -#ifdef CONFIG_NET_DMA
> -       skb_queue_head_init(&sk->sk_async_wait_queue);
> -#endif
>
>         sk->sk_send_head        =       NULL;
>
> diff --git a/net/core/user_dma.c b/net/core/user_dma.c
> deleted file mode 100644
> index 1b5fefdb8198..000000000000
> --- a/net/core/user_dma.c
> +++ /dev/null
> @@ -1,131 +0,0 @@
> -/*
> - * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
> - * Portions based on net/core/datagram.c and copyrighted by their authors.
> - *
> - * This program is free software; you can redistribute it and/or modify it
> - * under the terms of the GNU General Public License as published by the Free
> - * Software Foundation; either version 2 of the License, or (at your option)
> - * any later version.
> - *
> - * This program is distributed in the hope that it will be useful, but WITHOUT
> - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> - * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> - * more details.
> - *
> - * You should have received a copy of the GNU General Public License along with
> - * this program; if not, write to the Free Software Foundation, Inc., 59
> - * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
> - *
> - * The full GNU General Public License is included in this distribution in the
> - * file called COPYING.
> - */
> -
> -/*
> - * This code allows the net stack to make use of a DMA engine for
> - * skb to iovec copies.
> - */
> -
> -#include <linux/dmaengine.h>
> -#include <linux/socket.h>
> -#include <linux/export.h>
> -#include <net/tcp.h>
> -#include <net/netdma.h>
> -
> -#define NET_DMA_DEFAULT_COPYBREAK 4096
> -
> -int sysctl_tcp_dma_copybreak = NET_DMA_DEFAULT_COPYBREAK;
> -EXPORT_SYMBOL(sysctl_tcp_dma_copybreak);
> -
> -/**
> - *     dma_skb_copy_datagram_iovec - Copy a datagram to an iovec.
> - *     @skb - buffer to copy
> - *     @offset - offset in the buffer to start copying from
> - *     @iovec - io vector to copy to
> - *     @len - amount of data to copy from buffer to iovec
> - *     @pinned_list - locked iovec buffer data
> - *
> - *     Note: the iovec is modified during the copy.
> - */
> -int dma_skb_copy_datagram_iovec(struct dma_chan *chan,
> -                       struct sk_buff *skb, int offset, struct iovec *to,
> -                       size_t len, struct dma_pinned_list *pinned_list)
> -{
> -       int start = skb_headlen(skb);
> -       int i, copy = start - offset;
> -       struct sk_buff *frag_iter;
> -       dma_cookie_t cookie = 0;
> -
> -       /* Copy header. */
> -       if (copy > 0) {
> -               if (copy > len)
> -                       copy = len;
> -               cookie = dma_memcpy_to_iovec(chan, to, pinned_list,
> -                                           skb->data + offset, copy);
> -               if (cookie < 0)
> -                       goto fault;
> -               len -= copy;
> -               if (len == 0)
> -                       goto end;
> -               offset += copy;
> -       }
> -
> -       /* Copy paged appendix. Hmm... why does this look so complicated? */
> -       for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> -               int end;
> -               const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> -
> -               WARN_ON(start > offset + len);
> -
> -               end = start + skb_frag_size(frag);
> -               copy = end - offset;
> -               if (copy > 0) {
> -                       struct page *page = skb_frag_page(frag);
> -
> -                       if (copy > len)
> -                               copy = len;
> -
> -                       cookie = dma_memcpy_pg_to_iovec(chan, to, pinned_list, page,
> -                                       frag->page_offset + offset - start, copy);
> -                       if (cookie < 0)
> -                               goto fault;
> -                       len -= copy;
> -                       if (len == 0)
> -                               goto end;
> -                       offset += copy;
> -               }
> -               start = end;
> -       }
> -
> -       skb_walk_frags(skb, frag_iter) {
> -               int end;
> -
> -               WARN_ON(start > offset + len);
> -
> -               end = start + frag_iter->len;
> -               copy = end - offset;
> -               if (copy > 0) {
> -                       if (copy > len)
> -                               copy = len;
> -                       cookie = dma_skb_copy_datagram_iovec(chan, frag_iter,
> -                                                            offset - start,
> -                                                            to, copy,
> -                                                            pinned_list);
> -                       if (cookie < 0)
> -                               goto fault;
> -                       len -= copy;
> -                       if (len == 0)
> -                               goto end;
> -                       offset += copy;
> -               }
> -               start = end;
> -       }
> -
> -end:
> -       if (!len) {
> -               skb->dma_cookie = cookie;
> -               return cookie;
> -       }
> -
> -fault:
> -       return -EFAULT;
> -}
> diff --git a/net/dccp/proto.c b/net/dccp/proto.c
> index eb892b4f4814..f9076f295b13 100644
> --- a/net/dccp/proto.c
> +++ b/net/dccp/proto.c
> @@ -848,7 +848,7 @@ int dccp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
>                 default:
>                         dccp_pr_debug("packet_type=%s\n",
>                                       dccp_packet_name(dh->dccph_type));
> -                       sk_eat_skb(sk, skb, false);
> +                       sk_eat_skb(sk, skb);
>                 }
>  verify_sock_status:
>                 if (sock_flag(sk, SOCK_DONE)) {
> @@ -905,7 +905,7 @@ verify_sock_status:
>                         len = skb->len;
>         found_fin_ok:
>                 if (!(flags & MSG_PEEK))
> -                       sk_eat_skb(sk, skb, false);
> +                       sk_eat_skb(sk, skb);
>                 break;
>         } while (1);
>  out:
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 3d69ec8dac57..79a90b92e12d 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -642,15 +642,6 @@ static struct ctl_table ipv4_table[] = {
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec
>         },
> -#ifdef CONFIG_NET_DMA
> -       {
> -               .procname       = "tcp_dma_copybreak",
> -               .data           = &sysctl_tcp_dma_copybreak,
> -               .maxlen         = sizeof(int),
> -               .mode           = 0644,
> -               .proc_handler   = proc_dointvec
> -       },
> -#endif
>         {
>                 .procname       = "tcp_slow_start_after_idle",
>                 .data           = &sysctl_tcp_slow_start_after_idle,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index c4638e6f0238..8dc913dfbaef 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -274,7 +274,6 @@
>  #include <net/tcp.h>
>  #include <net/xfrm.h>
>  #include <net/ip.h>
> -#include <net/netdma.h>
>  #include <net/sock.h>
>
>  #include <asm/uaccess.h>
> @@ -1409,39 +1408,6 @@ static void tcp_prequeue_process(struct sock *sk)
>         tp->ucopy.memory = 0;
>  }
>
> -#ifdef CONFIG_NET_DMA
> -static void tcp_service_net_dma(struct sock *sk, bool wait)
> -{
> -       dma_cookie_t done, used;
> -       dma_cookie_t last_issued;
> -       struct tcp_sock *tp = tcp_sk(sk);
> -
> -       if (!tp->ucopy.dma_chan)
> -               return;
> -
> -       last_issued = tp->ucopy.dma_cookie;
> -       dma_async_issue_pending(tp->ucopy.dma_chan);
> -
> -       do {
> -               if (dma_async_is_tx_complete(tp->ucopy.dma_chan,
> -                                             last_issued, &done,
> -                                             &used) == DMA_COMPLETE) {
> -                       /* Safe to free early-copied skbs now */
> -                       __skb_queue_purge(&sk->sk_async_wait_queue);
> -                       break;
> -               } else {
> -                       struct sk_buff *skb;
> -                       while ((skb = skb_peek(&sk->sk_async_wait_queue)) &&
> -                              (dma_async_is_complete(skb->dma_cookie, done,
> -                                                     used) == DMA_COMPLETE)) {
> -                               __skb_dequeue(&sk->sk_async_wait_queue);
> -                               kfree_skb(skb);
> -                       }
> -               }
> -       } while (wait);
> -}
> -#endif
> -
>  static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
>  {
>         struct sk_buff *skb;
> @@ -1459,7 +1425,7 @@ static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
>                  * splitted a fat GRO packet, while we released socket lock
>                  * in skb_splice_bits()
>                  */
> -               sk_eat_skb(sk, skb, false);
> +               sk_eat_skb(sk, skb);
>         }
>         return NULL;
>  }
> @@ -1525,11 +1491,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
>                                 continue;
>                 }
>                 if (tcp_hdr(skb)->fin) {
> -                       sk_eat_skb(sk, skb, false);
> +                       sk_eat_skb(sk, skb);
>                         ++seq;
>                         break;
>                 }
> -               sk_eat_skb(sk, skb, false);
> +               sk_eat_skb(sk, skb);
>                 if (!desc->count)
>                         break;
>                 tp->copied_seq = seq;
> @@ -1567,7 +1533,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
>         int target;             /* Read at least this many bytes */
>         long timeo;
>         struct task_struct *user_recv = NULL;
> -       bool copied_early = false;
>         struct sk_buff *skb;
>         u32 urg_hole = 0;
>
> @@ -1610,28 +1575,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
>
>         target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
>
> -#ifdef CONFIG_NET_DMA
> -       tp->ucopy.dma_chan = NULL;
> -       preempt_disable();
> -       skb = skb_peek_tail(&sk->sk_receive_queue);
> -       {
> -               int available = 0;
> -
> -               if (skb)
> -                       available = TCP_SKB_CB(skb)->seq + skb->len - (*seq);
> -               if ((available < target) &&
> -                   (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
> -                   !sysctl_tcp_low_latency &&
> -                   net_dma_find_channel()) {
> -                       preempt_enable_no_resched();
> -                       tp->ucopy.pinned_list =
> -                                       dma_pin_iovec_pages(msg->msg_iov, len);
> -               } else {
> -                       preempt_enable_no_resched();
> -               }
> -       }
> -#endif
> -
>         do {
>                 u32 offset;
>
> @@ -1762,16 +1705,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
>                         /* __ Set realtime policy in scheduler __ */
>                 }
>
> -#ifdef CONFIG_NET_DMA
> -               if (tp->ucopy.dma_chan) {
> -                       if (tp->rcv_wnd == 0 &&
> -                           !skb_queue_empty(&sk->sk_async_wait_queue)) {
> -                               tcp_service_net_dma(sk, true);
> -                               tcp_cleanup_rbuf(sk, copied);
> -                       } else
> -                               dma_async_issue_pending(tp->ucopy.dma_chan);
> -               }
> -#endif
>                 if (copied >= target) {
>                         /* Do not sleep, just process backlog. */
>                         release_sock(sk);
> @@ -1779,11 +1712,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
>                 } else
>                         sk_wait_data(sk, &timeo);
>
> -#ifdef CONFIG_NET_DMA
> -               tcp_service_net_dma(sk, false);  /* Don't block */
> -               tp->ucopy.wakeup = 0;
> -#endif
> -
>                 if (user_recv) {
>                         int chunk;
>
> @@ -1841,43 +1769,13 @@ do_prequeue:
>                 }
>
>                 if (!(flags & MSG_TRUNC)) {
> -#ifdef CONFIG_NET_DMA
> -                       if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
> -                               tp->ucopy.dma_chan = net_dma_find_channel();
> -
> -                       if (tp->ucopy.dma_chan) {
> -                               tp->ucopy.dma_cookie = dma_skb_copy_datagram_iovec(
> -                                       tp->ucopy.dma_chan, skb, offset,
> -                                       msg->msg_iov, used,
> -                                       tp->ucopy.pinned_list);
> -
> -                               if (tp->ucopy.dma_cookie < 0) {
> -
> -                                       pr_alert("%s: dma_cookie < 0\n",
> -                                                __func__);
> -
> -                                       /* Exception. Bailout! */
> -                                       if (!copied)
> -                                               copied = -EFAULT;
> -                                       break;
> -                               }
> -
> -                               dma_async_issue_pending(tp->ucopy.dma_chan);
> -
> -                               if ((offset + used) == skb->len)
> -                                       copied_early = true;
> -
> -                       } else
> -#endif
> -                       {
> -                               err = skb_copy_datagram_iovec(skb, offset,
> -                                               msg->msg_iov, used);
> -                               if (err) {
> -                                       /* Exception. Bailout! */
> -                                       if (!copied)
> -                                               copied = -EFAULT;
> -                                       break;
> -                               }
> +                       err = skb_copy_datagram_iovec(skb, offset,
> +                                                     msg->msg_iov, used);
> +                       if (err) {
> +                               /* Exception. Bailout! */
> +                               if (!copied)
> +                                       copied = -EFAULT;
> +                               break;
>                         }
>                 }
>
> @@ -1897,19 +1795,15 @@ skip_copy:
>
>                 if (tcp_hdr(skb)->fin)
>                         goto found_fin_ok;
> -               if (!(flags & MSG_PEEK)) {
> -                       sk_eat_skb(sk, skb, copied_early);
> -                       copied_early = false;
> -               }
> +               if (!(flags & MSG_PEEK))
> +                       sk_eat_skb(sk, skb);
>                 continue;
>
>         found_fin_ok:
>                 /* Process the FIN. */
>                 ++*seq;
> -               if (!(flags & MSG_PEEK)) {
> -                       sk_eat_skb(sk, skb, copied_early);
> -                       copied_early = false;
> -               }
> +               if (!(flags & MSG_PEEK))
> +                       sk_eat_skb(sk, skb);
>                 break;
>         } while (len > 0);
>
> @@ -1932,16 +1826,6 @@ skip_copy:
>                 tp->ucopy.len = 0;
>         }
>
> -#ifdef CONFIG_NET_DMA
> -       tcp_service_net_dma(sk, true);  /* Wait for queue to drain */
> -       tp->ucopy.dma_chan = NULL;
> -
> -       if (tp->ucopy.pinned_list) {
> -               dma_unpin_iovec_pages(tp->ucopy.pinned_list);
> -               tp->ucopy.pinned_list = NULL;
> -       }
> -#endif
> -
>         /* According to UNIX98, msg_name/msg_namelen are ignored
>          * on connected socket. I was just happy when found this 8) --ANK
>          */
> @@ -2285,9 +2169,6 @@ int tcp_disconnect(struct sock *sk, int flags)
>         __skb_queue_purge(&sk->sk_receive_queue);
>         tcp_write_queue_purge(sk);
>         __skb_queue_purge(&tp->out_of_order_queue);
> -#ifdef CONFIG_NET_DMA
> -       __skb_queue_purge(&sk->sk_async_wait_queue);
> -#endif
>
>         inet->inet_dport = 0;
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index c53b7f35c51d..33ef18e550c5 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -73,7 +73,6 @@
>  #include <net/inet_common.h>
>  #include <linux/ipsec.h>
>  #include <asm/unaligned.h>
> -#include <net/netdma.h>
>
>  int sysctl_tcp_timestamps __read_mostly = 1;
>  int sysctl_tcp_window_scaling __read_mostly = 1;
> @@ -4967,53 +4966,6 @@ static inline bool tcp_checksum_complete_user(struct sock *sk,
>                __tcp_checksum_complete_user(sk, skb);
>  }
>
> -#ifdef CONFIG_NET_DMA
> -static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
> -                                 int hlen)
> -{
> -       struct tcp_sock *tp = tcp_sk(sk);
> -       int chunk = skb->len - hlen;
> -       int dma_cookie;
> -       bool copied_early = false;
> -
> -       if (tp->ucopy.wakeup)
> -               return false;
> -
> -       if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
> -               tp->ucopy.dma_chan = net_dma_find_channel();
> -
> -       if (tp->ucopy.dma_chan && skb_csum_unnecessary(skb)) {
> -
> -               dma_cookie = dma_skb_copy_datagram_iovec(tp->ucopy.dma_chan,
> -                                                        skb, hlen,
> -                                                        tp->ucopy.iov, chunk,
> -                                                        tp->ucopy.pinned_list);
> -
> -               if (dma_cookie < 0)
> -                       goto out;
> -
> -               tp->ucopy.dma_cookie = dma_cookie;
> -               copied_early = true;
> -
> -               tp->ucopy.len -= chunk;
> -               tp->copied_seq += chunk;
> -               tcp_rcv_space_adjust(sk);
> -
> -               if ((tp->ucopy.len == 0) ||
> -                   (tcp_flag_word(tcp_hdr(skb)) & TCP_FLAG_PSH) ||
> -                   (atomic_read(&sk->sk_rmem_alloc) > (sk->sk_rcvbuf >> 1))) {
> -                       tp->ucopy.wakeup = 1;
> -                       sk->sk_data_ready(sk, 0);
> -               }
> -       } else if (chunk > 0) {
> -               tp->ucopy.wakeup = 1;
> -               sk->sk_data_ready(sk, 0);
> -       }
> -out:
> -       return copied_early;
> -}
> -#endif /* CONFIG_NET_DMA */
> -
>  /* Does PAWS and seqno based validation of an incoming segment, flags will
>   * play significant role here.
>   */
> @@ -5198,14 +5150,6 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
>
>                         if (tp->copied_seq == tp->rcv_nxt &&
>                             len - tcp_header_len <= tp->ucopy.len) {
> -#ifdef CONFIG_NET_DMA
> -                               if (tp->ucopy.task == current &&
> -                                   sock_owned_by_user(sk) &&
> -                                   tcp_dma_try_early_copy(sk, skb, tcp_header_len)) {
> -                                       copied_early = 1;
> -                                       eaten = 1;
> -                               }
> -#endif
>                                 if (tp->ucopy.task == current &&
>                                     sock_owned_by_user(sk) && !copied_early) {
>                                         __set_current_state(TASK_RUNNING);
> @@ -5271,11 +5215,6 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
>                         if (!copied_early || tp->rcv_nxt != tp->rcv_wup)
>                                 __tcp_ack_snd_check(sk, 0);
>  no_ack:
> -#ifdef CONFIG_NET_DMA
> -                       if (copied_early)
> -                               __skb_queue_tail(&sk->sk_async_wait_queue, skb);
> -                       else
> -#endif
>                         if (eaten)
>                                 kfree_skb_partial(skb, fragstolen);
>                         sk->sk_data_ready(sk, 0);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 59a6f8b90cd9..dc92ba9d0350 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -72,7 +72,6 @@
>  #include <net/inet_common.h>
>  #include <net/timewait_sock.h>
>  #include <net/xfrm.h>
> -#include <net/netdma.h>
>  #include <net/secure_seq.h>
>  #include <net/tcp_memcontrol.h>
>  #include <net/busy_poll.h>
> @@ -2000,18 +1999,8 @@ process:
>         bh_lock_sock_nested(sk);
>         ret = 0;
>         if (!sock_owned_by_user(sk)) {
> -#ifdef CONFIG_NET_DMA
> -               struct tcp_sock *tp = tcp_sk(sk);
> -               if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
> -                       tp->ucopy.dma_chan = net_dma_find_channel();
> -               if (tp->ucopy.dma_chan)
> +               if (!tcp_prequeue(sk, skb))
>                         ret = tcp_v4_do_rcv(sk, skb);
> -               else
> -#endif
> -               {
> -                       if (!tcp_prequeue(sk, skb))
> -                               ret = tcp_v4_do_rcv(sk, skb);
> -               }
>         } else if (unlikely(sk_add_backlog(sk, skb,
>                                            sk->sk_rcvbuf + sk->sk_sndbuf))) {
>                 bh_unlock_sock(sk);
> @@ -2170,11 +2159,6 @@ void tcp_v4_destroy_sock(struct sock *sk)
>         }
>  #endif
>
> -#ifdef CONFIG_NET_DMA
> -       /* Cleans up our sk_async_wait_queue */
> -       __skb_queue_purge(&sk->sk_async_wait_queue);
> -#endif
> -
>         /* Clean prequeue, it must be empty really */
>         __skb_queue_purge(&tp->ucopy.prequeue);
>
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 0740f93a114a..e27972590379 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -59,7 +59,6 @@
>  #include <net/snmp.h>
>  #include <net/dsfield.h>
>  #include <net/timewait_sock.h>
> -#include <net/netdma.h>
>  #include <net/inet_common.h>
>  #include <net/secure_seq.h>
>  #include <net/tcp_memcontrol.h>
> @@ -1504,18 +1503,8 @@ process:
>         bh_lock_sock_nested(sk);
>         ret = 0;
>         if (!sock_owned_by_user(sk)) {
> -#ifdef CONFIG_NET_DMA
> -               struct tcp_sock *tp = tcp_sk(sk);
> -               if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
> -                       tp->ucopy.dma_chan = net_dma_find_channel();
> -               if (tp->ucopy.dma_chan)
> +               if (!tcp_prequeue(sk, skb))
>                         ret = tcp_v6_do_rcv(sk, skb);
> -               else
> -#endif
> -               {
> -                       if (!tcp_prequeue(sk, skb))
> -                               ret = tcp_v6_do_rcv(sk, skb);
> -               }
>         } else if (unlikely(sk_add_backlog(sk, skb,
>                                            sk->sk_rcvbuf + sk->sk_sndbuf))) {
>                 bh_unlock_sock(sk);
> diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
> index 7b01b9f5846c..e1b46709f8d6 100644
> --- a/net/llc/af_llc.c
> +++ b/net/llc/af_llc.c
> @@ -838,7 +838,7 @@ static int llc_ui_recvmsg(struct kiocb *iocb, struct socket *sock,
>
>                 if (!(flags & MSG_PEEK)) {
>                         spin_lock_irqsave(&sk->sk_receive_queue.lock, cpu_flags);
> -                       sk_eat_skb(sk, skb, false);
> +                       sk_eat_skb(sk, skb);
>                         spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
>                         *seq = 0;
>                 }
> @@ -860,10 +860,10 @@ copy_uaddr:
>                 llc_cmsg_rcv(msg, skb);
>
>         if (!(flags & MSG_PEEK)) {
> -                       spin_lock_irqsave(&sk->sk_receive_queue.lock, cpu_flags);
> -                       sk_eat_skb(sk, skb, false);
> -                       spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
> -                       *seq = 0;
> +               spin_lock_irqsave(&sk->sk_receive_queue.lock, cpu_flags);
> +               sk_eat_skb(sk, skb);
> +               spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
> +               *seq = 0;
>         }
>
>         goto out;
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net-next 1/2] r6040: add delays in MDIO read/write polling loops
From: Florian Fainelli @ 2014-01-15 21:04 UTC (permalink / raw)
  To: netdev; +Cc: davem, Florian Fainelli
In-Reply-To: <1389819866-32142-1-git-send-email-florian@openwrt.org>

On newer and faster machines (Vortex X86DX) using the r6040 driver, it
was noticed that the driver was returning an error during probing traced
down to being the MDIO bus probing and the inability to complete a MDIO
read operation in time. It turns out that the MDIO operations on these
faster machines usually complete after ~2140 iterations which is bigger
than 2048 (MAC_DEF_TIMEOUT) and results in spurious timeouts depending
on the system load.

Update r6040_phy_read() and r6040_phy_write() to include a 1
micro second delay in each busy-looping iteration of the loop which is a
much safer operation than incrementing MAC_DEF_TIMEOUT.

Reported-by: Nils Koehler <nils.koehler@ibt-interfaces.de>
Reported-by: Daniel Goertzen <daniel.goertzen@gmail.com>
Signed-off-by: Florian Fainelli <florian@openwrt.org>
---
 drivers/net/ethernet/rdc/r6040.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/rdc/r6040.c b/drivers/net/ethernet/rdc/r6040.c
index 1e49ec5..ff4683a 100644
--- a/drivers/net/ethernet/rdc/r6040.c
+++ b/drivers/net/ethernet/rdc/r6040.c
@@ -222,6 +222,7 @@ static int r6040_phy_read(void __iomem *ioaddr, int phy_addr, int reg)
 		cmd = ioread16(ioaddr + MMDIO);
 		if (!(cmd & MDIO_READ))
 			break;
+		udelay(1);
 	}

 	if (limit < 0)
@@ -245,6 +246,7 @@ static int r6040_phy_write(void __iomem *ioaddr,
 		cmd = ioread16(ioaddr + MMDIO);
 		if (!(cmd & MDIO_WRITE))
 			break;
+		udelay(1);
 	}

 	return (limit < 0) ? -ETIMEDOUT : 0;
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH net-next 2/2] r6040: use ETH_ZLEN instead of MISR for SKB length checking
From: Florian Fainelli @ 2014-01-15 21:04 UTC (permalink / raw)
  To: netdev; +Cc: davem, Florian Fainelli
In-Reply-To: <1389819866-32142-1-git-send-email-florian@openwrt.org>

Ever since this driver was merged the following code was included:

if (skb->len < MISR)
	skb->len = MISR;

MISR is defined to 0x3C which is also equivalent to ETH_ZLEN, but use
ETH_ZLEN directly which is exactly what we want to be checking for.

Reported-by: Marc Volovic <marcv@ezchip.com>
Signed-off-by: Florian Fainelli <florian@openwrt.org>
---
 drivers/net/ethernet/rdc/r6040.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/rdc/r6040.c b/drivers/net/ethernet/rdc/r6040.c
index ff4683a..eb15ebf 100644
--- a/drivers/net/ethernet/rdc/r6040.c
+++ b/drivers/net/ethernet/rdc/r6040.c
@@ -836,8 +836,8 @@ static netdev_tx_t r6040_start_xmit(struct sk_buff *skb,
 	/* Set TX descriptor & Transmit it */
 	lp->tx_free_desc--;
 	descptr = lp->tx_insert_ptr;
-	if (skb->len < MISR)
-		descptr->len = MISR;
+	if (skb->len < ETH_ZLEN)
+		descptr->len = ETH_ZLEN;
 	else
 		descptr->len = skb->len;
 
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH net-next 0/2] r6040: misc fixes
From: Florian Fainelli @ 2014-01-15 21:04 UTC (permalink / raw)
  To: netdev; +Cc: davem, Florian Fainelli

Hi David,

Here are two small fixes, patch 1 could potentially be backported to stable
trees since it affects MDIO operations.

Thanks!

Florian Fainelli (2):
  r6040: add delays in MDIO read/write polling loops
  r6040: use ETH_ZLEN instead of MISR for SKB length checking

 drivers/net/ethernet/rdc/r6040.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

-- 
1.8.3.2

^ permalink raw reply

* [PATCH-next v2] netfilter: don't use module_init/exit in core IPV4 code
From: Paul Gortmaker @ 2014-01-15 20:57 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik
  Cc: David S. Miller, netfilter-devel, netdev, Paul Gortmaker
In-Reply-To: <1389638147-30399-1-git-send-email-paul.gortmaker@windriver.com>

The file net/ipv4/netfilter.o is created based on whether
CONFIG_NETFILTER is set.  However that is defined as a bool, and
hence this file with the core netfilter hooks will never be
modular.  So using module_init as an alias for __initcall can be
somewhat misleading.

Fix this up now, so that we can relocate module_init from
init.h into module.h in the future.  If we don't do this, we'd
have to add module.h to obviously non-modular code, and that
would be a worse thing.  Also add an inclusion of init.h, as
that was previously implicit here in the netfilter.c file.

Note that direct use of __initcall is discouraged, vs. one
of the priority categorized subgroups.  As __initcall gets
mapped onto device_initcall, our use of subsys_initcall (which
seems to make sense for netfilter code) will thus change this
registration from level 6-device to level 4-subsys (i.e. slightly
earlier).  However no observable impact of that small difference
has been observed during testing, or is expected. (i.e. the
location of the netfilter messages in dmesg remains unchanged
with respect to all the other surrounding messages.)

As for the module_exit, rather than replace it with __exitcall,
we simply remove it, since it appears only UML does anything
with those, and even for UML, there is no relevant cleanup
to be done here.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---

[v2: Drop __exitcall stuff completely, as per Eric's suggestion
 given for patch at  http://patchwork.ozlabs.org/patch/311164/ ]

 net/ipv4/netfilter.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
index c3e0adea9c27..31abf9636ba7 100644
--- a/net/ipv4/netfilter.c
+++ b/net/ipv4/netfilter.c
@@ -197,11 +197,4 @@ static int __init ipv4_netfilter_init(void)
 {
 	return nf_register_afinfo(&nf_ip_afinfo);
 }
-
-static void __exit ipv4_netfilter_fini(void)
-{
-	nf_unregister_afinfo(&nf_ip_afinfo);
-}
-
-module_init(ipv4_netfilter_init);
-module_exit(ipv4_netfilter_fini);
+device_initcall(ipv4_netfilter_init);
-- 
1.8.5.2

^ permalink raw reply related

* Re: [RFC net] tcp: metrics: Avoid duplicate entries with the same destination-IP
From: David Miller @ 2014-01-15 20:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: christoph.paasch, netdev
In-Reply-To: <1389631669.31367.221.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 13 Jan 2014 08:47:49 -0800

> On Mon, 2014-01-13 at 16:25 +0100, Christoph Paasch wrote:
> 
>> Another solution might be to leave tcp_get_metrics() as it is, and in
>> tcpm_new do another call to __tcp_get_metrics() while holding the
>> spin-lock. We would then check __tcp_get_metrics twice for new entries
>> but we won't hold the spin-lock needlessly anymore.
> 
> This is the only solution if you want to fix this.
> Cost of lookup are the cache line misses.
> Avoiding the spinlock is a must.
> 
> The second 'lookup' is basically free, as the first one have populated
> cpu caches.

Indeed, taking the lock in tcp_get_metrics() is to be avoided at all
costs.

^ permalink raw reply

* Re: [PATCH v4 0/3] Send audit/procinfo/cgroup data in socket-level control message
From: David Miller @ 2014-01-15 20:17 UTC (permalink / raw)
  To: jkaluza-H+wXaHxf7aLQT0dZR+AlfA
  Cc: rgb-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	eparis-H+wXaHxf7aLQT0dZR+AlfA,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, tj-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1389600109-30739-1-git-send-email-jkaluza-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

From: Jan Kaluza <jkaluza-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Date: Mon, 13 Jan 2014 09:01:46 +0100

> Changes introduced in this patchset can also increase performance
> of such server-like processes, because current way of opening and
> parsing /proc/$PID/* files is much more expensive than receiving these
> metadata using SCM.

The problem with this line of reasoning is that these changes will
hurt everyone else, because these new control messages are sent
unconditionally, whether the application is interested in them or not.

I really don't like this cost tradeoff, it's terrible, and therefore
I'm really not inclined to apply these patches, sorry.

^ permalink raw reply

* Re: [patch net-next] neigh: use NEIGH_VAR_INIT in ndo_neigh_setup functions.
From: David Miller @ 2014-01-15 20:09 UTC (permalink / raw)
  To: jiri; +Cc: netdev, jes
In-Reply-To: <1389273227-17532-1-git-send-email-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Thu,  9 Jan 2014 14:13:47 +0100

> When ndo_neigh_setup is called, the bitfield used by NEIGH_VAR_SET is
> not initialized yet. This might cause confusion for the people who use
> NEIGH_VAR_SET in ndo_neigh_setup. So rather introduce NEIGH_VAR_INIT for
> usage in ndo_neigh_setup.
> 
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>

Jiri, please respond to my feedback, this patch has been rotting in
patchwork for 6 days.

^ permalink raw reply

* [PATCH net-next] ipv6: send Change Status Report after DAD is completed
From: Flavio Leitner @ 2014-01-15 19:10 UTC (permalink / raw)
  To: netdev; +Cc: Hideaki YOSHIFUJI, Hannes Frederic Sowa, Flavio Leitner

The RFC 3810 defines two type of messages for multicast
listeners. The "Current State Report" message, as the name
implies, refreshes the *current* state to the querier.
Since the querier sends Query messages periodically, there
is no need to retransmit the report.

On the other hand, any change should be reported immediately
using "State Change Report" messages. Since it's an event
triggered by a change and that it can be affected by packet
loss, the rfc states it should be retransmitted [RobVar] times
to make sure routers will receive timely.

Currently, we are sending "Current State Reports" after
DAD is completed.  Before that, we send messages using
unspecified address (::) which should be silently discarded
by routers.

This patch changes to send "State Change Report" messages
after DAD is completed fixing the behavior to be RFC compliant
and also to pass TAHI IPv6 testsuite.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
---
 net/ipv6/mcast.c | 64 ++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 39 insertions(+), 25 deletions(-)

diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 99cd65c..8ac17f5 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1493,7 +1493,7 @@ static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc,
 	skb_tailroom(skb)) : 0)
 
 static struct sk_buff *add_grec(struct sk_buff *skb, struct ifmcaddr6 *pmc,
-	int type, int gdeleted, int sdeleted)
+	int type, int gdeleted, int sdeleted, int crsend)
 {
 	struct inet6_dev *idev = pmc->idev;
 	struct net_device *dev = idev->dev;
@@ -1585,7 +1585,7 @@ empty_source:
 		if (type == MLD2_ALLOW_NEW_SOURCES ||
 		    type == MLD2_BLOCK_OLD_SOURCES)
 			return skb;
-		if (pmc->mca_crcount || isquery) {
+		if (pmc->mca_crcount || isquery || crsend) {
 			/* make sure we have room for group header */
 			if (skb && AVAILABLE(skb) < sizeof(struct mld2_grec)) {
 				mld_sendpack(skb);
@@ -1602,6 +1602,28 @@ empty_source:
 	return skb;
 }
 
+static void mld_send_initial_cr(struct inet6_dev *idev)
+{
+	struct sk_buff *skb;
+	struct ifmcaddr6 *pmc;
+	int type;
+
+	skb = NULL;
+	read_lock_bh(&idev->lock);
+	for (pmc=idev->mc_list; pmc; pmc=pmc->next) {
+		spin_lock_bh(&pmc->mca_lock);
+		if (pmc->mca_sfcount[MCAST_EXCLUDE])
+			type = MLD2_CHANGE_TO_EXCLUDE;
+		else
+			type = MLD2_CHANGE_TO_INCLUDE;
+		skb = add_grec(skb, pmc, type, 0, 0, 1);
+		spin_unlock_bh(&pmc->mca_lock);
+	}
+	read_unlock_bh(&idev->lock);
+	if (skb)
+		mld_sendpack(skb);
+}
+
 static void mld_send_report(struct inet6_dev *idev, struct ifmcaddr6 *pmc)
 {
 	struct sk_buff *skb = NULL;
@@ -1617,7 +1639,7 @@ static void mld_send_report(struct inet6_dev *idev, struct ifmcaddr6 *pmc)
 				type = MLD2_MODE_IS_EXCLUDE;
 			else
 				type = MLD2_MODE_IS_INCLUDE;
-			skb = add_grec(skb, pmc, type, 0, 0);
+			skb = add_grec(skb, pmc, type, 0, 0, 0);
 			spin_unlock_bh(&pmc->mca_lock);
 		}
 	} else {
@@ -1626,7 +1648,7 @@ static void mld_send_report(struct inet6_dev *idev, struct ifmcaddr6 *pmc)
 			type = MLD2_MODE_IS_EXCLUDE;
 		else
 			type = MLD2_MODE_IS_INCLUDE;
-		skb = add_grec(skb, pmc, type, 0, 0);
+		skb = add_grec(skb, pmc, type, 0, 0, 0);
 		spin_unlock_bh(&pmc->mca_lock);
 	}
 	read_unlock_bh(&idev->lock);
@@ -1671,13 +1693,13 @@ static void mld_send_cr(struct inet6_dev *idev)
 		if (pmc->mca_sfmode == MCAST_INCLUDE) {
 			type = MLD2_BLOCK_OLD_SOURCES;
 			dtype = MLD2_BLOCK_OLD_SOURCES;
-			skb = add_grec(skb, pmc, type, 1, 0);
-			skb = add_grec(skb, pmc, dtype, 1, 1);
+			skb = add_grec(skb, pmc, type, 1, 0, 0);
+			skb = add_grec(skb, pmc, dtype, 1, 1, 0);
 		}
 		if (pmc->mca_crcount) {
 			if (pmc->mca_sfmode == MCAST_EXCLUDE) {
 				type = MLD2_CHANGE_TO_INCLUDE;
-				skb = add_grec(skb, pmc, type, 1, 0);
+				skb = add_grec(skb, pmc, type, 1, 0, 0);
 			}
 			pmc->mca_crcount--;
 			if (pmc->mca_crcount == 0) {
@@ -1708,8 +1730,8 @@ static void mld_send_cr(struct inet6_dev *idev)
 			type = MLD2_ALLOW_NEW_SOURCES;
 			dtype = MLD2_BLOCK_OLD_SOURCES;
 		}
-		skb = add_grec(skb, pmc, type, 0, 0);
-		skb = add_grec(skb, pmc, dtype, 0, 1);	/* deleted sources */
+		skb = add_grec(skb, pmc, type, 0, 0, 0);
+		skb = add_grec(skb, pmc, dtype, 0, 1, 0);	/* deleted sources */
 
 		/* filter mode changes */
 		if (pmc->mca_crcount) {
@@ -1717,7 +1739,7 @@ static void mld_send_cr(struct inet6_dev *idev)
 				type = MLD2_CHANGE_TO_EXCLUDE;
 			else
 				type = MLD2_CHANGE_TO_INCLUDE;
-			skb = add_grec(skb, pmc, type, 0, 0);
+			skb = add_grec(skb, pmc, type, 0, 0, 0);
 			pmc->mca_crcount--;
 		}
 		spin_unlock_bh(&pmc->mca_lock);
@@ -1825,27 +1847,19 @@ err_out:
 	goto out;
 }
 
-static void mld_resend_report(struct inet6_dev *idev)
+static void mld_resend_cr(struct inet6_dev *idev)
 {
-	if (MLD_V1_SEEN(idev)) {
-		struct ifmcaddr6 *mcaddr;
-		read_lock_bh(&idev->lock);
-		for (mcaddr = idev->mc_list; mcaddr; mcaddr = mcaddr->next) {
-			if (!(mcaddr->mca_flags & MAF_NOREPORT))
-				igmp6_send(&mcaddr->mca_addr, idev->dev,
-					   ICMPV6_MGM_REPORT);
-		}
-		read_unlock_bh(&idev->lock);
-	} else {
-		mld_send_report(idev, NULL);
-	}
+	if (MLD_V1_SEEN(idev))
+		return;
+
+	mld_send_initial_cr(idev);
 }
 
 void ipv6_mc_dad_complete(struct inet6_dev *idev)
 {
 	idev->mc_dad_count = idev->mc_qrv;
 	if (idev->mc_dad_count) {
-		mld_resend_report(idev);
+		mld_resend_cr(idev);
 		idev->mc_dad_count--;
 		if (idev->mc_dad_count)
 			mld_dad_start_timer(idev, idev->mc_maxdelay);
@@ -1856,7 +1870,7 @@ static void mld_dad_timer_expire(unsigned long data)
 {
 	struct inet6_dev *idev = (struct inet6_dev *)data;
 
-	mld_resend_report(idev);
+	mld_resend_cr(idev);
 	if (idev->mc_dad_count) {
 		idev->mc_dad_count--;
 		if (idev->mc_dad_count)
-- 
1.8.4.2

^ permalink raw reply related

* Re: [Patch net-next] net_sched: act: remove headers in include/net/tc_act/
From: Cong Wang @ 2014-01-15 18:57 UTC (permalink / raw)
  To: David Miller; +Cc: Cong Wang, netdev, Jamal Hadi Salim
In-Reply-To: <20140114.181213.695613248997119103.davem@davemloft.net>

On Tue, Jan 14, 2014 at 6:12 PM, David Miller <davem@davemloft.net> wrote:
> From: Cong Wang <xiyou.wangcong@gmail.com>
> Date: Tue, 14 Jan 2014 17:01:39 -0800
>
>> These headers are not necessary because those definitions in them
>> are action specific and are not shared for others. Just move them
>> into the C files.
>>
>> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
>> Cc: David S. Miller <davem@davemloft.net>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>
> Like Eric, I think this is a dubious change.
>
> There is nothing wrong with using these headers to define the
> core data structures used by each of these actions modules.
>

Nothing is wrong here, just that it is not necessary.

act_police defines similar stuffs in its C file, not a header.
I don't see any reason why others can't.

^ permalink raw reply

* Re: [PATCH] [RFC] netfilter: nf_conntrack: don't relase a conntrack with non-zero refcnt
From: Andrew Vagin @ 2014-01-15 18:08 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Andrey Vagin, netfilter-devel, Eric Dumazet, netfilter, netdev,
	linux-kernel, vvs, Cyrill Gorcunov, Vasiliy Averin
In-Reply-To: <20140114185329.GB28205@breakpoint.cc>

On Tue, Jan 14, 2014 at 07:53:29PM +0100, Florian Westphal wrote:
> Andrey Vagin <avagin@openvz.org> wrote:
> > ----
> > Eric and Florian, could you look at this patch. When you say,
> > that it looks good, I will ask the user to validate it.
> > I can't reorder these actions, because it's reproduced on a real host
> > with real users. Thanks.
> > ----
> > 
> > nf_conntrack_free can't be called for a conntract with non-zero ref-counter,
> > because it can race with nf_conntrack_find_get().
> 
> Indeed.
> 
> > A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
> > ref-conunter says that this conntrack is used now. So when we release a
> > conntrack with non-zero counter, we break this assumption.
> > 
> > CPU1                                    CPU2
> > ____nf_conntrack_find()
> >                                         nf_ct_put()
> >                                          destroy_conntrack()
> >                                         ...
> >                                         init_conntrack
> >                                          __nf_conntrack_alloc (set use = 1)
> > atomic_inc_not_zero(&ct->use) (use = 2)
> >                                          if (!l4proto->new(ct, skb, dataoff, timeouts))
> >                                           nf_conntrack_free(ct); (use = 2 !!!)
> >                                         ...
> 
> Yes, I think this sequence is possible; we must not use nf_conntrack_free here.
> 
> > -	/* We overload first tuple to link into unconfirmed or dying list.*/
> > -	BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
> > -	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
> > +	if (!hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode))
> > +		hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
> 
> This is the only thing that I don't like about this patch.  Currently
> all the conntracks in the system are always put on a list before they're
> supposed to be visible/handled via refcnt system (unconfirmed, hash, or
> dying list).
> 
> I think it would be nice if we could keep it that way.
> If everything fails we could proably intoduce a 'larval' dummy list
> similar to the one used by template conntracks?

I'm not sure, that this is required. Could you elaborate when this can
be useful?

Now I see only overhead, because we need to take the nf_conntrack_lock
lock to add conntrack in a list.

Thanks,
Andrey

^ permalink raw reply

* Re: [PATCH net-next] netfilter: remove double colon
From: Denis Kirjanov @ 2014-01-15 18:05 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Pablo Neira Ayuso, David S. Miller, netdev, netfilter-devel
In-Reply-To: <20140115081250.56958f9a@nehalam.linuxnetplumber.net>

You did miss the "---" after SOB ;)

On 1/15/14, Stephen Hemminger <stephen@networkplumber.org> wrote:
> This is C not shell script
>
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>
> --- a/net/ipv4/netfilter.c	2013-12-31 17:45:31.993942921 -0800
> +++ b/net/ipv4/netfilter.c	2014-01-15 08:10:49.793785943 -0800
> @@ -61,7 +61,7 @@ int ip_route_me_harder(struct sk_buff *s
>  		skb_dst_set(skb, NULL);
>  		dst = xfrm_lookup(net, dst, flowi4_to_flowi(&fl4), skb->sk, 0);
>  		if (IS_ERR(dst))
> -			return PTR_ERR(dst);;
> +			return PTR_ERR(dst);
>  		skb_dst_set(skb, dst);
>  	}
>  #endif
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH v2 2/2] Documentation: Document the cephroot functionality
From: Randy Dunlap @ 2014-01-15 18:00 UTC (permalink / raw)
  To: mark.doffman
  Cc: ceph-devel, Rob Taylor, sage, netdev, linux-kernel, linux-nfs
In-Reply-To: <29b3bd9700e23df2aee095df8c34a15a62f57d27.1389806186.git.mark.doffman@codethink.co.uk>

On 01/15/2014 09:26 AM, mark.doffman@codethink.co.uk wrote:
> From: Rob Taylor <rob.taylor@codethink.co.uk>
>
> Document using the cephfs as a root device, its purpose,
> functionality and use.
>
> Signed-off-by: Mark Doffman <mark.doffman@codethink.co.uk>
> Signed-off-by: Rob Taylor <rob.taylor@codethink.co.uk>
> Reviewed-by: Ian Molton <ian.molton@codethink.co.uk>
> ---
>   Documentation/filesystems/{ => ceph}/ceph.txt |  0
>   Documentation/filesystems/ceph/cephroot.txt   | 86 +++++++++++++++++++++++++++
>   2 files changed, 86 insertions(+)
>   rename Documentation/filesystems/{ => ceph}/ceph.txt (100%)
>   create mode 100644 Documentation/filesystems/ceph/cephroot.txt
>
> diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph/ceph.txt
> similarity index 100%
> rename from Documentation/filesystems/ceph.txt
> rename to Documentation/filesystems/ceph/ceph.txt
> diff --git a/Documentation/filesystems/ceph/cephroot.txt b/Documentation/filesystems/ceph/cephroot.txt
> new file mode 100644
> index 0000000..deda4f0
> --- /dev/null
> +++ b/Documentation/filesystems/ceph/cephroot.txt
> @@ -0,0 +1,86 @@
> +Mounting the root filesystem via Ceph (cephroot)
> +===============================================
> +
> +Written 2013 by Rob Taylor <rob.taylor@codethink.co.uk>
> +
> +derived from nfsroot.txt:
> +
> +Written 1996 by Gero Kuhlmann <gero@gkminix.han.de>
> +Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz>
> +Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org>
> +Updated 2006 by Horms <horms@verge.net.au>
> +
> +
> +
> +In order to use a diskless system, such as an X-terminal or printer server
> +for example, it is necessary for the root filesystem to be present on a
> +non-disk device. This may be an initramfs (see Documentation/filesystems/
> +ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt), a
> +filesystem mounted via NFS or a filesystem mounted via Ceph. The following
> +text describes on how to use Ceph for the root filesystem.
> +
> +For the rest of this text 'client' means the diskless system, and 'server'
> +means the Ceph server.
> +
> +
> +1.) Enabling cephroot capabilities
> +    -----------------------------
> +
> +In order to use cephroot, CEPH_FS needs to be selected as
> +built-in during configuration. Once this has been selected, the cephroot
> +option will become available, which should also be selected.
> +
> +In the networking options, kernel level autoconfiguration can be selected,
> +along with the types of autoconfiguration to support. Selecting all of
> +DHCP, BOOTP and RARP is safe.
> +
> +
> +2.) Kernel command line
> +    -------------------
> +
> +When the kernel has been loaded by a boot loader (see below) it needs to be
> +told what root fs device to use. And in the case of cephroot, where to find

                                use, and

> +both the server and the name of the directory on the server to mount as root.
> +This can be established using the following kernel command line parameters:
> +
> +root=/dev/ceph
> +
> +This is necessary to enable the pseudo-Ceph-device. Note that it's not a
> +real device but just a synonym to tell the kernel to use Ceph instead of
> +a real device.
> +
> +If cephroot is not specified, it is expected that that a valid mount will be

drop duplicate:                                 that

> +found via DHCP option 17, Root Path [1]
> +
> +cephroot=<monaddrs>:/[<subdir>],<ceph-opts>
> +
> +  <monaddrs>    Monitor addresses separated by commas. Each takes the form
> +		host[:port]. If the port is not specified, the Ceph default
> +		of 6789 is assumed.
> +
> +  <subdir>	A subdirectory subdir may be specified if a subset of the file
> +		system is to be mounted

		                mounted.
		
> +
> +  <ceph-opts>	Standard Ceph options. All options are separated by commas.
> +		See Documentation/filesystems/ceph/ceph.txt for options and
> +		their defaults.
> +
> +4.) References
> +    ----------
> +
> +[1] http://tools.ietf.org/html/rfc2132
> +
> +5.) Credits
> +    -------
> +
> +  cephroot was derived from nfsroot by Rob Taylor <rob.taylor@codethink.co.uk>
> +  and Mark Doffman <mark.doffman@codethink.co.uk>
> +
> +  The nfsroot code in the kernel and the RARP support have been written
> +  by Gero Kuhlmann <gero@gkminix.han.de>.
> +
> +  The rest of the IP layer autoconfiguration code has been written
> +  by Martin Mares <mj@atrey.karlin.mff.cuni.cz>.
> +
> +  In order to write the initial version of nfsroot I would like to thank
> +  Jens-Uwe Mager <jum@anubis.han.de> for his help.
>


-- 
~Randy

^ permalink raw reply

* Re: [RFC PATCH net-next 2/3] virtio_net: Introduce one dummy function virtnet_filter_rfs()
From: Tom Herbert @ 2014-01-15 17:54 UTC (permalink / raw)
  To: Zhi Yong Wu; +Cc: Linux Netdev List, Eric Dumazet, David Miller, Zhi Yong Wu
In-Reply-To: <1389795654-28381-3-git-send-email-zwu.kernel@gmail.com>

Zhi, this is promising work! I can't wait to see how this impacts
network virtualization performance :-)

On Wed, Jan 15, 2014 at 6:20 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>
> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> ---
>  drivers/net/virtio_net.c |   11 +++++++++++
>  1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7b17240..046421c 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1295,6 +1295,14 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
>         return 0;
>  }
>
> +#ifdef CONFIG_RFS_ACCEL
> +static int virtnet_filter_rfs(struct net_device *net_dev,
> +               const struct sk_buff *skb, u16 rxq_index, u32 flow_id)
> +{
Does this need to be filled out with more stuff?

> +       return 0;
> +}
> +#endif /* CONFIG_RFS_ACCEL */
> +
>  static const struct net_device_ops virtnet_netdev = {
>         .ndo_open            = virtnet_open,
>         .ndo_stop            = virtnet_close,
> @@ -1309,6 +1317,9 @@ static const struct net_device_ops virtnet_netdev = {
>  #ifdef CONFIG_NET_POLL_CONTROLLER
>         .ndo_poll_controller = virtnet_netpoll,
>  #endif
> +#ifdef CONFIG_RFS_ACCEL
> +       .ndo_rx_flow_steer   = virtnet_filter_rfs,
> +#endif
>  };
>
>  static void virtnet_config_changed_work(struct work_struct *work)
> --
> 1.7.6.5
>

^ permalink raw reply

* Re: TI CPSW Ethernet Tx performance regression
From: Ben Hutchings @ 2014-01-15 17:54 UTC (permalink / raw)
  To: Mugunthan V N; +Cc: netdev
In-Reply-To: <1389790129-5721-1-git-send-email-mugunthanvnm@ti.com>

On Wed, 2014-01-15 at 18:18 +0530, Mugunthan V N wrote:
> Hi
> 
> I am seeing a performance regression with CPSW driver on AM335x EVM. AM335x EVM
> CPSW has 3.2 kernel support [1] and Mainline support from 3.7. When I am
> comparing the performance between 3.2 and 3.13-rc4. TCP receive performance of
> CPSW between 3.2 and 3.13-rc4 is same (~180Mbps) but TCP Transmit performance
> is poor comparing to 3.2 kernel. In 3.2 kernel is it *256Mbps* and in 3.13-rc4
> it is *70Mbps*
> 
> Iperf version is *iperf version 2.0.5 (08 Jul 2010) pthreads* on both PC and EVM
> 
> On UDP transmit also performance is down comparing to 3.2 kernel. In 3.2 it is
> 196Mbps for 200Mbps band width and in 3.13-rc4 it is 92Mbps
> 
> Can someone point me out where can I look for improving Tx performance. I also
> checked whether there is Tx descriptor over flow and there is none. I have
> tries 3.11 and some older kernel, all are giving ~75Mbps Transmit performance
> only.
> 
> [1] - http://arago-project.org/git/projects/?p=linux-am33x.git;a=summary

If you don't get any specific suggestions, you could try bisecting to
find out which specific commit(s) changed the performance.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [Patch net-next] net_sched: act: fix a bug in tcf_register_action()
From: Cong Wang @ 2014-01-15 17:40 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: Linux Kernel Network Developers, David S. Miller
In-Reply-To: <52D68069.1090804@mojatatu.com>

On Wed, Jan 15, 2014 at 4:34 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 01/14/14 17:48, Cong Wang wrote:
>>
>> In tcf_register_action() we check ->type and ->kind to see if there
>> is an existing action registered, but ipt action registers two
>> actions with same type but different kinds. This should be a valid
>> case, otherwise only xt can be registered.
>>
>
>
> We cant allow for conflicts by name or id - we want to catch them.
> So just introduce TCA_ACT_XT instead (ID 7)

Oh, I thought it is intentional to use the same type for xt and ipt.

>
> [
> Note: iptables used to be a constant moving API target
> and this is supposed to be the latest "backward compat mode".
> New kernel/iproute ==> We want to love "xt" more than "ipt".
> We infact want to eventually kill "ipt".
> but this preference is hard to achieve as you may have run into.
> I would be curious how you tested and run into this..
> ].
>

Just load the module, and you would see an error message. :)

^ permalink raw reply

* [PATCH net-next v2] xen-netfront: add support for IPv6 offloads
From: Paul Durrant @ 2014-01-15 17:30 UTC (permalink / raw)
  To: netdev, xen-devel
  Cc: Paul Durrant, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	David Vrabel

This patch adds support for IPv6 checksum offload and GSO when those
features are available in the backend.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
v2:
- Use xenbus_write rather than xenbus_printf

 drivers/net/xen-netfront.c |   48 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 43 insertions(+), 5 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index c41537b..d7bee8a 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -617,7 +617,9 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		tx->flags |= XEN_NETTXF_extra_info;
 
 		gso->u.gso.size = skb_shinfo(skb)->gso_size;
-		gso->u.gso.type = XEN_NETIF_GSO_TYPE_TCPV4;
+		gso->u.gso.type = (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6) ?
+			XEN_NETIF_GSO_TYPE_TCPV6 :
+			XEN_NETIF_GSO_TYPE_TCPV4;
 		gso->u.gso.pad = 0;
 		gso->u.gso.features = 0;
 
@@ -809,15 +811,18 @@ static int xennet_set_skb_gso(struct sk_buff *skb,
 		return -EINVAL;
 	}
 
-	/* Currently only TCPv4 S.O. is supported. */
-	if (gso->u.gso.type != XEN_NETIF_GSO_TYPE_TCPV4) {
+	if (gso->u.gso.type != XEN_NETIF_GSO_TYPE_TCPV4 &&
+	    gso->u.gso.type != XEN_NETIF_GSO_TYPE_TCPV6) {
 		if (net_ratelimit())
 			pr_warn("Bad GSO type %d\n", gso->u.gso.type);
 		return -EINVAL;
 	}
 
 	skb_shinfo(skb)->gso_size = gso->u.gso.size;
-	skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
+	skb_shinfo(skb)->gso_type =
+		(gso->u.gso.type == XEN_NETIF_GSO_TYPE_TCPV4) ?
+		SKB_GSO_TCPV4 :
+		SKB_GSO_TCPV6;
 
 	/* Header must be checked, and gso_segs computed. */
 	skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
@@ -1191,6 +1196,15 @@ static netdev_features_t xennet_fix_features(struct net_device *dev,
 			features &= ~NETIF_F_SG;
 	}
 
+	if (features & NETIF_F_IPV6_CSUM) {
+		if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
+				 "feature-ipv6-csum-offload", "%d", &val) < 0)
+			val = 0;
+
+		if (!val)
+			features &= ~NETIF_F_IPV6_CSUM;
+	}
+
 	if (features & NETIF_F_TSO) {
 		if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
 				 "feature-gso-tcpv4", "%d", &val) < 0)
@@ -1200,6 +1214,15 @@ static netdev_features_t xennet_fix_features(struct net_device *dev,
 			features &= ~NETIF_F_TSO;
 	}
 
+	if (features & NETIF_F_TSO6) {
+		if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
+				 "feature-gso-tcpv6", "%d", &val) < 0)
+			val = 0;
+
+		if (!val)
+			features &= ~NETIF_F_TSO6;
+	}
+
 	return features;
 }
 
@@ -1338,7 +1361,9 @@ static struct net_device *xennet_create_dev(struct xenbus_device *dev)
 	netif_napi_add(netdev, &np->napi, xennet_poll, 64);
 	netdev->features        = NETIF_F_IP_CSUM | NETIF_F_RXCSUM |
 				  NETIF_F_GSO_ROBUST;
-	netdev->hw_features	= NETIF_F_IP_CSUM | NETIF_F_SG | NETIF_F_TSO;
+	netdev->hw_features	= NETIF_F_SG |
+				  NETIF_F_IPV6_CSUM |
+				  NETIF_F_TSO | NETIF_F_TSO6;
 
 	/*
          * Assume that all hw features are available for now. This set
@@ -1716,6 +1741,19 @@ again:
 		goto abort_transaction;
 	}
 
+	err = xenbus_write(xbt, dev->nodename, "feature-gso-tcpv6", "1");
+	if (err) {
+		message = "writing feature-gso-tcpv6";
+		goto abort_transaction;
+	}
+
+	err = xenbus_write(xbt, dev->nodename, "feature-ipv6-csum-offload",
+			   "1");
+	if (err) {
+		message = "writing feature-ipv6-csum-offload";
+		goto abort_transaction;
+	}
+
 	err = xenbus_transaction_end(xbt, 0);
 	if (err) {
 		if (err == -EAGAIN)
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH v2 2/2] Documentation: Document the cephroot functionality
From: mark.doffman @ 2014-01-15 17:26 UTC (permalink / raw)
  To: ceph-devel
  Cc: Rob Taylor, sage, netdev, linux-kernel, linux-nfs, Mark Doffman
In-Reply-To: <cover.1389806186.git.mark.doffman@codethink.co.uk>

From: Rob Taylor <rob.taylor@codethink.co.uk>

Document using the cephfs as a root device, its purpose,
functionality and use.

Signed-off-by: Mark Doffman <mark.doffman@codethink.co.uk>
Signed-off-by: Rob Taylor <rob.taylor@codethink.co.uk>
Reviewed-by: Ian Molton <ian.molton@codethink.co.uk>
---
 Documentation/filesystems/{ => ceph}/ceph.txt |  0
 Documentation/filesystems/ceph/cephroot.txt   | 86 +++++++++++++++++++++++++++
 2 files changed, 86 insertions(+)
 rename Documentation/filesystems/{ => ceph}/ceph.txt (100%)
 create mode 100644 Documentation/filesystems/ceph/cephroot.txt

diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph/ceph.txt
similarity index 100%
rename from Documentation/filesystems/ceph.txt
rename to Documentation/filesystems/ceph/ceph.txt
diff --git a/Documentation/filesystems/ceph/cephroot.txt b/Documentation/filesystems/ceph/cephroot.txt
new file mode 100644
index 0000000..deda4f0
--- /dev/null
+++ b/Documentation/filesystems/ceph/cephroot.txt
@@ -0,0 +1,86 @@
+Mounting the root filesystem via Ceph (cephroot)
+===============================================
+
+Written 2013 by Rob Taylor <rob.taylor@codethink.co.uk>
+
+derived from nfsroot.txt:
+
+Written 1996 by Gero Kuhlmann <gero@gkminix.han.de>
+Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz>
+Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org>
+Updated 2006 by Horms <horms@verge.net.au>
+
+
+
+In order to use a diskless system, such as an X-terminal or printer server
+for example, it is necessary for the root filesystem to be present on a
+non-disk device. This may be an initramfs (see Documentation/filesystems/
+ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt), a
+filesystem mounted via NFS or a filesystem mounted via Ceph. The following
+text describes on how to use Ceph for the root filesystem.
+
+For the rest of this text 'client' means the diskless system, and 'server'
+means the Ceph server.
+
+
+1.) Enabling cephroot capabilities
+    -----------------------------
+
+In order to use cephroot, CEPH_FS needs to be selected as
+built-in during configuration. Once this has been selected, the cephroot
+option will become available, which should also be selected.
+
+In the networking options, kernel level autoconfiguration can be selected,
+along with the types of autoconfiguration to support. Selecting all of
+DHCP, BOOTP and RARP is safe.
+
+
+2.) Kernel command line
+    -------------------
+
+When the kernel has been loaded by a boot loader (see below) it needs to be
+told what root fs device to use. And in the case of cephroot, where to find
+both the server and the name of the directory on the server to mount as root.
+This can be established using the following kernel command line parameters:
+
+root=/dev/ceph
+
+This is necessary to enable the pseudo-Ceph-device. Note that it's not a
+real device but just a synonym to tell the kernel to use Ceph instead of
+a real device.
+
+If cephroot is not specified, it is expected that that a valid mount will be
+found via DHCP option 17, Root Path [1]
+
+cephroot=<monaddrs>:/[<subdir>],<ceph-opts>
+
+  <monaddrs>    Monitor addresses separated by commas. Each takes the form
+		host[:port]. If the port is not specified, the Ceph default
+		of 6789 is assumed.
+
+  <subdir>	A subdirectory subdir may be specified if a subset of the file
+		system is to be mounted
+
+  <ceph-opts>	Standard Ceph options. All options are separated by commas.
+		See Documentation/filesystems/ceph/ceph.txt for options and
+		their defaults.
+
+4.) References
+    ----------
+
+[1] http://tools.ietf.org/html/rfc2132
+
+5.) Credits
+    -------
+
+  cephroot was derived from nfsroot by Rob Taylor <rob.taylor@codethink.co.uk>
+  and Mark Doffman <mark.doffman@codethink.co.uk>
+
+  The nfsroot code in the kernel and the RARP support have been written
+  by Gero Kuhlmann <gero@gkminix.han.de>.
+
+  The rest of the IP layer autoconfiguration code has been written
+  by Martin Mares <mj@atrey.karlin.mff.cuni.cz>.
+
+  In order to write the initial version of nfsroot I would like to thank
+  Jens-Uwe Mager <jum@anubis.han.de> for his help.
-- 
1.8.4

^ permalink raw reply related

* [PATCH v2 1/2] init: Add a new root device option, the Ceph file system
From: mark.doffman @ 2014-01-15 17:26 UTC (permalink / raw)
  To: ceph-devel
  Cc: Mark Doffman, sage, netdev, linux-kernel, linux-nfs, rob.taylor
In-Reply-To: <cover.1389806186.git.mark.doffman@codethink.co.uk>

From: Mark Doffman <mark.doffman@codethink.co.uk>

Analogous to NFS add a new root device option, the ability
to boot using the Ceph networked file system as the root fs.

This patch adds a new root device option '/dev/ceph' that
uses a ceph networked file system. File system parameters
are passed using a new kernel parameter: 'cephroot'.

The 'cephroot' parameters are very similar to 'nfsroot'.

Signed-off-by: Mark Doffman <mark.doffman@codethink.co.uk>
Reviewed-by: Ian Molton <ian.molton@codethink.co.uk>
---
 fs/ceph/Kconfig                |  10 +++
 fs/ceph/Makefile               |   1 +
 fs/ceph/root.c                 | 176 +++++++++++++++++++++++++++++++++++++++++
 include/linux/ceph/ceph_root.h |  10 +++
 include/linux/root_dev.h       |   1 +
 init/do_mounts.c               |  32 +++++++-
 net/ipv4/ipconfig.c            |  10 ++-
 7 files changed, 237 insertions(+), 3 deletions(-)
 create mode 100644 fs/ceph/root.c
 create mode 100644 include/linux/ceph/ceph_root.h

diff --git a/fs/ceph/Kconfig b/fs/ceph/Kconfig
index ac9a2ef..325e83d 100644
--- a/fs/ceph/Kconfig
+++ b/fs/ceph/Kconfig
@@ -25,3 +25,13 @@ config CEPH_FSCACHE
 	  caching support for Ceph clients using FS-Cache
 
 endif
+
+config ROOT_CEPH
+	bool "Root file system on Ceph FS"
+	depends on CEPH_FS=y && IP_PNP
+	help
+	  If you want your system to mount its root file system via CEPH,
+	  choose Y here.  For details, read
+	  <file:Documentation/filesystems/ceph/cephroot.txt>.
+
+	  If unsure say N.
diff --git a/fs/ceph/Makefile b/fs/ceph/Makefile
index 32e3010..af2dcbf 100644
--- a/fs/ceph/Makefile
+++ b/fs/ceph/Makefile
@@ -10,3 +10,4 @@ ceph-y := super.o inode.o dir.o file.o locks.o addr.o ioctl.o \
 	debugfs.o
 
 ceph-$(CONFIG_CEPH_FSCACHE) += cache.o
+ceph-$(CONFIG_ROOT_CEPH) += root.o
diff --git a/fs/ceph/root.c b/fs/ceph/root.c
new file mode 100644
index 0000000..1559c19
--- /dev/null
+++ b/fs/ceph/root.c
@@ -0,0 +1,176 @@
+/*
+ * Copyright (C) 2012 Codethink Ltd. <mark.doffman@codethink.co.uk>
+ *
+ * This file is released under the GPL v2
+ *
+ * Allow a CephFS filesystem to be mounted as root.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/utsname.h>
+#include <linux/root_dev.h>
+#include <linux/in.h>
+#include <net/ipconfig.h>
+#include <linux/ceph/ceph_root.h>
+
+#define MAXPATHLEN 1024
+
+/* Parameters passed from the kernel command line */
+static char ceph_command_line_params[256] __initdata;
+
+/* server:path string passed to mount */
+static char ceph_root_device[MAXPATHLEN + 1] __initdata;
+
+/* Name of directory to mount */
+static char ceph_export_path[MAXPATHLEN + 1] __initdata;
+
+/* Mount options */
+static char ceph_root_options[256] __initdata;
+
+/*
+ *  Parse CephFS server and directory information passed on the kernel
+ *  command line.
+ *
+ *  cephroot=[<server-ip>][,<server-ips>]:<root-dir>[,<cephfs-options>]
+ */
+static int __init ceph_root_setup(char *line)
+{
+		ROOT_DEV = Root_CEPH;
+
+		strlcpy(ceph_command_line_params, line,
+				sizeof(ceph_command_line_params));
+
+		return 1;
+}
+
+__setup("cephroot=", ceph_root_setup);
+
+/*
+ * ceph_root_append - Concatenates an options or address string
+ * adding a ',' delimiter if neccessary.
+ *
+ * Returns 0 on success -E2BIG if the resulting string is too long.
+ */
+static int __init ceph_root_append(char *incoming,
+				   char *dest,
+				   const size_t destlen)
+{
+	int res = 0;
+
+	if (incoming != NULL && *incoming != '\0') {
+		size_t len = strlen(dest);
+
+		if (len && dest[len - 1] != ',') {
+			if (strlcat(dest, ",", destlen) > destlen)
+				res = -E2BIG;
+		}
+
+		if (strlcat(dest, incoming, destlen) > destlen)
+			res = -E2BIG;
+
+	}
+	return res;
+}
+
+/*
+ * ceph_root_parse_params - Parse out root export path and mount options from
+ * passed-in string @incoming.
+ *
+ * Copy the path into @path.
+ *
+ * Returns 0 on success -E2BIG if the resulting options string or device
+ * string are too long.
+ */
+static int __init ceph_root_parse_params(char *incoming, char *outpath,
+					 const size_t outpathlen)
+{
+	int res = -EINVAL;
+	char *options;
+	char *path;
+
+	options = strstr(incoming, ":/");
+	if (options == NULL)
+		options = strstr(incoming, "default");
+
+	if (options != NULL) {
+		path = strsep(&options, ",");
+		if (*path != '\0' && strcmp(path, "default") != 0)
+			strlcpy(outpath, path, outpathlen);
+		res = ceph_root_append(options, ceph_root_options,
+				sizeof(ceph_root_options));
+
+		if (res == 0) {
+			*path = '\0';
+			res = ceph_root_append(incoming, ceph_root_device,
+					sizeof(ceph_root_device));
+		}
+	}
+
+	return res;
+}
+
+/*
+ * ceph_root_data - Return mount device and data for CEPHROOT mount.
+ *
+ * @root_device: OUT: Address of string containing CEPHROOT device.
+ * @root_data: OUT: Address of string containing CEPHROOT mount options.
+ *
+ * Returns: 0 and sets @root_device and @root_data if successful.
+ *          error code if unsuccessful.
+ */
+int __init ceph_root_data(char **root_device, char **root_data)
+{
+	char *tmp_root_path = NULL;
+	const size_t tmplen = sizeof(ceph_export_path);
+	int len;
+	int res = -E2BIG;
+
+	tmp_root_path = kzalloc(tmplen, GFP_KERNEL);
+	if (tmp_root_path == NULL)
+		return -ENOMEM;
+
+	if (root_server_path[0] != '\0') {
+		if (ceph_root_parse_params(root_server_path, tmp_root_path,
+					tmplen))
+			goto out;
+	}
+
+	if (ceph_command_line_params[0] != '\0') {
+		if (ceph_root_parse_params(ceph_command_line_params,
+					tmp_root_path, tmplen))
+			goto out;
+	}
+
+	/*
+	 * Set up ceph_root_device. This looks like: server:/path
+	 *
+	 * At this point, utsname()->nodename contains our local
+	 * IP address or hostname, set by ipconfig.  If "%s" exists
+	 * in tmp_root_path, substitute the nodename, then shovel the whole
+	 * mess into ceph_root_device.
+	 */
+	len = snprintf(ceph_export_path, sizeof(ceph_export_path),
+				   tmp_root_path, utsname()->nodename);
+	if (len > (int)sizeof(ceph_export_path))
+		goto out;
+
+	len = strlcat(ceph_root_device, ceph_export_path,
+			sizeof(ceph_root_device));
+	if (len > (int)sizeof(ceph_root_device))
+		goto out;
+
+	pr_debug("Root-CEPH: Root device: %s\n", ceph_root_device);
+	pr_debug("Root-CEPH: Root options: %s\n", ceph_root_options);
+	*root_device = ceph_root_device;
+	*root_data = ceph_root_options;
+
+	res = 0;
+
+out:
+	kfree(tmp_root_path);
+	return res;
+}
diff --git a/include/linux/ceph/ceph_root.h b/include/linux/ceph/ceph_root.h
new file mode 100644
index 0000000..e6bae63
--- /dev/null
+++ b/include/linux/ceph/ceph_root.h
@@ -0,0 +1,10 @@
+/*
+ * Copyright (C) 2012 Codethink Ltd. <mark.doffman@codethink.co.uk>
+ *
+ * This file is released under the GPL v2
+ *
+ * ceph_root.h
+ */
+
+/* linux/fs/ceph/root.c */
+extern int ceph_root_data(char **root_device, char **root_data); /*__init*/
diff --git a/include/linux/root_dev.h b/include/linux/root_dev.h
index ed241aa..af6b182 100644
--- a/include/linux/root_dev.h
+++ b/include/linux/root_dev.h
@@ -16,6 +16,7 @@ enum {
 	Root_SDA2 = MKDEV(SCSI_DISK0_MAJOR, 2),
 	Root_HDC1 = MKDEV(IDE1_MAJOR, 1),
 	Root_SR0 = MKDEV(SCSI_CDROM_MAJOR, 0),
+	Root_CEPH = MKDEV(UNNAMED_MAJOR, 254),
 };
 
 extern dev_t ROOT_DEV;
diff --git a/init/do_mounts.c b/init/do_mounts.c
index 8e5addc..d075020 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -33,6 +33,8 @@
 #include <linux/nfs_fs_sb.h>
 #include <linux/nfs_mount.h>
 
+#include <linux/ceph/ceph_root.h>
+
 #include "do_mounts.h"
 
 int __initdata rd_doload;	/* 1 = load RAM disk, 0 = don't load */
@@ -199,6 +201,7 @@ done:
  *	   a partition with a known unique id.
  *	8) <major>:<minor> major and minor number of the device separated by
  *	   a colon.
+ *	9) /dev/ceph represents Root_CEPH
  *
  *	If name doesn't have fall into the categories above, we return (0,0).
  *	block_class is used to check if something is a disk name. If the disk
@@ -245,7 +248,9 @@ dev_t name_to_dev_t(char *name)
 	res = Root_RAM0;
 	if (strcmp(name, "ram") == 0)
 		goto done;
-
+	res = Root_CEPH;
+	if (strcmp(name, "ceph") == 0)
+		goto done;
 	if (strlen(name) > 31)
 		goto fail;
 	strcpy(s, name);
@@ -473,6 +478,22 @@ static int __init mount_nfs_root(void)
 }
 #endif
 
+#ifdef CONFIG_ROOT_CEPH
+static int __init mount_ceph_root(void)
+{
+	char *root_dev, *root_data;
+
+	if (ceph_root_data(&root_dev, &root_data))
+		return 0;
+
+	if (do_mount_root(root_dev, "ceph",
+				root_mountflags, root_data))
+		return 0;
+
+	return 1;
+}
+#endif
+
 #if defined(CONFIG_BLK_DEV_RAM) || defined(CONFIG_BLK_DEV_FD)
 void __init change_floppy(char *fmt, ...)
 {
@@ -514,6 +535,15 @@ void __init mount_root(void)
 		ROOT_DEV = Root_FD0;
 	}
 #endif
+#ifdef CONFIG_ROOT_CEPH
+	if (ROOT_DEV == Root_CEPH) {
+		if (mount_ceph_root())
+			return;
+
+		printk(KERN_ERR "VFS: Unable to mount root fs via CephFS, trying floppy.\n");
+		ROOT_DEV = Root_FD0;
+	}
+#endif
 #ifdef CONFIG_BLK_DEV_FD
 	if (MAJOR(ROOT_DEV) == FLOPPY_MAJOR) {
 		/* rd_doload is 2 for a dual initrd/ramload setup */
diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index efa1138..765eea4 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -1435,10 +1435,10 @@ static int __init ip_auto_config(void)
 	 * missing values.
 	 */
 	if (ic_myaddr == NONE ||
-#ifdef CONFIG_ROOT_NFS
+#if defined(CONFIG_ROOT_NFS) || defined(CONFIG_ROOT_CEPH)
 	    (root_server_addr == NONE &&
 	     ic_servaddr == NONE &&
-	     ROOT_DEV == Root_NFS) ||
+	     (ROOT_DEV == Root_NFS || ROOT_DEV == Root_CEPH)) ||
 #endif
 	    ic_first_dev->next) {
 #ifdef IPCONFIG_DYNAMIC
@@ -1465,6 +1465,12 @@ static int __init ip_auto_config(void)
 				goto try_try_again;
 			}
 #endif
+#ifdef CONFIG_ROOT_CEPH
+			if (ROOT_DEV ==  Root_CEPH) {
+				pr_err("IP-Config: Retrying forever (CEPH root)...\n");
+				goto try_try_again;
+			}
+#endif
 
 			if (--retries) {
 				pr_err("IP-Config: Reopening network devices...\n");
-- 
1.8.4

^ permalink raw reply related

* [PATCH v2 0/2] Add ceph root filesystem functionality and documentation.
From: mark.doffman-4yDnlxn2s6sWdaTGBSpHTA @ 2014-01-15 17:26 UTC (permalink / raw)
  To: ceph-devel-u79uwXL29TY76Z2rM5mHXA
  Cc: Mark Doffman, sage-4GqslpFJ+cxBDgjK7y7TUQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	rob.taylor-4yDnlxn2s6sWdaTGBSpHTA
In-Reply-To: <1385000024-23463-1-git-send-email-mark.doffman-4yDnlxn2s6sWdaTGBSpHTA@public.gmane.org>

From: Mark Doffman <mark.doffman-4yDnlxn2s6sWdaTGBSpHTA@public.gmane.org>

Hi All,

The following is a second version of a patch series that adds the ability to use
a ceph distributed file system as the root device.

Changes from version 1

fs/ceph/root.c:

The parsing code that takes the DHCP option 17 and kernel command line
parameters has been extensively altered.

The parsing now accepts multiple monitor addresses and ipv6 addresses.

The monitors listed in DHCP option 17 are now concatenated with those
listed on the kernel command line.

The patch series applies to v3.13-rc8-7-g3539717

Thanks

Mark

Mark Doffman (1):
  init: Add a new root device option, the Ceph file system

Rob Taylor (1):
  Documentation: Document the cephroot functionality

 Documentation/filesystems/{ => ceph}/ceph.txt |   0
 Documentation/filesystems/ceph/cephroot.txt   |  86 +++++++++++++
 fs/ceph/Kconfig                               |  10 ++
 fs/ceph/Makefile                              |   1 +
 fs/ceph/root.c                                | 176 ++++++++++++++++++++++++++
 include/linux/ceph/ceph_root.h                |  10 ++
 include/linux/root_dev.h                      |   1 +
 init/do_mounts.c                              |  32 ++++-
 net/ipv4/ipconfig.c                           |  10 +-
 9 files changed, 323 insertions(+), 3 deletions(-)
 rename Documentation/filesystems/{ => ceph}/ceph.txt (100%)
 create mode 100644 Documentation/filesystems/ceph/cephroot.txt
 create mode 100644 fs/ceph/root.c
 create mode 100644 include/linux/ceph/ceph_root.h

-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox