Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 1/8] [I/OAT] DMA memcpy subsystem
From: Andrew Grover @ 2006-03-28 18:44 UTC (permalink / raw)
  To: Kumar Gala; +Cc: Chris Leech, linux kernel mailing list, netdev
In-Reply-To: <2FF801BB-F96C-4864-AC44-09B4B92531F7@kernel.crashing.org>

On 3/16/06, Kumar Gala <galak@kernel.crashing.org> wrote:
> It would seem that when a client registers (or shortly there after
> when they call dma_async_client_chan_request()) they would expect to
> get the number of channels they need by some given time period.
>
> For example, lets say a client registers but no dma device exists.
> They will never get called to be aware of this condition.
>
> I would think most clients would either spin until they have all the
> channels they need or fall back to a non-async mechanism.

Clients *are* expected to fall back to non-async if they are not given
channels. The reason it was implemented with callbacks for
added/removed was that the client may be initializing before the
channels are enumerated. For example, the net subsystem will ask for
channels and not get them for a while, until the ioatdma PCI device is
found and its driver loads. In this scenario, we'd like the net
subsystem to be given these channels, instead of them going unused.

> Also, what do you think about adding an operation type (MEMCPY, XOR,
> CRYPTO_AES, etc).  We can than validate if the operation type
> expected is supported by the devices that exist.

No objections, but this speculative support doesn't need to be in our
initial patchset.

> Shouldn't we also have a dma_async_client_chan_free()?

Well we could just define it to be chan_request(0) but it doesn't seem
to be needed. Also, the allocation mechanism we have for channels is
different from alloc/free's semantics, so it may be best to not muddy
the water in this area.

Regards -- Andy

^ permalink raw reply

* Re: [PATCH 1/8] [I/OAT] DMA memcpy subsystem
From: Kumar Gala @ 2006-03-28 18:58 UTC (permalink / raw)
  To: Andrew Grover; +Cc: Chris Leech, linux kernel mailing list, netdev
In-Reply-To: <c0a09e5c0603281044i57730c66ye08c45aadd352cf8@mail.gmail.com>


On Mar 28, 2006, at 12:44 PM, Andrew Grover wrote:

> On 3/16/06, Kumar Gala <galak@kernel.crashing.org> wrote:
>> It would seem that when a client registers (or shortly there after
>> when they call dma_async_client_chan_request()) they would expect to
>> get the number of channels they need by some given time period.
>>
>> For example, lets say a client registers but no dma device exists.
>> They will never get called to be aware of this condition.
>>
>> I would think most clients would either spin until they have all the
>> channels they need or fall back to a non-async mechanism.
>
> Clients *are* expected to fall back to non-async if they are not given
> channels. The reason it was implemented with callbacks for
> added/removed was that the client may be initializing before the
> channels are enumerated. For example, the net subsystem will ask for
> channels and not get them for a while, until the ioatdma PCI device is
> found and its driver loads. In this scenario, we'd like the net
> subsystem to be given these channels, instead of them going unused.

Fair, I need to think on that a little more.

>> Also, what do you think about adding an operation type (MEMCPY, XOR,
>> CRYPTO_AES, etc).  We can than validate if the operation type
>> expected is supported by the devices that exist.
>
> No objections, but this speculative support doesn't need to be in our
> initial patchset.

I don't consider it speculative.  The patch is for a generic DMA  
engine interface.  That interface should encompass all users.  I have  
a security/crypto DMA engine that I'd like to front with the generic  
DMA interface today.  Also, I believe there is another Intel group  
with an XOR engine that had a similar concept called ADMA posted a  
while ago.

http://marc.theaimsgroup.com/?t=112603120100004&r=1&w=2

>> Shouldn't we also have a dma_async_client_chan_free()?
>
> Well we could just define it to be chan_request(0) but it doesn't seem
> to be needed. Also, the allocation mechanism we have for channels is
> different from alloc/free's semantics, so it may be best to not muddy
> the water in this area.

Can you explain what the semantics are.

It's been a little while since I posted so my thoughts on the subject  
are going to take a little while to come back to me :)

- kumar

^ permalink raw reply

* Re: [PATCH 1/8] [I/OAT] DMA memcpy subsystem
From: Andrew Grover @ 2006-03-28 22:01 UTC (permalink / raw)
  To: Kumar Gala; +Cc: Chris Leech, linux kernel mailing list, netdev
In-Reply-To: <D760971F-3C6A-400B-99EA-E95358B37F82@kernel.crashing.org>

On 3/28/06, Kumar Gala <galak@kernel.crashing.org> wrote:

> >> Also, what do you think about adding an operation type (MEMCPY, XOR,
> >> CRYPTO_AES, etc).  We can than validate if the operation type
> >> expected is supported by the devices that exist.
> >
> > No objections, but this speculative support doesn't need to be in our
> > initial patchset.
>
> I don't consider it speculative.  The patch is for a generic DMA
> engine interface.  That interface should encompass all users.  I have
> a security/crypto DMA engine that I'd like to front with the generic
> DMA interface today.  Also, I believe there is another Intel group
> with an XOR engine that had a similar concept called ADMA posted a
> while ago.

Please submit patches then. We will be doing another rev of the I/OAT
patch very soon, which you will be able to patch against. Or, once the
patch gets in mainline then we can enhance it. Code in the Linux
kernel is never "done", and the burden of implementing additional
functionality falls on those who want it.

> Can you explain what the semantics are.
>
> It's been a little while since I posted so my thoughts on the subject
> are going to take a little while to come back to me :)

Yeah. Basically you register as a DMA client, and say how many DMA
channels you want. Our net_dma patch for example uses multiple
channels to help lock contention. Then when channels are available
(i.e. a DMA device added or another client gives them up) then you get
a callback. If the channel goes away (i.e. DMA device is removed
(theoretically possible but practically never happens) or *you* are
going away and change your request to 0 channels) then you get a
remove callback.

This gets around the problem of DMA clients registering (and therefore
not getting) channels simply because they init before the DMA device
is discovered.

Regards -- Andy

^ permalink raw reply

* [PATCH] Janitor: drivers/net/pcnet32: fix incorrect comments
From: Linas Vepstas @ 2006-03-28 22:36 UTC (permalink / raw)
  To: tsbogend, donf; +Cc: jklewis, linuxppc-dev, linux-kernel, netdev


Please sign-off/ack/apply and/or forward upstream.

[PATCH] Janitor: drivers/net/pcnet32: fix incorrect comments

The comments concerning how the pcnet32 ethernet device driver selects 
the MAC addr to use are incorrect. A recent patch (in the last 3 months)
changed how the code worked, but did not change the comments.

Side comment: the new behaviour is good; I've got a pcnet32 card which
powers up with garbage in the CSR's, and a good MAC addr in the PROM.

Signed-off-by: Linas Vepstas <linas@linas.org>

----

 drivers/net/pcnet32.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.16-git6/drivers/net/pcnet32.c
===================================================================
--- linux-2.6.16-git6.orig/drivers/net/pcnet32.c	2006-03-23 12:21:41.000000000 -0600
+++ linux-2.6.16-git6/drivers/net/pcnet32.c	2006-03-28 16:08:23.398158717 -0600
@@ -1167,8 +1167,8 @@ pcnet32_probe1(unsigned long ioaddr, int
 	 * station address PROM at the base address and programmed into the
 	 * "Physical Address Registers" CSR12-14.
 	 * As a precautionary measure, we read the PROM values and complain if
-	 * they disagree with the CSRs.  Either way, we use the CSR values, and
-	 * double check that they are valid.
+	 * they disagree with the CSRs.  If they miscompare, and the PROM addr
+	 * is valid, then the PROM addr is used.
 	 */
 	for (i = 0; i < 3; i++) {
 		unsigned int val;

^ permalink raw reply

* Re: [PATCH 1/8] [I/OAT] DMA memcpy subsystem
From: Kumar Gala @ 2006-03-28 23:03 UTC (permalink / raw)
  To: Andrew Grover; +Cc: Chris Leech, linux kernel mailing list, netdev
In-Reply-To: <c0a09e5c0603281401uaeea6aci57054aef444a5e1@mail.gmail.com>


On Mar 28, 2006, at 4:01 PM, Andrew Grover wrote:

> On 3/28/06, Kumar Gala <galak@kernel.crashing.org> wrote:
>
>>>> Also, what do you think about adding an operation type (MEMCPY,  
>>>> XOR,
>>>> CRYPTO_AES, etc).  We can than validate if the operation type
>>>> expected is supported by the devices that exist.
>>>
>>> No objections, but this speculative support doesn't need to be in  
>>> our
>>> initial patchset.
>>
>> I don't consider it speculative.  The patch is for a generic DMA
>> engine interface.  That interface should encompass all users.  I have
>> a security/crypto DMA engine that I'd like to front with the generic
>> DMA interface today.  Also, I believe there is another Intel group
>> with an XOR engine that had a similar concept called ADMA posted a
>> while ago.
>
> Please submit patches then. We will be doing another rev of the I/OAT
> patch very soon, which you will be able to patch against. Or, once the
> patch gets in mainline then we can enhance it. Code in the Linux
> kernel is never "done", and the burden of implementing additional
> functionality falls on those who want it.

I completely understand that.  However, I think putting something  
into mainline that only works or solves the particular problem you  
have is a bad idea.  I'll provide patches for the changes I'd like to  
see.  However, I figured a little discussion on the subject before I  
went off an spent time on it was worth while.

>> Can you explain what the semantics are.
>>
>> It's been a little while since I posted so my thoughts on the subject
>> are going to take a little while to come back to me :)
>
> Yeah. Basically you register as a DMA client, and say how many DMA
> channels you want. Our net_dma patch for example uses multiple
> channels to help lock contention. Then when channels are available
> (i.e. a DMA device added or another client gives them up) then you get
> a callback. If the channel goes away (i.e. DMA device is removed
> (theoretically possible but practically never happens) or *you* are
> going away and change your request to 0 channels) then you get a
> remove callback.

Do you only get callback when a channel is available?  How do you  
decide to do to provide PIO to the client?

A client should only request multiple channel to handle multiple  
concurrent operations.

> This gets around the problem of DMA clients registering (and therefore
> not getting) channels simply because they init before the DMA device
> is discovered.

What do you expect to happen in a system in which the channels are  
over subscribed?

Do you expect the DMA device driver to handle scheduling of channels  
between multiple clients?

- kumar

^ permalink raw reply

* Re: dcache leak in 2.6.16-git8 II
From: Andi Kleen @ 2006-03-29 22:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: bharata, linux-kernel, netdev
In-Reply-To: <20060327190027.24498e3a.akpm@osdl.org>

On Tuesday 28 March 2006 05:00, Andrew Morton wrote:
> Andi Kleen <ak@suse.de> wrote:
> >
> > On Monday 27 March 2006 13:48, Bharata B Rao wrote:
> > > On Mon, Mar 27, 2006 at 07:50:20AM +0200, Andi Kleen wrote:
> > > > 
> > > > A 2GB x86-64 desktop system here is currently swapping itself to death after
> > > > a few days uptime.
> > > > 
> > > > Some investigation shows this:
> > > > 
> > > > inode_cache         1287   1337    568    7    1 : tunables   54   27    8 : slabdata    191    191      0
> > > > dentry_cache      1867436 1867643    208   19    1 : tunables  120   60    8 : slabdata  98297  98297      0
> > > > 
> > > 
> > > Would it be possible to try out this experimental patch which
> > > gets some stats from the dentry cache ?
> > 
> > It should be trivial to reproduce by other people. Biggest workload
> > is kernel compiles and quilt.
> > 
> > After a few hours with -git12 it's already at
> > 
> > dentry_cache      947013 952014    208   19    1 : tunables  120   60    8 : slabdata  50100  50106    480
> > 
> > and starting to go into swap.
> > 
> > I can't imagine I'm the only one seeing this?
> > 
> > I have a few x86-64 patches applied too, but they don't change anything
> > in this area.
> 
> I don't think I can reproduce this on x86 uniproc.  (avtab_node_cache
> is a different story - maintainers separately pinged).


I got another OOM from this with -git13. Unfortunately it seems to take a few days at least
to trigger.

dentry_cache      999168 1024594    208   19    1 : tunables  120   60    8 : slabdata  53926  53926      0 : shrinker stat 18522624 8871000

Hrm interesting is this one:

sock_inode_cache  996784 996805    704    5    1 : tunables   54   27    8 : slabdata 199361 199361      0

Most of the leaked dentries seem to be sockets. I didn't notice this earlier.

This was with the debugging patches applied btw. 

So maybe we have a socket leak?

I still got a copy of the /proc in case anybody wants more information.

-Andi
 
meminfo debugging was

r_dentries/page        nr_pages        nr_inuse
         0              0               0
         1              139             139
         2              115             230
         3              99              297
         4              103             412
         5              84              419
         6              80              480
         7              75              519
         8              65              520
         9              52              466
        10              58              580
        11              55              605
        12              73              868
        13              124             1610
        14              184             2576
        15              313             4685
        16              518             8288
        17              1338            22744
        18              4591            82633
        19              45843           870758
        20              0               0
        21              0               0
        22              0               0
        23              0               0
        24              0               0
        25              0               0
        26              0               0
        27              0               0
        28              0               0
        29              0               0
Total:                  53909           998829
dcache lru: total 232 inuse 1

^ permalink raw reply

* Re: [PATCH] Janitor: drivers/net/pcnet32: fix incorrect comments
From: Jeff Garzik @ 2006-03-29 22:35 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: tsbogend, donf, jklewis, linuxppc-dev, linux-kernel, netdev
In-Reply-To: <20060328223623.GD2172@austin.ibm.com>

Linas Vepstas wrote:
> Please sign-off/ack/apply and/or forward upstream.
> 
> [PATCH] Janitor: drivers/net/pcnet32: fix incorrect comments
> 
> The comments concerning how the pcnet32 ethernet device driver selects 
> the MAC addr to use are incorrect. A recent patch (in the last 3 months)
> changed how the code worked, but did not change the comments.
> 
> Side comment: the new behaviour is good; I've got a pcnet32 card which
> powers up with garbage in the CSR's, and a good MAC addr in the PROM.
> 
> Signed-off-by: Linas Vepstas <linas@linas.org>

applied

^ permalink raw reply

* Re: dcache leak in 2.6.16-git8 II
From: Andrew Morton @ 2006-03-29 22:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: bharata, linux-kernel, netdev
In-Reply-To: <200603300026.59131.ak@suse.de>

Andi Kleen <ak@suse.de> wrote:
>
> On Tuesday 28 March 2006 05:00, Andrew Morton wrote:
> > Andi Kleen <ak@suse.de> wrote:
> > >
> > > On Monday 27 March 2006 13:48, Bharata B Rao wrote:
> > > > On Mon, Mar 27, 2006 at 07:50:20AM +0200, Andi Kleen wrote:
> > > > > 
> > > > > A 2GB x86-64 desktop system here is currently swapping itself to death after
> > > > > a few days uptime.
> > > > > 
> > > > > Some investigation shows this:
> > > > > 
> > > > > inode_cache         1287   1337    568    7    1 : tunables   54   27    8 : slabdata    191    191      0
> > > > > dentry_cache      1867436 1867643    208   19    1 : tunables  120   60    8 : slabdata  98297  98297      0
> > > > > 
> > > > 
> > > > Would it be possible to try out this experimental patch which
> > > > gets some stats from the dentry cache ?
> > > 
> > > It should be trivial to reproduce by other people. Biggest workload
> > > is kernel compiles and quilt.
> > > 
> > > After a few hours with -git12 it's already at
> > > 
> > > dentry_cache      947013 952014    208   19    1 : tunables  120   60    8 : slabdata  50100  50106    480
> > > 
> > > and starting to go into swap.
> > > 
> > > I can't imagine I'm the only one seeing this?
> > > 
> > > I have a few x86-64 patches applied too, but they don't change anything
> > > in this area.
> > 
> > I don't think I can reproduce this on x86 uniproc.  (avtab_node_cache
> > is a different story - maintainers separately pinged).
> 
> 
> I got another OOM from this with -git13. Unfortunately it seems to take a few days at least
> to trigger.
> 
> dentry_cache      999168 1024594    208   19    1 : tunables  120   60    8 : slabdata  53926  53926      0 : shrinker stat 18522624 8871000
> 
> Hrm interesting is this one:
> 
> sock_inode_cache  996784 996805    704    5    1 : tunables   54   27    8 : slabdata 199361 199361      0
> 
> Most of the leaked dentries seem to be sockets. I didn't notice this earlier.
> 
> This was with the debugging patches applied btw. 
> 
> So maybe we have a socket leak?

It looks that way.  Didn't someone else report a sock_inode_cache leak?

> I still got a copy of the /proc in case anybody wants more information.

We have this fancy new /proc/slab_allocators now, it might show something
interesting.  It needs CONFIG_DEBUG_SLAB_LEAK.

^ permalink raw reply

* Re: dcache leak in 2.6.16-git8 II
From: Andi Kleen @ 2006-03-29 22:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: bharata, linux-kernel, netdev
In-Reply-To: <20060329145013.37c87323.akpm@osdl.org>

On Thursday 30 March 2006 00:50, Andrew Morton wrote:

> It looks that way.  Didn't someone else report a sock_inode_cache leak?

Didn't see it.
 
> > I still got a copy of the /proc in case anybody wants more information.
> 
> We have this fancy new /proc/slab_allocators now, it might show something
> interesting.  It needs CONFIG_DEBUG_SLAB_LEAK.

I didn't have that enabled unfortunately. I can try it on the next round.

-Andi

^ permalink raw reply

* [PATCH 2/9] I/OAT
From: Chris Leech @ 2006-03-29 22:54 UTC (permalink / raw)
  To: linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 589 bytes --]

[I/OAT] Driver for the Intel(R) I/OAT DMA engine

From: Chris Leech <christopher.leech@intel.com>

Adds a new ioatdma driver

Signed-off-by: Chris Leech <christopher.leech@intel.com>
---

 drivers/dma/Kconfig             |    9 
 drivers/dma/Makefile            |    1 
 drivers/dma/ioatdma.c           |  805
+++++++++++++++++++++++++++++++++++++++
 drivers/dma/ioatdma.h           |  126 ++++++
 drivers/dma/ioatdma_hw.h        |   52 +++
 drivers/dma/ioatdma_io.h        |  118 ++++++
 drivers/dma/ioatdma_registers.h |  128 ++++++
 7 files changed, 1239 insertions(+), 0 deletions(-)


[-- Attachment #2: ioatdma_driver.gz --]
[-- Type: application/x-gzip, Size: 9231 bytes --]

^ permalink raw reply

* [PATCH 0/9] I/OAT
From: Chris Leech @ 2006-03-29 22:55 UTC (permalink / raw)
  To: linux-kernel, netdev

This patch series is the a full release of the Intel(R) I/O
Acceleration Technology (I/OAT) for Linux.  It includes an in kernel API
for offloading memory copies to hardware, a driver for the I/OAT DMA memcpy
engine, and changes to the TCP stack to offload copies of received
networking data to application space.

Changes from last posting:
	Fixed a page reference leak that happened when offloaded copies were 
	set up but never used for a recv.
	Fixed the ioatdma self test to handle failures correctly.
	Serialized DMA ADD and REMOVE events in the networking core with a lock.
	Added a long comment in dmaengine.c to describe the locking and 
	reference counting being used.
	Disabled preempt around a use of get_cpu_var.
	Made tcp_dma_try_early_copy static, it is only used in one file.
	Made some GFP_ATOMIC allocations GFP_KERNEL where safe to sleep.
	Made changes to sk_eat_skb, removing some ifdefs in the TCP code.
	

These changes apply to DaveM's net-2.6.17 tree as of commit
68907dad58cd7ef11536e1db6baeb98b20af91b2 ([DCCP]: Use NULL for pointers, comfort sparse.)

They are available to pull from
	git://198.78.49.142/~cleech/linux-2.6 ioat-2.6.17

There are 9 patches in the series:
	1) The memcpy offload APIs and class code
	2) The Intel I/OAT DMA driver (ioatdma)
	3) Core networking code to setup networking as a DMA memcpy client
	4) Utility functions for sk_buff to iovec offloaded copy
	5) Structure changes needed for TCP receive offload
	6) Rename cleanup_rbuf to tcp_cleanup_rbuf
	7) Make sk_eat_skb aware of early copied packets
	8) Add a sysctl to tune the minimum offloaded I/O size for TCP
	9) The main TCP receive offload changes

--
Chris Leech <christopher.leech@intel.com>
I/O Acceleration Technology Software Development
LAN Access Division / Digital Enterprise Group 

^ permalink raw reply

* [PATCH 1/9] [I/OAT] DMA memcpy subsystem
From: Chris Leech @ 2006-03-29 22:55 UTC (permalink / raw)
  To: linux-kernel, netdev
In-Reply-To: <20060329225505.25585.30392.stgit@gitlost.site>

Provides an API for offloading memory copies to DMA devices

Signed-off-by: Chris Leech <christopher.leech@intel.com>
---

 drivers/Kconfig           |    2 
 drivers/Makefile          |    1 
 drivers/dma/Kconfig       |   13 +
 drivers/dma/Makefile      |    1 
 drivers/dma/dmaengine.c   |  405 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dmaengine.h |  337 +++++++++++++++++++++++++++++++++++++
 6 files changed, 759 insertions(+), 0 deletions(-)

diff --git a/drivers/Kconfig b/drivers/Kconfig
index 9f5c0da..f89ac05 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -72,4 +72,6 @@ source "drivers/edac/Kconfig"
 
 source "drivers/rtc/Kconfig"
 
+source "drivers/dma/Kconfig"
+
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index 4249552..9b808a6 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -74,3 +74,4 @@ obj-$(CONFIG_SGI_SN)		+= sn/
 obj-y				+= firmware/
 obj-$(CONFIG_CRYPTO)		+= crypto/
 obj-$(CONFIG_SUPERH)		+= sh/
+obj-$(CONFIG_DMA_ENGINE)	+= dma/
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
new file mode 100644
index 0000000..f9ac4bc
--- /dev/null
+++ b/drivers/dma/Kconfig
@@ -0,0 +1,13 @@
+#
+# DMA engine configuration
+#
+
+menu "DMA Engine support"
+
+config DMA_ENGINE
+	bool "Support for DMA engines"
+	---help---
+	  DMA engines offload copy operations from the CPU to dedicated
+	  hardware, allowing the copies to happen asynchronously.
+
+endmenu
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
new file mode 100644
index 0000000..10b7391
--- /dev/null
+++ b/drivers/dma/Makefile
@@ -0,0 +1 @@
+obj-y += dmaengine.o
diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
new file mode 100644
index 0000000..683456a
--- /dev/null
+++ b/drivers/dma/dmaengine.c
@@ -0,0 +1,405 @@
+/*
+ * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+
+/*
+ * This code implements the DMA subsystem. It provides a HW-neutral interface
+ * for other kernel code to use asynchronous memory copy capabilities,
+ * if present, and allows different HW DMA drivers to register as providing
+ * this capability.
+ *
+ * Due to the fact we are accelerating what is already a relatively fast
+ * operation, the code goes to great lengths to avoid additional overhead,
+ * such as locking.
+ *
+ * LOCKING:
+ *
+ * The subsystem keeps two global lists, dma_device_list and dma_client_list.
+ * Both of these are protected by a spinlock, dma_list_lock.
+ *
+ * Each device has a channels list, which runs unlocked but is never modified
+ * once the device is registered, it's just setup by the driver.
+ *
+ * Each client has a channels list, it's only modified under the client->lock
+ * and in an RCU callback, so it's safe to read under rcu_read_lock().
+ *
+ * Each device has a kref, which is initialized to 1 when the device is
+ * registered. A kref_put is done for each class_device registered.  When the
+ * class_device is released, the coresponding kref_put is done in the release
+ * method. Every time one of the device's channels is allocated to a client,
+ * a kref_get occurs.  When the channel is freed, the coresponding kref_put
+ * happens. The device's release function does a completion, so
+ * unregister_device does a remove event, class_device_unregister, a kref_put
+ * for the first reference, then waits on the completion for all other
+ * references to finish.
+ *
+ * Each channel has an open-coded implementation of Rusty Russell's "bigref,"
+ * with a kref and a per_cpu local_t.  A single reference is set when on an
+ * ADDED event, and removed with a REMOVE event.  Net DMA client takes an
+ * extra reference per outstanding transaction.  The relase function does a
+ * kref_put on the device. -ChrisL
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/dmaengine.h>
+#include <linux/hardirq.h>
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
+#include <linux/rcupdate.h>
+
+static DEFINE_SPINLOCK(dma_list_lock);
+static LIST_HEAD(dma_device_list);
+static LIST_HEAD(dma_client_list);
+
+/* --- sysfs implementation --- */
+
+static ssize_t show_memcpy_count(struct class_device *cd, char *buf)
+{
+	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
+	unsigned long count = 0;
+	int i;
+
+	for_each_cpu(i)
+		count += per_cpu_ptr(chan->local, i)->memcpy_count;
+
+	return sprintf(buf, "%lu\n", count);
+}
+
+static ssize_t show_bytes_transferred(struct class_device *cd, char *buf)
+{
+	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
+	unsigned long count = 0;
+	int i;
+
+	for_each_cpu(i)
+		count += per_cpu_ptr(chan->local, i)->bytes_transferred;
+
+	return sprintf(buf, "%lu\n", count);
+}
+
+static ssize_t show_in_use(struct class_device *cd, char *buf)
+{
+	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
+
+	return sprintf(buf, "%d\n", (chan->client ? 1 : 0));
+}
+
+static struct class_device_attribute dma_class_attrs[] = {
+	__ATTR(memcpy_count, S_IRUGO, show_memcpy_count, NULL),
+	__ATTR(bytes_transferred, S_IRUGO, show_bytes_transferred, NULL),
+	__ATTR(in_use, S_IRUGO, show_in_use, NULL),
+	__ATTR_NULL
+};
+
+static void dma_async_device_cleanup(struct kref *kref);
+
+static void dma_class_dev_release(struct class_device *cd)
+{
+	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
+	kref_put(&chan->device->refcount, dma_async_device_cleanup);
+}
+
+static struct class dma_devclass = {
+	.name            = "dma",
+	.class_dev_attrs = dma_class_attrs,
+	.release = dma_class_dev_release,
+};
+
+/* --- client and device registration --- */
+
+/**
+ * dma_client_chan_alloc - try to allocate a channel to a client
+ * @client: &dma_client
+ *
+ * Called with dma_list_lock held.
+ */
+static struct dma_chan *dma_client_chan_alloc(struct dma_client *client)
+{
+	struct dma_device *device;
+	struct dma_chan *chan;
+	unsigned long flags;
+	int desc;	/* allocated descriptor count */
+
+	/* Find a channel, any DMA engine will do */
+	list_for_each_entry(device, &dma_device_list, global_node) {
+		list_for_each_entry(chan, &device->channels, device_node) {
+			if (chan->client)
+				continue;
+
+			desc = chan->device->device_alloc_chan_resources(chan);
+			if (desc >= 0) {
+				kref_get(&device->refcount);
+				kref_init(&chan->refcount);
+				chan->slow_ref = 0;
+				INIT_RCU_HEAD(&chan->rcu);
+				chan->client = client;
+				spin_lock_irqsave(&client->lock, flags);
+				list_add_tail_rcu(&chan->client_node,
+				                  &client->channels);
+				spin_unlock_irqrestore(&client->lock, flags);
+				return chan;
+			}
+		}
+	}
+
+	return NULL;
+}
+
+/**
+ * dma_client_chan_free - release a DMA channel
+ * @chan: &dma_chan
+ */
+void dma_chan_cleanup(struct kref *kref)
+{
+	struct dma_chan *chan = container_of(kref, struct dma_chan, refcount);
+	chan->device->device_free_chan_resources(chan);
+	chan->client = NULL;
+	kref_put(&chan->device->refcount, dma_async_device_cleanup);
+}
+
+static void dma_chan_free_rcu(struct rcu_head *rcu)
+{
+	struct dma_chan *chan = container_of(rcu, struct dma_chan, rcu);
+	int bias = 0x7FFFFFFF;
+	int i;
+	for_each_cpu(i)
+		bias -= local_read(&per_cpu_ptr(chan->local, i)->refcount);
+	atomic_sub(bias, &chan->refcount.refcount);
+	kref_put(&chan->refcount, dma_chan_cleanup);
+}
+
+static void dma_client_chan_free(struct dma_chan *chan)
+{
+	atomic_add(0x7FFFFFFF, &chan->refcount.refcount);
+	chan->slow_ref = 1;
+	call_rcu(&chan->rcu, dma_chan_free_rcu);
+}
+
+/**
+ * dma_chans_rebalance - reallocate channels to clients
+ *
+ * When the number of DMA channel in the system changes,
+ * channels need to be rebalanced among clients
+ */
+static void dma_chans_rebalance(void)
+{
+	struct dma_client *client;
+	struct dma_chan *chan;
+	unsigned long flags;
+
+	spin_lock(&dma_list_lock);
+	list_for_each_entry(client, &dma_client_list, global_node) {
+		while (client->chans_desired > client->chan_count) {
+			chan = dma_client_chan_alloc(client);
+			if (!chan)
+				break;
+			client->chan_count++;
+			client->event_callback(client,
+	                                       chan,
+	                                       DMA_RESOURCE_ADDED);
+		}
+		while (client->chans_desired < client->chan_count) {
+			spin_lock_irqsave(&client->lock, flags);
+			chan = list_entry(client->channels.next,
+			                  struct dma_chan,
+			                  client_node);
+			list_del_rcu(&chan->client_node);
+			spin_unlock_irqrestore(&client->lock, flags);
+			client->chan_count--;
+			client->event_callback(client,
+			                       chan,
+			                       DMA_RESOURCE_REMOVED);
+			dma_client_chan_free(chan);
+		}
+	}
+	spin_unlock(&dma_list_lock);
+}
+
+/**
+ * dma_async_client_register - allocate and register a &dma_client
+ * @event_callback: callback for notification of channel addition/removal
+ */
+struct dma_client *dma_async_client_register(dma_event_callback event_callback)
+{
+	struct dma_client *client;
+
+	client = kzalloc(sizeof(*client), GFP_KERNEL);
+	if (!client)
+		return NULL;
+
+	INIT_LIST_HEAD(&client->channels);
+	spin_lock_init(&client->lock);
+	client->chans_desired = 0;
+	client->chan_count = 0;
+	client->event_callback = event_callback;
+
+	spin_lock(&dma_list_lock);
+	list_add_tail(&client->global_node, &dma_client_list);
+	spin_unlock(&dma_list_lock);
+
+	return client;
+}
+
+/**
+ * dma_async_client_unregister - unregister a client and free the &dma_client
+ * @client:
+ *
+ * Force frees any allocated DMA channels, frees the &dma_client memory
+ */
+void dma_async_client_unregister(struct dma_client *client)
+{
+	struct dma_chan *chan;
+
+	if (!client)
+		return;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(chan, &client->channels, client_node)
+		dma_client_chan_free(chan);
+	rcu_read_unlock();
+
+	spin_lock(&dma_list_lock);
+	list_del(&client->global_node);
+	spin_unlock(&dma_list_lock);
+
+	kfree(client);
+	dma_chans_rebalance();
+}
+
+/**
+ * dma_async_client_chan_request - request DMA channels
+ * @client: &dma_client
+ * @number: count of DMA channels requested
+ *
+ * Clients call dma_async_client_chan_request() to specify how many
+ * DMA channels they need, 0 to free all currently allocated.
+ * The resulting allocations/frees are indicated to the client via the
+ * event callback.
+ */
+void dma_async_client_chan_request(struct dma_client *client,
+			unsigned int number)
+{
+	client->chans_desired = number;
+	dma_chans_rebalance();
+}
+
+/**
+ * dma_async_device_register -
+ * @device: &dma_device
+ */
+int dma_async_device_register(struct dma_device *device)
+{
+	static int id;
+	int chancnt = 0;
+	struct dma_chan* chan;
+
+	if (!device)
+		return -ENODEV;
+
+	init_completion(&device->done);
+	kref_init(&device->refcount);
+	device->dev_id = id++;
+
+	/* represent channels in sysfs. Probably want devs too */
+	list_for_each_entry(chan, &device->channels, device_node) {
+		chan->local = alloc_percpu(typeof(*chan->local));
+		if (chan->local == NULL)
+			continue;
+
+		chan->chan_id = chancnt++;
+		chan->class_dev.class = &dma_devclass;
+		chan->class_dev.dev = NULL;
+		snprintf(chan->class_dev.class_id, BUS_ID_SIZE, "dma%dchan%d",
+		         device->dev_id, chan->chan_id);
+
+		kref_get(&device->refcount);
+		class_device_register(&chan->class_dev);
+	}
+
+	spin_lock(&dma_list_lock);
+	list_add_tail(&device->global_node, &dma_device_list);
+	spin_unlock(&dma_list_lock);
+
+	dma_chans_rebalance();
+
+	return 0;
+}
+
+/**
+ * dma_async_device_unregister -
+ * @device: &dma_device
+ */
+static void dma_async_device_cleanup(struct kref *kref)
+{
+	struct dma_device *device;
+
+	device = container_of(kref, struct dma_device, refcount);
+	complete(&device->done);
+}
+
+void dma_async_device_unregister(struct dma_device* device)
+{
+	struct dma_chan *chan;
+	unsigned long flags;
+
+	spin_lock(&dma_list_lock);
+	list_del(&device->global_node);
+	spin_unlock(&dma_list_lock);
+
+	list_for_each_entry(chan, &device->channels, device_node) {
+		if (chan->client) {
+			spin_lock_irqsave(&chan->client->lock, flags);
+			list_del(&chan->client_node);
+			chan->client->chan_count--;
+			spin_unlock_irqrestore(&chan->client->lock, flags);
+			chan->client->event_callback(chan->client,
+			                             chan,
+			                             DMA_RESOURCE_REMOVED);
+			dma_client_chan_free(chan);
+		}
+		class_device_unregister(&chan->class_dev);
+	}
+	dma_chans_rebalance();
+
+	kref_put(&device->refcount, dma_async_device_cleanup);
+	wait_for_completion(&device->done);
+}
+
+static int __init dma_bus_init(void)
+{
+	spin_lock_init(&dma_list_lock);
+	return class_register(&dma_devclass);
+}
+
+subsys_initcall(dma_bus_init);
+
+EXPORT_SYMBOL(dma_async_client_register);
+EXPORT_SYMBOL(dma_async_client_unregister);
+EXPORT_SYMBOL(dma_async_client_chan_request);
+EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
+EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
+EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
+EXPORT_SYMBOL(dma_async_memcpy_complete);
+EXPORT_SYMBOL(dma_async_memcpy_issue_pending);
+EXPORT_SYMBOL(dma_async_device_register);
+EXPORT_SYMBOL(dma_async_device_unregister);
+EXPORT_SYMBOL(dma_chan_cleanup);
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
new file mode 100644
index 0000000..3078154
--- /dev/null
+++ b/include/linux/dmaengine.h
@@ -0,0 +1,337 @@
+/*
+ * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+#ifndef DMAENGINE_H
+#define DMAENGINE_H
+#include <linux/config.h>
+#ifdef CONFIG_DMA_ENGINE
+
+#include <linux/device.h>
+#include <linux/uio.h>
+#include <linux/kref.h>
+#include <linux/completion.h>
+#include <linux/rcupdate.h>
+
+/**
+ * enum dma_event - resource PNP/power managment events
+ * @DMA_RESOURCE_SUSPEND: DMA device going into low power state
+ * @DMA_RESOURCE_RESUME: DMA device returning to full power
+ * @DMA_RESOURCE_ADDED: DMA device added to the system
+ * @DMA_RESOURCE_REMOVED: DMA device removed from the system
+ */
+enum dma_event {
+	DMA_RESOURCE_SUSPEND,
+	DMA_RESOURCE_RESUME,
+	DMA_RESOURCE_ADDED,
+	DMA_RESOURCE_REMOVED,
+};
+
+/**
+ * typedef dma_cookie_t
+ *
+ * if dma_cookie_t is >0 it's a DMA request cookie, <0 it's an error code
+ */
+typedef s32 dma_cookie_t;
+
+#define dma_submit_error(cookie) ((cookie) < 0 ? 1 : 0)
+
+/**
+ * enum dma_status - DMA transaction status
+ * @DMA_SUCCESS: transaction completed successfully
+ * @DMA_IN_PROGRESS: transaction not yet processed
+ * @DMA_ERROR: transaction failed
+ */
+enum dma_status {
+	DMA_SUCCESS,
+	DMA_IN_PROGRESS,
+	DMA_ERROR,
+};
+
+/**
+ * struct dma_chan_percpu - the per-CPU part of struct dma_chan
+ * @refcount: local_t used for open-coded "bigref" counting
+ * @memcpy_count: transaction counter
+ * @bytes_transferred: byte counter
+ */
+
+struct dma_chan_percpu {
+	local_t refcount;
+	/* stats */
+	unsigned long memcpy_count;
+	unsigned long bytes_transferred;
+};
+
+/**
+ * struct dma_chan - devices supply DMA channels, clients use them
+ * @client: ptr to the client user of this chan, will be NULL when unused
+ * @device: ptr to the dma device who supplies this channel, always !NULL
+ * @cookie: last cookie value returned to client
+ * @chan_id:
+ * @class_dev:
+ * @refcount: kref, used in "bigref" slow-mode
+ * @slow_ref:
+ * @rcu:
+ * @client_node: used to add this to the client chan list
+ * @device_node: used to add this to the device chan list
+ * @local: per-cpu pointer to a struct dma_chan_percpu
+ */
+struct dma_chan {
+	struct dma_client *client;
+	struct dma_device *device;
+	dma_cookie_t cookie;
+
+	/* sysfs */
+	int chan_id;
+	struct class_device class_dev;
+
+	struct kref refcount;
+	int slow_ref;
+	struct rcu_head rcu;
+
+	struct list_head client_node;
+	struct list_head device_node;
+	struct dma_chan_percpu *local;
+};
+
+void dma_chan_cleanup(struct kref *kref);
+
+static inline void dma_chan_get(struct dma_chan *chan)
+{
+	if (unlikely(chan->slow_ref))
+		kref_get(&chan->refcount);
+	else {
+		local_inc(&(per_cpu_ptr(chan->local, get_cpu())->refcount));
+		put_cpu();
+	}
+}
+
+static inline void dma_chan_put(struct dma_chan *chan)
+{
+	if (unlikely(chan->slow_ref))
+		kref_put(&chan->refcount, dma_chan_cleanup);
+	else {
+		local_dec(&(per_cpu_ptr(chan->local, get_cpu())->refcount));
+		put_cpu();
+	}
+}
+
+/*
+ * typedef dma_event_callback - function pointer to a DMA event callback
+ */
+typedef void (*dma_event_callback) (struct dma_client *client,
+		struct dma_chan *chan, enum dma_event event);
+
+/**
+ * struct dma_client - info on the entity making use of DMA services
+ * @event_callback: func ptr to call when something happens
+ * @chan_count: number of chans allocated
+ * @chans_desired: number of chans requested. Can be +/- chan_count
+ * @lock: protects access to the channels list
+ * @channels: the list of DMA channels allocated
+ * @global_node: list_head for global dma_client_list
+ */
+struct dma_client {
+	dma_event_callback	event_callback;
+	unsigned int		chan_count;
+	unsigned int		chans_desired;
+
+	spinlock_t		lock;
+	struct list_head	channels;
+	struct list_head	global_node;
+};
+
+/**
+ * struct dma_device - info on the entity supplying DMA services
+ * @chancnt: how many DMA channels are supported
+ * @channels: the list of struct dma_chan
+ * @global_node: list_head for global dma_device_list
+ * @refcount:
+ * @done:
+ * @dev_id:
+ * Other func ptrs: used to make use of this device's capabilities
+ */
+struct dma_device {
+
+	unsigned int chancnt;
+	struct list_head channels;
+	struct list_head global_node;
+
+	struct kref refcount;
+	struct completion done;
+
+	int dev_id;
+
+	int (*device_alloc_chan_resources)(struct dma_chan *chan);
+	void (*device_free_chan_resources)(struct dma_chan *chan);
+	dma_cookie_t (*device_memcpy_buf_to_buf)(struct dma_chan *chan,
+			void *dest, void *src, size_t len);
+	dma_cookie_t (*device_memcpy_buf_to_pg)(struct dma_chan *chan,
+			struct page *page, unsigned int offset, void *kdata,
+			size_t len);
+	dma_cookie_t (*device_memcpy_pg_to_pg)(struct dma_chan *chan,
+			struct page *dest_pg, unsigned int dest_off,
+			struct page *src_pg, unsigned int src_off, size_t len);
+	enum dma_status (*device_memcpy_complete)(struct dma_chan *chan,
+			dma_cookie_t cookie, dma_cookie_t *last,
+			dma_cookie_t *used);
+	void (*device_memcpy_issue_pending)(struct dma_chan *chan);
+};
+
+/* --- public DMA engine API --- */
+
+struct dma_client *dma_async_client_register(dma_event_callback event_callback);
+void dma_async_client_unregister(struct dma_client *client);
+void dma_async_client_chan_request(struct dma_client *client,
+		unsigned int number);
+
+/**
+ * dma_async_memcpy_buf_to_buf - offloaded copy between virtual addresses
+ * @chan: DMA channel to offload copy to
+ * @dest: destination address (virtual)
+ * @src: source address (virtual)
+ * @len: length
+ *
+ * Both @dest and @src must be mappable to a bus address according to the
+ * DMA mapping API rules for streaming mappings.
+ * Both @dest and @src must stay memory resident (kernel memory or locked
+ * user space pages)
+ */
+static inline dma_cookie_t dma_async_memcpy_buf_to_buf(struct dma_chan *chan,
+	void *dest, void *src, size_t len)
+{
+	int cpu = get_cpu();
+	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
+	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	put_cpu();
+
+	return chan->device->device_memcpy_buf_to_buf(chan, dest, src, len);
+}
+
+/**
+ * dma_async_memcpy_buf_to_pg - offloaded copy
+ * @chan: DMA channel to offload copy to
+ * @page: destination page
+ * @offset: offset in page to copy to
+ * @kdata: source address (virtual)
+ * @len: length
+ *
+ * Both @page/@offset and @kdata must be mappable to a bus address according
+ * to the DMA mapping API rules for streaming mappings.
+ * Both @page/@offset and @kdata must stay memory resident (kernel memory or
+ * locked user space pages)
+ */
+static inline dma_cookie_t dma_async_memcpy_buf_to_pg(struct dma_chan *chan,
+	struct page *page, unsigned int offset, void *kdata, size_t len)
+{
+	int cpu = get_cpu();
+	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
+	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	put_cpu();
+
+	return chan->device->device_memcpy_buf_to_pg(chan, page, offset,
+	                                             kdata, len);
+}
+
+/**
+ * dma_async_memcpy_buf_to_pg - offloaded copy
+ * @chan: DMA channel to offload copy to
+ * @dest_page: destination page
+ * @dest_off: offset in page to copy to
+ * @src_page: source page
+ * @src_off: offset in page to copy from
+ * @len: length
+ *
+ * Both @dest_page/@dest_off and @src_page/@src_off must be mappable to a bus
+ * address according to the DMA mapping API rules for streaming mappings.
+ * Both @dest_page/@dest_off and @src_page/@src_off must stay memory resident
+ * (kernel memory or locked user space pages)
+ */
+static inline dma_cookie_t dma_async_memcpy_pg_to_pg(struct dma_chan *chan,
+	struct page *dest_pg, unsigned int dest_off, struct page *src_pg,
+	unsigned int src_off, size_t len)
+{
+	int cpu = get_cpu();
+	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
+	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	put_cpu();
+
+	return chan->device->device_memcpy_pg_to_pg(chan, dest_pg, dest_off,
+	                                            src_pg, src_off, len);
+}
+
+/**
+ * dma_async_memcpy_issue_pending - flush pending copies to HW
+ * @chan:
+ *
+ * This allows drivers to push copies to HW in batches,
+ * reducing MMIO writes where possible.
+ */
+static inline void dma_async_memcpy_issue_pending(struct dma_chan *chan)
+{
+	return chan->device->device_memcpy_issue_pending(chan);
+}
+
+/**
+ * dma_async_memcpy_complete - poll for transaction completion
+ * @chan: DMA channel
+ * @cookie: transaction identifier to check status of
+ * @last: returns last completed cookie, can be NULL
+ * @used: returns last issued cookie, can be NULL
+ *
+ * If @last and @used are passed in, upon return they reflect the driver
+ * internal state and can be used with dma_async_is_complete() to check
+ * the status of multiple cookies without re-checking hardware state.
+ */
+static inline enum dma_status dma_async_memcpy_complete(struct dma_chan *chan,
+	dma_cookie_t cookie, dma_cookie_t *last, dma_cookie_t *used)
+{
+	return chan->device->device_memcpy_complete(chan, cookie, last, used);
+}
+
+/**
+ * dma_async_is_complete - test a cookie against chan state
+ * @cookie: transaction identifier to test status of
+ * @last_complete: last know completed transaction
+ * @last_used: last cookie value handed out
+ *
+ * dma_async_is_complete() is used in dma_async_memcpy_complete()
+ * the test logic is seperated for lightweight testing of multiple cookies
+ */
+static inline enum dma_status dma_async_is_complete(dma_cookie_t cookie,
+			dma_cookie_t last_complete, dma_cookie_t last_used)
+{
+	if (last_complete <= last_used) {
+		if ((cookie <= last_complete) || (cookie > last_used))
+			return DMA_SUCCESS;
+	} else {
+		if ((cookie <= last_complete) && (cookie > last_used))
+			return DMA_SUCCESS;
+	}
+	return DMA_IN_PROGRESS;
+}
+
+
+/* --- DMA device --- */
+
+int dma_async_device_register(struct dma_device *device);
+void dma_async_device_unregister(struct dma_device *device);
+
+#endif /* CONFIG_DMA_ENGINE */
+#endif /* DMAENGINE_H */

^ permalink raw reply related

* [PATCH 3/9] [I/OAT] Setup the networking subsystem as a DMA client
From: Chris Leech @ 2006-03-29 22:55 UTC (permalink / raw)
  To: linux-kernel, netdev
In-Reply-To: <20060329225505.25585.30392.stgit@gitlost.site>

Attempts to allocate per-CPU DMA channels

Signed-off-by: Chris Leech <christopher.leech@intel.com>
---

 drivers/dma/Kconfig       |   12 +++++
 include/linux/netdevice.h |    4 ++
 include/net/netdma.h      |   38 ++++++++++++++++
 net/core/dev.c            |  104 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 158 insertions(+), 0 deletions(-)

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 0f15e76..30d021d 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -10,6 +10,18 @@ config DMA_ENGINE
 	  DMA engines offload copy operations from the CPU to dedicated
 	  hardware, allowing the copies to happen asynchronously.
 
+comment "DMA Clients"
+
+config NET_DMA
+	bool "Network: TCP receive copy offload"
+	depends on DMA_ENGINE && NET
+	default y
+	---help---
+	  This enables the use of DMA engines in the network stack to
+	  offload receive copy-to-user operations, freeing CPU cycles.
+	  Since this is the main user of the DMA engine, it should be enabled;
+	  say Y here.
+
 comment "DMA Devices"
 
 config INTEL_IOATDMA
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 950dc55..7fda35f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -37,6 +37,7 @@
 #include <linux/config.h>
 #include <linux/device.h>
 #include <linux/percpu.h>
+#include <linux/dmaengine.h>
 
 struct divert_blk;
 struct vlan_group;
@@ -592,6 +593,9 @@ struct softnet_data
 	struct sk_buff		*completion_queue;
 
 	struct net_device	backlog_dev;	/* Sorry. 8) */
+#ifdef CONFIG_NET_DMA
+	struct dma_chan		*net_dma;
+#endif
 };
 
 DECLARE_PER_CPU(struct softnet_data,softnet_data);
diff --git a/include/net/netdma.h b/include/net/netdma.h
new file mode 100644
index 0000000..cbfe89d
--- /dev/null
+++ b/include/net/netdma.h
@@ -0,0 +1,38 @@
+/*
+ * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+#ifndef NETDMA_H
+#define NETDMA_H
+#include <linux/config.h>
+#ifdef CONFIG_NET_DMA
+#include <linux/dmaengine.h>
+
+static inline struct dma_chan *get_softnet_dma(void)
+{
+	struct dma_chan *chan;
+	rcu_read_lock();
+	chan = rcu_dereference(__get_cpu_var(softnet_data.net_dma));
+	if (chan)
+		dma_chan_get(chan);
+	rcu_read_unlock();
+	return chan;
+}
+#endif /* CONFIG_NET_DMA */
+#endif /* NETDMA_H */
diff --git a/net/core/dev.c b/net/core/dev.c
index a3ab11f..ffd3d6d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -115,6 +115,7 @@
 #include <net/iw_handler.h>
 #include <asm/current.h>
 #include <linux/audit.h>
+#include <linux/dmaengine.h>
 
 /*
  *	The list of packet types we will receive (as opposed to discard)
@@ -148,6 +149,12 @@ static DEFINE_SPINLOCK(ptype_lock);
 static struct list_head ptype_base[16];	/* 16 way hashed list */
 static struct list_head ptype_all;		/* Taps */
 
+#ifdef CONFIG_NET_DMA
+static struct dma_client *net_dma_client;
+static unsigned int net_dma_count;
+static spinlock_t net_dma_event_lock;
+#endif
+
 /*
  * The @dev_base list is protected by @dev_base_lock and the rtln
  * semaphore.
@@ -1780,6 +1787,19 @@ static void net_rx_action(struct softirq
 		}
 	}
 out:
+#ifdef CONFIG_NET_DMA
+	/*
+	 * There may not be any more sk_buffs coming right now, so push
+	 * any pending DMA copies to hardware
+	 */
+	if (net_dma_client) {
+		struct dma_chan *chan;
+		rcu_read_lock();
+		list_for_each_entry_rcu(chan, &net_dma_client->channels, client_node)
+			dma_async_memcpy_issue_pending(chan);
+		rcu_read_unlock();
+	}
+#endif
 	local_irq_enable();
 	return;
 
@@ -3243,6 +3263,88 @@ static int dev_cpu_callback(struct notif
 }
 #endif /* CONFIG_HOTPLUG_CPU */
 
+#ifdef CONFIG_NET_DMA
+/**
+ * net_dma_rebalance -
+ * This is called when the number of channels allocated to the net_dma_client
+ * changes.  The net_dma_client tries to have one DMA channel per CPU.
+ */
+static void net_dma_rebalance(void)
+{
+	unsigned int cpu, i, n;
+	struct dma_chan *chan;
+
+	lock_cpu_hotplug();
+
+	if (net_dma_count == 0) {
+		for_each_online_cpu(cpu)
+			rcu_assign_pointer(per_cpu(softnet_data.net_dma, cpu), NULL);
+		unlock_cpu_hotplug();
+		return;
+	}
+
+	i = 0;
+	cpu = first_cpu(cpu_online_map);
+
+	rcu_read_lock();
+	list_for_each_entry(chan, &net_dma_client->channels, client_node) {
+		n = ((num_online_cpus() / net_dma_count)
+		   + (i < (num_online_cpus() % net_dma_count) ? 1 : 0));
+
+		while(n) {
+			per_cpu(softnet_data.net_dma, cpu) = chan;
+			cpu = next_cpu(cpu, cpu_online_map);
+			n--;
+		}
+		i++;
+	}
+	rcu_read_unlock();
+
+	unlock_cpu_hotplug();
+}
+
+/**
+ * netdev_dma_event - event callback for the net_dma_client
+ * @client: should always be net_dma_client
+ * @chan:
+ * @event:
+ */
+static void netdev_dma_event(struct dma_client *client, struct dma_chan *chan,
+	enum dma_event event)
+{
+	spin_lock(&net_dma_event_lock);
+	switch (event) {
+	case DMA_RESOURCE_ADDED:
+		net_dma_count++;
+		net_dma_rebalance();
+		break;
+	case DMA_RESOURCE_REMOVED:
+		net_dma_count--;
+		net_dma_rebalance();
+		break;
+	default:
+		break;
+	}
+	spin_unlock(&net_dma_event_lock);
+}
+
+/**
+ * netdev_dma_regiser - register the networking subsystem as a DMA client
+ */
+static int __init netdev_dma_register(void)
+{
+	spin_lock_init(&net_dma_event_lock);
+	net_dma_client = dma_async_client_register(netdev_dma_event);
+	if (net_dma_client == NULL)
+		return -ENOMEM;
+
+	dma_async_client_chan_request(net_dma_client, num_online_cpus());
+	return 0;
+}
+
+#else
+static int __init netdev_dma_register(void) { return -ENODEV; }
+#endif /* CONFIG_NET_DMA */
 
 /*
  *	Initialize the DEV module. At boot time this walks the device list and
@@ -3296,6 +3398,8 @@ static int __init net_dev_init(void)
 		atomic_set(&queue->backlog_dev.refcnt, 1);
 	}
 
+	netdev_dma_register();
+
 	dev_boot_phase = 0;
 
 	open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL);

^ permalink raw reply related

* [PATCH 4/9] [I/OAT] Utility functions for offloading sk_buff to iovec copies
From: Chris Leech @ 2006-03-29 22:55 UTC (permalink / raw)
  To: linux-kernel, netdev
In-Reply-To: <20060329225505.25585.30392.stgit@gitlost.site>

Provides for pinning user space pages in memory, copying to iovecs,
and copying from sk_buffs including fragmented and chained sk_buffs.

Signed-off-by: Chris Leech <christopher.leech@intel.com>
---

 drivers/dma/Makefile      |    3 
 drivers/dma/iovlock.c     |  301 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dmaengine.h |   22 +++
 include/net/netdma.h      |    6 +
 net/core/Makefile         |    1 
 net/core/user_dma.c       |  141 +++++++++++++++++++++
 6 files changed, 473 insertions(+), 1 deletions(-)

diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index c8a5f56..bdcfdbd 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -1,2 +1,3 @@
-obj-y += dmaengine.o
+obj-$(CONFIG_DMA_ENGINE) += dmaengine.o
+obj-$(CONFIG_NET_DMA) += iovlock.o
 obj-$(CONFIG_INTEL_IOATDMA) += ioatdma.o
diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
new file mode 100644
index 0000000..5ed327e
--- /dev/null
+++ b/drivers/dma/iovlock.c
@@ -0,0 +1,301 @@
+/*
+ * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
+ * Portions based on net/core/datagram.c and copyrighted by their authors.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+
+/*
+ * This code allows the net stack to make use of a DMA engine for
+ * skb to iovec copies.
+ */
+
+#include <linux/dmaengine.h>
+#include <linux/pagemap.h>
+#include <net/tcp.h> /* for memcpy_toiovec */
+#include <asm/io.h>
+#include <asm/uaccess.h>
+
+int num_pages_spanned(struct iovec *iov)
+{
+	return
+	((PAGE_ALIGN((unsigned long)iov->iov_base + iov->iov_len) -
+	((unsigned long)iov->iov_base & PAGE_MASK)) >> PAGE_SHIFT);
+}
+
+/*
+ * Pin down all the iovec pages needed for len bytes.
+ * Return a struct dma_pinned_list to keep track of pages pinned down.
+ *
+ * We are allocating a single chunk of memory, and then carving it up into
+ * 3 sections, the latter 2 whose size depends on the number of iovecs and the
+ * total number of pages, respectively.
+ */
+struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
+{
+	struct dma_pinned_list *local_list;
+	struct page **pages;
+	int i;
+	int ret;
+	int nr_iovecs = 0;
+	int iovec_len_used = 0;
+	int iovec_pages_used = 0;
+	long err;
+
+	/* don't pin down non-user-based iovecs */
+	if (segment_eq(get_fs(), KERNEL_DS))
+		return NULL;
+
+	/* determine how many iovecs/pages there are, up front */
+	do {
+		iovec_len_used += iov[nr_iovecs].iov_len;
+		iovec_pages_used += num_pages_spanned(&iov[nr_iovecs]);
+		nr_iovecs++;
+	} while (iovec_len_used < len);
+
+	/* single kmalloc for pinned list, page_list[], and the page arrays */
+	local_list = kmalloc(sizeof(*local_list)
+		+ (nr_iovecs * sizeof (struct dma_page_list))
+		+ (iovec_pages_used * sizeof (struct page*)), GFP_KERNEL);
+	if (!local_list) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	/* list of pages starts right after the page list array */
+	pages = (struct page **) &local_list->page_list[nr_iovecs];
+
+	for (i = 0; i < nr_iovecs; i++) {
+		struct dma_page_list *page_list = &local_list->page_list[i];
+
+		len -= iov[i].iov_len;
+
+		if (!access_ok(VERIFY_WRITE, iov[i].iov_base, iov[i].iov_len)) {
+			err = -EFAULT;
+			goto unpin;
+		}
+
+		page_list->nr_pages = num_pages_spanned(&iov[i]);
+		page_list->base_address = iov[i].iov_base;
+
+		page_list->pages = pages;
+		pages += page_list->nr_pages;
+
+		/* pin pages down */
+		down_read(&current->mm->mmap_sem);
+		ret = get_user_pages(
+			current,
+			current->mm,
+			(unsigned long) iov[i].iov_base,
+			page_list->nr_pages,
+			1,	/* write */
+			0,	/* force */
+			page_list->pages,
+			NULL);
+		up_read(&current->mm->mmap_sem);
+
+		if (ret != page_list->nr_pages) {
+			err = -ENOMEM;
+			goto unpin;
+		}
+
+		local_list->nr_iovecs = i + 1;
+	}
+
+	return local_list;
+
+unpin:
+	dma_unpin_iovec_pages(local_list);
+out:
+	return ERR_PTR(err);
+}
+
+void dma_unpin_iovec_pages(struct dma_pinned_list *pinned_list)
+{
+	int i, j;
+
+	if (!pinned_list)
+		return;
+
+	for (i = 0; i < pinned_list->nr_iovecs; i++) {
+		struct dma_page_list *page_list = &pinned_list->page_list[i];
+		for (j = 0; j < page_list->nr_pages; j++) {
+			set_page_dirty_lock(page_list->pages[j]);
+			page_cache_release(page_list->pages[j]);
+		}
+	}
+
+	kfree(pinned_list);
+}
+
+static dma_cookie_t dma_memcpy_to_kernel_iovec(struct dma_chan *chan, struct
+	iovec *iov, unsigned char *kdata, size_t len)
+{
+	dma_cookie_t dma_cookie = 0;
+
+	while (len > 0) {
+		if (iov->iov_len) {
+			int copy = min_t(unsigned int, iov->iov_len, len);
+			dma_cookie = dma_async_memcpy_buf_to_buf(
+					chan,
+					iov->iov_base,
+					kdata,
+					copy);
+			kdata += copy;
+			len -= copy;
+			iov->iov_len -= copy;
+			iov->iov_base += copy;
+		}
+		iov++;
+	}
+
+	return dma_cookie;
+}
+
+/*
+ * We have already pinned down the pages we will be using in the iovecs.
+ * Each entry in iov array has corresponding entry in pinned_list->page_list.
+ * Using array indexing to keep iov[] and page_list[] in sync.
+ * Initial elements in iov array's iov->iov_len will be 0 if already copied into
+ *   by another call.
+ * iov array length remaining guaranteed to be bigger than len.
+ */
+dma_cookie_t dma_memcpy_to_iovec(struct dma_chan *chan, struct iovec *iov,
+	struct dma_pinned_list *pinned_list, unsigned char *kdata, size_t len)
+{
+	int iov_byte_offset;
+	int copy;
+	dma_cookie_t dma_cookie = 0;
+	int iovec_idx;
+	int page_idx;
+
+	if (!chan)
+		return memcpy_toiovec(iov, kdata, len);
+
+	/* -> kernel copies (e.g. smbfs) */
+	if (!pinned_list)
+		return dma_memcpy_to_kernel_iovec(chan, iov, kdata, len);
+
+	iovec_idx = 0;
+	while (iovec_idx < pinned_list->nr_iovecs) {
+		struct dma_page_list *page_list;
+
+		/* skip already used-up iovecs */
+		while (!iov[iovec_idx].iov_len)
+			iovec_idx++;
+
+		page_list = &pinned_list->page_list[iovec_idx];
+
+		iov_byte_offset = ((unsigned long)iov[iovec_idx].iov_base & ~PAGE_MASK);
+		page_idx = (((unsigned long)iov[iovec_idx].iov_base & PAGE_MASK)
+			 - ((unsigned long)page_list->base_address & PAGE_MASK)) >> PAGE_SHIFT;
+
+		/* break up copies to not cross page boundary */
+		while (iov[iovec_idx].iov_len) {
+			copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
+			copy = min_t(int, copy, iov[iovec_idx].iov_len);
+
+			dma_cookie = dma_async_memcpy_buf_to_pg(chan,
+					page_list->pages[page_idx],
+					iov_byte_offset,
+					kdata,
+					copy);
+
+			len -= copy;
+			iov[iovec_idx].iov_len -= copy;
+			iov[iovec_idx].iov_base += copy;
+
+			if (!len)
+				return dma_cookie;
+
+			kdata += copy;
+			iov_byte_offset = 0;
+			page_idx++;
+		}
+		iovec_idx++;
+	}
+
+	/* really bad if we ever run out of iovecs */
+	BUG();
+	return -EFAULT;
+}
+
+dma_cookie_t dma_memcpy_pg_to_iovec(struct dma_chan *chan, struct iovec *iov,
+	struct dma_pinned_list *pinned_list, struct page *page,
+	unsigned int offset, size_t len)
+{
+	int iov_byte_offset;
+	int copy;
+	dma_cookie_t dma_cookie = 0;
+	int iovec_idx;
+	int page_idx;
+	int err;
+
+	/* this needs as-yet-unimplemented buf-to-buff, so punt. */
+	/* TODO: use dma for this */
+	if (!chan || !pinned_list) {
+		u8 *vaddr = kmap(page);
+		err = memcpy_toiovec(iov, vaddr + offset, len);
+		kunmap(page);
+		return err;
+	}
+
+	iovec_idx = 0;
+	while (iovec_idx < pinned_list->nr_iovecs) {
+		struct dma_page_list *page_list;
+
+		/* skip already used-up iovecs */
+		while (!iov[iovec_idx].iov_len)
+			iovec_idx++;
+
+		page_list = &pinned_list->page_list[iovec_idx];
+
+		iov_byte_offset = ((unsigned long)iov[iovec_idx].iov_base & ~PAGE_MASK);
+		page_idx = (((unsigned long)iov[iovec_idx].iov_base & PAGE_MASK)
+			 - ((unsigned long)page_list->base_address & PAGE_MASK)) >> PAGE_SHIFT;
+
+		/* break up copies to not cross page boundary */
+		while (iov[iovec_idx].iov_len) {
+			copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
+			copy = min_t(int, copy, iov[iovec_idx].iov_len);
+
+			dma_cookie = dma_async_memcpy_pg_to_pg(chan,
+					page_list->pages[page_idx],
+					iov_byte_offset,
+					page,
+					offset,
+					copy);
+
+			len -= copy;
+			iov[iovec_idx].iov_len -= copy;
+			iov[iovec_idx].iov_base += copy;
+
+			if (!len)
+				return dma_cookie;
+
+			offset += copy;
+			iov_byte_offset = 0;
+			page_idx++;
+		}
+		iovec_idx++;
+	}
+
+	/* really bad if we ever run out of iovecs */
+	BUG();
+	return -EFAULT;
+}
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 3078154..78b236c 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -333,5 +333,27 @@ static inline enum dma_status dma_async_
 int dma_async_device_register(struct dma_device *device);
 void dma_async_device_unregister(struct dma_device *device);
 
+/* --- Helper iov-locking functions --- */
+
+struct dma_page_list {
+	char *base_address;
+	int nr_pages;
+	struct page **pages;
+};
+
+struct dma_pinned_list {
+	int nr_iovecs;
+	struct dma_page_list page_list[0];
+};
+
+struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len);
+void dma_unpin_iovec_pages(struct dma_pinned_list* pinned_list);
+
+dma_cookie_t dma_memcpy_to_iovec(struct dma_chan *chan, struct iovec *iov,
+	struct dma_pinned_list *pinned_list, unsigned char *kdata, size_t len);
+dma_cookie_t dma_memcpy_pg_to_iovec(struct dma_chan *chan, struct iovec *iov,
+	struct dma_pinned_list *pinned_list, struct page *page,
+	unsigned int offset, size_t len);
+
 #endif /* CONFIG_DMA_ENGINE */
 #endif /* DMAENGINE_H */
diff --git a/include/net/netdma.h b/include/net/netdma.h
index cbfe89d..19760eb 100644
--- a/include/net/netdma.h
+++ b/include/net/netdma.h
@@ -23,6 +23,7 @@
 #include <linux/config.h>
 #ifdef CONFIG_NET_DMA
 #include <linux/dmaengine.h>
+#include <linux/skbuff.h>
 
 static inline struct dma_chan *get_softnet_dma(void)
 {
@@ -34,5 +35,10 @@ static inline struct dma_chan *get_softn
 	rcu_read_unlock();
 	return chan;
 }
+
+int dma_skb_copy_datagram_iovec(struct dma_chan* chan,
+		const struct sk_buff *skb, int offset, struct iovec *to,
+		size_t len, struct dma_pinned_list *pinned_list);
+
 #endif /* CONFIG_NET_DMA */
 #endif /* NETDMA_H */
diff --git a/net/core/Makefile b/net/core/Makefile
index 79fe12c..e9bd246 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_NET_DIVERT) += dv.o
 obj-$(CONFIG_NET_PKTGEN) += pktgen.o
 obj-$(CONFIG_WIRELESS_EXT) += wireless.o
 obj-$(CONFIG_NETPOLL) += netpoll.o
+obj-$(CONFIG_NET_DMA) += user_dma.o
diff --git a/net/core/user_dma.c b/net/core/user_dma.c
new file mode 100644
index 0000000..ec177ef
--- /dev/null
+++ b/net/core/user_dma.c
@@ -0,0 +1,141 @@
+/*
+ * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
+ * Portions based on net/core/datagram.c and copyrighted by their authors.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+
+/*
+ * This code allows the net stack to make use of a DMA engine for
+ * skb to iovec copies.
+ */
+
+#include <linux/dmaengine.h>
+#include <linux/socket.h>
+#include <linux/rtnetlink.h> /* for BUG_TRAP */
+#include <net/tcp.h>
+
+
+#ifdef CONFIG_NET_DMA
+
+/**
+ *	dma_skb_copy_datagram_iovec - Copy a datagram to an iovec.
+ *	@skb - buffer to copy
+ *	@offset - offset in the buffer to start copying from
+ *	@iovec - io vector to copy to
+ *	@len - amount of data to copy from buffer to iovec
+ *	@pinned_list - locked iovec buffer data
+ *
+ *	Note: the iovec is modified during the copy.
+ */
+int dma_skb_copy_datagram_iovec(struct dma_chan *chan,
+			struct sk_buff *skb, int offset, struct iovec *to,
+			size_t len, struct dma_pinned_list *pinned_list)
+{
+	int start = skb_headlen(skb);
+	int i, copy = start - offset;
+	dma_cookie_t cookie = 0;
+
+	/* Copy header. */
+	if (copy > 0) {
+		if (copy > len)
+			copy = len;
+		cookie = dma_memcpy_to_iovec(chan, to, pinned_list,
+		                            skb->data + offset, copy);
+		if (cookie < 0)
+			goto fault;
+		len -= copy;
+		if (len == 0)
+			goto end;
+		offset += copy;
+	}
+
+	/* Copy paged appendix. Hmm... why does this look so complicated? */
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		int end;
+
+		BUG_TRAP(start <= offset + len);
+
+		end = start + skb_shinfo(skb)->frags[i].size;
+		copy = end - offset;
+		if ((copy = end - offset) > 0) {
+			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+			struct page *page = frag->page;
+
+			if (copy > len)
+				copy = len;
+
+			cookie = dma_memcpy_pg_to_iovec(chan, to, pinned_list, page,
+					frag->page_offset + offset - start, copy);
+			if (cookie < 0)
+				goto fault;
+			len -= copy;
+			if (len == 0)
+				goto end;
+			offset += copy;
+		}
+		start = end;
+	}
+
+	if (skb_shinfo(skb)->frag_list) {
+		struct sk_buff *list = skb_shinfo(skb)->frag_list;
+
+		for (; list; list = list->next) {
+			int end;
+
+			BUG_TRAP(start <= offset + len);
+
+			end = start + list->len;
+			copy = end - offset;
+			if (copy > 0) {
+				if (copy > len)
+					copy = len;
+				cookie = dma_skb_copy_datagram_iovec(chan, list,
+				                offset - start, to, copy,
+				                pinned_list);
+				if (cookie < 0)
+					goto fault;
+				len -= copy;
+				if (len == 0)
+					goto end;
+				offset += copy;
+			}
+			start = end;
+		}
+	}
+
+end:
+	if (!len) {
+		skb->dma_cookie = cookie;
+		return cookie;
+	}
+
+fault:
+ 	return -EFAULT;
+}
+
+#else
+
+int dma_skb_copy_datagram_iovec(struct dma_chan *chan,
+			const struct sk_buff *skb, int offset, struct iovec *to,
+			size_t len, struct dma_pinned_list *pinned_list)
+{
+	return skb_copy_datagram_iovec(skb, offset, to, len);
+}
+
+#endif

^ permalink raw reply related

* [PATCH 5/9] [I/OAT] Structure changes for TCP recv offload to I/OAT
From: Chris Leech @ 2006-03-29 22:55 UTC (permalink / raw)
  To: linux-kernel, netdev
In-Reply-To: <20060329225505.25585.30392.stgit@gitlost.site>

Adds an async_wait_queue and some additional fields to tcp_sock, and a
dma_cookie_t to sk_buff.

Signed-off-by: Chris Leech <christopher.leech@intel.com>
---

 include/linux/skbuff.h |    4 ++++
 include/linux/tcp.h    |    8 ++++++++
 include/net/sock.h     |    2 ++
 include/net/tcp.h      |    7 +++++++
 net/core/sock.c        |    6 ++++++
 5 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 613b951..76861a8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -29,6 +29,7 @@
 #include <linux/net.h>
 #include <linux/textsearch.h>
 #include <net/checksum.h>
+#include <linux/dmaengine.h>
 
 #define HAVE_ALLOC_SKB		/* For the drivers to know */
 #define HAVE_ALIGNABLE_SKB	/* Ditto 8)		   */
@@ -285,6 +286,9 @@ struct sk_buff {
 	__u16			tc_verd;	/* traffic control verdict */
 #endif
 #endif
+#ifdef CONFIG_NET_DMA
+	dma_cookie_t		dma_cookie;
+#endif
 
 
 	/* These elements must be at the end, see alloc_skb() for details.  */
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 542d395..c90daa5 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -18,6 +18,7 @@
 #define _LINUX_TCP_H
 
 #include <linux/types.h>
+#include <linux/dmaengine.h>
 #include <asm/byteorder.h>
 
 struct tcphdr {
@@ -233,6 +234,13 @@ struct tcp_sock {
 		struct iovec		*iov;
 		int			memory;
 		int			len;
+#ifdef CONFIG_NET_DMA
+		/* members for async copy */
+		struct dma_chan		*dma_chan;
+		int			wakeup;
+		struct dma_pinned_list	*pinned_list;
+		dma_cookie_t		dma_cookie;
+#endif
 	} ucopy;
 
 	__u32	snd_wl1;	/* Sequence for window update		*/
diff --git a/include/net/sock.h b/include/net/sock.h
index af2b054..190809c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -132,6 +132,7 @@ struct sock_common {
   *	@sk_receive_queue: incoming packets
   *	@sk_wmem_alloc: transmit queue bytes committed
   *	@sk_write_queue: Packet sending queue
+  *	@sk_async_wait_queue: DMA copied packets
   *	@sk_omem_alloc: "o" is "option" or "other"
   *	@sk_wmem_queued: persistent queue size
   *	@sk_forward_alloc: space allocated forward
@@ -205,6 +206,7 @@ struct sock {
 	atomic_t		sk_omem_alloc;
 	struct sk_buff_head	sk_receive_queue;
 	struct sk_buff_head	sk_write_queue;
+	struct sk_buff_head	sk_async_wait_queue;
 	int			sk_wmem_queued;
 	int			sk_forward_alloc;
 	gfp_t			sk_allocation;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9418f4d..54e4367 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -28,6 +28,7 @@
 #include <linux/cache.h>
 #include <linux/percpu.h>
 #include <linux/skbuff.h>
+#include <linux/dmaengine.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/inet_timewait_sock.h>
@@ -820,6 +821,12 @@ static inline void tcp_prequeue_init(str
 	tp->ucopy.len = 0;
 	tp->ucopy.memory = 0;
 	skb_queue_head_init(&tp->ucopy.prequeue);
+#ifdef CONFIG_NET_DMA
+	tp->ucopy.dma_chan = NULL;
+	tp->ucopy.wakeup = 0;
+	tp->ucopy.pinned_list = NULL;
+	tp->ucopy.dma_cookie = 0;
+#endif
 }
 
 /* Packet is added to VJ-style prequeue for processing in process
diff --git a/net/core/sock.c b/net/core/sock.c
index a96ea7d..d2acd35 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -818,6 +818,9 @@ struct sock *sk_clone(const struct sock 
 		atomic_set(&newsk->sk_omem_alloc, 0);
 		skb_queue_head_init(&newsk->sk_receive_queue);
 		skb_queue_head_init(&newsk->sk_write_queue);
+#ifdef CONFIG_NET_DMA
+		skb_queue_head_init(&newsk->sk_async_wait_queue);
+#endif
 
 		rwlock_init(&newsk->sk_dst_lock);
 		rwlock_init(&newsk->sk_callback_lock);
@@ -1369,6 +1372,9 @@ void sock_init_data(struct socket *sock,
 	skb_queue_head_init(&sk->sk_receive_queue);
 	skb_queue_head_init(&sk->sk_write_queue);
 	skb_queue_head_init(&sk->sk_error_queue);
+#ifdef CONFIG_NET_DMA
+	skb_queue_head_init(&sk->sk_async_wait_queue);
+#endif
 
 	sk->sk_send_head	=	NULL;
 

^ permalink raw reply related

* [PATCH 6/9] [I/OAT] Rename cleanup_rbuf to tcp_cleanup_rbuf and make non-static
From: Chris Leech @ 2006-03-29 22:56 UTC (permalink / raw)
  To: linux-kernel, netdev
In-Reply-To: <20060329225505.25585.30392.stgit@gitlost.site>

Needed to be able to call tcp_cleanup_rbuf in tcp_input.c for I/OAT

Signed-off-by: Chris Leech <christopher.leech@intel.com>
---

 include/net/tcp.h |    2 ++
 net/ipv4/tcp.c    |   10 +++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 54e4367..ca5bdaf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -294,6 +294,8 @@ extern int			tcp_rcv_established(struct 
 
 extern void			tcp_rcv_space_adjust(struct sock *sk);
 
+extern void			tcp_cleanup_rbuf(struct sock *sk, int copied);
+
 extern int			tcp_twsk_unique(struct sock *sk,
 						struct sock *sktw, void *twp);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 87f68e7..b10f78c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -937,7 +937,7 @@ static int tcp_recv_urg(struct sock *sk,
  * calculation of whether or not we must ACK for the sake of
  * a window update.
  */
-static void cleanup_rbuf(struct sock *sk, int copied)
+void tcp_cleanup_rbuf(struct sock *sk, int copied)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int time_to_ack = 0;
@@ -1086,7 +1086,7 @@ int tcp_read_sock(struct sock *sk, read_
 
 	/* Clean up data we have read: This will do ACK frames. */
 	if (copied)
-		cleanup_rbuf(sk, copied);
+		tcp_cleanup_rbuf(sk, copied);
 	return copied;
 }
 
@@ -1220,7 +1220,7 @@ int tcp_recvmsg(struct kiocb *iocb, stru
 			}
 		}
 
-		cleanup_rbuf(sk, copied);
+		tcp_cleanup_rbuf(sk, copied);
 
 		if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
 			/* Install new reader */
@@ -1391,7 +1391,7 @@ skip_copy:
 	 */
 
 	/* Clean up data we have read: This will do ACK frames. */
-	cleanup_rbuf(sk, copied);
+	tcp_cleanup_rbuf(sk, copied);
 
 	TCP_CHECK_TIMER(sk);
 	release_sock(sk);
@@ -1853,7 +1853,7 @@ static int do_tcp_setsockopt(struct sock
 			    (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
 			    inet_csk_ack_scheduled(sk)) {
 				icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
-				cleanup_rbuf(sk, 1);
+				tcp_cleanup_rbuf(sk, 1);
 				if (!(val & 1))
 					icsk->icsk_ack.pingpong = 1;
 			}

^ permalink raw reply related

* [PATCH 7/9] [I/OAT] make sk_eat_skb I/OAT aware
From: Chris Leech @ 2006-03-29 22:56 UTC (permalink / raw)
  To: linux-kernel, netdev
In-Reply-To: <20060329225505.25585.30392.stgit@gitlost.site>

Add an extra argument to sk_eat_skb, and make it move early copied packets
to the async_wait_queue instead of freeing them.
Signed-off-by: Chris Leech <christopher.leech@intel.com>
---

 include/net/sock.h |   13 ++++++++++++-
 net/dccp/proto.c   |    4 ++--
 net/ipv4/tcp.c     |    8 ++++----
 net/llc/af_llc.c   |    2 +-
 4 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 190809c..e3723b6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1272,11 +1272,22 @@ sock_recv_timestamp(struct msghdr *msg, 
  * This routine must be called with interrupts disabled or with the socket
  * locked so that the sk_buff queue operation is ok.
 */
-static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb)
+#ifdef CONFIG_NET_DMA
+static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb, int copied_early)
+{
+	__skb_unlink(skb, &sk->sk_receive_queue);
+	if (!copied_early)
+		__kfree_skb(skb);
+	else
+		__skb_queue_tail(&sk->sk_async_wait_queue, skb);
+}
+#else
+static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb, int copied_early)
 {
 	__skb_unlink(skb, &sk->sk_receive_queue);
 	__kfree_skb(skb);
 }
+#endif
 
 extern void sock_enable_timestamp(struct sock *sk);
 extern int sock_get_timestamp(struct sock *, struct timeval __user *);
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 1ff7328..35d7dfd 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -719,7 +719,7 @@ int dccp_recvmsg(struct kiocb *iocb, str
 		}
 		dccp_pr_debug("packet_type=%s\n",
 			      dccp_packet_name(dh->dccph_type));
-		sk_eat_skb(sk, skb);
+		sk_eat_skb(sk, skb, 0);
 verify_sock_status:
 		if (sock_flag(sk, SOCK_DONE)) {
 			len = 0;
@@ -773,7 +773,7 @@ verify_sock_status:
 		}
 	found_fin_ok:
 		if (!(flags & MSG_PEEK))
-			sk_eat_skb(sk, skb);
+			sk_eat_skb(sk, skb, 0);
 		break;
 	} while (1);
 out:
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b10f78c..2346539 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1072,11 +1072,11 @@ int tcp_read_sock(struct sock *sk, read_
 				break;
 		}
 		if (skb->h.th->fin) {
-			sk_eat_skb(sk, skb);
+			sk_eat_skb(sk, skb, 0);
 			++seq;
 			break;
 		}
-		sk_eat_skb(sk, skb);
+		sk_eat_skb(sk, skb, 0);
 		if (!desc->count)
 			break;
 	}
@@ -1356,14 +1356,14 @@ skip_copy:
 		if (skb->h.th->fin)
 			goto found_fin_ok;
 		if (!(flags & MSG_PEEK))
-			sk_eat_skb(sk, skb);
+			sk_eat_skb(sk, skb, 0);
 		continue;
 
 	found_fin_ok:
 		/* Process the FIN. */
 		++*seq;
 		if (!(flags & MSG_PEEK))
-			sk_eat_skb(sk, skb);
+			sk_eat_skb(sk, skb, 0);
 		break;
 	} while (len > 0);
 
diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index 5a04db7..7465170 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -789,7 +789,7 @@ static int llc_ui_recvmsg(struct kiocb *
 			continue;
 
 		if (!(flags & MSG_PEEK)) {
-			sk_eat_skb(sk, skb);
+			sk_eat_skb(sk, skb, 0);
 			*seq = 0;
 		}
 	} while (len > 0);

^ permalink raw reply related

* [PATCH 8/9] [I/OAT] Add a sysctl for tuning the I/OAT offloaded I/O threshold
From: Chris Leech @ 2006-03-29 22:56 UTC (permalink / raw)
  To: linux-kernel, netdev
In-Reply-To: <20060329225505.25585.30392.stgit@gitlost.site>

Any socket recv of less than this ammount will not be offloaded

Signed-off-by: Chris Leech <christopher.leech@intel.com>
---

 include/linux/sysctl.h     |    1 +
 include/net/tcp.h          |    1 +
 net/core/user_dma.c        |    4 ++++
 net/ipv4/sysctl_net_ipv4.c |   10 ++++++++++
 4 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 76eaeff..cd9e7c0 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -403,6 +403,7 @@ enum
  	NET_TCP_MTU_PROBING=113,
 	NET_TCP_BASE_MSS=114,
 	NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS=115,
+	NET_TCP_DMA_COPYBREAK=116,
 };
 
 enum {
diff --git a/include/net/tcp.h b/include/net/tcp.h
index ca5bdaf..2e6fdef 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -219,6 +219,7 @@ extern int sysctl_tcp_adv_win_scale;
 extern int sysctl_tcp_tw_reuse;
 extern int sysctl_tcp_frto;
 extern int sysctl_tcp_low_latency;
+extern int sysctl_tcp_dma_copybreak;
 extern int sysctl_tcp_nometrics_save;
 extern int sysctl_tcp_moderate_rcvbuf;
 extern int sysctl_tcp_tso_win_divisor;
diff --git a/net/core/user_dma.c b/net/core/user_dma.c
index ec177ef..642a3f3 100644
--- a/net/core/user_dma.c
+++ b/net/core/user_dma.c
@@ -33,6 +33,10 @@
 
 #ifdef CONFIG_NET_DMA
 
+#define NET_DMA_DEFAULT_COPYBREAK 1024
+
+int sysctl_tcp_dma_copybreak = NET_DMA_DEFAULT_COPYBREAK;
+
 /**
  *	dma_skb_copy_datagram_iovec - Copy a datagram to an iovec.
  *	@skb - buffer to copy
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 6b6c3ad..6a6aa53 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -688,6 +688,16 @@ ctl_table ipv4_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
 	},
+#ifdef CONFIG_NET_DMA
+	{
+		.ctl_name	= NET_TCP_DMA_COPYBREAK,
+		.procname	= "tcp_dma_copybreak",
+		.data		= &sysctl_tcp_dma_copybreak,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
+#endif
 	{ .ctl_name = 0 }
 };
 

^ permalink raw reply related

* [PATCH 9/9] [I/OAT] TCP recv offload to I/OAT
From: Chris Leech @ 2006-03-29 22:56 UTC (permalink / raw)
  To: linux-kernel, netdev
In-Reply-To: <20060329225505.25585.30392.stgit@gitlost.site>

Locks down user pages and sets up for DMA in tcp_recvmsg, then calls
dma_async_try_early_copy in tcp_v4_do_rcv

Signed-off-by: Chris Leech <christopher.leech@intel.com>
---

 net/ipv4/tcp.c       |  101 ++++++++++++++++++++++++++++++++++++++++++++------
 net/ipv4/tcp_input.c |   74 +++++++++++++++++++++++++++++++++----
 net/ipv4/tcp_ipv4.c  |   18 ++++++++-
 net/ipv6/tcp_ipv6.c  |   12 +++++-
 4 files changed, 183 insertions(+), 22 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2346539..8be8d69 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -263,7 +263,7 @@
 #include <net/tcp.h>
 #include <net/xfrm.h>
 #include <net/ip.h>
-
+#include <net/netdma.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -1110,6 +1110,7 @@ int tcp_recvmsg(struct kiocb *iocb, stru
 	int target;		/* Read at least this many bytes */
 	long timeo;
 	struct task_struct *user_recv = NULL;
+	int copied_early = 0;
 
 	lock_sock(sk);
 
@@ -1133,6 +1134,15 @@ int tcp_recvmsg(struct kiocb *iocb, stru
 
 	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
 
+#ifdef CONFIG_NET_DMA
+	tp->ucopy.dma_chan = NULL;
+	preempt_disable();
+	if ((len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
+	    !sysctl_tcp_low_latency && __get_cpu_var(softnet_data.net_dma))
+		tp->ucopy.pinned_list = dma_pin_iovec_pages(msg->msg_iov, len);
+	preempt_enable_no_resched();
+#endif
+
 	do {
 		struct sk_buff *skb;
 		u32 offset;
@@ -1274,6 +1284,10 @@ int tcp_recvmsg(struct kiocb *iocb, stru
 		} else
 			sk_wait_data(sk, &timeo);
 
+#ifdef CONFIG_NET_DMA
+		tp->ucopy.wakeup = 0;
+#endif
+
 		if (user_recv) {
 			int chunk;
 
@@ -1329,13 +1343,39 @@ do_prequeue:
 		}
 
 		if (!(flags & MSG_TRUNC)) {
-			err = skb_copy_datagram_iovec(skb, offset,
-						      msg->msg_iov, used);
-			if (err) {
-				/* Exception. Bailout! */
-				if (!copied)
-					copied = -EFAULT;
-				break;
+#ifdef CONFIG_NET_DMA
+			if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+				tp->ucopy.dma_chan = get_softnet_dma();
+
+			if (tp->ucopy.dma_chan) {
+				tp->ucopy.dma_cookie = dma_skb_copy_datagram_iovec(
+					tp->ucopy.dma_chan, skb, offset,
+					msg->msg_iov, used,
+					tp->ucopy.pinned_list);
+
+				if (tp->ucopy.dma_cookie < 0) {
+
+					printk(KERN_ALERT "dma_cookie < 0\n");
+
+					/* Exception. Bailout! */
+					if (!copied)
+						copied = -EFAULT;
+					break;
+				}
+				if ((offset + used) == skb->len)
+					copied_early = 1;
+
+			} else
+#endif
+			{
+				err = skb_copy_datagram_iovec(skb, offset,
+						msg->msg_iov, used);
+				if (err) {
+					/* Exception. Bailout! */
+					if (!copied)
+						copied = -EFAULT;
+					break;
+				}
 			}
 		}
 
@@ -1355,15 +1395,19 @@ skip_copy:
 
 		if (skb->h.th->fin)
 			goto found_fin_ok;
-		if (!(flags & MSG_PEEK))
-			sk_eat_skb(sk, skb, 0);
+		if (!(flags & MSG_PEEK)) {
+			sk_eat_skb(sk, skb, copied_early);
+			copied_early = 0;
+		}
 		continue;
 
 	found_fin_ok:
 		/* Process the FIN. */
 		++*seq;
-		if (!(flags & MSG_PEEK))
-			sk_eat_skb(sk, skb, 0);
+		if (!(flags & MSG_PEEK)) {
+			sk_eat_skb(sk, skb, copied_early);
+			copied_early = 0;
+		}
 		break;
 	} while (len > 0);
 
@@ -1386,6 +1430,36 @@ skip_copy:
 		tp->ucopy.len = 0;
 	}
 
+#ifdef CONFIG_NET_DMA
+	if (tp->ucopy.dma_chan) {
+		struct sk_buff *skb;
+		dma_cookie_t done, used;
+
+		dma_async_memcpy_issue_pending(tp->ucopy.dma_chan);
+
+		while (dma_async_memcpy_complete(tp->ucopy.dma_chan,
+		                                 tp->ucopy.dma_cookie, &done,
+		                                 &used) == DMA_IN_PROGRESS) {
+			/* do partial cleanup of sk_async_wait_queue */
+			while ((skb = skb_peek(&sk->sk_async_wait_queue)) &&
+			       (dma_async_is_complete(skb->dma_cookie, done,
+			                              used) == DMA_SUCCESS)) {
+				__skb_dequeue(&sk->sk_async_wait_queue);
+				kfree_skb(skb);
+			}
+		}
+
+		/* Safe to free early-copied skbs now */
+		__skb_queue_purge(&sk->sk_async_wait_queue);
+		dma_chan_put(tp->ucopy.dma_chan);
+		tp->ucopy.dma_chan = NULL;
+	}
+	if (tp->ucopy.pinned_list) {
+		dma_unpin_iovec_pages(tp->ucopy.pinned_list);
+		tp->ucopy.pinned_list = NULL;
+	}
+#endif
+
 	/* According to UNIX98, msg_name/msg_namelen are ignored
 	 * on connected socket. I was just happy when found this 8) --ANK
 	 */
@@ -1653,6 +1727,9 @@ int tcp_disconnect(struct sock *sk, int 
 	__skb_queue_purge(&sk->sk_receive_queue);
 	sk_stream_writequeue_purge(sk);
 	__skb_queue_purge(&tp->out_of_order_queue);
+#ifdef CONFIG_NET_DMA
+	__skb_queue_purge(&sk->sk_async_wait_queue);
+#endif
 
 	inet->dport = 0;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 195d835..c87f4eb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -71,6 +71,7 @@
 #include <net/inet_common.h>
 #include <linux/ipsec.h>
 #include <asm/unaligned.h>
+#include <net/netdma.h>
 
 int sysctl_tcp_timestamps = 1;
 int sysctl_tcp_window_scaling = 1;
@@ -3785,6 +3786,50 @@ static inline int tcp_checksum_complete_
 		__tcp_checksum_complete_user(sk, skb);
 }
 
+#ifdef CONFIG_NET_DMA
+static int tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb, int hlen)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int chunk = skb->len - hlen;
+	int dma_cookie;
+	int copied_early = 0;
+
+	if (tp->ucopy.wakeup)
+          	return 0;
+
+	if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+		tp->ucopy.dma_chan = get_softnet_dma();
+
+	if (tp->ucopy.dma_chan && skb->ip_summed == CHECKSUM_UNNECESSARY) {
+
+		dma_cookie = dma_skb_copy_datagram_iovec(tp->ucopy.dma_chan,
+			skb, hlen, tp->ucopy.iov, chunk, tp->ucopy.pinned_list);
+
+		if (dma_cookie < 0)
+			goto out;
+
+		tp->ucopy.dma_cookie = dma_cookie;
+		copied_early = 1;
+
+		tp->ucopy.len -= chunk;
+		tp->copied_seq += chunk;
+		tcp_rcv_space_adjust(sk);
+
+		if ((tp->ucopy.len == 0) ||
+		    (tcp_flag_word(skb->h.th) & TCP_FLAG_PSH) ||
+		    (atomic_read(&sk->sk_rmem_alloc) > (sk->sk_rcvbuf >> 1))) {
+			tp->ucopy.wakeup = 1;
+			sk->sk_data_ready(sk, 0);
+		}
+	} else if (chunk > 0) {
+		tp->ucopy.wakeup = 1;
+		sk->sk_data_ready(sk, 0);
+	}
+out:
+	return copied_early;
+}
+#endif /* CONFIG_NET_DMA */
+
 /*
  *	TCP receive function for the ESTABLISHED state. 
  *
@@ -3901,14 +3946,23 @@ int tcp_rcv_established(struct sock *sk,
 			}
 		} else {
 			int eaten = 0;
+			int copied_early = 0;
 
-			if (tp->ucopy.task == current &&
-			    tp->copied_seq == tp->rcv_nxt &&
-			    len - tcp_header_len <= tp->ucopy.len &&
-			    sock_owned_by_user(sk)) {
-				__set_current_state(TASK_RUNNING);
+			if (tp->copied_seq == tp->rcv_nxt &&
+			    len - tcp_header_len <= tp->ucopy.len) {
+#ifdef CONFIG_NET_DMA
+				if (tcp_dma_try_early_copy(sk, skb, tcp_header_len)) {
+					copied_early = 1;
+					eaten = 1;
+				}
+#endif
+				if (tp->ucopy.task == current && sock_owned_by_user(sk) && !copied_early) {
+					__set_current_state(TASK_RUNNING);
 
-				if (!tcp_copy_to_iovec(sk, skb, tcp_header_len)) {
+					if (!tcp_copy_to_iovec(sk, skb, tcp_header_len))
+						eaten = 1;
+				}
+				if (eaten) {
 					/* Predicted packet is in window by definition.
 					 * seq == rcv_nxt and rcv_wup <= rcv_nxt.
 					 * Hence, check seq<=rcv_wup reduces to:
@@ -3924,8 +3978,9 @@ int tcp_rcv_established(struct sock *sk,
 					__skb_pull(skb, tcp_header_len);
 					tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 					NET_INC_STATS_BH(LINUX_MIB_TCPHPHITSTOUSER);
-					eaten = 1;
 				}
+				if (copied_early)
+					tcp_cleanup_rbuf(sk, skb->len);
 			}
 			if (!eaten) {
 				if (tcp_checksum_complete_user(sk, skb))
@@ -3966,6 +4021,11 @@ int tcp_rcv_established(struct sock *sk,
 
 			__tcp_ack_snd_check(sk, 0);
 no_ack:
+#ifdef CONFIG_NET_DMA
+			if (copied_early)
+				__skb_queue_tail(&sk->sk_async_wait_queue, skb);
+			else
+#endif
 			if (eaten)
 				__kfree_skb(skb);
 			else
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9e85c04..5ed065f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -71,6 +71,7 @@
 #include <net/inet_common.h>
 #include <net/timewait_sock.h>
 #include <net/xfrm.h>
+#include <net/netdma.h>
 
 #include <linux/inet.h>
 #include <linux/ipv6.h>
@@ -1091,8 +1092,18 @@ process:
 	bh_lock_sock(sk);
 	ret = 0;
 	if (!sock_owned_by_user(sk)) {
-		if (!tcp_prequeue(sk, skb))
+#ifdef CONFIG_NET_DMA
+		struct tcp_sock *tp = tcp_sk(sk);
+		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+			tp->ucopy.dma_chan = get_softnet_dma();
+		if (tp->ucopy.dma_chan)
 			ret = tcp_v4_do_rcv(sk, skb);
+		else
+#endif
+		{
+			if (!tcp_prequeue(sk, skb))
+			ret = tcp_v4_do_rcv(sk, skb);
+		}
 	} else
 		sk_add_backlog(sk, skb);
 	bh_unlock_sock(sk);
@@ -1296,6 +1307,11 @@ int tcp_v4_destroy_sock(struct sock *sk)
 	/* Cleans up our, hopefully empty, out_of_order_queue. */
   	__skb_queue_purge(&tp->out_of_order_queue);
 
+#ifdef CONFIG_NET_DMA
+	/* Cleans up our sk_async_wait_queue */
+  	__skb_queue_purge(&sk->sk_async_wait_queue);
+#endif
+
 	/* Clean prequeue, it must be empty really */
 	__skb_queue_purge(&tp->ucopy.prequeue);
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 301eee7..a50eb30 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1218,8 +1218,16 @@ process:
 	bh_lock_sock(sk);
 	ret = 0;
 	if (!sock_owned_by_user(sk)) {
-		if (!tcp_prequeue(sk, skb))
-			ret = tcp_v6_do_rcv(sk, skb);
+#ifdef CONFIG_NET_DMA
+                struct tcp_sock *tp = tcp_sk(sk);
+                if (tp->ucopy.dma_chan)
+                        ret = tcp_v6_do_rcv(sk, skb);
+                else
+#endif
+		{
+			if (!tcp_prequeue(sk, skb))
+				ret = tcp_v6_do_rcv(sk, skb);
+		}
 	} else
 		sk_add_backlog(sk, skb);
 	bh_unlock_sock(sk);

^ permalink raw reply related

* [git patches] net driver updates
From: Jeff Garzik @ 2006-03-29 23:00 UTC (permalink / raw)
  To: netdev, linux-kernel


[just sent upstream]

The 'upstream-linus' branch of
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git

contains the following updates:

 Documentation/networking/bcm43xx.txt            |   36 
 drivers/net/8390.h                              |    2 
 drivers/net/acenic_firmware.h                   |   10 
 drivers/net/b44.c                               |   14 
 drivers/net/bonding/bond_3ad.c                  |   28 
 drivers/net/bonding/bond_3ad.h                  |    1 
 drivers/net/bonding/bond_main.c                 |   97 
 drivers/net/bonding/bonding.h                   |    4 
 drivers/net/ixp2000/ixpdev.c                    |    5 
 drivers/net/natsemi.c                           |   18 
 drivers/net/pcmcia/axnet_cs.c                   |   61 
 drivers/net/pcnet32.c                           |    4 
 drivers/net/spider_net.c                        |    4 
 drivers/net/via-rhine.c                         |   21 
 drivers/net/wireless/Kconfig                    |    7 
 drivers/net/wireless/Makefile                   |    1 
 drivers/net/wireless/bcm43xx/Kconfig            |   62 
 drivers/net/wireless/bcm43xx/Makefile           |   12 
 drivers/net/wireless/bcm43xx/bcm43xx.h          |  926 +++++
 drivers/net/wireless/bcm43xx/bcm43xx_debugfs.c  |  499 +++
 drivers/net/wireless/bcm43xx/bcm43xx_debugfs.h  |  117 
 drivers/net/wireless/bcm43xx/bcm43xx_dma.c      |  968 +++++
 drivers/net/wireless/bcm43xx/bcm43xx_dma.h      |  218 +
 drivers/net/wireless/bcm43xx/bcm43xx_ethtool.c  |   50 
 drivers/net/wireless/bcm43xx/bcm43xx_ethtool.h  |    8 
 drivers/net/wireless/bcm43xx/bcm43xx_ilt.c      |  337 ++
 drivers/net/wireless/bcm43xx/bcm43xx_ilt.h      |   32 
 drivers/net/wireless/bcm43xx/bcm43xx_leds.c     |  293 +
 drivers/net/wireless/bcm43xx/bcm43xx_leds.h     |   56 
 drivers/net/wireless/bcm43xx/bcm43xx_main.c     | 3973 ++++++++++++++++++++++++
 drivers/net/wireless/bcm43xx/bcm43xx_main.h     |  168 +
 drivers/net/wireless/bcm43xx/bcm43xx_phy.c      | 2345 ++++++++++++++
 drivers/net/wireless/bcm43xx/bcm43xx_phy.h      |   74 
 drivers/net/wireless/bcm43xx/bcm43xx_pio.c      |  606 +++
 drivers/net/wireless/bcm43xx/bcm43xx_pio.h      |  138 
 drivers/net/wireless/bcm43xx/bcm43xx_power.c    |  358 ++
 drivers/net/wireless/bcm43xx/bcm43xx_power.h    |   47 
 drivers/net/wireless/bcm43xx/bcm43xx_radio.c    | 2026 ++++++++++++
 drivers/net/wireless/bcm43xx/bcm43xx_radio.h    |   99 
 drivers/net/wireless/bcm43xx/bcm43xx_sysfs.c    |  322 +
 drivers/net/wireless/bcm43xx/bcm43xx_sysfs.h    |   25 
 drivers/net/wireless/bcm43xx/bcm43xx_wx.c       | 1002 ++++++
 drivers/net/wireless/bcm43xx/bcm43xx_wx.h       |   36 
 drivers/net/wireless/bcm43xx/bcm43xx_xmit.c     |  582 +++
 drivers/net/wireless/bcm43xx/bcm43xx_xmit.h     |  156 
 drivers/net/wireless/hostap/hostap_80211.h      |    2 
 drivers/net/wireless/hostap/hostap_80211_tx.c   |    9 
 drivers/usb/net/zd1201.c                        |    2 
 net/ieee80211/ieee80211_wx.c                    |    4 
 net/ieee80211/softmac/ieee80211softmac_module.c |   17 
 net/ieee80211/softmac/ieee80211softmac_priv.h   |    2 
 net/ieee80211/softmac/ieee80211softmac_wx.c     |   12 
 52 files changed, 15820 insertions(+), 76 deletions(-)

Adrian Bunk:
      PCMCIA_SPECTRUM must select FW_LOADER

Arthur Othieno:
      net: remove CONFIG_NET_CBUS conditional for NS8390

Danny van Dyk:
      Sync bcm43xx_phy_initb6() with specs

David Woodhouse:
      softmac: reduce scan dwell time
      softmac: reduce default rate to 11Mbps.

Gary Zambrano:
      b44: fix force mac address before ifconfig up
      b44: ensure valid mac addr

Jay Vosburgh:
      bonding: support carrier state for master

Jean Tourrilhes:
      zd1201 wireless stat update

Jens Osterkamp:
      spidernet : reduce console spam
      spidernet : enable tx checksum offloading by default

John W. Linville:
      wireless: import bcm43xx sources
      bcm43xx: patch Kconfig and wireless/Makefile for import

Jouni Malinen:
      hostap: Make hostap_tx_encrypt() static
      hostap: Fix EAPOL frame encryption

Komuro:
      axnet_cs.c : add hardware multicast support

Larry Finger:
      Minor (janitorial) change to ieee80211

Lennert Buytenhek:
      ixp2000: fix gcc4 breakage

Linas Vepstas:
      Janitor: drivers/net/pcnet32: fix incorrect comments

Mark Brown:
      natsemi: Support oversized EEPROMs

Michael Buesch:
      bcm43xx: sync with svn.berlios.de
      bcm43xx: remove linux version compatibility code.
      bcm43xx: Move README file to Documentation directory.
      bcm43xx: remove redundant COPYING file.
      bcm43xx: add DEBUG Kconfig option. Also fix indention.
      bcm43xx: Fix makefile. Remove all the "out-of-tree" stuff.
      bcm43xx: Add more initvals sanity checks and error out, if one sanity check fails.
      bcm43xx: Remove function bcm43xx_channel_is_allowed()
      bcm43xx: basic ethtool support
      bcm43xx: Wireless Ext update
      bcm43xx: fix txpower reporting in WE.
      bcm43xx: enable SPROM writing.
      bcm43xx: heavily increase mac_suspend timeout.
      bcm43xx: fix compiletime warning (phy_xmitpower)
      bcm43xx: remove WX debugging.
      bcm43xx: Partially fix PIO code. Add Kconfig option for PIO or DMA mode (or both).
      bcm43xx: add a note that not all devices support PIO.
      Apple Airport: Add Kconfig note that the bcm43xx driver has to be used for Airport Extreme cards.
      bcm43xx: update README
      bcm43xx: fix LED code.
      bcm43xx: rewrite and simplify the periodic task handling.
      bcm43xx: Code cleanups. This removes various "inline" statements and reduces codesize.
      bcm43xx: Move sprom lowlevel reading/writing to its own functions.
      bcm43xx: make bcm43xx_sprom_crc() static.
      bcm43xx: split the channel helper functions, so that they can be used without a valid running core.
      bcm43xx: remove old unused struct.
      bcm43xx: Fix Kconfig typo (transfer mode default)
      bcm43xx: Workaround init_board vs IRQ race.
      bcm43xx: move initialized = 1 to the end of init_board.
      bcm43xx: add assert(bcm->initialized) to periodic_tasks_setup().
      bcm43xx: Move TX/RX related functions to its own file. Add basic RTS/CTS code.
      bcm43xx: Add sysfs attributes for device specific tunables.
      bcm43xx: Set both, the DMAmask and the coherent DMAmask.
      bcm43xx: Abstract the locking mechanism.
      bcm43xx: Remove the mmio access printing facility overhead.
      bcm43xx: fix some stuff, add a few missing mmiowb(), remove dead code.
      bcm43xx: receive TX status on MMIO or DMA unconditionally regarding the 80211 core rev.
      bcm43xx: add functions bcm43xx_dma_read/write, bcm43xx_dma_tx_suspend/resume.
      bcm43xx: reduce the size of bcm43xx_private by removing unneeded members.
      bcm43xx: Fix crash on ifdown, by being careful in pio/dma freeing.
      bcm43xx: Remove the workaround in dummy_transmission,
      bcm43xx: Do boardflags workarounds for specific boards.
      bcm43xx: properly mask txctl1 before writing it to hardware.
      bcm43xx: remove check for mmio length, as it differs among platforms. (especially embedded)
      bcm43xx: fix some gpio register trashing (hopefully :D)
      bcm43xx: merge all iwmode code into the set_iwmode function.
      bcm43xx: some IRQ handler cleanups.
      bcm43xx: set default attenuation values.
      bcm43xx: sync interference mitigation code to the specs.
      bcm43xx: fix nrssi_threshold calculation.
      bcm43xx: add useless and broken statistics stuff. People seem to want it. well...
      bcm43xx: get rid of "/* vim: ..." lines at the end of several files.
      bcm43xx: fix "include" issues on some platforms.
      bcm43xx: remove some compilerwarnings.
      bcm43xx: fix the remaining sparse warnings.
      bcm43xx: sync GPHY init with the specs.
      bcm43xx: don't set the channel on a device, which is down.

Pete Zaitcev:
      bcm43xx: fix DMA TX skb freeing in case of fragmented packets.

Randy Dunlap:
      acenic: fix section mismatches

Roger Luethi:
      via-rhine: link state fix

^ permalink raw reply

* Re: [PATCH 1/8] [I/OAT] DMA memcpy subsystem
From: Andrew Grover @ 2006-03-29 23:05 UTC (permalink / raw)
  To: Kumar Gala; +Cc: Chris Leech, linux kernel mailing list, netdev
In-Reply-To: <3B202D51-1683-465D-AE3D-DE301017BD69@kernel.crashing.org>

On 3/28/06, Kumar Gala <galak@kernel.crashing.org> wrote:
> Do you only get callback when a channel is available?

Yes

> How do you
> decide to do to provide PIO to the client?

The client is responsible for using any channels it gets, or falling
back to memcpy() if it doesn't get any. (I don't understand how PIO
comes into the picture..?)

> A client should only request multiple channel to handle multiple
> concurrent operations.

Correct, if there aren't any CPU concurrency issues then 1 channel
will use the device's full bandwidth (unless some other client has
acquired the other channels and is using them, of course.)

> > This gets around the problem of DMA clients registering (and therefore
> > not getting) channels simply because they init before the DMA device
> > is discovered.
>
> What do you expect to happen in a system in which the channels are
> over subscribed?
>
> Do you expect the DMA device driver to handle scheduling of channels
> between multiple clients?

It does the simplest thing that could possibly work right now:
channels are allocated first come first serve. When there is a need,
it should be straightforward to allow multiple clients to share DMA
channels.

Regards -- Andy

^ permalink raw reply

* [Patch 4/8] generic netlink utility functions
From: Shailabh Nagar @ 2006-03-30  0:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, Thomas Graf, Jamal
In-Reply-To: <442B271D.10208@watson.ibm.com>

genetlink-utils.patch

Two utilities for simplifying usage of NETLINK_GENERIC
interface.

Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>

 include/net/genetlink.h |   20 ++++++++++++++++++++
 1 files changed, 20 insertions(+)

Index: linux-2.6.16/include/net/genetlink.h
===================================================================
--- linux-2.6.16.orig/include/net/genetlink.h	2006-03-29 18:12:54.000000000 -0500
+++ linux-2.6.16/include/net/genetlink.h	2006-03-29 18:13:17.000000000 -0500
@@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
 	return nlmsg_unicast(genl_sock, skb, pid);
 }

+/**
+ * gennlmsg_data - head of message payload
+ * @gnlh: genetlink messsage header
+ */
+static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
+{
+       return ((unsigned char *) gnlh + GENL_HDRLEN);
+}
+
+/**
+ * genlmsg_len - length of message payload
+ * @gnlh: genetlink message header
+ */
+static inline int genlmsg_len(const struct genlmsghdr *gnlh)
+{
+       struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
+                                                   NLMSG_HDRLEN);
+       return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
+}
+
 #endif	/* __NET_GENERIC_NETLINK_H */

^ permalink raw reply

* [Patch 5/8] generic netlink interface for delay accounting
From: Shailabh Nagar @ 2006-03-30  0:52 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, Thomas Graf, Jamal
In-Reply-To: <442B271D.10208@watson.ibm.com>

delayacct-genetlink.patch

Create a generic netlink interface (NETLINK_GENERIC family),
called "taskstats", for getting delay and cpu statistics of
tasks and thread groups during their lifetime and when they exit.


Signed-off-by: Shailabh Nagar <nagar@us.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>

 include/linux/delayacct.h |   11 +
 include/linux/taskstats.h |  113 +++++++++++++++++
 init/Kconfig              |   16 ++
 kernel/Makefile           |    1
 kernel/delayacct.c        |   44 ++++++
 kernel/taskstats.c        |  292 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 474 insertions(+), 3 deletions(-)

Index: linux-2.6.16/include/linux/delayacct.h
===================================================================
--- linux-2.6.16.orig/include/linux/delayacct.h	2006-03-29 18:13:13.000000000 -0500
+++ linux-2.6.16/include/linux/delayacct.h	2006-03-29 18:13:18.000000000 -0500
@@ -18,6 +18,7 @@
 #define _LINUX_TASKDELAYS_H

 #include <linux/sched.h>
+#include <linux/taskstats.h>

 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;	/* Delay accounting turned on/off */
@@ -27,6 +28,7 @@ extern void __delayacct_tsk_init(struct
 extern void __delayacct_tsk_exit(struct task_struct *);
 extern void __delayacct_blkio_start(void);
 extern void __delayacct_blkio_end(void);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);

 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {
@@ -64,4 +66,13 @@ static inline void delayacct_blkio_start
 static inline void delayacct_blkio_end(void)
 {}
 #endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+				    struct task_struct *tsk)
+{
+	if (!tsk->delays)
+		return -EINVAL;
+	return __delayacct_add_tsk(d, tsk);
+}
+#endif
 #endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.16/include/linux/taskstats.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16/include/linux/taskstats.h	2006-03-29 18:13:18.000000000 -0500
@@ -0,0 +1,113 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ *	- a task exits
+ *	- listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION	1
+#define TASKSTATS_NOPID	-1
+
+struct taskstats {
+	/* Maintain 64-bit alignment while extending */
+	/* Version 1 */
+
+	/* XXX_count is number of delay values recorded.
+	 * XXX_total is corresponding cumulative delay in nanoseconds
+	 */
+
+#define TASKSTATS_NOCPUSTATS	1
+	__u64	cpu_count;
+	__u64	cpu_delay_total;	/* wait, while runnable, for cpu */
+	__u64	blkio_count;
+	__u64	blkio_delay_total;	/* sync,block io completion wait*/
+	__u64	swapin_count;
+	__u64	swapin_delay_total;	/* swapin page fault wait*/
+
+	__u64	cpu_run_total;		/* cpu running time
+					 * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP	0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+	TASKSTATS_CMD_UNSPEC = 0,	/* Reserved */
+	TASKSTATS_CMD_GET,		/* user->kernel request/get-response */
+	TASKSTATS_CMD_NEW,		/* kernel->user event */
+	__TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+	TASKSTATS_TYPE_UNSPEC = 0,	/* Reserved */
+	TASKSTATS_TYPE_PID,		/* Process id */
+	TASKSTATS_TYPE_TGID,		/* Thread group id */
+	TASKSTATS_TYPE_STATS,		/* taskstats structure */
+	TASKSTATS_TYPE_AGGR_PID,	/* contains pid + stats */
+	TASKSTATS_TYPE_AGGR_TGID,	/* contains tgid + stats */
+	__TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+	TASKSTATS_CMD_ATTR_UNSPEC = 0,
+	TASKSTATS_CMD_ATTR_PID,
+	TASKSTATS_CMD_ATTR_TGID,
+	__TASKSTATS_CMD_ATTR_MAX,
+};
+
+#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME	"TASKSTATS"
+#define TASKSTATS_GENL_VERSION	0x1
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+enum {
+	TASKSTATS_MSG_UNICAST,		/* send data only to requester */
+	TASKSTATS_MSG_MULTICAST,	/* send data to a group */
+};
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
Index: linux-2.6.16/init/Kconfig
===================================================================
--- linux-2.6.16.orig/init/Kconfig	2006-03-29 18:12:57.000000000 -0500
+++ linux-2.6.16/init/Kconfig	2006-03-29 18:13:18.000000000 -0500
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
 	  in pages. Such statistics can help in setting a task's priorities
 	  relative to other tasks for cpu, io, rss limits etc.

-	  Unlike BSD process accounting, this information is available
-	  continuously during the lifetime of a task.
-
 	  Say N if unsure.

+config TASKSTATS
+	bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+	depends on TASK_DELAY_ACCT
+	default y
+	help
+	  Export selected statistics for tasks/processes through the
+	  generic netlink interface. Unlike BSD process accounting, the
+	  statistics are available during the lifetime of tasks/processes as
+	  responses to commands. Like BSD accounting, they are sent to user
+	  space on task exit.
+
+	  Say Y if unsure.
+
 config SYSCTL
 	bool "Sysctl support"
 	---help---
Index: linux-2.6.16/kernel/delayacct.c
===================================================================
--- linux-2.6.16.orig/kernel/delayacct.c	2006-03-29 18:13:13.000000000 -0500
+++ linux-2.6.16/kernel/delayacct.c	2006-03-29 18:13:18.000000000 -0500
@@ -1,6 +1,7 @@
 /* delayacct.c - per-task delay accounting
  *
  * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * Copyright (C) Balbir Singh, IBM Corp. 2006
  *
  * This program is free software;  you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -18,9 +19,12 @@
 #include <linux/time.h>
 #include <linux/sysctl.h>
 #include <linux/delayacct.h>
+#include <linux/taskstats.h>
+#include <linux/mutex.h>

 int delayacct_on __read_mostly = 0;	/* Delay accounting turned on/off */
 kmem_cache_t *delayacct_cache;
+static DEFINE_MUTEX(delayacct_exit_mutex);

 static int __init delayacct_setup_enable(char *str)
 {
@@ -50,10 +54,16 @@ void __delayacct_tsk_init(struct task_st

 void __delayacct_tsk_exit(struct task_struct *tsk)
 {
+	/*
+	 * Protect against racing thread group exits
+	 */
+	mutex_lock(&delayacct_exit_mutex);
+	taskstats_exit_pid(tsk);
 	if (tsk->delays) {
 		kmem_cache_free(delayacct_cache, tsk->delays);
 		tsk->delays = NULL;
 	}
+	mutex_unlock(&delayacct_exit_mutex);
 }

 /*
@@ -108,3 +118,37 @@ void __delayacct_blkio_end(void)
 			      &current->delays->blkio_delay,
 			      &current->delays->blkio_count);
 }
+#ifdef CONFIG_TASKSTATS
+int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{
+	nsec_t tmp;
+	struct timespec ts;
+	unsigned long t1,t2;
+
+	/* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+
+	tmp = (nsec_t)d->cpu_run_total ;
+	tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
+	d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
+
+	/* No locking available for sched_info. Take snapshot first. */
+	t1 = tsk->sched_info.pcnt;
+	t2 = tsk->sched_info.run_delay;
+
+	d->cpu_count += t1;
+
+	jiffies_to_timespec(t2, &ts);
+	tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
+	d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp;
+
+	spin_lock(&tsk->delays->lock);
+	tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+	d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
+	tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+	d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;
+	d->blkio_count += tsk->delays->blkio_count;
+	d->swapin_count += tsk->delays->swapin_count;
+	spin_unlock(&tsk->delays->lock);
+	return 0;
+}
+#endif /* CONFIG_TASKSTATS */
Index: linux-2.6.16/kernel/Makefile
===================================================================
--- linux-2.6.16.orig/kernel/Makefile	2006-03-29 18:12:57.000000000 -0500
+++ linux-2.6.16/kernel/Makefile	2006-03-29 18:13:18.000000000 -0500
@@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o

 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
Index: linux-2.6.16/kernel/taskstats.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16/kernel/taskstats.c	2006-03-29 18:13:18.000000000 -0500
@@ -0,0 +1,292 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+static struct genl_family family = {
+	.id             = GENL_ID_GENERATE,
+	.name           = TASKSTATS_GENL_NAME,
+	.version        = TASKSTATS_GENL_VERSION,
+	.maxattr        = TASKSTATS_CMD_ATTR_MAX,
+};
+
+static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = {
+	[TASKSTATS_CMD_ATTR_PID]  = { .type = NLA_U32 },
+	[TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
+};
+
+
+static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
+			 void **replyp, size_t size)
+{
+	struct sk_buff *skb;
+	void *reply;
+
+	/*
+	 * If new attributes are added, please revisit this allocation
+	 */
+	skb = nlmsg_new(size);
+	if (!skb)
+		return -ENOMEM;
+
+	if (!info) {
+		int seq = get_cpu_var(taskstats_seqnum)++;
+		put_cpu_var(taskstats_seqnum);
+
+		reply = genlmsg_put(skb, 0, seq,
+				    family.id, 0, 0,
+				    cmd, family.version);
+	} else
+		reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+				    family.id, 0, 0,
+				    cmd, family.version);
+	if (reply == NULL) {
+		nlmsg_free(skb);
+		return -EINVAL;
+	}
+
+	*skbp = skb;
+	*replyp = reply;
+	return 0;
+}
+
+static int send_reply(struct sk_buff *skb, pid_t pid, int event)
+{
+	struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+	void *reply;
+	int rc;
+
+	reply = genlmsg_data(genlhdr);
+
+	rc = genlmsg_end(skb, reply);
+	if (rc < 0) {
+		nlmsg_free(skb);
+		return rc;
+	}
+
+	if (event == TASKSTATS_MSG_MULTICAST)
+		return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+	return genlmsg_unicast(skb, pid);
+}
+
+static inline int fill_pid(pid_t pid, struct task_struct *pidtsk,
+			   struct taskstats *stats)
+{
+	int rc;
+	struct task_struct *tsk = pidtsk;
+
+	if (!pidtsk) {
+		read_lock(&tasklist_lock);
+		tsk = find_task_by_pid(pid);
+		if (!tsk) {
+			read_unlock(&tasklist_lock);
+			return -ESRCH;
+		}
+		get_task_struct(tsk);
+		read_unlock(&tasklist_lock);
+	} else
+		get_task_struct(tsk);
+
+	rc = delayacct_add_tsk(stats, tsk);
+	put_task_struct(tsk);
+
+	return rc;
+
+}
+
+static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
+			    struct taskstats *stats)
+{
+	int rc;
+	struct task_struct *tsk, *first;
+
+	first = tgidtsk;
+	read_lock(&tasklist_lock);
+	if (!first) {
+		first = find_task_by_pid(tgid);
+		if (!first) {
+			read_unlock(&tasklist_lock);
+			return -ESRCH;
+		}
+	}
+	tsk = first;
+	do {
+		rc = delayacct_add_tsk(stats, tsk);
+		if (rc)
+			break;
+	} while_each_thread(first, tsk);
+	read_unlock(&tasklist_lock);
+
+	return rc;
+}
+
+static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+{
+	int rc = 0;
+	struct sk_buff *rep_skb;
+	struct taskstats stats;
+	void *reply;
+	size_t size;
+	struct nlattr *na;
+
+	/*
+	 * Size includes space for nested attribute as well
+	 * The returned data is of the format
+	 * TASKSTATS_TYPE_AGGR_PID/TGID
+	 * --> TASKSTATS_TYPE_PID/TGID
+	 * --> TASKSTATS_TYPE_STATS
+	 */
+	size = nla_total_size(sizeof(u32)) +
+		nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+	memset(&stats, 0, sizeof(stats));
+	rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
+	if (rc < 0)
+		return rc;
+
+	if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
+		u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+		rc = fill_pid((pid_t)pid, NULL, &stats);
+		if (rc < 0)
+			goto err;
+
+		na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
+		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
+	} else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
+		u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+		rc = fill_tgid((pid_t)tgid, NULL, &stats);
+		if (rc < 0)
+			goto err;
+
+		na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
+		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+	} else {
+		rc = -EINVAL;
+		goto err;
+	}
+
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	nla_nest_end(rep_skb, na);
+
+	return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
+
+nla_put_failure:
+	return  genlmsg_cancel(rep_skb, reply);
+err:
+	nlmsg_free(rep_skb);
+	return rc;
+}
+
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+	int rc = 0;
+	struct sk_buff *rep_skb;
+	void *reply;
+	struct taskstats stats;
+	size_t size;
+	int is_thread_group = !thread_group_empty(tsk);
+	struct nlattr *na;
+
+	/*
+	 * tasks can start to exit very early. Ensure that the family
+	 * is registered before notifications are sent out
+	 */
+	if (!family_registered)
+		return;
+
+	/*
+	 * Size includes space for nested attributes
+	 */
+	size = nla_total_size(sizeof(u32)) +
+		nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+	if (is_thread_group)
+		size = 2 * size;	// PID + STATS + TGID + STATS
+
+	memset(&stats, 0, sizeof(stats));
+	rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
+	if (rc < 0)
+		return;
+
+	rc = fill_pid(tsk->pid, tsk, &stats);
+	if (rc < 0)
+		goto err;
+
+	na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
+	NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	nla_nest_end(rep_skb, na);
+
+	if (!is_thread_group) {
+		send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+		return;
+	}
+
+	memset(&stats, 0, sizeof(stats));
+	rc = fill_tgid(tsk->tgid, tsk, &stats);
+	if (rc < 0)
+		goto err;
+
+	na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
+	NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	nla_nest_end(rep_skb, na);
+
+	send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+	return;
+
+nla_put_failure:
+	genlmsg_cancel(rep_skb, reply);
+	return;
+err:
+	nlmsg_free(rep_skb);
+}
+
+static struct genl_ops taskstats_ops = {
+	.cmd            = TASKSTATS_CMD_GET,
+	.doit           = taskstats_send_stats,
+	.policy		= taskstats_cmd_get_policy,
+};
+
+static int __init taskstats_init(void)
+{
+	if (genl_register_family(&family))
+		return -EFAULT;
+        family_registered = 1;
+
+	if (genl_register_ops(&family, &taskstats_ops))
+		goto err;
+
+	return 0;
+err:
+	genl_unregister_family(&family);
+	family_registered = 0;
+	return -EFAULT;
+}
+
+late_initcall(taskstats_init);

^ permalink raw reply

* [e1000 debug] KERNEL: assertion (!sk_forward_alloc) failed...
From: Brandeburg, Jesse @ 2006-03-30  2:53 UTC (permalink / raw)
  To: nipsy, jrlundgren, cat, djani22, yoseph.basri, bb, mykleb, olel,
	michal, chris
  Cc: netdev, Jesse Brandeburg, davem, E1000-devel, Brandeburg, Jesse

Hi all, I've identified you as people who have at some point in the past
emailed one of the Linux lists with problems with e1000 and
sk_forward_alloc.  It seems to be fairly widespread, but only seems to
have appeared with recent kernel changes (after 2.6.12...)

What I need from you is a reproducible test, and some information.  I
have never been able to reproduce this, and I'm trying to isolate the
problem a bit.  What motherboards are you using?  What seems to cause
this problem?  Are you all using iptables?  Are you all routing? From
the reports I assume none of you are using an 82571/2/3 (pci express)

As far as I know e1000 has the same requirement as tg3 and some others
where we have to modify the header of the skb in the case of transmits
using TSO.  I don't see anywhere else that the driver modifies the skb.
Tomorrow I'll generate a patch to try a more paranoid copying of the
skb, I hope some of you can test.

To do this we have code like so in e1000_tso:
2529         if (skb_shinfo(skb)->tso_size) {
2530                 if (skb_header_cloned(skb)) {
2531                         err = pskb_expand_head(skb, 0, 0,
GFP_ATOMIC);
2532                         if (err)
2533                                 return err;
2534                 }
2535 
2536                 hdr_len = ((skb->h.raw - skb->data) +
(skb->h.th->doff << 2));
2537                 mss = skb_shinfo(skb)->tso_size;
2538                 if (skb->protocol == ntohs(ETH_P_IP)) {
2539                         skb->nh.iph->tot_len = 0;
2540                         skb->nh.iph->check = 0;

Thanks for your assistance

Jesse


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid\x110944&bid$1720&dat\x121642

^ permalink raw reply

* Re: [e1000 debug] KERNEL: assertion (!sk_forward_alloc) failed...
From: Yoseph Basri @ 2006-03-30  4:02 UTC (permalink / raw)
  To: Brandeburg, Jesse
  Cc: nipsy, jrlundgren, cat, djani22, bb, mykleb, olel, michal, chris,
	netdev, Jesse Brandeburg, davem, E1000-devel
In-Reply-To: <C925F8B43D79CC49ACD0601FB68FF50C07461F59@orsmsx408>

Hi Jesse,

Thanks for your concern,

My server still send warning message regarding this KERNEL: assertion
(!sk_forward_alloc)  after upgrade kernel 2.6.12 or 2.6.15.

This is from dmesg server:

Linux version 2.6.15.4 (root@xxxxx) (gcc version 3.3.4 (Debian
1:3.3.4-13)) #1 SMP Tue Feb 21 17:12:27 SGT 2006
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009d400 (usable)
 BIOS-e820: 000000000009d400 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000f7ffb300 (usable)
 BIOS-e820: 00000000f7ffb300 - 00000000f8000000 (ACPI data)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000108000000 (usable)
Warning only 4GB will be used.
Use a PAE enabled kernel.
3200MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 0009d540
On node 0 totalpages: 1048576
  DMA zone: 4096 pages, LIFO batch:0
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 225280 pages, LIFO batch:31
  HighMem zone: 819200 pages, LIFO batch:31
DMI 2.3 present.
ACPI: RSDP (v000 IBM                                   ) @ 0x000fdfc0
ACPI: RSDT (v001 IBM    SERONYXP 0x00001000 IBM  0x45444f43) @ 0xf7ffff80
ACPI: FADT (v001 IBM    SERONYXP 0x00001000 IBM  0x45444f43) @ 0xf7ffff00
ACPI: MADT (v001 IBM    SERONYXP 0x00001000 IBM  0x45444f43) @ 0xf7fffe80
ACPI: ASF! (v016 IBM    SERONYXP 0x00000001 IBM  0x45444f43) @ 0xf7fffdc0
ACPI: DSDT (v001 IBM    SERGEODE 0x00001000 MSFT 0x0100000b) @ 0x00000000
ACPI: PM-Timer IO Port: 0x488
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 15:2 APIC version 20
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 15:2 APIC version 20
ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
ACPI: IOAPIC (id[0x0e] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 14, version 17, address 0xfec00000, GSI 0-15
ACPI: IOAPIC (id[0x0d] address[0xfec01000] gsi_base[16])
IOAPIC[1]: apic_id 13, version 17, address 0xfec01000, GSI 16-31
ACPI: IOAPIC (id[0x0c] address[0xfec02000] gsi_base[32])
IOAPIC[2]: apic_id 12, version 17, address 0xfec02000, GSI 32-47
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ7 used by override.
Enabling APIC mode:  Flat.  Using 3 I/O APICs
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at f8800000 (gap: f8000000:06c00000)
Built 1 zonelists
Kernel command line: BOOT_IMAGE=linux-2.6.15.4 ro root=801
mapped APIC to ffffd000 (fee00000)
mapped IOAPIC to ffffc000 (fec00000)
mapped IOAPIC to ffffb000 (fec01000)
mapped IOAPIC to ffffa000 (fec02000)
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 65536 bytes)
Detected 2794.685 MHz processor.
Using pmtmr for high-res timesource
Console: colour VGA+ 80x25
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 4025292k/4194304k available (1906k kernel code, 36796k
reserved, 650k data, 240k init, 3145708k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 5593.74 BogoMIPS (lpj=11187486)
Mount-cache hash table entries: 512
CPU: After generic identify, caps: bfebfbff 00000000 00000000 00000000
00004400 00000000 00000000
CPU: After vendor identify, caps: bfebfbff 00000000 00000000 00000000
00004400 00000000 00000000
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: Physical Processor ID: 0
CPU: After all inits, caps: bfebfbff 00000000 00000000 00000080
00004400 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU0: Intel P4/Xeon Extended MCE MSRs (12) available
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
CPU0: Intel(R) Xeon(TM) CPU 2.80GHz stepping 07
Booting processor 1/1 eip 2000
Initializing CPU#1
Calibrating delay using timer specific routine.. 5588.12 BogoMIPS (lpj=11176253)
CPU: After generic identify, caps: bfebfbff 00000000 00000000 00000000
00004400 00000000 00000000
CPU: After vendor identify, caps: bfebfbff 00000000 00000000 00000000
00004400 00000000 00000000
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: Physical Processor ID: 0
CPU: After all inits, caps: bfebfbff 00000000 00000000 00000080
00004400 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: Intel P4/Xeon Extended MCE MSRs (12) available
CPU1: Intel(R) Xeon(TM) CPU 2.80GHz stepping 07
Total of 2 processors activated (11181.86 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
checking TSC synchronization across 2 CPUs: passed.
Brought up 2 CPUs
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: PCI BIOS revision 2.10 entry at 0xfd7dc, last bus=8
PCI: Using configuration type 1
ACPI: Subsystem revision 20050902
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Probing PCI hardware (bus 00)
Boot video device is 0000:00:06.0
PCI: Ignoring BAR0-3 of IDE controller 0000:00:0f.1
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Root Bridge [PCI1] (0000:02)
PCI: Probing PCI hardware (bus 02)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI1._PRT]
ACPI: PCI Root Bridge [PCI2] (0000:04)
PCI: Probing PCI hardware (bus 04)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI2._PRT]
ACPI: PCI Root Bridge [PCI3] (0000:06)
PCI: Probing PCI hardware (bus 06)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI3._PRT]
ACPI: PCI Root Bridge [PCI4] (0000:08)
PCI: Probing PCI hardware (bus 08)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI4._PRT]
ACPI: PCI Interrupt Link [LP00] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP01] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP02] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP03] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP04] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP05] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP06] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP07] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP08] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP09] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP0A] (IRQs *10)
ACPI: PCI Interrupt Link [LP0B] (IRQs *9)
ACPI: PCI Interrupt Link [LP0C] (IRQs *9)
ACPI: PCI Interrupt Link [LP0D] (IRQs *3)
ACPI: PCI Interrupt Link [LP0E] (IRQs *5)
ACPI: PCI Interrupt Link [LP0F] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP10] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP11] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP12] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP13] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP14] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP15] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP16] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP17] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP18] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP19] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP1A] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP1B] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP1C] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP1D] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP1E] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LP1F] (IRQs) *0, disabled.
ACPI: PCI Interrupt Link [LPUS] (IRQs *11)
SCSI subsystem initialized
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
IA-32 Microcode Update Driver: v1.14 <tigran@veritas.com>
highmem bounce pool size: 64 pages
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered
Real Time Clock Driver v1.12
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
Floppy drive(s): fd0 is 1.44M
FDC 0 is a National Semiconductor PC87306
Intel(R) PRO/1000 Network Driver - version 6.1.16-k2
Copyright (c) 1999-2005 Intel Corporation.
ACPI: PCI Interrupt 0000:06:08.0[A] -> GSI 29 (level, low) -> IRQ 16
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
ACPI: PCI Interrupt 0000:06:08.1[B] -> GSI 30 (level, low) -> IRQ 17
e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
e100: Intel(R) PRO/100 Network Driver, 3.4.14-k4-NAPI
e100: Copyright(c) 1999-2005 Intel Corporation
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
Probing IDE interface ide0...
hda: LG CD-ROM CRN-8245B, ATAPI CD/DVD-ROM drive
Probing IDE interface ide1...
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Fusion MPT base driver 3.03.04
Copyright (c) 1999-2005 LSI Logic Corporation
Fusion MPT SPI Host driver 3.03.04
Fusion MPT base driver 3.03.04
Copyright (c) 1999-2005 LSI Logic Corporation
Fusion MPT SPI Host driver 3.03.04
ACPI: PCI Interrupt 0000:08:07.0[A] -> GSI 27 (level, low) -> IRQ 18
mptbase: Initiating ioc0 bringup
ioc0: 53C1030: Capabilities={Initiator}
scsi0 : ioc0: LSI53C1030, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=18
  Vendor: IBM-ESXS  Model: MAP3735NC     FN  Rev: B109
  Type:   Direct-Access                      ANSI SCSI revision: 03
SCSI device sda: 143374000 512-byte hdwr sectors (73407 MB)
SCSI device sda: drive cache: write through
SCSI device sda: 143374000 512-byte hdwr sectors (73407 MB)
SCSI device sda: drive cache: write through
 sda: sda1 sda2 sda3 < sda5 sda6 sda7 sda8 sda9 sda10 >
sd 0:0:0:0: Attached scsi disk sda
  Vendor: IBM-ESXS  Model: MAP3735NC     FN  Rev: B109
  Type:   Direct-Access                      ANSI SCSI revision: 03
SCSI device sdb: 143374000 512-byte hdwr sectors (73407 MB)
SCSI device sdb: drive cache: write through
SCSI device sdb: 143374000 512-byte hdwr sectors (73407 MB)
SCSI device sdb: drive cache: write through
 sdb: sdb1 sdb2 sdb3 sdb4
sd 0:0:1:0: Attached scsi disk sdb
  Vendor: IBM-ESXS  Model: MAP3735NC     FN  Rev: B109
  Type:   Direct-Access                      ANSI SCSI revision: 03
SCSI device sdc: 143374000 512-byte hdwr sectors (73407 MB)
SCSI device sdc: drive cache: write through
SCSI device sdc: 143374000 512-byte hdwr sectors (73407 MB)
SCSI device sdc: drive cache: write through
 sdc: sdc1 sdc2 sdc3 sdc4
sd 0:0:2:0: Attached scsi disk sdc
  Vendor: IBM-ESXS  Model: MAP3735NC     FN  Rev: B109
  Type:   Direct-Access                      ANSI SCSI revision: 03
SCSI device sdd: 143374000 512-byte hdwr sectors (73407 MB)
SCSI device sdd: drive cache: write through
SCSI device sdd: 143374000 512-byte hdwr sectors (73407 MB)
SCSI device sdd: drive cache: write through
 sdd: sdd1 sdd2 sdd3 sdd4
sd 0:0:3:0: Attached scsi disk sdd
  Vendor: IBM-ESXS  Model: MAP3735NC     FN  Rev: B109
  Type:   Direct-Access                      ANSI SCSI revision: 03
SCSI device sde: 143374000 512-byte hdwr sectors (73407 MB)
SCSI device sde: drive cache: write through
SCSI device sde: 143374000 512-byte hdwr sectors (73407 MB)
SCSI device sde: drive cache: write through
 sde: sde1 sde2 sde3 sde4
sd 0:0:4:0: Attached scsi disk sde
  Vendor: IBM-ESXS  Model: MAP3735NC     FN  Rev: B109
  Type:   Direct-Access                      ANSI SCSI revision: 03
SCSI device sdf: 143374000 512-byte hdwr sectors (73407 MB)
SCSI device sdf: drive cache: write through
SCSI device sdf: 143374000 512-byte hdwr sectors (73407 MB)
SCSI device sdf: drive cache: write through
 sdf: sdf1 sdf2 sdf3 sdf4
sd 0:0:5:0: Attached scsi disk sdf
  Vendor: IBM       Model: 32P0032a S320  1  Rev: 1
  Type:   Processor                          ANSI SCSI revision: 02
ACPI: PCI Interrupt 0000:08:07.1[B] -> GSI 28 (level, low) -> IRQ 19
mptbase: Initiating ioc1 bringup
ioc1: 53C1030: Capabilities={Initiator}
scsi1 : ioc1: LSI53C1030, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=19
mice: PS/2 mouse device common for all mice
NET: Registered protocol family 2
IP route cache hash table entries: 262144 (order: 8, 1048576 bytes)
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 7, 524288 bytes)
input: AT Translated Set 2 keyboard as /class/input/input0
TCP: Hash tables configured (established 524288 bind 65536)
TCP reno registered
ip_conntrack version 2.4 (8192 buckets, 65536 max) - 176 bytes per conntrack
ip_tables: (C) 2000-2002 Netfilter core team
TCP bic registered
NET: Registered protocol family 1
NET: Registered protocol family 17
Starting balanced_irq
Using IPI Shortcut mode
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 240k freed
Adding 2104504k swap on /dev/sda2.  Priority:-1 extents:1 across:2104504k
EXT3 FS on sda1, internal journal
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda6, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda5, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda7, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda10, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda9, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda8, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdb1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdb2, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdb3, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdb4, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdc1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdc2, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdc3, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdc4, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdd1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdd2, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdd3, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdd4, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sde1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sde2, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sde3, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sde4, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdf1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdf2, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdf3, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdf4, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
e1000: eth0: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex

Are you all using iptables?

YES

Are you all routing?

NO

>From the reports I assume none of you are using an 82571/2/3 (pci express)?

think so.

Thnaks again.

YB



On 3/30/06, Brandeburg, Jesse <jesse.brandeburg@intel.com> wrote:
> Hi all, I've identified you as people who have at some point in the past
> emailed one of the Linux lists with problems with e1000 and
> sk_forward_alloc.  It seems to be fairly widespread, but only seems to
> have appeared with recent kernel changes (after 2.6.12...)
>
> What I need from you is a reproducible test, and some information.  I
> have never been able to reproduce this, and I'm trying to isolate the
> problem a bit.  What motherboards are you using?  What seems to cause
> this problem?  Are you all using iptables?  Are you all routing? From
> the reports I assume none of you are using an 82571/2/3 (pci express)
>
> As far as I know e1000 has the same requirement as tg3 and some others
> where we have to modify the header of the skb in the case of transmits
> using TSO.  I don't see anywhere else that the driver modifies the skb.
> Tomorrow I'll generate a patch to try a more paranoid copying of the
> skb, I hope some of you can test.
>
> To do this we have code like so in e1000_tso:
> 2529         if (skb_shinfo(skb)->tso_size) {
> 2530                 if (skb_header_cloned(skb)) {
> 2531                         err = pskb_expand_head(skb, 0, 0,
> GFP_ATOMIC);
> 2532                         if (err)
> 2533                                 return err;
> 2534                 }
> 2535
> 2536                 hdr_len = ((skb->h.raw - skb->data) +
> (skb->h.th->doff << 2));
> 2537                 mss = skb_shinfo(skb)->tso_size;
> 2538                 if (skb->protocol == ntohs(ETH_P_IP)) {
> 2539                         skb->nh.iph->tot_len = 0;
> 2540                         skb->nh.iph->check = 0;
>
> Thanks for your assistance
>
> Jesse
>


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid\x110944&bid$1720&dat\x121642

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox