Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC] [PATCH 0/3] ioat: DMA engine support
From: Alan Cox @ 2005-11-23 23:02 UTC (permalink / raw)
  To: Andrew Grover; +Cc: netdev, linux-kernel, john.ronciak, christopher.leech
In-Reply-To: <Pine.LNX.4.44.0511231143380.32487-100000@isotope.jf.intel.com>

On Mer, 2005-11-23 at 12:26 -0800, Andrew Grover wrote:
> early next year. Until then, the code doesn't really *do* anything, but we
> wanted to release what we could right away, and start getting some 
> feedback.

First comment partly based on Jeff Garziks comments - if you added an
"operation" to the base functions and an operation mask to the DMA
engines it becomes possible to support engines that can do other ops (eg
abusing an NCR53c8xx for both copy and clear).

Second one - you obviously tested this somehow, was that all done by
simulation or do you have a "CPU" memcpy test engine for use before the
hardware pops up ?

^ permalink raw reply

* Re: [RFC: 2.6 patch] remove drivers/net/eepro100.c
From: David S. Miller @ 2005-11-23 23:01 UTC (permalink / raw)
  To: rmk+lkml; +Cc: jgarzik, bunk, saw, linux-kernel, netdev
In-Reply-To: <20051123225319.GP15449@flint.arm.linux.org.uk>

From: Russell King <rmk+lkml@arm.linux.org.uk>
Date: Wed, 23 Nov 2005 22:53:19 +0000

> That means there's about 15 minutes left before I go to sleep before
> having to be up early tomorrow to go on a 2 hour journey to attend a
> meeting.  What do you want me to do with those 15 minutes?  Perform a
> miracle maybe?

It's perfectly fine that you are not able to test the fix.  But being
so visibly angry about it, that's the part I don't get.

How about a "I'm not able to test this due to lack of access to the
necessary hardware, but I did give it a try although unsuccessful.
Perhaps someone else can lend a hand so we can resolve this for good?"

That's the "calm" response.

^ permalink raw reply

* Re: [RFC] [PATCH 0/3] ioat: DMA engine support
From: Jeff Garzik @ 2005-11-23 22:56 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrew Grover, netdev, linux-kernel, john.ronciak,
	christopher.leech
In-Reply-To: <1132786445.13095.32.camel@localhost.localdomain>

Alan Cox wrote:
>>Additionally, current IOAT is memory->memory.  I would love to be able 
>>to convince Intel to add transforms and checksums, 
> 
> 
> Not just transforms but also masks and maybe even merges and textures
> would be rather handy 8)


Ah yes:  I totally forgot to mention XOR.

Software RAID would love that.

	Jeff

^ permalink raw reply

* Re: [RFC] [PATCH 0/3] ioat: DMA engine support
From: Alan Cox @ 2005-11-23 22:54 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Grover, netdev, linux-kernel, john.ronciak,
	christopher.leech
In-Reply-To: <4384E7F2.2030508@pobox.com>

On Mer, 2005-11-23 at 17:06 -0500, Jeff Garzik wrote:
> Sample ideas:  VM page pre-zeroing.  ATA PIO data xfers (async copy to 
> static buffer, to dramatically shorten length of kmap+irqsave time). 
> Extremely large memcpy() calls.

ATA PIO copies are 512 bytes of memory per sector and that is usually
already in cache and on cache line boundaries. You won't even be able to
measure it done by the CPU. I can't see the I/O engine sync cost being
worth it.

Might just about help large transfers I guess but you don't do
multisector which is the only case you'd get perhaps 8K an I/O.

> Additionally, current IOAT is memory->memory.  I would love to be able 
> to convince Intel to add transforms and checksums, 

Not just transforms but also masks and maybe even merges and textures
would be rather handy 8)

^ permalink raw reply

* Re: [RFC] [PATCH 0/3] ioat: DMA engine support
From: Jeff Garzik @ 2005-11-23 22:53 UTC (permalink / raw)
  To: Andrew Grover; +Cc: netdev, linux-kernel, john.ronciak, christopher.leech
In-Reply-To: <Pine.LNX.4.44.0511231143380.32487-100000@isotope.jf.intel.com>

Andrew Grover wrote:
> overall diffstat information:
>  drivers/Kconfig           |    2 
>  drivers/Makefile          |    1 
>  drivers/dma/Kconfig       |   40 ++
>  drivers/dma/Makefile      |    5 
>  drivers/dma/cb_list.h     |   12 
>  drivers/dma/dmaengine.c   |  394 ++++++++++++++++++++++++
>  drivers/dma/testclient.c  |  132 ++++++++
>  include/linux/dmaengine.h |  268 ++++++++++++++++
>  net/core/Makefile         |    3 
>  net/core/dev.c            |   78 ++++
>  net/core/user_dma.c       |  422 ++++++++++++++++++++++++++
>  11 files changed, 1356 insertions(+), 1 deletion(-)


overall, there was a distinction lack of any useful 
description/documentation, over and above the code itself.

	Jeff

^ permalink raw reply

* Re: [RFC: 2.6 patch] remove drivers/net/eepro100.c
From: Russell King @ 2005-11-23 22:53 UTC (permalink / raw)
  To: David S. Miller; +Cc: jgarzik, bunk, saw, linux-kernel, netdev
In-Reply-To: <20051123.143946.41188551.davem@davemloft.net>

On Wed, Nov 23, 2005 at 02:39:46PM -0800, David S. Miller wrote:
> From: Russell King <rmk+lkml@arm.linux.org.uk>
> Date: Wed, 23 Nov 2005 22:15:48 +0000
> 
> > I leave it up to you how to proceed.  Effectively I'm now completely
> > out of the loop on this with no hardware to worry about.  Sorry.
> > 
> > Finally, please don't assign any blame for this in my direction; I
> > reported it and I kept bugging people about it, and in spite of my
> > best efforts there was very little which was forthcoming.  Obviously
> > that wasn't enough.
> 
> I think you're being unreasonable.

I think you're being unreasonable telling me that I'm being unreasonable.

> They've worked on a fix for the problem, and now you're unable to test
> the fix, and you're angry at them because they took so long to code up
> the fix.
> 
> If you're overextended and have too much work to do and that's
> stressing you out, that doesn't give you permission to take it
> out on other people.

No.  It's quite simple.

I've worked on trying to replicate the problem today.  Tomorrow I'm
out at a meeting and since I'm no longer working on the problematical
hardware, it is being returned.

That means there's about 15 minutes left before I go to sleep before
having to be up early tomorrow to go on a 2 hour journey to attend a
meeting.  What do you want me to do with those 15 minutes?  Perform a
miracle maybe?

David, I ask you to retract your unreasonable mail.  I'm being quite
calm here.  I'm just pointing out the facts that as of *now* I'm no
longer in a position to test.

I was rather hoping that being crystal clear about the reasons about
_why_ I'm no longer able to continue participating in his problem
that I would be seen not to be unreasonable.

I guess I'm just cursed.

Sorry.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core

^ permalink raw reply

* Re: [RFC] [PATCH 3/3] ioat: testclient
From: Jeff Garzik @ 2005-11-23 22:53 UTC (permalink / raw)
  To: Andrew Grover; +Cc: netdev, linux-kernel, john.ronciak, christopher.leech
In-Reply-To: <Pine.LNX.4.44.0511231217340.32487-100000@isotope.jf.intel.com>

Andrew Grover wrote:
> diff --git a/drivers/dma/testclient.c b/drivers/dma/testclient.c
> new file mode 100644
> index 0000000..9bfb979
> --- /dev/null
> +++ b/drivers/dma/testclient.c
> @@ -0,0 +1,132 @@
> +/*******************************************************************************
> +
> +  
> +  Copyright(c) 2004 - 2005 Intel Corporation. All rights reserved.
> +  
> +  This program is free software; you can redistribute it and/or modify it 
> +  under the terms of the GNU General Public License as published by the Free 
> +  Software Foundation; either version 2 of the License, or (at your option) 
> +  any later version.
> +  
> +  This program is distributed in the hope that it will be useful, but WITHOUT 
> +  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 
> +  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for 
> +  more details.
> +  
> +  You should have received a copy of the GNU General Public License along with
> +  this program; if not, write to the Free Software Foundation, Inc., 59 
> +  Temple Place - Suite 330, Boston, MA  02111-1307, USA.
> +  
> +  The full GNU General Public License is included in this distribution in the
> +  file called LICENSE.
> +  
> +*******************************************************************************/
> +
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/device.h>
> +#include <linux/dmaengine.h>
> +#include <linux/delay.h>
> +#include <asm/io.h>
> +
> +/* MODULE API */
> +
> +static volatile u8 *buffer1;
> +static volatile u8 *buffer2;

why do you think volatile is needed?


> +struct dma_client *test_dma_client;
> +struct dma_chan *test_dma_chan;
> +static dma_cookie_t cookie;
> +
> +void
> +test_added_chan(void)
> +{
> +	int i;
> +
> +	printk("buffer1 = %p\n", buffer1);
> +	printk("buffer2 = %p\n", buffer2);
> +	for (i = 0; i < 20; i+=4)
> +		printk("%u %u %u %u\n", buffer2[i], buffer2[i+1], buffer2[i+2], buffer2[i+3]);
> +
> +//	for (i = 0; i < 10; i++) {
> +	cookie = dma_async_memcpy_buf_to_buf(test_dma_chan, 
> +		(void *)buffer2,
> +		(void *)buffer1,
> +		2000);
> +	dma_async_memcpy_issue_pending(test_dma_chan);
> +//	}
> +//	printk("dma cookie = %i\n", cookie);
> +	if (dma_async_memcpy_complete(test_dma_chan, cookie, NULL, NULL) == DMA_IN_PROGRESS)
> +		printk("DMA cookie == IN PROGRESS\n");
> +	else
> +		printk("DMA cookie == SUCCESS\n");
> +#if 0
> +	for (i = 0; i < 1000; i++) {
> +		if (buffer2[1] != 0)
> +			break;
> +		mdelay(1);

ummm....


> +	printk("i = %d\n", i);
> +	for (i = 0; i < 20; i+=4)
> +		printk("%u %u %u %u\n", buffer2[i], buffer2[i+1], buffer2[i+2], buffer2[i+3]);
> +	for (i = 0; i < 20; i+=4)
> +		printk("%u %u %u %u\n", buffer1[i], buffer1[i+1], buffer1[i+2], buffer1[i+3]);
> +#endif
> +}
> +
> +void test_dma_event(struct dma_client *client, struct dma_chan *chan, enum dma_event_t event)
> +{
> +	switch (event) {
> +	case DMA_RESOURCE_ADDED:
> +		test_dma_chan = chan;
> +		test_added_chan();
> +		break;
> +	case DMA_RESOURCE_REMOVED:
> +		test_dma_chan = NULL;
> +		break;
> +	default:
> +		break;
> +	}
> +}

what keeps DMA_RESOURCE_ADDED from being called multiple times?
What happens when there is more than one resource?
dma_async_client_chan_request(...,1) prevents this, perhaps?


> +static int __init
> +testclient_init_module(void)
> +{
> +	int i;
> +
> +	buffer1 = kmalloc(sizeof(u8) * 2000, SLAB_KERNEL);
> +	buffer2 = kmalloc(sizeof(u8) * 2000, SLAB_KERNEL);
> +	memset((void *)buffer2, 0, 2000);

1) GFP_KERNEL not SLAB_KERNEL

2) kzalloc()

3) be consistent:  either use "2000" or "sizeof * 2000"


> +	for (i = 0; i < 2000; i++)
> +		buffer1[i] = i;
> +	test_dma_client = dma_async_client_register(test_dma_event);
> +	if (!test_dma_client) {
> +		printk(KERN_ERR "Could not register dma client!\n");
> +		return 0;
> +	}
> +
> +	dma_async_client_chan_request(test_dma_client, 1);
> +
> +	return 0;
> +}
> +
> +module_init(testclient_init_module);
> +
> +static void __exit
> +testclient_exit_module(void)
> +{
> +	int i;
> +	for (i = 0; i < 20; i+=4)
> +		printk("%u %u %u %u\n", buffer2[i], buffer2[i+1], buffer2[i+2], buffer2[i+3]);
> +	if (dma_async_memcpy_complete(test_dma_chan, cookie, NULL, NULL) == DMA_SUCCESS)
> +		printk("DMA cookie == SUCCESS\n");
> +	else
> +		printk("DMA cookie == IN PROGRESS\n");
> +
> +	dma_async_client_unregister(test_dma_client);
> +}
> +
> +module_exit(testclient_exit_module);
> +
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [RFC] [PATCH 2/3] ioat: user buffer pin; net DMA client register
From: Jeff Garzik @ 2005-11-23 22:46 UTC (permalink / raw)
  To: Andrew Grover; +Cc: netdev, linux-kernel, john.ronciak, christopher.leech
In-Reply-To: <Pine.LNX.4.44.0511231215020.32487-100000@isotope.jf.intel.com>

A per-patch description would be nice, as DaveM mentioned... and also 
please put a diffstat in each email.

	Jeff

^ permalink raw reply

* Re: [RFC] [PATCH 0/3] ioat: DMA engine support
From: Jeff Garzik @ 2005-11-23 22:45 UTC (permalink / raw)
  To: Andrew Grover; +Cc: netdev, linux-kernel, john.ronciak, christopher.leech
In-Reply-To: <Pine.LNX.4.44.0511231143380.32487-100000@isotope.jf.intel.com>

Andrew Grover wrote:
> As presented in our talk at this year's OLS, the Bensley platform, which 
> will be out in early 2006, will have an asyncronous DMA engine. It can be 
> used to offload copies from the CPU, such as the kernel copies of received 
> packets into the user buffer.

More than a one-paragraph description would be nice...  URLs to OLS and 
IDF presentations, other info?

	Jeff

^ permalink raw reply

* Re: [RFC] [PATCH 1/3] ioat: DMA subsystem
From: Jeff Garzik @ 2005-11-23 22:44 UTC (permalink / raw)
  To: Andrew Grover; +Cc: netdev, linux-kernel, john.ronciak, christopher.leech
In-Reply-To: <Pine.LNX.4.44.0511231207410.32487-100000@isotope.jf.intel.com>



Mostly ok, but some minor nits.



Andrew Grover wrote:
> index 0000000..f2cc2d7
> --- /dev/null
> +++ b/drivers/dma/cb_list.h
> @@ -0,0 +1,12 @@
> +/* Extra macros that build on <linux/list.h> */
> +#ifndef CB_LIST_H
> +#define CB_LIST_H
> +
> +#include <linux/list.h>
> +
> +/* Provide some safty to list_add, which I find easy to swap the arguments to */
> +
> +#define list_add_entry(pos, head, member)      list_add(&pos->member, head)
> +#define list_add_entry_tail(pos, head, member) list_add_tail(&pos->member, head)
> +
> +#endif /* CB_LIST_H */

Maybe this just adds fuel to the fire, given your code comment, but I 
tend to think most people are used to

	object_foo(object, other args...)

where the object in question is "the list".  That would imply a

	list_foo(head, others args...)

pattern.

As a side note, I have the same problem as you, WRT swapping the 
list_add arguments.  I've always thought that was the one big drawback 
to Linus's otherwise elegant list implementation.

general nits:

1) docbook-able function headers, with useful documentation, would be 
nice.  Using libata as an example, even if I don't provide any useful 
function description, I at least document the locking details/context 
for each function.

2) more inline code commenting would be nice.



> +			if (chan->device->device_alloc_chan_resources(chan) >= 0) {
> +				chan->client = client;
> +				list_add_entry_tail(chan, &client->channels, client_node);
> +				return chan;
> +			}


device_alloc_chan_resources is a very long name.  :)


> +static void
> +dma_client_chan_free(struct dma_chan *chan)
> +{
> +	BUG_ON(!chan);
> +
> +	chan->device->device_free_chan_resources(chan);
> +	chan->client = NULL;
> +}

ditto


> +static void
> +dma_chans_rebalance(void)

explanation of this function would be nice.  remember to answer "how?" 
and "why?", not "what?".


> +{
> +	struct dma_client *client;
> +	struct dma_chan *chan;
> +
> +	list_for_each_entry(client, &dma_client_list, global_node) {

locking of dma_client_list?

> +		while (client->chans_desired > client->chan_count) {
> +			chan = dma_client_chan_alloc(client);
> +			if (!chan)
> +				break;
> +
> +			client->chan_count++;
> +			client->event_callback(client, chan, DMA_RESOURCE_ADDED);
> +		}
> +
> +		while (client->chans_desired < client->chan_count) {
> +			chan = list_entry(client->channels.next, struct dma_chan, client_node);
> +			list_del(&chan->client_node);
> +			client->chan_count--;
> +			client->event_callback(client, chan, DMA_RESOURCE_REMOVED);
> +			dma_client_chan_free(chan);

In general, this DMA_RESOURCE_REMOVED operation feels like a "yanking 
the carpet out from under my feet" operation, something we should avoid 
for object-lifetime reasons.

However in this case, AFAICS dmaengine.c completely controls object 
lifetime, so I do not see a real problem.  I'm just nervous.  :)


> +		}
> +	}
> +}
> +
> +struct dma_client *
> +dma_async_client_register(dma_event_callback event_callback)
> +{
> +	struct dma_client *client;
> +
> +	BUG_ON(!event_callback);
> +
> +	client = kmalloc(sizeof(*client), GFP_KERNEL);
> +	if (!client)
> +		return NULL;
> +
> +	INIT_LIST_HEAD(&client->channels);
> +
> +	client->chans_desired = 0;
> +	client->chan_count = 0;
> +	client->event_callback = event_callback;
> +
> +	list_add_entry_tail(client, &dma_client_list, global_node);
> +
> +	return client;

Possible SMP bug here?

So far, in my code read, I was presuming that the caller was doing some 
sort of locking on dma_client_list and dma_device_list.  (Hint: need 
locking docs for each function)

But if you are using GFP_KERNEL, it certainly appears that two callers 
could race with each other when touching dma_client_list.



> +dma_cookie_t
> +dma_async_memcpy_buf_to_buf(
> +	struct dma_chan *chan,
> +	void *dest,
> +	void *src,
> +	size_t len)
> +{
> +	chan->bytes_transferred += len;
> +	chan->memcpy_count++;
> +
> +	return chan->device->device_memcpy_buf_to_buf(chan, dest, src, len);
> +}
> +
> +dma_cookie_t
> +dma_async_memcpy_buf_to_pg(
> +	struct dma_chan *chan,
> +	struct page *page,
> +	unsigned int offset,
> +	void *kdata,
> +	size_t len)
> +{
> +	chan->bytes_transferred += len;
> +	chan->memcpy_count++;
> +
> +	return chan->device->device_memcpy_buf_to_pg(chan, page, offset, kdata, len);
> +}
> +
> +dma_cookie_t
> +dma_async_memcpy_pg_to_pg(
> +	struct dma_chan *chan,
> +	struct page *dest_pg,
> +	unsigned int dest_off,
> +	struct page *src_pg,
> +	unsigned int src_off,
> +	size_t len)
> +{
> +	chan->bytes_transferred += len;
> +	chan->memcpy_count++;
> +
> +	return chan->device->device_memcpy_pg_to_pg(chan, dest_pg, dest_off,
> +		src_pg, src_off, len);
> +}
> +
> +void
> +dma_async_memcpy_issue_pending(struct dma_chan *chan)
> +{
> +	return chan->device->device_memcpy_issue_pending(chan);
> +}
> +
> +enum dma_status_t
> +dma_async_memcpy_complete(struct dma_chan *chan, dma_cookie_t cookie, dma_cookie_t *last, dma_cookie_t *used)
> +{
> +	return chan->device->device_memcpy_complete(chan, cookie, last, used);
> +}

Making these 'static inline' might be a good idea?


> +int
> +dma_async_device_register(struct dma_device *device)
> +{
> +	static int id;
> +	int chancnt = 0;
> +	struct dma_chan* chan;
> +
> +	if (!device)
> +		return -ENODEV;
> +
> +	list_add_entry_tail(device, &dma_device_list, global_node);
> +
> +	dma_chans_rebalance();
> +
> +	device->dev_id = id++;
> +
> +	/* represent channels in sysfs. Probably want devs too */
> +	list_for_each_entry(chan, &device->channels, device_node) {
> +		chan->chan_id = chancnt++;
> +		chan->class_dev.class = &dma_devclass;
> +		chan->class_dev.dev = NULL;
> +		snprintf(chan->class_dev.class_id, BUS_ID_SIZE, "dma%dchan%d",
> +			device->dev_id, chan->chan_id);
> +
> +		chan->min_copy_size = DMA_DEFAULT_MIN_COPY_SIZE;
> +		class_device_register(&chan->class_dev);
> +	}
> +
> +	return 0;
> +}
> +
> +void
> +dma_async_device_unregister(struct dma_device* device)
> +{
> +	struct dma_chan *chan;
> +
> +	BUG_ON(!device);
> +
> +	list_for_each_entry(chan, &device->channels, device_node) {
> +		if (chan->client) {
> +			list_del(&chan->client_node);
> +			chan->client->chan_count--;
> +			chan->client->event_callback(chan->client, chan, DMA_RESOURCE_REMOVED);
> +			dma_client_chan_free(chan);
> +		}
> +		class_device_unregister(&chan->class_dev);
> +	}
> +
> +	list_del(&device->global_node);
> +
> +	dma_chans_rebalance();
> +}
> +
> +static struct workqueue_struct *dma_wait_wq;
> +static LIST_HEAD(dma_poll_list);
> +
> +enum dma_status_t
> +dma_async_wait_for_completion(struct dma_chan *chan, dma_cookie_t cookie)
> +{
> +	while (dma_async_memcpy_complete(chan, cookie, NULL, NULL) == DMA_IN_PROGRESS)
> +		schedule();

1) Is it worth adding a loop above the 'while', which does

	retries = 5
	while (operation == in progress &&
	       retries-- > 0)
		udelay(1)

2) at that point, perhaps replace schedule() with schedule_timeout(1). 
WARNING:  this might introduce too much latency, and be a bad idea.


> +	return DMA_SUCCESS;
> +}
> +
> +#if 0
> +static void
> +dma_poll(void *data)
> +{
> +	struct dma_completion *comp = data;
> +
> +	comp->status = dma_memcpy_complete(comp->chan, comp->cookie);
> +	while (comp->status == DMA_IN_PROGRESS) {
> +		comp->chan->device->device_arm_interrupt(comp->chan);
> +		wait_for_completion(&__get_cpu_var(kick_dma_poll));
> +		comp->status = dma_memcpy_complete(comp->chan, comp->cookie);
> +	}
> +	complete(&comp->comp);
> +}
> +
> +enum dma_status_t
> +dma_wait_for_completion(struct dma_chan *chan, dma_cookie_t cookie)
> +{
> +	enum dma_status_t status;
> +	DECLARE_DMA_COMPLETION(comp, chan, cookie);
> +	DECLARE_WORK(dma_wait_work, dma_poll, &comp);
> +
> +	BUG_ON(in_interrupt());
> +
> +	status = dma_memcpy_complete(chan, cookie);
> +	if (status != DMA_IN_PROGRESS)
> +		return status;
> +
> +	queue_work(dma_wait_wq, &dma_wait_work);
> +	wait_for_completion(&comp.comp);
> +	return comp.status;
> +}
> +#endif

is this for future use?  never to be used?


> +static int __init dma_bus_init(void)
> +{
> +	int cpu;
> +
> +	dma_wait_wq = create_workqueue("dmapoll");

dma_wait_wq is never used, due to #if 0


> +	for_each_online_cpu(cpu) {
> +		init_completion(&per_cpu(kick_dma_poll, cpu));
> +	}
> +	return class_register(&dma_devclass);
> +}
> +
> +subsys_initcall(dma_bus_init);
> +
> +EXPORT_SYMBOL(dma_async_client_register);
> +EXPORT_SYMBOL(dma_async_client_unregister);
> +EXPORT_SYMBOL(dma_async_client_chan_request);
> +EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
> +EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
> +EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
> +EXPORT_SYMBOL(dma_async_memcpy_complete);
> +EXPORT_SYMBOL(dma_async_memcpy_issue_pending);
> +EXPORT_SYMBOL(dma_async_device_register);
> +EXPORT_SYMBOL(dma_async_device_unregister);
> +EXPORT_SYMBOL(dma_async_wait_for_completion);
> +EXPORT_PER_CPU_SYMBOL(kick_dma_poll);
> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
> new file mode 100644
> index 0000000..7b4f58b
> --- /dev/null
> +++ b/include/linux/dmaengine.h
> @@ -0,0 +1,268 @@
> +/*******************************************************************************
> +
> +  
> +  Copyright(c) 2004 - 2005 Intel Corporation. All rights reserved.
> +  
> +  This program is free software; you can redistribute it and/or modify it 
> +  under the terms of the GNU General Public License as published by the Free 
> +  Software Foundation; either version 2 of the License, or (at your option) 
> +  any later version.
> +  
> +  This program is distributed in the hope that it will be useful, but WITHOUT 
> +  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 
> +  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for 
> +  more details.
> +  
> +  You should have received a copy of the GNU General Public License along with
> +  this program; if not, write to the Free Software Foundation, Inc., 59 
> +  Temple Place - Suite 330, Boston, MA  02111-1307, USA.
> +  
> +  The full GNU General Public License is included in this distribution in the
> +  file called LICENSE.
> +  
> +*******************************************************************************/
> +
> +
> +#ifndef DMAENGINE_H
> +#define DMAENGINE_H
> +
> +#include <linux/device.h>
> +#include <linux/uio.h>
> +#include <linux/skbuff.h>
> +
> +DECLARE_PER_CPU(struct completion, kick_dma_poll);
> +
> +#define DMA_DEFAULT_MIN_COPY_SIZE 16
> +
> +/**
> + * enum dma_event_t - resource PNP/power managment events
> + * @DMA_RESOURCE_SUSPEND: DMA device going into low power state
> + * @DMA_RESOURCE_RESUME: DMA device returning to full power
> + * @DMA_RESOURCE_ADDED: DMA device added to the system
> + * @DMA_RESOURCE_REMOVED: DMA device removed from the system
> + */
> +enum dma_event_t {
> +	DMA_RESOURCE_SUSPEND,
> +	DMA_RESOURCE_RESUME,
> +	DMA_RESOURCE_ADDED,
> +	DMA_RESOURCE_REMOVED,
> +};
> +
> +/**
> + * typedef dma_cookie_t
> + *
> + * if dma_cookie_t is >0 it's a DMA request cookie, <0 it's an error code
> + */
> +typedef s32 dma_cookie_t;

More natural to use [signed] long?  i.e. a machine int.  Or _must_ this 
match hardware somewhere?


> +/*#define dma_submit_error(cookie) ((cookie) < 0 ? 1 : 0)*/
> +
> +/**
> + * enum dma_status_t - DMA transaction status
> + * @DMA_SUCCESS: transaction completed successfully
> + * @DMA_IN_PROGRESS: transaction not yet processed
> + * @DMA_ERROR: transaction failed
> + */
> +enum dma_status_t {
> +	DMA_SUCCESS,
> +	DMA_IN_PROGRESS,
> +	DMA_ERROR,
> +};
> +
> +/**
> + * struct dma_chan - devices supply DMA channels, clients use them
> + * @client: ptr to the client user of this chan, will be NULL when unused
> + * @device: ptr to the dma device who supplies this channel, always !NULL
> + * @client_node: used to add this to the client chan list
> + * @device_node: used to add this to the device chan list
> + */
> +struct dma_chan
> +{
> +	struct dma_client *client;
> +	struct dma_device *device;
> +	dma_cookie_t cookie;
> +
> +	/* sysfs */
> +	int chan_id;
> +	struct class_device class_dev;
> +
> +	/* stats */
> +	unsigned long memcpy_count;
> +	unsigned long bytes_transferred;
> +	unsigned int min_copy_size;

very very minor nit, but it bugs me at least:  the stats variables 
strike me as overly long and verbose.


> +	struct list_head client_node;
> +	struct list_head device_node;
> +
> +	cpumask_t cpumask;
> +};
> +
> +/*
> + * typedef dma_event_callback - function pointer to a DMA event callback
> + */
> +typedef void (*dma_event_callback) (struct dma_client *client, struct dma_chan *chan, enum dma_event_t event);
> +
> +/**
> + * struct dma_client - info on the entity making use of DMA services
> + * @event_callback: func ptr to call when something happens
> + * @chan_count: number of chans allocated
> + * @chans_desired: number of chans requested. Can be +- chan_count
> + * @port: upstream DMA port from the client's PCI device
> + * @channels: the list of DMA channels allocated
> + * @global_node: list_head for global dma_client_list
> + */
> +struct dma_client {
> +	dma_event_callback	event_callback;
> +	unsigned int		chan_count;
> +	unsigned int		chans_desired;
> +
> +	/* TODO keep some stats */
> +	struct list_head	channels;
> +	struct list_head	global_node;
> +};
> +
> +/**
> + * struct dma_device - info on the entity supplying DMA services
> + * @chancnt: how many DMA channels are supported
> + * @channels: the list of struct dma_chan
> + * @global_node: list_head for global dma_device_list
> + * Other func ptrs: used to make use of this device's capabilities
> + */
> +struct dma_device {
> +
> +	unsigned int chancnt;
> +	struct list_head channels;
> +	struct list_head global_node;
> +
> +	int dev_id;
> +	/*struct class_device class_dev;*/
> +
> +	int (*device_alloc_chan_resources)(struct dma_chan *chan);
> +	void (*device_free_chan_resources)(struct dma_chan *chan);
> +	dma_cookie_t (*device_memcpy_buf_to_buf)(struct dma_chan *chan, void *dest,
> +		void *src, size_t len);
> +	dma_cookie_t (*device_memcpy_buf_to_pg)(struct dma_chan *chan, struct page *page,
> +		unsigned int offset, void *kdata, size_t len);
> +	dma_cookie_t (*device_memcpy_pg_to_pg)(struct dma_chan *chan, struct page *dest_pg,
> +		unsigned int dest_off, struct page *src_pg, unsigned int src_off,
> +		size_t len);
> +	void (*device_arm_interrupt)(struct dma_chan *chan);
> +	enum dma_status_t (*device_memcpy_complete)(struct dma_chan *chan, dma_cookie_t cookie, dma_cookie_t *last, dma_cookie_t *used);
> +	void (*device_memcpy_issue_pending)(struct dma_chan *chan);

names feel a bit long

^ permalink raw reply

* Re: [RFC: 2.6 patch] remove drivers/net/eepro100.c
From: David S. Miller @ 2005-11-23 22:39 UTC (permalink / raw)
  To: rmk+lkml; +Cc: jgarzik, bunk, saw, linux-kernel, netdev
In-Reply-To: <20051123221547.GM15449@flint.arm.linux.org.uk>

From: Russell King <rmk+lkml@arm.linux.org.uk>
Date: Wed, 23 Nov 2005 22:15:48 +0000

> I leave it up to you how to proceed.  Effectively I'm now completely
> out of the loop on this with no hardware to worry about.  Sorry.
> 
> Finally, please don't assign any blame for this in my direction; I
> reported it and I kept bugging people about it, and in spite of my
> best efforts there was very little which was forthcoming.  Obviously
> that wasn't enough.

I think you're being unreasonable.

They've worked on a fix for the problem, and now you're unable to test
the fix, and you're angry at them because they took so long to code up
the fix.

If you're overextended and have too much work to do and that's
stressing you out, that doesn't give you permission to take it
out on other people.

^ permalink raw reply

* Re: [RFC] [PATCH 0/3] ioat: DMA engine support
From: Andi Kleen @ 2005-11-23 22:30 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Grover, netdev, linux-kernel, john.ronciak,
	christopher.leech
In-Reply-To: <4384E7F2.2030508@pobox.com>

On Wed, Nov 23, 2005 at 05:06:42PM -0500, Jeff Garzik wrote:
> IOAT is super-neat stuff.

The main problem I see is that it'll likely only pay off when you can keep 
the queue of copies long (to amortize the cost of 
talking to an external chip). At least for the standard recvmsg 
skb->user space, user space-> skb cases these queues are 
likely short in most cases. That's because most applications
do relatively small recvmsg or sendmsgs. 

It definitely will need a threshold under which it is disabled.
With bad luck the threshold will be high enough that it doesn't
help very often :/

Longer term the right way to handle this would be likely to use
POSIX AIO on sockets. With that interface it would be easier
to keep long queues of data in flight, which would be best for
the DMA engine.

> In addition to helping speed up network RX, I would like to see how 
> possible it is to experiment with IOAT uses outside of networking. 
> Sample ideas:  VM page pre-zeroing.  ATA PIO data xfers (async copy to 
> static buffer, to dramatically shorten length of kmap+irqsave time). 
> Extremely large memcpy() calls.

Another proposal was swiotlb.

But it's not clear it's a good idea: a lot of these applications prefer to 
have the target in cache. And IOAT will force it out of cache.

> Additionally, current IOAT is memory->memory.  I would love to be able 
> to convince Intel to add transforms and checksums, to enable offload of 
> memory->transform->memory and memory->checksum->result operations like 
> sha-{1,256} hashing[1], crc32*, aes crypto, and other highly common 
> operations.  All of that could be made async.

I remember the registers in the Amiga Blitter for this and I'm
still scared... Maybe it's better to keep it simple.

-Andi

^ permalink raw reply

* Re: [RFC: 2.6 patch] remove drivers/net/eepro100.c
From: Russell King @ 2005-11-23 22:24 UTC (permalink / raw)
  To: Jeff Garzik, Adrian Bunk, saw, linux-kernel, netdev,
	David S. Miller
In-Reply-To: <20051123221547.GM15449@flint.arm.linux.org.uk>

On Wed, Nov 23, 2005 at 10:15:48PM +0000, Russell King wrote:
> On Fri, Nov 18, 2005 at 11:12:28AM -0500, Jeff Garzik wrote:
> > Russell King wrote:
> > >On Fri, Nov 18, 2005 at 04:33:02AM +0100, Adrian Bunk wrote:
> > >
> > >>This patch removes the obsolete drivers/net/eepro100.c driver.
> > >>
> > >>Is there any reason why it should be kept?
> > >
> > >
> > >Tt's the only driver which works correctly on ARM CPUs.  e100 is
> > >basically buggy.  This has been discussed here on lkml and more
> > >recently on linux-netdev.  If anyone has any further questions
> > >please read the archives of those two lists.
> > 
> > After reading the archives, one discovers the current status is:
> > 
> > 	waiting on ARM folks to test e100
> > 
> > Latest reference is public message-id <4371A373.6000308@pobox.com>, 
> > which was CC'd to you.
> > 
> > There is a patch in netdev-2.6.git#e100-sbit and in Andrew's -mm tree 
> > that should solve the ARM problems, and finally allow us to kill 
> > eepro100.  But it's waiting for feedback...
> 
> Well, I've run 2.6.15-rc2 on what I think was the ARM platform which
> exhibited the problem, but it doesn't show up.  However, that's
> meaningless as it has been literally _years_ (4 or more) since the
> problem was reported.  It's rather unsurprising that I can't reproduce
> it - I don't even know if I'm using the right processor module!

Additionally, looking back at my 30th June 2004 message, I don't
think I've even managed sufficient testing to make any claim of
working-ness or non-working-ness against either driver.

The test was merely a "did it successfully BOOTP" because I can't
get it to mount and run /sbin/init from the jffs2 rootfs which
2.5.70 was perfectly happy to earlier today.  However, the
failure point seemed to be when NFS tried to use the card.

Whether that means I was or was not using BOOTP back in 2004...
your guess is as good as mine.

Anyway, that's the end of the issue as far as I'm concerned.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core

^ permalink raw reply

* Re: [RFC: 2.6 patch] remove drivers/net/eepro100.c
From: Russell King @ 2005-11-23 22:15 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Adrian Bunk, saw, linux-kernel, netdev, David S. Miller
In-Reply-To: <437DFD6C.1020106@pobox.com>

On Fri, Nov 18, 2005 at 11:12:28AM -0500, Jeff Garzik wrote:
> Russell King wrote:
> >On Fri, Nov 18, 2005 at 04:33:02AM +0100, Adrian Bunk wrote:
> >
> >>This patch removes the obsolete drivers/net/eepro100.c driver.
> >>
> >>Is there any reason why it should be kept?
> >
> >
> >Tt's the only driver which works correctly on ARM CPUs.  e100 is
> >basically buggy.  This has been discussed here on lkml and more
> >recently on linux-netdev.  If anyone has any further questions
> >please read the archives of those two lists.
> 
> After reading the archives, one discovers the current status is:
> 
> 	waiting on ARM folks to test e100
> 
> Latest reference is public message-id <4371A373.6000308@pobox.com>, 
> which was CC'd to you.
> 
> There is a patch in netdev-2.6.git#e100-sbit and in Andrew's -mm tree 
> that should solve the ARM problems, and finally allow us to kill 
> eepro100.  But it's waiting for feedback...

Well, I've run 2.6.15-rc2 on what I think was the ARM platform which
exhibited the problem, but it doesn't show up.  However, that's
meaningless as it has been literally _years_ (4 or more) since the
problem was reported.  It's rather unsurprising that I can't reproduce
it - I don't even know if I'm using the right processor module!

It's far too late to try swapping modules and rebuilding everything
(if I can find the code for the boot loader still.)  Also since this
platform is being returned to whence it came tomorrow, I lose the
ability to test this.

I've been struggling all day trying to get the kernel back up and
running on this hardware due to various issues with sizes of kernels
and mtd/jffs2 incompatibilities with the jffs2 filesystem in flash.

So, all in all, resolving this issue has taken far too long and it is
now far too late to do any kind of positive testing.

I leave it up to you how to proceed.  Effectively I'm now completely
out of the loop on this with no hardware to worry about.  Sorry.

Finally, please don't assign any blame for this in my direction; I
reported it and I kept bugging people about it, and in spite of my
best efforts there was very little which was forthcoming.  Obviously
that wasn't enough.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core

^ permalink raw reply

* Re: [RFC] [PATCH 0/3] ioat: DMA engine support
From: Jeff Garzik @ 2005-11-23 22:06 UTC (permalink / raw)
  To: Andrew Grover; +Cc: netdev, linux-kernel, john.ronciak, christopher.leech
In-Reply-To: <Pine.LNX.4.44.0511231143380.32487-100000@isotope.jf.intel.com>

Andrew Grover wrote:
> As presented in our talk at this year's OLS, the Bensley platform, which 
> will be out in early 2006, will have an asyncronous DMA engine. It can be 
> used to offload copies from the CPU, such as the kernel copies of received 
> packets into the user buffer.

IOAT is super-neat stuff.

In addition to helping speed up network RX, I would like to see how 
possible it is to experiment with IOAT uses outside of networking. 
Sample ideas:  VM page pre-zeroing.  ATA PIO data xfers (async copy to 
static buffer, to dramatically shorten length of kmap+irqsave time). 
Extremely large memcpy() calls.

Additionally, current IOAT is memory->memory.  I would love to be able 
to convince Intel to add transforms and checksums, to enable offload of 
memory->transform->memory and memory->checksum->result operations like 
sha-{1,256} hashing[1], crc32*, aes crypto, and other highly common 
operations.  All of that could be made async.

	Jeff

^ permalink raw reply

* Re: [RFC] [PATCH 1/3] ioat: DMA subsystem
From: David S. Miller @ 2005-11-23 21:51 UTC (permalink / raw)
  To: andrew.grover; +Cc: netdev, linux-kernel, john.ronciak, christopher.leech
In-Reply-To: <Pine.LNX.4.44.0511231207410.32487-100000@isotope.jf.intel.com>

Please provide a complete and detailed changelog message for each
patch and an introductory email explaining the top-level purpose of
these changes.

Yes, I personally know what these changes are all about, but not
everyone does.

Sending a bunch of non-descript patches to the list is always a very
bad idea, and will result in little, if any, patch review.

^ permalink raw reply

* [RFC] [PATCH 3/3] ioat: testclient
From: Andrew Grover @ 2005-11-23 20:26 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: john.ronciak, christopher.leech


diff --git a/drivers/dma/testclient.c b/drivers/dma/testclient.c
new file mode 100644
index 0000000..9bfb979
--- /dev/null
+++ b/drivers/dma/testclient.c
@@ -0,0 +1,132 @@
+/*******************************************************************************
+
+  
+  Copyright(c) 2004 - 2005 Intel Corporation. All rights reserved.
+  
+  This program is free software; you can redistribute it and/or modify it 
+  under the terms of the GNU General Public License as published by the Free 
+  Software Foundation; either version 2 of the License, or (at your option) 
+  any later version.
+  
+  This program is distributed in the hope that it will be useful, but WITHOUT 
+  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 
+  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for 
+  more details.
+  
+  You should have received a copy of the GNU General Public License along with
+  this program; if not, write to the Free Software Foundation, Inc., 59 
+  Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+  
+  The full GNU General Public License is included in this distribution in the
+  file called LICENSE.
+  
+*******************************************************************************/
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/device.h>
+#include <linux/dmaengine.h>
+#include <linux/delay.h>
+#include <asm/io.h>
+
+/* MODULE API */
+
+static volatile u8 *buffer1;
+static volatile u8 *buffer2;
+
+struct dma_client *test_dma_client;
+struct dma_chan *test_dma_chan;
+static dma_cookie_t cookie;
+
+void
+test_added_chan(void)
+{
+	int i;
+
+	printk("buffer1 = %p\n", buffer1);
+	printk("buffer2 = %p\n", buffer2);
+	for (i = 0; i < 20; i+=4)
+		printk("%u %u %u %u\n", buffer2[i], buffer2[i+1], buffer2[i+2], buffer2[i+3]);
+
+//	for (i = 0; i < 10; i++) {
+	cookie = dma_async_memcpy_buf_to_buf(test_dma_chan, 
+		(void *)buffer2,
+		(void *)buffer1,
+		2000);
+	dma_async_memcpy_issue_pending(test_dma_chan);
+//	}
+//	printk("dma cookie = %i\n", cookie);
+	if (dma_async_memcpy_complete(test_dma_chan, cookie, NULL, NULL) == DMA_IN_PROGRESS)
+		printk("DMA cookie == IN PROGRESS\n");
+	else
+		printk("DMA cookie == SUCCESS\n");
+#if 0
+	for (i = 0; i < 1000; i++) {
+		if (buffer2[1] != 0)
+			break;
+		mdelay(1);
+	}
+	printk("i = %d\n", i);
+	for (i = 0; i < 20; i+=4)
+		printk("%u %u %u %u\n", buffer2[i], buffer2[i+1], buffer2[i+2], buffer2[i+3]);
+	for (i = 0; i < 20; i+=4)
+		printk("%u %u %u %u\n", buffer1[i], buffer1[i+1], buffer1[i+2], buffer1[i+3]);
+#endif
+}
+
+void test_dma_event(struct dma_client *client, struct dma_chan *chan, enum dma_event_t event)
+{
+	switch (event) {
+	case DMA_RESOURCE_ADDED:
+		test_dma_chan = chan;
+		test_added_chan();
+		break;
+	case DMA_RESOURCE_REMOVED:
+		test_dma_chan = NULL;
+		break;
+	default:
+		break;
+	}
+}
+
+static int __init
+testclient_init_module(void)
+{
+	int i;
+
+	buffer1 = kmalloc(sizeof(u8) * 2000, SLAB_KERNEL);
+	buffer2 = kmalloc(sizeof(u8) * 2000, SLAB_KERNEL);
+
+	memset((void *)buffer2, 0, 2000);
+	for (i = 0; i < 2000; i++)
+		buffer1[i] = i;
+
+	test_dma_client = dma_async_client_register(test_dma_event);
+	if (!test_dma_client) {
+		printk(KERN_ERR "Could not register dma client!\n");
+		return 0;
+	}
+
+	dma_async_client_chan_request(test_dma_client, 1);
+
+	return 0;
+}
+
+module_init(testclient_init_module);
+
+static void __exit
+testclient_exit_module(void)
+{
+	int i;
+	for (i = 0; i < 20; i+=4)
+		printk("%u %u %u %u\n", buffer2[i], buffer2[i+1], buffer2[i+2], buffer2[i+3]);
+	if (dma_async_memcpy_complete(test_dma_chan, cookie, NULL, NULL) == DMA_SUCCESS)
+		printk("DMA cookie == SUCCESS\n");
+	else
+		printk("DMA cookie == IN PROGRESS\n");
+
+	dma_async_client_unregister(test_dma_client);
+}
+
+module_exit(testclient_exit_module);
+

^ permalink raw reply related

* [RFC] [PATCH 2/3] ioat: user buffer pin; net DMA client register
From: Andrew Grover @ 2005-11-23 20:26 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: john.ronciak, christopher.leech


diff --git a/net/core/Makefile b/net/core/Makefile
index 630da0f..d02132b 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -8,7 +8,8 @@ obj-y := sock.o request_sock.o skbuff.o 
 obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 
 obj-y		     += dev.o ethtool.o dev_mcast.o dst.o \
-			neighbour.o rtnetlink.o utils.o link_watch.o filter.o
+			neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
+			user_dma.o
 
 obj-$(CONFIG_XFRM) += flow.o
 obj-$(CONFIG_SYSFS) += net-sysfs.o
diff --git a/net/core/dev.c b/net/core/dev.c
index a44eeef..a81bee8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -113,6 +113,7 @@
 #include <linux/wireless.h>		/* Note : will define WIRELESS_EXT */
 #include <net/iw_handler.h>
 #endif	/* CONFIG_NET_RADIO */
+#include <linux/dmaengine.h>
 #include <asm/current.h>
 
 /*
@@ -147,6 +148,12 @@ static DEFINE_SPINLOCK(ptype_lock);
 static struct list_head ptype_base[16];	/* 16 way hashed list */
 static struct list_head ptype_all;		/* Taps */
 
+#ifdef CONFIG_NET_DMA
+struct dma_client *net_dma_client;
+DEFINE_PER_CPU(struct dma_chan *, net_dma);
+static unsigned int net_dma_count;
+#endif
+
 /*
  * The @dev_base list is protected by @dev_base_lock and the rtln
  * semaphore.
@@ -1708,6 +1715,9 @@ static void net_rx_action(struct softirq
 	unsigned long start_time = jiffies;
 	int budget = netdev_budget;
 	void *have;
+#ifdef CONFIG_NET_DMA
+	struct dma_chan *chan;
+#endif
 
 	local_irq_disable();
 
@@ -1739,6 +1749,10 @@ static void net_rx_action(struct softirq
 		}
 	}
 out:
+#ifdef CONFIG_NET_DMA
+	list_for_each_entry(chan, &net_dma_client->channels, client_node)
+		dma_async_memcpy_issue_pending(chan);
+#endif
 	local_irq_enable();
 	return;
 
@@ -3171,6 +3185,68 @@ static int dev_cpu_callback(struct notif
 }
 #endif /* CONFIG_HOTPLUG_CPU */
 
+#ifdef CONFIG_NET_DMA
+static void net_dma_rebalance(void)
+{
+	unsigned int cpu, i, n;
+	struct dma_chan *chan;
+
+	lock_cpu_hotplug();
+
+	if (net_dma_count == 0) {
+		for_each_online_cpu(cpu)
+			per_cpu(net_dma, cpu) = NULL;
+		unlock_cpu_hotplug();
+		return;
+	}
+
+	i = 0;
+	cpu = first_cpu(cpu_online_map);
+
+	list_for_each_entry(chan, &net_dma_client->channels, client_node) {
+		/* cpus_clear(chan->cpumask); */
+		n = ((num_online_cpus() / net_dma_count) + (i < (num_online_cpus() % net_dma_count) ? 1 : 0));
+
+		while(n) {
+			per_cpu(net_dma, cpu) = chan;
+			/* cpu_set(cpu, chan->cpumask); */
+			cpu = next_cpu(cpu, cpu_online_map);
+			n--;
+		}
+		i++;
+	}
+
+	unlock_cpu_hotplug();
+}
+
+static void netdev_dma_event(struct dma_client *client, struct dma_chan *chan, enum dma_event_t event)
+{
+	switch (event) {
+	case DMA_RESOURCE_ADDED:
+		net_dma_count++;
+		net_dma_rebalance();
+		break;
+	case DMA_RESOURCE_REMOVED:
+		net_dma_count--;
+		net_dma_rebalance();
+		break;
+	default:
+		break;
+	}
+}
+
+static int __init netdev_dma_register(void)
+{
+	net_dma_client = dma_async_client_register(netdev_dma_event);
+
+	dma_async_client_chan_request(net_dma_client, num_online_cpus());
+
+	return 0;
+}
+
+#else
+static int __init netdev_dma_register(void) { return -ENODEV; }
+#endif /* CONFIG_NET_DMA */
 
 /*
  *	Initialize the DEV module. At boot time this walks the device list and
@@ -3224,6 +3300,8 @@ static int __init net_dev_init(void)
 		atomic_set(&queue->backlog_dev.refcnt, 1);
 	}
 
+	netdev_dma_register();
+
 	dev_boot_phase = 0;
 
 	open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL);
diff --git a/net/core/user_dma.c b/net/core/user_dma.c
new file mode 100644
index 0000000..958e2c8
--- /dev/null
+++ b/net/core/user_dma.c
@@ -0,0 +1,422 @@
+/*
+  Copyright(c) 2004 - 2005 Intel Corporation
+  Portions based on net/core/datagram.c and copyrighted by their authors.
+
+  This code allows the net stack to make use of a DMA engine for
+  skb to iovec copies.
+*/
+
+#include <linux/dmaengine.h>
+#include <linux/pagemap.h>
+#include <linux/socket.h>
+#include <linux/rtnetlink.h> /* for BUG_TRAP */
+#include <net/tcp.h>
+#include <asm/io.h>
+#include <asm/uaccess.h>
+
+#ifdef CONFIG_NET_DMA
+
+#define NUM_PAGES_SPANNED(start, length) \
+	((PAGE_ALIGN((unsigned long)start + length) - \
+	((unsigned long)start & PAGE_MASK)) >> PAGE_SHIFT)
+
+/*
+ * Lock down all the iovec pages needed for len bytes.
+ * Return a struct dma_locked_list to keep track of pages locked down.
+ *
+ * We are allocating a single chunk of memory, and then carving it up into
+ * 3 sections, the latter 2 whose size depends on the number of iovecs and the
+ * total number of pages, respectively.
+ */
+int
+dma_lock_iovec_pages(struct iovec *iov, size_t len, struct dma_locked_list **locked_list)
+{
+	struct dma_locked_list *local_list;
+	struct page **pages;
+	int i;
+	int ret;
+
+	int nr_iovecs = 0;
+	int iovec_len_used = 0;
+	int iovec_pages_used = 0;
+
+	/* don't lock down non-user-based iovecs */
+	if (segment_eq(get_fs(), KERNEL_DS)) {
+		*locked_list = NULL;
+		return 0;
+	}
+
+	/* determine how many iovecs/pages there are, up front */
+	do {
+		iovec_len_used += iov[nr_iovecs].iov_len;
+		iovec_pages_used += NUM_PAGES_SPANNED(iov[nr_iovecs].iov_base,
+			iov[nr_iovecs].iov_len);
+		nr_iovecs++;
+	} while (iovec_len_used < len);
+
+	/* single kmalloc for locked list, page_list[], and the page arrays */
+	local_list = kmalloc(sizeof(*local_list)
+		+ (nr_iovecs * sizeof (struct dma_page_list))
+		+ (iovec_pages_used * sizeof (struct page*)), GFP_KERNEL);
+	if (!local_list)
+		return -ENOMEM;
+
+	/* list of pages starts right after the page list array */
+	pages = (struct page **) &local_list->page_list[nr_iovecs];
+
+	/* it's a userspace pointer */
+	might_sleep();
+
+	for (i = 0; i < nr_iovecs; i++) {
+		struct dma_page_list *page_list = &local_list->page_list[i];
+
+		len -= iov[i].iov_len;
+
+		if (!access_ok(VERIFY_WRITE, iov[i].iov_base, iov[i].iov_len)) {
+			dma_unlock_iovec_pages(local_list);
+			return -EFAULT;
+		}
+
+		page_list->nr_pages = NUM_PAGES_SPANNED(iov[i].iov_base, iov[i].iov_len);
+		page_list->base_address = iov[i].iov_base;
+
+		page_list->pages = pages;
+		pages += page_list->nr_pages;
+
+		/* lock pages down */
+		down_read(&current->mm->mmap_sem);
+		ret = get_user_pages(
+			current,
+			current->mm,
+			(unsigned long) iov[i].iov_base,
+			page_list->nr_pages,
+			1,
+			0,
+			page_list->pages,
+			NULL);
+		up_read(&current->mm->mmap_sem);
+
+		if (ret != page_list->nr_pages) {
+			goto mem_error;
+		}
+
+		local_list->nr_iovecs = i + 1;
+	}
+
+	*locked_list = local_list;
+	return 0;
+
+mem_error:
+	dma_unlock_iovec_pages(local_list);
+	return -ENOMEM;
+}
+
+void
+dma_unlock_iovec_pages(struct dma_locked_list *locked_list)
+{
+	int i, j;
+
+	if (!locked_list)
+		return;
+
+	for (i = 0; i < locked_list->nr_iovecs; i++) {
+		struct dma_page_list *page_list = &locked_list->page_list[i];
+		for (j = 0; j < page_list->nr_pages; j++) {
+			SetPageDirty(page_list->pages[j]);
+			page_cache_release(page_list->pages[j]);
+		}
+	}
+
+	kfree(locked_list);
+}
+
+static dma_cookie_t
+dma_memcpy_tokerneliovec(struct dma_chan *chan, struct iovec *iov,
+	unsigned char *kdata, size_t len)
+{
+	dma_cookie_t dma_cookie = 0;
+
+	while (len > 0) {
+		if (iov->iov_len) {
+			int copy = min_t(unsigned int, iov->iov_len, len);
+			dma_cookie = dma_async_memcpy_buf_to_buf(
+					chan,
+					iov->iov_base,
+					kdata,
+					copy);
+			kdata += copy;
+			len -= copy;
+			iov->iov_len -= copy;
+			iov->iov_base += copy;
+		}
+		iov++;
+	}
+
+	return dma_cookie;
+}
+
+/*
+ * We have already locked down the pages we will be using in the iovecs.
+ * Each entry in iov array has corresponding entry in locked_list->page_list.
+ * Using array indexing to keep iov[] and page_list[] in sync.
+ * Initial elements in iov array's iov->iov_len will be 0 if already copied into
+ *   by another call.
+ * iov array length remaining guaranteed to be bigger than len.
+ */
+static dma_cookie_t
+dma_memcpy_toiovec(struct dma_chan *chan, struct iovec *iov,
+	struct dma_locked_list *locked_list, unsigned char *kdata, size_t len)
+{
+	int iov_byte_offset;
+	int copy;
+	dma_cookie_t dma_cookie = 0;
+	int iovec_idx;
+	int page_idx;
+
+	if (!chan)
+		return memcpy_toiovec(iov, kdata, len);
+
+	/* -> kernel copies (e.g. smbfs) */
+	if (!locked_list)
+		return dma_memcpy_tokerneliovec(chan, iov, kdata, len);
+
+	iovec_idx = 0;
+	while (iovec_idx < locked_list->nr_iovecs) {
+		struct dma_page_list *page_list;
+
+		/* skip already used-up iovecs */
+		while (!iov[iovec_idx].iov_len)
+			iovec_idx++;
+
+		page_list = &locked_list->page_list[iovec_idx];
+
+		iov_byte_offset = ((unsigned long)iov[iovec_idx].iov_base & ~PAGE_MASK);
+		page_idx = (((unsigned long)iov[iovec_idx].iov_base & PAGE_MASK)
+			 - ((unsigned long)page_list->base_address & PAGE_MASK)) >> PAGE_SHIFT;
+
+		/* break up copies to not cross page boundary */
+		while (iov[iovec_idx].iov_len) {
+			copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
+			copy = min_t(int, copy, iov[iovec_idx].iov_len);
+
+			dma_cookie = dma_async_memcpy_buf_to_pg(chan,
+					page_list->pages[page_idx],
+					iov_byte_offset,
+					kdata,
+					copy);
+
+			len -= copy;
+			iov[iovec_idx].iov_len -= copy;
+			iov[iovec_idx].iov_base += copy;
+
+			if (!len)
+				return dma_cookie;
+
+			kdata += copy;
+			iov_byte_offset = 0;
+			page_idx++;
+		}
+		iovec_idx++;
+	}
+
+	/* really bad if we ever run out of iovecs */
+	BUG();
+	return -EFAULT;
+}
+
+static dma_cookie_t
+dma_memcpy_pg_toiovec(struct dma_chan *chan, struct iovec *iov,
+	struct dma_locked_list *locked_list, struct page *page,
+	unsigned int offset, size_t len)
+{
+	int iov_byte_offset;
+	int copy;
+	dma_cookie_t dma_cookie = 0;
+	int iovec_idx;
+	int page_idx;
+	int err;
+
+	/* this needs as-yet-unimplemented buf-to-buff, so punt. */
+	/* TODO: use dma for this */
+	if (!chan || !locked_list) {
+		u8 *vaddr = kmap(page);
+		err = memcpy_toiovec(iov, vaddr + offset, len);
+		kunmap(page);
+		return err;
+	}
+
+	iovec_idx = 0;
+	while (iovec_idx < locked_list->nr_iovecs) {
+		struct dma_page_list *page_list;
+
+		/* skip already used-up iovecs */
+		while (!iov[iovec_idx].iov_len)
+			iovec_idx++;
+
+		page_list = &locked_list->page_list[iovec_idx];
+
+		iov_byte_offset = ((unsigned long)iov[iovec_idx].iov_base & ~PAGE_MASK);
+		page_idx = (((unsigned long)iov[iovec_idx].iov_base & PAGE_MASK)
+			 - ((unsigned long)page_list->base_address & PAGE_MASK)) >> PAGE_SHIFT;
+
+		/* break up copies to not cross page boundary */
+		while (iov[iovec_idx].iov_len) {
+			copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
+			copy = min_t(int, copy, iov[iovec_idx].iov_len);
+
+			dma_cookie = dma_async_memcpy_pg_to_pg(chan,
+					page_list->pages[page_idx],
+					iov_byte_offset,
+					page,
+					offset,
+					copy);
+
+			len -= copy;
+			iov[iovec_idx].iov_len -= copy;
+			iov[iovec_idx].iov_base += copy;
+
+			if (!len)
+				return dma_cookie;
+
+			offset += copy;
+			iov_byte_offset = 0;
+			page_idx++;
+		}
+		iovec_idx++;
+	}
+
+	/* really bad if we ever run out of iovecs */
+	BUG();
+	return -EFAULT;
+}
+
+void
+dma_memcpy_toiovec_wait(struct dma_chan *chan, dma_cookie_t cookie)
+{
+	if (cookie <= 0)
+		return;
+
+	dma_async_wait_for_completion(chan, cookie);
+}
+
+/**
+ *	dma_skb_copy_datagram_iovec - Copy a datagram to an iovec.
+ *	@skb - buffer to copy
+ *	@offset - offset in the buffer to start copying from
+ *	@iovec - io vector to copy to
+ *	@len - amount of data to copy from buffer to iovec
+ *	@locked_list - locked iovec buffer data
+ *
+ *	Note: the iovec is modified during the copy.
+ */
+int
+dma_skb_copy_datagram_iovec(
+	struct dma_chan *chan,
+	const struct sk_buff *skb,
+	int offset,
+	struct iovec *to,
+	size_t len,
+	struct dma_locked_list *locked_list)
+{
+	int start = skb_headlen(skb);
+	int i, copy = start - offset;
+	dma_cookie_t cookie = 0;
+
+	/* Copy header. */
+	if (copy > 0) {
+		if (copy > len)
+			copy = len;
+		if ((cookie = dma_memcpy_toiovec(chan, to, locked_list,
+		     skb->data + offset, copy)) < 0)
+			goto fault;
+		if ((len -= copy) == 0)
+			goto end;
+		offset += copy;
+	}
+
+	/* Copy paged appendix. Hmm... why does this look so complicated? */
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		int end;
+
+		BUG_TRAP(start <= offset + len);
+
+		end = start + skb_shinfo(skb)->frags[i].size;
+		if ((copy = end - offset) > 0) {
+			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+			struct page *page = frag->page;
+
+			if (copy > len)
+				copy = len;
+
+			cookie = dma_memcpy_pg_toiovec(chan, to, locked_list, page,
+					frag->page_offset + offset - start, copy);
+			if (cookie < 0)
+				goto fault;
+			if (!(len -= copy))
+				goto end;
+			offset += copy;
+		}
+		start = end;
+	}
+
+	if (skb_shinfo(skb)->frag_list) {
+		struct sk_buff *list = skb_shinfo(skb)->frag_list;
+
+		for (; list; list = list->next) {
+			int end;
+
+			BUG_TRAP(start <= offset + len);
+
+			end = start + list->len;
+			if ((copy = end - offset) > 0) {
+				if (copy > len)
+					copy = len;
+				if ((cookie = dma_skb_copy_datagram_iovec(chan, list,
+					        offset - start, to, copy, locked_list)) < 0)
+					goto fault;
+				if ((len -= copy) == 0)
+					goto end;
+				offset += copy;
+			}
+			start = end;
+		}
+	}
+
+end:
+	if (!len) {
+#if 0
+		TCP_SKB_CB(skb)->dma_cookie = cookie;
+#endif
+		return cookie;
+	}
+
+fault:
+ 	return -EFAULT;
+}
+
+#else
+
+int
+dma_lock_iovec_pages(struct iovec *iov, size_t len, struct dma_locked_list **locked_list)
+{
+	*locked_list = NULL;
+
+	return 0;
+}
+
+void
+dma_unlock_iovec_pages(struct dma_locked_list* locked_list)
+{ }
+
+int
+dma_skb_copy_datagram_iovec(struct dma_chan *chan, const struct sk_buff *skb, int offset,
+			    struct iovec *to, size_t len, struct dma_locked_list *locked_list)
+{
+	return skb_copy_datagram_iovec(skb, offset, to, len);
+}
+
+void
+dma_memcpy_toiovec_wait(struct dma_chan *chan, dma_cookie_t cookie)
+{ }
+
+#endif

^ permalink raw reply related

* [RFC] [PATCH 1/3] ioat: DMA subsystem
From: Andrew Grover @ 2005-11-23 20:26 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: john.ronciak, christopher.leech


diff --git a/drivers/Kconfig b/drivers/Kconfig
index 48f446d..fbe5116 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -66,4 +66,6 @@ source "drivers/infiniband/Kconfig"
 
 source "drivers/sn/Kconfig"
 
+source "drivers/dma/Kconfig"
+
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index 1a109a6..4bd0ab6 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -67,3 +67,4 @@ obj-$(CONFIG_INFINIBAND)	+= infiniband/
 obj-$(CONFIG_SGI_IOC4)		+= sn/
 obj-y				+= firmware/
 obj-$(CONFIG_CRYPTO)		+= crypto/
+obj-$(CONFIG_DMA_ENGINE)	+= dma/
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
new file mode 100644
index 0000000..dde603d
--- /dev/null
+++ b/drivers/dma/Kconfig
@@ -0,0 +1,40 @@
+#
+# DMA engine configuration
+#
+
+menu "DMA Engine support"
+
+config DMA_ENGINE
+	bool "Support for DMA engines"
+	---help---
+	  DMA engines offload copy operations from the CPU to dedicated
+	  hardware, allowing the copies to happen asynchronously.
+
+config NET_DMA
+	bool "Use DMA engines in the network stack"
+	depends on DMA_ENGINE
+	---help---
+	  Say Y to enable the use of DMA engines in the network stack to
+	  offload receive copy-to-user operations, freeing CPU cycles.
+
+config NET_DMA_EARLY
+	bool "Do early DMA copies"
+	depends on NET_DMA
+	---help---
+	  Enabling this will cause the network stack to start DMA copies
+	  earlier. This can improve throughput, but this is also a more
+	  invasive change, and can be unstable.
+
+#
+# 
+#
+
+config DMA_TESTCLIENT
+	tristate "DMA test client"
+	depends on DMA_ENGINE
+	---help---
+	  The CB test client driver performs a DMA-assisted memcpy on module
+	  load, and prints the result when unloaded. It is pretty simple, but
+	  maybe someday this will grow up into an actually useful test client.
+
+endmenu
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
new file mode 100644
index 0000000..abb83be
--- /dev/null
+++ b/drivers/dma/Makefile
@@ -0,0 +1,5 @@
+-include $(PWD)/config
+
+obj-y += dmaengine.o
+
+obj-$(CONFIG_DMA_TESTCLIENT) += testclient.o
diff --git a/drivers/dma/cb_list.h b/drivers/dma/cb_list.h
new file mode 100644
index 0000000..f2cc2d7
--- /dev/null
+++ b/drivers/dma/cb_list.h
@@ -0,0 +1,12 @@
+/* Extra macros that build on <linux/list.h> */
+#ifndef CB_LIST_H
+#define CB_LIST_H
+
+#include <linux/list.h>
+
+/* Provide some safty to list_add, which I find easy to swap the arguments to */
+
+#define list_add_entry(pos, head, member)      list_add(&pos->member, head)
+#define list_add_entry_tail(pos, head, member) list_add_tail(&pos->member, head)
+
+#endif /* CB_LIST_H */
diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
new file mode 100644
index 0000000..fe240c8
--- /dev/null
+++ b/drivers/dma/dmaengine.c
@@ -0,0 +1,394 @@
+/*******************************************************************************
+
+  
+  Copyright(c) 2004 - 2005 Intel Corporation. All rights reserved.
+  
+  This program is free software; you can redistribute it and/or modify it 
+  under the terms of the GNU General Public License as published by the Free 
+  Software Foundation; either version 2 of the License, or (at your option) 
+  any later version.
+  
+  This program is distributed in the hope that it will be useful, but WITHOUT 
+  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 
+  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for 
+  more details.
+  
+  You should have received a copy of the GNU General Public License along with
+  this program; if not, write to the Free Software Foundation, Inc., 59 
+  Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+  
+  The full GNU General Public License is included in this distribution in the
+  file called LICENSE.
+  
+*******************************************************************************/
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/dmaengine.h>
+#include <linux/hardirq.h>
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
+#include "cb_list.h"
+
+static LIST_HEAD(dma_device_list);
+static LIST_HEAD(dma_client_list);
+
+DEFINE_PER_CPU(struct completion, kick_dma_poll);
+
+/* --- sysfs implementation --- */
+
+static ssize_t show_memcpy_count(struct class_device *cd, char *buf)
+{
+	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
+
+	sprintf(buf, "%lu\n", chan->memcpy_count);
+	return strlen(buf) + 1;
+}
+
+static ssize_t show_bytes_transferred(struct class_device *cd, char *buf)
+{
+	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
+
+	sprintf(buf, "%lu\n", chan->bytes_transferred);
+	return strlen(buf) + 1;
+}
+
+static ssize_t show_in_use(struct class_device *cd, char *buf)
+{
+	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
+
+	sprintf(buf, "%d\n", (chan->client ? 1 : 0));
+	return strlen(buf) + 1;
+}
+
+static ssize_t show_min_hw_copy_size(struct class_device *cd, char *buf)
+{
+	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
+
+	sprintf(buf, "%d\n", chan->min_copy_size);
+	return strlen(buf) + 1;
+}
+
+static ssize_t store_min_hw_copy_size(struct class_device *cd, const char *buf, size_t count)
+{
+	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
+
+	chan->min_copy_size = simple_strtoul(buf, NULL, 0);
+
+	return count;
+}
+
+static struct class_device_attribute dma_class_attrs[] = {
+	__ATTR(memcpy_count, S_IRUGO, show_memcpy_count, NULL),
+	__ATTR(bytes_transferred, S_IRUGO, show_bytes_transferred, NULL),
+	__ATTR(in_use, S_IRUGO, show_in_use, NULL),
+	__ATTR(min_copy_size, S_IRUGO | S_IWUSR, show_min_hw_copy_size, store_min_hw_copy_size),
+	__ATTR_NULL
+};
+
+static void
+dma_class_release(struct class_device *cd)
+{
+	/* do something */
+}
+
+static struct class dma_devclass = {
+	.name		= "dma",
+	.release	= dma_class_release,
+	.class_dev_attrs = dma_class_attrs,
+};
+
+/* --- client and device registration --- */
+
+static struct dma_chan *
+dma_client_chan_alloc(struct dma_client *client)
+{
+	struct dma_device *device;
+	struct dma_chan *chan;
+
+	BUG_ON(!client);
+
+	/* Find a channel, any DMA engine will do */
+	list_for_each_entry(device, &dma_device_list, global_node) {
+		list_for_each_entry(chan, &device->channels, device_node) {
+			if (chan->client)
+				continue;
+
+			if (chan->device->device_alloc_chan_resources(chan) >= 0) {
+				chan->client = client;
+				list_add_entry_tail(chan, &client->channels, client_node);
+				return chan;
+			}
+		}
+	}
+
+	return NULL;
+}
+
+static void
+dma_client_chan_free(struct dma_chan *chan)
+{
+	BUG_ON(!chan);
+
+	chan->device->device_free_chan_resources(chan);
+	chan->client = NULL;
+}
+
+static void
+dma_chans_rebalance(void)
+{
+	struct dma_client *client;
+	struct dma_chan *chan;
+
+	list_for_each_entry(client, &dma_client_list, global_node) {
+
+		while (client->chans_desired > client->chan_count) {
+			chan = dma_client_chan_alloc(client);
+			if (!chan)
+				break;
+
+			client->chan_count++;
+			client->event_callback(client, chan, DMA_RESOURCE_ADDED);
+		}
+
+		while (client->chans_desired < client->chan_count) {
+			chan = list_entry(client->channels.next, struct dma_chan, client_node);
+			list_del(&chan->client_node);
+			client->chan_count--;
+			client->event_callback(client, chan, DMA_RESOURCE_REMOVED);
+			dma_client_chan_free(chan);
+		}
+	}
+}
+
+struct dma_client *
+dma_async_client_register(dma_event_callback event_callback)
+{
+	struct dma_client *client;
+
+	BUG_ON(!event_callback);
+
+	client = kmalloc(sizeof(*client), GFP_KERNEL);
+	if (!client)
+		return NULL;
+
+	INIT_LIST_HEAD(&client->channels);
+
+	client->chans_desired = 0;
+	client->chan_count = 0;
+	client->event_callback = event_callback;
+
+	list_add_entry_tail(client, &dma_client_list, global_node);
+
+	return client;
+}
+
+void
+dma_async_client_unregister(struct dma_client *client)
+{
+	struct dma_chan *chan, *_chan;
+
+	if (!client)
+		return;
+
+	list_for_each_entry_safe(chan, _chan, &client->channels, client_node) {
+		dma_client_chan_free(chan);
+	}
+
+	list_del(&client->global_node);
+
+	kfree(client);
+
+	dma_chans_rebalance();
+}
+
+void
+dma_async_client_chan_request(struct dma_client *client, unsigned int number)
+{
+	BUG_ON(!client);
+
+	client->chans_desired = number;
+
+	dma_chans_rebalance();
+}
+
+dma_cookie_t
+dma_async_memcpy_buf_to_buf(
+	struct dma_chan *chan,
+	void *dest,
+	void *src,
+	size_t len)
+{
+	chan->bytes_transferred += len;
+	chan->memcpy_count++;
+
+	return chan->device->device_memcpy_buf_to_buf(chan, dest, src, len);
+}
+
+dma_cookie_t
+dma_async_memcpy_buf_to_pg(
+	struct dma_chan *chan,
+	struct page *page,
+	unsigned int offset,
+	void *kdata,
+	size_t len)
+{
+	chan->bytes_transferred += len;
+	chan->memcpy_count++;
+
+	return chan->device->device_memcpy_buf_to_pg(chan, page, offset, kdata, len);
+}
+
+dma_cookie_t
+dma_async_memcpy_pg_to_pg(
+	struct dma_chan *chan,
+	struct page *dest_pg,
+	unsigned int dest_off,
+	struct page *src_pg,
+	unsigned int src_off,
+	size_t len)
+{
+	chan->bytes_transferred += len;
+	chan->memcpy_count++;
+
+	return chan->device->device_memcpy_pg_to_pg(chan, dest_pg, dest_off,
+		src_pg, src_off, len);
+}
+
+void
+dma_async_memcpy_issue_pending(struct dma_chan *chan)
+{
+	return chan->device->device_memcpy_issue_pending(chan);
+}
+
+enum dma_status_t
+dma_async_memcpy_complete(struct dma_chan *chan, dma_cookie_t cookie, dma_cookie_t *last, dma_cookie_t *used)
+{
+	return chan->device->device_memcpy_complete(chan, cookie, last, used);
+}
+
+int
+dma_async_device_register(struct dma_device *device)
+{
+	static int id;
+	int chancnt = 0;
+	struct dma_chan* chan;
+
+	if (!device)
+		return -ENODEV;
+
+	list_add_entry_tail(device, &dma_device_list, global_node);
+
+	dma_chans_rebalance();
+
+	device->dev_id = id++;
+
+	/* represent channels in sysfs. Probably want devs too */
+	list_for_each_entry(chan, &device->channels, device_node) {
+		chan->chan_id = chancnt++;
+		chan->class_dev.class = &dma_devclass;
+		chan->class_dev.dev = NULL;
+		snprintf(chan->class_dev.class_id, BUS_ID_SIZE, "dma%dchan%d",
+			device->dev_id, chan->chan_id);
+
+		chan->min_copy_size = DMA_DEFAULT_MIN_COPY_SIZE;
+		class_device_register(&chan->class_dev);
+	}
+
+	return 0;
+}
+
+void
+dma_async_device_unregister(struct dma_device* device)
+{
+	struct dma_chan *chan;
+
+	BUG_ON(!device);
+
+	list_for_each_entry(chan, &device->channels, device_node) {
+		if (chan->client) {
+			list_del(&chan->client_node);
+			chan->client->chan_count--;
+			chan->client->event_callback(chan->client, chan, DMA_RESOURCE_REMOVED);
+			dma_client_chan_free(chan);
+		}
+		class_device_unregister(&chan->class_dev);
+	}
+
+	list_del(&device->global_node);
+
+	dma_chans_rebalance();
+}
+
+static struct workqueue_struct *dma_wait_wq;
+static LIST_HEAD(dma_poll_list);
+
+enum dma_status_t
+dma_async_wait_for_completion(struct dma_chan *chan, dma_cookie_t cookie)
+{
+	while (dma_async_memcpy_complete(chan, cookie, NULL, NULL) == DMA_IN_PROGRESS)
+		schedule();
+
+	return DMA_SUCCESS;
+}
+
+#if 0
+static void
+dma_poll(void *data)
+{
+	struct dma_completion *comp = data;
+
+	comp->status = dma_memcpy_complete(comp->chan, comp->cookie);
+	while (comp->status == DMA_IN_PROGRESS) {
+		comp->chan->device->device_arm_interrupt(comp->chan);
+		wait_for_completion(&__get_cpu_var(kick_dma_poll));
+		comp->status = dma_memcpy_complete(comp->chan, comp->cookie);
+	}
+	complete(&comp->comp);
+}
+
+enum dma_status_t
+dma_wait_for_completion(struct dma_chan *chan, dma_cookie_t cookie)
+{
+	enum dma_status_t status;
+	DECLARE_DMA_COMPLETION(comp, chan, cookie);
+	DECLARE_WORK(dma_wait_work, dma_poll, &comp);
+
+	BUG_ON(in_interrupt());
+
+	status = dma_memcpy_complete(chan, cookie);
+	if (status != DMA_IN_PROGRESS)
+		return status;
+
+	queue_work(dma_wait_wq, &dma_wait_work);
+	wait_for_completion(&comp.comp);
+	return comp.status;
+}
+#endif
+
+static int __init dma_bus_init(void)
+{
+	int cpu;
+
+	dma_wait_wq = create_workqueue("dmapoll");
+	for_each_online_cpu(cpu) {
+		init_completion(&per_cpu(kick_dma_poll, cpu));
+	}
+	return class_register(&dma_devclass);
+}
+
+subsys_initcall(dma_bus_init);
+
+EXPORT_SYMBOL(dma_async_client_register);
+EXPORT_SYMBOL(dma_async_client_unregister);
+EXPORT_SYMBOL(dma_async_client_chan_request);
+EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
+EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
+EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
+EXPORT_SYMBOL(dma_async_memcpy_complete);
+EXPORT_SYMBOL(dma_async_memcpy_issue_pending);
+EXPORT_SYMBOL(dma_async_device_register);
+EXPORT_SYMBOL(dma_async_device_unregister);
+EXPORT_SYMBOL(dma_async_wait_for_completion);
+EXPORT_PER_CPU_SYMBOL(kick_dma_poll);
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
new file mode 100644
index 0000000..7b4f58b
--- /dev/null
+++ b/include/linux/dmaengine.h
@@ -0,0 +1,268 @@
+/*******************************************************************************
+
+  
+  Copyright(c) 2004 - 2005 Intel Corporation. All rights reserved.
+  
+  This program is free software; you can redistribute it and/or modify it 
+  under the terms of the GNU General Public License as published by the Free 
+  Software Foundation; either version 2 of the License, or (at your option) 
+  any later version.
+  
+  This program is distributed in the hope that it will be useful, but WITHOUT 
+  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 
+  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for 
+  more details.
+  
+  You should have received a copy of the GNU General Public License along with
+  this program; if not, write to the Free Software Foundation, Inc., 59 
+  Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+  
+  The full GNU General Public License is included in this distribution in the
+  file called LICENSE.
+  
+*******************************************************************************/
+
+
+#ifndef DMAENGINE_H
+#define DMAENGINE_H
+
+#include <linux/device.h>
+#include <linux/uio.h>
+#include <linux/skbuff.h>
+
+DECLARE_PER_CPU(struct completion, kick_dma_poll);
+
+#define DMA_DEFAULT_MIN_COPY_SIZE 16
+
+/**
+ * enum dma_event_t - resource PNP/power managment events
+ * @DMA_RESOURCE_SUSPEND: DMA device going into low power state
+ * @DMA_RESOURCE_RESUME: DMA device returning to full power
+ * @DMA_RESOURCE_ADDED: DMA device added to the system
+ * @DMA_RESOURCE_REMOVED: DMA device removed from the system
+ */
+enum dma_event_t {
+	DMA_RESOURCE_SUSPEND,
+	DMA_RESOURCE_RESUME,
+	DMA_RESOURCE_ADDED,
+	DMA_RESOURCE_REMOVED,
+};
+
+/**
+ * typedef dma_cookie_t
+ *
+ * if dma_cookie_t is >0 it's a DMA request cookie, <0 it's an error code
+ */
+typedef s32 dma_cookie_t;
+
+/*#define dma_submit_error(cookie) ((cookie) < 0 ? 1 : 0)*/
+
+/**
+ * enum dma_status_t - DMA transaction status
+ * @DMA_SUCCESS: transaction completed successfully
+ * @DMA_IN_PROGRESS: transaction not yet processed
+ * @DMA_ERROR: transaction failed
+ */
+enum dma_status_t {
+	DMA_SUCCESS,
+	DMA_IN_PROGRESS,
+	DMA_ERROR,
+};
+
+/**
+ * struct dma_chan - devices supply DMA channels, clients use them
+ * @client: ptr to the client user of this chan, will be NULL when unused
+ * @device: ptr to the dma device who supplies this channel, always !NULL
+ * @client_node: used to add this to the client chan list
+ * @device_node: used to add this to the device chan list
+ */
+struct dma_chan
+{
+	struct dma_client *client;
+	struct dma_device *device;
+	dma_cookie_t cookie;
+
+	/* sysfs */
+	int chan_id;
+	struct class_device class_dev;
+
+	/* stats */
+	unsigned long memcpy_count;
+	unsigned long bytes_transferred;
+	unsigned int min_copy_size;
+
+	struct list_head client_node;
+	struct list_head device_node;
+
+	cpumask_t cpumask;
+};
+
+/*
+ * typedef dma_event_callback - function pointer to a DMA event callback
+ */
+typedef void (*dma_event_callback) (struct dma_client *client, struct dma_chan *chan, enum dma_event_t event);
+
+/**
+ * struct dma_client - info on the entity making use of DMA services
+ * @event_callback: func ptr to call when something happens
+ * @chan_count: number of chans allocated
+ * @chans_desired: number of chans requested. Can be +- chan_count
+ * @port: upstream DMA port from the client's PCI device
+ * @channels: the list of DMA channels allocated
+ * @global_node: list_head for global dma_client_list
+ */
+struct dma_client {
+	dma_event_callback	event_callback;
+	unsigned int		chan_count;
+	unsigned int		chans_desired;
+
+	/* TODO keep some stats */
+	struct list_head	channels;
+	struct list_head	global_node;
+};
+
+/**
+ * struct dma_device - info on the entity supplying DMA services
+ * @chancnt: how many DMA channels are supported
+ * @channels: the list of struct dma_chan
+ * @global_node: list_head for global dma_device_list
+ * Other func ptrs: used to make use of this device's capabilities
+ */
+struct dma_device {
+
+	unsigned int chancnt;
+	struct list_head channels;
+	struct list_head global_node;
+
+	int dev_id;
+	/*struct class_device class_dev;*/
+
+	int (*device_alloc_chan_resources)(struct dma_chan *chan);
+	void (*device_free_chan_resources)(struct dma_chan *chan);
+	dma_cookie_t (*device_memcpy_buf_to_buf)(struct dma_chan *chan, void *dest,
+		void *src, size_t len);
+	dma_cookie_t (*device_memcpy_buf_to_pg)(struct dma_chan *chan, struct page *page,
+		unsigned int offset, void *kdata, size_t len);
+	dma_cookie_t (*device_memcpy_pg_to_pg)(struct dma_chan *chan, struct page *dest_pg,
+		unsigned int dest_off, struct page *src_pg, unsigned int src_off,
+		size_t len);
+	void (*device_arm_interrupt)(struct dma_chan *chan);
+	enum dma_status_t (*device_memcpy_complete)(struct dma_chan *chan, dma_cookie_t cookie, dma_cookie_t *last, dma_cookie_t *used);
+	void (*device_memcpy_issue_pending)(struct dma_chan *chan);
+};
+
+/* --- public DMA engine API --- */
+
+struct dma_client *
+dma_async_client_register(dma_event_callback event_callback);
+
+void
+dma_async_client_unregister(struct dma_client *client);
+
+void
+dma_async_client_chan_request(struct dma_client *client, unsigned int number);
+
+dma_cookie_t
+dma_async_memcpy_buf_to_buf(
+	struct dma_chan *chan,
+	void *dest,
+	void *src,
+	size_t len);
+
+dma_cookie_t
+dma_async_memcpy_buf_to_pg(
+	struct dma_chan *chan,
+	struct page *page,
+	unsigned int offset,
+	void *kdata,
+	size_t len);
+
+dma_cookie_t
+dma_async_memcpy_pg_to_pg(
+	struct dma_chan *chan,
+	struct page *dest_pg,
+	unsigned int dest_off,
+	struct page *src_pg,
+	unsigned int src_off,
+	size_t len);
+
+void dma_async_memcpy_issue_pending(struct dma_chan *);
+
+enum dma_status_t
+dma_async_wait_for_completion(struct dma_chan *chan, dma_cookie_t cookie);
+
+static inline enum dma_status_t
+dma_async_is_complete(
+	dma_cookie_t cookie,
+	dma_cookie_t last_complete,
+	dma_cookie_t last_used) {
+	
+	if (last_complete <= last_used) {
+		if ((cookie <= last_complete) || (cookie > last_used))
+			return DMA_SUCCESS;
+	} else {
+		if ((cookie <= last_complete) && (cookie > last_used))
+			return DMA_SUCCESS;
+	}
+	return DMA_IN_PROGRESS;
+}
+
+enum dma_status_t
+dma_async_memcpy_complete(struct dma_chan *chan, dma_cookie_t cookie, dma_cookie_t *last, dma_cookie_t *used);
+
+u32
+dma_async_get_errors(struct dma_chan *chan, dma_cookie_t cookie);
+
+/* --- DMA device --- */
+
+int dma_async_device_register(
+	struct dma_device *device);
+
+void dma_async_device_unregister(
+	struct dma_device *device);
+
+/* --- DMA completion --- */
+
+struct dma_completion
+{
+	struct dma_chan *chan;
+	dma_cookie_t cookie;
+	enum dma_status_t status;
+	struct completion comp;
+};
+
+#define DMA_COMPLETION_INITIALIZER(name, chan, cookie) \
+{	.chan = chan, \
+	.cookie = cookie, \
+	.status = DMA_IN_PROGRESS, \
+	.comp = COMPLETION_INITIALIZER((name).comp)	}
+
+#define DECLARE_DMA_COMPLETION(name, chan, cookie) \
+struct dma_completion name = DMA_COMPLETION_INITIALIZER(name, chan, cookie)
+
+/* --- net iovec stuff --- */
+
+DECLARE_PER_CPU(struct dma_chan *, net_dma);
+
+struct dma_page_list
+{
+	char *base_address;
+	int nr_pages;
+	struct page **pages;
+};
+
+struct dma_locked_list
+{
+	int nr_iovecs;
+	struct dma_page_list page_list[0];
+};
+
+int dma_lock_iovec_pages(struct iovec *iov, size_t len, struct dma_locked_list **locked_list);
+void dma_unlock_iovec_pages(struct dma_locked_list* locked_list);
+int
+dma_skb_copy_datagram_iovec(struct dma_chan* chan, const struct sk_buff *skb, int offset,
+			    struct iovec *to, size_t len, struct dma_locked_list *locked_list);
+void dma_memcpy_toiovec_wait(struct dma_chan *chan, dma_cookie_t cookie);
+void dma_async_try_early_copy(struct sock *sk, struct sk_buff *skb);
+
+#endif /* DMAENGINE_H */

^ permalink raw reply related

* [RFC] [PATCH 0/3] ioat: DMA engine support
From: Andrew Grover @ 2005-11-23 20:26 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: john.ronciak, christopher.leech

As presented in our talk at this year's OLS, the Bensley platform, which 
will be out in early 2006, will have an asyncronous DMA engine. It can be 
used to offload copies from the CPU, such as the kernel copies of received 
packets into the user buffer.

The code consists of the following sections:
1) The HW driver for the DMA engine device
2) The DMA subsystem, which abstracts the HW details from users of the 
async DMA
3) Modifications to net/ to make use of the DMA engine for receive copy 
offload:
    3a) Code to register the net stack as a "DMA client"
    3b) Code to pin and unpin pages associated with a user buffer
    3c) Code to initiate async DMA transactions in the net receive path

Today we are releasing 2, 3a, and 3b, as well as "testclient", a throwaway
driver we wrote to demonstrate the DMA subsystem API. We will be releasing
3c shortly. We will be releasing 1 (the HW driver) when the platform ships
early next year. Until then, the code doesn't really *do* anything, but we
wanted to release what we could right away, and start getting some 
feedback.

Against 2.6.14:
patch 1: DMA engine
patch 2: iovec pin/unpin code; register net as a DMA client
patch 3: testclient

overall diffstat information:
 drivers/Kconfig           |    2 
 drivers/Makefile          |    1 
 drivers/dma/Kconfig       |   40 ++
 drivers/dma/Makefile      |    5 
 drivers/dma/cb_list.h     |   12 
 drivers/dma/dmaengine.c   |  394 ++++++++++++++++++++++++
 drivers/dma/testclient.c  |  132 ++++++++
 include/linux/dmaengine.h |  268 ++++++++++++++++
 net/core/Makefile         |    3 
 net/core/dev.c            |   78 ++++
 net/core/user_dma.c       |  422 ++++++++++++++++++++++++++
 11 files changed, 1356 insertions(+), 1 deletion(-)

Regards -- Andy and Chris

^ permalink raw reply

* Re: 2.6.15-rc2-mm1
From: Andrew Morton @ 2005-11-23 19:38 UTC (permalink / raw)
  To: Marc Koschewski; +Cc: linux-kernel, Harald Welte, netdev
In-Reply-To: <20051123175045.GA6760@stiffy.osknowledge.org>

Marc Koschewski <marc@osknowledge.org> wrote:
>
> * Andrew Morton <akpm@osdl.org> [2005-11-23 03:35:50 -0800]:
> 
> > 
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15-rc2/2.6.15-rc2-mm1/
> > 
> > (temp copy at http://www.zip.com.au/~akpm/linux/patches/stuff/2.6.15-rc2-mm1.gz)
> > 
> > - Added git-sym2.patch to the -mm lineup: updates to the sym2 scsi driver
> >   (Matthew Wilcox).  
> > 
> > - The JSM tty driver still doesn't compile.
> > 
> > - The git-powerpc tree is included now.
> 
> Just booted into 2.6.15-rc2-mm1. The 'mouse problem' (as reported earlier) still
> persists,

You'l probably need to re-report the mouse problem if the previous report
didn't get any action.

> moreover, some stuff's now really not gonna work anymore. I logged in
> via gdm once and rebooted. 

Yes, netfilter broke.

> ...
> Nov 23 18:34:01 stiffy kernel: 0.0: ttyS3 at I/O 0xe108 (irq = 3) is a 8250
> Nov 23 18:34:01 stiffy kernel: ip_conntrack version 2.4 (4095 buckets, 32760 max) - 212 bytes per conntrack
> Nov 23 18:34:01 stiffy kernel: ip_tables: (C) 2000-2002 Netfilter core team
> Nov 23 18:34:01 stiffy kernel:  [schedule+1453/1679] schedule+0x5ad/0x68f
> Nov 23 18:34:01 stiffy kernel:  [__wake_up_common+60/94] __wake_up_common+0x3c/0x5e
> Nov 23 18:34:01 stiffy kernel:  [wait_for_completion+134/242] wait_for_completion+0x86/0xf2
> Nov 23 18:34:01 stiffy kernel:  [default_wake_function+0/18] default_wake_function+0x0/0x12
> Nov 23 18:34:01 stiffy kernel:  [call_usermodehelper_keys+175/186] call_usermodehelper_keys+0xaf/0xba
> Nov 23 18:34:01 stiffy kernel:  [__call_usermodehelper+0/110] __call_usermodehelper+0x0/0x6e
> Nov 23 18:34:01 stiffy kernel:  [request_module+175/240] request_module+0xaf/0xf0
> Nov 23 18:34:01 stiffy kernel:  [buffered_rmqueue+241/514] buffered_rmqueue+0xf1/0x202
> Nov 23 18:34:01 stiffy kernel:  [get_page_from_freelist+136/162] get_page_from_freelist+0x88/0xa2
> Nov 23 18:34:01 stiffy kernel:  [pg0+553595222/1069659136] translate_table+0x95f/0xbcb [ip_tables]
> Nov 23 18:34:01 stiffy kernel:  [map_vm_area+109/149] map_vm_area+0x6d/0x95
> Nov 23 18:34:01 stiffy kernel:  [__vmalloc_area_node+246/362] __vmalloc_area_node+0xf6/0x16a
> Nov 23 18:34:01 stiffy kernel:  [__vmalloc_node+79/110] __vmalloc_node+0x4f/0x6e
> Nov 23 18:34:01 stiffy kernel:  [__vmalloc+39/43] __vmalloc+0x27/0x2b
> Nov 23 18:34:01 stiffy kernel:  [pg0+553597367/1069659136] do_replace+0x145/0x6d6 [ip_tables]
> Nov 23 18:34:03 stiffy kernel:  [pg0+553596209/1069659136] copy_entries_to_user+0xaf/0x1e3 [ip_tables]
> Nov 23 18:34:04 stiffy kernel:  [pg0+553599347/1069659136] do_ipt_set_ctl+0x1e/0x62 [ip_tables]
> Nov 23 18:34:04 stiffy kernel:  [nf_sockopt+198/277] nf_sockopt+0xc6/0x115
> Nov 23 18:34:04 stiffy kernel:  [nf_setsockopt+55/59] nf_setsockopt+0x37/0x3b
> Nov 23 18:34:04 stiffy kernel:  [ip_setsockopt+219/3448] ip_setsockopt+0xdb/0xd78
> Nov 23 18:34:04 stiffy kernel:  [nf_sockopt+136/277] nf_sockopt+0x88/0x115
> Nov 23 18:34:04 stiffy kernel:  [nf_getsockopt+55/59] nf_getsockopt+0x37/0x3b
> Nov 23 18:34:04 stiffy kernel:  [ip_getsockopt+254/1764] ip_getsockopt+0xfe/0x6e4
> Nov 23 18:34:04 stiffy kernel:  [prio_tree_remove+150/191] prio_tree_remove+0x96/0xbf
> Nov 23 18:34:04 stiffy kernel:  [free_pgtables+59/167] free_pgtables+0x3b/0xa7
> Nov 23 18:34:04 stiffy kernel:  [buffered_rmqueue+241/514] buffered_rmqueue+0xf1/0x202
> Nov 23 18:34:04 stiffy kernel:  [get_page_from_freelist+136/162] get_page_from_freelist+0x88/0xa2
> ...

^ permalink raw reply

* Re: 2.6.15-rc2-mm1
From: Andrew Morton @ 2005-11-23 19:22 UTC (permalink / raw)
  To: Michal Piotrowski; +Cc: linux-kernel, Harald Welte, netdev
In-Reply-To: <6bffcb0e0511230615y7531e268n@mail.gmail.com>

Michal Piotrowski <michal.k.k.piotrowski@gmail.com> wrote:
>
>  Debug: sleeping function called from invalid context at
>  include/asm/semaphore.h:123
>  in_atomic():1, irqs_disabled():0
>   [<c0103be6>] dump_stack+0x17/0x19
>   [<c011a0c3>] __might_sleep+0x9c/0xae
>   [<fd9a090d>] translate_table+0x147/0xc14 [ip_tables]
>   [<fd9a2b2a>] ipt_register_table+0x93/0x20d [ip_tables]
>   [<f98fe027>] init+0x27/0x9e [iptable_filter]
>   [<c01376d0>] sys_init_module+0xd7/0x26c
>   [<c0102cc7>] sysenter_past_esp+0x54/0x75
>  ---------------------------
>  | preempt count: 00000001 ]
>  | 1 level deep critical section nesting:
>  ----------------------------------------
>  .. [<fd9a2aca>] .... ipt_register_table+0x33/0x20d [ip_tables]
>  .....[<f98fe027>] ..   ( <= init+0x27/0x9e [iptable_filter])
> 

ipt_register_table() does get_cpu() then calls translate_table(), and
somewhere under translate_table() we do something which sleeps, only I'm not
sure what it is - netfilter likes to hide things in unexpected places.

check_entry() will do sleepy things under that get_cpu(), but that doesn't
seem to be in this particular call chain.

Anyway, the new get_cpu() in ipt_register_table() is the problem.

^ permalink raw reply

* [Fwd: [Bug 5644] New: NFS v3 TCP 3-way handshake incorrect, iptables blocks access]
From: Trond Myklebust @ 2005-11-23 17:51 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 151 bytes --]

Sorry to be cross-posting, but does this bug ring any bells? I'm having
trouble seeing how the sunrpc server code could be at fault.

Cheers,
  Trond

[-- Attachment #2: Forwarded message - [Bug 5644] New: NFS v3 TCP 3-way handshake incorrect, iptables blocks access --]
[-- Type: message/rfc822, Size: 4262 bytes --]

From: bugme-daemon@bugzilla.kernel.org
To: trond.myklebust@fys.uio.no
Subject: [Bug 5644] New: NFS v3 TCP 3-way handshake incorrect, iptables blocks access
Date: Wed, 23 Nov 2005 08:03:52 -0800
Message-ID: <200511231603.jANG3qDT026280@fire-2.osdl.org>

http://bugzilla.kernel.org/show_bug.cgi?id=5644

           Summary: NFS v3 TCP 3-way handshake incorrect, iptables blocks
                    access
    Kernel Version: 2.6.14
            Status: NEW
          Severity: blocking
             Owner: trond.myklebust@fys.uio.no
         Submitter: jl-icase@comcast.net

Most recent kernel where this bug did not occur:
Distribution: Can't remember, possibly FC2.
Hardware Environment:
Software Environment:
Problem Description:

Steps to reproduce:
1. Boot NFS v3 TCP client running iptables & mount NFS filesystem
2. Do a normal NFS client reboot & try mounting the same filesystem again
3. Experience intermittent failure to read superblock

The cause of this problem is NFS server's improper response to SYN packet sent
by the client.  This occurs *after* successful client authorization, when the
client tries to open the connection (i.e. sends SYN to the server's nfs port) to
read the superblock.  The server (sometimes) responds with a pure ACK without
the SYN bit set.  This is blocked by iptables -- thus, mount fails with a "could
not read superblock" message.

Here is an excerpt from ethereal log:

      3 0.021733    client           SERVER           TCP      800 > nfs [SYN]
Seq=0 Ack=0 Win=5840 Len=0 MSS=1460 TSV=24095 TSER=0 WS=2
      4 0.021846    SERVER           client           TCP      nfs > 800 [ACK]
Seq=9138391 Ack=3580883479 Win=16022 Len=0 TSV=244936050 TSER=1149400
      5 0.021864    client           SERVER           ICMP     Destination
unreachable (Host administratively prohibited)

The above problem occurs with a very simple default+ssh iptables configuration.
 Disabling iptables on the client makes the problem go away.  Even with iptables
active, there is no problem when nfsd responds with a proper [SYN,ACK] instead
of just pure ACK (this happens intermittently after the client reboot).

Please fix nfsd so that it reliably responds to SYN packets with proper
[SYN,ACK] packets instead of just ACK packets.  Apparently, nfsd state doesn't
get properly reset on client reboots.  Other people have reported autofs
failures which may be related (e.g. on remounts).

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply

* Re: [2.6 patch] net/sunrpc/xdr.c: remove xdr_decode_string()
From: Adrian Bunk @ 2005-11-23 16:25 UTC (permalink / raw)
  To: Lever, Charles
  Cc: David Miller, neilb, trond.myklebust, linux-kernel, nfs, netdev
In-Reply-To: <044B81DE141D7443BCE91E8F44B3C1E2013327DF@exsvl02.hq.netapp.com>

On Wed, Nov 23, 2005 at 04:31:14AM -0800, Lever, Charles wrote:
> > On Thu, Oct 06, 2005 at 07:13:14AM -0700, Lever, Charles wrote:
> > 
> > > actually, can we hold off on this change?  the RPC 
> > transport switch will
> > > eventually need most of those EXPORT_SYMBOLs.
> > 
> > Am I right to assume this will happen in the foreseeable future?
> 
> the first portion of the transport switch is in 2.6.15-rcX.  at this
> point i'm expecting the EXPORT_SYMBOL changes to go in 2.6.17 or later.

OK.

> so i don't remember why you are removing xdr_decode_string.  are we sure
> that no-one will need this functionality in the future?  it is harmless
> to remove today, but i wonder if someone is just going to add it back
> sometime.

It's unused and you said:
  the only harmless change i see below is removing xdr_decode_string().

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

^ permalink raw reply

* Re: ip_conntrack: Make "hashsize" conntrack parameter writable
From: Jesper Dangaard Brouer @ 2005-11-23 14:08 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Harald Welte, netdev, netfilter-devel, Jesper Dangaard Brouer
In-Reply-To: <1132707085.7720.2.camel@localhost.localdomain>

On Wed, 23 Nov 2005, Rusty Russell wrote:

> On Tue, 2005-11-22 at 15:49 +0100, Jesper Dangaard Brouer wrote:
>> Hi Rusty (and Harald)
>>
>> We met at the Netfilter Workshop 2005, where I complained that the
>> conntrack hashsize were statically set at module load time.
>>
>> Thank you making a kernel patch, which changes this...
>> BUT I cannot make it work! :-(
>>
>> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=eed75f191d8318a2b144da8aae9774e1cfcae492
>>
>> Am I missing some part of the patch?
>>
>> I cannot find the link to the /proc file system. Should there not be
>> any changes to ip_conntrack_standalone.c ??
>
> /sys/module/ip_conntrack/parameters/hashsize
>
> Cheers!
> Rusty.

Aha I see, the sysfs filesystem.

I was confused, because the hashsize is already exported as 
/proc/sys/net/ipv4/netfilter/ip_conntrack_buckets.

It is a bit confusing, that the Netfilter team are changing away from the 
/proc filesystem, but I don't care, it seems that the sysfs filesystem is 
a more powerful choice.

The permissions on "/sys/module/ip_conntrack/parameters/hashsize" is set 
to 600, where the /proc/../ip_conntrack_buckets is readable to all (444). 
I think we should change the /sys/../hashsize parameter to 644, as it does 
not make sense as it is readable through /proc.

Hilsen
   Jesper Dangaard Brouer

ps. Cc'ing -> lets keep google updated ;-)
--
-------------------------------------------------------------------
Cand. scient datalog
Dept. of Computer Science, University of Copenhagen
-------------------------------------------------------------------

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox