Re: [RFT][PATCH] generic device DMA implementation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [RFT][PATCH] generic device DMA implementation
@ 2002-12-27 20:21 David Brownell
  2002-12-27 21:40 ` James Bottomley
  2002-12-27 21:47 ` [RFT][PATCH] generic device DMA implementation James Bottomley
  0 siblings, 2 replies; 44+ messages in thread
From: David Brownell @ 2002-12-27 20:21 UTC (permalink / raw)
  To: James Bottomley, linux-kernel

I think you saw that patch to let the new 2.5.53 generic dma code
replace one of the two indirections USB needs.  Here are some of
the key open issues I'm thinking of:

- DMA mapping calls still return no errors; so BUG() out instead?

   Consider systems where DMA-able memory is limited (like SA-1111,
   to 1 MByte); clearly it should be possible for these calls to
   fail, when they can't allocate a bounce buffer.  Or (see below)
   when an invalid argument is provided to a dma mapping call.

   Fix by defining fault returns for the current signatures,
   starting with the api specs:

     * dma_map_sg() returns negative errno (or zero?) when it
       fails.  (Those are illegal sglist lengths.)

     * dma_map_single() returns an arch-specific value, like
       DMA_ADDR_INVALID, when it fails.  (DaveM's suggestion,
       from a while back; it's seemingly arch-specific.)

   Yes, the PCI dma calls would benefit from those bugfixes too.

- Implementation-wise, I'm rather surprised that the generic
   version doesn't just add new bus driver methods rather than
   still insisting everything be PCI underneath.

   It's not clear to me how I'd make, for example, a USB device
   or interface work with dma_map_sg() ... those "generic" calls
   are going to fail (except on x86, where all memory is DMA-able)
   since USB != PCI. Even when usb_buffer_map_sg() would succeed.
   (The second indirection:  the usb controller hardware does the
   mapping, not the device or hcd.  That's usually PCI.)

   Hmm, I suppose there'd need to be a default implementation
   of the mapping operations (for all non-pci busses) that'd
   fail cleanly ... :)

- There's no analogue to pci_pool, and there's nothing like
   "kmalloc" (likely built from N dma-coherent pools).

   That forces drivers to write and maintain memory allocators,
   is a waste of energy as well as being bug-prone.  So in that
   sense this API isn't a complete functional replacement of
   the current PCI (has pools, ~= kmem_cache_t) or USB (with
   simpler buffer_alloc ~= kmalloc) APIs for dma.

- The API says drivers "must" satisfy dma_get_cache_alignment(),
   yet both implementations, asm-{i386,generic}/dma-mapping.h,
   ignore that rule.

   Are you certain of that rule, for all cache coherency models?
   I thought only some machines (with dma-incoherent caches) had
   that as a hard constraint.  (Otherwise it's a soft one:  even
   if there's cacheline contention, the hardware won't lose data
   when drivers use memory barriers correctly.)

   I expect that combination is likely to be problematic, since the
   previous rule has been (wrongly?) that kmalloc or kmem_cache
   memory is fine for DMA mappings, no size restrictions.  Yet for
   one example on x86 dma_get_cache_alignment() returns 128 bytes,
   but kmalloc has several smaller pool sizes ... and lately will
   align to L1_CACHE_BYTES (wasting memory on small allocs?) even
   when that's smaller than L1_CACHE_MAX (in the new dma calls).

   All the more reason to have a drop-in kmalloc alternative for
   dma-aware code to use, handling such details transparently!

- Isn't arch/i386/kernel/pci-dma.c handling DMA masks wrong?
   It's passing GFP_DMA in cases where GFP_HIGHMEM is correct ...

I'm glad to see progress on making DMA more generic, thanks!

- Dave

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFT][PATCH] generic device DMA implementation
  2002-12-27 20:21 [RFT][PATCH] generic device DMA implementation David Brownell
@ 2002-12-27 21:40 ` James Bottomley
  2002-12-28  1:29   ` David Brownell
  2002-12-28  1:56   ` David Brownell
  2002-12-27 21:47 ` [RFT][PATCH] generic device DMA implementation James Bottomley
  1 sibling, 2 replies; 44+ messages in thread
From: James Bottomley @ 2002-12-27 21:40 UTC (permalink / raw)
  To: David Brownell; +Cc: James Bottomley, linux-kernel

david-b@pacbell.net said:
> - Implementation-wise, I'm rather surprised that the generic
>    version doesn't just add new bus driver methods rather than
>    still insisting everything be PCI underneath. 

You mean dma-mapping.h in asm-generic?  The reason for that is to provide an 
implementation that functions now for non x86 (and non-parisc) archs without 
having to write specific code for them all.  Since all the other arch's now 
function in terms of the pci_ API, that was the only way of sliding the dma_ 
API in without breaking them all.

Bus driver methods have been advocated before, but it's not clear to me that 
they should be exposed in the *generic* API.

>    It's not clear to me how I'd make, for example, a USB device
>    or interface work with dma_map_sg() ... those "generic" calls
>    are going to fail (except on x86, where all memory is DMA-able)
>    since USB != PCI.

Actually, they should work on parisc out of the box as well because of the way 
its DMA implementation is built in terms of the generic dma_ API.

As far as implementing this generically, just adding a case for the 
usb_bus_type in asm-generic/dma-mapping.h will probably get you where you need 
to be. (the asm-generic is, after all, only intended as a stopgap.  Fully 
coherent platforms with no IOMMUs will probably take the x86 route to 
implementing the dma_ API, platforms with IOMMUs will probably (eventually) do 
similar things to parisc).

>    (The second indirection:  the usb controller hardware does the
>    mapping, not the device or hcd.  That's usually PCI.) 

Could you clarify this a little.  I tend to think of "mapping" as something 
done by the IO MMU managing the bus.  I think you mean that the usb controller 
will mark a region of memory to be accessed by the device.  If such a region 
were also "mapped" by an IOMMU, it would be done outside the control of the 
USB controller, correct? (the IOMMU would translate between the address the 
processor sees and the address the USB controller thinks it's responding to)

Is the problem actually that the USB controller needs to be able to allocate 
coherent memory in a range much more narrowly defined than the current 
dma_mask allows?

> - There's no analogue to pci_pool, and there's nothing like
>    "kmalloc" (likely built from N dma-coherent pools). 

I didn't want to build another memory pool re-implementation.  The mempool API 
seems to me to be flexible enough for this, is there some reason it won't work?

I did consider wrappering mempool to make it easier, but I couldn't really 
find a simplifying wrapper that wouldn't lose flexibility.

James

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFT][PATCH] generic device DMA implementation
  2002-12-27 20:21 [RFT][PATCH] generic device DMA implementation David Brownell
  2002-12-27 21:40 ` James Bottomley
@ 2002-12-27 21:47 ` James Bottomley
  2002-12-28  2:28   ` David Brownell
  1 sibling, 1 reply; 44+ messages in thread
From: James Bottomley @ 2002-12-27 21:47 UTC (permalink / raw)
  To: David Brownell; +Cc: James Bottomley, linux-kernel

david-b@pacbell.net said:
> - DMA mapping calls still return no errors; so BUG() out instead? 

That's actually an open question.  The line of least resistance (which is what 
I followed) is to do what the pci_ API does (i.e. BUG()).

It's not clear to me that adding error returns rather than BUGging would buy 
us anything (because now all the drivers have to know about the errors and 
process them).

>    Consider systems where DMA-able memory is limited (like SA-1111,
>    to 1 MByte); clearly it should be possible for these calls to
>    fail, when they can't allocate a bounce buffer.  Or (see below)
>    when an invalid argument is provided to a dma mapping call. 

That's pretty much an edge case.  I'm not opposed to putting edge cases in the 
api (I did it for dma_alloc_noncoherent() to help parisc), but I don't think 
the main line should be affected unless there's a good case for it.

Perhaps there is a compromise where the driver flags in the struct 
device_driver that it wants error returns otherwise it takes the default 
behaviour (i.e. no error return checking and BUG if there's a problem).

James

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFT][PATCH] generic device DMA implementation
  2002-12-27 21:40 ` James Bottomley
@ 2002-12-28  1:29   ` David Brownell
  2002-12-28 16:18     ` James Bottomley
  2002-12-28  1:56   ` David Brownell
  1 sibling, 1 reply; 44+ messages in thread
From: David Brownell @ 2002-12-28  1:29 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel

James Bottomley wrote:
> david-b@pacbell.net said:
> 
>>- Implementation-wise, I'm rather surprised that the generic
>>   version doesn't just add new bus driver methods rather than
>>   still insisting everything be PCI underneath. 
> 
> 
> You mean dma-mapping.h in asm-generic?  ..

Yes.  As noted, it can't work for USB directly.  And your
suggestion of more "... else if (bus is USB) ... else  ... "
logic (#including each bus type's headers?) bothers me.

> Bus driver methods have been advocated before, but it's not clear to me that 
> they should be exposed in the *generic* API.

Isn't the goal to make sure that for every kind of "struct device *"
it should be possible to use those dma_*() calls, without BUGging
out.  If that's not true ... then why were they defined?

That's certainly the notion I was talking about when this "generic
dma" API notion came up this summer [1].  (That discussion led to
the USB DMA APIs, and then the usb_sg_* calls that let usb-storage
queue scatterlists directly to disk:  performance wins, including
DaveM's "USB keyboards don't allocate IOMMU pages", but structures
looking ahead to having real generic DMA calls.)

>>   It's not clear to me how I'd make, for example, a USB device
>>   or interface work with dma_map_sg() ... those "generic" calls
>>   are going to fail (except on x86, where all memory is DMA-able)
>>   since USB != PCI.
> 
> 
> Actually, they should work on parisc out of the box as well because of the way 
> its DMA implementation is built in terms of the generic dma_ API.

Most of us haven't seen your PARISC code, it's not in Linus' tree.  :)

>>   (The second indirection:  the usb controller hardware does the
>>   mapping, not the device or hcd.  That's usually PCI.) 
> 
> 
> Could you clarify this a little.

Actually, make that "hardware-specific code".

The USB controller is what does the DMA.  But USB device drivers don't
talk to USB controllers, at least not directly.  Instead they talk to a
"struct usb_interface *", or a "struct usb_device *" ... those are more
or less software proxies for the real devices with usbcore and some
HCD turning proxy i/o requests into USB controller operations.

The indirection is getting from the USB device (or interface) to the
object representing the USB controller.  All USB calls need that, at
least for host-side APIs, since the controller driver is multiplexing
up to almost 4000 I/O channels.  (127 devices * 31 endpoints, max; and
of course typical usage is more like dozens of channels.)

> Is the problem actually that the USB controller needs to be able to allocate 
> coherent memory in a range much more narrowly defined than the current 
> dma_mask allows?

Nope, it's just an indirection issue.  Even on a PCI based system, the "struct
device" used by a USB driver (likely usb_interface->dev) will never be a USB
controller.  Since it's the USB controller actually doing the I/O something
needs to use that controller to do the DMA mapping(s).

So any generic DMA logic needs to be able to drill down a level or so before
doing DMA mappings (or allocations) for a USB "struct device *".

- Dave

[1] http://marc.theaimsgroup.com/?l=linux-kernel&m=102389137402497&w=2

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFT][PATCH] generic device DMA implementation
  2002-12-27 21:40 ` James Bottomley
  2002-12-28  1:29   ` David Brownell
@ 2002-12-28  1:56   ` David Brownell
  2002-12-28 16:13     ` James Bottomley
  2002-12-30 23:11     ` [PATCH] generic device DMA (dma_pool update) David Brownell
  1 sibling, 2 replies; 44+ messages in thread
From: David Brownell @ 2002-12-28  1:56 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel

>>- There's no analogue to pci_pool, and there's nothing like
>>   "kmalloc" (likely built from N dma-coherent pools). 
> 
> 
> I didn't want to build another memory pool re-implementation.  The mempool API 
> seems to me to be flexible enough for this, is there some reason it won't work?

I didn't notice any way it would track, and return, DMA addresses.
It's much like a kmem_cache in that way.

> I did consider wrappering mempool to make it easier, but I couldn't really 
> find a simplifying wrapper that wouldn't lose flexibility.

In My Ideal World (tm) Linux would have some kind of memory allocator
that'd be configured to use __get_free_pages() or dma_alloc_coherent()
as appropriate.  Fast, efficient; caching pre-initted objects; etc.

I'm not sure how realistic that is.  So long as APIs keep getting written
so that drivers _must_ re-invent the "memory allocator" wheel, it's not.

But ... if the generic DMA API includes such stuff, it'd be easy to replace
a dumb implementation (have you seen pci_pool, or how usb_buffer_alloc
works? :) with something more intelligent than any driver could justify
writing for its own use.

- Dave

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFT][PATCH] generic device DMA implementation
  2002-12-27 21:47 ` [RFT][PATCH] generic device DMA implementation James Bottomley
@ 2002-12-28  2:28   ` David Brownell
  0 siblings, 0 replies; 44+ messages in thread
From: David Brownell @ 2002-12-28  2:28 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel

Hi,

>>- DMA mapping calls still return no errors; so BUG() out instead? 
> 
> 
> That's actually an open question.  The line of least resistance (which is what 
> I followed) is to do what the pci_ API does (i.e. BUG()).

That might have been appropriate for PCI-mostly APIs, since those tend to
be resource-rich.  Maybe.  (It always seemed like an API bug to me.)

I can't buy that logic in the "generic" case though.  Heck, haven't all
the address space allocation calls in Linux always exposed ENOMEM type
faults ... except PCI?  This one is _really_ easy to fix now.  Resources
are never infinite.

> It's not clear to me that adding error returns rather than BUGging would buy 
> us anything (because now all the drivers have to know about the errors and 
> process them).

For me, designing any "generic" API to handle common cases (like allocation
failures) reasonably (no BUGging!) is a fundamental design requirement.

Robust drivers are aware of things like allocation faults, and handle them.
If they do so poorly, that can be fixed like any other driver bug.

>>   Consider systems where DMA-able memory is limited (like SA-1111,
>>   to 1 MByte); clearly it should be possible for these calls to
>>   fail, when they can't allocate a bounce buffer.  Or (see below)
>>   when an invalid argument is provided to a dma mapping call. 
> 
> 
> That's pretty much an edge case.  I'm not opposed to putting edge cases in the 
> api (I did it for dma_alloc_noncoherent() to help parisc), but I don't think 
> the main line should be affected unless there's a good case for it.

Absolutely *any* system can have situations where the relevant address space
(or memory) was all in use, or wasn't available to a non-blocking request
without blocking, etc.  Happens more often on some systems than others; I
just chose SA-1111 since your approach would seem to make that unusable.

If that isn't a "good case", why not?  And what could ever be a "good case"?

> Perhaps there is a compromise where the driver flags in the struct 
> device_driver that it wants error returns otherwise it takes the default 
> behaviour (i.e. no error return checking and BUG if there's a problem).

IMO that's the worst of all possible worlds.  The error paths would get
even less testing than they do today.  If there's a fault path defined,
use it in all cases:  don't just BUG() in some modes, and some drivers.

- Dave

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFT][PATCH] generic device DMA implementation
  2002-12-28  1:56   ` David Brownell
@ 2002-12-28 16:13     ` James Bottomley
  2002-12-28 17:41       ` David Brownell
  2002-12-30 23:11     ` [PATCH] generic device DMA (dma_pool update) David Brownell
  1 sibling, 1 reply; 44+ messages in thread
From: James Bottomley @ 2002-12-28 16:13 UTC (permalink / raw)
  To: David Brownell; +Cc: James Bottomley, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 757 bytes --]

OK, the attached is a sketch of an implementation of bus_type operations.

It renames all the platform dma_ operations to platform_dma_  and will call 
only the bus specific operation if it exists.  Thus it will be the 
responsibility of the bus to call the platform_dma_ functions correctly (this 
one is a large loaded gun).

The answer to error handling in the general case is still no (because I don't 
want to impact the main line code for a specific problem, and the main line is 
x86 which effectively has infinite mapping resources), but I don't see why the 
platforms can't export a set of values they guarantee not to return as 
dma_addr_t's that you can use for errors in the bus implementations.

Would this solve most of your problems?

James


[-- Attachment #2: tmp.diff --]
[-- Type: text/plain , Size: 5094 bytes --]

===== include/linux/device.h 1.70 vs edited =====
--- 1.70/include/linux/device.h	Mon Dec 16 10:01:41 2002
+++ edited/include/linux/device.h	Sat Dec 28 09:57:51 2002
@@ -61,6 +61,7 @@
 struct device;
 struct device_driver;
 struct device_class;
+struct bus_dma_ops;
 
 struct bus_type {
 	char			* name;
@@ -75,6 +76,7 @@
 	struct device * (*add)	(struct device * parent, char * bus_id);
 	int		(*hotplug) (struct device *dev, char **envp, 
 				    int num_envp, char *buffer, int buffer_size);
+	struct bus_dma_ops *	dma_ops;
 };
 
 
===== include/linux/dma-mapping.h 1.1 vs edited =====
--- 1.1/include/linux/dma-mapping.h	Sat Dec 21 22:37:05 2002
+++ edited/include/linux/dma-mapping.h	Sat Dec 28 10:02:37 2002
@@ -10,7 +10,144 @@
 	DMA_NONE = 3,
 };
 
+struct bus_dma_ops {
+	int (*supported)(struct device *dev, u64 mask);
+	int (*set_mask)(struct device *dev, u64 mask);
+	void *(*alloc_coherent)(struct device *dev, size_t size,
+				dma_addr_t *dma_handle);
+	void (*free_coherent)(struct device *dev, size_t size, void *cpu_addr,
+			      dma_addr_t dma_handle);
+	dma_addr_t (*map_single)(struct device *dev, void *cpu_addr,
+				 size_t size, enum dma_data_direction direction);
+	void (*unmap_single)(struct device *dev, dma_addr_t dma_addr,
+			     size_t size, enum dma_data_direction direction);
+	int (*map_sg)(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction direction);
+	void (*unmap_sg)(struct device *dev, struct scatterlist *sg,
+			 int nhwentries, enum dma_data_direction direction);
+	void (*sync_single)(struct device *dev, dma_addr_t dma_handle,
+			    size_t size, enum dma_data_direction direction);
+	void (*sync_sg)(struct device *dev, struct scatterlist *sg, int nelems,
+			enum dma_data_direction direction);
+};
+
 #include <asm/dma-mapping.h>
+
+static inline int
+dma_supported(struct device *dev, u64 mask)
+{
+	struct bus_type *bus = dev->bus;
+
+	if(bus->dma_ops && bus->dma_ops->supported)
+		return bus->dma_ops->supported(dev, mask);
+
+	return platform_dma_supported(dev, mask);
+}
+
+static inline int
+dma_set_mask(struct device *dev, u64 dma_mask)
+{
+	struct bus_type *bus = dev->bus;
+
+	if(bus->dma_ops && bus->dma_ops->set_mask)
+		return bus->dma_ops->set_mask(dev, mask);
+	
+	return platform_dma_set_mask(dev, dma_mask);
+}
+
+static inline void *
+dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle)
+{
+	struct bus_type *bus = dev->bus;
+
+	if(bus->dma_ops && bus->dma_ops->alloc_coherent)
+		return bus->dma_ops->alloc_coherent(dev, size, dma_handle);
+
+	return platform_dma_alloc_coherent(dev, size, dma_handle);
+}
+
+static inline void
+dma_free_coherent(struct device *dev, size_t size, void *cpu_addr,
+		    dma_addr_t dma_handle)
+{
+	struct bus_type *bus = dev->bus;
+
+	if(bus->dma_ops && bus->dma_ops->free_coherent)
+		bus->dma_ops->free_coherent(dev, size, cpu_addr, dma_handle);
+	else
+		platform_dma_free_coherent(dev, size, cpu_addr, dma_handle);
+}
+
+static inline dma_addr_t
+dma_map_single(struct device *dev, void *cpu_addr, size_t size,
+	       enum dma_data_direction direction)
+{
+	struct bus_type *bus = dev->bus;
+
+	if(bus->dma_ops && bus->dma_ops->map_single)
+		return bus->dma_ops->map_single(dev, cpu_addr, size, direction);
+	return platform_dma_map_single(dev, cpu_addr, size, direction);
+}
+
+static inline void
+dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size,
+		 enum dma_data_direction direction)
+{
+	struct bus_type *bus = dev->bus;
+
+	if(bus->dma_ops && bus->dma_ops->unmap_single)
+		bus->dma_ops->unmap_single(dev, dma_addr, size, direction);
+	else
+		platform_dma_unmap_single(dev, dma_addr, size, direction);
+}
+
+static inline int
+dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+	   enum dma_data_direction direction)
+{
+	struct bus_type *bus = dev->bus;
+
+	if(bus->dma_ops && bus->dma_ops->map_sg)
+		return bus->dma_ops->map_sg(dev, sg, nents, direction);
+
+	return platform_dma_map_sg(dev, sg, nents, direction);
+}
+
+static inline void
+dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nhwentries,
+	     enum dma_data_direction direction)
+{
+	struct bus_type *bus = dev->bus;
+
+	if(bus->dma_ops && bus->dma_ops->unmap_sg)
+		bus->dma_ops->unmap_sg(dev, sg, nhwentries, direction);
+	else
+		platform_dma_unmap_sg(dev, sg, nhwentries, direction);
+}
+
+static inline void
+dma_sync_single(struct device *dev, dma_addr_t dma_handle, size_t size,
+		enum dma_data_direction direction)
+{
+	struct bus_type *bus = dev->bus;
+
+	if(bus->dma_ops && bus->dma_ops->sync_single)
+		bus->dma_ops->sync_single(dev, dma_handle, size, direction);
+	else
+		platform_dma_sync_single(dev, dma_handle, size, direction);
+}
+
+static inline void
+dma_sync_sg(struct device *dev, struct scatterlist *sg, int nelems,
+	    enum dma_data_direction direction)
+{
+	struct bus_type *bus = dev->bus;
+
+	if(bus->dma_ops && bus->dma_ops->sync_sg)
+		bus->dma_ops->sync_sg(dev, sg, nelems, direction);
+	else
+		platform_dma_sync_sg(dev, sg, nelems, direction);
+}
 
 #endif
 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFT][PATCH] generic device DMA implementation
  2002-12-28  1:29   ` David Brownell
@ 2002-12-28 16:18     ` James Bottomley
  2002-12-28 18:16       ` David Brownell
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2002-12-28 16:18 UTC (permalink / raw)
  To: David Brownell; +Cc: James Bottomley, linux-kernel

david-b@pacbell.net said:
> The indirection is getting from the USB device (or interface) to the
> object representing the USB controller.  All USB calls need that, at
> least for host-side APIs, since the controller driver is multiplexing
> up to almost 4000 I/O channels.  (127 devices * 31 endpoints, max; and
> of course typical usage is more like dozens of channels.) 

This sounds like a mirror of the problem of finding the IOMMU on parisc (there 
can be more than one).

The way parisc solves this is to look in dev->platform_data and if that's null 
walk up the dev->parent until the IOMMU is found and then cache the IOMMU ops 
in the current dev->platform_data.  Obviously, you can't use platform_data, 
but you could use driver_data for this.  The IOMMU's actually lie on a parisc 
specific bus, so the ability to walk up the device tree without having to know 
the device types was crucial to implementing this.

James

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFT][PATCH] generic device DMA implementation
  2002-12-28 16:13     ` James Bottomley
@ 2002-12-28 17:41       ` David Brownell
  0 siblings, 0 replies; 44+ messages in thread
From: David Brownell @ 2002-12-28 17:41 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel

James Bottomley wrote:
> OK, the attached is a sketch of an implementation of bus_type operations.

Quick reaction ....

Those signatures look more or less right, at a quick glance,
except that allocating N bytes should pass a __GFP_WAIT flag.
(And of course, allocating a mapping needs a failure return.)

That bus_dma_ops is more of a "vtable" approach, and I confess
I'd been thinking of hanging some object that had internal state
as well as method pointers.  (Call it a "whatsit" for the moment.)

That'd make it possible for layered busses like USB and SCSI
to just reference the "whatsit" from the parent bus in their
layered "struct device" objects. [1]

In many cases that'd just end up being a ref to the "platform
whatsit", eliminating a conditional test from the hot path from
your sketch as well as an entire set of new "platform_*()" APIs.

- Dave

[1] That is, resembling what Benjamin Herrenschmidt suggested:

http://marc.theaimsgroup.com/?l=linux-kernel&m=102389432006266&w=2

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFT][PATCH] generic device DMA implementation
  2002-12-28 16:18     ` James Bottomley
@ 2002-12-28 18:16       ` David Brownell
  0 siblings, 0 replies; 44+ messages in thread
From: David Brownell @ 2002-12-28 18:16 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel

James Bottomley wrote:
> david-b@pacbell.net said:
> 
>>The indirection is getting from the USB device (or interface) to the
>>object representing the USB controller.  ...
> 
> This sounds like a mirror of the problem of finding the IOMMU on parisc (there 
> can be more than one).

Wouldn't it be straightforward to package that IOMMU solution using the
"call dev->whatsit->dma_op()" approach I mentioned?  Storing data in
the "whatsit" seems more practical than saying driver_data is no longer
available to the device's driver.  (I'll be agnostic on platform_data.)

This problem seems to me to be a common layering requirement.  All the
indirections are known when the device structure is being initted, so it
might as well be set up then.  True for PARISC (right?), as well as USB,
SCSI, and most other driver stacks.  I suspect it'd even allow complex
voodoo for multi-path I/O too...

- Dave

> The way parisc solves this is to look in dev->platform_data and if that's null 
> walk up the dev->parent until the IOMMU is found and then cache the IOMMU ops 
> in the current dev->platform_data.  Obviously, you can't use platform_data, 
> but you could use driver_data for this.  The IOMMU's actually lie on a parisc 
> specific bus, so the ability to walk up the device tree without having to know 
> the device types was crucial to implementing this.
> 
> James
> 
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH] generic device DMA (dma_pool update)
  2002-12-28  1:56   ` David Brownell
  2002-12-28 16:13     ` James Bottomley
@ 2002-12-30 23:11     ` David Brownell
  2002-12-31 15:00       ` James Bottomley
  2002-12-31 16:36       ` James Bottomley
  1 sibling, 2 replies; 44+ messages in thread
From: David Brownell @ 2002-12-30 23:11 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1729 bytes --]

  >>> - There's no analogue to pci_pool, and there's nothing like
  >>>   "kmalloc" (likely built from N dma-coherent pools).
  >>
  >> I didn't want to build another memory pool re-implementation.  The
  >> mempool API seems to me to be flexible enough for this, is there some
  >> reason it won't work?
  >
  > I didn't notice any way it would track, and return, DMA addresses.
  > It's much like a kmem_cache in that way.

Here's a patch (against 2.5.53) that does it with current allocators:

- adds new dma_pool calls, implemented by inlines using "kmem_cache_t *"
   (wrapped) or "struct pci_pool *" depending on the arch/config.

        struct dma_pool *dma_pool_create(char *, struct device *, size_t)
        void dma_pool_destroy (struct dma_pool *pool)
        void *dma_pool_alloc(struct dma_pool *, int mem_flags, dma_addr_t *)
        void dma_pool_free(struct dma_pool *, void *, dma_addr_t)

- new kmalloc/kfree style calls, likewise turned by inlines into kmalloc
   or code that can use dma_pool (but can't yet, needs a device dma state
   hook), falling back to dma_alloc_coherent for page allocation otherwise.

        void *dma_alloc(struct device *, size_t, dma_addr_t *, int mem_flags)
        void dma_free(struct device *, size_t, void *, dma_addr_t)

- pci_pool is now arch-specific like the rest of the pci dma calls.
   along with a DMA-API.txt update, this makes 2/3 of this patch. so
   x86 uses dma_pool instead, through <asm-generic/pci-dma-compat.h>

An immediate effect of this patch is that on x86, both USB and Firewire will
do their pci_pool allocations through kmem_cache_t, which is a smarter (and
faster) allocator.  (So would other drivers using pci_pool calls.)

Comments?

- Dave





[-- Attachment #2: dma-1230.patch --]
[-- Type: text/plain, Size: 27449 bytes --]

--- ./include-dist/asm-generic/dma-mapping.h	Wed Dec 25 08:47:40 2002
+++ ./include/asm-generic/dma-mapping.h	Mon Dec 30 11:17:32 2002
@@ -152,5 +152,33 @@
 	BUG();
 }
 
+#ifdef __dma_slab_flags
+/* if we can use the slab allocator, do so.  */
+#include <asm-generic/dma-slab.h>
+
+#else
+
+extern void *
+dma_alloc (struct device *dev, size_t size, dma_addr_t *handle, int mem_flags);
+
+extern void
+dma_free (struct device *dev, size_t size, void *vaddr, dma_addr_t dma);
+
+
+#define dma_pool pci_pool
+
+static inline struct dma_pool *
+dma_pool_create (const char *name, struct device *dev, size_t size)
+{
+	BUG_ON(dev->bus != &pci_bus_type);
+	return pci_pool_create(name, to_pci_dev(dev), size, 0, 0);
+}
+
+#define dma_pool_destroy	pci_pool_destroy
+#define dma_pool_alloc		pci_pool_alloc
+#define dma_pool_free		pci_pool_free
+
+#endif
+
 #endif
 
--- ./include-dist/asm-generic/dma-slab.h	Wed Dec 31 16:00:00 1969
+++ ./include/asm-generic/dma-slab.h	Mon Dec 30 10:35:47 2002
@@ -0,0 +1,59 @@
+#ifndef _ASM_GENERIC_DMA_SLAB_H
+#define _ASM_GENERIC_DMA_SLAB_H
+
+/* flags == SLAB_KERNEL or SLAB_ATOMIC */
+static inline void *
+dma_alloc (struct device *dev, size_t size, dma_addr_t *handle, int mem_flags)
+{
+	void	*ret;
+
+	/* We rely on kmalloc memory being aligned to L1_CACHE_BYTES, to
+	 * prevent cacheline sharing during DMA.  With dma-incoherent caches,
+	 * such sharing causes bugs, not just cache-related slowdown.
+	 */
+	ret = kmalloc (size, mem_flags | __dma_slab_flags (dev));
+	if (likely (ret != 0))
+		*handle = virt_to_phys (ret);
+	return ret;
+}
+
+static inline void
+dma_free (struct device *dev, size_t size, void *vaddr, dma_addr_t dma)
+{
+	kfree (vaddr);
+}
+
+
+struct dma_pool {
+	/* these caches are forced to be hw-aligned too */
+	kmem_cache_t 	*cache;
+	struct device	*dev;
+	char		name [0];
+};
+
+extern struct dma_pool *
+dma_pool_create (const char *name, struct device *dev, size_t size);
+
+extern void dma_pool_destroy (struct dma_pool *pool);
+
+
+/* flags == SLAB_KERNEL or SLAB_ATOMIC */
+static inline void *
+dma_pool_alloc (struct dma_pool *pool, int flags, dma_addr_t *handle)
+{
+	void		*ret;
+
+	ret = kmem_cache_alloc (pool->cache, flags);
+	if (likely (ret != 0))
+		*handle = virt_to_phys (ret);
+	return ret;
+}
+
+static inline void
+dma_pool_free (struct dma_pool *pool, void *vaddr, dma_addr_t addr)
+{
+	kmem_cache_free (pool->cache, vaddr);
+}
+
+#endif
+
--- ./include-dist/asm-i386/dma-mapping.h	Wed Dec 25 08:49:09 2002
+++ ./include/asm-i386/dma-mapping.h	Mon Dec 30 13:12:07 2002
@@ -134,4 +134,9 @@
 	flush_write_buffers();
 }
 
+/* use the slab allocator to manage dma buffers */
+#define __dma_slab_flags(dev) \
+	((*(dev)->dma_mask <= 0x00ffffff) ? SLAB_DMA : 0)
+#include <asm-generic/dma-slab.h>
+
 #endif
--- ./include-dist/asm-ppc/dma-mapping.h	Wed Dec 25 08:48:02 2002
+++ ./include/asm-ppc/dma-mapping.h	Mon Dec 30 10:43:51 2002
@@ -1 +1,7 @@
+#ifndef CONFIG_NOT_COHERENT_CACHE
+/* use the slab allocator to manage dma buffers */
+#define __dma_slab_flags(dev) \
+	((*(dev)->dma_mask != 0xffffffff) ? SLAB_DMA : 0)
+#endif
+
 #include <asm-generic/dma-mapping.h>
--- ./include-dist/asm-um/dma-mapping.h	Wed Dec 25 08:49:40 2002
+++ ./include/asm-um/dma-mapping.h	Mon Dec 30 13:25:28 2002
@@ -1 +1,4 @@
+/* use the slab allocator to manage dma buffers */
+#define __dma_slab_flags(dev) 0
+
 #include <asm-generic/dma-mapping.h>
--- ./include-dist/asm-arm/dma-mapping.h	Wed Dec 25 08:48:19 2002
+++ ./include/asm-arm/dma-mapping.h	Mon Dec 30 13:25:36 2002
@@ -1 +1,6 @@
+#ifdef	CONFIG_SA1111
+/* we can just use the slab allocator to manage dma buffers */
+#define __dma_slab_flags(dev) SLAB_DMA
+#endif
+
 #include <asm-generic/dma-mapping.h>
--- ./drivers/base-dist/dma.c	Wed Dec 31 16:00:00 1969
+++ ./drivers/base/dma.c	Mon Dec 30 14:37:46 2002
@@ -0,0 +1,106 @@
+/* the non-inlined support for coherent dma memory buffer calls */
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/version.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/device.h>
+
+#include <asm/io.h>
+#include <asm/scatterlist.h>
+#include <linux/mm.h>
+#include <linux/dma-mapping.h>
+
+
+#ifdef __dma_slab_flags
+
+/*
+ * Slab memory can be used for DMA buffers on some platforms (including i386).
+ * This implementation is mostly inlined in <asm-generic/dma-slab.h>.
+ */
+
+struct dma_pool *
+dma_pool_create (const char *name, struct device *dev, size_t size)
+{
+	int		tmp;
+	struct dma_pool	*pool;
+
+	tmp = strlen (name);
+	pool = kmalloc (sizeof *pool + tmp + strlen (dev->bus_id) + 2, SLAB_KERNEL);
+	if (!pool)
+		return pool;
+
+	/* bus_id makes the name unique over dev->bus, not globally */
+	memcpy (pool->name, name, tmp);
+	pool->name [tmp] = '/';
+	strcpy (&pool->name [tmp + 1], dev->bus_id);
+
+	tmp = SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN;
+	if (__dma_slab_flags (dev) & SLAB_DMA)
+		tmp |= SLAB_CACHE_DMA;
+
+	pool->cache = kmem_cache_create (pool->name, size, 0, tmp, 0, 0);
+	if (!pool->cache) {
+		kfree (pool);
+		pool = 0;
+	} else
+		pool->dev = get_device (dev);
+	return pool;
+}
+EXPORT_SYMBOL (dma_pool_create);
+
+void dma_pool_destroy (struct dma_pool *pool)
+{
+	kmem_cache_destroy (pool->cache);
+	put_device (pool->dev);
+	kfree (pool);
+}
+EXPORT_SYMBOL (dma_pool_destroy);
+
+#elif	defined (CONFIG_PCI)
+
+#include <linux/pci.h>
+
+/*
+ * Otherwise, assume wrapping pci_pool will do.
+ * This implementation is mostly inlined in <asm-generic/dma-mapping.h>.
+ */
+
+void *
+dma_alloc (struct device *dev, size_t size, dma_addr_t *handle, int mem_flags)
+{
+	struct pci_dev	*pdev = to_pci_dev(dev);
+
+	BUG_ON(dev->bus != &pci_bus_type);
+
+	if (size >= PAGE_SIZE)
+		return pci_alloc_consistent (pdev, size, handle);
+
+	/* FIXME keep N pools of per-device dma state, not just a u64 for the
+	 * dma mask.  allocate from the right pool here.  (unless this platform
+	 * has a pci_alloc_consistent() that does 1/Nth page allocations.)
+	 * and since pci_alloc_consistent can't fail (so must BUG out), it
+	 * might be a good idea to cache a few larger buffers here too.
+	 */
+#warning "broken dma_alloc()"
+
+	return 0;
+}
+EXPORT_SYMBOL (dma_alloc);
+
+void
+dma_free (struct device *dev, size_t size, void *vaddr, dma_addr_t dma)
+{
+	struct pci_dev	*pdev = to_pci_dev(dev);
+
+	BUG_ON(dev->bus != &pci_bus_type);
+
+	pci_free_consistent (pdev, size, vaddr, dma);
+}
+EXPORT_SYMBOL (dma_free);
+
+
+#else
+#error "no dma buffer allocation is available"
+#endif
--- ./drivers/base-dist/Makefile	Wed Dec 25 08:50:00 2002
+++ ./drivers/base/Makefile	Sat Dec 28 11:10:42 2002
@@ -2,7 +2,7 @@
 
 obj-y		:= core.o sys.o interface.o power.o bus.o \
 			driver.o class.o intf.o platform.o \
-			cpu.o firmware.o
+			cpu.o firmware.o dma.o
 
 obj-$(CONFIG_NUMA)	+= node.o  memblk.o
 
@@ -11,4 +11,4 @@
 obj-$(CONFIG_HOTPLUG)	+= hotplug.o
 
 export-objs	:= core.o power.o sys.o bus.o driver.o \
-			class.o intf.o platform.o firmware.o
+			class.o intf.o platform.o firmware.o dma.o
--- ./Documentation-dist/DMA-API.txt	Wed Dec 25 08:54:25 2002
+++ ./Documentation/DMA-API.txt	Mon Dec 30 12:01:37 2002
@@ -5,7 +5,8 @@
 
 This document describes the DMA API.  For a more gentle introduction
 phrased in terms of the pci_ equivalents (and actual examples) see
-DMA-mapping.txt
+DMA-mapping.txt; USB device drivers will normally use the USB DMA
+APIs, see Documentation/usb/dma.txt instead.
 
 This API is split into two pieces.  Part I describes the API and the
 corresponding pci_ API.  Part II describes the extensions to the API
@@ -20,18 +21,31 @@
 To get the pci_ API, you must #include <linux/pci.h>
 To get the dma_ API, you must #include <linux/dma-mapping.h>
 
-void *
-dma_alloc_coherent(struct device *dev, size_t size,
+
+I.A Coherent memory
++++++++++++++++++++
+Coherent memory is memory for which a write by either the device or the
+processor can immediately be read by the other, without having to worry
+about caches losing the write.  Drivers should normally use memory barrier
+instructions to ensure that the device and processor see writes in the
+intended order.  CPUs and compilers commonly re-order reads, unless rmb()
+or mb() is used.  Many also reorder writes unless wmb() or mb() is used.
+
+Just as with normal kernel memory, there are several allocation APIs.
+One talks in terms of pages, another talks in terms of arbitrary sized
+buffers, and another talks in terms of pools of fixed-size blocks.
+
+
+ALLOCATING AND FREEING ONE OR MORE PAGES
+
+	void *
+	dma_alloc_coherent(struct device *dev, size_t size,
 			     dma_addr_t *dma_handle)
-void *
-pci_alloc_consistent(struct pci_dev *dev, size_t size,
+	void *
+	pci_alloc_consistent(struct pci_dev *dev, size_t size,
 			     dma_addr_t *dma_handle)
 
-Consistent memory is memory for which a write by either the device or
-the processor can immediately be read by the processor or device
-without having to worry about caching effects.
-
-This routine allocates a region of <size> bytes of consistent memory.
+This routine allocates a region of <size> bytes of coherent memory.
 it also returns a <dma_handle> which may be cast to an unsigned
 integer the same width as the bus and used as the physical address
 base of the region.
@@ -39,26 +53,113 @@
 Returns: a pointer to the allocated region (in the processor's virtual
 address space) or NULL if the allocation failed.
 
-Note: consistent memory can be expensive on some platforms, and the
+Note: coherent memory can be expensive on some platforms, and the
 minimum allocation length may be as big as a page, so you should
-consolidate your requests for consistent memory as much as possible.
+consolidate your requests for coherent memory as much as possible.
+Or better yet, use the other APIs (below), which handle 1/N page
+allocations as well as N-page allocations.
 
-void
-dma_free_coherent(struct device *dev, size_t size, void *cpu_addr
+	void
+	dma_free_coherent(struct device *dev, size_t size, void *cpu_addr
 			   dma_addr_t dma_handle)
-void
-pci_free_consistent(struct pci_dev *dev, size_t size, void *cpu_addr
+	void
+	pci_free_consistent(struct pci_dev *dev, size_t size, void *cpu_addr
 			   dma_addr_t dma_handle)
 
-Free the region of consistent memory you previously allocated.  dev,
+Free the region of coherent memory you previously allocated.  dev,
 size and dma_handle must all be the same as those passed into the
-consistent allocate.  cpu_addr must be the virtual address returned by
-the consistent allocate
+coherent allocation.  cpu_addr must be the virtual address returned by
+the coherent allocation.
+
+
+If you are working with coherent memory that's smaller than a page, use the
+pool allocators (for fixed size allocations) or dma_alloc (otherwise).  All
+this memory is aligned to at least L1_CACHE_BYTES, so that cacheline sharing
+can't cause bugs (on systems with cache-incoherent dma) or performance loss
+(on the more widely used systems with cache-coherent DMA).
+
+Note that there is no pci analogue to dma_alloc(), and there's no USB
+analogue to dma pools.
+
+
+ALLOCATING AND FREEING ARBITRARY SIZED BUFFERS
+
+	void *
+	dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
+		int mem_flags)
+
+	void
+	dma_free(struct device *dev, void *vaddr, dma_addr_t dma)
+
+These are like kmalloc and kfree, except that since they return dma-coherent
+memory they must also return DMA addresses.  Also, they aren't guaranteed to
+accept mem_flags values other than SLAB_KERNEL and SLAB_ATOMIC.  The allocator
+returns null when it fails for any reason.
+
+
+POOL CREATION AND DESTRUCTION
+
+Device drivers often use pools of coherent memory, along with memory
+barriers, to communicate to devices through request queues.
+
+	struct dma_pool *
+	dma_pool_create(char *name, struct pci_dev *dev, size_t size);
+
+	struct pci_pool *
+	pci_pool_create(char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t alloc)
 
-int
-dma_supported(struct device *dev, u64 mask)
-int
-pci_dma_supported(struct device *dev, u64 mask)
+Pools are like a kmem_cache_t but they work for dma-coherent memory.  They
+must be created (or destroyed) in thread contexts.  For the generic API, use
+'size' to force alignment to bigger boundaries than L1_CACHE_BYTES, and to
+ensure that allocations can't cross 'alloc' byte boundaries.
+
+Returns the pool, or null if one can't be created.
+
+	void
+	dma_pool_destroy(struct dma_pool *pool)
+
+	void
+	pci_pool_destroy(struct pci_pool *pool)
+
+These calls just destroy the pool provided.  The caller guarantees all the
+memory has been returned to the pool, and is not currently in use.
+
+ALLOCATING AND FREEING MEMORY FROM A POOL
+
+	void *
+	dma_pool_alloc(struct dma_pool *pool, int mem_flags, dma_addr_t *handle)
+
+	void *
+	pci_pool_alloc(struct pci_pool *pool, int mem_flags, dma_addr_t *handle)
+
+When allocation succeeds, a pointer to the allocated memory is returned
+and the dma address is stored at the other end of the handle.  "flags"
+can be SLAB_KERNEL or SLAB_ATOMIC, controlling whether it's OK to sleep.
+If allocation fails for any reason, null is returned.
+
+	void *
+	dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
+
+	void *
+	pci_pool_free(struct pci_pool *pool, void *vaddr, dma_addr_t dma)
+
+When your driver is done using memory allocated using the pool allocation
+calls, pass both of the memory addresses to the free() routine.
+
+
+I.B DMA masks
++++++++++++++
+Hardware often has limitations in the DMA address space it supports.
+Some common examples are x86 ISA addresses, in the first 16 MBytes of
+PC address space, and "highmem" I/O, over the 4 GB mark.  The coherent
+memory allocators will never return "highmem" pointers.
+
+	int
+	dma_supported(struct device *dev, u64 mask)
+
+	int
+	pci_dma_supported(struct device *dev, u64 mask)
 
 Checks to see if the device can support DMA to the memory described by
 mask.
@@ -70,21 +171,31 @@
 internal API for use by the platform than an external API for use by
 driver writers.
 
-int
-dma_set_mask(struct device *dev, u64 mask)
-int
-pci_dma_set_mask(struct pci_device *dev, u64 mask)
+	int
+	dma_set_mask(struct device *dev, u64 mask)
+
+	int
+	pci_dma_set_mask(struct pci_device *dev, u64 mask)
 
 Checks to see if the mask is possible and updates the device
 parameters if it is.
 
 Returns: 1 if successful and 0 if not
 
-dma_addr_t
-dma_map_single(struct device *dev, void *cpu_addr, size_t size,
+
+I.C DMA mappings
+++++++++++++++++
+To work with arbitrary DMA-ready memory, you need to map it into the DMA
+address space of the device.  This might involve an IOMMU or a bounce buffer.
+The two kinds of mapping are  flat "single" mappings, sometimes for single
+pages; and grouped "scatter/gather" ones.  If you're not using the mappings
+for one-shot I/O, you may also need to synchronize the mappings.
+
+	dma_addr_t
+	dma_map_single(struct device *dev, void *cpu_addr, size_t size,
 		      enum dma_data_direction direction)
-dma_addr_t
-pci_map_single(struct device *dev, void *cpu_addr, size_t size,
+	dma_addr_t
+	pci_map_single(struct device *dev, void *cpu_addr, size_t size,
 		      int direction)
 
 Maps a piece of processor virtual memory so it can be accessed by the
--- ./include-dist/linux/pci.h	Wed Dec 25 08:48:53 2002
+++ ./include/linux/pci.h	Mon Dec 30 12:47:36 2002
@@ -672,14 +672,6 @@
 struct pci_bus * pci_add_new_bus(struct pci_bus *parent, struct pci_dev *dev, int busnr);
 int pci_scan_bridge(struct pci_bus *bus, struct pci_dev * dev, int max, int pass);
 
-/* kmem_cache style wrapper around pci_alloc_consistent() */
-struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
-		size_t size, size_t align, size_t allocation);
-void pci_pool_destroy (struct pci_pool *pool);
-
-void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
-void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
-
 #if defined(CONFIG_ISA) || defined(CONFIG_EISA)
 extern struct pci_dev *isa_bridge;
 #endif
--- ./include-dist/asm-generic/pci-dma-compat.h	Wed Dec 25 08:47:40 2002
+++ ./include/asm-generic/pci-dma-compat.h	Mon Dec 30 13:14:41 2002
@@ -35,6 +35,39 @@
 	return dma_map_single(&hwdev->dev, ptr, size, (enum dma_data_direction)direction);
 }
 
+static inline struct pci_pool *
+pci_pool_create (const char *name, struct pci_dev *hwdev,
+		size_t size, size_t align, size_t allocation)
+{
+	if (size < align)
+		size = align;
+	if (allocation) {
+		while (align < allocation && (allocation % size) != 0) {
+			align <<= 1;
+			if (size < align)
+				size = align;
+		}
+	}
+	return (struct pci_pool *) dma_pool_create(name, &hwdev->dev, size);
+}
+
+static inline void pci_pool_destroy (struct pci_pool *pool)
+{
+	dma_pool_destroy ((struct dma_pool *) pool);
+}
+
+static inline void *
+pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle)
+{
+	return dma_pool_alloc((struct dma_pool *) pool, flags, handle);
+}
+
+static inline void
+pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr)
+{
+	return dma_pool_free((struct dma_pool *) pool, vaddr, addr);
+}
+
 static inline void
 pci_unmap_single(struct pci_dev *hwdev, dma_addr_t dma_addr,
 		 size_t size, int direction)
--- ./drivers/pci-dist/pool.c	Wed Dec 25 08:51:41 2002
+++ ./drivers/pci/pool.c	Mon Dec 30 12:51:57 2002
@@ -2,6 +2,8 @@
 #include <linux/slab.h>
 #include <linux/module.h>
 
+#ifndef __dma_slab_flags
+
 /*
  * Pool allocator ... wraps the pci_alloc_consistent page allocator, so
  * small blocks are easily used by drivers for bus mastering controllers.
@@ -403,3 +405,5 @@
 EXPORT_SYMBOL (pci_pool_destroy);
 EXPORT_SYMBOL (pci_pool_alloc);
 EXPORT_SYMBOL (pci_pool_free);
+
+#endif /* !__dma_slab_flags */
--- ./include-dist/asm-alpha/pci.h	Wed Dec 25 08:49:02 2002
+++ ./include/asm-alpha/pci.h	Mon Dec 30 12:35:11 2002
@@ -81,6 +81,14 @@
 
 extern void pci_free_consistent(struct pci_dev *, size_t, void *, dma_addr_t);
 
+/* kmem_cache style wrapper around pci_alloc_consistent() */
+struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t allocation);
+void pci_pool_destroy (struct pci_pool *pool);
+
+void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
+void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
+
 /* Map a single buffer of the indicate size for PCI DMA in streaming
    mode.  The 32-bit PCI bus mastering address to use is returned.
    Once the device is given the dma address, the device owns this memory
--- ./include-dist/asm-arm/pci.h	Wed Dec 25 08:48:18 2002
+++ ./include/asm-arm/pci.h	Mon Dec 30 12:35:29 2002
@@ -76,6 +76,14 @@
 	consistent_free(vaddr, size, dma_handle);
 }
 
+/* kmem_cache style wrapper around pci_alloc_consistent() */
+struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t allocation);
+void pci_pool_destroy (struct pci_pool *pool);
+
+void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
+void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
+
 /* Map a single buffer of the indicated size for DMA in streaming mode.
  * The 32-bit bus address to use is returned.
  *
--- ./include-dist/asm-ia64/pci.h	Wed Dec 25 08:47:47 2002
+++ ./include/asm-ia64/pci.h	Mon Dec 30 12:36:21 2002
@@ -58,6 +58,14 @@
 #define sg_dma_address			platform_pci_dma_address
 #define pci_dma_supported		platform_pci_dma_supported
 
+/* kmem_cache style wrapper around pci_alloc_consistent() */
+struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t allocation);
+void pci_pool_destroy (struct pci_pool *pool);
+
+void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
+void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
+
 /* pci_unmap_{single,page} is not a nop, thus... */
 #define DECLARE_PCI_UNMAP_ADDR(ADDR_NAME)	\
 	dma_addr_t ADDR_NAME;
--- ./include-dist/asm-mips/pci.h	Wed Dec 25 08:48:27 2002
+++ ./include/asm-mips/pci.h	Mon Dec 30 12:36:52 2002
@@ -77,6 +77,14 @@
 extern void pci_free_consistent(struct pci_dev *hwdev, size_t size,
 				void *vaddr, dma_addr_t dma_handle);
 
+/* kmem_cache style wrapper around pci_alloc_consistent() */
+struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t allocation);
+void pci_pool_destroy (struct pci_pool *pool);
+
+void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
+void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
+
 /*
  * Map a single buffer of the indicated size for DMA in streaming mode.
  * The 32-bit bus address to use is returned.
--- ./include-dist/asm-mips64/pci.h	Wed Dec 25 08:49:31 2002
+++ ./include/asm-mips64/pci.h	Mon Dec 30 12:37:04 2002
@@ -70,6 +70,14 @@
 extern void pci_free_consistent(struct pci_dev *hwdev, size_t size,
 				void *vaddr, dma_addr_t dma_handle);
 
+/* kmem_cache style wrapper around pci_alloc_consistent() */
+struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t allocation);
+void pci_pool_destroy (struct pci_pool *pool);
+
+void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
+void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
+
 
 #ifdef CONFIG_MAPPED_PCI_IO
 
--- ./include-dist/asm-parisc/pci.h	Wed Dec 25 08:48:31 2002
+++ ./include/asm-parisc/pci.h	Mon Dec 30 12:37:26 2002
@@ -146,6 +146,14 @@
 	void (*dma_sync_sg)(struct pci_dev *dev, struct scatterlist *sg, int nelems, int direction);
 };
 
+/* kmem_cache style wrapper around pci_alloc_consistent() */
+struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t allocation);
+void pci_pool_destroy (struct pci_pool *pool);
+
+void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
+void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
+
 
 /*
 ** We could live without the hppa_dma_ops indirection if we didn't want
--- ./include-dist/asm-ppc/pci.h	Wed Dec 25 08:48:02 2002
+++ ./include/asm-ppc/pci.h	Mon Dec 30 12:37:45 2002
@@ -91,6 +91,14 @@
 extern void pci_free_consistent(struct pci_dev *hwdev, size_t size,
 				void *vaddr, dma_addr_t dma_handle);
 
+/* kmem_cache style wrapper around pci_alloc_consistent() */
+struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t allocation);
+void pci_pool_destroy (struct pci_pool *pool);
+
+void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
+void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
+
 /* Map a single buffer of the indicated size for DMA in streaming mode.
  * The 32-bit bus address to use is returned.
  *
--- ./include-dist/asm-ppc64/pci.h	Wed Dec 25 08:47:38 2002
+++ ./include/asm-ppc64/pci.h	Mon Dec 30 12:37:53 2002
@@ -43,6 +43,14 @@
 extern void pci_free_consistent(struct pci_dev *hwdev, size_t size,
 				void *vaddr, dma_addr_t dma_handle);
 
+/* kmem_cache style wrapper around pci_alloc_consistent() */
+struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t allocation);
+void pci_pool_destroy (struct pci_pool *pool);
+
+void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
+void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
+
 extern dma_addr_t pci_map_single(struct pci_dev *hwdev, void *ptr,
 				 size_t size, int direction);
 extern void pci_unmap_single(struct pci_dev *hwdev, dma_addr_t dma_addr,
--- ./include-dist/asm-sh/pci.h	Wed Dec 25 08:49:37 2002
+++ ./include/asm-sh/pci.h	Mon Dec 30 12:38:03 2002
@@ -71,6 +71,14 @@
 extern void pci_free_consistent(struct pci_dev *hwdev, size_t size,
 				void *vaddr, dma_addr_t dma_handle);
 
+/* kmem_cache style wrapper around pci_alloc_consistent() */
+struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t allocation);
+void pci_pool_destroy (struct pci_pool *pool);
+
+void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
+void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
+
 /* Map a single buffer of the indicated size for DMA in streaming mode.
  * The 32-bit bus address to use is returned.
  *
--- ./include-dist/asm-sparc/pci.h	Wed Dec 25 08:49:17 2002
+++ ./include/asm-sparc/pci.h	Mon Dec 30 12:38:13 2002
@@ -47,6 +47,14 @@
  */
 extern void pci_free_consistent(struct pci_dev *hwdev, size_t size, void *vaddr, dma_addr_t dma_handle);
 
+/* kmem_cache style wrapper around pci_alloc_consistent() */
+struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t allocation);
+void pci_pool_destroy (struct pci_pool *pool);
+
+void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
+void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
+
 /* Map a single buffer of the indicated size for DMA in streaming mode.
  * The 32-bit bus address to use is returned.
  *
--- ./include-dist/asm-sparc64/pci.h	Wed Dec 25 08:49:25 2002
+++ ./include/asm-sparc64/pci.h	Mon Dec 30 12:38:21 2002
@@ -55,6 +55,14 @@
  */
 extern void pci_free_consistent(struct pci_dev *hwdev, size_t size, void *vaddr, dma_addr_t dma_handle);
 
+/* kmem_cache style wrapper around pci_alloc_consistent() */
+struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t allocation);
+void pci_pool_destroy (struct pci_pool *pool);
+
+void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
+void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
+
 /* Map a single buffer of the indicated size for DMA in streaming mode.
  * The 32-bit bus address to use is returned.
  *
--- ./include-dist/asm-v850/pci.h	Wed Dec 25 08:49:46 2002
+++ ./include/asm-v850/pci.h	Mon Dec 30 12:38:36 2002
@@ -74,4 +74,12 @@
 pci_free_consistent (struct pci_dev *pdev, size_t size, void *cpu_addr,
 		     dma_addr_t dma_addr);
 
+/* kmem_cache style wrapper around pci_alloc_consistent() */
+struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t allocation);
+void pci_pool_destroy (struct pci_pool *pool);
+
+void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
+void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
+
 #endif /* __V850_PCI_H__ */
--- ./include-dist/asm-x86_64/pci.h	Wed Dec 25 08:47:55 2002
+++ ./include/asm-x86_64/pci.h	Mon Dec 30 12:40:43 2002
@@ -69,6 +69,14 @@
 extern void pci_free_consistent(struct pci_dev *hwdev, size_t size,
 				void *vaddr, dma_addr_t dma_handle);
 
+/* kmem_cache style wrapper around pci_alloc_consistent() */
+struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t allocation);
+void pci_pool_destroy (struct pci_pool *pool);
+
+void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
+void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
+
 #ifdef CONFIG_GART_IOMMU
 
 /* Map a single buffer of the indicated size for DMA in streaming mode.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-30 23:11     ` [PATCH] generic device DMA (dma_pool update) David Brownell
@ 2002-12-31 15:00       ` James Bottomley
  2002-12-31 17:04         ` David Brownell
  2002-12-31 16:36       ` James Bottomley
  1 sibling, 1 reply; 44+ messages in thread
From: James Bottomley @ 2002-12-31 15:00 UTC (permalink / raw)
  To: David Brownell; +Cc: James Bottomley, linux-kernel

I think a much easier way of doing this would be a slight modification to the 
current pci_pool code: we know it works already.  All that really has to 
change is that it should take a struct device instead of a pci_dev and it 
should call dma_alloc_coherent to get the source memory.  The rest of the pci 
pool code is mainly about memory management and is well tested and so should 
remain (much as I'd like to see a mempool implementation).

You have this in your code:

+void *
+dma_alloc (struct device *dev, size_t size, dma_addr_t *handle, int mem_flags)
+{
+	struct pci_dev	*pdev = to_pci_dev(dev);
+
+	BUG_ON(dev->bus != &pci_bus_type);
+
+	if (size >= PAGE_SIZE)
+		return pci_alloc_consistent (pdev, size, handle);

This would work if just passed to dma_alloc_coherent and drop the check for 
the pci_bus_type (If the dma_ APIs are implemented in terms of the pci_ ones, 
then they will bug out in the generic implementation of dma_alloc_coherent so 
there's no need to do this type of transform twice).

This:

+dma_alloc (struct device *dev, size_t size, dma_addr_t *handle, int mem_flags)
+{
+	void	*ret;
+
+	/* We rely on kmalloc memory being aligned to L1_CACHE_BYTES, to
+	 * prevent cacheline sharing during DMA.  With dma-incoherent caches,
+	 * such sharing causes bugs, not just cache-related slowdown.
+	 */
+	ret = kmalloc (size, mem_flags | __dma_slab_flags (dev));
+	if (likely (ret != 0))
+		*handle = virt_to_phys (ret);
+	return ret;

Just can't be done: you can't get consistent memory from kmalloc and you 
certainly can't use virt_to_phys and expect it to work on IOMMU hardware.

James

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-30 23:11     ` [PATCH] generic device DMA (dma_pool update) David Brownell
  2002-12-31 15:00       ` James Bottomley
@ 2002-12-31 16:36       ` James Bottomley
  2002-12-31 17:32         ` David Brownell
  1 sibling, 1 reply; 44+ messages in thread
From: James Bottomley @ 2002-12-31 16:36 UTC (permalink / raw)
  To: David Brownell; +Cc: James Bottomley, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 323 bytes --]

How about the attached as the basis for a generic coherent memory pool 
implementation.  It basically leverages pci/pool.c to be more generic, and 
thus makes use of well tested code.

Obviously, as a final tidy up, pci/pool.c should probably be moved to 
base/pool.c with compile options for drivers that want it.

James


[-- Attachment #2: tmp.diff --]
[-- Type: text/plain , Size: 15048 bytes --]

diff -Nru a/drivers/base/core.c b/drivers/base/core.c
--- a/drivers/base/core.c	Tue Dec 31 10:33:58 2002
+++ b/drivers/base/core.c	Tue Dec 31 10:33:58 2002
@@ -146,6 +146,7 @@
 	INIT_LIST_HEAD(&dev->bus_list);
 	INIT_LIST_HEAD(&dev->class_list);
 	INIT_LIST_HEAD(&dev->intf_list);
+	INIT_LIST_HEAD(&dev->pools);
 }
 
 /**
diff -Nru a/drivers/pci/pool.c b/drivers/pci/pool.c
--- a/drivers/pci/pool.c	Tue Dec 31 10:33:58 2002
+++ b/drivers/pci/pool.c	Tue Dec 31 10:33:58 2002
@@ -1,6 +1,8 @@
-#include <linux/pci.h>
 #include <linux/slab.h>
 #include <linux/module.h>
+#include <linux/types.h>
+#include <linux/blkdev.h>
+#include <linux/dma-mapping.h>
 
 /*
  * Pool allocator ... wraps the pci_alloc_consistent page allocator, so
@@ -8,19 +10,19 @@
  * This should probably be sharing the guts of the slab allocator.
  */
 
-struct pci_pool {	/* the pool */
+struct dma_pool {	/* the pool */
 	struct list_head	page_list;
 	spinlock_t		lock;
 	size_t			blocks_per_page;
 	size_t			size;
-	struct pci_dev		*dev;
+	struct device		*dev;
 	size_t			allocation;
 	char			name [32];
 	wait_queue_head_t	waitq;
 	struct list_head	pools;
 };
 
-struct pci_page {	/* cacheable header for 'allocation' bytes */
+struct dma_page {	/* cacheable header for 'allocation' bytes */
 	struct list_head	page_list;
 	void			*vaddr;
 	dma_addr_t		dma;
@@ -36,7 +38,6 @@
 static ssize_t
 show_pools (struct device *dev, char *buf, size_t count, loff_t off)
 {
-	struct pci_dev		*pdev;
 	unsigned		temp, size;
 	char			*next;
 	struct list_head	*i, *j;
@@ -44,7 +45,6 @@
 	if (off != 0)
 		return 0;
 
-	pdev = container_of (dev, struct pci_dev, dev);
 	next = buf;
 	size = count;
 
@@ -53,16 +53,16 @@
 	next += temp;
 
 	down (&pools_lock);
-	list_for_each (i, &pdev->pools) {
-		struct pci_pool	*pool;
+	list_for_each (i, &dev->pools) {
+		struct dma_pool	*pool;
 		unsigned	pages = 0, blocks = 0;
 
-		pool = list_entry (i, struct pci_pool, pools);
+		pool = list_entry (i, struct dma_pool, pools);
 
 		list_for_each (j, &pool->page_list) {
-			struct pci_page	*page;
+			struct dma_page	*page;
 
-			page = list_entry (j, struct pci_page, page_list);
+			page = list_entry (j, struct dma_page, page_list);
 			pages++;
 			blocks += page->in_use;
 		}
@@ -82,31 +82,31 @@
 static DEVICE_ATTR (pools, S_IRUGO, show_pools, NULL);
 
 /**
- * pci_pool_create - Creates a pool of pci consistent memory blocks, for dma.
+ * dma_pool_create - Creates a pool of coherent memory blocks, for dma.
  * @name: name of pool, for diagnostics
- * @pdev: pci device that will be doing the DMA
+ * @dev: device that will be doing the DMA
  * @size: size of the blocks in this pool.
  * @align: alignment requirement for blocks; must be a power of two
  * @allocation: returned blocks won't cross this boundary (or zero)
  * Context: !in_interrupt()
  *
- * Returns a pci allocation pool with the requested characteristics, or
- * null if one can't be created.  Given one of these pools, pci_pool_alloc()
+ * Returns a dma allocation pool with the requested characteristics, or
+ * null if one can't be created.  Given one of these pools, dma_pool_alloc()
  * may be used to allocate memory.  Such memory will all have "consistent"
  * DMA mappings, accessible by the device and its driver without using
  * cache flushing primitives.  The actual size of blocks allocated may be
  * larger than requested because of alignment.
  *
- * If allocation is nonzero, objects returned from pci_pool_alloc() won't
+ * If allocation is nonzero, objects returned from dma_pool_alloc() won't
  * cross that size boundary.  This is useful for devices which have
  * addressing restrictions on individual DMA transfers, such as not crossing
  * boundaries of 4KBytes.
  */
-struct pci_pool *
-pci_pool_create (const char *name, struct pci_dev *pdev,
+struct dma_pool *
+dma_pool_create (const char *name, struct device *dev,
 	size_t size, size_t align, size_t allocation)
 {
-	struct pci_pool		*retval;
+	struct dma_pool		*retval;
 
 	if (align == 0)
 		align = 1;
@@ -134,7 +134,7 @@
 	strncpy (retval->name, name, sizeof retval->name);
 	retval->name [sizeof retval->name - 1] = 0;
 
-	retval->dev = pdev;
+	retval->dev = dev;
 
 	INIT_LIST_HEAD (&retval->page_list);
 	spin_lock_init (&retval->lock);
@@ -143,12 +143,12 @@
 	retval->blocks_per_page = allocation / size;
 	init_waitqueue_head (&retval->waitq);
 
-	if (pdev) {
+	if (dev) {
 		down (&pools_lock);
-		if (list_empty (&pdev->pools))
-			device_create_file (&pdev->dev, &dev_attr_pools);
+		if (list_empty (&dev->pools))
+			device_create_file (dev, &dev_attr_pools);
 		/* note:  not currently insisting "name" be unique */
-		list_add (&retval->pools, &pdev->pools);
+		list_add (&retval->pools, &dev->pools);
 		up (&pools_lock);
 	} else
 		INIT_LIST_HEAD (&retval->pools);
@@ -157,22 +157,22 @@
 }
 
 
-static struct pci_page *
-pool_alloc_page (struct pci_pool *pool, int mem_flags)
+static struct dma_page *
+pool_alloc_page (struct dma_pool *pool, int mem_flags)
 {
-	struct pci_page	*page;
+	struct dma_page	*page;
 	int		mapsize;
 
 	mapsize = pool->blocks_per_page;
 	mapsize = (mapsize + BITS_PER_LONG - 1) / BITS_PER_LONG;
 	mapsize *= sizeof (long);
 
-	page = (struct pci_page *) kmalloc (mapsize + sizeof *page, mem_flags);
+	page = (struct dma_page *) kmalloc (mapsize + sizeof *page, mem_flags);
 	if (!page)
 		return 0;
-	page->vaddr = pci_alloc_consistent (pool->dev,
-					    pool->allocation,
-					    &page->dma);
+	page->vaddr = dma_alloc_coherent (pool->dev,
+					  pool->allocation,
+					  &page->dma);
 	if (page->vaddr) {
 		memset (page->bitmap, 0xff, mapsize);	// bit set == free
 #ifdef	CONFIG_DEBUG_SLAB
@@ -200,43 +200,43 @@
 }
 
 static void
-pool_free_page (struct pci_pool *pool, struct pci_page *page)
+pool_free_page (struct dma_pool *pool, struct dma_page *page)
 {
 	dma_addr_t	dma = page->dma;
 
 #ifdef	CONFIG_DEBUG_SLAB
 	memset (page->vaddr, POOL_POISON_BYTE, pool->allocation);
 #endif
-	pci_free_consistent (pool->dev, pool->allocation, page->vaddr, dma);
+	dma_free_coherent (pool->dev, pool->allocation, page->vaddr, dma);
 	list_del (&page->page_list);
 	kfree (page);
 }
 
 
 /**
- * pci_pool_destroy - destroys a pool of pci memory blocks.
- * @pool: pci pool that will be destroyed
+ * dma_pool_destroy - destroys a pool of dma memory blocks.
+ * @pool: dma pool that will be destroyed
  * Context: !in_interrupt()
  *
  * Caller guarantees that no more memory from the pool is in use,
  * and that nothing will try to use the pool after this call.
  */
 void
-pci_pool_destroy (struct pci_pool *pool)
+dma_pool_destroy (struct dma_pool *pool)
 {
 	down (&pools_lock);
 	list_del (&pool->pools);
 	if (pool->dev && list_empty (&pool->dev->pools))
-		device_remove_file (&pool->dev->dev, &dev_attr_pools);
+		device_remove_file (pool->dev, &dev_attr_pools);
 	up (&pools_lock);
 
 	while (!list_empty (&pool->page_list)) {
-		struct pci_page		*page;
+		struct dma_page		*page;
 		page = list_entry (pool->page_list.next,
-				struct pci_page, page_list);
+				struct dma_page, page_list);
 		if (is_page_busy (pool->blocks_per_page, page->bitmap)) {
-			printk (KERN_ERR "pci_pool_destroy %s/%s, %p busy\n",
-				pool->dev ? pool->dev->slot_name : NULL,
+			printk (KERN_ERR "dma_pool_destroy %s/%s, %p busy\n",
+				pool->dev ? pool->dev->name : NULL,
 				pool->name, page->vaddr);
 			/* leak the still-in-use consistent memory */
 			list_del (&page->page_list);
@@ -250,8 +250,8 @@
 
 
 /**
- * pci_pool_alloc - get a block of consistent memory
- * @pool: pci pool that will produce the block
+ * dma_pool_alloc - get a block of consistent memory
+ * @pool: dma pool that will produce the block
  * @mem_flags: SLAB_KERNEL or SLAB_ATOMIC
  * @handle: pointer to dma address of block
  *
@@ -260,11 +260,11 @@
  * If such a memory block can't be allocated, null is returned.
  */
 void *
-pci_pool_alloc (struct pci_pool *pool, int mem_flags, dma_addr_t *handle)
+dma_pool_alloc (struct dma_pool *pool, int mem_flags, dma_addr_t *handle)
 {
 	unsigned long		flags;
 	struct list_head	*entry;
-	struct pci_page		*page;
+	struct dma_page		*page;
 	int			map, block;
 	size_t			offset;
 	void			*retval;
@@ -273,7 +273,7 @@
 	spin_lock_irqsave (&pool->lock, flags);
 	list_for_each (entry, &pool->page_list) {
 		int		i;
-		page = list_entry (entry, struct pci_page, page_list);
+		page = list_entry (entry, struct dma_page, page_list);
 		/* only cachable accesses here ... */
 		for (map = 0, i = 0;
 				i < pool->blocks_per_page;
@@ -319,16 +319,16 @@
 }
 
 
-static struct pci_page *
-pool_find_page (struct pci_pool *pool, dma_addr_t dma)
+static struct dma_page *
+pool_find_page (struct dma_pool *pool, dma_addr_t dma)
 {
 	unsigned long		flags;
 	struct list_head	*entry;
-	struct pci_page		*page;
+	struct dma_page		*page;
 
 	spin_lock_irqsave (&pool->lock, flags);
 	list_for_each (entry, &pool->page_list) {
-		page = list_entry (entry, struct pci_page, page_list);
+		page = list_entry (entry, struct dma_page, page_list);
 		if (dma < page->dma)
 			continue;
 		if (dma < (page->dma + pool->allocation))
@@ -342,8 +342,8 @@
 
 
 /**
- * pci_pool_free - put block back into pci pool
- * @pool: the pci pool holding the block
+ * dma_pool_free - put block back into dma pool
+ * @pool: the dma pool holding the block
  * @vaddr: virtual address of block
  * @dma: dma address of block
  *
@@ -351,15 +351,15 @@
  * unless it is first re-allocated.
  */
 void
-pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t dma)
+dma_pool_free (struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 {
-	struct pci_page		*page;
+	struct dma_page		*page;
 	unsigned long		flags;
 	int			map, block;
 
 	if ((page = pool_find_page (pool, dma)) == 0) {
-		printk (KERN_ERR "pci_pool_free %s/%s, %p/%lx (bad dma)\n",
-			pool->dev ? pool->dev->slot_name : NULL,
+		printk (KERN_ERR "dma_pool_free %s/%s, %p/%lx (bad dma)\n",
+			pool->dev ? pool->dev->name : NULL,
 			pool->name, vaddr, (unsigned long) dma);
 		return;
 	}
@@ -371,13 +371,13 @@
 
 #ifdef	CONFIG_DEBUG_SLAB
 	if (((dma - page->dma) + (void *)page->vaddr) != vaddr) {
-		printk (KERN_ERR "pci_pool_free %s/%s, %p (bad vaddr)/%Lx\n",
+		printk (KERN_ERR "dma_pool_free %s/%s, %p (bad vaddr)/%Lx\n",
 			pool->dev ? pool->dev->slot_name : NULL,
 			pool->name, vaddr, (unsigned long long) dma);
 		return;
 	}
 	if (page->bitmap [map] & (1UL << block)) {
-		printk (KERN_ERR "pci_pool_free %s/%s, dma %Lx already free\n",
+		printk (KERN_ERR "dma_pool_free %s/%s, dma %Lx already free\n",
 			pool->dev ? pool->dev->slot_name : NULL,
 			pool->name, (unsigned long long)dma);
 		return;
@@ -399,7 +399,7 @@
 }
 
 
-EXPORT_SYMBOL (pci_pool_create);
-EXPORT_SYMBOL (pci_pool_destroy);
-EXPORT_SYMBOL (pci_pool_alloc);
-EXPORT_SYMBOL (pci_pool_free);
+EXPORT_SYMBOL (dma_pool_create);
+EXPORT_SYMBOL (dma_pool_destroy);
+EXPORT_SYMBOL (dma_pool_alloc);
+EXPORT_SYMBOL (dma_pool_free);
diff -Nru a/drivers/pci/probe.c b/drivers/pci/probe.c
--- a/drivers/pci/probe.c	Tue Dec 31 10:33:58 2002
+++ b/drivers/pci/probe.c	Tue Dec 31 10:33:58 2002
@@ -353,8 +353,6 @@
 
 	sprintf(dev->slot_name, "%02x:%02x.%d", dev->bus->number, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
 	sprintf(dev->dev.name, "PCI device %04x:%04x", dev->vendor, dev->device);
-	INIT_LIST_HEAD(&dev->pools);
-	
 	pci_read_config_dword(dev, PCI_CLASS_REVISION, &class);
 	class >>= 8;				    /* upper 3 bytes */
 	dev->class = class;
diff -Nru a/include/linux/device.h b/include/linux/device.h
--- a/include/linux/device.h	Tue Dec 31 10:33:58 2002
+++ b/include/linux/device.h	Tue Dec 31 10:33:58 2002
@@ -256,6 +256,8 @@
 	struct list_head driver_list;
 	struct list_head children;
 	struct list_head intf_list;
+	struct list_head pools;		/* dma_pools tied to this device */
+
 	struct device 	* parent;
 
 	struct kobject kobj;
diff -Nru a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
--- a/include/linux/dma-mapping.h	Tue Dec 31 10:33:58 2002
+++ b/include/linux/dma-mapping.h	Tue Dec 31 10:33:58 2002
@@ -1,6 +1,8 @@
 #ifndef _ASM_LINUX_DMA_MAPPING_H
 #define _ASM_LINUX_DMA_MAPPING_H
 
+#include <linux/dma-pool.h>
+
 /* These definitions mirror those in pci.h, so they can be used
  * interchangeably with their PCI_ counterparts */
 enum dma_data_direction {
diff -Nru a/include/linux/dma-pool.h b/include/linux/dma-pool.h
--- /dev/null	Wed Dec 31 16:00:00 1969
+++ b/include/linux/dma-pool.h	Tue Dec 31 10:33:58 2002
@@ -0,0 +1,15 @@
+#ifndef _LINUX_DMA_POOL_H
+#define _LINUX_DMA_POOL_H
+
+#include <linux/device.h>
+
+/* dma_pool is an opaque structure pointer */
+struct dma_pool;
+
+struct dma_pool *dma_pool_create(const char *name, struct device *dev,
+				 size_t size, size_t align, size_t allocation);
+void dma_pool_destroy(struct dma_pool *pool);
+void *dma_pool_alloc(struct dma_pool *pool, int flags, dma_addr_t *handle);
+void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t addr);
+
+#endif
diff -Nru a/include/linux/pci.h b/include/linux/pci.h
--- a/include/linux/pci.h	Tue Dec 31 10:33:58 2002
+++ b/include/linux/pci.h	Tue Dec 31 10:33:58 2002
@@ -388,8 +388,6 @@
 					   0xffffffff.  You only need to change
 					   this if your device has broken DMA
 					   or supports 64-bit transfers.  */
-	struct list_head pools;		/* pci_pools tied to this device */
-
 	u32             current_state;  /* Current operating state. In ACPI-speak,
 					   this is D0-D3, D0 being fully functional,
 					   and D3 being off. */
@@ -658,14 +656,38 @@
 unsigned int pci_do_scan_bus(struct pci_bus *bus);
 struct pci_bus * pci_add_new_bus(struct pci_bus *parent, struct pci_dev *dev, int busnr);
 int pci_scan_bridge(struct pci_bus *bus, struct pci_dev * dev, int max, int pass);
+#include <linux/dma-pool.h>
 
 /* kmem_cache style wrapper around pci_alloc_consistent() */
-struct pci_pool *pci_pool_create (const char *name, struct pci_dev *dev,
-		size_t size, size_t align, size_t allocation);
-void pci_pool_destroy (struct pci_pool *pool);
 
-void *pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle);
-void pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr);
+/* struct pci_pool is an alias for struct dma_pool.  However, its contents
+ * are never exposed so just declare it here */
+struct pci_pool;
+
+static inline struct pci_pool *
+pci_pool_create (const char *name, struct pci_dev *dev,
+		size_t size, size_t align, size_t allocation)
+{
+	return (struct pci_pool *)dma_pool_create(name, &dev->dev, size, align, allocation);
+}
+
+static inline void
+pci_pool_destroy(struct pci_pool *pool)
+{
+	dma_pool_destroy((struct dma_pool *)pool);
+}
+
+static inline void *
+pci_pool_alloc (struct pci_pool *pool, int flags, dma_addr_t *handle)
+{
+	return dma_pool_alloc((struct dma_pool *)pool, flags, handle);
+}
+
+static inline void
+pci_pool_free (struct pci_pool *pool, void *vaddr, dma_addr_t addr)
+{
+	dma_pool_free((struct dma_pool *)pool, vaddr, addr);
+}
 
 #if defined(CONFIG_ISA) || defined(CONFIG_EISA)
 extern struct pci_dev *isa_bridge;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 15:00       ` James Bottomley
@ 2002-12-31 17:04         ` David Brownell
  2002-12-31 17:23           ` James Bottomley
  0 siblings, 1 reply; 44+ messages in thread
From: David Brownell @ 2002-12-31 17:04 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel

James Bottomley wrote:
> I think a much easier way of doing this would be a slight modification to the 
> current pci_pool code: we know it works already.  All that really has to 

The same is true of the slab code, which is a better allocator.  Why should
kernels require an extra allocator, when pci_pool can safely be eliminated on
quite a few systems?

> change is that it should take a struct device instead of a pci_dev and it 
> should call dma_alloc_coherent to get the source memory.  The rest of the pci 
> pool code is mainly about memory management and is well tested and so should 
> remain (much as I'd like to see a mempool implementation).

In that patch, the pci_pool code _does_ remain ... only for use on platforms
where __get_free_pages() memory, used by the slab allocator, can't be used
(at least without a lot of setup work).

You finally got me to look at a the mempool code.  I didn't notice any
code to actually allocate 1/Nth page units; that's delegated to some
other code.  (Except for that "Raced" buglet, where it doesn't delegate,
and uses kfree instead of calling pool->free().)  So that's another reason
it'd not be appropriate to try to use it in this area:  it doesn't solve
the core problem, which is non-wasteful allocation of 1/Nth page units.

> +	BUG_ON(dev->bus != &pci_bus_type);
> +
> +	if (size >= PAGE_SIZE)
> +		return pci_alloc_consistent (pdev, size, handle);
> 
> This would work if just passed to dma_alloc_coherent and drop the check for 
> the pci_bus_type 

True ... though that wouldn't be true on the other branch, where it'd
be choosing some pci_pool based on size (when it gets a way to find a
per-device array of pools).  Do you know for a fact that GCC isn't
smart enough to get rid of that duplicate check?  (I'm curious.)

> +dma_alloc (struct device *dev, size_t size, dma_addr_t *handle, int mem_flags)
> +{
> +	void	*ret;
> +
> +	/* We rely on kmalloc memory being aligned to L1_CACHE_BYTES, to
> +	 * prevent cacheline sharing during DMA.  With dma-incoherent caches,
> +	 * such sharing causes bugs, not just cache-related slowdown.
> +	 */
> +	ret = kmalloc (size, mem_flags | __dma_slab_flags (dev));
> +	if (likely (ret != 0))
> +		*handle = virt_to_phys (ret);
> +	return ret;
> 
> Just can't be done: you can't get consistent memory from kmalloc and you 
> certainly can't use virt_to_phys and expect it to work on IOMMU hardware.

Actually it _can_ be done on x86 and several other common platforms.  (But
"coherent" memory was the agreed terminology change, yes?)  Which is where
that particular code kicks in ... x86, some ppc, sa1111, and uml for now.
No IOMMUs there.

Any system where pci_alloc_consistent() is calling  __get_free_pages(), and
doing no further setup beyond a virt_to_phys() call, is a candidate for having
much simpler allocation of I/O memory.  I could even imagine that being true on
some systems where an "IOMMU" just accelerates I/O (prefetch, cache bypass, etc)
for streaming mappings, and isn't strictly needed otherwise.

- Dave

> James
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 17:04         ` David Brownell
@ 2002-12-31 17:23           ` James Bottomley
  2002-12-31 18:11             ` David Brownell
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2002-12-31 17:23 UTC (permalink / raw)
  To: David Brownell; +Cc: James Bottomley, linux-kernel

david-b@pacbell.net said:
> The same is true of the slab code, which is a better allocator.  Why
> should kernels require an extra allocator, when pci_pool can safely be
> eliminated on quite a few systems? 

I agree in principle.  But the way to get this to work is to allow the slab 
allocater to specify its getpages and freepages in the same way as ctor and 
dtor, so it can be fed by dma_alloc_coherent.  Then you can wrapper the slab 
allocator to be used across all platforms (rather than only those which are 
coherent and have no IOMMU).

The benefit mempool has over slab is its guaranteed minimum pool size and 
hence guaranteed non-failing allocations as long as you manage pool resizing 
correctly.  I suppose this is primarily useful for storage devices, which 
sometimes need memory that cannot be obtained from doing I/O.

However, you can base mempool off slab modified to use dma_alloc_coherent() 
and get the benefits of everything.

James

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 16:36       ` James Bottomley
@ 2002-12-31 17:32         ` David Brownell
  0 siblings, 0 replies; 44+ messages in thread
From: David Brownell @ 2002-12-31 17:32 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel

James Bottomley wrote:
> How about the attached as the basis for a generic coherent memory pool 
> implementation.  It basically leverages pci/pool.c to be more generic, and 
> thus makes use of well tested code.

I'd still rather have configuration completely eliminate that particular
pool allocator on the platforms (most, by volume) that don't need it,
in favor of the slab code ... which is not only well-tested, but also
has had a fair amount of cross-platform performance work done on it.

> Obviously, as a final tidy up, pci/pool.c should probably be moved to 
> base/pool.c with compile options for drivers that want it.

I'd have no problems with making it even more generic, and moving it.

Though "compile options" doesn't sound right, unless you mean letting
arch-specific code choose whether to use that or the slab allocator.

- Dave

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 17:23           ` James Bottomley
@ 2002-12-31 18:11             ` David Brownell
  2002-12-31 18:44               ` James Bottomley
  0 siblings, 1 reply; 44+ messages in thread
From: David Brownell @ 2002-12-31 18:11 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel

James Bottomley wrote:
> David Brownell said:
> 
>>The same is true of the slab code, which is a better allocator.  Why
>>should kernels require an extra allocator, when pci_pool can safely be
>>eliminated on quite a few systems? 
> 
> 
> I agree in principle.  But the way to get this to work is to allow the slab 
> allocater to specify its getpages and freepages in the same way as ctor and 
> dtor, so it can be fed by dma_alloc_coherent.  Then you can wrapper the slab 
> allocator to be used across all platforms (rather than only those which are 
> coherent and have no IOMMU).

Nobody's done that yet, and it's been a couple years now since that point
was made as a desirable evolution path for pci_pool.  In fact the first
(only!) step on that path was the dma_pool update I just posted.

Modifying the slab allocator like that means passing extra parameters around
(for dma_addr_t), causing extra register pressure even for non-dma allocations.
I have a hard time seeing that not slow things down, even on x86 (etc) where
we know we can already get all the benefits of the slab allocator without any
changes to that critical code.

> The benefit mempool has over slab is its guaranteed minimum pool size and 
> hence guaranteed non-failing allocations as long as you manage pool resizing 
> correctly.  I suppose this is primarily useful for storage devices, which 
> sometimes need memory that cannot be obtained from doing I/O.

Yes:  not all drivers need (or want) such a memory-reservation layer.

> However, you can base mempool off slab modified to use dma_alloc_coherent() 
> and get the benefits of everything.

I'd say instead that drivers which happen to need the protection against
failures that mempool provides can safely layer that behavior on top of
any allocator they want.  (Modulo that buglet, which requires them to be
layered on top of kmalloc.)

Which means that the mempool discussion is just a detour, it's not an
implementation tool for any kind of dma_pool but is instead an option
for drivers (in block i/o paths) to use on top of a dma_pool allocator.

- Dave

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 18:11             ` David Brownell
@ 2002-12-31 18:44               ` James Bottomley
  2002-12-31 19:29                 ` David Brownell
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2002-12-31 18:44 UTC (permalink / raw)
  To: David Brownell; +Cc: James Bottomley, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1220 bytes --]

david-b@pacbell.net said:
> Nobody's done that yet, and it's been a couple years now since that
> point was made as a desirable evolution path for pci_pool.  In fact
> the first (only!) step on that path was the dma_pool update I just
> posted. 

> Modifying the slab allocator like that means passing extra parameters
> around (for dma_addr_t), causing extra register pressure even for
> non-dma allocations. I have a hard time seeing that not slow things
> down, even on x86 (etc) where we know we can already get all the
> benefits of the slab allocator without any changes to that critical
> code. 

I think the attached should do the necessary with no slow down in the slab 
allocator.

Now all that has to happen for use with the dma pools is to wrapper 
dma_alloc/free_coherent().

Note that the semantics won't be quite the same as the pci_pool one since you 
can't guarantee that allocations don't cross the particular boundary 
parameter.  There's also going to be a reliance on the concept that the dma 
coherent allocators will return a page bounded chunk of memory.  Most seem 
already to be doing this because the page properties are only controllable at 
that level, so it shouldn't be a problem.

James


[-- Attachment #2: tmp.diff --]
[-- Type: text/plain , Size: 4627 bytes --]

===== mm/slab.c 1.52 vs edited =====
--- 1.52/mm/slab.c	Sat Dec 21 10:24:34 2002
+++ edited/mm/slab.c	Tue Dec 31 12:35:39 2002
@@ -260,6 +260,12 @@
 	/* de-constructor func */
 	void (*dtor)(void *, kmem_cache_t *, unsigned long);
 
+	/* page based allocator */
+	unsigned long FASTCALL((*getpages)(unsigned int gfp_mask, unsigned int order));
+
+	/* release pages allocated above */
+	void FASTCALL((*freepages)(unsigned long addr, unsigned int order));
+
 /* 4) cache creation/removal */
 	const char		*name;
 	struct list_head	next;
@@ -715,7 +721,7 @@
 	 * would be relatively rare and ignorable.
 	 */
 	flags |= cachep->gfpflags;
-	addr = (void*) __get_free_pages(flags, cachep->gfporder);
+	addr = (void*) cachep->getpages(flags, cachep->gfporder);
 	/* Assume that now we have the pages no one else can legally
 	 * messes with the 'struct page's.
 	 * However vm_scan() might try to test the structure to see if
@@ -741,7 +747,7 @@
 		dec_page_state(nr_slab);
 		page++;
 	}
-	free_pages((unsigned long)addr, cachep->gfporder);
+	cachep->freepages((unsigned long)addr, cachep->gfporder);
 }
 
 #if DEBUG
@@ -811,13 +817,15 @@
 }
 
 /**
- * kmem_cache_create - Create a cache.
+ * kmem_cache_create_alloc - Create a cache with a custom allocator.
  * @name: A string which is used in /proc/slabinfo to identify this cache.
  * @size: The size of objects to be created in this cache.
  * @offset: The offset to use within the page.
  * @flags: SLAB flags
  * @ctor: A constructor for the objects.
  * @dtor: A destructor for the objects.
+ * @getpages: A function to get the memory pages
+ * @freepages: A function to free the corresponding pages
  *
  * Returns a ptr to the cache on success, NULL on failure.
  * Cannot be called within a int, but can be interrupted.
@@ -844,9 +852,12 @@
  * as davem.
  */
 kmem_cache_t *
-kmem_cache_create (const char *name, size_t size, size_t offset,
-	unsigned long flags, void (*ctor)(void*, kmem_cache_t *, unsigned long),
-	void (*dtor)(void*, kmem_cache_t *, unsigned long))
+kmem_cache_create_with_alloc (const char *name, size_t size, size_t offset,
+	unsigned long flags,
+	void (*ctor)(void*, kmem_cache_t *, unsigned long),
+	void (*dtor)(void*, kmem_cache_t *, unsigned long),
+	unsigned long FASTCALL((*getpages)(unsigned int gfp_mask, unsigned int order)),
+	void FASTCALL((*freepages)(unsigned long addr, unsigned int order)))
 {
 	const char *func_nm = KERN_ERR "kmem_create: ";
 	size_t left_over, align, slab_size;
@@ -1010,6 +1021,9 @@
 		cachep->slabp_cache = kmem_find_general_cachep(slab_size,0);
 	cachep->ctor = ctor;
 	cachep->dtor = dtor;
+	BUG_ON(getpages == NULL || freepages == NULL);
+	cachep->getpages = getpages;
+	cachep->freepages = freepages;
 	cachep->name = name;
 
 	if (g_cpucache_up == FULL) {
@@ -1072,6 +1086,50 @@
 	up(&cache_chain_sem);
 opps:
 	return cachep;
+}
+
+/**
+ * kmem_cache_create_alloc - Create a cache with a custom allocator.
+ * @name: A string which is used in /proc/slabinfo to identify this cache.
+ * @size: The size of objects to be created in this cache.
+ * @offset: The offset to use within the page.
+ * @flags: SLAB flags
+ * @ctor: A constructor for the objects.
+ * @dtor: A destructor for the objects.
+ * @getpages: A function to get the memory pages
+ * @freepages: A function to free the corresponding pages
+ *
+ * Returns a ptr to the cache on success, NULL on failure.
+ * Cannot be called within a int, but can be interrupted.
+ * The @ctor is run when new pages are allocated by the cache
+ * and the @dtor is run before the pages are handed back.
+ *
+ * @name must be valid until the cache is destroyed. This implies that
+ * the module calling this has to destroy the cache before getting 
+ * unloaded.
+ * 
+ * The flags are
+ *
+ * %SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5)
+ * to catch references to uninitialised memory.
+ *
+ * %SLAB_RED_ZONE - Insert `Red' zones around the allocated memory to check
+ * for buffer overruns.
+ *
+ * %SLAB_NO_REAP - Don't automatically reap this cache when we're under
+ * memory pressure.
+ *
+ * %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
+ * cacheline.  This can be beneficial if you're counting cycles as closely
+ * as davem.
+ */
+kmem_cache_t *
+kmem_cache_create (const char *name, size_t size, size_t offset,
+	unsigned long flags,
+	void (*ctor)(void*, kmem_cache_t *, unsigned long),
+	void (*dtor)(void*, kmem_cache_t *, unsigned long))
+{
+	return kmem_cache_create_with_alloc(name, size, offset, flags, ctor, dtor, __get_free_pages, free_pages);
 }
 
 static inline void check_irq_off(void)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 18:44               ` James Bottomley
@ 2002-12-31 19:29                 ` David Brownell
  2002-12-31 19:50                   ` James Bottomley
  0 siblings, 1 reply; 44+ messages in thread
From: David Brownell @ 2002-12-31 19:29 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel

> I think the attached should do the necessary with no slow down in the slab 
> allocator.
> 
> Now all that has to happen for use with the dma pools is to wrapper 
> dma_alloc/free_coherent().

You didn't make anything store or return the dma_addr_t ... that was the
issue I was referring to, it's got to be either (a) passed up from the
very lowest level, like the pci_*() calls assume, or else (b) cheaply
derived from the virtual address.  My patch added slab support in common
cases where (b) applies.

(By the way -- I'm glad we seem to be agreeing on the notion that we
should have a dma_pool!  Is that also true of kmalloc/kfree analogues?)

> Note that the semantics won't be quite the same as the pci_pool one since you 
> can't guarantee that allocations don't cross the particular boundary 
> parameter.

Go back and look at the dma_pool patch I posted, which shows how to
handle that using extra padding.  Only one driver needed that, and
it looks like maybe its hardware spec changed so that's not an issue.

That's why the signature of dma_pool_create() is simpler than the
one for pci_pool_create() ... I never liked it before, and now I'm
fully convinced it _can_ be simpler.

 >    There's also going to be a reliance on the concept that the dma
> coherent allocators will return a page bounded chunk of memory.  Most seem 
> already to be doing this because the page properties are only controllable at 
> that level, so it shouldn't be a problem.

I think that's a desirable way to specify dma_alloc_coherent(), as handling
page-and-above.  The same layering as "normal" memory allocation.  (In fact,
that's how I had updated the docs.)

- Dave

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 19:29                 ` David Brownell
@ 2002-12-31 19:50                   ` James Bottomley
  2002-12-31 21:17                     ` David Brownell
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2002-12-31 19:50 UTC (permalink / raw)
  To: David Brownell; +Cc: James Bottomley, linux-kernel

david-b@pacbell.net said:
> You didn't make anything store or return the dma_addr_t ... that was
> the issue I was referring to, it's got to be either (a) passed up from
> the very lowest level, like the pci_*() calls assume, or else (b)
> cheaply derived from the virtual address.  My patch added slab support
> in common cases where (b) applies. 

That's fairly simply done as part of the wrappers: The allocator stores the 
vaddr, paddr and size in a hash table.  Thus, the paddr can be deduced when 
kmem_cache_alloc is called by the allocation wrapper using the linearity 
property.

I've got to say though that the most sensible course of action is still to 
generalise pci_pool, which can be done easily and safely.  I think replacing 
it with a slab based scheme is probably a 2.7 thing.

James

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 19:50                   ` James Bottomley
@ 2002-12-31 21:17                     ` David Brownell
  0 siblings, 0 replies; 44+ messages in thread
From: David Brownell @ 2002-12-31 21:17 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel

James Bottomley wrote:
> david-b@pacbell.net said:
> 
>>You didn't make anything store or return the dma_addr_t ... that was
>>the issue I was referring to, it's got to be either (a) passed up from
>>the very lowest level, like the pci_*() calls assume, or else (b)
>>cheaply derived from the virtual address.  My patch added slab support
>>in common cases where (b) applies. 
> 
> 
> That's fairly simply done as part of the wrappers: The allocator stores the 
> vaddr, paddr and size in a hash table.  Thus, the paddr can be deduced when 
> kmem_cache_alloc is called by the allocation wrapper using the linearity 
> property.

However it'd be done, it'd be an essential part, and it was missing.  In fact,
your getpages() didn't have a way to return the dma_addr_t values, and your
freepages() didn't provide it as an input.  (But it did pass mem_flags in, as
I had at some point suggested should be done with dma_alloc_coherent.)

> I've got to say though that the most sensible course of action is still to 
> generalise pci_pool, which can be done easily and safely.  I think replacing 
> it with a slab based scheme is probably a 2.7 thing.

I think the patch I posted "easily and safely" does that, appropriate for 2.5,
even though some platforms can't yet collect that win.

I'd agree that morphing drivers/pci/pool.c into drivers/base/pool.c (or whatever)
would be appropriate, but I haven't heard any arguments to justify using that
allocator on systems where the the slab code can be used (more cheaply).

- Dave

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
@ 2002-12-31 22:02 Adam J. Richter
  2002-12-31 22:41 ` Andrew Morton
  2002-12-31 23:35 ` David Brownell
  0 siblings, 2 replies; 44+ messages in thread
From: Adam J. Richter @ 2002-12-31 22:02 UTC (permalink / raw)
  To: david-b, James.Bottomley, linux-kernel

David Brownell wrote:

>struct dma_pool *dma_pool_create(char *, struct device *, size_t)
>void dma_pool_destroy (struct dma_pool *pool)
>void *dma_pool_alloc(struct dma_pool *, int mem_flags, dma_addr_t *)
>void dma_pool_free(struct dma_pool *, void *, dma_addr_t)

	I would like to be able to have failure-free, deadlock-free
blocking memory allocation, such as we have with the non-DMA mempool
library so that we can guarantee that drivers that have been
successfully initialized will continue to work regardless of memory
pressure, and reduce error branches that drivers have to deal with.

	Such a facility could be layered on top of your interface
perhaps by extending the mempool code to pass an extra parameter
around.  If so, then you should think about arranging your interface
so that it could be driven with as little glue as possible by mempool.

	I think that the term "pool" is more descriptively used by
mempool and more misleadningly used by the pci_pool code, as there is
no guaranteed pool being reserved in the pci_pool code.  Alas, I don't
have a good alternative term to suggest at the moment.

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 22:02 [PATCH] generic device DMA (dma_pool update) Adam J. Richter
@ 2002-12-31 22:41 ` Andrew Morton
  2002-12-31 23:23   ` David Brownell
  2003-01-01 17:10   ` James Bottomley
  2002-12-31 23:35 ` David Brownell
  1 sibling, 2 replies; 44+ messages in thread
From: Andrew Morton @ 2002-12-31 22:41 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: david-b, James.Bottomley, linux-kernel

"Adam J. Richter" wrote:
> 
> David Brownell wrote:
> 
> >struct dma_pool *dma_pool_create(char *, struct device *, size_t)
> >void dma_pool_destroy (struct dma_pool *pool)
> >void *dma_pool_alloc(struct dma_pool *, int mem_flags, dma_addr_t *)
> >void dma_pool_free(struct dma_pool *, void *, dma_addr_t)
> 
>         I would like to be able to have failure-free, deadlock-free
> blocking memory allocation, such as we have with the non-DMA mempool
> library so that we can guarantee that drivers that have been
> successfully initialized will continue to work regardless of memory
> pressure, and reduce error branches that drivers have to deal with.
> 
>         Such a facility could be layered on top of your interface
> perhaps by extending the mempool code to pass an extra parameter
> around.  If so, then you should think about arranging your interface
> so that it could be driven with as little glue as possible by mempool.
> 

What is that parameter?  The size, I assume.   To do that you'd
need to create different pools for different allocation sizes.

Bear in mind that mempool only makes sense with memory objects
which have the special characteristic that "if you wait long enough,
someone will free one".  ie: BIOs, nfs requests, etc.   Probably,
DMA buffers fit into that picture as well.

If anything comes out of this discussion, please let it be the 
removal of the hard-wired GFP_ATOMIC in dma_alloc_coherent.  That's
quite broken - the interface should (always) be designed so that the
caller can pass in the gfp flags (__GFP_WAIT,__GFP_IO,__GFP_FS, at least)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 22:41 ` Andrew Morton
@ 2002-12-31 23:23   ` David Brownell
  2002-12-31 23:27     ` Andrew Morton
  2002-12-31 23:47     ` James Bottomley
  2003-01-01 17:10   ` James Bottomley
  1 sibling, 2 replies; 44+ messages in thread
From: David Brownell @ 2002-12-31 23:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Adam J. Richter, James.Bottomley, linux-kernel


>>        Such a facility could be layered on top of your interface
>>perhaps by extending the mempool code to pass an extra parameter
>>around.  If so, then you should think about arranging your interface
>>so that it could be driven with as little glue as possible by mempool.
>>
> 
> 
> What is that parameter?  The size, I assume.   To do that you'd
> need to create different pools for different allocation sizes.

In the other allocators it'd be the dma_addr_t for the memory
being returned ...

I don't think the mempool stuff needs that, see the fragments below.


> Bear in mind that mempool only makes sense with memory objects
> which have the special characteristic that "if you wait long enough,
> someone will free one".  ie: BIOs, nfs requests, etc.   Probably,
> DMA buffers fit into that picture as well.Inside the USB host controller drivers I think that mostly applies
to transfer descriptors (tds), which are freed when some other request
completes.  An 8K buffer takes  1 (ehci), 2 (ohci) or 128 (uhci)
of those, and as you know scatterlists can queue quite a few pages.

I'd imagine mempool based td allocation might go like this; it should
be easy enough to slip into most of the HCDs:

	void *mempool_alloc_td (int mem_flags, void *pool)
	{
		struct td *td;
		dma_addr_t dma;

		td = dma_pool_alloc (pool, mem_flags, &dma);
		if (!td)
			return td;
		td->td_dma = dma;	/* feed to the hardware */
		... plus other init
		return td;
	}

	void mempool_free_td (void *_td, void *pool)
	{
		struct td *td = _td;

		dma_pool_free (pool, td, td->dma);
	}

USB device drivers tend to either allocate and reuse one dma buffer
(kmalloc/kfree usage pattern) or use dma mapping ... so far.

- Dave



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 23:23   ` David Brownell
@ 2002-12-31 23:27     ` Andrew Morton
  2002-12-31 23:44       ` David Brownell
  2002-12-31 23:47     ` James Bottomley
  1 sibling, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2002-12-31 23:27 UTC (permalink / raw)
  To: David Brownell; +Cc: Adam J. Richter, James.Bottomley, linux-kernel

David Brownell wrote:
> 
>         void *mempool_alloc_td (int mem_flags, void *pool)
>         {
>                 struct td *td;
>                 dma_addr_t dma;
> 
>                 td = dma_pool_alloc (pool, mem_flags, &dma);
>                 if (!td)
>                         return td;
>                 td->td_dma = dma;       /* feed to the hardware */
>                 ... plus other init
>                 return td;
>         }

The existing mempool code can be used to implement this, I believe.  The
pool->alloc callback is passed an opaque void *, and it returns
a void * which can point at any old composite caller-defined blob.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 22:02 [PATCH] generic device DMA (dma_pool update) Adam J. Richter
  2002-12-31 22:41 ` Andrew Morton
@ 2002-12-31 23:35 ` David Brownell
  1 sibling, 0 replies; 44+ messages in thread
From: David Brownell @ 2002-12-31 23:35 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: James.Bottomley, linux-kernel

Adam J. Richter wrote:

> 	I think that the term "pool" is more descriptively used by
> mempool and more misleadningly used by the pci_pool code, as there is
> no guaranteed pool being reserved in the pci_pool code.  Alas, I don't
> have a good alternative term to suggest at the moment.

FWIW pci_pool predates mempool by quite a bit (2.4.early vs 2.5.later),
and I don't think I've noticed any correlation between allocation using
the "pool" word and reserving memory ... so I thought it was "mempool"
that clashed.  No big deal IMO, "all the good words are taken".

I seem to recall it was a portability issue that made pci_pool never
release pages once it allocates them ... some platform couldn't cope
with pci_free_consistent() being called in_interrupt().  In practice
that seems to have been a good enough reservation scheme so far.

- Dave

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
@ 2002-12-31 23:38 Adam J. Richter
  0 siblings, 0 replies; 44+ messages in thread
From: Adam J. Richter @ 2002-12-31 23:38 UTC (permalink / raw)
  To: akpm; +Cc: david-b, James.Bottomley, linux-kernel

Andrew Morton wrote:
>"Adam J. Richter" wrote:
>> 
>> David Brownell wrote:
>> 
>> >struct dma_pool *dma_pool_create(char *, struct device *, size_t)
>> >void dma_pool_destroy (struct dma_pool *pool)
>> >void *dma_pool_alloc(struct dma_pool *, int mem_flags, dma_addr_t *)
>> >void dma_pool_free(struct dma_pool *, void *, dma_addr_t)
>> 
>>         I would like to be able to have failure-free, deadlock-free
>> blocking memory allocation, such as we have with the non-DMA mempool
>> library so that we can guarantee that drivers that have been
>> successfully initialized will continue to work regardless of memory
>> pressure, and reduce error branches that drivers have to deal with.
>> 
>>         Such a facility could be layered on top of your interface
>> perhaps by extending the mempool code to pass an extra parameter
>> around.  If so, then you should think about arranging your interface
>> so that it could be driven with as little glue as possible by mempool.
>> 

>What is that parameter?  The size, I assume.

	No, dma address.  All allocations in a memory pool (in both
the mempool and pci_pool sense) are the same size.

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 23:27     ` Andrew Morton
@ 2002-12-31 23:44       ` David Brownell
  0 siblings, 0 replies; 44+ messages in thread
From: David Brownell @ 2002-12-31 23:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Adam J. Richter, James.Bottomley, linux-kernel

Yes, that was exactly what I was demonstrating ... those are
the callbacks that'd be supplied to mempool_create(), showing
that it doesn't need to change for that kind of usage.  (Which
isn't allocating DMA buffers, note!)

But we still need lower level allocators that are reasonable
for use with 1/Nth page allocations ... which isn't a problem
that mempool even attempts to solve.  Hence dma_pool, in any of
its implementations, would go underneath mempool to achieve what
Adam was describing (for drivers that need it).

- Dave

Andrew Morton wrote:
> David Brownell wrote:
> 
>>        void *mempool_alloc_td (int mem_flags, void *pool)
>>        {
>>                struct td *td;
>>                dma_addr_t dma;
>>
>>                td = dma_pool_alloc (pool, mem_flags, &dma);
>>                if (!td)
>>                        return td;
>>                td->td_dma = dma;       /* feed to the hardware */
>>                ... plus other init
>>                return td;
>>        }
> 
> 
> The existing mempool code can be used to implement this, I believe.  The
> pool->alloc callback is passed an opaque void *, and it returns
> a void * which can point at any old composite caller-defined blob.
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 23:23   ` David Brownell
  2002-12-31 23:27     ` Andrew Morton
@ 2002-12-31 23:47     ` James Bottomley
  1 sibling, 0 replies; 44+ messages in thread
From: James Bottomley @ 2002-12-31 23:47 UTC (permalink / raw)
  To: David Brownell
  Cc: Andrew Morton, Adam J. Richter, James.Bottomley, linux-kernel

david-b@pacbell.net said:
> USB device drivers tend to either allocate and reuse one dma buffer
> (kmalloc/kfree usage pattern) or use dma mapping ... so far. 

Please be careful with this in drivers.  Coherent memory can be phenomenally 
expensive to obtain on some hardware.  Sometimes it has to be implemented by 
turning off caching  and globally flushing the tlb.  Thus it really makes 
sense most of the time for drivers to allocate all the coherent memory they 
need initially and not have it go through the usual memory allocate/free cycle 
except under extreme conditions.  That's sort really what both pci_pool and 
mempool aim for.

James

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
@ 2003-01-01  0:02 Adam J. Richter
  0 siblings, 0 replies; 44+ messages in thread
From: Adam J. Richter @ 2003-01-01  0:02 UTC (permalink / raw)
  To: akpm; +Cc: david-b, James.Bottomley, linux-kernel

Andrew Morton wrote:
>The existing mempool code can be used to implement this, I believe.  The
>pool->alloc callback is passed an opaque void *, and it returns
>a void * which can point at any old composite caller-defined blob.

	That field is the same for all allocations (its value
is stored in the struct mempool), so you can't use it.

	David's example would work because it happens to store a copy
of the DMA address in the structure being allocated, and the mempool
code currently does not overwrite the contents of any chunks of memory
when it holds a freed chunk to give to out later.

	We could generalize David's technique for other data
structures by using a wrapper to store the DMA address and
adopting the convention of having the DMA allocator initially
stuff the DMA address in the beginning of the chunk being
allocated, like so.


/* Set mempool.alloc to this */
void *dma_mempool_alloc_callback(int gfp_mask, void *mempool_data)
{
	struct dma_pool *dma_pool = mempool_data;
	dma_addr_t dma_addr;
	void *result = dma_pool_alloc(mempool_data, gfp_mask, &dma_addr);
	if (result)
		memcpy(result, dma_addr, sizeof(dma_addr_t));
	return result;
}

/* Set mempool.free to this */
void dma_mempool_free_callback(void *element, void *arg)
{
	struct dma_pool *dma_pool = mempool_data;
	dma_pool_free(dma_pool, element, *(dma_addr_t*) element);
}




void *dma_mempool_alloc(mempool_t *pool, int gfp_mask, dma_addr_t *dma_addr)
{
        void *result = mempool_alloc(pool, gfp_mask);
        if (result)
                memcpy(dma_addr, result, sizeof(dma_addr_t));
        return result;
}

void dma_mempool_free(void *element, mempool_t *pool,
		      struct dma_addr_t dma_addr)
{
        /* We rely on mempool_free not trashing the first
           sizeof(dma_addr_t) bytes of element if it is going to
           give this element to another caller rather than freeing it.
           Currently, the mempool code does not know the size of the
           elements, so this is safe to to do, but it would be nice if
           in the future, we could let the mempool code use some of the
           remaining byte to maintain its free list. */
        memcpy(element, &dma_addr, sizeof(dma_addr_t));
        mempool_free(element, pool);
}



	So, I guess there is no need to change the mempool interface,
at least if we guarantee that mempool is not going to overwrite any
freed memory chunks that it is holding in reserve, although this also
means that there will continue to be a small but unnecessary memory
overhead in the mempool allocator.  I guess that could be addressed
later.

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2002-12-31 22:41 ` Andrew Morton
  2002-12-31 23:23   ` David Brownell
@ 2003-01-01 17:10   ` James Bottomley
  1 sibling, 0 replies; 44+ messages in thread
From: James Bottomley @ 2003-01-01 17:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Adam J. Richter, david-b, James.Bottomley, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 491 bytes --]

akpm@digeo.com said:
> If anything comes out of this discussion, please let it be the
> removal of the hard-wired GFP_ATOMIC in dma_alloc_coherent.  That's
> quite broken - the interface should (always) be designed so that the
> caller can pass in the gfp flags (__GFP_WAIT,__GFP_IO,__GFP_FS, at
> least) 

How about the attached?  I'll also make the changes for parisc (any arch 
maintainers who just implemented the dma_ API are going to be annoyed about 
this change, though).

James





[-- Attachment #2: tmp.diff --]
[-- Type: text/plain , Size: 5632 bytes --]

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.930   -> 1.934  
#	drivers/scsi/53c700.c	1.21    -> 1.22   
#	include/asm-generic/pci-dma-compat.h	1.2     -> 1.3    
#	arch/i386/kernel/pci-dma.c	1.9     -> 1.11   
#	Documentation/DMA-API.txt	1.1     -> 1.3    
#	include/asm-generic/dma-mapping.h	1.2     -> 1.3    
#	include/asm-i386/dma-mapping.h	1.1     -> 1.3    
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/12/31	jejb@raven.il.steeleye.com	1.931
# add GFP_ flag to dma_alloc_[non]coherent
# --------------------------------------------
# 03/01/01	jejb@raven.il.steeleye.com	1.932
# update noncoherent #define for flag
# --------------------------------------------
# 03/01/01	jejb@raven.il.steeleye.com	1.933
# tidy up docs and flags
# --------------------------------------------
# 03/01/01	jejb@raven.il.steeleye.com	1.934
# update generic prototype for gfp flag
# --------------------------------------------
#
diff -Nru a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
--- a/Documentation/DMA-API.txt	Wed Jan  1 11:08:55 2003
+++ b/Documentation/DMA-API.txt	Wed Jan  1 11:08:55 2003
@@ -22,7 +22,7 @@
 
 void *
 dma_alloc_coherent(struct device *dev, size_t size,
-			     dma_addr_t *dma_handle)
+			     dma_addr_t *dma_handle, int flag)
 void *
 pci_alloc_consistent(struct pci_dev *dev, size_t size,
 			     dma_addr_t *dma_handle)
@@ -43,6 +43,12 @@
 minimum allocation length may be as big as a page, so you should
 consolidate your requests for consistent memory as much as possible.
 
+The flag parameter (dma_alloc_coherent only) allows the caller to
+specify the GFP_ flags (see kmalloc) for the allocation (the
+implementation may chose to ignore flags that affect the location of
+the returned memory, like GFP_DMA).  For pci_alloc_consistent, you
+must assume GFP_ATOMIC behaviour.
+
 void
 dma_free_coherent(struct device *dev, size_t size, void *cpu_addr
 			   dma_addr_t dma_handle)
@@ -261,7 +267,7 @@
 
 void *
 dma_alloc_noncoherent(struct device *dev, size_t size,
-			       dma_addr_t *dma_handle)
+			       dma_addr_t *dma_handle, int flag)
 
 Identical to dma_alloc_coherent() except that the platform will
 choose to return either consistent or non-consistent memory as it sees
diff -Nru a/arch/i386/kernel/pci-dma.c b/arch/i386/kernel/pci-dma.c
--- a/arch/i386/kernel/pci-dma.c	Wed Jan  1 11:08:55 2003
+++ b/arch/i386/kernel/pci-dma.c	Wed Jan  1 11:08:55 2003
@@ -14,10 +14,12 @@
 #include <asm/io.h>
 
 void *dma_alloc_coherent(struct device *dev, size_t size,
-			   dma_addr_t *dma_handle)
+			   dma_addr_t *dma_handle, int gfp)
 {
 	void *ret;
-	int gfp = GFP_ATOMIC;
+
+	/* ignore region specifiers */
+	gfp &= ~(__GFP_DMA | __GFP_HIGHMEM);
 
 	if (dev == NULL || ((u32)*dev->dma_mask != 0xffffffff))
 		gfp |= GFP_DMA;
diff -Nru a/drivers/scsi/53c700.c b/drivers/scsi/53c700.c
--- a/drivers/scsi/53c700.c	Wed Jan  1 11:08:55 2003
+++ b/drivers/scsi/53c700.c	Wed Jan  1 11:08:55 2003
@@ -246,7 +246,7 @@
 	int j;
 
 	memory = dma_alloc_noncoherent(hostdata->dev, TOTAL_MEM_SIZE,
-					 &pScript);
+				       &pScript, GFP_KERNEL);
 	if(memory == NULL) {
 		printk(KERN_ERR "53c700: Failed to allocate memory for driver, detatching\n");
 		return NULL;
diff -Nru a/include/asm-generic/dma-mapping.h b/include/asm-generic/dma-mapping.h
--- a/include/asm-generic/dma-mapping.h	Wed Jan  1 11:08:55 2003
+++ b/include/asm-generic/dma-mapping.h	Wed Jan  1 11:08:55 2003
@@ -30,9 +30,10 @@
 }
 
 static inline void *
-dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle)
+dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
+		   int flag)
 {
-	BUG_ON(dev->bus != &pci_bus_type);
+	BUG_ON(dev->bus != &pci_bus_type || (flag & GFP_ATOMIC) != GFP_ATOMIC);
 
 	return pci_alloc_consistent(to_pci_dev(dev), size, dma_handle);
 }
@@ -121,7 +122,7 @@
 
 /* Now for the API extensions over the pci_ one */
 
-#define dma_alloc_noncoherent(d, s, h) dma_alloc_coherent(d, s, h)
+#define dma_alloc_noncoherent(d, s, h, f) dma_alloc_coherent(d, s, h, f)
 #define dma_free_noncoherent(d, s, v, h) dma_free_coherent(d, s, v, h)
 #define dma_is_consistent(d)	(1)
 
diff -Nru a/include/asm-generic/pci-dma-compat.h b/include/asm-generic/pci-dma-compat.h
--- a/include/asm-generic/pci-dma-compat.h	Wed Jan  1 11:08:55 2003
+++ b/include/asm-generic/pci-dma-compat.h	Wed Jan  1 11:08:55 2003
@@ -19,7 +19,7 @@
 pci_alloc_consistent(struct pci_dev *hwdev, size_t size,
 		     dma_addr_t *dma_handle)
 {
-	return dma_alloc_coherent(hwdev == NULL ? NULL : &hwdev->dev, size, dma_handle);
+	return dma_alloc_coherent(hwdev == NULL ? NULL : &hwdev->dev, size, dma_handle, GFP_ATOMIC);
 }
 
 static inline void
diff -Nru a/include/asm-i386/dma-mapping.h b/include/asm-i386/dma-mapping.h
--- a/include/asm-i386/dma-mapping.h	Wed Jan  1 11:08:55 2003
+++ b/include/asm-i386/dma-mapping.h	Wed Jan  1 11:08:55 2003
@@ -3,11 +3,11 @@
 
 #include <asm/cache.h>
 
-#define dma_alloc_noncoherent(d, s, h) dma_alloc_coherent(d, s, h)
+#define dma_alloc_noncoherent(d, s, h, f) dma_alloc_coherent(d, s, h, f)
 #define dma_free_noncoherent(d, s, v, h) dma_free_coherent(d, s, v, h)
 
 void *dma_alloc_coherent(struct device *dev, size_t size,
-			   dma_addr_t *dma_handle);
+			   dma_addr_t *dma_handle, int flag);
 
 void dma_free_coherent(struct device *dev, size_t size,
 			 void *vaddr, dma_addr_t dma_handle);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
@ 2003-01-01 19:21 Adam J. Richter
  2003-01-01 19:48 ` James Bottomley
  0 siblings, 1 reply; 44+ messages in thread
From: Adam J. Richter @ 2003-01-01 19:21 UTC (permalink / raw)
  To: akpm, James.Bottomley; +Cc: david-b, linux-kernel

James Bottomley wrote:
> void *
> dma_alloc_coherent(struct device *dev, size_t size,
>-                            dma_addr_t *dma_handle)
>+                            dma_addr_t *dma_handle, int flag)

	I thought Andrew Morton's request for a gfp flag was for
allocating memory from a pool (for example, a "read ahead" will want
to abort if memory is unavailable rather than wait).

	The big DMA allocations, however, will always occur during
initialization, and I think will always have the intermediate policy
of "I can block, but I should fail rather than block for more than a
few seconds and potentially deadlock from loading too many drivers on
a system with very little memory."

	Can someone show me or invent an example of two different uses
of dma_alloc_coherent that really should use different policies on
whether to block or not?

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2003-01-01 19:21 Adam J. Richter
@ 2003-01-01 19:48 ` James Bottomley
  2003-01-02  2:11   ` David Brownell
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2003-01-01 19:48 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: akpm, James.Bottomley, david-b, linux-kernel

adam@yggdrasil.com said:
> 	I thought Andrew Morton's request for a gfp flag was for allocating
> memory from a pool (for example, a "read ahead" will want to abort if
> memory is unavailable rather than wait).

Well, yes, but the underlying allocators will also have to take the flag too 
so that all the semantics are correct.

adam@yggdrasil.com said:
> 	Can someone show me or invent an example of two different uses of
> dma_alloc_coherent that really should use different policies on
> whether to block or not? 

The obvious one is allocations from interrupt routines, which must be 
GFP_ATOMIC (ignoring the issue of whether a driver should be doing a memory 
allocation in an interrupt).  Allocating large pools at driver initialisation 
should probably be GFP_KERNEL as you say.

James

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2003-01-01 19:48 ` James Bottomley
@ 2003-01-02  2:11   ` David Brownell
  0 siblings, 0 replies; 44+ messages in thread
From: David Brownell @ 2003-01-02  2:11 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: James Bottomley, akpm, linux-kernel

James Bottomley wrote:
> adam@yggdrasil.com said:
> 
>>	Can someone show me or invent an example of two different uses of
>>dma_alloc_coherent that really should use different policies on
>>whether to block or not? 
> 
> 
> The obvious one is allocations from interrupt routines, which must be 
> GFP_ATOMIC (ignoring the issue of whether a driver should be doing a memory 
> allocation in an interrupt).  Allocating large pools at driver initialisation 
> should probably be GFP_KERNEL as you say.

More:  not just "from interrupt routines", also "when a spinlock is held".
Though one expects that if a dma_pool were in use, it'd probably have one
of that kind of chunk already available (ideally, even initialized).

Many task-initiated allocations would use GFP_KERNEL; not just major driver
activation points like open().

- Dave


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
@ 2003-01-02  4:13 Adam J. Richter
  2003-01-02 16:41 ` James Bottomley
  2003-01-02 18:26 ` David Brownell
  0 siblings, 2 replies; 44+ messages in thread
From: Adam J. Richter @ 2003-01-02  4:13 UTC (permalink / raw)
  To: david-b; +Cc: akpm, James.Bottomley, linux-kernel

David Brownell wrote:
>James Bottomley wrote:
>> adam@yggdrasil.com said:
>> 
>>>	Can someone show me or invent an example of two different uses of
>>>dma_alloc_coherent that really should use different policies on
>>>whether to block or not? 
>> 
>> 
>> The obvious one is allocations from interrupt routines, which must be 
>> GFP_ATOMIC (ignoring the issue of whether a driver should be doing a memory 
>> allocation in an interrupt).  Allocating large pools at driver initialisation 
>> should probably be GFP_KERNEL as you say.

>More:  not just "from interrupt routines", also "when a spinlock is held".
>Though one expects that if a dma_pool were in use, it'd probably have one
>of that kind of chunk already available (ideally, even initialized).

	Let me clarify or revise my request.  By "show me or invent an
example" I mean describe a case where this would be used, as in
specific hardware devices that Linux has trouble supporting right now,
or specific programs that can't be run efficiently under Linux, etc.
What device would need to do this kind of allocation?  Why haven't I
seen requests for this from people working on real device drivers?
Where is this going to make the kernel smaller, more reliable, faster,
more maintainable, able to make a computer do something it could do
before under Linux, etc.?

	I have trouble understanding why, for example, a USB hard
disk driver would want anything more than a fixed size pool of
transfer descriptors.  At some point you know that you've queued
enough IO so that the driver can be confident that it will be
called again before that queue is completely emptied.


>Many task-initiated allocations would use GFP_KERNEL; not just major driver
>activation points like open().

	Here too, it would help me to understand if you would try
to construct an example.

	Also, your use of the term GFP_KERNEL is potentially
ambiguous.  In some cases GFP_KERNEL seems to mean "wait indifinitely
until memory is available; never fail."  In other cases it means
"perhaps wait a second or two if memory is unavailable, but fail if
memory is not available by then."

	I don't dispute that adding a memory allocation class argument
to dma_alloc_consistent would make it capable of being used in more
cases.  I just don't know if any of those cases actually come up, in
which case the Linux community is probably better off with the smaller
kernel footprint.

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2003-01-02  4:13 Adam J. Richter
@ 2003-01-02 16:41 ` James Bottomley
  2003-01-02 18:26 ` David Brownell
  1 sibling, 0 replies; 44+ messages in thread
From: James Bottomley @ 2003-01-02 16:41 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: david-b, akpm, James.Bottomley, linux-kernel

adam@yggdrasil.com said:
> 	Let me clarify or revise my request.  By "show me or invent an
> example" I mean describe a case where this would be used, as in
> specific hardware devices that Linux has trouble supporting right now,
> or specific programs that can't be run efficiently under Linux, etc.
> What device would need to do this kind of allocation?  Why haven't I
> seen requests for this from people working on real device drivers?
> Where is this going to make the kernel smaller, more reliable, faster,
> more maintainable, able to make a computer do something it could do
> before under Linux, etc.?

I'm not really the right person to be answering this.  For any transfer you 
set up (which encompasses all of the SCSI stuff bar target mode and AENs) you 
should have all the resources ready and waiting in the interrupt, and so never 
require an in_interrupt allocation.

However, for unsolicited transfer requests---the best example I can think of 
would be incoming network packets---it does make sense:  You allocate with 
GFP_ATOMIC, if the kernel can fulfil the request, fine; if not, you drop the 
packet on the floor.  Now, whether there's an unsolicited transfer that's 
going to require coherent memory, that I can't say.  It does seem to be 
possible, though.

James

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
@ 2003-01-02 17:04 Adam J. Richter
  0 siblings, 0 replies; 44+ messages in thread
From: Adam J. Richter @ 2003-01-02 17:04 UTC (permalink / raw)
  To: James.Bottomley; +Cc: akpm, david-b, linux-kernel

James Bottomley wrote:
>adam@yggdrasil.com said:
>> 	Let me clarify or revise my request.  By "show me or invent an
>> example" I mean describe a case where this would be used, as in
>> specific hardware devices that Linux has trouble supporting right now,
>> or specific programs that can't be run efficiently under Linux, etc.
>> What device would need to do this kind of allocation?  Why haven't I
>> seen requests for this from people working on real device drivers?
>> Where is this going to make the kernel smaller, more reliable, faster,
>> more maintainable, able to make a computer do something it could do
>> before under Linux, etc.?

>I'm not really the right person to be answering this.  For any transfer you 
>set up (which encompasses all of the SCSI stuff bar target mode and AENs) you 
>should have all the resources ready and waiting in the interrupt, and so never 
>require an in_interrupt allocation.

>However, for unsolicited transfer requests---the best example I can think of 
>would be incoming network packets---it does make sense:  You allocate with 
>GFP_ATOMIC, if the kernel can fulfil the request, fine; if not, you drop the 
>packet on the floor.  Now, whether there's an unsolicited transfer that's 
>going to require coherent memory, that I can't say.  It does seem to be 
>possible, though.

	When a network device driver receives a packet, it gives the
network packet that it pre-allocated to the higher layers with
netif_rx() and then allocates a net packet with dev_alloc_skb(), but
that is non-consistent "streaming" memory.  The consistent memory is
general for the DMA gather-scatter stub(s), which is (are) generally
reused since the receive for the packet that arrived has been
completed.

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2003-01-02  4:13 Adam J. Richter
  2003-01-02 16:41 ` James Bottomley
@ 2003-01-02 18:26 ` David Brownell
  1 sibling, 0 replies; 44+ messages in thread
From: David Brownell @ 2003-01-02 18:26 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: akpm, James.Bottomley, linux-kernel

Hi Adam,

Note that this "add gfp flags to dma_alloc_coherent()" issue is a tangent
to the dma_pool topic ... it's a different "generic device DMA" issue.

We already have the pci_pool allocator that knows how to cope (timeout and
retry) with the awkward semantics of pci_alloc_consistent() ... likewise,
we can tell that those semantics are problematic, both because they need
that kind of workaround and because they complicate reusing "better"
allocator code (like mm/slab.c) on top of the DMA page allocators.

> 	Let me clarify or revise my request.  By "show me or invent an
> example" I mean describe a case where this would be used, as in
> specific hardware devices that Linux has trouble supporting right now,
> or specific programs that can't be run efficiently under Linux, etc.

That doesn't really strike me as a reasonable revision, since that
wasn't an issue an improved dma_alloc_coherent() syntax was intended
to address directly.

To the extent that it's reasonable, you should also be considering
this corresponding issue:  ways that removing gfp_flags from the
corresponding generic memory allocator, __get_free_pages(), would be
improving those characteristics of Linux.  (IMO, it wouldn't.)

> 	I have trouble understanding why, for example, a USB hard
> disk driver would want anything more than a fixed size pool of
> transfer descriptors.  At some point you know that you've queued
> enough IO so that the driver can be confident that it will be
> called again before that queue is completely emptied.

For better or worse, that's not how it works today and it's not
likely to change in the 2.5 kernels.  "Transfer Descriptors" are
resources inside host controller drivers (bus drivers), allocated
dynamically (GFP_KERNEL in many cases, normally using a pci_pool)
when USB device drivers (like usb-storage) submit requests (URBs).

They are invisible to usb device drivers, except implicitly as
another reason that submitting an urb might trigger -ENOMEM.

And for that matter, the usb-storage driver doesn't "fire and
forget" as you described; that'd be a network driver model.
The storage driver needs to block till its request completes.

> 	Also, your use of the term GFP_KERNEL is potentially
> ambiguous.  In some cases GFP_KERNEL seems to mean "wait indifinitely
> until memory is available; never fail."  In other cases it means
> "perhaps wait a second or two if memory is unavailable, but fail if
> memory is not available by then."

Hmm, I was unaware that anyone expected GFP_KERNEL (or rather,
__GFP_WAIT) to guarantee that memory was always returned.  It's
not called __GFP_NEVERFAIL, after all.

I've always seen such fault returns documented as unrelated to
allocator parameters ... and treated callers that break on fault
returns as buggy, regardless of GFP_* parameters in use.  A
random sample of kernel docs agrees with me on that:  both
in Documentation/* (like the i2c stuff) or the O'Reilly book
on device drivers.

- Dave

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
@ 2003-01-02 22:07 Adam J. Richter
  2003-01-03  0:20 ` Russell King
                   ` (3 more replies)
  0 siblings, 4 replies; 44+ messages in thread
From: Adam J. Richter @ 2003-01-02 22:07 UTC (permalink / raw)
  To: david-b; +Cc: akpm, James.Bottomley, linux-kernel

>Note that this "add gfp flags to dma_alloc_coherent()" issue is a tangent
>to the dma_pool topic ... it's a different "generic device DMA" issue.

	I think we understand each other, but let's walk through
the three level of it: mempool_alloc, {dma,pci}_pool_alloc, and
{dma,pci}_alloc_consistent.

	We agree that mempool_alloc should take gfp_flags so that
it can fail rather than block on optional IO such as read-aheads.

	Whether pci_pool_alloc (which does not guarantee a reserve of
memory) should continue to take gfp_flags is not something we were
discussing, but is an interesting question for the future.  I see
that pci_pool_alloc is only called with GFP_ATOMIC in two places:

	uhci_alloc_td in drivers/usb/host/uhci-hcd.c
	alloc_safe_buffer in arch/arm/mach-sa1100/sa1111-pcibuf.c

	The use in uhci_alloc_td means that USB IO can fail under
memory load.  Conceivably, swapping could even deadlock this way.
uhci_alloc_td should be going through a mempool which individual
drivers should expand and shrink as they are added and removed so that
transfers that are sent with GFP_KERNEL will be guaranteed not to fail
for lack of memory (and gfp_flags should be passed down so that it
will be possible to abort transfers like read-aheads rather than
blocking, but that's not essential).

	The pci_pool_alloc  in sa1111-buf.c is more interesting.
alloc_safe_buffer is used to implement an unusual version of
pci_map_single.  It is unclear to me whether this approach is optimal.
I'll look into this more.

	Anyhow, the question that we actually were talking about is
whether dma_alloc_consistent should take gfp_flags.  So far, the only
potential example that I'm aware of is that the question of whether the
call to pci_pool_alloc in sa1111-pcibuf.c needs.to call down to
{pci,dma}_alloc_consistent with GFP_ATOMIC.  It's something that I
just noticed and will look into.


>We already have the pci_pool allocator that knows how to cope (timeout and
>retry) with the awkward semantics of pci_alloc_consistent() ...

	pci_pool_alloc currently implments mempool's GFP_KERNEL
semantics (block indefinitely, never fail), but, unlike mempool, it
does not guarantee a minimum memory allocation, so there is no way
to guarantee that it will ever return.  It can hang if there is
insufficient memory.  It looks like you cannot even control-C out
of it.

	By the way, since you call pci_alloc_consistent's GFP_KERNEL
semantics "awkward", are you saying that you think it should wait
indefinitely for memory to become available?

>likewise,
>we can tell that those semantics are problematic, both because they need
>that kind of workaround and because they complicate reusing "better"
>allocator code (like mm/slab.c) on top of the DMA page allocators.

	Maybe it's irrelevant, but I don't understand what you mean by
"better" here.


>> 	Let me clarify or revise my request.  By "show me or invent an
>> example" I mean describe a case where this would be used, as in
>> specific hardware devices that Linux has trouble supporting right now,
>> or specific programs that can't be run efficiently under Linux, etc.

>That doesn't really strike me as a reasonable revision, since that
>wasn't an issue an improved dma_alloc_coherent() syntax was intended
>to address directly.

	It is always reasonable to ask for an example of where a
requested change would be used, as in specific hardware devices that
Linux has trouble supporting right now, or specific programs that
can't be run efficiently under Linux, or some other real external
benefit, because that is the purpose of software.  This applies to
every byte of code in Linux.

	Don't start by positing some situation in the middle of a
call graph.  Instead, I'm asking you to show an example starting
at the top of the call graph.  What *actual* use will cause this
parameter to be useful in reality?  By what external metric
will users benefit?

	I'll look into the sa1111-pcibuf.c case.  That's the only
potential example that I've seen so far.


>To the extent that it's reasonable, you should also be considering
>this corresponding issue:  ways that removing gfp_flags from the
>corresponding generic memory allocator, __get_free_pages(), would be
>improving those characteristics of Linux.  (IMO, it wouldn't.)

	There are other users of __get_free_pages.  If it turns out
that there is no true use for gfp_mask in dma_alloc_consistent then it
might be worth exploring in the future, but we don't have to decide to
eliminate gfp_mask from __get_free_pages just to decide not to add it
to dma_alloc_consistent.


>> 	I have trouble understanding why, for example, a USB hard
>> disk driver would want anything more than a fixed size pool of
>> transfer descriptors.  At some point you know that you've queued
>> enough IO so that the driver can be confident that it will be
>> called again before that queue is completely emptied.

>For better or worse, that's not how it works today and it's not
>likely to change in the 2.5 kernels.  "Transfer Descriptors" are
>resources inside host controller drivers (bus drivers), allocated
>dynamically (GFP_KERNEL in many cases, normally using a pci_pool)
>when USB device drivers (like usb-storage) submit requests (URBs).

>They are invisible to usb device drivers, except implicitly as
>another reason that submitting an urb might trigger -ENOMEM.

	It might not be trivial to fix that but it would be
straightforward to do so.  In the future, submit_urb(...GFP_KERNEL)
should be guaranteed never to fail with -ENOMEM (see my comments
about uhci_alloc_td above on how to fix this).

	This existing straighforwardly fixable bug does not justify
changing the DMA API just to encrust it further.  Users will be better
off if we fix the bug (it will be able to swap reliably on USB disks,
or at least they'll be closer to it if there are other bugs in the
way).

	In practice, I think that if we just added one, maybe two,
URB's by default for every endpoint when a device is added, that
that would be enough to guarantee that would reduce the number of
drivers that needed to reserve more URB's than that to few or none.

>And for that matter, the usb-storage driver doesn't "fire and
>forget" as you described; that'd be a network driver model.
>The storage driver needs to block till its request completes.

	That only makes it easier to calculate the number of
URB's that need to be guaranteed to be available.

	By the way, as far as I can tell, usb_stor_msg_common in
drivers/usb/storage/transport.c only waits for the urb to be
transferred, not for the operation to complete.


>> 	Also, your use of the term GFP_KERNEL is potentially
>> ambiguous.  In some cases GFP_KERNEL seems to mean "wait indifinitely
>> until memory is available; never fail."  In other cases it means
>> "perhaps wait a second or two if memory is unavailable, but fail if
>> memory is not available by then."

>Hmm, I was unaware that anyone expected GFP_KERNEL (or rather,
>__GFP_WAIT) to guarantee that memory was always returned.  It's
>not called __GFP_NEVERFAIL, after all.

	mempool_alloc does.  That's the point of it.  You calculate
how many objects you need in order to guarantee no deadlocks and
reserve that number in advance (the initial reservation can fail).

>I've always seen such fault returns documented as unrelated to
>allocator parameters ...

	By design, mempool_alloc(GFP_KERNEL) never fails (although
mempool_alloc(GFP_ATOMIC) can).

	By the way, pci_pool_alloc(GFP_KERNEL) also never returns
failure, but I consider that a bug as I explained above.

>and treated callers that break on fault
>returns as buggy, regardless of GFP_* parameters in use.

	It is not a bug for mempool_alloc(GFP_KERNEL) callers to assume
success upon return.  It is a wasete of kernel footprint for them to
check for failure.

>A
>random sample of kernel docs agrees with me on that:  both
>in Documentation/* (like the i2c stuff) or the O'Reilly book
>on device drivers.

	mempool really needs a file in Documentation/.  The only reference
to it there is in linux-2.5.54/Documentation/block/biodoc.txt, line 663 ff:
| This makes use of Ingo Molnar's mempool implementation, which enables
| subsystems like bio to maintain their own reserve memory pools for guaranteed
| deadlock-free allocations during extreme VM load. [...]

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2003-01-02 22:07 Adam J. Richter
@ 2003-01-03  0:20 ` Russell King
  2003-01-03  4:50 ` David Brownell
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: Russell King @ 2003-01-03  0:20 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: david-b, akpm, James.Bottomley, linux-kernel

On Thu, Jan 02, 2003 at 02:07:46PM -0800, Adam J. Richter wrote:
> 	The pci_pool_alloc  in sa1111-buf.c is more interesting.
> alloc_safe_buffer is used to implement an unusual version of
> pci_map_single.  It is unclear to me whether this approach is optimal.
> I'll look into this more.

Welcome to the world of seriously broken hardware that a fair number
of people put on their boards.  Unfortunately, the hardware bug got
marged as "never fix" by Intel.

Basically, the chip is only able to DMA from certain memory regions
and its RAM size dependent.  For 32MB of memory, it works out at
around 1MB regions - odd regions are not able to perform DMA, even
regions can.

Linux only supports one DMA region per memory node though, so in
this configuration, we only have 1MB of memory able to be used for
OHCI.

To work around this, we have "unusual" versions of the mapping
functions.  Although they may not be optimal, they are, afaik
completely functional.

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2003-01-02 22:07 Adam J. Richter
  2003-01-03  0:20 ` Russell King
@ 2003-01-03  4:50 ` David Brownell
  2003-01-03  6:11 ` David Brownell
  2003-01-03  6:46 ` David Brownell
  3 siblings, 0 replies; 44+ messages in thread
From: David Brownell @ 2003-01-03  4:50 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: linux-kernel

Adam J. Richter wrote:

 > 	The pci_pool_alloc  in sa1111-buf.c is more interesting.
 > alloc_safe_buffer is used to implement an unusual version of
 > pci_map_single.  It is unclear to me whether this approach is optimal.
 > I'll look into this more.

Darn if I didn't mention SA1111 up front!  Here's where avoiding
mapping APIs is a win, in favor of allocating dma buffers that
will work from the get go.  (That is, dma_addr_t can usefully push
up the call stack, while gfp_flags pushes down.)

USB device drivers like "hid" and "usblp" do that already; later,
more drivers could convert.

Of course, the ohci-sa1111.c codepaths are also relevant here.
They use pci_pools for TD allocation.  The fact that they use
SLAB_ATOMIC (you missed one!), and once didn't, bothers me;
luckily that became an easy fix a few months back.

And the dma_pool_patch I posted would make all this code morph
to using the slab allocator underneath.  The slab code would be
smarter about managing it than pci_pool, so it'd be better at
sharing that 1 MByte of DMA memory with other devices.

- Dave

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2003-01-02 22:07 Adam J. Richter
  2003-01-03  0:20 ` Russell King
  2003-01-03  4:50 ` David Brownell
@ 2003-01-03  6:11 ` David Brownell
  2003-01-03  6:46 ` David Brownell
  3 siblings, 0 replies; 44+ messages in thread
From: David Brownell @ 2003-01-03  6:11 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: akpm, James.Bottomley, linux-kernel

Adam J. Richter wrote:
>>Note that this "add gfp flags to dma_alloc_coherent()" issue is a tangent
>>to the dma_pool topic ... it's a different "generic device DMA" issue.
> 
> 
> 	I think we understand each other, but let's walk through
> the three level of it: mempool_alloc, {dma,pci}_pool_alloc, and
> {dma,pci}_alloc_consistent.
> 
> 	We agree that mempool_alloc should take gfp_flags so that
> it can fail rather than block on optional IO such as read-aheads.

I can imagine code using mempool to implement the __GFP_HIGH "use
emergency pools" behavior.  In the case of USB that'd mean more or
less that GFP_ATOMIC allocations, and many usb-storage paths, might
want to initialize such emergency pools ... freeing into them, or
allocating from them when __GFP_HIGH is specified.

> 	Whether pci_pool_alloc (which does not guarantee a reserve of
> memory) should continue to take gfp_flags is not something we were
> discussing, but is an interesting question for the future.  I see
> that pci_pool_alloc is only called with GFP_ATOMIC in two places:
> 
> 	uhci_alloc_td in drivers/usb/host/uhci-hcd.c
> 	alloc_safe_buffer in arch/arm/mach-sa1100/sa1111-pcibuf.c

And SLAB_* elsewhere, per spec ... :)  There are other places it gets
called, even with SLAB_ATOMIC.  Plus, things like HID and usblp buffer
alloc will be passing down SLAB_KERNEL ... without __GFP_HIGH, so the
allocator clearly shouldn't tap emergency reserves then.

The gfp_flags parameter is a good example of something that shouldn't
be discarded except when it's unavoidable. (Like "I have to hold that
spinlock during this allocation" forcing GFP_ATOMIC.)

> 	The use in uhci_alloc_td means that USB IO can fail under
> memory load.  Conceivably, swapping could even deadlock this way.

That's not new, or just because of UHCI.  There's per-request
allocation happening, and the APIs don't support pre-allocation
of most anything.  Maybe 2.7 should change that.

The LK 2.5 USB calls (not 2.4) will pass gfp_flags down from callers,
with the net effect that many allocations respect __GFP_HIGH, and maybe
even __GFP_WAIT.  Clearly uhci-hcd doesn't; ehci-hcd does, and ohci-hcd
could, with a small patch (longstanding minor peeve).

The problem is that at the next layer down, when pci_pool runs out of
its preallocated pages, it's got to talk to pci_alloc_consistent().

THAT forces pci_pool to discard gfp_flags; the lower levels are
stuck using GFP_ATOMIC.  There lies potential lossage:  it enables
emergency allocations (on x86, this can fragment a zone, and it'd
stay that way since pci_pool won't free pages) when it'd be better
to just block for a moment, or maybe even just fail.  Portably
supporting graceful degradation just isn't in the cards.

Which is why it'd be better if dma_alloc_coherent() took gfp_flags.
That way any allocator layered on top of it won't do dumb things
like that, when the whole stack on top of it is saying "don't!!".

> 	By the way, since you call pci_alloc_consistent's GFP_KERNEL
> semantics "awkward", are you saying that you think it should wait
> indefinitely for memory to become available?

You mean change its semantics so it can't be used in non-sleeping
contexts any more?  No. I didn't like James' notion to change them
so this call _had_ to be used in non-sleeping contexts, either.

The point is that dma_alloc_coherent() shouldn't be making the same
mistake of _precluding_ implementations from implementing __GFP_WAIT,
which might be appropriate, or distinguishing normal allocations from
those that are __GFP_HIGH priority.  (And so on.)

The "awkward" aspect is that pci_alloc_consistent() implementations
may only use GFP_ATOMIC, regardless of whether the caller has better
knowledge of the true restrictions.

> 	It is always reasonable to ask for an example of where a
> requested change would be used, as in specific hardware devices that
> Linux has trouble supporting right now, or specific programs that
> can't be run efficiently under Linux, or some other real external
> benefit, because that is the purpose of software.  This applies to
> every byte of code in Linux.

Then there are basic characteristics that don't work well when
taken out of context.  The context here is that higher level code
is telling lower levels some important things in gfp_flags, but
there's this darn lossy API in between, discarding that information.

Sort of like thinking the shortest difference between two cities
in California involves McLean, Virginia:  that makes no sense.

So:  this is something that a shiny new API introduced a week or
so ago can/should fix ASAP (IMO).  There's no impact on an installed
base, and implementations can reasonably ignore all the gfp_flags
that are really quality-of-service hints ... they have ignored
such information so far, but future code could be smarter.

- Dave

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2003-01-02 22:07 Adam J. Richter
                   ` (2 preceding siblings ...)
  2003-01-03  6:11 ` David Brownell
@ 2003-01-03  6:46 ` David Brownell
  2003-01-03  6:52   ` William Lee Irwin III
  3 siblings, 1 reply; 44+ messages in thread
From: David Brownell @ 2003-01-03  6:46 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: linux-kernel

Adam J. Richter wrote:
> 	In practice, I think that if we just added one, maybe two,
> URB's by default for every endpoint when a device is added, that
> that would be enough to guarantee that would reduce the number of
> drivers that needed to reserve more URB's than that to few or none.

I seem to recall someone posted a patch to make non-iso URB allocation
use a mempool.


>>Hmm, I was unaware that anyone expected GFP_KERNEL (or rather,
>>__GFP_WAIT) to guarantee that memory was always returned.  It's
>>not called __GFP_NEVERFAIL, after all.
> 
> 
> 	mempool_alloc does.  That's the point of it.  You calculate
> how many objects you need in order to guarantee no deadlocks and
> reserve that number in advance (the initial reservation can fail).

To rephrase that so it illustrates my point:  the whole reason to
use mempool is to try adding __GFP_NEVERFAIL when __GFP_WAIT is
given ... because __GFP_WAIT doesn't otherwise mean NEVERFAIL.

- Dave



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] generic device DMA (dma_pool update)
  2003-01-03  6:46 ` David Brownell
@ 2003-01-03  6:52   ` William Lee Irwin III
  0 siblings, 0 replies; 44+ messages in thread
From: William Lee Irwin III @ 2003-01-03  6:52 UTC (permalink / raw)
  To: David Brownell; +Cc: Adam J. Richter, linux-kernel

Adam J. Richter wrote:
>>	mempool_alloc does.  That's the point of it.  You calculate
>> how many objects you need in order to guarantee no deadlocks and
>> reserve that number in advance (the initial reservation can fail).

On Thu, Jan 02, 2003 at 10:46:43PM -0800, David Brownell wrote:
> To rephrase that so it illustrates my point:  the whole reason to
> use mempool is to try adding __GFP_NEVERFAIL when __GFP_WAIT is
> given ... because __GFP_WAIT doesn't otherwise mean NEVERFAIL.

Well, it's not quite that general. There is a constraint of the
objects allocated with a mempool having a finite lifetime. And
non-waiting mempool allocations are also available, but may fail.

So long as the queueing is fair everyone eventually gets their turn.


Bill

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2003-01-03  6:44 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-12-27 20:21 [RFT][PATCH] generic device DMA implementation David Brownell
2002-12-27 21:40 ` James Bottomley
2002-12-28  1:29   ` David Brownell
2002-12-28 16:18     ` James Bottomley
2002-12-28 18:16       ` David Brownell
2002-12-28  1:56   ` David Brownell
2002-12-28 16:13     ` James Bottomley
2002-12-28 17:41       ` David Brownell
2002-12-30 23:11     ` [PATCH] generic device DMA (dma_pool update) David Brownell
2002-12-31 15:00       ` James Bottomley
2002-12-31 17:04         ` David Brownell
2002-12-31 17:23           ` James Bottomley
2002-12-31 18:11             ` David Brownell
2002-12-31 18:44               ` James Bottomley
2002-12-31 19:29                 ` David Brownell
2002-12-31 19:50                   ` James Bottomley
2002-12-31 21:17                     ` David Brownell
2002-12-31 16:36       ` James Bottomley
2002-12-31 17:32         ` David Brownell
2002-12-27 21:47 ` [RFT][PATCH] generic device DMA implementation James Bottomley
2002-12-28  2:28   ` David Brownell
  -- strict thread matches above, loose matches on Subject: below --
2002-12-31 22:02 [PATCH] generic device DMA (dma_pool update) Adam J. Richter
2002-12-31 22:41 ` Andrew Morton
2002-12-31 23:23   ` David Brownell
2002-12-31 23:27     ` Andrew Morton
2002-12-31 23:44       ` David Brownell
2002-12-31 23:47     ` James Bottomley
2003-01-01 17:10   ` James Bottomley
2002-12-31 23:35 ` David Brownell
2002-12-31 23:38 Adam J. Richter
2003-01-01  0:02 Adam J. Richter
2003-01-01 19:21 Adam J. Richter
2003-01-01 19:48 ` James Bottomley
2003-01-02  2:11   ` David Brownell
2003-01-02  4:13 Adam J. Richter
2003-01-02 16:41 ` James Bottomley
2003-01-02 18:26 ` David Brownell
2003-01-02 17:04 Adam J. Richter
2003-01-02 22:07 Adam J. Richter
2003-01-03  0:20 ` Russell King
2003-01-03  4:50 ` David Brownell
2003-01-03  6:11 ` David Brownell
2003-01-03  6:46 ` David Brownell
2003-01-03  6:52   ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox