32-bit dma allocations on 64-bit platforms

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 32-bit dma allocations on 64-bit platforms
@ 2004-06-23 18:35 Terence Ripperda
  2004-06-23 19:19 ` Jeff Garzik
  2004-06-26  5:02 ` David Mosberger
  0 siblings, 2 replies; 20+ messages in thread
From: Terence Ripperda @ 2004-06-23 18:35 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Terence Ripperda

[-- Attachment #1: Type: text/plain, Size: 3008 bytes --]

I'm working on cleaning up some of our dma allocation code to properly allocate 32-bit physical pages for dma on 64-bit platforms. I think our first pass at supporting em64t is sub-par. I'd like to fix that by using the correct kernel interfaces.

>From our early efforts in supporting AMD's x86_64, we've used the pci_map_sg/pci_map_single interface for remapping > 32-bit physical addresses through the system's IOMMU. Since Intel's em64t does not provide an IOMMU, the kernel falls back to a swiotlb to implement these interfaces. For our first pass at supporting em64t, we tried to work with the swiotlb, but this works very poorly.

We should have gone back and reviewed how we use the kernel interfaces and followed DMA-API.txt and DMA-mapping.txt. We're now working on using these interfaces (mainly pci_alloc_consistent). but we're still running into some general shortcomings of these interfaces. the main problem is the ability to allocate enough 32-bit addressable memory.

the physical addressing of memory allocations seems to boil down to the behavior of GFP_DMA and GFP_NORMAL. but there seems to be some disconnect between what these mean for each architecture and what various drivers expect them to mean.

based on each architecture's paging_init routines, the zones look like this:

                x86:         ia64:      x86_64:
ZONE_DMA:       < 16M        < ~4G      < 16M
ZONE_NORMAL:    16M - ~1G    > ~4G      > 16M
ZONE_HIMEM:     1G+

an example of this disconnect is vmalloc_32. this function is obviously intended to allocate 32-bit addresses (this was specifically mentioned in a comment in 2.4.x header files). but vmalloc_32 is an inline routine that calls __vmalloc(GFP_KERNEL). based on the above zone descriptions, this will do the correct thing for x86, but not for ia64 or x86_64. on ia64, a driver could just use GFP_DMA for the desired behavior, but this doesn't work well for x86_64.

AMD's x86_64 provides remapping > 32-bit pages through the iommu, but obviously Intel's em64t provides no such ability. based on the above zonings, these leaves us with the options of either relying on the swiotlb interfaces for dma, or relying on the isa memory for dma.

the last day or two, I've been experimenting with using the pci_alloc_consistent interface, which uses the later (note attached patch to fix an apparent memory leak in the x86_64 pci_alloc_consistent). unfortunately, this provides very little dma-able memory. In theory, up to 16 Megs, but in practice I'm only getting about 5 1/2 Megs.

I was rather surprised by these limitations on allocating 32-bit addresses. I checked through the dri and bttv drivers, to see if they had dealt with these issues, and they did not appear to have done so. has anyone tested these drivers on ia64/x86_64/em64t platforms w/ 4+ Gigs of memory?

are these limitations on allocating 32-bit addresses intentional and known? is there anything we can do to help improve this situation? help work on development?

Thanks,
Terence

[-- Attachment #2: pci-gart.patch --]
[-- Type: text/plain, Size: 330 bytes --]

--- pci-gart.c.old	2004-06-21 18:33:29.000000000 -0500
+++ pci-gart.c.new	2004-06-21 18:33:57.000000000 -0500
@@ -211,6 +211,7 @@
 		if (no_iommu || dma_mask < 0xffffffffUL) { 
 			if (high) {
 				if (!(gfp & GFP_DMA)) { 
+					free_pages((unsigned long)memory, get_order(size)); 
 					gfp |= GFP_DMA; 
 					goto again;
 				}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-23 18:35 32-bit dma allocations on 64-bit platforms Terence Ripperda
@ 2004-06-23 19:19 ` Jeff Garzik
  2004-06-26  5:05   ` David Mosberger
  2004-06-26  5:02 ` David Mosberger
  1 sibling, 1 reply; 20+ messages in thread
From: Jeff Garzik @ 2004-06-23 19:19 UTC (permalink / raw)
  To: Terence Ripperda; +Cc: Linux Kernel Mailing List

Terence Ripperda wrote:

Fix your word wrap.


> I'm working on cleaning up some of our dma allocation code to properly allocate 32-bit physical pages for dma on 64-bit platforms. I think our first pass at supporting em64t is sub-par. I'd like to fix that by using the correct kernel interfaces.
> 
>>From our early efforts in supporting AMD's x86_64, we've used the pci_map_sg/pci_map_single interface for remapping > 32-bit physical addresses through the system's IOMMU. Since Intel's em64t does not provide an IOMMU, the kernel falls back to a swiotlb to implement these interfaces. For our first pass at supporting em64t, we tried to work with the swiotlb, but this works very poorly.

swiotlb was a dumb idea when it hit ia64, and it's now been propagated 
to x86-64 :(


> We should have gone back and reviewed how we use the kernel interfaces and followed DMA-API.txt and DMA-mapping.txt. We're now working on using these interfaces (mainly pci_alloc_consistent). but we're still running into some general shortcomings of these interfaces. the main problem is the ability to allocate enough 32-bit addressable memory.
> 
> the physical addressing of memory allocations seems to boil down to the behavior of GFP_DMA and GFP_NORMAL. but there seems to be some disconnect between what these mean for each architecture and what various drivers expect them to mean.
> 
> based on each architecture's paging_init routines, the zones look like this:
> 
>                 x86:         ia64:      x86_64:
> ZONE_DMA:       < 16M        < ~4G      < 16M
> ZONE_NORMAL:    16M - ~1G    > ~4G      > 16M
> ZONE_HIMEM:     1G+
> 
> an example of this disconnect is vmalloc_32. this function is obviously intended to allocate 32-bit addresses (this was specifically mentioned in a comment in 2.4.x header files). but vmalloc_32 is an inline routine that calls __vmalloc(GFP_KERNEL). based on the above zone descriptions, this will do the correct thing for x86, but not for ia64 or x86_64. on ia64, a driver could just use GFP_DMA for the desired behavior, but this doesn't work well for x86_64.
> 
> AMD's x86_64 provides remapping > 32-bit pages through the iommu, but obviously Intel's em64t provides no such ability. based on the above zonings, these leaves us with the options of either relying on the swiotlb interfaces for dma, or relying on the isa memory for dma.

FWIW, note that there are two main considerations:

Higher-level layers (block, net) provide bounce buffers when needed, as 
you don't want to do that purely with iommu.

Once you have bounce buffers properly allocated by <something> (swiotlb? 
  special DRM bounce buffer allocator?), you then pci_map the bounce 
buffers.


> the last day or two, I've been experimenting with using the pci_alloc_consistent interface, which uses the later (note attached patch to fix an apparent memory leak in the x86_64 pci_alloc_consistent). unfortunately, this provides very little dma-able memory. In theory, up to 16 Megs, but in practice I'm only getting about 5 1/2 Megs.
> 
> I was rather surprised by these limitations on allocating 32-bit addresses. I checked through the dri and bttv drivers, to see if they had dealt with these issues, and they did not appear to have done so. has anyone tested these drivers on ia64/x86_64/em64t platforms w/ 4+ Gigs of memory?
> 
> are these limitations on allocating 32-bit addresses intentional and known? is there anything we can do to help improve this situation? help work on development?

Sounds like you're not setting the PCI DMA mask properly, or perhaps 
passing NULL rather than a struct pci_dev to the PCI DMA API?

	Jeff



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-23 19:19 ` Jeff Garzik
@ 2004-06-26  5:05   ` David Mosberger
  2004-06-26  7:16     ` Arjan van de Ven
  0 siblings, 1 reply; 20+ messages in thread
From: David Mosberger @ 2004-06-26  5:05 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Terence Ripperda, Linux Kernel Mailing List

>>>>> On Wed, 23 Jun 2004 15:19:22 -0400, Jeff Garzik <jgarzik@pobox.com> said:

  Jeff> swiotlb was a dumb idea when it hit ia64, and it's now been propagated
  Jeff> to x86-64 :(

If it's such a dumb idea, why not submit a better solution?

	--david

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-26  5:05   ` David Mosberger
@ 2004-06-26  7:16     ` Arjan van de Ven
  2004-06-29  6:13       ` David Mosberger
  0 siblings, 1 reply; 20+ messages in thread
From: Arjan van de Ven @ 2004-06-26  7:16 UTC (permalink / raw)
  To: davidm; +Cc: Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 462 bytes --]

On Sat, 2004-06-26 at 07:05, David Mosberger wrote:
> >>>>> On Wed, 23 Jun 2004 15:19:22 -0400, Jeff Garzik <jgarzik@pobox.com> said:
> 
>   Jeff> swiotlb was a dumb idea when it hit ia64, and it's now been propagated
>   Jeff> to x86-64 :(
> 
> If it's such a dumb idea, why not submit a better solution?

the real solution is an iommu of course, but the highmem solution has
quite some merit too..... I know you disagree with me on that one
though.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-26  7:16     ` Arjan van de Ven
@ 2004-06-29  6:13       ` David Mosberger
  2004-06-29  6:55         ` Arjan van de Ven
  2004-06-30  8:00         ` Jes Sorensen
  0 siblings, 2 replies; 20+ messages in thread
From: David Mosberger @ 2004-06-29  6:13 UTC (permalink / raw)
  To: arjanv; +Cc: davidm, Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List

>>>>> On Sat, 26 Jun 2004 09:16:27 +0200, Arjan van de Ven <arjanv@redhat.com> said:

  Arjan> the real solution is an iommu of course, but the highmem
  Arjan> solution has quite some merit too..... I know you disagree
  Arjan> with me on that one though.

Yes, some merits and some faults.  The real solution is iommu or
64-bit capable devices.  Interesting that graphics controllers should
be last to get 64-bit DMA capability, considering how much more
complex they are than disk controllers or NICs.

	--david

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-29  6:13       ` David Mosberger
@ 2004-06-29  6:55         ` Arjan van de Ven
  2004-06-30  8:00         ` Jes Sorensen
  1 sibling, 0 replies; 20+ messages in thread
From: Arjan van de Ven @ 2004-06-29  6:55 UTC (permalink / raw)
  To: davidm; +Cc: Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 661 bytes --]

On Mon, Jun 28, 2004 at 11:13:12PM -0700, David Mosberger wrote:
> >>>>> On Sat, 26 Jun 2004 09:16:27 +0200, Arjan van de Ven <arjanv@redhat.com> said:
> 
>   Arjan> the real solution is an iommu of course, but the highmem
>   Arjan> solution has quite some merit too..... I know you disagree
>   Arjan> with me on that one though.
> 
> Yes, some merits and some faults.  The real solution is iommu or
> 64-bit capable devices.  Interesting that graphics controllers should
> be last to get 64-bit DMA capability, considering how much more
> complex they are than disk controllers or NICs.

I guess the first game with more than 4Gb in textures will fix it ;)


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-29  6:13       ` David Mosberger
  2004-06-29  6:55         ` Arjan van de Ven
@ 2004-06-30  8:00         ` Jes Sorensen
  1 sibling, 0 replies; 20+ messages in thread
From: Jes Sorensen @ 2004-06-30  8:00 UTC (permalink / raw)
  To: davidm; +Cc: arjanv, Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List

>>>>> "David" == David Mosberger <davidm@napali.hpl.hp.com> writes:

>>>>> On Sat, 26 Jun 2004 09:16:27 +0200, Arjan van de Ven <arjanv@redhat.com> said:
Arjan> the real solution is an iommu of course, but the highmem
Arjan> solution has quite some merit too..... I know you disagree with
Arjan> me on that one though.

David> Yes, some merits and some faults.  The real solution is iommu
David> or 64-bit capable devices.  Interesting that graphics
David> controllers should be last to get 64-bit DMA capability,
David> considering how much more complex they are than disk
David> controllers or NICs.

You found a 64 bit capable sound card yet? ;-)

Cheers,
Jes

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-23 18:35 32-bit dma allocations on 64-bit platforms Terence Ripperda
  2004-06-23 19:19 ` Jeff Garzik
@ 2004-06-26  5:02 ` David Mosberger
  1 sibling, 0 replies; 20+ messages in thread
From: David Mosberger @ 2004-06-26  5:02 UTC (permalink / raw)
  To: Terence Ripperda; +Cc: Linux Kernel Mailing List

Terence,

>>>>> On Wed, 23 Jun 2004 13:35:35 -0500, Terence Ripperda <tripperda@nvidia.com> said:

  Terence> based on each architecture's paging_init routines, the
  Terence> zones look like this:

  Terence>                 x86:         ia64:      x86_64:
  Terence> ZONE_DMA:       < 16M        < ~4G      < 16M
  Terence> ZONE_NORMAL:    16M - ~1G    > ~4G      > 16M
  Terence> ZONE_HIMEM:     1G+

Not that it matters here, but for correctness let me note that the
ia64 column is correct only for machines which don't have an I/O MMU.
With I/O MMU, ZONE_DMA will have the same coverage as ZONE_NORMAL with
a recent enough kernel (older kernels had a bug which limited ZONE_DMA
to < 4GB, but that was unintentional).

	--david

^ permalink raw reply	[flat|nested] 20+ messages in thread

[parent not found: <2akPm-16l-65@gated-at.bofh.it>]

* Re: 32-bit dma allocations on 64-bit platforms
       [not found] <2akPm-16l-65@gated-at.bofh.it>
@ 2004-06-23 21:46 ` Andi Kleen
  2004-06-24  6:18   ` Arjan van de Ven
  0 siblings, 1 reply; 20+ messages in thread
From: Andi Kleen @ 2004-06-23 21:46 UTC (permalink / raw)
  To: Terence Ripperda; +Cc: discuss, tiwai, linux-kernel

Terence Ripperda <tripperda@nvidia.com> writes:

[sending again with linux-kernel in cc]

> I'm working on cleaning up some of our dma allocation code to properly allocate 32-bit physical pages for dma on 64-bit platforms. I think our first pass at supporting em64t is sub-par. I'd like to fix that by using the correct kernel interfaces.

I get from this that your hardware cannot DMA to >32bit.
>
> the physical addressing of memory allocations seems to boil down to the behavior of GFP_DMA and GFP_NORMAL. but there seems to be some disconnect between what these mean for each architecture and what various drivers expect them to mean.
>
> based on each architecture's paging_init routines, the zones look like this:
>
>                 x86:         ia64:      x86_64:
> ZONE_DMA:       < 16M        < ~4G      < 16M
> ZONE_NORMAL:    16M - ~1G    > ~4G      > 16M
> ZONE_HIMEM:     1G+
>
> an example of this disconnect is vmalloc_32. this function is obviously intended to allocate 32-bit addresses (this was specifically mentioned in a comment in 2.4.x header files). but vmalloc_32 is an inline routine that calls __vmalloc(GFP_KERNEL). based on the above zone descriptions, this will do the correct thing for x86, but not for ia64 or x86_64. on ia64, a driver could just use GFP_DMA for the desired behavior, but this doesn't work well for x86_64.
>
> AMD's x86_64 provides remapping > 32-bit pages through the iommu, but obviously Intel's em64t provides no such ability. based on the above zonings, these leaves us with the options of either relying on the swiotlb interfaces for dma, or relying on the isa memory for dma.
>
> the last day or two, I've been experimenting with using the pci_alloc_consistent interface, which uses the later (note attached patch to fix an apparent memory leak in the x86_64 pci_alloc_consistent). unfortunately, this provides very little dma-able memory. In theory, up to 16 Megs, but in practice I'm only getting about 5 1/2 Megs.
>
> I was rather surprised by these limitations on allocating 32-bit addresses. I checked through the dri and bttv drivers, to see if they had dealt with these issues, and they did not appear to have done so. has anyone tested these drivers on ia64/x86_64/em64t platforms w/ 4+ Gigs of memory?
>
> are these limitations on allocating 32-bit addresses intentional and known? is there anything we can do to help improve this situation? help work on development?

First vmalloc_32 is a rather broken interface and should imho
just be removed. The function name just gives promises that cannot
be kept. It was always quite bogus. Please don't use it.

The situation on EM64T is very unfortunate I agree. There was a reason
we asked AMD to add an IOMMU and it's quite bad that the Intel chipset
people ignored that wisdom and put us into this compatibility mess.

Failing that it would be best if the other PCI DMA hardware
could just address enough memory, but that's less realistic
than just fixing the chipset. 

The x86-64 port had decided early to keep the 16MB GFP_DMA zone
to get maximum driver compatibility and because the AMD IOMMU gave
us an nice alternative over bounce buffering.

In theory I'm not totally against enlarging GFP_DMA a bit on x86-64.
It would just be difficult to find a good value. The problem is that
that there may be existing drivers that rely on the 16MB limit, and it
would not be very nice to break them. We got rid of a lot of them by
disallowing CONFIG_ISA, but there may be some left. So before doing
this it would need a full driver tree audit to check any device

The most prominent example used to be the floppy driver, but the
current floppy driver seems to use some other way to get around this.

There seem to be quite some sound chipsets with DMA limits < 32bit;
e.g. 29 bits seems to be quite common, but I see several 24bit PCI ones 
too. 

I must say I'm somewhat reluctant to break an working in tree driver.
Especially for the sake of an out of tree binary driver. Arguably the
problem is probably not limited to you, but it's quite possible that
even the in tree DRI drivers have it, so it would be still worth to
fix it. 

I see two somewhat realistic ways to handle this: 

- We enlarge GFP_DMA and find some way to do double buffering
for these sound drivers (it would need an PCI-DMA API extension
that always calls swiotlb for this) 
For sound that's not too bad, because they are relatively slow.
It would require to reserve bootmem memory early for the bounces, 
but I guess requiring the user to pass a special boot time parameter
for these devices would be reasonable.

If yes someone would need to do this work. 
Also the question would be how large to make GFP_DMA
Ideally it should not be too big, so that e.g. 29bit devices don't
require the bounce buffering. 

- We introduce multiple GFP_DMA zones and keep 16MB GFP_DMA 
and GFP_BIGDMA or somesuch for larger DMA.

The VM should be able to handle this, but it may still require
some tuning. It would need some generic changes, but not too bad.
Still would need a decision on how big GFP_BIGDMA should be. 
I suspect 4GB would be too big again.

Comments?

-Andi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-23 21:46 ` Andi Kleen
@ 2004-06-24  6:18   ` Arjan van de Ven
  2004-06-24 10:33     ` Andi Kleen
  2004-06-24 13:48     ` Jesse Barnes
  0 siblings, 2 replies; 20+ messages in thread
From: Arjan van de Ven @ 2004-06-24  6:18 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Terence Ripperda, discuss, tiwai, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 636 bytes --]

On Wed, 2004-06-23 at 23:46, Andi Kleen wrote:

> The VM should be able to handle this, but it may still require
> some tuning. It would need some generic changes, but not too bad.
> Still would need a decision on how big GFP_BIGDMA should be. 
> I suspect 4GB would be too big again.

What is the problem again, can't the driver us the dynamic pci mapping
API which does allow more memory to be mapped even on crippled machines
without iommu ?
And isn't this a problem that will vanish since PCI Express and PCI X
both *require* support for 64 bit addressing, so all higher speed cards
are going to be ok in principle ?

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24  6:18   ` Arjan van de Ven
@ 2004-06-24 10:33     ` Andi Kleen
  2004-06-24 13:48     ` Jesse Barnes
  1 sibling, 0 replies; 20+ messages in thread
From: Andi Kleen @ 2004-06-24 10:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Terence Ripperda, discuss, tiwai, linux-kernel

On Thu, Jun 24, 2004 at 08:18:06AM +0200, Arjan van de Ven wrote:
> On Wed, 2004-06-23 at 23:46, Andi Kleen wrote:
> 
> > The VM should be able to handle this, but it may still require
> > some tuning. It would need some generic changes, but not too bad.
> > Still would need a decision on how big GFP_BIGDMA should be. 
> > I suspect 4GB would be too big again.
> 
> What is the problem again, can't the driver us the dynamic pci mapping
> API which does allow more memory to be mapped even on crippled machines
> without iommu ?

In theory one could fix pci_alloc_consistent from the swiotlb pool yes,
the problem is just that this pool is completely preallocated. If 
enough memory is needed that would be quite nasty, because you suddenly
lose 1 or 2GB RAM.

> And isn't this a problem that will vanish since PCI Express and PCI X
> both *require* support for 64 bit addressing, so all higher speed cards
> are going to be ok in principle ?

There are EM64T systems with AGP only and not all PCI-Express cards
seem to follow this. PCI-Express unfortunately discouraged the AGP aperture too,
so not even that can be used on those Intel systems.

-Andi 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24  6:18   ` Arjan van de Ven
  2004-06-24 10:33     ` Andi Kleen
@ 2004-06-24 13:48     ` Jesse Barnes
  2004-06-24 14:39       ` Terence Ripperda
  1 sibling, 1 reply; 20+ messages in thread
From: Jesse Barnes @ 2004-06-24 13:48 UTC (permalink / raw)
  To: arjanv; +Cc: Andi Kleen, Terence Ripperda, discuss, tiwai, linux-kernel

On Thursday, June 24, 2004 2:18 am, Arjan van de Ven wrote:
> What is the problem again, can't the driver us the dynamic pci mapping
> API which does allow more memory to be mapped even on crippled machines
> without iommu ?
> And isn't this a problem that will vanish since PCI Express and PCI X
> both *require* support for 64 bit addressing, so all higher speed cards
> are going to be ok in principle ?

Well, PCI-X may require it, but there certainly are PCI-X devices that don't 
do 64 bit addressing, or if they do, it's a crippled implementation (e.g. top 
32 bits have to be constant).

Jesse

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 13:48     ` Jesse Barnes
@ 2004-06-24 14:39       ` Terence Ripperda
  0 siblings, 0 replies; 20+ messages in thread
From: Terence Ripperda @ 2004-06-24 14:39 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: arjanv, Andi Kleen, Terence Ripperda, discuss, tiwai,
	linux-kernel

correct. I checked with my contacts here on the PCI express requirements.
Apparently the spec says "A PCI Express Endpoint operating as the
Requester of a Memory Transaction is required to be capable of
generating addresses greater than 4GB", but my contact claims this is a
"soft" requirement.

but even if all PCI-X and PCI-E devices properly addressed the full
64-bits, legacy 32-bit PCI devices can be plugged into the motherboards as
well. my Intel em64t boards have mostly PCI-X, but 1 PCI slot and my amd
x86_64 have all PCI slots (aside from the main PCI-E slot).

also, at least one motherboard manufacturer claims PCI-E + AGP, but the AGP
is really just an AGP form-factor slot on the PCI bus.

Thanks,
Terence

On Thu, Jun 24, 2004 at 06:48:07AM -0700, jbarnes@engr.sgi.com wrote:
> On Thursday, June 24, 2004 2:18 am, Arjan van de Ven wrote:
> > What is the problem again, can't the driver us the dynamic pci mapping
> > API which does allow more memory to be mapped even on crippled
> machines
> > without iommu ?
> > And isn't this a problem that will vanish since PCI Express and PCI X
> > both *require* support for 64 bit addressing, so all higher speed
> cards
> > are going to be ok in principle ?
> 
> Well, PCI-X may require it, but there certainly are PCI-X devices that
> don't 
> do 64 bit addressing, or if they do, it's a crippled implementation
> (e.g. top 
> 32 bits have to be constant).
> 
> Jesse

^ permalink raw reply	[flat|nested] 20+ messages in thread

[parent not found: <m3acyu6pwd.fsf@averell.firstfloor.org>]

[parent not found: <20040623213643.GB32456@hygelac>]

* Re: 32-bit dma allocations on 64-bit platforms
       [not found] ` <20040623213643.GB32456@hygelac>
@ 2004-06-23 23:46   ` Andi Kleen
  2004-06-24 11:13     ` Takashi Iwai
  2004-06-24 15:44     ` Terence Ripperda
  0 siblings, 2 replies; 20+ messages in thread
From: Andi Kleen @ 2004-06-23 23:46 UTC (permalink / raw)
  To: Terence Ripperda; +Cc: Andi Kleen, discuss, tiwai, linux-kernel, andrea

On Wed, Jun 23, 2004 at 04:36:43PM -0500, Terence Ripperda wrote:
> > The x86-64 port had decided early to keep the 16MB GFP_DMA zone
> > to get maximum driver compatibility and because the AMD IOMMU gave
> > us an nice alternative over bounce buffering.
> 
> that was a very understandable decision. and I do agree that using the
> AMD IOMMU is a very nice architecture. it is unfortunate to have to deal
> with this on EM64T. Will AMD's pci-express chipsets still maintain an
> IOMMU, even if it's not needed for AGP anymore? (probably not public
> information, I'll check via my channels).

The IOMMU is actually implemented in the in CPU northbridge on K8 so yes. 
I hope they'll keep it in future CPUs too. 

> 
> > I must say I'm somewhat reluctant to break an working in tree driver.
> > Especially for the sake of an out of tree binary driver. Arguably the
> > problem is probably not limited to you, but it's quite possible that
> > even the in tree DRI drivers have it, so it would be still worth to
> > fix it.
> 
> agreed. I completely understand that there is no desire to modify the
> core kernel to help our driver. that's one of the reasons I looked through
> the other drivers, as I suspect that this is a problem for many drivers. I
> only looked through the code for each briefly, but didn't see anything to
> handle this. I suspect it's more of a case that the drivers have not been
> stress tested on an x86_64 machine w/ 4+ G of memory.

We usually handle it using the swiotlb, which works.

pci_alloc_consistent is limited to 16MB, but so far nobody has really
complained about that. If that should be a real issue we can make
it allocate from the swiotlb pool, which is usually 64MB (and can
be made bigger at boot time) 

Would that work for you too BTW ? How much memory do you expect
to need?

drawback is that the swiotlb pool is not unified with the rest of the VM,
so tying up too much memory there is quite unfriendly.
e.g. if you you can use up 1GB then i wouldn't consider this suitable,
for 128MB max it may be possible.

> > I see two somewhat realistic ways to handle this:
> 
> either of those approaches sounds good. keeping compatibility with older
> devices/drivers is certainly a good thing.
> 
> can the core kernel handle multiple new zones? I haven't looked at the
> code, but the zones seem to always be ZONE_DMA and ZONE_NORMAL, with some
> architectures adding ZONE_HIMEM at the end. if you add a ZONE_DMA_32 (or
> whatever) between ZONE_DMA and ZONE_NORMAL, will the core vm code be able
> to handle that? I guess one could argue if it can't yet, it should be able
> to, then each architecture could create as many zones as they wanted.

Sure, we create multiple zones on NUMA systems (even on x86-64). Each
node has one zone. But they're all ZONE_NORMAL. And the first node 
has two zones, one ZONE_DMA and one ZONE_NORMAL
(actually the others have a ZONE_DMA too, but it's empty) 

Multiple ZONE_DMA zones would be a novelty, but may be doable
(I have not checked all the implications of this, but I don't immediately
see any show stopper, maybe someone like Andrea can correct me on that). 
It will probably be a bit intrusive patch though. 

> 
> another brainstorm: instead of counting on just a large-grained zone and
> call to __get_free_pages() returning an allocation within a given
> bit-range, perhaps there could be large-grained zones, with a fine-grained
> hint used to look for a subset within the zone. for example, there could be
> a DMA32 zone, but a mask w/ 24- or 29- bits enabled could be used to scan
> the DMA32 zone for a valid address. (don't know how well that fits into the
> current architecture).

Not very well. Or rather the allocation would not be O(1) anymore
because you would need to  scan the queues. That could be still
tolerable, but when there are no pages you have to call the VM
and then teach try_to_free_pages and friends that you are only
interested in pages in some mask. And that would probably get 
quite nasty. I did something like this in 2.4 for an old prototype
of the NUMA API, but it never worked very well and also was quite ugly.

Multiple zones are probably better.

One of the reasons we rejected this early when the x86-64 port was
designed was that the VM had quite bad zone balancing problems
at that time. It should be better now though, or at least the NUMA
setup works reasonably well. But NUMA zones tend to be a lot bigger
than DMA zones and don't show all the corner cases.

-Andi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-23 23:46   ` Andi Kleen
@ 2004-06-24 11:13     ` Takashi Iwai
  2004-06-24 14:45       ` Terence Ripperda
  2004-06-24 15:44     ` Terence Ripperda
  1 sibling, 1 reply; 20+ messages in thread
From: Takashi Iwai @ 2004-06-24 11:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Terence Ripperda, discuss, linux-kernel, andrea

At 24 Jun 2004 01:46:44 +0200,
Andi Kleen wrote:
> 
> > > I must say I'm somewhat reluctant to break an working in tree driver.
> > > Especially for the sake of an out of tree binary driver. Arguably the
> > > problem is probably not limited to you, but it's quite possible that
> > > even the in tree DRI drivers have it, so it would be still worth to
> > > fix it.
> > 
> > agreed. I completely understand that there is no desire to modify the
> > core kernel to help our driver. that's one of the reasons I looked through
> > the other drivers, as I suspect that this is a problem for many drivers. I
> > only looked through the code for each briefly, but didn't see anything to
> > handle this. I suspect it's more of a case that the drivers have not been
> > stress tested on an x86_64 machine w/ 4+ G of memory.
> 
> We usually handle it using the swiotlb, which works.
> 
> pci_alloc_consistent is limited to 16MB, but so far nobody has really
> complained about that. If that should be a real issue we can make
> it allocate from the swiotlb pool, which is usually 64MB (and can
> be made bigger at boot time) 

Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the
allocated pages are out of dma mask, just like in pci-gart.c?
(with ifdef x86-64)


Takashi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 11:13     ` Takashi Iwai
@ 2004-06-24 14:45       ` Terence Ripperda
  2004-06-24 15:41         ` Andrea Arcangeli
  0 siblings, 1 reply; 20+ messages in thread
From: Terence Ripperda @ 2004-06-24 14:45 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Andi Kleen, Terence Ripperda, discuss, linux-kernel, andrea

On Thu, Jun 24, 2004 at 04:13:47AM -0700, tiwai@suse.de wrote:
> > pci_alloc_consistent is limited to 16MB, but so far nobody has really
> > complained about that. If that should be a real issue we can make
> > it allocate from the swiotlb pool, which is usually 64MB (and can
> > be made bigger at boot time) 
> 
> Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the
> allocated pages are out of dma mask, just like in pci-gart.c?
> (with ifdef x86-64)

pci_alloc_consistent (at least on x86-64) does do this. one of the problems
I've seen in experimentation is that GFP_KERNEL tends to allocate from the
top of memory down. this means that all of the GFP_KERNEL allocations are >
32-bits, which forces us down to GFP_DMA and the < 16M allocations.

I've mainly tested this after a cold boot, so after running for a while,
GFP_KERNEL may hit good allocations a lot more.

Thanks,
Terence

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 14:45       ` Terence Ripperda
@ 2004-06-24 15:41         ` Andrea Arcangeli
  0 siblings, 0 replies; 20+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 15:41 UTC (permalink / raw)
  To: Terence Ripperda; +Cc: Takashi Iwai, Andi Kleen, discuss, linux-kernel

On Thu, Jun 24, 2004 at 09:45:51AM -0500, Terence Ripperda wrote:
> On Thu, Jun 24, 2004 at 04:13:47AM -0700, tiwai@suse.de wrote:
> > > pci_alloc_consistent is limited to 16MB, but so far nobody has really
> > > complained about that. If that should be a real issue we can make
> > > it allocate from the swiotlb pool, which is usually 64MB (and can
> > > be made bigger at boot time) 
> > 
> > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the
> > allocated pages are out of dma mask, just like in pci-gart.c?
> > (with ifdef x86-64)
> 
> pci_alloc_consistent (at least on x86-64) does do this. one of the problems
> I've seen in experimentation is that GFP_KERNEL tends to allocate from the
> top of memory down. this means that all of the GFP_KERNEL allocations are >
> 32-bits, which forces us down to GFP_DMA and the < 16M allocations.
> 
> I've mainly tested this after a cold boot, so after running for a while,
> GFP_KERNEL may hit good allocations a lot more.

it's trivial to change the order in the freelist to allocate from lower
address first, but the point is still that over time that will be
random.

the 16M must be reserved enterely to the __GFP_DMA on any machine with
>=1G of ram, and the lowmem_reserve_ratio algorithm accomplish this and
it scales down the reserve ratio according to the balance between lowmem
and dma zone.

I believe if something you should try with GFP_KERNEL if GFP_DMA fails,
not the other way around. btw, 2.6 is even more efficient in shrinking
and paging out the dma zone than it could be in 2.4.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-23 23:46   ` Andi Kleen
  2004-06-24 11:13     ` Takashi Iwai
@ 2004-06-24 15:44     ` Terence Ripperda
  2004-06-24 18:51       ` Andi Kleen
  1 sibling, 1 reply; 20+ messages in thread
From: Terence Ripperda @ 2004-06-24 15:44 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Terence Ripperda, discuss, tiwai, linux-kernel, andrea

On Wed, Jun 23, 2004 at 04:46:44PM -0700, ak@muc.de wrote:
> pci_alloc_consistent is limited to 16MB, but so far nobody has really
> complained about that. If that should be a real issue we can make
> it allocate from the swiotlb pool, which is usually 64MB (and can
> be made bigger at boot time) 

In all of the cases I've seen, it defaults to 4M. in swiotlb.c,
io_tlb_nslabs defaults to 1024, * PAGE_SIZE == 4194304.

> Would that work for you too BTW ? How much memory do you expect
> to need?

potentially. our currently pending release uses pci_map_sg, which
relies on swiotlb for em64t systems. it "works", but we have some ugly
hacks and were hoping to get away from using it (at least in it's
current form).

here's some of the problems we encountered:

probably the biggest problem is that the size is way too small for our
needs (more on our memory usage shortly). this is compounded by the
the swiotlb code throwing a kernel panic when it can't allocate
memory. and if the panic doesn't halt the machine, the routine returns
a random value off the stack as the dma_addr_t.

for this reason, we have an ugly hack that notices that swiotlb is
enabled (just checks if swiotlb is set) and prints a warning to the user
to bump up the size of the swiotlb to 16384, or 64M.

also, the proper usage of using the bounce buffers and calling 
pci_dma_sync_* would be a performance killer for us. we stream a
considerable amount of data to the gpu per second (on the order of
100s of Megs a second), so having to do an additional memcpy would
reduce performance considerably, in some cases between 30-50%.

for this reason, we detect when the dma_addr != phys_addr, and map the
dma_addr directly to opengl to avoid the copy. I know this is ugly,
and that's one of the things I'd really like to get away from.

finally, our driver already uses a considerable amount of memory. by
definition, the swiotlb interface doubles that memory usage. if our
driver used swiotlb correctly (as in didn't know about swiotlb and
always called pci_dma_sync_*), we'd lock down the physical addresses
opengl writes to, since they're normally used directly for dma, plus
the pages allocated from the swiotlb would be locked down (currently
we manually do this, but if swiotlb is supposed to be transparent to
the driver and used for dma, I assume it must already be locked down,
perhaps by definition of being bootmem?). this means not only is the
memory usage double, but it's all locked down and unpageable.

in this case, it almost would make more sense to treat the bootmem
allocated for swiotlb as a pool of 32-bit memory that can be directly
allocated from, rather than as bounce buffers. I don't know that this
would be an acceptable interface though.

but if we could come up with reasonable solutions to these problems,
this may work.

> drawback is that the swiotlb pool is not unified with the rest of the
> VM, so tying up too much memory there is quite unfriendly.
> e.g. if you you can use up 1GB then i wouldn't consider this suitable,
> for 128MB max it may be possible.

I checked with our opengl developers on this. by default, we allocate
~64k for X's push buffer and ~1M per opengl client for their push
buffer. on quadro/workstation parts, we allocate 20M for the first
opengl client, then ~1M per client after that.

in addition to the push buffer, there is a lot of data that apps dump
to the push buffer. this includes textures, vertex buffers, display
lists, etc. the amount of memory used for this varies greatly from app
to app. the 20M listed above includes the push buffer and memory for
these buffers (I think workstation apps tend to push a lot more
pre-processed vertex data to the gpu).

note that most agp apertures these days are in the 128M - 1024M range,
and there are times that we exhaust that memory on the low end. I
think our driver is greedy when trying to allocate memory for
performance reasons, but has good fallback cases. so being somewhat
limited on resources isn't too bad, just so long as the kernel doesn't
panic instead of falling the memory allocation.

I would think that 64M or 128M would be good. a nice feature of
swiotlb is the ability to tune it at boot. so if a workstation user
found they really did need more memory for performance, they could
tweak that value up for themselves.

also remember future growth. PCI-E has something like 20/24 lanes that
can be split among multiple PCI-E slots. Alienware has already
announced multi-card products, and it's likely multi-card products
will be more readily available on PCI-E, since the slots should have
equivalent bandwidth (unlike AGP+PCI).

nvidia has also had workstation parts in the past with 2 gpus and a
bridge chip. each of these gpus ran twinview, so each card drove 4
monitors. these were pci cards, and some crazy vendors had 4+ of these
cards in a machine driving many monitors. this just pushes the memory
requirements up in special circumstances.

Thanks,
Terence

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 15:44     ` Terence Ripperda
@ 2004-06-24 18:51       ` Andi Kleen
  2004-06-26  4:58         ` David Mosberger
  0 siblings, 1 reply; 20+ messages in thread
From: Andi Kleen @ 2004-06-24 18:51 UTC (permalink / raw)
  To: Terence Ripperda; +Cc: Andi Kleen, discuss, tiwai, linux-kernel, andrea

On Thu, Jun 24, 2004 at 10:44:29AM -0500, Terence Ripperda wrote:
> On Wed, Jun 23, 2004 at 04:46:44PM -0700, ak@muc.de wrote:
> > pci_alloc_consistent is limited to 16MB, but so far nobody has really
> > complained about that. If that should be a real issue we can make
> > it allocate from the swiotlb pool, which is usually 64MB (and can
> > be made bigger at boot time) 
> 
> In all of the cases I've seen, it defaults to 4M. in swiotlb.c,
> io_tlb_nslabs defaults to 1024, * PAGE_SIZE == 4194304.

I checked this now. 

It's 

#define IO_TLB_SHIFT 11

static unsigned long io_tlb_nslabs = 1024;

and the allocation does

	io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs * (1 << IO_TLB_SHIFT));

which contrary to its name does not allocate in pages (otherwise you would
get 8GB of memory on x86-64 and even more on IA64) 
That's definitely far too small. 

A better IO_TLB_SHIFT would be 16 or 17.

-Andi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 18:51       ` Andi Kleen
@ 2004-06-26  4:58         ` David Mosberger
  0 siblings, 0 replies; 20+ messages in thread
From: David Mosberger @ 2004-06-26  4:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Terence Ripperda, discuss, tiwai, linux-kernel, andrea

>>>>> On Thu, 24 Jun 2004 20:51:56 +0200, Andi Kleen <ak@muc.de> said:

  Andi> A better IO_TLB_SHIFT would be 16 or 17.

Careful.  I see code like this:

		stride = (1 << (PAGE_SHIFT - IO_TLB_SHIFT));

You probably don't want IO_TLB_SHIFT > PAGE_SHIFT...  Increasing
io_tlb_nslabs should be no problem though (subject to memory
availability).  It can already by set via the "swiotlb" option.

I doubt swiotlb is the right thing here, though, given the bw-demands
of graphics.  Too bad Nvidia cards don't support > 32 bit
addressability and Intel chipsets don't support I/O MMUs...

	--david

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2004-06-30  8:12 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-23 18:35 32-bit dma allocations on 64-bit platforms Terence Ripperda
2004-06-23 19:19 ` Jeff Garzik
2004-06-26  5:05   ` David Mosberger
2004-06-26  7:16     ` Arjan van de Ven
2004-06-29  6:13       ` David Mosberger
2004-06-29  6:55         ` Arjan van de Ven
2004-06-30  8:00         ` Jes Sorensen
2004-06-26  5:02 ` David Mosberger
     [not found] <2akPm-16l-65@gated-at.bofh.it>
2004-06-23 21:46 ` Andi Kleen
2004-06-24  6:18   ` Arjan van de Ven
2004-06-24 10:33     ` Andi Kleen
2004-06-24 13:48     ` Jesse Barnes
2004-06-24 14:39       ` Terence Ripperda
     [not found] <m3acyu6pwd.fsf@averell.firstfloor.org>
     [not found] ` <20040623213643.GB32456@hygelac>
2004-06-23 23:46   ` Andi Kleen
2004-06-24 11:13     ` Takashi Iwai
2004-06-24 14:45       ` Terence Ripperda
2004-06-24 15:41         ` Andrea Arcangeli
2004-06-24 15:44     ` Terence Ripperda
2004-06-24 18:51       ` Andi Kleen
2004-06-26  4:58         ` David Mosberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox