* 32-bit dma allocations on 64-bit platforms
@ 2004-06-23 18:35 Terence Ripperda
2004-06-23 19:19 ` Jeff Garzik
2004-06-26 5:02 ` David Mosberger
0 siblings, 2 replies; 20+ messages in thread
From: Terence Ripperda @ 2004-06-23 18:35 UTC (permalink / raw)
To: Linux Kernel Mailing List; +Cc: Terence Ripperda
[-- Attachment #1: Type: text/plain, Size: 3008 bytes --]
I'm working on cleaning up some of our dma allocation code to properly allocate 32-bit physical pages for dma on 64-bit platforms. I think our first pass at supporting em64t is sub-par. I'd like to fix that by using the correct kernel interfaces.
>From our early efforts in supporting AMD's x86_64, we've used the pci_map_sg/pci_map_single interface for remapping > 32-bit physical addresses through the system's IOMMU. Since Intel's em64t does not provide an IOMMU, the kernel falls back to a swiotlb to implement these interfaces. For our first pass at supporting em64t, we tried to work with the swiotlb, but this works very poorly.
We should have gone back and reviewed how we use the kernel interfaces and followed DMA-API.txt and DMA-mapping.txt. We're now working on using these interfaces (mainly pci_alloc_consistent). but we're still running into some general shortcomings of these interfaces. the main problem is the ability to allocate enough 32-bit addressable memory.
the physical addressing of memory allocations seems to boil down to the behavior of GFP_DMA and GFP_NORMAL. but there seems to be some disconnect between what these mean for each architecture and what various drivers expect them to mean.
based on each architecture's paging_init routines, the zones look like this:
x86: ia64: x86_64:
ZONE_DMA: < 16M < ~4G < 16M
ZONE_NORMAL: 16M - ~1G > ~4G > 16M
ZONE_HIMEM: 1G+
an example of this disconnect is vmalloc_32. this function is obviously intended to allocate 32-bit addresses (this was specifically mentioned in a comment in 2.4.x header files). but vmalloc_32 is an inline routine that calls __vmalloc(GFP_KERNEL). based on the above zone descriptions, this will do the correct thing for x86, but not for ia64 or x86_64. on ia64, a driver could just use GFP_DMA for the desired behavior, but this doesn't work well for x86_64.
AMD's x86_64 provides remapping > 32-bit pages through the iommu, but obviously Intel's em64t provides no such ability. based on the above zonings, these leaves us with the options of either relying on the swiotlb interfaces for dma, or relying on the isa memory for dma.
the last day or two, I've been experimenting with using the pci_alloc_consistent interface, which uses the later (note attached patch to fix an apparent memory leak in the x86_64 pci_alloc_consistent). unfortunately, this provides very little dma-able memory. In theory, up to 16 Megs, but in practice I'm only getting about 5 1/2 Megs.
I was rather surprised by these limitations on allocating 32-bit addresses. I checked through the dri and bttv drivers, to see if they had dealt with these issues, and they did not appear to have done so. has anyone tested these drivers on ia64/x86_64/em64t platforms w/ 4+ Gigs of memory?
are these limitations on allocating 32-bit addresses intentional and known? is there anything we can do to help improve this situation? help work on development?
Thanks,
Terence
[-- Attachment #2: pci-gart.patch --]
[-- Type: text/plain, Size: 330 bytes --]
--- pci-gart.c.old 2004-06-21 18:33:29.000000000 -0500
+++ pci-gart.c.new 2004-06-21 18:33:57.000000000 -0500
@@ -211,6 +211,7 @@
if (no_iommu || dma_mask < 0xffffffffUL) {
if (high) {
if (!(gfp & GFP_DMA)) {
+ free_pages((unsigned long)memory, get_order(size));
gfp |= GFP_DMA;
goto again;
}
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: 32-bit dma allocations on 64-bit platforms 2004-06-23 18:35 32-bit dma allocations on 64-bit platforms Terence Ripperda @ 2004-06-23 19:19 ` Jeff Garzik 2004-06-26 5:05 ` David Mosberger 2004-06-26 5:02 ` David Mosberger 1 sibling, 1 reply; 20+ messages in thread From: Jeff Garzik @ 2004-06-23 19:19 UTC (permalink / raw) To: Terence Ripperda; +Cc: Linux Kernel Mailing List Terence Ripperda wrote: Fix your word wrap. > I'm working on cleaning up some of our dma allocation code to properly allocate 32-bit physical pages for dma on 64-bit platforms. I think our first pass at supporting em64t is sub-par. I'd like to fix that by using the correct kernel interfaces. > >>From our early efforts in supporting AMD's x86_64, we've used the pci_map_sg/pci_map_single interface for remapping > 32-bit physical addresses through the system's IOMMU. Since Intel's em64t does not provide an IOMMU, the kernel falls back to a swiotlb to implement these interfaces. For our first pass at supporting em64t, we tried to work with the swiotlb, but this works very poorly. swiotlb was a dumb idea when it hit ia64, and it's now been propagated to x86-64 :( > We should have gone back and reviewed how we use the kernel interfaces and followed DMA-API.txt and DMA-mapping.txt. We're now working on using these interfaces (mainly pci_alloc_consistent). but we're still running into some general shortcomings of these interfaces. the main problem is the ability to allocate enough 32-bit addressable memory. > > the physical addressing of memory allocations seems to boil down to the behavior of GFP_DMA and GFP_NORMAL. but there seems to be some disconnect between what these mean for each architecture and what various drivers expect them to mean. > > based on each architecture's paging_init routines, the zones look like this: > > x86: ia64: x86_64: > ZONE_DMA: < 16M < ~4G < 16M > ZONE_NORMAL: 16M - ~1G > ~4G > 16M > ZONE_HIMEM: 1G+ > > an example of this disconnect is vmalloc_32. this function is obviously intended to allocate 32-bit addresses (this was specifically mentioned in a comment in 2.4.x header files). but vmalloc_32 is an inline routine that calls __vmalloc(GFP_KERNEL). based on the above zone descriptions, this will do the correct thing for x86, but not for ia64 or x86_64. on ia64, a driver could just use GFP_DMA for the desired behavior, but this doesn't work well for x86_64. > > AMD's x86_64 provides remapping > 32-bit pages through the iommu, but obviously Intel's em64t provides no such ability. based on the above zonings, these leaves us with the options of either relying on the swiotlb interfaces for dma, or relying on the isa memory for dma. FWIW, note that there are two main considerations: Higher-level layers (block, net) provide bounce buffers when needed, as you don't want to do that purely with iommu. Once you have bounce buffers properly allocated by <something> (swiotlb? special DRM bounce buffer allocator?), you then pci_map the bounce buffers. > the last day or two, I've been experimenting with using the pci_alloc_consistent interface, which uses the later (note attached patch to fix an apparent memory leak in the x86_64 pci_alloc_consistent). unfortunately, this provides very little dma-able memory. In theory, up to 16 Megs, but in practice I'm only getting about 5 1/2 Megs. > > I was rather surprised by these limitations on allocating 32-bit addresses. I checked through the dri and bttv drivers, to see if they had dealt with these issues, and they did not appear to have done so. has anyone tested these drivers on ia64/x86_64/em64t platforms w/ 4+ Gigs of memory? > > are these limitations on allocating 32-bit addresses intentional and known? is there anything we can do to help improve this situation? help work on development? Sounds like you're not setting the PCI DMA mask properly, or perhaps passing NULL rather than a struct pci_dev to the PCI DMA API? Jeff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-23 19:19 ` Jeff Garzik @ 2004-06-26 5:05 ` David Mosberger 2004-06-26 7:16 ` Arjan van de Ven 0 siblings, 1 reply; 20+ messages in thread From: David Mosberger @ 2004-06-26 5:05 UTC (permalink / raw) To: Jeff Garzik; +Cc: Terence Ripperda, Linux Kernel Mailing List >>>>> On Wed, 23 Jun 2004 15:19:22 -0400, Jeff Garzik <jgarzik@pobox.com> said: Jeff> swiotlb was a dumb idea when it hit ia64, and it's now been propagated Jeff> to x86-64 :( If it's such a dumb idea, why not submit a better solution? --david ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-26 5:05 ` David Mosberger @ 2004-06-26 7:16 ` Arjan van de Ven 2004-06-29 6:13 ` David Mosberger 0 siblings, 1 reply; 20+ messages in thread From: Arjan van de Ven @ 2004-06-26 7:16 UTC (permalink / raw) To: davidm; +Cc: Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 462 bytes --] On Sat, 2004-06-26 at 07:05, David Mosberger wrote: > >>>>> On Wed, 23 Jun 2004 15:19:22 -0400, Jeff Garzik <jgarzik@pobox.com> said: > > Jeff> swiotlb was a dumb idea when it hit ia64, and it's now been propagated > Jeff> to x86-64 :( > > If it's such a dumb idea, why not submit a better solution? the real solution is an iommu of course, but the highmem solution has quite some merit too..... I know you disagree with me on that one though. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-26 7:16 ` Arjan van de Ven @ 2004-06-29 6:13 ` David Mosberger 2004-06-29 6:55 ` Arjan van de Ven 2004-06-30 8:00 ` Jes Sorensen 0 siblings, 2 replies; 20+ messages in thread From: David Mosberger @ 2004-06-29 6:13 UTC (permalink / raw) To: arjanv; +Cc: davidm, Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List >>>>> On Sat, 26 Jun 2004 09:16:27 +0200, Arjan van de Ven <arjanv@redhat.com> said: Arjan> the real solution is an iommu of course, but the highmem Arjan> solution has quite some merit too..... I know you disagree Arjan> with me on that one though. Yes, some merits and some faults. The real solution is iommu or 64-bit capable devices. Interesting that graphics controllers should be last to get 64-bit DMA capability, considering how much more complex they are than disk controllers or NICs. --david ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-29 6:13 ` David Mosberger @ 2004-06-29 6:55 ` Arjan van de Ven 2004-06-30 8:00 ` Jes Sorensen 1 sibling, 0 replies; 20+ messages in thread From: Arjan van de Ven @ 2004-06-29 6:55 UTC (permalink / raw) To: davidm; +Cc: Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 661 bytes --] On Mon, Jun 28, 2004 at 11:13:12PM -0700, David Mosberger wrote: > >>>>> On Sat, 26 Jun 2004 09:16:27 +0200, Arjan van de Ven <arjanv@redhat.com> said: > > Arjan> the real solution is an iommu of course, but the highmem > Arjan> solution has quite some merit too..... I know you disagree > Arjan> with me on that one though. > > Yes, some merits and some faults. The real solution is iommu or > 64-bit capable devices. Interesting that graphics controllers should > be last to get 64-bit DMA capability, considering how much more > complex they are than disk controllers or NICs. I guess the first game with more than 4Gb in textures will fix it ;) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-29 6:13 ` David Mosberger 2004-06-29 6:55 ` Arjan van de Ven @ 2004-06-30 8:00 ` Jes Sorensen 1 sibling, 0 replies; 20+ messages in thread From: Jes Sorensen @ 2004-06-30 8:00 UTC (permalink / raw) To: davidm; +Cc: arjanv, Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List >>>>> "David" == David Mosberger <davidm@napali.hpl.hp.com> writes: >>>>> On Sat, 26 Jun 2004 09:16:27 +0200, Arjan van de Ven <arjanv@redhat.com> said: Arjan> the real solution is an iommu of course, but the highmem Arjan> solution has quite some merit too..... I know you disagree with Arjan> me on that one though. David> Yes, some merits and some faults. The real solution is iommu David> or 64-bit capable devices. Interesting that graphics David> controllers should be last to get 64-bit DMA capability, David> considering how much more complex they are than disk David> controllers or NICs. You found a 64 bit capable sound card yet? ;-) Cheers, Jes ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-23 18:35 32-bit dma allocations on 64-bit platforms Terence Ripperda 2004-06-23 19:19 ` Jeff Garzik @ 2004-06-26 5:02 ` David Mosberger 1 sibling, 0 replies; 20+ messages in thread From: David Mosberger @ 2004-06-26 5:02 UTC (permalink / raw) To: Terence Ripperda; +Cc: Linux Kernel Mailing List Terence, >>>>> On Wed, 23 Jun 2004 13:35:35 -0500, Terence Ripperda <tripperda@nvidia.com> said: Terence> based on each architecture's paging_init routines, the Terence> zones look like this: Terence> x86: ia64: x86_64: Terence> ZONE_DMA: < 16M < ~4G < 16M Terence> ZONE_NORMAL: 16M - ~1G > ~4G > 16M Terence> ZONE_HIMEM: 1G+ Not that it matters here, but for correctness let me note that the ia64 column is correct only for machines which don't have an I/O MMU. With I/O MMU, ZONE_DMA will have the same coverage as ZONE_NORMAL with a recent enough kernel (older kernels had a bug which limited ZONE_DMA to < 4GB, but that was unintentional). --david ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <2akPm-16l-65@gated-at.bofh.it>]
* Re: 32-bit dma allocations on 64-bit platforms [not found] <2akPm-16l-65@gated-at.bofh.it> @ 2004-06-23 21:46 ` Andi Kleen 2004-06-24 6:18 ` Arjan van de Ven 0 siblings, 1 reply; 20+ messages in thread From: Andi Kleen @ 2004-06-23 21:46 UTC (permalink / raw) To: Terence Ripperda; +Cc: discuss, tiwai, linux-kernel Terence Ripperda <tripperda@nvidia.com> writes: [sending again with linux-kernel in cc] > I'm working on cleaning up some of our dma allocation code to properly allocate 32-bit physical pages for dma on 64-bit platforms. I think our first pass at supporting em64t is sub-par. I'd like to fix that by using the correct kernel interfaces. I get from this that your hardware cannot DMA to >32bit. > > the physical addressing of memory allocations seems to boil down to the behavior of GFP_DMA and GFP_NORMAL. but there seems to be some disconnect between what these mean for each architecture and what various drivers expect them to mean. > > based on each architecture's paging_init routines, the zones look like this: > > x86: ia64: x86_64: > ZONE_DMA: < 16M < ~4G < 16M > ZONE_NORMAL: 16M - ~1G > ~4G > 16M > ZONE_HIMEM: 1G+ > > an example of this disconnect is vmalloc_32. this function is obviously intended to allocate 32-bit addresses (this was specifically mentioned in a comment in 2.4.x header files). but vmalloc_32 is an inline routine that calls __vmalloc(GFP_KERNEL). based on the above zone descriptions, this will do the correct thing for x86, but not for ia64 or x86_64. on ia64, a driver could just use GFP_DMA for the desired behavior, but this doesn't work well for x86_64. > > AMD's x86_64 provides remapping > 32-bit pages through the iommu, but obviously Intel's em64t provides no such ability. based on the above zonings, these leaves us with the options of either relying on the swiotlb interfaces for dma, or relying on the isa memory for dma. > > the last day or two, I've been experimenting with using the pci_alloc_consistent interface, which uses the later (note attached patch to fix an apparent memory leak in the x86_64 pci_alloc_consistent). unfortunately, this provides very little dma-able memory. In theory, up to 16 Megs, but in practice I'm only getting about 5 1/2 Megs. > > I was rather surprised by these limitations on allocating 32-bit addresses. I checked through the dri and bttv drivers, to see if they had dealt with these issues, and they did not appear to have done so. has anyone tested these drivers on ia64/x86_64/em64t platforms w/ 4+ Gigs of memory? > > are these limitations on allocating 32-bit addresses intentional and known? is there anything we can do to help improve this situation? help work on development? First vmalloc_32 is a rather broken interface and should imho just be removed. The function name just gives promises that cannot be kept. It was always quite bogus. Please don't use it. The situation on EM64T is very unfortunate I agree. There was a reason we asked AMD to add an IOMMU and it's quite bad that the Intel chipset people ignored that wisdom and put us into this compatibility mess. Failing that it would be best if the other PCI DMA hardware could just address enough memory, but that's less realistic than just fixing the chipset. The x86-64 port had decided early to keep the 16MB GFP_DMA zone to get maximum driver compatibility and because the AMD IOMMU gave us an nice alternative over bounce buffering. In theory I'm not totally against enlarging GFP_DMA a bit on x86-64. It would just be difficult to find a good value. The problem is that that there may be existing drivers that rely on the 16MB limit, and it would not be very nice to break them. We got rid of a lot of them by disallowing CONFIG_ISA, but there may be some left. So before doing this it would need a full driver tree audit to check any device The most prominent example used to be the floppy driver, but the current floppy driver seems to use some other way to get around this. There seem to be quite some sound chipsets with DMA limits < 32bit; e.g. 29 bits seems to be quite common, but I see several 24bit PCI ones too. I must say I'm somewhat reluctant to break an working in tree driver. Especially for the sake of an out of tree binary driver. Arguably the problem is probably not limited to you, but it's quite possible that even the in tree DRI drivers have it, so it would be still worth to fix it. I see two somewhat realistic ways to handle this: - We enlarge GFP_DMA and find some way to do double buffering for these sound drivers (it would need an PCI-DMA API extension that always calls swiotlb for this) For sound that's not too bad, because they are relatively slow. It would require to reserve bootmem memory early for the bounces, but I guess requiring the user to pass a special boot time parameter for these devices would be reasonable. If yes someone would need to do this work. Also the question would be how large to make GFP_DMA Ideally it should not be too big, so that e.g. 29bit devices don't require the bounce buffering. - We introduce multiple GFP_DMA zones and keep 16MB GFP_DMA and GFP_BIGDMA or somesuch for larger DMA. The VM should be able to handle this, but it may still require some tuning. It would need some generic changes, but not too bad. Still would need a decision on how big GFP_BIGDMA should be. I suspect 4GB would be too big again. Comments? -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-23 21:46 ` Andi Kleen @ 2004-06-24 6:18 ` Arjan van de Ven 2004-06-24 10:33 ` Andi Kleen 2004-06-24 13:48 ` Jesse Barnes 0 siblings, 2 replies; 20+ messages in thread From: Arjan van de Ven @ 2004-06-24 6:18 UTC (permalink / raw) To: Andi Kleen; +Cc: Terence Ripperda, discuss, tiwai, linux-kernel [-- Attachment #1: Type: text/plain, Size: 636 bytes --] On Wed, 2004-06-23 at 23:46, Andi Kleen wrote: > The VM should be able to handle this, but it may still require > some tuning. It would need some generic changes, but not too bad. > Still would need a decision on how big GFP_BIGDMA should be. > I suspect 4GB would be too big again. What is the problem again, can't the driver us the dynamic pci mapping API which does allow more memory to be mapped even on crippled machines without iommu ? And isn't this a problem that will vanish since PCI Express and PCI X both *require* support for 64 bit addressing, so all higher speed cards are going to be ok in principle ? [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 6:18 ` Arjan van de Ven @ 2004-06-24 10:33 ` Andi Kleen 2004-06-24 13:48 ` Jesse Barnes 1 sibling, 0 replies; 20+ messages in thread From: Andi Kleen @ 2004-06-24 10:33 UTC (permalink / raw) To: Arjan van de Ven Cc: Andi Kleen, Terence Ripperda, discuss, tiwai, linux-kernel On Thu, Jun 24, 2004 at 08:18:06AM +0200, Arjan van de Ven wrote: > On Wed, 2004-06-23 at 23:46, Andi Kleen wrote: > > > The VM should be able to handle this, but it may still require > > some tuning. It would need some generic changes, but not too bad. > > Still would need a decision on how big GFP_BIGDMA should be. > > I suspect 4GB would be too big again. > > What is the problem again, can't the driver us the dynamic pci mapping > API which does allow more memory to be mapped even on crippled machines > without iommu ? In theory one could fix pci_alloc_consistent from the swiotlb pool yes, the problem is just that this pool is completely preallocated. If enough memory is needed that would be quite nasty, because you suddenly lose 1 or 2GB RAM. > And isn't this a problem that will vanish since PCI Express and PCI X > both *require* support for 64 bit addressing, so all higher speed cards > are going to be ok in principle ? There are EM64T systems with AGP only and not all PCI-Express cards seem to follow this. PCI-Express unfortunately discouraged the AGP aperture too, so not even that can be used on those Intel systems. -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 6:18 ` Arjan van de Ven 2004-06-24 10:33 ` Andi Kleen @ 2004-06-24 13:48 ` Jesse Barnes 2004-06-24 14:39 ` Terence Ripperda 1 sibling, 1 reply; 20+ messages in thread From: Jesse Barnes @ 2004-06-24 13:48 UTC (permalink / raw) To: arjanv; +Cc: Andi Kleen, Terence Ripperda, discuss, tiwai, linux-kernel On Thursday, June 24, 2004 2:18 am, Arjan van de Ven wrote: > What is the problem again, can't the driver us the dynamic pci mapping > API which does allow more memory to be mapped even on crippled machines > without iommu ? > And isn't this a problem that will vanish since PCI Express and PCI X > both *require* support for 64 bit addressing, so all higher speed cards > are going to be ok in principle ? Well, PCI-X may require it, but there certainly are PCI-X devices that don't do 64 bit addressing, or if they do, it's a crippled implementation (e.g. top 32 bits have to be constant). Jesse ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 13:48 ` Jesse Barnes @ 2004-06-24 14:39 ` Terence Ripperda 0 siblings, 0 replies; 20+ messages in thread From: Terence Ripperda @ 2004-06-24 14:39 UTC (permalink / raw) To: Jesse Barnes Cc: arjanv, Andi Kleen, Terence Ripperda, discuss, tiwai, linux-kernel correct. I checked with my contacts here on the PCI express requirements. Apparently the spec says "A PCI Express Endpoint operating as the Requester of a Memory Transaction is required to be capable of generating addresses greater than 4GB", but my contact claims this is a "soft" requirement. but even if all PCI-X and PCI-E devices properly addressed the full 64-bits, legacy 32-bit PCI devices can be plugged into the motherboards as well. my Intel em64t boards have mostly PCI-X, but 1 PCI slot and my amd x86_64 have all PCI slots (aside from the main PCI-E slot). also, at least one motherboard manufacturer claims PCI-E + AGP, but the AGP is really just an AGP form-factor slot on the PCI bus. Thanks, Terence On Thu, Jun 24, 2004 at 06:48:07AM -0700, jbarnes@engr.sgi.com wrote: > On Thursday, June 24, 2004 2:18 am, Arjan van de Ven wrote: > > What is the problem again, can't the driver us the dynamic pci mapping > > API which does allow more memory to be mapped even on crippled > machines > > without iommu ? > > And isn't this a problem that will vanish since PCI Express and PCI X > > both *require* support for 64 bit addressing, so all higher speed > cards > > are going to be ok in principle ? > > Well, PCI-X may require it, but there certainly are PCI-X devices that > don't > do 64 bit addressing, or if they do, it's a crippled implementation > (e.g. top > 32 bits have to be constant). > > Jesse ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <m3acyu6pwd.fsf@averell.firstfloor.org>]
[parent not found: <20040623213643.GB32456@hygelac>]
* Re: 32-bit dma allocations on 64-bit platforms [not found] ` <20040623213643.GB32456@hygelac> @ 2004-06-23 23:46 ` Andi Kleen 2004-06-24 11:13 ` Takashi Iwai 2004-06-24 15:44 ` Terence Ripperda 0 siblings, 2 replies; 20+ messages in thread From: Andi Kleen @ 2004-06-23 23:46 UTC (permalink / raw) To: Terence Ripperda; +Cc: Andi Kleen, discuss, tiwai, linux-kernel, andrea On Wed, Jun 23, 2004 at 04:36:43PM -0500, Terence Ripperda wrote: > > The x86-64 port had decided early to keep the 16MB GFP_DMA zone > > to get maximum driver compatibility and because the AMD IOMMU gave > > us an nice alternative over bounce buffering. > > that was a very understandable decision. and I do agree that using the > AMD IOMMU is a very nice architecture. it is unfortunate to have to deal > with this on EM64T. Will AMD's pci-express chipsets still maintain an > IOMMU, even if it's not needed for AGP anymore? (probably not public > information, I'll check via my channels). The IOMMU is actually implemented in the in CPU northbridge on K8 so yes. I hope they'll keep it in future CPUs too. > > > I must say I'm somewhat reluctant to break an working in tree driver. > > Especially for the sake of an out of tree binary driver. Arguably the > > problem is probably not limited to you, but it's quite possible that > > even the in tree DRI drivers have it, so it would be still worth to > > fix it. > > agreed. I completely understand that there is no desire to modify the > core kernel to help our driver. that's one of the reasons I looked through > the other drivers, as I suspect that this is a problem for many drivers. I > only looked through the code for each briefly, but didn't see anything to > handle this. I suspect it's more of a case that the drivers have not been > stress tested on an x86_64 machine w/ 4+ G of memory. We usually handle it using the swiotlb, which works. pci_alloc_consistent is limited to 16MB, but so far nobody has really complained about that. If that should be a real issue we can make it allocate from the swiotlb pool, which is usually 64MB (and can be made bigger at boot time) Would that work for you too BTW ? How much memory do you expect to need? drawback is that the swiotlb pool is not unified with the rest of the VM, so tying up too much memory there is quite unfriendly. e.g. if you you can use up 1GB then i wouldn't consider this suitable, for 128MB max it may be possible. > > I see two somewhat realistic ways to handle this: > > either of those approaches sounds good. keeping compatibility with older > devices/drivers is certainly a good thing. > > can the core kernel handle multiple new zones? I haven't looked at the > code, but the zones seem to always be ZONE_DMA and ZONE_NORMAL, with some > architectures adding ZONE_HIMEM at the end. if you add a ZONE_DMA_32 (or > whatever) between ZONE_DMA and ZONE_NORMAL, will the core vm code be able > to handle that? I guess one could argue if it can't yet, it should be able > to, then each architecture could create as many zones as they wanted. Sure, we create multiple zones on NUMA systems (even on x86-64). Each node has one zone. But they're all ZONE_NORMAL. And the first node has two zones, one ZONE_DMA and one ZONE_NORMAL (actually the others have a ZONE_DMA too, but it's empty) Multiple ZONE_DMA zones would be a novelty, but may be doable (I have not checked all the implications of this, but I don't immediately see any show stopper, maybe someone like Andrea can correct me on that). It will probably be a bit intrusive patch though. > > another brainstorm: instead of counting on just a large-grained zone and > call to __get_free_pages() returning an allocation within a given > bit-range, perhaps there could be large-grained zones, with a fine-grained > hint used to look for a subset within the zone. for example, there could be > a DMA32 zone, but a mask w/ 24- or 29- bits enabled could be used to scan > the DMA32 zone for a valid address. (don't know how well that fits into the > current architecture). Not very well. Or rather the allocation would not be O(1) anymore because you would need to scan the queues. That could be still tolerable, but when there are no pages you have to call the VM and then teach try_to_free_pages and friends that you are only interested in pages in some mask. And that would probably get quite nasty. I did something like this in 2.4 for an old prototype of the NUMA API, but it never worked very well and also was quite ugly. Multiple zones are probably better. One of the reasons we rejected this early when the x86-64 port was designed was that the VM had quite bad zone balancing problems at that time. It should be better now though, or at least the NUMA setup works reasonably well. But NUMA zones tend to be a lot bigger than DMA zones and don't show all the corner cases. -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-23 23:46 ` Andi Kleen @ 2004-06-24 11:13 ` Takashi Iwai 2004-06-24 14:45 ` Terence Ripperda 2004-06-24 15:44 ` Terence Ripperda 1 sibling, 1 reply; 20+ messages in thread From: Takashi Iwai @ 2004-06-24 11:13 UTC (permalink / raw) To: Andi Kleen; +Cc: Terence Ripperda, discuss, linux-kernel, andrea At 24 Jun 2004 01:46:44 +0200, Andi Kleen wrote: > > > > I must say I'm somewhat reluctant to break an working in tree driver. > > > Especially for the sake of an out of tree binary driver. Arguably the > > > problem is probably not limited to you, but it's quite possible that > > > even the in tree DRI drivers have it, so it would be still worth to > > > fix it. > > > > agreed. I completely understand that there is no desire to modify the > > core kernel to help our driver. that's one of the reasons I looked through > > the other drivers, as I suspect that this is a problem for many drivers. I > > only looked through the code for each briefly, but didn't see anything to > > handle this. I suspect it's more of a case that the drivers have not been > > stress tested on an x86_64 machine w/ 4+ G of memory. > > We usually handle it using the swiotlb, which works. > > pci_alloc_consistent is limited to 16MB, but so far nobody has really > complained about that. If that should be a real issue we can make > it allocate from the swiotlb pool, which is usually 64MB (and can > be made bigger at boot time) Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the allocated pages are out of dma mask, just like in pci-gart.c? (with ifdef x86-64) Takashi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 11:13 ` Takashi Iwai @ 2004-06-24 14:45 ` Terence Ripperda 2004-06-24 15:41 ` Andrea Arcangeli 0 siblings, 1 reply; 20+ messages in thread From: Terence Ripperda @ 2004-06-24 14:45 UTC (permalink / raw) To: Takashi Iwai; +Cc: Andi Kleen, Terence Ripperda, discuss, linux-kernel, andrea On Thu, Jun 24, 2004 at 04:13:47AM -0700, tiwai@suse.de wrote: > > pci_alloc_consistent is limited to 16MB, but so far nobody has really > > complained about that. If that should be a real issue we can make > > it allocate from the swiotlb pool, which is usually 64MB (and can > > be made bigger at boot time) > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the > allocated pages are out of dma mask, just like in pci-gart.c? > (with ifdef x86-64) pci_alloc_consistent (at least on x86-64) does do this. one of the problems I've seen in experimentation is that GFP_KERNEL tends to allocate from the top of memory down. this means that all of the GFP_KERNEL allocations are > 32-bits, which forces us down to GFP_DMA and the < 16M allocations. I've mainly tested this after a cold boot, so after running for a while, GFP_KERNEL may hit good allocations a lot more. Thanks, Terence ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 14:45 ` Terence Ripperda @ 2004-06-24 15:41 ` Andrea Arcangeli 0 siblings, 0 replies; 20+ messages in thread From: Andrea Arcangeli @ 2004-06-24 15:41 UTC (permalink / raw) To: Terence Ripperda; +Cc: Takashi Iwai, Andi Kleen, discuss, linux-kernel On Thu, Jun 24, 2004 at 09:45:51AM -0500, Terence Ripperda wrote: > On Thu, Jun 24, 2004 at 04:13:47AM -0700, tiwai@suse.de wrote: > > > pci_alloc_consistent is limited to 16MB, but so far nobody has really > > > complained about that. If that should be a real issue we can make > > > it allocate from the swiotlb pool, which is usually 64MB (and can > > > be made bigger at boot time) > > > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the > > allocated pages are out of dma mask, just like in pci-gart.c? > > (with ifdef x86-64) > > pci_alloc_consistent (at least on x86-64) does do this. one of the problems > I've seen in experimentation is that GFP_KERNEL tends to allocate from the > top of memory down. this means that all of the GFP_KERNEL allocations are > > 32-bits, which forces us down to GFP_DMA and the < 16M allocations. > > I've mainly tested this after a cold boot, so after running for a while, > GFP_KERNEL may hit good allocations a lot more. it's trivial to change the order in the freelist to allocate from lower address first, but the point is still that over time that will be random. the 16M must be reserved enterely to the __GFP_DMA on any machine with >=1G of ram, and the lowmem_reserve_ratio algorithm accomplish this and it scales down the reserve ratio according to the balance between lowmem and dma zone. I believe if something you should try with GFP_KERNEL if GFP_DMA fails, not the other way around. btw, 2.6 is even more efficient in shrinking and paging out the dma zone than it could be in 2.4. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-23 23:46 ` Andi Kleen 2004-06-24 11:13 ` Takashi Iwai @ 2004-06-24 15:44 ` Terence Ripperda 2004-06-24 18:51 ` Andi Kleen 1 sibling, 1 reply; 20+ messages in thread From: Terence Ripperda @ 2004-06-24 15:44 UTC (permalink / raw) To: Andi Kleen; +Cc: Terence Ripperda, discuss, tiwai, linux-kernel, andrea On Wed, Jun 23, 2004 at 04:46:44PM -0700, ak@muc.de wrote: > pci_alloc_consistent is limited to 16MB, but so far nobody has really > complained about that. If that should be a real issue we can make > it allocate from the swiotlb pool, which is usually 64MB (and can > be made bigger at boot time) In all of the cases I've seen, it defaults to 4M. in swiotlb.c, io_tlb_nslabs defaults to 1024, * PAGE_SIZE == 4194304. > Would that work for you too BTW ? How much memory do you expect > to need? potentially. our currently pending release uses pci_map_sg, which relies on swiotlb for em64t systems. it "works", but we have some ugly hacks and were hoping to get away from using it (at least in it's current form). here's some of the problems we encountered: probably the biggest problem is that the size is way too small for our needs (more on our memory usage shortly). this is compounded by the the swiotlb code throwing a kernel panic when it can't allocate memory. and if the panic doesn't halt the machine, the routine returns a random value off the stack as the dma_addr_t. for this reason, we have an ugly hack that notices that swiotlb is enabled (just checks if swiotlb is set) and prints a warning to the user to bump up the size of the swiotlb to 16384, or 64M. also, the proper usage of using the bounce buffers and calling pci_dma_sync_* would be a performance killer for us. we stream a considerable amount of data to the gpu per second (on the order of 100s of Megs a second), so having to do an additional memcpy would reduce performance considerably, in some cases between 30-50%. for this reason, we detect when the dma_addr != phys_addr, and map the dma_addr directly to opengl to avoid the copy. I know this is ugly, and that's one of the things I'd really like to get away from. finally, our driver already uses a considerable amount of memory. by definition, the swiotlb interface doubles that memory usage. if our driver used swiotlb correctly (as in didn't know about swiotlb and always called pci_dma_sync_*), we'd lock down the physical addresses opengl writes to, since they're normally used directly for dma, plus the pages allocated from the swiotlb would be locked down (currently we manually do this, but if swiotlb is supposed to be transparent to the driver and used for dma, I assume it must already be locked down, perhaps by definition of being bootmem?). this means not only is the memory usage double, but it's all locked down and unpageable. in this case, it almost would make more sense to treat the bootmem allocated for swiotlb as a pool of 32-bit memory that can be directly allocated from, rather than as bounce buffers. I don't know that this would be an acceptable interface though. but if we could come up with reasonable solutions to these problems, this may work. > drawback is that the swiotlb pool is not unified with the rest of the > VM, so tying up too much memory there is quite unfriendly. > e.g. if you you can use up 1GB then i wouldn't consider this suitable, > for 128MB max it may be possible. I checked with our opengl developers on this. by default, we allocate ~64k for X's push buffer and ~1M per opengl client for their push buffer. on quadro/workstation parts, we allocate 20M for the first opengl client, then ~1M per client after that. in addition to the push buffer, there is a lot of data that apps dump to the push buffer. this includes textures, vertex buffers, display lists, etc. the amount of memory used for this varies greatly from app to app. the 20M listed above includes the push buffer and memory for these buffers (I think workstation apps tend to push a lot more pre-processed vertex data to the gpu). note that most agp apertures these days are in the 128M - 1024M range, and there are times that we exhaust that memory on the low end. I think our driver is greedy when trying to allocate memory for performance reasons, but has good fallback cases. so being somewhat limited on resources isn't too bad, just so long as the kernel doesn't panic instead of falling the memory allocation. I would think that 64M or 128M would be good. a nice feature of swiotlb is the ability to tune it at boot. so if a workstation user found they really did need more memory for performance, they could tweak that value up for themselves. also remember future growth. PCI-E has something like 20/24 lanes that can be split among multiple PCI-E slots. Alienware has already announced multi-card products, and it's likely multi-card products will be more readily available on PCI-E, since the slots should have equivalent bandwidth (unlike AGP+PCI). nvidia has also had workstation parts in the past with 2 gpus and a bridge chip. each of these gpus ran twinview, so each card drove 4 monitors. these were pci cards, and some crazy vendors had 4+ of these cards in a machine driving many monitors. this just pushes the memory requirements up in special circumstances. Thanks, Terence ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 15:44 ` Terence Ripperda @ 2004-06-24 18:51 ` Andi Kleen 2004-06-26 4:58 ` David Mosberger 0 siblings, 1 reply; 20+ messages in thread From: Andi Kleen @ 2004-06-24 18:51 UTC (permalink / raw) To: Terence Ripperda; +Cc: Andi Kleen, discuss, tiwai, linux-kernel, andrea On Thu, Jun 24, 2004 at 10:44:29AM -0500, Terence Ripperda wrote: > On Wed, Jun 23, 2004 at 04:46:44PM -0700, ak@muc.de wrote: > > pci_alloc_consistent is limited to 16MB, but so far nobody has really > > complained about that. If that should be a real issue we can make > > it allocate from the swiotlb pool, which is usually 64MB (and can > > be made bigger at boot time) > > In all of the cases I've seen, it defaults to 4M. in swiotlb.c, > io_tlb_nslabs defaults to 1024, * PAGE_SIZE == 4194304. I checked this now. It's #define IO_TLB_SHIFT 11 static unsigned long io_tlb_nslabs = 1024; and the allocation does io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs * (1 << IO_TLB_SHIFT)); which contrary to its name does not allocate in pages (otherwise you would get 8GB of memory on x86-64 and even more on IA64) That's definitely far too small. A better IO_TLB_SHIFT would be 16 or 17. -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 18:51 ` Andi Kleen @ 2004-06-26 4:58 ` David Mosberger 0 siblings, 0 replies; 20+ messages in thread From: David Mosberger @ 2004-06-26 4:58 UTC (permalink / raw) To: Andi Kleen; +Cc: Terence Ripperda, discuss, tiwai, linux-kernel, andrea >>>>> On Thu, 24 Jun 2004 20:51:56 +0200, Andi Kleen <ak@muc.de> said: Andi> A better IO_TLB_SHIFT would be 16 or 17. Careful. I see code like this: stride = (1 << (PAGE_SHIFT - IO_TLB_SHIFT)); You probably don't want IO_TLB_SHIFT > PAGE_SHIFT... Increasing io_tlb_nslabs should be no problem though (subject to memory availability). It can already by set via the "swiotlb" option. I doubt swiotlb is the right thing here, though, given the bw-demands of graphics. Too bad Nvidia cards don't support > 32 bit addressability and Intel chipsets don't support I/O MMUs... --david ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2004-06-30 8:12 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-23 18:35 32-bit dma allocations on 64-bit platforms Terence Ripperda
2004-06-23 19:19 ` Jeff Garzik
2004-06-26 5:05 ` David Mosberger
2004-06-26 7:16 ` Arjan van de Ven
2004-06-29 6:13 ` David Mosberger
2004-06-29 6:55 ` Arjan van de Ven
2004-06-30 8:00 ` Jes Sorensen
2004-06-26 5:02 ` David Mosberger
[not found] <2akPm-16l-65@gated-at.bofh.it>
2004-06-23 21:46 ` Andi Kleen
2004-06-24 6:18 ` Arjan van de Ven
2004-06-24 10:33 ` Andi Kleen
2004-06-24 13:48 ` Jesse Barnes
2004-06-24 14:39 ` Terence Ripperda
[not found] <m3acyu6pwd.fsf@averell.firstfloor.org>
[not found] ` <20040623213643.GB32456@hygelac>
2004-06-23 23:46 ` Andi Kleen
2004-06-24 11:13 ` Takashi Iwai
2004-06-24 14:45 ` Terence Ripperda
2004-06-24 15:41 ` Andrea Arcangeli
2004-06-24 15:44 ` Terence Ripperda
2004-06-24 18:51 ` Andi Kleen
2004-06-26 4:58 ` David Mosberger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox