* Re: 32-bit dma allocations on 64-bit platforms [not found] ` <20040623213643.GB32456@hygelac> @ 2004-06-23 23:46 ` Andi Kleen 2004-06-24 11:13 ` Takashi Iwai 2004-06-24 15:44 ` Terence Ripperda 0 siblings, 2 replies; 70+ messages in thread From: Andi Kleen @ 2004-06-23 23:46 UTC (permalink / raw) To: Terence Ripperda; +Cc: Andi Kleen, discuss, tiwai, linux-kernel, andrea On Wed, Jun 23, 2004 at 04:36:43PM -0500, Terence Ripperda wrote: > > The x86-64 port had decided early to keep the 16MB GFP_DMA zone > > to get maximum driver compatibility and because the AMD IOMMU gave > > us an nice alternative over bounce buffering. > > that was a very understandable decision. and I do agree that using the > AMD IOMMU is a very nice architecture. it is unfortunate to have to deal > with this on EM64T. Will AMD's pci-express chipsets still maintain an > IOMMU, even if it's not needed for AGP anymore? (probably not public > information, I'll check via my channels). The IOMMU is actually implemented in the in CPU northbridge on K8 so yes. I hope they'll keep it in future CPUs too. > > > I must say I'm somewhat reluctant to break an working in tree driver. > > Especially for the sake of an out of tree binary driver. Arguably the > > problem is probably not limited to you, but it's quite possible that > > even the in tree DRI drivers have it, so it would be still worth to > > fix it. > > agreed. I completely understand that there is no desire to modify the > core kernel to help our driver. that's one of the reasons I looked through > the other drivers, as I suspect that this is a problem for many drivers. I > only looked through the code for each briefly, but didn't see anything to > handle this. I suspect it's more of a case that the drivers have not been > stress tested on an x86_64 machine w/ 4+ G of memory. We usually handle it using the swiotlb, which works. pci_alloc_consistent is limited to 16MB, but so far nobody has really complained about that. If that should be a real issue we can make it allocate from the swiotlb pool, which is usually 64MB (and can be made bigger at boot time) Would that work for you too BTW ? How much memory do you expect to need? drawback is that the swiotlb pool is not unified with the rest of the VM, so tying up too much memory there is quite unfriendly. e.g. if you you can use up 1GB then i wouldn't consider this suitable, for 128MB max it may be possible. > > I see two somewhat realistic ways to handle this: > > either of those approaches sounds good. keeping compatibility with older > devices/drivers is certainly a good thing. > > can the core kernel handle multiple new zones? I haven't looked at the > code, but the zones seem to always be ZONE_DMA and ZONE_NORMAL, with some > architectures adding ZONE_HIMEM at the end. if you add a ZONE_DMA_32 (or > whatever) between ZONE_DMA and ZONE_NORMAL, will the core vm code be able > to handle that? I guess one could argue if it can't yet, it should be able > to, then each architecture could create as many zones as they wanted. Sure, we create multiple zones on NUMA systems (even on x86-64). Each node has one zone. But they're all ZONE_NORMAL. And the first node has two zones, one ZONE_DMA and one ZONE_NORMAL (actually the others have a ZONE_DMA too, but it's empty) Multiple ZONE_DMA zones would be a novelty, but may be doable (I have not checked all the implications of this, but I don't immediately see any show stopper, maybe someone like Andrea can correct me on that). It will probably be a bit intrusive patch though. > > another brainstorm: instead of counting on just a large-grained zone and > call to __get_free_pages() returning an allocation within a given > bit-range, perhaps there could be large-grained zones, with a fine-grained > hint used to look for a subset within the zone. for example, there could be > a DMA32 zone, but a mask w/ 24- or 29- bits enabled could be used to scan > the DMA32 zone for a valid address. (don't know how well that fits into the > current architecture). Not very well. Or rather the allocation would not be O(1) anymore because you would need to scan the queues. That could be still tolerable, but when there are no pages you have to call the VM and then teach try_to_free_pages and friends that you are only interested in pages in some mask. And that would probably get quite nasty. I did something like this in 2.4 for an old prototype of the NUMA API, but it never worked very well and also was quite ugly. Multiple zones are probably better. One of the reasons we rejected this early when the x86-64 port was designed was that the VM had quite bad zone balancing problems at that time. It should be better now though, or at least the NUMA setup works reasonably well. But NUMA zones tend to be a lot bigger than DMA zones and don't show all the corner cases. -Andi ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-23 23:46 ` 32-bit dma allocations on 64-bit platforms Andi Kleen @ 2004-06-24 11:13 ` Takashi Iwai 2004-06-24 11:29 ` [discuss] " Andi Kleen 2004-06-24 14:45 ` Terence Ripperda 2004-06-24 15:44 ` Terence Ripperda 1 sibling, 2 replies; 70+ messages in thread From: Takashi Iwai @ 2004-06-24 11:13 UTC (permalink / raw) To: Andi Kleen; +Cc: Terence Ripperda, discuss, linux-kernel, andrea At 24 Jun 2004 01:46:44 +0200, Andi Kleen wrote: > > > > I must say I'm somewhat reluctant to break an working in tree driver. > > > Especially for the sake of an out of tree binary driver. Arguably the > > > problem is probably not limited to you, but it's quite possible that > > > even the in tree DRI drivers have it, so it would be still worth to > > > fix it. > > > > agreed. I completely understand that there is no desire to modify the > > core kernel to help our driver. that's one of the reasons I looked through > > the other drivers, as I suspect that this is a problem for many drivers. I > > only looked through the code for each briefly, but didn't see anything to > > handle this. I suspect it's more of a case that the drivers have not been > > stress tested on an x86_64 machine w/ 4+ G of memory. > > We usually handle it using the swiotlb, which works. > > pci_alloc_consistent is limited to 16MB, but so far nobody has really > complained about that. If that should be a real issue we can make > it allocate from the swiotlb pool, which is usually 64MB (and can > be made bigger at boot time) Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the allocated pages are out of dma mask, just like in pci-gart.c? (with ifdef x86-64) Takashi ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 11:13 ` Takashi Iwai @ 2004-06-24 11:29 ` Andi Kleen 2004-06-24 14:36 ` Takashi Iwai 2004-06-24 14:45 ` Terence Ripperda 1 sibling, 1 reply; 70+ messages in thread From: Andi Kleen @ 2004-06-24 11:29 UTC (permalink / raw) To: Takashi Iwai; +Cc: Andi Kleen, Terence Ripperda, discuss, linux-kernel, andrea > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the > allocated pages are out of dma mask, just like in pci-gart.c? > (with ifdef x86-64) That won't work reliable enough in extreme cases. -Andi ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 11:29 ` [discuss] " Andi Kleen @ 2004-06-24 14:36 ` Takashi Iwai 2004-06-24 14:42 ` Andi Kleen 0 siblings, 1 reply; 70+ messages in thread From: Takashi Iwai @ 2004-06-24 14:36 UTC (permalink / raw) To: Andi Kleen; +Cc: Andi Kleen, Terence Ripperda, discuss, linux-kernel, andrea At Thu, 24 Jun 2004 13:29:00 +0200, Andi Kleen wrote: > > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the > > allocated pages are out of dma mask, just like in pci-gart.c? > > (with ifdef x86-64) > > That won't work reliable enough in extreme cases. Well, it's not perfect, but it'd be far better than GFP_DMA only :) BTW, we have the similar problem on i386, too. The non-32bit DMA mask always results in the allocation with GFP_DMA. The patch below does similar hack as described above, calling with GFP_DMA as fallback. Takashi --- linux-2.6.7/arch/i386/kernel/pci-dma.c 2004-06-24 15:56:46.017473544 +0200 +++ linux-2.6.7/arch/i386/kernel/pci-dma.c 2004-06-24 16:05:02.449803937 +0200 @@ -17,17 +17,35 @@ void *dma_alloc_coherent(struct device * dma_addr_t *dma_handle, int gfp) { void *ret; + unsigned long dma_mask; + /* ignore region specifiers */ gfp &= ~(__GFP_DMA | __GFP_HIGHMEM); - if (dev == NULL || (dev->coherent_dma_mask < 0xffffffff)) + if (dev == NULL) { gfp |= GFP_DMA; + dma_mask = 0xffffffUL; + } else { + dma_mask = 0xffffffffUL; + if (dev->dma_mask) + dma_mask = *dev->dma_mask; + if (dev->coherent_dma_mask) + dma_mask &= (unsigned long)dev->coherent_dma_mask; + } + again: ret = (void *)__get_free_pages(gfp, get_order(size)); if (ret != NULL) { - memset(ret, 0, size); *dma_handle = virt_to_phys(ret); + if (((unsigned long)*dma_handle + size - 1) & ~dma_mask) { + free_pages((unsigned long)ret, get_order(size)); + if (gfp & GFP_DMA) + return NULL; + gfp |= GFP_DMA; + goto again; + } + memset(ret, 0, size); } return ret; } ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 14:36 ` Takashi Iwai @ 2004-06-24 14:42 ` Andi Kleen 2004-06-24 14:58 ` Takashi Iwai 0 siblings, 1 reply; 70+ messages in thread From: Andi Kleen @ 2004-06-24 14:42 UTC (permalink / raw) To: Takashi Iwai; +Cc: ak, tripperda, discuss, linux-kernel, andrea On Thu, 24 Jun 2004 16:36:47 +0200 Takashi Iwai <tiwai@suse.de> wrote: > At Thu, 24 Jun 2004 13:29:00 +0200, > Andi Kleen wrote: > > > > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the > > > allocated pages are out of dma mask, just like in pci-gart.c? > > > (with ifdef x86-64) > > > > That won't work reliable enough in extreme cases. > > Well, it's not perfect, but it'd be far better than GFP_DMA only :) The only description for this patch I can think of is "russian roulette" -Andi ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 14:42 ` Andi Kleen @ 2004-06-24 14:58 ` Takashi Iwai 2004-06-24 15:29 ` Andrea Arcangeli 0 siblings, 1 reply; 70+ messages in thread From: Takashi Iwai @ 2004-06-24 14:58 UTC (permalink / raw) To: Andi Kleen; +Cc: ak, tripperda, discuss, linux-kernel, andrea At Thu, 24 Jun 2004 16:42:58 +0200, Andi Kleen wrote: > > On Thu, 24 Jun 2004 16:36:47 +0200 > Takashi Iwai <tiwai@suse.de> wrote: > > > At Thu, 24 Jun 2004 13:29:00 +0200, > > Andi Kleen wrote: > > > > > > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the > > > > allocated pages are out of dma mask, just like in pci-gart.c? > > > > (with ifdef x86-64) > > > > > > That won't work reliable enough in extreme cases. > > > > Well, it's not perfect, but it'd be far better than GFP_DMA only :) > > The only description for this patch I can think of is "russian roulette" Even if we have a bigger DMA zone, it's no guarantee that the obtained page is precisely in the given mask. We can unlikely define zones fine enough for all different 24, 28, 29, 30 and 31bit DMA masks. My patch for i386 works well in most cases, because such a device is usually equipped on older machines with less memory than DMA mask. Without the patch, the allocation is always <16MB, may fail even small number of pages. Takashi ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 14:58 ` Takashi Iwai @ 2004-06-24 15:29 ` Andrea Arcangeli 2004-06-24 15:48 ` Nick Piggin 2004-06-24 16:04 ` Takashi Iwai 0 siblings, 2 replies; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 15:29 UTC (permalink / raw) To: Takashi Iwai; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 04:58:24PM +0200, Takashi Iwai wrote: > At Thu, 24 Jun 2004 16:42:58 +0200, > Andi Kleen wrote: > > > > On Thu, 24 Jun 2004 16:36:47 +0200 > > Takashi Iwai <tiwai@suse.de> wrote: > > > > > At Thu, 24 Jun 2004 13:29:00 +0200, > > > Andi Kleen wrote: > > > > > > > > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the > > > > > allocated pages are out of dma mask, just like in pci-gart.c? > > > > > (with ifdef x86-64) > > > > > > > > That won't work reliable enough in extreme cases. > > > > > > Well, it's not perfect, but it'd be far better than GFP_DMA only :) > > > > The only description for this patch I can think of is "russian roulette" > > Even if we have a bigger DMA zone, it's no guarantee that the obtained > page is precisely in the given mask. We can unlikely define zones > fine enough for all different 24, 28, 29, 30 and 31bit DMA masks. > > > My patch for i386 works well in most cases, because such a device is > usually equipped on older machines with less memory than DMA mask. > > Without the patch, the allocation is always <16MB, may fail even small > number of pages. why does it fail? note that with the lower_zone_reserve_ratio algorithm I added to 2.4 all dma zone will be reserved for __GFP_DMA allocations so you should have troubles only with 2.6, 2.4 should work fine. So with latest 2.4 it has to fail only if you already allocated 16M with pci_alloc_consistent which sounds unlikely. the fact 2.6 lacks the lower_zone_reserve_ratio algorithm is a different issue, but I'm confortable there's no other possible algorithm to solve this memory balancing problem completely so there's no way around a forward port. well 2.6 has a tiny hack like some older 2.4 that attempts to do what lower_zone_reserve_ratio does, but it's not nearly enough, there's no per-zone-point-of-view watermark in 2.6 etc.. 2.6 actually has a more hardcoded hack for highmem, but the lower_zone_reserve_ratio has absolutely nothing to do with highmem vs lowmem. it's by pure coincidence that it avoids highmem machine to lockup without swap, but the very same problem happens on x86-64 with lowmem vs dma. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 15:29 ` Andrea Arcangeli @ 2004-06-24 15:48 ` Nick Piggin 2004-06-24 16:52 ` Andrea Arcangeli 2004-06-24 17:39 ` Andrea Arcangeli 2004-06-24 16:04 ` Takashi Iwai 1 sibling, 2 replies; 70+ messages in thread From: Nick Piggin @ 2004-06-24 15:48 UTC (permalink / raw) To: Andrea Arcangeli Cc: Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel Andrea Arcangeli wrote: > > why does it fail? note that with the lower_zone_reserve_ratio algorithm I > added to 2.4 all dma zone will be reserved for __GFP_DMA allocations so > you should have troubles only with 2.6, 2.4 should work fine. > > So with latest 2.4 it has to fail only if you already allocated 16M with > pci_alloc_consistent which sounds unlikely. > > the fact 2.6 lacks the lower_zone_reserve_ratio algorithm is a different > issue, but I'm confortable there's no other possible algorithm to solve > this memory balancing problem completely so there's no way around a > forward port. > > well 2.6 has a tiny hack like some older 2.4 that attempts to do what > lower_zone_reserve_ratio does, but it's not nearly enough, there's no > per-zone-point-of-view watermark in 2.6 etc.. 2.6 actually has a more > hardcoded hack for highmem, but the lower_zone_reserve_ratio has > absolutely nothing to do with highmem vs lowmem. it's by pure > coincidence that it avoids highmem machine to lockup without swap, but > the very same problem happens on x86-64 with lowmem vs dma. 2.6 has the "incremental min" thing. What is wrong with that? Though I think it is turned off by default. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 15:48 ` Nick Piggin @ 2004-06-24 16:52 ` Andrea Arcangeli 2004-06-24 16:56 ` William Lee Irwin III 2004-06-24 17:39 ` Andrea Arcangeli 1 sibling, 1 reply; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 16:52 UTC (permalink / raw) To: Nick Piggin Cc: Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel On Fri, Jun 25, 2004 at 01:48:47AM +1000, Nick Piggin wrote: > Andrea Arcangeli wrote: > > > > >why does it fail? note that with the lower_zone_reserve_ratio algorithm I > >added to 2.4 all dma zone will be reserved for __GFP_DMA allocations so > >you should have troubles only with 2.6, 2.4 should work fine. > > > >So with latest 2.4 it has to fail only if you already allocated 16M with > >pci_alloc_consistent which sounds unlikely. > > > >the fact 2.6 lacks the lower_zone_reserve_ratio algorithm is a different > >issue, but I'm confortable there's no other possible algorithm to solve > >this memory balancing problem completely so there's no way around a > >forward port. > > > >well 2.6 has a tiny hack like some older 2.4 that attempts to do what > >lower_zone_reserve_ratio does, but it's not nearly enough, there's no > >per-zone-point-of-view watermark in 2.6 etc.. 2.6 actually has a more > >hardcoded hack for highmem, but the lower_zone_reserve_ratio has > >absolutely nothing to do with highmem vs lowmem. it's by pure > >coincidence that it avoids highmem machine to lockup without swap, but > >the very same problem happens on x86-64 with lowmem vs dma. > > 2.6 has the "incremental min" thing. What is wrong with that? > Though I think it is turned off by default. sysctl_lower_zone_protection is an inferior implementation of the lower_zone_reserve_ratio, inferior because it has no way to give a different balance to each zone. As you said it's turned off by default so it had no tuning. The lower_zone_reserve_ratio has already been tuned in 2.4. Somebody can attempt a conversion but it'll never be equal since lower_zone_reserve_ratio is a superset of what sysctl_lower_zone_protection can do. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 16:52 ` Andrea Arcangeli @ 2004-06-24 16:56 ` William Lee Irwin III 2004-06-24 17:32 ` Andrea Arcangeli 2004-06-24 21:54 ` Andrew Morton 0 siblings, 2 replies; 70+ messages in thread From: William Lee Irwin III @ 2004-06-24 16:56 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel On Fri, Jun 25, 2004 at 01:48:47AM +1000, Nick Piggin wrote: >> 2.6 has the "incremental min" thing. What is wrong with that? >> Though I think it is turned off by default. On Thu, Jun 24, 2004 at 06:52:01PM +0200, Andrea Arcangeli wrote: > sysctl_lower_zone_protection is an inferior implementation of the > lower_zone_reserve_ratio, inferior because it has no way to give a > different balance to each zone. As you said it's turned off by default > so it had no tuning. The lower_zone_reserve_ratio has already been > tuned in 2.4. Somebody can attempt a conversion but it'll never be equal > since lower_zone_reserve_ratio is a superset of what > sysctl_lower_zone_protection can do. Is there any chance you could send in thise improved implementation of zone fallback watermarks and describe the deficiencies in the current scheme that it corrects? Thanks. -- wli ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 16:56 ` William Lee Irwin III @ 2004-06-24 17:32 ` Andrea Arcangeli 2004-06-24 17:38 ` William Lee Irwin III 2004-06-24 21:54 ` Andrew Morton 1 sibling, 1 reply; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 17:32 UTC (permalink / raw) To: William Lee Irwin III, Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 09:56:29AM -0700, William Lee Irwin III wrote: > On Fri, Jun 25, 2004 at 01:48:47AM +1000, Nick Piggin wrote: > >> 2.6 has the "incremental min" thing. What is wrong with that? > >> Though I think it is turned off by default. > > On Thu, Jun 24, 2004 at 06:52:01PM +0200, Andrea Arcangeli wrote: > > sysctl_lower_zone_protection is an inferior implementation of the > > lower_zone_reserve_ratio, inferior because it has no way to give a > > different balance to each zone. As you said it's turned off by default > > so it had no tuning. The lower_zone_reserve_ratio has already been > > tuned in 2.4. Somebody can attempt a conversion but it'll never be equal > > since lower_zone_reserve_ratio is a superset of what > > sysctl_lower_zone_protection can do. > > Is there any chance you could send in thise improved implementation of > zone fallback watermarks and describe the deficiencies in the current > scheme that it corrects? I did quite a few times and it was successfully merged in 2.4. Now I'd need to forward port to 2.6. I recall I recommended Andrew to merge the lower_zone_reserve_ratio at some point during 2.5 or early 2.6 but apparently he implemented this other thing called sysctl_lower_zone_protection. Note that now that I look more into it, it seems sysctl_lower_zone_protection and lower_zone_reserve_ratio have very little in common, I'm glad sysctl_lower_zone_protection is disabled. sysctl_lower_zone_protection is just an improvement to the algorithm I dropped from 2.4 when lowmem_zone_reserve_ratio was merged. So in short enabling sysctl_lower_zone_protection won't help, sysctl_lower_zone_protection should be dropped enterely and replaced with lower_zone_reserve_ratio. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 17:32 ` Andrea Arcangeli @ 2004-06-24 17:38 ` William Lee Irwin III 2004-06-24 18:02 ` Andrea Arcangeli 0 siblings, 1 reply; 70+ messages in thread From: William Lee Irwin III @ 2004-06-24 17:38 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 07:32:36PM +0200, Andrea Arcangeli wrote: > I did quite a few times and it was successfully merged in 2.4. Now I'd > need to forward port to 2.6. > I recall I recommended Andrew to merge the lower_zone_reserve_ratio > at some point during 2.5 or early 2.6 but apparently he implemented this > other thing called sysctl_lower_zone_protection. Note that now that I > look more into it, it seems sysctl_lower_zone_protection and > lower_zone_reserve_ratio have very little in common, I'm glad > sysctl_lower_zone_protection is disabled. sysctl_lower_zone_protection > is just an improvement to the algorithm I dropped from 2.4 when > lowmem_zone_reserve_ratio was merged. So in short enabling > sysctl_lower_zone_protection won't help, sysctl_lower_zone_protection > should be dropped enterely and replaced with lower_zone_reserve_ratio. Could you refer me to an online source (e.g. Message-Id or URL) where the deficiencies in the incremental min and/or lower_zone_protection that the zone-to-zone watermarks address are described in detail? -- wli ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 17:38 ` William Lee Irwin III @ 2004-06-24 18:02 ` Andrea Arcangeli 2004-06-24 18:13 ` William Lee Irwin III 0 siblings, 1 reply; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 18:02 UTC (permalink / raw) To: William Lee Irwin III, Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 10:38:27AM -0700, William Lee Irwin III wrote: > On Thu, Jun 24, 2004 at 07:32:36PM +0200, Andrea Arcangeli wrote: > > I did quite a few times and it was successfully merged in 2.4. Now I'd > > need to forward port to 2.6. > > I recall I recommended Andrew to merge the lower_zone_reserve_ratio > > at some point during 2.5 or early 2.6 but apparently he implemented this > > other thing called sysctl_lower_zone_protection. Note that now that I > > look more into it, it seems sysctl_lower_zone_protection and > > lower_zone_reserve_ratio have very little in common, I'm glad > > sysctl_lower_zone_protection is disabled. sysctl_lower_zone_protection > > is just an improvement to the algorithm I dropped from 2.4 when > > lowmem_zone_reserve_ratio was merged. So in short enabling > > sysctl_lower_zone_protection won't help, sysctl_lower_zone_protection > > should be dropped enterely and replaced with lower_zone_reserve_ratio. > > Could you refer me to an online source (e.g. Message-Id or URL) where > the deficiencies in the incremental min and/or lower_zone_protection > that the zone-to-zone watermarks address are described in detail? I'm talking to Andrew about this very issue since december 2002, so I mostly giveup except for a few reminders like this one today. http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=20021206145718.GL1567%40dualathlon.random&prev=/groups%3Fq%3Dlinus%2Bgoogle%2Bfix%2Bmin%2Bwatermarks%26hl%3D I'm confident as people starts to run into the zone inbalance with 2.6 and as google upgrades to 2.6, eventually lowmem_zone_reserve_ratio will be forward ported to 2.4.26 to 2.6. I'm not the guy with >4G of ram anyways, so it won't be myself having troubles with this ;). Furthermore if you have some swap, the VM can normally relocate the stuff (you've to be quite unlucky to be filled by pure ptes in the lowmem zone but it can happen too, but certainly not in my or Andrew's boxes where we have not more than 2M of ptes anytime allocated). I already tried to merge this in a preventive-way without a real life case of somebody cracking down on this trouble like it happened 2.4, but now I'll only react if somebody has a real life case again in 2.6. This lowmem vs dma zone thing would be helped very singificantly by the lowmem_reserve_ratio and that's why I bring up this issue right now and not one month ago. This is a matter of fact, with my algorithm the dma zone would be completely preserved for __GFP_DMA allocations on the big x86-64 boxes. Guaranteeing that no DMA zone will be wasted with ptes or similar stuff that can very well go in the higher zones. The "how many bytes" question in my above email is now addressed by the sysctl_lower_zone_protection but that's still a very weak answer since it doesn't work for not similar inbalances across different classzones (i.e. huge dma, tiny lowmem, and even smaller highmem), and furthermore it requires people to tune by themself in userspace, and they cannot tune as well as lowmem_reserve_ratio would do since it's a fixed sysctl for all classzone against classzone imbalances. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 18:02 ` Andrea Arcangeli @ 2004-06-24 18:13 ` William Lee Irwin III 2004-06-24 18:27 ` Andrea Arcangeli 0 siblings, 1 reply; 70+ messages in thread From: William Lee Irwin III @ 2004-06-24 18:13 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 08:02:56PM +0200, Andrea Arcangeli wrote: > I'm talking to Andrew about this very issue since december 2002, so I > mostly giveup except for a few reminders like this one today. > http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=20021206145718.GL1567%40dualathlon.random&prev=/groups%3Fq%3Dlinus%2Bgoogle%2Bfix%2Bmin%2Bwatermarks%26hl%3D > I'm confident as people starts to run into the zone inbalance with 2.6 > and as google upgrades to 2.6, eventually lowmem_zone_reserve_ratio will > be forward ported to 2.4.26 to 2.6. I'm not the guy with >4G of ram > anyways, so it won't be myself having troubles with this ;). > Furthermore if you have some swap, the VM can normally relocate the > stuff (you've to be quite unlucky to be filled by pure ptes in the > lowmem zone but it can happen too, but certainly not in my or Andrew's > boxes where we have not more than 2M of ptes anytime allocated). This sounds like the more precise fix would be enforcing a stricter fallback criterion for pinned allocations. Pinned userspace would need zone migration if it's done selectively like this. Thanks. -- wli ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 18:13 ` William Lee Irwin III @ 2004-06-24 18:27 ` Andrea Arcangeli 2004-06-24 18:50 ` William Lee Irwin III 0 siblings, 1 reply; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 18:27 UTC (permalink / raw) To: William Lee Irwin III, Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 11:13:11AM -0700, William Lee Irwin III wrote: > This sounds like the more precise fix would be enforcing a stricter > fallback criterion for pinned allocations. Pinned userspace would need > zone migration if it's done selectively like this. yes and "the stricter fallback criterion" is precisely called lower_zone_reserve_ratio and it's included in the 2.4 mainline kernel and this "stricter fallback criterion" doesn't exist in 2.6 yet. I do apply it to non-pinned pages too because wasting tons of cpu in memcopies for migration is a bad idea compared to reseving 900M of absolutely critical lowmem ram on a 64G box. So I find the pinned/unpinned parameter worthless and I apply "the stricter fallback criterion" to all allocations in the same way, which is a lot simpler, doesn't require substantial vm changes to allow migration of ptes, anonymous and mlocked memory w/o passing through some swapcache and without clearng ptes and most important I believe it's a lot more efficient than migrating with bulk memcopies. Even on a big x86-64 dealing with the migration complexity is worthless, reserving the full 16M of dma zone makes a lot more sense. The lower_zone_reserve_ratio algorithm scales back to the size of the zones automatically autotuned at boot time and the balance-setting are in functions of the imbalances found at boot time. That's the fundamental difference with the sysctl that is fixed, for all zones, and it has no clue on the size of the zones etc... So in short with little ram installed it will be like mainline 2.6, with tons of ram installed it will make an huge difference and it will reserve up to _whole_ classzones to the users that cannot use the higher zones, but 16M on a 16G box is nothing so nobody will notice any regression anyways, only the befits will be noticeable in the otherwise unsolvable corner cases (yeah, you could try to migrate ptes and other stuff to solve them but that's incredibily inefficient compared to throwing 16M or 800M at the problem on respectively 16G or 64G machines, etc..). the number aren't math exact with the 2.4 code, but you get an idea of the order of magnitude. BTW, I think I'm not the only VM guy who agrees this algo is needed, For istance I recall Rik once included the lower_zone_reserve_ratio patch in one of his 2.4 patches too. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 18:27 ` Andrea Arcangeli @ 2004-06-24 18:50 ` William Lee Irwin III 0 siblings, 0 replies; 70+ messages in thread From: William Lee Irwin III @ 2004-06-24 18:50 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 08:27:37PM +0200, Andrea Arcangeli wrote: > yes and "the stricter fallback criterion" is precisely called > lower_zone_reserve_ratio and it's included in the 2.4 mainline kernel > and this "stricter fallback criterion" doesn't exist in 2.6 yet. > I do apply it to non-pinned pages too because wasting tons of cpu in > memcopies for migration is a bad idea compared to reseving 900M of > absolutely critical lowmem ram on a 64G box. So I find the > pinned/unpinned parameter worthless and I apply "the stricter fallback > criterion" to all allocations in the same way, which is a lot simpler, > doesn't require substantial vm changes to allow migration of ptes, > anonymous and mlocked memory w/o passing through some swapcache and > without clearng ptes and most important I believe it's a lot more > efficient than migrating with bulk memcopies. Even on a big x86-64 > dealing with the migration complexity is worthless, reserving the full > 16M of dma zone makes a lot more sense. Not sure what's going on here. I suppose I had different expectations, e.g. not attempting to relocate kernel allocations, but rather failing them outright after the threshold is exceeded. No matter, it just saves me the trouble of implementing it. I understood the migration to be a method of last resort, not preferred to admission control. On Thu, Jun 24, 2004 at 08:27:37PM +0200, Andrea Arcangeli wrote: > The lower_zone_reserve_ratio algorithm scales back to the size of the > zones automatically autotuned at boot time and the balance-setting are in > functions of the imbalances found at boot time. That's the fundamental > difference with the sysctl that is fixed, for all zones, and it has no > clue on the size of the zones etc... I wasn't involved with this, so unfortunately I don't have an explanation of why these semantics were considered useful. On Thu, Jun 24, 2004 at 08:27:37PM +0200, Andrea Arcangeli wrote: > So in short with little ram installed it will be like mainline 2.6, with > tons of ram installed it will make an huge difference and it will > reserve up to _whole_ classzones to the users that cannot use the higher > zones, but 16M on a 16G box is nothing so nobody will notice any > regression anyways, only the befits will be noticeable in the otherwise > unsolvable corner cases (yeah, you could try to migrate ptes and other > stuff to solve them but that's incredibily inefficient compared to > throwing 16M or 800M at the problem on respectively 16G or 64G machines, > etc..). > the number aren't math exact with the 2.4 code, but you get an idea of > the order of magnitude. This sounds like you're handing back hard allocation failures to unpinned allocations when zone fallbacks are meant to be discouraged. Given this, I think I understand where some of the concerns about merging it came from, though I'd certainly rather have underutilized memory than workload failures. I suspect one concern about this is that it may cause premature workload failures. My own review of the code has determined this to be a minor concern. Rather, I believe it's better to fail the allocations earlier than to allow the workload to slowly accumulate pinned pages in lower zones, even at the cost of underutilizing lower zones. This belief may not be universal. On Thu, Jun 24, 2004 at 08:27:37PM +0200, Andrea Arcangeli wrote: > BTW, I think I'm not the only VM guy who agrees this algo is needed, > For istance I recall Rik once included the lower_zone_reserve_ratio > patch in one of his 2.4 patches too. One of the reasons I've not seen this in practice is that the stress tests I'm running aren't going on for extended periods of time, where fallback of pinned allocations to lower zones would be a progressively more noticeable problem as they accumulate. -- wli ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 16:56 ` William Lee Irwin III 2004-06-24 17:32 ` Andrea Arcangeli @ 2004-06-24 21:54 ` Andrew Morton 2004-06-24 22:08 ` William Lee Irwin III ` (2 more replies) 1 sibling, 3 replies; 70+ messages in thread From: Andrew Morton @ 2004-06-24 21:54 UTC (permalink / raw) To: William Lee Irwin III Cc: andrea, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel William Lee Irwin III <wli@holomorphy.com> wrote: > > Is there any chance you could send in thise improved implementation of > zone fallback watermarks and describe the deficiencies in the current > scheme that it corrects? We decided earlier this year that the watermark stuff should be forward-ported in toto, but I don't recall why. Nobody got around to doing it because there have been no bug reports. It irks me that the 2.4 algorithm gives away a significant amount of pagecache memory. It's a relatively small amount, but it's still a lot of memory, and all the 2.6 users out there at present are not reporting problems, so we should not penalise all those people on behalf of the few people who might need this additional fallback protection. It should be runtime tunable - that doesn't seem hard to do. All the infrastructure is there now to do this. Note that this code was sigificantly changed between 2.6.5 and 2.6.7. First thing to do is to identify some workload which needs the patch. Without that, how can we test it? ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 21:54 ` Andrew Morton @ 2004-06-24 22:08 ` William Lee Irwin III 2004-06-24 22:45 ` Andrea Arcangeli 2004-06-24 22:11 ` Andrew Morton 2004-06-24 22:21 ` Andrea Arcangeli 2 siblings, 1 reply; 70+ messages in thread From: William Lee Irwin III @ 2004-06-24 22:08 UTC (permalink / raw) To: Andrew Morton Cc: andrea, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 02:54:41PM -0700, Andrew Morton wrote: > We decided earlier this year that the watermark stuff should be > forward-ported in toto, but I don't recall why. Nobody got around to doing > it because there have been no bug reports. > It irks me that the 2.4 algorithm gives away a significant amount of > pagecache memory. It's a relatively small amount, but it's still a lot of > memory, and all the 2.6 users out there at present are not reporting > problems, so we should not penalise all those people on behalf of the few > people who might need this additional fallback protection. > It should be runtime tunable - that doesn't seem hard to do. All the > infrastructure is there now to do this. > Note that this code was sigificantly changed between 2.6.5 and 2.6.7. > First thing to do is to identify some workload which needs the patch. > Without that, how can we test it? That does sound troublesome, especially since it's difficult to queue up the kinds of extended stress tests needed to demonstrate the problems. The prolonged memory pressure and so on are things that we've unfortunately had to wait until extended runtime in production to see. =( The underutilization bit is actually why I keep going on and on about the pinned pagecache relocation; it resolves a portion of the problem of pinned pages in lower zones without underutilizing RAM, then once pinned user pages can arbitrarily utilize lower zones, pinned kernel allocations (which would not be relocatable) can be denied fallback entirely without overall underutilization. I've actually already run out of ideas here, so just people just saying what they want me to write might help. Tests can be easily contrived (e.g. fill a swapless box' upper zones with file-backed pagecache, then start allocating anonymous pages), but realistic situations are much harder to trigger. -- wli ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 22:08 ` William Lee Irwin III @ 2004-06-24 22:45 ` Andrea Arcangeli 2004-06-24 22:51 ` William Lee Irwin III 0 siblings, 1 reply; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 22:45 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 03:08:23PM -0700, William Lee Irwin III wrote: > The prolonged memory pressure and so on are things that we've > unfortunately had to wait until extended runtime in production to see. =( Luckily this problem doesn't fall in this scenario and it's trivial to reproduce if you've >= 2G of ram. I still have here the testcase google sent me years ago when this problem seen the light during 2.4.1x. They used mlock, but it's even simpler to reproduce it with a single malloc + bzero (note: no mlock). The few mbytes of lowmem left won't last long if you load some big app after that. > The underutilization bit is actually why I keep going on and on about > the pinned pagecache relocation; it resolves a portion of the problem > of pinned pages in lower zones without underutilizing RAM, then once I also don't like the underutilization but I believe it's a price everybody has to pay if you buy x86. On x86-64 the cost of the insurance is much lower, max 16M wasted, and absolutely nothing wasted if you've an amd system (all amd systems have a real iommu that avoids having to mess with the physical ram addresses). it's like an health insurance, you can avoid to pay it but it might not turn out to be a good idea for everyone not pay for it. At least you should give the choice to the people to be able to pay for it and to have it, and the sysctl is not going to work. It's relatively very cheap as Andrew said, if you've very few mbytes of lowmemory you're going to pay very few kbytes for it. But I think we should force everyone to have it like I did in 2.4 and absolutely nobody complained, infact if something somebody could complain _without_ it. Sure nobody cares about 800M of ram on a 64G machine when they risk to swap-slowdown (and vfs caches overshrink) and in the worst case to lockup without swap without the "insurance". I don't think one should be forced to have swap on a 64G box if the userspace apps have a very well defined high bound of ram utilization. There will be always a limit anyways that is ram+swap, so ideally if we had infinite money it would _always_ better to replace swap with more ram and to never have swap, swap still make sense only because disk is still cheaper than ram (watch MRAM). So a VM that destabilizes without swap is not a VM that I can avoid to fix and to me it remains a major bug even if nobody will ever notice it because we don't have that much cheap ram yet. About the ability to tune it at least at boot time, I always wanted it and I added the setup_lower_zone_reserve parameter, but that is parsed too late, so it doesn't work due a minor implementation detail ;), like also setup_mem_frac apparently doesn't work too. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 22:45 ` Andrea Arcangeli @ 2004-06-24 22:51 ` William Lee Irwin III 2004-06-24 23:09 ` Andrew Morton 0 siblings, 1 reply; 70+ messages in thread From: William Lee Irwin III @ 2004-06-24 22:51 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel On Fri, Jun 25, 2004 at 12:45:29AM +0200, Andrea Arcangeli wrote: > Luckily this problem doesn't fall in this scenario and it's trivial to > reproduce if you've >= 2G of ram. I still have here the testcase google > sent me years ago when this problem seen the light during 2.4.1x. They > used mlock, but it's even simpler to reproduce it with a single malloc + > bzero (note: no mlock). The few mbytes of lowmem left won't last long if > you load some big app after that. Well, there are magic numbers here we need to explain to get a testcase runnable on more machines than just x86 boxen with exactly 2GB RAM. Where do the 2GB and 1GB come from? Is it that 1GB is the size of the upper zone? -- wli ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 22:51 ` William Lee Irwin III @ 2004-06-24 23:09 ` Andrew Morton 2004-06-24 23:15 ` William Lee Irwin III 2004-06-25 2:39 ` Andrea Arcangeli 0 siblings, 2 replies; 70+ messages in thread From: Andrew Morton @ 2004-06-24 23:09 UTC (permalink / raw) To: William Lee Irwin III Cc: andrea, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel William Lee Irwin III <wli@holomorphy.com> wrote: > > On Fri, Jun 25, 2004 at 12:45:29AM +0200, Andrea Arcangeli wrote: > > Luckily this problem doesn't fall in this scenario and it's trivial to > > reproduce if you've >= 2G of ram. I still have here the testcase google > > sent me years ago when this problem seen the light during 2.4.1x. They > > used mlock, but it's even simpler to reproduce it with a single malloc + > > bzero (note: no mlock). The few mbytes of lowmem left won't last long if > > you load some big app after that. > > Well, there are magic numbers here we need to explain to get a testcase > runnable on more machines than just x86 boxen with exactly 2GB RAM. > Where do the 2GB and 1GB come from? Is it that 1GB is the size of the > upper zone? > A testcase would be, on a 2G box: a) free up as much memory as you can b) write a 1.2G file to fill highmem with pagecache c) malloc(800M), bzero(), sleep d) swapoff -a You now have a box which has almost all of lowmem pinned in anonymous memory. It'll limp along and go oom fairly easily. Another testcase would be: a) free up as much memory as you can b) write a 1.2G file to fill highmem with pagecache c) malloc(800M), mlock it You now have most of lowmem mlocked. In both situations the machine is really sick. Probably the most risky scenario is a swapless machine in which lots of lowmem is allocated to anonymous memory. It should be the case that increasing lower_zone_peotection will fix all the above. If not, it needs fixing. So we're down the question "what should we default to at bootup". I find it hard to justify defaulting to a mode where we're super-defensive against this sort of thing, simply because nobody seems to be hitting the problems. Distributors can, if the must, bump lower_zone_protection in initscripts, and it's presumably pretty simple to write a boot script which parses /proc/meminfo's MemTotal and SwapTotal lines, producing an appropriate lower_zone_protection setting. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 23:09 ` Andrew Morton @ 2004-06-24 23:15 ` William Lee Irwin III 2004-06-25 6:16 ` William Lee Irwin III 2004-06-25 2:39 ` Andrea Arcangeli 1 sibling, 1 reply; 70+ messages in thread From: William Lee Irwin III @ 2004-06-24 23:15 UTC (permalink / raw) To: Andrew Morton Cc: andrea, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 04:09:45PM -0700, Andrew Morton wrote: > A testcase would be, on a 2G box: > a) free up as much memory as you can > b) write a 1.2G file to fill highmem with pagecache > c) malloc(800M), bzero(), sleep > d) swapoff -a > You now have a box which has almost all of lowmem pinned in anonymous > memory. It'll limp along and go oom fairly easily. > Another testcase would be: > a) free up as much memory as you can > b) write a 1.2G file to fill highmem with pagecache > c) malloc(800M), mlock it > You now have most of lowmem mlocked. These are approximately identical to the testcases I had in mind, except neither of these is truly specific to 2GB and can have the various magic numbers calculated from sysconf() and/or meminfo. On Thu, Jun 24, 2004 at 04:09:45PM -0700, Andrew Morton wrote: > In both situations the machine is really sick. Probably the most risky > scenario is a swapless machine in which lots of lowmem is allocated to > anonymous memory. > It should be the case that increasing lower_zone_peotection will fix all > the above. If not, it needs fixing. > So we're down the question "what should we default to at bootup". I find > it hard to justify defaulting to a mode where we're super-defensive against > this sort of thing, simply because nobody seems to be hitting the problems. > Distributors can, if the must, bump lower_zone_protection in initscripts, > and it's presumably pretty simple to write a boot script which parses > /proc/meminfo's MemTotal and SwapTotal lines, producing an appropriate > lower_zone_protection setting. I'm going to beat on this in short order, but will be indisposed for an hour or two before that begins. Thanks. -- wli ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 23:15 ` William Lee Irwin III @ 2004-06-25 6:16 ` William Lee Irwin III 0 siblings, 0 replies; 70+ messages in thread From: William Lee Irwin III @ 2004-06-25 6:16 UTC (permalink / raw) To: Andrew Morton, andrea, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel /* On Thu, Jun 24, 2004 at 04:09:45PM -0700, Andrew Morton wrote: >> A testcase would be, on a 2G box: >> a) free up as much memory as you can >> b) write a 1.2G file to fill highmem with pagecache >> c) malloc(800M), bzero(), sleep >> d) swapoff -a >> You now have a box which has almost all of lowmem pinned in anonymous >> memory. It'll limp along and go oom fairly easily. >> Another testcase would be: >> a) free up as much memory as you can >> b) write a 1.2G file to fill highmem with pagecache >> c) malloc(800M), mlock it >> You now have most of lowmem mlocked. On Thu, Jun 24, 2004 at 04:15:49PM -0700, William Lee Irwin III wrote: > These are approximately identical to the testcases I had in mind, except > neither of these is truly specific to 2GB and can have the various magic > numbers calculated from sysconf() and/or meminfo. It seems that glibc is fucking with sysinfo or something; hackish workaround was to call sysconf(_SC_PAGESIZE) by hand for where mem_unit would otherwise be needed and to treat the screwed-with sysinfo fields as being in opaque units. Blame Uli. At any rate, the result of running this with no swap online appears to be that this just results in OOM kills whenever enough lowmem is needed. This is expected, as the anonymous allocations aren't mlocked, so with swap online, they would merely be swapped out, and with swap offline, the nr_swap_pages deadlock is no longer possible (the nr_swap_pages fix wasn't in place for this testing). Something more sophisticated may have worse effects. However, there were apparent oddities with premature failures of vma allocations and piss poor vma merging observed. For instance, the sbrk()/mmap() changeover logic to fall back on a per-iteration basis is largely because sticking to mmap() and then changing over to sbrk() when it fails switches over prematurely, and so failed to sufficiently utilize lowmem. The failures to find the free areas for the vmas went away after alternating between sbrk() and mmap(). Also, the 64KB mmap()'s of the file aren't merged at all, despite being very very blatantly sequential. I'll look into this. The strategy of mmap()'ing locked pagecache is useless for PAE boxen in general and so things should be taught to, say, mount ramfs, allocate ramfs pagecache to fill highmem, and then go on to mmap() instead of fiddling around mmap()'ing and mlock()'ing pagecache. I can implement this if it's deemed necessary to have the testcase extensible to PAE. The results are mixed. It's not clear that this behavior is pathological, at least not in the manner Andrea described. It is, however, easy to trigger workload failure as opposed to kernel deadlock. It may help to clarify the general position on that kind of issue so I know how and whether that should be addressed. $ cat /proc/meminfo MemTotal: 1032988 kB MemFree: 106684 kB Buffers: 3804 kB Cached: 16256 kB SwapCached: 0 kB Active: 897104 kB Inactive: 2708 kB HighTotal: 130816 kB HighFree: 101388 kB LowTotal: 902172 kB LowFree: 5296 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 108 kB Writeback: 0 kB Mapped: 881912 kB Slab: 18276 kB Committed_AS: 911496 kB PageTables: 1896 kB VmallocTotal: 114680 kB VmallocUsed: 2160 kB VmallocChunk: 105244 kB $ cat /proc/buddyinfo Node 0, zone DMA 0 0 1 1 1 0 1 1 1 0 0 Node 0, zone Normal 56 14 59 2 3 0 1 1 1 0 0 Node 0, zone HighMem 777 315 349 360 505 236 61 1 0 0 0 */ #define _GNU_SOURCE #define _FILE_OFFSET_BITS 64 #include <unistd.h> #include <stdlib.h> #include <errno.h> #include <stdio.h> #include <string.h> #include <sys/types.h> #include <sys/mman.h> #include <sys/sysinfo.h> #define LENGTH_STEP ((off64_t)pagesize << 4) #define MAX_RETRIES 64 #ifdef DEBUG #define dprintf(fmt, arg...) printf(fmt,##arg) #else #define dprintf(fmt, arg...) do { } while (0) #endif #define die() \ do { \ fprintf(stderr, "failure %s (%d) at %s:%d\n", \ strerror(errno), errno, __FILE__, __LINE__); \ fflush(stderr); \ sleep(60); \ exit(EXIT_FAILURE); \ } while (0) int main(void) { struct sysinfo info; char namebuf[64] = "/tmp/zoneDoS_XXXXXX"; int i, fd, retries; off64_t len = 0; unsigned long *first, *last, *p, *first_buf, *last_buf, *q; unsigned long freehigh, freelow; long pagesize; first = last = NULL; first_buf = last_buf = NULL; if ((pagesize = sysconf(_SC_PAGESIZE)) < 0) die(); if ((fd = mkstemp(namebuf)) < 0) die(); if (unlink(namebuf)) die(); if (sysinfo(&info)) die(); retries = freehigh = 0; while (info.freehigh && retries < MAX_RETRIES) { if (ftruncate64(fd, len + LENGTH_STEP)) die(); p = mmap(NULL, LENGTH_STEP, PROT_READ|PROT_WRITE, MAP_SHARED, fd, len); if (p == MAP_FAILED) die(); len += LENGTH_STEP; if (mlock(p, LENGTH_STEP)) die(); *p = 0; if (last) *last = (unsigned long)p; last = p; if (!first) first = p; freehigh = info.freehigh; if (sysinfo(&info)) die(); if (info.freehigh >= freehigh) retries++; else retries = 0; dprintf("allocated %lu kB, freehigh = %lu kB\n", (unsigned long)(len >> 10), (unsigned long)(info.freehigh >> 10)); } if (sysinfo(&info)) die(); retries = freelow = 0; while (info.freeram - info.freehigh && retries < MAX_RETRIES) { q = mmap(NULL, LENGTH_STEP, PROT_READ|PROT_WRITE, MAP_ANONYMOUS, 0, 0); if (q == MAP_FAILED) q = sbrk(LENGTH_STEP); if (q == MAP_FAILED) { sleep(1); ++retries; continue; } for (i = 0; i < LENGTH_STEP/sizeof(*q); i += pagesize/sizeof(*q)) q[i + 1] = 1; *q = 0; if (last_buf) *last_buf = (unsigned long)q; last_buf = q; if (!first_buf) first_buf = q; freelow = info.freeram - info.freehigh; if (sysinfo(&info)) die(); if (info.freeram - info.freehigh >= freelow) ++retries; else retries = 0; dprintf("freelow = %lu kB\n", (info.freeram - info.freehigh) >> 10); } dprintf("done allocating anonymous memory, freeing pagecache\n"); while (first) { p = first; first = (unsigned long *)(*first); if (munmap(p, LENGTH_STEP)) die(); } close(fd); pause(); return EXIT_SUCCESS; } ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 23:09 ` Andrew Morton 2004-06-24 23:15 ` William Lee Irwin III @ 2004-06-25 2:39 ` Andrea Arcangeli 2004-06-25 2:47 ` Andrew Morton 1 sibling, 1 reply; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-25 2:39 UTC (permalink / raw) To: Andrew Morton Cc: William Lee Irwin III, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 04:09:45PM -0700, Andrew Morton wrote: > this sort of thing, simply because nobody seems to be hitting the problems. nobody is hitting the problems because if this problem triggers the machine starts slowly swapping and shrinking the vfs and it eventually relocate the highmem. the crpilling down of the vfs caches as well isn't a good thing and it will not be noticeable by anybody. If they would be truly running without swap they would be hitting these problems very fast. But everybody has swap. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-25 2:39 ` Andrea Arcangeli @ 2004-06-25 2:47 ` Andrew Morton 2004-06-25 3:19 ` Andrea Arcangeli 0 siblings, 1 reply; 70+ messages in thread From: Andrew Morton @ 2004-06-25 2:47 UTC (permalink / raw) To: Andrea Arcangeli Cc: wli, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel Andrea Arcangeli <andrea@suse.de> wrote: > > On Thu, Jun 24, 2004 at 04:09:45PM -0700, Andrew Morton wrote: > > this sort of thing, simply because nobody seems to be hitting the problems. > > nobody is hitting the problems because if this problem triggers the > machine starts slowly swapping and shrinking the vfs and it eventually > relocate the highmem. the crpilling down of the vfs caches as well isn't > a good thing and it will not be noticeable by anybody. Good point, that. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-25 2:47 ` Andrew Morton @ 2004-06-25 3:19 ` Andrea Arcangeli 0 siblings, 0 replies; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-25 3:19 UTC (permalink / raw) To: Andrew Morton Cc: wli, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel if you want to leave it disabled that's still fine with me as far as it can be enabled in a optimal way (the one I like as usual is the 256/32 ratios of 2.4 ;), but I'm quite convinced that it will provide benefit even if enabled, possibly with bigger ratios if you want less "guaranteed" waste. as usual if one doesn't want any ram and performance waste, x86-64 is out there in production, and it'll avoid all the waste (unless you care about wasting 16M of ram on a 4G box without the risk of failing order 0 dma allocations on the intel implementation). If one want to go cheap and buy x86 still then he must be prepared to potentially lose 900M of ram on a 32G box, it's a relative cost, so the more ram the more memory will be potentially wasted, the less ram the less ram will be potentially wasted. the most frequent x86 highmem complains I ever got were related to runing _out_ of lowmem zone with the lowmem zone _empty_. The day I will get a complain for the lowmem being completely _free_ has yet to come ;). thanks a lot for all the help. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 21:54 ` Andrew Morton 2004-06-24 22:08 ` William Lee Irwin III @ 2004-06-24 22:11 ` Andrew Morton 2004-06-24 23:09 ` Andrea Arcangeli 2004-06-24 22:21 ` Andrea Arcangeli 2 siblings, 1 reply; 70+ messages in thread From: Andrew Morton @ 2004-06-24 22:11 UTC (permalink / raw) To: wli, andrea, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel Andrew Morton <akpm@osdl.org> wrote: > > Note that this code was sigificantly changed between 2.6.5 and 2.6.7. Here's the default setup on a 1G ia32 box: DMA free:4172kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB protections[]: 8 476 540 Normal free:54632kB min:936kB low:1872kB high:2808kB active:278764kB inactive:253668kB present:901120kB protections[]: 0 468 532 HighMem free:308kB min:128kB low:256kB high:384kB active:87972kB inactive:40300kB present:130516kB protections[]: 0 0 64 ie: - protect 8 pages from ZONE_DMA from a GFP_DMA allocation attempt - protect 476 pages from ZONE_DMA from a GFP_KERNEL allocation attempt - protect 540 pages from ZONE_DMA from a GFP_HIGHMEM allocation attempt. etcetera. After setting lower_zone_protection to 10: Active:111515 inactive:65009 dirty:116 writeback:0 unstable:0 free:3290 slab:75489 mapped:52247 pagetables:446 DMA free:4172kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB protections[]: 8 5156 5860 Normal free:8736kB min:936kB low:1872kB high:2808kB active:352780kB inactive:224972kB present:901120kB protections[]: 0 468 1172 HighMem free:252kB min:128kB low:256kB high:384kB active:93280kB inactive:35064kB present:130516kB protections[]: 0 0 64 It's a bit complex, and perhaps the relative levels of the various thresholds could be tightened up. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 22:11 ` Andrew Morton @ 2004-06-24 23:09 ` Andrea Arcangeli 2004-06-25 1:17 ` Nick Piggin 0 siblings, 1 reply; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 23:09 UTC (permalink / raw) To: Andrew Morton Cc: wli, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 03:11:30PM -0700, Andrew Morton wrote: > After setting lower_zone_protection to 10: > > Active:111515 inactive:65009 dirty:116 writeback:0 unstable:0 free:3290 slab:75489 mapped:52247 pagetables:446 > DMA free:4172kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB > protections[]: 8 5156 5860 > Normal free:8736kB min:936kB low:1872kB high:2808kB active:352780kB inactive:224972kB present:901120kB > protections[]: 0 468 1172 > HighMem free:252kB min:128kB low:256kB high:384kB active:93280kB inactive:35064kB present:130516kB > protections[]: 0 0 64 > > It's a bit complex, and perhaps the relative levels of the various > thresholds could be tightened up. this is the algorithm I added to 2.4 to produce good protection levels (with lower_zone_reserve_ratio supposedly tunable at boot time): static int lower_zone_reserve_ratio[MAX_NR_ZONES-1] = { 256, 32 }; zone->watermarks[j].min = mask; zone->watermarks[j].low = mask*2; zone->watermarks[j].high = mask*3; /* now set the watermarks of the lower zones in the "j" classzone */ for (idx = j-1; idx >= 0; idx--) { zone_t * lower_zone = pgdat->node_zones + idx; unsigned long lower_zone_reserve; if (!lower_zone->size) continue; mask = lower_zone->watermarks[idx].min; lower_zone->watermarks[j].min = mask; lower_zone->watermarks[j].low = mask*2; lower_zone->watermarks[j].high = mask*3; /* now the brainer part */ lower_zone_reserve = realsize / lower_zone_reserve_ratio[idx]; lower_zone->watermarks[j].min += lower_zone_reserve; lower_zone->watermarks[j].low += lower_zone_reserve; lower_zone->watermarks[j].high += lower_zone_reserve; realsize += lower_zone->realsize; } Your code must be inferior since it doesn't even allow to tune each zone differently (you seems not to have a lower_zone_reserve_ratio[idx]). Not sure why you dont' simply forward port the code from 2.4 instead of reinventing it. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 23:09 ` Andrea Arcangeli @ 2004-06-25 1:17 ` Nick Piggin 2004-06-25 3:11 ` Andrea Arcangeli 0 siblings, 1 reply; 70+ messages in thread From: Nick Piggin @ 2004-06-25 1:17 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, wli, tiwai, ak, ak, tripperda, discuss, linux-kernel Andrea Arcangeli wrote: > Your code must be inferior since it doesn't even allow to tune each zone > differently (you seems not to have a lower_zone_reserve_ratio[idx]). Not sure > why you dont' simply forward port the code from 2.4 instead of reinventing it. > It can easily be modified if required though. Is there a need to be tuning these different things? This is probably where we should hold back on the complexity until it is shown to improve something. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-25 1:17 ` Nick Piggin @ 2004-06-25 3:11 ` Andrea Arcangeli 0 siblings, 0 replies; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-25 3:11 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, wli, tiwai, ak, ak, tripperda, discuss, linux-kernel On Fri, Jun 25, 2004 at 11:17:25AM +1000, Nick Piggin wrote: > It can easily be modified if required though. Is there a need to be > tuning these different things? This is probably where we should hold I did tune them differently in 2.4 mainline at least. 256 ratio for dma and 32 ratio for lowmem, the lowmem is already quite critical in most machines with >2G of ram so ratio should be lower than dma. for example on 64bit you want the 16M of dma to be completely reserved only on machines with >4G of ram. The 256 dma ratio applies fine to 64bit archs, and the 32 never applies to 64bit archs and it only applies to the highmem boxes. the 256 and 32 numbers aren't random, they're calculated this way: 4096M of 64bit platform / 16M = 256 32G of 32bit platform / 1G = 32 That means with my 2.4 algorithm any 64bit machine with >4G has its whole dma zone reserved to __GFP_DMA. and at the same time any 32bit machine with 32G of ram doesn't allow anything but GFP_KERNEL to go in lowmem, this is fundamental. Now you may very well argue about the numbers not being perfect and this is still a bit hardcoded with the highmem issues in mind, but it would be possible to generalize it even more and I do see a benefit in not having a fixed number for both issues, and to get a bit more of flexibility that the 2.4 has over the 2.6 one. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 21:54 ` Andrew Morton 2004-06-24 22:08 ` William Lee Irwin III 2004-06-24 22:11 ` Andrew Morton @ 2004-06-24 22:21 ` Andrea Arcangeli 2004-06-24 22:36 ` Andrew Morton 2004-06-24 22:37 ` William Lee Irwin III 2 siblings, 2 replies; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 22:21 UTC (permalink / raw) To: Andrew Morton Cc: William Lee Irwin III, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 02:54:41PM -0700, Andrew Morton wrote: > First thing to do is to identify some workload which needs the patch. that's quite trivial, boot a 2G box, malloc(1G), bzero(1GB), swapoff -a, then the machine will lockup. Depending on the architecture (more precisely depending if it starts allocating ram from the end or from the start of the physical memory), you may have to load 1G of data into pagecache first, like reading from /dev/hda 1G (without closing the file) will work fine, then run the above malloc + bzero + swapoff. Most people will never report this because everybody has swap and they simply run a lot slower than they could run if they didn't need to pass through the swap device to relocate memory because memory would been allocated in the right place in the first place. this plus the various oom killer breakages that gets dominated by the nr_swap_pages > 0 check, are the reasons 2.6 is unusable w/o swap. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 22:21 ` Andrea Arcangeli @ 2004-06-24 22:36 ` Andrew Morton 2004-06-24 23:15 ` Andrea Arcangeli 2004-06-24 22:37 ` William Lee Irwin III 1 sibling, 1 reply; 70+ messages in thread From: Andrew Morton @ 2004-06-24 22:36 UTC (permalink / raw) To: Andrea Arcangeli Cc: wli, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel Andrea Arcangeli <andrea@suse.de> wrote: > > On Thu, Jun 24, 2004 at 02:54:41PM -0700, Andrew Morton wrote: > > First thing to do is to identify some workload which needs the patch. > > that's quite trivial, boot a 2G box, malloc(1G), bzero(1GB), swapoff -a, > then the machine will lockup. Are those numbers correct? We won't touch swap at all with that test? ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 22:36 ` Andrew Morton @ 2004-06-24 23:15 ` Andrea Arcangeli 0 siblings, 0 replies; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 23:15 UTC (permalink / raw) To: Andrew Morton Cc: wli, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 03:36:12PM -0700, Andrew Morton wrote: > Andrea Arcangeli <andrea@suse.de> wrote: > > > > On Thu, Jun 24, 2004 at 02:54:41PM -0700, Andrew Morton wrote: > > > First thing to do is to identify some workload which needs the patch. > > > > that's quite trivial, boot a 2G box, malloc(1G), bzero(1GB), swapoff -a, > > then the machine will lockup. > > Are those numbers correct? We won't touch swap at all with that test? they are correct if the page allocator allocates memory starting from address 0 physical up to 2G in contigous order (sometime it allocates memory backwards instead, in such case you need to load say 900M in pagecache and then malloc 1.2G, worked fine for me in 2.4 before I fixed it at least). the malloc(1G) will pin the whole lowmem, then the box will be dead. oom killer won't kill the task, but the syscalls will all hang (they don't even return -ENOMEM because you loop forever, 2.4 at least was returning -ENOMEM). workaround is to add swap and to slowdown like a crawl relocating ram at disk-seeking-speed and overshrinking vfs caches, but nobody will notice something is going wrong then. Only swapoff -a will show that something is not going well. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 22:21 ` Andrea Arcangeli 2004-06-24 22:36 ` Andrew Morton @ 2004-06-24 22:37 ` William Lee Irwin III 2004-06-24 22:40 ` William Lee Irwin III 2004-06-24 23:21 ` Andrea Arcangeli 1 sibling, 2 replies; 70+ messages in thread From: William Lee Irwin III @ 2004-06-24 22:37 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel /* On Thu, Jun 24, 2004 at 02:54:41PM -0700, Andrew Morton wrote: >> First thing to do is to identify some workload which needs the patch. On Fri, Jun 25, 2004 at 12:21:50AM +0200, Andrea Arcangeli wrote: > that's quite trivial, boot a 2G box, malloc(1G), bzero(1GB), swapoff -a, > then the machine will lockup. > Depending on the architecture (more precisely depending if it starts > allocating ram from the end or from the start of the physical memory), > you may have to load 1G of data into pagecache first, like reading from > /dev/hda 1G (without closing the file) will work fine, then run the > above malloc + bzero + swapoff. > Most people will never report this because everybody has swap and they > simply run a lot slower than they could run if they didn't need to pass > through the swap device to relocate memory because memory would been allocated > in the right place in the first place. this plus the various oom killer > breakages that gets dominated by the nr_swap_pages > 0 check, are the > reasons 2.6 is unusable w/o swap. Have you tried with 2.6.7? The following program fails to trigger anything like what you've mentioned, though granted it was a 512MB allocation on a 1GB machine. swapoff(2) merely fails. */ #include <stdint.h> #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <strings.h> #include <sys/swap.h> int main(int argc, char * const argv[]) { int i; long pagesize, physpages; size_t size; void *p; pagesize = sysconf(_SC_PAGE_SIZE); if (pagesize < 0) { perror("failed to determine pagesize"); exit(EXIT_FAILURE); } physpages = sysconf(_SC_PHYS_PAGES); if (physpages < 0) { perror("failed to determine physical memory capacity"); exit(EXIT_FAILURE); } if ((size_t)(physpages/2) > SIZE_MAX/pagesize) { fprintf(stderr, "insufficient virtualspace capacity\n"); exit(EXIT_FAILURE); } size = (physpages/2)*pagesize; p = malloc(size); if (!p) { perror("allocation failure"); exit(EXIT_FAILURE); } bzero(p, size); for (i = 1; i < argc; ++i) { if (swapoff(argv[i])) perror("swapoff failure"); fprintf(stderr, "failed to offline %s\n", argv[i]); exit(EXIT_FAILURE); } return EXIT_SUCCESS; } ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 22:37 ` William Lee Irwin III @ 2004-06-24 22:40 ` William Lee Irwin III 2004-06-24 23:21 ` Andrea Arcangeli 1 sibling, 0 replies; 70+ messages in thread From: William Lee Irwin III @ 2004-06-24 22:40 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel /* On Thu, Jun 24, 2004 at 03:37:50PM -0700, William Lee Irwin III wrote: > Have you tried with 2.6.7? The following program fails to trigger anything > like what you've mentioned, though granted it was a 512MB allocation on > a 1GB machine. swapoff(2) merely fails. And after fixing a bug in the program, not even that fails: */ #include <stdint.h> #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <strings.h> #include <sys/swap.h> int main(int argc, char * const argv[]) { int i; long pagesize, physpages; size_t size; void *p; pagesize = sysconf(_SC_PAGE_SIZE); if (pagesize < 0) { perror("failed to determine pagesize"); exit(EXIT_FAILURE); } physpages = sysconf(_SC_PHYS_PAGES); if (physpages < 0) { perror("failed to determine physical memory capacity"); exit(EXIT_FAILURE); } if ((size_t)(physpages/2) > SIZE_MAX/pagesize) { fprintf(stderr, "insufficient virtualspace capacity\n"); exit(EXIT_FAILURE); } size = (physpages/2)*pagesize; p = malloc(size); if (!p) { perror("allocation failure"); exit(EXIT_FAILURE); } bzero(p, size); for (i = 1; i < argc; ++i) { if (swapoff(argv[i])) { perror("swapoff failure"); fprintf(stderr, "failed to offline %s\n", argv[i]); exit(EXIT_FAILURE); } } return EXIT_SUCCESS; } ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 22:37 ` William Lee Irwin III 2004-06-24 22:40 ` William Lee Irwin III @ 2004-06-24 23:21 ` Andrea Arcangeli 2004-06-24 23:45 ` William Lee Irwin III 1 sibling, 1 reply; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 23:21 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 03:37:50PM -0700, William Lee Irwin III wrote: > /* > On Thu, Jun 24, 2004 at 02:54:41PM -0700, Andrew Morton wrote: > >> First thing to do is to identify some workload which needs the patch. > > On Fri, Jun 25, 2004 at 12:21:50AM +0200, Andrea Arcangeli wrote: > > that's quite trivial, boot a 2G box, malloc(1G), bzero(1GB), swapoff -a, > > then the machine will lockup. > > Depending on the architecture (more precisely depending if it starts > > allocating ram from the end or from the start of the physical memory), > > you may have to load 1G of data into pagecache first, like reading from > > /dev/hda 1G (without closing the file) will work fine, then run the > > above malloc + bzero + swapoff. > > Most people will never report this because everybody has swap and they > > simply run a lot slower than they could run if they didn't need to pass > > through the swap device to relocate memory because memory would been allocated > > in the right place in the first place. this plus the various oom killer > > breakages that gets dominated by the nr_swap_pages > 0 check, are the > > reasons 2.6 is unusable w/o swap. > > Have you tried with 2.6.7? The following program fails to trigger anything I've definitely not tried 2.6.7 and I'm also reading a 2.6.5 codebase. But you can sure trigger it if you run a big workload after the big allocation. > like what you've mentioned, though granted it was a 512MB allocation on > a 1GB machine. swapoff(2) merely fails. what you have to do is this: 1) swapoff -a (it must not fail!! it cannot fail if you run it first) 2) fill 130000K in pagecache, be very careful, not more than that, every mbyte matters 3) run your program and allocate 904000K!!! (not 512M!!!) then keep using the machine until it lockups because it cannot reloate the anonymous memory from the 900M of lowmem to the 130M of highmem. But really I said you need >=2G to have a realistic chance of seeing it. So don't be alarmed you cannot reproduce on a 1G box by allocating 512M and with swap still enabled, you had none of the conditions that make it reproducible. I reproduced this dozen of times so I know how to reproduce it very well (amittedly not in 2.6 because nobody crashed on this yet). ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 23:21 ` Andrea Arcangeli @ 2004-06-24 23:45 ` William Lee Irwin III 0 siblings, 0 replies; 70+ messages in thread From: William Lee Irwin III @ 2004-06-24 23:45 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel On Fri, Jun 25, 2004 at 01:21:57AM +0200, Andrea Arcangeli wrote: > what you have to do is this: > 1) swapoff -a (it must not fail!! it cannot fail if you run it first) > 2) fill 130000K in pagecache, be very careful, not more than that, every > mbyte matters > 3) run your program and allocate 904000K!!! (not 512M!!!) > then keep using the machine until it lockups because it cannot reloate > the anonymous memory from the 900M of lowmem to the 130M of highmem. > But really I said you need >=2G to have a realistic chance of seeing it. > So don't be alarmed you cannot reproduce on a 1G box by allocating 512M > and with swap still enabled, you had none of the conditions that make it > reproducible. > I reproduced this dozen of times so I know how to reproduce it very > well (amittedly not in 2.6 because nobody crashed on this yet). This resembles the more sophisticated testcase I originally had in mind. I'll be out for a couple of hours and then I'll fix this up. -- wli ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 15:48 ` Nick Piggin 2004-06-24 16:52 ` Andrea Arcangeli @ 2004-06-24 17:39 ` Andrea Arcangeli 2004-06-24 17:53 ` William Lee Irwin III 1 sibling, 1 reply; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 17:39 UTC (permalink / raw) To: Nick Piggin Cc: Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel On Fri, Jun 25, 2004 at 01:48:47AM +1000, Nick Piggin wrote: > 2.6 has the "incremental min" thing. What is wrong with that? > Though I think it is turned off by default. I looked more into it and you can leave it turned off since it's not going to work. it's all in functions of z->pages_* and those are _global_ for all the zones, and in turn they're absolutely meaningless. the algorithm has nothing in common with lowmem_reverse_ratio, the effect has a tinybit of similarity but the incremntal min thing is so weak and so bad that it will either not help or it'll waste tons of memory. Furthemore you cannot set a sysctl value that works for all machines. The whole thing should be dropped and replaced with the fine production quality lowmem_reserve_ratio in 2.4.26+ (the only broken thing of lowmem_reserve_ratio is that it cannot be tuned, not even at boottime, a recompile is needed, but that's fixable to tune it at boot time, and in theory at runtime too, but the point is that no dyanmic tuning is required with it) Please focus on this code of 2.4: /* * We don't know if the memory that we're going to allocate will * be freeable or/and it will be released eventually, so to * avoid totally wasting several GB of ram we must reserve some * of the lower zone memory (otherwise we risk to run OOM on the * lower zones despite there's tons of freeable ram on the * higher zones). */ zone_watermarks_t watermarks[MAX_NR_ZONES]; typedef struct zone_watermarks_s { unsigned long min, low, high; } zone_watermarks_t; class_idx = zone_idx(classzone); for (;;) { zone_t *z = *(zone++); if (!z) break; if (zone_free_pages(z, order) > z->watermarks[class_idx].low) { page = rmqueue(z, order); if (page) return page; } } zone->watermarks[j].min = mask; zone->watermarks[j].low = mask*2; zone->watermarks[j].high = mask*3; /* now set the watermarks of the lower zones in the "j" * classzone */ for (idx = j-1; idx >= 0; idx--) { zone_t * lower_zone = pgdat->node_zones + idx; unsigned long lower_zone_reserve; if (!lower_zone->size) continue; mask = lower_zone->watermarks[idx].min; lower_zone->watermarks[j].min = mask; lower_zone->watermarks[j].low = mask*2; lower_zone->watermarks[j].high = mask*3; /* now the brainer part */ lower_zone_reserve = realsize / lower_zone_reserve_ratio[idx]; lower_zone->watermarks[j].min += lower_zone_reserve; lower_zone->watermarks[j].low += lower_zone_reserve; lower_zone->watermarks[j].high += lower_zone_reserve; realsize += lower_zone->realsize; } The 2.6 algorithm controlled by the sysctl does nothing similar to the above. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 17:39 ` Andrea Arcangeli @ 2004-06-24 17:53 ` William Lee Irwin III 2004-06-24 18:07 ` Andrea Arcangeli 0 siblings, 1 reply; 70+ messages in thread From: William Lee Irwin III @ 2004-06-24 17:53 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 07:39:27PM +0200, Andrea Arcangeli wrote: > I looked more into it and you can leave it turned off since it's not > going to work. > it's all in functions of z->pages_* and those are _global_ for all the > zones, and in turn they're absolutely meaningless. > the algorithm has nothing in common with lowmem_reverse_ratio, the > effect has a tinybit of similarity but the incremntal min thing is so > weak and so bad that it will either not help or it'll waste tons of > memory. Furthemore you cannot set a sysctl value that works for all > machines. The whole thing should be dropped and replaced with the fine > production quality lowmem_reserve_ratio in 2.4.26+ > (the only broken thing of lowmem_reserve_ratio is that it cannot be > tuned, not even at boottime, a recompile is needed, but that's fixable > to tune it at boot time, and in theory at runtime too, but the point is > that no dyanmic tuning is required with it) > Please focus on this code of 2.4: There is mention of discrimination between pinned and unpinned allocations not being possible; I can arrange this for more comprehensive coverage if desired. Would you like this to be arranged, and if so, how would you like that to interact with the fallback heuristics? -- wli ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 17:53 ` William Lee Irwin III @ 2004-06-24 18:07 ` Andrea Arcangeli 2004-06-24 18:29 ` William Lee Irwin III 0 siblings, 1 reply; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 18:07 UTC (permalink / raw) To: William Lee Irwin III, Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 10:53:31AM -0700, William Lee Irwin III wrote: > On Thu, Jun 24, 2004 at 07:39:27PM +0200, Andrea Arcangeli wrote: > > I looked more into it and you can leave it turned off since it's not > > going to work. > > it's all in functions of z->pages_* and those are _global_ for all the > > zones, and in turn they're absolutely meaningless. > > the algorithm has nothing in common with lowmem_reverse_ratio, the > > effect has a tinybit of similarity but the incremntal min thing is so > > weak and so bad that it will either not help or it'll waste tons of > > memory. Furthemore you cannot set a sysctl value that works for all > > machines. The whole thing should be dropped and replaced with the fine > > production quality lowmem_reserve_ratio in 2.4.26+ > > (the only broken thing of lowmem_reserve_ratio is that it cannot be > > tuned, not even at boottime, a recompile is needed, but that's fixable > > to tune it at boot time, and in theory at runtime too, but the point is > > that no dyanmic tuning is required with it) > > Please focus on this code of 2.4: > > There is mention of discrimination between pinned and unpinned > allocations not being possible; I can arrange this for more > comprehensive coverage if desired. Would you like this to be arranged, > and if so, how would you like that to interact with the fallback > heuristics? how do you handle swapoff and mlock then? anonymous memory is pinned w/o swap. You've relocate the stuff during the mlock or swapoff to obey to the pin limits to make this work right, and it sounds quite complicated and it would hurt mlock performance a lot too (some big app uses mlock to pagein w/o page faults tons of stuff). Note that the "pinned" thing in theory makes *perfect* sense, but it only makes sense on _top_ of lowmem_zone_reserve_ratio, it's not an alternative. When the page is pinned you obey to the "lowmem_zone_reserve_ratio" when it's _not_ pinned then you absolutely ignore the lowmem_zone_reseve_ratio and you go with the watermarks[curr_zone_idx] instead of the class_idx. But in practice I doubt it worth it since I doubt you want to relocate pagecache and anonymous memory during swapoff/mlock. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 18:07 ` Andrea Arcangeli @ 2004-06-24 18:29 ` William Lee Irwin III 0 siblings, 0 replies; 70+ messages in thread From: William Lee Irwin III @ 2004-06-24 18:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 08:07:56PM +0200, Andrea Arcangeli wrote: > how do you handle swapoff and mlock then? anonymous memory is pinned w/o > swap. You've relocate the stuff during the mlock or swapoff to obey to > the pin limits to make this work right, and it sounds quite complicated > and it would hurt mlock performance a lot too (some big app uses mlock > to pagein w/o page faults tons of stuff). I don't have a predetermined answer to this. I can take suggestions (e.g. page migration) for a preferred implementation of how pinned userspace is to be handled, or refrain from discriminating between pinned and unpinned allocations as desired. Another possibility would be ignoring the mlocked status of userspace pages in situations where cross-zone migration would be considered necessary. On Thu, Jun 24, 2004 at 08:07:56PM +0200, Andrea Arcangeli wrote: > Note that the "pinned" thing in theory makes *perfect* sense, but it > only makes sense on _top_ of lowmem_zone_reserve_ratio, it's not an > alternative. > When the page is pinned you obey to the "lowmem_zone_reserve_ratio" when > it's _not_ pinned then you absolutely ignore the > lowmem_zone_reseve_ratio and you go with the watermarks[curr_zone_idx] > instead of the class_idx. > But in practice I doubt it worth it since I doubt you want to relocate > pagecache and anonymous memory during swapoff/mlock. I suspect that if it's worth it to migrate userspace memory between zones, it's only worthwhile to do so during page reclamation. The first idea that occurs to me is checking for how plentiful memory in upper zones is when a pinned userspace page in a lower zone is found on the LRU, and then migrating it as an alternative to outright eviction or ignoring its pinned status. I didn't actually think of it as an alternative, but as just feeding your algorithm the more precise information the comment implied it wanted. I'm basically just looking to get things as solid as possible, so I'm not wedded to a particular solution. If it's too unclear how to handle the situation when pinned allocations can be distinguished, I can just port the 2.4 fallback discouraging algorithm without extensions. -- wli ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 15:29 ` Andrea Arcangeli 2004-06-24 15:48 ` Nick Piggin @ 2004-06-24 16:04 ` Takashi Iwai 2004-06-24 17:16 ` Andrea Arcangeli 1 sibling, 1 reply; 70+ messages in thread From: Takashi Iwai @ 2004-06-24 16:04 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel At Thu, 24 Jun 2004 17:29:46 +0200, Andrea Arcangeli wrote: > > On Thu, Jun 24, 2004 at 04:58:24PM +0200, Takashi Iwai wrote: > > At Thu, 24 Jun 2004 16:42:58 +0200, > > Andi Kleen wrote: > > > > > > On Thu, 24 Jun 2004 16:36:47 +0200 > > > Takashi Iwai <tiwai@suse.de> wrote: > > > > > > > At Thu, 24 Jun 2004 13:29:00 +0200, > > > > Andi Kleen wrote: > > > > > > > > > > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the > > > > > > allocated pages are out of dma mask, just like in pci-gart.c? > > > > > > (with ifdef x86-64) > > > > > > > > > > That won't work reliable enough in extreme cases. > > > > > > > > Well, it's not perfect, but it'd be far better than GFP_DMA only :) > > > > > > The only description for this patch I can think of is "russian roulette" > > > > Even if we have a bigger DMA zone, it's no guarantee that the obtained > > page is precisely in the given mask. We can unlikely define zones > > fine enough for all different 24, 28, 29, 30 and 31bit DMA masks. > > > > > > My patch for i386 works well in most cases, because such a device is > > usually equipped on older machines with less memory than DMA mask. > > > > Without the patch, the allocation is always <16MB, may fail even small > > number of pages. > > why does it fail? note that with the lower_zone_reserve_ratio algorithm I > added to 2.4 all dma zone will be reserved for __GFP_DMA allocations so > you should have troubles only with 2.6, 2.4 should work fine. > So with latest 2.4 it has to fail only if you already allocated 16M with > pci_alloc_consistent which sounds unlikely. If a driver needs large contiguous (e.g. a coule of MB) pages and the memory is fragmented, it may still fail. But it's anyway very rare... However, 16MB isn't enough in some cases indeed. For example, the following devices are often problematic: - SB Live (emu10k1) This needs many single pages for WaveTable synthesis per user's request (up to 128MB). It sets 31bit DMA mask (sigh...) - ES1968 This requires 28bit DMA mask and a single big buffer for all PCM streams. Also there are other devices with <32bit DMA masks, for example, 24bit (als4000, es1938, sonicvibes, azt3328), 28bit (ice1712, maestro3), 30bit (trident), 31bit (ali5451)... Takashi ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 16:04 ` Takashi Iwai @ 2004-06-24 17:16 ` Andrea Arcangeli 2004-06-24 18:33 ` Takashi Iwai 0 siblings, 1 reply; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 17:16 UTC (permalink / raw) To: Takashi Iwai; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 06:04:58PM +0200, Takashi Iwai wrote: > At Thu, 24 Jun 2004 17:29:46 +0200, > Andrea Arcangeli wrote: > > > > On Thu, Jun 24, 2004 at 04:58:24PM +0200, Takashi Iwai wrote: > > > At Thu, 24 Jun 2004 16:42:58 +0200, > > > Andi Kleen wrote: > > > > > > > > On Thu, 24 Jun 2004 16:36:47 +0200 > > > > Takashi Iwai <tiwai@suse.de> wrote: > > > > > > > > > At Thu, 24 Jun 2004 13:29:00 +0200, > > > > > Andi Kleen wrote: > > > > > > > > > > > > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the > > > > > > > allocated pages are out of dma mask, just like in pci-gart.c? > > > > > > > (with ifdef x86-64) > > > > > > > > > > > > That won't work reliable enough in extreme cases. > > > > > > > > > > Well, it's not perfect, but it'd be far better than GFP_DMA only :) > > > > > > > > The only description for this patch I can think of is "russian roulette" > > > > > > Even if we have a bigger DMA zone, it's no guarantee that the obtained > > > page is precisely in the given mask. We can unlikely define zones > > > fine enough for all different 24, 28, 29, 30 and 31bit DMA masks. > > > > > > > > > My patch for i386 works well in most cases, because such a device is > > > usually equipped on older machines with less memory than DMA mask. > > > > > > Without the patch, the allocation is always <16MB, may fail even small > > > number of pages. > > > > why does it fail? note that with the lower_zone_reserve_ratio algorithm I > > added to 2.4 all dma zone will be reserved for __GFP_DMA allocations so > > you should have troubles only with 2.6, 2.4 should work fine. > > So with latest 2.4 it has to fail only if you already allocated 16M with > > pci_alloc_consistent which sounds unlikely. > > If a driver needs large contiguous (e.g. a coule of MB) pages and the > memory is fragmented, it may still fail. But it's anyway very > rare... Yes. This is why I suggested to use GFP_KERNEL _after_ GFP_DMA has failed, not the other way around. As Andi said in big systems you're pretty much guaranteed that GFP_KERNEL will always fail. > However, 16MB isn't enough in some cases indeed. For example, the > following devices are often problematic: > > - SB Live (emu10k1) > This needs many single pages for WaveTable synthesis per user's > request (up to 128MB). It sets 31bit DMA mask (sigh...) then it may never work. If the lowmem below 4G is all allocated in anonymous memory and you've no swap, there's no way, absolutely no way to make the above work. I start to think you should fail insmod if the machine has more than 2^31 bytes of ram being used by the kernel. All we can do is to give it a chance to work, that is to call GFP_KERNEL _after_ GFP_DMA has failed, but again there's no guarantee that it will work, even if you've only a few gigs of ram. > - ES1968 > This requires 28bit DMA mask and a single big buffer for all PCM > streams. this is just the order > 0 issue. Note that 2.6 limits the defragmentation to order == 3, order 4 and higher are ""guaranteed"" to always fail, this wasn't the case in 2.4. 2.6 adds a few terrible hacks called __GFP_REPEAT and __GFP_NOFAIL, those are all deadlock prone as much as order < 4 allocations. The basic deadlocks in 2.6 are due the lack of return value from try_to_free_pages, 2.6 has no clue when it made progress or not, it can only try to kill tasks when the highmem and swap are exausted, but there are tons of other conditions where it can deadlock including while confusing the oom killer with apps using mlock. > Also there are other devices with <32bit DMA masks, for example, 24bit > (als4000, es1938, sonicvibes, azt3328), 28bit (ice1712, maestro3), > 30bit (trident), 31bit (ali5451)... creating a GFP_PCI28 zone at _runtime_ only for the intel implementations that unfortunately lacks an iommu might not be too bad. Note that one other relevant thing we can add (with O(N) complexity) is an alloc_pages_range() that walks the whole freelist by hand searching for anything in the physuical range passed as parameter. But it would need to be used with care since it'd loop in kernel space for a loong time. irq disabling timeouts may also trigger, so implementing it safe won't be trivial. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 17:16 ` Andrea Arcangeli @ 2004-06-24 18:33 ` Takashi Iwai 2004-06-24 18:44 ` Andrea Arcangeli 0 siblings, 1 reply; 70+ messages in thread From: Takashi Iwai @ 2004-06-24 18:33 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel At Thu, 24 Jun 2004 19:16:20 +0200, Andrea Arcangeli wrote: > > On Thu, Jun 24, 2004 at 06:04:58PM +0200, Takashi Iwai wrote: > > At Thu, 24 Jun 2004 17:29:46 +0200, > > Andrea Arcangeli wrote: > > > > > > On Thu, Jun 24, 2004 at 04:58:24PM +0200, Takashi Iwai wrote: > > > > At Thu, 24 Jun 2004 16:42:58 +0200, > > > > Andi Kleen wrote: > > > > > > > > > > On Thu, 24 Jun 2004 16:36:47 +0200 > > > > > Takashi Iwai <tiwai@suse.de> wrote: > > > > > > > > > > > At Thu, 24 Jun 2004 13:29:00 +0200, > > > > > > Andi Kleen wrote: > > > > > > > > > > > > > > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the > > > > > > > > allocated pages are out of dma mask, just like in pci-gart.c? > > > > > > > > (with ifdef x86-64) > > > > > > > > > > > > > > That won't work reliable enough in extreme cases. > > > > > > > > > > > > Well, it's not perfect, but it'd be far better than GFP_DMA only :) > > > > > > > > > > The only description for this patch I can think of is "russian roulette" > > > > > > > > Even if we have a bigger DMA zone, it's no guarantee that the obtained > > > > page is precisely in the given mask. We can unlikely define zones > > > > fine enough for all different 24, 28, 29, 30 and 31bit DMA masks. > > > > > > > > > > > > My patch for i386 works well in most cases, because such a device is > > > > usually equipped on older machines with less memory than DMA mask. > > > > > > > > Without the patch, the allocation is always <16MB, may fail even small > > > > number of pages. > > > > > > why does it fail? note that with the lower_zone_reserve_ratio algorithm I > > > added to 2.4 all dma zone will be reserved for __GFP_DMA allocations so > > > you should have troubles only with 2.6, 2.4 should work fine. > > > So with latest 2.4 it has to fail only if you already allocated 16M with > > > pci_alloc_consistent which sounds unlikely. > > > > If a driver needs large contiguous (e.g. a coule of MB) pages and the > > memory is fragmented, it may still fail. But it's anyway very > > rare... > > Yes. This is why I suggested to use GFP_KERNEL _after_ GFP_DMA has > failed, not the other way around. As Andi said in big systems you're > pretty much guaranteed that GFP_KERNEL will always fail. Ok. > > However, 16MB isn't enough in some cases indeed. For example, the > > following devices are often problematic: > > > > - SB Live (emu10k1) > > This needs many single pages for WaveTable synthesis per user's > > request (up to 128MB). It sets 31bit DMA mask (sigh...) > > then it may never work. If the lowmem below 4G is all allocated in > anonymous memory and you've no swap, there's no way, absolutely no way > to make the above work. I start to think you should fail insmod if the > machine has more than 2^31 bytes of ram being used by the kernel. > > All we can do is to give it a chance to work, that is to call GFP_KERNEL > _after_ GFP_DMA has failed, but again there's no guarantee that it will > work, even if you've only a few gigs of ram. Sure, in extreme cases, it can't work. But at least, it _may_ work better than using only GFP_DMA. And indeed it should (still) work on most of consumer PC boxes. The addition of another zone would help much better, though. Takashi ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 18:33 ` Takashi Iwai @ 2004-06-24 18:44 ` Andrea Arcangeli 2004-06-25 15:50 ` Takashi Iwai 0 siblings, 1 reply; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 18:44 UTC (permalink / raw) To: Takashi Iwai; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel On Thu, Jun 24, 2004 at 08:33:02PM +0200, Takashi Iwai wrote: > Sure, in extreme cases, it can't work. But at least, it _may_ work > better than using only GFP_DMA. And indeed it should (still) work > on most of consumer PC boxes. The addition of another zone would help > much better, though. of course agreed. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 18:44 ` Andrea Arcangeli @ 2004-06-25 15:50 ` Takashi Iwai 2004-06-25 17:30 ` Andrea Arcangeli 0 siblings, 1 reply; 70+ messages in thread From: Takashi Iwai @ 2004-06-25 15:50 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel At Thu, 24 Jun 2004 20:44:47 +0200, Andrea Arcangeli wrote: > > On Thu, Jun 24, 2004 at 08:33:02PM +0200, Takashi Iwai wrote: > > Sure, in extreme cases, it can't work. But at least, it _may_ work > > better than using only GFP_DMA. And indeed it should (still) work > > on most of consumer PC boxes. The addition of another zone would help > > much better, though. > > of course agreed. The below is the new patch to follow your advice. thanks, Takashi --- linux-2.6.7/arch/i386/kernel/pci-dma.c-dist 2004-06-24 15:56:46.017473544 +0200 +++ linux-2.6.7/arch/i386/kernel/pci-dma.c 2004-06-25 17:43:42.509366917 +0200 @@ -23,11 +23,22 @@ void *dma_alloc_coherent(struct device * if (dev == NULL || (dev->coherent_dma_mask < 0xffffffff)) gfp |= GFP_DMA; + again: ret = (void *)__get_free_pages(gfp, get_order(size)); - if (ret != NULL) { + if (ret == NULL) { + if (dev && (gfp & GFP_DMA)) { + gfp &= ~GFP_DMA; + goto again; + } + } else { memset(ret, 0, size); *dma_handle = virt_to_phys(ret); + if (!(gfp & GFP_DMA) && + (((unsigned long)*dma_handle + size - 1) & ~(unsigned long)dev->coherent_dma_mask)) { + free_pages((unsigned long)ret, get_order(size)); + return NULL; + } } return ret; } ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-25 15:50 ` Takashi Iwai @ 2004-06-25 17:30 ` Andrea Arcangeli 2004-06-25 17:39 ` Takashi Iwai 0 siblings, 1 reply; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-25 17:30 UTC (permalink / raw) To: Takashi Iwai; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel On Fri, Jun 25, 2004 at 05:50:04PM +0200, Takashi Iwai wrote: > --- linux-2.6.7/arch/i386/kernel/pci-dma.c-dist 2004-06-24 15:56:46.017473544 +0200 > +++ linux-2.6.7/arch/i386/kernel/pci-dma.c 2004-06-25 17:43:42.509366917 +0200 > @@ -23,11 +23,22 @@ void *dma_alloc_coherent(struct device * > if (dev == NULL || (dev->coherent_dma_mask < 0xffffffff)) > gfp |= GFP_DMA; > > + again: > ret = (void *)__get_free_pages(gfp, get_order(size)); > > - if (ret != NULL) { > + if (ret == NULL) { > + if (dev && (gfp & GFP_DMA)) { > + gfp &= ~GFP_DMA; I would find cleaner to use __GFP_DMA in the whole file, this is not about your changes, previous code used GFP_DMA too. The issue is that if we change GFP_DMA to add a __GFP_HIGH or similar, the above will clear the other bitflags too. > + (((unsigned long)*dma_handle + size - 1) & ~(unsigned long)dev->coherent_dma_mask)) { > + free_pages((unsigned long)ret, get_order(size)); > + return NULL; > + } I would do the memset and setting of dma_handle after the above check. this approch looks fine, thanks. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-25 17:30 ` Andrea Arcangeli @ 2004-06-25 17:39 ` Takashi Iwai 2004-06-25 17:45 ` Andrea Arcangeli 0 siblings, 1 reply; 70+ messages in thread From: Takashi Iwai @ 2004-06-25 17:39 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel At Fri, 25 Jun 2004 19:30:46 +0200, Andrea Arcangeli wrote: > > On Fri, Jun 25, 2004 at 05:50:04PM +0200, Takashi Iwai wrote: > > --- linux-2.6.7/arch/i386/kernel/pci-dma.c-dist 2004-06-24 15:56:46.017473544 +0200 > > +++ linux-2.6.7/arch/i386/kernel/pci-dma.c 2004-06-25 17:43:42.509366917 +0200 > > @@ -23,11 +23,22 @@ void *dma_alloc_coherent(struct device * > > if (dev == NULL || (dev->coherent_dma_mask < 0xffffffff)) > > gfp |= GFP_DMA; > > > > + again: > > ret = (void *)__get_free_pages(gfp, get_order(size)); > > > > - if (ret != NULL) { > > + if (ret == NULL) { > > + if (dev && (gfp & GFP_DMA)) { > > + gfp &= ~GFP_DMA; > > I would find cleaner to use __GFP_DMA in the whole file, this is not > about your changes, previous code used GFP_DMA too. The issue is that if > we change GFP_DMA to add a __GFP_HIGH or similar, the above will clear > the other bitflags too. Indeed. > > > + (((unsigned long)*dma_handle + size - 1) & ~(unsigned long)dev->coherent_dma_mask)) { > > + free_pages((unsigned long)ret, get_order(size)); > > + return NULL; > > + } > > I would do the memset and setting of dma_handle after the above check. Yep. The below is the corrected version. Thanks! Takashi --- linux-2.6.7/arch/i386/kernel/pci-dma.c-dist 2004-06-24 15:56:46.017473544 +0200 +++ linux-2.6.7/arch/i386/kernel/pci-dma.c 2004-06-25 19:38:26.334210809 +0200 @@ -21,13 +21,24 @@ void *dma_alloc_coherent(struct device * gfp &= ~(__GFP_DMA | __GFP_HIGHMEM); if (dev == NULL || (dev->coherent_dma_mask < 0xffffffff)) - gfp |= GFP_DMA; + gfp |= __GFP_DMA; + again: ret = (void *)__get_free_pages(gfp, get_order(size)); - if (ret != NULL) { - memset(ret, 0, size); + if (ret == NULL) { + if (dev && (gfp & __GFP_DMA)) { + gfp &= ~__GFP_DMA; + goto again; + } + } else { *dma_handle = virt_to_phys(ret); + if (!(gfp & __GFP_DMA) && + (((unsigned long)*dma_handle + size - 1) & ~(unsigned long)dev->coherent_dma_mask)) { + free_pages((unsigned long)ret, get_order(size)); + return NULL; + } + memset(ret, 0, size); } return ret; } ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-25 17:39 ` Takashi Iwai @ 2004-06-25 17:45 ` Andrea Arcangeli 0 siblings, 0 replies; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-25 17:45 UTC (permalink / raw) To: Takashi Iwai; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel On Fri, Jun 25, 2004 at 07:39:19PM +0200, Takashi Iwai wrote: > Yep. The below is the corrected version. looks perfect thanks ;). ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 11:13 ` Takashi Iwai 2004-06-24 11:29 ` [discuss] " Andi Kleen @ 2004-06-24 14:45 ` Terence Ripperda 2004-06-24 15:41 ` Andrea Arcangeli 1 sibling, 1 reply; 70+ messages in thread From: Terence Ripperda @ 2004-06-24 14:45 UTC (permalink / raw) To: Takashi Iwai; +Cc: Andi Kleen, Terence Ripperda, discuss, linux-kernel, andrea On Thu, Jun 24, 2004 at 04:13:47AM -0700, tiwai@suse.de wrote: > > pci_alloc_consistent is limited to 16MB, but so far nobody has really > > complained about that. If that should be a real issue we can make > > it allocate from the swiotlb pool, which is usually 64MB (and can > > be made bigger at boot time) > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the > allocated pages are out of dma mask, just like in pci-gart.c? > (with ifdef x86-64) pci_alloc_consistent (at least on x86-64) does do this. one of the problems I've seen in experimentation is that GFP_KERNEL tends to allocate from the top of memory down. this means that all of the GFP_KERNEL allocations are > 32-bits, which forces us down to GFP_DMA and the < 16M allocations. I've mainly tested this after a cold boot, so after running for a while, GFP_KERNEL may hit good allocations a lot more. Thanks, Terence ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 14:45 ` Terence Ripperda @ 2004-06-24 15:41 ` Andrea Arcangeli 0 siblings, 0 replies; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 15:41 UTC (permalink / raw) To: Terence Ripperda; +Cc: Takashi Iwai, Andi Kleen, discuss, linux-kernel On Thu, Jun 24, 2004 at 09:45:51AM -0500, Terence Ripperda wrote: > On Thu, Jun 24, 2004 at 04:13:47AM -0700, tiwai@suse.de wrote: > > > pci_alloc_consistent is limited to 16MB, but so far nobody has really > > > complained about that. If that should be a real issue we can make > > > it allocate from the swiotlb pool, which is usually 64MB (and can > > > be made bigger at boot time) > > > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the > > allocated pages are out of dma mask, just like in pci-gart.c? > > (with ifdef x86-64) > > pci_alloc_consistent (at least on x86-64) does do this. one of the problems > I've seen in experimentation is that GFP_KERNEL tends to allocate from the > top of memory down. this means that all of the GFP_KERNEL allocations are > > 32-bits, which forces us down to GFP_DMA and the < 16M allocations. > > I've mainly tested this after a cold boot, so after running for a while, > GFP_KERNEL may hit good allocations a lot more. it's trivial to change the order in the freelist to allocate from lower address first, but the point is still that over time that will be random. the 16M must be reserved enterely to the __GFP_DMA on any machine with >=1G of ram, and the lowmem_reserve_ratio algorithm accomplish this and it scales down the reserve ratio according to the balance between lowmem and dma zone. I believe if something you should try with GFP_KERNEL if GFP_DMA fails, not the other way around. btw, 2.6 is even more efficient in shrinking and paging out the dma zone than it could be in 2.4. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-23 23:46 ` 32-bit dma allocations on 64-bit platforms Andi Kleen 2004-06-24 11:13 ` Takashi Iwai @ 2004-06-24 15:44 ` Terence Ripperda 2004-06-24 16:15 ` [discuss] " Andi Kleen 2004-06-24 18:51 ` Andi Kleen 1 sibling, 2 replies; 70+ messages in thread From: Terence Ripperda @ 2004-06-24 15:44 UTC (permalink / raw) To: Andi Kleen; +Cc: Terence Ripperda, discuss, tiwai, linux-kernel, andrea On Wed, Jun 23, 2004 at 04:46:44PM -0700, ak@muc.de wrote: > pci_alloc_consistent is limited to 16MB, but so far nobody has really > complained about that. If that should be a real issue we can make > it allocate from the swiotlb pool, which is usually 64MB (and can > be made bigger at boot time) In all of the cases I've seen, it defaults to 4M. in swiotlb.c, io_tlb_nslabs defaults to 1024, * PAGE_SIZE == 4194304. > Would that work for you too BTW ? How much memory do you expect > to need? potentially. our currently pending release uses pci_map_sg, which relies on swiotlb for em64t systems. it "works", but we have some ugly hacks and were hoping to get away from using it (at least in it's current form). here's some of the problems we encountered: probably the biggest problem is that the size is way too small for our needs (more on our memory usage shortly). this is compounded by the the swiotlb code throwing a kernel panic when it can't allocate memory. and if the panic doesn't halt the machine, the routine returns a random value off the stack as the dma_addr_t. for this reason, we have an ugly hack that notices that swiotlb is enabled (just checks if swiotlb is set) and prints a warning to the user to bump up the size of the swiotlb to 16384, or 64M. also, the proper usage of using the bounce buffers and calling pci_dma_sync_* would be a performance killer for us. we stream a considerable amount of data to the gpu per second (on the order of 100s of Megs a second), so having to do an additional memcpy would reduce performance considerably, in some cases between 30-50%. for this reason, we detect when the dma_addr != phys_addr, and map the dma_addr directly to opengl to avoid the copy. I know this is ugly, and that's one of the things I'd really like to get away from. finally, our driver already uses a considerable amount of memory. by definition, the swiotlb interface doubles that memory usage. if our driver used swiotlb correctly (as in didn't know about swiotlb and always called pci_dma_sync_*), we'd lock down the physical addresses opengl writes to, since they're normally used directly for dma, plus the pages allocated from the swiotlb would be locked down (currently we manually do this, but if swiotlb is supposed to be transparent to the driver and used for dma, I assume it must already be locked down, perhaps by definition of being bootmem?). this means not only is the memory usage double, but it's all locked down and unpageable. in this case, it almost would make more sense to treat the bootmem allocated for swiotlb as a pool of 32-bit memory that can be directly allocated from, rather than as bounce buffers. I don't know that this would be an acceptable interface though. but if we could come up with reasonable solutions to these problems, this may work. > drawback is that the swiotlb pool is not unified with the rest of the > VM, so tying up too much memory there is quite unfriendly. > e.g. if you you can use up 1GB then i wouldn't consider this suitable, > for 128MB max it may be possible. I checked with our opengl developers on this. by default, we allocate ~64k for X's push buffer and ~1M per opengl client for their push buffer. on quadro/workstation parts, we allocate 20M for the first opengl client, then ~1M per client after that. in addition to the push buffer, there is a lot of data that apps dump to the push buffer. this includes textures, vertex buffers, display lists, etc. the amount of memory used for this varies greatly from app to app. the 20M listed above includes the push buffer and memory for these buffers (I think workstation apps tend to push a lot more pre-processed vertex data to the gpu). note that most agp apertures these days are in the 128M - 1024M range, and there are times that we exhaust that memory on the low end. I think our driver is greedy when trying to allocate memory for performance reasons, but has good fallback cases. so being somewhat limited on resources isn't too bad, just so long as the kernel doesn't panic instead of falling the memory allocation. I would think that 64M or 128M would be good. a nice feature of swiotlb is the ability to tune it at boot. so if a workstation user found they really did need more memory for performance, they could tweak that value up for themselves. also remember future growth. PCI-E has something like 20/24 lanes that can be split among multiple PCI-E slots. Alienware has already announced multi-card products, and it's likely multi-card products will be more readily available on PCI-E, since the slots should have equivalent bandwidth (unlike AGP+PCI). nvidia has also had workstation parts in the past with 2 gpus and a bridge chip. each of these gpus ran twinview, so each card drove 4 monitors. these were pci cards, and some crazy vendors had 4+ of these cards in a machine driving many monitors. this just pushes the memory requirements up in special circumstances. Thanks, Terence ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 15:44 ` Terence Ripperda @ 2004-06-24 16:15 ` Andi Kleen 2004-06-24 17:22 ` Andrea Arcangeli 2004-06-24 22:28 ` Terence Ripperda 2004-06-24 18:51 ` Andi Kleen 1 sibling, 2 replies; 70+ messages in thread From: Andi Kleen @ 2004-06-24 16:15 UTC (permalink / raw) To: Terence Ripperda; +Cc: Andi Kleen, discuss, tiwai, linux-kernel, andrea On Thu, Jun 24, 2004 at 10:44:29AM -0500, Terence Ripperda wrote: > On Wed, Jun 23, 2004 at 04:46:44PM -0700, ak@muc.de wrote: > > pci_alloc_consistent is limited to 16MB, but so far nobody has really > > complained about that. If that should be a real issue we can make > > it allocate from the swiotlb pool, which is usually 64MB (and can > > be made bigger at boot time) > > In all of the cases I've seen, it defaults to 4M. in swiotlb.c, > io_tlb_nslabs defaults to 1024, * PAGE_SIZE == 4194304. Oops, that should be probably fixed. I think it was 64MB at some point ... 4MB is definitely far too small. > probably the biggest problem is that the size is way too small for our > needs (more on our memory usage shortly). this is compounded by the > the swiotlb code throwing a kernel panic when it can't allocate > memory. and if the panic doesn't halt the machine, the routine returns > a random value off the stack as the dma_addr_t. That sounds like a bug too. pci_map_sg should return 0 when it overflows. The gart iommu code will do that. I'll take a look, need to convince the IA64 people of any changes though (I just reused their code) Newer pci_map_single also got a "bad_dma_adress" magic return value to check for this, but some also just panic. > also, the proper usage of using the bounce buffers and calling > pci_dma_sync_* would be a performance killer for us. we stream a > considerable amount of data to the gpu per second (on the order of > 100s of Megs a second), so having to do an additional memcpy would > reduce performance considerably, in some cases between 30-50%. Understood. > finally, our driver already uses a considerable amount of memory. by > definition, the swiotlb interface doubles that memory usage. if our > driver used swiotlb correctly (as in didn't know about swiotlb and > always called pci_dma_sync_*), we'd lock down the physical addresses > opengl writes to, since they're normally used directly for dma, plus > the pages allocated from the swiotlb would be locked down (currently > we manually do this, but if swiotlb is supposed to be transparent to > the driver and used for dma, I assume it must already be locked down, > perhaps by definition of being bootmem?). this means not only is the It's allocated once at boot and never freed or increased. (the reason is that these functions must all work inside spinlocks and cannot sleep, and you cannot do anything serious to the VM with that constraint) - arguably it would have been much nicer to pass them a GFP flag and do sleeping for bounce memory and GFP_KERNEL allocations etc.instead of the dumb panics on overflow. Maybe something for 2.7. > in this case, it almost would make more sense to treat the bootmem > allocated for swiotlb as a pool of 32-bit memory that can be directly > allocated from, rather than as bounce buffers. I don't know that this > would be an acceptable interface though. Ok, that was one of my proposals too (using it for pci_alloc_consistent). But again it would only help if the memory requirements are relatively moderate. > but if we could come up with reasonable solutions to these problems, > this may work. > > > drawback is that the swiotlb pool is not unified with the rest of the > > VM, so tying up too much memory there is quite unfriendly. > > e.g. if you you can use up 1GB then i wouldn't consider this suitable, > > for 128MB max it may be possible. > > I checked with our opengl developers on this. by default, we allocate > ~64k for X's push buffer and ~1M per opengl client for their push > buffer. on quadro/workstation parts, we allocate 20M for the first > opengl client, then ~1M per client after that. Oh, that sounds quite moderate. Ok, then we probably don't need the GFP_BIGDMA zone just for you. Great. > > in addition to the push buffer, there is a lot of data that apps dump > to the push buffer. this includes textures, vertex buffers, display > lists, etc. the amount of memory used for this varies greatly from app > to app. the 20M listed above includes the push buffer and memory for > these buffers (I think workstation apps tend to push a lot more > pre-processed vertex data to the gpu). Overall it sounds more like you need 128MB though - especially since we cannot give everything to you, but also still need some memory for SATA and other devices with limited addressing capability (fortunately they slowly get fixed now) I would prefer if the default value would work for most users because any special options are a very high support load. Do you think 64MB (minus other users so maybe 30-40MB in practice) would be still sufficient to give reasonable performance without hickups? > > note that most agp apertures these days are in the 128M - 1024M range, > and there are times that we exhaust that memory on the low end. I Yes, I have the same problem with the IOMMU. The IOMMU makes it actually worse, because it reserves half of the aperture (so you may get only 64MB IOMMU/AGP aperture in the worst case) But it can be increased in the BIOS and the kernel has code to get a larger aperture too) > think our driver is greedy when trying to allocate memory for > performance reasons, but has good fallback cases. so being somewhat > limited on resources isn't too bad, just so long as the kernel doesn't > panic instead of falling the memory allocation. Agreed, the panics should be made optional at least. I will take a look at doing this for swiotlb too. I like them as options though because for debugging it's better to get a clear panic than a weird malfunction. > also remember future growth. PCI-E has something like 20/24 lanes that > can be split among multiple PCI-E slots. Alienware has already > announced multi-card products, and it's likely multi-card products > will be more readily available on PCI-E, since the slots should have > equivalent bandwidth (unlike AGP+PCI). > > nvidia has also had workstation parts in the past with 2 gpus and a > bridge chip. each of these gpus ran twinview, so each card drove 4 > monitors. these were pci cards, and some crazy vendors had 4+ of these > cards in a machine driving many monitors. this just pushes the memory > requirements up in special circumstances. But why didn't you implement addressing capability for >32bit in your hardware then? I imagine the memory requirements won't stop at 4GB (or rather 2-3GB because not all phys mapping space below 4GB can be dedicated to graphics) It sounds a bit weird to have such extreme requirements and then cripple the hardware like this. Anyways - for such extreme applications i think it's perfectly reasonable to require the user to pass special boot options and tie up much memory. -Andi ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 16:15 ` [discuss] " Andi Kleen @ 2004-06-24 17:22 ` Andrea Arcangeli 2004-06-24 22:28 ` Terence Ripperda 1 sibling, 0 replies; 70+ messages in thread From: Andrea Arcangeli @ 2004-06-24 17:22 UTC (permalink / raw) To: Andi Kleen; +Cc: Terence Ripperda, Andi Kleen, discuss, tiwai, linux-kernel On Thu, Jun 24, 2004 at 06:15:40PM +0200, Andi Kleen wrote: > reasonable to require the user to pass special boot options and > tie up much memory. the boot parameter will always work and it avoids a new zone. btw, if we would link the driver into the kernel no boot parameter would be necessary, if the hardware would be discovered it could allocated its tons of memory with bootmem. But it sounds like there are too many drivers in troubles so I believe we can't link them all. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 16:15 ` [discuss] " Andi Kleen 2004-06-24 17:22 ` Andrea Arcangeli @ 2004-06-24 22:28 ` Terence Ripperda 1 sibling, 0 replies; 70+ messages in thread From: Terence Ripperda @ 2004-06-24 22:28 UTC (permalink / raw) To: Andi Kleen Cc: Terence Ripperda, Andi Kleen, discuss, tiwai, linux-kernel, andrea On Thu, Jun 24, 2004 at 09:15:40AM -0700, ak@suse.de wrote: > I would prefer if the default value would work for most users > because any special options are a very high support load. > Do you think 64MB (minus other users so maybe 30-40MB in practice) > would be still sufficient to give reasonable performance without > hickups? that's what we're currently asking users to do for our current swiotlb code. we are seeing some hickups in ut2004, but I haven't investigated if this is related to limited memory resources (actually, it shouldn't be, as we'd have paniced instead of failing to allocate memory). I think I would push for 128M by default, just to make sure there's plenty. I don't think this should be too bad, since this would only kick in if the user has 4+ Gigs of memory, in which 128M is a small portion of the total. > Agreed, the panics should be made optional at least. I will > take a look at doing this for swiotlb too. I like > them as options though because for debugging it's better to get > a clear panic than a weird malfunction. it makes perfect sense to have a debugging option for that, it'd just be nice to have that not be the default. > But why didn't you implement addressing capability for >32bit > in your hardware then? I imagine the memory requirements won't > stop at 4GB (or rather 2-3GB because not all phys mapping > space below 4GB can be dedicated to graphics) I suspect the addressing capability is due to cost/die size tradeoffs. and I didn't mean to imply that these setups would be common, or really use that much additional memory. just pointing out that it's not uncommon to have some odd frankenstein setups that would use a little more memory than normal. you are correct that in these cases, a little more end user tweaking is acceptable. after talking to some of the other developers here, we wanted to re-inquiry about the extra dma zone approach, and how feasible/acceptable that might be. one of the thoughts is that the swiotlb approach would probably be the easiest to get in place quickly, but that the dma zone approach would be more robust. we wouldn't need to set aside an allocation pool, there wouldn't need to be end user tweaking for the corner cases, etc.. Thanks, Terence ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 15:44 ` Terence Ripperda 2004-06-24 16:15 ` [discuss] " Andi Kleen @ 2004-06-24 18:51 ` Andi Kleen 2004-06-26 4:58 ` David Mosberger 1 sibling, 1 reply; 70+ messages in thread From: Andi Kleen @ 2004-06-24 18:51 UTC (permalink / raw) To: Terence Ripperda; +Cc: Andi Kleen, discuss, tiwai, linux-kernel, andrea On Thu, Jun 24, 2004 at 10:44:29AM -0500, Terence Ripperda wrote: > On Wed, Jun 23, 2004 at 04:46:44PM -0700, ak@muc.de wrote: > > pci_alloc_consistent is limited to 16MB, but so far nobody has really > > complained about that. If that should be a real issue we can make > > it allocate from the swiotlb pool, which is usually 64MB (and can > > be made bigger at boot time) > > In all of the cases I've seen, it defaults to 4M. in swiotlb.c, > io_tlb_nslabs defaults to 1024, * PAGE_SIZE == 4194304. I checked this now. It's #define IO_TLB_SHIFT 11 static unsigned long io_tlb_nslabs = 1024; and the allocation does io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs * (1 << IO_TLB_SHIFT)); which contrary to its name does not allocate in pages (otherwise you would get 8GB of memory on x86-64 and even more on IA64) That's definitely far too small. A better IO_TLB_SHIFT would be 16 or 17. -Andi ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 18:51 ` Andi Kleen @ 2004-06-26 4:58 ` David Mosberger 0 siblings, 0 replies; 70+ messages in thread From: David Mosberger @ 2004-06-26 4:58 UTC (permalink / raw) To: Andi Kleen; +Cc: Terence Ripperda, discuss, tiwai, linux-kernel, andrea >>>>> On Thu, 24 Jun 2004 20:51:56 +0200, Andi Kleen <ak@muc.de> said: Andi> A better IO_TLB_SHIFT would be 16 or 17. Careful. I see code like this: stride = (1 << (PAGE_SHIFT - IO_TLB_SHIFT)); You probably don't want IO_TLB_SHIFT > PAGE_SHIFT... Increasing io_tlb_nslabs should be no problem though (subject to memory availability). It can already by set via the "swiotlb" option. I doubt swiotlb is the right thing here, though, given the bw-demands of graphics. Too bad Nvidia cards don't support > 32 bit addressability and Intel chipsets don't support I/O MMUs... --david ^ permalink raw reply [flat|nested] 70+ messages in thread
[parent not found: <2akPm-16l-65@gated-at.bofh.it>]
* Re: 32-bit dma allocations on 64-bit platforms [not found] <2akPm-16l-65@gated-at.bofh.it> @ 2004-06-23 21:46 ` Andi Kleen 2004-06-24 6:18 ` Arjan van de Ven 0 siblings, 1 reply; 70+ messages in thread From: Andi Kleen @ 2004-06-23 21:46 UTC (permalink / raw) To: Terence Ripperda; +Cc: discuss, tiwai, linux-kernel Terence Ripperda <tripperda@nvidia.com> writes: [sending again with linux-kernel in cc] > I'm working on cleaning up some of our dma allocation code to properly allocate 32-bit physical pages for dma on 64-bit platforms. I think our first pass at supporting em64t is sub-par. I'd like to fix that by using the correct kernel interfaces. I get from this that your hardware cannot DMA to >32bit. > > the physical addressing of memory allocations seems to boil down to the behavior of GFP_DMA and GFP_NORMAL. but there seems to be some disconnect between what these mean for each architecture and what various drivers expect them to mean. > > based on each architecture's paging_init routines, the zones look like this: > > x86: ia64: x86_64: > ZONE_DMA: < 16M < ~4G < 16M > ZONE_NORMAL: 16M - ~1G > ~4G > 16M > ZONE_HIMEM: 1G+ > > an example of this disconnect is vmalloc_32. this function is obviously intended to allocate 32-bit addresses (this was specifically mentioned in a comment in 2.4.x header files). but vmalloc_32 is an inline routine that calls __vmalloc(GFP_KERNEL). based on the above zone descriptions, this will do the correct thing for x86, but not for ia64 or x86_64. on ia64, a driver could just use GFP_DMA for the desired behavior, but this doesn't work well for x86_64. > > AMD's x86_64 provides remapping > 32-bit pages through the iommu, but obviously Intel's em64t provides no such ability. based on the above zonings, these leaves us with the options of either relying on the swiotlb interfaces for dma, or relying on the isa memory for dma. > > the last day or two, I've been experimenting with using the pci_alloc_consistent interface, which uses the later (note attached patch to fix an apparent memory leak in the x86_64 pci_alloc_consistent). unfortunately, this provides very little dma-able memory. In theory, up to 16 Megs, but in practice I'm only getting about 5 1/2 Megs. > > I was rather surprised by these limitations on allocating 32-bit addresses. I checked through the dri and bttv drivers, to see if they had dealt with these issues, and they did not appear to have done so. has anyone tested these drivers on ia64/x86_64/em64t platforms w/ 4+ Gigs of memory? > > are these limitations on allocating 32-bit addresses intentional and known? is there anything we can do to help improve this situation? help work on development? First vmalloc_32 is a rather broken interface and should imho just be removed. The function name just gives promises that cannot be kept. It was always quite bogus. Please don't use it. The situation on EM64T is very unfortunate I agree. There was a reason we asked AMD to add an IOMMU and it's quite bad that the Intel chipset people ignored that wisdom and put us into this compatibility mess. Failing that it would be best if the other PCI DMA hardware could just address enough memory, but that's less realistic than just fixing the chipset. The x86-64 port had decided early to keep the 16MB GFP_DMA zone to get maximum driver compatibility and because the AMD IOMMU gave us an nice alternative over bounce buffering. In theory I'm not totally against enlarging GFP_DMA a bit on x86-64. It would just be difficult to find a good value. The problem is that that there may be existing drivers that rely on the 16MB limit, and it would not be very nice to break them. We got rid of a lot of them by disallowing CONFIG_ISA, but there may be some left. So before doing this it would need a full driver tree audit to check any device The most prominent example used to be the floppy driver, but the current floppy driver seems to use some other way to get around this. There seem to be quite some sound chipsets with DMA limits < 32bit; e.g. 29 bits seems to be quite common, but I see several 24bit PCI ones too. I must say I'm somewhat reluctant to break an working in tree driver. Especially for the sake of an out of tree binary driver. Arguably the problem is probably not limited to you, but it's quite possible that even the in tree DRI drivers have it, so it would be still worth to fix it. I see two somewhat realistic ways to handle this: - We enlarge GFP_DMA and find some way to do double buffering for these sound drivers (it would need an PCI-DMA API extension that always calls swiotlb for this) For sound that's not too bad, because they are relatively slow. It would require to reserve bootmem memory early for the bounces, but I guess requiring the user to pass a special boot time parameter for these devices would be reasonable. If yes someone would need to do this work. Also the question would be how large to make GFP_DMA Ideally it should not be too big, so that e.g. 29bit devices don't require the bounce buffering. - We introduce multiple GFP_DMA zones and keep 16MB GFP_DMA and GFP_BIGDMA or somesuch for larger DMA. The VM should be able to handle this, but it may still require some tuning. It would need some generic changes, but not too bad. Still would need a decision on how big GFP_BIGDMA should be. I suspect 4GB would be too big again. Comments? -Andi ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-23 21:46 ` Andi Kleen @ 2004-06-24 6:18 ` Arjan van de Ven 2004-06-24 10:33 ` Andi Kleen 2004-06-24 13:48 ` Jesse Barnes 0 siblings, 2 replies; 70+ messages in thread From: Arjan van de Ven @ 2004-06-24 6:18 UTC (permalink / raw) To: Andi Kleen; +Cc: Terence Ripperda, discuss, tiwai, linux-kernel [-- Attachment #1: Type: text/plain, Size: 636 bytes --] On Wed, 2004-06-23 at 23:46, Andi Kleen wrote: > The VM should be able to handle this, but it may still require > some tuning. It would need some generic changes, but not too bad. > Still would need a decision on how big GFP_BIGDMA should be. > I suspect 4GB would be too big again. What is the problem again, can't the driver us the dynamic pci mapping API which does allow more memory to be mapped even on crippled machines without iommu ? And isn't this a problem that will vanish since PCI Express and PCI X both *require* support for 64 bit addressing, so all higher speed cards are going to be ok in principle ? [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 6:18 ` Arjan van de Ven @ 2004-06-24 10:33 ` Andi Kleen 2004-06-24 13:48 ` Jesse Barnes 1 sibling, 0 replies; 70+ messages in thread From: Andi Kleen @ 2004-06-24 10:33 UTC (permalink / raw) To: Arjan van de Ven Cc: Andi Kleen, Terence Ripperda, discuss, tiwai, linux-kernel On Thu, Jun 24, 2004 at 08:18:06AM +0200, Arjan van de Ven wrote: > On Wed, 2004-06-23 at 23:46, Andi Kleen wrote: > > > The VM should be able to handle this, but it may still require > > some tuning. It would need some generic changes, but not too bad. > > Still would need a decision on how big GFP_BIGDMA should be. > > I suspect 4GB would be too big again. > > What is the problem again, can't the driver us the dynamic pci mapping > API which does allow more memory to be mapped even on crippled machines > without iommu ? In theory one could fix pci_alloc_consistent from the swiotlb pool yes, the problem is just that this pool is completely preallocated. If enough memory is needed that would be quite nasty, because you suddenly lose 1 or 2GB RAM. > And isn't this a problem that will vanish since PCI Express and PCI X > both *require* support for 64 bit addressing, so all higher speed cards > are going to be ok in principle ? There are EM64T systems with AGP only and not all PCI-Express cards seem to follow this. PCI-Express unfortunately discouraged the AGP aperture too, so not even that can be used on those Intel systems. -Andi ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 6:18 ` Arjan van de Ven 2004-06-24 10:33 ` Andi Kleen @ 2004-06-24 13:48 ` Jesse Barnes 2004-06-24 14:39 ` Terence Ripperda 1 sibling, 1 reply; 70+ messages in thread From: Jesse Barnes @ 2004-06-24 13:48 UTC (permalink / raw) To: arjanv; +Cc: Andi Kleen, Terence Ripperda, discuss, tiwai, linux-kernel On Thursday, June 24, 2004 2:18 am, Arjan van de Ven wrote: > What is the problem again, can't the driver us the dynamic pci mapping > API which does allow more memory to be mapped even on crippled machines > without iommu ? > And isn't this a problem that will vanish since PCI Express and PCI X > both *require* support for 64 bit addressing, so all higher speed cards > are going to be ok in principle ? Well, PCI-X may require it, but there certainly are PCI-X devices that don't do 64 bit addressing, or if they do, it's a crippled implementation (e.g. top 32 bits have to be constant). Jesse ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-24 13:48 ` Jesse Barnes @ 2004-06-24 14:39 ` Terence Ripperda 0 siblings, 0 replies; 70+ messages in thread From: Terence Ripperda @ 2004-06-24 14:39 UTC (permalink / raw) To: Jesse Barnes Cc: arjanv, Andi Kleen, Terence Ripperda, discuss, tiwai, linux-kernel correct. I checked with my contacts here on the PCI express requirements. Apparently the spec says "A PCI Express Endpoint operating as the Requester of a Memory Transaction is required to be capable of generating addresses greater than 4GB", but my contact claims this is a "soft" requirement. but even if all PCI-X and PCI-E devices properly addressed the full 64-bits, legacy 32-bit PCI devices can be plugged into the motherboards as well. my Intel em64t boards have mostly PCI-X, but 1 PCI slot and my amd x86_64 have all PCI slots (aside from the main PCI-E slot). also, at least one motherboard manufacturer claims PCI-E + AGP, but the AGP is really just an AGP form-factor slot on the PCI bus. Thanks, Terence On Thu, Jun 24, 2004 at 06:48:07AM -0700, jbarnes@engr.sgi.com wrote: > On Thursday, June 24, 2004 2:18 am, Arjan van de Ven wrote: > > What is the problem again, can't the driver us the dynamic pci mapping > > API which does allow more memory to be mapped even on crippled > machines > > without iommu ? > > And isn't this a problem that will vanish since PCI Express and PCI X > > both *require* support for 64 bit addressing, so all higher speed > cards > > are going to be ok in principle ? > > Well, PCI-X may require it, but there certainly are PCI-X devices that > don't > do 64 bit addressing, or if they do, it's a crippled implementation > (e.g. top > 32 bits have to be constant). > > Jesse ^ permalink raw reply [flat|nested] 70+ messages in thread
* 32-bit dma allocations on 64-bit platforms
@ 2004-06-23 18:35 Terence Ripperda
2004-06-23 19:19 ` Jeff Garzik
2004-06-26 5:02 ` David Mosberger
0 siblings, 2 replies; 70+ messages in thread
From: Terence Ripperda @ 2004-06-23 18:35 UTC (permalink / raw)
To: Linux Kernel Mailing List; +Cc: Terence Ripperda
[-- Attachment #1: Type: text/plain, Size: 3008 bytes --]
I'm working on cleaning up some of our dma allocation code to properly allocate 32-bit physical pages for dma on 64-bit platforms. I think our first pass at supporting em64t is sub-par. I'd like to fix that by using the correct kernel interfaces.
>From our early efforts in supporting AMD's x86_64, we've used the pci_map_sg/pci_map_single interface for remapping > 32-bit physical addresses through the system's IOMMU. Since Intel's em64t does not provide an IOMMU, the kernel falls back to a swiotlb to implement these interfaces. For our first pass at supporting em64t, we tried to work with the swiotlb, but this works very poorly.
We should have gone back and reviewed how we use the kernel interfaces and followed DMA-API.txt and DMA-mapping.txt. We're now working on using these interfaces (mainly pci_alloc_consistent). but we're still running into some general shortcomings of these interfaces. the main problem is the ability to allocate enough 32-bit addressable memory.
the physical addressing of memory allocations seems to boil down to the behavior of GFP_DMA and GFP_NORMAL. but there seems to be some disconnect between what these mean for each architecture and what various drivers expect them to mean.
based on each architecture's paging_init routines, the zones look like this:
x86: ia64: x86_64:
ZONE_DMA: < 16M < ~4G < 16M
ZONE_NORMAL: 16M - ~1G > ~4G > 16M
ZONE_HIMEM: 1G+
an example of this disconnect is vmalloc_32. this function is obviously intended to allocate 32-bit addresses (this was specifically mentioned in a comment in 2.4.x header files). but vmalloc_32 is an inline routine that calls __vmalloc(GFP_KERNEL). based on the above zone descriptions, this will do the correct thing for x86, but not for ia64 or x86_64. on ia64, a driver could just use GFP_DMA for the desired behavior, but this doesn't work well for x86_64.
AMD's x86_64 provides remapping > 32-bit pages through the iommu, but obviously Intel's em64t provides no such ability. based on the above zonings, these leaves us with the options of either relying on the swiotlb interfaces for dma, or relying on the isa memory for dma.
the last day or two, I've been experimenting with using the pci_alloc_consistent interface, which uses the later (note attached patch to fix an apparent memory leak in the x86_64 pci_alloc_consistent). unfortunately, this provides very little dma-able memory. In theory, up to 16 Megs, but in practice I'm only getting about 5 1/2 Megs.
I was rather surprised by these limitations on allocating 32-bit addresses. I checked through the dri and bttv drivers, to see if they had dealt with these issues, and they did not appear to have done so. has anyone tested these drivers on ia64/x86_64/em64t platforms w/ 4+ Gigs of memory?
are these limitations on allocating 32-bit addresses intentional and known? is there anything we can do to help improve this situation? help work on development?
Thanks,
Terence
[-- Attachment #2: pci-gart.patch --]
[-- Type: text/plain, Size: 330 bytes --]
--- pci-gart.c.old 2004-06-21 18:33:29.000000000 -0500
+++ pci-gart.c.new 2004-06-21 18:33:57.000000000 -0500
@@ -211,6 +211,7 @@
if (no_iommu || dma_mask < 0xffffffffUL) {
if (high) {
if (!(gfp & GFP_DMA)) {
+ free_pages((unsigned long)memory, get_order(size));
gfp |= GFP_DMA;
goto again;
}
^ permalink raw reply [flat|nested] 70+ messages in thread* Re: 32-bit dma allocations on 64-bit platforms 2004-06-23 18:35 Terence Ripperda @ 2004-06-23 19:19 ` Jeff Garzik 2004-06-26 5:05 ` David Mosberger 2004-06-26 5:02 ` David Mosberger 1 sibling, 1 reply; 70+ messages in thread From: Jeff Garzik @ 2004-06-23 19:19 UTC (permalink / raw) To: Terence Ripperda; +Cc: Linux Kernel Mailing List Terence Ripperda wrote: Fix your word wrap. > I'm working on cleaning up some of our dma allocation code to properly allocate 32-bit physical pages for dma on 64-bit platforms. I think our first pass at supporting em64t is sub-par. I'd like to fix that by using the correct kernel interfaces. > >>From our early efforts in supporting AMD's x86_64, we've used the pci_map_sg/pci_map_single interface for remapping > 32-bit physical addresses through the system's IOMMU. Since Intel's em64t does not provide an IOMMU, the kernel falls back to a swiotlb to implement these interfaces. For our first pass at supporting em64t, we tried to work with the swiotlb, but this works very poorly. swiotlb was a dumb idea when it hit ia64, and it's now been propagated to x86-64 :( > We should have gone back and reviewed how we use the kernel interfaces and followed DMA-API.txt and DMA-mapping.txt. We're now working on using these interfaces (mainly pci_alloc_consistent). but we're still running into some general shortcomings of these interfaces. the main problem is the ability to allocate enough 32-bit addressable memory. > > the physical addressing of memory allocations seems to boil down to the behavior of GFP_DMA and GFP_NORMAL. but there seems to be some disconnect between what these mean for each architecture and what various drivers expect them to mean. > > based on each architecture's paging_init routines, the zones look like this: > > x86: ia64: x86_64: > ZONE_DMA: < 16M < ~4G < 16M > ZONE_NORMAL: 16M - ~1G > ~4G > 16M > ZONE_HIMEM: 1G+ > > an example of this disconnect is vmalloc_32. this function is obviously intended to allocate 32-bit addresses (this was specifically mentioned in a comment in 2.4.x header files). but vmalloc_32 is an inline routine that calls __vmalloc(GFP_KERNEL). based on the above zone descriptions, this will do the correct thing for x86, but not for ia64 or x86_64. on ia64, a driver could just use GFP_DMA for the desired behavior, but this doesn't work well for x86_64. > > AMD's x86_64 provides remapping > 32-bit pages through the iommu, but obviously Intel's em64t provides no such ability. based on the above zonings, these leaves us with the options of either relying on the swiotlb interfaces for dma, or relying on the isa memory for dma. FWIW, note that there are two main considerations: Higher-level layers (block, net) provide bounce buffers when needed, as you don't want to do that purely with iommu. Once you have bounce buffers properly allocated by <something> (swiotlb? special DRM bounce buffer allocator?), you then pci_map the bounce buffers. > the last day or two, I've been experimenting with using the pci_alloc_consistent interface, which uses the later (note attached patch to fix an apparent memory leak in the x86_64 pci_alloc_consistent). unfortunately, this provides very little dma-able memory. In theory, up to 16 Megs, but in practice I'm only getting about 5 1/2 Megs. > > I was rather surprised by these limitations on allocating 32-bit addresses. I checked through the dri and bttv drivers, to see if they had dealt with these issues, and they did not appear to have done so. has anyone tested these drivers on ia64/x86_64/em64t platforms w/ 4+ Gigs of memory? > > are these limitations on allocating 32-bit addresses intentional and known? is there anything we can do to help improve this situation? help work on development? Sounds like you're not setting the PCI DMA mask properly, or perhaps passing NULL rather than a struct pci_dev to the PCI DMA API? Jeff ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-23 19:19 ` Jeff Garzik @ 2004-06-26 5:05 ` David Mosberger 2004-06-26 7:16 ` Arjan van de Ven 0 siblings, 1 reply; 70+ messages in thread From: David Mosberger @ 2004-06-26 5:05 UTC (permalink / raw) To: Jeff Garzik; +Cc: Terence Ripperda, Linux Kernel Mailing List >>>>> On Wed, 23 Jun 2004 15:19:22 -0400, Jeff Garzik <jgarzik@pobox.com> said: Jeff> swiotlb was a dumb idea when it hit ia64, and it's now been propagated Jeff> to x86-64 :( If it's such a dumb idea, why not submit a better solution? --david ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-26 5:05 ` David Mosberger @ 2004-06-26 7:16 ` Arjan van de Ven 2004-06-29 6:13 ` David Mosberger 0 siblings, 1 reply; 70+ messages in thread From: Arjan van de Ven @ 2004-06-26 7:16 UTC (permalink / raw) To: davidm; +Cc: Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 462 bytes --] On Sat, 2004-06-26 at 07:05, David Mosberger wrote: > >>>>> On Wed, 23 Jun 2004 15:19:22 -0400, Jeff Garzik <jgarzik@pobox.com> said: > > Jeff> swiotlb was a dumb idea when it hit ia64, and it's now been propagated > Jeff> to x86-64 :( > > If it's such a dumb idea, why not submit a better solution? the real solution is an iommu of course, but the highmem solution has quite some merit too..... I know you disagree with me on that one though. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-26 7:16 ` Arjan van de Ven @ 2004-06-29 6:13 ` David Mosberger 2004-06-29 6:55 ` Arjan van de Ven 2004-06-30 8:00 ` Jes Sorensen 0 siblings, 2 replies; 70+ messages in thread From: David Mosberger @ 2004-06-29 6:13 UTC (permalink / raw) To: arjanv; +Cc: davidm, Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List >>>>> On Sat, 26 Jun 2004 09:16:27 +0200, Arjan van de Ven <arjanv@redhat.com> said: Arjan> the real solution is an iommu of course, but the highmem Arjan> solution has quite some merit too..... I know you disagree Arjan> with me on that one though. Yes, some merits and some faults. The real solution is iommu or 64-bit capable devices. Interesting that graphics controllers should be last to get 64-bit DMA capability, considering how much more complex they are than disk controllers or NICs. --david ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-29 6:13 ` David Mosberger @ 2004-06-29 6:55 ` Arjan van de Ven 2004-06-30 8:00 ` Jes Sorensen 1 sibling, 0 replies; 70+ messages in thread From: Arjan van de Ven @ 2004-06-29 6:55 UTC (permalink / raw) To: davidm; +Cc: Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 661 bytes --] On Mon, Jun 28, 2004 at 11:13:12PM -0700, David Mosberger wrote: > >>>>> On Sat, 26 Jun 2004 09:16:27 +0200, Arjan van de Ven <arjanv@redhat.com> said: > > Arjan> the real solution is an iommu of course, but the highmem > Arjan> solution has quite some merit too..... I know you disagree > Arjan> with me on that one though. > > Yes, some merits and some faults. The real solution is iommu or > 64-bit capable devices. Interesting that graphics controllers should > be last to get 64-bit DMA capability, considering how much more > complex they are than disk controllers or NICs. I guess the first game with more than 4Gb in textures will fix it ;) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-29 6:13 ` David Mosberger 2004-06-29 6:55 ` Arjan van de Ven @ 2004-06-30 8:00 ` Jes Sorensen 1 sibling, 0 replies; 70+ messages in thread From: Jes Sorensen @ 2004-06-30 8:00 UTC (permalink / raw) To: davidm; +Cc: arjanv, Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List >>>>> "David" == David Mosberger <davidm@napali.hpl.hp.com> writes: >>>>> On Sat, 26 Jun 2004 09:16:27 +0200, Arjan van de Ven <arjanv@redhat.com> said: Arjan> the real solution is an iommu of course, but the highmem Arjan> solution has quite some merit too..... I know you disagree with Arjan> me on that one though. David> Yes, some merits and some faults. The real solution is iommu David> or 64-bit capable devices. Interesting that graphics David> controllers should be last to get 64-bit DMA capability, David> considering how much more complex they are than disk David> controllers or NICs. You found a 64 bit capable sound card yet? ;-) Cheers, Jes ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: 32-bit dma allocations on 64-bit platforms 2004-06-23 18:35 Terence Ripperda 2004-06-23 19:19 ` Jeff Garzik @ 2004-06-26 5:02 ` David Mosberger 1 sibling, 0 replies; 70+ messages in thread From: David Mosberger @ 2004-06-26 5:02 UTC (permalink / raw) To: Terence Ripperda; +Cc: Linux Kernel Mailing List Terence, >>>>> On Wed, 23 Jun 2004 13:35:35 -0500, Terence Ripperda <tripperda@nvidia.com> said: Terence> based on each architecture's paging_init routines, the Terence> zones look like this: Terence> x86: ia64: x86_64: Terence> ZONE_DMA: < 16M < ~4G < 16M Terence> ZONE_NORMAL: 16M - ~1G > ~4G > 16M Terence> ZONE_HIMEM: 1G+ Not that it matters here, but for correctness let me note that the ia64 column is correct only for machines which don't have an I/O MMU. With I/O MMU, ZONE_DMA will have the same coverage as ZONE_NORMAL with a recent enough kernel (older kernels had a bug which limited ZONE_DMA to < 4GB, but that was unintentional). --david ^ permalink raw reply [flat|nested] 70+ messages in thread
end of thread, other threads:[~2004-06-30 8:12 UTC | newest]
Thread overview: 70+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <m3acyu6pwd.fsf@averell.firstfloor.org>
[not found] ` <20040623213643.GB32456@hygelac>
2004-06-23 23:46 ` 32-bit dma allocations on 64-bit platforms Andi Kleen
2004-06-24 11:13 ` Takashi Iwai
2004-06-24 11:29 ` [discuss] " Andi Kleen
2004-06-24 14:36 ` Takashi Iwai
2004-06-24 14:42 ` Andi Kleen
2004-06-24 14:58 ` Takashi Iwai
2004-06-24 15:29 ` Andrea Arcangeli
2004-06-24 15:48 ` Nick Piggin
2004-06-24 16:52 ` Andrea Arcangeli
2004-06-24 16:56 ` William Lee Irwin III
2004-06-24 17:32 ` Andrea Arcangeli
2004-06-24 17:38 ` William Lee Irwin III
2004-06-24 18:02 ` Andrea Arcangeli
2004-06-24 18:13 ` William Lee Irwin III
2004-06-24 18:27 ` Andrea Arcangeli
2004-06-24 18:50 ` William Lee Irwin III
2004-06-24 21:54 ` Andrew Morton
2004-06-24 22:08 ` William Lee Irwin III
2004-06-24 22:45 ` Andrea Arcangeli
2004-06-24 22:51 ` William Lee Irwin III
2004-06-24 23:09 ` Andrew Morton
2004-06-24 23:15 ` William Lee Irwin III
2004-06-25 6:16 ` William Lee Irwin III
2004-06-25 2:39 ` Andrea Arcangeli
2004-06-25 2:47 ` Andrew Morton
2004-06-25 3:19 ` Andrea Arcangeli
2004-06-24 22:11 ` Andrew Morton
2004-06-24 23:09 ` Andrea Arcangeli
2004-06-25 1:17 ` Nick Piggin
2004-06-25 3:11 ` Andrea Arcangeli
2004-06-24 22:21 ` Andrea Arcangeli
2004-06-24 22:36 ` Andrew Morton
2004-06-24 23:15 ` Andrea Arcangeli
2004-06-24 22:37 ` William Lee Irwin III
2004-06-24 22:40 ` William Lee Irwin III
2004-06-24 23:21 ` Andrea Arcangeli
2004-06-24 23:45 ` William Lee Irwin III
2004-06-24 17:39 ` Andrea Arcangeli
2004-06-24 17:53 ` William Lee Irwin III
2004-06-24 18:07 ` Andrea Arcangeli
2004-06-24 18:29 ` William Lee Irwin III
2004-06-24 16:04 ` Takashi Iwai
2004-06-24 17:16 ` Andrea Arcangeli
2004-06-24 18:33 ` Takashi Iwai
2004-06-24 18:44 ` Andrea Arcangeli
2004-06-25 15:50 ` Takashi Iwai
2004-06-25 17:30 ` Andrea Arcangeli
2004-06-25 17:39 ` Takashi Iwai
2004-06-25 17:45 ` Andrea Arcangeli
2004-06-24 14:45 ` Terence Ripperda
2004-06-24 15:41 ` Andrea Arcangeli
2004-06-24 15:44 ` Terence Ripperda
2004-06-24 16:15 ` [discuss] " Andi Kleen
2004-06-24 17:22 ` Andrea Arcangeli
2004-06-24 22:28 ` Terence Ripperda
2004-06-24 18:51 ` Andi Kleen
2004-06-26 4:58 ` David Mosberger
[not found] <2akPm-16l-65@gated-at.bofh.it>
2004-06-23 21:46 ` Andi Kleen
2004-06-24 6:18 ` Arjan van de Ven
2004-06-24 10:33 ` Andi Kleen
2004-06-24 13:48 ` Jesse Barnes
2004-06-24 14:39 ` Terence Ripperda
2004-06-23 18:35 Terence Ripperda
2004-06-23 19:19 ` Jeff Garzik
2004-06-26 5:05 ` David Mosberger
2004-06-26 7:16 ` Arjan van de Ven
2004-06-29 6:13 ` David Mosberger
2004-06-29 6:55 ` Arjan van de Ven
2004-06-30 8:00 ` Jes Sorensen
2004-06-26 5:02 ` David Mosberger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox