Re: 32-bit dma allocations on 64-bit platforms

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: 32-bit dma allocations on 64-bit platforms
       [not found] ` <20040623213643.GB32456@hygelac>
@ 2004-06-23 23:46   ` Andi Kleen
  2004-06-24 11:13     ` Takashi Iwai
  2004-06-24 15:44     ` Terence Ripperda
  0 siblings, 2 replies; 70+ messages in thread
From: Andi Kleen @ 2004-06-23 23:46 UTC (permalink / raw)
  To: Terence Ripperda; +Cc: Andi Kleen, discuss, tiwai, linux-kernel, andrea

On Wed, Jun 23, 2004 at 04:36:43PM -0500, Terence Ripperda wrote:
> > The x86-64 port had decided early to keep the 16MB GFP_DMA zone
> > to get maximum driver compatibility and because the AMD IOMMU gave
> > us an nice alternative over bounce buffering.
> 
> that was a very understandable decision. and I do agree that using the
> AMD IOMMU is a very nice architecture. it is unfortunate to have to deal
> with this on EM64T. Will AMD's pci-express chipsets still maintain an
> IOMMU, even if it's not needed for AGP anymore? (probably not public
> information, I'll check via my channels).

The IOMMU is actually implemented in the in CPU northbridge on K8 so yes. 
I hope they'll keep it in future CPUs too. 

> 
> > I must say I'm somewhat reluctant to break an working in tree driver.
> > Especially for the sake of an out of tree binary driver. Arguably the
> > problem is probably not limited to you, but it's quite possible that
> > even the in tree DRI drivers have it, so it would be still worth to
> > fix it.
> 
> agreed. I completely understand that there is no desire to modify the
> core kernel to help our driver. that's one of the reasons I looked through
> the other drivers, as I suspect that this is a problem for many drivers. I
> only looked through the code for each briefly, but didn't see anything to
> handle this. I suspect it's more of a case that the drivers have not been
> stress tested on an x86_64 machine w/ 4+ G of memory.

We usually handle it using the swiotlb, which works.

pci_alloc_consistent is limited to 16MB, but so far nobody has really
complained about that. If that should be a real issue we can make
it allocate from the swiotlb pool, which is usually 64MB (and can
be made bigger at boot time) 

Would that work for you too BTW ? How much memory do you expect
to need?

drawback is that the swiotlb pool is not unified with the rest of the VM,
so tying up too much memory there is quite unfriendly.
e.g. if you you can use up 1GB then i wouldn't consider this suitable,
for 128MB max it may be possible.

> > I see two somewhat realistic ways to handle this:
> 
> either of those approaches sounds good. keeping compatibility with older
> devices/drivers is certainly a good thing.
> 
> can the core kernel handle multiple new zones? I haven't looked at the
> code, but the zones seem to always be ZONE_DMA and ZONE_NORMAL, with some
> architectures adding ZONE_HIMEM at the end. if you add a ZONE_DMA_32 (or
> whatever) between ZONE_DMA and ZONE_NORMAL, will the core vm code be able
> to handle that? I guess one could argue if it can't yet, it should be able
> to, then each architecture could create as many zones as they wanted.

Sure, we create multiple zones on NUMA systems (even on x86-64). Each
node has one zone. But they're all ZONE_NORMAL. And the first node 
has two zones, one ZONE_DMA and one ZONE_NORMAL
(actually the others have a ZONE_DMA too, but it's empty) 

Multiple ZONE_DMA zones would be a novelty, but may be doable
(I have not checked all the implications of this, but I don't immediately
see any show stopper, maybe someone like Andrea can correct me on that). 
It will probably be a bit intrusive patch though. 

> 
> another brainstorm: instead of counting on just a large-grained zone and
> call to __get_free_pages() returning an allocation within a given
> bit-range, perhaps there could be large-grained zones, with a fine-grained
> hint used to look for a subset within the zone. for example, there could be
> a DMA32 zone, but a mask w/ 24- or 29- bits enabled could be used to scan
> the DMA32 zone for a valid address. (don't know how well that fits into the
> current architecture).

Not very well. Or rather the allocation would not be O(1) anymore
because you would need to  scan the queues. That could be still
tolerable, but when there are no pages you have to call the VM
and then teach try_to_free_pages and friends that you are only
interested in pages in some mask. And that would probably get 
quite nasty. I did something like this in 2.4 for an old prototype
of the NUMA API, but it never worked very well and also was quite ugly.

Multiple zones are probably better.

One of the reasons we rejected this early when the x86-64 port was
designed was that the VM had quite bad zone balancing problems
at that time. It should be better now though, or at least the NUMA
setup works reasonably well. But NUMA zones tend to be a lot bigger
than DMA zones and don't show all the corner cases.

-Andi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-23 23:46   ` 32-bit dma allocations on 64-bit platforms Andi Kleen
@ 2004-06-24 11:13     ` Takashi Iwai
  2004-06-24 11:29       ` [discuss] " Andi Kleen
  2004-06-24 14:45       ` Terence Ripperda
  2004-06-24 15:44     ` Terence Ripperda
  1 sibling, 2 replies; 70+ messages in thread
From: Takashi Iwai @ 2004-06-24 11:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Terence Ripperda, discuss, linux-kernel, andrea

At 24 Jun 2004 01:46:44 +0200,
Andi Kleen wrote:
> 
> > > I must say I'm somewhat reluctant to break an working in tree driver.
> > > Especially for the sake of an out of tree binary driver. Arguably the
> > > problem is probably not limited to you, but it's quite possible that
> > > even the in tree DRI drivers have it, so it would be still worth to
> > > fix it.
> > 
> > agreed. I completely understand that there is no desire to modify the
> > core kernel to help our driver. that's one of the reasons I looked through
> > the other drivers, as I suspect that this is a problem for many drivers. I
> > only looked through the code for each briefly, but didn't see anything to
> > handle this. I suspect it's more of a case that the drivers have not been
> > stress tested on an x86_64 machine w/ 4+ G of memory.
> 
> We usually handle it using the swiotlb, which works.
> 
> pci_alloc_consistent is limited to 16MB, but so far nobody has really
> complained about that. If that should be a real issue we can make
> it allocate from the swiotlb pool, which is usually 64MB (and can
> be made bigger at boot time) 

Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the
allocated pages are out of dma mask, just like in pci-gart.c?
(with ifdef x86-64)


Takashi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 11:13     ` Takashi Iwai
@ 2004-06-24 11:29       ` Andi Kleen
  2004-06-24 14:36         ` Takashi Iwai
  2004-06-24 14:45       ` Terence Ripperda
  1 sibling, 1 reply; 70+ messages in thread
From: Andi Kleen @ 2004-06-24 11:29 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Andi Kleen, Terence Ripperda, discuss, linux-kernel, andrea

> Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the
> allocated pages are out of dma mask, just like in pci-gart.c?
> (with ifdef x86-64)

That won't work reliable enough in extreme cases.

-Andi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 11:29       ` [discuss] " Andi Kleen
@ 2004-06-24 14:36         ` Takashi Iwai
  2004-06-24 14:42           ` Andi Kleen
  0 siblings, 1 reply; 70+ messages in thread
From: Takashi Iwai @ 2004-06-24 14:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, Terence Ripperda, discuss, linux-kernel, andrea

At Thu, 24 Jun 2004 13:29:00 +0200,
Andi Kleen wrote:
> 
> > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the
> > allocated pages are out of dma mask, just like in pci-gart.c?
> > (with ifdef x86-64)
> 
> That won't work reliable enough in extreme cases.

Well, it's not perfect, but it'd be far better than GFP_DMA only :)


BTW, we have the similar problem on i386, too.  The non-32bit DMA mask
always results in the allocation with GFP_DMA.  The patch below does
similar hack as described above, calling with GFP_DMA as fallback.


Takashi

--- linux-2.6.7/arch/i386/kernel/pci-dma.c	2004-06-24 15:56:46.017473544 +0200
+++ linux-2.6.7/arch/i386/kernel/pci-dma.c	2004-06-24 16:05:02.449803937 +0200
@@ -17,17 +17,35 @@ void *dma_alloc_coherent(struct device *
 			   dma_addr_t *dma_handle, int gfp)
 {
 	void *ret;
+	unsigned long dma_mask;
+
 	/* ignore region specifiers */
 	gfp &= ~(__GFP_DMA | __GFP_HIGHMEM);
 
-	if (dev == NULL || (dev->coherent_dma_mask < 0xffffffff))
+	if (dev == NULL) {
 		gfp |= GFP_DMA;
+		dma_mask = 0xffffffUL;
+	} else {
+		dma_mask = 0xffffffffUL;
+		if (dev->dma_mask)
+			dma_mask = *dev->dma_mask;
+		if (dev->coherent_dma_mask)
+			dma_mask &= (unsigned long)dev->coherent_dma_mask;
+	}
 
+ again:
 	ret = (void *)__get_free_pages(gfp, get_order(size));
 
 	if (ret != NULL) {
-		memset(ret, 0, size);
 		*dma_handle = virt_to_phys(ret);
+		if (((unsigned long)*dma_handle + size - 1) & ~dma_mask) {
+			free_pages((unsigned long)ret, get_order(size));
+			if (gfp & GFP_DMA)
+				return NULL;
+			gfp |= GFP_DMA;
+			goto again;
+		}
+		memset(ret, 0, size);
 	}
 	return ret;
 }

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 14:36         ` Takashi Iwai
@ 2004-06-24 14:42           ` Andi Kleen
  2004-06-24 14:58             ` Takashi Iwai
  0 siblings, 1 reply; 70+ messages in thread
From: Andi Kleen @ 2004-06-24 14:42 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: ak, tripperda, discuss, linux-kernel, andrea

On Thu, 24 Jun 2004 16:36:47 +0200
Takashi Iwai <tiwai@suse.de> wrote:

> At Thu, 24 Jun 2004 13:29:00 +0200,
> Andi Kleen wrote:
> > 
> > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the
> > > allocated pages are out of dma mask, just like in pci-gart.c?
> > > (with ifdef x86-64)
> > 
> > That won't work reliable enough in extreme cases.
> 
> Well, it's not perfect, but it'd be far better than GFP_DMA only :)

The only description for this patch I can think of is "russian roulette" 

-Andi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 14:42           ` Andi Kleen
@ 2004-06-24 14:58             ` Takashi Iwai
  2004-06-24 15:29               ` Andrea Arcangeli
  0 siblings, 1 reply; 70+ messages in thread
From: Takashi Iwai @ 2004-06-24 14:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, tripperda, discuss, linux-kernel, andrea

At Thu, 24 Jun 2004 16:42:58 +0200,
Andi Kleen wrote:
> 
> On Thu, 24 Jun 2004 16:36:47 +0200
> Takashi Iwai <tiwai@suse.de> wrote:
> 
> > At Thu, 24 Jun 2004 13:29:00 +0200,
> > Andi Kleen wrote:
> > > 
> > > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the
> > > > allocated pages are out of dma mask, just like in pci-gart.c?
> > > > (with ifdef x86-64)
> > > 
> > > That won't work reliable enough in extreme cases.
> > 
> > Well, it's not perfect, but it'd be far better than GFP_DMA only :)
> 
> The only description for this patch I can think of is "russian roulette" 

Even if we have a bigger DMA zone, it's no guarantee that the obtained
page is precisely in the given mask.  We can unlikely define zones
fine enough for all different 24, 28, 29, 30 and 31bit DMA masks.


My patch for i386 works well in most cases, because such a device is
usually equipped on older machines with less memory than DMA mask.

Without the patch, the allocation is always <16MB, may fail even small
number of pages.


Takashi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 14:58             ` Takashi Iwai
@ 2004-06-24 15:29               ` Andrea Arcangeli
  2004-06-24 15:48                 ` Nick Piggin
  2004-06-24 16:04                 ` Takashi Iwai
  0 siblings, 2 replies; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 15:29 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel

On Thu, Jun 24, 2004 at 04:58:24PM +0200, Takashi Iwai wrote:
> At Thu, 24 Jun 2004 16:42:58 +0200,
> Andi Kleen wrote:
> > 
> > On Thu, 24 Jun 2004 16:36:47 +0200
> > Takashi Iwai <tiwai@suse.de> wrote:
> > 
> > > At Thu, 24 Jun 2004 13:29:00 +0200,
> > > Andi Kleen wrote:
> > > > 
> > > > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the
> > > > > allocated pages are out of dma mask, just like in pci-gart.c?
> > > > > (with ifdef x86-64)
> > > > 
> > > > That won't work reliable enough in extreme cases.
> > > 
> > > Well, it's not perfect, but it'd be far better than GFP_DMA only :)
> > 
> > The only description for this patch I can think of is "russian roulette" 
> 
> Even if we have a bigger DMA zone, it's no guarantee that the obtained
> page is precisely in the given mask.  We can unlikely define zones
> fine enough for all different 24, 28, 29, 30 and 31bit DMA masks.
> 
> 
> My patch for i386 works well in most cases, because such a device is
> usually equipped on older machines with less memory than DMA mask.
> 
> Without the patch, the allocation is always <16MB, may fail even small
> number of pages.

why does it fail? note that with the lower_zone_reserve_ratio algorithm I
added to 2.4 all dma zone will be reserved for __GFP_DMA allocations so
you should have troubles only with 2.6, 2.4 should work fine.

So with latest 2.4 it has to fail only if you already allocated 16M with
pci_alloc_consistent which sounds unlikely.

the fact 2.6 lacks the lower_zone_reserve_ratio algorithm is a different
issue, but I'm confortable there's no other possible algorithm to solve
this memory balancing problem completely so there's no way around a
forward port.

well 2.6 has a tiny hack like some older 2.4 that attempts to do what
lower_zone_reserve_ratio does, but it's not nearly enough, there's no
per-zone-point-of-view watermark in 2.6 etc.. 2.6 actually has a more
hardcoded hack for highmem, but the lower_zone_reserve_ratio has
absolutely nothing to do with highmem vs lowmem. it's by pure
coincidence that it avoids highmem machine to lockup without swap, but
the very same problem happens on x86-64 with lowmem vs dma.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 15:29               ` Andrea Arcangeli
@ 2004-06-24 15:48                 ` Nick Piggin
  2004-06-24 16:52                   ` Andrea Arcangeli
  2004-06-24 17:39                   ` Andrea Arcangeli
  2004-06-24 16:04                 ` Takashi Iwai
  1 sibling, 2 replies; 70+ messages in thread
From: Nick Piggin @ 2004-06-24 15:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel

Andrea Arcangeli wrote:

> 
> why does it fail? note that with the lower_zone_reserve_ratio algorithm I
> added to 2.4 all dma zone will be reserved for __GFP_DMA allocations so
> you should have troubles only with 2.6, 2.4 should work fine.
> 
> So with latest 2.4 it has to fail only if you already allocated 16M with
> pci_alloc_consistent which sounds unlikely.
> 
> the fact 2.6 lacks the lower_zone_reserve_ratio algorithm is a different
> issue, but I'm confortable there's no other possible algorithm to solve
> this memory balancing problem completely so there's no way around a
> forward port.
> 
> well 2.6 has a tiny hack like some older 2.4 that attempts to do what
> lower_zone_reserve_ratio does, but it's not nearly enough, there's no
> per-zone-point-of-view watermark in 2.6 etc.. 2.6 actually has a more
> hardcoded hack for highmem, but the lower_zone_reserve_ratio has
> absolutely nothing to do with highmem vs lowmem. it's by pure
> coincidence that it avoids highmem machine to lockup without swap, but
> the very same problem happens on x86-64 with lowmem vs dma.

2.6 has the "incremental min" thing. What is wrong with that?
Though I think it is turned off by default.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 15:48                 ` Nick Piggin
@ 2004-06-24 16:52                   ` Andrea Arcangeli
  2004-06-24 16:56                     ` William Lee Irwin III
  2004-06-24 17:39                   ` Andrea Arcangeli
  1 sibling, 1 reply; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 16:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel

On Fri, Jun 25, 2004 at 01:48:47AM +1000, Nick Piggin wrote:
> Andrea Arcangeli wrote:
> 
> >
> >why does it fail? note that with the lower_zone_reserve_ratio algorithm I
> >added to 2.4 all dma zone will be reserved for __GFP_DMA allocations so
> >you should have troubles only with 2.6, 2.4 should work fine.
> >
> >So with latest 2.4 it has to fail only if you already allocated 16M with
> >pci_alloc_consistent which sounds unlikely.
> >
> >the fact 2.6 lacks the lower_zone_reserve_ratio algorithm is a different
> >issue, but I'm confortable there's no other possible algorithm to solve
> >this memory balancing problem completely so there's no way around a
> >forward port.
> >
> >well 2.6 has a tiny hack like some older 2.4 that attempts to do what
> >lower_zone_reserve_ratio does, but it's not nearly enough, there's no
> >per-zone-point-of-view watermark in 2.6 etc.. 2.6 actually has a more
> >hardcoded hack for highmem, but the lower_zone_reserve_ratio has
> >absolutely nothing to do with highmem vs lowmem. it's by pure
> >coincidence that it avoids highmem machine to lockup without swap, but
> >the very same problem happens on x86-64 with lowmem vs dma.
> 
> 2.6 has the "incremental min" thing. What is wrong with that?
> Though I think it is turned off by default.

sysctl_lower_zone_protection is an inferior implementation of the
lower_zone_reserve_ratio, inferior because it has no way to give a
different balance to each zone. As you said it's turned off by default
so it had no tuning. The lower_zone_reserve_ratio has already been
tuned in 2.4. Somebody can attempt a conversion but it'll never be equal
since lower_zone_reserve_ratio is a superset of what
sysctl_lower_zone_protection can do.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 16:52                   ` Andrea Arcangeli
@ 2004-06-24 16:56                     ` William Lee Irwin III
  2004-06-24 17:32                       ` Andrea Arcangeli
  2004-06-24 21:54                       ` Andrew Morton
  0 siblings, 2 replies; 70+ messages in thread
From: William Lee Irwin III @ 2004-06-24 16:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss,
	linux-kernel

On Fri, Jun 25, 2004 at 01:48:47AM +1000, Nick Piggin wrote:
>> 2.6 has the "incremental min" thing. What is wrong with that?
>> Though I think it is turned off by default.

On Thu, Jun 24, 2004 at 06:52:01PM +0200, Andrea Arcangeli wrote:
> sysctl_lower_zone_protection is an inferior implementation of the
> lower_zone_reserve_ratio, inferior because it has no way to give a
> different balance to each zone. As you said it's turned off by default
> so it had no tuning. The lower_zone_reserve_ratio has already been
> tuned in 2.4. Somebody can attempt a conversion but it'll never be equal
> since lower_zone_reserve_ratio is a superset of what
> sysctl_lower_zone_protection can do.

Is there any chance you could send in thise improved implementation of
zone fallback watermarks and describe the deficiencies in the current
scheme that it corrects?

Thanks.


-- wli

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 16:56                     ` William Lee Irwin III
@ 2004-06-24 17:32                       ` Andrea Arcangeli
  2004-06-24 17:38                         ` William Lee Irwin III
  2004-06-24 21:54                       ` Andrew Morton
  1 sibling, 1 reply; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 17:32 UTC (permalink / raw)
  To: William Lee Irwin III, Nick Piggin, Takashi Iwai, Andi Kleen, ak,
	tripperda, discuss, linux-kernel

On Thu, Jun 24, 2004 at 09:56:29AM -0700, William Lee Irwin III wrote:
> On Fri, Jun 25, 2004 at 01:48:47AM +1000, Nick Piggin wrote:
> >> 2.6 has the "incremental min" thing. What is wrong with that?
> >> Though I think it is turned off by default.
> 
> On Thu, Jun 24, 2004 at 06:52:01PM +0200, Andrea Arcangeli wrote:
> > sysctl_lower_zone_protection is an inferior implementation of the
> > lower_zone_reserve_ratio, inferior because it has no way to give a
> > different balance to each zone. As you said it's turned off by default
> > so it had no tuning. The lower_zone_reserve_ratio has already been
> > tuned in 2.4. Somebody can attempt a conversion but it'll never be equal
> > since lower_zone_reserve_ratio is a superset of what
> > sysctl_lower_zone_protection can do.
> 
> Is there any chance you could send in thise improved implementation of
> zone fallback watermarks and describe the deficiencies in the current
> scheme that it corrects?

I did quite a few times and it was successfully merged in 2.4. Now I'd
need to forward port to 2.6.

I recall I recommended Andrew to merge the lower_zone_reserve_ratio
at some point during 2.5 or early 2.6 but apparently he implemented this
other thing called sysctl_lower_zone_protection. Note that now that I
look more into it, it seems sysctl_lower_zone_protection and
lower_zone_reserve_ratio have very little in common, I'm glad
sysctl_lower_zone_protection is disabled.  sysctl_lower_zone_protection
is just an improvement to the algorithm I dropped from 2.4 when
lowmem_zone_reserve_ratio was merged.  So in short enabling
sysctl_lower_zone_protection won't help, sysctl_lower_zone_protection
should be dropped enterely and replaced with lower_zone_reserve_ratio.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 17:32                       ` Andrea Arcangeli
@ 2004-06-24 17:38                         ` William Lee Irwin III
  2004-06-24 18:02                           ` Andrea Arcangeli
  0 siblings, 1 reply; 70+ messages in thread
From: William Lee Irwin III @ 2004-06-24 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss,
	linux-kernel

On Thu, Jun 24, 2004 at 07:32:36PM +0200, Andrea Arcangeli wrote:
> I did quite a few times and it was successfully merged in 2.4. Now I'd
> need to forward port to 2.6.
> I recall I recommended Andrew to merge the lower_zone_reserve_ratio
> at some point during 2.5 or early 2.6 but apparently he implemented this
> other thing called sysctl_lower_zone_protection. Note that now that I
> look more into it, it seems sysctl_lower_zone_protection and
> lower_zone_reserve_ratio have very little in common, I'm glad
> sysctl_lower_zone_protection is disabled.  sysctl_lower_zone_protection
> is just an improvement to the algorithm I dropped from 2.4 when
> lowmem_zone_reserve_ratio was merged.  So in short enabling
> sysctl_lower_zone_protection won't help, sysctl_lower_zone_protection
> should be dropped enterely and replaced with lower_zone_reserve_ratio.

Could you refer me to an online source (e.g. Message-Id or URL) where
the deficiencies in the incremental min and/or lower_zone_protection
that the zone-to-zone watermarks address are described in detail?


-- wli

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 17:38                         ` William Lee Irwin III
@ 2004-06-24 18:02                           ` Andrea Arcangeli
  2004-06-24 18:13                             ` William Lee Irwin III
  0 siblings, 1 reply; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 18:02 UTC (permalink / raw)
  To: William Lee Irwin III, Nick Piggin, Takashi Iwai, Andi Kleen, ak,
	tripperda, discuss, linux-kernel

On Thu, Jun 24, 2004 at 10:38:27AM -0700, William Lee Irwin III wrote:
> On Thu, Jun 24, 2004 at 07:32:36PM +0200, Andrea Arcangeli wrote:
> > I did quite a few times and it was successfully merged in 2.4. Now I'd
> > need to forward port to 2.6.
> > I recall I recommended Andrew to merge the lower_zone_reserve_ratio
> > at some point during 2.5 or early 2.6 but apparently he implemented this
> > other thing called sysctl_lower_zone_protection. Note that now that I
> > look more into it, it seems sysctl_lower_zone_protection and
> > lower_zone_reserve_ratio have very little in common, I'm glad
> > sysctl_lower_zone_protection is disabled.  sysctl_lower_zone_protection
> > is just an improvement to the algorithm I dropped from 2.4 when
> > lowmem_zone_reserve_ratio was merged.  So in short enabling
> > sysctl_lower_zone_protection won't help, sysctl_lower_zone_protection
> > should be dropped enterely and replaced with lower_zone_reserve_ratio.
> 
> Could you refer me to an online source (e.g. Message-Id or URL) where
> the deficiencies in the incremental min and/or lower_zone_protection
> that the zone-to-zone watermarks address are described in detail?

I'm talking to Andrew about this very issue since december 2002, so I
mostly giveup except for a few reminders like this one today.

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=20021206145718.GL1567%40dualathlon.random&prev=/groups%3Fq%3Dlinus%2Bgoogle%2Bfix%2Bmin%2Bwatermarks%26hl%3D

I'm confident as people starts to run into the zone inbalance with 2.6
and as google upgrades to 2.6, eventually lowmem_zone_reserve_ratio will
be forward ported to 2.4.26 to 2.6.  I'm not the guy with >4G of ram
anyways, so it won't be myself having troubles with this ;).
Furthermore if you have some swap, the VM can normally relocate the
stuff (you've to be quite unlucky to be filled by pure ptes in the
lowmem zone but it can happen too, but certainly not in my or Andrew's
boxes where we have not more than 2M of ptes anytime allocated).
I already tried to merge this in a preventive-way without a real life
case of somebody cracking down on this trouble like it happened 2.4, but
now I'll only react if somebody has a real life case again in 2.6. This
lowmem vs dma zone thing would be helped very singificantly by the
lowmem_reserve_ratio and that's why I bring up this issue right now and
not one month ago. This is a matter of fact, with my algorithm the dma
zone would be completely preserved for __GFP_DMA allocations on the big
x86-64 boxes. Guaranteeing that no DMA zone will be wasted with ptes or
similar stuff that can very well go in the higher zones.

The "how many bytes" question in my above email is now addressed by the
sysctl_lower_zone_protection but that's still a very weak answer since
it doesn't work for not similar inbalances across different classzones
(i.e. huge dma, tiny lowmem, and even smaller highmem), and furthermore
it requires people to tune by themself in userspace, and they cannot
tune as well as lowmem_reserve_ratio would do since it's a fixed sysctl
for all classzone against classzone imbalances.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 18:02                           ` Andrea Arcangeli
@ 2004-06-24 18:13                             ` William Lee Irwin III
  2004-06-24 18:27                               ` Andrea Arcangeli
  0 siblings, 1 reply; 70+ messages in thread
From: William Lee Irwin III @ 2004-06-24 18:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss,
	linux-kernel

On Thu, Jun 24, 2004 at 08:02:56PM +0200, Andrea Arcangeli wrote:
> I'm talking to Andrew about this very issue since december 2002, so I
> mostly giveup except for a few reminders like this one today.
> http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=20021206145718.GL1567%40dualathlon.random&prev=/groups%3Fq%3Dlinus%2Bgoogle%2Bfix%2Bmin%2Bwatermarks%26hl%3D
> I'm confident as people starts to run into the zone inbalance with 2.6
> and as google upgrades to 2.6, eventually lowmem_zone_reserve_ratio will
> be forward ported to 2.4.26 to 2.6.  I'm not the guy with >4G of ram
> anyways, so it won't be myself having troubles with this ;).
> Furthermore if you have some swap, the VM can normally relocate the
> stuff (you've to be quite unlucky to be filled by pure ptes in the
> lowmem zone but it can happen too, but certainly not in my or Andrew's
> boxes where we have not more than 2M of ptes anytime allocated).

This sounds like the more precise fix would be enforcing a stricter
fallback criterion for pinned allocations. Pinned userspace would need
zone migration if it's done selectively like this.

Thanks.


-- wli

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 18:13                             ` William Lee Irwin III
@ 2004-06-24 18:27                               ` Andrea Arcangeli
  2004-06-24 18:50                                 ` William Lee Irwin III
  0 siblings, 1 reply; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 18:27 UTC (permalink / raw)
  To: William Lee Irwin III, Nick Piggin, Takashi Iwai, Andi Kleen, ak,
	tripperda, discuss, linux-kernel

On Thu, Jun 24, 2004 at 11:13:11AM -0700, William Lee Irwin III wrote:
> This sounds like the more precise fix would be enforcing a stricter
> fallback criterion for pinned allocations. Pinned userspace would need
> zone migration if it's done selectively like this.

yes and "the stricter fallback criterion" is precisely called
lower_zone_reserve_ratio and it's included in the 2.4 mainline kernel
and this "stricter fallback criterion" doesn't exist in 2.6 yet.

I do apply it to non-pinned pages too because wasting tons of cpu in
memcopies for migration is a bad idea compared to reseving 900M of
absolutely critical lowmem ram on a 64G box. So I find the
pinned/unpinned parameter worthless and I apply "the stricter fallback
criterion" to all allocations in the same way, which is a lot simpler,
doesn't require substantial vm changes to allow migration of ptes,
anonymous and mlocked memory w/o passing through some swapcache and
without clearng ptes and most important I believe it's a lot more
efficient than migrating with bulk memcopies. Even on a big x86-64
dealing with the migration complexity is worthless, reserving the full
16M of dma zone makes a lot more sense.

The lower_zone_reserve_ratio algorithm scales back to the size of the
zones automatically autotuned at boot time and the balance-setting are in
functions of the imbalances found at boot time. That's the fundamental
difference with the sysctl that is fixed, for all zones, and it has no
clue on the size of the zones etc...

So in short with little ram installed it will be like mainline 2.6, with
tons of ram installed it will make an huge difference and it will
reserve up to _whole_ classzones to the users that cannot use the higher
zones, but 16M on a 16G box is nothing so nobody will notice any
regression anyways, only the befits will be noticeable in the otherwise
unsolvable corner cases (yeah, you could try to migrate ptes and other
stuff to solve them but that's incredibily inefficient compared to
throwing 16M or 800M at the problem on respectively 16G or 64G machines,
etc..).

the number aren't math exact with the 2.4 code, but you get an idea of
the order of magnitude.

BTW, I think I'm not the only VM guy who agrees this algo is needed,
For istance I recall Rik once included the lower_zone_reserve_ratio
patch in one of his 2.4 patches too.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 18:27                               ` Andrea Arcangeli
@ 2004-06-24 18:50                                 ` William Lee Irwin III
  0 siblings, 0 replies; 70+ messages in thread
From: William Lee Irwin III @ 2004-06-24 18:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss,
	linux-kernel

On Thu, Jun 24, 2004 at 08:27:37PM +0200, Andrea Arcangeli wrote:
> yes and "the stricter fallback criterion" is precisely called
> lower_zone_reserve_ratio and it's included in the 2.4 mainline kernel
> and this "stricter fallback criterion" doesn't exist in 2.6 yet.
> I do apply it to non-pinned pages too because wasting tons of cpu in
> memcopies for migration is a bad idea compared to reseving 900M of
> absolutely critical lowmem ram on a 64G box. So I find the
> pinned/unpinned parameter worthless and I apply "the stricter fallback
> criterion" to all allocations in the same way, which is a lot simpler,
> doesn't require substantial vm changes to allow migration of ptes,
> anonymous and mlocked memory w/o passing through some swapcache and
> without clearng ptes and most important I believe it's a lot more
> efficient than migrating with bulk memcopies. Even on a big x86-64
> dealing with the migration complexity is worthless, reserving the full
> 16M of dma zone makes a lot more sense.

Not sure what's going on here. I suppose I had different expectations,
e.g. not attempting to relocate kernel allocations, but rather failing
them outright after the threshold is exceeded. No matter, it just saves
me the trouble of implementing it. I understood the migration to be a
method of last resort, not preferred to admission control.

On Thu, Jun 24, 2004 at 08:27:37PM +0200, Andrea Arcangeli wrote:
> The lower_zone_reserve_ratio algorithm scales back to the size of the
> zones automatically autotuned at boot time and the balance-setting are in
> functions of the imbalances found at boot time. That's the fundamental
> difference with the sysctl that is fixed, for all zones, and it has no
> clue on the size of the zones etc...

I wasn't involved with this, so unfortunately I don't have an explanation
of why these semantics were considered useful.

On Thu, Jun 24, 2004 at 08:27:37PM +0200, Andrea Arcangeli wrote:
> So in short with little ram installed it will be like mainline 2.6, with
> tons of ram installed it will make an huge difference and it will
> reserve up to _whole_ classzones to the users that cannot use the higher
> zones, but 16M on a 16G box is nothing so nobody will notice any
> regression anyways, only the befits will be noticeable in the otherwise
> unsolvable corner cases (yeah, you could try to migrate ptes and other
> stuff to solve them but that's incredibily inefficient compared to
> throwing 16M or 800M at the problem on respectively 16G or 64G machines,
> etc..).
> the number aren't math exact with the 2.4 code, but you get an idea of
> the order of magnitude.

This sounds like you're handing back hard allocation failures to
unpinned allocations when zone fallbacks are meant to be discouraged.
Given this, I think I understand where some of the concerns about
merging it came from, though I'd certainly rather have underutilized
memory than workload failures.

I suspect one concern about this is that it may cause premature
workload failures. My own review of the code has determined this to
be a minor concern. Rather, I believe it's better to fail the
allocations earlier than to allow the workload to slowly accumulate
pinned pages in lower zones, even at the cost of underutilizing lower
zones. This belief may not be universal.

On Thu, Jun 24, 2004 at 08:27:37PM +0200, Andrea Arcangeli wrote:
> BTW, I think I'm not the only VM guy who agrees this algo is needed,
> For istance I recall Rik once included the lower_zone_reserve_ratio
> patch in one of his 2.4 patches too.

One of the reasons I've not seen this in practice is that the stress
tests I'm running aren't going on for extended periods of time, where
fallback of pinned allocations to lower zones would be a progressively
more noticeable problem as they accumulate.

-- wli

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 16:56                     ` William Lee Irwin III
  2004-06-24 17:32                       ` Andrea Arcangeli
@ 2004-06-24 21:54                       ` Andrew Morton
  2004-06-24 22:08                         ` William Lee Irwin III
                                           ` (2 more replies)
  1 sibling, 3 replies; 70+ messages in thread
From: Andrew Morton @ 2004-06-24 21:54 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: andrea, nickpiggin, tiwai, ak, ak, tripperda, discuss,
	linux-kernel

William Lee Irwin III <wli@holomorphy.com> wrote:
>
> Is there any chance you could send in thise improved implementation of
> zone fallback watermarks and describe the deficiencies in the current
> scheme that it corrects?

We decided earlier this year that the watermark stuff should be
forward-ported in toto, but I don't recall why.  Nobody got around to doing
it because there have been no bug reports.

It irks me that the 2.4 algorithm gives away a significant amount of
pagecache memory.  It's a relatively small amount, but it's still a lot of
memory, and all the 2.6 users out there at present are not reporting
problems, so we should not penalise all those people on behalf of the few
people who might need this additional fallback protection.

It should be runtime tunable - that doesn't seem hard to do.  All the
infrastructure is there now to do this.

Note that this code was sigificantly changed between 2.6.5 and 2.6.7.

First thing to do is to identify some workload which needs the patch. 
Without that, how can we test it?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 21:54                       ` Andrew Morton
@ 2004-06-24 22:08                         ` William Lee Irwin III
  2004-06-24 22:45                           ` Andrea Arcangeli
  2004-06-24 22:11                         ` Andrew Morton
  2004-06-24 22:21                         ` Andrea Arcangeli
  2 siblings, 1 reply; 70+ messages in thread
From: William Lee Irwin III @ 2004-06-24 22:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrea, nickpiggin, tiwai, ak, ak, tripperda, discuss,
	linux-kernel

On Thu, Jun 24, 2004 at 02:54:41PM -0700, Andrew Morton wrote:
> We decided earlier this year that the watermark stuff should be
> forward-ported in toto, but I don't recall why.  Nobody got around to doing
> it because there have been no bug reports.
> It irks me that the 2.4 algorithm gives away a significant amount of
> pagecache memory.  It's a relatively small amount, but it's still a lot of
> memory, and all the 2.6 users out there at present are not reporting
> problems, so we should not penalise all those people on behalf of the few
> people who might need this additional fallback protection.
> It should be runtime tunable - that doesn't seem hard to do.  All the
> infrastructure is there now to do this.
> Note that this code was sigificantly changed between 2.6.5 and 2.6.7.
> First thing to do is to identify some workload which needs the patch. 
> Without that, how can we test it?

That does sound troublesome, especially since it's difficult to queue up
the kinds of extended stress tests needed to demonstrate the problems.
The prolonged memory pressure and so on are things that we've
unfortunately had to wait until extended runtime in production to see. =(
The underutilization bit is actually why I keep going on and on about
the pinned pagecache relocation; it resolves a portion of the problem
of pinned pages in lower zones without underutilizing RAM, then once
pinned user pages can arbitrarily utilize lower zones, pinned kernel
allocations (which would not be relocatable) can be denied fallback
entirely without overall underutilization. I've actually already run
out of ideas here, so just people just saying what they want me to
write might help. Tests can be easily contrived (e.g. fill a swapless
box' upper zones with file-backed pagecache, then start allocating
anonymous pages), but realistic situations are much harder to trigger.

-- wli

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 22:08                         ` William Lee Irwin III
@ 2004-06-24 22:45                           ` Andrea Arcangeli
  2004-06-24 22:51                             ` William Lee Irwin III
  0 siblings, 1 reply; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 22:45 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, nickpiggin, tiwai, ak, ak,
	tripperda, discuss, linux-kernel

On Thu, Jun 24, 2004 at 03:08:23PM -0700, William Lee Irwin III wrote:
> The prolonged memory pressure and so on are things that we've
> unfortunately had to wait until extended runtime in production to see. =(

Luckily this problem doesn't fall in this scenario and it's trivial to
reproduce if you've >= 2G of ram. I still have here the testcase google
sent me years ago when this problem seen the light during 2.4.1x. They
used mlock, but it's even simpler to reproduce it with a single malloc +
bzero (note: no mlock). The few mbytes of lowmem left won't last long if
you load some big app after that.

> The underutilization bit is actually why I keep going on and on about
> the pinned pagecache relocation; it resolves a portion of the problem
> of pinned pages in lower zones without underutilizing RAM, then once

I also don't like the underutilization but I believe it's a price
everybody has to pay if you buy x86. On x86-64 the cost of the insurance
is much lower, max 16M wasted, and absolutely nothing wasted if you've
an amd system (all amd systems have a real iommu that avoids having to
mess with the physical ram addresses).

it's like an health insurance, you can avoid to pay it but it might not
turn out to be a good idea for everyone not pay for it. At least you
should give the choice to the people to be able to pay for it and to
have it, and the sysctl is not going to work. It's relatively very cheap
as Andrew said, if you've very few mbytes of lowmemory you're going to
pay very few kbytes for it. But I think we should force everyone to have
it like I did in 2.4 and absolutely nobody complained, infact if
something somebody could complain _without_ it. Sure nobody cares about
800M of ram on a 64G machine when they risk to swap-slowdown (and vfs
caches overshrink) and in the worst case to lockup without swap without
the "insurance". I don't think one should be forced to have swap on a
64G box if the userspace apps have a very well defined high bound of ram
utilization.  There will be always a limit anyways that is ram+swap, so
ideally if we had infinite money it would _always_ better to replace
swap with more ram and to never have swap, swap still make sense only
because disk is still cheaper than ram (watch MRAM). So a VM that
destabilizes without swap is not a VM that I can avoid to fix and to me
it remains a major bug even if nobody will ever notice it because we
don't have that much cheap ram yet.

About the ability to tune it at least at boot time, I always wanted it
and I added the setup_lower_zone_reserve parameter, but that is parsed
too late, so it doesn't work due a minor implementation detail ;), like
also setup_mem_frac apparently doesn't work too.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 22:45                           ` Andrea Arcangeli
@ 2004-06-24 22:51                             ` William Lee Irwin III
  2004-06-24 23:09                               ` Andrew Morton
  0 siblings, 1 reply; 70+ messages in thread
From: William Lee Irwin III @ 2004-06-24 22:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, nickpiggin, tiwai, ak, ak, tripperda, discuss,
	linux-kernel

On Fri, Jun 25, 2004 at 12:45:29AM +0200, Andrea Arcangeli wrote:
> Luckily this problem doesn't fall in this scenario and it's trivial to
> reproduce if you've >= 2G of ram. I still have here the testcase google
> sent me years ago when this problem seen the light during 2.4.1x. They
> used mlock, but it's even simpler to reproduce it with a single malloc +
> bzero (note: no mlock). The few mbytes of lowmem left won't last long if
> you load some big app after that.

Well, there are magic numbers here we need to explain to get a testcase
runnable on more machines than just x86 boxen with exactly 2GB RAM.
Where do the 2GB and 1GB come from? Is it that 1GB is the size of the
upper zone?


-- wli

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 22:51                             ` William Lee Irwin III
@ 2004-06-24 23:09                               ` Andrew Morton
  2004-06-24 23:15                                 ` William Lee Irwin III
  2004-06-25  2:39                                 ` Andrea Arcangeli
  0 siblings, 2 replies; 70+ messages in thread
From: Andrew Morton @ 2004-06-24 23:09 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: andrea, nickpiggin, tiwai, ak, ak, tripperda, discuss,
	linux-kernel

William Lee Irwin III <wli@holomorphy.com> wrote:
>
> On Fri, Jun 25, 2004 at 12:45:29AM +0200, Andrea Arcangeli wrote:
> > Luckily this problem doesn't fall in this scenario and it's trivial to
> > reproduce if you've >= 2G of ram. I still have here the testcase google
> > sent me years ago when this problem seen the light during 2.4.1x. They
> > used mlock, but it's even simpler to reproduce it with a single malloc +
> > bzero (note: no mlock). The few mbytes of lowmem left won't last long if
> > you load some big app after that.
> 
> Well, there are magic numbers here we need to explain to get a testcase
> runnable on more machines than just x86 boxen with exactly 2GB RAM.
> Where do the 2GB and 1GB come from? Is it that 1GB is the size of the
> upper zone?
> 

A testcase would be, on a 2G box:

a) free up as much memory as you can

b) write a 1.2G file to fill highmem with pagecache

c) malloc(800M), bzero(), sleep

d) swapoff -a

You now have a box which has almost all of lowmem pinned in anonymous
memory.  It'll limp along and go oom fairly easily.

Another testcase would be:

a) free up as much memory as you can

b) write a 1.2G file to fill highmem with pagecache

c) malloc(800M), mlock it

You now have most of lowmem mlocked.

In both situations the machine is really sick.  Probably the most risky
scenario is a swapless machine in which lots of lowmem is allocated to
anonymous memory.

It should be the case that increasing lower_zone_peotection will fix all
the above.  If not, it needs fixing.

So we're down the question "what should we default to at bootup".  I find
it hard to justify defaulting to a mode where we're super-defensive against
this sort of thing, simply because nobody seems to be hitting the problems.

Distributors can, if the must, bump lower_zone_protection in initscripts,
and it's presumably pretty simple to write a boot script which parses
/proc/meminfo's MemTotal and SwapTotal lines, producing an appropriate
lower_zone_protection setting.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 23:09                               ` Andrew Morton
@ 2004-06-24 23:15                                 ` William Lee Irwin III
  2004-06-25  6:16                                   ` William Lee Irwin III
  2004-06-25  2:39                                 ` Andrea Arcangeli
  1 sibling, 1 reply; 70+ messages in thread
From: William Lee Irwin III @ 2004-06-24 23:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrea, nickpiggin, tiwai, ak, ak, tripperda, discuss,
	linux-kernel

On Thu, Jun 24, 2004 at 04:09:45PM -0700, Andrew Morton wrote:
> A testcase would be, on a 2G box:
> a) free up as much memory as you can
> b) write a 1.2G file to fill highmem with pagecache
> c) malloc(800M), bzero(), sleep
> d) swapoff -a
> You now have a box which has almost all of lowmem pinned in anonymous
> memory.  It'll limp along and go oom fairly easily.
> Another testcase would be:
> a) free up as much memory as you can
> b) write a 1.2G file to fill highmem with pagecache
> c) malloc(800M), mlock it
> You now have most of lowmem mlocked.

These are approximately identical to the testcases I had in mind, except
neither of these is truly specific to 2GB and can have the various magic
numbers calculated from sysconf() and/or meminfo.


On Thu, Jun 24, 2004 at 04:09:45PM -0700, Andrew Morton wrote:
> In both situations the machine is really sick.  Probably the most risky
> scenario is a swapless machine in which lots of lowmem is allocated to
> anonymous memory.
> It should be the case that increasing lower_zone_peotection will fix all
> the above.  If not, it needs fixing.
> So we're down the question "what should we default to at bootup".  I find
> it hard to justify defaulting to a mode where we're super-defensive against
> this sort of thing, simply because nobody seems to be hitting the problems.
> Distributors can, if the must, bump lower_zone_protection in initscripts,
> and it's presumably pretty simple to write a boot script which parses
> /proc/meminfo's MemTotal and SwapTotal lines, producing an appropriate
> lower_zone_protection setting.

I'm going to beat on this in short order, but will be indisposed for an
hour or two before that begins.

Thanks.


-- wli

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 23:15                                 ` William Lee Irwin III
@ 2004-06-25  6:16                                   ` William Lee Irwin III
  0 siblings, 0 replies; 70+ messages in thread
From: William Lee Irwin III @ 2004-06-25  6:16 UTC (permalink / raw)
  To: Andrew Morton, andrea, nickpiggin, tiwai, ak, ak, tripperda,
	discuss, linux-kernel

/*
On Thu, Jun 24, 2004 at 04:09:45PM -0700, Andrew Morton wrote:
>> A testcase would be, on a 2G box:
>> a) free up as much memory as you can
>> b) write a 1.2G file to fill highmem with pagecache
>> c) malloc(800M), bzero(), sleep
>> d) swapoff -a
>> You now have a box which has almost all of lowmem pinned in anonymous
>> memory.  It'll limp along and go oom fairly easily.
>> Another testcase would be:
>> a) free up as much memory as you can
>> b) write a 1.2G file to fill highmem with pagecache
>> c) malloc(800M), mlock it
>> You now have most of lowmem mlocked.

On Thu, Jun 24, 2004 at 04:15:49PM -0700, William Lee Irwin III wrote:
> These are approximately identical to the testcases I had in mind, except
> neither of these is truly specific to 2GB and can have the various magic
> numbers calculated from sysconf() and/or meminfo.

It seems that glibc is fucking with sysinfo or something; hackish
workaround was to call sysconf(_SC_PAGESIZE) by hand for where mem_unit
would otherwise be needed and to treat the screwed-with sysinfo fields
as being in opaque units. Blame Uli.

At any rate, the result of running this with no swap online appears to
be that this just results in OOM kills whenever enough lowmem is needed.
This is expected, as the anonymous allocations aren't mlocked, so with
swap online, they would merely be swapped out, and with swap offline,
the nr_swap_pages deadlock is no longer possible (the nr_swap_pages fix
wasn't in place for this testing). Something more sophisticated may
have worse effects.

However, there were apparent oddities with premature failures of vma
allocations and piss poor vma merging observed. For instance, the
sbrk()/mmap() changeover logic to fall back on a per-iteration basis is
largely because sticking to mmap() and then changing over to sbrk()
when it fails switches over prematurely, and so failed to sufficiently
utilize lowmem. The failures to find the free areas for the vmas went
away after alternating between sbrk() and mmap(). Also, the 64KB
mmap()'s of the file aren't merged at all, despite being very very
blatantly sequential. I'll look into this.

The strategy of mmap()'ing locked pagecache is useless for PAE boxen in
general and so things should be taught to, say, mount ramfs, allocate
ramfs pagecache to fill highmem, and then go on to mmap() instead of
fiddling around mmap()'ing and mlock()'ing pagecache. I can implement
this if it's deemed necessary to have the testcase extensible to PAE.

The results are mixed. It's not clear that this behavior is
pathological, at least not in the manner Andrea described. It is,
however, easy to trigger workload failure as opposed to kernel deadlock.
It may help to clarify the general position on that kind of issue so I
know how and whether that should be addressed.

$ cat /proc/meminfo 
MemTotal:      1032988 kB
MemFree:        106684 kB
Buffers:          3804 kB
Cached:          16256 kB
SwapCached:          0 kB
Active:         897104 kB
Inactive:         2708 kB
HighTotal:      130816 kB
HighFree:       101388 kB
LowTotal:       902172 kB
LowFree:          5296 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:             108 kB
Writeback:           0 kB
Mapped:         881912 kB
Slab:            18276 kB
Committed_AS:   911496 kB
PageTables:       1896 kB
VmallocTotal:   114680 kB
VmallocUsed:      2160 kB
VmallocChunk:   105244 kB
$ cat /proc/buddyinfo 
Node 0, zone      DMA      0      0      1      1      1      0      1      1      1      0      0 
Node 0, zone   Normal     56     14     59      2      3      0      1      1      1      0      0 
Node 0, zone  HighMem    777    315    349    360    505    236     61      1      0      0      0 
*/

#define _GNU_SOURCE
#define _FILE_OFFSET_BITS 64

#include <unistd.h>
#include <stdlib.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/sysinfo.h>

#define LENGTH_STEP		((off64_t)pagesize << 4)
#define MAX_RETRIES		64
#ifdef DEBUG
#define dprintf(fmt, arg...)	printf(fmt,##arg)
#else
#define dprintf(fmt, arg...)	do { } while (0)
#endif

#define die()								\
	do {								\
		fprintf(stderr, "failure %s (%d) at %s:%d\n",		\
			strerror(errno),  errno, __FILE__, __LINE__);	\
		fflush(stderr);						\
		sleep(60);						\
		exit(EXIT_FAILURE);					\
	} while (0)

int main(void)
{
	struct sysinfo info;
	char namebuf[64] = "/tmp/zoneDoS_XXXXXX";
	int i, fd, retries;
	off64_t len = 0;
	unsigned long *first, *last, *p, *first_buf, *last_buf, *q;
	unsigned long freehigh, freelow;
	long pagesize;

	first = last = NULL;
	first_buf = last_buf = NULL;
	if ((pagesize = sysconf(_SC_PAGESIZE)) < 0)
		die();
	if ((fd = mkstemp(namebuf)) < 0)
		die();
	if (unlink(namebuf))
		die();

	if (sysinfo(&info))
		die();
	retries = freehigh = 0;
	while (info.freehigh && retries < MAX_RETRIES) {
		if (ftruncate64(fd, len + LENGTH_STEP))
			die();
		p = mmap(NULL, LENGTH_STEP, PROT_READ|PROT_WRITE, MAP_SHARED, fd, len);
		if (p == MAP_FAILED)
			die();
		len += LENGTH_STEP;
		if (mlock(p, LENGTH_STEP))
			die();
		*p = 0;
		if (last)
			*last = (unsigned long)p;
		last = p;
		if (!first)
			first = p;
		freehigh = info.freehigh;
		if (sysinfo(&info))
			die();
		if (info.freehigh >= freehigh)
			retries++;
		else
			retries = 0;
		dprintf("allocated %lu kB, freehigh = %lu kB\n",
			(unsigned long)(len >> 10),
			(unsigned long)(info.freehigh >> 10));
	}

	if (sysinfo(&info))
		die();
	retries = freelow = 0;
	while (info.freeram - info.freehigh && retries < MAX_RETRIES) {
		q = mmap(NULL, LENGTH_STEP, PROT_READ|PROT_WRITE, MAP_ANONYMOUS, 0, 0);
		if (q == MAP_FAILED)
			q = sbrk(LENGTH_STEP);
		if (q == MAP_FAILED) {
			sleep(1);
			++retries;
			continue;
		}
		for (i = 0; i < LENGTH_STEP/sizeof(*q); i += pagesize/sizeof(*q))
			q[i + 1] = 1;
		*q = 0;
		if (last_buf)
			*last_buf = (unsigned long)q;
		last_buf = q;
		if (!first_buf)
			first_buf = q;
		freelow = info.freeram - info.freehigh;
		if (sysinfo(&info))
			die();
		if (info.freeram - info.freehigh >= freelow)
			++retries;
		else
			retries = 0;
		dprintf("freelow = %lu kB\n", (info.freeram - info.freehigh) >> 10);
	}

	dprintf("done allocating anonymous memory, freeing pagecache\n");
	while (first) {
		p = first;
		first = (unsigned long *)(*first);
		if (munmap(p, LENGTH_STEP))
			die();
	}
	close(fd);
	pause();
	return EXIT_SUCCESS;
}

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 23:09                               ` Andrew Morton
  2004-06-24 23:15                                 ` William Lee Irwin III
@ 2004-06-25  2:39                                 ` Andrea Arcangeli
  2004-06-25  2:47                                   ` Andrew Morton
  1 sibling, 1 reply; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-25  2:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: William Lee Irwin III, nickpiggin, tiwai, ak, ak, tripperda,
	discuss, linux-kernel

On Thu, Jun 24, 2004 at 04:09:45PM -0700, Andrew Morton wrote:
> this sort of thing, simply because nobody seems to be hitting the problems.

nobody is hitting the problems because if this problem triggers the
machine starts slowly swapping and shrinking the vfs and it eventually
relocate the highmem. the crpilling down of the vfs caches as well isn't
a good thing and it will not be noticeable by anybody.

If they would be truly running without swap they would be hitting these
problems very fast. But everybody has swap.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-25  2:39                                 ` Andrea Arcangeli
@ 2004-06-25  2:47                                   ` Andrew Morton
  2004-06-25  3:19                                     ` Andrea Arcangeli
  0 siblings, 1 reply; 70+ messages in thread
From: Andrew Morton @ 2004-06-25  2:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: wli, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Thu, Jun 24, 2004 at 04:09:45PM -0700, Andrew Morton wrote:
> > this sort of thing, simply because nobody seems to be hitting the problems.
> 
> nobody is hitting the problems because if this problem triggers the
> machine starts slowly swapping and shrinking the vfs and it eventually
> relocate the highmem. the crpilling down of the vfs caches as well isn't
> a good thing and it will not be noticeable by anybody.

Good point, that.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-25  2:47                                   ` Andrew Morton
@ 2004-06-25  3:19                                     ` Andrea Arcangeli
  0 siblings, 0 replies; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-25  3:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: wli, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel

if you want to leave it disabled that's still fine with me as far as it
can be enabled in a optimal way (the one I like as usual is the 256/32
ratios of 2.4 ;), but I'm quite convinced that it will provide benefit
even if enabled, possibly with bigger ratios if you want less
"guaranteed" waste.

as usual if one doesn't want any ram and performance waste, x86-64 is
out there in production, and it'll avoid all the waste (unless you care
about wasting 16M of ram on a 4G box without the risk of failing
order 0 dma allocations on the intel implementation). If one want to go
cheap and buy x86 still then he must be prepared to potentially lose
900M of ram on a 32G box, it's a relative cost, so the more ram the more
memory will be potentially wasted, the less ram the less ram will be
potentially wasted.

the most frequent x86 highmem complains I ever got were related to
runing _out_ of lowmem zone with the lowmem zone _empty_. The day I will
get a complain for the lowmem being completely _free_ has yet to come ;).

thanks a lot for all the help.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 21:54                       ` Andrew Morton
  2004-06-24 22:08                         ` William Lee Irwin III
@ 2004-06-24 22:11                         ` Andrew Morton
  2004-06-24 23:09                           ` Andrea Arcangeli
  2004-06-24 22:21                         ` Andrea Arcangeli
  2 siblings, 1 reply; 70+ messages in thread
From: Andrew Morton @ 2004-06-24 22:11 UTC (permalink / raw)
  To: wli, andrea, nickpiggin, tiwai, ak, ak, tripperda, discuss,
	linux-kernel

Andrew Morton <akpm@osdl.org> wrote:
>
> Note that this code was sigificantly changed between 2.6.5 and 2.6.7.


Here's the default setup on a 1G ia32 box:

DMA free:4172kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB
protections[]: 8 476 540
Normal free:54632kB min:936kB low:1872kB high:2808kB active:278764kB inactive:253668kB present:901120kB
protections[]: 0 468 532
HighMem free:308kB min:128kB low:256kB high:384kB active:87972kB inactive:40300kB present:130516kB
protections[]: 0 0 64

ie:

- protect 8 pages from ZONE_DMA from a GFP_DMA allocation attempt

- protect 476 pages from ZONE_DMA from a GFP_KERNEL allocation attempt

- protect 540 pages from ZONE_DMA from a GFP_HIGHMEM allocation attempt.

etcetera.

After setting lower_zone_protection to 10:

Active:111515 inactive:65009 dirty:116 writeback:0 unstable:0 free:3290 slab:75489 mapped:52247 pagetables:446
DMA free:4172kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB
protections[]: 8 5156 5860
Normal free:8736kB min:936kB low:1872kB high:2808kB active:352780kB inactive:224972kB present:901120kB
protections[]: 0 468 1172
HighMem free:252kB min:128kB low:256kB high:384kB active:93280kB inactive:35064kB present:130516kB
protections[]: 0 0 64

It's a bit complex, and perhaps the relative levels of the various
thresholds could be tightened up.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 22:11                         ` Andrew Morton
@ 2004-06-24 23:09                           ` Andrea Arcangeli
  2004-06-25  1:17                             ` Nick Piggin
  0 siblings, 1 reply; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 23:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: wli, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel

On Thu, Jun 24, 2004 at 03:11:30PM -0700, Andrew Morton wrote:
> After setting lower_zone_protection to 10:
> 
> Active:111515 inactive:65009 dirty:116 writeback:0 unstable:0 free:3290 slab:75489 mapped:52247 pagetables:446
> DMA free:4172kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB
> protections[]: 8 5156 5860
> Normal free:8736kB min:936kB low:1872kB high:2808kB active:352780kB inactive:224972kB present:901120kB
> protections[]: 0 468 1172
> HighMem free:252kB min:128kB low:256kB high:384kB active:93280kB inactive:35064kB present:130516kB
> protections[]: 0 0 64
> 
> It's a bit complex, and perhaps the relative levels of the various
> thresholds could be tightened up.

this is the algorithm I added to 2.4 to produce good protection levels (with
lower_zone_reserve_ratio supposedly tunable at boot time):

static int lower_zone_reserve_ratio[MAX_NR_ZONES-1] = { 256, 32 };


		zone->watermarks[j].min = mask;
		zone->watermarks[j].low = mask*2;
		zone->watermarks[j].high = mask*3;
		/* now set the watermarks of the lower zones in the "j" classzone */
		for (idx = j-1; idx >= 0; idx--) {
			zone_t * lower_zone = pgdat->node_zones + idx;
			unsigned long lower_zone_reserve;
			if (!lower_zone->size)
				continue;

			mask = lower_zone->watermarks[idx].min;
			lower_zone->watermarks[j].min = mask;
			lower_zone->watermarks[j].low = mask*2;
			lower_zone->watermarks[j].high = mask*3;

			/* now the brainer part */
			lower_zone_reserve = realsize / lower_zone_reserve_ratio[idx];
			lower_zone->watermarks[j].min += lower_zone_reserve;
			lower_zone->watermarks[j].low += lower_zone_reserve;
			lower_zone->watermarks[j].high += lower_zone_reserve;

			realsize += lower_zone->realsize;
		}

Your code must be inferior since it doesn't even allow to tune each zone
differently (you seems not to have a lower_zone_reserve_ratio[idx]). Not sure
why you dont' simply forward port the code from 2.4 instead of reinventing it.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 23:09                           ` Andrea Arcangeli
@ 2004-06-25  1:17                             ` Nick Piggin
  2004-06-25  3:11                               ` Andrea Arcangeli
  0 siblings, 1 reply; 70+ messages in thread
From: Nick Piggin @ 2004-06-25  1:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, wli, tiwai, ak, ak, tripperda, discuss,
	linux-kernel

Andrea Arcangeli wrote:

> Your code must be inferior since it doesn't even allow to tune each zone
> differently (you seems not to have a lower_zone_reserve_ratio[idx]). Not sure
> why you dont' simply forward port the code from 2.4 instead of reinventing it.
> 

It can easily be modified if required though. Is there a need to be
tuning these different things? This is probably where we should hold
back on the complexity until it is shown to improve something.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-25  1:17                             ` Nick Piggin
@ 2004-06-25  3:11                               ` Andrea Arcangeli
  0 siblings, 0 replies; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-25  3:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, wli, tiwai, ak, ak, tripperda, discuss,
	linux-kernel

On Fri, Jun 25, 2004 at 11:17:25AM +1000, Nick Piggin wrote:
> It can easily be modified if required though. Is there a need to be
> tuning these different things? This is probably where we should hold

I did tune them differently in 2.4 mainline at least.  256 ratio for dma
and 32 ratio for lowmem, the lowmem is already quite critical in most
machines with >2G of ram so ratio should be lower than dma.  for example
on 64bit you want the 16M of dma to be completely reserved only on
machines with >4G of ram. The 256 dma ratio applies fine to 64bit archs,
and the 32 never applies to 64bit archs and it only applies to the
highmem boxes.

the 256 and 32 numbers aren't random, they're calculated this way:

	4096M of 64bit platform / 16M = 256
	32G of 32bit platform / 1G = 32

That means with my 2.4 algorithm any 64bit machine with >4G has its
whole dma zone reserved to __GFP_DMA.

and at the same time any 32bit machine with 32G of ram doesn't allow
anything but GFP_KERNEL to go in lowmem, this is fundamental.

Now you may very well argue about the numbers not being perfect and this
is still a bit hardcoded with the highmem issues in mind, but it would
be possible to generalize it even more and I do see a benefit in not
having a fixed number for both issues, and to get a bit more of
flexibility that the 2.4 has over the 2.6 one.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 21:54                       ` Andrew Morton
  2004-06-24 22:08                         ` William Lee Irwin III
  2004-06-24 22:11                         ` Andrew Morton
@ 2004-06-24 22:21                         ` Andrea Arcangeli
  2004-06-24 22:36                           ` Andrew Morton
  2004-06-24 22:37                           ` William Lee Irwin III
  2 siblings, 2 replies; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 22:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: William Lee Irwin III, nickpiggin, tiwai, ak, ak, tripperda,
	discuss, linux-kernel

On Thu, Jun 24, 2004 at 02:54:41PM -0700, Andrew Morton wrote:
> First thing to do is to identify some workload which needs the patch. 

that's quite trivial, boot a 2G box, malloc(1G), bzero(1GB), swapoff -a,
then the machine will lockup.

Depending on the architecture (more precisely depending if it starts
allocating ram from the end or from the start of the physical memory),
you may have to load 1G of data into pagecache first, like reading from
/dev/hda 1G (without closing the file) will work fine, then run the
above malloc + bzero + swapoff.

Most people will never report this because everybody has swap and they
simply run a lot slower than they could run if they didn't need to pass
through the swap device to relocate memory because memory would been allocated
in the right place in the first place. this plus the various oom killer
breakages that gets dominated by the nr_swap_pages > 0 check, are the
reasons 2.6 is unusable w/o swap. 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 22:21                         ` Andrea Arcangeli
@ 2004-06-24 22:36                           ` Andrew Morton
  2004-06-24 23:15                             ` Andrea Arcangeli
  2004-06-24 22:37                           ` William Lee Irwin III
  1 sibling, 1 reply; 70+ messages in thread
From: Andrew Morton @ 2004-06-24 22:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: wli, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Thu, Jun 24, 2004 at 02:54:41PM -0700, Andrew Morton wrote:
> > First thing to do is to identify some workload which needs the patch. 
> 
> that's quite trivial, boot a 2G box, malloc(1G), bzero(1GB), swapoff -a,
> then the machine will lockup.

Are those numbers correct?  We won't touch swap at all with that test?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 22:36                           ` Andrew Morton
@ 2004-06-24 23:15                             ` Andrea Arcangeli
  0 siblings, 0 replies; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 23:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: wli, nickpiggin, tiwai, ak, ak, tripperda, discuss, linux-kernel

On Thu, Jun 24, 2004 at 03:36:12PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > On Thu, Jun 24, 2004 at 02:54:41PM -0700, Andrew Morton wrote:
> > > First thing to do is to identify some workload which needs the patch. 
> > 
> > that's quite trivial, boot a 2G box, malloc(1G), bzero(1GB), swapoff -a,
> > then the machine will lockup.
> 
> Are those numbers correct?  We won't touch swap at all with that test?

they are correct if the page allocator allocates memory starting from
address 0 physical up to 2G in contigous order (sometime it allocates
memory backwards instead, in such case you need to load say 900M in
pagecache and then malloc 1.2G, worked fine for me in 2.4 before I fixed
it at least).

the malloc(1G) will pin the whole lowmem, then the box will be dead. oom
killer won't kill the task, but the syscalls will all hang (they don't
even return -ENOMEM because you loop forever, 2.4 at least was returning
-ENOMEM).  workaround is to add swap and to slowdown like a crawl
relocating ram at disk-seeking-speed and overshrinking vfs caches, but
nobody will notice something is going wrong then. Only swapoff -a will
show that something is not going well.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 22:21                         ` Andrea Arcangeli
  2004-06-24 22:36                           ` Andrew Morton
@ 2004-06-24 22:37                           ` William Lee Irwin III
  2004-06-24 22:40                             ` William Lee Irwin III
  2004-06-24 23:21                             ` Andrea Arcangeli
  1 sibling, 2 replies; 70+ messages in thread
From: William Lee Irwin III @ 2004-06-24 22:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, nickpiggin, tiwai, ak, ak, tripperda, discuss,
	linux-kernel

/*
On Thu, Jun 24, 2004 at 02:54:41PM -0700, Andrew Morton wrote:
>> First thing to do is to identify some workload which needs the patch. 

On Fri, Jun 25, 2004 at 12:21:50AM +0200, Andrea Arcangeli wrote:
> that's quite trivial, boot a 2G box, malloc(1G), bzero(1GB), swapoff -a,
> then the machine will lockup.
> Depending on the architecture (more precisely depending if it starts
> allocating ram from the end or from the start of the physical memory),
> you may have to load 1G of data into pagecache first, like reading from
> /dev/hda 1G (without closing the file) will work fine, then run the
> above malloc + bzero + swapoff.
> Most people will never report this because everybody has swap and they
> simply run a lot slower than they could run if they didn't need to pass
> through the swap device to relocate memory because memory would been allocated
> in the right place in the first place. this plus the various oom killer
> breakages that gets dominated by the nr_swap_pages > 0 check, are the
> reasons 2.6 is unusable w/o swap. 

Have you tried with 2.6.7? The following program fails to trigger anything
like what you've mentioned, though granted it was a 512MB allocation on
a 1GB machine. swapoff(2) merely fails.
*/

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <strings.h>
#include <sys/swap.h>

int main(int argc, char * const argv[])
{
	int i;
	long pagesize, physpages;
	size_t size;
	void *p;

	pagesize = sysconf(_SC_PAGE_SIZE);
	if (pagesize < 0) {
		perror("failed to determine pagesize");
		exit(EXIT_FAILURE);
	}
	physpages = sysconf(_SC_PHYS_PAGES);
	if (physpages < 0) {
		perror("failed to determine physical memory capacity");
		exit(EXIT_FAILURE);
	}
	if ((size_t)(physpages/2) > SIZE_MAX/pagesize) {
		fprintf(stderr, "insufficient virtualspace capacity\n");
		exit(EXIT_FAILURE);
	}
	size = (physpages/2)*pagesize;
	p = malloc(size);
	if (!p) {
		perror("allocation failure");
		exit(EXIT_FAILURE);
	}
	bzero(p, size);
	for (i = 1; i < argc; ++i) {
		if (swapoff(argv[i]))
			perror("swapoff failure");
			fprintf(stderr, "failed to offline %s\n", argv[i]);
			exit(EXIT_FAILURE);
	}
	return EXIT_SUCCESS;
}

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 22:37                           ` William Lee Irwin III
@ 2004-06-24 22:40                             ` William Lee Irwin III
  2004-06-24 23:21                             ` Andrea Arcangeli
  1 sibling, 0 replies; 70+ messages in thread
From: William Lee Irwin III @ 2004-06-24 22:40 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, nickpiggin, tiwai, ak, ak,
	tripperda, discuss, linux-kernel

/*
On Thu, Jun 24, 2004 at 03:37:50PM -0700, William Lee Irwin III wrote:
> Have you tried with 2.6.7? The following program fails to trigger anything
> like what you've mentioned, though granted it was a 512MB allocation on
> a 1GB machine. swapoff(2) merely fails.

And after fixing a bug in the program, not even that fails:
*/


#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <strings.h>
#include <sys/swap.h>

int main(int argc, char * const argv[])
{
	int i;
	long pagesize, physpages;
	size_t size;
	void *p;

	pagesize = sysconf(_SC_PAGE_SIZE);
	if (pagesize < 0) {
		perror("failed to determine pagesize");
		exit(EXIT_FAILURE);
	}
	physpages = sysconf(_SC_PHYS_PAGES);
	if (physpages < 0) {
		perror("failed to determine physical memory capacity");
		exit(EXIT_FAILURE);
	}
	if ((size_t)(physpages/2) > SIZE_MAX/pagesize) {
		fprintf(stderr, "insufficient virtualspace capacity\n");
		exit(EXIT_FAILURE);
	}
	size = (physpages/2)*pagesize;
	p = malloc(size);
	if (!p) {
		perror("allocation failure");
		exit(EXIT_FAILURE);
	}
	bzero(p, size);
	for (i = 1; i < argc; ++i) {
		if (swapoff(argv[i])) {
			perror("swapoff failure");
			fprintf(stderr, "failed to offline %s\n", argv[i]);
			exit(EXIT_FAILURE);
		}
	}
	return EXIT_SUCCESS;
}

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 22:37                           ` William Lee Irwin III
  2004-06-24 22:40                             ` William Lee Irwin III
@ 2004-06-24 23:21                             ` Andrea Arcangeli
  2004-06-24 23:45                               ` William Lee Irwin III
  1 sibling, 1 reply; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 23:21 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, nickpiggin, tiwai, ak, ak,
	tripperda, discuss, linux-kernel

On Thu, Jun 24, 2004 at 03:37:50PM -0700, William Lee Irwin III wrote:
> /*
> On Thu, Jun 24, 2004 at 02:54:41PM -0700, Andrew Morton wrote:
> >> First thing to do is to identify some workload which needs the patch. 
> 
> On Fri, Jun 25, 2004 at 12:21:50AM +0200, Andrea Arcangeli wrote:
> > that's quite trivial, boot a 2G box, malloc(1G), bzero(1GB), swapoff -a,
> > then the machine will lockup.
> > Depending on the architecture (more precisely depending if it starts
> > allocating ram from the end or from the start of the physical memory),
> > you may have to load 1G of data into pagecache first, like reading from
> > /dev/hda 1G (without closing the file) will work fine, then run the
> > above malloc + bzero + swapoff.
> > Most people will never report this because everybody has swap and they
> > simply run a lot slower than they could run if they didn't need to pass
> > through the swap device to relocate memory because memory would been allocated
> > in the right place in the first place. this plus the various oom killer
> > breakages that gets dominated by the nr_swap_pages > 0 check, are the
> > reasons 2.6 is unusable w/o swap. 
> 
> Have you tried with 2.6.7? The following program fails to trigger anything

I've definitely not tried 2.6.7 and I'm also reading a 2.6.5 codebase.
But you can sure trigger it if you run a big workload after the big
allocation.

> like what you've mentioned, though granted it was a 512MB allocation on
> a 1GB machine. swapoff(2) merely fails.

what you have to do is this:

1) swapoff -a (it must not fail!! it cannot fail if you run it first)
2) fill 130000K in pagecache, be very careful, not more than that, every
   mbyte matters
3) run your program and allocate 904000K!!! (not 512M!!!)

then keep using the machine until it lockups because it cannot reloate
the anonymous memory from the 900M of lowmem to the 130M of highmem.

But really I said you need >=2G to have a realistic chance of seeing it.

So don't be alarmed you cannot reproduce on a 1G box by allocating 512M
and with swap still enabled, you had none of the conditions that make it
reproducible.

I reproduced this dozen of times so I know how to reproduce it very
well (amittedly not in 2.6 because nobody crashed on this yet).

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 23:21                             ` Andrea Arcangeli
@ 2004-06-24 23:45                               ` William Lee Irwin III
  0 siblings, 0 replies; 70+ messages in thread
From: William Lee Irwin III @ 2004-06-24 23:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, nickpiggin, tiwai, ak, ak, tripperda, discuss,
	linux-kernel

On Fri, Jun 25, 2004 at 01:21:57AM +0200, Andrea Arcangeli wrote:
> what you have to do is this:
> 1) swapoff -a (it must not fail!! it cannot fail if you run it first)
> 2) fill 130000K in pagecache, be very careful, not more than that, every
>    mbyte matters
> 3) run your program and allocate 904000K!!! (not 512M!!!)
> then keep using the machine until it lockups because it cannot reloate
> the anonymous memory from the 900M of lowmem to the 130M of highmem.
> But really I said you need >=2G to have a realistic chance of seeing it.
> So don't be alarmed you cannot reproduce on a 1G box by allocating 512M
> and with swap still enabled, you had none of the conditions that make it
> reproducible.
> I reproduced this dozen of times so I know how to reproduce it very
> well (amittedly not in 2.6 because nobody crashed on this yet).

This resembles the more sophisticated testcase I originally had in mind.
I'll be out for a couple of hours and then I'll fix this up.


-- wli

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 15:48                 ` Nick Piggin
  2004-06-24 16:52                   ` Andrea Arcangeli
@ 2004-06-24 17:39                   ` Andrea Arcangeli
  2004-06-24 17:53                     ` William Lee Irwin III
  1 sibling, 1 reply; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 17:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Takashi Iwai, Andi Kleen, ak, tripperda, discuss, linux-kernel

On Fri, Jun 25, 2004 at 01:48:47AM +1000, Nick Piggin wrote:
> 2.6 has the "incremental min" thing. What is wrong with that?
> Though I think it is turned off by default.

I looked more into it and you can leave it turned off since it's not
going to work.

it's all in functions of z->pages_* and those are _global_ for all the
zones, and in turn they're absolutely meaningless.

the algorithm has nothing in common with lowmem_reverse_ratio, the
effect has a tinybit of similarity but the incremntal min thing is so
weak and so bad that it will either not help or it'll waste tons of
memory. Furthemore you cannot set a sysctl value that works for all
machines. The whole thing should be dropped and replaced with the fine
production quality lowmem_reserve_ratio in 2.4.26+

(the only broken thing of lowmem_reserve_ratio is that it cannot be
tuned, not even at boottime, a recompile is needed, but that's fixable
to tune it at boot time, and in theory at runtime too, but the point is
that no dyanmic tuning is required with it)


Please focus on this code of 2.4:

	/*
	 * We don't know if the memory that we're going to allocate will
	 * be freeable or/and it will be released eventually, so to
	 * avoid totally wasting several GB of ram we must reserve some
	 * of the lower zone memory (otherwise we risk to run OOM on the
	 * lower zones despite there's tons of freeable ram on the
	 * higher zones).
	 */
	zone_watermarks_t       watermarks[MAX_NR_ZONES];

typedef struct zone_watermarks_s {
	unsigned long min, low, high;
} zone_watermarks_t;

	class_idx = zone_idx(classzone);

	for (;;) {
		zone_t *z = *(zone++);
		if (!z)
			break;

		if (zone_free_pages(z, order) >
z->watermarks[class_idx].low) {
			page = rmqueue(z, order);
			if (page)
				return page;
		}
	}


		zone->watermarks[j].min = mask;
		zone->watermarks[j].low = mask*2;
		zone->watermarks[j].high = mask*3;
		/* now set the watermarks of the lower zones in the "j"
 * classzone */
		for (idx = j-1; idx >= 0; idx--) {
			zone_t * lower_zone = pgdat->node_zones + idx;
			unsigned long lower_zone_reserve;
			if (!lower_zone->size)
				continue;

			mask = lower_zone->watermarks[idx].min;
			lower_zone->watermarks[j].min = mask;
			lower_zone->watermarks[j].low = mask*2;
			lower_zone->watermarks[j].high = mask*3;

			/* now the brainer part */
			lower_zone_reserve = realsize /
lower_zone_reserve_ratio[idx];
			lower_zone->watermarks[j].min +=
lower_zone_reserve;
			lower_zone->watermarks[j].low +=
lower_zone_reserve;
			lower_zone->watermarks[j].high +=
lower_zone_reserve;

			realsize += lower_zone->realsize;
		}


The 2.6 algorithm controlled by the sysctl does nothing similar to the
above.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 17:39                   ` Andrea Arcangeli
@ 2004-06-24 17:53                     ` William Lee Irwin III
  2004-06-24 18:07                       ` Andrea Arcangeli
  0 siblings, 1 reply; 70+ messages in thread
From: William Lee Irwin III @ 2004-06-24 17:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss,
	linux-kernel

On Thu, Jun 24, 2004 at 07:39:27PM +0200, Andrea Arcangeli wrote:
> I looked more into it and you can leave it turned off since it's not
> going to work.
> it's all in functions of z->pages_* and those are _global_ for all the
> zones, and in turn they're absolutely meaningless.
> the algorithm has nothing in common with lowmem_reverse_ratio, the
> effect has a tinybit of similarity but the incremntal min thing is so
> weak and so bad that it will either not help or it'll waste tons of
> memory. Furthemore you cannot set a sysctl value that works for all
> machines. The whole thing should be dropped and replaced with the fine
> production quality lowmem_reserve_ratio in 2.4.26+
> (the only broken thing of lowmem_reserve_ratio is that it cannot be
> tuned, not even at boottime, a recompile is needed, but that's fixable
> to tune it at boot time, and in theory at runtime too, but the point is
> that no dyanmic tuning is required with it)
> Please focus on this code of 2.4:

There is mention of discrimination between pinned and unpinned
allocations not being possible; I can arrange this for more
comprehensive coverage if desired. Would you like this to be arranged,
and if so, how would you like that to interact with the fallback
heuristics?


-- wli

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 17:53                     ` William Lee Irwin III
@ 2004-06-24 18:07                       ` Andrea Arcangeli
  2004-06-24 18:29                         ` William Lee Irwin III
  0 siblings, 1 reply; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 18:07 UTC (permalink / raw)
  To: William Lee Irwin III, Nick Piggin, Takashi Iwai, Andi Kleen, ak,
	tripperda, discuss, linux-kernel

On Thu, Jun 24, 2004 at 10:53:31AM -0700, William Lee Irwin III wrote:
> On Thu, Jun 24, 2004 at 07:39:27PM +0200, Andrea Arcangeli wrote:
> > I looked more into it and you can leave it turned off since it's not
> > going to work.
> > it's all in functions of z->pages_* and those are _global_ for all the
> > zones, and in turn they're absolutely meaningless.
> > the algorithm has nothing in common with lowmem_reverse_ratio, the
> > effect has a tinybit of similarity but the incremntal min thing is so
> > weak and so bad that it will either not help or it'll waste tons of
> > memory. Furthemore you cannot set a sysctl value that works for all
> > machines. The whole thing should be dropped and replaced with the fine
> > production quality lowmem_reserve_ratio in 2.4.26+
> > (the only broken thing of lowmem_reserve_ratio is that it cannot be
> > tuned, not even at boottime, a recompile is needed, but that's fixable
> > to tune it at boot time, and in theory at runtime too, but the point is
> > that no dyanmic tuning is required with it)
> > Please focus on this code of 2.4:
> 
> There is mention of discrimination between pinned and unpinned
> allocations not being possible; I can arrange this for more
> comprehensive coverage if desired. Would you like this to be arranged,
> and if so, how would you like that to interact with the fallback
> heuristics?

how do you handle swapoff and mlock then? anonymous memory is pinned w/o
swap. You've relocate the stuff during the mlock or swapoff to obey to
the pin limits to make this work right, and it sounds quite complicated
and it would hurt mlock performance a lot too (some big app uses mlock
to pagein w/o page faults tons of stuff).

Note that the "pinned" thing in theory makes *perfect* sense, but it
only makes sense on _top_ of lowmem_zone_reserve_ratio, it's not an
alternative.

When the page is pinned you obey to the "lowmem_zone_reserve_ratio" when
it's _not_ pinned then you absolutely ignore the
lowmem_zone_reseve_ratio and you go with the watermarks[curr_zone_idx]
instead of the class_idx.

But in practice I doubt it worth it since I doubt you want to relocate
pagecache and anonymous memory during swapoff/mlock.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 18:07                       ` Andrea Arcangeli
@ 2004-06-24 18:29                         ` William Lee Irwin III
  0 siblings, 0 replies; 70+ messages in thread
From: William Lee Irwin III @ 2004-06-24 18:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Takashi Iwai, Andi Kleen, ak, tripperda, discuss,
	linux-kernel

On Thu, Jun 24, 2004 at 08:07:56PM +0200, Andrea Arcangeli wrote:
> how do you handle swapoff and mlock then? anonymous memory is pinned w/o
> swap. You've relocate the stuff during the mlock or swapoff to obey to
> the pin limits to make this work right, and it sounds quite complicated
> and it would hurt mlock performance a lot too (some big app uses mlock
> to pagein w/o page faults tons of stuff).

I don't have a predetermined answer to this. I can take suggestions
(e.g. page migration) for a preferred implementation of how pinned
userspace is to be handled, or refrain from discriminating between
pinned and unpinned allocations as desired. Another possibility would
be ignoring the mlocked status of userspace pages in situations where
cross-zone migration would be considered necessary.

On Thu, Jun 24, 2004 at 08:07:56PM +0200, Andrea Arcangeli wrote:
> Note that the "pinned" thing in theory makes *perfect* sense, but it
> only makes sense on _top_ of lowmem_zone_reserve_ratio, it's not an
> alternative.
> When the page is pinned you obey to the "lowmem_zone_reserve_ratio" when
> it's _not_ pinned then you absolutely ignore the
> lowmem_zone_reseve_ratio and you go with the watermarks[curr_zone_idx]
> instead of the class_idx.
> But in practice I doubt it worth it since I doubt you want to relocate
> pagecache and anonymous memory during swapoff/mlock.

I suspect that if it's worth it to migrate userspace memory between
zones, it's only worthwhile to do so during page reclamation. The first
idea that occurs to me is checking for how plentiful memory in upper
zones is when a pinned userspace page in a lower zone is found on the
LRU, and then migrating it as an alternative to outright eviction or
ignoring its pinned status.

I didn't actually think of it as an alternative, but as just feeding
your algorithm the more precise information the comment implied it
wanted. I'm basically just looking to get things as solid as possible,
so I'm not wedded to a particular solution. If it's too unclear how to
handle the situation when pinned allocations can be distinguished, I
can just port the 2.4 fallback discouraging algorithm without extensions.

-- wli

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 15:29               ` Andrea Arcangeli
  2004-06-24 15:48                 ` Nick Piggin
@ 2004-06-24 16:04                 ` Takashi Iwai
  2004-06-24 17:16                   ` Andrea Arcangeli
  1 sibling, 1 reply; 70+ messages in thread
From: Takashi Iwai @ 2004-06-24 16:04 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel

At Thu, 24 Jun 2004 17:29:46 +0200,
Andrea Arcangeli wrote:
> 
> On Thu, Jun 24, 2004 at 04:58:24PM +0200, Takashi Iwai wrote:
> > At Thu, 24 Jun 2004 16:42:58 +0200,
> > Andi Kleen wrote:
> > > 
> > > On Thu, 24 Jun 2004 16:36:47 +0200
> > > Takashi Iwai <tiwai@suse.de> wrote:
> > > 
> > > > At Thu, 24 Jun 2004 13:29:00 +0200,
> > > > Andi Kleen wrote:
> > > > > 
> > > > > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the
> > > > > > allocated pages are out of dma mask, just like in pci-gart.c?
> > > > > > (with ifdef x86-64)
> > > > > 
> > > > > That won't work reliable enough in extreme cases.
> > > > 
> > > > Well, it's not perfect, but it'd be far better than GFP_DMA only :)
> > > 
> > > The only description for this patch I can think of is "russian roulette" 
> > 
> > Even if we have a bigger DMA zone, it's no guarantee that the obtained
> > page is precisely in the given mask.  We can unlikely define zones
> > fine enough for all different 24, 28, 29, 30 and 31bit DMA masks.
> > 
> > 
> > My patch for i386 works well in most cases, because such a device is
> > usually equipped on older machines with less memory than DMA mask.
> > 
> > Without the patch, the allocation is always <16MB, may fail even small
> > number of pages.
> 
> why does it fail? note that with the lower_zone_reserve_ratio algorithm I
> added to 2.4 all dma zone will be reserved for __GFP_DMA allocations so
> you should have troubles only with 2.6, 2.4 should work fine.
> So with latest 2.4 it has to fail only if you already allocated 16M with
> pci_alloc_consistent which sounds unlikely.

If a driver needs large contiguous (e.g. a coule of MB) pages and the
memory is fragmented, it may still fail.  But it's anyway very
rare...

However, 16MB isn't enough in some cases indeed.  For example, the
following devices are often problematic:

- SB Live (emu10k1)
  This needs many single pages for WaveTable synthesis per user's
  request (up to 128MB).  It sets 31bit DMA mask (sigh...)

- ES1968
  This requires 28bit DMA mask and a single big buffer for all PCM
  streams.

Also there are other devices with <32bit DMA masks, for example, 24bit
(als4000, es1938, sonicvibes, azt3328), 28bit (ice1712, maestro3),
30bit (trident), 31bit (ali5451)...


Takashi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 16:04                 ` Takashi Iwai
@ 2004-06-24 17:16                   ` Andrea Arcangeli
  2004-06-24 18:33                     ` Takashi Iwai
  0 siblings, 1 reply; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 17:16 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel

On Thu, Jun 24, 2004 at 06:04:58PM +0200, Takashi Iwai wrote:
> At Thu, 24 Jun 2004 17:29:46 +0200,
> Andrea Arcangeli wrote:
> > 
> > On Thu, Jun 24, 2004 at 04:58:24PM +0200, Takashi Iwai wrote:
> > > At Thu, 24 Jun 2004 16:42:58 +0200,
> > > Andi Kleen wrote:
> > > > 
> > > > On Thu, 24 Jun 2004 16:36:47 +0200
> > > > Takashi Iwai <tiwai@suse.de> wrote:
> > > > 
> > > > > At Thu, 24 Jun 2004 13:29:00 +0200,
> > > > > Andi Kleen wrote:
> > > > > > 
> > > > > > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the
> > > > > > > allocated pages are out of dma mask, just like in pci-gart.c?
> > > > > > > (with ifdef x86-64)
> > > > > > 
> > > > > > That won't work reliable enough in extreme cases.
> > > > > 
> > > > > Well, it's not perfect, but it'd be far better than GFP_DMA only :)
> > > > 
> > > > The only description for this patch I can think of is "russian roulette" 
> > > 
> > > Even if we have a bigger DMA zone, it's no guarantee that the obtained
> > > page is precisely in the given mask.  We can unlikely define zones
> > > fine enough for all different 24, 28, 29, 30 and 31bit DMA masks.
> > > 
> > > 
> > > My patch for i386 works well in most cases, because such a device is
> > > usually equipped on older machines with less memory than DMA mask.
> > > 
> > > Without the patch, the allocation is always <16MB, may fail even small
> > > number of pages.
> > 
> > why does it fail? note that with the lower_zone_reserve_ratio algorithm I
> > added to 2.4 all dma zone will be reserved for __GFP_DMA allocations so
> > you should have troubles only with 2.6, 2.4 should work fine.
> > So with latest 2.4 it has to fail only if you already allocated 16M with
> > pci_alloc_consistent which sounds unlikely.
> 
> If a driver needs large contiguous (e.g. a coule of MB) pages and the
> memory is fragmented, it may still fail.  But it's anyway very
> rare...

Yes. This is why I suggested to use GFP_KERNEL _after_ GFP_DMA has
failed, not the other way around. As Andi said in big systems you're
pretty much guaranteed that GFP_KERNEL will always fail.

> However, 16MB isn't enough in some cases indeed.  For example, the
> following devices are often problematic:
> 
> - SB Live (emu10k1)
>   This needs many single pages for WaveTable synthesis per user's
>   request (up to 128MB).  It sets 31bit DMA mask (sigh...)

then it may never work. If the lowmem below 4G is all allocated in
anonymous memory and you've no swap, there's no way, absolutely no way
to make the above work. I start to think you should fail insmod if the
machine has more than 2^31 bytes of ram being used by the kernel.

All we can do is to give it a chance to work, that is to call GFP_KERNEL
_after_ GFP_DMA has failed, but again  there's no guarantee that it will
work, even if you've only a few gigs of ram.

> - ES1968
>   This requires 28bit DMA mask and a single big buffer for all PCM
>   streams.

this is just the order > 0 issue.

Note that 2.6 limits the defragmentation to order == 3, order 4 and
higher are ""guaranteed"" to always fail, this wasn't the case in 2.4.
2.6 adds a few terrible hacks called __GFP_REPEAT and __GFP_NOFAIL,
those are all deadlock prone as much as order < 4 allocations. The basic
deadlocks in 2.6 are due the lack of return value from
try_to_free_pages, 2.6 has no clue when it made progress or not, it can
only try to kill tasks when the highmem and swap are exausted, but there
are tons of other conditions where it can deadlock including while
confusing the oom killer with apps using mlock.

> Also there are other devices with <32bit DMA masks, for example, 24bit
> (als4000, es1938, sonicvibes, azt3328), 28bit (ice1712, maestro3),
> 30bit (trident), 31bit (ali5451)...

creating a GFP_PCI28 zone at _runtime_ only for the intel
implementations that unfortunately lacks an iommu might not be too bad.

Note that one other relevant thing we can add (with O(N) complexity) is
an alloc_pages_range() that walks the whole freelist by hand searching
for anything in the physuical range passed as parameter. But it would
need to be used with care since it'd loop in kernel space for a loong
time.  irq disabling timeouts may also trigger, so implementing it safe
won't be trivial.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 17:16                   ` Andrea Arcangeli
@ 2004-06-24 18:33                     ` Takashi Iwai
  2004-06-24 18:44                       ` Andrea Arcangeli
  0 siblings, 1 reply; 70+ messages in thread
From: Takashi Iwai @ 2004-06-24 18:33 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel

At Thu, 24 Jun 2004 19:16:20 +0200,
Andrea Arcangeli wrote:
> 
> On Thu, Jun 24, 2004 at 06:04:58PM +0200, Takashi Iwai wrote:
> > At Thu, 24 Jun 2004 17:29:46 +0200,
> > Andrea Arcangeli wrote:
> > > 
> > > On Thu, Jun 24, 2004 at 04:58:24PM +0200, Takashi Iwai wrote:
> > > > At Thu, 24 Jun 2004 16:42:58 +0200,
> > > > Andi Kleen wrote:
> > > > > 
> > > > > On Thu, 24 Jun 2004 16:36:47 +0200
> > > > > Takashi Iwai <tiwai@suse.de> wrote:
> > > > > 
> > > > > > At Thu, 24 Jun 2004 13:29:00 +0200,
> > > > > > Andi Kleen wrote:
> > > > > > > 
> > > > > > > > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the
> > > > > > > > allocated pages are out of dma mask, just like in pci-gart.c?
> > > > > > > > (with ifdef x86-64)
> > > > > > > 
> > > > > > > That won't work reliable enough in extreme cases.
> > > > > > 
> > > > > > Well, it's not perfect, but it'd be far better than GFP_DMA only :)
> > > > > 
> > > > > The only description for this patch I can think of is "russian roulette" 
> > > > 
> > > > Even if we have a bigger DMA zone, it's no guarantee that the obtained
> > > > page is precisely in the given mask.  We can unlikely define zones
> > > > fine enough for all different 24, 28, 29, 30 and 31bit DMA masks.
> > > > 
> > > > 
> > > > My patch for i386 works well in most cases, because such a device is
> > > > usually equipped on older machines with less memory than DMA mask.
> > > > 
> > > > Without the patch, the allocation is always <16MB, may fail even small
> > > > number of pages.
> > > 
> > > why does it fail? note that with the lower_zone_reserve_ratio algorithm I
> > > added to 2.4 all dma zone will be reserved for __GFP_DMA allocations so
> > > you should have troubles only with 2.6, 2.4 should work fine.
> > > So with latest 2.4 it has to fail only if you already allocated 16M with
> > > pci_alloc_consistent which sounds unlikely.
> > 
> > If a driver needs large contiguous (e.g. a coule of MB) pages and the
> > memory is fragmented, it may still fail.  But it's anyway very
> > rare...
> 
> Yes. This is why I suggested to use GFP_KERNEL _after_ GFP_DMA has
> failed, not the other way around. As Andi said in big systems you're
> pretty much guaranteed that GFP_KERNEL will always fail.

Ok.

> > However, 16MB isn't enough in some cases indeed.  For example, the
> > following devices are often problematic:
> > 
> > - SB Live (emu10k1)
> >   This needs many single pages for WaveTable synthesis per user's
> >   request (up to 128MB).  It sets 31bit DMA mask (sigh...)
> 
> then it may never work. If the lowmem below 4G is all allocated in
> anonymous memory and you've no swap, there's no way, absolutely no way
> to make the above work. I start to think you should fail insmod if the
> machine has more than 2^31 bytes of ram being used by the kernel.
>
> All we can do is to give it a chance to work, that is to call GFP_KERNEL
> _after_ GFP_DMA has failed, but again  there's no guarantee that it will
> work, even if you've only a few gigs of ram.

Sure, in extreme cases, it can't work.  But at least, it _may_ work
better than using only GFP_DMA.  And indeed it should (still) work
on most of consumer PC boxes.  The addition of another zone would help
much better, though.


Takashi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 18:33                     ` Takashi Iwai
@ 2004-06-24 18:44                       ` Andrea Arcangeli
  2004-06-25 15:50                         ` Takashi Iwai
  0 siblings, 1 reply; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 18:44 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel

On Thu, Jun 24, 2004 at 08:33:02PM +0200, Takashi Iwai wrote:
> Sure, in extreme cases, it can't work.  But at least, it _may_ work
> better than using only GFP_DMA.  And indeed it should (still) work
> on most of consumer PC boxes.  The addition of another zone would help
> much better, though.

of course agreed.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 18:44                       ` Andrea Arcangeli
@ 2004-06-25 15:50                         ` Takashi Iwai
  2004-06-25 17:30                           ` Andrea Arcangeli
  0 siblings, 1 reply; 70+ messages in thread
From: Takashi Iwai @ 2004-06-25 15:50 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel

At Thu, 24 Jun 2004 20:44:47 +0200,
Andrea Arcangeli wrote:
> 
> On Thu, Jun 24, 2004 at 08:33:02PM +0200, Takashi Iwai wrote:
> > Sure, in extreme cases, it can't work.  But at least, it _may_ work
> > better than using only GFP_DMA.  And indeed it should (still) work
> > on most of consumer PC boxes.  The addition of another zone would help
> > much better, though.
> 
> of course agreed.

The below is the new patch to follow your advice.


thanks,

Takashi

--- linux-2.6.7/arch/i386/kernel/pci-dma.c-dist	2004-06-24 15:56:46.017473544 +0200
+++ linux-2.6.7/arch/i386/kernel/pci-dma.c	2004-06-25 17:43:42.509366917 +0200
@@ -23,11 +23,22 @@ void *dma_alloc_coherent(struct device *
 	if (dev == NULL || (dev->coherent_dma_mask < 0xffffffff))
 		gfp |= GFP_DMA;
 
+ again:
 	ret = (void *)__get_free_pages(gfp, get_order(size));
 
-	if (ret != NULL) {
+	if (ret == NULL) {
+		if (dev && (gfp & GFP_DMA)) {
+			gfp &= ~GFP_DMA;
+			goto again;
+		}
+	} else {
 		memset(ret, 0, size);
 		*dma_handle = virt_to_phys(ret);
+		if (!(gfp & GFP_DMA) &&
+		    (((unsigned long)*dma_handle + size - 1) & ~(unsigned long)dev->coherent_dma_mask)) {
+			free_pages((unsigned long)ret, get_order(size));
+			return NULL;
+		}
 	}
 	return ret;
 }

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-25 15:50                         ` Takashi Iwai
@ 2004-06-25 17:30                           ` Andrea Arcangeli
  2004-06-25 17:39                             ` Takashi Iwai
  0 siblings, 1 reply; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-25 17:30 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel

On Fri, Jun 25, 2004 at 05:50:04PM +0200, Takashi Iwai wrote:
> --- linux-2.6.7/arch/i386/kernel/pci-dma.c-dist	2004-06-24 15:56:46.017473544 +0200
> +++ linux-2.6.7/arch/i386/kernel/pci-dma.c	2004-06-25 17:43:42.509366917 +0200
> @@ -23,11 +23,22 @@ void *dma_alloc_coherent(struct device *
>  	if (dev == NULL || (dev->coherent_dma_mask < 0xffffffff))
>  		gfp |= GFP_DMA;
>  
> + again:
>  	ret = (void *)__get_free_pages(gfp, get_order(size));
>  
> -	if (ret != NULL) {
> +	if (ret == NULL) {
> +		if (dev && (gfp & GFP_DMA)) {
> +			gfp &= ~GFP_DMA;

I would find cleaner to use __GFP_DMA in the whole file, this is not
about your changes, previous code used GFP_DMA too. The issue is that if
we change GFP_DMA to add a __GFP_HIGH or similar, the above will clear
the other bitflags too.

> +		    (((unsigned long)*dma_handle + size - 1) & ~(unsigned long)dev->coherent_dma_mask)) {
> +			free_pages((unsigned long)ret, get_order(size));
> +			return NULL;
> +		}

I would do the memset and setting of dma_handle after the above check.

this approch looks fine, thanks.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-25 17:30                           ` Andrea Arcangeli
@ 2004-06-25 17:39                             ` Takashi Iwai
  2004-06-25 17:45                               ` Andrea Arcangeli
  0 siblings, 1 reply; 70+ messages in thread
From: Takashi Iwai @ 2004-06-25 17:39 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel

At Fri, 25 Jun 2004 19:30:46 +0200,
Andrea Arcangeli wrote:
> 
> On Fri, Jun 25, 2004 at 05:50:04PM +0200, Takashi Iwai wrote:
> > --- linux-2.6.7/arch/i386/kernel/pci-dma.c-dist	2004-06-24 15:56:46.017473544 +0200
> > +++ linux-2.6.7/arch/i386/kernel/pci-dma.c	2004-06-25 17:43:42.509366917 +0200
> > @@ -23,11 +23,22 @@ void *dma_alloc_coherent(struct device *
> >  	if (dev == NULL || (dev->coherent_dma_mask < 0xffffffff))
> >  		gfp |= GFP_DMA;
> >  
> > + again:
> >  	ret = (void *)__get_free_pages(gfp, get_order(size));
> >  
> > -	if (ret != NULL) {
> > +	if (ret == NULL) {
> > +		if (dev && (gfp & GFP_DMA)) {
> > +			gfp &= ~GFP_DMA;
> 
> I would find cleaner to use __GFP_DMA in the whole file, this is not
> about your changes, previous code used GFP_DMA too. The issue is that if
> we change GFP_DMA to add a __GFP_HIGH or similar, the above will clear
> the other bitflags too.

Indeed.

> 
> > +		    (((unsigned long)*dma_handle + size - 1) & ~(unsigned long)dev->coherent_dma_mask)) {
> > +			free_pages((unsigned long)ret, get_order(size));
> > +			return NULL;
> > +		}
> 
> I would do the memset and setting of dma_handle after the above check.

Yep.  The below is the corrected version.


Thanks!

Takashi


--- linux-2.6.7/arch/i386/kernel/pci-dma.c-dist	2004-06-24 15:56:46.017473544 +0200
+++ linux-2.6.7/arch/i386/kernel/pci-dma.c	2004-06-25 19:38:26.334210809 +0200
@@ -21,13 +21,24 @@ void *dma_alloc_coherent(struct device *
 	gfp &= ~(__GFP_DMA | __GFP_HIGHMEM);
 
 	if (dev == NULL || (dev->coherent_dma_mask < 0xffffffff))
-		gfp |= GFP_DMA;
+		gfp |= __GFP_DMA;
 
+ again:
 	ret = (void *)__get_free_pages(gfp, get_order(size));
 
-	if (ret != NULL) {
-		memset(ret, 0, size);
+	if (ret == NULL) {
+		if (dev && (gfp & __GFP_DMA)) {
+			gfp &= ~__GFP_DMA;
+			goto again;
+		}
+	} else {
 		*dma_handle = virt_to_phys(ret);
+		if (!(gfp & __GFP_DMA) &&
+		    (((unsigned long)*dma_handle + size - 1) & ~(unsigned long)dev->coherent_dma_mask)) {
+			free_pages((unsigned long)ret, get_order(size));
+			return NULL;
+		}
+		memset(ret, 0, size);
 	}
 	return ret;
 }

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-25 17:39                             ` Takashi Iwai
@ 2004-06-25 17:45                               ` Andrea Arcangeli
  0 siblings, 0 replies; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-25 17:45 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Andi Kleen, ak, tripperda, discuss, linux-kernel

On Fri, Jun 25, 2004 at 07:39:19PM +0200, Takashi Iwai wrote:
> Yep.  The below is the corrected version.

looks perfect thanks ;).

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 11:13     ` Takashi Iwai
  2004-06-24 11:29       ` [discuss] " Andi Kleen
@ 2004-06-24 14:45       ` Terence Ripperda
  2004-06-24 15:41         ` Andrea Arcangeli
  1 sibling, 1 reply; 70+ messages in thread
From: Terence Ripperda @ 2004-06-24 14:45 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Andi Kleen, Terence Ripperda, discuss, linux-kernel, andrea

On Thu, Jun 24, 2004 at 04:13:47AM -0700, tiwai@suse.de wrote:
> > pci_alloc_consistent is limited to 16MB, but so far nobody has really
> > complained about that. If that should be a real issue we can make
> > it allocate from the swiotlb pool, which is usually 64MB (and can
> > be made bigger at boot time) 
> 
> Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the
> allocated pages are out of dma mask, just like in pci-gart.c?
> (with ifdef x86-64)

pci_alloc_consistent (at least on x86-64) does do this. one of the problems
I've seen in experimentation is that GFP_KERNEL tends to allocate from the
top of memory down. this means that all of the GFP_KERNEL allocations are >
32-bits, which forces us down to GFP_DMA and the < 16M allocations.

I've mainly tested this after a cold boot, so after running for a while,
GFP_KERNEL may hit good allocations a lot more.

Thanks,
Terence


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 14:45       ` Terence Ripperda
@ 2004-06-24 15:41         ` Andrea Arcangeli
  0 siblings, 0 replies; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 15:41 UTC (permalink / raw)
  To: Terence Ripperda; +Cc: Takashi Iwai, Andi Kleen, discuss, linux-kernel

On Thu, Jun 24, 2004 at 09:45:51AM -0500, Terence Ripperda wrote:
> On Thu, Jun 24, 2004 at 04:13:47AM -0700, tiwai@suse.de wrote:
> > > pci_alloc_consistent is limited to 16MB, but so far nobody has really
> > > complained about that. If that should be a real issue we can make
> > > it allocate from the swiotlb pool, which is usually 64MB (and can
> > > be made bigger at boot time) 
> > 
> > Can't it be called with GFP_KERNEL at first, then with GFP_DMA if the
> > allocated pages are out of dma mask, just like in pci-gart.c?
> > (with ifdef x86-64)
> 
> pci_alloc_consistent (at least on x86-64) does do this. one of the problems
> I've seen in experimentation is that GFP_KERNEL tends to allocate from the
> top of memory down. this means that all of the GFP_KERNEL allocations are >
> 32-bits, which forces us down to GFP_DMA and the < 16M allocations.
> 
> I've mainly tested this after a cold boot, so after running for a while,
> GFP_KERNEL may hit good allocations a lot more.

it's trivial to change the order in the freelist to allocate from lower
address first, but the point is still that over time that will be
random.

the 16M must be reserved enterely to the __GFP_DMA on any machine with
>=1G of ram, and the lowmem_reserve_ratio algorithm accomplish this and
it scales down the reserve ratio according to the balance between lowmem
and dma zone.

I believe if something you should try with GFP_KERNEL if GFP_DMA fails,
not the other way around. btw, 2.6 is even more efficient in shrinking
and paging out the dma zone than it could be in 2.4.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-23 23:46   ` 32-bit dma allocations on 64-bit platforms Andi Kleen
  2004-06-24 11:13     ` Takashi Iwai
@ 2004-06-24 15:44     ` Terence Ripperda
  2004-06-24 16:15       ` [discuss] " Andi Kleen
  2004-06-24 18:51       ` Andi Kleen
  1 sibling, 2 replies; 70+ messages in thread
From: Terence Ripperda @ 2004-06-24 15:44 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Terence Ripperda, discuss, tiwai, linux-kernel, andrea

On Wed, Jun 23, 2004 at 04:46:44PM -0700, ak@muc.de wrote:
> pci_alloc_consistent is limited to 16MB, but so far nobody has really
> complained about that. If that should be a real issue we can make
> it allocate from the swiotlb pool, which is usually 64MB (and can
> be made bigger at boot time) 

In all of the cases I've seen, it defaults to 4M. in swiotlb.c,
io_tlb_nslabs defaults to 1024, * PAGE_SIZE == 4194304.

> Would that work for you too BTW ? How much memory do you expect
> to need?

potentially. our currently pending release uses pci_map_sg, which
relies on swiotlb for em64t systems. it "works", but we have some ugly
hacks and were hoping to get away from using it (at least in it's
current form).

here's some of the problems we encountered:

probably the biggest problem is that the size is way too small for our
needs (more on our memory usage shortly). this is compounded by the
the swiotlb code throwing a kernel panic when it can't allocate
memory. and if the panic doesn't halt the machine, the routine returns
a random value off the stack as the dma_addr_t.

for this reason, we have an ugly hack that notices that swiotlb is
enabled (just checks if swiotlb is set) and prints a warning to the user
to bump up the size of the swiotlb to 16384, or 64M.

also, the proper usage of using the bounce buffers and calling 
pci_dma_sync_* would be a performance killer for us. we stream a
considerable amount of data to the gpu per second (on the order of
100s of Megs a second), so having to do an additional memcpy would
reduce performance considerably, in some cases between 30-50%.

for this reason, we detect when the dma_addr != phys_addr, and map the
dma_addr directly to opengl to avoid the copy. I know this is ugly,
and that's one of the things I'd really like to get away from.

finally, our driver already uses a considerable amount of memory. by
definition, the swiotlb interface doubles that memory usage. if our
driver used swiotlb correctly (as in didn't know about swiotlb and
always called pci_dma_sync_*), we'd lock down the physical addresses
opengl writes to, since they're normally used directly for dma, plus
the pages allocated from the swiotlb would be locked down (currently
we manually do this, but if swiotlb is supposed to be transparent to
the driver and used for dma, I assume it must already be locked down,
perhaps by definition of being bootmem?). this means not only is the
memory usage double, but it's all locked down and unpageable.

in this case, it almost would make more sense to treat the bootmem
allocated for swiotlb as a pool of 32-bit memory that can be directly
allocated from, rather than as bounce buffers. I don't know that this
would be an acceptable interface though.

but if we could come up with reasonable solutions to these problems,
this may work.

> drawback is that the swiotlb pool is not unified with the rest of the
> VM, so tying up too much memory there is quite unfriendly.
> e.g. if you you can use up 1GB then i wouldn't consider this suitable,
> for 128MB max it may be possible.

I checked with our opengl developers on this. by default, we allocate
~64k for X's push buffer and ~1M per opengl client for their push
buffer. on quadro/workstation parts, we allocate 20M for the first
opengl client, then ~1M per client after that.

in addition to the push buffer, there is a lot of data that apps dump
to the push buffer. this includes textures, vertex buffers, display
lists, etc. the amount of memory used for this varies greatly from app
to app. the 20M listed above includes the push buffer and memory for
these buffers (I think workstation apps tend to push a lot more
pre-processed vertex data to the gpu).

note that most agp apertures these days are in the 128M - 1024M range,
and there are times that we exhaust that memory on the low end. I
think our driver is greedy when trying to allocate memory for
performance reasons, but has good fallback cases. so being somewhat
limited on resources isn't too bad, just so long as the kernel doesn't
panic instead of falling the memory allocation.

I would think that 64M or 128M would be good. a nice feature of
swiotlb is the ability to tune it at boot. so if a workstation user
found they really did need more memory for performance, they could
tweak that value up for themselves.

also remember future growth. PCI-E has something like 20/24 lanes that
can be split among multiple PCI-E slots. Alienware has already
announced multi-card products, and it's likely multi-card products
will be more readily available on PCI-E, since the slots should have
equivalent bandwidth (unlike AGP+PCI).

nvidia has also had workstation parts in the past with 2 gpus and a
bridge chip. each of these gpus ran twinview, so each card drove 4
monitors. these were pci cards, and some crazy vendors had 4+ of these
cards in a machine driving many monitors. this just pushes the memory
requirements up in special circumstances.

Thanks,
Terence

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 15:44     ` Terence Ripperda
@ 2004-06-24 16:15       ` Andi Kleen
  2004-06-24 17:22         ` Andrea Arcangeli
  2004-06-24 22:28         ` Terence Ripperda
  2004-06-24 18:51       ` Andi Kleen
  1 sibling, 2 replies; 70+ messages in thread
From: Andi Kleen @ 2004-06-24 16:15 UTC (permalink / raw)
  To: Terence Ripperda; +Cc: Andi Kleen, discuss, tiwai, linux-kernel, andrea

On Thu, Jun 24, 2004 at 10:44:29AM -0500, Terence Ripperda wrote:
> On Wed, Jun 23, 2004 at 04:46:44PM -0700, ak@muc.de wrote:
> > pci_alloc_consistent is limited to 16MB, but so far nobody has really
> > complained about that. If that should be a real issue we can make
> > it allocate from the swiotlb pool, which is usually 64MB (and can
> > be made bigger at boot time) 
> 
> In all of the cases I've seen, it defaults to 4M. in swiotlb.c,
> io_tlb_nslabs defaults to 1024, * PAGE_SIZE == 4194304.

Oops, that should be probably fixed. I think it was 64MB 
at some point ... 

4MB is definitely far too small.

> probably the biggest problem is that the size is way too small for our
> needs (more on our memory usage shortly). this is compounded by the
> the swiotlb code throwing a kernel panic when it can't allocate
> memory. and if the panic doesn't halt the machine, the routine returns
> a random value off the stack as the dma_addr_t.

That sounds like a bug too. pci_map_sg should return 0 when it overflows. 
The gart iommu code will do that.  I'll take a look, need to convince
the IA64 people of any changes though (I just reused their code) 

Newer pci_map_single also got a "bad_dma_adress" magic return value
to check for this, but some also just panic.

> also, the proper usage of using the bounce buffers and calling 
> pci_dma_sync_* would be a performance killer for us. we stream a
> considerable amount of data to the gpu per second (on the order of
> 100s of Megs a second), so having to do an additional memcpy would
> reduce performance considerably, in some cases between 30-50%.

Understood.

> finally, our driver already uses a considerable amount of memory. by
> definition, the swiotlb interface doubles that memory usage. if our
> driver used swiotlb correctly (as in didn't know about swiotlb and
> always called pci_dma_sync_*), we'd lock down the physical addresses
> opengl writes to, since they're normally used directly for dma, plus
> the pages allocated from the swiotlb would be locked down (currently
> we manually do this, but if swiotlb is supposed to be transparent to
> the driver and used for dma, I assume it must already be locked down,
> perhaps by definition of being bootmem?). this means not only is the

It's allocated once at boot and never freed or increased.
(the reason is that these functions must all work inside 
spinlocks and cannot sleep, and you cannot do anything serious
to the VM with that constraint) - arguably it would have been
much nicer to pass them a GFP flag and do sleeping for bounce
memory and GFP_KERNEL allocations etc.instead of the dumb
panics on overflow. Maybe something for 2.7.

> in this case, it almost would make more sense to treat the bootmem
> allocated for swiotlb as a pool of 32-bit memory that can be directly
> allocated from, rather than as bounce buffers. I don't know that this
> would be an acceptable interface though.

Ok, that was one of my proposals too (using it for pci_alloc_consistent).

But again it would only help if the memory requirements are relatively
moderate.

> but if we could come up with reasonable solutions to these problems,
> this may work.
> 
> > drawback is that the swiotlb pool is not unified with the rest of the
> > VM, so tying up too much memory there is quite unfriendly.
> > e.g. if you you can use up 1GB then i wouldn't consider this suitable,
> > for 128MB max it may be possible.
> 
> I checked with our opengl developers on this. by default, we allocate
> ~64k for X's push buffer and ~1M per opengl client for their push
> buffer. on quadro/workstation parts, we allocate 20M for the first
> opengl client, then ~1M per client after that.

Oh, that sounds quite moderate. Ok, then we probably don't need the 
GFP_BIGDMA zone just for you. Great.

> 
> in addition to the push buffer, there is a lot of data that apps dump
> to the push buffer. this includes textures, vertex buffers, display
> lists, etc. the amount of memory used for this varies greatly from app
> to app. the 20M listed above includes the push buffer and memory for
> these buffers (I think workstation apps tend to push a lot more
> pre-processed vertex data to the gpu).

Overall it sounds more like you need 128MB though - especially
since we cannot give everything to you, but also still need
some memory for SATA and other devices with limited addressing
capability (fortunately they slowly get fixed now) 

I would prefer if the default value would work for most users
because any special options are a very high support load.
Do you think 64MB (minus other users so maybe 30-40MB in practice)  
would be still sufficient to give reasonable performance without
hickups?

> 
> note that most agp apertures these days are in the 128M - 1024M range,
> and there are times that we exhaust that memory on the low end. I

Yes, I have the same problem with the IOMMU. The IOMMU makes it 
actually worse, because it reserves half of the aperture
(so you may get only 64MB IOMMU/AGP aperture in the worst case)

But it can be increased in the BIOS and the kernel has code
to get a larger aperture too)

> think our driver is greedy when trying to allocate memory for
> performance reasons, but has good fallback cases. so being somewhat
> limited on resources isn't too bad, just so long as the kernel doesn't
> panic instead of falling the memory allocation.

Agreed, the panics should be made optional at least. I will
take a look at doing this for swiotlb too. I like
them as options though because for debugging it's better to get
a clear panic than a weird malfunction.

> also remember future growth. PCI-E has something like 20/24 lanes that
> can be split among multiple PCI-E slots. Alienware has already
> announced multi-card products, and it's likely multi-card products
> will be more readily available on PCI-E, since the slots should have
> equivalent bandwidth (unlike AGP+PCI).
> 
> nvidia has also had workstation parts in the past with 2 gpus and a
> bridge chip. each of these gpus ran twinview, so each card drove 4
> monitors. these were pci cards, and some crazy vendors had 4+ of these
> cards in a machine driving many monitors. this just pushes the memory
> requirements up in special circumstances.

But why didn't you implement addressing capability for >32bit
in your hardware then? I imagine the memory requirements won't 
stop at 4GB (or rather 2-3GB because not all phys mapping
space below 4GB can be dedicated to graphics) 

It sounds a bit weird to have such extreme requirements and then
cripple the hardware like this.

Anyways - for such extreme applications i think it's perfectly
reasonable to require the user to pass special boot options and
tie up much memory.

-Andi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 16:15       ` [discuss] " Andi Kleen
@ 2004-06-24 17:22         ` Andrea Arcangeli
  2004-06-24 22:28         ` Terence Ripperda
  1 sibling, 0 replies; 70+ messages in thread
From: Andrea Arcangeli @ 2004-06-24 17:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Terence Ripperda, Andi Kleen, discuss, tiwai, linux-kernel

On Thu, Jun 24, 2004 at 06:15:40PM +0200, Andi Kleen wrote:
> reasonable to require the user to pass special boot options and
> tie up much memory.

the boot parameter will always work and it avoids a new zone. btw, if we
would link the driver into the kernel no boot parameter would be
necessary, if the hardware would be discovered it could allocated its
tons of memory with bootmem. But it sounds like there are too many
drivers in troubles so I believe we can't link them all.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 16:15       ` [discuss] " Andi Kleen
  2004-06-24 17:22         ` Andrea Arcangeli
@ 2004-06-24 22:28         ` Terence Ripperda
  1 sibling, 0 replies; 70+ messages in thread
From: Terence Ripperda @ 2004-06-24 22:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Terence Ripperda, Andi Kleen, discuss, tiwai, linux-kernel,
	andrea

On Thu, Jun 24, 2004 at 09:15:40AM -0700, ak@suse.de wrote:
> I would prefer if the default value would work for most users
> because any special options are a very high support load.
> Do you think 64MB (minus other users so maybe 30-40MB in practice)  
> would be still sufficient to give reasonable performance without
> hickups?

that's what we're currently asking users to do for our current swiotlb
code. we are seeing some hickups in ut2004, but I haven't investigated
if this is related to limited memory resources (actually, it shouldn't
be, as we'd have paniced instead of failing to allocate memory).

I think I would push for 128M by default, just to make sure there's
plenty. I don't think this should be too bad, since this would only
kick in if the user has 4+ Gigs of memory, in which 128M is a small
portion of the total.

> Agreed, the panics should be made optional at least. I will
> take a look at doing this for swiotlb too. I like
> them as options though because for debugging it's better to get
> a clear panic than a weird malfunction.

it makes perfect sense to have a debugging option for that, it'd just
be nice to have that not be the default.

> But why didn't you implement addressing capability for >32bit
> in your hardware then? I imagine the memory requirements won't 
> stop at 4GB (or rather 2-3GB because not all phys mapping
> space below 4GB can be dedicated to graphics) 

I suspect the addressing capability is due to cost/die size tradeoffs.

and I didn't mean to imply that these setups would be common, or
really use that much additional memory. just pointing out that it's
not uncommon to have some odd frankenstein setups that would use a
little more memory than normal. you are correct that in these cases, a
little more end user tweaking is acceptable.

after talking to some of the other developers here, we wanted to
re-inquiry about the extra dma zone approach, and how
feasible/acceptable that might be. one of the thoughts is that the
swiotlb approach would probably be the easiest to get in place
quickly, but that the dma zone approach would be more robust.
we wouldn't need to set aside an allocation pool, there wouldn't need
to be end user tweaking for the corner cases, etc..

Thanks,
Terence

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 15:44     ` Terence Ripperda
  2004-06-24 16:15       ` [discuss] " Andi Kleen
@ 2004-06-24 18:51       ` Andi Kleen
  2004-06-26  4:58         ` David Mosberger
  1 sibling, 1 reply; 70+ messages in thread
From: Andi Kleen @ 2004-06-24 18:51 UTC (permalink / raw)
  To: Terence Ripperda; +Cc: Andi Kleen, discuss, tiwai, linux-kernel, andrea

On Thu, Jun 24, 2004 at 10:44:29AM -0500, Terence Ripperda wrote:
> On Wed, Jun 23, 2004 at 04:46:44PM -0700, ak@muc.de wrote:
> > pci_alloc_consistent is limited to 16MB, but so far nobody has really
> > complained about that. If that should be a real issue we can make
> > it allocate from the swiotlb pool, which is usually 64MB (and can
> > be made bigger at boot time) 
> 
> In all of the cases I've seen, it defaults to 4M. in swiotlb.c,
> io_tlb_nslabs defaults to 1024, * PAGE_SIZE == 4194304.

I checked this now. 

It's 

#define IO_TLB_SHIFT 11

static unsigned long io_tlb_nslabs = 1024;

and the allocation does

	io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs * (1 << IO_TLB_SHIFT));

which contrary to its name does not allocate in pages (otherwise you would
get 8GB of memory on x86-64 and even more on IA64) 
That's definitely far too small. 

A better IO_TLB_SHIFT would be 16 or 17.

-Andi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 18:51       ` Andi Kleen
@ 2004-06-26  4:58         ` David Mosberger
  0 siblings, 0 replies; 70+ messages in thread
From: David Mosberger @ 2004-06-26  4:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Terence Ripperda, discuss, tiwai, linux-kernel, andrea

>>>>> On Thu, 24 Jun 2004 20:51:56 +0200, Andi Kleen <ak@muc.de> said:

  Andi> A better IO_TLB_SHIFT would be 16 or 17.

Careful.  I see code like this:

		stride = (1 << (PAGE_SHIFT - IO_TLB_SHIFT));

You probably don't want IO_TLB_SHIFT > PAGE_SHIFT...  Increasing
io_tlb_nslabs should be no problem though (subject to memory
availability).  It can already by set via the "swiotlb" option.

I doubt swiotlb is the right thing here, though, given the bw-demands
of graphics.  Too bad Nvidia cards don't support > 32 bit
addressability and Intel chipsets don't support I/O MMUs...

	--david

^ permalink raw reply	[flat|nested] 70+ messages in thread

[parent not found: <2akPm-16l-65@gated-at.bofh.it>]

* Re: 32-bit dma allocations on 64-bit platforms
       [not found] <2akPm-16l-65@gated-at.bofh.it>
@ 2004-06-23 21:46 ` Andi Kleen
  2004-06-24  6:18   ` Arjan van de Ven
  0 siblings, 1 reply; 70+ messages in thread
From: Andi Kleen @ 2004-06-23 21:46 UTC (permalink / raw)
  To: Terence Ripperda; +Cc: discuss, tiwai, linux-kernel

Terence Ripperda <tripperda@nvidia.com> writes:

[sending again with linux-kernel in cc]

> I'm working on cleaning up some of our dma allocation code to properly allocate 32-bit physical pages for dma on 64-bit platforms. I think our first pass at supporting em64t is sub-par. I'd like to fix that by using the correct kernel interfaces.

I get from this that your hardware cannot DMA to >32bit.
>
> the physical addressing of memory allocations seems to boil down to the behavior of GFP_DMA and GFP_NORMAL. but there seems to be some disconnect between what these mean for each architecture and what various drivers expect them to mean.
>
> based on each architecture's paging_init routines, the zones look like this:
>
>                 x86:         ia64:      x86_64:
> ZONE_DMA:       < 16M        < ~4G      < 16M
> ZONE_NORMAL:    16M - ~1G    > ~4G      > 16M
> ZONE_HIMEM:     1G+
>
> an example of this disconnect is vmalloc_32. this function is obviously intended to allocate 32-bit addresses (this was specifically mentioned in a comment in 2.4.x header files). but vmalloc_32 is an inline routine that calls __vmalloc(GFP_KERNEL). based on the above zone descriptions, this will do the correct thing for x86, but not for ia64 or x86_64. on ia64, a driver could just use GFP_DMA for the desired behavior, but this doesn't work well for x86_64.
>
> AMD's x86_64 provides remapping > 32-bit pages through the iommu, but obviously Intel's em64t provides no such ability. based on the above zonings, these leaves us with the options of either relying on the swiotlb interfaces for dma, or relying on the isa memory for dma.
>
> the last day or two, I've been experimenting with using the pci_alloc_consistent interface, which uses the later (note attached patch to fix an apparent memory leak in the x86_64 pci_alloc_consistent). unfortunately, this provides very little dma-able memory. In theory, up to 16 Megs, but in practice I'm only getting about 5 1/2 Megs.
>
> I was rather surprised by these limitations on allocating 32-bit addresses. I checked through the dri and bttv drivers, to see if they had dealt with these issues, and they did not appear to have done so. has anyone tested these drivers on ia64/x86_64/em64t platforms w/ 4+ Gigs of memory?
>
> are these limitations on allocating 32-bit addresses intentional and known? is there anything we can do to help improve this situation? help work on development?

First vmalloc_32 is a rather broken interface and should imho
just be removed. The function name just gives promises that cannot
be kept. It was always quite bogus. Please don't use it.

The situation on EM64T is very unfortunate I agree. There was a reason
we asked AMD to add an IOMMU and it's quite bad that the Intel chipset
people ignored that wisdom and put us into this compatibility mess.

Failing that it would be best if the other PCI DMA hardware
could just address enough memory, but that's less realistic
than just fixing the chipset. 

The x86-64 port had decided early to keep the 16MB GFP_DMA zone
to get maximum driver compatibility and because the AMD IOMMU gave
us an nice alternative over bounce buffering.

In theory I'm not totally against enlarging GFP_DMA a bit on x86-64.
It would just be difficult to find a good value. The problem is that
that there may be existing drivers that rely on the 16MB limit, and it
would not be very nice to break them. We got rid of a lot of them by
disallowing CONFIG_ISA, but there may be some left. So before doing
this it would need a full driver tree audit to check any device

The most prominent example used to be the floppy driver, but the
current floppy driver seems to use some other way to get around this.

There seem to be quite some sound chipsets with DMA limits < 32bit;
e.g. 29 bits seems to be quite common, but I see several 24bit PCI ones 
too. 

I must say I'm somewhat reluctant to break an working in tree driver.
Especially for the sake of an out of tree binary driver. Arguably the
problem is probably not limited to you, but it's quite possible that
even the in tree DRI drivers have it, so it would be still worth to
fix it. 

I see two somewhat realistic ways to handle this: 

- We enlarge GFP_DMA and find some way to do double buffering
for these sound drivers (it would need an PCI-DMA API extension
that always calls swiotlb for this) 
For sound that's not too bad, because they are relatively slow.
It would require to reserve bootmem memory early for the bounces, 
but I guess requiring the user to pass a special boot time parameter
for these devices would be reasonable.

If yes someone would need to do this work. 
Also the question would be how large to make GFP_DMA
Ideally it should not be too big, so that e.g. 29bit devices don't
require the bounce buffering. 

- We introduce multiple GFP_DMA zones and keep 16MB GFP_DMA 
and GFP_BIGDMA or somesuch for larger DMA.

The VM should be able to handle this, but it may still require
some tuning. It would need some generic changes, but not too bad.
Still would need a decision on how big GFP_BIGDMA should be. 
I suspect 4GB would be too big again.

Comments?

-Andi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-23 21:46 ` Andi Kleen
@ 2004-06-24  6:18   ` Arjan van de Ven
  2004-06-24 10:33     ` Andi Kleen
  2004-06-24 13:48     ` Jesse Barnes
  0 siblings, 2 replies; 70+ messages in thread
From: Arjan van de Ven @ 2004-06-24  6:18 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Terence Ripperda, discuss, tiwai, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 636 bytes --]

On Wed, 2004-06-23 at 23:46, Andi Kleen wrote:

> The VM should be able to handle this, but it may still require
> some tuning. It would need some generic changes, but not too bad.
> Still would need a decision on how big GFP_BIGDMA should be. 
> I suspect 4GB would be too big again.

What is the problem again, can't the driver us the dynamic pci mapping
API which does allow more memory to be mapped even on crippled machines
without iommu ?
And isn't this a problem that will vanish since PCI Express and PCI X
both *require* support for 64 bit addressing, so all higher speed cards
are going to be ok in principle ?

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24  6:18   ` Arjan van de Ven
@ 2004-06-24 10:33     ` Andi Kleen
  2004-06-24 13:48     ` Jesse Barnes
  1 sibling, 0 replies; 70+ messages in thread
From: Andi Kleen @ 2004-06-24 10:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Terence Ripperda, discuss, tiwai, linux-kernel

On Thu, Jun 24, 2004 at 08:18:06AM +0200, Arjan van de Ven wrote:
> On Wed, 2004-06-23 at 23:46, Andi Kleen wrote:
> 
> > The VM should be able to handle this, but it may still require
> > some tuning. It would need some generic changes, but not too bad.
> > Still would need a decision on how big GFP_BIGDMA should be. 
> > I suspect 4GB would be too big again.
> 
> What is the problem again, can't the driver us the dynamic pci mapping
> API which does allow more memory to be mapped even on crippled machines
> without iommu ?

In theory one could fix pci_alloc_consistent from the swiotlb pool yes,
the problem is just that this pool is completely preallocated. If 
enough memory is needed that would be quite nasty, because you suddenly
lose 1 or 2GB RAM.

> And isn't this a problem that will vanish since PCI Express and PCI X
> both *require* support for 64 bit addressing, so all higher speed cards
> are going to be ok in principle ?

There are EM64T systems with AGP only and not all PCI-Express cards
seem to follow this. PCI-Express unfortunately discouraged the AGP aperture too,
so not even that can be used on those Intel systems.

-Andi 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24  6:18   ` Arjan van de Ven
  2004-06-24 10:33     ` Andi Kleen
@ 2004-06-24 13:48     ` Jesse Barnes
  2004-06-24 14:39       ` Terence Ripperda
  1 sibling, 1 reply; 70+ messages in thread
From: Jesse Barnes @ 2004-06-24 13:48 UTC (permalink / raw)
  To: arjanv; +Cc: Andi Kleen, Terence Ripperda, discuss, tiwai, linux-kernel

On Thursday, June 24, 2004 2:18 am, Arjan van de Ven wrote:
> What is the problem again, can't the driver us the dynamic pci mapping
> API which does allow more memory to be mapped even on crippled machines
> without iommu ?
> And isn't this a problem that will vanish since PCI Express and PCI X
> both *require* support for 64 bit addressing, so all higher speed cards
> are going to be ok in principle ?

Well, PCI-X may require it, but there certainly are PCI-X devices that don't 
do 64 bit addressing, or if they do, it's a crippled implementation (e.g. top 
32 bits have to be constant).

Jesse

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-24 13:48     ` Jesse Barnes
@ 2004-06-24 14:39       ` Terence Ripperda
  0 siblings, 0 replies; 70+ messages in thread
From: Terence Ripperda @ 2004-06-24 14:39 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: arjanv, Andi Kleen, Terence Ripperda, discuss, tiwai,
	linux-kernel

correct. I checked with my contacts here on the PCI express requirements.
Apparently the spec says "A PCI Express Endpoint operating as the
Requester of a Memory Transaction is required to be capable of
generating addresses greater than 4GB", but my contact claims this is a
"soft" requirement.

but even if all PCI-X and PCI-E devices properly addressed the full
64-bits, legacy 32-bit PCI devices can be plugged into the motherboards as
well. my Intel em64t boards have mostly PCI-X, but 1 PCI slot and my amd
x86_64 have all PCI slots (aside from the main PCI-E slot).

also, at least one motherboard manufacturer claims PCI-E + AGP, but the AGP
is really just an AGP form-factor slot on the PCI bus.

Thanks,
Terence

On Thu, Jun 24, 2004 at 06:48:07AM -0700, jbarnes@engr.sgi.com wrote:
> On Thursday, June 24, 2004 2:18 am, Arjan van de Ven wrote:
> > What is the problem again, can't the driver us the dynamic pci mapping
> > API which does allow more memory to be mapped even on crippled
> machines
> > without iommu ?
> > And isn't this a problem that will vanish since PCI Express and PCI X
> > both *require* support for 64 bit addressing, so all higher speed
> cards
> > are going to be ok in principle ?
> 
> Well, PCI-X may require it, but there certainly are PCI-X devices that
> don't 
> do 64 bit addressing, or if they do, it's a crippled implementation
> (e.g. top 
> 32 bits have to be constant).
> 
> Jesse

^ permalink raw reply	[flat|nested] 70+ messages in thread

* 32-bit dma allocations on 64-bit platforms
@ 2004-06-23 18:35 Terence Ripperda
  2004-06-23 19:19 ` Jeff Garzik
  2004-06-26  5:02 ` David Mosberger
  0 siblings, 2 replies; 70+ messages in thread
From: Terence Ripperda @ 2004-06-23 18:35 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Terence Ripperda

[-- Attachment #1: Type: text/plain, Size: 3008 bytes --]

I'm working on cleaning up some of our dma allocation code to properly allocate 32-bit physical pages for dma on 64-bit platforms. I think our first pass at supporting em64t is sub-par. I'd like to fix that by using the correct kernel interfaces.

>From our early efforts in supporting AMD's x86_64, we've used the pci_map_sg/pci_map_single interface for remapping > 32-bit physical addresses through the system's IOMMU. Since Intel's em64t does not provide an IOMMU, the kernel falls back to a swiotlb to implement these interfaces. For our first pass at supporting em64t, we tried to work with the swiotlb, but this works very poorly.

We should have gone back and reviewed how we use the kernel interfaces and followed DMA-API.txt and DMA-mapping.txt. We're now working on using these interfaces (mainly pci_alloc_consistent). but we're still running into some general shortcomings of these interfaces. the main problem is the ability to allocate enough 32-bit addressable memory.

the physical addressing of memory allocations seems to boil down to the behavior of GFP_DMA and GFP_NORMAL. but there seems to be some disconnect between what these mean for each architecture and what various drivers expect them to mean.

based on each architecture's paging_init routines, the zones look like this:

                x86:         ia64:      x86_64:
ZONE_DMA:       < 16M        < ~4G      < 16M
ZONE_NORMAL:    16M - ~1G    > ~4G      > 16M
ZONE_HIMEM:     1G+

an example of this disconnect is vmalloc_32. this function is obviously intended to allocate 32-bit addresses (this was specifically mentioned in a comment in 2.4.x header files). but vmalloc_32 is an inline routine that calls __vmalloc(GFP_KERNEL). based on the above zone descriptions, this will do the correct thing for x86, but not for ia64 or x86_64. on ia64, a driver could just use GFP_DMA for the desired behavior, but this doesn't work well for x86_64.

AMD's x86_64 provides remapping > 32-bit pages through the iommu, but obviously Intel's em64t provides no such ability. based on the above zonings, these leaves us with the options of either relying on the swiotlb interfaces for dma, or relying on the isa memory for dma.

the last day or two, I've been experimenting with using the pci_alloc_consistent interface, which uses the later (note attached patch to fix an apparent memory leak in the x86_64 pci_alloc_consistent). unfortunately, this provides very little dma-able memory. In theory, up to 16 Megs, but in practice I'm only getting about 5 1/2 Megs.

I was rather surprised by these limitations on allocating 32-bit addresses. I checked through the dri and bttv drivers, to see if they had dealt with these issues, and they did not appear to have done so. has anyone tested these drivers on ia64/x86_64/em64t platforms w/ 4+ Gigs of memory?

are these limitations on allocating 32-bit addresses intentional and known? is there anything we can do to help improve this situation? help work on development?

Thanks,
Terence

[-- Attachment #2: pci-gart.patch --]
[-- Type: text/plain, Size: 330 bytes --]

--- pci-gart.c.old	2004-06-21 18:33:29.000000000 -0500
+++ pci-gart.c.new	2004-06-21 18:33:57.000000000 -0500
@@ -211,6 +211,7 @@
 		if (no_iommu || dma_mask < 0xffffffffUL) { 
 			if (high) {
 				if (!(gfp & GFP_DMA)) { 
+					free_pages((unsigned long)memory, get_order(size)); 
 					gfp |= GFP_DMA; 
 					goto again;
 				}

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-23 18:35 Terence Ripperda
@ 2004-06-23 19:19 ` Jeff Garzik
  2004-06-26  5:05   ` David Mosberger
  2004-06-26  5:02 ` David Mosberger
  1 sibling, 1 reply; 70+ messages in thread
From: Jeff Garzik @ 2004-06-23 19:19 UTC (permalink / raw)
  To: Terence Ripperda; +Cc: Linux Kernel Mailing List

Terence Ripperda wrote:

Fix your word wrap.


> I'm working on cleaning up some of our dma allocation code to properly allocate 32-bit physical pages for dma on 64-bit platforms. I think our first pass at supporting em64t is sub-par. I'd like to fix that by using the correct kernel interfaces.
> 
>>From our early efforts in supporting AMD's x86_64, we've used the pci_map_sg/pci_map_single interface for remapping > 32-bit physical addresses through the system's IOMMU. Since Intel's em64t does not provide an IOMMU, the kernel falls back to a swiotlb to implement these interfaces. For our first pass at supporting em64t, we tried to work with the swiotlb, but this works very poorly.

swiotlb was a dumb idea when it hit ia64, and it's now been propagated 
to x86-64 :(


> We should have gone back and reviewed how we use the kernel interfaces and followed DMA-API.txt and DMA-mapping.txt. We're now working on using these interfaces (mainly pci_alloc_consistent). but we're still running into some general shortcomings of these interfaces. the main problem is the ability to allocate enough 32-bit addressable memory.
> 
> the physical addressing of memory allocations seems to boil down to the behavior of GFP_DMA and GFP_NORMAL. but there seems to be some disconnect between what these mean for each architecture and what various drivers expect them to mean.
> 
> based on each architecture's paging_init routines, the zones look like this:
> 
>                 x86:         ia64:      x86_64:
> ZONE_DMA:       < 16M        < ~4G      < 16M
> ZONE_NORMAL:    16M - ~1G    > ~4G      > 16M
> ZONE_HIMEM:     1G+
> 
> an example of this disconnect is vmalloc_32. this function is obviously intended to allocate 32-bit addresses (this was specifically mentioned in a comment in 2.4.x header files). but vmalloc_32 is an inline routine that calls __vmalloc(GFP_KERNEL). based on the above zone descriptions, this will do the correct thing for x86, but not for ia64 or x86_64. on ia64, a driver could just use GFP_DMA for the desired behavior, but this doesn't work well for x86_64.
> 
> AMD's x86_64 provides remapping > 32-bit pages through the iommu, but obviously Intel's em64t provides no such ability. based on the above zonings, these leaves us with the options of either relying on the swiotlb interfaces for dma, or relying on the isa memory for dma.

FWIW, note that there are two main considerations:

Higher-level layers (block, net) provide bounce buffers when needed, as 
you don't want to do that purely with iommu.

Once you have bounce buffers properly allocated by <something> (swiotlb? 
  special DRM bounce buffer allocator?), you then pci_map the bounce 
buffers.


> the last day or two, I've been experimenting with using the pci_alloc_consistent interface, which uses the later (note attached patch to fix an apparent memory leak in the x86_64 pci_alloc_consistent). unfortunately, this provides very little dma-able memory. In theory, up to 16 Megs, but in practice I'm only getting about 5 1/2 Megs.
> 
> I was rather surprised by these limitations on allocating 32-bit addresses. I checked through the dri and bttv drivers, to see if they had dealt with these issues, and they did not appear to have done so. has anyone tested these drivers on ia64/x86_64/em64t platforms w/ 4+ Gigs of memory?
> 
> are these limitations on allocating 32-bit addresses intentional and known? is there anything we can do to help improve this situation? help work on development?

Sounds like you're not setting the PCI DMA mask properly, or perhaps 
passing NULL rather than a struct pci_dev to the PCI DMA API?

	Jeff



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-23 19:19 ` Jeff Garzik
@ 2004-06-26  5:05   ` David Mosberger
  2004-06-26  7:16     ` Arjan van de Ven
  0 siblings, 1 reply; 70+ messages in thread
From: David Mosberger @ 2004-06-26  5:05 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Terence Ripperda, Linux Kernel Mailing List

>>>>> On Wed, 23 Jun 2004 15:19:22 -0400, Jeff Garzik <jgarzik@pobox.com> said:

  Jeff> swiotlb was a dumb idea when it hit ia64, and it's now been propagated
  Jeff> to x86-64 :(

If it's such a dumb idea, why not submit a better solution?

	--david

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-26  5:05   ` David Mosberger
@ 2004-06-26  7:16     ` Arjan van de Ven
  2004-06-29  6:13       ` David Mosberger
  0 siblings, 1 reply; 70+ messages in thread
From: Arjan van de Ven @ 2004-06-26  7:16 UTC (permalink / raw)
  To: davidm; +Cc: Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 462 bytes --]

On Sat, 2004-06-26 at 07:05, David Mosberger wrote:
> >>>>> On Wed, 23 Jun 2004 15:19:22 -0400, Jeff Garzik <jgarzik@pobox.com> said:
> 
>   Jeff> swiotlb was a dumb idea when it hit ia64, and it's now been propagated
>   Jeff> to x86-64 :(
> 
> If it's such a dumb idea, why not submit a better solution?

the real solution is an iommu of course, but the highmem solution has
quite some merit too..... I know you disagree with me on that one
though.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-26  7:16     ` Arjan van de Ven
@ 2004-06-29  6:13       ` David Mosberger
  2004-06-29  6:55         ` Arjan van de Ven
  2004-06-30  8:00         ` Jes Sorensen
  0 siblings, 2 replies; 70+ messages in thread
From: David Mosberger @ 2004-06-29  6:13 UTC (permalink / raw)
  To: arjanv; +Cc: davidm, Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List

>>>>> On Sat, 26 Jun 2004 09:16:27 +0200, Arjan van de Ven <arjanv@redhat.com> said:

  Arjan> the real solution is an iommu of course, but the highmem
  Arjan> solution has quite some merit too..... I know you disagree
  Arjan> with me on that one though.

Yes, some merits and some faults.  The real solution is iommu or
64-bit capable devices.  Interesting that graphics controllers should
be last to get 64-bit DMA capability, considering how much more
complex they are than disk controllers or NICs.

	--david

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-29  6:13       ` David Mosberger
@ 2004-06-29  6:55         ` Arjan van de Ven
  2004-06-30  8:00         ` Jes Sorensen
  1 sibling, 0 replies; 70+ messages in thread
From: Arjan van de Ven @ 2004-06-29  6:55 UTC (permalink / raw)
  To: davidm; +Cc: Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 661 bytes --]

On Mon, Jun 28, 2004 at 11:13:12PM -0700, David Mosberger wrote:
> >>>>> On Sat, 26 Jun 2004 09:16:27 +0200, Arjan van de Ven <arjanv@redhat.com> said:
> 
>   Arjan> the real solution is an iommu of course, but the highmem
>   Arjan> solution has quite some merit too..... I know you disagree
>   Arjan> with me on that one though.
> 
> Yes, some merits and some faults.  The real solution is iommu or
> 64-bit capable devices.  Interesting that graphics controllers should
> be last to get 64-bit DMA capability, considering how much more
> complex they are than disk controllers or NICs.

I guess the first game with more than 4Gb in textures will fix it ;)


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-29  6:13       ` David Mosberger
  2004-06-29  6:55         ` Arjan van de Ven
@ 2004-06-30  8:00         ` Jes Sorensen
  1 sibling, 0 replies; 70+ messages in thread
From: Jes Sorensen @ 2004-06-30  8:00 UTC (permalink / raw)
  To: davidm; +Cc: arjanv, Jeff Garzik, Terence Ripperda, Linux Kernel Mailing List

>>>>> "David" == David Mosberger <davidm@napali.hpl.hp.com> writes:

>>>>> On Sat, 26 Jun 2004 09:16:27 +0200, Arjan van de Ven <arjanv@redhat.com> said:
Arjan> the real solution is an iommu of course, but the highmem
Arjan> solution has quite some merit too..... I know you disagree with
Arjan> me on that one though.

David> Yes, some merits and some faults.  The real solution is iommu
David> or 64-bit capable devices.  Interesting that graphics
David> controllers should be last to get 64-bit DMA capability,
David> considering how much more complex they are than disk
David> controllers or NICs.

You found a 64 bit capable sound card yet? ;-)

Cheers,
Jes

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: 32-bit dma allocations on 64-bit platforms
  2004-06-23 18:35 Terence Ripperda
  2004-06-23 19:19 ` Jeff Garzik
@ 2004-06-26  5:02 ` David Mosberger
  1 sibling, 0 replies; 70+ messages in thread
From: David Mosberger @ 2004-06-26  5:02 UTC (permalink / raw)
  To: Terence Ripperda; +Cc: Linux Kernel Mailing List

Terence,

>>>>> On Wed, 23 Jun 2004 13:35:35 -0500, Terence Ripperda <tripperda@nvidia.com> said:

  Terence> based on each architecture's paging_init routines, the
  Terence> zones look like this:

  Terence>                 x86:         ia64:      x86_64:
  Terence> ZONE_DMA:       < 16M        < ~4G      < 16M
  Terence> ZONE_NORMAL:    16M - ~1G    > ~4G      > 16M
  Terence> ZONE_HIMEM:     1G+

Not that it matters here, but for correctness let me note that the
ia64 column is correct only for machines which don't have an I/O MMU.
With I/O MMU, ZONE_DMA will have the same coverage as ZONE_NORMAL with
a recent enough kernel (older kernels had a bug which limited ZONE_DMA
to < 4GB, but that was unintentional).

	--david

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2004-06-30  8:12 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <m3acyu6pwd.fsf@averell.firstfloor.org>
     [not found] ` <20040623213643.GB32456@hygelac>
2004-06-23 23:46   ` 32-bit dma allocations on 64-bit platforms Andi Kleen
2004-06-24 11:13     ` Takashi Iwai
2004-06-24 11:29       ` [discuss] " Andi Kleen
2004-06-24 14:36         ` Takashi Iwai
2004-06-24 14:42           ` Andi Kleen
2004-06-24 14:58             ` Takashi Iwai
2004-06-24 15:29               ` Andrea Arcangeli
2004-06-24 15:48                 ` Nick Piggin
2004-06-24 16:52                   ` Andrea Arcangeli
2004-06-24 16:56                     ` William Lee Irwin III
2004-06-24 17:32                       ` Andrea Arcangeli
2004-06-24 17:38                         ` William Lee Irwin III
2004-06-24 18:02                           ` Andrea Arcangeli
2004-06-24 18:13                             ` William Lee Irwin III
2004-06-24 18:27                               ` Andrea Arcangeli
2004-06-24 18:50                                 ` William Lee Irwin III
2004-06-24 21:54                       ` Andrew Morton
2004-06-24 22:08                         ` William Lee Irwin III
2004-06-24 22:45                           ` Andrea Arcangeli
2004-06-24 22:51                             ` William Lee Irwin III
2004-06-24 23:09                               ` Andrew Morton
2004-06-24 23:15                                 ` William Lee Irwin III
2004-06-25  6:16                                   ` William Lee Irwin III
2004-06-25  2:39                                 ` Andrea Arcangeli
2004-06-25  2:47                                   ` Andrew Morton
2004-06-25  3:19                                     ` Andrea Arcangeli
2004-06-24 22:11                         ` Andrew Morton
2004-06-24 23:09                           ` Andrea Arcangeli
2004-06-25  1:17                             ` Nick Piggin
2004-06-25  3:11                               ` Andrea Arcangeli
2004-06-24 22:21                         ` Andrea Arcangeli
2004-06-24 22:36                           ` Andrew Morton
2004-06-24 23:15                             ` Andrea Arcangeli
2004-06-24 22:37                           ` William Lee Irwin III
2004-06-24 22:40                             ` William Lee Irwin III
2004-06-24 23:21                             ` Andrea Arcangeli
2004-06-24 23:45                               ` William Lee Irwin III
2004-06-24 17:39                   ` Andrea Arcangeli
2004-06-24 17:53                     ` William Lee Irwin III
2004-06-24 18:07                       ` Andrea Arcangeli
2004-06-24 18:29                         ` William Lee Irwin III
2004-06-24 16:04                 ` Takashi Iwai
2004-06-24 17:16                   ` Andrea Arcangeli
2004-06-24 18:33                     ` Takashi Iwai
2004-06-24 18:44                       ` Andrea Arcangeli
2004-06-25 15:50                         ` Takashi Iwai
2004-06-25 17:30                           ` Andrea Arcangeli
2004-06-25 17:39                             ` Takashi Iwai
2004-06-25 17:45                               ` Andrea Arcangeli
2004-06-24 14:45       ` Terence Ripperda
2004-06-24 15:41         ` Andrea Arcangeli
2004-06-24 15:44     ` Terence Ripperda
2004-06-24 16:15       ` [discuss] " Andi Kleen
2004-06-24 17:22         ` Andrea Arcangeli
2004-06-24 22:28         ` Terence Ripperda
2004-06-24 18:51       ` Andi Kleen
2004-06-26  4:58         ` David Mosberger
     [not found] <2akPm-16l-65@gated-at.bofh.it>
2004-06-23 21:46 ` Andi Kleen
2004-06-24  6:18   ` Arjan van de Ven
2004-06-24 10:33     ` Andi Kleen
2004-06-24 13:48     ` Jesse Barnes
2004-06-24 14:39       ` Terence Ripperda
2004-06-23 18:35 Terence Ripperda
2004-06-23 19:19 ` Jeff Garzik
2004-06-26  5:05   ` David Mosberger
2004-06-26  7:16     ` Arjan van de Ven
2004-06-29  6:13       ` David Mosberger
2004-06-29  6:55         ` Arjan van de Ven
2004-06-30  8:00         ` Jes Sorensen
2004-06-26  5:02 ` David Mosberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox