Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
       [not found] <20060205163618.GB21972@in.ibm.com>
@ 2006-02-05 17:03 ` Andi Kleen
  2006-02-06 16:11   ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2006-02-05 17:03 UTC (permalink / raw)
  To: discuss, bharata; +Cc: linux-kernel, Christoph Lameter

On Sunday 05 February 2006 17:36, Bharata B Rao wrote:
> Hi,
> 
> I am seeing a kernel crash with 2.6.16-rc1 and rc2 but not on any
> 2.6.15 kernels (rc and 2.6.15.2). Arch is x86_64.
> 
> The kernel crashes when I run an application which does:
> 	- mmap (0, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS)
> 	- mbind the memory to the 1st node with policy MPOL_BIND
> 	- write to that memory
> 
> The crash time log on 2.6.16-rc2 looks like this:
> 
> Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP:
> <ffffffff801614df>{__rmqueue+63}

There's another report of it. The boot logs seem ok, so I guess
mbind broke somehow. I suppose it's related to the mempolicy changes
that went into 2.6.16-rc1. I'll try to take a look tomorrow if
Christoph doesn't beat it.

OOM with mbind seems to have broken also - it oopses too.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-05 17:03 ` [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64 Andi Kleen
@ 2006-02-06 16:11   ` Christoph Lameter
  2006-02-06 18:12     ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-02-06 16:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: discuss, bharata, linux-kernel

On Sun, 5 Feb 2006, Andi Kleen wrote:

> > The kernel crashes when I run an application which does:
> > 	- mmap (0, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS)
> > 	- mbind the memory to the 1st node with policy MPOL_BIND
> > 	- write to that memory

Tried the following code on rc1 and rc2 and it worked fine on ia64:

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <numaif.h>

int main(int argc, void *argv[])
{
	char *p;
	unsigned long nodes = 0x01;

	p = mmap(0, 32768, PROT_READ| PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, 0, 0);
	mbind(p, 32768, MPOL_BIND, &nodes, 64, 0);
	p[34] = 89;
	return 0;
}
 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-06 16:11   ` Christoph Lameter
@ 2006-02-06 18:12     ` Andi Kleen
  2006-02-06 18:25       ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2006-02-06 18:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: discuss, bharata, linux-kernel

On Monday 06 February 2006 17:11, Christoph Lameter wrote:
> On Sun, 5 Feb 2006, Andi Kleen wrote:
> 
> > > The kernel crashes when I run an application which does:
> > > 	- mmap (0, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS)
> > > 	- mbind the memory to the 1st node with policy MPOL_BIND
> > > 	- write to that memory
> 
> Tried the following code on rc1 and rc2 and it worked fine on ia64:

Perhaps it depends on if the node has enough memory free or not?
I assume if the zonelist has some issue but the first entry is ok
it will only cause problems when the allocation has to go off node
(it shouldn't actually go off node with that policy of course,
but with a full free local node that code path is never triggered)

-Andi


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-06 18:12     ` Andi Kleen
@ 2006-02-06 18:25       ` Christoph Lameter
  2006-02-06 18:31         ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-02-06 18:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: discuss, bharata, linux-kernel

On Mon, 6 Feb 2006, Andi Kleen wrote:

> > Tried the following code on rc1 and rc2 and it worked fine on ia64:
> 
> Perhaps it depends on if the node has enough memory free or not?
> I assume if the zonelist has some issue but the first entry is ok
> it will only cause problems when the allocation has to go off node
> (it shouldn't actually go off node with that policy of course,

If node 0 is exhausted then you have an OOM situation.

> but with a full free local node that code path is never triggered)

Wamt me to test the OOM path for mbind?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-06 18:25       ` Christoph Lameter
@ 2006-02-06 18:31         ` Andi Kleen
  2006-02-06 18:45           ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2006-02-06 18:31 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: discuss, bharata, linux-kernel

On Monday 06 February 2006 19:25, Christoph Lameter wrote:
> On Mon, 6 Feb 2006, Andi Kleen wrote:
> 
> > > Tried the following code on rc1 and rc2 and it worked fine on ia64:
> > 
> > Perhaps it depends on if the node has enough memory free or not?
> > I assume if the zonelist has some issue but the first entry is ok
> > it will only cause problems when the allocation has to go off node
> > (it shouldn't actually go off node with that policy of course,
> 
> If node 0 is exhausted then you have an OOM situation.

No - it could just need to free some cleanable pages first. That's
a long way before going OOM.
 
> > but with a full free local node that code path is never triggered)
> 
> Wamt me to test the OOM path for mbind?

I already know it oopses - someone else reported that. If you feel
motivated feel free to fix.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-06 18:31         ` Andi Kleen
@ 2006-02-06 18:45           ` Christoph Lameter
  2006-02-06 18:55             ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-02-06 18:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: discuss, bharata, linux-kernel

On Mon, 6 Feb 2006, Andi Kleen wrote:

> > If node 0 is exhausted then you have an OOM situation.
> 
> No - it could just need to free some cleanable pages first. That's
> a long way before going OOM.

Then node 0 still has memory available. So you suspect zone_reclaim?
  
> > > but with a full free local node that code path is never triggered)
> > 
> > Wamt me to test the OOM path for mbind?
> I already know it oopses - someone else reported that. If you feel
> motivated feel free to fix.

We also have a minor issue with huge pages. If the pools are exhausted 
then the kernel will terminate the application with Bus Error.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-06 18:45           ` Christoph Lameter
@ 2006-02-06 18:55             ` Andi Kleen
  2006-02-06 19:22               ` Christoph Lameter
  2006-02-07  5:59               ` Bharata B Rao
  0 siblings, 2 replies; 29+ messages in thread
From: Andi Kleen @ 2006-02-06 18:55 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: discuss, bharata, linux-kernel

On Monday 06 February 2006 19:45, Christoph Lameter wrote:
> On Mon, 6 Feb 2006, Andi Kleen wrote:
> 
> > > If node 0 is exhausted then you have an OOM situation.
> > 
> > No - it could just need to free some cleanable pages first. That's
> > a long way before going OOM.
> 
> Then node 0 still has memory available. So you suspect zone_reclaim?

Either zone reclaim or the first entry in the zonelist is ok, but it's 
not correctly terminated or something like that so it causes 
problems when the kernel looks for the second (just speculating here,
i don't know if that is the problem) 
   
> > > > but with a full free local node that code path is never triggered)
> > > 
> > > Wamt me to test the OOM path for mbind?
> > I already know it oopses - someone else reported that. If you feel
> > motivated feel free to fix.
> 
> We also have a minor issue with huge pages. If the pools are exhausted 
> then the kernel will terminate the application with Bus Error.

That is what prereservation was supposed to prevent. I remember there 
were endless discussions when this all was originally implemented long
ago (in the version that never got merged).

Basically there were two approaches:
- Do strict overcommit checking at mmap with prereservation (that was
what the old Intel/SGI patch did)

- The hackish way I implemented in SLES9: just check at mmap time 
if there are enough pages but don't prereserve anything. That was 
more a 80% solution with races, but seemed to fix the problem well enough 
that people in the field didn't really complain. The advantage was that 
it was much simpler code.

-Andi


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-06 18:55             ` Andi Kleen
@ 2006-02-06 19:22               ` Christoph Lameter
  2006-02-07  5:59               ` Bharata B Rao
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2006-02-06 19:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: discuss, bharata, linux-kernel

On Mon, 6 Feb 2006, Andi Kleen wrote:

> That is what prereservation was supposed to prevent. I remember there 
> were endless discussions when this all was originally implemented long
> ago (in the version that never got merged).

But the reservation does not consider cpusets and memory policies right?
It surely must fail if one restrict allocation to one node and then we run 
out of memory. That was the testcase that showed the Bus Error....\

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-06 18:55             ` Andi Kleen
  2006-02-06 19:22               ` Christoph Lameter
@ 2006-02-07  5:59               ` Bharata B Rao
  2006-02-07 16:49                 ` Christoph Lameter
  1 sibling, 1 reply; 29+ messages in thread
From: Bharata B Rao @ 2006-02-07  5:59 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, discuss, linux-kernel

On Mon, Feb 06, 2006 at 07:55:18PM +0100, Andi Kleen wrote:
> On Monday 06 February 2006 19:45, Christoph Lameter wrote:
> > On Mon, 6 Feb 2006, Andi Kleen wrote:
> > 
> > > > If node 0 is exhausted then you have an OOM situation.
> > > 
> > > No - it could just need to free some cleanable pages first. That's
> > > a long way before going OOM.
> > 
> > Then node 0 still has memory available. So you suspect zone_reclaim?
> 
> Either zone reclaim or the first entry in the zonelist is ok, but it's 
> not correctly terminated or something like that so it causes 
> problems when the kernel looks for the second (just speculating here,
> i don't know if that is the problem) 
>    

I can still crash my x86_64 box with Christoph's program.

The meminfo in my case looks like this just before I execute the
program.

llm07:~ # cat /sys/devices/system/node/node0/meminfo

Node 0 MemTotal:      3095532 kB
Node 0 MemFree:       2960972 kB
Node 0 MemUsed:        134560 kB
Node 0 Active:          19752 kB
Node 0 Inactive:        14908 kB
Node 0 HighTotal:           0 kB
Node 0 HighFree:            0 kB
Node 0 LowTotal:      3095532 kB
Node 0 LowFree:       2960972 kB
Node 0 Dirty:               0 kB
Node 0 Writeback:         576 kB
Node 0 Mapped:              0 kB
Node 0 Slab:            24200 kB
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
llm07:~ # cat /sys/devices/system/node/node1/meminfo

Node 1 MemTotal:      2002368 kB
Node 1 MemFree:       1964464 kB
Node 1 MemUsed:         37904 kB
Node 1 Active:          10608 kB
Node 1 Inactive:         3056 kB
Node 1 HighTotal:           0 kB
Node 1 HighFree:            0 kB
Node 1 LowTotal:      2002368 kB
Node 1 LowFree:       1964464 kB
Node 1 Dirty:            1164 kB
Node 1 Writeback:           0 kB
Node 1 Mapped:          43064 kB
Node 1 Slab:             9648 kB
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0

I was trying to bind the memory to node 0, which still has enough
free memory.

Not sure if this helps, but I have some more debug data.
While the kernel(2.6.16-rc1) oopes at page_alloc.c, line no: 556
(list_del(&page->lru), some of the variables in __rmqueue look like this at the time of crash:

page = 0xffffffffffffffd8
&page->lru = 0000000000000000
zone = 0xffff81000000e700
zone->name Normal
current_order 0
area->nr_free 0

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-07  5:59               ` Bharata B Rao
@ 2006-02-07 16:49                 ` Christoph Lameter
  2006-02-07 23:27                   ` Ray Bryant
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-02-07 16:49 UTC (permalink / raw)
  To: Bharata B Rao; +Cc: Andi Kleen, discuss, linux-kernel

On Tue, 7 Feb 2006, Bharata B Rao wrote:

> I can still crash my x86_64 box with Christoph's program.

So it looks like the problem is arch specific. Test program runs fine on 
ia64.

> page = 0xffffffffffffffd8
> &page->lru = 0000000000000000

Yup lru field overwritten as I thought.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-07 16:49                 ` Christoph Lameter
@ 2006-02-07 23:27                   ` Ray Bryant
  2006-02-07 23:36                     ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Ray Bryant @ 2006-02-07 23:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Bharata B Rao, Andi Kleen, discuss, linux-kernel

On Tuesday 07 February 2006 10:49, Christoph Lameter wrote:
> On Tue, 7 Feb 2006, Bharata B Rao wrote:
> > I can still crash my x86_64 box with Christoph's program.
>
> So it looks like the problem is arch specific. Test program runs fine on
> ia64.
>
> > page = 0xffffffffffffffd8
> > &page->lru = 0000000000000000
>
> Yup lru field overwritten as I thought.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

For what it is worth:

Christoph's test program runs fine on my 32 GB, 4 socket, 8 core Opteron 64 
box with 2.6.16-rc1.
-- 
Ray Bryant
AMD Performance Labs                   Austin, Tx
512-602-0038 (o)                 512-507-7807 (c)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-07 23:27                   ` Ray Bryant
@ 2006-02-07 23:36                     ` Andi Kleen
  2006-02-08 12:10                       ` Bharata B Rao
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2006-02-07 23:36 UTC (permalink / raw)
  To: Ray Bryant; +Cc: Christoph Lameter, Bharata B Rao, discuss, linux-kernel

On Wednesday 08 February 2006 00:27, Ray Bryant wrote:
> On Tuesday 07 February 2006 10:49, Christoph Lameter wrote:
> > On Tue, 7 Feb 2006, Bharata B Rao wrote:
> > > I can still crash my x86_64 box with Christoph's program.
> >
> > So it looks like the problem is arch specific. Test program runs fine on
> > ia64.
> >
> > > page = 0xffffffffffffffd8
> > > &page->lru = 0000000000000000
> >
> > Yup lru field overwritten as I thought.
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> For what it is worth:
> 
> Christoph's test program runs fine on my 32 GB, 4 socket, 8 core Opteron 64 

Opteron 64? A new exciting upcomming product? @)

> box with 2.6.16-rc1.

Yes it also works on my test box and also some other simple tests with MPOL_BIND. 
But we had similar reports on two different systems, so there's very likely a problem.
Just need to reproduce it somehow. 

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-07 23:36                     ` Andi Kleen
@ 2006-02-08 12:10                       ` Bharata B Rao
  2006-02-08 15:42                         ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Bharata B Rao @ 2006-02-08 12:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ray Bryant, Christoph Lameter, discuss, linux-kernel

On Wed, Feb 08, 2006 at 12:36:30AM +0100, Andi Kleen wrote:
> On Wednesday 08 February 2006 00:27, Ray Bryant wrote:
> > On Tuesday 07 February 2006 10:49, Christoph Lameter wrote:
> > > On Tue, 7 Feb 2006, Bharata B Rao wrote:
> > > > I can still crash my x86_64 box with Christoph's program.
> > >
> > > So it looks like the problem is arch specific. Test program runs fine on
> > > ia64.
> > >
> > > > page = 0xffffffffffffffd8
> > > > &page->lru = 0000000000000000
> > >
> > > Yup lru field overwritten as I thought.
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> > For what it is worth:
> > 
> > Christoph's test program runs fine on my 32 GB, 4 socket, 8 core Opteron 64 
> 
> Opteron 64? A new exciting upcomming product? @)
> 
> > box with 2.6.16-rc1.
> 
> Yes it also works on my test box and also some other simple tests with MPOL_BIND. 
> But we had similar reports on two different systems, so there's very likely a problem.
> Just need to reproduce it somehow. 
> 

I believe I understand why I am seeing this problem with my setup.

The zones in my machine look like this:

On node 0 totalpages: 773791
  DMA zone: 2151 pages, LIFO batch:0
  DMA32 zone: 771640 pages, LIFO batch:31
  Normal zone: 0 pages, LIFO batch:0
  HighMem zone: 0 pages, LIFO batch:0
On node 1 totalpages: 500592
  DMA zone: 0 pages, LIFO batch:0
  DMA32 zone: 242032 pages, LIFO batch:31
  Normal zone: 258560 pages, LIFO batch:31
  HighMem zone: 0 pages, LIFO batch:0

So it can be seen that the node 0 has only DMA and DMA32 zones while
node 1 has only DMA32 and Normal zones.

The current mempolicy code assumes that the highest zone(policy_zone) that
comes under the memory policy is valid (by which I mean zone->present_pages
is non-zero) for all nodes, which is not true in my case. In this case
the policy_zone gets set to ZONE_NORMAL (highest zone here). 

When mbind'ing to node 0, bind_zonelist()(and subsequent functions) binds
the ZONE_NORMAL zone to vma->vm_policy. During the write fault, the allocator
is asked to allocate from a non-existent ZONE_NORMAL zone for node 0. This
I believe is causing the oops I am seeing. It is still not clear to me
why doesn't the allocator fail the allocations from a zone which has 
zone->present_pages=0 gracefully.

This whole problem wasn't seen on 2.6.15.2 because, bind_zonelist()
actually makes sure that the zone it is binding to has a non-zero
zone->present_pages.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-08 12:10                       ` Bharata B Rao
@ 2006-02-08 15:42                         ` Christoph Lameter
  2006-02-08 15:45                           ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-02-08 15:42 UTC (permalink / raw)
  To: Bharata B Rao; +Cc: Andi Kleen, Ray Bryant, discuss, linux-kernel

On Wed, 8 Feb 2006, Bharata B Rao wrote:

> The zones in my machine look like this:
> 
> On node 0 totalpages: 773791
>   DMA zone: 2151 pages, LIFO batch:0
>   DMA32 zone: 771640 pages, LIFO batch:31
>   Normal zone: 0 pages, LIFO batch:0
>   HighMem zone: 0 pages, LIFO batch:0
> On node 1 totalpages: 500592
>   DMA zone: 0 pages, LIFO batch:0
>   DMA32 zone: 242032 pages, LIFO batch:31
>   Normal zone: 258560 pages, LIFO batch:31
>   HighMem zone: 0 pages, LIFO batch:0
> 
> So it can be seen that the node 0 has only DMA and DMA32 zones while
> node 1 has only DMA32 and Normal zones.

Uhh... Thats a rather asymmetric arrangement.
 
> The current mempolicy code assumes that the highest zone(policy_zone) that
> comes under the memory policy is valid (by which I mean zone->present_pages
> is non-zero) for all nodes, which is not true in my case. In this case
> the policy_zone gets set to ZONE_NORMAL (highest zone here). 

Right.

> When mbind'ing to node 0, bind_zonelist()(and subsequent functions) binds
> the ZONE_NORMAL zone to vma->vm_policy. During the write fault, the allocator
> is asked to allocate from a non-existent ZONE_NORMAL zone for node 0. This
> I believe is causing the oops I am seeing. It is still not clear to me
> why doesn't the allocator fail the allocations from a zone which has 
> zone->present_pages=0 gracefully.

Hmm....
 
> This whole problem wasn't seen on 2.6.15.2 because, bind_zonelist()
> actually makes sure that the zone it is binding to has a non-zero
> zone->present_pages.

Correct there was a loop in bind_zonelist that I moved to the zone 
initialization to simplify it.

However, this has implications for policy_zone. This variable should store
the zone that policies apply to. However, in your case this zone will vary 
which may lead to all sorts of weird behavior even if we fix 
bind_zonelist. To which zone does policy apply? ZONE_NORMAL or ZONE_DMA32?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-08 15:42                         ` Christoph Lameter
@ 2006-02-08 15:45                           ` Andi Kleen
  2006-02-08 15:59                             ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2006-02-08 15:45 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Bharata B Rao, Ray Bryant, discuss, linux-kernel

On Wednesday 08 February 2006 16:42, Christoph Lameter wrote:

> However, this has implications for policy_zone. This variable should store
> the zone that policies apply to. However, in your case this zone will vary 
> which may lead to all sorts of weird behavior even if we fix 
> bind_zonelist. To which zone does policy apply? ZONE_NORMAL or ZONE_DMA32?

It really needs to apply to both (currently you can't police 4GB of your 
memory on x86-64) But I haven't worked out a good design how to implement it yet.

-Andi


> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-08 15:45                           ` Andi Kleen
@ 2006-02-08 15:59                             ` Christoph Lameter
  2006-02-08 16:06                               ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-02-08 15:59 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Bharata B Rao, Ray Bryant, discuss, linux-kernel

On Wed, 8 Feb 2006, Andi Kleen wrote:

> On Wednesday 08 February 2006 16:42, Christoph Lameter wrote:
> 
> > However, this has implications for policy_zone. This variable should store
> > the zone that policies apply to. However, in your case this zone will vary 
> > which may lead to all sorts of weird behavior even if we fix 
> > bind_zonelist. To which zone does policy apply? ZONE_NORMAL or ZONE_DMA32?
> 
> It really needs to apply to both (currently you can't police 4GB of your 
> memory on x86-64) But I haven't worked out a good design how to implement it yet.

So a provisional solution would be to simply ignore empty zones in 
bind_zonelist? Or fall back to earlier zones (which includes unpolicied 
zones in the bind zone list?)

Index: linux-2.6.16-rc2/mm/mempolicy.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/mempolicy.c	2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/mm/mempolicy.c	2006-02-08 07:55:29.000000000 -0800
@@ -143,8 +143,12 @@ static struct zonelist *bind_zonelist(no
 	if (!zl)
 		return NULL;
 	num = 0;
-	for_each_node_mask(nd, *nodes)
-		zl->zones[num++] = &NODE_DATA(nd)->node_zones[policy_zone];
+	for_each_node_mask(nd, *nodes) {
+		struct zone *zone = &NODE_DATA(nd)->node_zones[policy_zone];
+
+		if (zone->present_pages)
+			zl->zones[num++] = &NODE_DATA(nd)->node_zones[policy_zone];
+	}
 	zl->zones[num] = NULL;
 	return zl;
 }

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-08 15:59                             ` Christoph Lameter
@ 2006-02-08 16:06                               ` Andi Kleen
  2006-02-08 16:20                                 ` Christoph Lameter
  2006-02-09  4:39                                 ` Bharata B Rao
  0 siblings, 2 replies; 29+ messages in thread
From: Andi Kleen @ 2006-02-08 16:06 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Bharata B Rao, Ray Bryant, discuss, linux-kernel

On Wednesday 08 February 2006 16:59, Christoph Lameter wrote:
> On Wed, 8 Feb 2006, Andi Kleen wrote:
> 
> > On Wednesday 08 February 2006 16:42, Christoph Lameter wrote:
> > 
> > > However, this has implications for policy_zone. This variable should store
> > > the zone that policies apply to. However, in your case this zone will vary 
> > > which may lead to all sorts of weird behavior even if we fix 
> > > bind_zonelist. To which zone does policy apply? ZONE_NORMAL or ZONE_DMA32?
> > 
> > It really needs to apply to both (currently you can't police 4GB of your 
> > memory on x86-64) But I haven't worked out a good design how to implement it yet.
> 
> So a provisional solution would be to simply ignore empty zones in 
> bind_zonelist?

That would likely prevent the crash yes (Bharata can you test?)

But of course it still has the problem of a lot of memory being unpolicied
on machines with >4GB if there's both DMA32 and NORMAL.

> Or fall back to earlier zones (which includes unpolicied  
> zones in the bind zone list?)

Or that.

Thanks,
-Andi

 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-08 16:06                               ` Andi Kleen
@ 2006-02-08 16:20                                 ` Christoph Lameter
  2006-02-08 16:27                                   ` Andi Kleen
  2006-02-09  4:39                                 ` Bharata B Rao
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-02-08 16:20 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Bharata B Rao, Ray Bryant, discuss, linux-kernel

On Wed, 8 Feb 2006, Andi Kleen wrote:

> > So a provisional solution would be to simply ignore empty zones in 
> > bind_zonelist?
> 
> That would likely prevent the crash yes (Bharata can you test?)
> 
> But of course it still has the problem of a lot of memory being unpolicied
> on machines with >4GB if there's both DMA32 and NORMAL.

The fix could result in a zonelist with no zones. So we can answer one 
question in __alloc_pages().

Index: linux-2.6.16-rc2/mm/page_alloc.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/page_alloc.c	2006-02-08 00:05:09.000000000 -0800
+++ linux-2.6.16-rc2/mm/page_alloc.c	2006-02-08 08:18:59.000000000 -0800
@@ -913,7 +913,7 @@ restart:
 	z = zonelist->zones;  /* the list of zones suitable for gfp_mask */
 
 	if (unlikely(*z == NULL)) {
-		/* Should this ever happen?? */
+		/* May occur if MPOL_BIND results in an empty zonelist */
 		return NULL;
 	}
 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-08 16:20                                 ` Christoph Lameter
@ 2006-02-08 16:27                                   ` Andi Kleen
  2006-02-08 16:51                                     ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2006-02-08 16:27 UTC (permalink / raw)
  To: discuss; +Cc: Christoph Lameter, Bharata B Rao, Ray Bryant, linux-kernel

On Wednesday 08 February 2006 17:20, Christoph Lameter wrote:
> On Wed, 8 Feb 2006, Andi Kleen wrote:
> 
> > > So a provisional solution would be to simply ignore empty zones in 
> > > bind_zonelist?
> > 
> > That would likely prevent the crash yes (Bharata can you test?)
> > 
> > But of course it still has the problem of a lot of memory being unpolicied
> > on machines with >4GB if there's both DMA32 and NORMAL.
> 
> The fix could result in a zonelist with no zones. So we can answer one 
> question in __alloc_pages().

I don't think it can happen - at least one zone <= policy-zone has to 
have memory otherwise the machine wouldn't work at all.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-08 16:27                                   ` Andi Kleen
@ 2006-02-08 16:51                                     ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2006-02-08 16:51 UTC (permalink / raw)
  To: Andi Kleen; +Cc: discuss, Bharata B Rao, Ray Bryant, linux-kernel

On Wed, 8 Feb 2006, Andi Kleen wrote:

> > The fix could result in a zonelist with no zones. So we can answer one 
> > question in __alloc_pages().
> 
> I don't think it can happen - at least one zone <= policy-zone has to 
> have memory otherwise the machine wouldn't work at all.

One could bind to a nodeset that contains a single node. If that node has 
no memory in the policy zone then the zonelist generated by 
bind_zonelist will be empty.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-08 16:06                               ` Andi Kleen
  2006-02-08 16:20                                 ` Christoph Lameter
@ 2006-02-09  4:39                                 ` Bharata B Rao
  2006-02-09  9:58                                   ` Andi Kleen
  1 sibling, 1 reply; 29+ messages in thread
From: Bharata B Rao @ 2006-02-09  4:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, Ray Bryant, discuss, linux-kernel

On Wed, Feb 08, 2006 at 05:06:26PM +0100, Andi Kleen wrote:
> On Wednesday 08 February 2006 16:59, Christoph Lameter wrote:
> > On Wed, 8 Feb 2006, Andi Kleen wrote:
> > 
> > > On Wednesday 08 February 2006 16:42, Christoph Lameter wrote:
> > > 
> > > > However, this has implications for policy_zone. This variable should store
> > > > the zone that policies apply to. However, in your case this zone will vary 
> > > > which may lead to all sorts of weird behavior even if we fix 
> > > > bind_zonelist. To which zone does policy apply? ZONE_NORMAL or ZONE_DMA32?
> > > 
> > > It really needs to apply to both (currently you can't police 4GB of your 
> > > memory on x86-64) But I haven't worked out a good design how to implement it yet.
> > 
> > So a provisional solution would be to simply ignore empty zones in 
> > bind_zonelist?
> 
> That would likely prevent the crash yes (Bharata can you test?)

With this solution, the kernel doesn't crash, but the application does.

Shouldn't we fail mbind if we can't bind any zones ?
Something like this...


Signed-off-by: Bharata B Rao <bharata@in.ibm.com>

--- linux-2.6.16-rc2/mm/mempolicy.c.orig	2006-02-09 01:34:37.000000000 -0800
+++ linux-2.6.16-rc2/mm/mempolicy.c	2006-02-09 01:39:32.000000000 -0800
@@ -143,8 +143,18 @@
 	if (!zl)
 		return NULL;
 	num = 0;
-	for_each_node_mask(nd, *nodes)
-		zl->zones[num++] = &NODE_DATA(nd)->node_zones[policy_zone];
+	for_each_node_mask(nd, *nodes) {
+		struct zone *zone = &NODE_DATA(nd)->node_zones[policy_zone];
+
+		if (zone->present_pages)
+			zl->zones[num++] = zone;
+	}
+
+	if (!num) {
+		/* failed to bind even a single zone */
+		kfree(zl);
+		return NULL;
+	}
 	zl->zones[num] = NULL;
 	return zl;
 }

> 
> But of course it still has the problem of a lot of memory being unpolicied
> on machines with >4GB if there's both DMA32 and NORMAL.
> 
> > Or fall back to earlier zones (which includes unpolicied  
> > zones in the bind zone list?)
> 

Does it make sense to have a separate policy_zone for each node so that we
have atleast one(highest) zone in a node which comes under memory policy ?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-09  4:39                                 ` Bharata B Rao
@ 2006-02-09  9:58                                   ` Andi Kleen
  2006-02-14 19:33                                     ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2006-02-09  9:58 UTC (permalink / raw)
  To: bharata; +Cc: Christoph Lameter, Ray Bryant, discuss, linux-kernel

On Thursday 09 February 2006 05:39, Bharata B Rao wrote:
> On Wed, Feb 08, 2006 at 05:06:26PM +0100, Andi Kleen wrote:
> > On Wednesday 08 February 2006 16:59, Christoph Lameter wrote:
> > > On Wed, 8 Feb 2006, Andi Kleen wrote:
> > > 
> > > > On Wednesday 08 February 2006 16:42, Christoph Lameter wrote:
> > > > 
> > > > > However, this has implications for policy_zone. This variable should store
> > > > > the zone that policies apply to. However, in your case this zone will vary 
> > > > > which may lead to all sorts of weird behavior even if we fix 
> > > > > bind_zonelist. To which zone does policy apply? ZONE_NORMAL or ZONE_DMA32?
> > > > 
> > > > It really needs to apply to both (currently you can't police 4GB of your 
> > > > memory on x86-64) But I haven't worked out a good design how to implement it yet.
> > > 
> > > So a provisional solution would be to simply ignore empty zones in 
> > > bind_zonelist?
> > 
> > That would likely prevent the crash yes (Bharata can you test?)
> 
> With this solution, the kernel doesn't crash, but the application does.
> 
> Shouldn't we fail mbind if we can't bind any zones ?

Really need to fix this properly to support both zones in mbind




> Does it make sense to have a separate policy_zone for each node so that we
> have atleast one(highest) zone in a node which comes under memory policy ?

That wouldn't solve the problem. The problem is that the mempolicy needs 
at least two zonelists to handle all type of allocations (that is why 
i added the concept of policy zone in the first place - to avoid the need
of multilevel zonelists in the policies)

Or maybe it's better to just don't do any policy for GFP_DMA32 
allocations and always use the highest zonelist. I guess they're somewhat
rare anyways and the policy will rarely succeed.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-09  9:58                                   ` Andi Kleen
@ 2006-02-14 19:33                                     ` Christoph Lameter
  2006-02-15  5:46                                       ` Bharata B Rao
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-02-14 19:33 UTC (permalink / raw)
  To: bharata; +Cc: Andi Kleen, Christoph Lameter, Ray Bryant, discuss, linux-kernel

I just took another look at this issue and I cannot see anything wrong. An 
empty zone should be ignored by the page allocator since nr_free == 0. My 
patch should not be needed.

Could you get us the contents of the struct zone that the page allocator 
is trying to get memory from?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-14 19:33                                     ` Christoph Lameter
@ 2006-02-15  5:46                                       ` Bharata B Rao
  2006-02-15 10:38                                         ` Bharata B Rao
  0 siblings, 1 reply; 29+ messages in thread
From: Bharata B Rao @ 2006-02-15  5:46 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, Ray Bryant, discuss, linux-kernel

On Tue, Feb 14, 2006 at 11:33:00AM -0800, Christoph Lameter wrote:
> I just took another look at this issue and I cannot see anything wrong. An 
> empty zone should be ignored by the page allocator since nr_free == 0. My 
> patch should not be needed.

There is a check for list_empty(&area->free_list) in __rmqueue(), which
I think is one of the points in the page allocator where the emptiness of
the free_area list is checked. The current zone(when the crash happens)
bypasses this test leading to this crash.

> 
> Could you get us the contents of the struct zone that the page allocator 
> is trying to get memory from?

The zone looks like this:

crash> p *(struct zone *)0xffff81000000e700
$1 = {
  free_pages = 0,
  pages_min = 0,
  pages_low = 0,
  pages_high = 0,
  lowmem_reserve = {0, 0, 0, 0},
  pageset = {0xffff81000c013740, 0xffff81013fc42f40, 0xffffffff8071d600,
    0xffffffff8071d680, 0xffffffff8071d700, 0xffffffff8071d780,
    0xffffffff8071d800, 0xffffffff8071d880, 0xffffffff8071d900,
    0xffffffff8071d980, 0xffffffff8071da00, 0xffffffff8071da80,
    0xffffffff8071db00, 0xffffffff8071db80, 0xffffffff8071dc00,
    0xffffffff8071dc80, 0xffffffff8071dd00, 0xffffffff8071dd80,
    0xffffffff8071de00, 0xffffffff8071de80, 0xffffffff8071df00,
    0xffffffff8071df80, 0xffffffff8071e000, 0xffffffff8071e080,
    0xffffffff8071e100, 0xffffffff8071e180, 0xffffffff8071e200,
    0xffffffff8071e280, 0xffffffff8071e300, 0xffffffff8071e380,
    0xffffffff8071e400, 0xffffffff8071e480},
  lock = {
    raw_lock = {
      slock = 0
    },
    break_lock = 1
  },
  free_area = {{
      free_list = {
        next = 0x0,
        prev = 0x0
      },
      nr_free = 0
    }, {
      free_list = {
        next = 0x0,
        prev = 0x0
      },
      nr_free = 0
    }, {
      free_list = {
        next = 0x0,
        prev = 0x0
      },
      nr_free = 0
    }, {
      free_list = {
        next = 0x0,
        prev = 0x0
      },
      nr_free = 0
    }, {
      free_list = {
        next = 0x0,
        prev = 0x0
      },
      nr_free = 0
    }, {
      free_list = {
        next = 0x0,
        prev = 0x0
      },
      nr_free = 0
    }, {
      free_list = {
        next = 0x0,
        prev = 0x0
      },
      nr_free = 0
    }, {
      free_list = {
        next = 0x0,
        prev = 0x0
      },
      nr_free = 0
    }, {
      free_list = {
        next = 0x0,
        prev = 0x0
      },
      nr_free = 0
    }, {
     free_list = {
        next = 0x0,
        prev = 0x0
      },
      nr_free = 0
    }, {
      free_list = {
        next = 0x0,
        prev = 0x0
      },
      nr_free = 0
    }},
  _pad1_ = {
    x = 0xffff81000000e980 "\001"
  },
  lru_lock = {
    raw_lock = {
      slock = 1
    },
    break_lock = 0
  },
  active_list = {
    next = 0xffff81000000e988,
    prev = 0xffff81000000e988
  },
  inactive_list = {
    next = 0xffff81000000e998,
    prev = 0xffff81000000e998
  },
  nr_scan_active = 0,
  nr_scan_inactive = 0,
  nr_active = 0,
  nr_inactive = 0,
  pages_scanned = 0,
  all_unreclaimable = 0,
  reclaim_in_progress = {
    counter = 0
  },
  last_unsuccessful_zone_reclaim = 0,
 temp_priority = 12,
  prev_priority = 12,
  _pad2_ = {
    x = 0xffff81000000ea00 ""
  },
  wait_table = 0x0,
  wait_table_size = 0,
  wait_table_bits = 0,
  zone_pgdat = 0xffff81000000e000,
  zone_mem_map = 0x0,
  zone_start_pfn = 0,
  spanned_pages = 0,
  present_pages = 0,
  name = 0xffffffff804a858c "Normal"
}

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-15  5:46                                       ` Bharata B Rao
@ 2006-02-15 10:38                                         ` Bharata B Rao
  2006-02-15 11:21                                           ` Andi Kleen
  2006-02-15 18:10                                           ` Christoph Lameter
  0 siblings, 2 replies; 29+ messages in thread
From: Bharata B Rao @ 2006-02-15 10:38 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, Ray Bryant, discuss, linux-kernel

On Wed, Feb 15, 2006 at 11:16:20AM +0530, Bharata B Rao wrote:
> On Tue, Feb 14, 2006 at 11:33:00AM -0800, Christoph Lameter wrote:
> > I just took another look at this issue and I cannot see anything wrong. An 
> > empty zone should be ignored by the page allocator since nr_free == 0. My 
> > patch should not be needed.
> 
> There is a check for list_empty(&area->free_list) in __rmqueue(), which
> I think is one of the points in the page allocator where the emptiness of
> the free_area list is checked. The current zone(when the crash happens)
> bypasses this test leading to this crash.
> 

We don't initialize the free_area list for all zones. Instead,
free_area_init_core() does that only for zones which are non-empty.

But in __rmqueue(), we depend on these free_area lists to be intialized
correctly for all zones, which is not true in the present case we
are discussing.

I think we either need to initialize free_area lists for all zones
or check for !zone->free_area->nr_free in __rmqueue().

Even with this, mbind still needs to be fixed. Even though it
can't get a conforming zone in the node (MPOL_BIND case), right now,
it goes ahead with the "bind"ing of the memory area. This causes the
application to crash (assuming we have fixed the __rmqueue kernel crash)
(Haven't yet figured our why exactly the application dies)

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-15 10:38                                         ` Bharata B Rao
@ 2006-02-15 11:21                                           ` Andi Kleen
  2006-02-15 18:14                                             ` Christoph Lameter
  2006-02-16  5:18                                             ` Bharata B Rao
  2006-02-15 18:10                                           ` Christoph Lameter
  1 sibling, 2 replies; 29+ messages in thread
From: Andi Kleen @ 2006-02-15 11:21 UTC (permalink / raw)
  To: bharata; +Cc: Christoph Lameter, Ray Bryant, discuss, linux-kernel

On Wednesday 15 February 2006 11:38, Bharata B Rao wrote:

> 
> Even with this, mbind still needs to be fixed. Even though it
> can't get a conforming zone in the node (MPOL_BIND case),

It should just use lower zones then (e.g. if no ZONE_NORMAL
use ZONE_DMA32). yes that needs to be fixed.

How about the appended patch? Does it fix the problem for you?

-Andi

Handle all and empty zones when setting up custom zonelists for mbind

The memory allocator doesn't like empty zones (which have an 
uninitialized freelist), so a x86-64 system with a node fully
in GFP_DMA32 only would crash on mbind.

Fix that up by putting all possible zones as fallback into the zonelist
and skipping the empty ones.

In fact the code always enough allocated space for all zones,
but only used it for the highest. This change just uses all the
memory that was allocated before.

This should work fine for now, but whoever implements node hot removal
needs to fix this somewhere else too (or make sure zone datastructures 
by itself never go away, only their memory)

Signed-off-by: Andi Kleen <ak@suse.de>

Index: linux/mm/mempolicy.c
===================================================================
--- linux.orig/mm/mempolicy.c
+++ linux/mm/mempolicy.c
@@ -132,19 +132,29 @@ static int mpol_check_policy(int mode, n
 	}
 	return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL;
 }
+
 /* Generate a custom zonelist for the BIND policy. */
 static struct zonelist *bind_zonelist(nodemask_t *nodes)
 {
 	struct zonelist *zl;
-	int num, max, nd;
+	int num, max, nd, k;
 
 	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
-	zl = kmalloc(sizeof(void *) * max, GFP_KERNEL);
+	zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
 	if (!zl)
 		return NULL;
 	num = 0;
-	for_each_node_mask(nd, *nodes)
-		zl->zones[num++] = &NODE_DATA(nd)->node_zones[policy_zone];
+	/* First put in the highest zones from all nodes, then all the next 
+	   lower zones etc. Avoid empty zones because the memory allocator
+	   doesn't like them. If you implement node hot removal you
+	   have to fix that. */
+	for (k = policy_zone; k >= 0; k--) { 
+		for_each_node_mask(nd, *nodes) { 
+			struct zone *z = &NODE_DATA(nd)->node_zones[k];
+			if (z->present_pages > 0) 
+				zl->zones[num++] = z;
+		}
+	}
 	zl->zones[num] = NULL;
 	return zl;
 }

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-15 11:21                                           ` Andi Kleen
@ 2006-02-15 18:14                                             ` Christoph Lameter
  2006-02-16  5:18                                             ` Bharata B Rao
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2006-02-15 18:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: bharata, Ray Bryant, discuss, linux-kernel

On Wed, 15 Feb 2006, Andi Kleen wrote:

> How about the appended patch? Does it fix the problem for you?

I think we still need to address the issue of being able to crash
the page allocator if an empty zone is in the zonelist. 

> This should work fine for now, but whoever implements node hot removal
> needs to fix this somewhere else too (or make sure zone datastructures 
> by itself never go away, only their memory)

Yup. Simply initializing the pcp structures with empty lists should 
suffice though.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-15 11:21                                           ` Andi Kleen
  2006-02-15 18:14                                             ` Christoph Lameter
@ 2006-02-16  5:18                                             ` Bharata B Rao
  1 sibling, 0 replies; 29+ messages in thread
From: Bharata B Rao @ 2006-02-16  5:18 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, Ray Bryant, discuss, linux-kernel

On Wed, Feb 15, 2006 at 12:21:53PM +0100, Andi Kleen wrote:
> On Wednesday 15 February 2006 11:38, Bharata B Rao wrote:
> 
> > 
> > Even with this, mbind still needs to be fixed. Even though it
> > can't get a conforming zone in the node (MPOL_BIND case),
> 
> It should just use lower zones then (e.g. if no ZONE_NORMAL
> use ZONE_DMA32). yes that needs to be fixed.
> 
> How about the appended patch? Does it fix the problem for you?
> 

Yes, this fixes the problem. The kernel and the application
don't crash now with this patch.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64
  2006-02-15 10:38                                         ` Bharata B Rao
  2006-02-15 11:21                                           ` Andi Kleen
@ 2006-02-15 18:10                                           ` Christoph Lameter
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2006-02-15 18:10 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Christoph Lameter, Andi Kleen, Ray Bryant, discuss, linux-kernel

On Wed, 15 Feb 2006, Bharata B Rao wrote:

> We don't initialize the free_area list for all zones. Instead,
> free_area_init_core() does that only for zones which are non-empty.

Right.

> But in __rmqueue(), we depend on these free_area lists to be intialized
> correctly for all zones, which is not true in the present case we
> are discussing.

> I think we either need to initialize free_area lists for all zones
> or check for !zone->free_area->nr_free in __rmqueue().

Or we can initialize all pcp to contain empty lists for zones without 
pages.

> Even with this, mbind still needs to be fixed. Even though it
> can't get a conforming zone in the node (MPOL_BIND case), right now,
> it goes ahead with the "bind"ing of the memory area. This causes the
> application to crash (assuming we have fixed the __rmqueue kernel crash)
> (Haven't yet figured our why exactly the application dies)

The application crashes because of an OOM.


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2006-02-16  5:14 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20060205163618.GB21972@in.ibm.com>
2006-02-05 17:03 ` [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64 Andi Kleen
2006-02-06 16:11   ` Christoph Lameter
2006-02-06 18:12     ` Andi Kleen
2006-02-06 18:25       ` Christoph Lameter
2006-02-06 18:31         ` Andi Kleen
2006-02-06 18:45           ` Christoph Lameter
2006-02-06 18:55             ` Andi Kleen
2006-02-06 19:22               ` Christoph Lameter
2006-02-07  5:59               ` Bharata B Rao
2006-02-07 16:49                 ` Christoph Lameter
2006-02-07 23:27                   ` Ray Bryant
2006-02-07 23:36                     ` Andi Kleen
2006-02-08 12:10                       ` Bharata B Rao
2006-02-08 15:42                         ` Christoph Lameter
2006-02-08 15:45                           ` Andi Kleen
2006-02-08 15:59                             ` Christoph Lameter
2006-02-08 16:06                               ` Andi Kleen
2006-02-08 16:20                                 ` Christoph Lameter
2006-02-08 16:27                                   ` Andi Kleen
2006-02-08 16:51                                     ` Christoph Lameter
2006-02-09  4:39                                 ` Bharata B Rao
2006-02-09  9:58                                   ` Andi Kleen
2006-02-14 19:33                                     ` Christoph Lameter
2006-02-15  5:46                                       ` Bharata B Rao
2006-02-15 10:38                                         ` Bharata B Rao
2006-02-15 11:21                                           ` Andi Kleen
2006-02-15 18:14                                             ` Christoph Lameter
2006-02-16  5:18                                             ` Bharata B Rao
2006-02-15 18:10                                           ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox