All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Christoph Lameter <clameter@sgi.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org, ak@suse.de,
	Nishanth Aravamudan <nacc@us.ibm.com>
Subject: Re: [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes
Date: Thu, 14 Jun 2007 10:18:25 -0400	[thread overview]
Message-ID: <1181830705.5410.13.camel@localhost> (raw)
In-Reply-To: <Pine.LNX.4.64.0706131535200.32399@schroedinger.engr.sgi.com>

On Wed, 2007-06-13 at 15:46 -0700, Christoph Lameter wrote:
> On Wed, 13 Jun 2007, Lee Schermerhorn wrote:
> 
> > SLUB early allocation, included in the patch.  Works on HP ia64 platform
> > with small DMA only node and "zone order" zonelists.  Will test on
> > x86_64 real soon now...
> 
> I do not see the difference?? How does this work? node_memory(x) fails 
> there?

On my system, pseudo-node 4 contains ~512MB of DMA zone.  With the new
zoneorder auto-config patch [which this platform also needs], the
zonelist for gfp_zone == ZONE_NORMAL [hugetlb attempts to allocate from
this zone--as it should] contains the following zones:

	zone0-normal, zone1-normal, zone2-normal, zone3-normal,
	zone4-dma

node_memory(4) returns "true" -- node does have memory, in the dma zone,
but it's last in the list, as required.  alloc_pages_node() would call
__alloc_pages() with this zonelist.  If get_page_from_freelist() finds
the requested page in zone0-normal, the check that the zone's pgdat ==
the pgdat of zonelist->zone[0] will succeed and we'll return an off-node
page.

> 
> > The map of nodes with memory may include nodes with just
> > DMA/DMA32 memory.  Using this map/mask together with
> > GFP_THISNODE will not guarantee on-node allocations at higher
> > zones.  Modify checks in alloc_pages_node() to ensure that the
> > first zone in the selected zonelist is "on-node".
> 
> That check is already done by __alloc_pages.

You mean in get_page_from_freelist()?  No, it only checks that the zone
under consideration is on the same node as the zone at the start of the
list.  This can be off-node if the node is populated only at lower
zones; and the zonelists are in zone-order.

> 
> > This change will result in alloc_pages_node() returning NULL
> > when GFP_THISNODE is specified and the first zone in the zonelist
> > selected by (nid, gfp_zone(gfp_mask) is not on node 'nid'.  This,
> > in turn, BUGs out in slub.c:early_kmem_cache_node_alloc() which
> > apparently can't handle a NULL page from new_slab().  Fix SLUB
> > to handle NULL page in early allocation.
> 
> Ummm... Slub would need to consult node_memory_map instead I guess.

Probably should check the node_memory_map to avoid attempting
allocations from completely memoryless nodes.  However, it should still
be able to handle nulls from alloc_pages_nodes() because of the
scenarios discussed above.

Lee

> 
> > Index: Linux/mm/slub.c
> > ===================================================================
> > --- Linux.orig/mm/slub.c	2007-06-13 16:36:02.000000000 -0400
> > +++ Linux/mm/slub.c	2007-06-13 16:38:41.000000000 -0400
> > @@ -1870,16 +1870,18 @@ static struct kmem_cache_node * __init e
> >  	/* new_slab() disables interupts */
> >  	local_irq_enable();
> >  
> > -	BUG_ON(!page);
> > -	n = page->freelist;
> > -	BUG_ON(!n);
> > -	page->freelist = get_freepointer(kmalloc_caches, n);
> > -	page->inuse++;
> > -	kmalloc_caches->node[node] = n;
> > -	setup_object_debug(kmalloc_caches, page, n);
> > -	init_kmem_cache_node(n);
> > -	atomic_long_inc(&n->nr_slabs);
> > -	add_partial(n, page);
> > +	if (page) {
> > +		n = page->freelist;
> > +		BUG_ON(!n);
> > +		page->freelist = get_freepointer(kmalloc_caches, n);
> > +		page->inuse++;
> > +		kmalloc_caches->node[node] = n;
> > +		setup_object_debug(kmalloc_caches, page, n);
> > +		init_kmem_cache_node(n);
> > +		atomic_long_inc(&n->nr_slabs);
> > +		add_partial(n, page);
> > +	} else
> > +		kmalloc_caches->node[node] = NULL;
> >  	return n;
> >  }
> 
> It would be easier to modify SLUB to loop over node_memory_map instead of 
> node_online_map? Potentially we have to change all loops over online node 
> in the slab allocators.

Again, node_memory_map can't detect the "first zone in zonelist
off-node" situation.  That's the one that alloc_pages_node() must guard
against.   So, it can/should/must return NULL when attempting to
allocate from a higher zone that is off-node.

> 
> ---
>  include/linux/nodemask.h |    1 +
>  mm/slub.c                |    2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/include/linux/nodemask.h
> ===================================================================
> --- linux-2.6.orig/include/linux/nodemask.h	2007-06-13 15:40:27.000000000 -0700
> +++ linux-2.6/include/linux/nodemask.h	2007-06-13 15:40:48.000000000 -0700
> @@ -377,5 +377,6 @@ extern int nr_node_ids;
>  
>  #define for_each_node(node)	   for_each_node_mask((node), node_possible_map)
>  #define for_each_online_node(node) for_each_node_mask((node), node_online_map)
> +#define for_each_memory_node(node) for_each_node_mask((node), node_memory_map)
>  
>  #endif /* __LINUX_NODEMASK_H */
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2007-06-13 15:39:16.000000000 -0700
> +++ linux-2.6/mm/slub.c	2007-06-13 15:40:23.000000000 -0700
> @@ -1836,7 +1836,7 @@ static int init_kmem_cache_nodes(struct 
>  	else
>  		local_node = 0;
>  
> -	for_each_online_node(node) {
> +	for_each_memory_node(node) {
>  		struct kmem_cache_node *n;
>  
>  		if (local_node == node)
> 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2007-06-14 14:18 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-06-12 20:48 [patch 0/3] Fixes for NUMA allocations on memoryless nodes clameter
2007-06-12 20:48 ` [patch 1/3] NUMA: introduce node_memory_map clameter
2007-06-12 21:03   ` David Rientjes
2007-06-12 21:08     ` Christoph Lameter
2007-06-12 21:10       ` David Rientjes
2007-06-12 21:27         ` Christoph Lameter
2007-06-12 21:34           ` David Rientjes
2007-06-12 21:37             ` Christoph Lameter
2007-06-12 21:38               ` David Rientjes
2007-06-12 21:36         ` Nishanth Aravamudan
2007-06-12 21:39           ` Christoph Lameter
2007-06-12 21:42             ` Nishanth Aravamudan
2007-06-12 21:45               ` David Rientjes
2007-06-12 22:26               ` Christoph Lameter
2007-06-12 22:32                 ` Nishanth Aravamudan
2007-06-13  9:14           ` Andy Whitcroft
2007-06-12 20:48 ` [patch 2/3] Fix GFP_THISNODE behavior for memoryless nodes clameter
2007-06-12 21:03   ` David Rientjes
2007-06-12 21:07     ` Christoph Lameter
2007-06-12 21:08       ` David Rientjes
2007-06-12 21:25         ` Christoph Lameter
2007-06-12 21:34           ` David Rientjes
2007-06-12 21:51             ` Nishanth Aravamudan
2007-06-13 21:10   ` Lee Schermerhorn
2007-06-13 21:57     ` Nishanth Aravamudan
2007-06-13 22:46     ` Christoph Lameter
2007-06-13 23:11       ` Nishanth Aravamudan
2007-06-13 23:15         ` Christoph Lameter
2007-06-13 23:20           ` Nishanth Aravamudan
2007-06-13 23:26             ` Christoph Lameter
2007-06-13 23:32               ` Nishanth Aravamudan
2007-06-13 23:53                 ` Christoph Lameter
2007-06-14  0:04                   ` Nishanth Aravamudan
2007-06-14 14:18       ` Lee Schermerhorn [this message]
2007-06-14 14:24         ` Christoph Lameter
2007-06-14 14:55           ` Lee Schermerhorn
2007-06-14 15:51             ` Christoph Lameter
2007-06-14  7:07     ` Christoph Lameter
2007-06-14 14:23       ` Nishanth Aravamudan
2007-06-12 20:48 ` [patch 3/3] Fix MPOL_INTERLEAVE " clameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1181830705.5410.13.camel@localhost \
    --to=lee.schermerhorn@hp.com \
    --cc=ak@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=linux-mm@kvack.org \
    --cc=nacc@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.