[RFC] split zonelist and use nodemask for page allocation [1/4]

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] split zonelist and use nodemask for page allocation [1/4]
@ 2006-04-21  4:11 KAMEZAWA Hiroyuki
  2006-04-21  4:41 ` Christoph Lameter
  2006-04-21  6:17 ` Paul Jackson
  0 siblings, 2 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-04-21  4:11 UTC (permalink / raw)
  To: linux-mm; +Cc: clameter

These patches modifies zonelist and add nodes_list[].
They also modify alloc_pages to use nodemask instead of zonelist.

By this, 
(1)very long zonelist will be removed.
(2)MPOL_BIND can work in sane way.
(3)node-hot-plug doesn't need to care  mempolicies.IOW, mempolicy doesn't have
   to manage zonelist.

My current concern is
(a) the performance degradation of alloc_pages() by this
(b) whether this will break assumptions of mempolicy or not.


-Kame

==
Now zonelist covers all nodes' zones, this patch modifies it to cover
only one node's. This patch also modifes front-end of alloc_pages to use
nodemask instead of zonelist.

zonelist is splited into zonelist and node_lists.
node_lists preserves all node's id in order of distance.

to be done:
- To duplicate nodes_list for each gfp type as zone_list will be better.
- This patch will make it slow the fastest path of alloc_pages(), so some more
  optimization will be needed.
- more clean up

Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Index: linux-2.6.17-rc1-mm2/include/linux/gfp.h
===================================================================
--- linux-2.6.17-rc1-mm2.orig/include/linux/gfp.h	2006-04-21 10:54:40.000000000 +0900
+++ linux-2.6.17-rc1-mm2/include/linux/gfp.h	2006-04-21 10:55:15.000000000 +0900
@@ -104,7 +104,7 @@
 #endif
 
 extern struct page *
-FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
+FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int, int, nodemask_t *));
 
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
@@ -116,8 +116,7 @@
 	if (nid < 0)
 		nid = numa_node_id();
 
-	return __alloc_pages(gfp_mask, order,
-		NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
+	return __alloc_pages_nodemask(gfp_mask, order, nid, NULL);
 }
 
 #ifdef CONFIG_NUMA
Index: linux-2.6.17-rc1-mm2/include/linux/mmzone.h
===================================================================
--- linux-2.6.17-rc1-mm2.orig/include/linux/mmzone.h	2006-04-21 10:54:40.000000000 +0900
+++ linux-2.6.17-rc1-mm2/include/linux/mmzone.h	2006-04-21 12:07:40.000000000 +0900
@@ -268,7 +268,7 @@
  * footprint of this construct is very small.
  */
 struct zonelist {
-	struct zone *zones[MAX_NUMNODES * MAX_NR_ZONES + 1]; // NULL delimited
+	struct zone *zones[MAX_NR_ZONES + 1]; // NULL delimited
 };
 
 
@@ -287,6 +287,7 @@
 typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
 	struct zonelist node_zonelists[GFP_ZONETYPES];
+	int nodes_list[MAX_NUMNODES + 1]; /* sorted by distance */
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
 	struct page *node_mem_map;
Index: linux-2.6.17-rc1-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.17-rc1-mm2.orig/mm/page_alloc.c	2006-04-21 10:54:40.000000000 +0900
+++ linux-2.6.17-rc1-mm2/mm/page_alloc.c	2006-04-21 12:08:22.000000000 +0900
@@ -980,7 +980,7 @@
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
-struct page * fastcall
+static struct page * fastcall
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist)
 {
@@ -999,7 +999,7 @@
 	z = zonelist->zones;  /* the list of zones suitable for gfp_mask */
 
 	if (unlikely(*z == NULL)) {
-		/* Should this ever happen?? */
+		/* goto next node */
 		return NULL;
 	}
 
@@ -1137,7 +1137,29 @@
 	return page;
 }
 
-EXPORT_SYMBOL(__alloc_pages);
+struct page * fastcall
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+		       int nid, nodemask_t *nodemask)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+	struct page *page = NULL;
+	struct zonelist *zl;
+	int target_nid;
+	int i = 0;
+
+	do {
+		target_nid = pgdat->nodes_list[i++];
+		if (likely(node_online(target_nid)))
+			if (!nodemask  || node_isset(target_nid, *nodemask)) {
+				zl = NODE_DATA(target_nid)->node_zonelists +
+					gfp_zone(gfp_mask);
+				page = __alloc_pages(gfp_mask, order, zl);
+			}
+	} while(!page && pgdat->nodes_list[i] != -1);
+
+	return page;
+}
+EXPORT_SYMBOL(__alloc_pages_nodemask);
 
 /*
  * Common helper functions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] split zonelist and use nodemask for page allocation [1/4]
  2006-04-21  4:11 [RFC] split zonelist and use nodemask for page allocation [1/4] KAMEZAWA Hiroyuki
@ 2006-04-21  4:41 ` Christoph Lameter
  2006-04-21  6:17 ` Paul Jackson
  1 sibling, 0 replies; 7+ messages in thread
From: Christoph Lameter @ 2006-04-21  4:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm

On Fri, 21 Apr 2006, KAMEZAWA Hiroyuki wrote:

> 
> These patches modifies zonelist and add nodes_list[]. They also modify 
> alloc_pages to use nodemask instead of zonelist.

That is great. I have thought that this would be necessary for a long 
time. The zonelist stuff is rather difficult to handle. This could allow a 
clean up of the memory policy layer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] split zonelist and use nodemask for page allocation [1/4]
  2006-04-21  4:11 [RFC] split zonelist and use nodemask for page allocation [1/4] KAMEZAWA Hiroyuki
  2006-04-21  4:41 ` Christoph Lameter
@ 2006-04-21  6:17 ` Paul Jackson
  2006-04-21  6:49   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 7+ messages in thread
From: Paul Jackson @ 2006-04-21  6:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, clameter

Interesting ... maybe ?

Doesn't this change the semantics of the kernel page allocator?

If I read correctly:

    The existing code scans the entire systems zonelists multiple times.
    First, it looks on all nodes in the system for easy memory, and if that
    fails, tries again, looking for less easy (lower threshold) memory.

    Your code takes one node at a time, in the alloc_pages_nodemask() loop,
    and calls __alloc_pages() for that node, which will exhaust that node
    before giving up.

In particular, the low memory failure cases, such as when the system
starts to swap on a node, or a task is forced to sleep waiting for memory,
or the out-of-memory killer called, would seem to be quite different with
your patch.  This could cause some serious problems, I suspect.

Some of your other advantages from this change look nice, but I suspect
it would take a radical rewrite of __alloc_pages(), moving the multiple
scans at increasingly aggressive free memory settings up into your
__alloc_pages_nodemask() routine, and  moving the cpuset_zone_allowed()
check from get_page_from_freelist() up as well.

This would be a major rewrite of mm/page_alloc.c, perhaps a very
interesting one, but I don't think it would be an easy one.

Or, just perhaps, the above change in semantics is a -good- one.  I'll
wager that my colleague Christoph will consider it such (I see he has
already heartily endorsed your patch.)  Essentially your patch would
seem to increase the locality of allocations -- beating one node to
death before considering the next.  Sometimes this will be a good
improvement.

And sometimes not.  In my ideal world, there would be a per-cpuset
option, perhaps just a boolean, choosing between the two choices of:
  1) look on all allowed nodes for easy memory, before reconsidering
       each allowed node for the one of the last free pages, or
  2) beat all zones on one node hard, before going off-node.

I believe that the existing code does (1), and your patch does (2).

In any event, the layering of yet another control loop on top of the
nested conditional fallback loops of loops we have now is a concern.
It is getting harder and harder for mere mortals to understand this.

Perhaps there are opportunities here for much more cleanup, though
that would not be easy.

My apologies for wasting your time if I misread this.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] split zonelist and use nodemask for page allocation [1/4]
  2006-04-21  6:17 ` Paul Jackson
@ 2006-04-21  6:49   ` KAMEZAWA Hiroyuki
  2006-04-21  6:56     ` Paul Jackson
  0 siblings, 1 reply; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-04-21  6:49 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-mm, clameter

On Thu, 20 Apr 2006 23:17:51 -0700
Paul Jackson <pj@sgi.com> wrote:

> Interesting ... maybe ?
> 
> Doesn't this change the semantics of the kernel page allocator?
> 
> If I read correctly:
> 
>     The existing code scans the entire systems zonelists multiple times.
>     First, it looks on all nodes in the system for easy memory, and if that
>     fails, tries again, looking for less easy (lower threshold) memory.
> 
>     Your code takes one node at a time, in the alloc_pages_nodemask() loop,
>     and calls __alloc_pages() for that node, which will exhaust that node
>     before giving up.
> 

Ah....okay, get_page_from_freelist()  scans several times in alloc_pages()....
I should consider again and rewrite the whole patch.
Thank you for pointing it out.
Maybe what I should do is not to add a function which encapsulate alloc_pages()
but to modify get_pge_from_freelist() to take nodemask.


> In particular, the low memory failure cases, such as when the system
> starts to swap on a node, or a task is forced to sleep waiting for memory,
> or the out-of-memory killer called, would seem to be quite different with
> your patch.  This could cause some serious problems, I suspect.
> 
Yes, serious.

> Some of your other advantages from this change look nice, but I suspect
> it would take a radical rewrite of __alloc_pages(), moving the multiple
> scans at increasingly aggressive free memory settings up into your
> __alloc_pages_nodemask() routine, and  moving the cpuset_zone_allowed()
> check from get_page_from_freelist() up as well.
> 
Yes, I think so too.

> This would be a major rewrite of mm/page_alloc.c, perhaps a very
> interesting one, but I don't think it would be an easy one.
> 

> Or, just perhaps, the above change in semantics is a -good- one.  I'll
> wager that my colleague Christoph will consider it such (I see he has
> already heartily endorsed your patch.)  Essentially your patch would
> seem to increase the locality of allocations -- beating one node to
> death before considering the next.  Sometimes this will be a good
> improvement.
> 
> And sometimes not.  In my ideal world, there would be a per-cpuset
> option, perhaps just a boolean, choosing between the two choices of:
>   1) look on all allowed nodes for easy memory, before reconsidering
>        each allowed node for the one of the last free pages, or
>   2) beat all zones on one node hard, before going off-node.
> 
> I believe that the existing code does (1), and your patch does (2).
> 
> In any event, the layering of yet another control loop on top of the
> nested conditional fallback loops of loops we have now is a concern.
> It is getting harder and harder for mere mortals to understand this.
> 
> Perhaps there are opportunities here for much more cleanup, though
> that would not be easy.
> 
yes, not easy.

> My apologies for wasting your time if I misread this.
> 
I think you are right.
Thank you. 

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] split zonelist and use nodemask for page allocation [1/4]
  2006-04-21  6:49   ` KAMEZAWA Hiroyuki
@ 2006-04-21  6:56     ` Paul Jackson
  2006-04-21  8:05       ` KAMEZAWA Hiroyuki
  2006-04-21 15:06       ` Christoph Lameter
  0 siblings, 2 replies; 7+ messages in thread
From: Paul Jackson @ 2006-04-21  6:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, clameter

> yes, not easy.

Good luck <grin>.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] split zonelist and use nodemask for page allocation [1/4]
  2006-04-21  6:56     ` Paul Jackson
@ 2006-04-21  8:05       ` KAMEZAWA Hiroyuki
  2006-04-21 15:06       ` Christoph Lameter
  1 sibling, 0 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-04-21  8:05 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-mm, clameter

On Thu, 20 Apr 2006 23:56:16 -0700
Paul Jackson <pj@sgi.com> wrote:

> > yes, not easy.
> 
> Good luck <grin>.
> 

Maybe the whole look of allocation codes will be like below.
But I noticed try_to_free_pages()/out_of_memory() etc. uses array of zones ;(
The whole modification will be bigger than I thought....

Thanks
--Kame

=
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int nid,
                nodemask_t *nodemask, int alloc_flags)
{
        pg_data_t *pgdat;
        struct zone **z, *orig_zone;
        struct page *page = NULL;
        int classzone_idx, target_node, index;
        int alloc_type = gfp_zone(gfp_mask);
        /*
         * Go through the all specified zones once, looking for a zone
         * with enough free.
         * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
         */
        index = 0;
        orig_zone = NULL;
        do {
                target_node = NODE_DATA(nid)->nodes_list[index++];
                if (nodemask && !node_isset(target_node, *nodemask))
                        continue;
                if (!node_online(target_node))
                        continue;
                pgdat = NODE_DATA(target_node);
                if (!orig_zone) { /* record the first zone we found for
                                     statistics */
                        z = pgdat->node_zonelists[alloc_type].zones;
                        orig_zone = *z;
                        classzone_idx = zone_idx(orig_zone);
                }
                for(z =pgdat->node_zonelists[alloc_type].zones; *z; ++z) {
                        if ((alloc_flags & ALLOC_CPUSET) &&
                                !cpuset_zone_allowed(*z, gfp_mask))
                                continue;

                        if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
                                unsigned long mark;
                                if (alloc_flags & ALLOC_WMARK_MIN)
                                        mark = (*z)->pages_min;
                                else if (alloc_flags & ALLOC_WMARK_LOW)
                                        mark = (*z)->pages_low;
                                else
                                        mark = (*z)->pages_high;
                                if (!zone_watermark_ok(*z, order, mark,
                                            classzone_idx, alloc_flags))
                                        if (!zone_reclaim_mode ||
                                            !zone_reclaim(*z, gfp_mask, order))
                                                continue;
                        }
                        page = buffered_rmqueue(*z, order, gfp_mask, orig_zone);
                        if (page)
                                return page;
                }
        } while (target_node != -1);
        return page;
}

struct page * fastcall
__alloc_pages(gfp_t gfp_mask, unsigned int order, int nid,
              nodemask_t *nodemask)
{
<snip>
	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
                                nid, nodemask, ALLOC_WMARK_LOW|ALLOC_CPUSET);
        if (page)
                goto got_pg;
        alloc_type = gfp_zone(gfp_mask);
        /* run kswapd for all failed zone */
        for_each_node_mask(node, *nodemask)
                for(z = NODE_DATA(node)->node_zonelists[alloc_type],zones;
                    *z; ++z)
                        if (cpuset_zone_allowed(*z, gfp_mask))
                                wakeup_kswapd(*z, order);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] split zonelist and use nodemask for page allocation [1/4]
  2006-04-21  6:56     ` Paul Jackson
  2006-04-21  8:05       ` KAMEZAWA Hiroyuki
@ 2006-04-21 15:06       ` Christoph Lameter
  1 sibling, 0 replies; 7+ messages in thread
From: Christoph Lameter @ 2006-04-21 15:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Paul Jackson, linux-mm, ak

One thing that also may be good to implement is to get away from traveling 
lists for allocations.

Most of the time you will have multiple nodes at the same distance for 
an allocation. It would be best if we either could do a round robin on 
those nodes or check the amount of memory free and allocate from the one 
with the most memory free. This means that the nodelist would not work and 
that the algorithm for selecting a remote node would get more complex.

Also when going off node: It may be good to increase the amount that 
cannot be touched to reserve more memory for local allocations.

I think there are definitely some challenges here as Paul pointed out. 
However, I think we may be at a dead end with the zonelist. Going away 
from the zonelist would also enable the consolidation of policy and cpuset 
restrictions. If the page allocator can take a list of nodes from which 
allocations are allowed then the cpuset hooks may no longer be necessary.

However, this is certainly not immediately doable but needs careful 
thought and performance measurement to insure that we avoid regressions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-04-21 15:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-21  4:11 [RFC] split zonelist and use nodemask for page allocation [1/4] KAMEZAWA Hiroyuki
2006-04-21  4:41 ` Christoph Lameter
2006-04-21  6:17 ` Paul Jackson
2006-04-21  6:49   ` KAMEZAWA Hiroyuki
2006-04-21  6:56     ` Paul Jackson
2006-04-21  8:05       ` KAMEZAWA Hiroyuki
2006-04-21 15:06       ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).