Avoid allocating during interleave from almost full nodes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Avoid allocating during interleave from almost full nodes
@ 2006-11-03 20:58 Christoph Lameter
  2006-11-03 21:46 ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-11-03 20:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel

Interleave allocation often go over large sets of nodes. Some nodes
may have tasks on them that heavily use memory and rely on node local
allocations to get optimium performance. Overallocating those nodes may
reduce performance of those tasks by forcing off node allocations and
additional reclaim passes. It is better if we try to avoid nodes
that have most of its memory used and focus on nodes that still have lots
of memory available.

The intend of interleave is to have allocations spread out over a set of
nodes because the data is likely to be used from any of those nodes. It is
not important that we keep the exact sequence of allocations at all times.

The exact node we choose during interleave does not matter much if we are
under memory pressure since the allocations will be redirected anyways
after we have overallocated a single node.

This patch checks for the amount of free pages on a node. If it is lower
than a predefined limit (in /proc/sys/kernel/min_interleave_ratio) then
we avoid allocating from that node. We keep a bitmap of full nodes
that is cleared every 2 seconds when draining the pagesets for
node 0.

Should we find that all nodes are marked as full then we disregard
the limit and continue allocate from the next node round robin
without any checks.

This is only effective for interleave pages that are placed without
regard to the address in a process (anonymous pages are typically
placed depending on an interleave node generated from the address).
It applies mainly to slab interleave and page cache interleave.

We operate on full_interleave_nodes without any locking which means
that the nodemask may take on an undefined value at times. That does
not matter though since we always can fall back to operating without
full_interleave_nodes. As a result of the racyness we may uselessly
skip a node or retest a node.

RFC: http://marc.theaimsgroup.com/?t=116200376700004&r=1&w=2

RFC->V2
- Rediff against 2.6.19-rc4-mm2
- Update description

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.19-rc4-mm2/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.19-rc4-mm2.orig/Documentation/sysctl/vm.txt	2006-11-02 14:18:59.000000000 -0600
+++ linux-2.6.19-rc4-mm2/Documentation/sysctl/vm.txt	2006-11-03 13:12:04.006734590 -0600
@@ -198,6 +198,28 @@ and may not be fast.
 
 =============================================================
 
+min_interleave_ratio:
+
+This is available only on NUMA kernels.
+
+A percentage of the free pages in each zone.  If less than this
+percentage of pages are in use then interleave will attempt to
+leave this zone alone and allocate from other zones. This results
+in a balancing effect on the system if interleave and node local allocations
+are mixed throughout the system. Interleave pages will not cause zone
+reclaim and leave some memory on node to allow node local allocation to
+occur. Interleave allocations will allocate all over the system until global
+reclaim kicks in.
+
+The mininum does not apply to pages that are placed using interleave
+based on an address such as implemented for anonymous pages. It is
+effective for slab allocations, huge page allocations and page cache
+allocations.
+
+The default ratio is 10 percent.
+
+=============================================================
+
 panic_on_oom
 
 This enables or disables panic on out-of-memory feature.  If this is set to 1,
Index: linux-2.6.19-rc4-mm2/include/linux/mmzone.h
===================================================================
--- linux-2.6.19-rc4-mm2.orig/include/linux/mmzone.h	2006-11-02 14:19:34.000000000 -0600
+++ linux-2.6.19-rc4-mm2/include/linux/mmzone.h	2006-11-03 13:12:04.027244005 -0600
@@ -192,6 +192,12 @@ struct zone {
 	 */
 	unsigned long		min_unmapped_pages;
 	unsigned long		min_slab_pages;
+	/*
+	 * If a zone has less pages free then interleave will
+	 * attempt to bypass the zone
+	 */
+	unsigned long 		min_interleave_pages;
+
 	struct per_cpu_pageset	*pageset[NR_CPUS];
 #else
 	struct per_cpu_pageset	pageset[NR_CPUS];
@@ -564,6 +570,8 @@ int sysctl_min_unmapped_ratio_sysctl_han
 			struct file *, void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
+int sysctl_min_interleave_ratio_sysctl_handler(struct ctl_table *, int,
+			struct file *, void __user *, size_t *, loff_t *);
 
 #include <linux/topology.h>
 /* Returns the number of the current Node. */
Index: linux-2.6.19-rc4-mm2/include/linux/swap.h
===================================================================
--- linux-2.6.19-rc4-mm2.orig/include/linux/swap.h	2006-11-02 14:19:35.000000000 -0600
+++ linux-2.6.19-rc4-mm2/include/linux/swap.h	2006-11-03 13:12:04.049706697 -0600
@@ -197,6 +197,7 @@ extern long vm_total_pages;
 extern int zone_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
+extern int sysctl_min_interleave_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
 #define zone_reclaim_mode 0
Index: linux-2.6.19-rc4-mm2/include/linux/sysctl.h
===================================================================
--- linux-2.6.19-rc4-mm2.orig/include/linux/sysctl.h	2006-11-02 14:19:35.000000000 -0600
+++ linux-2.6.19-rc4-mm2/include/linux/sysctl.h	2006-11-03 13:13:09.660305173 -0600
@@ -203,10 +203,11 @@ enum
 	VM_MIN_UNMAPPED=32,	/* Set min percent of unmapped pages */
 	VM_PANIC_ON_OOM=33,	/* panic at out-of-memory */
 	VM_VDSO_ENABLED=34,	/* map VDSO into new processes? */
-	VM_MIN_SLAB=35,		 /* Percent pages ignored by zone reclaim */
+	VM_MIN_SLAB=35,		/* Percent pages ignored by zone reclaim */
 	VM_SWAP_PREFETCH=36,	/* swap prefetch */
 	VM_READAHEAD_RATIO=37,	/* percent of read-ahead size to thrashing-threshold */
 	VM_READAHEAD_HIT_RATE=38, /* one accessed page legitimizes so many read-ahead pages */
+	VM_MIN_INTERLEAVE=39,	/* Limit for interleave */
 };
 
 
Index: linux-2.6.19-rc4-mm2/kernel/sysctl.c
===================================================================
--- linux-2.6.19-rc4-mm2.orig/kernel/sysctl.c	2006-11-02 14:19:36.000000000 -0600
+++ linux-2.6.19-rc4-mm2/kernel/sysctl.c	2006-11-03 13:12:04.102445192 -0600
@@ -1026,6 +1026,17 @@ static ctl_table vm_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one_hundred,
 	},
+	{
+		.ctl_name	= VM_MIN_INTERLEAVE,
+		.procname	= "min_interleave_ratio",
+		.data		= &sysctl_min_interleave_ratio,
+		.maxlen		= sizeof(sysctl_min_interleave_ratio),
+		.mode		= 0644,
+		.proc_handler	= &sysctl_min_interleave_ratio_sysctl_handler,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
 #endif
 #ifdef CONFIG_X86_32
 	{
Index: linux-2.6.19-rc4-mm2/mm/mempolicy.c
===================================================================
--- linux-2.6.19-rc4-mm2.orig/mm/mempolicy.c	2006-11-02 14:19:37.000000000 -0600
+++ linux-2.6.19-rc4-mm2/mm/mempolicy.c	2006-11-03 13:12:04.181552934 -0600
@@ -1118,16 +1118,60 @@ static struct zonelist *zonelist_policy(
 	return NODE_DATA(nd)->node_zonelists + gfp_zone(gfp);
 }
 
+/*
+ * Generic interleave function to be used by cpusets and memory policies.
+ */
+nodemask_t full_interleave_nodes = NODE_MASK_NONE;
+
+/*
+ * Called when draining the pagesets of node 0
+ */
+void clear_full_interleave_nodes(void) {
+	nodes_clear(full_interleave_nodes);
+}
+
+int __interleave(int current_node, nodemask_t *nodes)
+{
+	unsigned next;
+	struct zone *z;
+	nodemask_t nmask;
+
+redo:
+	nodes_andnot(nmask, *nodes, full_interleave_nodes);
+	if (unlikely(nodes_empty(nmask))) {
+		/*
+		 * All allowed nodes are overallocated.
+		 * Ignore interleave limit.
+		 */
+		next = next_node(current_node, *nodes);
+		if (next >= MAX_NUMNODES)
+			next = first_node(*nodes);
+		return next;
+	}
+
+	next = next_node(current_node, nmask);
+	if (next >= MAX_NUMNODES)
+		next = first_node(nmask);
+
+	/*
+	 * Check if node is overallocated. If so the set it to full.
+	 */
+	z = &NODE_DATA(next)->node_zones[policy_zone];
+	if (unlikely(z->free_pages <= z->min_interleave_pages)) {
+		node_set(next, full_interleave_nodes);
+		goto redo;
+	}
+	return next;
+}
+
 /* Do dynamic interleaving for a process */
-static unsigned interleave_nodes(struct mempolicy *policy)
+static int interleave_nodes(struct mempolicy *policy)
 {
 	unsigned nid, next;
 	struct task_struct *me = current;
 
 	nid = me->il_next;
-	next = next_node(nid, policy->v.nodes);
-	if (next >= MAX_NUMNODES)
-		next = first_node(policy->v.nodes);
+	next = __interleave(nid, &policy->v.nodes);
 	me->il_next = next;
 	return nid;
 }
Index: linux-2.6.19-rc4-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.19-rc4-mm2.orig/mm/page_alloc.c	2006-11-02 14:19:39.000000000 -0600
+++ linux-2.6.19-rc4-mm2/mm/page_alloc.c	2006-11-03 13:12:04.201085710 -0600
@@ -711,6 +711,8 @@ void drain_node_pages(int nodeid)
 			}
 		}
 	}
+	if (!nodeid)
+		clear_full_interleave_nodes();
 }
 #endif
 
@@ -2056,6 +2058,9 @@ static void setup_pagelist_highmark(stru
 
 
 #ifdef CONFIG_NUMA
+
+int sysctl_min_interleave_ratio = 10;
+
 /*
  * Boot pageset table. One per cpu which is going to be used for all
  * zones and all nodes. The parameters will be set in such a way
@@ -2651,6 +2656,7 @@ static void __meminit free_area_init_cor
 		zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
 						/ 100;
 		zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
+		zone->min_interleave_pages = (realsize + sysctl_min_interleave_ratio) / 100;
 #endif
 		zone->name = zone_names[j];
 		spin_lock_init(&zone->lock);
@@ -3226,6 +3232,21 @@ int sysctl_min_slab_ratio_sysctl_handler
 				sysctl_min_slab_ratio) / 100;
 	return 0;
 }
+int sysctl_min_interleave_ratio_sysctl_handler(ctl_table *table, int write,
+	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+	struct zone *zone;
+	int rc;
+
+	rc = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
+	if (rc)
+		return rc;
+
+	for_each_zone(zone)
+		zone->min_interleave_pages = (zone->present_pages *
+				sysctl_min_interleave_ratio) / 100;
+	return 0;
+}
 #endif
 
 /*
Index: linux-2.6.19-rc4-mm2/include/linux/mempolicy.h
===================================================================
--- linux-2.6.19-rc4-mm2.orig/include/linux/mempolicy.h	2006-10-30 21:37:36.000000000 -0600
+++ linux-2.6.19-rc4-mm2/include/linux/mempolicy.h	2006-11-03 13:12:04.212805376 -0600
@@ -156,6 +156,8 @@ extern void mpol_fix_fork_child_flag(str
 #else
 #define current_cpuset_is_being_rebound() 0
 #endif
+extern int __interleave(int node, nodemask_t *nodes);
+extern void clear_full_interleave_nodes(void);
 
 extern struct mempolicy default_policy;
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
Index: linux-2.6.19-rc4-mm2/kernel/cpuset.c
===================================================================
--- linux-2.6.19-rc4-mm2.orig/kernel/cpuset.c	2006-11-02 14:19:36.000000000 -0600
+++ linux-2.6.19-rc4-mm2/kernel/cpuset.c	2006-11-03 13:12:04.228431596 -0600
@@ -2396,9 +2396,8 @@ int cpuset_mem_spread_node(void)
 {
 	int node;
 
-	node = next_node(current->cpuset_mem_spread_rotor, current->mems_allowed);
-	if (node == MAX_NUMNODES)
-		node = first_node(current->mems_allowed);
+	node = __interleave(current->cpuset_mem_spread_rotor,
+			&current->mems_allowed);
 	current->cpuset_mem_spread_rotor = node;
 	return node;
 }

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-03 20:58 Avoid allocating during interleave from almost full nodes Christoph Lameter
@ 2006-11-03 21:46 ` Andrew Morton
  2006-11-03 22:10   ` Christoph Lameter
  2006-11-04 10:35   ` CTL_UNNUMBERED and killing sys_sysctl Eric W. Biederman
  0 siblings, 2 replies; 29+ messages in thread
From: Andrew Morton @ 2006-11-03 21:46 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel

On Fri, 3 Nov 2006 12:58:24 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> Interleave allocation often go over large sets of nodes. Some nodes
> may have tasks on them that heavily use memory

"heavily used" means "referenced" and maybe "active" and maybe "dirty".
See below.

> and rely on node local
> allocations to get optimium performance. Overallocating those nodes may
> reduce performance of those tasks by forcing off node allocations and
> additional reclaim passes. It is better if we try to avoid nodes
> that have most of its memory used and focus on nodes that still have lots
> of memory available.
> 
> The intend of interleave is to have allocations spread out over a set of
> nodes because the data is likely to be used from any of those nodes. It is
> not important that we keep the exact sequence of allocations at all times.
> 
> The exact node we choose during interleave does not matter much if we are
> under memory pressure since the allocations will be redirected anyways
> after we have overallocated a single node.

Am not clear on what that means.

> This patch checks for the amount of free pages on a node. If it is lower
> than a predefined limit (in /proc/sys/kernel/min_interleave_ratio) then

You mean /proc/sys/vm

> we avoid allocating from that node. We keep a bitmap of full nodes
> that is cleared every 2 seconds when draining the pagesets for
> node 0.

Wall time is a bogus concept in the VM.  Can we please stop relying upon it?

> Should we find that all nodes are marked as full then we disregard
> the limit and continue allocate from the next node round robin
> without any checks.
> 
> This is only effective for interleave pages that are placed without
> regard to the address in a process (anonymous pages are typically
> placed depending on an interleave node generated from the address).
> It applies mainly to slab interleave and page cache interleave.
> 
> We operate on full_interleave_nodes without any locking which means
> that the nodemask may take on an undefined value at times. That does
> not matter though since we always can fall back to operating without
> full_interleave_nodes. As a result of the racyness we may uselessly
> skip a node or retest a node.

This design relies upon nodes having certain amounts of free memory.  This
concept is bogus.  Because it treats clean pagecache which hasn't been used
since last Saturday as "in use".  It is not in use.

This false distinction between free pages and trivially-reclaimable pages
is specific to particular workloads on particular machines hence this
design is not generally useful.


Perhaps a better design would be to key the decision off the page reclaim
scanning priority.

> 
> Index: linux-2.6.19-rc4-mm2/Documentation/sysctl/vm.txt
> ===================================================================
> --- linux-2.6.19-rc4-mm2.orig/Documentation/sysctl/vm.txt	2006-11-02 14:18:59.000000000 -0600
> +++ linux-2.6.19-rc4-mm2/Documentation/sysctl/vm.txt	2006-11-03 13:12:04.006734590 -0600
> @@ -198,6 +198,28 @@ and may not be fast.
>  
>  =============================================================
>  
> +min_interleave_ratio:
> +
> +This is available only on NUMA kernels.
> +
> +A percentage of the free pages in each zone.  If less than this
> +percentage of pages are in use then interleave will attempt to
> +leave this zone alone and allocate from other zones. This results
> +in a balancing effect on the system if interleave and node local allocations
> +are mixed throughout the system. Interleave pages will not cause zone
> +reclaim and leave some memory on node to allow node local allocation to
> +occur. Interleave allocations will allocate all over the system until global
> +reclaim kicks in.
> +
> +The mininum does not apply to pages that are placed using interleave
> +based on an address such as implemented for anonymous pages. It is
> +effective for slab allocations, huge page allocations and page cache
> +allocations.
> +
> +The default ratio is 10 percent.
> +

That has several typos and grammatical mistakes.

> +	VM_MIN_INTERLEAVE=39,	/* Limit for interleave */

I think we recently decided to set all new sysctl number to CTL_UNNUMBERED.
 Eric, can you remind us of the thinkin there please?

> --- linux-2.6.19-rc4-mm2.orig/mm/mempolicy.c	2006-11-02 14:19:37.000000000 -0600
> +++ linux-2.6.19-rc4-mm2/mm/mempolicy.c	2006-11-03 13:12:04.181552934 -0600
> @@ -1118,16 +1118,60 @@ static struct zonelist *zonelist_policy(
>  	return NODE_DATA(nd)->node_zonelists + gfp_zone(gfp);
>  }
>  
> +/*
> + * Generic interleave function to be used by cpusets and memory policies.
> + */
> +nodemask_t full_interleave_nodes = NODE_MASK_NONE;
> +
> +/*
> + * Called when draining the pagesets of node 0
> + */
> +void clear_full_interleave_nodes(void) {
> +	nodes_clear(full_interleave_nodes);
> +}

coding style.

> +int __interleave(int current_node, nodemask_t *nodes)
> +{
> +	unsigned next;
> +	struct zone *z;
> +	nodemask_t nmask;
> +
> +redo:
> +	nodes_andnot(nmask, *nodes, full_interleave_nodes);
> +	if (unlikely(nodes_empty(nmask))) {
> +		/*
> +		 * All allowed nodes are overallocated.
> +		 * Ignore interleave limit.
> +		 */
> +		next = next_node(current_node, *nodes);
> +		if (next >= MAX_NUMNODES)
> +			next = first_node(*nodes);
> +		return next;
> +	}
> +
> +	next = next_node(current_node, nmask);
> +	if (next >= MAX_NUMNODES)
> +		next = first_node(nmask);
> +
> +	/*
> +	 * Check if node is overallocated. If so the set it to full.
> +	 */
> +	z = &NODE_DATA(next)->node_zones[policy_zone];
> +	if (unlikely(z->free_pages <= z->min_interleave_pages)) {
> +		node_set(next, full_interleave_nodes);
> +		goto redo;
> +	}
> +	return next;
> +}

This function would benefit from an introductory comment.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-03 21:46 ` Andrew Morton
@ 2006-11-03 22:10   ` Christoph Lameter
  2006-11-03 22:31     ` Andrew Morton
  2006-11-04 10:35   ` CTL_UNNUMBERED and killing sys_sysctl Eric W. Biederman
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-11-03 22:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Fri, 3 Nov 2006, Andrew Morton wrote:

> > The exact node we choose during interleave does not matter much if we are
> > under memory pressure since the allocations will be redirected anyways
> > after we have overallocated a single node.
> 
> Am not clear on what that means.

If we currently overallocate a node then we fall back to other nodes along 
the zonelist. We will not be able to allocate on the intended node and 
the next interleave node becomes the nearest node with enough memory.

> > This patch checks for the amount of free pages on a node. If it is lower
> > than a predefined limit (in /proc/sys/kernel/min_interleave_ratio) then
> 
> You mean /proc/sys/vm

Right.

> > we avoid allocating from that node. We keep a bitmap of full nodes
> > that is cleared every 2 seconds when draining the pagesets for
> > node 0.
> 
> Wall time is a bogus concept in the VM.  Can we please stop relying upon it?

We use the same 2 second pulse to drain slab caches, and the page 
allocators per cpu caches. The slab draining has been around forever. Its 
relying on jiffies and not on wall time.

> > not matter though since we always can fall back to operating without
> > full_interleave_nodes. As a result of the racyness we may uselessly
> > skip a node or retest a node.
> 
> This design relies upon nodes having certain amounts of free memory.  This
> concept is bogus.  Because it treats clean pagecache which hasn't been used
> since last Saturday as "in use".  It is not in use.

It relies on free pages, not on in use pages. The attempt is to bypass 
expensive reclaim as long as we can find free memory on other nodes.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-03 22:10   ` Christoph Lameter
@ 2006-11-03 22:31     ` Andrew Morton
  2006-11-04  0:28       ` Christoph Lameter
  2006-11-04  1:26       ` Paul Jackson
  0 siblings, 2 replies; 29+ messages in thread
From: Andrew Morton @ 2006-11-03 22:31 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel

On Fri, 3 Nov 2006 14:10:01 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> > 
> > Wall time is a bogus concept in the VM.  Can we please stop relying upon it?
> 
> We use the same 2 second pulse to drain slab caches, and the page 
> allocators per cpu caches. The slab draining has been around forever.

And it doesn't make sense there either.

Well.  There are situations where it makes a _bit_ of sense: in those
places where we are trying to determine whether a piece of memory is still
in the CPU's cache.  If we assume that the CPU is evicting cachelines at a
constant lines/sec rate then yes, using walltime has some correlation with
reality.

But in this application which you are proposing, any correlation with
elapsed walltime is very slight.  It's just the wrong baseline to use. 
What is the *sense* in it?

> > > not matter though since we always can fall back to operating without
> > > full_interleave_nodes. As a result of the racyness we may uselessly
> > > skip a node or retest a node.
> > 
> > This design relies upon nodes having certain amounts of free memory.  This
> > concept is bogus.  Because it treats clean pagecache which hasn't been used
> > since last Saturday as "in use".  It is not in use.
> 
> It relies on free pages, not on in use pages.

Yes.  And it is wrong to do so.  Because a node may well have no "free"
pages but plenty of completely stale ones which should be reclaimed.

This patch is very specific to the one particular usage scenario which your
users happen to deploy but is quite ineffective for other (and quite valid)
usage scenarios.

> The attempt is to bypass 
> expensive reclaim as long as we can find free memory on other nodes.

Reclaim isn't expensive.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-03 22:31     ` Andrew Morton
@ 2006-11-04  0:28       ` Christoph Lameter
  2006-11-04  0:58         ` Andrew Morton
  2006-11-04  1:26       ` Paul Jackson
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-11-04  0:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Fri, 3 Nov 2006, Andrew Morton wrote:

> But in this application which you are proposing, any correlation with
> elapsed walltime is very slight.  It's just the wrong baseline to use. 
> What is the *sense* in it?

You just accepted Paul's use of a similar mechanism to void cached 
zonelists. He has a one second timeout for the cache there it seems.

The sense is that memory on nodes may be freed and then we need to 
allocate from those nodes again.

> Yes.  And it is wrong to do so.  Because a node may well have no "free"
> pages but plenty of completely stale ones which should be reclaimed.

But there may be other nodes that have more free pages. If we allocate 
from those then we can avoid reclaim.

> Reclaim isn't expensive.

It is needlessly expensive if its done for an allocation that is not bound 
to a specific node and there are other nodes with free pages. We may throw 
out pages that we need later.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-04  0:28       ` Christoph Lameter
@ 2006-11-04  0:58         ` Andrew Morton
  2006-11-06 16:53           ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2006-11-04  0:58 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel

On Fri, 3 Nov 2006 16:28:31 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Fri, 3 Nov 2006, Andrew Morton wrote:
> 
> > But in this application which you are proposing, any correlation with
> > elapsed walltime is very slight.  It's just the wrong baseline to use. 
> > What is the *sense* in it?
> 
> You just accepted Paul's use of a similar mechanism to void cached 
> zonelists. He has a one second timeout for the cache there it seems.

With complaints.

> The sense is that memory on nodes may be freed and then we need to 
> allocate from those nodes again.

This has almost nothing to do with elapsed time.

How about doing, in free_pages_bulk():

	if (zone->over_interleave_pages) {
		zone->over_interleave_pages = 0;
		node_clear(zone_to_nid(zone), full_interleave_nodes);
	}

?

> > Yes.  And it is wrong to do so.  Because a node may well have no "free"
> > pages but plenty of completely stale ones which should be reclaimed.
> 
> But there may be other nodes that have more free pages. If we allocate 
> from those then we can avoid reclaim.
> 
> > Reclaim isn't expensive.
> 
> It is needlessly expensive if its done for an allocation that is not bound 
> to a specific node and there are other nodes with free pages. We may throw 
> out pages that we need later.

Well it grossly changes the meaning of "interleaving".  We might as well
call it something else.  It's not necessarily worse, but it's not
interleaved any more.

Actually by staying on the same node for a string of successive allocations
it could well be quicker.  How come MPOL_INTERLEAVE doesn't already do some
batching?   Or does it, and I missed it?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-03 22:31     ` Andrew Morton
  2006-11-04  0:28       ` Christoph Lameter
@ 2006-11-04  1:26       ` Paul Jackson
  2006-11-04  1:42         ` Andrew Morton
  1 sibling, 1 reply; 29+ messages in thread
From: Paul Jackson @ 2006-11-04  1:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-kernel

Andrew wrote:
> But in this application which you are proposing, any correlation with
> elapsed walltime is very slight.  It's just the wrong baseline to use. 
> What is the *sense* in it?

Ah - but time is cheap as dirt, and scales like the common cold virus.
That makes it sinfully attractive for secondary affect placement cache
hints like this.

What else would you suggest?

Same question applies, I suppose, to my zonelist caching patch that is
sitting in your *-mm patch stack, where you also had doubts about using
wall clock time to decay the fullnode hints.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-04  1:26       ` Paul Jackson
@ 2006-11-04  1:42         ` Andrew Morton
  2006-11-04 10:51           ` Paul Jackson
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2006-11-04  1:42 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, linux-kernel

On Fri, 3 Nov 2006 17:26:05 -0800
Paul Jackson <pj@sgi.com> wrote:

> Andrew wrote:
> > But in this application which you are proposing, any correlation with
> > elapsed walltime is very slight.  It's just the wrong baseline to use. 
> > What is the *sense* in it?
> 
> Ah - but time is cheap as dirt, and scales like the common cold virus.
> That makes it sinfully attractive for secondary affect placement cache
> hints like this.
> 
> What else would you suggest?
> 
> Same question applies, I suppose, to my zonelist caching patch that is
> sitting in your *-mm patch stack, where you also had doubts about using
> wall clock time to decay the fullnode hints.

Depends what it's doing.  "number of pages allocated" would be a good
"clock" to use in the VM.  Or pages scanned.  Or per-cpu-pages reloads. 
Something which adjusts to what's going on.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* CTL_UNNUMBERED and killing sys_sysctl
  2006-11-03 21:46 ` Andrew Morton
  2006-11-03 22:10   ` Christoph Lameter
@ 2006-11-04 10:35   ` Eric W. Biederman
  1 sibling, 0 replies; 29+ messages in thread
From: Eric W. Biederman @ 2006-11-04 10:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-kernel

Andrew Morton <akpm@osdl.org> writes:

> That has several typos and grammatical mistakes.
>
>> +	VM_MIN_INTERLEAVE=39,	/* Limit for interleave */
>
> I think we recently decided to set all new sysctl number to CTL_UNNUMBERED.
>  Eric, can you remind us of the thinkin there please?

Sure.  Sorry for the delay you buried the question well.

The basic thinking goes as follows.  To properly allocate the
numbers for the binary sysctl interface requires a lot of discipline that
we have proven that we don't always have.  Essentially no one uses
the binary sysctl interface anyway.  Therefore CTL_UNNUMBERED was
introduced so we don't need to allocate a binary sysctl number
to add a sysctl to the /proc/sys, interface.

This avoids approach patch decay before the patch is merged upstream.

So in general if you really need a new binary sysctl number the approach
should be first get your patch merged into Linus's tree and then get
an additional 3 line patch merged into Linus's tree to get your number.

I probably need to wake the conversation up again to see if we can make
the final determinate if we want to drop the binary sysctl interface
after having a long grace period, or simply commit to maintain it.  Linus's
tree still has the binary interface slated for removal in January 2007,
that was only appropriate when we believed there were no users in user space
that cared.

The big maintenance problem has been the bit rot of patches where
people allocate the next number and their patches take a long time to
get into Linus's tree.  So by the time they are merged the patches
conflict over which number they get, and by that time the code has
shipped with a binary interface in a distro kernel.

CTL_UNNUMBERED by freeing us from allocating the binary interface
and just using the file based one gives us a mechanism to solve that
maintenance problem.  I have not heard of a conflict of file names
under /proc/sys.

Andrew can we get the CTL_UNNUMBERED patches pushed up to Linus?

Eric

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-04  1:42         ` Andrew Morton
@ 2006-11-04 10:51           ` Paul Jackson
  2006-11-06 16:56             ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Paul Jackson @ 2006-11-04 10:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-kernel

Andrew wrote:
> Depends what it's doing.  "number of pages allocated" would be a good
> "clock" to use in the VM.  Or pages scanned.  Or per-cpu-pages reloads. 
> Something which adjusts to what's going on.

Christoph,

  Do you know of any existing counters that we could use like this?

Adding a system wide count of pages allocated or scanned, just for
these fullnode hint caches, bothers me.

Sure, Andrew is right in the purist sense.  The connection to any
wall clock time base for these events is tenuous at best.

But if the tradeoff is:
 1) a new global counter on the pager allocator or scanning path,
 2) versus an impure heuristic for zapping these full node hints,

then I can't justify the new counter.  I work hard on this stuff to
keep any frequently written global data off hot code paths.

I just don't see any real world case where having a bogus time base for
these fullnode zaps actually hurts anyone.  A global counter in the
main allocator or scanning code paths hurts everyone (well, everyone on
big NUMA boxes, anyhow ... ;).

It might not matter for this here interleave refinement patch (which has
other open questions), but it could at least (in theory) benefit my
zonelist caching patch to get a more reasonable trigger for zapping the
fullnode hint cache.

Even using an existing counter isn't "free."  The more readers a
frequently updated warm cache line has, the hotter it gets.

Perhaps best if we used a node or cpu local counter.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-04  0:58         ` Andrew Morton
@ 2006-11-06 16:53           ` Christoph Lameter
  2006-11-06 19:59             ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-11-06 16:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Fri, 3 Nov 2006, Andrew Morton wrote:

> This has almost nothing to do with elapsed time.
> 
> How about doing, in free_pages_bulk():
> 
> 	if (zone->over_interleave_pages) {
> 		zone->over_interleave_pages = 0;
> 		node_clear(zone_to_nid(zone), full_interleave_nodes);
> 	}

Hmmm... We would also have to compare to the mininum pages 
required before clearing the node. Isnt it a bit much to have two 
comparisons added to the page free path?

> > It is needlessly expensive if its done for an allocation that is not bound 
> > to a specific node and there are other nodes with free pages. We may throw 
> > out pages that we need later.
> 
> Well it grossly changes the meaning of "interleaving".  We might as well
> call it something else.  It's not necessarily worse, but it's not
> interleaved any more.

It is going from node to node unless there is significant imbalance with 
some nodes being over the limit and some under. Then the allocations will 
take place round robin from the nodes under the limit until all are under 
the limit. Then we continue going over all nodes again.

> Actually by staying on the same node for a string of successive allocations
> it could well be quicker.  How come MPOL_INTERLEAVE doesn't already do some
> batching?   Or does it, and I missed it?

It should do interleaving because the data is to be accessed from multiple 
nodes. Clustering on a single node may create hotspots or imbalances. 
Hmmm... We should check how many nodes are remaining if there is just a 
single node left then we need to ignore the limit.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-04 10:51           ` Paul Jackson
@ 2006-11-06 16:56             ` Christoph Lameter
  2006-11-08 10:21               ` Paul Jackson
  2006-12-01  7:51               ` Paul Jackson
  0 siblings, 2 replies; 29+ messages in thread
From: Christoph Lameter @ 2006-11-06 16:56 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Andrew Morton, linux-kernel

On Sat, 4 Nov 2006, Paul Jackson wrote:

>   Do you know of any existing counters that we could use like this?
> 
> Adding a system wide count of pages allocated or scanned, just for
> these fullnode hint caches, bothers me.

There are already such counters. PGALLOC_* and PGSCAN_*. See 
include/linux/vmstat.h

> Perhaps best if we used a node or cpu local counter.

The counters are per cpu and are cpu local.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-06 16:53           ` Christoph Lameter
@ 2006-11-06 19:59             ` Andrew Morton
  2006-11-06 20:12               ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2006-11-06 19:59 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel

On Mon, 6 Nov 2006 08:53:22 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Fri, 3 Nov 2006, Andrew Morton wrote:
> 
> > This has almost nothing to do with elapsed time.
> > 
> > How about doing, in free_pages_bulk():
> > 
> > 	if (zone->over_interleave_pages) {
> > 		zone->over_interleave_pages = 0;
> > 		node_clear(zone_to_nid(zone), full_interleave_nodes);
> > 	}
> 
> Hmmm... We would also have to compare to the mininum pages 
> required before clearing the node.

OK.

> Isnt it a bit much to have two 
> comparisons added to the page free path?

Page freeing is not actually a fastpath.  It's rate-limited by the
frequency at which the CPU can _use_ the page: by filling it from disk, or
by writing to all of the page with the CPU.

Plus this is free_pages_bulk(), so the additional test occurs once per
per_cpu_pages.batch pages, not once per page.

And I assume it could be brought down to a single comparison with some
thought.

> > > It is needlessly expensive if its done for an allocation that is not bound 
> > > to a specific node and there are other nodes with free pages. We may throw 
> > > out pages that we need later.
> > 
> > Well it grossly changes the meaning of "interleaving".  We might as well
> > call it something else.  It's not necessarily worse, but it's not
> > interleaved any more.
> 
> It is going from node to node unless there is significant imbalance with 
> some nodes being over the limit and some under. Then the allocations will 
> take place round robin from the nodes under the limit until all are under 
> the limit. Then we continue going over all nodes again.

<head spins>

> > Actually by staying on the same node for a string of successive allocations
> > it could well be quicker.  How come MPOL_INTERLEAVE doesn't already do some
> > batching?   Or does it, and I missed it?
> 
> It should do interleaving because the data is to be accessed from multiple 
> nodes.

I think you missed the point.

At present the code does interleaving by taking one page from each zone and
then advancing onto the next zone, yes?

If so, this is pretty awful frmo a cache utilsiation POV.  it'd be much
better to take 16 pages from one zone before advancing onto the next one.

> Clustering on a single node may create hotspots or imbalances. 

Umm, but that's exactly what the patch we're discussing will do.

> Hmmm... We should check how many nodes are remaining if there is just a 
> single node left then we need to ignore the limit.

yup.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-06 19:59             ` Andrew Morton
@ 2006-11-06 20:12               ` Christoph Lameter
  2006-11-06 20:24                 ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-11-06 20:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Mon, 6 Nov 2006, Andrew Morton wrote:

> > It should do interleaving because the data is to be accessed from multiple 
> > nodes.
> 
> I think you missed the point.
> 
> At present the code does interleaving by taking one page from each zone and
> then advancing onto the next zone, yes?

s/zone/node/ then yes (zone == node if we just have a single zone).

> If so, this is pretty awful frmo a cache utilsiation POV.  it'd be much
> better to take 16 pages from one zone before advancing onto the next one.

The L1/L2 cpu cache or the pageset hot / cold caches? Take N pages 
from a node instead of 1? That would mean we need to have more complex 
interleaving logic that keeps track of how many pages we took. The number 
of pages to take will vary depending on the size of the shared data. For 
shared data areas that are just a couple of pages this wont work.

> > Clustering on a single node may create hotspots or imbalances. 
> 
> Umm, but that's exactly what the patch we're discussing will do.

Not if we have a set of remaining nodes.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-06 20:12               ` Christoph Lameter
@ 2006-11-06 20:24                 ` Andrew Morton
  2006-11-06 20:31                   ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2006-11-06 20:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel

On Mon, 6 Nov 2006 12:12:50 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Mon, 6 Nov 2006, Andrew Morton wrote:
> 
> > > It should do interleaving because the data is to be accessed from multiple 
> > > nodes.
> > 
> > I think you missed the point.
> > 
> > At present the code does interleaving by taking one page from each zone and
> > then advancing onto the next zone, yes?
> 
> s/zone/node/ then yes (zone == node if we just have a single zone).
> 
> > If so, this is pretty awful frmo a cache utilsiation POV.  it'd be much
> > better to take 16 pages from one zone before advancing onto the next one.
> 
> The L1/L2 cpu cache or the pageset hot / cold caches?

I'm referring to the metadata rather than to the pages themselves: the zone
structure at least.  I bet there are a couple of cache misses in there.

> Take N pages 
> from a node instead of 1? That would mean we need to have more complex 
> interleaving logic that keeps track of how many pages we took.

It's hardly rocket science.  Stick a nid and a counter in the task_struct
for a simple implmentation.

> The number 
> of pages to take will vary depending on the size of the shared data. For 
> shared data areas that are just a couple of pages this wont work.

What is "shared data"?

> > > Clustering on a single node may create hotspots or imbalances. 
> > 
> > Umm, but that's exactly what the patch we're discussing will do.
> 
> Not if we have a set of remaining nodes.

Yes it is.  You're proposing taking an arbitrarily large number of
successive pages from the same node rather than interleaving the allocations.
That will create "hotspots or imbalances" (whatever they are).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-06 20:24                 ` Andrew Morton
@ 2006-11-06 20:31                   ` Christoph Lameter
  2006-11-06 20:42                     ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-11-06 20:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Mon, 6 Nov 2006, Andrew Morton wrote:

> I'm referring to the metadata rather than to the pages themselves: the zone
> structure at least.  I bet there are a couple of cache misses in there.

Yes, in particular in large systems.

> > The number 
> > of pages to take will vary depending on the size of the shared data. For 
> > shared data areas that are just a couple of pages this wont work.
> 
> What is "shared data"?

Interleave is used for data accessed from many nodes otherwise one would 
prefer to allocate from the current zone. The shared data may be very 
frequently accessed from multiple nodes and one would like different NUMA 
nodes to respond to these requests.

> > > Umm, but that's exactly what the patch we're discussing will do.
> > Not if we have a set of remaining nodes.
> 
> Yes it is.  You're proposing taking an arbitrarily large number of
> successive pages from the same node rather than interleaving the allocations.
> That will create "hotspots or imbalances" (whatever they are).

No I proposed to go round robin over the remaining nodes. The special case 
of one node left could be dealt with.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-06 20:31                   ` Christoph Lameter
@ 2006-11-06 20:42                     ` Andrew Morton
  2006-11-06 20:58                       ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2006-11-06 20:42 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel

On Mon, 6 Nov 2006 12:31:36 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Mon, 6 Nov 2006, Andrew Morton wrote:
> 
> > I'm referring to the metadata rather than to the pages themselves: the zone
> > structure at least.  I bet there are a couple of cache misses in there.
> 
> Yes, in particular in large systems.
> 
> > > The number 
> > > of pages to take will vary depending on the size of the shared data. For 
> > > shared data areas that are just a couple of pages this wont work.
> > 
> > What is "shared data"?
> 
> Interleave is used for data accessed from many nodes otherwise one would 
> prefer to allocate from the current zone. The shared data may be very 
> frequently accessed from multiple nodes and one would like different NUMA 
> nodes to respond to these requests.

But what is "shared data"??  You're using a new but very general term
without defining it.

> > > > Umm, but that's exactly what the patch we're discussing will do.
> > > Not if we have a set of remaining nodes.
> > 
> > Yes it is.  You're proposing taking an arbitrarily large number of
> > successive pages from the same node rather than interleaving the allocations.
> > That will create "hotspots or imbalances" (whatever they are).
> 
> No I proposed to go round robin over the remaining nodes. The special case 
> of one node left could be dealt with.

OK, but if two nodes have a lot of free pages and the rest don't then
interleave will consume those free pages without performing any reclaim
from all the other nodes.  Hence hostpots or imbalances.

Whatever they are.  Why does it matter?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-06 20:42                     ` Andrew Morton
@ 2006-11-06 20:58                       ` Christoph Lameter
  2006-11-06 21:20                         ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2006-11-06 20:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Mon, 6 Nov 2006, Andrew Morton wrote:

> > Interleave is used for data accessed from many nodes otherwise one would 
> > prefer to allocate from the current zone. The shared data may be very 
> > frequently accessed from multiple nodes and one would like different NUMA 
> > nodes to respond to these requests.
> 
> But what is "shared data"??  You're using a new but very general term
> without defining it.

Data that is shared by applications or by the kernel. The user space 
programs may allocate shared data with interleave policy. For certain data 
the kernel may use interleave allocations. F.e. page cache pages in a 
cpuset configured for memory spreading.

It depends what the application or the kernel designates to be shared 
data.

> OK, but if two nodes have a lot of free pages and the rest don't then
> interleave will consume those free pages without performing any reclaim
> from all the other nodes.  Hence hostpots or imbalances.
> 
> Whatever they are.  Why does it matter?

Hotspots create lots of requests going to the same numa node. The nodes 
have a limited capability to service cacheline requests and the bandwidth 
on the interlink is also limited. If too many processors request 
information from the same remote node then performance will drop.

There are different kind of data in a NUMA system:

Data that is node local is only accessed by the local processor. For node 
local data we have no such concerns since the interlink is not used. Quite 
a lot of kernel data per node or per cpu and thus is not a problem.

For shared data that is known to be performance critical--and where we 
know that the data is accessed from multiple nodes--there we need to 
balance the data between multiple nodes to avoid overloads and 
to keep the system running at optimal speed. That is where interleave 
becomes important.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-06 20:58                       ` Christoph Lameter
@ 2006-11-06 21:20                         ` Andrew Morton
  2006-11-06 21:42                           ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2006-11-06 21:20 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel

On Mon, 6 Nov 2006 12:58:52 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> > OK, but if two nodes have a lot of free pages and the rest don't then
> > interleave will consume those free pages without performing any reclaim
> > from all the other nodes.  Hence hostpots or imbalances.
> > 
> > Whatever they are.  Why does it matter?
> 
> Hotspots create lots of requests going to the same numa node. The nodes 
> have a limited capability to service cacheline requests and the bandwidth 
> on the interlink is also limited. If too many processors request 
> information from the same remote node then performance will drop.

OK.

> There are different kind of data in a NUMA system:
> 
> Data that is node local is only accessed by the local processor. For node 
> local data we have no such concerns since the interlink is not used. Quite 
> a lot of kernel data per node or per cpu and thus is not a problem.
> 
> For shared data that is known to be performance critical--and where we 
> know that the data is accessed from multiple nodes--there we need to 
> balance the data between multiple nodes to avoid overloads and 
> to keep the system running at optimal speed. That is where interleave 
> becomes important.

But doesn't this patch introduce considerable risks of the above problems
occurring?  In the two-nodes-have-lots-of-free-memory scenario?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-06 21:20                         ` Andrew Morton
@ 2006-11-06 21:42                           ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2006-11-06 21:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Mon, 6 Nov 2006, Andrew Morton wrote:

> But doesn't this patch introduce considerable risks of the above problems
> occurring?  In the two-nodes-have-lots-of-free-memory scenario?

If two nodes have lots of memory then we will alternate between both 
nodes. If one of the nodes is going below the interleave limit then we 
will indeed only allocate from that single node. At some point both are 
dropping below the limit and we will revert back to alternating.

We can avoid the phase where we only allocate from one node by checking 
the node weight of the available nodes instead of checking for an empty 
node mask.

For systems with less than 3 nodes the approach will not be useful. What I 
had in mind when writing this patch were systems with a large number of 
nodes segmented by cpusets into smaller slices. The segments would 
still be greater than 4 nodes.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-06 16:56             ` Christoph Lameter
@ 2006-11-08 10:21               ` Paul Jackson
  2006-11-08 15:18                 ` Peter Zijlstra
  2006-12-01  7:51               ` Paul Jackson
  1 sibling, 1 reply; 29+ messages in thread
From: Paul Jackson @ 2006-11-08 10:21 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel

Christoph wrote:
> On Sat, 4 Nov 2006, Paul Jackson wrote:
> 
> >   Do you know of any existing counters that we could use like this?
> > 
> > Adding a system wide count of pages allocated or scanned, just for
> > these fullnode hint caches, bothers me.
> 
> There are already such counters. PGALLOC_* and PGSCAN_*. See 
> include/linux/vmstat.h


  Andrew,

    I'm willing to take a shot at replacing the wall clock time
    base with one of these vm counters, in my patch in *-mm:

	memory-page_alloc-zonelist-caching-speedup.patch

    But it will be a few weeks before I can get to it.

    I really need to do some other stuff first.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-08 10:21               ` Paul Jackson
@ 2006-11-08 15:18                 ` Peter Zijlstra
  2006-11-08 17:06                   ` Paul Jackson
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2006-11-08 15:18 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Christoph Lameter, akpm, linux-kernel

On Wed, 2006-11-08 at 02:21 -0800, Paul Jackson wrote:
> Christoph wrote:
> > On Sat, 4 Nov 2006, Paul Jackson wrote:
> > 
> > >   Do you know of any existing counters that we could use like this?
> > > 
> > > Adding a system wide count of pages allocated or scanned, just for
> > > these fullnode hint caches, bothers me.
> > 
> > There are already such counters. PGALLOC_* and PGSCAN_*. See 
> > include/linux/vmstat.h
> 
> 
>   Andrew,
> 
>     I'm willing to take a shot at replacing the wall clock time
>     base with one of these vm counters, in my patch in *-mm:
> 
> 	memory-page_alloc-zonelist-caching-speedup.patch
> 
>     But it will be a few weeks before I can get to it.
> 
>     I really need to do some other stuff first.

The swap token code in -mm (which I still have to review) has a global
fault counter to measure 'time'. Perhaps we can generalise that.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-08 15:18                 ` Peter Zijlstra
@ 2006-11-08 17:06                   ` Paul Jackson
  2006-11-08 17:09                     ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Paul Jackson @ 2006-11-08 17:06 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: clameter, akpm, linux-kernel

Peter wrote:
> global fault counter

I hope we avoid frequently updated, widely accessed global
counters.  They tend to create hot cache lines on big NUMA
boxes.

Christoph said that the counters he was suggesting were
node or cpu local.  That sounds good to me.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-08 17:06                   ` Paul Jackson
@ 2006-11-08 17:09                     ` Peter Zijlstra
  2006-11-08 17:21                       ` Paul Jackson
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2006-11-08 17:09 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, akpm, linux-kernel

On Wed, 2006-11-08 at 09:06 -0800, Paul Jackson wrote:
> Peter wrote:
> > global fault counter
> 
> I hope we avoid frequently updated, widely accessed global
> counters.  They tend to create hot cache lines on big NUMA
> boxes.
> 
> Christoph said that the counters he was suggesting were
> node or cpu local.  That sounds good to me.

Very true indeed, I was just hoping we could come up with 1 vm-time; but
perhaps the local vs global thing will keep us from that.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-08 17:09                     ` Peter Zijlstra
@ 2006-11-08 17:21                       ` Paul Jackson
  2006-11-08 17:40                         ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Paul Jackson @ 2006-11-08 17:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: clameter, akpm, linux-kernel

Peter wrote:
> I was just hoping we could come up with 1 vm-time

Eh - why?

At least for secondary uses such as this, marking little node local
caches stale, so we are forced to refresh them occassionally, I'd
almost as soon avoid beating with other activity, and keeping the
affect rather like background white noise.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-08 17:21                       ` Paul Jackson
@ 2006-11-08 17:40                         ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2006-11-08 17:40 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Peter Zijlstra, akpm, linux-kernel

The event counters we are considering are per cpu and you can ask the vm 
statistics subsystem to give you per cpu or global counts. The global 
counts are calculated by summing up all per processor counts.

We also have other counters (ZVC) that are per zone (they are updated per 
cpu per zone.. and are extremely scalable as well). Values can be obtained 
for those by zone, node or global. The global counters and the per zone 
counters do *not* have to be summed up (unlike event counters) but are 
kept current (within a certain delta).

If you need global counters and want to avoid summing up over all 
processors then I would suggest that you use a ZVC or look at the existing 
ZVCs and see if any of those are usable for you.

For ZVCs see include/linux/mmzone.h

For event counters see include/linux/vmstat.h

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-11-06 16:56             ` Christoph Lameter
  2006-11-08 10:21               ` Paul Jackson
@ 2006-12-01  7:51               ` Paul Jackson
  2006-12-01  7:59                 ` Andrew Morton
  2006-12-01 16:27                 ` Christoph Lameter
  1 sibling, 2 replies; 29+ messages in thread
From: Paul Jackson @ 2006-12-01  7:51 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel

A month ago, Christoph replied to pj:
>
> On Sat, 4 Nov 2006, Paul Jackson wrote:
> 
> >   Do you know of any existing counters that we could use like this?
> > 
> > Adding a system wide count of pages allocated or scanned, just for
> > these fullnode hint caches, bothers me.
> 
> There are already such counters. PGALLOC_* and PGSCAN_*. See 
> include/linux/vmstat.h

These counters depend on CONFIG_VM_EVENT_COUNTERS.

The Kconfig comment for CONFIG_VM_EVENT_COUNTERS states:

          VM event counters are only needed to for event counts to be
          shown. They have no function for the kernel itself. This
          option allows the disabling of the VM event counters.
          /proc/vmstat will only show page counts.

(By the way - note the "needed to for event" phrasing error.)

The header file, include/linux/vmstat.h, for these counters states:

	/*
	 * Light weight per cpu counter implementation.
	 *
	 * Counters should only be incremented and no critical kernel component
	 * should rely on the counter values.

Both these clearly state that I should not use these counters for real
kernel functions.

If that is so, I should find some other "time base" for the zonelist
caching.

If that is not so, then these comments need updating.

Anybody have any idea which is the case?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-12-01  7:51               ` Paul Jackson
@ 2006-12-01  7:59                 ` Andrew Morton
  2006-12-01 16:27                 ` Christoph Lameter
  1 sibling, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2006-12-01  7:59 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Christoph Lameter, linux-kernel

On Thu, 30 Nov 2006 23:51:17 -0800
Paul Jackson <pj@sgi.com> wrote:

> A month ago, Christoph replied to pj:
> >
> > On Sat, 4 Nov 2006, Paul Jackson wrote:
> > 
> > >   Do you know of any existing counters that we could use like this?
> > > 
> > > Adding a system wide count of pages allocated or scanned, just for
> > > these fullnode hint caches, bothers me.
> > 
> > There are already such counters. PGALLOC_* and PGSCAN_*. See 
> > include/linux/vmstat.h
> 
> These counters depend on CONFIG_VM_EVENT_COUNTERS.
> 
> The Kconfig comment for CONFIG_VM_EVENT_COUNTERS states:
> 
>           VM event counters are only needed to for event counts to be
>           shown. They have no function for the kernel itself. This
>           option allows the disabling of the VM event counters.
>           /proc/vmstat will only show page counts.
> 
> (By the way - note the "needed to for event" phrasing error.)
> 
> The header file, include/linux/vmstat.h, for these counters states:
> 
> 	/*
> 	 * Light weight per cpu counter implementation.
> 	 *
> 	 * Counters should only be incremented and no critical kernel component
> 	 * should rely on the counter values.
> 
> Both these clearly state that I should not use these counters for real
> kernel functions.
> 
> If that is so, I should find some other "time base" for the zonelist
> caching.
> 
> If that is not so, then these comments need updating.
> 
> Anybody have any idea which is the case?

You need to set EMBEDDED to disable VM_EVENT_COUNTERS.

Things like procps (vmstat, top, etc) now use /proc/vmstat and would likely
break.

I don't know how much space it saves, but I doubt if the world would end if
we removed CONFIG_VM_EVENT_COUNTERS.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Avoid allocating during interleave from almost full nodes
  2006-12-01  7:51               ` Paul Jackson
  2006-12-01  7:59                 ` Andrew Morton
@ 2006-12-01 16:27                 ` Christoph Lameter
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2006-12-01 16:27 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, linux-kernel

On Thu, 30 Nov 2006, Paul Jackson wrote:

> Anybody have any idea which is the case?

You can rely on those to increment and count events if it does not matter 
that we may miss an event once in a while. And I think that is the case 
here.

The counters may only switched off for embedded systems. We could just 
remove the CONFIG option if necessary. The event counter operations are in 
critical paths of the VM though and I would think that embedded systems 
with no need for vmstat want those as efficient as possible.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2006-12-01 16:27 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-03 20:58 Avoid allocating during interleave from almost full nodes Christoph Lameter
2006-11-03 21:46 ` Andrew Morton
2006-11-03 22:10   ` Christoph Lameter
2006-11-03 22:31     ` Andrew Morton
2006-11-04  0:28       ` Christoph Lameter
2006-11-04  0:58         ` Andrew Morton
2006-11-06 16:53           ` Christoph Lameter
2006-11-06 19:59             ` Andrew Morton
2006-11-06 20:12               ` Christoph Lameter
2006-11-06 20:24                 ` Andrew Morton
2006-11-06 20:31                   ` Christoph Lameter
2006-11-06 20:42                     ` Andrew Morton
2006-11-06 20:58                       ` Christoph Lameter
2006-11-06 21:20                         ` Andrew Morton
2006-11-06 21:42                           ` Christoph Lameter
2006-11-04  1:26       ` Paul Jackson
2006-11-04  1:42         ` Andrew Morton
2006-11-04 10:51           ` Paul Jackson
2006-11-06 16:56             ` Christoph Lameter
2006-11-08 10:21               ` Paul Jackson
2006-11-08 15:18                 ` Peter Zijlstra
2006-11-08 17:06                   ` Paul Jackson
2006-11-08 17:09                     ` Peter Zijlstra
2006-11-08 17:21                       ` Paul Jackson
2006-11-08 17:40                         ` Christoph Lameter
2006-12-01  7:51               ` Paul Jackson
2006-12-01  7:59                 ` Andrew Morton
2006-12-01 16:27                 ` Christoph Lameter
2006-11-04 10:35   ` CTL_UNNUMBERED and killing sys_sysctl Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox