public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Slab: Node rotor for freeing alien caches and remote per cpu pages.
@ 2006-02-23 18:43 Christoph Lameter
  2006-02-23 19:33 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Christoph Lameter @ 2006-02-23 18:43 UTC (permalink / raw)
  To: akpm; +Cc: alokk, manfred, Pekka Enberg, linux-kernel

The cache reaper currently tries to free all alien caches and all remote
per cpu pages in each pass of cache_reap. For a machines with large number
of nodes (such as Altix) this may lead to sporadic delays of around ~10ms.
Interrupts are disabled while reclaiming creating unacceptable delays.

This patch changes that behavior by adding a per cpu reap_node variable.
Instead of attempting to free all caches, we free only one alien cache
and the per cpu pages from one remote node. That reduces the time
spend in cache_reap. However, doing so will lengthen the time it takes to
completely drain all remote per cpu pagesets and all alien caches. The
time needed will grow with the number of nodes in the system. All caches
are drained when they overflow their respective capacity. So the
drawback here is only that a bit of memory may be wasted for awhile 
longer.

Details:

1. Rename drain_remote_pages to drain_node_pages to allow the specification
   of the node to drain of pcp pages.

2. Add additional functions init_reap_node, next_reap_node for NUMA
   that manage a per cpu reap_node counter.

3. Add a reap_alien function that reaps only from the current reap_node.


Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.16-rc4/include/linux/gfp.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/gfp.h	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/include/linux/gfp.h	2006-02-23 09:34:19.000000000 -0800
@@ -157,9 +157,9 @@ extern void FASTCALL(free_cold_page(stru
 
 void page_alloc_init(void);
 #ifdef CONFIG_NUMA
-void drain_remote_pages(void);
+void drain_node_pages(int node);
 #else
-static inline void drain_remote_pages(void) { };
+static inline void drain_node_pages(int node) { };
 #endif
 
 #endif /* __LINUX_GFP_H */
Index: linux-2.6.16-rc4/mm/page_alloc.c
===================================================================
--- linux-2.6.16-rc4.orig/mm/page_alloc.c	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/mm/page_alloc.c	2006-02-23 09:34:19.000000000 -0800
@@ -590,21 +590,20 @@ static int rmqueue_bulk(struct zone *zon
 }
 
 #ifdef CONFIG_NUMA
-/* Called from the slab reaper to drain remote pagesets */
-void drain_remote_pages(void)
+/*
+ * Called from the slab reaper to drain pagesets on a particular node that
+ * belong to the currently executing processor.
+ */
+void drain_node_pages(int nodeid)
 {
-	struct zone *zone;
-	int i;
+	int i, z;
 	unsigned long flags;
 
 	local_irq_save(flags);
-	for_each_zone(zone) {
+	for (z = 0; z < MAX_NR_ZONES; z++) {
+		struct zone *zone = NODE_DATA(nodeid)->node_zones + z;
 		struct per_cpu_pageset *pset;
 
-		/* Do not drain local pagesets */
-		if (zone->zone_pgdat->node_id == numa_node_id())
-			continue;
-
 		pset = zone_pcp(zone, smp_processor_id());
 		for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) {
 			struct per_cpu_pages *pcp;
Index: linux-2.6.16-rc4/mm/slab.c
===================================================================
--- linux-2.6.16-rc4.orig/mm/slab.c	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/mm/slab.c	2006-02-23 10:20:16.000000000 -0800
@@ -789,6 +789,54 @@ static void __slab_error(const char *fun
 	dump_stack();
 }
 
+#ifdef CONFIG_NUMA
+/*
+ * Special reaping functions for NUMA systems called from cache_reap().
+ * These take care of doing round robin flushing of alien caches (containing
+ * objects freed on different nodes from which they were allocated) and the
+ * flushing of remote pcps by calling drain_node_pages.
+ */
+static DEFINE_PER_CPU(unsigned long, reap_node);
+
+static void init_reap_node(int cpu)
+{
+	int node;
+
+	node = next_node(cpu_to_node(cpu), node_online_map);
+	if (node >= MAX_NUMNODES)
+		node = 0;
+
+	__get_cpu_var(reap_node) = node;
+}
+
+static void next_reap_node(void)
+{
+	int node = __get_cpu_var(reap_node);
+
+	/*
+	 * Also drain per cpu pages on remote zones
+	 */
+	if (node != numa_node_id())
+		drain_node_pages(node);
+
+	node = next_node(node, node_online_map);
+	if (unlikely(node >= MAX_NUMNODES)) {
+		node = first_node(node_online_map);
+		/*
+		 * If the node_online_map is empty (early boot) then
+		 * only use node zero.
+		*/
+		if (node >= MAX_NUMNODES)
+			node = 0;
+	}
+	__get_cpu_var(reap_node) = node;
+}
+
+#else
+#define init_reap_node(cpu) do { } while (0)
+#define next_reap_node(void) do { } while (0)
+#endif
+
 /*
  * Initiate the reap timer running on the target CPU.  We run at around 1 to 2Hz
  * via the workqueue/eventd.
@@ -806,6 +854,7 @@ static void __devinit start_cpu_timer(in
 	 * at that time.
 	 */
 	if (keventd_up() && reap_work->func == NULL) {
+		init_reap_node(cpu);
 		INIT_WORK(reap_work, cache_reap, NULL);
 		schedule_delayed_work_on(cpu, reap_work, HZ + 3 * cpu);
 	}
@@ -884,6 +933,23 @@ static void __drain_alien_cache(struct k
 	}
 }
 
+/*
+ * Called from cache_reap() to regularly drain alien caches round robin.
+ */
+static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3)
+{
+	int node = __get_cpu_var(reap_node);
+
+	if (l3->alien) {
+		struct array_cache *ac = l3->alien[node];
+		if (ac && ac->avail) {
+			spin_lock_irq(&ac->lock);
+			__drain_alien_cache(cachep, ac, node);
+			spin_unlock_irq(&ac->lock);
+		}
+	}
+}
+
 static void drain_alien_cache(struct kmem_cache *cachep, struct array_cache **alien)
 {
 	int i = 0;
@@ -902,6 +968,7 @@ static void drain_alien_cache(struct kme
 #else
 
 #define drain_alien_cache(cachep, alien) do { } while (0)
+#define reap_alien(cachep, l3) do { } while (0)
 
 static inline struct array_cache **alloc_alien_cache(int node, int limit)
 {
@@ -3494,8 +3561,7 @@ static void cache_reap(void *unused)
 		check_irq_on();
 
 		l3 = searchp->nodelists[numa_node_id()];
-		if (l3->alien)
-			drain_alien_cache(searchp, l3->alien);
+		reap_alien(searchp, l3);
 		spin_lock_irq(&l3->list_lock);
 
 		drain_array_locked(searchp, cpu_cache_get(searchp), 0,
@@ -3545,8 +3611,8 @@ static void cache_reap(void *unused)
 	}
 	check_irq_on();
 	mutex_unlock(&cache_chain_mutex);
-	drain_remote_pages();
 	/* Setup the next iteration */
+	next_reap_node();
 	schedule_delayed_work(&__get_cpu_var(reap_work), REAPTIMEOUT_CPUC);
 }
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slab: Node rotor for freeing alien caches and remote per cpu pages.
  2006-02-23 18:43 Slab: Node rotor for freeing alien caches and remote per cpu pages Christoph Lameter
@ 2006-02-23 19:33 ` Andrew Morton
  2006-02-23 19:41   ` Christoph Lameter
  2006-02-23 21:53 ` Christoph Lameter
  2006-02-24  0:59 ` Andrew Morton
  2 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2006-02-23 19:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: alokk, manfred, penberg, linux-kernel

Christoph Lameter <clameter@engr.sgi.com> wrote:
>
> The cache reaper currently tries to free all alien caches and all remote
>  per cpu pages in each pass of cache_reap.

umm, why?  We have a reap timer per cpu - why doesn't each CPU drain its
own stuff and its own node's stuff and leave the other nodes&cpus alone?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slab: Node rotor for freeing alien caches and remote per cpu pages.
  2006-02-23 19:33 ` Andrew Morton
@ 2006-02-23 19:41   ` Christoph Lameter
  2006-02-24  1:28     ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2006-02-23 19:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: alokk, manfred, penberg, linux-kernel

On Thu, 23 Feb 2006, Andrew Morton wrote:

> Christoph Lameter <clameter@engr.sgi.com> wrote:
> >
> > The cache reaper currently tries to free all alien caches and all remote
> >  per cpu pages in each pass of cache_reap.
> 
> umm, why?  We have a reap timer per cpu - why doesn't each CPU drain its
> own stuff and its own node's stuff and leave the other nodes&cpus alone?

Each cpu has per cpu pages on remote nodes and also has alien caches 
on remote nodes. These are only accessible from the processor using them.

This is the cpus "own" stuff but this stuff is per node.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slab: Node rotor for freeing alien caches and remote per cpu pages.
  2006-02-23 18:43 Slab: Node rotor for freeing alien caches and remote per cpu pages Christoph Lameter
  2006-02-23 19:33 ` Andrew Morton
@ 2006-02-23 21:53 ` Christoph Lameter
  2006-02-24  0:59 ` Andrew Morton
  2 siblings, 0 replies; 9+ messages in thread
From: Christoph Lameter @ 2006-02-23 21:53 UTC (permalink / raw)
  To: akpm; +Cc: alokk, manfred, Pekka Enberg, linux-kernel,
	Ravikiran G Thirumalai

Fix two issues in the patch as suggested by Kiran:

1. next_node() cannot return a node number >MAX_NUMNODES. So use ==

2. Node online map should not be empty during bootup so we can
   skip the code that I added to deal with that situation.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.16-rc4/mm/slab.c
===================================================================
--- linux-2.6.16-rc4.orig/mm/slab.c	2006-02-23 10:20:16.000000000 -0800
+++ linux-2.6.16-rc4/mm/slab.c	2006-02-23 13:28:34.000000000 -0800
@@ -803,7 +803,7 @@ static void init_reap_node(int cpu)
 	int node;
 
 	node = next_node(cpu_to_node(cpu), node_online_map);
-	if (node >= MAX_NUMNODES)
+	if (node == MAX_NUMNODES)
 		node = 0;
 
 	__get_cpu_var(reap_node) = node;
@@ -820,15 +820,8 @@ static void next_reap_node(void)
 		drain_node_pages(node);
 
 	node = next_node(node, node_online_map);
-	if (unlikely(node >= MAX_NUMNODES)) {
+	if (unlikely(node >= MAX_NUMNODES))
 		node = first_node(node_online_map);
-		/*
-		 * If the node_online_map is empty (early boot) then
-		 * only use node zero.
-		*/
-		if (node >= MAX_NUMNODES)
-			node = 0;
-	}
 	__get_cpu_var(reap_node) = node;
 }
 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slab: Node rotor for freeing alien caches and remote per cpu pages.
  2006-02-23 18:43 Slab: Node rotor for freeing alien caches and remote per cpu pages Christoph Lameter
  2006-02-23 19:33 ` Andrew Morton
  2006-02-23 21:53 ` Christoph Lameter
@ 2006-02-24  0:59 ` Andrew Morton
  2006-02-24  1:17   ` Christoph Lameter
  2 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2006-02-24  0:59 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: alokk, manfred, penberg, linux-kernel

Christoph Lameter <clameter@engr.sgi.com> wrote:
>
> +void drain_node_pages(int nodeid)
>   {
>  -	struct zone *zone;
>  -	int i;
>  +	int i, z;
>   	unsigned long flags;
>   
>   	local_irq_save(flags);
>  -	for_each_zone(zone) {
>  +	for (z = 0; z < MAX_NR_ZONES; z++) {
>  +		struct zone *zone = NODE_DATA(nodeid)->node_zones + z;
>   		struct per_cpu_pageset *pset;
>   
>  -		/* Do not drain local pagesets */
>  -		if (zone->zone_pgdat->node_id == numa_node_id())
>  -			continue;
>  -
>   		pset = zone_pcp(zone, smp_processor_id());
>   		for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) {

Should it be testing populated_zone() in there?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slab: Node rotor for freeing alien caches and remote per cpu pages.
  2006-02-24  0:59 ` Andrew Morton
@ 2006-02-24  1:17   ` Christoph Lameter
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Lameter @ 2006-02-24  1:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: alokk, manfred, penberg, linux-kernel

On Thu, 23 Feb 2006, Andrew Morton wrote:

> Should it be testing populated_zone() in there?

setup_pageset() is called for all zones for each cpu. So all pcps
of online nodes should be initialized properly.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slab: Node rotor for freeing alien caches and remote per cpu pages.
  2006-02-23 19:41   ` Christoph Lameter
@ 2006-02-24  1:28     ` Ravikiran G Thirumalai
  2006-02-24  2:25       ` Christoph Lameter
  0 siblings, 1 reply; 9+ messages in thread
From: Ravikiran G Thirumalai @ 2006-02-24  1:28 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, alokk, manfred, penberg, linux-kernel

On Thu, Feb 23, 2006 at 11:41:51AM -0800, Christoph Lameter wrote:
> On Thu, 23 Feb 2006, Andrew Morton wrote:
> 
> > Christoph Lameter <clameter@engr.sgi.com> wrote:
> > >
> > > The cache reaper currently tries to free all alien caches and all remote
> > >  per cpu pages in each pass of cache_reap.
> > 
> > umm, why?  We have a reap timer per cpu - why doesn't each CPU drain its
> > own stuff and its own node's stuff and leave the other nodes&cpus alone?
> 
> Each cpu has per cpu pages on remote nodes and also has alien caches 
> on remote nodes. These are only accessible from the processor using them.

Actually, all cpus on the node share the alien_cache, and the alien_cache is
one per remote node (for the cachep).  So currently each cpu on the node 
drains the same alien_cache onto all the remote nodes in the per-cpu eventd.  

What is probably very expensive here at drain_alien_cache is free_block 
getting called from the foreign node, and freeing remote pages.
We have a patch-set here to drop-in the alien objects from the current node to 
the respective alien node's drop box, and that drop box will be cleared
locally (so that freeing happens locally).  This would happen off cache_reap. 
(I was holding from posting it because akpm complained about slab.c 
being full on -mm.  Maybe I should post it now...).   

Round robin might still be useful for drain_alien_cache with that approach, 
but maybe init_reap_node should initialize the per-cpu reap_node with a skew
for cpus on the same node (so all cpus of a node do not drain to the same 
foreign node when the eventd runs?)

Round robin for drain_remote_pages is going to be useful for us too I think.

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slab: Node rotor for freeing alien caches and remote per cpu pages.
  2006-02-24  1:28     ` Ravikiran G Thirumalai
@ 2006-02-24  2:25       ` Christoph Lameter
  2006-02-24  8:31         ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2006-02-24  2:25 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Andrew Morton, alokk, manfred, penberg, linux-kernel

On Thu, 23 Feb 2006, Ravikiran G Thirumalai wrote:

> Actually, all cpus on the node share the alien_cache, and the alien_cache is
> one per remote node (for the cachep).  So currently each cpu on the node 
> drains the same alien_cache onto all the remote nodes in the per-cpu eventd.  

Right. So we could optimize that.

> What is probably very expensive here at drain_alien_cache is free_block 
> getting called from the foreign node, and freeing remote pages.
> We have a patch-set here to drop-in the alien objects from the current node to 
> the respective alien node's drop box, and that drop box will be cleared
> locally (so that freeing happens locally).  This would happen off cache_reap. 
> (I was holding from posting it because akpm complained about slab.c 

Could you show us the patch?

> Round robin might still be useful for drain_alien_cache with that approach, 
> but maybe init_reap_node should initialize the per-cpu reap_node with a skew
> for cpus on the same node (so all cpus of a node do not drain to the same 
> foreign node when the eventd runs?)

Good idea.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slab: Node rotor for freeing alien caches and remote per cpu pages.
  2006-02-24  2:25       ` Christoph Lameter
@ 2006-02-24  8:31         ` Ravikiran G Thirumalai
  0 siblings, 0 replies; 9+ messages in thread
From: Ravikiran G Thirumalai @ 2006-02-24  8:31 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, alokk, manfred, penberg, linux-kernel

On Thu, Feb 23, 2006 at 06:25:37PM -0800, Christoph Lameter wrote:
> On Thu, 23 Feb 2006, Ravikiran G Thirumalai wrote:
> 
> Could you show us the patch?

Sure.  They are in a series on our tree here and I am removing dependencies
and rediffing against current -mm.  Should post them tomorrow.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-02-24  8:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-23 18:43 Slab: Node rotor for freeing alien caches and remote per cpu pages Christoph Lameter
2006-02-23 19:33 ` Andrew Morton
2006-02-23 19:41   ` Christoph Lameter
2006-02-24  1:28     ` Ravikiran G Thirumalai
2006-02-24  2:25       ` Christoph Lameter
2006-02-24  8:31         ` Ravikiran G Thirumalai
2006-02-23 21:53 ` Christoph Lameter
2006-02-24  0:59 ` Andrew Morton
2006-02-24  1:17   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox