public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] Arch specific zone reclaim framework
@ 2005-12-05 19:01 Christoph Lameter
  2005-12-05 19:01 ` [PATCH 2/3] ia64 zone reclaim Christoph Lameter
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Christoph Lameter @ 2005-12-05 19:01 UTC (permalink / raw)
  To: akpm, torvalds; +Cc: linux-ia64, Christoph Lameter, linux-kernel

Generic framework for arch specific zone reclaim.

Zone reclaim allows the reclaiming of pages from a zone if the number of free
pages falls below the watermark even if other zones still have enough pages
available.

Zone reclaim is of particular importance for NUMA machines. It can be more
beneficial to reclaim a page than taking the performance penalties that come
with allocating a page on a remote zone. Maybe this will also be useful
to implement reclaim for DMA zones in some architectures.

The penalty incurred by remote page accesses varies depending on the NUMA
factor of the archictecture. If the NUMA factor is very low (architectures
that have multiple nodes on the same motherboard like for example Opteron
multi-processor boards) then no page reclaim may be needed since access to
another nodes memory is almost as fast as a direct access.
On Itanium architectures and other bus based NUMA architectures a remote
access usually means that the access has to occur over some sort of NUMA
interlink. It is worth to sacrifice easily reclaimable pages in order to
allow a local allocation. Typically there are large number of easily reclaimable
page available if a scan over some files has just been done or if an application
has just terminated that mmapped many files.

Other architectures (especially software NUMA like VirtualIron) may have
higher NUMA factors and consequently it may be beneficial to do even more
cleaning of the local zone before going off-node for those.

This patch adds a hook to the page allocator and defines a generic zone
reclaim function. These allow an arch to implement its own zone reclaim
so that the off-node allocation behavior of of the page allocator may be
controlled in an arch specific way. The patch replaces Martin Hick's zone
reclaim function (which was never working properly).

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.15-rc4/mm/page_alloc.c
=================================--- linux-2.6.15-rc4.orig/mm/page_alloc.c	2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/mm/page_alloc.c	2005-12-05 10:21:36.000000000 -0800
@@ -842,7 +842,8 @@ get_page_from_freelist(gfp_t gfp_mask, u
 				mark = (*z)->pages_high;
 			if (!zone_watermark_ok(*z, order, mark,
 				    classzone_idx, alloc_flags))
-				continue;
+				if (!arch_zone_reclaim(*z, gfp_mask, order))
+						continue;
 		}
 
 		page = buffered_rmqueue(*z, order, gfp_mask);
Index: linux-2.6.15-rc4/include/linux/swap.h
=================================--- linux-2.6.15-rc4.orig/include/linux/swap.h	2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/include/linux/swap.h	2005-12-05 10:21:36.000000000 -0800
@@ -172,7 +172,7 @@ extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
 extern int try_to_free_pages(struct zone **, gfp_t);
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
+extern int zone_reclaim(struct zone *, gfp_t, int, int);
 extern int shrink_all_memory(int);
 extern int vm_swappiness;
 
Index: linux-2.6.15-rc4/mm/vmscan.c
=================================--- linux-2.6.15-rc4.orig/mm/vmscan.c	2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/mm/vmscan.c	2005-12-05 10:21:36.000000000 -0800
@@ -1354,47 +1354,45 @@ static int __init kswapd_init(void)
 
 module_init(kswapd_init)
 
-
 /*
  * Try to free up some pages from this zone through reclaim.
  */
-int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
+#ifdef CONFIG_ARCH_ZONE_RECLAIM
+int zone_reclaim(struct zone *z, gfp_t gfp_mask, int writepage, int swap)
 {
+	struct task_struct *p = current;
 	struct scan_control sc;
-	int nr_pages = 1 << order;
-	int total_reclaimed = 0;
+	struct reclaim_state reclaim_state;
 
-	/* The reclaim may sleep, so don't do it if sleep isn't allowed */
-	if (!(gfp_mask & __GFP_WAIT))
-		return 0;
-	if (zone->all_unreclaimable)
-		return 0;
-
-	sc.gfp_mask = gfp_mask;
-	sc.may_writepage = 0;
-	sc.may_swap = 0;
-	sc.nr_mapped = read_page_state(nr_mapped);
 	sc.nr_scanned = 0;
 	sc.nr_reclaimed = 0;
-	/* scan at the highest priority */
+	sc.nr_mapped = read_page_state(nr_mapped);
 	sc.priority = 0;
-	disable_swap_token();
-
-	if (nr_pages > SWAP_CLUSTER_MAX)
-		sc.swap_cluster_max = nr_pages;
-	else
-		sc.swap_cluster_max = SWAP_CLUSTER_MAX;
+	sc.gfp_mask = gfp_mask;
+	sc.may_writepage = writepage;
+	sc.may_swap = swap;
+	sc.swap_cluster_max = SWAP_CLUSTER_MAX;
 
+	/* The reclaim may sleep, so don't do it if sleep isn't allowed */
+	if (!(gfp_mask & __GFP_WAIT))
+		return 0;
+	if (z->all_unreclaimable)
+		return 0;
 	/* Don't reclaim the zone if there are other reclaimers active */
-	if (atomic_read(&zone->reclaim_in_progress) > 0)
-		goto out;
-
-	shrink_zone(zone, &sc);
-	total_reclaimed = sc.nr_reclaimed;
+	if (atomic_read(&z->reclaim_in_progress) > 0)
+		return 0;
 
- out:
-	return total_reclaimed;
+	cond_resched();
+	p->flags |= PF_MEMALLOC;
+	reclaim_state.reclaimed_slab = 0;
+	p->reclaim_state = &reclaim_state;
+	shrink_zone(z, &sc);
+	p->reclaim_state = NULL;
+	current->flags &= ~PF_MEMALLOC;
+	cond_resched();
+	return sc.nr_reclaimed;
 }
+#endif
 
 asmlinkage long sys_set_zone_reclaim(unsigned int node, unsigned int zone,
 				     unsigned int state)
Index: linux-2.6.15-rc4/include/linux/gfp.h
=================================--- linux-2.6.15-rc4.orig/include/linux/gfp.h	2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/include/linux/gfp.h	2005-12-05 10:21:54.000000000 -0800
@@ -100,6 +100,13 @@ static inline int gfp_zone(gfp_t gfp)
 static inline void arch_free_page(struct page *page, int order) { }
 #endif
 
+#ifndef CONFIG_ARCH_ZONE_RECLAIM
+static inline int arch_zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
+{
+	return 0;
+}
+#endif
+
 extern struct page *
 FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 2/3] ia64 zone reclaim
  2005-12-05 19:01 [PATCH 1/3] Arch specific zone reclaim framework Christoph Lameter
@ 2005-12-05 19:01 ` Christoph Lameter
  2005-12-05 19:01 ` [PATCH 3/3] Remove debris from old " Christoph Lameter
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Christoph Lameter @ 2005-12-05 19:01 UTC (permalink / raw)
  To: akpm, torvalds; +Cc: linux-ia64, Christoph Lameter, linux-kernel

IA64 zone reclaim

Set up a zone reclaim function for IA64. The zone reclaim function will
reclaim easily reclaimable pages. Off node allocations will occur if no
easily reclaimable pages exist anymore.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.15-rc4/arch/ia64/mm/numa.c
=================================--- linux-2.6.15-rc4.orig/arch/ia64/mm/numa.c	2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/arch/ia64/mm/numa.c	2005-12-05 10:12:14.000000000 -0800
@@ -17,6 +17,7 @@
 #include <linux/node.h>
 #include <linux/init.h>
 #include <linux/bootmem.h>
+#include <linux/swap.h>
 #include <asm/mmzone.h>
 #include <asm/numa.h>
 
@@ -71,3 +72,17 @@ int early_pfn_to_nid(unsigned long pfn)
 	return 0;
 }
 #endif
+
+/*
+ * Remove easily reclaimable local pages if watermarks would prevent a
+ * local allocation.
+ */
+int arch_zone_reclaim(struct zone *z, gfp_t  mask,
+				    unsigned int order)
+{
+	if (z->zone_pgdat->node_id = numa_node_id()) {
+		if (zone_reclaim(z, mask, 0, 0) > (1 << order))
+			return 1;
+	}
+	return 0;
+}
Index: linux-2.6.15-rc4/arch/ia64/Kconfig
=================================--- linux-2.6.15-rc4.orig/arch/ia64/Kconfig	2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/arch/ia64/Kconfig	2005-12-03 13:30:27.000000000 -0800
@@ -338,6 +338,10 @@ config HAVE_ARCH_EARLY_PFN_TO_NID
 	def_bool y
 	depends on NEED_MULTIPLE_NODES
 
+config ARCH_ZONE_RECLAIM
+	def_bool y
+	depends on NUMA
+
 config IA32_SUPPORT
 	bool "Support for Linux/x86 binaries"
 	help

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 3/3] Remove debris from old zone reclaim
  2005-12-05 19:01 [PATCH 1/3] Arch specific zone reclaim framework Christoph Lameter
  2005-12-05 19:01 ` [PATCH 2/3] ia64 zone reclaim Christoph Lameter
@ 2005-12-05 19:01 ` Christoph Lameter
  2005-12-05 19:11 ` [PATCH 1/3] Arch specific zone reclaim framework Christoph Hellwig
  2005-12-06 15:43 ` Andi Kleen
  3 siblings, 0 replies; 6+ messages in thread
From: Christoph Lameter @ 2005-12-05 19:01 UTC (permalink / raw)
  To: akpm, torvalds; +Cc: linux-ia64, Christoph Lameter, linux-kernel

Remove debris of old zone reclaim

Removes the leftovers from prior attempts to implement Zone reclaim.

sys_set_zone_reclaim is not reachable in 2.6.14.

The reclaim_pages field in struct zone is only used by sys_set_zone_reclaim.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.15-rc4/include/linux/mmzone.h
=================================--- linux-2.6.15-rc4.orig/include/linux/mmzone.h	2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/include/linux/mmzone.h	2005-12-05 09:57:36.000000000 -0800
@@ -150,11 +150,6 @@ struct zone {
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	int			all_unreclaimable; /* All pages pinned */
 
-	/*
-	 * Does the allocator try to reclaim pages from the zone as soon
-	 * as it fails a watermark_ok() in __alloc_pages?
-	 */
-	int			reclaim_pages;
 	/* A count of how many reclaimers are scanning this zone */
 	atomic_t		reclaim_in_progress;
 
Index: linux-2.6.15-rc4/mm/vmscan.c
=================================--- linux-2.6.15-rc4.orig/mm/vmscan.c	2005-12-03 13:34:59.000000000 -0800
+++ linux-2.6.15-rc4/mm/vmscan.c	2005-12-05 09:57:36.000000000 -0800
@@ -1394,33 +1394,3 @@ int zone_reclaim(struct zone *z, gfp_t g
 }
 #endif
 
-asmlinkage long sys_set_zone_reclaim(unsigned int node, unsigned int zone,
-				     unsigned int state)
-{
-	struct zone *z;
-	int i;
-
-	if (!capable(CAP_SYS_ADMIN))
-		return -EACCES;
-
-	if (node >= MAX_NUMNODES || !node_online(node))
-		return -EINVAL;
-
-	/* This will break if we ever add more zones */
-	if (!(zone & (1<<ZONE_DMA|1<<ZONE_NORMAL|1<<ZONE_HIGHMEM)))
-		return -EINVAL;
-
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		if (!(zone & 1<<i))
-			continue;
-
-		z = &NODE_DATA(node)->node_zones[i];
-
-		if (state)
-			z->reclaim_pages = 1;
-		else
-			z->reclaim_pages = 0;
-	}
-
-	return 0;
-}


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3] Arch specific zone reclaim framework
  2005-12-05 19:01 [PATCH 1/3] Arch specific zone reclaim framework Christoph Lameter
  2005-12-05 19:01 ` [PATCH 2/3] ia64 zone reclaim Christoph Lameter
  2005-12-05 19:01 ` [PATCH 3/3] Remove debris from old " Christoph Lameter
@ 2005-12-05 19:11 ` Christoph Hellwig
  2005-12-05 19:24   ` Christoph Lameter
  2005-12-06 15:43 ` Andi Kleen
  3 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2005-12-05 19:11 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, torvalds, linux-ia64, linux-kernel

Nack.  Arch control over VM reclaim logic will load to a total mess with VM
logic all over arch.  Please introduce a framework that allows individual
machines control parameters, but procedural callouts are a big no-no.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3] Arch specific zone reclaim framework
  2005-12-05 19:11 ` [PATCH 1/3] Arch specific zone reclaim framework Christoph Hellwig
@ 2005-12-05 19:24   ` Christoph Lameter
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Lameter @ 2005-12-05 19:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: akpm, torvalds, linux-ia64, linux-kernel

On Mon, 5 Dec 2005, Christoph Hellwig wrote:

> Nack.  Arch control over VM reclaim logic will load to a total mess with VM
> logic all over arch.  Please introduce a framework that allows individual
> machines control parameters, but procedural callouts are a big no-no.

The different penalties for off node accesses on various architectures may 
dictate different techniques in order to get best performance . The 
parameter control was tried before and it was not nice. IMHO this is the 
cleanest possible solution.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3] Arch specific zone reclaim framework
  2005-12-05 19:01 [PATCH 1/3] Arch specific zone reclaim framework Christoph Lameter
                   ` (2 preceding siblings ...)
  2005-12-05 19:11 ` [PATCH 1/3] Arch specific zone reclaim framework Christoph Hellwig
@ 2005-12-06 15:43 ` Andi Kleen
  3 siblings, 0 replies; 6+ messages in thread
From: Andi Kleen @ 2005-12-06 15:43 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-ia64, linux-kernel

Christoph Lameter <clameter@sgi.com> writes:

> Generic framework for arch specific zone reclaim.
> 
> Zone reclaim allows the reclaiming of pages from a zone if the number of free
> pages falls below the watermark even if other zones still have enough pages
> available.
> 
> Zone reclaim is of particular importance for NUMA machines. It can be more
> beneficial to reclaim a page than taking the performance penalties that come
> with allocating a page on a remote zone. Maybe this will also be useful
> to implement reclaim for DMA zones in some architectures.
> 
> The penalty incurred by remote page accesses varies depending on the NUMA
> factor of the archictecture. If the NUMA factor is very low (architectures
> that have multiple nodes on the same motherboard like for example Opteron
> multi-processor boards) then no page reclaim may be needed since access to
> another nodes memory is almost as fast as a direct access.
> On Itanium architectures and other bus based NUMA architectures a remote
> access usually means that the access has to occur over some sort of NUMA
> interlink. It is worth to sacrifice easily reclaimable pages in order to
> allow a local allocation. Typically there are large number of easily reclaimable
> page available if a scan over some files has just been done or if an application
> has just terminated that mmapped many files.
> 
> Other architectures (especially software NUMA like VirtualIron) may have
> higher NUMA factors and consequently it may be beneficial to do even more
> cleaning of the local zone before going off-node for those.

I think it's a very very bad idea to have architecture specific
functions for such generic VM tasks. I'm all for fixing this
particular problem, but do it in generic code, possible
with an ifdef and some arch settable parameters. But no
architecture specific VM code please. Going down that path     
would cause long term maintenance headaches.

I suppose this particular problem could be just handled by
just checking node_distance()

-Andi

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2005-12-06 15:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-05 19:01 [PATCH 1/3] Arch specific zone reclaim framework Christoph Lameter
2005-12-05 19:01 ` [PATCH 2/3] ia64 zone reclaim Christoph Lameter
2005-12-05 19:01 ` [PATCH 3/3] Remove debris from old " Christoph Lameter
2005-12-05 19:11 ` [PATCH 1/3] Arch specific zone reclaim framework Christoph Hellwig
2005-12-05 19:24   ` Christoph Lameter
2005-12-06 15:43 ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox